You are on page 1of 10

Modelling Time-course DNA Microarrays by

Kernel Auto-Regressive Methods


Sylvia Young and Philip Broadbridge
School of Engineering and Mathematical Sciences
La Trobe University
Bundoora, Victoria, Australia 3086
Email: sylvia.young, p.broadbridge@latrobe.edu.au
AbstractTime-course DNA microarray data, whose rows
correspond to genes and whose columns correspond to time of
experiments, can be regarded as time series from the signal
processing perspective. As such, dynamical systems are useful
tools in modelling the gene expressions for time-course microar-
rays. Linear auto-regressive systems have been used previously
in modelling the dynamics because of their simplicity. Typical
system identication algorithms are based on the celebrated Yule-
Walker equations. In this paper, we study nonlinear dynamical
microarray models selected by kernel auto-regression, and com-
pare them with linear dynamical models. For the purposes of
illustration, an efcient sparse kernel method is used. We use the
kernel recursive least squares approach as the model learning
algorithm. In this paper we present simulation results of the
kernel auto-regressive methods for gene expression estimation.
We show that kernel auto-regressive methods are attractive
and suitable signal processing techniques for modelling gene
expression data.
Keywords
Kernel methods, auto-regression, model sparsity, time-course
DNA microarrays, missing values estimation.
I. INTRODUCTION
The DNA microarray technology [20] is a powerful tool
for bioinformatics researchers to understand gene expression,
gene regulation and gene interactions through a simultaneous
study of thousands of genes. The microarray data are orig-
inally generated on a microarray slide during a microarray
experiment. In a microarray experiment, a normal sample is
usually labeled with a green dye Cy3 and a diseased sample
with red dye Cy5 [20]. The two labeled samples are mixed
and hybridized onto the microarray slide. A typical microarray
slide contains square blocks of microscopic spots of DNA
immobilized in a lattice or grid. Often a microarray slide
is a 24 cm membrane or a microscope glass or a silicon
substrate. After the microarray slide is scanned by laser with
two specic wavelengths, the slide is converted into a pair of
(Cy3 and Cy5) microarray images in TIFF format using 16
bits to represent the intensity of each pixel. In a microarray
image, the green Cy3 and the red Cy5 signals are overlaid.
For example, yellow spots in the resulting microarray image
indicate equal intensity for the dyes. The pixel intensity at a
spot is the gene expression given by the logarithm of the ratio
of the two samples, i.e., logCy5logCy3.
Time-course DNA microarrays are a particular type of gene
expression data collected over an extended period of time. A
Fig. 1. An example of time-course microarray image, where the rows
correspond to different genes and the columns correspond to time.
time-course microarray can be regarded as a time series from
a signal processing point of view [14]. They are frequently
used in monitoring gene behaviors as a function of time.
An example of a time-course microarray is shown in Fig.
1, which is an image of the expressions of the yeast genes
obtained from a microarray experiment. The rows of a time-
course microarray image usually correspond to expression
levels of different genes and the columns correspond to the
time when the gene expressions are measured. Modelling
the patterns of genomic behavior is important not only in
order to predict the gene expressions, but also in order to
understand gene regulations. Typical usages can be found in
clinical applications such as drug discovery and treatments
optimization. A long-term goal of microarrray measurements
is to reconstruct the gene regulatory networks [22], [20].
Regulation of the cell cycle is a dynamical process [24], that
can readily be monitored via time-course microarray proles,
a benecial method in modelling the gene regulation networks.
Auto-regressive (AR) models are suitable computational
tools in extracting information from time-course microarray
data, because the models are able to capture the nature of
time series. AR models assume that the current measurement
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
9
is a linear combination of the measurements from previously
measured data. Applications of linear AR models in modelling
microarray expressions have been extensively studied [5],
[27], [28], [15]. Recent studies extend AR models by using
the kernel methods. For example, Kumar and Javahar [10]
discussed a kernel method to autoregressive modelling, and
reported experimental results based on some simulated data.
Another kernel approach related to AR modelling is the kernel
predictive linear Gaussian (KPLG) model [26]. It uses kernel
functions to capture the nonlinear dynamics in a stochastic
system. The KPLG model was claimed to be a competitor of
the unscented Kalman lter (UKF) [1]. However, the KPLG
model has a relatively high computational cost compared to
the UKF counterpart.
Wingate and Singh proposed a kernel auto-regressive (KAR)
model in [26]. The work presented in this paper is an improve-
ment of the KAR model. We study the kernel AR models with
the property of sparsity, and name them the sparse kernel
auto-regressive (sparse kernel AR) models. In the proposed
models, the features of previous measurements are employed
as the variables, instead of the previous measurements used in
linear AR models. There are two advantages of the sparse
kernel AR models. First, the sparse kernel AR models assume
that high dimensional features, dened by the kernel function,
are governing the gene expressions over time variations. Such
an assumption has been conrmed in the literature of gene
expression prediction [18], [22]. It has been pointed out that
gene expression is regulated by complex factors in nature. For
example, transcription factors [11] which govern the binding
process during the generation of a DNA molecule, play at the
fundamental level to regulate the expression of a gene. And
post-translational modications [7] of proteins can also have
posterior effects on the gene expressions. All these factors may
be regarded as features governing the genetic observations
over time. Second, the proposed kernel model introduces a
constraint of sparsity in the feature space. As such, it assumes
that among the high dimensional features, only the most
signicant ones in the feature space are the key elements in
regulating the nal expression of a gene.
The time-course microarray data set being analyzed in
this paper consists of the cell-cycle microarrays of yeast
genes, provided by Stanford Microarray Databases [19]. By
applying the sparse kernel AR model, we aim to capture
the auto-regression of the gene expressions, which governs
the expression levels of those genes. We implement both the
proposed sparse kernel AR model and the KAR model on the
time-course microarrays. Based on experimental results, we
show that the kernel methods can outperform the linear AR
model in gene expression prediction, in the sense of mean
squared error (MSE).
This paper is organized as follows. Section II states the basic
problem of modelling the expression levels of time-course
microarrays. Section III reviews the previous work which used
the linear AR models. Section IV presents the proposed sparse
kernel AR model for predicting microarray expressions, as
well as the methods for model parameters estimation and the
kernel function being used. This is followed by a description of
the datasets being analyzed and the statistical evaluation of the
Fig. 2. An illustrative diagram of a time-course gene vector, where M is
the number of observations of the gene. The black lled circles represent the
entries in a gene microarray vector g over the range of time, from 1 to M.
The gene expression level at time t (t = 1, , M) is specied by g
t
.
prediction accuracy in Section V. Numerical results are also
presented in Section V, including the prediction performances
of three models linear AR model, KAR model, and sparse
kernel AR model. The performances of one-step prediction and
multi-step prediction are presented. Finally, some conclusions
are summarized in Section VI.
II. PROBLEM STATEMENT
Auto-regressive models have been successfully applied in
the area of missing data estimation for time-course microarray
proles [3]. The extension of AR models by using the kernel
method, which is known as the kernel AR models has recently
been investigated. Researchers dene the kernel AR models
in different ways. Some authors rstly map the time series
to a feature space, and then model the features as the auto-
regressors of a linear AR model [10]. Other authors directly
dene the observation as a function of features, which is
specied by using a kernel function [26]. The idea of kernel
AR models used in this paper is related to the latter, but we
extend it by imposing a sparsity property on the kernels. The
computational problem is to predict one or multiple expression
levels of a gene, which might be the missing entries of a
microarray vector, provided that a series of expression levels
from previous measurements are known. To the best of the
authors knowledge, the implementation of kernel AR models
in modelling time-course microarray data has not yet been
reported in the literature.
The basic problem in this paper is to model a specic
type of microarray data, namely the time-course microarrays,
which are gene expression levels obtained as a time series. The
analysis of time-course microarrays is performed on a gene-
by-gene basis. This means the dynamics of a gene might be
different from that of the other gene.
A simple illustration of a time-course gene expression is
given in Fig. 2. Denote a gene by an M dimensional vector
g = { g
t
}
t=1:M
. For the convenience of analysis, a gene vector
is centered to have zero mean. In doing so, the gene elements
are processed in the following way: rstly the mean value
of g is computed by g =
1
M

M
t=1
g
t
, and then the mean value
g is taken out from each element, so that the elements are
g
t
= g
t
g for t = 1, , M. As such the centerd gene g with
elements g
t
(t =1, , M) is obtained, where g
t
represents the
observation of gene g at time t and
M
t=1
g
t
= 0. The gene
vector g is to be analyzed below as a time series.
The expression level g
t
at time t is expressed as a function
of P previous expression levels at time (t 1), (t 2),
, (t P). Collecting all the P previous expression levels
g
t1
, g
t2
, , g
tP
in a vector g
t
, the model for gene expres-
sion level g
t
at time t is written in the following form
g
t
= f (g
t
) +r
t
(1)
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
10
Fig. 3. An illustrative diagram of an auto-regressive model of order P,
where M is the length of observations. The black lled circles in the gure
represent the entries in a gene microarray vector g over the range of time, from
1 to M. The gene expression values are represented by quantities g
1
to g
M
correspondingly. For an order P linear AR model, g
t
is a linear combination
of g
tP
, , g
t1
, for any time t.
where r
t
is assumed to be the modelling noise following
an i.i.d. Gaussian distribution with zero-mean and with the
variance
2
r
, i.e., N(0,
2
r
). The modelling noise is assumed
independent of the gene expression data.
A linear AR model simply assumes that the current obser-
vation is a linear combination of the P previous observations,
and P is called the model order. The notation AR(P) refers to
an auto-regressive model of order P. An illustrative diagram
for an AR(P) model is shown in Fig. 3.
An AR(P) model is dened as follows,
g
t
=
P

p=1

p
g
tp
+r
t
, t = (P+1), , M (2)
where the model parameters
p
(p = 1, , P) are estimated
by using all the received observations as the training data.
A missing value estimation problem assumes that the m
th
element of gene g is missing, and the missing value can be
estimated from known elements. A linear AR model uses
the linear function of previous observations g
mp
, , g
m1
to
estimate g
m
such that
g
m
=
P

p=1

p
g
mp
(3)
Kernel AR models are generalizations of linear AR models
by using the features [17] of the data as the auto regressors.
The features can be described as the results of mapping the
data into a feature space via a nonlinear function. The kernel
AR models assume that the current observation is generated
by a linear combination of the features of previous P data,
given by
g
t
=
P

p=1

p
(g
tp
) +r
t
(4)
where is a nonlinear mapping [17]. As such, the estimation
of the missing value g
m
by using nonlinear AR model is given
by
g
m
=
P

p=1

p
(g
mp
). (5)
III. LINEAR AUTO-REGRESSIVE MODELS
A. Notations of AR Models
From notations of equation (2), it can be noticed that the
indices of the training data start from P+1. In order to let the
indices of the training data start from 1, we set
J = MP. (6)
Thereafter, a time-course microarray vector which was illus-
trated in Fig. 3 and dened in equation (2) can be simply
Fig. 4. An illustration of the notation of the AR(P) model. There are J =M
P pairs of training data for this gene, where M is the length of observations
and P is the model order.
specied by Fig. 4. And hence the training data can be
equivalently written as D = {y
j
, y
j
}
J
j=1
where
y
j
= [g
jP
, , g
j1
]
T
(7)
and
y
j
= g
P+j
j = 1, , J (8)
As such, by using the new notations, an AR(P) linear model
is now dened as
y
j
= y
T
j
+r
j
j = 1, , J (9)
where is the parameter vector, = [
1
, ,
P
]
T
. Equation
(9) is a vector form to describe an AR model which was
previously described in scalar form by equation (2). For the
linear AR models, the scalar form is appropriate enough to
solve the model parameters. The purpose of introducing the
new vector form is for the convenience in establishing kernel
models. The prediction of gene expression at time m is given
by
y
m
=y
T
m
(10)
where the column vectors y
m
are represented by y
m
=
[g
mP
, , g
m1
]
T
, and the model parameters are learned
from the training data.
B. Model Selection in Linear Auto-regression
The key problem in auto-regressive modelling is the esti-
mation of parameters. This problem is also known as model
learning in machine learning literature [9], or system identi-
cation in control systems literature [1]. The aim is to estimate
the vector of model parameters and select an appropriate
model order P.
Determination of
The parameter vector can be typically estimated by
minimizing the sum of squared errors of the training data
such that
= argmin

j=1
_
y
j
y
T
j

_
2
. (11)
Another method to nd is to use the Yule-Walker
equations [8].
The main idea of the Yule-Walker equations is to rst
multiply the equation (2) by g
tp
(p = 1, , P), which
gives a set of P linear equations as follows,
g
t
g
t1
=
P

k=1

k
g
tk
g
t1
+r
t
g
t1
, (12)
g
t
g
t2
=
P

k=1

k
g
tk
g
t2
+r
t
g
t2
, (13)
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
11
.
.
.
g
t
g
tP
=
P

k=1

k
g
tk
g
tP
+r
t
g
tP
. (14)
Then taking the expectation of each equation, the P linear
equations result in
E[g
t
g
tp
] = E[
P

k=1

k
g
tk
g
tp
+r
t
g
tp
] (15)
= E[
P

k=1

k
g
tk
g
tp
] +E[r
t
g
tp
] (16)
= E[
P

k=1

k
g
tk
g
tp
] (17)
The last line of the above equations holds, because
E[r
t
g
tp
] = 0 for the reason that the noise is assumed
independent to the data and has zero-mean. The term on
the left hand side of the equations, i.e., E[g
t
g
tp
] is called
the autocorrelation coefcient at delay p, denoted by c
p
.
Therefore, organizing equations (17) by using c
p
, the
Yule-Walker equations can be obtained as follows,
_

_
c
1
c
2
c
3
.
.
.
c
P
_

_
=
_

_
c
0
c
1
c
2
c
P1
c
1
c
0
c
1
c
P2
c
2
c
1
c
0
c
P3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
c
P1
c
P2
c
P3
c
0
_

_
_

3
.
.
.

P
_

_
.
(18)
Equivalently, the Yule-Walker equations can be rewritten
in a matrix form as below,
c =C (19)
where the vector c has elements {c
p
}
p=1:P
and C is
a symmetric matrix containing the autocorrelation co-
efcients. In practice, c
p
can be calculated from the
autocovariance elements a
p
(p = 0, , P) given by
a
p
=
_
_
_
1
J

M1
i=P
g
2
ip
, p = 0
1
J

M1
i=P
g
i
g
ip
, p = 1, 2, , (P1)
1
J

M
i=P+1
g
i
g
ip
, p = P
(20)
where J = MP and c
p
are computed by
c
p
=
a
p
a
0
p = (0, , P). (21)
In deriving the Yule-Walker equations (17), the time
series is assumed stationary so that the autocorrelation
coefcients are a function of the lag only, but not the
exact time.
The drawback of solving parameters of a linear AR model
by using equation (11) is that the solution only aims to
minimize an empirical loss function. This drawback can
be mitigated by the kernel method, where a regularized
loss function is optimized.
Selection of model order P
Selecting the model order P is another key issue in setting
up an AR model. This can be performed in a number of
ways. Typical Bayesian methods include Auto Relevance
Determination (ARD) [13], Akaike information criterion
(AIC) [28] and Bayesian information criterion (BIC) [27].
Model selection by frequentists methods include the cross
validation (CV) [9] and the generalized cross validation
[9]. In this paper the CV method is used.
IV. PROPOSED SPARSE KERNEL AUTO-REGRESSIVE
METHODS
There are different denitions of kernel auto-regressive
models in literature. For example, Kumar and Jawahar dis-
cussed a kernel approach to autoregressive models in [10].
Their approach denes the autoregressive patterns over the
features {(g
t
)}
t=1:M
. In other words, an order P model in
their approach uses (g
t1
), , (g
tP
) as regressors to model
the value of (g
t
). The other type of kernel auto-regressive
model is termed KAR models by Wingate and Singh [26].
The KAR models are alternatively called kernel AR models
in section 6.2 of problem statement in this paper. The kernel
AR models are dened in the rest of this section.
A. Kernel AR Models
The idea of kernel AR models (or KAR models) can be
summarized as follows. Kernel AR models generalize the
linear AR model, so that the gene expressions can be described
as the combination of the features of previous observations,
instead of the previous observations themselves. In contrast to
a linear AR model in equation (9), a kernel AR model can be
written as
y
j
=
T
(y
j
)+r
j
(22)
= b
T
j
+r
j
j = 1, , J (23)
where is a vector of model parameters =
[1, ,
m
]
T
, and (y
j
) is a P dimensional vector
(y
j
) = [(y
jP
), , (y
j1
)]
T
where is a nonlinear
mapping. To simplify notation, we have denoted b
j
= (y
j
).
The method to learn the model parameters , from the
regularization point of view, is to minimize a regularized risk
function J which is composed of an empirical loss L and a
risk function P , given by
J () = L () +P () =
1
2
J

j=1
_
y
j
b
T
j

_
2
+

2

T
. (24)
where is a regularization parameter.
The task of learning the model parameters is achieved by
learning each element
m
individually. Taking derivative of J
with respect to every parameter
m
, we have
J

m
=
J

j=1
_
(y
j
b
T
j
)b
jm

+
m
. (25)
Setting the derivative equation (25) to zero, we can obtain

m
=
1

j=1
(y
j
b
T
j
)b
jm
(26)
=
J

j=1
y
j
b
T
j

b
jm
(27)
=
J

j=1

j
b
jm
(28)
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
12
where the coefcient
j
=
y
j
b
T
j

. It can be seen that all the


J training data (b
1
through to b
J
) contribute their m
th
element
respectively. To simplify notations, we collect the column
vectors
_
b
j
_
j=1:J
into a matrix B, i.e.,
B =
_
b
1
, , b
J

. (29)
We name a new column vector , which collects the elements
_

j
_
j=1:J
. As such, the parameters can be nicely written as
=B. (30)
So is represented by a linear combination of vectors b
j
( j = 1, , J) with coefcients
j
being functions of .
Instead of working with parameter vector , the model can
be reformulated in terms of parameter vector . We show in
the following that although the denition of still includes ,
it can be eliminated by using a kernel function. If equation (30)
is substituted into J (), the objective function to be optimized
is changed into
J () =
1
2

T
B
T
BB
T
B
T
B
T
By+
1
2
y
T
y+

2

T
B
T
B (31)
where y is the vector containing all observations
_
y
j
_
j=1:J
. A
kernel function is dened as a postive denite function, called
a Mercer kernel [17], which is given by
K(y
i
, y
j
) =
T
(y
i
)(y
j
) =b
T
i
b
j
. (32)
Then the objective function can be rewritten as
J () =
1
2

T
KK
T
Ky +
1
2
y
T
y +

2

T
K. (33)
Setting the gradient of J () with respect to to zero, the
solution of is obtained as
= (K+I)
1
y, (34)
where K is the kernel matrix whose i j
th
element is dened by
K(y
i
, y
j
).
If this solution is substituted back into the auto regression
model, the following prediction model for the input vector y
t
at time t is obtained:
y
t
= b
T
t
+r
t
(35)
= b
T
t
B+r
t
(36)
=
J

j=1

j
b
T
t
b
j
+r
t
(37)
=
J

j=1

j
K(y
j
, y
t
) +r
t
(38)
= k
T
(K+I)
1
y +r
t
(39)
where k is a J 1 kernel vector whose elements are dened by
K(y
j
, y
t
), and the data y
j
are the center vectors of the kernel
function [2], [17]. The idea of sparse kernel AR model is to
nd a sparse set of the center vectors.
The kernel function considered in this paper is the Gaussian
kernel, dened by
K(y
j
, y
t
) = exp
_

||y
j
y
t
||
2
2
2
j
_
(40)
where
2
j
is the kernel parameter. The Gaussian kernel is
used as a typical kernel function in this paper to illustrate
the effectiveness of kernel models.
The Gaussian kernel function expresses an idea that if the
distance between a vector y
j
and the vector y
t
is small, then
the value of the kernel function in terms of y
j
and y
t
is great,
otherwise the value of the kernel function is small. When the
Gaussian kernel function is applied, the kernel methods have a
nice link with another family of signal processing techniques,
i.e., Gaussian processes [16].
B. Sparse Kernel AR Models
Using the kernel AR model, all the training data y
j
( j =
1, , J) will act as the center vectors of the kernel function.
In this section, we aim to build a sparse set of kernel centers, in
which only the most L (L J) signicantly relevant vectors y
l
(l =1, , L) will be retained in the kernel model. From (38),
the sparse kernel AR models can then be written as
y
t
=
L

l=1

l
K(y
l
, y
t
) +r
t
. (41)
where L, the number of vectors in the sparse subset, is smaller
than J, the number of vectors in the full training set.
The sparsication can be achieved in different ways. Two
main approaches are (1) the kernel principal component analy-
sis (kernel PCA) [2], [17] and (2) the incremental construction
method [26], [6].
Here we review the sparsication method by the incremental
construction. In [6], the algorithm to incrementally construct
the center vectors is called the kernel recursive least squares
(KRLS) [6] which is based on the distance measurement in
the feature space. The main steps of building the center vector
set by the KRLS algorithm are briey summarized below.
Step 1: Computation of kernels.
At time t 1, the center vectors y
j
are assumed to be collected
in the set
t1
=
_
y
j
_
J
t1
j=1
where J
t1
is the number of vectors.
The kernel matrix K
t1
, of size (J
t1
J
t1
), between the
current center vectors can be computed by using a kernel
function K(, ). The i j
th
element of the matrix K
t1
, denoted
[K
t1
]
i, j
, is given by
[K
t1
]
i, j
= K( y
i
, y
j
) (i, j = 1, , J
t1
). (42)
We want to increment the center vectors at time t. When a
new training pair {y
t
, y
t
} is observed at time t, a (J
t1
1)
dimensional kernel vector k
t1
is obtained whose j
th
element
is given by
(k
t1
)
j
= K( y
j
, y
t
) ( j = 1, , J
t1
). (43)
Meanwhile, a scalar k
t
can obtained which is the result of the
kernel function of the new input y
t
itself, computed by
k
t
= K(y
t
, y
t
). (44)
Step 2: Distance measurement.
We measure the distance, , between the new feature vector
b
t
= (y
t
) and the feature vectors of all the exsiting center
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
13
vectors
_

b
j
_
j=1:J
t1
where

b
j
=( y
j
). The distance measure-
ment is performed under the condition of approximate linear
dependence (ALD) [25]. The ALD judges the condition if b
t
is a linear combination of
_

b
j
_
J
t1
j=1
, as described below.
The distance between the new feature vector b
t
and existing
feature vectors

b
j
is dened by
(a) =
1
2
||
J
t1

j=1
a
j

b
j
b
t
||
2
(45)
where a is composed of elements a
j
. The coefcients a
j
are
estimated by
a = argmin
a
(a) (46)
The new feature vector b
t
is determined as linear dependent
to

b
j
, if it satises the ALD condition:
( a) <th (47)
where th is a sufciently small threshold.
Step 3: Set expansion.
If ( a) <th, then the distance between the new feature vector
and the exsiting feature vectors is small. This indicates that the
new feature vector can be represented by the existing feature
vectors. In other words, the new feature is approximately
linearly dependent on existing features. So the set of center
vectors will be unchanged. Otherwise, if ( a) > th, the new
feature can not be fully expressed by the existing features.
Then the new observation will be included into the set of
center vectors, so that the center set will be augmented into

t
=
_

t1
.
.
. y
t
_
(48)
The new time t represents the time index of the training
set, which ranges from 1 to J. Once a new training pair
arrives, the algorithm can immediately decide if the new vector
will be accommodated in the set of center vectors. A similar
sparsication method by using the incremental construction
has been proposed in [4] to generate sparse Gaussian processes
[16].
In general, the incremental construction method is able to
select the sparse set of center vectors which are linearly inde-
pendent on each other. It computes the linearity between the
features of selected center vectors by the kernel functions only.
The incremental construction method has lower computational
costs, in comparison to kernel PCA which needs both kernel
computations and principal component analysis. We use the
incremental construction method in this paper to select the
center vectors in computing the kernel functions.
C. Model Selection in Sparse Kernel AR Models
The parameters to learn in a sparse kernel AR model include
the model parameters
l
, the model order P and the kernel
parameter
1

2
. The algorithms to learn these parameters are
discussed respectively in the rest of this section.
Model parameters
l
.
1
Here
2
l
are assumed identical, i.e.,
2
l
=
2
for l = 1, , L.
The parameters ={
l
}
L
l=1
for sparse kernel AR models
can be learned in a similar way as given in equation (34).
The only difference is that the size of kernel matrix K
for the sparse kernel AR model is (LL), smaller than
the size of the previous kernel AR model of size (J J).
Model order P and kernel parameter
2
.
The model order P is an important parameter, because it
decides the number of data in the training group which
is J = M P. Since the number of observations M is
given and xed, the choice of P leads to different numbers
of training pairs
_
y
j
, y
j
_
j=1:J
. The kernel parameter
2
determines the smoothness of the kernel function. These
parameters can be learnt from the observations by various
ways. In practice, cross validation is used here as the
model selection algorithm.
In summary, the proposed sparse kernel AR model is a
particular type of kernel AR model, which uses sparse center
vectors to build the kernel matrix. The justication of the
model sparsity lies in the fact that the feature space can be
sufciently represented by a set of linearly independent fea-
tures, instead of all the feature vectors in total. The proposed
sparse kernel AR model is able to select the vectors which
are mutually independent in the feature space. The selected
vectors are then used as the center vectors to build the kernel
functions for the proposed model. Also, the kernel function
is used to describe the nonlinear auto-regression between the
observations.
V. RESULTS
A. Datasets
In this section we implement the proposed sparse kernel
AR models in predicting the expression levels of time-course
microarrays. This application is useful for the situation where
only a number of entries of a gene are given while other entries
in following time are missing. The cell-cycle microarrays of
budding yeast, which are typical time-course microarray data,
are used in the studies. The dataset is publicly available online
from Stanford Microarray Databases
2
(SMD).
In a budding yeast cell 104 genes, out of the 6178 genes in
total, have been identied as cell-cycle regulated by previous
work [21]. In the analysis, we choose two primary cell-
cycle regulated genes, namely YOL090W and YAL036C. The
two genes belong to distinguished group of differentially
expressed genes [12], and both are reported [23] as a repre-
sentative gene to their group, respectively. In the experiments,
we perform experiments on these two genes using the same
method, because both of them are time-course microarray data.
The gene expression levels of YOL090W and YAL036C are
generated from the so-called CDC15 experiments [21]. In a
CDC15 experiment, the expression of one gene is measured
at 24 time steps during the progress of the microarray ex-
periments. Therefore, the full time-length for each gene is
24 in our study. Normalization of expression levels has been
performed in advance for each gene, before any processes to
be performed in this section. Therefore the two gene vectors
with zero-mean are used in the following experiments.
2
http://genome-www.stanford.edu/cellcycle/data/rawdata/
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
14
B. Statistical Evaluation
To evaluate the effectiveness of auto-regressive methods,
expression levels of a gene g at the last n time steps (1 n <
M) will be articially assumed to be unknown. The segment
containing the n artical unknown observations is named g.
The result of a prediction algorithm is another expression
sequence of size n, denoted by g

. The quality of the prediction


result is then measured by the mean squared error (MSE)
dened below
MSE =
1
n
n

i=1
(g

i
g
i
)
2
(49)
where g

i
and g
i
are the i
th
element of vectors g

and g,
respectively.
In terms of the sparse kernel AR models, we dene a
quantity , called sparsity ratio, to represent the sparsity of
a kernel AR model:
=
L
J
=
number of selected vectors
total number of training data
(50)
Note that the denominator of is given by J = MP so that
the value of is set once the model order P, as well as the
number of selected vectors L are determined.
C. Numerical Results of One-step Predictions
We evaluate the effectiveness of kernel AR models and
linear AR models in predicting the time-course microarrays.
In this experiment, we articially assume that the m
th
element
of the gene vector g, i.e., y
m
is missing (m < M). The linear
AR, kernel AR and sparse kernel AR models are used to nd
the estimate of the element, y
m
. Because there is only one
element assumed unknown, this experiment is called one-
step prediction.
For an order P (P m1) linear AR model, the estimate
of the missing element is given by
y
m
=
P

i=1

i
y
mi
. (51)
For a kernel AR model, the estimate of y
m
is given by
y
m
=
P

i=1

i
(y
mi
) =
mP

j=1

j
K(y
j
, y
m
) (52)
where y
j
= [g
jP
, , g
j1
]
T
, and y
m
= [g
mP
, , g
m1
]
T
. For
a sparse kernel AR model, the estimate is similar to equation
(52), but the difference is that only L (L mP) vectors are
selected to compute the kernels.
In order to provide enough information in the training data,
the index of missing element to be estimated, m, is set from
24 to 17. For example, when m = 24, the training set has 23
observations, and the prediction value is the 24th element of
the gene vector. Likewise, when m = 17, the prediction value
is the 17
th
element of the gene vector.
Five-fold cross validation is used to determine the model or-
der P, the smoothing parameter of the kernel function
2
, and
the threshold th of the sparse kernel AR model. For instance,
the kernel parameter
2
is determined from a wide range
of candidates from 10
4
, 10
3
, 10
2
, 10
1
, 10
0
, 10
1
, 10
2
to10
3
.
TABLE I
ONE-STEP PREDICTION PERFORMANCES W.R.T. GENE YAL036C. THE
BEST PREDICTION RESULT FOR EACH CASE OF m IS MARKED BY leaning
numbers. NOTE THAT IN THIS TABLE AND FOLLOWING TABLES, THE WORD
LAR MODELS REPRESENTS LINEAR AR MODELS FOR THE SAKE OF
BREVITY.
m LAR Models Kernel AR Models Sparse Kernel AR Models
P MSE P
2
MSE P
2
MSE
17 8 0.063 4 10
3
0.030 5 10
1
2/11 0.014
18 3 0.029 8 10
1
0.0002 8 10
2
2/9 0.002
19 7 0.035 4 10
2
0.038 4 10
2
14/14 0.038
20 3 0.008 5 10
1
0.001 8 10
2
2/11 0.002
21 5 0.023 3 10
0
0.035 6 10
1
3/14 0.056
22 7 0.004 3 10
2
0.070 8 10
1
3/13 0.041
23 7 0.007 3 10
0
0.001 4 10
3
2/18 0.235
24 7 0.015 8 10
1
0.001 6 10
1
3/17 0.018
TABLE II
ONE-STEP PREDICTION PERFORMANCES W.R.T. GENE YOL090W. THE
BEST PREDICTION RESULT FOR EACH CASE OF m IS MARKED BY leaning
numbers.
m LAR Models Kernel AR Models Sparse Kernel AR Models
P MSE P
2
MSE P
2
MSE
17 4 0.014 7 10
1
0.004 7 10
1
9/9 0.004
18 8 0.011 8 10
0
0.074 4 10
1
12/13 0.024
19 8 0.042 7 10
2
0.032 8 10
1
2/10 0.052
20 8 0.032 5 10
1
0.001 8 10
1
2/11 0.038
21 8 0.033 8 10
1
0.003 8 10
1
12/12 0.003
22 4 0.220 7 10
1
0.244 8 10
1
13/13 0.261
23 6 0.002 5 10
2
0.004 8 10
1
2/14 0.0002
24 6 0.057 4 10
3
0.021 8 10
1
2/15 0.010
Similarly, the model order P is learned from candidates
3, 4, 5, 6, 7and8. For the sparse kernel AR models, the thresh-
old th is a special parameter governing the model sparsi-
cation. The value of th is set a sufciently small scalar,
th = 0.01. For a given th, the model is able to select the size
of center vectors, L, to build the kernel matrix. The regression
coefcients of kernel models are learned by the algorithm of
kernel recursive least squares [25].
In general, the kernel AR models and the sparse kernel
AR models work well for the Gene YAL036C. For the eight
prediction results listed in Table I, there are only two cases
in which linear AR models have better performance, when
m = 21 and 22. For all the other cases, the kernel methods
can outperform the linear counterparts. For the case m = 19
the three models show comparable prediction performances.
The reason why the kernel AR model and the sparse kernel
AR model show exactly the same prediction results lies in the
fact that they used identical model parameters, and that the
sparsity ratio equals one, therefore the two models tend to be
identical.
Table II shows the one-step prediction performance with
respect to Gene YOL090W. Both kernel AR models can
generally outperform the linear counterparts in terms of MSE.
The value m in the table represents the length of training data,
so that the prediction is performed on the (m+1)
th
element.
The linear AR model performs better than kernel AR models
only for the case m = 18, while in all the other cases the
kernel AR models outperform the linear AR model. For the
two types of kernel AR models, when =1 and when the same
kernel parameters are used, they result in the same prediction
performance, for example when m = 17 and m = 21.
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
15
TABLE IV
MULTI-STEP PREDICTION PERFORMANCES W.R.T. GENE YAL036C
n LAR Models Kernel AR Models Sparse Kernel AR Models
P MSE P
2
MSE P
2
MSE
2 6 0.032 5 10
1
0.017 8 10
1
9/9 0.007
3 7 0.032 7 10
1
0.019 8 10
1
10/10 0.012
4 3 0.024 7 10
1
0.017 7 10
1
12/12 0.012
5 5 0.040 3 10
4
0.027 4 10
1
3/16 0.038
6 5 0.042 5 10
3
0.031 4 10
1
3/17 0.033
7 7 0.043 3 10
4
0.032 4 10
1
3/18 0.045
8 7 0.038 3 10
4
0.028 4 10
3
2/19 0.052
For example, when n = 2, the 2-step prediction by an order
P linear AR model is given by equation (51) and
y
m+1
=
P1

i=1

i
y
mi
+
P
y
m
(53)
The 2-step kernel AR model is given by equation (52) and
y
m+1
=
P1

i=1

i
(y
mi
) +
P
( y
m
)
=
mP1

j=1

j
K(y
j
, y
m+1
) +
mP
K( y
m+1
, y
m+1
) (54)
where y
m+1
includes the estimated element y
m
, i.e., y
m+1
=
[y
mP+1
, , y
m1
, y
m
]
T
. Similarly, the 2-step sparse kernel AR
model is an alternative of the 2-step kernel AR model, with
the difference that only L (L mP) vectors are selected to
compute the kernels.
In this experiment, the number n is up to 8. The multi-step
prediction proceeds in the following way:
Step 1: Initially set n = 1. The rst element y
m
is
estimated by the method in section V-C for linear AR,
kernel AR and sparse kernel AR models.
Step 2: Once an element is obtained, its value y
m
is added
into the training set so that the training set is incremented
into D
m+1
= {D
m
, y
m
}.
Step 3: Increase n and use the training data D
m+1
to
predict the next element.
Step 4: Repeat Steps 2 - 3, until n = 8 elements are
predicted.
All three models follow the same procedure to make pre-
dictions. Since the length of training data increases when a
new element is estimated, the model has to be updated with
the increase of n. To put it another way, once new data is
integated into the training group the model parameters need
to be learned again, because the auto-regressive pattern may
be different due to the new training set.
The performance comparison evaluated by MSE is shown in
Table IV for the Gene YAL036C. It can be seen that all the best
performances are produced by kernel AR models including
sparse and non-sparse models. Therefore the kernel models
outperform the linear counterpart in the multi-step prediction
experiments with respect to Gene YAL036C.
The plot of the MSE values for the Gene YAL036C is
shown in Fig. 5. It can be seen that the kernel AR models can
signicantly outperform the linear models. The sparse kernel
2 3 4 5 6 7 8
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
MSE Result of Multistep Prediction of Gene YAL
Number of Prediction Steps
M
e
a
n

S
q
u
a
r
e
d

E
r
r
o
r




(
M
S
E
)


Linear AR model
sparse Kernel AR model
Kernel AR model
Fig. 5. Performance of sparse kernel AR model with respect to Gene
YAL036C
TABLE V
MULTI-STEP PREDICTION PERFORMANCES W.R.T. GENE YOL090W
n LAR Models Kernel AR Models Sparse Kernel AR Models
P MSE P
2
MSE P
2
MSE
2 8 0.014 6 10
1
0.028 4 10
1
12/13 0.019
3 8 0.024 7 10
1
0.024 7 10
1
11/11 0.019
4 8 0.027 7 10
1
0.037 4 10
1
12/15 0.023
5 4 0.026 8 10
0
0.031 8 10
1
12/12 0.019
6 4 0.057 7 10
1
0.043 7 10
1
13/14 0.043
7 4 0.049 4 10
1
0.040 8 10
1
13/14 0.040
8 4 0.045 4 10
2
0.036 4 10
2
19/19 0.036
AR models also generally outperform the linear counterpart,
where there is only one exception which happens when n =8.
Notice for the case when n = 4 that the two kernel models
show different prediction results, although they learn the
same model parameters when the sparsity ratio = 1. This
is a different situation from the experiments of one-step
prediction in previous subsection. The reason why two kernel
models using same model parameters as above can generate
different predictions in multi-steps prediction experiment,
lies in that fact that the two models are actually based on
different training data, due to the last datum in the training set
being an estimation itself which can be different.
Table V shows the performance comparison of the models
with respect to the Gene YOL090W. The sparse kernel AR
models result in the best performances when the number of
predictions vary from 3 to 8. There is only one exception when
n = 2, the linear AR model shows best performance. For the
cases when n = 6, 7 and 8, the kernel AR models without
sparsication also show excellent performance in terms of
MSE. Therefore the sparse kernel AR models outperform the
other two models in prediction the expression levels of Gene
YOL090W. Again for the convenience of comparison, the
MSE results are plotted in Fig. 6.
The outperformance of the sparse kernel AR models against
the linear counterparts can be clearly observed for the Gene
YOL090W. The kernel AR models work comparably well to
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
D. Numerical Results of Multi-step Predictions
16
2 3 4 5 6 7 8
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
MSE Result of Multistep Prediction of Gene YOL
Number of Prediction Steps
M
e
a
n

S
q
u
a
r
e
d

E
r
r
o
r




(
M
S
E
)


Linear AR model
Kernel AR model
sparse Kernel AR model
Fig. 6. Performance of sparse kernel AR model with respect to Gene
YOL090W
the sparse kernel AR models when the number of prediction
steps n is greater than 5. Notice that when n > 5 the two
kernel models tend to show similar prediction errors in terms
of MSE. This means that the sparse kernel model has better
prediction accuracy because of low MSE.
On the other hand, there is a tradeoff between the prediction
performance of the models and the corresponding computation
time. The multi-step prediction by the linear AR model for
both genes costs less than 0.1 seconds. The prediction by
the kernel AR model costs 1 second for the same genes. The
sparse kernel AR model uses 2 to 8 seconds respectively when
the prediction step varies from 2 steps to 8 steps.
E. Discussion
In the experiment of one-step prediction, the proposed
sparse kernel AR models show comparable prediction accu-
racy to that of the kernel AR models. Both type of kernel
techniques can generally outperform the linear AR models.
In the experiment of multi-step prediction, the sparse kernel
AR models show better performance than kernel AR models
and linear AR models for Gene YAL036C when the number
of prediction steps is less than or equal to 4, and the kernel
AR models show better performance when the number of
prediction steps is greater than 4. For Gene YOL090W, the
sparse kernel models show generally better performance than
both the kernel AR models and the linear AR models when
the number of prediction steps varies.
The results show that the kernel models can properly capture
the auto-regressive pattern of a time-course gene. When the
length of prediction steps is large, the sparse kernel AR
models show that they are appropriate for modelling the auto-
regressions.
VI. CONCLUSIONS
The key problem studied in this paper is to model the gene
expressions of time-course microarray data, by using kernel
auto-regressive models. In general, the expression levels of
the time-course microarray data can be assumed to follow
a nonlinear relationship. The nonlinearity can be expressed
by using the kernel methods which are dened by a kernel
function of input variables. In this paper, a sparse modelling
strategy is employed in the area of gene expression estimation.
The sparsication of the model in the feature space may
potentially indicate the nature of gene regulatory networks,
where only key features should be considered in the interaction
between genes. Based on the experimental results on time-
course microarray data, it can be observed that the kernel auto-
regressive models outperform the linear counterparts in terms
of the measurements of MSE.
REFERENCES
[1] B. D. O. Anderson and J. B. Moore, Optimal Filtering. USA: Dover
Publications, INC, 2005.
[2] C. M. Bishop, Pattern Recognition and Machine Learning. Singapore:
Springer, 2006.
[3] M. K. Choong, M. Charbit, and H. Yan, Autoregressive-model-based
missing value estimation for DNA microarray time series data, IEEE
Transcations on Information Technology in Biomedicine, vol. 13, no. 1,
pp. 131137, 2009.
[4] L. Csato and M. Opper, Sparse on-line Gaussian processes, Neural
Computation, vol. 14, pp. 641668, 2002.
[5] A. Darvish, R. Hakimzadeh, and K. Najarian, Discovering dynamic
regulatory pathway by applying an auto regressive model to time series
DNA microarray data, in Proceeding of 26th Annual International
Conference of the IEEE EMBS, San Francisco, USA, sept 2004, pp.
18.
[6] Y. Engel, S. Mannor, and R. Meir, The kernel recursive least squares
algorithm, IEEE Transcations on Signal Processing, vol. 52, pp. 2275
2285, 2004.
[7] K. D. Grasser, Regulation of transcription in plants. Oxford, UK:
Blackwell Publishing Ltd, 2006.
[8] G.Walker, On periodicity in series of related terms, Proceedings of the
Royal Society of London, vol. 131, pp. 518532, 1931.
[9] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning. New York, USA: Springer, 2001.
[10] R. Kumar and C. V. Jawahar, Kernel approach to autoregressive
modeling, in Proc. of 13th National Conf. on Communication, Kanpur,
India, 2007.
[11] D. Latchman, Gene Regulation: A Eukaryotic Perspective. London,
UK: Chapman and Hall, 1995.
[12] M. L. T. Lee, Analysis of Microarray Gene Eexpression Data. Boston,
MA, USA: Kluwer Academic Publishers, 2004.
[13] D. J. C. Mackay, Probable networks and plausible predictions a
review of practical Bayesian methods for supervised neural networks,
Network: Computation in Neural Systems, vol. 6, no. 3, pp. 469505,
1995.
[14] M. Mutarelli, L. Cicatiello, L. Ferraro, O. M. Grober, M. Ravo, A. M.
Facchiano, C. Angelini, and A. Weisz, Time-course analysis of genome-
wide gene expression data from hormone-responsive human breast
cancer cells, BMC Bioinformatics, vol. 9, no. Sl2, 2008.
[15] C. Phong and R. Singh, Missing value estimation for time series mi-
croarray data using linear dynamical systems modeling, in Proceeding
of 22nd Int. Conf. on Advanced Information Networking and Application,
2008, pp. 814819.
[16] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine
Learning. USA: MIT Press, 2006.
[17] B. Schlkopf and A. J. Smola, Learning with kernels. MA, USA: MIT
Press, 2002.
[18] E. Segal, M. Shapira, A. Regev, D. Per, D. Botstein, D. Koller, and
K. Friedman, Module networks: identifying regulatory modules and
their condition-specic regulators from gene expression data, Nature
Genetics, vol. 34, no. 2, pp. 166176, 2003.
[19] SMD, The Stanford Microarray Database. [Online]. Available:
http://smd.stanford.edu
[20] T. P. Speed, Statistical Analysis of Gene Expression Microarray Data.
Florida, USA: Chapman and Hall, 2003.
International Journal of Emerging Trends in Signal Processing
Volume 1 ,Issue 1, November 2012
17
[21] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders,
M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher, Comprehensive
identication of cell cycle-regulated genes of the yeast Saccharomyces
cerevisiae by microarray hybridization, Molecule Biology of the Cell,
vol. 9, pp. 32733297, 1998.
[22] J. D. Storey, W. Xiao, J. T. Leek, R. G. Tompkins, and R. W.
Davis, Signicance analysis of time course microarray experiments,
Proceedings of the National Academy of Sciences USA, vol. 102, no. 36,
pp. 12 83712 842, 2005.
[23] G.-F. Tsai and A. Qu, Testing the signicance of cell-cycle patterns
in time-course microarray data using nonparametric quadratic inference
functions, Computational Statistics and Data Analysis, vol. 52, pp.
13871398, 2008.
[24] J. J. Tyson, A. Csikasz-Nagy, and B. Novak, The dynamics of cell
cycle regulation, BioEssays, vol. 24, pp. 10951109, 2002.
[25] D. Wingate, Resources: kernel recursive least squares, 2006,
http://web.mit.edu/ wingated/www/resources.html.
[26] D. Wingate and S. Singh, Kernel predictive linear Gaussian models for
nonlinear stochastic dynamical systems, in Proceeding of the 23rd Int.
Conf. on Machine Learning, Pittsburgh, USA, 2006.
[27] F. X. Wu, W. J. Zhang, and A. J. Kusalik, Modeling gene expression
from microarray expression data with state-space equations, in Proceed-
ing of Pacic Symposium on Biocomputing, vol. 9, 2004, pp. 581592.
[28] R. Yamaguchi, S. Yamashita, and T. Higuch, Estimatinng gene net-
works with cDNA microarray data using state-space models, in Pro-
ceeding of Int. Conf. on Computational Science and Its Applications,
vol. 3482, 2005, pp. 381388.
18

You might also like