You are on page 1of 6

VAG002-

Gaussian processes Provided the covariance matrix  is nonsingular, the


random vector X has a Gaussian probability density
function given by
The class of Gaussian processes is one of the most
widely used families of stochastic processes for mod- fX x
D 2
n/2 det 
1/2
eling dependent data observed over time, or space,
ð exp  12 x  m
0 1 x  m

or time and space. The popularity of such processes


stems primarily from two essential properties. First, In environmental applications, the subscript t will
a Gaussian process is completely determined by its typically denote a point in time, or space, or space
mean and covariance functions. This property facili- and time. For simplicity, we shall restrict attention to
tates model fitting as only the first- and second-order the case of time series for which t represents time.
moments of the process require specification. Second, In such cases, the index set T is usually [0, 1
for
solving the prediction problem is relatively straight- time series recorded continuously or f0, 1, . . . , g for
forward. The best predictor of a Gaussian process time series recorded at equally spaced time units.
at an unobserved location is a linear function of The mean and covariance functions of a Gaussian
the observed values and, in many cases, these func- process are defined by
tions can be computed rather quickly using recursive
formulas.  t
D EXt 2

The fundamental characterization, as described


below, of a Gaussian process is that all the finite- and
dimensional distributions have a multivariate normal  s, t
D cov Xs , Xt
3

(or Gaussian) distribution. In particular the distribu-


tion of each observation must be normally distributed. respectively. While Gaussian processes depend only
There are many applications, however, where this on these two quantities, modeling can be diffi-
assumption is not appropriate. For example, con- cult without introducing further simplifications on
sider observations x1 , . . . , xn , where xt denotes a 1 the form of the mean and covariance functions.
or 0, depending on whether or not the air pollution The assumption of stationarity frequently provides
on the tth day at a certain site exceeds a govern- the proper level of simplification without sacrificing
ment standard. A model for there data should only much generalization. Moreover, after applying ele-
allow the values of 0 and 1 for each daily obser- mentary transformations to the data, the assumption
vation thereby precluding the normality assumption of stationarity of the transformed data is often quite
imposed by a Gaussian model. Nevertheless, Gaus- plausible.
sian processes can still be used as building blocks A Gaussian time series fXt g is said to be station-
to construct more complex models that are appro- ary if
priate for non-Gaussian data. See [3–5] for more on
modeling non-Gaussian data. 1. m t
D EXt D  is independent of t, and
2.  t C h, t
D cov XtCh , Xt
is independent of t for
all h.
Basic Properties
For stationary processes, it is conventional to express
A real-valued stochastic process fXt , t 2 Tg, where the covariance function  as a function on T instead
T is an index set, is a Gaussian process if all the of on T ð T. That is, we define  h
D cov XtCh , Xt

finite-dimensional distributions have a multivariate and call it the autocovariance function of the process.
normal distribution. That is, for any choice of dis- For stationary Gaussian processes fXt g, we have
tinct values t1 , . . . , tk 2 T, the random vector X D
3. Xt ¾ N ,  0

for all t, and


Xt1 , . . . , Xtk
0 has a multivariate normal distribu-
4. XtCh , Xt
0 has a bivariate normal distribution
tion with mean vector m D EX and covariance matrix
with covariance matrix
 D cov X, X
, which will be denoted by
 
 0
 h

X ¾ N m, 
 h
 0

VAG002-

2 Gaussian processes

for all t and h. where fZt g is a sequence of independent and identi-


A general stochastic process fXt g satisfying con- cally distributed (iid) normal random variables with
ditions 1 and 2 is said to be weakly or second-order mean 0 and variance  2 , f j g is a sequence of square
stationary. The first- and second-order moments of summable coefficients with 0 D 1, and fVt g is a
weakly stationary processes are invariant with respect deterministic process that is independent of fZt g. The
to time translations. A stochastic process fXt g is Zt are referred to as innovations and are defined by
strictly stationary if the distribution of Xt1 , . . . , Xtn
Zt D Xt  E Xt jXt1 , Xt2 , . . .). A process fVt g is
is the same as Xt1Cs , . . . , XtnCs
for any s. In other deterministic if Vt is completely determined by its
words, the distributional properties of the time series past history fVs , s < tg. An example of such a pro-
are the same under any time translation. For Gaus- cesses is the random sinusoid, Vt D A cos t C 
,
sian time series, the concepts of weak and strict where A and  are independent random variables
stationarity coalesce. This result follows immediately with A ½ 0 and  distributed uniformly on [0, 2 ). In
from the fact that for weakly stationary processes, this case, V2 is completely determined by the values
Xt1 , . . . , Xtn
and Xt1Cs , . . . , XtnCs
have the same of V0 and V1 . In most time series modeling applica-
mean vector and covariance matrix. Since each of the
tions, the deterministic component of a time series is
two vectors has a multivariate normal distribution,
either not present or easily removed.
they must be identically distributed.
Purely nondeterministic Gaussian processes do not
possess a deterministic component and can be repre-
Properties of the Autocovariance Function sented as a Gaussian linear processes,

An autocovariance function  Ð
has the properties: 
1
Xt D j Ztj 6

1.  0
½ 0, jD0
2. j h
j   0
for all h,
3.  h
D  h
, i.e.  Ð
is an even function. The autocovariance of fXt g has the form
Autocovariances have another fundamental prop-
erty, namely that of non-negative definiteness, 
1
 h
D j jCh 7


n jD0
ai  ti  tj
aj ½ 0 4

i,jD1 The class of autoregressive (AR) processes, and its


for all positive integers n, real numbers a1 , . . . , an , extensions, autoregressive moving-average (ARMA)
and t1 , . . . , tn 2 T. Note that the expression on the processes, are dense in the class of Gaussian linear
left of (4) is merely the variance of a1 Xt1 C Ð Ð Ð C processes. A Gaussian AR(p) process satisfies the
an Xtn and hence must be non-negative. Conversely, recursions
if a function  Ð
is non-negative definite and even,
then it must be an autocovariance function of some Xt D !1 Xt1 C Ð Ð Ð C !p Xtp C Zt 8

stationary Gaussian process.


where fZt g is an iid sequence of N 0,  2
ran-
Gaussian Linear Processes dom variables, and the polynomial ! z
D 1  !1 z 
Ð Ð Ð  !p zp has no zeros inside or on the unit cir-
If fXt , t D 0, š1, š2, . . . , g is a stationary Gaussian cle. The AR(p) process has a linear representation
process with mean 0, then the Wold decomposition (6) where the coefficients are found as functions of
allows Xt to be expressed as a sum of two indepen- the !j (see [2]). Now for any Gaussian linear process,
dent components, there exists an AR(p) process such that the difference

1 in the two autocovariance functions can be made arbi-
Xt D j Ztj C Vt 5
trarily small for all lags. In fact, the autocovariances
jD0 can be matched up perfectly for the first p lags.
VAG002-

Gaussian processes 3

Prediction subset consisting of linear independent variables. The


covariance matrix of this prediction subset will be
Recall that if two random vectors X1 and X2 have a nonsingular. A mild and easily verifiable condition
joint normal distribution, i.e. for ensuring nonsingularity of n for all n is that
       h
! 0 as h ! 1 with  0
> 0 (see [1]).
X1 m1 11 12
¾N , While (12) and (13) completely solve the pre-
X2 m2 21 22
diction problem, these equations require the inver-
and 22 is nonsingular, then the conditional distri- sion of an n ð n covariance matrix which may
bution of X1 given X2 has a multivariate normal be difficult and time consuming for large n. The
distribution with mean Durbin–Levinson algorithm (see [1]) allows one to
compute the coefficient vector !n D !n1 , . . . , !nn
0
mX1 jX2 D m1 C 12 1
22 X2  m2
9
and the one-step prediction errors vn recursively from
!n1 , vn1 , and the autocovariance function.
and covariance matrix

X1 jX2 D 11  12 1 The Durbin–Levinson Algorithm


22 21 10

The key observation here is that the best mean The coefficients !n in the calculation of the one-step
square error predictor of X1 in terms of X2 prediction error (11) and the mean square error of
(i.e. the multivariate function g X2
that minimizes prediction (13) can be computed recursively from the
EjjX1  g X2
jj2 , where jj Ð jj is Euclidean distance) equations
is E X1 jX2
D mX1 jX2 which is a linear function of  1
X2 . Also, the covariance matrix of prediction error, n
!nn D  n
 !n1,j  n  1
 v1 n1
X1 jX2 , does not depend on the value of X2 . These
jD1
results extend directly to the prediction problem for      
Gaussian processes. !n,1 !n1,1 !n1,n1
Suppose fXt , t D 1, 2, . . .g is a stationary Gaussian  .   ..   .. 
 ..  D  .   !nn  . 
process with mean  and autocovariance function
!n,n1 !n1,n1 !n1,1
 Ð
and that based on the random vector consisting
of the first n observations, Xn D X1 , . . . , Xn
0 , we 2
vn D vn1 1  !nn
14

wish to predict the next observation XnC1 . Prediction


for other lead times is analogous to this special case. where !11 D  1
/ 0
and v0 D  0
.
Applying the formula in (9), the best one-step-ahead If fXt g follows that AR(p) process in (8), then
predictor of XnC1 is given by the recursions simplify a great deal. In particular,
for n > p, the coefficients !nj D !j for j D 1, . . . , p
 nC1 : D E XnC1 jX1 , . . . , Xn
D  C !n1 Xn  

X and !nj D 0 for j > p giving


C Ð Ð Ð C !nn X1  
11
 nC1 D !1 Xn C Ð Ð Ð C !p Xnp
X 15

where
with vn D  2 .
!n1 , . . . , !nn
0 D 1
n n 12

The sequence of coefficients f!jj , j ½ 1g is called


n D cov Xn , Xn
, and n D cov XnC1 , Xn
D the partial autocorrelation function and is a useful tool
 1
, . . . ,  n

0 . The mean square error of predic- for model identification. The partial autocorrelation at
tion is given by lag j is interpreted as the correlation between X1 and
XjC1 after correcting for the intervening observations
vn D  0
 n0 1
n n 13
X2 , . . . , Xj . Specifically, !jj is the correlation of
the two residuals obtained by regression of X1 and
These formulas assume that n is nonsingular. If XjC1 on the intermediate observations X2 , . . . , Xj .
n is singular, then there is a linear relationship Of particular interest is the relationship between !nn
among X1 , . . . , Xn and the prediction problem can and the reduction in the one-step mean square error
then be recast by choosing a generating prediction as the number of predictors is increased from n  1
VAG002-

4 Gaussian processes

to n. The one-step prediction error has the following is a useful representation of the likelihood in terms of
decomposition in terms of the partial autocorrelation the one-step prediction errors and their mean square
function: errors. By the form of X  n , we can write
2
vn D  0
1  !11 2

Ð Ð Ð 1  !nn
16
 n D An X n
Xn  X 18

For a Gaussian process, XnC1  X nC1 is normally where An is a lower triangular square matrix with
distributed with mean 0 and variance vn . Thus, ones on the diagonal. Inverting this expression,
 nC1 š z1˛/2 v1/2 we have
X n  n

Xn D Cn Xn  X 19

constitute (1  ˛) 100% prediction bounds for the


observation XnC1 , where z1˛/2 is the (1  ˛/2) where Cn is also lower triangular with ones on the
quantile of the standard normal distribution. In diagonal. Since Xj  E Xj jX1 , . . . , Xj1
is uncor-
 nC1 š
other words, XnC1 lies between the bounds X related with X1 , . . . , Xj1 , it follows that the vec-
1/2 tor Xn  X  n consists of uncorrelated, and hence
z1˛/2 vn with probability 1  ˛.
independent, normal random variables with mean
0 and variance vj1 , j D 1, . . . , n. Taking covari-
Estimation for Gaussian Processes ances on both sides of (19) and setting Dn D
diagfv0 , . . . , vn1 g, we find that
One of the advantages of Gaussian models is that
an explicit and closed form of the likelihood is n D Cn Dn C0n 20

readily available. Suppose that fXt , t D 1, 2, . . . , g


and
is a stationary Gaussian time series with mean 
and autocovariance function  Ð
. Denote the data Xn  1
0 1  0 1
n Xn  1
D Xn  Xn
Dn
vector by Xn D X1 , . . . , Xn
and the vector of one-
 n D X
step predictors by X  1, . . . , X
 n
0 , where X
1 D 
n  j
2
Xj  X
 n
D
Xn  X 21


 and Xj D E Xj jX1 , . . . , Xj1
for j ½ 2. If n vj1
jD1
denotes the covariance matrix of Xn , which we
assume is nonsingular, then the likelihood of Xn is It follows that det n D v0 v1 . . . vn1 so that the
n/2 1/2 likelihood reduces to
L n , 
D 2
det n

 
ð exp  12 Xn  1
0 1 X  1
L n , 
D 2
n/2 v0 v1 . . . vn1
1/2
n n  
17
1 n
X  
X
2
exp 
j j 
where 1 D 1, . . . , 1
0 . Typically, n will be express- 22

2 vj1
ible in terms of a finite number of unknown param- jD1
eters, ˇ1 , . . . , ˇr , so that the maximum likelihood
The calculation of the one-step prediction errors and
estimator of these parameters and  are those val-
their mean square errors required in the computation
ues that maximize L for the given dataset. Under
of L based on (22) can be simplified further for
mild regularity assumptions, the resulting maximum
a variety of time series models such as ARMA
likelihood estimators are approximately normally dis-
processes. We illustrate this for an AR process.
tributed with covariance matrix given by the inverse
of the Fisher information.
In most settings, direct-closed-form maximization Gaussian Likelihood for an AR(p) Process
of L with respect to the parameter set is not achiev-
able. In order to maximize L using numerical meth- If fXt g is the AR(p) process specified in (8) with
ods, either derivatives or repeated calculation of the mean , then one can take advantage of the simple
function are required. For moderate to large sample form for the one-step predictors and associated mean
sizes n, calculation of both the determinant of n and square errors. The likelihood becomes
the quadratic form in the exponential of L can be dif-
ficult and time consuming. On the other hand, there L !1 , . . . , !p , ,  2
D 2
 np
/2   np

VAG002-

Gaussian processes 5
 
n  j
2 where, for j > p, X  j D  C !1 Xj1  
C Ð Ð Ð C
1 X  X
ð exp   2
p/2
j
!p Xjp1  
are the one-step predictors. The
2 2
jDpC1 likelihood is a product of two terms, the conditional
 
p  j
2 density of Xn given Xp and the density of Xp . Often,
1 X  X
ð v0 v1 . . . vp1
1/2 exp  
j
just the conditional maximum likelihood estimator
2 vj1 is computed which is found by maximizing the first
jD1
23
term. For the AR process, the conditional maximum

20

19
Temperature

18

17

16

15

1900 1920 1940 1960 1980

Figure 1 Average maximum temperature, 1885–1993. Regression line is 16.83 C 0.008 45t

1
Temperature for september

−1

−2

−3

−2 −1 0 1 2
Quantiles of standard normal

Figure 2 QQ plot for normality of the innovations


VAG002-

6 Gaussian processes

likelihood estimator can be computed in closed form. significance of a nonzero slope of the line. Without
modeling the dependence in the residuals, the slope
Example This example consists of the average would have been deemed significant using classical
maximum temperature over the month of September inference procedures. By modeling the dependence
for the years 1895–1993 in an area of the US whose in the residuals, the evidence in favor of a nonzero
vegetation is characterized as tundra. The time series slope has diminished somewhat. The QQ plot of the
x1 , . . . , x99 is plotted in Figure 1. Here we investigate estimated innovations is displayed in Figure 2. This
the possibility of the data exhibiting a slight linear plot shows that the AR(1) model is not far from being
trend. After inspecting the residuals from fitting a Gaussian. Further details about inference procedures
least squares regression line to the data, we entertain for regression models with time series errors can be
a time series model of the form found in [2, Chapter 6].
Xt D ˇ0 C ˇ1 t C Wt 24

References
where fWt g is the Gaussian AR(1),
[1] Brockwell, P.J. & Davis, R.A. (1991). Time Series: The-
ory and Methods, 2nd Edition, Springer-Verlag, New
Wt D !1 Wt1 C Zt 25
York.
[2] Brockwell, P.J. & Davis, R.A. (1996). Introduction to
and fZt g is a sequence of iid N 0,  2
random vari- Time Series and Forecasting, Springer-Verlag, New
ables. After maximizing the Gaussian likelihood over York.
the parameters ˇ0 , ˇ1 , !1 , and  2 , we find that the [3] Diggle, Peter J., Liang, Kung-Yee & Zeger, Scott L.
maximum likelihood estimate of the mean function is (1996). Analysis of Longitudinal Data, Clarendon Press,
16.83 C 0.008 45t. The maximum likelihood parame- Oxford.
[4] Fahrmeir, L. & Tutz, G. (1994). Multivariate Statis-
ters of !1 and  2 are estimated by 0.1536 and 1.3061,
tical Modeling Based on Generalized Linear Models,
respectively. The maximum likelihood estimates of Springer-Verlag, New York.
ˇ0 and ˇ1 can be viewed as generalized least squares [5] Rosenblatt, M. (2000). Gaussian and Non-Gaussian
estimates assuming that the residual process follows Linear Time Series and Random Fields, Springer-Verlag,
the estimated AR(1) model. The resulting standard New York.
errors of these estimates are 0.277 81 and 0.004 82,
respectively, which provides some doubt about the RICHARD A. DAVIS

You might also like