You are on page 1of 12

Chapter 5

Linear Stochastic Models

5.1 Least Squares


Suppose that we observe some dependent variable (e.g. number of red blood cells) as a function of some
independent variable (e.g. dosage of a drug), based on n experiments. The empirical data consists of
(x1 , y1 ), . . . (xn , yn ), where xi is the value of the independent variable used for the i’th experiment, and yi is
the corresponding value of the dependent variable.
We wish to find a simple mathematical model that summarizes the relationship of the dependent variable
to the independent variable. The simplest such model is a linear model of the form

y(x) = ax + b
If n = 2 with x1 6= x2 , we can find a and b by solving a linear system of two equations in two unknowns. If
n > 2, the system is overdetermined, and we can apply the method of “least squares”.
Specifically, let ~y = (y1 , . . . , yn )T , ~x = (x1 , . . . , xn )T , ~e = (1, 1, . . . , 1)T . A reasonable means of finding the
“best” values of a and b is to select a and b as to minimize some measure of distance between ~y and a~x + b~e.
One notion of distance that leads to a particularly nice system of determining equations for a and b is to
measure the distance beween ~z1 , ~z2 ∈ IRn via

kz~1 − z~2 k,

where
~ 2=w
kwk ~ T Λw.
~
Here, Λ is a given n × n symmetric positive definite matrix. The minimizers a, b of the sum of squares

min k~y − a~x − b~ek2


a,b

satisfy the linear system µ T


~x Λ~x ~eT Λ~x
¶µ ¶ µ T ¶
a ~y Λ~x
= .
~xT Λ~e ~eT Λ~e b ~y T Λ~e
Our choice of a quadratic form as our (squared) notion of distance is what leads to a linear system for the
minimizer. Other notions of distance would lead to more complex optimization problem. The case in which
Λ = I is called “ordinary least squares”, whereas Λ 6= I is called“weighted least squares”.
This approach to fitting a linear model to observed data leaves open several questions:

1. How should one choose the matrix Λ?

2. How can one assign “error bars” to our slope and intercept values a and b?

77
3. Is there any way to objectively test the linear model against the even simpler model in which a = 0
(the “constant” model)?

A statistical formulation of this linear modeling problem will permit us to address these issues.

5.2 Linear Regression Models with Gaussian Residuals


We turn to the statistical view of how to build a linear model. We now view the dependent data values
as random variables. In particular, we assume that Yi is a rv corresponding to the measured response of
the dependent variable (e.g. blood pressure) as a function of the independent variable (e.g. drug dosage).
Specifically, the “linear regression” model assumes that

Yi = a∗ xi + b∗ + εi
where a∗ and b∗ are the “true” shape and intercept values, and εi is a rv describing the “residual error” in
the linear model corresponding to observation Yi . The great majority of the literature on linear regression
presumes that ~ε = (ε1 , . . . , εn )T is a Gaussian random vector with mean ~0 and covariance matrix σ 2 C, where
σ 2 is unknown (and C is specified by the statistician and is therefore known). We follow the literature here,
and make the assumption that the residuals have this Gaussian structure.
This statistical model has three unknown parameters, namely a∗ , b∗ and σ 2 . The principle of maximum
likelihood asserts that a∗ , b∗ and σ 2 should be estimated as the maximizer (â, b̂, σ̂ 2 ) of the likelihood.

1 1 ~ ~ − a~x + b~e)) .
exp(− (Y − a~x + b~e)T C −1 (Y
(2πσ 2 )n/2 | det C|1/2 2
~ = (Y1 , . . . , Yn )T . (This likelihood presumes that C has been specified as a positive definite matrix.
Here, Y
Any reasonable model for C will have this property.)
The estimators satisfy

~ C ~x
µ T −1
~x C ~x ~eT C −1 ~x
¶ µ ¶ µ T −1 ¶
â Y
T −1 T −1 = ~ T −1
~x C ~e ~e C ~e b̂ Y C ~e
and

1 ~ ~ − â~x − b̂~e) .
σ̂ 2 = (Y − â~x − b̂~e)T C −1 (Y
n
It is common to choose C = I in the linear regression model. This corresponds to an assumption of iid
residual errors. However, one need not choose C = I. For example, suppose that one assumes that the
variability of Yi scales with the magnitude of xi . In this case, one would set C = diag(x21 , x22 , . . . , x2n ). This
leads to a “weighted least squares” problem. Note that the statistical formulation helps suggest plausible
forms for C.
This statistical framework also permits us to develop “error bars” for our estimates of a∗ and b∗ . Let
n
1X
x̄ = xi ,
n 1

n
X
sxx = (xi − x̄)2 ,
1

n
X
SSE = sum of squares of estimated residuals = (Yi − âxi − b̂)2 .
1

78
It can be shown that when C = I,

â − a∗ D
q = tn−2 ,
SSE/(n−2)
sxx

b̂ − b∗ D
r ³ ´ = tn−2 ,
SSE 1 x̄2
(n−2) n + sxx

where tn−2 is a so-called Student-t rv with n − 2 degrees of freedom (and is a “tabulated distribution”). It
follows that if one selects z so that P {−z ≤ tn−2 ≤ z} = 1 − δ, then
s s
SSE/(n − 2) SSE/(n − 2)
[â − z , â + z ],
sxx sxx
s s
x̄2 x̄2
µ ¶ µ ¶
SSE 1 SSE 1
[b̂ − z + , b̂ + z + ]
n − 2 n sxx n − 2 n sxx

are exact 100(1 − δ)% confidence interval for a∗ and b∗ , respectively1


We turn next to the issue of testing the linear model (a∗ 6= 0) versus the constant model (a∗ = 0) when
C = I. Under the hypothesis that a∗ = 0,

â2 sxx D
= F1,n−2
SSE/(n − 2)

where F1,n−2 is a rv having the F distribution with 1 and n − 2 degrees of freedom. If we choose z so
2
that P {F1,n−2 > z} = γ (with γ small), then it is rare that the statistic âSSE
sxx
exceeds z when a∗ = 0. (A
n−2
â2 sxx
common value of γ is 0.05.) Hence, if SSE ≤ z, we view the data as being consistent with a∗ = 0 (i.e. the
n−2
“constant” model), whereas if the statistic is larger than z, we reject the hypothesis that a∗ = 0.

5.3 Linear Regression Model with non-Gaussian Residuals


In some applications settings, one may prefer to avoid assuming that the residuals are Gaussian. Here, we
illustrate the use of the bootstrap in this context.
We assume that for 1 ≤ i ≤ n,
Yi = a∗ xi + b∗ + εi
where a∗ and b∗ are the“true” shape and intercept values, and ε1 , ε2 , . . . , εn are iid rv’s with E [εi ] = 0 and
var (εi ) < ∞. Because this is a model in which C is implicitly assumed to be the identity, we estimate a∗
and b∗ as in ordinary least squares: Pn
1 (xi Yi − x̄Ȳ )
â = P n 2
1 (xi − x̄)

b̂ = Ȳ − âx̄
1
Pn
where Ȳ = n 1 Yi . Under modest assumptions on the xi ’s, it can be shown that
p
â −→ a∗
p
b̂ −→ b∗
1 See Chapter 12 of Probability and Statistics for the Engineering, Computing, and Physical Sciences by E.R. Dougherty.

Prentice Hall (1990), for details.

79
as n → ∞. This guarantees that when n is large, each of the estimated residuals

ε̂i = Yi − âxi − b̂

is close to its corresponding true residual.


To construct error bars in this setting, we apply the bootstrap. If we knew a∗ , b∗ , and the exact distribution of
the residuals, we could use Monte Carlo simulation to numerically compute the distribution of (for example)
â − a∗
q .
SSE/(n−2)
sxx

(Of course, in the Gaussian setting with C = I, this is known to be a tn−2 rv. In the non-Gaussian setting,
this rv has a complicated and unknown distribution.) Specifically, we could sample the distribution of the
εi ’s n iid times, yielding ε11 , . . . , ε1n . Set Y1i = a∗ xi + b∗ + ε1i , for 1 ≤ i ≤ n, and compute the ordinary
least squares estimates â1 and b̂1 corresponding to the data set (x1 , Y11 ), . . . , (xn , Y1n ). If we repeat the
process m independent times (for a total of mn samples from the distribution of the εi ’s ), thereby yielding
â1 , b̂1 , . . . , ân , b̂n , we could estimate the required distribution via
m
1 X âi − a∗
I( q Pn ) ≤ ·) .
m i=1 2
1 (Yij −âi xj −b̂i ) /(n−2)
sxx

Of course, we generally don’t have the ability to cheaply obtain mn such samples from the distribution of
the εi ’s.
The bootstrap philosophy replaces a∗ by â, b∗ by b̂, and the distribution of the εi ’s by the ε̂i s. Sample the
ε̂i ’s n iid times (with replacement), thereby yielding ε∗11 , . . . , ε∗1n , and compute the ordinary least squares

estimator, â∗1 and b̂∗1 , corresponding to the data set (x1 , Y11 ∗
), . . . , (xn , Y1n ∗
), where Y1j = âxj + b̂ + ε∗1j . We
now repeat this process m independent times, thereby yielding m bootstrap estimates â∗1 , b̂∗1 , . . . , â∗m , b̂∗m .
The required distribution can be estimated via
m
1 X â∗i − â
I( r P ) ≤ ·)
m i=1 n
(Y ∗ −â∗ xj −b̂∗ )2 /(n−2)
1 ij i i
sxx

The above estimated distribution is then used to construct a confidence interval for a∗ in the usual way.
A similar bootstrap method can be used to produce confidence intervals for b∗ (that are asymptotically valid
as m, n → ∞) or to produce asymptotically valid hypothesis testing regions.

Remark 5.1: Suppose that we assume the εi ’s have a covariance matrix that is known up to an (unknown)
factor σ 2 , so that
Yi = a∗ xi + b∗ + εi ,
where ~ε = (ε1 , . . . , εn )T is assumed to have a positive definite covariance matrix σ 2 C, where C is known
and σ 2 is assumed unknown. In this case, we can use the fact that C is known to compute the Cholesky
factorization C = LLT . Note that
~ = a∗ L−1 ~x + b∗ L−1~e + L−1 ~ε .
L−1 Y
~ = L−1 Y
Hence, if we set Z ~, w
~ = L−1 ~x, ~v = L−1~e, and ~ν = L−1 ~ε, we arrive at the model

~ = a∗ w
Z ~ + b∗~v + ~ν ,
where ~ν has mean zero and σ 2 I as its covariance matrix. If we now additionally assume that the νi ’s are iid,
the bootstrap can be applied to this transformed model (and hence to the original model with covariance
matrix σ 2 C).

80
5.4 Data Transformations
In many applied settings, one expects that a non-linear model might offer a better explanation of the data.
For example, one might postulate basic trends of the form:

y(x) = c exp(ax) (exponential trend)


or

y(x) = cxa (power law trend).


Simple data transformations reduce these models to a linear model. In the presence of an exponen-
tial trend, fit the linear regression model to (x1 , log Y1 ), . . . , (xn , log Yn ), while one fits a linear model to
(log x1 , log Y1 ), . . . , (log xn , log Yn ) in the presence of a power law trend.

5.5 Multiple Linear Regression


Of course, it is common in many applications to try to explain a dependent variable (e.g. blood pressure)
as a function of multiple explanatory variables (e.g. dosage of a drug, body weight). Here, we have an
IRd -valued vector xi that represents the levels of the d explanatory variables associated with experimental
value i, and Yi is the corresponding value of the dependent variable. We assume that

Yi = a∗ T xi + b∗ + εi

for 1 ≤ i ≤ n, where a∗ ∈ IRd and b∗ are the “true” parameters, and (ε1 , . . . , εn )T is an n dimensional
Gaussian rv with mean 0 and covariance matrix σ 2 C with unknown σ 2 (but known C).
Here, the likelihood is given by

1 1 ~ ~ − xa − b~e))
exp(− (Y − xa − b~e)T C −1 (Y
(2πσ 2 )n/2 | det C|1/2 2
~ = (Y1 , . . . , Yn )T and x is the n x d matrix in which the i’th row is xi . The maximum likelihood
where Y
estimators â, b̂ and σ̂ 2 , satisfy

~
µ T −1
~x C ~x ~xT C −1~e
¶ µ ¶ µ T −1 ¶
â ~x C Y
T −1 T −1 = T ~
~e C ~x ~e C ~e b̂ −1
~e C Y

1 ~ ~ − xa − b~e)
σ̂ 2 = (Y − xâ − b̂~e)T C −1 (Y
n
All the ideas described in the context of (simple) linear regression models with d = 1 generalize in a
suitable way to the multiple linear regression context: confidence region procedures for a∗ , hypothesis testing,
bootstrap procedures for non-Gaussian residuals, etc.

5.6 The Correlation Model


Here, we adapt the linear regression model slightly, so that the xi ’s are now modeled themselves as random
variables (so, in particular, the experimentalist does not control the xi ’s at which measurements are gathered).
For example, we might collect n specimens of (say) a fish, and study the relationship between the (random)
weight Xi of the ith fish, and the amount Yi of pollutant stored in the tissues of the fish (for 1 ≤ i ≤ n).
The precise statistical specification of this so-called “correlation” model assumes that

Yi = a∗ T Xi + b∗ + εi

81
for some “true” a∗ ∈ IRd and b∗ ∈ IR, where ((Xi , εi ) : 1 ≤ i ≤ n) is a set of n iid pairs with E [εi ] = 0,,
and var (εi ) < ∞. (We permit Xi ∈ IRd to be vector valued, so as to permit Yi to depend on multiple
characteristics of each specimen.) Put X̃i = Xi − E [Xi ], and Ỹi = Yi − E [Yi ], for 1 ≤ i ≤ n. Note that the
best affine predictor of Yi given Xi must be a∗ T Xi + b∗ , and hence
h i h i
a∗ = (E X̃1 X̃1T )−1 E X̃1 Ỹ1

b∗ = E [Y1 ] − a∗ T E [X1 ]
h i
(We assume here, and throughout, that the covariance matrix, E X̃1 X̃1T , is non-singular.)
We describe now the bootstrap procedure that would be used to deal with such a correlation model (in the
presence of non-Gaussian residual errors).

Exercise 5.1: Suppose that E kX1 k2 < ∞ and E ǫ21 < ∞. Put
£ ¤ £ ¤

n n
1X 1X
ân = ( (Xi − X n )T (Xi − X n ))−1 · ( (Xi − X n )T (Yi − Y n )) ,
n 1 n 1

b̂n = Y n − ân X n
where
n
1X
Xn = Xi ,
n 1
n
1X
Yn = Yi .
n 1
Prove that
ân → â a.s. ,
b̂n → b̂ a.s.
as n → ∞.

According to Problem Exercise 5.1, ân and b̂n are (for large sample sizes n) close to a∗ and b∗ . Suppose that
∗ ∗ ∗ ∗
we sample (X11 , Y11 ), . . . , (X1n , Y1n ), from the collection of observations (X1 , Y1 ),. . . ,(Xn , Yn ), independently
(and with replacement). Put
n
∗ 1X
X1 = Xi
n 1
n
∗ 1X
Y1 = Yi
n 1
n n
1X ∗ ∗ ∗ 1X ∗ ∗ ∗
â∗1 = ( (X1i − X 1 )T (X1i

− X 1 )T )−1 · ( (Y − Y 1 )T (X1i

− X 1 ))
n i=1 n i=1 1i
∗ ∗
b̂∗1 = Y 1 − â∗1 X 1
If we independently generate m such bootstrap samples from (X1 , Y1 ), . . . , (Xn , Yn ), we obtain m pairs
(â∗1 , b̂∗1 ), . . . , (â∗m , b̂∗m ),. If m, n are both large, the distribution of
n
1X 1
I((Xi − X n )T (Xi − X n ))− 2 (ân − a∗ )
n i=1

82
can be approximated by
m n
1 X 1X ∗ ∗ ∗ 1
I(( (X − X i )T (Xij

− X i ))− 2 (â∗i − ân ) ≤ ·)
m i=1 n j=1 ij

This bootstrap procedure can be used to construct confidence regions for a∗ and b∗ for the correlation model,
as well as hypothesis testing regions2 .

5.7 Modeling Deterministic Dynamical Systems via Differential


Equations
Let y = (y(t) : t ≥ 0) be a deterministic dynamical system described by a pth order differential equation,
namely

y (p) = f (y(t), y (1) (t), ..., y (p−1) (t))


for some given function f : IRp 7→ IR. Of course, such a pth order equation can always be reduced to a first
order equation by introducing a suitable state variable, namely

x(t) = (y(t), y (1) (t), ..., y (p−1) (t))T .


Then we have ẋ(t) = g(x(t)), where g : IRp 7→ IRp is given by

g(x1 , ..., xp ) = (x2 , ..., xp , f (x1 , ..., xp−1 ))T .


An especially important case is that of a pth order linear differential equation of the form
p
X
(p)
y = βj y (p−j) + c (5.1)
j=1

in which case we have


    
y(t) 0 1 0 .. 0 y(t)
 y (1) (t)   0 0 1 .. 0  y (1) (t) 
d     
 : = .. .. .. .. ..  :  (5.2)
dt     
 :   0 0 0 .. 1  : 
y (p−1) (t) βp βp−1 βp−2 .. β1 y (p−1) (t)

5.8 Linear Difference Equation of pth Order


Definition 5.1: Given a sequence (yn : n ≥ 0), define the pth difference by

∆1 yn = yn+1 − yn

for k = 1 and for k > 1,


∆k yn = ∆k−1 yn+1 − ∆k−1 yn

Then the discrete-time analog to (5.1) is


p
X
∆p yn = βj ∆p−j yn + c (5.3)
j=1

2 See Chapter 4 of The Bootstrap and Edgeworth Expansion by Peter Hall, Springer-Verlag (1992) for details.

83
5.9 Stochastic Linear Difference Equations of pth Order
The stochastic analog to a constant sequence zn = c is an iid sequence (Vn : n ≥ 0). Hence, the natural
stochastic analog to (5.3) is a stochastic sequence (Yn : n ≥ 0) satisfying
p
X
∆p Yn = βj ∆p−j Yn + Vn . (5.4)
j=1

Now observe that


k µ ¶
X k
∆k Yn = (−1)k−j Yn+j .
j
j=1

As a consequence, we may write (5.4) in the form


p
X
Yn = aj Yn−j + Vn (5.5)
j=1

for n ≥ p (for suitably chosen aj ’s).


Note that Yn is expressed as a linear combination of the p previous values of the Y -sequence, namely Yn−1 ,
... , Yn−p . In other words, Yn is “regressed” on the p previous values of the same Y -sequence, and hence it
is “autoregressed”.

Definition 5.2:
A sequence Y = (Yn : n ≥ 0) satisfying (5.5) with (Vn : n ≥ 0) iid is called a pth order autoregressive
sequence.
The autoregressive sequence is said to be Gaussian is the Vn ’s are Gaussian.

Any pth order (scalar) autoregression can be expressed as a first order (vector) autoregression, by following
the same idea as that leading to (5.2). Put

Xn = (Yn−p+1 , ..., Yn )T
and note that

Xn+1 = F Xn + Zn+1 (5.6)


where
 
0 1 0 .. 0

 0 0 1 .. 0 

F =
 .. .. .. .. .. 

 0 0 0 .. 1 
ap ap−1 ap−2 .. a1

Zn+1 = (0, 0, ..., 0, Vn+1 )T

5.10 Stability Properties of the Autoregressive Sequences


If Vn = c in (5.5), the deterministic sequence governed by (5.5) remain bounded if and only if the spectral
radius of F (i.e. the maximum of the moduli of the eigenvalues of F ) is less than 1. This turns out to be
the right condition to guarantee “stability” of an autoregressive sequence.

84
Exercise 5.2:
Let X = (Xn : n ≥ 0) satisfy (5.6) for n ≥ 0 with EkZn k < ∞ and (Zn : n ≥ 0) iid.
Pn−1
1. Show that Xn = F n X0 + j=0 F j Zn−j .
D Pn−1
2. Prove that Xn = F n X0 + j=0 F j Zj .
3. Prove that if the spectral radius of F is less than one, then Xn =⇒ X∞ as n → ∞, where


D
X
X∞ = F j Zj .
j=0

4. If the Zn ’s are Gaussian with covariance C, show that X∞ is Gaussian with mean (I − F )−1 EZ1 and
covariance matrix Λ satisfying

Λ = F ΛF T + C .

5. Prove that Λ can be computed via the recursion

Λn+1 = F Λn F T + C

for n ≥ 0, subject to Λ0 = 0.
Requiring the eigenvalues of F to have moduli less than one is equivalent to requiring that the p roots
z1 , ..., zp of the degree p polynomial
p
X
zp − aj z p−j (5.7)
j=1

all have modulus less than one.

5.11 Stationary Version of a Stable Autoregressive Sequence


Suppose that the p roots of (5.7) are all less than one in modulus, and EkZ1 k < ∞. If we then initialize X
at time −r at 0, then
k+r−1
X
Xk = F j Zk−J
j=0

where (Zn : n ≥ 0) is a sequence of iid copies of the random variable Z1 . To indicate the dependence of Xk
on r, we write it as Xk,−r . Observe that as r → ∞,

Xk,−r → Xk∗ a.s.


for each k ∈ ZZ, where

D
X
Xk∗ = F j Zk−j = X∞
j=0

Note that X = (Xk∗ : k ∈ ZZ) satisfies the recursion

Xk+1 = F Xk∗ + Zk+1
∗ D
and is stationary in the sense that (Xm+k : k ∈ ZZ) = (Xk∗ : k ∈ ZZ)

85
Definition 5.3:
The sequence X ∗ is said to be the stationary version of X.

We interpret a stationary process as representing a system that was initialized at time −∞ and is in stochastic
equilibrium at every finite t.

5.12 Prediction for Autoregressive Sequences


Suppose that we wish to compute the best mean square predictor of Xn+m , given the past “history” Xj , j ≤ n.
If EkZ1 k2 < ∞, this is just

E[Xn+m |Xj : j ≤ n] .

This, of course, is equal to

m−1
X
F m Xn + F j EZ1 . (5.8)
j=0

Hence, we can use this formula to predict Yn+m based on (Yn , Yn−1 , ..., Yn−p+1 )T (equal to XnT ).

5.13 Parameter Estimation for Gaussian Autoregressive Sequences


In order to use an autoregressive model (in a real-world setting), we must first estimate the parameters from
observed data. If we assume that the Vn ’s are iid Gaussian with (unknown) mean µ∗ and (unknown) variance
σ ∗2 , then the pth order autoregressive model contains p + 2 unknown parameters, namely α1∗ , ... , αp∗ , µ∗ and
σ ∗2 . (Here α1∗ , ... , αp∗ are the “true” autoregressive coefficients). Here, the “partial likelihood” (referred to
as partial because it is a likelihood that conditions on Y0 , ... , Yp−1 and does not take full advantage of the
information that may be present in this initialization) based on observing (Yj : 0 ≤ j < n + p) is given by

n−1
à !
2 −n 1 X
(2πσ ) 2 exp − 2 (Yi+p − a1 Yi+p−1 − ... − ap Yi − µ)2 .
2σ i=0

The maximum likelihood estimators â1 , ... , âp , µ̂ and σˆ2 solve the linear system

 Pn−1 Pn−1 Pn−1    Pn−1


Y2 Yi+p−1 Yi+p−2 ... Yi+p−1 Yi+p Yi+p−1

Pn−1i=0 i+p−1 i=0
Pn−1 Pi=0 â1 Pi=0
2 n−1 n−1
i=0 Yi+p−1 Yi+p−2 i=0 Yi+p−2 ... i=0 Yi+p−2   â2  i=0 Yi+p Yi+p−2
   
   
... ... ... ...  :  = :
   

Pn−1 Pn−1 Pn−1   Pn−1 
Y i+p−1 Yi Y i+p−2 Yi ... i=0 Yi
âp   Yi+p Yi
  
Pi=0 Pi=0 Pi=0
 
n−1 n−1 µ̂ n−1
Y
i=0 i+p−1 Y
i=0 i+p−2 ... n i=0Yi+p
(5.9)
with

n−1
1X
σˆ2 = (Yi+p − â1 Yi+p−1 − ... − âp Yi − µ̂)2
n i=0

As in the settings of conventional regression models, exact confidence regions and hypothesis testing have
been developed in this Gaussian setting. Details can be found in the enormous literature on so-called “time
series” models.

86
5.14 Parameter Estimation for Autoregressive Sequences with non-
Gaussian Residuals
We now turn to the issue of how to deal with an autoregressive sequence Y = (Yn : n ≥ 0) for which

Yn = a∗1 Yn−1 + ... + a∗p Yn−p + µ∗ + εn


where (εn : n ≥ 0) is a iid (possibly non-Gaussian) sequence with E [ε0 ] = 0 and var (ε0 ) < ∞.
We first deal with the prediction problem in the presence of known parameters a∗1 , ... , a∗p , µ∗ and a known
distribution for the εi ’s. Conditional on (Xj : j ≤ n), Xm+n has conditional mean
m−1
X
m
F Xn + F j EZ1 ;
j=0

see (5.8) above. If the Zn ’s are Gaussian, the conditional distribution of Xn+m is
m−1
X
N (F m Xn + F j EZ1 , Λm ) (5.10)
j=0

where Λm = F Λm−1 F T + E(Z1 − EZ1 )(Z1 − EZ1 )T with Λ0 = 0. This conditional distribution can be used
to make predictions such as

P (Yn+m > z|Yj : 0 ≤ j ≤ n) (5.11)


If the Zn ’s are non-Gaussian, computing (5.11) is non-trivial and must generally be implemented via Monte-
Carlo. In particular, to compute the conditional distribution of Xn+m (conditional on Xj , j ≤ n), we
generate mr independent copies of Z1 , call them Z1,1 , ..., Zr,m and use the Monte-Carlo estimator (based on
r independent simulations of the history of X over [n, n + m])
r m−1
1X X
I(F m Xn + F j Zi,j+1 ∈ ·)
r i=1 j=0
to compute

P (Xn+m ∈ ·|Xj , j ≤ n)
We now turn to the question of parameter estimation in the setting of non-Gaussian residuals. Note that

hεn , Yn−i i = 0
for i ≥ 1, so that

hYn − a∗1 Yn−1 − ... − a∗p Yn−p − µ∗ , Yn−i i = 0


for i ≥ 1. In other words, the “true” parameters a∗1 , ... , a∗p and µ∗ satisfy the linear system

a∗1 EYn−1 Yn−i + ... + a∗p EYn−p Yn−i + µ∗ EYn−i = EYn Yn−i
for i ≥ 1. A square linear system of p + 1 equations is obtained by taking the first p + 1 such equations (i.e.
1 ≤ i ≤ p + 1).
Given observations (Yj : 0 ≤ j ≤ n+p), we can estimate EYl−k Yl−i via Yl−k Yl−i , suggesting that we consider
the linear system
p+n p+n p+n p+n
1 X 1 X 1 X 1 X
â1 Yl−1 Yl−i + ... + âp Yl−p Yl−i + µ̂ Yl−i = Yl Yl−i (5.12)
n n n n
l=p+1 l=p+1 l=p+1 l=p+1

for 1 ≤ i ≤ p + 1. (Note the similarity of (5.12) to (5.9). (What explains the similarity?)

87
Exercise 5.3:
Suppose that the roots of (5.7) are all less than one in modulus, and assume that Eε41 < ∞. Prove that
p p
âi −→ a∗i , 1 ≤ i ≤ p and that µ̂ −→ µ∗ as n → ∞.

To produce confidence regions for a∗1 , ... , a∗p and µ∗ , we can apply the bootstrap idea. For p ≤ i ≤ n + p, let

ε̂i = Yi − â1 Yi−1 − ... − âp Yi−p − µ̂


th
be the i estimated residual. To create a bootstrap sample of the autoregressive sequence, sample ε∗1,p , ... ,
ε∗1,n+p n + 1 independent times from the set of estimated residuals {ε̂p , ..., ε̂n+p }. For p ≤ i ≤ n + p, compute
∗ ∗ ∗
Y1,i = â1 Y1,i−1 + ... + âp Y1,i−p + µ̂ + ε∗1,i

subject to Y1,j = Yj for 0 ≤ j < p.

From the bootstrapped autoregressive sequence (Y1,j : 0 ≤ j ≤ n + p), solve the linear system corresponding
∗ ∗ ∗
to (5.12) for â1,1 , ... , â1,p , µ̂ . If we repeat this bootstrap procedure m independent times, then
m
1 X
I(â∗i,j − âj ∈ ·)
m i=1

will (for large m and n) be close to

P (âj − a∗j ∈ ·)
from which a large-sample confidence interval for a∗j can be obtained. In a similar way, we can obtain a
large-sample bootstrap confidence interval for µ∗ .

Exercise 5.4:
Extend the bootstrap procedure to produce prediction regions for Yn+m , based on observing Yj , 0 ≤ j ≤ n,
that take into account parameter uncertainty in estimating a∗1 , ... , a∗p and µ∗ from the observed data.

88

You might also like