You are on page 1of 149

Lecture 1: Stationary Time Series

Introduction

If a random variable X is indexed to time, usually denoted by t, the observations {Xt , t 2 T} is


called a time series, where T is a time index set (for example, T = Z, the integer set).
Time series data are very common in empirical economic studies. Figure 1 plots some frequently
used variables. The upper left figure plots the quarterly GDP from 1947 to 2001; the upper right
figure plots the the residuals after linear-detrending the logarithm of GDP; the lower left figure
plots the monthly S&P 500 index data from 1990 to 2001; and the lower right figure plots the log
dierence of the monthly S&P. As you could see, these four series display quite dierent patterns
over time. Investigating and modeling these dierent patterns is an important part of this course.
In this course, you will find that many of the techniques (estimation methods, inference procedures, etc) you have learned in your general econometrics course are still applicable in time series
analysis. However, there are something special of time series data compared to cross sectional data.
For example, when working with cross-sectional data, it usually makes sense to assume that the
observations are independent from each other, however, time series data are very likely to display
some degree of dependence over time. More importantly, for time series data, we could observe
only one history of the realizations of this variable. For example, suppose you obtain a series of
US weekly stock index data for the last 50 years. This sample can be said to be large in terms of
sample size, however, it is still one data point, as it is only one of the many possible realizations.

Autocovariance Functions

In modeling finite number of random variables, a covariance matrix is usually computed to summarize the dependence between these variables. For a time series {Xt }1
t= 1 , we need to model
the dependence over infinite number of random variables. The autocovariance and autocorrelation
functions provide us a tool for this purpose.
Definition 1 (Autocovariance function). The autocovariance function of a time series {Xt } with
V ar(Xt ) < 1 is defined by
X (s, t)

= Cov(Xs , Xt ) = E[(Xs

EXs )(Xt

Example 1 (Moving average process) Let t i.i.d.(0, 1), and


Xt = t + 0.5t

Copyright 2002-2006 by Ling Hu.

EXt )].

12000

0.2
Detrended Log(GDP)

10000

GDP

8000
6000
4000
2000

0.1

0.2

1950 1960 1970 1980 1990 2000


Time

Monthly S&P 500 Index Returns

Monthly S&P 500 Index

0.1

1500

1000

500

0
1990 1992 1994 1996 1998 2000 2002
Time

1950 1960 1970 1980 1990 2000


Time

0.2

0.1

0.1

0.2
1990 1992 1994 1996 1998 2000 2002
Time

Figure 1: Plots of some economic variables

then E(Xt ) = 0 and

X (s, t)

= E(Xs Xt ). Let s t. When s = t,


X (t, t)

= E(Xt2 ) = 1.25,

when t = s + 1,
X (t, t

when t

s > 1,

X (s, t)

+ 1) = E[(t + 0.5t

1 )(t+1

+ 0.5t )] = 0.5,

= 0.

Stationarity and Strict Stationarity

With autocovariance functions, we can define the covariance stationarity, or weak stationarity. In
the literature, usually stationarity means weak stationarity, unless otherwise specified.
Definition 2 (Stationarity or weak stationarity) The time series {Xt , t 2 Z} (where Z is the
integer set) is said to be stationary if
(I) E(Xt2 ) < 1 8 t 2 Z.
(II) EXt = 8 t 2 Z.
(III)

X (s, t)

X (s

+ h, t + h) 8 s, t, h 2 Z.

In other words, a stationary time series {Xt } must have three features: finite variation, constant
first moment, and that the second moment X (s, t) only depends on (t s) and not depends on s
or t. In light of the last point, we can rewrite the autocovariance function of a stationary process
as
X (h) = Cov(Xt , Xt+h ) for t, h 2 Z.
Also, when Xt is stationary, we must have

X (h)

X(

h).

When h = 0, X (0) = Cov(Xt , Xt ) is the variance of Xt , so the autocorrelation function for a


stationary time series {Xt } is defined to be
X (h) =

X (h)
X (0)

Example 1 (continued): In example 1, we see that E(Xt ) = 0, E(Xt2 ) = 1.25, and the autocovariance functions does not depend on s or t. Actually we have X (0) = 1.25, X (1) = 0.5, and
x (h) = 0 for h > 1. Therefore, {Xt } is a stationary process.
Pt
Example 2 (Random walk) Let St be a random walk St =
s=0 Xs with S0 = 0 and Xt is
independent and identically distributed with mean zero and variance 2 . Then for h > 0,
S (t, t

+ h) = Cov(St , St+h )
1
0
t
t+h
X
X
= Cov @
Xi ,
Xj A
i=1

= V ar

t
X
i=1

= t

j=1

Xi

since

Cov(Xi , Xj ) = 0

for i 6= j

In this case, the autocovariance function depends on time t, therefore the random walk process St
is not stationary.
Example 3 (Process with linear trend): Let t iid(0,

2)

and

Xt = t + t .
Then E(Xt ) = t, which depends on t, therefore a process with linear trend is not stationary.
Among stationary processes, there is simple type of process that is widely used in constructing
more complicated processes.
Example 4 (White noise): The time series t is said to be a white noise with mean zero and
variance 2 , written as
W N (0, 2 )

if and only if t has zero mean and covariance function as


2
if h = 0
(h) =
0 if h 6= 0

It is clear that a white noise process is stationary. Note that white noise assumption is weaker
than identically independent distributed assumption.
To tell if a process is covariance stationary, we compute the unconditional first two moments,
therefore, processes with conditional heteroskedasticity may still be stationary.
Example 5 (ARCH model) Let Xt = t with E(t ) = 0, E(2t ) =
t 6= s. Assume the following process for 2t ,
2t = c + 2t

> 0, and E(t s ) = 0 for

+ ut

where 0 < < 1 and ut W N (0, 1).


In this example, the conditional variance of Xt is time varying, as
Et

2
1 (Xt )

= Et

2
1 (t )

= Et

1 (c

+ 2t

+ ut ) = c + 2t 1 .

However, the unconditional variance of Xt is constant, which is


process is still stationary.

= c/(1

). Therefore, this

Definition 3 (Strict stationarity) The time series {Xt , t 2 Z} is said to be strict stationary if the
joint distribution of (Xt1 , Xt2 , . . . , Xtk ) is the same as that of (Xt1 +h , Xt2 +h , . . . , Xtk +h ).
In other words, strict stationarity means that the joint distribution only depends on the difference h, not the time (t1 , . . . , tk ).
Remarks: First note that finite variance is not assumed in the definition of strong stationarity,
therefore, strict stationarity does not necessarily imply weak stationarity. For example, processes
like i.i.d. Cauchy is strictly stationary but not weak stationary. Second, a nonlinear function of
a strict stationary variable is still strictly stationary, but this is not true for weak stationary. For
example, the square of a covariance stationary process may not have finite variance. Finally, weak
4

0.05
S&P 500 returns in year 1999

S&P 500 index in year 1999

1500
1400
1300
1200
1100
1000
900

100

200

0.05

300

100

200

300

100

200

300

0.05
S&P 500 returns in year 2001

S&P 500 index in year 2001

1500
1400
1300
1200
1100
1000
900

100

200

0.05

300

Figure 2: Plots of S&P index and returns in year 1999 and 2001

stationarity usually does not imply strict stationarity as higher moments of the process may depend
on time t. However, if process {Xt } is a Gaussian time series, which means that the distribution
functions of {Xt } are all multivariate Gaussian, i.e. the joint density of
fXt ,Xt+j1 ,...,Xt+jk (xt , xt+j1 , . . . , xt+jk )
is Gaussian for any j1 , j2 , . . . , jk , weak stationary also implies strict stationary. This is because a
multivariate Gaussian distribution is fully characterized by its first two moments.
For example, a white noise is stationary but may not be strict stationary, but a Gaussian
white noise is strict stationary. Also, general white noise only implies uncorrelation while Gaussian
white noise also implies independence. Because if a process is Gaussian, uncorrelation implies
independence. Therefore, a Gaussian white noise is just i.i.d.N (0, 2 ).
Stationary and nonstationary processes are very dierent in their properties, and they require
dierent inference procedures. We will discuss this in much details through this course. At this
point, note that a simple and useful method to tell if a process is stationary in empirical studies is
to plot the data. Loosely speaking, if a series does not seem to have a constant mean or variance,
then very likely, it is not stationary. For example, Figure 2 plots the daily S&P 500 index in year
1999 and 2001. The upper left figure plots the index in 1999, upper right figure plots the returns
in 1999, lower left figure plots the index in 2001, and lower right figure plots the returns in 2001.
Note that the index level are very dierent in 1999 and 2001. In year 1999, it is wandering at
a higher level and the market rises. In year 2001, the level is much lower and the market drops.
5

In comparison, we did not see much dierence in the returns in year 1999 and 2001 (although
the returns in 2001 seem to have thicker tails). Actually, only judging from the return data, it
is very hard to tell which figure plots the market in booms, and which figure plots the market in
crashes. Therefore, people usually treat stock price data as nonstationary and stock return data as
stationary.

Ergodicity

Recall that Kolmogorovs law of large number (LLN) tells that if Xi i.i.d.(,
then we have the following limit for the ensemble average
n = n
X

n
X
i=1

2)

for i = 1, . . . , n,

Xi ! .

In time series, we have time series average, not ensemble average. To explain the dierences
between ensemble average and time series average, consider the following experiment. Suppose we
want to track the movements of some particles and draw inference about their expected position
(suppose that these particles move on the real line). If we have a group of particles (group size n),
then we could track down the position of each particle and plot a distribution of their positions.
The mean of this sample is called ensemble average. If all these particles are i.i.d., LLN tells that
this average converges to its expectation as n ! 1. However, as we remarked earlier, with time
series observations, we only have one history. That means, in this experiment, we only have one
particle. Then instead of collecting n particles, we can only track this single particle and record
PT its
1
position, say xt , for t = 1, 2, . . . , T . The mean we computed by averaging over time, T
t=1 xt
is called time series average.
Does the time series average converges to the same limit as the ensemble average? The answer
is yes if Xt is stationary and ergodic. If Xt is stationary and ergodic with E(Xt ) = , then the
time series average has the same limit as ensemble average,
T = T
X

T
X
t=1

Xt ! .

This result is given as ergodic theorem, and we will discuss it later in our lecture 4 on asymptotic theory. Note that this result require both stationarity and ergodicity. We have explained
stationarity and we see that stationarity allows time series dependence. Ergodicity requires average asymptotic independence. Note that stationarity itself does not guarantee ergodicity (page 47
in Hamilton and lecture 4).
Readings:
Hamilton, Ch. 3.1
Brockwell and Davis, Page 1-29
Hayashi, Page 97-102

Lecture 2: ARMA Models

ARMA Process

As we have remarked, dependence is very common in time series observations. To model this time
series dependence, we start with univariate ARMA models. To motivate the model, basically we
can track two lines of thinking. First, for a series xt , we can model that the level of its current
observations depends on the level of its lagged observations. For example, if we observe a high
GDP realization this quarter, we would expect that the GDP in the next few quarters are good
as well. This way of thinking can be represented by an AR model. The AR(1) (autoregressive of
order one) can be written as:
xt = xt 1 + t
where t W N (0, 2 ) and we keep this assumption through this lecture. Similarly, AR(p) (autoregressive of order p) can be written as:
xt =

1 xt 1

2 xt 2

+ ... +

p xt p

+ t .

In a second way of thinking, we can model that the observations of a random variable at time
t are not only aected by the shock at time t, but also the shocks that have taken place before
time t. For example, if we observe a negative shock to the economy, say, a catastrophic earthquake,
then we would expect that this negative eect aects the economy not only for the time it takes
place, but also for the near future. This kind of thinking can be represented by an MA model. The
MA(1) (moving average of order one) and MA(q) (moving average of order q) can be written as
xt = t + t

and
xt = t + 1 t

+ . . . + q t

q.

If we combine these two models, we get a general ARMA(p, q) model,


xt =

1 xt 1

2 xt 2

+ ... +

p xt p

+ t + 1 t

+ . . . + q t

q.

ARMA model provides one of the basic tools in time series modeling. In the next few sections,
we will discuss how to draw inferences using a univariate ARMA model.

Copyright 2002-2006 by Ling Hu.

Lag Operators

Lag operators enable us to present an ARMA in a much concise way. Applying lag operator
(denoted L) once, we move the index back one time unit; and applying it k times, we move the
index back k units.
Lxt = xt
L xt = xt
..
.
2

Lk xt = xt

1
2

The lag operator is distributive over the addition operator, i.e.


L(xt + yt ) = xt

+ yt

Using lag operators, we can rewrite the ARMA models as:


AR(1) :

(1

L)xt = t

AR(p) :

(1

1L

2
2L

...

p
p L )xt

= t

MA(1) : xt = (1 + L)t
MA(q) : xt = (1 + 1 L + 2 L2 + . . . + q Lq )t
Let

= 1, 0 = 1 and define log polynomials


(L) = 1

1L

(L) = 1 + 1 L +

2
2L
2 L2 +

...
... +

p
pL
p Lq

With lag polynomials, we can rewrite an ARMA process in a more compact way:
AR :

(L)xt = t

MA :

xt = (L)t

ARMA :

(L)xt = (L)t

Invertibility

Given a time series probability model, usually we can find multiple ways to represent it. Which
representation to choose depends on our problem. For example, to study the impulse-response
functions (section 4), MA representations maybe more convenient; while to estimate an ARMA
model, AR representations maybe more convenient as usually xt is observable while t is not.
However, not all ARMA processes can be inverted. In this section, we will consider under what
conditions can we invert an AR model to an MA model and invert an MA model to an AR model. It
turns out that invertibility, which means that the process can be inverted, is an important property
of the model.
If we let 1 denotes the identity operator, i.e., 1yt = yt , then the inversion operator (1
L) 1
is defined to be the operator so that
(1

L)

(1
2

L) = 1

For the AR(1) process, if we premulitply (1

L)

xt = (1

L)

Is there any explicit way to rewrite (1


with k = k for | | < 1. To show this,
(1

1?

L)

to both sides of the equation, we get


t

Yes, and the answer just turns out to be (L)

L)(L)

= (1

L)(1 + 1 L + 2 L2 + . . .)

= (1

L)(1 + L +

= 1
= 1
= 1

L+ L
lim

k!1

L + . . .)

2 2

L +

2 2

2 2

L + ...

3 3

for | | < 1

We can also verify this result by recursive substitution,


xt =

xt

+ t

=
..
.

xt

+ t + t

xt

+ t + t

xt

k 1
X

+ ... +

k 1

k+1

j=0

With | | < 1, we have that limk!1 k xt k = 0, so again, we get the moving average representation
with MA coefficient equal to k . So the condition that | | < 1 enables us to invert an AR(1)
process to an MA(1) process,
AR(1) :

(1

L)xt = t

MA(1) : xt = (L)t

with k =

We have got some nice results in inverting an AR(1) process to a MA(1) process. Then, how
to invert a general AR(p) process? We need to factorize a lag polynomial and then make use of the
result that (1
L) 1 = (L). For example, let p = 2, we have
(1

1L

2
2 L )xt

To factorize this polynomial, we need to find roots


(1

1L

2
2L )

= (1

= t
1

and

1 L)(1

(1)
2

such that
2 L)

Given that both | 1 | < 1 and | 2 | < 1 (or when they are complex number, they lie within the
unit circle. Keep this in mind as I may not mention this again in the remaining of the lecture), we
could write
(1
(1

1 L)

= 1 (L)

2 L)

= 2 (L)

and so to invert (1), we have


xt = (1

1 L)

(1

2 L)

= 1 (L)2 (L)t
Solving 1 (L)2 (L) is straightforward,
1 (L)2 (L) = (1 +

1L

2 2
1L

+ . . .)(1 +

= 1+(

1 + 2 )L + (
k
X j k j
k
(
1 2 )L
k=0 j=0

1
X

=
=

(L),

2
1

1 2

2L

2
2 )L

2 2
2L
2

+ . . .)

+ ...

say,

P
with k = kj=0 j1 k2 j . Similarly, we can also invert the general AR(p) process given that all
roots i has less than one absolute value. An alternative way to represent this MA process (to
express ) is to make use of partial fractions. Let c1 , c2 be two constants, and their values are
determined by
1
L)(1
1

(1

2 L)

c1
1

1L

c2
1

c1 (1
(1

2L

+ c2 (1
1 L)
L)(1
L)
1
2

2 L)

We must have
1 = c1 (1

2 L)

= (c1 + c2 )

+ c2 (1

(c1

1 L)

+ c2

1 )L

which gives
c1 + c2 = 1

and c1

+ c2

= 0.

Solving these two equations we get


c1 =

1
1

c2 =

,
2

2
2

.
1

Then we can express xt as


xt = [(1

1 L)(1

= c1 (1
1
X
= c1
k=0

1
X

2 L)]

= c1

k
1

+ c2

t + c2 (1
2 L)
1
X
k
k
1 t k + c2
2 t k
1 L)

k=0

k t k

k=0

where

k.
2

Similarly, an MA process,
xt = (L)t ,
is invertible if (L) 1 exists. An MA(1) process is invertible if || < 1, and an MA(q) process is
invertible if all roots of
1 + 1 z + 2 z 2 + . . . q z q = 0
lie outside of the unit circle. Note that for any invertible MA process, we can find a noninvertible
MA process which is the same as the invertible process up to the second moment. The converse is
also true. We will give an example in section 5.
Finally, given an invertible ARMA(p, q) process,
(L)xt = (L)t
xt =

(L)(L)t

xt = (L)t
then what is the series

k?

Note that since


1

(L)(L)t = (L)t ,

we have (L) = (L) (L). So the elements of


coefficients of Lk .

can be computed recursively by equating the

Example 1 For a ARMA(1, 1) process, we have


1 + L = (1
=

L)(

+(

1L
0 )L

+(

2
2L

+ . . .)
2
1 )L

+ ...

Matching coefficients on Lk , we get


1=

0=

j 1

for j

Solving those equation, we can easily get

=1

+
j 1

( + )

for

Impulse-Response Functions

Given an ARMA model, (L)xt = (L)t , it is natural to ask: what is the eect on xt given a unit
shock at time s (for s < t)?

4.1

MA process

For an MA(1) process,


xt = t + t
the eects of on x are:

: 0 1 0 0 0
x: 0 1 0 0

For a MA(q) process,


xt = t + 1 t
the eects on on x are:

+ 2 t

+ . . . + q t

q,

: 0 1 0 0 ... 0 0
x : 0 1 1 2 . . . q 0

The left figure in Figure 1 plots the impulse-response function of an MA(3) process. Similarly,
we can write down the eects for an MA(1) process. As you can see, we can get impulse-response
function immediately from an MA process.

4.2

AR process

For a AR(1) process xt = xt


of on x are:

+ t with | | < 1, we can invert it to a MA process and the eects


: 0 1 0
x: 0 1

...
...

As can be seen from above, the impulse-response dynamics is quite clear from a MA representation.
For example, let t > s > 0, given one unit increase in s , the eect on xt would be t s , if there
are no other shocks. If there are shocks that take place at time other than s and has nonzero eect
on xt , then we can add these eects, since this is a linear model.
The dynamics is a bit complicated for higher order AR process. But applying our old trick
of inverting them to a MA process, then the following analysis will be straightforward. Take an
AR(2) process as example.
Example 2
xt = 0.6xt

+ 0.2xt

+ t

or
(1

0.6L

0.2L2 )xt = t

We first solve the polynomial:


y 2 + 3y

5=0

and get two roots1 y1 = 1.2926 and y2 = 4.1925. Recall that


0.24. So we can factorize the lag polynomial to be:
(1

0.6L

0.2L2 )xt = (1
xt = (1
=

= 1/y1 = 0.84 and

0.84L)(1 + 0.24L)xt
0.84L)

(1 + 0.24L)

(L)t

Recall that the roots for polynomial ay 2 + by + c = 0 is

b2
2a

4ac

= 1/y2 =

P
where k = kj=0 j1 k2 j . In this example, the series of
the eects of on x can be described as:

is {1, 0.6, 0.5616, 0.4579, 0.3880, . . .}. So

: 0 1 0
0
0
...
x : 0 1 0.6 0.5616 0.4579 . . .
The right figure in Figure 1 plots this impulse-response function. So after we invert an AR(p)
process to an MA process, given t > s > 0, the eect of one unit increase in s on xt is just t s .
We can see that given a linear process, AR or ARMA, if we could represent them as a MA
process, we will find impulse-response dynamics immediately. In fact, MA representation is the
same thing as the impulse-response function.

0.5

0.5

Response

1.5

Response

1.5

0.5

0.5

10

20

30

10

Time

20

Figure 1: The impulse-response functions of an MA(3) process (1 = 0.6, 2 =


an AR(2) process ( 1 = 0.6, 2 = 0.2), with unit shock at time zero

5
5.1

30

Time

0.5, 3 = 0.4) and

Autocovariance Functions and Stationarity of ARMA models


MA(1)
xt = t + t

where t W N (0,

2
).

1,

It is easy to calculate the first two moments of xt :


E(xt ) = E(t + t
E(x2t )

= (1 + )
2

1)

=0

and
x (t, t

+ h) = E[(t + t 1 )(t+h + t+h

2 for h = 1
=
0
for h > 1
7

1 )]

So, for a MA(1) process, we have a fixed mean and a covariance function which does not depend
on time t: (0) = (1 + 2 ) 2 , (1) = 2 , and (h) = 0 for h > 1. So we know MA(1) is stationary
given any finite value of .
The autocorrelation can be computed as x (h) = x (h)/ x (0), so
x (0) = 1,

x (1) =

,
1 + 2

x (h) = 0

for h > 1

We have proposed in the section on invertability that for an invertible (noninvertible) MA


process, there always exists a noninvertible (invertible) process which is the same as the original
process up to the second moment. We use the following MA(1) process as an example.
Example 3 The process
xt = t + t

1,

t W N (0,

) || > 1

is noninvertible. Consider an invertible MA process defined as


x
t = t + 1/
t

t W N (0, 2

1,

.
Then we can compute that E(xt ) = E(
xt ) = 0, E(x2t ) = E(
x2t ) = (1 + 2 ) 2 , x (1) = x (1) =

and x (h) = x (h) = 0 for h > 1. Therefore, these two processes are equivalent up to the
second moments. To be more concrete, we plug in some numbers.
Let = 2, and we know that the process
2,

xt = t + 2t

t W N (0, 1)

1,

is noninvertible. Consider the invertible process


x
t = t + (1/2)
t

1,

t W N (0, 4)

.
Note that E(xt ) = E(
xt ) = 0, E(x2t ) = E(
xt )2 = 5,
for h > 1.

x (1)

x
(1)

= 2, and

x (h)

x
(h)

=0

Although these two representations, noninvertible MA and invertible MA, could generate the
same process up to the second moment, we prefer the invertible presentations in practice because if
we can invert an MA process to an AR process, we can find the value of t (non-observable) based
on all past values of x (observable). If a process is noninvertible, then, in order to find the value of
t , we have to know all future values of x.

5.2

MA(q)
xt = (L)t =

q
X
k=0

(k Lk )t

The first two moments are:


E(xt ) = 0
q
X
2
E(xt ) =
k2

k=0

and
x (h)

Pq
0

h
2
k=0 k k+h

for h = 1, 2, . . . , q
for h > q

Again, a MA(q) is stationary for any finite values of 1 , . . . , q .

5.3

MA(1)
xt = (L)t =

1
X

(k Lk )t

k=0

Before we compute moments and discuss the stationarity of xt , we should first make sure that
{xt } converges.
P
2
Proposition 1 If {t } is a sequence of white noise with 2 < 1, and if 1
k=0 k < 1, then the
series
1
X
xt = (L)t =
k t k
k=0

converges in mean square.

Proof (See Appendix 3.A. in Hamilton): Recall the Cauchy criterion: a sequence {yn } converges in
mean square if and only if kyn ym k ! 0 as n, m ! 1. In this problem, for n > m > 0, we want
to show that
" n
#2
m
X
X
E
k t k
k t k
k=1

mkn
n
X
k2
k=0

"

! 0

as

k=1

k2 2

m
X
k=0

k2

m, n ! 1

The result holds since {k } is square summable. It is often more convenient to work with a
slightly stronger condition absolutely summability:
1
X
k=0

|k | < 1.

It is easy to show that absolutely summable implies square summable. A MA(1) process with
absolutely summable coefficients is stationary with moments:
E(xt ) = 0
1
X
2
E(xt ) =
k2

k=0

x (h)

1
X

k k+h

k=0

5.4

AR(1)
(1

L)xt = t

(2)

Recall that an AR(1) process with | | < 1 can be inverted to an MA(1) process
xt = (L)t

with k =

With | | < 1, it is easy to check that the absolute summability holds:


1
X
k=0

|k | =

1
X
k=0

| < 1.

Using the results for MA(1), the moments for xt in (2) can be computed:
E(xt ) = 0
1
X
E(x2t ) =

k=0
2
2
)
/(1
1
X
2k+h 2

k=0
h 2
2
)
/(1

x (h)

2k 2

=
=

So, an AR(1) process with | | < 1 is stationary.

5.5

AR(p)

Recall that an AR(p) process


(1

1L

2
2L

can be inverted to an MA process xt = (L)t if all


(1

1L

2
2L

...

p
pL )

p
p L )xt

...

= (1

= t

in
1 L)(1

2 L) . . . (1

p L)

(3)

have
P1 less than one absolute value. It also turns out that with | i | < 1, the absolute summability
k=0 | k | < 1 is also satisfied. (The proof can be found on page 770 of Hamilton and the proof
uses the result that k = c1 k1 + c2 k2 .)
10

When we solve the polynomial in:


(L

y1 )(L

y2 ) . . . (L

yp ) = 0

(4)

the requirement that | i | < 1 is equivalent to that all roots in (4) lie outside of the unit circle, i.e.,
|yi | > 1 for all i.
First calculate the expectation for xt , E(xt ) = 0. To compute the second moments, one method
is to invert it into a MA process and using the formula of autocovariance function for MA(1).
This method requires finding the moving average coefficients , and an alternative method which
is known as Yule-Walker method maybe more convenient in finding the autocovariance functions.
To illustrate this method, take an AR(2) process as an example:
xt =

1 xt 1

+ t

2 xt 2

Multiply xt , xt 1 , xt 2 , . . . to both sides of the equation, take expectation and and then divide
by (0), we get the following equations:
1 =

1 (1)

(1) =

(2) =

1 (1)

(k) =

1 (k

2 (2)

2
/

(0)

2 (1)

1) +

2)

2 (k

for

(1) can be first solved from the second equation: (1) = 1 /(1
2 ), (2) can then be solved
from the third equation. (k) can be solved recursively using (1) and (2) and finally, (0) can
be solved from the first equation. Using (0) and (k), (k) can computed using (k) = (k) (0).
Figure 2 plots this autocorrelation for k = 0, . . . , 50 and the parameters are set to be 1 = 0.5 and
2 = 0.3. As is clear from the graph, the autocorrelation is very close to zero when k > 40.
1

0.9

0.8

0.7

rho(k)

0.6

0.5

0.4

0.3

0.2

0.1

10

15

20

25
k

30

35

40

Figure 2: Plot of the autocorrelation of AR(2) process, with

11

45

= 0.5 and

= 0.3

5.6

ARMA(p, q)

Given an invertible ARMA(p, q) process, we have shown that


(L)xt = (L)t ,
invert (L) we obtain
xt = (L)

(L)t = (L)t .

Therefore, an ARMA(p, q) process is stationary as long as (L) is invertible. In other words,


the stationarity of the ARMA process only depends on the autoregressive parameters, and not on
the moving average parameters (assuming that all parameters are finite).
The expectation of this process E(xt ) = 0. To find the autocovariance function, first we can
invert it to MA process and find the MA coefficients (L) = (L) 1 (L). We have shown an
example of finding in ARMA(1, 1) process, where we have
(1

L)xt = (1 + L)t

xt = (L)t =

1
X

j t j

j=0

where 0 = 1 and
process we have

j 1(

+ ) for j

x (0) =

1. Now, using the autocovariance functions for MA(1)


1
X

2 2
k

k=0

=
=

1+

1
X

2(k 1)

k=1

( + )2
1+
2
1

( + )

If we plug in some numbers, say, = 0.5 and = 0.5, so the original process is xt = 0.5xt
0.5t 1 , then x (0) = (7/3) 2 . For h 1,

x (h)

1
X

2
k k+h

k=0

=
=
Plug in

= = 0.5 we have for h

h 1

h 1

( + ) +

( + ) 1 +

h 2

( + )2

( + )
2
1

1,
x (h)

5 21
3
12

2
.

1
X
k=1
2

2k

+ t +

An alternative to compute the autocovariance function is to multiply each side of


(L)t with xt , xt 1 , . . . and take expectations. In our ARMA(1, 1) example, this gives
x (0)

x (1)

= [1 + ( + )]

x (1)

x (0)

x (2)

x (1)

= 0
..
.

x (h)

x (h

1) = 0

(L)xt =

for h > 2

where we use that xt = (L)t in taking expectation on the right side, for instance, E(xt t ) =
E((t + 1 t 1 + . . .)t ) = 2 . Plug in = = 0.5 and solving those equations, we have x (0) =
(7/3) 2 , x (1) = (5/3) 2 , and x (h) = x (h 1)/2 for h 2. This is the same results as we got
using the first method.
Summary: A MA processPis stationary if and
P1 only if the coefficients {k } are square summable
1
2
(absolute summable), i.e., k=0 k < 1 or k=0 |k | < 1. Therefore, MA with finite number of
MA coefficients are always stationary. Note that stationarity does not require MA to be invertible.
An AR process is stationary if it is invertible, i.e. | i | < 1 or |yi | > 1, as defined in (3) and (4)
respectively. An ARMA(p, q) process is stationary if its autoregressive lag polynomial is invertible.

5.7

Autocovariance generating function of stationary ARMA process

For covariance stationary process, we see that autocovariance function is very useful
P1in describing
the process. One way to summarize absolutely summable autocovariance functions ( h= 1 | (h)| <
1) is to use the autocovariance-generating function:
gx (z) =

1
X

(h)z h .

h= 1

where z could be a complex number.


For white noise, the autocovriance-generating function (AGF) is just a constant, i.e, for
W N (0, 2 ), g (z) = 2 .
For MA(1) process,
xt = (1 + L)t , W N (0, 2 ),
we can compute that

gx (z) =

2
1
[z

+ (1 + 2 ) + z] =

2
(1

+ z)(1 + z

).

For a MA(q) process,


xt = (1 + 1 L + . . . + q Lq )t ,
we know that

x (h)

Pq

h
2
k=0 k k+h

gx (z) =

1
X

for h = 1, . . . , q and
(h)z h

h= 1

13

x (h)

= 0 for h > q. we have

q
X

k2 +

k=0

q
X

k z k

k=0

q X
q h
X
h=1 k=0
q
X

(k k
k

k z

k=0

hz

+ k k+h z h )

P
For a MA(1) process xt = (L)t where 1
k=0 |k | < 1, we can naturally let q be replaced by
1 in the AGF for MA(q) to get AGF for MA(1),
! 1
!
1
X
X
gx (z) = 2
k z k
k z k = 2 (z)(z 1 ).
k=0

k=0

Next, for a stationary AR or ARMA process, we can invert them to a MA process. For instance,
an AR(1) process, (1
L)xt = t , invert it to
xt =

1
1

and its AGF is


gx (z) =
which equal to
2

1
X
k=0

where k =

k.

1
X

z)(1

k z

k=0

1)

z
=

(z)(z

),

In general, the AGF for an ARMA(p, q) process is


gx (z) =
=

k z

(1

t ,

2
(1

+ 1 z + . . . + q z q )(1 + 1 z 1 + . . . + q z q )
p
1
p
(1
...
...
1z
p z )(1
1z
pz )
1
2 (z)(z )

(z) (z 1 )

Simulated ARMA process

In this section, we plot a few simulated ARMA processes. In the simulations, the errors are Gaussian
white noise i.i.d.N (0, 1). As a comparison, we first plot a Gaussian white noise (or AR(1) with
= 0) in Figure 3. Then, we plot AR(1) with = 0.4 and = 0.9 in Figure 4 and Figure 5. As
you can see, the white noise process is very choppy and patternless. When = 0.4, it becomes a
bit smoother, and when = 0.9, the departures from the mean (zero) is very prolonged. Figure 6
plots an AR(2) process and the coefficients are set to numbers as in our example in this lecture.
Finally, Figure 7 plots a MA(3) process. Compare this MA(3) process with the white noise, we
could see an increase of volatilities (the volatility of the white noise is 1 and the volatility of the
MA(3) process is 1.77).

14

20

40

60

80

100

120

140

160

180

200

Figure 3: A Gaussian white noise time series

20

40

60

80

100

120

140

160

180

Figure 4: A simulated AR(1) process, with

15

200

= 0.4

10

10

20

40

60

80

100

120

140

160

180

Figure 5: A simulated AR(1) process, with

200

= 0.9

20

40

60

80

100

120

140

Figure 6: A simulated AR(2) process, with

16

160

180

200

= 0.6,

= 0.2

20

40

60

80

100

120

140

160

180

200

Figure 7: A simulated MA(3) process, with 1 = 0.6, 2 =

7
7.1

0.5, and 3 = 0.4

Forecastings of ARMA Models


Principles of forecasting

If we are interested in forecasting a random variable yt+h based on the observations of x up to time
t (denoted by X) we can have dierent candidates, denoted by g(X). If our criterion in picking the
best forecast is to minimize the mean squared error (MSE), then the best forecast is the conditional
expectation, g(X) = EX (yt+h ). The proof can be found on page 73 in Hamilton. In our following
discussion, we assume that the data generating process is known (so parameters are known), so we
can compute the conditional moments.

7.2

AR models

Lets start from an AR(1) process:


xt = xt

+ t

where we continue to assume that t is a white noise with mean zero and variance
compute
Et (xt+1 ) = Et ( xt + t+1 ) = xt
Et (xt+2 ) = Et (

xt + t+1 + t+2 ) =

xt +

xt

... = ...
Et (xt+k ) = Et (

k 1

t+1 + . . . + t+k ) =

xt

and the variance


Vart (xt+1 ) = Vart ( xt + t+1 ) =
Vart (xt+2 ) = Vart (

xt + t+1 + t+2 ) = (1 +

xt +

... = ...
Vart (xt+k ) = Vart (

k 1

t+1 + . . . + t+k ) =

k 1
X
j=0

17

2j 2

2
,

then we can

Note that as k ! 1,

Et (xt+k ) ! 0

which is the unconditional expectation of xt , and

Vart (xt+k ) !

2
/(1

which is the unconditional variance of xt .


Similarly, for an AR(p) process, we can forecast recursively.

7.3

MA Models

For a MA(1) process,


xt = t + t

1,

if we know t , then
Et (xt+1 ) = Et (t+1 + t ) = t
Et (xt+2 ) = Et (t+2 + t+1 ) = 0
... = ...
Et (xt+k ) = Et (t+k + t+k

1)

=0

and
Vart (xt+1 ) = Vart (t+1 + t ) =

Vart (xt+2 ) = Vart (t+2 + t+1 ) = (1 + 2 )

... = ...
Vart (xt+k ) = Vart (t+k + t+k

1)

= (1 + 2 )

It is easy to see that for an MA(1) process, the conditional expectation for two step ahead and
higher is the same as unconditional expectation, so is the variance. Next, for a MA(q) model,
xt = t + 1 t

1 + 2 t

2 + . . . + q t

q
X

j t

j,

j=0

if we know t , t

1 , . . . , t q ,

then

q
X
Et (xt+1 ) = Et (
j t+1
j=0
q
X

Et (xt+2 ) = Et (
... = ...

j t+2

j) =

j)

j=0

Et (xt+k ) = Et (

j t+1

j=1
q
X

j t+2

q
X

j t+k

j=2

q
X
Et (xt+k ) = Et (
j t+k
j=0
q
X

q
X

j t+k

j) =

j=k

j)

j=0

18

=0

for k > q

for k q

and
q
X
Vart (xt+1 ) = Vart (
j t+1
j=0
q
X

Vart (xt+2 ) = Vart (


... = ...

j t+2

j)

j)

= 1 + 12

j=0

q
X
Vart (xt+k ) = Vart (
j t+k

j) =

j=0

k
X

j2

j=0

8 k>0

We could see that for an MA(q) process, the conditional expectation and variance of forecast for
q + 1 and higher is the same as unconditional expectations and variance.

Wold Decomposition

So far we have focused on ARMA models, which are linear time series models. Is there any relationship between a general covariance stationary process (maybe nonlinear) to linear representations?
The answer is given by the Wold decomposition theorem:
Proposition 2 (Wold Decomposition) Any zero-mean covariance stationary process xt can be represented in the form
1
X
xt =
j t j + Vt
j=0

where
(i)

=1

and

(ii) t W N (0,

(iii) E(t Vs ) = 0

P1

j=0

2
j

<1

2
)

s, t > 0

(iv) t is the error in forecasting xt on the basis of a linear function of lagged x:


t = xt

E(xt |xt

1 , xt 2 , . . .)

(v) Vt is a deterministic process and it can be predicted from a linear function of lagged x.
Remarks: Wold decomposition says that any covariance stationary process has a linear representation: a linear deterministic component (Vt ) and a linear indeterministic components (t ). If
Vt = 0, then the process is said to be purely-non-deterministic, and the process can be represented
as a MA(1) process. Basically, t is the error from the projection of xt on lagged x, therefore it is
uniquely determined and it is orthogonal to lagged x and lagged . Since this error is the residual
from the projections, it may not be the true errors in the DGP of xt . Also note that the error term
() is a white noise process, and does not need to be iid.
Readings:
Hamilton Ch. 1-4
Brockwell and Davis Ch. 3
Hayashi Ch 6.1, 6.2
19

Lecture 3: Spectral Analysis

Any covariance stationary process has both a time domain representation and a spectrum domain representation. So far, our analysis is in the time domain as we represent a time series {xt }
in terms of past values of innovations and investigate the dependence of x at distinct time. In some
cases, a spectrum-domain representation is more convenient in describing a process. To transform
a time-domain representation to a spectrum-domain representation, we use the Fourier transform.

Fourier Transforms

Let ! denote the frequency ( < ! < ), and let T denote the period : the minimum time that
it takes the wave to go through a whole cycle, and we have T = 2/!. Given any integer number
z, we have x(t) = x(t + zT ). Finally, we will let denote the phase: the amount that a wave is
shifted.
Given a time series {xt }, its Fourier transformation is:
x(!) =
and the inverse Fourier transform is:

1
1 X
e
2 t= 1

x(t) =

it!

x(t)

eit! x(!)d!

(1)

(2)

Spectrum

Recall that the autocovariance function for a zero-mean stationary process {xt } is defined as:
x (h)

= E(xt xt

h)

and it serves to characterize the time series {xt }. The spectrum of {x} is defined to be the Fourier
transform of x (h),
1
1 X
Sx (!) =
e ih! x (h)
(3)
2
h= 1
P
h
Recall that the autocovariance generating function is gx (z) = 1
h= 1 x (h)z , if we let z =
i!
e , then the spectrum is just the autocovariance generating function divided by 2. In (3), if we
take ! = 0, we see that
1
X
x (h) = 2Sx (0),
h= 1

Copyright 2002-2006 by Ling Hu.

which tells that the sum of autocorrelations equals the spectrum at zero multiplied by 2. Using
the identity
ei = cos + i sin ,
we can also write (3) as
1
Sx (!) =
2

"

+2

1
X

x (h) cos(h!)

h=1

(4)

Note that since cos(!) = cos( !), and x (h) = x ( h), the spectrum is symmetric about zero.
Also the cosine function is periodic with period 2, therefore, for spectral analysis, we only need
to find the spectrum for ! 2 [0, ]. Now if we know x (h), we can compute its spectrum using (4),
and if we know the spectrum Sx (!), we can compute x (h) using the inverse Fourier transform:
Z
ei!h Sx (!)d!
(5)
x (h) =

Let h = 0, then (5) gives the variance of {xt }


Z
(0)
=
Sx (!)d!.
x

So the variance of {xt } is just the sum of the spectrum over all frequencies < ! < .
Therefore we can see that the spectrum function Sx (!) decomposes the variance into components
contributed from each frequency. In other words, we can use spectrum to find the importance of
cycles of dierent frequencies.
If we normalize the spectrum Sx (!) by dividing x (0), we get the Fourier transform of the
autocorrelation function x (h),
1
1 X
f (!) =
e
2

ih!

x (h)

(6)

h= 1

The autocorrelation functions can be generated from f (!) using the inverse transform
Z
x (h) =
ei!h fx (!)d!

(7)

Again, let h = 0, (7) gives

1=

fx (!)d!

Note that f (!) is positive and integrate to one, just like a probability distribution density, so
we call it spectral density.
Example 1 (spectral density of white noise) Let WN(0,
for h 6= 0. Using (3) and (6), we can compute
S (!) =
Divide it by

(0),

1
2

(0)

we have

1
2

2
).

We have

(0)

and

(h)

=0

2
.

1
.
2
So the spectral density is uniform over [ , ], i.e., every frequency has equal contribution to
the variance.
fx () =

Spectrum of Filtered Process

Considering that the spectrum of a white noise process is so simple, we may want to know if we
could make use it for a more complicated process, say,
1
X
xt =
k t k = (L)t .
k= 1

We call this process a two-sided moving average process. Then what is the relationship between
Sx (!) and S (!)? The general solution is given in the following statement.
Proposition 1 If {xt } is a zero mean stationary process with spectrum function Sx (!), and {yt }
is the process
1
X
yt =
k xt k = (L)xt
k= 1

where is absolutely summable, then

1
X

Sy (!) =

Sx (!) = (e

ik!

k e

i!

) Sx (!).

k= 1

Proof: We start from the autocovarinace function of y,


y (h)

= E(yt yt h )
0
1
X
= E@
j xt

j= 1

1
X

=
=

1
1 X
e
2

1
2

h= 1
1
X

Sy (!) =

j= 1
i!

= (e
= (e
=

(e

i!

ih!

j xt h k )

+k

j)

ij!

y (h)

ih!

h= 1

(Let l = h + k j and note that Sx (!) =


term and see what are the remainings.)
1
X

h k

j,k= 1

Next, consider the spectrum of y,


Sy (!) =

x (h

j k

k xt

k= 1

j k E(xt

j,k= 1
1
X

1
X

P1

1
2

l= 1 e

1
X

eik! k

)(ei! )Sx (!)

i!

j k

j,k= 1

k= 1

)(e

1
X

i! )S

x (!)

) Sx (!)
3

il!

x (h

x (l),

+k

j)

so we want to construct such a

1
1 X
e
2
l= 1

il!

x (l)

Example 2 To apply this results, first consider the problem of computing an MA(1) process,
xt = t + t

= (1 + L)t .

In this problem,
(e

i!

) = 1 + e

i!

thus
(e

i!

= (1 + e

i!

)(1 + ei! )

= 1 + 2 + (e

i!

+ ei! )

Therefore,
Sx (!) =
=

(e i! ) S (!)
1
[1 + 2 + (e
2

i!

+ ei! )]

We can verify this result by using the spectrum to compute the autocovarinace function, say,
(1).
Using (5).
x
Z
ei! Sx (!)d!
x (1) =

Z
1 2 i!
=
e [1 + 2 + (e i! + ei! )]d!
2
1 2
=
2
2
= 2
which is the Rsame as what we got from working in the time domain. In the computation we use

the fact the ei! d! = 0, as the integral of sine or cosine functions all the way around a circle is
zero.
Figure 1 plots the spectrum of MA(1) processes with positive and negative coefficients. When
> 0, we see that the spectrum is high for low frequencies and low for high frequencies. When
< 0, we observe the opposite. This is because when is positive, we have positive one lag
correlation which makes the series smooth with only small contribution from high frequency (say,
day to day) components. When is negative, we have negative one lag correlation, therefore the
series fluctuates rapidly about its mean value.
Above we have considered the moving average process, the next proposition gives results for an
ARMA models with white noise errors:
Proposition 2 Let {xt } be an ARMA(p, q) process satisfying
(L)xt = (L)t

0.4

0.35

0.35

0.3

0.3

0.25

0.25

Spectrum

Spectrum

0.4

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

2
Frequency

2
Frequency

Figure 1: Plots of the spectrum of MA(1) processes ( = 0.5 for the left figure and =
the right figure)

where W N (0,

2
),

0.5 for

all roots of (L) lies out of the unit circle, then the spectrum of xt is:
|(e i! )|2
S (!)
| (e i! )|2
1 |(e i! )|2 2
2 | (e i! )|2

Sx (!) =
=
Example 3 Consider an AR(1) process,

xt = xt

+ t .

Using the above proposition,


Sx (!) =
=

2
2

|1

(1 +

i!

2 cos !)

(8)

Figure 2 plots the AR(1) processes with positive and negative coefficients. We have similar
observations here as the MA processes. However, note that when ! 1, Sx (!) ! 1, which means
that a random walk process has an infinite spectrum at frequency zero. This is similar as we are
working with summation and dierencing. When we add up a white noise (say, = 1 as in a
random walk), the high frequencies are smoothed out (those spikes in the white noise disappear)
and what is left is the long term stochastic trend. On the contrary, when we do dierencing (say,
do first dierencing to a random walk, then we are back to the white noise series), we get rid of
the long term trend, and what is left is the high frequencies (lots of spikes with mean zero, say).
Finally we introduce a spectral representation theorem without proof. For zero-mean stationary
process with absolutely summable autocovariances, define random variables (!) and (!), we
could represent the series in the form
Z
xt =
[(!) cos(!t) + (!) sin(!t)]d!.
0

0.7

0.6

0.6

0.5

0.5

0.4

0.4

Spectrum

Spectrum

0.7

0.3

0.3

0.2

0.2

0.1

0.1

2
Frequency

2
Frequency

Figure 2: Plots of the spectrum of AR(1) processes ( = 0.5 for the left figure and
the right figure)

0.5 for

where (!) and (!) have zero mean and are mutually and serially uncorrelated. The representation
theorem tells that for a stationary process with absolutely summable autocovariances, we can write
it as a weighted sum of periodic functions.

Cross Spectrum and Spectrum of a Sum

Spectrum is an autocovariance generating function and we can use it to compute the autocovariance
for a stationary process. Besides computing autocovariance of a single time series, the spectrum
function can also capture the covariance cross two time series. We call such spectrum functions
cross spectrum.
For a single time series {xt }, a spectrum function is the Fourier transform of the autocovariance
function x (h) = E(xt xt h ). Similarly, for two time series {xt } and {yt }, the cross spectrum is the
Fourier transform of the covariance function of xt and yt h , i.e.
1
X

Sxy (!) =

ih!

E(xt yt

h)

h= 1

In general,
Sxy (!) 6= Syx (!) =
But they have the following relationship:

1
X

ih!

E(yt xt

h= 1

Sxy (!) = Syx (!) = Syx ( !)


which is easy to verify:
Sxy (!) =

1
X

ih!

E(xt yt

h= 1

h)

h)

=
=
=

1
X

ih!

h= 1
1
X

E(yt xt+h )

eik! E(yt xt

k= 1
1
X

( ik!)

k)

(let k =

E(yt xt

h)

k)

k= 1

= Syx ( !)
Note that if xt and ys are uncorrelated for all t, s, then E(xt yt h ) = 0 for all h, therefore,
Sxy (!) = Syx (!) = 0. Knowing the cross spectrum, next we can compute the spectrum of a sum.
For a process zt = xt + yt , the spectrum of zt can be computed as follows:
Sz (!) =
=
=

1
X

h= 1
1
X
h= 1
1
X

h)

ih!

E(zt zt

ih!

E[(xt + yt )(xt

ih!

[E(xt xt

h)

+ yt

+ E(xt yt

h )]

h)

+ E(yt xt

h)

+ E(yt yt

h )]

h= 1

= Sx (!) + Sxy (!) + Syx (!) + Sy (!)


We have proposed before that for a time series zt , its spectrum decompose its variation to different components contributed from each frequency !. Here, we see another form of decomposition:
we can decompose the variation in z to dierent sources. In particular, if xt and ys are uncorrelated
for all t, s, i.e., Sxy (!) = Syx (!) = 0, then we have
Sz (!) = Sx (!) + Sy (!).

Estimation

In equation (3), we define the spectrum as


Sx (!) =

1
1 X
e
2

ih!

x (h).

h= 1

Given a stationary process, the sample autocovariance can be estimated


x (h) = T

T
X

[(xt

x
)(xt

x
)].

t=h+1

To estimate the spectrum, we may compute the sample analog of (3), which is known as the
sample periodogram
T 1
1 X
Ix (!) =
e ih! x (h).
2
h= T +1

Or we can equivalently write it as


1
Ix (!) =
2

(0) + 2

T
X1

(h) cos(!h) .

h=1

(9)

We have the following asymptotic distribution of the sample periodogram.


2Ix (!)

Sx (!)

(2)

Since E( 2 (2)) = 2, the sample periodogram provides an unbiased estimate of the spectrum,
limT !1 EIx (!) = Sx (!). However, the variance of Ix (!) does not go to zero. In fact,

2Sx2 (0) for ! = 0


V ar(Ix (!)) !
S 2 (!) for ! 6= 0
Therefore, even when the sample size is very large, the sample periodogram still could not
provide an accurate estimate for the true spectrum. To estimate the spectrum, there are two better
approaches. First is parametric approach. We can estimate the ARMA model using least square
or MLE to obtain consistent estimator of the parameters, and then plug in the estimator to obtain
a consistent estimator for the spectrum. For instance, for an MA(1) process,
xt = t + t

t W N (0, 1)

1,

then for any !,


if we could obtain a consistent estimator for , denoted by ,
1

Sx (!) =
[1 + 2 + (e
2

i!

+ ei! )].

A potential problem with parametric estimation is that we have to specify a parametric model
for the process, say, ARMA(p, q). So we may have some errors due to misspecification. However,
even if the model is incorrectly specified, if the autocovariances of the true process are close to those
for our specifications, then this procedure still could provide a useful estimate of the population
spectrum.
An alternative approach is to estimate the spectrum nonparametrically. Doing this could save
us from specifying a model for the process. We still make use of the sample periodogram, however,
to estimate the spectrum Sx (!), we use a weighted average of the sample periodogram over several
neighboring !s. How much weight to put on each ! in the neighborhood is determined by a function
which is known as the kernel, or kernel function. This means that the spectrum is estimated by
Sx (!j ) =

m
X

l= m

k(l, m) Ix (!j+l ).

The kernel function k(l, m) must satisfy that


m
X

k(l, m) = 1.

l= m

(10)

Here m is the bandwidth or window indicating how many dierent frequencies to viewed as useful
in estimating Sx (!j ). Averaging Ix (!) over dierent frequencies can equivalently be represented as
multiplying the hth autocovariances (h) in (9) by a weight function w(h, q). A derivation can be
found on page 166 on Hamilton.
These weight function w(h, q) satisfy that w(0, q) = 1, |w(h, q)| 1, and w(h, q) = 0 for h > q.
The q in weight function works in a similar way as the m in k(l, m), as it specifies a length of the
window. Some commonly used weight functions are
Truncated kernel, let x = h/q,
w(x) =

1
0

Bartlett kernel, let x = h/q,


w(x) =
Modified Bartlett kernel
w(h, q) =

1
0

1 for |x| 1
0 otherwise
|x| for |x| 1
otherwise

h
q+1

for h = 1, 2, . . . , q
otherwise

Parzen kernel, let x = h/q,


8
< 1 6|x|2 + 6|x|3 for |x| < 1/2
2(1 |x|)3
for 1/2 |x| 1
w(x) =
:
0
otherwise

A typical problem in nonparametric estimation is the trade o between variance and bias.
Usually a large bandwidth reduces variance but induces bias. To reduce the variance without adding
much bias, we need to choose a proper bandwidth. In practice, we may plot an estimate of the
spectrum using several dierent bandwidths and use subjective judgment to choose the bandwidth.
Basically, if the plot is too flat, then it is hard to draw information like which frequencies are more
important than others; on the other hand, if the plot is too choppy (too many peaks and valleys
mixed together), then it is hard to make convincing comments.
Example 4 (Spectrum estimation of an AR(1) process). The data are generated from
xt = xt

+ t ,

= 0.5, t i.i.d.N (0, 1).

We simulated a sequence of length n = 200 using this DGP and the OLS estimates of is 0.59
(OLS estimate is consistent in this problem). The upper-left figure in Figure 3 plots the population
spectrum, i.e., using (8) with = 0.5. The upper-right figure plots the estimated spectrum using
(8) with the OLS estimates of , 0.59. The lower-left figure plots the sample periodogram Ix (!),
which is very volatile. Finally, the lower right figure plots the smoothed estimate for the spectrum
using the Bartlett kernel, i.e.
2
3

q
X
j
Sx (!) = (2) 1 4 x (0) + 2
1
x (j)cos(!j)5 ,
q+1
j=1

where q is set to be 5.

Parametrically Estimated Spectrum

0.7
Population Spectrum

0.6
0.5
0.4
0.3
0.2
0.1
0

2
Frequency

Nonparametrically Estimated Spectrum

Sample Periodogram

1.5

0.5

2
Frequency

1
0.8
0.6
0.4
0.2
0

2
Frequency

2
Frequency

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Figure 3: Estimates for Spectrum

10

In empirical studies, Section 6.4 on spectrum of industrial production series in Hamilton provides
a very good example. Without any detrending, the spectrum is focused on the low frequency region,
which means that the variance of the series is largely from the long term trend (here is the economics
growth). After detrending, we obtain the growth rate which is stationary, and the variance now
mostly come from the business cycle and seasonal eects. After filtering the seasonal eects, most
of the variance is now due to the business cycle.
Readings: Hamilton, Ch. 6; Brockwell and Davis, Ch. 4, Ch. 10

11

Lecture 4: Asymptotic Distribution Theory

In time series analysis, we usually use asymptotic theories to derive joint distributions of the
estimators for parameters in a model. Asymptotic distribution is a distribution we obtain by letting
the time horizon (sample size) go to infinity. We can simplify the analysis by doing so (as we know
that some terms converge to zero in the limit), but we may also have a finite sample error. Hopefully,
when the sample size is large enough, the error becomes small and we can have a satisfactory
approximation to the true or exact distribution. The reason that we use asymptotic distribution
instead of exact distribution is that the exact finite sample distribution in many cases are too
complicated to derive, even for Gaussian processes. Therefore, we use asymptotic distributions as
alternatives.

Review

I think that this lecture may contain more propositions and definitions than any other lecture for
this course. In summary, we are interested in two type of asymptotic results. The first result is
about convergence to a constant. For example, we are interested in whether the sample moments
converge to the population moments, and law of large numbers (LLN) is a famous result on this.
The second type of results is about convergence to a random variable, say, Z, and in many cases, Z
follows a standard normal distribution. Central limit theorem (CLT) provides a tool in establishing
asymptotic normality.
The confusing part in this lecture might be that we have several versions of LLN and CLT.
The results may look similar, but the assumptions are dierent. We will start from the strongest
assumption, i.i.d., then we will show how to obtain similar results when i.i.d. is violated. Before
we come to the major part on LLN and CLT, we review some basic concepts first.

1.1

Convergence in Probability and Convergence Almost Surely

Definition 1 (Convergence in probability) Xn is said to be convergent in probability to X if for


every > 0,
P (|Xn X| > ) ! 0 as n ! 1.
If X = 0, we say that Xn converges in probability to zero, written Xn = op (1), or Xn !p 0.

Definition 2 (Boundedness in probability) Xn is said to be bounded in probability, written Xn =


Op (1) if for every > 0, there exists () 2 (0, 1) such that
P (|Xn | > ()) <

Copyright 2002-2006 by Ling Hu.

8n

We can similarly define order in probability: Xn = op (n


Xn = Op (n r ) if and only if nr Xn = Op (1).

r)

if and only if nr Xn = op (1); and

Proposition 1 if Xn and Yn are random variables defined in the same probability space and an > 0,
bn > 0, then
(i) If Xn = op (an ) and Yn = op (bn ), we have
Xn Yn = op (an bn )
Xn + Yn = op (max(an , bn ))
|Xn |r = op (arn )

for

r > 0.

(ii) If Xn = op (an ) and Yn = Op (bn ), we have Xn Yn = op (an bn ).


Proof of (i): If |Xn Yn |/(an bn ) > then either |Yn |/bn 1 and |Xn |/an > or |Yn |/bn > 1 and
|Xn Yn |/(an bn ) > , hence
P (|Xn Yn |/(an bn ) > )

P (|Xn |/an > ) + P (|Yn |/bn > 1)

! 0

If |Xn + Yn |/ max(an , bn ) > , then either |Xn |/an > /2 or |Yn |/bn > /2.
P (|Xn + Yn |/ max(an , bn ) > )

P (|Xn |/an > /2) + P (|Yn |/bn > /2)

! 0.

Finally,
P (|Xn |r /arn > ) = P (|Xn |/an > 1/r ) ! 0.

Proof of (ii): If |Xn Yn |/(an bn ) > , then either |Yn |/bn >
|Yn |/bn () and |Xn |/an > / (), then
P (|Xn Yn |/(an bn ) > )

() and |Xn Yn |/(an bn ) > or

P (|Xn |/an > / ()) + P (|Yn |/bn > ())

! 0

This proposition is very useful. For example, if Xn = op (n 1 ) and Yn = op (n 2 ), then Xn +Yn =


op (n 1 ), which tells that the slowest convergent rate dominates. Later on, we will see sum of several
terms, and to study the asymptotics of the sum, we can start from judging the convergent rates
of each term and pick the terms that converge slowerest. In many cases, the terms that converges
faster can be omitted, such as Yn in this example.
The results also hold if we replace op in (i) with Op . The notations above can be naturally
extended from sequence of scalar to sequence of vector or matrix. In particular, Xn = op (n r ) if
and only if all elements in X converge to zero at order nr . Using Euclidean distance |Xn X| =
P
1/2
k
2
(X
X
)
, where k is the dimension of Xn , we also have
nj
j
j=1
Proposition 2 Xn

X = op (1) if and only if |Xn

X| = op (1).

Proposition 3 (Preservations of convergence of continuous transformations) If {Xn } is a sequence


of k-dimensional random vectors such that Xn ! X and if g : Rk ! Rm is a continuous mapping,
then g(Xn ) ! g(X).
Proof: let M be a positive real number. Then 8 > 0, we have
P (|g(Xn )

g(X)| > ) P (|g(Xn )

g(X)| > , |Xn | M, |X| M )

+P ({|Xn | > M } [ {|X| > M }).

(the above inequality uses


P (A [ B) P (A) + P (B)

where

A = {|g(Xn )

g(X)| > , |Xn | M, |X| M )},

B = {|Xn | > M, |X| > M }.

) Recall that if a function g is uniformly continuous on {x : |x| M }, then 8 > 0, 9(),


|Xn X| < (), so that |g(Xn ) g(X)| < . Then
{|g(Xn )

g(X)| > , |Xn | M, |X| M )} {|Xn

X| > ().}

Therefore,
P (|g(Xn )

g(X)| > ) P (|Xn


P (|Xn

X| > ()) + P (|Xn | > M ) + P (|X| > M )


X| > ()) + P (|X| > M )

+P (|X| > M/2) + P (|Xn

X| > M/2)

Given any > 0, we can choose M to make the second and third terms each less than /4.
Since Xn ! X, the first and fourth terms will each be less than /4. Therefore, we have
P (|g(Xn )

g(X)| > ) .

Then g(Xn ) ! g(X).


Definition 3 (Convergence almost surely) A sequence {Xn } is said to converge to X almost surely
or with probability one if
P ( lim |Xn X| > ) = 0.
n!1

If Xn converges to X almost surely, we write Xn !a.s. X. Almost sure convergence is stronger


than convergence in probability. In fact, we have
Proposition 4 If Xn !a.s. X, Xn !p X.
However, the converse is not true. Below is an example.

Example 1 (Convergence in probability but not almost surely) Let the sample space S = [0, 1], a
closed interval. Define the sequence {Xn } as
X1 (s) = s + 1[0,1] (s) X2 (s) = s + 1[0,1/2] (s) X3 (s) = s + 1[1/2,1] (s)
X4 (s) = s + 1[0,1/3] (s) X5 (s) = s + 1[1/3,2/3] (s) X6 (s) = s + 1[2/3,1] (s)
etc, where 1 is the indicator function, i.e., it equals to 1 if the statement is true and equals to 0
otherwise. Let X(s) = s. Then Xn !p X, as P (|Xn X| ) is equal to the probability of the
length of the interval of s values whose length is going to zero as n ! 1. However, Xn does not
converge to X almost surely, Actually there is no s 2 S for which Xn (s) ! s = X(s). For every s,
the value of Xn (s) alternates between the values of s and s + 1 infinately often.

1.2

Convergence in Lp Norm

When E(|Xn |p ) < 1 with p > 0, Xn is said to be Lp -bounded. Define that the Lp norm of X is
kXkp = (E|X|p )1/p . Before we define Lp convergence, we first review some useful inequalities.
Proposition 5 (Markovs Inequality) If E|X|p < 1, p
P (|X|

) = P (|X|p

0 and > 0, then

E|X|p

Proof:
P (|X|

1)

= E1[1,1) (|X|
p

E[|X|

1[1,1) (|X|p

)]

E|X|

In the Markovs inequality, we can also replace |X| with |X c|, where c can be any real
number. When p = 2, the inequality is also known as Chebyshevs inequality. If X is Lp bounded,
then Markovs inequality tells that the tail probabilities converge to zero at the rate p as ! 1.
Therefore, the order of Lp boundedness measures the tendency of a distribution to generate outliers.
Proposition 6 (Holders inequality) For any p

1,

E|XY | kXkp kY kq ,
where q = p/(p

1) if p > 1 and q = 1 if p = 1.

Proposition 7 (Liapunovs inequality) If p > q > 0, then kXkp

kXkq .

Proof: Let Z = |X|q , Y = 1, s = p/q, Then by Holders inequality, E|ZY | kZks kY ks/(s

1) ,

or

E(|X|q ) E(|X|qs )1/s = E(|X|p )q/p .


Definition 4 (Lp convergence) If kXn kp < 1 for all n with p > 0, and limn!1 kXn Xkp = 0,
then Xn is said to converge in Lp norm to X, written Xn !Lp X. When p = 2, we say it converges
in mean square, written as Xn !m.s. X.
4

For any p > q > 0, Lp convergences implies Lq convergence by Liaponovs inequality. We can
take convergence in probability as an L0 convergence, therefore, Lp convergence implies convergence
in probability:
Proposition 8 (Lp convergence implies convergence in probability) If Xn !Lp X then Xn !p X.
Proof:

P (|Xn

X| > )

X|p

E|Xn

by Markov0 s inequality

! 0

1.3

Convergence in Distribution

Definition 5 (Convergence in distribution) The sequence {Xn }1


n=0 of random variables with distribution functions {FXn (x)} is said to converge in distribution to X, written as Xn !d X if there
exists a distribution function FX (x) such that
lim FXn (x) = FX (x).

n!1

Again, we can naturally extend the definition and related results about scalar random variable
X to vector valued random variable X. To verify convergence in distribution of a k by 1 vector, if
the scalar ( 1 X1n + 2 X2n + . . . + k Xkn ) converges in distribution to ( 1 X1 + 2 X2 + . . . + k Xk )
for any real values of ( 1 , 2 , . . . , k ), then the vector (X1n , X2n , . . . , Xkn ) converges in distribution
to the vector (X1 , X2 , . . . , Xk ).
We also have the continuous mapping theorem for convergence in distribution.
Proposition 9 If {Xn } is a sequence of random k-vectors with Xn !d X and if g : Rk ! Rm is
a continuous function. Then g(Xn ) !d g(X).
In the special case that that the limit is a constant scalar or vector, convergence in distribution
implies convergence in probability.
Proposition 10 If Xn !d c where c is a constant, then Xn !p c.
Proof:. If Xn !d c, then FXn (x) ! 1[c,1) (x) for all x 6= c. For any > 0,
P (|Xn

c| )

P (c

Xn c + )

! 1[c,1) (c + )
=

1[c,1) (c

On the other side, for a sequence {Xn }, if the limit of convergence in probability or convergence
almost sure is a random variable X, then the sequence also converges in distribution to x.

1.4

Law of Large Numbers

Theorem 1 (Chebychevs Weak LLN) Let X be a random variable with E(X) = and limn!1
n ) = 0, then
V ar(X
n
X
n = 1
X
Xt !p .
n
t=1

The proof follow readily from Chebychevs inequality.


n
P (|X

| > )

n)
V ar(X
! 0.
2

WLLN tells that the sample mean is a consistent estimate for the population mean and the
n )2 = V ar(X
n ) ! 0, we also know that X
n converges
variance goes away as n ! 1. Since E(X
to the population mean in mean square.
Theorem 2 (Kolmogorovs Strong LLN) Let Xt be i.i.d and E(|X|) < 1, then
n !a.s. .
X
Note that Kolmogorovs LLN does not require finite variance. Next we consider the LLN for an
heterogeneous P
process without serial correlations, say, E(Xt ) = t and V ar(Xt ) = t2 , and assume
n) =
that
n = n 1 nt=1 t ! . Then we know that E(X
n ! , and
n) = E
V ar(X

n
X

(Xt

!2

t )

t=1

=n

n
X

2
t.

t=1

n ) ! 0, we need another fundamental tool in asymptotic


To prove the condition for V ar(X
theory, Kroneckers lemma.
Theorem 3 (Kroneckers lemma) Let X
Pn1be a sequence of real numbers and Let {bn } be a monotone
increasing sequence with bn ! 1, and t=1 Xt convergent. then
n
1 X
bt Xt ! 0.
bn
t=1

Theorem 4 Let {Xt } be a serially uncorrelated sequence, and


n !m.s. .
X

P1

n) = n
Proof: take bt = t2 , then by Kroneckers lemma, V ar(X
2

E(Xn ) ! 0, therefore, Xn !m.s. .

t=1 t

2 2
t

Pn

t=1

< 1, then

2
t

! 0. Then we have

1.5

Classical Central Limit Theory

Finally, central limit theorem (CLT) provides a tool to establish asymptotic normality of an estimator.
Definition 6 (Asymptotic Normality) A sequence of random variables {Xn } is said to be asymptotic normal with mean n and standard deviation n if n > 0 for n sufficiently large and
(Xn

n )/

where

!d Z,

Z N (0, 1).

Theorem 5 (Lindeberg-Levy Central Limit Theorem) If {Xn } iid(,


Xn )/n, then
p
n )/ !d N (0, 1).
n(X

2 ),

n = (X1 + . . . +
and X

n without assuming normality for the


Note that in CLT, we obtain normality results about X
distribution of Xn . Here we only require that Xn follows some i.i.d. We will see a moment later
that central limit theorem also holds for more general cases. Another useful tool which can be used
together with LLN and CLT is known as Slutskys theorem.
Theorem 6 (Slutskys theorem) If Xn ! X in distribution and Yn ! c, a constant, then
(a) Yn Xn ! cX in distribution.
(b) Xn + Yn ! X + c in distribution.
If we know the distribution of a random variable, we can derive the distribution of a function
of this random variable using the so called -method.
Proposition 11 ( -method) Let {Xn } be a sequence of random variables such that
N (0, 2 ), and if g is a function which is dierentiable at , then
p
n[g(Xn ) g()] !d N (0, g 0 ()2 2 ).

n(Xn ) !d

Proof: The Taylor expansion of g(Xn ) around Xn = is


g(Xn ) = g() + g 0 ()(Xn

) + op (n

as Xn !p . Applying the Slutskys theorem to


p
p
n[g(Xn ) g()] = g 0 () n(Xn
where we know that

p
p

n(Xn

n[g(Xn )

) ! N (0,

then
p
g()] = g 0 () n(Xn

).

),

2 ),

) ! N (0, g 0 ()2

).

p
p
For example, let g(Xn ) = 1/Xn , and n(Xn ) !d N (0, 2 ), then we have n(1/Xn 1/) !d
N (0, 2 /4 ).
Lindeberg-Levy CLT assumes i.i.d., which is too strong in practice. Now we retain the assumption of independence but allow heterogeneous distributions (i.ni.d), and in the next section, we will
show versions of CLT for serial dependent sequence.
7

In the following analysis, it is more convenient to work with normalized variables. We also need
to use triangular arrays in the analysis. An array Xnt is a double-indexed collection of numbers
and each sample size n can be associated with a dierent sequence. We use {{Xnt }nt=0 }1
n=1 , or just
{Xnt }Pto denote an array. Let {Yt } be the sequence of the raw sequence with E(Yt ) = t . Define
2 = E(Y
s2n = nt=1 E(Yt t )2 , nt
t )2 /s2n , and
t
Xnt =

Then E(Xnt ) = 0 and V ar(Xnt ) =

2
nt .

Yt

sn

Define
Sn =

n
X

Xnt ,

t=1

then E(Sn ) = 0 and


E(Sn2 ) =

n
X

2
nt

= 1.

(1)

t=1

Definition 7 (Lindeberg CLT) Let the array {Xnt } be independent with zero mean and variance
2 } satisfying (1). If the following condition holds,
sequence { nt
lim

n!1

n Z
X
t=1

{|Xnt |>}

2
Xnt
dP = 0

for all

> 0,

(2)

then Sn !d N (0, 1).


Equation (2) is known as the Lindeberg condition. What Lindeberg condition rules out are the
cases where some sequences exhibit extreme behavior as to influence the distribution of the sum
in the limit. Only finite variances are not sufficient to rule out these kind of situations with nonidentically distributed observations. The following is a popular version of the CLT for independent
processes.
Definition 8 (Liapunov CLT) A sufficient condition for Lindeberg condition (2) is
lim

n!1

n
X
t=1

E|Xnt |2+ = 0,

for some

>0

(3)

(3) is known as Liapunov condition. It is stronger than Lindeberg condition, but it is more
easily checkable. Therefore it is more frequently used in practice.

Limit Theorems for Serially Dependent Observations

We have seen that if the data {Xn } are generated by an ARMA process, then the observations are
not i.i.d, but serially correlated. In this section, we will discuss how to derive asymptotic theories
for stationary and serially dependent process.

2.1

LLN for a Covariance Stationary Process

Consider a covariance stationary


process {Xn }. Without loss of generality, let E(Xn ) = 0, so
P1
E(Xt Xt h ) = (h), where h=0 | (h)| < 1. Now we will consider the the properties of the sample
n = (X1 + . . . + Xn )/n. First we see that it is an unbiased estimate for the population
mean: X
n ) = E(Xt ) = 0. Next, the variance of this estimate is:
mean, E(X
n2 ) = E[(X1 + . . . + Xn )/n]2
E(X
= (1/n2 )E(X1 + . . . + Xn )2
n
X
= (1/n2 )
E(Xi Xj )
= (1/n2 )

i,j=1
n
X

x (i

j)

i,j=1

= (1/n)
or
= (1/n)

+2

n
X1
h=1

(1

|h|<n

h
n

(h)

|h|) (h)

First we can see that

h
(h)
n
h=1

1
2
= (0) + 1
2 (2) + 1
2 (3) + . . . + 1
n
n
| (0)| + 2| (1)| + 2| (2)| + . . . < 1

n2 ) =
nE(X

0+2

n
X1

m
2 (m) + . . .
n

n2 ) is bounded, then we know


by our assumption on the absolute summability of (h). Now nE(X
2

that E(Xn ) ! 0, which means that Xn !m.s. 0, the population


mean.
h
n2 ) = (0) + 2 Pn
Next, we consider the limit of nE(X
1
(h). First we know that if a
h=1
n
series is summable, then its tails must go to zero. So with large h, those autocovariance does not
aect the sum; and with small h, the weight approaches 1 when n ! 1. Therefore, we have
n2 ) =
lim nE(X

n!1

1
X

(h) = (0) + 2 (1) + 2 (2) + . . .

h= 1

We summarize out results in the following proposition


Proposition 12 (LLN for covariance stationary process) Let Xt be a zero-mean covariance stationary process with E(Xt Xt h ) = (h) and absolutely
summable autocovariances, then the sample
n ! 0 and limn!1 nE(X
n2 ) = P1
mean satisfies X
h= 1 (h).

n ! and the limit of


If the process has population mean , then accordingly we have X
2

nE(Xn ) remain the same. A covariance stationary process is said to ergodic for the mean if the
9

time series average converges to the population mean. Similarly, if the sample average provides an
consistent estimate for the second moment, then the process is said to be ergodic for the second
moment. In this section, we see
P that a sufficient condition for a covariance stationary process to be
ergodic for the mean is that 1
h=0 | (h)| < 1. Further, if the process is Gaussian, then absolute
summable autocovariances also ensure that the process is ergodic for all moments.
Recall that in spectrum analysis, we have
1
X

x (h)

= 2Sx (0),

h= 1

n2 ) can be equivalently expressed as 2Sx (0).


therefore the limit of nE(X

2.2

Ergodic Theorem*

Ergodic theorem is a law of large number for a strictly stationary and ergodic process. We need a few
concepts to define ergodic stationarity, and those concepts can be found in the appendix. Given a
probability space (, F, P ), an event E 2 F is invariant under transformation T if E = T 1 E. Now,
a measure-preserving transformation T is ergodic if for any invariant event E, we have P (E) = 1
or P (E) = 0. In other words, events that are invariant under ergodic transformations either occur
almost surely, or do not occur almost surely. Let T be a shift operator, then a strictly stationary
process {Xt } is said to be ergodic if Xt = T t 1 X1 for any t where T is measure-preserving and
ergodic.
Below is an alternative way to define ergodicity,
Theorem 7 Let (, F, P ) be a probability space and let {Xt } be a strictly stationary process,
Xt (!) = X1 (T t 1 !). Then this process is ergodic if and only if for any pair of events A, B 2 F,
n

1X
P (T k A \ B) = P (A)P (B).
n!1 n
lim

(4)

k=1

To understand this result, if event A is not invariant and T is measure preserving, then T A \ Ac
is not empty. Therefore repeated iterations of the transformation generate a sequence of sets {T k A}
containing dierent mixtures of the elements of A and Ac . A positive dependence of B on A implies
a negative dependence of B on Ac , i.e.
P (A \ B) > P (A)P (B) ) P (Ac \ B) = P (B)

P (A \ B) < P (B)

P (A)P (B) = P (Ac )P (B).

So the average dependence of B on a mixtures of A and Ac should tend to zero as k ! 1.


Example 2 (Absence of ergodicity) Let Xt = Ut + Z, where Ut i.i.d.Uniform(0, 1) and Z
N (0, 1). Then Xt is stationary, as each observation follows the same distribution. However, this
process is not ergodic, because
Xt = Ut + Z = T t

U1 + Z,

so Z is an invariant event under the shift operator. If we compute the autocovariance, X (h) =
E(Xt Xt+h ) = 1, no matter how large h is. This means that the dependence is too persistent. Recall
that in lecture one we have proposed that the time series average of a stationary converges to its
10

population mean only when it is ergodic. In this example, the series is not ergodic. We P
can compute

that the true expectation of the process is 1/2, while the sample average Xn = (1/n) nt=1 Ut + Z
does not converge to 1/2, but to Z + 1/2.
In Example 2 we can see that in order for Xt to be ergodic, Z has to be a constant almost surely.
In practice, ergodicity is usually assumed theoretically, and it is impossible to test it empirically.
If a process is stationary and ergodic, we have the following LLN:
Theorem 8 (Ergodic theorem) Let Xt be a strictly stationary and ergodic process and E(Xt ) = ,
then
n
X
n =
X
Xt !a.s. .
t=1

Recall that when a process is strictly stationary, then a measurable function of this process is also
strictly stationary. Similar property holds for ergodicity. Also, if the process is ergodic stationary,
then all its moment, given that they exist and are finite, can also be consistently estimated by
the sample
For instance, if Xt and strictly stationary and ergodic, E(Xt2 ) = 2 , then
Pn moment.
2
2
(1/n) t=1 Xt ! .

2.3

Mixing Sequences*

Application of ergodic theorem is restricted in applications since it requires strict stationary, which
is a too strong assumption in many cases. Now, we introduce another condition on dependence:
mixing.
A mixing transformation T implies that repeated application of T to event A mix up A and
Ac , so that when k is large, T k A provides no information about the original event A. A classical
example about mixing is due to Halmos (1956) (draw a picture here).
Consider that to make a dry martini, we pour a layer of vermouth (10% of the volume) on top
of the gin (90% of the volume). let G denote the gin, and F an arbitrary small region of the fluid,
so that F \ G is the gin contained in F . If P () denotes the volume of a set as a proportion of the
whole, P (G) = 0.9. The proportion of gin in F , denoted by P (F \ G)/P (F ) is initially either 0 or
1. Let T denote the operation of stirring the martini with a swizzle stick, so that P (T k F \G)/P (F )
is the proportion of gin in F after k stirs. If the stirring mixes the martini we would expect the
proportion of gin in T k F , which is P (T k F \ G)/P (F ) tends to P (G), so that each region F of the
martini eventually contains 90% gin.
Let (, F, P ) be a probability space, and let G, H be subfields of F, define
(G, H) =

sup
G2G,H2H

|P (G \ H)

P (G)P (H)|,

(5)

and
(G, H) =

sup
G2G,H2H;P (G)>0

|P (H|G)

P (H)|.

(6)

Clearly, (G, H) (G, H). The events in G and H are independent i and are zero.
1
For a sequence, {Xt }11 , let F t 1 = (...Xt 1 , Xt ), Ft+m
= (Xt+m , Xt+m+1 , ...). Define
t
1
the strong mixing coefficient m = supt (F 1 , Ft+m ) and the uniform mixing coefficient to be
t
1
m = supt (F 1 , Ft+m ).
11

Next, the sequence is said to be -mixing or strong mixing if limm!1 m = 0 and it is said to
be -mixing or uniform mixing if limm!1 m = 0. Since , -mixing implies -mixing.
A mixing sequence is not necessarily stationary, and it could be hetergeneous. However if
a strictly stationary process is mixing, it must be ergodic. As you can see from (4), ergodicity
implies average asymptotic independence. However, ergodicity does not imply that any two parts
will eventually become independent. On the other hand, a mixing sequence has this property
(asymptotic independence). Hence mixing is a stronger condition than ergodicity. A stationary
and ergodic sequence needs not be mixing.
We usually use a statistics called size to characterize the rate of convergence of m or m . A
sequence is said to be -mixing of size
> 0 . If Xt is a -mixing
0 if m = O(m ) for some
sequence of size
,
and
if
Y
=
g(X
,
X
,
.
.
.
,
X
)
is
a
measurable
function and k be finite,
0
t
t
t 1
t k
then Y is also -mixing of size
0 . All above statements can also be applied to -mixing.
When a sequence is stationary and mixing, then Cov(X1 , Xm ) ! 0 as m ! 1. Consider the
ARMA processes. If it is MA(q), then the process must be mixing since any two events with time
interval larger than q are independent, i.e., (m) = (m) = 0 for m > q. We will not discuss
sufficient conditions for a MA(1) to be strong or uniform mixing, but note that if the innovations
are i.i.d. Gaussian, then absolute summability of the moving average coefficients is sufficient to
ensure strong mixing.
The following LLN (McLeish (1975)) applies to hetergeneous and temporarily dependent (mixing) sequences. We will only consider strong mixing.
Proposition 13 (LLN for heterogeneous mixing sequences) Let {Xt } be strong mixing with size
r/(r 1) for some r > 1, with finite means t = E(Xt ). If for some , 0 < r,
1/r
1
X
E|Zt t |r+
< 1,
(7)
tr+
t=1

n
then X

2.4

n !a.s. 0.

Martingale, Martingale Dierence Sequence, and Mixingale

In time series observations, we know the past but we do not know the future. Therefore, a very
important way in time series modeling is to condition sequentially on past events. In a probability
space (, F, P ), we characterize partial knowledge by specifying a -subfield of events from F, for
which it is known whether each of the events belonging to it has occurred or not. The accumulation
of information over time is represented by an increasing sequence of -field, {F}11 , with . . .
F0 F1 . . . F. Here the set F has also been referred as the universal information set. If Xt is
known given Ft for each t, then {Ft }11 is said to be adapted to the sequence {Xt }11 . The pair
{Xt , Ft }11 are called an adapted sequence. Setting Ft = (Xs , 1 < s t), i.e.,Ft generated by
all lagged observations of X, we obtain the minimum adapted sequence. And Ft defined in this
way is also known as the natural filtration.
Given an adapted sequence {Xt , Ft }11 , if we have
E|Xt | < 1,
E(Xt |Ft

1)

= Xt

1,

for all t, then the sequence is called a martingale. A simple example of martingale is a random
walk.
12

Example 3 (Random walk) Let


Xt = Xt

+ t , X0 = 0, t i.i.d.(0,

Ft = {t , t

Then we know that Xt is a martingale as E|Xt |

1 , . . . , 1 }

Pt

k=1 E|k |

< 1 and E(Xt |Ft

1)

= Xt

1.

Let {Xt , Ft }11 be an adapted sequence, two concepts that are related to martingales are
submartingales, which means E(Xt+1 |Ft ) Xt and supermartingales, which means E(Xt+1 |Ft )
Xt .
A sequence {Zt } is known as a martingale dierence sequence if E(Zt |Ft 1 ) = 0. As you can
see, a mds can be constructed using martingales. For example, let Zt = Xt Xt 1 where {Xt }
is a martingale. Then this sequence of Zt is an mds.
P On the other hand, the sum of mds is a
martingale, i.e., {Xt } will be a martingale if Xt = ti=1 Zi where Zi is an mds.
Proposition 14 If Xt is an mds, then E(Xt Xt

h)

= 0 for all t and h 6= 0.

Proof: E(Xt Xt h ) = E(Et h (Xt Xt h )) = E(Xt h Et h (Xt )) = 0.


Remark: 1. mds is a stronger condition than being serially uncorrelated. If Xt is an mds, then
we cannot forecast Xt as a linear or nonlinear function of its past realizations. 2. mds is a weaker
condition than independence, since it does not rule out the possibility that higher moments such
as E(Xt2 |Ft 1 ) depends on lagged value of Xt .
Example 4 (mds but not independent) Let t i.i.d.(0,
serially independent.

2 ),

then Xt = t t

is a an mds but not

Another example is Garch model. In Garch model, the error terms are mds, but the variance of
the error depends on past values. Although mds is weaker than independence, it behaves in many
ways just like independent sequence. In cases where independence is violated, if the sequence is
an mds, then we will find that many asymptotic results which hold for independent sequence also
hold for mds.
One of the fundamental results in martingale theory is the martingale convergence theorem.
Theorem 9 (Martingale convergence theorem) If {Xt , Ft }11 is an L1 -bounded submartingale,
then Xn !a.s. X where E|X| < 1. Further, let 1 < p < 1. If {Xt , Ft }11 is a martingale and
supt E|Xt |p < 1, then Xt converges in Lp as well as with probability one.
This is an existence theorem and it tells that Xn converges to X, but it does not tell what X
is. But martingale convergence theorem (MGCT) is still a very powerful result.
2
Example
mds(0,
supt t2 = M < 1. Define
Pn 5 (LLN for heterogeneous mds) Let t
Pn t )2 with
2
2
Sn = P
E(Sn ) = t=1 t /t . Verify that supn E(|Sn |2 )
t=1 t /t, then Sn is a martingale with
P
n
n
supt t2 ( t=1 (1/t2 )) < 1. Therefore, Sn = t=1 t /t converges by MGCT. Next, let bn = n, then
by Kroneckers lemma
n
n
1X
1 X t
t =
t ! 0.
n
t
n
t=1

t=1

13

A concept similar to martingale (mds) is mixingales, which can be regarded as asymptotic


martingales. A sequence of random variables {Xt } with E(Xt ) = 0 is called a Lp mixingale (p 1)
with respect to {Ft } if for sequence of nonnegative constants ct and m , where m ! 0 as m ! 1,
we have
kE(Xt |Ft m )kp ct m
(8)
kXt

E(Xt |Ft+m )kp ct m+1

(9)

for all t 1 and m 0. Intuitively, mixingale captures the idea that the sequence {Fs } contains
progressively more information about Xt as s increases. In the remote past nothing is known
according to (8), or any past event eventually became useless in predicting event that will happen
today (t). While in the future, everything will eventually be known according to (9). When Xt is
Ft -measurable, as in most of the cases we will be interested in, condition (9) always holds (since
E(Xt |Ft+m ) = Xt ). So to test if a sequence is mixingale, in many cases we only need to test
condition (8). In what follows, we will mostly use L1 -mixingale. Condition (8) can then be written
as
E|E(Xt |Ft m )| ct m .
(10)

As you can see, mixingales are even more general than mds, in fact, a mds is a special kind of
mixingale and you can set ct = E|Xt | and set 0 = 1 and m = 0 for m 1.
Example 6 Consider a two-sided MA(1) process,
Xt =

1
X

j t

j,

1
X

j t

j= 1

where t is an mds with E(|t |) < 1. Then


E(Xt |Ft

m)

j.

j=m

P1
Take ct = supt E|
Pt1| and take m = j=m |j |. Then if the moving average coefficients are absolutely
summable, i.e., j= 1 |j | < 1, then its tails has to go to zero, i.e., m ! 0. Then condition (10)
is satisfied and Xt is an L1 -mixingale.
In this example, first, we specify an MA process as generated by mds errors, which is a more
generalized class of stochastic processes than i.i.d and white noise. Second, if E(|t |) < 1 (which
controls the tails of t ), then the condition of absolutely summable coefficients makes Xt a L1 mixingale.

2.5

Law of Large Numbers for L1 -Mixingales

To derive the law of large numbers for L1 -mixingales, we need the notion of uniformly integrable.
Definition 9 (Uniformly integrable sequence) A sequence {Xt } is said to be uniformly integrable
if for every > 0 there exists a number c > 0 for all t such that
E(|Xt |1[c,1) (|Xt |) <
14

We will see how to make use of this notion in a moment. First, we introduce the following two
conditions for uniform integrability.
Proposition 15 (Conditions for uniform integrability) (a) A sequence {Xt } is uniformly integrable
r ) < M for all t. (b) Let {X } be a uniformly
if there exits an r > 1 and an M P
< 1 such that E(|Xt |P
t
1
integrable sequence and if Yt = k= 1 k Xt k with 1
|
|
<
1,
then
the
sequence {Yt } is
k
k= 1
also uniformly integrable.
To derive inference for a uniformly integrable sequence, we have the following proposition.
Proposition 16 (Law of large numbers for L1 -mixingale) Let {Xt } be an L1 -mixingale. If {Xt }
is uniformly integrable and there exists a sequence of {ct } such that
lim (1/n)

n!1

n = (1/n) Pn Xt !p 0.
then X
t=1

n
X
t=1

ct < 1,

2
Example 7 (LLN for mds with finite variance) Let {Xt } be a mds with
PnE|Xt | = M < 1,
then it is uniformly integrable and we can take ct = M , and since (1/n) t=1 ct = M < 1, by
n !p 0.
proposition 16, X

We can naturally generalize mixingale sequence to mixingale arrays. An array {Xnt } is said
to be L1 mixingale with respect to {Fnt } if there exists nonnegative constant constants {cnt } and
non-negative sequence {m } such that m ! 0 as m ! and
kE(Xnt |Fn,t
kXnt

m )kp

(11)

cnt m

E(Xnt |Fn,t+m )kp cnt m+1

for all t 1 P
and m 0. If the array is uniformly integrable with limn!1 (1/n)

Xn = (1/n) nt=1 Xnt !p 0.

Pn

t=1 cnt

(12)
< 1, then

r
Example 8 Let {t }1
t=1 be an mds with E|| < M for some r > 1 and M < 1 ( i.e. t
is Lr -bounded ). Let Xnt = (t/n)t . Then {Xnt } is a uniformly integrable L1 -mixingale with
cnt = supt E|t |, 0 = 1 and m = 0 for m > 0. Then applying LLN for L1 -mixingales, we have
n ! 0.
X

2.6

Consistent Estimate of Second Moment

In this section, we will show how to prove the consistency of the estimate of second moments using
the LLN of L1 -mixingales. There are two steps in the proof: first, we need to construct an L1 mixingales; second, we need to verify that the conditions for applying the LLN is satisfied. This
kind of methodology is very useful in many applications. Out following proof can also be found on
page 192-192 in Hamilton.
construct a mixingale. Out problem is outlined as follows. Let Xt =
P1First, we want to
P1
r

,
where
j=0 j t j
j=0 |j | < 1 and t is i.i.d. with E|t | < 1 for some r > 2. We what
to prove that
n
X
(1/n)
Xt Xt k !p E(Xt Xt k ).
t=1

15

Define Xtk = Xt Xt

E(Xt Xt

Xt Xt

k ),

then we want to show that Xtk is an L1 -mixingale.


1
!0 1
1
X
X
=
i t i @
j t k j A
i=0

j=0

1 X
1
X

i j t i t

k j

i=0 j=0

E(Xt Xt

= E@

k)

1 X
1
X

i j t i t

k j

i=0 j=0

1 X
1
X

i j E(t i t

k j)

1
A

i=0 j=0

then
Xtk =

1 X
1
X

i j (t i t

k j

E(t i t

k j )).

i=0 j=0

Let Ft = {t , t

1 , . . .},

then

E(Xtk |Ft

m)

1
1
X
X

i j (t i t

E(t i t

k j

k j )).

i=m j=m k

Now, we want to find ct and m so that condition (10) holds.


E |E(Xtk |Ft

m )| = E

1
1
X
X

i j (t i t

k j

E(t i t

k j ))

i=m j=m k

E@

1
X

1
1
X
X

i=m j=m k
1
X

i=m j=m k

|i j ||t i t

k j

E(t i t

1
A

k j )|

|i j |M

for some M < 1. We take ct = M and


m =

1
1
X
X

i=m j=m k

|i j | =

1
X

i=m

|i |

1
X

j=m k

|j |.

P1
Since j is absolutely summable, its tails goes to zero, i.e.,
i=m i ! 0 as m ! 0, therefore,
m ! 0.
Now, we have shown
P that Xtk is an L1 -mixingale. Next, we want to show that it is uniformly
integrable and (1/n) nt=1 ct < 1. Since ct = M < 1, this latter condition holds. The uniform
16

integrability can also be easily verified using the second part (b) of proposition 15. Therefore,
applying the LLN, we have
(1/n)

n
X

Xtk = (1/n)

t=1

therefore,

n
X

E(Xt Xt

t=1

(1/n)

n
X

Xt Xt

t=1

2.7

(Xt Xt

!p E(Xt Xt

k ))

!p 0,

k ).

(13)

Central Limit Theorem for Martingale Dierence Sequence

We have already learned several versions of CLT: (1) CLT for independently identically distributed
sequence (Lindeberg-Levy CLT), (2) CLT for independently non-identically distributed sequence
(Lindeberg CLT, Liapunov CLT). Now, we will consider the conditions for CLT to hold for a
martingale dierence sequence. Actually we can have CLT for any stationary ergodic mds with
finite variance:
Proposition 17 Let {Xt } be stationary and ergodic martingale dierence sequences with E(Xt2 ) =
2 < 1, then
n
1 X
p
Xt ! N (0, 2 ).
(14)
n t=1

Let Sn = Sn 1 + Xn with E(Sn ) = 0, which is a martingale with stationary and ergodic dierences,
then from the above proposition we can have n 1/2 Sn ! N (0, 2 ).
The conditions in the following version of CLT is usually easy to check in applications:
n = n 1 Pn Xt .
Proposition 18 (Central Limit Theorem for
mds) Let {Xt } be a mds with X
t=1
P
Suppose that (a) E(Xt2P
) = t2 > 0 with n 1 nt=1 t2 ! 2 > 0, (b) E|Xt |r < 1 for some r > 2
p
2 ).
and all t, and (c), n 1 nt=1 Xt2 !p 2 . Then nX
n ! N (0,

2 )=
Again, this proposition can be extended from sequence {Xt } to mds array {Xnt } with E(Xnt
In our last example in this lecture, we will use the next proposition, which is also a very useful
tool.
P1
4 ) < 1. Let Y =
Proposition
19
Let
X
be
a
strictly
stationary
process
with
E(X
t
t
t
j=0 j Xt j ,
P1
where j=0 |j | < 1. Then Yt is a strictly stationary process with E|Yt Ys Yi Yj | < 1 for all t, s, i
and j.
P
P1
2)
Example 9 (Example 7.15 in Hamilton) Let Yt = 1
j=0 j t j with
j=0 |j | < 1, t iid(0,
P
2
and E(4 ) < 1. Then we see that E(Yt ) = 0 and E(Yt2 ) = 2 1
Yt k for
j=0 j . Define Xt = tP
1
2
2
2
4
2
k > 0, then Xt is an mds with respect to {t , t 1 , . . .}, with E(Xt ) = E(Yt ) =
j=0 j
4
4
4
4
4
(so condition (a) in proposition 18 is satisfied), E(Xt ) = E(t Yt k ) = E( )E(Yt ) < 1. Here
E(4t ) < 1 by assumption and E(Yt4 ) < 1 by proposition 19. So condition (b) in proposition 18
is also satisfied, and the remaining condition we need to verify to apply CLT is condition (c),
2
nt .

(1/n)

n
X
t=1

Xt2 !p E(Xt2 ).
17

Write
(1/n)

n
X

= (1/n)

Xt2

t=1

n
X

2t Yt2 k

t=1

= (1/n)

n
X

(2t

)Yt2 k

+ (1/n)

t=1

n
X

Yt2 k

t=1

The first term is a normed sum of mds with finite variance To see this,
Et

2
1 [(t

)Yt2 k ] = Yt2 k (Et

2
1 (t )

)=0

and
P
Then (1/n) nt=1 (2t
By (13), we have

) Yt4 k ] = E(4t

E[(2t

2 2

2 )Y 2
t k

! 0 (example 7).

(1/n)

n
X

t=1

Therefore, we have

(1/n)

n
X
t=1

Finally, by proposition 18, we have

Yt2 k !p

Xt2 !p

)E(Yt4 ) < 1.

E(Yt2 ).

E(Yt2 ).

0
n
X
1
p
Xt !d N (0, E(Xt2 )) = N @0,
n t=1

2.8

1
X
j=0

j2 A .

Central limit theorem for serially correlated sequence

Finally we present a CLT for a serially correlated sequence.


Proposition 20 Let
Xt = +

1
X

cj t

j.

j=0

where t is i.i.d. with E(2t ) < 1 and


p

P1

n
n(X

j=0 j

|cj | < 1. Then

) !d N (0,

1
X

(h)).

h= 1

To prove the results, we can use a tool known as BN Decomposition and Phillips-Solo Device. Let
ut = C(L)t =

1
X
j=0

18

cj t

j,

(15)

where (a) t iid(0, 2 ) and (b)


rewrite the lag operator as

P1

j=0 j

|cj | < 1. The BN-decomposition tells that we could

C(L) = C(1) + (L 1)C(L)


P
P1
P
P1

where C(1) =P 1
j Lj , and cj = 1
j=0 cj , C(L) =
j=0 c
j+1 ck . Since we assume that
j=0 j |cj | <
1, we have 1
|
c
|
<
1.
When
C(1)
>
0
(the
assumption
ensured
that
C(1)
<
1),
we can
j=0 j
rewrite ut as

ut = (C(1) + (L 1)C(L))
t

= C(1)t C(L)(t t 1 )
= C(1)t

(
ut

u
t

1 ).

For example, let ut = t + t 1 , then it can be written as ut = (1 + )t (t t 1 ). In this


case, C(1) = 1 + , c0 = c1 = , and u
t = t .
Therefore,
n
n
1 X
1 X
1
p
0 ).
ut = C(1) p
t p (
un u
n t=1
n t=1
n
P
Clearly, C(1) p1n nt=1 ut ! N (0, C(1)2 2 ). The variance, denoted by 2u = C(1)2 2 , is called
the long run variance of ut . When ut is i.i.d., then c0 = 1 and cj = 0 for j > 0. Hence cj = 0,
for j 0. In that case, the variance and the long run variance are equal. But in general, they are
dierent. Take MA(1) as another example. Write
ut = t + t

= (1 + )t

(t

1 ).

Hence for this process, C(1) = 1 + , c0 = , and cj = 0 for j > 0. Note that the variance of ut is
that 0 = (1 + 2 ) 2 while the long run variance of ut is 2 = 2 C(1)2 = (1 + )2 2 .
Note that since cj is absolutely summable, then u
n u
0 is bounded in probability, hence
n

1 X
1 X
p
ut = C(1) p
t + op (1) ! N (0, C(1)2
n t=1
n t=1

2
).

(16)

P
2
P
1
2
2 2
You can verify that 1
(h)
=
c
= C(1) .
h= 1 x
j=0 j
This result also applies when t is a martingale dierence sequence satisfying certain moment
conditions (Phillips and Solo 1992).
Readings: Hamilton (Ch. 7) Davidson (Part IV and Part V)

Appendix: Some concepts


Set theory is trivial when it has finite number of elements. When a set has infinite number of
elements, how to measure its size becomes an interesting problem. Let X denote a set we are
interested in and we want to investigate the classes of its subsets. If X has n elements, then the
total number of its subsets is 2n , which could be huge when n is large. And if X includes infinite
number of elements, specifying the classes of all its subsets is more difficult. Therefore, we need to
introduce some notations for study of these subsets.
19

Definition 10 ( -Field) A -field F is a class of subsets of X satisfying


(a) X 2 F.
(b) If A 2 F then Ac 2 F.
(c) If {An , n 2 N} is a sequence of F-sets, then

S1

n=1 An

2 F.

So a -Field is closed under the operations of complementation and countable unions and
intersections. The smallest -field for a set X is {X, ;}. Let A be subset of X, the smallest -field
that contains A is {X, A, Ac , ;}. So given any set or a collection of sets, we can write down the
smallest -field that contains it. Let C denote a collection of sets, then the smallest field containing
C is called the -field generated by C.
A measure is a nonnegative countably additive set function and it associates a real number with
a set.
Definition 11 (Measure) Given a class F of a subsets of a set , a measure : F 7! R is a
function satisfying
(a) (A)

0, for all A 2 F.

(b) (;) = 0.
(c) For a countable collection {Aj 2 F, j 2 N} with Aj \ Al = ; for j 6= l and
0

[
j

Aj A =

Aj 2 F,

(Aj ).

A measurable space is a pair (, F) where is any collection of objects, and F is a -field


of subsets of . Let (, F) and ( , G) be two measurable spaces and let transformation T be a
mapping T : 7! . T is said to be measurable if T 1 (B) 2 F for all B 2 G. The idea is that a
measure defined on (, F) can be mapped into ( , G). Every event B 2 G is assigned a measure,
denoted by , with (B) = (T 1 (B)).
When the set we are interested in is the real line, R, the -field generated by the open sets is
called the Borel set, denoted by B. Let denote the Lebesgue measure, and it is the only measure
on R with ((a, b]) = b a.
We usually use (, F, P ) to denote a probability space. is the sample space, the set of all the
possible outcomes of the experiment, and each of the individual elements is denoted by !. F is the
-field of subsets of . The event A 2 F is said to have occurred if the outcome of the experiment
is an element of A. A measure P is assigned to elements of F with P () = 1, and P (A) is the
probability of A. For example, in an experiment of tossing a coin, we can define = {head, tail},
F = {;, {head}, {tail}, {head, tail}}, and we can assign probability to each element in F, P (;) =
0, P ({head}) = 1/2, P ({tail}) = 1/2, and P ({head, tail}) = 1. Formally, the probability measure
is defined as
Definition 12 A probability measure on a measurable space (, F) is a set function P : F 7! [0, 1]
satisfying axioms of probability:
20

(a) P (A)

0 for all A 2 F.

(b) P () = 1.
(c) Countable additivity: for a disjoint collection {Aj 2 F, j 2 N},
0
1
[
X
P @ Aj A =
P (Aj ).
j

We can define a random variable in a probability space. If the mapping X : 7! R is Fmeasurable then X is a real valued random variable on . For example, if is a discrete probability
space, as in our example of tossing a coin, then any function X : 7! R is a random variable.
Let (, F, P ) be a probability space, the transformation T : 7! is measure-preserving if it
is measurable and P (A) = P (T A) for all A 2 F. A shift transformation T for a sequence {Xt (!)}
is defined by Xt (T !) = Xt+1 (!). So a shift transformation works like a lag operator. If the shift
1
transformation T is measure-preserving, then the sequences {Xt }1
t=1 and {Xt+k }t=1 have the same
joint distribution for every k > 0. Therefore we can see that when the shift transformation T is
measure-preserving, the process is strictly stationary.

21

Lecture 5: Linear Regressions

In lecture 2, we introduced stationary linear time series models. In that lecture, we discussed
the data generating processes and their characteristics, assuming that we know all parameters
(autoregressive or moving average coefficients). However, in empirical studies, we have to specify
an econometric model, estimate this model and draw inferences based on the estimates. In this
lecture, we will provide an introduction to parametric estimation of a linear model with time
series observations. Three commonly used estimation methods are least square estimation (LS),
maximum likelihood estimation (MLE) and general method of moments (GMM). In this lecture,
we will discuss LS and MLE.

Least Square Estimation

Least square (LS) estimation is one of the first techniques we learn in econometrics. It is both
intuitive and easy to implement, and the famous Gauss-Markov theorem tells that under certain
assumptions, ordinary least square (OLS) estimator is the best linear unbiased estimator (BLUE).
We will start from review of classical LS estimation and then we will consider estimations with
relaxed assumptions.
Below are our notations in this lecture and the basic algebra in LS estimation. Consider the
regression
yt = x0t 0 + ut , t = 1, . . . , n
(1)
where xt is k by 1 vector and
of 0 , denoted by n is

0,

also a k by 1 vector is the true parameter. Then the OLS estimator


n =

"

and the OLS sample residual is

n
X

xt x0t

t=1

1" n
X
t=1

xt yt

(2)

x0t n .

u
t = yt
Sometimes, it is more convenient
2
y1
6 y2
6
Yn = 6 .
4 ..

to work in matrix form. Define


3
2 0 3
2
3
x1
u1
7
6 x0 7
6 u2 7
7
6 2 7
6
7
7 Xn = 6 .. 7 Un = 6 .. 7
5
4 . 5
4 . 5
yn
x0n
un

Then the regression can be written as

Yn = Xn

Copyright 2002-2006 by Ling Hu.

+ Un ,

(3)

and the OLS estimator can be written as


n = (X 0 Xn )
n

Xn0 Yn .

(4)

Define
Xn (Xn0 Xn )

MX = In

Xn0 .

It is easy to see that Mx is symmetric, idempotent (Mx Mx = Mx ), and orthogonal to the


columns of X. Then we have
n = Yn Xn n = MX Yn .
U
To derive the distribution of the estimator n ,
n = (X 0 Xn )
n

Xn0 Yn = (Xn0 Xn )

Xn0 (Xn

Therefore, the properties of n depends on (Xn0 Xn )


0, then n is unbiased estimator.

1.1

+ Un ) =

1X 0 U .
n n

+ (Xn0 Xn )

Xn0 Un .

For example, if E[(Xn0 Xn )

(5)
1X 0 U ]
n n

Case 1: OLS with deterministic regressors and i.i.d. Gaussian errors

Assumption 1 (a) xt is deterministic; (b) ut i.i.d(0,

2 );

(c) ut i.i.d.N (0,

Under assumption 1 (a) and (b), E(Un ) = 0 and E(Un Un0 ) =


E( n ) =

+ (Xn0 Xn )

Xn0 E(Un ) =

2I .
n

2 ).

Then from (5) we have

0,

and
E[( n

0 )( n

0
0) ]

= E[(Xn0 Xn )
=
=

Xn0 Un Un0 Xn (Xn0 Xn )

(Xn0 Xn ) 1 Xn0 E(Un Un0 )Xn (Xn0 Xn ) 1


2
(Xn0 Xn ) 1

Under these assumptions, Gauss-Markov theorem tells that the OLS estimator n is the best
linear unbiased estimator for 0 . The OLS estimator for 2 is
n U
n0 /(n
s2n = U

k) = Un0 MX MX Un /(n

k) = Un0 MX Un /(n

(6)

k).

Since MX is symmetric, there exists a n by n matrix P such that


MX = P P 0

and P 0 P = In

where is a n by n matrix with the eigenvalues of MX along the principal diagonal and zeros
elsewhere. From properties of MX we can compute that contains k zeros and n k ones along
its principal diagonal. Then
RSS =

Un0 MX Un

= Un P P Un = (P Un )(P Un ) =

Wn0

Wn =

n
X
t=1

2
t wt

where Wn = P 0 Un . Then E(Wn Wn0 ) = P 0 E(Un Un0 )P =


mean 0 and variance 2 . Therefore,
E(Un0 MX Un )

n
X

2
t E(wt )

2I ,
n

= (n

therefore, wt are uncorrelated with

k)

t=1

So the s2n defined in (6) is unbiased estimator for 2 : E(s2n ) =


With the Gaussian assumption (c), n is also Gaussian,
n N (0,

(Xn0 Xn )

2.

).

Note that here n is exact normal, while many of the estimator in our later discussions are
asymptotically normal. Actually, under assumption 1, OLS estimator is optimal. Also, with the
Gaussian assumption, wt is i.i.d.N (0, 2 ). Therefore we have
Un0 MX Un /

1.2

(n

k).

Case 2: OLS with stochastic regressors and i.i.d. Gaussian errors

The assumption of deterministic regressors is very strong for empirical studies in economics. Some
examples of deterministic regressors are constants and deterministic trend (i.e. xt = (1, t, t2 , . . .)).
However, most data we have for econometric regression are stochastic. Therefore from this subsection, we will allow the regressors to be stochastic. However, in case 2 and case 3, we assume that
xt is independent of errors (leads and lags). This is still too strong in time series, as it rules out
many processes including ARMA models.
Assumption 2 (a) xt is stochastic and independent of us for all t, s; (b) ut i.i.d.N (0,

2 ).

This assumption can be equivalently written as Un |Xn N (0, 2 In ). Under these assumptions,
n is still unbiased:
E( n ) = 0 + E[(Xn0 Xn ) 1 Xn0 ]E(Un ) = 0 .
Conditional on Xn , n is normal, n |Xn N ( 0 , 2 (Xn0 Xn ) 1 ). To get the unconditional
probability distribution for n , we have to integrate this conditional density over X. Therefore, the
unconditional distribution of n will depend on the distribution of X. However, we still have the
unconditional distribution for the estimate of the variance Un0 MX Un / 2 2 (n k).

1.3

Case 3: OLS with stochastic regressors and i.i.d. Non-Gaussian errors

Compared to case 2, in this section we let the error terms to follow arbitrary i.i.d. distribution
with finite fourth moments. Since this is an arbitrary unknown distribution, it is very hard obtain
exact distribution (finite sample distribution) for n , instead, we will apply asymptotic theory in
this problem.
2
Assumption 3 (a) xt is stochastic and independent of us for all t, s; (b)
Pnut i.i.d.(0, ), and
4
0
E(ut ) = 4 < 1; (c) E(xt xt ) = Qt , a positive definite matrix with (1/n)
Pn t=1 Qt0 ! Q, a positive
definite matrix; (d) E(xit xjt xkt xlt ) < 1 for all i, j, k, l and t; (e) (1/n) t=1 (xt xt ) !p Q.

With assumption (a), we still have the n is unbiased estimator


for 0 . ThePassumption (c) to
Pn
0
(e) are restrictions on xt . Basically we want to have (1/n) t=1 xt xt !p (1/n) nt=1 E(xt x0t ).
We have
" n
# 1" n
#
X
X
0
n
xt x
xt ut
0 =
t

"

t=1

t=1

n
X

(1/n)

xt x0t

t=1

1"

(1/n)

n
X

xt ut

t=1

From assumptions and continuous mapping theorem, we have


"

(1/n)

n
X

xt x0t

t=1

!p Q

xt ut is a martingale dierence sequence with finite variance, then by LLN for mixingales, we
have
"
#
n
X
(1/n)
xt ut !p 0.
t=1

Therefore, n !p 0 , so n is a consistent estimator. Next, we will derive the distribution for


it. This is the first time we derive asymptotic distribution for an OLS estimator. The routines in
deriving
asymptotically distribution for n are outlined as follows: first we apply LLN on the term
Pn
0
(so that the limit is a constant); then apply continuous mapping
t=1 xt xt , after properly normed
Pn
theorem to get the limit for [ t=1 xt x0t ] 1 . We already got this in the above proof of consistency
P
for n . Then we apply CLT on the term nt=1 xt ut , also after properly normed (so that the limit
is nondegenerate).
P
Note E(xt x0t u2t ) = 2 Qt and (1/n) nt=1 2 Qt ! 2 Q. By CLT for mds, we have
"
#
n
p X
(1/ n)
xt ut ! N (0, 2 Q).
t=1

Therefore,
p

n( n

0)

"

(1/n)

n
X
t=1
1

! N (0, [Q
so the n follows
n N

xt x0t

0,

1"

Q)Q

2Q 1

(1/ n)

n
X

xt ut

t=1

]) = N (0,

).

Note that this distribution is not exact, but approximate. So we should read it as approximately
distributed as normal.

To compute this variance, we need to know


still consistent under assumption 3. We have
x0t

u2t = (yt

When it is unknown, the OLS estimator s2n is

2
0)

x0t n + x0t ( n
x0 n )2 + 2(yt

= [yt

2.

2
0 )]
x0t n )x0t ( n

0
2
= (yt
0 ) + [xt ( n
0 )]
t
P
By LLN, we have (1/n) nt=1 u2t ! 2 . There are three terms in the above equation. For the
second term, we have
n
X
(1/n)
(yt x0t n )x0t ( n
0) = 0
t=1

as (yt

x0t n ) is orthogonal to xt . For the third term,


"
#
n
X
0
0
( n
x xt ( n
0 ) (1/n)

0)

!p 0

t=1

as n

is op (1) and (1/n)

Pn

! Q. Therefore, we can define

0
t=1 xt xt

n2

= (1/n)

n
X

(yt

x0t n )2 ,

t=1

and we have
n2

= (1/n)

n
X

(yt

x0t n )2

= (1/n)

t=1

n
X

u2t

(1/n)

t=1

n
X

n(n2

) = (1/ n)

n
X

(u2t

n( n

t=1

2
0 )]

t=1

This estimator is only slightly dierent from s2n (n2 = (n


n ! 1, if n2 is consistent, so is s2n .
Next, to derive the distribution of n2 .
p

[x0t ( n

0
0)

k)
s2n /n). Since (n

"

(1/n)

n
X
t=1

x0t xt ( n

.
k)/n ! 1 as

0 ).

P
2
The second term goes to zero as [(1/n) nt=1 x0t xt ] !p Q and n
0 !p 0. Define zt = ut
4
4
4
then zt is i.i.d. with mean zero and variance E(ut )
= 4
. Applying CLT, we have
n
p X
zt !d N (0, 4
(1/ n)

),

t=1

therefore,

n(n2

) !d N (0, 4

).

The same limit distribution applies for s2n , since the dierence between n2 and s2n is op (n

1/2 ).

2,

1.4

Case 4: OLS estimation in autoregression with i.i.d. error

In an autoregression, say, xt = 0 xt 1 + t , where t is i.i.d., the regressors are no longer independent of t . In this case, the OLS estimator of 0 is biased. However, we will show that under
assumption 4, the estimator is consistent.
Assumption 4 The regression model is
yt = c +
2
with roots of (1
...
1z
2z
t i.i.d. with mean zero , variance

1 yt 1

2 yt 2

+ ... +

p yt p

+ t ,

p
p z ) = 0 outside the unit circle (so
2 , and finite fourth moments .
4

yt is stationary) and with

Page 215-216 in Hamilton presents the general AR(p) case with constant. We will use AR(2)
as an example, yt = 1 yt 1 + 2 yt 2 + t . Let x0t = (yt 1 , yt 2 ), ut = t and yt = x0t 0 + ut (so
0
0 = ( 1 , 2 )).
"
# 1"
#
n
n
X
p
p X
0
n( n
xt xt
(1/ n)
xt ut
(7)
0 ) = (1/n)
t=1

t=1

The first term

(1/n)

n
X

xt x0t

= (1/n)

t=1

In this matrix, first, on the diagonal, n


converges to 1 . Therefore,
(1/n)

Pn
2
Pn t=1 yt 1
t=1 yt 1 yt

Pn

n
X

2
t=1 yt j

xt x0t

t=1

Pn
t=1 yt
P
n

1 yt 2
2
t=1 yt 2

converge to

!p Q =

0.

Apply CLT for mds on the second term in (7),


"
#
n
p X
(1/ n)
xt ut !d N (0,

The remaining term n

Pn

t=1 yt 1 yt 2

Q),

t=1

therefore,

n( n

0)

!d N (0,

).

So far we have considered four cases in OLS regressions. The common assumption in all those
four cases are i.i.d. errors. From next section, we will consider cases where the errors are not i.i.d..

1.5

OLS with non-i.i.d. errors

When the error ut is i.i.d., then the variance-covariance matrix V = E(Un Un0 ) = 2 In . If V is
still diagonal but the elements are not equal, for example, the errors on some dates display larger
variance and the errors on some dates display smaller variance, then the errors are said to exhibit
heteroskedasticity. If V is non-diagonal, then the errors are said to be autocorrelated. For example,
let ut = t
t 1 where t is i.i.d., then ut is serially correlated errors.
Case 5 in Hamilton assumes
6

Assumption 5 (a) xt is stochastic; (b) conditional on the full matrix X, the vector U N (0,
(c) V is a known positive matrix.

2V

);

Under these assumptions, the exact distribution of n can be derived. However, this is a very
strong assumption and it rules out the autoregressive regression. Also, the assumption that V is
known rarely holds in applications.
Case 6 in Hamilton assumes uncorrelated but heteroskedastic errors with unknown covariance
matrix. Under assumption 6, the OLS estimator is still consistent and asymptotically normal.
Assumption 6 (a) xt stochastic, including perhaps lagged values of y; (b) xt uP
t is martingale
n
2 x x0 ) = , a positive definite matrix, with (1/n)
dierence sequence;
(c)
E(u
t
t t t
t=1 t !p
Pn
2
0
4
and (1/n)
t=1 ut xt xt !p ;P(d) E(ut xit xjt xlt xkt ) < 1 for all i, j, k, l and t; (e)
P
P plims of
(1/n) nt=1 ut xit xt x0t and (1/n) nt=1 xit xjt xt x0t exist and are finite for all i, j and (1/n) nt=1 x0t xt !p
Q, a nonsingular matrix.
Again, write the OLS estimator as
p

n( n

0)

"

= (1/n)

Apply CLT for mds,

"

therefore,

(1/n)

xt x0t

t=1

Assumption 6 (e) ensures that


"

n
X

n
X

xt x0t

t=1

(1/ n)

n
X
t=1

n( n

0)

1"

(1/ n)

n
X
t=1

xt ut

!p Q

xt ut ! N (0, ),

! N (0, Q

).

However, both Q and are not observable and we


need to find consistent estimates
for them.
2 x x0 where
n = (1/n) Pn xt x0t and
n = (1/n) Pn u
White proposes the following estimator Q

t
t
t=1
t=1 t
u
t is the OLS residual yt x0t n .
Proposition 1 With heteroskedasticity of unknown form satisfying assumption 6, the asymptotic
variance-covariance matrix of the OLS coefficient vector can be consistently estimated by
n 1
nQ
n 1 !p Q
Q

! Q and assumption 6 (c) ensures that


Proof: Assumption 6 (e) ensures Q
n (1/n)

So to prove (8), we only need to show that


n

n = (1/n)

n
X
t=1

n
X
t=1

u2t xt x0t !p .

(
u2t

u2t )xt x0t ! 0.

(8)


The trick here is to make use of a known fact that n
0 !p 0. If we could write n

sums of some products of n


n !p 0.
0 and terms that are bounded, then n
u
2t

u2t = (
ut + ut )(
ut
= [2(yt

2ut ( n

ut )
( n

0
0 xt )

0
0 ) xt

( n

0
0 ) xt ][

+ [( n

n as

0
0 ) xt ]

0
2
0 ) xt ]

Then
n

n = ( 2/n)

n
X

ut ( n

0
0
0 ) xt (xt xt )

+ (1/n)

t=1

n
X

[( n

0
2
0
0 ) xt ] (xt xt ).

t=1

Write the first term


( 2/n)

n
X

ut ( n

0
0 ) xt (xt xt )

t=1

k
X

( in

i=1

i0 )

"

(1/n)

n
X

ut xit (xt x0t )

t=1

The term in the bracket has a finite plim by assumption 6 (e) and we have in
i0 ! 0 for
each i. Then this term converges to zero. (if this looks messy, take k = 1, then you can simply

move ( n
0 ) out of the summation. n
0 !p 0 and the sum has a finite limit, so the product
goes to zero).
Similarly for the second term,
(1/n)

n
X

[( n

0
2
0
0 ) xt ] (xt xt )

t=1

k X
k
X

( in

i0 )( jn

i=1 j=1

j0 )

"

(1/n)

n
X
t=1

xit xjt (xt x0t ) !p 0

n
n ! 0.
as the term in bracket has a finite plim. Therefore,
n 1
nQ
n 1 , then
Define Vn = Q
n N ( 0 , Vn /n),

and Vn /n is a heteroskedastic-consistent estimates for the variance-covariance matrix. Newey-West


proposes the following estimator for the variance-covariance matrix which is heteroskedastic and
autocorrelation consistent (HAC).
#
" n
X
q
n
X
X
k
0
0
0
1
2
0

(xt u
t u
t k xt k + xt k u
t k u
t xt ) (Xn0 Xn ) 1 .
Vn /n = (Xn Xn )
u
t xt xt +
1
q+1
t=1

1.6

t=k+1

k=1

General least square

General least square (GLS) and feasible general least square (FGLS) is preferred in least square
estimation when the errors are heteroskedastic or/and autocorrelated.
Let xt be stochastic and U |X N (0, 2 V ) where V is known (assumption 5). Since V is
symmetric and positive definite, there exists matrix L such that V 1 = L0 L. Premultiply L to our
regression and get
LY = LX 0 + LU.

= LU is i.i.d. conditional on X,
Then the new error U
U
0 |X) = LE(U U 0 |X)L0 =
E(U
Then the estimator

n = (X 0 L0 LX)

X 0 L0 Ly = (X 0 V

LV L0 =
1

X)

In .

X 0V

is known as the general least square estimator.


However, as we remarked earlier, in applications, V is rarely known and we have estimate it.
The GLS estimator obtained using estimated V is known as feasible GLS estimator. Usually, FGLS
require that we specify a parametric model for the error. For example, let the error ut follow an
AR(1) process, ut = 0 ut 1 + t where t i.i.d.(0, 2 ). In this case, we can run OLS first and
obtain the OLS residual u
t . Then run OLS estimation for using the u
t . This estimator, denoted
by n , is consistent estimator for . To show this, write
u
t = (yt

1X
u
t u
t
n

0 xt

t=1

=
=

1X
[ut + (
n
1
n

t=1
n
X

ut ut

n )0 xt ][ut

+(

01

n )

t=1
n

1X
ut ut
n

n xt ) = ut + (

0 xt

+(

n
X

n )0 xt

(ut xt

+ ut

n )0 xt .

1]

1 xt )

+(

t=1

n )

"

1X
xt x0t
n
t=1

+ op (1)

t=1

n
1X
(t + ut
n

1 )ut 1

t=1

! var(ut ).
Similarly, we can show that
can show that

Hence

1
n

Pn

t u
t
t=1 u

!p var(ut ), hence n ! 0 . Still use similar method, we

1 X
p
u
t u
t
n t=1
p

n(

1 X
=p
ut ut
n t=1

0 ) ! N (0, (1

+ op (1).

20 )).

Finally the FGLS estimator for 0 based on V (


) has the same limit distribution as the GLS
estimator based on V (0 ) (page 222-225 in Hamilton).

1.7

Statistical inference with LS estimation

Some commonly used test statistics for LS estimator are t statistics and F statistics. t statistics
is used to test the hypothesis of a single parameter, say i = c. For simplicity, we assume that
c = 0, so we use t statistics to test if a variable is significant. The t statistics is defined as the ratio

i /sd( i ). Let the estimate of the variance of be denoted by s2 W


, then the standard deviation

of i is the product of s and the square root of the ith element on the diagonal, i.e.,
i
t= p p .
s2 wii
Recall that if X/ N (0, 1), and Y 2 /

2 (m),

(9)

and let X and Y be independent, then

X m
Y
follows exact student t distribution with m degree of freedom.
F -statistics is used to test the hypothesis of m dierent linear restrictions about , say
t=

H0 : R = r,
where R is a m by k matrix. The F statistics is then defined as
F = (R

r)0 [V ar(R

r)]

(R

(10)

r).

This is a Wald statistics. To derive the distribution of the statistics, we will need the following
result
Proposition 2 If a k by 1 vector X N (, ), then (X

)0

1 (X

Also recall that an exact F (m, n) distribution is defined to be


2 (m)/m

F (m, n) =
= (Xn0 Xn )
With assumption 1 W
then write

1,

2 (n)/n

2 (k).

and under the hull hypothesis i N (0,

2 w ).
ii

We can

p i
2w
ii
t= q
.
s2

Since the numerator is N (0, 1) and the denominator is the square root of 2 (n k) divided by n k
(since RSS/ 2 2 (n k)), and the numerator and denominator are independent, so t statistics
(9) under assumption 1 follows exact t distribution.
With assumption 1 and under the null hypothesis, we have
R

r N (0,

R(Xn0 Xn )

R),

then by proposition 2, the F statistics defined in (10) under hypothesis H0


(R

r)0 [

R(Xn0 Xn )

R]

(R

r)

(m).

If we replace 2 with s2 , and divide it by the number of restrictions m, we get the OLS F test
of a linear hypothesis
F

= (R
=

r)0 [s2 R(Xn0 Xn )


F /m
,
(RSS/ 2 )/(n k)
10

R]

(R

r)/m

so F follows a exact F (m, n k) distribution.


An alternative way to express the F statistics is to compute the estimator without restriction
and its associated sum of residual RSSu ; and the estimator with restriction and its associated
sum of residual RSSr , then we can write
F =

(RSSr RSSu )/m


.
RSSu /(n k)

Now, with assumption 2, X is stochastic and is normal conditional on X and RSS 2 2 (n


k) conditional on X. This conditional distribution of RSS is the same for all X, therefore, the
unconditional distribution of RSS is the same as the conditional distribution. The same is true
for the t and F statistics. Therefore we have the same results under assumption 2 as that under
assumption 1.
From case 3, we no longer have exact distribution for the estimator, and we have to derive the
asymptotic distribution for the estimator, so we also use the asymptotic distributions for the test
statistics.
p
i
n i
tn = p
= p
.
sn wii
sn nwii
where wii is the ith element on the diagonal of s asymptotic variance Q 1 n 1 . If we let the ith
element on the diagonal of Q denoted by qii , then we have i !d N (0, 2 qii ). Recall that under
assumption 3, sn ! , there we have
tn ! N (0, 1).
Next, write

Fn = (R r)0 [s2n R(Xn0 Xn ) 1 R] 1 (R r)/m


p
p
=
n(R r)0 [s2n R(Xn0 Xn /n) 1 R] 1 n(R
Now we have s2n !p

2,

r)/m

Xn0 Xn /n ! Q, and under the null,

n(R

p
r) = R n(

0)

!d N (0,

RQ

R0 ).

Then by proposition 2, we have


mFn !

(m).

We can then use similar methods to derive the distribution for other cases. In general if !p 0
and asymptotically normal, s2n ! 2 , and we have found a consistent estimate for the variance of ,
then the t and F statistics follow asymptotically normal and 2 (m) distribution. Actually, under
assumption 1 or 2, when the sample size is large, we can also use normal and 2 distribution to
approximate the exact t and F distribution. Further, since we are using the asymptotic distribution,
the Wald test can also be used to test nonlinear restrictions.

11

Maximum Likelihood Estimation

2.1

Review: maximum likelihood principle and Cramer-Rao lower bound

The basic idea of maximum likelihood principle is to choose the parameter estimates that maximizes the probability of obtaining the observed sample. Consider that we observe a sample
Xn = (x1 , x2 , . . . , xn ) and assume that the sample is drawn from an i.i.d. distribution and the
associated parameters are denoted by . Let p(xt ; ) denote the pdf of the tth observation. For
example, when xt i.i.d.N (, 2 ), then = (, 2 ) and

(xt )2
2 1/2
p(xt ; ) = (2 )
exp
.
2 2
The likelihood function for the whole sample Xn is
L(Xn ; ) =

n
Y

p(xt ; )

t=1

and the log likelihood function is


l(Xn ; ) =

n
X

log p(xt ; ).

t=1

The maximum likelihood estimates for are chosen so that l(Xn ; ) is maximized. Define the
score function S() = @l()/@, and the Hessian matrix H() = @ 2 l( )/@@0 , then the famous
Cramer-Rao inequality tells that the lowest bound for the variance of an unbiased estimator of
is the inverse of the information matrix I(0 ) = E[S(0 )S(0 )0 ], where 0 denotes the true value
of the parameter. An estimator that have a variance equal to this bound is known as efficient.
Under some regularity condition which are satisfied for the Gaussian density, we have the following
equality
2
@ l()
.
I() = E[H()] = E
@@0

So, if we find an unbiased estimator and its variance achieves the Cramer-Rao lower bound,
then we know that this estimator is efficient and there is no other unbiased estimator (linear or
nonlinear) that could have smaller variance than this estimator. However, this lower bound is not
always achievable. If an estimator does achieve this bound, then this estimator is identical to MLE.
Note that Cramer-Rao inequality holds for unbiased estimator while sometimes ML estimators
are biased. If the estimator is biased but consistent, and its variance approaches the Cramer-Rao
bound asymptotically, then this estimator is known as asymptotically efficient.
Example 1 (MLE estimation for i.i.d. Gaussian distribution) Let xt i.i.d.N (,
rameter = (, 2 ). Then we have

1
(xt )2
p(xt ; ) = p
exp
2 2
2 2
n
n
n
1 X
2
l(Xn ; ) =
log(2)
log( )
(xt )2
2
2
2 2
t=1

12

2 ),

so the pa-

S(Xn ; ) =

n
@l(Xn ; )
1 X
= 2
(xt
@

)2

t=1

S(Xn ;

) =

n
n
1 X
+
(xt
2 2 2 4

@l(Xn ; )
=
@ 2

)2

t=1

n and 2 = 1 Pn (xt
Set the score functions to zero, we found the MLE estimator for are
=X
t=1
n
n ) = , so

)2 . It is easy to verify that E(


) = E(X
is unbiased and its variance V ar(
) = 2 /n,
while
" n
#
X
1
E 2 = E
(xt
)2
n
t=1

= E(xt

)2

= E[(xt
=

) + (
)]2
2 2 1 2
+
n
n
1 2

so 2 is biased, but it is consistent as 2 ! 2 as n ! 1. Define s2 =


Es2 = 2 , and V ar(s2 ) = 2 4 /(n 1).
We can further compute the Hessian matrix,
" @ 2 l(X ;) @ 2 l(X ;) #
n
@2
2
@ l(Xn ;)
@ 2

H(Xn ; ) =

1
n 1

Pn

t=1 (xt

)2 , then

n
@ 2
2
@ l(Xn ;)
@2 2

where
@l(Xn ; )
@2

@l(Xn ; )
@ 2

@l(Xn ; )
@2 2

n
2
n
1 X

@l(Xn ; )
=
@ 2
n
2 4

n
1 X
6

(xt

t=1

(xt

)2

t=1

We can also compute that

n2
> 0,
2 6
so we know that the we have found the maximum (not minimum) of the likelihood function. Next,
compute the information matrix,
" n
#
" n
#
X
X
E
(xt ) = 0, E
(xt )2 = n 2 .
|H(Xn ; )|= =

t=1

t=1

13

therefore the information matrix


I() = E[ H(Xn ; )] =

n
2 4

So the MLE of has achieved the Cramer-Rao lower bound of variance n . Although s2 does
not achieve to the lower bound, it turns out it is still the unbiased estimator for 2 with minimum
variance.

2.2

Asymptotic Normality of MLE

There are a few regularity conditions to ensure that the MLE is consistent. First we assume that the
data is strictly stationary and ergodic (for example, i.i.d.). Second, we assume that the parameter
space is convex and neither the estimate nor the true parameter 0 lie on the boundary of .
Third, we require that the likelihood function evaluated at is dierent from 0 , for any 6= 0 in
. This is known as the identification condition. Finally, we assume that E[sup2 |l(Xn ; )|] < 1.
With all those conditions satisfied, the MLE is consistent !p 0 .
Next we will discuss the asymptotic results on the score function S(Xn ; ), the Hessian matrix

H(Xn ; ) and the asymptotic distribution of the MLE estimates .


First, we want to show that E[S(Xn , 0 )] = 0 and E[S(Xn , 0 )S(Xn , 0 )0 ] = E[(H(Xn ; 0 )].
Let the integral operator denote integrate over X1 , X2 , . . . , Xn , then we have that
Z
L(Xn , 0 )dXn = 1.
Taking derivative with respect to , then we have
Z
@L(Xn , 0 )
dXn = 0.
@
While, we can write
Z
@L(Xn , 0 )
dXn
@
Z
1
@L(Xn , 0 )
=
L(Xn , 0 )dXn
L(Xn , 0 )
@
Z
@l(Xn ; 0 )
=
L(Xn , 0 )dXn
@
= E[S(Xn , 0 )]

So we know that E[S(Xn , 0 )] = 0. Next, let the integral (which equal to zero) take 0 , it is
Z
Z 2
@l(Xn ; 0 ) @L(Xn , 0 )
@ l(Xn ; 0 )
dXn +
L(Xn , 0 )dXn = 0.
0
@
@
@@0
The second term is just E[H(Xn ; 0 )]. The first can be written as

Z
@l(Xn ; 0 )
1
@L(Xn , 0 )
L(Xn , 0 )dXn
@
L(Xn , 0 )
@0
Z
@l(Xn ; 0 ) @l(Xn ; 0 )
=
L(Xn , 0 )dXn
@
@0
= E[S(Xn , 0 )S(Xn , 0 )0 ]

14

Now, since E[S(Xn , 0 )S(Xn , 0 )0 ] + E[H(Xn ; 0 )] = 0, we have that E[S(Xn , 0 )S(Xn , 0 )0 ] =


E[H(Xn ; 0 )].
p(xt ;)
Next, defineP
that s(xt ; ) = @ log @
, then we write the score function as the sum of s(xt ; ),
n
i.e., S(Xn , ) = t=1 s(xt ; ). s(xt ; ) is i.i.d. and we can show that E[s(xt ; 0 )] = 0 and E[s(xt ; 0 )s(xt ; 0 )0 ]
= E[H(xt ; 0 )]. Applying Lindeberg-Levy CLT, we obtain the asymptotic normality of the score
function
1
n 1/2 S(Xn ; 0 ) !d N (0,
E[H(Xn ; 0 )]).
n
Next, we consider the properties of the Hessian matrix. First we assume that E[H(Xn ; 0 )] is
non-singular. Let N be a neighborhood of 0 , and
E[ sup kH(Xn ; )k] < 1,
2N

then we have

1X
! E[H(Xn ; 0 )] V,
H(xt ; )
n
t=1

where is any consistent estimator for 0 .


Apply the LLN, we have

n
1X
1
1
H(Xn ; 0 ) =
H(xt ; 0 ) !p E(xt ; 0 ) = E
H(Xn ; 0 )
n
n
n

t=1

With the notation , we can write n

1/2 S(X

n ; 0 )

!d N (0, ).

Proposition 3 (Asymptotic normality of MLE) With all the conditions we have outlined above,
p
n( 0 ) !d N (0, 1 ).
around 0 ,
Proof: Do a Taylor expansion of S(Xn ; )
S(Xn ; 0 ) + (
0 = S(Xn ; )
Therefore, we have
p

n(

0 )

=
=

nS(Xn ; 0 )H(Xn ; 0 )

1
1
p S(Xn ; 0 )
H(Xn ; 0 )
n
n

! N (0,
=

Note that
written as

= E[ n1 H(Xn ; 0 )]

N (0,

0 )H(Xn ; 0 ).

= nI(0 )

1,

so the asymptotic distribution of can be

N (0 , I(0 )

).

However, I(0 ) depends on 0 which is unknown. So we need to find a consistent estimator for
One way is that
it, denoted by V . There are two methods to compute this variance matrix of .
15

i.e. V = H(Xn ; ).
The second way is to
we compute the Hessian matrix, and evaluate it at = ,
use the outer product estimate, which is
V =

n
X

0
[S(xt ; )S(x
t ; ) ].

t=1

2.3

Statistical Inference for MLE

There are three asymptotically equivalent tests for MLE: likelihood ratio (LR) test, Wald test, and
Lagrange multiplier (LM) test or score test. You can probably find discussion on these three tests
on any graduate text book in econometrics, so we only describe them briefly here.
The likelihood ratio test is based on the dierence between the likelihood you computed (maximized) with or without the restriction. Let lu denote the likelihood without restriction and lr
denote the likelihood with restriction (note that lr lu ). If the restriction is valid, then we expect
the lr should not be too much lower than lu . Therefore, to test if the restriction is valid, the
statistics we compute is 2(lu lr ) which follows a 2 distribution with degree of freedom equal to
the number of restrictions imposed.
To do LR test, we have to compute the likelihood under both restricted and unrestricted condition. In comparison, the other two tests only use either the estimator without restriction (denoted
or the estimator with restriction (denoted by ).

by )
Let the restriction be H0 : R() = r, the idea of Wald test is that: if this restriction is valid,

then the estimator obtained without restriction will make R()


r close to zero. Therefore the
Wald statistics is

W = (R()
r)0 [V ar(R()
r)] 1 (R()
r),
which also follows a 2 distribution with degree of freedom equal to the number of restrictions
imposed.
To find the ML estimator, we set the score function equal to zero and solve for the estimator,
= 0. If the restriction is valid, and the estimator we obtained with the restriction is ,

i.e., S()

then we expect that S() is close to zero. This idea leads to the LM test or score test. The LM
statistics is
0 I()
1 S(),

LM = S()
which also follows a
imposed.

2.4

distribution with degree of freedom equal to the number of restrictions

LS and MLE

In a regression Yn = Xn
density of Y given X is

+ Un where Un |Xn N (0,

f (Y |X; ) = (2

n/2

exp

1
2

2I )
n

(as in assumption 2), the conditional

X )0 (Y

(Y

X ) .

The log likelihood function is


l(Y |X; ) =

n
log(2)
2

n
log(
2
16

1
2

(X

X )0 (X

X )

Note that n that maximizes l is the vector that minimizes the sum of squares, therefore, under the
assumption 2, the OLS estimator is equivalent to ML estimator for 0 . It can be shown that this
estimator is unbiased and achieves the Cramer-Rao lower bound, therefore under assumption 2, the
OLS/MLE estimator are efficient (compared to all unbiased linear or nonlinear estimators). Recall
that under assumption 1, we have Gauss-Markov theorem to show that OLS estimator is the best
linear unbiased estimator. Now, the Cramer-Rao inequality tells the optimality of OLS estimator
under assumption2. The ML estimator for 2 is (Y X )0 (Y X )/n. We have introduced this
estimator a moment ago and we showed that the dierence between n2 and the OLS estimator s2n
becomes arbitrarily small as n ! 1.
Next, consider assumption 5, where U |X N (0, 2 V ) and V is known. Then the log likelihood
function omitting constant term is
l(Y |X, ) =
The MLE estimator is

(1/2)logV

X )0 V

(1/2)(Y

n = (X 0 V

X)

(Y

X ).

X 0 Y,

which is equivalent to the GLS estimator. The score vector is Sn ( ) = (Y X )0 V 1 X, the Hessian
matrix Hn ( ) = X 0 V 1 X. Therefore, the information matrix is I( ) = X 0 V 1 X. Therefore, the
GLS/MLE estimator is efficient as it achieves the Cramer-Rao lower bound (X 0 V 1 X) 1 .
When V is unknown, we can parameterize it V ( ), say, and maximizes the likelihood
l(Y |X, , ) =

2.5

(1/2)logV ( )

(1/2)(Y

X )0 V

( )(Y

X ).

Example: MLE in autoregressive estimation

In Hamiltons book, you can find many detailed discussions about MLE estimation for an ARMA
model in Chapter 5. We will take an AR(1) model as example.
Consider an AR(1) model,
xt = c + xt 1 + ut
where ut i.i.d.N (0, 2 ). Let = (c, , 2 ) and let the sample size denoted by n. There are
two ways to construct the likelihood function, and the dierence lies in how to specify the initial
observation x1 . If we let x1 be random, we know that the unconditional distribution of xt is
2 )), and this will lead to an exact likelihood function. Alternatively, we can
N (c/(1
), 2 /(1
assume that x1 is observable (known) and this will lead to a conditional likelihood function.
We first consider the exact likelihood function. We know that

(x1 c/(1
))2
2 1/2
p(x1 ; ) = (2 )
exp
.
2)
2 2 /(1
Conditional on x1 , the conditional distribution of x2 is N (c + x2 , 2 ), then the conditional
probability density for the second observation is

(x2 c
x1 ))2
2 1/2
p(x2 |x1 ; ) = (2 )
exp
.
2 2
So the joint probability density for (x1 , x2 ) is
p(x1 , x2 ; ) = p(x2 |x1 ; )p(x1 ; ).
17

Similarly, the probability density for the nth observation conditional on xn

(xn c
xn 1 ))2
2 1/2
p(xn |xn 1 ; ) = (2 )
exp
.
2 2

is

and the density for the joint observation of Xn = (x1 , x2 , . . . , xn ) is


L(Xn ; ) = p(x1 ; )

n
Y
t=2

p(xt |xt

1 ; ).

Taking log we get the exact likelihood function (omitting constant terms for simplicity)
l(Xn ; ) =

1
log
2

(x1
2

c/(1
2 /(1

))2
2)

n
2

log(

n
X
(xt

t=2

xt
2

2
1)

(11)

Next, to construct the conditional likelihood, assume that x1 is observable, then the log likelihood function is (again, constant terms are omitted)
l(Xn ; ) =

n
2

log(

n
X
(xt
t=2

xt
2

2
1)

(12)

The maximum likelihood estimates c and are obtained by maximizing (12), or solving the
score function. Note that maximizing (12) with respect to is equivalent to minimizing
n
X

(xt

xt

2
1) ,

t=1

which is the objective function in OLS.


Compared to the exact likelihood function, we see that the conditional likelihood function is
much easier to work with. Actually, when the sample size is large, the first observation becomes
negligible to the total likelihood function. When | | < 1, the estimator computed from exact
likelihood and the estimator from conditional likelihood are asymptotically equivalent.
Finally, if the residual is not Gaussian, and if we estimate the parameter using the conditional
Gaussian likelihood as in (12), then the estimate we obtain is known as quasi-maximum likelihood
estimate (QMLE). QMLE is also very frequently used in empirical estimation. Although we misspecified the density function, in many cases, QMLE is still consistent. For instance, in an AR(p)
process, if the sample second moment converges to the population second moments, then QMLE
using (12) is consistent, no matter whether the error is Gaussian or not. However, standard errors
for the estimated coefficients that are computed with the Gaussian assumption need not be correct
if the true data are not Gaussian (White, 1982).

Model Selection

In the discussion on estimation above, we assume that the order of the lags is known. However,
in empirical estimation, we have to choose a proper order. A larger number of order (parameters)
will increase the fitness of the model, therefore we need some criterion to balance the goodness of
18

fit and model parsimony. There are three commonly used criterion, Akaike information criterion
(AIC), Schwartzs Bayesian information criterion (BIC), and the posterior information criterion
(PIC) developed by Phillips (1996).
In all these criterion, we specify a maximum order kmax , and then choose k to minimize a
criterion equation.

SSRk
2k
AIC = log
+
(13)
n
n

where n is the sample size, k = 1, 2, . . . , kmax is the number of parameters in the model, and SSRk
is the residual from the fitted model. When k increase, the fit increases, so SSRk decreases, but
the second term increases. So this shows a trade o between fit and parsimony. Since the model is
estimated using dierent lags, the sample size also varies. We can either use the dierent sample
size n k, or we can use a fixed sample size n kmax . Ng and Perron (2000) has recommended
using the fixed sample size and use it to replace n in the criterion. However, the AIC rule is not
consistent and tends to overfit the model by choosing larger k.
With all other issues similar as in the AIC rule, the BIC rule imposes a larger penalty for
increasing number of parameters,

k log(n)
SSRk
+
(14)
BIC = log
n
n
BIC suggests samller k than AIC and BIC rule is consistent in stationary data, i.e., limn!1 kBIC =
k. Further, Hannan and Deistler (1988) has shown that kBIC is consistent when we set kmax =
[c log(n)] (the integer part of c log(n)) for any c > 0. Therefore, we can estimate kBIC consistently
without knowing the upper bound of k.
Finally, to present the PIC criterion, let K = kmax , and let X(K) and X(k) to denote the
regressor matrix with K and k parameters respectively. Similar for , the parameter vector.
Y

= X(K) (K) + error = X(k) (k) + X() () + error

A() = X() ()
A(k) = X(k) (k)
A(, k) = X()X(k)
A() = A() A(, k)A(k) 1 A(k, )
() = [X()0 X() X()0 X(k)(X(k)0 X(k))
X()0 X(k)(X(k)0 X(k))
2
K

= SSRK /(n

X(k)X()]

[X()0 Y

X(k)Y ]

K)

then
PIC = |A()/

2 1/2
exp
k|

1
2

2
k

()0 A() () .

PIC is asymptotically equivalent to the BIC criterion when the data is stationary, and when
the data is nonstationary, PIC is still consistent.
Reading: Hamilton, Ch. 5, 8.

19

Lecture 6: Vector Autoregression

In this section, we will extend our discussion to vector valued time series. We will be mostly
interested in vector autoregression (VAR), which is much easier to be estimated in applications.
We will fist introduce the properties and basic tools in analyzing stationary VAR process, and then
well move on to estimation and inference of the VAR model.

Covariance-stationary VAR(p) process

1.1

Introduction to stationary vector ARMA processes

1.1.1

VAR processes

A VAR model applies when each variable in the system does not only depend on its own lags, but
also the lags of other variables. A simple VAR example is:
x1t =

11 x1,t 1

12 x2,t 1

+ 1t

x2t =

21 x2,t 1

22 x2,t 2

+ 2t

where E(1t 2s ) = 12 for t = s and zero for t 6= s. We could rewrite it as

x1t
x1,t 1
0 0
x1,t 2
1t
11
12
=
+
+
x2t
0
x2,t 1
0
x2,t 2
2t
21
22

or just
xt =

1 xt 1

and E(t ) = 0,E(t s ) = 0 for s 6= t and


E(t 0t )

2 xt 2

2
1
21

12
2
2

+ t

(1)

As you can see, in this example, the vector-valued random variable xt follows a VAR(2) process.
A general VAR(p) process with white noise can be written as
xt =
=

1 xt 1
p
X

j xt j

2 xt 2

+ t

j=1

or, if we make use of the lag operator,


(L)xt = t ,

Copyright 2002-2006 by Ling Hu.

+ . . . + t

where
(L) = Ik

1L

p
pL .

...

The error terms follow a vector white noise, i.e., E(t ) = 0,

for t = s
E(t 0s ) =
0 otherwise
with a (k k) symmetric positive definite matrix.
Recall that in studying the scalar AR(p) process,
(L)xt = t ,
we have the results that the process {xt } is covariance-stationary as long as all the roots in (2)
1

1z

2z

...

pz

=0

(2)

lies out side of the unit circle. Similarly, for the VAR(p) process to be stationary, we must have
that the roots in the equation
p
|Ik
...
1z
pz | = 0
all lies outside the unit circle.
1.1.2

Vector moving average processes

Recall that we could invert a scalar stationary AR(p) process, (L)xt = t to a MA(1) process,
xt = (L)t , where (L) = (L) 1 . The same is true for a covariance-stationary VAR(p) process,
(L)xt = t . We could invert it to
xt = (L)t
where
(L) = (L)
The coefficients of
then (L) (L) = Ik :
(Ik

can be solved in the same way as in the scalar case, i.e., if


1L

2
2L

p
p L )(Ik

...

Equating the coefficients of Lj , we have


have
s = 1 s

1.2

= Ik ,

s 2

1,

+ ... +

1L

2
2L

+ . . .) = Ik .

2,

1 (L)

(L),

and in general, we

s p.

Transforming to a state space representation

Sometime, it is more convenient to write a scalar valued time series, say an AR(p) process, in vector
form. For example,
p
X
xt =
j xt j + t .
j=1

where N (0,

2 ).

0
B
B
B
@

We could equivalently write it as


1 0
xt
1
2 ...
p 1
C
B
xt 1 C B 1
0 ...
0
C = B ..
..
..
..
..
A @ .
.
.
.
.
xt p+1
0 ... ...
1

0
..
.
0

10
CB
CB
CB
A@

xt
xt
..
.

xt

If we let t = (xt , xt 1 , . . . , xt p+1 )0 , t 1 = (xt 1 , xt 2 , . . . , xt


denote the parameter matrix, then we can write the process as:
t = F t

p ),

C B
C B
C+B
A @

t
0
..
.
0

1
C
C
C
A

t = (t , 0, . . . , 0), and let F

+ t

where N (0, 2 Ip ). So we have rewrite an AR(p) scalar process as an vector autoregression of


order one, denoted by VAR(1).
Similarly, we could also transform a VAR(p) process to a VAR(1) process. For the process
xt =

1 xt 1

let

2 xt 2

B
B
B
F =B
B
@

B
B
t = B
@
1

Ik
0
..
.

0
Ik
..
.

+ ... +

xt
xt 1
..
.
xt

p+1

...
...
...
..
.

...
0
t
B 0
B
vt = B .
@ ..
0

p xt p

+ t ,

C
C
C,
A

p 1

0
0
..
.

0
0
..
.

Ik
1

C
C
C
C,
C
A

C
C
C,
A

Then we could rewrite the VAR(p) process in state space notations,


t = F t
where E(v t v 0s ) equals Q for t = s and equals zero
2
0
6 0 0
6
Q=6 . .
4 .. ..

+ vt .

otherwise, and
3
... 0
... 0 7
7
.. 7 .
... . 5
0 0 ... 0

(3)

1.3
1.3.1

The autocovariance matrix


VAR process

For a covariance stationary k dimensional vector process {xt }, let E(xt ) = , then the autocovariance is defined to be the following k by k matrix
(h) = E[(xt

)(xt

)0 ].

For simplicity, assume that = 0. Then we have (h) = E(xt x0t h ). Because of the lead-lag eect,
we may not have (h) = ( h), but we have (h)0 = ( h). To show this,
(h) = E(xt+h x0t+h

h)

= E(xt+h x0t ),

taking transpose
(h)0 = E(xt x0t+h ) = ( h).
Similar as in the scalar case, we define the autocovariance generating function of the process x
as

1
X

Gx (z) =

(h)z h

h= 1

where z is again a complex scalar.


Let t as defined in (3). Assume that and x are stationary, and let denote the variance of
,
= E( t 0t )
20
6B
6B
= E 6B
4@
2

6
6
= 6
4

xt
xt 1
..
.

xt

p+1

(0)
(1)0
..
.

(p

1)0

1
C
C
C
A

x0t x0t
(1)
(0)
..
.

...
...

...
2)0 . . .

(p

. . . x0t

p+1

3
1)
2) 7
7
7.
..
5
.
(0)

(p
(p

3
7
7
7
5

Postmultiplying (3) by its transpose and taking expectations gives


E( t 0t ) = E[(F t

+ vt )(F t

+ vt )0 ] = F E( t

0
0
1 t 1 )F

+ E(vt vt0 ),

or
= F F 0 + Q.

(4)

To solve for , we need to use the Kronecker product, and the following result: let A, B, C be
matrices whose dimensions are such that the product ABC exists. Then
vec(ABC) = (C 0 A) vec(B).

where vec is the operator to stack each column of a matrix (k k) into a k 2 -dimensional vector,
for example,
2
3
a11

6 a21 7
a11 a12
7
A=
vec(A) = 6
4 a12 5 .
a21 a22
a22
Apply vec operator on both sides of (4), we get

vec() = (F F ) vec() + vec(Q),


which gives
vec() = (Im

F F)

vec(Q),

where m = k 2 p2 . We can use this equation to solve for the first p order of autocovariance of x,
(0), . . . , (p). To derive the hth autocovariance of , denoted by (h), we can postmultiplying
(3) by 0t h and take expectations,
E( t 0t

h)

0
1 t h )

= F E( t

+ E(vt 0t

h ),

then
(h) = F (h

1),

or

(h) = F h .

Therefore we have the following relationship for (h)


(h) =
1.3.2

(h

1) +

(h

2) + . . . +

(h

p).

Vector MA processes

We first consider the MA(q) process.


xt = t +

1 t 1

2 t 2

+ ... +

q t q .

Then the variance of xt is


(0) = E(xt x0t )
= E(t 0t ) +
= +

and the autocovariances


8
< h + h+1
(h) =
0 h + 1
:
0 for |h| > q

1
0

0
0
1 E(t 1 t 1 ) 1 +
0
0
1 + 2 2 + . . . +

h+2

h+1 +

0
2
0

0
q E(t q t q )

... +

+ . . . + q
h+2 + . . . +

0
q

0
q.

0
q j

q+h

for h = 1, . . . , q.
0
q for h = 1, . . . , q.

As in the scalar case, any vector MA(q) process is stationary. Next consider the MA(1) process
xt = t +

1 t 1

2 t 2

+ ... =

(L)t .

A sequence of matrices { s }11 is absolutely summable if each of its element forms an absolutely
summable scalar sequence, i.e.
1
X
s=0

(s)
ij |

for i, j = 1, 2, . . . n,

<1

(s)

where ij is the row i column j element (will use ijth for short) of
about MA(1) process is summarized as follows:

s.

Some important results

Proposition 1 Let xt be a k 1 vector satisfying


xt =

1
X

j t j ,

j=0

where t is vector white noise and

is absolutely summable. Then

(a) The autocovaiance between the ith variable at time t and the jth variable s periods earlier,
E(xit xj,t s ) exists and is given by the ijth element of
(s) =

1
X

s+v

0
v

for s = 0, 1, 2, . . . ;

v=0

(b) { (h)}1
h=0 is absolutely summable.
If {t }1
t=

is i.i.d. with E|i1,t i2,t i3,t i4,t | < 1 for i1, i2, i3, i4 = 1, 2, . . . , k then we also have

(c) E|xi1,t1 xi2,t2 xi3,t3 xi4,t4 | < 1 for i1, i2, i3, i4 = 1, 2, . . . , k and all t1, t2, t3, t4.
P
(d) n 1 nt=1 xit xj,t s !p E(xit xj,t s ) for i, j = 1, 2, . . . , k and for all s.

All of these results can be viewed as extensions from the scalar case to vector case, and its proof
can be found on page 286-288 in Hamiltons book.

1.4

The Sample Mean of a Vector Process

Let xt be a stationary process with E(xt ) = 0 and E(xt xt h ) =


summable. Then we consider the properties of the sample mean
n

n =
x

1X
xt .
n
t=1

0n )
E[(
xn x
1
=
E[(x1 + . . . xn )(x1 + . . . xn )0 ]
n2
n
1 X
=
E(xi x0j )
n2
i,j

(h), where

(h) is absolutely

1
1 X
n2

(h)

h= 1

1
n

n
X1

h= n+1

|h|
n

(h)

Then
0n )]
nE[(
xn x
n
X1
=
1
h= n+1

(0) + 1
!

1
X

|h|
(h)
n

1
( (1) + ( 1)) + 1
n

2
n

( (2) + ( 2)) + . . .

(h)

h= 1

This is very similar as what we did in the scalar case. Then we have the following proposition:
Proposition 2 Let xt be a zero mean stationary process with E(xt ) = 0 and E(xt xt
where (h) is absolutely summable, then the sample mean satisfies
n !p 0
(a) x
0n )] =
(b) limn!1 [nE(
xn x

P1

h= 1

h)

= (h),

(h).

0n ). If the data are generated by a MA(q) process,


Let S denote the limit variance of nE(
xn x
then results (b) implies that
q
X
S=
(h).
h= q

Then a natural estimate for S is

S = (h) +

q
X

( (h) + (h)0 ),

(5)

h=1

where

n
X
(h) = 1
(xt
n

n )(xt
x

n )0 .
x

t=h+1

S defined in (5) provides a consistent estimator for a large class of stationary processes. Even
when the process has time-varying second moments, as long as
n
1 X
(xt
n

n )(xt
x

t=h+1

n )0
x

converges in probability to

n
1 X
E(xt xt
n

h ),

t=h+1

0n ). It is used not only for MA(q) process. Write the autocoS is a consistent estimate of nE(
xn x
variance as E(xt xs ), even it is nonzero for all t and s, if the matrix goes to zero sufficiently fast as
|t s| ! 1, and q is growing with the sample n, then we still have S ! S.
However, a problem with S is that it may not be positive semidefinite in small samples. Therefore, we can use the Newey and West estimate

q
X
h
S = 0 +
1
( (h) + (h)0 ),
q+1
h=1

which is positive semidefinite and has the same consistency properties of S when q, n ! 1 with
q/n1/4 ! 0.

1.5
1.5.1

Impulse-response Function and Orthogonalization


Impulse-response function

Impulse-response function gives how a time series variable is aected given a shock at time t. Recall
that for a scalar time series process, say, a AR(1) process xt = xt 1 + t with | | < 1, we can
invert it to a MA process xt = (1 + L + 2 L + . . .)t , and the eects of on x are:
: 0 1 0
x: 0 1

0
2

...
...

In other words, after we invert (L)xt = t to xt = (L)t , the (L) function gives us how x
response to a a unit shock from t .
We could do similar thing on a VAR process. In our earlier example, we have a VAR(2) system,
xt =
and t W N (0, ) where

1 xt 1

+
2
1
21

2 xt 2

12
2
2

After we invert it to a MA(1) representation


xt =

+ t

(L)t

(6)

2 1
where (L) = (1
1L
2 L ) , we see that in this representation, the observations xt is a linear
combinations of shocks t. However, suppose we are interested in another form of shocks, say

ut = Qt
where Q is an arbitrary square matrix (in this example, it is 2 by 2), we have
xt =

(L)Q

Qt = A(L)ut

(7)

where we let A(L) = (L)Q 1 . Since this Q is arbitrary, you see that we can have many linear
combinations of shocks, and response functions. Then which combinations shall we use?
8

1.5.2

Orthogonalization and model specification

In economic modeling, we calculate the impulse-response dynamics as we are interested how economic variables response to certain source of shocks. If the shocks are correlated, then it is hard
to identify what is the response to a particular shock. From that view, we may want to choose the
Q to make ut = Qt orthonormal, or uncorrelated across each other and with unit variance, i.e.,
E(ut u0t ) = I. To do so, we need a Q such that
Q

10

= ,

then E(ut u0t ) = E(Qt 0t Q0 ) = QQ0 = Ik . So, we can use Choleski decomposition to find Q.
However, Q is still not unique as you can form other Qs by multiplying an orthogonal matrix.
Sims (1980) proposes that we could specify the model by choosing a particular leading term in
the coefficient, A0 . In (6), we see that 0 = Ik . However, in (7), A0 = Q 1 cannot be identity
matrix unless is diagonal. In our example, we would choose the Q which produces A0 = Q 1 as
a lower triangular matrix. That means after this transformation, shock u2t has no eects on x1t .
The nice thing is that Choleski decomposition itself will produce a triangular matrix.
Example 1 Consider a AR(1) process of a 2-dimensional vector,

x1t
0.5 0.2
x1,t 1
1t
=
+
x2t
0.3 0.4
x2,t 1
2t
where
=

E(t 0t )

2 1
1 4

First we verify that this process is stationary, as

0
0.5 0.2
0
0.3 0.4
gives 1 = 0.94 and
process,

0.04, both lies inside the unit circle. Invert it to a moving average
xt =

We know that
gives

= I2 ,
Q=

Then we can write

=0

1,

(L)t .

etc. Then we find Q by Choleski decomposition of , which

0.70
0
0.27 0.53
xt =

and Q

(L)Q

Qt =

(L)Q

1.41
0
0.70 1.87
1

ut

where we define that ut = Qt . Then we have


xt =
or

x1t
x2t

0Q

1.41
0
0.70 1.87

ut +
u1t
u2t

1Q

ut

+ ....

0.85 0.37
0.70 0.75

u1,t
u2,t

1
1

+ ...

In this example you see that we find a unique MA representation which is linear combination of
uncorrelated error (E(ut u0t ) = I2 ), and the second sources of shock does not have instantaneous
eects on x1t . We can then use this representation to compute the impulse-responses.
There are also other ways to specify the representation, depending on the problem of interest.
For example, Quah (1988) suggests that find a Q so that the long-run response of one variable to
another shocks is zero.
1.5.3

Variance decomposition

Now, lets consider how we could decompose the variance of the forecasting errors. xt = (L)t =
A(L)ut where A(L) = (L)Q, ut = Qt and E(ut u0t ) = I. For simplicity, we let (xt = (x1t , x02t ).
Suppose we do a one-period ahead forecasting, and let yt+1 denote the forecast error,
0

A11 A012
u1,t+1
yt+1 = xt+1 Et (xt+1 ) = A0 ut+1 =
.
0
0
A21 A22
u2,t+1
0 )=
Since E(u1t u2t ) = 0, E(u2it ) = 1, the variance of the forecasting error is given by E(yt+1 yt+1
0
0
2
0
2
A0 A0 . So the variance of forecasting error for x1t is given by (A11 ) + (A12 ) . We can interpret
that (A011 )2 is the amount of the one-step ahead forecasting error variance due to shock u1 , and
(A012 )2 is the amount due to shock u2 . Similarly the variance of forecasting error of x2t is given by
(A021 )2 + (A022 )2 , and we can interpret them as amount due to shock u1 and u2 respectively. The
variance for k-period ahead forecasting error can be computed in a similar way.

Estimation of VAR(p) process

2.1

Maximum Likelihood Estimation

Usually we use conditional likelihood in VAR estimation (recall that conditional likelihood functions
are much easier to work with than unconditional likelihood functions).
Given a k-vector VAR(p) process,
yt = c +

1 yt 1

2 yt 2

+ . . . + t ,

we could rewrite it more concisely as


yt = 0 xt + t .
where

c0

B 0 C
B 1 C
B 0 C
=B 2 C
B .. C
@ . A
0
p

B yt
B
B
and xt = B yt
B ..
@ .
yt

1
2

1
C
C
C
C
C
A

If we assume that i.i.d.N (0, ), then we could use MLE to estimate the parameters in =
(c, , ). Following the same way in the scalar case, assume that we have observed (y p+1 , . . . , y0 ),
then the likelihood function for the yt is
L(yt , xt ; ) = (2)

k/2

1 1/2

exp[( 1/2)(yt
10

0 xt )0

(yt

0 xt )]

The log likelihood function of observations (y1 , . . . , yn ) is (constant omitted)


l(y, x; ) = (n/2)log|

(1/2)

n
X

(yt

0 xt )0

(yt

t=1

Taking first derivative with respect to and , we have that


" n
#" n
# 1
X
X
0
0
0
n =

yt xt
xt xt
.
t=1

0n is
The jth row of
0j

"

n
X

0 xt ) .

(8)

t=1

yjt x0t

t=1

#"

n
X

xt x0t

t=1

which is the estimated coefficient vector from an OLS regression of yjt on xt . So the MLE
estimates of the coefficients for the jth equation of a VAR are found by an OLS regression of yjt
on a constant term and p lags of all of the variables in the system.
The MLE estimate of is
n
X

t 0t
n = (1/n)
t=1

where

t = yt

0n xt

The details on the derivations can be found on page 292-296 on Hamilton book. The MLE
and
are consistent even if the true innovations are non-Gaussian. In the next
estimates
subsection, we will consider regression with non-Gaussian errors, and we will use the LS approach
to derive for the asymptotics.

2.2

LS estimation and asymptotics

is summarized in the following proposition


The asymptotic distribution of
Proposition 3
yt = c +

1 yt 1

2 yt 2

+ ... +

p yt p

+ t ,

t = i.i.d.(0, ), and E(it jt lt mt ) < 1 for all i, j, l, and m and where roots of
|Ik

1z

...

pz

|=0

lie outside the unit circle. Let m = kp + 1 and let


x0t = [ 1 yt0

yt0

. . . yt0

],

n ) denote the km1 vector of coefficients resulting


n = vec(
So xt is a m-dimensional vector. Let
from OLS regression of each of the elements of yt on xt for a sample of size n:
01,n
02,n . . .
0k,n ]
0n = [

11

where
i,n =

"

n
X

xt x0t

t=1

1" n
X

x0t yit ,

t=1

0 denote the km by 1 vector of the true parameter. Finally, let


and let
n = n

n
X

t 0t ,

t=1

where
0t = [ 1t 2t . . . kt ]
i,n
x0t

it = yit
Then
(a) n

Pn

0
t=1 xt xt

!p Q where Q = E(xt x0t );

n !p ;
(b)

n !p ;
(c)
p
n ) !d N (0, Q
(d) n(

1 ).

Result (a) is a vector version of that sample second moment converges to the population moment,
and it follows that the coefficients are absolutely summable and it has finite fourth moment. Result
(b) and (c) are similar to the derivations for single OLS regression in case 3 in lecture 5. To show
result (d), let
n
X
1
Qn = n
xt x0t ,
t=1

then we could write

i,n
n(

and
p

n
n(

Define t to be a km 1 vector

"

i ) = Qn 1 n
2

Qn 1 n
6 Q 1n
6 n
) = 6
4
Qn 1 n
2

6
6
t = 6
4

xt it

t=1

Pn
xt 1t
Pt=1
n
1/2
t=1 xt 2t
..
.
Pn
1/2
t=1 xt kt
1/2

xt 1t
xt 2t
..
.

xt kt

12

1/2

n
X

7
7
.7 .
5

#
3

7
7
7.
5

(9)

Note that t is a mds with finite fourth moments and variance


2
3
E(21t )
E(1t 2t ) . . . E(1t kt )
6 E(2t 1t )
E(22t ) . . . E(2t kt ) 7
6
7
E( t 0t ) = 6
7 E(xt x0t )
..
..
..
4
5
.
.
...
.
2
E(kt 1t ) E(kt 2t ) . . . E(kt )
= Q
We can also show that
1

n
X
t=1

t 0t !p Q.

Apply the CLT for vector mds, we have


1/2

n
X
t=1

Now rewrite (9) as


p

6
6
) = 6
4

n
n(

t !d N (0, Q).

Qn 1
0
0
Qn 1
..
..
.
.
0
0

= (Ik Qn 1 )n

By result (a) we have Qn !p Q


1

n
n1/2 (

1.

...
...

32

0
0
..
.

n
76 n
76
76
54
n

...
. . . Qn 1
1/2

n
X

(10)

Pn
xt 1t
Pt=1
n
1/2
t=1 xt 2t
..
.
Pn
1/2
t=1 xt kt
1/2

3
7
7
7
5

t=1

Thus

) !p (Ik Q

)n

1/2

n
X

t .

t=1

From (10) we know that this has a distribution that is Gaussian with mean 0 and variance
(Ik Q

)( Q)(Ik Q

) = (Ik Ik ) (Q

i has the distribution


Hence we got result (d). Each of
p
i,n i ) !d N (0,
n(

QQ

)=Q

2
1
i Q ).

Given that the estimators are asymptotically normal, we can use it to test linear or nonlinear
restrictions on the coefficients with the Wald statistics.
We know that vec is an operator to stack each column of a k k matrix into one k 2 1 vector.
A similar operator, vech, is to stack all elements under the principal diagonal (so it transforms a
k k matrix into one k(k + 1)/2 1 vector). For example,
2
3

a11
a11 a12
A=
vech(A) = 4 a21 5 .
a21 a22
a22
13

We will apply this operator on the variance matrix, which is symmetric. The joint distribution
n is given in the following proposition.
n and
of
Proposition 4
yt = c +

1 yt 1

2 yt 2

+ ... +

1z

...

pz

p yt p

+ t ,

t = i.i.d.N (0, ), and where roots of


|Ik

|=0

n , and Q be as defined in proposition 3, then


n,
lie outside the unit circle. Let

n ]
n1/2 [
0
Q 1 0
,
.
n ) vech()] !d N
0
0
22
n1/2 [vech(
Let ij denote the ijth element of then the element of 22 corresponding to the covariance between
ij and lm is given by ( il jm + im jl ) for all i, j, l, m = 1, . . . k.
The detailed proof can be found onPpage 341-342 in Hamilton book. Basically there are three
n = n 1 n t t 0 has the same asymptotic distribution as
n =
steps:
first, we show that
t=1
P
n
1
0
n
t=1 t t . In the second step, write

P
n ]
n1/2 [
(Ik Q 1 )n 1/2 nt=1 t
P
n ) vech()] !d
n 1/2 nt=1 t
n1/2 [vech(
where

Now,

( 0t ,

0)
t

6
= vech 4

21t

kt k1

..
.

11

k1

. . . 1t kt
...
...

..
.

2kt

1k

kk

7
5.

is an mds and we apply the CLT for mds to get (with a few more computations)

P
n 1/2 nt=1 t
0
Q 1 0
P
.
!d N
,
0
0
22
n 1/2 nt=1 t

The final step in the proof is to show that E( t 0t ) is given by the matrix 22 as described in the
proposition, which can be proved with a constructed error sequence which is uncorrelated Gaussian
with zero mean and unit variance (see Hamiltons book for details).
n , we can then test if two errors are correlated. For example,
With the asymptotic variance of
for k = 2,
2
3
02 3 2
31
2
2
11,n
0
2 11
2 11 12
11
12
p
2
n 4 12,n
12 5 !d N @4 0 5 , 4 2 11 12
11 22 + 12 2 12 22 5A .
2
2
22,n
0
2 12
2 12 22
2 22
22
by

Then a Wald test of the null hypothesis that there is no covariance between 1t and 2t is given
p

n12
N (0, 1).
2 )1/2
(11 22 + 12
14

The matrix 22 can be expressed more compactly using the duplication matrix. Duplication
matrix Dk is a matrix of size k 2 k(k + 1)/2 matrix that transforms vech() into vec(), i.e.
Dk vech() = vec().
For example,

1
6 0
6
4 0
0

Define

3
0 2
0 7
74
1 5
1

0
1
1
0

11
21
22

6
5=6
4

0
D+
k (Dk Dk )

11
21
12
22

7
7.
5

Dk

+
Note that D+
k Dk = Ik(k+1)/2 . Dk is like the reverse of Dk as it transform vec() into vech(),

vech() = D+
k vec().
For example, when k = 2, we have
2
4

11
21
22

With Dk and D+
k we can write

2
3
1 0
0 0 6
5 = 4 0 1/2 1/2 0 5 6
4
0 0
0 1
3

11
21
12
22

7
7.
5

+ 0
22 = 2D+
k ( )(Dk ) .

Granger Causality

In most regressions in econometrics, it is very hard to discuss causality. For instance, the significance
of the coefficient in the regression
yi = xi + i ,
only tells the co-occurrence of x and y, not that x causes y. In other words, usually the regression
only tells us there is some relationship between x and y, and does not tell the nature of the
relationship, such as whether x causes y or y causes x.
One good thing of time series vector autoregression is that we could test causality in some
sense. This test is first proposed by Granger (1969), and therefore we refer it Granger causality.
We will restrict our discussion to a system of two variables, x and y. y is said to Granger-cause
x if current or lagged values of y helps to predict future values of x. On the other hand, y fails to
Granger-cause x if for all s > 0, the mean squared error of a forecast of xt+s based on (xt , xt 1 , . . .)
is the same as that is based on (yt , yt 1 , . . .) and (xt , xt 1 , . . .). If we restrict ourselves to linear
functions, x fails to Granger-cause x if
t+s |xt , xt
MSE[E(x

1 , . . .)]

t+s |xt , xt
= MSE[E(x

1 , . . . , yt , yt 1 , . . .)].

Equivalently, we can say that x is exogenous in the time series sense with respect to y, or y is not
linearly informative about future x.
15

In the VAR equation, the example we proposed above implies a lower triangular coefficient
matrix:

xt
c1
0
xt 1
0
xt p
1t
11
11
=
+
+ ... +
+
(11)
p
p
1
1
yt
c2
y
y
2t
t 1
t p
21
22
21
22
Or if we use MA representations,

xt
1
=
yt
2

11 (L)

0
22 (L)

21 (L)

1t
2t

(12)

where
ij (L)

with 011 =
(1972).

0
22

= 1 and

0
21

0
ij

1
ij L

2
ij L

+ ...

= 0. Another implication of Granger causality is stressed by Sims

Proposition 5 Consider a linear projection of yt on past, present and future xs,


yt = c +

1
X

bj xt

j=0

1
X

dj xt+j + t ,

(13)

j=1

where E(t x ) = 0 for all t and . Then y fails to Granger-cause x i dj = 0 for j = 1, 2, . . ..


Econometric tests of whether the series y Granger causes x can be based on any of the three
implications (11), (12), or (13). The simplest test is to estimate the regression which is based on
(11),
p
p
X
X
xt = c1 +
i xt i +
i yt j + u t
i=1

j=1

using OLS and then conduct a F-test of the null hypothesis


H0 :

= ... =

= 0.

Note: we have to be aware of that Granger causality does not equal to what we usually mean
by causality. For instance, even if x1 does not cause x2 , it may still help to predict x2 , and thus
Granger-causes x2 if changes in x1 precedes that of x2 for some reason. A naive example is that
we observe that a dragonfly flies much lower before a rain storm, due to the lower air pressure.
We know that dragonflies do not cause a rain storm, but it does help to predict a rain storm, thus
Granger-causes a rain storm.
Reading: Hamilton Ch. 10, 11, 14.

16

Lecture 7: Processes with Deterministic Trends

Introduction

Recall that a process is covariance stationary if it has constant expectation, finite variance, and
its antocovariance functions do not depend on time. In this lecture, we will introduce one class of
processes that are nonstationary processes with deterministic trend. In the next lecture, we will
introduce another type processes with stochastic trend.
We have been familiar with a stationary ARMA process,
x
t = (L)ut .
Now consider an ARMA process with a drift,
xt = t + x
t = t + (L)ut .

(1)

Now the expectation of xt is t, which is a function of time, so this process is nonstationary.


We can decompose the process xt into two components: a trend component ( t) and a stationary
component (
xt ). If is known, then we can detrend xt , i.e., xt
t to get x
t , which is a stationary
process, so the process {xt } is said to be trend stationary.
The k-period ahead forecasting of x is
Et (xt+k )
= Et ( (t + k) + (L)ut+k )
=

(t + k) + Et (ut+k +

(t + k) +

k ut

1 ut+k 1

k+1 ut 1

+ ... +

+ ... +

k ut

k+1 ut 1

+ ... +

t+k u0 )

t+k

And the forecasting error is:


Et (xt+k

Et (xt+k ))2

= Et (ut+k +
= (1 +

2
1

2
1 ut+k 1 + . . . + k 1 ut+1 )
2
2
2
2 + . . . + k 1)

Note since x
t = (L)t is a stationary process, as k ! 1, the forecasting error converges to
the unconditional variance of x
t , which is bounded. This is a very important dierence between
processes with deterministic trend and those with stochastic trend. Another feature of a trend
stationary process is that given a shock at time t, its eects on the level of x
, hence x eventually
dies o as in a stationary process. This is another dierence compared to a unit root process. We
will discuss more on this in next lecture.
Figure 1 plots a simulated path of (1), where = 1, ut N (0, 1) and (L) = 1 + 0.5L.

Copyright 2002-2006 by Ling Hu.

3
2
1
0
1
2
3

10

15

20

25

30

35

40

45

50

10

15

20

25

30

35

40

45

50

50
40
30
20
10
0

Figure 1: Simulated MA(2) Process with Deterministic Trend

2
2.1

Estimation and Inference


OLS estimation of the simple time trend model

Consider the process


yt = + t + ut = xt + ut ,

(2)

where 0 = [, ], x0t = [1, t], and ut i.i.d.N (0, 2 ). We can use MLE to estimate the parameters
, and the MLE estimator is equivalent to the OLS estimator. So we will only discuss OLS
estimation, which is applicable to a more general class of errors. In our following analysis, we
assume ut i.i.d(0, 2 ) and E(u4t ) < 1.
The OLS estimate of is
" n
# 1" n
#
X
X
n =
xt x0
xt yt
(3)
t

t=1

"

n
X

xt x0t

t=1

t=1
1" n
X

xt ut

t=1

(4)

Since xt is deterministic, take expectation of n , we have E( n ) = 0 . So n is an unbiased


estimator for 0 . It can be also shown that they are consistent (converges to the true value). So
far what we got are just the same as what we got with a stationary processes. However, although
n = (
n , n )0 converges to the true parameter 0 = (
0 , 0 )0 , it turns out that its two components

n and n converge at dierent rates!


To see this, note that xt x0t is a 2 by 2 matrix,
n
X
t=1

xt x0t

Pn
Pnt=1 t2
n
t=1 t
t=1 t

Pn
2

Some simple math gives


n
X
t=1
n
X

t = n(n + 1)/2 = O(n2 ),


t2 = n(n + 1)(2n + 1)/6 = O(n3 ),

t=1

or

n
1
1 X
t!
n2
2

n
1
1 X 2
t ! ,
n3
3

t=1

More generally, we have

t=1

n
X

1
nr+1

t=1

tr !

1
.
r+1

So the elements of matrix of


diverges at dierent rate. To obtain a convergent matrix,
3
we have to divide it by n (the largest divergent rate),
Xn0 Xn

n
X

xt x0t

t=1

0 0
0 31

Unfortunately, this limiting matrix is singular and cannot be inverted. It turns out that to
obtain a nondegenerate limiting distributions,
n need to be rescaled by n1/2 and n need to be
rescaled by n3/2 . Therefore, to get a proper limit of n , we need to normalize it with a matrix
1/2
n
0
Hn =
.
0
n3/2
Now premultiply n by H.
Hn ( n

0) =

"

n1/2 (
n 0 )
3/2
n ( n
0)
" n
# 1
" n
#
X
X
0
1
= Hn
xt xt
Hn Hn
xt ut
t=1

Hn 1

n
X

xt x0t

t=1

Hn 1

t=1
1"

We first drive the limit for the matrix Hn 1 Xn0 Xn Hn .


!
1/2
Pn
n
1
X
n
0
n
1
0
1
Pn
Pnt=1 t2
Hn
xt xt Hn =
3/2
0
n
t=1 t
t=1 t
t=1

as n ! 1. We will use Q to denote this matrix,

1
Q= 1
2

1
2
1
3

Hn 1

n
X
t=1

xt ut

!#

n1/2
0
3/2
0
n

1
1
2

1
2
1
3

P
Next, we need to derive the asymptotic distribution for Hn 1 ( nt=1 xt ut ),
!
Pn

P
n
X
n 1/2
0
ut
n 1/2 nt=1 ut
1
t=1
P
P
Hn
xt ut =
=
n
0
n 3/2
n 1/2 nt=1 (t/n)ut
t=1 tut
t=1

We will show that this vector is P


asymptotically normal with mean zero and covariance matrix
n
1/2
First consider the term n
t=1 ut , applying the classical central limit theorem directly,
we have
n
X
1/2
n
ut ! N (o, 2 ).
2 Q.

t=1

Pn
1/2

Second, consider the term n


t=1 (t/n)ut . Now the series {(t/n)ut } is not i.i.d, but it is
a martingale dierence sequence, and we can apply the CLT for mds. To apply the CLT of mds,
we need to show that the three conditions in proposition 15 in lecture
4 (proposition 7.8 in
Pn note
2
2
2
2
1
2
2
2 ! 2 /3 > 0, so
Hamilton) are satisfied. First, E[(t/n)ut ] = (t /n ) , and n
t=1 (t /n )
condition (a) is satisfied. Second, take r = 4, since ut has finite fourth moment by assumption,
condition (b) is satisfied. Finally, we need to show that
1

n
X
t=1

Since we have that


1

n
X

(t2 /n2 )u2t !

/3.

(t2 /n2 )

/3,

t=1

we just need to show that


n

n
X

(t2 /n2 )(u2t

t=1

Note that the series {(t2 /n2 )(u2t


((t4 /n4 )E[(u2t

2 )}

) ! 0.

(5)

is a mds with variance

) ] = (t4 /n4 )[E(u4t )

2 2

] = (t4 /n4 )(4

) < 1,

So (5) holds by law of large numbers for mds. Now all the three conditions are satisfied, we can
then apply the CLT for mds,
n

1/2

n
X
t=1

(t/n)ut ! N (0,

/3).

P
P
The remaining task is to show that {n 1/2 nt=1 ut } and {n 1/2 nt=1 (t/n)ut } are asymptotically
joint normal. To show they are jointly normal, it is suffice to show that any linear combination of
these two series is asymptotically normal, i.e., to show that
n

1/2

n
X

2 (t/n)]ut

t=1

! N (0, ).

Note that the series {


satisfying

1 ut

2 (t/n)ut }

is a mds with variance

2[ 2
1

+2

1 2 (t/n)

2 0

2 (t/n)2 ]
2

1X
n

2
1

+2

1 2 (t/n)

2
2
2 (t/n) ]

t=1

for

=(

1,

0
2) .

2
1

+2

1 2 (1/2)

2 0

2
2 (1/3)]

Furthermore,
n

1X
[
n

2 2
2 (t/n)] t

t=1

Q .

So we can apply CLT and have this linear combination of the two elements converge to a
Gaussian distribution, this hence imply that this two elements are joint Gaussian.

P
n 1/2 nt=1 ut
P
! N (0, 2 Q).
n 1/2 nt=1 (t/n)ut
Therefore, we got
Hn

n
X
t=1

xt ut

n
n

Pn
t=1 ut
P
n
3/2
t=1 tut
1/2

! N (0,

QQ

) = N (0,

).

We can summarize the results in

Proposition 1 Let yt be generated according to the simple deterministic time trend model (2)
where ut i.i.d.(0, 2 ) with finite fourth moment. Then
!
1/2

1
1
n (
n )
0
1
!N
, 2 1 21
.
0
n3/2 ( n
)
2
3
Note that for the estimate of , we not only have n !p , we also have n( n
case, the estimate is said to be superconsistent.

2.2

) !p 0. In this

Hypothesis testing for the simple time trend model

When the innovation term ut is Gaussian, and since in the simple trend model the regressors are
deterministic, the OLS estimates
n and n are Gaussian and the usual OLS t and F tests have
the exact small sample t and F distribution. In this section, we will consider the case when ut is
non-Gaussian.
We first consider
a test of the null hypothesis on , say, = a. Let s2n is the OLS estimate of
P
n
2 : s2 = 1
2t . Then the t statistics is
n
t=1 u
n 2
tn

n a

1/2

1
s2n 1 0 (Xn0 Xn ) 1
0
p
n(
n a)

p
p

n
2
0
1
n 0 (Xn Xn )
sn
0
5

1/2

n(
n a)

1
2
0
1
sn 1 0 Hn (Xn Xn ) Hn
0
p
n(
n a)

1/2

1
2 1 0 Q 1
0

where we uses that s2n !p 2 ,


p
n 0 = 1 0 Hn

and Hn (Xn0 Xn )

Let q11 denote the (1, 1) element of Q


a) ! N (0, 2 q11 ). So we can see that

1,

1/2

Hn = [Hn 1 (Xn0 Xn )Hn 1 ]

!Q

then under the null hypothesis we know that

tn !

.
p

n(
n

n(
n a)
,
p
q11

is an asymptotically Gaussian variable divided by the square root of its variance, so it has a N (0, 1)
distribution.
Similarly, to test the null hypothesis n = b, write
tn

n b

s2n 0 1 (Xn0 Xn )

0
1

n3/2 ( n b)

3/2
s2n 0 n
(Xn0 Xn ) 1

1/2

1/2

n3/2

n3/2 ( n b)

0
2
0
1
sn 0 1 Hn (Xn Xn ) Hn
1
3/2
n ( n b)
,
p
q22

1/2

which is again asymptotically N (0, 1).


We have just considered tests on either or . Now consider a test involving both and :
H0 : r1 + r2 = r.
We will apply similar procedure as before, but and have dierent convergent rate n1/2 and
n3/2 , which one shall we use to derive the asymptotics? It turns out (again) that the slower rate
dominates.
tn =

(r1
n + r2 n r)

r1
2
0
1
sn r1 r2 (Xn Xn )
r2
6

1/2

n(r1
n + r2 n

p
s2n n r1 r2 (Xn0 Xn )

r)

r1 p
n
r2
p
n(r1
n + r2 n r)
1

p
s2n n r1 r2 Hn 1 Hn (Xn0 Xn )
p

n(r1
n + r2 n

{s2n r0n [Hn (Xn0 Xn )

where
rn Hn

r1
r2

Since n is superconsistent,
p
n(r1
n + r2 n
So

r)

1H

n ]rn }

n=

r) =

1/2

Further, note that


p
n(r1
n + r2

r) =

n[r1 (
n

n Hn

r1
r2 /n

1/2

r1
r2

1/2

n(r1
n + r2

n(r1
n + r2 n r)
tn !p

r1
2 r
1
1 0 Q
0

1H

1/2

r1
0

r) + op (1).

n(r1
n + r2
r)
+ op (1).
2
1/2
2
(r1 q11 )

) + r1 + r2

r) =

n[r1 (
n

)]

under the null hypothesis. Therefore, under the null


p
p
n[r1 (
n )]
n(
n )
tn !p
=
.
p
q11
(r12 2 q11 )1/2
which is asymptotically N (0, 1). This example shows that a test involving a single restriction across
parameters with dierent rates of convergence is dominated asymptotically by the parameters with
the slowest rates of convergence.
Finally consider joint test of separate hypothesis about and ,

a
H0 :
=
b
or in vector form,

= c. Then we could compute a Wald statistics


Wn

=
=

( n
( n

c)0 [s2n (Xn0 Xn )


0

c)

! [Hn ( n

( n

c)

Hn [s2n Hn (Xn0 Xn ) 1 Hn ] 1 H)n( n


c)]0 [ 2 Q 1 ] 1 [Hn ( n c)].

Then we have
Wn !
7

(2).

c)

2.3

OLS Estimation of Autoregression with Time Trend

Now consider a general autoregressive process around a deterministic time trend.


yt = + t +

1 yt 1

2 yt 2

+ ... +

+ ut ,

p yt p

or in matrix form,
yt = x0t + ut
where x0t = [yt 1 , yt 2 , . . . , yt p , 1, t], and 0 = [ 1 , . . . , p , , ]. Sims, Stock and Watson (1990)
suggest that we find a matrix G and use it to transform this process to
yt = x0t G0 [G0 ]

+ ut = x
0t

+ ut .

where x
t = Gxt = [
yt 1 , yt 2 , . . . , yt p , 1, t]0 and = [G0 ] 1 = [ 1 , 2 , . . . p , , ]0 .
The idea is that after the transformation, we could write yt in terms of zero-mean covariance
stationary process (
yt j ), a constant and a time trend. In doing this, we could isolate components of
the OLS coefficient vector with dierent rates of convergence. In this case, after the transformation,
, , . . . will converge at the usual rate of pn, while
n , n will behave asymptotically like
n
1,n 2,n

and n in the simple time trend model. The matrix G is of dimension (p + 2) (p + 2):
2
3
1
0
...
0
0 0
6
0
1
...
0
0 0 7
6
7
6
..
..
..
.. .. 7
6
.
.
...
.
. . 7
G0 6
7,
6
7
0
0
.
.
.
1
0
0
6
7

4 +
+2
...
+p
1 0 5

...
0 1
2

[G0 ]

1
0
..
.

6
6
6
6
=6
6
0
6
4

0
1
..
.

...
...
1

...
p

...

0
0
..
.

...
...

3
0 0
0 0 7
7
.. .. 7
. . 7
7.
0 0 7
7
1 0 5
0 1

The relation between the OLS coefficient estimates before and after the transformation is:
= [G0 ] 1 n and n = G0 . A simple example to understand this transformation is the following
n
n
model:
y t = yt 1 + + u t ,
(6)
for which we know that E(yt ) = /(1
). Now, we can write

1
0
1 0
0
0 1
G
[G ]
1
1
We can then solve

from

1 0
1

)
8

/(1

Then we can rewrite the process yt as


yt =
=
=

y + + ut
t 1

yt 1
+
1
1
yt 1 + t + ut .

+ ut

The advantage of this transformation is that now yt 1 is a zero mean process. When a time trend
is included in the process, we will see similar fact: yt is demeaned and detrended.
To derive the asymptotic distribution of n , define
2 p
3
n 0 0 ... 0
0
0
p
6 0
n 0 ... 0
0
0 7
6
7
6 ..
..
..
..
..
.. 7
6 .
.
. ... .
.
. 7
Hn = 6
7,
p
6 0
0 0 ...
n 0
0 7
6
7
p
4 0
n
0 5
0 0 ... 0
0
0 0 ... 0
0 n3/2
then the OLS estimates

Hn (

"
) = H 1
n

"

n
X

x
t x
0t

t=1

n
X

x
t x
0t

t=1

1" n
X
t=1

Hn

x
t ut ,
1"

Hn

n
X

x
t ut .

t=1

(7)

Consider the first term and it can be written in the form


!

n
X
An Bn
1
0
1
Hn
x
t x
t Hn =
.
Bn0 Cn
t=1
P
The elements in An (pp) takes the form of n 1 nt=1 yt i yt j for i, j = 1, . . . p, which converges
to y (|i j|). We can let Q11 to denote
the limiting matrix
of An : An !p Q11 . Next, the elements
P
P
n
n
in Bn (p 2) takes the form of n 1 t=1 yt i and n 1 t=1 (t/n)
yt i , and we know that all of these
elements converges to zero: Bn !p 0. Finally, the matrix of Cn (2 2) is

P
1
n 2P nt=1 t
1 1
P
! 1 21 Q22 ,
n
n
2
3
2
n
t=1 t n
t=1 t
2
3
which is just the Q matrix in our simple time trend model. Thus we have
!

n
X
Q11 0
1
0
1

Hn
x
t x
t Hn ! Q
.
0 Q22
t=1

Next consider the second term in (7),

2 P
P yt 1 ut
6
yt 2 ut
6
n
6
..
X
6
Hn 1
x
t ut = n 1/2 6 P .
6
yt p ut
t=1
6 P
4
P ut
(t/n)ut
9

7
7
7
7
7n
7
7
5

1/2

n
X
t=1

t .

This t is a martingale dierence sequence with variance


2
3
Q11 0
0
1
t/n 5
E( t 0t ) = 2 Qt 2 4 0
2
0 t/n t /n2

and we have

n
X
t=1

Then apply CLT


Hn 1

n
X
t=1

Therefore for the OLS estimate


Hn ( n

Qt ! Q .

x
t ut ! N (0,

Q ).

we have

) ! N (0, [Q ]

1 2

Q [Q ]

) = N (0,

Using the block-diagonal representation, we can also write

[Q11 ] 1
0
1
[Q ] =

0
[Q22 ] 1

[Q ]

).

Now, given the asymptotic distribution of the estimates n , what are the results for n , the
estimates for the coefficients in the original model? We have that n = G0 n , or in matrix form
2
3 2
32
3

1
1
0
...
0
0 0
1
6 7 6
6
7
0
1
...
0
0 0 7
6 2 7 6
7 6 2 7
6
7 6
7
6
.
.
.
.
.
..
..
..
.. .. 7 6 . . . 7
6 ... 7 6
7
...
6 7=6
7 6 7 .
6 p 7 6
76 p 7
0
0
.
.
.
1
0
0
6
7 6
76 7
4
5 4 +
5
+ 2 . . .
+ p 1 0 5 4

...
0 1
Note that the j is identical to j , so for
p
p

n( n

j,

we have

) ! N (0,

[Q11 ]

).

Next,
n is a linear combination of variables that converge to a Gaussian distribution at rate
n, so
n behaves the same way. Let

+
+ 2 . . .
+ p 1 0 ,
g0 =

then
n = ga0 n ,

n(
n

) ! N (0, g0 [Q ]

g ).

Next, n is a linear combination of variables converging at dierent rates:


n = g 0 +
n
n
10

where
g =

...

0 0

Its asymptotic distribution is governed by the variables with the slowest rate of convergence:
p
p

n( n
) =
n( + g 0 n
g0 )
p n

!p
n( + g 0 n
g0 )
p

= g 0 n(
)
n

!p N (0,

g [Q ]

g ).

So each element of n individually is asymptotically Gaussian and Op (n 1/2 ). The asymptotic


p
distribution of the full vector n( n
) is multivariate Gaussian, though with a singular variancecovariance matrix.
For hypothesis testing in this model, please read Hamiltons book.

11

Lecture 8: Univariate Processes with Unit Roots

Introduction

1.1

Stationary AR(1) vs. Random Walk

In this lecture, we will discuss a very important type of processes: unit root processes. For an
AR(1) process
xt = xt 1 + ut
(1)
to be stationary, we require that | | < 1. Or in an AR(p) process, we require that all the roots of
1

1z

...

pz

=0

lie out of the unit circle.


If one of the roots turns out to be one, then this process is called unit root process. In an AR(1)
process, we have = 1,
xt = xt 1 + ut .
(2)
It turns out that two processes with (| | < 1 and = 1) behave in very dierent manners. For
simplicity, we assume that the innovations ut follows a i.i.d. Gaussian distribution with mean zero
and variance 2 .
First, I plot the following two graphs in figure 1. In the left graph, I set = 0.9 and in the
right graph, I draw the random walk process, = 1. From the left graph, we see that xt moves
around zero and never gets out of the [ 6, 6] region. There seems to be some force there which
pulls the process to its mean (zero). But in the right graph, we did not see a fixed mean, instead,
xt moves freely and in this case, it goes to as high as about 72. If we repeat generating the above
two process, we would see that the = 0.9 processes look pretty much the same; but the random
walk processes are very dierent from each other. For instance, in a second simulation, it may go
down to 80, say.
Above is some graphical illustration. Second, consider some moments of the process xt when
| | < 1 and = 1. When | | < 1, we have that
E(xt ) = 0
when
are

/(1

),

= 1, we no longer have constant unconditional moments, the first two conditional moments
E(xt |Ft

and E(x2 ) =

1)

= xt

and E(x2t |Ft

Copyright 2002-2006 by Ling Hu.

1)

80

70

60
4
50
2
40
0
30
2
20

10

100

200

300

400

500

100

200

Figure 1: Simulated Autoregressive Processes with Coefficient

300

400

= 0.9 and

500

=1

When we do k-period ahead forecasting, when | | < 1,


E(xt+k |Ft ) =

xt .

Since | | < 1, k ! 0 = E(xt ) when k ! 1. So as the forecasting horizon increases, the current
value of xt matters less and less since the conditional expectation converges to the unconditional
expectation.
The variance of the forecasting
V ar(xt+k |Ft ) = (1 +

+ ... +

2k

2k+2

2 ) as k ! 1.
which converges to 2 /(1
Next, consider the case when = 1.

E(xt+k |Ft ) = xt ,
which means that the current value does matter (actually it is the only thing that matters) even
as k ! 1!
The variance of the forecasting is
V ar(xt+k |Ft ) = k

! 1 as

k ! 1.

If we let x0 = 1, 2 = 1, we could draw the forecasting of xt+k when = 0.9 and = 1 in


Figure 2.
The upper graph in Figure 2 plots the forecasting for xk when = 0.9. The expectation for
xk conditional on x0 = 1 drops to zero as k increases, and the forecasting error converges to the
unconditional standard error quickly. The lower graph in Figure 2 plots the forecasting for xk when
= 1. Obviously, the forecasting interval diverges as k increases.
2

Forecast of x(k) at time zero

Forecast of x(k) at time zero

4
2

10

10
k

k
12

12

14

14

16

16

18

18

20

20

Figure 2: Forecasting of xk at time zero when

= 0.9 and

= 1 (x0 = 1,

= 1)

A third way to compare a stationary and a nonstationary autoregressive process is to compare


their impulse-response functions. We can invert (1) to
xt = ut + ut
t 1
X
k
=
ut

ut

+ ... +

t 1

u1

k=0

So the eect of a shock ut to xt+h is

h,

which dies out as h increases. In the unit root case,

xt =

t
X

uk ,

k=1

The eect of ut on xt+h is one, which is independent of h. So if a process is a random walk, the
eects of all shocks on the level of {x} are permanent. Or the impulse-response function is flat at
one.
Finally, we can compare the asymptotic distribution of the coefficient estimator of a stationary
and a nonstationary autoregressive process.
For an AR(1) process, xt = xt 1 + ut where | | < 1 and ut i.i.d.N (0, 2 ), we have shown in
lecture note 6 that the MLE estimator of is asymptotically normal,
0
! 11
n
X
2
A.
n1/2 ( n
n 1
x2t
0 ) N @0,
t=1

P
However, if 0 = 1, then (n 1 nt=1 x2t ) 1 goes to zero as n ! 1. This implies that if
then n converges at a order higher than n1/2 .
3

= 1,

Above we have only considered AR(1) process. In a general AR(p) process, if there is one unit
root, then the process is a nonstationary unit root process. Consider an AR(2) example. let 1 = 1,
and 2 = 0.5, then
(1 L)(1 0.5L)xt = t , t i.i.d.(0, 2 ).

Then

(1
So the dierence of xt ,

L)xt = (1

xt = (1

0.5L)

t = (L)t ut .

L)xt is a stationary process, and


xt = xt

+ ut ,

is a unit root process with serially correlated errors.

1.2

Stochastic Trend v.s. Deterministic Trend

In a unit root process,


xt = xt+1 + ut ,
where ut is a stationary process, then xt is said to be integrated of order one, denoted by I(1).
An I(1) process is also said to be dierence stationary, compared to trend stationary as has been
discussed in the previous lecture. If we need to take dierence twice to get a stationary process,
then this process is said to be integrated of order two, denoted by I(2), and so force. A stationary
process can be denoted by I(0). If a process is a stationary ARMA(p, q) after taking kth dierences,
then the original process is called ARIMA(p, k, q). The I here denotes integrated.
Recall that we learned about spectrum in lecture 3. The implications of these integrate and
dierence operators to spectral analysis is that when you do integration or sum, you filter out
the high frequency components and what remains are low frequencies; which is a feature of unit
root process. Recall that the spectrum of a stationary AR(1) process is
S(!) =

1
(1 +
2

2 cos !)

When ! 1, we have S(!) = 1/[4(1 cos !)]. Then when ! ! 0, S(!) ! 1. So processes
with stochastic trend have infinite spectrum at the origin. Recall that S(!) decomposes the variance
of a process into components contributed by each frequencies. So the variance of a unit root process
are largely contributed by low frequencies. On the other hand, when we do dierence, we filter
out the low frequencies and what remains are the high frequencies.
In the previous lecture, we discussed the processes with deterministic trend. We can compare
a process with deterministic trend (DT) and a process with stochastic trend (ST) from two perspectives. First, when we do k-period ahead forecasting, as k ! 1, the forecasting error for DT
converges to the variance of its stationary components, which is bounded. But as we see from the
previous section, the forecasting error for ST diverges as n ! 1. Second, the impulse-response
function for a DT is the same as in the stationary case: the eect of a shock dies out quickly. While
the impulse-function for ST is flat at one: the eect of all shocks on the level are permanent.
However, note that in Figure 1, we plot a simulated random walk, but part of its path looks
like to have a upward time trend. This turns out to be a quite general problem: over a short time
period, it is very hard to judge whether a process has a stochastic trend, or deterministic trend.

Brownian Motion and Functional Central Limit Theorem

2.1

Brownian Motion

To derive statistical inference of a unit root process, we need to make use of a very important
stochastic process Brownian motion (also called Wiener process). To understand a Brownian
motion, consider a random walk
yt = yt

+ ut ,

y0 = 0,

We can then write


yt =

t
X
s=1

ut i.i.d.N (0, 1).

(3)

us N (0, t),

and the change in the value of x between dates t and s,


yt

ys = us+1 + us+1 + . . . + ut =

t
X

i=s+1

ui N (0, t

s)

and it is independent of the change between dates r and q for s < t < r < q.
Next, consider the change yt yt 1 = ut i.i.d.N (0, 1). If we view ut as the sum of two
independent Gaussian variables,

1
ut = 1t + 2t , it i.i.d.N 0,
.
2
Then we can associate 1t with the change between yt 1 and the value of y at some interim point
(say, yt (1/2) ), and 2t with the change between yt (1/2) and yt :
yt

yt

(1/2)

= 1t

yt

yt

(1/2)

= 2t .

(4)

Sampled at integer dates t = 1, 2, . . ., the process of (4) has the same properties as (3), since
yt

yt

= 1t + 2t i.i.d.N (0, 1).

In addition, the process of (4) is defined also at the non-integer dates and remains the property
for both integer and non-integer dates that yt ys N (0, t s) with yt ys independent of the
change over any other nonoverlapping interval. Using the same reasoning, we could partition the
change between t 1 and t into N separate subperiods:
yt

yt

n
X
i=1

it ,

it i.i.d.N

1
0,
n

when n ! 1, the limit process is known as Brownian motion. The value of this process at date t is
denoted by W (t). A realization of a continuous time process can be viewed as a stochastic function
W (). In particular, we will be interested in Brownian motion over the interval t 2 [0, 1].
Definition 1 (Brownian Motion) A standard Brownian motion W (t), t 2 [0, 1], is a continuous
time stochastic process such that
5

(a) W (0) = 0
(b) For any time 0 < s < t < 1, W (t) W (s) N (0, t s). And the dierences W (t2 )
and W (t4 ) W (t3 ), for any 0 t1 < t2 < t3 < t4 1, are independent.

W (t1 )

(c) W(t) is continuous in time t with probability 1.


So given a standard Brownian motion W (t), we have E(W (t)) = 0, V ar(W (t)) = t, and
Cov(W (t), W (s)) = min(t, s). Other Brownian motions can be generated from a standard Brownian
motion. For example, the process Z(t) = W (t) has independent increments and is distributed
N (0, 2 t). Such a process is described as Brownian motion with variance 2 .
An important feature of Brownian motion is that although it is continuous in t, it is not
dierentiable using standard calculus. The direction of change at t is likely to be completely
dierent from that at t + , no matter how small is . Even some parts of the realization of a
Brownian motion looks smooth, if we see it with a microscope, we will see many zig-zags.
There are several concepts of smoothness of a function and continuity is the weakest one.
Dierentiability is another concept of smoothness. When the domain of the function is an interval,
we have another smoothness condtion. A function f : [a, b] 7! R is of bounded variation if 9 M < 1
such that for every partition of [a, b] by finite collections of points a = x0 < x1 < x2 < . . . < xn = b,
n
X
k=1

|f (xi )

f (xi

1 )|

M.

Brownian motion is not of bounded variation.


R1
Later in this lecture, we will also see integrals of Brownian motion ( 0 W (r)dr) and a stochastic
R1
R1
integral ( 0 W (r)dW (r), for r 2 [0, 1]. First, note that W (r) is a Gaussian process, hence 0 W (r)dr
R1
is also a Gaussian process. It is easy to see that E[ 0 W (r)dr] = 0. To compute its variance, let
s t and write
E

W (r)dr

= 2

= 2
Z
=

0
1

E[W (r)W (s)]dsdr

1Z r

sdsdr

r2 dr =

R1

1Z r

1
3

Therefore, 0 W (r)dr N (0, 1/3). As another exercise, consider the distribution of W (1)
R1
0 W (r)dr. Again, it is a Gaussian process with zero mean. To compute its variance,

E W (1)

W (r)dr

Z 1
1
= 1+
2E W (1)
W (r)dr
3
0
Z 1
4
1
=
2
rdr = .
3
3
0

To study the stochastic integral, we need a fundamental theorem in stochastic calculus:

Definition 2 (Itos Lemma) Let Xr be a process given by


dX(r) = udr + vdW (r).
Let g(r, x) be a twice continuously dierentiable real function. Let
Y (r) = g(r, Xr ),
then

@g
@g
1 @2g
(r, x)dr +
(r, x)dX(r) +
(r, x) (dX(r))2 ,
@r
@x
2 @x2
where (dX(r))2 = (dX(r)) (dX(r)) is computed according to the rules
dY (r) =

dr dr = dr dW (r) = dW (r) dr = 0,

dW (r) dW (r) = dr.

Now, choose X(r) = W (r), g(r, x) = 12 x2 . Then


1
1
Y (r) = g(r, X(r)) = X(r)2 =
2
2

W (r)2 .

By Itos lemma,
dY (r) =

@g
@g
1 @2g
(dX(r))2 =
dr+ dX(r)+
@r
@x
2 @x2

1
W (r)dW (r)+ ( dW (r))2 =
2

W (r)dW (r)+

1
2

dr.

where the first derivative of g with respect to x gives Xr = W (r), and dXr = dW (r). Hence,

1 2
1 2
2
W (r) = 2 W (r)dW (r) +
dr.
d
2
2
Integrate them from 0 to 1,
1
2

W (1) =
2

W (r)dW (r) +

therefore,
2

W (r)dW (r) =

2.2
2.2.1

1
2

(W (1)2

1
2

1).

Functional Central Limit Theorem (FCLT)


Introduction

Recall that the central limit theorem (CLT) tells that the sample mean of a stationary process is
asymptotically normal and centered at the the population mean. However, if xt is a random walk,
there is no such a thing as population mean. Therefore, to draw inference for processes with unit
root, we need a new tool which is called functional central limit theorem (FCLT), the central limit
theorem defined on the function spaces. FCLT is important to unit root limit theory just as CLT
is important to stationary time series limit theory.
As usual, let n denotes the sample size, and we let r = t/n, so r 2 [0, 1]. And we use the symbol
[nr] to denote the largest integer that is less than or equal to nr.
7

Consider a process ut i.i.d.(0, 2 ) with its mean denoted by u


n , then CLT tells that n1/2 u
n !
2
N (0, ). Now consider that given a sample of size n, we calculate the mean of the first half of the
sample and throw out the rest of the observations:
u
[n/2]

[n/2]
1 X
=
ut .
[n/2]
t=1

This estimator also satisfies the CLT:


p

[n/2]
u[n/2] ! N (0,

).

Moreover, this estimator would be independent of an estimator tha uses only the second half of the
sample. More generally, lets construct a new random variable Xn (r) for r 2 [0, 1],
Xn (r) = (1/n)

[nr]
X

ut .

t=1

or

8
0
>
>
>
>
>
< u1 /n
(u1 + u2 )/n
Xn (r) =
>
..
>
>
.
>
>
:
(u1 + u2 + . . . + un )/n

for r 2 [0, 1/n)


for r 2 [1/n, 2/n)
for r 2 [2/n, 3/n)
for r = 1.

It is easy to see that n1/2 Xn (1) = n1/2 u


n and CLT tells that it converges to N (0,
Xn (r) converges to as n ! 1? Write
n
By CLT we have [nr]

1/2

1/2

2 ).

But what

p
[nr]
[nr]
[nr] 1 X
1 X
Xn (r) = p
ut = p p
ut .
n t=1
n
[nr] t=1

P[nr]

t=1 ut

! N (0,

2 ),

while ([nr]/n)1/2 ! r1/2 , therefore we have

n1/2 Xn (r)/ ! N (0, r).

(5)

Next if we consider the behavior of a sample mean based on observations [nr1 ] through [nr2 ]
for r2 > r1 , this is also asymptotically normal using similar approach,
n1/2 (Xn (r2 )

Xn (r1 ))/ ! N (0, r2

r1 )

and it is independent of the estimator in (5) for r < r1 . Therefore, the sequence of stochastic
p
functions { nXn ()/ }1
n=1 has an asymptotic probability law:
n1/2 Xn ()/ ! W ().

(6)

Note that here Xn () is a function, while in (5) Xn (r) is a random variable. The asymptotic
result (6) is known as the functional central limit theorem (FCLT). Later on, we may also write
n1/2 Xn (r) ! W (r), but note this does not mean the variable Xn (r) converges to a variable which
8

has N (0, r) distribution, but that the function converges to a stochastic function: the standard
Brownian motion.
Evaluated at r = 1, the function Xn (r) is just the sample mean. Thus when the function in (6)
is evaluated at t = 1, we get the conventional CLT:
p
2.2.2

n
1 X
nXn (1)/ = p
ut ! W (1) N (0, 1).
n t=1

(7)

Convergence of a random function

In lecture 4, we discussed various convergence and continuous mapping theorem for a random
variable. Now, lets define convergence of a random function, such as Xn (r) we defined earlier.
We first define convergence in distribution for a random function. Let S() represent a continuoustime stochastic process with S(r) representing its value at some date r for r 2 [0, 1]. Also suppose
that for any given realization, S() is a continuous function of r with probability 1. For {Sn ()}1
n=1
a sequence of such continuous functions, we say that Sn () !d S() if all of the following hold:
(a) For any finite collection of k particular dates, 0 r1 < r2 < . . . < rk 1, the sequence of
k-dimensional random vectors {yn }1
n=1 converges in distribution to the vector y, where
2
3
2
3
Sn (r1 )
S(r1 )
6 Sn (r2 ) 7
6 S(r2 ) 7
6
7
6
7
yn 6
7 y6
7;
..
..
4
5
4
5
.
.
Sn (rk )

S(rk )

(b) For each > 0, the probability that Sn (r1 ) diers from Sn (r2 ) for any dates r1 and r2 within
of each other goes to zero uniformly in n as ! 0.
(c) P (|Sn (0)| > ) ! 0 uniformly in n as

! 1.

Next, we will extend convergence in probability for a random function. Let {Sn ()}1
n=1 and
{Vn ()}1
denote
sequences
of
random
continuous
functions
with
S
:
r
2
[0,
1]
!
7
R
and
Vn : r 2
n
n=1
[0, 1] 7! R. Define Yn as:
Yn = sup |Sn (r) Vn (r)|.
r2[0,1]

Then Yn is a sequence of random variables. If Yn !p 0 (this is the usual convergence in


probability for a random variable), then we have that
Sn () !p Vn ().
In other words, we define convergence in probability of a random function in terms of convergence of the upper bound of its distance from the limit function. Further, if Vn () !p Sn () and
Vn () !d V () where S() is a continuous function, then Vn () !d S().
Example 1 Let ut be strictly stationary time series with finite fourth moment, and let Sn (r) =
n 1/2 u[nr] . Then Sn () !p 0. Proof:
!
P

sup |Sn (r)| >

r2[0,1]

n
P [|n

nP (|n

E(n

1/2

u1 | > ], or [|n

1/2

ut | >
1/2 u )4
t
4

1/2

u2 | > ], . . . or [|n

1/2

o
un | > ]

E(u4t )
n 4
! 0.
=

So we ave Sn () !p 0.
In Lecture 4, we also reviewed that the continuous mapping theorem (CMT) tells that if xn ! x,
and g() is a continuous function, then we have g(xn ) ! g(x). We have a similar results for the
FCLT. If Sn () ! S(), and g() is a continuous functional, then g(Sn ()) ! g(S()). For example,
p
nXn ()/ !d W () implies that
p
nXn () !d W () N (0, 2 r).
(8)
As another example, let
Since

p
Sn (r) [ nXn (r)]2 .

nXn (r) !d W (), it follows that

Sn () !d

2.3

(9)

[W ()]2 .

(10)

Applications to Unit Root Processes

The simplist case to illustrate how to use FCLT to compute the asymptotics is to consider a random
walk yt with y0 = 0,
t
X
yt = yt 1 + u t =
ui , ut i.i.d.N (0, 2 ).
i=1

Define Xn () as:

8
0
>
>
>
>
>
< y1 /n
y2 /n
Xn (r) =
>
..
>
>
.
>
>
:
yn /n

for r 2 [0, 1/n)


for r 2 [1/n, 2/n)
for r 2 [2/n, 3/n)
for r = 1.

If we integrate Xn (r) over r 2 [0, 1], we have


Z 1
Xn (r)dr = y1 /n2 + y2 /n2 + . . . + yn
0

= n

n
X

yt

t=1

Multiplying its both sides by

n:
Z

1p

(11)

nXn (r)dr = n

3/2

n
X
t=1

10

yt

1.

1 /n

From (8) we know that

nXn () !d W (), by CMT,


Z

1p

therefore, we got the limit for n

3/2

nXn (r)dr !d

Pn

W (r)dr.

1
t=1 yt ,

3/2

n
X

yt

t=1

W (r)dr.

(12)

P
Thus, when yt is a driftless random walk, its sample mean n1 nt=1 yt diverges but n
P
converges. An alternative way to find the limit distribution of n 3/2 nt=1 yt follows:
n

3/2

n
X

yt

= n

3/2

= n

3/2

[u1 + (u1 + u2 ) + . . . + (u1 + u2 + . . . + un

3/2

Pn

t=1 yt

1 )]

t=1

[(n 1)u1 + (n
n
X
3/2
= n
(n t)ut

2)u2 + . . . + un

= n

tut

1]

t=1

1/2

n
X

ut

3/2

t=1

n
X
t=1

while from the previous lecture, we know that

P
n 1/2 nt=1 ut
0
Pn
!d N
3/2
0
n
t=1 tut
Therefore n
2(1/2) + 1/3] =

Pn

1
1
2

1
2
1
3

is asymptotically Gaussian with mean zero and variance equal to


2 /3. From this expression we also have

3/2

t=1 yt

3/2

n
X

tut

1/2

t=1

n
X
t=1

W (1)

ut

3/2

n
X

yt

(13)
2 [1

t=1

W (r)dr

(14)

Using similar methods we could compute the asymptotic distribution of the sum of squares of
a random walk. Define
Sn (r) = n[Xn (r)]2 .
and it can be written as

8
0
>
>
>
2
>
>
< y1 /n
y22 /n
Sn (r) =
>
> ..
>
.
>
>
: 2
yn /n

for r 2 [0, 1/n)


for r 2 [1/n, 2/n)
for r 2 [2/n, 3/n)
for r = 1
11

(15)

Again we compute the sum


Z

Sn (r)dr = y12 /n2 + y22 /n2 + . . . + yn2

Since we have that Sn (r) !

2 W (r)2 ,
2

3/2

5/2

Pn

n
X

yt2 1

!d

t=1 yt 1

n
X

tyt

=n

by CMT,

R1
0

3/2

t=1

t=1

If we make use of n

1 /n

[W (r)]2 dr.

(16)

W (r)dr and for r = t/n , we also have

n
X

(t/n)yt

t=1

!d

rW (r)dr.

(17)

r[W (r)]2 dr.

(18)

Similarly, for r = t/n and we use (16) to get


n

n
X

=n

tyt2 1

t=1

n
X

(t/n)yt2 1

!d

! (1/2)

t=1

Another useful result is

n
X

yt

1 ut

t=1

Proof: first

yt2 = (yt

+ ut )2 = yt2

[W (1)2

+ 2yt

1 ut

1].

+ u2t ,

so
n

n
X

yt

1 ut

t=1

= n

(1/2)

n
X

(yt2

yt2 1 )

(1/2)

t=1

= n

(1/2)yn2

n
X

u2t

t=1

(1/2)

n
X

u2t

t=1

By (6), we have n

1y

! W (1), by CMT, we have


n

(1/2)yn2 ! (1/2)

By LLN,
n

(1/2)

n
X
t=1

Therefore,
n

n
X
t=1

yt

1 ut

W (1)2 .

u2t ! (1/2)

! (1/2)
12

[W (1)2

1]

(19)

Unit Root Tests

3.1

Unit Root Tests with i.i.d Error

The asymptotics of a a random walk with i.i.d. shocks is summarized in the following proposition.
The number in bracket shows where the result is first introduced and proved.
Proposition 1 Suppose that t follows a random walk without drift,
t = t

+ ut ,

0 = 0,

ut i.i.d(0,

).

Then
(a)

1/2

(b)

(c)

(d)

(e)

(f )

(g)

(h)

Pn

t=1 ut

Pn

!d W (1)

t=1 t 1 ut

!d

[7];

1 2
[W (1)2
2

1]

[19];

R1
!d W (1)
0 W (r)dr [14];
R1
Pn
3/2
t=1 t 1 !d
0 W (r)dr [12];
R
Pn 2
2
2 1 W (r)2 dr [16];
t=1 t 1 !d
0
R1
P
n
5/2
t=1 tt 1 !d
0 rW (r)dr [17];
R
Pn
3
2
2 1 rW (r)2 dr [18].
t=1 tt 1 !d
0
P
n
v+1
v
for v = 0, 1, 2, . . . [lecture 7]
t=1 t ! 1/(v + 1)
3/2

Pn

t=1 tut

Note that all those W () is the same Brownian motion, so all those results are correlated. If we
are not interested in their correlations, we can find simpler expressions for them. For example, (a)
is just N (0, 2 ), (b) is (1/2) 2 [ 2 (1) 1], (c) and (d) are N (0, 2 /3).
R1
P
In general, the correspondence between the finite sample and their limits are like nt=1 ! 0 ,
1/2 u ! dW , etc. Take (h) as an example, and let v = 2. From previous
(t/n) ! r, (1/n) ! dr, n P
t
lecture we know that n 3 nt=1 t2 ! 1/3. Using the correspondence here, we have
n

n
X

t =n
2

t=1

3.1.1

n
X
t=1

(t/n) !
2

r2 dr = 1/3.

Case 1

Suppose that the data generating process (DGP) is a random walk, and we are estimating the
parameter by OLS in the regression
yt = yt

+ ut ,

ut i.i.d(0,

13

),

(20)

where = 1 and we are interested in the asymptotic distributions of the OLS estimates n :
Pn
yt 1 yt
n = Pt=1
n
y2
Pnt=1 t 1
yt 1 (yt 1 + ut )
t=1 P
=
n
y2
Pn t=1 t 1
yt 1 u t
= 1 + Pt=1
n
2
t=1 yt 1
Then

P
n 1 nt=1 yt 1 ut
P
1) =
.
n 2 nt=1 yt2 1

n(
n

By (19) (result b), (16) (result e) and CMT, we have


1) !

n(

W (1)2 1
.
R1
2 0 W (r)2 dr

(21)

First we note that (


n 1) converges at the order of n, instead of n1/2 , as in the cases when || <
1. Therefore when the true coefficient is unity, n is superconsistent. Second, since W (1) N (0, 1),
W (1)2 2 (1). The probability that 2 (1) is less than one is 0.68, therefore with probability 0.68
n(
n 1) will be negative, which implies that its limit distribution is skewed to the left. Recall
that in the AR(1) regression with || < 1, the estimate n is downward biased. However, its limit
p
distribution n(
n ) is still symmetric around zero. While when the true value of is unity,
even the limit distribution of n(
1) is asymmetric with negative values twice as likely as positive
values.
In practice, critical values for the random variable in (21) are found by computing the exact
finite sample distribution of n(
1) assuming ut is Gaussian. Then the critical value can be
tabulated by Monte Carlo or by numerical approximation.
There are two commonly used approaches to test the hypothesis that 0 = 1: Dickey-Fuller
-test and Dickey-Fuller t-test. The DF -test is to compute the statistics n(
n 1) and compare
the statistics with the critical values from the distribution in (21). The advantage of this approach
is that we dont need to compute its standard deviation. Alternatively, we could use DF t-test
which is based on the usual t statistics,
tn =

n 1
.

(22)

where is the standard deviation of OLS estimated coefficient,


s2n
2
t=1 yt

and

2 = Pn

(23)

,
1

s2n =

1X
(yt
n
t=1

14

n yt

2
1) .

Plug (23) into (22), we have


tn =

n 1
Pn
2

Pn

t=1 yt 1 ut
.
1/2 2 1/2
2
y
(s
)
n
t=1 t 1

If ! = 1, which is true for OLS estimator in the present problem, then s2n !
And by (19) and (16), we have the limit for tn ,
tn ! h

(1/2) 2 [W (1)2 1]
W (1)2 1
=
i1/2
R
1/2 .
R
1
1/2
2 1 W (r)2 dr
2
2
[ ]
2 0 W (r) dr
0

by LLN.

(24)

For the same reason as in (21), this t-statistics is asymmetric and skewed to the left.
3.1.2

Case 2

The DGP is still a random walk as in case 1 (20),


yt = yt

+ ut ,

ut i.i.d(0,

),

but we include a constant term in the regression


yt =
+ yt
The OLS estimates for the coefficients

n
n
= P
n
yt

P
P yt2
yt

+u
t .

1
1

P
P yt
yt 1 yt

Under the null hypothesis H0 : = 0, = 1, the deviations of the estimate vector from the
hypothesis

P
P
1

n
n
y
t
1
P 2
P ut
.
(25)
= P
n 1
yt 1
yt 1
u t 1 yt

Recall in a regression with a constant and time trend, the estimates have dierent convergent
rates. The situation is similar in this case. The order in probability for each terms are

n
n

Op (n)
Op (n3/2 )
3/2
Op (n ) Op (n2 )

As we did before, now we need a rescaling matrix


1/2
n
0
Hn =
0
n

Op (n1/2 )
Op (n)

(26)

Premultiply (25) by Hn we have

n1/2
n
n(
n 1)

1
P
3/2

yt

P
n 3/2 yt 1
P
n 2 yt2 1
15

P
n P1/2 ut
n u t 1 yt

(27)

By result (d) and (e) we have

P
1
n 3/2 yt 1
P
P
n 3/2 yt 1 n 2 yt2 1

R
1
R
R W (r)dr
!
2 W (r)2 dr
W (r)dr
R

1 0
1
W (r)dr
R
R
=
0
W (r)dr
W (r)2 dr

and by result (a) and (b) we have

P
n 1/2
u
t
P
!
(1/2)
n 1 u t 1 yt
Therefore,

n1/2
n
1
n (
n 1)

!d
=
=

0
0 1

W (1)
2 [W (1)2

0
0 1
0
0 1

where

1]

1 0
0

W (1)
(1/2)[W (1)2

1 0
0

1]

R
1
1
W
(r)dr
W (1)
R
2
W (r)dr
W (r) dr
(1/2)[W (1)2 1]
R
R

2
W (r)dr
W (1)
RW (r) dr
W (r)dr
1
(1/2)[W (1)2 1]
R
R

W (1) W (r)2 dr (1/2)[W (1)2R 1] W (r)dr


(1/2)[W (1)2 1] W (1) W (r)dr

W (r) dr
2

W (r)dr

So the DF statistics to test the null hypothesis that = 1 has the following limit distribution
R
(1/2)[W (1)2 1] W (1) W (r)dr
n(
n 1) !d
.
(28)
R
2
R
W (r)2 dr
W (r)dr
As in case 1, we can also use a t test,

tn =
which converges to

n 1
,
n

R
(1/2)[W (1)2 1] W (1) W (r)dr
nR
R
2 o1/2 .
W (r)2 dr
W (r)dr

The details can be found on page 493-494.


3.1.3

Case 3

Now, suppose that the true process is a random walk with drift:
yt = + yt

+ ut ,

ut i.i.d(0,

).

Without loss of generality, we could set y0 = 0. And we also estimate a linear regression with a
constant,
yt =
+ yt 1 + u
t .
16

Define
t u1 + u2 + . . . + ut ,

then

yt = t + t
and

n
X

yt

t=1

n
X

n
X

t+

t=1

1.

t=1

Pn
Notice that
these
two
terms
have
dierent
divergent
rates.
We
know
that
t=1 t = n(n+1)/2 =
Pn
Pn
2
3/2
3/2
O(n ), while t=1 t 1 = Op (n ) as n
t=1 t 1 converges to a normal distribution with finite
variance (result (d)). Therefore, pick the fastest divergent rate,
n

n
X

yt

= n

t=1

Similarly,

Pn

2
t=1 yt 1 ,
n
X

n
X

t+n

1/2

3/2

t=1

Pn

t=1

t=1

(29)

!p /2.

also have terms with dierent divergent rates:

t=1 yt 1 ut

yt2 1

n
X

n
X

[(t

1) + t

2
1]

t=1

n
X

(t

1) +
2

t=1

n
X

+ 2

t2 1

t=1

n
X

(t

1)t

t=1

P
P
P
where nt=1 (t 1)2 = Op (n3 ) (result (h)), nt=1 t2 1 = Op (n2 ) (result (e)), and nt=1 (t 1)t
Op (n5/2 ) (result (f)). Norm the sequence with the inverse of the fastest divergent rate n3 ,
n

n
X

yt2

t=1

Finally,

n
X
t=1

yt

1 ut

n
X

[(t

1) + t

1 ]ut

t=1

n
X

3/2

1)ut +

t=1

P
P
where nt=1 (t 1)ut = Op (n3/2 ) (result (c)) and nt=1 t
with the fastest divergent rate
n

(t

n
X
t=1

yt

1 ut !p n

3/2

1 ut

n
X

(t

(30)

! 2 /3.

n
X

1 ut ,

t=1

= Op (n) (result (b)). Again norm it

1)ut .

(31)

t=1

Corresponding to the dierent rates, to derive a nondegenerate limit distribution for the estimates, again we need a scaling matrix. In this case, we need
1/2
n
0
Hn =
.
3/2
0
n

17

Premultiply the OLS estimator vector (in deviations from their true value) with Hn we got

n1/2 (
n
n3/2 (
n

)
1)

1
P
2

From (29) and (30), the first term

1
n
P
2
n
yt 1 n
From (13) and (31), we have

P
n 1/2 ut
P
n 3/2 yt 1 ut

!p

yt

P
P yt2
3
yt
2

!p N

P
P yt2
3
yt
2

n
n

!p

n 1/2
Pn
3/2

1
1

P
ut
P
n 3/2 yt 1 ut
n

1
/2
/2 2 /3

1/2

Q.

ut
(t
1)ut
t=1

1
/2
2
,
= N (0,
/2 2 /3

0
0

Therefore we have the following limit distribution for the OLS estimates
1/2
n (
n )
!d N (0, Q 1 2 Q Q 1 ) = N (0, 2 Q 1 ).
3/2
n (
n 1)

Q).

(32)

So in case 3, both estimated coefficients are asymptotically Gaussian, and the asymptotic distribution is the same as
and in the regression with deterministic trends. This is because here yt
has two components: a deterministic time trend and random walk, and the time trend dominates
the random walk.
3.1.4

Case 4

Finally we consider that the true process is a random walk with or without drift,
yt = + yt

+ ut ,

ut i.i.d.(0,

),

where may or may not be zero, and we run the following regression
yt = + yt

+ t + ut .

(33)

Without loss of generality, we assume that y0 = 0. Note that when 6= 0, it is also a time trend,
hence there will be an asymptotic collinear problem between yt and t. Hence rewrite the regression
as
yt = (1

) + [yt

= + t

(t

1)] + ( + )t + ut

t + ut

where = (1 ), = , = ( + ), and t = yt
null hypothesis = 1, = 0, t is a random walk:

t. With this transformation, under the

t = u1 + u2 + . . . + ut .
18

Therefore, with this transformation, we regress yt on a constant, a driftless random walk, and a
deterministic time trend.
The OLS estimates in this regression are
2 3 2
32 P
3
P
P

n
n
t 1 P t
yt
P
P
P
4 n 5 = 4
54
t 1 P t2 1
t 1 yt 5 .
P
Pt 21 t
P

t
t 1 t
t
tyt
n

The hypothesis is that = c, any constant, = 1 and


= 0. Correspondingly, in the

transformed system = 0, = 1, and


= c. The deviations of the estimates from these true
values are given by
2
32 P
3
3 2
P
P

n
n
t 1 P t
ut
P
P
P
4 n 1 5 = 4
54
t 1 P t2 1
t 1 ut 5 .
(34)
P
Pt 21 t
P
c
t
t 1 t
t
tut
n

Note that these three estimates have dierent convergent rates (we are already familiar with
them!) n is n1/2 convergent, n is n convergent, and n is n3/2 convergent. Therefore we need a
rescaling matrix
2 1/2
3
n
0
0
n
0 5.
Hn = 4 0
3/2
0
0 n
Premultiply (34) with Hn we have that
P
P
3 2
32
3
2
P
n1/2
n
1
n 3/2 t 1
n 2 t
n 1/2
ut
P
P
P
P
4 n(
n 1) 5 = 4 n 3/2 t 1
n 2 t2 1
n 5/2 t 1 t 5 4 n 1 t 1 ut 5 .
P
P
P
P
n 3/2 tut
n 2 t
n 5/2 t 1 t
n 3 t2
n3/2 ( n c)

The limit distribution of each term in the above equation can be found in the proposition. Plug
them in and we get
R
2
3
2
32
3 12
3
1
n1/2
n
0 0
1
W (r)dr R
W (1)
2
R
R
4 n(
n 1) 5 !d 4 0 1 0 5 4 W (r)dr R W (r)2 dr
rW (r)dr 5 4 (1/2)[WR(r)2 1] 5 .
1
1
3/2

0 0
rW (r)dr
W (1)
W (r)dr
n ( n c)
2
3
(35)
The DF unit root test in this case is given by the middle row of (35). Note that it does not
depend on either or . The DF t test can be derived in a similar way (see page 500 in Hamilton).

3.2
3.2.1

Unit Root Tests with Serially Correlated Errors


BN Decomposition and Phillips-Solo Device

Beveridge and Nelson (1981) proposed that any time series that displays some degree of nonstationarity can be decomposed into two additive parts: a stationary (also called cyclitical or transitory)
part and a nonstationary (also called long-run or permanent) part. Let
ut = C(L)t =

1
X
j=0

19

cj t

j,

(36)

where (a) t W N (0, 2 ) and (b)


rewrite the lag operator as

P1

j=0 j

|cj | < 1. The BN-decomposition tells that we could

C(L) = C(1) + (L 1)C(L)


P1
P
P
P1

where C(1) =P j=0 cj , C(L)


= 1
j Lj , and cj = 1
j=0 c
j+1 ck . Since we assume that
j=0 j |cj | <
1, we have 1
|
c
|
<
1.
Phillips
and
Solo
(1992)
verified
that
with
conditions
(a),
(b), and
j
j=0
that C(1) 6= 0, ut can be represented in the form of

ut = (C(1) + (L 1)C(L))
t

= C(1)t C(L)(t t 1 )

Then for a random walk process with innovations ut , we could represent it as


yt = yt

+ ut
t
X
= y0 +
uj
1

j=1

= y0 +

t
X

t
X

C(L)

C(1)j

j=1

= y0 + C(1)

(j

1)

j=0

t
X

C(L)
t + C(L)0

j=1

= y0 + 0

t + C(1)

t
X

j=1

P1

where 0 = C(L)
j t j is a stationary process (note
0 is the initial condition, t = C(L)t =
j=1 c
Pt
that cj is absolutely summable), and C(1) j=1 j is a nonstationary random walk process.
Rewrite yt as
t
t
X
X
yt =
us = C(1)
s + 0 t .
Pt

s=1

s=1

Note that t =
s=1 s is a random walk with serially uncorrelated error and we have that
n 1/2 [nr] ! W (r), while 0 t are bounded in probability, hence we would expect that n 1/2 yt =
C(1)n 1/2 t +op (1) ! W (r). The following proposition summarizes some important limit theories
for unit root process with serially correlated error.
P
P1
2 , ).
Proposition 2 Let ut = C(L)t = 1
4
j=0 cj t j , where
j=0 j |cj | < 1 and i.i.d.(0,
Define that
1
X
2
=
E(u
u
)
=
cj cj+h ,
t t h
h
j=0

1
X

cj = C(1),

j=0

t = u1 + u2 + . . . ut ,

0 = 0.

In the above notation, 2 is known as the long run variance of ut , which is in gerenal dierent
from the variance of ut , which is 0 .
20

(a)

1/2

(b)

1/2

(c)

(d)

Pn

t=1 ut

Pn

t=1 ut ut j

Pn

n
X

Pn

(g)

n3/2

(h)

(i)

5/2

(j)

(k)

(v+1)

3/2

Pn

t=1 tut

Pn

Pn

Pn

!d

2
t=1 t 1

!d

t=1 tt 1

2
t=1 tt 1

Pn

t=1 t

R1

!d

t=1 t 1

!d

!d

(1/2)[
(1/2)[

R1

W (1)

W (r)2 dr;

R1
0

R1
0

1];
2 [W (1)2

0]

2 [W (1)2

0]

W (r)dr;

R1

[W (1)2

!d

t=1

(f )

for j = 1, 2, . . .;

for j = 1, 2, . . .;

1 ut

0)

!d (1/2)

t=1 t 1 t

(e)

!d N (0,

t=1 ut j t

Pn
1
1

W (1);

!d

for h = 0
Ph 1
j=0

for

h = 1, 2, . . .

i
W (r)dr ;

rW (r)dr;

rW (r)2 dr;

! 1/(v + 1)

for

v = 0, 1, 2, . . ..

The proof of all these results can be found in the appendix of Chapter 17 in Hamilton. In the
class, we will discuss (a), (e) and (f) as examples. First to prove (a),
n

1/2

[nr]
X

ut = n

1/2

C(1)

t=1

[nr]
X

t + n

1/2

([nr]

0 ).

t=1

By (6) and CMT we have that


n

1/2

C(1)

[nr]
X
t=1

t ! C(1)W (r).

0 is the initial condition and [nr] is a zero mean stationary process, both are bounded in probability,
n

1/2

([nr]

0 ) ! 0.

Therefore, we obtain the limit


n

1/2

[nr]
X

ut ! C(1)W (r)

n
X

ut ! C(1)W (1)

t=1

and when r = 1,
n

1/2

t=1

21

(37)

Second, to prove
n

n
X

1 ut

t=1

(1/2)[
(1/2)[

2 (W (1)2

0 )]

2 (W (1)2

for j = 0
Pj 1

0 )]

First, let h = 0 and we have


n

n
X

1 ut = n

(1/2)n2

(1/2)

t=1

We know that n

1/2
n

i=0

n
X

for j > 0

(38)

u2t .

t=1

! W (1). By CMT, we have that


(1/2)n

In result (c), we have n

Pn

2
t=1 ut

n
X

0.

1 2
n

1 ut

t=1

! (1/2)

W (1)2 .

Therefore,

! (1/2)[

(W (1)2

0 )].

Next, let h = 1. Note that


t

1 ut 1

We already got the limit for n


n

= (t
Pn
1

n
X

+ ut

1 )ut 1

t 1 t 1 ut ,

1 ut 1

3/2

t=1

= t

2 ut 1

+ ut

1 ut 1 .

0 )]

0.

therefore,

! (1/2)[

(W (1)2

Similar for h = 2, 3, . . . .
Thirdly, consider result (f),
n
X
t=1

Define

then we have
By CMT,

W (r)dr.

Sn (r) ! W (r).

Sn (r)dr !
1

Sn (r)dr = n

W (r)dr,

3/2

n
X
t=1

22

(39)

8
< 0 for r 2 [0, 1/n)
n 1/2 t for r 2 [t/n, (t + 1)/n)
Sn (r) =
:
n 1/2 n for r = 1

and we have

t .

(40)

3.2.2

Phillips-Perron Tests for Unit Roots

We will discuss case 2 only and other cases can be derived similarly. Let the true DGP be a random
walk with serially correlated errors,
yt = + yt

+ ut ,

ut = C(L)t

where C(L) and t satisfy the conditions in proposition 2. When || < 1, OLS estimates of
is not consistent when the errors are serially correlated. However, when = 1, OLS estimates
n ! 1. Therefore, Phillips and Perron (1988) proposed estimating the regression with OLS and
then correct the estimates with serial correlation.
Under the null hypothesis H0 : = 0, = 1, the deviations of the OLS estimates vector from
the hypothesis

P 3/2
P
1
1
n
yt 1
n1/2
n
n 1/2
P ut
P
P 2
=
.
(41)
1
3/2
2
n(
n 1)
n
yt 1 u t
n
yt 1 n
yt 1
Use result (f) and (h) in proposition 2,
R

P 3/2
1
1
1
n
yt 1
1
W (r)dr
R
R
P
P
!
2 W (r)2 dr
W (r)dr
n 3/2 yt 1 n 2 yt2 1
R

1
1 0
1
R
R W (r)dr
=
0
W (r)dr
W (r)2 dr

and use result (a) and (e) in proposition 2,

P
W (1)
n 1/2
ut
P
!
d
(1/2)[ 2 W (1)2
n 1 yt 1 u t
0]

W (1)
0
=
+
(1/2)[ 2 W (1)2 1]
(1/2)( 2
0)

1 0
W (1)
0
=
+
2
2
0
(1/2)[ W (1)
1]
(1/2)( 2
Substitute these two results into (41),
R

1
0
1
W (r)dr
W (1)
n1/2
n
R
R
!
0 1
W (r)dr
W (r)2 dr
(1/2)[ 2 W (1)2
n(
n 1)
R

1
1 0
1
W
(r)dr
0
R
R
+
1
2
2
0
W (r)dr
W (r) dr
(1/2)(

1 0
0

0)

1]
0 )/

To test = 1,
n(
n

R
1
W (1)
1
W
(r)dr
R
R
0 1
1) !
W (r)dr
W (r)2 dr
(1/2)[ 2 W (1)2 1]
R

1
2

0
1
W (r)dr
0
R
R
0 1
+
W (r)dr
W (r)2 dr
1
2 2
R
(1/2)[W (1)2 1] W (1) W (r)dr
(1/2)( 2
)
R

R 0
.
R
R
=
+
2
W (r)2 dr
W (r)dr
W (r)2 dr
W (r)dr

23

The first term describes the asymptotic distribution of n(


1) as if ut is i.i.d as in the previous subsection (28). The second term is a correction for serial correlation. When ut is serially
uncorrelated, C(1) = 1, then 2 = 0 = 2 . Then this term disappears. The asymptotics for the
t-statistics can be derived in a similar way.
3.2.3

Augmented Dickey-Fuller Tests for Unit Roots

An alternative unit root test with serially correlated errors is augmented Dickey-Fuller test. Recall
that I used an example of AR(2) process early in this lecture
(1
with one unit root and another root |
yt = yt

2
2 L )yt

1L
2|

= t ,

< 1. Then we could rewrite it

+ ut ,

ut = (1

2)

t = (L)t .

So this is a unit root process with serially correlated errors. To correct for the serial correlation,
define
= 1 + 2, =
2.
Then we have the following equivalent polynomial,
(1

L)

= 1

= 1

L(1

L)

L + L

2 )L

2
2L .

Therefore, the original AR(2) process can be written as


[(1

L)

L(1

L)]yt = t ,

or
yt = yt

+ yt

+ t

(42)

This approach can be generalized to AR(p) process, where we define


=

+ ... +

p,

p]

j = 1, 2, . . . , p

and
j =

j+1

j+2

+ ... +

for

1.

Note that when the process contain a unit root, which means one root of
1

1z

2z

...

pz

=0

is unity,
1

...

= 0,

which implies that = 1. Therefore to test if a process contain a unit root is equivalent to test if
= 1 in (42). Furthermore, (42) is a regression with serially uncorrelated errors. For simplicity,

24

in our following discussion, we work with an AR(2) process. Again, we only consider case 2. Our
regression
yt = yt 1 + + yt 1 + t x0t + t
where xt = ( yt

1 , 1, yt 1 ),

= (, , ). The deviation of the OLS estimates from the true ,


n

Let ut = yt

yt

1,

=
2

"

n
X

xt x0t

t=1

P 2
P ut 1
0
4
xt xt = P ut 1
t=1
yt 1 u t

n
X

1" n
X

ut
n

Hn 1
Here V =

0,

"

n
X

j ),

= C(1) = /(1
2
#
0

xt x0t Hn 1 !d 4 0
t=1
0

5,

the scaling matrix


3
0
0 5.
n

t=1

= E(ut ut

P
u t 1 yt
P
P yt2 1
yt 1
3

yt 1
1
2 P
n
ut 1 t
X
P
4
xt t = P t 5 .
t=1
yt 1 t

Premultiply the coefficient vector with Hn ,


(
" n
#
X
Hn ( n
)= H 1
xt x0 H
j

xt t .

t=1

ut is stationary, so its coefficient is n1/2 convergent. So


2 p
n 0
p
4
0
n
Hn =
0
0

Define

), where

0
1
R
W (r)dr

Hn 1

"

n
X
t=1

(43)

xt t .

= E(2t ).
3

0
R
V
5
W (r)dr

R
0
2 W (r)2 dr

0
Q

while it would be a matrix with elements j for a general AR(p) model, and
R

1
R
R W (r)dr
Q=
.
2
W (r)dr
W (r)2 dr

Next, consider the second term in (43),


Hn 1

"

n
X
t=1

xt t

3
P
ut 1 t
P
= 4 n 1/2 t 5
P
n 1 yt 1 t
n

1/2

Apply the usual CLT to the first element,


X
n 1/2
ut 1 t !p h1 N (0,
25

V ).

Apply result (a) and (d) of proposition 2 for the other two terms,

P
W (1)
n 1/2

t
P
!d h 2
1
(1/2) [W (1)2 1]
n
yt 1 t
Substituting the above results into (43) and we get

V 0
h1

Hn ( n
) !d
0 Q
h2

V
Q

1h
1h

(44)

Since the limit distribution is block diagonal, we can discuss the coefficients on the stationary
components and the nonstationary components seperately. For the stationary components,
p
n(
n ) !d V 1 h1 N (0, 2 V 1 ).
In this AR(2) problem, the variance is simply 2 / 0 . The limit distribution on the constant and
the I(1) components are
R

0
1
W (r)dr
W (1)
n1/2
n
1
R
R
!d Q h2 =
.
2
0
/
W (r)dr
W (r) dr
(1/2) [W (1)2 1]
n(
n 1)
This implies that n ( / ) (
n 1) has the same distribution as in (28). Since
/ = C(1) = 1/(1 ). Therefore, the ADF -test is
R
(1/2)[W (1)2 1] W (1) W (r)dr
n(
n 1)
!d
.
R
2
R
1
n
W (r)2 dr
W (r)dr
For the general AR(p) process, simply replace (1
t-test can be found in Hamiltons book.

26

n ) with (1

1,n

...

1,n ).

C(1),

(45)
The ADF

Lecture 9: Multivariate Unit Root Processes and


Cointegration

Multivariate Unit Root Processes

From univariate unit root processes to multivariate unit root processes, we need to extend the
scalar Brownian motion to the vector Brownian motion.
Definition 1 k-dimensional standard Brownian motion W() is a continuous time process associating each date r 2 [0, 1] with the (k 1) vector W(r) satisfying the following
(a) W(0) = 0;
(b) For any dates 0 r1 < r2 < . . . < rk 1, the changes [W(r2 ) W(r1 )], [W(r3 ) W(r2 )],
. . . , [W(rk ) W(rk 1 )], are independent multivariate Gaussian with [W(s) W(r)] N (0, (s
r)Ik );
(c) For any given realization, W(r) is continuous in r with probability 1.
Let vt be a k-dimensional i.i.d. vector process with E(vt ) = 0 and E(vt vt0 ) = Ik . Define that
n (r) = n 1 (v1 + . . . + v[nr] ), then the vector version FCLT is given by
X
p

n () !d W(),
nX

(1)

Let t be a k-dimensional i.i.d. vector process with E(t ) = 0 and E(t 0t ) = . Cholesky
decomposition of gives
= PP0
Or we can write t = P vt . Define Xn (r) = n 1 (1 + . . . + [nr] ), Then (1) and CMT gives that
p
nXn () ! P W().
(2)
P1
s
Finally consider serially correlated errors ut =
s=0 Cs t s , where if Cij denotes the ijth
element of Cs ,
1
X
s
s|Cij
| < 1.
s=0

Apply the BN decomposition,

t
X
s=1

us = C(1)

t
X
s=1

s + t

0 ,

P1
where C(1) = (C0 + C1 + . . .), t =
(Cs+1 + Cs+2 + . . .) and s is
s=0 s t s for s =
absolutely summable. Now define Xn (r) = (1/n)(u1 + . . . u[nr] ), then
p
nXn () ! C(1)P W ().
(3)
The proposition 18.1 in Hamilton (p. 547) (we will use P 18.1 for short in this lecture) summarized many useful asymptotic results for vector unit root processes. Most of them are analogous to
the univariate cases: you replace with , replace with , etc.

Spurious Regression

Consider two independent I(1) variables, x1 and x2 . If we regress x1 on x2 , despite the fact that they
are actually independent, the OLS estimates of the coefficient may be significant. This phenomenon
is called spurious regression (Granger and Newbold (1974), Phillips (1986)). Proposition 18.2 in
Hamilton (1994) gives the results that have been developed by Phillips (1986). We will reproduce
a two-variable version for simplicity (some degree) of representation. I think that this will be easier
to read, but you are still encouraged to read the original propositions and proofs.
Let yt = (x1t , x2t )0 , and it is generated by
yt =

(L)t =

1
X

Cj t ,

j=0

where the error t satisfies our standard assumption: mean zero and finite fourth moment and sCs
is absolutely summable.
Consider the regression
x1t = + 0 x2t + ut .
(4)
The OLS coefficient estimates for a sample of size n are given by

P
P
1

n
x1t
Pn
P x2t
P
=
n
x2t
x22t
x2t x1t

Consider a null hypothesis H0 : R = r, where R is m 2 matrix representing m separate


hypothesis involving the coefficients. the OLS F -statistics is
(
) 1

P
1
0

n
x
0
2t
P
P 2
Fn = (Rn r)0 s2n 0 R
(Rn r)/m.
(5)
x2t
x2t
R0
P
where s2n = (n 2) 1 nt=1 u
2t . To derive the asymptotics for the estimates and test statistics, we
will do some transformation. Let E(t 0t ) = P P 0 , and = (1)P . Partition 0 as
2
12
0
1
=
2
21

Suppose that 0 is nonsingular, and define


(

) =

2
1

2
2
12 / 2

.
To further simplify the problem we assume that 1 and 2 are independent, then 12 = 21 = 0
and = 1 . And the L22 matrix in proposition 18.2 is just 2 1 . Then the three part of proposition
18.2 in this problem become
2

(a) The OLS estimates


n and n are characterized by

n 1/2
n
!d
(
n

1 h1

(6)

1 / 2 )h2

where

h1
h2

R
1
[W2 (r)]0 dr
R
R
W2 (r)dr
[W2 (r)][W2 (r)]0 dr

W1 (r)dr
W1 (r)W2 (r)dr

F,

where W1 (r) and W2 (r) are independent standard Brownian motion and Q and F are defined
to be the matrix and vector.
(b) The sum of squared residual RSSn from OLS estimation satisfy
2

n
where
H=

RSSn !

2
1H

F 0 QF.

W1 (r)2 dr

(c) The OLS F test satisfies


n

Fn ! [(

1 / 2 )R

r] (

1/ 2)

0 R

00
R0

[(

1 / 2 )R

r]/m.

(7)

The proof of the above results for the general case can be found on page P564-548 in Hamilton.
In our simple case, (6) tells neither of the estimates (
, ) is consistent. Recall that if is a
consistent estimate for , then
! 0 we we have to scale it with nr where r > 0 to obtain
a nondegenerate limit distribution. Actually in this problem the OLS estimate of diverges with
sample size n, since we have to scale it with n 1/2 to obtain a limit distribution.
Result (b) then tells that the OLS estimate of the variance of ut also diverges:
s2n = (n

k)

RSSn ! 1.

This is because in a spurious regression, the residual u


t is an I(1) nonstationary process. To see
this,
u
t = x1t
n n x2t ,
taking dierence,
u
t =

x1t

n x2t = [ 1

n ]

x1t
x2t

![ 1

1 / 2 )h2

] yt ,

which
I(0) variable, so u
t is I(0), and u
t is I(1). Hence we s2n =
Pnis a 2random vector times
Pan
n
1
2
2
n
t diverges, and n
t converges.
t=1 u
t=1 u
Result (c) tells that any OLS t or F statistics based on spurious regression also diverges. The
usual t statistics has be divided by n1/2 to converge, and the usual F statistics has to be divided
by n to converge. If we draw inference based on the usual test statistics, we tend to accept that
x1t and x2t are significantly related even when they are independent.

There are three ways to cure for spurious regression. First, we may include the lags of both the
independent and dependent variables, say
x1t = + x1,t

+ x2t + x2,t

+ ut .

Now the value = 1, = = 0 will make ut I(0), therefore most of usual OLS inferences are valid
now, although tests of some hypothesis will still involve non-standard distribution.
The second cure is to dierence the data before regression,
x1t = +

x2t + ut .

(8)

Now ut is again I(0) and all usual OLS inferences are valid. However, in dierencing the data, we
may lose some information in the data.
Finally, we could apply GLS to estimate the system. We first apply an AR(1) regression on the
residual u
t , u
t = u
t 1 + et , then define x
1t = x1t x1,t 1 , x
2t = x2t x2,t 1 , and then regress
x
1t on x
2t . Since u
t is a unit root process, ! 1. Therefore, this Cochrane-Orcutt GLS regression
is asymptotically equivalent to running OLS on the dierenced data (8).

3
3.1

Cointegration
Introduction

In the previous section, we showed that when we regress one I(1) variable on another I(1) variable,
and when the residuals of the regression is also I(1), then it is a spurious regression. Even when
these two variables are independent, usual OLS inference may imply that they are significantly
related. Now you may wonder when it is valid to run an OLS regression between I(1) variables? It
turns out the regression is valid only when the residual is stationary, and in this case, we say that
those I(1) variables are cointegrated.
There are two facts about cointegration. First, a cointegration is a relationship that applies
only to I(1) series. Second, although each individual series, say x1t , x2t , . . . xkt , are I(1), and let
yt = (x1t , x2t , . . . xkt )0 , there exist a nonzero k by 1 vector , such that the series yt is I(0).
There are many examples of cointegration in economic applications. For instance, both income and
consumption maybe nonstationary, but they seem to keep a stable relation with each other. Or
if we look at some data of short rate and 3-month forward rate, they also tend to have a stable
relationship over time, although they all wander around.
Example: consider the following system of processes
x1t =

1 x2t

x2t =

3 x3t

+ u2t

x3t = x3,t

2 x3t

+ u1t

+ u3t

where the three error terms are uncorrelated white noise processes. Clearly, all those three processes
are individually I(1). Let yt = (x1t , x2t , x3t )0 and = (1,
1,
2 ), then yt = u1t which is a I(0)
process. Another cointegrating relationship is between x2t and x3t . So we can let = (0, 1,
3 ),
then yt = u2t is also I(0).

3.1.1

Cointegrating Matrix

In the above example, we see that the cointegrating vector is not unique. Also, note that and
are lineally independent. In general, if the cointegrating system has k I(1) series, we can have h
lineally independent cointegrating vectors, with h < k. In the example, we have 3 I(1) series and
we have 2 cointegrating relations. Let i , i = 1, . . . , h denote each of these vectors, then we could
construct a h by k matrix
0
1
0
1

B
C
A0 = @ ... A .
0
h

Then the vector A0 yt is h-vector valued stationary time series. In our above example,

1
1
2
0
A =
0
1
3
and
0

A yt =

u1t
u2t

Given a matrix A0 (h k) whose rows are linearly independent and A0 yt is a stationary (h 1)


vector. Suppose further that if c0 is any (1 k) vector that is linearly independent of the rows
of A0 , then c0 yt is nonstationary. Then we say there are exactly h conintegrating relations among
the elements of yt and that the rows of A0 ( 1 , . . . , h ) form a basis for the space of cointegrating
vectors.
3.1.2

MA representations

In univariate time series analysis, sometimes we would like to use dierenced data if the original
data is I(1). For example, if xt is I(1), then we may take dierence to get xt and specify an
AR(p) process for xt . However, we cannot do this for a cointegrated system. Assume that yt
is I(0), and let = E( yt ). Define
ut = yt
.
(9)
The ut is a stationary process by assumption. Suppose that ut has the Wold decomposition
ut = (L)t where t is a vector white noise. Let (1) = Ik + 1 + 2 + . . ..
The dierence equation (9) implies that the BN-decomposition of yt gives
yt = y0 + t + u1 + u2 + . . .
= y0 + t +

(1)(1 + 2 + . . . + t ) + t

Premultiply by A0 ,
A0 yt = A0 y0 + A0 t + A0 (1)(1 + 2 + . . . + t ) + A0 t

A0 0 .

To ensure that A0 yt is stationary, the coefficients on the nonstationary components t and


must be zero, i.e.
A0 (1) = 0 A0 = 0.

Pt

i=1 i

Note that A0 (1) = 0 implies that the | (z)| = 0 at z = 1. This in turn means that (L) is
noninvertible. Thus a cointegrated system can never be represented by a finite order VAR in the
dierenced data yt . Intuitively, this is because in the dynamics of the system, the level of the
variables matters.
5

3.1.3

Phillipss Triangular Representation

Obviously the cointegrating matrix is not unique for a cointegrated system. Therefore researchers
can choose to use a representation that is convenient for their problems.
Phillips (1991) suggested that the h by k cointegrating matrix A be transformed as
A=
where

is a matrix of size h by g = k

Ih

h. Define zt as
zt = A0 yt .

Correspondingly, rearrange yt = (y1t , y2t )0 , then we could represent y1t and y2t separately,
y1t = y2t + zt
and
y2t =

+ v2t

where 2 and u2t are the last g elements of and ut . I will show how this works with our example.
In our example,

1
1
2
0
A =
0
1
3
Transform A to the form

A =
Then

1 0
0 1
zt =

1 3

1 3

u1t
u2t

Therefore, we could represent yt as


x1t
y1t =
=
x2t
y2t =
x3t = u3t

+
3

x3t + zt

The cointegrating relationships is very clear from this representation. There are also other
representations which are convenient in some problems (reading: page 578 - 582).

3.2

Cointegration tests

In our discussion of spurious regression, we learn that given two I(1) processes, even if they are
independent, OLS estimator may turn out to be significant according to regular test statistics, such
as t and F test. Therefore, we should be cautious when running regressions between nonstationary
time series variables. However, if we know that two or more I(1) series are cointegrated, then it is
valid to apply our linear regression techniques. Therefore, to test whether a system of nonstationary
processes are cointegrated becomes critical in multivariate nonstationry time series studies. In the
test for cointegration, we let the null be no integration among the elements of a (k 1) vector yt ;
rejection of the null is then taken as evidence of cointegration.

3.2.1

Cointegrating vector is known

If we already know the cointegrating vector, or a specific relationship is implied by economic


theories, say, , then to test yt = (x1t , . . . , xkt ) are cointegrated can be done in two steps. First,
we test if xit is I(1) for i = 1, . . . , k. Second, if all series in yt are I(1), then test if yt = zt is I(0).
If it is, then the system is cointegrated. In both steps, we could use the unit root tests we have
discussed in the previous lecture.
3.2.2

Estimating the cointegrating vector

If the cointegration vector is unknown, then we could first test if each series is I(1), then estimate
the cointegrating vector using OLS, and finally, we test the null hypothesis of cointegration, which
is equivalent to test that the residual u
t is I(0). With the OLS estimates, , if u
t is I(0), then
the vector is cointegrated; if u
t is I(1), then the regression is spurious. The following proposition
(proposition 19.2) summarized the asymptotic results in this approach (for simplicity, we let y2 be
a scalar)
Proposition 1 Suppose
y1t = + y2t + u1t
y2t = u2t

u1t
ut =
u2t

(10)

= C(L)t ,

where t is an i.i.d. vector with mean zero, finite fourth moments, and positive definite variancecovariance matrix E(t 0t ) = P P 0 . Further suppose that sCs is absolutely summable and that the
rows of C(1) are linearly independent. Let
and be OLS estimate

P
P

n
P y2t
P y1t
= P
(11)
2

y2t
y2t
y2t y1t
Partition C(1) P as
then

C(1) P =

n1/2 (
n )
n(
)

!d

0
1
2

0
0
R W(r) dr 02
2 W(r)W(r) dr

R 1
2 W(r)dr

0
2

where W(r) is a 2 dimensional standard Brownian motion, and


h1
h2

0
1 W (1)
Z 1

W(r)dW(r)

0
1

1
X

h1
h2

E(u2t u1,t+v ).

v=0

To understand the results, consider the simple example that u1t and u2s are uncorrelated,

2
0 0
0
0
1
C(L) = I2 +
, E(t t ) =
.
2
0 cL
0
2
Then
0
1

=[

0]

= [0

(1 + c)

2 ].

Hence
h1 =

1 W1 (1)

h2 = (1 + c)

1 2

W2 (r)dW1 (r)

The proof of this proposition can be found on page 618-619 in Hamilton. Basically the proof
uses the results from proposition 18.1 on multivariate unit root processes. P
Note that in this example we assume u1t and u2s are uncorrelated, so 1
v=0 E(u2t u1,t+v ) = 0.
In the general case, u1t and u2s could be correlated, therefore induce bias in the estimates, however
this bias is in n is Op (n 1 ). To correct this bias caused by correlations between u1 and u2 , we can
add leads and lags in the regression. Define u
1t as the residual from a linear projection of u12 on
{u2,t p , . . . , u2,t 1 , u2t , u2,t+1 , . . . , u2,t+p },
u1t =

p
X

0
s u2,t s

+u
1t ,

s= p

then u
1t is uncorrelated with u2t . We can then rewrite the regression (10) can be written as
y1t = + y2t +

p
X

0
s u2,t s

+u
1t .

(12)

s= p

Now the estimates are consistent.


3.2.3

Testing for cointegration among trending series

Still with our 2-variable model, suppose that there is a time trend in y2t ,
y1t = + y2t + u1t
y2t =
with

+ u2t

6= 0. Then the process


y2t = y20 +

(13)

2t

t
X

u2s

s=1

is asymptotically dominated by the deterministic time trend 2 t. Then the OLS estimates
and
in (13) have the same limit distribution as regressing an I(1) series on a constant and a time trend.
If y1t also contain a deterministic time trend:
y1t =

+ u1t ,

then n in (13) converges to ( 1 / 2 ).


3.2.4

3.3

Phillips and Hansens fully modified OLS estimates

Testing hypothesis of cointegration vector

Consider the system


y1t = + y2t + u1t ,
8

(14)

y2t = y2,t

+ u2t ,

(15)

where y1t , y2t are I(1) while u1t , u2t are i.i.d. normal sequence and they are independent of each
other.

u1t
0
0
1
i.i.d.N
,
.
2
u2t
0
0
2
And if we consider a null hypothesis about the cointegrating vector, say,
R1 + R2 = r.
It turns out the correct approach is just to estimate (14) with OLS and use standard t or F statistics
to test any hypothesis about the cointegrating vector. No special procedures or unusual critical
values are needed.

Lecture 10: Further Topics

GARCH Model Families

There are a few key features of financial data. First, the variance seems to be varying from time to
time, and usually one large movement tends to be followed by another large movement. In other
words, large movements tend to cluster. This can be seen from Figure 1.

S&P 500 Returns

0.05

0.05

1996

1997

1998

1999
time

2000

2001

2002

1997

1998

1999
time

2000

2001

2002

Squares of S&P 500 Returns

x 10

4
3
2
1

0
1996

Figure 1: The S&P 500 Return 1996-2001


Figure 1 plots the S&P 500 returns from Jan. 1996 to Dec. 2001. The upper figure plots the
returns and the lower figure plots the squares of the returns. From both graphs, you could see the
clustering of large movements.

Copyright 2002-2006 by Ling Hu.

Second, the distributions of financial data have heavy tails (heavier than Gaussian). For the
same data described above, I plot the empirical density and the normal density (with mean zero
and standard deviation equal to the standard deviation of the data) in Figure 2.
1

1
Empirical Density
Normal Density

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0
0.08

Empirical Density
Normal Density

0.9

Density

Density

0.9

0.07
0.06
0.05
0.04
Left Tail of the S&P 500 Return

0
0.03

0.03

0.04
0.05
0.06
0.07
Right Tail of the S&P 500 Return

0.08

Figure 2: The Tails of S&P 500 Return 1996-2001


Another way to measure tail thickness is to use kurtosis statistics. We know that the kurtosis
of Gaussian is 3. But for the data we have here, the empirical kurtosis is 5.8867.
Both time-varying variance and fat tails are important in finance applications. The ARCHGARCH family models have been constructed to capture these features of financial data.

1.1

Autoregressive Conditional Heteroskedasticity (ARCH)

ARCH model was introduced by Engle (1982). For an AR(p) process


xt =

p
X

k xt k

+ t ,

k=1

where E(t ) = 0, E(2t ) = 2 > 0 and E(t s ) = 0 for t 6= s.


The idea of ARCH model is that the variance of t , denoted by
process.
m
X
2
2
=
c
+
i t i + ut ,
t

2
t,

follows an autoregressive
(1)

i=1

W N (0, 2 ).

where ut
Then we say t follows an ARCH(m) process.
Note: first, we must have that t2 > 0. A sufficient condition is that c
0 and i
0 for
i = 1, . . . , m.
Second, we are modeling a time-varying conditional variance for xt , but wed still like to restrict
our discussion to covariance-stationary process, therefore, we want that the unconditional variance

of xt , therefore the unconditional variance of t , is constant. For the AR(m) process


stationary, we must have that all the roots of
1

1z

...

mz

to be

=0

lie outside the unit circle. Combine this with the condition that all
to impose the following condition on the coefficient,
m
X

2
t

is

are nonnegative, we need

< 1.

i=1

Then the unconditional variance of t is given by


2

= E(

2
t)

m
X

= c/(1

i ).

i=1

Another way to specify an ARCH(m) process for t is to let


p
t = ht t

where E(t ) = 0 and V ar(t ) = 1, and

ht = c +

m
X

2
i t i .

i=1

It is easy to see that E(t ) = 0 and Et 1 (2t ) = ht .


To estimate the parameters in an ARCH model, we can specify a distribution for t and use
maximum likelihood estimation, or use GMM estimation based on some orthogonalization conditions.
Recall that we motivate this section by describing two features of financial data: clustering of
large movements and heavy tails. An ARCH model can capture both of these two features. First, it
is easy to see that we got dependence between t and t 1 , . . ., therefore can produce clustered large
movements. Second,the distribution of t will have heavy tails. We can show this by computing
kurtosis, denoted by ks, for a simple ARCH(1) process,
p
t = ht t
where t i.i.d.N (0, 1), and

ht = c + 2t 1 .

The moments for t are

E(t ) = 0,
E(2t )

= E(ht t2 ) =
E(4t ) = 3

so the kurtosis is
ks =

= c/(1

c2 + 2c
1 3

E(4t )
1
=3
2
2
1 3
(E(t ))

2
2

>3

2
2

for

< 1/3.

Therefore, an ARCH model can produce heavier than normal tails.


3

1.2

Generalized Autoregressive Conditional Heteroskedasticity (GARCH)

Bollerslev (1986) extends the ARCH model and let


of 2t , but also the lagged values of t2 ,
p
X

=c+

2
t

2
t

2
t i

i=1

in (1) not only depend on the lagged values

q
X

2
i t i

+ ut .

(2)

i=1

We use GARCH(p, q) to denote such a process. The sufficient condition for t2 > 0 and that
the process is stationary is that i 0 for i = 1, . . . , p, j 0 for j = 1, . . . , q, and
p
X

i +

i=1

q
X

< 1.

i=1

Finally, the unconditional variance is


= E(

2
t)

p
X

= c/(1

q
X

i=1

i ).

i=1

For the S&P 500 data we have displayed, we estimate a GARCH(1, 1) process for the return rt
and the estimates we got are:
2
t

= 0.00212 + 0.876

2
t 1

+ 0.0972t

+u
t .

The unconditional variance is


2

= E(

) = 0.00212 /(1

0.876

0.097) = 0.01272 .

Figure 3 plots the estimated t2 (solid line in the lower graph) and the unconditional variance
(dashed line in the lower graph).

1.3

Multivariate GARCH

Consider a k-vector VAR(h) process,


xt =

h
X

i xt i

+ t

i=1

where E() = 0 and E(t0 s) = for t = s and zero otherwise.


Recall that a univariate GARCH model produces time-varying variance t2 . Using the same
idea, we could construct a multivariate GARCH model to produce time-varying covariance matrix
t .
A simple extension from univariate GARCH(p, q) processes to multivariate GARCH(p, q) is to
2 , follows a GARCH(p, q) process,
let each element of t , say ij,t
2
ij,t

= cij +

p
X

2
ij,l ij,t
l

q
X
k=1

l=1

2
ij,k ij,t k

+ uij,t .

Index Returns

0.06
0.04
0.02
0
0.02
0.04
0.06
0.08
1996

1997

1998

1999
Time

2000

2001

2002

Variance of the Index Returns

x 10

Conditional Variance
Unconditional Variance

0
1996

1997

1998

1999
Time

2000

2001

2002

Figure 3: The Estimated Variance of S&P 500 Returns

The problem of this approach is the number of parameters may get too big when k is large.
For instance, even if we assume a GARCH(1, 1) process, when k = 10, we need to estimate
3 10 11/2 = 165 parameters.
To solve this problem, we can impose some structures of t . For instance, Bollerslev (1990)
suggested that the conditional correlations are constant over time. Then ij,t = ij i,t j,t , with
only one parameter, ij , instead of cij , ij and ij .

1.4

Variants of GARCH Models

We will briefly introduce a few other members in the GARCH model family.
IGARCH In a GARCH(p, q) model, when the coefficient satisfy
p
X

i +

i=1

q
X

= 1,

j=1

Engle and Bollerslev (1986) referred to it as integrated GARCH process. In this case, the unconditional variance of t is infinite so the process is no longer covariance (weakly) stationary but still
strictly stationary.
EGARCH Recall that we let the innovation take the form t = ht t with t is i.i.d. with mean
zero and unit variance. Nelson (1991) proposed the following specification for ht .
loght = c +

1
X
i=1

i (|t i |

E|t i | + t i )

The parameter i captures the eects of the deviation of |t | from its expectations. A more
interesting specification in an EGARCH model is the parameter .
We have discussed two features of financial data: dependence in volatility and heavy tails.
There is another feature of financial return data: negative skewness. For the normal distribution,
we have zero skewness. But for the data set we have used as example, the empirical skewness is
-0.1806. In other words, in the financial return data, negative shock tends to have larger volatility
than positive shocks. The parameter in the EGARCH model can capture this eect. If = 0
then the volatilities for positive and negative shocks are symmetric; if < 0, then negative shocks
tend to have larger eect to the volatility.
There are still other models that belongs to GARCH family, GARCH with threshold (Zakoian
1990), GARCH with regime-switching (Cai 1994), etc.
Readings: Hamilton Ch 21

2
2.1

Hidden Markov Chains and Regime Switching Models


Markov chains

Let st be a random variable that only take integer value. If the probability of st takes a particular
value j depends on the past only through the most recent value st 1 :
P {st = j|st

= i, st

= ....} = P {st = j|st


6

= i} = Pij .

This process is called a Markov Chain, and Pij is called transition probability: the probability that
state i will be followed by state j. Suppose there are N states, then we must have
N
X

Pij = 1.

j=1

We can collect the transition probabilities in an N N matrix, denoted by P , and it is known as


the transition matrix :
2
3
P11 P21 . . . PN 1
6
..
..
.. 7 .
P = 4 ...
.
.
. 5
P1N P2N . . . PN N

For example, suppose there is a squirrel, who may stay inside a house (in the roof) or stay in
the tree (tree by the house). We can specify the transition matrix (P 0 ) of this squirrel as:
t
t

House
0.7
0.1

House
Tree

Tree
0.3
0.9

To study the forecast of a Markov chain, using our two state example, we can assign integer 1
and 2 to the two states, and we can define

(1, 0)0 when st = 1


t =
(0, 1)0 when st = 2
Conditional on st = 1, the expected value of t+1 is (p11 , p12 )0 . Hence we can write one period
forecast for the Markov chain as
E(t+1 |t , t

1 , . . .)

= P t .

Hence we can express a Markov chain using a VAR(1) representation


t+1 = P t + vt+1

(3)

where
vt+1 = t+1

E(t+1 |t , t

1 , . . .).

It is easy to see that vt is a martingale dierence sequence.


Similarly, the m-period forecast of a Markov chain is
E(t+m |t , t

1 , . . .)

= P m t .

(4)

Now, suppose p11 = 1 instead of 0.7, then the squirrel will stay in the roof forever. In this
case, the state House is an absorbing state and that the Markov chain is reducible. On the other
hand, if the Markov chain is not reducible, we say it is irreducible. For our two-state example, this
requires that P11 < 1, and P22 < 1.
7

For a transition matrix P , suppose that one of the eigenvalues is unity and that all other
eigenvalues of P are inside the unit circle. Then the Markov chain is said to be ergodic. The vector
corresponding to the unit eigenvalue is the ergodic probablity (after rescaled so that its elements
sum to unity (10 = 1). It can be shown that (Hamilton, page 681)
lim P m = 10 .

m!1

From (4), we can write that


E(t+m |t , t

1 , . . .)

= P m t ! 10 t = .

Hence the forecast of t+m converge to no matter what is t . So, we can see that this is the
unconditional probability for the process (the matrix P gives the conditional probability).
In our example of the squirrels, we can compute that = [0.25, 0.75], or, the squirrel stays in
the house with about one fourth of the time.
In general, for a two-state Markov chain to be ergodic, besides the conditions for irreducible,
which is P11 < 1, P22 < 1, we also require P11 + P22 > 0, which means at least one of these two
probability is positive. If both probability is zero, then in our example with squirrel, we got that
the squirrel jump from the house to the tree and jump from the tree to the house, then at time
t + m, the position of the squirrel depends on its position at time t. If the squirrel is in the house
at time t and m is even number, we know that the squirrel is in the house at time t + m. Hence,
no matter how large m is, we can always tell where is the squirrel given the position of the squirrel
at time t.

2.2

The Hidden Markov Chain and i.i.d. mixture distributions

Let st be a Markov chain and there are N possible states. Let xt denote another sequence, and
the distribution of xt at time t depends on st . For example, suppose there are two states and st
take values 1 or 2. When st = 1, xt equals 0 with probability 0.9 and equals 1 with probabilty
0.1; while when st = 2, xt equals 0 with probability 0.1 and equals 1 with proababilty 0.9. Further
assume that we could not observe st and we can only observe xt , this is a simple example of a
hidden Markov chain. (draw a picture here)
xt can also be drawn from a continuous distribution, such as a normal distribution. For example,
when st = 1, xt is drawn from N (0, 1), and when st = 2, xt is drawn from N (2, 4). We write the
density of xt conditional on st as follows
2
1
xt
f (xt |st = 1) = p exp
,
2
2

(xt 2)2
1
p exp
f (xt |st = 2) =
.
8
2 2
To compute the unconditional distributions, we need to know the distribution of st . For example,
if st is i.i.d., and P (st = 1) = 1/3, then the unconditional distribution of xt is
f (xt ) = (1/3)f (xt |st = 1) + (2/3)f (xt |st = 2).
Figure (4) plots the density of this mixture distribution as well as those two normal distributions.
8

0.4
Mixture distribution
N(0, 1)
N(2, 4)

0.35

0.3

f(x)

0.25

0.2

0.15

0.1

0.05

0
4

2
x

Figure 4: The plot of the mixture density

Now, although we could not observe st , we can make an inference for st based on xt . In our
example,
p(xt , st = 1)
(1/3) f (xt |st = 1)
P (st = 1|xt ) =
=
.
(5)
f (xt )
f (xt )
Similarly, we can write this for P (st = 2|xt ). From this expression, we see that two factors
jointly determine this probability: one is the unconditional probability of st , and the other is the
probability that each component generate xt . Consider some numerical examples. Suppose we
observe that xt = 3, then we know that N (2, 4) is more likely to generate this observation and also
we know that the unconditional probability of st = 2 is larger. Hence we believe that st is much
more likely to be 2. Using (5), we can compute that P (st = 1|xt = 3) = 0.01, which supports
our hypothesis. On the other hand, if we observe that xt = 1, then things are not that clear.
Although st has larger unconditional probability to be 2, from figure (4) we can see that N(0, 1)
has a much larger probability to generate xt = 1 than N (2, 4) distribution. Using (5) we can then
compute that the probability that st = 1 conditional on xt = 1 is about 0.65.
Above is a simple example, where we assume that we know all coefficient, and it largely illustrate
how to work with an i.i.d. mixture models. In general, if there are N states (N number of individual
distributons in the mixture), and if we assume xt is drawn from N (i , i2 ) when st = i, then we
can write

1
(xt i )2
2
f (xt |st = i; i , i ) = p
exp
.
(6)
2 i2
2 i2
Let i denote the probability that st = i, and let = (1 , . . . , N ,
the joint probability density for xt and st = i is
p(xt , st = i; ) = i f (xt |st = i; i ,

2
i ),

2
1, . . . ,

2
N , 1 , . . . , N ),

then

hence the unconditional distribution for xt is just


f (xt ; ) =

N
X
i=1

i f (xt |st = i; i ,

2
i ).

If st is i.i.d., the likelihood for observations {x1 , . . . , xT } is


l() =

T
X

log f (xt ; ).

t=1

We
PNcan then solve for the maximum likelihood estimator for with the restrictions that i 0 and
i=1 i = 1.
Note that to maximize this function, we first take sum over dierent component and then take
log, hence it is not possible to solve them analytically for as a function of the data. In empirical
studies, MLE of mixture models is computed using the EM algorithm. E represent expectation,
and M represent maximization. This is an iterative method, and the likelihood is guaranteed to
increase in each iteration.
The MLE estimator for the system can be shown as (P699-701 in Hamilton)

i =
i2 =

xt P (st = i|xt ; )
PT

t=1 P (st = i|xt ; )

(xt
i )2 P (st = i|xt ; )
PT

t=1 P (st = i|xt ; )

i = T

T
X

P (si = i|xt ; )

t=1

EM algorithm was originally designed to solve estimation with missing data. In a mixture
model, if we know what each observation xt is drawn from which regime (state), then the problem
is much easier.
i and i2 are just the mean and variance computed using the data that from regime
i. And
i is just the proportion of data from regime i. Since we dont know this information, we
can use an iterative algorithm.
We can start with an arbitrary value for , denote it 0 , plug this 0 to the right hand side
of the above equations, we can obtain a new estimate for , denoted by 1 . We can continue this
iteration and stop till m and m+1 are close.

2.3

Time series regime switching model

Next, we apply the hidden Markov model to time series studies which allow time varying parameters.
The idea is that under dierent regimes, the parameters, which represent the level or relationships,
maybe dierent. For simplicity, we assume that there are two regimes: st = 1 or st = 2 and let
P denote the transition matrix. We continue to assume that we could not observe st , and we can
only observe yt , whose process is specified as
yt = cst + yt

10

+ ut

(7)

where ut i.i.d.N (0,

2 ).

Then the conditional density of yt can be written as


2
h
i 3

(yt c1
yt 1 )2
p 1 exp
2
f (yt |st = 1, yt 1 ; )
2
i 5.
h
t =
= 4 2
(yt c2
yt 1 )2
f (yt |st = 2, yt 1 ; )
p 1 exp
2 2
2

(8)

To analyze this system, lets first assume that we know all parameters = (c1 , c2 , , 2 , p11 , p22 ).
Unlike in the i.i.d. case, now we can draw inference about st based on all observations. Let Yt denote
observations up to time t, then we can write the conditional probability about st at time t as

P (st = 1|Yt ; )

t|t =
.
P (st = 2|Yt ; )
Lets first consider how to compute density of yt conditional on Yt 1 . If we take point by point
product of t|t 1 and t , which can be written as (t|t 1 t ), we get
p(yt , st = 1|Yt

1 ; )

= P (st = 1|Yt

1 ; )

p(yt , st = 2|Yt

1 ; )

= P (st = 2|Yt

1 ; )

f (yt |st = 1, yt
f (yt |st = 2, yt

If we add these two elements, we just got the density of yt conditional on Yt


f (yt |Yt

1 , )

= 10 (t|t

1 ; )
1 ; ).
1,

i.e.

t ).

(9)

Then the likelihood function can be written as


l() =

T
X
t=1

log f (yt |Yt

(10)

1 , ).

To derive a rule in updating the forecast and optimal inference about st , note that if we divide
each element in (t|t 1 t ) by f (yt |Yt 1 , ) = 10 (t|t 1 t )., we have
p(yt , st = i|Yt 1 ; )
p(yt , st = i|Yt 1 ; )
=
= P (st = i|yt , Yt
0
1 (t|t 1 t )
f (yt |Yt 1 , )

1 ; )

= P (st = i|Yt ; ).

We can do this for each element in the vector and obtain that
t|t =

(t|t 1 t )
.
10 (t|t 1 t )

(11)

Finally, if we take expectation of (3) conditional on Yt , we have


t+1|t = P t|t .

(12)

These two equations, (11) and (12) compose an iterating algorithm to compute the optimal inference
for st . This iteration starts from 1|0 , which can be specified in several ways (see page 693 in
Hamilton).
So far when we draw inference about st , we base on information up to time t. While, as we
obtain more information and look back, we may have dierent ideas about what happened at time
t. Such an inference, say, P (st = i|Y ; ) for > t, is called the smoothed inference.
Above we assume that we know the parameter . To estimate , we can find the estimator that
maximizes the likelihood (10), using some numerical optimization techniques.
11

You might also like