Simple Regression Analysis Guide

R EVIEW T OPIC 1: S IMPLE R EGRESSION

I NTRODUCTORY F INANCIAL E CONOMETRICS
Review of Econometric Theory
3 C REDITS , 51 H OURS
Readings:
Jianhua Gang
Wooldridge, Ch.2
School of Finance
Renmin University of China
Spring 2013
J IANHUA G ANG (RUC)
I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

S PRING
Theory
2013
1 / 110
R EGRESSION A NALYSIS

S PRING
Theory
2013
R EGRESSION A NALYSIS
2 / 110
C LASSICAL N ORMAL S IMPLE R EGRESSION M ODEL

Generalized idea of a random sample of n independently and
identically distributed (i.i.d.) observations from N (, 2 ).
Regression analysis involves the estimation and evaluation of the

relationship between a variable of interest (dependent variable,
explained variable, regressand) and one or more other variables
(independent variables, explanatory variables, regressors).
What is estimation, prediction (forecast), the fitting?
Have sample of n independent observations y1 , ..., yn , each of

which is normally distributed with variance 2 ,but conditional
mean governed by
E(yi ) = + xi , i = 1, ..., n.
where,
1
2

S PRING
Theory
2013
3 / 110
and are termed regression parameters/regression coefficients.

The term xi varies with i, but is not random (nonstochastic, fixed in
repeated sampling).
What is sampling?

S PRING
Theory
2013
4 / 110

If ui = yi ( + xi ) denotes the error (or disturbance term), then
write simple regression model as:
yi = + xi + ui , ui NID(0, 2 ), i = 1, ..., n,
If we regard + xi as the equation of a straight line, then

1
2
the intercept is the mean of y when xi equals zero

the slope is the change in the mean of y when xi increases by one
unit. (This interpretation of the intercept is not always sensible in
economic applications.)
(1)
The assumption that the regressor x is Nonstochastic is

inappropriate in many applications in economics and it is relaxed
later.
More useful to think of the classical assumption as being
appropriate when we conditional on the values of x1 , ..., xn . Thus,
conditional upon the values of x1 , ..., xn , the yi are independent
normal variables with means + xi and common constant
variance 2 for i = 1, ..., n.

S PRING
Theory
2013
5 / 110
E STIMATION OF PARAMETERS

S PRING
Theory
2013
E STIMATION OF PARAMETERS
6 / 110
M ETHOD OF M OMENTS E STIMATION
Population moments conditions (assumptions provided before as in

(1)):
The following general approaches to estimate , and 2 are
considered: method of moments (MM); ordinary least squares
(OLS); and maximum likelihood estimation (MLE).
E(ui ) = 0,
E(xi ui ) = 0,
E(u2i
These slides do not contain full mathematical details.
2 ) = 0.
Let the MM estimator of and be b

and b
, with associated
bi = yi (b
+b
xi ), i = 1, ..., n.
residuals u

S PRING
Theory
2013
7 / 110

S PRING
Theory
2013
8 / 110
O RDINARY L EAST S QUARES E STIMATION (OLS)
Obtain MM: solving the derived equations (replacing E(.) by

bi ), the equations are:
n1 (.), and ui by u
Choose estimates b
and b
to get "best fit" in the sense of
minimizing
S(, ) = [yi ( + xi )]2 .
ubi
b2i
u
[yi (b + bxi )] = 0,
xi ubi =
First order conditions (the F.O.C.s) are,
xi [yi (b + bxi )] = 0,
S(b
, b
)
S(b
, b
)
= 0
It can be proved that under weak conditions, MME are consistent

and asymptotically normally distributed.

S PRING
Theory
2013
9 / 110
xi [yi (b + bxi )]
i
ubi = 0
xi ubi = 0
(2)
(xi x)(yi y)
b
=
It is clear that the normal equations imply that the OLS estimates
of and are equal to the corresponding MME previously.

S PRING
Theory
2013
10 / 110
The solution of b
and b
which minimize the objective function
S(, ) are,
(3)
Equations (2) and (3) are called the normal equations (b

ui is an OLS
residual).
= 0

S PRING
Theory
2013
= 0
Ignoring an irrelevant factor of 2, these equations are,
[yi (b + bxi )]
11 / 110
(xi x)2
i
b
= yb
x
where x denotes a sample average, e.g. x = n1 xi .

i
We have to postpone discussions of estimation of 2 later.

S PRING
Theory
2013
12 / 110
M AXIMUM L IKELIHOOD E STIMATION
OLS D ECOMPOSITION OF S UM OF S QUARES
Because yi N ( + xi , 2 ), i = 1, ..., n, so that

f (yi ) = (22 )1/2 exp{[yi ( + xi )]2 /22 }, i.
Let b
yi = ( b
+b
xi ) denote a typical OLS predicted value, then the
normal equation for OLS yield several results.
We already assume that yi , ..., yn are independent, so

f (y1 , ..., yn ) =
yi
f ( yi ) = L
i
b2 =
equals OLS. The MLE of 2 is
n 1
ub2i = MM estimate.
13 / 110
=
=
(b + bxi )ubi = b ubi + b xi ubi = 0

(byi + ubi )
=
i
i
b
y2i
+
i
b2i
u
+0
(byi n1 byi )2 + ub2i

i

S PRING
Theory
2013
14 / 110
G OODNESS OF F IT
G OODNESS OF F IT
(yi n1 yi )2 = (byi n1 byi )2 + ub2i

i

S PRING
Theory
2013
(yi n1 yi )2
y2i
The MLE of and must minimize [yi ( + xi )]2 and so
n
[y ( + xi )]2
l(, , 2 ) = ln(22 ) i
.
2
22
i
(byi + ubi ) = byi + ubi = byi

i
byi ubi
The log-likelihood is, therefore,
or put this in another way,
Total Sum of Squares (TSS)=Explained Sum of Squares(ESS) +

Residual Sum of Squares(RSS)
Coefficient of determination R2 is index of goodness of fit of OLS

line with
ESS
RSS
= 1
, 0 R2 1.
R2 =
TSS
TSS
R2 = r2XY , where rXY = XY (correlation coefficient between x and
y).
Note sums of squares are measured about sample averages.

S PRING
Theory
2013
15 / 110

S PRING
Theory
2013
16 / 110
S AMPLING P ROPERTIES OF OLS E STIMATORS
S AMPLING D ISTRIBUTION OF OLE E STIMATORS

For the classical normal simple regression model, b
and b
are
jointly normally distributed with
E(b
) =
E( b
) =
Best linear unbiased estimator (BLUE) of and , even when

errors ui are not normally distributed.
Var(b
) =
Consistent and asymptotically efficient (MLE).
2
(xi x)2
i
2
)
+ x2 Var(b
n
Cov(b
, b
) = xVar(b
)
Var(b
) =

S PRING
Theory
2013
17 / 110

S PRING
Theory
2013
18 / 110
E STIMATION OF SIGMA - SQUARE
The OLS estimator of the regression parameters can be written as

It can be shown that, in classical normal simple regression model,
b
= + wi ui
ub2i = RSS 2 2 (n 2)
b
= + zi ui
is independent of b
and b
.
where the nonstochastic terms wi and zi depend upon the

regressor values, e.g.
Note (n 2) is the number of observations minus the number of

regression parameters estimated to derive the residuals and is called
the degree of freedom parameter for the regression.
zi = (xi x)/ (xj x)2 .

j

S PRING
Theory
2013
19 / 110

S PRING
Theory
2013
20 / 110
S TATISTICAL I NFERENCE
S TOCHASTIC
SPECIFICATION OF CLASSICAL MODEL
Hence,
b2i ) = 2 (n 2)
E( u
i
And so the newly-defined (sample) estimator
s2 =
Study of statistical inference requires the specification of the

probabilistic model for y1 , ..., yn .We make the following
assumptions.
ub2i
i
n2
b2 = [(n 2)/n] s2 is
is unbiased. The ML estimator, however,
biased (of course when sample size gets relatively small).

S PRING
Theory
2013
21 / 110

S PRING
Theory
2013
22 / 110
S AMPLING DISTRIBUTIONS FOR INFERENCE
S TOCHASTIC SPECIFICATION OF CLASSICAL MODEL
A1 There exist observation invariant parameters and such that

E(yi ) = + xi i;
n
A2 The regressor x is nonrandom and satisfies Sxx =
( xi x ) 2 > 0
1
for n > 1. For the purpose of asymptotic theory, it is conventional

to assume 0 < lim n1 S < ;
A3 Let ui = yi E(yi ),common variance (homoskedasticity)
var(ui ) = 2 i. If the ui do not have the same variance, have
heteroskedasticity.
A4 Let ui = yi E(yi ),uncorrelated disturbances so E(ui uj ) = 0 if
i 6= j.If have time series data and assumption is false then say have
autocorrelation (or serial correlation).
A5 Let ui = yi E(yi ),normally distributed distanbances (so that A4
implies independence).

S PRING
Theory
2013
23 / 110
b
and b
are N (, var(b
)) and N ( , var(b
)), respectively, so that
q
z(b
) = (b
)/ var(b
) N (0, 1)
q
) N (0, 1)
z( b
) = (b
)/ var(b
b2i 2 2 (n 2) independently of b
RSS = u
and b
, so
RSS
2
b
(n 2) independently of z(b
) and z( ), so
2
t(b
) =
t( b
) =
q
q
z(b
)
RSS
(n2) 2
z( b
)
RSS
(n2) 2
t(n 2)
t(n 2)

S PRING
Theory
2013
24 / 110

RSS
(n2)
= s2 so that, for example,

(b
)
z( b
)
t( b
) = q = q
q t(n 2)
2
s
s2
b
var
(
)
2
Hence,
t( b
) =
in which, the denominator

var(b
)(
s2
2
s2
s2
)
=
(
)(
)
=
2
SXX 2
SXX
is the estimator of var(b

) and the square root of this quantity is
called the estimated standard error, denoted by
s
r
q
2
s
s2
b
b
var(b
)(when n big)
SE( ) = var( )( 2 ) =
SXX

S PRING
Theory
2013
25 / 110
C ONFIDENCE I NTERVALS (C.I. S )
Similar for t(b

),
t(b
) =
SE(b
)
t(n 2)
)
(b
t(n 2),
SE(b
)

S PRING
Theory
2013
C ONFIDENCE I NTERVALS (C.I. S )
26 / 110
H YPOTHESIS T ESTING : T S TATISTIC

Consider the null hypothesis that restricts one of the regression
parameters, e.g. H0 : = 0 , where 0 is some specified constant.
Let d1 be such that
For whatever value of ,
prob(d1 t(n 2) d1 ) = (1 )
t( b
) =
(b
)
t(n 2),
SE(b
)
t0 (b
) =
(b
0 )
t(n 2).
SE(b
)
Then the (1 ) 100 per cent confidence intervals (C.I.) for

and are given by,
respectively.
and so if H0 is true,
b
d1 SE(b
)
b
)
d1 SE(b
) is termed as the test statistic.

t0 ( b

S PRING
Theory
2013
27 / 110

S PRING
Theory
2013
28 / 110
R ELAXING THE A SSUMPTION OF F IXED R EGRESSORS

Suppose that x, like y, is a r.v.. Consider the results above that can
now be regarded as being derived, conditional upon the values
x1 , ..., xn .
The critical/rejection region depends upon the nature of the

alternative hypothesis and the prespecified significance level,
denoted by .
1
2
3
H1 : 6= 0 reject H0 if |t0 (b
)| > d1 ,where
prob(t(n 2) > d1 ) = /2
H1+ : > 0 reject H0 if t0 (b
) > d2 ,where prob(t(n 2) > d2 ) =
H1 : < 0 reject H0 if t0 (b
) < d2 ,where
prob(t(n 2) < d2 ) =
Just replace by and b
by b
in the above to obtain test procedures
for (the intercept).

S PRING
Theory
2013
29 / 110
E(b
|x1 , ..., xn ) = , E(b
|x1 , ..., xn ) = and E(s2 |x1 , ..., xn ) = 2 .These
expectations do not depend upon the x values and so OLS
estimators are unconditionally unbiased. Similar remarks apply to
probability limits;
var(b
|x1 , ..., xn ), var(b
|x1 , ..., xn ) and cov(b
, b
|xx1 , ..., xn ),as given
above, do depend on the x values, and so do not correspond to
unconditional characteristics.
Fortunately, 2 does not pose major problems for inference. The
variables (b
) /SE(b
) and (b
)/SE(b
) are, given x values,
both distributed as t(n 2), still. This distribution does not depend
on x values, but just on the values of (n 2). Hence the t tests
and confidence intervals described above are unconditinally valid.

S PRING
Theory
2013
30 / 110
P RESENTATION OF R ESULTS ( EARNINGS ON SCHOOLING )
P RESENTATION OF R ESULTS ( EARNINGS ON

SCHOOLING )
It is, however, important to note,

1
It has been assumed that the errors u1 , ..., un NID(0, 2 ) whether

or not we condition on the x values, i.e. the regressor values and
error terms are statistically independent.
Assumptions in 1 can be weakened but we cannot expect to get
results that are exact, i.e. valid for finite sample sizes, and often
have to resort to asymptotically valid results in practical situations.

S PRING
Theory
2013
31 / 110

S PRING
Theory
2013
32 / 110
P REDICTION
P REDICTION
P REDICTION
P REDICTION
Suppose wish to make predictions for period f , f > n (the sample

size), with xf known and assuming the data generation process for
y is unchanged so that,
Suppose wish to make predictions for period f , f > n (the sample

size), with xf known and assuming the data generation process for
y is unchanged so that,
yf = + xf + uf , uf N (0, 2 ).
yf = + xf + uf , uf N (0, 2 ).
Prediction of E(yf ): use the predictor b

yf = b
+b
xf , where the OLS
estimators use the data for i = 1, ..., n. This predictor is BLUE for
E(yf ) = + xf .
The predictor b
yf is a linear combination of the OLS estimators and
so is normally distributed.
The variance of b
yf can be estimated, and confidence intervals and
tests of hypotheses are feasible.

S PRING
Theory
2013
33 / 110
R EVIEW T OPIC 2: M ULTIPLE R EGRESSION
Prediction of yf : use same predictor which implies a forecast error

h

i
of (yf b
y f ) = uf ( b
) + b
xf , which has zero
expectation, given OLS unbiased and E(uf ) = 0.
The forecast error is normally distributed, being a linear
combination of three normal variates, and has a variance that can
be estimated. Confidence intervals and tests of hypotheses, e.g.
H0 : E ( yf b
yf ) = 0,are feasible.

S PRING
Theory
2013
34 / 110
C LASSICAL M ULTIPLE R GRESSION M ODEL

Have sample of n independent observations y1 , ..., yn , each of
which is normally distributed with variance 2 , but means vary
according to
E(yi ) = + 1 x1i + ... + k xki = + j xji , i = 1, ..., n.
READING
Wooldridge, Ch.3, 4
and j are parameters/coefficients.
Regressors xji vary with i, but nonrandom (nonstochastic, i.e. fixed
in repeated sampling).
can be regarded as an intercept with = E(yi ), given all xji = 0.
Slopes j can often be regarded as partial derivatives: j =
E(yi )
xji .
Note: Regressor might be discrete or a nonlinear function of some

other regressor; so that interpretations vary.

S PRING
Theory
2013
35 / 110

S PRING
Theory
2013
36 / 110
T HE C LASSICAL M ULTIPLE R GRESSION M ODEL
S TOCHASTIC S PECIFICATION OF C LASSICAL M ODEL

The following assumptions are made in the classical normal
regression model:
If ui = yi ( + j xji ) denotes the error or disturbance term,

j
then write classical normal multiple regression model as:
A1 There exist observation invariant parameters and j , j = 1, ..., k

such that
E(yi ) = + j xji i;
j
yi = + j xji + ui , ui NID(0, ), i = 1, ..., n,
A2 The regressor xji are nonrandom and satisfy
j
n
(xji xj )2 > 0, xj = n1 xji
where NID stands for, normally and independently distributed.
where n > 1 and j = 1, ..., k. For the purpose of asymptotic theory,

n
assume 0 < limn n1 (xji xj )2 < for all j.

1

S PRING
Theory
2013
37 / 110

S PRING
Theory
2013
38 / 110

Assumption A2 is often too restrictive for economic applications
in which some regressors are probably better regarded as random,
rather than fixed in repeated sampling.
The following assumptions are made in the classical normal

regression model:
A3 Also need to assume that no regressor is just a linear combination
of the other regressors and the intercept term.
A4 Common variance (homoskedasticity) var(ui ) = 2 i. If the ui do
not have the same variance, have heteroskedasticity.
A5 Uncorrelated disturbances so E(ui uj ) = 0 if i 6= j.If have time series
data and assumption is false then say have autocorrelation/serial
correlation.
A6 Normally distributed distanbances (so that A5 implies
independence).
As in the case of the simple regression model, we can start by

thinking about the conditional distribution of yi , holding the
values xji (i = 1, ..., n; j = 1, ...k) constant. Having derived results
for the conditinal model, we can see which of them will apply to
the unconditional model for y.
For the former model, we have that, given the values of the
regressors, the variates yi are independent with conditional
distributions N ( + xji j , 2 ) for i = 1, ..., n.
j

S PRING
Theory
2013
39 / 110

S PRING
Theory
2013
40 / 110
O RDINARY L EAST S QUARES E STIMATION
= 0
The OLS estimators are chosen to minimize,

"
= 0
The F.O.C.s yields the normal equations,
Have, E(ui ) = 0 and E(xji ui ) = 0 for j = 1, ..., k.

Therefore, MM estimators, denoted byb, can be derived form
ubi
O RDINARY L EAST S QUARES E STIMATION
S(, 1 , ..., k ) =
yi
+ j xji
!#2
xji ubi
i
ubi
bi is the residual yi (b
for j = 1, ..., k, where u
+b
j xji ), i = 1, ..., n.
xji ubi
The MM estimate of 2 can be derived from

E(u2i 2 ) = 0,
it is
b2i .
b 2 = n1 u

S PRING
Theory
2013
= 0
bi is the OLS residual

for j = 1, ..., k, where
! u
yi
b
j xji
+b
, i = 1, ..., n.
= 0
Hence the OLS estimators are equal to the MM estimators.

41 / 110

S PRING
Theory
2013
42 / 110
Using methods similar to those appropriate in the context of the

simple regression model, it can be shown that the log likelihood
function is given by,
Best linear unbiased estimator (BLUE) of and j , j = 1, ..., k,even
when errors ui are not normally distributed.
S(, 1 , ..., k )
n
.
l(, 1 , ..., k , ) = ( ) ln(22 )
2
22
2
Consistent and asympototically efficient (MLE).
The MLE of the regression parameters must minimize

S(, 1 , ..., k ) and so OLSE = MLE.
b2
The MLE of 2 is RSS
n , where RSS = ui is the OLS residual sum
of squares function.

S PRING
Theory
2013
43 / 110

S PRING
Theory
2013
44 / 110
j xji denote a typical OLS predicted value.The

Let b
yi = b
+b
j
normal equation for OLS yield several results,
yi
(byi + ubi ) = byi + ubi = byi

i
byi ubi
bi + bj xji u
bi = 0
= b
u
y2i =
i
by2i + ub2i , given 2 byi ubi = 0
(byi n byi )
i
Note sums of squares are measured about sample averages.
(byi + ubi )2
i
( yi n y i )
Total Sum of Squares (TSS)=Explained Sum of Squares(ESS) +

Residual Sum of Squares(RSS)
or put it another way,
(b + bj xji )ubi
i
b2i
+u
i

S PRING
Theory
2013
45 / 110
G OODNESS OF F IT

S PRING
Theory
2013
G OODNESS OF F IT
46 / 110
E XPRESSIONS FOR OLS E STIMATORS
It can be shown that

R2
Coefficient of determination
is index of goodness of fit of OLS
ESS
RSS
2
line with R = TSS = 1 TSS , 0 R2 1.
2
Some use degree-of-freedom adjusted R2 , denoted by R ,and

2
defined by R = 1 {RSS/ (n k 1) / [TSS/ (n 1)]} .This
index can be negative.
b
j xj ,
= yb
j
with a typical slope estimator given by
b
j =
If add regressors to a model and re-estimate by OLS, R2 cannot

2
fall (monotonic function on # of parameters), but R can.
exji yi
i
ex2ji
where e
xji is the ith residual from the OLS regression of the
jth regressor on the other (k 1) regressors and the intercept term.

S PRING
Theory
2013
47 / 110

S PRING
Theory
2013
48 / 110
For the classical normal multiple regression model,

b
N (, var(b
)).
It can also be shown that

b
j = j +
exji ui
i
ex2ji
= j +
exji ui
i
RSSj
Since the OLS estimators of the slope parameters can be written as

b
xji ui / e
x2ji = j + e
xji ui /RSSj and the disturbances
j = j + e

S PRING
Theory
2013
49 / 110
2 2
= RSS (n k 1)
Note that (n k 1) is the number of observations minus the

number of regression parameters estimated to derive the
residuals and is called the degree of freedom parameter for the
regression.
b2i ) = 2 (n k 1) and so the estimator s2 = (n1k1) ( u
b2i )
E( u
i
is unbiased.
b2 = [(n k 1)/n] s2 is biased (of

However, the MLE estimator,
course when sample size is relatively small).
S PRING
Theory
2013

S PRING
Theory
2013
50 / 110
independently of b
and bj , j.
var(b
j ) = 2 /RSSj , j = 1, ..., k.
It can be shown that, in classical normal simple regression model,
E( b
j ) = j
b2i
u
ui are NID(0, 2 ), they are all normally distributed with
where RSSj is the residual sum of squares from the OLS

estimation of the auxiliary regression of the jth regressor on the
other (k 1) regressors and the intercept term.
51 / 110
b
)) and N ( j , var( bj )), respectively, so that
and b
j are N (, var(b
z(b
) = (b
)/
RSS =
var(b
) N (0, 1)
q
z( bj ) = ( bj j )/ var( bj ) N (0, 1).
i ub2i 2 2 (n k 1) independently of b and bj , so
RSS/2 2 (n k 1) independently of z(b

) and z( bj ), so
q
[RSS/(n k 1)] /2 t(n k 1)

q
t( bj ) = z( bj )/ [RSS/(n k 1)] /2 t(n k 1).
t(b
) = z(b
)/

S PRING
Theory
2013
52 / 110
C ONFIDENCE I NTERVALS
We know RSS/(n k 1) = s2 , so that, for example,

q
p
p
t( bj ) = z( bj )/ s2 /2 = ( bj j )/
var( bj ) s2 /2
var( bj )(s2 /2 ) = (2 /RSSj )(s2 /2 ) = s2 /RSSj which is the

estimator of var( b ) and the square root of this quantity is called
j
the (estimated) standard error, denoted by SE(b

).
Hence,
simlarly

t( bj ) = bj j /SE( bj ) t(n k 1)
) /SE(b
) t(n k 1)
t(b
) = (b

S PRING
Theory
2013
53 / 110
the (1 ) 100 per cent confidence intervals for and j are

) and b d1 SE( b ), respectively.
given by b
d1 SE(b
j
j0
Then t0 ( bj ) is the test statistic. The critical/rejection region

depends upon the nature of the alternative hypothesis and the
prespecified significance level, denoted by .

S PRING
Theory
2013

S PRING
Theory
2013
54 / 110
T EST OF H YPOTHESES USING T S TATISTICS
For whatever value of j , t( bj ) = ( bj j )/SE( bj ) t(n k 1)

,and so if H0 is true t0 ( b ) = ( b )/SE( b ) t(n k 1).
j
Consider null hypothesis that restricts one of the regression

parameters, e.g. H0 : j = j0 (some specified constant),
Let d1 be such that prob(d1 t(n k 1) d1 ) = (1 )
C ONFIDENCE I NTERVALS
55 / 110
H1 : j 6= j0 reject H0 if |t0 ( bj )| > d1 ,where

prob(t(n k 1) > d1 ) = /2
H + : > reject H0 if t0 ( b ) > d2 ,where
1
j0
prob(t(n k 1) > d2 ) =
H1 : j < j0 reject H0 if t0 ( bj ) < d2 ,where

prob(t(n k 1) < d2 ) =
Just replace j by and bj by b
in the above to obtain test
procedures relevant to testing hypotheses concerning the
intercept.

S PRING
Theory
2013
56 / 110
F T EST OF S EVERAL L INEAR R ESTRICTIONS
E XAMPLE
Suppose that the null hypothesis to be tested is denoted by H0 and
consists of several linear restrictions on the parameters of the
regression model. Thus H0 specifies the values of, say, q < (k + 1)
linear combinations of the regression coefficients. For example, with
k = 4 and q = 3, H0 could consist of the following restrictions:
+ 1 = 0; 2 = 1; and 4 = 0. We now need a joint test of all the
restrictions of H0 ,rather than a collection of separate t-tests.

S PRING
Theory
2013
57 / 110
Let RSS(H0 ) be the sum of squared residuals obtained under the

restrictions of H0 .In the example of the previous note, RSS(H0 ) is
derived by applying OLS to the restricted model:
(yi x2i ) = 1 (x1i 1) + 3 x3i + ui .
Let RSS(H1 ) be the RSS obtained by applying OLS to the
unrestricted model. In the previous example, RSS(H1 ) is derived by
4
applying OLS to yi = + j xji + ui .

j=1

S PRING
Theory
2013
58 / 110
P REDICTION
P REDICTION
D EFINITION
Define the F statistic by the following equation
F=
Suppose wish to make predictions for period f , f > n (n is the

sample size), with xjf known and it being assumed that the data
generation process (DGP) for y is unchanged so that
[RSS(H0 ) RSS(H1 )] df (H1 )
,
RSS(H1 )
q
yf = + j xjf + uf , uf N (0, 2 ).
j
in which df (H1 ) is the degrees of freedom parameter for the

unrestricted model, i.e. df (H1 ) = (n k 1).
Prediction of E(yf ): use the predictor b

yf = b
+b
j xjf , where the
j
If H0 is true, then F F(q, df (H1 )).

The null hypothesis is regarded as inconsistent with the data if the
sample (observed) value of F is significantly large, i.e. the test is
one-sided.

S PRING
Theory
2013
59 / 110
OLS estimators use the data for i = 1, ..., n. This predictor is BLUE
for E(yf ) = + xf .
The predictor b
yf is a linear combination of the OLS estimators and
so is normally distributed. The variance of b
yf can be estimated, and
confidence intervals and tests of hypotheses are feasible.

S PRING
Theory
2013
60 / 110
P REDICTION
R EVIEW T OPIC 3: M ULTICOLLINEARITY
P REDICTION
Suppose wish to make predictions for period f , f > n (n is the

sample size), with xjf known and it being assumed that the DGP
for y is unchanged so that,
yf = + j xjf + uf , uf N (0, 2 )
j
READING
Prediction of yf : use same predictor which implies a forecast error

of
#
"

b
) + j j xjf
( yf b
y f ) = uf ( b
Wooldridge, Ch.3.
which has zero expectation, given OLS unbiased and E(uf ) = 0.

The forecast error is normally distributed, being a linear
combination of normal variates, and has a variance that can be
estimated.
Confidence intervals and tests of hypotheses, e.g.
H0 : E ( yf b
yf ) = 0,is feasible.

S PRING
Theory
2013
61 / 110
M ULTICOLLINEARITY
M ULTICOLLINEARITY

S PRING
Theory
2013
62 / 110
M ULTICOLLINEARITY
M ULTICOLLINEARITY
The information content of a sample available for the purpose of

estimating the individual regression parameters depends, in part,
upon the intercorrelations between the regressors.
Let R2j denote the R2 statistic from the OLS estimation of the
It can be proved that,

var(b
j ) = 2 /RSSj = 2 /
"
(xji xj )
i

1 R2j .
auxiliary regression of the jth regressor on the other (k 1)

regressors and the intercept term. Since it has been assumed that
no regressor is a linear combination of the other regressors and
the intercept term, it follows that R2j < 1 for all j.
Thus, ceteris paribus, high degrees of multicollinearity lead to high

values of sampling variances.
If R2j = 1 for some j, then say that there is perfect multicollinearity.

If R2j is close to 1 for some j, then have a high degree of
multicollinearity.
However, in practice, we cannot vary R2j with 2 and (xji xj )2

S PRING
Theory
2013
63 / 110
Note: imprecise estimators can lead to wide condidence intervals

and weak tests of hypotheses.
i
held constant. Variances may be small even when there is a high

degree of multicollinearity, or large when the regressor are
uncorrelated.

S PRING
Theory
2013
64 / 110
M ULTICOLLINEARITY
M ULTICOLLINEARITY
M ULTICOLLINEARITY
Also note that although the multicollinearity is indeed a problem,

but nontheless no assumptions of the classical multiple
regression model have been violated.
Therefore, provided multicollinearity is not perfect, then OLS
estimators are BLUE and MLE. Similarly the standard test
procedures are valid and retain optimality properties relative to
other tests.
Klein proposes the rule of thumb that multicollinearity is a
"problem" if maxj R2j > R2 .
If trying to consider multicollinearity, it is not sufficient to look
only at pairwise correlations between regressors (might be nested
models where reside complex relationship or even stochastic).

S PRING
Theory
2013
M ULTICOLLINEARITY
65 / 110
Multicollinearity is a feature of the nonrandom regressor set and

so we cannot test for it. Some measures for multicollinearity have
been proposed, but they are open to objection and the R2j statistics
are simple to calculate and interpret.
Models can be reparameterized to make transformed regressor
uncorrelated, but the transformed parameters may have no
economic interest.
As noted above, multicollinearity can lead to large variances and
weak tests, e.g. might have every individual slope estimate being
insignificant (as indicated by a t-test), but a highly significant F
statistic for the hypothesis that all slopes equal zero.
M ULTICOLLINEARITY

S PRING
Theory
2013
66 / 110
R EVIEW T OPIC 4: T HE M EAN F UNCTION
M ULTICOLLINEARITY
Multicollinearity can also lead to large changes in parameter

estimates when there are small changes in the data.
Various "treatments" have been described, e.g. drop some
variables, use first differences, use outside estimates of some
coefficients. These treatments usually introduce new problems,
e.g. dropping an insignificant, but relevant, variable will lead to
biased estimator in the amended model.
READING
Wooldridge, Ch.3, Ch. 7, Ch. 9.
Real solution is to get more valid information, so using false

restrictions is not a good strategy. May also have to wait for more
data.

S PRING
Theory
2013
67 / 110

S PRING
Theory
2013
68 / 110
I NCORRECT S PECIFICATION IN THE M EAN

R EVIEW T OPIC 4: T HE M EAN F UNCTION F UNCTION -C ONSEQUENCES

C ONSEQUENCES
C ONSEQUENCES
C ASE 1
C ASE 2
Have assumed that there exist observation invariant parameters

and 1 , ..., k such that the conditional mean is given by
Have assumed that there exist observation invariant parameters

and 1 , ..., k such that the conditional mean is given by
E(yi |xji , j = 1, ..., k) = + j xji ,
E(yi |xji , j = 1, ..., k) = + j xji ,
where xji is ith value of jth regressor.
where xji is ith value of jth regressor.

1. May have included irrelavant regressors, i.e. some j equals zero. OLS
estimators are still unbiased and consistent, but no longer efficient
(they fail to use valid information set that corresponds to some
coefficients being zero).

S PRING
Theory
2013
69 / 110

2. May have omitted some relevant regressors: Write the conditional mean
function as E(yi |xji , j = 1, ..., k) = + j xji + E(fi |xji , j = 1, ...k.),
j
where fi stands for an omitted factor. In general, OLS estimators of

regression parameters and j are biased and inconsistent. The
estimator s2 is biased and inconsistent, and the standard t- and
F-tests are no longer valid.

S PRING
Theory
2013
C ONSEQUENCES
70 / 110
T EST P ROCEDURES -RESET T EST
C ASE 3
If have strong belief about the omitted factor, can use precise test.
For example, if sure that fi is a linear combination of q variables zji ,
can apply F-test of H0 : 1 = ... = q = 0 in the expanded model
May use incorrect functional form, e.g. assume

yi = + j xji + ui , ui NID(0, 2 ),
yi = + j xji + j zji + ui , ui NID(0, 2 ).
when the true model is a log-log form
If do not have strong belief, then can use "information

parsimonious" RESET test. In this test, fit the null model
log(yi ) = + j log(xji ) + vi , vi NID(0, 2 ).

j
yi = + j xji + ui , ui NID(0, 2 ),
The OLS estimators of the false linear-linear model do not

correspond to parameters of economic interest.

S PRING
Theory
2013
by OLS to obtain predicted values b

yi , i = 1, ..., n.
71 / 110

S PRING
Theory
2013
72 / 110
T ESTS FOR S TABILITY
Then test H0 : 1 = ... = q = 0 in the artificial model,
Suppose we divide the sample into two subsamples, denoted by

1 and 2 .Let 1 contains n1 observations and 2 contains
n2 = n n1 observations. The unrestricted model of the
alternative hypothesis is then written as,
yi )j+1 + ui , ui NID(0, 2 ).
yi = + j xji + j (b
j
yi = + j xji + ui , ui NID(0, 2 ), if i 1 ,
Notes:
1
2
3
4
5
6
7
No b
yi term because this is a linear combination of the intercept term
and the regressors xji ;
F-test is valid even though added variables are random;
Choice of q has impact on power;
No rule for determining the best value of q;
Often use quite small values of q, e.g. 1 or 2;
Cannot expect RESET to indicate how a model should be re-specified;
Cannot assume RESET will always have high power.

S PRING
Theory
2013
73 / 110
restrictions of H0 : = and j = j , j = 1, ..., k .
Suppose that ns > (k + 1), s = 1, 2.Let RSSs denote the residual

sum of squares (RSS) for the OLS regression of yi on the intercept
term and the xji using only the observations for s , s = 1, 2, and
RSS denote the residual sum of squares for this OLS regression
using all n observations. H0 can be tested using the F statistic
RSS (RSS1 + RSS2 ) n 2k 2
RSS1 + RSS2
k+1
yi = + j xji + ui , ui NID(0, 2 ), if i 2 .
j
so that changes in regression coefficients are permitted (under the

unrestricted model!). Should note the homoskedasticity.

S PRING
Theory
2013
74 / 110
If, say, n2 (k"+ 1), then use predictive

failure test. Test n2
!#
restrictions E yi e
xji
+e
= 0, i ,whereedenotes an
j
estimator derived using only the observations of 1 .The

F-statistics is
RSS RSS1 n1 k 1
F=
,
RSS1
n2
which is F(n2 , (n1 k 1)) when the model is stable.
However, in case of n2 < k + 1,the n2 restrictions being tested may

be satisfied even though H0 is false.
which is F(k + 1, (n 2k 2)) under H0 , with large values

indicating the inconsistency of H0 .
S PRING
Theory
2013
and
The null hypothesis

of constant coefficients consists
n
o of the (k + 1)
F=
75 / 110

S PRING
Theory
2013
76 / 110
T REATMENT
R EVIEW T OPIC 5: N ON - NORMAL D ISTURBANCES
T REATMENT
The only treatment that allows valid inference is the correct

specification of the mean function.
READING

S PRING
Theory
2013
Wooldridge, Ch.5.
77 / 110
N ON - NORMAL D ISTURBANCES

S PRING
Theory
2013
N ON - NORMAL D ISTURBANCES
78 / 110
C ONSEQUENCES
C ONSEQUENCES
OLS estimators are still BLUE, but, in general, are NOT normally
distributed. Therefore the t and F tests are no longer valid in
finite samples.
Now suppose that the regression model is
The standard formulae for confidence intervals are also invalid in

finite samples.
yi = + j xji + ui , i = 1, ..., n,
j
where the disturbances are independently and identically

distributed (i.i.d.) with zero mean and variance 2 < ,but the
common distribution is NOT normal.
Under weak conditions, OLS estimators are consistent and a

Central Limit Theorem can be used to show that they are
asymptotically normally distributed, implying that t and F tests of
linear restrictions on regression coefficients are asymptotically
valid. The usual confidence intervals are also asymptotically
valid.
The prediction error test is, however, not asymptotically valid.
Since MLE maximizes wrong likelihood function, it does not
produce asymptotically efficient estimators.

S PRING
Theory
2013
79 / 110

S PRING
Theory
2013
80 / 110
T EST P ROCEDURES
T EST P ROCEDURES
T REATMENT
T REATMENT
When the ui are NID(0, 2 ), the following conditions are satisfied:

E(u3i ) = 0; and E(u4i ) 34 = 0.
bi ,then it is natural to look
If a typical OLS residual is denoted by u
b3i and
at tests based upon the sample moments n1 u
b3i 3b
b2i .Jarque and Bera propose a
4 , where 2 = n1 u
n1 u
test of the joint significance of these terms. However, this test is
only asymptotically valid and, in large samples, there is little
need to assume normality when examining OLS results for the
linear multiple regression model.
If have precise information about the form of the disturbance

distribution, then can derive the likelihood function and obtain
the asymptotically efficient MLE. Otherwise, use OLS and rely
upon large sample results.
Asymptotic theory sometimes provides a poor approximation to

the actual finite sample behaviour of the Jarque-Bera statistic
when the ui are normal.
The Jarque-Bera test can have low power under some nonnormal
disturbance distributions.

S PRING
Theory
2013
81 / 110
R EVIEW T OPIC 6: H ETEROSKEDASTICITY

S PRING
Theory
2013
R EVIEW T OPIC 6: A UTOCORRELATION AND

H ETEROSKEDASTICITY
82 / 110
H ETEROSKEDASTICITY-I NTRODUCTION
H ETEROSKEDASTICITY-I NTRODUCTION
Allow var(ui ) to vary with i, so that y1 , ..., yn are independent

N ( + j xji , 2i ) variables, where 2i denotes var(ui ).
j
READING
Heteroskedasticity is often regarded as associated with

cross-section data, grouped data, or random coefficient models,
but can occur in time-series applications (GARCH-family models
for instance).
Wooldridge, Ch.8., Ch.12.

S PRING
Theory
2013
83 / 110

S PRING
Theory
2013
84 / 110
C ONSEQUENCES OF H ETEROSKEDASTICITY
C ONSEQUENCES OF H ETEROSKEDASTICITY
T ESTS FOR H ETEROSKEDASTICITY
OLS still unbiased and consistent, but no longer efficient in

either large or small samples.
Goldfeld-Quandt Test
OLS not MLE because MLE maximize likelihood under false

assumption that all ui have same variance.
b
xji ui / e
x2 = + e
xji ui /RSSj , so that
= + e
j
ji
var(b
j ) =
"i
ex2ji 2i / ex2ji
i
#2

2
= e
x2ji u2i / RSSj which is not equal
i
to E(s2 )/RSSj .Conventional standard errors are, therefore, biased.

The t- and F-tests are, therefore, invalid.

S PRING
Theory
2013
T ESTSFOR H ETEROSKEDASTICITY
85 / 110
A finite sample test that requires normality of the distrubances.

The null hypothesis is that the errors are homoskedastic. It is
assumed that information is available about the relative
magnitudes of variances under the alternative hypothesis of
heteroskedasticity.
Using this information, reorder the data so that 21 22 ... 2n .
Split the sample into three parts containing m, c, and m
observations, with m > (k + 1) and n = 2m + c. Drop the middle
set of c observations.

S PRING
Theory
2013
86 / 110
Goldfeld-Quandt Test
Let RSS1 and RSS2 denote the OLS residual sum of squares
functions for estimation using the first m and last m observations,
respectively. Under the null hypothesis of homoskedasticity, the
statistic GQ = RSS2 /RSS1 is distributed as F(m k 1, m k 1)
and large values indicate data inconsistency of null hypothesis.
Lagrange Multiplier/Score Test

Original form suggested by Breusch-Pagan and Godfrey requires
normal disturbances even for asymptotic validity, and is not
recommended.
Problems: a) Choice of m and c; b) Need enough information to

reorder data according to values of variances.

S PRING
Theory
2013
87 / 110

S PRING
Theory
2013
88 / 110
Studentizedd Score Test

Koenkers Studentized Score test is asymptotically robust to
nonnormality. Estimate model by OLS using all observations and
bi , i = !
obtain the residuals u
1, ..., n. Assume an alternative of the
p
form
2i
= g 0 + j zji ,where the precise form of g(.) need

1
not be specified.
Studentizedd Score Test

Koenkers test statistic is nR2K and, under homoskedasticity, nR2K is
asymptotically distributed as 2 (p) with large values indicating
the rejection of the null model.
Problems: a) Large sample test; b) need enough information to
select the variable zji incorrect choice has impact on power.
Apply OLS to the artificial regression model

p
b2i = 0 + j zji + ai , i = 1, ..., n,

u
1
and obtain the coefficient of determination denoted by R2K .


S PRING
Theory
2013
89 / 110
90 / 110
Whites Direct Test
Autoregressive Conditional Heteroskedasticity Tests
Whites test can be regarded as a Koenker-type test with the zji

being the nonredundant terms of xiq and xiq xir , q, r = 1, ..., k.
Problems: a) Large sample test; b) need enough information to
select the variable zji incorrect choice has impact on power.

S PRING
Theory
2013

S PRING
Theory
2013
91 / 110
ARCH models are widely used - conditional variance depends

upon squared past values of ui .The test for ARCH is a
b2ij ; i = p + 1, ..., n and j = 1, ..., p.
Koenker-type check with zji = u

S PRING
Theory
2013
92 / 110
T REATMENT OF H ETEROSKEDASTICITY
If know variances up to a constant of proportionality, can apply

OLS to transformed data to get efficient estimators. Suppose
2i = 2 w2i ,with the w2i being known, then var(ui /wi ) = 2 i.In
this case, apply OLS to the transformed model
(yi /wi ) = (1/wi ) + j (xji /wi ) + (ui /wi ), in which the (ui /wi )
If suspect heteroskedasticity and do not have very precise

information about its form, then can use Whites
heteroskedasticity consistent standard errors, denoted by
WSE(b
) and WSE(b
j ), j = 1, ..., k. for asymptotically valid
inference after OLS estimation.
White shows that, if
WSE(b
j ) =
are NID(0, ) variates.

Note: the transformed model may not contain an intercept.
ex2ji u2i /
i
2
RSSj ,
j ) is asymptotically distributed as N (0, 1) in

then (b
j j )/WSE(b
presence of unspecified heteroskedasticity.

S PRING
Theory
2013
93 / 110

S PRING
Theory
2013
94 / 110
Asymptotically valid tests of hypotheses such as

H0 : j = j0
Hence, if d1 is such that

are based upon
prob(d1 N (0, 1) d1 ) = (1 ),
the (1 ) 100 per cent confidence intervals for and j are
) and b d1 WSE( b ), respectively.
given by, b
d1 WSE(b
j
under H0 .

S PRING
Theory
2013
b
b
b
tW
0 ( j ) = ( j j0 ) /WSE( j )N (0, 1)
Since the procedures are only asymptotically valid, can replace

N (0, 1) by t(n k 1) and this is often done. Thus can use the
following to obtain asymptotically valid tests of H0 : j = j0 .
95 / 110

S PRING
Theory
2013
96 / 110
A UTOCORRELATION /S ERIAL C ORRELATION I NTRODUCTION

b ) > d1 , where
H1 : j 6= j0 reject H0 if tW
(
0
j
Have yt N ( + j xjt , 2 ),but no longer assume independence,

j
prob(t(n k 1) > d1 ) = /2;

b
H1+ : j > j0 reject H0 if tW
0 ( j ) > d2 , where
prob(t(n k 1) > d2 ) = ;
H : 6= reject H0 if tW ( b ) < d2 , where
1
j0
t = 1, ..., n. If ut = yt ( + j xjt ),then allow E(ut us ) 6= 0 for

j
some t 6= s.Use t subscript because autocorrelation is often

discussed in a time-series framework, but spatial autocorrelation
has been examined.
prob(t(n k 1) > d2 ) = ;
Just replace j by and b
by b
in the above to obtain test
procedures relevant to testing hypotheses concerning the
intercept.

S PRING
Theory
2013
The regressors are asumed to be nonrandom. (It would be

straightforward to allow for random regressors with xjt
independent of us , for all j, s and t.) This assumption will be
relaxed later. In particular, will consider autocorrelation when
regressors include lagged values of the dependent variable.
97 / 110
C ONSEQUENCES OF A UTOCORRELATION
OLS not MLE because MLE maximizes likelihood under false

assumption that the ut are independent.
b
= + e
xjt ut
xjt ut / e
x2 = + e
xjt ut /RSSj ,and, since the e
t
jt
are not independent, var
exjt ut
t

S PRING
Theory
2013
98 / 110
T ESTS FOR A UTOCORRELATION
T ESTS FOR A UTOCORRELATION
OLS still unbiased and consistent, but no longer efficient in either

large or small samples.
C ONSEQUENCES OF A UTOCORRELATION
A UTOCORRELATION /S ERIAL C ORRELATION
xjt ut ) and so
6= var(e
t
var(b
j ) 6= 2 /RSSj .Conventional standard errors are, therefore,
biased.
In the lectures given this term, it is assumed that the ut are

covariance stationary with E(ut utg ) = (|g|) for all t, with
(|0|) = 2 . The autocorrelation of order g, denoted by (g),is the
correlation between ut and utg ,i.e. E(ut utg )/2 , with the
sequence (1), (2), ...being called the autocorrelation function or
ACF. Under the null hypothesis of serial independence, (g) = 0
for all g 6= 0. Different tests check the significance of different sets
of estimates of autocorrelations.
The t- and F- tests are invalid.

S PRING
Theory
2013
99 / 110

S PRING
Theory
2013
100 / 110
D URBIN WATSON T EST
D URBIN -WATSON T EST
Basically a test for nonzero values of (1), based upon OLS

residuals. The test statistic is
d=
which is approx.
L EMMA
Values of d close to 0 (resp. 4) indicate high level of positive (resp. negative)
residual first order serial correlation. The distribution of d under null
hypothesis of independent errors depends upon values of regressors, so critical
values vary from one case to another.
(ubt ubt1 )2 / ub2t

2(1 r(1))
where,
r(1) =
ubt ubt1 / ub2t

S PRING
Theory
2013
101 / 110

S PRING
Theory
2013
102 / 110
Have tables for combinations of n and k (and for models with and
without an intercept) giving bounds for the critical values for
testing H0 of serial independence against H1 : (1) > 0.These
upper and lower bounds, denoted by du and dl , define an interval
that contains the true known critical value. If d < dl , reject.If
d > du , accept.If dl d du ,the test is inconclusive. For
H1 : (1) < 0,use 4 du and 4 dl as bounds.
The Durbin-Watson procedure is a useful test against either first

order autoregressive (AR(1)) model ut = 1 ut1 + t ,or first order
moving average (MA(1)) model ut = t + 1 t1 ,in which the
t NID(0, 2 ).For reasons to be discussed later in time series, we
assume |1 | < 1 and | 1 | 1.
Problems:
Checks for nonzero values of (1) can be insensitive to
(g) 6= 0, g 6= 1,e.g. g = 4, when (1) = 0
Test is inconclusive when sample value of d falls between
bounds-inconclusive region.
Requires errors to be normal and regressors to be fixed, e.g. no
lagged dependent variables.

S PRING
Theory
2013
103 / 110

S PRING
Theory
2013
104 / 110
L AGRANGE M ULTIPLIER /S CORE T ESTS
L AGRANGE M ULTIPLIER /S CORE T ESTS
E STIMATION
Very flexible asymptotic test based upon OLS results. It is

asymptotically valid for models with nonnormal errors and
lagged dependent variables in the regressor set.
If null hypothesis of serial independence is to be tested against
autoregressive or moving average model of order g, then apply
asymptotically valid F-test of H0 : 1 = 2 = ... = g = 0 after OLS
k
E STIMATION
If have precise information about form of autocorrelation, e.g.

type (AR or MA) and order (value of g), can use asymptotically
efficient MLE or apporoximation.
k
Model can then be written as yt = + j xjt + ut ,with either

1
btj + ut ,in which

estimation of the model yt = + j xjt + j u
ut = 1 ut1 + ... + g utg + t , t NID(0, 2 ), AR(g), or

ut = t + 1 t1 + ... + g tg , t NID(0, 2 ), MA(g). MLE, or
approximations based upon minimizing 2t are available in
of yt = + j xjt + ut .For "gaps" in alternative model, omit
econometric softwares.
btj are lagged values of the residuals from the OLS estimation
the u
k
btj = 0.
selected j terms. If t j is not positive, set u

S PRING
Theory
2013
105 / 110
O THER P ROBLEMS OF A UTOCORRELATION
106 / 110
R ESIDUAL S ERIAL C ORRELATION OR G ENUINE

D ISTURBANCE A UTOCORRELATION ?
Significant outcomes of tests designed for autocorrelation can be

caused by misspecification of the mean function, e.g. omit
relevant regressors or use wrong functional form. In such cases,
re-estimation allowing for autocorrelation is of little value.

S PRING
Theory
2013

S PRING
Theory
2013

107 / 110
A procedure, called the COMFAC test, has been developed to test

the null hypothesis that the errors of a regression equation are
generated by an autoregressive process of specified order. The
COMFAC test uses as its alternative an expanded version of the
original regression equation obtained by adding lagged values of
the dependent variable and the initial set of regressors. Details are
not provided because this test, while asymptotically valid, has
finite sample properties that cause concern; see Gregory and Veall,
Economic Letters, 1986, 22, 203-208. Moreover, the alternative
adopted in the COMFAC procedure may be inadequate and yield
a test that rarely detects a false null hypothesis.

S PRING
Theory
2013
108 / 110


Mizon (A simple message for autocorrelation correctors: dont,

Journal of Econometrics, 1995, 69, 267-288) offers the following
conclusions:
Although it is important to test for autocorrelation, it is rarely
appropriate to "autocorrelation correct" in response to rejecting the
null hypothesis of independent disturbances;
and, when re-estimation assuming autoregressive errors imposes
invalid restrictions, inconsistent parameter estimators will result.
The nature of the restrictions to which Mizon refers can be

illustrated by considering a simple case in which the model of the
null is yt = xt + ut ,with ut = 1 ut1 + t , t NID(0, 2 ), i.e. the
disturbances ut are AR(1).

S PRING
Theory
2013
109 / 110
Under this null,

yt = xt + 1 (yt1 xt1 ) + t ,
or equivalently,
yt = xt + 1 yt1 1 xt1 + t , t NID(0, 2 ),
in which the coefficient of xt1 is restricted to be minus the
product of the coefficients of xt and yt1 . Note that this restriction
is not linear.

S PRING
Theory
2013
110 / 110
T OPIC 1 I NTRODUCTION OF T IME S ERIES

Topic 1 Introduction of Time Series
Statistical analysis of data observed over time.
Jianhua Gang
School of Finance
Spring 2013
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 1 Introduction SofPRING

Time Series
2013
1 / 18
T IME S ERIES D ATA
T IME S ERIES D ATA
M OMENTS
For a generic random variable we can define the mean, variance,

and for pairs of random variables we can also define covariance,
correlation etc. In a time series we define these for each Yt :
Data observed between two dates, normalized as t = 1 and t = T.

Equispaced, i.e. we observe Y1 , Y2 , ..., Yt , Yt+1 , ..., YT1 , YT and NO
intermediate observation is missing.
Yt depends on Ys (if theres any) if and only if s < t
Yt does not depends on Ys if s > t.
Then, the vector {Y1 , Y2 , ..., Yt , Yt+1 , ..., YT1 , YT } is a time series.
D EFINITIONS (M OMENTS OF T IME S ERIES )

Mean: E(Yt )= t ;

2
2
Variance: E (Y
nt t ) = t
o
Covariance: E (Yt t )(Yt+j t+j ) = t (j)
Correlation:

Time Series
2013
2 / 18
M OMENTS
D EFINITION (T IME S ERIES D ATA )

Time Series
2013
3 / 18
t (j)
t t+j

Time Series
2013
4 / 18
O PERATORS
O PERATORS
S TATIONARITY AND E RGODICITY
S TATIONARITY AND E RGODICITY
P ROBLEM
Suppose {Y1 , Y2 , ..., Yt , Yt+1 , ..., YT1 , YT } is a single realization from a
stochastic process {Yt }
.We are interested in the model that generated
the time series, but we do not know it. How can we make inference, using
one single realization?
Lag operator: L
L Yt = Yt1
So, L1 Yt = Yt+1
First Difference operator:

= 1L
Yt = (1 L)Yt = Yt Yt1
Also, 2 Yt = (1 L)2 Yt = Yt 2Yt1 + Yt2

Time Series
2013
S OLUTION
We must use the fact that this is a T-dimensional observation:
5 / 18
R ESTRICT H ETEROGENEITY
Restrict heterogeneity over time;
Restrict dependence over time.

Time Series
2013
6 / 18
Assume some properties are common to all the Yt s in

{Y1 , Y2 , ..., YT } .For example,
In this way, we may try to estimate or (j) using the sample

counterparts. "Covariance stationarity" is also known as a "weak
stationarity" or simply as "stationarity" (without other references).
D EFINITION (C OVARIANCE S TATIONARITY )

For time series Yt {Yt }
,
For stationary processes, we shorten the notation and introduce j

for (j) to indicate the autocovariance.
E(Yt ) = , t

E (Yt )(Yt+j )
= (j), t

The plot of j against j is called autocovariance function.
i.e. the first two moments are finite and do not depend on time
(spatial equivalent).

Time Series
2013
7 / 18

Time Series
2013
8 / 18
R ESTRICT HETEROGENEITY
R ESTRICT D EPENDENCE OVER T IME

Given n , and given the process is stationary, then the sample
moments would estimate the population moments consistently.
An alternative restriction on heterogeneity is:
One may generalize this argument and allow for some

dependence, provided that it is not too much: a sufficient
D EFINITION (S TRICT S TATIONARITY )

For any j1 , ...jn , the joint distribution of Yt+j1 , ..., Yt+jn and of

Yt+ +j1 , ..., Yt+ +jn is the same for any .
1
The joint distribution only depends on the spatial difference

,not on time;
Strict and Covariance stationarity do not imply each other.
condition for consistent estimation of is
|j | < .
j=0
D EFINITION
One restriction on the dependence that allows to consistently estimate
the population moments using the sample moments in stationary
processes is called Ergodicity.

Time Series
2013
9 / 18
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 1 IntroductionS PRING

of Time2013
Series
R ESTRICT D EPENDENCE
10 / 18
F ORECASTS BASED ON A L INEAR P ROJECTION

Assume: Yt is stationary; E(Yt ) = 0 (if E(Yt ) = 6= 0, then
consider Yt instead). Then,
Often we are interested in time series because we want to answer

one of the two questions:
1
Forecasting: What value do you expect for Yt+1 if you observed

Y1 , ..., Yt ?
Impulse response: What is the consequence on Yt of a shock that
took place (t j) periods ago?
Linear forecast of Yt+1 using Yt is

b t + 1 | t = a ( 1 ) Yt ;
Y
1
Linear forecast of Yt+1 using Yt and Yt1 is

b t+1|t = a(2) Yt + a(2) Yt1 ;
Y
1
2
We first address these questions in the case of stationary processes.

3
Linaer forecast of Yt+1 using Yt , ..., Ytm+1 is

(m)
b t+1|t,...,tm+1 = a(m) Yt + a(m) Yt1 + ...am
Y
Ytm+1 .
1
2

of Time2013
Series
11 / 18

of Time2013
Series
12 / 18
F ORECAST
W OLD D ECOMPOSITION
Of course, in some cases a non-linear forecast may be better.
Now, which values of

linear projection?
(m) (m)
(m)
(1 , 2 , ..., m )
(m)
(m)
characterise a good
(m)
Let Xt = (Yt , ..., Ytm+1 ) , = (1 , 2 , ..., m ) ,then must

meet E [(Yt+1 Xt ) Xt ] = 0 (i.e., the forecast error Yt+1 Xt is
not correlated with Xt ).
Then, given Yt+1 = Yt+1 , (Yt+1 being single component),
Any stationary process Yt may be represented in the form

Yt = kt + j tj
j=0
where
It can be proved that b

gives the best linear forecast.

of Time2013
Series
D EFINITION (W OLD D ECOMPOSITION )
E(Yt+1 Xt ) E(Xt Xt ) = 0
1

b
E(Xt Yt+1 )
= E(Xt Xt )
However, a linear model is usually easier to use, so it is important

that any stationary process may be given a linear representation.
This can be discussed using the Wold Decomposition.
0 = 1, 2j <
j=0
13 / 18

of Time2013
Series
14 / 18
I MPULSE R ESPONSE
I MPULSE R ESPONSE
For a process Yt that admits
Yt = + j tj
and t ,the error made in forecasting Yt on the basis of a linear

function,
b (Yt |Yt1 , ...)
t = Yt E
j=0
for t such that, for any t,
is such that, for any t, E(t ) = 0, E(2t ) = 2 , E(t s ) = 0 if t 6= s.
E(t ) = 0, E(2t ) = 2 ,
kt is the linear deterministic component of Yt : it can be predicted

arbitrarily well as a linear function of past Yt , i.e.,
b (kt |Yt1 , ...) and it is such that E(kt tj ) = 0 j.
kt = E
E(t s ) = 0, s 6= t.
notice that
Yt
= j
tj
so j is the effect on Yt of a shock that took place (t j) periods

before. A plot of j (againtst j) is called impulse response function.

of Time2013
Series
15 / 18

of Time2013
Series
16 / 18
ACF
A UTOCORRELATION F UNCTION
PARTIAL A UTOCORRELATION F UNCTION
D EFINITION (PACF)
D EFINITION (ACF)
For a stationary Yt with E(Yt ) = 0, consider its linear projection,
For a stationary Yt ,define the autocorrelation,

j =
(m)
b t+1|t,...,tm+1 = (m) Yt + (m) Yt1 + ... + m
Y
Ytm+1
1
2
j
0
(1)

of Time2013
Series
(2)
(m)
For different values of m, 1 , 2 , ..., m are the first m partial

(j)
A plot of j (against j) is called autocorrelation function.
PACF
autocorrelations, and a plot of j (against j) is called partial

autocorrelation function.
17 / 18

of Time2013
Series
18 / 18
T OPIC 2 M OMENT G ENERATING F UNCTION

Topic 2 MGF
It is however essential to consider the MGFs in order to
depict/solve relevant time series problems.
Jianhua Gang
School of Finance
Spring 2013
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 2 MGF
S PRING 2013
1 / 19
P RELIMINARIES
P RELIMINARIES :
P RELIMINARIES :
S AMPLE S PACE AND R ANDOM VARIABLES
B INOMIAL D ISTRIBUTION
Define,
f (x) =
x x
S PRING 2013
2 / 19
P RELIMINARIES
Define as,
x sample space;
x random variable
Then a probability density function (pdf) f (x) is a mapping from

x to the set of R with the probability that:
f (x) = 1;
Z x
Pr {x x } =
f (x)dx = 1.
S PRING 2013
3 / 19
n!
px (1 p)nx , for x = 1, 2, ..., n.
x!(n x)!
The density arises as a sequence of the binomial expansion of:
(a + b)n =
n!
x!(n x)! ax bnx ,
x=0
written as x Bin(n, p).
S PRING 2013
4 / 19
P RELIMINARIES
P RELIMINARIES :
P RELIMINARIES :
P OISSON D ISTRIBUTION
N ORMAL D ISTRIBUTION
P RELIMINARIES
Define as,
e x
, for x = 1, 2, ..., n.
x!
The density arises from the identity of:
f (x) =
Define as,
o
n
(x )2
exp 22
f (x) =
22

written as x N , 2 , where < x < .
e =
x
x!
x=0
in which = E(x).
S PRING 2013
5 / 19
E XPECTATIONS AND M OMENTS
E XPECTATIONS AND L OWER -O RDER M OMENTS
6 / 19
E XPECTATIONS AND L OWER -O RDER M OMENTS
The expectation (or the mean, or 1st. moment) of a random

variable is defined by,
x f (x) discrete
x x
Z
E(x) =
x f (x) dx continuous
From which we obtain as special case the raw moments:

i = E xi
x x
i.e., it is a weighted average of x over all possible outcomes.

The expectation of a measurable function g(x) of a r.v. x is
therefore defined by:
g (x) f (x)
x x
Z
E {g(x)} =
g (x) f (x)dx
S PRING 2013
And the central moments:
o
n
i = E (x )i
Note that 2 is the variance of x, i.e., 2 = 2 .
x x
S PRING 2013
7 / 19
S PRING 2013
8 / 19
H IGHER -O RDER M OMENTS
C ALCULATION OF M OMENTS
What about the higher-order (central) moments?

In definition, the third and the fourth moments measure the
following properties:
3 :
Skewness of the distribution
4 :
Kurtosis of the distribution
It is simple to show that:

g(x) = c E {g(x)} = c
E {c g(x)} = c E {g(x)}
E {a + b g(x)} = a + bE {g(x)}
E {g(x) + h(x)} = E {g(x)} + E {h(x)}
For comparative purpose, skewness and kurtosis are usually

measured by:
3 =
4 =
and hence,
3
3
4
4
o

n
2
2 = E (x )2 = E x2 [E(x)]
S PRING 2013
9 / 19
M OMENT G ENERATING F UNCTIONS (MGF S )
x x
so that,
S PRING 2013
10 / 19
#
2 x2
1 + x +
+ ... f (x)dx
2!
di [Mx ()]
i
2
3
i
2 + 3 + ... + i + ...
2!
3!
i!
|=0 = i (raw moments)
d
Hence we call the function Mx () the MGF of x. Note that this
property is true in either the discrete or the continuous case.
x x
S PRING 2013
"
= 1 + 1 +
x x
)
2 2
x
Mx () = E ex = E 1 + x +
+ ...
2!
)
"
#
(
Z
(x)i
(x)i
=
= E
i! f (x)dx
i!
i=0
i=0
Calculating the moments of even simple r.v.s can be difficult.

However, consider the following function:
e f (x)
x x
n o
Z
Mx () = E ex =
ex f (x)dx
11 / 19
S PRING 2013
12 / 19
E XAMPLE OF MGF
A N E XAMPLE
It is also easy to see that the MGF satisfies two very important
properties.
E XAMPLE
Observations x1 through xn which are independent copies from r.v.
x Po ().Suppose were interested in the properties (distribution,
moments, etc.) of the sample mean:
g(x) = ax + b Mg(x) () = eb Mx (a)
g(x) = x1 + x2 Mg(x) () = Mx1 () Mx2 ()
Therefore, (Given that x1 , x2 , x3 , ..., xn are independent copies of

the r.v. x.)
X=
Mn1 xi =
Mn 1 xi
1n
Mxi () = [Mx ()]
i=1
n

1
n
= Mxi ( ) = Mx ( )
n
n
i=1
1 n
Xi
n i
=1
S PRING 2013
13 / 19
E XAMPLE OF MGF
A N E XAMPLE
S PRING 2013
14 / 19
E XAMPLE OF MGF
A N E XAMPLE
P ROBLEM
Calculate the MGF of Sn = nX;
P ROBLEM
Calculate the MGF of X;
S OLUTION
S OLUTION
n o
Mx () = E ex =
MSn
() =
x
e
e x
= e
= e
x!
x!
x=0
x=0
n
o
o
n
= e exp e = exp e 1
x
S PRING 2013
n
oin
exp e 1
n
o
= exp n e 1
x=0
i=1
ex f (x)
Mx () =
Note that the MGF of Sn is of the same form as that for x, i.e. letting = n
n
o
MSn () = exp e 1
i.e. Sn Po (n) = Po ( ).
15 / 19
S PRING 2013
16 / 19
E XAMPLE OF MGF
A N E XAMPLE
E XAMPLE OF MGF
A N E XAMPLE
P ROBLEM
The moments of X.
P ROBLEM
Calculate the MGF of X;
S OLUTION
S OLUTION
h
E X
Mxi ( )
n
n
n
i=1
n h

oi
n

n
= Mx ( ) = exp e n 1
n
o
n
= exp n e n 1
MX () = M Sn
( ) = M xi ( ) =
S PRING 2013
h 2i
E X
=
2X
17 / 19
E XAMPLE OF MGF
That is we immediately find that

E X =
2X =
n
If we consider X as an estimator for , we refer to these properties
as unbiasedness, and given the consistency, that is the variance
tends to be zero.
S PRING 2013

E X =
A N E XAMPLE
i
i
19 / 19
o
n
di exp n(e n 1)
i
n d
o
d exp n(e n 1)
o
n d
2
d exp n(e n 1)
2
| =0
| =0 =
| =0 = 2 +
i d
2
= E X E X
= (central moments)
n
h
S PRING 2013
18 / 19
T OPIC 3 ARMA M ODELS

Topic 3 ARMA Models
We said we are interested in the j in the representation:
Yt = +
j t j
j =0
Jianhua Gang
for the impulse response analysis and for forecasting.
School of Finance
However, in general we dont know the j , and we cant hope to

estimate an infinite number of parameters, so we have to propose
parsimonious models.
Spring 2013
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 3 ARMA Models

S PRING 2013
1 / 47
W HITE N OISE P ROCESS
T HE S IMPLEST M ODEL : W HITE N OISE
2 / 47
W HITE N OISE P ROCESS
T HE S IMPLEST M ODEL : W HITE N OISE
If t is w .n.(0, 2 ),
D EFINITION
t may be independent, but needs not be;

t may be strictly stationary, but needs not be;
t is covariance stationary;
{ t }
is white noise if:
E ( t ) = 0t
and if Yt = +
E (2t ) = 2 t
(j )
so, j = 0, j = 0, and j
and if Yt = +
= 0 j 6= 0.i.e. the process has no memory.
mean if
is stationary if
j t j ,then Yt
is stationary and ergodic for the
3 / 47
j =0
j =0
j < .
j =0

S PRING 2013
2j < ;
j t j ,then Yt
j =0
E ( t s ) = 0t (t 6 = s )

S PRING 2013

S PRING 2013
4 / 47
MA(1)
MA(1)
I NVERTIBLE MA(1)
I NVERTIBLE MA(1)
Let t w .n.(0, 2 ), then Yt = + t + t 1 is the MA(1).

We can check stationarity noticing that 0 = 1, 1 = ,so
Rewrite t = Yt t 1 as t = Yt Lt
2j = 1 + 2 < .
using the lag operator. Then, (1 + L)t = Yt ,so, for | | < 1,
j =0
Otherwise, we can check that the first two moments do not depend on
time.
1
2
Mean: E (Yt ) =
Autocovariances:
= E [(Yt )2 ] = E [(t + t 1 )2 ] = (1 + 2 )2
= E [(Yt )(Yt 1 )] = 2
= 0
i.e. Yt =
j =1
However, for | | 1,the representation is not invertible.

S PRING 2013
5 / 47
MA( Q )
6 / 47
MA( Q )
MA( Q )
Let t w .n.(0, 2 ), then Yt = + t + 1 t 1 + ... + q t q is

MA(q).
Mean: E (Yt ) =
Autocovariances:
The autocorrelations drop to 0 after q lags.
The impulse response are j , j q,and drop to 0 after q lags.
0
j q
= E [(t + 1 t 1 + ... + q t q ) ]
= (1 + 21 + ... + 2q )2
= E [(t + 1 t 1 + ... + q t q )
(t j + 1 t 1 j + ... + q t q j )]
j >q

S PRING 2013
MA( Q )
( )j Yt j + t .
Autocorrelations: 1 = 2 , j 2 = 0.
1 +
Yt
= Yt ( )j Lj = ( )j Yt j ,
(1 + L)
j =0
j =0
0
j 2
3
t =
= ( j + 1 j +1 + 2 j +2 + ... + q j q )2
= 0

S PRING 2013
7 / 47
Invertibility: set = 0;recall Yt = (1 + 1 L + ... + q Lq )t and

factor (1 + 1 L + ... + q Lq ) = (1 1 L)(1 2 L)...(1 q L) in
the MA(1) we asked that |1 | < 1: in the same way here we have to
ask that |1 | < 1, |2 | < 1, ..., |q | < 1.This is sometimes stated as
asking that the roots of the equation in z of the form
(1 + 1 z + ... + q z q ) = 0 lie OUTSIDE the unit circle.

S PRING 2013
8 / 47
MA( INFINITY )
MA( INFINITY )
AR(1)
Let t w .n.(0, 2 ), then Yt = + t + 1 t 1 + ... =
Let t w .n.(0, 2 ), then Yt = c + Yt 1 + t is AR (1). Assume

further that || < 1. Since Yt 1 = c + Yt 2 + t 1 ,then replace
into the previous equation:
j t j
AR(1)
is MA(). Under the additional assumption that
j =0
|j | < ,we can derive the moments replacing j
j =0
by j in a
Yt
= c + (c + Yt 2 + t 1 ) + t
= (1 + )c + 2 Yt 2 + t 1 + t
= ...iterating
Yt
MA(q ) and taking the limit for q .

1
2
Mean: E (Yt ) =
Autocovariances:
2k 2
k k +j 2
k =0

S PRING 2013
j =0
j =0
as n , and || < 1
n
1
Yt =
c + 0 + j t j
1
j =0
k =0
j c + n +1 Yt n 1 + j t j
9 / 47
AR(1)

S PRING 2013
AR(1)
10 / 47
AR(1)
AR(1)
So an AR (1) with || < 1 may be written as a MA(). Notice that
the condition
|j | < is met, because j = j , so
Mean:
j =0
| j | =
j =0
| |j =
j =0
1
1 ||
stationary and ergodic for the mean.

This can also be obtained by rewriting Yt as Yt = c + LYt + t ,
using the lag operator, and then (1 L)Yt = c + t . Since || < 1,
Yt
0 =
j c + j t j
j =0
2k 2 = 2k 2 = 1 2 2
k =0
c
(= )
1
Autocovariances: using the formula for the MA() process,
= (1 L)1 c + (1 L)1 t
E (Yt ) =
< ,then it also follows that the process is
k =0
k =0
k k +j 2 =
k =0
k k +j 2 =
k =0
2k j 2 =
j
2
1 2
j =0
c
+ j t j
1 j =0

S PRING 2013
11 / 47

S PRING 2013
12 / 47
AR(1)
AR(1)
AR(1)
AR(1)
Autocorrelations
Upon knowing that the process is stationary, we could derive the mean and
autocovariances:
j
= j
j =
0
Mean: E (Yt ) = E (c + Yt 1 + t ) = c + E (Yt 1 ) + E (t ) using

stationarity, E (Yt ) = , E (Yt 1 ) = ,
so = c + and then = 1 c
Autocovariances: Replacing c = (1 ), rewrite Yt as
= + Yt 1 + t
Yt = (Yt 1 ) + t
Yt
then
0 = E (Yt )2 = E ((Yt 1 ) + t )2
= 2 E (Yt 1 )2 + E (2t ) + 2E ((Yt 1 )t )

= 2 0 + 2
Impulse Response Function: j = j


S PRING 2013
13 / 47
AR(1)

S PRING 2013
AR(1)
14 / 47
AR( P )
AR( P )
solving for 0,
Let t w .n.(0, 2 ), then Yt = c + 1 Yt 1 + ... + p Yt p + t is

AR (p ).
2
.
1 2
= E [(Yt )(Yt j )]
0 =
j 1
P ROBLEM
How can we check for stationarity?
= E [((Yt 1 ) + t )(Yt j )]
= E [(Yt 1 )(Yt j )] + E (t (Yt j ))
= j 1
So
j 1 =
S OLUTION
Factoring (1 1 L ... p Lp ) = (1 1 L)...(1 p L) stationary
follows if |j | < 1 for all j.
Another way to state this condition is to check that the solutions of the
equation in z, (1 1 z ... p z p ) = 0 are all OUTSIDE the unit circle.
j
2 .
1 2

S PRING 2013
15 / 47

S PRING 2013
16 / 47
AR( P )
AR( P )
AR( P )
AR( P )
Given stationarity,
Autocovariances:
0 = E (Yt )2
Given the stationarity,

Mean:
E (Yt ) = E (c + 1 Yt 1 + ... + p Yt p + t )
= c + 1 + ... + p
c
=
1 1 ... p
j 1
= E [(1 (Yt 1 ) + ... + p (Yt p ) + t ) (Yt )]

= E [(1 (Yt 1 )(Yt ) + ... + p (Yt p )(Yt )
+t (Yt ))]
= 1 1 + ... + p p + 2
= E [(Yt )(Yt j )]
= E [(1 (Yt 1 ) + ... + p (Yt p ) + t ) (Yt j )]
= E [(1 (Yt 1 )(Yt j ) + ... + p (Yt p )(Yt j )
+t (Yt j )]
= 1 j 1 + ... + p j p
This is a linear system in j , j = 0, ..., p.

S PRING 2013
17 / 47
AR( P )

S PRING 2013
AR( P )
18 / 47
AR( P )
AR( P )
Now try AR(2).
Notice that if the roots of (1 1 z 2 z 2 = 0) are complex, then

the autocorrelations show a cyclical dynamics.
0 = 1 1 + 2 2 + 2
1 = 1 0 + 2 1
2 = 1 1 + 2 0
and notice that 1 = 1 ,so replacing 1 and 2 ,
1
1 2 0

21
=
+ 2 0
1 2
(1 2 )
h
i 2
=
(1 + 2 ) (1 2 )2 21
1 =
2
0
We can therefore also get autocorrelations.


S PRING 2013
19 / 47

S PRING 2013
20 / 47
I MPULSE R ESPONSE F UNCTION (IRF)
I MPULSE R ESPONSE F UNCTION
In general we can compute the IRF inverting (L)Yt = t into

Yt = (L)1 t (here we used stationary) so (L)1 = (L).i.e.,

S PRING 2013
21 / 47

S PRING 2013
22 / 47
ARMA( P, Q )
ARMA( P, Q )
Let t w .n.(0, 2 ), then

Yt = c + 1 Yt 1 + ... + p Yt p + t + 1 t 1 + ... + q t q is
ARMA(p, q ).
Stationarity of the whole ARMA(p, q ) depends on the
autoregressive part only (whilst the invertibility depends on the
MA part only) :
Using the lag operator, the ARMA(p, q ) is
(1 1 L ... p Lp )Yt = (1 + 1 L + ... + q Lq )t .
For stationarity, we have to check if the roots of
(1 1 z ... p z p ) = 0 are all outside the unit circle.
For invertibility, we require that the roots of
(1 + 1 z + ... + q z q ) = 0 are outside the unit circle.

S PRING 2013
23 / 47

S PRING 2013
24 / 47
ARMA( P, Q )
ARMA( P, Q )
ARMA( P, Q )
ARMA( P, Q )
Autocovariances: The autocovariances are a combination between
those of an AR (p ) and an MA(q ), so for j > q,

Mean:
j = 1 j 1 + ... + p j p
E (Yt ) = E (c + 1 Yt 1 + ... + p Yt p
For example, ARMA(1, 1), Yt = c + Yt 1 + t + t 1 (|| < 1) :

Firstly notice that
+t + 1 t 1 + ... + q t q )
= c + 1 + ... + p + 0 + ... + 0
c
=
1 1 ... p
E [(Yt )t ] = E [((Yt 1 ) + t + t 1 )t ]
= 0 + 2 + 0 = 2
E [(Yt )t 1 ] = E [((Yt 1 ) + t + t 1 )t 1 ]
= 2 + 0 + 2 = ( + )2

S PRING 2013
25 / 47

S PRING 2013
ARMA( P, Q )
ARMA( P, Q )
26 / 47
ARMA( P, Q )
ARMA( P, Q )
so
0 = E [((Yt 1 ) + t + t 1 )(Yt )]
so
= E [(Yt 1 )(Yt )] + E [t (Yt )]

+E [t 1 (Yt )]
= 1 + 2 + ( + )2
0
1
j 2 = j 1
= E [(Yt 1 )(Yt 1 )] + E [t (Yt 1 )]

+E [t 1 (Yt 1 )]
= 0 + 0 + 2
S PRING 2013
and
1 = E [(Yt )(Yt 1 )]

( + )2
= 1+
1 2

( + )2
= 2 + +
1 2
2
27 / 47

S PRING 2013
28 / 47
ARMA( P, Q )
ARMA( P, Q )
IRF OF ARMA
I MPULSE R ESPONSE F UNCTION OF ARMA( P, Q )
The autocorrelation can be derived in the same way: for the generic
ARMA(p, q ), for j > q,
Given stationarity, inverting (L)Yt = (L)t
=
1
(L) (L) =
(L) =
(1 + 1 L + ... + q Lq ) =
Yt
( L ) 1 (L ) t
(L)
(L) (L)
(1 1 L... p Lp )
(1 + 1 L + 2 L2 + ...)
(1 + 1 L + ... + q Lq ) = 1 1 L + 1 L 2 L2 + 2 L2
3 L3 2 1 L3 1 2 L3 + 3 L3 + ...

S PRING 2013
29 / 47
IRF OF ARMA

S PRING 2013
30 / 47
IRF OF ARMA
In the ARMA(1, 1) case, then,

solve this for the various power of L:
1 = +
L0 : 1 = 1
1
: 1 = 1 + 1
: 2 = 2 + 1 1 + 2
L
L
j 2 = j 1 = ( + ) j 1
The ARMA(1, 1) could also be decomposed in impulse response by
looking at
Yt = Yt 1 + t , t = t + t 1
L3 : 3 = 3 + 3 + 2 1 + 1 2
(and = 0 to keep notation short).

S PRING 2013
31 / 47

S PRING 2013
32 / 47
IRF OF ARMA
C OMMON FACTORS
C OMMON FACTORS
Then,
Yt
j =0
j =0
j t j + j t j 1
j =0
In ARMA modelling, it may be that the same factor appears both in

(L) and of (L) : in this case, the ARMA(p, q ) process cannot be
distinguished, on the basis of the autocorrelation structure (or from
the weights in the MA() representation), from an
ARMA(p 1, q 1) process.
j t j = j (t j + t j 1 )
j =0
In this case, it is sometimes also said that the model ARMA(p, q ) is

overparametrised.
j t j + l 1 t l
j =0
l =1
The ARMA(p, q ) model may be simplified (and indeed it may be

desirable to do so, especially if the parameters 1 , ..., p and
1 , ... q have to be estimated).
= t + j 1 t j + j 1 t j
j =1
j =1
= t + ( + ) j 1 t j
j =1

S PRING 2013
33 / 47
C OMMON FACTORS

S PRING 2013
E XAMPLE :
34 / 47
A FINAL COMMENT ON STATIONARY AND INVERTIBLE ARMA
A FINAL COMMENT ON STATIONARY AND INVERTIBLE

ARMA
Yt = 1.2Yt 1 0.35Yt 2 + t 0.7t 1
is
We already saw that for a stationary ARMA(p, q ), it is also possible

to give an MA() representation; in the same way, it is also possible
to give an AR () representation (indeed, this is a proper definition of
"invertibility"). All these models have the same autocovariances /
autocorrelation structures, and are therefore indistinguishable.
We can choose the representation that is more convenient for our

purpose: for example, we may like the MA() if we are interested in
the impulse rensponse function, the AR () if we want to compute t
given observations on {Yt }
(and assuming we know the
parameters), or we may prefer the ARMA(p, q ) if we are interested in
estimating the parameters.
Yt 1.2Yt 1 + 0.35Yt 2 = t 0.7t 1
(1 1.2L + 0.35L2 )Yt = (1 0.7L)t

(1 0.7L)(1 0.5L)Yt = (1 0.7L)t
so, simplifying (1 0.7L), the process has the same autocorrelation
structure (and the same weights in the MA() representation) of
(1 0.5L)Yt = t
i.e.
Yt = 0.5Yt 1 + t

S PRING 2013
35 / 47

S PRING 2013
36 / 47
T RANSFORMATION OF ARMA M ODELS
F ILTERS
Sometimes data are treated (by nature or by the researcher) by
summing / averaging / differencing ...
For Yt , a filter h(L) is applied as:
Xt = h(L)Yt
ARMA models are quite standard and typical behaviour in econometrics,

however, we may derive some transformations from this particular
framework.
where
h (L) =
hj Lj
j =
If
|hj | < ,
j =
j =
| j | <
then
Xt = + (L)t
where
= h(1)c, (L) = h(L)(L)

S PRING 2013
37 / 47

S PRING 2013
D OES AVERAGING REVEALS A SIGNAL ?
38 / 47
S UM OF ARMA PROCESSES
Our variable of interest (signal) may be obscured by a noise.

Suppose Yt is w .n.(0, 2 ), and
1
Xt =
k
Example:
Yt = Xt + vt
k 1
Yt j
where
j =0
Xt = ut + ut 1
as in (moving) average of quarterly or monthly data on a yearly basis:

then averaging induced dependence where there was none.

S PRING 2013
39 / 47
and ut is w .n.(0, 2u ), vt is w .n.(0, 2v ), E (ut vt ) = 0 for all t, .

Suppose we are interested in Xt ,but we can only observe Yt .What are
the properties of Yt ?

S PRING 2013
40 / 47
In order to find ,compute
E (Yt ) = 0 for all t.

0 = E (Xt + vt )2 = E (Xt2 ) + E (vt2 ) + 2E (Xt vt )
1
j 2
1 =
= (1 + 2 )2u + 2v
= E (Xt + vt )(Xt 1 + vt 1 ) = E (Xt Xt 1 ) + E (vt Xt 1 )
+E (Xt vt 1 ) + E (vt vt 1 )
= 2u
= 0
= 1
1 + 2
solve for :
= 1 + 1 2
q
1 1 421
1,2 =
21
Yt = t + t 1 , t w .n.(0, 2 )

S PRING 2013
41 / 47
42 / 47
1 421
21
the process is invertible. We can also derive

1 = 2 ,so
2 = 2u .
L EMMA
In general, consider
Yt = Xt + Wt
2 ,for
example from
where Xt and Wt are (zero mean) stationary processes such that Xt and
W are not correlated at any t, , then
E (Yt Yt j ) = E (Xt Xt j ) + E (Wt Wt j )
(Notice that we observe Yt ,so we can estimate and 2 : since

however there are three parameters of interest, , 2u , and 2v ,we,
however, cannot estimate them without an identification assumption.
In other words, Yt contains less information than Xt and vt ).

S PRING 2013
1 =
For = 1 , where
2u
.
(1 + 2 )2u + 2v
Since in an MA(1),
so Yt is MA(1), i.e., we can represent it as

S PRING 2013
43 / 47
i.e.
Yj = Xj + W
j

S PRING 2013
44 / 47
S UM OF TWO MA PROCESSES
S UM OF TWO AR PROCESSES
Suppose,
Yt = Xt + Wt
where
(1 L)Xt = ut , (1 L)Wt = vt ( 6= )
then
L EMMA
If Xt is MA(q1 ) and Wt is MA(q2 ), then Yt is MA(max[q1, q2 ]).
(1 L)(1 L)Xt = (1 L)ut ,

(1 L)(1 L)Wt = (1 L)vt ,
so
(1 L)(1 L) (Xt + Wt )
= (1 L)ut + (1 L)vt
So Yt is ARMA(2, 1). (If = , Yt is AR (1) ).

S PRING 2013
45 / 47
S UM OF T WO ARMA P ROCESSES
L EMMA
If Xt is ARMA(p1 , q1 ), Wt is ARMA(p2 , q2 ), then Yt is ARMA(p, q ) with
p p1 + p2
and
q max(p1 + q2 , p2 + q1 )

S PRING 2013
47 / 47

S PRING 2013
46 / 47
T OPIC 4 E STIMATION OF ARMA

Topic 4 Estimation of ARMA
MLE
Jianhua Gang
School of Finance
Spring 2013
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 4 Estimation ofSARMA

PRING 2013
1 / 42
E STIMATION : S AMPLE MOMENTS
Y = (Y1 , ..., YT )
be a Normally distributed vector with
E (Y ) = ,E ((Y )(Y ) ) =
Sample autocovariance
bj =
Sample autocorrelation
E STIMATION : M AXIMUM L IKELIHOOD (ML)
Let
Sample Mean
2 / 42
E STIMATION : MAXIMUM LIKELIHOOD
We described the properties of some stationary processes by focusing

on some population moments (mean, autocovariances,...). However,
we only have the data that we observed, (y1 , ..., yT ) ,so we can only
compute estimates of these moments. Are these estimates useful?
Y =

PRING 2013
E STIMATION : S AMPLE MOMENTS
1
Yt
T t
=1
The Gaussian density, computed at the points

y = (yT , ..., y1 )
1 T
(Yt Y )(Yt j Y )
T t =
j +1
in the support of Y is
fY T ,...Y 1 (yT , ..., y1 )
bj
b
j =
b0

PRING 2013
1
= (2 )T /2 ||1/2 exp( (y ) 1 (y ))
2
3 / 42

PRING 2013
4 / 42
E STIMATION : MAXIMUM LIKELIHOOD
E XAMPLES :
Now assume that y = (yT , ..., y1 ) is the realization of Y, and

consider =() where is a set of parameters of interest. Then,
1
= (2 )T /2 |()|1/2 exp( (y ) ()1 (y ))
2
is the likelihood function. Maximizing that function w.r.t. yields the
(exact) maximum likilihood estimate.
Note the difference between and .

PRING 2013
AR (1)(|0 | < 1) :
Yt = c0 + 0 Yt 1 + t , t Nid (0, 20 )
fY T ,...Y 1 (yT , ..., y1 ; )
5 / 42
= (c, , 2 ) , (||
1
2
...
...
() =
1 2
T 2 T 3
T 1 T 2
... T 2 T 1
... T 3 T 2
...
...
...
...
1

...

PRING 2013
E XAMPLES :
< 1) and
6 / 42
E XAMPLES :
MA(1)(| 0 | < 1) :
Yt = 0 + t + 0 t 1 , t
The likelihood function may be computed for a given set of

observations and for any parameter (within the range of the
parameter space).
Nid (0, 20 )
= (c, , 2 ) , and
() =

(1 + 2 )
2
2
(1 + ) ...
0
0
For example, assume that we know, 0 = 0 and that we observed
(1 + 2 )
...
...
...
0
...
...
...
1
...
(1 + 2 )
time
obs.
...
2
(1 + )
1
0

PRING 2013
y1
0.5
y2
0.8
y3
0.2
y4
2
and suppose you want to estimate 0 in the MA(1) model with the
additional assumption that 0 = 0 and 20 = 1: consider five
potential values for 0 : 0.5, 0.25, 0, 0.25, 0.5.
Then, we have to compute () for each : for example, when
= 0.5,
7 / 42

PRING 2013
8 / 42
E XAMPLES :
E XAMPLES :
and
then,
() =
0.5 2
(1 +0.5 )
(1 + 0.52 ) ...
0
0
0.5
(1 +0.52 )
1
...
0
0
...
...
...
...
...
0
0
...
1
0.5
(1 +0.52 )
0
0
...
0.5
(1 +0.52 )
1

PRING 2013
(y ) ()1 (y )

= 0.5 0.8 0.2 2
1
1.25 0.5
0
0
0.5 1.25 0.5
0
0
0.5 1.25 0.5
0
0
0.5 1.25
0.5
0.8
0.2
2
= 4.6903
9 / 42
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 4 Estimation SofPRING

ARMA2013
E XAMPLES :
10 / 42
E XAMPLES :
The function may be computed for all the , | | < 1 (b

= 0.76)
So,
1
(2 )T /2 |()|1/2 exp( (y ) ()1 (y ))
2
1
1/2
= (2 )4/2 (1.332)
exp( 4.6903)
2
= 2.1033 103
Therefore, we may get all the likelihoods for different .
0.5 0.25 0
0.25
0.5
103 f 3.178 2.618 2.153 1.967 2.103

ARMA2013
11 / 42

ARMA2013
12 / 42
E XAMPLES :
ML OF AR(1)
AR(1)
Yt = c0 + 0 Yt 1 + t , |0 | < 1, t Nid (0, 20 )
The computation of
Then
1
(2 )T /2 |()|1/2 exp( (y ) ()1 (y ))
2
Yt N
is very heavy, because it requires the inversion of the T T .matrix of

1
() for all the admissible values .

ARMA2013
13 / 42
1
c0
, 20
1 0
1 20
fY1 (y1 ; )

2

c
1/2
2
1 y1 1

= (2 )1/2
exp

2
2
1 2
1 2
ML OF AR(1)

ARMA2013
AR(1)
so, setting = (c, , 2 ) , look at the likelihood of Y1
Luckily, it is sometimes easy to rewrite the likelihood function in a

way that does not require the inversion of ();otherwise, it is also
possible to modify the problem so that, again, we can avoid the
inversion of ().
14 / 42
ML OF AR(1)
AR(1)
and, by the same arguement,
Of course, the same likelihood may be expressed for Y2 , however in

this case we can also exploit the fact that we observed Y1 on the
period before:
Y2 |Y1 N (c0 + 0 Y1 , 20 ),
so
= (2 )
fY |Y ,...,Y
t
t =2
t 1
(yt |yt 1 , ..., y1 ; )fY1 (y1 ; )
where
fY t |Yt 1 ,...,Y 1 (yt |yt 1 , ..., y1 ; )
"
#

1 (yt c yt 1 )2
1/2 2 1/2
= (2 )
exp
2
2
fY 2 |Y 1 (y2 |y1 ; )
1/2
fY T ,...,Y 1 (yT , ..., y1 ; ) =
"
#
2
2 1/2
y
1
y
(
)
2
1

exp
2
2
when t = 2, ..., T .
Also notice that, for the AR(1),
We can then express the likelihood for Y1 and Y2 factoring
fY t |Yt 1 ,...,Y 1 (yt |yt 1 , ..., y1 ; ) = fY t |Yt 1 (yt |yt 1 ; )
fY 2 ,Y 1 (y2 , y1 ; ) = fY 2 |Y 1 (y2 |y1 ; )fY 1 (y1 ; )
so in what follows we simplify the notation in this way.


ARMA2013
15 / 42

ARMA2013
16 / 42
ML OF AR(1)
AR(1)
ML OF AR(1)
AR(1)
The log-likelihood is
T
l () = ln(fY1 (y1 ; )) +
ln(fY |Y (yt |yt 1 ; ))

t
t =2
1
= ln 2
2
1 2
Maximizing that function would give the "maximum likelihood

estimate" when t is normally distributed.
t 1
c
1 y1 1
2
2
2
1
2
However, although we eliminated the problem of inverting (),we

still cannot express our estimate b
as a closed form function of the
observations, so we still have to compute the likelihood function on
all the admissible parameters in order to find the maximum.
1 T (yt c yt 1 )2
T 1
2
ln(2 )
2
2 j =2
2
We then succeed in writing the (log) likelihood in a way that does
not require the inversion of a T T matrix.

ARMA2013
17 / 42

ARMA2013
ML OF AR(1)
AR(1)
18 / 42
ML OF AR(1)
AR(1)
Consider, on the other hand, estimating 0 by maximizing
and, equating the derivative to 0,
1 T (yt c yt 1 )2
T 1
ln(22 )
2
2 j =2
2
That estimate is known as "conditional maximum likelihood

estimate", because it is the maximum likelihood estimate if Y1 is not
random (so, the log-likelihood above is called "conditional"
log-likelihood). In this case, a closed form solution exists.
In order to find the closed form solution, first notice that 2 can be
estimated and concentrated out:
"
#
2
T
y
T 1
1
y
(
)
t
t
ln(22 )
2
2
2 j =2
2
T
=
T 1 1
1
(yt c yt 1 )
+
2
2
2 j =2
( 2 )2
c2 =
so replacing this in the log-likelihood, and we get the solutions:
(yt y . )(yt 1 y .1 )
t =2
b =
where, y . =
19 / 42
T
1
(yt y . ) (yt 1 y .1 )
T 1 t =2
b
c =

ARMA2013
T
1
(yt c yt 1 )2
T 1 j =2
1
T 1
( yt 1 y . 1 ) 2
t =2
T
t =2
t =2
yt , y .1 = T 11 yt 1
ARMA2013
20 / 42
ML OF AR(1)
AR(1)
ML OF AR( P )
AR( P )
So for the "conditional maximum likelihood estimate" a closed

form solution exists, and it is the OLS estimate in
Yt = c0 + 0 Yt 1 + t .
Notice that this is not the likelihood function of our original
stationary AR(1) process, but the likelihood of the process,
Assume
Yt = c0 + 0;1 Yt 1 + ... + 0;p Yt p + t ,
where
t Nid (0, 20 )
and the roots of 1 0;1 z ... 0;p z p = 0 are outside the unit
circle.
Yt = c0 + 0 Yt 1 + t , |0 | < 1,
Introduce
t Nid (0, 20 ), when t > 1
Yp = (Y1 , ..., Yp ) , p = E (Yp ), yp = (y1 , ..., yp )
Y1 = y1
and
(hence the name, "conditional maximum likelihood")

ARMA2013
2 1
Vp = ( )
21 / 42
ML OF AR( P )

Yp p
Yp p


ARMA2013
AR( P )

22 / 42
ML OF AR( P )
AR( P )
and take again the Gaussian density:

fY p ,...,Y 1 (yp , ..., y1 ; )
so the likelihood can be written as
= (2 )p/2 |2 Vp ()|1/2

1
1
exp 2 yp p Vp () yp p
2
fY T ,...,Y 1 (yT , ..., y1 ; )

T
= fYp,...,Y1 (yp , ..., y1 ; )
fY t |Y t 1 ,...,Yt p (yt |yt 1 , ..., yt p ; )
t =p +1
fY t |Y t 1 ,...,Y t p (yt |yt 1 , ..., yt p ; )
where the problem of inverting the T T matrix () is reduced to

inverting a p p matrix Vp ().
= (2 )1/2 |2 |1/2
"
#
2
1 (yt c 1 yt 1 ... p yt p )
exp
2
2
when t = p + 1, ..., T .

ARMA2013
23 / 42

ARMA2013
24 / 42
ML OF AR( P )
AR( P )
ML OF AR( P )
AR( P )
Taking logarithms, the log-likelihood is
Maximizing the log-likelihood yields the "maximum likelihood

estimate".
l () = ln(fY p,...,Y1 (yp , ..., y1 ; ))

h
i
T
+ ln fYt |Yt 1 ,...,Yt p (yt |yt 1 , ..., yt p ; )
Again, a "conditional maximum likelihood estimate" can be

considered instead: this is obtained by treating Yp , ..., Y1 as given,
and maximizing
t =p +1
p
= ln(2 ) |2 Vp ()|1/2
2

1
2 yp p Vp1 () yp p
2
T p
ln(22 )
2
2
1 T (yt c 1 yt 1 ... p yt p )

2 t =p +1
2

ARMA2013
t =p +1
fY t |Y t 1 ,...,Y t p (yt |yt 1 , ..., yt p ; )
instead. The value of that maximized the log-likelihood is called

"conditional maximum likelihood estimate". This turns out to
be the OLS estimate of c0 , 0;1 , ..., 0;p in the corresponding
regression model.
25 / 42
ML OF MA(1)

ARMA2013
MA(1)
26 / 42
ML OF MA(1)
MA(1)
Suppose
i.e. the density of Yt |t 1 is
Yt = 0 + t + 0 t 1 , | 0 | < 1, t Nid (0, 20 )
fY t |t 1 (yt |t 1 ; 0 )

1 (yt 0 0 t 1 )2
1
exp
= q
2
20
220

1 2t
1
exp 2
= q
2 0
220
Under an additional assumption that

0 = 0
we can also derive a "conditional maximum likelihood estimate" of
in an MA(1).
In general, since t Nid (0, 20 ),then
Yt |t 1 N (0 + 0 t 1 , 20 )

ARMA2013
27 / 42

ARMA2013
28 / 42
ML OF MA(1)
MA(1)
ML OF MA(1)
MA(1)
Unfortunately t 1 is not observable.

However, suppose that we know 0 ,then Y1 = 0 + 1 + 0 0 , and,
given 0 and 0 we can also compute
and
fY t ,Y t 1 ,...,Y 1 |0 (yt , yt 1 , ..., y1 |0 ; 0 )
= fY1 |0 (y1 |0 ; 0 )
1 (0 ) = y1 0 0 0
(the notation 1 (0 ) means that 1 is computed for a given value of

the vector of parameters ).
Having computed 1 (0 )(and given 0 ) we can also compute
2 (0 ) = y2 0 0 1 ( 0 ), and,iterating the procedure,
t (0 ) = yt 0 0 t 1 (0 )
Then
fY t |t 1 (yt |t 1 ; 0 ) = fY t |Y t 1 ,...,Y 1 ,0 (yt |yt 1 , ..., y1 , 0 ; 0 )

ARMA2013
29 / 42
fY |Y ,...,Y , (yt |yt 1 , ..., y1 , 0 ; 0 )

t 1
t =2

T
t ( 0 )2
.
= (2 )T /2 (20 )T /2 exp
220
t =1
Notice that this is not the density of (Yt , Yt 1 , ..., Y1 ) where each Yt
has an MA(1) representation, but that a density (i.e., the density of
(Yt , Yt 1 , ..., Y1 ) when each Yt has MA(1) representation)
conditional on 0 .
Moreover, we cannot compute a likelihood, because we cant observe
0 .
ARMA2013
ML OF MA(1)
MA(1)
30 / 42
ML OF MA(1)
MA(1)
Therefore,
However, consider the process
fY t ,Y t 1 ,...,Y 1 ,0 =0 (yt , yt 1 , ..., y1 , 0 = 0; 0 )
Yt = 0 + t + 0 t 1 ,
= fY1 |0 =0 (y1 |0 = 0; 0 )
with
Nid (0, 20 ), when
fY |Y ,...,Y , =0 (yt |yt 1 , ..., y1 , 0 = 0; 0 )
t > 0, 0 = 0
t =2
This process is very similar to the stationary MA(1), and it has the
density above (setting 0 = 0); given that we know 0 ,we can
initialize the iterations (for all the admissible values of )
= (2 )
t 1
T /2
2 T /2
( )

t ()2
exp 22 .
t =1
T
Taking the logs, the (conditional) log-likelihood is

t () = yt t 1 ()
We can then compute the likelihood (which is, then, a "conditional
likelihood") as a function of a set of observations (yt , yt 1 , ..., y1 ) ,
and of a generic vector of unknown parameters ,

ARMA2013
31 / 42
l () =
T
1
T
ln(2 ) ln(2 ) 2
2
2
2
2t ()
t =1
The value of that maximizes the (conditional) log-likelihood is

called "conditional maximum likelihood estimate".

ARMA2013
32 / 42
ML OF MA( Q )
MA( Q )
ML OF MA( Q )
MA( Q )
Iteratively, and we can formulate a "conditional maximum likelihood":
fY t ,Y t 1 ,...,Y 1 ,0 =0 (yt , yt 1 , ..., y1 , 0 = 0; 0 )
Corresponding with the MA(1) case, the MA(q) process can be

written as:
= fY1 |0 = 0(y1 |0 = 0; 0 )
T
fY |Y ,...,Y , =0 (yt |yt 1 , ..., y1 , 0 = 0; 0 )

t
Yt = 0 + t + 0;1 t 1 + ... + 0;q t q , t Nid (0, 20 )

and the roots of 1 + 0;1 z + ... + 0;q
zq
t =2
= 0 are all outside the unit circle.
Introduce 0 = (0 , ..., q +1 ) .Again, if 0 = 0,we compute
= (2 )
t 1
T /2
2 T /2
| |

t ()2
exp 22 .
t =1
T
Taking the logs, the (conditional) log-likelihood is
t () = yt 1 t 1 () ... q t q ()
l () =
T
1
T
ln(2 ) ln(2 ) 2
2
2
2
2t ()
t =1
The value of that maximizes the (conditional) log-likelihood is

called "conditional maximum likelihood estimate".
ARMA2013
33 / 42
ML OF ARMA( P, Q )

ARMA2013
ARMA( P, Q )
34 / 42
ML OF ARMA( P, Q )
ARMA( P, Q )
Similar methodologies can also be applied to the ARMA process:

Yt
t
= c0 + 0;1 Yt 1 + ... + 0;p Yt p

+t + 0;1 t 1 + ... + 0;q t q ,
Nid (0, 20 )
The conditional likelihood is then

fY T ,...,Y p +1 |Y p, ...,Y 1 ,0 =0,...,p q +1 =0
and the roots of 1 0;1 z ... 0;p z p = 0 and of

1 + 0;1 z + ... + 0;q z q = 0 are all outside the unit circle, and there is
no common factor.
(yT , ..., yp +1 |yp , ..., y1 , p = 0, ..., p q +1 = 0; )

= (2 )(T p )/2 |2 |(T p )/2

2
T
t ()
exp 22 .
t =p +1
Again, assume that Yp , ..., Y1 as given and 0 = (0 , ..., q +1 ) = 0.

Then we can compute
t () = yt c 0;1 yt 1 ... 0;p yt p 1 t 1 () ... q t q ()
for t > p.

ARMA2013
35 / 42

ARMA2013
36 / 42
ML OF ARMA( P, Q )
ARMA( P, Q )
P SEUDO M AXIMUM L IKELIHOOD
P SEUDO M AXIMUM L IKELIHOOD
Taking logarithums, the conditional log-likelihood is

l ()
1
T p
T p
ln(2 )
ln(2 ) 2
=
2
2
2
2t ()
t =p +1
The value of that maximizies the conditional log-likelihood is called

"conditional maximum likelihood estimate".
When t is not normally distributed, the density is different and

then the maximum likelihood estimate is different as well. If we use
the gaussian density even if t is not normally distributed, then, our
estimate is no longer the maximum likelihood one. In this case it
usually known as Pseudo (or Quasi) maximum likelihood instead.
The conditional log-likelihood is maximized as long as the final

term (conditional RSS) is minimized.

ARMA2013
37 / 42
A PPENDIX O PTIMIZATION OF THE OBJECTIVE FUNCTION

ARMA2013
38 / 42
O PTIMIZATION OF THE OBJECTIVE FUNCTION

for a generic (0 ) ,and consider an approximate second order Taylor
expansion of l (),
In general, it is not always possible to obtain a closed form formula

for the estimate, and it may be extremely time consuming to compute
the log-likelihood function (even the conditional log-likelihood) for all
the potential .
The optimisation of the log-likelihood may be carried using a
numerical algorithm, such as the Newton-Raphson one. Introduce
Recall that l () is maximized at b

iif
l ()
| (0) (gradient)
=
2 l ()
H ((0 ) ) =
| (0) (Hessian)
=
g ((0 ) ) =

ARMA2013
ih
i
h
l () l ((0 ) ) + g ((0 ) ) (0 )
i
i
h
1h
(0 ) H ((0 ) ) (0 )
2
l ()
| b=0
=
Now, consider the approximation of the derivative around (0 ) :

h
i
h
i
l ()
g ((0 ) ) H ((0 ) ) (0 ) .
39 / 42

ARMA2013
40 / 42
If the approximation was perfect, we could have just computed b
solving for
h
i
h
i
g ((0 ) ) H ((0 ) ) (0 ) = 0,
i.e.,
Next, we can improve, by considering a second order approximation of

l () in (1 ) ,and compute
h
i
(2 ) = (1 ) + H ((1 ) )1 g ((1 ) ) .
i
h
= (0 ) + H ((0 ) )1 g ((0 ) ) .
However, this may be a rather poor estimate, because the

approximation is not exact (there is a remainder, in this case of the
third order, in the Taylor expansion of l ()). Lets call this possibly
poor estimate (1 ) ,then, where
i
h
(1 ) = (0 ) + H ((0 ) )1 g ((0 ) )
clearly, this is (in a certain probabilistic sense better than a generic
(0 ) .

ARMA2013
41 / 42
The procedure can then be iterated until convergence (which

gives b
).
In many cases, you may start the optimisation with any set of
starting values, but this may result in a rather slow optimisation, or
even in an "incorrect" solution (you may end up picking a local
maximum, rather than the maximum). It is then advisable to start
from a "good" point, that is, from a consistent estimate of
(typically, an estimate that you may compute easily, even if it is less
efficient than maximum likelihood): the correlogram based
estimate is a good starting point (given certain regularity conditions,
properties as in the pseudo-maximum likelihood estimate may be
obtained after just one step).

ARMA2013
42 / 42
T OPIC 5 M ODEL OF H ETEROSKEDASTICITY IN T IME S ERIES
T OPIC 5 M ODEL OF H ETEROSKEDASTICITY IN T IME

S ERIES
Topic 5 Models of Heteroskedasticity
Jianhua Gang
School of Finance
Spring 2013
1 / 30
A N E XCURSION INTO N ON - LINEARITY L AND
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 5 Models of Heteroskedasticity

S PRING 2013
A N E XCURSION INTO N ON - LINEARITY L AND
2 / 30
A S AMPLE
O N S CALE ( DATA : C HICAGO B OARD O PTIONS

E XCHANGE )
Motivation: the linear structural (and time series) models cannot

explain a number of important features common to much financial
data
F IGURE : CBOE VIX and SPX (S&P500 Index) Scale
leptokurtosis
volatility clustering or volatility pooling
leverage effects
SPX (S&P500 Index)
Our traditional structural model could be something like:

yt = 1 + 2 x2t + ... + k xkt + ut
2000
100
1000
50
CBOE VIX

S PRING 2013
or more compactly
y = X + u, u N 0, 2

S PRING 2013
0
1990
3 / 30
1992
1994
1996
1998
2002
Year
2004
2006
2008 2009

S PRING 2013
4 / 30
A S AMPLE
O N ( LOG -)R ETURN S ERIES
N ON - LINEAR M ODELS : A D EFINITION

Campbell, Lo and MacKinlay (1997) define a non-linear data
generating process (DGP) as one that can be written as,
F IGURE : SPX Return and Percentage Change of VIX (%VIX)
yt = f (ut , ut 1 , ut 2 , ...)
Total Return of SPX
15
10
where ut is an i.i.d. error term and f is a non-linear function.
5
0
They also give a slightly more specific definition as,
-5
-10
1990
1992
1994
1996
1998
2000
Year
2002
2004
2006
20082009
yt = g (ut 1 , ut 2 , ...) + ut 2 (ut 1 , ut 2 , ...)
Percentage Change of VIX
80
60
40
where g (.) is a function of past error terms only and 2 is a variance

term.
20
0
-20
-40
1990
N ON - LINEAR M ODELS : A D EFINTION
1992
1994
1996
1998
2000
Year
2002
2004
2006
20082009

S PRING 2013
Models with nonlinear g (.) are non-linear in the mean, while those
with nonlinear 2 (.) are nonlinear in variance.
5 / 30
T YPES OF N ON - LINEAR M ODELS

S PRING 2013
T YPES OF N ON - LINEAR M ODELS
6 / 30
T ESTING FOR N ON - LINEARITY
T ESTING FOR N ON - LINEARITY

The traditional tools of time series analysis (acfs, spectral analysis)
may find no evidence that we could use a linear model, but the data
may still not be independent.
The linear paradigm is a useful one: Many apparently non-linear

relationships can be made linear by a suitable transformation.
On the other hand, it is likely that many relationships in finance are
intrinsically non-linear.
There are many types of non-linear models, e.g.
Portmanteau tests (discuss later) for non-linear dependence have been

developed.
The simplest is Ramseys RESET test, which took the form:
u
bt = 0 + 1 ybt2 + 2 ybt3 + ... + p 1 ybtp + vt
ARCH / GARCH
switching models
bilinear models
Many other non-linearity tests are available.
One particular non-linear model that has proved very useful in finance
is the ARCH model due to Engle (1982).

S PRING 2013
7 / 30

S PRING 2013
8 / 30
A UTOREGRESSIVE C ONDITIONALLY H ETEROSKEDASTIC (ARCH)

T OPIC 5 M ODEL OF H ETEROSKEDASTICITY IN T IME S ERIES M ODELS
H ETEROSKEDASTICITY R EVISITED
H ETEROSKEDASTICITY R EVISITED
ARCH M ODELS
An example of a structural model is

yt = 1 + 2 x2t + 3 x3t + 4 x4t + ut , ut N 0, 2u
Box and Jenkins (1970) ARMA models...
The assumption that the variance of the error is constant is known as

homoskedasticity, i.e. var (ut ) = 2 .
What if the variance of the error is not constant?
heteroskedasticity;
would imply that standard error estimates could be wrong.
The mean process have been extended to essentially analogous

models for the variance (seminal paper of Engle (1982)).
Autoregressive conditional heteroscedasticity (ARCH) models are now
commonly used to describe and forecast changes in the volatility of
financial time series.
Bollerslev et al. (1992, 1994), Bera and Higgins (1993), Pagan
(1996), Palm (1996) and Shephard (1996), among others.
Is the variance of the errors likely to be constant over time?

Not for financial data!!!

S PRING 2013
9 / 30

ARCH M ODELS
10 / 30

ARCH M ODELS
So use a model which does not assume that the variance is constant.
Recall the definition of the variance of ut :
n
o
2t = var (ut |ut 1 , ut 2 , ...) = E [ut E (ut )]2 |ut 1 , ut 2 , ...
we usually assume that E (ut ) = 0, so

2t = var (ut |ut 1 , ut 2 , ...) = E ut2 |ut 1 , ut 2 , ...
The full model would be

yt = 1 + 2 x2t + ... + k xkt + ut , ut N 0, 2t
where 2t = 0 + 1 ut21 .
We can easily extend this to the general case where the error variance
depends on q lags of squared errors:
This is an ARCH(q) model.

Instead of calling the variance 2t ,in the literature it is usually called
ht ,so the model is
Previous squared error terms.
This leads to the Autoregressive Conditionally Heteroscedastic

(ARCH) model:
2t = 0 + 1 ut21
yt = 1 + 2 x2t + ... + k xkt + ut , ut N (0, ht )

where
ht = 0 + 1 ut21 + 2 ut22 + ... + q ut2q
This is known as the ARCH(1) model.

S PRING 2013
2t = 0 + 1 ut21 + 2 ut22 + ... + q ut2q
Now, what could the current value of the variance of the errors
plausibly depend upon?

S PRING 2013
11 / 30

S PRING 2013
12 / 30
A NOTHER WAY OF W RITING ARCH M ODELS
A NOTHER WAY OF W RITING ARCH M ODELS
E XAMPLE : T ESTING FOR ARCH E FFECTS
T ESTING FOR ARCH E FFECTS

1 First, run any postulated linear regression of the form given in the
equation above, e.g.
For illustration, consider an ARCH(1). Instead of the above, we can

write

yt = 1 + 2 x2t + ... + k xkt + ut , ut N 0, 2t
q
t =
0 + 1 ut21
The two are different ways of expressing exactly the same model. The
first form is easier to understand while the second form is required for
simulating from an ARCH model, for example.
yt = 1 + 2 x2t + ... + k xkt + ut

saving the residuals, u
bt .
2 Then square the residuals, and regress them on q own lags to test for
ARCH of order q, i.e. run the regression
bt21 + 2 u
bt22 + ... + q u
bt2q + vt
u
bt2 = 0 + 1 u
where vt is i.i.d. Obtain R 2 from this regression.
3 The test statistic is defined as TR 2 (the number of observations

multiplied by the coefficient of multiple correlation) from the last
regression, and is distributed as a 2 (q ).

S PRING 2013
13 / 30
E XAMPLE : T ESTING FOR ARCH E FFECTS

S PRING 2013
T ESTING FOR ARCH E FFECTS
P ROBLEMS WITH ARCH( Q ) M ODELS
How do we decide on q?
H0 : 1 = 2 = ... = q = 0
The required value of q might be very large;
H1 : q 6= 0, q = 1 or 2 or 3...
Non-negativity constraints might be violated;
If the value of the test statistic is greater than the critical value from
the 2 distribution, then reject the null hypothesis.
Note that the ARCH test is also sometimes applied directly to returns
instead of the residuals from Stage 1 above.

S PRING 2013
14 / 30
P ROBLEMS WITH ARCH MODELS
4 The null and alternative hypotheses are
15 / 30
When we estimate an ARCH model, we require i > 0, i = 1, 2, ..., q

(since variance cannot be negative).
Therefore, a natural extension of an ARCH(q) model which circumvents
some of these problems is a GARCH model.

S PRING 2013
16 / 30
G ENERALISED ARCH (GARCH) M ODELS
GARCH M ODELS
G ENERALISED ARCH (GARCH) M ODELS
GARCH M ODELS
Due to Bollerslev (1986). Allow the conditional variance to be

dependent upon previous own lags.
The variance equation is now
But in general a GARCH(1,1) model will be sufficient to capture the
volatility clustering in the data.
2t = 0 + 1 ut21 + 2t 1
This is a GARCH(1,1) model, which is like an ARMA(1,1) model for
the variance equation.
We could also show that a GARCH(1,1) model can be written as an
infinite order ARCH model.
Why is GARCH better than ARCH?

more parsimonious-avoiding overfitting;
less likely to breach non-negativity constraints.
We can again extend the GARCH(1,1) model to a GARCH(p,q):

q
2t = 0 + i ut2i +
i =1
j 2t j
j =1

S PRING 2013
17 / 30
T HE U NCONDITIONAL VARIANCE UNDER GARCH

S PRING 2013
T HE U NCONDITIONAL VARIANCE UNDER GARCH
18 / 30
E STIMATION OF ARCH/GARCH M ODELS
The unconditional variance of ut is given by,

var (ut ) =
Since the model is no longer of the usual linear form, we cannot use
OLS.
0
1 ( 1 + )
We use maximum likelihood as we already discussed.
when 1 + < 1
1 + > 1 is termed non-stationarity in variance;
The method works by finding the most likely values of the parameters
given the actual data.
1 + = 1 is termed integrated GARCH.
More specifically, we form a log-likelihood function and maximise it.
For non-stationarity in variance, the conditional variance forecasts will

not converge on their unconditional value as the horizon increases.

S PRING 2013
19 / 30

S PRING 2013
20 / 30
E STIMATION OF GARCH M ODELS USING M AXIMUM

T OPIC 5 M ODEL OF H ETEROSKEDASTICITY IN T IME S ERIES L IKELIHOOD
E STIMATION OF GARCH M ODELS USING M AXIMUM

L IKELIHOOD
The steps involved in actually estimating an ARCH or GARCH model

are as follows
1
Specify the appropriate equations for the mean and the variance - e.g.
an AR(1)-GARCH(1,1) model:
yt
2t
= + yt 1 + ut , ut N (0, 2t )
=
0 + 1 ut21
(1)
2t 1
Specify the log-likelihood function to maximise:
Now we get model (1) and likelihood function (2).

Unfortunately, the LLF for a model with time-varying variances
cannot be maximised analytically, except in the simplest of cases. So
a numerical procedure is used to maximise the log-likelihood
function. A potential problem: local optima or multimodalities in
the likelihood surface.
The way we do the optimisation is:
1 T
T
1 T (yt yt 1 )2
l = log(2 ) log(2t )
2
2 t =1
2 t =1
2t
(2)
1
2
3
The computer will maximise the function and give parameter values
and their standard errors.

S PRING 2013
21 / 30
N ON -N ORMALITY AND M AXIMUM L IKELIHOOD
Set up LLF.
Use regression to get initial guesses for the mean parameters.
Choose some initial guesses for the conditional variance parameters.
Specify a convergence criterion - either by criterion or by value.

S PRING 2013
N ON -N ORMALITY AND M AXIMUM L IKELIHOOD
22 / 30
E XTENSIONS OF THE B ASIC GARCH
E XTENSIONS OF THE B ASIC GARCH
Recall that the conditional normality assumption for ut is essential.

We can test for normality using the following representation
Since the GARCH model was developed, a huge number of extensions

and variants have been proposed. Three of the most important
examples are EGARCH, GJR, and GARCH-M models.
vt N (0, 1)
q ut = vt t
t = 0 + 1 ut21 + 2t 1
vt = utt
The sample counterpart is vbt =
Problems with GARCH(p,q) Models:
ubt
bt
Are the vbt normal? Typically vbt are still leptokurtic, although less so
than u
bt . Is this a problem? Not really, as we discussed before. We
can use the ML with a robust variance/covariance estimator. ML
with robust standard error is called Quasi-Maximum Likelihood or
QML (also known as pseudo-).

S PRING 2013
23 / 30
Non-negativity constraints may still be violated;

GARCH models cannot account for leverage effects
Possible solutions: the exponential GARCH (EGARCH) model or the

GJR model, which are asymmetric GARCH models.

S PRING 2013
24 / 30
T HE EGARCH M ODEL
T HE EGARCH M ODEL
T HE GJR M ODEL
Suggested by Nelson (1991). The variance equation is given by
r

2
ut 1
|ut 1 |
+ q
log 2t = + log 2t 1 + q
2t 1
2t 1
Due to Glosten, Jaganathan and Runkle

2t = 0 + 1 ut21 + 2t 1 + ut21 It 1
where
Advantages of the model

Since we model the log 2t , then even if the parameters are negative,
2t will be positive.
We can account for the leverage effect by noticing that a negative
shock (u
t 1 ) has an asymmetric effect on the dependent variable
log 2t as opposed to a positive shock.

S PRING 2013
T HE GJR M ODEL
25 / 30
A N E XAMPLE OF GJR
It 1 = 1, if ut 1 < 0
It 1 = 0, otherwise
For a leverage effect, we would see > 0.
We require 1 + 0 and 1 0 for non-negativity conditions.

S PRING 2013
A N E XAMPLE OF GJR
26 / 30
N EWS I MPACT C URVES
N EWS I MPACT C URVES

The news impact curve plots the next period volatility (ht ) that would
arise from various positive and negative values of ut 1 , given an
estimated model.
News Impact Curves for SPX returns using coefficients from GARCH
and GJR Model Estimates:
Using monthly SPX returns, December 1979 - June 1998

Estimating a GJR model, we obtain the following results.
0.14
GARCH
= 0.172
2t =
GJR
0.12
(3.198 )
Value of Conditional Variance
yt
1.243 + 0.015 ut21 + 0.498 2t 1 + 0.604 ut21 It 1
(16.372 )
(0.437 )
(14.999 )
(5.772 )
0.1
0.08
0.06
0.04
0.02
0
-1
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Value of Lagged Shock

S PRING 2013
27 / 30

S PRING 2013
28 / 30
GARCH- IN -M EAN
GARCH- IN -M EAN
W HY W E N EED GARCH FAMILY ?
W HY W E N EED GARCH FAMILY ?
We expect a risk to be compensated by a higher return. So why not

let the return of a security be partly determined by its risk?
Engle, Lilien and Robins (1987) suggested the ARCH-M specification.
A GARCH-M model would be

yt = + t 1 + ut , ut N 0, 2t
2t = 0 + 1 ut21 + 2t 1
GARCH can model the volatility clustering effect since the conditional
variance is autoregressive. Such models can be used to forecast
volatility.
We could show that
Var (yt |yt 1 , yt 2 ..., ) = Var (ut |ut 1 , ut 2 , ...)
can be interpreted as a sort of risk premium.
So modelling 2t will give us models and forecasts for yt as well.
It is possible to combine all or some of these models together to get

more complex hybrid models - e.g. an ARMA-EGARCH(1,1)-M
model.
Variance forecasts are additive over time.

S PRING 2013
29 / 30

S PRING 2013
30 / 30
T OPIC 6 M ULTIVARIATE GARCH

Topic 6 Multivariate GARCH
Jianhua Gang
School of Finance
Spring 2013
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 6 MultivariateSGARCH

PRING 2013
1 / 44
MGARCH FAMILY

PRING 2013
MGARCH FAMILY
2 / 44
MGARCH FAMILY
MGARCH FAMILY
P ROBLEM
1
Is the volatility of a market leading the volatility of other markets?
Is the volatility of an asset transmitted to another asset directly

(through its conditional variance) or indirectly (through its conditional
covariances)?
Is the impact the same for negative and positive shocks of the same
amplitude?
Whether or not the correlations between asset returns change over

time.

PRING 2013
Literature regarding MGARCH models: Bollerslev et al. (1988);

Gourieroux (1997); De Santis and Gerard (1998); Hafner and
Herwartz (1998); Franses and van Dijk (2000); Lien and Tse (2002).
MGARCH models: initially developed in the late 1980s and the first
half of the 1990s. After 2000s, another active phase of this field.
Are they higher during periods of higher volatility (sometimes

associated with financial crises)?
Are they increasing in the long run, perhaps because of the
globalization of financial markets?
The above questions can be answered using MGARCH.

For example, impact of volatility in financial markets on real variables
like exports and output growth rates, and the volatility of these
growth rates.
3 / 44

PRING 2013
4 / 44
MGARCH FAMILY
C ATEGORY 1: VEC AND BEKK
MGARCH FAMILY
C AT. 1: D IRECT G ENERALIZATION
M ODEL C ONSTRUCTION IN G ENERAL
VEC M ODELS (B OLLERSLEV
D EFINITION (C ONDITIONAL M EAN )
D EFINITION (C ONDITIONAL VARIANCE )
yt = t ( ) + t
ET AL .
ht = c + A t 1 + Ght 1
where
D EFINITION (C ONDITIONAL VARIANCE )
ht
t = Ht1/2 ( ) zt
where Ht1/2 ( ) is a N N positive definite matrix; the N 1 random

vector zt , IN is the identity matrix of order N.
Var (zt ) = IN

PRING 2013
= vech (Ht )

= vech t t
and vech () denotes the operator that stacks the lower triangular portion
of a N N matrix as a N (N + 1) /2 1vector. A and G are square
parameter matrices of order N (N + 1) /2 and c is a N (N + 1) /2 1
parameter vector.
E (zt ) = 0
1988)
5 / 44

PRING 2013
VEC M ODELS (B OLLERSLEV
VEC
ET AL .
1988)
AND
DVEC (B OLLERSLEV
ET AL .
6 / 44
1988)
The number of parameters is 78 for N = 3. So in practice, this model

is used only in the bivariate case.
The number of parameters is 78 for N = 3. So in practice, this model

is used only in the bivariate case.
Bollerslev et al. (1988) suggests DVEC (diagonal VEC) model to

overcome this: A and G matrices are assumed to be diagonal, each
element hi ,jt depending only on its own lag and on the previous values
of it jt .
Bollerslev et al. (1988) suggests DVEC (diagonal VEC) model to

overcome this: A and G matrices are assumed to be diagonal, each
element hi ,jt depending only on its own lag and on the previous values
of it jt .
DVEC can hence reduce the number of parameters to 12 for N = 3.
DVEC can hence reduce the number of parameters to 12 for N = 3.
But, even under this diagonality, large-scale systems are still highly
parameterized and difficult to estimate.
But, even under this diagonality, large-scale systems are still highly
parameterized and difficult to estimate.
Even simpler version of the DVEC (Ding and Engle, 2001): A and G
to be positive scalar (scalar model).
Even simpler version of the DVEC (Ding and Engle, 2001): A and G
to be positive scalar (scalar model).

PRING 2013
7 / 44

PRING 2013
8 / 44
R ISKMETRICS (1996) (N OW MSCI)
BEKK
Practitioners who study volatility processes often observe that their

model is very close to the unit root case.
To take this into account, Riskmetrics uses the exponentially weighted
moving average model (EWMA) and defines the variances and
covariances as IGARCH-type models (Engle and Bollerslev, 1986):
It is difficult to guarantee the positivity of Ht in the VEC

representation without imposing strong restrictions on the parameters.
Engle and Kroner (1995) propose alternative Ht to ensure the
positivity: the BEKK model.
ht = (1 ) t 1 + ht 1
which is a scalar VEC. The decay factor proposed by Riskmetrics is
0.94 for daily data and 0.97 for monthly data.
However, the decay factor is not estimated by suggested. Therefore
very hard to justify.

PRING 2013
9 / 44
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 6 Multivariate

S PRING
GARCH
2013
10 / 44
BEKK
BEKK
D EFINITION (BEKK(1,1,K))
Ht = C C +
k =1
k =1
Ak t 1 t 1 Ak + Gk Ht 1 Gk
(1)
where C , Ak and Gk are N N matrices but C is upper triangular.
However, still prefer parsimonious models (as well as reducing the

generality).
Impose diagonal BEKK model, i.e. diagonalize Ak and Gk . (Now also
a DVEC model but less general, since DVEC is not guranteed to be
positive definite.)
K determines the generality of the process.

Parameters of the BEKK model do not represent directly the impact
of the different lagged term on the elements of Ht like in the VEC
model.
Scalar BEKK also applicable, i.e. Ak and Gk are equal to a scalar

times a matrix of ones.
When K = 1, this is a VEC with C , Ak and Gk being restricted to

be positive.

S PRING
GARCH
2013
11 / 44

S PRING
GARCH
2013
12 / 44
VEC
FACTOR M ODEL (E NGLE ET
AND
BEKK
AL .
(1990 B ), B OLLERSLEV AND E NGLE (1993))
D EFINITION (FGARCH(1,1,K))
The difficulty when estimating a VEC or even a BEKK model is the
high number of unknown parameters, even after imposing several
restrictions.
Lin (1992): the BEKK(1,1,K) model above is a factor GARCH model,

denoted by F-GARCH(1,1,K), if for each k = 1, ..., K , Ak and Gk have
rank one and have the same left and right eigenvectors, k and wk ,i.e.
Ak
It is thus not surprising that these models are rarely used when the
number of series is larger than 3 or 4.
Factor and orthogonal models circumvent this difficulty by imposing a
common dynamic structure on all the elements of Ht , which results in
less parameterized models.
Gk

S PRING
GARCH
2013
13 / 44
(4)

S PRING
GARCH
2013

AL .
14 / 44

(3)
N
0 for k 6= i
, wkn = 1
1 for k = i n =1
AL .
(2)
where k and k are scalars, and wk and k (for k = 1, ..., K ) are N 1

vectors satisfying,
wk i
= k wk k ,
= k wk k ,
D EFINITION (FGARCH(1,1,K))
Substitute (2) and (3) into (1) and define = C C , we get
K
Ht = +
k k
k =1
2k wk t 1 t 1 wk + 2k wk Ht 1 wk
(5)
restriction (4) is an identification restriction.
Ht
The K -factor GARCH implies that Ht 1 has reduced rank K , but Ht

is of full rank because is positive definite.
The vector k is defined as factor loading, and the scalar wk t
(denoted as fkt ) is the kth factor.
The expression between brackets can be replaced by other univariate
GARCH specifications.

S PRING
GARCH
2013
Consider, for instance, the two-factor F-GARCH model:

F-GARCH(1,1,2),
15 / 44
= + 1 1 21 w1 t 1 t 1 w1 + 21 w1 Ht 1 w1

+1 1 22 w2 t 1 t 1 w2 + 22 w2 Ht 1 w2

S PRING
GARCH
2013
(6)
16 / 44
AL .
Alternatively, the two-factor model can be obtained from
AL .
A K-factor model can then be written as
t = 1 f1t + 2 f2t + et
t = ft + et
where et represents an idiosyncratic shock with constant variance

matrix and uncorrelated with the two factors.
where is a matrix of dimension N K and ft is a K 1 vector. A

factor is observable if it is specified as a function of t .
Each factor fkt has zero conditional mean and conditional variance
like a GARCH(1,1) process.
Variants of the factor model in the literature. e.g. Vrontos et al.

(2003), the full-factor multivariate GARCH model (FF-GARCH).

S PRING
GARCH
2013
17 / 44
C ATEGORY 2: L INEAR C OMBINATIONS OF U NIVARIATE GARCH

S PRING
GARCH
2013
C ATEGORY 2: L INEAR C OMBINATIONS OF U NIVARIATE GARCH
C AT. 2: L INEAR C OMBINATIONS
C AT. 2: L INEAR C OMBINATIONS
O RTHOGONAL GARCH
O RTHOGONAL GARCH
Each factor fkt has zero conditional mean and conditional variance
like a GARCH(1,1) process.
D EFINITION (O-GARCH(1,1, M ))
Kariya (1988) and Alexander and Chibumba (1997): The N N
time-varing variance marix Ht is generated by m N univariate GARCH
models,
t = 1 f1t + 2 f2t + et
where et represents an idiosyncratic shock with constant variance matrix
and uncorrelated with the two factors.

S PRING
GARCH
2013
18 / 44
19 / 44
A K-factor model can then be written as

t = ft + et
where is a matrix of dimension N K and ft is a K 1 vector. A
factor is observable if it is specified as a function of t .
Variants of the factor model in the literature. e.g. Vrontos et al.
(2003), the full-factor multivariate GARCH model (FF-GARCH).

S PRING
GARCH
2013
20 / 44
C ATEGORY 3: N ONLINEAR C OMBINATIONS
C AT. 3: N ONLINEAR C OMBINATIONS

CCC M ODEL
Multivariate models must allow where one can specify separately (the
individual conditional variances) and the conditional correlation
matrix or other measure of dependence between individual series (like
the copula of the conditional joint density).
A hierarchical procedure:
1
Choose a GARCH-type model for each conditional variance (may vary

among within the multivariate system);
Model the conditional correlation matrix (imposing positive definiteness
for any t).
For models of this category, theoretical results on stationarity,

ergodicity and moments may not be so straightforward
MGARCH models in which the conditional correlations are constant.
Nonetheless, they are less greedy in parameters than other categories,

and therefore more easily estimable.
Thus the conditional covariances are proportional to the product of

the corresponding conditional standard deviations.
This restriction greatly reduces the number of unknown parameters
and thus simplifies the estimation. (Bollerslev (1990))

S PRING
GARCH
2013
21 / 44

S PRING
GARCH
2013
CCC M ODEL
CCC M ODEL
D EFINITION (CCC)
R is the matrix containing the constant conditional correlation ij .The

original CCC model has a GARCH(1,1) specification for each
conditional variance in Dt :
The CCC model is defined as:

Ht = Dt RDt = ij
where
hiit hjjt

1/2
1/2
...hNNt
Dt = diag h11t
hiit = i + i 2i ,t 1 + i hii ,t 1 , i = 1, 2, ..., N.

(7)

hiit can be defined as any univariate GARCH model, and R = ij

symmetric positive definite matrix with ij = 1,for any i.
22 / 44

S PRING
GARCH
2013
is a
23 / 44
However, unconditional covariances are difficult to calculate.

He and Terasvirta (2002b) use a VEC-type formulation for
(h11t , h22t , ..., hNNt ) to allow for interactions between the conditional
variances. They call this the extended CCC model.

S PRING
GARCH
2013
24 / 44
DCC M ODEL
DCC M ODEL
The DCC model of Christodoulakis and Satchell (2002) only allows

bivariate case.
CCC: Assumption that the conditional correlations are constant:

unrealistic in many empirical applications.
The DCC model of Tse and Tsui (2002) and Engle (2002) are useful
when modelling high-dimensional data sets.
DCC: Generalization of the CCC by making the conditional

correlation matrix time-dependent. (Christodoulakis and Satchell
(2002), Engle (2002) and Tse and Tsui (2002)).
D EFINITION (DCC MODEL OF T SE AND T SUI (2002) OR DCCT (M))
An additional difficulty is that the time-dependent conditional

correlation matrix has to be positive definite for any t. (The DCC
models guarantee this under simple conditions on the parameters.)

S PRING
GARCH
2013
Ht = Dt Rt Dt
where Dt is defined in (7), hiit can be defined as any univariate GARCH
model and
Rt = (1 1 2 ) R + 1 t 1 + 2 Rt 1
(8)
25 / 44

S PRING
GARCH
2013
DCC M ODEL
DCC M ODEL
D EFINITION (C ONT. DCC

DCCT (M))
MODEL OF
T SE AND T SUI (2002)
OR
In (8), 1 and 2 are non-negative parameters satisfying 1 + 2 < 1,R is a

symmetric N N positive definite parameter matrix with ii = 1 and
t 1 is the N N correlation matrix of for
= t M, t M + 1, ..., t 1. Its i, jth. element is given by:
ij ,t 1 = r
M
m =1 ui ,t m uj ,t m

M
2
2
u
u
M
m =1 i ,t m
m =1 j ,t m

S PRING
GARCH
2013
(9)
27 / 44
26 / 44
D EFINITION (C ONT. DCC MODEL OF T SE AND T SUI (2002) OR

DCCT (M))
where uit = it / hiit .The matrix t 1 can be expressed as:

t 1 = Bt11 Lt 1 Lt 1 Bt11 ,where Bt 1 is a N N diagonal matrix with

1/2
2
ith. diagonal element given by M
and
h =1 ui ,t h
Lt 1 = (ut 1 , ..., ut M ) is a N M matrix, with ut = (u1t u2t ...uNt ) .
A necessary condition to ensure the positivity of t 1 ,and therefore

also of Rt , is that M > N. Then Rt is itself a correlation matrix if
Rt 1 is also a correlation matrix (notice iit = 1 for any i).

S PRING
GARCH
2013
28 / 44
DCC M ODEL
DCC M ODEL
Alternatively, Engle (2002) proposes a different DCC model (see also

Engle and Sheppard, 2001).
D EFINITION (C ONT. DCC MODEL OF E NGLE (2002) OR DCCE (1,1))
D EFINITION (DCC MODEL OF E NGLE (2002) OR DCCE (1,1))
with ut as in (9). Q is the N N unconditional variance matrix of ut , and

and are non-negative scalar parameters satisfying + < 1.
Ht = Dt Rt Dt
where

1/2
1/2
1/2
1/2
Q
...qNN
diag
q
...q
Rt = diag q11,t
t
11,t
,t
NN ,t
The element of Q can be estimated or set to their empirical

couterpart to render the estimation even simpler.
where the N N symmetric positive definite matrix Qt = (qij ,t ) is given

by:
Qt = (1 ) Q + ut 1 ut 1 + Qt 1
S PRING
GARCH
2013
29 / 44
12t
M
m =1 u1,t m u2,t m

M
2
2
u
u
M
m =1 1,t m
m =1 2,t m
(10)
2
(1 ) q 22 + u2,t
1 + q22,t 1

S PRING
GARCH
2013
For both DCCT and DCCE models, one can test 1 = 2 = 0 or

= = 0, respectively to check whether imposing constant
conditional correlations is empirically relevant.
A drawback of the DCC models is that 1 , 2 in DCCT and , in
DCCE are scalars, so that all the conditional correlations obey the
same dynamics. This is however necessary to ensure that Rt is
positive definite for any t through sufficient conditions on the
parameters.
(1 ) q 12 + u1,t 1 u2,t 1 + q12,t 1

q

2
(1 ) q 11 + u1,t
1 + q11,t 1
1
Unlike in DCCT the DCCE model does not formulate the conditional
correlation as a weighted sum of past correlations.
= (1 1 2 ) 12 + 2 12,t 1 + 1
q
30 / 44
Explicit difference between DCCT (M ) and DCCE (1, 1), see

conditional correlations:
+ r

S PRING
GARCH
2013
12t
(11)
31 / 44

S PRING
GARCH
2013
32 / 44

W HY DCC
DCC models can be estimated consistently in two steps, which makes

this approach feasible when N is high.
Of course, when N is large, the restriction of common dynamics gets
tighter, but for large N the problem of maintaining tractability also
gets harder. In this respect, several variants of the DCC model are
proposed in the literature.
DCC models open the door to using flexible GARCH specifications in

the variance part.
Billio et al. (2003) argue that constraining the dynamics of the

conditional correlation matrix to be the same for all the correlations is
not desirable.
As the conditional variances (together with the conditional means)

can be estimated using N univariate models, one can easily extend the
DCC-GARCH models to more complex GARCH-type structures.
Pelletier (2003) proposes a model where the conditional correlations

follow a switching regime driven by an unobserved Markov chain so
that the correlation matrix is constant in each regime but may vary
across regimes.

S PRING
GARCH
2013
33 / 44

S PRING
GARCH
2013
34 / 44
G ENERAL D YNAMIC C OVARIANCE M ODEL
D EFINITION (GDC M ODEL )

Ht = Dt Rt Dt + t
where,
A model somewhat different from the previous ones but that nests
several of them is the general dynamic covariance (GDC) model
proposed by Kroner and Ng (1998).
Dt = (dijt ), diit =
(12)
iit , dijt = 0, t = ( ijt )
Rt is specified as DCCT (M ) or DCCE (1, 1). = (ij ), ii = 0, ij = ji ,

ijt = ij + i t 1 t 1 j + gi Ht 1 gj , for any i, j.
(13)
i , gi , i = 1, ..., N are (N 1) vectors of parameters, and = ( ij ) is

positive definite and symmetric.

S PRING
GARCH
2013
35 / 44

S PRING
GARCH
2013
36 / 44
C OPULA -MGARCH M ODELS
Elementwise we have,
hijt
hiit
p
= ijt iit jjt + ij ijt , for i 6= j
= | iit | , for any i.
(14)
Any N-dimensional joint distribution function may be decomposed

into its N marginal distributions, and a copula function that
completely describes the dependence between the N variables (Sklar
(1959), Nelsen (1999), Patton (2000), Jondeau and Rockinger
(2001)).
Standard copula-GARCH:
where the ijt are given by the BEKK formulation in (13).

The GDC model contains several MGARCH models as special cases.
1
2
3

S PRING
GARCH
2013
37 / 44
GARCH for conditional variances;

marginal distributions for each series;
a conditional copula function.

S PRING
GARCH
2013
E STIMATION I SSUES
C OPULA -MGARCH M ODELS
T WO -S TEP MLE
Papers put need to allow for time-variation in the conditional copula,

extending the DCC models to other specifications of the conditional
dependence, so that the copula function is rendered time-varying
through its parameters, which can be functions of past data.
Can be estimated using a two-step maximum likelihood approach.
Feature of copula-GARCH models: the ease with which very flexible
joint distributions may be obtained in the bivariate case.
Their application to higher dimensions is a subject for further
research.

S PRING
GARCH
2013
39 / 44
38 / 44
E STIMATION I SSUES
Two-step maximum likelihood estimation: Engle and Sheppard

(2001) show that the loglikelihood can be written as the sum of a
mean and volatility part (depending on a set ofunknown
parameters).

b
b
Therefore estimate coefficients separately, say, 1 , 2 .
But maximizing them separately is not fully efficient since they are
limited information estimators. However, one iteration of a
Newton-Raphson
algorithm applied to the total likelihood starting at

b
b
1 , 2 provides an estimator that is asymptotically efficient.

S PRING
GARCH
2013
40 / 44
E STIMATION I SSUES
E STIMATION I SSUES
D IAGNOSTIC C HECKING
VARIANCE TARGETING (VTE)
MGARCH models are difficult: too many parameters!

A simple trick to ensure a reasonable value of the model-implied
unconditional covariance matrix, which also helps to reduce the
number of parameters in the maximization of the likelihood function,
is referred to as VTE by Engle and Mezrich (1996).
It is desirable to check,
1
VTE:
1
ex ante: whether the data present evidence of multivariate ARCH

effects;
ex post: check the adequacy of the MGARCH specification.
Two kinds of tests: univariate/multivariate(very sparse) tests.
Re-parameterization of the model and the estimation of the

unconditional variance;
QML estimation of the remaining parameters.
Merits: when the model is misspecified, the VTE can be superior to

the QMLE for long-term prediction.

S PRING
GARCH
2013
41 / 44
E (ztzt ) = I
N;
Cov zit2 , zjt2 = 0, for all i 6= j;

Cov zit2 , zj2,t k = 0, for k > 0.

S PRING
GARCH
2013
42 / 44
Since the dynamics of the series is assumed to be captured by the

model (at least in the first two conditional moments), the
standardized error term zt = Ht1/2 t should obey the following
moment conditions (Ding and Engle, 2001):
2

S PRING
GARCH
2013
Testing 1 has power to detect misspecification in the conditional

mean;
Testing 2 is suited to check if the conditional distribution is Gaussian,

which could be false even if Ht is correctly specified.
Testing 3 aims at checking the adequacy of the dynamic specification

of Ht , regardless of the validity of the assumption about the
distribution of zt .
Tse (2002), diagnostics for conditional heteroscedasticity models
applied in the literature can be divided into three categories:
portmanteau tests, residual-based diagnostics and Lagrange multiplier
tests.
43 / 44

S PRING
GARCH
2013
44 / 44
T OPIC 7 M ULTIVARIATE M ODELS

Topic 7 Multivariate Models
Jianhua Gang
School of Finance
Spring 2013
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 7 MultivariateSModels

PRING 2013
1 / 52
S IMULTANEOUS E QUATIONS M ODELS
S IMULTANEOUS E QUATIONS M ODELS
All of the variables contained in the X matrix are assumed to be

EXOGENOUS.
Qst
Qdt
(1)
(2)
(3)

PRING 2013
= + P + S + u
Q = + P + kT + v
Q
(4)
(5)
This is a simultaneous STRUCTURAL FORM of the model.

The point is that price and quantity are determined simultaneously
(price affects quantity and quantity affects price). So, P and Q are
endogenous variables, while S and T are exogenous.
We can obtain REDUCED FORM equations corresponding to (4)
and (5) by solving equations (4) and (5) for P and for Q (separately).
in which St = price of a substitute good; Tt = some variable

embodying the state of technology.
S IMULTANEOUS E QUATIONS M ODELS : T HE S TRUCTURAL F ORM
Assuming that the market always clears, and dropping the time
subscripts for simplicity.
y is an ENDOGENOUS variable.
An example from economics to illustrate-the demand and supply of a
good:
= + Pt + St + ut
= + Pt + kTt + vt
= Qst
2 / 52
S IMULTANEOUS E QUATIONS M ODELS : T HE

S TRUCTURAL F ORM
All the models we have looked at thus far have been single equations
models of the form: y = X + u
Qdt

PRING 2013
3 / 52

PRING 2013
4 / 52
O BTAINING THE R EDUCED F ORM
Re-arranging (6):
Solving for Q,
+ P + S + u = + P + kT + v
P=
(6)
v u
+
T
S

(8)
Q can be ultimately calculated by multiplying (7) through with :
Solving for P,
S
u
Q
kT
v
Q

=
(7)
Q=
u v
T+
S+
(9)
(8) and (9) are the reduced form equations for P and Q.

PRING 2013
5 / 52
S IMULTANEOUS E QUATIONS B IAS
6 / 52
But what would happen if we had estimated equations (4) and (5),
i.e. the structural form equations, separately using OLS?
Both equations depend on P. One of the CLRM assumptions was that
E (X u ) = 0, where X is a matrix containing all the variables on the
R.H.S. of the equation.
It is clear from (8) that P is related to the errors in (4) and (5) i.e. it is stochastic.
Hence, when estimating coefficient before P, it is biased! (Since
E (X u ) 6= 0 in general!)

PRING 2013

PRING 2013
7 / 52
Conclusion: Application of OLS to structural equations which are part

of a simultaneous system will lead to biased coefficient estimates.
Is the OLS estimator still consistent, even though it is biased?
No - In fact the estimator is inconsistent as well.
Hence it would NOT be possible to estimate equations (4) and (5)
validly using OLS.

PRING 2013
8 / 52
AVOIDING S IMULTANEOUS E QUATIONS B IAS
AVOIDING S IMULTANEOUS E QUATIONS B IAS
I DENTIFICATION OF S IMULTANEOUS E QUATIONS
Can We Retrieve the Original Coefficients from the s?

Short answer: sometimes.
So, what can we do?

1
Taking equations (8) and (9), we can rewrite them as

P
Q
= 10 + 11 T + 12 S + 1
= 20 + 21 T + 22 S + 2
(10)
(11)
We CAN estimate equations (10) and (11) using OLS since all the
R.H.S. variables are exogenous.
But ... we probably dont care what the values of the coefficients
are; what we wanted were the original parameters in the structural
equations - , , , , , k.

PRING 2013
P ROBLEM
As well as simultaneity, we sometimes encounter another problem:
Identification. Consider the following demand and supply equations
= + P
Q = + P
Q
9 / 52

S PRING
Models2013
(13)
10 / 52
If an equation (model) is identified in general its coefficients can be

estimated. The appropriate estimation technique will depend upon
whether it is exactly identified or over-identified.

S PRING
Models2013
(12)
We cannot tell which is which! (same equations in nature from OLS view!)
11 / 52
Both equations of (12) and (13) are UNIDENTIFIED or NOT

IDENTIFIED, or UNDERIDENTIFIED.
We do not have enough information from the equations to estimate

four parameters. Notice that we would not have had this problem with
equations (4) and (5) since they have different exogenous variables.

S PRING
Models2013
12 / 52
W HAT D ETERMINES THE I DENTIFICATION ?
We could have three possible situations:

1
An equation is unidentified
like (12) and (13)
we cannot get the structural coefficients from the reduced form
estimates.
An equation is exactly identified
e.g. (4) or (5)
can get unique structural form coefficient estimates.
An equation is over-identified
Examples given later
More than one set of structural coefficients could be obtained from
the reduced form.

S PRING
Models2013
13 / 52
The order condition - is a necessary but not sufficient condition for an

equation to be identified.
The rank condition - is a necessary and sufficient condition for
identification. We specify the structural equations in a matrix
form and consider the rank of a coefficient matrix.

S PRING
Models2013
14 / 52
T HE O RDER C ONDITION
E XAMPLE
In the following system of equations, the Ys are endogenous, while the Xs
are exogenous. Determine whether each equation is over-, under-, or
just-identified.
D EFINITION
Statement of the Order Condition (from Ramanathan 1995, pp.666)
Let G denote the number of structural equations. An equation is just
identified if the number of variables excluded from an equation is G-1.
If more than G-1 are absent, it is over-identified. If less than G-1 are
absent, it is not identified.

S PRING
Models2013
How do we tell if an equation is identified or not?

There are two conditions we could look at:
15 / 52
Y1 = 0 + 1 Y2 + 2 Y3 + 3 X1 + 4 X2 + u1
(14)
Y2 = 0 + 1 Y3 + 2 X1 + u2
(15)
Y3 = 0 + 1 Y2 + u3
(16)

S PRING
Models2013
16 / 52
T HE R ANK C ONDITION
S OLUTION
G=3
If # excluded variables = 2, the eq. is just identified
If # excluded variables > 2, the eq. is over-identified
If # excluded variables < 2, the eq. is not identified
Hence,
Equation 14: Not identified
Equation 15: Just identified
Equation 16: Over-identified

S PRING
Models2013
In a system of G equations any particular equation is identified iff it is

possible to construct at least one non-zero determinant of the order
(G-1) from the coefficients excluded from that particular equation but
contained in other equations of the model.
or A sufficient condition for the identification of a relationship is that
the rank of the matrix of parameters of all the excluded variables
(endogenous and pre-determined) from that equation be equal to
(G-1).
17 / 52

S PRING
Models2013
18 / 52
For example:
y1 = 3y2 2x1 + x2 + u1
Results:
y2 = y3 + x3 + u2
y3 = y1 y2 2x3 + u3
2
3
RC/OC: Equation 1 is exactly identified;

RC: Equation 2 is exactly identified, OC: over-identified;
RC: Equation 3 is not identified, OC: exactly identified.
y1 + 3y2 + 0y3 2x1 + x2 + 0x3 + u1 = 0

0y1 y2 + y3 + 0x1 + 0x2 + x3 + u2 = 0
y1 y2 y3 + 0x1 + 0x2 2x3 + u3 = 0

S PRING
Models2013
19 / 52

S PRING
Models2013
20 / 52
T EST FOR E XOGENEITY-H AUSMAN T EST
We can, however, formally test this using a Hausman test, which is

calculated as follows:
How do we tell whether variables really need to be treated as
endogenous or not?
Consider again equations (14)-(16). Equation (14) contains Y2 and
Y3 - but do we really need equations for them?
1 Obtain the reduced form equations corresponding to equations

(14)-(16). The reduced forms turn out to be:
Y1
Y2
Y3
= 10 + 11 X1 + 12 X2 + v1
= 20 + 21 X1
+ v2
= 30 + 31 X1
+ v3
(17)
(18)
(19)
Estimate the reduced form equations (17)-(19) using OLS, and obtain
b2 , Y
b3 .
b1 , Y
the fitted values: Y

S PRING
Models2013
21 / 52

S PRING
Models2013
22 / 52
R ECURSIVE S YSTEMS
R ECURSIVE S YSTEMS
Consider the following system of equations:
Y1 = 10
(21)
2. Run the regression corresponding to equation 14.
Y2
(22)
3. Run the regression 14 again, but now also including the fitted values
b3 as additional regressors:
b2 , Y
Y
Y3
b2 + 3 Y
b3 + u1 (20)
Y1 = 0 + 1 Y2 + 2 Y3 + 3 X1 + 4 X2 + 2 Y
4. Use an F-test to test the joint restriction that H0 : 2 = 3 = 0. If

the null hypothesis is rejected, Y2 and Y3 should be treated as
endogenous.

S PRING
Models2013
23 / 52
+ 11 X1 + 12 X2 + u1
= 20 + 21 Y1
+ 21 X1 + 22 X2 + u2
= 30 + 31 Y1 + 32 Y2 + 31 X1 + 32 X2 + u3
(23)
P ROBLEM
Assume that the error terms are not correlated with each other. Can we
estimate the equations individually using OLS?
(21) contains no endogenous variables, so X1 and X2 are NOT
correlated with u1 . So we can use OLS on (21).
(22) contains endogenous variable Y1 . We can use OLS on (22) if all
the R.H.S. variables are uncorrelated with the error u2 (True!). In
fact, Y1 is not correlated with u2 because there is no Y2 term in
equation (21). So we can use OLS on (22).

S PRING
Models2013
24 / 52
R ECURSIVE S YSTEMS
R ECURSIVE S YSTEMS
I NDIRECT L EAST S QUARES (ILS)
I NDIRECT L EAST S QUARES (ILS)
Cannot use OLS on structural equations, but we can validly apply it

to the reduced form equations.
Equation 23: Contains both Y1 and Y2 ; we require these to be
uncorrelated with u3 . By similar arguments to the above, equations
(21) and (22) do not contain Y3 , so we can use OLS on (23).
This is known as a RECURSIVE or TRIANGULAR system. We do not
have a simultaneity problem here.
D EFINITION
If the system is just identified, ILS involves estimating the reduced form
equations using OLS, and then using them to substitute back to obtain
the structural parameters.
But in practice not many systems of equations will be recursive...

However, ILS is not used much because:
1
2

S PRING
Models2013
25 / 52
Solving back to get the structural parameters can be tedious.

Most simultaneous equations systems are over-identified.
E STIMATION USING T WO -S TAGE L EAST S QUARES

S PRING
Models2013
In fact, we can use this technique for just-identified and

over-identified systems.
In fact, we can use this technique for just-identified and

over-identified systems.
Two stage least squares (2SLS or TSLS) is done in two stages:
Two stage least squares (2SLS or TSLS) is done in two stages:
Obtain and estimate the reduced form equations using OLS. Save the
fitted values for the dependent variables.

S PRING
Models2013
27 / 52
26 / 52
Obtain and estimate the reduced form equations using OLS. Save the
fitted values for the dependent variables.
Estimate the structural equations, but replace any R.H.S. endogenous
variables with their Stage 1 fitted values.

S PRING
Models2013
27 / 52
Example: Say equations (14)-(16) are required.

1
Estimate the reduced form equations (17)-(19) individually by OLS and

b2 , Y
b3 .
b1 , Y
obtain the fitted values, Y
Replace the R.H.S. endogenous variables with their Stage 1 estimated
values:
Y1
Y2
Y3
b2 + 2 Y
b3 + 3 X1 + 4 X2 + u1
= 0 + 1 Y
b3 + X1 + u2
= 0 + 1 Y
2
b
= 0 + 1 Y2 + u3
(24)
(25)
(26)
b2 and Y
b3 will not be correlated with u1 , will not be correlated
Now Y
with u2 , and will not be correlated with u3 .

S PRING
Models2013
28 / 52
I NSTRUMENTAL VARIABLES (IV)
The standard error estimates also need to be modified compared

with their OLS counterparts, but once this has been done, we can use
the usual t- and F-tests to test hypotheses about the structural form
coefficients.

S PRING
Models2013
29 / 52
Recall that the reason we cannot use OLS directly on the structural
equations is that the endogenous variables are correlated with the
errors.
One solution: abandon Y2 or Y3 , rather, use some other variables
instead.
We want these other variables to be (highly) correlated with Y2 and
Y3 , but not correlated with the errors - the INSTRUMENTS.
Say, some suitable instruments for Y2 and Y3 , z2 and z3 respectively.
We do not use the instruments directly, but run regressions of the
form:
Y2 = 1 + 2 z2 + 1
(27)
Y3 = 3 + 4 z3 + 2
(28)

S PRING
Models2013
If the disturbances in the structural equations are autocorrelated,

the 2SLS estimator is not even consistent.
It is still of concern in the context of simultaneous systems whether

the CLRM assumptions are supported by the data.
30 / 52
b2 and Y
b3 , and replace Y2
Obtain the fitted values from (27) & (28), Y
and Y3 with these in the structural equation.
We do not use the instruments directly in the structural equation.
It is typical to use more than one instrument per endogenous
variable.
If the instruments are the variables in the reduced form equations,
then IV is equivalent to 2SLS.

S PRING
Models2013
31 / 52
V ECTOR A UTOREGRESSIVE (VAR) M ODELS
What happens if we use IV/2SLS unnecessarily?

The coefficient estimates will still be consistent, but will be
inefficient compared to those that just used OLS directly.
The Problem With IV:
A natural generalisation of autoregressive models.

VAR: a systems regression model i.e. there is more than one
dependent variable.
y1t
P ROBLEM
What are the instruments?
y2t
S OLUTION
Solution: 2SLS is easier.
Other Estimation Techniques:
1
2
3SLS - allows for non-zero covariances between the error terms.

LIML-(Limited Information ML) estimating reduced form equations
by maximum likelihood.
FIML-(Full Information ML) estimating all the equations
simultaneously using maximum likelihood.

S PRING
Models2013
32 / 52
Yt = 0 + 1 Yt 1 + ut

S PRING
Models2013

S PRING
Models2013
33 / 52
VAR V. S . S TRUCTURAL E QUATIONS M ODELS
No need to specify endogeneity/exogeneity: all are (weakly)

endogenous!
Allows a variable to depend on more than just its own lags or
combinations of white noise terms, so more general than ARMA
modelling.
Provided that there are no contemporaneous terms on the R.H.S.
of the equations, Can simply use OLS separately on each equation.
Forecasts are often better than traditional structural models.
or even more compactly as
Advantages of VAR Modelling
= 10 + 11 y1t 1 + 11 y2t 1 + u1t

= 20 + 21 y2t 1 + 21 y1t 1 + u2t
One important feature of VARs is the compactness, e.g. bivariate

VAR(1):

10
y1t 1
u1t
y1t
11 11
=
+
+
y2t
20
y2t 1
u2t
21
21
V ECTOR A UTOREGRESSIVE (VAR) M ODELS
34 / 52
Problems with VARs

1
2
3
4
5
VARs are theoretical (as are ARMA models). What if not the VAR
process?
How to decide the appropriate lag length?
So many parameters to estimate!
Do we need to ensure all components of the VAR are stationary?
How do we interpret the coefficients?

S PRING
Models2013
35 / 52
O PTIMAL L AG L ENGTH FOR VAR
C HOOSE O PTIMAL L AG L ENGTH FOR

VAR(C ROSS -E QUATION R ESTRICTIONS )

VAR(C ROSS -E QUATION R ESTRICTIONS )
In the spirit of (unrestricted) VAR modelling, each equation should

have the same lag length.
Suppose that a bivariate VAR(8), and we want to examine a
restriction that the coefficients on lags 5 through 8 are jointly zero.
(de facto, H0 : VAR (4) against HA : VAR (8))
This can be done using a likelihood ratio (LR) test.
Denote the variance-covariance matrix of residuals (given by u

bu
b /T ),
b
as . The LR test statistic for this joint hypothesis is:
i
h
b
b
LR = T ln
r ln u
LR is asymptotically distributed as a:
2 (q ), q = # of restrictions
In the our case above we restrict 4 lags of two variables in each of the
two equations:4 2 2 = 16 restrictions.
P ROBLEM
Conducting the LR test is cumbersome and requires a normality
assumption for the disturbances

S PRING
Models2013
36 / 52
37 / 52
F EEDBACK E FFECT AND P RIMITIVE F ORM OF VAR S
F EEDBACK E FFECT AND P RIMITIVE F ORM OF VAR S
What if the equations had a contemporaneous feedback term?
Multivariate versions of the information criteria can be defined as:

b
MAIC = ln
+ 2k/T
k
b
MSBIC = ln
+ ln(T )
T
2k
b
MHQIC = ln +
ln(ln T )
T
y1t
y2t
= 10 + 11 y1t 1 + 11 y2t 1 + 12 y2t + u1t

= 20 + 21 y2t 1 + 21 y1t 1 + 22 y1t + u2t
in compact form:

10
y1t 1
y2t
u1t
y1t
11 11
12 0
=
+
+
+
0 22
21 21
y2t
20
y2t 1
y1t
u2t
k is the total number of regressors in all equations.
The values of the information criteria are constructed for 0,1,... lags
(up to some pre-specified maximum kmax ).

S PRING
Models2013

S PRING
Models2013

VAR(I NFORMATION C RITERIA )
38 / 52
This VAR is in primitive form.

S PRING
Models2013
39 / 52
P RIMITIVE V. S . S TANDARD F ORM VAR S
P RIMITIVE V. S . S TANDARD F ORM VAR S

1
B LOCK S IGNIFICANCE AND C AUSALITY T ESTS
We can take the contemporaneous terms over to the L.H.S. and write

y1t
10
y1t 1
u1t
1
12
11 11
=
+
+
22
1
21 21
y2t
20
y2t 1
u2t
BYt = 0 + 1 Yt 1 + ut
We then pre-multiply both sides by B 1 :
or
Yt = A0 + A1 Yt 1 + et
This is known as a standard form VAR, which we can estimate using
ML as before.
S PRING
Models2013
40 / 52
G RANGER C AUSALITY T ESTS
These tests could also be referred to as Granger causality tests.

S PRING
Models2013
41 / 52
I MPULSE R ESPONSES
VAR models are often difficult to interpret: one solution is to

construct the impulse responses and variance decompositions.
Impulse responses trace out the responsiveness of the dependent
variables in the VAR to shocks in the error term. A unit shock is
applied to each variable and its effects are noted.
Consider for example a simple bivariate VAR(1):
Granger causality tests seek to answer questions such as Do

changes in y1 cause changes in y2 ?
Implied Restriction
21 = 0 and 21 = 0 and 21 = 0
11 = 0 and 11 = 0 and 11 = 0
12 = 0 and 12 = 0 and 12 = 0
22 = 0 and 22 = 0 and 22 = 0
I MPULSE R ESPONSES
Each of these four joint hypotheses can be tested within the F-test
framework, since each set of restrictions contains only parameters
drawn from one equation.
Hypothesis
1. Lags of y1t do not explain current y2t
G RANGER C AUSALITY T ESTS
12 y1t 1 11 12 y1t 2 11 12 y1t 3 u1t

+
+
+
22 y 2t 1 21 22 y 2t 2 21 22 y 2t 3 u 2t
We might be interested in testing the following hypotheses, and their

implied restrictions on the parameter matrices:
Yt = B 1 0 + B 1 1 Yt 1 + B 1 ut
It is likely that, when a VAR includes many lags of variables, it will be

difficult to see which sets of variables have significant effects on
each dependent variable and which do not. For illustration, consider
the following bivariate VAR(3):
y1t 10 11
+
=
y 2t 20 21
or
2
B LOCK S IGNIFICANCE AND C AUSALITY T ESTS
y1t = 10 + 11 y1t 1 + 11 y2 t 1 + u1t
If y1 causes y2 , lags of y1 should be significant in the equation

for y2 . If this is the case, we say that y1 Granger-causes y2 .
If y2 causes y1 , lags of y2 should be significant in the equation for y1 .
If both sets of lags are significant, there is bi-directional causality
y2 t = 20 + 21 y2 t 1 + 21 y1t 1 + u2 t
A change in u1t will immediately change y1 . It will change y2 and also

y1 during the next period.
We can examine how long and to what degree a shock to a given
equation has on all of the variables in the system.

S PRING
Models2013
42 / 52

S PRING
Models2013
43 / 52
VARIANCE D ECOMPOSITIONS
VARIANCE D ECOMPOSITIONS
I MPULSE R ESPONSES AND VARIANCE

D ECOMPOSITIONS : T HE O RDERING OF THE
VARIABLES
Variance decompositions offer a slightly different method of

examining VAR dynamics. They give the proportion of the
movements in the dependent variables that are due to their own
shocks, versus shocks to the other variables.
This is done by determining how much of the s step ahead forecast
error variance for each variable is explained in innovations to each
explanatory variable (s = 1,2,. . . ).
The variance decomposition gives information about the relative
importance of each shock to the variables in the VAR.

S PRING
Models2013
T HE O RDERING OF THE VARIABLES
44 / 52
For calculating IRs and variance decompositions, the ordering of the

variables is important.
Main reason: VAR errors often violate the independence of one
another. Instead, they typically correlates to some degree.
Therefore, the notion of examining the effect of the innovations
separately has little meaning, since they have a common
component.
Thus, must orthogonalise the innovations.
In the bivariate VAR, this problem would be approached by
attributing all of the effect of the common component to the first of
the two variables in the VAR.
In the general case where there are more variables, the situation is
more complex but the interpretation is the same.
L AG N UMBER S ELECTION

S PRING
Models2013
45 / 52
T ESTS OF R ANDOMNESS
If Yt is i.i.d. (and has finite variance) then 1 , ..., k are all 0.

How do we choose the lags p, q in an ARMA(p, q ) model?
1
By by looking at the sample autocorrelations and the sample partial

autocorrelations, and trying to recognize the pattern of a model with
given p, q.
By using an automatic selection criterion (information criterion).

S PRING
Models2013
46 / 52
Then, the sample autocorrelations (b

j , b
h , j 6= h, j 1, h 1) are
asymptotically independent and
Tb
j d N (0, 1) ,(j 1)
We can use this property to design two tests to check if the data are
independently distributed.

S PRING
Models2013
47 / 52
T EST OF R ANDOMNESS
P ORTMANTEAU TEST
P ORTMANTEAU TEST
We can also test a group of k autocorrelations jointly : under the

null,
This test is so simple that it can be inspected
so the
visually,
computers usually plots two error bars at 1.96/ T with the sample
autocorrelation function.
(Notice: although it is called "test for randomness" by some
computer softwares and some references, a more appropriate name
would be "test for independent distribution").
b2j d
2k
j =1
(this test may be of particular interest when we suspect a

seasonal structure in the data: for example with quarterly data the
first three autocorrelations may be zero, and then the fourth one may
be non-zero). (The test may be sensitive to the choice of k on some
occasions).
The tests for independent distribution and the Portmanteau test
may provide preliminary information about the sample AC.

S PRING
Models2013
48 / 52
M ODEL S ELECTION : I NFORMATION C RITERIA

S PRING
Models2013
49 / 52

The solution: add a penalty which increases with p and q.
IC = 2l (b
) + penalty

2(p + q )
Akaike IC
penalty :
(p + q ) ln(T ) Bayes IC
An automatic way to select p, q. The idea: use "maximum

likelihood" to choose p, q.
The problem: if you compare an ARMA(p, q ) with an
ARMA(p + 1, q ), the ARMA(p, q ) has always smaller likelihood.
This is because the estimate from the ARMA(p, q ) model maximises
bp +1 = 0,while the
the likelihood with the constraint that
ARMA(p + 1, q ) does not impose that constraint, so the
bp +1 = 0,
ARMA(p + 1, q ) has higher maximum likelihood unless
exaclty (which is an event with probability zero in finite sample even
when the true p;0 = 0 actually) (Notice analogy with regression here:
when you increase the number of regressors, the R 2 does not decrease,
and in general increases, even when the regressors are irrelevant).

S PRING
Models2013
50 / 52
BIC: consistent estimation of p, q.

AIC: inconsistent estimation of p, q (may select larger than correct
p, q in large samples).
Both BIC and AIC may select smaller then correct p, q in finite
samples (this however is not necessarily a bad thing: it may result,
in small samples).
An alternative approach: of course, we can also compare an

ARMA(p, q ) with an ARMA(p + 1, q ), or with an ARMA(p, q + 1),
using a likelihood ratio test. The criterion is then adding lags as
long as the likelihood ratio test statistic is above a user-chosen critical
value (for example, 5% significance would have c.v. 3. 84)

S PRING
Models2013
51 / 52
PARSIMONIOUS M ODELLING
PARSIMONIOUS M ODELLING
Large econometrics models tend to do badly in terms of
forecasting, and are outperfomed by small ARMA models (Box
& Jenkins).
Even in ARMA models, increasing the number of parameters reduces
the precision of with which each parameter is estimated. This is
beacuse when the parameters are estimated, their variance
contributed to the variance of the forecast.
Adding extra parameters may then help to reduce or eliminate the
forecast bias, but the gain in terms of reduction bias 2 is outweighted
by the loss in increased variance of the forecast.
Should balance the number of estimated parameters and the number
of observations.
Sometimes, Information Criteria have been advocated also to select
more parsimonious models.

S PRING
Models2013
52 / 52
T OPIC 8 T RENDING VARIABLES AND C O - INTEGRATION
T OPIC 8 T RENDING VARIABLES AND

C O - INTEGRATION
Topic 8 Trending and Co-Integration
In previous lectures, it has been assumed that all data are

trend-free.
However, in many cases, this assumption turns out to be

inappropriate,
Jianhua Gang
E XAMPLE
Examine the series for consumption and income. The presence of trends
can sometimes invalidate the usual asymptotic theory for OLS and test
procedures.
School of Finance
Spring 2013
A discussion of trends and related topics such as tests for unit roots
and cointegration is, therefore, required.
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 8 Trending andSCo-Integration

PRING 2013
1 / 49
A UTOREGRESSIVE D ISTRIBUTED L AG (ADL) M ODELS

Applied workers often specify models that include both lagged values
of the dependent variable and a distributed lag component in the
regression function. These models are called autoregressive
distributed lag (ADL) models.
Applied workers often specify models that include both lagged values
of the dependent variable and a distributed lag component in the
regression function. These models are called autoregressive
distributed lag (ADL) models.
A very simple ADL relationship is
yt = + 1 yt 1 + 0 xt + 1 xt 1 + ut , |1 | < 1.
yt = + 1 yt 1 + 0 xt + 1 xt 1 + ut , |1 | < 1.
The short-run multiplier is E (yt )/xt = 0 .

The cumulative effect corresponding to the long-run multiplier is
The autoregressive component implies that a change in xt affects yt

and all future values of yt .
It can be shown that
E (yt +j )/xt = 1j 1 ( 1 + 1 0 ) , j = 1, 2, ...such terms are called
dynamic multipliers.

PRING 2013
2 / 49
A very simple ADL relationship is

PRING 2013
E (yt +j )/xt = ( 0 + 1 ) /(1 1 ) = ,
j =0
then, say the slope of the long-run relationship and the

intercept is = /(1 1 ).
3 / 49

PRING 2013
4 / 49
A UTOREGRESSIVE DISTRIBUTED LAG (ADL) MODELS

It is useful to note that the simple ADL can be written in a
mathematically equivalent form that has a parameterization of direct
economic interest, called the error correction model (ECM).
The ECM is derived from the ADL as follows:
yt = + 1 yt 1 + 0 xt + 1 xt 1 + ut ,
The ECM is nonlinear in the coefficients 0 , 1 , , and , all of which

have meanings,
The coefficient 0 measures the contemporaneous effect.
The coefficient 1 can be thought of as reflecting speed of
adjustment.
The coefficients and are the intercept and the slope of the
long-run relationship.
can be written as
(yt + yt 1 ) = + 1 yt 1 + 0 (xt + xt 1 ) + 1 xt 1 + ut ,
yt = 0 xt (1 1 ) [yt 1 xt 1 ] + ut
Thus the ECM has first differences in y linked to first differences in x
and the extent by which y deviates from the long-run expected value in
the previous period. (Martingale)

PRING 2013
5 / 49
If the OLS estimates of the ADL are denoted byband the nonlinear
least squares estimates of the ECM are denoted bye, then it can
be shown that,
e =
e
=

PRING 2013
b
0
0 = e
b
1 = e
1
6 / 49
For future reference, it is important to note that if 0 and 1 are

estimated by applying OLS to
yt = 0 xt (1 1 ) [yt 1 xt 1 ] + ut
(1 b1 )
1 )
(b
0 + b
(1 b1 )
the estimates of these parameters are b

0 = e
0 and
b
e
1 = 1 , respectively .
This two-step approach can play an important role when the data
contain trends and will be discussed later in further detail.
While OLS estimation of the ADL and nonlinear least squares

estimation of the ECM yield the same point estimates, the latter
method (NLS) has the advantage of giving estimated standard errors
for estimates of long-run parameters as part of standard output.

PRING 2013
7 / 49

PRING 2013
8 / 49
T RENDING VARIABLES
T RENDING VARIABLES
T RENDING VARIABLES
In the past, attention has been restricted to (covariance) stationary

processes. Recall that a time series variable zt is covariance stationary
if E (zt ) = z , var (zt ) = 2z ,and cov (zt , zt g ) = (|g |) for all t.
A variable zt follows a simple random walk if
Two models of trends are discussed.

1
zt = zt zt 1 = ut , ut NID (0, 2 ),
First, zt is said to be a trend stationary process if zt = f (t ) + ut ,in

which f (.) is a deterministic function and ut is a stationary process.
(Often linear trend t);
Second, zt is said to be a difference stationary process if
zt = zt zt 1 = ut ,where ut is a stationary process.
Series that can be differenced to obtain stationary variables are called

integrated series and it is useful to adopt the following terminology
and notation: a stationary process is denoted I (0) (integrated of order
zero); and a series zt is said to be integrated of order d if d zt I (0).
T RENDING VARIABLES

PRING 2013
9 / 49
T RENDING VARIABLES
Hence,
zt =
us + z0 .
1
If z0 is zero, then, zt is sum of current and past innovations.

Clearly zt I (1) and zt I (0).
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 8 Trending and

S PRING
Co-Integration
2013
T RENDING VARIABLES
(1)
10 / 49
I(0) AND I(1) P ROCESSES
D IFFERENCES BETWEEN I(0) AND I(1) VARIABLES

We are able to illustrate the differences using two simple models:
Model (1) can be extended so that zt has a nonzero mean, i.e. to

allow for a drift,
zt = zt zt 1 = a + ut , ut NID (0, 2 ),
1. I (0) : Let zt be an I (0) variate generated by the stable AR (1) model

zt = zt 1 + ut , || < 1, ut NID (0, 2 ).
It can be proved that for whatever value of t:
so that, if z0 = 0, then,
E (zt )
var (zt )
corr (zt , zt s )
zt =
us + at,
(2)
and equation (2) implies the existence of both deterministic and

stochastic trend components.
0;
h

i
2 / 1 2 < ;
s s 0
Thus this I (0) variable has constant mean, constant variance (hence
large departures are rare), and autocorrelations decline as the order
increases.
Also can write zt =
j ut j ,so low weights are given to distant past

0
with j tending rapidly to zero as j , i .e.finite memory.


S PRING
Co-Integration
2013
11 / 49

S PRING
Co-Integration
2013
12 / 49
I(0) AND I(1) P ROCESSES
D IFFERENCES BETWEEN I(0) AND I(1) VARIABLES
E XAMPLES OF I(1) AND I(0)
GNP D EFLATOR AS EXAMPLE : (I(2) PROCESS )
We are able to illustrate the differences using two simple models:

2. I (1) :Let zt be an I (1) variate generated by the random walk
zt = zt zt 1 = ut , ut NID (0, 2 )
t
where, zt =
us ,if z0 = 0.
1
Hence,
var (zt )
E (zt )
=
=
corr (zt , zt g )
0,
t2 , (monotonic on t )
q
(t g )/ t (t g ), (dependence on t )
Even if g is large, corr (zt , zt g ) can be close to 1 for t g .Clearly zt

is nonstationary.
t
Note that zt =
us
implies an innovation affects all later values of
zt ,i.e. there is an infinitely long memory.


S PRING
Co-Integration
2013
13 / 49
E XAMPLES OF I(1) AND I(0)

S PRING
Co-Integration
2013
GNP D EFLATOR AS EXAMPLE : (AC FUNCTION )
14 / 49
C ONSEQUENCES FOR OLS ANALYSIS

Analysis is relatively straightforward when variables are
trend-stationary with linear trends. In multiple regression with several
regressors having linear trends, adding a trend term to the basic
model and fitting
yt = + j xjt + t + ut , ut i.i.d.(0, 2 ),
j
which provides a basis for valid estimation and inference, with the
additional regressor serving as a trend-removing agent.
However, it has been established that the asymptotic theory of
OLS estimators and tests developed for I (0) variables can be
misleading when applied to data from I (1) processes. With
nonstationary variables, OLS estimators may tend to nonstandard
distributions, rather than normality, as n .

S PRING
Co-Integration
2013
15 / 49

S PRING
Co-Integration
2013
16 / 49
T ESTING FOR U NIT R OOTS
It is therefore important to test whether or not variables are generated

by I (1) difference stationary processes.
Many results have been derived for testing the null hypothesis that
a series is difference stationary against the alternative of
covariance stationarity.
- Consider, a simple AR (1) model
2
- If
= 1 zt I (1)
< 1 zt I (0)

S PRING
Co-Integration
2013
Let DP and RE denote the true data process and the regression
equation used to compute the test, respectively. Will consider
three cases.
C ASE 1 H1 : zt stable AR (1) with zero mean.
C ASE 2 H1 : zt stable AR (1) with constant nonzero mean.
- Can define the polynomial ( ) = 1 .Solving

( ) = 1 = 0 yields the root = 1/, which is unity if = 1.
- Hence we talk of testing for a unit root. Several tests for unit roots
are based upon work by Dickey and Fuller (hereafter DF).
The correct identification of any deterministic component is crucial.
DP(1) zt = ut , ut = ut 1 + t , t NID (0, 2 ),with 1 < 1

and H0 : = 1.Apply OLS to,
RE(1) zt = ( 1)zt 1 + t .
zt = zt 1 + ut , ut NID (0, ).
||
U NIT R OOT T ESTS
17 / 49
U NIT R OOT T ESTS
DP(2) zt = 0 + ut , ut = ut 1 + t , t NID (0, 2 ),with

1 < 1 and H0 : = 1.Apply OLS to,
RE(2) zt = + ( 1)zt 1 + t , 0 (1 ).H0 implies that
= 1 = 0.

S PRING
Co-Integration
2013
18 / 49
U NIT R OOT T ESTS

b 1) denote an OLS estimator (same notation for all REs).
Let (
b 1) or
Can base tests on either K (1) = n(
b 1)/SE (
b 1) = (
b 1)/SE (
b ).For RE(3), can also
t (1) = (
use F(0, 1), the F statistic for testing the two restrictions of
= 1 = 0.
C ASE 3 H1 : zt linear trend + stable AR (1)

DP(3) zt = 0 + 1 t + ut , ut = ut 1 + t , t NID (0, 2 ),with
1 < 1 and H0 : = 1.Apply OLS to,
RE(3) zt = + t + ( 1)zt 1 + t ,in which
[ 0 (1 ) + 1 ] and 1 (1 ).H0 implies that = 0.
Only t (1) type tests are to be considered. DF show that,

provided DP and RE are correctly matched, these test statistics have
nonstandard asymptotic distributions under H0 .DF also provide
estimated critical values for each of the three cases given
above.
If get incorrect matching of RE with DP, results can be quite
different and wrong.

S PRING
Co-Integration
2013
19 / 49

S PRING
Co-Integration
2013
20 / 49
C RITICAL VALUES OF DF- TAU TESTS :
T ESTING FOR U NIT R OOTS -C RITICAL VALUES OF

DF- TAU TESTS :

S PRING
Co-Integration
2013
C RITICAL VALUES OF DF- GAMMA TESTS :
T ESTING FOR U NIT R OOTS -C RITICAL VALUES OF

DF- GAMMA TESTS :
21 / 49
A UGMENTED D ICKEY F ULLER (ADF) T EST

S PRING
Co-Integration
2013
A UGMENTED D ICKEY F ULLER (ADF) T EST
22 / 49
P ROBLEMS OF ADF T ESTS
P ROBLEMS OF ADF T ESTS
Useful to relax assumption that, under H0 ,the zt are independent.

A fairly general specification is that, under the unit root hypothesis,
zt follows a stationary mixed autoregressive-moving average process
(l )zt = (l )t , t i.i.d.(0, 2 )
in which (l ) and (l ) are polynomials in the lag operator.
Often approximate the autocorrelation of the zt by autoregressive
model to obtain the Augmented-DF (ADF) test.
For example, the ADF form corresponding to RE(3) can be written as
p
zt = A + A t + (A 1)zt 1 + j zt j + t
1
it is asymptotically valid, for each of the three combinations of DP and

RE, to use the same critical values for ADF and DF.

S PRING
Co-Integration
2013
23 / 49
Schwert points out that it may be difficult to obtain a satisfactory

autoregressive approximation to the serial correlated when zt has a
moving average component.
Choi finds that the use of a large value of p in ADF tests can lead to
low power.
More generally, studies of power indicate that it may be very difficult
to discriminate between trend stationary and difference stationary
processes, e.g. DF test for unit roots have low power when the data
are trend stationary.
Perron finds that if the data have segmented trends, e.g. structural
breaks, then unit root test lack power.

S PRING
Co-Integration
2013
24 / 49
C O - INTEGRATION
C O - INTEGRATION
C O - INTEGRATION
The recognition of the existence of unit roots coupled with ideas

about long run equilibrium relationships leads to the study of
co-integration. A simple case of a single relationship will be
considered. More general treatment are available and the book by
Banerjee et al. (1994) contains many useful discussions and
references; also see the book by Harris (1995).
Definition: The variable of (z1t , z2t , ..., zmt ) are said to be

co-integrated if:
1
zit I (d ), d > 0, i;
2
The topics to be covered are: the definition of co-integration and its

links with equilibrium; testing for the absence of co-integration; and
Grangers Representation theorem which concerns the Error
Correction Models.
Co-integration theory was developed by Engle and Granger (1987)
(Nobel Prize in Economics, 2003)

S PRING
Co-Integration
2013
25 / 49
C O - INTEGRATION AND E QUILIBRIUM R ELATIONSHIPS
Let yt and xt denote consumption and income, respectively. Suppose

that both variables are I (1).Consider ut defined by
yt = + xt + ut .In general, a linear combination of I (1)
variables, such as ut = yt ( + xt ) is also I (1).
Engle and Granger argue that there cannot be a meaningful
equilibrium relationship between yt and xt unless ut I (0),since an
I (1) error will wander widely and rarely cross the line through zero.
Thus co-integration is sometimes viewed as being required for a
certain type of equilibrium relationship.
However, economists might wish to include I (0) variables in an
equilibrium relationship, as well as I (1) regressors.

S PRING
Co-Integration
2013
And there exist weights 1 , 2 , ..., m such that

at =
j zjt I (d b ), d b > 0, with some j
27 / 49
6= 0.
If 1 and 2 are satisfied, then, in general, we write

(z1t , z2t , ..., zmt ) CI (d, b ).
Will only consider the case d = b = 1,i.e. a linear combination of I (1)
processes is a stationary I (0) variable.

S PRING
Co-Integration
2013
C O - INTEGRATION AND E QUILIBRIUM R ELATIONSHIPS
C O - INTEGRATION
26 / 49
T ESTING FOR THE A BSENCE OF C O - INTEGRATION
For technical reasons, the null hypothesis is taken to be no

co-integration, so that, using the simple example above, the true ut
is I (1)(in which case, first difference the data and then apply classical
methods for stationary processes).
Thus, if and were known, could calculate the ut and apply an
ADF test for a unit root. If the test indicated the rejection of the unit
root restriction, the evidence could be viewed as supporting the
assumption of co-integration with ut I (0),but the parameters are, of
course, unknown.

S PRING
Co-Integration
2013
28 / 49
The parameters and can be estimated by applying OLS - this is

known as fitting the co-integrating regression (CR). If yt and xt are
co-integrated, the OLS estimator of from the CR exhibits a property
known as superconsistency because, as n , it approaches the true
value at a faster rate than in the classical stationary variables case.
However, in small samples, biases may be important.

S PRING
Co-Integration
2013
29 / 49
Having estimated the CR, the associated OLS residuals u

bt can be
used in a test for a unit root in the process ut ,which is a test for the
absence of co-integration. Two simple tests can be used with the u
bt .
One procedure is to compute the DW statistic which should be close
to zero under the unit root hypothesis. This check is not
recommended.
30 / 49
C O - INTEGRATION E XAMPLE 1: PPP T HEORY
The second approach involves applying the ADF test in t-ratio form
after OLS estimation of the equation
p
b
ut = b
ut 1 + j b
ut j + et ,
1
Consider for example the PPP theory for the ex-change rate. In
perfect markets there are no arbitrage opportunities so the exchange
rate R is determined by the relative movements of the domestic price
level P and the foreign price level. P* i.e.
in which p is selected so that et appears to be a sequence of i.i.d.

variables.
R=
This test is denoted CRADF. The DF tables are not valid for CRADF.
Asymptotic distributions under the unit root hypothesis depend upon
the number of I (1) regressors and whether or not the CR includes an
intercept and/or a trend term.
Finite sample critical values have been estimated by computer
method for various cases and are availble in some estimation
programs, e.g. PcGive.
S PRING
Co-Integration
2013

S PRING
Co-Integration
2013
31 / 49
P
r = p p (in log)
P
This can be seen as an long equilibrium. Data like exchange rates and
inflation levels are usually I (1). So they are quite volatile. However if
the PPP theory is correct they should not drift apart a lot over time
i.e.
r (p p ) small

S PRING
Co-Integration
2013
32 / 49
C O - INTEGRATION E XAMPLE 2: T HE M ONEY D EMAND
C O - INTEGRATION E XAMPLE 2: T HE M ONEY D EMAND

Suppose: mt real money supply; rt interest rate; t inflation; Yt real
income.
If in fact there exists a vector of coefficients = ( 1 , 2 , 3 )such

that
1 rt 2 pt 3 pt = I (0)
Theories for demand for money suggest that
then rt , pt ,and pt are said to co-integrate and is the cointegrating

vector. Note: A co-integrating vector does not always exist, or there
might be more than one co-integrating vectors.
Moreover in practice the variables above are usually I (1) so they may
co-integrate.
mt = 1 rt + 2 t + 3 Yt
So now, in order to run out the cointegrating vector. One should:

1
2
3

S PRING
Co-Integration
2013
33 / 49
J OHANSEN (1988) A PPROACH

S PRING
Co-Integration
2013
34 / 49
In the previous lecture we have seen how we can test for

co-integration using the EG methodology. The EG approach involves
some serious drawbacks.
1 Suppose for example that x1t and x2t are I (1) and we want to test for
co-integration between those two variables. Recall the way to do it is
estimate the following regression by OLS1
x1t = b
1 + b
2 x2t + u
bt
and apply DF test on the residuals u

bt . Because x1t and x2t can be
treated in a symmetric fashion, hence an alternative regression can be
3 + b
4 x1t + bt
x2t = b
and do the same thing on bt .Theoretically the two approaches are

equivelant and they should give the same answer when the sample
used is large. In practice however, they may give different answers
because the sample sizes used are not large enough.
S PRING
Co-Integration
2013
Use DF test to make sure that all the variables are I (1).
Use OLS to estimate the model mt = c
1 rt + c
2 t + c
3 Yt + ubt .
Conduct test on the residual. If there is co-integration, then the
residuals must be stationary, otherwise the residuals will be I (1).
35 / 49
2 Moreover when m variables co-integrate, it is possible to have more

than one distinct co-integrating relationships (this number is actually
up to (m 1) co-integrating vectors can be found). The EG
methodology cannot estimate distinct co-integrating relationships.
The Johansen (1988) procedure provides a framework that
circumvents those problems. The Johansen approach involves
estimation of a system of equations rather than a single equation.
Before we consider this approach, we need to introduce the VAR and
VECM.

S PRING
Co-Integration
2013
36 / 49
VAR M ODEL
VAR M ODEL
VAR M ODEL
VAR M ODEL
The VAR model is, as the name suggests, an autoregression of a
vector process. Consider a simplest example of a VAR. This is a
two-variable VAR model with lag of first order (VAR(1)).

1t
y1t
11 12 y1t 1
=
+
21 22 y2t 1
y2t
2t
Recall the AR(p) model is yt =
i yt i + t , and it can be
i =1
reparameterized as
yt = yt 1 + i yt i + t
i =2
or
It can be re-written as a more compact expression

Yt = 1 Yt 1 + t
p 1
yt = yt p +
ci yt i + t
Now, to generalize, the VAR(p) is then
i =1
Scalar autoregressive models are inapropriate for co-intgration

analysis, as they involve only one variable (yt ). But co-integration
involves more than one variables.
Yt =
i Yt i + t
i =1
where Yt = [y1t , ..., ymt ] and i is (m m) matrices of coefficients.


S PRING
Co-Integration
2013
37 / 49
VECM
R EPARAMETERISING THE VAR: T HE VECM
Yt = 1 Yt 1 + t
i Yt i + t
This expression can be easily rearranged into
i =1
where i = I
j
j =1
1 Yt 1 = Yt t
.
The model is called Vector Error Correction model (VECM). Notice

that the only term in levels is Yt p The rest of the terms appear in
differences. As it will be explained in detail later, the Johansen
approach relies on the VECM.
VECM, I(1), AND C O - INTEGRATION
Consider the VECM representation of VAR(1) on two component I (1)

series. Yt = [yy1t2t ]I (1).
p 1
38 / 49
T HE VECM, I(1) PROCESSES AND C O - INTEGRATION
Just like the scalar AR(p) model, the VAR(p) model can finally be
reparameterised as follows
Yt = p Yt p +

S PRING
Co-Integration
2013

S PRING
Co-Integration
2013
39 / 49
Note: the right hand side is I (0) so 1 Yt 1 must be I (0) as well i.e.
the rows of the matrix 1 are co-integrating vectors and y1t and y2t
co-integrate. The rank of 1 gives the number of the linearly
independent co-integrating vectors.
Note that m = 2 so we cannot have more than one linearly
independent co-integrating vectors.

S PRING
Co-Integration
2013
40 / 49
VECM, I(1), AND C O - INTEGRATION
T HE VECM, I(1) PROCESSES AND CO - INTEGRATION
The result from last slide can be generalized easily to higher order
VECMs. Consider the model as before and suppose that
Y (t ) = I (1). Then,
1
2

S PRING
Co-Integration
2013
At the centre of the subsequent analysis is the matrix p .In particular

we are interested in the rank of p .
41 / 49
VECM P ROPERTIES R EQUIRED FOR C O - INTEGRAITON
r = m : all component series of Yt are I (0),so co-integration is not an

issue.
0 < r < m : all component series are at least I (1) and co-integration
exists.
r = 0 : all component seris are I (1), but co-integration does not exists.

S PRING
Co-Integration
2013

S PRING
Co-Integration
2013
42 / 49
T RACE T ESTS AND M AX T ESTS FOR C O - INTEGRATION V ECTORS
T RACE T ESTS AND M AX T ESTS FOR C O - INTEGRATION

V ECTORS
Let the rank of p be r . Then the following hold:
P ROPERTIES OF VECM REQUIRED FOR

CO - INTEGRATION
P ROPERTIES OF VECM REQUIRED FOR

CO - INTEGRATION
So far we have considered what co-integration implies for the

properties of VECM. Now reverse the question and ask which
properties of the VECM imply co-integration.
The rows of p are co-integrating vectors of Yt p .

rank (p ) = r , where r is the number of linearly independent
co-integrating vectors.
Since r m 1, p is of reduced rank (singular).
VECM P ROPERTIES R EQUIRED FOR C O - INTEGRAITON
43 / 49
Under Johansens approach, the test statistics for co-integration are

formulated as
g
bi )
trace (r ) = T ln(1
i =r +1
and
b r +1 )
max (r , r + 1) = T ln(1
b i is the estimated value for the ith. ordered eigenvalue from

where
the p matrix, r is the rank of matrix p , T is the number of
observations, and g is the dimension of the p .

S PRING
Co-Integration
2013
44 / 49
T RACE T ESTS AND M AX T ESTS FOR C O - INTEGRATION V ECTORS
T RACE T ESTS AND M AX T ESTS FOR C O - INTEGRATION

V ECTORS
C O - INTEGRATING V ECTORS
O BTAINING LINEARLY INDEPENDENT

CO - INTEGRATING VECTORS FROM THE VECM
Recall that a reduced rank matrix can be decomposed into a product
of two full rank matrices. If co-integration exists then the m m
matrix p is of reduced rank (r < m ) and can be expressed as
The trace (r ) tests the null that the number of co-integrating vectors
is less than or equal to r against an unspecified alternative, while
the max (r , r + 1) tests the null that the number of co-integrating
vectors is r against an alternative of r + 1.

S PRING
Co-Integration
2013
45 / 49
C O - INTEGRATING V ECTORS
p =
where , are m r full rank matrices.
Consider, for example, the case of m = 2. Then if y1t , y2t
co-integrate r = 1 and

1
1 2
p = =
2

1 1 1 2
=
2 1 2 2

S PRING
Co-Integration
2013
O BTAINING LINEARLY INDEPENDENT

CO - INTEGRATING VECTORS FROM THE VECM
46 / 49
E CONOMIC INTERPRATATION OF THE VECM

Error Correction Models (ECM) have been widely used in economics
e.g. theories for the demand for money. The idea is as follows: let xt
be the optimal money balance that the individual wants to hold in
period t. Moreover, let xt the actual money stock. Equilibrium is
attained when xt = xt (long run).
In practice however xt may be different from xt due to adjustment
costs. The disequilibrium error in period t is defined as
Now consider p Yt p ,the term in the VECM

1 ( 1 y1t + 2 y2t )
y1t
=
p Yt p =
y2t
2 ( 1 y1t + 2 y2t )
1 y1t + 2 y2t is the co-integrating relationship and defines the
co-integrating vector (when r > 1). The matrix of weightings can
be seen as a matrix of "speed adjustment" coefficients.
et = xt xt
The ECM suggests that xt changes over time to correct disequilibrium
errors that occurred in the past i.e.
xt = et 1
, where is a speed adjustment coefficient.

S PRING
Co-Integration
2013
47 / 49

S PRING
Co-Integration
2013
48 / 49
In the context of VECM, long-run equlibrium relation(s) exists when

there is a co-integration. The long-run relationship is defined by
Yt p . When Yt p = 0 the system is in equilibrium. In the short
run however Yt p 6= 0 and Yt p gives the disequilibrium error.
Yt changes over time in response
1
to the past error ( Yt p ) according to the adjustment coefficients

given by the matrix , and
to past changes Yt 1 ,Yt 2 ...Yt p +1 .

S PRING
Co-Integration
2013
49 / 49
T OPIC 9 C AUSALITY, E XOGENEITY AND S HOCK
T OPIC 9 C AUSALITY, E XOGENEITY, AND S HOCK

Topic 9 Causality, Exogeneity and Shock
Jianhua Gang
School of Finance
Spring 2013
I NTRODUCTORY F INANCIAL E CONOMETRICS Topic 9 Causality, Exogeneity

S PRING 2013
and Shock 1 / 31
D YNAMIC M ACROECONOMETRIC M ODELS

S PRING 2013
and Shock 2 / 31
D YNAMIC MACROECONOMETRIC MODELS
G OALS
O UR DATA
Time-series properties of macro variables;

How certain variables are related to each other (Interactions,
causality);
Systemic dynamics/transmissions of shocks.

S PRING 2013
and Shock 3 / 31
E XAMPLES (M ACRO DATA )

Quantity, Price. These data are the result of aggregation procedures with
respect to economic agents, goods, and time.

S PRING 2013
and Shock 4 / 31
VARIABLES
D EFINITION (L INEAR SYSTEM )
D EFINITION (E NDOGENOUS VARIABLE )
Denote yt , as a vector of n endogenous variables at time t; xt , the vector

of m exogenous variables, then create a linear system:
Some variables that are specific to the phenomenon under study, allow
ones to follow their evolutions.
D EFINITION (E XOGENOUS VARIABLE )
In order to explain the phenomenon, some variables may have influence on
the endogenous variables, and that the values of which are fixed outside
the phenomenon.

S PRING 2013
and Shock 5 / 31
LINEAR SYSTEM
A0 yt + A1 yt 1 + ... + Ap yt p + B0 xt + B1 xt 1 + ... + Bp xt p + = 0,
(1)
where Aj , j = 0, 2, ..., p are n n; Bj are n m matrices, and is a n 1
vector. The A0 is supposed to be nonsingular, so that the whole system
allows for a unique determination of the current values of the endogenous
variables.

S PRING 2013
and Shock 6 / 31
E XAMPLE - K EYNESIAN MODEL
E XAMPLE - K EYNESIAN MODEL
Aim of the model: derive the impact on the economy of an

autonomous (exogenously decided) expenditure (Gt ) policy.
Endogenous variables: GDPt , Ct , It .
Exogenous variables: Gt
The system,
GDPt = Ct + It + Gt
Ct = aGDPt 1
It = b (GDPt 1 GDPt 2 )
(2)
where the first equation represents equilibrium of total supply =

total demand; second the demand of consumption (a to be some
fraction between 0 and 1); third the propensity to invest (assuming
growth period).

S PRING 2013
and Shock 7 / 31
It is hence convenient to rewrite model (2) as in,
GDPt
GDPt 1
1 1 1
0 0 0
0 1
0 Ct a 0 0 Ct 1
0 0
1
b 0 0
It
It 1
GDPt 2
1
0 0 0
+ 0 0 0 Ct 2 0 Gt = 0
0
b 0 0
It 2
(3)

S PRING 2013
and Shock 8 / 31
R ANDOMNESS
R ANDOMNESS
R ANDOMNESS
R ANDOMNESS
D YNAMICS AND
DISTURBANCES
The dynamic model (3) is deterministic and does not reflect short-run
disturbances.
If the whole dynamics has been correctly included in the initial
specification as in (3), these disturbances should be independent.
With random factors, we may re-write the model (2) as in,
TDt = Ct + It + Gt
GDPt = TDt
C = aGDPt 1 + ut
t
It = b (GDPt 1 GDPt 2 ) + vt
(4)
where, given the equilibrium conditions, the price adjustment ensures

total demand = total supply, while clearly the behaviors (Ct , and
It ) are determined on factors other than just revenue. So we add in
error terms in these behavior equations.

S PRING 2013
and Shock 9 / 31
C ONTROL AND E NVIRONMENT VARIABLES
Model (4) can be further written into,

a a
0
0
Ct 1
Ct 2
Ct
=
+
b b
b b
It
It 1
It 2

0
u
a
Gt 2 + t
Gt 1 +
+
vt
b
b
(5)
More compactly,
A0 yt + A1 yt 1 + A2 yt 2 + B0 xt + B1 xt 1 + B2 xt 2 + = t

S PRING 2013
and Shock10 / 31
C ONTROL VARIABLES AND ENVIRONMENT VARIABLES
D EFINITIONS
D EFINITIONS
D EFINITION (C ONTROL VAR .)

Exogenous variables that can be controled by policy maker. (a.k.a.
instruments, economic policy, decision variables)
D EFINITION (E NVIRONMENT
VAR .)
It is hence possible to distinguish the difference among exogenous

variables (xt environment, zt control) in the model,
A0 yt + A1 yt 1 + ... + Ap yt p + B0 xt + B1 xt 1 + ... + Bp xt p
+C0 zt + C1 zt 1 + ... + Cp zt p + = t
Other exogenous variables have their own evolution on which we cannot

easily intervene.

S PRING 2013
and Shock11 / 31

S PRING 2013
and Shock12 / 31
E VOLUTION OF THE ENVIROMENT
B LOCK - RECURSIVE AND
VAR .
Implied assumptions of model (6):
Consider, control var.s are fixed; environment var.s influence on

endogenous var.s is before yt . Therefore,
+C0 zt + C1 zt 1 + ... + Cp zt p + = t
xt + D1 xt 1 + ... + Dp xt p + E0 zt + E1 zt 1 + ... + Ep zt p
+F1 yt 1 + ... + Fp yt p + = ut
(6)
where, {t } and {ut } are two mutually uncorrelated W.N. processes.

S PRING 2013
and Shock13 / 31
WEAK EXOGENEITY
The control var.s can have an impact on the endogenous var. or the
environment var. However, they do not influence them directly. (i.e. do
not alter through Aj , Bj , Dj , Fj );
The x are exogenous because the xt s are fixed prior to the yt s
(F0 = 0, and cov (ut , t ) = 0).
The model (6) is called block-recursive (determination of x and then

of y ).
The recursive model (6) corresponds to the weak exogeneity (with
information of lagged endogenous variable y ).

S PRING 2013
and Shock14 / 31
C HARACTERIZATION OF THE E CONOMIC P OLICY
C HARACTERIZATION OF ECONOMIC POLICY
A UTONOMOUS ENVIRONMENT VAR .
Policy maker could intervene on the control var.s (value or evolution)

so as to affect the endogenous var.s.
More restrictive if we assume xt s are determined autonomously
(without a relationship to the lagged endogenous var.s). This
corresponds to imposing Fj = 0, j,
E XAMPLE
In Keynesian model, the government can alter Gt so as to influence the
economy. e.g. maintain a constant level of expenditure,
+C0 zt + C1 zt 1 + ... + Cp zt p + = t
+ = ut
Gt = Gt 1 ,
(7)
where, {t } and {ut } are two mutually uncorrelated W.N. processes.

S PRING 2013
and Shock15 / 31
or to modify government expenditure according to the observed evolution

of investment,
Gt Gt 1 = (It 1 It 2 ).
This is how the values of the control var.s will be fixed in term of the main
aggregates. And this can be expressed by adding in a policy equation,

S PRING 2013
and Shock16 / 31
D EFINITION (W ITH POLICY EQUATION )

From equation (8) to (10), additional recursiveness can be observed:
+C0 zt + C1 zt 1 + ... + Cp zt p + = t
+F1 yt 1 + ... + Fp yt p + = ut
zt + G1 zt 1 + ... + Gp zt p + H1 xt 1 + ... + Hp xt p
+I1 yt 1 + ... + Ip yt p + = vt
(8)
(9)
determination of z of x of y .
However, policy maker may only give values that he wants to the
coefficients Gj , Hj , Ij , whereas he does not have any influence on the
other parameters of the model.
(10)
where cov (t , ut ) = cov (t , vt ) = cov (ut , vt ) = 0.

S PRING 2013
and Shock17 / 31
VARIOUS F ORMS OF A D YNAMIC M ODEL

S PRING 2013
and Shock18 / 31
T HE STRUCTURAL FORM
T HE STRUCTURAL FORM
S IMULTANEITY
The structural form, for example, corresponds to the initial equation:

A0 yt + A1 yt 1 + ... + Ap yt p
+B0 xt + B1 xt 1 + ... + Bp xt p + = t ,
(11)
Traditionally, A0 is often expressed with unit elements along its main

diagonal, then equation (11) can be re-written as,
yt
= + (I A0 )yt A1 yt 1 ... Ap yt p
B0 xt B1 xt 1 ... Bp xt p + t ,
Simultaneity among the variables can be introduced through the

coefficients of A0 and through the nonzero contemporaneous
correlation of the elements of the vector .
While the simultaneity appearing in A0 is easily interpretable in terms
of equilibrium, the one appearing in var () is not!
(12)
It is not possible to keep two sources of simultaneity separate.
where (I A0 ) has zero elements on the main diagonal.

System as in (12) could be difficult to interpret without additional
constraints.

S PRING 2013
and Shock19 / 31

S PRING 2013
and Shock20 / 31
T HE REDUCED FORM
T HE REDUCED FORM
D EFINITION
C OMMENTS
D EFINITION (R EDUCED FORM )

Endogenous var. is expressed as a function of the lagged endogenous
var.s, of the exogenous var.s, and of the disturbance term, e.g.,
yt
= A 01 (A0 yt + A1 yt 1 + ... + Ap yt p
+B0 xt + B1 xt 1 + ... + Bp xt p + ) + A01 t .
(13)
Therefore initial parameters are transformed into other summarized

forms. Sometimes easy to calculate/estimate but...
Problem: do we really care about reduced estimations?
Model (13) can be simplified into:

yt = A (0)1 ((A(L) A(0))yt + B (L)xt + ) + A(0)1 t
where,
A(L) = A0 + A 1 L + ... + Ap Lp , B (L) = B0 + B1 L + ... + Bp Lp .

S PRING 2013
and Shock21 / 31
T HE FINAL FORM

S PRING 2013
and Shock22 / 31
C AUSALITY
C AUSALITY
D EFINITION
D EFINITION (T HE FINAL FORM )

The above expressions can be further transformed into expressing the
current value of the endogenous var.s yt as a function of the exogenous
variables and of the disturbances , < t.This is the final form.
Given all roots of the polynomial A(L) are outside the unit circle,
yt = A (L)1 B (L)xt A(L)1 + A(L)1 A0 (A01 )t ,
Previous cases: study distinction between endogenous, exogenous

(control/environment) variables.
Now consider an approach: analyzing the joint evolution of the
various variables of interest, and in examining whether some of them
are fixed before others.
Can be used on processes of {xt } and {yt }.(may also be used on
control var.s {zt }).
which allows us to separate the influence of the exogenous var.s and of

the disturbances on y .

S PRING 2013
and Shock23 / 31

S PRING 2013
and Shock24 / 31
C AUSALITY
C AUSALITY
C AUSALITY
D EFINITIONS - CAUSALITY
D EFINITION -
C AUSALITY
NONCAUSALITY
D EFINITION (G RANGER (1969))

As a result of the properties of the linear regression, the variable
forecast based on more information is necessarily the best one. i.e.
1. y causes x at time t iff,

E (xt | x t 1 , y t 1 ) 6= E (xt | x t 1 );
2. y causes x instantaneously at time t iff,
var ((xt | x t 1 , y t 1 )) var ((xt | x t 1 ))

then we have the following conditions.
E (xt | x t 1 , y t ) 6= E (xt | x t 1 , y t 1 ).

S PRING 2013
and Shock25 / 31
C AUSALITY
C AUSALITY
C AUSALITY
D EFINITION -
D EFINITION -
NONCAUSALITY

S PRING 2013
and Shock26 / 31
C AUSALITY
NONCAUSALITY
D EFINITION (N ONCAUSALITY )
1. y does not cause x at time t iff,
var ((xt | x t 1 , y t 1 )) = var ((xt | x t 1 ));
2. y does not cause x instantaneously at time t iff,
C OROLLARY (S YMMETRIC )
The two following statements are equivalent:
1. y does not cause x instantaneously at time t;
2. x does not cause y instantaneously at time t.
var ((xt | x t 1 , y t )) = var ((xt | x t 1 , y t 1 )).

S PRING 2013
and Shock27 / 31

S PRING 2013
and Shock28 / 31
C AUSALITY
C AUSALITY
C AUSALITY
C AUSALITY R EVERSAL
L IMIT
The definitions of causality proposed are valid for any time t. In

reality, for certain phenomena we could observe a causality reversal.
It is clear that definitions of causality as shown above involves
conditions on the forecast error only.
Therefore, need to provide a definition applicable in the absence of

such reversals.
It hence might be preferable, then, to use terms such as

predictability and instantaneous predictability instead of
causality and instantaneous causality.
D EFINITION (A BSENCE OF REVERSAL )

y does not cause x (instantaneous) iff y does not cause x (instantaneous)
at time t for all possible times t.
However, academia still uses the term causality.

Therefore, should keep in mind constantly the previous definitions is
sometimes not suitable to describe real-world phenomenon.

S PRING 2013
and Shock29 / 31
C AUSALITY AND VAR M ODELS
C AUSALITY AND VAR M ODELS

Consider the expression of VAR,

yt
y (L) yx (L)
c
= y + yt
cx
xt
xt
xy (L) x (L)
(14)
where the usual conditions on the roots of the autoregressive

characteristic polynomial are satisfied. We can hence choose a
normalization of the type,

y (0) yx (0)
=I
(0) =
xy (0) x (0)
Therefore, in this case, all simultaneous links between the two
processes are summarized in the covariance cov (yt , xt ).

S PRING 2013
and Shock31 / 31
It is therefore true that when the process (x, y ) is stationary, it is

apparent that the definitions for a certain date or for all dates
coincide.

S PRING 2013
and Shock30 / 31

Simple Regression Analysis Guide

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Regression Analysis Guide

Uploaded by

Copyright:

Available Formats

R EVIEW T OPIC 1: S IMPLE R EGRESSION

R EVIEW T OPIC 1: S IMPLE R EGRESSION

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

R EVIEW T OPIC 1: S IMPLE R EGRESSION

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

R EVIEW T OPIC 1: S IMPLE R EGRESSION

C LASSICAL N ORMAL S IMPLE R EGRESSION M ODEL

C LASSICAL N ORMAL S IMPLE R EGRESSION M ODEL

Regression analysis involves the estimation and evaluation of the

Have sample of n independent observations y1 , ..., yn , each of

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

and are termed regression parameters/regression coefficients.

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

R EVIEW T OPIC 1: S IMPLE R EGRESSION

C LASSICAL N ORMAL S IMPLE R EGRESSION M ODEL

R EVIEW T OPIC 1: S IMPLE R EGRESSION

C LASSICAL N ORMAL S IMPLE R EGRESSION M ODEL

C LASSICAL N ORMAL S IMPLE R EGRESSION M ODEL

C LASSICAL N ORMAL S IMPLE R EGRESSION M ODEL

If we regard + xi as the equation of a straight line, then

the intercept is the mean of y when xi equals zero

The assumption that the regressor x is Nonstochastic is

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

R EVIEW T OPIC 1: S IMPLE R EGRESSION

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

R EVIEW T OPIC 1: S IMPLE R EGRESSION

M ETHOD OF M OMENTS E STIMATION

M ETHOD OF M OMENTS E STIMATION

Population moments conditions (assumptions provided before as in

These slides do not contain full mathematical details.

Let the MM estimator of and be b

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

R EVIEW T OPIC 1: S IMPLE R EGRESSION

M ETHOD OF M OMENTS E STIMATION

R EVIEW T OPIC 1: S IMPLE R EGRESSION

M ETHOD OF M OMENTS E STIMATION

O RDINARY L EAST S QUARES E STIMATION (OLS)

Obtain MM: solving the derived equations (replacing E(.) by

First order conditions (the F.O.C.s) are,

It can be proved that under weak conditions, MME are consistent

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

J IANHUA G ANG (RUC)

R EVIEW T OPIC 1: S IMPLE R EGRESSION

O RDINARY L EAST S QUARES E STIMATION (OLS)

O RDINARY L EAST S QUARES E STIMATION (OLS)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

Equations (2) and (3) are called the normal equations (b

J IANHUA G ANG (RUC)

I NTRODUCTORY F INANCIAL E CONOMETRICS Review of Econometric

O RDINARY L EAST S QUARES E STIMATION (OLS)

Ignoring an irrelevant factor of 2, these equations are,

J IANHUA G ANG (RUC)

R EVIEW T OPIC 1: S IMPLE R EGRESSION