You are on page 1of 90

1.

INTRODUCTION
1.1. MATRIX NOTATION AND SOME USEFUL RESULT.
Always a (p 1) vector x will be a column vector. That is,
0 1
x1
B x2 C
B C
B : C
x=B B : C.
C
B C
@ : A
xp
The inner product of two vectors, denoted by h ; i ; is dened as
p
X
hx; yi = xi yi .
i=1
Ap q matrix is written as
a1 ; :::; aq (aij )i=1;::;p;j=1;:::;q .
Transpose, 0 , A0 = (aji )i=1;::;p;j=1;:::;q .

(A + B)0 = A0 + B 0
(AB)0 = B 0 A0 .
6 0 =) A 1 , the
If the matrix A is squared, that is p = q, then if jAj =
inverse, exists and it is dened in such a way that AA 1 = A 1 A =
Ip .
(AB) 1 = B 1 A 1 .
Denition 1.1. Let A be a matrix (p q). We dene the rank as the
maximum number of columns (rows) linearly independent. If jAj =
6 0, then
rank(A) = p.
Denition 1.2. Let Ap p be a matrix. A quadratic form is dened as
p
X
x0 Ax = xi aij xj .
i;j=1

Denition 1.3. We say that a quadratic form, i.e. A, is (a) positive


denite if x0 Ax 0 for all x 6= 0. (b) We say that it is positive semidenite
if x0 Ax 0 and 9x 6= 0 such that x0 Ax = 0.
One characterization of a positive denite matrix is that all the minors
have determinants greater than zero.
Denition 1.4. We say that A B if A B 0, i.e. is positive denite.
Also we have that if the inverses of A and B exist, then A B implies
that B 1 A 1 .
Denition 1.5. Kroneckers product of A and B is dened as (A B) =
(aij B)i=1;:::;p;j=1;:::;q .
P
Denition 1.6. Trace of a matrix is dened as tr (A) = pi=1 aii .
1
2

Let r (A) = q. Then 9 a square nonsingular matrix B such that

Iq 0
BAB 0 = .
0 0

1 Iq 0 1
That is, A = B (B 0 ) .
0 0

Denition 1.7. The eigenvalues of a matrix (square) A are dene as the


solutions of the equation
jA Ij = 0.

Consider = diag ( 1 ; :::; p ). Then A = C C 0 , where jCj =


6 0.

Denition 1.8. We say that a matrix A is idempotent if A2 = A.

Some properties of idempotent matrices are:


rank (A) = tr (A).
Eigenvalues are zero or one.
Inverse of a Partitioned Matrix A.
Let
A11 A12
A= .
A21 A22

Assume that jA11 j =


6 0 and denote A22:1 = A22 A21 A111 A12 . Then

1 A111 Ip1 + A12 A22:1 A21 A111 1


A111 A12 A22:1
A = 1 .
A22:1 A21 A111 1
A22:1

1.1.1. Matrix Di erentiation.


@ 0
@x x Ay = Ay.
@ 0
@x x Ax = (A + A0 ) x.
Let is a (p 1) vector. Then

@ 1 @
ln jAj = tr A0 A .
@ @
For scalar,
@ 1 1 @ 1
A = A A A .
@ @

@2 1 @2 1 @ 0 1 @
ln jAj = tr A0 A A A0 A A0 A .
@ 2 @ 2 @ @

@2 1 1 @A 1 @A @2A 1
A =A A A .
@ 2 @ @ @ 2
3

1.1.2. Inequalities.
Cauchy-Schwarz
n
!2 n
! n
!
X X X
xi yi x2i yi2 .
i=1 i=1 i=1
Hlders inequality
n n
!1=p n
!1=q
X X X 1 1
x i yi jxi jp jyi jq where + = 1,
p q
i=1 i=1 i=1

with equality i yi / xip 1 for all i = 1; :::; n.


More generally, for p1 + 1q + 1r = 1,
n n
!1=p n !1=q n
!1=r
X X X X
xi yi zi jxi jp jyi jq jzi jr .
i=1 i=1 i=1 i=1

Minkowski. For xi 0; yi 0 and k 1,


n
!1=k n
!1=k n
!1=k
X k
X X
k
(xi + yi ) xi + yik .
i=1 i=1 i=1

Jensens inequality. For any convex function f ( ), that is f ( x + y)


f (x) + f (y),
f (E (x)) E (f (x)) .

1.2. DISTRIBUTION OF QUADRATIC FORMS OF NORMAL


RANDOM VARIABLES.
Consider (x1 ; :::; xp )0 s Np ; diag 21 ; :::; 2p , where the symbol
0
s means distributed as, and where = 1 ; :::; p . Then
p
X 2
xi i 2
Z= s p,
i=1 i

i.e. a Chi-Square with p degrees of freedom.


Consider Z1 and Z2 two independent 2p and 2q r.v. respectively.
Then
Z1 =p
F = s Fp;q ,
Z2 =q
i.e. F is distributed as an F-Snedeckor with p and q degrees of
freedom. When p = 1, we have that
p 1=2
Z1
F t= s tq ,
(Z2 =q)1=2
i.e. a Student-t with q degrees of freedom.
Non-central 2 : Consider X s N ; 2 . Then
X2 2
2
2
s 1 2
,
4

where 2 = 2 is the non-centrality parameter. In general let (x1 ; :::; xp )0 s


Np ; diag 21 ; :::; 2p , then
p p
!
X xi 2 X 2
s 2p i
2
i=1 i i=1 i
Pp 2= 2 is
where i=1 the non-centrality parameter.
i i
Let A be a symmetric matrix, and X s Nq (0; Iq ). Then
X 0 AX s 2
p () A idempotent and p = tr (A) = rank (A) q.
If X s Nq (0; ), then
X 0 AX s 2
p () A A = A and p = rank (A ) q.
The proof of the last result is as follows. By the previous property
X0 1=2 1=2
A 1=2 1=2
X =Y0 1=2
A 1=2
Y
where Y s Nq (0; Iq ). Therefore,
1=2 A 1=2 = 1=2 A 1=2 1=2 A 1=2
1=2 A 1=2 =) A A = A .
= A
Consider Y s Np ( ; ). A necessary and su cient condition for
P 0 Y and (Y )0 A (Y ) to be independent is that
A P = 0.
Consider Y s Np ( ; ). A necessary and su cient condition for
(Y )0 A (Y ) and (Y )0 B (Y ) to be independent is that
A B = 0.
Finally, Q = Q1 + Q2 , where Q s 2 and Q1 s 2 with Q2 0.
a b
Then Q2 s 2a b .
5

2. CONCEPTS ON ASYMPTOTIC CONVERGENCE IN


STATISTICS
In what follows, fXn gn2N denotes a sequence of random variables (r.v.)
and X also r.v.
Denition 2.1. (Convergence in Probability) We say that fXn gn2N con-
verges in probability to X, if 8"; > 0 there exists a n0 such that for all
n n0
Pr fjXn Xj > g < ",
or equivalently limn!1 Pr fjXn Xj > g = 0.
p
This type of convergence will be denoted by Xn ! X or plimXn = X.
Denition 2.2. (Convergence almost surely) We say that fXn gn2N con-
verges almost surely or with probability one to X if 8 > 0
lim Pr fjXm Xj < ; 8m ng = 1
n!1

or equivalently Pr flimn!1 Xn = Xg = 1.
a:s
This type of convergence is denoted as Xn ! X.
Denition 2.3. (Convergence in r-th mean) We say that fXn gn2N con-
verges in r-th mean to X if limn!1 E jXn Xjr = 0.
r th
This type of convergence will be denoted by Xn ! X.
Remark 2.1. For convergence in r-th mean, r-th in what follows, we need
that E jXn j < 1 and E jXj < 1.
Example 2.1. Let fXn gn2N be such that
1
1 n
Xn = 1
1+ n.
p
Then Xn ! . Indeed, Pr fjXn j > g = Pr fXn = 1 + g = 1=n ! 0.
Example 2.2. Let fXn gn2N be dened as
1
0 1 n2
Xn = 1
n n2
.
a:s:
Then Xn ! 0. Indeed
Pr fjXm Xj < ; all m ng = 1 Pr fjXm j , for some m ng .
So, it su ces to prove that the second term on the right of the last displayed
expression converges to zero. But, that term is
1
X X1
S
1 1
Pr fjXm j g Pr fjXm j g= <"
n=m n=m n=m
m2
P1 2
because m=1 m < 1. From here conclude.
The last type of convergence is Convergence in Distribution.
6

Denition 2.4. We say that fXn gn2N converges in distribution to X if


lim Fn (t) = F (t)
n!1
8t continuity point of F (t), where Fn (t) and F (t) are the probability dis-
tribution functions of Xn and X respectively.
d
This type of convergence will be denoted by Xn ! X.
The relationship among the four types of convergence is as follows
a:s =) P ROBABILIT Y =) DIST RIBU T ION
* If the limit is a constant,e.g.
: 7 ! r th (=
it is degenerate r.v.
Remark 2.2. The reason why P ROBAB: ; a:s: and P ROBAB ; r-th is
because for convergence in r-th, we need that E jXn jr < 1 and E jXjr < 1,
whereas in Probability the moments may not even exist.
Theorem 2.1. (Markovs inequality) Let X be a r.v. with nite rth mo-
ments. Then,
E jXjr
Pr fjXj > "g .
"r
Proof. Let I ( ) be the indicator function. Then,
jXjr
Pr fjXj > "g = E fI (jXj > ")g E I (jXj > ")
"r
1
E fjXjr I (jXj > ")g
=
"r
E jXjr
.
"r
Remark 2.3. When r = 2, the Markovs inequality is known as Chebyshevs
inequality.
r th p
Theorem 2.2. fXn g ! X =) Xn ! X.
Proof. By Theorem 2.1,
E jXn Xjr
Pr fjXn Xj > "g .
"r
r th
Now conclude because fXn g ! X.
The next question is the following. Suppose that our interest is not to
know whether the sequence fXn gn2N converges to X but g (Xn ). For in-
stance, we are interested in Xn2 . Then, what can we say? The next theorem,
which plays a key role, answers this question.
Theorem 2.3. (Slutzky
Proof. For any arbitrary " > 0 we can choose a compact set S such that
"
Pr fX 2= Sg .
2
Because g ( ) is continuous, it implies uniform continuouity in S. Thus,
8 > 0 independent of x,
8x 2 S; kx yk < =) jg (x) g (y)j .
7

p
Also, we know that Xn ! X that is, 8n n0
" "
Pr fjXn Xj > g < or Pr fjXn Xj g 1 .
2 2
Thus,
Pr fjXn Xj g = Pr fjXn Xj ; X 2 Sg
+ Pr fjXn Xj ; X2= Sg
"
Pr fjXn Xj ; X 2 Sg + ,
2
which implies that
"
1 Pr fjXn Xj g
2
"
Pr fjXn Xj ; X 2 Sg +
2
and hence that 1 " Pr fjXn Xj ; X 2 Sg.
But by continuity of g ( ),
Pr fjXn Xj ; X 2 Sg Pr fjg (Xn ) g (X)j ; X 2 Sg
Pr fjg (Xn ) g (X)j g
and hence
1 " Pr fjg (Xn ) g (X)j g.
This concludes the proof.
p p
Example 2.3. Xn ! c 6= 0 =) 1=Xn ! 1=c.
Let fXn gn2N be a sequence of (k 1) vectors of r.v..
Denition 2.5. We say that fXn gn2N converges in Probability to X a
p
(k 1) vector, if 8i = 1; 2; :::; k, Xni ! Xi .
2.1. WEAK LAW OF LARGE NUMBERS.
Theorem 2.4. (Khintchine) Let fXi gi2N be a sequence of iid random vari-
ables such that EXi = . Then,
n
1X p
Xi ! .
n
i=1

Theorem 2.4 is known as the weak law of large numbers (WLLN).


Theorem 2.5. (Chevyshev) Let fXi gi2N be a sequence of uncorrelated r.v.s
P
where EXi = i and V arXi = 2i . If n 2 ni=1 2i ! 0 as n ! 1, then
n n
1X 1X p
Xi i ! 0.
n n
i=1 i=1

Proof. We have to show that


n n
1X 1Xe p
(Xi i) = Yi ! 0.
n n
i=1 i=1
8

Denote by Yn the left side of the last displayed expression. Then by Theorem
P 2
2.1, see also Theorem 2.2, we have that it su ces to show that E 1 n Yei ! n i=1
0. But because the r.v.s Xi are uncorrelated
n 2 n
1Xe 1 X 2
E Yi = 2 i ! 0.
n n
i=1 i=1
This concludes the proof.
1
Pn p
Remark 2.4. If 8i i = , then n i=1 Xi ! 0.
Theorem 2.6. Let fXi gi2Z be a sequence of r.v.s where EXi = i and
P
V arXi = 2i and Cov (Xi ; Xj ) = jj ij i j . If n 2 ni=1 2i ! 0 as n ! 1;
P
and 1 i=1 j i j < 1, then
n n
1X 1X p
Xi i ! 0.
n n
i=1 i=1

Remark 2.5. From Theorem 2.5 we can observe that there is a trade-o
between the heterogeneity of the sequence of r.v.s fXi g and the moment
conditions of the sequence. In particular, we observe that in Theorem 2.4
we only need the rst moment to be nite whereas in Theorem 2.5, we need
second moments, although 2i may increaseto innity, but not too quickly.
p d d
Theorem 2.7. Assume that Xn ! 0 and Yn ! Y . Then, Xn + Yn ! Y .
2.2. CENTRAL LIMIT THEOREMS.
Central limit theorems (CLT ) deal with the
Pnconvergence in distribution
of (normalized) sums of r.v.s. For instance, i=1 Xi .
Theorem 2.8. (Lindeberg-Levy) Let fXi gi2N be iid r.v.s, with mean and
nite variance 2 . Then,
n
1 X Xi d
1=2
! N (0; 1) .
n i=1

Sometimes, we will write


n
1X 2
Xi AN ; ,
n n
i=1

e.g. asymptotically normal with mean and variance 2 =n.

Theorem 2.9. (Lindeberg-Feller) Let fXi gi2N be independent r.v.s, such


that EXi = i , V ar (Xi ) = 2i and Fi its distribution function. Assume that
n
X
Bn2 = 2
i
i=1
2
satises (a) maxj j
n B2 ! 0 and (b) Bn2 ! 1. Then
n
n n
!
1X 1X Bn2
(1) Xi AN i; 2
n n n
i=1 i=1
9

if and only if the Lindebergs condition holds, that is


n Z
1 X 2
(L) jz i j dFi (z) n!1
! 0 8 > 0.
Bn2 jz i j> Bn
i=1

Corollary 2.10. (Lyapunov) Let fXi gi2N be as in Theorem 2.9. Suppose


that for some > 2 and 8 > 0 arbitrary small
n
1 X
E jXi ij < .
Bn
i=1

Then, we have the conclusion in (1).


Proof. Let us show condition (L) rst. By standard algebra,
Z Z
2 2
jz i j dFi (z) ( Bn ) jz ij dFi (t)
jz i j> Bn jz i j> Bn

( Bn )2 E jXi ij .
Then,
n Z n
1 X 2 2
X
jz i j dFi (z) Bn E jXi ij
Bn2 jz i j> Bn
i=1 i=1
1.

Next, we shall prove that (L) implies that


2
i
max 2
! 0.
i n Bn

Indeed, for all 1 i n, we have that


Z
2 2 2
i jz ij dFi (z) + Bn2
jz i j> Bn

and so
n Z
X
2 2 2
max j jt i j dFi (t) + Bn2
j n jt
i=1 i j> Bn

which implies that


2
i 2
max 2
(L) + ! 0,
i n Bn
since is arbitrary small. But, also the last displayed inequality implies that
Bn2 ! 1. Thus conditions (a) and (b) of the Lindebergs-Fellers CLT are
satised, which concludes the proof.
Theorem 2.11. Let fXn gn2N converge in distribution to X. Consider a
continuous function g ( ) in the domain of the r.v. X. Then,
d
g (Xn ) ! g (X) .
The previous Theorem is a powerful and useful one which will easy the
study of the asymptotic behaviour of many statistics of interest in economet-
rics. In some sense this is the analogue of Theorem 2.3 but for convergence
in distribution.
10

Example 2.4. Let fXi gi2N be a sequence of iid r.v.s. Then, by Theorem
2.8, we know that
n
1 X Xi d
Zn = 1=2 ! N (0; 1)
n i=1
d
and then we conclude that Zn2 ! 2.
1

The next theorem tell us that all we need is to show CLT for scalar
sequences of r.v..
Theorem 2.12. (Cramer-Wold) Let fXn gn2N be a sequence of (k 1) vec-
tors of r.v.s. Then,
d 0 d 0
Xn ! X if f Xn ! X 8 6= 0.
Theorem 2.13. (Cramer) Let fXn gn2N be a sequence of (k 1) vectors
p d
of r.v.s, such that Xn = An Zn . Suppose that An ! A (p:s:d) and Zn !
d
N ( ; ), then An Zn ! N (A ; A A0 ).
11

3. THE LINEAR REGRESSION MODEL


One of the aims of econometrics is based on a set of data, say fzt gTt=1 , to
be able to make inferences about its characteristics, how these data move
together or to nd causal relationship among them. In particular, if it is
known that the data fzt g follows a known pdf up to a set of parameters 0 ,
we may be interested to know what the true values of 0 are. To that end
we need to estimate them and then test for its hypothetical value.
To x ideas, lets assume that the data fzt gTt=1 is distributed according to
f (zt ; 0 ). Then, we wish to make inferences about 0 . Sometimes the pdf
f ( )is not known or perhaps all we hope to know are certain characteris-
tics, say its mean E (zt ). In other situations we can split the data into two
groups say zt = (yt0 ; x0t )0 such that
f (zt ; 0) = f1 (yt j xt ; 0 ) f2 (xt ; 0 )
= f1 (yt j xt ; 01 ) f2 (xt ; 02 )

being our interest on 01 . For instance, the conditional distribution function


of yt given xt
2
f1 (yt j xt = x; 01 ) N (x; 01 ) ; (x; 01 )

and we want to know 01 , or less ambitious, we hope to know only some


characteristics of yt given xt like its conditional mean or variance.
Suppose that yt stands for consumption and xt for income. Then we wish
to know say by how much yt will increase if xt increases by a 1%. To answer
this question what we need is E [yt j xt ]. The starting point is to assume a
particular function. For instance if xt = (xt1 ; :::; xtk )0 , then
0
E [yt j xt ] = xt ,
0
where =( 1 ; :::; k) . We write the latter relation as
0
(2) yt = xt + ut ,
where, by construction, E [ut j xt ] = 0.
The variable yt is called endogenous, whereas xt are called exogenous
variables, although we shall refer to them as regressors. ut is known as error
term.
Remark 3.1. If xt is deterministic, the conditional expectation becomes an
unconditional one, e.g. E [yt ] = 0 xt .
Notation: Equation (2) will be written in matrix notation as
(3) Y = X + U,
where
0 1 0 1 0 1 0 1
y1 x11 ; :::; x1k 1 u1
B : C B :; :::; : C B : C B : C
B C B C B C B C
Y =B
B : C
C ;X = B
B :; :::; : C
C ; =B C
B : C ;U = B
B : C
C .
@ : A @ :; :::; : A @ : A @ : A
yT (T 1)
xT 1 ; :::; xT k (T k) k (k 1)
uT (T 1)
12

Once the model has been set up, the question is how can we make infer-
ences on based on the data zt = (yt ; x0t )0 ? The general way to do it is by
computing a function of and zt called Objective Function, and see which
value of minimizes such a function. That is,
(4) b = arg min Q ( ; z) = arg min Q ( ) .

One obvious function will be the sum of squares


T
X
0 2
(5) Q( ) = yt xt .
t=1

Denition 3.1. The LEAST SQUARES ESTIMATOR (LSE) is the sta-


tistic b dened in (4), with Q ( ) given in (5).
3.1. LEAST SQUARES: ALGEBRAIC RESULTS.
How to obtain the estimator of ? Using (3) write
Q ( ) = (Y X )0 (Y X ).
The rst-order-conditions (FOC)
@Q ( )
j b =0
@
= 2X 0 Y + 2X 0 X b ,
and, assuming that X 0 X has full column rank, then
b = X 0X 1
(6) X 0Y .
That b is a minimum comes from the observation that
@2Q ( )
= 2X 0 X 0.
@ @ 0
Denition 6: The least squares residuals are dened by
bt = yt
u b 0 xt , t = 1; 2; :::; T .
Some important property about the residualsis that they are orthogo-
nal to the matrix X; that is
b = 0,
X 0U b = fb
where U ut gTt=1 .
Remark 3.2. Note that we are not saying that X and U are orthogonal but
b = Y X b , so that
b and X. Indeed by denition, U
U
b = X 0Y
X 0U X 0 X b = 0.
PT
Remark 3.3. If one of the regressors is a constant, then bi = 0.
i=1 u
0 b 0
PT
Indeed because X U = 0, for the regressor 1, 1 ubt = i=1 u
bi = 0.
Denition 3.2. (Coe cient of multiple correlation) Assume that the model
has a constant. Then, the coe cient of multiple correlation R2 is dened as
PT
yt y)2
(b 0
2
0 < R = PTt=1 2
< 1; ybt = b xt .
t=1 (yt y)
13

Before we study the statistical behaviour of b , it is worth describing some


properties about ubt . We have already said that X 0 U b = 0. This appears
natural, why? Recall that E (ut jxt ) = 0 that is Eut = 0 and, by the law of
iterated expectations,
E (xt ut ) = Ex (E (xt ut j xt )) = E (xt E (ut j xt )) = 0.
If you pretend not to know E (xt ut ) and you want to estimate the expec-
tation, the obvious estimator is the sample analogue. Thus, we estimate
E (xt ut ) and Eut by
T T
1X 1X
xt ut and ut .
T T
t=1 t=1
However, we do not observe ut but yt and xt , so what we do is then to choose
the value of such that
T
1X 0
xt yt xt = 0
T
t=1
as the true population moment is 0. Hence, LSE tries to nd the value of
which matches the sample and population moments.
3.2. STATISTICAL PROPERTIES OF THE LSE.
Introduce the following assumptions:
A1: X is deterministic.
A2: rank(X) = k, i.e. the dimension of .
A3: (i) Eui = 0 8i; (ii) Eu2i = 2 8i; (iii) Eui uj = 0 8i 6= j.
Proposition 3.1. Under A1, A2 and A3(i), b is unbiased.
Proof. By denition, b = (X 0 X) 1
X 0 Y . Then,
1
Eb = X 0X X 0 EY
1
= X 0X X 0 (X + EU )
= .
Proposition 3.2. Under A1-A3, Cov b = 2 (X 0 X) 1 .

Proof. By denition,
0
cov b = E b b
h i
1 1
= E X 0X X 0U U 0X X 0X
1 1
= X 0X X 0E U U 0 X X 0X
1 1
= X 0X X0 2
IX X 0X
2 1
= X 0X .
Thus, Propositions 3.1 and 3.2 tell us that the LSE is an unbiased for
with a variance-covariance matrix equal to 2 (X 0 X) 1 .
Suppose now that we have another unbiased linear estimator, say e = CY .
Then, how good is e compared to b ?
14

Theorem 3.3. (Gauss-Markov) Let e = CY unbiased estimator of . Then


Cov e Cov b . (Cov e Cov b 0.)

Proof. Because E e = , it implies that


E e = E (C (X + U )) = CX .
1
So C satises CX = I. Write C = (X 0 X) X + D, so DX = 0. Next,
0
E e e = E CU U 0 C 0
2
= CC 0 .
But, since DX = 0,
1 1 0
CC 0 = X 0X X +D X 0X X +D
1
= XX 0 + DD0 .
From here we conclude as DD0 0.
Theorem 3.3 tells us that the LSE is e cient among all linear unbiased
estimators of . We will refer the LSE as BLUE (best linear unbiased
estimator ).
So far, we have seen that the LSE is BLUE. But one of the basic require-
ments of any estimator is to be consistent, e.g. b converges in probability
to . To see that this is the case, let us introduce
A4:
X 0X
lim = Q 0.
T !1 T
Theorem 3.4. Under A1-A4, plim b = .

Proof. We shall prove that plim b = 0. By denition,


1
b X 0X X 0U
(7) = .
T T
A4 implies that the rst factor on the right of (7) converges to Q 1 because
Q 0. Next, the second factor on the right of (7). Because A3, the
rst moment of the second factor of (7) is zero, because xt is deterministic,
whereas A3 (ii,iii ) imply that the second moment is
2 ! !0 3
XT XT T
1 1 1 X
E4 xt ut xt ut 5 = xt x0t E u2t
T T T2
i=1 i=1 i=1
T
!
2 1X 0
= xt xt ! 0
T T
i=1
by A4. Thus by Theorem 2.2, the second factor on the right of (7) converges
to zero in probability. But, because the product is a continuous function,
by Theorem 2.3, we conclude.
Remark 3.4. Assumption A4 is su cient but not necessary.
15

Estimator of 2 .
Up to now we have focused on b . But, in the regression model (2) exists
2 . The question is how can we estimate it and what its statistical properties

are? If ut were observed, an obvious estimator would be


T
1X 2
ui .
T
i=1

Because ui is not observed, we replace it by ub2i , but why? We know that


b satises good statistical properties so that we can expect b to be close to
0
bi = yi b xi closed to ui , suggesting to use
, so then u
T
1X 2
(8) b2 = bi .
u
T
i=1

Before examining the statistical properties of b2 , we introduce the pro-


jection matrix M = I X (X 0 X) 1 X 0 . Some properties of M are
(1) M is symmetric, that is M = M 0 .
(2) M is idempotent, that is M = M 2 .
(3) M X = 0, e.g. it is orthogonal to the regressors.
b = MU = MY .
(4) U
Remark 3.5. M is called projection matrix because it projects Y onto the
orthogonal space span by Xi , S? (X), whereas X (X 0 X) 1 X 0 does to S (X),
the space span by Xi .
b 0U
Lemma 3.5. Under A1-A3, E U b = (T k) 2.

b 0U
Proof. By denition, E U b is

E U 0M U = Etr U 0 M U
= Etr M U U 0
= trM E U U 0
2
= trM .
1
Now we conclude since trM = tr (IT ) tr (X 0 X) X 0X = T k.
From Lemma 3.5, we conclude that an unbiased estimator of 2 is
T
X
1
(9) e2 = b2i .
u
T k
i=1

However, under some regularity conditions, e2 and b2 are consistent for 2 .


So, b and b2 are consistent, but if we wish to make inferences about possi-
ble value(s) of the parameters, what we need is to discuss their (asymptotic)
distribution.
3.2.1. DISTRIBUTION OF b AND b2 .
Lets introduce the following assumption:
A5: Assume that ui ' N 0; 2 , for all i = 1; ::; T .
16

Theorem 3.6. Assuming A1 to A3 and A5, we have that


1=2 b
(10) (a) X 0X N 0; 2
Ik
b2 2
(11) (b) T 2 T k.

Moreover, b and b2 are independent.


Proof. By denition of b , the left side of (10) is
T
X
1=2 1=2
X 0X X 0U = X 0X xi ui .
i=1
But the right side is a linear combination of normal random variables by
A5. Thus,
1=2 1=2 1=2
X 0X X 0U N 0; 2
X 0X X 0 IX X 0 X
2
N 0; Ik ,
which completes the proof of part (a).
With regard to (b), observe that the left side of (11) is
0
b2 U U
2
= M .

Then, because A5 implies that 1U N (0; IT ), we conclude that right


2
side follows a T k r.v., because trM = T k. Hence, to conclude the proof
of the theorem, we need to show that b and b2 are independent. But it
su ces to show that M and X (X 0 X) 1 are orthogonal1, which follows by
property 3) of M .

Corollary 3.7. Assuming A1-A5, T 1=2 b !d N 0; 2Q 1 .

Proof. By A4, T 1 (X 0 X) ! Q 0, whereas A5 and Theorem 3.6 b


N 0; 2 (X 0 X) 1 . Thus,
1=2
X 0X
T 1=2 b N 0; 2
Ik .
T

Now use Theorem 2.13 to conclude that T 1=2 b N 0; 2 Q 1 .


We now examine what happens when A1 to A5 are dropped.
3.2.2. STOCHASTIC REGRESSORS.
We will start our discussion when Assumption A1 is dropped, that is X is
stochastic. If fxi ghTi=1 and fui gTi=1 were
i independent or E (ui j xi ; i 1) = 0,
1
then assuming E kX 0 Xk kX 0 U k < 1, we obtain
h i
E bjX =0

1 If Z and Z are independent normally distributed then g (Z ) and h (Z ) are also for
1 2 1 2
any functions g ( ) and h ( ).
17

and so unconditionally by the law of iterated expectations. With regard to


the variance, it is not true, that is
0 0
E b b = E E b b jX

2 1
= E X 0X ,

as no guarantee that Ez 1 < 1. Finally, if the distribution of ui


N 0; 2 , then
b 2 1
jX N 0; X 0X
although we do not know the unconditional one. However, because
1=2 b
X 0X jX N 0; 2
Ik
which does not depend on X, so the unconditional distribution is also
N 0; 2 Ik .
The condition of mutual independence of U and X is key, which cannot
be assumed in many scenarios. Thus what can we do? The answer is to rely
on their behaviour as T ! 1. Even if the sample size is nite, we make all
our statements as if T were in fact innity.
Lets introduce the following assumption
A1: fxi gi2N is iid with E (xi x0i ) = xx 0.

Theorem 3.8. Under A1 and A3, b is consistent.

Proof. It su ces to show that plim b = 0. Now


1
b X 0X X 0U
(12) = .
T T
First the behaviour of
T T
1X 0 1X
xi xi = zi .
T T
i=1 i=1
But by A1, zi is an iid r.v., since xi is. Moreover, E jzi j < 1 since the
second moments of xi are nite. Thus, Theorem 2.4 implies that the right
side converges to xx 0 in probability and by Theorem 2.3,
T
! 1
1X p
xi x0i ! xx1 .
T
i=1
Next,
T T
1X 1X
(13) xi ui = zi .
T T
i=1 i=1
But, zi is zero mean iid since E (ui xi ) = Eui Exi = 0. Thus, by Theorem
2.4, it converges to zero in probability. But, because the product is a con-
tinuous function we conclude, by Theorem 2.3 that (12) converges to zero
in probability.
Next we will examine the consistency of b2 .
18

p
Proposition 3.9. Under A1, A3 and that ui is iid, we have that b2 ! 2.

0
bi = ui
Proof. Because u b xi , (8) equals

T T
1X 2 b
0 1X
(14) ui xi ui .
T T
i=1 i=1

Consider the rst term of (14). Because u2i is an iid r.v. with nite rst
moment, e.g. 2 , by Theorem 2.4, we conclude that
T
1X 2 p 2
ui ! .
T
i=1
p
The second term of (14) converges in probability to 0, because b !0
P p
and as shown in (13), T1 Ti=1 xi ui ! 0. Thus by Theorem 2.3,
T T
1X 2 b
0 1X p 2
ui xi ui !
T T
i=1 i=1

because the sum is continuous.


To be able to make inferences about , we need to know its asymptotic
distribution. That will be answered in the following theorem.
Theorem 3.10. Assuming A1, A2 and A3 with ui iid r.v.s, we have that
d
(a) T 1=2 b ! N 0; 2 1
xx .
d
(b) T 1=2 b2 2
! N 0; Eu4i 4
.
Proof. We begin showing part (a) rst. By denition,
1
X 0X 1
T 1=2 b = X 0U .
T T 1=2
1 p
X0X 1
By Theorem 3.8, we know that T ! xx , so that by Theorem 2.13,
it su ces to show that
T
1 1 X
1=2
X 0U = xi ui
T T 1=2 i=1
T
1 X d 2
(15) = zi ! N 0; xx .
T 1=2 i=1

Because zi is a vector by Theorem 2.12, it su ces to show the CLT for


T
1 X 0
zi 6= 0.
T 1=2 i=1

Because both ui and xi are iid, it will imply that 0 zi is also zero mean iid
with nite variance. By independence of ui and xi , the rst moment of 0 zi
19

is zero whereas its variance is 2 0 . So Theorem 2.8 implies that


xx
T
1 X 0 d 2 0
zi ! N 0; xx
T 1=2 i=1

and we conclude the proof of (15) , and thus part (a) by Theorem 2.13.
Next, part (b). By denition of ubi , we have that the left side is
n n
1 X b
0 1 X
(16) u2i 2
xi x0i b .
T 1=2 i=1
T 1=2 i=1

Proceeding as in part (a), the rst term of (16) satises


n
1 X d 2 2
u2i 2
! N 0; E u2i N 0; Eu4i 4
.
T 1=2 i=1

On the other hand, the second term of (16) is


n
!
1 1=2 b
0 1X
T 0
xi xi T 1=2 b
T 1=2 T
i=1

But in part (a) we have already shown that T 1=2 b converges in dis-
P p
tribution, and in Theorem 3.8 that T1 ni=1 xi x0i ! xx . Hence, we conclude
that the last displayed expression converges to zero in probability because
the product is a continuous function and T 1=2 ! 0. Then
d 2 2
T 1=2 b2 2
! N 0; E u2i

by Theorem 2.7.
Remark 3.6. (i) Assumption A5 is not necessary for the results to hold.
Moreover, in the case of stochastic and/or nonnormality of the errors all
our arguments will be asymptotic ones.
(ii) If Eu4+i < C for some arbitrary > 0, we could have dropped the
condition of iid r.v. for ui . In this case, we would use Corollary 2.10 instead
of Theorem 2.8.
d
(iii) If ui were normally distributed then T 1=2 b2 2 ! N 0; 2 4 .

Finally, how good (e cient) is b ? In other words, do we have an equivalent


result as that for the case of xi deterministic?
Theorem 3.11. Under A1, A3 with ui iid, the LSE is asymptotically
Gauss-Markov e cient, that is, any other linear consistent estimator of
has an asymptotic variance-covariance matrix which exceeds 2 xx1 by a
s.p.d. matrix.
Proof. Let e = (Z 0 X) 1 Z 0 Y be. Clearly e is linear in Y and very general.
If Z = X we have the LSE. Assume the following
(1) rank (Z 0 X) = k, that is E (xi zi0 ) = xz 6= 0.
(2) zi is a (k 1) vector of iid r.v.s, with nite second-moments.
(3) zi and ui are mutually independent.
20

From the denition of e ,


n
! 1 n
!
e 1X 1X
= zi x0i zi ui .
T T
i=1 i=1

e is consistent by Theorem 2.4, since


n
1X p
zi ui ! 0
T
i=1

as zi and ui are iid with nite rst moments. Similarly


n
1X p
zi x0i ! zx 6= 0.
T
i=1
p
Then, Theorem 2.3 implies that e ! 0. Now
n
! 1 n
!
1X 1 X
T 1=2 e = zi x0i zi ui .
T
i=1
T 1=2 i=1

The second factor on the right of the last displayed equality is


n
1 X d 2
zi ui ! N 0; zz ,
T 1=2 i=1

by Theorem 2.8 and then by Theorem 2.13,


d
T 1=2 e ! N 0; 2 1
zx zz
1
xz .

Thus, to complete the proof of the theorem, it remains to prove that


1 1 1 1
zx zz xz xx or 0 xx xz zz zx .

To that end, consider the following matrix


xx xz
A= :
zx zz

A is p.s.d. matrix because it is the covariance matrix of the vector (x0i ; zi0 )0 .
Choosing the vector
I
a= 1 ,
zz zx

so that 0 a0 Aa
= xx xz
1
zx which completes the proof.
zz
Restricted Least Squares.
Suppose that there are m k linear independent linear constraints on
the parameters in (2). That is for Rm k and rm 1 known matrices
R =r rank (R) = m.
How can we estimate s.t. R = r? To that end, we use the Lagrange-
multiplier principle. Let
g ( ; ) = (Y X )0 (Y X ) 0
(R r) .
21

Thus, to obtain the e which minimizes g ( ; ), the FOC are


@
g( ; ) = 0
@
(17) = 2X 0 Y + 2 X 0 X e R0 e

@
g( ; ) = 0
@
(18) = Re r.
1
Multiplying the right side of (17) by R (X 0 X) , we obtain
1 1
0 = 2R X 0 X X 0 Y + 2R e R X 0X R0 e .

Now using (18), we have that


1 1
e = 2 R X 0X R0 r Rb ,

because b = (X 0 X) 1
X 0 Y . Replacing the value of e into (17), we have
1 1
0= 2X 0 Y + 2 X 0 X e 2R0 R X 0 X R0 r Rb

or equivalently we obtain that e given by


1 1 1
(19) e = b + X 0X R0 R X 0 X R0 r Rb .

The estimator of given in (19) is called Restricted Least Squares.

Remark 3.7. If the LSE satises the constraints then b = e .

3.2.3. SOME PRELIMINARY RESULTS WITH REGARD TO


HYPOTHESIS TESTING.
So far, we have studied how to estimate in (2) and what their statistical
properties are. However, we might be interested to decide if satises some
specic constraint. Consider the regression model given in (2) e.g.

yi = 1 xi1 + 2 xi2 + ::: + k xik + ui .

Suppose that we are interested on the hypothesis testing

(20) H0 : 1 = 1; vs. H1 : 1 6= 1.

We have shown that the LSE is distributed as N ; 2 (X 0 X) 1 or

b 2 1
AN 0; X 0X .

The latter implies that the nite-sample distribution will be approximated


by the asymptotic one. So, the arguments that follows needs somehow to
distinguish between these two frameworks.
22

Case 1.
By Theorem 3.6, we know that
b 2 1
N 0; X 0X ,

and denoting Ai;j the (i; j)-th element of A 1,

b 1;1
1 1 = (1; 0; :::; 0) b N 0; 2
X 0X .

So, to test for H0 in (20), we use the statistic


b
1 1
N= 1=2
' N (0; 1)
2 (X 0 X)1;1

under H0 , we obtain a test at a signicance level ( = :05), by rejecting


H0 if jN j > N:025 , where N:025 is such that Pr fN > N:025 g = :025.
Because 2 is unknown we replace it by b2 . Then, N becomes
b
1 1
t= 1=2
.
b2 (X 0 X)1;1

Because b2 is a random variable, it is intuitive to think that t is no longer


a normal r.v. Theorem 3.6 showed that under A1-A5, b and b2 = 2 are
mutually independent and they are distributed as N (0; 1) and 2T k r.v.
respectively. So we obtain that
1=2 b
t = b2 11
xx =
2
1 1 = 2
tT k.

Hence, a test at the 5% signicance level will be to reject if jtj > tT k;:025 ,
where tT k;:025 is such that Pr ft > tT k;:025 g = :025.
Case 2.
When the nite sample distribution is not available, we have to rely on
asymptotic results. We know by Theorem 3.10
d
T 1=2 b ! N 0; 2 1
xx .

Therefore, for T su ciently large,

T 1=2 b AN 0; 2 1
xx

which suggests to test (20) by


1=2
t= 2 11
xx T 1=2 b 1 1 .

As we mentioned above neither xx nor 2 are known in empirical examples,


and thus to be able to compute the test, we replace them by consistent
estimators, say
T
! 1
1 X
b 11
xx = (1; :::; 0) xi x0i (1; :::; 0)0 and b2 respectively.
T
i=1
23

Then, the test becomes


T 1=2 b 1 1
t= 1=2
.
b2 b 11
xx

d
Proposition 3.12. Assuming A1 and A3, t ! N (0; 1).
Proof. The proof is a simple application of Theorems 3.10, 2.3 and 2.13.
Remark 3.8. More general hypothesis testing can be done proceeding simi-
larly, however we will defer them until Section 8.
3.3. VIOLATIONS OF ASSUMPTIONS.
When establishing the statistical properties of the LSE, we have made
use of Assumptions A1-A6. Inspecting these assumptions, we can infer that
some of them seem to be more crucial or important than others. For in-
stance, we have always assumed so far that
(a) Eut = 0.
(b) E (xt ut ) = 0 or independent.
(c) ut iid.
Then, the question is what it would happen with the properties of the
LSE if one or more of the previous assumptions are violated. We will answer
this question one by one, although (c) will be examined in Section 5.
3.3.1. (a) Eut 6= 0.
Assume that Eut = 6= 0. Then, we have the following proposition.
Proposition 3.13. Assume A1, (A1)-A4 except that Eut = 0, then the
0
LSE b is biased (inconsistent) unless X 0 = 0 (plim XT = 0).
h i
Proof. If E b exists, e.g. X is xed or E kX 0 Xk 1 jkX 0 U kj < 1, then
by the law of iterated expectations we have that
h i
1
E b = E X 0X X 0 E (U j X)
h i
1
= E X 0X X0 .
1
But rank (X 0 X) = k so (X 0 X) exists. Then,
E b = 0 if f X 0 = 0.
Now if xt are stochastic then,
" #
1
X 0X X 0U
(21) plim b = plim .
T T
By Theorem 3.10, we have that
1
X 0X 1
plim = xx 0,
T
whereas
T T T
1X 1X 1X
plim xi ui = plim xi (ui )+ plim xi .
T T T
i=1 i=1 i=1
24

P
So Theorem 2.3, we conclude that b is consistent i plimT 1 Tt=1 xt ut = 0
because xx1 0. So, b is an inconsistent because Ext 6= 0.
A further result without proof is the following,
Theorem 3.14. If Eut 6= 0, then either b or b2 (or both) must be biased.
In addition, as long as lim ( 0 ) =T 6= 0, then either b or b2 (or both) must
T !1
be inconsistent.
There is a special case where the regression contains a constant, e.g.
0
yt = + xt + ut .
Theorem 3.15. Consider the latter regression model. Assume A1 (A1)-A4
except that Eut = 0, then the LSE of is unbiased (consistent).
Proof. Left as an exercise.
Thus, a nonzero mean of the error term has not serious consequences as
the slope parameters can be always consistently estimated.
3.3.2. (b) xi and ui are not uncorrelated.
We shall assume that xi stochastic. Examples are quite common, being,
perhaps, the leading one the simultaneous equation model, see Section 7.
Another example is the transformed Box-Cox model
yt 1 0
= xt + ut ,
0
where the parameters are ; 0 . The most immediate consequence is that
the LSE of is no longer consistent, as the following proposition shows.
Proposition 3.16. Assume A1 and A3, except that xt and ut are uncor-
related. Then, the LSE of is inconsistent.
Proof. By denition,
1
b X 0X X 0U
(22) = .
T T
As we have already seen, the rst factor on the right of (22) converges in
probability to xx1 . Next, by Theorem 2.4, the second factor on the right of
(22) is
T T
X 0U 1X 1X p
= xi ui = wi !
T T T
i=1 i=1
since xi and ui are iid, which implies that wi is and E (xi ui ) = 6= 0. Thus,
by Theorem 2.3, we obtain that
plim b = 1
xx 6= 0.
So the LSE is inconsistent.
Remark 3.9. The estimator of 2 in (8) is inconsistent as
T
1X 2 UMU p 2 0 1 2
bt =
u ! xx 6= .
T T
t=1
25

The question is what can we do? Suppose that we have a set of k variables
such that (a) uncorrelated with ut and (b) correlated with xt , that is
E (zt x0t ) = zx 6= 0. Consider the estimator of
e = Z 0X 1
(23) Z 0Y
(compared with the estimator given in Theorem 3.11). This estimator is
called INSTRUMENTAL VARIABLE ESTIMATOR (IV E). Its statistical
properties are given in the next theorem.
Theorem 3.17. Under the conditions of Proposition 3.16 and zt a (k 1)
vector of iid r.v. such that E (zt x0t ) = zx =
6 0 and E (zt ut ) = 0. Then,
P
(a) e !
d
(b) T 1=2 e ! N 0; 2 1
zx zz
1
zx ; zz = E zt zt0 .
Proof. We begin with (a). By denition, (23) is
1
Z 0X Z 0U
(24) + .
T T
Now, by Theorem 2.4,
T
Z 0X 1X p
= zt x0t ! zx
T T
t=1
because wt = zt x0t is an iid sequence of r.v. with nite rst moment zx ,
whereas Theorem 2.3 implies, because zx 6= 0, that
1
Z 0X p 1
! zx .
T
On the other hand, the second factor of expression (24) is
T
1X p
zt ut ! 0
T
t=1
by Theorem 2.4, because Ewt = 0. That concludes the proof of part (a).
Next we show (b). By denition,
T
Z 0X 1
1 X
(25) T 1=2 e = zt ut .
T T 1=2 t=1
By Theorem 2.13, it su ces to show that the rst factor on the right of (25)
converges in probability to a well dened limit, whereas the second factor
on the right of (25) satises the CLT . In view of part (a) and Theorem
2.12, it su ces to show the convergence of
T
1 X 0
zt ut 6= 0
T 1=2 t=1
But, wt = 0 zt ut is iid with zero mean and E wt2 = E 0 zt ut ut zt0 =
0
E (zt zt0 ) E u2t = 2 0 zz . So the second factor on the right of (25)
1
Z0X
converges in distribution to N 0; 2 zz and because T !P 1
xz ,
by Theorem 2.13 we conclude the proof of part (b).
26

Remark 3.10. (i) From the result of the above theorem, we notice that
the asymptotic variance of the IV E depends very much on the correlation
between the regressors and instruments, the higher this correlation is the
smaller the variance.
(ii) If the LSE were consistent and asymptotically normal then 2 (X 0 X) 1
is smaller that 2 zx1 zz xz1 , as the correlation between the regressors and
themselves is maximal, e.g. 1. See also the proof of Theorem 3.11.
The intuition behind LSE is that we try to nd the value in S (X) closest
to Y . What is that point? It is just the projection of Y onto S (X). Recall
that LSE tries to minimize (Y X )0 (Y X ). Thus, one can view this
as splitting the space into S (X) and S? (X), on which U b lies. Because
0
yt = xt + ut ,
we can regard yt as the sum (vector sense) of 0 xt and ut . To say that ut
and xt are uncorrelated is that ut 2 S? (X). Now, if ut is not uncorrelated
with xt it means in geometrical terms that ut 2= S? (X).
Thus, because xt 6? ut , the projection of yt onto S (X) will be dierent
than 0 xt , e.g. its value depends on ut . The reason is because what LSE
does is to nd the closest point on S (X) that explains yt , and if ut
carries information on xt , then we end up estimating
0
xt + E [ut jxt ] = ( + )0 xt .
So, what does IV E do? IV E does basically the following. Obtain the
space span by Z, say S (Z), and then minimize the distance from Y to S (X)
that lies in S (Z). In mathematical terms,
1st ) Regress X on Z, e.g. Z (Z 0 Z) 1 Z 0 X = PZ X, where PZ is the matrix
that projects X onto S (Z).
2nd ) Then nd e that minimizes the distance between Y and PZ X,
e = 1
X 0 PZ X X 0 PZ Y
1 1 1
= X 0Z Z 0Z Z 0X X 0Z Z 0Z Z 0Y ,

i.e. we minimize (Y PZ X )0 (Y PZ X ), and thus it can be observed


that what we minimize is
1
(Y X )0 Z Z 0 Z Z 0 (Y X ) (Y X )0 PZ (Y X ).
Observe that because dim (Z) = dim (X), e = (Z 0 X) 1 Z 0 Y .
The next issue is the following. We have assumed that the number of
instruments equals the number of regressors. But, what would happen if
the number of instruments is bigger than the number of regressors? It is
obvious that we can choose two dierent sets of instruments, each of them
providing a consistent estimator. But, then how to decide among them?
An intuitive answer is the following. Because we want to obtain a set
of instruments as correlated as possible with xt , (recall that the higher the
correlation is the smaller the variance of the IV E is), then we rst regress
of X on Z, and then use the tted values as instruments, why? Recall what
LSE does. LSE tries to nd the value on the space S (X) closest to Y . Thus,
intuitively, the closest we get the bigger the correlation between the y 0 s
27

0
and the tted values b xt would be. Thus, in our particular framework, the
optimal instrument will be
1 0
Ze = Z Z 0 Z ZX
and so the IV E becomes
1
e = Ze0 X e0 Y
Z
1 1 1
= X 0Z Z 0Z Z 0X X 0Z Z 0Z Z 0Y
1
(26) = X 0 PZ X X 0 PZ Y .
This estimator is called GIVE (generalized instrumental variable estimator).
But is it best? The next lemma will give us the answer.
Lemma 3.18. Consider the linear regression model Y = X + U . Assume
A1 to A4 except that E (xt ut ) = 0. Consider two sets of instruments Z1
and Z2 such that S (Z1 ) S (Z2 ). Then,
e = X 0P 2 X 1
2 Z X 0 PZ2 Y
has an asymptotic variance covariance matrix not bigger than that of
e = X 0P 1 X 1
1 Z X 0 PZ1 Y ,
1
where PZi = Zi (Zi0 Zi ) Zi for i = 1; 2.
Proof. By Theorem 3.17, we know that
d 1
T 1=2 e i ! N 0; X 0 PZi X i = 1; 2.
1 1
Thus, it su ces to show that X 0 PZ2 X X 0 PZ1 X or that
0 X 0 PZ2 PZ1 X.
Let us examine the matrix PZ2 PZ1 rst. We already know that if a
matrix is idempotent then is p.s.d., which is PZ2 PZ1 . Why? Because PZi is
a projection matrix, and then idempotent, we have that
PZ2 PZ1 PZ2 PZ1 = PZ2 + PZ1 PZ2 PZ1 PZ1 PZ2 .
But S (Z1 ) S (Z2 ) which implies that PZ1 PZ2 = PZ2 PZ1 = PZ1 . Thus,
PZ2 PZ1 PZ2 PZ1 = PZ2 PZ1 ,
which concludes the proof.
28

4. MAXIMUM LIKELIHOOD ESTIMATION AND


NUMERICAL OPTIMIZATION
The methods of estimation via LSE or IV (later we will see that the
Generalized least squares, GLS) are based on the rst two moments of the
data, i.e. the expectation (or conditional expectation) and the variance.
The question is what about if we know more than the rst two moments.
Can we do better than the LSE? The answer is yes, and the method is
called Maximum Likelihood Estimator (M LE).
Suppose that we have observations y = fyi gTi=1 with probability density
function p ( ), depending on a set of parameters . So, the joint pdf becomes2
Q
T
p (y; ) = p (yi ; ) .
i=1

Denition 4.1. p (y; ) as a function of is known as the Likelihood Func-


tion.
p (y; ) as a function of y integrates to 1 8 2 , that is,
Z
p (y; ) dy = 1,
Y
where denotes the Parameter Space.
Denition 4.2. 0 denotes the true value of the (k 1) vector of parame-
ters.
The likelihood function will be denoted as
L L ( ; y) = p (y; ) .
For simplicity, we examine not the Likelihood function L but
(27) ` ( ; y) = log L ( ; y) .
Remark 4.1. We should mention that L as a function of depends on y,
so that for any , it is a random variable with expectation
Z
L ( ; y) p (y; 0 ) dy.
Y

4.1. Denitions and Properties.


Denition 4.3. The score vector q ( ) is the (k 1) vector of rst deriva-
tives of ` ( ; y) with respect to . That is,
q ( ) = ((@=@ ) L ( ; y)) =L ( ; y) .
Denition 4.4. The Hessian matrix H ( ) is the (k k) matrix of second
derivatives of ` ( ; y) with respect to . That is,
H( ) = @ 2 =@ @ 0
` ( ; y)
@2 @
@ @
L ( ; y)
0 @ L ( ; y) @@ 0 L ( ; y)
= .
L ( ; y) L2 ( ; y)
2 To know the probability distribution function is equivalent to know all its moments,
the converse is true except in very pathological circumstances.
29

Denition 4.5. The Information Matrix, I ( ), is minus the expectation of


the Hessian matrix H ( ). That is,
(28) I( )= E (H ( )) .
A very important property of the score function q ( ) is that
E (q ( 0 )) = 0.
The last displayed relation plays a key role in econometrics. Also, we have
that
" 2 #
@
0 L ( 0 ; y)
(29) E @ @ = 0.
L ( 0 ; y)

Denition 4.6. An unbiased estimator b of 0 is called e cient (in the


Cramr-Rao sense) if its covariance matrix is I 1 ( 0 ).
In the next theorem, known as Cramr-Rao Theorem, we will prove the
important result that the covariance matrix of any unbiased estimator ex-
ceeds I 1 ( 0 ) by a positive semi-denite matrix.
Theorem 4.1. Let e be an unbiased estimator of 0 . Then, its covariance
matrix must exceed I 1 ( 0 ) given in (28) by a positive semi-denite matrix.
R
Proof. Since for all 2 , Y L ( ; y) dy = 1, dierentiating both sides of
the last equation with respect to we obtain3
Z
@
0 = L ( ; y) dy
@ Y
Z
@
= L ( ; y) dy
Y @
Z
@
= ` ( ; y) L ( ; y) dy.
Y @
Dierentiating once again with respect to , we obtain
Z Z
@2 @ @
0= 0 ` ( ; y) L ( ; y) dy ` ( ; y) 0 ` ( ; y) L ( ; y) dy.
Y @ @ Y @ @
On the other hand, e being unbiased estimator of means that
Z
E e = eL ( ; y) dy =
Y
and dierentiating both sides we obtain
Z
Ik k = e @ L ( ; y) dy
0
Y @
Z
= e @ ` ( ; y) L ( ; y) dy.
Y @ 0
However, we know that the E (q ( 0 )) = 0, so,
Z
e @
Ik k = ` ( ; y) L ( ; y) dy.
Y @ 0
3 To be able to do the step in the second equality, we need that Y does not depend on
. For instance, if y were distributed as U (0; ), then this step would not be true.
30

Now Cauchy-Schwarz inequality implies that


Z
0
Ik k e e L ( ; y) dy
Y
Z
@ @
` ( ; y) 0 ` ( ; y) L ( ; y) dy
Y @ @
= Cov e I ( ) .

For the last step we have used (29). So

Cov e I 1
( ),

which completes the proof of the theorem.


Denition 4.7. The maximum likelihood estimator (M LE) is dened as
the value , denoted b, which maximizes ` ( ; y). That is,
(30) b = arg max ` ( ; y) .
2

Remark 4.2. Notice that the value b which maximizes ` ( ; y) is the same
as that which maximizes L ( ; y).
Since the observations are iid, the objective function for the M LE is
T
X
Q ( ) = ` ( ; y) = ` ( ; yt ) ,
t=1

and so, b is the value of 2 which satises


X T
@
Q b = q b; yt = 0.
@
t=1

Now, it is obvious that b is a maximum because


@2 2
b = @ q b =H b
0Q 0.
@ @ @ 0
Example 4.1. Consider the linear regression model
0
yt = xt + ut , t = 1; :::; T ,
where ut N ID 0; 2 . Then yt conditional to xt follows a N 0 xt ; 2 ,
and since ut are independent yt j xt is also independent. So, the likelihood
function is
T
Y T
Y 1 1 0 2
(31) p (yt ; xt ; ) = p exp 2
yt xt
2 2
t=1 t=1

where = 0
; 2 0. Now, taking log 0 s in (31), we have that
T
1 2 1X 0 2
` ( ; yt ; xt ) = C log yt xt ,
2 2
t=1
31

where C is a constant independent of . From the last displayed equation,


we obtain the M LE of as
T
! 1 T ! T
X X 1X 0 2
b= xt x0t x t yt and b2 = yt b xt .
T
t=1 t=1 t=1
We observe that the M LE of , when the errors are normally distributed,
coincides with the LSE. So in this case, the LSE will also be e cient in a
Cramr-Rao sense. On the other hand, the M LE of 2 is biased, although
the biased converges to zero as T ! 1. More specically,
T k 2 k 2
E b2 = ) BIAS b2 = ! 0.
T T T !1
In the previous example, the M LE has an explicit expression. However, in
general this is not the case, since it requires to solve the system of equations
T
X
q b; yt = 0
t=1
which can be highly nonlinear. Moreover, to obtain the asymptotic proper-
ties of b is not that easy and requires dierent techniques, which will not be
examined here. However, we can mention its properties:
It is consistent, that is, plim b = 0 .
Asymptotically normal. That is,
d
T 1=2 b ! N 0; I 1
( ) .
E cient in the Cramr-Rao sense.
Invariant. If b is the M LE of , the M LE of = g ( ) is b = g b .
Once we have outlined the properties of the M LE, in view of our previous
comment the rst issue to discuss is how can we obtain b? This will be
examined in the next section.
4.2. NON-LINEAR OPTIMIZATION.
We have seen that the LSE, IV E and M LE are based on the minimiza-
tion (maximization) of an objective function Q ( ). Because to obtain the
value which maximizes a function g ( ) is the same as getting the minimum
of g ( ), on what follows we shall talk about how to obtain the minimum.
Example 4.2. Suppose the following (nonlinear) regression model
yt = f (xt ; ) + ut , t = 1; :::; T ,
where f is a continuous and di erentiable function with respect to for all
xt . Clearly an objective function to obtain b can be based on the Residuals
Sum of Squares, that is
T
X
Q( ) = (yt f (xt ; ))2 .
t=1
By denition, we have that
T
X
b = arg min Q ( ) = arg min (yt f (xt ; ))2 ,
2 2
t=1
32

which will be given by the solution to the set of nonlinear equations


T
X @
q b = f xt ; b yt f xt ; b
@
t=1
(32) = 0,
that is, b satises the F OC. It is clear that the equation system in (32)
is nonlinear in . The nonlinearity comes from f and its derivative with
respect to , and hence no explicit solution is expected to exist.
Denition 4.8. The solution b to (32) is called the Nonlinear Least Squares
(N LLS) Estimator.

So, in general we have a set of observations (data) Z = (z1 ; :::; zT )0 where


zt = (yt ; x0t )0 and also let QT (Z; ) a function of the data and the unknown
parameters which we are interested on. For instance, in the linear regres-
sion model
QT (Z; ) = (Y X )0 (Y X )
where Y = (y1 ; :::; yT )0 and X = (x1 ; :::; xT )0 . For the Maximum Likelihood,
the objective function was given by
T
X
QT (Z; ) = ` (zt ; )
t=1
or for the Nonlinear Least Squares,
T
X
QT (Z; ) = (yt f (xt ; ))2 .
t=1

4.3. METHODS OF NUMERICAL OPTIMIZATION.


There many procedures implemented or used in practice, however we will
only examine those procedures which will be employed in later sections.
More specically, we shall consider:
(1) GRID SEARCH
(2) STEP-WISE MINIMIZATION
(3) NEWTON-RAPHSON
(4) GAUSS-NEWTON

4.3.1. GRID SEARCH.


This method is perhaps the most natural and intuitive of all. How does
it work? Suppose that is scalar, and for simplicity 2 [ 1; 1]. Con-
sider now a set of possible values of , say i = 1:0; 0:9; ::::; 0:9; 1:0,
and evaluate QT ( ) at each of those points. Let 1 = 0:5 the value
for which QT ( ) attains a minimum. Then, in the next step consider a
ner mesh of points but now in the subinterval [ 0:6; 0:4]. That is
2
i = 0:6; 0:59; ::::; 0:41; 0:40, and suppose that at = 0:33, QT ( )
is minimized. In the next step, we make an even ner partition but now
on the subinterval ( 0:34; 0:32), and so on up to the prescribed level of
accuracy has been achieved.
33

So, the method is very simple and quite appealing. However, the method
becomes very intractable when the dimension of the vector becomes greater
than two, as one can imagine. In addition, the method does not work very
well if the objective function has a deep trough near the minimum.

4.3.2. STEP-WISE MINIMIZATION.


This method becomes extremely useful when conditional on a subset of
the parameter , the model becomes linear, so that it is possible to obtain
an explicit solution for the remaining parameters in . We shall explain the
method via an example. Consider the nonlinear regression model
yt = f (xt ; ) + ut ,
where
(a) f (xt ; ) = 1 xt1 + 2 xt2 + 1 2 xt3
1
(b) f (xt ; ) = 1 xt1 + 2 (xt2 3) .
When studying the Generalized Least Squares in Section 5, we will see ex-
amples where (a) happens to be the case.
Let us consider model (a) rst. We observe that if we x 1 , the nonlinear
regression model becomes linear. Why? Write the regression function f as

1 xt1 + 2 (xt2 + 1 xt3 )

then for a specic value of 1 , say e1 , e1 xt1 and xt2 + e1 xt3 are known (or
can be taken as observed variables), so that 2 enters now in a linear form.
Thus, we can employ LS techniques to obtain the value 2 which minimizes
T
X 2
(33) yt e1 xt1 2 xt2 + e1 xt3 .
t=1

Denote this value as 2 e1 , to indicate that it depends on our preliminary


guess e1 . We now can reverse the roles that 1 and 2 have played in the
above argument, so that after we obtain a preliminary guess for 2 , say e2 ,
1 can be obtained by Least Squares in
T
X 2
yt e2 xt2 1 xt1 + e2 xt3 ,
t=1

obtaining 1
e2 . So, we can iterate the procedure back and forth until the
prescribed degree of accuracy is reached. That is, we x e2 and obtain an
1
estimate of 1 , say b . With this guess of 1 , we obtain a new estimate of
1
1
b
2 , 2 , via the minimization of (33), and so on.
For model (b) we have similar issues. That is, for xed 3, say e3 , 1 and
2 can be obtained by LS techniques in
T
X 2
1
yt 1 xt1 2 xt2 e3
t=1
34

1
where our regressors are xt1 and xt2 e3 . Then, given values of 1 and
e e
2 , say 1 and 2 respectively, the estimate for 3 can be obtained by Grid
Search approach in
T
X 2
eT ( 3 ) =
Q yt e1 xt1 e2 (xt2 3)
1
.
t=1

Then, as in model (a) the procedure is iterated until the desired level of
accuracy is reached.

4.3.3. NEWTON-RAPHSON.
The Newton-Raphson approach is based on quadratic approximations of
the objective function QT ( ). Suppose that QT ( ) has two continuous
derivatives. Applying a Taylor expansion up to its second term around a
point , say, we obtain that
@ 1 @
QT ( ) = QT ( )+ QT ( ) ( )+ ( )0 QT ( )
@ 0 2 @ @ 0
where = + (1 ) and 2 [0; 1]. Now, taking derivatives with
respect to in both sides of the last displayed equation and evaluate it at b
(the value which minimizes QT ( )), we obtain
@
0 = q b = q( )+ QT b ,
@ @ 0
which implies that
1
b= @
(34) QT q( ).
@ @ 0

Remark 4.3. Notice that = b + (1 ) is an intermediate point


between b and the initial guess .
The problem to implement (34) is that depends on b which is not known.
In fact, this is the value that we wish to obtain. So, as it stands (34) cannot
be used or computed. So, what shall we do? One solution comes using
the Newton-Raphson algorithm, which is described as follows. In (34), lets
evaluate the second derivatives at , denoting the obtained value by 1 .
That is, the right side of (34) becomes
1
1 @
= QT ( ) q( ).
@ @ 0
1
With this new guess , repeat the above device to obtain
1
2 1 @ 1 1
= QT q
@ @ 0
and the i-th step
1
i i 1 @ i 1 i 1
= QT q .
@ @ 0
35

We stop when a level of accuracy is reached. We have that as i ! 1,


i
= i 1 , which implies that
i
q =0
and hence b = limi!1 i .
Weakness of Newton-Raphson.
This procedure of optimization has two major weaknesses.
(1) There is no guarantee that @=@ @ 0 QT b is a positive denite
matrix as we require for a minimum.
(2) i+1 i
may be too large or too small. If it is too large, it
overshoots the target value b, that is it may happen that QT i >
QT i 1 , which it is not desirable. On the other hand, if it is too
small then it can take a long time (many steps) up to convergence,
that is until b is reached.
To solve the rst weaknesses, let be a positive constant such that
@
QT i + Ik k 0.
@ @ 0
Then, the Newton-Raphson algorithm becomes
1
i+1 i @ i
= Q T + Ik k q i .
@ @ 0
This modication of the algorithm is known as QUADRATIC-HILL-CLIMBING.
To solve the second weakness, we make use of the Grid Search algorithm
in a very specic way. We choose by Grid Search in
1
i+1 i @ i i
= QT q .
@ @ 0
i
How? We want to minimize QT ( ). So given , we update it by
1
i @ i i
QT q .
@ @ 0
So, if we consider QT ( ) as a function of not of , i.e.
!
1
i @ i i
QT ( ) = QT QT q ,
@ @ 0
we search the value which minimizes QT ( ). Since is scalar we can
easily implement the Grid Search algorithm.
To nish this subsection, we should mention that in the case where the
objective function is minus the log-likelihood function, we have that by the
denition of the Hessian, then the iterative algorithm is equivalent to
1
i+1 i @ i @ i @ i
= ` ` `
@ @ 0 @
i i 1
= q q0 i
q i
,
which is known as the Method of Scoring. The modication for the previous
approach comes from the equality (29). One advantage of this method is
that its implementation does not need to compute the second derivatives.
36

4.3.4. GAUSS-NEWTON.
This method was designed to obtain the Nonlinear Least Squares estima-
tor, that is to minimize
T
X
(35) QT ( ) = u2t ( ) ; ut ( ) = yt f (xt ; ) .
t=1

As in the Newton-Raphson, we can take Taylors expansion of QT ( ) up


to its second term. For the latter objective function, the rst two derivatives
are given respectively by

X T
@ @
QT ( ) = 2 ut ( ) ut ( )
@ @
t=1

(36)
XT
@ @ @ @
Q T ( ) = 2 ut ( ) ut ( ) + ut ( ) ut ( ) .
@ @ 0 t=1
@ @ 0 @ @ 0

Now, recalling that E (ut ( 0 )) = 0, then we can expect that


T
X @
ut ( ) ut ( ) '0
t=1
@ @ 0

or at least negligible, for T large enough, compared to the rst term on the
right of (36). Thus, we can based our iteration procedure on
(37)
T
! 1 T
X @ @ X @
i+1 i i i
= ut 0 ut ut i ut i .
@ @ @
t=1 t=1

The algorithm given in (37), known as Gauss-Newton, has two natural


motivations. The rst one was given implicitly in (37). Dening wt =
@ i
@ ut , we can write (37) as

T
! 1 T
X X
i+1 i
= + wt wt0 ut i
wt .
t=1 t=1

That is, at each step i is updated by the LSE of ut i on wt .


The second motivation is as follows. By Mean Value Theorem,
@
ut ( ) ' ut ( ) + ( )0 ut ( )
@
@ @
[ut ( ) + ut ( ) ' ut ( ) + ut ( ) ],
@ 0 @ 0
@
i.e. regression model with xt = u (
@ 0 t
) and dependent variable

@
yt = ut ( ) + ut ( ) .
@ 0
37

i+1 i @ i i @ i
So is the LSE of ut + u
@ 0 t
on u
@ 0 t
, i.e.

T
! 1
X @ @
i+1 i i
= ut ut
t=1
@ @ 0
XT
@ i @ i i i
ut ut ut
t=1
@ @ 0
T
! 1 T
X @ @ X @
i i i i i
= ut ut ut ut .
t=1
@ @ 0 t=1
@

The weakness of this algorithm is the same as the one of the Newton-Raphson
and the procedure to correct for this is exactly the same.
The dierence between Newton-Raphson and Gauss-Newton is that the
former requires second derivatives, whereas the latter only the rst ones.

Example 4.3. Consider the linear regression model

0
yt = xt + ut t = 1; :::; T

where ut N 0; 2 iid. Then, the likelihood function becomes


( T
)
1 1 X 0 2
L ( ; y; x) = exp 2
yt xt ,
(2 2 )T =2 2
t=1

where = 0
; 2 0 and the log-likelihood

T
T 2 1 X 0 2
` ( ; y; x) = C log 2
yt xt .
2 2
t=1

The FOC are thus given by the set of equations

T
@ 1 X b 0 xt xt = 0
(38) ` ( ; y; x) = yt
@ b2 t=1

T
@ T 1 X b 0 xt
2
(39) ` ( ; y; x) = yt = 0.
@ 2 2b2 2b4 t=1

PT 1P
From (38) we obtain that the M LE of is b = 0
t=1 xt xt
T
t=1 xt yt ,
and plugging back b into (39) we obtain that the M LE of 2 2
is b =
P 0 2
T 1 Tt=1 yt b xt .
38

With regard to I ( ), we need to obtain the second derivatives,


T
@2 1 X
` ( ; y; x) = xt x0t
@ @ 0 b2 t=1
T
@2 1 X b 0 xt xt
` ( ; y; x) = yt
@ @ 2
b4 t=1
T
@2 T 1 X b 0 xt
2
` ( ; y; x) = yt .
@ 2@ 2
2b4 b6 t=1
So,
T
!
1 X T
I( )= E (H ( )) = diag 2
xt x0t ; .
2 4
t=1
and the Variance-Covariance matrix of the M LE is
0 ! 1 1
XT 4
2
I 1 ( 0 ) = diag @ 2 xt x0t ; A
T
t=1

which coincides with that given in Theorem 3.6. So, not only the LSE is
BLU E but it is also Cramr-Rao e cient when the errors are Gaussian.
Observe that since ut is normally distributed, E u4t = 3 4 and b and b2
are independent.
If xt were stochastic, we would have the same conclusions and
b d 0 2 1
T 1=2 2 2 !N ; xx
4 .
b 0 0 2
In the present context we obtain I ( ) by taking the probability limit of
T 1 times the Hessian matrix, that is the Cramr-Rao bound becomes
1
1 @2 2 1 4
p lim ` ( ; y; x) = diag xx ; 2 .
T @ @

Remark 4.4. In general b and b2 are not independent. For instance if the
distribution of ut is not symmetric.

5. ASYMPTOTIC PROPERTIES OF NONLINEAR


ESTIMATORS
As we have seen, frequently we cannot expect to have an explicit solution
for the estimator b of . Two examples that we have discussed were the
nonlinear regression model
yt = f (xt ; ) + ut
and the Maximum Likelihood Estimator. Recall that to obtain b, it was
required to employ numerical algorithms.
In general, given a set of observations, say zt = (yt ; x0t )0 , we want to obtain
knowledge of an unknown set of parameters. This is done by getting the
39

minimum (maximum) of an objective function QT (zt ; ). Recall that for the


N LLS estimator the objective function was
T
X
QT (zt ; ) = (yt f (xt ; ))2
t=1
whereas for the M LE it was ` (zt ; ) as given in (30).
5.1. CONSISTENCY OF THE NONLINEAR LEAST SQUARES.

To simplify notation and arguments, we are going to focus in a simpler


case, namely that xt and us are mutually independent zero mean iid random
variables with variance 2x and 2 respectively. The arguments that we are
about to describe and examine follows for the M LE in a similar fashion,
but it is better not to abstract ourselves too much. So, the model we are
going to consider is the nonlinear regression
yt = f (xt ; ) + ut , t = 1; :::; T .
Let us introduce the following conditions:
A: The parameter space is a compact subset of Rk .
B: QT (y; x; ) is a continuous function in 2 for all x and y.
C:
1 p
QT (y; x; ) ! Q ( )
T
uniformly in 2 where Q ( ) is a nonstochastic function which
attains a unique minimum at = 0 .
Remark 5.1. Note that Condition B is satised if f is continuous in for
all x.
Remark 5.2. The last part of Condition C is an (asymptotic) identication
condition of the parameters .
p
Theorem 5.1. Assuming Conditions A, B and C, we have that b ! 0.

Proof. Consider the objective function


T
X
QT ( ) = (yt f (xt ; ))2
t=1

and dene the estimator of , b, as


b = arg min QT ( ) .
2
Let N be an open neighbourhood around 0 , and consider N c \ . Since
N is an open set it means that N c is compact and Condition A implies that
N c \ is also compact. Let
(40) " = min
c
Q( ) Q ( 0) .
N \
Next, denote by AT the event T 1 QT ( ) Q ( ) < "=2 for all 2 .
(Going ahead a little bit, observe that Condition C implies that Pr fAT g !
1.) Then, if we are in the set AT we have that
(41) Q b <T 1
QT b + "=2
40

and
1
(42) TQT ( 0 ) < Q ( 0 ) + "=2.
On the other hand, by denition of b, we know that
(43) QT b < QT ( 0 ) .
So, combining (43) and (41) yield
Q b <T 1
QT ( 0 ) + "=2
and adding the last displayed inequality into (42), we obtain
Q b +T 1
QT ( 0 ) < Q ( 0 ) + T 1
QT ( 0 ) + ",
which implies that
Q b < Q ( 0 ) + ".

So, we have shown that AT implies that Q b Q ( 0 ) < " and hence
together with (40), it will imply that b 2 N . Hence, we have shown that
n o
Pr fAT g Pr b 2 N .
But moreover, Condition C implies that Pr fAT g ! 1 and thus
n o
Pr b 2 N ! 1.
That is, n o
Pr b 0 !0
for any arbitrary > 0. Observe that N is a neighbourhood of 0 , that is
N =f = j 0 j < g for arbitrary > 0.
What we now should do is try to give more primitive and easier to check
conditions on yt , xt and the function QT ( ) under which Conditions A, B
and C hold true. So far Conditions A and B, they are pretty easy to check,
so we do not need to give any further conditions. The only condition which
is a bit more problematic is Condition C. Therefore, we are going to give a
set of Conditions which will guarantee or imply Condition C. This will be
done in the form of a Lemma.
Lemma 5.2. Let g (z; ) a function on z and where 2 . Assuming
that
(i) is a compact set in Rk .
(ii) g (z; ) is continuous in for all z.
(iii) E (g (z; )) = 0.
(iv) z1 ; :::; zT are iid such that

E sup jg (zt ; )j < 1,

that is, sup 2 jg (z; )j L (z) whose expectation is nite.


Then, uniformly in we have that
T
1X p
g (zt ; ) ! 0.
T
t=1
41

Proof. We are going to use very much assumptions (i) and (ii), and the fact
that those two assumptions imply that g (z; ) is uniformly continuous in .
That is a compact set implies that 9 a partition of , say 1 ; ::::; n such
that
= [ni=1 i and i \ j = ; for all i 6= j.
1 2 i,
Also we can choose the partition such that 8 and 2 i = 1; :::; n,
1 2
(44) <
for any arbitrary small > 0.
Let 1 ; :::; n be a sequence of 0 s such that i 2 i . We need to show
that for any arbitrary " > 0,
( T
)
1X
lim Pr sup g (zt ; ) > " = 0.
T !1 2 T
t=1

But the left side of the last displayed expression is bounded by


(45)
( ( )) ( )
X T X n T
X
1 1
Pr [ni=1 sup g (zt ; ) > " Pr sup g (zt ; ) > "
2 i T t=1 2 i T i=1 t=1

since Pr fA [ Bg Pr fAg + Pr fBg. Now, after adding and subtracting


i
g zt ; , we have that the triangle inequality implies that
T T T
1X 1X i 1X i
g (zt ; ) g zt ; + g (zt ; ) g zt ; .
T T T
t=1 t=1 t=1

So, using the last displayed inequality, we have that the right side of (45)
is bounded by
(46) ( ) n ( )
X n T T
1X i
X 1X i
Pr g zt ; > "=2 + Pr sup g (zt ; ) g zt ; > "=2 .
T 2 i T
i=1 t=1 i=1 t=1

From here, to complete the proof it su ces to show that both terms of (46)
converge to zero.
We begin showing that the second term on the right of (46) converges to
zero. By Markovs inequality and that zt are iid, that term is bounded by
n
2X i
E sup g (zt ; ) g zt ; .
" 2 i
i=1
i
But for all z, and continuity of g (z; ) and (44) imply that sup 2 i g (z; ) g z; !
0, whereas by condition (iv), we obtain that
i
sup g (zt ; ) g zt ; 2 sup jg (zt ; )j 2L (z)
2 i 2

whose expectation is nite. So, by dominated convergence theorem, we


conclude that
n
2X
E sup g (zt ; ) g zt ; i <
" 2 i
i=1
42

which completes the proof that the second term on the right of (46) converges
to zero.
To nish the proof it su ces to show that the rst term of (46) also
converges to zero. But this is the case because zt are iid, which implies
that g zt ; i is also an iid sequence of random variables with nite rst
moments. So, Khintchines (or Kolmogorov) theorem, and because since
Eg zt ; i = 0, implies that the last expression converges to zero.
The next lemma specializes the above result for the nonlinear least squares.
Consider
yt = f (xt ; ) + ut , t = 1; :::; T .
Lemma 5.3. Assume that yt is scalar, 0 is a (k 1) vector of unknown
parameters, ut is a sequence of iid 0; 2 random variables and xt is deter-
ministic. In addition, assume that
(A) (@=@ ) f (x; ) is a continuous function in 2 for all x.
(B) f (x; ) is continuous in for all x (uniformly). That is,
1 2
f x; f x; <"
1 2 1 2
whenever < for all ; 2 and for all x.
(C) Uniformly in 1 ; 2 2 ,
T
1X 1 2 1 2
f x; f x; ! x; ; .
T
t=1
(D) If 6= 0,
T
1X
jf (x; 0) f (x; )j2 ! (x; 0; ) > 0.
T
t=1

Then, the solution b to the set of equations


T
@ X
(yt f (xt ; ))2 = 0
@
t=1
is consistent.
Proof. By denition,
T T T
1X 2 1X 2 1X 2
(yt f (xt ; )) = ut ( 0) + (f (xt ; ) f (xt ; 0 ))
T T T
t=1 t=1 t=1
XT
2
(47) + (f (xt ; 0) f (xt ; )) ut ( 0) .
T
t=1

The rst term on the right of (47) converges to 2 by Khintchines Theorem.


Recall that u2t ( 0 ) = u2t which is a sequence of iid random variables with
nite rst moments. If there were not identically distributed, will it be the
result still true? Under what conditions?
Next, by (C) and (D), the second term on the right of (47) converges to
a function uniformly in with a minimum equal to zero at 0 . Thus, to
complete the proof it su ces to show that the third term on the right of (47)
converges to zero uniformly in in probability. But this is the case as we
43

now show. The arguments to be used are similar to those of the proceeding
lemma. First, it is obvious that
T
1X p
(48) f (xt ; 0 ) ut ( 0 ) !0
T
t=1

since ut ( 0 ) = ut are iid 0; 2 random variables and f (xt ; 0 ) are con-


stants such that f 2 (xt ; 0 ) K. So, by Chevyshevs Theorem, why cannot
we use Khintchines Theorem? we have that (48) holds true, since the second
moment of the term on the left of the last displayed expression is
2 T
X 2
K
f 2 (xt ; 0) ! 0.
T2 T
t=1

So, the proof is completed if we show that


T
1 X p
(49) sup f (xt ; ) ut ( 0) ! 0.
2 T t=1
1 2 1 2
We have assumed in (B) that f xt ; f xt ; < " if < ,
which implies that
1 2 "
f xt ; f xt ; < 1=2
2 2 +1
0

for 1 and 2 2 Ni , for i = 1; :::; n, which forms a partition of the parameter


space . So by the triangle inequality and the Cauchy-Schwarz inequality
we have that
T T
1 X 1 X i
(50)sup f (xt ; ) ut ( 0) f xt ; ut ( 0)
2Ni T T
t=1 t=1
T
!1=2
1X 2 "
+ ut ( 0) 1=2
,
T 2 2 +1
t=1 0
i
where is a xed arbitrary point in Ni . Therefore
( T
) n
( T
)
1 X X 1 X
Pr sup f (xt ; ) ut ( 0 ) > " Pr sup f (xt ; ) ut ( 0 ) > "
2 T t=1 i=1 2Ni T t=1
n
( T
)
X 1 X
Pr f xt ; i ut ( 0 ) > "
T
i=1 t=1
( T
)
1X 2
+n Pr ut ( 0 ) > 20 + 1 .
T
t=1

In the rst inequality


Pn of the last displayed expression we have used that
n
Pr f[i=1 Ai g i=1 Pr fAi g, whereas in the second inequality we have
made use of (50). But both terms on the right of the last displayed inequality
converge to zero which proves (49) and the lemma.
44

5.2. ASYMPTOTIC NORMALITY.


When we studied numerical optimization, we saw that the estimator b
could be written, using for example the Newton-Raphson algorithm, as
T
! 1 T
@ 2 X 2 @ X
b = y t f x t ; e (yt f (xt ; 0 ))2 ,
0
@ @ 0 t=1
@
t=1

where e 0
b
0 . Remember that in general we had
1
b @2 @
0 = QT e QT ( 0 ) .
@ @ 0 @
p p
So far we have given su cient conditions under which b 0 ! 0 or b 0 !
0. So, what about its limiting distribution? We will focus on the nonlinear
least squares, but once again, you can guess that the results and arguments
apply for the M LE as well.
Lemma 5.4. Assuming (A) (D) of Lemma 2, and in addition
i: 0 is an interior point of .
ii: Uniformly for all 2 N ( 0 ), a neighbourhood of 0 ,
T
1X @ @
f (xt ; ) 0 f (xt ; ) ! ( ) > 0.
T @ @
t=1

iii: @ 2 f (xt ; ) = @ @ 0 is continuous in a neighbourhood of 0 uni-


formly in xt and for all 2 N ( 0 ),
T 2
1 X @ 2 f (xt ; )
! 0.
T2
t=1
@ @ 0
iv:
T
1 X @ d 2
f (xt ; 0) ut ! N 0; D .
T 1=2 t=1
@
v: Uniformly for all 2N( 0)
T
1X @ @ 2 f (xt ; )
f (xt ; ) ! e.
T
t=1
@ @ @ 0
Then, as T ! 1
d
T 1=2 b 0 ! N 0; 2 1
( 0) D
1
( 0) .

Proof. Using Newton-Raphson algorithm, we obtain that


(51)
T
! 1 T
1=2 b 1 X @2 e
2 1 X @ 2
T 0 = yt f x t ; (yt f (xt ; 0 ))
T
t=1
@ @ 0 T 1=2 t=1 @
p
because we already know that b 0 ! 0 and 0 is an interior point of the
compact set by i), where e b
0 0 . Why, or in other words
if 0 belongs to the boundary of , couldnt we obtain the above equation?
We examine each factor on the right of (51) separately.
45

We begin with the second factor on the right of (51). By standard dier-
entiation,
T T
1 X @ 2 2 X @
(yt f (xt ; 0 )) = f (xt ; 0) ut
T 1=2t=1
@ T 1=2 t=1
@
d 2
! N 0; 4 D
by iii). Next, we examine the rst factor on the right of (51). First,
T T
1 X @2 2 2X @ @
0 yt f xt ; e = f xt ; e f xt ; e
T
t=1
@ @ T
t=1
@ @ 0
XT
2 @2
(52) f xt ; e ut e .
T
t=1
@ @ 0
p
Because e is an intermediate point between 0 and b , it implies that e ! 0.
Moreover, since the convergence is uniform in by ii), we have that
T
2X @ @ p
f xt ; e f xt ; e !2 ( 0) > 0.
T
t=1
@ @ 0
On the other hand, iii) implies that uniformly in 2N( 0 ),
T 2
1 X @ 2 f (xt ; )
! 0.
T2
t=1
@ @ 0

Then, since ut e ut ( 0) and because ut ( 0) are iid, Chevyshevs The-


orem implies that
T
2 X @2 p
e ut e !
0 f xt ; 0.
T @ @
t=1

Thus, the left side of (52) converges in probability to 2 ( 0 ). Then, apply


Cramrs Theorem to conclude.
Remark 5.3. Typically ( 0 ) = D, as is the case in Lemma 3. Then, in
this situation we obtain that
d
T 1=2 b 0 ! N 0; 2
D 1
N 0; 2 1
( 0) .

5.3. TWO STEP ESTIMATORS.


We have described some iterative/numerical optimization procedures to
obtain the root of
@
QT b = 0,
@
that is, the value b such that minimizes the objective function QT ( ). Also,
we have presented its asymptotic properties such as consistency and asymp-
totic normality.
However sometimes it is possible to obtain a T 1=2 -consistent estimator of
e b
0 , say , which is explicit, although maybe not as e cient as . A classical
example is in a regression model where the errors are not Gaussian. In this
46

case, LSE is T 1=2 -consistent, although less e cient than the M LE, assum-
ing that the probability distribution function of the errors ut is known. The
question is the following. Suppose that we implement an iterative procedure
to obtain the M LE and we start the algorithm using the LSE. Then, what
are the properties of the estimator after one iteration? This is known as
T WO ST EP estimators.
Theorem 5.5. Let fyt gTt=1 be a sequence of iid random variables with prob-
ability density f (y; ). Assume that 1 is a preliminary T 1=2 -consistent es-
timator of . Denote by QT ( ) the likelihood function, then
1
2 1 @2 1 1
(53) = QT q
@ @ 0
is Cramr-Rao e cient.
Proof. Subtracting 0 on both sides of (53), we have that
1
2 1 @2 1 1
0 = 0 QT q
@ @ 0
1
1 @2 1 1
= 0 QT q q ( 0)
@ @ 0
1
@2 1
QT q ( 0) .
@ @ 0
We have already mentioned that
1 1
!
1 @2 1 1 d @2
QT q ( 0) ! N 0; p lim QT ( 0 )
T@ @ 0 T 1=2 @ @ 0
1
N 0; I ( 0) .
So, by Theorem 2.7, it su ces to show that
1
!
1=2 1 @2 1 1 P
T 0 QT q q ( 0) ! 0.
@ @ 0
By the mean value theorem, the left side of the last displayed expression is
1
1=2 1 1 @2 1 1 @2
T 0 QT QT T 1=2 1
0
T@ @ 0 T@ @ 0
" 1
#
1=2 1 1 @2 1 1 @2
= T 0 I QT QT
T@ @ 0 T@ @ 0
where is an intermediate point between 1 and 0 . But, by assumption
T 1=2 1 0 converges in distribution, so to complete the proof, it su ces
to show that
1
1 @2 1 1 @2 P
I QT QT ! 0.
T@ @ 0 T@ @ 0
However, since 1 !P 0 it implies that is also consistent. So using
2 2
that for the function @ @@ 0 QT ( ) is continuous, then T1 @ @@ 0 QT 1 and
1 @2
T @ @ 0 QT converge to the same matrix, which is positive denite, and
47

therefore by Theorem 2.3 the left side of the last displayed expression con-
verge to zero in probability, which completes the proof of the theorem.

5.4. THE GENERALIZED METHOD OF MOMENTS.


So far we have examined three main methods to estimate the parameters
of a model. Namely, The LSE, IV E and M LE. In all these three methods
what we have done is to nd the root of the rst order conditions. That is,
(1) For the LSE, it was
T
1X b 0 xt = 0.
x t yt
T
t=1

(2) For the N N LS, it was


T
1X @
f xt ; b yt f xt ; b = 0.
T @
t=1

(3) For the IV E, it was


T
1X b 0 xt = 0,
zt yt
T
t=1

(4) For the M LE, it was


T
1X
qt b = 0.
T
t=1

But why do we equate it to zero? The reason comes from the assumption
that, if the model is correctly specied,
@
E (xt ut ) = 0, E f (xt ; ) ut = 0, E (zt ut ) = 0 or E (qt ( 0 )) = 0.
@

So, what we have done is simply to nd the value b that equals the sample
moment and (theoretical) population moment, without loss of generality
equal to zero. For instance, if you have fyt gTt=1 and you wish to estimate its
mean, one estimator is the sample mean
T
1X
(54) b =
m yt .
T
t=1

Why? The reason is as follows. Let zt = yt m0 , where m0 = E (yt ). Then


E (yt m0 ) = 0 and we replace it by its sample counterpart,
T
1X
(55) (yt m) .
T
t=1

b such that the sample moment given


Then we try to obtain the value m, m,
in (55) equals the population moment, i.e. zero. But this is just (54).
48

Remark 5.4. From the above arguments we can see why the LSE is incon-
sistent if E (xt ut ) = 6= 0. Indeed, from the denition of the LSE,
T
1X b 0 xt
0 = x t yt
T
t=1
XT T
1 1X b .
= xt ut + xt x0t
T T
t=1 t=1
Denoting wt = xt ut , Ewt = , it implies that for the latter expression to
hold true we need b 6= 0. More specically, we actually obtain that
plim b = 1
.
The general idea of the Generalized Method of Moments (GM M ) is as
follows. Suppose we have some data fzt gTt=1 , we wish to estimate 0 which
satises the moment conditions
E (zt ; 0) =0
where is a p 1 column vector and ( ) is a q 1 vector of equations such
that q p. So, it seems natural that our estimator b of 0 would be that
for which the sample moments equal to zero
T
1X
(56) zt ; b = 0.
T
t=1
However, if q > p in general there is no solution for the set of equations given
in (56). In fact, we can nd many solutions b each of them consistent.
Comparing our discussion to that of the IV E, what we would like to do is
to combine all the possible solutions by looking at
T
! T
!
1 X 1 X
b = bGM M = arg min 0
(zt ; ) AT (zt ; )
2 T T
t=1 t=1
where AT is some positive denite matrix such that plimAT = A 0. This
is the estimator explored by Hansens (1982) Econometrica paper.
The asymptotic properties of b are
Consistent
d
T 1=2 b 0 ! N (0; ) where depends on the matrix AT , A.
It can be shown that under some suitable regularity conditions, the best
choice of A is given by
0 1
E (zt ; 0) (zt ; 0)
or
T
1X
(57) AT 1 = zt ; e 0
zt ; e ,
T
t=1

where e is some preliminary estimator of , say that obtained with AT = I,


T
! T
!
1 X 1 X
e = arg min 0
(zt ; ) (zt ; ) .
2 T T
t=1 t=1
49

Then choosing AT given in (57), we have that


d
T 1=2 b 0 ! N (0; 0) ,

where
1 @ 0 0 1 @
0 =E (zt ; 0) (zt ; 0) (zt ; 0) (zt ; 0) .
@ @ 0
Example 5.1. Consider the IV E. The criterion function was
T
!0 T
! 1 T
!
1X 1 X 1 X
yt x0t wt wt wt0 yt x0t wt .
T T T
t=1 t=1 t=1
P 1
So, in this example x0t ) wt and AT = T 1 Tt=1 wt wt0
(zt ; ) = (yt
P 1
which converges in probability to A = plimT 1 Tt=1 wt wt0 = (Ewt wt0 ) 1 .
Obviously instead of AT we could have used a general matrix, say, BT
and estimate by
T
!0 T
!
1 X 1 X
e = arg min yt x0t wt BT yt x0t wt ,
2 T T
t=1 t=1

which is
e = X 0 W BT W 0 X 1
X 0 W BT W 0 Y .
Observe that if the dimension of wt , q, is equal to p the dimension of the
parameter (which is that of xt ) then
e = W 0X 1
W 0Y ,
that is the estimator becomes independent of the choice of the weighting ma-
trix AT (BT ). So, it is only when q > p that the estimator is not independent
of the choice of the matrix AT . That is, the e ciency of the estimator de-
pends on AT .
Following the general result, the question is: what is the best choice of
AT ? In our particular example (zt ; 0 ) = (yt x0t ) wt , so
0
E (zt ; (zt ; 0 ) = E u2t wt wt0 = 2 E wt wt0 ,
0)
P
which is estimated by 2 T 1 Tt=1 wt wt0 . Hence, as we know it would is
T
! 1
1X 0
AT = wt wt .
T
t=1

Remark 5.5. Note that multiplicative constants does not a ect e because
T
!0 T
!
1 X 1 X
e = arg min yt x0t wt BT yt x0t wt
2 T T
t=1 t=1
T
!0 T
!
1X 1 X
= arg min yt x0t wt 2
BT yt x0t wt .
2 T T
t=1 t=1
50

Up to now we have unconditional moments. Often we work with condi-


tional moments. That is, not only E (xt ut ) = 0 but we have the stronger
condition
E (ut j xt ) = 0
holds true. In mathematical terms, in the regression model
0
yt = xt + ut
not only E (xt ut ) = 0 but
(58) E ( (yt ; xt ; ) j xt ) = 0,
0
with (yt ; xt ; ) = yt xt . The implication of (58) is that not only
E (xt (yt ; xt ; )) = 0
but for any function g (xt ),
E (g (xt ) (yt ; xt ; )) = 0.
How can we estimate as the number of moments is huge, one per function
g (xt )? As we did before, we form the quadratic form
(59)
T
! T ! 1 T !
X X X
QT ( ) = g (xt ) (yt ; xt ; ) g (xt ) g 0 (xt ) g (xt ) (yt ; xt ; )
t=1 t=1 t=1

and the GM M estimator b becomes


(60) b = arg min QT ( ) .
2

We can expect that the e ciency of our estimator b will depend very
much on the choice of g ( ) in (59). Under suitable regularity conditions, the
best choice is
@
g (xt ) = E (yt ; xt ; ) j xt
@
since the lower bound for an estimator b given in (60) is
@ @
E (yt ; xt ; ) (yt ; xt ; ) .
@ @ 0
Remark 5.6. The variance-covariance matrix of b is
1
@ 1 @
E g (xt ) (yt ; xt ; ) E g (xt ) g 0 (xt ) E g (xt ) (yt ; xt ; ) .
@ 0 @
The optimal g (xt ) in (60) is not easy and no explicit functional form ex-
ists. Although work has been done in Econometrica by Robinson (1987) and
Newey (1992).
51

6. GENERALIZED LEAST SQUARES ESTIMATORS


We have examined the consequences, and how to solve the problem, of
dropping the condition that the mean of ut is zero and when E (ut xt ) 6= 0.
In this section, we turn our attention to the consequences of dropping the
assumption of constant variance and/or uncorrelation of ut in
(61) Y = X + U.
Hence, in this section, as was done in cases 1) and 2), we will examine
the consequences (and remedies) of having E (U U 0 ) = , for general .
Proposition 6.1. Assuming A1-A3, except that E (U U 0 ) = , the LSE of
is unbiased. Moreover, its variance-covariance matrix is given by
LSE 1 1
V ar b = X 0X X0 X X 0X .
LSE
Proof. b = (X 0 X) 1
X 0 U , so its expectation is 0. Next
LSE
h i
1 1
V ar b = E XX 0 0 0
X UU X X X 0

1 1
= X 0X X 0E U U 0 X X 0X
1 1
= X 0X X0 X X 0X ,
which completes the proof.
Thus we rst observe is the fact that the usual formula 2 (X 0 X) 1 no
longer hold. The latter is only true if = 2 IT . Next, once we have that
the basic properties of the LSE are not aected by allowing EU U 0 6= 2 IT ,
one question is how good the LSE is? We already know, by Theorem 3.11,
LSE
that under some regularity conditions b is BLUE. However, to obtain
that result, it was assumed that = 2 IT , which is not the case now.
Consider the model (61). We know by Theorem 3.11 that the BLU E
of is the LSE when EU U 0 = 2 IT . So one idea to obtain the BLUE is
to transform rst the model in (61) to restore the ideal conditions. To
that end, consider 1=2 . If we premultiply both sides of (61) by 1=2 , we

obtain the model


(62) Y =X +U ,
where
1=2 1=2 1=2
Y = Y, X = X, and U = U.
Note that if U (0; ), then U (0; IT ). So (62) satises all the ideal
conditions and thus the LSE of in (62) is BLU E, that is
e = 1
(63) X 0X X 0Y
1
= X0 1
X X0 1
Y.

Denition 6.1. The estimator e in (63) is called the (U N F EASIBLE)


GENERALIZED LEAST SQUARES (U GLS) ESTIMATOR.

We know that Theorem 3.11, e is BLUE because it is the LSE in the


model (62), which satises all the ideal conditions. So, what we can
52

LSE
say is that V ar b V ar e 0. Indeed, from (62), V ar e =
1
X0 1X , so
LSE 1 1 1
V ar b V ar e = X 0 X X0 X X 0X X0 1
X 0.
Next we examined the asymptotic properties of the LSE.
Proposition 6.2. Assuming A1-A3, except that E (U U 0 ) = and
X0 X
lim < 1,
T !1 T
LSE
then b is consistent.
LSE
Proof. Because b = (X 0 X=T ) 1
(X 0 U ) =T , we have that E [X 0 U ] = 0
and by assumption,
0
X 0U X 0U 1
E = E X 0U U 0X
T T T2
1
= X 0 X ! 0.
T2
LSE 2nd
So, b ! then use Theorem 2.2 to conclude.
The statistical properties of e do not need to be examined since as we
have argued above, e , is simply the LSE on the transformed model (62),
and thus it will be consistent, asymptotically normal and e cient in the
Gauss-Markov.
LSE
So, we can summarize our ndings about b as:
(1) UNBIASED
(2) INEFFICIENT COMPARED TO THE (U)GLS
(3) 2 (X 0 X) 1 is not the true variance-covariance but (X 0 X) 1 (X 0 X) (X 0 X) 1
.
6.0.1. UNKNOWN.
It is quite unrealistic to pretend that we know . So, the question is,
what can we do? Although for the computation of the LSE, we do not
needed to know , if our aim is to perform hypothesis testing, we would
need to estimate as its variance-covariance depends on . As it stands,
the matrix has far too many parameters, i.e. has T (T + 1) =2 distinct
elements. Thus, it seems unlikely that we can estimate the components of
with only T observations, which is much smaller than the # of parameters.
So the standard procedure is to assume that = ( ), where =
0 b
( 1 ; :::; m ) . Let be an estimator of then we compute
b= b .

Denition 6.2. The (feasible) GLS estimator of is given by


1
(64) b = X0 b 1
X X0 b 1
Y.

We need to examine (i) is b consistent? (ii) Is b asymptotically normal?,


(iii) is there any loss of e ciency using b ? We now answer these questions.
53

Proposition 6.3. A su cient condition for the consistency of b is


X 0 b 1X X 0 b 1U
(a) 0 plim 1; (b) plim = 0.
T T
Proof. By denition in (64)
1
b = X0 b 1
X X0 b 1
U.

Now Theorem 2.3 implies that


! 1
X 0 b 1X X 0 b 1U
plim b = plim plim = 0,
T T

because (a) and (b).

Theorem 6.4. A su cient condition for e and b to have the same asymp-
totic distribution is
X0 b 1 1 X
(a) plim = 0
T
X0 b 1 1 U
(b) plim = 0.
T 1=2
Proof. Theorem 2.7 implies that it su ces to show that
P
(65) T 1=2 e T 1=2 b ! 0.

Now the left side of (65) is


1
! 1
X0 1X X 0 1U X 0 b 1X X 0 b 1U
T T 1=2 T T 1=2
1
!
X0 1X X 0 1U X 0 b 1U
=
T T 1=2 T 1=2
8 ! 19 ( )
< X 0 1X 1
X 0 b 1X = X 0 1U X 0 b 1U
+
: T T ; T 1=2 T 1=2
8 ! 19
< X 0 1X 1
X 0 b 1X = X 0 1U
(66) + :
: T T ; T 1=2

The rst term on the right of (66) converges to zero in probability, because
the rst factor converges in probability to somewhere nite and the second
factor to zero in probability by assumption. By assumption both factors of
the second term on the right of (66) converge to zero in probability, and thus
by Theorem 2.3 the product will as well. Finally, the third term on the right
of (66) also converges to zero in probability, because its rst factor converges
to zero in probability, whereas the second one converges in distribution to a
Normal random variable. Thus, we conclude that the third term converges
to zero in probability and again by Theorem 2.3 we conclude (65).
54

Remark 6.1. (1) The assumptions of Theorem 6.4 are satised almost
always, although they need to be checked.
(2) For the properties of b , we must rely on asymptotic approximations.
Also notice that now b is no longer a linear estimator.
(3) Finally, an interpretation of the ( U)GLS of is that e is such that
minimizes the objective function
Q ( ) = (Y X )0 1
(Y X ).
We are now to look at two scenarios for = ( ). The rst one we are
going to assume that is diagonal but with distinct components, whereas
in the second one, the elements o the main diagonal of are assumed to
be dierent than zero but the diagonal elements are the same.
6.1. HETEROSCEDASTICITY.
Heteroscedasticity appears when the variance of ut varies across observa-
tions, or in other words, we have that
Eu2t = 2
t 6= 2
; t = 1; :::; T .
A standard situation is in cross-sectiondata, observations for households
or rms at some particular period. We may think that the errors may
depend on xt . In a consumption-income relationship, one can expect that
the expected consumption in food would be much the same for those with
low income than for those with high income.
Thus, it is reasonable to expect that when xt is large yt E (yt j xt ) will
be larger. This eect might be captured assuming that ut is drawn from a
distribution with a dierent variance. When 2t = 2 8t, we say that the
errors are homoscedastic.
6.1.1. Estimator of under heteroscedasticity.
Our model is given by
0
yt = xt + ut ; t = 1; 2; :::; T
where Eut = 0, Eu2t
= 2
t and Eut us = 0 if t 6= s.
In this case, = diag 2 ; 2 ; :::; 2 . If were known, we could compute
1 2 T
the UGLS
e = (X 0 X ) 1
X 0Y ,
where X = 1=2 X and Y = 1=2 Y . In our case, the t-th row of X and

Y are x0t = t and yt = t , respectively. Thus, e is simply LSE in


yt xt
= 0 + "t ; t = 1; 2; :::; T ,
t t
where E"2t = 1 for all t .
e is often known as weighted least squares (W LS), because what we
are doing is to weight each observation by the inverse of the standard de-
viation of ut . Thus, if an observation has a huge variance then its relative
importance in the estimation procedure will be very small as compared with
an observation with a small relative variance. The estimator becomes
T
! 1 T !
X x t x 0 X xt yt
e= t
2 2
t=1 t t=1 t
55

and the variance-covariance matrix is


T
! 1
X xt x0
V ar e = 2
t
.
t=1 t

6.1.2. The next step is what to do when 2t is unknown.


As we mentioned earlier, the standard procedure is to specify a parametric
function for 2t . Typical functions are
2 0 2 2
t = exp 1 + 2 zt or t = 1 + 2 zt .

Then, how can we obtain the W LS estimator? We implement a step-wise


algorithm. (Note that Q ( ; ) is nonlinear.)
LSE LSE
1st ) We compute b , and then the LSE residuals u bt = yt x0t b ,
t 1.
2nd ) Compute (N )LSE of u b2t on g ( ; zt ) to obtain b .
3rd ) Use the W LS replacing by b to obtain b in (64).
The su cient conditions of Theorem 6.4 hold true, so that
T
! 1 T !
X x t x 0 X xt yt
b= t
2
t=1
b t t=1
b2t

and e have the same asymptotically properties. The only di culty is to


show that b ! , i.e. b2t ! 2t in some appropriate way.
p
Remark 6.2. We are not saying that b ! , because it may not be true.
Indeed,
Example 6.1. Suppose that 2t = exp ( 1 + 02 zt ). Then, to estimate 1
and 2 we can perform LSE of log u2t on 1 + 02 zt . The problem that we
face is that the mean of the error term is di erent than zero. Indeed,
b2t + log
log u 2
t = 1 + 0
2 zt b2t
+ log u
so that
b2t =
log u 1 + 0
2 zt + vt ; b2t =
vt = log u 2
t
p
However, vt !d v, where Ev 6= 0. Then by Theorems 3.14 and 3.15, b 1 !
p
1 + c and b 2 ! 2 . Fortunately, this is not of concerned because
T
! 1 T !
X x t x 0 X xt yt
b= t
2
t=1
b t t=1
b2t
is invariant up to multiplicative constants. The factor exp (b 1 ) cancels in
both numerator and denominator of b , and so it is numerically identical to
T
! 1 T !
X x t x 0 X x t y t
b= t
.
0 0
t=1
exp b 2 z t t=1
exp b 2 z t

Thus, what we really need is a consistent estimator of 2. Finally, it is


worthwhile noticing that we estimate 2 as
1 0
b2 = Y Xb b 1 Y Xb
T
56

where b = diag exp b 02 z1 ; :::; exp b 02 zT . One consequence is that the


1 1
estimator of the covariance matrix of b is not X 0 b 1X but b2 X 0 b 1X .
2 0
A typical situation t is a function of xt , that is
2 0
t =g xt ; .
It is then noteworthy to observe
1st ) The GLS is not e cient as it will not be the M LE even if the errors
were normally distributed.
2nd ) However, if ut is a Gamma distribution, that is f (u) = 1 (r) r ur 1 e u,

the the GLS and M LE coincide.


LSE
We nish this section discussing (i) How we can estimate the var of b
and (ii) how we can test for heteroscedasticity.
With regard to (i). We know that
LSE 1 1
V ar b = X 0X X0 X X 0X .

Hence, if we employ the formula 2 (X 0 X) 1 , the inferences on will be


invalid. So, what we can do even if we have no model for 2t . We can show
that, under some regularity conditions,
1 1
X 0X X 0 DX X 0X ,
LSE
where D = diag u b22 ; :::; u
b21 ; u b2T is a consistent estimator of Var b .
Next we examine (ii). There are several ones but perhaps a particular
procedure is due to White (1980). The test is as follows. If the model were
homoscedastic, we have that X 0 X = 2 (X 0 X), and then
T
1X2 P
(67) b xt x0t ! 2
E xt x0t .
T
t=1
However, we also have from our preceding comments
T
X 0 DX 1X 2
(68) = bt xt x0t
u
T T
t=1
is another consistent estimator. Thus, under homoscedasticity,
(67) (68) !P 0.
On the other hand, if the model were heteroscedastic, then
T T
1X 2 b2 X P
bt xt x0t
u xt x0t !
6 0.
T T
t=1 t=1
Thus, the idea of the test is to compare both estimators given in (67) and
(68), because under the null hypothesis they converge to the same quantity,
whereas under the alternative, the dierence is dierent than zero. Compare
with Hausmans test.
More precisely, let ts = xtk xs` , s = 1; 2; :::; K (K + 1) =2; k ` =
1; :::; K, where t denotes the observation and K is the dimension of the vector
xt . Let t be the (K (K + 1) =2 1) vector with elements ts . That is,
57

tis the vector containing the elements of the lower triangle matrix of xt x0t .
Moreover, consider
T
1X
Ts = E( ts ) s = 1; 2; :::; K (K + 1) =2
T
t=1
XT
1 2 2 0
BT = E u2t t t T t T ,
T
t=1

where T is the (K (K + 1) =2 1) vector whose sth element is T s. Then,


the lower triangle matrix of
T T
1X 2 1X
2
bt xt x0t
u b xt x0t
T T
t=1 t=1

can be written as
T
1X
DT b ; b 2 = t b2t
u b2 .
T
t=1

Similarly,
T
1X 2 2
h ih i0
b
BT = bt
u b2 t
bT t
bT ,
T
t=1
PT
where b T = T 1
t=1 t. Then, we can test for heteroscedasticity using

(69) Wh = T DT b ; b2 B
b 1 DT b ; b 2 '
T
2
K(K+1)=2 ,

under the null hypothesis of homoscedasticity.


A modication of the test given in (69) is based on the joint signicance
of the parameters s , s = 1; 2; :::; K (K + 1) =2, in the regression model
K X
X k
b2t
u = 0 + k(k 1)
+`
xtk xt` + vt .
2
k=1 `=1

The test is based on whether s = 0 for all 8s against s 6= 0 for some s.

Remark 6.3. The test given in (69) is not only a test for heteroscedasticity
as rejection might be due to incorrect specication of the regression model.

6.2. AUTOCORRELATION.
We now examine the question of a linear regression model
0
yt = xt + ut , t = 1; :::; T
E (ut us ) 6= 0 for some t 6= s.

As with heteroscedastic errors, we need to put some constraints on . A


way to reduce the number of parameters is to assume that ut is stationary,
58

e.g. E (ut ) = 0 and Cov (ut ; us ) = jt sj , so that


0 1
1 1 2 ::: T 1
B C
B 1 1 1 ::: T 2 C
B : : ::: : : C
B C
= 2u BB : : : ::: : C.
C
B : : : : : C
B C
@ : : : 1 1
A
: : : 1 1
If u 2 was known, we can nd a matrix P such that P 0 P = 1 2
u and
perform LSE in the transformed model
PY = PX + PU
where P U is such that E (P U U 0 P 0 ) = 2 I. But, as in the case of het-
eroscedasticity, we cannot expect to know and we need to put some addi-
tional constraints on . A typical example is the AR (1) model
(70) ut = ut 1 + "t ,
although more general models can be used as the ARMA(p,q), e.g.
ut = 1 ut 1 + ::: + p ut p + "t + 1 "t 1 + ::: + q "t q .

Under (70), we have that


0 2 T 1
1 0 p 1
1 ::: 1 2 0 0 ::: 0
B 1 ::: T 2 C B 1 0 ::: 0 C
B C B C
B : : ::: : : C B : : ::: : : C
2B
B C B C
= uB : : : ::: : C; P =B : : : ::: : C.
C B C
B : : : : : C B : : : : : C
B C B C
@ : : : 1 A @ : : : 1 0 A
: : : 1 : : : 1
The transformation P to the model gives
p p
y1 = 1 2y 2x
1 x1 = 1 1
yt = yt yt 1 xt = xt xt 1 ; t = 2; :::; T .
Hence once is known, the GLS of will be given by
" T
#
X
2 2
arg min v1 + vt ,
t=2
0 0
where v1 = y1 x1 and vt = yt xt .
When v1 is not included, we have the Cochrane-Orcutt transformation.
6.2.1. What to do when is unknown.
Proceeding as we did with the heteroscedastic case, we shall employ the
preceding, say Cochrane-Orcutt, but with being replaced by an estimator.
Several step-wise estimators have been proposed. Among them are
(1)
PT
ubt u
bt 1
(71) b = Pt=2
T
.
b2t 1
t=2 u
59

(2) Durbin, which is based on the following transformation


yt = 0 xt + ut
=) yt = yt 1 + 0
xt ( )0 xt 1 + "t .
ut = ut 1 + "t
Then, b is the LSE of the parameter associated with yt 1 .
We can show that both estimators of are consistent, and then we com-
pute the estimator of , b = (b). From here we obtain the GLS as
T
!
X 2
(72) b = min vb +
2
(yt byt 1 ) 0
(xt bxt 1 ) ,
1
t=2
2 1=2 1=2
where vb1 = 1 b y1 0
1 b2 x1 .
We shall notice that inclusion of vb1 into (72) is irrelevant for the asymp-
totic properties of b . The reason being that we only discarded one observa-
tion among T and as T becomes large, the contribution of one observation is
negligible. If we discard vb1 , and we use (71) then the estimator is known as
Cochrane-Orcutt and it is an example of a Step-Wise algorithm. Obviously,
we can iterate the procedure up to any level of accuracy.
The asymptotic properties of b are the same of e . In particular,
d X0 1X
T 1=2 b ! N 0; V 1
, V = plim .
T
So we observe that asymptotically the choice of b is irrelevant.
If "t N 0; 2" , we can obtain the M LE, which is the maximum of
f (y1 ; :::; yT ) = f (y1 ) f (y2 j y1 ) :::f (yT j yT 1)

1 2 1=2
1 2 1=2 0 2 1=2
2
= exp 2
y1 1 x1 1
(2 2 )T =2 2 "
"
( T
)
1 X 0 2
exp 2
(yt yt 1) (xt xt 1)
2 " t=2

and thus the log-likelihood, ignoring constants, becomes


T 1 1
ln 2" + ln 1 2
(Y X )0 (Y X ).
2 2 2 2"
Thus, conditional on 2" and , b (M LE) becomes the GLS estimator. The
estimator of , on the other hand, is dierent that the ones proposed earlier
on. If we discard the 2nd term, then the estimator of is just (71). In
fact, when we ignore this second term, the estimator is known as the Condi-
tional Maximum Likelihood Estimator. However, if the unconditional M LE
is performed, b is no longer a linear estimator, and we need non-linear
optimization algorithms.
Another point to mention is that (71) is asymptotically e cient, e.g. it
has the same asymptotic distribution as the M LE. The reason why the step-
wise algorithm is an e cient method is because the asymptotic distribution
of b and b are independent. That is,
b V 1 0
T avar = 2 .
b 1
60

To nish, we should mention that a consistent estimator of 2 is given by


"
0
Y Xb Pb0 Pb Y Xb
b2"=
T
0 0 11 0 1
b V 1 0 0
T @avar @ b AA = @ 1 2 0 A.
2
b" 2 4"
For a general AR (p) model, the only dierence is that depends on p
parameters, but otherwise the arguments are exactly the same.
6.2.2. ESTIMATION WITH MA(1) ERRORS.
The estimation with more general processes, like M A (q) or ARM A (p; q),
is the same although notationally more complicated. Thus we will focus on
yt = 0 xt + ut
, t = 1; 2; :::; T .
ut = "t + "t 1
The basic assumption on ut is invertibility, j j < 1. Here
0 1
1+ 2 0 ::: 0
B 1+ 2
::: 0 C
B C
B : : ::: : : C
B C
= 2" B
B : : : ::: : C.
C
B : : : : : C
B C
@ : : : 1+ 2 A
2
: : : 1+
The problem is how to estimate . Assume that "t is normally distributed.
Then, we can compute the M LE as
T
X
arg min "2t ( ; ) ,
;
t=1
where
"t ( ; ) = ut ( ) "t 1 ( ; )
0
ut ( ) = yt xt , t = 1; :::; T .
For t = 1, "1 ( ; ) = u1 ( ) "0 ( ; ) which depends on "0 . So what to
do? There are two main procedures:
(1) Treat "0 as a parameter
(2) Replace "0 by its mean, e.g. E"0 = 0.
Conditional on "0 = 0, the M LE and LSE coincide, we just minimize
T
X
0 2
Q( ; ) = yt xt "t 1 .
t=2
However Q ( ; ) is highly nonlinear in so that we need nonlinear opti-
mization algorithms, such as the Gauss-Newton. The question is then, how
can we implement the algorithm? Basically, we need to obtain the rst
derivatives,
@ @
@ "t = x t @ "t 1 t = 1; :::; T .
@ @
@ "t = "t 1 @ "t 1
61

One point to notice is that


T
1X @ @
plim "t "t = 0,
T @ @
t=1

e.g. the information matrix is block diagonal, that is b and b are asymptot-
ically independent, as it was the case with AR errors. The consequence is
that we can implement a step-wise algorithm without losing any e ciency,
i.e. we compute "t on @@ "t and then "t on @@ "t .
Consistent estimators of and can be obtained via the LSE of and
1=2
b= 1 1 4b21
2b1
1
respectively, where because 1 = 1 + 2 , and b1 given in (71). Then,
from these two initial consistent estimators, a two-step estimator will yield
asymptotic e cient estimators.

6.2.3. TESTING FOR FIRST-ORDER AUTOREGRESSIVE DIS-


TURBANCES.
Consider the model
yt = 0 xt + ut
ut = ut 1 + "t ,
and we are interested in the hypothesis H0 = 0 vs. H1 6= 0.
One obtains b as (71) and the test will be
1 PT
1=2 bt u
t=2 u bt 1 d
T 1=2b = T 1 PT ! N (0; 1) .
2
bt 1
T t=2 u

So, the test will be to reject if T 1=2b > 1:96 (5% signicance level).
Although this is a possibility, perhaps the most well known and used test
is due to Durbin and Watson (1950). The D-W test is based on the statistic
PT
(but u bt 1 )2
d = t=2 PT .
b2t 1
t=2 u
The nite sample distribution of d depends on X, although we can use
asymptotic approximations. First notice that d 2 [0; 4]. Also,
d ' 2 (1 b) .
Thus, d AN (2; 4=T ), e.g. T 1=2 (d 2) =2 AN (0; 1).
But, the contribution of Durbin-Watson was the nite sample behaviour
of d. Although, it is not possible (or known) its exact sample behaviour
they provided two bounds du and dL and a one-sided test. Their test works
as follows
8
< reject H0 =0 vs: H1 > 0 if d < du
accept H0 =0 vs: H1 > 0 if d > dL
:
inconclusive H0 =0 vs: H1 > 0 if d 2 (du ; dL ) .
One word of caution is that in order to implement this test we need a constant term
among the regressors X.
62

Portmanteau Test.
Denoting br the r-th sample autocorrelation, Box and Pierce proposed
P
X
Q=T b2r ' 2
P,
r=1
and Box-Pierce-Ljung, with better nite sample behaviour,
P
X 1 2 2
Q = T (T + 2) (T r) br ' P.
r=1
These tests are designed to detect any departure from randomness indicated
by its rst P autocorrelations of the errors.
63

7. DYNAMIC MODELS
The models we have examined are static. That is, given data zt = (yt ; x0t )0
t 2 Z, we have that E [yt j xs ; s = 0; 1; :::] = 0 xt . However with time
series data we may have that not only xt inuence yt but also past values
of xt and/or yt .
So, we may have in mind that
1
X
E [yt j xs ; s = 0; 1; :::] = j xt j
j=0

P
1
or more generally E [yt j xs ; s = 0; 1; :::] = j xt j implying
j= 1

1
X
yt = j xt j + ut .
j= 1

They are known as distributed lag models, where ut ARM A (p; q).

Example 7.1. (Finite distributed lag model) If j =0 8j > K xed K


K
X
yt = j xt j + ut .
j=0

Example 7.2. Suppose that


P
r1
j
1
X jL
j j=0
jL = , with 0 = 1,
Pr2
j=0 j Lj
j=0

P
r2
j
where the roots of jz = 0 are outside unit circle, e.g. jzj > 1. Then if
j=0
(i) ut uncorrelated it is known as Rational Distributed Lag models. (ii) if
ut ARM A (p; q) then they are known as Transfer Models.

In general, we can have


(L) (L)
yt = xt + vt ,
(L) (L)
where vt are iid and w(L) a polynomial in L. If (L) = (L) (L),
vt
(L)yt = (L)xt +
(L)
known as Stochastic Di erence Equation, and ARMAX model if the errors
are M A ( ) instead of AR ( ), that is
(L)yt = (L)xt + (L)vt .
Lets give some denitions.
64

Denition 7.1. (a) TOTAL MULTIPLIER. The e ect in the long run
of a permanent increase of one unit to the level of xt
1
X
= j.
j=0

(b) Impact Multiplier: The e ect of a unit increase in xt on yt .


PJ
(c) Interim multiplier: The change after J periods, i.e. j, J 0.
j=0
1
P
1
(d) The mean lag: dened as j j.
j=0

7.1. FINITE DISTRIBUTED LAG MODELS.


Consider the following model
K
X
(73) yt = 0 + i xt i + ut .
i=1

Its estimation is not dierent from what we have already seen. If ut are
heteroscedastic or autocorrelated, then we can implement a GLS type of
estimator as (73) is just a standard regression model. One possible problem
is that (X 0 X) might be near singular making the computation quite di cult.

7.2. INFINITE DISTRIBUTED LAGS.


We will focus on the rational distributed lag and the stochastic di erence
equations.

7.2.1. Rational distributed lag models.


We start with the simplest Koyck model
0 xt
yt = + ut .
(1 L)
Its motivation comes from the idea that expectations are adaptive, e.g.

yt = xt+1 + ut ) yt = (1(1 )
L) xt + ut
xt+1 = xt + (xt xt ) 0< <1 with = 1 .

Assume that ut N ID 0; 2u . One possible way to estimate 0 and


is to multiply by (1 L) to obtain
yt = yt 1 + 0 xt + (ut ut 1)

and perform the LSE. However the LSE is inconsistent since E (yt 1 vt ) 6=
0, as vt = ut ut 1 . So what to do? As usual we employ IV E.
Thus, we should nd instruments for yt 1 and xt . Because xt is not
correlated with vt , it implies that all we need is an instrument for yt 1 . Now
if xt is related to yt , xt 1 and yt 1 will also. Thus, one possible instrument
for yt 1 is xt 1 .
This method is not e cient though, since not all the available information
(the particular structure) of the model has been used. How can we exploit
65

the sample information to estimate the parameters? As usual, what we can


do is to minimize the RSS
0 12
XT T
X X1
S( ; )= u2t ( ; ) = @yt j
xt j A .
t=1 t=1 j=0

One problem is that we are only able to observe x1 ; :::; xT . Thus, what
are the possible solutions? ( )
P1
jx
tP1
jx t
P
1
jx
P1
jx
As t j = t j + j , denoting j = ,
j=0 j=0 j=0 j=0
0 0 1 12
T
X Xt 1
S( ; ) = @yt @ j
xt jA
t A .
t=1 j=0

Some solutions are


(1) To set equal to zero. The objective function is nonlinear but, once
is xed it becomes a linear one, and thus a step-wise algorithm
can be implemented with a grid-search for .
(2) Treat Ey0 as a parameter.
P
1
jx
tP2
jx t 1
P
1
jx
(3) Write t j = t j + 1 j . Because we have
j=0 j=0 j=0
P
1
jx
that 1 j = Ey1 , then we can replace by y1 .
j=0
(4) To look at
T
X T
X 2
u2t = (yt yt 1 xt + ut 1)
t=2 t=2
and to implement a Gauss-Newton iteration procedure, e.g.
@ @
ut = yt 1 + ut 1 + ut 1
@ @
@ @
ut = xt + ut 1 .
@ @
All of the preceding methods share the same asymptotic properties.
5 A fth procedure is to implementation of two-step algorithm. First
we obtain the IV E, and then one Gaussian-Newton iteration, since
that will give us asymptotic e cient estimators by Theorem 5.5.
This is in fact the usual procedure. When we have more than one
explanatory variable or the rational lag polynomial is of higher order,
the way to estimate the parameters of the model is exactly the same,
e.g. consider
(L)yt = (L)xt + (L)vt
p
X q
X p
X
0
yt = i yt i + j xt j + i vt i ,
i=1 j=0 i=0

with 0 = 1. Then, obtain the IV E of bi , i = 1; :::; p and b j ,


j = 1; :::; q, where the instruments are xt j , j = 0; :::; p + q.
66

Once, the IV E has been obtained, a two-step algorithm can be imple-


mented to obtain
XT
arg min u2t ( ; ).
; t=p+1
If the errors are, say, AR (1), we proceed identically. It is quite important
to emphasize that as in the static case, the parameter estimates of the
coe cients ( ; ) and the parameter of the AR(1) process governing the
error term are asymptotically independent. The latter implies that a step-
wise algorithm, such as the Cochrane-Orcutt, will be e cient.

7.3. STOCHASTIC DIFFERENCE EQUATION.


We should start by looking at the simplest model, e.g.
yt = yt 1 + xt + ut
with j j < 1 for stability conditions. If ut are uncorrelated then
!
2 1
1=2 b d 2 y yx
T b ! N 0; u 2 :
x

Again, as we did with the Koyck model, if the errors were correlated then
E [yt 1 ut ] 6= 0 and the LSE would be inconsistent. In this case, we need to
use IV E, with xt 1 as instrument for yt 1 .
On the other hand, a more e cient estimator could be obtained if one is
willing to use all the available information. Once again, we will focus mainly
with the case where ut AR(1), e.g.
ut = ut 1 + "t .
As we did with the Cochrane-Orcutt (the motivation of the method),
we multiply the model by (1 L), in order to eliminate the correlation
structure of the error term, e.g.
(1 L) yt = (1 L) yt 1 + (1 L) xt + "t
yt yt 1 = yt 1 yt 2 + xt xt 1 + "t .
Thus, an appropriate method will be based on
T
X
b = arg min "2t ( ; ; )
=( ; ; )0
t=3

@
"t = (yt 1 yt 2) = zt1
@
@
"t = (xt xt 1) = zt2
@
@
"t = (yt 1 yt 2 xt 1) = zt3 .
@
So, we can implement a Gauss-Newton iteration algorithm, i.e.
" T # 1 T
i+1 i X X
b =b zt z 0 z t "t , t
t=3 t=3
67

i
where "t and ztj , j = 1; 2; 3, are evaluated at b . We can show that
0 2 1
2q2 2q 1 3 1
y xy (1 )
d B 5 C
T 1=2 b ! @0; 4 2q2
t 0 A
2 1
0 1
where
T T
1X @"t 2
1X @"t 2
qy2 = p lim ; qx2 = p lim
T @ T @
t=3 t=3
XT
1 @"t @"t
qxy = p lim .
T @ @
t=3
A two-step procedure can be implemented, starting with the IV E of and
and b as the LSE of ubt on u
bt 1 , obtaining a fully e cient estimator.
Remark 7.1. In this model a step-wise algorithm will not be e cient to
estimate , although it will converge to the true value. The reason being that
the asymptotic variance-covariance matrix is not block diagonal, e.g. the
0
estimators of ; 0 are not independent to that of .
7.4. HATANAKAs TWO-STEP PROCEDURE.
(Residual adjusted Aitken estimator ).
Hatanakas device is a procedure which is asymptotically as e cient as
the M LE of , and . As a by-product, we can conclude or see why the
Cochrane-Orcutt (step-wise) method is not e cient.
STEP 1 : Regress yt on yt 1 and xt using IV E, with xt 1 as an in-
strument for yt 1 , e.g. obtain b and b . Then perform LSE of u bt on
b
bt 1 to obtain .
u
STEP 2 : Regress yt byt 1 on yt 1 byt 2 , xt bxt 1 and u bt 1 .
The key dierence is the inclusion of u bt 1 . LSE here will be asymptoti-
cally e cient, noticing that the e cient estimator of is b + + , where +
is the coe cient associated with the regressor u bt 1 .
If the errors are M A (1) instead of AR (1), it proceeds similarly, e.g.
yt = yt 1 + xt + ut
ut = "t + "t 1 .
Assuming that y1 xed and "1 = 0, the M LE is equivalent to the mini-
mization of the conditional sum of squares, CSS,
T
X
"2t .
t=2
For a Gauss-Newton iteration algorithm, "t = yt yt 1 xt "t 1 and
@"t @
= yt 1 "t 1
@ @
@"t @
= xt "t 1
@ @
@"t @
= "t 1 "t 1
@ @
68

with initial values given by IV E b , b and b given by


P
b= 1 2
1 bt u
u bt 1
1 4b1 2
=2b1 ; b1 = P 2 .
bt 1
u
Remark 7.2. One of the di erence with the static case (e.g. no lagged
dependent variables) is that overparameterization leads to asymptotic ine -
ciency. Recall that in the static case this was not the case.
The next question is how to test for AR (1) in the presence of yt 1 . Obvi-
ously the Durbin-Watson test is no longer valid, since when yt 1 is present
the estimator of and are not independent. Due to this fact, Durbin
(1970) examined a suitable test known as h-test
1=2 1=2
1 T T
h= 1 d b1 ,
2 1 T avar(b ) 1 T avar(b )
where
2 qx2
avar (b ) =
T qx2 qy2 2
qxy
and under H0 : h N (0; 1).
7.5. THE COMFAC ANALYSIS.
The idea is the following. Suppose that we have the model
(74) yt = yt 1 + 0 xt + 1 xt 1 + ut
or more generally A(L)yt = B(L)xt + ut .
The question to address is if A(L) and B(L) have or dont have roots in
common. That (74) has common roots, it is equivalent to say that
(1 L) yt = 0 (1 L) xt + ut
or that 0 + 1 = 0. Thus, we can be interested to know
H0 0 + 1 = 0 vs. H1 0 + 1 6= 0.
How to perform the test?
First estimate (74), via any of the previous methods we have pointed out,
and then try to see whether b 0 b + b 1 0 or not. Then:
9
@ >
@ ( 1 + 0 ) = 0 =
@
@ 0 ( 1 + 0 ) = =) R = ( 0 ; ; 1) ,
@ >
;
@ ( 1+ 0 )=1
1

thus, the test is based on


X 1 1
d
b b+b b2 R zt zt0 R0 b b+b ! 2
(1).
0 1 0 1

Obviously this analysis can be extended to more general models although it


complicates matters.
Notice that we can write H0 as
1 1
H0 + = 0 or H0 0 + =0
0
and it has been shown, by Monte-Carlo simulations, that the outcome of
the test depends on how we write H0 .
69

7.6. CAUSALITY.
Let y and x be two scalar variables. We say that x causes y if in some
sense x helps to predict y, that is
h i h i
E (yt E[yt j yt 1 ; :::])2 > E (yt E[yt j xt 1 ; :::; yt 1 ; :::])2 .
If, xt were in the Information Set, then we would say that there is Instan-
taneous Causality. Feedback if yt also helps to predict xt .
7.6.1. GRANGERS TEST.
11 (L) 12 (L) xt "t
= ,
21 (L) 22 (L) yt t
where ij (L) are polynomials in L. To test if x causes y is equivalent to
test if 21 (L) = 0 or not. This is sometimes known as a direct test.

7.6.2. SIMS TEST.


Consider
m
X
yt = xt + wt .
= m
The null hypothesis that y does not cause x is equivalent to test that 1 =
::: = m = 0. Here, some care should be taken about wt . We have to be
sure that wt is iid, otherwise the test would not be valid.
Prewhitening
First we prewhitened yt and xt , before testing for causality between y and
x. Denote the prewhitened variables by yt and xt . Then, obtain the cross
expectations of yt xt , = 0; 1; 2; :::.
(a) If 6= 0 for some > 0 ) x ! y.
(b) If 6= 0 for some < 0 ) y ! x.
(c) 6= 0 for = 0 ) instantaneous casuality.
Pm
The test is then based on U = T r2 (x ; y ), which is asymptotically
= n
distributed as a 2n+m+1 under the null hypothesis and where r (x ; y )
denotes the cross-correlations.
70

8. SYSTEMS OF EQUATIONS
So far, we have studied models where only one equation was given, e.g.
multiple regression models. Very frequently, we face models or problems
that involve the specication, estimation and inference of more than one
equation. We are now going to study this issue, e.g. systems of equations.
8.1. MULTIVARIATE REGRESSION.
In some sense multivariate regression models are not much dierent than
the models we have already examined. Its specication is as follows
(75) yt = Bxt + "t , t = 1; :::; T
where yt is (N 1) vector, xt is a (K 1) vector and B (N K) matrix of
coe cients, in which we are interested. So, the ` th row of B stand for
the coe cients of the ` th equation.
The multivariate least squares estimator of B is given by
T
! 1 T
X X
Bb =
0 0
xt xt xt yt0
t=1 t=1
b 0 = (X 0 X)
or B 1
X 0 Y , where Y 0 = (y1 ; :::; yT )N and X 0 = (x1 ; :::; xT )K
T T.

8.1.1. Motivation.
Denote the t-th observation of the rst equation of (75) by
yt1 = b01 xt + "t1 (1st row of B is b01 )
whereas the (T 1) vector of observations of the 1st equation by
y1 = Xb1 + "1 .
Thus we can write (75) as
0 1 0 X 10 1
0 1
O b1
y1 "1
B : C B B CB b2 C
B C X CB C B : C
B : C=B CB : C B C
B C B : CB C+B : C
@ : A B CB : C B C
@ : AB
@ :
C @
A : A
yN O : X "N
bN
or in matrix notation
(76) y = (I X) b + " ,
0 ) , b = (b0 ; :::; b0 )0 and " = ("0 ; :::; "0 )0 . So, there
where y = (y10 ; :::; yN 0
1 N 1 N
is no much dierence with what we have seen already. That is we have the
standard linear regression model with y and b as our dependent variable
and vector of parameters respectively. Thus, the LSE becomes
bb 1
= I X 0X I X0 y
1
= I X 0X X0 y
1
= I X 0X X 0 vec(Y )
1
= vec X 0X X 0 Y IN N
71

0 )0 and vec (ABC) = (C 0


because Y = (y10 ; :::; yN A) vec (B). But

bb = vec bb1 ; bb2 ; :::; bbN = vec B


b0 ,

b 0 = bb1 ; :::; bbN . Thus vec B


where B b0 = vec (X 0 X) 1
X 0Y b0 =
=) B
1
(X 0 X) X 0 Y . Notice that this is identical to the LSE equation by equation.

8.1.2. Properties.
Its properties follow straightforwardly by looking at the properties of the
LSE in a multiple regression model.
(i) It is unbiased (or consistent) and (ii) Its variance (asymptotic variance)
is equal to
!
X 8X 1
1
X 0X p lim ,
T

where = var ("t "0t ), where = vec (B 0 ) and b = vec B


b0 .
A test for the null hypothesis

H0 w0 = r, vs: H1 w0 6= r

is done as usual, i.e.


1 1 d
w0 b r w0 X 0X w w0 b r ! 2
(s) .

The model considered in (76) can be viewed as a particular case of the


more general one SURE (seemingly unrelated regression equations).

8.2. SURE.
Suppose that we have N equations (regression equations)

yi = Xi i + "i ; i = 1; :::; N

E"i = 0; E"i "0i = wii IT


E"it "jt = wij i; j = 1; :::; N
E"jt "js = 0 8t =6 s,

e.g. ("t1 ; "t2 ; :::; "tN )0 is an (N 1) column vector with covariance (wij )i;j=1;:::;N .
Thus, we have
0 1 0 1
0 1 X1 0 1 "1
y1 B C 1 B C
B : C B X2 CB : C B "2 C
B C B : CB C B : C
B : C=B CB : C+B C
B C B : C B C B : C
@ : A B C@ : A B C
@ : A @ : A
yN N
XN "N

Y = X + "; V = E""0 = I.
72

Then,
0 1
0 1 1 P
N
w11 X10 X1 w12 X10 X2 ; :::; w1N X10 XN B j=1 w1j X10 yj C
B : C B C
B C B : C
bb = B : C B C.
B C B : C
@ : A B C
@ : A
wN 1 XN0 X
1 wN 2 XN
0 X ; :::; w N N X 0 X
2 N N PN NjX0 y
j=1 w N j

This estimator is more e cient than the LSE equation by equation. But
there are two cases where LSE is fully e cient. Namely: (i) wij = 0 8i 6=
j and (ii) Xi = Xj 8i 6= j.
The derivation of the SURE can be seen by observing what is the Variance-
Covariance matrix of " = ("01 ; :::; "0N )0
0 1
20 1 3 w11 IT w12 IT : : : w1N IT
"1 B C
6B : C 7 B w22 IT : : : w2N IT C
6B C 0 7 B : C
E6B
6B :
C "1 ; :::; "0N
C
7=B
7 B
C=(
C IT ) .
4@ : A 5 B : C
@ : A
"N
wN N IT

So, to obtain b we will use GLS estimators, since the variance covariance
matrix of " is not a diagonal matrix. That is, we will minimize the objective
function
1
(Y X ) I (Y X )

02 3 1 1
X10 20 1 32 3
B6 7 w11 : : : w1N X1 C
B6 X20 7 6B C 76 7C
B6 7 6B : C 76 : 7C
b = B6 : 7 6B C
B6 7 6B : C I7
76
6 : 7C
7C
B6 : 7 4@ A 54 5C
@4 5 : : A
:
0 wN N XN
XN
2 3 20 1 32 3
X10 w11 : : : w1N y1
6 : 7 6B : C 76 : 7
6 7 6B C 76 7
6 : 7 6B : C I7 6 : 7
6 7 6B C 76 7
4 : 5 4@ : A 54 : 5
0
XN wN N yN

0 1
0 1 1 P
N
w11 X10 X1 w12 X10 X2 ::: w1N X10 XN B w1j X10 yj C
B : C B j=1 C
B C B : C
=B
B : C
C
B
B
C:
C
@ A B : C
: @ A
:
wN 1 XN0 X
1 wN 2 XN
0 X
2 ::: wN N XN
0 X
N PN NjX0 y
j=1 w N j
73

If Xi = Xj ; 8i; j = 1; :::; N , then the above expression will be

b = 1
I X0 1
I (I X) I X0 1
I Y
0 1
(X 0 X) 1 X 0 y1
1
= I X 0X X0 Y =@ A
0 1 0
(X X) X yN

e.g. the LSE to every equation individually.

8.2.1. FEASIBLE SURE.


Since (wij )i;j=1;:::;N are not known, we need to give an estimator of them to
be able to perform the SURE. To do that, rst we obtain the LSE equation
by equation, and thus we obtain b "i i = 1; :::; N , then an estimate of wij will
be given by
T
1X
bij =
w b
"itb
"jt 8 i; j = 1; :::; N;
T
t=1

then,
0 1 0 1
b11 X10 X1 ; w
w b12 X10 X2 ; :::; w
b1N X10 XN
1 P
N
B b1j X10 yj
w C
B : C B j=1 C
B C B C
b=B : C B C.
B C B P C
@ : A @ N A
bN j XN
w 0 y
bN 1 XN
w 1 bN N X10 XN
0 X ; :::; w
j=1
j

8.3. SIMULTANEOUS EQUATION MODELS.


In economics it is very frequent to study the behaviour of more than two
variables, yt1 and yt2 , on other variables, such that yt2 might also depend
on yt1 and vice versa. That is, we have instantaneous model
0
yt1 = 1 yt2+ 1 xt + ut1
0
yt2 = y
2 t1 + 2 xt + ut2 .

The classical example is a demand-supply system, where demand and supply


depend on price, but at the same time price is determined by the quantity
traded.
Since we would expect that yt2 is correlated with ut1 and also yt1 with
ut2 , the consequence is that LSE will yield inconsistent estimators.
Another consequence is that in general the parameters of the model are
not identiable, e.g. 1 , 1 , 2 , 2 , since e.g. if we multiply the 2nd equation
by 21 and the rst by 2 and we add them up, we obtain
0
1 0 2 ut2
y1t = 2 1 + yt2 + 2 1 xt + 2u1t
2 2 2
0
= 0 e xt
1 y1t + 1 +u
~1t ,

which has an identical structure to the original equation.


74

8.3.1. IDENTIFICATION.
A simultaneous equation system can be written in general as:
(77) Ayt + Bxt = ut
Y A + XB = U
(T N )(N N )

where, yt is (N 1) vector, xt a (K 1) vector, AN N and BN K and


ut iid(0; ). We will not impose any constraint on and jAj =6 0.
It is clear that we can write the following
(78) yt = A 1 Bxt + A 1
ut
= xt + vt .
Thus, (78) is a multivariate regression model known as Reduced Form, in
contrast with (77) known as Structural Form. It is obvious that can be
estimated consistently under the usual conditions, e.g. E (vt j xt ) = 0.
That we can estimate is similar to say that you can identify it. The
question of identication becomes, when can we recover the parameters in
(A; B) from ?
Assume that the errors are Gaussian. Then the log-likelihood is
1 1 1 0 1 0 0
Q ( ; ) = N log (2 )+ N log j j+ trace Y X Y X
2 2 2
0
where stands for the parameters in C = (A; B) and = A 1 A 1 .
So, if and can be consistently estimated, one can think on those
quantities as known. But
1
(79) = A B
and so, the issue is when do we have a unique solution for
A + B = 0?
If there were no restrictions on C then there is not a unique solution, since
if C = (A ; B ) is one solution, the same is for P A = A0 and P B = B 0 ,
where P is a nonsingular matrix. Indeed
1 1
= A0 B0 = (P A ) PB
1 1
= A P PB
1
= A B .
Thus, we cannot obtain C uniquely from .
Denition: We say that (77) is identiable if the only matrix P such
that links A and A0 and B and B 0 is the identity matrix.
Another way to see why with no constraints in C is impossible to obtain
C from is the following. Observe that is a (N K) matrix, so it has
KN parameters, whereas in C there are N (N + K) parameters. So we try
to solve a (N K) system of equations with N (N + K) coe cients and we
know from elementary algebra that this is not possible.
Thus we need to have at least the same number of equations than coef-
cients, and hence we need to add to (79) at least an extra N 2 equations
75

involving the parameters , e.g. W = w, so that we have


V 0
= .
W w
The order condition: W need to have at least N 2 rows.
V
The rank condition: rank = N (N + K).
W
Sometimes, we do not need to know if the whole system is identiable,
but just one equation, say the 1st one. Then when can we say that the 1st
equation is identiable? From (79) we have K restrictions. Indeed, denoting
the rst row of C by 01 = (a01 ; b01 ), then we have that
a01 + b01 = 0.
1 NN K 1 K
We have a system of K equations with N + K unknowns. So, we need at
least N more equations. Denote these restrictions by
W1 1 = w1 .
Theorem 8.1. 1 is identied i rank (W1 C 0 ) = N (rank condition).
Remark 8.1. (i) If w1 = 0, we call it homogeneous restrictions.
(ii) If 9 two independent sets of restrictions, W1 and W2 , such that the
above theorem holds for each of them, then the 1st equation (or system) is
overidentiable. (This is important for e ciency.)
(iii) We always impose the normalization constraint that aii = 1, 8i.
Theorem 8.2. Assume that all the constraints except the rst one are ze-
roes. Then 1 is identied i r1 N 1, e.g. the number of constraints are
greater than or equal to the number of equations minus 1 and r (C ) = N 1
where C is the matrix formed from the 2nd N -rows of C corresponding to
zeroes of 1 .
Example 8.1.
8
< yt1 + 12 yt2 + 13 yt3 + 11 xt1 + 12 xt2 + 13 xt3+ ::: + 1K xtK= ut1
21 yt1 + yt2 + 23 yt3 + x
21 t1 + x
22 t2 + x
23 t3 + ::: + x
2K tK = ut2 :
:
31 yt1 + 32 yt2 + yt3 + 31 xt1 + 32 xt2 + x
33 t3 + ::: + x
3K tK = ut3

From the order conditions, we need at least two constraints in the 1st equa-
tion. Assume that 11 and 12 are zero. Then, consider the matrix formed
by the columns such that its rst element is zero. If such a matrix has rank
2 (here N = 3) then the 1st equation is identied. Otherwise not.
8.4. ESTIMATION OF A SIMULTANEOUS EQUATION MODEL.

Consider the following simultaneous equation system


Ayt + Bxt = ut .
From now on, we will assume that aii = 1, i = 1; :::; N , and that the system
(or equation) is identiable. Consider the 1st equation, that is
m
X k
X
(80) yt1 = ai1 yti + bi1 xti + ut1 .
i=2 i=1
76

First thing to notice is that the LSE of ai1 and bi1 is inconsistent. This is
expected from the fact that the RHS contains endogenous variables or yti s,
and one would expect that those variables correlated with the error term
ut1 . In fact the LSE is inconsistent except in some special case, when the
system is recursive, that is A is lower triangular and is diagonal.
If the LSE is inconsistent, how can we estimate the parameters? In the
standard linear regression model the method was IV E.
But what instruments should we use? Recall that the e ciency of the
IV E depends on the correlation between the regressors and instruments.
The higher the correlation, the better the IV E is.
Lets go back to the equation of interest, e.g. (80). In matrix form
y1 = Z1 1 + u1
= Y1 a1 + X1 b1 + u1 ,
where Z1 = (Y1 ; X1 ) and Y1 the matrix corresponding to yti i = 2; :::; m
and X1 for xti . All we need to do is to nd the best instruments for Y1 .
Obviously there is no need for X1 as there are not correlated with u1 . Thus,
all we want is to nd instrument W1 as correlated as possible with Y1 .
To that end, let the reduced form equation (78), that is
0
Y =X +V.
Recall the i th column of Y are the (T 1) observations for yi .
Based on the above equation, if we want to obtain a set of instruments
W1 , because X and V are uncorrelated, one candidate is the best predictor
for Y which is X b 0 , where b 0 is the LSE of 0 . Because we do not want all
the Yb = X b 0 , but only those corresponding to Y1 and
1
Yb = X X 0 X X 0Y
the instrument for Y1 are Yb1 = X (X 8 X) 1 X 8 Y1 . Observe that Yb1 and u1
P
are asymptotically uncorrelated because b ! so Yb1 ! X 01 and X ? u.
So the IV E becomes
1
Yb10 Yb1 y1
b1 = (Y1 ; X1 )
X1 X10 y1
1
Yb10 Y1 Yb10 X1 Yb10 y1
=
X10 Y1 X10 X1 X10 y1
! 1
Yb 0 b Yb10 X1 Yb1 y1
= 1Y 1
X10 Yb1 X10 X1 Yb 0 X1
1
since
1
Yb10 X1 = Y10 X X 0 X X 0 X1
= Y10 X1
and
1 1
Yb10 Yb1 = Y1 X X 0 X X 0X X 0X X 0 Y1
1
= Y10 X X 0 X X 0 Y1
= Y10 Yb1 Yb10 Y1 .
77

This will imply that the LSE of y1 on Yb1 and X1 yields the same estimator.
It can be shown that the asymptotic distribution of this estimator b1 is
d 1
T 1=2 (b 1 1) ! N 0; 2
1 ,
where 2 = Eu2t1 and
1 Yb1 Y1 Y10 X1
1 = p lim
T X10 Y1 X10 X1
0Q 0Q
1 1 1 1
=
Q01 1 Q11

0 1 0 0 1 0
1Q = p lim
Y1 X 1 Q1 = p lim Y X1
T T 1
X 0X X 0 X1 X 0 X1
Q = p lim ; Q1 = p lim ; Q11 = p lim 1 .
T T T
This is only for one equation. If what we want is all the system then
we can do the same equation by equation. We have argued before that to
estimate the parameters in the 1st equation
y1 = Z1 1 + u1
was via the LS estimation in
b1
Y1 = Z 1 + u1 ,
b1 = Yb1 ; X1 . Then, for the whole system
where Z
Y =Z +U
with
0 1
0 1 Z1 O 0 1
y1 B C u1
B C B Z2 C B C
B : C B C B : C
B : C
Y =B
B : C;
C Z =B C; U =B
B : C.
C
@ A B : C @ A
: B C :
@ : A
yN uN
O ZN

So we have now Y on Z b , with Zi replaced by Zbi , i = 1; :::; N . This


estimator is known as the 2SLS (two-stage least squares).
From the above equation, we can easily observe that this is not quite
e cient (generally), and the reason is because is not diagonal (compared
with Theorem 3.11). What we did there was to transform the model in such
a way that the transformed error had a diagonal matrix, 2 I say.
Let E(ui u0i ) = , then E (u u0 ) = ( I).
1=2
I Y = 1=2 b
I Z + 1=2
I U
1
b
=) b = Z 1
I Z^ Zb0 1
I Y ,
which by standard algebra is
(81)
1 1 1
b = Z0 1
X X 0X X0 Z Z0 1
X X 0X X0 Y .
78

Now then, although this estimator of is more e cient (generally), it


depends on 1 . Then, to make this estimator feasible, we should replace
1 by b 1 , a consistent estimator. Consider

T
1X
bij =
w bit u
u bjt ,
T
t=1

bi is obtained as the residuals of the 2SLSE. (81) with


where u 1 replaced

by b 1 is known as the 3SLS (three-stage-least-squares). Its properties are


that
d
T 1=2 (b ) ! N (0; V )
1
where V = Z 0 1 X(X 0 X) 1 X 0 Z .
To nish up this section we will discuss the, perhaps, most obvious way
to estimate i , i = 1; :::; N , but not without problems. We have seen that
there exists a relationship between , the reduced form parameters, and
C, the structural form parameters. Such a relationship is (79). We have
discussed that can always be consistently estimated. Then, this suggests
to estimate the parameters in C by solving the system of equations
b=
B bb .
A
When the only restrictions are exclusion restrictions, for instance, the pa-
b and B.
rameters in the rst equation will be the 1st row of A b Because the
diagonal elements of A are one, then
bb0 ; 0 = a01 ; 0 b
1; b
1
2 b b 12 3
11
bb0 ; 0 = a01 ; 0
1; b 4 b 21 b 22 5
1
1 K 1 N
b 31 b 33
N K

where b 12 , has K2 columns which equals to the number of excluded vari-


ables. Therefore, we have
(
bb0 = b 11 + b
a01 b 21 a01 = b 221 b 12
b
1 =)
0 = b 12 + b
a01 b 22 bb0 = b 11 b 1 b 12 b 21
1 22

and the same will be applied to the other equations. This estimator is
called Indirect Least Squares (ILS) and is consistent by consistency of b
and Theorem 2.3. However when the system is overidentied there will be
more than one solution, with implications for the e ciency.
Example 8.2. Suppose that a1 is a scalar and b 12 and b 22 are (1 K2 ),
e.g. there are K2 exogenous variables excluded in the 1st equation, then,
clearly, we have K2 equations and 1 coe cient =) more than one solution.
We should mention that all the solutions are consistent, e.g. they converge
in probability to a1 . So when the system is overidentiable the ILS is not
e cient, as it appears that some of the information given in the system has
not been taken into account. Compare with the IV E when the number of
instruments exceeds the variables to instrument.
79

There is a special situation where all these three estimators become the
same. If the system is (just) identiable and thus a unique ILS exists, then
ILS = 2SLS = 3SLS.
Also the 2SLS = 3SLS if
2 no cross-equation restrictions and
(i ) = I or
= diag 21 ; :::; 2N .
(ii) Each equation is just identiable.
80

9. HYPOTHESIS TESTING
There are three procedures available; (i) the Wald (W ), (ii) the Lagrange
Multiplier (LM ) and (iii) the Likelihood Ratio (LR) tests.
The purpose is based on a sample fzt gTt=1 , to know if its mean or vari-
ance, or maybe the conditional expectation equals some specic value. For
example, we would like to know if 1 = 0 or R = r in the model
0
yt = xt + ut .
To estimate the parameters we looked at
b = arg minQ ( ) ,
2
which will satisfy the FOC
@
(82) Q b = 0.
@
9.0.1. Wald (W).
Based on b, the idea is to decide if the constraints on hold true for b.
Example 9.1. If H0 1 = 0, then the Wald test tries to decide if b 1 0.
9.0.2. Lagrange multiplier (LM).
Because b satises (82), the idea of this test is to decide if the FOC
evaluated under the null holds true. That is, if e is the estimator using the
restrictions, then we looked at
@
Q e ' 0.
@
0
Example 9.2. Consider = 1 ; 02 . If H0 1 = 0, then the LM tries
to decide if
@
Q 0; e2 ' 0, e = 0; e2 .
@
9.0.3. Likelihood Ratio Test (LR).
This test tries to decide if the ratio between the minimum of Q( ) with
and without the constrains is 1, that is
Q 1 e Q b ' 1.

The standard LR formulation is


LR = 2 log Q e log Q b .
A popular tests is the LM , because of its comparative simplicity. Observe
that the implementation of W , we estimate the model under the alternative,
for the LM we estimate the model under the null, whereas the LR requires
estimation under the alternative and null hypothesis.
Example 9.3. Consider the following nonlinear regression model
1
(83) yt = 1 x1t + 2 (x2t ) + "t
where x1t is real national income, x2t is rate of interest and yt is real money
demand. We wish to test the existence of liquidity trap, e.g. H0 = 0. If
81

we get a test that all it needs is to estimate the model under the null, then
simplies the computations, as under H0 the model is linear
1
yt = 1 x1t + 2 x2t + "t .
9.0.4. The Wald Test.
Let Q ( ) be the objective function. The W test is based on how far b is
from the null. Consider H h ( 0 ) = 0, then we wish to know if h b is
statistically dierent than zero. The form of the W test for this very general
hypothesis is
1
0 @ @
W = T 1=2 h b h b \ b
Asyvar h b T 1=2 h b ,
@ @ 0
which is 2 (r), where r is the dimension of the vector h ( ).

9.0.5. The Lagrange Multiplier.


The idea of the LM test is to decide how far from zero is the rst derivative
of Q( ) when evaluated at e, e.g. the minimum of Q ( ) using the null
h ( ) = 0. The question is how far is @@ Q e from zero, where

e= arg min Q ( ) .
2 ;s:t:h( )=0

This test is sometimes called the score test. What is the form of the test?
Similar to testing for = 0, where what we did was
1
b 0 asyvar
\ b b.

Here we want to know if


@
Q ( 0) = 0
@
@
so we can then regard @ Q ( 0 ) as our above, and the test becomes
0 1
1 @ 1 @ 1 @
LM 1=2
Q e asyvar
\ Q e Q e .
T @ T @ T 1=2 @
The tests are such that in small samples
W LR LM .
This implies that the W test has greater power than the LR and LM , but
also a higher type I error.

9.0.6. Properties.
The asymptotic properties of the W , LR and LM tests are
d 2 d 2 d 2
(a) LR ! (s) , W ! (s) and LM ! (s)
where s # of constraints and (b) They are consistent. That is, if the null
is not true, the tests rejects with probability equal to 1 as T % 1.
Example 9.4. Consider the following linear regression model
0
yt = xt + ut
82

and the objective function (Least Squares)


1
Q( ) = 2
(Y X )0 (Y X ).
2
Our interest is H0 R 0 = r; vs. H1 R 0 6= r.
The LM principle will be as follows

@ X0 Y Xe
Q e = 2
,
@

where e is the restricted least squares estimator. Now


2 2
@ @
E Q( ) = 4
X 0 X = V ar Q( ) .
@ @

e the restricted least squares residuals, the LM is


Denoting U
( e0 e
U X(X 0 X) 1 X 0 U
2 if 2 known
LM Ue 0 X(X 0 X) X 0 U
1 e
e2
if 2 unknown.

The W principle
8 0 1
< (Rb r) [R(X 0 X) 1 R0 ] (Rb r) if 2 known
2
W = 0 1
: (Rb r) [R(X 0 X) 1 R0 ] (Rb r) if 2 unknown,
b2
P P
where e2 = T1 Tt=1 u
e2t ; and b2 = T1 Tt=1 u
b2t .
2
Finally the LR would be LR = log e log b2 .

Example 9.5. The liquidity trap model

Consider the model (83) and we wish to test H0 = 0 vs. H1 6= 0.


T
1 X 1
2
Q( ) = 2
yt 1 x1t 2 (x2t ) .
2
t=1

As mentioned above we shall implement a LM as the model is linear under


H0 . But, how?
T
@ 1 X 1 1
Q( ) = 2
x1t yt 1 x1t 2 (x2t )
@ 1
t=1
T
@ 1 X 1 1 1
Q( ) = 2
(x2t ) yt 1 x1t 2 (x2t )
@ 2
t=1
XT
@ 1 2 1 1
Q( ) = 2 2 (x2t ) yt 1 x1t 2 (x2t ) ,
@
t=1
83

and when evaluated at = 0, it becomes


T
X
x1t yt e x1t e x 1 = 0
1 2 2t
t=1
T
X
x2t1 yt e x1t e x 1 = 0
1 2 2t
t=1
T
1 Xe 2 e x1t e x 1
2 2 x2t yt 1 2 2t = ?
t=1
What will its second moment be?
20 T 1 3
P
6B t=1 x1t "t C ! 7
6B C X 7
6B P T C T T
X T
X 7
6
E 6B B 1
x2t "t C x1t "t ; x2t1 "t ; 2
2 x2t "t
47
= 7
C
6B t=1 C t=1 t=1 t=1 7
4@ P T A 5
2
2 x2t "t
t=1
0 0 1 0 1 1
P
T P
T P
T
B B x21t x1t x2t1 C .. B
2
2 x2t x1t C C
B A=B t=1 t=1 C . B t=1 C C
B @ PT A @ P
T A C
1 BB x2t2 2 x2t
3 C
C = B= 2
= 2 B t=1 t=1 C
B ::: ::: C
B C
@ .. PT A
2 4
. 2 x2t
t=1
@ 2 )).
which is @ @ Q( ) as "t iid(0;
Hence, the LM test is
T
! T
!0
1 X X
0; 0; e 2 2
x2t e"t B 1
0; 0; e 2 2
x2t e"t
e2 t=1 t=1
2 2 33
!2 " # P
T
e 2
T
X 6 T
X T
X T
X 6 x x
1t 2 2t 77
1 e x 2e e2 x
2 2t "t
6
4 2 2t
4
x1t x2t2 e 2 ; e x 3
2 2t A 16 t=1
4 PT
77
55
e e x 3
t=1 t=1 t=1 t=1 2 2t
t=1
which is T R2 , where R2 is the coe cient of multiple correlation in the model
"t on @@ e
e "t , @ @ e
"t ; @e
"t =@ .
1 2
84

10. UNIT ROOTS AND COINTEGRATION


Economic theory generally deals with equilibrium relationships. To put
these theories in context, econometricians use statistical tools. Tradition-
ally econometric models were built on the assumption of stationary data or
stationary around a trend. However, data is rarely stationary, in that we
observe that the level of the series changes.
The assumption of stationarity has been recognized as important when
doing inference and its implication was highlighted with the problem of
Spurious Regressions as introduced by Granger and Newbold.
They generated two independent sets of data but with a strong trending
component, e.g. ARIM A (0; 1; 0). Obviously, as they are independent none
can help to explain the other. However, when they regressed one variable to
the other, they found the t-test was very signicant and the R2 ' 1. So the
conclusion would be that not only the variables are related, but they help
to explain each other extremely well!! Obviously there is something wrong
here. The explanation lies on the type of data generated, i.e. they were
generated with unit roots and, possibly, a drift. These type of variables are
known as stochastic trends.
Denition 10.1. We say that xt , t = 1; :::; T has unit roots if
xt = + xt 1 + vt ; t = 1; :::; T ,
where vt ARM A. is known as drift parameter.
This feature is very often observed with economic data. Some examples
are: ination, stock prices, real interest rates and implications into the per-
manent income hypothesis.
Remark 10.1. Economists have argued very strongly against the practice
of using di erences as in Box-Jenkins. The motivation comes from the ob-
servation that on doing this, it makes impossible to perform any inference
on the long-run, equilibrium relationships, of the economy, which after all is
what economists are aiming to.
These arguments suggest to study the relations among the variables in
levels (no di erenced) and to examine the possibility of unit roots, in a sta-
tistical fashion rather than by visual inspection as was done in Box-Jenkins.
Denition 10.2. (Unit roots): We say that xt has unit roots if we need to
di erence xt to obtain stationarity. Sometimes this is referred to say that
xt is integrated of order 1, I(1).
If one generates data from
T SP : yt = + t + ut ; t = 1; :::; T
DSP : yt = + ut ; t = 1; :::; T ,
where ut iid and we plot yt , we see that they look very alike.
That is, both T SP and DSP models seem to generate similar patterns
for yt . On the other hand, we have discussed that the implication and
consequence for statistical inference can be quite dierent depending on
whether the yt is T SP or DSP . Also yt for both models are very similar
85

in that they look stationary around the mean. The question then is, which
model should I choose or where does the data come from?
To discriminate between T SP or DSP was rst addressed by Nelson &
Plosser (1982). The testing procedure is a non-nested one. Their approach
was to nest the models and then test for unit roots. That is, they introduced
the articial equation
yt = + t + yt 1 + "t ; t = 1; :::; T ,
yt yt 1 = + t+( 1) yt 1 + "t
= 0 + 1t + ( 1) yt 1 + "t .
Remark 10.2. Similarly you might have started using the specication
yt = 0 + 1 t + ut
; t = 1; :::; T ,
ut = ut 1 + "t
and then employ a Cochrane-Orcutts type transformation.
Their hypothesis testing is
H0 : = 1 and 1 = 0, H1 : Negation of null.
If H0 is rejected, it implies that the data belongs to the T SP class, whereas
if H0 is not rejected, then it belongs to the DSP class. So how can we test
for H0 , and more importantly what are their properties?
The relevance of unit roots in economics comes from the observation that
a shock to the economy will have permanent eect, i.e. a change in monetary
policy will have a permanent eect on output. On the contrary, if the data
were T SP , stationary around a time trend, then the eect is only transitory.
We shall begin by considering the AR (1) model
(84) xt = xt 1 + "t ; t = 1; :::; T ,
where "t iid and testing H0 : = 1 against H1 : < 1.
We already know that if j j < 1, then the LSE , that is (71) satises
the CLT
d
(85) T 1=2 (b ) ! N 0; 1 2
.
But if = 1? The rst issue that we observe from (85) is that 1 2 = 0,

so that the asymptotic variance is zero. So it seems that the theory that
works ne for j j < 1, it will not for = 1. This was examined by Dickey &
Fuller (1974) (Fuller, 1976). They showed that, when = 1,
d
T (b ) ! Distribution.
The rst point to mention is that we need to normalize the LSE of by T
instead of T 1=2 , to obtain a proper limit distribution, which is
1 2 R
d 2 B (1) 1 B (r) dB (r)
T (b 1) ! R 1 = R1 ,
B 2 (r) dr B 2 (r) dr
0 0

if "t iid in (84), where B (r) is the standard Brownian Motion, that is for
xed r, B (r) is distributed as N (0; r), and the r.v.s B (r4 ) B (r3 ) and
86

B (r2 ) B (r1 ) are independent for all 0 < r1 < r2 < r3 < r4 < 1. Moreover,
Phillips (1987; 88) showed that
R1 1 2
(b 1) d B (r) dB (r) 2 B (1) 1
tb = ! R0 1=2
= R1 1=2
.
SE(b) 1 2 2
0 B (r) dr 0 B (r) dr

The immediate consequence is that the t distribution is no longer valid.


One question is: what would happen if "t were not iid? Then
2
B 2 (1)
d
"
2
T (b 1) ! R 1 ,
2
2 0 B (r) dr
where
2
2 1 P
T
" =E "2t ; 2
= lim E "t .
T !1 T 1=2 t=1
So what do we notice?
(a) The limit distribution changes.
(b) Even though the error "t is autocorrelated, the LSE is still consis-
tent, contrary to what we have seen so far.
How can we perform the test? We use the t-test and satises
" d B 2 (1) 1 2 2
"
tb ! R1 1=2
+ R1 .
2 2
0 B (r) dr
2 2
0 B (r) dr

10.0.7. Serial Correlation.


When "t are serially correlated, a popular approach is to model the serial
correlation, e.g. we run the regression
P
k
xt = ( 1) xt 1 + j xt j + ut
j=1

for some k. This is known as the Augmented Dickey-Fuller (DF ) test,


d B 2 (1) 1
T (b 1) ! R 1
2 0 B 2 (r) dr
and its corresponding t-statistic
d B 2 (1) 1
tb ! R1 1=2
.
2 0 B 2 (r) dr
So the limit distributions are the same as when "t were uncorrelated.
The problem with this procedure is how to choose k. To solve this
problem Phillips & Perron (1987) suggested to estimate 2 by
1 PT 1 P P
T
b2 = "2t + 2
b wj b
"t b
"t+j
T t=1 T j=1 t=j+1
and modify the t-test as
2 2 B 2 (1) 1
" " d
tb + R1 1=2
! R1 1=2
,
2 2 2 2
0 B (r) dr 2 0 B (r) dr
87

whose sample counterpart is


b" b2" b2
zbT = tb + .
b P
T 1=2
2 1
2b T2
x2t 1
t=1

10.0.8. What Happens if we allow for an Intercept?


Recall that our basic model was
xt = ( 1) xt 1 + "t (xt = xt 1 + "t )
and suppose that we include an intercept . That is, we estimate the model
xt = +( 1) xt 1 + "t ,
so the LSE of = 1 is
P
T
(xt 1 x 1 ) "t
t=1 1 PT
b 1= ; x 1 = xt 1.
P
T T t=2
x2t 1 T x2 1
t=1
In this case, we have that
^2 2 R1
d 2 B 2 (1) "
2 2 0 B (r) dr B (1)
T (b 1) ! R R1 2 .
2 1 B 2 (r) dr B (r) dr
0 0

One consequence is that the corresponding t test satises


2 R1
d B 2 (1) "
2 2B (1) 0 B (r) dr
tb ! 1=2
.
R1 R1 2
2 "
0 B 2 (r) dr 0 B (r) dr

Fortunately, Dickey & Fuller also tabulated this distribution. However, one
key requirement for its validity is that the true value of = 0.
If 6= 0, we then have that
12 2
d
T 3=2 (b 1) ! N 0; 2

and hence tb !d N (0; 1). So, once again we see that, contrary to the
situation where the regressors are stationary, when the data have unit roots,
changing the model implies that the distribution also changes!!
What would happen if in the model I have a time trend? That is
xt = + t + xt 1 + "t .
We like to test = 1, = 0 and 6= 0 in general (in this sense the
augmented model is regarded as a solution of the dependence of the previous
test on whether = 0 or 6= 0),
2 R1 R1 R1
2
d B (1)
"
2 2B (1) 0 B (r) dr 12 0 r 12 B (r) dr 0 r 21 dB(r)
tb ! 1 .
2 "
R1 R1 2 R1 2
2
0 B (r) dr 0 B (r) dr 12 0 r 12 B (r) dr
88

One word of caution or remark. We observe that the distribution depends


on 2 as in the Dickey-Fuller test. As we did there the modications apply.
That is, we can build an ADF or a Phillips-Perron modication to make
the distribution of our statistics not depending on 2 .

10.0.9. Testing the Null of I (0): No Unit Roots.


So far our null hypothesis has been to test for I (1), that is we are testing
if the data follows a unit root model. One modication is to avoid a possible
bias towards nding a unit root, is to take the hypothesis of I (0) as our null.
This is the approach taken by of Kwiatkowski, Phillips, Schmidt and Shin
(1992), KP SS. The corresponding test is
PT !!2
1 P 1 P d R1
T t
s=1 (s s) xs 2
T = (x t x) P T 2
t t ! 0 V2 (r) dr,
T t=1 t1=2 s=1 s=1 (s s)
where
R1 1
V2 (r) = B (r) rB (1) 6 r2 1 0 s dB (s) ,
2
where the distribution has been tabulated already by the authors.

10.1. COINTEGRATION.
Consider xt and yt two scalar I (d) sequences. Then, we say that xt and
yt are cointegrated if there exists a vector such that
1 xt + 2 yt = zt
is an integrated process of order d b, where b > 0 (C(d; b)). Note that if
1 xt + 2 yt = zt is I(d b) also
xt + e yt = zet I(d b).
The vector = ( 1 ; 2 )0 is not identiable, and we assume that 1 = 1 say.
Assume that d = 1. Suppose that we have n variables, say xt , each of
which is I(1), e.g. 8i = 1; :::; n, xti ' V ARM A. Also,
xt = C(L)"t where C(1) < 1.
Moreover assume that it can be written as
A+ (L) xt = D(L)"t
where both A+ (L) and D(L) are nite polynomials and A+ (L) 1 D(L) =
C(L). Then writing A(L) = A+ (L) , we have the V ARM A representation
A(L)xt = D(L)"t
A(L) = A(1) + A (L)
(this is like a Taylor expansion around L = 1). Assume that A(1) has rank
s < n, i.e. A(1) = 0 . Then
(n s)(s n)
0
A(L)xt = xt + A (L)xt (A (L) xt + zt )
= D(L)"t
89

or similarly we write A (L) xt + zt 1 = D(L)"t , because


A (L) xt + zt = D (L) "t
A (L) xt + (zt zt 1 ) + zt 1 = D (L) "t
A (L) xt + 0 (xt xt 1 ) + zt 1 = D (L) "t
(A (L) + 0) xt + zt 1 = D (L) "t
A (L) xt + zt 1 = D (L) "t .
Now if A (L) has not roots at L = 1, because xt and "t are I (0), it implies
that 0 xt = zt I (0). That is, s-linear combinations of the vector xt
leads to I (0) r.v., e.g. the vector xt is cointegrated with cointegration vectors
. The representation, known as ERROR-CORRECTION, given by
A (L) xt + zt 1 = D (L) "t .
Example 10.1. Let us see the following example
yt = xt + zt 1 + "1t
xt = "2t ,
where zt = yt xt . In V AR representation
L + L yt "1t
=
0 xt "2t

A(1) = (1; ).
0
10.1.1. ESTIMATION.
A two-step procedure can be implemented.
STEP 1 : The cointegrated vector is estimated via LSE in the model
yt = xt + zt .
T b converges in distribution to a random variable.
STEP 2 : Compute the LSE residuals, zbt 1 = yt 1 ^ xt 1 , and to
use them as a proxy variable for zt . Then, perform LSE in the
Error-correction equation, e.g.
yt = xt + zbt 1 + vt .
The LSE of and is asymptotically normal. The reason basically
is that all the variables involved in the regression model are I (0).
However it is worth mentioning that b has a lot of small sample bias,
and because this bias feeds into the second step, e.g. into the regression of
yt on xt and z^t 1,

the nite sample performance of the estimator is quite poor. An alternative


way to handle this is via estimating the rst equation, e.g.
yt = xt + yt 1 xt 1 + "t .
The long-run ( ) then turns out to be the LSE of xt 1 divided by the LSE
of yt 1 . This method turns out to have much better small sample properties
than the one described in STEPS 1 and 2.
90

Remark 10.3. To estimate the cointegrated vector, it is the same to regress


yt on xt or xt on yt . In the latter situation, b = db 1 where db is the LSE of
xt on yt . The reason is because R2 ! 1, i.e. is like having 2u = 0.
10.1.2. TESTING FOR COINTEGRATION.
Compute the LSE in
yt = xt + ut
and ubt = yt b xt . We can implement a D-W test. d ! 0 suggests the
presence of nocointegration., as was shown that if yt and xt not cointegrated,
the D-W statistic d ' T 1 , see Phillips (1986) J.o.E.
Another procedure is to perform the augmented Dickey-Fuller approach
P
X
bt = u
u bt 1 + bt
bi u 1 + et
i=1

and test if = 0 vs. < 0, e.g. the t-ratio of b. However the tables
provided by Dickey-Fuller are no longer valid since u bt is not the observed
data. The Tables are in Engle and Yoo (1987) J.o.E.
One of the ideas, from an economic perspective, is that cointegration gives
us the long-run relationship, or equilibrium path, between two (or more)
variables. In particular, if we consider the general distributed lag model
yt = 1 yt 1 + ::: + r yt r + 0 xt + 1 xt 1 + ::: + s xt s + "t ,
we can obtain the short and long run behaviour as follows. Consider
r 1
X s 1
X
+ +
yt = yt 1+ j yt j + xt + j xt j + "t ,
j=1 j=0
where
r
X r
X s
X s
X
+ +
= j; j = ; j = ; = j
j=1 =j+1 =j+1 j=0
or
r 1
X s 1
X
+ +
yt = j yt j + j xt j +( 1) yt 1 + xt + "t
j=1 j=0
r 1
X s 1
X
+ +
= j yt j + j xt j +( 1) yt 1 xt 1 + "t ,
( 1)
j=1 j=0

where zt 1 =: yt 1 1 xt 1 . If = 1, zt 1 is not present, so that the


model is specied in implies that there is no long-run relation, and the
coe cients give us the short-run behaviour. This model is known as the
Error-correction model. The long-run is given by = ( 1).
The idea behind this model (formulation) is the following. Given the
equilibrium relation yt = ( 1) xt , say, if yesterday yt was bigger than the
equilibrium path indicates, yt 1 ( 1) xt > 0 and ( 1) negative imply
that the increase of yt with respect to yt 1 will be smaller, e.g. this term
forces the system to return to equilibrium.

You might also like