You are on page 1of 50

III 1

James B. McDonald
Brigham Young University
9/29/2010

III. Classical Normal Linear Regression Model Extended to the Case of k

Explanatory Variables

A. Basic Concepts

Let y denote an n x l vector of random variables, i.e., y = (y1, y2, . . ., yn)'.

1. The expected value of y is defined by

E( y 1 )
E( y 2 )
E(y) =

E( y n )
2. The variance of the vector y is defined by

Var( y 1 ) Cov( y 1, y 2 ) Cov( y 1, y n )


Cov( y 2 , y 1 ) Var( y 2 ) Cov( y 2 , y n )
Var(y) =

Cov( y n , y 1 ) Cov( y n , y 2 ) Var( y n )


NOTE: Let μ = E(y), then

Var(y) = E[(y - μ)(y - μ)']

y1 - 1

=E (y1 - μ1, . . ., yn - μn)


yn - n
III 2

2
E( y 1 - 1 ) E( y 1 - 1 )( y 2 - 2 ) ... E( y 1 - 1 )( y n - n )
2
E( y 2 - 2 )( y 1 - 1 ) E( y 2 - 2 ) . . . E( y 2 - 2 )( y n - n )
. . .
=
. . .
. . .
2
E( y n - n )( y 1 - 1 ) E( y n - n )( y 2 - 2 ) ... E( y n - n )

Var( y 1 ) Cov( y 1, y 2 ) ... Cov( y 1, y n )


Cov( y 2 , y 1 ) Var( y 2 ) ... Cov( y 2 , y n )
. . .
= .
. . .
. . .
Cov( y n , y 1 ) Cov( y n , y 2 ) ... Var( y n )

3. The n x l vector of random variables, y, is said to be distributed as a multivariate


normal with mean vector μ and variance covariance matrix (denoted y ~
N(μ, )) if the probability density function of y is given by

1 -1
- (y- ) (y- )
e2
f(y; , ) = n 1
.
(2 ) | | 2 2

Special case (n = 1): y = (y1), μ = (μ1), = (ζ2).

1 1
- ( y1- 1) 2 ( y1- 1)
e2
f( y1 ; 1 , )= 1 1
2
(2 ) 2 ( )2
-(y1- 1)2
e 2 2
= .
2
2

4. Some Useful Theorems

a. If y ~ N(μy, y), then z = Ay ~ N(μz = Aμy; z = A yA') where A is a


matrix of constants.
III 3

b. If y ~ N(0,I) and A is a symmetric idempotent matrix, then y'Ay ~ χ2(m)


where m = Rank(A) = trace (A).
c. If y ~ N(0,I) and L is a k x n matrix of rank k, then Ly and y'Ay are
independently distributed if LA = 0.
d. If y ~ N(0,I), then the idempotent quadratic forms y'Ay and y'By are
independently distributed χ2 variables if AB = 0.

NOTE:

(1) Proof of (a)


E(z) = E(Ay) = AE(y) = Aµy
VAR(z) = E[(z - E(z))(z - E(z))']
= E[(Ay - Aµy)(Ay - Aµy)']
= E[A(y - µy)(y - µy)'A']
= AE[(y - µy)(y - µy)']A'
= AΣyA' =Σ z

(2) Example: Let y1, . . ., yn denote a random sample drawn from


N(μ,ζ2).
y1 2
...0
. .
y ~N , .  .
. . 2
0. . .
yn

The "Useful" Theorem 4.a implies that:


1 1 1 1 2
y = y1 + ... + y n = , . . . y ~ N( , /n) .
n n n n

Verify that
III 4

1 1
(a) ,...,  =
n n

1
n
1 1
(b) ,..., 2
I  = 2
/n .
n n
1
n
III 5

B. The Basic Model

Consider the model defined by

(1) yt = β1xtl + β2xt2 + . . . + βkxtk + εt (t = 1, . . ., n).

If we want to include an intercept, define xtl = 1 for all t and we obtain

(2) yt = β1 + β2xt2 + . . . + βkxtk + εt.

Note that βi can be interpreted as the marginal impact of a unit increase in xi on the

expected value of y.

The error terms (εt) in (1) will be assumed to satisfy:

(A.1) εt distributed normally

(A.2) E(εt) = 0 for all t

(A.3) Var(εt) = ζ2 for all t

(A.4) Cov(εtεs) = 0,t s.

Rewriting (1) for each t (t = 1, 2, . . ., n) we obtain

y1 = β1x11 + β2x12 + . . . + βkx1k + ε1

y2 = β1x21 + β2x22 + . . . + βkx2k + ε2

. . . .

. . . .

(3) . . . .

yn = β1xn1 + β2xn2 + . . . + βkxnk + εn.

The system of equations (3) is equivalent to the matrix representation

y = Xβ + ε

where the matrices y, X, β and ε are defined as follows:


III 6

y1 x11 x1k columns: n observations on k


y2 individual variables.
x 21 x 2k
y= X= rows: may represent
observations at a given point
yn x n1 x nk in time.
(nx1) (nxk)

1 1

2 2
= and = .

k n

NOTE: (1) Assumptions (A.1)-(A.4) can be written much more

compactly as

(A.1)’ ε ~ N (0; = ζ2I).

(2) The model to be discussed can then be summarized as

y = Xβ + ε.

(A.1)' ε ~ N(0; = ζ2I)

(A.5)' The xtj's are nonstochastic and


XX
Limit = x is nonsingular.
n n
III 7

C. Estimation

We will derive the least squares, MLE, BLUE and instrumental variables estimators in

this section.

1. Least Squares:

The basic model can be written as

y = Xβ + ε

= Xβˆ + e = Y
ˆ +e

where Ŷ = X βˆ is an nx1 vector of predicted values for the dependent variable and
e denotes a vector of residuals or estimated errors.
The sum of squared errors is defined by

n
ˆ =
SSE(β) 2
et
t=1

e1
e2
= (e1 , e2 ,  , en )

en
=ee
= (y - Xβ)ˆ (y - Xβ)ˆ
= y y - βˆ X y - y Xβˆ + βˆ X Xβˆ
= y y - 2βˆ X y + βˆ X Xβˆ .

ˆ A
The least squares estimator of β is defined as the β̂ which minimizes SSE (β).
necessary condition for SSE(β)ˆ to be a minimum is that
ˆ
dSSE(β)
=0 (see Appendix A for how to differentiate a real
dβˆ
valued function with respect to a vector)
ˆ
dSSE(β)
= -2X y + 2X Xβˆ = 0 or
ˆ

III 8

XX ˆ =Xy Normal Equations

ˆ = (X X )-1 X y is the least squares estimator.

Note that β̂ is a vector of least squares estimators of β1, β2,...,βk.

2. Maximum Likelihood Estimation (MLE)

Likelihood Function: (Recall y ~ N (Xβ; = ζ2I))


- 1 (y-X ) -1
(y-X )
2 e 2
L(y; , = I) = 1
(2 ) n/ 2 | | 2

1
- (y-X ) (y-X )
e2 2
= 1
(2 ) n/ 2 | 2
I| 2

(y-X ) (y-X ) / 2 2
e
= .
n 2 2 n2
(2 ) ( )

The natural log of the likelihood function,

(y- X ) (y- X ) n n
 = ln L = - 2
- ln 2 - ln 2
2 2 2
1 n n
= 2
(y- X ) (y- X ) - ln 2 - ln 2
2 2 2
1 n n 2
= 2
y' y - 2 'X' Y 2 ' X ' X - ln 2 - ln
2 2 2

is known as the log likelihood function.  is a function of β and ζ2.


The MLE. of β and ζ are defined by the two equations (necessary conditions for a
maximum):
1
= (-2 X y + 2(X X) ) = 0
β 2
2

(y - X ) (y - X ) n 1
2
= - 2
=0
2 2
2
2( )
III 9

i.e.,

= (X X ) -1 X'y

2 = 1 (y - X ) (y - X )
n
2
ee et
= =
n n
.

NOTE: (1) =ˆ

(2) 2
is a biased estimator of ζ2; whereas,

2 1 (y - X ) (y - X ) SSE
s = ee= =
n- k n-k n-k

is an unbiased estimator of ζ2.

A proof of the unbiasedness of s2 is given in Appendix B.


Only n-k of the estimated residuals are independent. The
necessary conditions for least squares estimates impose k
restrictions on the estimated residuals (e). The restrictions
are summarized by the normal equations X'X ˆ = X'y, or
equivalently

X’e = 0

(3) Substituting ζ2 = SSE/n into the log likelihood function

yields what is known as the concentrated log likelihood

function

n SSE
 - 1 ln(2 ) ln
2 n
III 10

which expresses the loglikelihood value as a function of β

only. This equation also clearly demonstrates the

equivalence of maximizing and minimizing SSE.


III 11

3. BLUE ESTIMATORS OF β, β .

We will demonstrate that assumptions (A.2)-(A.5) imply that the best

(least variance) linear unbiased estimator (BLUE) of β is the least squares

estimator. We first consider the desired properties and then derive the associated

estimator.
~
Linear: = Ay where A is a kxn matrix of constants
~
Unbiased: E( ) = AE(y) = AX
~
We note that E( ) = A X = requires AX = I.
Minimum Variance:
Var(βi) = A i Var(y) A i

= ζ2AiAi'

where Ai = the ith row of A and β i = A i y .

Thus, the construction of BLUE is equivalent to selecting the matrix A so that the
rows of A
Min AiAi' i = 1, 2, . . ., k
s.t. AX = I
or equivalently, min Var(βi) s.t. AX = I (unbiased).

The solution to this problem is given by

A = (X'X)-1X' ; hence, the BLUE of β is given by = Ay (X X ) -1 X y .


The details of this derivation are contained in Appendix C.

NOTE: (1)
β = β = βˆ = (X X ) -1 X y

1
(2) AX X 'X X 'X I ; thus β is unbiased
III 12

4. Method of Moments Estimation


Method of moments parameter estimators are selected to equate sample and
corresponding theoretical moments. The open question is what theoretical
moments should be considered and what are the corresponding sample moments.
With the regression model we might consider the following theoretical moments
which follow from the underlying theoretical assumptions:
(A.2) E t 0
(A.5) Cov X it , t 0
The sample moment associated with (A.2) is
n
et / n e 0
t 1

The sample covariances can be written as


n n n
X it Xi et e /n X it X i et / n X it et / n 0
t 1 t 1 t 1

These sample moments can be summarized using matrix notation as follows:


1 1 ... 1 e1
x12 x22 ... xn 2 e2
/n X 'e / n 0
. . ... . .
x1k x2 k ... xnk en

which is equivalent to X’e=0 which are also known as the normal equations in the
OLS framework and yields the OLS estimator by solving
X 'e X ' Y X ˆ 0
for ˆ .

5. Instrumental Variables Estimators


y = Xβ + ε

Let Z denote an n x k matrix of “instruments” or "instrumental" variables.

Consider the solution of the modified normal equations:

1
Z'Y Z' X Z ; hence, β̂ z ZX Zy.

β̂ z is referred to as the instrumental variables estimator of β based on the

instrumental variables Z. Instrumental variables can be very useful if the


variables on the right hand side include “endogenous” variables or in the case of
III 13

measurement error. In this case OLS will yield biased and inconsistent
estimators; whereas, instrumental variables can yield consistent estimators.
NOTE: (1) The motivation for the selection of the instruments (Z) is

that the covariance (Z,ε) approaches 0 and Z and X are

correlated. Thus Z'(Y) = Z'(Xβ + ε) = Z' X β + Z'ε Z' Xβ.

ZX Z
(2) If Lim is nonsingular and Lim = 0 , then
n n n n

β̂ z is a consistent estimator of β.

(3) Many calculate an R2 after instrumental variables


estimation using the formula R2 = 1 – SSE/SST. Since this
can be negative, there is not a natural interpretation of R2
for instrumental variables estimators. Further, the R2 can’t
be used to construct F-statistics for IV estimators.

(4) If Z includes “weak” instruments (weakly correlated


with the X’s), then the variances of the IV estimator can
be large and the corresponding asymptotic biases can be
large if the Z and error are correlated. This can be
seen by noting that the bias of the instrumental variables
estimator is given by
1
E Z ' X / n ( Z ' / n) .
Δ
(5) As a special case, if Z = X, then βˆ = βˆ = βˆ = β = β .
z x
(6) If Z is an n x k* matrix where k< k* (Z contains more
variables than X), then the IV estimator defined above must
be modified. The most common approach in this case is to
replace Z in the “IV” equation by the projections** of X on
the columns of Z, i.e. Xˆ Z Z ' Z Z ' X .
1

This substitution yields the IV estimator


1
IV Xˆ ' X Xˆ ' Y
1 1 1
X 'Z Z 'Z Z'X X 'Z Z 'Z Z 'Y
which yields estimates for k k* .
III 14

.
The Stata command for the instrumental variables estimator
is given by
ivregress 2sls depvar (varlist_1 =varlist_iv)
[varlist_2]
where estimator = 2sls, gmm, or liml with
2sls is the default estimator

for the model

depvar = (varlist_1)b1 + var(list_2)b 2 + error

where varlist_iv are the instrumental variables for varlist_1.

A specific example is given by:


ivregres 2sls y1 (y2=z1 z2 z3) x1 x2 x3

Identical results could be obtained with the command,


Ivregress 2sls y1 (y2 x1 x2 x3=z1 z2 z3)

which is equivalent to regressing all of the right hand side


variables on the set of instrumental variables. This can be
thought of as being of the form

ivregress 2sls y (X=Z)

**The projections of X on Z can be obtained by obtaining


estimates of
in the "reduced form" equation X Z V to yield
ˆ 1
Z ' Z Z ' X ; hence, the estimate of X is given by
Xˆ Zˆ
1
Z Z 'Z Z 'X
Δ
D. Distribution of β̂, β , β

Recall that under the assumptions (A.1) – (A.5) y ~ N(Xβ, = ζ2I) and
β̂ = β = β = (X X ) -1 X y;

hence, by useful theorem (II.’ A. 4.a), we conclude that


Δ
β̂ = β = β ~ N(A y A y A ) = N[Ax , A 2
IA ]
III 15

where A = (X'X)-1X'.

The desired derivations can be can be simplified by noting that

AXβ = (X'X)-1X'Xβ = β

ζ2AA' = ζ2(X'X)-1X'((X'X)-1X')'

= ζ2(X'X)-1X'X((X'X)-1)'

= ζ2((X'X)-1)'

= ζ2((X'X)')-1

= ζ2(X'X)-1.

Δ
1
Therefore β̂ = β = β ~ N β; 2
XX

NOTE: (1) ζ2(X'X)-1 can be shown to be the Cramer-Rao matrix, the matrix
of lower bounds for the variances of unbiased estimators.

Δ
(2) β̂, β, β, are

unbiased

consistent

minimum variance of all (linear and nonlinear unbiased

estimators

normally distributed

(3) An unbiased estimator of ζ2(X'X)-1 is given by

s2(X'X)-1

where s2 = e'e/(n-k) and is the formula used to calculate the


III 16

"estimated variance covariance matrix" in many computer

programs.

(4) To report s2(X'X)-1 in STATA type

. reg y x

. estat vce

(5) Distribution of the variance estimator

(n - k)s 2 2
2
~ (n - k)

NOTE: This can be proven using the theorem (II'.A.4(b)) and noting that
ˆ (Y - Xβ)
(n- k)s 2 = e e = (Y - Xβ) ˆ .
= (X + ) (I - X(X X ) -1 X )(X + )

= ε'(I - X(X'X)-1X')ε.

(n- k)s 2 -1
Therefore, 2
= (I - X(X X ) X )

= M

where ~ N [0, I]. hence

(n- k)s 2 2
2
~ (n- k) because

M is idempotent with rank and trace equal to n - k.


III 17

E. Statistical Inference

1. Ho: β2 = β3 = . . . = βk = 0

This hypothesis tests for the statistical significance of overall explanatory power
of the explanatory variables by comparing the model with all variables included to
the model without any of the explanatory variables, i.e., yt = β1 + εt (all non-
intercept coefficients = 0). Recall that the total sum of squares (SST) can be
partitioned as follows:
N N N
( y t - y )2 = ( yt - ŷt )2 + ( ŷt - y )2 or
t =1 t =1 t =1

SST = SSE + SSR.

Dividing both sides of the equation by ζ2 yields quadratic forms, each having a
chi-square distribution:
SST SSE SSR
2
= 2 + 2
ζ ζ ζ
χ2(n - 1) = χ2(n - k) + χ2(k - 1).

This result provides the basis for using

SSR
2
(K -1)(n- K)
F= K 1 = 2
~ F(K - 1, n - K)
SSE (n- K)(K -1)
n K

to test the hypothesis that β2 = β3 = . . . = βk = 0.

2
SSR SSR SSR/SST
NOTE: (1) = = = R 2
SSE SST - SSR SSR 1- R
1-
SST
hence, the F-statistic for this hypothesis can also be rewritten as

2
R
2
k -1 n-k R
F= 2
= ~ F(k - 1, n - k).
(1 - R ) /(n - k) k -1 1- R2

Recall that this decomposition of SST can be summarized in an ANOVA table as


III 18

follows:

Source of Variation SS d.f MSE


Model SSR K-1 SSR/(K-1)
Error SSE n–K SSE/(n - K) s 2
Total SST n–1
K = number of coefficients in model

where the ratio of the model and error MSE’s yields the F statistic just discussed.
Additionally, remember that the adjusted R2 ( R 2 ), defined by

2 ( e 2t ) /(n- K)
R =1- 2
,
(Y t - Y) /(n - 1)
will only increase with the addition of a new variable if the t-statistic associated with
the new variable is greater than 1 in absolute value. This result follows from the
equation
2 2

(n 1) SSENew ˆ 0
2 2 New _ var
RNew ROld 1 where the last
n k n K 1 SST sˆ
New _ var

term in the product is t 2 1 and K denotes the number of coefficients in the “old”

regression model and the “new” regression model includes K+1 coefficients.

The Lagrangian Multiplier (LM) and likelihood ratio (LR) tests can also be
used to test this hypothesis where

LM NR 2 ~ a 2
(k 1)

LR N ln 1 R 2 ~ a 2
(k 1)
III 19

2. Testing hypotheses involving individual βi's

Recall that

β̂ ~ N (β; ζ (X X ) -1)

where


β1 βˆ 1βˆ 2 βˆ 1βˆ k

βˆ 2 βˆ 1 β2 βˆ 2 βˆ k
2 1
XX


βˆ k βˆ 1 βˆ k βˆ 2 βk

which can be estimated by

s 2βˆ1 sβˆ1βˆ 2 sβˆ1βˆ k


2 1
sβˆ 2βˆ1 s 2βˆ 2 sβˆ 2βˆ k
s XX

sβˆ k βˆ1 sβˆ k βˆ 2 s 2βˆ k

Hypotheses of the form H0: βi = i0 can be tested using the result

βˆ i - β i
0

~ t(n - k)
s β̂ i

The validity of this distributional result follows from


N(0,1)
~ t(d)
2
(d) /d
βˆ - β i (n - k) 2 2
since i ~ N(0,1) and 2 s β̂ i ~ χ (n - k).
ζ β̂i ζ β̂ i
III 20

3. Tests of hypotheses involving linear combinations of coefficients

A linear combination of the βi's can be written as


k
β1
δ iβ i = (δ1,...,δ k ) = δ β.
=1
βk

We now consider testing hypotheses of the form


H0: δ'β = γ.

Recall that

β̂ ~ N (β; ζ 2 (X X ) -1) ;

therefore,

δ βˆ ~ N (δβ; δ ζ 2 (X X) -1δ)

δ'βˆ - δ'β δ'βˆ - γ


hence, = ~ t(n - k).
-1 2
δ' s 2(X,X) δ s δ'βˆ

The t-test of a hypothesis involving a linear combination of the coefficients

involves running one regression and estimating the variance of δ βˆ from s2(X'X)-1

to construct the test statistics.

4. More general tests

a. Introduction

We have considered tests of the overall explanatory power of the

regression model (Ho: β2 = β3 = . . . βk = 0), tests involving individual parameters

(e.g., Ho: β3 = 6), and testing the validity of a linear constraint on the coefficients
III 21

(Ho: δ’β = γ). In this section we will consider how more general tests can be

performed. The testing procedures will be based on the Chow and Likelihood

ratio (LR) tests. The hypotheses may be of many different types and involve the

previous tests as special cases. Other examples might include joint hypotheses of

the form: Ho: β2 + 6 β5 = 4, β3 = β7 = 0. The basic idea is that if the hypothesis is

really valid, then goodness of fit measures such as SSE, R2 and log-likelihood

values (l) will not be significantly impacted by imposing the valid hypothesis in

estimation. Hence, the SSE, R2 or values will not be significantly different for

constrained (via the hypothesis) and unconstrained estimation of the underlying

regression model. The tests of the validity of the hypothesis are based on

constructing test statistics, with known exact or asymptotic distributions, to

evaluate the statistical significance of changes in SSE, R2, or .

Consider the model

y=Xβ+ε

and a hypothesis, Ho: g(β) = 0 which imposes individual and/or multiple

constraints on the β vector.

The Chow and likelihood ratio tests for testing Ho: g(β) = 0 can be

constructed from the output obtained from estimating the two following

regression models.

(1) Estimate the regression model y = Xβ + ε without imposing any

constraints on the vector β. Let the associated sum of square errors,

coefficient of determination, log-likelihood value and degrees of freedom


III 22

be denoted by SSE, R2, , and (n - k).

(2) Estimate the same regression model where the β is constrained as

specified by the hypothesis (Ho: g(β) = 0) in the estimation process. Let

the associated sum of squared errors, R2, log-likelihood value and degrees

of freedom be denoted by SSE*, R2*, *


and (n - k)*, respectively.

b. Chow test

The Chow test is defined by the following statistic:

SSE* - SSE
r
SSE ~ F(r, n - k)
n-k

where r = (n-k) - (n-k)* is the number of independent restrictions imposed on β by


the hypothesis. For example, if the hypothesis was Ho: β2 + 6 β5 =4, β3 = β7 = 0,
then the numerator degrees of freedom (r) is equal to 3. In applications where the
SST is unaltered by the imposing the restrictions, we can divide the numerator and
denominator by SST to yield the Chow test rewritten in terms of the change in the
R2 between the constrained and unconstrained regressions.
2 2
- * n-k
F = R R2 ~ F(r, n - k)
1- R r

Note that if the hypothesis (H0: g(β) = 0) is valid, then we would expect R2 (SSE)
and R2* (SSE*) to not be significantly different from each other. Thus, it is only
large values (greater than the critical value) of F which provide the basis for
rejecting the hypothesis. Again, the R 2 form of the Chow test is only valid if the
dependent variable is the same in the constrained and unconstrained regression.

References:

(1) Chow, G. C., "Tests of Equality Between Subsets of Coefficients in Two


Linear Regressions," Econometrica, 28(1960), 591-605.

(2) Fisher, F. M., "Tests of Equality Between Sets of Coefficients in Two Linear
III 23

Regressions: An Expository NOTE," Econometrica, 38(1970), 361-66.

c. Likelihood ratio (LR) test.

The LR test is a common method of statistical inference in classical


statistics. The motivation behind the LR test is similar to that of the Chow test
except that it is based on determining whether there has been a significant
reduction in the value of the log-likelihood value as a result of imposing the
hypothesized constraints on β in the estimation process. The LR test statistic is
defined to be twice the difference between the values of the constrained and
*
unconstrained log-likelihood values (2( - )) and, under fairly general
regularity conditions, is asymptotically distributed as a chi-square with degrees of
freedom equal to the number of independent restrictions (r) imposed by the
hypothesis. This may be summarized as follows:

a 2
LR = 2( - *) (r).

The LR test is more general than the Chow test and for the case of

independent and identically distributed normal errors, with known ζ2, LR is equal

to LR = [SSE* - SSE]/ζ2 .

Recall that s2 = SSE/(n - k) appears in the denominator of the Chow test statistic

and that for large values of (n-k), s2 is "close" to ζ2; hence, we can see the

similarity of the LR and Chow tests. If ζ2 is unknown, substituting the

concentrated log-likelihood function into LR yields


*
LR = 2 ( - )

= n [ln (SSE*) - ln (SSE) ]

= n [ln (SSE* / SSE)].


III 24

If the hypothesis Ho: β2 = β3 = . . . βk = 0 is being tested in the classical

normal linear regression model, then SSE* = SST and LR can be rewritten in

terms of the R2 as follows:

a
LR = nln[1/(1-R2)] = -nln[1-R2] ~ χ2(k-1).

In this case, the Chow test is identical to the F test for overall explanatory power

discussed earlier.

Thus the Chow test and LR test are similar in structure and purpose. The

LR test is more general than the Chow test; however, its distribution is

asymptotically (not exact) chi-square even for non-normally distributed errors.

The LR test provides a unified method of testing hypotheses.

d. Applications of the Chow and LR tests:

(1) Model: yt = β1 + β2xt2 + β3xt3 + β4xt4 + εt

Ho: β2 = β3 = 0 (two independent constraints)

(a) Estimate yt = β1 + β2xt2 + β3xt3 + β4xt4 + εt

to obtain SSE = et2 = (n - 4)s2, R2 ,


n SSE
= - 1 + ln(2 ) + ln ,
2 n
n-k = n - 4

(b) Estimate yt = β1 + β4xt4 + εt to obtain

SSE* = et*2 = (n - 2)s*2

SSE*, R2*, * and (n-k)* = n - 2


III 25
III 26

(c) Construct the test statistics


SSE* - SSE SSE* - SSE
(n k)* (n k) 2 n- 4 SSE*-SSE
Chow = = =
SSE SSE 2 SSE
n k n 4

2 2
- * n-4
= R R2 ~ F(2, n - 4)
1- R 2

a
LR = 2( - *) ~ χ2(2).

(2) Tests of equality of the regression coefficients in two different regressions

models.

(a) Consider the two regression models

y(1) = X(1) β(1) + ε(1) n1 observations, k independent variables

y(2) = X(2) β(2) + ε(2) n2 observations, k independent variables

Ho: β(1) = β(2) (k independent restrictions)

(b) Rewrite the model as


(1) (1) (1) (1)
y X 0
(1)' y = (2)
= (2) (2)
+ (2)
y 0 X

Estimate (1)' using least squares and determine SSE, R2,

and (n - k) = n1 + n2 - 2k.

Now impose the hypothesis that β(1) = β(2) = β and write (1)
as
(1) (1) (1)
y X
(2)’ y = (2)
= (2)
β + (2)
y X

Estimate (2)’ using least squares to obtain the constrained


III 27

sum of squared errors (SSE*), R2*, * and


(n - k)* = n1 + n2 - k.

(c) Construct the test statistics


SSE* - SSE
(n - k) * - (n - k)
Chow =
SSE
(n k)

2 2
- * n1 + n 2 - k ~ F (
= R R2 k , n1 + n 2 - 2 k)
1- R k

a
LR = 2( - *) ~ χ2 (k).

5. Testing Hypotheses using Stata


a. Stata reports the log likelihood values when the command

estat ic

follows a regression command and can be used in constructing LR tests.

b. Stata can also perform many tests based on t or Chow-type tests.

Consider the model

(1) Yt = β1 + β2Xt2 + β3Xt3 + β4Xt4 + εt

with the hypotheses:

(2) H1: β2 = 1

H2: β3 = 0

H3: β3 + β4 = 1

H4: β3β4 = 1

H5: β2 = 1 and β3 = 0

The Stata commands to perform tests of these hypotheses follow OLS


III 28

estimation of the unconstrained model.


III 29

reg Y X2
X3 X4
estimates the unconstrained model
test X2 = 1 (Tests H1)
test X3 = 0 (Tests H2)
test X3 + X4 = 1 (Tests H3)

testnl _b[X3]*_b[X4] = 1 (Tests H4. The “testnl” command is


for testing nonlinear hypotheses. The
suffix “_b”, along with the braces,
must be used when testing nonlinear
hypotheses)

test (X2 = 1) (X3 = 0) (Tests H5)

95% confidence intervals on coefficient estimates are automatically calculated in

Stata. To change the confidence level, use the “level” option as follows:

reg Y X2 X3 X4, level(90) (changes the confidence level


to 90%)
III 30

F. Stepwise Regression

Stepwise regression is a method for determining which variables might be

considered as being included in a regression model. It is a purely mechanical approach,

adding or removing variables in the model solely determined by their statistical

significance and not according to any theoretical reason. While stepwise regression can be

considered when deciding among many variables to include in a model, theoretical

considerations should be the primary factor for such a decision.

A stepwise regression may use forward selection or backward selection. Using

forward selection, a stepwise regression will add one independent variable at a time to see

if it is significant. If the variable is significant, it is kept in the model and another variable

is added. If the variable is not significant, or if a previously added variable becomes

insignificant, it is not included in the model. This process continues until no additional

variables are significant.

Stepwise regression using Stata

To perform a stepwise regression in Stata, use the following commands:

Forward:

stepwise, pe(#): reg dep_var indep_vars

stepwise, pe(#) lockterm: reg dep_var (forced in

variables) other indep_vars

Backward:

stepwise, pr(#): reg dep_var indep_vars


III 31

stepwise, pr(#) lockterm: reg dep_var (forced in

variables) other indep_vars

where the “#” in “pr(#)” is the significance level at which variables are removed, as

0.051, and the “#” in “pe(#)” is the significance level at which variables are entered or

added to the model. If pr(#1) and pr(#2) are both included in a stepwise regression

command, #1 must be greater than #2. Also, “depvar” represents the dependent variable,

“forced_indepvars” represent the independent variables which the user wishes to remain

in the model no matter what their significance level may be, and “other_indepvars”

represents the other independent variables which the stepwise regression will consider

including or excluding. Forward and backward stepwise regression may yield different

results.

G. Forecasting

Let yt = F(Xt, β) + εt

denote the stochastic relationship between the variable yt and the vector of variables Xt

where Xt = (xt1,..., xtk). β represents a vector of unknown parameters.

ˆ ,
Forecasts are generally made by estimating the vector of parameters β(β)

determining the appropriate vector Xt (X̂t ) and then evaluating


ˆ .
ˆ t, β)
yˆ t = F(X

The forecast error is FE = yt - ŷt.

There are at least four factors which contribute to forecast error.


III 32

1. Incorrect functional form (This is an example of specification error and will be

discussed later.)

Yt

Xt

2. Existence of random disturbance (εt)

Even if the "appropriate" future value of Xt and true parameter values, β,

were known with certainty

FE = yt - ŷt = yt - F(Xt,β) = εt

2
FE = Variance(FE)

= Var(εt) = ζ2.

In this case confidence intervals for yt would be obtained from

Pr [F (X t, β) - t ( ζ < y t < F (X t, β) + t (
/ 2) ζ] = 1 - α
/ 2)

which could be visualized as follows for the linear case:

Yt

Xt
III 33

3. Uncertainty about β

Assume F(Xt, β) = Xtβ in the model yt = F(Xt, β) + εt, then the predicted

value of yt for a given value of Xt is given by

yˆ t = X t βˆ ,
and the variance of yˆ t (sample regression line), 2
ŷ t is given by
2 ˆ
ζ ŷ t = X t Var (β) X t ,
with the variance of the forecast error (actual y) given by:

2
FE = ζ 2 + ζ 2 ŷ t .
2
Note that FE takes account of the uncertainty associated with the unknown

regression line and the error term and can be used to construct confidence

intervals for the actual value of Y rather than just the regression line.

Unbiased sample estimators of 2


ŷ t and ζ 2 FE can be easily obtained by replacing ζ2
with its unbiased estimator s2.

Confidence intervals for E ( Y t | X t ) , the population regression line:

Pr [X tβˆ - t (α/2)s yˆ t < Y t < X tβˆ + t (α/2)s yˆ t ] = 1 - α

Confidence intervals for Yt:


P R [X tβˆ - t (α/2)s FE < Y t < X tβˆ + t (α/2)s FE] = 1 - α
III 34

Yt

Xt
III 35

4. A comparison of confidence intervals.

Some students have found the following table facilitates their understanding of the different confidence intervals for the

population regression line and actual value of Y. The column for the estimated coefficients is only included to compare

the organizational parallels between the different confidence intervals.

Statistic ˆ Yˆt X t ˆ = sample regression line = FE (forecast error)


1
X 'X X 'Y
FE Yt Yˆt Yt Xt ˆ
predicted Y values corresponding to X t .

Distribution 2 1 2 2 1
N , X 'X N Xt , Yˆt
Xt ( X 'X ) X t' N 0, 2
FE
2 2
Yˆt

ˆ Xt ˆ Xt 1 Pr t
FE 0
t
=
t-stat 1 Pr t i i
t 1 Pr t t
/2
sFE
/2
/2 /2 /2 /2
sˆ sYˆ
i t
Pr FE t sFE 0 FE t sFE =
= Pr ˆi t s ˆ i
ˆ
i t sˆ Pr X t ˆ t sYˆ Xt X t ˆ t sYˆ 2 2
i i

Pr X t ˆ t sFE X t ˆ t sFE
2 2 2 2
Yt
2 2

C.I. ˆ t s ˆ , ˆi t s ˆ Xt : X t ˆ t sYˆ , X t ˆ t sYˆ Yt : X t ˆ t sFE , Xt ˆ t sFE


i: i i i
2 2 2 2 2 2

where sYˆ is used to compute confidence intervals for the regression line ( E Yt X t ) and sFE is used in the calculation of
2 2 2 2
confidence intervals for the actual value of Y. Recall that s FE s s ; hence, s
Yˆ FE > sY2ˆ and the confidence intervals for
Y are larger than for the population regression line.
III 36

5. Uncertainty about X. In many situations the value of the independent variable also

needs to be predicted along with the value of y. Not surprisingly, a “poor” estimate of

Xt will likely result in a poor forecast for y. This can be represented graphically as

follows:

Yt
Y t

X t

Xt

X̂ t

6. Hold out samples and a predictive test.

One way to explore the predictive ability of a model is to estimate the model on a

subset of the data and then use the estimated model to predict known outcomes which

are not used in the initial estimation.

7. Example ŷt = 10 + 2.5 G t + 6 Mt

= βˆ t + βˆ 2G t + βˆ 3M t

where yt, Gt, Mt denote GDP, government expenditure, and money supply.

Assume that
III 37

10 5 2
2 -1
s (X X ) = 5 20 3 10-3 , s2 = 10 .
2 3 15

a. Calculate an estimate of GPD(y) which corresponds to

Gt = 100, Mt = 200, i.e., Xt = (1, 100, 200).

10
ˆ
yˆ t = X t β = (1, 100, 200) 2.5
6
= 10 + 250 + 1200 = 1460.

b. Evaluate s 2 ŷt and s2FE corresponding to the Xt in question (a).

10 5 2 1
2 -1
s ŷt = Xt (s2 (X X ) ) Xt = (1, 100, 200) 5 20 3 100 .10-3
2 3 15 200
= 921.81

s ŷt = 30.30

2 2
sFE = s + s ŷt = 10 + 921.81 = 931.81

SFE = 30.53

7. Forecasting—basic Stata commands

a) The data file should include values for the explanatory variables

corresponding to the desired forecast period, say in observations n1 + 1 to n2.

b) Estimate the model using least squares

reg Y X1 . . . XK, [options]

c) Use the predict command, picking the name you want for the predictions, in
III 38

this case, yhat, e, sFE , and sYˆ .

predict yhat, xb this option predicts Ŷ


predict e, resid this option predicts the residuals (e)
predict sfe, stdf this option predicts the standard
error of the forecast ( s FE )
predict syhat, stdp this option predicts the standard
error of the prediction ( sYˆ )
list y yhat sfe this option lists indicated variables

These commands result in the calculation and reporting of Y, Ŷ, e, sFE and
sYˆ for observations 1 through n2. The predictions will show up in the Data
Editor of STATA under the variable names you picked (in this case, yhat,
e, sfe and syhat).

You may want to restrict the calculations to t= n1 + 1, .. , n2 by using

predict yhat if(_n> n1), xb

where “n1” is the numerical value of n1.

d) The variance of the predicted value can be calculated as follows:

2 2
s ŷt = sFE - s2
III 39

H. PROBLEM SETS: MULTIVARIATE REGRESSION

Problem Set 3.1

Theory

OBJECTIVE: The objective of problems 1 & 2 is to demonstrate that the matrix equations and
summation equations for the estimators and variances of the estimators are equivalent.
n
Remember Xt NX and Don’t get discouraged!!
t 1

1. BACKGROUND: Consider the model (1) Yt = β1 + β2 Xt+ εt (t = 1, . . ., N) or


equivalently,

Y1 1 X1 ε1
Y2 1 X2 ε2
(1)’ = 1
+
2

Yn 1 Xn εn

(1)” Y = Xβ + ε
ˆ
is ˆ = (X X )-1 X Y .
1
The least squares estimator of
ˆ
2

If (A.1) - (A.5) (see class notes) are satisfied, then

Var( ˆ 1) Cov(ˆ 1 , ˆ 2)
Var( ˆ ) = = 2
(X X )-1
Cov(ˆ 2 , ˆ 1) Var( ˆ 2)

QUESTIONS: Verify the following:


*Hint: It might be helpful to work backwards on part c and e.

N NX NY
a. XX= and X ' Y N

NX
2 X tYt
Xt t 1
III 40

b. ˆ = ( X t Y t - N X Y ) / ( X t 2 - N X 2)
2

c. ˆ =Y-ˆ X
1 2

d. Var( ˆ 2) = 2
/ ( X t 2 - N X 2)

2
1 X
e. Var( ˆ 1) = 2
+ 2 2
n Xt - N X
= Var( Y) + X 2 Var( ˆ 2)

f. Cov(ˆ 1 , ˆ 2) = - X Var( ˆ 2)
(JM II’-A, JM Stats)

2. Consider the model: Yt = Xt + t

a. Show that this model is equivalent to Y = Xβ + ε

Y1 X1 ε1
Y2 X2 ε2
where Y ,X = ,ε

Yn Xn εn

b. Using the matrices in 2(a), evaluate (X X )-1 X Y and compare your answer with
the results obtained in question 4 in Problem Set 2.1.

c. Using the matrices in 2(a) evaluate 2


(X X )-1 .
(JM II’-A)

Applied

3. Use the data in HPRICE1.RAW to estimate the model

price = β0 + β1sqrft + β2bdrms + u


where price is the house price measured in thousands of dollars, sqrft is
the floorspace measured in square feet, and bdrms is the number of bedrooms.

a. Write out the results in equation form.


III 41

b. What is the estimated increase in price for a house with one more bedroom, holding
square footage constant?
c. What is the estimated increase in price for a house with an additional bedroom that is 140
square feet in size? Compare this to your answer in part (ii).
d. What percentage variation in price is explained by square footage and number of
bedrooms?
e. The first house in the sample has sqrft = 2,438 and bdrms = 4. Find the predicted selling
price for this house from the OLS regression line.
f. The actual selling price of the first house in the sample was $300,000 (so price = 300).
Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for
the house?
III 42

Problem Set 3.2

Theory

1. R2, Adjusted R2( R 2 ), F Statistic, and LR

The R2 (coefficient of determination) is defined by

2 SSR SSE
R = =1-
SST SST

2 2
where SSE = et 2 and SST = (Y t - Y) , SSR = (Ŷ t - Y ) .

Given that SST = SSR + SSE when using OLS,

a. Demonstrate that 0 R2 1.

b. Demonstrate that n = k implies R2 = 1. (Hint: n=k implies that X is square. Be


careful! Show Y = Ŷ = X ˆ .)

c. If an additional independent variable is included in the regression equation, will


the R2 increase, decrease, or remain unaltered? (Hint: What is the effect upon
SST, SSE?)

SSE/(n- k)
d. The adjusted R 2 , R 2 , is defined by R 2 = 1 - . Demonstrate that
SST/(n-1)
1- k
R
2
R
2
1 , i.e., the adjusted R2 can be negative.
n- k

SSE n- 1 n- 1
(Hint : 1 - R 2 = = (1 - R 2))
SST n- k n- k
e. Verify that

SSE* - SSE
LR = 2
if ζ2 is known

= n ln(SSE* /SSE) if ζ2 is unknown where SSE* denotes the


restricted SSE.
III 43

f. For the hypothesis H0: β2 = . . . = βk = 0, verify that the corresponding LR statistic


1
can be written as LR = n ln = - n ln(1 - R 2) .
1- R2
FYI: The corresponding Lagrangian multiplier (LM) test statistic for this
hypothesis can be written in terms of the coefficient of variation as LM NR2 .
(JM II-B)

2. Demonstrate that

a. X’e = 0 is equivalent to the normal equations X X ˆ = X Y .


b. X’e = 0 implies that the sum of estimated error terms will equal zero if regression
equation includes an intercept.
Remember: e Y Yˆ Y X ˆ
(JM II-B)

Applied

3. The following model can be used to study whether campaign expenditures affect election
outcomes:

voteA = β0 + β1ln(expendA) + β2 ln(expendB) + β3 prtystrA + u

where voteA is the percent of the vote received by Candidate A, expendA and expendB are
campaign expenditures by Candidates A and B, and prtystrA is a measure of party
strength for Candidate A (the percent of the most recent presidential vote that went to A's
party).
i) What is the interpretation of β1?
ii) In terms of the parameters, state the null hypothesis that a 1% increase in A's
expenditures is offset by a 1% increase in B's expenditures.
iii) Estimate the model above using the data in VOTE1.RAW and report the results in
the usual form. Do A's expenditures affect the outcome? What about B's
expenditures? Can you use these results to test the hypothesis in part (ii)?
iv) Estimate a model that directly gives the t statistic for testing the hypothesis in part
(ii). What do you conclude? (Use a two sided alternative.). A possible approach,
test H 0 : 1 2 D , plug D 2 for 1 and simplify to obtain

voteA = β0 + D ln(expendA) + β2 ln(exp endB) - ln exp endA + β3 prtystrA + u

Regress voteA on ln(expendA), (ln(expend(B)-ln(expendA)), and prtystrA


III 44

and test the hypothesis that the coefficient, D, of ln(expendA) is 0.

You can check your results by constructing the “high-tech” t-test or by using the
Stata command, test ln(expendA) + ln(expendB) =0 following the estimation of
the unconstrained regression model.
(Wooldridge C. 4.1)

4. Consider the data (data from Solow’s paper on economic growth)

t Output (Yt) Labor (Lt) Capital (Kt)

1 40.26 64.63 133.14


2 40.84 66.30 139.24
3 42.83 65.27 141.64
4 43.89 67.32 148.77
5 46.10 67.20 151.02
6 44.45 65.18 143.38
7 43.87 65.57 148.19
8 49.99 71.42 167.12
9 52.64 77.52 171.33
10 57.93 79.46 176.41

The Cobb Douglas Production function is defined by


β1+β 2t β β
(1) Yt = e K t 3 Lt 4 εt
where (β2t) takes account of changes in output for any reason other than a change in Lt or
Kt; εt denotes a random disturbance having the property that lnεt is distributed N(0, ζ2).
total wage receipts
Labor’s share is given by β3 if β3 + β4 (the returns to scale) is
totalsales receipts

equal to one. β2 is frequently referred to as the rate of technological change


dYt
/ Y t for fixed L and K . Taking the natural logarithm of equation(1),we obtain
dt

(2) ln Y t = β1 + β 2 t + β3ln(L t ) + β 4 ln(K t) + ln(ε t ) .


III 45

If 3 + 4 is equal to 1, then equation (2) can be rewritten as

(3) ln(Y t / K t ) = 1 + 2 t+ 3 ln(L t / K t ) + ln t .

a. Estimate equation (2) using the technique of least squares.


b. Corresponding to equation (2)
1) Test the hypothesis Ho: β2 = β3 = β4 = 0. Explain the implications of this
hypothesis. (95% confidence level)
2) perform and interpret individual tests of significance of β2, β3, and β4, i.e. test
Ho : βi = 0 .α = .05.
3) test the hypothesis of constant returns to scale, i.e., Ho: β3 + β4 = 1, using
a. a t-test for general linear hypothesis, let restrictions δ= (0,0,1,1);
b. a Chow test;
c. a LR test.
c. Estimate equation (3) and test the hypothesis that labor’s share is equal to .75, i.e., β3 =
.75.
d. Re-estimate the model (equation 2) with the first nine observations and check to see if the actual
log(output) for the 10th observation lies in the 95% forecast confidence interval.
(JM II)

5. The translog production function corresponding to the previous problem is given by

ln(Y) = β1 + β 2 t + β 3 ln(L) + β 4 ln(K) + β 5(ln(L)) 2 + β 6(ln(K)) 2 + β 7(ln(L)) ln(K) + ln(εt )


a. What restrictions on the translog production function result in a Cobb-Douglas
production function?
b. Estimate the translog production function using the data in problem 5 and use the Chow and
LR tests to determine whether it provides a statistically significant improved fit to the data,
relative to the Cobb-Douglas function.
(JM II)
III 46

6. The transcendental production function corresponding to the data in problem 5 is defined by

Y = eβ1 + β 2 t + β3 L + β 4 K Lβ5Kβ6
a. What restrictions on the transcendental production function result in a Cobb-Douglas
production function?
b. Estimate the transcendental production function using the data in problem 2 and use the Chow
and LR tests to compare it with the Cobb-Douglas production function.
(JM II)
III 47

APPENDIX A
Some important derivatives:

x1 a1 a11 a12
Let X = , a= , A=
x2 a2 a 21 a 22
(symmetric) (a12 = a 21 = a )

d (a X) d (X a)
1. = =a
dX dX

d (X AX)
2. = 2 AX
dX

d (X a)
Proof of =a
dX

Note: a’X = X’a = a1x1 + a2x2

d (X a) (X a) / X1 a1
= = =a
dX (X a) / X2 a2

d (X AX)
Proof of = 2 AX
dX

Note: X’AX = a11x12 + (a12 + a21) x1x2 + a22 x22

d (X AX) (X a) / X1 2 a11 x1 + 2a x 2
= =
dX (X AX) / X2 2a x1 + 2 a 22 x 2

a11 x1 + a x 2
=2
a x1 + a 22 x 2

a11 a x1
=2
a a 22 x2
= 2 AX.
III 48

APPENDIX B

An unbiased estimator of ζ2 is given by

1
2
s = (y (I - X (X X )-1 X ) y) = SSE/(n- k) .
n- k

Proof: To show this, we need some results on traces:

n
tr (A) = a ii
i

1) tr(I) = n

2) If A is idempotent, tr(A) = rank of A

3) tr(A+B) = tr(A) + tr(B)

4) tr(AB) = tr(BA) if both AB and BA are defined

5) tr(ABC) = tr(CAB)

6) tr(kA) = k tr(A)

Now, remember that

2 1
ζ̂ = ee
n

1
and s2 = ee
n-k

e = y - Xβˆ = y - X ( X X ) -1 X y = My

= M (Xβ + ε) = MXβ + Mε ,
= Mε ,
where M = I - X(X’X)-1X’.

Note that M is symmetric, and idempotent (problem set R.2).


1 1
So ζ̂ 2 = e e = ε M Mε
n n
III 49

1
= ε MMε .
n

1
= ε Mε .
n

1
and s 2 = ε Mε .
n-k

1 1
E (ζ̂ 2) = E (ε Mε) = E (tr(ε Mε)) because cov (ε i, ε j) = 0, i j)
n n

1 1
= Etr (M εε ) = tr (ME (εε ))
n n

1 1
= tr (M ζ 2I) = tr (ζ 2 M)
n n

2
ζ
= tr(M)
n

2
ζ -1
= tr(I - X(X X ) X )
n
2
ζ -1
= (n - tr (X(X X ) X ))
n

2
ζ -1
= (n - tr (X X(X X ) ))
n

2
ζ
= (n - trace (I k ))
n

2
ζ
= (n - k)
n

n-k 2 n
= 2
ζ so E (s ) = E (ζˆ 2) = ζ 2 .
n n-k
n
Therefore ζ̂ 2 is biased, but E (s 2) = E (ζˆ 2) = ζ 2 and s2 is unbiased.
n-k
III 50

APPENDIX C

β = AY = (X X) X Y is BLUE.

Proof: Let β i = A i Y where Ai denotes the ith row of the matrix A. Since the result will be
symmetric for each βi (hence, for each Ai), denote Ai by a’ where a is a (n by 1) vector.

The problem then becomes:

Min a’Ia when I is nxn

s.t. AX = I when X is nxk (for unbiasedness)

or min a’Ia

s.t. X’a = i where i is the ith column of the identity matrix.

Let = a Ia + λ (X a - i) which is the associated Lagrangian function where λ is kx1.

The necessary conditions for a solution are:

= 2a I + λ X = 0
a
= (X a - i) = 0 .
λ

This implies

a = (-1/ 2)λ X ) .
Now substitute a = (-½)Xλ into the expression for = 0 and we obtain
λ
(-1/ 2) X X λ = i
λ = - 2 (X X ) -1 i
a = (-1 / 2) (-2) i (X X )-1 X

= i (X X )-1 X = Ai .
which implies
A = (X X )-1 X
-1
hence, β = (X X ) X y .