III General Regression

III 1
James B. McDonald
Brigham Young University
9/29/2010
III. Classical Normal Linear Regression Model Extended to the Case of k
Explanatory Variables
A. Basic Concepts
Let y denote an n x l vector of random variables, i.e., y = (y1, y2, . . ., yn)'.
1. The expected value of y is defined by
E( y 1 )
E( y 2 )
E(y) =
E( y n )
2. The variance of the vector y is defined by
Var( y 1 ) Cov( y 1, y 2 ) Cov( y 1, y n )

Cov( y 2 , y 1 ) Var( y 2 ) Cov( y 2 , y n )
Var(y) =
Cov( y n , y 1 ) Cov( y n , y 2 ) Var( y n )

NOTE: Let μ = E(y), then
Var(y) = E[(y - μ)(y - μ)']
y1 - 1
=E (y1 - μ1, . . ., yn - μn)

yn - n
III 2
2
E( y 1 - 1 ) E( y 1 - 1 )( y 2 - 2 ) ... E( y 1 - 1 )( y n - n )
2
E( y 2 - 2 )( y 1 - 1 ) E( y 2 - 2 ) . . . E( y 2 - 2 )( y n - n )
. . .
=
. . .
. . .
2
E( y n - n )( y 1 - 1 ) E( y n - n )( y 2 - 2 ) ... E( y n - n )
Var( y 1 ) Cov( y 1, y 2 ) ... Cov( y 1, y n )

Cov( y 2 , y 1 ) Var( y 2 ) ... Cov( y 2 , y n )
. . .
= .
. . .
. . .
Cov( y n , y 1 ) Cov( y n , y 2 ) ... Var( y n )
3. The n x l vector of random variables, y, is said to be distributed as a multivariate

normal with mean vector μ and variance covariance matrix (denoted y ~
N(μ, )) if the probability density function of y is given by
1 -1
- (y- ) (y- )
e2
f(y; , ) = n 1
.
(2 ) | | 2 2
Special case (n = 1): y = (y1), μ = (μ1), = (ζ2).
1 1
- ( y1- 1) 2 ( y1- 1)
e2
f( y1 ; 1 , )= 1 1
2
(2 ) 2 ( )2
-(y1- 1)2
e 2 2
= .
2
2
4. Some Useful Theorems
a. If y ~ N(μy, y), then z = Ay ~ N(μz = Aμy; z = A yA') where A is a

matrix of constants.
III 3
b. If y ~ N(0,I) and A is a symmetric idempotent matrix, then y'Ay ~ χ2(m)

where m = Rank(A) = trace (A).
c. If y ~ N(0,I) and L is a k x n matrix of rank k, then Ly and y'Ay are
independently distributed if LA = 0.
d. If y ~ N(0,I), then the idempotent quadratic forms y'Ay and y'By are
independently distributed χ2 variables if AB = 0.
NOTE:
(1) Proof of (a)

E(z) = E(Ay) = AE(y) = Aµy
VAR(z) = E[(z - E(z))(z - E(z))']
= E[(Ay - Aµy)(Ay - Aµy)']
= E[A(y - µy)(y - µy)'A']
= AE[(y - µy)(y - µy)']A'
= AΣyA' =Σ z
(2) Example: Let y1, . . ., yn denote a random sample drawn from

N(μ,ζ2).
y1 2
...0
. .
y ~N , .  .
. . 2
0. . .
yn
The "Useful" Theorem 4.a implies that:

1 1 1 1 2
y = y1 + ... + y n = , . . . y ~ N( , /n) .
n n n n
Verify that
III 4
1 1
(a) ,...,  =
n n
1
n
1 1
(b) ,..., 2
I  = 2
/n .
n n
1
n
III 5
B. The Basic Model
Consider the model defined by
(1) yt = β1xtl + β2xt2 + . . . + βkxtk + εt (t = 1, . . ., n).
If we want to include an intercept, define xtl = 1 for all t and we obtain
(2) yt = β1 + β2xt2 + . . . + βkxtk + εt.
Note that βi can be interpreted as the marginal impact of a unit increase in xi on the
expected value of y.
The error terms (εt) in (1) will be assumed to satisfy:
(A.1) εt distributed normally
(A.2) E(εt) = 0 for all t
(A.3) Var(εt) = ζ2 for all t
(A.4) Cov(εtεs) = 0,t s.
Rewriting (1) for each t (t = 1, 2, . . ., n) we obtain
y1 = β1x11 + β2x12 + . . . + βkx1k + ε1
y2 = β1x21 + β2x22 + . . . + βkx2k + ε2
. . . .
. . . .
(3) . . . .
yn = β1xn1 + β2xn2 + . . . + βkxnk + εn.
The system of equations (3) is equivalent to the matrix representation
y = Xβ + ε
where the matrices y, X, β and ε are defined as follows:

III 6
y1 x11 x1k columns: n observations on k

y2 individual variables.
x 21 x 2k
y= X= rows: may represent
observations at a given point
yn x n1 x nk in time.
(nx1) (nxk)
1 1
2 2
= and = .
k n
NOTE: (1) Assumptions (A.1)-(A.4) can be written much more
compactly as
(A.1)’ ε ~ N (0; = ζ2I).
(2) The model to be discussed can then be summarized as
y = Xβ + ε.
(A.1)' ε ~ N(0; = ζ2I)
(A.5)' The xtj's are nonstochastic and

XX
Limit = x is nonsingular.
n n
III 7
C. Estimation
We will derive the least squares, MLE, BLUE and instrumental variables estimators in
this section.
1. Least Squares:
The basic model can be written as
y = Xβ + ε
= Xβˆ + e = Y
ˆ +e
where Ŷ = X βˆ is an nx1 vector of predicted values for the dependent variable and
e denotes a vector of residuals or estimated errors.
The sum of squared errors is defined by
n
ˆ =
SSE(β) 2
et
t=1
e1
e2
= (e1 , e2 ,  , en )

en
=ee
= (y - Xβ)ˆ (y - Xβ)ˆ
= y y - βˆ X y - y Xβˆ + βˆ X Xβˆ
= y y - 2βˆ X y + βˆ X Xβˆ .
ˆ A
The least squares estimator of β is defined as the β̂ which minimizes SSE (β).
necessary condition for SSE(β)ˆ to be a minimum is that
ˆ
dSSE(β)
=0 (see Appendix A for how to differentiate a real
dβˆ
valued function with respect to a vector)
ˆ
dSSE(β)
= -2X y + 2X Xβˆ = 0 or
ˆ
dβ
III 8
XX ˆ =Xy Normal Equations
ˆ = (X X )-1 X y is the least squares estimator.
Note that β̂ is a vector of least squares estimators of β1, β2,...,βk.
2. Maximum Likelihood Estimation (MLE)
Likelihood Function: (Recall y ~ N (Xβ; = ζ2I))

- 1 (y-X ) -1
(y-X )
2 e 2
L(y; , = I) = 1
(2 ) n/ 2 | | 2
1
- (y-X ) (y-X )
e2 2
= 1
(2 ) n/ 2 | 2
I| 2
(y-X ) (y-X ) / 2 2
e
= .
n 2 2 n2
(2 ) ( )
The natural log of the likelihood function,
(y- X ) (y- X ) n n
 = ln L = - 2
- ln 2 - ln 2
2 2 2
1 n n
= 2
(y- X ) (y- X ) - ln 2 - ln 2
2 2 2
1 n n 2
= 2
y' y - 2 'X' Y 2 ' X ' X - ln 2 - ln
2 2 2
is known as the log likelihood function.  is a function of β and ζ2.

The MLE. of β and ζ are defined by the two equations (necessary conditions for a
maximum):
1
= (-2 X y + 2(X X) ) = 0
β 2
2
(y - X ) (y - X ) n 1
2
= - 2
=0
2 2
2
2( )
III 9
i.e.,
= (X X ) -1 X'y
2 = 1 (y - X ) (y - X )
n
2
ee et
= =
n n
.
NOTE: (1) =ˆ
(2) 2
is a biased estimator of ζ2; whereas,
2 1 (y - X ) (y - X ) SSE
s = ee= =
n- k n-k n-k
is an unbiased estimator of ζ2.
A proof of the unbiasedness of s2 is given in Appendix B.

Only n-k of the estimated residuals are independent. The
necessary conditions for least squares estimates impose k
restrictions on the estimated residuals (e). The restrictions
are summarized by the normal equations X'X ˆ = X'y, or
equivalently
X’e = 0
(3) Substituting ζ2 = SSE/n into the log likelihood function
yields what is known as the concentrated log likelihood
function
n SSE
 - 1 ln(2 ) ln
2 n
III 10
which expresses the loglikelihood value as a function of β
only. This equation also clearly demonstrates the
equivalence of maximizing and minimizing SSE.

III 11
3. BLUE ESTIMATORS OF β, β .
We will demonstrate that assumptions (A.2)-(A.5) imply that the best
(least variance) linear unbiased estimator (BLUE) of β is the least squares
estimator. We first consider the desired properties and then derive the associated
estimator.
~
Linear: = Ay where A is a kxn matrix of constants
~
Unbiased: E( ) = AE(y) = AX
~
We note that E( ) = A X = requires AX = I.
Minimum Variance:
Var(βi) = A i Var(y) A i
= ζ2AiAi'
where Ai = the ith row of A and β i = A i y .
Thus, the construction of BLUE is equivalent to selecting the matrix A so that the
rows of A
Min AiAi' i = 1, 2, . . ., k
s.t. AX = I
or equivalently, min Var(βi) s.t. AX = I (unbiased).
The solution to this problem is given by
A = (X'X)-1X' ; hence, the BLUE of β is given by = Ay (X X ) -1 X y .

The details of this derivation are contained in Appendix C.
NOTE: (1)
β = β = βˆ = (X X ) -1 X y
1
(2) AX X 'X X 'X I ; thus β is unbiased
III 12
4. Method of Moments Estimation

Method of moments parameter estimators are selected to equate sample and
corresponding theoretical moments. The open question is what theoretical
moments should be considered and what are the corresponding sample moments.
With the regression model we might consider the following theoretical moments
which follow from the underlying theoretical assumptions:
(A.2) E t 0
(A.5) Cov X it , t 0
The sample moment associated with (A.2) is
n
et / n e 0
t 1
The sample covariances can be written as

n n n
X it Xi et e /n X it X i et / n X it et / n 0
t 1 t 1 t 1
These sample moments can be summarized using matrix notation as follows:

1 1 ... 1 e1
x12 x22 ... xn 2 e2
/n X 'e / n 0
. . ... . .
x1k x2 k ... xnk en
which is equivalent to X’e=0 which are also known as the normal equations in the
OLS framework and yields the OLS estimator by solving
X 'e X ' Y X ˆ 0
for ˆ .
5. Instrumental Variables Estimators

y = Xβ + ε
Let Z denote an n x k matrix of “instruments” or "instrumental" variables.
Consider the solution of the modified normal equations:
1
Z'Y Z' X Z ; hence, β̂ z ZX Zy.
β̂ z is referred to as the instrumental variables estimator of β based on the
instrumental variables Z. Instrumental variables can be very useful if the

variables on the right hand side include “endogenous” variables or in the case of
III 13
measurement error. In this case OLS will yield biased and inconsistent
estimators; whereas, instrumental variables can yield consistent estimators.
NOTE: (1) The motivation for the selection of the instruments (Z) is
that the covariance (Z,ε) approaches 0 and Z and X are
correlated. Thus Z'(Y) = Z'(Xβ + ε) = Z' X β + Z'ε Z' Xβ.
ZX Z
(2) If Lim is nonsingular and Lim = 0 , then
n n n n
β̂ z is a consistent estimator of β.
(3) Many calculate an R2 after instrumental variables

estimation using the formula R2 = 1 – SSE/SST. Since this
can be negative, there is not a natural interpretation of R2
for instrumental variables estimators. Further, the R2 can’t
be used to construct F-statistics for IV estimators.
(4) If Z includes “weak” instruments (weakly correlated

with the X’s), then the variances of the IV estimator can
be large and the corresponding asymptotic biases can be
large if the Z and error are correlated. This can be
seen by noting that the bias of the instrumental variables
estimator is given by
1
E Z ' X / n ( Z ' / n) .
Δ
(5) As a special case, if Z = X, then βˆ = βˆ = βˆ = β = β .
z x
(6) If Z is an n x k* matrix where k< k* (Z contains more
variables than X), then the IV estimator defined above must
be modified. The most common approach in this case is to
replace Z in the “IV” equation by the projections** of X on
the columns of Z, i.e. Xˆ Z Z ' Z Z ' X .
1
This substitution yields the IV estimator

1
IV Xˆ ' X Xˆ ' Y
1 1 1
X 'Z Z 'Z Z'X X 'Z Z 'Z Z 'Y
which yields estimates for k k* .
III 14
.
The Stata command for the instrumental variables estimator
is given by
ivregress 2sls depvar (varlist_1 =varlist_iv)
[varlist_2]
where estimator = 2sls, gmm, or liml with
2sls is the default estimator
for the model
depvar = (varlist_1)b1 + var(list_2)b 2 + error
where varlist_iv are the instrumental variables for varlist_1.
A specific example is given by:

ivregres 2sls y1 (y2=z1 z2 z3) x1 x2 x3
Identical results could be obtained with the command,

Ivregress 2sls y1 (y2 x1 x2 x3=z1 z2 z3)
which is equivalent to regressing all of the right hand side

variables on the set of instrumental variables. This can be
thought of as being of the form
ivregress 2sls y (X=Z)
**The projections of X on Z can be obtained by obtaining

estimates of
in the "reduced form" equation X Z V to yield
ˆ 1
Z ' Z Z ' X ; hence, the estimate of X is given by
Xˆ Zˆ
1
Z Z 'Z Z 'X
Δ
D. Distribution of β̂, β , β
Recall that under the assumptions (A.1) – (A.5) y ~ N(Xβ, = ζ2I) and
β̂ = β = β = (X X ) -1 X y;
hence, by useful theorem (II.’ A. 4.a), we conclude that

Δ
β̂ = β = β ~ N(A y A y A ) = N[Ax , A 2
IA ]
III 15
where A = (X'X)-1X'.
The desired derivations can be can be simplified by noting that
AXβ = (X'X)-1X'Xβ = β
ζ2AA' = ζ2(X'X)-1X'((X'X)-1X')'
= ζ2(X'X)-1X'X((X'X)-1)'
= ζ2((X'X)-1)'
= ζ2((X'X)')-1
= ζ2(X'X)-1.
Δ
1
Therefore β̂ = β = β ~ N β; 2
XX
NOTE: (1) ζ2(X'X)-1 can be shown to be the Cramer-Rao matrix, the matrix
of lower bounds for the variances of unbiased estimators.
Δ
(2) β̂, β, β, are
unbiased
consistent
minimum variance of all (linear and nonlinear unbiased
estimators
normally distributed
(3) An unbiased estimator of ζ2(X'X)-1 is given by
s2(X'X)-1
where s2 = e'e/(n-k) and is the formula used to calculate the

III 16
"estimated variance covariance matrix" in many computer
programs.
(4) To report s2(X'X)-1 in STATA type
. reg y x
. estat vce
(5) Distribution of the variance estimator
(n - k)s 2 2
2
~ (n - k)
NOTE: This can be proven using the theorem (II'.A.4(b)) and noting that
ˆ (Y - Xβ)
(n- k)s 2 = e e = (Y - Xβ) ˆ .
= (X + ) (I - X(X X ) -1 X )(X + )
= ε'(I - X(X'X)-1X')ε.
(n- k)s 2 -1
Therefore, 2
= (I - X(X X ) X )
= M
where ~ N [0, I]. hence
(n- k)s 2 2
2
~ (n- k) because
M is idempotent with rank and trace equal to n - k.

III 17
E. Statistical Inference
1. Ho: β2 = β3 = . . . = βk = 0
This hypothesis tests for the statistical significance of overall explanatory power
of the explanatory variables by comparing the model with all variables included to
the model without any of the explanatory variables, i.e., yt = β1 + εt (all non-
intercept coefficients = 0). Recall that the total sum of squares (SST) can be
partitioned as follows:
N N N
( y t - y )2 = ( yt - ŷt )2 + ( ŷt - y )2 or
t =1 t =1 t =1
SST = SSE + SSR.
Dividing both sides of the equation by ζ2 yields quadratic forms, each having a
chi-square distribution:
SST SSE SSR
2
= 2 + 2
ζ ζ ζ
χ2(n - 1) = χ2(n - k) + χ2(k - 1).
This result provides the basis for using
SSR
2
(K -1)(n- K)
F= K 1 = 2
~ F(K - 1, n - K)
SSE (n- K)(K -1)
n K
to test the hypothesis that β2 = β3 = . . . = βk = 0.
2
SSR SSR SSR/SST
NOTE: (1) = = = R 2
SSE SST - SSR SSR 1- R
1-
SST
hence, the F-statistic for this hypothesis can also be rewritten as
2
R
2
k -1 n-k R
F= 2
= ~ F(k - 1, n - k).
(1 - R ) /(n - k) k -1 1- R2
Recall that this decomposition of SST can be summarized in an ANOVA table as

III 18
follows:
Source of Variation SS d.f MSE

Model SSR K-1 SSR/(K-1)
Error SSE n–K SSE/(n - K) s 2
Total SST n–1
K = number of coefficients in model
where the ratio of the model and error MSE’s yields the F statistic just discussed.
Additionally, remember that the adjusted R2 ( R 2 ), defined by
2 ( e 2t ) /(n- K)
R =1- 2
,
(Y t - Y) /(n - 1)
will only increase with the addition of a new variable if the t-statistic associated with
the new variable is greater than 1 in absolute value. This result follows from the
equation
2 2
(n 1) SSENew ˆ 0
2 2 New _ var
RNew ROld 1 where the last
n k n K 1 SST sˆ
New _ var
term in the product is t 2 1 and K denotes the number of coefficients in the “old”
regression model and the “new” regression model includes K+1 coefficients.
The Lagrangian Multiplier (LM) and likelihood ratio (LR) tests can also be
used to test this hypothesis where
LM NR 2 ~ a 2
(k 1)
LR N ln 1 R 2 ~ a 2
(k 1)
III 19
2. Testing hypotheses involving individual βi's
Recall that
β̂ ~ N (β; ζ (X X ) -1)
where
2ˆ
β1 βˆ 1βˆ 2 βˆ 1βˆ k
2ˆ
βˆ 2 βˆ 1 β2 βˆ 2 βˆ k
2 1
XX
2ˆ
βˆ k βˆ 1 βˆ k βˆ 2 βk
which can be estimated by
s 2βˆ1 sβˆ1βˆ 2 sβˆ1βˆ k

2 1
sβˆ 2βˆ1 s 2βˆ 2 sβˆ 2βˆ k
s XX
sβˆ k βˆ1 sβˆ k βˆ 2 s 2βˆ k
Hypotheses of the form H0: βi = i0 can be tested using the result
βˆ i - β i
0
~ t(n - k)
s β̂ i
The validity of this distributional result follows from

N(0,1)
~ t(d)
2
(d) /d
βˆ - β i (n - k) 2 2
since i ~ N(0,1) and 2 s β̂ i ~ χ (n - k).
ζ β̂i ζ β̂ i
III 20
3. Tests of hypotheses involving linear combinations of coefficients
A linear combination of the βi's can be written as

k
β1
δ iβ i = (δ1,...,δ k ) = δ β.
=1
βk
We now consider testing hypotheses of the form

H0: δ'β = γ.
Recall that
β̂ ~ N (β; ζ 2 (X X ) -1) ;
therefore,
δ βˆ ~ N (δβ; δ ζ 2 (X X) -1δ)
δ'βˆ - δ'β δ'βˆ - γ

hence, = ~ t(n - k).
-1 2
δ' s 2(X,X) δ s δ'βˆ
The t-test of a hypothesis involving a linear combination of the coefficients
involves running one regression and estimating the variance of δ βˆ from s2(X'X)-1
to construct the test statistics.
4. More general tests
a. Introduction
We have considered tests of the overall explanatory power of the
regression model (Ho: β2 = β3 = . . . βk = 0), tests involving individual parameters
(e.g., Ho: β3 = 6), and testing the validity of a linear constraint on the coefficients
III 21
(Ho: δ’β = γ). In this section we will consider how more general tests can be
performed. The testing procedures will be based on the Chow and Likelihood
ratio (LR) tests. The hypotheses may be of many different types and involve the
previous tests as special cases. Other examples might include joint hypotheses of
the form: Ho: β2 + 6 β5 = 4, β3 = β7 = 0. The basic idea is that if the hypothesis is
really valid, then goodness of fit measures such as SSE, R2 and log-likelihood
values (l) will not be significantly impacted by imposing the valid hypothesis in
estimation. Hence, the SSE, R2 or values will not be significantly different for
constrained (via the hypothesis) and unconstrained estimation of the underlying
regression model. The tests of the validity of the hypothesis are based on
constructing test statistics, with known exact or asymptotic distributions, to
evaluate the statistical significance of changes in SSE, R2, or .
Consider the model
y=Xβ+ε
and a hypothesis, Ho: g(β) = 0 which imposes individual and/or multiple
constraints on the β vector.
The Chow and likelihood ratio tests for testing Ho: g(β) = 0 can be
constructed from the output obtained from estimating the two following
regression models.
(1) Estimate the regression model y = Xβ + ε without imposing any
constraints on the vector β. Let the associated sum of square errors,
coefficient of determination, log-likelihood value and degrees of freedom

III 22
be denoted by SSE, R2, , and (n - k).
(2) Estimate the same regression model where the β is constrained as
specified by the hypothesis (Ho: g(β) = 0) in the estimation process. Let
the associated sum of squared errors, R2, log-likelihood value and degrees
of freedom be denoted by SSE*, R2*, *

and (n - k)*, respectively.
b. Chow test
The Chow test is defined by the following statistic:
SSE* - SSE
r
SSE ~ F(r, n - k)
n-k
where r = (n-k) - (n-k)* is the number of independent restrictions imposed on β by

the hypothesis. For example, if the hypothesis was Ho: β2 + 6 β5 =4, β3 = β7 = 0,
then the numerator degrees of freedom (r) is equal to 3. In applications where the
SST is unaltered by the imposing the restrictions, we can divide the numerator and
denominator by SST to yield the Chow test rewritten in terms of the change in the
R2 between the constrained and unconstrained regressions.
2 2
- * n-k
F = R R2 ~ F(r, n - k)
1- R r
Note that if the hypothesis (H0: g(β) = 0) is valid, then we would expect R2 (SSE)
and R2* (SSE*) to not be significantly different from each other. Thus, it is only
large values (greater than the critical value) of F which provide the basis for
rejecting the hypothesis. Again, the R 2 form of the Chow test is only valid if the
dependent variable is the same in the constrained and unconstrained regression.
References:
(1) Chow, G. C., "Tests of Equality Between Subsets of Coefficients in Two

Linear Regressions," Econometrica, 28(1960), 591-605.
(2) Fisher, F. M., "Tests of Equality Between Sets of Coefficients in Two Linear
III 23
Regressions: An Expository NOTE," Econometrica, 38(1970), 361-66.
c. Likelihood ratio (LR) test.
The LR test is a common method of statistical inference in classical

statistics. The motivation behind the LR test is similar to that of the Chow test
except that it is based on determining whether there has been a significant
reduction in the value of the log-likelihood value as a result of imposing the
hypothesized constraints on β in the estimation process. The LR test statistic is
defined to be twice the difference between the values of the constrained and
*
unconstrained log-likelihood values (2( - )) and, under fairly general
regularity conditions, is asymptotically distributed as a chi-square with degrees of
freedom equal to the number of independent restrictions (r) imposed by the
hypothesis. This may be summarized as follows:
a 2
LR = 2( - *) (r).
The LR test is more general than the Chow test and for the case of
independent and identically distributed normal errors, with known ζ2, LR is equal
to LR = [SSE* - SSE]/ζ2 .
Recall that s2 = SSE/(n - k) appears in the denominator of the Chow test statistic
and that for large values of (n-k), s2 is "close" to ζ2; hence, we can see the
similarity of the LR and Chow tests. If ζ2 is unknown, substituting the
concentrated log-likelihood function into LR yields

*
LR = 2 ( - )
= n [ln (SSE*) - ln (SSE) ]
= n [ln (SSE* / SSE)].

III 24
If the hypothesis Ho: β2 = β3 = . . . βk = 0 is being tested in the classical
normal linear regression model, then SSE* = SST and LR can be rewritten in
terms of the R2 as follows:
a
LR = nln[1/(1-R2)] = -nln[1-R2] ~ χ2(k-1).
In this case, the Chow test is identical to the F test for overall explanatory power
discussed earlier.
Thus the Chow test and LR test are similar in structure and purpose. The
LR test is more general than the Chow test; however, its distribution is
asymptotically (not exact) chi-square even for non-normally distributed errors.
The LR test provides a unified method of testing hypotheses.
d. Applications of the Chow and LR tests:
(1) Model: yt = β1 + β2xt2 + β3xt3 + β4xt4 + εt
Ho: β2 = β3 = 0 (two independent constraints)
(a) Estimate yt = β1 + β2xt2 + β3xt3 + β4xt4 + εt
to obtain SSE = et2 = (n - 4)s2, R2 ,

n SSE
= - 1 + ln(2 ) + ln ,
2 n
n-k = n - 4
(b) Estimate yt = β1 + β4xt4 + εt to obtain
SSE* = et*2 = (n - 2)s*2
SSE*, R2*, * and (n-k)* = n - 2

III 25
III 26
(c) Construct the test statistics

SSE* - SSE SSE* - SSE
(n k)* (n k) 2 n- 4 SSE*-SSE
Chow = = =
SSE SSE 2 SSE
n k n 4
2 2
- * n-4
= R R2 ~ F(2, n - 4)
1- R 2
a
LR = 2( - *) ~ χ2(2).
(2) Tests of equality of the regression coefficients in two different regressions
models.
(a) Consider the two regression models
y(1) = X(1) β(1) + ε(1) n1 observations, k independent variables
y(2) = X(2) β(2) + ε(2) n2 observations, k independent variables
Ho: β(1) = β(2) (k independent restrictions)
(b) Rewrite the model as

(1) (1) (1) (1)
y X 0
(1)' y = (2)
= (2) (2)
+ (2)
y 0 X
Estimate (1)' using least squares and determine SSE, R2,
and (n - k) = n1 + n2 - 2k.
Now impose the hypothesis that β(1) = β(2) = β and write (1)
as
(1) (1) (1)
y X
(2)’ y = (2)
= (2)
β + (2)
y X
Estimate (2)’ using least squares to obtain the constrained

III 27
sum of squared errors (SSE*), R2*, * and

(n - k)* = n1 + n2 - k.
(c) Construct the test statistics

SSE* - SSE
(n - k) * - (n - k)
Chow =
SSE
(n k)
2 2
- * n1 + n 2 - k ~ F (
= R R2 k , n1 + n 2 - 2 k)
1- R k
a
LR = 2( - *) ~ χ2 (k).
5. Testing Hypotheses using Stata

a. Stata reports the log likelihood values when the command
estat ic
follows a regression command and can be used in constructing LR tests.
b. Stata can also perform many tests based on t or Chow-type tests.
Consider the model
(1) Yt = β1 + β2Xt2 + β3Xt3 + β4Xt4 + εt
with the hypotheses:
(2) H1: β2 = 1
H2: β3 = 0
H3: β3 + β4 = 1
H4: β3β4 = 1
H5: β2 = 1 and β3 = 0
The Stata commands to perform tests of these hypotheses follow OLS

III 28
estimation of the unconstrained model.

III 29
reg Y X2
X3 X4
estimates the unconstrained model
test X2 = 1 (Tests H1)
test X3 = 0 (Tests H2)
test X3 + X4 = 1 (Tests H3)
testnl _b[X3]*_b[X4] = 1 (Tests H4. The “testnl” command is

for testing nonlinear hypotheses. The
suffix “_b”, along with the braces,
must be used when testing nonlinear
hypotheses)
test (X2 = 1) (X3 = 0) (Tests H5)
95% confidence intervals on coefficient estimates are automatically calculated in
Stata. To change the confidence level, use the “level” option as follows:
reg Y X2 X3 X4, level(90) (changes the confidence level

to 90%)
III 30
F. Stepwise Regression
Stepwise regression is a method for determining which variables might be
considered as being included in a regression model. It is a purely mechanical approach,
adding or removing variables in the model solely determined by their statistical
significance and not according to any theoretical reason. While stepwise regression can be
considered when deciding among many variables to include in a model, theoretical
considerations should be the primary factor for such a decision.
A stepwise regression may use forward selection or backward selection. Using
forward selection, a stepwise regression will add one independent variable at a time to see
if it is significant. If the variable is significant, it is kept in the model and another variable
is added. If the variable is not significant, or if a previously added variable becomes
insignificant, it is not included in the model. This process continues until no additional
variables are significant.
Stepwise regression using Stata
To perform a stepwise regression in Stata, use the following commands:
Forward:
stepwise, pe(#): reg dep_var indep_vars
stepwise, pe(#) lockterm: reg dep_var (forced in
variables) other indep_vars
Backward:
stepwise, pr(#): reg dep_var indep_vars

III 31
stepwise, pr(#) lockterm: reg dep_var (forced in
variables) other indep_vars
where the “#” in “pr(#)” is the significance level at which variables are removed, as
0.051, and the “#” in “pe(#)” is the significance level at which variables are entered or
added to the model. If pr(#1) and pr(#2) are both included in a stepwise regression
command, #1 must be greater than #2. Also, “depvar” represents the dependent variable,
“forced_indepvars” represent the independent variables which the user wishes to remain
in the model no matter what their significance level may be, and “other_indepvars”
represents the other independent variables which the stepwise regression will consider
including or excluding. Forward and backward stepwise regression may yield different
results.
G. Forecasting
Let yt = F(Xt, β) + εt
denote the stochastic relationship between the variable yt and the vector of variables Xt
where Xt = (xt1,..., xtk). β represents a vector of unknown parameters.
ˆ ,
Forecasts are generally made by estimating the vector of parameters β(β)
determining the appropriate vector Xt (X̂t ) and then evaluating

ˆ .
ˆ t, β)
yˆ t = F(X
The forecast error is FE = yt - ŷt.
There are at least four factors which contribute to forecast error.

III 32
1. Incorrect functional form (This is an example of specification error and will be
discussed later.)
Yt
Xt
2. Existence of random disturbance (εt)
Even if the "appropriate" future value of Xt and true parameter values, β,
were known with certainty
FE = yt - ŷt = yt - F(Xt,β) = εt
2
FE = Variance(FE)
= Var(εt) = ζ2.
In this case confidence intervals for yt would be obtained from
Pr [F (X t, β) - t ( ζ < y t < F (X t, β) + t (
/ 2) ζ] = 1 - α
/ 2)
which could be visualized as follows for the linear case:
Yt
Xt
III 33
3. Uncertainty about β
Assume F(Xt, β) = Xtβ in the model yt = F(Xt, β) + εt, then the predicted
value of yt for a given value of Xt is given by
yˆ t = X t βˆ ,
and the variance of yˆ t (sample regression line), 2
ŷ t is given by
2 ˆ
ζ ŷ t = X t Var (β) X t ,
with the variance of the forecast error (actual y) given by:
2
FE = ζ 2 + ζ 2 ŷ t .
2
Note that FE takes account of the uncertainty associated with the unknown
regression line and the error term and can be used to construct confidence
intervals for the actual value of Y rather than just the regression line.
Unbiased sample estimators of 2

ŷ t and ζ 2 FE can be easily obtained by replacing ζ2
with its unbiased estimator s2.
Confidence intervals for E ( Y t | X t ) , the population regression line:
Pr [X tβˆ - t (α/2)s yˆ t < Y t < X tβˆ + t (α/2)s yˆ t ] = 1 - α
Confidence intervals for Yt:

P R [X tβˆ - t (α/2)s FE < Y t < X tβˆ + t (α/2)s FE] = 1 - α
III 34
Yt
Xt
III 35
4. A comparison of confidence intervals.
Some students have found the following table facilitates their understanding of the different confidence intervals for the
population regression line and actual value of Y. The column for the estimated coefficients is only included to compare
the organizational parallels between the different confidence intervals.
Statistic ˆ Yˆt X t ˆ = sample regression line = FE (forecast error)

1
X 'X X 'Y
FE Yt Yˆt Yt Xt ˆ
predicted Y values corresponding to X t .
Distribution 2 1 2 2 1
N , X 'X N Xt , Yˆt
Xt ( X 'X ) X t' N 0, 2
FE
2 2
Yˆt
ˆ Xt ˆ Xt 1 Pr t
FE 0
t
=
t-stat 1 Pr t i i
t 1 Pr t t
/2
sFE
/2
/2 /2 /2 /2
sˆ sYˆ
i t
Pr FE t sFE 0 FE t sFE =
= Pr ˆi t s ˆ i
ˆ
i t sˆ Pr X t ˆ t sYˆ Xt X t ˆ t sYˆ 2 2
i i
Pr X t ˆ t sFE X t ˆ t sFE
2 2 2 2
Yt
2 2
C.I. ˆ t s ˆ , ˆi t s ˆ Xt : X t ˆ t sYˆ , X t ˆ t sYˆ Yt : X t ˆ t sFE , Xt ˆ t sFE

i: i i i
2 2 2 2 2 2
where sYˆ is used to compute confidence intervals for the regression line ( E Yt X t ) and sFE is used in the calculation of
2 2 2 2
confidence intervals for the actual value of Y. Recall that s FE s s ; hence, s
Yˆ FE > sY2ˆ and the confidence intervals for
Y are larger than for the population regression line.
III 36
5. Uncertainty about X. In many situations the value of the independent variable also
needs to be predicted along with the value of y. Not surprisingly, a “poor” estimate of
Xt will likely result in a poor forecast for y. This can be represented graphically as
follows:
Yt
Y t
X t
Xt
X̂ t
6. Hold out samples and a predictive test.
One way to explore the predictive ability of a model is to estimate the model on a
subset of the data and then use the estimated model to predict known outcomes which
are not used in the initial estimation.
7. Example ŷt = 10 + 2.5 G t + 6 Mt
= βˆ t + βˆ 2G t + βˆ 3M t
where yt, Gt, Mt denote GDP, government expenditure, and money supply.
Assume that
III 37
10 5 2
2 -1
s (X X ) = 5 20 3 10-3 , s2 = 10 .
2 3 15
a. Calculate an estimate of GPD(y) which corresponds to
Gt = 100, Mt = 200, i.e., Xt = (1, 100, 200).
10
ˆ
yˆ t = X t β = (1, 100, 200) 2.5
6
= 10 + 250 + 1200 = 1460.
b. Evaluate s 2 ŷt and s2FE corresponding to the Xt in question (a).
10 5 2 1
2 -1
s ŷt = Xt (s2 (X X ) ) Xt = (1, 100, 200) 5 20 3 100 .10-3
2 3 15 200
= 921.81
s ŷt = 30.30
2 2
sFE = s + s ŷt = 10 + 921.81 = 931.81
SFE = 30.53
7. Forecasting—basic Stata commands
a) The data file should include values for the explanatory variables
corresponding to the desired forecast period, say in observations n1 + 1 to n2.
b) Estimate the model using least squares
reg Y X1 . . . XK, [options]
c) Use the predict command, picking the name you want for the predictions, in
III 38
this case, yhat, e, sFE , and sYˆ .
predict yhat, xb this option predicts Ŷ

predict e, resid this option predicts the residuals (e)
predict sfe, stdf this option predicts the standard
error of the forecast ( s FE )
predict syhat, stdp this option predicts the standard
error of the prediction ( sYˆ )
list y yhat sfe this option lists indicated variables
These commands result in the calculation and reporting of Y, Ŷ, e, sFE and
sYˆ for observations 1 through n2. The predictions will show up in the Data
Editor of STATA under the variable names you picked (in this case, yhat,
e, sfe and syhat).
You may want to restrict the calculations to t= n1 + 1, .. , n2 by using
predict yhat if(_n> n1), xb
where “n1” is the numerical value of n1.
d) The variance of the predicted value can be calculated as follows:
2 2
s ŷt = sFE - s2
III 39
H. PROBLEM SETS: MULTIVARIATE REGRESSION
Problem Set 3.1
Theory
OBJECTIVE: The objective of problems 1 & 2 is to demonstrate that the matrix equations and
summation equations for the estimators and variances of the estimators are equivalent.
n
Remember Xt NX and Don’t get discouraged!!
t 1
1. BACKGROUND: Consider the model (1) Yt = β1 + β2 Xt+ εt (t = 1, . . ., N) or

equivalently,
Y1 1 X1 ε1
Y2 1 X2 ε2
(1)’ = 1
+
2
Yn 1 Xn εn
(1)” Y = Xβ + ε
ˆ
is ˆ = (X X )-1 X Y .
1
The least squares estimator of
ˆ
2
If (A.1) - (A.5) (see class notes) are satisfied, then
Var( ˆ 1) Cov(ˆ 1 , ˆ 2)
Var( ˆ ) = = 2
(X X )-1
Cov(ˆ 2 , ˆ 1) Var( ˆ 2)
QUESTIONS: Verify the following:

*Hint: It might be helpful to work backwards on part c and e.
N NX NY
a. XX= and X ' Y N
NX
2 X tYt
Xt t 1
III 40
b. ˆ = ( X t Y t - N X Y ) / ( X t 2 - N X 2)
2
c. ˆ =Y-ˆ X
1 2
d. Var( ˆ 2) = 2
/ ( X t 2 - N X 2)
2
1 X
e. Var( ˆ 1) = 2
+ 2 2
n Xt - N X
= Var( Y) + X 2 Var( ˆ 2)
f. Cov(ˆ 1 , ˆ 2) = - X Var( ˆ 2)
(JM II’-A, JM Stats)
2. Consider the model: Yt = Xt + t
a. Show that this model is equivalent to Y = Xβ + ε
Y1 X1 ε1
Y2 X2 ε2
where Y ,X = ,ε
Yn Xn εn
b. Using the matrices in 2(a), evaluate (X X )-1 X Y and compare your answer with
the results obtained in question 4 in Problem Set 2.1.
c. Using the matrices in 2(a) evaluate 2

(X X )-1 .
(JM II’-A)
Applied
3. Use the data in HPRICE1.RAW to estimate the model
price = β0 + β1sqrft + β2bdrms + u

where price is the house price measured in thousands of dollars, sqrft is
the floorspace measured in square feet, and bdrms is the number of bedrooms.
a. Write out the results in equation form.

III 41
b. What is the estimated increase in price for a house with one more bedroom, holding
square footage constant?
c. What is the estimated increase in price for a house with an additional bedroom that is 140
square feet in size? Compare this to your answer in part (ii).
d. What percentage variation in price is explained by square footage and number of
bedrooms?
e. The first house in the sample has sqrft = 2,438 and bdrms = 4. Find the predicted selling
price for this house from the OLS regression line.
f. The actual selling price of the first house in the sample was $300,000 (so price = 300).
Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for
the house?
III 42
Problem Set 3.2
Theory
1. R2, Adjusted R2( R 2 ), F Statistic, and LR
The R2 (coefficient of determination) is defined by
2 SSR SSE
R = =1-
SST SST
2 2
where SSE = et 2 and SST = (Y t - Y) , SSR = (Ŷ t - Y ) .
Given that SST = SSR + SSE when using OLS,
a. Demonstrate that 0 R2 1.
b. Demonstrate that n = k implies R2 = 1. (Hint: n=k implies that X is square. Be

careful! Show Y = Ŷ = X ˆ .)
c. If an additional independent variable is included in the regression equation, will

the R2 increase, decrease, or remain unaltered? (Hint: What is the effect upon
SST, SSE?)
SSE/(n- k)
d. The adjusted R 2 , R 2 , is defined by R 2 = 1 - . Demonstrate that
SST/(n-1)
1- k
R
2
R
2
1 , i.e., the adjusted R2 can be negative.
n- k
SSE n- 1 n- 1
(Hint : 1 - R 2 = = (1 - R 2))
SST n- k n- k
e. Verify that
SSE* - SSE
LR = 2
if ζ2 is known
= n ln(SSE* /SSE) if ζ2 is unknown where SSE* denotes the

restricted SSE.
III 43
f. For the hypothesis H0: β2 = . . . = βk = 0, verify that the corresponding LR statistic

1
can be written as LR = n ln = - n ln(1 - R 2) .
1- R2
FYI: The corresponding Lagrangian multiplier (LM) test statistic for this
hypothesis can be written in terms of the coefficient of variation as LM NR2 .
(JM II-B)
2. Demonstrate that
a. X’e = 0 is equivalent to the normal equations X X ˆ = X Y .

b. X’e = 0 implies that the sum of estimated error terms will equal zero if regression
equation includes an intercept.
Remember: e Y Yˆ Y X ˆ
(JM II-B)
Applied
3. The following model can be used to study whether campaign expenditures affect election
outcomes:
voteA = β0 + β1ln(expendA) + β2 ln(expendB) + β3 prtystrA + u
where voteA is the percent of the vote received by Candidate A, expendA and expendB are
campaign expenditures by Candidates A and B, and prtystrA is a measure of party
strength for Candidate A (the percent of the most recent presidential vote that went to A's
party).
i) What is the interpretation of β1?
ii) In terms of the parameters, state the null hypothesis that a 1% increase in A's
expenditures is offset by a 1% increase in B's expenditures.
iii) Estimate the model above using the data in VOTE1.RAW and report the results in
the usual form. Do A's expenditures affect the outcome? What about B's
expenditures? Can you use these results to test the hypothesis in part (ii)?
iv) Estimate a model that directly gives the t statistic for testing the hypothesis in part
(ii). What do you conclude? (Use a two sided alternative.). A possible approach,
test H 0 : 1 2 D , plug D 2 for 1 and simplify to obtain
voteA = β0 + D ln(expendA) + β2 ln(exp endB) - ln exp endA + β3 prtystrA + u
Regress voteA on ln(expendA), (ln(expend(B)-ln(expendA)), and prtystrA

III 44
and test the hypothesis that the coefficient, D, of ln(expendA) is 0.
You can check your results by constructing the “high-tech” t-test or by using the
Stata command, test ln(expendA) + ln(expendB) =0 following the estimation of
the unconstrained regression model.
(Wooldridge C. 4.1)
4. Consider the data (data from Solow’s paper on economic growth)
t Output (Yt) Labor (Lt) Capital (Kt)
1 40.26 64.63 133.14

2 40.84 66.30 139.24
3 42.83 65.27 141.64
4 43.89 67.32 148.77
5 46.10 67.20 151.02
6 44.45 65.18 143.38
7 43.87 65.57 148.19
8 49.99 71.42 167.12
9 52.64 77.52 171.33
10 57.93 79.46 176.41
The Cobb Douglas Production function is defined by

β1+β 2t β β
(1) Yt = e K t 3 Lt 4 εt
where (β2t) takes account of changes in output for any reason other than a change in Lt or
Kt; εt denotes a random disturbance having the property that lnεt is distributed N(0, ζ2).
total wage receipts
Labor’s share is given by β3 if β3 + β4 (the returns to scale) is
totalsales receipts
equal to one. β2 is frequently referred to as the rate of technological change

dYt
/ Y t for fixed L and K . Taking the natural logarithm of equation(1),we obtain
dt
(2) ln Y t = β1 + β 2 t + β3ln(L t ) + β 4 ln(K t) + ln(ε t ) .

III 45
If 3 + 4 is equal to 1, then equation (2) can be rewritten as
(3) ln(Y t / K t ) = 1 + 2 t+ 3 ln(L t / K t ) + ln t .
a. Estimate equation (2) using the technique of least squares.

b. Corresponding to equation (2)
1) Test the hypothesis Ho: β2 = β3 = β4 = 0. Explain the implications of this
hypothesis. (95% confidence level)
2) perform and interpret individual tests of significance of β2, β3, and β4, i.e. test
Ho : βi = 0 .α = .05.
3) test the hypothesis of constant returns to scale, i.e., Ho: β3 + β4 = 1, using
a. a t-test for general linear hypothesis, let restrictions δ= (0,0,1,1);
b. a Chow test;
c. a LR test.
c. Estimate equation (3) and test the hypothesis that labor’s share is equal to .75, i.e., β3 =
.75.
d. Re-estimate the model (equation 2) with the first nine observations and check to see if the actual
log(output) for the 10th observation lies in the 95% forecast confidence interval.
(JM II)
5. The translog production function corresponding to the previous problem is given by
ln(Y) = β1 + β 2 t + β 3 ln(L) + β 4 ln(K) + β 5(ln(L)) 2 + β 6(ln(K)) 2 + β 7(ln(L)) ln(K) + ln(εt )

a. What restrictions on the translog production function result in a Cobb-Douglas
production function?
b. Estimate the translog production function using the data in problem 5 and use the Chow and
LR tests to determine whether it provides a statistically significant improved fit to the data,
relative to the Cobb-Douglas function.
(JM II)
III 46
6. The transcendental production function corresponding to the data in problem 5 is defined by
Y = eβ1 + β 2 t + β3 L + β 4 K Lβ5Kβ6
a. What restrictions on the transcendental production function result in a Cobb-Douglas
production function?
b. Estimate the transcendental production function using the data in problem 2 and use the Chow
and LR tests to compare it with the Cobb-Douglas production function.
(JM II)
III 47
APPENDIX A
Some important derivatives:
x1 a1 a11 a12
Let X = , a= , A=
x2 a2 a 21 a 22
(symmetric) (a12 = a 21 = a )
d (a X) d (X a)
1. = =a
dX dX
d (X AX)
2. = 2 AX
dX
d (X a)
Proof of =a
dX
Note: a’X = X’a = a1x1 + a2x2
d (X a) (X a) / X1 a1
= = =a
dX (X a) / X2 a2
d (X AX)
Proof of = 2 AX
dX
Note: X’AX = a11x12 + (a12 + a21) x1x2 + a22 x22
d (X AX) (X a) / X1 2 a11 x1 + 2a x 2
= =
dX (X AX) / X2 2a x1 + 2 a 22 x 2
a11 x1 + a x 2
=2
a x1 + a 22 x 2
a11 a x1
=2
a a 22 x2
= 2 AX.
III 48
APPENDIX B
An unbiased estimator of ζ2 is given by
1
2
s = (y (I - X (X X )-1 X ) y) = SSE/(n- k) .
n- k
Proof: To show this, we need some results on traces:
n
tr (A) = a ii
i
1) tr(I) = n
2) If A is idempotent, tr(A) = rank of A
3) tr(A+B) = tr(A) + tr(B)
4) tr(AB) = tr(BA) if both AB and BA are defined
5) tr(ABC) = tr(CAB)
6) tr(kA) = k tr(A)
Now, remember that
2 1
ζ̂ = ee
n
1
and s2 = ee
n-k
e = y - Xβˆ = y - X ( X X ) -1 X y = My
= M (Xβ + ε) = MXβ + Mε ,
= Mε ,
where M = I - X(X’X)-1X’.
Note that M is symmetric, and idempotent (problem set R.2).

1 1
So ζ̂ 2 = e e = ε M Mε
n n
III 49
1
= ε MMε .
n
1
= ε Mε .
n
1
and s 2 = ε Mε .
n-k
1 1
E (ζ̂ 2) = E (ε Mε) = E (tr(ε Mε)) because cov (ε i, ε j) = 0, i j)
n n
1 1
= Etr (M εε ) = tr (ME (εε ))
n n
1 1
= tr (M ζ 2I) = tr (ζ 2 M)
n n
2
ζ
= tr(M)
n
2
ζ -1
= tr(I - X(X X ) X )
n
2
ζ -1
= (n - tr (X(X X ) X ))
n
2
ζ -1
= (n - tr (X X(X X ) ))
n
2
ζ
= (n - trace (I k ))
n
2
ζ
= (n - k)
n
n-k 2 n
= 2
ζ so E (s ) = E (ζˆ 2) = ζ 2 .
n n-k
n
Therefore ζ̂ 2 is biased, but E (s 2) = E (ζˆ 2) = ζ 2 and s2 is unbiased.
n-k
III 50
APPENDIX C
β = AY = (X X) X Y is BLUE.
Proof: Let β i = A i Y where Ai denotes the ith row of the matrix A. Since the result will be
symmetric for each βi (hence, for each Ai), denote Ai by a’ where a is a (n by 1) vector.
The problem then becomes:
Min a’Ia when I is nxn
s.t. AX = I when X is nxk (for unbiasedness)
or min a’Ia
s.t. X’a = i where i is the ith column of the identity matrix.
Let = a Ia + λ (X a - i) which is the associated Lagrangian function where λ is kx1.
The necessary conditions for a solution are:
= 2a I + λ X = 0
a
= (X a - i) = 0 .
λ
This implies
a = (-1/ 2)λ X ) .
Now substitute a = (-½)Xλ into the expression for = 0 and we obtain
λ
(-1/ 2) X X λ = i
λ = - 2 (X X ) -1 i
a = (-1 / 2) (-2) i (X X )-1 X
= i (X X )-1 X = Ai .
which implies
A = (X X )-1 X
-1
hence, β = (X X ) X y .

III General Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

III General Regression

Uploaded by

Copyright:

Available Formats

III 1

III. Classical Normal Linear Regression Model Extended to the Case of k

Let y denote an n x l vector of random variables, i.e., y = (y1, y2, . . ., yn)'.

1. The expected value of y is defined by

Var( y 1 ) Cov( y 1, y 2 ) Cov( y 1, y n )

Cov( y n , y 1 ) Cov( y n , y 2 ) Var( y n )

Var(y) = E[(y - μ)(y - μ)']

=E (y1 - μ1, . . ., yn - μn)

Var( y 1 ) Cov( y 1, y 2 ) ... Cov( y 1, y n )

3. The n x l vector of random variables, y, is said to be distributed as a multivariate

Special case (n = 1): y = (y1), μ = (μ1), = (ζ2).

4. Some Useful Theorems

a. If y ~ N(μy, y), then z = Ay ~ N(μz = Aμy; z = A yA') where A is a

b. If y ~ N(0,I) and A is a symmetric idempotent matrix, then y'Ay ~ χ2(m)

(1) Proof of (a)

(2) Example: Let y1, . . ., yn denote a random sample drawn from

The "Useful" Theorem 4.a implies that:

B. The Basic Model

Consider the model defined by

(1) yt = β1xtl + β2xt2 + . . . + βkxtk + εt (t = 1, . . ., n).

If we want to include an intercept, define xtl = 1 for all t and we obtain

(2) yt = β1 + β2xt2 + . . . + βkxtk + εt.

The error terms (εt) in (1) will be assumed to satisfy:

(A.1) εt distributed normally

(A.2) E(εt) = 0 for all t

(A.3) Var(εt) = ζ2 for all t

(A.4) Cov(εtεs) = 0,t s.

Rewriting (1) for each t (t = 1, 2, . . ., n) we obtain

y1 = β1x11 + β2x12 + . . . + βkx1k + ε1

y2 = β1x21 + β2x22 + . . . + βkx2k + ε2

yn = β1xn1 + β2xn2 + . . . + βkxnk + εn.

The system of equations (3) is equivalent to the matrix representation

where the matrices y, X, β and ε are defined as follows:

y1 x11 x1k columns: n observations on k

NOTE: (1) Assumptions (A.1)-(A.4) can be written much more

(A.1)’ ε ~ N (0; = ζ2I).

(2) The model to be discussed can then be summarized as

(A.1)' ε ~ N(0; = ζ2I)

(A.5)' The xtj's are nonstochastic and

The basic model can be written as

XX ˆ =Xy Normal Equations

ˆ = (X X )-1 X y is the least squares estimator.

Note that β̂ is a vector of least squares estimators of β1, β2,...,βk.

2. Maximum Likelihood Estimation (MLE)

Likelihood Function: (Recall y ~ N (Xβ; = ζ2I))

The natural log of the likelihood function,

is known as the log likelihood function.  is a function of β and ζ2.

is an unbiased estimator of ζ2.

A proof of the unbiasedness of s2 is given in Appendix B.

(3) Substituting ζ2 = SSE/n into the log likelihood function

yields what is known as the concentrated log likelihood

which expresses the loglikelihood value as a function of β

only. This equation also clearly demonstrates the

equivalence of maximizing and minimizing SSE.

We will demonstrate that assumptions (A.2)-(A.5) imply that the best

(least variance) linear unbiased estimator (BLUE) of β is the least squares

where Ai = the ith row of A and β i = A i y .

The solution to this problem is given by

A = (X'X)-1X' ; hence, the BLUE of β is given by = Ay (X X ) -1 X y .

4. Method of Moments Estimation

The sample covariances can be written as

These sample moments can be summarized using matrix notation as follows:

5. Instrumental Variables Estimators

of freedom be denoted by SSE, R2, *