Sejda RZO

Introduction to Econometrics
Coordinator: prof. dr. Rob Alessie

email: r.j.m.alessie@rug.nl
oce: DUI 727 tel: 050-3637240
February 2, 2016
Course outline
Course materials
Hayashi, F., Econometrics, Princeton University Press, 2000
Econometric software package STATA
Lecture sheets (available on Nestor)
Lecturers
1. prof. dr. Rob Alessie; email: r.j.m.alessie@rug.nl (weeks 1-4)
2. dr. Jochem de Bresser; email j.de.bresser@rug.nl (weeks 5-7)
3. Bert ter Bogt (homework); email: t.e.ter.bogt@student.rug.nl
Format of the course
A combination of lectures and tutorials (we will discuss the assignments).
Assessment
Weekly assignments (20%). Assignments will have both applied and
theoretical questions.
Important: A group of exactly 2 students can collaborate and can hand
in one set of answers!
Please note that you will have to enroll in a homework group consisting
of exactly two students. This can be done on Nestor, see below Group
enroll. The deadline for registering is February 10.
Assignments have to be typed!!!! Hand-written answers are not accepted.
Email submission are not allowed; please hand in a hard-copy.
Students who
1. followed the course Introduction to Econometrics last year or before
and
2. already made the weekly assignments,
dont need to hand in the assignments again. These students will retain
the old grade for the assignments.
individual exam (80 %);
Supplementary reading:
Verbeek, M., A guide to modern econometrics: Fourth edition, 2012.
Wooldridge, J.M., Introduction to econometrics, (4th edn.), 2006.
Prerequisite knowledge
Matrix algebra, e.g. matrix multiplication, rank of a matrix, inverse of
a matrix, determinant, trace, positive (semi-)denite matrices, Projection
matrix etc. (cf. handouts Paul Bekker available on Nestor).
Calculus: functions of several variables, partial derivatives, rst principles
of optimization, integrals.
Probability theory and statistics, e.g. random variables, joint distributions, conditional distribution, independence, conditional and unconditional expectations, (multivariate) normal distribution, 2 distribution etc.
(cf. course Linear models in Statistics).
Students with insucient knowledge are expected to study the required
material themselves.
Global overview of the lectures (tutorials)

1. The classical linear regression model
2. The algebra of Ordinary Least Squares (OLS)
3. Finite-sample properties of OLS
4. Hypothesis testing under normality
5. Interpretation of regression results
6. Generalized least squares
7. Large sample theory
8. Endogeneity bias (violation of the exogeneity assumption)
9. Instrumental variables
10. (Single equation) General Methods of Moments (GMM)
11. Panel data
See Nestor for details (cf. excel le course_schedule_2016_r.xlsx)
Lecture 1
Hayashi, chapter 1: section 1.1
Organisation of lecture 1
1. What is Econometrics?
2. The classical linear regression model (cf. section 1.1 Hayashi)
The linearity assumption
Matrix notation
The strict exogeneity assumption
Implications of strict exogeneity
Other assumptions
What is econometrics?
Econometrics is the development of statistical methods for
estimating economic relations
testing economic theories
evaluating economic and business policy (proposals)
Examples
Forecasting of interest rate, ination rate and GDP
What is the causal eect of schooling on wages?
What is the eect of colonial institutions on economic growth?
Econometrics focuses on problems inherent in analyzing non-experimental
economic data.
Steps in empirical economic analysis
1. Formulation of a research question
2. Construction of an economic model
3. Construction and estimation of an econometric model
Structure of economic data
Cross-sectional data (obtained from a random sample)
Time series data (more dicult to analyze because one cannot assume
that the data are obtained from a random sample).
Panel data
In this course we mainly focus on the analysis of cross-section data
Causality and the notion of ceteris paribus
The classical linear regression model

This model posits a linear relationship between a dependent variable y
one the one hand and some explanatory variables on the other.
Suppose we observe n values for those variables
Let
- yi :=i-th observation of the dependent variable, i = 1, . . . , n;
- (xi1 , . . . , xiK ) be the i-th observation of the K explanatory variables
The sample or data is a collection of those n observations.
The data in economics cannot be generated by experiments (except in experimental economics). So both the dependent and independent variables
have to be treated as random variables, variables whose values are subject
to chance.
Model: set of restrictions on the joint distribution of the dependent and
explanatory variables.
The classical regression model is a set of joint distributions satisfying 4
following assumptions
1. Linearity
2. Strict exogeneity
3. No perfect multicollinearity
4. Spherical error variance
2.1
The linearity assumption
Assumption 1.1: the relationship between the dependent variable and

the regressors is linear.
yi = 1 xi1 + 2 xi2 + . . . + K xiK + i ,
i = 1, . . . , n
(1)
where
yi :=dependent variable; other terms: regressand or left hand side
variable
(xi1 , . . . , xiK ):= K explanatory variables. Other terms: a) regressors,
b) right hand side variables, c) independent variables.
1 , . . . , K : regression coecients.
i := disturbance term or error term with certain properties to be
specied below.
Remarks
The part of the right-hand side involving the regressors
1 xi1 + 2 xi2 + . . . + K xiK ,
is called the regression or the regression function.
The disturbance term i captures all inuences on yi left unexplained by
the regressors.
In almost all cases, a regression model includes a constant term, i.e. xi1 = 1
for all i. In that case model (1) becomes
yi = 1 + 2 xi2 + . . . + K xiK + i ,
i = 1, . . . , n
(2)
where 1 is called the intercept (constant term). The other regression

coecients are called slope coecients.
Slope coecients represent the marginal eect of the regressors. For

example,
yi
2 =
xi2
i.e. 2 represents the change in the dependent variable when the second
regressor increases by one unit while other regressors are held constant
(ceteris paribus). The linearity assumption implies that the marginal eect
is constant and does not depend on the level of regressors.
The sample (data) should be used to estimate the unknown regression
coecients.
In economics, three types of data are analyzed
Cross-section data originating from a random sample
Time series data (e.g. macro-economic data from National Accounts,
nancial data). In case of time series data the assumption of a random
sample cannot be made.
Panel data (longitudinal data) where cross-sectional units (e.g. individuals) are followed over time
In the course Introduction to Econometrics
analysis of cross-section data
assumption of a random sample of size n, i.e.
{(x1i , . . . , xKi , yi ) : i = 1, . . . , n}
are independent, identically distributed (I.I.D.) distributed random
variables across observations.
In many (older) textbooks: assumption xi is xed. Only in controlled
experiments the assumption of xed regressors is reasonable.
Example 1 (Engel curve for food)

Engel curve: relationship between expenditures on food, f oodi and disposable income ydi
Linear Engel curve
f oodi = 1 + 2 ydi + i
(3)
Suppose that we have a budget survey (random sample) available which

can be used to estimate the unknown regression coecients
The linear Engel function can be written as (2) where K = 2, yi = f oodi
and xi2 = ydi
When the equation has only one nonconstant regressor, as in equation (3),
it is called the simple regression model.
The error term represents other variables besides ydi which inuences consumption. Examples
Financial assets and housing equity (increases of the house value);

Future (expected) income (think of the lifecycle hypothesis);
Demographic composition of the household;
Price of food and other consumption goods;
Other variables reecting taste (preferences) of the consumer, such as
risk attitude;
If ydi is an exogenous variable (to be dened below), the slope coecient

oodi
represents the Marginal Propensity to Consume: 2 = fyd
.
i
Alternative specication: log-log specication
ln(f oodi ) = 1 + 2 ln(ydi ) + i
Notice that model (4) satises the assumption 1.1 (linearity).
In this case 2 can be interpreted as an income elasticity:
ln(f oodi )
= 2
ln(ydi )
.
10
(4)
In empirical studies, one typically nds that 0 < 2 < 1 (food is a necessary good)
In equation (4), the income elasticity is the same for everybody (this might
be a restrictive assumption).
Alternative specication:
ln(f oodi ) = 1 + 2 ln(ydi ) + 3 ln2 (ydi ) + i
(5)
In this case, the income elasticity is equal to:

ln(f oodi )
= 2 + 23 ln(ydi )
ln(ydi )
In empirical studies, one typically nds that 3 < 0: food is a more necessary good (lower income elasticity) for high income households than for
low income ones.
11
Example 1.2 (wage equation)

In labor economics, the wage equation is typically specied in a semi-log
form
ln(wagei ) = 1 + 2 Si + 3 tenurei + 4 experi + 5 experi2 + i
(6)
where wagei := hourly wage rate; Si := years of schooling; tenurei = no. of

years on the current job; experi = experience in the labor force.
Interpretation regression coecients in a semi-log model:
ln(wagei ) wagei /wagei
=
= 2
Si
Si
Suppose that 2 = 0.05.
Interpretation of 2 : A one year increase in Si leads to 5 percent (2 *100)
increase in the hourly wage rate keeping other factors constant (ceteris
paribus).
Equation (6) takes into account that the returns to experience,
ln(wagei )
= 4 + 25 experi ,
experi
may decrease with experi : 5 < 0.
Again i captures the impact of variables not included in the regression,
for instance innate ability, motivation
Although the relation between wagei and experi is not linear, equation (6)
is still linear in the parameters, i.e it satises assumption 1.1.
The model below is not linear in the parameters
ln(wagei ) = 1 + 2 Si + 3 tenurei + 4 experi + 5 experi2 + i
12
(7)
2.2
Matrix notation
Let xi = (xi1 , . . . , xiK ) and = (1 , . . . , K ) be (K 1) vectors, then

linearity assumption 1.1 (model (1)) can be rewritten as follows:
yi = xi + i ,
i = 1, . . . , n
(8)
because by the denition of vector inner product:

xi = 1 xi1 + . . . + K xiK
Stacking the n observations gives:
y = X +
(9)
where y = (y1 , ..., yn ) ; = (1 , ..., n ) are (n 1) vectors and X a (n K)matrix dened as follows:

x1
x11 . . . x1K
x x . . . x
2K
21
X = ..2 = .. ..
. .....
xn
xn1 . . . xnK
13
2.3
The strict exogeneity assumption
Assumption 1.2 (strict exogeneity):

E(i |X) = 0,
i = 1, . . . , n
(10)
Here, the expectation is conditional on the regressors for all observations.

This becomes more clear if we rewrite the strict exogeneity as follows:
E(i |x1 , . . . , xn ) = 0,
i = 1, . . . , n
(11)
To state the assumption dierently, take, for any given observation i , the
joint distribution of the nK + 1 random variables, f (i , x1 , . . . , xn ), and
consider the conditional distribution, f (i |x1 , . . . xn ).
The conditional mean E(i |x1 , . . . , xn ) = g(x1 , . . . , xn ) where g(.) in general a nonlinear function of x1 , . . . , xn .
The strict exogeneity assumption: E(i |x1 , . . . , xn ) = 0.
Assuming this constant to be zero is not restrictive if the regressors include
a constant.
To see this, suppose that E(i |X) = and xi1 = 1. Then equation (1) can
be rewritten as:
yi = 1 + 2 xi2 + . . . + K xiK + i =
(1 + ) + 2 xi2 + . . . + K xiK + (i ) = c
(12)
1 + 2 xi2 + . . . + K xiK + i
Notice that E(i |X) = 0.
Instead of assuming strict exogeneity, one could assume that i and X are
independent of each other. The assumption of independence says that
f (i |X) = f (i )
(13)
Suppose that E(i ) = 0. Then stochastic independence implies strict exogeneity: (i is conditional mean independent of X. This can be seen
as follows:
E(i |X) =
f (|X)d =
f ()d = E() = 0
(14)
14
2.4
Implications of strict exogeneity
Fact 1: Law of total expectations:

E[E(y|x)] = E(y)
Fact 2: Two random variables x and y are orthogonal to each other if
E(xy) = 0
Fact 3: Law of iterated expectations:
E[E(y|x, z)|x] = E(y|x)
Fact 4: Linearity of conditional expectations:
E(f (x)y|x) = f (x)E(y|x)
Fact 5: The Covariance of two random variables x and y:
Cov(x, y) = E((x E(x))(y E(y)) = E(xy) E(x)E(y)
(15)
Fact 6: Consider the covariance between x and y. Suppose that E(y) = 0.

Then equation (15) implies that
Cov(x, y) = E(xy)
(16)
Fact 7: Correlation coecient between two random variables:

(x, y) =
Cov(x, y)
sd(x)sd(y)
Fact 8: Two random variables x and y are uncorrelated to each other if

(x, y) = Cov(x, y) = 0
15
The strict exogeneity assumption, E(i |X) = 0, has several implications:

1. The unconditional mean of the error term is zero, i.e. E(i ) = 0
Proof: According to the law of total expectations:
E(i ) = E(E(i |X)) = E(0) = 0
(17)
q.e.d.
2. Under strict exogeneity, the regressors are orthogonal to the error term
for all observations, i.e.,
E(xjk i ) = 0,
i, j = 1, . . . , n; k = 1, . . . , K
(18)
Proof: Since xjk is an element of the data matrix X, we can apply

the law of iterated expectations:
E(i |xjk ) = E[E(i |X)|xjk ] = E[0|xjk ] = 0
(19)
This implies that

E(xjk i ) = E[E(xjk i |xjk )] = (by Law of total expectations)
E[xjk E(i |xjk )] = (by linearity of conditional expectations)
E[xjk 0] = 0
(20)
The last line of equation (20) follows from result (19) q.e.d.
3. Because the mean of the error term is zero, the orthogonality conditions
(18) are equivalent to zero-correlation conditions. This is because
Cov(i , xjk ) = E(i xjk ) E(i )E(xjk ) = E(i xjk ) = 0
(21)
The second equality follows from E(i ) = 0 (cf. equation (17), the last
equality from orthogonality conditions (18).
4. In particular, for i = j, Cov(xik , i ) = 0.
Therefore, strict exogeneity implies that the regressors are uncorrelated
with the error term.
5. Together with the linearity assumption 1.1, the strict exogeneity assumption also implies that
E(yi |X) = E(xi + i |X) = xi + E(i |X) = xi
16
(22)
Example (Engel curve revisited)

Consider again loglinear Engel function for food
ln(f oodi ) = 1 + 2 ln(ydi ) + i
Since the model includes a constant term, xi = (1, ydi ) , one could write
the the strict exogeneity assumption as
E(i |1, ln(yd1 ), 1, ln(yd2 ), . . . , 1, ln(ydn )) = 0
Since a constant provides no information and ydi embodies the same information as ln(ydi ), the conditional expectation above is the as as
E(i |yd1 , yd2 , . . . , ydn ) = 0
Question: is it likely that the strict exogeneity assumption holds in this
case?
In the earlier example, we argued that the error term represent variables
besides disposable income which inuences food consumption.
Examples of such variables
Financial assets and housing equity (increases of the house value)
Future (expected) income (think of the lifecycle hypothesis)
Demographic composition of the household
Price of food and other consumption goods
other variables reecting taste (preferences) of the consumer, such as
risk attitude
log 2 (ydi )
For most of these variables it holds that they will be correlated with the
regressor log(ydi ).
If this is true, the strict exogeneity assumption does not hold
17
2.5
Strict exogeneity in time series models
For time-series models where index i is time, the implication (18) of strict
exogeneity can be rephrased as: the regressors are orthogonal to the past,
current, and future error terms (or equivalently, the error term is orthogonal
to the past, current, and future regressors).
But for most time-series models, this condition (and a fortiori strict exogeneity) is not satised.
The clearest example of a failure of strict exogeneity is a model where the
regressor includes the lagged dependent variable. Consider the simplest
such model:
yi = yi1 + i
(23)
where we assume that yi1 is uncorrelated with i : E(yi1 i ) = 0
Notice that
E(yi i ) = E[(yi1 + i )i ] =
E(yi1 i ) + E(2i ) = E(2i ) > 0
(24)
But yi is the regressor for observation i + 1. Thus, the regressor is not

orthogonal to the past error term, which is a violation of strict exogeneity.
18
Example 4: models with a feedback mechanism

gGDPi = 1 + 2 ri + i
(25)
where
gGDPi := GDP-growth rate in period i
ri := interest rate
Suppose that E(ri i ) = 0
The treasury bill interest rate is typically set by the president of a central
bank, e.g. Janet Yellen (USA).
Behavior Janet Yellen (feedback mechanism):
ri = 1 + 2 (gGDPi1 3) + i , 2 > 0
(26)
Equation (26) implies that ri+1 (the regressor in period i + 1 of equation

(25) depends on gGDPi which in turn depends on i (cf. equation (25)
Consequently, ri is NOT a strict exogenous variable in model (25)!
We do not face feedback eects as illustrated in the example above, if one
analyzes cross-section drawn from a random sample
19
2.6
Other assumptions of the classical regression model
Assumption 1.3 (no perfect multicollinearity): The column rank of

the n K data matrix, X, is K with probability 1.
Assumption 1.4 (spherical error variance):
(homoskedasticity)
E(2i |X) = 2 > 0,
i = 1, . . . , n
(27)
(no correlation between observations)

E(i j |X) = 0,
i, j = 1, . . . , n; i = j
(28)
Comments on assumption 1.3

The rank of a matrix equals the number of linearly independent columns
of the matrix.
The assumption of full column rank says that none of the K columns
of the data matrix X can be expressed as a linear combination of the
other columns of X.
Since the K columns cannot be linearly independent if their dimension
is less than K, the assumption implies that n K, i.e., there must be
at least as many observations as there are regressors.
The regressors are said to be perfectly multicollinear if assumption
1.3 is not satised.
20
Wage example continued, disentangling age from cohort eects

Suppose that we have cross-section data available for the year 1996
Consider the following model
log(wagei ) = 1 + 2 Si + 3 agei + 4 yobi + i
(29)
where
agei = age of individual i (measured in years)
yobi = year of birth of individual i
Obviously, yobi = 1996 agei for all i.
Since model (30) includes a constant term (xi1 = 1), the regressor yobi is a
linear combination of the constant term and the the variable age. In other
words, we face a problem of perfect multicollinearity.
Given the identity yobi = 1996 agei , we can rewrite equation (30) as
follows:
log(wagei ) = (1 + 19964 ) + 2 Si + (3 4 )agei + i
(30)
which shows that only the dierence 3 4 but not 3 and 4 separately,
can be estimated.
21
Comments on assumption 1.4

The homoskedasticity assumption (27) says that the conditional second moment of i , which in general is a nonlinear function of X, is a
constant.
Thanks to strict exogeneity, this condition can be stated equivalently
in more familiar terms. Consider the conditional variance
V ar(i |X) = E(2i |X) [E(i |X)]2 = E(2i |X)
(31)
where in the second equality we use the assumption of strict exogeneity:

E(i |X) = 0
So we can restate assumption (27) as follows:
V ar(i |X) = 2 > 0,
i = 1, . . . , n
(32)
Similarly, assumption (28) is equivalent to the requirement that

Cov(i , j |X) = 0 i, j = 1, . . . , n; i = j
(33)
To see this notice that by denition

Cov(i , j |X) = E(i j |X) E(i |X)E(j |X)
(34)
Due to the assumption of strict exogeneity, the second term of the rhs
of equation (34) cancels out.
In the context of time-series models, (28) states that there is no serial
correlation in the error term.
Since the (i, j) element of the n n matrix is i j , Assumption 1.4
can be written compactly as
E( |X) = 2 In
(35)
where In denotes the identity matrix of order n

Assumption 1.4 is sometimes called the spherical error variance assumption because the n n matrix of second moments (which are also
variances and covariances) is proportional to the identity matrix In .
22
2.7
The classical regression model for random samples
As I said before, in this course we typically make the assumption that we

work with random samples
The sample (y, X) is a random sample if {yi , xi } is i.i.d. (independently
and identically distributed) across observations.
Since
a by the linearity assumption 1.1 (cf. equation (1), i is a function of
(yi , xi ) and
b (yi , xi ) is independent of (yj , xj ) for i = j,
(i , xi ) is independent of xj for i = j. So
E(i |X) = E(i |x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) = E(i |xi )
E(2i |X) = E(2i |xi )
(36)
and E(i j |X) = E(i |xi )E(j |xj )
(37)
Therefore assumptions 1.2 (strict exogeneity) and 1.4 reduce to
Assumption 1.2 E(i |xi ) = 0
Assumption 1.4 E(2i |xi ) = 2
23
i = 1, . . . , n
(38)
(39)
2.8
Fixedregressors
Up to now, we have treated the regressors as random. This is in contrast to

the treatment in most older textbooks, where X is assumed to be xed
or deterministic.
If X is xed, then there is no need to distinguish between the conditional distribution of the error term, f (i |x1 , . . . , xn ) and the marginal
distribution, f (i )
In other words, assumptions 1.2 and 1.4 can be written as
Assumption 1.2 : E(i ) = 0
Assumption 1.4 : E(2i ) = 2
E(i j ) = 0
24
i = 1, . . . , n
(40)
(41)
(42)
oce: DUI 727 tel: 050-3637240
February 4, 2016
Lecture 2
Hayashi, chapter 1: section 1.2, 1.3
Organization of the lecture
1. The algebra of least squares
Minimizing the residual sum of squares
Normal equations
OLS estimator
2
Other concepts (fitted value, residual, R2 , Run
)
2. Finite sample properties of the OLS estimator b

Unbiasedness
Expression for the variance
Gauss Markov theorem
3. Finite sample properties of s2
The algebra of least squares

Consider the following population model
yi = xi + i
(1)
Suppose that we have a sample (xi , yi ), i = 1, ..., n.

Question: how do we estimate the unknown parameter
vector in equation (1)?
Answer : by means of Ordinary Least Squares
First consider the following arbitrary linear combination:
1 xi1 + 2 xi2 + . . . + K xiK = xi
(2)
where = (1 , . . . , K ) is to be chosen.
Consider the dierence between yi and the linear approximation xi
yi xi
(3)
From these dierences, form the sum of squared
residuals (SSR)
n
2 = (y X )
(y X )
(4)
SSR() =
(yi xi )
i=1
This sum is also called the residual sum of squares

(RSS)
Approach of Ordinary Least Squares (OLS): The
OLS estimate, b, of is the that minimizes function
(4) (see figure 1):
b argmin SSR()
(5)
1.1
Normal equations
To solve minimization problem (5), is to derive the

first order conditions (F.O.C.), by setting the partial
derivatives equal to zero.
Rewrite (4) as follows:
(y X )
SSR()=(y
X )
=(y X )(y X )
=y y X y y X + X X
=y y 2y X + X X
y y 2a + A
(6)
where a = X y and A = X X.
Recalling from matrix algebra (cf. handout Matrix
Algebra) that
( A)
(a )
= a and
= 2A for A symmetric
the K dimensional vector of partial derivatives becomes
SSR()
= 2a + 2A = 2X y + 2X X (7)
The first order conditions:

2X y + 2X Xb = 0
(8)
or
(X X)b = X y
(9)
Here, we replaced by b because the OLS estimate b
is the that satisfies the first order conditions.
The K equations (9) are called the normal equations
The normal equations can also be written as follows
(check!)
( n
)
n
xi xi b =
x i yi
(10)
i=1
i=1
(n 1) vector of (OLS) residuals

e = y Xb
(11)
Its i-th element ei = yi xi b

The first order conditions (9) are just necessary. The
second order condition states that the Hessian
2 SSR()
= 2X X
(12)

should be positive definite at the optimum.
Indeed X X is a positive definite matrix so that b is
the unique solution of the minimization problem (4)
The system of normal equations linear in b.
System of equations (9) provides a unique solution as

long as the no perfect multicollinearity assumption 1.3 holds, i.e. the
matrix is of full colnfollowing
umn rank: X X = ( i=1 xi xi ).

In that case, the solution of this problem, denoted by
b is given by:
( n
)1 n
xi yi = (X X)1 X y (13)
b=
xi xi
i=1
i=1
1.2
More concepts
Best linear approximation (fitted value)
Residual
yi = xi b
(14)
ei = yi yi = yi xi b
(15)
Residual sum of squares (=minimum value objec (cf equation (4)):

tive function SSR()
SSR = SSR(b) = e e =
e2i
(16)
i=1
Notice that we can rewrite first order conditions (9) as

follows:
X (y Xb) = 0 or X e = 0 or
1
1
xi ei = 0 or
xi (yi xi b) = 0
n i=1
n i=1
n
If xi contains a constant then (17a) implies that

n
ei = 0
i=1
and that
(17a)
y = x b
(17b)
We already know that E(i |X) = 0 (strict exogeneity)

imply the following K moment conditions:
E(xi i ) = 0
(see notes previous lecture)
Equation (17b) can be viewed as the sample analogue
of these moment conditions.
One can therefore view OLS as a method of moment
estimator (see chapter 3 of the book)
(n n) Projection matrix P
P = X(X X)1 X
(18)
Annihilator matrix
M = In P
(19)
Properties of the projection and annihilator matrices

P and M
Both P and M are symmetric and idempotent.
PX = X
MX = O
P y = y
e = y y = (In P )y = M y = M (X + ) = M
SSR = e e = M M = M
The non econometricians should check the validity of

these statements (see handout Matrix Algebra)
s2 : OLS estimate of 2 (the variance of the error term)

SSR
e e
s =
=
nK
nK
2
standard error of the regression (SER): s =
(20)
s2 .
It is an estimate of the standard error of the error

term.
Sampling error: b .
It can be related to (the error term) as follows:
b =(X X)1 X y
=(X X)1 X (X + )
=(X X)1 (X X) + (X X)1 X
= + (X X)1 X
=(X X)1 X
(21)
Goodness of fit measures

Uncentered R2
Variability
be measured
2 of the dependent variable can
by
yi = y y. We can decompose y y as follows:
y y=(y + e) (y + e)
=y y + 2y e + e e
=y y + 2b X e + e e
=y y + e e
(22)
In the last step I use the property that X e = 0 by

the normal equation)
Definition uncentered R2 :
2
Ruc
e e
=1
yy
(23)
Because of the decomposition (22) this equals

2
Ruc
y y
=
yy
(24)
2
Since both y y and y y are nonnegative, 0 Ruc
1
Thus, the uncentered R2 has the interpretation of the

fraction of the variation of the dependent variable that
is attributable to the variation in the explanatory variables.
10
Centered R2 (coecient of determination)

If the only regressor is a constant, the b = y =
1
n
yi
i=1
(sample average of the dependent variable). In that

case
yi = y
y y = n
y2
n
e e = (yi y)2
i=1
If the regressors also

variables, it
include nonconstant
can be shown that i (yi y)2 can be decomposed as
follows (check!):
n
n
n
2
2
(yi y) =
(
yi y) +
e2i
i=1
i=1
(25)
i=1
The coecient of determination, R2 , is defined as

n
e2i
i=1
R2 1
n
(yi y)2
(26)
i=1
Because of the decomposition (25) this equals

n
(
yi y)2
R2 = i=1
n
(yi y)2
i=1
It holds that 0 R2 1
11
(27)
Interpretation R2 : 100*R2 is percentage of sample variance in y explained by the nonconstant regressors (model).
low R2 : bad fit of the model
high R2 : good fit of the model
Be careful with interpretation of the R2 !!! Value of R2
increases by the inclusion of extra irrelevant regressors.
2 , addresses this problem somewhat
Adjusted R2 , R
e e/(n K)
2 = 1
R
n
( (yi y)2 )/(n 1)

i=1
12
(28)
Finite sample properties of OLS
Recapitulation of the assumptions

1. Assumption 1.1: the relationship between the dependent variable and the regressors is linear.
yi = 1 xi1 + 2 xi2 + . . . + K xiK + i ,
i = 1, . . . , n
2. Assumption 1.2 (strict exogeneity):

E(i |X) = 0,
i = 1, . . . , n
(29)
3. Assumption 1.3 (no perfect multicollinearity):

The column rank of the n K data matrix, X, is K
with probability 1.
4. Assumption 1.4 (spherical error variance):
E( |X) = 2 In
terminology
estimator: algorithm (accounting rule, formula) which
provides the rule to estimate the unknown parameters.
This rule can be formulated before the sample is drawn.
Estimator is a random variable (because the dependent
and explanatory variables are random variables) with
a probability distribution
estimate: after sample is drawn, one can apply the algorithm given the observed values of ((x1 , y1 ), . . . , (xn , yn ))
. Estimate is NOT a random variable!! Dierent samples will yield dierent OLS estimates.
An estimator W of is said to be unbiased if E(W ) =
13
2.1
Finite sample properties of b
Proposition 1.1.a: Under assumptions 1.1-1.3,

E(b|X) =
proof:
From equation (21) we know that the sampling error
is equal to:
b = (X X)1 X
Basically we have to prove E(b |X) = 0
Assumption 1.3 is needed in order to compute b (i.e.
(X X)1 should exist)
Assumption 1.1 is needed to derive formula (21)
E(b|X) = E((X X)1 X |X) = (X X)1 X E(|X) = 0
q.e.d.
The OLS estimator b is a function of the sample (y, X).
Since (y, X) are random, so is b.
Now imagine that we fix X at some given value, calculate b for all samples corresponding to all possible
realizations of y, and take the average of b.
This average is the (population) conditional mean E(b|X).
Proposition (1.1.a) says that this average equals the
true value .
14
By the law of total expectations:

E(b) = E(E(b|X)) = E() =
if we calculated b for all possible dierent samples,
diering not only in y but also in X, the average would
be the true value.
This unconditional statement is probably more relevant in economics because samples do dier in both y
and X.
It should be emphasized that the strict exogeneity assumption is crucial.
It is not enough to assume that E(xi i ) = 0.
Most time-series models do not satisfy strict exogeneity
even if they satisfy the weaker condition E(xi i ) = 0.
15
Proposition 1.1.b: Under assumptions 1.1-1.4,

V ar(b|X) = 2 (X X)1
proof:
Under assumptions 1.1-1.3, E(b|X) = . Then
V ar(b|X) =
=
=
=
=
=
E((b )(b ) |X)

E((X X)1 X X(X X)1 |X)
((X X)1 X E( |X)X(X X)1
((X X)1 X ( 2 In )X(X X)1
2 ((X X)1 (X X)(X X)1
2 (X X)1
16
2.2
Gauss-Markov theorem
Fact: Let A and B two square matrices of the same

size. We say that A B in the matrix sense if A B
is positive semidefinite.
A K K matrix C is said to be positive semi-definite
if x Cx 0 for all K-dimensional vectors x.
Proposition 1.1.c (Gauss-Markov Theorem): Under Assumptions 1.1 1.4, the OLS estimator is ecient in the class of linear unbiased estimators. That
is, for any unbiased estimator that is linear in y,
V ar(|X)
V ar(b|X)
in the matrix sense.
Remarks
The matrix inequality in part (c) says that the K K
matrix V ar(|X)
V ar(b|X) is positive semidefinite,
so
a [V ar(|X)V
ar(b|X)]a or a V ar(|X)a
a V ar(b|X)a
(30)
for any K-dimensional vector a.
Consider a = ek = (0, . . . , 1, 0, . . . , 0) (k-th element
equal to one, other elements zero), then equation (30)
implies that
V ar(k |X) V ar(bk |X)
(31)
That is, for any regression coecient, the variance of
the OLS estimator is no larger than that of any other
linear unbiased estimator.
17
The OLS estimator is linear in y. There are many

other estimators of that are linear and unbiased.
The Gauss-Markov Theorem says that the OLS estimator is ecient in the sense that its conditional variance
matrix is smallest among linear unbiased estimators.
For this reason the OLS estimator is called the Best
Linear Unbiased Estimator (BLUE).
proof proposition 1.1.c
Let = Cy and D = C A where A = (X X)1 X
= (D + A)y (and b = Ay)
Then
=(D
+ A)y
=Dy + b
=D(X + ) + b
=DX + D + b
Taking the conditional expectation of both sides, we

obtain
E(|X)
= DX + DE(|X) + E(b|X) = DX +
Since is unbiased it should hold that DX = O.
So = D + b and
=D + (b )
=(D + A)
18
So
V ar(|X)
=
=
=
=
=
E(( )( ) |X)
E((D + A) (D + A )|X) =
(D + A)E( |X)(D + A ) =
2 (D + A)(D + A ) =
2 (DD + AD + DA + AA )
But DA = DX(X X)1 = O since DX = O. Also

AA = (X X)1 . So
V ar(|X)
V ar(b|X)= 2 (DD + (X X)1 ) 2 (X X)1
2 DD 0
(32)
because DD is positive semidefinite

q.e.d.
19
2.3
Finite sample properties of s2
We have defined before an estimator s2 as follows

s2 =
e e
nK
where n K is called the degrees of freedom

Lemma 1: e e = M
proof
Note that
e=y Xb
=y X(X X)1 X y = (In P )y
=M y = M (X + ) = M
(33)
Since M is a symmetric and idempotent matrix it

holds that
e e = M M = M M = M
q.e.d.
Proposition 1.2 (unbiasedness of s2 ): Under assumptions 1.1-1.4 E(s2 |X) = 2 , provided n > K
proof: see book (and slides lecture 3)
Intuition proposition 1.2
dividing e e by n K rather than by n makes this
estimate unbiased for 2 .
The intuitive reason is that K parameters () have to
be estimated before obtaining the residual vector e.
20
More specifically, e has to satisfy the K normal equations, which limits the variability of the residual.
Estimate of V ar(b|X) = 2 (X X)1 :
\ = s2 (X X)1
V ar(b|X)
21
(34)
oce: DUI 727 tel: 050-3637240
February 9, 2016
Lecture 3
Hayashi, chapter 1: section 1.3, 1.4
Organisation of the lecture
1. Finite sample properties of s2
2. Estimate of the variance V ar(b|X)
3. Examples
Omitted variable bias: the simple case
Inclusion of irrelevant variables in a regression model
The variance of the OLS estimator revisited
Interpretation dummy variables
4. Partitioned regression: Frisch-Waugh theorem
Finite sample properties of s2

We have defined before an estimator s2 for the variance
of the error term 2
e e
s2 =
nK
where n K is called the degrees of freedom
Lemma 1: e e = M
proof
Note that
e =
=
=
=
=
y Xb
y X(X X)1 X y
(In P )y
My
M (X + ) = M
Since M is a symmetric and idempotent matrix it

holds that
e e = M M = M
q.e.d.
The trace operator
Fact 1: The trace of a square matrix A is the sum of
its diagonal elements of A: trace(A) = i aii

Fact 2: The trace operator is linear trace(A+B) =
trace(A) + trace(B)
Fact 3: trace(AB) = trace(BA)
Proposition 1.2 (unbiasedness of s2 ): Under assumptions 1.1-1.4

E(s2 |X) = 2
provided n > K
proof:
We should show that
E(e e|X) = (n K) 2
or equivalently (cf. Lemma 1):
E( M |X) = 2 (n K)
The proof consists of two parts
1. E( M |X) = 2 trace(M )
2. trace(M ) = n K
proof part 1
n
n
Notice that M =
mij i j
i=1 j=1
Also
E( M |X) = E
=
=
n
n
)
mij i j |X
i=1 j=1
n
n
mij E(i j |X)
i=1 j=1
n
mii E(2i |X)
i=1
mii = 2 trace(M )
i=1
In the second step above, we use that mij s are a

function of X (linearity property of conditional
expectations)
3
In the third step we exploit assumption 1.4

E( |X) = 2 In E(i j |X) = 0, for i = j
proof part 2: trace(M ) = n K
We know that
Projection matrix P = X(X X)1 X
Annihilator M = In P = In X(X X)1 X
trace(P ) = trace(X(X X)1 X )

= trace((X X)1 X X)
= trace(IK ) = K
In the second step we use the property
trace(AB) = trace(BA)
(A = X and B = (X X)1 X )
trace(M ) = trace(In P )
= trace(In ) trace(P ) = n K
(1)
Intuition proposition 1.2
dividing e e by n K rather than by n makes this
estimate unbiased for 2 .
The intuitive reason is that K parameters () have to
be estimated before obtaining the residual vector e.
More specifically, e has to satisfy the K normal equations, which limits the variability of the residual.
Estimate of V ar(b|X) = 2 (X X)1 :
\ = s2 (X X)1
V ar(b|X)
4
(2)
Example: wage and education

We consider the following simple regression model
wagei = 1 + 2 educi + i
(3)
where
wagei =wage is measured in dollars per hour
educi =years of schooling
Model (3) has been estimated using Stata (the stata
program and dataset are available on Nestor).
. sum wage,detail
average hourly earnings
------------------------------------------------Percentiles Smallest
1%
1.67
.53
5%
2.75
1.43
10%
2.92
1.5 Obs
526
25%
3.33
1.5 Sum of Wgt.
526
50%
75%
90%
95%
99%
4.65
6.88
10
13
20
Largest
21.86
22.2
22.86
24.98
Mean
Std. Dev.
5.896103
3.693086
Variance
Skewness
Kurtosis
13.63888
2.007325
7.970083
Notice that the distribution of wage is rightly skewed

and has fat tails (cf. skewness and kurtosis)
regression results
. reg wage educ
Source |
SS
df
MS
---------+--------------------------Model |1179.73204
1 1179.73204
Residual |5980.68225 524 11.4135158
---------+--------------------------Total |7160.41429 525 13.6388844
No obs =
526
F(1,524)=103.36
Prob> F =0.0000
R^2
=0.1648
Adj R^2 =0.1632
Root MSE=3.3784
----------------------------------------wage |
Coef.
Std. Err.
t
---------+------------------------------educ |
.5413593
.053248
10.17
_cons | -.9048516
.6849678
-1.32
---------------------------------------- beduc =.5413593
Interpretation: One year of extra schooling leads to
54 dollar cents extra wage per hour.
s2 = e e/(n K)=(5980.68225/524=11.4135158)
Standard error of the regression:
s = s2 = 11.4135158= 3.3784 (cf. root MSE )
se(beduc |X) = Vd
ar(beduc |X) = .053248
R2 = 1
e e
(yi
y )2
=1
5980.68225
7160.41429
1179.73204
7160.41429
= 0.1648
i=1
Adjusted R2 = 1
e e/(nK)
(yi
y )2 )/(n1)
i=1
= 1 11.4135158
13.6388844 =0.1632
25
20
15
10
5
0
0
10
years of education
average hourly earnings
15
Fitted values
see week2_wage.log for more output

Scattergram
The figure shows evidence of heteroskedasticity: the
absolute value of the residuals increase with educ
The standard errors should be viewed with care
20
Simple solution: consider the following a semi-log specification:

log(wagei ) = 1 + 2 educi + i
(4)
Distribution log(wage) not so much rightly skewed

. sum log_wage,detail
log_wage
-------------------------------------------------Percentiles Smallest
1%
.5128236 -.6348783
5%
1.011601
.3576744
10%
1.071584
.4054651 Obs
526
25%
1.202972
.4054651 Sum of Wgt.
526
50%
75%
90%
95%
99%
1.536867
1.928619
2.302585
2.564949
2.995732
Largest
3.084659
3.100092
3.129389
3.218076
Mean
Std. Dev.
1.623268
.5315382
Variance
Skewness
Kurtosis
.2825329
.3909038
3.386529
regression results
. reg log_wage educ
Source |
SS
df
MS
---------+--------------------------Model |27.5606288
1 27.5606288
Residual |120.769123 524 .230475425
---------+--------------------------Total |148.329751 525
.28253286
Number of obs=
526
F( 1,
524)=119.58
Prob > F
=0.0000
R-squared
=0.1858
Adj R-squared=0.1843
Root MSE
=.48008
-------------------------------------log_wage |
Coef. Std. Err.
t
---------+---------------------------educ | .0827444 .0075667
10.94
_cons | .5837727 .0973358
6.00
------------------------------------- Interpretation beduc (see table 1 below).
The return to education is equal to 8.27 percent (due
to a one year increase in the educ (years of schooling), the hourly wage rate increases with 8.27 percent
(=.0827444*100)
3
2
1
0
1
0
10
years of education
log_wage
10
15
Fitted values
20
Table 1: Summary of functional forms involving logarithms

Model Dependent Independent
Interpretation
Variable
Variable
of
level-level
y
x
y = x
level-log
y
ln(x)
y 100
%x
log-level (semi-log)
ln(y)
x %y 100x1
log-log
ln(y)
ln(x)
%y = %x
1: The log-level approximation is precise for small values of .
Semi-log specification
A useful approximation for the natural logarithm for
small x is
ln(1 + x) x
This linear approximation x is very close for |x| < 0.1,
and reasonably close for 0.1 < |x| < 0.2, but the difference increases with |x|.
Now, if y is c% greater than y, then
y = (1 + c/100)y.
Taking natural logarithms,
ln(y ) = ln(y) + ln(1 + c/100)
or
ln(y ) ln(y) = ln(1 + c/100)
c
=
100
or c 100 (see table 1)

Exact calculation of the percentage dierence
y y
100 = 100(exp(ln(y )ln(y))1) = 100(exp()1)
y
you should use this formula if is large (rule of
thumb || > 0.2).
11
1.1
Omitted Variable Bias: The Simple Case
Suppose that that ln(wagei ) is determined by

ln(wagei ) = 1 + 2 educi + 3 abili + i
(5)
where E(i |educi , abili ) = 0.

In dataset used in example above, abil (ability) is not
observed. we instead have estimated the model
ln(wagei ) = 1 + 2 educi + i
(6)
where i = 3 abil + i
Notice that
E(i |educi ) = E(3 abili +i |educi ) = 3 E(abili |educi ) = 0
because abili and educi are positively correlated (E(abili |educi ) >
0).
In case of equation (6), assumption 1.2 (strict exogeneity) is violated! In other words, a problem of omitted
variable bias arises.
Omitted variable bias. Let
2 be the estimator of 2 from a simple regression of
ln(wagei ) on educi (cf. equation (6)).
yi = ln(wagei ), x2i = educi and x3i = abili
Model (5) can be rewritten as follows:
yi = 1 + 2 x2i + 3 x3i + i
(7)
(equation (6) can be rewritten in an analogous way.)
12
Then it can be shown that

E(2 |x2i , x3i ) = 2 + 3 1
(8)
where 1 is the estimate for 1 from the following regression

x3i = 0 + 1 x2i + i
(9)
The proof of this statement will be given later in the
course.
Equation (8) implies that the omitted variable bias
is equal to
bias(2 |x2 , x3 ) = E(2 |x2 , x3 ) 2 = 3 1
(10)
From equation (10), we see that there are two cases

where the estimator 2 is unbiased.
3 = 0
In wage example: ability has no impact on the
hourly wage rate
1 = 0 or equivalently the variables x2 (education)
is uncorrelated with the omitted variable x3 (ability).
In the wage example above, it is rather unlikely that
those conditions are met. One should expect that
3 > 0
1 > 0 (positive correlation between years of schooling and ability)
According to equation (6) and the bias formula (10)
the estimator 2 is biased upwards because
E(i |educi ) = E(3 abili +i |educi ) = 3 E(abili |educi ) > 0
13
In the example above, we estimated the returns to

schooling to be equal to 0.083.
This estimation result from only a single sample, so we
cannot say that .083 is greater than 2 ;
the true return to education could be lower or higher
than 8.3 percent (and we will never know for sure).
Nevertheless, we know that the average of the estimates across all random samples would be too large.
14
1.2
Including Irrelevant Variables in a Regression Model
To illustrate the issue, suppose we specify the model

as
yi = 1 + 2 x2i + 3 x3i + 4 x4i + i
(11)
and this model satisfies Assumptions 1.1 through 1.4.
However, 4 = 0. In other words, the true model is
yi = 1 + 2 x2i + 3 x3i + i
(12)
Let be the OLS estimator of regression (11) and

b the OLS estimator stemming from the true model
(12)
Let
xj = (x1j , . . . , xnj ) , j = 2, ..., 4: (n 1) vectors
= (n , x2 , x3 , x4 )
X
X = (n , x2 , x3 )
X)
1 X
y
= (X
b = (X X)1 X y)
Then, one can easily show that
X)
= E((X
X)
1 X
y|X)
= = 2
E(|
3
0
(In the last step, we use 4 = 0). Consequently, by the

law of total expectations
= E(E(|X))
E()
= E() =
In other words, is an unbiased estimator. Moreover,
is linear in y
15
Proposition 1.1.a (see slides lecture 2) says that b is

also unbiased estimator.
Moreover, according to the Gauss-Markov theorem b
is the Best Linear Unbiased Estimator (BLUE).
In other words, b is a more ecient estimator for
(1 , 2 , 3 ) than .
16
1.3
The variance of the OLS estimator revisited

V ar(b|X) = 2 (X X)1
(13)
see slides lecture 2 for the proof

Consider the following regression model
yi = 1 + 2 x2i + . . . + xKi + i
Using the Frisch-Waugh theorem (see next section) it
can be shown that the elements on the main diagonal
of V ar(b|X) can be written as
V ar(bj |X) =
n
2
(xji xj
i=1
)2 (1
, j = 2, . . . , K
Rj2 )
(14)
where
is the R-squared from regressing xj on all
other independent variables (and including an intercept).
Rj2
The size of V ar(bj |X) is practically important.

A larger variance means a less precise estimator, and
this translates into larger confidence intervals and less
accurate hypotheses tests (see section 1.4 of the book)
17
V ar(bj |X) =
n
2
(xji xj )2 (1
i=1
, j = 2, . . . , K
Rj2 )
From equations (13) and (14), it becomes clear that

the conditional variance of bj depends on three factors
1. 2 the variance of the error term
2 V ar(bj |X)
2. The total sample variation SSTj =
(xji xj )2
i=1
SSTj V ar(bj |X)

(Notice that SSTj increases with sample size!)
3. Rj2 : the linear relationships among the independent variables
Rj2 V ar(bj |X)
Contrary to Hayashi, I call High (but not perfect)
correlation between two or more independent variables multicollinearity (high Rj2 for at least one
of the explanatory variables).
Perfect multicollinearity refers to the case that for
at least one j Rj2 = 1. In that case assumption 1.3
(data matrix X is of full column rank) is violated
(Hayashi calls this multicollinearity)
18
Wage example revisited

We now consider the following model:
ln(wagei ) = 1 + 2 educi + 3 experi + 4 experi2 + i
(15)
where experi denotes the years of experience in the
labor market
In this specification that the returns to experience
ln(wagei )
= 3 + 24 experi
experi
decreases with exper (i.e. 3 > 0, 4 < 0)
In the example above, I had considerable sample
variation in exper
. /* returns to experience plus regional wage differentials */
. /* In this dataset considerable variation in exper */
. sum exper
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------exper |
526
17.01711
13.57216
1
51
. gen expersq=exper*exper
. reg log_wage educ exper expersq
Source |
SS
df
MS
-------------+-----------------------------Model | 44.5393713
3 14.8464571
Residual |
103.79038
522 .198832146
-------------+-----------------------------Total | 148.329751
525
.28253286
Number of obs
F( 3,
522)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
526
74.67
0.0000
0.3003
0.2963
.44591
-----------------------------------------------------------------------------log_wage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------educ |
.0903658
.007468
12.10
0.000
.0756948
.1050368
exper |
.0410089
.0051965
7.89
0.000
.0308002
.0512175
expersq | -.0007136
.0001158
-6.16
0.000
-.000941
-.0004861
_cons |
.1279975
.1059323
1.21
0.227
-.0801085
.3361035
------------------------------------------------------------------------------
log(wagei )
experi
> 0 if exper < 29( .0410089/(2

.0007136)
19
3 and 4 precisely estimated (due to large variation in exper

I also consider an alternative dataset (NLSY) which
only cover young exployees
This sampling frame implies that there is less variation in exper than in the previous dataset
sum exper
Variable | Obs
Mean Std. Dev Min Max
---------+-----------------------------exper | 935 11.56364 4.374586
1 23
Consequently, I face a problem of multicollinearity:
the parameters 3 and 4 arent estimated with
great precision
Source|
SS
df
MS
--------+--------------------------Model| 21.688779
3 7.22959299
Residual|143.967504 931 .154637491
--------+--------------------------Total|165.656283 934 .177362188
Nobs
=
935
F(3,931)= 46.75
Prob > F=0.0000
R^2
=0.1309
Adj R^2 =0.1281
Root MSE=.39324
------------------------------------log_wage|
Coef. Std. Err.
t
--------+---------------------------educ| .0779866 .0066242
11.77
exper| .016256
.01354
1.20
expersq| .000152
.000567
0.27
_cons| 5.517432 .1248186
44.20
-------------------------------------
20
A regression of exper2 on the other rhs variables

yields a R2 of 0.9534
Given the limited variability in exper, there is a
strong correlation among the rhs variables (especially between exper and exper2 see the R2 (0.95
of the regression below.
Given the limited variability, it might be wise to
drop the variable exper2 from specification (15)
In that case we obtain the following regression results
. reg log_wage educ exper
Source|
SS
df
MS
--------+---------------------------Model|21.6776613
2 10.8388306
Residual|143.978622
932
.1544835
--------+---------------------------Total|165.656283
934 .177362188
-------------------------------------log_wage|
Coef.
Std. Err.
t
--------+----------------------------educ| .077782
.0065769
11.83
exper| .0197768
.0033025
5.99
_cons| 5.50271
.112037
49.12
--------------------------------------
21
Nobs
=
935
F(2,932)= 70.16
Prob > F=0.0000
R^2
=0.1309
Adj R^2 =0.1290
Root MSE=.39304
A Single Dummy Independent Variable
I consider again the original data set consisting of

526 observations.
I extend the model with one dummy independent
variable nonwhite
. /* Dummy variables */
. reg log_wage educ exper expersq nonwhite
Source |
SS
df
MS
Number of obs =
526
-------------+-----------------------------F( 4,
521) =
55.90
Model |
44.545379
4 11.1363448
Prob > F
= 0.0000
Residual | 103.784372
521
.19920225
R-squared
= 0.3003
-------------+-----------------------------Adj R-squared = 0.2949
Total | 148.329751
525
.28253286
Root MSE
= .44632
-----------------------------------------------------------------------------log_wage |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------educ |
.090251
.0075041
12.03
0.000
.075509
.1049931
exper |
.0410322
.0052031
7.89
0.000
.0308106
.0512538
expersq | -.0007142
.0001159
-6.16
0.000
-.0009419
-.0004864
nonwhite | -.0111807
.064382
-0.17
0.862
-.1376609
.1152995
_cons |
.1304809
.1069908
1.22
0.223
-.0797054
.3406671
------------------------------------------------------------------------------
The dummy variable nonwhite is defined as follows:

nonwhitei = 1 if respondent i is non-white
= 0 otherwise
Interpretation bnonwhite : a nonwhite person earns
about 1.1 percent (= -.0111807*100) less than a
white person (reference group) keeping other factors constant (ceteris paribus) (log-level specification.
The wage dierential between nonwhites can be
computed more precisely as: (e.0111807 1)100 =
1.1118428 percent.
22
Instead of the dummy variable nonwhite, I include

the dummy variable f emale in order to analyze the
gender wage gap
. reg log_wage
educ exper expersq female
Source |
SS
df
MS
-------------+-----------------------------Model | 59.2711314
4 14.8177829
Residual |
89.05862
521
.17093785
-------------+-----------------------------Total | 148.329751
525
.28253286
Number of obs
F( 4,
521)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
526
86.69
0.0000
0.3996
0.3950
.41345
-----------------------------------------------------------------------------log_wage |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------educ |
.0841361
.0069568
12.09
0.000
.0704692
.0978029
exper |
.03891
.0048235
8.07
0.000
.029434
.0483859
expersq |
-.000686
.0001074
-6.39
0.000
-.000897
-.0004751
female | -.3371868
.0363214
-9.28
0.000
-.4085411
-.2658324
_cons |
.390483
.1022096
3.82
0.000
.1896894
.5912767
------------------------------------------------------------------------------
Interpretation bf emale : a female employee earns

(e.3371868 1) 100 = 28.622451 percent
more (i.e. 28.622451 percent less) than a male employee ceteris paribus.
23
3 Using Dummy Variables for Multiple Categories

We can use several dummy independent variables
in the same equation. For example, we could add
the dummy variable married to the regression equation presented above (next to the dummy f emale)
. reg log_wage educ exper expersq female married
Source |
SS
df
MS
Number of obs =
526
-------------+-----------------------------F( 5,
520) =
70.00
Model | 59.6726652
5
11.934533
Prob > F
= 0.0000
Residual | 88.6570862
520 .170494397
R-squared
= 0.4023
-------------+-----------------------------Adj R-squared = 0.3966
Total | 148.329751
525
.28253286
Root MSE
= .41291
-----------------------------------------------------------------------------log_wage |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------educ |
.0828163
.0070008
11.83
0.000
.0690629
.0965696
exper |
.0357493
.0052391
6.82
0.000
.025457
.0460417
expersq |
-.000632
.0001129
-5.60
0.000
-.0008537
-.0004103
female | -.3290458
.0366601
-8.98
0.000
-.4010658
-.2570257
married |
.0645607
.042069
1.53
0.125
-.0180854
.1472068
_cons |
.3921025
.1020824
3.84
0.000
.1915579
.5926472
------------------------------------------------------------------------------
the marriage premium is estimated to be about 6.5

%
Limitation of the model: marriage premium is assumed to be the same for men and women.
24
If the regression model is to have dierent intercepts for, say, g categories, we need to include g 1
dummy variables in the model along with an intercept.
Example: In example below, we distinguish 4 different regions (g = 4):
1.
2.
3.
4.
North-east (reference group)

North-central
West
South
. reg log_wage educ exper expersq female married northcen south west
Source |
SS
df
MS
Number of obs =
526
-------------+-----------------------------F( 8,
517) =
45.95
Model | 61.6387993
8 7.70484991
Prob > F
= 0.0000
Residual | 86.6909521
517 .167680758
R-squared
= 0.4156
-------------+-----------------------------Adj R-squared = 0.4065
Total | 148.329751
525
.28253286
Root MSE
= .40949
-----------------------------------------------------------------------------log_wage |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------educ |
.0808207
.0070101
11.53
0.000
.067049
.0945925
exper |
.0363615
.0052269
6.96
0.000
.0260929
.0466301
expersq |
-.000645
.0001128
-5.72
0.000
-.0008665
-.0004235
female | -.3345661
.0364315
-9.18
0.000
-.406138
-.2629941
married |
.0711934
.0417725
1.70
0.089
-.0108712
.153258
northcen |
-.070182
.0519674
-1.35
0.177
-.1722752
.0319111
south | -.1162238
.0486825
-2.39
0.017
-.2118637
-.0205839
west |
.04643
.0576308
0.81
0.421
-.0667894
.1596494
_cons |
.4625799
.1074952
4.30
0.000
.2513988
.673761
------------------------------------------------------------------------------
Interpretation bsouth : an employee living in the Southern part of the US earns about 11.6 % less than an
employee living in the North East (reference group)
keeping other factors fixed.
25
4 Partitioned regression, Frisch Waugh theorem

Consider the following regression model
y = X +
(16)
where y and are (n 1) vectors and X the (n K)

data matrix.
We partition the model in the following way
y = X1 1 + X2 2 +
(17)
where X1 and X2 are (n K1 ) and (n K2 ) matrices

respectively (1 and 2 are (K1 1) and (K2 1)
vectors).
Let
- P1 = X1 (X1 X1 )1 X1 (notice that P1 X1 = X1 )
- M1 = In P1 . Notice that
M1 X1 = (In P1 )X1 = X1 X1 = O
- y = M1 y
y is the residual vector from the regression from y
on X1
2 = M1 X2
-X
2 is the residual vector of
The k-th column of X
from a regression of the corresponding k-th column
of X2 on X1
Let b the OLS estimator for (cf. model (16)) and b
can be partitioned as (cf. equation (17)) as
b = (b1 , b2 )
26
Frisch Waugh theorem
2 X
2 )1 X
2 y
1. Under assumptions 1.1-1.3 b2 = (X
That is b2 can be obtained by regressing the resid2
uals y on the matrix of residuals X
2 nu2. The residuals from the regression of y on X
merically equals e, the residuals from y on X (
.
(X1 ..X2 ))
2X
2 )1
3. Under assumptions 1.1-1.4, Vd
ar(b2 ) = s2 (X
2y
2 )1 X
2X
4. b2 = (X
That is b2 can be obtained by regressing y on the
2.
matrix of residuals X
Comments on the Frisch Waugh theorem
Consider the following simple regression model
yi = 1 x1i + 2 x2i + i
where x1i = 1 (the intercept)
If we stack the observations we obtain
y = 1 n + 2 x2 +
where n = (1, . . . , 1) is a ((n 1)-vector of ones.
Now we apply the Frisch-Waugh theorem where x1 =
n and P1 = n (n n )1 n
Notice that
P1 n = n
M1 n = (In P1 )n = n n = 0
y = M1 y = y yn where y =
1
n
yi (sample
i=1
average of yi ).
yi = yi y (deviation from the average)
27
2 = M1 x2 = x2 x2 n
x
x2i = x2i x2 (deviation from the average)
Part 1 of the Frisch-Waugh theorem says that we can
2.
obtain the OLS estimate b2 by regressing y on x
Then we obtain
n
(x2i x2 )(yi y)
sy
b2 = i=1
rx2 ,y
(18)
=
n
s
x
2
(x2i x2 )2
i=1
where sx and sy denote the sample standard deviations

of x2i and yi respectively. rx2 ,y is the sample correlation
coecient.
Notice that in a simple regression model the sign of b2
is the same as rx2 ,y
See Stata program week2_wage.log for an application
of the Frisch-Waugh theorem
28
proof Frisch-Waugh theorem

Proof part 1
The OLS estimator b from a regression of y on X
can be obtained by solving the normal equations.
(X X)b = X y
.
Since X = (X1 ..X2 ) the normal equations can also
be written as
( )
(
) ( )
X1
b1
X1
..
(X1 .X2 )
y
=
X2
b2
X2
or
X1 X1 b1 + X1 X2 b2 = X1 y
X2 X1 b1 + X2 X2 b2 = X2 y
(19)
(20)
By premultiplying both sides of (19) by X1 (X1 X1 )1 ,

we obtain
X1 (X1 X1 )1 X1 X1 b1 = X1 (X1 X1 )1 X1 X2 b2 + X1 (X1 X1 )1 X1 y
X1 b1 = P1 X2 b2 + P1 y
Substitution of this into equation (20) gives
X2 (P1 X2 b2 + P1 y) + X2 X2 b2
X2 P1 X2 b2 + X2 X2 b2
X2 (I P1 )X2 b2
X2 M1 X2 b2
X2 M1 M1 X2 b2
2 b2
2X
X
Therefore
X2 y
X2 y X2 P1 y
X2 (I P1 )y
X2 M1 y
X2 M1 M1 y
2 y
= X
=
=
=
=
=
2 )1 X
2 y
2X
b2 = (X
q.e.d.
29
Proof part 2: We can obtain the OLS estimator

.
b = (b1 , b2 ) by regressing y on X = (X1 ..X2 )
By premultiplying both sides of
y = X1 b1 + X2 b2 + e
by M1 , we get
M1 y = M1 X1 b1 + M1 X2 b2 + M1 e
2,
Since M1 X1 = O, M1 y = y and M1 X2 = X
we can rewrite the equation above as
2 b2 + M1 e
y = X
(21)
M1 e = e because
M1 e
M1 e
M1 e
M1 e
=
=
=
=
(I P1 )e
e P1 e
e X1 (X1 X1 )1 X1 e
e
because X1 e = 0 by the normal equation (19)

q.e.d.
Proof of part 3:
We obtain the estimate s2 = e e/(n K) for 2 by
regressing y on X
However, we obtain the same estimate for e e by
2 because M1 e = e (see part
regressing y on X
2) In the computation of s2 we should take into
account that K is equal to the number of columns
of X! From equation (21) it becomes clear that we
can apply the standard formula for the estimate of
Vd
ar(b2 |X):
2 )1
2X
Vd
ar(b2 |X) = s2 (X
q.e.d.
30
proof part 4:
2X
2 )1 X
2 y,
From part 1, we know that b2 = (X

or
2X
2 )1 X
2 y
b2 = (X
2 )1 X M M1 y
2X
b2 = (X
b2 =
b2 =
2
1
2) X M y
2 X
(X
2
1
1
2 X
2) X
2 y
(X
31
oce: DUI 727 tel: 050-3637240
February 11, 2016
Lecture 4
Hayashi, chapter 1: section 1.4 (except page 45)
1. Hypothesis Testing under Normality
Normally Distributed Error Terms
Testing Hypotheses about Individual Regression
Coecients
Decision Rule for the t-Test
Condence Interval
p-Value
Example t-test
Linear Hypotheses
The F -Test
A more convenient expression for the F test
t versus F
Hypothesis Testing under Normality

Regression model
yi = 1 + 2 xi2 + . . . + K xiK + i ,
i = 1, . . . , n (1)
Very often, the economic theory that motivated the

regression equation also species the values that the
regression coecients should take.
Example: Cobb-Douglas production function with constant returns to scale:
ln(Yi ) = 1 + 1 ln(Li ) + 2 ln(Ki ) + i ,
i = 1, . . . , n
(2)
where Yi denotes output, Li labor input, and Ki capital. If production technology exhibits constant returns to scale, then 1 + 2 = 1.
The unbiasedness property of OLS (E(b|X) = ) guarantees that (cf equation (2))
E(b1 + b2 ) = 1 + 2 = 1
if the restriction of constant returns to scale is true.
However, given the sampling error b1 +b2 is not exactly
equal to 1 for a particular sample at hand.
Obviously, we cannot conclude that the restriction is
false just because the estimate b1 + b2 diers from 1.
To decide whether the sampling error
b1 + b2 1
is too large for the restriction to be true, we need to
construct from the sampling error some test statistic
whose probability distribution is known given the truth
of the hypothesis.
2
In the language of hypothesis testing, the restriction

to be tested (such as 1 + 2 = 1) is called the null
hypothesis:
H0 : 1 + 2 = 1
1.1
Normally distributed error terms
Assumption 1.5 (normality of the error term):

The distribution of |X is jointly normal.
Recall from probability theory that the normal distribution has several properties:
The distribution (density function) depends only on
the mean and the variance
|X normal
E(|X) = 0
f (|X) = f () = (2 2 )n/2 exp(0.5 2 )
V ar(|X) = 2 In
In other words
|X N (0, 2 In ),
N (0, 2 In )
(3)
and and X are independent.

In general, if two random variables are independent,
then they are uncorrelated, but the converse is not
true
Example: Consider two random variables X and Y
where Y = X 2 . Moreover, assume that E(X) = E(X 3 ) =
0.
Obviously, the variables X and Y are uncorrelated:
E(XY ) = E(X 3 ) = E(X)E(Y ) = 0 cov(X, Y ) = 0
If Y = X 2 , then X and Y are clearly not independent:
once we know X, we know Y .
3
However, if two random variables X and Y are jointly

normal distributed, the converse is also true, so that
independence and a lack of correlation are equivalent.
This carries over to conditional distributions: if two
random variables are joint normal and uncorrelated
conditional on X, then they are independent conditional on X.
A linear function of random variables that are jointly
normally distributed is itself normally distributed.
This also carries over to conditional distributions. If
the distribution of u = (u1 , . . . , un ) conditional on X
is normal, then Au, where the elements of matrix A
are functions of X, is normal conditional on X:
}
u|X N (, )
Au|X N (A, AA )
A is a function of X
We know that the sampling error of b
b = (X X)1 X y =
= (X X)1 X (X + ) = (X X)1 X
(4)
is linear in given X. Since is normal given X, so

is the sampling error:
b |X N (0, 2 (X X)1 )
(5)
1.2 Testing Hypotheses about Individual Regression Coecients
Consider the following Null Hypothesis on the k -th regression coecient

H0 : k = k
Here, k is some known value specied by the null hypothesis.
We wish to test H0 against the Alternative hypothesis
(two sided test)
H1 : k = k
at a signicance level .
Signicance level =maximum accepted probability of
a type I error: P (reject H0 |H0 true)
Looking at the k -th component of (5) and imposing
the restriction of the null, we obtain
bk k |X N (0, 2 (X X)1
kk )
(6)
1
where (X X)1
kk is the (k,k ) element of (X X) .
Standardizing bk k gives
zk
bk k
2 (X X)1
kk
N (0, 1)
(7)
Suppose for the moment that 2 is known (obviously

this is an unrealistic assumption)
Then the statistic zk has some desirable properties:
1. its value can be calculated from the sample
2. the conditional distribution zk |X N (0, 1) does
not depend on X zk and X are independently
distributed and the unconditional distribution zk
N (0, 1) (the same as the conditional one)
3. The distribution of zk is known and does not depend on unknown nuisance parameters
4. Using zk , we can determine whether or not the
sampling error bk k is too large.
However 2 is typically not known.
t-ratio: Replace in equation (7) the nuisance parameter
2 by its OLS estimate s2 = e e/(n K).
In this way we obtain the following test statistic
tk
bk k
Vd
ar(b|X)kk
bk k
s2 (X X)1
kk
Proposition 1.1.d: Under assumptions 1.1-1.4,

Cov(b, e|X) = O
where e = y Xb
Proof: See assignment 2
(8)
Proposition 1.3 (distribution of the t-ratio): Suppose Assumptions 1.1,..,1.5 hold. Under the null hypothesis H0 : k = k , the t-ratio defined as in equation
(8) is distributed as t(n K) (the t-distribution with
n-K degrees of freedom).
Some facts:
Fact 1: If
1. x N (0, 1)
2. y 2 (m)
3. x and y are independent
then the ratio
x/ y/m
has the t-distribution with m degrees of freedom

Fact 2 : If x N (0, In ) and A is symmetric and
idempotent, then x Ax 2 (trace(A)).
Fact 3 : If x and y are independently distributed,
then so are f (x) and g(y).
Proof proposition 1.3
We can rewrite equation (8) as follows:
bk k
2
zk
tk =
=
2
s2
2 (X X)1 s
kk
=
where q =
zk
e e/(nK)
2
e e
2
zk
=
q
nK
(9)
We know that zk N (0, 1). We will show:

(1) q|X 2 (n K)
(2) Two random variables zk and q are independent
conditional on X
Then, due to fact 1, tk (cf. equation (9)) is distributed
as t with n-K degrees of freedom. Then we have proven
proposition 1.1.3.
(1) Since e e = M (see sheets lecture 2), we have
e e
q= 2 = M
where M = In X(X X)1 X is symmetric and

idempotent. We already know (proposition 1.2)
that
trace(M ) = n K
Also, due to equation (3)
/|X N (0, In )
Due to fact 2, q|X 2 (trace(M )) or q|X
2 (n K)
(2) Both b = +(X X)1 X and e = M are linear
functions of .
So
b and e are jointly normal distributed conditional on X
From proposition 1.1.d, we know that b and e
are uncorrelated conditional on X (see part (d)
of Proposition 1.1)
These two properties imply b and e are independently distributed conditional on X
8
Notice that 1) zk is a function of b and 2) q a

function of e
So zk and q are independently distributed conditional on X (cf. fact 3).
q.e.d.
1.3
Decision rule of the t-test
Step 1 : Formulate H0 and H1 , e.g. H0 : k = k

and H1 : k = k (two sided test).
Step 2 : Specify the signicance level (size) , Typically, one chooses either = 0.10, 0.05, 0.01.
Step 3 : Given the hypothesized value, k , of k ,
form the t-ratio as in (8).
Step 4 : Go to the t-table and look up the critical
value, t/2 (n K).
Find the critical value such that the area in
the t distribution to the right of t/2 (n K) is
/2, see gure 1.3.
If nK is large, then the student t distribution
is very similar to the standard normal distribution.
If nK = 900 and = 0.05 then t/2 (nK) =
1.96
Then, since the t distribution is symmetric around
0 (see gure 1.3)
P (t/2 (n K) < tk < t/2 (n K)) = 1
Step 5 : Do not reject H0 if
t/2 (n K) < tk < t/2 (n K)
Reject H0 otherwise (accept H1 ).
9
1.4
Condence interval
Step 5 can also be stated in terms of bk and SE(bk ).

Since tk is as in (8), you do not reject H0 whenever
bk k
t/2 (n K) <
< t/2 (n K)
SE(bk )
or
bk t/2 (n K)SE(bk ) < k < bk + t/2 (n K)SE(bk )
Therefore, we do not reject H0 if and only if the hypothesized value k falls in the interval presented above
This interval is called the the level 1 condence
interval.
It is narrower the smaller the standard error.
Thus, the smallness of the standard error is a measure
of the estimators precision.
10
1.5
p-Value
The decision rule of the t-test can also be stated using

the p-Value
Steps 1,2,3 same as above
Step 4: Rather than nding the critical value, calculate
p = P (t > |tk |) 2
Since the t distribution is symmetric around 0,
P (t < |tk |) = P (t > |tk |),so
P (|tk | < t < |tk |) = 1 p
(10)
Step 5: Do not reject H0 if p > . Otherwise, reject

H0 and accept H1 .
11
1.6
Example t-test
We again consider the following model:

log(wagei ) = 1 + 2 educi + 3 experi + 4 experi2 + i
(11)
where experi denotes the years of experience in the
labor market
I again consider (NLSY) dataset (wave 1976 which only
covers young employees (wage2.dta))
regression results
Source |
SS
df
MS
-------------+-----------------------------Model |
21.688779
3 7.22959299
Residual | 143.967504
931 .154637491
-------------+-----------------------------Total | 165.656283
934 .177362188
Number of obs
F( 3,
931)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
935
46.75
0.0000
0.1309
0.1281
.39324
-----------------------------------------------------------------------------log_wage |
Coef.
Std. Err.
t
P>|t|
95% Conf. Interval
-------------+---------------------------------------------------------------educ |
.0779866
.0066242
11.77
0.000
.0649864
.0909867
exper |
.016256
.01354
1.20
0.230
-.0103164
.0428285
expersq |
.000152
.000567
0.27
0.789
-.0009607
.0012647
_cons |
5.517432
.1248186
44.20
0.000
5.272474
5.762391
------------------------------------------------------------------------------
Research question: Does return to schooling (educi

have an impact on wages?
Burden of proof lies with the statistician, so H0 : 2 =
0, H1 : 2 = 0 (step 1)
I choose = 0.05 (signicance level).
From the research output, we can calculate t2 .
t2 =
b2 0 .0779866
=
= 11.77
se(b2 )
.0066242
12
see column t of the regression output.

Critical value t0.025 (931) = 1.96.
|t2 | > t0.025 (931), so reject H0 .
If H0 : 2 is rejected in favor of H1 at the 5% level,
we usually say that years of schooling (educ) is statistically signicant, or beduc is statistically dierent
from zero, at the 5% level.
If H0 is not rejected, we say that years of schooling is
statistically insignicant at the 5% level.
One also says that educ is individually signicant at
the 5 % level.
The estimated education coecient indicates a sizable
return to schooling of 7.8 %. This result indicates a
strong economic signicance of schooling
However, the estimated 2 parameter unlikely measures a causal eect because of the ability bias (violation of assumption 1.2 (strict exogeneity).
In the column P>|t| p-Values are reported associated
with the tests k = 0, k = 1, . . . , K.
It appears that the p-value of t2 is equal to 0.000. Since
this p-value is smaller than = 0.05, we reject H0
From the p-value, we can directly see that educ is statistically signicant at the 1% level.
Obviously, one can also test a null hypothesis like H0 :
2 = 0.10. This again could involve a computation of
the test statistic.
However, 0.10 is not in the 95 % condence interval
(see Stata output. So we can reject this H0 directly
(without computing the t statistic).
13
Notice that the variables exper and expersq are not

individually signicant at the 5 % level.
Can we conclude that years of experience does not affect wage formation?
ANSWER: NO!!!!
In that case, one hast to test the joint hypothesis H0 :
3 = 0, 4 = 0 by means of a F -test.
14
1.7
Linear hypotheses
The null hypothesis may not be a restriction about

individual regression coecients; it is often about linear combinations of them written as a system of linear
equations:
R = r
(12)
where R and r are known and specied by the hypothesis.
R is a #r K matrix where #r denotes the number of
restrictions on the regression coecients.
r id a (#rx1)-vector
H1 : H0 not true
Obviously, matrix R should have full row rank,i.e rank(R)=#r.
In other words one should not include redundant equations (restrictions).
Obviously, #r K
Example revisited
We like to check whether years of experience has an
eect on the hourly wage rate (cf equation (11), i.e.
H0 : 3 = 0, 4 = 0
In this case,
R=
0 0 1 0
0 0 0 1
[
and r =
0
0
Notice that the matrix R has full row rank.
15
One could also formulate H0 as:

H0 : 3 = 0, 4 = 0, 3 = 4
Notice that one of these restrictions is redundant
In this case the matrices
0 0 1
R=0 0 0
0 0 1
R and r are equal to

0
0
1 and r = 0
1
0
Here rank(R) = 3 because the third can be written as

a linear combination of the rst and second row.
Obviously one can also add a restriction on the educ
coecient, e.g. 2 = 0.05 In this case R and r become
0 0 1 0
0
R = 0 0 0 1 and r = 0
0 1 0 0
0.05
16
1.8
The F -test
Proposition 1.4 (distribution of the F -ratio):

Suppose Assumptions 1.1-1.5 hold. Under the null hypothesis H0 : R = r where R is #r K with
rank(R) =#r, the F -ratio dened as:
(Rb r) [R(X X)1 R ]1 (Rb r)/#r
s2
1
= (Rb r) [Rvar(b|X)R
c
] (Rb r)/#r
(13)
is distributed as F (#r, n K) (the F distribution with

#r and n K degrees of freedom).
As in Proposition 1.3, it suces to show that
F |X F (#r, n K)
because the F distribution does not depend on X, it
is also the unconditional distribution of the statistic.
Fact 4: Let x be an m dimensional random vector. If
x N (, ) with a nonsingular, then
(x ) 1 (x ) 2 (m)
17
Proof proposition 1.4

Since s2 = e e/(n K), we can write F as
F =
w/#r
q/(n K)
where q = e e/ 2 and
w = (Rb r) [ 2 R(X X)1 R ]1 (Rb r)
We need to show
1) w|X 2 (#r)
2) q|X 2 (n K) (this is part 1 in the proof of
theorem 1.3)
3) w and q are independently distributed conditional
on X.
Then, by the denition of the F distribution,
F F (#r, n K)
1) Let v = Rb r. Under H0 : Rb r = R(b )
We already know that b|X N (0, 2 (X X)1 )
So conditional on X, v is normal with mean 0, and
variance
var(v|X) = var(R(b )|X) = R var(b |X)R =
= 2 R(X X)1 R
which is the inverse of the middle matrix in the
quadratic form for w. Hence,
w = v var(v|X)1 v
Since R is of full row rank and X X is nonsingular
(assumption 1.3), then var(v|X) = 2 R(X X)1 R
is nonsingular (and we can therefore take an inverse
in equation above)
Moreover due to fact 4, w|X 2 (#r)
18
3) w is a function of b and q is a function of e.

But b and e are independently distributed conditional on X (see proof proposition 1.3)
So w and q are independently distributed conditional on X.
q.e.d.
If H0 : R = r is true, we expect Rb r to be small
So large F should be taken as evidence for a failure of
the null.
Look therefore at right tail of the F-distribution, see
gure 1.4 of the book
19
Steps F test when using critical value

Step 1: Formulate H0 and H1 (see above).
Step 2: Specify .
Step 3: Calculate the F -ratio by the formula (13).
Step 4: Go to the table of F distribution and look up
the entry for #r (the numerator degrees of freedom)
and n K (the denominator degrees of freedom).
Find the critical value F (#r, n K) as illustrated
in Figure 1.4.
For example, when #r = 3, n K = 30 and =
0.05, then F0.05 (3, 30) = 2.92.
Step 5: Do not reject H0 if F < F (#r, n K), otherwise reject H0 .
Steps F test when using p-Values
Step 1: same
Step 2: same
Step 3: same
Step 4: Calculate
p=area of the upper tail of the F distribution to
the right of the F -ratio:
Step 5: Do not reject H0 if p > ; reject otherwise.
20
1.9
A More Convenient Expression for F
The above derivation of the F -ratio is by the Wald

principle: it is based on the unrestricted estimator
(see equation (13)).
A more convenient formula is available, involving two
dierent sum of squared residuals:
1. The unrestricted sum of squares SSRU (b) = e e
based on the unrestricted regression b= OLS estimator of the unrestricted model.
obtained
2. the restricted sum of squares SSRU ()
from
s.t. R = r
min SSR()
(14)
3.
4.
5.
6.
( is the solution of this minimization problem).

Finding the that achieves this constrained minimization is called the restricted regression or
restricted least squares.
Proposition 1.4.a: The F -ratio (13) can be rewritten as
(SSRR SSRU )/#r
(SSRR SSRU )/#r
F =
=
SSRU /(n K)
s2
(15)
This second derivation of the F-ratio is said to be
by the Likelihood-Ratio principle. (one needs
to estimate both the restricted and the unrestricted
model).
Equation (15) can be rewritten as follows:
2
(RU2 RR
)/#r
F =
(1 RU2 )/(n K)
(16)
See Stata program lecture4_wage1.do for an example of application of the F -test.

21
proof proposition 1.4.a

Step 1: Derive the restricted least squares estimator
Lagrangean of minimization problem (14):
= 0.5(y X )
(y X )
(r R)
L()
(r R)
= 0.5(y y 2y X + X X )
First order conditions (check!)
X X = X y R = b (X X)1 R
R = r
(17)
(18)
Pre-multiplying equation (17) with R and using equation (18) gives

Rb r = R(X X)1 R
= (R(X X)1 R )1 (Rb r)
Substituting this into equation (17)
b = (X X)1 R (R(X X)1 R )1 (Rb r) (19)
Step2: Derivation of formula (15) (given (13))
the residuals of the restricted regresLet = y X ,
sion.
= e + X(b )
Obviously = y Xb + X(b )
Then
SSRR =
=
SSRR SSRU =
SSRR SSRU =
(e + X(b ))
= (e + X(b ))
X e + (b )
(X X)(b )

SSRU + 2(b )
(X X)(b )

(b )
(20)
(Rb r) (R(X X)1 R )1 (Rb r)
In the last step I use equation (19) (I leave this check

as an exercise)
22
Substituting (20) into (13) gives (notice that

s2 = SSRU /(n K)):
F
(SSR SSRU )/#r

SSRU /(n K)
q.e.d.
1.10
t versus F
Because hypotheses about individual coecients are

linear hypotheses, the t-test of H0 : k = k is a special
case of the F -test. To see this, note that the hypothesis
can be written as R = r with
R = [0 . . . , 0, 1, 0 . . . 0] , r = k
So by (13) the F -ratio is
1
(bk k ) [s2 (X X)1

kk ] (bk k )
which is the square of the t-ratio in (8). So F (1, n

K) = t2nK .
Source |
SS
df
MS
-------------+-----------------------------Model |
21.688779
3 7.22959299
Residual | 143.967504
931 .154637491
-------------+-----------------------------Total | 165.656283
934 .177362188
Number of obs
F( 3,
931)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
935
46.75
0.0000
0.1309
0.1281
.39324
-----------------------------------------------------------------------------log_wage |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------educ |
.0779866
.0066242
11.77
0.000
.0649864
.0909867
exper |
.016256
.01354
1.20
0.230
-.0103164
.0428285
expersq |
.000152
.000567
0.27
0.789
-.0009607
.0012647
_cons |
5.517432
.1248186
44.20
0.000
5.272474
5.762391
-----------------------------------------------------------------------------. test exper expersq
23
( 1)
( 2)
exper = 0
expersq = 0
F(
2,
931) =
Prob > F =
17.95
0.0000
Notice that in example the variables exper and expersq

is individually insignicant (cf. t-values). From this
nding, you should not conclude that you can drop
the variables exper and expersq from the specication.
Perform rst F-test
However, a F test indicates that the two variables are
jointly signicant.
24
Recapitulation of the assumptions

1. Assumption 1.1: the relationship between the dependent variable and the regressors is linear.
yi = 1 xi1 + 2 xi2 + . . . + K xiK + i ,
i = 1, . . . , n
2. Assumption 1.2 (strict exogeneity):

E(i |X) = 0,
i = 1, . . . , n
(21)
3. Assumption 1.3 (no perfect multicollinearity):

with probability 1.
4. Assumption 1.4 (spherical error variance):
E( |X) = 2 In
5. Assumption 1.5 (normality of the error term):
Proposition 1.1.a: Under assumptions 1.1-1.3,
E(b|X) =
V ar(b|X) = 2 (X X)1
Gauss-Markov theorem
25
oce: DUI 727 tel: 050-3637240
February 16, 2016
Asymptotic properties of the OLS estimator

Hayashi, chapter 2: 2.1 (except Lemma 2.5), 2.3, 2.4
(except pages 120-122), 2.6, 2.5, 2.7, 1.6, 2.8
1. Overview Chapter 1
2. Large sample theory
3. Review of limit theorems (section 2.1 Hayashi)
Relation among modes of convergence (section 2.1)
Viewing Estimators as Sequences of Random Variables (section 2.1)
Laws of Large Numbers and Central Limit Theorems (section 2.1)
4. Large sample distribution of the OLS estimator b (section 2.3)
5. Hypothesis testing (section 2.4)
6. Conditional homoskedasticity
Implications
Testing for conditional homoskedasticity
GLS (WLS)
1
Overview Chapter 1
Assumptions
Assumption 1.1: the relationship between the dependent variable and the regressors is linear.
yi = 1 xi1 + 2 xi2 + . . . + K xiK + i ,
i = 1, . . . , n
Assumption 1.2 (strict exogeneity):

E(i |X) = 0,
i = 1, . . . , n
Assumption 1.3 (no perfect multicollinearity):

with probability 1.
Assumption 1.4 (spherical error variance):
E(2 |X) = 2 In ; no serial correlation, conditional homoskedasticity
Assumption 1.5 (normality of the error term):
Results Chapter 1
Under assumptions 1.1-1.3, b is an unbiased estimator
of .
Under assumptions 1.1-1.4
var(b|X) = 2 (X X)1
Gauss-Markov theorem, OLS estimator is BLUE
s2 = e e/(n K) unbiased estimator for 2 :
Under assumptions 1.1-1.5, b|X normally distributed
even in finite samples. Consequently, one can use tand F -tests.
2
Large sample theory

The nite-sample theory of OLS breaks down if one of
the following three assumptions is violated:
1. Assumption 1.2: strict exogeneity
2. Assumption 1.5: normality of the error term
3. Assumption 1.4: spherical error variance
Consequence of Relaxing Assumption 1.4
The Gauss-Markov Theorem no longer holds for
the OLS estimator b.
The BLUE is some other estimator (GLS estimator, to be discussed later in the course).
The t-ratio is not distributed as the t distribution.
Thus, the t-test is no longer valid. The same comments apply to the F -test.
However, the OLS estimator is still unbiased, because the unbiasedness result (Proposition 1.1(a))
does not require Assumption 1.4.
Assumptions 1.2 (strict exogeneity) and 1.5 (normality) are also very strong.
Only weaker assumptions (to be specied below) are
needed to derive an approximation to the distribution
of the OLS estimator b which is valid in suciently
large samples
Moreover, associated test statistics only assume that
the sample size is suciently large.
2.1
Various modes of convergence
Sequence (z1 , z2 , . . . , zn ) of random variables denoted

by {zn }
1) Convergence in probability
A sequence of random scalars {zn } converges in probability to a constant (nonrandom) if, for any > 0,
lim P r(|zn | > ) = 0
(1)
The constant is called the probability limit of zn ,

0.
also written as plim zn = or as zn
p
This denition of convergence in probability is extended

to a sequence of random vectors or random matrices.
That is, a sequence of K -dimensional random vectors
{zn } converges in probability to a K -dimensional vector of constants if, for any > 0,
lim P r(|znk k | > ) = 0, for all k = 1, . . . , K (2)
2) Almost sure convergence

A sequence of random scalars {zn } converges almost
surely to a constant if,
P r( lim zn = ) = 1
(3)
We write this as zn 0
a.s.
The extension random vectors (matrices) is obvious.

The concept of almost sure convergence is stronger
than convergence in probability:
zn 0 zn
0
a.s.
3) Convergence in mean square

A sequence of random scalars {zn } converges in mean
squares to a constant if,
lim E[(zn )2 ] = 0
(4)
We write this as zn 0
m.s.
Convergence to a random vector

The limit can also be a random vector z. We say
that a sequence of K -dimensional random vector {zn }
converges to a K -dimensional random vector z if {zn
z} converges to 0:
zn
z is the same as zn z
0 or zn = z+op (1)
p
where op (1) is some suitable random vector (here zn

z) that converges to 0 in probability.
Likewise
zn z is the same as zn z 0
a.s.
a.s.
and
zn z is the same as zn z 0
m.s.
m.s.
4) Convergence in distribution
Let {zn } be a sequence of random scalars and Fn be
the cumulative distribution function (c.d.f.) of zn .
We say that {zn } converges in distribution to a
random scalar z if
lim Fn (z) = F (z)
for all points z where F (.) is continuous.

We write zn
z and call F the asymptotic (or
d
limiting or limit) distribution of zn .

Sometimes we write zn
F when the distribution
d
F is well-known, e.g. zn
N (0, 1)
d
It can be shown that

zn
z zn
z
p
If z is a constant
zn
z zn
z
p
(5)
The extension to a sequence of random vectors is immediate: zn

z if the joint c.d.f. Fn of the random
d
vector zn converges to the joint c.d.f. F of z at every

continuity point of F .
Note, however, that, unlike the other concepts of convergence, for convergence in distribution, element-byelement convergence does not necessarily mean convergence for the vector sequence. That is,
znk
zk , k = 1, . . . , K ; zn
z
d
A common way to establish the connection between

scalar convergence and vector convergence in distribution is
Multivariate Convergence in Distribution Theorem: Let {zn } be a sequence of K -dimensional random vectors. Then:
zn
z zn
z for any K -dimensional
d
vector of real numbers.
2.2
Relation among modes of convergence
Lemma 2.2 (relationship among the four modes

of convergence):
(a) zn zn
. So zn z zn
z
m.s
m.s
(b) zn zn
. So zn z zn
z
a.s
a.s
(c) zn
zn
Lemma 2.3 (preservation of convergence for continuous transformation): Suppose a() is a vectorvalued continuous function that does not depend on
n.
(a) zn
a(zn )
a(), or stated dierently
p
plim a(zn ) = a(plim zn )

n
provided that plim(zn ) exists

(b) zn
z a(zn )
a(z),
d
Lemma 2.4 (Parts (a) and (c) are called the Slutskys
theorem)
(a) xn
x, yn
xn + yn
x+
d
(b) xn
x, yn
0
d
yn xn
0
p
(c) xn
x, An
A An xn
Ax provided that
d
matrix An and vector xn are conformable.

In particular, if x N (0, ), then An xn
N (0, AA )
d
(d) xn
x, An
A xn A1
x A1 x
n xn
d
By setting = 0, part (a) implies:

xn
x, yn
0 zn = xn + yn
x
d
(6)
We say that the two sequences zn and xn are asymptotically equivalent: when yn = zn xn
0 then
zn xn or zn = xn + op (1)
A standard trick in deriving the asymptotic distribution of a sequence of random variables is to nd an

asymptotically equivalent sequence whose asymptotic
distribution is easier to derive.
In particular, by replacing yn by yn in part (b) of
lemma 2.4, we obtain
xn
x, yn
yn xn xn or yn xn = xn +op (1)
d
(7)
Therefore, replacing yn by its probability limit does
not change the asymptotic distribution of yn xn , provided xn converges in distribution to some random vector x.
2.3 Viewing Estimators as Sequences of Random Variables
The most frequently used versions of convergence are

convergence in probability and convergence in
distributions (cf. lemmas 2.3 and 2.4)
Let n be an estimator of based on a sample of size
n.
The sequence {n } is an example of a sequence of random variables, so the concepts introduced for sequeces (e.g. convergence in probability) are applicable to
{n }.
We say that {n } is consistent for if plim n =
n
The asymptotic bias of n is dened as plim n .

n
So the estimator n is consistent of the asymptotic bias

is zero.
For example, if n is an unbiased estimator for the
parameter , and var(n ) 0, then
lim E[(n )2 ] = 0
, i.e. n . Consequently, n
(lemma 2.2a).
m.s.
So in that case, n is a consistent estimator for .

A consistent estimator n is asymptotically normal
if
n(n )
N (0, )
d
Such an estimator is called n-consistent.

The variance matrix is called the asymptotic variance Avar(n )
10
2.4 Laws of Large Numbers and Central Limit Theorems
For a sequence of random scalars {zn }, the sample

mean zn , is dened as
1
zn =
zi
n i=1
n
Consider the sequence {

zn }. Laws of large numbers
(LLNs) concern conditions under which {
zn } converges
either in probability or almost surely.
An LLN is called strong if the convergence is almost
surely and weak if the convergence is in probability.
Part (a) of Lemma 2.2 implies (cf. analytical exercise
2 on page 168).
A Version of Chebychevs Weak LLN:
lim E(
zn ) = , lim var(
zn ) = 0 zn
m.s.
Moreover, lemma 2.2.a implies that

zn zn

m.s.
The following strong LLN assumes that {zi } is i.i.d.

(independently and identically distributed), but the
variance does not need to be nite.
Kolmogorovs Second Strong Law of Large Numbers: Let {zi } be i.i.d. with E(zi ) = . Then
zn . Moreover, lemma 2.2.b implies that zn

a.s.
These LLNs extend readily to random vectors by requiring element-by-element convergence.

11
Simulation Example: estimation of the population mean

by means of sample average
10,000 dierent random samples drawn of size 30
((x1 , . . . , x30 ) from a 2 (1) distribution (E(xi ) = 1,
var(xi ) = 2) 10,000 sample averages x30
Likewise, 10,000 dierent samples drawn of size 4
10,000 sample averages x4
Also, 10,000 sample averages x500
Table below conrms that the sample average is an
unbiased estimator for the population mean:
E[
xn ] = E(xi ) = 1
irrespective of sample size (4, 30 or 500). Moreover
it holds that
var(
xn ) = 2 /n
. sum
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------xbar500 |
10000
1.000665
.0637855
.7707955
1.265255
xbar4 |
10000
.9980116
.7081222
.0026005
5.855723
xbar30 |
10000
1.000482
.2595308
.2922013
2.330844
Figure 1 conrms that xn is consistent for (conrmation LLN)
12
density
The sampling distribution of sample mean for 3 sample sizes
3
xbar
n=4
n=500
n=30
Figure 1:
Central Limit Theorems (CLTs) are about the limiting behavior of the dierence between zn , and E(zn )
(which equals E(zi ) if {zi } is i.i.d.) blown up by n.

The only Central Limit Theorem we need for the case
of i.i.d. sequences is:
Lindeberg-Levy CLT: Let {zi } be i.i.d. with E(zi ) =
and var(zi ) = . Then
1
n(zn ) =
(zi )
N (0, )
d
n i=1
n
Usually, the Lindeberg-Levy CLT is for a sequence of

scalar random variables. The vector version displayed
above is derived from the scalar version as follows.
Let be any real vector of the same dimension as zi .
Then { zi } is a sequence of scalar random variables
13
with E( zi ) = and var( zi ) = . The

scalar version of Lindeberg-Levy then implies that
n( zn ) = n(zn )
N (0, )
d
But this limit distribution is the distribution of x

where x N (0, ).
So by the Multivariate Convergence in Distribution
Theorem, [ n(zn )]
x (cf. vector version of
d
the Lindeberg-Levy theorem.
14
Example continued
The table below and gures 2 and 3 conrm the
validity of the CLT (Lindeberg-Levy), i.e. as the
sample size increases the ratio
n(
xn ) = n(
xn 1)
converges in distribution to N (0, 2) (remember that
in the example xi 2 (1), so var(xi ) = 2).
. tabstat clt*,stat(mean sd sk ku) long col(stat)

variable |
mean
sd skewness kurtosis
-------------+---------------------------------------clt4 | -.0039768 1.416244
1.37186 5.586039
clt30 | .0026414 1.421509 .5447799
3.43055
clt500 | .0148735 1.426287 .1219383 2.968487
------------------------------------------------------
15
.1
cltity
.2
.3
.4
The central limit theorem
0
ybar
n=4: sqrt(4)(ybar1)
n=500: sqrt(500)(ybar1)
Figure 2:
16
4
n=30: sqrt(30)(ybar1)
.1
cltity
.2
.3
The central limit theorem 2
0
clt500
normal density with var=2
Figure 3:
17
n=500: sqrt(500)(ybar1)
3 Large sample distribution of the OLS estimator b

3.1
The model and assumptions
We use the term the data generating process (DGP)

for the stochastic process that generated the nite sample (y, X).
Therefore, if we specify the DGP, the joint distribution
of the nite sample (y, X) can be determined.
The model we study is the set of DGPs satisfying the
following set of assumptions:
Assumption 2.1 (linearity)
yi = xi + i
where xi is a K-dimensional vector of regressors.
Assumption 2.2: {yi , xi } is a random sample, i.e.
{yi , xi } is i.i.d. (independently and identically distributed) across observations.
Assumption 2.3 (predetermined regressors): All
the regressors are predetermined in the sense that
they are orthogonal to the contemporaneous error
term:
E(xik i ) = 0 for all i and k (= 1, 2, . . . , K).
This can be written as
Egi = E[xi i ] = E[xi (yi xi )] = 0
where gi xi i
Assumption 2.4 (rank condition): The K K
matrix E(xi xi ) is nonsingular (and hence nite).
We denote this matrix by xx
Assumption 2.5: S E(gi gi ) = E(xi 2i xi ) exists
and is nite.
18
Comments on the assumptions

Instead of assumption 2.2 Hayashi assumes that {yi , xi }
is jointly stationary and ergodic.
In case of time series data, one cannot make random
sample assumption 2.2. Therefore, Hayashi comes up
with those weaker assumptions.
In the remainder substitute the phrase ergodic stationarity by i.i.d. (random sample assumption)
The model accommodates conditional heteroskedasticity.
If {yi , xi } is i.i.d. (cf. assumption 2.2) then the error
term i = yi xi is also i.i.d. Thus, Assumption 2.2
implies that E(2i ) = 2 is constant across i . That is,
the error term is unconditionally homoskedastic.
Yet the error can be conditionally heteroskedastic in
that the conditional second moment, E(2i |xi ), can depend on xi .
However, the random sample assumption excludes the
possibility of serially correlated errors. In other words
E(i j |X) = 0 in case of random samples.
E(xi i ) = 0 (orthogonality condition) vs. E(i |xi ) = 0
(conditional mean independence)
The condition of conditional mean independence is stronger
because it implies that any function f (xi ) is orthogonal
to i
E[f (xi )i ] = E[E(f (xi )i |xi )|xi ] = E[f (xi )E(i |xi )] = 0
In case of random samples, no distinction can be made
between the assumption E(i |xi ) = 0 and the assumption of strict exogeneity E(i |X) = 0
19
In case of time series data, the assumption of strict exogeneity is far stronger that that of contemporaneous
exogeneity (E(i |xi )). (see slides of lecture 1)
Suppose that the regressors include a constant (almost
always the case)
Then assumption 2.3 implies
1. E(i ) = 0
2. Cov(xi , i ) = E(xi i ) = 0 That is, all (nonconstant) regressors xi are uncorrelated with the error
term i .
Assumption 2.4:
Since by the random sample assumption 2.2 {xi } is
i.i.d., {xi xi } is i.i.d., so by Kolmogorovs second strong
law of large numbers:
Sxx xx
a.s.
where
1
1
xi xi = X X
n i=1
n
n
Sxx
So, for n suciently large, the sample cross moment of

the regressors Sxx is nonsingular by Assumptions 2.2
and 2.4.
Since X X/n is nonsingular rank(X) = K, Assumption 1.3 (no perfect multicollinearity) is satised
with probability one for suciently large n.
Notice that for suciently large n, the OLS estimate
n
1
1
b = Sxx sxy can be computed (sxy n
xi yi ).
i=1
assumption 2.5 is dierent from assumption 2.5 of the

book
20
In case of a random sample (no time series) assumption

2.5 suces.
Notice that S is a matrix of fourth moments.
Consistent estimation of S will require an additional
assumption to be specied later (cf. section 2.5 of the
book)
3.2
Let
Asymptotic Distribution of the OLS Estimator b
1
1
gi =
xi i
g =
n i=1
n i=1
n
Suppose that there is available some consistent esti of S(= Avar(

mator, denoted S,
g ) = E(2i xi xi )) is
available.
Note: Avar(
g ) is the variance of the asymptotic dis
tribution of n
g.
Hayashi assumes ergodic stationarity instead of i.i.d..
Nonetheless the proofs of the theorem below is very
similar
21
.
Proposition 2.1.a.(Consistency of b for ): Under
Assumptions 2.1-2.4
plim b =
n
Proof:
The sampling error b can be written as
b = (X X)1 X
(
)1 (
)
1
1
=
XX
X
n
n
( n
)1 ( n
)
1
1
=
xi xi
xi i
n i=1
n i=1
1
Sxx
g
where
(8)
1
gi = xi i , g =
gi
n i=1
n
Notice that the sample means Sxx and g depend

on the sample size n!!
Since by assumption 2.2 , {xi xi } is i.i.d. and given
assumption 2.4
Sxx xx
a.s.
by the Kolmogorovs law of large numbers (see

above)
By lemma 2.2.b, Sxx xx Sxx
xx .
a.s.
Since xx is invertible by assumption 2.4,

1
1
Sxx
xx
by lemma 2.3.
p
22
Since by assumptions 2.2 and 2.1, {xi i } is i.i.d.

g
E(xi i ) = 0 by assumption 2.3 and the Kolp
mogorovs law of large numbers (see above).

So by lemma 2.3.a,
1
1
b Sxx
g
xx
0=0
p
. So
plim b =
n
q.e.d.
23
Proposition 2.1.b (Asymptotic Normality of b):

Under assumptions 2.1, 2.2,2.3, 2.4 and 2.5, the OLS
estimator b is asymptotically normal with asymptotic
variance
1
1
Avar(b) = xx
Sxx
(9)
In other words
1
1
n (b )
N (0, xx
Sxx
)
d
Proof
Rewrite (8) as
1
n(b ) = Sxx
( n
g)
(10)
Notice that
1. by assumptions 2.1 and 2.2, {gi } ({xi i }) is
i.i.d.
2. by assumption 2.3 E(gi ) = 0
3. by assumptions 2.5 and 2.3
var(gi ) = E(gi gi ) = S
We can therefore apply Lindeberg-Levys CLT and
state that
n
g
N (0, S)
d
So, by lemma 2.4.c,
1
1
n(b )
N (0, xx
Sxx
)
d
Since xx is symmetric, this expression is equal to

(9)
q.e.d.
24
Proposition 2.1.b also indicates the OLS estimator b

is approximately normal distributed with mean and
1
1
variance xx
Sxx
/n. If n is large, this variance goes
to zero.
Proposition 2.1.c (Consistent estimate of Avar(b)):
of
Suppose there is available a consistent estimator, S,
S (K K). Then, under Assumptions 2.2 and 2.4,
Avar(b) (cf. formula (9) is consistently estimated by
\ = S 1 SS
1
Avar(b)
xx
xx
(11)
proof: direct application of lemma 2.3.a (check)
3.3
s2 is consistent
Proposition 2.2 (consistent estimation of error

variance): Let ei = yi xi b be the OLS residual for
observation i . Under assumptions 2.1, 2.2, 2.3 and
2.4,
n
1 2
2
s =
ei
E(2i )
p
n K i=1
provided E(2i ) exists and is nite (if the regression
includes a constant term, assumption 2.5 guarantees
that this condition is met).
Proof
see book and review exercise 4 on page 116 (check!)
25
Hypothesis testing
Statistical inference in large-sample theory is based

on test statistics whose asymptotic distributions are
known under the truth of the null hypothesis.
Derivation of the distribution of test statistics relies
on a large-sample approximation to the exact nite
sample distribution.
In this section we derive test statistics, assuming through of S(= E(gi g ) is
out that a consistent estimator, S,
i
available.
The issue of consistent estimation of S will be taken
up later.
4.1
Testing a hypothesis about the k-th coecient of
Proposition 2.1 implies that under H0 : k = k
\k )
n(bk k )
N (0, Avar(bk )) and Avar(b
Avar(bk )
p
where bk is the k-th element of b and Avar(bk ) is the

(k,k)-th element of the K K matrix Avar(b). So
lemma 2.4.c. guarantees that
n(bk k )
bk k
tk
=
N (0, 1)
(12)
SE (bk ) d
\
Avar(bk )
where
SE (bk )
1 \
Avar(bk )
n
26
1 1 1
(S SSxx )kk
n xx
The denominator in this t-ratio, SE (bk ), is called the

heteroskedasticity consistent standard error,
(heteroskedasticity-)robust standard error, or Whites
standard error.
The reason for this terminology is that the error term
can be conditionally heteroskedastic;
recall that we have not assumed conditional homoskedasticity (that E(2i |xi ) does not depend on xi ) to derive
the asymptotic distribution of tk . This t-ratio is called
the robust t-ratio, to distinguish it from the t-ration of
chapter 1.
Dierences from the nite sample t-test are:
The way the standard error is calculated is dierent
The table of N (0, 1) can be used to nd the critical
values instead of t(n K)
The actual size or exact size of the test (the
probability of Type I error given the sample size)
equals the nominal size (i.e., the desired significance level ) only approximately, although the
approximation becomes very good as the sample
size is very large. The dierence between the exact size and the nominal size of a test is called the
size distortion.
27
So far we have proved, the rst part of next theorem

Proposition 2.3 (robust t-ratio and Wald statistic): Under Assumptions 2.1- 2.5, and the availability
of a consistent estimate S of S(= E(gi gi )). As before,
let
\ = S 1 SS
1
Avar(b)
xx
xx
then
(a) Under H0 : k = k , tk as dened in(12)
N (0, 1)
d
(b) Under H0 : R = r, where R is an#r K matrix

of full row rank
\
W n(Rbr) [R[Avar(b)]R
]1 (Rbr)
2 (#r)
(13)
Proof part b
Write W as
\
W = cn Q1
c
where
c
n(Rbr) and Qn = R(Avar(b))R

n
n
n
Under H0 , cn = R n(b ). So by proposition 2.1

(part b)
cn
c where c N (0, RAvar(b)R )
d
Also by proposition 2.1 (part c)

Qn
Q where Q = RAvar(b)R
p
Because R is of full row rank and Avar(b) is positive

denite, Q is invertible
Therefore, by lemma 2.4.d
W
c Q1 c
d
28
Since the #r-dimensional random vector c is is normally distributed and since Q = var(c), c Q1 c
2 (#r)
q.e.d.
29
2.6 Implications of conditional homoskedasticity

Assumption 2.7 (conditional homoskedasticity)
E(2i |xi ) = 2
(14)
This assumption implies that the unconditional second

moment E(
varepsilon2i ) = 2 (unconditional homoskedasticity)
Given assumption 2.7 it can be shown that the matrix
of fourth order moments S (cf. assumption 2.5) is
equal to
S =
=
=
=
=
E(gi gi ) = E(xi xi 2i )
E[E(xi xi 2i |xi )]
E[xi xi E(2i |xi )]
E[xi xi 2 ]
2 E(xi xi ) = 2 xx
(15)
This equation implies that assumption (2.4) (E(xi xi ) =

xx exists and nite) and assumption 2.7 (conditional
homoskedasticity) implies assumption 2.5. No need
to make assumption 2.5 if one adds the assumption of
conditional homoskedasticity.
By the random sample (i.i.d.) assumption 2.2: Sxx is a
consistent estimator for xx : Sxx
xx (application
p
of lemma 2.3.a)
Since by proposition 2.2 s2
2 the following estimap
tor for S is consistent (lemma 2.3.a)

S = s2 Sxx = s2 (X X)/n
30
Substituting (15) into (11) yields the following simplied expression for Avar(b):
1
Avar(b) = 2 xx
(16)
which can be estimated by

\ = s2 S 1 = ns2 (X X)1
Avar(b)
xx
(17)
Substituting this expression in (12) yields the usual

t-ratio (see chapter 1 of the book)
n(bk k )
bk k
tk =
=
(18)
SE(b
)
1
k
2
ns (X X)kk
where SE(bk ) denotes the usual standard error of bk
The Wald statistic dened in equation (13) simplies
(see page 128 of the book)
W = #rF = (SSRR SSRU )/s2
(19)
Basically, we have proved

Proposition 2.5 (large-sample properties of b,
t , and F under conditional homoskedasticity):
Suppose Assumptions 2.1, 2.2, 2.3, 2.4, 2.7 are satised. Then
(a) The OLS estimator b is consistent for and asymp1
totically normal with Avar(b) = 2 xx
(b) Avar(b) is consistently estimated by
\ = s2 S 1 = ns2 (X X)1
Avar(b)
xx
(c) Under H0 : k = k the usual t-ratio (see chapter 1 of the book) is asymptotically distributed as
N (0, 1).
31
(d) Under H0 : R = r, #rF is asymptotically 2 (#r)

where F is the usual F statistic dened in chapter
of the book and #r the number of restrictions in
H0
4.2 Variations of Asymptotic Tests under Conditional

Homoskedasticity
Also, you need N (0, 1) table to nd the critical value

to be compared with the t-ratio (18) and the 2 table
for the statistic #rF derived from (19).
An asymptotic equivalent procedure (applied in Stata):
retain the degrees of freedom nK but use the t(nK)
table for the t ratio and the
F (#r, nK) table for F, which is exactly the prescription of nite-sample theory.
This result is due to n K n for large n and xed
K. In that case t(n K) converges to N (0, 1) and
F (#r, n K) to 2 (#r)/#r (check)
32
Simulation example: OLS with 2 (1) errors

We consider the following DGP
yi = 1 + 2 xi + i ; i 2 (1) 1; xi 2 (1)
where 1 = 1 and 2 = 2 and the sample size
n = 150 or n = 1500.
For this DGP
1. the error i is independent of xi
2. E(i ) = 0, var(i ) = 2, skewness of 8 and kurtosis of 15. (By contrast, a normal error has
skewness of 0 and kurtosis of 3.
3. 1000 simulations performed: b2 , se2 , t-values of
the t test of H0 : 2 = 2 and the outcome of a
two sided test of H0 at (nominal) signicance
level 0.05.
4. Notice that this example satises the assumptions 2.1-2.5. Therefore we would expect that
b2 is a consistent estimator
5. Moreover, we expect that the t-ratio formulated
above follows a N (0, 1) distribution (especially
in the case for n = 1500 (the t distribution and
standard normal distribution almost coincide if
n is large (n > 120).
6. See Stata program simulate_mean.do for the
simulations.
7. Simulation results indicate that
(a) the mean of the point estimates is very close
to the true value of 2 (irrespective of sample
size)
This conrms that OLS is an unbiased estimator
33
(b) The standard deviation of the point estimates is close to the mean of the standard
errors
(c) The distribution of the t-ratio converges to a
standard normal distribution as the sample
size increases (the distribution becomes less
rightly skewed as sample size increases ).
This is a conrmation of proposition 2.3.b
(2.5b) see below)
(d) The row reject2f indicates the fraction of
rejections (out of 1000 simulations) of H0 :
2 = 2. It basically measures the actual
size of the test (the probability of Type I
error). The actual size is close to the nominal size (see below)
n=150
. tabstat b2f se2f t2f reject2f,stat(mean sd sk ku min max) long col(stat)
variable |
mean
min
max
-------------+-----------------------------------------------------------b2f | 2.000506
.08427 .5324175 4.206323 1.719513
2.40565
se2f | .0839776 .0172588 .4718556 3.067525 .0415919
.145264
t2f | .0028714 .9932668 .5252773 3.664809 -2.824061 4.556576
reject2f |
.046 .2095899 4.334438 19.78735
0
1
-------------------------------------------------------------------------n=1500
variable |
mean
min
max
-------------+-----------------------------------------------------------b2f | 1.999467 .0266265 .3857573 3.136166 1.925347 2.103468
se2f | .0258293
.001761 .2088724 3.007599 .0201466
.031988
t2f | -.0211082
1.02924 .3641272 3.093189 -3.016825 3.843584
reject2f |
.052 .2221381 4.035545 17.28562
0
1
--------------------------------------------------------------------------
34
2.5 Estimating S = E(2i xi xi ) consistently
Now we do not assume conditional homoskedasticity.

We need to nd a consistent estimator for S = E(2i xi xi )
in order to apply proposition 2.1 (asymptotic distribution OLS estimator), and 2.3 (robust t-ratio and Wald
statistic)
A natural candidate is the following estimator
1 2
S
ei xi xi
n i=1
n
(20)
where ei denotes the OLS residual of observation i.

In order to prove consistency, we need to make the
following assumption:
Assumption 2.6 (nite fourth moments for regressors): E[(xij xik )2 ] exists and is nite for all k, j(=
1, . . . , K)
Proposition 2.4 (consistent estimation of S)Under
assumption 2.1, 2.2, 2.3, 2.4, 2.5 and 2.6, then estimator (20) is consistent for S = E(gi gi ) = E(2i xi xi ).
In that the estimated asymptotic variance becomes:
( n
)
1
1
\ = S 1
Avar(b)
e2i xi xi Sxx
(21)
xx
n i=1
proof for special case K = 1
Since K = 1, b, and xi are scalars
Assumption 2.5: S exists and is nite
b
(proposition 2.1)
p
35
Obviously
e2i = ((yi xi ) (b )xi )2
= (i (b )xi )2
= 2i 2(b )xi i + (b )2 x2i
Pre-multiplying this equation with x2i and averaging over i yields
1 3
1 4
1 2 2 1 2 2
xi e i
xi i = 2(b)
xi i +(b)2
x
n i=1
n i=1
n i=1
n i=1 i
(22)
n
Obviously given the random sample assumption

n
1
x2i 2i
S
n
i=1
Assumption 2.6 ensures that E(xi )4 is nite

1) Then by random sample assumption,
1
n
x4i
i=1
converges to a nite number

n
1
2) This also holds for n
x3i i (see analytical exi=1
ercise 4 (page 169 of the book)

Given 1) and 2) and b
, the rhs of equation
p
(22) converges in probability to 0.

Then (cf. equation (22))
1 2 2
S
x
0
n i=1 i i p
n
or
S S
0 or S
S
p
q.e.d.
36
5.1
Finite sample considerations
Davidson and MacKinnon (1993) report that in nite

samples the robust t-ratio based on (20) rejects the
null too often
They propose the following degrees of freedom correction to formula (20) with n/(n K), i.e.
n 1 2
ei xi xi
S
n K n i=1
n
(23)
Stata carries out this degrees of freedom correction

Use in Stata the robust option: regress y x1 x2 x3, robust
simulation example revisited with robust standard errors
see Stata program
. * Simulation for finite-sample properties of OLS (robust standard errors)
. simulate b2f=r(b2) se2f=r(se2) t2f=r(t2) reject2f=r(r2) p2f=r(p2), reps(1000)
saving(helpdatares,replace) nolegend nod
> ots: chi2data_robust
. * simulation results
variable |
mean
min
max
-------------+-----------------------------------------------------------b2f | 1.999467 .0266265 .3857573 3.136166 1.925347 2.103468
se2f | .0253119 .0040947 1.268405 6.061013 .0161489 .0493915
t2f | -.1157212 1.054541 -.1146999 3.109138 -4.511499 3.327462
reject2f |
.059 .2357426 3.743241 15.01185
0
1
-------------------------------------------------------------------------.
. set seed 10101
. global numobs 15000
37
. * Simulation for finite-sample properties of OLS (robust standard errors)

. simulate b2f=r(b2) se2f=r(se2) t2f=r(t2) reject2f=r(r2) p2f=r(p2), reps(1000)
saving(helpdatares,replace) nolegend nod
> ots: chi2data_robust
. * simulation results
variable |
mean
min
max
-------------+-----------------------------------------------------------b2f | 2.000018 .0080713 .1267341 2.752077 1.977108
2.02643
se2f |
.008128 .0004615 .6295791 4.125336 .0069674 .0106837
t2f | -.0305723 .9933768 -.0500329 2.711769 -3.070038
2.82062
reject2f |
.047 .2117447 4.280878 19.32591
0
1
--------------------------------------------------------------------------
Testing conditional heteroskedasticity
With the advent of robust standard errors , testing for

presence heteroskedasticity has become less important.
Recall that S given in (23) is consistent for S (proposition 2.4)
Recall also s2 Sxx is consistent for 2 xx (cf. proposition 2.2).
Under the null of conditional homoskedasticity H0 :
S = 2 xx , the dierence between the two estimators
should vanish
n
n
1 2
2
21
S s Sxx =
ei xi xi s
xi xi
n i=1
n i=1
1 2
(ei s2 )xi xi
=
O
p
n i=1
n
38
(24)
Equation (24) implies

1 2
0
cn
(ei s2 )i
p
n i=1
n
(25)
where i is a (m1) vector of unique and nonconstant

elements in the matrix xi xi .
Based on this result, White (1980) has developed a test
statistic, which is 2 (m) distributed where m is the
number of elements in i
This test statistic can be computed as nR2 from the
following regression
regress e2i on a constant and i
If we do not reject H0 of conditional homoskedasticity,
then statistical inference can be based on the conventional t- and F -ratios.
Example
We estimate a simple labor demand function for a
sample of 569 Belgian rms (from 1996).
We explain the dependent variable labori from xi =
(1, outputi , wagei , capitali )
labori = 1 + 2 capitali + 3 outputi + 4 wagei + i
where
labor= number of workers
capital= total xed assets (in million euro)
wage=(total wage costs)/(number of workers) (in
1000 euro)
output=value added (in million euro)
39
There are 10 (=4*5/2) unique elements in xi xi

i is a nine-dimensional vector excluding the constant (1.1)
. use labour2
. /* step 1 estimate model by means of OLS */
. reg labor wage output capital
Source |
SS
df
MS
-------------+-----------------------------Model |
198943126
3 66314375.3
Residual | 13795026.5
565 24415.9761
-------------+-----------------------------Total |
212738152
568
374539
Number of obs
F( 3,
565)
Prob > F
R-squared
Adj R-squared
Root MSE
=
569
= 2716.02
= 0.0000
= 0.9352
= 0.9348
= 156.26
-----------------------------------------------------------------------------labor |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------wage | -6.741904
.5014054
-13.45
0.000
-7.72675
-5.757057
output |
15.40047
.3556333
43.30
0.000
14.70194
16.09899
capital | -4.590491
.2689693
-17.07
0.000
-5.118793
-4.062189
_cons |
287.7186
19.64175
14.65
0.000
249.1388
326.2984
-----------------------------------------------------------------------------. scalar s=sqrt(e(rss)/e(df_r))
. outreg2 using labour2, replace ctitle("OLS") addstat("s", s, "adj R^2", e(r2_a))

seeout
. predict residu,resid
. /* white test */
. /* create e_i^2 */
. gen res2=residu^2
. /* create psi_i */
. for var output capital wage: gen outputX=output*X \ gen capitalX=capital*X
\ gen wageX=wage*X
....
.
40
. /* Auxiliary regression white test */

. reg res2 output capital wage outputoutput outputwage outputcapital
wagewage wagecapital capitalcapital
Source |
SS
df
MS
-------------+-----------------------------Model | 9.8053e+12
9 1.0895e+12
Residual | 2.1796e+12
559 3.8992e+09
-------------+-----------------------------Total | 1.1985e+13
568 2.1100e+10
Number of obs
F( 9,
559)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
569
279.41
0.0000
0.8181
0.8152
62443
-----------------------------------------------------------------------------res2 |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------output | -2573.292
512.1794
-5.02
0.000
-3579.324
-1567.261
capital |
2810.435
663.0731
4.24
0.000
1508.016
4112.854
wage |
554.3507
833.0281
0.67
0.506
-1081.897
2190.598
outputoutput |
27.59449
1.836334
15.03
0.000
23.98753
31.20144
outputwage |
58.53846
8.11748
7.21
0.000
42.59397
74.48295
outputcapi~l | -40.02937
3.746344
-10.68
0.000
-47.388
-32.67073
wagewage | -10.07189
9.290223
-1.08
0.279
-28.3199
8.176121
wagecapital | -48.24574
14.01991
-3.44
0.001
-75.78389
-20.70759
capitalcap~l |
14.41762
2.010053
7.17
0.000
10.46944
18.3658
_cons | -260.8893
18478.55
-0.01
0.989
-36556.77
36034.99
-----------------------------------------------------------------------------. scalar LM=e(r2)*e(N)
. scalar pval2=chi2tail(e(df_m),LM)
. dis LM
465.51928
. dis pval2
1.377e-94
.
. /* rerun the regression with white standard errors,
outreg2 also reports the results of the white test
(rejection of H_0) */
. reg labor output capital wage, robust
41
7 Generalized Least Squares (GLS), section

1.6
Consequences for the OLS estimator
Gauss-Markov assumptions
As. 1.4: V {|X} = V {} = 2 In . This assumption implies
A1.4a: Homoskedasticity: V (i ) = 2 (constant variance)
A1.4b: No autocorrelation: E(i j ) = 0 if
i = j
Suppose that we relax assumption (A1.4) in the
following way:
V {|X} = 2 V (X)
(26)
where V (X) is nonsingular and known.

Consider the OLS estimator b = (X X)1 X y.
Notice that we do not need assumption A1.4 in
order to prove that the OLS estimator is unbiased:
E(b|X) = E{(X X)1 X y|X} = +(X X)1 X E{|X} =
(27)
However, in case of equation (26), the sampling
variance covariance matrix for b becomes:
V {b|X} = 2 (X X)1 X V (X)X(X X)1
(28)
2
which only reduces to the simpler expression (X X)1
if V (X) = In .
Consequently, routinely computed standard errors
are based on the wrong expressions. Thus, standard t and F tests will no longer be valid and
inferences will be misleading
42
In addition, The OLS estimator is no longer BLUE

Two ways of handling the problems of heteroskedasticity (and autocorrelation)
1. Formulate an alternative estimator which is BLUE
2. Compute Whites standard errors (see above).
This method is now more popular
43
Deriving an alternative estimator

How to derive a BLUE estimator under assumption
(26)?
Idea: we transform model so that it satises GaussMarkov assumptions
Assumption (A1.4) is violated, i.e.
V { | X} = 2 V (X)
We start this by writing
V 1 = C C
(29)
for some square, non-singular (n n)-matrix C,

not necessarily unique. Note that such a matrix C
exists
Given equation (29), we know that
V = (C C)1 = C 1 (C )1
CV C = CC 1 (C )1 C = In
(30)
Idea: we transform the regression model in the following way:

+
Cy = CX + C or y = X
(31)
By applying OLS on the transformed model (31)

one obtains the Generalized least squares (GLS)
):
estimator (V
) = (X
y = (X (C C)X)1 X (C C)y =
1 X
X)
(V
(X V 1 X)1 X V 1 y
(32)
Notice that the GLS estimator is unbiased:
) | X} = +(X
X)
1 X
E( | X) = +0 =
E{(V
44
Notice that
V {|X}
= V {C|X} = CV {|X}C = 2 CV C = 2 In
(33)
In other words the transformed model (31) satises
the Gauss Markov assumptions, including (A1.4)
) is BLUE
Therefore the GLS estimator (V
Obviously
)|X) = 2 (X V 1 X)1
V ((V
Unbiased estimator for 2 :
1
(Cy CX )
=
2 = nK
(Cy CX )
1
V 1 (y X )
(y X )
(34)
(35)
nK
We can only compute GLS estimator if V is known.

In many cases one has only a consistent estimator
V for the unknown V available. In that case, one
can use the feasible generalized least squares
(FGLS) estimator:
V ) = (X V 1 X)1 X V 1 y
(
(36)
Notice that if we run OLS on the transformed model

(31), we obtain the FGLS estimate for . This estimate is consistent
Moreover, we also obtain consistent estimates of
the standard errors, t-values etc. The estimate
2
(see equation (35) is automatically computed in
that case.
45
Heteroskedasticity (hi function unknown)

yi = xi + i ,
i = 1, . . . , n
(37)
Suppose that
V {i |xi } = i2 = 2 h2i = 2 exp{1 zi1 +. . .+J ziJ } = 2 exp{zi }
(38)
Problem the parameter vector not known
We need consistent estimate for this vector. This
estimate can be obtained by running the following
regression
loge2i = log 2 + zi + i
(39)
Consequently, the FGLS estimator (other name:

V ) can be obWeighted Least Squares (WLS)) (
tained as follows:
1. Estimate model (37) by means of OLS. This
gives b.
2. Compute loge2i = log((yi xi b)2 ) from the OLS
residuals ei
3. Estimate equation (39) by means of OLS. This
of the parameter
gives a consistent estimate
vector
2 = exp{z }.
4. Compute h
i
i
5. FGLS boils down to regressing yi /hi = yi / exp(zi )
i = xi / exp(z )
on xi /h
i (do not forget to
transform the constant term!!!!
6. see Stata do le labour2.do for an implementation of feasible GLS
46
(1)
VARIABLES
wage
output
capital
OLS
(2)
(3)
(4)
OLS level
white s.e. alpha param FGLS level
-6.742*** -6.742***
(0.501)
(1.858)
15.40*** 15.40***
(0.356)
(2.491)
-4.590*** -4.590***
(0.269)
(1.719)
0.00842
(0.00713)
0.0341***
(0.00506)
-0.0201***
(0.00383)
const
Constant
Observations
R-squared
s
adj R squared
White test
p-value white test
287.7***
(19.64)
287.7***
(65.11)
6.937***
(0.279)
569
0.935
156.3
0.935
569
0.935
569
0.124
0.935
465.5
0
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
47
-2.638***
(0.318)
17.06***
(0.500)
-3.724***
(0.235)
116.6***
(11.46)
569
0.900
2.9: Least squares projection

Suppose that the assumptions justifying the large-sample
properties of the OLS estimator are not satised (except for the random sample assumption) and that we
apply the OLS estimator?
Question: What do we estimate?
ANSWER: OLS provides an estimate of the best way
linearly to combine the explanatory variables to predict
the dependent variable.
This linear combination is called the least squares
projection.
8.1
Optimally predict the dependent variable
Normally the econometrician is concerned about estimating unknown parameters from a sample.
However, assume for the moment that we know the
joint distribution (y, x) of the dependent variable y and
the random vector x. Moreover, we know the value of
x.
On the basis of this knowledge we wish to predict y.
A predictor is a function f (x) of x (f (.) determined
by the joint distribution of (y, x)).
Forecast error: y f (x)
Mean squared error: E[(y f (x))2 ]
48
Proposition 2.7: E(y|x) is the best-predictor of y in

that it minimizes the mean squared error.
Proof
Let f (x) be any forecast. Then we can write the
forecast error as:
y f (x) = (y E(y|x)) + (E(y|x) f (x)) (40)
So the squared forecast error is:
(y f (x))2 = (y E(y|x))2 + 2(y E(y|x))(E(y|x) f (x))
+(E(y|x) f (x))2
(41)
The mean squared error is obtained by taking the
expectation of (41)
E[(y f (x))2 ] = E[(y E(y|x))2 ] + 2E[(y E(y|x))(E(y|x) f (x))] + E[(E(y|x) f (x))2 ]
(42)
It is a straightforward application of the Law of

Total Expectations to show that the second term
of the right hand sice of equation (42) is equal to
zero (see below).
Therefore
E[(y f (x))2 ] = E[(y E(y|x))2 ] + E[(E(y|x) f (x))2 ] E[(y E(y|x))2 ]
(43)
In other words, the mean squared error is bounded

from below by E[(y E(y|x))2 ] and this lower
bound is achieved by the conditional expectation.
q.e.d.
PS: the second term of the right hand sice of equation (42) is equal to zero because
2E[(y E(y|x))(E(y|x) f (x))] = 2E(E[(y E(y|x))(E(y|x) f (x))]|x) =
2E((E(y|x) f (x))E[(y E(y|x))]|x) =
2E((E(y|x) f (x))[E(y) E(E(y|x))]|x) =
2E((E(y|x) f (x))[E(y) E(y)]|x) = 0
(44)
49
8.2
Best Linear Predictor
It requires the knowledge of the joint distribution of

(y, x) to calculate E(y|x), which may be highly nonlinear.
We now restrict the predictor to being a linear function
of x
Question: what is the best (in the sense of minimizing
the mean squared error) linear predictor of y based on
x?
For this purpose, consider that satises the orthogonality condition
E[x(yx )] = E(xy)E(xx ) = 0 or E(xx ) = E(xy)
(45)
or (if E(xx )) is nonsingular)

= [E(xx )]1 E(xy)
The least squares projection of y on x:
E (y|x) = x
where is called the least squares projection coecients (cf. (45)).
Proposition 2.8: The least squares projection E (y|x)
is the best linear predictor of y in that it minimizes the
mean squared error
Proof (add and subtract strategy)
50
For any linear predictor x

mean squared error
2]
E[(y x )
2]
= E[((y x ) + x ( ))
E[x(y x )] +
= E[((y x )2 ] + 2( )
E(xx )( )
( )
2
E(xx )( )
= E[((y x ) ] + ( )
E[((y x )2 ]
q.e.d.
In contrast to the best predictor E(y|x), the best linear

predictor requires only the knowledge of the second
moments of the joint distribution of (y, x) to calculate
(see (45)).
51
If one of the regressors x is a constant, the least squares

projection coecients can be written in terms of vari ) , then one can
ances and covariances. Let x = (1, x
show that E (y|x) can be written as:
= + x
(46a)
E (y|x) E (y|1, x)
where
1 Cov(x,
y) and = E(y) E(x)
(46b)
= var(x)
Proof of equation (46):
If (, ) is the least squares projection coecients,
it satises
[ ]
E(xx )
= E(x y)
(47)
or
[
][ ] [
]

1
E(x)
E(y)
=
(48)
E(x
x
)
E(x)
E(xy)
or

= E(y) E(x)
+ E(x
x
) = E(xy)
E(x)
(49)
(50)
Equation (49) is rst part of the proof (cf. equation

(46) of the book).
Substitution of (49) into (50) yields
) + E(xx ) = E(xy)
E(x)(E(y)
E(x)
(51)
or
) = E(xy)
E(x)E(y)
(E(xx ) E(x)E(
x)
(52)
or
= Cov(x,
y) = var(x)
1 Cov(x,
y)
var(x)
(53)
52
8.3 OLS Consistently Estimates the Projection Coecients
Task econometrician: estimating

Suppose that we have a sample of size n drawn from
an i.i.d. stochastic process {yi , xi } (random sample
assumption) with joint distribution (y, x) (So, for example, E(xi xi ) = E(xx ))
By the law of large numbers the second moments in
(45) can be consistently estimated by the corresponding sample second moments. Thus a consistent estimator of the projection coecients is
( n
)
)1 ( n
1
1
xi xi
xi yi = (X X)1 X y
n i=1
n i=1
which is the usual OLS estimator b.
That is, under Assumption 2.2 (random sample) and
Assumption 2.4 guaranteeing the nonsingularity of E(xx ),
the OLS estimator is always consistent for the projection coecient vector, the that satises the orthogonality condition (45).
One also needs the assumptions 2.1 (linearity) and 2.3
(predetermined regressors) in order to attach an economic interpretation to and its OLS estimate b.
53
54
Observations
R-squared
s
adj R squared
White test
p-value white test
Constant
const
capital
output
wage
VARIABLES
569
0.935
287.7***
(65.11)
-6.742***
(1.858)
15.40***
(2.491)
-4.590***
(1.719)
(2)
OLS level
white s.e.
569
0.124
6.937***
(0.279)
0.00842
(0.00713)
0.0341***
(0.00506)
-0.0201***
(0.00383)
alpha param
(3)
0.935
465.5
0
*** p<0.01, ** p<0.05, * p<0.1
569
0.935
156.3
0.935
287.7***
(19.64)
-6.742***
(0.501)
15.40***
(0.356)
-4.590***
(0.269)
OLS
(1)
569
0.900
-2.638***
(0.318)
17.06***
(0.500)
-3.724***
(0.235)
116.6***
(11.46)
FGLS level
(4)
55
Observations
R-squared
s
adj R squared
White test
p-value white test
Constant
const
ln capital
569
0.843
6.177***
(0.294)
-0.928***
(0.0867)
0.990***
(0.0468)
-0.00370
(0.0379)
(2)
OLS log
white s.e.
569
0.024
-3.254***
(1.185)
-0.0611
(0.344)
0.267**
(0.127)
-0.331***
(0.0904)
alpha param
(3)
0.842
58.54
2.56e-09
*** p<0.01, ** p<0.05, * p<0.1
569
0.843
0.465
0.842
6.177***
(0.246)
-0.928***
(0.0714)
0.990***
(0.0264)
-0.00370
(0.0188)
ln wage
ln output
OLS log
VARIABLES
(1)
569
0.990
-0.856***
(0.0719)
1.035***
(0.0273)
-0.0569***
(0.0216)
5.895***
(0.248)
FGLS log
(4)
Least Squares Projection; Instrumental Variables; GMM

Jochem de Bresser
University of Groningen; Dui 758
March 2016
Jochem de Bresser (RUG)
Week 5
1/84
March 2016
1 / 84
Outline
1
Least squares projection
Introduction instrumental variables (IV)

Endogeneity (when NOT to use OLS to estimate causal effects)
Intuition of instrumental variables (IV) estimation
General formulation
Generalized Method of Moments (GMM)

Large sample properties of GMM
Testing overidentifying restrictions
Conditional homoscedasticity (2SLS)
Example: what makes a valid instrument?
GMM in Stata
Week 5
2/84
March 2016
2 / 84
Least squares projection
OLS is consistent under four key conditions ( bOLS ):

Linearity: yi = xi0 + i
The sample is random: {yi , xi } is i.i.d.
Predetermined regressors (contemporaneous exogeneity):
E (xi i ) = 0
No perfect multicollinearity when sample size grows large:
1 n
0
n i =1 xi xi converges to finite nonsingular matrix xx
Question: what does OLS estimate if we let go of linearity and

contemporaneous exogeneity?
Answer: OLS estimates the the best way to predict dependent
variable from a linear combination of the independent variables
Least squares projection: best according to squared loss criterion
Week 5
3/84
March 2016
3 / 84
Optimal prediction
So far we tried to estimate structural parameters from a sample

Causal interpretation
Our new goal: predict yi from xi based on sample {yi , xi }

Do this optimally: minimize some measure of the difference between
forecasts f (x) and data y
Two important definitions:
Forecast error: y f (x)h
i
Mean squared error: E (y f (x))2
E (y |x) is the best predictor of y : it minimizes the mean squared error
Week 5
4/84
March 2016
4 / 84
Proof of proposition 2.7

Proposition
E (y |x) minimizes mean squared error (and is best predictor of y )
Proof.
Let f (x) be any forecast. Rewrite the forecast error as:
y f (x) = (y E (y |x)) + (E (y |x) f (x))
(y f (x))2 = (y E (y |x))2
+ 2 (y E (y |x)) (E (y |x) f (x))
+ (E (y |x) f (x))2
Week 5
5/84
March 2016
5 / 84

Proposition
Proof continued
Take the expectation to obtain the mean squared error:
h
i
h
i
E (y f (x))2 =E (y E (y |x))2
+ 2 E [(y E (y |x)) (E (y |x) f (x))]

{z
}
|
=0, since E[(y E(y |x))(x)]=0
+ E (E (y |x) f (x))2
Week 5
6/84
March 2016
6 / 84

Proposition
Proof continued
Therefore
h
i
h
i
h
i
2
2
E (y f (x))2 = E (y E (y |x)) + E (E (y |x) f (x))
h
i
E (y E (y |x))2
h
i
2
So the mean squared error can never be below E (y E (y |x)) and the
lower bound is achieved by choosing for f (x) the conditional expectation
E (y |x)
Week 5
7/84
March 2016
7 / 84
Best Linear Predictor
Now we restrict predictor f (x) to be a linear function of x

Consider that satisfies the orthogonality condition

E x y x0 = E (xy ) E xx0 = 0
if E (xx0 ) is non-singular:

1
= E xx0
E (xy )
b (y |x) = x0 is optimal in
We show that predicting y according to E
that it minimizes the mean squared error
b (y |x) is least squares projection
E
is the vector of least squares projection coefficients
Week 5
8/84
March 2016
8 / 84

Proposition
b (y |x) is the best linear predictor of y in that it minimizes the mean
E
squared error
Proof.
We follow the add-and-subtract strategy once more:
mean squared error =E
=E

2
y x0 e

2

y x0 + x0 e

0
2 i

y x0
+ 2 e E x y x0

0

+ e E xx0 e
h
0

2 i

=E y x0
+ e E xx0 e
h
2 i
E y x0
=E
Week 5
9/84
March 2016
9 / 84
Least squares projection coefficients in terms of variances

and covariances
If one of regressors is a constant, we can write least squares
projection coefficients in terms of variances and covariances
0
x = (1, e
x0 )
b (y |x) can be written as

E
b (y |1, e
b (y |x) E
x) = + 0 e
x
E
where = Var (e
x)1 Cov (e
x, y ) and = E (y ) 0 E (e
x)
Proof:
If (, 0 ) are least squares projection coefficients:

0
E xx
= E (xy )
or

1
E (e
x)
E (e
x) 0
E (e
xe
x0 )
Week 5

E (y )
=
E (e
xy )
10/84
March 2016
10 / 84

and covariances
If one of regressors is a constant, we can write least squares
projection coefficients in terms of variances and covariances:

E (y )
1
E (e
x) 0
=
E (e
xy )
E (e
x) E (e
xe
x0 )
or
= E (y ) E (e
x) 0

E (e
x) + E e
xe
x0 = E (e
xy )
Done with , now substitute expression for into second equation:

E (e
x) E (y ) E (e
x) 0 + E e
xe
x0 = E (e
xy )

0
E e
xe
x0 E (e
x) E (e
x) = E (e
xy ) E (e
x) E (y )
Week 5
11/84
March 2016
11 / 84

and covariances

E (e
x) E (y ) E (e
x) 0 + E e
xe
x0 = E (e
xy )

0
E e
xe
x0 E (e
x) E (e
x) = E (e
xy ) E (e
x) E (y )
or
Var (e
x) = Cov (e
x, y ) = Var (e
x)1 Cov (e
x, y )
Week 5
12/84
March 2016
12 / 84
In conclusion
The OLS estimator is consistent for the projection coefficient vector

under 2 assumptions:
The sample is random: {yi , xi } is i.i.d.
No perfect multicollinearity when sample size grows large:
1 n
0
n i =1 xi xi converges to finite nonsingular matrix xx
We need 2 additional assumptions to attach economic (causal)

meaning to and its OLS estimator:
Linearity: yi = xi0 + i
Predetermined regressors (contemporaneous exogeneity):
E (xi i ) = 0
Week 5
13/84
March 2016
13 / 84
The problem of endogeneity

Model: yi = xi0 + i
Endogeneity: E (xi i ) 6= 0
Consequence: the OLS estimator bOLS no longer converges to as
sample size grows:
bOLS
1 n
=
xi xi0
n i
=1
|
{z
! 1
!
1 n
xi i
n i
=1
|
{z
}
1 doesnt converge to zero

converges to xx
There are several possible reasons for endogeneity:

Omitted variable bias
Measurement error in explanatory variable(s)
Simultaneity bias
Week 5
14/84
March 2016
14 / 84
Source of endogeneity 1: omitted variable bias

Consider regression of (log) wages on education, ability and other
controls:
0
yi = x1i
1 + x2i 2 + ui + i
yi = log hourly wage rate

x1i = vector of individual characteristics
x2i = years of schooling
ui = natural ability
Assume that exogeneity holds if we control for all variables mentioned
above
OLS consistent IF we can control for x1i , x2i and ui
Persons with higher ability are likely to have higher wages ( > 0)
Persons with higher ability are likely to have more schooling
(cov (x2i , ui ) > 0)
In practice ability is unobserved

Week 5
15/84
March 2016
15 / 84

We proceed by regressing yi on the observed variables x1i and x2i
collected in xi :
yi = xi0 + i
0 , x ), 0 = ( 0 , ) and = u +
where xi0 = (x1i
2
2i
i
i
i
1
b
The OLS estimator OLS is no longer consistent for :
bOLS =
1 n
xi xi0
n i
=1
1 n
xi xi0
n i
=1
= +
! 1
1 n
xi yi
n i
=1
! 1
!

1 n
0
xi xi + ui + i
n i
=1
! 1
! 1
!
1 n
1 n
1 n
0
0
xi xi
xi ui +
xi xi
n i
n i
n i
=1
=1
=1
Week 5
16/84
1 n
xi i
n i
=1
March 2016
16 / 84

Assume E (xik i ) = 0 for all k:
bOLS = +
1 n
xi xi0
n i
=1
! 1
1 n
xi ui
n i
=1
1 n
xi xi0
n i
=1
! 1
1 n
xi i
n i
=1
p
1
bOLS + xx
E ( x i ui )
If 6= 0 we need E (xi ui ) = 0 for bOLS to be consistent: unobserved

ability should be uncorrelated with schooling and the other regressors
In case > 0 and cov (x2i , ui ) > 0 the OLS coefficient overestimates
return on schooling
Interpretation: OLS estimate shows difference in expected wages
between people with different levels of schooling yet identical
characteristics x1
NOT causal effect: people with different levels of education also differ
in unobserved ways (ability)
Week 5
17/84
March 2016
17 / 84
Source of endogeneity 2: measurement error in explanatory

variables
Example: regress household savings yi on disposable income wi :
yi = 1 + 2 wi + i
E (i ) = 0; var (i ) = 2
Assume that wi is exogenous: E (wi i ) = 0 and even E (i |wi ) = 0
E ( yi | w i ) = 1 + 2 w i
wi is measured with error (survey response): we observe xi given by
xi = wi + ui
ui = measurement error:
E (ui ) = 0; var (ui ) = u2
ui is independent of true income wi
Week 5
18/84
March 2016
18 / 84

variables
Regression is estimated with observed xi instead of unobserved wi :
yi = 1 + 2 xi + i
i = i 2 ui
OLS estimator bOLS is not consistent for = ( 1 , 2 )0
xi and i both depend on ui : E (xi i ) 6= 0
If 2 > 0 then E (xi i ) = cov (xi , i ) < 0
Analogously to previous example:
plim b2,OLS = 2 +
cov (xi , i )
var (xi )
E (xi i ) = E [(wi + ui ) (i 2 ui )] = 2 u2
var (xi ) = var (wi + ui ) = w2 + u2
Attenuation bias: b2 biased towards zero
Week 5
19/84
March 2016
19 / 84

variables
So we get

plim b2,OLS = 2
2
1 2 u 2
w + u
Conclusion: b2,OLS is consistent only if u2 = 0 no measurement

error
Asymptotic bias is worse if u2 is large relative to
Inconsistency of b2,OLS carries over to estimator
intercept
w2
b1,OLS of the
So what do you estimate if you regress yi on xi ?

The coefficients in (linear approximation of) the conditional
expectation of yi given reported disposable income xi
Projection coefficient vector of savings on reported income
No problem if we interpret coefficients in terms of reported rather than
true variables
We want to know conditional expectation of yi given actual income wi
Week 5
20/84
March 2016
20 / 84
Source of endogeneity 3: simultaneity

Consider demand and supply of coffee:
qid = 0 + 1 pi + ui (demand)
qis = 0 + 1 pi + i (supply)
qid = qis (market equilibrium)
1 < 0 and 1 > 0
Assume E (ui ) = E (i ) = 0 and cov (ui , i ) = 0
Market equilibrium: qi = qid = qis :
qi = 0 + 1 pi + ui (demand)
qi = 0 + 1 pi + i (supply)
Solve for pi and qi :
0 0
ui
1 0 0 1
1 i 1 ui
pi =
+ i
; qi =
+
1 1
1 1
1 1
1 1
Week 5
21/84
March 2016
21 / 84

Now pi is endogenous in both equations:
E (pi ui ) = cov (pi , ui ) =
u2
2
> 0; cov (pi , i ) =
<0
1 1
1 1
Regression of quantity on price estimates neither demand, nor supply

curve
It does estimate least squares projection of quantity on price
b (qi |1, pi ) = 0 + 1 pi ), with
(E
1 =
cov (pi , qi )
var (pi )
1 is neither consistent for 1 nor for 1 :

cov (pi , qi )
cov (pi , 0 + 1 pi + ui )
cov (pi , ui )
=
= 1 +
var (pi )
var (pi )
var (pi )
Week 5
22/84
March 2016
22 / 84

So-called simultaneity bias: regressor and error term are related
through system of simultaneous equations
This does not occur if ui = 0 or i = 0
Figure: Identification of demand curve in absence of demand shifts
Week 5
23/84
March 2016
23 / 84
Estimation in presence of endogeneity simultaneity

Think again of coffee demand and supply in the Netherlands
Problem: we cannot disentangle demand and supply shifts
Solution: find a variable that affects qi only through either demand or
supply
Demand-shifter identifies supply curve
Supply-shifter identifies demand curve
Example of a supply-shifter: temperature in coffee-growing regions

(tempi )
qi = 0 + 1 pi + 2 tempi + i (supply)
E (tempi i ) = cov (tempi , i ) = 0 by construction
Likely to affect the endogenous coffee price (cov (tempi , pi ) 6= 0)
Unlikely to be related to the unobserved taste for coffee in NL
(cov (tempi , ui ) = 0)
Week 5
24/84
March 2016
24 / 84

An instrument should satisfy 2 criteria:
Relevance: instrument should be correlated with endogenous regressor
Validity: instrument should NOT be correlated with the error term of
the equation of interest
For our example:

Temperature is correlated with price if 2 6= 0 (solve demand and
extended supply equations):
2
i ui
0 0
+
tempi +
1 1
1 1
1 1
2
cov (tempi , pi ) =
var (tempi ) 6= 0
1 1
pi =
Temperature is assumed to be uncorrelated with ui

So temperature is a useful instrument for price that identifies the
demand equation
Week 5
25/84
March 2016
25 / 84

We use the demand equation to calculate cov (tempi , qi ):
cov (tempi , qi ) = cov (tempi , 0 + 1 pi + ui )
= 1 cov (tempi , pi ) + cov (tempi , ui )

= 1 cov (tempi , pi )
Hence our parameter of interest (1 ) satisfies
1 =
cov (tempi , qi )
cov (tempi , pi )
We can estimate 1 consistently as

b
1,IV =
1
n
1
n
ni=1 (tempi temp ) (qi q )

ni=1 (tempi temp ) (pi p )
Week 5
26/84
March 2016
26 / 84
Estimation in presence of endogeneity omitted variable

bias
Recall example of returns to schooling (using only schooling as
regressor):
yi = 0 + 1 xi + ui
where yi is (log) wage; xi is schooling and ui is the error term, which

includes ability
OLS assumes that E (xi ui ) = 0:
Schooling uncorrelated with error term
Implies following path analysis diagram:
Week 5
27/84
March 2016
27 / 84

bias
But education may be correlated with errors
Suppose person has high ui due to high (unobserved) ability
This increases earnings directly, since yi = 0 + 1 xi + ui
ui = yi
But it may also increase elements of xi , since schooling is likely to be
higher for those with high ability
ui = xi = yi
The new path diagram becomes:
yi = 0 + 1 xi + u (xi ) implies
du
dy
= 1 +
dx
dx
OLS is inconsistent for 1 since it measures
Week 5
28/84
dy
dx ,
not just 1
March 2016
28 / 84

bias
Now assume we have an instrument zi with properties

Validity: zi not correlated with the error term ui (this implies that zi
does not directly lead to changes in yi )
Relevance: zi is correlated with the endogenous regressor schooling
In a path diagram:
Note that zi does not directly cause yi , but zi and yi are correlated
through xi
Week 5
29/84
March 2016
29 / 84

bias
Example: 1-unit change in zi is associated with
0.2 years of schooling xi
$500 increase in annual earnings yi (zi = xi = yi )
Then 0.2 years of schooling is associated with $500 extra earnings

1 year increase in schooling associated with 500/0.2 = $2,500 increase
in earnings
Causal estimate of 1 : 2500

Econometric approach: estimate changes dx /dz and dy /dz to
calculate causal estimator as
dy /dz
IV =
dx /dz
1
dy /dz estimated by OLS of y on z with slope estimate (z0 z) z0 y

1
dx /dz estimated by OLS of x on z with slope estimate (z0 z) z0 x
The IV estimator is
bIV = z0 x
Week 5
1
30/84
z0 y
March 2016
30 / 84
General formulation
Notation:
yi : dependent variable of interest
zi : L-dimensional vector of regressors (some of which endogenous)
: parameter vector of interest
xi : K -dimensional vector of instruments (exogenous regressors and at
least 1 instrument for each endogenous regressor)
wi : vector of unique and non-constant elements of (yi , zi , xi )
The model consists of the following assumptions:

3.1: linearity: the equation to be estimated is linear:
yi = zi0 + i
3.2: random sample: {wi } is an i.i.d. stochastic process
3.3: instrument exogeneity or instrument validity: all K variables
in xi are orthogonal to the error term:
E (xi i ) = 0
E (xi i ) = E xi yi zi0
= E ( gi ) = 0
where gi xi i
Week 5
31/84
March 2016
31 / 84
General formulation coffee example
The coffee example in general notation:

yi = q i
zi = (1, pi )0 (L = 2)
= ( 0 , 1 ) 0
xi = (1, tempi )0 (K = 2)
wi = (qi , 1, pi , tempi )0
Remember that we assumed E (tempi i ) = 0 (valid instrument)
xi and zi may share variables
Here they share the constant
In general: each exogenous (pre-determined) regressor is an
instrument for itself (both in zi and in xi )
Endogenous regressors are only included in the list of regressors zi
Excluded instruments are only included in list of instruments xi
Week 5
32/84
March 2016
32 / 84
General formulation returns to schooling example

The returns to schooling example in general notation:
yi = (log) wage
zi = (1, educi , experi )0 (L = 3)
xi = (1, mothereduci , fathereduci , agei )0 (K = 4)
wi = (wagei , 1, educi , experi , mothereduci , fathereduci , agei )0
Are the instruments useful?

Are they relevant (correlated with the endogenous regressors)?
Probably
Are they valid (uncorrelated with the error term)?

Probably not: fathereduci and mothereduci are likely to be correlated
with unobserved ability
IV estimation based on different but equally fragile assumption as

OLS:
OLS assumes: E (zi i ) = 0
IV assumes: E (xi i ) = 0
Finding valid instruments is difficult
Week 5
33/84
March 2016
33 / 84
Identification
3.4: rank condition for identification: the K L matrix

xz = E xi zi0
is of full column rank (rank equals L, the number of its columns)
To see why we need this, look at the orthogonality condition:

E (gi ) = E (g (wi ; )) = E (xi i ) = E xi yi zi0 = 0

E (xi yi ) E xi zi0 = 0
xz = xy
where xz = E (xi zi0 ) and xy = E (xi yi )
This is a system of K equations in L unknowns
Unique solution if and only if xz is of full column rank
In that case is identified
Week 5
34/84
March 2016
34 / 84
Identification
A necessary condition for identification is the order condition:

KL
# instruments # regressors
# excluded instruments # endogenous regressors
# orthogonality conditions # parameters
Depending on whether the order condition is satisfied, the parameters

can be
Overidentified: 3.4 satisfied and K > L
Just identified: 3.4 satisfied and K = L
Not identified: 3.4 NOT satisfied or K < L (either instruments not
correlated with endogenous regressors or not enough instruments)
Week 5
35/84
March 2016
35 / 84
Final assumption required for asymptotic normality
Assumptions so far:
3.1: linearity: the equation to be estimated is linear
3.2: random sample: {wi } is an i.i.d. stochastic process
3.3: instrument exogeneity or instrument validity: all K variables
in xi are orthogonal to the error term: E (xi i ) = E (gi ) = 0
3.4: rank condition for identification: the K L matrix
xz = E (xi zi0 ) is of full column rank (rank equals L, the number of its
columns)
One more assumption for the asymptotic normality of the GMM

estimators:
3.5: finite second moment
of orthogonality conditions:
S E (gi gi0 ) = E xi 2i xi0 exists and is finite
Week 5
36/84
March 2016
36 / 84
Basic principle of Method of Moments

Method of Moments (MM) starts from orthogonality conditions
Set of population moments are equal to zero:

E (gi ) = E (g (wi ; )) = E (xi i ) = E xi yi zi0 = 0

xz = xy
Basic principle MM: choose estimator e so that corresponding

sample moments are also equal to zero:

1 n
1 n
1 n
i = xi yi zi0 e = 0
gn e = g wi ; e = xi e
n i =1
n i =1
n i =1
where e
i = yi zi0 e
MM amounts to choosing e that solves system of K equations with L
unknowns
Week 5
37/84
March 2016
37 / 84
Instrumental variables estimator: K = L

Population moment conditions:

xz = xy
Sample moment conditions:
1 n
1 n
xi yi xi zi0 e = 0
n i =1
n i =1
Sxz e = sxy
where Sxz =
If K = L
1
n
ni=1 xi zi0 and sxy =
1
n
ni=1 xi yi
xz is square and invertible

p
Sxz xz
System of equations has unique solution
1
bIV = Sxz
sxy =
Week 5
1
n
xi zi0
i =1
38/84
! 1
1
n
x i yi
i =1
March 2016
38 / 84
OLS as instrumental variables estimator

IV reduces to OLS if all regressors are exogenous (xi = zi ):
Population moment conditions:

E (zi i ) = E zi yi zi0 = 0

E (zi yi ) E zi zi0 = 0
zz = zy
Sample moment conditions:
1
n
zi e i = n zi
i =1

yi zi0 e = 0
i =1
1
n
zi yi n zi zi0 e = 0
i =1
i =1
Szz e = szy
bIV = bOLS =
Week 5
1 n
zi zi0
n i
=1
39/84
! 1
1 n
zi yi
n i
=1
March 2016
39 / 84
Why do we focus on asymptotic properties of VI

estimator? (I)
For OLS we showed that bOLS is unbiased:

1 0
Z y |Z
E bOLS |Z = E Z0 Z

1 0
= E Z0 Z
Z (Z + ) |Z

1 0
1 0
= E Z0 Z
Z Z + Z0 Z
Z |Z
1 0
= + Z0 Z
Z E ( |Z)
| {z }
0 by assumption
Remember: we assumed E (|Z) = 0

Hence, E bOLS = E E bOLS |Z
=
Week 5
40/84
March 2016
40 / 84
Why do we focus on asymptotic properties of VI

estimator? (II)
For IV we cannot obtain similar finite-sample result:

1 0
X y|X, Z
E bIV |X, Z = E X0 Z

1 0
= E X0 Z
X (Z + ) |X, Z

1 0
= + E X0 Z
X |X, Z
1 0
= + X0 Z
X E (|X, Z)
However, E (|X, Z) is not assumed equal to zero!
Would imply E (|Z) = 0
IV assumes E (X0 ) = 0 (implied by E (|X) = 0)

1
Hence, E bIV = + E (X0 Z) X0 E (|X, Z) 6=
Week 5
41/84
March 2016
41 / 84
Coffee demand and supply in IV notation
qi = 0 + 1 pi + i
Endogenous pi ; instrument tempi

Orthogonality conditions: E (xi i ) = E
E

1
tempi
pi
tempi pi
IV estimator

b
0
1
=
1 n
b
1
n i =1 tempi

1
0
i =
tempi
0

0
1
=E
q
1
tempi i
1 n
n i =1 pi
1 n
n i =1 tempi pi
Week 5
42/84
1
1 n
n i =1 qi
1 n
n i =1 tempi qi
March 2016
42 / 84
Generalized Method of Moments

Overidentified case: K > L (more instruments than endogenous
regressors)
Cannot choose e to satisfy
K equations exactly
e
Choose such that gn e is as close as possible to zero
Closeness is measured through sum of squares objective function

GMM estimator:

0

c = argminn gn e Wg
c n e
b W
e
where

i = n1 ni=1 xi yi n1 ni=1 xi zi0 e = sxy Sxz e
gn e = n1 ni=1 xi e
c is a weighting matrix that can be random and depend on the
and W
p
c
sample size (estimated); W
W, with W symmetric and positive
definite
Week 5
43/84
March 2016
43 / 84
Generalized Method of Moments

First order conditions:
0 c
0 c
Sxz
Wsxy = Sxz
WSxz e
System of L equations in L unknowns

0 WS
c xz is nonsingular, because
Sxz
Sxz is of full column rank
c is symmetric and positive definite
W

c :
So system has unique solution that is the GMM estimator b W

1
0 c
0 c
c = Sxz
b W
WSxz
Sxz
Wsxy
If K = L then Sxz is square and the GMM estimator reduces to the
IV estimator
1
bIV = Sxz
sxy
which does not depend on the weighting matrix

Week 5
44/84
March 2016
44 / 84
Sampling error
Model
yi = zi0 + i
Multiply both sides from left with xi and take averages:
sxy = Sxz + g
g = n1 ni=1 xi i = n1 ni=1 g (wi ; ) = gn ()
Substitute into definition of GMM estimator:

1
0 c
0 c
c = Sxz
b W
WSxz
Sxz
Wsxy

1
n
0 c
0 c1
= Sxz
WSxz
Sxz
W xi yi
n i =1

1
n

0 c
0 c1
= Sxz
WSxz
Sxz
W xi zi0 + i
n i =1

1
0 c
0 c
c = Sxz
b W
WSxz
Sxz
Wg
Week 5
45/84
March 2016
45 / 84
Large sample properties of GMM
c yields a different GMM estimator

Each different weighting matrix W
Lets look at
Large sample properties of GMM for given weighting matrix
Optimal choice of weighting matrix
We will find that

GMM estimators are consistent under assumptions 3.1-3.4 regardless of
c
W
c does affect the standard errors optimal weighting yields precise
W
estimates
Week 5
46/84
March 2016
46 / 84
Asymptotic distribution of GMM estimator consistence

1
0 c
0 c
c = Sxz
b W
WSxz
Sxz
Wg
Key term: g =
1
n
ni=1 xi i =
1
n
ni=1 g (wi ; ) = gn ()
Proposition
Under assumptions 3.1-3.4 the GMM estimator is consistent:

c =
plimn b W
p
g 0 (orthogonality conditions)
Week 5
47/84
March 2016
47 / 84
Asymptotic distribution of GMM estimator asymptotic

normality

1
0 c
0 c
c = Sxz
b W
WSxz
Sxz
Wg
Key term: g =
1
n
ni=1 xi i =
1
n
ni=1 g (wi ; ) = gn ()
Proposition
Under assumptions 3.1-3.5 the GMM estimator is asymptotically normal:

h i

d
c
c
n b W
N 0, avar b W
h i
0 W )1 0 WSW (0 W )1
c = (xz
where avar b W
xz
xz
xz
xz
xz

2
0
0
c = W; S = E gi g = E xi x0
and xz = E xi zi ; plim W
i
i
i
p
d
Sxz xz ; n g N (0, S) (CLT)
a
a
0 W )1 0 W
n g N (0, S) = n Ag N (0, ASA0 ) with A = (xz
xz
xz
Week 5
48/84
March 2016
48 / 84
Asymptotic distribution
of GMM estimator consistent
h i
c
estimate of avar b W

1
0 c
0 c
c = Sxz
b W
WSxz
Sxz
Wg
Proposition
b
Suppose we have a consistent estimator
h iS for S. Then, under random
c consistently by
sampling, we can estimate avar b W

0 c
Sxz
WSxz
1

1
0 cbc
0 c
Sxz
WSWSxz Sxz
WSxz
Week 5
49/84
March 2016
49 / 84
Consistent estimation of error variance
For any consistent estimator b of define b

i = yi zi0 b
Under 3.1; 3.2 and finite second moment of the regressors (E (zi zi0 ) is
finite) we can show that
Proposition
b
The error variance can be estimated using any consistent estimator :

1 n 2 p
b
i E 2i
n i =1

if E 2i exists and is finite
Week 5
50/84
March 2016
50 / 84
Estimation of S

Suppose E (xik zil )2 exists and is finite for all k = 1, ..., K and
l = 1, ..., L
Proposition
Use a consistent estimator b to calculate the residual b
i . Then, under 3.1;
3.2; 3.5 and 3.6:
n
b = 1 b
S
2 xi xi0
n i =1 i
is consistent for S
Week 5
51/84
March 2016
51 / 84
c
Optimal weighting matrix W
The asymptotic variances of GMM estimators indexed by W are given
by
h i
1 0
1
0
0
c = xz
Wxz
xz WSWxz xz
Wxz
avar b W
Proposition
0
A lower bound on the asymptotic
variance is given by (xz Wxz )
c = S1 :
bound is achieved if plim W
b 1
c=S
W
Week 5
1 n 2 0
b
i xi xi
n i
=1
52/84
. That
! 1
March 2016
52 / 84
Two step estimation procedure
b to calculate the efficient GMM

We need a consistent estimator S
estimator
b based on any consistent estimator of is consistent for S
S
Two step procedure:
1
1
c such as W
c = IK or W
c
Choose any convenient matrix W,
= Sxx .
c and 2 and
c to compute a consistent estimate of : b W
Use this W
i
b
S.
b
Use
the efficient
S to
calculate

1 GMM estimator
0 S
0 S
b 1 = Sxz
b 1 Sxz
b 1 sxy
b S
Sxz
Week 5
53/84
March 2016
53 / 84
Hypothesis testing

h i

a
c
c
N 0, avar b W
n b W
Under H0 : l = l

b
c l
c l
n W
b W
d
l
l
N (0, 1)
tl r
h\
i =
SEl
c
avar b W
ll
Under H0 : R = r

0
h i 1

d
c r
c
c r
n Rb W
R avar b W
R0
Rb W
2 (#r )
#r is the number of restrictions and R (#r L) is of full row rank
Week 5
54/84
March 2016
54 / 84
Hansen (1982) test for overidentifying restrictions

If parameters are exactly identified (L = K ) and xz of full column
rank the GMM estimator solves orthogonality conditions exactly:

0

eW
c = n gn e Wg
c n e = 0
J ,
If we have
more orthogonality conditions than parameters (K > L)
eW
c 6= 0
J ,
But: if orthogonality conditions hold, the minimized distance should
be close to 0
If we knew and have a consistent estimator for S:

0 1
b 1 = n g 0 S
b 1 g =
b
J , S
ng S
ng
where g = gn ()

p
d
d
b
b 1
ng N (0, S) and S
S, so J , S
2 (K )
Week 5
55/84
March 2016
55 / 84
Hansen (1982) test for overidentifying restrictions
Overidentifying restrictions test

H0 : E ( gi ) = E ( x i i ) = 0

a 2
bW
c=S
b 1 , J ,
c
Test statistic: for W
(K L):

0

d
bS
b 1 = n gn b S
b 1
b 1 gn b S
b 1
J ,
S
2 (K L)
Large J is evidence against the orthogonality assumptions (if we trust the
other assumptions)
DF drop from K to K L: we estimate L parameters before

computing the sample average gn
Week 5
56/84
March 2016
56 / 84
Two stage least squares

3.7 Conditional homoscedasticity:

E 2i |xi = 2

S = E 2i xi xi0 = 2 xx
Efficient GMM becomes two stage least squares (2SLS)

1
1
1
0
0
b 1 = Sxz
b
b
b S
2 Sxx
Sxz
Sxz
2 Sxx
sxy

1
0
0
Sxz
= Sxz
(Sxx )1 Sxz
(Sxx )1 sxy

1
bxx
= b S
b2SLS
b2SLS doesnt depend on b
2 : can be computed in 1 step
Week 5
57/84
March 2016
57 / 84
Asymptotic variance of 2SLS estimator (I)

1

0
0
Sxz
b2SLS = Sxz
(Sxx )1 sxy
(Sxx )1 Sxz
1

1 n
0
0
Sxz
= + Sxz
(Sxx )1 xi i
(Sxx )1 Sxz
n i =1
1

0
0
Sxz
b2SLS = Sxz
(Sxx )1 g
(Sxx )1 Sxz

We derive the asymptotic distribution of n b2SLS from
asymptotic distribution of ng:
a
ng N 0, 2 xx

a
n b2SLS N 0, A2 xx A0

1
0 ( ) 1
0 ( ) 1
where A = xz
xz
xx
xz
xx
Week 5
58/84
March 2016
58 / 84
Asymptotic variance of 2SLS estimator (II)

n b2SLS

a
ng N 0, 2 xx

a
N 0, A2 xx A0
1

0 ( ) 1
0 ( ) 1
xz
where A = xz
xx
xx
xz
Notice how terms cancel out:

0
A2 xx A0 = (B)1 xz
(xx )1 2 xx (xx )1 xz (B)1
0
= 2 (B)1 xz
(xx )1 xz (B)1
= 2 (B)1 B (B)1
= 2 (B)1

1
0
= 2 xz
(xx )1 xz
Week 5
59/84
March 2016
59 / 84
Two stage least squares

h
i
1
0 1
Asymptotic variance: avar b2SLS = 2 xz
xx xz
i
h
1
\
0 S1 S
2 Sxz
Estimator for asymptotic variance: avar b2SLS = b
xx xz

with b
2 = N 1L ni=1 yi zi0 b2SLS
Hansen J statistic becomes Sargans Statistic

0

1 s S
b
sxy Sxz b2SLS Sxx
xy
xz 2SLS
Sargans statistic = n
b
2
d
Sargans statistic 2 (K L)
Computation of Sargan statistic:
1 Estimate the equation y = z0 + by means of 2SLS (with x as
i
i
i
i
instruments). Compute the residuals b
i = yi zi0 b2SLS .
2
Regress b
i on xi and compute the Sargan statistic as
n R2
Week 5
60/84
March 2016
60 / 84
Why call it two stage least squares?

b2SLS can be computed in 2 steps:
1
First stage regression: regress (endogenous) regressor zi on the

instrument(s) and exogenous regressors xi . Calculate fitted values b
zi
Second stage regression: regress the dependent variable yi on b
zi
Check instrument relevance assumption by F-test on exclusion

restrictions in first stage regression
Test joint significance of instruments in the regression explaining the
endogenous regressor
F statistic should exceed 10
In general: always important to check instrument relevance

GMM is very inefficient if instruments are weakly correlated with
endogenous regressors
Bias of GMM estimators is large if correlation is very weak
See Stock and Yogo (2002) for test statistics
Week 5
61/84
March 2016
61 / 84
Hausman test for 2SLS

2SLS is less efficient (but consistent) than OLS if all regressors are
exogenous
OLS is preferable over 2SLS if regressors are exogenous
Hausman test: compare bOLS and b2SLS :

Table: Logic of the Hausman test
bOLS
b2SLS
H0 : E ( z i i ) = 0
Ha : E (zi i ) 6= 0 (and E (xi i ) = 0)
consistent
consistent
plim b2SLS bOLS = 0
inconsistent
consistent
plim b2SLS bOLS 6= 0
Test whether OLS and 2SLS estimates are significantly different

If not different: regressors are exogenous use OLS
If different: regressors are endogenous use 2SLS
Week 5
62/84
March 2016
62 / 84

Reject H0 if the difference b2SLS bOLS is large
Test statistic:

i1
0 h

a
b b2SLS bOLS
b2SLS bOLS 2 (q )
V
H = b2SLS bOLS
where q is the number of parameters being tested
Tricky
bit: estimating
h
i
h
i
h
i
h
i
b
V 2SLS bOLS = V b2SLS + V bOLS 2 Cov b2SLS , bOLS
b
Usual assumption:
one of the
h
i estimators
h
i is efficient
h
i (OLS efficient)
Then: V b2SLS bOLS = V b2SLS V bOLS
Assumption of efficiency often not met (OLS not efficient under

heteroscedasticity)
Heteroscedasticity-robust
versions
of Hausman test
h
i
Estimate V b2SLS bOLS (for formula, see Cameron and Trevedi,

2005)
Bootstrap
Use auxiliary regression from efficient case, but perform the subset of
regressors-test using robust standard errors
Week 5
63/84
March 2016
63 / 84
Computation of Hausman test:

1
2
Estimate first stage regression by OLS and obtain the residual

Estimate second stage by OLS and include first stage residual as
additional regressor
Test significance of first stage residual (t-test)

Use robust SEs in second step
If there are multiple RHS variables that may be endogenous: include

all first stage residuals and do F-test of joint significance
Use robust estimate of covariance matrix
Week 5
64/84
March 2016
64 / 84
GMM in practice
Now you know the theory of GMM

Credibility of any GMM analysis rests on
Relevance of the instrument(s)
Partial correlation between instruments and endogenous regressors
Validity of the instrument(s)

Test overidentifying restrictions (if applicable)
The most credible instruments rely on variation that is completely

outside control of the subjects being studied
Actual randomized experiments (randomization done with analysis in
mind)
Natural experiments (randomization occurs by accident)
Week 5
65/84
March 2016
65 / 84
Expert Opinion and Compensation: Evidence from a

Musical Competition
Example of a valid instrument: Ginsburgh and Van Ours (2003)

Many economic decisions made by experts
Financial analysts
Academic committees
Olympic juries
Question: do experts opinion reflect true quality, or do they influence

economic outcomes independently of their value as a signal of quality?
Why not regress measure of success on expert opinions using OLS?
Week 5
66/84
March 2016
66 / 84

Musical Competition
Question: do experts opinion reflect true quality, or do they influence

economic outcomes independently of their value as a signal of quality?
Why not regress measure of success on expert opinions using OLS?
True quality is likely to drive success
True quality unobservable and likely to be correlated with experts
opinions
Ommitted variable bias
Solution: use an instrument that is

Correlated with ranking (relevant)
Uncorrelated with unobservable quality (valid)
Week 5
67/84
March 2016
67 / 84

Musical Competition
Ginsburgh and Van Ours look at Queen Elizabeth musical
competition
Held every year in Belgium
Best known international competition for piano and violin
Unusual characteristic: finalists get a single week to study a

completely new concerto
Order in which finalists play is randomized
Two finalists perform per evening
Instrument: order in which finalists play

Order is random: uncorrelated with quality (valid instrument)
Order may be correlated with ranking in competition (relevant
instrument)
Week 5
68/84
March 2016
68 / 84
Expert opinion and compensation: endogeneity problem

In terms of our general notation:
yi = measures of success: presence in catalogues and ratings by
Belgian music critics
zi = ranking in competition (ri endogenous)
xi = order of finalists (exclusion restriction)
Second stage:
yi = 0 + 1 ri + 2 qi + ui
= 0 + 1 ri + i
where i = 2 qi + ui (qi not observed)

ri endogenous if
2 6= 0 (quality matters for success)
cov (ri , qi ) 6= 0 (rating correlated with quality)
Week 5
69/84
March 2016
69 / 84
Expert opinion and compensation: descriptive statistics
Week 5
70/84
March 2016
70 / 84
Expert opinion and compensation: OLS estimation
Estimated equation:
yi = 0 + 1 ri + i
Success measured by presence in catalogues:
Week 5
71/84
March 2016
71 / 84
Expert opinion and compensation: OLS estimation
Estimated equation:
yi = 0 + 1 ri + i
Success measured by critics reviews:
Week 5
72/84
March 2016
72 / 84
Expert opinion and compensation: first stage(s)
First stage:
ri = 0 + 1 firsti + 2 femalei + 3 latei + i
where firsti = 1{i was first to perform in competition};
femalei = 1{i is female};
latei = 1{i was second to play on particular evening}
Test for relevance: test for joint significance of instruments
Week 5
73/84
March 2016
73 / 84

First stage:
ri = 0 + 1 firsti + 2 femalei + 3 latei + i
Week 5
74/84
March 2016
74 / 84

First stage:
ri = 0 + 1 firsti + i
Week 5
75/84
March 2016
75 / 84
Expert opinion and compensation: conclusions first stage
first is highly relevant

Joint test for significance instruments: F-statistic smaller than 10
Musicians who play during first evening rank 3 places lower on average
Females are ranked 2 places lower on average
Those who perform second gain 1 rank
Though the order of performing is random, it does affect the final
ranking
Jury hears concerto for first time too maybe they get less severe as
competition unfolds
Week 5
76/84
March 2016
76 / 84
Expert opinion and compensation: second stage

Second stage success measured by presence in catalogues
yi = 0 + 1 rbi + i
One instrument:
(OLS estimate 0.0919, SE 0.0282)

Week 5
77/84
March 2016
77 / 84

Second stage success measured by presence in catalogues
yi = 0 + 1 rbi + i
Multiple instruments:

Week 5
78/84
March 2016
78 / 84

Second stage success measured by critics reviews
yi = 0 + 1 rbi + i
One instrument:

Week 5
79/84
March 2016
79 / 84

Second stage success measured by critics reviews
yi = 0 + 1 rbi + i
Multiple instruments:

Week 5
80/84
March 2016
80 / 84
Post-estimation commands after ivregress

Relevance of instruments
Overidentifying restrictions (Hansen/Sargans test)
Endogeneity of regressor (Hausman test)
Week 5
81/84
March 2016
81 / 84
Expert opinion and compensation: conclusions from

second stage
Effect of ranking on success is larger for 2SLS than OLS

Ranking and true quality negatively correlated?
Expert judges have different opinion than critics and record buyers
Expert opinion has effect on success over and above signaling

It pays to do well in competitions
But: results from competitions are affected by unexpected factors
Sheds doubt on ability of experts to cast objective judgements
Week 5
82/84
March 2016
82 / 84
Expert opinion and compensation: obtain 2SLS estimates

in two stages:
1
Estimate first stage by OLS, obtain residuals
Add first stage residuals to second stage regression, estimate by OLS:
t-test on resid1st constitutes Hausman test

We fail to reject the null hypothesis that ranking is exogenous
Week 5
83/84
March 2016
83 / 84
Useful Stata commands

Just-identified case (IV estimator): ivregress 2sls
Overidentified case (GMM):
Homoscedasticity (2SLS): ivregress 2sls
Use robust VCE to guard against heteroscedastic errors (Stata option
vce(robust))
Optimal GMM: ivregress gmm
Hausman test for endogeneity:

Non-robust Hausman test (OLS is efficient): hausman
Robust Hausman test (OLS is not efficient): estat endogenous
Test of overidentifying restrictions: estat overid after

ivregress gmm
First stage estimates: estat firststage after ivregress
Or simply run the first stage regression(s) of endogenous regressors on
all instruments
Week 5
84/84
March 2016
84 / 84

Sejda RZO

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sejda RZO

Uploaded by

Copyright:

Available Formats

Introduction to Econometrics

Coordinator: prof. dr. Rob Alessie

Global overview of the lectures (tutorials)

See Nestor for details (cf. excel le course_schedule_2016_r.xlsx)

The classical linear regression model

The linearity assumption

Assumption 1.1: the relationship between the dependent variable and

where 1 is called the intercept (constant term). The other regression

Slope coecients represent the marginal eect of the regressors. For

Example 1 (Engel curve for food)

Suppose that we have a budget survey (random sample) available which

Financial assets and housing equity (increases of the house value);

If ydi is an exogenous variable (to be dened below), the slope coecient

In this case, the income elasticity is equal to:

Example 1.2 (wage equation)

where wagei := hourly wage rate; Si := years of schooling; tenurei = no. of

Let xi = (xi1 , . . . , xiK ) and = (1 , . . . , K ) be (K 1) vectors, then

because by the denition of vector inner product:

The strict exogeneity assumption

Assumption 1.2 (strict exogeneity):

Here, the expectation is conditional on the regressors for all observations.

Implications of strict exogeneity

Fact 1: Law of total expectations:

Fact 6: Consider the covariance between x and y. Suppose that E(y) = 0.

Fact 7: Correlation coecient between two random variables:

Fact 8: Two random variables x and y are uncorrelated to each other if

The strict exogeneity assumption, E(i |X) = 0, has several implications:

Proof: Since xjk is an element of the data matrix X, we can apply

This implies that

Example (Engel curve revisited)

Strict exogeneity in time series models

But yi is the regressor for observation i + 1. Thus, the regressor is not

Example 4: models with a feedback mechanism

Equation (26) implies that ri+1 (the regressor in period i + 1 of equation

Other assumptions of the classical regression model

Assumption 1.3 (no perfect multicollinearity): The column rank of

(no correlation between observations)

Comments on assumption 1.3

Wage example continued, disentangling age from cohort eects

Comments on assumption 1.4

where in the second equality we use the assumption of strict exogeneity:

Similarly, assumption (28) is equivalent to the requirement that

To see this notice that by denition

where In denotes the identity matrix of order n

The classical regression model for random samples

As I said before, in this course we typically make the assumption that we

Up to now, we have treated the regressors as random. This is in contrast to

2. Finite sample properties of the OLS estimator b

The algebra of least squares

Suppose that we have a sample (xi , yi ), i = 1, ..., n.

This sum is also called the residual sum of squares

To solve minimization problem (5), is to derive the

the K dimensional vector of partial derivatives becomes

The first order conditions:

(n 1) vector of (OLS) residuals

Its i-th element ei = yi xi b

System of equations (9) provides a unique solution as

umn rank: X X = ( i=1 xi xi ).

Best linear approximation (fitted value)

Residual sum of squares (=minimum value objec (cf equation (4)):

Notice that we can rewrite first order conditions (9) as

If xi contains a constant then (17a) implies that

We already know that E(i |X) = 0 (strict exogeneity)