Professional Documents
Culture Documents
2019
1 / 34
Recap
y = X β + u
n×1 n×(k+1) (k+1)×1 n×1
βb = (X0 X)−1 X0 y
2 / 34
I A consequence of orthogonality of residuals and columns of X is that
n
X n
X n
X
(yi − ȳ )2 = (ŷi − ȳ )2 + ûi2
i=1 i=1 i=1
or
SST = SSE + SSR
I This leads to the definition of the coefficient of determination R 2 ,
which is a measure of goodness of fit
R 2 = SSE/SST = 1 − SSR/SST
3 / 34
I OLS formula gives us fitted values that are closest to the actual
values in the sense that the length of the residual vector is the
smallest possible.
I But why should this be a good estimate of the unknown parameters
of the conditional expectation function?
I We ask if this formula produces the best information we can get
from our data about the unknown parameters β
4 / 34
Lecture Outline
I Interpretation of the OLS estimates in multiple regression (textbook
reference 3-2a to 3-2e)
I Properties of good estimators
I Properties of the OLS estimator βb
1. OLS is unbiased: The expected value of βb (Textbook reference
3-3, and Appendix E-2)
2. The Variance of βb (Textbook reference 3-4, and Appendix E-2)
3. OLS is BLUE (Gauss-Markov Theorem): The Efficiency of βb
(Textbook reference 3-5, and Appendix E-2)
I Units of measurement: do the results qualitatively change if we
change the units of measurement? (textbook reference 2-4a, 6-1
(exclude 6-1a))
5 / 34
Interpretation of OLS estimates
Example: The causal effect of education on wage
wage = β0 + β1 educ + β2 IQ + u
6 / 34
Interpretation of OLS estimates
Example: The causal effect of education on wage
I The coefficient of educ now shows that for two people with the
same IQ score, the one with 1 more year of education is expected to
earn $42 more.
8 / 34
Interpretation of OLS estimates
E (y | x1 , x2 ) = β0 + β1 x1 + β2 x2
9 / 34
I What if we increase x1 by 1 unit and hold x2 fixed, that is, ∆x1 = 1
and ∆x2 = 0?
10 / 34
I Similarly,
∆ŷ = β̂2 if ∆x1 = 0 and ∆x2 = 1
I Let’s go back to regression output and interpret the parameters.
wage
[ = −128.89 + 42.06 educ + 5.14 IQ
n = 935, R 2 = .134
I 42.06 shows that for two people with the same IQ, the one with one
more year of education is predicted to earn $42.06 more in monthly
wages.
I Or: Every extra year of education increases the predicted wage by
$42.06, keeping IQ constant (or “after controlling for IQ”, or ”after
removing the effect of IQ”, or ”all else constant”, or ”all else
equal”, or ”ceteris paribus”)
11 / 34
Effects of rescaling
The upshot: changing units of measurement does not create any extra
information, so it only changes OLS results in predictable and
non-substantive ways.
I Scaling x → cx: Any change of unit of measurement that involves
multiplying x by a constant (such as changing dollars to cents or
pounds to kilograms), does not change the column space of X, so
will not change ŷ. Therefore, it must be that the coefficient of x
changes such that x β̂ stays the same.
12 / 34
13 / 34
I Change of units of the form x → a + cx: Any change of unit of
measurement that involves multiplying x by a constant and adding
or subtracting a constant (such as changing from ◦ C to ◦ F ) does
not change the column space of X either, so will not change ŷ.
Therefore, it must be that the β̂ changes in such a way that Xβ̂
stays the same.
I In both cases, since ŷ does not change, residuals, SST, SSE and
SSR all stay the same, so R 2 will not change.
14 / 34
I Changing the units of dependent variable y → cy : This changes the
length of y but the column space of X is still the same. So, ŷ will
be multiplied by c, and since both y and ŷ are multiplied by c, the
residuals will be multiplied by c also.
I In this case, SST, SSE and SSR all change but all will be multiplied
by . . ., so again, R 2 will not change.
15 / 34
Estimators and their Unbiasedness
16 / 34
The Expected Value of the OLS Estimator
17 / 34
I We use an important property of conditional expectations to prove
that the OLS estimator is unbiased: if E (z | w ) = 0, then
E (g (w )z) = 0 for any function g . For example
E (wz) = 0, E (w 2 z) = 0, etc.
I Now let’s show E (β)
b =β
I The assumption of no perfect collinearity (E.2) is immediately
required because ...
I Step 1: Using the population model (Assumption E.1), substitute
for y in the estimator’s formula and simplify
βb = (X0 X)−1 X0 y
= (X0 X)−1 X0 (Xβ + u)
= β + (X0 X)−1 X0 u
18 / 34
I Step 2: Take expectations:
E (β)
b = E (β + (X0 X)−1 X0 u)
= β + E ((X0 X)−1 X0 u)
= β
19 / 34
I Are these assumptions too strong?
I Linearity in parameters is not too strong, and does not exclude
non-linear relationships between y and x (more on this later)
I Random sample is not too strong for cross-sectional data if
participation in the sample is not voluntary. Randomness is
obviously not correct for time series data.
I Perfect multicollinearity is quite unlikely unless we have done
something silly like using income in dollars and income in cents in
the list of independent variables, or we have fallen into the “dummy
variable trap” (more on this later)
I Zero conditional mean is not a problem for predictive analytics,
because for best prediction, we always want our best estimate of
E (y | x1 , x2 , . . . , xk ) for a set of observed predictors.
20 / 34
I Zero conditional mean can be a problem for prescriptive analytics
(causal analysis) when we want to establish the causal effect of one
of the x variables, say x1 on y controlling for an attribute that we
cannot measure. E.g. Causal effect of education on wage keeping
ability constant:
21 / 34
I Zero conditional mean can be quite restrictive in time series analysis
as well.
I It implies that the error term in any time period t is uncorrelated
with each of the regressors, in all time periods, past, present and
future.
I Assumption MLR.4 is violated when the regression model contains
a lag of the dependent variable as a regressor. E.g. We want to
predict this quarter’s GDP using GDP outcomes for the past 4
quarters
I In this case, the regression parameters are biased estimators.
22 / 34
The Variance of the OLS Estimator
I Coming up with unbiased estimators is not too hard. But how the
estimates that produce are dispersed around the mean determines
how precise they are.
I To study the variance of β,
b we need to learn about variances of a
vector of random variables var-cov matrix
23 / 34
I We introduce an extra assumption:
24 / 34
I Under these assumptions, the variance-covariance matrix of the OLS
estimator conditional on X is
because
25 / 34
I We can immediately see that given X the OLS estimator will be
more precise (i.e. its variance will be smaller) the smaller σ 2 is
I It can also be seen (not as obvious) that as we add observations to
the sample, the variance of the estimator decreases, which also
makes sense
I Gauss-Markov Theorem: Under Assumptions E.1 to E.4 (or MLR.1
to MLR.5) βb is the best linear unbiased estimator (B.L.U.E.) of β.
I This means that there is no other estimator that can be written as a
linear combination of elements of y that will be unbiased and will
have a lower variance than β.
b
I This is the reason that everybody loves the OLS estimator.
26 / 34
Estimating the Error Variance
I We can calculate β,
b and we showed that
b = σ 2 (X0 X)−1
Var (β)
27 / 34
I Why do we divide SSR by n − k − 1 instead of n?
I In order to get an unbiased estimator of σ 2
û0 û
E( ) = σ2 .
n−k −1
If we divide by n the expected value of the estimator will be slightly
different from the true parameter (proof of unbiasedness of σ̂ 2 is
not required)
I Of course if the sample size n is large, this bias is very small
I Think about dimensions: ŷ is in the column space of X, so it is in a
subspace with dimension k + 1
I û is orthogonal to column space of X, so it is in a subspace with
dimension n − (k + 1) = n − k − 1. So even though there are n
coordinates in û, only n − k − 1 of those are free (it has n − k − 1
degrees of freedom)
28 / 34
Standard error of each parameter estimate
Var\
(β̂ | X) = σ̂ 2 (X0 X)−1
29 / 34
I Example: Predicting wage with education and experience
30 / 34
Summary
31 / 34
Summary
32 / 34
Digression: The Variance-Covariance Matrix
33 / 34
I We define the variance-covariance matrix of v as
I Note that since σij = Cov (vi , vj ) = Cov (vj , vi ) = σji , this matrix is
symmetric.
I An important property: If A is an n × k matrix of constants, then
34 / 34