You are on page 1of 15

Problem Set 4

1. Paper & Pen


a)
y! = ! + ! x! + ! ,
x! is endogenous in the bivariate regression;
E (z) = 0;
E () = 0
E z = 0 => E z(y x ! ) = E zy zx ! ) = E zy E zx ! = 0
E[zx ! ] = E[zy]
If one can assume that E [zx ! ] exists, is finite and invertible, we can solve for (2x1
vector including ! and ! ). Hence,
= E zx !

!!

E(zy)

Other way to solve the question,


E = 0 => E y ! ! x! = 0
=> E ! = E y E (! x! )
! = E y ! E (x! )

E z = 0 => E z y ! ! x!
E zy E z! E z! x! = 0
! E z = E zy ! E (zx! )
E zy ! E (zx! )
! =
E z
! = E y ! E (x! )
E zy ! E (zx! )
! =
z
E zy ! E (zx! )

E z
E zy
! E (zx! )
E y
= ! E x!
E z
E z
E zx!
E zy
! E x!
= E y

E z
E z
E y ! E x! =

! =

E y E z E zy
Cov z, y
=
E x E z E zx!
Cov z, x

=0

Cov z, y
E (x)
Cov z, x

! = E y
With

!"# !,!
!"# !,!

0, condition for relevance of instrumental variable z.

b) From a) we have that = E zx ! !! E (zy), and so we can replace by the Law of


!
Large Numbers E zx ! by ! !!!! z! x! which goes to in probability to E zx ! . The
!

same can be applied to E (zy), substituting in the equation for !

!
!!! z! y! .

Hence,
^ = (
1

!!
!
!!! z! x! )

!
!!! z! y!

!
!
x!! = x!!
, x!!
!
!
z! = (x!! , z!! ), if z!! = x!!
the expression reduces to the OLS estimator.
!

!
!!!(!! !!)(!! !!)
!
!!!(!! !!)(!! !!)

^ = !"# !,! = !
1
!
!"# !,!

Since E z = 0 z = 0, then
^ =
1

!
!!!(!! !!)(!! )
! (! !!)(! )
!
!!! !

!"
b!"
! = y b!

c) z 0,1
!"#(!,!)

Taking again ! = !"#(!,!) and divide numerator and denominator by the variance of
Z brings
! =

Cov(y, z)/Var(z) !
=
Cov(x, z)/Var(z) !

! is the slope regression of y on z. ! is the slope of regression x on z.


As z is binary, ! = E y z = 1 E(y|z = 0) and ! = E x z = 1 - E x z = 0
Therefore b! =

! ! !!! ! ! ! !!!
! ! !!! ! ! !!!

So, the IV estimator becomes:


E y z = 1 E(y|z = 0)
b!!" =
E x z = 1 E(x|z = 0)
!"
b!"
! = y b!

which is called the Wald estimator.

2. Data
Table 1
VARIABLES

(2)
ed76

Education
(ed76)

(1)
lwage76
0.0561***
(0.00437)

(3)
lwage76
0.0671***
(0.0114)

(4)
lwage76
0.0679***
(0.0113)

Lived Metrop.
Area (smsa76)

0.163***
(0.0150)

0.332***
(0.0921)

0.159***
(0.0155)

0.159***
(0.0155)

Lived in
South
(south76)
Enrolled
(enroll76)
Black

-0.120***
(0.0151)

-0.147*
(0.0864)

-0.118***
(0.0154)

-0.118***
(0.0154)

-0.122***
(0.0249)
-0.119***
(0.0192)
0.00750***
(0.00110)
-0.0323***
(0.00356)
0.0573***
(0.00690)
-0.00154***
(0.000314)

1.012***
(0.116)
-0.0107
(0.112)
0.110***
(0.00548)
0.108***
(0.0190)

-0.134***
(0.0267)
-0.116***
(0.0196)
0.00605***
(0.00181)
-0.0338***
(0.00383)
0.0507***
(0.0168)
-0.00115
(0.000826)

-0.134***
(0.0267)
-0.116***
(0.0196)
0.00593***
(0.00180)
-0.0338***
(0.00383)
0.0512***
(0.0168)
-0.00117
(0.000826)

Kww score
Married
(mar76)
Experience
(exp76)
Experience2
(exp762)
Mothers
Education
(momed)
Fathers
Education
(daded)
Constant
Observations
R-squared

0.162***
(0.0170)
0.142***
(0.0146)
4.951***
5.970***
4.879***
(0.0713)
(0.234)
(0.134)
2,956
2,956
2,956
0.324
0.365
0.322
Robust standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

4.870***
(0.134)
2,956
0.321

a) The column (1) of Table 1, presents the output of the regression on the dependent
variable log(wage) on other control variables that we choose believing that this ones
have a direct impact on wages. We ruled out the variables that we thought had an
indirect effect on wage through the education variable. Therefore, we choose the
variables from the table above with values in the firs column to this estimation.
Variables as age and IQ were deprecated because, the first one would make as face a
problem o multicolinearity with the experience variables, since one should obtain
more experience with age; the second one we choose not to include because one could

see that we didnt had information on this variable for nearly a third of the
individuals, allowing us to have better results in terms of R2.
We can see that education has a positive effect on wage, since one more year of
education, on average and ceteris paribus, will increase wages by 5.5%. Marital
status, living in the south or being black have a negative impact on the wage level.
Individuals from metropolitan areas get higher wages, on average. Furthermore,
individuals who have a good knowledge of the world of work test have a slightly
better wage, everything else constant.
b) The economic interpretation is that the fact of an individual staying more years on
school could be correlated with the same individual having more ability, so the returns
from education might be overestimated. At the same time, people who stay longer in
school expect to have a higher wage at the end of their academic formation. On the
opposite side people we have people who dont expect returns on wage from more
years of education. Therefore, we are measuring not only the difference between
having one more year of education or not, but at the same time the difference between
people who have expectations that one more year of education traduces in a
significant return on wage and hence selecting the good outcomes when using an OLS
estimation.
Statistically speaking, the problem stated above means that the error term is correlated
with the regressors on the estimation, so E(/x) 0 and Cov(, x) is probably
positive. This problem implies that the OLS estimator is no longer consistent and
unbiased.
c) In the regression of education, presented on Table 1 column number 2, we used all
variables that were used in the log wage regression excluding the experience and its
squared variables. As it was asked we included the education the variables with the
parents education; momed and daded. In linear models, there are two main
requirements for using an IV:
- The instrument must be correlated with the endogenous explanatory variables,
conditional on the other covariates.
- The instrument cannot be correlated with the error term in the explanatory equation
(conditional on the other covariates), that is, the instrument cannot suffer from the
same problem as the original predicting variable.
We calculated the F-statistic for a joint test for both variables, momed and daded, and
the value we get was 167.54, and this suggests that our instruments are strong and the
validity of using them as instruments.
d) There was no variable with square of age in our database so we created the variable
age762, which is nothing, but the square of the variable age76 from the database. In
Table 2 we present the results for the three regressions needed for answer this
question. This enables us to check the strength of our instruments. For this we use a
joint F-test for our 4 instrumental variables. For the three regressions we get the
following outcomes:
ed76: F-value of 98.85;
exp76: F-value of 1694.41;
exp762: F-value of 1024.78.

The results of the F-statistics show that our instruments (education of parents, father
and mother, age and age squared) are strong. In Table 1 column number 3, we can see
the outputs of IV regression of log(wage) on education and experience, using by
parents' education and age as instruments. The results show that the IV estimator for
education is higher compared to the OLS estimation with a larger standard error value
than in (1), but the estimator for experience has a smaller value when compared with
the OLS estimation. Experience squared is no longer significant.
Table 2
VARIABLES

(1)
ed76

(2)
exp76

(3)
exp762

momed

0.154***
(0.0168)

-0.154***
(0.0168)

-3.071***
(0.370)

daded

0.133***
(0.0145)
0.512**
(0.256)
-0.0106**
(0.00445)
0.300***
(0.0906)
-0.146*
(0.0854)

-0.133***
(0.0145)
0.488*
(0.256)
0.0106**
(0.00445)
-0.300***
(0.0906)
0.146*
(0.0854)

-2.177***
(0.328)
-43.22***
(5.879)
1.134***
(0.104)
-7.102***
(2.019)
5.123***
(1.820)

age76
age762
smsa76
south76
enroll76
black
kww
mar76
Constant
Observations
R-squared

0.926***
-0.926***
(0.117)
(0.117)
0.0766
-0.0766
(0.112)
(0.112)
0.127***
-0.127***
(0.00606)
(0.00606)
0.0840***
-0.0840***
(0.0195)
(0.0195)
-0.227
-5.773
(3.631)
(3.631)
2,956
2,956
0.377
0.741
Robust standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

-15.01***
(2.196)
-4.277*
(2.585)
-2.705***
(0.146)
-0.778**
(0.384)
554.1***
(81.49)
2,956
0.715

e) A test of overidentifying restrictions regresses the residuals from an IV or 2SLS


regression on all instruments in Z. Under the null hypothesis that all instruments are
uncorrelated with u, the test has a large-sample 2(r) distribution where r is the
number of overidentifying restrictions. Under the assumption of i.i.d. errors, this is
known as a Sargan test, and is routinely produced on STATA by ivreg2 for IV and
2SLS estimates. It can also be calculated after ivreg estimation with the overid
command, which is part of the ivreg2 suite. After ivregress, the command estat overid
provides the test.
We just have to find the critical value and compare it to the output from STATA.

NR! ~ ! (R K)
29560,0009 = 2.6604 < ! 11 10 = ! 1 = 3.841
We can see that the test statistic is lower than the critical value from the Chi2 table
(one restriction, 95% level. The null hypothesis cannot be rejected, which in this case
is that all the instrumental variables are exogenous. To sum this up we can say that
our instruments are strong and valid.
f) We have applied the same method as in e) and compute the over-identification test
statistic.
29560.0012 = 3.5472 < ! 12 10 = ! 2 = 5.99
The test statistic is lower than the critical value so we cannot reject the null hypothesis
that all our instrumental variables are exogenous.
We can see, in Table 1 column (4), from the regression results that the differences
between education and experience are small when we use the living near a college
variable as an instrument. To check if this variable should be included, we can use
first stage regressions for the instrumented variables and include the dummy. If the
dummy is significant in explaining the instrumented variable and the joint F-test is
high enough we should include the dummy to get a better IV model. We can also
perform likelihood-ratio tests between the versions of the three first stage regressions
with nearc and without. All of the likelihood-ratio tests don't reject the null hypothesis
so in this case is not good to use nearc as an instrument. The F-statistic for the joint
significance of the instruments are lower in the first, 88.48 for schooling and 1434.23
for experience, and the F for experience2 increases to 1293.90. It is now possible to
say that the variable nearc it is not a good instrument since it do not have explanatory
power.
g) It is easy to see that the returns to education estimated with the IV model are higher
when compared to the OLS (0.0561<0.0679). But the standard errors in the IV
estimator are much bigger than in the OLS. We expected that the OLS estimator
would give us biased results but presenting higher returns to education than in the IV
estimates, but it seems to be the opposite. Even being more precise the OLS estimator
is biased so it is not a good option.
Table 3

Return to
education
Std. Error



OLS (Table1, column1)


0.0561***

IV (Table1, column4)
0.0679***

(0.00437)

(0.0113)

3. Simulation
a)-e) We start our analysis with the definition of the model:
y = ! + ! x + ,.
As we know, x is an endogenous regressor.
An endogenous regressor means that E x 0 ,
Our model is defined in this way:
! = 1, ! = 0.1
x = 12 (

!! ! !!
!

+ 0.5), where x! ~U 0.5,0.5 , x! ~U 0.5,0.5


= U 0.5,0.5 + 0.5x!

We will run a Monte Carlo simulation with different samples size of observations and
different strength of the instruments.
From this model we start to analyse a small sample of 20 observations.
We can immediately observe that we have a problem of endogeneity in the OLS
estimation: we have that E x 0, which means that the estimator b is biased and
not consistent.
In which sense is there endogeneity?
Theres endogeneity because the error term is correlated with x1, and x1 is inside our x
variable.
Moreover, our instrument z1 is correlated with x2, consequentially it is also correlated
with x, but it isnt correlated with the error term.
On the other hand the instrument z2 is correlated with x1 (so also with x, as for z1) but
also correlated with the error term as x1 is correlated with .
It is easy to see that, if there is endogeneity, OLS overestimates the slope parameter
and underestimates the intercept.
The underestimation of the slope is shown in the table by b0, OLS 0 (its deviation
from the estimated intercept to its true value) , which has a negative value: -0.242.
Otherwise, we can see the overestimation of the mean of the deviation from the
estimated slope to its true value (b1,OLS 1 ) from the positive value 0.040.
On the contrary, using IV1 regression, which just has z1 as instrument variable, our
deviation would be around 0 and positive for b0 (0.051) and slightly negative for b1 (0.008).
The problem is that, even if the IV1 regression allows us to obtain a more centred
deviation, we can see from Fig.1 that IV estimator is less precise.
To sum up: we can conclude that deleting z2 as a instrumental variable and focusing
our analysis just con z1 the deviation become less bias, but, despite of this, we will
have a less precise estimation (we can see it by the higher value of the standard error
in IV regression than in OLS).

We decided to study also the Sargan test, which is a test of the validity of the
instrumental variables and it is used for testing over-identifying restrictions too. A
model is over-identifying if the number of the instruments is bigger than the number
of the regressors.
This test examines the exogeneity of all implemented instruments: Sargans null
hypothesis is not rejected means that our variables are valid instruments because they
are uncorrelated to some set of residuals.
Our Sargan test shows us that the null hypothesis should get rejected more often for a
stronger instrument z1 and many observations which implies the weaker z1 the less the
Sargan detects the endogeneity of z2.
If we look at Table 2 and Figure 1, we see that IV2 estimation, which is composed by
the two instruments z1 and z2, has the same problems of the OLS estimation. Indeed
also in IV2 the estimators are biased, but it is also less precise than OLS, since z2 is a
function of x1 (which generates the endogeneity).
In any case we reject the null hypothesis of the Sargan test only in 23% of the cases. It
means that the IV2 regression is correctly specied and we can consider our
instruments as exogenous.

Fig.1: N=20-STRONG

Table 1: N=20 (strong instrument)


Variable

Obs

Mean

Std. Dev.

Min

Max

b0,OLS 0

1000

-0.242

0.184

-1.194

0.400

b0,IV(1) 0

1000

0.051

0.399

-1.401

2.953

b0,IV(2) 0

1000

-0.155

0.268

-1.518

1.451

b1,OLS 1

1000

0.040

0.028

-0.061

0.156

b1,IV(1) 1

1000

-0.008

0.065

-0.455

0.238

b1,IV(2) 1

1000

0.026

0.043

-0.246

0.259

Sargan

1000

0.162

0.369

0.000

1.000

b1,ex

1000

0.040

0.028

-0.061

0.156

xz

1000

0.603

0.132

0.073

0.901

F-test value

1000

13.194

9.443

0.097

77.242

In Table 1, looking at the F-test we can see that the mean for a sample of 20
observations is 13.194 (so it is > 10 rule-of-thumb value), which means that the
regressor x on z1 is significantly different from zero (first stage regression).
The xz value, shows us the correlarion between the instrument and the endogenous
regressor. We can notice that it has a quite strong (0.603); consequentially the
instrument is quite strong too.
f) From the Table 2 we can notice that taking a larger sample with 2000 observations,
our distribution becomes more precise.
We can conclude that because it is more centred around the bias in the OLS, so its
estimators are more precise around the wrong values. Hence, the estimation is far
from the true values.
In the IV1 we have a smaller standard deviation with a larger sample, even if is still
higher than in OLS estimations it is consistent.
Also in IV2 regression the estimators are more precise around the wrong values than
with a sample of 20 observations for the endogeneity of z2
The Sargan test in this case shows us that we have to reject the null hypothesis the
instruments are not exogenous (because of z2). When the sample becomes larger, it is
harder to find endogeneity between the variables.
If we look at the correlation (xz) in the sample with N=2000, we can observe that the
instrument is strong and that we can adopt it to do an approximation of the ratio of the
standard errors.

Fig.2 shows us that IV2 is around 0 and it is quite precise, on the other hand IV1 and
OLS are really far from 0 and, consequentially, biased. It is also important to
highlight that we used a really small scale; it means that if we swell the scale, the two
regressions are distinct one from the other.

Table 2: 2000 obs-strong

Fig 2: 2000 obs-STRONG

g) We want to create a weak instrument. How can we do it?


We decided to reduce the weight of the component, which shows the influence of x2
on z1 from 1.2 to 0.05.

Table 3: weak-20 obs

Tab 4: weak-2000 obs

From Fig.3 and Fig.4 we can see that OLS do not change, instead IV1, both with a
sample of 20 observations and with a sample of 2000 observations, not converge
anymore to the real value.
Furthermore the standard deviations of our estimated coefficients are increasing
blatantly.
Examining the IV2 regression, the astonishment thing is that, using a weak instrument,
is more precise that IV1; on the other hand it is still more biased than OLS.
Using a weak instrument we can see that the correlation falls steeply (N=20: from
0.603 to 0.044; N=2000: from 0.611 to 0.05), probably caused by the necessity to
have a bigger sample to decrease the standard deviation.
In this case the Sargan Test suggests us to accept the null hypothesis of exogenous
instruments (even if z2 is endogenous).
Moreover, the F-test, as shown in Fig.3 and in Fig.4 is really lower than 10 (rule-ofthumb): F-test takes a value of 1.17 with a sample of 20 observations and a value of
5.88 with a sample of 2000 (this is not unexpected because we decided to create a
very weak instrument).
Comparing these graphs with those in point f) we notice how much is fundamental for
our analyses the strength of an instrument: weak instruments can cause problem of
high inefficiency

Fig.3: N=20- WEAK

Fig.4: N=2000-WEAK

You might also like