You are on page 1of 15

Instrumental Variables

- Alternative, simpler approach


- Less parametric
Ti = binary treatment, Homogeneous Additive T.E.
yi = + Ti + X i + ui
T = X + w + v
i

Omitted vars. Bias: E(uivi) 0


For now, abstract from Xi
Ti not independent of (y0i, y1i)
Definition: wi are I.V. if they are randomly assigned.
- wi independent of (y0i, y1i) E(wiui) = 0
- wi correlated with Ti E(wiTi) 0
- For linear model above, just need E(wiui) = 0 (orthogonality)
If also assume constant additive T.E.: y1i = y0i +
Can use I.V. to identify causal ATE =
In practice want:
1. Strong 1st-stage relation between wi and Ti
2. wi valid exclusion restriction (quality and validity of research design)
Single I.V. wi = Just Identified:
yi = y0i + Ti
Reduced-form: E(yi|wi) = 11 + 12 wi
E(Ti|wi) = 21 + 22 wi
1st-stage:
E(yi|wi) = E(y0i|wi) + E(Ti|wi) = (0 + 21) + 22 wi

Unless 22 = 0 (No 1st-stage) Weak IV, 22 0

12
, ratio of reduced-form to 1st-stage
22

Two-stage least squares (2SLS)


^

2 sls =

12
^

, from 2 OLS regressions

22

(w i w )( y i y ) = 12 = ^ = ^
=
2 sls
ils
(w i w )(Ti T ) ^
^

IV

22

2SLS:

1. LS of Ti on 1, wi Ti = 21 + 22 w i
^

2. LS of yi on 1, Ti
Form IV/2SLS estimator from 2 reduced-forms
See anatomy of 2SLS (how IV estimate formed)
STATA: ivreg y (T=w), robust
If wi binary (0-1), has Wald estimator interpretation (grouped
estimation)
wi = 1,
y1 , T1
wi = 0,
y0 , T0
^
y y0
IV = 1
,
u1 u0 ? X 1 X 0 ?
T1 T0
Question: How much bias does wi reduce relative to reduction in
treatment variation?
Ex.: Quarter 4 vs. Quarter 1 babies
T1 T0 = 0.1 yrs of education, very small difference
X 1 X 0 0 , mothers education
2

Multiple I.V. wi = Over-Identified:


dim(Ti)=1, dim(wi)=p > 1
Observe Xi, but Ti not R.A. conditional on Xi
yi = Ti + X i + ui
Ti = X i 1 + w i 2 + v i
E(uivi) 0
^

Ti = Ti + v i ,

Ti = exogenous component
^

v i = potential endogenous component


may be correlated with ui
Definition: wi are I.V. conditional on Xi if they are R.A. conditional on Xi
(y0i, y1i) C wi | Xi
In linear case, need E(wiui|Xi) = 0
2SLS
1. OLS of Ti on Xi, wi Ti = X i 21 + w i
^

22 , v i = Ti Ti

2. OLS of yi on Xi, Ti
STATA: ivreg y X (T=w), robust
2 Alternative 2nd-stages
i) use wi as instruments:

^
^

yi = Ti + X i + i , E Ti i = 0

ii) use v i as control for omitted variables mathematically identical


yi = Ti + X i + v i + i , E (Ti i ) = 0
^

CF = 2 sls

- v i selection correction single-index control function


^

- measures corr(ui, vi) = ols 2 sls


- Before: E ui vi > zi = zi

Indirect validity tests of instruments (2SLS better than OLS?)


OLS:
- regress yi on Ti ols
- regress each Xi on Ti how correlated are controls with Ti?
2SLS:
- 2SLS of yi on Ti using wi as instrument 2sls
- regress each Xi on wi how correlated are controls with wi?
- 2SLS of each Xi on Ti using wi as instrument how correlated are Xi
with variation in Ti due to wi?
Testing whether instruments reduce association between treatment and
observables. Cannot test association with unobservables.
Also examine whether 2sls is less sensitive to controlling for Xi than ols
I.V. versus Heckitt:
I.V.: -robust since not assuming (ui, vi) ~ joint normal
- inefficient since assuming all variation in Ti not due to wi is
endogenous may throw away good variation
- only 1 correction term
Heckitt: - biased if (ui, vi) not joint normal
- efficient if f(ui, vi) correctly specified additional ID assumption
i) only correct for part of vi correlate with ui since specifying
their correlation; ii) using nonlinear transformation of
^
()
v i = ( ) =
( )
- Can make inferences on selection mechanism (e.g., cream skinning,
comparative advantage, absolute advantage, etc.)
Issue with 2SLS:
^

In finite samples, if 1st-stage weak, then 2 sls biased toward ols , and
^

conventional s.e.( 2 sls ) biased.


4

Overfitting and poor research design small sample bias


(Should present 1st-stage results)
Testing Over-ID restrictions:
1. outcome equation residual i should be uncorrelated with wi, Xi
(exogenous vars.)
^

regress yi on Xi, Ti

, regress i on Xi, wi R2

N R2 ~ 2(p-1), p = dim(wi)

- Intuitive omnibus test of orthogonality conditions E (w i i ) = 0


and model specification
- Similar to test for heteroskedasticity (Lagrange Multiplier test)

2. If homogeneous and additive , then each I.V. (w1i, , wpi) should


^

identify the same same 2 sls . If different I.V. lead to different s


then not all I.V. valid (or model misspecified).
Can motivate as Minimum Distance problem
- 1 endogenous treatment, p instruments
yi = Ti + ui,
Ti = 1w1i + 2w2i + + pwpi + vi
Structural eqn.:
yi = 1w1i + 2w2i + + pwpi + i
Reduced-form:
1 = 1, 2 = 2 , , p = p
RF parameters:
[1 p, 1 p] =
Structural parameters: [1 p, 1 p] = f()
2-steps
1. Estimate with two OLS regressions of reduced-form and first-stage.
2. Fit reduced-form estimates to structural parameters.
^
^ 1 ^ ^

Min F ( ) = f ( ) Var f ( ) = Optimal Min. Distance

^
D
F OMD 2 ( p 1) degrees of freedom

Use to test whether (p1) over-identifying restrictions hold.


5

- I.V./2SLS simple and works well if additive, homogeneous T.E.s


- Commonly used
- Heckitt may work better if there is Roy-like self selection
(Differential sorting based on heterogeneous T.E.s)
Relax Homogeneous Additive T.E. Assumption
Angrist, Imbens and Rubin; Wooldridge; Garen
wi are R.A. conditional on Xi (often, just need mean independence)
wi finite valued, valid exclusion restriction

i varies over i (Random Coeffs.)


Can we estimate ATE= E(i)= = avg. effect for randomly selected i?
Ex. i = is return to college
f(i) = population density function
r = marginal cost of education (e.g., interest rate)
i r, go to college (1)
i < r, dont (0)
Density function of
Returns to College

Choose 0
Choose 1

(0)

(bar) r

(1)

Return to College

= ATE ,
r = Marginal T.E. (effect for marginal person)
1 = Selected ATE = " Treatment on Treated Effect"
0 = Average effect for untreated
6

Ti = 1( i = y1i y0 i > r )
- Pure Roy model all variation in choice due to self-selection on
benefits relative to uniform cost.
- Cant ID T.E.s without strong parametric assumptions.
Ti = 1( y1i y0 i ci > 0 )
- Some variation in choice due to heterogeneity in costs (ci). If costs
unrelated to i and omitted vars., then can ID model using cost
variables as instruments.

Binary Treatment: Ti = 1 Ti* > 0


i = y1i y0i,
Cov ( i , Ti ) 0

Neither Heckitt nor IV can ID E(y1i y0i) without strong assumptions.


Heckitt: (u0i, u1i,vi) ~ trivariate normal
2SLS: E(y1i y0i|Ti, wi) = E(y1i y0i)
- gain from treatment varies across i, but mean independent of Ti, wi
- loosely, Cov(y1i y0i, Ti)=0, Cov(y1i y0i, wi)=0
^

- otherwsie, E( 2 sls )

For both Heckitt and I.V., semi-parametric identification at infinity


i.e., need incredible instrumental variable.
Continuous Treatment Ti*
yi = i Ti* + X i + ui
E(ui Ti* )0 omitted vars. bias
E(i Ti* )0 selectivity bias
Ex. Ti* =yrs. of education
E(i Ti* )>0?

Rewrite:
yi = Ti* + a i Ti* + X i + ui , = E ( i ) , ai = i
Ti* = X i 1 + w i 2 + v i

Ti* interacts with unobserved ai


^

E 2 sls = if: (Wooldridge)

A1: E(ui|wi)=0,
wi = valid I.V.
(wi, vi) independent
A2: E(vi|wi)=0
E v i2 w i = v2

A3: E(ai|wi, vi) = E(ai|vi) = vi


wi unrelated to T.E. heterogeneity conditional on vi
A4: 2 0,
valid 1st-stage
- can condition on zi = (Xi, wi)
- A2 unlikely to hold if Ti* discrete (e.g., binary) unless wi is purely
randomly assigned
- A3 pretty restrictive single control function (vi) absorbs both
omitted vars. and selectivity biases
^

^
A1-A4 E ai Ti* wi = E ai Ti* = constant E 2 sls = E CF =

) (

Alternatively,
B1A1, B4A4

B2: E Ti* wi , a i = + X i 1 + w i 2 + a i ,

[E (vi wi , ai ) = ai ]

B3: E(ai|wi)=0
E a i2 w i = a2

- Fewer restrictions on Ti* reduced-form (B2)


- More restrictions on (ai, wi) relation (B3)
- Ex.: B1-B4 satisfied for binary Ti when Pr(Ti = 1|Xi, wi, ai) is linear
probability model
8

E 2 sls

Augmented control function approach to random coefficients (Garen)


- include additional control for selectivity bias
- based on assumption ui, ai linear in Ti* , wi

What if A1-A4 (B1-B4) violated?

C1: E (ui wi ) = E (a i w i ) = 0

(
C3: E (a T

)
,w )=

C2: E ui Ti* , w i = T Ti* + w wi


i

*
i

(
E (a T

*
T Ti

+ w wi

)
,w )=

C1-C3 E ui Ti* , w i = T v i
i

*
i

Ti* = Ti* + ^vi ,


(*)

T vi

2nd control function

Ti* = exogenous, ^vi = potentially endogenous

yi = Ti* + X i + 1 v i + 2Ti* v i + i
^

1 v i = control for omitted vars. bias


^

2Ti* v i = control for selectivity bias


1 = T =

Cov (ui , v i )
,
Var (v i )

2 =T =

Cov (a i , v i )
Var (v i )

- 1 tests for E(ui| Ti* )=0, O.V.B


- 2 tests for self-selection due to T.E. heterogeneity
2 0 T.E. heterogeneity and self-selection
^

- Under assumptions, adding 2Ti* v i eliminates selectivity bias due

to E(ai| Ti* )0

- More general than 2SLS: 2 control functions instead of 1


- Similar to p-score and Heckitt selection correction approaches
- No joint normality assumptions (leveraging Ti* continuous)
- works if C2 and C3 hold
^ 2

^ 2

- Test for robustness include v i , Ti v i , etc.


- Calculate standard errors via bootstrap.
Application: Chay and Greenstone (JPE, April 2005)
- Air pollution and housing prices = hedonics
- Go through this in Applied Exercise #4
What can 2SLS Semiparametrically Identify? (AIR, JASA 96)
Heterogeneous T.E.s
- minimal assumptions on functional form and T.E. heterogeneity
Binary Ti = (0, 1)
Binary I.V. wi = (0, 1)
Potential Treatment Status
T0i if wi = 0
T1i if wi = 1
Ti = T0i(1 - wi) + T1iwi = T0i + (T1i - T0i)wi
Assumptions
1. Independence (strengthened I.V. condition)
(y0i, y1i, T0i, T1i) C wi
Ex. wi = 1 i encouraged to take treatment, encouragement R.A.
2. Monotonicity
Pr(Ti = 1|wi = 1) Pr(Ti = 1|wi = 0), for all i, or
Pr(Ti = 1|wi = 1) Pr(Ti = 1|wi = 0), for all i
T1i T0i for all i, Pr(T1i T0i) = 1
T1i T0i
10

T0i
0
Never takers
Compliers

0
1

T1i

1
Defiers
Always takers

Compliers: Pr(T1i > T0i) = 1


Monotonicity No Defiers
- use intuition on plausibility
- sometimes not true of latent variable models (especially if wi takes
on many values)
Ex. Ti = f(wi), f() may not be monotonic
With 1. and 2., 2SLS (IV) identifies Local Average T.E. (LATE)
(Ahn and Powell type assumptions)
wi binary:
yi = + 2 wi + ui
Ti = + 1 wi + vi
^

2 sls =

2
^

^ E ( yi wi = 1) E ( yi wi = 0 )
p lim 2 sls =
= E ( y1i y0 i T1i T0 i = 1)
(
)
(
)
E
T
w
=
1

E
T
w
=
0

i i
i i
= ATE for compliers (those whose treatment status changed by
I.V. wi)
Note: Ti* continuous, 2SLS has similar interpretation (Angrist and
Imbens)
^ E g Tg*
= weighted avg. of LATEs for each group g
p lim 2 sls =
*
E Tg

( )

11

ATE = E(y1i y0i) = avg. effect for populaton


SATE = E(y1i y0i|Ti = 1) = avg. effect among treated
= Effect of unionism on unionized
LATE = E ( y1i y0 i T1i T0 i = 1) = avg. effect among compliers
SATE and LATE are avg. effects among non-random subpopulations
- not clear which are the interesting policy parameters
- LATE interesting if wi can be linked to a clear policy (wi =
regulation vs. wi = QOB)
Latent Vars. Interpretation of 2SLS
- Abstract from Xi
y0 i = 0 + u0 i
y1i = 1 + u1i
Ti = 1( + 2 wi + vi 0 ) , wi = (0, 1)
T0i = 1 if + vi > 0, 0 otherwise
T1i = 1 if + 2 + vi > 0, 0 otherwise

I.V. condition: (u0i, u1i, vi) C wi


Monotonicity: 2 0 (2 < 0), Ex. linearity, homogeneity
Compliers: T1i T0i = 1 2 vi <
^
p lim 2 sls = E ( y1i y0 i 2 vi < )

2 R.A. encouragement has huge effect on treatment status


^
then p lim 2 sls = E ( y1i y0 i vi < )

Therefore, wi = (0, 1) not sufficient for semiparamteric ID of (1 0)


12

Also need , + 2 , to ID (1 0)
wi = 1 Ti = 1
Treatment status R.A.
wi = 0 Ti = 0
everyone a complier
Now: wi = (-1, 0, 1), focus on wi = (-1, 1)
Complier condition: 2 vi < + 2
^
2 p lim 2 sls = 1 0

Ti = 1( ' + 2 wi + y1i y0 i 0 )
' uniform cost of participating

Roy Model:
^

- 2 sls provides little information on (1 0)

Latent Var. Model


vi = u1i u0i
= ' + 1 0
Complier condition: ' 2 y1i y0i < '
Higher potential gain more likely to participate
^

2 0 p lim 2 sls = E ( y1i y0 i y1i y0 i ' )


^

2 sls estimates cost of participating (useless)


Cannot identify Roy Model without joint normality assumption or panel
data (and stationarity assumption)?
SATE = ATE and I.V. estimate of (1 0) consistent only if
E (u1i u0 i Ti = 1, X i , wi ) = 0

13

Semiparametric selection corrections and Identification at Infinity


1. (u0i, u1i, vi) C wi
2. Index-sufficiency and sufficient variation in wi
[traces out Pr(selection)]
wi such that Pr(Ti = 1|wi, Xi) 1 E(u1i|Ti = 1, wi, Xi) 0
wi infinite-valued
Can derive semi-parametric selection correction estimate of ATE and
SATE without functional form assumption on (u0i, u1i, vi) joint
distribution
Identical Issue in I.V./2SLS
With just independence and monotonicity I.V./2SLS can point identify
LATE, but can say nothing about ATE.
What can latent variable selection correction approach identify?
- Vytlacil, Econometrica: under independence and monotonicity,
latent variable approach can derive bounds on ATE.
- Does not derive a point estimator of anything, though.

Returns to Education example


- How to interpret the fact that IV estimates often larger than OLS,
and twins estimate little different from OLS?
- Go back to human capital model and allow for heterogeneity.
- What are implications for role of benefits and costs in educational
attainment?
Put graphs of education production functions and cost functions here
- Two types of individuals
- Ability, marginal returns and marginal costs for person i (ai, bi, ri);
person j (aj, bj, rj)

14

What about continuous IV, wi?


1. Local Instrumental Variables (Heckman and Vytlacil, 2000)
- local Wald estimation of T.E.s at different values of wi
- E(yi|wi), E(Ti|wi): take ratio at different values of wi
- Bandwidth choice
2. Manning, BE Journals, 2004 (very clear) binary Ti
- binary treatment and heterogeneous T.E.s
- heterogeneous T.E.s nonlinear relation between mean of
outcome conditional on I.V. and mean of treatment conditional on
I.V.
- linear I.V. (2SLS) estimator is misspecification of functional form
- therefore, will depend on I.V. used due to misspecification
- estimate correction functional form, then T.E.s independent of I.V.
used
- in practice, need rich data to identify T.E.s without strong
distributional assumptions; e.g., a lot of variation in E(Ti|wi) in the
data (identification at infinity). Otherwise, must extrapolate.
- o.w., settle for estimates of T.E.s that are instrument-dependent
3. Blundell and Powell
- control function approaches when the outcome equation is a
nonlinear transformation of some function (e.g., binary response)
- 2SLS invalid for nonlinear models

15

You might also like