Professional Documents
Culture Documents
Lawrence Katz
9/22/14
E[Yi|di=1] - E[Yi|di=0]
= E[Y1i|di=1] - E[Y0i|di=1] + {E[Y0i|di=1] - E[Y0i|di=0]}
= E[ i|di=1] + {E[Y0i|di=1] - E[Y0i|di=0]}
The first term is the parameter of interest, but the term in brackets is the selection bias term. In the case of
random assignment of treatment it is zero (up to sampling error when population moments are replaced by
sample moments). If assignment is nonrandom, then omitted variables that affect both Y0i and selection
into the program will generate selection bias. Selection bias arises when the non-participants differ from
the participants in the non-participant state.
Linear Model with Constant Treatment Effect (simplifying assumption):
Y0i = Xi + ui where Xi are observed covariates and ui are unobserved outcome determinants.
E[ui|Xi] = 0 by construction.
The actual outcome for i (Yi) is then given by:
Yi = Y0i + di
The goal is to estimate . Selection bias is present (ignoring observed covariates for the moment) if
E[Y0i | di=1]
0.
If ui is correlated with selection into the program even after conditioning on observed covariates then
selection bias will bias estimates of program effects.
More General Model with Covariates with outcomes function of observables (X) and unobservables
(u1, u0):
Y1i = g1(Xi) + u1i
Y0i = g0(Xi) + u0i
where E(u1i) = E(u0i) = 0
TOT = E[Y1i - Y0i | Xi, di=1] = E( i|Xi, di=1) = g1(Xi) - g0(Xi) + E[u1i - u0i| Xi, di=1]
The TOT combines both the structure (the g0 and g1 functions) and the means of the error terms for the
treated in this more general set-up. Experimental (random assignment) approaches allow the
identification of the TOT but without further assumptions do not necessarily allow one to identify the
underlying structural parameters of g1 and g0.
Approaches to Estimating the Average Treatment Effect on the Treated (TOT):
1. Randomized Social Experiment: Random assignment of treatment among applicants to programs
(those that would have participated). A randomized social experiment generates an experimental control
group consisting of those persons who would have participated but were randomly denied access to the
2
program or treatment - thus requires no randomization bias --so that randomization does not change
pool of applicants or behavior per se. The control group provides an estimate of E[Y0|d=1]. Compare
sample means of treatment and control group in experiment.
Under ideal conditions, social experiments recover: F(Y0 | d=1,X) from the distribution of outcomes of
the control group and F(Y1 | d=1, X) from the outcomes of the treatment group if randomization
administered at application stage, no attrition, and no randomization bias. The experiment supplements
missing data by providing an estimate of E[Y0|d=1,X] from the sample mean for the control group and
E[Y1|d=1,X] from the mean of the treatment group (which is also available in observational studies).
Thus the TOT can be estimated, but one can't recover the overall distribution of gains (treatment effects)
F( |d=1,X) without stronger additional assumptions.
How does randomization identify the TOT? Drop i subscripts for ease of presentation.
Let
Assumption of No randomization bias: Let Y1* and Y0* be the outcomes observed under a regime of
randomization: Absence of randomization bias for the mean gain in the program implies:
E[Y1 | d=1,X] = E[Y1* | d=1,X]
E[Y0 | d=1,X] = E[Y0* | d=1,X]
Randomization operates conditional on d*=1 which is appropriate since we are trying to get the mean
treatment effect for those who would participate in the program.
E[Y | d*=1, R=1, X] = E[Y1 | d=1,X] = g1(X) + E[u1| d=1,X]
E[Y | d*=1, R=0, X] = E[Y0 | d=1,X] = g0(X) + E[u0| d=1,X]
Thus the difference of the mean of the treatment group and control group yields the TOT:
E[Y | d*=1, R=1, X] - E[Y | d*=1, R=0, X] = E[Y1 - Y0|d=1,X] = E[ |d=1,X]
2. Eligibility Randomization: Randomization of eligibility to a program is sometimes a less disruptive
approach to implementing a social experiment. Under this approach, eligibility is randomly assigned (say
across hospitals or training centers), but then individuals and program operators can freely choose to
participate.
Eligibility randomization allows one to directly estimate the mean effect of eligibility for the program on
the population included in the experiment: the effect of eligibility for the program on outcomes is known
as the Intent to Treat (ITT) effect.
Consider a population of persons normally eligible for a program. Let e=1 if the person is kept eligible
after randomization and e=0 if the person loses eligibility. Let d* equal "willingness to participate" -- will
participate if eligible then d*=1. Assume eligibility e is randomly assigned. Ignore other covariates.
Actual participation d = ed*, only participate if eligible and willing to participate.
Intent-to-Treat Effect (ITT) = E[Y | e=1] - E[Y | e=0]
= difference in mean outcomes for eligibles and ineligibles if eligibility is randomly assigned
3
Randomization of eligibility directly allows the estimation of the ITT. But can one estimate the TOT
when there in an eligibility randomization experiment? Yes, but one needs additional assumptions:
The TOT can be estimated from an eligibility randomization experiment under the assumptions that (1)
treatment group (eligibility) assignment is truly random; (2) the effect of treatment group assignment on
outcomes operates only through participating in the program (e.g., using a housing voucher or getting
training) with no direct effect of eligibility per se; and (3) control group members (the ineligibles) cannot
participate in the program.
Under these assumptions, the difference in average outcomes of eligibles and ineligibles divided by
fraction of eligibles who participate provides an unbiased estimate of the TOT:
TOT = E[Y1 - Y0|d=1] = (E[Y|e=1] - E[Y|e=0])/P(d=1|e=1) = ITT/P(d=1|e=1)
where P( ) is the probability function, so that P(d=1 | e=1) is the program participation rate.
Proof:
(*)
(**)
z,w
di(w)]
Expected treatment effect for individuals who change treatment status as instrument changes value from
w to z.
The LATE can be identified if Z is a legitimate instrument (can be excluded from the Y equations) and if
have monotonicity condition: di(z) di(w) for all i or di(z) di(w) for all i. Thus we are assuming that
there are no "non-compliers" in terms of Angrist-Imbens-Rubin (1996).
Assume: di(z) > di(w):
(***) E[Yi|Zi=z] - E[Yi|Zi=w] = z(P(z)-P(w)) * E[Y1i - Y0i | di(z) - di(w) = 1]
The LATE is consistently estimated by the ratio of the difference in sample mean outcomes for those with
values of z and w for the instrument over the difference in fraction who are treated.
Proof: The monotonicity condition allows the expected value of Y given the instrument to be decomposed
into 3 groups (never takers (1-P(z)), compliers (P(z)-P(w)), and always-takers (P(w))). The LATE is the
average treatment effect for the compliers (those that change treatment status with different values of the
instrument).
E[Yi|Zi=z] = {P(z)-P(w)}*E[Y1i|di(z)-di(w)=1] + P(w)*E[Y1i|di(z)=di(w)=1]
+ {1-P(z))*E[Y0i|di(z)=di(w) =0]
E[Yi|Zi=w] = {P(z)-P(w)}*E[Y0i|di(z)-di(w)=1] + P(w)*E[Y1i|di(z)=di(w)=1]
+ {1-P(z))*E[Y0i|di(z)=di(w) =0]
Subtracting the second from the first equation yields equation (***).
Case of binary instrument Z:
LATE = E[Y1i - Y0i | di(1) - di(0) = 1] = (E[Yi|Zi=1] - E[Yi|Zi=0])/(P(1)-P(0))
One can't estimate LATE if there exists a fourth category of "defiers" (those with di(z)=0 but di(w)=1)-which arises with a failure of the monotonicity assumption -- see Angrist, Imbens, Rubin (JASA, 96).
Let Yi = Yi(Zi , di )
Exclusion restriction for instrument Z: Yi(1 , di ) = Yi(0 , di ) for d= 0, 1. The instrument Z only affects Y
through D.
-------------------------------------------------------------------------------------------------------------------------Causal Effects of Z on Y for Population Units Classified by di(0) and di(1)
di(0)
0
0
Yi(1 , 0 ) - Yi(0 ,0 ) = 0
Never-Taker
di(1)
1
Yi(1 , 0 ) - Yi(0 ,1 ) = - (Yi(1) - Yi(0))
Defier
Yi(1 , 1 ) - Yi(0 ,1 ) = 0
Always-Taker
it
t.
Yt = Xit + + di t + uit.
E[uit|di,Xit]
0.
The decision-making rule for program assignment can be described in terms of a latent index function INi
that depends on both observed (Zi) and unobserved (vi) covariates:
(3)
Alternative nonexperimental estimators try to undo the dependence between uit and di by making
alternative assumptions about the forms of equations (1), (2), and (3).
Dependence between uit and di can arise for two reasons: (1) dependence between Zi and uit (selection on
observables); and (2) dependence between vi and uit (selection on unobservables). Dependence on
observables is easily solved by controlling for those observables; selection on unobservables is a more
difficult problem.
B. Selection on observables (Zi): E(uit|di,Xi)
0 and E(uit|di,Xi,Zi)
0, but
E(uit|di,Xi,Zi) = E(uit|Xi,Zi).
In this case, controlling for observed selection variables solves selection bias problem. The only issue is
getting the right functional form for the control function: E(uit|Xi,Zi) and then insert this into equation (1)
and estimate by regression methods.
1. Propensity score and blocking approach -- Dehejia and Wahba (1999)
2. Exact match comparison approach if can discretize observables -- Card and Sullivan (1988)
C. Selection on unobservables (vi): E(uit|di,Xi) 0 and E(uit|di,Xi,Zi) E(uit|Xi,Zi). In this case, one
needs assumptions about distributions of vi, uit, and Zi to get estimate of t using Control Function
Estimators (Heckits, control for Propensity score) or need an instrument (a variable in Z not included in
X). The availability of longitudinal data (pre- and post-program) allows one to try alternatives such as (1)
Fixed Effects estimates assuming selection based on permanent earnings components; (2) Randomgrowth estimator - allow individual specific trends to effect selection; or (3) Transitory shocks -- see
7
Ashenfelter and Card (85) and by Heckman and Hotz (89). Benefits of pre-program data on comparison
group - Heckman et al. (1998, EMA)
"Natural or Quasi Experiments" (see Angrist-Krueger 1999; Meyer 1995)- Natural Experiments result
when exogenous variation in independent variables of interest is created by (1) sharp exogenous shocks
to markets (baby boom, Black Death, Mariel boatlift); (2) institutional quirks (e.g. draft lotteries;
Maimonides rule for maximum class size in Israel); or (3) exogenous policy changes that affect some
groups but not other groups (e.g. changes in maximum UI that leave the replacement rate unchanged for
workers not at the maximum in one state but not another).
Basic approach: A comparison of changes for treatment and comparison groups (differences-indifferences) or a further difference relative to placebo treatment and comparison groups (differences-indifferences-in-differences). All this can be done in a simple components of variance scheme (time
effects, location effects, treatment group effects, placebo group effects, interaction terms) or by using an
IV - instrument variables - strategy in which one instruments for the treatment dummy variable with the
natural experiment indicator variables. IV Estimates can be interpreted as natural experiments:
Legitimate instruments generate a natural experiment that assigns treatment in a manner independent of
unobserved covariates:
--Vietnam Draft Lottery and Effects of Military service on earnings (Angrist, 1990)
--Date of Birth, Compulsory Schooling Laws, and Returns to Education (Angrist and Krueger, 1991)
--Mariel boatlift (Card, 1990) and impact of mass immigration on local labor markets
--prison overcrowding legislation to estimate impacts of incarceration on crime (Levitt, 1996)
Diffs-in-Diffs and DDD examples: Mariel boatlift (high and low skill workers and low skill immigration):
Treatment city Miami; Placebo city Atlanta (p)
Experimentals: Low Education workers
Controls: High Education workers
Before
E
C
Experimentals
Controls
After
E
C
Diff-in-diff = (E - E) - (C -C)
DDD = [(E - E) - (C -C)] - [(Ep - Ep) - (Cp - Cp)]
Regression approach to DDD with covariates (X) where S= skill group, A = after, and M = Miami:
Yit = Xit + Sit
+ Mit
2 +A t 3
+ SitMit
+SitA t
+A tMit
+ SitA tMit
it
+ dt + Dit +
it
< y
where y is a constant based on potential trainees discount rates and tastes for training.
In this case, a simple differences-in-differences estimate comparing the change in earnings for
trainees between some pre-training period ( -j) and the post-training period ( +1) to the change in
earnings over the same period for the comparison group provides an unbiased estimate of the training
effect:
E[yi +1 - yi -j | Di +1=1] = (d +1 - d -j) +
E[yi +1 - yi -j | Comparison Group] = (d +1 - d -j) +
where is the fraction of the comparison group that participates in CETA training (contamination
effect). If is trivially small (approximately 0), then one gets an unbiased estimate of through the
differences-in-differences estimator:
E[yi +1 - yi -j | Di +1=1] - E[yi +1 - yi -j | Comparison Group] =
If multiple years of data available, then there are multiple difference-in-differences estimates which
should all be equal up to sampling error. A test for the correct specification is a test of equality of the
alternative d-d estimates.
Table 2: The choice of initial years greatly affects the A&C estimates. If use 1975, then it looks like
large positive effect (from Ashenfelter's dip and mean reversion), earlier years look like negative
effects (from the fact that trainees are individuals with flatter age-earnings profiles than the
comparison group).
Furthermore, Ashenfelter's dip strongly indicates that transitory earnings are likely to play a key role
in training program entry not just the permanent (average) earnings component. Shocks to earnings
also appear to be serially correlated. Thus, one needs a more sophisticated model of earnings
dynamics and program selection.
(2) Components of Variance Estimates: Assume that selection is based on individual-specific fixed
effect and individual-year-specific disturbance tern which display first-order autoregressive serial
correlation.
(i) yit =
where
it =
it-1
+ dt + Dit +
it
yi -k + vi < y
)+
i -k
+ vi < y -
+ d -k
z ; where
is mean of
Use method of moments: Predict the means and covariances of the earnings of the comparison group
and the trainees using (i) and (ii). Estimate the means and covariances using the comparison group
and the trainee group the sample moments. Match the estimated sample moments to the predicted
moments, get parameter estimates. Use the parameter estimates to predict trainee earnings if they had
not received training. The difference between the predictions of the trainee earnings without training
and the actual earnings of trainings is the estimate of the effect of training .
For the comparison group, the means and covariances are the unconditional means and covariances
from (i):
E(yit) =
+ dt
cov(yit,yis) = 2 + |t s| 2
where * =
i,
it,
var(yit) = 2 + 2
cov( yit , z i )
E[ z i | z i < z ] = E ( y it ) + Dit [ 2 + |t k | 2 ] *
var( z i )
E[ z i | z i < z ]
>0
var( z i )
10
The mean of trainee earnings differs from the mean of comparison earnings by a training effect plus
the sum of two components (a permanent components and a geometrically declining transitory
component symmetric centered around the selection period) each proportional to *. The model
imposes the restriction that in the pre- and post-training periods the earnings of the trainees and
comparisons diverge in a systematic pattern that depend on only one free parameter *.
The
restrictions of the model are rejected since they fail to capture a systematically weaker trend in
trainees' earnings than in comparison group earnings. A&C supplement the model with individualspecific earnings growth rates trends (gi):
yit =
+ dt + git + Dit +
it
This model does better but still much instability in the estimates.
LaLonde (1986 AER): A classic study in which the estimates of the impact of a training program
from a randomized social experiment the National Supported Work demonstration project are used as
a benchmark (the "true" estimates) to compare to alternative non-experimental estimates using
alternative comparison groups and econometric specifications.
Experimental Estimates: Difference in mean earnings of NSW treatment group and NSW controls
(applicants randomized out of access to the program).
Non-experimental estimates: LaLonde throws away the experimental controls and uses comparison
group with longitudinal earnings histories from the PSID and CPS-SSA earning matched samples.
He differences-in-differences models, more detailed regression models, and Heckit (control function)
estimates.
Key Insights: Experimental treatments and controls look identical (balanced) up to sampling error.
None of the standard non-experimental approaches provide reliable estimates; different estimators
passing standard specification tests give widely varying estimates.
Dehejia and Wahba (1999 JASA): This paper re-examines the use of non-experimental estimators to
estimate the treatment impact on earnings of the NSW demonstration using propensity score
methods.
Propensity Score Method: A semi-parametric generalization of the Heckman selection correction
model. Its advantages are a more general first stage equation and a better diagnostic for assessing the
comparability of the treatment and comparison groups (how balanced are the covariates of treatment
and comparison group members with similar propensity scores). It is an approach to doing "selection
on observables." Thus, the propensity score method is most useful when the econometrician
observes all of the variables used in selection but does not know the exact form of the "rule" that
leads to selection into treatment.
Rosenbaum and Rubin (1983): If treatment and potential outcomes are independent conditional on
the observed covariates X, then they are independent conditional on the conditional probability of
receiving treatment given the covariates.
Let Yi = diYi1 + (1-di)Yi0 where di = 1 if treated, Yi1 = outcome for i with treatment, and Yi0 is the
outcome for i without treatment.
11
Thus, in this case of selection on observables, adjusting for the propensity score removes the biases
associated with differences in covariates. Why is it sufficient to condition just on the propensity
score? The reason is that under the Rosenbaum-Rubin assumptions for selection on observables
(covariates X), the covariates are independent of assignment to treatment conditional on the
propensity score. In other words, the distribution of covariates should be the same across treatment
and comparison groups for observations with the same propensity score. This implication of the
assumptions for the propensity score approach to be appropriate provides a diagnostic: one can
group observations in strata based on the estimated propensity score and check whether the
covariates are balanced across the treatment and comparison groups in each strata.
Implementing the Propensity Score Approach:
(1)
Start with a parsimonious logit or probit selection equation for treatment and estimate the
propensity to select into the treatment group
(2)
Sort the data according to the estimated propensity score from lowest to highest
(3)
Divide the observations into blocks (or strata) of equal propensity score range
(0-0.1, 0.1-0.2, 0.2-0.3, etc.)
(4)
Do t-tests for the difference-in-means for all covariates across treatment and control
observations in each block
(5a) If all covariates are balanced (no significant differences in means), stop. Use the estimated
propensity scores.
(5b) If a particular block has one or more unbalanced covariates (but there is balance elsewhere),
divide the block into finer blocks and re-evaluate
(5c) If still problems with unbalanced covariates, modify initial logit or probit equation to add
higher order terms in problem covariates and/or further interactions. Re-evaluate.
12
There are a number of different semi-parametric ways to use the propensity score to estimate the
TOT given that you have achieved "balance" in the covariates. The multiplicity of methods arises
when the true functional form of the second stage equation is unknown:
(1)
Control Function: Use your first stage equation to form the Heckman selection correction
term and add it to your second stage regression
(2)
Stratify: Divide the data into blocks based on the propensity score. Run the second stage
equation within each block (this might just be the mean difference in outcomes for
treatment and comparison observations in each block). Calculate the weighted mean of the
within-block estimates to get the TOT (weight by number of treatment observations in each
block).
(3)
Match: Match each treatment observation with a comparison observation, based on
similar propensity scores (find closest match). Treat the data like panel data (like twins
data) and run within-match (match fixed effects) models of the treatment effect.
(4)
Weight: Weight each observation by its propensity score and estimate the second stage
equation (Hirano, Imbens and Ridder 2003).
Dehejia and Wahba (1999) illustrate these methods and show that once one achieves "balance" of
covariates within blocks that these propensity score methods come quite close to the experimental
estimates for the NSW demonstration in contrast to the lack of reliability of other the traditional
econometric non-experimental estimators examined by LaLonde (1986). See table below.
13