Professional Documents
Culture Documents
http:www.aeaweb.org/articles.php?doi=10.1257/jel.47.1.5
Many empirical questions in economics and other social sciences depend on causal
effects of programs or policies. In the last two decades, much research has been done
on the econometric and statistical analysis of such causal effects. This recent theoreti-
cal literature has built on, and combined features of, earlier work in both the statistics
and econometrics literatures. It has by now reached a level of maturity that makes
it an important tool in many areas of empirical research in economics, including
labor economics, public finance, development economics, industrial organization,
and other areas of empirical microeconomics. In this review, we discuss some of the
recent developments. We focus primarily on practical issues for empirical research-
ers, as well as provide a historical overview of the area and give references to more
technical research.
and Jeffrey A. Smith (1999), Heckman and can involve different physical units or the
Edward Vytlacil (2007a, 2007b), and Jaap same physical unit at different times.
H. Abbring and Heckman (2007) provide an The problem of evaluating the effect of a
excellent overview of the important theoreti- binary treatment or program is a well studied
cal work by Heckman and his coauthors in problem with a long history in both econo-
this area. metrics and statistics. This is true both in
The central problem studied in this liter- the theoretical literature as well as in the
ature is that of evaluating the effect of the more applied literature. The econometric
exposure of a set of units to a program, or literature goes back to early work by Orley
treatment, on some outcome. In economic Ashenfelter (1978) and subsequent work by
studies, the units are typically economic Ashenfelter and David Card (1985), Heckman
agents such as individuals, households, mar- and Richard Robb (1985), LaLonde (1986),
kets, firms, counties, states, or countries Thomas Fraker and Rebecca Maynard
but, in other disciplines where evaluation (1987), Card and Daniel G. Sullivan (1988),
methods are used, the units can be animals, and Charles F. Manski (1990). Motivated
plots of land, or physical objects. The treat- primarily by applications to the evaluation of
ments can be job search assistance programs, labor market programs in observational set-
educational programs, vouchers, laws or tings, the focus in the econometric literature
regulations, medical drugs, environmental is traditionally on endogeneity, or self-selec-
exposure, or technologies. A critical feature tion, issues. Individuals who choose to enroll
is that, in principle, each unit can be exposed in a training program are by definition dif-
to multiple levels of the treatment. Moreover, ferent from those who choose not to enroll.
this literature is focused on settings with These differences, if they influence the
observations on units exposed, and not response, may invalidate causal comparisons
exposed, to the treatment, with the evalua- of outcomes by treatment status, possibly
tion based on comparisons of units exposed even after adjusting for observed covariates.
and not exposed.1 For example, an individual Consequently, many of the initial theoreti-
may enroll or not in a training program, or he cal studies focused on the use of traditional
or she may receive or not receive a voucher, econometric methods for dealing with endo-
or be subject to a particular regulation or geneity, such as fixed effect methods from
not. The object of interest is a comparison panel data analyses, and instrumental vari-
of the two outcomes for the same unit when ables methods. Subsequently, the economet-
exposed, and when not exposed, to the treat- rics literature has combined insights from
ment. The problem is that we can at most the semiparametric literature to develop new
observe one of these outcomes because the estimators for a variety of settings, requir-
unit can be exposed to only one level of the ing fewer functional form and homogeneity
treatment. Paul W. Holland (1986) refers to assumptions.
this as the fundamental problem of causal The statistics literature starts from a dif-
inference. In order to evaluate the effect of ferent perspective. This literature originates
the treatment, we therefore always need to in the analysis of randomized experiments by
compare distinct units receiving the different Ronald A. Fisher (1935) and Jerzy Splawa-
levels of the treatment. Such a comparison Neyman (1990). From the early 1970s, Rubin
(1973a, 1973b, 1974, 1977, 1978), in a series
1 As oppposed to studies where the causal effect of
of papers, formulated the now dominant
fundamentally new programs is predicted through direct approach to the analysis of causal effects in
identification of preferences and production functions. observational studies. Rubin proposed the
Imbens and Wooldridge: Econometrics of Program Evaluation 7
will discuss several of them. One approach articular attention to the practical issues
p
(Rosenbaum and Rubin 1983b; Rosenbaum raised by the implementation of these meth-
1995) consists of sensitivity analyses, where ods. At this stage, the literature has matured
robustness of estimates to specific limited to the extent that it has much to offer the
departures from unconfoundedness are empirical researcher. Although the evalu-
investigated. A second approach, developed ation problem is one where identification
by Manski (1990, 2003, 2007), consists of problems are important, there is currently a
bounds analyses, where ranges of estimands much better understanding of which assump-
consistent with the data and the limited tions are most useful, as well as a better set
assumptions the researcher is willing to make, of methods for inference given different sets
are derived and estimated. A third approach, of assumptions.
instrumental variables, relies on the pres- Most of this review will be limited to set-
ence of additional treatments, the so-called tings with binary treatments. This is in keep-
instruments, that satisfy specific exogeneity ing with the literature, which has largely
and exclusion restrictions. The formulation focused on binary treatment case. There are
of this method in the context of the potential some extensions of these methods to mul-
outcomes framework is presented in Imbens tivalued, and even continuous, treatments
and Angrist (1994) and Angrist, Imbens, and (e.g., Imbens 2000; Michael Lechner 2001;
Rubin (1996). A fourth approach applies to Lechner and Ruth Miquel 2005; Richard D.
settings where, in its pure form, overlap is Gill and James M. Robins 2001; Hirano and
completely absent because the assignment Imbens 2004), and some of these extensions
is a deterministic function of covariates, but will be discussed in the current review. But
comparisons can be made exploiting conti- the work in this area is ongoing, and much
nuity of average outcomes as a function of remains to be done here.
covariates. This setting, known as the regres- The running example we will use through-
sion discontinuity design, has a long tradition out the paper is that of a job market training
in statistics (see William R. Shadish, Thomas program. Such programs have been among
D. Cook, and Donald T. Campbell 2002 and the leading applications in the economics lit-
Cook 2008 for historical perspectives), but erature, starting with Ashenfelter (1978) and
has recently been revived in the economics including LaLonde (1986) as a particularly
literature through work by Wilbert van der influential study. In such settings, a number
Klaauw (2002), Hahn, Todd, and van der of individuals do, or do not enroll in a training
Klaauw (2001), David S. Lee (2001), and Jack program, with labor market outcomes, such
R. Porter (2003). Finally, a fifth approach, as yearly earnings or employment status, as
referred to as difference-in-differences, relies the main outcome of interest. An individual
on the presence of additional data in the form not participating in the program may have
of samples of treated and control units before chosen not to do so, or may have been ineli-
and after the treatment. An early applica- gible for various reasons. Understanding the
tion is Ashenfelter and Card (1985). Recent choices made, and constraints faced, by the
theoretical work includes Abadie (2005), potential participants, is a crucial component
Bertrand, Duflo, and Mullainathan (2004), of any analysis. In addition to observing par-
Stephen G. Donald and Kevin Lang (2007), ticipation status and outcome measures, we
and Susan Athey and Imbens (2006). typically observe individual background char-
In this review, we will discuss in detail acteristics, such as education levels and age,
some of the new methods that have been as well as information regarding prior labor
developed in this literature. We will pay market histories, such as earnings at various
Imbens and Wooldridge: Econometrics of Program Evaluation 9
levels of aggregation (e.g., yearly, quarterly, or but not both, and thus only one of these two
monthly). In addition, we may observe some potential outcomes can be realized. Prior to
of the constraints faced by the individuals, the assignment being determined, both are
including measures used to determine eli- potentially observable, hence the label poten-
gibility, as well as measures of general labor tial outcomes. If individual i participates in
market conditions in the local labor markets the program, Yi(1) will be realized and Yi(0)
faced by potential participants. will ex post be a counterfactual outcome. If,
on the other hand individual i does not par-
ticipate in the program, Yi(0) will be realized
2. The Rubin Causal Model: Potential
and Yi(1) will be the ex post counterfactual.
Outcomes, the Assignment Mechanism,
We will denote the realized outcome by Yi,
and Interactions
with Y the N-vector with i-th element equal
In this section, we describe the essential to Yi. The preceding discussion implies that
elements of the modern approach to program
evaluation, based on the work by Rubin. Yi = Yi(Wi) = Yi(0) (1 −Wi) + Yi(1) Wi
Suppose we wish to analyze a job training
program using observations on N individu- Yi(0) if Wi = 0,
als, indexed by i = 1, …, N. Some of these = e
individuals were enrolled in the training Yi(1) if Wi = 1.
program. Others were not enrolled, either
because they were ineligible or chose not to The potential outcomes are tied to the spe-
enroll. We use the indicator Wi to indicate cific manipulation that would have made
whether individual i enrolled in the training one of them the realized outcome. The more
program, with Wi = 0 if individual i did not, precise the specification of the manipulation,
and Wi = 1 if individual i did, enroll in the the more well-defined the potential out-
program. We use W to denote the N-vector comes are.
with i-th element equal to Wi, and N0 and N1 This distinction between the pair of poten-
to denote the number of control and treated tial outcomes (Yi(0),Yi(1)) and the realized
units, respectively. For each unit, we also outcome Yi is the hallmark of modern statis-
observe a K-dimensional column vector of tical and econometric analyses of treatment
covariates or pretreatment variables, Xi, with effects. We offer some comments on it. The
X denoting the N × K matrix with i-th row potential outcomes framework has important
equal to X′
i. precursors in a variety of other settings. Most
directly, in the context of randomized experi-
2.1 Potential Outcomes
ments, the potential outcome framework was
The first element of the RCM is the notion introduced by Splawa-Neyman (1990) to
of potential outcomes. For individual i, for derive the properties of estimators and confi-
i = 1, …, N, we postulate the existence of two dence intervals under repeated sampling.
potential outcomes, denoted by Yi(0) and The potential outcomes framework also
Yi(1). The first, Yi(0), denotes the outcome that has important antecedents in econometrics.
would be realized by individual i if he or she Specifically, it is interesting to compare the
did not participate in the program. Similarly, distinction between potential outcomes Yi(0)
Yi(1) denotes the outcome that would be real- and Yi(1) and the realized outcome Yi in
ized by individual i if he or she did partici- Rubin’s approach to Trygve Haavelmo’s (1943)
pate in the program. Individual i can either work on simultaneous equations models
participate or not participate in the program, (SEMs). Haavelmo discusses identification of
10 Journal of Economic Literature, Vol. XLVII (March 2009)
supply and demand models. He makes a dis- The potential outcomes framework has
tinction between “any imaginable price π” as a number of advantages over a framework
the argument in the demand and supply func- based directly on realized outcomes. The
tions, qd(π) and qs(π), and the “actual price p,” first advantage of the potential outcome
which is the observed equilibrium price satis- framework is that it allows us to define causal
fying qs( p) = qd( p). The supply and demand effects before specifying the assignment
functions play the same role as the potential mechanism, and without making functional
outcomes in Rubin’s approach, with the equi- form or distributional assumptions. The most
librium price similar to the realized outcome. common definition of the causal effect at the
Curiously, Haavelmo’s notational distinction unit level is as the difference Yi(1) − Yi(0),
between equilibrium and potential prices has but we may wish to look at ratios Yi(1)/Yi(0),
gotten blurred in many textbook discussions of or other functions. Such definitions do not
simultaneous equations. In such discussions, require us to take a stand on whether the
the starting point is often the general formula- effect is constant or varies across the popu-
tion YΓ + XB = U for N × M vectors of real- lation. Further, defining individual-specific
ized outcomes Y, N × L matrices of exogenous treatment effects using potential outcomes
covariates X, and an N × M matrix of unob- does not require us to assume endogeneity or
served components U. A nontrivial byproduct exogeneity of the assignment mechanism. By
of the potential outcomes approach is that it contrast, the causal effects are more difficult
forces users of SEMs to articulate what the to define in terms of the realized outcomes.
potential outcomes are, thereby leading to Often, researchers write down a regression
better applications of SEMs. A related point is function Yi = α + τ · Wi + εi. This regres-
made in Pearl (2000). sion function is then interpreted as a struc-
Another area where potential outcomes tural equation, with τ as the causal effect.
are used explicitly is in the econometric Left unclear is whether the causal effect
analyses of production functions. Similar to is constant or not, and what the properties
the potential outcomes framework, a pro- of the unobserved component, εi, are. The
duction function g(x, ε) describes production potential outcomes approach separates these
levels that would be achieved for each value issues, and allows the researcher to first
of a vector of inputs, some observed (x) and define the causal effect of interest without
some unobserved (ε). Observed inputs may considering probabilistic properties of the
be chosen partly as a function of (expected) outcomes or assignment.
values of unobserved inputs. Only for the The second advantage of the poten-
level of inputs actually chosen do we observe tial outcome approach is that it links the
the level of the output. Potential outcomes analysis of causal effects to explicit manip-
are also used explicitly in labor market set- ulations. Considering the two potential out-
tings by A. D. Roy (1951). Roy models indi- comes forces the researcher to think about
viduals choosing from a set of occupations. scenarios under which each outcome could
Individuals know what their earnings would be observed, that is, to consider the kinds
be in each of these occupations and choose of experiments that could reveal the causal
the occupation (treatment) that maximizes effects. Doing so clarifies the interpretation
their earnings. Here we see the explicit use of causal effects. For illustration, consider
of the potential outcomes, combined with a a couple of recent examples from the eco-
specific selection/assignment mechanism, nomics literature. First, consider the causal
namely, choosing the treatment with the effects of gender or ethnicity on outcomes
highest potential outcome. of job applications. Simple comparisons of
Imbens and Wooldridge: Econometrics of Program Evaluation 11
e conomic outcomes by ethnicity are diffi- model the probability of enrolling in the pro-
cult to interpret. Are they the result of dis- gram given the earnings in both treatment
crimination by employers, or are they the arms conditional on individual characteris-
result of differences between applicants, tics. This sequential modeling will lead to a
possibly arising from discrimination at an model for the realized outcome, but it may
earlier stage of life? Now, one can obtain be easier than directly specifying a model for
unambiguous causal interpretations by link- the realized outcome.
ing comparisons to specific manipulations. A fourth advantage of the potential out-
A recent example is the study by Bertrand comes approach is that it allows us to for-
and Mullainathan (2004), who compare call- mulate probabilistic assumptions in terms of
back rates for job applications submitted potentially observable variables, rather than
with names that suggest African-American in terms of unobserved components. In this
or Caucasian ethnicity. Their study has a approach, many of the critical assumptions
clear manipulation—a name change—and will be formulated as (conditional) indepen-
therefore a clear causal effect. As a sec- dence assumptions involving the potential
ond example, consider some recent eco- outcomes. Assessing their validity requires
nomic studies that have focused on causal the researcher to consider the dependence
effects of individual characteristics such structure if all potential outcomes were
as beauty (e.g., Daniel S. Hamermesh and observed. By contrast, models in terms of
Jeff E. Biddle 1994) or height. Do the dif- realized outcomes often formulate the criti-
ferences in earnings by ratings on a beauty cal assumptions in terms of errors in regres-
scale represent causal effects? One possible sion functions. To be specific, consider again
interpretation is that they represent causal the regression function Yi = α + τ · Wi + εi.
effects of plastic surgery. Such a manipula- Typically (conditional independence) assump-
tion would make differences causal, but it tions are made on the relationship between εi
appears unclear whether cross-sectional and Wi. Such assumptions implicitly bundle a
correlations between beauty and earnings number of assumptions, including functional-
in a survey from the general population rep- form assumptions and substantive exogeneity
resent causal effects of plastic surgery. assumptions. This bundling makes the plau-
A third advantage of the potential outcome sibility of these assumptions more difficult to
approach is that it separates the modeling assess.
of the potential outcomes from that of the A fifth advantage of the potential outcome
assignment mechanism. Modeling the real- approach is that it clarifies where the uncer-
ized outcome is complicated by the fact that tainty in the estimators comes from. Even if
it combines the potential outcomes and the we observe the entire (finite) population (as
assignment mechanism. The researcher may is increasingly common with the growing
have very different sources of information to availability of administrative data sets)—so
bear on each. For example, in the labor mar- we can estimate population averages with no
ket program example we can consider the uncertainty—causal effects will be uncertain
outcome, say, earnings, in the absence of the because for each unit at most one of the two
program: Yi(0). We can model this in terms of potential outcomes is observed. One may still
individual characteristics and labor market use super population arguments to justify
histories. Similarly, we can model the out- approximations to the finite sample distribu-
come given enrollment in the program, again tions, but such arguments are not required to
conditional on individual characteristics and motivate the existence of uncertainty about
labor market histories. Then finally we can the causal effect.
12 Journal of Economic Literature, Vol. XLVII (March 2009)
independence4. Although the analysis of data outcomes for another unit. Only the level of
with such assignment mechanisms is not as the treatment applied to the specific individ-
straightforward as that of randomized exper- ual is assumed to potentially affect outcomes
iments, there are now many practical meth- for that particular individual. In the statistics
ods available for this case. We review them literature, this assumption is referred to as
in section 5. the Stable-Unit-Treatment-Value-Assumption
The third class of assignment mechanisms (Rubin 1978). In this paper, we mainly focus
contains all remaining assignment mecha- on settings where this assumption is main-
nisms with some dependence on potential tained. In the current section, we discuss
outcomes.5 Many of these create substantive some of the literature motivated by concerns
problems for the analysis, for which there is about this assumption.
no general solution. There are a number of This lack-of-interaction assumption is very
special cases that are by now relatively well plausible in many biomedical applications.
understood, and we discuss these in section 6. Whether one individual receives or does
The most prominent of these cases are instru- not receive a new treatment for a stroke or
mental variables, regression discontinuity, and not is unlikely to have a substantial impact
differences-in-differences. In addition, we on health outcomes for any other individual.
discuss two general methods that also relax However, there are also many cases in which
the unconfoundedness assumption but do not such interactions are a major concern and the
replace it with additional assumptions. The assumption is not plausible. Even in the early
first relaxes the unconfoundedness assump- experimental literature, with applications
tion in a limited way and investigates the sen- to the effect of various fertilizers on crop
sitivity of the estimates to such violations. The yields, researchers were cognizant of poten-
second drops the unconfoundedness assump- tial problems with this assumption. In order
tion entirely and establishes bounds on esti- to minimize leaking of fertilizer applied to
mands of interest. The latter is associated with one plot into an adjacent plot experimenters
the work by Manski (1990, 1995, 2007). used guard rows to physically separate the
plots that were assigned different fertilizers.
2.3 Interactions and General
A different concern arises in epidemiological
Equilibrium Effects
applications when the focus is on treatments
In most of the literature, it is assumed that such as vaccines for contagious diseases. In
treatments received by one unit do not affect that case, it is clear that the vaccination of
one unit can affect the outcomes of others in
their proximity, and such effects are a large
4 E.g., Lechner 2001; A. Colin Cameron and Pravin K.
part of the focus of the evaluation.
Trivedi 2005.
5 This includes some mechanisms where the In economic applications, interactions
dependence on potential outcomes does not create any between individuals are also a serious con-
problems in the analyses. Most prominent in this category cern. It is clear that a labor market program
are sequential assignment mechanisms. For example, one
could randomly assign the first ten units to the treatment that affects the labor market outcomes for
or control group with probability 1/2. From then on one one individual potentially has an effect on
could skew the assignment probability to the treatment the labor market outcomes for others. In a
with the most favorable outcomes so far. For example,
if the active treatment looks better than the control world with a fixed number of jobs, a train-
treatment based on the first N units, then the (N + 1)th ing program could only redistribute the jobs,
unit is assigned to the active treatment with probability and ignoring this constraint on the number
0.8 and vice versa. Such assignment mechanisms are not
very common in economics settings, and we ignore them of jobs by using a partial, instead of a gen-
in this discussion. eral, equilibrium analysis could lead one to
14 Journal of Economic Literature, Vol. XLVII (March 2009)
erroneously conclude that extending the pro- depending on some distance metric, either
gram to the entire population would raise geographical distance or proximity in some
aggregate employment. Such concerns have economic metric.
rarely been addressed in the recent program The most interesting literature in this area
evaluation literature. Exceptions include views the interactions not as a nuisance but
Heckman, Lance Lochner, and Christopher as the primary object of interest. This litera-
Taber (1999) who provide some simulation ture, which includes models of social inter-
evidence for the potential biases that may actions and peer effects, has been growing
result from ignoring these issues. rapidly in the last decade, following the early
In practice these general equilibrium effects work by Manski (1993). See Manski (2000a)
may, or may not, be a serious problem. The and William Brock and Steven N. Durlauf
indirect effect on one individual of exposure (2000) for recent surveys. Empirical work
to the treatment of a few other units is likely to includes Jeffrey R. Kling, Jeffrey B. Liebman,
be much smaller than the direct effect of the and Katz (2007), who look at the effect of
exposure of the first unit itself. Hence, with households moving to neighborhoods with
most labor market programs both small in higher average socioeconomic status; Bruce
scope and with limited effects on the individ- I. Sacerdote (2001), who studies the effect
ual outcomes, it appears unlikely that general of college roommate behavior on a student’s
equilibrium effects are substantial and they grades; Edward L. Glaeser, Sacerdote, and
can probably be ignored for most purposes. Jose A. Scheinkman (1996), who study social
One general solution to these problems is interactions in criminal behavior; Anne C.
to redefine the unit of interest. If the inter- Case and Lawrence F. Katz (1991), who look
actions between individuals are at an inter- at neighborhood effects on disadvantaged
mediate level, say a local labor market, or a youths; Bryan S. Graham (2008), who infers
classroom, rather than global, one can ana- interactions from the effect of class size on
lyze the data using the local labor market the variation in grades; and Angrist and Lang
or classroom as the unit and changing the (2004), who study the effect of desegregation
no-interaction assumption to require the programs on students’ grades. Many iden-
absence of interactions among local labor tification and inferential questions remain
markets or classrooms. Such aggregation is unanswered in this literature.
likely to make the no-interaction assump-
tion more plausible, albeit at the expense of
3. What Are We Interested In?
reduced precision.
Estimands and Hypotheses
An alternative solution is to directly model
the interactions. This involves specifying In this section, we discuss some of the
which individuals interact with each other, questions that researchers have asked in this
and possibly relative magnitudes of these literature. A key feature of the current litera-
interactions. In some cases it may be plau- ture, and one that makes it more important to
sible to assume that interactions are limited be precise about the questions of interest, is
to individuals within well-defined, possibly the accommodation of general heterogeneity
overlapping groups, with the intensity of in treatment effects. In contrast, in many
the interactions equal within this group. early studies it was assumed that the effect
This would be the case in a world with a of a treatment was constant, implying that
fixed number of jobs in a local labor market. the effect of various policies could be cap-
Alternatively, it may be that interactions occur tured by a single parameter. The essentially
in broader groups but decline in importance unlimited heterogeneity in the effects of the
Imbens and Wooldridge: Econometrics of Program Evaluation 15
treatment allowed for in the current litera- expectation of the unit-level causal effect,
ture implies that it is generally not possible Yi(1) − Yi(0):
to capture the effects of all policies of inter-
est in terms of a few summary statistics. In τ PATE = E[Yi(1) − Yi(0)].
practice researchers have reported estimates
of the effects of a few focal policies. In this If the policy under consideration would
section we describe some of these estimands. expose all units to the treatment or none at
Most of these estimands are average treat- all, this is the most relevant quantity. Another
ment effects, either for the entire population popular estimand is the Population Average
or for some subpopulation, although some Treatment effect on the Treated (PATT), the
correspond to other features of the joint dis- average over the subpopulation of treated
tribution of potential outcomes. units:
Most of the empirical literature has focused
on estimation. Much less attention has been τ PATT = E[Yi(1) − Yi(0) | Wi = 1].
devoted to testing hypotheses regarding the
properties or presence of treatment effects. In many observational studies, τ PATT is a
Here we discuss null and alternative hypoth- more interesting estimand than the overall
eses that may be of interest in settings with average effect. As an example, consider the
heterogeneous effects. Finally, we discuss case where a well defined population was
some of the recent literature on decision- exposed to a treatment, say a job training
theoretic approaches to program evaluation program. There may be various possibilities
that ties estimands more closely to optimal for a comparison group, including subjects
policies. drawn from public use data sets. In that case,
it is generally not interesting to consider the
3.1 Average Treatment Effects
effect of the program for the comparison
The econometric literature has largely group: for many members of the comparison
focused on average effects of the treatment. group (e.g., individuals with stable, high-wage
The two most prominent average effects are jobs) it is difficult and uninteresting to imag-
defined over an underlying population. In ine their being enrolled in the labor market
cases where the entire population can be program. (Of course, the problem of averag-
sampled, population treatment effects rely on ing across units that are unlikely to receive
the notion of a superpopulation, where the future treatments can be mitigated by more
current population that is available is viewed carefully constructing the comparison group
as just one of many possibilities. In either to be more like the treatment group, mak-
case, the the sample of size N is viewed as ing τ PATE a more meaningful parameter. See
a random sample from a large (super-)popu- the discussion below.) A second case where
lation, and interest is in the average effect τ PATT is the estimand of most interest is in
in the superpopulation.6 The most popular the setting of a voluntary program where
treatment effect is the Population Average those not enrolled will never be required
Treatment Effect (PATE), the population to participate in the program. A specific
example is the effect of serving in the mili-
6 For simplicity, we restrict ourselves to random sam- tary where an interesting question concerns
pling. Some data sets are obtained by stratified sampling. the foregone earnings for those who served
Most of the estimators we consider can be adjusted for (Angrist 1998).
stratified sampling. See, for example, Wooldridge (1999,
2007) on inverse probability weighting of averages and In practice, there is typically little motiva-
objective functions. tion presented for the focus on the overall
16 Journal of Economic Literature, Vol. XLVII (March 2009)
average effect or the average effect for the have to be particularly concerned with the
treated. Take a job training program. The distinction between the two estimands at the
overall average effect would be the param- estimation stage. However, there is an impor-
eter of interest if the policy under con- tant difference between the population and
sideration is a mandatory exposure to the conditional estimands at the inference stage.
treatment versus complete elimination. It If there is heterogeneity in the effect of the
is rare that these are the alternatives, with treatment, we can estimate the sample aver-
more typically exemptions granted to various age treatment effect τ CATE more precisely
subpopulations. Similarly the average effect than the population average treatment effect
for the treated would be informative about τ PATE. When one estimates the variance of an
the effect of entirely eliminating the current estimator τ
ˆ —which can serve as an estimate
program. More plausible regime changes for τ PATE or τ CATE —one therefore needs to
would correspond to a modest extension of be explicit about whether one is interested in
the program to other jurisdictions, or a con- the variance relative to the population or to
traction to a more narrow population. the conditional average treatment effect. We
A somewhat subtle issue is that we may will return to this issue in section 5.
wish to separate the extrapolation from the A more general class of estimands includes
sample to the superpopulation from the average causal effects for subpopulations
problem of inference for the sample at hand. and weighted average causal effects. Let 픸
This suggests that, rather than focusing on be a subset of the covariate space 핏, and let
PATE or PATT, we might first focus on the τ CATE,픸 denote the conditional average causal
average causal effect conditional on the cova- effect for the subpopulation with Xi ∈ 픸:
riates in the sample,
N τ CATE,픸 = ___
1 ∑ E[Y (1) − Y (0) | X ],
i i i
τ CATE = __
N픸 i∶Xi∈픸
1 ∑ E[Yi(1) − Yi(0) | Xi ],
N i=1
where N픸 is the number of units with Xi ∈ 픸.
and, similarly, the average over the subsam- Richard K. Crump et al. (2009) argue for
ple of treated units: considering such estimands. Their argu-
ment is not based on the intrinsic interest of
τ CATT = ___
1 ∑ E[Yi(1) − Yi(0) | Xi ].
these subpopulations. Rather, they show that
N1 i | Wi=1 such estimands may be much easier to esti-
mate than τ CATE (or τ CATT ). Instead of solely
If the effect of the treatment or interven- reporting an imprecisely estimated average
tion is constant (Yi(1) − Yi(0) = τ for some effect for the overall population, they sug-
constant τ), all four estimands, τ PATE, τ PATT, gest it may be informative to also report
τ CATE, and τ CATT, are obviously identical. a precise estimate for the average effect of
However, if there is heterogeneity in the some subpopulation. They then propose a
effect of the treatment, the estimands may particular set 픸 for which the average effect
all be different. The difference between is most easily estimable. See section 5.10.2
τ PATE and τ CATE (and between τ PATT and for more details. The Crump et al. estimates
τ CATT ) is relatively subtle. Most estimators would not necessarily have as much external
that are attractive for the population treat- validity as estimates for the overall popula-
ment effect are also attractive for the cor- tion, but they may be much more informative
responding conditional average treatment for the sample at hand. In any case, in many
effect, and vice versa. Therefore, we do not instances the larger policy questions concern
Imbens and Wooldridge: Econometrics of Program Evaluation 17
the entire distribution of outcomes, or solely With experimental data the statisti-
about average outcomes, and may also take cal analysis is generally straightforward.
into account costs associated with participa- Differencing average outcomes by treatment
tion. If the administrator knew exactly the status or, equivalently, regressing the out-
conditional distribution of the potential out- come on an intercept and an indicator for the
comes given the covariate information this treatment, leads to an unbiased estimator for
would be a simple problem: the administra- the average effect of the treatment. Adding
tor would simply compare the expected wel- covariates to the regression function typically
fare for different rules and choose the one improves precision without jeopardizing con-
with the highest value. However, the admin- sistency because the randomization implies
istrator does not have this knowledge and that in large samples the treatment indicator
needs to make a decision given uncertainty and the covariates are independent. In prac-
about these distributions. In these settings, it tice, researchers have rarely gone beyond
is clearly important that the statistical model basic regression methods. In principle,
allows for heterogeneity in the treatment however, there are additional methods that
effects. can be useful in these settings. In section
Graham, Imbens, and Ridder (2006) 4.2, we review one important experimental
extend the type of problems studied in this technique, randomization-based inference,
literature by incorporating resource con- including Fisher’s method for calculating
straints. They focus on problems that include exact p-values, that deserves wider usage in
as a special case the problem of allocating a social sciences. See Rosenbaum (1995) for a
fixed number of slots in a program to a set of textbook discussion.
individuals on the basis of observable charac-
4.1 Randomized Experiments in Economics
teristics of these individuals given a random
sample of individuals for whom outcome and Randomized experiments have a long
covariate information is available. tradition in biostatistics. In this literature
they are often viewed as the only cred-
ible approach to establishing causality. For
4. Randomized Experiments
example, the United States Food and Drug
Experimental evaluations have tradition- Administration typically requires evidence
ally been rare in economics. In many cases from randomized experiments in order to
ethical considerations, as well as the reluc- approve new drugs and medical procedures.
tance of administrators to deny services to A first comment concerns the fact that even
randomly selected individuals after they randomized experiments rely to some extent
have been deemed eligible, have made it on substantive knowledge. It is only once
difficult to get approval for, and implement, the researcher is willing to limit interactions
randomized evaluations. Nevertheless, the between units that randomization can estab-
few experiments that have been conducted, lish causal effects. In settings with poten-
including some of the labor market training tially unrestricted interactions between
programs, have generally been influential, units, randomization by itself cannot solve
sometimes extremely so. More recently, the identification problems required for
many exciting and thought-provoking experi- establishing causality. In biomedical settings,
ments have been conducted in development where such interaction effects are often argu-
economics, raising new issues of design and ably absent, randomized experiments are
analysis (see Duflo, Rachel Glennerster, and therefore particularly attractive. Moreover,
Kremer 2008 for a review). in biomedical settings it is often possible to
20 Journal of Economic Literature, Vol. XLVII (March 2009)
keep the units ignorant of their treatment Examples of such programs include the
status, further enhancing the interpretation Greater Avenues to INdependence (GAIN)
of the estimated effects as causal effects of programs (e.g., James Riccio and Daniel
the treatment, and thus improving the exter- Friedlander 1992, the WIN programs (e.g.,
nal validity. Judith M. Gueron and Edward Pauly 1991;
In the economics literature randomization Friedlander and Gueron 1992; Friedlander
has played a much less prominent role. At var- and Philip K. Robins 1995), the Self
ious times social experiments have been con- Sufficiency Project in Canada (Card and Dean
ducted, but they have rarely been viewed as R. Hyslop 2005, and Card and Robins 1996),
the sole method for establishing causality, and and the Statistical Assistance for Programme
in fact they have sometimes been regarded Selection in Switzerland (Stefanie Behncke,
with some suspicion concerning the rele- Markus Frölich, and Lechner 2006). Like
vance of the results for policy purposes (e.g., the NSW evaluation, these experiments have
Heckman and Smith 1995; see Gary Burtless been useful not merely in establishing the
1995 for a more positive view of experiments effects of particular programs but also in pro-
in social sciences). Part of this may be due to viding fertile testing grounds for new statisti-
the fact that for the treatments of interest to cal evaluations methods.
economists, e.g., education and labor mar- Recently there has been a large number of
ket programs, it is generally impossible to do exciting and innovative experiments, mainly
blind or double-blind experiments, creating in development economics but also in oth-
the possibility of placebo effects that com- ers areas, including public finance (Duflo
promise the internal validity of the estimates. and Emmanuel Saez 2003; Duflo et al.
Nevertheless, this suspicion often down- 2006; Raj Chetty, Adam Looney, and Kory
plays the fact that many of the concerns that Kroft forthcoming). The experiments in
have been raised in the context of random- development economics include many edu-
ized experiments, including those related to cational experiments (e.g., T. Paul Schultz
missing data, and external validity, are often 2001; Orazio Attanasio, Costas Meghir,
equally present in observational studies. and Ana Santiago 2005; Duflo and Rema
Among the early social experiments in eco- Hanna 2005; Banerjee et al. 2007; Duflo
nomics were the negative income tax experi- 2001; Miguel and Kremer 2004). Others
ments in Seattle and Denver in the early study topics as wide-ranging as corruption
1970s, formally referred to as the Seattle and (Benjamin A. Olken 2007; Claudio Ferraz
Denver Income Maintenance Experiments and Frederico Finan 2008) or gender issues
(SIME and DIME). In the 1980s, a number in politics (Raghabendra Chattopadhyay and
of papers called into question the reliability of Duflo 2004). In a number of these experi-
econometric and statistical methods for esti- ments, economists have been involved from
mating causal effects in observational studies. the beginning in the design of the evalua-
In particular, LaLonde (1986) and Fraker and tions, leading to closer connections between
Maynard (1987), using data from the National the substantive economic questions and the
Supported Work (NSW) programs, suggested design of the experiments, thus improving
that widely used econometric methods were the ability of these studies to lead to con-
unable to replicate the results from experi- clusive answers to interesting questions.
mental evaluations. These influential con- These experiments have also led to renewed
clusions encouraged government agencies to interest in questions of optimal design.
insist on the inclusion of experimental evalu- Some of these issues are discussed in Duflo,
ation components in job training programs. Glennerster, and Kremer (2008), Miriam
Imbens and Wooldridge: Econometrics of Program Evaluation 21
Bruhn and David McKenzie (2008), and Whether the null of no effect for any unit
Imbens et al. (2008). versus the null of no effect on average is
more interesting was the subject of a testy
4.2 Randomization-Based Inference and
exchange between Fisher (who focused on
Fisher’s Exact P-Values
the first) and Neyman (who thought the lat-
Fisher (1935) was interested in calculating ter was the interesting hypothesis, and who
p-values for hypotheses regarding the effect of stated that the first was only of academic
treatments. The aim is to provide exact infer- interest) in Splawa-Neyman (1990). Putting
ences for a finite population of size N. This the argument about its ultimate relevance
finite population may be a random sample aside, Fisher’s test is a powerful tool for
from a large superpopulation, but that is not establishing whether a treatment has any
exploited in the analysis. The inference is non- effect. It is not essential in this framework
parametric in that it does not make functional that the probabilities of assignment to the
form assumptions regarding the effects; it is treatment group are equal for all units. It is
exact in that it does not rely on large sample crucial, however, that the probability of any
approximations. In other words, the p-values particular assignment vector is known. These
coming out of this analysis are exact and valid probabilities may differ by unit provided the
irrespective of the sample size. probabilities are known.
The most common null hypothesis in The implication of Fisher’s framework is
Fisher’s framework is that of no effect of the that, under the null hypothesis, we know the
treatment for any unit in this population, exact value of all the missing potential out-
against the alternative that, at least for some comes. Thus there are no nuisance param-
units, there is a non-zero effect: eters under the null hypothesis. As a result,
we can deduce the distribution of any statis-
H0 : Yi(0) = Yi(1), ∀i = 1, …, N, tic, that is, any function of the realized values
of (Yi, Wi)i=1
N , generated by the randomiza-
against Ha : ∃i such that Yi(0) ≠ Yi(1). tion. For example, suppose the statistic is
the average difference between __ treated
__ and
It is not important that the null hypothesis control
__ outcomes, T(W, Y) =
Y
1 −
Y
0 , where
Y w = ∑ i∶W Yi /Nw , for w = 0, 1. Now sup-
is that the effects are all zero. What is essen- i=w
tial is that the null hypothesis is sharp, that pose we had assigned a different set of units
is, the null hypothesis specifies the value of to the treatment. Denote the vector
of alter-
all unobserved potential outcomes for each native treatment assignments by W ˜ . Under
unit. A more general null hypothesis could the null hypothesis we know all the potential
be that Yi(0) = Yi(1) + c for some prespeci- outcomes and thus we can deduce what the
fied c, or that Yi(0) = Yi(1) + ci for some set of value of the statistic would have been under
prespecified ci. Importantly, this framework that alternative assignment, namely T( W ˜ , Y).
cannot accommodate null hypotheses such We can infer the value of the statistic for all
as the average effect of the treatment is zero, possible values of the assignment vector W,
against the alternative hypothesis of a non- and since we know the distribution of W we
zero average effect, or can deduce the distribution of T(W, Y). The
distribution generated by the randomization
H′0 : __
1 ∑ Y A(1) − Y (0)B = 0, of the treatment assignment is referred to as
the randomization distribution. The p-value
N i i i
least as large, in absolute value, as that of the this point, we took data from eight random-
observed statistic, T(W, Y). ized evaluations of labor market programs.
In moderately large samples, it is typi- Four of the programs are from the WIN
cally not feasible to calculate the exact demonstration programs. The four evalua-
p-values for these tests. In that case, one tions took place in Arkansas, Baltimore, San
can approximate the p-value by basing it on Diego, and Virginia. See Gueron and Pauly
a large number of draws from the random- (1991), Friedlander and Gueron (1992), David
ization distribution. Here the approximation Greenberg and Michael Wiseman (1992),
error is of a very different nature than that and Friedlander and Robins (1995) for more
in typical large sample approximations: it is detailed discussions of each of these evalu-
controlled by the researcher, and if more ations. The second set of four programs is
precision is desired one can simply increase from the GAIN programs in California. The
the number of draws from the randomiza- four locations are Alameda, Los Angeles,
tion distribution. Riverside, and San Diego. See Riccio and
In the form described above, with the Friedlander (1992), Riccio, Friedlander, and
statistic equal to the difference in averages Freedman (1994), and Dehejia (2003) for
by treatment status, the results are typically more details on these programs and their
not that different from those using Wald evaluations. In each location, we take as the
tests based on large sample normal approxi- outcome total earnings for the first (GAIN)
mations to the sampling__ distribution
__ to the or second (WIN) year following the program,
difference in means Y 1 − Y 0, as long as the and we focus on the subsample of individuals
sample size is moderately large. The Fisher who had positive earnings at some point prior
approach to calculating p-values is much to the program. We calculate three p-values
more interesting with other choices for for each location. The first p-value is based
the statistic. For example, as advocated by on the normal approximation to the t-statis-
Rosenbaum in a series of papers (Rosenbaum tic calculated as the difference in average
1984a, 1995), a generally attractive choice is outcomes for treated and control individu-
the difference in average ranks by treatment als divided by the estimated standard error.
status. First the outcome is converted into The second p-value is based on randomiza-
ranks (typically with, in case of ties, all pos- tion inference using the difference in aver-
sible rank orderings averaged), and then the age outcomes by treatment status. And the
test is applied using the average difference third p-value is based on the randomization
in ranks by treatment status as the statistic. distribution using the difference in average
The test is still exact, with its exact distri- ranks by treatment status as the statistic. The
bution under the null hypothesis known as results are in table 1.
the Wilcoxon distribution. Naturally, the test In all eight cases, the p-values based on
based on ranks is less sensitive to outliers the t-test are very similar to those based
than the test based on the difference in on randomization inference. This outcome
means. is not surprising given the reasonably large
If the focus is on establishing whether the sample sizes, ranging from 71 (Arkansas,
treatment has some effect on the outcomes, WIN) to 4,779 (San Diego, GAIN). However,
rather than on estimating the average size in a number of cases, the p-value for the
of the effect, such rank tests are much more rank test is fairly different from that based
likely to provide informative conclusions on the level difference. In both sets of four
than standard Wald tests based differences locations there is one location where the
in averages by treatment status. To illustrate rank test suggests a clear rejection at the
Imbens and Wooldridge: Econometrics of Program Evaluation 23
Table 1
P-values for Fisher Exact Tests: Ranks versus Levels
5 percent level whereas the level-based test 5. Estimation and Inference under
would suggest that the null hypothesis of no Unconfoundedness
effect should not be rejected at the 5 per-
cent level. In the WIN (San Diego) evalua- Methods for estimation of average treat-
tion, the p-value goes from 0.068 (levels) to ment effects under unconfoundedness are
0.024 (ranks), and in the GAIN (San Diego) the most widely used in this literature. The
evaluation, the p-value goes from 0.136 (lev- central paper in this literature, which intro-
els) to 0.018 (ranks). It is not surprising that duces the key assumptions, is Rosenbaum
the tests give different results. Earnings data and Rubin (1983b), although the literature
are very skewed. A large proportion of the goes further back (e.g., William G. Cochran
populations participating in these programs 1968; Cochran and Rubin 1973; Rubin 1977).
have zero earnings even after conditioning Often the unconfoundedness assumption,
on positive past earnings, and the earnings which requires that conditional on observed
distribution for those with positive earnings covariates there are no unobserved factors
is skewed. In those cases, a rank-based test that are associated both with the assignment
is likely to have more power against alterna- and with the potential outcomes, is contro-
tives that shift the distribution toward higher versial. Nevertheless, in practice, where often
earnings than tests based on the difference data have been collected in order to make this
in means. assumption more plausible, there are many
As a general matter it would be useful in cases where there is no clearly superior alter-
randomized experiments to include such native, and the only alternative is to abandon
results for rank-based p-values, as a generally the attempt to get precise inferences. In this
applicable way of establishing whether the section, we discuss some of these methods
treatment has any effect. As with all omnibus and the issues related to them. A general
tests, one should use caution in interpreting theme of this literature is that the concern is
a rejection, as the test can pick up interesting more with biases than with efficiency.
changes in the distribution (such as a mean Among the many recent economic appli-
or median effect) but also less interesting cations relying on assumptions of this type
changes (such as higher moments about the are Blundell et al. (2001), Angrist (1998),
mean). Card and Hyslop (2005), Card and Brian P.
24 Journal of Economic Literature, Vol. XLVII (March 2009)
McCall (1996), V. Joseph Hotz, Imbens, and in the subsample with treatment WI = w.
Jacob A. Klerman (2006), Card and Phillip Imbens and Rubin (forthcoming) suggest as a
B. Levine (1994), Card, Carlos Dobkin, and rule of thumb that with a normalized differ-
Nicole Maestas (2004), Hotz, Imbens, and ence exceeding one quarter, linear regression
Julie H. Mortimer (2005), Lechner (2002a), methods tend to be sensitive to the specifi-
Abadie and Javier Gardeazabal (2003), and cation. Note the difference with the often
Bloom (2005). reported t-statistic for the null hypothesis of
This setting is closely related to that under- equal means,
lying standard multiple regression analysis
with a rich set of controls. See, for example, __ __
X 1 − X
Burt S. Barnow, Glend G. Cain, and Arthur (4) T = _____________
___________ 0 .
S. Goldberger (1980). Unconfoundedness √S 0 /N0 + S12 /
2
N1
implies that we have a sufficiently rich set of
predictors for the treatment indicator, con- The reason for focusing on the normalized
tained in the vector of covariates Xi, such difference, (3), rather than on the t-statistic,
that adjusting for differences in these covari- (4), as a measure of the degree of difficulty in
ates leads to valid estimates of causal effects. the statistical problem of adjusting for differ-
Combined with linearity assumptions of the ences in covariates, comes from their relation
conditional expectations of the potential out- to the sample size. Clearly, simply increasing
comes given covariates, the unconfoundedness the sample size does not make the problem
assumption justifies linear regression. But in of inference for the average treatment effect
the last fifteen years the literature has moved inherently more difficult. However, quadru-
away from the earlier emphasis on regression pling the sample size leads, in expectation,
methods. The main reason is that, although to a doubling of the t-statistic. In contrast,
locally linearity of the regression functions increasing the sample size does not system-
may be a reasonable approximation, in many atically affect the normalized difference. In
cases the estimated average treatment effects the landmark LaLonde (1986) paper the nor-
based on regression methods can be severely malized difference in mean exceeds unity for
biased if the linear approximation is not accu- many of the covariates, immediately show-
rate globally. To assess the potential problems ing that standard regression methods are
with (global) regression methods, it is useful unlikely to lead to credible results for those
to report summary statistics of the covariates data, even if one views unconfoundedness as
by treatment status. In particular, one may a reasonable assumption.
wish to report, for each covariate, the differ- As a result of the concerns with the sen-
ence in averages by treatment status, scaled sitivity of results based on linear regres-
by the square root of the sum of the vari- sion methods to seemingly minor changes
ances, as a scale-free measure of the differ- in specification, the literature has moved to
ence in distributions. To be specific, one may more sophisticated methods for adjusting for
wish to report the normalized difference differences in covariates. Some of these more
sophisticated methods use the propensity
__ __
X 1 − X 0 score—the conditional probability of receiv-
(3) Δ X = ________
______
, ing the treatment—in various ways. Others
√S 0 + S12
2
rely on pairwise matching of treated units to
control units, using values of the covariates to
where for w = 0, 1, Sw2 = ∑
__ i∶W (Xi −
i=w
match. Although these estimators appear at
w)2 /(Nw − 1), the sample variance of Xi
X first sight to be quite different, many (including
Imbens and Wooldridge: Econometrics of Program Evaluation 25
nonparametric versions of the regression esti- An ongoing discussion concerns the role
mators) in fact achieve the semiparametric of the propensity score, e(x) = pr(Wi = 1 | Xi
efficiency bound; thus, they would tend to be = x), introduced by Rosenbaum and Rubin
similar in large samples. Choices among them (1983b), and indeed whether there is any
typically rely on small sample arguments, role for this concept. See for recent contribu-
which are rarely formalized, and which do not tions to this discussion Hahn (1998), Imbens
uniformly favor one estimator over another. (2004), Angrist and Hahn (2004), Peter C.
Most estimators currently in use can be writ- Austin (2008a, 2008b), Dehejia (2005a),
ten as the difference of a weighted average of Smith and Todd (2001, 2005), Heckman,
the treated and control outcomes, with the Ichimura, and Todd (1998), Frölich (2004a,
weights in both groups adding up to one: 2004b), B. B. Hansen (2008), Jennifer Hill
(2008), Robins and Ya’acov Ritov (1997),
N Rubin (1997, 2006), and Elizabeth A. Stuart
ˆ = ∑
τ λi · Yi, with ∑
λi = 1,
i∶Wi=1
(2008).
i=1
In this section, we first discuss the key
∑ λi = −1. assumptions underlying an analysis based on
i∶Wi=0 unconfoundedness. We then review some of
the efficiency bound results for average treat-
The estimators differ in the way the weights λi ment effects. Next, in sections 5.3 to 5.5, we
depend on the full vector of assignments and briefly review the basic methods relying on
matrix of covariates (including those of other regression, propensity score methods, and
units). For example, some estimators implicitly matching. Although still fairly widely used,
allow the weights to be negative for the treated we do not recommend these methods in prac-
units and positive for controls units, whereas tice. In sections 5.6 to 5.8, we discuss three
others do not. In addition, some depend on of the combination methods that we view as
essentially all other units whereas others more attractive and recommend in practice.
depend only on units with similar covariate We discuss estimating variances in section
values. Nevertheless, despite the commonali- 5.9. Next we discuss implications of lack of
ties of the estimators and large sample equiva- overlap in the covariate distributions. In par-
lence results, in practice the performance of ticular, we discuss two general methods for
the estimators can be quite different, partic- constructing samples with improved covari-
ularly in terms of robustness and bias. Little ate balance, both relying heavily on the pro-
is known about finite sample properties. The pensity score. In section 5.11, we describe
few simulation studies include Zhong Zhao methods that can be used to assess the plau-
(2004), Frölich (2004a), and Matias Busso, sibility of the unconfoundedness assumption,
John DiNardo, and Justin McCrary (2008). even though this assumption is not directly
On a more positive note, some understanding testable. We discuss methods for testing for
has been reached regarding the sensitivity of the presence of average treatment effects
specific estimators to particular configura- and for the presence of treatment effect het-
tions of the data, such as limited overlap in erogeneity under unconfoundedness in sec-
covariate distributions. Currently, the best tion 5.12.
practice is to combine linear regression with
5.1 Identification
either propensity score or matching methods
in ways that explicitly rely on local, rather than The key assumption is unconfounded-
global, linear approximations to the regression ness, introduced by Rosenbaum and Rubin
functions. (1983b),
26 Journal of Economic Literature, Vol. XLVII (March 2009)
where the second equality follows by uncon- effect, the third term drops out, and the vari-
foundedness: E[Yi(w) | Wi = w, Xi ] does not ance bound for τ CATE is
[ ]
depend on w. By the overlap assumption, we
σ2 (X )
σ2 (X ) _______
can estimate both terms in the last line, and (8) 핍CATE = E _____
1 i
+ 0 i .
therefore we can identify τ(x). Given that we e(Xi) 1 − e(Xi)
can identify τ(x) for all x, we can identify the
expected value across the population distri- Still, the role of heterogeneity in the treat-
bution of the covariates, ment effect is potentially important. Suppose
we actually had prior knowledge that the
(6) τ PATE = E[τ (Xi)], average treatment effect conditional on the
covariates is constant, or τ(x) = τ PATE for all
as well as τ PATT and other estimands. x. Given this assumption, the model is closely
related to the partial linear model (Peter M.
5.2 Efficiency Bounds
Robinson 1988; James H. Stock 1989). Given
Before discussing specific estimation this prior knowledge, the variance bound is
methods, it is useful to see what we can learn
about the parameters of interest, given just (9) Vconst
the strong ignorability of treatment assign-
−1 −1
ment assumption, without functional form or σ12 (Xi) _______
σ2 (X )
distributional assumptions. In order to do so, = aE c a _____ + 0 i
b db
.
e(Xi) 1 − e(Xi)
we need some additional notation. Let σ02 (x)
= 핍(Yi(0) | Xi = x) and σ12 (x) = 핍(Yi(1) | Xi
= x) denote the conditional variances of the This variance bound can be much lower
potential outcomes given the covariates. than (8) if there is variation in the propen-
Hahn (1998) derives the __ lower bounds for sity score. Knowledge of lack of variation in
asymptotic variances of √N
-consistent esti- the treatment effect can be very valuable, or,
mators for τ PATE as conversely, allowing for general heterogene-
ity in the treatment effect can be expensive
σ2 (X )
σ12 (Xi) _______
(7) 핍PATE = E c _____
+ 0 i
in terms of precision.
e(Xi) 1 − e(Xi)
In addition to the conditional variances of
the counterfactual outcomes, a third impor-
+ (τ(Xi) − τ) 2d , tant determinant of the efficiency bound is
the propensity score. Because it enters into
where p = E[e(Xi)] is the unconditional treat- (7) in the denominator, the presence of units
ment probability. Interestingly, this lower with the propensity score close to zero or one
bound holds irrespective of whether the will make it difficult to obtain precise esti-
propensity score is known or not. The form mates of the average effect of the treatment.
of this variance bound is informative. It is no One approach to address this problem, devel-
surprise that τ PATE is more difficult to esti- oped by Crump et al. (2009) and discussed in
mate the larger are the variances σ02 (x) and more detail in section 5.10, is to drop obser-
σ12 (x). However, as shown by the presence of vations with the propensity score close to
the third term, it is also more difficult to esti- zero and one, and focus on the average effect
mate τ PATE, the more variation there is in the of the treatment in the subpopulation with
average treatment effect conditional on the propensity scores away from zero. Suppose
covariates. If we focus instead on estimat- we focus on τ CATE,픸, the average of τ(Xi) for
ing τ CATE, the conditional average treatment Xi ∈ 픸. Then the variance bound is
28 Journal of Economic Literature, Vol. XLVII (March 2009)
1
N
(10) 핍픸 = _________
ˆ reg = __
(11) τ 1 ∑ A
μ ˆ (X ) −
μ ˆ 0(Xi)B.
pr(Xi ∈ 픸) N i=1 1 i
|
Given parametric models for μ 0( · ) and
σ2 (X )
σ12 (Xi) ________
× E c _____
+ 0 i
(Xi) ∈ 픸 d , μ1( · ), estimation and inference are straight-
e(Xi) 1 − e (Xi)
forward.8 In the simplest case, we assume
each conditional mean can be expressed as
By excluding from the set 픸 subsets of the functions linear in parameters, say
covariate space where the propensity score is
close to zero or one, we may be able to esti- (12) μ 0(x) = α 0 + β′0(x − ψX),
mate τ CATE,픸 more precisely than τ CATE. (If
we are instead interested in τ CATT, we only μ1(x) = α1 + β′1 (x − ψX),
need to worry about covariate values where
e(x) is close to one.) where we take deviations from the overall
Having displayed these lower bounds on population covariate mean ψX so that the
variances for the average treatment effects, treatment effect is the difference in inter-
a natural question is: Are there estimators cepts. (Naturally, as in any regression context,
that achieve these lower bounds that do not we can replace x with general functions of x.)
require parametric models or functional Of course, we rarely know the population
form restrictions on either the conditional mean of the covariates, so in estimation we
means or the propensity score? The answer replace__ψX with the sample average across all
in general is yes, and we now consider differ- . Then τ
units, X ˆ reg is simply
ent classes of estimators in turn.
(13) τ
ˆ reg =
α
ˆ 1 −
α
ˆ 0 .
5.3 Regression Methods
To describe the general approach to This estimator is also obtained from the
regression methods for estimating average coefficient on the treatment indicator Wi__in
treatment effects, define μ 0(x) and μ1(x) to be the regression Yi on 1, Wi, Xi, Wi ·(Xi − X ).
the two regression functions for the potential Standard errors can be obtained from stan-
outcomes: dard least square regression output. (As
we show below, in the case of estimating
μ 0(x) = E[Yi(0) | Xi = x] τ PATE, the usual standard error, whether
and or not it is made robust to heteroskedastic-
__
ity, ignores the estimation error in X as an
μ1(x) = E[Yi(1) | Xi = x]. estimator of ψX; technically, the conventional
By definition, the average treatment effect 8 There is a somewhat subtle issue in estimating treat-
conditional on X = x is τ(x) = μ1(x) − μ 0(x). ment effects from stratified samples or samples with
As we discussed in the identification subsec- missing values of the covariates. If the missingness or
stratification are determined by outcomes on the covari-
tion, under the unconfoundedness assump- ates, Xi, and the conditional means are correctly specified,
tion, μ 0(x) = E[Yi | Wi = 0, Xi = x] and μ1(x) then the missing data or stratification can be ignored for
= E[Yi | Wi = 1, Xi = x], which means we can the purposes of estimating the regression parameters; see,
estimate μ 0( · ) using regression methods for for example, Wooldridge (1999, 2007). However, sample
selection or stratification based on Xi cannot be ignored in
the untreated subsample and μ1( · ) using estimating, say, τ PATE, because τ PATE equals the expected
the treated subsample. Given consistent difference in regression functions across the population
ˆ ( · ) and distribution of Xi. Therefore, consistent estimation of τ PATE
estimators μ 0 μ
ˆ 1( · ), a consistent esti- requires applying inverse probability weights or sampling
mator for either τ PATE or τ CATE is weights to the average in (11).
Imbens and Wooldridge: Econometrics of Program Evaluation 29
standard error is only valid for τ CATE and not linear approximation to the regression func-
for τ PATE.) tion is globally accurate, regression may lead
A different representation of τ ˆ
reg is useful to severe biases. Another way of interpreting
in order to illustrate some of the concerns this problem is as a multicollinearity prob-
with regression estimators in this setting. lem. If the averages of the covariates in
Suppose we do use the linear model in (12). the two treatment arms are very different,
It can be shown that the correlation between the covariates and
the treatment indicator is relatively high.
__ __
1 − Y 0 − a _______
N0 ˆ Although conventional least squares standard
(14) τ
= Y
ˆ reg
· β 1
N0 + N1 errors take the degree of multicollinearity
into account, they do so conditional on the
ˆ ′ __ __
+ _______
N1
· β 0 b (X 1 − X
0 ). specification of the regression function. Here
N0 + N1 the concern is that any misspecification may
be exacerbated by the collinearity problem.
To adjust for differences in covariates As noted in the introduction to section 5,
between treated and control units,__the simple __ an easy way to establish the severity of this
difference in average outcomes, Y 1 − Y 0, is problem __ is to__inspect _______ the normalized differ-
adjusted
__ by the
__ difference in average covari- ences (X 1 − X 0)/ √S 02 + S12 )
.
ates, X 1 − X 0, multiplied by the weighted In the case of the standard regression esti-
average ˆ
of the regression coeffi- mator it is straightforward to derive and to
cients β 0 and β
ˆ in the two treatment regimes.
1 estimate the variance when we view the esti-
This is a useful representation. It shows that mator as an estimator of τ CATE. Assuming the
if the averages of the covariates in the two linear regression model is correctly specified,
treatment arms are very different, then the we have
adjustment to the simple mean difference can __
(τ
(15) √N reg − τ CATE) → (0, V0 + V1),
ˆ d
be large. We can see that even more clearly
by inspecting the predicted outcome for the
treated units had they been subject to the con- where Vw = N · E [(
α ˆ w − αw)2],
trol treatments:
__ __ __ which can be obtained directly from standard
ˆ [Y (1) | W = 0] = Y + β
E ˆ 0 ′ (X 1 − X 0). regression output. Estimating the variance
i i 0
when we view the estimator as an estimator
The regression parameter β ˆ 0 is estimated of τ PATE requires adding a term capturing the
on the control sample, where__the average variation in the treatment effect conditional
of the covariates is equal to X 0. It there- on the covariates. The form is then
fore likely provides a good approximation to __
(τ
√N reg − τ CATE) → (0, V0 + V1 + Vτ),
ˆ d
the conditional mean function around that
value. However, this estimated regression
function is then used to predict outcomes where the third term in the normalized vari-
in the treated sample, where__ the average of ance is
the covariates is equal to X 1. If these cova-
riate averages are very different, and thus Vτ = (β1 − β 0)′
the regression model is used to predict out-
comes far away from where the parameters E[(Xi − E[Xi])(Xi − E[Xi])′](β1 − β 0),
were estimated, the results can be sensitive to
minor changes in the specification. Unless the which can be estimated as
30 Journal of Economic Literature, Vol. XLVII (March 2009)
/
choice is that of the bandwidth h. In prac-
Xi − x
Xi − x
λi = K a _____ b ∑ K a _____
b.
tice, researchers have used ad hoc methods
h i∶Wi=w h
for bandwidth selection. Formal results on
bandwidth selection from the literature on
Although the rate of convergence of the nonparametric regression are not directly
kernel estimator to the regression function applicable. Those results are based on mini-
is slower than the conventional parametric mizing a global criterion such as the expected
rate N −1/2, the rate of convergence of the value of the squared difference between the
implied estimator for the average treatment estimated and true regression function, with
effect, τ
ˆ reg in (11), is the regular parametric the expectation taken with respect to the
rate under regularity conditions. These con- marginal distribution of the covariates. Thus,
ditions include smoothness of the regression they focus on estimating the regression func-
functions and require the use of higher order tion well everywhere. Here the focus is on
kernels (with the order of the kernel depend- a particular scalar functional of the regres-
ing on the dimension of the covariates). In sion function, and it is not clear whether the
Imbens and Wooldridge: Econometrics of Program Evaluation 31
Before we turn to propensity score methods, The basic insight is that for any binary vari-
we should comment on estimating the average able Wi, and any random vector Xi, it is true
treatment effects on the treated, τ PATT and (without assuming unconfoundedness) that
τ CATT. In this case, τ
ˆ
(Xi) gets averaged across
observations with Wi = 1, rather than across Wi ǁ Xi | e(Xi).
the entire sample as in (11) Because μ
ˆ 1(x) is
estimated on the treated subsample, in esti- Hence, within subpopulations with the same
mating PATT or CATT there is no problem value for the propensity score, covariates are
if μ1(x) is poorly estimated at covariate values independent of the treatment indicator and
that are common in the control group but thus cannot lead to biases (the same way in
scarce in the treatment group. But we must a regression framework omitted variables that
have a good estimate of μ 0(x) at covariate val- are uncorrelated with included covariates do
ues common in the treatment group, and this not introduce bias). Since under unconfound-
is not ensured because we can only use the edness all biases can be removed by adjusting
control group to obtain μ
ˆ 0(x). Nevertheless, in for differences in covariates, this means that
many settings μ 0(x) can be estimated well over within subpopulations homogenous in the
the entire range of the covariates because the propensity score there are no biases in com-
control group often includes units that are sim- parisons between treated and control units.
ilar to those in the treatment group. By con- Given the Rosenbaum–Rubin result, it is
trast, often there are numerous control group sufficient, under the maintained assumption of
units—for example, high-income workers in unconfoundedness, to adjust solely for differ-
the context of a job training program—that ences in the propensity score between treated
are quite different from any units in the treat- and control units. This result can be exploited
ment group, making the ATE parameters con- in a number of ways. Here we discuss three
siderably more difficult to estimate than ATT of these that have been used in practice. The
parameters. (Further, the ATT parameters are first two of these methods exploit the fact
more interesting from a policy perspective in that the propensity score can be viewed as a
such cases, unless one redefines the popula- covariate that is sufficient to remove biases in
tion to exclude some units that are unlikely to estimation of average treatment effects. For
ever be in the treatment group.) this purpose, any one-to-one function of the
propensity score could also be used. The third
5.4 Methods Based on the Propensity Score
method further uses the fact that the pro-
The first set of alternatives to regres- pensity score is the conditional probability of
sion estimators relies on estimates of the receiving the treatment.
propensity score. These methods were intro- The first method simply uses the pro-
duced in Rosenbaum and Rubin (1983b). pensity score in place of the covariates in
An early economic discussion is in Card and regression analysis. Define νw(e) = E[Yi | Wi
Sullivan (1988). Rosenbaum and Rubin show = w,e(Xi) = e]. Unconfoundedness in com-
that, under unconfoundedness, independence bination with the Rosenbaum–Rubin result
of potential outcomes and treatment indica- implies that νw(e) = E[Yi(w) | e(Xi) = e]. Then
tors also holds after conditioning solely on the we can estimate νw(e) very generally using
propensity score, e(x) = pr(Wi = 1 | Xi = x): kernel or series estimation on the propensity
score, something which is greatly simpli-
Wi ǁ AYi(0), Yi(1)B | Xi fied by the fact that the propensity score is a
scalar. Heckman, Ichimura, and Todd (1998)
⇒ Wi ǁ AYi(0), Yi(1)B | e(Xi). consider local smoothers and Hahn (1998)
Imbens and Wooldridge: Econometrics of Program Evaluation 33
considers a series estimator. In either case = 1 be boundary values. Then define Bij,
we have the consistent estimator for i = 1, … , N, and j = 1, … , J − 1, as the
N indicators
ˆ regprop = __
τ 1 · ∑ A ν
ˆ (e(Xi)) −
ν
ˆ 0(e(Xi))B ,
N i=1 1
1 if cj−1 ≤ e(Xi) < cj
B =e
which is simply the average of the differ- ij 0 otherwise
ences in predicted values for the treated and J−1
untreated outcomes. Interestingly, Hahn and BiJ = 1 − ∑ Bij.
shows that, unlike when we use regression to j=1
adjust for the full set of covariates, the series
regression estimator based on adjusting for Now estimate within stratum j the average
the known propensity score does not achieve treatment effect τj = E[Yi(1) − Yi(0) | Bij = 1]
the efficiency bound. as
Although methods of this type have been __ __
used in practice, probably because of their τ
ˆ j = Y
j1 − Y
j0
simplicity, regression on simple functions of
the propensity score is not recommended. where
__ N
Because the propensity score does not have a
Y jw = ___
1 ∑ Bij × Yi,
substantive meaning, it is difficult to motivate Njw i∶Wi=w
a low order polynomial as a good approxima- and
tion to the conditional expectation. For exam- N
c alculations, researchers have often used five where the second and final inequalities fol-
strata, although depending on the sample low by iterated expectations and the third
size and the joint distribution of the data, equality holds by unconfoundedness. The
fewer or more blocks will generally lead to a implication is that weighting the treated
lower expected mean squared error. population by the inverse of the propensity
The variance for this estimator is typi- score recovers the expectation of the uncon-
cally calculated conditional on the strata ditional response under treatment. A similar
indicators, and assuming random assignment calculation shows E[((1 − Wi)Yi)/(1 − e(Xi))]
within the strata. That is, for stratum j, the = E[Yi(0)], and together these imply
estimator
is τ
ˆ j, and its variance is estimated
as V j = V
ˆ j0 + V
ˆ ˆ , where
_____ (1 − Wi) · Yi
Wi · Yi _________
j1
(16) τ PATE = E c −
d
.
e(Xi) 1 − e(Xi)
Sjw
2
ˆ = ___
V jw
, where Equation (16) suggests an obvious estimator
Njw
__ of τ PATE:
= ___
1 ∑ (Yi − Y
jw)2
2
Sjw
Njw i∶Bij=1,Wi=w ˆ weight = __
(17) τ 1
N
N
_____ (1 − Wi) · Yi
Wi · Yi _________
×∑
c
The overall variance is then estimated as
−
d
,
e(X i) 1 − e(Xi)
Nj0 + Nj1 2
J i=1
핍 ˆ block) = ∑
τ
ˆ ( ( V
ˆ +
V _______
ˆ )· a .
b
0j 1j
N which, as a sample average from a random __
sample, is consistent for τ PATE and √N
j=1
This variance estimator is appropriate for asymptotically normally distributed. The
τ CATE, although it ignores biases arising from estimator in (17) is essentially due to D. G.
variation in the propensity score within strata. Horvitz and D. J. Thompson (1952).9
The third method exploiting the propensity In practice, (17) is not a feasible estima-
score is based on weighting. Recall that τ PATE tor because it depends on the propensity
= E[Yi(1) − Yi(0)] = E[Yi(1)] − E[Yi(0)]. We score function e( · ), which is rarely known. A
consider the two terms separately. Because surprising result is that, even if we know the
Wi · Yi = Wi · Yi(1), we have propensity score, τ
ˆ weight does not achieve the
efficiency bound given in (7). It turns out to
_____
Wi · Yi _______
Wi · Yi(1) be better, in terms of large sample efficiency,
E c = E c
d d
to weight using the estimated rather than the
e(Xi) e(Xi)
|
true propensity score. Hirano, Imbens, and
Ridder (2003) establish conditions under
_______
Wi · Yi(1)
= E cE c
Xi d d
which replacing e( · ) with a logistic sieve esti-
e(Xi)
mator results in a weighted propensity score
estimator that achieves the variance bound.
= E ________________
E(Wi | Xi) · E(Yi(1) | X)
c d The estimator is practically simple to com-
e(Xi) pute, as estimation of the propensity score
involves a straightforward logit estimation
= E _____________
e(X ) · E(Yi(1) | Xi)
c i d
e(Xi)
9 Because the Horvitz–Thompson estimator is based on
sample averages, adjustments for stratified sampling are
= E [E(Yi(1) | Xi] = E[Yi(1)], straightforward if one is provided sampling weights.
Imbens and Wooldridge: Econometrics of Program Evaluation 35
involving flexible functions of the covariates. the block. This has the advantage of avoiding
Theoretically, the number of terms in the particularly large weights, but comes at the
approximation should increase with the sam- expense of introducing bias if the propensity
ple size. In the second step, given the esti- score is correctly specified.
mated propensity score e ˆ(x),
one estimates A particular concern with IPW estimators
arises again when the covariate distributions
/
N
W · Yi N ____
_____
are substantially different for the two treatment
=∑ ∑
Wi
(18) τ
ˆ ipw i
− groups. That implies that the propensity score
i=1 e (Xi) i=1 e
ˆ (Xi)
ˆ
gets close to zero or one for some values of the
/
(1 − Wi) · Yi N _______
N covariates. Small or large values of the pro-
∑ _________ ∑
Wi
.
i=1 1 − e (Xi) i=1 1 − e
ˆ pensity score raises a number of issues. One
ˆ ( Xi)
concern is that alternative parametric models
We refer to this as the inverse probabil- for the binary data, such as probit and logit
ity weighting (IPW) estimator. See Hirano, models that can provide similar approxima-
Imbens, and Ridder (2003) for intuition as tions in terms of estimated probabilities over
to why estimating the propensity score leads the middle ranges of their arguments, tend to
to a more efficient estimator, asymptotically, be more different when the probabilities are
than knowing the propensity score. close to zero or one. Thus the choice of model
Ichimura and Oliver Linton (2005) stud- and specification becomes more important,
ied τ
ˆ ipw when ˆ ( · ) is obtained via kernel
e and it is often difficult to make well motivated
regression, and they consider the problem of choices in treatment effect settings. A second
optimal bandwidth choice when the object of concern is that for units with propensity scores
interest is τ PATE. More recently, Li, Racine, close to zero or one, the weights can be large,
and Wooldridge (forthcoming) consider making those units particularly influential in
kernel estimation for discrete as well as con- the estimates of the average treatment effects,
tinuous covariates. The estimator proposed and thus making the estimator imprecise.
by Li, Racine, and Wooldridge achieves the These concerns are less serious than those
variance lower bound. See Hirano, Imbens, regarding regression estimators because at
and Ridder (2003) and Wooldridge (2007) least the IPW estimates will accurately reflect
for methods for estimating the variance for uncertainty. Still, these concerns make the
these estimators. simple IPW estimators less attractive. (As
Note that the blocking estimator can also for regression cases, the problem can be less
be interpreted as a weighting estimator. severe for the ATT parameters because pro-
Consider observations in block j. Within the pensity score values close to zero play no role.
block, the Nj1 treated observations all get Problems for estimating ATT arise when some
equal weight 1/Nj1. In the estimator for the units, as described by their observed covari-
overall average treatment effect, this block ates, are almost certain to receive treatment.)
gets weight (Nj0 + Nj1)/N, so we can write τ ˆ
5.5 Matching
= ∑ i=1 λ i · Yi, where for treated observations
N
in block j the weight normalized by N is N · λi Matching estimators impute the missing
= (Nj0 + Nj1)/Nj1), and for control observa- potential outcomes using only the outcomes
tions it is N · λi = (Nj0 + Nj1)/Nj0). Implicitly of a few nearest neighbors of the opposite
this estimator is based on an estimate of treatment group. In that sense, matching is
the propensity score in block j equal to similar to nonparametric kernel regression,
Nj1/(Nj0 + Nj1). Compared to the IPW estima- with the number of neighbors playing the role
tor, the propensity score is smoothed within of the bandwidth in the kernel regression. A
36 Journal of Economic Literature, Vol. XLVII (March 2009)
formal difference with kernel methods is that replacement.” Given the matched pairs, the
the asymptotic distribution for matching esti- treatment effect within a pair is estimated
mators is derived conditional on the implicit as the difference in outcomes, and the over-
bandwidth, that is, the number of neighbors, all average as the average of the within-pair
often fixed at a small number, e.g., one. Using difference. Exploiting the representation of
such asymptotics, the implicit estimate μ
ˆ the estimator as a difference in two sample
w (x) is (close to) unbiased, but not consistent, means, inference is based on standard meth-
for μ w(x). In contrast, the kernel regression ods for differences in means or methods for
estimators discussed in the previous section paired randomized experiments, ignoring
μ
implied consistency of
ˆ w(x). any remaining bias. Fully efficient matching
Matching estimators have the attractive algorithms that take into account the effect
feature that the smoothing parameters are of a particular choice of match for treated
easily interpretable. Given the matching unit i on the pool of potential matches for
metric, the researcher only has to choose unit j are computationally cumbersome. In
the number of matches. Using only a single practice, researchers use greedy algorithms
match leads to the most credible inference that sequentially match units. Most com-
with the least bias, at the cost of sacrificing monly the units are ordered by the value of
some precision. This sits well with the focus the propensity score with the highest pro-
in the literature on reducing bias rather than pensity score units matched first. See Gu and
variance. It also can make the matching esti- Rosenbaum (1993) and Rosenbaum (1995)
mator easier to use than those estimators that for discussions.
require more complex choices of smoothing Abadie and Imbens (2006) study formal
parameters, and this may be another expla- asymptotic properties of matching estimators
nation for its popularity. in a different setting, where both treated and
Matching estimators have been widely control units are (potentially) matched and
studied in practice and theory (e.g., X. Gu and matching is done with replacement. Code for
Rosenbaum 1993; Rosenbaum 1989, 1995, the Abadie–Imbens estimator is available in
2002; Rubin 1973b, 1979; Rubin and Neal Matlab and Stata (see Abadie et al. 2004).10
Thomas 1992a, 1992b, 1996, 2000; Heckman, Formally, given a sample, {(Yi, Xi, Wi)} i=1
N ,
Ichimura, and Todd 1998; Dehejia and Sadek let ℓ1(i) be the nearest neighbor to i, that is,
Wahba 1999; Abadie and Imbens 2006; ℓ1(i) is equal to the nonnegative integer j, for
Alexis Diamond and Jasjeet S. Sekhon 2008; j ∈ {1, … , N}, if Wj ≠ Wi, and
Sekhon forthcoming; Sekhon and Richard
Grieve 2008; Rosenbaum and Rubin 1985; ǁ Xj − Xi ǁ =
min ǁXk − Xi ǁ.
k:Wk≠Wi
Stefano M. Iacus, Gary King, and Giuseppe
Porro 2008). Most often they have been More generally, let ℓm(i) be the index that sat-
applied in settings where, (1) the interest is in isfies Wℓm(i) ≠ Wi and that is the m-th closest
the average treatment effect for the treated, to unit i:
and (2) there is a large reservoir of potential
controls, although recent work (Abadie and ∑ 1 Eǁ Xl − Xi ǁ ≤ ǁ Xℓm(i) − Xi ǁ F = m,
l∶Wl≠Wi
Imbens 2006) shows that matching estima-
tors can be modified to estimate the overall
average effect. The setting with many poten-
tial controls allows the researcher to match 10 See Sascha O. Becker and Andrea Ichino (2002) and
each treated unit to one or more distinct Edwin Leuven and Barbara Sianesi (2003) for alternative
controls, hence the label “matching without Stata implementations of matching estimators.
Imbens and Wooldridge: Econometrics of Program Evaluation 37
where 1{ · } is the indicator function, equal to it is therefore critical that some weights are
one if the expression in brackets is true and negative through the device of higher order
zero otherwise. In other words, ℓm(i) is the kernels, with the exact order required depen-
index of the unit in the opposite treatment dent on the dimension of the covariates (see,
group that is the m-th closest to unit i in e.g., Heckman, Ichimura, and Todd 1998). In
terms of the distance measure based on the practice, however, researchers have not used
norm ǁ · ǁ. Let M(i) ⊂ {1, … , N} denote the higher order kernels, and so bias concerns
set of indices for the first M matches for unit for nearest-neighbor matching estimators
i: M(i) = {ℓ1(i), … , ℓM(i)}. Now impute the are even more relevant for kernel matching
missing potential outcomes as the average methods.
of the
ˆ
outcomes
ˆ
for the matches, by defin- There are three caveats to the Abadie–
ing Y i(0) and Y
i(1) as Imbens bias result. First, it is only the con-
tinuous covariates that should be counted in
ˆ Yi if Wi = 0, the dimension of the covariates. With dis-
i (0) = e 1/M ∑
Y j∈
(i)
M
Yj if Wi = 1, crete covariates the matching will be exact
in large samples, and as a result such cova-
ˆ 1/M ∑ j∈ if Wi = 0,
Yj
(i) riates do not contribute to the order of the
i (1) = e Yi
Y if Wi = 1,
M
bias. Second, if one matches only the treated,
and the number of potential controls is much
The simple matching estimator discussed in larger than the number of treated units, one
Abadie and Imbens is then can justify ignoring the bias by appealing to
N an asymptotic sequence where the number
ˆ match = __
1 ∑ A Y
ˆ ˆ
(19) τ i(1) − Y i(0)B. of potential controls increases faster with
N i=1
the sample size than the number of treated
Abadie and Imbens show that the bias of units. Specifically, if the number of controls,
this estimator is of order O(N−1/K), where K N0, and the number of treated, N1, satisfy
is the dimension of the covariates. Hence, if N1/N04/K
→ 0, then the bias disappears__ in
one studies the asymptotic distribution
__ of the large samples after normalization by √N 1 .
estimator by normalizing by √N (as can be Third, even though the order of the bias may
justified by the fact that the variance of the be high, the actual bias may still be small
estimator is of order O(1/N)), the bias does if the coefficients in the leading term are
not disappear if the dimension of the covari- small. This is possible if the biases for differ-
ates is equal to two, and will dominate the ent units are at least partially offsetting. For
large sample variance if K is at least three. To example, the leading term in the bias relies
put this result in perspective, it is useful to on the regression function being nonlinear,
relate it to bias properties of estimators based and the density of the covariates having a
on kernel regression. Kernel estimators can nonzero slope. If either the regression func-
be viewed as matching estimators where tion is well approximated by a linear func-
all observations within some bandwidth hN tion, or the density is approximately flat, the
receive some weight. As the sample size N bias may be fairly limited.
increases, the bandwidth hN shrinks, but Abadie and Imbens (2006) also show
sufficiently slow in order to ensure that the that matching estimators are generally not
number of units receiving non-zero weights efficient. Even in the case where the bias
diverges. If all the weights are positive, the is of low enough order to be dominated by
bias for kernel estimators would generally be the variance, the estimators do not reach
worse. In order to achieve root-N consistency, the efficiency bound given a fixed number
38 Journal of Economic Literature, Vol. XLVII (March 2009)
of matches. To reach the bound the num- on estimating μ w(x) = E[Yi(w) | Xi = x] for
ber of matches would need to increase with w = 0, 1 and averaging the difference as in
the sample size. If M → ∞, with M/N → 0, (11), and the second is based on estimating
then the matching estimator is essentially the propensity score e(x) = pr(Wi = 1 | Xi = x)
like a nonparametric regression estima- and using that to weight the outcomes as in
tor. However, it is not clear that using an (18). For each approach, we have discussed
approximation based on a sequence with estimators that achieve the asymptotic effi-
an increasing number of matches improves ciency bound. If we have large sample sizes,
the accuracy of the approximation. Given relative to the dimension of Xi, we might
that in an actual data set one uses a spe- think our nonparametric estimators of the
cific number of matches, M, it would appear conditional means or propensity score are
appropriate to calculate the asymptotic sufficiently accurate to invoke the asymptotic
variance conditional on that number, rather efficiency results described above.
than approximate the distribution as if this In other cases, however, we might choose
number is large. Calculations in Abadie and flexible parametric models without being
Imbens show that the efficiency loss from confident that they necessarily approximate
even a very small number of matches is the means or propensity score well. As we
quite modest, and so the concerns about the discussed earlier, one reason for viewing esti-
inefficiency of matching estimators may not mators of conditional means or propensity
be very relevant in practice. Little is known scores as flexible parametric models is that
about the optimal number of matches, or it greatly simplifies standard error calcula-
about data-dependent ways of choosing it. tions for treatment effect estimates. In such
All of the distance metrics used in prac- cases, one might want to adopt a strategy that
tice standardize the covariates in some combines regression and propensity score
manner. Abadie and Imbens use a diagonal methods in order to achieve some robust-
matrix with each diagonal element equal to ness to misspecification of the parametric
the inverse of the corresponding covariate models. It may be helpful to think about the
variance. The most common metric is the analogy to omitted variable bias. Suppose
Mahalanobis metric, which is based on the we are interested in the coefficient on Wi in
inverse of the full covariance matrix. Zhao the (long) linear regression of Yi on a con-
(2004), in an interesting discussion of the stant, Wi and Xi. Suppose we omit Xi from
choice of metrics, suggests some alterna- the long regression, and just run the short
tives that depend on the correlation between regression of Yi on a constant and Wi. The
covariates, treatment assignment, and out- bias in the estimate from the short regression
comes. So far there is little experience with is equal to the product of the coefficient on
any metrics beyond inverse-of-the-variances Xi in the long regression, and the coefficient
and the Mahalanobis metrics. Zhao (2004) on Xi in a regression of Wi on a constant and
reports the results of some simulations using Xi. Weighting can be interpreted as remov-
his proposed metrics, finding no clear winner ing the correlation between Wi and Xi, and
given his specific design. regression as removing the direct effect of Xi.
Weighting therefore removes the bias from
5.6 Combining Regression and Propensity
omitting Xi from the regression. As a result,
Score Weighting
combining regression and weighting can lead
In sections 5.3 and 5.4, we describe meth- to additional robustness by both removing
ods for estimating average causal effects the correlation between the omitted covari-
based on two strategies: the first is based ates, and by reducing the correlation between
Imbens and Wooldridge: Econometrics of Program Evaluation 39
the omitted and included variables. This is (2007), weighting the objective function
the idea behind the doubly-robust estima- by any nonnegative function of Xi does not
tors developed in Robins and Rotnitzky affect consistency of least squares.11 As a
(1995), Robins, Rotnitzky and Lue Ping Zhao result, even if the logit model for the propen-
(1995), and Mark J. van der Laan and Robins sity score is misspecified, the binary response
(2003). γ
MLE
ˆ still has a well-defined probability
Suppose we model the two regression
__ func- limit, say γ *, and the IPW estimator that uses
tions as μ w(x) = αw + β′w (x − X ), for w = 0, 1 γ
weights 1/p(Xi;
ˆ ) for treated observations
(where we abuse notation a bit and insert the γ
and 1/(1 − p(Xi; ˆ )) for control observations
sample averages of the covariates for their pop- is asymptotically equivalent to the estima-
ulation means). More generally, we may use a tor that uses weights based on γ *.12 It does
nonlinear model for the conditional expecta- not matter that for some x, e(x) ≠ p(x; γ *).
tion, or just a more flexible linear approxima- This is the first part of the double robustness
tion. Suppose we model the propensity score result: if the parametric conditional means
as e(x) = p(x; γ), for example as p(x; γ) = exp(γ 0 for E[Y(w) | X = x] are correctly specified, the
+ x′γ1)/(1 + exp(γ 0 + x′γ1)). In the first step, model for the propensity score can be arbi-
we estimate γ by maximum likelihood and trarily misspecified for the true propensity
obtain the estimated propensity scores as e ˆ score. Equation (20) still leads to a consistent
(Xi) = p(x;
γ
ˆ ) . In the second step, we use lin- estimator for τ PATE.
ear regression, where we weight the objec- When the conditional means are correctly
tive function by the inverse probability of specified, weighting will generally hurt in
treatment or non-treatment. Specifically, to terms of asymptotic efficiency. The optimal
estimate (α 0, β 0) and (α1, β1), we would solve weight is the inverse of the variance, and
the weighted least squares problems in general there is no reason to expect that
__ weighting the inverse of (one minus) the pro-
(Y − α 0 − β′0 (Xi − X
) )2
∑ _________________
(20) min i ,
pensity score gives a good approximation to
α 0,β 0 i∶Wi=0 γ
p(Xi;
ˆ )) that. Specifically, under homoskedasticity
and of Yi(w) so that σw2 = σw2 ( x), in the context of
__ least squares—the IPW estimator of (αw, β w)
(Y − α1 − β′1(Xi − X ) )2
∑ _________________
is less efficient than the unweighted estima-
min
i ,
α1,β 1 i∶Wi=1 1 − p(Xi; γ
ˆ )) tor; see Wooldridge (2007). The motivation
for propensity score weighting is different: it
Given the estimated conditional mean func- offers a robustness advantage for estimating
tions, we estimate τ PATE, using the expres- τ PATE.
sion for τ
ˆ reg =
α
ˆ 1 − α
ˆ 0 as in equation (13). The second part of the double robustness
But what is the motivation for weighting by result assumes that the logit model (or an
the inverse propensity score when we did alternative binary response model) is cor-
not use such weighting in section 5.3? The rectly specified for the propensity score, so
motivation is the double robustness result that e(x) = p(x; γ *), but allows the condi-
due to Robins and Rotnitzky (1995); see also tional mean functions to be misspecified.
Daniel O. Scharfstein, Rotnitzky, and Robins
(1999). 11 More generally, it does not affect the consistency of
First, suppose that the conditional expec- any quasi-likelihood method that is robust for estimating
tation is indeed linear, __ or E[Yi(w) | Xi = x] the parameters of the conditional mean. These are likeli-
= αw + β′w (x − X ). Then, as discussed in hoods in the linear exponential family, as described in C.
Gourieroux, A. Monfort, and A. Trognon (1984a, 1984b).
the treatment effect context by Wooldridge 12 See Wooldridge (2007).
40 Journal of Economic Literature, Vol. XLVII (March 2009)
The weights imply that E[(1 − Wi)λi__ Yi ] and similar for V1. In general, we may again
= E[Yi(0)] __and E[(1 − Wi)λi(Xi − X )] want to adjust for the estimation of the
= E[Xi − X ] = 0, and as a result α
ˆ 0 → parameters in γ. See Wooldridge (2007) for
E[Yi(0)]. Similarly, the average of the pre- details.
dicted values for Yi(1) converges to E[Yi(1)], Although combining weighting and regres-
and so the resulting estimator τ ˆ ipw =
α
ˆ 1 −
α
ˆ 0 sion is more attractive then either weighting
is consistent for τ PATE and τ CATE irrespective or regression on their own, it still requires at
of the shape of the regression functions. This least one of the two specifications to be accu-
is the second part of the double robustness rate globally. It has been used regularly in
part, at least for linear regression. the epidemiology literature, partly through
For certain kinds of responses, including the efforts of Robins and his coauthors, but
binary responses, fractional responses, and has not been widely used in the economics
count responses, linearity of E[Yi(w) | Xi = literature.
x] is a poor assumption. Using linear con-
ditional expectations for limited dependent 5.7 Subclassification and Regression
variables effectively abdicates the first part
of the double robustness result. Instead, We can also combine subclassification
we should use coherent models of the con- with regression. The advantage relative to
ditional means, as well as a sensible model weighting and regression is that we do not
for the propensity score, with the hope that use global approximations to the regression
the mean functions, propensity score, or function. The idea is that within stratum j,
both are correctly specified. Beyond speci- we estimate the average treatment effect by
fying logically coherent for E[Yi(w) | Xi = x] regressing the outcome on a constant, an
so that the first part of double robustness indicator for the treatment, and the covari-
has a chance, for the second part we need ates, instead of simply taking the difference
to choose functional forms and estimators in averages by treatment status as in section
with the following property: even when the 5.4. The latter can be viewed as a regression
mean functions are misspecified, E[Yi(w)] = estimate based on a regression with only an
E[μ(X
ˆ i w
, δ* )], where δ*w is the probability limit intercept and the treatment indicator. The
of δ
w. Fortunately, for the common kinds of further regression adjustment simply adds
limited dependent variables used in appli- (some of) the covariates to that regression.
cations, such functional forms and estima- The key difference with using regression in
tors exist; see Wooldridge (2007) for further the full sample is that, within a stratum, the
discussion. propensity score varies relatively little. As a
Imbens and Wooldridge: Econometrics of Program Evaluation 41
result, the covariate distributions are simi- regression is not used to extrapolate far out
lar, and the regression function is not used to of sample.
extrapolate far out of sample. The idea behind
ˆ
the regression adjustment
To be precise, we estimate on the observa- is to replace Y ˆ i(1) by
i(0) and Y
tions with Bi j = 1, the regression function
ˆ Yi if Wi = 0,
Yi = αj + τ j · Wi + β′j Xi + εi, Y i (0) = e __ ∑ j∈
1
(i)
(
Yj + β′ (X
0 i − X j)) if Wi = 1,
M M
if Wi = 1,
ages the difference. This estimator may still ˆ ˆ
be biased due to discrepancies between the and let β w be based on a regression of Y
ˆ
i(w)
covariates of the matched observations and on a constant and X i(w):
their matches. One can attempt to reduce
this bias by using regression methods. This α
ˆ w
a ˆ b =
use of regression is very different from using β
w
regression methods on the full sample.
ˆ (w)′ ˆ (w)
Here the covariate distributions are likely N 1 X −1 Y
∑
i i
a a ˆ ˆ ˆ b
b a ˆ ˆ b
.
to be similar in the matched sample, and so X
X i(w) X
i(w) (w)′
X (w) Y (w)
i
i=1
i i
42 Journal of Economic Literature, Vol. XLVII (March 2009)
σ
2 i (Xi) = (Yi − Yν(i))2/2.
ˆ W limiting it to individuals with zero earnings in
the year prior to the program). Dehejia and
This way we can estimate σW 2 i (Xi) for all units. Wahba looked at this problem more system-
Note that these are not consistent estimators atically and found that a major concern is the
of the conditional variances. As the sample lack of overlap in the covariate distributions.
size increases, the bias of these estimators will Traditionally, overlap in the covariate dis-
disappear, just as we saw that the bias of the tributions was assessed by looking at sum-
matching estimator for the average treatment mary statistics of the covariate distributions
effect disappears under similar conditions. by treatment status. As discussed before in the
We then use these estimates of the con- introduction to section 5, it is particularly use-
ditional variance to estimate the variance of ful to report differences in average covariates
the estimator: normalized by the square root of the sum of
N the within-treatment group variances. In table
) = ∑
λi2 ·
ˆ ( τ
V ˆ
σ 2
ˆ W (Xi).
i
2, we report, for the LaLonde data, averages
i=1 and standard deviations of the basic covariates,
An extension to allow for clustering has and the normalized difference. For four out of
been developed by Samuel Hanson and Adi the ten covariates the means are more than
Sunderam (2008). a standard deviation apart. This immediately
suggests that the technical task of adjusting
5.10 Overlap in Covariate Distributions
for differences in the covariates is a challeng-
In practice, a major concern in applying ing one. Although reporting normalized dif-
methods under the assumption of uncon- ferences in covariates by treatment status is a
foundedness is lack of overlap in the covariate sensible starting point, inspecting differences
distributions. In fact, once one is committed to one covariate at a time is not generally suffi-
the unconfoundedness assumption, this may cient. Even if all these differences are small,
well be the main problem facing the analyst. there may still be areas with limited overlap.
The overlap issue was highlighted in papers Formally, we are concerned with regions in the
by Dehejia and Wahba (1999) and Heckman, covariate space where the density of covariates
Ichimura, and Todd (1998). Dehejia and in one treatment group is zero and the density
Wahba reanalyzed data on a job training pro- in the other treatment group is not. This cor-
gram originally analyzed by LaLonde (1986). responds to the propensity score being equal
LaLonde (1986) had attempted to replicate to zero or one. Therefore, a more direct way of
results from an experimental evaluation of a assessing the overlap in covariate distributions
job training program, the National Supported is to inspect histograms of the estimated pro-
Work (NSW) program, using a comparison pensity score by treatment status.
group constructed from two public use data Once it has been established that overlap
sets, the Panel Study of Income Dynamics is a concern, several strategies can be used.
(PSID) and the Current Population Survey We briefly discuss two of the earlier specific
(CPS). The NSW program targeted indi- suggestions, and then describe in more detail
viduals who were disadvantaged with very two general methods. In practice, researchers
poor labor market histories. As a result, they have often simply dropped observations with
were very different from the raw comparison propensity score close to zero or one, with the
groups constructed by LaLonde from the actual cutoff value chosen in an ad hoc fashion.
CPS and PSID. LaLonde partially addressed Dehejia and Wahba (1999) focus on the aver-
this problem by limiting his raw comparison age effect for the treated. After estimating
samples based on single covariate criteria (e.g., the propensity score, they find the smallest
44 Journal of Economic Literature, Vol. XLVII (March 2009)
Table 2
Balance Improvements in the Lalonde Data (Dehejia–Wahba Sample)
value of the estimated propensity score This improves the covariate balance, but
among the treated units, e1 = mini:Wi=1 e
ˆ (Xi). many of the normalized differences are still
They then drop all control units with an substantial.
estimated propensity score lower than this Heckman, Ichimura, and Todd (1997) and
threshold e1. The idea behind this suggestion Heckman et al. (1998) develop a different
is that control units with very low values for method. They focus on estimation of the set
the propensity score may be so different where the density of the propensity score con-
from treated units that including them in the ditional on the treatment is bounded away from
analysis is likely to be counterproductive. (In zero for both treatment regimes. Specifically,
effect, the population over which the treat- they first estimate the density functions f (e | W
ment effects are calculated is redefined.) A = w), for w = 0, 1, nonparametrically. ˆ They
concern is that the results may be sensitive then evaluate the estimated density f ˆ (Xi) | Wi
( e
to the choice of specific threshold e1. If, for = 0) for all N values Xi, and the same for
example, one used as the threshold the K-th the estimated density f ˆ (Xi) | Wi = 1) for all
ˆ ( e
order statistic of the estimated propensity N values Xi. Given these 2N values they
score among the treated (Lechner 2002a, calculate the 2N · q order statistic of
2002b), the results might change consider- these 2N estimated densities. Denote this
ˆ
ably. In the sixth column of table 2, we report order statistic by f q. Then, for each unit
the normalized difference (normalized using i,
ˆ they compare the
ˆ estimated
ˆ density
the same denominator equal to the square ˆ ( Xi) | Wi = 0) to f
( e
f
q , and
f
(
ˆ
e (
X i) | W i = 1)
ˆ
root of the sum of the within treatment group to f q. If either of those estimated densities
sample variances) after removing 9,891 (out is below the order statistic, the observation
of a total 16,177) control observations whose gets dropped from the analysis. Smith and
estimated propensity score was smaller than Todd (2005) implement this method with
the smallest value of the estimated propen- q = 0.02, but provide no motivation for the
sity score among the treated, e1 = 0.00051. choice of the threshold.
Imbens and Wooldridge: Econometrics of Program Evaluation 45
covariate space to minimize the asymptotic half of what it is in the full sample, with this
variance of the efficient estimator of the aver- improvement obtained by dropping approxi-
age treatment effect for that set. Under some mately 20 percent of the original sample.
conditions (in particular homoskedasticity), A potentially controversial feature of all
they show that the optimal set 픸* depends these methods is that they change what is
only on the value of the propensity score. This being estimated. Instead of estimating τ PATE,
method suggests discarding observations with the Crump et al. (2009) approach estimates
a propensity score less than α away from the τ CATE,픸. This results in reduced external
two extremes, zero and one: validity, but it is likely to improve internal
validity.
픸* = {x ∈ 핏 | α ≤ e(x) ≤ 1 − α},
5.11 Assessing the Unconfoundedness
Assumption
where α satisifies a condition based on the
marginal distribution of the propensity The unconfoundedness assumption used
score: in section 5 is not testable. It states that
the conditional distribution of the outcome
________
1 under the control given receipt of the active
α · (1 − α) treatment and covariates, is identical to the
= 2 · E c ____________
1 ____________
| distribution of the control outcome condi-
1
e(X) · (1 − e(X)) e(X) · (1 − e(X)) tional on being in the control and covari-
ates. A similar assumption is made for the
< ________
1 d
. distribution of the treatment outcome. Yet
α · (1 − α) since the data are completely uninformative
about the distribution of Yi(0) for those who
Based on empirical examples and numerical received the active treatment and of Yi(1)
calculations with beta distributions for the for those receiving the control, the data can
propensity score, Crump et al. (2009) suggest never reject the unconfoundedness assump-
that the rule-of-thumb fixing α at 0.10 gives tion. Nevertheless, there are often indi-
good results. rect ways of assessing this assumption. The
To illustrate this method, table 3 presents most important of these were developed in
summary statistics for data from Imbens, Rosenbaum (1987) and Heckman and Hotz
Rubin and Sacerdote (2001) on lottery play- (1989). Both methods rely on testing the
ers, including “winners” who won big prizes, null hypothesis that an average causal effect
and “losers” who did not. Even though win- is zero, where the particular average causal
ning the lottery is obviously random, varia- effect is known to equal zero. If the testing
tion in the number of tickets bought, and procedure rejects the null hypothesis, this is
nonresponse, creates imbalances in the cova- interpreted as weakening the support for the
riate distributions. In the full sample (sample unconfoundedness assumption. These tests
size N = 496), some of the covariates dif- can be divided into two groups.
fer by as much as 0.64 standard deviations. The first set of tests focuses on estimating
Following the Crump et al. calculations leads the causal effect of a treatment that is known
to a bound of 0.0914. Discarding the obser- not to have an effect. It relies on the presence
vations with an estimated propensity score of two or more control groups (Rosenbaum
outside the interval [0.0914, 0.9086] leads 1987). Suppose one has two potential control
to a sample size 388. In this subsample, the groups, for example eligible nonparticipants
largest normalized difference is 0.35, about and ineligibles, as in Heckman, Ichimura and
Imbens and Wooldridge: Econometrics of Program Evaluation 47
Table 3
Balance Improvements in the Lottery Data
and this is not testable. Instead we focus on Next, we turn to implementation of the
testing an implication of the stronger condi- tests. We can simply test whether there is a
tional independence relation difference in average values of Yi between
the two control groups, after adjusting for
(24) Yi(0), Yi(1) ǁ G
i | X i. differences in Xi. That is, we effectively test
whether
This independence condition implies (23),
but in contrast to that assumption, it also E CE[Yi | Gi = −1, Xi] − E[Yi | Gi = 0, Xi]D = 0.
implies testable restrictions. In particular, we
focus on the implication that More generally we may wish to test
because Gi ∈ {−1,0} implies that Yi = Yi(0). for all x in the support of Xi using the meth-
Because condition (24) is slightly stron- ods discussed in Crump et al. (2008b). We
ger than unconfoundedness, the question is can also include transformations of the basic
whether there are interesting settings where outcomes in the procedure to test for dif-
the weaker condition of unconfoundedness ference in other aspects of the conditional
holds, but not the stronger condition. To dis- distributions.
cuss this question, it is useful to consider two A second set of tests of unconfounded-
alternative conditional independence condi- ness focuses on estimating the causal effect
tions, both of which are implied by (24): of the treatment on a variable known to be
unaffected by it, typically because its value
(26) AYi(0), Yi(1)B ǁ Wi | Xi, Gi ∈ {−1, 1}, is determined prior to the treatment itself.
Such a variable can be time-invariant, but
and the most interesting case is in considering
the treatment effect on a lagged outcome.
(27) AYi(0), Yi(1)B ǁ Wi | Xi, Gi ∈ {0, 1}. If it is not zero, this implies that the treated
observations are distinct from the controls;
If (26) holds, then we can estimate the average namely that the distribution of Yi(0) for the
causal effect by invoking the unconfounded- treated units is not comparable to the distri-
ness assumption using only the first control bution of Yi(0) for the controls. If the treat-
group. Similarly, if (27) holds, then we can ment is instead zero, it is more plausible that
estimate the average causal effect by invok- the unconfoundedness assumption holds. Of
ing the unconfoundedness assumption using course this does not directly test the uncon-
only the second control group. The point is foundedness assumption; in this setting,
that it is difficult to envision a situation where being able to reject the null of no effect does
unconfoundedness based on the two com- not directly reflect on the hypothesis of inter-
parison groups holds—that is, (23) holds— est, unconfoundedness. Nevertheless, if the
but it does not hold using only one of the two variables used in this proxy test are closely
comparison groups at the time. In practice, it related to the outcome of interest, the test
seems likely that if unconfoundedness holds arguably has more power. For these tests it
then so would the stronger condition (24), is clearly helpful to have a number of lagged
and we have the testable implication (25). outcomes.
Imbens and Wooldridge: Econometrics of Program Evaluation 49
the groups adjusted for differences in Xir are if on average it does not affect outcomes.13
zero, or test whether the average difference They show that in some data sets they reject
is zero for all values of the covariates (e.g., the null hypothesis (30) even though they
Crump et al. 2008). cannot reject the null hypothesis of a zero
average effect.
5.12 Testing Taking the motivation in Crump et al.
(2008) one step further, one may also be
Most of the focus in the evaluation litera- interested in testing the null hypothesis that
ture has been on estimating average treat- the conditional distribution of Yi(0) given Xi
ment effects. Testing has largely been limited = x is the same as the conditional distribu-
to the null hypothesis that the average effect tion of Yi(1) given Xi = x. Under the main-
is zero. In that case testing is straightforward tained hypothesis of unconfoundedness, this
since many estimators exist for the average is equivalent to testing the null hypothesis
treatment effect that are approximately nor- that
mally distributed in large samples with zero
asymptotic bias. In addition there is some H0 : Yi ǁ Wi | Xi,
testing based on the Fisher approach using
the randomization distribution. In many against the alternative hypothesis that Yi is
cases, however, there are other null hypoth- not independent of Wi given Xi. Tests of this
eses of interest. Crump et al. (2008) develop type can be implemented using the methods
tests of the null hypotheses of zero average of Linton and Pedro Gozalo (2003). There
effects conditional on the covariates, and of have been no applications of these tests in
a constant average effect conditional on the the program evaluation literature.
covariates. Formally, in the first case the null
hypothesis 5.13 Selection of Covariates
parameters should change with the sample f unctional form and functions of a small set
size. For example, using regression estima- of covariates.
tors, one would have to choose the bandwidth
if using kernel estimators, or the number of
6. Selection on Unobservables
terms in the series if using series estimators.
The program evaluation literature does not In this section we discuss a number of
provide much guidance as to how to choose methods that relax the pair of assump-
these smoothing parameters in practice. tions made in section 5. Unlike in the set-
More generally, the nonparametric estima- ting under unconfoundedness, there is not
tion literature has little to offer in this regard. a unified set of methods for this case. In
Most of the results in this literature offer a number of special cases there are well-
optimal choices for smoothing parameters if understood methods, but there are many
the criterion is integrated squared error. In cases without clear recommendations. We
the current setting the interest is in a sca- will highlight some of the controversies and
lar parameter, and the choice of smoothing different approaches. First we discuss some
parameter that is optimal for the regression methods that simply drop the unconfound-
function itself need not be close to optimal edness assumption. Next, in section 6.2, we
for the average treatment effect. discuss sensitivity analyses that relax the
Hirano and Imbens (2001) consider an unconfoundedness assumption in a more
estimator that combines weighting with the limited manner. In section 6.3, we discuss
propensity score and regression. In their appli- instrumental variables methods. Then, in
cation they have a large number of covariates, section 6.4 we discuss regression disconti-
and they suggest deciding which ones to include nuity designs, and in section 6.5 we discuss
on the basis of t-statistics. They find that the difference-in-differences methods.
results are fairly insensitive to the actual cutoff
6.1 Bounds
point if they use the weight/regression estima-
tor, but find more sensitivity if they only use In a series of papers and books, Manski
weighting or regression. They do not provide (1990, 1995, 2003, 2005, 2007) has
formal properties for these choices. developed a general framework for inference
Ichimura and Linton (2005) consider in settings where the parameters of interest
inverse probability weighting estimators and are not identified. Manski’s key insight is that
analyze the formal problem of bandwidth even if in large samples one cannot infer the
selection with the focus on the average treat- exact value of the parameter, one may be
ment effect. Imbens, Newey and Ridder able to rule out some values that one could
(2005) look at series regression estimators not rule out a priori. Prior to Manski’s work,
and analyze the choice of the number of researchers had typically dismissed models
terms to be included, again with the objective that are not point-identified as not useful in
being the average treatment effect. Imbens practice. This framework is not restricted to
and Rubin (forthcoming) discuss some step- causal settings, and the reader is referred to
wise covariate selection methods for finding Manski (2007) for a general discussion of the
a specification for the propensity score. approach. Here we limit the discussion to
It is clear that more work needs to be program evaluation settings.
done in this area, both for the case where We start by discussing Manksi’s per-
the choice is which covariates to include spective in a very simple case. Suppose we
from a large set of potential covariates, have no covariates and a binary outcome Yi
and in the case where the choice concerns ∈ {0, 1}. Let the goal be inference for the
52 Journal of Economic Literature, Vol. XLVII (March 2009)
average effect in the population, τ PATE. We assumptions we cannot rule out any value
can decompose the population average treat- inside the bounds. See Manski et al. (1992)
ment effect as for an empirical example of these particular
bounds.
τ PATE = E[Yi(1) | Wi = 1] · pr(Wi = 1) In this specific case the bounds are not
particularly informative. The width of the
+ E[Yi(1) | Wi = 0] · pr(Wi = 0) bounds, the difference in τu − τ l, with τ l and
τu given above, is always equal to one, imply-
− E[Yi(0) | Wi = 1] · pr(Wi = 1) ing we can never rule out a zero average treat-
ment effect. (In some sense this is obvious:
+ E[Yi(0) | Wi = 0] · pr(Wi = 0)]. if we refrain from making any assumptions
regarding the treatment effects we cannot
Of the eight components of this expres- rule out that the treatment effect is zero for
sion, we can estimate six. The data con- any unit.) In general, however, we can add
tain no information about the remaining some assumptions, short of making the type
two, E[Yi(1) | Wi = 0] and E[Yi(0) | Wi = 1]. of assumption as strong as unconfoundedness
Because the outcome is binary, and before that gets us back to the point-identified case.
seeing any data, we can deduce that these With such weaker assumptions we may be able
two conditional expectations must lie inside to tighten the bounds and obtain informative
the interval [0, 1], but we cannot say any more results, without making the strong assump-
without additional assumptions. This implies tions that strain credibility. The presence of
that without additional assumptions we can covariates increases the scope for additional
be sure that assumptions that may tighten the bounds.
Examples of such assumptions include those
τ PATE ∈ [τ l, τu], in the spirit of instrumental variables, where
some covariates are known not to affect the
where we can express the lower and upper potential outcomes (e.g., Manski 2007), or
bound in terms of estimable quantities, monotonicity assumptions where expected
outcomes are monotonically related to cova-
τ l = E[Yi(1) | Wi = 1] · pr(Wi = 1) riates or treatments (e.g., Manski and John
V. Pepper 2000). For an application of these
− pr(Wi = 1) − E[Yi(0) | Wi = 0] methods, see Hotz, Charles H. Mullin, and
Seth G. Sanders (1997). We return to some of
× pr(Wi = 0), these settings in section 6.3.
This discussion has focused on identifica-
and tion and demonstrated what can be learned
in large samples. In practice these bounds
τu = E[Yi(1) | Wi = 1] · pr(Wi = 1) need to be estimated, which leads to addi-
tional uncertainty regarding the estimands.
+ pr(Wi = 0) − E[Yi(0) | Wi = 0] A fast developing literature (e.g., Horowitz
and Manski 2000; Imbens and Manski 2004;
× pr(Wi = 0), Chernozhukov, Hong, and Elie Tamer 2007;
Arie Beresteanu and Francesca Molinari
In other words, we can bound the average 2006; Romano and Azeem M. Shaikh 2006a,
treatment effect. In this example the bounds 2006b; Ariel Pakes et al. 2006; Adam M.
are tight, meaning that without additional Rosen 2006; Donald W. K. Andrews and
Imbens and Wooldridge: Econometrics of Program Evaluation 53
Gustavo Soares 2007; Ivan A. Canay 2007; completely relaxing the unconfoundedness
and Jörg Stoye 2007) discusses construction assumption, the idea is to relax it slightly.
of confidence intervals in general settings More specifically, violations of unconfound-
with partial identification. One point of con- edness are interpreted as evidence of the
tention in this literature has been whether presence of unobserved covariates that are
the focus should be on confidence intervals correlated, both with the potential outcomes
for the parameter of interest (τ PATE in this and with the treatment indicator. The size of
case), or for the identified set. Imbens and bias these violations of unconfoundedness
Manski (2004) develop confidence sets for can induce depends on the strength of these
the parameter. In large samples, and at a correlations. Sensitivity analyses investigate
95 percent confidence level, the Imbens– whether results obtained under the main-
Manski confidence intervals amount to tained assumption of unconfoundedness can
taking the lower bound minus 1.645 times be changed substantially, or even overturned
the standard error of the lower bound and entirely, by modest violations of the uncon-
the upper bound plus 1.645 times its stan- foundedness assumption.
dard error. The reason for using 1.645 To be specific, consider a job train-
rather than 1.96 is to take account of the ing program with voluntary enrollment.
fact that, even in the limit, the width of the Suppose that we have monthly labor market
confidence set will not shrink to zero, and histories for a two year period prior to the
therefore one only needs to be concerned program. We may be concerned that indi-
with one-sided errors. Chernozhukov, Hong, viduals choosing to enroll in the program
and Tamer (2007) focus on confidence sets are more motivated to find a job than those
that include the entire partially identified that choose not to enroll in the program.
set itself with fixed probability. For a given This unobserved motivation may be related
confidence level, the latter approach gener- to subsequent earnings both in the presence
ally leads to larger confidence sets than the and in the absence of training. Conditioning
Imbens–Manski approach. See also Romano on the recent labor market histories of indi-
and Shaikh (2006a, 2006b) for subsampling viduals may limit the bias associated with
approaches to inference in these settings. this unobserved motivation, but it need not
eliminate it entirely. However, we may be
6.2 Sensitivity Analysis willing to limit how highly correlated unob-
served motivation is with the enrollment
Unconfoundedness has traditionally been decision and the earnings outcomes in the
seen as an all or nothing assumption: either two regimes, conditional on the labor mar-
it is satisfied and one proceeds accord- ket histories. For example, if we compare
ingly using the methods appropriate under two individuals with the same labor mar-
unconfoundedness, such as matching, or ket history for the last two years, e.g., not
the assumption is deemed implausible and employed the last six months and working
one considers alternative methods. The lat- the eighteen months before, and both with
ter include the bounds approach discussed one two-year old child, it may be reason-
in section 6.1, as well as approaches relying able to assume that these cannot differ radi-
on alternative assumptions, such as instru- cally in their unobserved motivation given
mental variables, which will be discussed in that their recent labor market outcomes
section 6.3. However, there is an important have been so similar. The sensitivity analy-
alternative that has received much less atten- ses developed by Rosenbaum and Rubin
tion in the economics literature. Instead of (1983a) formalize this idea and provides a
54 Journal of Economic Literature, Vol. XLVII (March 2009)
tool for making such assessments. Imbens this changes the point estimate of the aver-
(2003) applies this sensitivity analysis to age treatment effect.
data from labor market training programs. Typically the sensitivity analysis is done
The second approach is associated with in fully parametric settings, although
work by Rosenbaum (1995). Similar to the since the models can be arbitrarily flex-
Rosenbaum–Rubin approach Rosenbaum’s ible, this is not particularly restrictive.
method relies on an unobserved covariate Following Rosenbaum and Rubin (1983b),
that generates the deviations from uncon- we illustrate this approach in a setting
foundedness. The analysis differs in that with binary outcomes. See Imbens (2003)
sensitivity is measured using only the rela- and Lee (2005b) for examples in econom-
tion between the unobserved covariate and ics. Rosenbaum and Rubin (1983a) fix the
the treatment assignment, with the focus marginal distribution of the unobserved
on the correlation required to overturn, or covariate to be binomial with p = pr(Ui =
change substantially, p-values of statistical 1), and assume independence of Ui and Xi.
tests of no effect of the treatment. They specify a logistic distribution for the
treatment assignment:
6.2.1 The Rosenbaum–Rubin Approach to
Sensitivity Analysis
pr(Wi = 1 | Xi = x, Ui = u)
The starting point is that unconfound- exp(α 0 + α′1 x + α 2 · u)
edness is satisfied only conditional on the = ____________________
.
observed covariates Xi and an unobserved 1 + exp(α 0 + α′1 x + α 2 · u)
scalar covariate Ui: They also specify logistic regression func-
tions for the two potential outcomes:
Yi(0), Yi(1) ǁ Wi | Xi, Ui.
pr(Yi(w) = 1 | Xi = x, Ui = u) =
This set up in itself is not restrictive, although
exp(β w0 + β′w1 x + β w2 · u)
once parametric assumptions are made the ______________________
.
assumption of a scalar unobserved covariate 1 + exp(β w0 + β′w1 x + β w2 · u)
Ui is restrictive.
Now consider both the conditional dis-
tribution of the potential outcomes given For the subpopulation with Xi = x and Ui =
observed and unobserved covariates and the u, the average treatment effect is
conditional probability of assignment given
observed and unobserved covariates. Rather E[Yi(1) − Yi(0 | Xi = x, Ui = u] =
than attempting to estimate both these con-
exp(β10 + β′11 x + β12 · u)
ditional distributions, the idea behind the _____________________
sensitivity analysis is to specify the form and 1 + exp(β10 + β′11 x + β12 · u)
the amount of dependence of these condi-
exp(β 00 + β′01 x + β 02 · u)
− _____________________
tional distributions on the unobserved cova-
.
riate, and estimate only the dependence on 1 + exp(β 00 + β′01 x + β 02 · u)
the observed covariate. Conditional on the
specification of the first part estimation of The average treatment effect τ CATE can be
the latter is typically straightforward. The expressed in terms of the parameters of this
idea is then to vary the amount of depen- model and the distribution of the observable
dence of the conditional distributions on the covariates by averaging over Xi, and integrat-
unobserved covariate and assess how much ing out the unobserved covariate U:
Imbens and Wooldridge: Econometrics of Program Evaluation 55
τ CATE ≡ τ( p, α 2, β 02, β12, α 0, α1, β 00, f unctional form assumptions, and so attempts
to estimate θsens are therefore unlikely to be
β 01, β10, β11) effective. Given θsens, however, estimating the
remaining parameters is considerably easier.
N
exp(β10 + β′11 Xi + β12)
1 e ∑ p a ____________________ In the second step the plan is therefore to
= __
N i=1 1 + exp(β10 + β′11 Xi + β12) fix the first set of parameters and estimate
the others by maximum likelihood, and then
exp(β 00 + β′01 Xi + β 02) translate this into an estimate for τ. Thus, for
− ____________________
b
1 + exp(β 00 + β′01 Xi + β 02) fixed θsens, we first estimate the remaining
parameters through maximum likelihood:
exp(β10 + β′11 Xi)
+ (1 − p) a _______________
1 + exp(β10 + β′11 Xi)
θ
ˆ (θ ) = arg m
other sens L(θother | θsens),
ax
θother
covariate that has partial correlations with suggests bounding the ratio of the odds ratios
treatment and potential outcomes as high as e(xi)/(1 − e(xi)) and e(xj)/(1 − e(xj)):
any of the observed covariates. For example,
e(xi) · (1 − e(xj))
1/Γ ≤____________
Imbens considers, in the labor market train-
≤ Γ.
ing example, what the effect would be of (1 − e(xi)) · e(xj)
omitting unobserved motivation, if in fact
motivation had as much explanatory power If Γ = 1, we are back in the setting with
for future earnings and for treatment choice unconfoundedness. If we allow Γ = ∞, we
as did earnings in the year prior to the train- are not restricting the association between
ing program. A bounds analysis, in contrast, the treatment indicator and the potential
would implicitly allow unobserved motiva- outcomes. Rosenbaum investigates how
tion to completely determine both selection much the odds would have to be different in
into the program and future earnings. Even order to substantially change the p-value. Or,
though putting hard limits on the effect of starting from the other side, he investigates
motivation on earnings and treatment choice for fixed values of Γ what the implication is
may be difficult, it may be reasonable to put on the p-value.
some limits on it, and the Rosenbaum–Rubin For example, suppose that a test of the
sensitivity analysis provides a useful frame- null hypothesis of no effect has a p-value of
work for doing so. 0.0001 under the assumption of unconfound-
edness. If the data suggest it would take the
6.2.2 Rosenbaum’s Method for Sensitivity presence of an unobserved covariate that
Analysis changes the odds of participation by a factor
ten in order to increase that p-value to 0.05,
Rosenbaum (1995) developed a slightly then one would likely consider the result to
different approach. The advantage of his be very robust. If instead a small change in
approach is that it requires fewer tuning the odds of participation, say with a value of
parameters than the Rosenbaum–Rubin Γ = 1.5, would be sufficient for a change of
approach. Specifically, it only requires the the p-value to 0.05, the study would be much
researcher to consider the effect unobserved less robust.
confounders may have on the probability of
6.3 Instrumental Variables
treatment assignment. Rosenbaum’s focus
is on the effect the presence of unobserved In this section, we review the recent lit-
covariates could have on the p-value for the erature on instrumental variables. We focus
test of no effect of the treatment based on the on the part of the literature concerned with
unconfoundedness assumption, in contrast to heterogenous effects. In the current sec-
the Rosenbaum–Rubin focus on point esti- tion, we limit the discussion to the case with
mates for average treatment effects. Consider a binary endogenous variable. The early
two units i and j with the same value for the literature focused on identification of the
covariates, xi = xj. If the unconfoundedness population average treatment effect and the
assumption conditional on Xi holds, both units average effect on the treated. Identification
must have the same probability of assignment of these estimands ran into serious prob-
to the treatment, e(xi) = e(xj). Now suppose lems once researchers wished to allow for
unconfoundedness only holds conditional on unrestricted heterogeneity in the effect of
both Xi and a binary unobserved covariate the treatment. In an important early result,
Ui. In that case the assignment probabilities Bloom (1984) showed that if eligibility for the
for these two units may differ. Rosenbaum program is used as an instrument, then one
Imbens and Wooldridge: Econometrics of Program Evaluation 57
can identify the average effect of the treat- the observed outcome Yi and the potential
ment for those who received the treatment. outcomes Yi(0) and Yi(1), is
Key for the Bloom result is that the instru-
ment changes the probability of receiving Wi = Wi (0) · (1 − Zi)
the treatment to zero. In order to identify
the average effect on the overall popula- W (0) if Zi = 0
+ Wi(1) · Zi = e i .
tion, the instrument would also need to shift Wi(1) if Zi = 1
the probability of receiving the treatment
to one. This type of identification is some- Exogeneity of the instrument is captured by
times referred to as identification at infinity the assumption that all potential outcomes
(Gary Chamberlain 1986; Heckman 1990) in are independent of the instrument, or
settings with a continuous instrument. The
practical usefulness of such identification (Yi(0), Yi(1), Wi(0), Wi(1)) ǁ Zi.
results is fairly limited outside of cases where
eligibility is randomized. Finding a credible Formulating exogeneity in this way is attrac-
instrument is typically difficult enough, with- tive compared to conventional residual-
out also requiring that the instrument shifts based definitions, as it does not require the
the probability of the treatment close to zero researcher to specify a regression function in
and one. In fact, the focus of the current order to define the residuals. This assump-
literature on instruments that can credibly tion captures two properties of the instru-
be expected to satisfy exclusion restrictions ment. First, it captures random assignment
makes it even more difficult to find instru- of the instrument so that causal effects of the
ments that even approximately satisfy these instrument on the outcome and treatment
support conditions. Imbens and Angrist received can be estimated consistently. This
(1994) got around this problem by changing part of the assumption, which is implied by
the focus to average effects for the subpopu- explicitly randomization of the instrument, as
lation that is affected by the instrument. for example in the seminal draft lottery study
Initially we focus on the case with a binary by Angrist (1990), is not sufficient for causal
instrument. This case provides some of the interpretations of instrumental variables
clearest insight into the identification prob- methods. The second part of the assumption
lems. In that case the identification at infin- captures an exclusion restriction that there
ity arguments are obviously not satisfied and is no direct effect of the instrument on the
so one cannot (point-)identify the population outcome. This second part is captured by the
average treatment effect. absence of z in the definition of the potential
outcome Yi(w). This part of the assumption is
6.3.1 A Binary Instrument
not implied by randomization of the instru-
Imbens and Angrist adopt a potential out- ment and it has to be argued on a case by
come notation for the receipt of the treatment, case basis. See Angrist, Imbens, and Rubin
as well as for the outcome itself. Let Zi denote (1996) for more discussion on the distinction
the value of the instrument for individual i. between these two assumptions, and for a
Let Wi(0) and Wi(1) denote the level of the formulation that separates them.
treatment received if the instrument takes on Imbens and Angrist introduce a new con-
the values 0 and 1 respectively. As before, let cept, the compliance type of an individual.
Yi(0) and Yi(1) denote the potential values for The type of an individual describes the level
the outcome of interest. The observed treat- of the treatment that an individual would
ment is, analogously to the relation between receive given each value of the instrument.
58 Journal of Economic Literature, Vol. XLVII (March 2009)
In other words, it is captured by the pair of Bloom set up with one-sided noncompliance
values (Wi(0), Wi(1)). With both the treat- both always-takers and defiers are absent by
ment and instrument binary, there are four assumption.
types of responses for the potential treat- Under these two assumptions, inde-
ment. It is useful to define the compliance pendence of all four potential outcomes
types explicitly: (Yi(0), Yi(1), Wi(0), Wi(1)) and the instrument
Zi, and monotonicity, Imbens and Angrist
never-taker if Wi(0) = Wi(1) = 0 show that one can identify the average
if Wi(0) = 0, Wi(1) = 1
ecomplier
Ti = defier if Wi(0) = 1, Wi(1) = 0 .
effect of the treatment for the subpopula-
tion of compliers. Before going through their
always-taker if Wi(0) = Wi(1) = 1 argument, it is useful to see why we cannot
generally identify the average effect of the
The labels never-taker, complier, defier, treatment for others subpopulations. Clearly,
and always-taker (e.g., Angrist, Imbens, and one cannot identify the average effect of the
Rubin 1996) refer to the setting of a random- treatment for never-takers because they are
ized experiment with noncompliance, where never observed receiving the treatment, and
the instrument is the (random) assignment so E[Yi(1) | Ti = n] is not identified. Thus,
to the treatment and the endogenous regres- only compliers are observed in both treat-
sor is an indicator for the actual receipt of ment groups, so only for this group is there
the treatment. Compliers are in that case any chance of identifying the average treat-
individuals who (always) comply with their ment effect. In order to understand the
assignment, that is, take the treatment if positive component of the Imbens–Angrist
assigned to it and not take it if assigned to result, that we can identify the average effect
the control group. One cannot infer from the for compliers, it is useful to consider the
observed data (Zi, Wi, Yi) whether a particular subpopulations defined by instrument and
individual is a complier or not. It is important treatment. Table 4 shows the information
not to confuse compliers (who comply with we have about the individual’s type given
their actual assignment and would have com- the monotonicity assumption. Consider indi-
plied with the alternative assignment) with viduals with (Zi = 1, Wi = 0). Because of
individuals who are observed to comply with monotonicity such individuals can only be
their actual assignment: that is, individuals never-takers. Similarly, individuals (Zi = 0,
who complied with the assignment they actu- Wi = 1) can only be always-takers. However,
ally received, Zi = Wi. For such individuals consider individuals with (Zi = 0, Wi = 0).
we do not know what they would have done Such individuals can be either compliers
had their assignment been different, that is or never-takers. We cannot infer the type
we do not know the value of Wi(1 − Zi). of such individuals from the observed data
Imbens and Angrist then invoke an addi- alone. Similarly, individuals with (Zi = 1,
tional assumption they refer to as monotonicity. Wi = 1) can be either compliers or always-
Monotonicity requires that Wi(1) ≥ Wi(0) for takers.
all individuals, or that increasing the level of The intuition for the identification result
the instrument does not decrease the level is as follows. The first step is to see that we
of the treatment. This assumption is equiva- can infer the population proportions of the
lent to ruling out the presence of defiers, and three remaining subpopulations, never-
it is therefore sometimes referred to as the takers, always-takers and compliers (using
“no-defiance” assumption (Alexander Balke the fact that the monotonicity assumption
and Pearl 1994; Pearl 2000). Note that in the rules out the presence of defiers). Call these
Imbens and Wooldridge: Econometrics of Program Evaluation 59
Table 4
Type by Observed Variables
Zi
0 1
0 Nevertaker/Complier Nevertaker
Wi 1 Alwaystaker Alwaystaker/Complier
The only quantities not consistently estima- Imbens and Angrist show that the standard
ble are the average effects for never-takers instrumental variables estimand, using g(Zi)
and always-takers. Even for those we have as an instrument for Wi, is equal to a particu-
some information. For example, we can write lar weighted average:
E[Yi(1) − Yi(0) | Ti = n] = E[Yi(1) | Ti = n] −
E[Yi(0) | Ti = n]. The second term we can E[Yi · (g(Zi) − E[g(Zi)])]
__________________
= τ LATE,λ,
estimate, and the data are completely unin- E[Wi · (g(Zi) − E[g(Zi)])]
formative about the first term. Hence, if there
are natural bounds on Yi(1) (for example, if for a particular set of nonnegative weights as
the outcome is binary), we can use that to long as E[Wi | g(Zi) = g] increases in g.
bound E[Yi(1) | Ti = n], and then in turn use Heckman and Vytlacil (2006) and
that to bound τ PATE. These bounds are tight. Heckman, Sergio Urzua, and Vytlacil (2006)
See Manski (1990), Toru Kitagawa (2008), study the case with a continuous instrument.
and Balke and Pearl (1994). They use an additive latent single index setup
where the treatment received is equal to
6.3.2 Multivalued Instruments and
Weighted Local Average Treatment
Wi = 1{h(Zi) + Vi ≥ 0},
Effects
The previous discussion was in terms of a where h( · ) is strictly monotonic, and the
single binary instrument. In that case there is latent type Vi is independent of Zi. In general,
no other average effect of the treatment that in the presence of multiple instruments, this
can be estimated consistently other than the latent single index framework imposes sub-
local average treatment effect, τ LATE. With stantive restrictions.14 Without loss of gener-
a multivalued instrument, or with multiple ality we can take the marginal distribution
binary instruments (still maintaining the set- of Vi to be uniform. Given this framework,
ting of a binary treatment—see for extensions Heckman, Urzua, and Vytlacil (2006) define
of the local average treatment effect con- the marginal treatment effect as a function
cept to the multiple treatment case Angrist of the latent type v of an individual,
and Imbens (1995) and Card (2001), we can
estimate a variety of local average treatment τ MTE(v) = E[Yi(1) − Yi(0) | Vi = v].
effects. Let 핑 = {z1, … , zK} denote the set of
values for the instruments. Initially we take In the single continuous instrument case,
the set of values to be finite. Then for each τ MTE(v) is, under some differentiability and
pair (zk, zl) with pr(Wi = 1 | Zi = zk) > pr(Wi invertibility conditions, equal to a limit of
= 1 | Zi = zl) one can define a local average local average treatment effects:
treatment effect:
τ LATE(zk, zl) = 14 See Vytlacil (2002) for a discussion in the case with
binary instruments, where the latent index set up implies
E[Yi(1) − Yi(0) | Wi(zl) = 0, Wi(zk) =1 ]. no loss of generality.
Imbens and Wooldridge: Econometrics of Program Evaluation 61
τ MTE(v) =
lim
−1
τ LATE(h−1(v), z). Kenneth Y. Chay and Michael Greenstone
z↓h (v)
(2005), Card, Alexandre Mas, and Jesse
A parametric version of this concept goes Rothstein (2007), Lee, Enrico Moretti, and
back to work by Anders Björklund and Matthew J. Butler (2004), Jens Ludwig and
Robert Moffitt (1987). All average treatment Douglas L. Miller (2007), Patrick J. McEwan
effects, including the overall average effect, and Joseph S. Shapiro (2008), Sandra E.
the average effect for the treated, and any Black (1999), Susan Chen and van der Klaauw
local average treatment effect can now be (2008), Ginger Zhe Jin and Phillip Leslie
expressed in terms of integrals of this mar- (2003), Thomas Lemieux and Kevin Milligan
ginal treatment effect, as shown in Heckman (2008), Per Pettersson-Lidbom (2007, 2008),
and Vytlacil (2005). For example, τ PATE = and Pettersson-Lidbom and Björn Tyrefors
∫01 τ MTE(v) dv. A complication in practice is (2007). Key theoretical and conceptual
that not necessarily all the marginal treat- contributions include the interpretation of
ment effects can be estimated. For example, estimates for fuzzy regression discontinu-
if the instrument is binary, Zi ∈ {0, 1}, then ity designs allowing for general heterogene-
for individuals with Vi < min(−h(0), −h(1)), ity of treatment effects (Hahn, Todd, and
it follows that Wi = 0, and for these never- van der Klaauw 2001), adaptive estimation
takers we cannot estimate τ MTE(v). Any methods (Yixiao Sun 2005), methods for
average effect that requires averaging over bandwidth selection tailored to the RD set-
such values of v is therefore also not point- ting, (Ludwig and Miller 2005; Imbens and
identified. Moreover, average effects that can Karthik Kalyanaraman 2008) and various
be expressed as integrals of τ MTE(v) may be tests for discontinuities in means and distri-
identified even if some of the τ MTE(v) that butions of nonaffected variables (Lee 2008;
are being integrated over are not identified. McCrary 2008) and for misspecification
Again, in a binary instrument example with (Lee and Card 2008). For recent reviews in
pr(Wi = 1 | Zi = 1) = 1, and pr(Wi = 1 | Zi the economics literature, see van der Klaauw
= 0) = 0, the average treatment effect τ PATE (2008b), Imbens and Lemieux (2008), and
is identified, but τ MTE(v) is not identified for Lee and Lemieux (2008).
any value of v. The basic idea behind the RD design is that
assignment to the treatment is determined,
6.4 Regression Discontinuity Designs
either completely or partly, by the value of
Regression discontinuity (RD) methods a predictor (the forcing variable Xi) being on
have been around for a long time in the psy- either side of a common threshold. This gen-
chology and applied statistics literature, going erates a discontinuity, sometimes of size one,
back to the early 1960s. For discussions and in the conditional probability of receiving
references from this literature, see Donald L. the treatment as a function of this particular
Thistlethwaite and Campbell (1960), William predictor. The forcing variable is often itself
M. K. Trochim (2001), Shadish, Cook, and associated with the potential outcomes, but
Campbell (2002), and Cook (2008). Except this association is assumed to be smooth. As
for some important foundational work by a result any discontinuity of the conditional
Goldberger (1972a, 1972b), it is only recently distribution of the outcome as a function of
that these methods have attracted much atten- this covariate at the threshold is interpreted
tion in the economics literature. For some of as evidence of a causal effect of the treatment.
the recent applications, see Van Der Klaauw The design often arises from administrative
(2002, 2008a), Lee (2008), Angrist and decisions, where the incentives for individu-
Victor Lavy (1999), DiNardo and Lee (2004), als to participate in a program are rationed
62 Journal of Economic Literature, Vol. XLVII (March 2009)
for reasons of resource constraints, and clear averaging we make a smoothness assump-
transparent rules, rather than discretion, by tion that the two conditional expectations
administrators are used for the allocation of E[Yi(w) | Xi = x], for w = 0, 1, are continuous
these incentives. in x. Under this assumption, E[Yi(0) | Xi = c]
It is useful to distinguish between two gen- = limx↑c E[Yi(0) | Xi = x] = limx↑c E[Yi | Xi = x],
eral settings, the sharp and the fuzzy regres- implying that
sion discontinuity designs (e.g., Trochim
1984, 2001; Hahn, Todd, and van der Klaauw τ SRD = l
E[Yi | Xi = x] − l
im E[Yi | Xi = x],
im
x↓c x↑c
2001; Imbens and Lemieux 2008; van der
Klaauw 2008b; Lee and Lemieux 2008). where this expression uses the fact that Wi is
a deterministic function of Xi (a key feature
6.4.1 The Sharp Regression Discontinuity
of the SRD). The statistical problem becomes
Design
one of estimating a regression function non-
In the sharp regression discontinuity (SRD) parametrically at a boundary point. We dis-
design, the assignment Wi is a deterministic cuss the statistical problem in more detail in
function of one of the covariates, the forcing section 6.4.4.
(or treatment-determining) variable Xi:
6.4.2 The Fuzzy Regression Discontinuity
Design
Wi = 1[Xi ≥ c],
In the fuzzy regression discontinuity (FRD)
where 1[·] is the indicator function, equal to design, the probability of receiving the treat-
one if the even in brackets is true and zero ment need not change from zero to one at the
otherwise. All units with a covariate value of threshold. Instead the design only requires a
at least c are in the treatment group (and par- discontinuity in the probability of assignment
ticipation is mandatory for these individuals), to the treatment at the threshold:
and all units with a covariate value less than
c are in the control group (members of this pr(Wi = 1 | Xi = x)
lim
x↓c
group are not eligible for the treatment). In
the SRD design, we focus on estimation of pr(Wi = 1 | Xi = x).
≠ l
im
x↑c
Nk = ∑
1[bk ≤ Xi ≤ bk+1], __
i=1 and plot the W ˜ k, in
k against the bin centers b
__ N the same way as described above.
Y k = ___
1 · ∑ Yi · 1[bk ≤ Xi ≤ bk+1].
Nk i=1 6.4.4 Estimation and Inference
__
The key plot is that of the Y k, for k The object of interest in regression discon-
= 1, … ˜ , K against the mid point of the tinuity designs is a difference in two regres-
bins, b k = (bk + bk+1)/2. The question is sion functions at a particular point (in the
whether around the threshold c (by construc- SRD case), and the ratio of two differences of
tion on the edge of one of the bins) there is regression functions (in the FRD case). These
any evidence of a jump in the conditional estimands are identified without functional
mean of the outcome. The formal statistical form assumptions, and in general one might
analyses discussed below are essentially just therefore like to use nonparametric regres-
sophisticated versions of this, and if the basic sion methods that allow for flexible func-
plot does not show any evidence of a disconti- tional forms. Because we are interested in the
nuity, there is relatively little chance that the behavior of the regression functions around a
more sophisticated analyses will lead to robust single value of the covariate, it is attractive
Imbens and Wooldridge: Econometrics of Program Evaluation 65
to use local smoothing methods such as ker- nel). The choice of bandwidth then amounts
nel regression rather than global smoothing to to dropping all observations such that Xi ∉
methods such as sieves or series regression [c − h, c + h]. The question becomes how to
because the latter will generally be sensi- choose the bandwidth h.
tive to behavior of the regression function Most standard methods for choosing
away from the threshold. Local smoothing bandwidths in nonparametric regression,
methods are generally well understood (e.g., including both cross-validation and plug-in
Charles J. Stone 1977; Herman J. Bierens methods, are based on criteria that integrate
1987; Härdle 1990; Adrian Pagan and Aman the squared error over the entire distribution
Ullah 1999). For a particular choice of the of the covariates: ∫ ˆ (z) − m(z))2 f X(z) dz.
z (
m
kernel, K( · ), e.g., a rectangular kernel K(z) = For our purposes this criterion does not
1[−h ≤ z ≤ h], _ or a Gaussian kernel K(z) = reflect the object of interest. We are specifi-
exp(−z2/2)/ √( 2π), the regression function cally interested in the regression function at
at x, m(x) = E[Yi | Xi = x] is estimated as a single point, moreover, this point is always
N a boundary point. Thus we would like to
ˆ (x) = ∑
m Yi · λi, choose h to minimize E[( ˆ (c) − m(c))2] (using
m
i=1 the data with Xi ≤ c only, or using the data
KA ____
B
X −x with Xi ≥ c only). If the density of the forcing
with weights λi = __________
i
h
Xi − x
. variable is high at the threshold, a bandwidth
∑ K A ____
Ni=1
B
h selection procedure based on global criteria
may lead to a bandwidth that is much larger
An important difference with the primary than is appropriate.
focus in the nonparametric regression litera- There are few attempts to formalize to
ture is that in the RD setting we are inter- standardize the choice of a bandwidth for
ested in the value of the regression functions such cases. Ludwig and Miller (2005) and
at boundary points. Standard kernel regres- Imbens and Lemieux (2008) discuss some
sion methods do not work well in such cases. cross-validation methods that target more
More attractive methods for this case are directly the object of interest in RD designs.
local linear regression (Fan and Gijbels Assuming the density of Xi is continuous at c,
1996; Porter 2003; Burkhardt Seifert and and that the conditional variance of Yi given
Theo Gasser 1996, 2000; Ichimura and Todd Xi is continuous and equal to σ 2 at Xi = c,
2007), where locally a linear regression func- Imbens and Kalyanaraman (2009) show that
tion, rather than a constant regression func- the optimal bandwidth depends on the sec-
tion, is fitted. This leads to an estimator for ond derivatives of the regression functions at
the regression function at x equal to the threshold and has the form
ˆ ( x) =
m α α ˆ , β
ˆ , where ( ˆ ) hopt = N−1/5 · CK · σ 2
N
= arg m ∑ λi · (Yi − α − β · (Xi − x)) , p1 + ___
_ 1 −1 p
1/5
in 2
α,β i=1 _______________________
× a b ,
limx↓c ( 2 (x)) + limx↑c ( ∂__
∂__
2 2
2
m m
(x))2
with the same weights λi as before. In that ∂x 2
∂x
case the main remaining choice concerns
the bandwidth, denoted by h. Suppose one where p is the fraction of observations with
uses a rectangular kernel, K(z) = 1[−h ≤ z Xi ≥ c, and CK is a constant that depends
≤ h] (and typically the results are relatively on the kernel. For a rectangular kernel K(z)
robust with respect to the choice of the ker- = 1−h≤z≤h, the constant equals CK = 2.70.
66 Journal of Economic Literature, Vol. XLVII (March 2009)
Imbens and Kalyanaram propose and imple- times in an attempt to raise their score above
ment a plug in method for the bandwidth.16 the threshold.
If one uses a rectangular kernel, and given There are two sets of specification checks
a choice for the bandwidth, estimation for that researchers can typically perform to at
the SRD and FRD designs can be based on least partly assess the empirical relevance of
ordinary least squares and two stage least these concerns. Although the proposed proce-
squares, respectively. If the bandwidth goes dures do not directly test null hypotheses that
to zero sufficiently fast, so that the asymp- are required for the RD approach to be valid,
totic bias can be ignored, one can also base it is typically difficult to argue for the validity
inference on these methods. (See HTV and of the approach when these null hypotheses
Imbens and Lemieux 2008.) do not hold. First, one may look for discon-
tinuities in average value of the covariates
6.4.5 Specification Checks
around the threshold. In most cases, the rea-
There are two important concerns in the son for the discontinuity in the probability of
application of RD designs, be they sharp the treatment does not suggest a discontinu-
or fuzzy. These concerns can sometimes be ity in the average value of covariates. Finding
assuaged by investigating various implica- a discontinuity in other covariates typically
tions of the identification argument underly- casts doubt on the assumptions underlying the
ing the regression discontinuity design. RD design. Specifically, for covariates Zi, the
A first concern about RD designs is the pos- test would look at the difference
sibility of other changes at the same threshold
value of the covariate. For example, the same τ Z = l
E[Z i | Xi = x] − l
im E[Z i | Xi = x].
im
x↓c x↑c
age limit may affect eligibility for multiple
programs. If all the programs whose eligibil- Second, McCrary (2008) suggests testing the
ity changes at the same cutoff value affect null hypothesis of continuity of the density
the outcome of interest, an RD analysis may of the covariate that underlies the assign-
mistakenly attribute the combined effect to ment at the threshold, against the alterna-
the treatment of interest. The second con- tive of a jump in the density function at that
cern is that of manipulation by the individu- point. A discontinuity in the density of this
als of the covariate value that underlies the covariate at the particular point where the
assignment mechanism. The latter is less of a discontinuity in the conditional expectation
concern when the forcing variable is a fixed, occurs is suggestive of violations of the no-
immutable characteristic of an individual manipulation assumption. Here the focus is
such as age. It is a particular concern when on the difference
eligibility criteria are known to potential par-
ticipants and are based on variables that are τ f (x) = l
f X(x) − l
im im
f X (x).
x↓c x↑c
affected by individual choices. For example,
if eligibility for financial aid depends on test In both cases a substantially and statistically
scores that are graded by teachers who know significant difference in the left and right
the cutoff values, there may be a tendency to limits suggest that there may be problems
push grades high enough to make students with the RD approach. In practice, more use-
eligible. Alternatively if thresholds are known ful than formal statistical tests are graphical
to students they may take the test multiple analyses of the type discussed in section
6.4.3 where histogram-type estimates of the
16 Code in Matlab and Stata for calculating the optimal conditional expectation of E[Zi | Xi = x] and
bandwidth is available on their website. of the marginal density f X(x) are graphed.
Imbens and Wooldridge: Econometrics of Program Evaluation 67
6.5 Difference-in-Differences Methods Donald and Lang 2007), as well as the recent
extensions by Athey and Imbens (2006) who
Since the seminal work by Ashenfelter (1978) develop a functional form-free version of the
and Ashenfelter and Card (1985), the use difference-in-differences methodology, and
of Difference-In-Differences (DID) meth- Abadie, Diamond, and Jens Hainmueller
ods has become widespread in empirical (2007), who develop a method for construct-
economics. Influential applications include ing an artificial control group from multiple
Philip J. Cook and George Tauchen (1982, nonexposed groups.
1984), Card (1990), Bruce D. Meyer, W. Kip
6.5.1 Repeated Cross Sections
Viscusi, and David L. Durbin (1995), Card
and Krueger (1993, 1994), Nada Eissa and The standard model for the DID approach
Liebman (1996), Blundell, Alan Duncan, and is as follows. Individual i belongs to a group,
Meghir (1998), and many others. The DID Gi ∈ {0, 1} (where group 1 is the treatment
approach is often associated with so-called group), and is observed in time period Ti ∈
“natural experiments,” where policy changes {0, 1}. For i = 1, … , N, a random sample from
can be used to effectively define control and the population, individual i’s group identity
treatment groups. See Angrist and Krueger and time period can be treated as random
(1999), Angrist and Pischke (2009), and variables. In the standard DID model, we
Blundell and Thomas MaCurdy (1999) for can write the outcome for individual i in the
textbook discussions. absence of the intervention, Yi(0) as
The simplest setting is one where out-
comes are observed for units observed in (33) Yi(0) = α + β · Ti + γ · Gi + εi,
one of two groups, in one of two time peri-
ods. Only units in one of the two groups, with unknown parameters α, β, and γ. We
in the second time period, are exposed to ignore the potential presence of other cova-
a treatment. There are no units exposed to riates, which introduce no special com-
the treatment in the first period, and units plications. The second coefficient in this
from the control group are never observed specification, β, represents the time com-
to be exposed to the treatment. The average ponent common to both groups. The third
gain over time in the non-exposed (control) coefficient, γ, represents a group-specific,
group is subtracted from the gain over time time-invariant component. The fourth term,
in the exposed (treatment) group. This dou- εi, represents unobservable characteristics
ble differencing removes biases in second of the individual. This term is assumed to
period comparisons between the treatment be independent of the group indicator and
and control group that could be the result have the same distribution over time, i.e.,
from permanent differences between those εi ǁ (Gi, Ti), and is normalized to have mean
groups, as well as biases from compari- zero.
sons over time in the treatment group that An alternative set up leading to the same
could be the result of time trends unrelated estimator allows for a time-invariant individ-
to the treatment. In general this allows for ual-specific fixed effect, γ i, potentially corre-
the endogenous adoption of the new treat- lated with Gi, and models Yi(0) as
ment (see Timothy Besley and Case 2000
and Athey and Imbens 2006). We discuss (34) Yi(0) = α + β · Ti + γ i + εi.
here the conventional set up, and recent
work on inference (Bertrand, Duflo, and (See, e.g., Angrist and Krueger 1999.) This
Mullainathan 2004; Hansen 2007a, 2007b; generalization of the standard model does
68 Journal of Economic Literature, Vol. XLVII (March 2009)
not affect the standard DID estimand, and 6.5.2 Multiple Groups and Multiple Periods
it will be subsumed as a special case of the
model we propose. With multiple time periods and multiple
The equation for the outcome without groups we can use a natural extension of the
the treatment is combined with an equa- two-group two-time-period model for the
tion for the outcome given the treatment: outcome in the absence of the intervention.
Yi(1) = Yi(0) + τ DID. The standard DID Let T and G denote the number of time peri-
estimand is under this model equal to ods and groups respectively. Then:
T
(35) τ DID = E[Yi(1)] − E[Yi(0)] (36) Yi(0) = α + ∑
β t · 1[Ti = t]
t=1
= AE[Yi | Gi = 1, Ti = 1] G
+∑
γg · 1[Gi = g] + εi
g=1
− E[Yi | Gi = 1, Ti = 0]B
with separate parameters for each group and
− AE[Yi | Gi = 0, Ti = 1] time period, γg and β t, for g = 1, … , G and t
= 1, … , T, where the initial time period coef-
− E[Yi | Gi = 0, Ti = 0]B. ficient and first group coefficient have implic-
itly been normalized to zero. This model is
then combined with the additive model for
In other words, the population average the treatment effect, Yi(1) = Yi(0) + τ DID,
difference over time in the control group implying that the parameters of this model
(Gi = 0) is subtracted from the population can still be estimated by ordinary least
average difference over time in the treatment squares based on the regression function
group (Gi = 1) to remove biases associated T
with a common time trend unrelated to the (37) Yi = α + ∑
β t · 1[Ti = t]
intervention. t=1
testable restrictions on the four group/period unobserved components η gt. In this two-
means. group, two-time-period case the problem is
even worse than the absence of a consistent
6.5.3 Standard Errors in the Multiple
estimator, because one cannot even estab-
Group and Multiple Period Case
lish whether there is a clustering problem:
Recently there has been attention called the data are not informative about the value
to the concern that ordinary least square of ση2 . If we have data from more than two
standard errors for the DID estimator may groups or from more than two time periods,
not be accurate in the presence of correla- we can typically estimate ση2 , and thus, at
tions between outcomes within groups and least under the normality and independence
between time periods. This is a particu- assumptions for η gt, construct confidence
lar case of clustering where the regressor intervals for τ DID. Consider, for example, the
of interest does not vary within clusters. case with three groups, and two time peri-
See Brent R. Moulton (1990), Moulton ods. If groups Gi = 0, 1 are__both not __ treated__
and William C. Randolph (1989), and in the
__ second period, then (
Y
11 − 10) − (Y 01
Y
Wooldridge (2002) for a general discus- − Y 00) ~ (0, 4 · ση2 ), which can be used to
sion. The specific problem has been ana- obtain an unbiased estimator for ση2 . See
lyzed recently by Donald and Lang (2007), Donald and Lang (2007) for details.
Bertrand, Duflo, and Mullainathan (2004), Bertrand, Duflo, and Mullainathan (2004)
and Hansen (2007a, 2007b). and Hansen (2007a, 2007b) focus on the case
The starting point of these analyses is a with multiple (more than two) time peri-
particular structure on the error term εi: ods. In that case we may wish to relax the
assumption that the η gt are independent
εi = ηG i,Ti+ ν i, over time. Note that with data from only two
time periods there is no information in the
where ν i is an individual-level idiosyncratic data that allows one to establish the absence
error term, and η gt is a group/time specific of independence over time. The typical gen-
component. The unit level error term ν i is eralization is to allow for a autoregressive
independent across all units, E[ν i · νj] = 0 if structure on the η gt, for example,
i ≠ j and E[νi2 ] = σν2 . Now suppose we also
assume that η gt ~ (0, ση2 ), and all the η gt η gt = α · η gt−1 + ωgt,
are independent. Let us focus initially on
the two-group, two-time-period case. With a with a serially uncorrelated ωgt. More gener-
large number
__ of units in each group and time ally, with T time periods, one can allow for an
period, Y gt → α + β t + γg + 1g=1,t=1 · τ DID autoregressive process of order T − 2. Using
+ η gt, so that simulations and real data calculations based
__ __ __ __ on data for fifty states and multiple time
τ
DID = (Y 11 − Y 10) − (Y 01 − Y
ˆ
00) → τ DID periods, Bertrand, Duflo, and Mullainathan
(2004) show that corrections to the conven-
+ (η11 − η10) − (η 01 − η 00) ~ (τ DID, 4 · ση2 ). tional standard errors taking into account the
clustering and autoregressive structure make
Thus, in this case with two groups and two a substantial difference. Hansen (2007a,
time periods, the conventional DID esti- 2007b) provides additional large sample
mator is not consistent. In fact, no consis- results under sequences where the number
tent estimator exists because there is no of time periods increases with the sample
way to eliminate the influence of the four size.
70 Journal of Economic Literature, Vol. XLVII (March 2009)
While it appears that the analysis based with h0(u, t) increasing in u. The random vari-
on unconfoundedness is necessarily less able Ui represents all unobservable charac-
restrictive because it allows a free coef- teristics of individual i, and (38) incorporates
ficient in Yi0, this is not the case. The DID the idea that the outcome of an individual
assumption implies that adjusting for lagged with Ui = u will be the same in a given time
Imbens and Wooldridge: Econometrics of Program Evaluation 71
period, irrespective of the group member- the treatment group, no assumptions are
ship. The distribution of Ui is allowed to required about how the intervention affects
vary across groups, but not over time within outcomes.
groups, so that Ui ǁ Ti | Gi. Athey and Imbens The average effect of the treatment for
call the resulting model the changes-in- the second period treatment group is τcic
changes (CIC) model. = E[Yi(1) − Yi(0) | Gi = 1, Ti = 1]. Because
The standard DID model in (33) adds the first term of this expression is equal to
three additional assumptions to the CIC E[Yi(1) | Gi = 1, Ti = 1] = E[Yi | Gi = 1, Ti =
model, namely 1], it can be estimated directly from the data.
The difficulty is in estimating the second
(39) Ui − E[Ui | Gi] ǁ Gi (additivity) term. Under the assumptions of monotonicity
of h0(u, t) in u, and conditional independence
(40) h0(u, t) = ϕ(u + δ · t), of Ti and Ui given Gi, Athey and Imbens
(single index model) show that in fact the full distribution of Y(0)
given Gi = Ti = 1 is identified through the
for a strictly increasing function ϕ( · ), and equality
ǁ ǁ
prior to the treatment, namely, to assume unconfoundedness of the treat-
ment assignment. In that case straightfor-
__ G−1 __
Y G0 − ∑ λ g · Y
g0 ward extensions of the binary treatment case
g=0 can be used to obtain estimates and infer-
⋮ ences for causal effects. Second, we look at
the case with a continuous treatment under
__ G−1 __
Y G,T−1 − ∑ λ g · Y
g,T−1 , unconfoundedness. In that case, the defini-
g=0 tion of the propensity score requires some
modification but many of the insights from
where ǁ · ǁ denotes a measure of distance. the binary treatment case still carry over.
One can also add group level covariates to Third, we look at the case where units can be
the criterion to determine the weights. These exposed to a sequence of binary treatments.
Imbens and Wooldridge: Econometrics of Program Evaluation 73
For example, an individual may remain in a example, with three treatments, it may be
training program for a number of periods. In that no units are exposed to treatment level 2
each period the assignment to the program if Xi is in some subset of the covariate space.
is assumed to be unconfounded, given per- The insights from the binary case directly
manent characteristics and outcomes up to extend to this multiple (but few) treatment
that point. In the last two cases we briefly case. If the number of treatments is relatively
discuss multivalued endogenous treatments. large, one may wish to smooth across treat-
In the fourth case, we look at settings with ment levels in order to improve precision of
a discrete multivalued treatment in the pres- the inferences.
ence of endogeneity. We allow the treatment
7.2 Continuous Treatments with
to be continuous in the final case. The last
Unconfounded Treatment Assignment
two cases tie in closely with the simultane-
ous equations literature, where, somewhat In the case where the treatment taking
separately from the program evaluation lit- on many values, Imbens (2000), Lechner
erature, there has been much recent work on (2001, 2004), Hirano and Imbens (2004),
nonparametric identification and estimation. and Carlos A. Flores (2005) extended some
Especially in the discrete case, many of the of the propensity score methodology under
results in this literature are negative in the unconfoundedness. The key maintained
sense that, without unattractive restrictions assumption is that adjusting for pre-treat-
on heterogeneity or functional form, few ment differences removes all biases, and thus
objects of interest are point-identified. Some solves the problem of drawing causal infer-
of the literature has turned toward establish- ences. This is formalized by using the con-
ing bounds. This is an area with much ongo- cept of weak unconfoundedness, introduced
ing work and considerable scope for further by Imbens (2000). Assignment to treatment
research. Wi is weakly unconfounded, given pre-treat-
ment variables Xi, if
7.1 Multivalued Discrete Treatments with
Unconfounded Treatment Assignment
Wi ǁ Yi(w) | Xi,
If there are a few different levels of the
treatment, rather than just two, essentially all for all w. Compare this to the stronger
of the methods discussed before go through assumption made by Rosenbaum and Rubin
in the unconfoundedness case. Suppose, for (1983b) in the binary case:
example, that the treatment can be one of three
levels, say Wi ∈ {0, 1, 2}. In order to estimate Wi ǁ (Yi(0), Yi(1)) | Xi,
the effect of treatment level 2 relative to treat-
ment level 1, one can simply put aside the data which requires the treatment Wi to be
for units exposed to treatment level 0 if one independent of the entire set of potential
is willing to assume unconfoundedness. More outcomes. Instead, weak unconfounded-
specifically, one can estimate the average out- ness requires only pairwise independence
come for each treatment level conditional on of the treatment with each of the potential
the covariates, E[Yi(w) | Xi = x], using data on outcomes. A similar assumption is used in
units exposed to treatment level w, and aver- Robins and Rotnitzky (1995). The definition
age these over the (estimated) ˆ
marginal dis- of weak unconfoundedness is also similar
X (x). In practice,
tribution of the covariates, F to that of “missing at random” (Rubin 1976,
the overlap assumption may more likely to be 1987; Roderick J. A. Little and Rubin 1987)
violated with more than two treatments. For in the missing data literature.
74 Journal of Economic Literature, Vol. XLVII (March 2009)
Although in substantive terms the weak the assignment mechanism; see for example,
unconfoundedness assumption is not very Marshall M. Joffe and Rosenbaum (1999).
different from the assumption used by Because weak unconfoundedness given all
Rosenbaum and Rubin (1983b), it is important pretreatment variables implies weak uncon-
that one does not need the stronger assump- foundedness given the generalized propen-
tion to validate estimation of the expected sity score, one can estimate average outcomes
value of Yi(w) by adjusting for Xi: under by conditioning solely on the generalized
weak unconfoundedness, we have E[Yi(w) | Xi] propensity score. If assignment to treatment
= E[Yi(w) | Wi = w, Xi] = E[Yi | Wi = w, Xi], is weakly unconfounded given pretreatment
and expected outcomes can then be esti- variables X, then two results follow. First, for
mated by averaging these conditional means: all w,
E[Yi(w)] = E[E[Yi(w) | Xi]]. In practice, it can
be difficult to estimate E[Yi(w)] in this man- β(w, r) ≡ E[Yi(w) | r(w, Xi) = r]
ner when the dimension of Xi is large, or if
w takes on many values, because the first = E[Yi | Wi = w, r(Wi, Xi) = r],
step requires estimation of the expectation
of Yi(w) given the treatment level and all pre- which can be estimated using data on Yi, Wi,
treatment variables. It was this difficulty that and r (Wi, Xi). Second, the average outcome
motivated Rosenbaum and Rubin (1983b) to given a particular level of the treatment,
develop the propensity score methodology. E[Yi(w)], can be estimated by appropriately
Imbens (2000) introduces the general- averaging β(w, r):
ized propensity score for the multiple treat-
ment case. It is the conditional probability of E[Yi(w)] = E[β(w, r (w, Xi))].
receiving a particular level of the treatment
given the pretreatment variables: As with the implementation of the binary
treatment propensity score methodology, the
r(w, x) ≡ pr(Wi = w | Xi = x). implementation of the generalized propensity
score method consists of three steps. In the
In the continuous case, where, say, Wi first step the score r (w, x) is estimated. With
takes values in the unit interval, r (w, x) a binary treatment the standard approach
= F W | X(w | x). Suppose assignment to treat- (Rosenbaum and Rubin 1984; Rosenbaum
ment Wi is weakly unconfounded given pre- 1995) is to estimate the propensity score
treatment variables Xi. Then, by the same using a logistic regression. More generally, if
argument as in the binary treatment case, the treatments correspond to ordered levels
assignment is weakly unconfounded given of a treatment, such as the dose of a drug or
the generalized propensity score, as δ → 0, the time over which a treatment is applied,
one may wish to impose smoothness of the
1{w − δ ≤ Wi ≤ w + δ} ǁ Yi(w) | r(w, Xi), score in w. For continuous Wi, Hirano and
Imbens (2004) use a lognormal distribution.
for all w. This is the point where using the In the second step, the conditional expecta-
weak form of the unconfoundedness assump- tion β(w, r) = E[Yi | Wi = w, r(Wi, Xi) = r] is
tion is important. There is, in general, no sca- estimated. Again, the implementation may be
lar function of the covariates such that the different in the case where the levels of the
level of the treatment Wi is independent of treatment are qualitatively distinct than in
the set of potential outcomes {Yi(w)}w∈[0,1], the case where smoothness of the conditional
unless additional structure is imposed on expectation function in w is appropriate.
Imbens and Wooldridge: Econometrics of Program Evaluation 75
Here, some form of linear or nonlinear instrument, the instrumental variables esti-
regression may be used. In the third step the mand can still be interpreted as an average
average response at treatment level w is esti- causal effect, but with a complicated weight-
mated as the average of the estimated con- ing scheme. There are essentially two levels
ditional expectation, β ˆ (w, r (w, Xi)), averaged of averaging going on. First, at each level
over the distribution of the pretreatment of the treatment we can only get the aver-
variables, X1, … , X N. Note that to get the age effect of a unit increase in the treatment
average E[Yi(w)], the second argument in the for compliers at that level. In addition, there
conditional expectation β(w, r) is evaluated at is averaging over all levels of the treatment,
r (w, Xi), not at r (Wi, Xi). with the weights equal to the proportion of
compliers at that level.
7.2.1 Dynamic Treatments with
Imbens (2007) studies, in more detail,
Unconfounded Treatment Assignment
the case where the endogenous treatment
Multiple-valued treatments can arise takes on three values and shows the limits to
because at any point in time individuals identification in the case with heterogenous
can be assigned to multiple different treat- treatment effects.
ment arms, or because they can be assigned
7.4 Continuous Endogenous Treatments
sequentially to different treatments. Gill and
Robins (2001) analyze this case, where they Perhaps surprisingly, there are many
assume that at any point in time an uncon- more results for the case with continuous
foundedness assumption holds. Lechner endogenous treatments than for the discrete
and Miquel (2005) (see also Lechner 1999, case that do not impose restrictive assump-
and Lechner, Miquel, and Conny Wunsch tions. Much of the focus has been on tri-
2004) study a related case, where again a angular systems, with a single unobserved
sequential unconfoundedness assumption is component of the equation determining the
maintained to identify the average effects treatment:
of interest. Abbring and Gerard J. van den
Berg (2003) study settings with duration Wi = h(Zi, η i),
data. These methods hold great promise but,
where η i is scalar, and an essentially unre-
until now, there have been few substantive
stricted outcome equation:
applications.
7.3 Multivalued Discrete Endogenous Yi = g(Wi, εi),
Treatments
where εi may be a vector. Blundell and James
In settings with general heterogeneity in L. Powell (2003, 2004), Chernozhukov and
the effects of the treatment, the case with Hansen (2005), Imbens and Newey (forth-
more than two treatment levels is consider- coming), and Andrew Chesher (2003) study
ably more challenging than the binary case. various versions of this setup. Imbens and
There are few studies investigating identifi- Newey (forthcoming) show that if h(z, η) is
cation in these settings. Angrist and Imbens strictly monotone in η, then one can iden-
(1995) and Angrist, Kathryn Graddy and tify average effects of the treatment subject
Imbens (2000) study the interpretation of to support conditions on the instrument.
the standard instrumental variable estimand, They suggest a control function approach
the ratio of the covariances of outcome and to estimation. First η is normalized to have
instrument and treatment and instrument. a uniform distribution on [0, 1] (e.g., Rosa
They show that in general, with a valid L. Matzkin 2003). Then η i is estimated
76 Journal of Economic Literature, Vol. XLVII (March 2009)
as η
ˆ i = F
ˆ W | Z (Wi | Zi). In the second stage, Yi References
is regressed nonparametrically on Xi and η ˆ i.
Abadie, Alberto. 2002. “Bootstrap Tests of Distribu-
Chesher (2003) studies local versions of this tional Treatment Effects in Instrumental Variable
problem. Models.” Journal of the American Statistical Asso-
When the treatment equation has an addi- ciation, 97(457): 284–92.
tive form, say Wi = h1(Zi) + η i, where η i is
Abadie, Alberto. 2003. “Semiparametric Instrumental
Variable Estimation of Treatment Response Mod-
independent of Zi, Blundell and Powell (2003, els.” Journal of Econometrics, 113(2): 231–63.
2004) derive nonparametric control function Abadie, Alberto. 2005. “Semiparametric Difference-
in-Differences Estimators.” Review of Economic
methods for estimating the average struc- Studies, 72(1): 1–19.
tural function, E[g(w, εi)]. The general idea Abadie, Alberto, Joshua D. Angrist, and Guido W.
is to first obtain residuals, η ˆ i = Wi − h
ˆ 1(Zi) Imbens. 2002. “Instrumental Variables Estimates of
the Effect of Subsidized Training on the Quantiles
from a nonparametric regression. Next, a of Trainee Earnings.” Econometrica, 70(1): 91–117.
nonparametric regression of Yi on Wi and η ˆ i Abadie, Alberto, Alexis Diamond, and Jens Hainmuel-
is used to recover m(w, η) = E(Yi | Wi = w, η i ler. 2007. “Synthetic Control Methods for Compara-
= η). Blundell and Powell show that the aver-
tive Case Studies: Estimating the Effect of California’s
Tobacco Control Program.” National Bureau of Eco-
age structural function is generally identified nomic Research Working Paper 12831.
as E[m(w, η i)], which is easily estimated by Abadie, Alberto, David Drukker, Jane Leber Herr, and
averaging out η
Guido W. Imbens. 2004. “Implementing Matching
ˆ i across the sample. Estimators for Average Treatment Effects in Stata.”
Stata Journal, 4(3): 290–311.
Abadie, Alberto, and Javier Gardeazabal. 2003. “The
8. Conclusion Economic Costs of Conflict: A Case Study of the
Basque Country.” American Economic Review,
Over the last two decades, there has 93(1): 113–32.
been a proliferation of the literature on pro- Abadie, Alberto, and Guido W. Imbens. 2006. “Large
Sample Properties of Matching Estimators for
gram evaluation. This includes theoretical Average Treatment Effects.” Econometrica, 74(1):
econometrics work, as well as empirical 235–67.
work. Important features of the modern lit- Abadie, Alberto, and Guido W. Imbens. 2008a. “Bias
Corrected Matching Estimators for Average Treat-
erature are the convergence of the statistical ment Effects.” Unpublished.
and econometric literatures, with the Rubin Abadie, Alberto, and Guido W. Imbens. Forthcoming.
potential outcomes framework now the dom- “Estimation of the Conditional Variance in Paired
Experiments. Annales d’Economie et de Statistique.
inant framework. The modern literature has Abadie, Alberto, and Guido W. Imbens. 2008b. “On
stressed the importance of relaxing func- the Failure of the Bootstrap for Matching Estima-
tional form and distributional assumptions, tors.” Econometrica, 76(6): 1537–57.
Abbring, Jaap H., and James J. Heckman. 2007.
and has allowed for general heterogeneity in “Econometric Evaluation of Social Programs, Part
the effects of the treatment. This has led to III: Distributional Treatment Effects, Dynamic
renewed interest in identification questions, Treatment Effects, Dynamic Discrete Choice, and
General Equilibrium Policy Evaluation.” In Hand-
leading to unusual and controversial esti- book of Econometrics, Volume 6B, ed. James J.
mands such as the local average treatment Heckman and Edward E. Leamer, 5145–5303.
effect (Imbens and Angrist 1994), as well Amsterdam; New York and Oxford: Elsevier Sci-
ence, North-Holland.
as to the literature on partial identification Abbring, Jaap H., and Gerard J. van den Berg. 2003.
(Manski 1990). It has also borrowed heav- “The Nonparametric Identification of Treatment
ily from the semiparametric literature, using Effects in Duration Models.” Econometrica, 71(5):
1491–1517.
both efficiency bound results (Hahn 1998) Andrews, Donald W. K., and Gustavo Soares. 2007.
and methods for inference based on series “Inference for Parameters Defined By Moment
and kernel estimation (Newey 1994a, 1994b). Inequalities Using Generalized Moment Selection.”
Cowles Foundation Discussion Paper 1631.
It has by now matured to the point that it is Angrist, Joshua D. 1990. “Lifetime Earnings and the
of great use for practitioners. Vietnam Era Draft Lottery: Evidence from Social
Imbens and Wooldridge: Econometrics of Program Evaluation 77
Security Administrative Records.” American Eco- Economic Research Working Paper 6600.
nomic Review, 80(3): 313–36. Attanasio, Orazio, Costas Meghir, and Ana Santiago.
Angrist, Joshua D. 1998. “Estimating the Labor Market 2005. “Education Choices in Mexico: Using a Struc-
Impact of Voluntary Military Service Using Social tural Model and a Randomized Experiment to Eval-
Security Data on Military Applicants.” Economet- uate Progresa.” Institute for Fiscal Studies Centre
rica, 66(2): 249–88. for the Evaluation of Development Policies Working
Angrist, Joshua D. 2004. “Treatment Effect Hetero- Paper EWP05/01.
geneity in Theory and Practice.” Economic Journal, Austin, Peter C. 2008a. “A Critical Appraisal of Pro-
114(494): C52–83. pensity-Score Matching in the Medical Literature
Angrist, Joshua D., Eric Bettinger, and Michael Kre- between 1996 and 2003.” Statistics in Medicine,
mer. 2006. “Long-Term Educational Consequences 27(12): 2037–49.
of Secondary School Vouchers: Evidence from Austin, Peter C. 2008b. “Discussion of ‘A Critical
Administrative Records in Colombia.” American Appraisal of Propensity-Score Matching in the Med-
Economic Review, 96(3): 847–62. ical Literature between 1996 and 2003’: Rejoinder.”
Angrist, Joshua D., Kathryn Graddy, and Guido W. Statistics in Medicine, 27(12): 2066–69.
Imbens. 2000. “The Interpretation of Instrumen- Balke, Alexander, and Judea Pearl. 1994. “Nonpara-
tal Variables Estimators in Simultaneous Equations metric Bounds of Causal Effects from Partial Com-
Models with an Application to the Demand for Fish.” pliance Data.” University of California Los Angeles
Review of Economic Studies, 67(3): 499–527. Cognitive Systems Laboratory Technical Report
Angrist, Joshua D., and Jinyong Hahn. 2004. “When to R-199.
Control for Covariates? Panel Asymptotics for Esti- Banerjee, Abhijit V., Shawn Cole, Esther Duflo, and
mates of Treatment Effects.” Review of Economics Leigh Linden. 2007. “Remedying Education: Evi-
and Statistics, 86(1): 58–72. dence from Two Randomized Experiments in India.”
Angrist, Joshua D., and Guido W. Imbens. 1995. “Two- Quarterly Journal of Economics, 122(3): 1235–64.
Stage Least Squares Estimation of Average Causal Barnow, Burt S., Glend G. Cain, and Arthur S. Gold-
Effects in Models with Variable Treatment Inten- berger. 1980. “Issues in the Analysis of Selectivity
sity.” Journal of the American Statistical Associa- Bias.” In Evaluation Studies, Volume 5, ed. Ernst W.
tion, 90(430): 431–42. Stromsdorfer and George Farkas, 43–59. San Fran-
Angrist, Joshua D., Guido W. Imbens, and Donald B. cisco: Sage.
Rubin. 1996. “Identification of Causal Effects Using Becker, Sascha O., and Andrea Ichino. 2002. “Estima-
Instrumental Variables.” Journal of the American tion of Average Treatment Effects Based on Propen-
Statistical Association, 91(434): 444–55. sity Scores.” Stata Journal, 2(4): 358–77.
Angrist, Joshua D., and Alan B. Krueger. 1999. “Empir- Behncke, Stefanie, Markus Frölich, and Michael Lech-
ical Strategies in Labor Economics.” In Handbook of ner. 2006. “Statistical Assistance for Programme
Labor Economics, Volume 3A, ed. Orley Ashenfelter Selection—For a Better Targeting of Active Labour
and David Card, 1277–1366. Amsterdam; New York Market Policies in Switzerland.” University of St.
and Oxford: Elsevier Science, North-Holland. Gallen Department of Economics Discussion Paper
Angrist, Joshua D., and Kevin Lang. 2004. “Does 2006-09.
School Integration Generate Peer Effects? Evidence Beresteanu, Arie, and Francesca Molinari. 2006.
from Boston’s Metco Program.” American Economic “Asymptotic Properties for a Class of Partially Iden-
Review, 94(5): 1613–34. tified Models.” Institute for Fiscal Studies Centre
Angrist, Joshua D., and Victor Lavy. 1999. “Using Mai- for Microdata Methods and Practice Working Paper
monides’ Rule to Estimate the Effect of Class Size CWP10/06.
on Scholastic Achievement.” Quarterly Journal of Bertrand, Marianne, Esther Duflo, and Sendhil Mul-
Economics, 114(2): 533–75. lainathan. 2004. “How Much Should We Trust
Angrist, Joshua D., and Jörn-Steffen Pischke. 2009. Differences-in-Differences Estimates?” Quarterly
Mostly Harmless Econometrics: An Empiricist’s Journal of Economics, 119(1): 249–75.
Companion. Princeton: Princeton University Press. Bertrand, Marianne, and Sendhil Mullainathan. 2004.
Ashenfelter, Orley. 1978. “Estimating the Effect of “Are Emily and Greg More Employable than Lak-
Training Programs on Earnings.” Review of Eco- isha and Jamal? A Field Experiment on Labor Mar-
nomics and Statistics, 6(1): 47–57. ket Discrimination.” American Economic Review,
Ashenfelter, Orley, and David Card. 1985. “Using the 94(4): 991–1013.
Longitudinal Structure of Earnings to Estimate the Besley, Timothy, and Anne C. Case. 2000. “Unnatu-
Effect of Training Programs.” Review of Economics ral Experiments? Estimating the Incidence of
and Statistics, 67(4): 648–60. Endogenous Policies.” Economic Journal, 110(467):
Athey, Susan, and Guido W. Imbens. 2006. “Identifica- F672–94.
tion and Inference in Nonlinear Difference-in-Dif- Bierens, Herman J. 1987. “Kernel Estimators of Regres-
ferences Models.” Econometrica, 74(2): 431–97. sion Functions.” In Advances in Econometrics: Fifth
Athey, Susan, and Scott Stern. 1998. “An Empirical World Congress, Volume 1, ed. Truman F. Bew-
Framework for Testing Theories About Complimen- ley, 99–144. Cambridge and New York: Cambridge
tarity in Organizational Design.” National Bureau of University Press.
78 Journal of Economic Literature, Vol. XLVII (March 2009)
Bitler, Marianne, Jonah Gelbach, and Hilary Hoynes. Economic Perspectives, 9(2): 63–84.
2006. “What Mean Impacts Miss: Distributional Busso, Matias, John DiNardo, and Justin McCrary.
Effects of Welfare Reform Experiments.” American 2008. “Finite Sample Properties of Semipara-
Economic Review, 96(4): 988–1012. metric Estimators of Average Treatment Effects.”
Björklund, Anders, and Robert Moffitt. 1987. “The Unpublished.
Estimation of Wage Gains and Welfare Gains in Caliendo, Marco. 2006. Microeconometric Evaluation
Self-Selection.” Review of Economics and Statistics, of Labour Market Policies. Heidelberg: Springer,
69(1): 42–49. Physica-Verlag.
Black, Sandra E. 1999. “Do Better Schools Matter? Cameron, A. Colin, and Pravin K. Trivedi. 2005.
Parental Valuation of Elementary Education.” Quar- Microeconometrics: Methods and Applications.
terly Journal of Economics, 114(2): 577–99. Cambridge and New York: Cambridge University
Bloom, Howard S. 1984. “Accounting for No-Shows Press.
in Experimental Evaluation Designs.” Evaluation Canay, Ivan A. 2007. “EL Inference for Partially Iden-
Review, 8(2): 225–46. tified Models: Large Deviations Optimally and
Bloom, Howard S., ed. 2005. Learning More from Bootstrap Validity.” Unpublished.
Social Experiments: Evolving Analytic Approaches. Card, David. 1990. “The Impact of the Mariel Boatlift
New York: Russell Sage Foundation. on the Miami Labor Market.” Industrial and Labor
Blundell, Richard, and Monica Costa Dias. 2002. Relations Review, 43(2): 245–57.
“Alternative Approaches to Evaluation in Empirical Card, David. 2001. “Estimating the Return to School-
Microeconomics.” Institute for Fiscal Studies Cen- ing: Progress on Some Persistent Econometric Prob-
tre for Microdata Methods and Practice Working lems.” Econometrica, 69(5): 1127–60.
Paper CWP10/02. Card, David, Carlos Dobkin, and Nicole Maestas.
Blundell, Richard, Monica Costa Dias, Costas Meghir, 2004. “The Impact of Nearly Universal Insurance
and John Van Reenen. 2001. “Evaluating the Coverage on Health Care Utilization and Health:
Employment Impact of a Mandatory Job Search Evidence from Medicare.” National Bureau of Eco-
Assistance Program.” Institute for Fiscal Studies nomic Research Working Paper 10365.
Working Paper WP01/20. Card, David, and Dean R. Hyslop. 2005. “Estimating
Blundell, Richard, Alan Duncan, and Costas Meghir. the Effects of a Time-Limited Earnings Subsidy for
1998. “Estimating Labor Supply Responses Using Welfare-Leavers.” Econometrica, 73(6): 1723–70.
Tax Reforms.” Econometrica, 66(4): 827–61. Card, David, and Alan B. Krueger. 1993. “Trends in
Blundell, Richard, Amanda Gosling, Hidehiko Relative Black–White Earnings Revisited.” Ameri-
Ichimura, and Costas Meghir. 2004. “Changes in the can Economic Review, 83(2): 85–91.
Distribution of Male and Female Wages Accounting Card, David, and Alan B. Krueger. 1994. “Minimum
for Employment Composition Using Bounds.” Insti- Wages and Employment: A Case Study of the Fast-
tute for Fiscal Studies Working Paper W04/25. Food Industry in New Jersey and Pennsylvania.”
Blundell, Richard, and Thomas MaCurdy. 1999. “Labor American Economic Review, 84(4): 772–93.
Supply: A Review of Alternative Approaches.” In Card, David, and Phillip B. Levine. 1994. “Unemploy-
Handbook of Labor Economics, Volume 3A, ed. ment Insurance Taxes and the Cyclical and Seasonal
Orley Ashenfelter and David Card, 1559–1695. Properties of Unemployment.” Journal of Public
Amsterdam; New York and Oxford: Elsevier Sci- Economics, 53(1): 1–29.
ence, North-Holland. Card, David, Alexandre Mas, and Jesse Rothstein.
Blundell, Richard, and James L. Powell. 2003. “Endo- 2007. “Tipping and the Dynamics of Segregation.”
geneity in Nonparametric and Semiparametric National Bureau of Economic Research Working
Regression Models.” In Advances in Economics and Paper 13052.
Econometrics: Theory and Applications, Eighth Card, David, and Brian P. McCall. 1996. “Is Workers’
World Congress, Volume 2, ed. Mathias Dewatri- Compensation Covering Uninsured Medical Costs?
pont, Lars Peter Hansen, and Stephen J. Turnovsky, Evidence from the ‘Monday Effect.’” Industrial and
312–57. Cambridge and New York: Cambridge Uni- Labor Relations Review, 49(4): 690–706.
versity Press. Card, David, and Philip K. Robins. 1996. “Do Finan-
Blundell, Richard, and James L. Powell. 2004. “Endo- cial Incentives Encourage Welfare Recipients to
geneity in Semiparametric Binary Response Mod- Work? Evidence from a Randomized Evaluation of
els.” Review of Economic Studies, 71(3): 655–79. the Self-Sufficiency Project.” National Bureau of
Brock, William, and Steven N. Durlauf. 2000. “Interac- Economic Research Working Paper 5701.
tions-Based Models.” National Bureau of Economic Card, David, and Daniel G. Sullivan. 1988. “Measur-
Research Technical Working Paper 258. ing the Effect of Subsidized Training Programs on
Bruhn, Miriam, and David McKenzie. 2008. “In Pur- Movements In and Out of Employment.” Economet-
suit of Balance: Randomization in Practice in Devel- rica, 56(3): 497–530.
opment Field Experiments.” World Bank Policy Case, Anne C., and Lawrence F. Katz. 1991. “The
Research Working Paper 4752. Company You Keep: The Effects of Family and
Burtless, Gary. 1995. “The Case for Randomized Field Neighborhood on Disadvantaged Youths.” National
Trials in Economic and Policy Research.” Journal of Bureau of Economic Research Working Paper 3705.
Imbens and Wooldridge: Econometrics of Program Evaluation 79
Chamberlain, Gary. 1986. “Asymptotic Efficiency in Davison, A. C., and D. V. Hinkley. 1997. Bootstrap
Semi-parametric Models with Censoring.” Journal Methods and Their Application. Cambridge and
of Econometrics, 32(2): 189–218. New York: Cambridge University Press.
Chattopadhyay, Raghabendra, and Esther Duflo. Dehejia, Rajeev H. 2003. “Was There a Riverside
2004. “Women as Policy Makers: Evidence from a Miracle? A Hierarchical Framework for Evaluating
Randomized Policy Experiment in India.” Econo- Programs with Grouped Data.” Journal of Business
metrica, 72(5): 1409–43. and Economic Statistics, 21(1): 1–11.
Chay, Kenneth Y., and Michael Greenstone. 2005. Dehejia, Rajeev H. 2005a. “Practical Propensity Score
“Does Air Quality Matter? Evidence from the Hous- Matching: A Reply to Smith and Todd.” Journal of
ing Market.” Journal of Political Economy, 113(2): Econometrics, 125(1–2): 355–64.
376–424. Dehejia, Rajeev H. 2005b. “Program Evaluation as a
Chen, Susan, and Wilbert van der Klaauw. 2008. “The Decision Problem.” Journal of Econometrics, 125(1–
Work Disincentive Effects of the Disability Insur- 2): 141–73.
ance Program in the 1990s.” Journal of Economet- Dehejia, Rajeev H., and Sadek Wahba. 1999. “Causal
rics, 142(2): 757–84. Effects in Nonexperimental Studies: Reevaluat-
Chen, Xiaohong. 2007. “Large Sample Sieve Estima- ing the Evaluation of Training Programs.” Journal
tion of Semi-nonparametric Models.” In Handbook of the American Statistical Association, 94(448):
of Econometrics, Volume 6B, ed. James J. Heckman 1053–62.
and Edward E. Leamer, 5549–5632. Amsterdam Diamond, Alexis, and Jasjeet S. Sekhon. 2008. “Genetic
and Oxford: Elsevier, North-Holland. Matching for Estimating Causal Effects: A General
Chen, Xiaohong, Han Hong, and Alessandro Tarozzi. Multivariate Matching Method for Achieving Bal-
2008. “Semiparametric Efficiency in GMM Mod- ance in Observational Studies.” Unpublished.
els with Auxiliary Data.” Annals of Statistics, 36(2): DiNardo, John, and David S. Lee. 2004. “Economic
808–43. Impacts of New Unionization on Private Sector
Chernozhukov, Victor, and Christian B. Hansen. Employers: 1984–2001.” Quarterly Journal of Eco-
2005. “An IV Model of Quantile Treatment Effects.” nomics, 119(4): 1383–1441.
Econometrica, 73(1): 245–61. Doksum, Kjell. 1974. “Empirical Probability Plots
Chernozhukov, Victor, Han Hong, and Elie Tamer. and Statistical Inference for Nonlinear Models in
2007. “Estimation and Confidence Regions for the Two-Sample Case.” Annals of Statistics, 2(2):
Parameter Sets in Econometric Models.” Economet- 267–77.
rica, 75(5): 1243–84. Donald, Stephen G., and Kevin Lang. 2007. “Inference
Chesher, Andrew. 2003. “Identification in Nonsepa- with Difference-in-Differences and Other Panel
rable Models.” Econometrica, 71(5): 1405–41. Data.” Review of Economics and Statistics, 89(2):
Chetty, Raj, Adam Looney, and Kory Kroft. Forthcom- 221–33.
ing. “Salience and Taxation: Theory and Evidence.” Duflo, Esther. 2001. “Schooling and Labor Market
American Economic Review. Consequences of School Construction in Indone-
Cochran, William G. 1968. “The Effectiveness of sia: Evidence from an Unusual Policy Experiment.”
Adjustment by Subclassification in Removing Bias in American Economic Review, 91(4): 795–813.
Observational Studies.” Biometrics, 24(2): 295–314. Duflo, Esther, William Gale, Jeffrey B. Liebman, Peter
Cochran, William G., and Donald B. Rubin. 1973. Orszag, and Emmanuel Saez. 2006. “Saving Incen-
“Controlling Bias in Observational Studies: A tives for Low- and Middle-Income Families: Evi-
Review.” Sankhya, 35(4): 417–46. dence from a Field Experiment with H&R Block.”
Cook, Thomas D. 2008. “‘Waiting for Life to Arrive’: Quarterly Journal of Economics, 121(4): 1311–46.
A History of the Regression–Discontinuity Design Duflo, Esther, Rachel Glennerster, and Michael Kre-
in Psychology, Statistics and Economics.” Journal of mer. 2008. “Using Randomization in Development
Econometrics, 142(2): 636–54. Economics Research: A Toolkit.” In Handbook of
Cook, Philip J., and George Tauchen. 1982. “The Effect Development Economics, Volume 4, ed. T. Paul
of Liquor Taxes on Heavy Drinking.” Bell Journal of Schultz and John Strauss, 3895–3962. Amsterdam
Economics, 13(2): 379–90. and Oxford: Elsevier, North-Holland.
Cook, Philip J., and George Tauchen. 1984. “The Effect Duflo, Esther, and Rema Hanna. 2005. “Monitor-
of Minimum Drinking Age Legislation on Youthful ing Works: Getting Teachers to Come to School.”
Auto Fatalities, 1970–1977.” Journal of Legal Stud- National Bureau of Economic Research Working
ies, 13(1): 169–90. Paper 11880.
Crump, Richard K., V. Joseph Hotz, Guido W. Imbens, Duflo, Esther, and Emmanuel Saez. 2003. “The Role
and Oscar A. Mitnik. 2009. “Dealing with Lim- of Information and Social Interactions in Retire-
ited Overlap in Estimation of Average Treatment ment Plan Decisions: Evidence from a Random-
Effects.” Biometrika, 96:187–99. ized Experiment.” Quarterly Journal of Economics,
Crump, Richard K., V. Joseph Hotz, Guido W. Imbens, 118(3): 815–42.
and Oscar A. Mitnik. 2008. “Nonparametric Tests Efron, Bradley, and Robert J. Tibshirani. 1993. An
for Treatment Effect Heterogeneity.” Review of Introduction to the Bootstrap. New York and
Economics and Statistics, 90(3): 389–405. London: Chapman and Hall.
80 Journal of Economic Literature, Vol. XLVII (March 2009)
Eissa, Nada, and Jeffrey B. Liebman. 1996. “Labor Econometrica, 52(3): 681–700.
Supply Response to the Earned Income Tax Credit.” Graham, Bryan S. 2008. “Identifying Social Interac-
Quarterly Journal of Economics, 111(2): 605–37. tions through Conditional Variance Restrictions.”
Engle, Robert F., David F. Hendry, and Jean-Francois Econometrica, 76(3): 643–60.
Richard. 1983. “Exogeneity.” Econometrica, 51(2): Graham, Bryan S., Guido W. Imbens, and Geert Rid-
277–304. der. 2006. “Complementarity and Aggregate Impli-
Fan, J., and I. Gijbels. 1996. Local Polynomial Mod- cations of Assortative Matching: A Nonparametric
elling and Its Applications. London: Chapman and Analysis.” Unpublished.
Hall. Greenberg, David, and Michael Wiseman. 1992. “What
Ferraz, Claudio, and Frederico Finan. 2008. “Exposing Did the OBRA Demonstrations Do?” In Evaluat-
Corrupt Politicians: The Effects of brazil’s Publicly ing Welfare and Training Programs, ed. Charles F.
Released Audits on Electoral Outcomes.” Quarterly Manski and Irwin Garfinkel, 25–75. Cambridge and
Journal of Economics, 123(2): 703–45. London: Harvard University Press.
Firpo, Sergio. 2007. “Efficient Semiparametric Esti- Gu, X., and Paul R. Rosenbaum. 1993. “Comparison
mation of Quantile Treatment Effects.” Economet- of Multivariate Matching Methods: Structures, Dis-
rica, 75(1): 259–76. tances and Algorithms.” Journal of Computational
Fisher, Ronald A. 1935. The Design of Experiments, and Graphical Statistics, 2(4): 405–20.
First edition. London: Oliver and Boyd. Gueron, Judith M., and Edward Pauly. 1991. From Wel-
Flores, Carlos A. 2005. “Estimation of Dose-Response fare to Work. New York: Russell Sage Foundation.
Functions and Optimal Doses with a Continuous Haavelmo, Trygve. 1943. “The Statistical Implications
Treatment.” Unpublished. of a System of Simultaneous Equations.” Economet-
Fraker, Thomas, and Rebecca Maynard. 1987. “The rica, 11(1): 1–12.
Adequacy of Comparison Group Designs for Evalu- Hahn, Jinyong. 1998. “On the Role of the Propensity
ations of Employment-Related Programs.” Journal Score in Efficient Semiparametric Estimation of
of Human Resources, 22(2): 194–227. Average Treatment Effects.” Econometrica, 66(2):
Friedlander, Daniel, and Judith M. Gueron. 1992. “Are 315–31.
High-Cost Services More Effective than Low-Cost Hahn, Jinyong, Petra E. Todd, and Wilbert van der
Services?” In Evaluating Welfare Training Pro- Klaauw. 2001. “Identification and Estimation of
grams, ed. Charles F. Manski and Irwin Garfinkel, Treatment Effects with a Regression-Discontinuity
143–98. Cambridge and London: Harvard Univer- Design.” Econometrica, 69(1): 201–09.
sity Press. Ham, John C., and Robert J. LaLonde. 1996. “The
Friedlander, Daniel, and Philip K. Robins. 1995. Effect of Sample Selection and Initial Conditions
“Evaluating Program Evaluations: New Evidence in Duration Models: Evidence from Experimental
on Commonly Used Nonexperimental Methods.” Data on Training.” Econometrica, 64(1): 175–205.
American Economic Review, 85(4): 923–37. Hamermesh, Daniel S., and Jeff E. Biddle. 1994.
Frölich, Markus. 2004a. “Finite-Sample Properties of “Beauty and the Labor Market.” American Eco-
Propensity-Score Matching and Weighting Estima- nomic Review, 84(5): 1174–94.
tors.” Review of Economics and Statistics, 86(1): Hansen, B. B. 2008. “The Essential Role of Balance
77–90. Tests in Propensity-Matched Observational Studies:
Frölich, Markus. 2004b. “A Note on the Role of the Comments on ‘A Critical Appraisal of Propensity-
Propensity Score for Estimating Average Treatment Score Matching in the Medical Literature between
Effects.” Econometric Reviews, 23(2): 167–74. 1996 and 2003’ by Peter Austin.” Statistics in Medi-
Gill, Richard D., and James M. Robins. 2001. “Causal cine, 27(12): 2050–54.
Inference for Complex Longitudinal Data: The Hansen, Christian B. 2007a. “Asymptotic Properties of
Continuous Case.” Annals of Statistics, 29(6): a Robust Variance Matrix Estimator for Panel Data
1785–1811. When T Is Large.” Journal of Econometrics, 141(2):
Glaeser, Edward L., Bruce Sacerdote, and Jose A. 597–620.
Scheinkman. 1996. “Crime and Social Interactions.” Hansen, Christian B. 2007b. “Generalized Least
Quarterly Journal of Economics, 111(2): 507–48. Squares Inference in Panel and Multilevel Models
Goldberger, Arthur S. 1972a. “Selection Bias in Evalu- with Serial Correlation and Fixed Effects.” Journal
ating Treatment Effects: Some Formal Illustrations.” of Econometrics, 140(2): 670–94.
Unpublished. Hanson, Samuel, and Adi Sunderam. 2008. “The Vari-
Goldberger, Arthur S. 1972b. “Selection Bias in Evalu- ance of Average Treatment Effect Estimators in the
ating Treatment Effects: The Case of Interaction.” Presence of Clustering.” Unpublished.
Unpublished. Hardle, Wolfgang. 1990. Applied Nonparametric
Gourieroux, C., A. Monfort, and A. Trognon. 1984a. Regression. Cambridge; New York and Melboure:
“Pseudo Maximum Likelihood Methods: Appli- Cambridge University Press.
cations to Poisson Models.” Econometrica, 52(3): Heckman, James J. 1990. “Varieties of Selection Bias.”
701–20. American Economic Review, 80(2): 313–18.
Gourieroux, C., A. Monfort, and A. Trognon. 1984b. Heckman, James J., and V. Joseph Hotz. 1989.
“Pseudo Maximum Likelihood Methods: Theory.” “Choosing among Alternative Nonexperimental
Imbens and Wooldridge: Econometrics of Program Evaluation 81
Methods for Estimating the Impact of Social Pro- Heckman, James J., and Edward Vytlacil. 2007b.
grams: The Case of Manpower Training.” Journal “Econometric Evaluation of Social Programs, Part
of the American Statistical Association, 84(408): II: Using the Marginal Treatment Effect to Orga-
862–74. nize Alternative Econometric Estimators to Evalu-
Heckman, James J., Hidehiko Ichimura, Jeffrey A. ate Social Programs, and to Forecast Their Effects
Smith, and Petra E. Todd. 1998. “Characterizing in New Environments.” In Handbook of Economet-
Selection Bias Using Experimental Data.” Econo- rics, Volume 6B, ed. James J. Heckman and Edward
metrica, 66(5): 1017–98. E. Leamer, 4875–5143. Amsterdam and Oxford:
Heckman, James J., Hidehiko Ichimura, and Petra E. Elsevier, North-Holland.
Todd. 1997. “Matching as an Econometric Evalu- Hill, Jennifer. 2008. “Discussion of Research Using
ation Estimator: Evidence from Evaluating a Job Propensity-Score Matching: Comments on ‘A Criti-
Training Programme.” Review of Economic Studies, cal Appraisal of Propensity-Score Matching in the
64(4): 605–54. Medical Literature between 1996 and 2003’ by Peter
Heckman, James J., Hidehiko Ichimura, and Petra E. Austin.” Statistics in Medicine, 27(12): 2055–61.
Todd. 1998. “Matching as an Econometric Evalua- Hirano, Keisuke, and Guido W. Imbens. 2001. “Esti-
tion Estimator.” Review of Economic Studies, 65(2): mation of Causal Effects Using Propensity Score
261–94. Weighting: An Application to Data on Right Heart
Heckman, James J., Robert J. Lalonde, and Jeffrey A. Catheterization.” Health Services and Outcomes
Smith. 1999. “The Economics and Econometrics of Research Methodology, 2(3–4): 259–78.
Active Labor Market Programs.” In Handbook of Hirano, Keisuke, and Guido W. Imbens. 2004. “The
Labor Economics, Volume 3A, ed. Orley Ashenfelter Propensity Score with Continuous Treatments.” In
and David Card, 1865–2097. Amsterdam; New York Applied Bayesian Modeling and Causal Inference
and Oxford: Elsevier Science, North-Holland. from Incomplete-Data Perspectives, ed. Andrew
Heckman, James J., Lance Lochner, and Christopher Gelman and Xiao-Li Meng, 73–84. Hoboken, N.J.:
Taber. 1999. “Human Capital Formation and Gen- Wiley.
eral Equilibrium Treatment Effects: A Study of Tax Hirano, Keisuke, Guido W. Imbens, and Geert Ridder.
and Tuition Policy.” Fiscal Studies, 20(1): 25–40. 2003. “Efficient Estimation of Average Treatment
Heckman, James J., and Salvador Navarro-Lozano. Effects Using the Estimated Propensity Score.”
2004. “Using Matching, Instrumental Variables, and Econometrica, 71(4): 1161–89.
Control Functions to Estimate Economic Choice Hirano, Keisuke, Guido W. Imbens, Donald B. Rubin,
Models.” Review of Economics and Statistics, 86(1): and Xiao-Hua Zhou. 2000. “Assessing the Effect of
30–57. an Influenza Vaccine in an Encouragement Design.”
Heckman, James J., and Richard Robb Jr. 1985. “Alter- Biostatistics, 1(1): 69–88.
native Methods for Evaluating the Impact of Inter- Hirano, Keisuke, and Jack R. Porter. 2008. “Asymp-
ventions.” In Longitudinal Analysis of Labor Market totics for Statistical Treatment Rules.” http://
Data, ed. James J. Heckman and Burton Singer, 156- www.u.arizona.edu/~hirano/hp3_2008_08_10.pdf.
245. Cambridge; New York and Sydney: Cambridge Holland, Paul W. 1986. “Statistics and Causal Infer-
University Press. ence.” Journal of the American Statistical Associa-
Heckman, James J., and Jeffrey A. Smith. 1995. tion, 81(396): 945–60.
“Assessing the Case for Social Experiments.” Journal Horowitz, Joel L. 2001. “The Bootstrap.” In Hand-
of Economic Perspectives, 9(2): 85–110. book of Econometrics, Volume 5, ed. James J. Heck-
Heckman, James J., and Jeffrey A. Smith. 1997. “Mak- man and Edward Leamer, 3159–3228. Amsterdam;
ing the Most Out of Programme Evaluations and London and New York: Elsevier Science, North-
Social Experiments: Accounting for Heterogeneity Holland.
in Programme Impacts.” Review of Economic Stud- Horowitz, Joel L., and Charles F. Manski. 2000. “Non-
ies, 64(4): 487–535. parametric Analysis of Randomized Experiments
Heckman, James J., Sergio Urzua, and Edward Vyt- with Missing Covariate and Outcome Data.” Jour-
lacil. 2006. “Understanding Instrumental Variables nal of the American Statistical Association, 95(449):
in Models with Essential Heterogeneity.” Review of 77–84.
Economics and Statistics, 88(3): 389–432. Horvitz, D. G., and D. J. Thompson. 1952. “A Gener-
Heckman, James J., and Edward Vytlacil. 2005. “Struc- alization of Sampling without Replacement from a
tural Equations, Treatment Effects, and Econo- Finite Universe.” Journal of the American Statistical
metric Policy Evaluation.” Econometrica, 73(3): Association, 47(260): 663–85.
669–738. Hotz, V. Joseph, Guido W. Imbens, and Jacob A. Kler-
Heckman, James J., and Edward Vytlacil. 2007a. man. 2006. “Evaluating the Differential Effects of
“Econometric Evaluation of Social Programs, Part Alternative Welfare-to-Work Training Components:
I: Causal Models, Structural Models and Economet- A Reanalysis of the California GAIN Program.”
ric Policy Evaluation.” In Handbook of Economet- Journal of Labor Economics, 24(3): 521–66.
rics, Volume 6B, ed. James J. Heckman and Edward Hotz, V. Joseph, Guido W. Imbens, and Julie H. Mor-
E. Leamer, 4779–4874. Amsterdam and Oxford: timer. 2005. “Predicting the Efficacy of Future
Elsevier, North-Holland. Training Programs Using Past Experiences at Other
82 Journal of Economic Literature, Vol. XLVII (March 2009)
Locations.” Journal of Econometrics, 125(1–2): Imbens, Guido W., Whitney K. Newey, and Geert Rid-
241–70. der. 2005. “Mean-Squared-Error Calculations for
Hotz, V. Joseph, Charles H. Mullin, and Seth G. Sand- Average Treatment Effects.” Unpublished.
ers. 1997. “Bounding Causal Effects Using Data from Imbens, Guido W., and Donald B. Rubin. 1997a.
a Contaminated Natural Experiment: Analysing the “Bayesian Inference for Causal Effects in Random-
Effects of Teenage Childbearing.” Review of Eco- ized Experiments with Noncompliance.” Annals of
nomic Studies, 64(4): 575–603. Statistics, 25(1): 305–27.
Iacus, Stefano M., Gary King, and Giuseppe Porro. Imbens, Guido W., and Donald B. Rubin. 1997b.
2008. “Matching for Causal Inference without Bal- “Estimating Outcome Distributions for Compliers
ance Checking.” Unpublished. in Instrumental Variables Models.” Review of Eco-
Ichimura, Hidehiko, and Oliver Linton. 2005. “Asymp- nomic Studies, 64(4): 555–74.
totic Expansions for Some Semiparametric Program Imbens, Guido W., and Donald B. Rubin. Forthcom-
Evaluation Estimators.” In Identification and Infer- ing. Causal Inference in Statistics and the Social
ence for Econometric Models: Essays in Honor of Sciences. Cambridge and New York: Cambridge
Thomas Rothenberg, ed. Donald W. K. Andrews and University Press.
James H. Stock, 149–70. Cambridge and New York: Imbens, Guido W., Donald B. Rubin, and Bruce I. Sac-
Cambridge University Press. erdote. 2001. “Estimating the Effect of Unearned
Ichimura, Hidehiko, and Petra E. Todd. 2007. “Imple- Income on Labor Earnings, Savings, and Consump-
menting Nonparametric and Semiparametric Esti- tion: Evidence from a Survey of Lottery Players.”
mators.” In Handbook of Econometrics, Volume American Economic Review, 91(4): 778–94.
6B, ed. James J. Heckman and Edward E. Leamer, Jin, Ginger Zhe, and Phillip Leslie. 2003. “The Effect
5369–5468. Amsterdam and Oxford: Elsevier, of Information on Product Quality: Evidence from
North-Holland. Restaurant Hygiene Grade Cards.” Quarterly Jour-
Imbens, Guido W. 2000. “The Role of the Propensity nal of Economics, 118(2): 409–51.
Score in Estimating Dose-Response Functions.” Joffe, Marshall M., and Paul R. Rosenbaum. 1999.
Biometrika, 87(3): 706–10. “Invited Commentary: Propensity Scores.” Ameri-
Imbens, Guido W. 2003. “Sensitivity to Exogeneity can Journal of Epidemiology, 150(4): 327–33.
Assumptions in Program Evaluation.” American Kitagawa, Toru. 2008. “Identification Bounds for the
Economic Review, 93(2): 126–32. Local Average Treatment Effect.” Unpublished.
Imbens, Guido W. 2004. “Nonparametric Estimation Kling, Jeffrey R., Jeffrey B. Liebman, and Lawrence
of Average Treatment Effects under Exogeneity: A F. Katz. 2007. “Experimental Analysis of Neighbor-
Review.” Review of Economics and Statistics, 86(1): hood Effects.” Econometrica, 75(1): 83–119.
4–29. Lalive, Rafael. 2008. “How Do Extended Benefits
Imbens, Guido W. 2007. “Non-additive Models with Affect Unemployment Duration? A Regression
Endogenous Regressors.” In Advances in Economics Discontinuity Approach.” Journal of Econometrics,
and Econometrics: Theory and Applications, Ninth 142(2): 785–806.
World Congress, Volume 3, ed. Richard Blundell, LaLonde, Robert J. 1986. “Evaluating the Economet-
Whitney K. Newey, and Torsten Persson, 17–46. ric Evaluations of Training Programs with Experi-
Cambridge and New York: Cambridge University mental Data.” American Economic Review, 76(4):
Press. 604–20.
Imbens, Guido W., and Joshua D. Angrist. 1994. “Iden- Lechner, Michael. 1999. “Earnings and Employment
tification and Estimation of Local Average Treat- Effects of Continuous Off-the-Job Training in East
ment Effects.” Econometrica, 62(2): 467–75. Germany after Unification.” Journal of Business and
Imbens, Guido W., and Karthik Kalyanaraman. 2009. Economic Statistics, 17(1): 74–90.
“Optimal Bandwidth Choice for the Regression Dis- Lechner, Michael. 2001. “Identification and Estima-
continuity Estimator.” National Bureau of Economic tion of Causal Effects of Multiple Treatments under
Research Working Paper 14726. the Conditional Independence Assumption.” In
Imbens, Guido W., Gary King, David McKenzie, and Econometric Evaluation of Labour Market Policies,
Geert Ridder. 2008. “On the Benefits of Stratifica- ed. Michael Lechner and Friedhelm Pfeiffer, 43–58.
tion in Randomized Experiments.” Unpublished. Heidelberg and New York: Physica; Mannheim: Cen-
Imbens, Guido W., and Thomas Lemieux. 2008. tre for European Economic Research.
“Regression Discontinuity Designs: A Guide to Lechner, Michael. 2002a. “Program Heterogeneity
Practice.” Journal of Econometrics, 142(2): 615–35. and Propensity Score Matching: An Application to
Imbens, Guido W., and Charles F. Manski. 2004. “Con- the Evaluation of Active Labor Market Policies.”
fidence Intervals for Partially Identified Parameters.” Review of Economics and Statistics, 84(2): 205–20.
Econometrica, 72(6): 1845–57. Lechner, Michael. 2002b. “Some Practical Issues in
Imbens, Guido W., and Whitney K. Newey. Forthcom- the Evaluation of Heterogeneous Labour Market
ing. “Identification and Estimation of Triangular Programmes by Matching Methods.” Journal of the
Simultaneous Equations Models without Additivity.” Royal Statistical Society: Series A (Statistics in Soci-
National Bureau of Economic Research Technical ety), 165(1): 59–82.
Econometrica. Lechner, Michael. 2004. “Sequential Matching
Imbens and Wooldridge: Econometrics of Program Evaluation 83
McEwan, Patrick J., and Joseph S. Shapiro. 2008. “The Quade, D. 1982. “Nonparametric Analysis of Covari-
Benefits of Delayed Primary School Enrollment: ance By Matching.” Biometrics, 38(3): 597–611.
Discontinuity Estimates Using Exact Birth Dates.” Racine, Jeffrey S., and Qi Li. 2004. “Nonparametric
Journal of Human Resources, 43(1): 1–29. Estimation of Regression Functions with Both Cat-
Mealli, Fabrizia, Guido W. Imbens, Salvatore Ferro, egorical and Continuous Data.” Journal of Econo-
and Annibale Biggeri. 2004. “Analyzing a Random- metrics, 119(1): 99–130.
ized Trial on Breast Self-Examination with Noncom- Riccio, James, and Daniel Friedlander. 1992. GAIN:
pliance and Missing Outcomes.” Biostatistics, 5(2): Program Strategies, Participation Patterns, and
207–22. First-Year Impacts in Six Countries. New York:
Meyer, Bruce D., W. Kip Viscusi, and David L. Durbin. Manpower Demonstration Research Corporation.
1995. “Workers’ Compensation and Injury Duration: Riccio, James, Daniel Friedlander, and Stephen Freed-
Evidence from a Natural Experiment.” American man. 1994. GAIN: Benefits, Costs, and Three-Year
Economic Review, 85(3): 322–40. Impacts of a Welfare-to-Work Program. New York:
Miguel, Edward, and Michael Kremer. 2004. “Worms: Manpower Demonstration Research Corporation.
Identifying Impacts on Education and Health in the Robins, James M., and Ya’acov Ritov. 1997. “Toward
Presence of Treatment Externalities.” Economet- a Curse of Dimensionality Appropriate (CODA)
rica, 72(1): 159–217. Asymptotic Theory for Semi-parametric Models.”
Morgan, Stephen L., and Christopher Winship. 2007. Statistics in Medicine, 16(3): 285–319.
Counterfactuals and Causal Inference: Methods and Robins, James M., and Andrea Rotnitzky. 1995. “Semi-
Principles for Social Research. Cambridge and New parametric Efficiency in Multivariate Regression
York: Cambridge University Press. Models with Missing Data.” Journal of the American
Moulton, Brent R. 1990. “An Illustration of a Pitfall Statistical Association, 90(429): 122–29.
in Estimating the Effects of Aggregate Variables on Robins, James M., Andrea Rotnitzky, and Lue Ping
Micro Unit.” Review of Economics and Statistics, Zhao. 1995. “Analysis of Semiparametric Regression
72(2): 334–38. Models for Repeated Outcomes in the Presence of
Moulton, Brent R., and William C. Randolph. 1989. Missing Data.” Journal of the American Statistical
“Alternative Tests of the Error Components Model.” Association, 90(429): 106–21.
Econometrica, 57(3): 685–93. Robinson, Peter M. 1988. “Root-N-Consistent Semi-
Newey, Whitney K. 1994a. “Kernel Estimation of parametric Regression.” Econometrica, 56(4):
Partial Means and a General Variance Estimator.” 931–54.
Econometric Theory, 10(2): 233–53. Romano, Joseph P., and Azeem M. Shaikh. 2006a.
Newey, Whitney K. 1994b. “Series Estimation of “Inference for Identifiable Parameters in Partially
Regression Functionals.” Econometric Theory, Identified Econometric Models.” Stanford University
10(1): 1–28. Department of Statistics Technical Report 2006-9.
Olken, Benjamin A. 2007. “Monitoring Corruption: Romano, Joseph P., and Azeem M. Shaikh. 2006b.
Evidence from a Field Experiment in Indonesia.” “Inference for the Identified Set in Partially Identi-
Journal of Political Economy, 115(2): 200–249. fied Econometric Models.” Unpublished.
Pagan, Adrian, and Aman Ullah. 1999. Nonparamet- Rosen, Adam M. 2006. “Confidence Sets for Partially
ric Econometrics. Cambridge; New York and Mel- Identified Parameters That Satisfy a Finite Number
bourne: Cambridge University Press. of Moment Inequalities.” Institute for Fiscal Studies
Pakes, Ariel, Jack R. Porter, Kate Ho, and Joy Ishii. Centre for Microdata Methods and Practice Work-
2006. “Moment Inequalities and Their Application.” ing Paper CWP25/06.
Institute for Fiscal Studies Centre for Microdata Rosenbaum, Paul R. 1984a. “Conditional Permutation
Methods and Practice Working Paper CWP16/07. Tests and the Propensity Score in Observational
Pearl, Judea. 2000. Causality: Models, Reasoning, and Studies.” Journal of the American Statistical Asso-
Inference. Cambridge; New York and Melbourne: ciation, 79(387): 565–74.
Cambridge University Press. Rosenbaum, Paul R. 1984b. “The Consequences of
Pettersson-Lidbom, Per. 2007. “The Policy Conse- Adjustment for a Concomitant Variable That Has
quences of Direct versus Representative Democracy: Been Affected By the Treatment.” Journal of the
A Regression-Discontinuity Approach.” Unpublished. Royal Statistical Society: Series A (Statistics in Soci-
Pettersson-Libdom, Per. 2008. “Does the Size of the ety), 147(5): 656–66.
Legislature Affect the Size of Government? Evidence Rosenbaum, Paul R. 1987. “The Role of a Second Con-
from Two Natural Experiments.” Unpublished. trol Group in an Observational Study.” Statistical
Pettersson-Lidbom, Per, and Björn Tyrefors. 2007. “Do Science, 2(3): 292–306.
Parties Matter for Economic Outcomes? A Regres- Rosenbaum, Paul R. 1989. “Optimal Matching for
sion-Discontinuity Approach.” Unpublished. Observational Studies.” Journal of the American
Politis, Dimitris N., Joseph P. Romano, and Michael Statistical Association, 84(408): 1024–32.
Wolf. 1999. Subsampling. New York: Springer, Rosenbaum, Paul R. 1995. Observational Studies. New
Verlag York; Heidelberg and London: Springer.
Porter, Jack R. 2003. “Estimation in the Regression Rosenbaum, Paul R. 2002. “Covariance Adjustment
Discontinuity Model.” Unpublished. in Randomized Experiments and Observational
Imbens and Wooldridge: Econometrics of Program Evaluation 85
Studies.” Statistical Science, 17(3): 286–327. inear Propensity Score Methods with Normal Dis-
L
Rosenbaum, Paul R., and Donald B. Rubin. 1983a. tributions.” Biometrika, 79(4): 797–809.
“Assessing Sensitivity to an Unobserved Binary Rubin, Donald B., and Neal Thomas. 1996. “Matching
Covariate in an Observational Study with Binary Using Estimated Propensity Scores: Relating Theory
Outcome.” Journal of the Royal Statistical Society: to Practice.” Biometrics, 52(1): 249–64.
Series B (Statistical Methodology), 45(2): 212–18. Rubin, Donald B., and Neal Thomas. 2000. “Com-
Rosenbaum, Paul R., and Donald B. Rubin. 1983b. bining Propensity Score Matching with Additional
“The Central Role of the Propensity Score in Obser- Adjustments for Prognostic Covariates.” Journal
vational Studies for Causal Effects.” Biometrika, of the American Statistical Association, 95(450):
70(1): 41–55. 573–85.
Rosenbaum, Paul R., and Donald B. Rubin. 1984. Sacerdote, Bruce. 2001. “Peer Effects with Random
“Reducing Bias in Observational Studies Using Sub- Assignment: Results for Dartmouth Roommates.”
classification on the Propensity Score.” Journal of the Quarterly Journal of Economics, 116(2): 681–704.
American Statistical Association, 79(387): 516–24. Scharfstein, Daniel O, Andrea Rotnitzky, and James
Rosenbaum, Paul R., and Donald B. Rubin. 1985. M. Robins. 1999. “Adjusting for Nonignorable Drop-
“Constructing a Control Group Using Multivariate Out Using Semiparametric Nonresponse Models.”
Matched Sampling Methods That Incorporate the Journal of the American Statistical Association,
Propensity Score.” American Statistician, 39(1): 94(448): 1096–1120.
33–38. Schultz, T. Paul. 2001. “School Subsidies for the Poor:
Rotnitzky, Andrea, and James M. Robins. 1995. “Semi- Evaluating the Mexican Progresa Poverty Program.”
parametric Regression Estimation in the Presence of Yale University Economic Growth Center Discus-
Dependent Censoring.” Biometrika, 82(4): 805–20. sion Paper 834.
Roy, A. D. 1951. “Some Thoughts on the Distribution of Seifert, Burkhardt, and Theo Gasser. 1996. “Finite-
Earnings.” Oxford Economic Papers, 3(2): 135–46. Sample Variance of Local Polynomials: Analysis and
Rubin, Donald B. 1973a. “Matching to Remove Bias in Solutions.” Journal of the American Statistical Asso-
Observational Studies.” Biometrics, 29(1): 159–83. ciation, 91(433): 267–75.
Rubin, Donald B. 1973b. “The Use of Matched Sam- Seifert, Burkhardt, and Theo Gasser. 2000. “Data
pling and Regression Adjustment to Remove Bias in Adaptive Ridging in Local Polynomial Regression.”
Observational Studies.” Biometrics, 29(1): 184–203. Journal of Computational and Graphical Statistics,
Rubin, Donald B. 1974. “Estimating Causal Effects 9(2): 338–60.
of Treatments in Randomized and Nonrandomized Sekhon, Jasjeet S. Forthcoming. “Multivariate and Pro-
Studies.” Journal of Educational Psychology, 66(5): pensity Score Matching Software with Automated
688–701. Balance Optimization: The Matching Package for
Rubin, Donald B. 1976. “Inference and Missing Data.” R.” Journal of Statistical Software.
Biometrika, 63(3): 581–92. Sekhon, Jasjeet S., and Richard Grieve. 2008. “A New
Rubin, Donald B. 1977. “Assignment to Treatment Non-parametric Matching Method for Bias Adjust-
Group on the Basis of a Covariate.” Journal of Edu- ment with Applications to Economic Evaluations.”
cational Statistics, 2(1): 1–26. http://sekhon.berkeley.edu/papers/GeneticMatch-
Rubin, Donald B. 1978. “Bayesian Inference for Causal ing_SekhonGrieve.pdf.
Effects: The Role of Randomization.” Annals of Sta- Shadish, William R., Thomas D. Cook, and Donald T.
tistics, 6(1): 34–58. Campbell. 2002. Experimental and Quasi-Exper-
Rubin, Donald B. 1979. “Using Multivariate Matched imental Designs for Generalized Causal Inference.
Sampling and Regression Adjustment to Control Bias Boston: Houghton Mifflin.
in Observational Studies.” Journal of the American Smith, Jeffrey A., and Petra E. Todd. 2001. “Recon-
Statistical Association, 74(366): 318–28. ciling Conflicting Evidence on the Performance of
Rubin, Donald B. 1987. Multiple Imputation for Non- Propensity-Score Matching Methods.” American
response in Surveys. New York: Wiley. Economic Review, 91(2): 112–18.
Rubin, Donald B. 1990. “Formal Mode of Statistical Smith, Jeffrey A., and Petra E. Todd. 2005. “Does
Inference for Causal Effects.” Journal of Statistical Matching Overcome Lalonde’s Critique of Nonex-
Planning and Inference, 25(3): 279–92. perimental Estimators?” Journal of Econometrics,
Rubin, Donald B. 1997. “Estimating Causal Effects 125(1–2): 305–53.
from Large Data Sets Using Propensity Scores.” Splawa-Neyman, Jerzy. 1990. “On the Application of
Annals of Internal Medicine, 127(5 Part 2): 757–63. Probability Theory to Agricultural Experiments.
Rubin, Donald B. 2006. Matched Sampling for Causal Essays on Principles. Section 9.” Statistical Science,
Effects. Cambridge and New York: Cambridge Uni- 5(4): 465–72. (Orig. pub. 1923.)
versity Press. Stock, James H. 1989. “Nonparametric Policy Analy-
Rubin, Donald B., and Neal Thomas. 1992a. “Affinely sis.” Journal of the American Statistical Association,
Invariant Matching Methods with Ellipsoidal Distri- 84(406): 567–75.
butions.” Annals of Statistics, 20(2): 1079–93. Stone, Charles J. 1977. “Consistent Nonparametric
Rubin, Donald B., and Neal Thomas. 1992b. Regression.” Annals of Statistics, 5(4): 595–620.
“Characterizing the Effect of Matching Using Stoye, Jörg. 2007. “More on Confidence Intervals for
86 Journal of Economic Literature, Vol. XLVII (March 2009)