You are on page 1of 156

Lecture Notes on Nonparametrics

Bruce E. Hansen
University of Wisconsin
Spring 2009
1 Introduction
Parametric means nite-dimensional. Non-parametric means innite-dimensional.
The dierences are profound.
Typically, parametric estimates converge at a :
12
rate. Non-parametric estimates typically
converge at a rate slower than :
12
.
Typically, in parametric models there is no distinction between the true model and the tted
model. In contrast, non-parametric methods typically distinguish between the true and tted
models.
Non-parametric methods make the complexity of the tted model depend upon the sample.
The more information is in the sample (i.e., the larger the sample size), the greater the degree of
complexity of the tted model. Taking this seriously requires a distinct distribution theory.
Non-parametric theory acknowledges that tted models are approximations, and therefore are
inherently misspecied. Misspecication implies estimation bias. Typically, increasing the com-
plexitiy of a tted model decreases this bias but increases the estimation variance. Nonparametric
methods acknowledge this trade-o and attempt to set model complexity to minimize an overall
measure of t, typically mean-squared error (MSE).
There are many nonparametric statistical objects of potential interest, including density func-
tions (univariate and multivariate), density derivatives, conditional density functions, conditional
distribution functions, regression functions, median functions, quantile functions, and variance func-
tions. Sometimes these nonparametric objects are of direct interest. Sometimes they are of interest
only as an input to a second-stage estimation problem. If this second-stage problem is described
by a nite dimensional parameter we call the estimation problem semiparametric.
Nonparametric methods typically involve some sort of approximation or smoothing method.
Some of the main methods are called kernels, series, and splines.
Nonparametric methods are typically indexed by a bandwidth or tuning parameter which
controls the degree of complexity. The choice of bandwidth is often critical to implementation.
Data-dependent rules for determination of the bandwidth are therefore essential for nonparametric
methods. Nonparametric methods which require a bandwidth, but do not have an explicit data-
dependent rule for selecting the bandwidth, are incomplete. Unfortunately this is quite common,
due to the diculty in developing rigorous rules for bandwidth selection. Often in these cases
the bandwidth is selected based on a related statistical problem. This is a feasible yet worrisome
compromise.
Many nonparametric problems are generalizations of univariate density estimation. We will
start with this simple setting, and explore its theory in considerable detail.
1
2 Kernel Density Estimation
2.1 Discrete Estimator
Let A be a random variable with continuous distribution 1(r) and density )(r) =
o
oa
1(r). The
goal is to estimate )(r) from a random sample A
1
, ..., A
a
.
The distribution function 1(r) is naturally estimated by the EDF
^
1(r) = :
1

a
i=1
1 (A
i
_ r) .
It might seem natural to estimate the density )(r) as the derivative of
^
1(r),
o
oa
^
1(r), but this
estimator would be a set of mass points, not a density, and as such is not a useful estimate of )(r).
Instead, consider a discrete derivative. For some small / 0, let
^
)(r) =
^
1(r +/)
^
1(r /)
2/
We can write this as
1
2:/
a

i=1
1 (r +/ < A
i
_ r +/) =
1
2:/
a

i=1
1
_
[A
i
r[
/
_ 1
_
=
1
:/
a

i=1
/
_
A
i
r
/
_
where
/(n) =
_
1
2
, [n[ _ 1
0 [n[ 1
is the uniform density function on [1, 1].
The estimator
^
)(r) counts the percentage of observations which are clsoe to the point r. If
many observations are near r, then
^
)(r) is large. Conversely, if only a few A
i
are near r, then
^
)(r)
is small. The bandwidth / controls the degree of smoothing.
^
)(r) is a special case of what is called a kernel estimator. The general case is
^
)(r) =
1
:/
a

i=1
/
_
A
i
r
/
_
where /(n) is a kernel function.
2.2 Kernel Functions
A kernel function /(n) : R R is any function which satises
_
1
1
/(n)dn = 1.
A non-negative kernel satises /(n) _ 0 for all n. In this case, /(n) is a probability density
function.
The moments of a kernel are i
)
(/) =
_
1
1
n
)
/(n)dn.
A symmetric kernel function satises /(n) = /(n) for all n. In this case, all odd moments
are zero. Most nonparametric estimation uses symmetric kernels, and we focus on this case.
2
The order of a kernel, i, is dened as the order of the rst non-zero moment. For example, if
i
1
(/) = 0 and i
2
(/) 0 then / is a second-order kernel and i = 2. If i
1
(/) = i
2
(/) = i
3
(/) = 0
but i
4
(/) 0 then / is a fourth-order kernel and i = 4. The order of a symmetric kernel is always
even.
Symmetric non-negative kernels are second-order kernels.
A kernel is higher-order kernel if i 2. These kernels will have negative parts and are not
probability densities. They are also refered to as bias-reducing kernels.
Common second-order kernels are listed in the following table
Table 1: Common Second-Order Kernels
Kernel Equation 1(/) i
2
(/) c))(/)
Uniform /
0
(n) =
1
2
1 ([n[ _ 1) 1,2 1,3 1.0758
Epanechnikov /
1
(n) =
3
4
_
1 n
2
_
1 ([n[ _ 1) 3,5 1,5 1.0000
Biweight /
2
(n) =
15
16
_
1 n
2
_
2
1 ([n[ _ 1) 5,7 1,7 1.0061
Triweight /
3
(n) =
35
32
_
1 n
2
_
3
1 ([n[ _ 1) 350,429 1,9 1.0135
Gaussian /

(n) =
1
p
2
exp
_

&
2
2
_
1,2
_
1 1.0513
In addition to the kernel formula we have listed its roughness 1(/), second moment i
2
(/), and
its eciency c))(/), the last which will be dened later. The roughness of a function is
1(q) =
_
1
1
q(n)
2
dn.
The most commonly used kernels are the Epanechnikov and the Gaussian.
The kernels in the Table are special cases of the polynomial family
/
c
(n) =
(2: + 1)!!
2
c+1
:!
_
1 n
2
_
c
1 ([n[ _ 1)
where the double factorial means (2: + 1)!! = (2: + 1) (2: 1) 5 3 1. The Gaussian kernel is
obtained by taking the limit as : after rescaling. The kernels with higher : are smoother,
yielding estimates
^
)(r) which are smoother and possessing more derivatives. Estimates using the
Gaussian kernel have derivatives of all orders.
For the purpose of nonparametric estimation the scale of the kernel is not uniquely dened.
That is, for any kernel /(n) we could have dened the alternative kernel /

(n) = /
1
/(n,/) for
some constant / 0. These two kernels are equivalent in the sense of producing the same density
estimator, so long as the bandwidth is rescaled. That is, if
^
)(r) is calculated with kernel / and
bandwidth /, it is numerically identically to a calculation with kernel /

and bandwidth /

= /,/.
Some authors use dierent denitions for the same kernels. This can cause confusion unless you
are attentive.
3
Higher-order kernels are obtained by multiplying a second-order kernel by an (i,21)th order
polynomial in n
2
. Explicit formulae for the general polynomial family can be found in B. Hansen
(Econometric Theory, 2005), and for the Gaussian family in Wand and Schucany (Canadian Journal
of Statistics, 1990). 4th and 6th order kernels of interest are given in Tables 2 and 3.
Table 2: Fourth-Order Kernels
Kernel Equation 1(/) i
4
(/) c))(/)
Epanechnikov /
4,1
(n) =
15
8
_
1
7
3
n
2
_
/
1
(n) 5,4 1,21 1.0000
Biweight /
4,2
(n) =
7
4
_
1 3n
2
_
/
2
(n) 805,572 1,33 1.0056
Triweight /
4,3
(n) =
27
16
_
1
11
3
n
2
_
/
3
(n) 3780,2431 3,143 1.0134
Gaussian /
4,
(n) =
1
2
_
3 n
2
_
/

(n) 27,32
_
3 1.0729
Table 3: Sixth-Order Kernels
Kernel Equation 1(/) i
6
(/) c))(/)
Epanechnikov /
6,1
(n) =
175
64
_
1 6n
2
+
33
5
n
4
_
/
1
(n) 1575,832 5,429 1.0000
Biweight /
6,2
(n) =
315
128
_
1
22
3
n
2
+
143
15
n
4
_
/
2
(n) 29295,14144 1,143 1.0048
Triweight /
6,2
(n) =
297
128
_
1
26
3
n
2
+ 13n
4
_
/
3
(n) 301455,134368 1,221 1.0122
Gaussian /
6,
(n) =
1
8
_
15 10n
2
+n
4
_
/

(n) 2265,2048
_
15 1.0871
2.3 Density Estimator
We now discuss some of the numerical properties of the kernel estimator
^
)(r) =
1
:/
a

i=1
/
_
A
i
r
/
_
viewed as a function of r.
First, if /(n) is non-negative then it is easy to see that
^
)(r) _ 0. However, this is not guarenteed
if / is a higher-order kernel. That is, in this case it is possible that
^
)(r) < 0 for some values of r.
When this happens it is prudent to zero-out the negative bits and then rescale:
~
)(r) =
^
)(r)1
_
^
)(r) _ 0
_
_
1
1
^
)(r)1
_
^
)(r) _ 0
_
dr
.
~
)(r) is non-negative yet has the same asymptotic properties as
^
)(r). Since the integral in the
denominator is not analytically available this needs to be calculated numerically.
Second,
^
)(r) integrates to one. To see this, rst note that by the change-of-variables n =
(A
i
r),/ which has Jacobian /,
_
1
1
1
/
/
_
A
i
r
/
_
dr =
_
1
1
/ (n) dn = 1.
4
The change-of variables n = (A
i
r),/ will be used frequently, so it is useful to be familiar with
this transformation. Thus
_
1
1
^
)(r)dr =
_
1
1
1
:
a

i=1
1
/
/
_
A
i
r
/
_
dr =
1
:
a

i=1
_
1
1
1
/
/
_
A
i
r
/
_
dr =
1
:
a

i=1
1 = 1
as claimed. Thus
^
)(r) is a valid density function when / is non-negative.
Third, we can also calculate the numerical moments of the density
^
)(r). Again using the change-
of-variables n = (A
i
r),/, the mean of the estimated density is
_
1
1
r
^
)(r)dr =
1
:
a

i=1
_
1
1
r
1
/
/
_
A
i
r
/
_
dr
=
1
:
a

i=1
_
1
1
(A
i
+n/) / (n) dn
=
1
:
a

i=1
A
i
_
1
1
/ (n) dn +
1
:
a

i=1
/
_
1
1
n/ (n) dn
=
1
:
a

i=1
A
i
the sample mean of the A
i
.
The second moment of the estimated density is
_
1
1
r
2
^
)(r)dr =
1
:
a

i=1
_
1
1
r
2
1
/
/
_
A
i
r
/
_
dr
=
1
:
a

i=1
_
1
1
(A
i
+n/)
2
/ (n) dn
=
1
:
a

i=1
A
2
i
+
2
:
a

i=1
A
i
/
_
1
1
/(n)dn +
1
:
a

i=1
/
2
_
1
1
n
2
/ (n) dn
=
1
:
a

i=1
A
2
i
+/
2
i
2
(/).
It follows that the variance of the density
^
)(r) is
_
1
1
r
2
^
)(r)dr
__
1
1
r
^
)(r)dr
_
2
=
1
:
a

i=1
A
2
i
+/
2
i
2

_
1
:
a

i=1
A
i
_
2
= ^ o
2
+/
2
i
2
(/)
where ^ o
2
is the sample variance. Thus the density estimate inates the sample variance by the
factor /
2
i
2
(/).
These are the numerical mean and variance of the estimated density
^
)(r), not its sampling
5
mean and variance.
2.4 Estimation Bias
It is useful to observe that expectations of kernel transformations can be written as integrals
which take the form of a convolution of the kernel and the density function:
E
1
/
/
_
A
i
r
/
_
=
_
1
1
1
/
/
_
. r
/
_
)(.)d.
Using the change-of variables n = (. r),/, this equals
_
1
1
/ (n) )(r +/n)dn.
By the linearity of the estimator we see
E
^
)(r) =
1
:
a

i=1
E
1
/
/
_
A
i
r
/
_
=
_
1
1
/ (n) )(r +/n)dn
The last expression shows that the expected value is an average of )(.) locally about r.
This integral (typically) is not analytically solvable, so we approximate it using a Taylor expan-
sion of )(r +/n) in the argument /n, which is valid as / 0. For a ith-order kernel we take the
expansion out to the ith term
) (r +/n) = )(r) +)
(1)
(r)/n +
1
2
)
(2)
(r)/
2
n
2
+
1
3!
)
(3)
(r)/
3
n
3
+
+
1
i!
)
(i)
(r)/
i
n
i
+o (/
i
) .
The remainder is of smaller order than /
i
as / , which is written as o(/
i
). (This expansion
assumes )
(i+1)
(r) exists.)
Integrating term by term and using
_
1
1
/ (n) dn = 1 and the denition
_
1
1
/ (n) n
)
dn = i
)
(/),
_
1
1
/ (n) ) (r +/n) dn = )(r) +)
(1)
(r)/i
1
(/) +
1
2
)
(2)
(r)/
2
i
2
(/) +
1
3!
)
(3)
(r)/
3
i
3
(/) +
+
1
i!
)
(i)
(r)/
i
i
i
(/) +o (/
i
)
= )(r) +
1
i!
)
(i)
(r)/
i
i
i
(/) +o (/
i
)
where the second equality uses the assumption that / is a ith order kernel (so i
)
(/) = 0 for , < i).
6
This means that
E
^
)(r) =
1
:
a

i=1
E
1
/
/
_
A
i
r
/
_
= )(r) +
1
i!
)
(i)
(r)/
i
i
i
(/) +o (/
i
) .
The bias of
^
)(r) is then
1ia:(
^
)(r)) = E
^
)(r) )(r) =
1
i!
)
(i)
(r)/
i
i
i
(/) +o (/
i
) .
For second-order kernels, this simplies to
1ia:(
^
)(r)) =
1
2
)
(2)
(r)/
2
i
2
(/) +O
_
/
4
_
.
For second-order kernels, the bias is increasing in the square of the bandwidth. Smaller bandwidths
imply reduced bias. The bias is also proportional to the second derivative of the density )
(2)
(r).
Intuitively, the estimator
^
)(r) smooths data local to A
i
= r, so is estimating a smoothed version
of )(r). The bias results from this smoothing, and is larger the greater the curvature in )(r).
When higher-order kernels are used (and the density has enough derivatives), the bias is pro-
portional to /
i
, which is of lower order than /
2
. Thus the bias of estimates using higher-order
kernels is of lower order than estimates from second-order kernels, and this is why they are called
bias-reducing kernels. This is the advantage of higher-order kernels.
2.5 Estimation Variance
Since the kernel estimator is a linear estimator, and /
_
A
i
r
/
_
is iid,
var
_
^
)(r)
_
=
1
:/
2
var
_
/
_
A
i
r
/
__
=
1
:/
2
E/
_
A
i
r
/
_
2

1
:
_
1
/
E/
_
A
i
r
/
__
2
From our analysis of bias we know that
1
/
E/
_
A
i
r
/
_
= )(r)+o(1) so the second term is O
_
1
:
_
.
For the rst term, write the expectation as an integral, make a change-of-variables and a rst-order
7
Taylor expansion
1
/
E/
_
A
i
r
/
_
2
=
1
/
_
1
1
/
_
. r
/
_
2
)(.)d.
=
_
1
1
/ (n)
2
) (r +/n) dn
=
_
1
1
/ (n)
2
() (r) +O(/)) dn
= ) (r) 1(/) +O(/)
where 1(/) =
_
1
1
/ (n)
2
dn is the roughness of the kernel. Together, we see
var
_
^
)(r)
_
=
) (r) 1(/)
:/
+O
_
1
:
_
The remainder O
_
1
:
_
is of smaller order than the O
_
1
:/
_
leading term, since /
1
.
2.6 Mean-Squared Error
A common and convenient measure of estimation precision is the mean-squared error
'o1(
^
)(r)) = E
_
^
)(r) )(r)
_
2
= 1ia:(
^
)(r))
2
+ var
_
^
)(r)
_

_
1
i!
)
(i)
(r)/
i
i
i
(/)
_
2
+
) (r) 1(/)
:/
=
i
2
i
(/)
(i!)
2
)
(i)
(r)
2
/
2i
+
) (r) 1(/)
:/
= 'o1(
^
)(r))
Since this approximation is based on asymptotic expansions this is called the asymptotic mean-
squared-error (AMSE). Note that it is a function of the sample size :, the bandwidth /, the kernel
function (through i
i
and 1(/)), and varies with r as )
(i)
(r) and )(r) vary.
Notice as well that the rst term (the squared bias) is increasing in / and the second term (the
variance) is decreasing in :/. For 'o1(
^
)(r)) to decline as : both of these terms must get
small. Thus as : we must have / 0 and :/ . That is, the bandwidth must decrease,
but not at a rate faster than sample size. This is sucient to establish the pointwise consistency
of the estimator. That is, for all r,
^
)(r)
j
)(r) as : . We call this pointwise convergence as
it is valid for each r individually. We discuss uniform convergence later.
8
A global measure of precision is the asymptotic mean integrated squared error (AMISE)
'1o1 =
_
1
1
'o1(
^
)(r))dr
=
i
2
i
(/)
(i!)
2
1
_
)
(i)
_
/
2i
+
1(/)
:/
.
where 1()
(i)
) =
_
1
1
_
)
(i)
(r)
_
2
dr is the roughness of )
(i)
.
2.7 Asymptotically Optimal Bandwidth
The AMISE formula expresses the MSE as a function of /. The value of / which minimizes this
expression is called the asymptotically optimal bandwidth. The solution is found by taking the
derivative of the AMISE with respect to / and setting it equal to zero:
d
d/
'1o1 =
d
d/
_
i
2
i
(/)
(i!)
2
1
_
)
(i)
_
/
2i
+
1(/)
:/
_
= 2i/
2i1
i
2
i
(/)
(i!)
2
1
_
)
(i)
_

1(/)
:/
2
= 0
with solution
/
0
= C
i
(/, )) :
1(2i+1)
C
i
(/, )) = 1
_
)
(i)
_
1(2i+1)

i
(/)

i
(/) =
_
(i!)
2
1(/)
2ii
2
i
(/)
_
1(2i+1)
The optimal bandwidth is propotional to :
1(2i+1)
. We say that the optimal bandwidth is
of order O
_
:
1(2i+1)
_
. For second-order kernels the optimal rate is O
_
:
15
_
. For higher-order
kernels the rate is slower, suggesting that bandwidths are generally larger than for second-order
kernels. The intuition is that since higher-order kernels have smaller bias, they can aord a larger
bandwidth.
The constant of proportionality C
i
(/, )) depends on the kernel through the function
i
(/)
(which can be calculated from Table 1), and the density through 1()
(i)
) (which is unknown).
If the bandwidth is set to /
0
, then with some simplication the AMISE equals
'1o1
0
(/) = (1 + 2i)
_
1
_
)
(i)
_
i
2
i
(/)1(/)
2i
(i!)
2
(2i)
2i
_
1(2i+1)
:
2i(2i+1)
.
9
For second-order kernels, this equals
'1o1
0
(/) =
5
4
_
i
2
2
(/)1(/)
4
1
_
)
(2)
__
15
:
45
.
As i gets large, the convergence rate approaches the parametric rate :
1
. Thus, at least as-
ymptotically, the slow convergence of nonparametric estimation can be mitigated through the use
of higher-order kernels.
This seems a bit magical. Whats the catch? For one, the improvement in convergence rate
requires that the density is suciently smooth that derivatives exist up to the (i + 1)th order. As
the density becomes increasingly smooth, it is easier to approximate by a low-dimensional curve,
and gets closer to a parametric-type problem. This is exploiting the smoothness of ), which is
inherently unknown. The other catch is that there is a some evidence that the benets of higher-
order kernels only develop when the sample size is fairly large. My sense is that in small samples, a
second-order kernel would be the best choice, in moderate samples a 4th order kernel, and in larger
samples a 6th order kernel could be used.
2.8 Asymptotically Optimal Kernel
Given that we have picked the kernel order, which kernel should we use? Examining the
expression '1o1
0
we can see that for xed i the choice of kernel aects the asymptotic precision
through the quantity i
i
(/) 1(/)
i
. All else equal, '1o1 will be minimized by selecting the kernel
which minimizes this quantity. As we discussed earlier, only the shape of the kernel is important, not
its scale, so we can set i
i
= 1. Then the problem reduces to minimization of 1(/) =
_
1
1
/(n)
2
dn
subject to the constraints
_
1
1
/(n)dn = 1 and
_
1
1
n
i
/(n)dn = 1. This is a problem in the calculus
of variations. It turns out that the solution is a scaled of /
i,1
( see Muller (Annals of Statistics,
1984)). As the scale is irrelevant, this means that for estimation of the density function, the higher-
order Epanechikov kernel /
i,1
with optimal bandwidth yields the lowest possible AMISE. For this
reason, the Epanechikov kernel is often called the optimal kernel.
To compare kernels, its relative eciency is dened as
c))(/) =
_
'1o1
0
(/)
'1o1
0
(/
i,1
)
_
(1+2i)2i
=
_
i
2
i
(/)
_
12i
1(/)
(i
2
i
(/
i,1
))
12i
1(/
i,1
)
The ratios of the AMISE is raised to the power (1 + 2i) ,2i as for large :, the AMISE will be the
same whether we use : observations with kernel /
i,1
or : c))(/) observations with kernel /. Thus
the penalty c))(/) is expressed as a percentage of observations.
The eciencies of the various kernels are given in Tables 1-3. Examining the second-order
kernels, we see that relative to the Epanechnikov kernel, the uniform kernel pays a penalty of about
7%, the Gaussian kernel a penalty of about 5%, the Triweight kernel about 1.4%, and the Biweight
10
kernel less than 1%. Examining the 4th and 6th-order kernels, we see that the relative eciency of
the Gaussian kernel deteriorates, while that of the Biweight and Triweight slightly improves.
The dierences are not big. Still, the calculation suggests that the Epanechnikov and Biweight
kernel classes are good choices for density estimation.
2.9 Rule-of-Thumb Bandwidth
The optimal bandwidth depends on the unknown quantity 1
_
)
(i)
_
. Silverman proposed that
we try the bandwidth computed by replacing 1
_
)
(i)
_
in the optimal formula by 1
_
q
(i)
^ o
_
where
q
o
is a reference density a plausible candidate for ), and ^ o
2
is the sample standard deviation.
The standard choice is to set q
o
= c
^ o
, the (0, ^ o
2
) density. The idea is that if the true density is
normal, then the computed bandwidth will be optimal. If the true density is reasonably close to
the normal, then the bandwidth will be close to optimal. While not a perfect solution, it is a good
place to start looking.
For any density q, if we set q
o
(r) = o
1
q(r,o), then q
(i)
o
(r) = o
1i
q
(i)
(r,o). Thus
1
_
q
(i)
o
_
1(2i+1)
=
__
q
(i)
o
(r)
2
dr
_
1(2i+1)
=
_
o
22i
_
q
(i)
(r,o)
2
dr
_
1(2i+1)
=
_
o
12i
_
q
(i)
(r)
2
dr
_
1(2i+1)
= o1
_
q
(i)
_
1(2i+1)
.
Furthermore,
_
1
_
c
(i)
__
1(2i+1)
= 2
_

12
i!
(2i)!
_
1(2i+1)
.
Thus
1
_
c
(i)
^ o
_
1(2i+1)
= 2^ o
_

12
i!
(2i)!
_
1(2i+1)
.
The rule-of-thumb bandwidth is then / = ^ oC
i
(/) :
1(2i+1)
where
C
i
(/) = 1
_
c
(i)
_
1(2i+1)

i
(/)
= 2
_

12
(i!)
3
1(/)
2i (2i)!i
2
i
(/)
_
1(2i+1)
We collect these constants in Table 4.
Table 4: Rule of Thumb Constants
11
Kernel i = 2 i = 4 i = 6
Epanechnikov 2.34 3.03 3.53
Biweight 2.78 3.39 3.84
Triweight 3.15 3.72 4.13
Gaussian 1.06 1.08 1.08
Silverman Rule-of-Thumb: / = ^ oC
i
(/) :
1(2i+1)
where ^ o is the sample standard devia-
tion, i is the order of the kernel, and C
i
(/) is the constant from Table 4.
If a Gaussian kernel is used, this is often simplied to / = ^ o:
1(2i+1)
. In particular, for the
standard second-order normal kernel, / = ^ o:
15
.
2.10 Density Derivatives
Consider the problem of estimating the rth derivative of the density:
)
(v)
(r) =
d
v
dr
v
)(r).
A natural estimator is found by taking derivatives of the kernel density estimator. This takes
the form
^
)
(v)
(r) =
d
v
dr
v
^
)(r) =
1
:/
1+v
a

i=1
/
(v)
_
A
i
r
/
_
where
/
(v)
(r) =
d
v
dr
v
/(r).
This estimator only makes sense if /
(v)
(r) exists and is non-zero. Since the Gaussian kernel has
derivatives of all orders this is a common choice for derivative estimation.
The asymptotic analysis of this estimator is similar to that of the density, but with a couple of
extra wrinkles and noticably dierent results. First, to calculate the bias we observe that
E
1
/
1+v
/
(v)
_
A
i
r
/
_
=
_
1
1
1
/
1+v
/
(v)
_
. r
/
_
)(.)d.
To simplify this expression we use integration by parts. As the integral of /
1
/
(v)
_
. r
/
_
is
/
(v1)
_
. r
/
_
, we nd that the above expression equals
_
1
1
1
/
v
/
(v1)
_
. r
/
_
)
(1)
(.)d..
Repeating this a total of r times, we obtain
_
1
1
1
/
/
_
. r
/
_
)
(v)
(.)d..
12
Next, apply the change of variables to obtain
_
1
1
/ (n) )
(v)
(r +/n)d..
Now expand )
(v)
(r+/n) in a ith-order Taylor expansion about r, and integrate the terms to nd
that the above equals
)
(v)
(r) +
1
i!
)
(v+i)
(r)/
i
i
i
(/) +o (/
i
)
where i is the order of the kernel. Hence the asymptotic bias is
1ia:(
^
)
(v)
(r)) = E
^
)
(v)
(r) )
(v)
(r)
=
1
i!
)
(v+i)
(r)/
i
i
i
(/) +o (/
i
) .
This of course presumes that ) is dierentiable of order at least r +i + 1.
For the variance, we nd
var
_
^
)
(v)
(r)
_
=
1
:/
2+2v
var
_
/
(v)
_
A
i
r
/
__
=
1
:/
2+2v
E/
(v)
_
A
i
r
/
_
2

1
:
_
1
:/
1+v
E/
(v)
_
A
i
r
/
__
2
=
1
:/
2+2v
_
1
1
/
(v)
_
. r
/
_
2
)(.)d.
1
:
)
(v)
(r)
2
+O
_
1
:
_
=
1
:/
1+2v
_
1
1
/
(v)
(n)
2
) (r +/n) dn +O
_
1
:
_
=
) (r)
:/
1+2v
_
1
1
/
(v)
(n)
2
dn +O
_
1
:
_
=
) (r) 1(/
(v)
)
:/
1+2v
+O
_
1
:
_
.
The AMSE and AMISE are
'o1(
^
)
(v)
(r)) =
)
(v+i)
(r)
2
/
2i
i
2
i
(/)
(i!)
2
+
) (r) 1(/
(v)
)
:/
1+2v
and
'1o1(
^
)
(v)
(r)) =
1
_
)
(v+i)
_
/
2i
i
2
i
(/)
(i!)
2
+
1(/
(v)
)
:/
1+2v
.
Note that the order of the bias is the same as for estimation of the density. But the variance is
now of order O
_
1
:/
1+2v
_
which is much larger than the O
_
1
:/
_
found earlier.
13
The asymptotically optimal bandwidth is
/
v
= C
v,i
(/, )) :
1(1+2v+2i)
C
v,i
(/, )) = 1
_
)
(v+i)
_
1(1+2v+2i)

v,i
(/)

v,i
(/) =
_
(1 + 2r) (i!)
2
1(/
(v)
)
2ii
2
i
(/)
_
1(1+2v+2i)
Thus the optimal bandwidth converges at a slower rate than for density estimation. Given this
bandwidth, the rate of convergence for the AMISE is O
_
:
2i(2v+2i+1)
_
, which is slower than the
O
_
:
2i(2i+1)
_
45
) rate when r = 0.
We see that we need a dierent bandwidth for estimation of derivatives than for estimation
of the density. This is a common situation which arises in nonparametric analysis. The optimal
amount of smoothing depends upon the object being estimated, and the goal of the analysis.
The AMISE with the optimal bandwidth is
'1o1(
^
)
(v)
(r)) = (1 + 2r + 2i)
_
i
2
i
(/)
(i!)
2
(1 + 2r)
_
(2v+1)(1+2v+2i)
_
1
_
/
(v)
_
2i
_
2i(1+2v+2i)
:
2i(1+2v+2i)
.
We can also ask the question of which kernel function is optimal, and this is addressed by
Muller (1984). The problem amounts to minimizing 1
_
/
(v)
_
subject to a moment condition, and
the solution is to set / equal to /
i,v+1
, the polynomial kernel of ith order and exponent r+1. Thus
to a rst derivative it is optimal to use a member of the Biweight class and for a second derivative
a member of the Triweight class.
The relative eciency of a kernel / is then
c))(/) =
_
'1o1
0
(/)
'1o1
0
(/
i,v+1
)
_
(1+2i+2v)2i
=
_
i
2
i
(/)
i
2
i
(/
i,v+1
)
_
(1+2v)2i
1
_
/
(v)
_
1
_
/
(v)
i,v+1
_.
The relative eciencies of the various kernels are presented in Table 5. (The Epanechnikov kernel
is not considered as it is inappropriate for derivative estimation, and similarly the Biweight kernel
for r = 2). In contrast to the case r = 0, we see that the Gaussian kernel is highly inecient, with
the eciency loss increasing with r and i. These calculations suggest that when estimating density
derivatives it is important to use the appropriate kernel.
Table 5: Relative Eciency c))(/)
14
Biweight Triweight Gaussian
r = 1 i = 2 1.0000 1.0185 1.2191
i = 4 1.0000 1.0159 1.2753
i = 6 1.0000 1.0136 1.3156
r = 2 i = 2 1.0000 1.4689
i = 4 1.0000 1.5592
i = 6 1.0000 1.6275
The Silverman Rule-of-Thumb may also be applied to density derivative estimation. Again using
the reference density q
o
= c
o
, we nd the rule-of-thumb bandwidth is / = C
v,i
(/) ^ o:
1(2v+2i+1)
where
C
v,i
(/) = 2
_

12
(1 + 2r) (i!)
2
(r +i)!1
_
/
(v)
_
2ii
2
i
(/) (2r + 2i)!
_
1(2v+2i+1)
.
The constants C
v,
are collected in Table 6. For all kernels, the constants C
v,i
are similar but
slightly decreasing as r increases.
Table 6: Rule of Thumb Constants
Biweight Triweight Gaussian
r = 1 i = 2 2.49 2.83 0.97
i = 4 3.18 3.49 1.03
i = 6 3.44 3.96 1.04
r = 2 i = 2 2.70 0.94
i = 4 3.35 1.00
i = 6 3.84 1.02
2.11 Multivariate Density Estimation
Now suppose that A
i
is a -vector and we want to estimate its density )(r) = )(r
1
, ..., r
q
). A
multivariate kernel estimator takes the form
^
)(r) =
1
:[H[
a

i=1
1
_
H
1
(A
i
r)
_
where 1(n) is a multivariate kernel function depending on a bandwidth vector H = (/
1
, ..., /
q
)
0
and [H[ = /
1
/
2
/
q
. A multivariate kernel satises That is,
_
1(n) (dn) =
_
1(n)dn
1
dn
q
= 1
Typically, 1(n) takes the product form:
1(n) = / (n
1
) / (n
2
) / (n
q
) .
15
As in the univariate case,
^
)(r) has the property that it integrates to one, and is non-negative
if 1(n) _ 0. When 1(n) is a product kernel then the marginal densities of
^
)(r) equal univariate
kernel density estimators with kernel functions / and bandwidths /
)
.
With some work, you can show that when 1(n) takes the product form, the bias of the estimator
is
1ia:(
^
)(r)) =
i
i
(/)
i!
q

)=1
0
i
0r
i
)
)(r)/
i
)
+o
_
/
i
1
+ +/
i
q
_
and the variance is
var
_
^
)(r)
_
=
) (r) 1(1)
:[H[
+O
_
1
:
_
=
) (r) 1(/)
q
:/
1
/
2
/
q
+O
_
1
:
_
.
Hence the AMISE is
'1o1
_
^
)(r)
_
=
i
2
i
(/)
(i!)
2
_
_
_
q

)=1
0
i
0r
i
)
)(r)/
i
)
_
_
2
(dr) +
1(/)
q
:/
1
/
2
/
q
There is no closed-form solution for the bandwidth vector which minimizes this expression.
However, even without doing do, we can make a couple of observations.
First, the AMISE depends on the kernel function only through 1(/) and i
2
i
(/), so it is clear
that for any given i, the optimal kernel minimizes 1(/), which is the same as in the univariate
case.
Second, the optimal bandwidths will all be of order :
1(2i+q)
and the optimal AMISE of order
:
2i(2i+q)
. This rates are slower than the univariate ( = 1) case. The fact that dimension has
an adverse eect on convergence rates is called the curse of dimensionality. Many theoretical
papers circumvent this problem through the following trick. Suppose you need the AMISE of the
estimator to converge at a rate O
_
:
12
_
or faster. This requires 2i, (2i +) 1,2, or < 2i.
For second-order kernels (i = 2) this restricts the dimension to be 3 or less. What some authors
will do is slip in an assumption of the form: Assume )(r) is dierentiable of order i + 1 where
i ,2, and then claim that their results hold for all . The trouble is that what the author
is doing is imposing greater smoothness as the dimension increases. This doesnt really avoid the
curse of dimensionality, rather it hides it behind what appears to be a technical assumption. The
bottom line is that nonparametric objects are much harder to estimate in higher dimensions, and
that is why it is called a curse.
To derive a rule-of-thumb, suppose that /
1
= /
2
= = /
q
= /. Then
'1o1
_
^
)(r)
_
=
i
2
i
(/)1(\
i
))
(i!)
2
/
2i
+
1(/)
q
:/
q
16
where
\
i
)(r) =
q

)=1
0
i
0r
i
)
)(r).
We nd that the optimal bandwidth is
/
0
=
_
(i!)
2
1(/)
q
2ii
2
i
(/)1(\
i
))
_
1(2i+q)
:
1(2i+q)
For a rule-of-thumb bandwidth, we replace ) by the multivariate normal density c. We can
calculate that
1(\
i
c) =

q2
2
q+i
_
(2i 1)!! + ( 1) ((i 1)!!)
2
_
.
Making this substitution, we obtain /
0
= C
i
(/, ) :
1(2i+q)
where
C
i
(/, ) =
_
_

q2
2
q+i1
(i!)
2
1(/)
q
ii
2
i
(/)
_
(2i 1)!! + ( 1) ((i 1)!!)
2
_
_
_
1(2i+q)
.
Now this assumed that all variables had unit variance. Rescaling the bandwidths by the standard
deviation of each variable, we obtain the rule-of-thumb bandwidth for the ,th variable:
/
)
= ^ o
)
C
i
(/, ) :
1(2i+q)
.
Numerical values for the constants C
i
(/, ) are given in Table 7 for = 2, 3, 4.
Table 7: Rule of Thumb Constants
i = 2 = 2 = 3 = 4
Epanechnikov 2.20 2.12 2.07
Biweight 2.61 2.52 2.46
Triweight 2.96 2.86 2.80
Gaussian 1.00 0.97 0.95
i = 4
Epanechnikov 3.12 3.20 3.27
Biweight 3.50 3.59 3.67
Triweight 3.84 3.94 4.03
Gaussian 1.12 1.16 1.19
i = 6
Epanechnikov 3.69 3.83 3.96
Biweight 4.02 4.18 4.32
Triweight 4.33 4.50 4.66
Gaussian 1.13 1.18 1.23
17
2.12 Least-Squares Cross-Validation
Rule-of-thumb bandwidths are a useful starting point, but they are inexible and can be far
from optimal.
Plug-in methods take the formula for the optimal bandwidth, and replace the unknowns by
estimates, e.g. 1
_
^
)
(i)
_
. But these initial estimates themselves depend on bandwidths. And each
situation needs to be individually studied. Plug-in methods have been thoroughly studied for
univariate density estimation, but are less well developed for multivariate density estimation and
other contexts.
A exible and generally applicable data-dependent method is cross-validation. This method
attempts to make a direct estimate of the squared error, and pick the bandwidth which minimizes
this estimate. In many senses the idea is quite close to model selection based on a information
criteria, such as Mallows or AIC.
Given a bandwidth / and density estimate
^
)(r) of )(r), dene the mean integrated squared
error (MISE)
'1o1 (/) =
_
_
^
)(r) )(r)
_
2
(dr) =
_
^
)(r)
2
(dr) 2
_
^
)(r))(r) (dr) +
_
)(r)
2
(dr)
Optimally, we want
^
)(r) to be as close to )(r) as possible, and thus for '1o1 (/) to be as small
as possible.
As '1o1 (/) is unknown, cross-validation replaces it with an estimate.
The goal is to nd an estimate of '1o1 (/), and nd the / which minimizes this estimate.
As the third term in the above expression does not depend on the bandwidth /, it can be
ignored.
The rst term can be directly calculated.
For the univariate case
_
^
)(r)
2
dr =
_
_
1
:/
a

i=1
/
_
A
i
r
/
_
_
2
dr
=
1
:
2
/
2
a

i=1
a

)=1
_
/
_
A
i
r
/
_
/
_
A
)
r
/
_
dr
The convolution of / with itself is

/(r) =
_
/ (n) / (r n) dn =
_
/ (n) / (n r) dn (by symmetry
of /). Then making the change of variables n =
A
i
r
/
,
1
/
_
/
_
A
i
r
/
_
/
_
A
i
r
/
_
dr =
_
/ (n) /
_
n
A
i
A
)
/
_
dn
=

/
_
A
i
A
)
/
_
.
18
Hence
_
^
)(r)
2
dr =
1
:
2
/
a

i=1
a

)=1

/
_
A
i
A
)
/
_
.
Discussion of

/ (r) can be found in the following section.
In the multivariate case,
_
^
)(r)
2
dr =
1
:
2
[H[
a

i=1
a

)=1

1
_
H
1
(A
i
A
)
)
_
where

1 (n) =

/ (n
1
)

/ (n
q
)
The second term in the expression for '1o1 (/) depends on )(r) so is unknown and must be
estimated. An integral with respect to )(r) is an expectation with respect to the random variable
A
i
. While we dont know the true expectation, we have the sample, so can estimate this expectation
by taking the sample average. In general, a reasonable estimate of the integral
_
q(r))(r)dr is
1
:

a
i=1
q (A
i
) , suggesting the estimate
1
:

a
i=1
^
) (A
i
) . In this case, however, the function
^
) (r) is
itself a function of the data. In particular, it is a function of the observation A
i
. A way to clean
this up is to replace
^
) (A
i
) with the leave-one-out estimate
^
)
i
(A
i
) , where
^
)
i
(r) =
1
(: 1) [H[

)6=i
1
_
H
1
(A
)
r)
_
is the density estimate computed without observation A
i
, and thus
^
)
i
(A
i
) =
1
(: 1) [H[

)6=i
1
_
H
1
(A
)
A
i
)
_
.
That is,
^
)
i
(A
i
) is the density estimate at r = A
i
, computed with the observations except A
i
. We
end up suggesting to estimate
_
^
)(r))(r)dr with
1
:
a

i=1
^
)
i
(A
i
) =
1
:(: 1) [H[
a

i=1

)6=i
1
_
H
1
(A
)
A
i
)
_
. It turns out that this is an unbiased estimate, in the sense that
E
_
1
:
a

i=1
^
)
i
(A
i
)
_
= E
__
^
)(r))(r)dr
_
19
To see this, the LHS is
E
^
)
a
(A
a
) = E
_
E
_
^
)
a
(A
a
) [ A
1
, ..., A
a1
__
= E
__
^
)
a
(r))(r) (dr)
_
=
_
E
_
^
)(r)
_
)(r) (dr)
= E
__
^
)(r))(r) (dr)
_
the second-to-last equality exchanging integration, and since E
_
^
)(r)
_
depends only in the band-
width, not the sample size.
Together, the least-squares cross-validation criterion is
C\ (/
1
, ..., /
q
) =
1
:
2
[H[
a

i=1
a

)=1

1
_
H
1
(A
i
A
)
)
_

2
:(: 1) [H[
a

i=1

)6=i
1
_
H
1
(A
)
A
i
)
_
.
Another way to write this is
C\ (/
1
, ..., /
q
) =

1 (0)
:[H[
+
1
:
2
[H[
a

i=1

)6=i

1
_
H
1
(A
i
A
)
)
_

2
:(: 1) [H[
a

i=1

)6=i
1
_
H
1
(A
)
A
i
)
_

1(/)
q
:[H[
+
1
:
2
[H[
a

i=1

)6=i
_

1
_
H
1
(A
i
A
)
)
_
21
_
H
1
(A
)
A
i
)
__
using

1 (0) =

/(0)
q
and

/(0) =
_
/ (n)
2
, and the approximation is by replacing : 1 by :.
The cross-validation bandwidth vector are the value
^
/
1
, ...,
^
/
q
which minimizes C\ (/
1
, ..., /
q
) .
The cross-validation function is a complicated function of the bandwidhts, so this needs to be done
numerically.
In the univariate case, / is one-dimensional this is typically done by plotting (a grid search).
Pick a lower and upper value [/
1
, /
2
], dene a grid on this set, and compute C\ (/) for each / in
the grid. A plot of C\ (/) against / is a useful diagnostic tool.
The C\ (/) function can be misleading for small values of /. This arises when there is data
rounding. Some authors dene the cross-validation bandwidth as the largest local minimer of C\ (/)
(rather than the global minimizer). This can also be avoided by picking a sensible initial range
[/
1
, /
2
]. The rule-of-thumb bandwidth can be useful here. If /
0
is the rule-of-thumb bandwidth,
then use /
1
= /
0
,3 and /
2
= 3/
0
or similar.
We we discussed above, C\ (/
1
, ..., /
q
) +
_
)(r)
2
(dr) is an unbiased estimate of '1o1 (/) .
This by itself does not mean that
^
/ is a good estimate of /
0
, the minimizer of '1o1 (/) , but it
20
turns out that this is indeed the case. That is,
^
/ /
0
/
0

j
0
Thus,
^
/ is asymptotically close to /
0
, but the rate of convergence is very slow.
The CV method is quite exible, as it can be applied for any kernel function.
If the goal, however, is estimation of density derivatives, then the CV bandwidth
^
/ is not
appropriate. A practical solution is the following. Recall that the asymptotically optimal bandwidth
for estimation of the density takes the form /
0
= C
i
(/, )) :
1(2i+1)
and that for the rth derivative
is /
v
= C
v,i
(/, )) :
1(1+2v+2i)
. Thus if the CV bandwidth
^
/ is an estimate of /
0
, we can estimate
C
i
(/, )) by
^
C
i
=
^
/:
1(2i+1)
. We also saw (at least for the normal reference family) that C
v,i
(/, ))
was relatively constant across r. Thus we can replace C
v,i
(/, )) with
^
C
i
to nd
^
/
v
=
^
C
i
:
1(1+2v+2i)
=
^
/:
1(2i+1)1(1+2v+2i)
=
^
/:
(1+2v+2i)(2i+1)(1+2v+2i)(2i+1)(1+2v+2i)(2i+1)
=
^
/:
2v((2i+1)(1+2v+2i))
Alternatively, some authors use the rescaling
^
/
v
=
^
/
(1+2i)(1+2v+2i)
2.13 Convolution Kernels
If /(r) = c(r) then

/(r) = exp(r
2
,4),
_
4.
When /(r) is a higher-order Gaussian kernel, Wand and Schucany (Canadian Journal of Sta-
tistics, 1990, p. 201) give an expression for

/(r).
For the polynomial class, because the kernel /(n) has support on [1, 1], it follows that

/(r)
has support on [2, 2] and for r _ 0 equals

/(r) =
_
1
a1
/(n)/(r n)dn. This integral can be
easily solved using algebraic software (Maple, Mathematica), but the expression can be rather
cumbersome.
For the 2nd order Epanechnikov, Biweight and Triweight kernels, for 0 _ r _ 2,

/
1
(r) =
3
160
(2 r)
3
_
r
2
+ 6r + 4
_

/
2
(r) =
5
3584
(2 r)
5
_
r
4
+ 10r
3
+ 36r
2
+ 40r + 16
_

/
3
(r) =
35
1757 184
(2 r)
7
_
5r
6
+ 70r
5
+ 404r
4
+ 1176r
3
+ 1616r
2
+ 1120r + 320
_
These functions are symmetric, so the values for r < 0 are found by

/(r) =

/(r).
21
For the 4th, and 6th order Epanechnikov kernels, for 0 _ r _ 2,

/
4,1
(r) =
5
2048
(2 r)
3
_
7r
6
+ 42r
5
+ 48r
4
160r
3
144r
2
+ 96r + 64
_

/
6,1
(r) =
105
3407 872
(2 r)
3
_
495r
10
+ 2970r
9
+ 2052r
8
19 368r
7
32 624r
6
+53 088r
5
+ 68 352r
4
48 640r
3
46 720r
2
+ 11 520r + 7680
_
2.14 Asymptotic Normality
The kernel estimator is the sample average
^
)(r) =
1
:
a

i=1
1
[H[
1
_
H
1
(A
i
r)
_
.
We can therefore apply the central limit theorem.
But the convergence rate is not
_
:. We know that
var
_
^
)(r)
_
=
) (r) 1(/)
q
:/
1
/
2
/
q
+O
_
1
:
_
.
so the convergence rate is
_
:/
1
/
2
/
q
. When we apply the CLT we scale by this, rather than
the conventional
_
:.
As the estimator is biased, we also center at its expectation, rather than the true value
Thus
_
:/
1
/
2
/
q
_
^
)(r) 1
^
)(r)
_
=
_
:/
1
/
2
/
q
:
a

i=1
1
[H[
1
_
H
1
(A
i
r)
_
1
_
1
[H[
1
_
H
1
(A
i
r)
_
_
=
_
/
1
/
2
/
q
_
:
a

i=1
_
1
[H[
1
_
H
1
(A
i
r)
_
1
_
1
[H[
1
_
H
1
(A
i
r)
_
__
=
1
_
:
a

i=1
7
ai
where
7
ai
=
_
/
1
/
2
/
q
_
1
[H[
1
_
H
1
(A
i
r)
_
1
_
1
[H[
1
_
H
1
(A
i
r)
_
__
We see that
var (7
ai
) ) (r) 1(/)
q
Hence by the CLT,
_
:/
1
/
2
/
q
_
^
)(r) 1
^
)(r)
_

o
(0, ) (r) 1(/)
q
) .
22
We also know that
1(
^
)(r)) = )(r) +
i
i
(/)
i!
q

)=1
0
i
0r
i
)
)(r)/
i
)
+o
_
/
i
1
+ +/
i
q
_
So another way of writing this is
_
:/
1
/
2
/
q
_
_^
)(r) )(r)
i
i
(/)
i!
q

)=1
0
i
0r
i
)
)(r)/
i
)
_
_

o
(0, ) (r) 1(/)
q
) .
In the univariate case this is
_
:/
_
^
)(r) )(r)
i
i
(/)
i!
)
(2)
(r)/
i
_

o
(0, ) (r) 1(/))
This expression is most useful when the bandwidth is selected to be of optimal order, that is
/ = C:
1(2i+1)
, for then
_
://
i
= C
i+12
and we have the equivalent statement
_
:/
_
^
)(r) )(r)
_

o

_
C
i+12
i
i
(/)
i!
)
(2)
(r), ) (r) 1(/)
_
This says that the density estimator is asymptotically normal, with a non-zero asymptotic bias
and variance.
Some authors play a dirty trick, by using the assumption that / is of smaller order than the
optimal rate, e.g. / = o
_
:
1(2i+1)
_
. For then then obtain the result
_
:/
_
^
)(r) )(r)
_

o
(0, ) (r) 1(/))
This appears much nicer. The estimator is asymptotically normal, with mean zero! There are
several costs. One, if the bandwidth is really seleted to be sub-optimal, the estimator is simply less
precise. A sub-optimal bandwidth results in a slower convergence rate. This is not a good thing.
The reduction in bias is obtained at in increase in variance. Another cost is that the asymptotic
distribution is misleading. It suggests that the estimator is unbiased, which is not honest. Finally, it
is unclear how to pick this sub-optimal bandwidth. I call this assumption a dirty trick, because it is
slipped in by authors to make their results cleaner and derivations easier. This type of assumption
should be avoided.
2.15 Pointwise Condence Intervals
The asymptotic distribution may be used to construct pointwise condence intervals for )(r).
In the univariate case conventional condence intervals take the form
^
)(r) 2
_
^
) (r) 1(/), (:/)
_
12
.
23
These are not necessarily the best choice, since the variance equals the mean. This set has the
unfortunate property that it can contain negative values, for example.
Instead, consider constructing the condence interval by inverting a test statistic. To test
H
0
: )(r) = )
0
, a t-ratio is
t ()
0
) =
^
)(r) )
0
_
:/)
0
1(/)
.
We reject H
0
if [t ()
0
)[ 2. By the no-rejection rule, an asymptotic 95% condence interval for )
is the set of )
0
which do reject, i.e. the set of ) such that [t ())[ _ 2. This is
C(r) =
_
) :

^
)(r) )
_
:/)1(/)

_ 2
_
This set must be found numerically.
24
3 Nonparametric Regression
3.1 Nadaraya-Watson Regression
Let the data be (j
i
, A
i
) where j
i
is real-valued and A
i
is a -vector, and assume that all are
continuously distributed with a joint density )(j, r). Let ) (j [ r) = )(j, r),)(r) be the conditional
density of j
i
given A
i
where )(r) =
_
) (j, r) dj is the marginal density of A
i
. The regression
function for j
i
on A
i
is
q(r) = 1 (j
i
[ A
i
= r) .
We want to estimate this nonparametrically, with minimal assumptions about q.
If we had a large number of observations where A
i
exactly equals r, we could take the average
value of the j
0
i
: for these observations. But since A
i
is continously distributed, we wont observe
multiple observations equalling the same value.
The solution is to consider a neighborhood of r, and note that if A
i
has a positive density at
r, we should observe a number of observations in this neighborhood, and this number is increasing
with the sample size. If the regression function q(r) is continuous, it should be reasonable constant
over this neighborhood (if it is small enough), so we can take the average the the j
i
values for
these observations. The obvious trick is is to determine the size of the neighborhood to trade o
the variation in q(r) over the neighborhood (estimation bias) against the number of observations
in the neighborhood (estimation variance).
we will observe a large number of A
i
in any given neighborhood of r
i
.
Take the one-regressor case = 1.
Let a neighborhood of r be r / for some bandwidth / 0. Then a simple nonparametric
estimator of q(r) is the average value of the j
0
i
: for the observations i such that A
i
is in this
neighborhood, that is,
^ q(r) =

n
i=1
1 ([A
i
r[ _ /) j
i

n
i=1
1 ([A
i
r[ _ /)
=

n
i=1
/
_
A
i
r
/
_
j
i

n
i=1
/
_
A
i
r
/
_
where /(n) is the uniform kernel.
In general, the kernel regression estimator takes this form, where /(n) is a kernel function. It
is known as the Nadaraya-Watson estimator, or local constant estimator.
When 1 the estimator is
^ q(r) =

n
i=1
1
_
H
1
(A
i
r)
_
j
i

n
i=1
1 (H
1
(A
i
r))
25
where 1(n) is a multivariate kernel function.
As an alternative motivation, note that the regression function can be written as
q(r) =
_
j) (j, r) dj
)(r)
where )(r) =
_
) (j, r) dj is the marginal density of A
i
. Now consider estimating q by replacing
the density functions by the nonparametric estimates we have already studied. That is,
^
) (j, r) =
1
:[H[ /
y
n

i=1
1
_
H
1
(A
i
r)
_
/
_
j
i
j
/
y
_
where /
y
is a bandwidth for smoothing in the j-direction. Then
^
) (r) =
_
^
) (j, r) dj
=
1
:[H[ /
y
n

i=1
1
_
H
1
(A
i
r)
_
_
/
_
j
i
j
/
y
_
dj
=
1
:[H[
n

i=1
1
_
H
1
(A
i
r)
_
and
_
j
^
) (j, r) dj =
1
:[H[ /
y
n

i=1
1
_
H
1
(A
i
r)
_
_
j/
_
j
i
j
/
y
_
dj
=
1
:[H[
n

i=1
1
_
H
1
(A
i
r)
_
j
i
and thus taking the ratio
^ q(r) =
1
:[H[

n
i=1
1
_
H
1
(A
i
r)
_
j
i
1
:[H[

n
i=1
1 (H
1
(A
i
r))
=

n
i=1
1
_
H
1
(A
i
r)
_
j
i

n
i=1
1 (H
1
(A
i
r))
again obtaining the Nadaraya-Watson estimator. Note that the bandwidth /
y
has disappeared.
The estimator is ill-dened for values of r such that
^
) (r) _ 0. This can occur in the tails of
the distribution of A
i
. As higher-order kernels can yield
^
) (r) < 0, many authors suggest using
only second-order kernels for regression. I am unsure if this is a correct recommendation. If a
higher-order kernel is used and for some r we nd
^
) (r) < 0, this suggests that the data is so sparse
in that neighborhood of r that it is unreasonable to estimate the regression function. It does not
26
require the abandonment of higher-order kernels. We will follow convention and typically assume
that / is second order (i = 2) for our presentation.
3.2 Asymptotic Distribution
We analyze the asymptotic distribution of the NW estimator ^ q(r) for the case = 1.
Since 1 (j
i
[ A
i
) = q(A
i
), we can write the regression equation as j
i
= q(A
i
) + c
i
where
1 (c
i
[ A
i
) = 0. We can also write the conditional variance as 1
_
c
2
i
[ A
i
= r
_
= o
2
(r).
Fix r. Note that
j
i
= q(A
i
) +c
i
= q(r) + (q(A
i
) q(r)) +c
i
and therefore
1
:/
n

i=1
/
_
A
i
r
/
_
j
i
=
1
:/
n

i=1
/
_
A
i
r
/
_
q(r)
+
1
:/
n

i=1
/
_
A
i
r
/
_
(q(A
i
) q(r))
+
1
:/
n

i=1
/
_
A
i
r
/
_
c
i
=
^
)(r)q(r) + ^ :
1
(r) + ^ :
2
(r),
say. It follows that
^ q(r) = q(r) +
^ :
1
(r)
^
)(r)
+
^ :
2
(r)
^
)(r)
.
We now analyze the asymptotic distributions of the components ^ :
1
(r) and ^ :
2
(r).
First take ^ :
2
(r). Since 1 (c
i
[ A
i
) = 0 it follows that 1
_
/
_
A
i
r
/
_
c
i
_
= 0 and thus
1 ( ^ :
2
(r)) = 0. Its variance is
ar ( ^ :
2
(r)) =
1
:/
2
1
_
/
_
A
i
r
/
_
c
i
_
2
=
1
:/
2
1
_
/
_
A
i
r
/
_
2
o
2
(A
i
)
_
(by conditioning), and this is
1
:/
2
_
/
_
. r
/
_
2
o
2
(.))(.)d.
27
(where )(.) is the density of A
i
). Making the change of variables, this equals
1
:/
_
/ (n)
2
o
2
(r +/n))(r +/n)dn =
1
:/
_
/ (n)
2
o
2
(r))(r)dn +o
_
1
:/
_
=
1(/)o
2
(r))(r)
:/
+o
_
1
:/
_
if o
2
(r))(r) are smooth in r. We can even apply the CLT to obtain that as / 0 and :/ ,
_
:/ ^ :
2
(r)
d

_
0, 1(/)o
2
(r))(r)
_
.
Now take ^ :
1
(r). Its mean is
1 ^ :
1
(r) =
1
/
1/
_
A
i
r
/
_
(q(A
i
) q(r))
=
1
/
_
/
_
. r
/
_
(q(.) q(r)) )(.)d.
=
_
/ (n) (q(r +/n) q(r)) )(r +/n)dn
Now expanding both q and ) in Taylor expansions, this equals, up to o(/
2
)
_
/ (n)
_
n/q
(1)
(r) +
n
2
/
2
2
q
(2)
(r)
_
_
)(r) +n/)
(1)
(r)
_
dn
=
__
/ (n) ndn
_
/q
(1)
(r))(r)
+
__
/ (n) n
2
dn
_
/
2
_
1
2
q
(2)
(r))(r) +q
(1)
(r))
(1)
(r)
_
= /
2
i
2
_
1
2
q
(2)
(r))(r) +q
(1)
(r))
(1)
(r)
_
= /
2
i
2
1(r))(r),
where
1(r) =
1
2
q
(2)
(r) +)(r)
1
q
(1)
(r))
(1)
(r)
(If / is a higher-order kernel, this is O(/

) instead.) A similar expansion shows that ar( ^ :


1
(r)) =
O
_
/
2
:/
_
which is of smaller order than O
_
1
:/
_
. Thus
_
:/
_
^ :
1
(r) /
2
i
2
1(r))(r)
_

p
0
and since
^
)(r)
p
)(r),
_
:/
_
^ :
1
(r)
^
)(r)
/
2
i
2
1(r)
_

p
0
28
In summary, we have
_
:/
_
^ q(r) q(r) /
2
i
2
1(r)
_
=
_
:/
_
^ :
1
(r)
^
)(r)
/
2
i
2
1(r)
_
+
_
:/ ^ :
2
(r)
^
)(r)
d


_
0, 1(/)o
2
(r))(r)
_
)(r)
=
_
0,
1(/)o
2
(r)
)(r)
_
When A
i
is a -vector, the result is
_
:[H[
_
_
^ q(r) q(r) i
2
q

j=1
/
2
j
1
j
(r)
_
_
d
=
_
0,
1(/)
q
o
2
(r)
)(r)
_
where
1
j
(r) =
1
2
0
2
0r
2
j
q(r) +)(r)
1
0
0r
j
q(r)
0
0r
j
)(r).
3.3 Mean Squared Error
The AMSE of the NW estimator ^ q (r) is
'o1 (^ q(r)) = i
2
2
_
_
q

j=1
/
2
j
1
j
(r)
_
_
2
+
1(/)
q
o
2
(r)
:[H[ )(r)
A weighted integrated MSE takes the form
\1'o1 =
_
'o1 (^ q(r)) )(r)'(r) (dr)
= i
2
2
_
_
_
q

j=1
/
2
j
1
j
(r)
_
_
2
)(r)'(r) (dr) +
1(/)
q
_
o
2
(r)'(r)dr
:/
1
/
2
/
q
where '(r) is a weight function. Possible choices include '(r) = )(r) and '(r) = 1 ()(r) _ c)
for some c 0. The AMSE nees the weighting otherwise the integral will not exist.
3.4 Observations about the Asymptotic Distribution
In univariate regression, the optimal rate for the bandwidth is /
0
= C:
1=5
with mean-squared
convergence O(:
2=5
). In the multiple regressor case, the optimal bandwidths are /
j
= C:
1=(q+4)
with convergence rate O
_
:
2=(q+4)
_
. This is the same as for univariate and -variate density esti-
mation.
If higher-order kernels are used, the optimal bandwidth and convergence rates are again the
same as for density estimation.
29
The asymptotic distribution depends on the kernel through 1(/) and i
2
. The optimal kernel
minimizes 1(/), the same as for density estimation. Thus the Epanechnikov family is optimal for
regression.
As the WIMSE depends on the rst and second derivatives of the mean function q(r), the
optimal bandwidth will depend on these values. When the derivative functions 1
j
(r) are larger,
the optimal bandwidths are smaller, to capture the uctuations in the function q(r). When the
derivatives are smaller, optimal bandwidths are larger, smoother more, and thus reducing the
estimation variance.
For nonparametric regression, reference bandwidths are not natural. This is because there is
no natural reference q(r) which dictates the rst and second derivative. Many authors use the
rule-of-thumb bandwidth for density estimation (for the regressors A
i
) but there is absolutely no
justication for this choice. The theory shows that the optimal bandwidth depends on the curvature
in the conditional mean q(r), and this is independent of the marginal density )(r) for which the
rule-of-thumb is designed.
3.5 Limitations of the NW estimator
Suppose that = 1 and the true conditional mean is linear q(r) = c + r,. As this is a very
simple situation, we might expect that a nonparametric estimator will work reasonably well. This
is not necessarily the case with the NW estimator.
Take the absolutely simplest case that there is not regression error, i.e. j
i
= c+A
i
, identically.
A simple scatter plot would reveal the deterministic relationship. How will NW perform?
The answer depends on the marginal distribution of the r
i
. If they are not spaced at uniform
distances, then ^ q(r) ,= q(r). The NW estimator applied to purely linear data yields a nonlinear
output!
One way to see the source of the problem is to consider the problem of nonparametrically
estimating 1 (A
i
r [ A
i
= r) = 0. The numerator of the NW estimator of the expectation is
n

i=1
/
_
A
i
r
/
_
(A
i
r)
but this is (generally) non-zero.
Can the problem by resolved by choice of bandwidth? Actually, it can make things worse. As
the bandwidth increases (to increase smoothing) then ^ q(r) collapses to a at function. Recall that
the NW estimator is also called the local constant estimator. It is approximating the regression
function by a (local) constant. As smoothing increases, the estimator simplies to a constant, not
to a linear function.
Another limitation of the NW estimator occurs at the edges of the support. Again consider the
case = 1. For a value of r _ min(A
i
) , then the NW estimator ^ q(r) is an average only of j
i
values
for obsevations to the right of r. If q(r) is positively sloped, the NW estimator will be upward
biased. In fact, the estimator is inconsistent at the boundary. This eectively restricts application
30
of the NW estimator to values of r in the interior of the support of the regressors, and this may
too limiting.
3.6 Local Linear Estimator
We started this chapter by motivating the NW estimator at r by taking an average of the
j
i
values for observations such that A
i
are in a neighborhood of r. This is a local constant ap-
proximation. Instead, we could t a linear regression line through the observations in the same
neighborhood. If we use a weighting function, this is called the local linear (LL) estimator, and it
is quite popular in the recent nonparametric regression literature.
The idea is to t the local model
j
i
= c +,
0
(A
i
r) +c
i
The reason for using the regressor A
i
r rather than A
i
is so that the intercept equals q(r) =
1 (j
i
[ A
i
= r) . Once we get the estimates ^ c(r),
^
,(r), we then set ^ q(r) = ^ c(r). Furthermore, we
can use
^
,(r) to estimate of
0
0r
q(r).
If we simply t a linear regression through observations such that [A
i
r[ _ /, this can be
written as
min
;
n

i=1
_
j
i
c ,
0
(A
i
r)
_
2
1 ([A
i
r[ _ /)
or setting
7
i
=
_
1
A
i
r
_
we have the explicit expression
_
^ c(r)
^
,(r)
_
=
_
n

i=1
1 ([A
i
r[ _ /) 7
i
7
0
i
_
1
_
n

i=1
1 ([A
i
r[ _ /) 7
i
j
i
_
=
_
n

i=1
1
_
H
1
(A
i
r)
_
7
i
7
0
i
_
1
_
n

i=1
1
_
H
1
(A
i
r)
_
7
i
j
i
_
where the second line is valid for any (multivariate) kernel funtion. This is a (locally) weighted
regression of j
i
on A
i
. Algebraically, this equals a WLS estimator.
In contrast to the NW estimator, the LL estimator preserves linear data. That is, if the true
data lie on a line j
i
= c + A
0
i
,, then for any sub-sample, a local linear regression ts exactly, so
^ q(r) = q(r). In fact, we will see that the distribution of the LL estimator is invariant to the rst
derivative of q. It has zero bias when the true regression is linear.
As / (smoothing is increased), the LL estimator collapses to the OLS regression of j
i
on
A
i
. In this sense LL is a natural nonparametric generalization of least-squares regression.
31
The LL estimator also has much better properties at the boundard than the NW estimator.
Intuitively, even if r is at the boundard of the regression support, as the local linear estimator ts
a (weighted) least-squares line through data near the boundary, if the true relationship is linear
this estimator will be unbiased.
Deriving the asymptotic distribution of the LL estimator is similar to that of the NW estimator,
but much more involved, so I will not present the argument here. It has the following asymptotic
distribution. Let ^ q(r) = ^ c(r). Then
_
:[H[
_
_
^ q(r) q(r) i
2
q

j=1
/
2
j
1
2
0
2
0r
2
j
q(r)
_
_
d

_
0,
1(/)
q
o
2
(r)
)(r)
_
This is quite similar to the distribution for the NW estimator, with one important dierence that
the bias term has been simplied. The term involving )(r)
1
0
0r
j
q(r)
0
0r
j
)(r) has been eliminated.
The asymptotic variance is unchanged.
Strictly speaking, we cannot rank the AMSE of the NW versus the LL estimator. While a bias
term has been eliminated, it is possible that the two terms have opposite signs and thereby cancel
somewhat. However, the standard intuition is that a simplied bias term suggests reduced bias in
practice. The AMSE of the LL estimator only depends on the second derivative of q(r), while that
of the NW estimator also depends on the rst derivative. We expect this to translate into reduced
bias.
Magically, this does not come as a cost in the asymptotic variance. These facts have led the
statistics literature to focus on the LL estimator as the preferred approach.
While I agree with this general view, a side not of caution is warrented. Simple simulation
experiments show that the LL estimator does not always beat the NW estimator. When the
regression function q(r) is quite at, the NW estimator does better. When the regression function
is steeper and curvier, the LL estimator tends to do better. The explanation is that while the two
have identical asymptotic variance formulae, in nite samples the NW estimator tends to have a
smaller variance. This gives it an advantage in contexts where estimation bias is low (such as when
the regression function is at). The reason why I mention this is that in many economic contexts,
it is believed that the regression function may be quite at with respect to many regressors. In this
context it may be better to use NW rather than LL.
3.7 Local Polynomial Estimation
If LL improves on NW, why not local polynomial? The intuition is quite straightforward.
Rather than tting a local linear equation, we can t a local quadratic, cubic, or polynomial of
arbitrary order.
Let j denote the order of the local polynomial. Thus j = 0 is the NW estimator, j = 1 is the
LL estimator, and j = 2 is a local quadratic.
Interestingly, the asymptotic behavior diers depending on whether j is even or odd.
32
When j is odd (e.g. LL), then the bias is of order O(/
p+1
) and is proportional to q
(p+1)
(r)
When j is even (e.g. NW or local quadratic), then the bias is of order O(/
p+2
) but is proportional
to q
(p+2)
(r) and q
(p+1)
(r))
(1)
(r),)(r).
In either case, the variance is O
_
1
:[H[
_
What happens is that by increasing the polynomial order from even to the next odd number,
the order of the bias does change, but the bias simplies.
By increasing the polynomial order from odd to the next even number, the bias order decreases.
This eect is analogous to the bias reduction achieved by higher-order kernels.
While local linear estimation is gaining popularity in econometric practice, local polynomial
methods are not typically used. I believe this is mostly because typical econometric applications
have 1, and it is dicult to apply polynomial methods in this context.
3.8 Weighted Nadaraya-Watson Estimator
In the context of conditional distribution estimation, Hall et. al. (1999, JASA) and Cai (2002,
ET) proposed a weighted NW estimator with the same asymptotic distribution as the LL estimator.
This is discussed on pp. 187-188 of Li-Racine.
The estimator takes the form
^ q(r) =

n
i=1
j
i
(r)1
_
H
1
(A
i
r)
_
j
i

n
i=1
j
i
(r)1 (H
1
(A
i
r))
where j
i
(r) are weights. The weights satisfy
j
i
(r) _ 0
n

i=1
j
i
(r) = 1
n

i=1
j
i
(r)1
_
H
1
(A
i
r)
_
(A
i
r) = 0
The rst two requirements set up the j
i
(r) as weights. The third equality requires the weights to
force the kernel function to satisfy local linearity.
The weights are determined by empirical likelihood. Specically, for each r, you maximize

n
i=1
lnj
i
(r) subject to the above constraints. The solutions take the form
j
i
(r) =
1
:
_
1 +`
0
(A
i
r) 1 (H
1
(A
i
r))
_
where ` is a Lagrange multiplier and is found by numerical optimization. For details about empirical
likelihood, see my Econometrics lecture notes.
The above authors show that the estimator ^ q(r) has the same asymptotic distribution as LL.
When the dependent variable is non-negative, j
i
_ 0, the standard and weighted NW estimators
33
also satisfy ^ q(r) _ 0. This is an advantage since it is obvious in this case that q(r) _ 0. In contrast,
the LL estimator is not necessarily non-negative.
An important disadvantage of the weighted NW estimator is that it is considerably more com-
putationally cumbersome than the LL estimator. The EL weights must be found separately for
each r at which ^ q(r) is calculated.
3.9 Residual and Fit
Given any nonparametric estimator ^ q(r) we can dene the residual ^ c
i
= j
i
^ q (A
i
). Numerically,
this requires computing the regression estimate at each observation. For example, in the case of
NW estimation,
^ c
i
= j
i

n
j=1
1
_
H
1
(A
j
A
i
)
_
j
j

n
j=1
1 (H
1
(A
j
A
i
))
From ^ c
i
we can compute many conventional regression statistics. For example, the residual
variance estimate is :
1

n
i=1
^ c
2
i
, and 1
2
has the standard formula.
One cautionary remark is that since the convergence rate for ^ q is slower than :
1=2
, the same
is true for many statistics computed from ^ c
i
.
We can also compute the leave-one-out residuals
^ c
i;i1
= j
i
^ q
i
(A
i
)
= j
i

j6=i
1
_
H
1
(A
j
A
i
)
_
j
j

j6=i
1 (H
1
(A
j
A
i
))
3.10 Cross-Validation
For NW, LL and local polynomial regression, it is critical to have a reliable data-dependent rule
for bandwidth selection. One popular and practical approach is cross-validation. The motivation
starts by considering the sum-of-squared errors

n
i=1
^ c
2
i
. One could think about picking / to min-
imize this quantity. But this is analogous to picking the number of regressors in least-squares by
minimizing the sum-of-squared errors. In that context the solution is to pick all possible regressors,
as the sum-of-squared errors is monotonically decreasing in the number of regressors. The same is
true in nonparametric regression. As the bandwidth / decreases, the in-sample t of the model
improves and

n
i=1
^ c
2
i
decreases. As / shrinks to zero, ^ q (A
i
) collapses on j
i
to obtain perfect t,
^ c
i
shrinks to zero and so does

n
i=1
^ c
2
i
. It is clearly a poor choice to pick / based on this criterion.
Instead, we can consider the sum-of-squared leave-one-out residuals

n
i=1
^ c
2
i;i1
. This is a rea-
sonable criterion. Because the quality of ^ q (A
i
) can be quite poor for tail values of A
i
, it may
be more sensible to use a trimmed verion of the sum of squared residuals, and this is called the
cross-validation criterion
C\ (/) =
1
:
n

i=1
^ c
2
i;i1
' (A
i
)
34
(We have also divided by sample size for convenience.) The funtion '(r) is a trimming function,
the same as introduced in the denition of WIMSE earlier.
The cross-validation bandwidth / is that which minimizes C\ (/). As in the case of density
estimation, this needs to be done numerically.
To see that the CV criterion is sensible, let us calculate its expectation. Since j
i
= q (A
i
) +c
i
,
1 (C\ (/)) = 1
_
(c
i
+q (A
i
) ^ q
i
(A
i
))
2
' (A
i
)
_
= 1
_
(q (A
i
) ^ q
i
(A
i
))
2
' (A
i
)
_
21 (c
i
(q (A
i
) ^ q
i
(A
i
)) ' (A
i
)) +1
_
c
2
i
' (A
i
)
_
.
The third term does not depend on the bandwidth so can be ignored. For the second term we
use the law of iterated expectations, conditioning on A
i
and 1
i
(the sample excluding the ith
observation) to obtain
1 (c
i
(q (A
i
) ^ q
i
(A
i
)) ' (A
i
) [ 1
i
, A
i
) = 1 (c
i
[ A
i
) (q (A
i
) ^ q
i
(A
i
)) ' (A
i
) = 0
so the unconditional expecation is zero. For the rst term we take expectations conditional on 1
i
to obtain
1
_
(q (A
i
) ^ q
i
(A
i
))
2
' (A
i
) [ 1
i
_
=
_
(q (r) ^ q
i
(r))
2
' (r) )(r) (dr)
and thus the unconditional expectation is
1
_
(q (A
i
) ^ q
i
(A
i
))
2
' (A
i
)
_
=
_
1 (q (r) ^ q
i
(r))
2
' (r) )(r) (dr)
=
_
1 (q (r) ^ q (r))
2
' (r) )(r) (dr)
=
_
'o1 (^ q (r)) ' (r) )(r) (dr)
which is \1'o1(/).
We have shown that
1 (C\ (/)) = \1'o1 (/) +1
_
c
2
i
' (A
i
)
_
Thus CV is an estimator of the weighted integrated squared error.
As in the case of density estimation, it can be shown that it is a good estimator of \1'o1(/),
in the sense that the minimizer of C\ (/) is consistent for the minimizer of \1'o1(/). This holds
true for NW, LL and other nonparametric methods. In this sense, cross-validation is a general,
practical method for bandwidth selection.
35
3.11 Displaying Estimates and Pointwise Condence Bands
When = 1 it is simple to display ^ q(r) as a function of r, by calculating the estimator on a
grid of values.
When 1 it is less simple. Writing the estimator as ^ q(r
1
, r
2
, ..., r
q
), you can display it as
a function of one variable, holding the others xed. The variables held xed can be set at their
sample means, or varied across a few representative values.
When displaying an estimated regression function, it is good to include condence bands. Typ-
ically this are pointwise condence intervals, and can be computed using the ^ q(r) 2:(r) method,
where :(r) is a standard error. Recall that the asymptotic distribution of the NW and LL estimators
take the form
_
:/
1
/
q
(^ q(r) q(r) 1ia:(r))
d

_
0,
1(/)
q
o
2
(r)
)(r)
_
.
Ignoring the bias (as it cannot be estimated well), this suggests the standard error formula
:(r) =
_
1(/)
q
^ o
2
(r)
:/
1
/
q
^
)(r)
where
^
)(r) is an estimate of )(r) and ^ o
2
(r) is an estimate of o
2
(r) = 1
_
c
2
i
[ A
i
= r
_
.
A simple choice for ^ o
2
(r) is the sample mean of the residuals ^ o
2
. But this is valid only under
conditional homoskedasticity. We discuss nonparametric estimates for o
2
(r) shortly.
3.12 Uniform Convergence
For theoretical purposes we often need nonparametric estimators such as
^
)(r) or ^ q(r) to converge
uniformly. The primary applications are two-step and semiparametric estimators which depend on
the rst step nonparametric estimator. For example, if a two-step estimator depends on the residual
^ c
i
, we note that
^ c
i
c
i
= q (A
i
) ^ q (A
i
)
is hard to handle (in terms of stochastically bounding), as it is an estimated function evaluated at
a random variable. If ^ q(r) converges to q(r) pointwise in r, but not uniformly in r, then we dont
know if the dierence q (A
i
)^ q (A
i
) is converging to zero or not. One solution is to apply a uniform
convergence result. That is, the above expression is bounded in absolute value, for [A
i
[ _ C for
some C < by
sup
jxjC
[q (r) ^ q (r)[
and this is the object of study for uniform convergence.
It turns out that there is some cost to obtain uniformity. While the NW and LL estimators
pointwise converge at the rates :
2=(q+4)
(the square root of the MSE convergence rate), the uniform
36
convergence rate is
sup
jxjC
[q (r) ^ q (r)[ = O
p
_
_
ln:
:
_
2=(q+4)
_
The O
p
() symbol means bounded in probability, meaning that the LHS is bounded beneath a
constant this the rate, with probability arbitrarily close to one. Alternatively, the same rate holds
almost surely. The dierence with the pointwise case is the addition of the extra ln: term. This is
a very slow penalty, but it is a penalty none-the-less.
This rate was shown by Stein to be the best possible rate, so the penalty is not an artifact of
the proof technique.
A recent paper of mine provides some generalizations of this result, allowing for dependent data
(time series). B. Hansen, Econometric Theory, 2008.
One important feature of this type of bound is the restriction of r to the compact set [r[ _ C.
This is a bit unfortunate as in applications we often want to apply uniform convergence over the
entire support of the regressors, and the latter can be unbounded. One solution is to ignore this
technicality, and just assume that the regressors are bounded. Another solution is to apply the
result using trimming, a technique which we will probably discuss later, when we do semipara-
metrics. Finally, as shown in my 2008 paper, it is also possible to allow the constant C = C
n
to
diverge with :, but at the cost of slowing down the rate of convergence on the RHS.
3.13 NonParametric Variance Estimation
Let o
2
(r) = ar (j
i
[ A
i
= r) . It is sometimes of direct economic interest to estimate o
2
(r). In
other cases we just want to estimate it to get a condence interval for q(r).
The following method is recommended. Write the model as
j
i
= q (A
i
) +c
i
1 (c
i
[ A
i
) = 0
c
2
i
= o
2
(A
i
) +j
i
1 (j
i
[ A
i
) = 0
Then o
2
(r) is the regression function of c
2
i
on A
i
.
If c
2
i
were observed, this could be done using NW, weighted NW, or LL regression. While c
2
i
is not observed, it can be replaced by ^ c
2
i
where ^ c
i
= j
i
^ q (A) are the nonparametric regression
residuals. Using a NW estimator
^ o
2
(r) =

n
i=1
1
_
H
1
(A
i
r)
_
^ c
2
i

n
i=1
1 (H
1
(A
i
r))
and similarly using weighted NW or LL. The bandwidths H are not the same as for estimation of
^ q(r), although we use the same notation.
37
As discussed earlier, the LL estimator ^ o
2
(r) is not guarenteed to be non-negative, while the
NW and weighted NW estimators are always non-negative (if non-negative kernels are used).
Fan and Yao (1998, Biometrika) analyze the asymptotic distribution of this estimator. They
obtain the surprising result that the asymptotic distribution of this two-step estimator is identical
to that of the one-step idealized estimator
~ o
2
(r) =

n
i=1
1
_
H
1
(A
i
r)
_
c
2
i

n
i=1
1 (H
1
(A
i
r))
.
That is, the nonparametric regression of ^ c
2
i
on r
i
is asymptotically equivalent to the nonparametric
regression of c
2
i
on r
i
.
Technically, they demonstrated this result when ^ q and ^ o
2
are computed using LL, but from
the nature of the argument it appears that the same holds for the NW estimator. They also only
demonstrated the result for = 1, but the extends to the 1 case.
This is a neat result, and is not typical in two-step estimation. One convenient implication
is that we can pick bandwidths in each step based on conventional one-step regression methods,
ignoring the two-step nature of the problem. Additionally, we do not have to worry about the
rst-step estimation of q(r) when computing condence intervals for o
2
(r).
38
4 Conditional Distribution Estimation
4.1 Estimators
The conditional distribution (CDF) of j
i
given A
i
= r is
1 (j [ r) = 1 (j
i
_ j [ A
i
= r)
= 1 (1 (j
i
_ j) [ A
i
= r) .
This is the conditional mean of the random variable 1 (j
i
_ j) . Thus the CDF is a regression, and
can be estimated using regression methods.
One dierence is that 1 (j
i
_ j) is a function of the argument j, so CDF estimation is a set of
regressions, one for each value of j.
Standard CDF estimators include the NW, LL, and WNW. The NW can be written as
^
1 (j [ r) =

n
i=1
1
_
H
1
(A
i
r)
_
1 (j
i
_ j)

n
i=1
1 (H
1
(A
i
r))
The NW and WNW estimators have the advantages that they are non-negative and non-
decreasing in j, and are thus valid CDFs.
The LL estimator does not necessarily satisfy these properties. It can be negative, and need
not be monotonic in j.
As we learned for regression estimation, the LL and WMW estimators both have better bias
and boundary properties. Putting these two observations together, it seems reasonable to consider
using the WNW estimator.
The estimator
^
1 (j [ r) is smooth in r, but a step function in j. We discuss later estimators
which are smooth in j.
4.2 Asymptotic Distribution
Recall that in the case of kernel regression, we had
_
:[H[
_
_
^ q(r) q(r) i
2
q

j=1
/
2
j
1
j
(r)
_
_
d

_
0,
1(/)
q
o
2
(r)
)(r)
_
where o
2
(r) was the conditional variance of the regression, and the 1
j
(r) equals (for NW)
1
j
(r) =
1
2
0
2
0r
2
j
q(r) +)(r)
1
0
0r
j
q(r)
0
0r
j
)(r)
while for LL and WNW the bias term is just the rst part.
Clearly, for any xed j, the same theory applies. In the case of CDF estimation, the regression
38
equation is
1 (j
i
_ j) = 1 (j [ A
i
) +c
i
(j)
where c
i
(j) is conditionally mean zero and has conditional variance function
o
2
(r) = 1 (j [ r) (1 1 (j [ r)) .
(We know the conditional variance takes this form because the dependent variable is binary.) I
write the error as a function of j to emphasize that it is dierent for each j.
In the case of LL or NWW, the bias terms are
1
j
(j [ r) =
1
2
0
2
0r
2
j
1 (j [ r)
the curvature in the CDF with respect to the conditioning variables.
We thus nd for all (j, r)
_
:[H[
_
_^
1 (j [ r) 1 (j [ r) i
2
q

j=1
/
2
j
1
j
(j [ r)
_
_
d

_
0,
1(/)
q
1 (j [ r) (1 1 (j [ r))
)(r)
_
and
'o1
_
^
1 (j [ r)
_
= i
2
2
_
_
q

j=1
/
2
j
1
j
(j [ r)
_
_
2
+
1(/)
q
1 (j [ r) (1 1 (j [ r))
:[H[ )(r)
In the = 1 case
'o1
_
^
1 (j [ r)
_
= /
4
i
2
2
1(j [ r)
2
+
1(/)1 (j [ r) (1 1 (j [ r))
:/)(r)
.
In the regression case we dened the WIMSE as the integral of the AMSE, weighting by
)(r)'(r). Here we also integrate over j. For = 1
\1'o1 =
_ _
'o1
_
^
1 (j [ r)
_
)(r)'(r) (dr) dj
= /
4
i
2
2
_ _
1(j [ r)
2
dj)(r)'(r) (dr) +
1(/)
_ _
1 (j [ r) (1 1 (j [ r)) dj'(r)dr
:/
The integral over j does not need weighting since 1 (j [ r) (1 1 (j [ r)) declines to zero as j tends
to either limit.
Observe that the converge rate is the same as in regression. The optimal bandwidths are the
same rates as in regression.
39
4.3 Bandwidth Selection
I do not believe that bandwidth choice for nonparametric CDF estimation is widely studied.
Li-Racine suggest using a CV method based on conditional density estimation.
It should also be possible to directly apply CV methods to CDF estimation.
The leave-one-out residuals are
^ c
i;i1
(j) = 1 (j
i
_ j)
^
1
i
(j [ A
i
)
So the CV criterion for any xed j is
C\ (j, /) =
1
:
n

i=1
^ c
i;i1
(j)
2
' (A
i
)
=
1
:
n

i=1
_
1 (j
i
_ j)
^
1
i
(j [ A
i
)
_
2
' (A
i
)
If you wanted to estimate the CDF at a single value of j you could pick / to minimize this criterion.
For estimation of the entire function, we want to integrate over the values of j. One method is
C\ (/) =
_
C\ (j, /)dj
c
N

j=1
C\ (j

j
, /)
where j

j
is a grid of values over the support of j
i
such that j
j
j
j=1
= c. To calculate this quantity,
it involves times the number of calculations as for regression, as the leave-one-out computations
are done for each j

j
on the grid. My guess is that the grid over the j values could be coarse, e.g.
one could set = 10.
4.4 Smoothed Distribution Estimators - Unconditional Case
The CDF estimators introduced above are not smooth, but are discontinuous step functions. For
some applications this may be inconvenient. It may be desireable to have a smooth CDF estimate as
an input for a semiparametric estimator. It is also the case that smoothing will improve high-order
estimation eciency. To see this, we need to return to the case of univariate data.
Recall that the univariate DF estimator for iid data j
i
is
^
1(j) =
1
:
n

i=1
1 (j
i
_ j)
It is easy to see that this estimator is unbiased and has variance 1(j) (1 1(j)) ,:.
40
Now consider a smoothed estimator
~
1(j) =
1
:
n

i=1
G
_
j j
i
/
_
where G(r) =
_
x
1
/(n)dn is a kernel distribution function (the integral of a univariate kernel
function). Thus
~
1(j) =
_
y
1
^
)(r)dr where
^
)(r) is the kernel density estimate.
To calculate its expectation
E
~
1(j) = EG
_
j j
i
/
_
=
_
G
_
j r
/
_
)(r)dr
= /
_
G(n) ) (j /n) dn
the last using the change of variables n = (j r),/ or r = j /n with Jacobian /.
Next, do not expand ) (j /n) in a Taylor expansion, because the moments of G do not exist.
Instead, rst use integration by parts. The integral of ) is 1 and that of /) (j /n) is 1 (j /n) ,
and the derivative of G(n) is /(n). Thus the above equals
_
/ (n) 1 (j /n) dn
which can now be expanded using Taylors expansion, yielding
E
~
1(j) = 1(j) +
1
2
i
2
/
2
)
(1)
(j) +o
_
/
2
_
Just as in other estimation contexts, we see that the bias of
~
1(j) is of order /
2
, and is proportional
to the second derivative of what we are estimating, as 1
(2)
(j) = )
(1)
(j)
Thus smoothing introduces estimation bias.
The interesting part comes in the analysis of variance.
ar
_
~
1(j)
_
=
1
:
ar
_
G
_
j j
i
/
__
=
1
:
_
EG
_
j j
i
/
_
2

_
EG
_
j j
i
/
__
2
_

1
:
_
_
G
_
j r
/
_
2
)(r)dr 1(j)
2
_
Lets calculate this integral. By a change of variables
_
G
_
j r
/
_
2
)(r)dr = /
_
G(n)
2
)(j /n)dn
41
Once again we cannot direct apply a Taylor expansion, but need to rst do integration-by-parts.
Again the integral of /) (j /n) is 1 (j /n) . The derivative of G(n)
2
is 2G(n)/(n). So the
above is
2
_
G(n) /(n)1 (j /n) dn
and then applying a Taylor expansion, we obtain
1 (j)
_
2
_
G(n) /(n)dn
_
) (j) /
_
2
_
G(n) /(n)ndn
_
+o(/)
since 1
(1)
(j) = )(j).
Now since the derivative of G(n)
2
is 2G(n)/(n), it follows that the integral of 2G(n)/(n) is
G(n)
2
, and thus the rst integral over (, ) is G()
2
G()
2
= 1 0 = 1 since G(n) is a
distribution function. Thus the rst part is simply 1(j). Dene
c(/) = 2
_
G(n) /(n)ndn 0
For any symmetric kernel /, c(/) 0. This is because for n 0, G(n) G(n), thus
_
1
0
G(n) /(n)ndn
_
1
0
G(n) /(n)ndn =
_
0
1
G(n) /(n)ndn
and so the integral over (, ) is positive. Integrated kernels and the value c(/) are given in
the following table.
Kernel Integrated Kernel c(/)
Epanechnikov G
1
(n) =
1
4
_
2 + 3n n
3
_
1 ([n[ _ 1) 9,35
Biweight G
2
(n) = 16
_
8 + 15n 10n
3
+ 3n
5
_
1 ([n[ _ 1) 50,231
Triweight G
3
(n) = 32
_
16 + 35n 35n
2
+ 21n
5
5n
7
_
1 ([n[ _ 1) 245,1287
Gaussian G

(n) = (n) 1,
_

Together, we have
ar
_
~
1(j)
_

1
:
_
_
G
_
j r
/
_
2
)(r)dr 1(j)
2
_
=
1
:
_
1(j) 1(j)
2
c(/) ) (j) / +o(/)
_
=
1(j) (1 1(j))
:
c(/) ) (j)
/
:
+o
_
/
:
_
The rst part is the variance of
^
1(r), the unsmoothed estimator. Smoothing reduces the
variance by c
0
) (j)
/
:
.
42
Its MSE is
'o1
_
~
1(j)
_
=
1(j) (1 1(j))
:
c(/) ) (j)
/
:
+
i
2
2
/
4
4
)
(1)
(j)
2
The integrated MSE is
'1o1
_
~
1(j)
_
=
_
'o1
_
~
1(j)
_
dj
=
_
1(j) (1 1(j)) dj
:
c(/)
/
:
+
i
2
2
/
4
1
_
)
(1)
_
4
where
1
_
)
(1)
_
=
_
)
(1)
(j)
2
dj
The rst term is independent of the smoothing parameter / (and corresponds to the integrated
variance of the unsmoothed EDF estimator). To nd the optimal bandwidth, take the FOC:
d
d/
'1o1
_
^
1(j)
_
=
c(/)
:
+i
2
2
/
3
1
_
)
(1)
_
= 0
and solve to nd
/
0
=
_
c(/)
i
2
2
1
_
)
(1)
_
_
1=3
:
1=3
The optimal bandwidth converges to zero at the fast :
1=3
rate.
Does smoothing help? The unsmoothed estimator has MISE of order :
1
, and the smoothed
estimator (with optimal bandwidth) is of order :
1
:
4=3
. We can thus think of the gain in the
scaled MISE as being of order :
4=3
, which is of smaller order than the original :
1
rate.
It is important that the bandwidth not be too large. Suppose you set / :
1=5
as for density
estimation. Then the square bias term is of order /
4
:
4=5
which is larger than the leading term.
In this case the smoothed estimator has larger MSE than the usual estimator! Indeed, you need /
to be of smaller order than :
1=4
for the MSE to be no worse than the unusual case.
For practical bandwidth selection, Li-Racine and Bowman et. al. (1998) recommend a CV
method. For xed j the criterion is
C\ (/, j) =
1
:
n

i=1
_
1 (j
i
_ j)
~
1
i
(j)
_
2
which is the sum of squared leave-one-out residuals. For a global estimate the criterion is
C\ (/) =
_
C\ (/, j)dj
and this can be approximated by a summation over a grid of values for j.
This is essentially the same as the CV criterion we introduced above in the conditional case.
43
4.5 Smoothed Conditional Distribution Estimators
The smoothed versions of the CDF estimators replace the indicator functions 1 (j
i
_ j) with
the integrated kernel G
_
yy
i
h
0
_
where we will use /
0
to denote the bandwidth smoothing in the j
direction.
The NW version is
~
1 (j [ r) =

n
i=1
1
_
H
1
(A
i
r)
_
G
_
yy
i
h
0
_

n
i=1
1 (H
1
(A
i
r))
with H = /
1
, .../
q
. The LL is obtained by a local linear regression of G
_
yy
i
h
0
_
on A
i
r with
bandwidths H. And similarly the WNW.
What is its distribution? It is essentially that of
^
1 (j [ r) , plus an additional bias term, minus
a variance term.
First take bias. Recall
1ia:
_
^
1 (j [ r)
_
i
2
q

j=1
/
2
j
1
j
(j [ r)
where for LL and WNW
1
j
(j [ r) =
1
2
0
2
0r
2
j
1 (j [ r) .
And for smoothed DF estimation, the bias term is
i
2
/
2
1
2
0
2
0j
2
1(j)
If you work out the bias of the smoothed CDF, you nd it is the sum of these two, that is
~
1 (j [ r)
1ia:
_
~
1 (j [ r)
_
i
2
q

j=0
/
2
j
1
j
(j [ r)
where for , _ 1 the 1
j
(j [ r) are the same as before, and for , = 0,
1
0
(j [ r) =
1
2
0
2
0j
2
1 (j [ r) .
For variance, recall
ar
_
^
1 (j [ r)
_
=
1(/)
q
1 (j [ r) (1 1 (j [ r))
)(r):[H[
and for smoothed DF estimation, the variance was reduced by the term c
0
) (j)
/
:
. In the CDF
44
case it turns out to be similarly adjusted:
ar
_
~
1 (j [ r)
_
=
1(/)
q
[1 (j [ r) (1 1 (j [ r)) /
0
c(/) ) (j [ r)]
)(r):[H[
In sum, the MSE is
'o1
_
~
1 (j [ r)
_
= i
2
2
_
_
q

j=0
/
2
j
1
j
(j [ r)
_
_
2
+
1(/)
q
[1 (j [ r) (1 1 (j [ r)) /
0
c(/) ) (j [ r)]
)(r):[H[
The WIMSE, = 1 case, is
\1'o1 =
_ _
'o1
_
~
1 (j [ r)
_
)(r)'(r) (dr) dj
= i
2
2
_ _
_
/
2
0
1
0
(j [ r) +/
2
1
1
1
(j [ r)
_
2
dj)(r)'(r) (dr)
+
1(/)
__ _
1 (j [ r) (1 1 (j [ r)) dj'(r)dr /
0
c(/)
_
'(r)dr

:/
1
4.6 Bandwidth Choice
First, consider the optimal bandwidth rates.
As smoothing in the j direction only aects the higher-order asymptotic distribution, it should
be clear that the optimal rates for /
1
, ..., /
q
is unchanged from the unsmoothed case, and is therefore
equal to the regression setting. Thus the optimal bandwidth rates are /
j
~ :
1=(4+q)
for , _ 1.
Substituting these rates into the MSE equation, and ignoring constants, we have
'o1
_
~
1 (j [ r)
_
~
_
/
2
0
+:
2=(4+q)
_
2
+
1
:
4=(4+q)

/
0
:
4=(4+q)
Dierentiating with respect to /
0
0 = 4
_
/
2
0
+:
2=(4+q)
_
/
0

1
:
4=(4+q)
and since /
0
will be of smaller order than :
1=(4+q)
, we can ignore the /
3
0
term, and then solving
the remainder we obtain /
0
~ :
2=(4+q)
. E.g. for = 1 then the optimal rate is /
0
~ :
2=5
.
What is the gain from smoothing? With optimal bandwidth, the MISE is reduced by a term of
order :
6=(4+q)
. This is :
6=5
for = 1 and :
1
for = 2. This gain increases as increases. Thus
the gain in eciency (from smoothing) is increased when A is of higher dimension. Intuitively,
increasing A is equivalent to reducing the eective sample size, increasing the gain from smoothing.
How should the bandwidth be selected?
Li-Racine recommend picking the bandwidths by using a CV method for conditional density
estimation, and then rescaling.
45
As an alternative, we can use CV directly for the CDF estimate. That is, dene the CV criterion
C\ (j, /) =
1
:
n

i=1
_
1 (j
i
_ j)
~
1
i
(j [ A
i
)
_
2
' (A
i
)
C\ (/) =
_
C\ (/, j)dj
where / = (/
0
, /
1
, ..., /
q
) includes smoothing in both the j and r directions. The estimator
~
1
i
is the smooth leave-one-out estimator of 1. This formulae allows includes NW, LL and WNW
estimation.
The second integral can be approximated using a grid.
To my knowledge, this procedure has not been formally investigated.
46
5 Conditional Density Estimation
5.1 Estimators
The conditional density of y
i
given X
i
= x is f (y j x) = f(y; x)=f(x): An natural estimator is
^
f (y j x) =
^
f(y; x)
^
f(x)
=

n
i=1
K
_
H
1
(X
i
x)
_
k
h
0
(y
i
y)

n
i=1
K (H
1
(X
i
x))
where H = diagfh
1
; :::; h
q
g and k
h
(u) = h
1
k(u=h): This is the derivative of the smooth NW-type
estimator
~
F (y j x) : The bandwidth h
0
smooths in the y direction and the bandwidths h
1
; :::; h
q
smooth in the X directions.
This is the NW estimator of the conditional mean of Z
i
= k
h
0
(y y
i
) given X
i
= x:
Notice that
E(Z
i
j X
i
= x) =
_
1
h
0
k
_
v y
h
0
_
f (v j x) dv
=
_
k (u) f (y uh
0
j x) du
' f (y j x) +
h
2
0

2
2
@
2
@y
2
f (y j x) :
We can view conditional density estimation as a regression problem. In addition to NW, we
can use LL and WNW estimation. The local polynomial method was proposed in a paper by Fan,
Yao and Tong (Biometrika, 1996) and has been called the double kernel method.
5.2 Bias
By the formula for NW regression of Z
i
on X
i
= x;
E
^
f (y j x) = E(Z
i
j X
i
= x) +
2
q

j=1
h
2
j
B
j
(y j x)
= f (y j x) +
2
q

j=0
h
2
j
B
j
(y j x)
where
B
0
(y j x) =
1
2
@
2
@
2
y
f(y j x)
B
j
(y j x) =
1
2
@
2
@x
2
j
f (y j x) +f(x)
1
@
@x
j
f (y j x)
@
@x
j
f(x); j > 0
47
as B
j
are the curvature of E(Z
i
j X
i
= x) ' f (y j x) with respect to x
j
: For LL or WNW
B
j
(y j x) =
1
2
@
2
@x
2
j
f (y j x) ; j > 0
The bias of
^
f (y j x) for f (y j x) is
2

q
j=0
h
2
j
B
j
(y j x):
For the bias to converge to zero with n; all bandwidths must decline to zero.
5.3 Variance
By the formula for NW regression of Z
i
on X
i
= x;
var
_
^
f (y j x)
_
'
R(k)
q
nh
1
h
q
f(x)
var (Z
i
j X
i
= x)
We calculate that
var (Z
i
j X
i
= x) = E
_
Z
2
i
j X
i
= x
_
(E(Z
i
j X
i
= x))
2
'
1
h
2
0
_
k
_
v y
h
0
_
2
f (v j x) dv
=
1
h
0
_
k (u)
2
f (y uh j x) du
'
R(k)f (y j x)
h
0
:
Substituting this into the expression for the estimation variance,
var
_
^
f (y j x)
_
'
R(k)
q
nh
1
h
q
f(x)
var (Z
i
j X
i
= x)
=
R(k)
q+1
f (y j x)
nh
0
h
1
h
q
f(x)
What is key is that the variance of the conditional density depends inversely upon all bandwidths.
For the variance to tend to zero, we thus need nh
0
h
1
h
q
!1:
5.4 MSE
AMSE
_
^
f (y j x)
_
=
2
2
_
_
q

j=0
h
2
j
B
j
(y j x)
_
_
2
+
R(k)
q+1
f (y j x)
nh
0
h
1
h
q
f(x)
In this problem, the bandwidths enter symmetrically. Thus the optimal rates for h
0
and the
other bandwidths will be equal. To see this, let h be a common bandwidth and ignoring constants,
then
AMSE
_
^
f (y j x)
_
h
4
+
1
nh
1+q
48
with optimal solution
h n
1=(5+q)
:
Thus if q = 1; h n
1=6
or q = 2; h n
1=7
: This is the same rate as for multivariate density
estimation (estimation of the joint density f(y; x)). The resulting convergence rate for the estimator
is the same as multivariate density estimation.
5.5 Cross-validation
Fan and Yim (2004, Biometrika) and Hall, Racine and Li (2004) have proposed a cross-validation
method appropriate for nonparametric conditional density estimators. In this section we describe
this method and its application to our estimators. For an estimator
^
f (y j x) of f (y j x) dene the
integrated squared error
I(h) =
_ _
_
~
f (y j x) f (y j x)
_
2
M(x)f(x)dydx
=
_ _
~
f (y j x)
2
M(x)f(x)dydx 2
_ _
~
f (y j x) M(x)f (y j x) f(x)dydx +
_ _
f (y j x)
2
M(x)f(x)dydx
= E
__
~
f (y j X
i
)
2
M(X
i
)
_
2E
_
~
f (y
i
j x
i
) M(X
i
)
_
+ E
__
f (y j X
i
)
2
M(X
i
)dy
_
= I
1
(h) 2I
2
(h) +I
3
:
Note that I
3
does not depend on the bandwidths and is thus irrelvant.
Let
^
f
i
(y j X
i
) denote the estimator
^
f (y j x) at x = X
i
with observation i omitted. For the
NW estimator this equals
^
f
i
(y j X
i
) =

j6=i
K
_
H
1
(X
i
X
j
)
_
k
h
0
(y
i
y)

j6=i
K (H
1
(X
i
X
j
))
The cross-validation estimators of I
1
and I
2
are
^
I
1
(h) =
1
n
n

i=1
M(X
i
)
_
~
f
i
(y j X
i
)
2
dy
^
I
2
(h) =
1
n
n

i=1
M(X
i
)
~
f
i
(Y
i
j X
i
) :
Rhe cross-validation criterion is
CV (h) =
^
I
1
(h) 2
^
I
2
(h):
The cross-validated bandwidths h
0
; h
1
; :::; h
q
are those which jointly minimize CV (h)
49
For the case of NW estimation
^
I
1
=
1
n
n

i=1
M(X
i
)

j6=i

k6=i
K
_
H
1
(X
i
X
j
)
_
K
_
H
1
(X
i
X
k
)
_ _
k
h
0
(y
j
y) k
h
0
(y
k
y) dy
_

j6=i
K (H
1
(X
i
X
j
))
_
2
=
1
n
n

i=1
M(X
i
)

j6=i

k6=i
K
_
H
1
(X
i
X
j
)
_
K
_
H
1
(X
i
X
k
)
_

k
h
0
(y
i
y
j
)
_

j6=i
K (H
1
(X
i
X
j
))
_
2
;
where

k is the convolution of k with itself, and
^
I
2
(h) =
1
n
n

i=1
M(X
i
)

j6=i
K
_
H
1
(X
i
X
j
)
_
k
h
0
(y
i
y
j
)

j6=i
K (H
1
(X
i
X
j
))
:
50
5.1 Two-Step Conditional Density Estimator
We can write
j = q(A) +c
where q(r) is the conditional mean function and c is the regression error. Let )
e
(c j r) be the
conditional density of c given A = r. Then the conditional density of j is
) (j j r) = )
e
(j q(r) j r)
That is, we can write the conditional density of j in terms of the regression function and the
conditional density of the error.
This decomposition suggests an alternative two-step estimator of ). First, estimate q. Second,
estimate )
e
.
The estimator ^ q(r) for q can be NW, WNW, or LL.
The residuals are ^ c
i
= j
i
^ q (A
i
) .
The second step is a conditional density estimator (NW, WNW or LL) applied to the residuals
^ c
i
as if they are observed data. This gives an estimator
^
)
e
(c j r).
The estimator for ) is then
^
) (j j r) =
^
)
e
(j ^ q(r) j r)
The rst-order asymptotic distribution of
^
) turns out to be identical to the ideal case where c
i
is directly observed. This is because the rst step conditional mean estimator ^ q(r) converges at a
rate faster than the second step estimator (at least if the rst step is done with a bandwidth of the
optimal order). e.g. if = 1 then ^ q(r) is optimally computed with a bandwidth / :
1=5
, so that
^ q converges at the rate :
2=5
, yet the estimator
^
)
e
converges at the best rate :
1=3
, so the error
induced by estimation of ^ q is of lower stochastic order.
The gain from the two-step estimator is that the conditional density of c typically has less
dependence on A than the conditional density of j itself. This is because the conditional mean
q(A) has been removed, leaving only the higher-order dependence. The accuracy of nonparametric
estimation improves as the estimated function becomes smoother and less dependent on the condi-
tioning variables. Partially this occurs because reduced dependence allows for larger bandwidths,
which reduces estimation variance.)
As an extreme case, if )
e
(c j r) = )
e
(c j r) does not depend on one of the A variables, the
^
)
e
can converge at the :
2=(q+4)
rate of the conditional mean. In this case the two-step estimator
actually has an improved rate of convergence relative to the conventional estimator.
Two-step estimators of this form are often employed in practical applications, but do not seem
to have been discussed much in the theoretical literature.
51
We could also consider a 3-step estimator, based on the expressions
j = q(A) +c
c
2
= o
2
(A) +j
j j r )

(j j r)
) (j j r) = )

_
j q(r)
o(r)
j r
_
The 3-step estimator is: First ^ q(r) by nonparametric regression, obtain residuals ^ c
i
. Second,
^ o
2
(r) by nonparametric regression using ^ c
2
i
as dependent variable. Obtain rescaled residuals
^ j
i
= ^ c
i
,^ o (A
i
) . Third,
^
)

(j j r) as the nonparametric conditional density estimator for ^ j


i
. Then
we can set
^
) (j j r) =
^
)

_
j ^ q(r)
^ o(r)
j r
_
In cases of strong variance eects (such as in nancial data) this method may be desireable.
As the variance estimator ^ o
2
(r) converges at the same rate as the mean ^ q(r), the same rst-
order properties apply to the 3-step estimator as to the 2-step estimator. Namely, )

should have
reduced dependence on r, so it should be relatively well estimated even with large r-bandwidths,
resulting in reduced MSE relative to the 1-step and 2-step estimators.
Given these insights, it might seem sensible to apply the 2-step or 3-step idea to conditional
distribution estimation. Unfortunately the analysis is not quite as simple. In this setting, the
nonparametric conditional mean, conditional variance, and conditional distribution estimators all
converge at the same rates. Thus the distribution of the estimate of the CDF of c
i
depends on the
fact that it is a 2-step estimator, and it is not immediately obvious how this aects the asymptotic
distribution. I have not seen an investigation of this issue.
52
6 Conditional Quantile Estimation
6.1 Quantiles
Suppose 1 is univariate with distribution 1.
If 1 is continuous and strictly increasing then its inverse function is uniquely dened. In this
case the cth quantile of 1 is

= 1
1
(c).
If 1 is not strictly increasing then the inverse function is not well dened and thus quantiles are
not unique but are interval-valued. To allow for this case it is conventional to simply dene the
quantile as the lower bound of this endpoint. Thus the general denition of the cth quantile is

= inf fj : 1(j) cg .
Quantiles are functions from probabilities to the sample space, and monotonically increasing in
c.
Multivariate quantiles are not well dened. Thus quantiles are used for univariate and condi-
tional settings.
If you know a distribution function 1 then you know the quantile function

. If you have an
estimate
^
1(j) of 1(j) then you can dene the estimate
^

= inf
_
j :
^
1(j) c
_
If
^
1(j) is monotonic in j then ^

will also be monotonic in c. When a smoothed estimator


^
1(j) is
used, then we can write the quantile estimator more simply as ^

=
^
1
1
(c).
Suppose that
^
1(j) is the (unsmoothed) EDF from a sample of size :. In this case, ^

equals
1
([n])
, the [c:]th order statistic from the sample. If c: is not an integer, [c:] is the greatest
integer less than c:. We could also view the interval
_
1
([n])
, 1
([n]+1)

as the quantile estimate.


We ignore these distinctions in practice.
When
^
1(j) is the EDF we can also write the quantile estimator as
^

= argmin
q
n

i=1
j

(1
i
)
where
j

(n) = n[c 1 (n 0)]


=
_

_
(1 c) n n < 0
cn n 0
53
is called the check function.
6.2 Conditional Quantiles
If the conditional distribution of 1 given A = r is 1 (j j r) then the conditional quantile of 1
given A = r is

= inf fj : 1 (j j r) cg
= 1
1
(c j r)
Conditional quantiles are functions from probabilities to the sample space, for a xed value of
the conditioning variables.
One method for nonparametric conditional quantile estimation is to invert an estimated distri-
bution function. Take an estimate
^
1 (j j r) of 1 (j j r) . Then we can dene
^

(r) = inf
_
j :
^
1 (j j r) c
_
When
^
1 (j j r) is smooth in j we can write this as ^

(r) =
^
1
1
(c j r) .
This method is particularly appropriate for inversion of the smoothed CDF estimators
~
1 (j j r) .
This inversion method requires that
^
1 (j j r) be a distribution function (that it lies in [0, 1] and
is monotonic), which is not ensured if
^
1 (j j r) is computed by LL. The NW, WNW and smoothed
versions are all appropriate. When
^
1 (j j r) is a distribution function then ^

(r) will satisfy the


properties of a quantile function.
6.3 Check Function Approach
Another estimation method is to dene a weighted check function. This can be done using
either a locally constant or locally linear specication.
The locally constant (NW) method uses the criterion
o

( j r) =
n

i=1
1
_
H
1
(A
i
r)
_
j

(1
i
)
It is a locally weighted the check function, for observations close to A
i
= r. The nonparametric
quantile estimator is
^

(r) = argmin
q
o

( j r) .
The local linear (LL) criterion is
o

(, , j r) =
n

i=1
1
_
H
1
(A
i
r)
_
j

_
1
i
(A
i
r)
0
,
_
.
54
The estimator is
_
^

(r) ,
^
,

(r)
_
= argmin
q;
o

(, , j r) .
The conditional quantile estimator is ^

(r) , with derivative estimate


^
,

(r). Numerically, these


problems are identical to weighted linear quantile regression.
6.4 Asymptotic Distribution
The asymptotic distributions of the quantile estimators are scaled versions of the asymptotic
distributions of the CDF estimators (see the Li-Racine text for details).
The CDF inversion method and the check function method have the same asymptotic distrib-
utions.
The asymptotic bias of the quantile estimators depends on whether a local constant or local
linear method was used, and whether smoothing in the j direction is used.
6.5 Bandwidth Selection
Optimal bandwidth selection for nonparametric quantile regression is less well studied than the
other methods.
As the asymptotic distributions seem to be scaled versions of the CDF estimators, and the
quantile estimator can be viewed as a by-product of CDF estimation, it seems reasonable to se-
lect bandwidths by a method optimal for CDF estimation, e.g. cross-validation for conditional
distribution function estimation.
55
7 Semiparametric Methods and Partially Linear Regression
7.1 Overview
A model is called semiparametric if it is described by 0 and t where 0 is nite-dimensional
(e.g. parametric) and t is innite-dimensional (nonparametric). All moment condition models
are semiparametric in the sense that the distribution of the data (t) is unspecied and innite
dimensional. But the settings more typically called semiparametric are those where there is
explicit estimation of t.
In many contexts the nonparametric part t is a conditional mean, variance, density or distrib-
ution function.
Often 0 is the parameter of interest, and t is a nuisance parameter, but this is not necessarily
the case.
In many semiparametric contexts, t is estimated rst, and then
^
0 is a two-step estimator. But
in other contexts (0, t) are jointly estimated.
7.2 Feasible Nonparametric GLS
A classic semiparametric model, which is not in Li-Racine, is feasible GLS with unknown vari-
ance function. The seminal papers are Carroll (1982, Annals of Statistics) and Robinson (1987,
Econometrica). The setting is a linear regression
j
i
= A
0
i
0 + c
i
E(c
i
[ A
i
) = 0
E
_
c
2
i
[ A
i
_
= o
2
(A
i
)
where the variance function o
2
(r) is unknown but smooth in r, and r R
q
. (The idea also applies
to non-linear but parametric regression functions). In this model, the nonparametric nuisance
parameter is t = o
2
()
As the model is a regression, the eciency bound for 0 is attained by GLS regression
~
0 =
_
n

i=1
1
o
2
(A
i
)
A
i
A
0
i
_
1
_
n

i=1
1
o
2
(A
i
)
A
i
j
i
_
.
This of course is infeasible. Carroll and Robinson suggested replacing o
2
(A
i
) with ^ o
2
(A
i
) where
^ o
2
(r) is a nonparametric estimator. (Carroll used kernel methods; Robinson used nearest neighbor
methods.) Specically, letting ^ o
2
(r) be the NW estimator of o
2
(r), we can dene the feasible
estimator
^
0 =
_
n

i=1
1
^ o
2
(A
i
)
A
i
A
0
i
_
1
_
n

i=1
1
^ o
2
(A
i
)
A
i
j
i
_
.
This seems sensible. The question is nd its asymptotic distribution, and in particular to nd
56
if it is asymptotically equivalent to
~
0.
7.3 Generated Regressors
The model is
j
i
= 0t(A
i
) + c
i
E(c
i
[ A
i
) = 0
where 0 is nite dimensional but t is an unknown function. Suppose t is identied by another
equation so that we have a consistent estimate of ^ t(r) for t(r).(Imagine a non-parametric Heckman
estimator).
Then we could estimate 0 by least-squares of j
i
on ^ t(7
i
).
This problem is called generated regressors, as the regressor is a (consistent) estimate of a
infeasible regressor.
In general,
^
0 is consistent. But what is its distribution?
7.4 Andrews MINPIN Theory
A useful framework to study the type of problem from the previous section is given in Andrews
(Econometrica, 1994). It is reviewed in section 7.3 of Li-Racine but the discussion is incomplete
and there is at least one important omission. If you really want to learn this theory I suggest
reading Andrews paper.
The setting is when the estimator
^
0 MINimizes a criterion function which depends on a Pre-
liminary Innite dimensional Nuisance parameter estimator, hence MINPIN.
Let 0 be the parameter of interest, let t T denote the innite-dimensional nuisance
parameter. Let 0
0
and t
0
denote the true values.
Let ^ t be a rst-step estimate of t, and assume that it is consistent: ^ t
p
t
0
Now suppose that the criterion function for estimation of 0 depends on the rst-step estimate
^ t. Let the criterion be Q
n
(0, t) and suppose that
^
0 = argmin

Q
n
(0, ^ t)
Thus
^
0 is a two-step estimator.
Assume that
0
00
Q
n
(0, t) = :
n
(0, t) + o
p
_
1
_
:
_
:
n
(0, t) =
1
:
n

i=1
:
i
(0, t)
57
where :
i
(0, t) is a function of the ith observation. In just-identied models, there is no o
p
_
1
_
:
_
error (and we now ignore the presence of this error).
In the FGLS example, t () = o
2
() , ^ t () = ^ o
2
() , and
:
i
(0, t) =
1
t (A
i
)
A
i
_
j
i
A
0
i
0
_
.
In the generated regressor problem,
:
i
(0, t) = t(A
i
) (j
i
0t(A
i
)) .
In general, the rst-order condition (FOC) for
^
0 is
0 = :
n
_
^
0, ^ t
_
Assume
^
0
p
0
0
. Implicit to obtain consistency is the requirement that the population expeca-
tion of the FOC is zero, namely
E:
i
(0
0
, t
0
) = 0
and we assume that this is the case.
We expand the FOC in the rst argument
0 =
_
: :
n
_
^
0, ^ t
_
=
_
: :
n
(0
0
, ^ t) + '
n
(0
0
, ^ t)
_
:
_
^
0 0
0
_
+ o
p
(1)
where
'
n
(0, t) =
0
00
0
:
n
(0, t)
It follows that
_
:
_
^
0 0
0
_
'
n
(0
0
, ^ t)
1
_
: :
n
(0
0
, ^ t) .
If '
n
(0, t) converges to its expectation
' (0, t) = E
0
00
0
:
i
(0, t)
uniformly in its arguments, then '
n
(0
0
, ^ t)
p
' (0
0
, t
0
) = ', say. Then
_
:
_
^
0 0
0
_
'
1
_
: :
n
(0
0
, ^ t) .
We cannot take a Taylor expansion in t because it is innite dimensional. Instead, Andrews
uses a stochastic equicontinuity argument. Dene the population version of :
n
(0, t)
:(0, t) = E:
i
(0, t) .
58
Note that at the true values :(0
0
, t
0
) = 0 (as discussed above) but the function is non-zero for
generic values.
Dene the function
i
n
(t) =
_
:( :
n
(0
0
, t) :(0
0
, t))
=
1
_
:
n

i=1
(:
i
(0
0
, t) E:
i
(0
0
, t))
Notice that i
n
(t) is a normalized sum of mean-zero random variables. Thus for any xed t, i
n
(t)
converges to a normal random vector. Viewed as a function of t, we might expect i
n
(t) to vary
smoothly in the argument t. The stochastic formulation of this is called stochastic equicontinuity.
Roughly, as : , i
n
(t) remains well-behaved as a function of t. Andrews rst key assumption
is that i
n
(t) is stochastically equicontinuous.
The important implication of stochastic equicontinuity is that ^ t
p
t
0
implies
i
n
(^ t) i
n
(t
0
)
p
0.
Intuitively, if q(t) is continuous, then q(^ t) q(t
0
)
p
0. More generally, if q
n
(t) converges in
probability uniformly to a continuous function q(t) then q
n
(^ t) q(t
0
)
p
0. The case of stochastic
equicontinuity is the most general, but still has the same implication.
Since E:
i
(0
0
, t
0
) = 0, when we evaluate the empirical process at the true value t
0
we have a
zero-mean normalized sum, which is asymptotically normal by the CLT:
i
n
(t
0
) =
1
_
:
n

i=1
(:
i
(0
0
, t
0
) E:
i
(0
0
, t
0
))
=
1
_
:
n

i=1
:
i
(0
0
, t
0
)

d
(0, )
where
= E:
i
(0
0
, t
0
) :
i
(0
0
, t
0
)
0
It follows that
i
n
(^ t) = i
n
(t
0
) + o
p
(1)
d
(0, )
and thus
_
: :
n
(0
0
, ^ t) =
_
:( :
n
(0
0
, ^ t) :(0
0
, ^ t)) +
_
::(0
0
, ^ t)
= i
n
(^ t) +
_
::(0
0
, ^ t)
59
The nal detail is what to do with
_
::(0
0
, ^ t) . Andrews directly assumes that
_
::(0
0
, ^ t)
p
0
This is the second key assumption, and we discuss it below. Under this assumption,
_
: :
n
(0
0
, ^ t)
d
(0, )
and combining this with our earlier expansion, we obtain:
Andrews MINPIN Theorem. Under Assumptions 1-6 below,
_
:
_
^
0 0
0
_

p
(0, \ )
where
\ = '
1
'
10
' = E
0
00
0
:
i
(0
0
, t
0
)
= E:
i
(0
0
, t
0
) :
i
(0
0
, t
0
)
0
Assumption 1 As : , for :(0, t) = E:
i
(0, t)
1.
^
0
p
0
0
2. ^ t
p
t
0
3.
_
::(0
0
, ^ t)
p
0
4. :(0
0
, t
0
) = 0
5. i
n
(t) is stochastically equicontinuous at t
0
.
6. :
n
(0, t) and
0
00
:
n
(0, t) satisfy uniform WLLNs over T. (They converge in probability
to their expectations, uniformly over the parameter space.)
Discussion of result: The theorem says that the semiparametric estimator
^
0 has the same
asymptotic distribution as the idealized estimator obtained by replacing the nonparametric estimate
^ t with the true function t
0
. Thus the estimator is adaptive. This might seem too good to be true.
The key is assumption, which holds in some cases, but not in others.
Discussion of assumptions.
Assumptions 1 and 2 state that the estimators are consistent, which should be separately
veried. Assumption 4 states that the FOC identies 0 when evaluated at the true t
0
. Assumptions
5 and 6 are regularity conditions, essentially smoothness of the underlying functions, plus sucient
moments.
60
7.5 Orthogonality Assumption
The key assumption 3 for Andrews MINPIN theorey was somehow missed in the write-up in
Li-Racine. Assumption 3 is not always true, and is not just a regularity condition. It requires a
sort of orthogonality between the estimation of 0 and t.
Suppose that t is nite-dimensional. Then by a Taylor expansion
_
::(0
0
, ^ t)
_
::(0
0
, t
0
) +
0
0t
:(0
0
, t)
0
_
:(^ t t
0
)
=
0
0t
:(0
0
, t
0
)
0
_
:(^ t t
0
)
since :(0
0
, t
0
) = 0 by Assumption 4. Since t is parametric, we should expect
_
:(^ t t
0
) to
converge to a normal vector. Thus this expression will converge in probability to zero only if
0
0t
:(0
0
, t
0
) = 0
Recall that :(0, t) is the expectation of the FOC, which is the derivative of the criterion wrt
0. Thus
0
0t
:(0, t) is the cross-derivative of the criterion, and the above statement is that this
cross-derivative is zero, which is an orthogonality condition (e.g. block diagonality of the Hessian.)
Now when t is innite-dimensional, the above argument does not work, but it lends intuition.
An analog of the derivative condition, which is sucient for Assumption 3, is that
:(0
0
, t) = 0
for all t in a neighborhood of t
0
.
In the FGLS example,
:(0
0
, t) = E
_
1
t (A
i
)
A
i
c
i
_
= 0
for all t, so this is Assumption 3 is satised in this example. We have the implication:
Theorem. Under regularity conditions the Feasible nonparametric GLS estimator of the pre-
vious section is asymptotically equivalent to infeasible GLS.
In the generated regressor example
:(0
0
, t) = E(t(A
i
) (j
i
0
0
t(A
i
)))
= E(t(A
i
) (c
i
+ 0
0
(t
0
(A
i
) t(A
i
))))
= 0
0
E(t(A
i
) (t
0
(A
i
) t(A
i
)))
= 0
0
_
t(r) (t
0
(r) t(r)) )(r)dr
Assumption 3 requires
_
::(0
0
, ^ t)
p
0. But note that
_
::(0
0
, ^ t)
_
:(t
0
(r) ^ t(r)) which
certainly does not converge to zero. Assumption 3 is generically violated when there are generated
regressors.
61
There is one interesting exception. When 0
0
= 0 then :(0
0
, t) = 0 and thus
_
::(0
0
, ^ t) = 0
so Assumption 3 is satisied.
We see that Andrews MINPIN assumption do not apply in all semiparametric models. Only
those which satisfy Assumption 3, which needs to be veried. The other key condition is stochastic
equicontinuity, which is dicult to verify but is genearly satised for well-behaved estimators.
The remaining assumptions are smoothness and regularity conditions, and typically are not of
concern in applications.
7.6 Partially Linear Regression Model
The semiparametric partially linear regression model is
j
i
= A
0
i
, + q (7
i
) + c
i
E(c
i
[ A
i
, 7
i
) = 0
E
_
c
2
i
[ A
i
= r, 7
i
= .
_
= o
2
(r, .)
That is, the regressors are (A
i
, 7
i
), and the model species the conditional mean as linear in A
i
but possibly non-linear in 7
i
R
q
. This is a very useful compromise between fully nonparametric
and fully parametric. Often the binary (dummy) variables are put in A
i
. Often there is just one
nonlinear variable: = 1, to keep things simple.
The goal is to estimate , and q, and to obtain condence intervals.
The rst issue to consider is identication. Since q is unconstrained, the elements of A
i
cannot
be collinear with any function of 7
i
. This means that we must exclude from A
i
intercepts and any
deterministic function of 7
i
. The function q includes these components.
7.7 Robinsons Transformation
Robinson (Econometrica, 1988) is the seminal treatment of the partially linear model. He rst
contribution is to show that we can concentrate out the unknown q by using a genearlization of
residual regression.
Take the equation
j
i
= A
0
i
, + q (7
i
) + c
i
and apply the conditional expectation operator E( [ 7
i
) . We obtain
E(j
i
[ 7
i
) = E
_
A
0
i
, [ 7
i
_
+ E(q (7
i
) [ 7
i
) + E(c
i
[ 7
i
)
= E(A
i
[ 7
i
)
0
, + q (7
i
)
62
(using the law of iterated expectations). Dening the conditional expectations
q
y
(.) = E(j
i
[ 7
i
= .)
q
x
(.) = E(A
i
[ 7
i
= .)
We can write this expression as
q
y
(.) = q
x
(.)
0
, + q(.)
Subtracting from the original equation, the function q disappears:
j
i
q
y
(7
i
) = (A
i
q
x
(7
i
))
0
, + c
i
We can write this as
c
yi
= c
0
xi
, + c
i
j
i
= q
y
(7
i
) + c
yi
A
i
= q
x
(7
i
) + c
xi
That is, , is the coecient of the regression of c
yi
on c
xi
, where these are the conditional expecta-
tions errors from the regression of j
i
(and A
i
) on 7
i
only.
This is a conditional expecation generalization of the idea of residual regression.
This transformed equation immediately suggests an infeasible estimator for ,, by LS of c
yi
on
c
xi
:
~
, =
_
n

i=1
c
xi
c
0
xi
_
1
n

i=1
c
xi
c
yi
7.8 Robinsons Estimator
Robinson suggested rst estimating q
y
and q
x
by NW regression, using these to obtain residuals
^ c
xi
and ^ c
xi
, and replacing these in the formula for
~
,.
Specically, let ^ q
y
(.) and ^ q
x
(.) denote the NW estimates of the conditional mean of j
i
and A
i
given 7
i
= .. Assuming = 1,
^ q
y
(.) =

n
i=1
/
_
7
i
.
/
_
j
i

n
i=1
/
_
7
i
.
/
_
^ q
x
(.) =

n
i=1
/
_
7
i
.
/
_
A
i

n
i=1
/
_
7
i
.
/
_
The estimator ^ q
x
(.) is a vector of the same dimension as A
i
.
63
Notice that you are regressing each variable (the j
i
and the A
0
i
s) separately on the continuous
variable 7
i
. You should view each of these regressions as a separate NW regression. You should
probably use a dierent bandwidth / for each of these regressions, as the depdendence on 7
i
will
depend on the variable. (For example, some regressors A
i
might be independent of 7
i
so you would
want to use an innite bandwidth for those cases.) While Robinson discussed NW regression, this
is not essential. You could substitute LL or WNW instead.
Given the regression functions, we obtain the regression residuals
^ c
yi
= j
i
^ q
y
(7
i
)
^ c
xi
= A
i
^ q
x
(7
i
)
Our rst attempt at an estimator for , is then
^
, =
_
n

i=1
^ c
xi
^ c
0
xi
_
1
n

i=1
^ c
xi
^ c
yi
7.9 Trimming
The asymptotic theory for semiparametric estimators, typically requires that the rst step
estimator converges uniformly at some rate. The diculty is that ^ q
y
(.) and ^ q
x
(.) do not converge
uniformly over unbounded sets. Equivalently, the problem is due to the estimated density of 7
i
in
the denominator. Another way of viewing the problem is that these estimates are quite noisy in
sparse regions of the sample space, so residuals in such regions are noisy, and this could unduely
inuence the estimate of ,.
suer a problem that they can be unduly inuence by unstable residuals from observations in
sparse regions of the sample space. The nonparametric regression estimates depend inversely on
the density estimate
^
)
z
(.) =
1
:/
n

i=1
/
_
7
i
.
/
_
.
For values of . where )
z
(.) is close to zero,
^
)
z
(.) is not bounded away from zero, so the NW
estimates at this point can be poor. Consequently the residuals for observations i such that )
z
(7
i
)
will be quite unreliable, and can have an undue inuence on
^
,.
A standard solution is to introduce trimming. Let
^
)
z
(.) =
1
:/
n

i=1
/
_
7
i
.
/
_
,
let / 0 be a trimming constant and let 1
i
(/) = 1
_
^
)
z
(7
i
) _ /
_
denote a indicator variable for
those observations for which the estimated density of 7
i
is above /.
64
The trimmed version of
^
, is
^
, =
_
n

i=1
^ c
xi
^ c
0
xi
1
i
(/)
_
1
n

i=1
^ c
xi
^ c
yi
1
i
(/)
This is a trimmed LS residual regression.
The asymptotic theory requires that / = /
n
0 but unfortunately there is not good guid-
ance about how to select / in practice. Often trimming is ignored in applications. One practical
suggestion is to estimate , with and without trimming to assess robustness.
7.10 Asymptotic Distribution
Robinson (1988), Andrews (1994) and Li (1996) are references. The needed regularity conditions
are that the data are iid, 7
i
has a density, and the regression functions, density, and conditional
variance function are suciently smooth with respect to their arguments. Assuming a second-order
kernel, and for simplicity writing / = /
1
= = /
q
, the important condition on the bandwidths
sequence is
_
:
_
/
4
+
1
:/
q
_
0
Technically, this is not quite enough, as this ignores the interaction with the trimming parameter
/. But since this can be set /
n
0 at an extremely slow rate, it can be safely ignored. The above
condition is similar to the standard convergence rates for nonparametric estimation, multiplied
by
_
:. Equivalently, what is essential is that the uniform MSE of the nonparametric estimators
converge faster than
_
:, or that the estimators themselves converges faster than :
1=4
. That is,
what we need is
:
1=4
sup
z
[^ q
y
(.) q
y
(.)[
p
0
:
1=4
sup
z
[^ q
x
(.) q
x
(.)[
p
0
From the theory for nonparametric regression, these rates hold when bandwidths are picked
optimally and _ 3.
In practice, _ 3 is probably sucient. If 3 is desired, then higher-order kernels can be
used to improve the rate of convergence. So long as the rate is faster than :
1=4
, the following
result applies.
Theorem (Robinson). Under regularity conditions, including _ 3, the trimmed estimator
satises
_
:
_
^
, ,
_

d
(0, \ )
\ =
_
1
_
c
xi
c
0
xi
__
1
_
1
_
c
xi
c
0
xi
o
2
(A
i
, 7
i
)
__
1
_
1
_
c
xi
c
0
xi
__
1
That is,
^
, is asymptotically equivalent to the infeasible estimator
~
,.
The variance matrix may be estimated using conventional LS methods.
65
7.11 Verication of Andrews MINPIN Condition
This Theorem states that Robinsons two-step estimator for , is asymptotically equivalent to
the infeasible one-step estimator. This is an example of the application of Andrews MINPIN
theory. Andrews specically mentions that the :
1=4
convergence rates for ^ q
y
(.) a:d ^ q
x
(.) are
essential to obtain this result.
To see this, note that the estimator
^
, solves the FOC
1
:
n

i=1
(A
i
^ q
x
(7
i
))
_
j
i
^ q
y
(7
i
)
^
,
0
(A
i
^ q
x
(7
i
))
_
= 0
In Andrews MINPIN notation, let t
x
= ^ q
x
and t
y
= ^ q
y
denote xed (function) values of the
regression estimates, then
:
i
(0
0
, t) = (A
i
t
x
(7
i
))
_
j
i
t
y
(7
i
) 0
0
0
(A
i
t
x
(7
i
))
_
Since
j
i
= q
y
(7
i
) + (A
i
q
x
(7
i
))
0
, + c
i
then
E
_
j
i
t
y
(7
i
) 0
0
0
(A
i
t
x
(7
i
)) [ A
i
, 7
i
_
= q
y
(7
i
) t
y
(7
i
) (q
x
(7
i
) t
x
(7
i
) +)
0
0
0
and
:(0
0
, t) = E:
i
(0
0
, t)
= EE(:
i
(0
0
, t) [ A
i
, 7
i
)
= E
_
(A
i
t
x
(7
i
))
_
q
y
(7
i
) t
y
(7
i
) (q
x
(7
i
) t
x
(7
i
))
0
0
0
__
= E
_
(q
x
(7
i
) t
x
(7
i
))
_
q
y
(7
i
) t
y
(7
i
) (q
x
(7
i
) t
x
(7
i
))
0
0
0
__
=
_
_
(q
x
(.) t
x
(.))
_
q
y
(.) t
y
(.) (q
x
(.) t
x
(.))
0
0
0
__
)
z
(.)d.
where the second-to-last line uses conditional expecations given A
i
. Then replacing t
x
with ^ q
x
and
t
y
with ^ q
y
_
::(0
0
, ^ t) =
_
_
(q
x
(.) ^ q
x
(.))
_
q
y
(.) ^ q
y
(.) (q
x
(.) ^ q
x
(.))
0
0
0
__
)
z
(.)d.
Taking bounds

_
::(0
0
, ^ t)

_
_
sup
z
[q
x
(.) ^ q
x
(.)[ sup
z
[q
y
(.) ^ q
y
(.)[ + sup
z
[q
x
(.) ^ q
x
(.)[
2
_
sup
z
[)
y
(.)[
p
0
when the nonparametric regression estimates converge faster than :
1=4
.
66
Indeed, we see that the o(:
1=4
) convergence rates imply the key condition
_
::(0
0
, ^ t)
p
0.
7.12 Estimation of Nonparametric Component
Recall that the model is
j
i
= A
0
i
, + q (7
i
) + c
i
and the goal is to estimate , and q. We have described Robinsons estimator for ,. We now discuss
estimation of q.
Since
^
, converges at rate :
1=2
which is faster than a nonparametric rate, we can simply pretend
that , is known, and do nonparametric regression of j
i
A
0
i
^
, on 7
i
.
^ q (.) =

n
i=1
/
_
7
i
.
/
_
_
j
i
A
0
i
^
,
_

n
i=1
/
_
7
i
.
/
_
The bandwidth / = (/
1
, ..., /
q
) is distinct from those for the rst-stage regressions.
It is not hard to see that this estimator is asymptotically equivalent to the infeasible regressor
when
^
, is replaced with the true ,
0
.
Standard errors for ^ q(r) may be computed as for standard nonparametric regression.
7.13 Bandwidth Selection
In a semiparametric context, it is important to study the eect a bandwidth has on the perfor-
mance of the estimator of interest before determining the bandwidth. In many cases, this requires
a nonconventional bandwidth rate.
However, this problem does not occur in the partially linear model. The rst-step bandwidths /
used for ^ q
y
(.) and ^ q
x
(.) are inputs for calculation of
^
,. The goal is presumably accurate estimation
of ,. The bandwith / impacts the theory for
^
, through the uniform convergence rates for ^ q
y
(.)
and ^ q
x
(.) suggesting that we use conventional bandwidth rules, e.g. cross-validation.
67
8 Semiparametric Single Index Models
8.1 Index Models
A object of interest such as the conditional density ) (j j r) or conditional mean E(j j r) is a
single index model when it only depends on the vector r through a single linear combination r
0
,.
Most parametric models are single index, including Normal regression, Logit, Probit, Tobit,
and Poisson regression.
In a semiparametric single index model, the object of interest depends on r through the function
q (r
0
,) where , 2 R
k
and q : R !R are unknown. q is sometimes called a link function. In single
index models, there is only one nonparametric dimension. These methods fall in the class of
dimension reduction techniques.
The semiparametric single index regression model is
E(j j r) = q
_
r
0
,
_
(1)
where q is an unknown link function.
The semiparametric single index binary choice model is
1 (j = 1 j r) = E(j j r) = q
_
r
0
,
_
(2)
where q is an unknown distribution function. We use q (rather than, say, 1) to emphasize the
connection with the regression model.
In both contexts, the function q includes any location and level shift, so the vector A
i
cannot
include an intercept. The level of , is not identied, so some normalization criterion for , is needed.
It is typically easier to impose this on , than on q. One approach is to set ,
0
, = 1. A second
approach is to set one component of , to equal one. (This second approach requires that this
variable correctly has a non-zero coecient.)
The vector A
i
must be dimension 2 or larger. If A
i
is one-dimensional, then , is simply
normalized to one, and the model is the one-dimensional nonparametric regression E(j j r) = q (r)
with no semiparametric component.
Identication of , and q also requires that A
i
contains at least one continuously distributed
variable, and that this variable has a non-zero coecient. If not, A
0
i
, only takes a discrete set of
values, and it would be impossible to identify a continuous function q on this discrete support.
8.2 Single Index Regression and Ichimuras Estimator
The semiparametric single index regression model is
j
i
= q
_
A
0
i
,
_
+ c
i
E(c
i
j A
i
) = 0
68
This model generalizes the linear regression model (which sets q(.) to be linear), and is a
restriction of the nonparametric regression model.
The gain over full nonparametrics is that there is only one nonparametric dimension, so the
curse of dimensionality is avoided.
Suppose q were known. Then you could estimate , by (nonlinear) least-squares. The LS
criterion would be
o
n
(,, q) =
n

i=1
_
j
i
q
_
A
0
i
,
__
2
.
We could think about replacing q with an estimate ^ q, but since q(.) is the conditional mean of j
i
given A
0
i
, = ., q depends on ,, so a two-step estimator is likely to be inecient.
In his PhD thesis, Ichimura proposed a semiparametric estimator, published later in the Journal
of Econometrics (1993).
Ichimura suggested replacing q with the leave-one-out NW estimator
^ q
i
_
A
0
i
,
_
=

j6=i
/
_
(A
j
A
i
)
0
,
/
_
j
j

j6=i
/
_
(A
j
A
i
)
0
,
/
_ .
The leave-one-out version is used since we are estimating the regression at the ith observation i.
Since the NW estimator only converges uniformly over compact sets, Ichimura introduces trim-
ming for the sum-of-squared errors. The criterion is then
o
n
(,) =
n

i=1
_
j
i
^ q
i
_
A
0
i
,
__
2
1
i
(/)
He is not too specic about how to pick the trimming function, and it is likely that it is not
important in applications.
The estimator of , is then
^
, = argmin

o
n
(,) .
The criterion is somewhat similar to cross-validation. Indeed, Hardle, Hall, and Ichimura (An-
nals of Statistics, 1993) suggest picking , and the bandwidth / jointly by minimization of o
n
(,).
In his paper, Ichimura claims that the ^ q
i
(A
0
i
,) could be replaced by any other uniformly
consistent estimator and the consistency of
^
, would be maintained, but his asymptotic normality
result would be lost. In particular, his proof rests on the asymptotic orthogonality of the derviative
of ^ q
i
(A
0
i
,) with c
i
, which holds since the former is a leave-one-out estimator, and fails if it is a
conventional NW estimator.
8.3 Asymptotic Distriubution of Ichimuras Estimator
Let ,
0
denote the true value of ,.
69
The tricky thing is that ^ q
i
(A
0
i
,) is not estimating q(A
0
i
,
0
), rather it is estimating
G
_
A
0
i
,
_
= E
_
j
i
j A
0
i
,
_
= E
_
q
_
A
0
i
,
0
_
j A
0
i
,
_
the second equality since j
i
= q(A
0
i
,
0
) + c
i
.
That is
G(.) = E
_
j
i
j A
0
i
, = .
_
and G(A
0
i
,) is then evaluated at A
0
i
,.
Note that
G
_
A
0
i
,
0
_
= q
_
A
0
i
,
0
_
but for other values of ,,
G
_
A
0
i
,
_
6= q
_
A
0
i
,
_
Hardle, Hall, and Ichimura (1993) show that the LS criterion is asymptotically equivalent to re-
placing ^ q
i
(A
0
i
,) with G(A
0
i
,) , so
o
n
(,) ' o

n
(,) =
n

i=1
_
j
i
G
_
A
0
i
,
__
2
.
This approximation is essentially the same as Andrews MINPIN argument, and relies on the
estimator ^ q
i
(A
0
i
,) being a leave-one-out estimator, so that it is orthogonal with the error c
i
.
This means that
^
, is asymptotically equivalent to the minimizer of o

n
(,) , a NLLS problem.
As we know from the Econ710, the asymptotic distribution of the NLLS estimator is identical to
least-squares on
A

i
=
0
0,
G
_
A
0
i
,
_
.
This implies
p
:
_
^
, ,
0
_
!
d
(0, \ )
\ = Q
1
Q
1
Q = E
_
A

i
A
0
i
_
= E
_
A

i
A
0
i
c
2
i
_
To complete the derivation, we now nd this A

i
.
As
^
, is :
1=2
consistent, we can use a Taylor expansion of q (A
0
i
,
0
) to nd
q
_
A
0
i
,
0
_
' q
_
A
0
i
,
_
+ q
(1)
_
A
0
i
,
_
A
0
i
(,
0
,)
where
q
(1)
(.) =
d
d.
q (.) .
70
Then
G
_
A
0
i
,
_
= E
_
q
_
A
0
i
,
0
_
j A
0
i
,
_
' E
_
q
_
A
0
i
,
_
+ q
(1)
_
A
0
i
,
_
A
0
i
(,
0
,) j A
0
i
,
_
= q
_
A
0
i
,
_
q
(1)
_
A
0
i
,
_
E
_
A
i
j A
0
i
,
_
0
(, ,
0
)
since q (A
0
i
,) and q
(1)
(A
0
i
,) are measureable with respect to A
0
i
,. Another Taylor expansion
forq (A
0
i
,) yields that this is approximately
G
_
A
0
i
,
_
' q
_
A
0
i
,
0
_
+ q
(1)
_
A
0
i
,
_ _
A
i
E
_
A
i
j A
0
i
,
__
0
(, ,
0
)
' q
_
A
0
i
,
0
_
+ q
(1)
_
A
0
i
,
0
_ _
A
i
E
_
A
i
j A
0
i
,
0
__
0
(, ,
0
)
the nal approximation for , in a :
1=2
neighborhood of ,
0
. (The error is of smaller stochastic
order.)
We see that
A

i
=
0
0,
G
_
A
0
i
,
_
' q
(1)
_
A
0
i
,
0
_ _
A
i
E
_
A
i
j A
0
i
,
0
__
.
Ichimura rigorously establishes this result.
This asymptotic distribution is slightly dierent than that which would be obtained if the func-
tion q were known a priori. In this case, the asymptotic design depends on A
i
, not E(A
i
j A
0
i
,
0
) .
Q = E
_
q
(1)
_
A
0
i
,
0
_
2
A
i
A
0
i
_
This is the cost of the semiparametric estimation.
Recall when we described identication that we required the dimension of A
i
to be 2 or larger.
Suppose that A
i
is one-dimensional. Then A
i
E(A
i
j A
0
i
,
0
) = 0 so Q = 0 and the above theory
is vacuous (as it should be).
The Ichimura estimator achieves the semiparametric eciency bound for estimation of , when
the error is conditionally homoskedastic. Ichimura also considers a weighted least-squares estimator
setting the weight to be the inverse of an estimate of the conditional variance function (as in
Robinsons FGLS estimator). This weighted LS estimator is then semiparametrically ecient.
8.4 Klein and Spadys Binary Choice Estimator
Klein and Spady (Econometrica, 1993) proposed an estimator of the semiparametric single index
binary choice model which has strong similarities with Ichimuras estimator.
The model is
j
i
= 1
_
A
0
i
, c
i
_
where c
i
is an error.
71
If c
i
is independent of A
i
and has distribution function q, then the data satisfy the single-index
regression
E(j j r) = q
_
r
0
,
_
.
It follows that Ichimuras estimator can be directly applied to this model.
Klein and Spady suggest a semiparametric likelihood approach. Given q, the log-likelihood is
1
n
(,, q) =
n

i=1
_
j
i
lnq
_
A
0
i
,
_
+ (1 j
i
) ln
_
1 q
_
A
0
i
,
___
.
This is analogous to the sumof-squared errors function o
n
(,, q) for the semiparametric regression
model.
Similarly with Ichimura, Klein and Spady suggest replacing q with the leave-one-out NW esti-
mator
^ q
i
_
A
0
i
,
_
=

j6=i
/
_
(A
j
A
i
)
0
,
/
_
j
j

j6=i
/
_
(A
j
A
i
)
0
,
/
_ .
Making this substitution, and adding trimming function, this leads to the feasible likelihood
criterion
1
n
(,) =
n

i=1
_
j
i
ln ^ q
i
_
A
0
i
,
_
+ (1 j
i
) ln
_
1 ^ q
i
_
A
0
i
,
___
1
i
(/).
Klein and Spady emphasize that the trimming indicator should not be a function of ,, but instead
of a preliminary estimator. They suggest
1
i
(/) = 1
_
^
)
X
0 ~

_
A
0
i
~
,
_
/
_
where
~
, is a preliminary estimator of ,, and
^
) is an estimate of the density of A
0
i
~
,. Klein and
Spady observe that trimming does not seem to matter in their simulations.
The Klein-Spady estimator for , is the value
^
, which maximizes 1
n
(,).
In many respects the Ichimura and Klein-Spady estimators are quite similar.
Unlike Ichimura, Klein-Spady impose the assumption that the kernel / must be fourth-order
(e.g. bias reducing). They also impose that the bandwidth / satisfy the rate :
1=6
< / < :
1=8
,
which is smaller than the optimal :
1=9
rate for a 4t/ order kernel. It is unclear to me if these are
merely technical sucient conditions, or if there a substantive dierence with the semiparametric
regression case.
Klein and Spady also have no discussion about how to select the bandwidth. Following the
ideas of Hardle, Hall and Ichimura, it seems sensible that it could be selected jointly with , by
minimization of 1
n
(,), but this is just a conjecture.
They establish the asymptotic distribution for their estimator. Similarly as in Ichimura, letting
72
q denote the distribution of c
i
, dene the function
G
_
A
0
i
,
_
= E
_
q
_
A
0
i
,
0
_
j A
0
i
,
_
.
Then
p
:
_
^
, ,
0
_
!
d

_
0, H
1
_
H = E
_
0
0,
G
_
A
0
i
,
_
0
0,
G
_
A
0
i
,
_
0
1
q (A
0
i
,
0
) (1 q (A
0
i
,
0
))
_
They are not specic about the derivative component, but if I understand it correctly it is the same
as in Ichimura, so
0
0,
G
_
A
0
i
,
_
' q
(1)
_
A
0
i
,
0
_ _
A
i
E
_
A
i
j A
0
i
,
0
__
.
The Klein-Spady estimator achieves the semiparametric eciency bound for the single-index
binary choice model.
Thus in the context of binary choice, it is preferable to use Klein-Spady over Ichimura. Ichimuras
LS estimator is inecient (as the regression model is heteroskedastic), and it is much easier and
cleaner to use the Klein-Spady estimator rather than a two-step weighted LS estimator.
8.5 Average Derivative Estimator
Let the conditional mean be
E(j j r) = j(r)
Then the derivative is
j
(1)
(r) =
0
0r
j(r)
and a weighted average is
E
_
j
(1)
(A)n(A)
_
where n(r) is a weight function. It is particularly convenient to set n(r) = )(r), the marginal
density of A. Thus Powell, Stock and Stoker (Econometrica, 1989) dene this as the average
derivative
c = E
_
j
(1)
(A))(A)
_
.
This is a measure of the average eect of A on j. It is a simple vector, and therefore easier to
report than a full nonparametric estimator.
There is a connection with the single index model, where
j(r) = q
_
r
0
,
_
73
for then
j
(1)
(r) = ,q
(1)
(r
0
,)
c = c,
where
c = E
_
q
(1)
(r
0
,))(A)
_
.
Since , is identied only up to scale, the constant c doesnt matter. That is, a (normalized)
estimate of c is an estimate of normalized ,.
PSS observe that by integration by parts
c = E
_
j
(1)
(A))(A)
_
=
_
j
(1)
(r))(r)
2
dr
= 2
_
j(r))(r))
(1)
(r)dr
= 2E
_
j(A))
(1)
(A)
_
= 2E
_
j)
(1)
(A)
_
By the reasoning in CV, an estimator of this is
^
c =
2
: 1
n

i=1
j
i
^
)
(1)
(i)
(A
i
)
where
^
)
(i)
(A
i
) is the leave-one-out density estimator, and
^
)
(1)
(i)
(A
i
) is its rst derivative.
This is a convenient estimator. There is no denominator messing with uniform convergence.
There is only a density estimator, no conditional mean needed.
PSS show that
^
c is :
1=2
consistent and asy. normal, with a convenient covariance matrix.
The asymptotic bias is a bit complicated.
Let = dim(A). Set j = (( + 4),2 if is even and j = ( + 3),2) if is odd. e.g. j = 2 for
= 1, j = 3 for = 2 or = 3 and j = 4 for = 4.
PSS require that the kernel for estimation of ) be of order at least j. Thus a second-order kernel
for = 1, a fourth order for = 2, 3, or 4.
PSS then show that the asymptotic bias is
:
1=2
_
E
^
c c
_
= O
_
:
1=2
/
p
_
which is o(1) if the bandwidth is selected so that :/
2p
!0. This is violated (too big) if / is selected
to be optimal for estimation of
^
) or
^
)
(1)
. This requirement needs the bandwidth to undersmooth to
reduce the bias. This type of result is commonly seen in semiparametric methods. Unfortunately,
it does not lead to a practical rule for bandwidth selection.
74
9 Selectivity Models
9.1 Semiparametric Selection Models
The Type-2 Tobit model is the four equation system
j

1i
= A
0
1i
,
1
+n
1i
j

2i
= A
0
2i
,
2
+n
2i
j
1i
= 1 (j

1i
0)
j
2i
= j

2i
1 (j
1i
= 1)
The variables (j

1i
, j

2i
) are latent (unobserved). The observed variables are (j
1i
, j
2i
, A
1i
, A
2i
). Ef-
fectively, j

2i
is observed only when j
1i
= 1, equivalently when j

1i
0.
The Type-3 Tobit model is the four equation system
j

1i
= A
0
1i
,
1
+n
1i
j

2i
= A
0
2i
,
2
+n
2i
j
1i
= maxfj

1i
, 0g
j
2i
= j

2i
1 (j
1i
0))
The dierence is that j
1i
is censored rather than binary. We observe j

2i
only when there is no
censoring.
Typically the second equation is of interest, e.g. the coecient ,
2
.
The type-2 model is the classic selection model introduced by Heckman.
It is conventional to assume that the errors (n
1i
, n
2i
) are independent of A
i
= (A
1i
, A
2i
).
As you recall from 710, Heckman showed that if we try to estimate ,
2
by a regression using the
available data, this is estimating the regression of j
2i
on A
2i
, conditional on j
1i
0, which is
1 (j
2i
j A
i
, j
1i
= 1) = A
0
2i
,
2
+ E(n
2i
j A
i
, j
1i
0)
= A
0
2i
,
2
+ E
_
n
2i
j n
1i
A
0
1i
,
1
_
= A
0
2i
,
2
+q
_
A
0
1i
,
1
_
for some function q (.) . When (n
1i
, n
2i
) are bivariate normal then q(n) is a scaled inverse Mills ratio.
But when the errors are non-normal, the functional form of q(.) is unknown and nonparametric.
The one constraint it satises is
lim
z!1
q(.) = lim
z!1
E(n
2i
j n
1i
.) = E(n
2i
) = 0
by normalization.
75
We can then write the regression for j
2i
as
j
2i
= A
0
2i
,
2
+q
_
A
0
1i
,
1
_
+c
i
E(c
i
j A
i
, j
1i
0) = 0
This is a partially linear single index model.
9.2 Two-Step Estimator
This method is developed in a working paper by Powell (1987) and in Li and Wooldridge
(Econometric Theory, 2002)
Dene 7
i
= A
0
1i
,
1
. If 7
i
were observed the regression is
j
2i
= A
0
2i
,
2
+q (7
i
) +c
i
which is a partially linear model, and can be estimated using Robinsons approach.
For the parially linear model, the intercept is absorbed by q, so it must be excluded from A
2i
.
Since 7
i
is not observed, we can use a two-step approach.
In step 1, ,
1
is estimated by a semiparametric estimator, say
^
,
1
. (A semiparametric binary
choice estimator from the previous section, or a semiparametric Tobit estimator from the next
section.) Set
^
7
i
= A
0
1i
^
,
1
.
In the second step, ,
2
and q are estimated by Robinsons estimator (using the observations for
which j
1i
= 1)
Since the second step uses on the generated regressor
^
7
i
= A
0
1i
^
,
1
, the asymptotic distribution
is aected.
From the text
p
:
_
^
,
2
,
2
_
!
d

_
0, Q
1
_

1
+
2

0
_
Q
1
_
the covariance terms dened in the text, as typical for two-step estimators.
9.3 Ichimura and Lees Estimator
Ichimura and Lee (1991) proposed a joint estimator for , = (,
1
, ,
2
) based on the nonlinear
regression
j
2i
= A
0
2i
,
2
+q
_
A
0
1i
,
1
_
+c
i
for observations i such that j
2i
is observed. Thus the rst equation is ignored.
Their criterion is
o
n
(,) =
1
:
n

i=1
_
j
2i
A
0
2i
,
2
^ q
_
A
0
1i
,
1
, ,
2
__
2
1
i
(/)
76
where 1
i
(/) is a trimming function
^ q
_
A
0
1i
,
1
, ,
_
=

j6=i
/
_
(A
1i
A
1j
)
0
,
1
/
_
(j
2i
A
0
2i
,
2
)

j6=i
/
_
(A
1i
A
1j
)
0
,
1
/
_
is a leave-one-out NW estimator of E(j
2i
A
0
2i
,
2
j A
0
1i
,
1
) . (Againl, this is computed only for the
observations for which j
2i
is observed.)
This works, but it ignores the rst equation of the system. It is a semiparametric extension of
a NLLS Heckit estimator based on the equation
j
2i
= A
0
2i
,
2
+o
12
`
_
A
0
1i
,
1
_
+c
i
.
Such estimators ignore the rst equation. This is convenient as it simplies estimation, but ignoring
relevant information reduces eciency. My view is that identication of ,
2
versus q(A
0
1
,
1
) is based
on (dubious) exclusion restrictions plus assuming linearity of the rst part.
9.4 Powells Estimator
An alternative creative estimator was propsoed by Powell (1987, unpublished working paper),
reviewed in his chapter 41 of the Handbook of Econometrics.
As for Ichimura and Lee, we ignore the rst equation and only consider observations for which
j
2i
are observed.
Take two observations i and ,
j
2i
= A
0
2i
,
2
+q (7
i
) +c
i
j
2j
= A
0
2j
,
2
+q (7
j
) +c
j
and their pairwise dierence
j
2i
j
2j
= (A
2i
A
2j
)
0
,
2
+q (7
i
) q (7
j
) +c
i
c
j
Now focus on observations for which 7
i
' 7
j
. For these observations, q (7
i
) q (7
j
) ' 0 and ,
2
can be estimated by a regression of j
2i
j
2j
on A
2i
A
2j
.
This is made operational by using a kernel for 7
i
7
j
, and by replacing 7
i
with
^
7
i
= A
0
1i
^
,
1
,
where
^
,
1
is a rst stage estimate of ,
1
, yielding
77
^
,
2
=
_
_

j6=i
/
_
(A
1i
A
1j
)
0
^
,
1
/
_
(A
2i
A
2j
) (A
2i
A
2j
)
0
_
_
1

j6=i
/
_
(A
1i
A
1j
)
0
^
,
1
/
_
(A
2i
A
2j
) (j
2i
j
2j
)
Unfortunately Powell didnt publish the original paper. This type of estimator eectly identies
,
2
from a small subset of observations, so is unlikely to be precise. The good side is that the
nonparametric function q does not need to be estimated in any sense.
9.5 Estimation of the Intercept
While the conditional equation
j
2i
= A
0
2i
,
2
+q (7
i
) +c
i
excludes an intercept (it is absorbed in q) the original equation of interest
j

2i
= j +A
0
2i
,
2
+n
2i
say, contains an intercept. Its value can be relevant in practice. That is, the parameters of interest
for policy evaluation may be (j, ,
2
), not (,
2
, q).
To estimate j, Heckman (AER, 1990) suggest using the observation that the function q satises
q(1) = 0. Thus
j = E
_
j
2i
A
0
2i
,
2
j j
1i
= 1, A
0
1i
,
1
= 1
_
' E
_
j
2i
A
0
2i
,
2
j j
1i
= 1, A
0
1i
,
1

n
_
where
n
!1 is a bandwidth. This can be estimated by
^ j =

n
i=1
1
_
A
0
1i
^
,
1

n
_
(j
2i
A
0
2i
,
2
)

n
i=1
1
_
A
0
1i
^
,
1

n
_
where the sample is only for those observations for which j
2i
is observed (those for which j
1i
= 1).
Notet that this estiamtor depends on the rst-step estimate
^
,
1
.
Andrews and Schafgans (1998, Review of Economic Studies) suggest that a better estimator
is obtained by relacing the indicator variable by a DF kernel. They nd that the asymptotic
distribution has a non-standard rate, depending on the distribution of A
0
1i
,
1
.
78
10 Censored Models
10.1 Censoring
Suppose that j

i
is a latent variable, and the observed variable is censored
j
i
= j

i
1 (j

i
0))
=
_
0 j

i
0
j

i
j

i
0
Notationally we have set the censoring point at zero, but this is not essential.
If the distribution of j

i
is nonparametric, moments of j

i
are unidentied. We dont observe
the full range of support for j

i
, so anything can happen in that part of the distribution. It follows
that moment restrictions are inherently unidentied. When there is censoring, we should be very
cautious about moment restriction models.
In contrast, (some) quantiles are identied. Let
a
(j

i
) and
a
(j
i
) denote the cth quantiles of
j

i
and j
i
.
If 1(j

i
0) < c, then

(j

i
) =

(j
i
). That is, so long as there is less than c percent censored,
censoring does not aect quantiles.
Furthermore, if If 1(j

i
0) c then

(j
i
) = 0. (e.g., if there is 30% censoring, then then
quantiles below 30% are identically zero).
This means that we have the relationship:

(j
i
) = max(

(j

i
), 0)
It follows that we can consistently (and eciently) estimate quantiles above the cth on the
observed data j
i
. These observations lead to the strong conclusion that in the presence of censoring,
we should identify parameters through quantile restrictions, not moment restrictions.
Of particular interest is the median 'cd (j

i
) . We have
'cd (j
i
) = max('cd (j

i
) , 0)
10.2 Powells CLAD Estimator
The Tobit or censored regression model is
j

i
= A
0
i
, +c
i
j
i
= j

i
1 (j

i
0))
The classic Tobit estimator for , is MLE when c
i
is independent of A
i
and (0, o
2
).
79
Powell (1984, Journal of Econometrics) made the brilliant observation that when c
i
is nonpara-
metric, , is not identied through moment restrictions.
Instead, identify A
0
i
, as the conditional median of j
i
, so
'cd (j

i
j A
i
) = A
0
i
,
As we showed in the previous section
'cd (j
i
j A
i
) = max('cd (j

i
j A
i
) , 0)
= max(A
0
i
,, 0)
Thus the conditional median is a specic nonlinear function of the single index ,
0
A
i
.
This shows that the censored observation obeys the nonlinear median regression model
j
i
= max(A
0
i
,, 0) +-
i
'cd (-
i
j A
i
) = 0
We know that the appropriate method to estimate conditional medians is by least absolute
deviations (LAD). This applies as well to nonlinear models. Hence Powell suggested the criterion
o
n
(,) =
n

i=1

j
i
max(A
0
i
,, 0)

or equivalently we can use the criterion


o
n
(,) =
n

i=1
1
_
A
0
i
, 0
_

j
i
A
0
i
,

.
The estimator
^
, which minimizes o
n
(,) is called the censored least absolute deviations (CLAD)
estimator. The estimator satises the asymptotic FOC
n

i=1
1
_
A
0
i
^
, 0
_
r
i
sgn
_
j
i
A
0
i
^
,
_
= 0
This is the same as the FOC for LAD, but only for the observations for which A
0
i
^
, 0.
Minimization of o
n
(,) is somewhat more tricky than standard LAD. Bushinsky (PhD disserta-
tion) worked out numerical methods to solve this problem
Powell showed that it has the asymptotic distribution
p
:
_
^
, ,
_
!
d
(0, \ )
80
\ = Q
1
Q
1
= E
_
1
i
A
i
A
0
i
_
Q = 2E
_
) (0 j A
i
) 1
i
A
i
A
0
i
_
1
i
= 1
_
A
0
i
, 0
_
where )(0 j r) is the conditional density of c
i
given A
i
= r at the origin.
The derivation of this result is not much dierent from that for standard LAD regression.
Since the criterion function is not smooth with respect to ,, you need to use an empirical process
approach, as outlined for example in section 7 of Newey and McFaddens Handbook chapter.
Identication requires that and Q are full rank. This requires that there is not too much
censoring. As the censoring rate increases, the information in diminishes and precision falls.
10.3 Variance Estimation
When c
i
is independent of A
i
, then )(0 j r) = )(0) and \ =
_
4) (0)
2

_
1
. Practical standard
error estimation seems to focus on estimating \ under this assumption.
^
\ =
_
4
^
) (0)
2
^

_
1
^
=
1
:
n

i=1
1
_
A
0
i
^
, 0
_
A
i
A
0
i
The dicult part is )(0), in part because ^ c
i
is only observed for observations with j
i
0.
Hall and Horowitz (1990, Econometric Theory) recommend
^
)(0) =

n
i=1
/
_
^ c
i
/
_
1 (j
i
0)
/

n
i=1
G
_
A
0
i
^
,
/
_
where ^ c
i
= j
i
A
0
i
^
,, / is a bandwidth, /(n) is a symmetric kernel, and G(n) is integrated kernel.
They nd that the optimal rate is / :
1=5
if / is a second-order kernel, and / :
1=(2+1)
if / is
a ith order kernel. Their paper has an expression for the optimal bandwidth, and discuss possible
methods to estimate the bandwidth, but do not present a fully automatic bandwidth method.
An obvious alternative to asymptotic methods is the bootstrap. It is quite common to use
bootstrap percentile methods to compute standards errors, condence intervals, and p-values for
LAD, quantile estimation, and CLAD estimation.
81
10.4 Khan and Powells Two-Step Estimator
o
n
(,) =
n

i=1
1
_
A
0
i
, 0
_

j
i
A
0
i
,

.
In this criterion, the coecient , plays two roles. Khan and Powell (2001, Journal of Econometrics)
suggest this double role induces bias in nite samples, and this can be avoided by a two-step
estimator.
They suggest rst estimating
~
, using a semiparametric binary choice estimator, and using this
to dene the observations for trimming. The second-stage criterion is then
o
n
(,) =
n

i=1
1
_
A
0
i
~
, 0
_

j
i
A
0
i
,

.
(In the theoretical treatment, the indictor function is replaced with a smooth weighting function,
but they claim this is only to make the theory easy, and they use the indicator function in their sim-
ulations.) The second-stage estimator minimizes this criterion, which is just LAD on the trimmed
sub-sample. Khan and Powell argue that this two-step estimator falls in the class of Andrews
MINPIN estimators, so the asymptotic distribution is identical to Powells estimator.
10.5 Newey and Powells Weighted CLAD Estimator
When c
i
is not independent of A
i
, the asymptotic covariance matrix of the CLAD estimator
suggests that it is inecient and can be improved. Newey and Powell (1990, Econometric The-
ory) compute the semiparametric eciency bound, and nd that it is attained by the estimator
minimizing the weighted criterion
o
n
(,) =
n

i=1
n
i

j
i
max(A
0
i
,, 0)

n
i
= 2) (0 j A
i
)
The estimator
^
, which minimizes this criterion is a weighted CLAD estimator, and the authors
show that it has the asymptotic distribution
p
:
_
^
, ,
_
!
d
(0, \ )
\ =
_
4E
_
) (0 j A
i
)
2
1
i
A
i
A
0
i
__
1
The conditional density plays a similar role to the conditional variance for GLS regression.
This eciency result is general to median regression, not just censored regression. That is,
the weighted LAD estimator achieves the asymptotic eciency bound for median regression. The
unweighted estimator is ecient when )(0 j r) = )(0) (essentially, when c
i
is independent of A
i
).
82
Feasible versions of this estimator are challenging to construct. Newey and Powell suggest a
method based on nearerst neighbor regression estimation of the conditional distribution function.
I dont know if this has been noticed elsewhere, but here is a useful observation.
Suppose that the error c
i
only depends on r
i
through a scale eect. That is
c
i
= o(A
i
).
i
where .
i
is independent of A
i
, with density )
z
(.) and median zero. Then the conditional density
of c
i
given A
i
= r is
)(c j r) =
1
o(r)
)
z
_
c
o(r)
_
so at the origin
)(0 j r) =
1
o(r)
)
z
(0)
Thus the optimal weighting is n
i
o(A
i
)
1
, which takes the same form as in the case of GLS in
regression. The interpretation of o(r) is a bit dierent (it is identied on median restrictions).
10.6 Nonparametric Censored Regression
The models discussed in the previous sections assume that the conditional median is linear in
A
i
a highly parametric assumption. It would be desireable to extend the censored regression
model to allow for nonparametric median functions. A nonparametric model would take the form
'cd (j

i
j A
i
) = q (A
i
)
j
i
= j

i
1 (j

i
0))
with q nonparametric. The conditional median for the observed dependent variable is
'cd (j
i
j A
i
) = max(q (A
i
) , 0).
We can dene the conditional median function
q

(r) = max(q (r) , 0).


Since q(r) is nonparametric then so is q

(r), although it does satisfy q

(r) 0.
A feasible approach to estimate q

(r) is to simply use standard nonparametric median regres-


sion. It is unclear if any information is lost in ignoring the censoring. My only thought is that
function q

(r) will typically have a kink at q(r) = 0, and this is smoothed over by nonparametric
methods, which suggests ineciency.
An alternative suggestion is Lewbel and Linton (Econometrica, 2002). They impose the strong
assumption that the error c
i
= j

i
q(A
i
) is independent of A
i
, and develop nonparametric estimates
of q(r) using kernel methods.
83
11 Nearest Neighbor Methods
11.1 kth Nearest Neighbor
An alternative nonparametric method is called k-nearest neighbors or k-nn. It is simiar to
kernel methods with a random and variable bandwidth. The idea is to base estimation on a xed
number of observations k which are closest to the desired point.
Suppose X R
q
and we have a sample X
1
; :::; X
n
:
For any xed point x R
q
; we can calculate how close each observation X
i
is to x using the
Euclidean distance |x| = (x
0
x)
1=2
: This distance is
D
i
= |x X
i
| =
_
(x X
i
)
0
(x X
i
)
_
1=2
This is just a simple calculation on the data set.
The order statistics for the distances D
i
are 0 _ D
(1)
_ D
(2)
_ _ D
(n)
:
The observations corresponding to these order statistics are the nearest neighbors of x: The
rst nearest neighbor is the observation closest to x; the second nearest neighbor is the observation
second closest, etc.
This ranks the data by how close they are to x: Imagine drawing a small ball about x and slowly
inating it. As the ball hits the rst observation X
i
; this is the rst nearest neighbor of x: As the
ball further inates and hits a second observation, this observation is the second nearest neighbor.
The observations ranked by the distances, or nearest neighbors, are X
(1)
; X
(2)
; X
(3)
; :::; X
(n)
:
The kth nearest neighbor of x is X
(k)
.
For a given k; let
R
x
=
_
_
X
(k)
x
_
_
= D
(k)
denote the Euclidean distance between x and X
(k)
: R
x
is just the kth order statistic on the distances
D
i
.
Side Comment: When X is multivariate the nearest neighbor ordering is not invariant to data
scaling. Before applying nearest neighbor methods, is therefore essential that the elements of X be
scaled so that they are similar and comparable across elements.
11.2 k-nn Density Estimate
Suppose X R
q
has multivariate density f(x) and we are estimating f(x) at x.
A multivariate uniform kernel is
w(|u|) = c
1
q
1 (|u| _ 1)
where
c
q
=

q=2

_
q + 2
2
_
84
is the volume of unit ball in R
q
: If q = 1 then c
1
= 2:
Treating R
x
as a bandwidth and using this uniform kernel
~
f(x) =
1
nR
q
x
n

i=1
c
1
q
1 (|x X
i
| _ R
x
)
=
1
nR
q
x
n

i=1
c
1
q
1 (D
i
_ R
x
)
But as R
x
= D
(k)
is the kth order statistic for D
i
; there are precisely k observations where
|x X
i
| _ R
x
: Thus the above equals
~
f(x) =
k
nR
q
x
c
q
To compute
~
f(x); all you need to know is R
x
:
The estimator is inversely proportional to R
x
: Intuitively, if R
x
is small this means that there
are many observations near x; so f(x) must be large, while if R
x
is large this means that there are
not many observations near x; so f(x) must be small.
A motivation for this estimator is that the eective number of observations to estimate
~
f(x) is
k; which is constant regardless of x: This is in contrast to the conventional kernel estimator, where
the eective number of observations varies with x:
While the traditional k-nn estimator used a uniform kernel, smooth kernels can also be used.
A smooth k-nn estimator is
~
f(x) =
1
nR
q
x
n

i=1
w
_
|x X
i
|
R
x
_
where w is a kernel weight function such that
_
R
q
w(|u|) (du) = 1:
In this case the estimator does not simplify to a function of R
x
only
The analysis of k-nn estimates are complicated by the fact that R
x
is random.
The solution is to calculate the bias and variance of
^
f(x) conditional on R
x
; which is similar to
treating R
x
as xed. It turns out that the conditional bias and variance are identical to those of
the standard kernel estimator:
E
_
~
f(x) [ R
q
_
f(x) +

2
(w)\
2
f(x)R
2
x
2
var
_
~
f(x) [ R
q
_

R(w) f(x)
nR
q
x
:
85
We can then approximate the unconditional bias and variance by taking expectations:
E
_
~
f(x)
_
f(x) +

2
(w)\
2
f(x)
2
E
_
R
2
x
_
var
_
~
f(x)
_

R(w) f(x)
n
E
_
R
q
x
_
We see that to evaluate these expressions we need the moments of R
x
= D
(k)
the kth order
statistic for D
i
. The distribution function for order statistics is well known. Asymptotic moments
for the order statistics were found by Mack and Rosenblatt (Journal of Multivariate Analysis, 1979):
E
_
R

x
_

_
k=n
c
q
f (x)
_
=q
This depends on the ratio k=n and the density f(x) at x: Thus
E
_
R
2
x
_

_
k
nc
q
f (x)
_
2=q
E
_
R
q
x
_

c
q
f (x) n
k
Substituting,
Bias
_
~
f(x)
_


2
(w)\
2
f(x)
2
_
k
nc
q
f (x)
_
2=q
=

2
(w)\
2
f(x)
2 (c
q
f (x))
2=q
_
k
n
_
2=q
var
_
~
f(x)
_

R(w) f(x)
n
c
q
f (x)
k
n
=
R(w) c
q
f (x)
2
k
For k-nn estimation, the integer k is similar to the bandwidth h for kernel density estimation,
except that we need k and k=n 0 as n :
The MSE is of order
MSE
_
~
f(x)
_
= O
_
_
k
n
_
4=q
+
1
k
_
This is minimized by setting
k ~ n
4=(4+q)
:
The optimal rate for the MSE is
MSE
_
~
f(x)
_
= O
_
n
4=(4+q)
_
86
which is the same as for kernel density estimation with a second-order kernel.
Kernel estimates
^
f and k-nn estimates
~
f behave dierently in the tails of f(x) (where f(x) is
small). The contrast is
Bias
_
^
f(x)
_
\
2
f(x)
Bias
_
~
f(x)
_

\
2
f(x)
f (x)
2=q
var
_
^
f(x)
_
f (x)
var
_
~
f(x)
_
f (x)
2
In the tails, where f(x) is small,
~
f(x) will have larger bias but smaller variance than
^
f(x): This is
because the k-nn estimate uses more eective observations than the kernel estimator. It is dicult
to rank one estimator versus the other based on this comparison. Another way of viewing this is
that in the tails
~
f(x) will tend to be smoother than
^
f(x):
11.3 Regression
Nearest neighbor methods are more typically used for regression than for density estimation.
The regression model is
y
i
= g (X
i
) +e
i
E (e
i
[ X
i
) = 0
The classic k-nn estimate of g(x) is
~ g(x) =
1
k
n

i=1
1 (|x X
i
| _ R
x
) y
i
This is the average value of y
i
among the observations which are the k nearest neighbors of x:
A smooth k-nn estimator is
~ g(x) =

n
i=1
w
_
|x X
i
|
R
x
_
y
i

n
i=1
w
_
|x X
i
|
R
x
_ ;
a weighted average of the k nearest neighbors.
The asymptotic analysis is the same as for density estimation. Conditional on R
x
; the bias and
variance are approximately as for NW regression. The conditional bias is proportional to R
2
x
and
87
the variance to 1=nR
q
x
: Taking unconditional expecations and using the formula for the moments
of R
x
give expressions for the bias and variance of ~ g(x): The optimal rate is k ~ n
4=(4+q)
and the
optimal convergence rate is the same as for NW estimation.
As for density estimation, in the tails of the density of X; the bias of the k-nn estimator is larger,
and the variance smaller, than the NW estimator ^ g(x). Since the eective number of observations
k is held constant across x; ~ g(x) is smoother than ^ g(x) in the tails.
11.4 Local Linear k-nn Regression
As pointed out by Li and Racine, local linear esitmation can be combined with the nearest
neighbor method.
A simple estimator (corresonding to a uniform kernel) is to take the k observations nearest
to x, and t a linear regression of y
i
on X
i
using these observations.
A smooth local linear k-nn estimator ts a weighted linear regression
11.5 Cross-Validation
To use nearest neighbor methods, the integer k must be selected. This is similar to bandwidth
selection, although here k is discrete, not continuous.
K.C. Li (Annals of Statistics, 1987) showed that for the knn regression estimator under condi-
tional homoskedasticity, it is asymptotically optimal to pick k by Mallows, Generalized CV, or CV.
Andrews (Journal of Econometrics, 1991) generalized this result to the case of heteroskedasticity,
and showed that CV is asymptotically optimal. The CV criterion is
CV (k) =
n

i=1
(y
i
~ g
i
(X
i
))
2
and ~ g
i
(X
i
) is the leave-one-out k-nn estimator of g(X
i
): The method is to select k by minimizing
CV (k): As k is discrete, this amounts to computing CV (k) for a set of values for k; and nding
the minimizing value.
88
12 Series Methods
12.1 General Approach
A model has parameters (,, j) where , is nite-dimensional and j is nonparametric. (Some-
times, there is no ,.) We will focus on regression.
The function j is approximated by a series a nite dimensional model which depends on an
integer 1 and a 1 dimensional parameter 0. Let j
K
(0) denote this approximating function.
Typically, the parameters (,, 0) are estimated by a conventional parametric technique (
^
,,
^
0).
Then ^ j = j
L
(
^
0)
Tasks:
To nd a class of functions j
K
(0) which are good approximations to j.
Study the bias (due to the nite dimensional approximation) and variance of the estimators
Find optimal rates for 1 to diverge to innity
Find rules for selection of 1
Show that
^
,, ^ j are asymptotically normal.
Asymptotic variance computation, and standard error calculation.
Data Transformation: Typically the methods are applied after transforming the regressors A
to lie in a specic compact space, such as [0, 1].
12.2 Regression and Splines
Take the univariate regression
j
i
= q (A
i
) + c
i
In this case, j = q.
Series Approximations:
power series (polynomial)
works for low order polynomials
unstable for high order polynomials
trigonometric (sin and cos functions)
bounded functions
can produce wiggly implausible nonparametric function estimates
splines
89
piecewise polynomial of order r
continuous derivatives up to r 1
cubic splines popular
join points (knots) can be selected evenly, or estimated
12.3 Splines
It is useful to dene the positive part function
(a)
+
= max [0, a]
=
_

_
0 a < 0
a a _ 0
Linear, quadratic and cubic splines with knots at t
1
< t
2
< < t
J1
are
q
K
(r) = 0
0
+ 0
1
r +
J1

j=1
0
1+j
(r t
j
)
+
q
K
(r) = 0
0
+ 0
1
r + 0
2
r
2
+
J1

j=1
0
2+j
(r t
j
)
2
+
q
K
(r) = 0
0
+ 0
1
r + 0
2
r
2
+ 0
3
r
3
+
J1

j=1
0
3+j
(r t
j
)
3
+
This model is set up so that it is everywhere a polynomial of order :, with continuous derivatives
of order up to :, and the :
0
th derivative changing discontinuously at the knots. Cubic splines
are smooth approximating functions, exible, and popular. The approximation improves as the
number of knots increases. The dimension of 0 is 1 = J + :.
For a given set of knots the function q
K
is linear in the parameters. Dene
. = .(r) =
_
1 r r
2
r
3
(r t
1
)
3
+
(r t
J1
)
3
+
_
0
,
then
q
K
(r) = 0
0
K
.
12.4 B Splines
Another popular class of series approximation are called 1-splines. These are basis functions
which are bounded, integrable and density-shaped. They can be constructed from a variety of basic
shapes. Polynomials are common.
90
Let A [0, 1] and divide the support into J equal subintervals, with knots are t
j
= ,,J,
, = 0, 1, ..., J. We also need knots outside of [0, 1] so let t
j
= ,,: for all integers ,.
An rth order 1-spline is a piecewise (r 1)-order polynomial.
A linear (r = 2) 1-spline base functions are linear on two adjacent subintervals, zero elsewhere.
They take the form
1
2
(r [ t
j
, t
j+1
, t
j+2
) = (r t
j
)
+
2 (r t
j+1
)
+
+ (r t
j+2
)
+
.
A quadratic (r = 3) 1-spline base function is piecewise quadratic over three subintervals
1
3
(r [ t
j
, t
j+1
, t
j+2
, t
j+3
) = (r t
j
)
+
3 (r t
j+1
)
+
+ 3 (r t
j+2
)
+
(r t
j+3
)
+
.
For general r
1
r
(r [ t
j
, , t
j+r
) =
r

s=0
(1)
s
_
r
:
_
(r t
j+s
)
+
.
The 1-spline is a linear combination of these basis functions.
q
K
(r) =
J1

j=1r
0
j
1
r
(r [ t
j
, , t
j+r
)
= 0
0
K
.
where . = .(r) is the vector of the basic functions. The dimension of 0 is 1 = J + r + 1
12.5 Estimation
For all of the examples, the function q
K
is linear in the parameters (at least if the knots are
xed). Dene the vector 7
i
= .(A
i
) as the sample base function transformations. For example, in
the case of a cubic spline
7
i
=
_
1 A
i
A
2
i
A
3
i
(A
i
t
1
)
3
+
(A
i
t
J1
)
3
+
_
0
.
From 7
i
, construct the regressor matrix 7. The LS estimate of 0
K
is
^
0
K
= 7 (7
0
7)
1
7
0
j. The
estimate of q(r) is ^ q(r) = .
0
^
0
K
, that of q(A
i
) is ^ q(A
i
) = .
0
i
^
0
K
and that of the vector q =
(q(A
1
) , ..., q(A
n
))
0
is
^ q = 7
^
0
K
= 1j
where
1 = 7
_
7
0
7
_
1
7
0
is a projection matrix.
91
12.6 Bias
Since j = q + c then
1
_
^
0
K
[ A
_
=
_
7
0
7
_
1
7
0
1 (j [ A)
=
_
7
0
7
_
1
7
0
q
= 0

K
the coecient from a regression of q on 7. This is the eective projection or pseudo-true value.
Similarly,
1 (^ q [ A) = 1q = q

K
is the projection of q on 7.
The bias in estimation of q is
1 (^ q q [ A) = q

K
q.
If the series approximation works well, the bias will decrease as 1 gets increases. If q is c-times
dierentiable, then for splines and power series
sup
x
[q

K
(r) q(r)[ _ O
_
1

_
.
The integrated squared bias is
1o1
K
=
_
(q

K
(r) q(r))
2
d1(r) _ O
_
1
2
_
where 1(r) is the marginal distribution of A.
This is approximately the same as the empirical average
1
:
n

i=1
(q

K
(A
i
) q(A
i
))
2
=
1
:
(q

K
q)
0
(q

K
q)
=
1
:
q
0
_
1 1
0
_
(1 1) q
=
1
:
q
0
(1 1) q
92
12.7 Integrated Squared Error
The integrated squared error of ^ q(r) for q(r) is
1o1 =
_
(^ q(r) q(r))
2
d1(r)

1
:
n

i=1
(^ q(A
i
) q(A
i
))
2
=
1
:
(^ q q)
0
(^ q q)
Since
^ q q = 1 (q + c) q
= 1c (1 1) q
then
1o1
K
=
1
:
(1c (1 1) q)
0
(1c (1 1) q)
=
1
:
c
0
11c +
1
:
q
0
_
1 1
0
_
(1 1) q
2
:
c
0
1 (1 1) q
and when 1 is a projection matrix (as for LS estimation) then this simplies to
1o1
K
=
1
:
c
0
1c + 1o1
K
(1)
The rst part represents estimation variance, the second is the integrated squared bias.
If the error is conditionally homoskedastic, then the conditional expectation of the rst part is
1
_
1
:
c
0
1c [ A
_
=
1
:
tr
_
11
_
cc
0
[ A
__
=
1
:
tr (1) o
2
=
1
:
o
2
In general, it can be shown that
1
:
c
0
1c = O
p
_
1
:
_
Put together with the analyis of the ISB, we have
1o1
K
_ O
p
_
1
:
_
+ O
_
1
2
_
.
The optimal rate for 1 is 1 = :
1=(2+1)
yielding a MSE convergence :
2=(2+1)
. This is the
93
same as the best rate attained by kernel regression using higher-order kernels or local polynomials.
12.8 Asymptotic Normality
The dimension of
^
0
K
grows with :, so we do not discuss its asymptotic distribution.
At any r, the estimate of q (r) is ^ q(r) = .
0
^
0
K
, a linear function of the OLS estimator
^
0
K
. Let
^
\
K
be the conventional (White) asymptoticcovariance matrix estimator for
^
0
K
, so that for .
0
^
0
K
is
.
0
^
\
K
.. Applying the CLT we can nd
_
:(^ q(r) q

K
(r))
_
.
0 ^
\
K
.

d
(0, 1)
Since the estimator is nonparametric, it is biased, so the estimator should be centered at the
projection or pseudo-true value rather than the true q(r). Alternative, if 1 is larger than optimal,
so the estimator is undersmoothed, then the squared bias will be of smaller order than the
variance and it can be omitted from the asymptotic expression.
The bottom line is that for series estimation, we calculate standard errors using the conventional
formula, as if the model were parametric. However, it is not constructive to focus on standard errors
for individual coecients, as they do not have individual meaning. Rather, standard errors should
be for identiable parameters, such as the conditional mean q(r).
12.9 Selection of Series Terms
The role of 1 is similar to that of the bandwidth in kernel regression. Automatic data-dependent
procedures are necessary for implementation.
As we worked out before, the integrated squared error is
1o1
K
=
1
:
c
0
1c + 1o1
K
The optimal 1 minimizes this expression, but it is unknown.
We can estimate it using the sum-of-squared residuals from a model. For a given 1, there
regressors dene a projection matrix 1, tted value ^ q = 1j and residual vector ^ c
K
= j 1j. Note
that
^ c
K
= (1 1) j
= (1 1) q + (1 1) c
94
Thus the SSE is
1
:
^ c
0
K
^ c
K
=
1
:
q
0
(1 1) q +
2
:
q
0
(1 1) c +
1
:
c
0
(1 1) c
= 1o1
K

1
:
c
0
1c +
2
:
q
0
(1 1) c +
1
:
c
0
c
= 1o1
K

2
:
c
0
1c +
1
:
2q
0
(1 1) c +
1
:
c
0
c
Taking expectations conditional on A,
1
_
1
:
^ c
0
K
^ c
K
[ A
_
= 1 (1o1
K
[ A) 1
_
2
:
c
0
1c [ A
_
+ o
2
= 1 (1o1
K
[ A)
21o
2
:
+ o
2
where the second line holds under conditional homoskedasticity.
Thus
1
:
^ c
0
^ c is biased for 1o1
K
, but this can be corrected if we correct for the bias. This leads
to Mallows (1973) criteria
C
K
= ^ c
0
K
^ c
K
+ 21^ o
2
where ^ o
2
is a preliminary estimate of o
2
. The scale doesnt matter, so I have multiplied through
by : as is conventional, and the nal o
2
term doesnt matter, as it is independent of 1.
The Mallows estimate
^
1 is the value which minimizes C
K
.
A method which does not require homoskedasticity is cross-validation. The CV criterion is
C\
K
=
n

i=1
_
j
i
^ q
K
i
(A
i
)
_
2
where ^ q
K
i
is a 1-th order series estimator omitting observation i. The CV estimate
^
1 is the value
which minimizes C\
K
.
Li (1987, Annals of Statistics) showed under quite minimal conditions that Mallows, GCV, and
CV are asymptotically optimal for selection of 1, in the sense that
1o1
^
K
inf
k
1o1
K

p
1
Andrews (1991, JoE) showed that this optimality only extends to the heteroskedastic case if CV is
used for selection. The reason is that the Mallows criterion uses homoskedasticity to calculate the
bias adjustment, as we showed above, and this is not needed under CV.
12.10 Partially Linear and Additive Models
Suppose
j
i
= \
0
i
+ q (A
i
) + c
i
95
with q nonparametric. A series approximation for q is .
0
0
K
yielding the model for estimation
j
i
= \
0
i
+ .
0
i
0
K
+ crror
i
which is estimated by least-squares. The estimate for is similar to that from the Robinson kernel
estimator, which had a residual-regression interpretation.
The asymptotic distribution for ^ is the same as for the Robinson estimator, under the condition
that the nonparametric component has MSE converging faster than :
1=2
, e.g. if 1,: + 1
2
=
o
_
:
1=2
_
. This is similar to the requirement for the Robinson estimator.
You can easily generalize this idea to multiple additive nonparametric components
j
i
= \
0
i
+ q
1
(A
1i
) + q
2
(A
2i
) + c
i
In practice, the components A
1i
and A
2i
are real-valued.
As discussed in Li-Racine, \
i
can contain nonlinear interaction eects between A
1i
and A
2i
,
such as A
1i
A
2i
. The main requirement is that the components of \
i
cannot be additively separable
in A
1i
and A
2i
. So in this sense the additive model can allow for simple interaction eects.
96
13 Endogeneity and Nonparametric IV
13.1 Nonparametric Endogeneity
A nonparametric IV equation is
1
i
= q (A
i
) +c
i
(1)
1 (c
i
j 7
i
) = 0
In this model, some elements of A
i
are potentially endogenous, and 7
i
is exogenous.
We have studied this model in 710 when q is linear. The extension to nonlinear q is not obvious.
The rst and primary issue is identication. What does q mean?
Let
`(.) = 1 (1
i
j 7
i
= .)
and take the conditional expectation of (1):
`(.) = 1 (1
i
j 7
i
= .)
= 1 (q (A
i
) +c
i
j 7
i
= .)
= 1 (q (A
i
) j 7
i
= .)
=
Z
q (r) ) (r j .) dr.
The functions `(.) and ) (r j .) are identied. The unknown nonparametric q(r) is a solution to
the integral equation
`(.) =
Z
q (r) ) (r j .) dr.
The diculty is that the solution q(r) is not necessarily unique. The mathematical problem
is that the solution q is not necessarily continuous in the function ). The non-uniqueness of q is
called the ill-posed inverse problem.
A solution is to restrict the space of allowable functions q. For example, the linear model
q(r) = r
0
, is linear, then the above equation reduces to
`(.) = ,
0
Z
r) (r j .) dr
= ,
0
1 (A
i
j 7
i
= .) .
Idencation of , in the linear model exploits this simple relationship.
13.2 Newey-Powell-Vellas Triangular Simultaneous Equations
Newey, Powell, Vella (1989, Econometrica)
97
The model is
1
i
= q (A
i
) +c
i
(2)
1 (c
i
j 7
i
) = 0
plus a reduced-form equation for A
i
A
i
= (7
i
) +n
i
(3)
1 (n
i
j 7
i
) = 0
Thus (.) is the conditional mean of A
i
given 7
i
= .. The vectors A
i
and 7
i
may overlap, so A
i
can contain both endogenous and exogenous variables.
NPV then take expectations of (2) given A
i
and 7
i
1 (1
i
j A
i
, 7
i
) = 1 (q (A
i
) +c
i
j A
i
, 7
i
)
= q (A
i
) +1 (c
i
j A
i
, 7
i
)
Since A
i
is endogenous, the latter conditional expectation is not zero. In general, this cannot be
simplied further. But NPV observe the following. From (3), A
i
is a function of 7
i
and n
i
, so
conditioning on A
i
and 7
i
is equivalent to conditioning on n
i
and 7
i
. Hence
1 (1
i
j A
i
, 7
i
) = q (A
i
) +1 (c
i
j n
i
, 7
i
)
Next, suppose that 7
i
is strongly exogenous in the sense that
1 (c
i
j n
i
, 7
i
) = 1 (c
i
j n
i
) = q
2
(n
i
)
That is, conditional on n
i
, 7
i
provides no information about the mean of the error c
i
. In this case
we have the simplication
1 (1
i
j A
i
, 7
i
) = q (A
i
) +q
2
(n
i
)
which implies
1
i
= q (A
i
) +q
2
(n
i
) +-
i
1 (-
i
j n
i
, A
i
) = 0
This is an additive regression model, with the regressor n
i
unobserved but identied.
Is q identied?
Since (2) is a (reduced-form) regression, is identied. Thus n
i
is identied.
98
Then the functions q and q
2
are identied so long as A
i
and n
i
are distinct. NPV discussed
several identication conditions. One is:
Theorem: If there is no functional relationship between r and n, then q(r) is identied up to
an additive constant.
The additive constant qualitication is required in all additive nonparametric models.
The authors propose the following series estimator:
1. Estimate
^

L
(.) =
^
0
0
L
7
Li
using a series in 7
i
with 1 terms, say
(a)
^
0
L
= (7
0
L
7
L
)
1
7
0
L
A where 7
L
are the basic functions of 7
(b) Residual ^ n
i
= A
i

^
(A
i
)
2. Create a basis transformation for A
i
, and a separate one for ^ n
i
, with 1 coecients.
(a) spline functions of A
i
(b) spline functions of ^ n
i
,
3. Least-squares regression of 1
i
on these basis functions, obtain ^ q and ^ q
2
NPV show that this estimator is consistent and asymptotically normal. The conditions require
that the functions q and be suciently smooth (enough derivatives), and that the number of
terms 1 and 1 diverge to innity in a controlled way. The regularity conditions are not particularly
helpful.
It is not clear how 1 and 1 should be selected in practice. A reasonable suggestion is to select
1 by cross-validation on the reduced form regression, and then select 1 by cross-validation on the
second-stage regression. The trouble is that the two stages are not orthogonal, so the MSE of the
second stage is aected by the rst stage, so it is unlikely that the CV criterion will correctly reect
this.
We are primarily interested in the estimate ^ q of q (the structural form equation). The estimates
of (^ q, ^ q
2
) in the second stage are asymptotically normal, but aected by the rst stage (the generated
regressors problem).
One solution is to write out the correct asymptotic covariance matrix for the two-step estimator
as discussed in NPV
The other, easier approach is to view the problem as a one-step estimator. Stack the moment
equations from each step. Then the two-step estimates are equivalent to a one-step estimator
just-identied GMM on the stacked equations. The covariance matrix may be calculated for the
estimates using the standard GMM formula.
The authors include an application to wage/hours prole.
99
13.3 Newey and Powells Estimator
Newey and Powell (Econometrica, 2003) propose a nonparametric method which avoids the
strong exogeneity assumption, but imposes restrictions on allowable q.
Return to the base model
1
i
= q (A
i
) +c
i
1 (c
i
j 7
i
) = 0
and tthe integral equation
1 (1
i
j 7
i
) =
Z
q (r) ) (r j 7
i
) dr
To identify q, NP point out that one solution is to assume that q lives in a compact space. Their
paper is based on this assumption, and impose this on their estimates of q.
Next, suppose that q(r) can be approximated using a series approximation. Write this as
q(r) ' q
K
(r) =
0
K
j
K
(r)
where
K
is a 1 vector of parameters and j
K
(r) is a 1 vector of basis functions. Compactness of
q can be impose by assuming that
K
is bounded. They use
0
K
\
K

K
C where \
K
is a specic
weight matrix and C is a pre-determined constant.
Substituting the series expasion into the integral equation,
1 (1
i
j 7
i
) '
0
K
Z
j
K
(r)) (r j 7
i
) dr
=
0
K
1 (j
K
(A
i
) j 7
i
)
=
0
K
/
K
(7
i
)
where
/
K
(.) = 1 (j
K
(A
i
) j 7
i
= .)
is the 1 vector of conditional expectations of the basis function transformations of A
i
.
We thus have the regression models
1
i
=
0
K
/
K
(7
i
) +
i
1 (
i
j 7
i
) = 0
and
j
K
(A
i
) = /
K
(.) +j
i
1 (j
i
j 7
i
) = 0
100
NW suggest a two-step estimator.
1. Select the basis functions j
K
(r)
2. Non-parametrically regress each element of j
K
(r) on 7
i
using series methods. The estimates
are collected in to the vector
^
/
K
(.).
3. Regress 1
i
on
^
/
K
(.) (least squares) to obtain ^
K
.
4. The estimate of interest is ^ q(r) = ^
0
K
j
K
(r)
Identication requires that q (and thus ^ q) satisfy a compactness condition. NW recommend
that this be imposed on ^ q by restricting the estimate ^
K
to satisfy ^
0
K
\
K
^
K
C. (This can be
easily imposed by constrained LS regression.) As the constant C is arbitrary it is unclear what this
means in practice.
NW discuss conditions for consistency of ^ q.
The estimator ^
K
is a two-step estimator in the generate regressor class. Thus the conventional
standard errors for ^
K
(and thus ^ q) are incorrect.
The method is extended and applied to Engel curve estimation by Blundell, Chen and Kristensen
(Econometrica, 2007). They extend the analysis of identication, and include a proof of asymptotic
normality, how to calculate standard errors, and computational implication issues.
Other important related papers are referenced include Hall and Horowitz (Annals of Statistics,
2005), and Darolles, Florens and Renault, Nonparametric Instrumental Regression (early version
2002, current version 2009, still unpublished).
This general topic is clearly very important to econometrics and underdeveloped.
13.4 Ai and Chen (Econometrica, 2003)
Take the conditional moment restriction model
1 [j (7, c
0
) j A] = 0
(Notationally, we have switched the 7 and A from the previous section, and I do this to follow
Li-Racine, who simply followed the notation in the original papers.) Here, 7 is the endogenous
variables, and A are exogenous. The function j is a residual (or moment) equation). E.g. j(7) =
j .
0
1
c in a linear framework). The are interested in the semiparametric framework in which
c = (0, q) where 0 is parametric and q is nonparametric. Their focus is on ecient estimation
of the parametric component 0. (In this sense their work takes a dierent focus from the papers
earlier reviewed, which focused on the nonparametric component.)
An example is the partially linear regression model with an endogenous regressor, where the
focus is on the parametric component.
For the moment consider estimation of c assuming that it is parametric
101
Dene the conditional mean and variance of j(7, c) for generic values of c :
:(r, c) = 1 [j (7, c) j A = r]
o
2
(r) = ar [j (7, c) j A = r]
Note that at the true value c
0
:(r, c
0
) = 0
for all r.
If the functions : and o
2
were known and c were parametric, a reasonable estimator for c
would be found by minimizing the squared error criterion
n
X
i=1
:(A
i
, c)
2
o
2
(A
i
)
.
For simplicity, suppose o
2
(r) = 1, then the criterion simplies to
n
X
i=1
:(A
i
, c)
2
= :(c)
0
:(c)
where :(c) is the vector of stacked :(A
i
, c).
As : is unknown this is infeasible. We can replace : with an estimate.
For any xed c estimate : by a series regression. That is, approximate
:(r, c) ' j
K
(r)
0

K
(c)
where j
K
(r) is a 1 vector of basis functions in r. Then
:(c) = 1
K
(c)
where 1 is the matrix of the regressors j
K
(A
i
).
Let j (c) be the vector of stacked j (7
i
, c) .
We estimate
K
(c) by LS of j (c) on 1 :
^
K
(c) =

1
0
1

1
1
0
j (c)
The estimate of :(c) is
^ :(c) = 1^
K
(c) = 1

1
0
1

1
1
0
j (c)
102
and the squared error criterion is estimated by
^ :(c)
0
^ :(c) = j(c)
0
1

1
0
1

1
1
0
1

1
0
1

1
1
0
j (c)
= j(c)
0
1

1
0
1

1
1
0
j (c)
which is a GMM criterion. If c were parametric, it could be estimated by minimizing this cri-
terion. Indeed this is conventional GMM using the instrument set 1 under the assumption of
homoskedasticity.
Now, as c = (0, q) includes a nonparametric component, Ai and Chen suggest replacing q by a
series approximation:
q(.) '
L
(.)
0
,
L
where
L
(.) is an 1 vector of basis functions in ..
The moment equation
j (., 0, q(.)) ' j

., 0,
L
(.)
0
,
L

is then a function of the parameters 0 and ,


L
. For xed (0, ,
L
) dene the : 1 vector j(0, ,
L
)
of stacked elements j (7
i
, 0,
L
(7
i
)
0
,
L
) . Replacing j (c) with j(0, ,
L
) we have the revised GMM
criterion
j(0, ,
L
)
0
1

1
0
1

1
1
0
j(0, ,
L
)
=
n
X
i=1
j

7
i
, 0,
L
(7
i
)
0
,
L

j
K
(A
i
)

n
X
i=1
j
K
(A
i
)j
K
(A
i
)
0
!
1
n
X
i=1
j
K
(A
i
)j

7
i
, 0,
L
(7
i
)
0
,
L

The estimates (
^
0,
^
,
L
) then minimize this function.
This is not Ai and Chens preferred estimator. For eciency, they suggest estimating ^ o
2
(A
i
),
the conditional variance, and using the weighted criterion.
This criterion is
n
X
i=1

j
K
(A
i
)
0
(1
0
1)
1
1
0
j(0, ,
L
)

2
^ o
2
(A
i
)
= j(0, ,
L
)
0
1

1
0
1

1
0
^
1
1
1

1
0
1

1
1
0
j(0, ,
L
).
This is GMM with the ecient weight matrix (1
0
1)
1

1
0
^
1
1
1

(1
0
1)
1
.
Ai and Chen demonstrate that the estimate
^
0 is root-n asymptotically normal and asymptoti-
cally ecient (in the semiparametric sense)
103
14 Time Series
14.1 Stationarity
A (multivariate) time series j
t
is an : 1 vector observed over time t = 1, ..., :. We think of
the sample as a window out of an innite past and innite future.
A series j
t
is strictly stationary if the joint distribution (j
t
, ..., j
t+h
) is constant across t for all
/. An implication of stationary is that any nite moment is time-invariant.
A linear measure of dependence is autocovariance
(/) = co (j
t
, j
tk
)
= 1
_
(j
t
1j
t
) (j
tk
1j
tk
)
0
_
Stationarity implies that (/) is constant over time t.
A loose denition of ergodicity is that (/) 0 as / . A rigorous denition requires a
measure-theoretic treatment, but intuitively that history is independent of the innite past.
Simple time series models are built using fundamental shocks c
t
, typically normalized so that
1c
t
= 0.
(1) The most basic shock c
t
is iid.
(2) Martingale Dierence Sequence (MDS). c
t
is a MDS if 1 (c
t
[ c
t1
, c
t2
, ...) = 0
(3) White noise. If 1c
t
c
th
= 0 for all / _ 1.
An iid shock is a MDS, and a MDS is white noise, but the reverse is not true. An iid shock is
unpredictable. A MDS is unpredictable in the mean. A white noise shock is linearly unpredictable.
A process is :-dependent if j
t
and j
t+k
are independent for / :. In this case, (/) = 0 for
/ :.
A simple model of time dependence is a moving average. A MA(1) is
j
t
= c
t
+ 0c
t1
with c
t
white noise. You can calculate that (1) = 0o
2
e
. An MA(q) is -dependent.
Another simple model is an autoregression. An AR(1) is
j
t
= cj
t1
+ c
t
where c
t
is iid. The series is stationary if [c[ < 1. You can calculate that o
2
y
= (1) = o
2
e
,(1 c
2
)
and (/) = c
k
o
2
y
. An AR process is not :-depenent.
An example of a MDS which is not iid is an ARCH process
c
t
= o
t
.
t
o
2
t
= . + cc
2
t1
104
or a GARCH(1,1)
o
2
t
= . + ,o
2
t1
+ cc
2
t1
where .
t
is an iid shock. This series is predictable in the square (in the variance) but not in the
mean.
An example of a white noise process which is not an MDS is the nonlinear MA
j
t
= c
t
+ c
t1
c
t2
You can calculate that this is white noise. Yet 1 (c
t
[ c
t1
, c
t2
, ...) = c
t1
c
t2
,= 0. This is a
nonlinear process.
14.2 Time Series Averages
Let 0 = 1j
t
be estimated by
^
0 =
1
:
n

t=1
j
t
If j
t
is stationary,
1
^
0 = 1j
t
= 0
so the estimator is unbiased.
The variance of the standardized estimator is
ar(
_
:
_
^
0 0
_
) =
1
:
1
_
n

t=1
(j
t
0)
__
n

t=1
(j
t
0)
_
0
=
1
:
1
_
_
n

t=1
n

j=1
(j
t
0) (j
j
0)
0
_
_
=
1
:
n

t=1
n

j=1
(t ,)
This can be simplied to equal
n1

k=(n1)
_
: /
:
_
(,)
Notice that it is bounded by the inequality
ar(
_
:
_
^
0 0
_
) _
1

k=1
[(,)[
Hence, if

1
k=0
[(,)[ < , then ar(
_
:
_
^
0 0
_
) is bounded. By Markovs inequality, it follows
that
^
0
p
0. This is a rather simple proof of consistency for time-series averages. This proof uses
105
stronger conditions than necessary.
The Ergodic Theorem states that if j
t
is strictly stationary and ergodic, and 1 [j
t
[ < , then
^
0 0 a.s.
For a central limit theorem, we need a stronger summability condition on the covariances. As
we showed above,
ar(
_
:
_
^
0 0
_
) =
n1

k=(n1)
_
: /
:
_
(,)

k=1
(,)
=
as : . Thus the asymptotic variance of
^
0 is , not ar(j
t
). Under regularity conditions the
estimator is asymptotically normal
_
:
_
^
0 0
_

d
(0, )
The variance is sometimes called the long-run covariance matrix and is also a scale of the
spectral density of j
t
at frequency zero.
The fact that involves an innite sum of covariances means that standard error calculations for
time-series averages needs to take this into account. Estimation of is called HAC estimation in
econometrics (for heteroskedasticity and autocorrelation consistent covariance matrix estimation),
and is often called the Newey-West estimator due to an early inuential paper (Econometrica,
1987) by Whitney Newey and Ken West.
14.3 GMM
This carries over to GMM estimation. If 0 is the solution to
1:(j
t
, 0) = 0
where : is a known function, the GMM estimator minimizes a quadratic funtion in the sample
moment of :(j
t
, 0) to nd
^
0. The asymptotic distribution of
^
0 is determined by the sample average
of :(j
t
, 0
0
) at the true value 0
0
, so if these are autocorrelated, then the asymptotic distribution of
^
0 will involve the long-run covariance matrix.
Theorem 1 Under general regularity conditions the GMM estimator for stationary time-series
data satises
_
:
_
^

_
d
N
_
0,
_
G
0

1
G
_
1
_
where
G = E
0
0
0
:(j
t
, 0
0
)
106
and
=
1

k=1
E
_
m
t
m
0
tk
_
.
A important simplication occurs when :(j
t
, 0
0
) is serially uncorrelated. This occurs in dy-
namically well-specied models or correctly-specied MLE. In this case, :(j
t
, 0
0
) will be a MDS,
serially uncorrelated, and thus
= E
_
m
t
m
0
t
_
.
14.4 Linear Regression
Consider linear regression
j
t
= 0
0
r
t1
+ c
t
This includes autoregressions, as r
t1
can include j
t1
, j
t2
, ....
In this model, the LS estimator of 0 is GMM, using the moment condition
:(j
t
, r
t1
, 0) = r
t1
_
j
t
0
0
r
t1
_
Note that at the true 0
0
,
:(j
t
, r
t1
, 0
0
) = r
t1
c
t
The properties of :
t
depend on the properties of c
t
.
If the model is a true regression, then 0
0
r
t1
is the conditional mean, and the error is condi-
tionally mean zero and thus a MDS:
1 (c
t
[ 1
t1
) = 0
where 1
t1
contains all lagged information. In this case :
t
is a MDS as well:
1 (:
t
[ 1
t1
) = 1 (r
t1
c
t
[ 1
t1
) = r
t1
1 (c
t
[ 1
t1
) = 0
Thus the LS estimator is asymptotically normal, with a conventional covariance matrix.
On the other hand, if the model is an approximation, a linear projection, then 0
0
r
t1
is not
necessarily the conditional mean, so c
t
is not necessarily a MDS. Then :
t
will not necessarily be
serially uncorrelated, and will contain the autocovariances of :
t
14.5 Density Estimation
Suppose j
t
is univariate and strictly stationary with marginal distribution 1(j) with density
)(j). The kernel density estimator of )(j) is
^
)(j) =
1
:/
n

t=1
/
_
j j
t
/
_
107
where /(n) is a kernel function and / is a bandwidth.
As the function is linear, the expectation is not aected by time-series dependence. That is,
1
^
)(j) = 1
1
/
/
_
j j
t
/
_
and this is the same as in the cross-section case. Hence the bias of
^
)(j) is unchanged by dependence.
To calculate the variance, for simplicity assume that j
t
is :-depenent. The variance is
ar(
^
)(j) = 1
_
1
:
n

t=1
_
1
/
/
_
j j
t
/
_
1
1
/
/
_
j j
t
/
__
_
2
=
1
:
2
n

t=1
n

j=1
co
_
1
/
/
_
j j
t
/
_
,
1
/
/
_
j j
j
/
__
=
1
:
n1

k=(n1)
_
: /
:
_
1
/
2
co
_
/
_
j j
t
/
_
, /
_
j j
tk
/
__
=
1
:
m

k=m
_
: /
:
_
1
/
2
co
_
/
_
j j
t
/
_
, /
_
j j
tk
/
__

1
:/
2
1/
_
j j
t
/
_
2
+
2
:
m

k=1
1
/
2
1/
_
j j
t
/
_
/
_
j j
tk
/
_
The second-to-last step uses :-dependence. The rst part in the nal line is the same as in the
cross section case, and is asymptotically
1(/))(j)
:/
.
Now take the second part in the nal line, which is the sum of the : components. Let )(n
0
, n
k
)
be the joint density of (j
t
, j
tk
) for / 0. Then
1
/
2
1/
_
j j
t
/
_
/
_
j j
tk
/
_
=
_ _
1
/
2
/
_
j n
0
/
_
/
_
j n
k
/
_
) (n
0
, n
k
) dn
0
dn
k
=
_ _
/ (
0
) / (
k
) ) (j /
0
, j /
k
) d
0
d
k
where I made two change-of-variables, n
0
= j /
0
and n
k
= j /
k
, which has Jacobian /
2
.
Expanding the joint density and integrating, this equals
_
/ (
0
) d
0
_
/ (
k
) d
k
) (j, j) + o(1) = ) (j, j)
the joint density, evaluated at (j, j). We have found that
ar(
^
)(j) =
1(/))(j)
:/
+
2:
:
) (j, j) + o
_
1
:
_

1(/))(j)
:/
108
which is dominated by the same term as in the cross-section case. Time-series dependence has no
eect on the asymptotic bias and variance of the kernel estimator! This is in strong contrast to the
parametric case, where correlation aects the asymptotic variance.
The technical requirement is that the joint density of the observations (j
t
, j
tk
) is smooth at
(j, j). In the cross-section case we only required smoothness of the marginal density. In the time
series case we need smoothness of the joint densities, even though we are just estimating a marginal
density.
The intuition is that kernel (nonparametric) estimation is averaging the data locally in the
jdimension, where there is no time-series dependence, not in the time-dimension. That is,
^
)(j)
is a average of observations j
t
that are close to j. This subset of observations are not necessarily
close to each other in time. Thus they have low joint dependence, and the contribution of joint
dependence to the asymptotic variance is of small order relative to the nonparametric smoothing.
The theoretical implication of this result is that the theory of nonparametric kernel density
estimation carries over from the iid case to the time-series case without essential modication.
(However, proofs of the theorems requires attention to dependence and stronger smoothness con-
ditions.)
14.6 One-Step-Ahead Point Forecasting
Given information up to :, we want a forecast )
n+1
of j
n+1
. What does this mean? We want
)
n+1
to be close to the realized j
n+1
. Let 1(), j) denote the loss associated with a forecast ) of
realized j. The risk of the estimator is the expected loss 1() [ 1
n
) = 1 (1(), j) [ 1
n
) . A convenient
loss function is quadratic 1(), j, 1
n
) = () j)
2
in which case the best forecast is the conditional
mean ) = 1 (j
n+1
[ 1
n
) . In this sense it is common for a point forecast for j
n+1
to be an estimate
of the conditional mean.
Let r
t1
= (j
t1
, j
t2
, ..., j
tq
), and assume that the conditional mean of j
t
is (approximately)
a function only of r
t1
. Then
j
t
= q (r
t1
) + c
t
1 (c
t
[ 1
t1
) = 0
(One issue in nonparametrics is letting as : so to allow for innite dependence.)
Then the optimal point forecast (given quadratic loss) for j
n+1
is q(r
n
). A feasible point forecast
is ^ q(r
n
) where ^ q(r) is an estimator of q.
The conventional linear approach is to set ^ q(r) = r
0
^
0 where
^
0 is LS of j on A. This imposes a
linear model.
Any other nonparametric estimator can be used. In particular, a local linear estimator nests
the linear (AR) model as a special case, but allows for nonlinearity.
109
14.7 One-Step-Ahead Interval Forecasting
An interval forecast is
^
C = [)
1
, )
2
]. The goal is select
^
C so that 1(j
n+1

^
C) = .9 (or some
other pre-specied coverage probability) and so that
^
C is as short as possible. Often people dont
mention the goal of a short
^
C. But this is critical, otherwise we can design silly
^
C with the desired
coverage. For example, let
^
C = R with probability .9, and
^
C = ^ q(r
n
) with probability .1. This has
exact coverage .9 but is clearly not using sample information intelligently.
A desired forecast inteval sets )
1
and )
2
equal to the .05 and .95 quantiles of the conditional
distribution of j given r.
We discussed this problem earlier. Given an estimate ^ q (r) for the conditional, set ^ c
t
= j
t

^ q(r
t1
) and estimate the CDF
^
1(c [ r), conditional distribution of ^ c
t
given r
t1
. (As a special case,
you can use the unconditional DF
^
1(c).) Let ^

(r) be the c quantile of this conditional distribution.


Then setting
^
)
1
= ^ q(r
n
) + ^
:05
(r
n
) and
^
)
2
= ^ q(r
n
) + ^
:95
(r
n
) the forecast interval is
^
C = [
^
)
1
,
^
)
2
].
An alternative method to display forecast uncertainty is to use a density forecast. Let )(j [ r)
denote the conditional density of j
t
given r
t1
. An estimate of this conditional density is
^
)(j [ r) =
^
)
e
(j ^ q(r) [ r)
where
^
)
e
is a (kernel) estimate of the conditional density of ^ c
t
given r
t
. (As a special case you can
use the unconditional density estimate, which is approximating the error c
t
as being independent
of r
t1
). The forecast density is then
^
)(j [ r
n
) =
^
)
e
(j ^ q(r
n
) [ r
n
)
14.8 Multi-Step-Ahead Forecasting
Practical forecasts are typically for multiple periods out of sample. Let / _ 1 be the forecast
horizon (a positive integer). It is convenient to switch notation and write the problem as forecasting
j
t+h
given r
t
.
Given squared error loss, the optimal forecast is the conditional mean ) = 1 (j
n+h
[ 1
n
) .
There are two main approaches to multi-step forecasting.
One method, the direct approach, is to specify a model for the /-step conditional mean, e.g.
j
t+h
= q (r
t
) + c
t+h
1 (c
t+h
[ 1
t
) = 0
This is estimated by a (parametric or nonparametric) regression of j
t+h
on r
t
. The forecast is then
) = ^ q(r
n
).
This requires a dierent model for each forecast horizon /, and could even imply dierent
selected r
t
for dierent horizons.
The strengths of this approach is that it directly models the object of interest. It is believed to
110
be relatively robust to misspecication. One weakness is that the error c
t+h
cannot be uncorrelated
whe / 1. It is necessarily a MA(/1) process. This might invalidate conventional order selection
methods (this is an open question)
The other method, called the iterated or plug-in approach, is to estimate a one-step-ahead
forecast, and iterate. When q(r) = r
0
0 is linear and the goal is point forecasting, this is relatively
simple, as iterated mean forecasts are functions only of 0. But for forecast intervals or when q(r)
is nonlinear, 0 is insucient. The only method is to estimate the one-step-ahead distribution
^
1(j [ r) =
^
1
e
(j ^ q(r) [ r)
(or density), and iterate the entire one-step-ahead distribution. As this involves /-fold integration,
it is easiest accomplished through simulation.
If
^
1
e
is estimated by NW, it can be written as
^
1
e
(c [ r) =
n

t=1
j
t
(r)1 (^ c
t
_ c)
where
j
t
(r) =
1
_
H
1
(A
t
r)
_

n
j=1
1 (H
1
(A
j
r))
This is a discrete distribution, with probability mass j
t
(r) at ^ c
t
. Thus to draw from
^
1
e
(c [ r) is
simply to draw from the empirical distribution of the residuals ^ c
t
, with the weighted probabilities
j
t
(r). (The simpliest case treats the error as independent of r
t
, in which case j
t
(r) = :
1
for all
t.)
[If a smoothed distribution estimator is used, then a simulation draw from the kernel is added.]
To make a draw from
^
1(j [ r), draw c

from
^
1
e
(c [ r) as described above, and then set
j

= ^ q(r) + c

. Then j

has the conditional distribution


^
1(j [ r).
To make multi-step-ahead draws:
1. Given r
n
, draw c
n+1
from
^
1
e
(c [ r
n
) and set j
n+1
= ^ q(r
n
) + c
n+1
.
2. Dene r
n+1
= (j
n+1
, r
n
).
3. Given r
n+1
, draw c
n+2
from
^
1
e
(c [ r
n+1
) and set j
n+2
= ^ q(r
n+1
) + c
n+2
.
4. Iterate until you attain j
n+h
.
This creates one draw from the estimated /-step-ahead distribution, say j
b;n+h
. Repeat for
/ = 1, ..., 1, similar to bootstrapping.
The point forecast, the conditional mean, is the expected value of the conditional distribution,
so is estimated by the average of the simulations
)
h
=
1
1
B

b=1
j
b;n+h
.
111
A 90% forecast interval is constructed from the 5% and 95% quantiles of the j
b;n+h
. A forecast
density can be calculated by applying a kernel density estimator to j
b;n+h
.
This procedure actually creates a joint distribution on the multi-step-ahead conditional distri-
bution (j
n+1
, ..., j
n+h
). Perhaps this could be constructively used for some purpose. (e.g., what is
the probability that unemployment rate will remain above 7% for all of the next 12 months?)
The acknowledged good feature of the iterative method (in parametric models) is that it is more
accurate when the one-step-ahead distribution is correctly specied. The ackowledged downside is
that it is is believed to be non-robust to misspecication. Errors in the one-step-ahead distribution
can be magnide when iterated multiple times. It is unclear how these statements apply to the non-
parametric setting. The models are both correctly specied (as the bandwidth decreases the models
become more accurate) yet are explcitly misspecied (in nite samples, any tted nonparametric
model is incomplete and biased).
14.9 Model Selection
Information criterion are widely used to select linear forecasting models.
Suppose the 1th model has 1 parameters and residual variance ^ o
2
K
= :
1

^ c
2
t
where ^ c
t
=
j
t

^
0
0
r
t1
Popular methods in econometrics include AIC, BIC
1C
K
= :ln
_
^ o
2
k
_
+ 21
11C
K
= :ln
_
^ o
2
k
_
+ ln(:)1
Less popular in econometrics, but widely used in time-series more generally, is Predictive Least
Squares (PLS) introduced by Rissanen
11o
K
=
n

t=P
~ c
2
t
~ c
t
= j
t
r
0
t1
~
0
t1
~
0
t1
=
_
_
t1

j=1
r
j1
r
0
j1
_
_
1
t1

j=1
r
j1
j
j
This is a time-series genearlization of CV. ~ c
t
is a predictive residual. The sequential estimate
~
0
t1
uses observations up to t 1 for a one-step forecast. The PLS criterion is the sum of squared out-
of-sample prediction errors. The PLS criterion needs a start-up sub-sample 1 before evaluating
the rst residual. Unfortunately the criterion can be sensitive to 1, and there is no good guide for
its selection.
While PLS is not typically used in econometrics for explicit model selection, it is very com-
monly used for model comparison. Models are frequently compared by so-called out of sample
performance. In practice, this involves comparing the PLS criterion across models. When this is
112
done, it is frequently described as if this is an objective comparison of performance. In fact, it
is just a comparison of the PLS criterion. While a good criterion, it is not necessarily superior to
other criterion, and is not at all infallible as a model selection criterion.
It is widely asserted that these methods can be applied to the direct multi-step-ahead context.
I am skeptical, as the proofs are typically omitted, and the multi-step-ahead model has correlated
errors. This may be worth investigating.
113
15 Model Selection
15.1 KLIC
Suppose a random sample is y = j
1
, , ..., j
n
has unknown density f (y) =

)(j
i
).
A model density g(y) =

q(j
i
).
How can we assess the t of g as an approximation to f ?
One useful measure is the Kullback-Leibler information criterion (KLIC)
111C(f , g) =
_
f (y) log
_
f (y)
g(y)
_
dy
You can decompose the KLIC as
111C(f , g) =
_
f (y) log f (y)dy
_
f (y) log g(y)dy
= C
f
1 log g(y)
The constant C
f
=
_
f (y) log f (y)dy is independent of the model q.
Notice that 111C(), q) _ 0, and 111C(), q) = 0 i q = ). Thus a good approximating
model q is one with a low KLIC.
15.2 Estimation
Let the model density q(j, 0) depend on a parameter vector 0. The negative log-likelihood
function is
/(0) =
n

i=1
log q(j
i
, 0) = log g(y, 0)
and the MLE is
^
0 = argmin

/(0). Sometimes this is called a quasi-MLE when q(j, 0) is acknowl-


edged to be an approximation, rather than the truth.
Let the minimizer of 1 log q(j, 0) be written 0
0
and called the pseudo-true value. This value
also minimizes 111C(), q(0)). As the likelihood divided by : is an estimator of 1 log q(j, 0), the
MLE
^
0 converges in probability to 0
0
. That is,
^
0
p
0
0
= argmin

111C(), q(0))
Thus QMLE estimates the best-tting density, where best is measured in terms of the KLIC.
From conventional asymptotic theory, we know
_
:
_
^
0
QMLE
0
0
_

d
(0, \ )
114
\ = Q
1
Q
1
Q = 1
0
2
0000
0
log q(j, 0)
= 1
_
0
00
log q(j, 0)
0
00
log q(j, 0)
0
_
If the model is correctly specied (q (j, 0
0
) = )(j)), then Q = (the information matrix equality).
Otherwise Q ,= .
15.3 Expected KLIC
The MLE
^
0 =
^
0(y) is a function of the data vector y.
The tted model at any ~ y is ^ g(~ y) = g(~ y,
^
0(y)) .
The tted likelihood is /(
^
0) = log g(y,
^
0(y)) (the model evaluated at the observed data).
The KLIC of the tted model is is
111C(f , ^ g) = C
f

_
f (~ y) log g(~ y,
^
0(y))d~ y
= C
f
1
~ y
log g(~ y,
^
0(y))
where ~ y has density f , independent of y.
The expected KLIC is the expectation over the observed values y
1 (111C(f , ^ g)) = C
f
1
y
1
~ y
log g(~ y,
^
0(y))
= C
f
1
~ y
1
y
log g(y,
^
0(~ y))
the second equality by symmetry. In this expression, ~ y and y are independent vectors each with
density f . Letting
~
0 =
^
0(~ y), the estimator of 0 when the data is ~ y, we can write this more compactly
as
1.
y
(111C(f , ^ g)) = C
f
1 log g(y,
~
0)
where y and
~
0 are independent.
An alternative interpretation is in terms of predicted likelihood. The expected KLIC is the
expected likelihood when the sample ~ y is used to construct the estimate
~
0, and an independent
sample y used for evaluation. In linear regression, the quasi-likelihood is Gaussian, and the expected
KLIC is the expected squared prediction error.
15.4 Estimating KLIC
We want an estimate of the expected KLIC.
As C
f
is constant across models, it is ignored.
We want to estimate
T = 1 log g(y,
~
0)
115
Make a second-order Taylor expansion of log g
_
y,
~
0
_
about
^
0 :
log g(y,
~
0) log g(y,
^
0)
0
00
log g(y,
^
0)
0
_
~
0
^
0
_

1
2
_
~
0
^
0
_
0
_
0
2
0000
0
log g(y,
^
0)
_
_
~
0
^
0
_
The rst term on the RHS is /(
^
0), the second is linear in the FOC, so only the third term remains.
Writing
^
Q = :
1
0
2
0000
0
log g(y,
^
0),
~
0
^
0 =
_
~
0 0
0
_

_
^
0 0
0
_
and expanding the quadratic, we nd
log g(y,
~
0) /(
^
0) +
1
2
:
_
~
0 0
0
_
0
^
Q
_
~
0 0
0
_
+
1
2
:
_
^
0 0
0
_
0
^
Q
_
^
0 0
0
_
+:
_
~
0 0
0
_
^
Q
_
^
0 0
0
_
.
Now
_
:
_
^
0 0
0
_

d
7
1
~ (0, \ )
_
:
_
~
0 0
0
_

d
7
2
~ (0, \ )
which are independent, and
^
Q
p
Q. Thus for large :,
log g(y,
~
0) /(
^
0) +
1
2
7
0
1
Q7
1
+
1
2
7
0
2
Q7
2
+7
0
1
Q7
2
.
Taking expectations
T = 1 log g(y,
~
0)
1/(
^
0) +1
_
1
2
7
0
1
Q7
1
+
1
2
7
0
2
Q7
2
+7
0
1
Q7
2
_
= 1/(
^
0) + tr (Q\ )
= 1/(
^
0) + tr
_
Q
1

_
An (asymptotically) unbiased estimate of T is then
^
T = /(
^
0) +
\
tr (Q
1
)
where
\
tr (Q
1
) is an estimate of tr
_
Q
1

_
.
116
15.5 AIC
When q(r, 0
0
) = )(r) (the model is correctly specied) then Q = (the information matrix
equality). Hence
tr
_
Q
1

_
= / = dim(0)
so
^
T = /(
^
0) +/
This is the the Akaike Information Criterion (AIC). It is typically written as 2
^
T, e.g.
1C = 2/(
^
0) + 2/
AIC is an estimate of the expected KLIC, based on the approximation that q includes the
correct model.
Picking a model with the smalled AIC is picking the model with the smallest estimated KLIC.
In this sense it is picking is the best-tting model.
15.6 TIC
Takeuchi (1976) proposed a robust AIC, and is known as the Takeuchi Information Criterion
(TIC)
T1C = 2/(
^
0) + 2 tr
_
^
Q
1
^

_
where
^
Q =
1
:
n

i=1
0
2
0000
0
log q(j
i
,
^
0)
^
=
1
:
n

i=1
_
0
00
log q(j
i
,
^
0)
0
00
log q(j
i
,
^
0)
0
_
The does not require that q is correctly specied.
15.7 Comments on AIC and TIC
The AIC and TIC are designed for the likelihood (or quasi-likelihood) context. For proper
application, the model needs to be a conditional density, not just a conditional mean or set of
moment conditions. This is a strength and limitation.
The benet of AIC/TIC is that it selects tted models whose densities are close to the true
density. This is a broad and useful feature.
The relation of the TIC to the AIC is very similar to the relationship between the conventional
and White covariance matrix estimators for the MLE/QMLE or LS. The TIC does not appear
to be widely appreciated nor used.
117
The AIC is known to be asymptotically optimal in linear regression (we discuss this below),
but in the general context I do not know of an optimality result. The desired optimality would
be that if a model is selected by minimizing AIC (or TIC) then the tted KLIC of this model is
asymptotically equivalent to the KLIC of the infeasible best-tting model.
15.8 AIC and TIC in Linear Regression
In linear regression or projection
j
i
= A
0
i
0 +c
i
1 (A
i
c
i
) = 0
AIC or TIC cannot be directly applied, as the density of c
i
is unspecied. However, the LS estimator
is the same as the Gaussian MLE, so it is natural to calculate the AIC or TIC for the Gaussian
quasi-MLE.
The Gaussian quasi-likelihood is
log q
i
(0) =
1
2
log
_
2o
2
_

1
2o
2
_
j
i
A
0
i
,
_
2
where 0 = (,, o
2
) and o
2
= 1c
2
i
. The MLE
^
0 = (
^
,, ^ o
2
) is LS. The pseudo-true value ,
0
is the
projection coecient , = 1 (A
i
A
0
i
)
1
1 (A
i
j
i
) . If , is / 1 then the number of parameters is
/ + 1.
The sample log-likelihood is
2/(
^
0) = :log
_
^ o
2
_
+:log (2) +:
The second/third parts can be ignored. The AIC is
1C = :log
_
^ o
2
_
+ 2 (/ + 1) .
Often this is written
1C = :log
_
^ o
2
_
+ 2/
as adding/subtracting constants do not matter for model selection, or sometimes
1C = log
_
^ o
2
_
+ 2
/
:
as scaling doesnt matter.
118
Also
0
0,
log q(j
i
, 0) =
1
o
2
A
i
_
j
i
A
0
i
,
_
0
0o
2
log q(j
i
, 0) =
1
2o
2
+
1
2o
4
_
j
i
A
0
i
,
_
2
,
and

0
2
0,0,
0
log q(j
i
, 0) =
1
o
2
A
i
A
0
i

0
2
0,0o
2
log q(j
i
, 0) =
1
o
4
A
i
_
j
i
A
0
i
,
_

0
2
0 (o
2
)
2
log q(j
i
, 0) =
1
2o
4
+
1
o
6
_
j
i
A
0
i
,
_
2
Evaluated at the pseudo-true values,
0
0,
log q(j
i
, 0
0
) =
1
o
2
A
i
c
i
0
0o
2
log q(j
i
, 0
0
) =
1
2o
4
_
c
2
i
o
2
_
,
and

0
2
0,0,
0
log q(j
i
, 0
0
) =
1
o
2
A
i
A
0
i

0
2
0,0o
2
log q(j
i
, 0
0
) =
1
o
4
A
i
c

0
2
0 (o
2
)
2
log q(j
i
, 0
0
) =
1
2o
6
_
2
_
j
i
A
0
i
,
_
2
o
2
_
Thus
Q = 1
_

_
0
2
0,0,
0
log q(j
i
, 0
0
)
0
2
0,0o
2
log q(j
i
, 0
0
)
0
2
0o
2
0,
0
log q(j
i
, 0
0
)
0
2
0 (o
2
)
2
log q(j
i
, 0
0
)
_

_
= o
2
_
_
1 (A
i
A
0
i
) 0
0
1
2o
2
_
_
119
and
= 1
_

_
0
0,
log q(j
i
, 0
0
)
0
0,
log q(j
i
, 0
0
)
0
0
0,
log q(j
i
, 0
0
)
0
0o
2
log q(j
i
, 0
0
)
0
0o
2
log q(j
i
, 0
0
)
0
0,
log q(j
i
, 0
0
)
0
_
0
0o
2
log q(j
i
, 0
0
)
_
2
_

_
= o
2
_

_
1
_
A
i
A
0
i
c
2
i
o
2
_
1
2o
4
1
_
A
0
i
c
3
i
_
1
2o
4
1
_
A
i
c
3
i
_
i
4
4o
2
_

_
where
i
4
= ar
_
c
2
i
o
2
_
=
1
_
c
2
i
o
2
_
2
o
4
=
1
_
c
4
i
_
o
4
o
4
We see that = Q if
1
_
c
2
i
o
2
[ A
i
_
= 1
1
_
A
i
c
3
i
_
= 0
i
4
= 2
Essentially, this requies that c
i
~ (0, o
2
). Otherwise ,= Q.
Thus the AIC is appropriate in Gaussian regression. It is an approximation in non-Gaussian
regression, heteroskedastic regression, or projection.
To calculate the TIC, note that since Q is block diagonal you do not need to estimate the
o-diagonal component of . Note that
tr Q
1
= tr
_
1
_
A
i
A
0
i
_
1
1
_
A
i
A
0
i
c
2
i
o
2
__
+
_
1
2o
2
_
1
i
4
4o
2
= tr
_
1
_
A
i
A
0
i
_
1
1
_
A
i
A
0
i
c
2
i
o
2
__
+
i
4
2
Let
^ i
4
=
1
:
n

i=1
_
^ c
2
i
^ o
2
_
2
The TIC is then
T1C = :log
_
^ o
2
_
+ tr
_
^
Q
1
^

_
= :log
_
^ o
2
_
+ 2
_
_
tr
_
_
_
n

i=1
A
i
A
0
i
_
1
_
n

i=1
A
i
A
0
i
^ c
2
i
^ o
2
_
_
_
+ ^ i
4
_
_
= :log
_
^ o
2
_
+
2
^ o
2
n

i=1
/
i
^ c
2
i
+ ^ i
4
120
where /
i
= A
0
i
(X
0
X)
1
A
i
.
When the errors are close to homoskedastic and Gaussian, then /
i
and c
2
i
will be uncorrelated
^ i
4
will be close to 2, so the penalty will be close to
2
n

i=1
/
i
+ 2 = 2 (/ + 1)
as for AIC. In this case TIC will be close to AIC. In applications, the dierences will arise under
heteroskedasticity and non-Gaussianity.
The primary use of AIC and TIC is to compare models. As we change models, typically the
residuals ^ c
i
do not change too much, so my guess is that the estimate ^ i will not change much. In
this event, the TIC correction for estimation of o
2
will not matter much.
15.9 Asymptotic Equivalence
Let ~ o
2
be a preliminary (model-free) estimate of o
2
. The AIC is equivalent to
~ o
2
_
1C :log ~ o
2
+:
_
= :o
2
_
log
_
^ o
2
~ o
2
_
+ 1
_
+ 2~ o
2
/
:~ o
2
_
^ o
2
~ o
2
_
+ 2~ o
2
/
= ^ c
0
^ c + 2~ o
2
/
= C
k
The approximation is log(1 + a) a for a small. This is the Mallows criterion. Thus AIC is
approximately equal to Mallows, and the approximation is close when /,: is small.
Furtheremore, this expression approximately equalts
^ c
0
^ c
_
1 +
2
:/
_
= o
k
which is known as Shibatas condition (Annals of Statistics, 1980; Biometrick, 1981).
The TIC (ignoring the correction for estimation of o
2
) is equivalent to
~ o
2
_
T1C :log ~ o
2
+:
_
= :~ o
2
_
log
_
^ o
2
~ o
2
_
+ 1
_
+
2~ o
2
^ o
2
n

i=1
/
i
^ c
2
i
^ c
0
^ c + 2
n

i=1
/
i
^ c
2
i

i=1
^ c
2
i
(1 /
i
)
2
= C\,
121
the cross-validation criterion. Thus T1C C\.
They are both asymptotically equivalent to a Heteroskedastic-Robust Mallows Criterion
C

k
= ^ c
0
^ c + 2
n

i=1
/
i
^ c
2
i
which, strangely enough, I have not seen in the literature.
15.10 Mallows Criterion
Ker-Chau Li (1987, Annals of Statistics) provided a important treatment of the optimality of
model selection methods for homoskedastic linear regression. Andrews (1991, JoE) extended his
results to allow conditional heteroskedasticity.
Take the regression model
j
i
= q (A
i
) +c
i
= q
i
+c
i
1 (c
i
[ A
i
) = 0
1
_
c
2
i
[ A
i
_
= 0
Written as an : 1 vector
j = q +c.
Li assumed that the A
i
are non-random, but his analysis can be re-interpreted by treating every-
thing as conditional on A
i
.
Li considered estimators of the : 1 vector q which are linear in j and thus take the form
^ q(/) = '(/)j
where '(/) is : :, a function of the A matrix, indexed by / H, and H is a discrete set.
For example, a series estimator sets '(/) = A
h
(A
0
h
A
h
)
1
A
h
where A
h
is an : /
h
set of basis
functions of the regressors, and H = 1, ...,

/. The goal is to pick / to minimize the average squared


error
1(/) =
1
:
(q ^ q(/))
0
(q ^ q(/)) .
The index / is selected by minimizing the Mallows, Generalized CV, or CV criterion. We discuss
Mallows in detail, as it is the easiest to analyze. Andrews showed that only CV is optimal under
heteroskedasticity.
The Mallows criterion is
C(/) =
1
:
(j ^ q(/))
0
(j ^ q(/)) +
2o
2
:
tr '(/)
122
The rst term is the residual variance from model /, the second is the penalty. For series estimators,
tr '(/) = /
h
. The Mallows selected index
^
/ minimizes C(/).
Since j = c +q, then j ^ q(/) = c +q ^ q(/), so
C(/) =
1
:
(j ^ q(/))
0
(j ^ q(/)) +
2o
2
:
tr '(/)
=
1
:
c
0
c +1(/) + 2
1
:
c
0
(q ^ q(/)) +
2o
2
:
tr '(/)
And
^ q(/) = '(/)j = '(/)q +'(/)c
then
q ^ q(/) = (1 '(/)) q '(/)c
= /(/) '(/)c
where /(/) = (1 '(/)) q, and C(/) equals
1
:
c
0
c +1(/) + 2
1
:
c
0
/(/) +
2
:
_
o
2
tr '(/) c
0
'(/)c
_
As the rst term doesnt involve /, it follows that
^
/ minimizes
C

(/) = 1(/) + 2
1
:
c
0
/(/) +
2
:
_
o
2
tr '(/) c
0
'(/)c
_
over / H.
The idea is that empirical criterion C

(/) equals the desired criterion 1(/) plus a stochastically


small error.
We calculate that
1(/) =
1
:
(q ^ q(/))
0
(q ^ q(/))
=
1
:
/(/)
0
/(/)
2
:
/(/)
0
'(/)c +
1
:
c
0
'(/)
0
'(/)c
and
1(/) = 1 (1(/) [ X)
=
1
:
/(/)
0
/(/) +1
_
1
:
c
0
'(/)
0
'(/)c [ X
_
=
1
:
/(/)
0
/(/) +
o
2
:
tr '(/)
0
'(/)
The optimality result is:
123
Theorem 1. Let `
max
() denote the maximum eigenvalue of . If for some positive integer :,
lim
n!1
sup
h2H
`
max
('(/)) <
1
_
c
4m
i
[ A
i
_
_ i <

h2H
(:1(/))
m
0 (1)
then
1(
^
/)
inf
h2H
1(/)

p
1.
15.11 Whittles Inequalities
To prove Theorem 1, Li (1987) used two key inequalities from Whittle (1960, Theory of Prob-
ability and Its Applications).
Theorem. Suppose the observations are independent. Let b be any : 1 vector and A any
: : matrix, functions of X. If for some : _ 2
max
i
1 ([c
i
[
s
[ A
i
) _ i
s
<
then
1
_

b
0
e

s
[ X
_
_ 1
1s
_
b
0
b
_
s=2
(2)
and
1
_

e
0
Ae 1
_
e
0
Ae [ X
_

s
[ X
_
_ 1
2s
_
tr A
0
A
_
s=2
(3)
where
1
1s
=
2
s3=2
_


_
: + 1
2
_
i
s
1
2s
= 2
s
1
1s
1
1=2
1;2s
i
2s
15.12 Proof of Theorem 1
The main idea is similar to that of consistent estimation. Recall that if o
n
(0)
p
o(0) uniformly
in 0, then the minimizer of o
n
(0) converges to the minimizer of o(0). We can write the uniform
convergence as
sup

o
n
(0)
o(0)
1

p
0
In the present case, we will show (below) that
sup
h

(/) 1(/)
1(/)

p
0 (4)
124
Let /
0
denote the minimizer of 1(/). Then
0 _
1(
^
/) 1(/
0
)
1(
^
/)
=
C

(
^
/) 1(/
0
)
1(
^
/)

C

(
^
/) 1(
^
/)
1(
^
/)
=
C

(
^
/) 1(/
0
)
1(
^
/)
+o
p
(1)
_
C

(/
0
) 1(/
0
)
1(
^
/)
+o
p
(1)
_
C

(/
0
) 1(/
0
)
1(/
0
)
+o
p
(1)
= o
p
(1)
This uses (4) twice, and the facts 1(/
0
) _ 1(
^
/) and C

(
^
/) _ C

(/
0
). This shows that
1(/
0
)
1(
^
/)

p
1
which is equivalent to the Theorem.
The key is thus (4). We show below that
sup
h

1(/)
1(/)
1

p
0 (5)
which says that 1(/) and 1(/) are asymptotically equivalent, and thus (4) is equivalent to
sup
h

(/) 1(/)
1(/)

p
0. (6)
From our earlier equation for C

(/), we have
sup
h

(/) 1(/)
1(/)

_ 2 sup
h
[c
0
/(/)[
:1(/)
+ 2 sup
h

o
2
tr '(/) c
0
'(/)c

:1(/)
. (7)
Take the rst term on the right-hand-side. By Whittles rst inequality,
1
_

c
0
/(/)

2m
[ X
_
_ 1
_
/(/)
0
/(/)
_
m
Now recall
:1(/) = /(/)
0
/(/) +o
2
tr '(/)
0
'(/) (8)
Thus
:1(/) _ /(/)
0
/(/)
125
Hence
1
_

c
0
/(/)

2m
[ X
_
_ 1
_
/(/)
0
/(/)
_
m
_ 1 (:1(/))
m
Then, since H is discrete, by applying Markovs inequality and this bound,
1
_
sup
h
[c
0
/ (/)[
:1(/)
c [ X
_
_

h2H
1
_
[c
0
/(/)[
:1(/)
c [ X
_
_

h2H
c
2m
1
_
[c
0
/(/)[
2m
[ X
_
(:1(/))
2m
_

h2H
c
2m
1 (:1(/))
m
(:1(/))
2m
=
1
c
2m

h2H
(:1(/))
m
0
by assumption (1). This shows
sup
h
[c
0
/(/)[
:1(/)

p
0
Now take the second term in (7). By Whittles second inequality, since
1
_
c
0
'(/)c [ X
_
= o
2
tr '(/),
then
1
_

c
0
'(/)c o
2
tr '(/)

2m
[ X
_
_ 1
_
tr
_
'(/)
0
'(/)
__
m
_ o
2m
1 (:1(/))
m
the second inequality since (8) implies
tr '(/)
0
'(/) _ o
2
:1(/)
126
Applying Markovs inequality
1
_
sup
h

c
0
'(/)c o
2
tr '(/)

:1(/)
c [ X
_
_

h2H
1
_

c
0
'(/)c o
2
tr '(/)

:1(/)
c [ X
_
_

h2H
c
2m
1
_

c
0
'(/)c o
2
tr '(/)

2m
[ X
_
(:1(/))
2m
_

h2H
c
2m
o
2m
1 (:1(/))
m
(:1(/))
2m
= 1
_
c
2
o
2
_
m

h2H
(:1(/))
m
0
For completeness, let us show (5). The demonstration is essentially the same as the above.
We calculate
1(/) 1(/) =
2
:
/(/)
0
'(/)c +
1
:
c
0
'(/)
0
'(/)c
o
2
:
tr '(/)
0
'(/)
=
2
:
/(/)
0
'(/)c +
1
:
_
c
0
'(/)
0
'(/)c 1
_
c
0
'(/)
0
'(/)c [ X
__
Thus
sup
h

1(/) 1(/)
1(/)

_ 2 sup
h
[c
0
'(/)
0
/(/)[
:1(/)
+ 2 sup
h
[c
0
'(/)
0
'(/)c 1 (c
0
'(/)
0
'(/)c [ X)[
:1(/)
.
By Whittles rst inequality,
1
_

c
0
'(/)
0
/(/)

2m
[ X
_
_ 1
_
/(/)
0
'(/)'(/)
0
/(/)
_
m
Use the matrix inequality
tr (1) _ `
max
() tr (1)
and letting

' = lim
n!1
sup
h2H
`
max
('(/)) <
then
/(/)
0
'(/)'(/)
0
/(/) = tr
_
'(/)'(/)
0
/(/)/(/)
0
_
_

'
2
tr
_
/(/)/(/)
0
_
_

'
2
/(/)
0
/(/)
_

'
2
:1(/)
127
Thus
1
_

c
0
'(/)
0
/(/)

2m
[ X
_
_ 1
_
/(/)
0
'(/)'(/)
0
/(/)
_
m
_ 1

'
2
(:1(/))
m
Thus
1
_
sup
h
[c
0
'(/)
0
/ (/)[
:1(/)
c [ X
_
_

h2H
1
_
[c
0
'(/)
0
/(/)[
:1(/)
c [ X
_
_

h2H
c
2m
1
_
[c
0
'(/)
0
/(/)[
2m
[ X
_
(:1(/))
2m
_

h2H
c
2m
1

'
2
(:1(/))
m
(:1(/))
2m
=
1

'
2
c
2m

h2H
(:1(/))
m
0
Similarly,
1
_

c
0
'(/)
0
'(/)c 1
_
c
0
'(/)
0
'(/)c [ X
_

2m
[ X
_
_ 1
_
tr
_
'(/)
0
'(/)'(/)
0
'(/)
__
m
_ 1

'
2m
_
tr
_
'(/)
0
'(/)
__
m
_ o
2m
1

'
2m
(:1(/))
m
and thus
1
_
sup
h
[c
0
'(/)
0
'(/)c 1 (c
0
'(/)
0
'(/)c [ X)[
:1(/)
c [ X
_
_

h2H
1
_
[c
0
'(/)
0
'(/)c 1 (c
0
'(/)
0
'(/)c [ X)[
:1(/)
c [ X
_
_

h2H
c
2m
1
_
[c
0
'(/)
0
'(/)c 1 (c
0
'(/)
0
'(/)c [ X)[
2m
[ X
_
(:1(/))
2m
_

h2H
c
2m
o
2m
1

'
2m
(:1(/))
m
(:1(/))
2m
= 1
_

'
2
c
2
o
2
_
m

h2H
(:1(/))
m
0
We have shown
sup
h

1(/) 1(/)
1(/)

p
0
which is (5).
128
15.13 Mallows Model Selection
Lis Theorem 1 applies to a variety of linear estimators. Of particular interest is model selection
(e.g. series estimation).
Lets verify Lis conditions, which were
lim
n!1
sup
h2H
`
max
('(/)) <
1
_
c
4m
i
[ A
i
_
_ i <

h2H
(:1(/))
m
0 (9)
In linear estimation, '(/) is a projection matrix, so `
max
('(/)) = 1 and the rst equation is
automatically satised.
The key is equation (9).
Suppose that for sample size :, there are
n
models. Let

n
= inf
h2H
:1(/)
and assume

n

A crude bound is

h2H
(:1(/))
m
_
n

m
n
If
n

m
n
0 then (9) holds. Notice that by increasing :, we can allow for larger
n
(more
models) but a tighter moment bound.
The condition
n
0 says that for all nite models /, the there is non-zero approximation error,
so that 1(/) is non-zero. In contrast, if there is a nite dimensional model /
0
for which /(/
0
) = 0,
then :1(/
0
) = /
0
o
2
does not diverge. In this case, Mallows (and AIC) are asymptotically sub-
optimal.
We can improve this condition if we consider the case of selection among models of increasing
size. Suppose that model / has /
h
regressors, and /
1
< /
2
< and for some : _ 2,
1

h=1
/
m
h
<
This includes nested model selection, where /
h
= / and : = 2. Note that
:1(/) = /(/)
0
/(/) +/
h
o
2
_ /
h
o
2
129
Now pick 1
n
so that 1
n

m
n
0 (which is possible since
n
.) Then
1

h=1
(:1(/))
m
=
Bn

h=1
(:1(/))
m
+
1

h=Bn+1
(:1(/))
m
_ 1
n

m
n
+o
2
1

h=Bn+1
/
m
h
0
as required.
15.14 GMM Model Selection
This is an underdeveloped area. I list a few papers.
Andrews (1999, Econometrica). He considers selecting moment conditions to be used for GMM
estimation. Let j be the number of parameters, c represent a list of selected moment conditions,
[c[ denote the cardinality (number) of these moments, and J
n
(c) the GMM criterion computed
using these c moments. Andrews proposes criteria of the form
1C(c) = J
n
(c) r
n
([c[ j)
where [c[ j is the number of overidentifying restrictions and r
n
is a sequence. For an AIC-like
criterion, he sets r
n
= 2, for a BIC-like criterion, he sets r
n
= log :.
The model selection rule picks the moment conditions c which minimize J
n
(c).
Assuming that a subset of the moments are incorrect, Andrews shows that the BIC-like rule
asymptotically selects the correct subset.
Andrews and Lu (2001, JoE) extend the above analysis to the case of jointly picking the moments
and the parameter vector (that is, imposing zero restrictions on the parameters). They show that
the same criterion has similar properties that it can asymptotically select the correct moments
and correct zero restrictions.
Hong, Preston and Shum (ET, 2003) extend the analysis of the above papers to empirical
likelihood. They show that that this criterion has the same interpretation when J
n
(c) is replaced
by the empirical likelihood.
These papers are an interesting rst step, but they do not address the issue of GMM selection
when the true model is potentially innite dimensional and/or misspecied. That is, the analysis
is not analogous to that of Li (1987) for the regression model.
In order to properly understand GMM selection, I believe we need to understand the behavior
of GMM under misspecication.
Hall and Inoue (2003, Joe) is one of the few contributions on GMM under misspecicaiton.
They did not investigate model selection.
Suppose that the model is
:(0) = 1:
i
(0) = 0
130
where :
i
is / 1 and 0 is / 1. Assume / r (overidentication). The model is misspecied if
there is no 0 such that this moment condition holds. That is, for all 0,
:(0) ,= 0
Suppose we apply GMM. What happens?
The rst question is, what is the pseudo-true value? The GMM criterion is
J
n
(0) = : :
n
(0)
0
\
n
:
n
(0)
If \
n

p
\, then
:
1
J
n
(0)
p
:(0)
0
\:(0).
Thus the GMM estimator
^
0 is consistent for the pseudo-true value
0
0
(\) = argmin:(0)
0
\:(0).
Interstingly, the pseudo-true value 0
0
(\) is a function of \. This is a fundamental dierence
from the correctly specied case, where the weight matrix only aects eciency. In the misspecied
case, it aects what is being estimated.
This means that when we apply iterated GMM, the pseudo-true value changes with each step
of the iteration!
Hall and Inoue also derive the distribution of the GMM estimator. They nd that the distri-
bution depends not only on the randomness in the moment conditions, but on the randomness in
the weight matrix. Specically, they assume that :
1=2
(\
n
\)
d
or:a|, and nd that this
aects the asymptotic distributions.
Furthermore, the distribution of test statistics is non-standard (a mixture of chi-squares). So
inference on the pseudo-true values is troubling.
This subject deserves more study.
15.15 KLIC for Moment Condition Models Under Misspecication
Suppose that the true density is )(j), and we have an over-identied moment condition model,
e.g. for some function :(j), the model is
1:(j) = 0
However, we want to allow for misspecication, namely that
1:(j) ,= 0
To explore misspecation, we have to ask: What is a desirable pseudo-true model?
131
Temporarily ignoring parameter estimation, we can ask: Which density q(j) satisfying this
moment condition is closest to )(j) in the sense of minimizing KLIC? We can call this q
0
(j) the
pseudo-true density.
The solution is nicely explained in Appendix A of Chen, Hong, and Shum (JoE, 2007). Recall
111C(), q) =
_
)(j) log
_
)(j)
q(j)
_
dj
The problem is
min
g
111C(), q)
subject to
_
q(j)dj = 1
_
:(j)q(j)dj = 0
The Lagrangian is
_
)(j) log
_
)(j)
q(j)
_
dj +j
__
q(j)dj 1
_
+`
0
_
:(j)q(j)dj
The FOC with respect to q(j) at some j is
0 =
)(j)
q(j)
+j +`
0
:(j)
Multiplying by q(j) and integrating,
0 =
_
)(j)dj +j
_
q(j)dj +`
0
_
:(j)q(j)dj
= 1 +j
so j = 1. Solving for q(j) we nd
q(j) =
)(j)
1 +`
0
:(j)
,
a tilted version of the true density )(j). Inserting this solution we nd
111C(), q) =
_
)(j) log
_
1 +`
0
:(j)
_
dj
By duality, the optimal Lagrange multiplier `
0
maximizes this expression
`
0
= argmax

_
)(j) log
_
1 +`
0
:(j)
_
dj.
132
The pseudo-true density is
q
0
(j) =
)(j)
1 +`
0
0
:(j)
,
with associated minimized KLIC
111C (), q
0
) =
_
)(j) log
_
1 +`
0
0
:(j)
_
dj
= 1 log
_
1 +`
0
0
:(j)
_
This is the smallest possible KLIC(f,g) for moment condition models.
This solution looks like empirical likelihood. Indeed, EL minimizes the empirical KLIC, and
this connection is widely used to motivate EL.
When the moment :(j, 0) depends on a parameter 0,then the pseudo-true values (0
0
, `
0
) are
the joint solution to the problem
min

max

1 log
_
1 +`
0
:(j, 0)
_
Theorem (Chen, Hong and Shum, JoE, 2007). If [:(j, 0)[ is bounded, then the EL estimates
(
^
0,
^
`) are :
1=2
consistent for the pseudo-true values (0
0
, `
0
).
This gives a simple interpretation to the denition of KLIC under misspecication.
15.16 Schennachs Impossibility Result
Schennach (Annals of Statistics, 2007) claims a fundamental aw in the application of KLIC
to moment condition models. She shows that the assumption of bounded [:(j, 0)[ is not merely a
technical condition, it is binding.
[Notice: In the linear model, :(j, 0) = .(j r
0
0) is unbounded if the data has unbounded
support. Thus the assumption is highly relevant.]
The key problem is that for any ` ,= 0, if :(j, 0) is unbounded, so is 1+`
0
:(j, 0). In particular,
it can take on negative values. Thus log
_
1 +`
0
:(j, 0)
_
is ill-dened. Thus there is no pseudo-true
value of `. (It must be non-zero, but it cannot be non-zero!) Without a non-zero `, there is no way
to dene a pseudo-true 0
0
which satises the moment condition.
Technically, Schennach shows that when there is no 0 such that 1:(j, 0) = 0 and :(j, 0) is
unbounded, then there is no 0
0
such that
_
:
_
^
0 0
0
_
= O
p
(1).
Her paper leaves open the question: For what is
^
0 consistent? Is there a pseudo-true value?
One possibility is that the pseudo-true value 0
n
needs to be indexed by sample size. (This idea is
used in Hal Whites work.)
Never-the-less, Schennachs theorem suggests that empirical likelihood is non-robust to mis-
specication.
133
15.17 Exponential Tilting
Instead of
111C(), q) =
_
)(j) log
_
)(j)
q(j)
_
dj
consider the reverse distance
111C(q, )) =
_
q(j) log
_
q(j)
)(j)
_
dj.
The pseudo-true q which minimizes this criterion is
min
g
_
q(j) log
_
q(j)
)(j)
_
dj
subject to
_
q(j)dj = 1
_
:(j)q(j)dj = 0
The Lagrangian is
_
q(j) log
_
q(j)
)(j)
_
dj j
__
q(j)dj 1
_
`
0
_
:(j)q(j)dj
with FOC
0 = log
_
q(j)
)(j)
_
+ 1 j `
0
:(j).
Solving
q(j) = )(j) exp(1 +j) exp
_
`
0
:(j)
_
.
Imposing
_
q(j)dj = 1 we nd
q(j) =
)(j) exp
_
`
0
:(j)
_
_
)(j) exp
_
`
0
:(j)
_
dj
. (10)
Hence the name exponential tilting or ET
134
Inserting this into 111C(q, )) we nd
111C(q, )) =
_
q(j) log
_
exp
_
`
0
:(j)
_
_
)(j) exp
_
`
0
:(j)
_
dj
_
dj
= `
0
_
:(j)q(j)dj
_
q(j)dj log
__
)(j) exp
_
`
0
:(j)
_
dj
_
= log
__
)(j) exp
_
`
0
:(j)
_
dj
_
(11)
= log 1 exp
_
`
0
:(j)
_
(12)
By duality, the optimal Lagrange multiplier `
0
maximizes this expression, equivalently
`
0
= argmin

1 exp
_
`
0
:(j)
_
(13)
The pseudo-true density q
0
(j) is (10) with this `
0
, with associated minimized KLIC (11). This is
the smallest possible KLIC(g,f) for moment condition models.
Notice: the q
0
which minimize KLIC(g,f) and KLIC(f,g) are dierent.
In contrast to the EL case, the ET problem (13) does not restrict `, and there are no trouble
spots. Thus ET is more robust than EL. The pseudo-true `
0
and q
0
are well dened under
misspecication, unlike EL.
When the moment :(j, 0) depends on a parameter 0,then the pseudo-true values (0
0
, `
0
) are
the joint solution to the problem
max

min

1 exp
_
`
0
:(j, 0)
_
.
15.18 Exponential Tilting Estimation
The ET or exponential tilting estimator solves the problem
min
;p
1
;:::;pn
n

i=1
j
i
log j
i
subject to
n

i=1
j
i
= 1
n

i=1
j
i
:(j
i
, 0) = 0
135
First, we concentrate out the probabilities. For any 0, the Lagrangian is
n

i=1
j
i
log j
i
j
_
n

i=1
j
i
1
_
`
0
n

i=1
j
i
:(j
i
, 0)
with FOC
0 = log ^ j
i
1 j `
0
:(j
i
, 0).
Solving for ^ j
i
and imposing the summability,
^ j
i
(`) =
exp
_
`
0
:(j
i
, 0)
_

n
i=1
exp
_
`
0
:(j
i
, 0)
_
When ` = 0 then ^ j
i
= :
1
, same as EL. The concentrated entropy criterion is then
n

i=1
^ j
i
(`) log ^ j
i
(`) =
n

i=1
^ j
i
(`)
_
`
0
:(j
i
, 0) log
_
n

i=1
exp
_
`
0
:(j
i
, 0)
_
__
= log
_
n

i=1
exp
_
`
0
:(j
i
, 0)
_
_
By duality, the Lagrange multiplier maximizes this criterion, or equivalently
^
`(0) = argmin

i=1
exp
_
`
0
:(j
i
, 0)
_
The ET estimator
^
0 maximizes this concentrated function, e.g.
^
0 = argmax

i=1
exp
_
^
`(0)
0
:(j
i
, 0)
_
The ET probabilities are ^ j
i
= ^ j
i
(
^
`)
15.19 Schennachs Estimator
Schennach (2007) observed that while the ET probabilities have desirable properties, the EL
estimator for 0 has better bias properties. She suggested a hybrid estimator which achieves the
best of both worlds, called exponentially tilted empirical likelihood (ETEL).
This is
^
0 = argmax

111T(0)
136
1T11(0) =
n

i=1
log (^ j
i
(0))
=
^
`(0)
0
n

i=1
:(j
i
, 0) log
_
n

i=1
exp
_
^
`(0)
0
:(j
i
, 0)
_
_
^ j
i
(0) =
exp
_
^
`(0)
0
:(j
i
, 0)
_

n
i=1
exp
_
^
`(0)
0
:(j
i
, 0)
_
^
`(0) = argmin

i=1
exp
_
`
0
:(j
i
, 0)
_
She claims the following advantages for the ETEL estimator
^
0
Under correct specication,
^
0 is asymptotically second-order equivalent to EL
Under misspecication, the pseudo-true values `
0
, 0
0
are generically well dened, and mini-
mize a KLIC analog

_
:
_
^
0 0
0
_

d

_
0,
1

10
_
where = 1
0
00
0
:(j, 0) and = 1 :(j, 0) :(j, 0)
0
.
137
16 Model Averaging
16.1 Framework
Let q be a (non-parametric) object of interest, such as a conditional mean, variance, density,
or distribution function. Let ^ q
m
, : = 1, ..., ' be a discrete set of estimators. Most commonly,
this set is the same as we might consider for the problem of model selection. In linear regression,
typically ^ q
m
correspond to dierent sets of regressors. We will sometimes call the :th estimator
the :th model.
Let n
m
be a set of weights for the :th estimator. Let w = (n
1
, ..., n
M
) be the vector of
weights. Typically we will require
0 _ n
m
_ 1
M

m=1
n
m
= 1
The set of weights satisfying this condition is H
M
, the unit simplex in R
M
.
An averaging estimator is
^ q (w) =
M

m=1
n
m
^ q
m
It is commonly called a model average estimator.
Selection estimators are the special case where we impose the restriction n
m
0, 1.
16.2 Model Weights
The most common method for weight specication is Bayesian Model Averaging (BMA). As-
sume that there are ' potential models and one of the models is the true model. Specify prior
probabilities that each of the potential models is the true model. For each model specify a prior
over the parameters. Then the posterior distribution is the weighted average of the individual
models, where the weights are Bayesian posterior probabilities that the given model is the true
model, conditional on the data.
Given diuse priors and equal model prior probabilities, the BMA weights are approximately
n
m
=
exp
_

1
2
11C
m
_

M
j=1
exp
_

1
2
11C
j
_
where
11C
m
= 2/
m
+/
m
log(:)
/
m
is the negative log-likelihood, and /
m
is the number of parameters in model :. 11C
m
is the
Bayesian information criterion for model :. It is similar to AIC, but with the 2 replaced by
138
log(n).
The BMA estimator has the nice interpretation as a Bayesian estimator. The downside is that
it does not allow for misspecication. It is designed to search for the true model, not to select
an estimator with low loss.
To remedy this situation, Burnham and Anderson have suggested replacing BIC with AIC,
resulting in what has been called smoothed AIC (AIC) or weighted AIC (WAIC). The weights are
n
m
=
exp
_

1
2
1C
m
_

M
j=1
exp
_

1
2
1C
j
_
where
1C
m
= 2/
m
+ 2/
m
The suggestion goes back to Akaike, who suggested that these n
m
may be interpreted as model
probabilities. It is convenient and simple to implement. The idea can be applied quite broadly, in
any context where AIC is dened.
In simulation studies, the SAIC estimator performs very well. (In particular, better than
conventional AIC.) However, to date I have seen no formal justication for the procedure. It
is unclear in what sense SAIC is producing a good approximation.
16.3 Linear Regression
In the case of linear regression, let A
m
be regressor matrix for the :th estimator. Then
the list of all regressors. Then the :th estimator is
^
,
m
=
_
A
0
m
A
m
_
1
A
m
j
^ q
m
= A
m
^
,
m
= 1
m
j
where
1
m
= A
m
_
A
0
m
A
m
_
1
A
m
The averaging estimator is
^ q (w) =
M

m=1
n
m
^ q
m
=
M

m=1
n
m
1
m
j
= 1 (w) j
139
where
1 (w) =
M

m=1
n
m
1
m
Let A be the matrix of all regressors. We can also write
^ q (w) =
M

m=1
n
m
A
m
_
A
0
m
A
m
_
1
A
m
j
=
M

m=1
n
m
A
m
^
,
m
= A
M

m=1
n
m
_
^
,
m
0
_
= A
^
, (w)
where
^
, (w) =
M

m=1
n
m
_
^
,
m
0
_
is the average of the coecient estimates.
^
, (w) is the model average estimator for ,. In lin-
ear regression, there is a direct correspondence between the average estimator for the conditional
mean and the average estimator of the parameters, but this correspondence breaks down when the
estimator is not linear in the parameters.
16.4 Mallows Weight Selection
As pointed out above, in the linear regression setting, ^ q (w) = 1 (w) j is a linear estimator, so
falls in the class studied by Li (1987). His framework allows for estimators indexed by w H
M
Under homoskedasticity, an optimal method for selection of w is the Mallows criterion. As we
discussed before, for estimators ^ q (w) = 1 (w) j, the Mallows criterion is
C(w) = ^ c (w)
0
^ c (w) + 2o
2
tr 1 (w)
where
^ c (w) = j ^ q (w)
is the residual.
140
In averaging linear regression
tr 1 (w) = tr
M

m=1
n
m
1
m
=
M

m=1
n
m
tr 1
m
=
M

m=1
n
m
/
m
= w
0
K
where /
m
is the number of coecients in the :th model, and K = (/
1
, ..., /
M
)
0
. The penalty is
twice w
0
K, the (weighted) average number of coecients.
Also
^ c (w) = j ^ q (w)
=
M

m=1
n
m
(j ^ q
m
)
=
M

m=1
n
m
^ c
m
= ^ew
where ^ c
m
is the :1 residual vector from the :th model, and ^e = [^ c
1
, ..., ^ c
M
] is the :' matrix
of residuals from all ' models.
We can then write the criterion as
C(w) = w
0
^e
0
^ew + 2o
2
w
0
K
This is quadratic in the vector w.
The Mallows selected weight vector minimizes the criterion C(w) over w H
M
, the unit
simplex.
^ w =argmin
w2H
M
C(w)
This is a quadratic programming problem with inequality constraints, which is pre-programmed in
Gauss and Matlab, so computation of ^ w is a simple command.
The Mallows selected estimator is then
^ q = ^ q( ^ w)
=
M

m=1
^ n
m
^ q
m
141
This is an
16.5 Weight Selection Optimality
As we discussed in the section on model selection, Li (1987) provided a set of sucient con-
ditions for the Mallows selected estimator to be optimal, in the sense that the squared error is
asymptotically equivalent to the infeasible optimum. The key condition was

w
(:1(w))
s
0 (1)
In Hansen (Econometrica, 2007), I show that this condition is satised if we restrict the set of
weights to a discrete set.
Recall that H
M
is the unit simplex in R
M
.
Now restrict w H

M
H
M
, where the weights in H

M
are elements of
1

,
2

, ..., 1 for some


integer . In that paper, I show that Lis condition (1) over w H

M
holds under the similar
conditions as model selection, namely if the models are nested,

n
= inf
w2H
M
:1(w)
and
1
_
c
4(N+1)
i
[ A
i
_
_ i < .
Thus model averaging is asymptotically optimal, in the sense that
1( ^ w)
inf
w2H

M
1(w)

p
1
where, again
1(w) =
1
:
(^ g (w) g)
0
(^ g (w) g)
The proof is similar to that for model selection in linear regression. The restriction of w to a
discrete set is necessary to directly apply Lis theorem, as the summation requires discreteness.
The discreteness was relaxed in a paper by Wan, Zhang, and Zou (2008, Least Squares Model
Combining by Mallows Criterion, working paper). Rather than proving (1), they provided a more
basic derivation, although using stronger conditions. Recall that the proof requires showing uniform
convergence results of the form
sup
w2H
M
[c
0
/(w)[
:1(w)

p
0
142
where
/(w) =
M

m=1
n
m
/
m
/
m
= (1 1
m
) g
Here is their proof: First,
sup
w2H
M
[c
0
/(w)[
:1(w)
_
M

m=1
n
m
[c
0
/
m
[

n
_ max
1mM
[c
0
/
m
[

n
Second, by Markovs and Whittles inequalities
1
_
max
1mM
[c
0
/
m
[

n
c
_
_
M

m=1
1
_
[c
0
/
m
[

n
c
_
_
M

m=1
1 [c
0
/
m
[
2G
c
2G

2G
n
_ 1
M

m=1
[/
0
m
/
m
[
G
c
2G

2G
n
_ 1
M

m=1
_
:1(w
0
m
)
_
G
c
2G

2G
n
where w
0
m
is the weight vector with a 1 in the :th place and zeros elsewhere. Equivalently,
:1(w
0
m
) is the expected squared error from the :th model. The nal inequality uses the fact from
the analysis for model selection that
:1(w
0
m
) = /
0
m
/
m
+o
2
/
m
_ /
0
m
/
m
Wan, Zhang, and Zou then assume

M
m=1
_
:1(w
0
m
)
_
G

2G
n
0
This is stronger than the condition from my paper
n
, as it requires that

M
m=1
_
:1(w
0
m
)
_
G
diverges slower than
2G
n
. They also do not directly assume that the models are nested.
16.6 Cross-Validation Selection
Hansen and Racine (Jacknife Model Averaging, working paper).
In this paper, we substitute CV for the Mallows criterion. As a result, we do not require
homoskedasticity.
143
For the :
0
t/ model, let ~ c
m
i
denote the leave-one-out (LOO) residuals for the ith observation,
e.g.
~ c
m
i
= j
i
A
m0
i
_
X
m0
i
X
m
i
_
1
X
m0
i
y
i
and let ~ c
m
denote the : 1 vector of the ~ c
m
i
. Then the LOO averaging residuals are
~ c
i
(w) =
M

m=1
n
m
~ c
m
i
~ c (w) =
M

m=1
n
m
~ c
m
= ~ew
where ~e is an : ' matrix whose :th column is ~ c
m
. Then the sum-of-squared LOO residuals is
C\ (w) = ~ c (w)
0
~ c (w) = w
0
~ c
0
~ cw
which is quadratic in w .
The CV (or jacknife) selected weight vector ^ w minimizes the criterion C\ (w) over the unit
simplex. As for Mallows selection, this is solved by quadratic programming.
The JMA estimator is then ^ q( ^ w)
In Hansen-Racine, we show that the CV estimator is asymptotically equivalent to the infeasible
best weight vector, under the conditions
0 < min
i
1
_
c
2
i
[ A
i
_
_ min
i
1
_
c
2
i
[ A
i
_
<
1
_
c
4(N+1)
i
[ A
i
_
_ i <

n
= inf
w2H
M
:1(w)
max
1mM
max
1in
A
m0
i
_
X
m0
X
m
_
1
X
m0
A
m
i
0
16.7 Many Unsolved Issues
Model averaging for other estimators: e.g. densities or conditional densities
IV, GMM, EL, ET
Standard errors?
Inference
144
17 Shrinkage
17.1 Mallows Averaging and Shrinkage
Suppose there are two models or estimators of q = 1(j [ A)
(1) q
0
= 0
(2) ^ q
1
= A
^
,
Given weights (1 n) and n an averaging estimator is ^ q = nA
^
,.
The Mallows criterion is
C(n) = w
0
^e
0
^ew+ 2^ o
2
w
0
K
=
_
1 n n
_
_
j
0
j j
0
^ c
^ c
0
j ^ c
0
^ c
__
1 n
n
_
+ 2^ o
2
n/
= (1 n)
2
j
0
j +
_
n
2
+ 2n(1 n)
_
^ c
0
^ c + 2^ o
2
n/
= (1 n)
2
_
j
0
j ^ c
0
^ c
_
+ ^ c
0
^ c + 2^ o
2
n/
The FOC for minimization is
d
dn
C(n) = 2(1 n)
_
j
0
j ^ c
0
^ c
_
+ 2^ o
2
/ = 0
with solution
^ n = 1
/
1
where
1 =
j
0
j ^ c
0
^ c
^ o
2
is the Wald statistic for , = 0. Imposing the constraint ^ n [0, 1] we obtain
^ n =
_

_
1
/
1
1 _ /
0 1 < /
The Mallow averaging estimator thus equals
^
,

=
^
,
_
1
/
1
_
+
where (a)
+
= a if a _ 0, 0 else.
This is a Stein-type shrinkage estimator.
145
17.2 Loss and Risk
A great reference is Theory of Point Estimation, 2nd Edition, by Lehmann and Casella.
Let
^
0 be an estimator for 0, / 1. Suppose
^
0 is an (asymptotic) sucient statistic for 0 so that
any other estimator can be written as a function of
^
0. We call
^
0 the usualestimator.
Suppose that
_
:
_
^
0 0
_

d
(0, V). Thus, approximately,
^
0 ~
a
(0, V
n
)
where V
n
= :
1
V. Most of Stein-type theory is developed for the exact distribution case. It
carries over to the asymptotic setting as approximations. For now on we will assume that
^
0 has an
exact normal distribution, and that V
n
= V is known. (Equivalently, we can rewrite the statistical
problem as local to 0 using the Limits of Experiments theory.
Is
^
0 the best estimator for 0, in the sense of minimizing the risk (expected loss)?
The risk of
~
0 under weighted squared error loss is
1(0,
~
0, W) = 1
_
_
~
0 0
_
0
W
_
~
0 0
_
_
= tr
_
W1
__
~
0 0
__
~
0 0
__
0
_
A convenient choice for the weight matrix is W = V
1
. Then
1(0,
^
0, V
1
) = tr
_
V
1
1
__
~
0 0
__
~
0 0
__
0
_
= tr
_
V
1
V
_
= /.
If W,= V
1
then
1(0,
^
0, W) = tr
_
W1
__
~
0 0
__
~
0 0
__
0
_
= tr (WV)
which depends on WV.
Again, we want to know if the risk of another feasible estimator is smaller than tr (WV) .
Take the simple (or silly) estimator
~
0 = 0. This has risk
1(0, 0, W) = 0
0
W0.
Thus
~
0 = 0 has smaller risk than
^
0 when 0
0
W0 < tr (WV) , and larger risk when 0
0
W0 tr (WV) .
Neither
^
0 nor
~
0 = 0 is better in the sense of having (uniformly) smaller risk! It is not enough to
ask that one estimator has smaller risk than another, as in general the risk is a function depending
146
on unknowns.
As another example, take the simple averaging (or shrinkage) estimator
~
0 = n
^
0
where n is a xed constant. Since
~
0 0 = n
_
^
0 0
_
(1 n)0
we can calculate that
1(0,
~
0, W) = n
2
1(0,
~
0, W) + (1 n)
2
0
0
W0
= n
2
tr (WV) + (1 n)
2
0
0
W0
This is minimized by setting
n =
0
0
W0
tr (WV) +0
0
W0
which is strictly in (0,1). [This is illustrative, and does not suggest an empirical rule for selecting
n.]
17.3 Admissibile and Minimax Estimators
For reference.
To compare the risk functions of two estimators, we have the following concepts.
Denition 1
^
0 weakly dominates
~
0 if 1(0,
^
0) _ 1(0,
~
0) for all 0
Denition 2
^
0 dominates
~
0 if 1(0,
^
0) _ 1(0,
~
0) for all 0, and 1(0,
^
0) < 1(0,
~
0) for at least one 0.
Clearly, we should prefer an estimator if it dominates the other.
Denition 3 An estimator is admissible if it is not dominated by another estimator. An estimator
is inadmissible if it is dominated by another estimator.
Admissibility is a desirable property for an estimator.
If the risk functions of two estimators cross, then neither dominates the other. How do we
compare these two estimators?
One approach is to calculate the worst-case scenerio. Specically, we dene the maximum risk
of an estimator
~
0 as

1(
~
0) = sup

1
_
0,
~
0
_
We can think: Suppose we use
~
0 to estimate 0. Then what is the worst case, how bad can than this
estimator do?
147
For example, for the usual estimator, 1
_
0,
^
0, W
_
= tr (WV) for all 0, so

1(
^
0) = tr (WV)
while for the silly estimator
~
0 = 0

1(0) =
The latter is an example of an estimator with unbounded risk. To guard against extreme worst
cases, it seems sensible to avoid estimators with unbounded risk.
The minimium value of the maximum risk

1(
~
0) across all estimators c = c(
^
0) is
inf


1(c) = inf

sup

1(0, c)
where
Denition 4 An estimator
~
0 of 0 which minimizes the maximum risk
inf

sup

1(0, c) = sup

1(0,
~
0)
is called a minimax estimator.
It is desireable for an estimator to be minimax, again as a protection against the worst-case
scenerio.
There is no general rule for determining the minimax bound. However, in the case
^
0 ~ (0, 1
k
),
it is known that
^
0 is mimimax for 0.
17.4 Shrinkage Estimators
Suppose
^
0 ~ (0, V)
A general form for a shrinkage estimator for 0 is
^
0

=
_
1 /(
^
0
0
W
^
0)
_
^
0
where / : [0, ) [0, ). Sometimes this is written as
^
0

=
_
1
c(
^
0
0
W
^
0)
^
0
0
W
^
0
_
^
0
where c() = /().
This notation includes the James-Stein estimator, pretest estimators, selection estimators, and
the Model averaging estimator of section 17.1. Pretest and selection estimators take the form
/() = 1( < a)
148
where a = 2/ for Mallows selection, and a is the critical value from a chi-square distribution for a
pretest estimator.
We now calculate the risk of
^
0

. Note
^
0

0 =
_
^
0 0
_
/(
^
0
0
W
^
0)
^
0
Thus
_
^
0

0
_
0
W
_
^
0

0
_
=
_
^
0 0
_
0
W
_
^
0 0
_
+/(
^
0
0
W
^
0)
2
^
0
0
W
^
0 2/(
^
0
0
W
^
0)
^
0
0
W
_
^
0 0
_
Taking expectations:
1(0,
^
0

, W) = tr (WV) +1
_
/(
^
0
0
W
^
0)
2
^
0
0
W
^
0
_
21
_
/(
^
0
0
W
^
0)
^
0
0
W
_
^
0 0
__
To simplify the second expectation when / is continuous we use:
Lemma 1 (Steins Lemma) If j (0) : R
k
R
k
is absolutely continuous and
^
0 ~ (0, V) then
E
_
j(
^
0)
0
_
^
0 0
__
= Etr
_
0
00
j(
^
0)
0
V
_
.
Proof: Let
c(x) =
1
(2)
k=2
exp
_

1
2
x
0
V
1
x
_
denote the N(0, V) density. Then
0
0x
c(x) = V
1
xc(x)
and
0
0x
c(x 0) = V
1
(x 0) c(x 0).
By multivariate integration by parts
E
_
j(
^
0)
0
_
^
0 0
__
=
_
j (x)
0
VV
1
(x 0) c(x 0) (dx)
=
_
tr
_
0
00
j (x)
0
Vc(x 0)
_
(dx)
= Etr
_
0
00
j(
^
0)
0
V
_
as stated.
Let j(0)
0
= /
_
0
0
W0
_
0
0
W, for which
0
00
j(0)
0
= /(0
0
W0)W+ 2W00
0
W/
0
(0
0
W0)
149
and
tr
0
00
j(0)
0
V = tr (WV) /(0
0
W0) + 20
0
WVW0/
0
(0
0
W0)
Then by Steins Lemma
1
_
/(
^
0
0
W
^
0)
^
0
0
W
_
^
0 0
__
= tr (WV) 1/(
^
0
0
W
^
0) + 21
_
(
^
0
0
WVW
^
0)/
0
(
^
0
0
W
^
0)
_
Applying this to the risk calculation, we obtain
Theorem.
1(0,
^
0

, W) = tr (WV) +1
_
/(
^
0
0
W
^
0)
2
^
0
0
W
^
0
_
2 tr (WV) 1/(
^
0
0
W
^
0) 41
_
(
^
0
0
WVW
^
0)/
0
(
^
0
0
W
^
0)
_
= tr (WV) +1
_

_
c(
^
0
0
W
^
0)
_
c(
^
0
0
W
^
0) 2 tr (WV) + 4
^
0
0
WVW
^
0
^
0
0
W
^
0
_
^
0
0
W
^
0
4
^
0
0
WVW
^
0
^
0
0
W
^
0
c
0
(
^
0
0
W
^
0)
_

_
where the nal equality uses the alternative expression /() = c(),.
We are trying to nd cases where 1(0,
^
0

, W) < 1(0,
^
0, W). This requires the term in the
expectation to be negative.
We now explore some special cases.
17.5 Default Weight Matrix
Set
W = V
1
and write
1(0,
^
0

) = 1(0,
^
0

, V
1
)
Then
1(0,
^
0

) = / +1
_
_
c(
^
0
0
V
1
^
0)
_
c(
^
0
0
V
1
^
0) 2/ + 4
_
^
0
0
V
1^
0
4c
0
(
^
0
0
V
1
^
0)
_
_
.
Theorem 1 For any absolutely continuous and non-decreasing function c() such that
0 < c() < 2 (/ 2) (1)
then
1(0,
^
0

) < 1(0,
^
0),
the risk of
^
0

is strictly less than the risk of


^
0. This inequality holds for all values of the parameter
0.
150
Note: Condition (1) can only hold if / 2. (Since / , the dimension of 0, is an integer, this
means / _ 3.)
Proof. Let
q() =
c() (c() 2 (/ 2))

4c
0
()
For all _ 0, q() < 0 by the assumptions. Thus 1q () for any non-negative random variable .
Setting
k
=
^
0
0
V
1
^
0,
1(0,
^
0

) = / +1q (
k
) < / = 1(0,
^
0)
which proves the result.
It also useful to note that

k
=
^
0
0
V
1
^
0 ~
2
k
(c)
a non-central chi-square random variable with / degrees of freedom and non-centrality parameter
c = 0
0
V
1
0
17.6 James-Stein Estimator
Set c() = c, a constant. This is the James-Stein estimator
^
0

=
_
1
c
^
0
0
V
1^
0
_
^
0 (2)
Theorem 2 If
^
0 ~ (0, V), / 2, and 0 < c < 2(/ 2), then for (2),
1(0,
^
0

) < 1(0,
^
0)
the risk of the James-Stein estimator is strictly less than the usual estimator. This inequality holds
for all values of the parameter 0.
Since the risk is quadratic in c, we can also see that the risk is minimized by setting c = / 2.
This yields the classic form of the James-Stein estimator
^
0

=
_
1
/ 2
^
0
0
V
1^
0
_
^
0
17.7 Positive-Part James-Stein
If
^
0
0
V
1
^
0 < c then
1
c
^
0
0
V
1^
0
< 0
151
and the James-Stein estimator over-shrinks, and ips the sign of
^
0

relative to
^
0. This is corrected
by using the positive-part version
^
0
+
=
_
1
c
^
0
0
V
1^
0
_
+
^
0
=
_

_
_
1
c
^
0
0
V
1^
0
_
^
0
^
0
0
V
1
^
0 _ c
0 else
This bears some resemblance to selection estimators.
The positive-part estimator takes the shrinkage form with
c() =
_

_
c _ c
< c
or
/() =
_

_
c

_ c
1 < c
In general the positive-part version of
^
0

=
_
1 /(
^
0
0
W
^
0)
_
^
0
is
^
0
+
=
_
1 /(
^
0
0
W
^
0)
_
+
^
0
Theorem. For any shrinakge estimator, 1(0,
^
0
+
) < 1(0,
^
0)
The proof is a bit technical, so we will skip it.
17.8 General Weight Matrix
Recall that for general c() and weight W we had
1(0,
^
0

, W) = tr (WV)+1
_

_
c(
^
0
0
W
^
0)
_
c(
^
0
0
W
^
0) 2 tr (WV) + 4
^
0
0
WVW
^
0
^
0
0
W
^
0
_
^
0
0
W
^
0
4
^
0
0
WVW
^
0
^
0
0
W
^
0
c
0
(
^
0
0
W
^
0)
_

_
152
Using a result about eigenvalues and setting / = W
1=2
0
^
0
0
WVW
^
0
^
0
0
W
^
0
_ max

0
0
WVW0
0
0
W0
= max
h
/
0
W
1=2
VW
1=2
/
/
0
/
= `
max
(W
1=2
VW
1=2
)
= `
max
(WV)
Thus if c
0
() _ 0,
1(0,
^
0

, W) _ tr (WV) +1
_
_
c(
^
0
0
W
^
0)
_
c(
^
0
0
W
^
0) 2 tr (WV) + 4`
max
(WV)
_
^
0
0
W
^
0
_
_
< tr (WV)
the nal inequality if
0 < c() < 2 (tr (WV) 2`
max
(WV)) (3)
When \ = \, the upper bound is 2(/ 2) so this is the same as for the default weight matrix.
Theorem 3 For any absolutely continuous and non-decreasing function c() such that (3) holds,
then
1(0,
^
0

, W) < 1(0,
^
0, W),
the risk of
^
0

is strictly less than the risk of


^
0.
17.9 Shrinkage Towards Restrictions
The classic James-Stein estimator shrinks towards the zero vector. More generally, shrinkage
can be towards restricted estimators, or towards linear or non-linear subspaces.
These estimators take the form
^
0

=
^
0 /(
_
^
0
~
0
_
0
W
_
^
0
~
0
_
)
_
^
0
~
0
_
where
^
0 is the unrestricted estimator (e.g. the long regression) and
~
0 is the restricted estimator
(e.g. the short regression).
The classic form is
^
0

=
^
0
_
_
_
r 2
_
^
0
~
0
_
0
^
V
1
_
^
0
~
0
_
_
_
_
1
_
^
0
~
0
_
where (a)
1
= max(a, 1),
^
V is the covariance matrix for
^
0, and r is the number of restrictions (by
the restriction from
^
0 to
~
0).
153
This estimator shrinks
^
0 towards
~
0, with the degree of shrinkage depending on the magnitude
of
_
^
0
~
0
_
.
This approach works for nested models, so that
_
^
0
~
0
_
0
^
V
1
_
^
0
~
0
_
is approximately (non-
central) chi-square.
It is unclear how to extend the idea to non-nested models, where
_
^
0
~
0
_
0
^
V
1
_
^
0
~
0
_
is not
chi-square.
17.10 Inference
We discussed shrinkage estimation.
Model averaging, Selection, and Shrinkage estimators have non-standard non-normal distribu-
tions.
Standard errors, testing, and condence intervals need development.
154

You might also like