Estimadores Extremos: Algoritmos e Bootstrap

Computation Resampling
Estimadores Extremos
Algoritmos e Bootstrap
Cristine Campos de Xavier Pinto
CEDEPLAR/UFMG
Maio/2010
Cristine Campos de Xavier Pinto Institute
Direct computation of an extremum estimator is in general
not possible. We need to use numerical methods for
computing these estimators.
In this lecture, we will review methods, which are interactive
algorithms that search for the maximum of a function of
several arguments.
When we deal with computation, we need to deal with
problems, like multiple local maximum, discontinuities,
numerical instability and large dimensions.
Grid Search
Consider the one-dimensional maximization problem
max
[a,b]
Q ()
and the interval [a, b] can be divided into a number of
subintervals,
[a,
1
] , [
1
,
2
] , ..., [
N
, b]
We compute the function value at each boundary, infer that
the maximum lies in one of the intervals with a boundary that
includes the highest function value:
_
[
i
,
i +1
][ max
j
Q (
j
) = max [Q (
i
) , Q (
i +1
)]
_
One then repeats the process in each of the chosen intervals,
as they were the original value (iterations)
The process will lead to smaller and smaller intervals that
contain local maxima.
Sometimes this method does not nd the global maximum.
We can mistakenly drop the interval that contains the global
maximum if the grip is not ne enough.
If we choose many, short intervals at each iteration, we
increase computation time.
An exhaustive search is infeasible.
Multidimensional settings: The grid search must cover
every dimension. Calculations will increase exponentially with
the dimension of the parameter space.
If we have n intervals and R
k
, each iteration will have an
order of n
k
calculations.
Sometimes we have information about the function that can
help in the search
Polynomial Approximation
We can explore the dierentiability of the maximand and
approximate Q () with a polynomial.
The optimum of the polynomial approximation is an
approximation of the optimum of Q.
Lets use a quadratic approximation:
Q () - a + b (
0
) +
1
2
c (
0
)
2
where a, b and c are chosen to t Q () well in a
neighborhood of the starting value
0
.
Given values a, b and c, the approximant to the location of
the optimum of Q is
b
c
, c < 0.
There are many ways to choose these parameters.
If Q () is dierentiable, a second-order Taylor series yields a
quadratic approximation based on Q and its rst two
derivatives,
Q () - Q (
0
) +\
Q (
0
) (
0
) +
1
2
\
2
/ Q (
0
) (
0
)
2
Another way is to t 3 points where Q () has been computed,
Q (
0
) = a + b
0
+
1
2
c
2
0
Q (
1
) = a + b
1
+
1
2
c
2
1
Q (
2
) = a + b
2
+
1
2
c
2
2
Line Searches
Idea: overcome the high dimension maximization by using a
grid search in one dimension (line search) through a
parameter space with several dimension.
Given a starting point
1
and a search direction ( "line ") ,
we use an iteraction attempt to solve one-dimensional
problem:
+
= arg max
Q (
1
+ )
: step length. The starting point of next iteration is
2
=
1
+
+

There are many possible choices of and the method of
approximating
+
.
By convention, we restrict _ 0.
The directional derivative of Q is
Q (
1
+ )
= \
Q (
1
+ )
/

and all line search methods require
Q (
1
+ )
=0
= \
Q (
1
)
/
> 0
so that Q is increasing with respect to the step length in a
neighborhood of the starting value
1
.
A positive value of that increases Q will always exists.
We will see two types of line search: steepest accent and
quadratic methods
The Method of Steepest Ascent
In this case, = \
Q (
1
) .
The elements of the gradient are the rates of change in the
function for a small ceteris paribus change in each element of
.
This search direction guarantees that the function value will
improve if the entire vector is moved (at least locally) in
that direction:
Q (
1
+ \
Q (
1
))
=0
= \
Q (
1
)
/
\
Q (
1
) > 0
unless
1
is a critical value.
The gradient has an optimality property: Among all the
directions with the same length, setting = \
Q (
1
) gives
the fastest rate of increase of Q (
1
+ ) with respect to
\
Q (
1
) = arg max
:||=|\
Q(
1
)|
Q (
1
+ )
This method implicitly approximates the maximand Q () as a

linear function in the neighborhood of
1
:
Q () - Q (
1
) + \
Q (
1
)
/
(
1
)
This method gives no guidance for the step length .
Maximization involves the curvature of a function.
This method does not exploit curvature, and make this
algorithm slow for many practical problems.
Example: OLS
Lets apply this algorithm to solve the following problem:
max
1
2
(Y X)
/
(Y X)
where = and Q () =
1
2
(Y X)
/
(Y X) .
On the i th iteration, let the starting point
i
so
i
= X
/
(y X
i
) and each line search solves
i
= arg max
1
2
[y X (
i
+
i
)]
/
[y X (
i
+
i
)]
= arg max
/
i
X
/
(y X
i
)

1
2
_
/
i
X
/
X
i
2
=

/
i
X
/
(y X
i
)
/
i
X
/
X
i
=

/
i
/
i
X
/
X
i
and the best step yields
i +1
=
i
+
i

i
=
i
+
(y X
i
) X
/
X (y X
i
)
(y X
i
) X
/
X
/
XX (y X
i
)
X
/
(y X
i
)
Quadratic Methods
Lets assume that Q is exactly quadratic
Q () = a + b
/
+
1
2
/
C
where
\
Q () = b + C
\
2
,
Q () = C
The Hessian C is negative denite if Q is strictly concave. In
that case, Q attains its maximum at
+
= C
1
b
=
1
C
1
(b + C
1
)
=
1
\
2
,
Q ()
1
\
Q ()
This expression suggests a modication to the search direction
of the steepest ascent.
For quadratic functions,
= \
2
,
Q ()
1
\
Q ()
A single line search would yield the optimal value of at the
step length equal to one, no matter the starting value.
Example: Lets to the OLS example using the quadratic method
\
Q () = X
/
(y X)
\
2
,
Q () = X
/
X
In this case, best step yields
i +1
=
i
+
_
X
/
X
_
1
_
X
/
(y X
i
)
_
1
=
_
X
/
X
_
1
X
/
y
Quadratic optimization methods approximate general
functions with quadratic functions,
Q () - Q (
1
) + \
Q (
1
)
/
(
1
)
+
1
2
(
1
)
/
\
2
/ Q (
1
) (
1
)
The maximum of the quadratic approximation as a further
approximation of the maximum of the original function.
For Taylor series approximation, the search direction is
= \
2
,
Q (
1
)
1
\
Q (
1
)
We will explore some examples of the quadratic methods
Newton-Raphson
The Newton-Raphson use the quadratic expansion for the
score:
N
i =1
s
i
(
g +1
) =
N
i =1
s
i
(
g
) +
_
N
i =1
H
i
(
g
)
_
(
g +1

g
) + r
g
where s
i
() is the Px1 score with respect to , H () is the
PxP Hessian and r is a Px1 is vector of remainder vectors.
In this case, ignoring the remainder term
g +1
=
g

_
N
i =1
H
i
(
g
)
_
1
_
N
i =1
s
i
(
g
)
_
Idea: As we get close to the solution,

N
i =1
s
i
(
g
) will get
close to zero, and the search direction will get smaller.
In general, we can use a stop rule: the requirement that the
largest absolute value change [
g +1

g
[ is smaller than a
constant.
Another stop criteria that is used by these quadratic methods
is
_
N
i =1
s
i
(
g
)
_/ _
N
i =1
H
i
(
g
)
_
1
_
N
i =1
s
i
(
g
)
_
being less than a small number, 0.0001.
This expression will be zero when the a maximum has been
reached.
We need to check that the Hessian is negative denite before
claiming convergence.
We need many dierent starting values to make sure that at
end the maximum is a global one and not a local one.
Drawbacks:
Computation of the second derivative
The sum of the Hessian may not be negative denite at a
particular value of , and we can go in the wrong direction.
We check the progress is being made by computing the
dierence in the values of the objective function in each
iteration:
N
i =1
Q
i
(
g +1
)
N
i =1
Q
i
(
g
)
Since we are maximizing the objective function, we should
expect that the step from g to g + 1 is positive.
BHHH Algorithm
Use the outer product of the score in the place of the Hessian,
g +1
=
g
+
_
N
i =1
s
i
(
g
) s
i
(
g
)
/
_
1
_
N
i =1
s
i
(
g
)
_
where is the direction (step size)
It solves the problem of estimating a second derivative.
The Generalized Gauss-Newton Method
Another possibility to estimate the Hessian is to use the
expected value of H (z,
0
) conditional on x, where z is
partitioned into y and x.
We called this conditional expectation, A(x,
0
) .
The generalized Gauss-Newton method uses the updating
equation:
g +1
=
g

_
N
i =1
A
i
(
g
)
_
1
_
N
i =1
s
i
(
g
)
_
Sometimes, it is computationally convenient to concentrate
one set of parameters.
Suppose that we can partition into the vectors and . In
this case, the rst order conditions are:
N
i =1
\
Q (z
i
, , ) = 0
N
i =1
\
Q (z
i
, , ) = 0
Suppose that the second equation can be solved for as a
function of z and , in the parameter set = g (z, )
N
i =1
\
Q (z
i
, g (z, ) , ) = 0
When we plug g (z, ) into the original objective function, we
get the concentrated objective function that only depends on
Q
c
(z, ) =
N
i =1
Q (z
i
, g (z, ) , )
Under some regularity condition,

that solves the
maximization problem using the concentrated objective
function is the same as the one for the original problem.
Finding

, we can get = g
_
z,
_
.
Allow to improve the asymptotic distribution approximation.
Sometimes, we know that the approximation distribution for

works well, but we are interested in a function of the

parameters
0
= g (
0
)
One way to obtain the approximation distribution of this
function is to use the Delta Method to approximate
= g
_
_
.
Sometimes it is hard to apply the Delta Method or the
approximations are not good.
Resampling can improve the usual asymptotic (standard errors
and condence intervals)
Bootstrapping
There are several variants of bootstrap.
Idea: Approximate the distribution of

without relying on the
rst-order asymptotic theory.
Let z
1
, ..., z
N
be the outcome of a random sample.
At each bootstrap iteration, b, a random sample of size N is
drawn from z
1
, ..., z
N
, with replacement,
_
z
(b)
1
, ..., z
(b)
N
_
At each iteration, we use the bootstrap sample to obtain the
estimate

(b)
by solving
max
i =1
Q
_
z
(b)
i
,
_
We iterate the process B times, obtaining

(b)
, b = 1, ..., B.
Then, we compute the average of

(b)
say

and uses this
average as the estimate value of the parameter.
The sample variance
1
B 1
N
i =1
_
(b)
_
2
can be used to estimate the standard error.
A 95% bootstrapped condence interval for
0
can be
obtained by nding the 2.5 and 97.5 percentiles in the list of
values
_
(b)
: b = 1, ..., B
_
.
This is the nonparametric bootstrap
Parametric bootstrap: assume that the distribution of z is
known up to the parameter
0
.
Let f (., ) denote the parametric density.
On each bootstrap iteration, we draw a random sample of size
N from f
_
.,
_
which gives
_
z
(b)
1
, ..., z
(b)
N
_
.
We do the resampling thousands of times.
Other alternative: In a regression model, we rst estimate

by NLS and obtain the residuals
i
= y
i
m
_
x
i
,
_
then we bootstrap sample of the residuals
_
(b)
i
: b = 1, .., B
_
and obtain y
(b)
i
= m
_
x
i
,
_
+
(b)
i
.
Using the generated data
__
x
i
, y
(b)
i
_
: i = 1, ..., N
_
, we
compute

(b)
.
References
Amemya: 4
Wooldridge: 12
Rudd: 16
Newey, W. and D. McFadden (1994). "Large Sample
Estimation and Hypothesis Testing", Handbook of
Econometrics, Volume IV, chapter 36.

Estimadores Extremos: Algoritmos e Bootstrap

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimadores Extremos: Algoritmos e Bootstrap

Uploaded by

Copyright:

Available Formats

Computation Resampling

This method implicitly approximates the maximand Q () as a

works well, but we are interested in a function of the

by NLS and obtain the residuals

You might also like