Mlelectures PDF

Maximum Likelihood Estimation
Eric Zivot
May 14, 2001
This version: November 15, 2009
Maximum Likelihood Estimation
1.1
The Likelihood Function
Let X1 , . . . , Xn be an iid sample with probability density function (pdf) f (xi ; ),

where is a (k 1) vector of parameters that characterize f (xi ; ). For example, if
Xi N(, 2 ) then f (xi ; ) = (2 2 )1/2 exp( 21 2 (xi )2 ) and = (, 2 )0 . The
joint density of the sample is, by independence, equal to the product of the marginal
densities
n
Y
f (xi ; ).
f (x1 , . . . , xn ; ) = f (x1 ; ) f (xn ; ) =
i=1
The joint density is an n dimensional function of the data x1 , . . . , xn given the parameter vector . The joint density1 satisfies
Z
f (x1 , . . . , xn ; ) 0
f (x1 , . . . , xn ; )dx1 dxn = 1.
The likelihood function is defined as the joint density treated as a functions of the
parameters :
L(|x1 , . . . , xn ) = f (x1 , . . . , xn ; ) =
n
Y
f (xi ; ).
i=1
Notice that the likelihood function is a k dimensional function of given the data
x1 , . . . , xn . It is important to keep in mind that the likelihood function, being a
function of and not the data, is not a proper pdf. It is always positive but
Z
Z
L(|x1 , . . . , xn )d1 dk 6= 1.
1
If X1 , . . . , Xn are discrete random variables, then f (x1 , . . . , xn ; ) = Pr(X1 = x1 , . . . , Xn = xn )

for a fixed value of .
To simplify notation, let the vector x = (x1 , . . . , xn ) denote the observed sample.
Then the joint pdf and likelihood function may be expressed as f (x; ) and L(|x).
Example 1 Bernoulli Sampling
Let Xi Bernoulli(). That is, Xi = 1 with probability and Xi = 0 with probability 1 where 0 1. The pdf for Xi is
f (xi ; ) = xi (1 )1xi , xi = 0, 1
Let X1 , . . . , Xn be an iid sample with Xi Bernoulli(). The joint density/likelihood
function is given by
f (x; ) = L(|x) =
n
Y
i=1
xi (1 )1xi =
Sn
i=1
xi
Sn
(1 )n
i=1
xi
For a given value of and observed sample x, f (x; ) gives the probability of observing
the sample. For example, suppose n = 5 and x = (0, . . . , 0). Now some values of
are more likely to have generated this sample than others. In particular, it is more
likely that is close to zero than one. To see this, note that the likelihood function
for this sample is
L(|(0, . . . , 0)) = (1 )5
This function is illustrated in figure xxx. The likelihood function has a clear maximum
at = 0. That is, = 0 is the value of that makes the observed sample x = (0, . . . , 0)
most likely (highest probability)
Similarly, suppose x = (1, . . . , 1). Then the likelihood function is
L(|(1, . . . , 1)) = 5
which is illustrated in figure xxx. Now the likelihood function has a maximum at
= 1.
Example 2 Normal Sampling
Let X1 , . . . , Xn be an iid sample with Xi N(, 2 ). The pdf for Xi is
1
2 1/2
2
exp 2 (xi ) , < < , 2 > 0, < x <
f (xi ; ) = (2 )
2
so that = (, 2 )0 . The likelihood function is given by
n
Y
1
2 1/2
2
L(|x) =
(2 )
exp 2 (xi )
2
i=1
!
n
X
1
= (2 2 )n/2 exp 2
(xi )2
2 i=1
2
Figure xxx illustrates the normal likelihood for a representative sample of size n = 25.
Notice that the likelihood has the same bell-shape of a bivariate normal density
Suppose 2 = 1. Then
!
n
X
1
L(|x) = L(|x) = (2)n/2 exp
(xi )2
2 i=1
Now
n
n
n
X
X
X
(xi )2 =
(xi x + x )2 =
) + (
x )2
(xi x)2 + 2(xi x)(x
i=1
i=1
i=1
n
X
=
(xi x)2 + n(
x )2
i=1
so that
#!
" n
X
1
(xi x)2 + n(
x )2
L(|x) = (2)n/2 exp
2 i=1
x )2 are positive it is clear that L(|x) is maximized at

Since both (xi x)2 and (
= x. This is illustrated in figure xxx.
Example 3 Linear Regression Model with Normal Errors
Consider the linear regression
x0i
yi =
+ i , i = 1, . . . , n
(1k)(k1)
i |xi iid N(0, 2 )

The pdf of i |xi is
2
2 1/2
f (i |xi ; ) = (2 )
1 2
exp 2 i
2
The Jacobian of the transformation for i to yi is one so the pdf of yi |xi is normal
with mean x0i and variance 2 :
1
2 1/2
0
2
f (yi |xi ; ) = (2 )
exp 2 (yi xi )
2
where = ( 0 , 2 )0 . Given an iid sample of n observations, y and X, the joint density
of the sample is
!
n
X
1
f (y|X; ) = (2 2 )n/2 exp 2
(yi x0i )2
2 i=1
1
2 n/2
0
exp 2 (y X) (y X)
= (2 )
2
3
The log-likelihood function is then

n
1
n
ln L(|y, X) = ln(2) ln( 2 ) 2 (y X)0 (y X)
2
2
2
Example 4 AR(1) model with Normal Errors

To be completed
1.2
The Maximum Likelihood Estimator
Suppose we have a random sample from the pdf f (xi ; ) and we are interested in
estimating . The previous example motives an estimator as the value of that
makes the observed sample most likely. Formally, the maximum likelihood estimator,
denoted mle , is the value of that maximizes L(|x). That is, mle solves
max L(|x)
It is often quite dicult to directly maximize L(|x). It usually much easier to

maximize the log-likelihood function ln L(|x). Since ln() is a monotonic function
the value of the that maximizes ln L(|x) will also maximize L(|x). Therefore, we
may also define mle as the value of that solves
max ln L(|x)
With random sampling, the log-likelihood has the particularly simple form
!
n
n
Y
X
f (xi ; ) =
ln f (xi ; )
ln L(|x) = ln
i=1
i=1
Since the MLE is defined as a maximization problem, we would like know the
conditions under which we may determine the MLE using the techniques of calculus.
A regular pdf f (x; ) provides a sucient set of such conditions. We say the f (x; )
is regular if
1. The support of the random variables X, SX = {x : f (x; ) > 0}, does not
depend on
2. f (x; ) is at least three times dierentiable with respect to
3. The true value of lies in a compact set
If f (x; ) is regular then we may find the MLE by dierentiating ln L(|x) and
solving the first order conditions
ln L(mle |x)
=0
Since is (k 1) the first order conditions define k, potentially nonlinear, equations

in k unknown values:
ln L(mle |x)
=
ln L(mle |x)
1
..
.
ln L(
mle |x)
k
The vector of derivatives of the log-likelihood function is called the score vector
and is denoted
ln L(|x)
S(|x) =
By definition, the MLE satisfies

S(mle |x) = 0
Under random sampling the score for the sample becomes the sum of the scores for
each observation xi :
n
n
X
ln f (xi ; ) X
S(|x) =
=
S(|xi )
i=1
i=1
where S(|xi ) =
ln f (xi ;)
is the score associated with xi .
Example 5 Bernoulli example continued

The log-likelihood function is
Sn
Sn
ln L(|X) = ln i=1 xi (1 )n i=1 xi
!
n
n
X
X
xi ln() + n
xi ln(1 )
=
i=1
i=1
The score function for the Bernoulli log-likelihood is
ln L(|x)
1X
1
S(|x) =
=
xi
i=1
1
n
n
X
i=1
xi
The MLE satisfies S(mle |x) = 0, which after a little algebra, produces the MLE
X
mle = 1
xi .
n i=1
n
Hence, the sample average is the MLE for in the Bernoulli model.
5
Example 6 Normal example continued

Since the normal pdf is regular, we may determine the MLE for = (, 2 ) by
maximizing the log-likelihood
n
n
1 X
n
2
(xi )2 .
ln L(|x) = ln(2) ln( ) 2
2
2
2 i=1
The sample score is a (2 1) vector given by
ln L(|x)
ln L(|x)
2
S(|x) =
where
n
1 X
ln L(|x)
=
(xi )
2 i=1
ln L(|x)
n 2 1 1 2 2 X
( ) + ( )
=
(xi )2
2
2
2
i=1
Note that the score vector for an observation is

!
S(|xi ) =
ln f (|xi )
ln f (|xi )
2
( 2 )1 (xi )
1
2 1
2 ( ) + 12 ( 2 )2 (xi )2
P
so that S(|x) = ni=1 S(|xi ).
Solving S(mle |x) = 0 gives the normal equations
n
ln L(mle |x)
1 X
(xi
mle ) = 0
=
2mle i=1
n
n 2 1 1 2 2 X
ln L(mle |x)
=
)
+
)
(xi
mle )2 = 0
(
2
2 mle
2 mle
i=1
Solving the first equation for

mle gives
1X
=
xi = x
n i=1
n
mle
Hence, the sample average is the MLE for . Using

mle = x and solving the second
2
equation for
mle gives
n
1X
2
mle =
(xi x)2 .
n i=1
Notice that
2mle is not equal to the sample variance.
6
Example 7 Linear regression example continued

The log-likelihood is
n
n
ln L(|y, X) = ln(2) ln( 2 )
2
2
1
2 (y X)0 (y X)
2
The MLE of satisfies S(mle |y, X) = 0 where S(|y, X) =
score vector. Now
ln L(|y, X) is the
1 0
ln L(|y, X)
=
[y y 2y 0 X + 0 X 0 X]
2 2
= ( 2 )1 [X 0 y + X 0 X]
ln L(|y, X)
n
1
= ( 2 )1 + ( 2 )2 (y X)0 (y X)
2
2
2
Solving
ln L(|y,X)
= 0 for gives
mle = (X 0 X)1 X 0 y = OLS
Next, solving
ln L(|y,X)
2
= 0 for 2 gives
1
bmle )0 (y X
bmle )
(y X
n
1
bOLS )
bOLS )0 (y X
6
=
2OLS =
(y X
nk
2mle =
1.3
Properties of the Score Function
The matrix of second derivatives of the log-likelihood is called the Hessian

2 ln L(|x)
2 ln L(|x)
2
1 k
1
2 ln L(|x)
.
..
.
.
.
H(|x) =
=
.
.
.
0
2
2
ln L(|x)
ln L(|x)
k 1
2
k
The information matrix is defined as minus the expectation of the Hessian

I(|x) = E[H(|x)]
If we have random sampling then
H(|x) =
n
X
2 ln f (|xi )
i=1
0
7
n
X
i=1
H(|xi )
and
I(|x) =
n
X
i=1
E[H(|xi )] = nE[H(|xi )] = nI(|xi )
The last result says that the sample information matrix is equal to n times the
information matrix for an observation.
The following proposition relates some properties of the score function to the
information matrix.
Proposition 8 Let f (xi ; ) be a regular pdf. Then
R
1. E[S(|xi )] = S(|xi )f (xi ; )dxi = 0
2. If is a scalar then
S(|xi )2 f (xi ; )dxi = I(|xi )
S(|xi )S(|xi )0 f (xi ; )dxi = I(|xi )
var(S(|xi ) = E[S(|xi ) ] =
If is a vector then
var(S(|xi ) = E[S(|xi )S(|xi ) ] =
Proof. For part 1, we have
E[S(|xi )] =
S(|xi )f (xi ; )dxi
ln f (xi ; )
f (xi ; )dxi
Z
1
f (xi ; )f (xi ; )dxi

=
f (xi ; )
Z
f (xi ; )dxi
=
f (xi ; )dxi
=
1
=
= 0.
=
The key part to the proof is the ability to interchange the order of dierentiation and
integration.
For part 2, consider the scalar case for simplicity. Now, proceeding as above we
get
2
Z
Z
ln f (xi ; )
2
2
E[S(|xi ) ] =
S(|xi ) f (xi ; )dxi =
f (xi ; )dxi
2
2
Z
Z
1
1
f (xi ; ) f (xi ; )dxi =
f (xi ; ) dxi
=
f (xi ; )
f (xi ; )
8
Next, recall that I(|xi ) = E[H(|xi )] and

Z 2
ln f (xi ; )
E[H(|xi )] =
f (xi ; )dxi
2
Now, by the chain rule
1
f (xi ; )
f (xi ; )
2
2
2
= f (xi ; )
f (xi ; ) + f (xi ; )1 2 f (xi ; )
2 ln f (xi ; ) =
Then
#
2
Z "
2
f (xi ; )2
E[H(|xi )] =
f (xi ; ) + f (xi ; )1 2 f (xi ; ) f (xi ; )dxi
2
Z
Z
2
f (xi ; ) dxi
=
f (xi ; )1
f (xi ; )dxi
2
Z
2
2
= E[S(|xi ) ] 2 f (xi ; )dxi
2
= E[S(|xi ) ].
1.4
Concentrating the Likelihood Function
In many situations, our interest may be only on a few elements of . Let = (1 , 2 )

and suppose 1 is the parameter of interest and 2 is a nuisance parameter (parameter
not of interest). In this situation, it is often convenient to concentrate out the nuisance
parameter 2 from the log-likelihood function leaving a concentrated log-likelihood
function that is only a function of the parameter of interest 1 .
To illustrate, consider the example of iid sampling from a normal distribution.
Suppose the parameter of interest is and the nuisance parameter is 2 . We wish to
concentrate the log-likelihood with respect to 2 leaving a concentrated log-likelihood
function for . We do this as follows. From the score function for 2 we have the first
order condition
n 2 1 1 2 2 X
ln L(|x)
( ) + ( )
=
(xi )2 = 0
2
2
2
i=1
n
Solving for 2 as a function of gives

1X
(xi )2 .
n i=1
n
2 () =
Notice that any value of 2 () defined this way satisfies the first order condition
ln L(|x)
= 0. If we substitute 2 () for 2 in the log-likelihood function for we get
2
the following concentrated log-likelihood function for :
n
n
1 X
n
2
ln L (|x) = ln(2) ln( ()) 2
(xi )2
2
2
2 () i=1
!
n
X
n
n
1
= ln(2) ln
(xi )2
2
2
n i=1
!1 n
n
X
1 1X
(xi )2
(xi )2
2 n i=1
!
i=1 n
n
n
1X
= (ln(2) + 1) ln
(xi )2
2
2
n i=1
c
Now we may determine the MLE for by maximizing the concentrated loglikelihood function ln L2 (|x). The first order conditions are
Pn
(xi
mle )
ln Lc (
mle |x)
= 1 Pi=1
=0
n
mle )2
i=1 (xi
n
which is satisfied by
mle = x provided not all of the xi values are identical.
For some models it may not be possible to analytically concentrate the loglikelihood with respect to a subset of parameters. Nonetheless, it is still possible
in principle to numerically concentrate the log-likelihood.
1.5
The Precision of the Maximum Likelihood Estimator
The likelihood, log-likelihood and score functions for a typical model are illustrated
in figure xxx. The likelihood function is always positive (since it is the joint density
of the sample) but the log-likelihood function is typically negative (being the log of
a number less than 1). Here the log-likelihood is globally concave and has a unique
maximum at mle . Consequently, the score function is positive to the left of the
maximum, crosses zero at the maximum and becomes negative to the right of the
maximum.
Intuitively, the precision of mle depends on the curvature of the log-likelihood
function near mle . If the log-likelihood is very curved or steep around mle , then
will be precisely estimated. In this case, we say that we have a lot of information
about . On the other hand, if the log-likelihood is not curved or flat near mle ,
then will not be precisely estimated. Accordingly, we say that we do not have much
information about .
The extreme case of a completely flat likelihood in is illustrated in figure xxx.
Here, the sample contains no information about the true value of because every
10
value of produces the same value of the likelihood function. When this happens we
say that is not identified. Formally, is identified if for all 1 6= 2 there exists a
sample x for which L(1 |x) 6= L(2 |x).
The curvature of the log-likelihood is measured by its second derivative (Hessian)
2
L(|x)
H(|x) = ln
. Since the Hessian is negative semi-definite, the information in
0
the sample about may be measured by H(|x). If is a scalar then H(|x) is
a positive number. The expected amount of information in the sample about the
parameter is the information matrix I(|x) = E[H(|x)]. As we shall see, the
information matrix is directly related to the precision of the MLE.
1.5.1
The Cramer-Rao Lower Bound
If we restrict ourselves to the class of unbiased estimators (linear and nonlinear)

then we define the best estimator as the one with the smallest variance. With linear
estimators, the Gauss-Markov theorem tells us that the ordinary least squares (OLS)
estimator is best (BLUE). When we expand the class of estimators to include linear
and nonlinear estimators it turns out that we can establish an absolute lower bound
on the variance of any unbiased estimator of under certain conditions. Then if an
unbiased estimator has a variance that is equal to the lower bound then we have
found the best unbiased estimator (BUE).
Theorem 9 Cramer-Rao Inequality
Let X1 , . . . , Xn be an iid sample with pdf f (x; ). Let be an unbiased estimator
of ; i.e., E[] = . If f (x; ) is regular then
var() I(|x)1
where I(|x) = E[H(|x)] denotes the sample information matrix. Hence, the
Cramer-Rao Lower Bound (CRLB) is the inverse of the information matrix. If is a
vector then var() I(|x)1 means that var() I(|x) is positive semi-definite.
Example 10 Bernoulli model continued
To determine the CRLB the information matrix must be evaluated. The information matrix may be computed as
I(|x) = E[H(|x)]
or
I(|x) = var(S(|x))
11
Further, due to random sampling I(|x) = n I(|xi ) = n var(S(|xi )). Now, using
the chain rule it can be shown that
d
d
xi
S(|xi ) =
H(|xi ) =
d
d (1 )
1 + S(|xi ) 2S(|xi )
=
(1 )
The information for an observation is then
I(|xi ) = E[H(|xi )] =
=
1
(1 )
since
E[S(|xi )] =
1 + E[S(|xi )] 2E[S(|xi )]
(1 )
E[xi ]
=
=0
(1 )
(1 )
The information for an observation may also be computed as
xi
I(|xi ) = var(S(|xi ) = var
(1 )
(1 )
var(xi )
= 2
= 2
2
(1 )
(1 )2
1
=
(1 )
The information for the sample is then
I(|x) = n I(|xi ) =
n
(1 )
and the CRLB is
(1 )
n
This the lower bound on the variance of any unbiased estimator of .
Consider the MLE for , mle = x. Now,
CRLB = I(|x)1 =
x] =
E[mle ] = E[
(1 )
x) =
var(mle ) = var(
n
Notice that the MLE is unbiased and its variance is equal to the CRLB. Therefore,
mle is ecient.
Remarks
12
If = 0 or = 1 then I(|x) = and var(mle ) = 0 (why?)

I(|x) is smallest when = 12 .
As n , I(|x) so that var(mle ) 0 which suggests that mle is
consistent for .
Example 11 Normal model continued
The Hessian for an observation is
2 ln f (xi ; )
S(|xi )
H(|xi ) =
=
=
0
2 ln f (xi ;)
2
2 ln f (xi ;)
2
2 ln f (xi ;)
2
2 ln f (xi ;)
(2 )2
Now
2 ln f (xi ; )
2
2
ln f (xi ; )
2
2 ln f (xi ; )
2
2 ln f (xi ; )
( 2 )2
= ( 2 )1
= ( 2 )(xi )
= ( 2 )(xi )
1 2 2
( ) ( 2 )3 (xi )2
2
so that
I(|xi ) = E[H(|xi )]
( 2 )1
=
E[(xi )]( 2 )2
E[(xi )](2 )2
1
( 2 )2 ( 2 )3 E[(xi )2 ]
2
Using the results2

E[(xi )] = 0
(xi )2
E
= 1
2
we then have
I(|xi ) =
( 2 )1
0
0
1
2 2
(
)
2
The information matrix for the sample is then
n( 2 )1
I(|x) = n I(|xi ) =
0
2
0
n
2 2
(
)
2
(xi )2 / 2 is a chi-square random variable with one degree of freedom. The expected value
of a chi-square random variable is equal to its degrees of freedom.
13
and the CRLB is

1
CRLB = I(|x)
2
n
0
24
n
Notice that the information matrix and the CRLB are diagonal matrices. The CRLB
2
for an unbiased estimator of is n and the CRLB for an unbiased estimator of 2
4
is 2n .
The MLEs for and 2 are
mle = x
1X
(xi mle )2
n i=1
n
2mle =
Now
E[
mle ] =
n1 2
E[
2mle ] =
n
so that
mle is unbiased whereas
2mle is biased. This illustrates the fact that mles
are not necessarily unbiased. Furthermore,
var(
mle ) =
2
= CRLB
n
and so
mle is ecient.
The MLE for 2 is biased and so the CRLB result does not apply. Consider the
unbiased estimator of 2
n
1 X
2
s =
(xi x)2
n 1 i=1
Is the variance of s2 equal to the CRLB? No. To see this, recall that
(n 1)s2
2 (n 1)
2
Further, if X 2 (n 1) then E[X] = n 1 and var(X) = 2(n 1). Therefore,
s2 =
2
X
(n 1)
var(s2 ) =
Hence, var(s2 ) =
Remarks
4
(n1)
> CRLB =
4
4
var(X)
=
(n 1)2
(n 1)
4
.
n
14
The diagonal elements of I(|x) as n

I(|x) only depends on 2
Example 12 Linear regression model continued
The score vector is given by
( 2 )1 [X 0 y + X 0 X]
S(|y, X) =
n2 ( 2 )1 + 12 ( 2 )2 (y X)0 (y X)
( 2 )1 (X 0 )
=
n2 ( 2 )1 + 12 ( 2 )2 0
where = y X. Now E[] = 0 and E[0 ] = n 2 (since 0 / 2 2 (n)) so that

0
( 2 )1 (X 0 E[])
=
E[S(|y, X)] =
n
1
2 1
2 2
0
0
2 ( ) + 2 ( ) E[ ]
To determine the Hessian and information matrix we need the second derivatives of
ln L(|y, X) :
2 ln L(|y, X)

2 1
0
0
=
)
[X
y
+
X
X]
(
0
0
= ( 2 )1 X 0 X
2 ln L(|y, X)

=
( 2 )1 [X 0 y + X 0 X]
2
2
= ( 2 )2 X 0
2 ln L(|y, X)
= ( 2 )2 0 X
0
2

2 ln L(|y, X)
n 2 1 1 2 2 0
( ) + ( )
=
2
2
2
( 2 )2
n 2 2
=
( ) ( 2 )3 0
2
Therefore,
H(|y, X) =
and
( 2 )1 X 0 X
( 2 )2 0 X
I(|y, X) = E[H(|y, X)]
(2 )1 X 0 X
=
( 2 )2 E[]0 X
( 2 )2 X 0
n
2 2
( ) ( 2 )3 0
2
( 2 )2 X 0 E[]
n
( 2 )2 ( 2 )3 E[0 ]
2
15
( 2 )1 X 0 X
0
0
n
( 2 )2
2
Notice that the information matrix is block diagonal in and 2 . The CRLB for
unbiased estimators of is then
2 0 1
(X X)
0
1
I(|y, X) =
2 4
0
n
Do the MLEs of and 2 achieve the CRLB? First, mle is unbiased and var( mle |X) =
2 (X 0 X)1 = CRLB for an unbiased estimator for . Hence, mle is the most ecient
unbiased estimator (BUE). This is an improvement over the Gauss-Markov theorem
which says that mle = OLS is the most ecient linear and unbiased estimator
(BLUE). Next, note that
2mle is not unbiased (why) so the CRLB result does not
apply. What about the unbiased estimator s2 = (n k)1 (y X OLS )0 (y X OLS )?
24
> n2 4 = CRLB for an unbiased estimator of
It can be shown that var(s2 |X) = nk
2 . Hence s2 is not the most ecient unbiased estimator of 2 .
1.6
Invariance Property of Maximum Likelihood Estimators
One of the attractive features of the method of maximum likelihood is its invariance
to one-to-one transformations of the parameters of the log-likelihood. That is, if mle
is the MLE of and = h() is a one-to-one function of then
mle = h(mle ) is the
mle for .
Example 13 Normal Model Continued
The log-likelihood is parameterized in terms of and 2 and we have the MLEs
mle = x
1X
=
(xi mle )2
n i=1
n
2mle
Suppose we are interested in the MLE for = h( 2 ) = ( 2 )1/2 , which is a one-to-one

function for 2 > 0. The invariance property says that
!1/2
n
X
1
mle = (
2mle )1/2 =
(xi
mle )2
n i=1
1.7
Asymptotic Properties of Maximum Likelihood Estimators
Let X1 , . . . , Xn be an iid sample with probability density function (pdf) f (xi ; ),

where is a (k 1) vector of parameters that characterize f (xi ; ). Under general
regularity conditions (see Hayashi Chapter 7), the ML estimator of has the following
asymptotic properties
16
p
1. mle
d
2. n(mle ) N(0, I(|xi )1 ), where
ln f (|xi )
I(|xi ) = E [H(|xi )] = E
0
That is,
avar( n(mle )) = I(|xi )1
Alternatively,
mle N , 1 I(|xi )1 = N(, I(|x)1 )

n
where I(|x) = nI(|xi ) = information matrix for the sample.
3. mle is ecient in the class of consistent and asymptotically normal estimators.
That is,
avar( n(mle )) avar( n( )) 0

for any consistent and asymptotically normal estimator .
Remarks:
1. The consistency of the MLE requires the following
P
p
(a) Qn () = n1 ni=1 ln f (xi |) E[ln f (xi |)] = Q0 () uniformly in
(b) Q0 () is uniquely maximized at = 0 .
2. Asymptotic normality of mle follows from an exact first order Taylors series
expansion of the first order conditions for a maximum of the log-likelihood about
0 :
0 = S(mle |x) = S(0 ) + H(|x)(mle 0 ), = mle + (1 )0
H(|x)(mle 0 ) = S(0 )
1
1
n(mle 0 ) =
n
H(|x)
S(0 )
n
n
Now
1X
p
H(|xi ) E[H(0 |xi )] = I(0 |xi )
H(|x) =
n i=1
n
1
1 X
d
S(0 ) =
S(0 |xi ) N(0, I(0 |xi ))
n
n i=1
n
17
Therefore

d
n(mle 0 ) I(0 |xi )1 N(0, I(0 |xi ))
= N(0, I(0 |xi )1 )
3. Since I(|xi ) = E[H(|xi )] = var(S(|xi )) is generally not known, avar( n(mle

)) must be estimated. The most common estimates for I(|xi ) are
mle |xi ) = 1
I(
n
mle |xi ) = 1
I(
n
n
X
i=1
n
X
i=1
H(mle |xi )
S(mle |xi )(mle |xi )0
The first estimate requires second derivatives of the log-likelihood, whereas

the second estimate only requires first derivatives. Also, the second estimate
is guaranteed
to be positive semi-definite in finite samples. The estimate of

avar( n(mle )) then takes the form
mle |xi )1
avar(
d n(mle )) = I(
To prove consistency of the MLE, one must show that Q0 () = E[ln f (xi |)] is
uniquely maximized at = 0 . To do this, let f (x, 0 ) denote the true density and
let f (x, 1 ) denote the density evaluated at any 1 6= 0 . Define the Kullback-Leibler
Information Criteria (KLIC) as
Z
f (x, 0 )
f (x, 0 )
K(f (x, 0 ), f (x, 1 )) = E0 ln
= ln
f (x, 0 )dx
f (x, 1 )
f (x, 1 )
where
f (x, 0 )
= if f (x, 1 ) = 0 and f (x, 0 ) > 0
f (x, 1 )
K(f (x, 0 ), f (x, 1 )) = 0 if f (x, 0 ) = 0
ln
The KLIC is a measure of the ability of the likelihood ratio to distinguish between
f (x, 0 ) and f (x, 1 ) when f (x, 0 ) is true. The Shannon-Komogorov Information
Inequality gives the following result:
K(f (x, 0 ), f (x, 1 )) 0
with equality if and only if f (x, 0 ) = f (x, 1 ) for all values of x.
Example 14 Asymptotic results for MLE of Bernoulli distribution parameters
18
Let X1 , . . . , Xn be an iid sample with X Bernoulli(). Recall,

X
mle = X
= 1
Xi
n i=1
n
I(|xi ) =
1
(1 )
The asymptotic properties of the MLE tell us that

p
mle
d
n(mle ) N (0, (1 ))
Alternatively,
(1 )
A
mle
N ,
n
An estimate of the asymptotic variance of mle is
x(1 x)
mle (1 mle )
avar(mle ) =
=
n
n
Example 15 Asymptotic results for MLE of linear regression model parameters
In the linear regression with normal errors
yi = x0i + i , i = 1, . . . , n
i |xi iid N(0, 2 )
the MLE for = ( 0 , 2 )0 is
(X 0 X)1 X 0 y
mle
=
bmle )0 (y X
bmle )
2mle
n1 (y X
and the information matrix for the sample is

2 0
XX
I(|x) =
0
0
n 4
The asymptotic results for MLE tell us that
2 0 1
(X X)
A
mle
,
N
2
2
mle
0
2 4
Further, the block diagonality of the information matrix implies that mle is asymptotically independent of
2mle .
19
1.8
Relationship Between ML and GMM
Let X1 , . . . , Xn be an iid sample from some underlying economic model. To do ML

estimation, you need to know the pdf, f (xi |), of an observation in order to form the
log-likelihood function
n
X
ln L(|x) =
ln f (xi |)
i=1
where R . The MLE satisfies the first order condtions

ln L(mle |x)
= S(mle |x) = 0
For general models, the first order condtions are p nonlinear equations in p unknowns.
Under regularity conditions, the MLE is consistent, asymptotically normally distributed, and ecient in the class of asymptotically normal estimators:
mle N , 1 I(|xi )1
n
where I(|xi ) = E[H(|xi )] = E[S(|xi )S(|xi )0 ].
To do GMM estimation, you need to know k p population moment condtions
E[g(xi , )] = 0
The GMM estimator matches sample moments with the population moments. The
sample moments are
n
1X
gn () =
g(xi , )
n i=1
If k > p, the ecient GMM estimator minimizes the objective function
J(, S1 ) = ngn ()0 S1 gn ()
where S = E[g(xi , )g(xi, )0 ]. The first order conditions are
J(gmm , S 1 )
= G0n (gmm )S1 gn (gmm ) = 0
Under regularity conditions, the ecient GMM estimator is consistent, asymptotically normally distributed, and ecient in the class of asymptotically normal GMM
estimators for a given set of moment conditions:
gmm N , 1 (G0 S 1 G)1

n
i
h
n ()
.
where G = E g
0
20
The asymptotic eciency of the MLE in the class of consistent and asymptotically
normal estimators implies that
avar(mle ) avar(gmm ) 0
That is, the ecient GMM estimator is generally less ecient than the ML estimator.
The GMM estimator will be equivalent to the ML estimator if the moment conditions happen to correspond with the score associated with the pdf of an observation.
That is, if
g(xi , ) = S(|xi )
In this case, there are p moment conditions and the model is just identified. The
GMM estimator then satisfies the sample moment equations
gn (gmm ) = S(gmm |x) = 0
which implies that gmm = mle . Since
S(|xi )
G = E
= E[H(|xi )] = I(|xi )
0
S = E[S(|xi )S(|x0i )] = I(|xi )
the asymptotic variance of the GMM estimator becomes
(G0 S 1 G)1 = I(|xi )1
which is the asymptotic variance of the MLE.
1.9
Hypothesis Testing in a Likelihood Framework
Let X1 , . . . , Xn be iid with pdf f (x, ) and assume that is a scalar. The hypotheses
to be tested are
H0 : = 0 vs. H1 : 6= 0
A statistical test is a decision rule based on the observed data to either reject H0 or
not reject H0 .
1.9.1
Likelihood Ratio Statistic
Consider the likelihood ratio

=
L(0 |x)
L(0 |x)
=
max L(|x)
L(mle |x)
which is the ratio of the likelihood evaluated under the null to the likelihood evaluated
at the MLE. By construction 0 < 1. If H0 : = 0 is true, then we should see
21
1; if H0 : = 0 is not true then we should see < 1. The likelihood ratio

(LR) statistic is a simple transformation of such that the value of LR is large if
H0 : = 0 is true, and the value of LR is small when H0 : = 0 is not true.
Formally, the LR statistic is
LR = 2 ln = 2 ln
L(0 |x)
L(mle |x)
= 2[ln L(0 |x) ln L(mle |x)]

From Figure xxx, notice that the distance between ln L(mle |x) and ln L(0 |x)
depends on the curvature of ln L(|x) near = mle . If the curvature is sharpe (i.e.,
information is high) then LR will be large for 0 values away from mle . If, however,
the curvature of ln L(|x) is flat (i.e., information is low) the LR will be small for 0
values away from mle .
Under general regularity conditions, if H0 : = 0 is true then
d
LR 2 (1)
In general, the degrees of freedom of the chi-square limiting distribution depends on
the number of restrictions imposed under the null hypothesis. The decision rule for
the LR statistic is to reject H0 : = 0 at the 100% level if LR > 21 (1), where
21 (1) is the (1 ) 100% quantile of the chi-square distribution with 1 degree of
freedom.
1.9.2
Wald Statistic
The Wald statistic is based directly on the asymptotic normal distribution of mle :
mle N(, I(
mle |x)1 )
mle |x) is a consistent estimate of the sample information matrix. An imwhere I(
plication of the asymptotic normality result is that the usualy t-ratio for testing
H0 : = 0
mle 0
mle 0
=q
c mle )
SE(
mle |x)1
I(
q
mle |x)
= mle 0
I(
t =
is asymptotically distributed as a standard normal random variable. Using the continuous mapping theorem, it follows that the square of the t-statistic is asymptotically
distributed as a chi-square random variable with 1 degree of freedom. The Wald
22
statistic is defined to be simply the square of this t-ratio
2
mle 0
W ald =
mle |x)1
I(
2
mle |x)
= mle 0 I(
Under general regularity conditions, if H0 : = 0 is true, then

d
W ald 2 (1)
The intuition behind the Wald statistic is illustrated in Figure xxx. If the curvature of ln L(|x) near = mle is big (high information) then the squared distance
2
mle 0 gets blown up when constructing the Wald statistic. If the curvature
mle |x) is small and the squared distance

of ln L(|x) near = mle is low, then I(
2
mle 0 gets attenuated when constucting the Wald statistic.
1.9.3
Lagrange Multiplier/Score Statistic
With ML estimation, mle solves the first order conditions

0=
d ln L(mle |x)
= S(mle |x)
d
If H0 : = 0 is true, then we should expect that

0
d ln L(0 |x)
= S(0 |x)
d
If H0 : = 0 is not true, then we should expect that

0 6=
d ln L(0 |x)
= S(0 |x)
d
The Lagrange multiplier (score) statistic is based on how far S(0 |x) is from zero.
Recall the following properties of the score S(|xi ). If H0 : = 0 is true then
E[S(0 |xi )] = 0
var(S(0 |xi )) = I(0 |xi )
Further, it can be shown that
n
1 X
d
nS(0 |x) =
S(0 |xi ) N(0, I(0 |xi ))
n i=1
23
so that
S(0 |x) N(0, I(0 |x))
This result motivates the statistic
LM =
S(0 |x)2
= S(0 |x)2 I(0 |x)1
I(0 |x)
Under general regularity conditions, if H0 : = 0 is true, then

d
LM 2 (1)
24

Mlelectures PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mlelectures PDF

Uploaded by

Copyright:

Available Formats

Maximum Likelihood Estimation

Maximum Likelihood Estimation

The Likelihood Function

Let X1 , . . . , Xn be an iid sample with probability density function (pdf) f (xi ; ),

If X1 , . . . , Xn are discrete random variables, then f (x1 , . . . , xn ; ) = Pr(X1 = x1 , . . . , Xn = xn )

x )2 are positive it is clear that L(|x) is maximized at

i |xi iid N(0, 2 )

The log-likelihood function is then

Example 4 AR(1) model with Normal Errors

The Maximum Likelihood Estimator

It is often quite dicult to directly maximize L(|x). It usually much easier to

Since is (k 1) the first order conditions define k, potentially nonlinear, equations

By definition, the MLE satisfies

is the score associated with xi .

Example 5 Bernoulli example continued

The score function for the Bernoulli log-likelihood is

Example 6 Normal example continued

The sample score is a (2 1) vector given by

Note that the score vector for an observation is

Solving the first equation for

Hence, the sample average is the MLE for . Using

Example 7 Linear regression example continued

Properties of the Score Function

The matrix of second derivatives of the log-likelihood is called the Hessian

The information matrix is defined as minus the expectation of the Hessian

E[H(|xi )] = nE[H(|xi )] = nI(|xi )

S(|xi )2 f (xi ; )dxi = I(|xi )

S(|xi )S(|xi )0 f (xi ; )dxi = I(|xi )

S(|xi )f (xi ; )dxi

f (xi ; )f (xi ; )dxi

Next, recall that I(|xi ) = E[H(|xi )] and

Concentrating the Likelihood Function

In many situations, our interest may be only on a few elements of . Let = (1 , 2 )

Solving for 2 as a function of gives

The Precision of the Maximum Likelihood Estimator

The Cramer-Rao Lower Bound

If we restrict ourselves to the class of unbiased estimators (linear and nonlinear)

The information for an observation may also be computed as

and the CRLB is

If = 0 or = 1 then I(|x) = and var(mle ) = 0 (why?)

Using the results2

The information matrix for the sample is then

and the CRLB is

The diagonal elements of I(|x) as n

I(|y, X) = E[H(|y, X)]

Invariance Property of Maximum Likelihood Estimators

Suppose we are interested in the MLE for = h( 2 ) = ( 2 )1/2 , which is a one-to-one

Asymptotic Properties of Maximum Likelihood Estimators

Let X1 , . . . , Xn be an iid sample with probability density function (pdf) f (xi ; ),

avar( n(mle )) = I(|xi )1

mle N , 1 I(|xi )1 = N(, I(|x)1 )

avar( n(mle )) avar( n( )) 0

3. Since I(|xi ) = E[H(|xi )] = var(S(|xi )) is generally not known, avar( n(mle

S(mle |xi )(mle |xi )0

The first estimate requires second derivatives of the log-likelihood, whereas

Let X1 , . . . , Xn be an iid sample with X Bernoulli(). Recall,

The asymptotic properties of the MLE tell us that

An estimate of the asymptotic variance of mle is

and the information matrix for the sample is

The asymptotic results for MLE tell us that

Relationship Between ML and GMM

Let X1 , . . . , Xn be an iid sample from some underlying economic model. To do ML

where R . The MLE satisfies the first order condtions