Professional Documents
Culture Documents
Eric Zivot
May 14, 2001
This version: November 15, 2009
1.1
The joint density is an n dimensional function of the data x1 , . . . , xn given the parameter vector . The joint density1 satisfies
Z
f (x1 , . . . , xn ; ) 0
f (x1 , . . . , xn ; )dx1 dxn = 1.
The likelihood function is defined as the joint density treated as a functions of the
parameters :
L(|x1 , . . . , xn ) = f (x1 , . . . , xn ; ) =
n
Y
f (xi ; ).
i=1
Notice that the likelihood function is a k dimensional function of given the data
x1 , . . . , xn . It is important to keep in mind that the likelihood function, being a
function of and not the data, is not a proper pdf. It is always positive but
Z
Z
L(|x1 , . . . , xn )d1 dk 6= 1.
1
To simplify notation, let the vector x = (x1 , . . . , xn ) denote the observed sample.
Then the joint pdf and likelihood function may be expressed as f (x; ) and L(|x).
Example 1 Bernoulli Sampling
Let Xi Bernoulli(). That is, Xi = 1 with probability and Xi = 0 with probability 1 where 0 1. The pdf for Xi is
f (xi ; ) = xi (1 )1xi , xi = 0, 1
Let X1 , . . . , Xn be an iid sample with Xi Bernoulli(). The joint density/likelihood
function is given by
f (x; ) = L(|x) =
n
Y
i=1
xi (1 )1xi =
Sn
i=1
xi
Sn
(1 )n
i=1
xi
For a given value of and observed sample x, f (x; ) gives the probability of observing
the sample. For example, suppose n = 5 and x = (0, . . . , 0). Now some values of
are more likely to have generated this sample than others. In particular, it is more
likely that is close to zero than one. To see this, note that the likelihood function
for this sample is
L(|(0, . . . , 0)) = (1 )5
This function is illustrated in figure xxx. The likelihood function has a clear maximum
at = 0. That is, = 0 is the value of that makes the observed sample x = (0, . . . , 0)
most likely (highest probability)
Similarly, suppose x = (1, . . . , 1). Then the likelihood function is
L(|(1, . . . , 1)) = 5
which is illustrated in figure xxx. Now the likelihood function has a maximum at
= 1.
Example 2 Normal Sampling
Let X1 , . . . , Xn be an iid sample with Xi N(, 2 ). The pdf for Xi is
1
2 1/2
2
exp 2 (xi ) , < < , 2 > 0, < x <
f (xi ; ) = (2 )
2
so that = (, 2 )0 . The likelihood function is given by
n
Y
1
2 1/2
2
L(|x) =
(2 )
exp 2 (xi )
2
i=1
!
n
X
1
= (2 2 )n/2 exp 2
(xi )2
2 i=1
2
Figure xxx illustrates the normal likelihood for a representative sample of size n = 25.
Notice that the likelihood has the same bell-shape of a bivariate normal density
Suppose 2 = 1. Then
!
n
X
1
L(|x) = L(|x) = (2)n/2 exp
(xi )2
2 i=1
Now
n
n
n
X
X
X
(xi )2 =
(xi x + x )2 =
) + (
x )2
(xi x)2 + 2(xi x)(x
i=1
i=1
i=1
n
X
=
(xi x)2 + n(
x )2
i=1
so that
#!
" n
X
1
(xi x)2 + n(
x )2
L(|x) = (2)n/2 exp
2 i=1
yi =
+ i , i = 1, . . . , n
(1k)(k1)
2 1/2
f (i |xi ; ) = (2 )
1 2
exp 2 i
2
The Jacobian of the transformation for i to yi is one so the pdf of yi |xi is normal
with mean x0i and variance 2 :
1
2 1/2
0
2
f (yi |xi ; ) = (2 )
exp 2 (yi xi )
2
where = ( 0 , 2 )0 . Given an iid sample of n observations, y and X, the joint density
of the sample is
!
n
X
1
f (y|X; ) = (2 2 )n/2 exp 2
(yi x0i )2
2 i=1
1
2 n/2
0
exp 2 (y X) (y X)
= (2 )
2
3
1.2
Suppose we have a random sample from the pdf f (xi ; ) and we are interested in
estimating . The previous example motives an estimator as the value of that
makes the observed sample most likely. Formally, the maximum likelihood estimator,
denoted mle , is the value of that maximizes L(|x). That is, mle solves
max L(|x)
With random sampling, the log-likelihood has the particularly simple form
!
n
n
Y
X
f (xi ; ) =
ln f (xi ; )
ln L(|x) = ln
i=1
i=1
Since the MLE is defined as a maximization problem, we would like know the
conditions under which we may determine the MLE using the techniques of calculus.
A regular pdf f (x; ) provides a sucient set of such conditions. We say the f (x; )
is regular if
1. The support of the random variables X, SX = {x : f (x; ) > 0}, does not
depend on
2. f (x; ) is at least three times dierentiable with respect to
3. The true value of lies in a compact set
If f (x; ) is regular then we may find the MLE by dierentiating ln L(|x) and
solving the first order conditions
ln L(mle |x)
=0
ln L(mle |x)
=
ln L(mle |x)
1
..
.
ln L(
mle |x)
k
The vector of derivatives of the log-likelihood function is called the score vector
and is denoted
ln L(|x)
S(|x) =
i=1
i=1
where S(|xi ) =
ln f (xi ;)
Sn
Sn
ln L(|X) = ln i=1 xi (1 )n i=1 xi
!
n
n
X
X
xi ln() + n
xi ln(1 )
=
i=1
i=1
ln L(|x)
1X
1
S(|x) =
=
xi
i=1
1
n
n
X
i=1
xi
The MLE satisfies S(mle |x) = 0, which after a little algebra, produces the MLE
X
mle = 1
xi .
n i=1
n
Hence, the sample average is the MLE for in the Bernoulli model.
5
ln L(|x)
ln L(|x)
2
S(|x) =
where
n
1 X
ln L(|x)
=
(xi )
2 i=1
ln L(|x)
n 2 1 1 2 2 X
( ) + ( )
=
(xi )2
2
2
2
i=1
S(|xi ) =
ln f (|xi )
ln f (|xi )
2
( 2 )1 (xi )
1
2 1
2 ( ) + 12 ( 2 )2 (xi )2
P
so that S(|x) = ni=1 S(|xi ).
Solving S(mle |x) = 0 gives the normal equations
n
ln L(mle |x)
1 X
(xi
mle ) = 0
=
2mle i=1
n
n 2 1 1 2 2 X
ln L(mle |x)
=
)
+
)
(xi
mle )2 = 0
(
2
2 mle
2 mle
i=1
mle
mle =
(xi x)2 .
n i=1
Notice that
2mle is not equal to the sample variance.
6
ln L(|y, X) is the
1 0
ln L(|y, X)
=
[y y 2y 0 X + 0 X 0 X]
2 2
= ( 2 )1 [X 0 y + X 0 X]
ln L(|y, X)
n
1
= ( 2 )1 + ( 2 )2 (y X)0 (y X)
2
2
2
Solving
ln L(|y,X)
= 0 for gives
mle = (X 0 X)1 X 0 y = OLS
Next, solving
ln L(|y,X)
2
= 0 for 2 gives
1
bmle )0 (y X
bmle )
(y X
n
1
bOLS )
bOLS )0 (y X
6
=
2OLS =
(y X
nk
2mle =
1.3
2 ln L(|x)
2
1 k
1
2 ln L(|x)
.
..
.
.
.
H(|x) =
=
.
.
.
0
2
2
ln L(|x)
ln L(|x)
k 1
2
k
n
X
2 ln f (|xi )
i=1
0
7
n
X
i=1
H(|xi )
and
I(|x) =
n
X
i=1
The last result says that the sample information matrix is equal to n times the
information matrix for an observation.
The following proposition relates some properties of the score function to the
information matrix.
Proposition 8 Let f (xi ; ) be a regular pdf. Then
R
1. E[S(|xi )] = S(|xi )f (xi ; )dxi = 0
2. If is a scalar then
var(S(|xi ) = E[S(|xi ) ] =
If is a vector then
var(S(|xi ) = E[S(|xi )S(|xi ) ] =
Proof. For part 1, we have
E[S(|xi )] =
ln f (xi ; )
f (xi ; )dxi
Z
1
f (xi ; )dxi
=
f (xi ; )dxi
=
1
=
= 0.
=
The key part to the proof is the ability to interchange the order of dierentiation and
integration.
For part 2, consider the scalar case for simplicity. Now, proceeding as above we
get
2
Z
Z
ln f (xi ; )
2
2
E[S(|xi ) ] =
S(|xi ) f (xi ; )dxi =
f (xi ; )dxi
2
2
Z
Z
1
1
f (xi ; ) f (xi ; )dxi =
f (xi ; ) dxi
=
f (xi ; )
f (xi ; )
8
1
f (xi ; )
f (xi ; )
2
2
2
= f (xi ; )
f (xi ; ) + f (xi ; )1 2 f (xi ; )
2 ln f (xi ; ) =
Then
#
2
Z "
2
f (xi ; )2
E[H(|xi )] =
f (xi ; ) + f (xi ; )1 2 f (xi ; ) f (xi ; )dxi
2
Z
Z
2
f (xi ; ) dxi
=
f (xi ; )1
f (xi ; )dxi
2
Z
2
2
= E[S(|xi ) ] 2 f (xi ; )dxi
2
= E[S(|xi ) ].
1.4
(xi )2 = 0
2
2
2
i=1
n
2 () =
Notice that any value of 2 () defined this way satisfies the first order condition
ln L(|x)
= 0. If we substitute 2 () for 2 in the log-likelihood function for we get
2
the following concentrated log-likelihood function for :
n
n
1 X
n
2
ln L (|x) = ln(2) ln( ()) 2
(xi )2
2
2
2 () i=1
!
n
X
n
n
1
= ln(2) ln
(xi )2
2
2
n i=1
!1 n
n
X
1 1X
(xi )2
(xi )2
2 n i=1
!
i=1 n
n
n
1X
= (ln(2) + 1) ln
(xi )2
2
2
n i=1
c
Now we may determine the MLE for by maximizing the concentrated loglikelihood function ln L2 (|x). The first order conditions are
Pn
(xi
mle )
ln Lc (
mle |x)
= 1 Pi=1
=0
n
mle )2
i=1 (xi
n
which is satisfied by
mle = x provided not all of the xi values are identical.
For some models it may not be possible to analytically concentrate the loglikelihood with respect to a subset of parameters. Nonetheless, it is still possible
in principle to numerically concentrate the log-likelihood.
1.5
The likelihood, log-likelihood and score functions for a typical model are illustrated
in figure xxx. The likelihood function is always positive (since it is the joint density
of the sample) but the log-likelihood function is typically negative (being the log of
a number less than 1). Here the log-likelihood is globally concave and has a unique
maximum at mle . Consequently, the score function is positive to the left of the
maximum, crosses zero at the maximum and becomes negative to the right of the
maximum.
Intuitively, the precision of mle depends on the curvature of the log-likelihood
function near mle . If the log-likelihood is very curved or steep around mle , then
will be precisely estimated. In this case, we say that we have a lot of information
about . On the other hand, if the log-likelihood is not curved or flat near mle ,
then will not be precisely estimated. Accordingly, we say that we do not have much
information about .
The extreme case of a completely flat likelihood in is illustrated in figure xxx.
Here, the sample contains no information about the true value of because every
10
value of produces the same value of the likelihood function. When this happens we
say that is not identified. Formally, is identified if for all 1 6= 2 there exists a
sample x for which L(1 |x) 6= L(2 |x).
The curvature of the log-likelihood is measured by its second derivative (Hessian)
2
L(|x)
H(|x) = ln
. Since the Hessian is negative semi-definite, the information in
0
the sample about may be measured by H(|x). If is a scalar then H(|x) is
a positive number. The expected amount of information in the sample about the
parameter is the information matrix I(|x) = E[H(|x)]. As we shall see, the
information matrix is directly related to the precision of the MLE.
1.5.1
11
Further, due to random sampling I(|x) = n I(|xi ) = n var(S(|xi )). Now, using
the chain rule it can be shown that
d
d
xi
S(|xi ) =
H(|xi ) =
d
d (1 )
1 + S(|xi ) 2S(|xi )
=
(1 )
The information for an observation is then
I(|xi ) = E[H(|xi )] =
=
1
(1 )
since
E[S(|xi )] =
1 + E[S(|xi )] 2E[S(|xi )]
(1 )
E[xi ]
=
=0
(1 )
(1 )
xi
I(|xi ) = var(S(|xi ) = var
(1 )
(1 )
var(xi )
= 2
= 2
2
(1 )
(1 )2
1
=
(1 )
The information for the sample is then
I(|x) = n I(|xi ) =
n
(1 )
(1 )
n
This the lower bound on the variance of any unbiased estimator of .
Consider the MLE for , mle = x. Now,
CRLB = I(|x)1 =
x] =
E[mle ] = E[
(1 )
x) =
var(mle ) = var(
n
Notice that the MLE is unbiased and its variance is equal to the CRLB. Therefore,
mle is ecient.
Remarks
12
2 ln f (xi ;)
2
2 ln f (xi ;)
2
2 ln f (xi ;)
2
2 ln f (xi ;)
(2 )2
Now
2 ln f (xi ; )
2
2
ln f (xi ; )
2
2 ln f (xi ; )
2
2 ln f (xi ; )
( 2 )2
= ( 2 )1
= ( 2 )(xi )
= ( 2 )(xi )
1 2 2
( ) ( 2 )3 (xi )2
2
so that
I(|xi ) = E[H(|xi )]
( 2 )1
=
E[(xi )]( 2 )2
E[(xi )](2 )2
1
( 2 )2 ( 2 )3 E[(xi )2 ]
2
(xi )2
E
= 1
2
we then have
I(|xi ) =
( 2 )1
0
0
1
2 2
(
)
2
n( 2 )1
I(|x) = n I(|xi ) =
0
2
0
n
2 2
(
)
2
(xi )2 / 2 is a chi-square random variable with one degree of freedom. The expected value
of a chi-square random variable is equal to its degrees of freedom.
13
CRLB = I(|x)
2
n
0
24
n
Notice that the information matrix and the CRLB are diagonal matrices. The CRLB
2
for an unbiased estimator of is n and the CRLB for an unbiased estimator of 2
4
is 2n .
The MLEs for and 2 are
mle = x
1X
(xi mle )2
n i=1
n
2mle =
Now
E[
mle ] =
n1 2
E[
2mle ] =
n
so that
mle is unbiased whereas
2mle is biased. This illustrates the fact that mles
are not necessarily unbiased. Furthermore,
var(
mle ) =
2
= CRLB
n
and so
mle is ecient.
The MLE for 2 is biased and so the CRLB result does not apply. Consider the
unbiased estimator of 2
n
1 X
2
s =
(xi x)2
n 1 i=1
Is the variance of s2 equal to the CRLB? No. To see this, recall that
(n 1)s2
2 (n 1)
2
Further, if X 2 (n 1) then E[X] = n 1 and var(X) = 2(n 1). Therefore,
s2 =
2
X
(n 1)
var(s2 ) =
Hence, var(s2 ) =
Remarks
4
(n1)
> CRLB =
4
4
var(X)
=
(n 1)2
(n 1)
4
.
n
14
( 2 )1 [X 0 y + X 0 X]
S(|y, X) =
n2 ( 2 )1 + 12 ( 2 )2 (y X)0 (y X)
( 2 )1 (X 0 )
=
n2 ( 2 )1 + 12 ( 2 )2 0
where = y X. Now E[] = 0 and E[0 ] = n 2 (since 0 / 2 2 (n)) so that
0
( 2 )1 (X 0 E[])
=
E[S(|y, X)] =
n
1
2 1
2 2
0
0
2 ( ) + 2 ( ) E[ ]
To determine the Hessian and information matrix we need the second derivatives of
ln L(|y, X) :
2 ln L(|y, X)
2 1
0
0
=
)
[X
y
+
X
X]
(
0
0
= ( 2 )1 X 0 X
2 ln L(|y, X)
=
( 2 )1 [X 0 y + X 0 X]
2
2
= ( 2 )2 X 0
2 ln L(|y, X)
= ( 2 )2 0 X
0
2
2 ln L(|y, X)
n 2 1 1 2 2 0
( ) + ( )
=
2
2
2
( 2 )2
n 2 2
=
( ) ( 2 )3 0
2
Therefore,
H(|y, X) =
and
( 2 )1 X 0 X
( 2 )2 0 X
(2 )1 X 0 X
=
( 2 )2 E[]0 X
( 2 )2 X 0
n
2 2
( ) ( 2 )3 0
2
( 2 )2 X 0 E[]
n
( 2 )2 ( 2 )3 E[0 ]
2
15
( 2 )1 X 0 X
0
0
n
( 2 )2
2
Notice that the information matrix is block diagonal in and 2 . The CRLB for
unbiased estimators of is then
2 0 1
(X X)
0
1
I(|y, X) =
2 4
0
n
Do the MLEs of and 2 achieve the CRLB? First, mle is unbiased and var( mle |X) =
2 (X 0 X)1 = CRLB for an unbiased estimator for . Hence, mle is the most ecient
unbiased estimator (BUE). This is an improvement over the Gauss-Markov theorem
which says that mle = OLS is the most ecient linear and unbiased estimator
(BLUE). Next, note that
2mle is not unbiased (why) so the CRLB result does not
apply. What about the unbiased estimator s2 = (n k)1 (y X OLS )0 (y X OLS )?
24
> n2 4 = CRLB for an unbiased estimator of
It can be shown that var(s2 |X) = nk
2 . Hence s2 is not the most ecient unbiased estimator of 2 .
1.6
One of the attractive features of the method of maximum likelihood is its invariance
to one-to-one transformations of the parameters of the log-likelihood. That is, if mle
is the MLE of and = h() is a one-to-one function of then
mle = h(mle ) is the
mle for .
Example 13 Normal Model Continued
The log-likelihood is parameterized in terms of and 2 and we have the MLEs
mle = x
1X
=
(xi mle )2
n i=1
n
2mle
mle = (
2mle )1/2 =
(xi
mle )2
n i=1
1.7
p
1. mle
d
2. n(mle ) N(0, I(|xi )1 ), where
ln f (|xi )
I(|xi ) = E [H(|xi )] = E
0
That is,
Alternatively,
2. Asymptotic normality of mle follows from an exact first order Taylors series
expansion of the first order conditions for a maximum of the log-likelihood about
0 :
0 = S(mle |x) = S(0 ) + H(|x)(mle 0 ), = mle + (1 )0
H(|x)(mle 0 ) = S(0 )
1
1
n(mle 0 ) =
n
H(|x)
S(0 )
n
n
Now
1X
p
H(|xi ) E[H(0 |xi )] = I(0 |xi )
H(|x) =
n i=1
n
1
1 X
d
S(0 ) =
S(0 |xi ) N(0, I(0 |xi ))
n
n i=1
n
17
Therefore
d
n(mle 0 ) I(0 |xi )1 N(0, I(0 |xi ))
= N(0, I(0 |xi )1 )
n
X
i=1
n
X
i=1
H(mle |xi )
mle |xi )1
avar(
d n(mle )) = I(
To prove consistency of the MLE, one must show that Q0 () = E[ln f (xi |)] is
uniquely maximized at = 0 . To do this, let f (x, 0 ) denote the true density and
let f (x, 1 ) denote the density evaluated at any 1 6= 0 . Define the Kullback-Leibler
Information Criteria (KLIC) as
Z
f (x, 0 )
f (x, 0 )
K(f (x, 0 ), f (x, 1 )) = E0 ln
= ln
f (x, 0 )dx
f (x, 1 )
f (x, 1 )
where
f (x, 0 )
= if f (x, 1 ) = 0 and f (x, 0 ) > 0
f (x, 1 )
K(f (x, 0 ), f (x, 1 )) = 0 if f (x, 0 ) = 0
ln
The KLIC is a measure of the ability of the likelihood ratio to distinguish between
f (x, 0 ) and f (x, 1 ) when f (x, 0 ) is true. The Shannon-Komogorov Information
Inequality gives the following result:
K(f (x, 0 ), f (x, 1 )) 0
with equality if and only if f (x, 0 ) = f (x, 1 ) for all values of x.
Example 14 Asymptotic results for MLE of Bernoulli distribution parameters
18
I(|xi ) =
1
(1 )
d
n(mle ) N (0, (1 ))
Alternatively,
(1 )
A
mle
N ,
n
x(1 x)
mle (1 mle )
avar(mle ) =
=
n
n
Example 15 Asymptotic results for MLE of linear regression model parameters
In the linear regression with normal errors
yi = x0i + i , i = 1, . . . , n
i |xi iid N(0, 2 )
the MLE for = ( 0 , 2 )0 is
(X 0 X)1 X 0 y
mle
=
bmle )0 (y X
bmle )
2mle
n1 (y X
0
n 4
2 0 1
(X X)
A
mle
,
N
2
2
mle
0
2 4
Further, the block diagonality of the information matrix implies that mle is asymptotically independent of
2mle .
19
1.8
For general models, the first order condtions are p nonlinear equations in p unknowns.
Under regularity conditions, the MLE is consistent, asymptotically normally distributed, and ecient in the class of asymptotically normal estimators:
mle N , 1 I(|xi )1
n
where I(|xi ) = E[H(|xi )] = E[S(|xi )S(|xi )0 ].
To do GMM estimation, you need to know k p population moment condtions
E[g(xi , )] = 0
The GMM estimator matches sample moments with the population moments. The
sample moments are
n
1X
gn () =
g(xi , )
n i=1
If k > p, the ecient GMM estimator minimizes the objective function
J(, S1 ) = ngn ()0 S1 gn ()
where S = E[g(xi , )g(xi, )0 ]. The first order conditions are
J(gmm , S 1 )
= G0n (gmm )S1 gn (gmm ) = 0
Under regularity conditions, the ecient GMM estimator is consistent, asymptotically normally distributed, and ecient in the class of asymptotically normal GMM
estimators for a given set of moment conditions:
The asymptotic eciency of the MLE in the class of consistent and asymptotically
normal estimators implies that
avar(mle ) avar(gmm ) 0
That is, the ecient GMM estimator is generally less ecient than the ML estimator.
The GMM estimator will be equivalent to the ML estimator if the moment conditions happen to correspond with the score associated with the pdf of an observation.
That is, if
g(xi , ) = S(|xi )
In this case, there are p moment conditions and the model is just identified. The
GMM estimator then satisfies the sample moment equations
gn (gmm ) = S(gmm |x) = 0
which implies that gmm = mle . Since
S(|xi )
G = E
= E[H(|xi )] = I(|xi )
0
S = E[S(|xi )S(|x0i )] = I(|xi )
the asymptotic variance of the GMM estimator becomes
(G0 S 1 G)1 = I(|xi )1
which is the asymptotic variance of the MLE.
1.9
Let X1 , . . . , Xn be iid with pdf f (x, ) and assume that is a scalar. The hypotheses
to be tested are
H0 : = 0 vs. H1 : 6= 0
A statistical test is a decision rule based on the observed data to either reject H0 or
not reject H0 .
1.9.1
L(0 |x)
L(0 |x)
=
max L(|x)
L(mle |x)
which is the ratio of the likelihood evaluated under the null to the likelihood evaluated
at the MLE. By construction 0 < 1. If H0 : = 0 is true, then we should see
21
L(0 |x)
L(mle |x)
LR 2 (1)
In general, the degrees of freedom of the chi-square limiting distribution depends on
the number of restrictions imposed under the null hypothesis. The decision rule for
the LR statistic is to reject H0 : = 0 at the 100% level if LR > 21 (1), where
21 (1) is the (1 ) 100% quantile of the chi-square distribution with 1 degree of
freedom.
1.9.2
Wald Statistic
The Wald statistic is based directly on the asymptotic normal distribution of mle :
mle N(, I(
mle |x)1 )
mle |x) is a consistent estimate of the sample information matrix. An imwhere I(
plication of the asymptotic normality result is that the usualy t-ratio for testing
H0 : = 0
mle 0
mle 0
=q
c mle )
SE(
mle |x)1
I(
q
mle |x)
= mle 0
I(
t =
is asymptotically distributed as a standard normal random variable. Using the continuous mapping theorem, it follows that the square of the t-statistic is asymptotically
distributed as a chi-square random variable with 1 degree of freedom. The Wald
22
2
mle 0
W ald =
mle |x)1
I(
2
mle |x)
= mle 0 I(
W ald 2 (1)
The intuition behind the Wald statistic is illustrated in Figure xxx. If the curvature of ln L(|x) near = mle is big (high information) then the squared distance
2
mle 0 gets blown up when constructing the Wald statistic. If the curvature
2
mle 0 gets attenuated when constucting the Wald statistic.
1.9.3
d ln L(mle |x)
= S(mle |x)
d
d ln L(0 |x)
= S(0 |x)
d
d ln L(0 |x)
= S(0 |x)
d
The Lagrange multiplier (score) statistic is based on how far S(0 |x) is from zero.
Recall the following properties of the score S(|xi ). If H0 : = 0 is true then
E[S(0 |xi )] = 0
var(S(0 |xi )) = I(0 |xi )
Further, it can be shown that
n
1 X
d
nS(0 |x) =
S(0 |xi ) N(0, I(0 |xi ))
n i=1
23
so that
S(0 |x) N(0, I(0 |x))
This result motivates the statistic
LM =
S(0 |x)2
= S(0 |x)2 I(0 |x)1
I(0 |x)
LM 2 (1)
24