Theoretical Statistics. Lecture 15.: M-Estimators. Consistency of M-Estimators. Nonparametric Maximum Likelihood

Theoretical Statistics. Lecture 15.
Peter Bartlett
M-Estimators.
Consistency of M-Estimators.
Nonparametric maximum likelihood.
1
M-estimators
Goal: estimate a parameter of the distribution P of observations

X1 , . . . , Xn .
Define a criterion 7 Mn () in terms of functions m : X R,
Mn () = Pn m .
The estimator = arg max Mn () is called an M-estimator (M for

maximum).
Example:
maximum likelihood uses
m (x) = log p (x).
2
Z-estimators
Can maximize by setting derivatives to zero:
n () = Pn = 0.
These are estimating equations. van der Vaart calls this a Z-estimator (Z
for zero), but its often called an M-estimator (even if theres no
maximization).
Example:
maximum likelihood:
(x) = log p (x).
3
M-estimators and Z-estimators
Of course, sometimes we cannot transform an M-estimator into a

Z-estimator. Example: p =uniform on [0, ] is not differentiable in , and
there is no natural Z-estimator. The M-estimator chooses
= arg max Pn m

1 [ [0, ]]
= arg max Pn log

= max Xi .
i
4
M-estimators and Z-estimators: Examples
Mean:
m (x) = (x )2 .
(x) = (x ).
Median:
m (x) = |x |.
(x) = sign(x ).
5
M-estimators and Z-estimators: Examples
Huber: [PICTURE]
m (x) = rk (x )

1 2
2 k k(x + k) if x < k,

rk (x) = 21 x2 if |x| k,

1 2

2 k + k(x k) if x > k.
(x) = [x ]kk

k if x < k,

[x]kk = x if |x| k,

k if x > k.

These are all location estimators: m (x) = m(x ), (x) = (x ).
6
Consistency of M-estimators and Z-estimators
P
We want to show that 0 , where approximately maximizes
Mn () = Pn m and 0 maximizes M () = P m . We use a ULLN.
Theorem: Suppose that

P
1. sup |Mn () M ()| 0,
2. For all > 0, sup {M () : d(, 0 ) } < M (0 ), and
3. Mn (n ) Mn (0 ) oP (1).
P
Then n 0 .
(2) is an identifiability condition: approximately maximizing M ()
unambiguously specifies 0 . It suffices if there is a unique maximizer, is
compact, and M is continuous.
7
Proof
From (2), for all > 0 there is a > 0 such that
Pr(d(n , 0 ) )
Pr(M (0 ) M (n ) )
= Pr(M (0 ) Mn (0 ) + Mn (0 ) Mn (n ) + Mn (n ) M (n ) )
Pr(M (0 ) Mn (0 ) /3) + Pr(Mn (0 ) Mn (n ) /3)
+ Pr(Mn (n ) M (n ) /3).
Then (1) implies the first and third probabilities go to zero, and (3) implies
the second probability goes to zero.
8
Consistency of M-estimators and Z-estimators
Same thing for Z-estimators: Finding that is an approximate zero of

n () = Pn leads to 0 , which is the unique zero of () = P .
Theorem: Suppose that

P
1. sup kn () ()k 0,
2. For all > 0, inf {k()k : d(, 0 ) } > 0 = k(0 )k, and
3. n (n ) = oP (1).
P
Then n 0 .
Proof: Choosing Mn () = kn ()k and M () = k()k in the

previous theorem implies the result.
9
Example: Sample median
Sample median n is the zero of
Pn (X) = Pn sign(X ).
Suppose that P is continuous and positive around the median, and check the
conditions:
1. The class {x 7 sign(x ) : R} is Glivenko-Cantelli.
2. The population median is unique, so for all > 0,

1
P (X < 0 ) < < P (X < 0 + ).
2
3. The sample median always has |Pn sign(X n )| = 0.
10
ULLN and M-estimators
Notice the ULLN condition:

P
sup |Mn () M ()| 0.
Typically, this requires the empirical process 7 Pn m to be totally
bounded. This can be problematic if m is unbounded. For instance:
Mean: m (x) = (x )2 ,
Median: m (x) = |x |.
We can get around the problem by restricting to a compact set where most
of the mass of P lies, and showing that this does not affect the asymptotics.
In that case, we can also restrict to an appropriate compact subset.
11
Non-parametric maximum likelihood
Estimate P on X . Suppose it has a density

dP
p0 = P,
d
where P is a family of densities. Define the maximum likelihood estimate
pn = arg max Pn log p.

pP
Well show conditions for which pn is Hellinger consistent, that is,

as
h(
pn , p0 ) 0, where h is the Hellinger distance:
Z 1/2
1 2
h(p, q) = p1/2 q 1/2 d .
2
[The 1/2 ensures 0 h(p, q) 1.]
12
Hellinger distance
We have
1
Z 2
h(p, q)2 = p1/2 q 1/2 d
2
1
Z
= p + q 2p1/2 q 1/2 d
2
Z
= 1 p1/2 q 1/2 d.
This latter integral is called the Hellinger affinity. Expressing h in this form
can simplify its calculation for product densities. Notice that, by
Cauchy-Schwartz,
Z Z Z
p1/2 q 1/2 d p d q d = 1,
so h(p, q) [0, 1].
13
The Kullback-Leibler divergence between p and q is

q
Z
dKL (p, q) = log q d.
p
Clearly, dKL (p, p) = 0. Also, since log() is convex,
Z
p p
Z
dKL (p, q) = log q d log q d = 0.
q q
14
Relating KL-divergence to a ULLN:
p0
Z
dKL (
pn , p0 ) = log p0 d
pn
p0 p0
Z
log p0 d Pn log
pn pn
p0 p0
= P log Pn log
pn pn
kP Pn kG ,
where the first inequality follows from the fact that pn maximizes Pn log p
15
over p P, and the class G is defined as

p0
G = 1[p0 > 0] log :pP .
p
16
One problem here is that log(p0 /p) is unbounded, since p can be zero.
Well take a different approach: For any p P, consider the mixture
p + p0
p = .
2
If the class P is convex and pn , p0 P, this mixture has
Pn log p Pn log pn . This is behind the following lemma.
Lemma: Define
pn + p0
pn = .
2
If P is convex,
pn
Z
p n , p 0 )2
h( d(Pn P ).
pn
17
Theorem: For a convex class P of densities, if P has density p0 P and

pn maximizes likelihood over P, we have
pn , p0 )2 kP Pn kG ,
h(
where
2p
G= :pP .
p + p0
Notice that functions in G are bounded between 0 and 2.
18
Non-parametric maximum likelihood: Example
Lemma: Suppose P is a set of densities on a compact subset X of Rd .

Fix a norm k k on Rd . Suppose that, for all p P,

p(x)
p(y) 1 Lkx yk.

p(x)
1. For all p conv P, p(y) 1 Lkx yk.

2p
2. For all p, p0 conv P, p+p0 is O(L2 )-Lipschitz wrt k k.
as
3. kP Pn kG 0, where

2p
G= : p conv P .
p + p0
19
Non-parametric maximum likelihood: Example
But notice that the dependence on the dimension d is terrible: the rate is
exponentially slow in d. The Lipschitz property is a very weak restriction.
20

Theoretical Statistics. Lecture 15.: M-Estimators. Consistency of M-Estimators. Nonparametric Maximum Likelihood

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Theoretical Statistics. Lecture 15.: M-Estimators. Consistency of M-Estimators. Nonparametric Maximum Likelihood

Uploaded by

Copyright:

Available Formats

Theoretical Statistics. Lecture 15.

Goal: estimate a parameter of the distribution P of observations

The estimator = arg max Mn () is called an M-estimator (M for

m (x) = log p (x).

Can maximize by setting derivatives to zero:

(x) = log p (x).

Of course, sometimes we cannot transform an M-estimator into a

These are all location estimators: m (x) = m(x ), (x) = (x ).

Theorem: Suppose that

From (2), for all > 0 there is a > 0 such that

Same thing for Z-estimators: Finding that is an approximate zero of

Theorem: Suppose that

Proof: Choosing Mn () = kn ()k and M () = k()k in the

Sample median n is the zero of

2. The population median is unique, so for all > 0,

3. The sample median always has |Pn sign(X n )| = 0.

Notice the ULLN condition:

Estimate P on X . Suppose it has a density

pn = arg max Pn log p.

Well show conditions for which pn is Hellinger consistent, that is,

so h(p, q) [0, 1].

The Kullback-Leibler divergence between p and q is

Relating KL-divergence to a ULLN:

Theorem: For a convex class P of densities, if P has density p0 P and

Notice that functions in G are bounded between 0 and 2.

Lemma: Suppose P is a set of densities on a compact subset X of Rd .

You might also like