You are on page 1of 20

Theoretical Statistics. Lecture 15.

Peter Bartlett

M-Estimators.
Consistency of M-Estimators.
Nonparametric maximum likelihood.

1
M-estimators

Goal: estimate a parameter of the distribution P of observations


X1 , . . . , Xn .
Define a criterion 7 Mn () in terms of functions m : X R,

Mn () = Pn m .

The estimator = arg max Mn () is called an M-estimator (M for


maximum).
Example:
maximum likelihood uses

m (x) = log p (x).

2
Z-estimators

Can maximize by setting derivatives to zero:

n () = Pn = 0.

These are estimating equations. van der Vaart calls this a Z-estimator (Z
for zero), but its often called an M-estimator (even if theres no
maximization).
Example:
maximum likelihood:

(x) = log p (x).

3
M-estimators and Z-estimators

Of course, sometimes we cannot transform an M-estimator into a


Z-estimator. Example: p =uniform on [0, ] is not differentiable in , and
there is no natural Z-estimator. The M-estimator chooses

= arg max Pn m

1 [ [0, ]]
= arg max Pn log

= max Xi .
i

4
M-estimators and Z-estimators: Examples

Mean:

m (x) = (x )2 .
(x) = (x ).

Median:

m (x) = |x |.
(x) = sign(x ).

5
M-estimators and Z-estimators: Examples

Huber: [PICTURE]

m (x) = rk (x )

1 2
2 k k(x + k) if x < k,


rk (x) = 21 x2 if |x| k,

1 2

2 k + k(x k) if x > k.
(x) = [x ]kk

k if x < k,


[x]kk = x if |x| k,


k if x > k.

These are all location estimators: m (x) = m(x ), (x) = (x ).

6
Consistency of M-estimators and Z-estimators

P
We want to show that 0 , where approximately maximizes
Mn () = Pn m and 0 maximizes M () = P m . We use a ULLN.

Theorem: Suppose that


P
1. sup |Mn () M ()| 0,
2. For all > 0, sup {M () : d(, 0 ) } < M (0 ), and
3. Mn (n ) Mn (0 ) oP (1).
P
Then n 0 .
(2) is an identifiability condition: approximately maximizing M ()
unambiguously specifies 0 . It suffices if there is a unique maximizer, is
compact, and M is continuous.

7
Proof

From (2), for all > 0 there is a > 0 such that

Pr(d(n , 0 ) )
Pr(M (0 ) M (n ) )
= Pr(M (0 ) Mn (0 ) + Mn (0 ) Mn (n ) + Mn (n ) M (n ) )
Pr(M (0 ) Mn (0 ) /3) + Pr(Mn (0 ) Mn (n ) /3)
+ Pr(Mn (n ) M (n ) /3).

Then (1) implies the first and third probabilities go to zero, and (3) implies
the second probability goes to zero.

8
Consistency of M-estimators and Z-estimators

Same thing for Z-estimators: Finding that is an approximate zero of


n () = Pn leads to 0 , which is the unique zero of () = P .

Theorem: Suppose that


P
1. sup kn () ()k 0,
2. For all > 0, inf {k()k : d(, 0 ) } > 0 = k(0 )k, and
3. n (n ) = oP (1).
P
Then n 0 .

Proof: Choosing Mn () = kn ()k and M () = k()k in the


previous theorem implies the result.

9
Example: Sample median

Sample median n is the zero of

Pn (X) = Pn sign(X ).

Suppose that P is continuous and positive around the median, and check the
conditions:
1. The class {x 7 sign(x ) : R} is Glivenko-Cantelli.

2. The population median is unique, so for all > 0,


1
P (X < 0 ) < < P (X < 0 + ).
2

3. The sample median always has |Pn sign(X n )| = 0.

10
ULLN and M-estimators

Notice the ULLN condition:


P
sup |Mn () M ()| 0.
Typically, this requires the empirical process 7 Pn m to be totally
bounded. This can be problematic if m is unbounded. For instance:
Mean: m (x) = (x )2 ,
Median: m (x) = |x |.
We can get around the problem by restricting to a compact set where most
of the mass of P lies, and showing that this does not affect the asymptotics.
In that case, we can also restrict to an appropriate compact subset.

11
Non-parametric maximum likelihood

Estimate P on X . Suppose it has a density


dP
p0 = P,
d
where P is a family of densities. Define the maximum likelihood estimate

pn = arg max Pn log p.


pP

Well show conditions for which pn is Hellinger consistent, that is,


as
h(
pn , p0 ) 0, where h is the Hellinger distance:
 Z  1/2
1  2
h(p, q) = p1/2 q 1/2 d .
2
[The 1/2 ensures 0 h(p, q) 1.]

12
Hellinger distance

We have
1
Z  2
h(p, q)2 = p1/2 q 1/2 d
2
1 
Z 
= p + q 2p1/2 q 1/2 d
2
Z
= 1 p1/2 q 1/2 d.

This latter integral is called the Hellinger affinity. Expressing h in this form
can simplify its calculation for product densities. Notice that, by
Cauchy-Schwartz,
Z Z Z
p1/2 q 1/2 d p d q d = 1,

so h(p, q) [0, 1].

13
Non-parametric maximum likelihood

The Kullback-Leibler divergence between p and q is


q
Z
dKL (p, q) = log q d.
p
Clearly, dKL (p, p) = 0. Also, since log() is convex,
Z 
p p
Z
dKL (p, q) = log q d log q d = 0.
q q

14
Non-parametric maximum likelihood

Relating KL-divergence to a ULLN:

p0
Z
dKL (
pn , p0 ) = log p0 d
pn
p0 p0
Z
log p0 d Pn log
pn pn
p0 p0
= P log Pn log
pn pn
kP Pn kG ,

where the first inequality follows from the fact that pn maximizes Pn log p

15
over p P, and the class G is defined as
 
p0
G = 1[p0 > 0] log :pP .
p

16
Non-parametric maximum likelihood

One problem here is that log(p0 /p) is unbounded, since p can be zero.
Well take a different approach: For any p P, consider the mixture
p + p0
p = .
2
If the class P is convex and pn , p0 P, this mixture has
Pn log p Pn log pn . This is behind the following lemma.

Lemma: Define
pn + p0
pn = .
2
If P is convex,
pn
Z
p n , p 0 )2
h( d(Pn P ).
pn

17
Non-parametric maximum likelihood

Theorem: For a convex class P of densities, if P has density p0 P and


pn maximizes likelihood over P, we have

pn , p0 )2 kP Pn kG ,
h(

where  
2p
G= :pP .
p + p0

Notice that functions in G are bounded between 0 and 2.

18
Non-parametric maximum likelihood: Example

Lemma: Suppose P is a set of densities on a compact subset X of Rd .


Fix a norm k k on Rd . Suppose that, for all p P,

p(x)
p(y) 1 Lkx yk.


p(x)
1. For all p conv P, p(y) 1 Lkx yk.

2p
2. For all p, p0 conv P, p+p0 is O(L2 )-Lipschitz wrt k k.
as
3. kP Pn kG 0, where
 
2p
G= : p conv P .
p + p0

19
Non-parametric maximum likelihood: Example

But notice that the dependence on the dimension d is terrible: the rate is
exponentially slow in d. The Lipschitz property is a very weak restriction.

20

You might also like