You are on page 1of 25

Learning with Maximum Likelihood

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrews tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University


www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599
Sep 6th, 2001

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood learning of Gaussians for Data Mining


Why we should care Learning Univariate Gaussians Learning Multivariate Gaussians Whats a biased estimator? Bayesian Learning of Gaussians

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 2

Why we should care


Maximum Likelihood Estimation is a very very very very fundamental part of data analysis. MLE for Gaussians is training wheels for our future techniques Learning Gaussians is more useful than you might guess

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 3

Learning Gaussians from Data


Suppose you have x1, x2, xR ~ (i.i.d) N(,2) But you dont know (you do know 2)
MLE: For which is x1, x2, xR most likely? MAP: Which maximizes p(|x1, x2, xR , 2)?

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 4

Learning Gaussians from Data


Suppose you have x1, x2, xR ~(i.i.d) N(,2) But you dont know (you do know 2)
Sneer

MLE: For which is x1, x2, xR most likely? MAP: Which maximizes p(|x1, x2, xR , 2)?

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 5

Learning Gaussians from Data


Suppose you have x1, x2, xR ~(i.i.d) N(,2) But you dont know (you do know 2)
Sneer

MLE: For which is x1, x2, xR most likely? MAP: Which maximizes p(|x1, x2, xR , 2)?
Despite this, well spend 95% of our time on MLE. Why? Wait and see

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 6

MLE for univariate Gaussian


Suppose you have x1, x2, xR ~(i.i.d) N(,2) But you dont know (you do know 2) MLE: For which is x1, x2, xR most likely?

mle = arg max p ( x1 , x2 ,... x R | , 2 )

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 7

Algebra Euphoria
mle = arg max p ( x1 , x2 ,... x R | , 2 )

= = = =

(by i.i.d) (monotonicity of log) (plug in formula for Gaussian) (after simplification)

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 8

Algebra Euphoria
mle = arg max p ( x1 , x2 ,... x R | , 2 )

arg max p ( xi | , 2 )

R i =1

(by i.i.d) (monotonicity of log) (plug in formula for Gaussian) (after simplification)

= arg max log p ( x | , 2 ) i

i =1

= arg max

1 2
R

( xi ) 2 2 2 i =1
R

arg min ( xi )

i =1

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 9

Intermission: A General Scalar MLE strategy


Task: Find MLE assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus 3. Set LL/=0 for a maximum, creating an equation in terms of 4. Solve it* 5. Check that youve found a maximum rather than a minimum or saddle-point, and be careful if is constrained
*This is a perfect example of something that works perfectly in all textbook examples and usually involves surprising pain if you need it for something new.
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 10

The MLE
mle = arg max p ( x1 , x2 ,... x R | , 2 )

= arg min ( xi ) 2

i =1

= s.t. 0 =

LL =

= (what?)

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 11

The MLE
mle = arg max p ( x1 , x2 ,... x R | , 2 )

= arg min ( xi ) 2

i =1

LL = s.t. 0 = =
R

(x
i =1

)2

2 ( xi )
i =1

1 R Thus = xi R i =1
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12

Lawks-a-lawdy!
mle =
1 R xi R i =1

The best estimate of the mean of a distribution is the mean of the sample!
At first sight: This kind of pedantic, algebra-filled and ultimately unsurprising fact is exactly the reason people throw down their Statistics book and pick up their Agent Based Evolutionary Data Mining Using The Neuro-Fuzz Transform book.
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 13

A General MLE strategy


Suppose = (1, 2, , n)T is a vector of parameters. Task: Find MLE assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus

LL 1 LL LL = 2 M LL n
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 14

A General MLE strategy


Suppose = (1, 2, , n)T is a vector of parameters. Task: Find MLE assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus 3. Solve the set of simultaneous equations
LL =0 1 LL =0 2 M LL =0 n
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 15

A General MLE strategy


Suppose = (1, 2, , n)T is a vector of parameters. Task: Find MLE assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus 3. Solve the set of simultaneous equations
LL =0 1 LL =0 2 M LL =0 n
Copyright 2001, 2004, Andrew W. Moore

4.

Check that youre at a maximum

Maximum Likelihood: Slide 16

A General MLE strategy


Suppose = (1, 2, , n)T is a vector of parameters. Task: Find MLE assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus 3. Solve the set of simultaneous equations
LL =0 1 LL =0 2 M LL =0 n

If you cant solve them, what should you do?

4.

Check that youre at a maximum

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 17

MLE for univariate Gaussian


Suppose you have x1, x2, xR ~(i.i.d) N(,2) But you dont know or 2 MLE: For which =(,2) is x1, x2,xR most likely?
log p ( x1 , x2 ,... x R | , 2 ) = R (log + 1 LL = 2 1 1 log 2 ) 2 2 2

(x
i =1

) 2

(x
i =1

R LL 1 = + 2 2 2 2 4

(x
i =1

) 2
Maximum Likelihood: Slide 18

Copyright 2001, 2004, Andrew W. Moore

MLE for univariate Gaussian


Suppose you have x1, x2, xR ~(i.i.d) N(,2) But you dont know or 2 MLE: For which =(,2) is x1, x2,xR most likely?
log p ( x1 , x2 ,... x R | , 2 ) = R (log + 0= 1 1 1 log 2 ) 2 2 2

(x
i =1

) 2

(x
i =1

) 1

0=

R 2
2

(x
i =1

) 2
Maximum Likelihood: Slide 19

Copyright 2001, 2004, Andrew W. Moore

MLE for univariate Gaussian


Suppose you have x1, x2, xR ~(i.i.d) N(,2) But you dont know or 2 MLE: For which =(,2) is x1, x2,xR most likely?
log p ( x1 , x 2 ,... x R | , 2 ) = R (log + 1 1 log 2 ) 2 2 2

(x
i =1

) 2

0=

(x
i =1

) = 1
4

1 xi R i =1

0=

R 2
2

(x
i =1

) 2 what?

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 20

10

MLE for univariate Gaussian


Suppose you have x1, x2, xR ~(i.i.d) N(,2) But you dont know or 2 MLE: For which =(,2) is x1, x2,xR most likely?
mle =
2 mle =

1 R xi R i =1

1 R ( xi mle ) 2 R i =1

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 21

Unbiased Estimators
An estimator of a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameters. If x1, x2, xR ~(i.i.d) N(,2) then
1 R E[ mle ] = E xi = R i =1

mle is unbiased

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 22

11

Biased Estimators
An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameters. If x1, x2, xR ~(i.i.d) N(,2) then
E

[ ]
2 mle

2 1 R 1 R 1 R mle 2 = E ( xi ) = E xi x j 2 R j =1 R i =1 R i =1

2mle is biased

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 23

MLE Variance Bias


If x1, x2, xR ~(i.i.d) N(,2) then
E

[ ]
2 mle

2 1 R 1 R 1 = E xi x j = 1 2 2 R j =1 R R i =1

Intuition check: consider the case of R=1 Why should our guts expect that underestimate of true 2? How could you prove that?

2mle would be an

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 24

12

Unbiased estimate of Variance


If x1, x2, xR ~(i.i.d) N(,2) then
E

[ ]
2 mle

2 1 R 1 R 1 = E xi x j = 1 2 2 R j =1 R R i =1

So define

2 unbiased

2 mle

1 1 R

2 So E unbiased = 2

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 25

Unbiased estimate of Variance


If x1, x2, xR ~(i.i.d) N(,2) then
E

[ ]
2 mle

2 1 R 1 R 1 = E xi x j = 1 2 2 R j =1 R R i =1

So define

2 unbiased

2 mle

1 1 R

2 So E unbiased = 2

2 unbiased =

1 R ( xi mle ) 2 R 1 i =1
Maximum Likelihood: Slide 26

Copyright 2001, 2004, Andrew W. Moore

13

Unbiaseditude discussion
Which is best?
2 mle =

1 R ( xi mle ) 2 R i =1 1 R ( xi mle ) 2 R 1 i =1

2 unbiased =

Answer: It depends on the task And doesnt make much difference once R--> large
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 27

Dont get too excited about being unbiased


Assume x1, x2, xR ~(i.i.d) N(,2) Suppose we had these estimators for the mean

suboptimal

1 = R+7 R

x
i =1

crap = x1

Are either of these unbiased? Will either of them asymptote to the correct value as R gets large? Which is more useful?

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 28

14

MLE for m-dimensional Gaussian


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or MLE: For which =(,) is x1, x2, xR most likely?
mle =

1 R xk R k =1 1 R x k mle x k mle R k =1

mle =

)(

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 29

MLE for m-dimensional Gaussian


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or MLE: For which =(,) is x1, x2, xR most likely?
mle =

1 R xk R k =1
R

mle i

1 R = x ki R k =1

Where 1 i m And xki is value of the ith component of xk (the ith attribute of the kth record) And imle is the ith component of mle

mle =

1 x k mle x k mle R k =1

)(

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 30

15

MLE for m-dimensional Gaussian


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or MLE: For which =(,) is x1, x2, xR most likely?
Where 1 i m, 1 j m

mle =

1 xk R k =1 1 R x k mle x k mle R k =1
mle ij =

mle =

)(

And xki is value of the ith component of xk (the ith attribute of the kth record) And ijmle is the (i,j)th component of mle

1 R x ki imle x kj mle j R k =1

)(

)
Maximum Likelihood: Slide 31

Copyright 2001, 2004, Andrew W. Moore

Suppose you have x1, x2, xR ~(i.i.d) through ) MLE A: Just plug N(, the recipe. But you dont know or Note , mle is forced to be MLE: For which =(,) is x1, xhow xR most likely? 2
symmetric non-negative definite Note the unbiased case How many datapoints would you need before the Gaussian has a chance of being non-degenerate?

MLE for m-dimensional Gaussian Q: How would you prove this?

mle

1 R = xk R k =1 1 R x k mle x k mle R k =1
unbiased =

mle =

)(

Copyright 2001, 2004, Andrew W. Moore

mle 1 R = x mle x k mle 1 R 1 k k =1 1 R

)(

Maximum Likelihood: Slide 32

16

Confidence intervals
We need to talk We need to discuss how accurate we expect mle and mle to be as a function of R And we need to consider how to estimate these accuracies from data Analytically * Non-parametrically (using randomization and bootstrapping) * But we wont. Not yet. *Will be discussed in future Andrew lecturesjust before we need this technology.
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 33

Structural error
Actually, we need to talk about something else too.. What if we do all this analysis when the true distribution is in fact not Gaussian? How can we tell? * How can we survive? * *Will be discussed in future Andrew lecturesjust before we need this technology.

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 34

17

Gaussian MLE in action


Using R=392 cars from the MPG UCI dataset supplied by Ross Quinlan

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 35

Data-starved Gaussian MLE


Using three subsets of MPG. Each subset has 6 randomly-chosen cars.

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 36

18

Copyright 2001, 2004, Andrew W. Moore

Bivariate MLE in action

Maximum Likelihood: Slide 37

Multivariate MLE

Covariance matrices are not exciting to look at

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 38

19

Being Bayesian: MAP estimates for Gaussians


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or MAP: Which (,) maximizes p(, |x1, x2, xR)?
Step 1: Put a prior on (,)

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 39

Being Bayesian: MAP estimates for Gaussians


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or MAP: Which (,) maximizes p(, |x1, x2, xR)?
Step 1: Put a prior on (,) Step 1a: Put a prior on (0-m-1) ~ IW(0, (0-m-1) 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices!

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 40

20

Being Bayesian: MAP estimates for Gaussians



0 Suppose you have x1, x2, xR ~(i.i.d) N(,) about my guess of 0 0 : (Roughly) my best But you dont know or guess of 0 large: Im pretty sure MAP: Which (,) maximizes p(, |x1, x2, xR)?

small: I am not sure

about my guess of 0

Step 1: Put a prior on (,) Step 1a: Put a prior on

[ ] = 0

(0-m-1) ~ IW(0, (0-m-1) 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices!

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 41

Being Bayesian: MAP estimates for Gaussians


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or MAP: Which (,) maximizes p(, |x1, x2, xR)?
Step 1: Put a prior on (,) Step 1a: Put a prior on (0-m-1) ~ IW(0, (0-m-1) 0 ) Step 1b: Put a prior on | | ~ N(0 , / 0) Together, and | define a joint distribution on (,)

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 42

21

Being Bayesian: MAP estimates for Gaussians


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or 0 small: I am not sure about my guess , x MAP: Which (,) maximizes p(, |x1, xof 0 R)? 2
Step 1: Put a prior on 0 : My best guess of (,) Step E[ ]Puta prior on 1a: = 0 Step 1b: Put a prior on | | ~ N(0 , / 0)
Notice how we are forced to express our ignorance of proportionally to
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 43

0 large: Im pretty sure about my guess of 0 Together, and | define a joint distribution on (,)

(0-m-1) ~ IW(0, (0-m-1) 0 )

Being Bayesian: MAP estimates for Gaussians


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or MAP: Which (,) maximizes p(, |x1, x2, xR)?
Step 1: Put a prior on (,) Step 1a: Put a prior on (0-m-1) ~ IW(0, (0-m-1) 0 ) Step 1b: Put a prior on | | ~ N(0 , / 0) Why do we use this form of prior?

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 44

22

Being Bayesian: MAP estimates for Gaussians


Suppose you have x1, x2, xR ~(i.i.d) N(,) But you dont know or MAP: Which (,) maximizes p(, |x1, x2, xR)?
Step 1: Put a prior on (,) Step 1a: Put a prior on (0-m-1) ~ IW(0, (0-m-1) 0 ) Step 1b: Put a prior on | | ~ N(0 , / 0)
Why do we use this form of prior? Actually, we dont have to But it is computationally and algebraically convenient its a conjugate prior.

Copyright 2001, 2004, Andrew W. Moore

Maximum Likelihood: Slide 45

Being Bayesian: MAP estimates for Gaussians


Suppose you have x1, x2, xR ~(i.i.d) N(,) MAP: Which (,) maximizes p(, |x1, x2, xR)?
Step 1: Prior: (0-m-1) ~ IW(0, (0-m-1) 0 ), | ~ N(0 , / 0) Step 2:

x=

+ Rx = + R 1 R x k R = 0 00 + R R 0 R k =1 R = 0 + R
R T

( R + m 1) R = ( 0 + m 1) 0 + (x k x )(x k x ) +
k =1

(x 0 )(x 0 )T
1/ 0 +1/ R

Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ), | ~ N(R , / R) Result: map = R, E[ |x1, x2,
Copyright 2001, 2004, Andrew W. Moore

xR ]= R
Maximum Likelihood: Slide 46

23

Conjugate priors ~(i.i.d) N( , Suppose you have x1, x2, xRmean prior formand)posterior form are same and characterized by sufficient maximizes data. MAP: Which (,)statistics of the p(, |x1, x2, xR)?

Being Bayesian: Look carefully at what these formulae are MAP estimates for Gaussians doing. Its all very sensible.

Step 1: Prior: (0-m-1) ~ The 0marginal distribution ~ is 0a, student-t IW( , (0-m-1) 0 ), | on N( / 0) One point of view: its pretty academic if R > 30 Step 2: R

x=

+ Rx R = 0 + R 1 xk R = 0 0 0 + R = + R R k =1 R 0
R T k =1

( R + m 1) R = ( 0 + m 1) 0 + (x k x )(x k x ) +

(x 0 )(x 0 )T
1/ 0 +1/ R

Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ), | ~ N(R , / R) Result: map = R, E[ |x1, x2,
Copyright 2001, 2004, Andrew W. Moore

xR ]= R
Maximum Likelihood: Slide 47

Where were at
Categorical inputs only Inputs Classifier Density Estimator Regressor Predict Joint BC category Nave BC Probability Predict real no. Joint DE Nave DE Gauss DE Real-valued inputs only Mixed Real / Cat okay Dec Tree

Copyright 2001, 2004, Andrew W. Moore

Inputs Inputs

Maximum Likelihood: Slide 48

24

What you should know


The Recipe for MLE What do we sometimes prefer MLE to MAP? Understand MLE estimation of Gaussian parameters Understand biased estimator versus unbiased estimator Appreciate the outline behind Bayesian estimation of Gaussian parameters
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 49

Useful exercise
Wed already done some MLE in this class without even telling you! Suppose categorical arity-n inputs x1, x2, xR~(i.i.d.) from a multinomial M(p1, p2, pn)
where

P(xk=j|p)=pj What is the MLE p=(p1, p2, pn)?


Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 50

25

You might also like