Professional Documents
Culture Documents
Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrews tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
MLE: For which is x1, x2, xR most likely? MAP: Which maximizes p(|x1, x2, xR , 2)?
MLE: For which is x1, x2, xR most likely? MAP: Which maximizes p(|x1, x2, xR , 2)?
Despite this, well spend 95% of our time on MLE. Why? Wait and see
Algebra Euphoria
mle = arg max p ( x1 , x2 ,... x R | , 2 )
= = = =
(by i.i.d) (monotonicity of log) (plug in formula for Gaussian) (after simplification)
Algebra Euphoria
mle = arg max p ( x1 , x2 ,... x R | , 2 )
arg max p ( xi | , 2 )
R i =1
(by i.i.d) (monotonicity of log) (plug in formula for Gaussian) (after simplification)
i =1
= arg max
1 2
R
( xi ) 2 2 2 i =1
R
arg min ( xi )
i =1
The MLE
mle = arg max p ( x1 , x2 ,... x R | , 2 )
= arg min ( xi ) 2
i =1
= s.t. 0 =
LL =
= (what?)
The MLE
mle = arg max p ( x1 , x2 ,... x R | , 2 )
= arg min ( xi ) 2
i =1
LL = s.t. 0 = =
R
(x
i =1
)2
2 ( xi )
i =1
1 R Thus = xi R i =1
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12
Lawks-a-lawdy!
mle =
1 R xi R i =1
The best estimate of the mean of a distribution is the mean of the sample!
At first sight: This kind of pedantic, algebra-filled and ultimately unsurprising fact is exactly the reason people throw down their Statistics book and pick up their Agent Based Evolutionary Data Mining Using The Neuro-Fuzz Transform book.
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 13
LL 1 LL LL = 2 M LL n
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 14
4.
4.
(x
i =1
) 2
(x
i =1
R LL 1 = + 2 2 2 2 4
(x
i =1
) 2
Maximum Likelihood: Slide 18
(x
i =1
) 2
(x
i =1
) 1
0=
R 2
2
(x
i =1
) 2
Maximum Likelihood: Slide 19
(x
i =1
) 2
0=
(x
i =1
) = 1
4
1 xi R i =1
0=
R 2
2
(x
i =1
) 2 what?
10
1 R xi R i =1
1 R ( xi mle ) 2 R i =1
Unbiased Estimators
An estimator of a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameters. If x1, x2, xR ~(i.i.d) N(,2) then
1 R E[ mle ] = E xi = R i =1
mle is unbiased
11
Biased Estimators
An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameters. If x1, x2, xR ~(i.i.d) N(,2) then
E
[ ]
2 mle
2 1 R 1 R 1 R mle 2 = E ( xi ) = E xi x j 2 R j =1 R i =1 R i =1
2mle is biased
[ ]
2 mle
2 1 R 1 R 1 = E xi x j = 1 2 2 R j =1 R R i =1
Intuition check: consider the case of R=1 Why should our guts expect that underestimate of true 2? How could you prove that?
2mle would be an
12
[ ]
2 mle
2 1 R 1 R 1 = E xi x j = 1 2 2 R j =1 R R i =1
So define
2 unbiased
2 mle
1 1 R
2 So E unbiased = 2
[ ]
2 mle
2 1 R 1 R 1 = E xi x j = 1 2 2 R j =1 R R i =1
So define
2 unbiased
2 mle
1 1 R
2 So E unbiased = 2
2 unbiased =
1 R ( xi mle ) 2 R 1 i =1
Maximum Likelihood: Slide 26
13
Unbiaseditude discussion
Which is best?
2 mle =
1 R ( xi mle ) 2 R i =1 1 R ( xi mle ) 2 R 1 i =1
2 unbiased =
Answer: It depends on the task And doesnt make much difference once R--> large
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 27
suboptimal
1 = R+7 R
x
i =1
crap = x1
Are either of these unbiased? Will either of them asymptote to the correct value as R gets large? Which is more useful?
14
1 R xk R k =1 1 R x k mle x k mle R k =1
mle =
)(
1 R xk R k =1
R
mle i
1 R = x ki R k =1
Where 1 i m And xki is value of the ith component of xk (the ith attribute of the kth record) And imle is the ith component of mle
mle =
1 x k mle x k mle R k =1
)(
15
mle =
1 xk R k =1 1 R x k mle x k mle R k =1
mle ij =
mle =
)(
And xki is value of the ith component of xk (the ith attribute of the kth record) And ijmle is the (i,j)th component of mle
1 R x ki imle x kj mle j R k =1
)(
)
Maximum Likelihood: Slide 31
Suppose you have x1, x2, xR ~(i.i.d) through ) MLE A: Just plug N(, the recipe. But you dont know or Note , mle is forced to be MLE: For which =(,) is x1, xhow xR most likely? 2
symmetric non-negative definite Note the unbiased case How many datapoints would you need before the Gaussian has a chance of being non-degenerate?
mle
1 R = xk R k =1 1 R x k mle x k mle R k =1
unbiased =
mle =
)(
)(
16
Confidence intervals
We need to talk We need to discuss how accurate we expect mle and mle to be as a function of R And we need to consider how to estimate these accuracies from data Analytically * Non-parametrically (using randomization and bootstrapping) * But we wont. Not yet. *Will be discussed in future Andrew lecturesjust before we need this technology.
Copyright 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 33
Structural error
Actually, we need to talk about something else too.. What if we do all this analysis when the true distribution is in fact not Gaussian? How can we tell? * How can we survive? * *Will be discussed in future Andrew lecturesjust before we need this technology.
17
18
Multivariate MLE
19
20
about my guess of 0
[ ] = 0
(0-m-1) ~ IW(0, (0-m-1) 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices!
21
0 large: Im pretty sure about my guess of 0 Together, and | define a joint distribution on (,)
22
x=
+ Rx = + R 1 R x k R = 0 00 + R R 0 R k =1 R = 0 + R
R T
( R + m 1) R = ( 0 + m 1) 0 + (x k x )(x k x ) +
k =1
(x 0 )(x 0 )T
1/ 0 +1/ R
Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ), | ~ N(R , / R) Result: map = R, E[ |x1, x2,
Copyright 2001, 2004, Andrew W. Moore
xR ]= R
Maximum Likelihood: Slide 46
23
Conjugate priors ~(i.i.d) N( , Suppose you have x1, x2, xRmean prior formand)posterior form are same and characterized by sufficient maximizes data. MAP: Which (,)statistics of the p(, |x1, x2, xR)?
Being Bayesian: Look carefully at what these formulae are MAP estimates for Gaussians doing. Its all very sensible.
Step 1: Prior: (0-m-1) ~ The 0marginal distribution ~ is 0a, student-t IW( , (0-m-1) 0 ), | on N( / 0) One point of view: its pretty academic if R > 30 Step 2: R
x=
+ Rx R = 0 + R 1 xk R = 0 0 0 + R = + R R k =1 R 0
R T k =1
( R + m 1) R = ( 0 + m 1) 0 + (x k x )(x k x ) +
(x 0 )(x 0 )T
1/ 0 +1/ R
Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ), | ~ N(R , / R) Result: map = R, E[ |x1, x2,
Copyright 2001, 2004, Andrew W. Moore
xR ]= R
Maximum Likelihood: Slide 47
Where were at
Categorical inputs only Inputs Classifier Density Estimator Regressor Predict Joint BC category Nave BC Probability Predict real no. Joint DE Nave DE Gauss DE Real-valued inputs only Mixed Real / Cat okay Dec Tree
Inputs Inputs
24
Useful exercise
Wed already done some MLE in this class without even telling you! Suppose categorical arity-n inputs x1, x2, xR~(i.i.d.) from a multinomial M(p1, p2, pn)
where
25