You are on page 1of 9

Maximum-Likelihood & Bayesian

Parameter Estimation

Srihari: CSE 555 0


Maximum Likelihood Versus Bayesian
Parameter Estimation
Optimal classifier can be designed knowing P(ωi) and p(x |
ωi)
Obtain them from training samples assuming known
forms of pdfs, e.g., p(x | ωi) ~ N( µi, Σi) has 2 parameters
• Estimation techniques
z Maximum-Likelihood (ML)
z Find parameters that maximize probability of observations
z Bayesian estimation
z Parameters are random variables with known prior distribution
sharpened by observations
z Results nearly identical, but approaches are different
Maximum Likelihood Parameter Estimation

• Parameters are fixed but unknown


• Best parameters obtained by maximizing probability
of obtaining samples observed
z Has good convergence properties as sample size increases
z Simpler than any other alternative techniques

z The General principle


Assume c classes and
p(x | ωj) ~ N( µj, Σj)
P(x | ωj) ≡ P (x | ωj, θj) where:
θ = (µ j , Σ j )
Maximum Likelihood Estimation
z Use n training samples in a class to estimate θ
z If D contains n independently drawn samples, x1, x2,…, xn
n
p ( D | θ ) = ∏ p ( xk | θ )
k =1

p (D | θ ) is called the likelihood of θ w.r.t. the set of samples


l(θ ) is the log - likelihood of θ

z ML estimate of θ is, by definition the value that maximizes


p(D | θ)
“It is the value of θθ̂that best agrees with the actually observed
training samples”
One-Dimensional Example Four of infinite
no. of source distributions

Likelihood as a
function of mean
(peaks at mean)
Would be sharp
peak with many
samples

Log-likelihood
function(also
peaks at mean)
Maximizing the log-likelihood function
z Let θ = (θ1, θ2, …, θp)t and let ∇θ be the gradient operator
t
⎡ ∂ ∂ ∂ ⎤
∇θ = ⎢ , ,..., ⎥
⎣ ∂θ 1 ∂θ 2 ∂θ p ⎦

z We define l (θ) as the log-likelihood function


l (θ) = ln P(D | θ)

θˆ = arg max l(θ )


Determine θ that maximizes the log-likelihood θ

Set of necessary conditions for an optimum is:


n
∇θ l = ∑ ∇θ ln p ( xk | θ ) = 0
k =1
MLE: The Gaussian Case: unknown µ
z p(xi | µ) ~ N(µ, Σ)

1
2
[ ] 1
ln p ( xk | µ ) = − ln (2π ) d Σ − ( xk − µ ) t Σ −1 ( xk − µ )
2
and ∇ µ ln P ( xk | µ ) = Σ −1 ( xk − µ )
θ = µ therefore:

• The ML estimate for µ must satisfy:

∑ ( xk − µˆ ) = 0
Σ −1

k =1
• Multiplying by Σ

1 n
µ̂ = ∑ xk
n k =1
Just the sample mean
MLE: Gaussian Case- unknown µ and σ
1 1
θ = (θ1, θ2) = (µ, σ2) ln p ( x k | θ ) = − ln 2 πθ − (x k − θ 1)2

2
2 2

⎛ ∂ ⎞
⎜ (ln p ( x k | θ )) ⎟
∂ θ
∇ θ l = ⎜⎜ 1 ⎟ = 0


⎜⎜ (ln p ( x k | θ )) ⎟⎟
⎝ ∂ θ 2 ⎠
⎧ 1
⎪θ (x k − θ 1) = 0
⎪ 2

⎪− 1 + (x k − θ 1) = 0
2

⎪⎩ 2 θ 2 2 θ 22

⎧n 1
⎪∑ ˆ ( xk − θ1 ) = 0 (1)
⎪ k =1 θ 2 n
⎨ n
⎪− 1 n
( xk − θˆ1 ) 2 n ∑ (x k − µ )2
⎪ ∑
+∑ =0 (2) µ =∑
xk
; σ2 = k =1

⎩ k =1 θ 2 k =1 θ2
ˆ ˆ 2
k =1 n n
MLE Bias
z ML estimate for σ2 is biased

⎡1 2⎤ n−1 2
E ⎢ Σ( x i − x ) ⎥ = .σ ≠ σ 2
⎣n ⎦ n

z An elementary unbiased estimator for Σ is:

1 k =n
C= ∑ (x k − µ )(x k − µˆ )
t

14n4- 4
1 k4
=14244444 3
Sample covariance matrix

You might also like