Professional Documents
Culture Documents
http://www.autonlab.org/tutorials/gmm.html
1
Bias of Maximum Likelihood Intuitive Explanation of Over-fitting
• Consider the expectations of the maximum likelihood
estimates under the Gaussian distribution
Bishop, 2003
Time
between
eruptions
(minutes)
2
Idea: Use a Mixture of Gaussians Example: Mixture of 3 Gaussians
• Linear super-position of Gaussians
3
3
Sampling from a GMM Sampling from a General GMM
4
Labeled vs Unlabeled Data Side-Trip : Clustering using K-means
labeled unlabeled
Easy to estimate params Hard to estimate params
(do each color separately) (we need to assign colors)
Bishop, 2003
Some Data
K-means K-means
1. Ask user how many 1. Ask user how many
clusters they’d like. clusters they’d like.
(e.g. k=5) (e.g. k=5)
2. Randomly guess k
cluster Center
locations
5
K-means K-means
1. Ask user how many 1. Ask user how many
clusters they’d like. clusters they’d like.
(e.g. k=5) (e.g. k=5)
2. Randomly guess k 2. Randomly guess k
cluster Center cluster Center
locations locations
3. Each datapoint finds 3. Each datapoint finds
out which Center it’s out which Center it’s
closest to. (Thus closest to.
each Center “owns”
4. Each Center finds
a set of datapoints)
the centroid of the
points it owns
K-means K-means
1. Ask user how many
clusters they’d like. Start
(e.g. k=5)
2. Randomly guess k Example generated by
Dan Pelleg’s super-
cluster Center
duper fast K-means
locations system:
3. Each datapoint finds Dan Pelleg and Andrew
out which Center it’s Moore. Accelerating
closest to. Exact k-means
Algorithms with
4. Each Center finds Geometric Reasoning.
Proc. Conference on
the centroid of the
Knowledge Discovery in
points it owns… Databases 1999,
(KDD99) (available on
5. …and jumps there www.autonlab.org/pap.html)
6. …Repeat until
terminated!
K-means K-means
continues… continues…
6
K-means K-means
continues… continues…
K-means K-means
continues… continues…
K-means K-means
continues… continues…
7
K-means Questions
K-means • What is it trying to optimize?
terminates • Are we sure it will terminate?
• Are we sure it will find an optimal clustering?
• How should we start it?
• How could we automatically choose the number of
centers?
….we’ll deal with these questions over the next few slides
Distortion Distortion
Given.. Given..
•an encoder function: ENCODE : m [1..k] •an encoder function: ENCODE : m [1..k]
•a decoder function: DECODE : [1..k] m •a decoder function: DECODE : [1..k] m
Define… Define…
R R
Distortion xi DECODE[ENCODE(x i )] Distortion xi DECODE[ENCODE(x i )]
2 2
i 1 i 1
What properties must centers c1 , c2 , … , ck have What properties must centers c1 , c2 , … , ck have
when distortion is minimized? when distortion is minimized?
(1) xi must be encoded by its nearest center
….why?
8
The Minimal Distortion (1) The Minimal Distortion (2)
R R
Distortion (x i c ENCODE ( xi ) ) 2 Distortion (x i c ENCODE ( xi ) ) 2
i 1 i 1
What properties must centers c1 , c2 , … , ck have What properties must centers c1 , c2 , … , ck have
when distortion is minimized? when distortion is minimized?
(1) xi must be encoded by its nearest center (2) The partial derivative of Distortion with respect to
Otherwise distortion could be each center location must be zero.
….why?
reduced by replacing ENCODE[xi]
by the nearest center
(2) The partial derivative of Distortion with respect to (2) The partial derivative of Distortion with respect to
each center location must be zero. each center location must be zero.
R R
Distortion (x
i 1
i c ENCODE ( xi ) ) 2 Distortion (x
i 1
i c ENCODE ( xi ) ) 2
k k
(x i
j 1 iOwnedBy( c j )
c j )2 OwnedBy(cj ) = the set
of records owned by
(x i
j 1 iOwnedBy( c j )
c j )2
Center cj .
Distortion Distortion
c j
c j
(x i
iOwnedBy( c j )
c j )2
c j
c j
(x i
iOwnedBy( c j )
c j )2
2 (x i
iOwnedBy( c j )
c j) 2 (x i
iOwnedBy( c j )
c j)
What properties must centers c1 , c2 , … , ck have when What properties can be changed for centers c1 , c2 , … , ck
distortion is minimized? have when distortion is not minimized?
(1) xi must be encoded by its nearest center (1) Change encoding so that xi is encoded by its nearest center
(2) Each Center must be at the centroid of points it owns. (2) Set each Center to the centroid of points it owns.
There’s no point applying either operation twice in succession.
But it can be profitable to alternate.
…And that’s K-means!
Easy to prove this procedure will terminate in a state at
which neither (1) or (2) change the configuration. Why?
9
Improving a suboptimal configuration…
partitioning R Will we find the optimal configuration?
er of ways of
ly a finite numb
There are on
R
into k groups
records Distortion
.
only a finitei nu
(mb cofENCODE
x i er xi ) )
possib(le
2
of
• Not necessarily.
So there are 1
Cen ters are the centroids • Can you invent a configuration that has converged, but
s in which all
configuration
What properties can ben.changed for centers c1 , c2 , ve … , ck does not have the minimum distortion?
the poi nts they ow ation, it must
ha
have when distortion is not
ch angminimized?
es on an iter
ura tion
If the config .
(1) Change ed the disso
encoding tortion
that xi is encoded it by
must to a
itsgonearest center
improv tion changes
the configura
So each time be en to be fore.
(2) Set eachfigu Center ion to
it’s the
nev ercentroid of points it owns.
ly run ou t of
con rat uld eventual
forever, it wo
d to go on either
There’s no if it trieapplying
Sopoint operation twice in succession.
ion s.
configurat
But it can be profitable to alternate.
…And that’s K-means!
Easy to prove this procedure will terminate in a state at
which neither (1) or (2) change the configuration. Why?
Will we find the optimal configuration? Will we find the optimal configuration?
• Not necessarily. • Not necessarily.
• Can you invent a configuration that has converged, but • Can you invent a configuration that has converged, but
does not have the minimum distortion? (Hint: try a fiendish k=3 does not have the minimum distortion? (Hint: try a fiendish k=3
configuration here…) configuration here…)
10
Common uses of K-means
• Often used as an exploratory data analysis tool
• In one-dimension, a good way to quantize real-valued
variables into k non-uniform buckets
• Used on acoustic data in speech understanding to Back to Estimating GMMs
convert waveforms into one of k categories (known as
Vector Quantization)
• Also used for choosing color palettes on old fashioned
graphical display devices!
• Used to initialize clusters for the EM algorithm!!!
• Recall:
p(x,z)
11
Expected Complete-Data Log Likelihood Expected Complete-Data Log Likelihood
• Suppose we make a guess for the parameter values • To summarize what we just did, we replaced:
(means, covariances and mixing coefficients)
• Use these to evaluate the responsibilities (ownership weights)
• Consider expected complete-data log likelihood
unknown discrete value 0 or 1
• With:
ownership weights
E
means covariances
mixing probabilities
12
After first iteration After 2nd iteration
13
After 20th iteration Homework
4) Change your code for generating N points from a single multivariate Gaussian to
instead generate N points from a mixture of Gaussians. Assume K Gaussians, each
of which is specified by a mixing parameter 0<=p_i<=1, a 2x1 mean vector mu_i, and
a 2x2 covariance matrix C_i.
5) write code to perform the K-means algorithm, given a set of N data points and a
number K of desired clusters. You can either start the algorithm with random cluster
centers, or else try something smarter that you can think up.
14