Professional Documents
Culture Documents
Hierarchical
Clustering
Note to other teachers and users of
these slides. Andrew would be
delighted if you found this source
material useful in giving your own
lectures. Feel free to use these slides Andrew W. Moore
verbatim, or to modify them to fit your
own needs. PowerPoint originals are
available. If you make use of a
Associate Professor
significant portion of these slides in
your own lecture, please include this
message, or the following link to the
School of Computer Science
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials
. Comments and corrections gratefully
Carnegie Mellon University
received. www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Some
Data
1
Suppose you transmit the
coordinates of points drawn
Lossy Compression
randomly from this dataset.
You can install decoding
software at the receiver.
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.
Loss = Sum Squared Error
between decoded coords and
original coords.
What encoder/decoder will
lose the least information?
Copyright © 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3
2
Suppose you transmit the
coordinates of points drawn
Break into a grid, decode
Idea Two
randomly from this dataset. as the
each bit-pair
centroid of all data in
You can install decoding
that grid-cell
software at the receiver.
00
01
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”. 11
10
Loss = Sum Squared Error
between decoded coords and
original coords.
What encoder/decoder will r Ideas?
e
lose the least information? ny Furth
A
Copyright © 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 5
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
3
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to. (Thus
each Center “owns”
a set of datapoints)
4
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns…
5. …and jumps there
6. …Repeat until
terminated!
5
K-means
Start
Advance apologies: in
Black and White this
example will deteriorate
Example generated by
Dan Pelleg’s super-duper
fast K-means system:
Dan Pelleg and Andrew
Moore. Accelerating Exact
k-means Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999,
(KDD99) (available on
www.autonlab.org/pap.html)
K-means
continues
…
6
K-means
continues
…
K-means
continues
…
7
K-means
continues
…
K-means
continues
…
8
K-means
continues
…
K-means
continues
…
9
K-means
continues
…
K-means
terminates
10
K-means Questions
• What is it trying to optimize?
• Are we sure it will terminate?
• Are we sure it will find an optimal
clustering?
• How should we start it?
• How could we automatically choose the
number of centers?
….we’ll deal with these questions over the next few slides
Distortion
Given..
•an encoder function: ENCODE : ℜm → [1..k]
•a decoder function: DECODE : [1..k] → ℜm
Define…
R
Distortion = ∑ (xi − DECODE[ENCODE( xi )])2
i= 1
11
Distortion
Given..
•an encoder function: ENCODE : ℜm → [1..k]
•a decoder function: DECODE : [1..k] → ℜm
Define…
R
Distortion = ∑ (xi − DECODE[ENCODE( xi )])2
i= 1
12
The Minimal Distortion (1)
R
Distortion = ∑ ( xi − cENCODE (x i ) ) 2
i =1
13
The Minimal Distortion (2)
R
Distortion = ∑ ( xi − cENCODE (x i ) ) 2
i =1
∂Distortion ∂
∂c j
=
∂c j
∑ (x
i∈OwnedBy( c j )
i − c j )2
= −2 ∑ (x i
i∈OwnedBy( c j )
−c j)
= 0 (for a minimum)
14
(2) The partial derivative of Distortion with respect
to each center location must be zero.
R
Distortion = ∑ (x
i= 1
i − cENCODE ( xi ) ) 2
k
= ∑ ∑ (x i
j =1 i∈OwnedBy( c j )
− c j )2
∂Distortion ∂
∂c j
=
∂c j
∑ (x
i∈OwnedBy( c j )
i − c j )2
= −2 ∑ (x i
i∈OwnedBy( c j )
−c j)
= 0 (for a minimum)
1
Thus, at a minimum: c j = ∑ xi
| OwnedBy(c j ) | i∈OwnedBy( c j )
Copyright © 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 29
15
Improving a suboptimal configuration…
R
Distortion = ∑ ( xi − cENCODE (x i ) ) 2
i =1
Improving a suboptimalaysconfiguration…
of partitioning
R
r of w
re ar e on ly a finite nuRmbe
The
records in to k groups.=
Distortion ∑
only a finitei =nu
( xbe
m i −r cofENCODE
po ss ib(le
x i ) )2
ntroids of
So there are ters are the ce
1
hi ch al l C en
configuratiocan ns in w
What properties be changed for centers c1 , c2t ,ha…ve, ck
the po ints ey own.
th ation, it mus
have when distortion at
isnnot
io ch an ges on an iter
minimized?
If the configur
ed the distso ortion. to a
t gonearest
(1) Change
improvencoding that xati ioisn encoded
ch anges it by musits center
the conf ig ur
So each time be fo re .
been to of points it owns.
(2) Set each Center itto ’s nethevercentroid
entually run
out
configuration ve r, it would ev
on fo re
There’s noSo point d to go either operation twice in succession.
if it trieapplying
of conf ig ur ations.
But it can be profitable to alternate.
…And that’s K-means!
Easy to prove this procedure will terminate in a state at
which neither (1) or (2) change the configuration. Why?
Copyright © 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 32
16
Will we find the optimal
configuration?
• Not necessarily.
• Can you invent a configuration that has
converged, but does not have the minimum
distortion?
17
Will we find the optimal
configuration?
• Not necessarily.
• Can you invent a configuration that has
converged, but does not have the minimum
distortion? (Hint: try a fiendish k=3 configuration here…)
18
Trying to find good optima
• Idea 1: Be careful about where you start
• Idea 2: Do many runs of k-means, each
fromNeat
a different
trick: random start configuration
• ManyPlace
other
first center
ideasonfloating
top of randomly
around.chosen datapoint.
Place second center on datapoint that’s as far away as
possible from first center
:
Place j’th center on datapoint that’s as far away as
possible from the closest of Centers 1 through j-1
:
19
Common uses of K-means
• Often used as an exploratory data analysis tool
• In one-dimension, a good way to quantize real-
valued variables into k non-uniform buckets
• Used on acoustic data in speech understanding to
convert waveforms into one of k categories
(known as Vector Quantization)
• Also used for choosing color palettes on old
fashioned graphical display devices!
20
Single Linkage Hierarchical
Clustering
1. Say “Every point is it’s
own cluster”
2. Find “most similar” pair
of clusters
21
Single Linkage Hierarchical
Clustering
1. Say “Every point is it’s
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
4. Repeat
22
Single Linkage Hierarchical
How do we define similarity
between clusters? Clustering
• Minimum distance between
1. Say “Every point is it’s
points in clusters (in which
own cluster”
case we’re simply doing
Euclidian Minimum Spanning 2. Find “most similar” pair
Trees) of clusters
• Maximum distance between 3. Merge it into a parent
points in clusters cluster
• Average distance between
points in clusters 4. Repeat…until you’ve
merged the whole
You’re left with a nice dataset into one cluster
dendrogram, or taxonomy, or
hierarchy of datapoints (not
shown here)
23
What you should know
• All the details of K-means
• The theory behind K-means as an
optimization algorithm
• How K-means can get stuck
• The outline of Hierarchical clustering
• Be able to contrast between which problems
would be relatively well/poorly suited to K-
means vs Gaussian Mixtures vs Hierarchical
clustering
Copyright © 2001, Andrew W. Moore K-means and Hierarchical Clustering: Slide 47
24