Professional Documents
Culture Documents
Agglomerative approach
Initialization:
Each object is a cluster
a
b
Iteration:
ab
abcde
cde
d
e
Step 0
de
Step 1
bottom-up
1
Hierarchical Clustering
Divisive Approaches
Initialization:
All objects stay in one cluster
a
b
Iteration:
ab
abcde
cde
de
Step 4
Step 3
Top-down
2
Dendrogram
Dendrogram
Dendrogram
Distance?
Single-link d
avg d ( p, q )
min (Ci , C j )
Complete-link
pCi , qC j
Average-link
The distance between
Centroid distance
two
clusters
is
represented
by
the
average distance of all
pairs of data objects
belonging to different9
Single-link
d mean (Ci , C j ) d (mi , m j )
Complete-link
Average-link
The distance between
Centroid distancetwo
clusters
is
represented
by
the
distance between the
means of the cluters.
10
1
2
11
1
3
6
3
1
2
6
1
12
3
5
4
5
6
1
5
2
Average-link
5
Complete-link
Centroid distance
5
2
2
5
3
1
4
6
1
13
Compare Dendrograms
Single-link
1 2
Complete-link
3 6
3 6
Centroid distance
Average-link
1 2
1 2
3 6
3 6 4 1
14
Single-link (2 clusters)
Complete-link (2 clusters)
15
Strength of Single-link
Original Points
Two Clusters
16
Limitations of Single-Link
Original Points
Two Clusters
17
Strength of Complete-link
Original Points
Two Clusters
18
Robust to outliers
Tend to break large clusters
Prefer spherical clusters
19
10
10
10
0
0
10
0
0
10
10
21
UPGMA
Average-link approach;
The distance between two clusters is
measured by the average distance between
two objects belonging to different clusters.
d avg (Ci , C j )
1
ni n j
d ( p, q )
pCi qC j
Average
distance
22
TreeView
UPGMA
Order the objects
The color intensity represents
expression level.
A large patch of similar color
indicates a cluster.
http://genome-www.stanford.edu/serum/fig2cluster.html
23
10
10
0
0
10
0
0
10
10
24
DIANA- Explored
25
C2
C1
26
C2
C2
jC1
6. Otherwise, stop
splitting process.
C1
C2
C1
C1
C2
C1
27
Strengths
Weakness
single-link
complete-link
29
30
e.g. TreeView
Inconvenient when
data set is large
Coarse
granularity
Fine
granularity
31