Professional Documents
Culture Documents
Chapter 3
PPDM Cl
Class
Tan,Steinbach, Kumar
4/18/2004
Intra-cluster
distances are
minimized
Tan,Steinbach, Kumar
4/18/2004
Discovered Clusters
Understanding
Group
p related documents
for browsing, group genes
and proteins that have
similar functionality, or
group stocks with similar
price fluctuations
Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
1
2
3
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
Summarization
Reduce the size of large
data sets
C uste g p
Clustering
precipitation
ec p tat o
in Australia
Tan,Steinbach, Kumar
4/18/2004
Supervised classification
Have class label information
Simple segmentation
Di
Dividing
idi students
t d t into
i t different
diff
t registration
i t ti groups
alphabetically, by last name
Results of a query
Groupings are a result of an external specification
Graph partitioning
Some mutual relevance and synergy, but areas are not
identical
Tan,Steinbach, Kumar
4/18/2004
Six Clusters
Two Clusters
Four Clusters
Tan,Steinbach, Kumar
4/18/2004
Types of Clusterings
z
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Tan,Steinbach, Kumar
4/18/2004
Partitional Clustering
Original
g
Points
Tan,Steinbach, Kumar
A Partitional Clustering
g
4/18/2004
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
Tan,Steinbach, Kumar
p3 p4
Non-traditional Dendrogram
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Types of Clusters
z
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
P
Property
t or Conceptual
C
t l
Tan,Steinbach, Kumar
4/18/2004
10
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
Tan,Steinbach, Kumar
4/18/2004
11
Center-based
A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the center of a cluster, than to the
center of any other cluster
The center of a cluster is often a centroid, the average of all
th points
the
i t iin th
the cluster,
l t or a medoid,
d id the
th mostt representative
t ti
point of a cluster
4 center-based clusters
Tan,Steinbach, Kumar
4/18/2004
12
8 contiguous clusters
Tan,Steinbach, Kumar
4/18/2004
13
Density-based
A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Tan,Steinbach, Kumar
4/18/2004
14
2 Overlapping Circles
Tan,Steinbach, Kumar
4/18/2004
15
P titi
Partitional
l algorithms
l ith
typically
t i ll h
have global
l b l objectives
bj ti
Tan,Steinbach, Kumar
4/18/2004
16
Tan,Steinbach, Kumar
4/18/2004
17
Sparseness
Dictates type of similarity
Adds to efficiency
Attribute type
Dictates type of similarity
Type of Data
Dictates type of similarity
Other characteristics, e.g., autocorrelation
z
z
z
Dimensionality
Di
i
lit
Noise and Outliers
Type of Distribution
Tan,Steinbach, Kumar
4/18/2004
18
Clustering Algorithms
z
Hierarchical clustering
Density-based clustering
Tan,Steinbach, Kumar
4/18/2004
19
K-means Clustering
z
z
Number of clusters,
clusters K
K, must be specified
The basic algorithm is very simple
Tan,Steinbach, Kumar
4/18/2004
20
z
z
Complexity is O( n * K * I * d )
Tan,Steinbach, Kumar
4/18/2004
21
2.5
Original Points
15
1.5
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
2.5
2.5
1.5
1.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
Optimal Clustering
Tan,Steinbach, Kumar
-2
Sub-optimal Clustering
4/18/2004
22
2.5
1.5
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
Tan,Steinbach, Kumar
4/18/2004
23
Iteration 2
Iteration 3
2.5
2.5
2.5
1.5
1.5
1.5
0.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
It ti 4
Iteration
It ti 5
Iteration
2.5
1.5
1.5
1.5
0.5
0.5
0.5
-0.5
0.5
Tan,Steinbach, Kumar
1.5
0.5
1.5
1.5
2.5
2.5
-1
-0.5
It ti 6
Iteration
-1.5
-1
-2
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
4/18/2004
24
SSE = dist 2 ( mi , x )
i =1 xCi
Given two clusters, we can choose the one with the smallest
error
One easy way to reduce SSE is to increase K, the number of
clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering
l t i with
ith hi
higher
h K
Tan,Steinbach, Kumar
4/18/2004
25
2.5
1.5
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
Tan,Steinbach, Kumar
4/18/2004
26
Iteration 2
2.5
2.5
1.5
1.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
Iteration 3
2.5
1.5
1.5
1.5
2.5
2.5
0.5
0.5
0.5
-1
1
-0.5
05
05
0.5
Tan,Steinbach, Kumar
Iteration 5
-1.5
15
1.5
Iteration 4
-2
2
15
1.5
-2
2
-1.5
15
-1
1
-0.5
05
05
0.5
15
1.5
-2
2
-1.5
15
-1
1
-0.5
05
05
0.5
15
1.5
4/18/2004
27
Ch
Chance
is
i relatively
l ti l smallll when
h K iis llarge
Tan,Steinbach, Kumar
4/18/2004
28
10 Clusters Example
Iteration 4
1
2
3
8
6
4
2
0
-2
-4
-6
0
10
15
20
x
Starting with two initial centroids in one cluster of each pair of clusters
Tan,Steinbach, Kumar
4/18/2004
29
10 Clusters Example
Iteration 2
8
Iteration 1
8
-2
-2
-4
-4
-6
-6
0
10
15
20
15
20
-2
-2
-4
-4
-6
-6
10
20
Iteration 4
8
Iteration 3
15
10
15
20
10
Starting with two initial centroids in one cluster of each pair of clusters
Tan,Steinbach, Kumar
4/18/2004
30
10 Clusters Example
Iteration 4
1
2
3
8
6
4
2
0
-2
-4
-6
0
10
15
20
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Tan,Steinbach, Kumar
4/18/2004
31
10 Clusters Example
Iteration 2
8
Iteration 1
8
-2
-2
-4
-4
-6
-6
0
10
15
20
-2
-4
-4
-6
-6
5
10
15
20
15
20
-2
10
x
Iteration
4
x
Iteration
3
15
20
10
Starting with some pairs of clusters having three initial centroids, while other have only one.
Tan,Steinbach, Kumar
4/18/2004
32
Multiple runs
Helps,
p , but probability
p
y is not on yyour side
Postprocessing
z Bisecting K-means
z
Tan,Steinbach, Kumar
4/18/2004
33
Several strategies
Choose the point that contributes most to SSE
Choose a point from the cluster with the highest SSE
If there are several empty clusters, the above can be
repeated several times.
Tan,Steinbach, Kumar
4/18/2004
34
Tan,Steinbach, Kumar
4/18/2004
35
Pre-processing
Normalize the data
Eliminate outliers
Post processing
Post-processing
Eliminate small clusters that may represent outliers
Split loose
loose clusters,
clusters ii.e.,
e clusters with relatively high
SSE
Merge
g clusters that are close and that have relatively
y
low SSE
Can use these steps during the clustering process
ISODATA
SO
Tan,Steinbach, Kumar
4/18/2004
36
Bisecting K-means
z
Tan,Steinbach, Kumar
4/18/2004
37
Tan,Steinbach, Kumar
4/18/2004
38
Limitations of K-means
z
Tan,Steinbach, Kumar
4/18/2004
39
K-means (3 Clusters)
Original Points
Tan,Steinbach, Kumar
4/18/2004
40
K-means (3 Clusters)
Original Points
Tan,Steinbach, Kumar
4/18/2004
41
Original Points
Tan,Steinbach, Kumar
K-means (2 Clusters)
4/18/2004
42
Original Points
K-means Clusters
4/18/2004
43
Original Points
Tan,Steinbach, Kumar
K-means Clusters
4/18/2004
44
Original Points
Tan,Steinbach, Kumar
K-means Clusters
4/18/2004
45