DWDM U-6

K-Means Cluster Analysis
Chapter 3
PPDM Cl
Class
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
What is Cluster Analysis?

z
Finding groups of objects such that the objects in a group

will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Inter
cluster
distances are
maximized
Intra-cluster
distances are
minimized
4/18/2004
Applications of Cluster Analysis

z
Discovered Clusters
Understanding
Group
p related documents
for browsing, group genes
and proteins that have
similar functionality, or
group stocks with similar
price fluctuations
Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
1
2
Fannie Mae DOWN Fed Home Loan DOWN

Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
3
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
Summarization
Reduce the size of large
data sets
C uste g p
Clustering
precipitation
ec p tat o
in Australia
4/18/2004
What is not Cluster Analysis?

z
Supervised classification
Have class label information
Simple segmentation
Di
Dividing
idi students
t d t into
i t different
diff
t registration
i t ti groups
alphabetically, by last name
Results of a query
Groupings are a result of an external specification
Graph partitioning
Some mutual relevance and synergy, but areas are not
identical
4/18/2004
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
4/18/2004
Types of Clusterings
z
A clustering is a set of clusters
Important distinction between hierarchical and

partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
4/18/2004
Partitional Clustering
Original
g
Points
A Partitional Clustering
g
4/18/2004
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
p3 p4
Non-traditional Dendrogram
4/18/2004
Other Distinctions Between Sets of Clusters

z
Exclusive versus non-exclusive

In non-exclusive clusterings, points may belong to multiple
clusters.
clusters
Can represent multiple classes or border points
Fuzzyy versus non-fuzzyy

In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
Weights must sum to 1
Probabilistic clustering has similar characteristics
Partial versus complete

In
I some cases, we only
l wantt to
t cluster
l t some off the
th data
d t
Heterogeneous versus homogeneous

Cluster of widely different sizes, shapes, and densities
4/18/2004
Types of Clusters
z
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
P
Property
t or Conceptual
C
t l
Described by an Objective Function
4/18/2004
10
Types of Clusters: Well-Separated

z
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
4/18/2004
11
Types of Clusters: Center-Based

z
Center-based
A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the center of a cluster, than to the
center of any other cluster
The center of a cluster is often a centroid, the average of all
th points
the
i t iin th
the cluster,
l t or a medoid,
d id the
th mostt representative
t ti
point of a cluster
4 center-based clusters
4/18/2004
12
Types of Clusters: Contiguity-Based

z
Contiguous Cluster (Nearest neighbor or

Transitive)
A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
8 contiguous clusters
4/18/2004
13
Types of Clusters: Density-Based

z
Density-based
A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
4/18/2004
14
Types of Clusters: Conceptual Clusters

z
Shared Property or Conceptual Clusters

Finds clusters that share some common property or represent
a particular concept.
.
2 Overlapping Circles
4/18/2004
15
Types of Clusters: Objective Function

z
Clusters Defined by an Objective Function

Finds clusters that minimize or maximize an objective function.
Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using
the given objective function
function. (NP Hard)
Can have global or local objectives.
Hierarchical clustering algorithms typically have local objectives
P titi
Partitional
l algorithms
l ith
typically
t i ll h
have global
l b l objectives
bj ti
A variation of the global objective function approach is to fit the

data to a parameterized model.
Parameters for the model are determined from the data.
Mixture models assume that the data is a mixture' of a number of

statistical distributions.
4/18/2004
16
Types of Clusters: Objective Function

z
Map the clustering problem to a different domain

and solve a related problem in that domain
Proximity matrix defines a weighted graph, where the
nodes are the points being clustered, and the
weighted edges represent the proximities between
points
Clustering is equivalent to breaking the graph into
connected components, one for each cluster.
Want to minimize the edge weight between clusters
and maximize the edge weight within clusters
4/18/2004
17
Characteristics of the Input Data Are Important

z
Type of proximity or density measure

This is a derived measure, but central to clustering
Sparseness
Dictates type of similarity
Adds to efficiency
Attribute type
Type of Data
Other characteristics, e.g., autocorrelation
z
z
z
Dimensionality
Di
i
lit
Noise and Outliers
Type of Distribution
4/18/2004
18
Clustering Algorithms
z
K-means and its variants
Hierarchical clustering
Density-based clustering
4/18/2004
19
K-means Clustering
z
z
Partitional clustering approach

Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest

centroid
Number of clusters,
clusters K
K, must be specified
The basic algorithm is very simple
4/18/2004
20
K-means Clustering Details

z
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the

cluster.
Closeness is measured by Euclidean distance, cosine

similarity correlation
similarity,
correlation, etc
etc.
K-means will converge for common similarity measures
mentioned above.
z
z
Most of the convergence happens in the first few

iterations.
Often the stopping

pp g condition is changed
g to Until relatively
y few
points change clusters
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes
4/18/2004
21
Two different K-means Clusterings

3
2.5
Original Points
15
1.5
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
2.5
2.5
1.5
1.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
Optimal Clustering
-2
Sub-optimal Clustering
4/18/2004
22
Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5
3
2.5
1.5
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
4/18/2004
23

Iteration 1
Iteration 2
Iteration 3
2.5
2.5
2.5
1.5
1.5
1.5
0.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
It ti 4
Iteration
It ti 5
Iteration
2.5
1.5
1.5
1.5
0.5
0.5
0.5
-0.5
0.5
1.5
0.5
1.5
1.5
2.5
2.5
-1
-0.5
It ti 6
Iteration
-1.5
-1
-2
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
4/18/2004
24
Evaluating K-means Clusters

z
Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
K
SSE = dist 2 ( mi , x )
i =1 xCi
x is a data point in cluster Ci and mi is the representative point for

cluster Ci
can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest
error
One easy way to reduce SSE is to increase K, the number of
clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering
l t i with
ith hi
higher
h K
4/18/2004
25

Iteration 5
1
2
3
4
3
2.5
1.5
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
4/18/2004
26

Iteration 1
Iteration 2
2.5
2.5
1.5
1.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
Iteration 3
2.5
1.5
1.5
1.5
2.5
2.5
0.5
0.5
0.5
-1
1
-0.5
05
05
0.5
Iteration 5
-1.5
15
1.5
Iteration 4
-2
2
15
1.5
-2
2
-1.5
15
-1
1
-0.5
05
05
0.5
15
1.5
-2
2
-1.5
15
-1
1
-0.5
05
05
0.5
15
1.5
4/18/2004
27
Problems with Selecting Initial Points

z
If there are K real clusters then the chance of selecting

one centroid from each cluster is small.
Ch
Chance
is
i relatively
l ti l smallll when
h K iis llarge
If clusters are the same size, n, then
For example, if K = 10, then probability = 10!/1010 = 0.00036
Sometimes the initial centroids will readjust themselves in

i ht way, and
right
d sometimes
ti
they
th d
dont
t
Consider an example of five pairs of clusters
4/18/2004
28
10 Clusters Example
Iteration 4
1
2
3
8
6
4
2
0
-2
-4
-6
0
10
15
20
x
Starting with two initial centroids in one cluster of each pair of clusters
4/18/2004
29
10 Clusters Example
Iteration 2
8
Iteration 1
8
-2
-2
-4
-4
-6
-6
0
10
15
20
15
20
-2
-2
-4
-4
-6
-6
10
20
Iteration 4
8
Iteration 3
15
10
15
20
10
Starting with two initial centroids in one cluster of each pair of clusters
4/18/2004
30
10 Clusters Example
Iteration 4
1
2
3
8
6
4
2
0
-2
-4
-6
0
10
15
20
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
4/18/2004
31
10 Clusters Example
Iteration 2
8
Iteration 1
8
-2
-2
-4
-4
-6
-6
0
10
15
20
-2
-4
-4
-6
-6
5
10
15
20
15
20
-2
10
x
Iteration
4
x
Iteration
3
15
20
10
Starting with some pairs of clusters having three initial centroids, while other have only one.
4/18/2004
32
Solutions to Initial Centroids Problem

z
Multiple runs
Helps,
p , but probability
p
y is not on yyour side
Sample and use hierarchical clustering to

determine initial centroids
z Select more than k initial centroids and then
select among these initial centroids
z
Select most widely separated
Postprocessing
z Bisecting K-means
z
Not as susceptible to initialization issues
4/18/2004
33
Handling Empty Clusters

z
Basic K-means algorithm can yield empty

clusters
Several strategies
Choose the point that contributes most to SSE
Choose a point from the cluster with the highest SSE
If there are several empty clusters, the above can be
repeated several times.
4/18/2004
34
Updating Centers Incrementally

z
In the basic K-means algorithm, centroids are

updated after all points are assigned to a centroid
An alternative is to update the centroids after

each assignment (incremental approach)
Each assignment updates zero or two centroids

More expensive
Introduces an order dependency
N
Never
gett an empty
t cluster
l t
Can use weights to change the impact
4/18/2004
35
Pre-processing and Post-processing

z
Pre-processing
Normalize the data
Eliminate outliers
Post processing
Post-processing
Eliminate small clusters that may represent outliers
Split loose
loose clusters,
clusters ii.e.,
e clusters with relatively high
SSE
Merge
g clusters that are close and that have relatively
y
low SSE
Can use these steps during the clustering process
ISODATA
SO
4/18/2004
36
Bisecting K-means
z
Bisecting K-means algorithm
Variant of K-means that can produce a partitional or a

hierarchical clustering
4/18/2004
37
Bisecting K-means Example
4/18/2004
38
Limitations of K-means
z
K-means has problems when clusters are of

differing
Sizes
Densities
Non-globular shapes
K-means has problems when the data contains

outliers.
4/18/2004
39
Limitations of K-means: Differing Sizes
K-means (3 Clusters)
Original Points
4/18/2004
40
Limitations of K-means: Differing Density
Original Points
4/18/2004
41
Limitations of K-means: Non-globular Shapes
Original Points
4/18/2004
42
Overcoming K-means Limitations
Original Points
K-means Clusters
One solution is to use many

y clusters.
Find parts of clusters, but need to put together.
4/18/2004
43
Original Points
K-means Clusters
4/18/2004
44
Original Points
K-means Clusters
4/18/2004
45

DWDM U-6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWDM U-6

Uploaded by

Copyright:

Available Formats

K-Means Cluster Analysis

Introduction to Data Mining

What is Cluster Analysis?

Finding groups of objects such that the objects in a group

Introduction to Data Mining

Applications of Cluster Analysis

Fannie Mae DOWN Fed Home Loan DOWN

Introduction to Data Mining

What is not Cluster Analysis?

Introduction to Data Mining

Notion of a Cluster can be Ambiguous

How many clusters?

Introduction to Data Mining

A clustering is a set of clusters

Important distinction between hierarchical and

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive

Fuzzyy versus non-fuzzyy

Partial versus complete

Heterogeneous versus homogeneous

Introduction to Data Mining

Described by an Objective Function

Introduction to Data Mining

Types of Clusters: Well-Separated

Introduction to Data Mining

Types of Clusters: Center-Based

Introduction to Data Mining

Types of Clusters: Contiguity-Based

Contiguous Cluster (Nearest neighbor or

Introduction to Data Mining

Types of Clusters: Density-Based

Introduction to Data Mining

Types of Clusters: Conceptual Clusters

Shared Property or Conceptual Clusters

Introduction to Data Mining

Types of Clusters: Objective Function

Clusters Defined by an Objective Function

Can have global or local objectives.

Hierarchical clustering algorithms typically have local objectives

A variation of the global objective function approach is to fit the

Parameters for the model are determined from the data.

Mixture models assume that the data is a mixture' of a number of

Introduction to Data Mining

Types of Clusters: Objective Function

Map the clustering problem to a different domain

Introduction to Data Mining

Characteristics of the Input Data Are Important

Type of proximity or density measure

Introduction to Data Mining

K-means and its variants

Introduction to Data Mining

Partitional clustering approach

Each point is assigned to the cluster with the closest

Introduction to Data Mining

K-means Clustering Details

Initial centroids are often chosen randomly.

Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the

Closeness is measured by Euclidean distance, cosine

Most of the convergence happens in the first few