You are on page 1of 28

Clustering

Basic Concepts and Algorithms

Bamshad
BamshadMobasher
Mobasher
DePaul
DePaulUniversity
University
What is Clustering in Data
Mining?
Clustering
Clusteringisisaaprocess
processof
ofpartitioning
partitioningaaset
setof
ofdata
data(or(or
objects)
objects)in
inaaset
setof
ofmeaningful
meaningfulsub-classes,
sub-classes,called
called clusters
clusters

Helps users understand the natural grouping or structure in a data set

i Cluster:
4 a collection of data objects that
are “similar” to one another
and thus can be treated
collectively as one group
4 but as a collection, they are
sufficiently different from other
groups

i Clustering
4 unsupervised classification
4 no predefined classes
2
Applications of Cluster Analysis
i Data reduction
4 Summarization: Preprocessing for regression, PCA,
classification, and association analysis
4 Compression: Image processing: vector quantization
i Hypothesis generation and testing
i Prediction based on groups
4 Cluster & find characteristics/patterns for each group
i Finding K-nearest Neighbors
4 Localizing search to one or a small number of clusters
i Outlier detection: Outliers are often viewed as those “far
away” from any cluster
3
Basic Steps to Develop a Clustering Task
i Feature selection / Preprocessing
4 Select info concerning the task of interest
4 Minimal information redundancy
4 May need to do normalization/standardization
i Distance/Similarity measure
4 Similarity of two feature vectors
i Clustering criterion
4 Expressed via a cost function or some rules
i Clustering algorithms
4 Choice of algorithms
i Validation of the results
i Interpretation of the results with applications

4
Distance or Similarity Measures
i Common Distance Measures:
4 Manhattan distance:

4 Euclidean distance:

 ( xi  yi )
, Y )  1  sim( X , Y )
dist ( Xsimilarity:
4 Cosine sim( X , Y )  i

 xi   yi
2 2

i i

5
More Similarity Measures
i In vector-space model many similarity measures can be used
in clustering

Dice’s Coefficient
Simple Matching
sim( X , Y )   xi  yi 2   xi  yi
i sim( X , Y )  i

 xi   yi
2 2

i i

Cosine Coefficient Jaccard’s Coefficient


 ( xi  yi )  xi  yi
sim( X , Y )  i
sim( X , Y )  i

 xi   yi  xi   yi   xi  yi
2 2 2 2

i i i i i

6
Quality: What Is Good Clustering?

i A good clustering method will produce high quality clusters


4 high intra-class similarity: cohesive within clusters
4 low inter-class similarity: distinctive between clusters
i The quality of a clustering method depends on
4 the similarity measure used
4 its implementation, and
4 Its ability to discover some or all of the hidden patterns

7
Major Clustering Approaches
i Partitioning approach:
4 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
4 Typical methods: k-means, k-medoids, CLARANS
i Hierarchical approach:
4 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
4 Typical methods: Diana, Agnes, BIRCH, CAMELEON
i Density-based approach:
4 Based on connectivity and density functions
4 Typical methods: DBSCAN, OPTICS, DenClue
i Model-based:
4 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
4 Typical methods: EM, SOM, COBWEB
8
Partitioning Approaches
i The notion of comparing item similarities can be extended to
clusters themselves, by focusing on a representative vector
for each cluster
4 cluster representatives can be actual items in the cluster or other “virtual”
representatives such as the centroid
4 this methodology reduces the number of similarity computations in
clustering
4 clusters are revised successively until a stopping condition is satisfied, or
until no more changes to clusters can be made
i Reallocation-Based Partitioning Methods
4 Start with an initial assignment of items to clusters and then move items
from cluster to cluster to obtain an improved partitioning
4 Most common algorithm: k-means

9
The K-Means Clustering Method
i Given the number of desired clusters k, the k-means
algorithm follows four steps:
1. Randomly assign objects to create k nonempty initial
partitions (clusters)
2. Compute the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean
point, of the cluster)
3. Assign each object to the cluster with the nearest
centroid (reallocation step)
4. Go back to Step 2, stop when the assignment does not
change
10
K-Means Example: Document Clustering

Initial (arbitrary)
assignment:
C1 = {D1,D2},
C2 = {D3,D4},
C3 = {D5,D6}

Cluster Centroids

Now compute the similarity (or distance) of each item to each cluster, resulting a cluster-
document similarity matrix (here we use dot product as the similarity measure).

11
Example (Continued)

For each document, reallocate the document to the cluster to which it has the highest
similarity (shown in red in the above table). After the reallocation we have the following
new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that
D1 and D6 have been reallocated from their original assignment.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

This
Thisisisthe
theend
endofoffirst
firstiteration
iteration(i.e.,
(i.e.,the
thefirst
firstreallocation).
reallocation).
Next,
Next,wewerepeat
repeatthe
theprocess
processforforanother
anotherreallocation…
reallocation…

12
Example (Continued)
Now compute new C1
C1=={D2,D7,D8},
{D2,D7,D8}, C2
C2=={D1,D3,D4,D6},
{D1,D3,D4,D6}, C3
C3=={D5}
{D5}
cluster centroids using
the original document- T1 T2 T3 T4 T5
term matrix D1 0 3 3 0 2
D2 4 1 0 1 2
D3 0 4 0 0 2
D4 0 3 0 3 3
D5 0 1 3 0 1
This will lead to a new D6 2 2 0 0 4
D7 1 0 3 2 0
cluster-doc similarity matrix
D8 3 1 0 0 2
similar to previous slide.
C1 8/3 2/3 3/3 3/3 4/3
Again, the items are
C2 2/4 12/4 3/4 3/4 11/4
reallocated to clusters with
C3 0/1 1/1 3/1 0/1 1/1
highest similarity.
D1 D2 D3 D4 D5 D6 D7 D8
C1 7.67 15.01 5.34 9.00 5.00 12.00 7.67 11.34
C2 16.75 11.25 17.50 19.50 8.00 6.68 4.25 10.00
C3 14.00 3.00 6.00 6.00 11.00 9.34 9.00 3.00

New assignment  C1
C1=={D2,D6,D8},
{D2,D6,D8}, C2
C2=={D1,D3,D4},
{D1,D3,D4}, C3
C3=={D5,D7}
{D5,D7}

Note: This process is now repeated with new clusters. However, the next iteration in this example
Will show no change to the clusters, thus terminating the algorithm.
13
K-Means Algorithm
i Strength of the k-means:
4 Relatively efficient: O(tkn), where n is # of objects, k is
# of clusters, and t is # of iterations. Normally, k, t << n
4 Often terminates at a local optimum

i Weakness of the k-means:


4 Applicable only when mean is defined; what about
categorical data?
4 Need to specify k, the number of clusters, in advance
4 Unable to handle noisy data and outliers

i Variations of K-Means usually differ in:


4 Selection of the initial k means
4 Distance or similarity measures used
4 Strategies to calculate cluster means
14
A Disk Version of k-means
i k-means can be implemented with data on disk
4 In each iteration, it scans the database once
4 The centroids are computed incrementally
i It can be used to cluster large datasets that do not fit in
main memory
i We need to control the number of iterations
4 In practice, a limited is set (< 50)
i There are better algorithms that scale up for large data
sets, e.g., BIRCH

15
BIRCH
i Designed for very large data sets
4 Time and memory are limited
4 Incremental and dynamic clustering of incoming objects
4 Only one scan of data is necessary
4 Does not need the whole data set in advance
i Two key phases:
4 Scans the database to build an in-memory tree
4 Applies clustering algorithm to cluster the leaf nodes

16
Hierarchical Clustering Algorithms
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k
clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k
clusters)

• Traditional hierarchical algorithms use a similarity or


distance matrix
– Merge or split one cluster at a time
Hierarchical Clustering Algorithms
i Use dist / sim matrix as clustering criteria
4 does not require the no. of clusters as input, but needs a termination
condition

Step 0 Step 1 Step 2 Step 3 Step 4


Agglomerative
a
ab
b
abcde
c
cd
d
cde
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0

18
Hierarchical Agglomerative Clustering
i Basic procedure
4 1. Place each of N items into a cluster of its own.
4 2. Compute all pairwise item-item similarity coefficients
h Total of N(N-1)/2 coefficients
4 3. Form a new cluster by combining the most similar pair of
current clusters i and j
h (methods for determining which clusters to merge: single-link,
complete link, group average, etc.);
h update similarity matrix by deleting the rows and columns
corresponding to i and j;
h calculate the entries in the row corresponding to the new
cluster i+j.
4 4. Repeat step 3 if the number of clusters left is great than 1.

19
Hierarchical Agglomerative Clustering
:: Example

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6 0.2

3 0.15
1 0.1

4 0.05

0
3 6 4 1 2 5

Nested Clusters Dendrogram


Distance Between Two Clusters
i The basic procedure varies based on the method used to
determine inter-cluster distances or similarities

i Different methods results in different variants of the


algorithm
4 Single link
4 Complete link
4 Average link
4 Ward’s method
4 Etc.

21
Single Link Method
i The distance between
two clusters is the
distance between two
closest data points in the
two clusters, one data
point from each cluster
i It can find arbitrarily
shaped clusters, but Two natural clusters are
4 It may cause the undesirable split into two
“chain effect” due to noisy
points

22
Distance between two clusters
i Single-link distance between clusters Ci and Cj is the minimum
distance between any object in Ci and any object in Cj
4 The distance is defined by the two most similar objects


Dsl  Ci , C j   min x , y d ( x, y ) x  Ci , y  C j 

1 2 3 4 5
Complete Link Method
i The distance between two clusters is the distance of
two furthest data points in the two clusters
i It is sensitive to outliers because they are far away

24
Distance between two clusters
i Complete-link distance between clusters Ci and Cj is the
maximum distance between any object in Ci and any object in Cj
4 The distance is defined by the two least similar objects


Dcl  Ci , C j   max x , y d ( x, y ) x  Ci , y  C j 

1 2 3 4 5
Average link and centroid methods
i Average link: A compromise between
4 the sensitivity of complete-link clustering to outliers and
4 the tendency of single-link clustering to form long chains that do not
correspond to the intuitive notion of clusters as compact, spherical
objects
4 In this method, the distance between two clusters is the average
distance of all pair-wise distances between the data points in two
clusters.

i Centroid method: In this method, the distance between two


clusters is the distance between their centroids

26
Distance between two clusters
i Group average distance between clusters Ci and Cj is the average
distance between objects in Ci and objects in Cj
4 The distance is defined by the average similarities

Davg  Ci , C j  
1
Ci  C j
 d ( x, y )
xCi , yC j

1 2 3 4 5
Clustering
Basic Concepts and Algorithms

Bamshad
BamshadMobasher
Mobasher
DePaul
DePaulUniversity
University

You might also like