Professional Documents
Culture Documents
Algorithms
Bill Andreopoulos
Biotec, TU Dresden, Germany, and
Department of Computer Science and Engineering
York University, Toronto, Ontario, Canada
June 27, 2006
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
Objective of clustering
algorithms for categorical data
Partition the objects into groups.
Objects with similar categorical attribute
values are placed in the same group.
Objects in different groups contain
dissimilar categorical attribute values.
i 1
cGi
Pi c
General Applications of
Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
Requirements of Clustering in
Data Mining
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
Scalability to High dimensions
Interpretability and usability
Incorporation of user-specified constraints
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
Data Structures
Data matrix
x11
...
x
i1
...
x
n1
Dissimilarity matrix
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
Mixed types
Nominal (categorical)
A generalization of the binary variable in that it can
take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
pm
Method 2: use a large number ofd (binary
i, j) variables
p
creating a new binary variable for each of the M nominal
states
Interval-scaled variables
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
m f 1n (x1 f x2 f
...
xnf )
xif m f
zif
sf
Ordinal (numerical)
An ordinal variable can be discrete or continuous
order is important, e.g., rank
Can be treated like interval-scaled
replacing xif by their rank
rif {1,..., M f }
map the range of each variable onto [0, 1] by replacing ith object in the f-th variable by
zif
rif 1
M f 1
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1
j1
i2
j2
ip
jp
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two
p-dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2
ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Binary Variables
A contingency table for binary data
Object j
Object i
1
0
1
a
c
0
b
d
sum
a b
cd
sum a c b d
bc
d (i, j)
b cvariable
d
Jaccard coefficient (noninvariant if theabinary
is
asymmetric):
d (i, j)
bc
a bc
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal
compute ranks rif and
r
zif
M
if
Software clustering
Group system files such that files with similar functionality are in the
same cluster, while files in different clusters perform dissimilar
functions.
Each object is a file x of the software system.
Categorical data set on a software system: for each file, which other
files it may invoke during runtime.
After the filename x there is a list of the other filenames that x may invoke.
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
10
0
0
10
10
10
10
0
0
10
10
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
10
10
9
9
8
7
6
5
4
3
6
5
h
i
4
3
2
2
1
1
0
0
10
Cjih = 0
0
10
10
10
7
6
5
4
3
10
10
It draws a sample of the data set, applies PAM on the sample, and
gives the best clustering as the output.
Fuzzy k-Means
Fuzzy k-Means
end_for
end_until
K-Modes algorithm
K-Modes deals with categorical attributes.
Insert the first K objects into K new clusters.
Calculate the initial K modes for K clusters.
Repeat {
Fuzzy K-Modes
The fuzzy k-modes algorithm contains extensions to the
fuzzy k-means algorithm for clustering categorical data.
Bunch
Hierarchical Clustering
Use distance matrix as clustering criteria. This
method does not require the number of clusters
k as an input, but needs a termination condition
Step 0
a
b
Step 1
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
AGNES (Agglomerative
Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
0
0
10
0
0
10
10
10
10
10
0
0
10
0
0
10
10
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
Clustering Feature
Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
CF = (5, (16,30),(54,190))
SS: Ni=1=Xi2
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree - A nonleaf node in this tree contains summaries of the CFs of its
children. A CF tree is a multilevel summary of the data that preserves the
inherent structure of the data.
L=6
Root
CF3
CF1
CF2
child1
child2 child3
CF6
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
BIRCH (1996)
Figure 11 Clustering a
set of objects using
CURE. (a) A random
sample of objects. (b)
Partial clusters.
Representative points
for each cluster are
marked with a +. (c)
The partial clusters are
further clustered. The
representative points
are moved toward the
cluster center. (d) The
final clusters are
nonspherical.
x
x
x
Shrink the multiple representative points towards the
gravity center by a fraction of .
Rock: Algorithm
Links: The number of common neighbours for
the two points.
{1,2,4}
CHAMELEON
CHAMELEON: hierarchical clustering using
dynamic modeling, by G. Karypis, E.H. Han
and V. Kumar99
Measures the similarity based on a dynamic
model
Two clusters are merged only if the
interconnectivity and closeness (proximity)
between two clusters are high relative to the
internal interconnectivity of the clusters and
closeness of items within the clusters
CHAMELEON
A two phase algorithm
Use a graph partitioning algorithm: cluster objects into a
large number of relatively small sub-clusters
Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining these
sub-clusters
Overall Framework of
CHAMELEON
Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
Scan to find the highest value, representing the pair of genes with the
most similar interaction patterns
The two most similar genes are grouped in a cluster
the similarity matrix is recomputed, using the average properties of both or all
genes in the cluster
More genes are progressively added to the initial pairs to form clusters
of genes [Eisen98]
process is repeated until all genes have been grouped into clusters
LIMBO
LIMBO is introduced in [Andritsos04] is a scalable
hierarchical categorical clustering algorithm that builds
on the Information Bottleneck (IB) framework for
quantifying the relevant information preserved when
clustering.
LIMBO uses the IB framework to define a distance measure for
categorical tuples.
LIMBO handles large data sets by producing a memory
bounded summary model for the data.
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
Density-Based Clustering:
Background
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Epsneighbourhood of that point
p
q
MinPts = 5
Eps = 1 cm
Density-Based Clustering:
Background (II)
Density-reachable:
Density-connected:
A point p is density-connected to a
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o wrt. Eps
and MinPts.
p1
q
o
Eps = 1cm
MinPts = 5
OPTICS: A Cluster-Ordering
Method (1999)
OPTICS: Ordering Points To Identify the
Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
Produces a special order of the database wrt its
density-based clustering structure
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
Reachability
-distance
Density Attractor
All remaining tuples of the data set are placed in one of the
clusters such that
at each step, the increase in the entropy of the resulting clustering is
minimized.
Let's take a small market basket database with 5 transactions {(apple, banana),
(apple, banana, cake), (apple, cake, dish), (dish, egg), (dish, egg, fish)}.
For simplicity, transaction (apple, banana) is abbreviated to ab, etc.
For this small database, we want to compare the following two clusterings:
(1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de, def} }.
H=2.0, W=4
H=1.67, W=3
H=1.67, W=3
H=1.6, W=5
{ab, abc, acd}
{de, def}
{ab, abc}
{acd, de, def}
clustering (1)
clustering (2)
Histograms of the two clusterings. Adopted from [Yang2002].
We judge the qualities of these two clusterings, by analyzing the heights and widths of the
clusters. Leaving out the two identical histograms for cluster {de, def} and cluster {ab, abc},
the other two histograms are of different quality.
The histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def}
has H/W=0.32.
Clustering (1) is better since we prefer more overlapping among transactions in the same cluster.
Thus, a larger height-to-width ratio of the histogram means better intra-cluster similarity.
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
STING: A Statistical
Information Grid Approach
Wang, Yang and Muntz (VLDB97)
The spatial area area is divided into rectangular
cells
There are several levels of cells corresponding to
different levels of resolution
WaveCluster (1998)
Sheikholeslami, Chatterjee, and Zhang (VLDB98)
A multi-resolution clustering approach which
applies wavelet transform to the feature space
A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.
Input parameters:
the wavelet, and the # of applications of wavelet transform.
WaveCluster (1998)
How to apply wavelet transform to find clusters
Summaries the data by imposing a multidimensional
grid structure onto data space
These multidimensional spatial data objects are
represented in a n-dimensional feature space
Apply wavelet transform on feature space to find the
dense regions in the feature space
Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
WaveCluster (1998)
Why is wavelet transformation useful for
clustering
Unsupervised clustering
It uses hat-shape filters to emphasize region where points
cluster, but simultaneously to suppress weaker information in
their boundary
Effective removal and detection of outliers
Multi-resolution
Cost efficiency
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
Quantization
Transformation
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
la
a
S
ry
30
Vacation(
week)
0 1 2 3 4 5 6 7
age
60
20
50
30
40
age
50
age
60
Weakness
The accuracy of the clustering result may be
degraded
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
COBWEB (Fisher87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic
description of that concept
COBWEB Clustering
Method
A classification tree
E is the attributes of a data item, that are given to us by the data set. For example, if each data
item is a coin, the evidence E might be represented as follows for a coin i:
If there were many attributes, then the evidence E might be represented as follows for a coin i:
Ei = {"land tail","land tail","land head"} meaning that in 3 separate trials the coin i landed as tail, tail
and head.
Ei = {"land tail} meaning that in one trial the coin i landed to be tail.
For example, H might state that coin i belongs in the class two headed coin.
We usually do not know the H for a data set. Thus, AutoClass tests many hypotheses.
L( E | H ) ( H )
L( E | H ) ( H )
(H | E)
(E)
L( E | H ) ( H )
H
AutoClass uses a Bayesian method for determining the optimal class H for each object.
Prior distribution for each attribute, symbolizing the prior beliefs of the user about the attribute.
Change the classifications of items in clusters and change the means and variances of the distributions
in each cluster, until the means and variances stabilize.
Competitive learning
Involves a hierarchical architecture of several units
(neurons)
Neurons compete in a winner-takes-all fashion for
the object currently being presented
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
Supervised classification
Bases the classification on prior
knowledge about the correct classification
of the objects.
CCA-S
Clustering and Classification Algorithm Supervised
(CCA-S)
for detecting intrusions into computer network systems [Ye01].
Problem
Find top n outlier points
Applications:
Outlier Discovery:
Statistical
Approaches
Assume a model underlying distribution that
generates data set (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers
Drawbacks
Most distribution tests are for single attribute
In many cases, data distribution may not be known
Outlier Discovery:
Distance-Based Approach
Introduced to counter the main limitations imposed by
statistical methods
We need multi-dimensional analysis without knowing data
distribution.
Outlier Discovery:
Deviation-Based Approach
Identifies outliers by examining the main
characteristics of objects in a group
Objects that deviate from this description are
considered outliers
Sequential exception technique
simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
Constraint-Based Clustering
Analysis
Clustering analysis: less parameters but more userdesired constraints, e.g., an ATM allocation problem
Summary
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis,
such as constraint-based clustering
References (1)
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
References (2)
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB98.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data
sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for
very large databases. SIGMOD'96.
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables not a good choice!
(why?)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as
interval-scaled.
Clustering is applied to files with logged behavior of system users over time, to separate instances
of normal activity from instances of abusive or attack activity. Clustering is primarily used in
Intrusion Detection Systems, especially when the log files of system user behavior are too large
for a human expert to analyze.
Clustering is applied to files with logged behavior of system users over time, to separate instances
of normal activity from instances of abusive or attack activity. Clustering is primarily used in
Intrusion Detection Systems, especially when the log files of system user behavior are too large
for a human expert to analyze.
IDS based on signature recognition focus on two main types of activity data: network traffic data
and computer audit data. Some categorical activity attributes that can be obtained from this data
include the user id, event type, process id, command, remote IP address. Some numerical activity
attributes that can be obtained from this data include the time stamp, CPU time, etc.
Squeezer
The farthest-first traversal k-center algorithm (FFT) is a fast, greedy algorithm that
minimizes the maximum cluster radius [Hochbaum85]. In FFT, k points are first selected
as cluster centers. The remaining points are added to the cluster whose center is the
closest. The first center is chosen randomly. The second center is greedily chosen as the
point furthest from the first. Each remaining center is determined by greedily choosing
the point farthest from the set of already chosen centers, where the furthest point, x, from
a set, D, is defined as, maxx{min{d(x,j), j(-D}}.
Clustering Feature
Vector
BIRCHs operation is based on clustering features (CF)
and clustering feature trees. These are used to
summarize the information in a cluster.
A CF is a triplet summarizing information about
subclusters of objects.
A CF typically holds the following information about a
subcluster: the number of objects N in the subcluster, a
vector holding the linear sum of the N objects in the
subcluster, and a vector holding the square sum of the
N objects in the subcluster.
Thus, a CF is a summary of statistics for the given
subcluster
f Gaussian ( x , y ) e
f
D
Gaussian
d ( x , y )2
2 2
( x) i 1 e
D
Gaussian
d ( x , xi ) 2
2 2
( x, xi ) i 1 ( xi x) e
N
d ( x , xi ) 2
2 2
STIRR produces a dynamical system from a table of categorical data. STIRR begins with a table of relational
data, consisting of k fields (or columns), each of which can assume one of many possible values. Figure 15
shows an example of the data representation. Each possible value in each field is represented by a node and
the data is represented as a set of tuples each tuple consists of nodes, with one node for each field.
A configuration is an assignment of a weight w to each node. A normalization function N() is needed to rescale
the weights of the nodes associated with each field so that their squares add up to 1.
CLICK finds clusters in categorical datasets based on a search method for k-partite maximal cliques
[Peters04].
The basic Click approach consists of the three principal stages, shown in Figure 17, as follows:
Pre-Processing: In this step, the k-partite graph is created from the input database D, and the
attributes are ranked for efficiency reasons.
Clique Detection: Given (D), all the maximal k-partite cliques in the graph are enumerated.
Post-Processing: the support of the candidate cliques within the original dataset is verified to form
the final clusters. Moreover, the final clusters are optionally merged to partially relax the strict cluster
conditions.
CLOPE proposes a novel global criterion function that tries to increase the intra-cluster overlapping of transaction items
by increasing the height-to-width ratio of the cluster histogram.
Let's take a small market basket database with 5 transactions {(apple, banana), (apple, banana, cake), (apple, cake,
dish), (dish, egg), (dish, egg, fish)}. For simplicity, transaction (apple, banana) is abbreviated to ab, etc. For this small
database, we want to compare the following two clustering (1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de,
def} }. For each cluster, we count the occurrence of every distinct item, and then obtain the height (H) and width (W) of the
cluster. For example, cluster {ab, abc, acd} has the occurrences of a:3, b:2, c:2, and d:l, with H=2.0 and W=4. H is
computed by summing the numbers of occurrences and dividing by the number of distinct items. Figure 17 shows the
values of these results as histograms, with items sorted in reverse order of their occurrences, only for the sake of easier
visual interpretation.
H=2.0, W=4
H=1.67, W=3
H=1.67, W=3
H=1.6, W=5
{ab, abc, acd}
{de, def}
{ab, abc}
{acd, de, def}
clustering (1)
clustering (2)
Figure 17 - Histograms of the two clusterings. Adopted from [Yang2002].
We judge the qualities of these two clusterings, by analyzing the heights and widths of the clusters. Leaving out the two
identical histograms for cluster {de, def} and cluster {ab, abc}, the other two histograms are of different quality. The
histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def} has H/W=0.32. Clearly, clustering
(1) is better since we prefer more overlapping among transactions in the same cluster. From the above example, we can
see that a larger height-to-width ratio of the histogram means better intra-cluster similarity. This intuition is the basis of
CLOPE and defines the global criterion function using the geometric properties of the cluster histograms.
WaveCluster (1998)
Wavelets emphasize regions where the objects cluster, but suppress less dense
regions outside of clusters.
The clusters in the data stand out because they are more dense and clear the
regions around them [Sheikholeslami98].
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify clusters:
Determine dense units in all subspaces of interest.
Determine connected dense units in all subspaces of interest.
ACDC
ACDC works in a different way than algorithms we mentioned above [Tzerpos00]. ACDC
performs the task of clustering in two stages. In the first one it creates a skeleton of the final
decomposition by identifying subsystems using a pattern-driven approach. There are many
patterns that have been used in ACDC. Figure 24 shows some of these patterns.
Depending on the pattern used the subsystems are given appropriate names. In the second
stage ACDC completes the decomposition by using an extended version of a technique known
as Orphan Adoption. Orphan Adoption is an incremental clustering technique based on the
assumption that the existing structure is well established. It attempts to place each newly
introduced resource (called an orphan) in the subsystem that seems more appropriate. This is
usually a subsystem that has a larger amount of connectivity to the orphan than any other
subsystem [Tzerpos00].
ACDC