Clustering Algorithm

Literature Survey of Clustering
Algorithms
Bill Andreopoulos
Biotec, TU Dresden, Germany, and
Department of Computer Science and Engineering
York University, Toronto, Ontario, Canada
June 27, 2006
Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
Objective of clustering
algorithms for categorical data
Partition the objects into groups.
Objects with similar categorical attribute
values are placed in the same group.
Objects in different groups contain
dissimilar categorical attribute values.
An issue with clustering in general is defining the goals.

Papadimitriou et al. (2000) propose:
Seek k groups G1,,Gk and a policy Pi for each group i.
Pi : a vector of categorical attribute values.
k
Maximize:

i 1
cGi
Pi c
is the overlap operator between 2 vectors.
Clustering problem is NP-complete:

Ideally, search all possible clusters and all assignments of objects.
The best clustering is the one maximizing a quality measure.
General Applications of
Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
Software clustering: cluster files in software systems based on their

functionality
Intrusion detection: Discover instances of anomalous (intrusive) user

behavior in large system log files
Gene expression data: Discover genes with similar functions in DNA

microarray data.
Marketing: Help marketers discover distinct groups in their customer

bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth

observation database
Insurance: Identifying groups of motor insurance policy holders with a

high average claim cost
What Is Good Clustering?
A good clustering method will produce high quality clusters with

high intra-class similarity
low inter-class similarity
The quality of a clustering depends on:

Appropriateness of method for dataset.
The (dis)similarity measure used
Its implementation.
The quality of a clustering method is also measured by its ability to

discover some or all of the hidden patterns.
Requirements of Clustering in
Data Mining
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
Scalability to High dimensions
Interpretability and usability
Incorporation of user-specified constraints
Outline
Grid-Based Methods
Data Structures
Data matrix
x11
...
x
i1
...
x
n1
Dissimilarity matrix
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
... 0
Measure the Quality of

Clustering
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
There is a separate quality function that measures the
goodness of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
It is hard to define similar enough or good enough
the answer is typically highly subjective.
Type of data in clustering

analysis
Nominal (Categorical)
Interval-scaled variables
Ordinal (Numerical)
Binary variables
Mixed types
Nominal (categorical)
A generalization of the binary variable in that it can
take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
pm
Method 2: use a large number ofd (binary
i, j) variables
p
creating a new binary variable for each of the M nominal
states
Interval-scaled variables
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
m f 1n (x1 f x2 f
...
xnf )
Calculate the standardized measurement (z-score)
xif m f
zif
sf
Ordinal (numerical)
An ordinal variable can be discrete or continuous
order is important, e.g., rank
Can be treated like interval-scaled
replacing xif by their rank
rif {1,..., M f }
map the range of each variable onto [0, 1] by replacing ith object in the f-th variable by
zif
rif 1
M f 1
compute the dissimilarity using methods for intervalscaled variables
Similarity and Dissimilarity

Between Objects
Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1
j1
i2
j2
ip
jp
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two
p-dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2
ip jp
Similarity and Dissimilarity

Between Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance, parametric

Pearson product moment correlation, or other
disimilarity measures.
Binary Variables
A contingency table for binary data
Object j
Object i
1
0
1
a
c
0
b
d
sum
a b
cd
sum a c b d
Simple matching coefficient (invariant, if the binary variable

is symmetric):
bc
d (i, j)
b cvariable
d
Jaccard coefficient (noninvariant if theabinary
is
asymmetric):
d (i, j)
bc
a bc
Dissimilarity between Binary

Variables
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute

the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
0.33
2 01
11
d ( jack , jim )
0.67
111
1 2
d ( jim , mary )
0.75
11 2
d ( jack , mary )
Variables of Mixed Types

A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio.
One may use a weighted formula to combine their

effects.
pf 1 ij( f ) d ij( f )
d (i, j )
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal
compute ranks rif and
r
zif
M
and treat zif as interval-scaled
if
Clustering of genomic data sets
Clustering of gene expression

data sets
Clustering of synthetic mutant

lethality data sets
Clustering applied to yeast data sets
Clustering the yeast genes in response to environmental

changes
Clustering the cell cycle-regulated yeast genes
Functional analysis of the yeast genome : Finding gene

functions - Functional prediction
Software clustering
Group system files such that files with similar functionality are in the
same cluster, while files in different clusters perform dissimilar
functions.
Each object is a file x of the software system.
Both categorical and numerical data sets:
Categorical data set on a software system: for each file, which other
files it may invoke during runtime.
After the filename x there is a list of the other filenames that x may invoke.
Numerical data set on a software system: the results of a profiling of

the execution of the system, how many times each file invoked other
files during the run time.
After the file name x there is a list of the other filenames that x invoked and: how
many times x invoked them during the run time.
Outline
Grid-Based Methods
Major Clustering Approaches

Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.
Outline
Grid-Based Methods
Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen67): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw87): Each cluster is represented by one of the
objects in the cluster
The k-Means Clustering Method

Given k, the k-Means algorithm is
implemented in 4 steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition. The centroid is the
center (mean point) of the cluster.
Assign each object to the cluster with the nearest
seed point.
Go back to Step 2, stop when no more new
assignment.
The K-Means Clustering Method

Example
10
10
0
0
10
10
10
10
0
0
10
10
Comments on the K-Means

Method
Strength
Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method

A few variants of the k-means which differ in:
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
PAM works effectively for small data sets, but does not scale well
for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS (Ng & Han, 1994): Randomized sampling
PAM (Partitioning Around Medoids)

(1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most similar
representative object
repeat steps 2-3 until there is no change
PAM Clustering: Total swapping cost

TCih=jCjih
10
10
9
9
8
7
6
5
4
3
6
5
h
i
4
3
2
2
1
1
0
0
10
Cjih = d(j, h) - d(j, i)
Cjih = 0
0
10
10
10
7
6
5
4
3
Cjih = d(j, t) - d(j, i)

0
10
Cjih = d(j, h) - d(j, t)

0
10
CLARA (Clustering Large

Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
It draws a sample of the data set, applies PAM on the sample, and
gives the best clustering as the output.
Strength: deals with larger data sets than PAM

Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample is
biased
CLARANS (Randomized CLARA)

(1994)
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han94)
CLARANS draws multiple samples dynamically.
A different sample can be chosen at each loop.
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
It is more efficient and scalable than both PAM and
CLARA
Squeezer Single linkage clustering

Not the most effective and accurate clustering algorithm that
exists, but it is efficient as it has a complexity of O(n)
where n is the number of data objects [Portnoy01].
1) Initialize the set of clusters, S, to the empty set.
2) Obtain an object d from the data set. If S is empty, then
create a cluster with d and add it to S. Otherwise, find
the cluster in S that is closest to this object. In other
words, find the closest cluster C to d in S.
3) If the distance between d and C is less than or equal to
a user specified threshold W then associate d with the
cluster C. Else, create a new cluster for d in S.
4) Repeat steps 2 and 3 until no objects are left in the
data set.
Fuzzy k-Means
Clusters produced by k-Means: "hard" or "crisp" clusters

since any feature vector x either is or is not a member of a
particular cluster.
In contrast to "soft" or "fuzzy" clusters

a feature vector x can have a degree of membership in each
cluster.
The fuzzy-k-means procedure of Bezdek [Bezdek81,

Dembele03, Gasch02] allows each feature vector x to
have a degree of membership in Cluster i
Fuzzy k-Means
Fuzzy k-Means algorithm
Choose the number of classes k, with 1<k<n.

Choose a value for the fuzziness exponent f, with f>1.
Choose a definition of distance in the variable-space.
Choose a value for the stopping criterion e (e= 0.001 gives reasonable
convergence).
Make initial guesses for the means m1, m2,..., mk
Until there are no changes in any mean:
Use the estimated means to find the degree of membership u(j,i) of xj
in Cluster i. For example, if a(j,i) = exp(- || xj - mi ||2 ), one
might use u(j,i) = a(j,i) / sum_j a(j,i).
For i from 1 to k
Replace mi with the fuzzy mean of all of the examples for Cluster
i -
end_for
end_until
K-Modes for categorical data

(Huang98)
Variation of the K-Means Method
Replacing means of clusters with modes
Using new dissimilarity measures to deal with
categorical objects
Using a frequency-based method to update modes of
clusters
A mixture of categorical and numerical data: kprototypes method
K-Modes algorithm
K-Modes deals with categorical attributes.
Insert the first K objects into K new clusters.
Calculate the initial K modes for K clusters.
Repeat {
For (each object O) {
Calculate the similarity between object O and

the modes of all clusters.
Insert object O into the cluster C whose mode

is the least dissimilar to object O.
Recalculate the cluster modes so that the cluster

similarity between mode and objects is maximized.
} until (no or few objects change clusters).
Fuzzy K-Modes
The fuzzy k-modes algorithm contains extensions to the
fuzzy k-means algorithm for clustering categorical data.
Bunch
Bunch is a clustering tool intended to aid the software

developer and maintainer in understanding and
maintaining source code [Mancoridis99].
Input : Module Dependency Graph (MDG).
Bunch good partition" : highly interdependent modules are

grouped in the same subsystems (clusters) .
Independent modules are assigned to separate subsystems.
Figure b shows a good partitioning of Figure a.

Finding a good graph partition involves:
systematically navigating through a very large search space of all
possible partitions for that graph.

Hierarchical Clustering
Use distance matrix as clustering criteria. This
method does not require the number of clusters
k as an input, but needs a termination condition
Step 0
a
b
Step 1
Step 2 Step 3 Step 4
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
AGNES (Agglomerative
Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
0
0
10
0
0
10
10
A Dendrogram Shows How the

Clusters are Merged Hierarchically
Decompose data objects into several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)
Inverse order of AGNES
Eventually each node forms a cluster on its own
10
10
10
0
0
10
0
0
10
10
More on Hierarchical Clustering

Methods
Weaknesses of agglomerative clustering methods
do not scale well: time complexity of at least O(n2),
where n is the number of total objects.
can never undo what was done previously.
More on Hierarchical Clustering

Methods
Next..
Integration of hierarchical with distancebased clustering
BIRCH (1996): uses CF-tree and incrementally
adjusts the quality of sub-clusters
CURE (1998): selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction
CHAMELEON (1999): hierarchical clustering
using dynamic modeling
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
Clustering Feature
Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
CF = (5, (16,30),(54,190))
SS: Ni=1=Xi2
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree - A nonleaf node in this tree contains summaries of the CFs of its
children. A CF tree is a multilevel summary of the data that preserves the
inherent structure of the data.
L=6
Root
CF3
CF1
CF2
child1
child2 child3
CF6
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
BIRCH (1996)
Scales linearly: finds a good clustering with a single scan and

improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the

order of the data record.
CURE (Clustering Using

REpresentatives)
CURE: proposed by Guha, Rastogi & Shim, 1998

CURE goes a step beyond BIRCH by not favoring clusters
with spherical shape thus being able to discover clusters
with arbitrary shape. CURE is also more robust with
respect to outliers [Guha98].
Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary shaped
clusters and avoids single-link effect.
Drawbacks of distance-based Methods
Consider only one point as representative of a cluster

Good only for convex shaped, similar size and
density, and if k can be reasonably estimated
CURE: The Algorithm
Draw random sample s.
Partition sample to p partitions with size s/p. Each partition

is a partial cluster.
Eliminate outliers by random sampling. If a cluster grows

too slow, eliminate it.
Cluster partial clusters.

The representative points falling in each new cluster are
shrinked or moved toward the cluster center by a userspecified shrinking factor.
These objects then represent the shape of the newly
formed cluster.
Data Partitioning and Clustering
Figure 11 Clustering a
set of objects using
CURE. (a) A random
sample of objects. (b)
Partial clusters.
Representative points
for each cluster are
marked with a +. (c)
The partial clusters are
further clustered. The
representative points
are moved toward the
cluster center. (d) The
final clusters are
nonspherical.
x
Cure: Shrinking Representative

Points
y
x
x
Shrink the multiple representative points towards the
gravity center by a fraction of .
Multiple representatives capture the shape of the cluster
Clustering Categorical Data:

ROCK
ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE99).
Use links to measure similarity/proximity

Cubic computational complexity
O(n 2 nmmma n 2 log n)
Rock: Algorithm
Links: The number of common neighbours for
the two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}

{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3}
{1,2,4}
Initially, each tuple is assigned to a separate cluster and then

clusters are merged repeatedly according to the closeness
between clusters.
The closeness between clusters is defined as the sum of the
number of links between all pairs of tuples, where the
number of links represents the number of common neighbors
between two clusters.
CHAMELEON
CHAMELEON: hierarchical clustering using
dynamic modeling, by G. Karypis, E.H. Han
and V. Kumar99
Measures the similarity based on a dynamic
model
Two clusters are merged only if the
interconnectivity and closeness (proximity)
between two clusters are high relative to the
internal interconnectivity of the clusters and
closeness of items within the clusters
CHAMELEON
A two phase algorithm
Use a graph partitioning algorithm: cluster objects into a
large number of relatively small sub-clusters
Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining these
sub-clusters
Overall Framework of
CHAMELEON
Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
Eisens hierarchical clustering of

gene expression data
The hierarchical clustering algorithm by Eisen et al. [Eisen98, Eisen99]
commonly used for cancer clustering
clustering of genomic (gene expression) data sets in general.
For a set of n genes, compute an upper-diagonal similarity matrix

containing similarity scores for all pairs of genes
use a similarity metric
Scan to find the highest value, representing the pair of genes with the
most similar interaction patterns
The two most similar genes are grouped in a cluster
the similarity matrix is recomputed, using the average properties of both or all
genes in the cluster
More genes are progressively added to the initial pairs to form clusters
of genes [Eisen98]
process is repeated until all genes have been grouped into clusters
LIMBO
LIMBO is introduced in [Andritsos04] is a scalable
hierarchical categorical clustering algorithm that builds
on the Information Bottleneck (IB) framework for
quantifying the relevant information preserved when
clustering.
LIMBO uses the IB framework to define a distance measure for
categorical tuples.
LIMBO handles large data sets by producing a memory
bounded summary model for the data.
Outline
Grid-Based Methods

Density-Based Clustering Methods

Clustering based on density (local cluster
criterion), such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
Need user-specified parameters
One scan
Density-Based Clustering:
Background
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Epsneighbourhood of that point
p
q
MinPts = 5
Eps = 1 cm
Density-Based Clustering:
Background (II)
Density-reachable:
A point p is density-reachable from a

point q wrt. Eps, MinPts if there is a
chain of points p1, , pn, p1 = q, pn =
p such that pi+1 is directly densityreachable from pi
Density-connected:
A point p is density-connected to a
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o wrt. Eps
and MinPts.
p1
q
o
DBSCAN: Density Based Spatial

Clustering of Applications with Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of densityconnected points
Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Core
Eps = 1cm
MinPts = 5
DBSCAN: The Algorithm

Check the e-neighborhood of each object in the
database.
If the e-neighborhood of an object o contains more
than MinPts, a new cluster with o as a core object is
created.
Iteratively collect directly density-reachable objects
from these core objects, which may involve the merge
of a few density-reachable clusters.
Terminate the process when no new object can be
added to any cluster.
OPTICS: A Cluster-Ordering
Method (1999)
OPTICS: Ordering Points To Identify the
Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
Produces a special order of the database wrt its
density-based clustering structure
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
OPTICS: An Extension from DBSCAN

OPTICS was developed to overcome
the difficulty of selecting appropriate
parameter values for DBSCAN
[Ankerst99].
The OPTICS algorithm finds clusters
using the following steps:
1) Create an ordering of the objects in a
database, storing the core-distance and
a suitable reachability-distance for each
object. Clusters with highest density will
be finished first.
2) Based on the ordering information
produced by OPTICS, use another
algorithm to extract clusters.
3) Extract density-based clusters with
respect to any distance e that is smaller
than the distance e used in generating
the order.
Figure 15a OPTICS. The core distance of

p is the distance e, between p and the
fourth closest object. The reachability
distance of q1 with respect to p is the coredistance of p (e=3mm) since this is greater
than the distance between p and q1. The
reachability distance of q2 with respect to
p is the distance between p and q2 since
this is greater than the core-distance of p
(e=3mm). Adopted from [Ankerst99].
Reachability
-distance
Cluster-order of the objects
DENCLUE: using density functions

DENsity-based CLUstEring by Hinneburg & Keim
(KDD98)
Major features:
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significantly faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
But needs a large number of parameters
DENCLUE: Technical Essence

Influence function: describes the impact of a data point
within its neighborhood.
Overall density of the data space can be calculated as the
sum of the influence function of all data points.
Clusters can be determined mathematically by identifying
density attractors.
Density attractors are local maximal of the overall density
function.
Density Attractor
Center-Defined and Arbitrary
CACTUS Categorical Clustering

CACTUS is presented in [Ganti99].
Distinguishing sets: clusters are uniquely identified by a
core set of attribute values that occur in no other cluster.
A distinguishing number represents the minimum size of
the distinguishing sets i.e. attribute value sets that
uniquely occur within only one cluster.
While this assumption may hold true for many real world
datasets, it is unnatural and unnecessary for the
clustering process.
COOLCAT Categorical Clustering
COOLCAT is introduced in [Barbara02] as an entropy-based

algorithm for categorical clustering.
COOLCAT starts with a sample of data objects

identifies a set of k initial tuples such that the minimum pairwise
distance among them is maximized.
All remaining tuples of the data set are placed in one of the
clusters such that
at each step, the increase in the entropy of the resulting clustering is
minimized.
CLOPE Categorical Clustering
Let's take a small market basket database with 5 transactions {(apple, banana),
(apple, banana, cake), (apple, cake, dish), (dish, egg), (dish, egg, fish)}.
For simplicity, transaction (apple, banana) is abbreviated to ab, etc.
For this small database, we want to compare the following two clusterings:
(1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de, def} }.
H=2.0, W=4
H=1.67, W=3
H=1.67, W=3
H=1.6, W=5
{ab, abc, acd}
{de, def}
{ab, abc}
{acd, de, def}
clustering (1)
clustering (2)
Histograms of the two clusterings. Adopted from [Yang2002].
We judge the qualities of these two clusterings, by analyzing the heights and widths of the
clusters. Leaving out the two identical histograms for cluster {de, def} and cluster {ab, abc},
the other two histograms are of different quality.
The histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def}
has H/W=0.32.
Clustering (1) is better since we prefer more overlapping among transactions in the same cluster.
Thus, a larger height-to-width ratio of the histogram means better intra-cluster similarity.
Outline
Grid-Based Methods

Grid-Based Clustering Method

Using multi-resolution grid data structure
Several interesting methods
STING (a STatistical INformation Grid approach)
by Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB98)
A multi-resolution clustering approach using
wavelet method
CLIQUE: Agrawal, et al. (SIGMOD98)
STING: A Statistical
Information Grid Approach
Wang, Yang and Muntz (VLDB97)
The spatial area area is divided into rectangular
cells
There are several levels of cells corresponding to
different levels of resolution
STING: A Statistical Information

Grid Approach
Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated
from parameters of lower level cell
count, mean, s, min, max
type of distributionnormal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layertypically with a small
number of cells
STING: A Statistical Information

Grid Approach
When finish examining the current layer, proceed to
the next lower level
Repeat this process until the bottom layer is reached
Advantages:
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
WaveCluster (1998)
Sheikholeslami, Chatterjee, and Zhang (VLDB98)
A multi-resolution clustering approach which
applies wavelet transform to the feature space
A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.
Input parameters:
the wavelet, and the # of applications of wavelet transform.
WaveCluster (1998)
How to apply wavelet transform to find clusters
Summaries the data by imposing a multidimensional
grid structure onto data space
These multidimensional spatial data objects are
represented in a n-dimensional feature space
Apply wavelet transform on feature space to find the
dense regions in the feature space
Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
What Is Wavelet (2)?
WaveCluster (1998)
Why is wavelet transformation useful for
clustering
Unsupervised clustering
It uses hat-shape filters to emphasize region where points
cluster, but simultaneously to suppress weaker information in
their boundary
Effective removal and detection of outliers
Multi-resolution
Cost efficiency
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
Quantization
Transformation
CLIQUE (CLustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan
(SIGMOD98).
CLIQUE can be considered as both densitybased and grid-based.
It partitions each dimension into the same number of
equal length interval
It partitions an m-dimensional data space into nonoverlapping rectangular units
A unit is dense if the fraction of total data points contained
in the unit exceeds the input model parameter
A cluster is a maximal set of connected dense units within
a subspace
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
la
a
S
ry
30
Vacation(
week)
0 1 2 3 4 5 6 7
age
60
20
50
30
40
age
50
age
60
Strength and Weakness of CLIQUE

Strengths
It is insensitive to the order of records in input
and does not presume some canonical data
distribution
It scales linearly with the size of input and has
good scalability as the number of dimensions in
the data increases
Weakness
The accuracy of the clustering result may be
degraded
Outline
Grid-Based Methods


Attempt to optimize the fit between the data and some
mathematical model
Statistical and AI approach
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects
Finds characteristic description for each concept (class)
COBWEB (Fisher87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic
description of that concept
COBWEB Clustering
Method
A classification tree
More on COBWEB Clustering

Limitations of COBWEB
The assumption that the attributes are
independent of each other is often too strong
because correlation may exist
Not suitable for clustering large database
data skewed tree and expensive
probability distributions
AutoClass (Cheeseman and Stutz, 1996)
E is the attributes of a data item, that are given to us by the data set. For example, if each data
item is a coin, the evidence E might be represented as follows for a coin i:
If there were many attributes, then the evidence E might be represented as follows for a coin i:
Ei = {"land tail","land tail","land head"} meaning that in 3 separate trials the coin i landed as tail, tail
and head.
H is a hypothesis about the classification of a data item.
Ei = {"land tail} meaning that in one trial the coin i landed to be tail.
For example, H might state that coin i belongs in the class two headed coin.
We usually do not know the H for a data set. Thus, AutoClass tests many hypotheses.
L( E | H ) ( H )
L( E | H ) ( H )
(H | E)
(E)
L( E | H ) ( H )
H
AutoClass uses a Bayesian method for determining the optimal class H for each object.
Prior distribution for each attribute, symbolizing the prior beliefs of the user about the attribute.
Change the classifications of items in clusters and change the means and variances of the distributions
in each cluster, until the means and variances stabilize.
Normal distribution for an attribute in a cluster:
Neural Networks and SelfOrganizing Maps

Neural network approaches
Represent each cluster as an exemplar, acting as a
prototype of the cluster
New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
Competitive learning
Involves a hierarchical architecture of several units
(neurons)
Neurons compete in a winner-takes-all fashion for
the object currently being presented
Self-organizing feature maps (SOMs)

type of neural network
Several units (clusters) compete for the current object

The unit whose weight vector is closest to the current object wins
The winner and its neighbors learn by having their weights adjusted
SOMs are believed to resemble processing that can occur in the brain
Outline
Grid-Based Methods

Supervised classification
Bases the classification on prior
knowledge about the correct classification
of the objects.
Support Vector Machines

Support Vector Machines (SVMs) were invented by Vladimir
Vapnik for classification based on prior knowledge [Vapnik98,
Burges98].
Create classification functions from a set of labeled training data.
The output is binary: is the input in a category?
SVMs find a hypersurface in the space of possible inputs.

Split the positive examples from the negative examples.
The split is chosen to have the largest distance from the hypersurface to
the nearest of the positive and negative examples.
Training an SVM on a large data set can be slow.

Testing data should be near the training data.
CCA-S
Clustering and Classification Algorithm Supervised
(CCA-S)
for detecting intrusions into computer network systems [Ye01].
CCA-S learns signature patterns of both normal and

intrusive activities in the training data.
Then, classify the activities in the testing data as normal or
intrusive based on the learned signature patterns of
normal and intrusive activities.
Classification (supervised) applied to

cancer tumor data sets
Class Discovery: dividing tumor samples into groups with similar

behavioral properties and molecular characteristics.
Previously unknown tumor subtypes may be identified this way
Class Prediction: determining the correct class for a new tumor

sample, given a set of known classes.
Correct class prediction may suggest whether a patient will benefit from
treatment, how will respond after treatment with a certain drug
Examples of Cancer Tumor Classification:

Classification of acute leukemias
Classification of diffuse large B-cell lymphoma tumors
Classification of 60 cancer cell lines derived from a variety of tumors
Chapter 8. Cluster Analysis

Grid-Based Methods
Outlier Analysis
What Is Outlier Discovery?

What are outliers?
The set of objects are considerably dissimilar from
the remainder of the data
Example: Sports: Michael Jordan, Wayne Gretzky, ...
Problem
Find top n outlier points
Applications:
Credit card fraud detection

Telecom fraud detection
Customer segmentation
Medical analysis
Outlier Discovery:
Statistical
Approaches
Assume a model underlying distribution that
generates data set (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers
Drawbacks
Most distribution tests are for single attribute
In many cases, data distribution may not be known
Outlier Discovery:
Distance-Based Approach
Introduced to counter the main limitations imposed by
statistical methods
We need multi-dimensional analysis without knowing data
distribution.
Distance-based outlier: A (p,D)-outlier is an object O in

a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O.
Outlier Discovery:
Deviation-Based Approach
Identifies outliers by examining the main
characteristics of objects in a group
Objects that deviate from this description are
considered outliers
Sequential exception technique
simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
OLAP data cube technique

uses data cubes to identify regions of anomalies in large
multidimensional data
Chapter 8. Cluster Analysis

Grid-Based Methods
Outlier Analysis
Summary
Problems and Challenges

Considerable progress has been made in scalable
clustering methods
Partitioning: k-means, k-medoids, CLARANS
Hierarchical: BIRCH, CURE, LIMBO
Density-based: DBSCAN, CLIQUE, OPTICS
Grid-based: STING, WaveCluster
Model-based: Autoclass, Denclue, Cobweb
Current clustering techniques do not address all the

requirements adequately
Constraint-based clustering analysis: Constraints exist
in data space (bridges and highways) or in user queries
Constraint-Based Clustering
Analysis
Clustering analysis: less parameters but more userdesired constraints, e.g., an ATM allocation problem
Summary
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis,
such as constraint-based clustering
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of

high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.

M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific,
1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,

2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based

on dynamic systems. In Proc. VLDB98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
References (2)
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster

Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to

Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data
sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution

clustering approach for very large spatial databases. VLDB98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for
very large databases. SIGMOD'96.
Thank you !!!
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables not a good choice!
(why?)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as
interval-scaled.
Intrusion detection systems
Clustering is applied to files with logged behavior of system users over time, to separate instances
of normal activity from instances of abusive or attack activity. Clustering is primarily used in
Intrusion Detection Systems, especially when the log files of system user behavior are too large
for a human expert to analyze.
Unsupervised clustering: When unsupervised clustering is applied to Intrusion Detection

Systems data on which no prior knowledge exists the data is first clustered and then different
clusters are labeled as either normal or intrusions based on the size of the cluster.
Given a new object d, classification proceeds by finding a cluster C which is closest to d and
classifying d according to the label of C as either normal or anomalous [Li04, Portnoy01].
IDSs based on unsupervised clustering focus on activity data that may contain features of an
individual TCP connection, such as its duration, protocol type, number of bytes transferred and a
flag indicating the connections normal or error status. Other features may include the number of
file creation operations, number of failed login attempts, whether root shell was obtained, the
number of connections to the same host as the current connection within the past two seconds,
the number of connections to the same service as the current connection within the past two
seconds and others.
Intrusion detection systems
Clustering is applied to files with logged behavior of system users over time, to separate instances
of normal activity from instances of abusive or attack activity. Clustering is primarily used in
Intrusion Detection Systems, especially when the log files of system user behavior are too large
for a human expert to analyze.
Supervised clustering: Supervised clustering is applied to Intrusion Detection Systems for

Signature Recognition purposes. This means that signatures representing patterns of previous
intrusion activities are learned by the system. Then, new patterns of activity are compared to the
previously learned signatures to determine if the activities are normal or if they may be intrusions
[Ye01].
IDS based on signature recognition focus on two main types of activity data: network traffic data
and computer audit data. Some categorical activity attributes that can be obtained from this data
include the user id, event type, process id, command, remote IP address. Some numerical activity
attributes that can be obtained from this data include the time stamp, CPU time, etc.
Squeezer
Squeezer is introduced in [He2002] as a one-pass algorithm. Squeezer

repeatedly reads tuples from the data set one by one. When the first tuple
arrives, it forms a cluster alone. The consequent tuples are either put into an
existing cluster or rejected by all existing clusters to form a new cluster by the
given similarity function.
Squeezer clustering consists of the following steps:
1) Initialize the set of clusters, S, to the empty set.

2) Obtain an object d from the data set. If S is empty, then create a cluster
with d and add it to S. Otherwise, find the cluster in S that is closest
to this object. In other words, find the closest cluster C to d in S.
3) If the distance between d and C is less than or equal to a user specified
threshold W then associate d with the cluster C. Else, create a new
cluster for d in S.
4) Repeat steps 2 and 3 until no objects are left in the data set.
Farthest First Traversal k-center Algorithm
The farthest-first traversal k-center algorithm (FFT) is a fast, greedy algorithm that
minimizes the maximum cluster radius [Hochbaum85]. In FFT, k points are first selected
as cluster centers. The remaining points are added to the cluster whose center is the
closest. The first center is chosen randomly. The second center is greedily chosen as the
point furthest from the first. Each remaining center is determined by greedily choosing
the point farthest from the set of already chosen centers, where the furthest point, x, from
a set, D, is defined as, maxx{min{d(x,j), j(-D}}.
Farthest_first_traversal(D: data set, k: integer) {

randomly select first center;
//select centers
for (I= 2,,k) {
for (each remaining point) { calculate distance to the current center set; }
select the point with maximum distance as new center;
}
//assign remaining points
for (each remaining point) {
calculate the distance to each cluster center;
put it to the cluster with minimum distance;
}
}
Clustering Feature
Vector
BIRCHs operation is based on clustering features (CF)
and clustering feature trees. These are used to
summarize the information in a cluster.
A CF is a triplet summarizing information about
subclusters of objects.
A CF typically holds the following information about a
subcluster: the number of objects N in the subcluster, a
vector holding the linear sum of the N objects in the
subcluster, and a vector holding the square sum of the
N objects in the subcluster.
Thus, a CF is a summary of statistics for the given
subcluster
OPTICS: An Extension from DBSCAN

OPTICS was developed to overcome
the difficulty of selecting appropriate
parameter values for DBSCAN
[Ankerst99].
The OPTICS algorithm finds clusters
using the following steps:
1) Create an ordering of the objects in a
database, storing the core-distance and
a suitable reachability-distance for each
object. Clusters with highest density will
be finished first.
2) Based on the ordering information
produced by OPTICS, use another
algorithm to extract clusters.
3) Extract density-based clusters with
respect to any distance e that is smaller
than the distance e used in generating
the order.
Figure 15a OPTICS. The core distance of

p is the distance e, between p and the
fourth closest object. The reachability
distance of q1 with respect to p is the coredistance of p (e=3mm) since this is greater
than the distance between p and q1. The
reachability distance of q2 with respect to
p is the distance between p and q2 since
this is greater than the core-distance of p
(e=3mm). Adopted from [Ankerst99].
Gradient: The steepness of a

slope
Example
f Gaussian ( x , y ) e
f
D
Gaussian
d ( x , y )2
2 2
( x) i 1 e
D
Gaussian
d ( x , xi ) 2
2 2
( x, xi ) i 1 ( xi x) e
N
d ( x , xi ) 2
2 2
Mutual Information Clustering
Mutual information refers to a measure of correlation between the information content of

two information elements [Michaels98]. For example, when clustering gene expression
data , the formula M(A,B) = H(A) + H(B) - H(A,B) represents the mutual information M,
that is shared by two temporal gene expression patterns A and B.
H refers to the Shannon entropy. H(A) is a measure of the number of expression levels
exhibited by the gene A expression pattern, over a given time period. The higher the
value of H, the more expression levels gene A exhibits over a temporal expression
pattern, and thus the more information the expression pattern contains.
An H of zero means that a gene expression pattern over a time course was completely
flat, or constant, and thus the expression pattern carries no information.
H(A,B) is a measure of the number of expression levels observed in either gene
expression pattern A or B.
Thus, the mutual information M(A,B) = H(A) + H(B) - H(A,B) is defined as the number of
expression levels that are observed in both gene expression patterns A and B, over a
given time period [26].
CACTUS Categorical Clustering
CACTUS is presented in [Ganti99], introducing a novel formalization of a

cluster for categorical attributes by generalizing a definition of a cluster for
numerical data.
CACTUS is a summary-based algorithm that utilizes summary information
of the dataset. It characterizes the detected categorical clusters by building
summaries.
The authors assume the existence of a distinguishing number that
represents the minimum size of the distinguishing sets i.e. attribute value
sets that uniquely occur within only one cluster.
The distinguishing sets in CACTUS rely on the assumption that clusters
are uniquely identified by a core set of attribute values that occur in no
other cluster.
While this assumption may hold true for many real world datasets, it is
unnatural and unnecessary for the clustering process. The distinguishing
sets are then extended to cluster projections.
COOLCAT Categorical Clustering

COOLCAT is introduced in [Barbara02] as an entropy-based algorithm
for categorical clustering. The COOLCAT algorithm is based on the
idea of entropy reduction within the generated clusters.
It first bootstraps itself using a sample of maximally dissimilar points
from the dataset to create initial clusters.
COOLCAT starts with a sample of data objects and identifies a set of k
initial tuples such that the minimum pairwise distance among them is
maximized.
The remaining points are then added incrementally. Clusters are
created by "cooling" them down, i.e. reducing their entropy.
All remaining tuples of the data set are placed in one of the clusters
such that, at each step, the increase in the entropy of the resulting
clustering is minimized.
STIRR Categorical Clustering
STIRR produces a dynamical system from a table of categorical data. STIRR begins with a table of relational
data, consisting of k fields (or columns), each of which can assume one of many possible values. Figure 15
shows an example of the data representation. Each possible value in each field is represented by a node and
the data is represented as a set of tuples each tuple consists of nodes, with one node for each field.
A configuration is an assignment of a weight w to each node. A normalization function N() is needed to rescale
the weights of the nodes associated with each field so that their squares add up to 1.
Figure 15 - Representation of a collection of tuples. Adopted from [Gibson98].

A dynamical system is the repeated application of a function f on some set of values. A fixed point of a
dynamical system is a point u for which f(u) = u. So it is a point which remains the same under the repeated
application of f.
The dynamical system for a set of tuples is based on a function f, which maps a current configuration of
weights to a new configuration. We define f as follows. We choose a combiner function defined below and
update each weight wv as follows:
CLICK Categorical Clustering
CLICK finds clusters in categorical datasets based on a search method for k-partite maximal cliques
[Peters04].
Figure 16 The CLICK algorithm. Adopted from [Peters04].
The basic Click approach consists of the three principal stages, shown in Figure 17, as follows:
Pre-Processing: In this step, the k-partite graph is created from the input database D, and the
attributes are ranked for efficiency reasons.
Clique Detection: Given (D), all the maximal k-partite cliques in the graph are enumerated.
Post-Processing: the support of the candidate cliques within the original dataset is verified to form
the final clusters. Moreover, the final clusters are optionally merged to partially relax the strict cluster
conditions.
CLOPE Categorical Clustering
CLOPE proposes a novel global criterion function that tries to increase the intra-cluster overlapping of transaction items
by increasing the height-to-width ratio of the cluster histogram.
Let's take a small market basket database with 5 transactions {(apple, banana), (apple, banana, cake), (apple, cake,
dish), (dish, egg), (dish, egg, fish)}. For simplicity, transaction (apple, banana) is abbreviated to ab, etc. For this small
database, we want to compare the following two clustering (1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de,
def} }. For each cluster, we count the occurrence of every distinct item, and then obtain the height (H) and width (W) of the
cluster. For example, cluster {ab, abc, acd} has the occurrences of a:3, b:2, c:2, and d:l, with H=2.0 and W=4. H is
computed by summing the numbers of occurrences and dividing by the number of distinct items. Figure 17 shows the
values of these results as histograms, with items sorted in reverse order of their occurrences, only for the sake of easier
visual interpretation.
H=2.0, W=4
H=1.67, W=3
H=1.67, W=3
H=1.6, W=5
{ab, abc, acd}
{de, def}
{ab, abc}
{acd, de, def}
clustering (1)
clustering (2)
Figure 17 - Histograms of the two clusterings. Adopted from [Yang2002].
We judge the qualities of these two clusterings, by analyzing the heights and widths of the clusters. Leaving out the two
identical histograms for cluster {de, def} and cluster {ab, abc}, the other two histograms are of different quality. The
histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def} has H/W=0.32. Clearly, clustering
(1) is better since we prefer more overlapping among transactions in the same cluster. From the above example, we can
see that a larger height-to-width ratio of the histogram means better intra-cluster similarity. This intuition is the basis of
CLOPE and defines the global criterion function using the geometric properties of the cluster histograms.
WaveCluster (1998)
WaveCluster uses wavelet transformations to cluster data.

It uses a wavelet transformation to transform the original feature space, finding
dense regions in the transformed space.
A wavelet transform is a signal processing technique that decomposes a signal

into different frequency subbands.
The wavelet model can be applied to a data set with n dimensions by applying a
one-dimensional wavelet n times.
Each time that a wavelet transform is applied, data are transformed so as to
preserve the relative distance between objects at different levels of resolution,
although the absolute distance increases.
This allows the natural clusters in the data to become more distinguishable.
Clusters are then identified by searching for dense regions in the new domain.
Wavelets emphasize regions where the objects cluster, but suppress less dense
regions outside of clusters.
The clusters in the data stand out because they are more dense and clear the
regions around them [Sheikholeslami98].
CLIQUE: The Major Steps

finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify clusters:
Determine dense units in all subspaces of interest.
Determine connected dense units in all subspaces of interest.
Generate minimal description for the clusters

Determine maximal regions that cover a cluster of connected
dense units for each cluster
Determination of minimal cover for each cluster
ACDC
ACDC works in a different way than algorithms we mentioned above [Tzerpos00]. ACDC
performs the task of clustering in two stages. In the first one it creates a skeleton of the final
decomposition by identifying subsystems using a pattern-driven approach. There are many
patterns that have been used in ACDC. Figure 24 shows some of these patterns.
Depending on the pattern used the subsystems are given appropriate names. In the second
stage ACDC completes the decomposition by using an extended version of a technique known
as Orphan Adoption. Orphan Adoption is an incremental clustering technique based on the
assumption that the existing structure is well established. It attempts to place each newly
introduced resource (called an orphan) in the subsystem that seems more appropriate. This is
usually a subsystem that has a larger amount of connectivity to the orphan than any other
subsystem [Tzerpos00].
ACDC

Clustering Algorithm

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Algorithm

Uploaded by

Copyright:

Available Formats

Literature Survey of Clustering

An issue with clustering in general is defining the goals.

is the overlap operator between 2 vectors.

Clustering problem is NP-complete:

Examples of Clustering Applications

Software clustering: cluster files in software systems based on their

Intrusion detection: Discover instances of anomalous (intrusive) user

Gene expression data: Discover genes with similar functions in DNA

Marketing: Help marketers discover distinct groups in their customer

Land use: Identification of areas of similar land use in an earth

Insurance: Identifying groups of motor insurance policy holders with a

What Is Good Clustering?

A good clustering method will produce high quality clusters with

The quality of a clustering depends on:

The quality of a clustering method is also measured by its ability to

d ( n,1) d ( n,2) ...

Measure the Quality of

Type of data in clustering

Calculate the standardized measurement (z-score)

compute the dissimilarity using methods for intervalscaled variables

Similarity and Dissimilarity

Similarity and Dissimilarity

Also one can use weighted distance, parametric

Simple matching coefficient (invariant, if the binary variable

Dissimilarity between Binary

gender is a symmetric attribute

Variables of Mixed Types

One may use a weighted formula to combine their

and treat zif as interval-scaled

Clustering of genomic data sets

Clustering of gene expression

Clustering of synthetic mutant

Clustering applied to yeast data sets

Clustering the yeast genes in response to environmental

Clustering the cell cycle-regulated yeast genes

Functional analysis of the yeast genome : Finding gene

Both categorical and numerical data sets:

Numerical data set on a software system: the results of a profiling of

Major Clustering Approaches

Partitioning Algorithms: Basic Concept

The k-Means Clustering Method

The K-Means Clustering Method

Comments on the K-Means

Variations of the K-Means Method

The K-Medoids Clustering Method

CLARA (Kaufmann & Rousseeuw, 1990)

PAM (Partitioning Around Medoids)

PAM Clustering: Total swapping cost

Cjih = d(j, h) - d(j, i)

Cjih = d(j, t) - d(j, i)

Cjih = d(j, h) - d(j, t)

CLARA (Clustering Large

CLARA (Kaufmann and Rousseeuw in 1990)

Strength: deals with larger data sets than PAM

CLARANS (Randomized CLARA)

Squeezer Single linkage clustering

Clusters produced by k-Means: "hard" or "crisp" clusters

In contrast to "soft" or "fuzzy" clusters

The fuzzy-k-means procedure of Bezdek [Bezdek81,

Fuzzy k-Means algorithm

Choose the number of classes k, with 1<k<n.

K-Modes for categorical data

A mixture of categorical and numerical data: kprototypes method

For (each object O) {