You are on page 1of 144

Literature Survey of Clustering

Algorithms
Bill Andreopoulos
Biotec, TU Dresden, Germany, and
Department of Computer Science and Engineering
York University, Toronto, Ontario, Canada
June 27, 2006

Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification

Objective of clustering
algorithms for categorical data
Partition the objects into groups.
Objects with similar categorical attribute
values are placed in the same group.
Objects in different groups contain
dissimilar categorical attribute values.

An issue with clustering in general is defining the goals.


Papadimitriou et al. (2000) propose:
Seek k groups G1,,Gk and a policy Pi for each group i.
Pi : a vector of categorical attribute values.
k
Maximize:


i 1

cGi

Pi c

is the overlap operator between 2 vectors.

Clustering problem is NP-complete:


Ideally, search all possible clusters and all assignments of objects.
The best clustering is the one maximizing a quality measure.

General Applications of
Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data
mining

Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns

Examples of Clustering Applications

Software clustering: cluster files in software systems based on their


functionality

Intrusion detection: Discover instances of anomalous (intrusive) user


behavior in large system log files

Gene expression data: Discover genes with similar functions in DNA


microarray data.

Marketing: Help marketers discover distinct groups in their customer


bases, and then use this knowledge to develop targeted marketing
programs

Land use: Identification of areas of similar land use in an earth


observation database

Insurance: Identifying groups of motor insurance policy holders with a


high average claim cost

What Is Good Clustering?

A good clustering method will produce high quality clusters with


high intra-class similarity
low inter-class similarity

The quality of a clustering depends on:


Appropriateness of method for dataset.
The (dis)similarity measure used
Its implementation.

The quality of a clustering method is also measured by its ability to


discover some or all of the hidden patterns.

Requirements of Clustering in
Data Mining
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
Scalability to High dimensions
Interpretability and usability
Incorporation of user-specified constraints

Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification

Data Structures
Data matrix

x11

...

x
i1
...
x
n1

Dissimilarity matrix

...

x1f

...

x1p

...

...

...

...

xif

...

...
xip

...
...
... xnf

...
...

...
xnp

d(2,1)
0

d(3,1) d ( 3,2) 0

:
:
:

d ( n,1) d ( n,2) ...

... 0

Measure the Quality of


Clustering
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
There is a separate quality function that measures the
goodness of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
It is hard to define similar enough or good enough
the answer is typically highly subjective.

Type of data in clustering


analysis
Nominal (Categorical)
Interval-scaled variables
Ordinal (Numerical)
Binary variables

Mixed types

Nominal (categorical)
A generalization of the binary variable in that it can
take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
pm
Method 2: use a large number ofd (binary
i, j) variables

p
creating a new binary variable for each of the M nominal
states

Interval-scaled variables
Standardize data
Calculate the mean absolute deviation:

s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where

m f 1n (x1 f x2 f

...

xnf )

Calculate the standardized measurement (z-score)

xif m f
zif
sf

Ordinal (numerical)
An ordinal variable can be discrete or continuous
order is important, e.g., rank
Can be treated like interval-scaled
replacing xif by their rank

rif {1,..., M f }

map the range of each variable onto [0, 1] by replacing ith object in the f-th variable by

zif

rif 1

M f 1

compute the dissimilarity using methods for intervalscaled variables

Similarity and Dissimilarity


Between Objects
Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:

d (i, j) q (| x x |q | x x |q ... | x x |q )
i1
j1
i2
j2
ip
jp
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two
p-dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance

d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2
ip jp

Similarity and Dissimilarity


Between Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )
i1
j1
i2
j2
ip
jp

Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)

Also one can use weighted distance, parametric


Pearson product moment correlation, or other
disimilarity measures.

Binary Variables
A contingency table for binary data
Object j

Object i

1
0

1
a
c

0
b
d

sum
a b
cd

sum a c b d

Simple matching coefficient (invariant, if the binary variable


is symmetric):

bc
d (i, j)
b cvariable
d
Jaccard coefficient (noninvariant if theabinary
is
asymmetric):

d (i, j)

bc
a bc

Dissimilarity between Binary


Variables
Example
Name
Jack
Mary
Jim

Gender
M
F
M

Fever
Y
Y
Y

Cough
N
N
P

Test-1
P
P
N

Test-2
N
N
N

Test-3
N
P
N

Test-4
N
N
N

gender is a symmetric attribute


the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
0.33
2 01
11
d ( jack , jim )
0.67
111
1 2
d ( jim , mary )
0.75
11 2
d ( jack , mary )

Variables of Mixed Types


A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio.

One may use a weighted formula to combine their


effects.
pf 1 ij( f ) d ij( f )
d (i, j )
pf 1 ij( f )

f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal
compute ranks rif and

r
zif
M

and treat zif as interval-scaled

if

Clustering of genomic data sets

Clustering of gene expression


data sets

Clustering of synthetic mutant


lethality data sets

Clustering applied to yeast data sets

Clustering the yeast genes in response to environmental


changes

Clustering the cell cycle-regulated yeast genes

Functional analysis of the yeast genome : Finding gene


functions - Functional prediction

Software clustering

Group system files such that files with similar functionality are in the
same cluster, while files in different clusters perform dissimilar
functions.
Each object is a file x of the software system.

Both categorical and numerical data sets:

Categorical data set on a software system: for each file, which other
files it may invoke during runtime.
After the filename x there is a list of the other filenames that x may invoke.

Numerical data set on a software system: the results of a profiling of


the execution of the system, how many times each file invoked other
files during the run time.
After the file name x there is a list of the other filenames that x invoked and: how
many times x invoked them during the run time.

Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification

Major Clustering Approaches


Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.

Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification

Partitioning Algorithms: Basic Concept


Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen67): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw87): Each cluster is represented by one of the
objects in the cluster

The k-Means Clustering Method


Given k, the k-Means algorithm is
implemented in 4 steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition. The centroid is the
center (mean point) of the cluster.
Assign each object to the cluster with the nearest
seed point.
Go back to Step 2, stop when no more new
assignment.

The K-Means Clustering Method


Example
10

10

0
0

10

10

10

10

0
0

10

10

Comments on the K-Means


Method
Strength
Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic
annealing and genetic algorithms

Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

Variations of the K-Means Method


A few variants of the k-means which differ in:
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means

The K-Medoids Clustering Method


Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
PAM works effectively for small data sets, but does not scale well
for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)


CLARANS (Ng & Han, 1994): Randomized sampling

PAM (Partitioning Around Medoids)


(1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most similar
representative object
repeat steps 2-3 until there is no change

PAM Clustering: Total swapping cost


TCih=jCjih

10

10

9
9

8
7

6
5

4
3

6
5

h
i

4
3

2
2
1
1
0
0

10

Cjih = d(j, h) - d(j, i)

Cjih = 0
0

10

10

10

7
6
5

4
3

Cjih = d(j, t) - d(j, i)


0

10

Cjih = d(j, h) - d(j, t)


0

10

CLARA (Clustering Large


Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)

It draws a sample of the data set, applies PAM on the sample, and
gives the best clustering as the output.

Strength: deals with larger data sets than PAM


Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample is
biased

CLARANS (Randomized CLARA)


(1994)
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han94)
CLARANS draws multiple samples dynamically.
A different sample can be chosen at each loop.
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
It is more efficient and scalable than both PAM and
CLARA

Squeezer Single linkage clustering


Not the most effective and accurate clustering algorithm that
exists, but it is efficient as it has a complexity of O(n)
where n is the number of data objects [Portnoy01].
1) Initialize the set of clusters, S, to the empty set.
2) Obtain an object d from the data set. If S is empty, then
create a cluster with d and add it to S. Otherwise, find
the cluster in S that is closest to this object. In other
words, find the closest cluster C to d in S.
3) If the distance between d and C is less than or equal to
a user specified threshold W then associate d with the
cluster C. Else, create a new cluster for d in S.
4) Repeat steps 2 and 3 until no objects are left in the
data set.

Fuzzy k-Means

Clusters produced by k-Means: "hard" or "crisp" clusters


since any feature vector x either is or is not a member of a
particular cluster.

In contrast to "soft" or "fuzzy" clusters


a feature vector x can have a degree of membership in each
cluster.

The fuzzy-k-means procedure of Bezdek [Bezdek81,


Dembele03, Gasch02] allows each feature vector x to
have a degree of membership in Cluster i

Fuzzy k-Means

Fuzzy k-Means algorithm

Choose the number of classes k, with 1<k<n.


Choose a value for the fuzziness exponent f, with f>1.
Choose a definition of distance in the variable-space.
Choose a value for the stopping criterion e (e= 0.001 gives reasonable
convergence).
Make initial guesses for the means m1, m2,..., mk
Until there are no changes in any mean:
Use the estimated means to find the degree of membership u(j,i) of xj
in Cluster i. For example, if a(j,i) = exp(- || xj - mi ||2 ), one
might use u(j,i) = a(j,i) / sum_j a(j,i).
For i from 1 to k
Replace mi with the fuzzy mean of all of the examples for Cluster
i -

end_for
end_until

K-Modes for categorical data


(Huang98)
Variation of the K-Means Method
Replacing means of clusters with modes
Using new dissimilarity measures to deal with
categorical objects
Using a frequency-based method to update modes of
clusters

A mixture of categorical and numerical data: kprototypes method

K-Modes algorithm
K-Modes deals with categorical attributes.
Insert the first K objects into K new clusters.
Calculate the initial K modes for K clusters.
Repeat {

For (each object O) {

Calculate the similarity between object O and


the modes of all clusters.

Insert object O into the cluster C whose mode


is the least dissimilar to object O.

Recalculate the cluster modes so that the cluster


similarity between mode and objects is maximized.
} until (no or few objects change clusters).

Fuzzy K-Modes
The fuzzy k-modes algorithm contains extensions to the
fuzzy k-means algorithm for clustering categorical data.

Bunch

Bunch is a clustering tool intended to aid the software


developer and maintainer in understanding and
maintaining source code [Mancoridis99].
Input : Module Dependency Graph (MDG).

Bunch good partition" : highly interdependent modules are


grouped in the same subsystems (clusters) .
Independent modules are assigned to separate subsystems.

Figure b shows a good partitioning of Figure a.


Finding a good graph partition involves:
systematically navigating through a very large search space of all
possible partitions for that graph.

Major Clustering Approaches


Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.

Hierarchical Clustering
Use distance matrix as clustering criteria. This
method does not require the number of clusters
k as an input, but needs a termination condition
Step 0

a
b

Step 1

Step 2 Step 3 Step 4

ab
abcde

cde

de

e
Step 4

agglomerative
(AGNES)

Step 3

Step 2 Step 1 Step 0

divisive
(DIANA)

AGNES (Agglomerative
Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10

10

10

0
0

10

0
0

10

10

A Dendrogram Shows How the


Clusters are Merged Hierarchically
Decompose data objects into several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.

DIANA (Divisive Analysis)


Introduced in Kaufmann and Rousseeuw (1990)
Inverse order of AGNES
Eventually each node forms a cluster on its own

10

10

10

0
0

10

0
0

10

10

More on Hierarchical Clustering


Methods
Weaknesses of agglomerative clustering methods
do not scale well: time complexity of at least O(n2),
where n is the number of total objects.
can never undo what was done previously.

More on Hierarchical Clustering


Methods
Next..
Integration of hierarchical with distancebased clustering
BIRCH (1996): uses CF-tree and incrementally
adjusts the quality of sub-clusters
CURE (1998): selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction
CHAMELEON (1999): hierarchical clustering
using dynamic modeling

BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree

Clustering Feature
Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
CF = (5, (16,30),(54,190))

SS: Ni=1=Xi2
10
9
8
7
6
5
4
3
2
1
0
0

10

(3,4)
(2,6)
(4,5)
(4,7)
(3,8)

CF Tree - A nonleaf node in this tree contains summaries of the CFs of its
children. A CF tree is a multilevel summary of the data that preserves the
inherent structure of the data.

L=6

Root
CF3

CF1

CF2

child1

child2 child3

CF6
child6

CF1

Non-leaf node
CF2 CF3

CF5

child1

child2 child3

child5

Leaf node
prev

CF1 CF2

CF6 next

Leaf node
prev

CF1 CF2

CF4 next

BIRCH (1996)

Scales linearly: finds a good clustering with a single scan and


improves the quality with a few additional scans

Weakness: handles only numeric data, and sensitive to the


order of the data record.

CURE (Clustering Using


REpresentatives)

CURE: proposed by Guha, Rastogi & Shim, 1998


CURE goes a step beyond BIRCH by not favoring clusters
with spherical shape thus being able to discover clusters
with arbitrary shape. CURE is also more robust with
respect to outliers [Guha98].
Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary shaped
clusters and avoids single-link effect.

Drawbacks of distance-based Methods

Consider only one point as representative of a cluster


Good only for convex shaped, similar size and
density, and if k can be reasonably estimated

CURE: The Algorithm

Draw random sample s.

Partition sample to p partitions with size s/p. Each partition


is a partial cluster.

Eliminate outliers by random sampling. If a cluster grows


too slow, eliminate it.

Cluster partial clusters.


The representative points falling in each new cluster are
shrinked or moved toward the cluster center by a userspecified shrinking factor.
These objects then represent the shape of the newly
formed cluster.

Data Partitioning and Clustering

Figure 11 Clustering a
set of objects using
CURE. (a) A random
sample of objects. (b)
Partial clusters.
Representative points
for each cluster are
marked with a +. (c)
The partial clusters are
further clustered. The
representative points
are moved toward the
cluster center. (d) The
final clusters are
nonspherical.
x

Cure: Shrinking Representative


Points
y

x
x
Shrink the multiple representative points towards the
gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

Clustering Categorical Data:


ROCK
ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE99).

Use links to measure similarity/proximity


Cubic computational complexity

O(n 2 nmmma n 2 log n)

Rock: Algorithm
Links: The number of common neighbours for
the two points.

{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}


{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3}

{1,2,4}

Initially, each tuple is assigned to a separate cluster and then


clusters are merged repeatedly according to the closeness
between clusters.
The closeness between clusters is defined as the sum of the
number of links between all pairs of tuples, where the
number of links represents the number of common neighbors
between two clusters.

CHAMELEON
CHAMELEON: hierarchical clustering using
dynamic modeling, by G. Karypis, E.H. Han
and V. Kumar99
Measures the similarity based on a dynamic
model
Two clusters are merged only if the
interconnectivity and closeness (proximity)
between two clusters are high relative to the
internal interconnectivity of the clusters and
closeness of items within the clusters

CHAMELEON
A two phase algorithm
Use a graph partitioning algorithm: cluster objects into a
large number of relatively small sub-clusters
Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining these
sub-clusters

Overall Framework of
CHAMELEON
Construct
Partition the Graph

Sparse Graph

Data Set

Merge Partition
Final Clusters

Eisens hierarchical clustering of


gene expression data
The hierarchical clustering algorithm by Eisen et al. [Eisen98, Eisen99]
commonly used for cancer clustering
clustering of genomic (gene expression) data sets in general.

For a set of n genes, compute an upper-diagonal similarity matrix


containing similarity scores for all pairs of genes
use a similarity metric

Scan to find the highest value, representing the pair of genes with the
most similar interaction patterns
The two most similar genes are grouped in a cluster
the similarity matrix is recomputed, using the average properties of both or all
genes in the cluster

More genes are progressively added to the initial pairs to form clusters
of genes [Eisen98]
process is repeated until all genes have been grouped into clusters

LIMBO
LIMBO is introduced in [Andritsos04] is a scalable
hierarchical categorical clustering algorithm that builds
on the Information Bottleneck (IB) framework for
quantifying the relevant information preserved when
clustering.
LIMBO uses the IB framework to define a distance measure for
categorical tuples.
LIMBO handles large data sets by producing a memory
bounded summary model for the data.

Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification

Major Clustering Approaches


Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.

Density-Based Clustering Methods


Clustering based on density (local cluster
criterion), such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
Need user-specified parameters
One scan

Density-Based Clustering:
Background
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Epsneighbourhood of that point

p
q

MinPts = 5
Eps = 1 cm

Density-Based Clustering:
Background (II)
Density-reachable:

A point p is density-reachable from a


point q wrt. Eps, MinPts if there is a
chain of points p1, , pn, p1 = q, pn =
p such that pi+1 is directly densityreachable from pi

Density-connected:
A point p is density-connected to a
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o wrt. Eps
and MinPts.

p1

q
o

DBSCAN: Density Based Spatial


Clustering of Applications with Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of densityconnected points
Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Core

Eps = 1cm
MinPts = 5

DBSCAN: The Algorithm


Check the e-neighborhood of each object in the
database.
If the e-neighborhood of an object o contains more
than MinPts, a new cluster with o as a core object is
created.
Iteratively collect directly density-reachable objects
from these core objects, which may involve the merge
of a few density-reachable clusters.
Terminate the process when no new object can be
added to any cluster.

OPTICS: A Cluster-Ordering
Method (1999)
OPTICS: Ordering Points To Identify the
Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
Produces a special order of the database wrt its
density-based clustering structure
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure

OPTICS: An Extension from DBSCAN


OPTICS was developed to overcome
the difficulty of selecting appropriate
parameter values for DBSCAN
[Ankerst99].
The OPTICS algorithm finds clusters
using the following steps:
1) Create an ordering of the objects in a
database, storing the core-distance and
a suitable reachability-distance for each
object. Clusters with highest density will
be finished first.
2) Based on the ordering information
produced by OPTICS, use another
algorithm to extract clusters.
3) Extract density-based clusters with
respect to any distance e that is smaller
than the distance e used in generating
the order.

Figure 15a OPTICS. The core distance of


p is the distance e, between p and the
fourth closest object. The reachability
distance of q1 with respect to p is the coredistance of p (e=3mm) since this is greater
than the distance between p and q1. The
reachability distance of q2 with respect to
p is the distance between p and q2 since
this is greater than the core-distance of p
(e=3mm). Adopted from [Ankerst99].

Reachability
-distance

Cluster-order of the objects

DENCLUE: using density functions


DENsity-based CLUstEring by Hinneburg & Keim
(KDD98)
Major features:
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significantly faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
But needs a large number of parameters

DENCLUE: Technical Essence


Influence function: describes the impact of a data point
within its neighborhood.
Overall density of the data space can be calculated as the
sum of the influence function of all data points.
Clusters can be determined mathematically by identifying
density attractors.
Density attractors are local maximal of the overall density
function.

Density Attractor

Center-Defined and Arbitrary

CACTUS Categorical Clustering


CACTUS is presented in [Ganti99].
Distinguishing sets: clusters are uniquely identified by a
core set of attribute values that occur in no other cluster.
A distinguishing number represents the minimum size of
the distinguishing sets i.e. attribute value sets that
uniquely occur within only one cluster.
While this assumption may hold true for many real world
datasets, it is unnatural and unnecessary for the
clustering process.

COOLCAT Categorical Clustering

COOLCAT is introduced in [Barbara02] as an entropy-based


algorithm for categorical clustering.

COOLCAT starts with a sample of data objects


identifies a set of k initial tuples such that the minimum pairwise
distance among them is maximized.

All remaining tuples of the data set are placed in one of the
clusters such that
at each step, the increase in the entropy of the resulting clustering is
minimized.

CLOPE Categorical Clustering

Let's take a small market basket database with 5 transactions {(apple, banana),
(apple, banana, cake), (apple, cake, dish), (dish, egg), (dish, egg, fish)}.
For simplicity, transaction (apple, banana) is abbreviated to ab, etc.

For this small database, we want to compare the following two clusterings:
(1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de, def} }.

H=2.0, W=4
H=1.67, W=3
H=1.67, W=3
H=1.6, W=5
{ab, abc, acd}
{de, def}
{ab, abc}
{acd, de, def}
clustering (1)
clustering (2)
Histograms of the two clusterings. Adopted from [Yang2002].

We judge the qualities of these two clusterings, by analyzing the heights and widths of the
clusters. Leaving out the two identical histograms for cluster {de, def} and cluster {ab, abc},
the other two histograms are of different quality.

The histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def}
has H/W=0.32.

Clustering (1) is better since we prefer more overlapping among transactions in the same cluster.
Thus, a larger height-to-width ratio of the histogram means better intra-cluster similarity.

Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification

Major Clustering Approaches


Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.

Grid-Based Clustering Method


Using multi-resolution grid data structure
Several interesting methods
STING (a STatistical INformation Grid approach)
by Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB98)
A multi-resolution clustering approach using
wavelet method
CLIQUE: Agrawal, et al. (SIGMOD98)

STING: A Statistical
Information Grid Approach
Wang, Yang and Muntz (VLDB97)
The spatial area area is divided into rectangular
cells
There are several levels of cells corresponding to
different levels of resolution

STING: A Statistical Information


Grid Approach
Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated
from parameters of lower level cell
count, mean, s, min, max
type of distributionnormal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layertypically with a small
number of cells

STING: A Statistical Information


Grid Approach
When finish examining the current layer, proceed to
the next lower level
Repeat this process until the bottom layer is reached
Advantages:
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected

WaveCluster (1998)
Sheikholeslami, Chatterjee, and Zhang (VLDB98)
A multi-resolution clustering approach which
applies wavelet transform to the feature space
A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.

Input parameters:
the wavelet, and the # of applications of wavelet transform.

WaveCluster (1998)
How to apply wavelet transform to find clusters
Summaries the data by imposing a multidimensional
grid structure onto data space
These multidimensional spatial data objects are
represented in a n-dimensional feature space
Apply wavelet transform on feature space to find the
dense regions in the feature space
Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse

What Is Wavelet (2)?

WaveCluster (1998)
Why is wavelet transformation useful for
clustering
Unsupervised clustering
It uses hat-shape filters to emphasize region where points
cluster, but simultaneously to suppress weaker information in
their boundary
Effective removal and detection of outliers
Multi-resolution
Cost efficiency

Major features:

Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data

Quantization

Transformation

CLIQUE (CLustering In QUEst)


Agrawal, Gehrke, Gunopulos, Raghavan
(SIGMOD98).
CLIQUE can be considered as both densitybased and grid-based.
It partitions each dimension into the same number of
equal length interval
It partitions an m-dimensional data space into nonoverlapping rectangular units
A unit is dense if the fraction of total data points contained
in the unit exceeds the input model parameter
A cluster is a maximal set of connected dense units within
a subspace

=3
30
40
Vacation

20
50

Salary
(10,000)
0 1 2 3 4 5 6 7

la
a
S
ry

30

Vacation(
week)
0 1 2 3 4 5 6 7

age
60
20

50
30
40

age

50
age
60

Strength and Weakness of CLIQUE


Strengths
It is insensitive to the order of records in input
and does not presume some canonical data
distribution
It scales linearly with the size of input and has
good scalability as the number of dimensions in
the data increases

Weakness
The accuracy of the clustering result may be
degraded

Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification

Major Clustering Approaches


Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.

Model-Based Clustering Methods


Attempt to optimize the fit between the data and some
mathematical model
Statistical and AI approach
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects
Finds characteristic description for each concept (class)

COBWEB (Fisher87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic
description of that concept

COBWEB Clustering
Method
A classification tree

More on COBWEB Clustering


Limitations of COBWEB
The assumption that the attributes are
independent of each other is often too strong
because correlation may exist
Not suitable for clustering large database
data skewed tree and expensive
probability distributions

AutoClass (Cheeseman and Stutz, 1996)

E is the attributes of a data item, that are given to us by the data set. For example, if each data
item is a coin, the evidence E might be represented as follows for a coin i:

If there were many attributes, then the evidence E might be represented as follows for a coin i:

Ei = {"land tail","land tail","land head"} meaning that in 3 separate trials the coin i landed as tail, tail
and head.

H is a hypothesis about the classification of a data item.

Ei = {"land tail} meaning that in one trial the coin i landed to be tail.

For example, H might state that coin i belongs in the class two headed coin.

We usually do not know the H for a data set. Thus, AutoClass tests many hypotheses.
L( E | H ) ( H )
L( E | H ) ( H )
(H | E)

(E)
L( E | H ) ( H )
H

AutoClass uses a Bayesian method for determining the optimal class H for each object.

Prior distribution for each attribute, symbolizing the prior beliefs of the user about the attribute.
Change the classifications of items in clusters and change the means and variances of the distributions
in each cluster, until the means and variances stabilize.

Normal distribution for an attribute in a cluster:

Neural Networks and SelfOrganizing Maps


Neural network approaches
Represent each cluster as an exemplar, acting as a
prototype of the cluster
New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure

Competitive learning
Involves a hierarchical architecture of several units
(neurons)
Neurons compete in a winner-takes-all fashion for
the object currently being presented

Self-organizing feature maps (SOMs)


type of neural network

Several units (clusters) compete for the current object


The unit whose weight vector is closest to the current object wins
The winner and its neighbors learn by having their weights adjusted
SOMs are believed to resemble processing that can occur in the brain

Outline
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification

Major Clustering Approaches


Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.

Supervised classification
Bases the classification on prior
knowledge about the correct classification
of the objects.

Support Vector Machines


Support Vector Machines (SVMs) were invented by Vladimir
Vapnik for classification based on prior knowledge [Vapnik98,
Burges98].
Create classification functions from a set of labeled training data.
The output is binary: is the input in a category?

SVMs find a hypersurface in the space of possible inputs.


Split the positive examples from the negative examples.
The split is chosen to have the largest distance from the hypersurface to
the nearest of the positive and negative examples.

Training an SVM on a large data set can be slow.


Testing data should be near the training data.

CCA-S
Clustering and Classification Algorithm Supervised
(CCA-S)
for detecting intrusions into computer network systems [Ye01].

CCA-S learns signature patterns of both normal and


intrusive activities in the training data.
Then, classify the activities in the testing data as normal or
intrusive based on the learned signature patterns of
normal and intrusive activities.

Classification (supervised) applied to


cancer tumor data sets

Class Discovery: dividing tumor samples into groups with similar


behavioral properties and molecular characteristics.
Previously unknown tumor subtypes may be identified this way

Class Prediction: determining the correct class for a new tumor


sample, given a set of known classes.
Correct class prediction may suggest whether a patient will benefit from
treatment, how will respond after treatment with a certain drug

Examples of Cancer Tumor Classification:


Classification of acute leukemias
Classification of diffuse large B-cell lymphoma tumors
Classification of 60 cancer cell lines derived from a variety of tumors

Chapter 8. Cluster Analysis


What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Supervised Classification
Outlier Analysis

What Is Outlier Discovery?


What are outliers?
The set of objects are considerably dissimilar from
the remainder of the data
Example: Sports: Michael Jordan, Wayne Gretzky, ...

Problem
Find top n outlier points

Applications:

Credit card fraud detection


Telecom fraud detection
Customer segmentation
Medical analysis

Outlier Discovery:
Statistical
Approaches
Assume a model underlying distribution that
generates data set (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers

Drawbacks
Most distribution tests are for single attribute
In many cases, data distribution may not be known

Outlier Discovery:
Distance-Based Approach
Introduced to counter the main limitations imposed by
statistical methods
We need multi-dimensional analysis without knowing data
distribution.

Distance-based outlier: A (p,D)-outlier is an object O in


a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O.

Outlier Discovery:
Deviation-Based Approach
Identifies outliers by examining the main
characteristics of objects in a group
Objects that deviate from this description are
considered outliers
Sequential exception technique
simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects

OLAP data cube technique


uses data cubes to identify regions of anomalies in large
multidimensional data

Chapter 8. Cluster Analysis


What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary

Problems and Challenges


Considerable progress has been made in scalable
clustering methods
Partitioning: k-means, k-medoids, CLARANS
Hierarchical: BIRCH, CURE, LIMBO
Density-based: DBSCAN, CLIQUE, OPTICS
Grid-based: STING, WaveCluster
Model-based: Autoclass, Denclue, Cobweb

Current clustering techniques do not address all the


requirements adequately
Constraint-based clustering analysis: Constraints exist
in data space (bridges and highways) or in user queries

Constraint-Based Clustering
Analysis

Clustering analysis: less parameters but more userdesired constraints, e.g., an ATM allocation problem

Summary
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis,
such as constraint-based clustering

References (1)

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of


high dimensional data for data mining applications. SIGMOD'98

M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.


M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific,
1996

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.

M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.

D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,


2:139-172, 1987.

D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based


on dynamic systems. In Proc. VLDB98.

S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

References (2)

L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster


Analysis. John Wiley & Sons, 1990.

E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB98.

G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to


Clustering. John Wiley and Sons, 1988.

P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.

R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.

E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data
sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.

G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution


clustering approach for very large spatial databases. VLDB98.

W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB97.

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for
very large databases. SIGMOD'96.

Thank you !!!

Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables not a good choice!
(why?)
apply logarithmic transformation

yif = log(xif)
treat them as continuous ordinal data treat their rank as
interval-scaled.

Intrusion detection systems

Clustering is applied to files with logged behavior of system users over time, to separate instances
of normal activity from instances of abusive or attack activity. Clustering is primarily used in
Intrusion Detection Systems, especially when the log files of system user behavior are too large
for a human expert to analyze.

Unsupervised clustering: When unsupervised clustering is applied to Intrusion Detection


Systems data on which no prior knowledge exists the data is first clustered and then different
clusters are labeled as either normal or intrusions based on the size of the cluster.
Given a new object d, classification proceeds by finding a cluster C which is closest to d and
classifying d according to the label of C as either normal or anomalous [Li04, Portnoy01].
IDSs based on unsupervised clustering focus on activity data that may contain features of an
individual TCP connection, such as its duration, protocol type, number of bytes transferred and a
flag indicating the connections normal or error status. Other features may include the number of
file creation operations, number of failed login attempts, whether root shell was obtained, the
number of connections to the same host as the current connection within the past two seconds,
the number of connections to the same service as the current connection within the past two
seconds and others.

Intrusion detection systems

Clustering is applied to files with logged behavior of system users over time, to separate instances
of normal activity from instances of abusive or attack activity. Clustering is primarily used in
Intrusion Detection Systems, especially when the log files of system user behavior are too large
for a human expert to analyze.

Supervised clustering: Supervised clustering is applied to Intrusion Detection Systems for


Signature Recognition purposes. This means that signatures representing patterns of previous
intrusion activities are learned by the system. Then, new patterns of activity are compared to the
previously learned signatures to determine if the activities are normal or if they may be intrusions
[Ye01].

IDS based on signature recognition focus on two main types of activity data: network traffic data
and computer audit data. Some categorical activity attributes that can be obtained from this data
include the user id, event type, process id, command, remote IP address. Some numerical activity
attributes that can be obtained from this data include the time stamp, CPU time, etc.

Squeezer

Squeezer is introduced in [He2002] as a one-pass algorithm. Squeezer


repeatedly reads tuples from the data set one by one. When the first tuple
arrives, it forms a cluster alone. The consequent tuples are either put into an
existing cluster or rejected by all existing clusters to form a new cluster by the
given similarity function.
Squeezer clustering consists of the following steps:

1) Initialize the set of clusters, S, to the empty set.


2) Obtain an object d from the data set. If S is empty, then create a cluster
with d and add it to S. Otherwise, find the cluster in S that is closest
to this object. In other words, find the closest cluster C to d in S.
3) If the distance between d and C is less than or equal to a user specified
threshold W then associate d with the cluster C. Else, create a new
cluster for d in S.
4) Repeat steps 2 and 3 until no objects are left in the data set.

Farthest First Traversal k-center Algorithm

The farthest-first traversal k-center algorithm (FFT) is a fast, greedy algorithm that
minimizes the maximum cluster radius [Hochbaum85]. In FFT, k points are first selected
as cluster centers. The remaining points are added to the cluster whose center is the
closest. The first center is chosen randomly. The second center is greedily chosen as the
point furthest from the first. Each remaining center is determined by greedily choosing
the point farthest from the set of already chosen centers, where the furthest point, x, from
a set, D, is defined as, maxx{min{d(x,j), j(-D}}.

Farthest_first_traversal(D: data set, k: integer) {


randomly select first center;
//select centers
for (I= 2,,k) {
for (each remaining point) { calculate distance to the current center set; }
select the point with maximum distance as new center;
}
//assign remaining points
for (each remaining point) {
calculate the distance to each cluster center;
put it to the cluster with minimum distance;
}
}

Clustering Feature
Vector
BIRCHs operation is based on clustering features (CF)
and clustering feature trees. These are used to
summarize the information in a cluster.
A CF is a triplet summarizing information about
subclusters of objects.
A CF typically holds the following information about a
subcluster: the number of objects N in the subcluster, a
vector holding the linear sum of the N objects in the
subcluster, and a vector holding the square sum of the
N objects in the subcluster.
Thus, a CF is a summary of statistics for the given
subcluster

OPTICS: An Extension from DBSCAN


OPTICS was developed to overcome
the difficulty of selecting appropriate
parameter values for DBSCAN
[Ankerst99].
The OPTICS algorithm finds clusters
using the following steps:
1) Create an ordering of the objects in a
database, storing the core-distance and
a suitable reachability-distance for each
object. Clusters with highest density will
be finished first.
2) Based on the ordering information
produced by OPTICS, use another
algorithm to extract clusters.
3) Extract density-based clusters with
respect to any distance e that is smaller
than the distance e used in generating
the order.

Figure 15a OPTICS. The core distance of


p is the distance e, between p and the
fourth closest object. The reachability
distance of q1 with respect to p is the coredistance of p (e=3mm) since this is greater
than the distance between p and q1. The
reachability distance of q2 with respect to
p is the distance between p and q2 since
this is greater than the core-distance of p
(e=3mm). Adopted from [Ankerst99].

Gradient: The steepness of a


slope
Example

f Gaussian ( x , y ) e
f

D
Gaussian

d ( x , y )2

2 2

( x) i 1 e

D
Gaussian

d ( x , xi ) 2
2 2

( x, xi ) i 1 ( xi x) e
N

d ( x , xi ) 2
2 2

Mutual Information Clustering

Mutual information refers to a measure of correlation between the information content of


two information elements [Michaels98]. For example, when clustering gene expression
data , the formula M(A,B) = H(A) + H(B) - H(A,B) represents the mutual information M,
that is shared by two temporal gene expression patterns A and B.
H refers to the Shannon entropy. H(A) is a measure of the number of expression levels
exhibited by the gene A expression pattern, over a given time period. The higher the
value of H, the more expression levels gene A exhibits over a temporal expression
pattern, and thus the more information the expression pattern contains.
An H of zero means that a gene expression pattern over a time course was completely
flat, or constant, and thus the expression pattern carries no information.
H(A,B) is a measure of the number of expression levels observed in either gene
expression pattern A or B.
Thus, the mutual information M(A,B) = H(A) + H(B) - H(A,B) is defined as the number of
expression levels that are observed in both gene expression patterns A and B, over a
given time period [26].

CACTUS Categorical Clustering

CACTUS is presented in [Ganti99], introducing a novel formalization of a


cluster for categorical attributes by generalizing a definition of a cluster for
numerical data.
CACTUS is a summary-based algorithm that utilizes summary information
of the dataset. It characterizes the detected categorical clusters by building
summaries.
The authors assume the existence of a distinguishing number that
represents the minimum size of the distinguishing sets i.e. attribute value
sets that uniquely occur within only one cluster.
The distinguishing sets in CACTUS rely on the assumption that clusters
are uniquely identified by a core set of attribute values that occur in no
other cluster.
While this assumption may hold true for many real world datasets, it is
unnatural and unnecessary for the clustering process. The distinguishing
sets are then extended to cluster projections.

COOLCAT Categorical Clustering


COOLCAT is introduced in [Barbara02] as an entropy-based algorithm
for categorical clustering. The COOLCAT algorithm is based on the
idea of entropy reduction within the generated clusters.
It first bootstraps itself using a sample of maximally dissimilar points
from the dataset to create initial clusters.
COOLCAT starts with a sample of data objects and identifies a set of k
initial tuples such that the minimum pairwise distance among them is
maximized.
The remaining points are then added incrementally. Clusters are
created by "cooling" them down, i.e. reducing their entropy.
All remaining tuples of the data set are placed in one of the clusters
such that, at each step, the increase in the entropy of the resulting
clustering is minimized.

STIRR Categorical Clustering

STIRR produces a dynamical system from a table of categorical data. STIRR begins with a table of relational
data, consisting of k fields (or columns), each of which can assume one of many possible values. Figure 15
shows an example of the data representation. Each possible value in each field is represented by a node and
the data is represented as a set of tuples each tuple consists of nodes, with one node for each field.
A configuration is an assignment of a weight w to each node. A normalization function N() is needed to rescale
the weights of the nodes associated with each field so that their squares add up to 1.

Figure 15 - Representation of a collection of tuples. Adopted from [Gibson98].


A dynamical system is the repeated application of a function f on some set of values. A fixed point of a
dynamical system is a point u for which f(u) = u. So it is a point which remains the same under the repeated
application of f.
The dynamical system for a set of tuples is based on a function f, which maps a current configuration of
weights to a new configuration. We define f as follows. We choose a combiner function defined below and
update each weight wv as follows:

CLICK Categorical Clustering

CLICK finds clusters in categorical datasets based on a search method for k-partite maximal cliques
[Peters04].

Figure 16 The CLICK algorithm. Adopted from [Peters04].

The basic Click approach consists of the three principal stages, shown in Figure 17, as follows:
Pre-Processing: In this step, the k-partite graph is created from the input database D, and the
attributes are ranked for efficiency reasons.
Clique Detection: Given (D), all the maximal k-partite cliques in the graph are enumerated.
Post-Processing: the support of the candidate cliques within the original dataset is verified to form
the final clusters. Moreover, the final clusters are optionally merged to partially relax the strict cluster
conditions.

CLOPE Categorical Clustering

CLOPE proposes a novel global criterion function that tries to increase the intra-cluster overlapping of transaction items
by increasing the height-to-width ratio of the cluster histogram.
Let's take a small market basket database with 5 transactions {(apple, banana), (apple, banana, cake), (apple, cake,
dish), (dish, egg), (dish, egg, fish)}. For simplicity, transaction (apple, banana) is abbreviated to ab, etc. For this small
database, we want to compare the following two clustering (1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de,
def} }. For each cluster, we count the occurrence of every distinct item, and then obtain the height (H) and width (W) of the
cluster. For example, cluster {ab, abc, acd} has the occurrences of a:3, b:2, c:2, and d:l, with H=2.0 and W=4. H is
computed by summing the numbers of occurrences and dividing by the number of distinct items. Figure 17 shows the
values of these results as histograms, with items sorted in reverse order of their occurrences, only for the sake of easier
visual interpretation.

H=2.0, W=4
H=1.67, W=3
H=1.67, W=3
H=1.6, W=5
{ab, abc, acd}
{de, def}
{ab, abc}
{acd, de, def}
clustering (1)
clustering (2)
Figure 17 - Histograms of the two clusterings. Adopted from [Yang2002].

We judge the qualities of these two clusterings, by analyzing the heights and widths of the clusters. Leaving out the two
identical histograms for cluster {de, def} and cluster {ab, abc}, the other two histograms are of different quality. The
histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def} has H/W=0.32. Clearly, clustering
(1) is better since we prefer more overlapping among transactions in the same cluster. From the above example, we can
see that a larger height-to-width ratio of the histogram means better intra-cluster similarity. This intuition is the basis of
CLOPE and defines the global criterion function using the geometric properties of the cluster histograms.

WaveCluster (1998)

WaveCluster uses wavelet transformations to cluster data.


It uses a wavelet transformation to transform the original feature space, finding
dense regions in the transformed space.

A wavelet transform is a signal processing technique that decomposes a signal


into different frequency subbands.
The wavelet model can be applied to a data set with n dimensions by applying a
one-dimensional wavelet n times.
Each time that a wavelet transform is applied, data are transformed so as to
preserve the relative distance between objects at different levels of resolution,
although the absolute distance increases.
This allows the natural clusters in the data to become more distinguishable.
Clusters are then identified by searching for dense regions in the new domain.

Wavelets emphasize regions where the objects cluster, but suppress less dense
regions outside of clusters.
The clusters in the data stand out because they are more dense and clear the
regions around them [Sheikholeslami98].

CLIQUE: The Major Steps


finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces

Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify clusters:
Determine dense units in all subspaces of interest.
Determine connected dense units in all subspaces of interest.

Generate minimal description for the clusters


Determine maximal regions that cover a cluster of connected
dense units for each cluster
Determination of minimal cover for each cluster

ACDC

ACDC works in a different way than algorithms we mentioned above [Tzerpos00]. ACDC
performs the task of clustering in two stages. In the first one it creates a skeleton of the final
decomposition by identifying subsystems using a pattern-driven approach. There are many
patterns that have been used in ACDC. Figure 24 shows some of these patterns.

Depending on the pattern used the subsystems are given appropriate names. In the second
stage ACDC completes the decomposition by using an extended version of a technique known
as Orphan Adoption. Orphan Adoption is an incremental clustering technique based on the
assumption that the existing structure is well established. It attempts to place each newly
introduced resource (called an orphan) in the subsystem that seems more appropriate. This is
usually a subsystem that has a larger amount of connectivity to the orphan than any other
subsystem [Tzerpos00].

ACDC

You might also like