Data Mining-Hierarchical Methods

CLUSTERING HIERARCHICAL METHODS
Hierarchical Clustering
Use distance matrix as clustering criteria.
This method does not require the number of clusters
k as an input, but needs a termination condition
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g.,
Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
Distance Measures
Minimum distance dmin(Ci, Cj) = min p Ci,p
|p-p|
Nearest Neighbor algorithm
Single-linkage algorithm terminates when
distance > threshold
Agglomerative Hierarchical Minimal Spanning
tree
Maximum distance dmax(Ci, Cj) = max p
|p-p|
Farthest-neighbor clustering algorithm
Complete-linkage - terminates when distance >
threshold
Mean distance dmean(Ci, Cj) = |mi-mj|

Average distance davg(Ci, Cj) = (1/ninj) p Ci
|p-p|
Difficulties faced in Hierarchical Clustering
Selection of merge/split points
Cannot revert operation
Scalability
Recent Hierarchical Clustering Methods

Integration of hierarchical and other techniques:
BIRCH: uses tree structures and incrementally
adjusts the quality of sub-clusters
CURE: Represents a Cluster by a fixed number of
representative objects
ROCK: clustering categorical data by neighbor
and link analysis
CHAMELEON : hierarchical clustering using
dynamic modeling
BIRCH
Balanced Iterative Reducing and Clustering using

Hierarchies
Incrementally construct a CF (Clustering Feature)

tree, a hierarchical data structure for multiphase
clustering
Phase 1: scan DB to build an initial in-memory

CF tree (a multi-level compression of the data that
tries to preserve the inherent clustering structure
of the data)
Phase 2: use an arbitrary clustering algorithm to

cluster the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single

scan and improves the quality with a few additional
scans
Weakness: handles only numeric data, and sensitive
to the order of the data record.
CF-Tree in BIRCH
Clustering feature:
summary of the statistics for a given subcluster
registers crucial measurements for computing
cluster and utilizes storage efficiently
*
A CF tree is a height-balanced tree that stores the
clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or
children
The nonleaf nodes store sums of the CFs of their
children
A CF tree has two parameters
Branching factor: specify the maximum number
of children.
threshold: max diameter of sub-clusters stored at
the leaf nodes
CF Tree or CURE (Clustering Using Representatives)
Supports Non-Spherical Clusters
Fixed number of representative points are used
Select scattered objects and shrink by moving
towards cluster center
Merge clusters with closest pair of representatives
For large Databases
Draw a random Sample S
Partition
Partially cluster each partition
Eliminate outliers and slow growing clusters

Continue Clustering
Label Cluster by Representatives
Cure: Shrinking Representative Points
Shrink the multiple representative points towards the

gravity center by a fraction of .
Multiple representatives capture the shape of the

cluster
Clustering Categorical Data: ROCK Algorithm
ROCK: RObust Clustering using linKs
Major ideas
Use links to measure similarity/proximity
Not distance-based
Number of Common Neighbours
Algorithm: sampling-based clustering
Draw random sample
Cluster with links
Label data in disk
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling
Measures the similarity based on a dynamic model
Two clusters are merged only if the

interconnectivity and closeness (proximity)
between two clusters are high
A two-phase algorithm
Use a graph partitioning algorithm: cluster
objects into a large number of relatively small
sub-clusters
Use an agglomerative hierarchical clustering
algorithm: find the genuine clusters by
repeatedly combining these sub-clusters

Overall Framework of CHAMELEON
Chameleon
Computes similarity based on Relative
Interconnectivity and Relative Closeness

Data Mining-Hierarchical Methods

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining-Hierarchical Methods

Uploaded by

Copyright:

Available Formats

CLUSTERING HIERARCHICAL METHODS

Mean distance dmean(Ci, Cj) = |mi-mj|

Recent Hierarchical Clustering Methods

Balanced Iterative Reducing and Clustering using

Incrementally construct a CF (Clustering Feature)

Phase 1: scan DB to build an initial in-memory

Phase 2: use an arbitrary clustering algorithm to

Scales linearly: finds a good clustering with a single

Eliminate outliers and slow growing clusters

Cure: Shrinking Representative Points

Shrink the multiple representative points towards the

Multiple representatives capture the shape of the

Measures the similarity based on a dynamic model

Two clusters are merged only if the

repeatedly combining these sub-clusters

You might also like