Professional Documents
Culture Documents
Dji
d ( j, i )
Cji = Dji - d ( j, i )
PAM Algorithm ( SWAP )
Considering all pairs of objects ( i , h )
object i has been selected and object h has not
Consider a nonselected object j and
calculate its contribution Cjih
Cjih = 0
if d ( j ,h ) < Ej then Cjih = d ( j ,h ) - d ( j ,i )
if d ( j ,h ) >= Ej then Cjih = Ej - Dj
Cjih = d ( j ,h) - Dj
Tih = sum of Cjih
Ask for minimum of Tih
If min(Tih) < 0 , swap it and returns to step 1
If min(Tih) >= 0 , STOP!!
CLARA :
Clustering Large Applications
Consider Time and Space
Also based on k-medoid approach
five ( or more ) random samples of objects
the size of samples depends on the number
of clusters, 40+2k
CLARA
CLARA Algorithm
Draw a sample of 40 + 2k objects randomly
from the entire data set
Use PAM to find k medoids of the sample
Assign each object to the nearest
representative object (medoid)
The average distance obtained for the
assignment is used as a measure of the
quality of the clustering
Hierarchical Cluster Algorithms
Clustering method depends on : how to
measure the similarity between two
clusters
UPGMA : Unweighted Pair-Group Method
using Arithmetic averages
SLINK : Single Linkage
CLINK : Complete Linkage
Wards Minimum variance
Agglomerative
Divisive
Hierarchical Clustering Methods
STEP 1 : Obtain the Data Matrix
STEP 2 : Standardize the Data Matrix
STEP 3 : Compute the Resemblance Matrix
STEP 4 : Execute the Clustering Method
STEP 5 : Rearrange the Data and
Resemblance Matrices
STEP 6 : Computer the Cophenetic
Correlation Coefficient
Hierarchical Clustering Method
Step 1
Step 3
Step 2
Step 4
Using UPGMA
Hierarchical Tree
Step 5
Cophenetic Correlation Coefficient
Step 6
Step 4
Using SLINK
SLINK
( Chaining )
CLINK UPGMA
Step 4
Using WARDs
Minimum Variance
Density Search Algorithm
A cluster is defined as a region in which the
density of objects is locally higher than in
other regions
Two types :
The density near an object is defined as the
number of objects within a sphere with a fixed
radius R
The density is defined as the inverse of the
dissimilarity to the Tth nearest object or the
inverse of the average dissimilarity to the T
nearest objects
Density-connected sets
Spatial Database
characterized by
spatial attributes : points or spatially extended
objects such as polygons in some d-dimensional
space
non-spatial attributes : represent additional
properties of a spatial object
Requirements for clustering algorithms
target for spatial database
Minimal requirements of domain knowledge to
determine the input parameters
Discovery of clusters with arbitrary shape
Good efficiency on large database
Fast algorithms for clustering
very large data sets
Revisions of some existing clustering methods
using some carefully designed search methods
CLARANS : randomized search
organizing structure
BIRCH : CF Tree
organizing indices
DBSCAN : R*-tree
=> target on numeric data
Fast algorithms for clustering
very large data sets
CLARANS - for large database
BIRCH - for large database, reduce
memory usage and I/O cost
CURE - arbitrary sahpe
DBSCAN, GDBSCAN
DBSLASD - arbitrary shape, no input
parameter required
DBSCAN
Density-Based Algorithm
GDBSCAN
K-medoid
PAM DBCLASD
CLARANS
CLARA
BIRCH
84 86 88 90 92 94 96 98
Year