Professional Documents
Culture Documents
Cluster Analysis
Cluster analysis
Unsupervised learning: no
predefined classes
Typical applications
Clustering: Rich
Applications and
Multidisciplinary Efforts
Pattern Recognition
Image Processing
Economic Science (especially
market research)
WWW
Document classification
Cluster Weblog data to discover groups
of similar access patterns
Examples of Clustering
Applications
Clustering Quality
enough
Requirements of
Clustering in Data Mining
Scalability
Ability to deal with different types of
attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain
knowledge to determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Types of Data
Clustering Data Structures
Data matrix
Two modes
Object by variable structure
Dissimilarity matrix
One mode
Type of data in
clustering analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio
variables
Variables of mixed types
Interval-scaled variables
Continuous measurements
Standardize data
where
Calculate the standardized
measurement (z-score)
Similarity and
Dissimilarity Between
Objects
If q = 1, d is Manhattan distance
Similarity and
Dissimilarity Between
Objects (Cont.)
If q = 2, d is Euclidean distance
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Binary Variables
Example
Categorical or Nominal
Variables
Ordinal Variables
Ratio-Scaled Variables
yif = log(xif)
Variables of Mixed
Types
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based: use the normalized
distance
f is ordinal or ratio-scaled
compute ranks rif and
and treat zif as interval-scaled
Vector Objects
Cosine measure
Tanimoto distance
t
s(x,y) = xt. y / x . x + y . y - x .
y
Ratio of attributes shared by x
and y to number of attributes
possessed by x or y