You are on page 1of 8

CLUSTERING MODEL BASED TECHNIQUES

AND HANDLING HIGH DIMENSIONAL DATA


Model-Based Clustering Methods
Attempt to optimize the fit between the data and
some mathematical model
Assumption: Data are generated by a mixture of
underlying probability distributions
Techniques
Expectation-Maximization
Conceptual Clustering
Neural Networks Approach
Expectation Maximization
Each cluster is represented mathematically by a
parametric probability distribution
Component distribution
Data is a mixture of these distributions
Mixture density model
Problem: To estimate parameters of probability
distributions
Iterative Refinement Algorithm used to find
parameter estimates
Extension of k-means
Assigns an object to a cluster according to a
weight representing probability of membership
Initial estimate of parameters
Iteratively reassigns scores
Initial guess for parameters; randomly select k
objects to represent cluster means or centers
Iteratively refine parameters / clusters
Expectation Step

Assign each object xi to cluster Ck with


probability

Maximization Step

Re-estimate model parameters

Simple and easy to implement


Complexity depends on features,

objects and

iterations
Conceptual Clustering
Conceptual clustering

A form of clustering in machine learning

Produces a classification scheme for a set of


unlabeled objects

Finds characteristic description for each


concept (class)
COBWEB

A popular and simple method of incremental


conceptual learning
Creates a hierarchical clustering in the form
of a classification tree
Each node refers to a concept and contains a
probabilistic description of that concept

COBWEB Clustering Method COBWEB


Classification tree
Each node Concept and its probabilistic
distribution (Summary of objects under that node)
Description Conditional probabilities P(Ai=vij /

Ck)
Sibling nodes at given level form a partition
Category Utility

Increase in the expected number of attribute


values that can be correctly guessed given a
partition
Category Utility rewards:
Intra-class similarity P(Ai=vij|Ck)

High value indicates many class members


share this attribute-value pair
Inter-class dissimilarity P(Ck|Ai=vij)

High values fewer objects in different


classes share this attribute-value
Placement of new objects
Descend tree
Identify best host

Temporarily place object in each node and


compute CU of resulting partition

Placement with highest CU is chosen

COBWEB may also forms new nodes if


object does not fit into the existing tree

COBWEB is sensitive to order of records


Additional operations
Merging and Splitting
Two best hosts are considered for merging
Best host is considered for splitting
Limitations
The assumption
that the attributes are
independent of each other is often too strong
because correlation may exist

Not suitable for clustering large database data


CLASSIT - an extension of COBWEB for
incremental clustering of continuous data
Neural Network Approach
Represent each cluster as an exemplar, acting as a
prototype of the cluster
New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
Self Organizing Map
Competitive learning
Involves a hierarchical architecture of several
units (neurons)
Neurons compete in a winner-takes-all fashion
for the object currently being presented
Organization of units forms a feature map
Web Document Clustering

Kohenen SOM Clustering High-Dimensional data


As dimensionality increases
number of irrelevant dimensions may produce
noise and mask real clusters
data becomes sparse
Distance measures meaningless
Feature transformation methods
PCA, SVD Summarize data by creating linear
combinations of attributes
But do not remove any attributes; transformed
attributes complex to interpret
Feature Selection methods
Most relevant set of attributes with respect to
class labels
Entropy Analysis

Subspace Clustering searches for groups of


clusters within different subspaces of the same
data set
CLIQUE: CLustering In QUest
Dimension growth subspace clustering
Starts at 1-D and grows upwards to higher
dimensions
Partitions each dimension grids determines
whether cell is dense
CLIQUE
Determines sparse and crowded units
Dense unit fraction of data points > threshold
Cluster maximal set of connected dense units
First partitions d-dimensional space into nonoverlapping units
Performed in 1-D
Based on Apriori property: If a k-dimensional
unit is dense so are its projections in (k-1)
dimensional space
Search space size is reduced

Determines the maximal dense region and Generates


a minimal description
Finds subspace of highest dimension
Insensitive to order of inputs
Performance depends on grid size and density
threshold
Difficult to determine across all dimensions

Several lower dimensional subspaces will have to be

processed
Can use adaptive strategy
PROCLUS PROjected CLUStering
Dimension-reduction Subspace Clustering technique
Finds initial approximation of clusters in high
dimensional space
Avoids generation of large number of overlapped
clusters of lower dimensionality
Finds best set of medoids by hill-climbing process
(Similar to CLARANS)
Manhattan Segmental distance measure
Initialization phase
Greedy algorithm to select
a set of initial
medoids that are far apart
Iteration Phase
Selects a random set of k-medoids
Replaces bad medoids
For each medoid a set of dimensions is chosen
whose average distances are small
Refinement Phase
Computes new dimensions for each medoid based
on clusters found, reasigns points to medoids and
removes outliers
Frequent Pattern based Clustering
Frequent patterns may also form clusters
Instead of growing clusters dimension by dimension
sets of frequent itemsets are determined
Two common technqiues
Frequent term-based text Clustering
Clustering by Pattern similarity

Frequent-term based text clustering


Text documents are clustered based on frequent terms
they contain
Documents terms
Dimensionality is very high
Frequent term based analysis
Well selected subset of set of all frequent terms
must be discovered
Fi Set of frequent term sets
Cov(Fi) set of documents covered by Fi
k
i=1 cov(Fi) = D and overlap between Fi and Fj
must be minimized
Description of clusters their frequent term sets
Clustering by Pattern Similarity
pCluster on micro-array data analysis
DNA micro-array analysis expression levels of two
genes may rise and fall synchronously in response to
stimuli
Two objects are similar if they exhibit a coherent
pattern on a subset of dimensions
pCluster
Shift Pattern discovery
Euclidean distance not suitable
Derive new attributes
Bi-Clustering based on mean squared residue
score
pCluster
Objects x, y; attributes a, b

A pair (O,T) forms a -pCluster if for any 2 x 2

Each pair of objects and their features must


satisfy threshold
Scaling patterns

pCluster can be used in other applications also

You might also like