Professional Documents
Culture Documents
Introduction
"Knowledge is good only if it is shared. I hope this guide will help those who
are finding the way around, just like me"
Clustering analysis has been an emerging research issue in data mining due its
variety of applications. With the advent of many data clustering algorithms in
the recent few years and its extensive use in wide variety of applications,
including image processing, computational biology, mobile communication,
medicine and economics, has lead to the popularity of this algorithms. Main
problem with the data clustering algorithms is
that
it cannot be
standardized. Algorithm developed may give best result with one type of data
set but may fail or give poor result with data set of other types. Although
there has been many attempts for standardizing the algorithms which can
perform well in all case of scenarios but till now no major accomplishment
has been achieved. Many clustering algorithms have been proposed so far.
However, each algorithm has its own merits and demerits and cannot work
for all real situations. Before exploring various clustering algorithms in detail
let's have a brief overview about what is clustering.
Clustering is a process which partitions a given data set into homogeneous
groups based on given features such that similar objects are kept in a group
whereas dissimilar objects are in different groups. It is the most
important unsupervised learning problem. It deals with finding structure in a
collection of unlabeled data. For better understanding please refer to Fig I.
Fig I: showing four clusters formed from the set of unlabeled data
Fig II: showing example where scalability may leads to wrong result
2) Clustering algorithm must be able to deal with different types of attributes.
3) Clustering algorithm must be able to find clustered data with the arbitrary
shape.
4) Clustering algorithm must be insensitive to noise and outliers.
5) Interpret-ability and Usability - Result obtained must be interpretable and
usable so that maximum knowledge about the input parameters can be
obtained.
6) Clustering algorithm must be able to deal with data set of high
dimensionality.
Clustering algorithms can be broadly classified into two categories:
1) Unsupervised linear clustering algorithms and
2) Unsupervised non-linear clustering algorithms
I. Unsupervised linear clustering algorithm
k-means clustering algorithm
Fuzzy c-means clustering algorithm
Hierarchical clustering algorithm
Gaussian(EM) clustering algorithm
Quality threshold clustering algorithm
II. Unsupervised non-linear clustering algorithm
MST based clustering algorithm
Kernel k-means clustering algorithm
Density based clustering algorithm
References:
1)
2)
3)
4)
5)
Data Clustering: A Review by A.K. Jain, M.N. Murty and P.J. Flynn.
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html
Introduction to Clustering Techniques by Leo Wanner.
A Comprehensive Overview of Basic Clustering Algorithms by Glenn Fung.
Survey of Clustering Algorithms by Rui Xu and Donald Wunsch.
where,
||xi - vj|| is the Euclidean distance between xi and vj.
ci is the number of data points in ith cluster.
c is the number of cluster centers.
5) Recalculate the distance between each data point and new obtained
cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
Advantages
1) Fast, robust and easier to understand.
2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is #
dimension of each object, and t is # iterations. Normally, k, t, d << n.
3) Gives best result when data set are distinct or well separated from each
other.
2) The use of Exclusive Assignment - If there are two highly overlapping data
then k-means will not be able to resolve that there are two clusters.
3) The learning algorithm is not invariant to non-linear transformations i.e.
with different representation of data we get different results (data
represented in form of cartesian co-ordinates and polar co-ordinates will
give different results).
4) Euclidean distance measures can unequally weight underlying factors.
5) The learning algorithm provides the local optima of the squared error
function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful
result. Pl. refer Fig.
Fig II: Showing the non-linear data set where k-means algorithm fails
References
1) An Efficient k-means Clustering Algorithm: Analysis and Implementation
by Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D.
Piatko, Ruth Silverman and Angela Y. Wu.
2) Research issues on K-means Algorithm: An Experimental Trial Using
Matlab by Joaquin Perez Ortega, Ma. Del Rocio Boone Rojas and Maria J.
Somodevilla Garcia.
3) The k-means algorithm - Notes by Tan, Steinbach, Kumar Ghosh.
4) http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.ht
ml
5) k-means clustering by ke chen.