You are on page 1of 5

A practical approach: Comparative study of clustering techniques (DATA MINING)

Dinesh Gomber
MCA Deptt.

dinesh2_engg@pdm.ac.in

PDM University, Bahadurgarh

Abstract Prediction involves using some variable to dataset in order to predict


unknown values of other relevant variable like classification,
Data mining is used to discover patterns and relationships regression and anomaly detection.
in the data in order to help make better business decisions. Data
mining is a powerful technique by which we can discover the very Description involves finding human understandable patterns and
important information from existing data which was undiscovered trends in the data example. Clustering, association rule mining and
so far and can use that information for taking good decision. summarization.

Clustering is a technique of Data mining by which groups


are made on the basis of similarity and we call them cluster.
Images, word documents, patterns all these things can be clustered.
There are many problems which we face when choosing clustering
technique because there are a large number of dataset and many
algorithm exist so the main motive of this paper is to make this work
easy and effective so that anybody can choose the right technique
so that can take the most benefits of data mining .

Introduction:-
Data mining is a series of tools and techniques for uncovering hidden
Data Mining is a process of using tools to extract useful knowledge
patterns and relationships among data (Dunham, 2003).Data mining
from large datasets; data mining is an essential part of knowledge is an emerging discipline that focuses on applying data mining tools
management. and techniques.
Data mining is an essential step in the knowledge discovery in Baker & Yacef, 2009). Researchers within EDM focus on topics
databases (KDD) process that produces useful patterns or models ranging from using data mining to improve institutional effectiveness
from data (Figure 2) [7]. The terms of KDD and data mining are to applying data mining improving student learning processes. There
different. KDD refers to the overall process of discovering useful
is a wide range of topics within educational data mining, so this paper
knowledge from data. Data mining refers to discover new patterns will focus exclusively on ways that data mining is used to improve
from a wealth of data in databases by focusing on the algorithms to
student success and processes directly related to student learning. For
extract useful knowledge. example, student success and recommender systems, and evaluation
Main functions of data mining: of student learning within course are all topics within the broad field
of educational data mining.
Classification is finding models that analyze and classify a data item
into several predefined classes Data mining is a series of tools and techniques for uncovering hidden
patterns and relationships
1. Regression is mapping a data item to a real-valued prediction
Data mining is also one step in an overall knowledge discovery
variable.
process, where organizations want to discover new information from
2. Clustering is identifying a finite set of categories or clusters to the data in order to aid in decision-making processes.
describe the data International Journal of Data Mining & There are a variety of different data mining techniques and
Knowledge Management Process (IJDKP) Vol.2, No.5,
approaches, such as clustering, classification, and association rule
September 2012.
mining. Each of these approaches can be used to quantitatively
3. Dependency Modeling (Association Rule Learning) is finding a analyze large data sets to find hidden meaning a patterns.
model which describes significant dependencies between Related Work
variables.
Finally, [1] high dimensional data is only one issue that needs to be
4. Deviation Detection (Anomaly Detection) is discovering the considered when performing cluster analysis. In closing we mention
most significant changes in the data. some other, only partially resolved, issues in cluster analysis:
scalability to large data sets, independence of the order of input,
5. Summarization is finding a compact description for a subset of
effective means of evaluating the validity of clusters that are
data.
produced, easy interpretability of results, an ability to estimate any
Data mining has two main objective: parameters required by the clustering technique, an ability to
1. Prediction function in an incremental manner, and robustness in the presence of
2. Description different underlying data and cluster characteristics.
A fundamental issue [2] related to clustering is its stability or form a cluster hierarchy) or in divisive (top-down) mode (starting
consistency. A good clustering principle should result in a data with all the data points in one cluster and recursively dividing each
partitioning that is stable with respect to perturbations in the data. cluster into smaller clusters). that typically involves other steps and
We need to develop clustering methods that lead to stable solutions. techniques. finding clusters in data where there are clusters of
different shapes, sizes, and density or where the data has lots of noise
Comparison the various clustering algorithms of weka tools and outliers. These issues become more important in the context of
K-Means partitioning based clustering algorithm required to high
define the number of final cluster (k) beforehand. Such dimensionality data sets.
algorithms are also having problems like susceptibility [3] to local For high dimensional data, traditional clustering techniques have
sometimes been used. For
optima, sensitivity to outliers, memory space and unknown number
example, the K-means algorithm and agglomerative hierarchical
of iteration steps that are required to cluster. The time complexity of
clustering techniques [DJ88],
the K-Means algorithm is O(ncdi) and the time omplexity of FCM have been used extensively for clustering document data. While K-
algorithm is O(ndc 2 means is efficient and often produces “reasonable” results, in high
dimensions, K-means still retains all of its low
i). From the obtained results we may conclude that K-Means
dimensional limitations, i.e., it has difficulty with outliers and does
algorithm is better than FCM algorithm. FCM produces close
not do a good job when the clusters in the data are of different sizes,
results to K-Means clustering but it still requires more shapes, and densities.
computation time than KMeans because of the fuzzy measures
calculations involvement in the algorithm Clustering Techniques-
Automatic and Visualization Approaches K-means:
Since clustering is a process of unsupervised learning, setting 1. It accepts the number of clusters to group data into, and the
appropriate [4] parameters is a problem for lots of algorithms. most dataset to cluster as input values.
clustering algorithms needs some parameters. Although they may be
straightforward in some cases, they are difficult to set in many 2. It then creates the first K initial clusters (K= number of
environments. Furthermore, current cluster representation clusters needed) from the dataset by choosing K rows of data
techniques can be easily understood only when the data is in low- randomly from the dataset. For Example, if there are 10,000
dimensional space. rows of data in the dataset and 3 clusters need to be formed,
then the first K=3 initial clusters will be created by selecting 3
Therefore, some algorithms are built for automatic clustering. records randomly from the dataset as the initial clusters. Each
Meanwhile, some other efforts has been made to visualize the of the 3 initial clusters formed will have just one row of data.
process of clustering, so that the user can set the parameters easily
and the result can be more understandable. 3. The K-Means algorithm calculates the Arithmetic Mean of
each cluster formed in the dataset. The Arithmetic Mean of
a cluster is the mean of all the individual records in the
Concept of Data Clustering cluster. In each of the first K initial clusters, their is only one
Clustering can be considered the most important unsupervised record. The Arithmetic Mean of a cluster with one record is
learning technique. the set of values that make up that record. For Example if the
dataset we are discussing is a set of Height, Weight and Age
Clustering is “the process of organizing objects into groups whose measurements for students in a University, where a record P
members are similar in some way”. in the dataset S is represented by a Height, Weight and Age
measurement, then P = {Age, Height, Weight). Then a record
A cluster is therefore a collection of objects which are “similar” containing the measurements of a student John, would be
between them and are “dissimilar” to the objects belonging to other represented as John = {20, 170, 80} where John's Age = 20
clusters. years, Height = 1.70 metres and Weight = 80 Pounds. Since
.The goal of data clustering, also known as cluster analysis, is to there is only one record in each initial cluster then the
discover the natural grouping(s) of a set of patterns, points, or Arithmetic Mean of a cluster with only the record for John as
objects. a member = {20, 170, 80}.

Find groups based on a measure of similarity such that the 4. Next, K-Means assigns each record in the dataset to only one
similarities between objects in the same group are high while the of the initial clusters. Each record is assigned to the nearest
similarities between objects in different groups are low. But, what is cluster (the cluster which it is most similar to) using a
the notion of similarity? What is the definition of a cluster? clusters measure of distance or similarity like the Euclidean Distance
can differ in terms of their shape, size, and density. The presence of Measure or Manhattan/City-Block Distance Measure.
noise in the data makes the detection of the clusters even more5. K-Means re-assigns each record in the dataset to the most
difficult. An ideal cluster can be defined as a set of points that is similar cluster and re-calculates the arithmetic mean of all the
compact and isolated. clusters in the dataset. The arithmetic mean of a cluster is the
Clustering algorithms can be broadly divided into two groups: arithmetic mean of all the records in that cluster. For Example,
hierarchical and partitioned. if a cluster contains two records where the record of the set of
Hierarchical clustering algorithms recursively find nested clusters measurements for John = {20, 170, 80} and Henry = {30, 160,
either in agglomerative mode (starting with each data point in its own 120}, then the arithmetic mean Pmean is represented as Pmean=
cluster and merging the most similar pair of clusters successively to {Agemean, Heightmean, Weightmean). Agemean= (20 + 30)/2,
Heightmean= (170 + 160)/2 and Weightmean= (80 + 120)/2. The
arithmetic mean of this cluster = {25, 165, 100}. This new
arithmetic mean becomes the center of this new cluster. Following
the same procedure, new cluster centers are formed for all the
existing clusters.

6. K-Means re-assigns each record in the dataset to only one


of the new clusters formed. A record or data point is
assigned to the nearest cluster (the cluster which it is most
similar to) using a measure of distance or similarity

7. The preceding steps are repeated until stable clusters are


formed and the K-Means clustering procedure is
completed. Stable clusters are formed when new iterations
Hierarchical Clustering: Weaknesses
or repetitions of the K-Means clustering algorithm does
not create new clusters as the cluster center or Arithmetic  The most commonly used type, single-link clustering, is
Mean of each cluster formed is the same as the old cluster particularly greedy.
center. There are different techniques for determining
when a stable cluster is formed or when the k-means  If two points from disjoint clusters happen to be near
clustering algorithm procedure is completed. each other, the distinction between the clusters will
be lost.

 On the other hand, average- and complete-link


clustering methods are biased towards spherical
The Algorithm-Given a set of numeric points in d dimensional clusters in the same way as k-means
space, and integer k.
Algorithm generates k (or fewer) clusters as follows:-  Does not really produce clusters; the user must decide
1. Assign all points to a cluster at random where to split the tree into groups.
2. Repeat until stable:
3. Compute centroid for each cluster  Some automated tools exist for this
4. Reassign each point to nearest centroid
Step 1: Make random assignments and compute  As with k-means, sensitive to noise and outliers
centroids.
Density-Based Clustering Methods:
Step 2: Assign points to nearest centroids
Step 3: Re-compute centroids (in this example, solution is  Clustering based on density (local cluster criterion), such
now stable) as density-connected points
 Major features:
K-means: Weaknesses:- o Discover clusters of arbitrary shape
 Must choose parameter k in advance, or try many values. o Handle noise
 Data must be numerical and must be compared via o One scan
Euclidean distance (there is a variant called the k-medians o Need density parameters as termination condition
algorithm to address these concerns)  Two parameters:
 The algorithm works best on data which contains spherical
o Eps: Maximum radius of the neighborhood
clusters; clusters with other geometry may not be found.
 The algorithm is sensitive to outliers---points which do not o MinPts: Minimum number of points in an Eps-
belong in any cluster. These can distort the centroid neighborhood of that point
positions and ruin the clustering.  Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
Hierarchical Clustering: The Algorithm:-  Core Object: object with at least MinPts objects within a
radius ‘Eps-neighborhood’
 Hierarchical clustering takes as input a set of points
 Border Object: object that on the border of a cluster
 It creates a tree in which the points are leaves and the
internal nodes reveal the similarity structure of the points.

 The tree is often called a “dendogram.”

 The method is summarized below:

 Place all points into their own clusters.


 While there is more than one cluster, do
 Merge the closest pair of clusters

o p belongs to NEps(q)

o core point condition:


| NEps (q)| >= MinPts  Clustering systems provide little guidance on how to pick
similarity
 Relies on a density-based notion of cluster: A cluster is  Computing similarity of mixed-type data is hard
defined as a maximal set of density-connected points  Similarity is very dependent on data
representation. Should one
 Discovers clusters of arbitrary shape in spatial databases
 Normalize?
with noise
 Represent one’s data numerically,
categorically, etc.?
 Cluster on only a subset of the data?
 The computer should do more to help
the user figure this out!
 Parameter selection

 Current algorithms require too many arbitrary,


user-specified parameters

Conclusion-
DBscan algorithm
 Clustering is a useful way of exploring data, but is still
 Arbitrary select a point p very ad hoc
 Retrieve all points density-reachable from p w.r.t. Eps  Good results are often dependent on choosing the right
and MinPts. data representation and similarity metric
 If p is a core point, a cluster is formed.  Data: categorical, numerical, boolean
 If p is a border point, no points are density-reachable  Similarity: distance, correlation, etc.
from p and DBSCAN visits the next point of the  Many different choices of algorithms, each with
database. different strengths and weaknesses
 Continue the process until all of the points have been k-means, hierarchical, graph partitioning, etc
processed.
 Density-reachable:
References:-
1:- The Challenges of Clustering High Dimensional Data
Michael Steinbach, Levent Ertöz, and Vipin Kumar

2. Data clustering: 50 years beyond K-means


Anil K. Jain*
Department of Computer Science and Engineering, Michigan State University,
East Lansing, Michigan 48824, USA
Department of Brain and Cognitive Engineering, Korea University, Anam-dong,
Seoul, 136-713, Korea.

3. Comparative Analysis of K-Means and Fuzzy CMeans Algorithms


 A point p is density-reachable from a point q wrt. Soumi Ghosh Sanjay Kumar Dubey
Eps, MinPts if there is a chain of points p1, …, pn, Department of Computer Science and Engineering,
p1 = q, pn = p such that pi+1 is directly density- Amity University, Uttar Pradesh
Noida, India.
reachable from pi
4. Analyzing Popular Clustering Algorithms from Different Viewpoints
 Density-connected QIAN Wei-ning, ZHOU Ao-ying
(Department of Computer Science, Fudan University, Shanghai 200433, China)
(Laboratory for Intelligent Information Processing, Fudan University, Shanghai
200433, China).

 A point p is density-connected to a point q wrt.


Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps
and MinPts.

Challenges in Clustering
 Similarity Calculation
 Results of algorithms depend entirely on similarity used

You might also like