You are on page 1of 5

G. Hanumantha Rao* et al.

/ (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES


Vol No. 2, Issue No. 2, 152 - 156

Representative Based Method of Categorical


Data Clustering

G.Hanumantha Rao*, G. Narender+, T.Balaji*, Y.Anitha*

* Assistant Professor, Department of CSE, Vasavi College of Engineering, Hyderabad-500 031, INDIA
+ Associate Professor, Department of CSE, Murthy Institute of Tech. & Science, Ranga Reddy-501301, INDIA
hanu.abc@gmail.com, guggillanarender@gmail.com, balaji_075@yahoo.co.in, anitha_yella@yahoo.co.in

data, usually presented in tabular form. Each column


Abstract— Categorical data (unordered data) using Clustering is represents a particular variable. Each row corresponds to a
one of the data mining techniques, which helps in identifying given member of the data set in question.[13] Its values for
clusters within the domain space. . The objective of clustering is each of the variables, such as height and weight of an object or

T
grouping of most similar objects. Here we are grouping the values of random numbers. Each value is known as a datum.
objects with minimum dissimilarity. Thus more similar objects
are grouped in this paper; we present the proposed method
The data set may comprise data for one or more members,
which is experimented with the well known data sets from UCI corresponding to the number of rows.
data repository, taking example as a soybean, dataset. The data
set consists of 47 records and each record contains 35 attributes II. OVERVIEW OF CLUSTERING
describing the feature of plants with four classes of disease. There
ES
are three phases in this method. The dissimilarity matrix,
neighbor matrix and the initial clusters are formed in first phase.
Clusters are merged in the second phase by relocating the objects
using the neighborhood concept. In the third phase, mode of
A cluster is a collection of data objects that are more
similar to one another within the same cluster and are
dissimilar to the objects in other clusters. Unsupervised
learning deals with in stances which have not been pre-
attributes of clusters is computed, and phase I and Phase II are classified in any way and so they do not have a class attributes
applied for the tuples formed from these Representatives. associated with them. Clustering is a process of grouping data
into groups based on similarity measure and it is used to
Keywords- Data Mining, Clustering, Categorical data,
Dissimilarity, Mode.
describe the preprocessing step for other algorithms such as
characterization and classification.
I. INTRODUCTION
Clustering in Data Mining is a discovery process that
Data mining [1] involves the use of sophisticated data groups a set of data in which the intra cluster similarity is
A
analysis tools to discover previously unknown, valid patterns maximized and the inter cluster similarity is minimized. In
and relationships in large data sets [6]. Data Mining is a general, clustering is divided into two broad categories viz.,
technique to extract valid novel, potential patterns and useful hierarchical and partitional. The partitional clustering
information from complex and huge amount of data set [2]. technique partitions the database into pre-defined number of
Two fundamental Goals of Data Mining are prediction and clusters based on some criteria. Hierarchical clustering
description otherwise known as verification model and technique is divided into agglomerative and divisive. The
IJ

discovery model. Some important data mining tasks are agglomerative Clustering [3] follows the bottom up strategy
association rules, classification rules, clustering. whereas the divisive approach follows the top down strategy.
The basic principle of clustering hinges on the Concept of
There are two major types of predictions: one can distance metric. Since the data are invariably real numbers for
either try to predict some unavailable data values or pending statistical applications and Pattern recognitions, a large class
trends, or predict a class label for some data. Prediction is of metrics exists and one can define one‘s own metric
however more often referred to the forecast of missing according to the specific requirements.
numerical values, or increase/ decrease trends in time related
data. The major idea is to use a large number of past values to Data Mining primarily works with large data bases.
consider probable future values. Description in terms The objects in the data base contain the attributes of various
of(human-interpretable) patterns. Association rules discovers data types. These values may be of either numeric or non
the relations between variables in large databases [7]. Data numeric type. Clustering can be performed for both Numerical
classification is the categorization of data for its most effective and categorical data. For clustering numerical data, geometric
and efficient use[8]. A data set (or dataset) is a collection of properties like distance function are used as a criterion. As

ISSN: 2230-7818 @ 2011 http://www.ijaest.iserp.org. All rights Reserved. Page 152


G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES
Vol No. 2, Issue No. 2, 152 - 156

data clustering is mostly related to real time or transactional of the whole data set, that is,
data sets, the attributes are of both numerical and categorical
type. Xi(t+1)= Xi(t) for all i=1,....,k
Data objects are clustered based on similarity
measurement. The most common similarity measurement is K-means is one of the simplest unsupervised learning
the distance function. The similarity between two objects Oi algorithms[12]. K-means clustering is a method of cluster
and OJ is measured using Euclidean distance or Manhattan analysis which aims to partition n observations into k clusters
distance. Clustering categorical data are very different from in which each observation belongs to the cluster with the
those of numerical data in terms of the definition of similarity nearest mean. he procedure follows a simple and easy way to
measure. The distance based Metric cannot be used to cluster classify a given data set through a certain number of clusters
the categorical data. Numerical clustering methods are applied (assume k clusters) fixed a priori.
to categorical data through data preprocessing [4]. But these
preprocessing techniques do not produce Quality clusters The following are the steps for K-means algorithm [10].
always. So it is widely accepted to apply the clustering on raw
categorical data. Here we have used the similarity concept as a Step 1: Place K points into the space represented by the
measurement. To maximize the intra cluster similarity, the objects that are being clustered. These points
minimum dissimilarity concept is used. represent initial group centroids.Start with a random
partition into K clusters
III. RELATED WORKS

T
Step 2: Generate a new partition by assigning each pattern to
The basic idea of clustering is grouping together its closest cluster center
similar objects[15]. A few existing categorical clustering
algorithms are discussed here. The k-means problem[14] is Step 3: Compute new cluster centers as the centroids of the
based on a simple iterative scheme for finding a locally clusters.
minimal solution. In the K-Means problem, a set of N points
X(I) in M-dimensions is given. The goal is to arrange these
ES
points into K clusters, with each cluster having a
representative point Z(J), usually chosen as the centroid of the
points in the cluster. This algorithm is often called the k-
means algorithm [9]. Prototypes algorithm is based on K-
Step 4: Steps 1 and 2 are repeated until there is no change in
the membership (also cluster centers remain the same)

The K-modes and extension of K-means cluster the


categorical data, where the means are replaced by modes. The
means [5] but removes the numeric data limitation. It is K representative algorithm is an extension of K-means
applicable for both numeric and categorical data. The Algorithm using relative frequency as a measure to cluster the
following are the steps for K-modes algorithm[11]. categorical Data.
Given a set X of categorical data vectors in Pm and a specified We are introducing a new representative Based Algorithm
number k of desired clusters, do
IV. PROPOSED ALGORITHM
Step 1. Begin with an initial partition,
A
Let T be the set of n objects having categorical data
X=X1(0) Ụ X2(0) Ụ…. Ụ Xk(0) with m attributes. The proposed algorithm is divided into three
phases. The process of clustering is carried out in the first
Step 2: Update the modes of each cluster according to (3) to phase. Merging of clusters is performed in the second phase.
obtain Mode of attributes for each cluster is generated in the third
phase. Apply phase I and phase II for the dataset obtained in
C1(t) , C2(t) …. Ck(t) the Third phase.
IJ

Step 3: Re-test the similarity of all data vectors from cluster to A. Proposed Representative Based Algorithm
cluster with each mode vector in the following way. If
a vector from Xi(t) is found to be strictly nearer to Cj(t) Let the Object list = [1, 2, 3...n], where
than to the current Ci(t) , reallocate that vector to the N - Number of tuples/records
cluster Xj(t+1) to obtain a new partition M - Number of attributes
X=X1(t+1) Ụ X2(t+1) Ụ…. Ụ Xk(t+1) Phase I

Notice that ties here are biased so that the mode of a data The steps involved in this phase are detailed below:
vector‘s current cluster is preferred.
Step1: Construct a dissimilarity matrix ‗d‘ using the
Step 4: Go back to Step 2 and repeat the iteration until no measurement.
object has changed cluster assignment after full cycle

ISSN: 2230-7818 @ 2011 http://www.ijaest.iserp.org. All rights Reserved. Page 153


G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES
Vol No. 2, Issue No. 2, 152 - 156

Step2. Compute the threshold value, and minimum TABLE-2 DISSIMILARITY MATRIX
dissimilarity of each object,
OBJECT 1 2 3 4 5 6 7
Step3. Construct a neighbor matrix as ‗neigh‘. 1 - 1 3 3 2 1 1
2 1 - 3 3 1 2 2
Step4. Select the first member of an object list, Form a new
cluster with this object as a member. Group the 3 3 3 - 1 3 3 3
neighbors of object based on the criteria. Remove the 4 3 3 1 - 3 3 3
clustered objects from the object list. 5 2 1 3 3 - 2 2
6 1 2 3 3 2 - 1
Step5. Repeat the above step until the object list becomes
7 1 2 3 3 2 1 -
empty.

Phase II
TABLE-3 NEIGHBOR MATRIX
The steps involved in merging of clusters are detailed below:
1 2,6,7
Step 1: Select the cluster with least number of objects. 2 1,5

T
Step2. The objects in the selected cluster are Relocated based 3 4
on the Cluster Merging Criteria.
4 3
Step3. The above steps are repeated until no more merging is 5 2
possible.
6 1,7

Phase III
ES
Compute the mode of each column or attribute of all objects in
7 1,6

The minimum dissimilarity value of each object is, m


(.) = [1, 1, 1, 1, 1, 1, 1] and the neighbor‘s of each object
each cluster. ‗Neigh‘ is given in Table- 3.Object list is initialized with
Where the attribute value with maximum frequency is defined values 1, 2 … n, Where n is the number of objects. Consider
as a mode of an attribute. If the number of cluster produced in the first object O1 = 1, neigh (O1) is {2, 6, and 7}. Objects 2,
Phase II is ‗K1‘, then this phase results in‗K1‘ tuples. 6 and 7 are grouped into cluster1, C1 = {1, 2, 6, 7}. Now
Consider this as a dataset with ―K1‖ Tuples with ‗m‘ attributes object list contains the objects 3, 4 and 5. In the next iteration
and repeat Phase I and Phase II. object 3 is selected. C2 is formed with object 3 as member.
Repeat the process. At the end of I phase, we get 3 clusters,
A
B. Illustration which are given in Table – 4.

The proposed algorithm is illustrated with a sample dataset TABLE-4 RESULTANT CLUSTERS AFTER PHASE-2
given in the Table-1. The sample dataset consists of 7
tuples/objects with 4 attributes. CLUSTER OBJECT
NUMBER
The dissimilarity matrix is depicted in table-2. Here
IJ

n = 4 and m = 3. 1 1,2,6,7

TABLE-1 SAMPLE DATA SET 2 3,4

GRADUA- BRANCH PERCEN- EMAIL-ID 3 5


TION TAGE
b-tech cse 70% yes
In the II phase, cluster with least number of Objects,
b-tech ece 70% no C3 is selected. Using the Cluster merging criteria, object 5 is
m-tech eee 60% no placed in the cluster C1. Thus 3 clusters are reduced to 2
m-tech eee 55% no clusters viz. C1 = {1, 2, 6, 7, 5} and C2 = {3, 4}. Mode of
b-tech ece 65% no cluster C1 and C2 is {btech, cse, 70%} and {mtech, eee,
b-tech cse 80% yes 60%}. Apply phase-I and phase-II for these two tuples. For
b-tech cse 75% yes

ISSN: 2230-7818 @ 2011 http://www.ijaest.iserp.org. All rights Reserved. Page 154


G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES
Vol No. 2, Issue No. 2, 152 - 156

this sample data set, after phase III also we get the same
9  10
results.
10  1
C. Results Done Experimentally
11  15,18
The proposed method is experimented with real data
set like Soybean small. Soybean small dataset consists of 47 12  16
records and each record contains 35 attributes describing the
feature of plants with four classes of diseases. And the result 13  18,20
can be obtained in 10 clusters after phase-2 and 4 clusters after
phase-3 14  19,20
D. Measure the purity of cluster 15  11
A cluster is called a pure cluster if all the objects 16  12
belong to a single class. To measure the efficiency of the
proposed method, we have used the clustering accuracy 17  13,15,19
measure .The clustering accuracy r is defined as,
18  11,13

T
R= 1\n Summation I=1 to k of all a1
19  14
Where al is the number of data objects that occur in both
cluster Cl and its corresponding labeled class, and n is the 20  13,14
number of objects in the data set.
21  25
In Our proposed algorithm the number of clusters
‗K‘ is not given as input. But during merging using
ES
representative the number of clusters is reduced, evident that
when the number of clusters is more the purity is also more.
.So it is proposed here to select the cluster with high purity.
22

23


27

27

The proposed method is compared with K-modes algorithm 24  25,30


and the results depend on the mode selection. In the proposed
method whatever may be the order the results will be same. 25  21,24
Execution is necessary to minimize the cost function. The
sample dataset and dissimilarity matrix of Soya bean dataset 26  29
can be represented in power point representation and next step
is the neighbor matrix can be shown below as follows: 27  22,23
A
TABLE-5 NEIGHBOR MATRIX OF SOYABEAN 28  24,25
DATASET
29  26

1  10 30  24
IJ

2  6,7 31  34

3  8 32  37

4  10 33  39,43

5  1,10 34  31,41

6  2,4,9 35  43

7  2,3,4 36  43

8  3 37  47

ISSN: 2230-7818 @ 2011 http://www.ijaest.iserp.org. All rights Reserved. Page 155


G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES
Vol No. 2, Issue No. 2, 152 - 156

Because of the computation overhead in constructing


38  33,45
dissimilarity matrix, we have experimented only with small
Soya bean data set. Further we planned to extend this to a
39  33,40
large data set.
40  39
References
41  32,34,37
[1] Arun.K.Pujari, Data Mining Techniques, Universities
Press, 2001
42  34
[2] www.ics.uci.edu/~mlearn/MLRepository.html
43  33,36 [3] Aranganayagi.S and K.Thangavel, A Novel
ClusteringAlgorithm For Categorical Data ,Computational
44  31,36 Mathematics, Narosa Publishing House, New Delhi, India
[4] Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis,
45  38 On Clustering Validation techniques Journal of Intelligent
Information Systems.
46  31 [5] Ohn Mar San, Van-Nam Huynh, Yachter Nakamori, An
Alternative Extension Of The K-Means algorithm For
47  37 Clustering Categorical Data, J. Appl. Math. Comput.

T
Science.
[6] http://www.fas.org/irp/crs/RL31798.pdf
[7] http://en.wikipedia.org/wiki/Association_rule_learning
[8]http://searchdatamanagement.techtarget.com/definition/data
TABLE-6 RESULTANT CLUSTERS OF SOYABEAN -classification
DATASET AFTER PHASE-2 [9] An Efficient k-Means Clustering Algorithm: Analysis and
Implementation, Tapas Kanungo, IEEE transaction on

CLUSTER NUMBER

1 1,5,10
ES
OBJECT
pattern analysis and machine intelligence.
[10] clustering algorithms by Johan Everts, Kunstmatige
Intelligentie / RuG
[11] The Effects of Ties on Convergece in K-Modes
Variants for Clustering Categorical Data N. Orlowski, D.
2 2,3,4,6,7,8,9 Schlorff, J. Blevins, D. Ca˜nas, M. T. Chu_, R. E. Funderlic
N.C. State University Department of Computer Science
11 11,13,14,15,17,18,19,20 [12]http://home.dei.polimi.it/matteucc/Clustering/tutorial_htm
l/kmeans.html
12 12,16 [13] http://en.wikipedia.org/wiki/Data_set
[14]http://people.sc.fsu.edu/~jburkardt/m_src/kmeans/kmeans.
21 21,24,25,28,30
A
html
22 22,23,27 [15] http://www.astro.princeton.edu/~gk/A542/matt.pdf

26 26,29

31 31,32,34,37,47
IJ

33 33,36,39,40,43

42 44,45,46,47

V. CONCLUSION

In many existing clustering methods, the number of


clusters is given as an input parameter. In the proposed
method, the dissimilarity measurement is used to cluster the
categorical data and the objects are clustered without getting
any input as ‗K‘ or the threshold value. The experimental
results depict the efficiency of the new proposed method.

ISSN: 2230-7818 @ 2011 http://www.ijaest.iserp.org. All rights Reserved. Page 156

You might also like