You are on page 1of 3

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1564



Improved Clustering using Hierarchical
Approach
1
Megha Gupta,
2
Vishal Shrivastava
1.
M.Tech Scholar (Rajasthan Technical University, Kota)
2.
M.Tech Associate Professor (Rajasthan Technical University, Kota)
India
Abstract- This paper presents the methods for finding
better and improved clusters using partitioning
methods along with hierarchical methods. In this
paper, k-means algorithm has been used for finding
clusters and the resultant clusters are further divided
using CFNG (Colored Farthest Neighbor Graph). The
final clusters formed would be better and closed as
compared to the clusters formed using k-means
algorithm.
Keywords: Data mining, K-means algorithm, CFNG
I. INTRODUCTION
In the enormous amount of data stored in files,
databases, and other repositories, it increasingly
important, if not necessary, to develop powerful
means for analysis and perhaps interpretation of such
data and for the extraction of interesting knowledge
that could help in decision-making.
Data Mining [4], also popularly known as
Knowledge Discovery in Databases (KDD), refers to
the nontrivial extraction of implicit, previously
unknown and potentially useful information from
data in databases. While data mining and knowledge
discovery in databases (or KDD) are frequently
treated as synonyms, data mining is actually part of
the knowledge discovery process.
This paper presents the new method for developing
clusters using k-means algorithm (partition method)
along with CFNG (Colored Farthest Neighbor
Graph). Cluster analysis studies the problem of
finding groups of similar objects. Given a set of
objects, this problem amounts to splitting the set into
clusters or groups, such that the objects in each
group are more similar to each other than to the
objects in other groups. Clustering is the
organization of data in classes. In clustering,
class labels are unknown and it is up to the clustering
algorithm to discover acceptable classes. Clustering
is also known as unsupervised classification, because
the classification is not directed by given class labels.
There are many clustering approaches which are
based on the principle of maximizing the similarity
between objects in a same class and minimizing the
similarity between objects of different classes.
II. METHODS AND TECHNIQUES
The methods that have been used in this paper are k-
means algorithm (partitioning method) and CFNG
(hierarchical clustering).
2.1 K-means algorithm
The k-means algorithm [1] is for partitioning where
each clusters center is represented by the mean value
of the objects in the cluster.
Input:
K: the number of clusters
D: a data set containing n objects.
Output: A set of k clusters
2.1.1 Algorithm
1. arbitrarily choose k objects from D as the initial
cluster centers
2. repeat
3. (re)assign each object to the cluster to which the
object is the most similar
4. based on the mean value of the objects in the
cluster
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1565

5. update the cluster means, that is, calculate the
mean value of the objects for each cluster
6. until no change
2.2 CFNG (Colored Farthest Neighbor Graph)
The basic approach for CFNG is to split one set of
objects into two subsets. Two steps are used for this
process. In first step, the objects are used to obtain
farthest neighbor graph (FNG). Then using two
different colors, the graph vertices are colored.
Finally the vertices are separated into two subsets,
one for each color. Objects in each subset tend to be
near each other and far from objects in the other
subset. So this process keeps on going by splitting
into smaller and smaller subsets until single objects
are obtained. The following steps are being followed
for obtaining FNG.
Graph Construction
A graph is a set of objects, along with a set of links
between some of the objects. Each object in a graph
is known as vertex and each link is known as edge.
The vertices are usually represented by V and edge is
represented by E.
FNG consists of objects and the edges between the
objects. In this graph, each object is connected with
the farthest neighbor.
Vertex Color
In any vertices, graph can be colored so that the
vertices of edges got different colors. To color a
graph, vertices do not receive actual visible colors.
A proper vertex coloring of a graph assigns each
vertex a label, such that each vertices are adjacent,
they will get different labels. A four color theorem
states that any planar graph can be colored with four
different colors [1]. 1. It means that using four
colors, each vertex must be of different colors so
that any two vertices sharing borders must be of
different colors. If k different labels are used in
coloring, then it is referred to as k-colored.
Once the FNG is built, the method uses partition
coloring to partition the objects into two well-
separated subsets. For an arbitrary graph, a coloring
can be found quickly with a simple greedy algorithm,
which starts with an uncolored graph and visits the
vertices in same order. When it visits the color of
vertex, it examines the color of the neighbor vertex.
If none of the vertex has been colored then it chooses
the color for the vertex arbitrarily. But if some of the
vertex has been colored then it chooses the color
which is different from neighbors vertex.
Such an algorithm will always help in creating a
graph having different colors but if minimum colors
are being used for coloring the vertices then it
becomes very hard to implement the algorithm.
So instead of using the graph, forest is used for this
purpose.
Forest is a collection of trees. A tree is a combination
of vertices and edges in which two vertices are
connected with single edge and there can be no cycle
formed in the case of forest.
Colored farthest neighbor graph shares many
characteristics with SFN (shared farthest neighbors)
by Rovetta and Masulli [3]. This algorithm yields
binary partitions of objects into subsets, whereas
number of subsets obtained by SFN can vary. The
SFN algorithm can easily split a cluster where no
natural partition is necessary, while the CFNG often
avoids such splits.
2.2.1 Algorithm
S is a set of objects in metric space
BuildTreeWithStack(S)
Clear Stack
(A, B)=split(S)
push (A), push(B)
root a new tree node
root. (left, right) (A, B)
while stack is not empty do
B pop ( )
A pop ( )
if |A| 1 or |B| 1 then
(c, d) split (A)
(e, f) split (B)
If one of c, d, e, f is in wrong subset then adjust
A and B (merge one of c, d, e, f into A or B)
(c, d) split (A)
(e, f) split (B)
end
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1566

A. (left, right) (c, d)
B. (left, right) (e, f)
Push(c), push (d), push (e), push (f)
end
end
return root
III. CONCLUSION
In this paper, we have analyzed k-means algorithm
and CFNG for hierarchical clustering. We have tried
to describe better and closed clusters. Further, other
algorithms can be used for hierarchical clustering.
With hierarchical clustering, we have come with the
tightly bonded clusters. With k-means we have got
two clusters, one strong and other having weak tie
and with further clustering using CFNG of the strong
cluster, we have got three clusters finally, two strong
and one having weak tie.
















REFERENCES
[1] Carl Endorf, Gene Schultz & Jim Mellander, Intrusion
Detection & Prevention. McGraw-Hill Osborme Media, first
edition, 2003
[2] M.Chrikar, C. Chakuri, T.Fedar & R.Motwani. Incremental
Clustering & Dynamic Information Retrieval. Proceeding of the
29
th
ACM Symposiumon the Theory of Computation, 1997.
[3] S.Rovetta, F.Masulli, Shared farthest neighbor approach to
clustering of high dimensionality, low cardinality data, Pattern
Recognition 39 (12) (2006) 2415-2425.
[4] Jiawei Han, Micheline Kamber, J ian Pei. Data Mining
Concepts and Techniques.

You might also like