Professional Documents
Culture Documents
Kankanala Laxmi
Supervisor : Prof. M. Narasimha Murthy
Computer Science and Automation
Indian Institute of Science, Bangalore
• DBSCAN
1.4 Outline of the Report Density Based Spatial Clustering of Application
This report is organized as follows. of Noise[6]
Section 2 provides a review of relevant work in this DBSCAN uses a density based notion of clusters
area.i.e.,two tree based clustering algorithms namely to discover clusters of arbitrary shapes. The key
CF tree, KD tree, and Divide and Conquer approach. idea is that, for each object of a cluster, the neigh-
Section 3 describes different classification approaches. bourhood of a given radius has to contain at least
Section 4 gives description about incremental mining. a minimum number of data objects. In other
Section 5 gives the details about different data sets, words, the density of the neighbourhood must ex-
which have been used for clustering. Section 6 has ex- ceed a threshold.
perimental results of our preliminary implementation The classification problem is concerned with generat-
of the two clustering algorithms. We conclude this ing a description or a model for each class from the
report with a road map to future work. given data set. Using the training set, the classifi-
cation process attempts to generate the description of
1.5 Review of Literature the classes, and those descriptions helps to classify the
unknown records. It is possible to use frequent itemset
We give a brief review of the relative work done in this mining algorithms to generate the clusters and their
area. There are two main approaches to clustering- descriptions efficiently. Two popular algorithms are:
heirarchical clustering and partitioning clustering.
The partitioning clustering technique partitions the • Apriori algorithm[7]
database into a predefined number of clusters. Par- It is the most popular algorithm to find all the fre-
titioning clustering algorithms are: k -means and k - quent sets. The first pass of the algorithm simply
medoid algorithms. The heirarchical clustering tech- counts item occurences to determine the frequent
nique produces a sequence of partitions, in which each itemsets. A subsequent pass, say pass k, consists
partition is nested into the next partition in the se- of two phases. First, the frequent item sets found
quence. It creates a heirarchy of clusters from small in the k -1 pass are used to generate the candidate
to big or big to small. Numerous algorithms have been itemsets. Next, the database is scanned and the
developed for clustering large data sets. Here we give support of candidates is counted.
a brief description about some of the algorithms.
• FP-tree
• K-means algorithm[4] A frequent pattern tree is a tree structure con-
The algorithm is composed of the following steps: sisting of an item-prefix-tree and a frequent-item-
1.Place K points into the space represented by header table.
the objects that are being clustered. These points Item-prefix-tree consists of a root node labelled
represent initial group centroids. null. Each non-root node consists of three fields:
2. Assign each object to the group that has the item name, support count and node link.
closest centroid. Frequent-item-header-table consists of item name
3. When all objects have been assigned, recalcu- and head of node link which points to the first
late the positions of the K centroids. node in the FP-tree carrying the item name.
4. Repeat Steps 2 and 3 until the centroids no
longer move. This produces a separation of the
objects into groups from which the metric to be 2 Clustering
minimized can be calculated.
Clustering is a method of grouping data into different
• CLARANS groups, so that the data in each group share similar
Clustering Large Application based on Random- trends and patterns. Clustering constitutes a major
ized Search[5] class of data mining algorithms. The algorithm at-
CLARANS is medoid-based method, which is tempts to automatically partition the data space into
more efficient, but it suffers from two major draw- a set of regions or clusters, to which the examples
backs:it assumes that all object fit into the main in the tables are assigned, either deterministically or
probability-wise. The goal of the process is to identify
all sets of similar examples in the data, in some opti-
mal fashion.
The objectives of clustering are:
2.2 KD Tree
• to find consistent and valid organization of the
data A KD-Tree[11] is a data structure for storing a finite
set of points from a k -dimensional space. It was ex-
amined in detail by J Bentley Friedman et al.,1977.
In this section we briefly discuss two tree based clus- A KD Tree is a binary tree, designed to handle spa-
tering algoritms namely, CF tree and KD tree. And tial data in a simple way. At each step, choose one
we give a brief overview about Divide and Conquer of the coordinate as a basis of dividing the rest of the
approach. points. For example, at the root, choose x as the ba-
sis. Like binary search trees, all items to the left of
root will have the x-coordinate less than that of the
root All items to the right of the root will have the x-
2.1 CF Tree coordinate greater than (or equal to) that of the root.
Example: The KD Tree for (3,7) (5,10) (8,3) and
(6,12)
CF tree[8, 9] is based on the principle of agglomerative
clustering, that is, at any given stage there are smaller
3,7
subclusters and the decision at the current stage is to
merge subclusters based on some criteria. It maintains
set of cluster features for each subcluster.The Cluster
Features of different subclusters are maintained in a
+
8,3 5,10
tree (in a B tree fashion)[10], this tree is called CF
TREE.
CF vector = (n, ls, ss); where n is the number of data
objects in CF, ls is the linear sum of the data objects, 6,12
and ss is the square sum of the data objects in CF.
A CF tree is a height-balanced tree with two param- It is a 2d-tree of four elements.The (3,7) node splits
eters: branching factor B and threshold T. Each non- along the Y=7 plane and the (5,10) node splits along
leaf node contains at most B entries of the form [CFi , X=5 plane.
childi ], where i ∈ 1, 2, · · · , B, childi is a pointer to its The below figure shows the way nodes partition the
th
i child node, and CFi is the CF of the subcluster rep- plane
resented by this child. So a nonleaf node represents
a cluster made up of all the subclusters represented
by its entries. A leaf node contains at most, L en- (5,10) (6,12)
tries, each of the form [CFi ], where i = 1, 2, · · · , L.
In addition, each leaf node has two pointers, “prev”
and “next” which are used to chain all leaf nodes to- (3,7)
gether for efficient scans. A leaf node also represents a
cluster made up of all the subclusters represented by
its entries. But all entries in a leaf node must satisfy (8,3)
a threshold requirement, with respect to a threshold
value T the diameter has to be less than T.
2.3 Divide-and-Conquer of the major drawbacks of KNN classifiers is that the
classifier needs all available data. This may lead to
We study clustering under the data stream model of considerable overhead, if the training data set is large.
computation, where, given a sequence of points, the So to reduce that overhead, we will use the KNNC
objective is to maintain a consistent good clustering for the clusters, which have been generated by some
of the sequence observed so far, using a small amount clustering method.
of memory and time. One of the first requisites of In some clusters all points belongs to the same class.
clustering a data stream is that the computation be So if a pattern belongs to that cluster, we dont have
carried out in small space. to find KNN, because all points are in same class. We
Algorithm Small-Space(S):[12] will find KNNs for only those patterns which belongs
1. Divide S into l disjoint pieces X1 , X2 , ..., Xl . to border clusters i.e., the clusters which contains data
2. For each i, find O(k) centers in Xi . Assign each points of multtiple classes. This will reduce the com-
point in Xi to its closest center. plexity of classification. For each test pattern which
0
3. Let X be the O(lk) centers obtained in (2), where belong to border clusters, we will find K nearest train-
each center c is weighted by the number of points as- ing points with respect to Euclidean distance. Classi-
signed to it. fication label of that test pattern will be the majority
0
4. Cluster X to find k centers. among those K- neighbours.
[1] A. K. Jain, M. N. Murty, and P. J. Flynn. [11] Andrew W. Moore. An intoductory tutorial on
Data clustering: a review. ACM Comput. Surv., kd-trees, October 08 1997.
31(3):264–323, 1999.
[12] S. Guha, N. Mishra, R. Motwani, and
[2] M. Prakash and M. N. Murty. Growing subspace L. O’Callaghan. Clustering data streams. In
pattern-recognition methods and their neural- Danielle C. Young, editor, Proceedings of the 41st
network models. IEEE Trans. Neural Networks, Annual Symposium on Foundations of Computer
8(1):161–168, January 1997. Science, pages 359–366, Los Alamitos, California,
November 12–14 2000. IEEE Computer Society.
[3] L. I. Kuncheva and L. C. Jain. Nearest neigh-
bor classifier: Simultaneous editing and feature [13] Bing Liu, Wynne Hsu, and Yiming Ma. Integrat-
selection. Pattern Recognition Letters, 20(11- ing classification and association rule mining. In
13):1149–1156, November 1999. KDD, pages 80–86, 1998.
[4] J. MacQueen. Some methods for classification [14] B. Zhang and S. N. Srihari. Fast K-nearest
and analysis of multivariate observations. In neighbor classification using cluster-based trees.
L. M. Le Cam and J. Neyman, editors, Proceed- IEEE Trans. Pattern Analysis and Machine In-
ings of the Fifth Berkeley Symposium on Math- telligence, 26(4):525–528, April 2004.
ematical Statistics and Probability, volume 1,
pages 281–297, Berkeley, Califonia, 1967. Univer- [15] Xiaoxin Yin and Jiawei Han. CPAR: Classi-
sity of California Press. fication based on predictive association rules.
In Daniel Barbará and Chandrika Kamath, edi-
[5] Raymond T. Ng and Jiawei Han. CLARANS: tors, Proceedings of the Third SIAM International
A method for clustering objects for spatial Conference on Data Mining, San Francisco, CA,
data mining. IEEE Trans. Knowl. Data Eng, USA, May 1-3, 2003. SIAM, 2003.
14(5):1003–1016, 2002.
[16] J. R. Quinlan and R. M. Cameron-Jones. FOIL:
[6] Martin Ester, Hans-Peter Kriegel, Jorg Sander, A midterm report. In P. B Brazdil, editor,
and Xiaowei Xu. A density-based algorithm Machine Learning (ECML-93) European Confer-
for discovering clusters in large spatial databases ence on Machine Learning Proceedings; Vienna,
with noise. In Evangelos Simoudis, Jiawei Han, Austria, pages 3–20, Berlin, Germany, 1993.
and Usama Fayyad, editors, Second International Springer-Verlag.
Conference on Knowledge Discovery and Data
Mining, pages 226–231, Portland, Oregon, 1996. [17] Carson Kai-Sang Leung, Quamrul I. Khan, and
AAAI Press. Tariqul Hoque. Cantree: A tree structure for effi-
cient incremental mining of frequent patterns. In
[7] Rakesh Agrawal and Ramakrishnan Srikant. Fast ICDM, pages 274–281. IEEE Computer Society,
algorithms for mining association rules. Research 2005.
Report RJ 9839, IBM Almaden Research Center,
San Jose, California 95120, U.S.A., June 1994.
[8] Tian Zhang, Raghu Ramakrishnan, and Miron
Livny. BIRCH: An efficient data clustering
method for very large databases. In H. V. Ja-
gadish and Inderpal Singh Mumick, editors, Pro-
ceedings of the 1996 ACM SIGMOD Interna-
tional Conference on Management of Data, pages
103–114, Montreal, Quebec, Canada, 4–6 June
1996.