You are on page 1of 4

Scalable Distribution in Data Summarization Clustering

N.V.S.K.VIJAYALAKSHMI.K*, V.PURUSHOTHAM RAJU# Vishnu Engineering College For Women, Vishnupur, Bhimavaram, (A.P.), India # Department of Computer Science Engineering, Shri Vishnu Engineering College For Women,Vishnupur,Bhimavaram, (A.P.),India Email:*vijayakathari@gmail.com & #pstraju@yahoo.co.in
*Shri

ABSTRACT
This paper is based on a multilayer spread over the surface network of peer neighborhoods. Super nodes, which act as representatives of neighborhoods, are recursively grouped to form higher stage neighborhoods. Within a certain level of the hierarchy, peers cooperate within their relevant neighborhoods to perform P2P clustering. Using this model, we can separation the clustering problem in a modular way across neighborhoods, solve each part separately using a distributed Kmeans variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we review the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. KEYWORDS: distributed document clustering, multilayer overlay peer-to-peer networks.

I.INTRODUCTION vast data sets are being together daily in different fields; e.g., retail chains, banking, biomedicine, astronomy, and so forth, but it is still extremely complicated to draw conclusions or make decisions based on the collective characteristics of such dissimilar data. Four main approaches for performing DDM can be identified. A common approach is to bring the data to a central site, then apply centralized data mining on the jointly data. Such approach clearly suffers from a vast communication and calculation cost to pool and mine the global data. In addition, we cannot protect data isolation in such scenarios. A smarter come close to is to perform local mining at each site to produce a local model. All local models can then be transmit to a central site that combines them into a global model. All previous three approaches grip a central site to facilitate the DDM process. A more departing approach does not grip centralized operation, and thus belongs to the peer-to-peer (P2P) class of algorithms. P2P networks can be formless or structured. In this paper, we introduce an approach for distributed data clustering, based on a ordered P2P network architecture. The goal is to realize a flexible DDM model that can be tailored to various scenarios. The projected model is called the Hierarchically distributed P2P Clustering (HP2PC). It involves a hierarchy of P2P neighbourhoods, in which the peers in each neighbourhood are dependable for building a clustering solution, using P2P communication, based on the data they have way in to. As we move up the hierarchy, clusters are combined from lower levels in the hierarchy. At the root of the hierarchy, one global clustering can be imitative. Using the HP2PC model, we can partition the difficulty in a modular way, solve each part individually, then successively combine solutions if it is desired to discover a global solution. This way, we avoid two problems in the current state-of-the-art DDM: 1) we stay away from high communication cost usually associated with a structured, fully connected network, and 2) we stay away from uncertainty in the network topology usually introduced by formless P2P networks. Experiments performed on document clustering show that we can achieve comparable results to centralized clustering with high gain in speedup. In addition, when apply to document clustering, we also provide interpretation of the distributed document clusters using a distributed keyphrase extraction algorithm, which is a distributed variant of the CorePhrase single cluster summarization algorithm. The algorithm finds the core phrases within a distributed document cluster by iteratively intersecting relevant keyphrases between nodes in a neighbourhood. Once converged to a set of core phrases, they are attached to the cluster for interpretation of its stuffing.

Fig. 1. The HP2PC hierarchy architecture.

II. MATHEMATICAL APPROACH An HP2PC network is build recursively, starting from level 0 up to the height of the hierarchy, H. The number of neighbourhoods and the size of each neighbourhood are embarrassed through the partitioning factor fi, which is specified for each level of the hierarchy (except the root level).The structure process is given in Algorithm. Given the initial set of nodes, p0, and the set of partitioning factor B = f(h), 0 < h < H _ 1, the algorithm recursively constructs the network. At each level, we partition the current p(0) into the appropriate number of neighbourhoods and assign a supernode for each one. The set of supernodes at a certain level forms the set of nodes for the next higher level, which are passed to the next recursive call. creation stops when the root is reached. Algorithm: HP2PC Construction try { for(int ic=0;ic<10;ic++) { // String dirname = "C:\\ReutersTranscribedSubset\\"+category[ic]; String dirname = peerpath+category[ic]; File f1 = new File(dirname); String s[] = f1.list(); for (int i=0; i < s.length; i++) { frequency fr = new frequency(); fr.freq(s[i]); fr1.freq(s[i],"","",false); jTextArea1.append("Frequency Computation : " + s[i] +" Done"); jTextArea1.append("\n"); jTextArea1.setCaretPosition(jTextArea1.getDocument(). getLength()); } } } catch(Exception ex) { }

For evaluation against the baseline K-means algorithm, we compute the centroids of each neighbourhood based on centralized K-means, then compare against the HP2PC centroids with respect to the merged data set of the respective neighbourhood (For neighbourhoods at level 0, the fused data set is a union of the data from all nodes in the neighbourhood. For those at higher levels, the merged data set is the alliance of all data reachable from each node in the neighbourhood through its respective lower level nodes. The Evaluate And Compare function evaluates both the centralized and distributed solutions and compares them against each other, as reported in the different experiments. One of the major proceeds of this algorithm is the ability to zoom in to more refined clusters by descending down the hierarchy and zoom out to more generalized clusters by ascending up the hierarchy. The other major benefit is the capability to merge a forest of independent hierarchies into one hierarchy by putting all roots of the forest into one neighbourhood and invoking the merge algorithm on that neighbourhood.

Fig.2.:Phrase matching using DIG.

The CorePhrase algorithm compares each pair of documents to extract comparable phrases. This procedure of matching every pair of documents is inherently O(n2). However, using a document phrase indexing graph formation, known as the Document Index Graph (DIG), the algorithm can attain this goal in near-linear time

III. RESULTS & DISCUSSION


The conception of this paper is implemented and different consequences are shown below

IV. CONCLUSIONS In this paper, we have introduced a narrative architecture and algorithm for scalable distributed clustering, the HP2PC model, which allows structure hierarchical networks for clustering data. We confirmed the flexibility of the model, showing that it achieve comparable quality to its centralized matching part while providing significant speedup and that it is possible to make it equivalent to traditional distributed clustering models (e.g., facilitator-worker models) by manipulating the neighbourhood size and height parameters. The model shows good scalability with respect to network size and hierarchy height, mortifying the distributed clustering quality significantly. The importance of this contribution stems from its flexibility to contain usual types of P2P networks as well as modularized networks through neighbourhood and hierarchy formation. It also allows privacy within neighbourhood boundaries (no data shared between neighbourhoods). In addition, we present interpretation aptitude for document clustering through document cluster summarization using distributed keyphrase extraction. REFERENCES [1] H. Kargupta, I. Hamzaoglu, and B. Stafford, Scalable, Distributed Data Mining Using an Agent-Based Architecture, Proc. Third Intl Conf. Knowledge Discovery and Data Mining (KDD 97), pp. 211-214,1997. \ [2] S. Datta, C. Giannella, and H. Kargupta, KMeans Clustering over a Large, Dynamic Network, Proc. Sixth SIAM Intl Conf. Data Mining (SDM 06), pp. 153-164, 2006. [3] S. Datta, C. Giannella, and H. Kargupta, KMeans Clustering over Peer-to-Peer Networks, Proc. Eighth Intl Workshop High Performance and Distributed Mining (HPDM), SIAM Intl Conf. Data Mining (SDM), 2005. [4] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta, Distributed Data Mining in Peer-to-Peer Networks, IEEE Internet Computing, vol. 10, no. 4, pp. 18-26, 2006. [5] K. Hammouda and M. Kamel, Corephrase: Keyphrase Extraction for Document Clustering, Proc. IAPR Intl Conf. Machine Learning and Data Mining in Pattern Recognition (MLDM 05), P. Perner and Imiya, eds., pp. 265-274, July 2005. [6] K. Hammouda and M. Kamel, Document Similarity Using a Phrase Indexing Graph Model, Knowledge and Information Systems, vol. 6, no. 6, pp. 710-727, Nov. 2004. [7] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis,Kumar, B. Mobasher, and J. Moore, Document Categorization and Query Generation on the World Wide Web

Using WebACE, AI Rev., vol. 13, nos. 5/6, pp. 365-391, 1999. [8] Strehl, Relationship-Based Clustering and Cluster Ensembles for High-Dimensional Data Mining, PhD dissertation, Faculty of Graduate School, Univ. of Texas at Austin, 2002. [9] D.D. Lewis, Y. Yang, T. Rose, and F. Li, RCV1: A New Benchmark Collection for Text Categorization Research, J. Machine Learning Research, vol. 5, pp. 361-397, 2004.

You might also like