DMN09-Hierarchically Distributed Peer-To-peer Document Clustering and Cluster Summarization

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO.
5, MAY 2009 681
Hierarchically Distributed Peer-to-Peer

Document Clustering and
Cluster Summarization
Khaled M. Hammouda and Mohamed S. Kamel, Fellow, IEEE
Abstract—In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of
modularity, flexibility, and scalability, we propose a Hierarchically distributed Peer-to-Peer (HP2PC) architecture and clustering
algorithm. The architecture is based on a multilayer overlay network of peer neighborhoods. Supernodes, which act as representatives
of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate
within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular
way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up
the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize
the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters.
Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with
comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that
HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized
counterparts with up to 88 percent accuracy.
Index Terms—Distributed data mining, distributed document clustering, hierarchical peer-to-peer networks.
1 INTRODUCTION
A recent shift toward distributed data mining (DDM)

was sparked by the data mining community since the
mid-1990s. It was realized that analyzing massive data sets,
global model [1], [2], [3]. Ensemble methods also fall into
this category [4]. While this approach may not scale well
with the number of sites, it is a better solution than pooling
which often span different sites, using traditional centra- the data.
lized approaches can be intractable. In addition, DDM is Another smart approach is for each site to carefully select
being fueled by recent advances in grid infrastructures and a small set of representative data objects and transmit it to a
distributed computing platforms. central site, which combines the local representatives into
Huge data sets are being collected daily in different one global representative data set. Data mining can then be
fields; e.g., retail chains, banking, biomedicine, astronomy, carried on the global representative data set [5], [6].
and so forth, but it is still extremely difficult to draw All previous three approaches involve a central site to
conclusions or make decisions based on the collective facilitate the DDM process. A more departing approach
characteristics of such disparate data. does not involve centralized operation, and thus belongs to
Four main approaches for performing DDM can be the peer-to-peer (P2P) class of algorithms. P2P networks can
identified. A common approach is to bring the data to a be unstructured or structured. Unstructured networks are
central site, then apply centralized data mining on the
formed arbitrarily by establishing and dropping links over
collected data. Such approach clearly suffers from a huge
time, and they usually suffer from flooding of traffic to
communication and computation cost to pool and mine the
resolve certain requests. Structured networks, on the other
global data. In addition, we cannot preserve data privacy in
such scenarios. hand, make an assumption about the network topology and
A smarter approach is to perform local mining at each implement a certain protocol that exploits such topology.
site to produce a local model. All local models can then be In P2P DDM, sites communicate directly with each other
transmitted to a central site that combines them into a to perform the data mining task [7], [8], [9], [10]. Commu-
nication in P2P DDM can be very costly if care is not taken
to localize traffic, instead of relying on flooding of control or
. K.M. Hammouda is with Desire2Learn Inc., 305 King St. West, Suite 200, data messages.
Kitchener, ON N2G 1B9, Canada. E-mail: khaledh@gmail.com. Regardless of any particular DDM approach, the dis-
. M.S. Kamel is with the Department of Electrical and Computer
Engineering, University of Waterloo, 200 University Ave. West, Waterloo, tributed nature of the DDM concept itself usually entails a
ON N2L 3G1, Canada. E-mail: mkamel@uwaterloo.ca. tradeoff between accuracy and scalability. If better accuracy
Manuscript received 23 July 2007; revised 4 Jan. 2008; accepted 13 Aug. 2008; is desired, the granularity level of information exchanged
published online 9 Sept. 2008. between distributed nodes should become finer and/or the
Recommended by acceptance by S. Zhang. connectedness of different nodes should be increased. On
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2007-07-0380. the other hand, if better scalability is desired, the granular-
Digital Object Identifier no. 10.1109/TKDE.2008.189. ity should be coarser and/or the connectedness should be
1041-4347/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
682 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
reduced. Bandyopadhyay et al. [11] derive for their P2P

K-means algorithm an upper bound on the clustering error
during computing the distributed solution, which measures
the degree to which accuracy has been sacrificed at the
expense of lowered communication cost. The relation is
evident in many DDM methods as well, since the prevalent
approaches employ approximate algorithms, as opposed to
exact algorithms.
In this paper, we introduce an approach for distributed
data clustering, based on a structured P2P network Fig. 1. Exact distributed clustering model.
architecture. The goal is to achieve a flexible DDM model
that can be tailored to various scenarios. The proposed distributed cluster summarization algorithm. Section 6
model is called the Hierarchically distributed P2P Cluster- presents experimental results and discussion. Finally,
ing (HP2PC). It involves a hierarchy of P2P neighborhoods,
conclusion and future directions are presented in Section 7.
in which the peers in each neighborhood are responsible for
building a clustering solution, using P2P communication,
based on the data they have access to. As we move up the 2 BACKGROUND AND RELATED WORK
hierarchy, clusters are merged from lower levels in the
DDM started to gain attention during the late 1990s.
hierarchy. At the root of the hierarchy, one global clustering
Although it is still a young area of research, the body of
can be derived.
The model deviates from the standard definition of P2P literature on DDM constitutes a sizeable portion of the
networks, which typically involve loose structure (or no broader data mining literature.
structure at all), based on peer connections that are created Basic definitions. Data mining in distributed environ-
and dropped frequently. The HP2PC model, on the other ments is known as DDM, and sometimes as Distributed
hand, is based on static hierarchical structure that is Knowledge Discovery (DKD). The central assumption in
designed up front, upon which the peer network is formed. DDM is that data are distributed over a number of sites and
We plan to introduce a dynamic structure extension to this that it is desirable to derive, through data mining techniques,
model in future work. a global model that reflects the characteristics of the whole
Using the HP2PC model, we can partition the problem in data set.
a modular way, solve each part individually, then succes- A number of challenges (often conflicting) arise when
sively combine solutions if it is desired to find a global developing DDM methods:
solution. This way, we avoid two problems in the current
. Communication model and complexity,
state-of-the-art DDM: 1) we avoid high communication cost
. Quality of global model, and
usually associated with a structured, fully connected net-
. Privacy of local data.
work, and 2) we avoid uncertainty in the network topology
It is desirable to develop methods that have low
usually introduced by unstructured P2P networks. Experi-
ments performed on document clustering show that we can communication complexity, especially in mobile applica-
achieve comparable results to centralized clustering with tions such as sensor networks, where communication
high gain in speedup. consumes battery power. Quality of the global model
The model lends itself to real-world structures, such as derived from the data should be either equal or comparable
hierarchically distributed organizations or government to a model derived using a centralized method. Finally, in
agencies. In such scenario, different departments or some situations when local data are sensitive and not easily
branches can perform local clustering to draw conclusions shared, it is desirable to achieve a certain level of privacy of
from local data. Parent departments or organizations can local data while deriving the global model.
combine results from those in lower levels to draw Although not yet proven, usually deriving high-quality
conclusions on a more holistic view of the data. models requires sharing as much data as possible, thus
In addition, when applied to document clustering, we incurring higher communication cost and sacrificing priv-
also provide interpretation of the distributed document acy at the same time.
clusters using a distributed keyphrase extraction algorithm, Homogeneous versus heterogeneous distributed data.
which is a distributed variant of the CorePhrase single We can differentiate between two types of data distribution.
cluster summarization algorithm. The algorithm finds the The first is homogeneous, where data are partitioned
core phrases within a distributed document cluster by horizontally across the sites; i.e., each site holds a subset
iteratively intersecting relevant keyphrases between nodes of the data. The second is heterogeneous, where data are
in a neighborhood. Once converged to a set of core phrases, partitioned vertically; i.e., each site holds a subset of the
they are attached to the cluster for interpretation of its attribute space, and the data are linked among sites via a
contents. common key.
This paper is organized as follows: Section 2 provides Exact versus approximate DDM algorithms. A DDM
some background on the topic and identifies related work. algorithm can be described as either exact or approximate.
Section 3 introduces the HP2P distributed architecture, and Exact algorithms produce a final model identical to a
Section 4 discusses the foundation behind the HP2P hypothetical model generated by a centralized process
distributed clustering algorithm. Section 5 discusses the having access to the full data set. Fig. 1 illustrates the
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 683
hypothetical process that is modeled by an exact distributed created. Each cluster is then assigned to a node, and later
clustering algorithm. The exact algorithm works as if the documents are classified to their respective clusters by
data subsets, Di , from each node were brought together into comparing their signature with all cluster signatures.
one data set, D, first; then a centralized clustering algorithm, Queries are handled in the same way, where they are
A, had performed the clustering procedure on the whole directed from a root node to the node handling the cluster
data set. The clustering solutions are then distributed again most similar to the query.
by intersecting the data subsets with the global clustering State of the art. In the latest issue of IEEE Internet
solution. Computing [15] (at the time of writing this paper), a few
Approximate algorithms, on the other hand, produce a algorithms were presented representing the state of the art in
model that closely approximates a centrally generated DDM. Datta et al. [10] described an exact local algorithm for
model. Most DDM research studies focus on approximate monitoring a K-means clustering (originally proposed by
algorithms as they tend to produce comparable results to Wolff et al. [16]), as well as an approximate local K-means
exact algorithms with far less complexity [3]. clustering algorithm for P2P networks (originally proposed
Communication models. Communication between by Datta et al. [8], [9]).
nodes in distributed clustering algorithms can be categor- Although the K-means monitoring algorithm does
ized into three classes (in increasing order of communica- not produce a distributed clustering, it helps a centralized
tion cost): 1) communicating models, 2) communicating K-means process know when to recompute the clusters
representatives, and 3) communicating actual data. The first by monitoring the distribution of centroids across peers
case involves calculating local models that are then sent to and triggering a reclustering if the data distribution
peers or a central site. Models often are comprised of cluster significantly changes over time.
centroids, e.g., P2P K-means [9], cluster dendograms, e.g., On the other hand, the P2P K-means algorithm in [8] and
RACHET [1], or generative models, e.g., DMC [2]. In the [9] works by updating the centroids at each peer based on
second case, nodes select a number of representative information received from their immediate neighbors. The
samples of the local data to be sent to a central site for algorithm terminates when the information received does
global model generation, such as the case in the KDEC not result in significant update to the centroids of all peers.
distributed clustering algorithm [6] and the DBDC algo- The P2P K-means algorithm finds its roots in a parallel
rithm [5]. The last model of communication is for nodes to implementation of K-means proposed by Dhillon and
exchange actual data objects; i.e., data objects can change Modha [17].
sites to facilitate construction of clusters that exist in certain
sites only, such as the case in the collaborative clustering 3 THE HP2PC DISTRIBUTED ARCHITECTURE
scheme in [12].
Applications. Applications of DDM are numerous and HP2PC is a hierarchically distributed P2P architecture for
are usually manifested as distributed computing projects. scalable distributed clustering of horizontally partitioned
They often try to solve problems in mathematics and science. data. We argue that a scalable distributed clustering system
Specific areas and sample projects include: astronomy (or any data mining system for that matter) should involve
(SETI@home), biology (Folding@home, Predictor@home), hierarchical distribution. A hierarchical processing strategy
climate change (CPDN), physics (LHC@home), cryptogra- allows for delegation of responsibility and modularity.
phy (distributed.net), and biomedicine (grid.org). Those Central to this hierarchical architecture design is the
projects are usually built on top of a common platform formation of neighborhoods. A neighborhood is a group of
providing low level services for distributed or grid comput- peers forming a logical unit of isolation in an otherwise
ing. Examples of those platforms include: Berkeley Open unrestricted open P2P network. Peers in a neighborhood
Infrastructure for Network Computing (BOINC), Grid.org, can communicate directly but not with peers in other
World Community Grid, and Data Mining Grid. neighborhoods. Each neighborhood has a supernode. Com-
Text mining. Applications of DDM in the text mining munication between neighborhoods is achieved through
area are rare but usually employ a form of distributed their respective supernodes. This model reduces flooding
information retrieval. Distributed text classification and problems usually encountered in large P2P networks.
clustering have received little attention. PADMA is an early The notion of a neighborhood accompanied by a super-
example of parallel text classification [13]. node can be applied recursively to construct a multilevel
The work presented by Eisenhardt et al. [7] achieves overlay hierarchy of peers; i.e., a group of supernodes can
document clustering using a distributed P2P network. They form a higher level neighborhood, which can communicate
use the K-means clustering algorithm, modified to work with other neighborhoods on the same level of the hierarchy
in a distributed P2P fashion using a probe-and-echo through their respective (higher level) supernodes. This
mechanism. They report improvement in speedup com- type of hierarchy is illustrated in Fig. 2.
pared to centralized clustering. Their algorithm is an exact 3.1 Notations
algorithm, although it requires global synchronization at Symbol-Description
each iteration.
A similar system can be found in [14], but the problem is . p. A set of peers comprising a P2P network.
posed from the information retrieval point of view. In this . pi . Peer i in p.
work, a subset of the document collection is centrally . q. A set of neighborhoods.
partitioned into clusters, for which “cluster signatures” are . qj . Neighborhood j in q.
The choice of which subset of peers forms the next level

overlay is closely related to Section 3.3 discussing peer
neighborhoods.
3.3 Neighborhoods
We divide a network overlay into a set of neighborhoods, q.
A neighborhood, qj 2 q, comprises a set of peers that is
subset of an overlay p and that has a designated peer
known as a supernode, spj ; thus
Neighborhoodj ðqj ; spj Þ : qj p; spj 2 qj :

Fig. 2. The HP2PC hierarchy architecture. The following neighborhood properties are enforced for
the HP2PC architecture:
. spj . A peer designated as the supernode of qj .
. A set of neighborhoods, q ¼ fqj g, 1 < j < nðqÞ,
. H. The hierarchy height of an HP2PC network.
covers the overlay network p:
. h. A specific level within an HP2PC network.
. pðhÞ . The set of peers at level h. nðqÞ
[
. qðhÞ . The set of neighborhoods at level h. p qj :
. . Network partitioning factor. j¼1
. D. Data set.
. Di . Horizontally partitioned subset of D that is in
. Neighborhoods do not overlap:
peer i.
. di . Data object i in D. 8i; j 6¼ i : qi \ qj ¼ ;:
. c. A set of clusters.
. ck . Cluster k in c.
. A node must belong to some neighborhood:
. mk . Mean (centroid) of a cluster k.
. sk . A set of pairwise similarities within cluster k. 8p 2 p : p 2 qj ; qj 2 q:
. Hk . Histogram of similarity distribution within
cluster k. A network partitioning factor, 2 IR, 0 1, parti-
. ’k . Skew of similarity histogram of cluster k. tions the P2P network into equally sized neighborhoods.
. wk . Weight of cluster k. The number of neighborhoods in an overlay network as a
. k . Summary of cluster k. function of is then given by
. K. Number of clusters.
. I. Number of algorithm iterations. nðqÞ ¼ b1 þ ðnðpÞ 1Þc: ð2Þ
. nðÞ. Cardinality of a set or vector. At one extreme, when ¼ 0, there is only one neighbor-
hood that contains all the peers in p. On the other extreme,
3.2 Hierarchical Overlays
when ¼ 1, there are nðpÞ neighborhoods, each containing
A P2P network is comprised of a set of peers, or nodes, one peer only. In between, we can set to a value that
p ¼ fpi g, 1 < i < nðpÞ, where nðpÞ is the number of nodes determines the number of neighborhoods and, conse-
in the network. An overlay network is a logical network on quently, the size of each neighborhood.
top of p that connects a certain subset of the nodes in p. An initial attempt to determine the size of each
Most work in the P2P literature refers to a single level of neighborhood as a fraction of the size of the network was
overlay; i.e., there exists one overlay network on top of the to use
original P2P network. In our work, this concept is extended
further to allow multiple overlay levels on top of p. bnðpÞ=nðqÞc; 1 j nðqÞ 1;
nðqj Þ ¼ PnðqÞ1 ð3Þ
To distinguish between the different levels of overlay, we nðpÞ i¼1 nðqi Þ; j ¼ nðqÞ:
will use the level of an overlay network. An overlay network
at level h is denoted as pðhÞ . The lowest level network (the However, this only works well when nðpÞ nðqÞ. When
physical P2P network) is at level h ¼ 0, thus denoted as pð0Þ ; nðpÞ and nðqÞ are of the same order of magnitude, nðqj Þ
while the highest overlay possible is at level H, denoted as tends to be a small real number. The problem is that such
pðHÞ , and consists of a single node (the root of the overlay small number will cause all neighborhoods, except the
hierarchy). In subsequent formulations, we will drop the nðqÞth, to have a very small number of nodes, while the
superscript ðhÞ if the formulation is referring to an arbitrary nðqÞth will absorb the remaining (large number of) nodes.
level; otherwise, we will use it to distinguish network levels. For example, if nðpÞ is 250 and nðqÞ is 150, then we have
The size of the overlay network at each level will differ 149 neighborhoods with a size of 1 and one neighborhood
according to how many peers comprise the overlay. How- with a size of 101.
ever, since a higher level overlay will always be a subset of its To solve this problem, we use a binary function to
immediate lower level overlay, we maintain the following generate the neighborhood sizes based on a specified
inequality: probability for each value. Let r represent this probability,
given by

0 < n pðhÞ < n pðh1Þ ; 8h > 0: ð1Þ r ¼ nðpÞ=nðqÞ bnðpÞ=nðqÞc:
Fig. 3. Example of an HP2PC network.
The following function then generates two neighborhood from which we can deduce H:
sizes based on r:
8l m
>
< loglognðpÞ
bnðpÞ=nðqÞc; with probability ð1 rÞ; ; 0 < < 1;
nðqj Þ ¼ ð4Þ H ¼ 1; ¼ 0; ð6Þ
dnðpÞ=nðqÞe; with probability r: >
:
1; ¼ 1:
This method produces a more even distribution of
neighborhood sizes, even if nðpÞ and nðqÞ are of the same If, however, the partitioning factor is chosen to be
different for different levels of the hierarchy, then we
order of magnitude. In the above example, using (4) we
cannot deduce the full height of the hierarchy up front.
have nðpÞ=nðqÞ ¼ 1:67 ðr ¼ 0:67Þ, so 67 percent of the
However, we can deduce the maximum height reachable
neighborhoods will be of size 2, and 33 percent will be of
from a certain level using (6), and hence, we can iteratively
size 1; far more balanced than with (3). calculate the full hierarchy height if all level ’s are known
All peers at the same level, h, of the hierarchy are denoted a priori.
by pðhÞ . Let the function levelðpÞ determine the level of a peer; If, instead of specifying ’s, a certain hierarchy height
i.e., levelðpðhÞ Þ ¼ h. A peer pi can communicate with peer pj if is desired, we can calculate the proper (the same for all
and only if levelðpi Þ ¼ levelðpj Þ and pi 2 ql () pj 2 ql . levels) using the following equation (which is derived
Peer hierarchy formation is bottom-up, so the lowest from (6)):
level of the hierarchy is h ¼ 0. The supernodes of level 0

neighborhoods form the overlay network at level h ¼ 1. eðlog nðpÞÞ=H ; H > 1;
¼ ð7Þ
Recursively, at level h ¼ 2 are the supernodes of level 1 0; H ¼ 1:
neighborhoods (groups of level 1 supernodes). The root
supernode is at level H, the height of the hierarchy; i.e.,
3.4 Example
there exists exactly one pðHÞ in the system.
The network partitioning factor, , can be different for Fig. 3 illustrates the HP2PC architecture with an example.
different levels of the hierarchy; i.e., neighborhood count The network shown consists of 16 nodes and four hierarchy
and size is not necessarily the same at each level. If we levels. The set of nodes at level 0, pð0Þ , is divided into four
apply the same network partitioning factor to every level, neighborhoods, subject to the network partitioning factor
we can deduce the height, H, of the hierarchy. We can ð0Þ ¼ 0:2. Each supernode of level 0 becomes a regular
approximate (2) to node at level 1, forming the set of four nodes pð1Þ . Those in
turn are grouped into two neighborhoods forming qð1Þ ,
nðqÞ nðpÞ: satisfying ð1Þ ¼ 0:33. At level 2, only one neighborhood is
formed out of level 1 supernodes, satisfying ð1Þ ¼ 0.
Since the number of nodes at a certain level is equal to the Finally, the root of the hierarchy is found at level 3.
number of supernodes (or neighborhoods) in the lower
3.5 HP2PC Network Construction
level, then we can say that
An HP2PC network is constructed recursively, starting
from level 0 up to the height of the hierarchy, H. The
nðpÞðhÞ ¼ nðpÞðh1Þ ; 8h > 0: number of neighborhoods and the size of each neighbor-
hood are controlled through the partitioning factor , which
At the top of the hierarchy (level H), we have one
is specified for each level of the hierarchy (except the root
node, then
level).
The construction process is given in Algorithm 1. Given
H nðpÞ ¼ 1 ð5Þ
the initial set of nodes, pð0Þ , and the set of partitioning
factors B ¼ f ðhÞ g, 0 < h < H 1, the algorithm recursively Let simðÞ be a similarity measure between two objects,
constructs the network. At each level, we partition the and sk be the set of pairwise similarity between objects of
current pðhÞ into the proper number of neighborhoods and cluster ck :
assign a supernode for each one. The set of supernodes at a
certain level forms the set of nodes for the next higher level, nðck Þðnðck Þ þ 1Þ
nðsk Þ ¼ ; ð8aÞ
which are passed to the next recursive call. Construction 2
stops when the root is reached.
sk ¼ fsl : 1 l nðsk Þg; ð8bÞ
Algorithm 1 HP2PC Construction
Input: pð0Þ , B ¼ f ðhÞ g, 0 < h < H 1 sl ¼ simðdi ; dj Þ; di ; dj 2 ck : ð8cÞ
Output: fpðhÞ g, fqðhÞ g, 0 h H
1: for h ¼ 0 to H 1 do The histogram of the similarities in cluster ck is represented as
2: nðpÞðhÞ ¼ jpðhÞ j
Hk ¼ fhi : 1 i Bg; ð9aÞ
3: Calculate nðqÞðhÞ by substituting nðpÞðhÞ and ðhÞ
into (2)
4: qðhÞ ¼ , pðhþ1Þ ¼ hi ¼ countðsl Þ; ð9bÞ
5: a ¼ 1, b ¼ 1 fPartition pðhÞ into nðqÞðhÞ neighborhoodsg
6: for j ¼ 1 to nðqÞðhÞ do sl 2 sk ; ði 1Þ sl < ðiÞ; ð9cÞ
7: Calculate nðqj Þ using (4) where
8: b ¼ b þ nðqj Þ
ðhÞ
9: qj ¼ fpi g, a i b . B is the number of histogram bins,
10: a¼bþ1 . hi is the count of similarities in bin i, and
11: Add qj to qðhÞ . is the bin width of the histogram.
12: spj ¼ first node in qj fsupernode for qj g To estimate the cohesiveness of cluster ck , we calculate
13: Add spj to pðhþ1Þ the histogram skew. Skew is the third central moment of a
14: end for distribution; it tells us if one tail of the distribution is longer
15: end for than the other. A positive skew indicates a longer tail in the
16: qðHÞ ¼ pðHÞ froot nodeg positive direction (higher interval of the histogram), while a
negative skew indicates a longer tail in the negative (lower
interval) direction. A similarity histogram that is negatively
4 THE HP2PC DISTRIBUTED CLUSTERING skewed indicates a tight cluster.
ALGORITHM Skew is calculated as
The HP2PC algorithm is a distributed iterative clustering P
ðxi Þ3
process. It is a centroid-based clustering algorithm, where a skew ¼ i ; ð10Þ
set of cluster centroids is generated to describe the N3
clustering solution. In HP2PC, each neighborhood con- so the tightness of a cluster, ’ðck Þ, calculated as the skew of
verges to a set of centroids that describe the data set in that its histogram Hk , is
neighborhood. The distributed clustering strategy within a P
single neighborhood is similar to the parallel K-means ðsl sk Þ3
’ðck Þ ¼ skewðHk Þ ¼ l ; s l 2 sk : ð11Þ
algorithm [17] in that the final set of centroids of a nðsk Þj3sk
neighborhood will be identical to those produced by
centralized K-means on the data within that neighborhood. A clustering quality measure based on skewness of
Other neighborhoods, either on the same level or at higher similarity histograms of individual clusters can be derived
levels of the hierarchy, may converge to another set of as a weighted average of the individual cluster skew:
centroids. P
Once a neighborhood converges to a set of centroids, nðck Þ’ðck Þ
’ðcÞ ¼ kP ; ck 2 c: ð12Þ
those centroids are acquired by the supernode of that k nðck Þ
neighborhood. The supernode, in turn as part of its higher
level neighborhood, collaborates with its peers to form a set 4.2 Distributed Clustering (Level h ¼ 0)
of centroids for its neighborhood. This process continues We define a general function for updating cluster models in
hierarchically until a set of centroids is generated at the root a fully connected neighborhood:
of the hierarchy.
ci;t ¼ f fcj;t1 g ; i; j 2 q; ð13aÞ
4.1 Estimating Clustering Quality
The distributed search for cluster centroids is guided by a ci;0 ¼ c0 ; ð13bÞ
cluster quality measure that estimates intracluster cohe-
i;t
siveness and intercluster separation. where c is the clustering model (a set of clusters)
Cluster cohesiveness. The distribution of pairwise calculated by peer i at iteration t, and fðÞ is an aggregating
similarities within a cluster is represented using a cluster function. The equation can be illustrated by Fig. 4, where
similarity histogram, which is a concise statistical representa- the output of each peer at iteration t depends on the
tion of the cluster tightness [18]. models calculated by all other peers at iteration t 1. In
5: end for
6: for all cik do
7: mik ¼ avgðdj Þ, j 2 cik
8: sik ¼ si # cik fpart of si indexed by objects in cik g
9: ’ik ¼ CalcSkewðsk Þ
10: wik ¼ ’ik nðcik Þ
11: end for
12: return fmi ; wi g
Function CalcSkewðsÞ
Fig. 4. Iterative level 0 neighborhood clustering. 1: ’ 0, avgðsÞ, stddevðsÞ
2: for all sl 2 s do
P2P K-means [8], fðÞ avgðÞ and the neighborhood are 3: ’ ¼ ’ þ ðsl Þ3
based on ad hoc network topology. 4: end for
Algorithm 2 shows the iterative HP2PC process for 5: ’ ¼ ’=ðnðsÞ 3 Þ
updating the cluster centroids at each node. Utility
6: return ’
routines for the main algorithm are given in Algorithm 3.
Function CalcSimilarityMatrixðDÞ
For each neighborhood, an initial set of centroids is
calculated by the supernode and transmitted to all peers 1: s
in the neighborhood. Like K-means, during each iteration, 2: for i ¼ 1 to nðDÞ do
each peer assigns local data to their nearest centroid and 3: for j ¼ 0 to i 1 do
calculates their new centroids. In addition, it also calcu- 4: Add simðdi ; dj Þ to s
lates cluster skews using (11). 5: end for
6: end for
Algorithm 2 Level 0 Clustering 7: return s
Input: Number of clusters K
Output: Set of clusters cr for each neighborhood qr in qð0Þ The final set of centroids for each iteration is calculated
1: for all qr 2 qð0Þ do from all peer centroids. Unlike K-means (or P2P K-means),
2: for all pi 2 qr do those final centroids are weighed by the skew and size of
3: Di ¼ data set at pi clusters at individual peers. The weight of a cluster ck at
4: si ¼ CalcPairwiseSimilarityðDi Þ peer pj is defined as
5: fmr ; wr g ¼ ReceiveFromðspr Þ fspr : supernode of qr g
wjk ¼ ’ðcjk Þ nðcjk Þ:
6: 8j 6¼ r : mj ¼ 0, wj ¼ 0
7: fmi ; wi g ¼ UpdateClustersðDi ; fmi gÞ At the end of each iteration, each node transmits the
8: while change in fmi g > do cluster centroids and their corresponding weights to all its
9: for all pj 2 qr , j 6¼ i do peers:
10: SendToðpj ; fmi ; wi gÞ
11: fmj ; wj g ¼ ReceiveFromðpj Þ cj;t fmjk ; wjk gt ; k 2 ½1; K
:
12: end for At peer pi , the centroid of cluster k is updated according
13: for k ¼ 1 to K do to the following equation, which favors tight and dense
14: mik ¼ 0, wik ¼ 0 clusters:
15: for j ¼ 1 to nðqr Þ do P j;t1
16: mik ¼ mik þ ðmjk wjk Þ j wk mj;t1
k
mi;t ¼ P j;t1 ; j 2 qr : ð14Þ
17: wik ¼ wik þ wjk k
w
j k
18: end for
19: mik ¼ mik =wik This is followed by assigning objects to their nearest
20: end for centroid and calculating the new set of cluster skews, f’ik g,
21: fmi ; wi g ¼ UpdateClustersðDi ; fmi gÞ and sizes, fnðcik Þg, which are used in the next iteration. The
22: end while algorithm terminates when object assignment does not
23: end for change, or when 8i; k kmi;t k mk
i;t1
k < , where is a
24: if pi ¼ spr then sufficiently small parameter.
25: cr ¼ ci fset of clusters stored at supernodeg
4.3 Distributed Clustering (Level h > 0)
26: end if
27: end for Once a neighborhood converges to a set of clusters, the
centroids and weights of those clusters are acquired by the
Algorithm 3 Utility Routines for Level 0 Clustering supernode as its initial set of clusters; i.e., for neighbor-
Function UpdateClustersðDi ; fmi gÞ hood qr with supernode spr
1: ci
cr;0;ðhÞ ¼ cr;T ;ðh1Þ ;
2: for all dj 2 Di do
3: l ¼ argmink fjdj mk jg where T is the final iteration of the algorithm at level h 1
4: Add dj to cil for neighborhood qr .
Since at level h of the hierarchy the actual data objects are 4.4 Complexity Analysis
not available, we rely on metaclustering: merging the clusters We divide the complexity of HP2PC into computational
using centroid and weight information alone. At level h > 0, complexity and communication complexity.
clusters are merged in a bottom-up fashion, up to the root of
the hierarchy; i.e., cðhÞ ¼ fðcðh1Þ Þ. This means once a 4.4.1 Computation Complexity
neighborhood at level h converges to a set of clusters, it is Assume the entire data set size across all nodes is D. The
frozen, and the higher level clustering is invoked. (A more data set is equally partitioned among nodes, so each node
elaborate technique would involve bidirectional traffic, holds DP ¼ D=nðpÞ data objects. For level 0, we have nðqÞ
making cðh1Þ ¼ fðcðhÞ Þ as well, but the complexity of this neighborhoods, each of size SQ ¼ nðpÞ=nðqÞ.
approach could be prohibitive, so we leave it for future work.) Each node has to compute a pairwise similarity matrix
A neighborhood at level h consists of a set of peers, each before it begins the P2P clustering process, requiring
having a set of K centroids. To merge those clusters, the DP ðDP 1Þ=2 similarity computations. For each iteration,
centroids are collected and clustered at the supernode of each node computes a new set of K centroids by averaging
this neighborhood, using K-means clustering. This process all neighborhood centroids ðK SQ Þ, assigns the data objects
repeats until one set of clusters is computed at the root of to those centroids ðK DP Þ, recomputes centroids based on
the hierarchy. The formal procedure representing this new data assignment ðDP Þ, and calculates the skew of the
clustering process is presented in Algorithm 4. clusters ðDP ðDP 1Þ=2Þ. Those requirements are summar-
Algorithm 4 HP2PC Clustering ized as
1: for all qi 2 qð0Þ do
Tsim ¼ DP ðDP 1Þ=2;
2: fmi gð0Þ ¼ NeighborhoodClusterðqi Þ
ð0Þ
3: end for Tupdate ¼ K SQ þ DP þ DP ;
4: for h ¼ 1 to H do
Tskew ¼ DP ðDP 1Þ=2:
5: for all qi 2 qðhÞ do
6: for all pj 2 qi do Let the number of iterations required to converge to a
7: fmj g ¼ fmj gðh1Þ solution be I. Then, the total number of computations
8: SendToðspi ; fmj gÞ required by each node to converge is
9: end for
TP ¼ Tsim þ I½Tupdate þ Tskew
: ð15Þ
10: fmi gh ¼ K-meansðfmj gÞ fonly at peer spi g
11: end for If we assume that DP 1 and I 1, then we can
12: end for rewrite (15) as
For comparison against the baseline K-means algorithm,
ð0Þ
we calculate the centroids of each neighborhood based on TP ¼ I K SQ þ DP þ D2P =2 : ð16Þ
centralized K-means, then compare against the HP2PC
centroids with respect to the merged data set of the For levels above 0, each neighborhood is responsible for
respective neighborhood (Algorithm 5.) For neighborhoods metaclustering a set of KSQ centroids into K centroids
at level 0, the merged data set is a union of the data from all using K-means. Then, for each neighborhood at level h, the
nodes in the neighborhood. For those at higher levels, the required computation is
merged data set is the union of all data reachable from ðhÞ
every node in the neighborhood through its respective Th ¼ ISQ K 2 : ð17Þ
lower level nodes. The EvaluateAndCompare function eval- Since each neighborhood computation is done in parallel
uates both the centralized and distributed solutions and with the others, we need only Th computations per level.
compares them against each other, as reported in the However, since computations at higher levels of the
various experiments in Section 6. hierarchy need to wait for lower levels to complete, we
Algorithm 5 Centralized K-means Clustering Comparison have to sum Th for all levels. The total computations
1: for h ¼ 0 to H do required for all levels above 0 are thus
2: for all qi 2 qðhÞ X
H 1
S
3: Di ¼ j2qi Dj fmerged data set of neighborhood qi g TH ¼ Th : ð18Þ
h
4: fmi gctr ¼ kmeansðDi Þ h¼1
5: EvaluateAndCompareðfmi ghctr ; fmi gh ; Di Þ Finally, we can combine (16) and (18) to find the total
6: end for computation complexity for HP2PC:
7: end for
T ¼ TP þ TH : ð19Þ
One of the major benefits of this algorithm is the ability to
zoom in to more refined clusters by descending down the It can be seen that computation complexity is largely
hierarchy and zoom out to more generalized clusters by affected by the data set size of each node ðDP Þ. By
ascending up the hierarchy. The other major benefit is the increasing the total number of nodes, we can decrease DP
ability to merge a forest of independent hierarchies into one (since the data are equally partitioned among nodes) but on
hierarchy by putting all roots of the forest into one the expense of increasing communication complexity, as
neighborhood and invoking the merge algorithm on that well as decreasing clustering quality due to fragmentation
neighborhood. of the data set.
4.4.2 Communication Complexity

In neighborhoods at level 0, at each iteration every peer
sends SQ 1 messages to its neighbors, each message is of
size K. For SQ peers in one neighborhood, SQ ðSQ 1Þ
messages are exchanged. Communication complexity for
nðqÞð0Þ neighborhoods at level 0 is then
ð0Þ2
M0 nðqÞð0Þ SQ IK: ð20Þ
Since nðqÞ ¼ nðpÞ=SQ , then

ð0Þ
M0 nðpÞð0Þ SQ IK: ð21Þ
For levels above 0, each neighborhood requires SQ 1

messages to be sent to each supernode, each message is of Fig. 5. Phrase matching using DIG.
size K. Communication complexity for nðqÞðhÞ neighbor-
hoods at level 0 is then cluster. The CorePhrase algorithm compares every pair of
ðhÞ
documents to extract matching phrases. This process of
Mh nðqÞðhÞ SQ K matching every pair of documents is inherently Oðn2 Þ.
ð22Þ
nðpÞðhÞ : However, using a document phrase indexing graph
structure, known as the Document Index Graph (DIG), the
Total communication requirements for HP2PC is then algorithm can achieve this goal in near-linear time [20].
X
H1 In essence, what the DIG model does is to keep a
M M0 þ Mh : ð23Þ cumulative graph representing currently processed docu-
h¼1 ments: Gi ¼ Gi1 [ gi , where gi is the subgraph representa-
We can see that the communication complexity is greatly tion of a new document. Upon introducing a new document,
influenced by the size of neighborhoods at level 0. Worst its subgraph is matched with the existing cumulative graph
case is when all nodes are put in one neighborhood, to extract the matching phrases between the new document
resulting in quadratic complexity (in terms of the number of and all previous documents. That is, the list of matching
nodes). As we adopt more fine-grained neighborhoods, we phrases between document di and previous documents is
can reduce both computation and communication complex- given by Mi ¼ gi \ Gi1 . The graph maintains complete
ities but on the expense of clustering quality as will be phrase structure identifying the containing document and
discussed in the results section. phrase location, so cycles can be uniquely identified. This
process produces complete phrase-matching output between
5 DISTRIBUTED CLUSTER SUMMARIZATION USING every pair of documents in near-linear time, with arbitrary
KEYPHRASE EXTRACTION length phrases. Fig. 5 illustrates the process of phrase
Summarizing the clusters generated by HP2PC using matching between two documents. In the figure, the two
CorePhrase poses two challenges. First, since CorePhrase subgraphs of two documents are matched to get the list of
works by intersecting documents in a cluster together, phrases shared between them.
generating a summary for a document cluster that is Phrase features. Quantitative features are needed to
distributed across various nodes cannot be done directly, judge the quality of the candidate keyphrases. Each
and thus, CorePhrase needs modification to work in this candidate keyphrase p is assigned the following features:
kind of environment. Second, merging cluster summaries
up the hierarchy will require working with keyphrases . df: document frequency; the number of documents
extracted at level 0 only, without any access to the actual in which the phrase appeared, normalized by the
documents. total number of documents:
5.1 Single Cluster Summarization j documents containing p j

df ¼ :
j all documents j
The summary of a document cluster is represented as a set
of core keyphrases that describe the topic of the cluster. The
CorePhrase [19] keyphrase extraction algorithm is used for . w: average weight; the average weight of the phrase
this purpose. CorePhrase works by first constructing a list over all documents. The weight of a phrase in a
of candidate keyphrases for each cluster, scoring each document is calculated using structural text cues.
candidate keyphrase according to its features, ranking the Examples: title phrases have maximum weight,
keyphrases by score, and finally selecting a number of the section headings are weighted less, while body text
top ranking keyphrases for output. We briefly review is weighted lowest.
CorePhrase here for completeness. . pf: average phrase frequency; the average number
Extraction of candidate keyphrases. Candidate key- of times this phrase has appeared in one document,
phrases naturally lie at the intersection of the document normalized by the length of the document in words:
[
pf ¼ arg avg
j occurrences of p j
: ik ¼ jk \ Di;k : ð26Þ
j words in document j j2q;j6¼i
Note that, by definition, the core summary will be the

. d: average phrase depth; the location of the first same at all nodes, since the operation is identical at all nodes.
The local cluster summary, however, will be different due to
occurrence of the phrase in the document:
intersection with local documents. On the next iteration, each
node will send its local cluster summary to all other nodes.
j words before first occurrence j
d ¼ arg avg 1 : Local cluster summaries are intersected together again
j words in document j according to (25), and the result is appended to the core
summary qk :
Phrase ranking. Phrase features are used to calculate a " #
score for each phrase. Phrases are then ranked by score, and q;tþ1 q;t
\ j;t
a number of the top phrases are selected as the ones k ¼ k [ k : ð27Þ
j2q
describing the cluster topic. The score of each phrase p is
pffiffiffiffiffiffiffiffiffiffiffiffi This process repeats until the desired number of key-
scoreðpÞ ¼ ð w pf d2 Þ logð1 dfÞ: ð24Þ phrases per cluster summary is acquired, or the intersection
yields an empty set.
The equation is derived from the tf idf term weighting
measure; however, we are rewarding phrases that appear in 5.3 Summarizing Higher Level Clusters
more documents (high df) rather than punishing those At higher levels, (26) is not applicable, since no local data is
phrases. By examining the distribution of the values of each available. Summarization of a cluster at higher levels is
simply an intersection of keyphrase summaries of the
feature in a typical corpus, it was found that the weight and
clusters chosen to be merged into said cluster by the higher
frequency features usually have low values compared to the level K-means algorithm. Let Ck ¼ fci g be the set of
depth feature. To take this fact into account, it was necessary clusters chosen to be merged into one cluster, ck , and let
to “expand” the weight and frequency features by taking their fi g be their corresponding summaries. The summary for
square root and to “compact” the depth by squaring it. This cluster ck is the intersection of the constituent cluster
helps even out the feature distributions and prevents one summaries, merged with an equal subset from every
feature from dominating the score equation. constituent summary, up to L keyphrases.
Word weight-based score assignment. Another method Let k represent the core intersection:
for scoring phrases was used, which is based on \
individual word weights. This method will be referred to k ¼ i :
i2Ck
as CorePhrase-M:
If jk j > L, then k is truncated up to L keyphrases.
. First, assign initial scores to each phrase based on Otherwise, k is merged with an equal subset from every
phrase scoring formulas given above. constituent cluster summary ki . Let M ¼ ðL jk jÞ=K,
. Construct a list of unique individual words out of where K is the number of clusters; and let ki ¼ ki n k ,
the candidate phrases. then the final cluster summary is
. For each word: add up all the scores of the phrases in ( )
which this word appeared to create a word weight. [ ðMÞ
. For each phrase: assign the final phrase score by k k ; ki ; ð28Þ
i
adding the individual word weights of the constitu-
ðMÞ
ent words and average them. where denotes the top M keyphrases in ki . Thus, the
ki
core cluster summary is augmented with an equal subset
5.2 Summarizing Level 0 Clusters from the top keyphrases in each constituent cluster
Let every node pi generate a cluster summary ik (set of summary that is not in the core summary. This is to make
keyphrases) for each cluster cik using CorePhrase. Nodes in sure that the core summary is representative of all
the same neighborhood will enter into multiple rounds of constituent clusters.
communication to agree on a common summary for each
cluster. For cluster cik , on each round, node pi receives a 6 EXPERIMENTAL RESULTS
cluster summary jk , j 2 q, from all nodes in its neighbor-
6.1 Data Sets
hood. Node pi then produces two sets of keyphrases based
Experiments were performed on four document data sets
on jk , one is called core summary, qk , and the other is called
with various characteristics and sizes. Table 1 lists the data
local cluster summary, ik . The core summary is generated by
sets used for evaluation. YAHOO, 20NG, and RCV1 are
intersecting all keyphrases in fjk ; ik g:
standard text mining data sets, while SN was manually
\ j collected but was already labeled. Below is a brief
qk ¼ k : ð25Þ
j2q
description of each data set.
YAHOO is a collection of 2,340 news articles from Yahoo!
The local cluster summary is generated by intersecting all News. It contains 20 categories (such as health, entertainment,
summaries from other nodes with local documents from and so forth), which have rather unbalanced distribution. It
cluster cik : has been used in document clustering-related research,
TABLE 1
Data Sets
including [21], [22], [23], and [24]. The data set is available at 6.3.1 Entropy
ftp://ftp.cs.umn.edu/dept/users/boley/. Entropy reflects the homogeneity of a set of objects, and
SN is a collection of 3,271 metadata records collected
thus can be used to indicate the homogeneity of a cluster.
from the SchoolNet learning resources website (http://
This is referred to cluster entropy, introduced by Boley et al.
www.schoolnet.ca/). We extracted the fields containing text
from the metadata records (title, description, and key- [21]. Lower cluster entropy indicates more homogeneous
words) and combined them to form one document per clusters. On the other hand, we can also measure the
metadata record. We used the 17 top-level categories from entropy of a prelabeled class of objects, which indicates the
the SchoolNet data set. homogeneity of a class with respect to the generated
20NG is the standard 20-newsgroup data set, which clusters. The less fragmented a class across clusters, the
contains 18,828 documents from 20 Usenet news- higher its entropy, and vice versa. This is referred to as class
groups divided into 20 balanced categories. This data entropy, due to He et al. [30], Tan et al. [32]
set is available at http://people.csail.mit.edu/jrennie/ Cluster entropy [21]. For every cluster cj in the
20Newsgroups/. clustering result c, we compute nðli ; cj Þ=nðcj Þ, the prob-
RCV1 is a subset of 23,149 documents selected from the ability that a member of cluster cj belongs to class li . The
standard Reuters RCV1 text categorization data set, entropy of each cluster cj is calculated using the standard
converted from the original Reuters RCV1 data set by formula
Lewis et al. [25]. The documents in the RCV1 data set are
assigned multiple labels. In order to properly evaluate the X nðli ; cj Þ nðli ; cj Þ
clustering algorithms using single-label validity measures, Ec j ¼ log ;
i
nðcj Þ nðcj Þ
we restricted the labels of the documents to the first
document label that appears in the data set. where the sum is taken over all classes. The total entropy for
a set of clusters is calculated as the sum of entropies for each
6.2 Text Preprocessing cluster weighted by the size of each cluster:
All texts were preprocessed in the following way. First,
words were tokenized using a specially built finite-state- X
nðcÞ
nðcj Þ
machine tokenizer that can detect both alphanumeric and Ec ¼ Ej : ð29Þ
j¼1
nðDÞ
special entities (such as currencies, dates, and so forth).
Then, tokens were lower cased, stop-words were removed, Class entropy [30], [32]. A drawback of cluster entropy is
and finally the remaining words were stemmed using the that it rewards small clusters, which means that if a class is
popular Porter stemmer algorithm [26]. fragmented across many clusters it would still get a low
Further to text preprocessing, we applied simple feature entropy value. To counter this problem, we calculate also
selection to reduce the number of features for every data the class entropy.
set. The method is based on the Document Frequency (DF) The entropy of each class li is calculated using
feature selection measure. It is based on the argument that
terms with very low DF tend to be noninformative in X nðli ; cj Þ nðli ; cj Þ
Eli ¼ log ;
categorization-related tasks [27], [28], [29]. After ranking the nðli Þ nðli Þ
j
set of terms in descending order with respect to their DF,
we pruned the list by keeping only the top 20 percent terms. where the sum is taken over all clusters. The total entropy
A threshold as far as 10 percent has been used in the for a set of classes is calculated as the weighted average of
literature for selecting features to increase categorization the individual class entropies:
accuracy [28] (or at least not affecting it), but we opted for a
more conservative threshold.
nðlÞ
X nðli Þ
El ¼ : ð30Þ
nðDÞ Eli
6.3 Evaluation Measures i¼1
Three aspects of the algorithm were evaluated: clustering As with cluster entropy, a drawback of class entropy is that
accuracy, speedup, and distributed summarization accuracy. if multiple small classes are lumped into one cluster, their
For evaluating clustering accuracy, we used two evaluation class entropy would still be small.
measures: Entropy, which evaluates clusters with respect to Overall entropy [30], [32]. To avoid the drawbacks of
an external predefined categorization, and Separation Index, either cluster or class entropy, their values can be combined
which does not rely on predefined categories. into an overall entropy measure:
Ec ð Þ ¼ Ec þ ð1 Þ El : ð31Þ
TABLE 2
In our experiments, we set to 0.5. Accuracy and Performance of HP2PC [YAHOO]
We evaluated the quality of clustering at different levels
of the hierarchy. At level h ¼ 0, we evaluated the quality of
clustering for each neighborhood, with respect to the subset
of the data in the neighborhood, i.e.
Er ¼ Ecr jDr ;
where cr is the set of clusters obtained for neighborhood r,
and Dr is the union S of data sets of all nodes in that
neighborhood ðDr ¼ i2qr Di Þ.
At level h > 0, we evaluated the clustering acquired by a
supernode with respect to the data subset of the nodes at
the level 0 reachable from the supernode. Thus, evaluation run in parallel, the total time taken by that level is
of the clustering acquired at the root node reflects the calculated as the maximum time taken by any node on
quality with respect to the whole data set. the same level. The time taken by different levels is added
to arrive at the global Td .
6.3.2 Separation Index For cluster summarization accuracy, evaluation of the
SI is another cluster validity measure that utilizes cluster produced cluster summaries was based on how much the
centroids to measure the distance between clusters, as well extracted keyphrases agree with the centralized version of
as between points in a cluster to their respective cluster CorePhrase when run on the centralized cluster. Assume
centroid. It is defined as the ratio of average within-cluster HP2PC produced a cluster ck that spanned nðpÞ nodes, each
variance (cluster scatter) to the square of the minimum holding subset of the documents, Dki , from that cluster. If
pairwise distance between clusters: all documents were pooled into a centralized cluster, we
PNC P have Dk documents in that cluster. The percentage of
2
i¼1 xj 2ci distðxj ; mi Þ correct keyphrases is calculated as
SI ¼
ND min1r;sNC fdistðmr ; ms Þg2 percent correct keyphrases
PNC P
r6¼s
ð32Þ
distðx j ; m i Þ2 CorePhraseðDk Þ \ DistCorePhraseðfDki gÞ
¼
i¼1 xj 2ci
; ¼ ;
L
ND dist2min
where L is the maximum number of top keyphrases
where mi is the centroid of cluster ci , and distmin is the extracted.
minimum pairwise distance between cluster centroids.
Clustering solutions with more compact clusters and larger 6.4 Experimental Setup
separation have lower Separation Index, thus lower values A simulation environment was used for evaluating the
indicate better solutions. This index is more computation- HP2PC algorithm. During simulation, data were parti-
ally efficient than other validity indices, such as Dunn’s tioned randomly over all nodes of the network. The number
index [31], which is also used to validate clusters that are of clusters was specified to the algorithm such that it
compact and well separated. In addition, it is less sensitive corresponds to the actual number of classes in each data set.
to noisy data. A random set of centroids was chosen by each supernode,
Speedup is a measure of the relative increase in speed of and the centroids were distributed to all nodes in its
one algorithm over the other. For evaluating HP2PC, it is neighborhood at the beginning of the process. Clustering
calculated as the ratio of time taken in the centralized was invoked at level 0 neighborhoods and was propagated
case ðTc Þ to the time taken in the distributed case ðTd Þ, to the root of the hierarchy as described in Section 4.
including communication time, i.e., In the next sections, we evaluate the effect of network size
Tc on clustering accuracy, the effect of scaling the hierarchy
S¼ : ð33Þ height, the quality of clustering at different levels within a
Td
single hierarchy, and the accuracy of distributed cluster
To take communication time into consideration in the summarization using the distributed CorePhrase algorithm.
simulations, we factored the time taken to transmit a
message from one node to another on a 100 Mbps link.1 6.5 Network Size and Height
Thus, the time required to transmit a message of size jMj Experiments on different network sizes and heights were
bytes is calculated as performed, and their effect on clustering accuracy (Entropy
and SI) and speedup over centralized clustering were
TM ¼ jMj=ð100;000;000=8Þ seconds: measured. Table 2 summarizes those results for the YAHOO
During simulation, each time a message is sent from (or data set, and Table 3 summarizes the same results for the
received by) one node to another, its time is calculated and SN data set. The same results are illustrated in Figs. 6 and 7,
is added to the total time taken by that node. Since in a real respectively.
environment all nodes on the same level of the hierarchy The first observation here is that for networks of height
H ¼ 1 ð ¼ 0Þ, the distributed clustering accuracy stays
1. This is a simplified assumption. Real networks exhibit communication almost the same as the network size increases. This is
overhead due to network protocols and network congestion. evident through both the Entropy and SI. Since for networks
TABLE 3
Accuracy and Performance of HP2PC [SN]
of height 1 all nodes at level 0 are in the same neighborhood,

every node can update its centroids based on complete
information received from all other nodes at the end of each
iteration (at the cost of increased communication). This
means that increasing the network size does not affect
accuracy of clustering, as long as it is of height 1.
The second observation is that, for networks of the same
size, larger network heights cause clustering accuracy to
drop. It is not surprising that this is the case, since at higher
levels metaclustering of lower level centroids is expected to
produce some deviation from the true centroids. It is also
Fig. 7. HP2PC accuracy and speedup [SN].
noticeable that unlike networks of height 1, networks with

height H > 0 tend to have less accuracy as the number of
nodes is increased. As we keep H constant and increase nðpÞ,
the network partitioning factor, , increases. This in turn
means neighborhoods become smaller, thus causing the more
accurate centroids at level 0 to become more fragmented.
An interesting observation is that there is a noticeable
plateau region between the centralized case ðnðpÞ ¼ 1Þ and a
point where the data are finely partitioned (nðpÞ > some
value), after which quality degrades rapidly. This plateau
provides a clue on the relation between the data set size and
the number of nodes, beyond which the number of nodes
should not be increased without increasing the data set size.
An appropriate strategy for automatically detecting the
higher boundary of this region (in scenarios where the
network grows arbitrarily) is to compare the SI measure
before and after adding nodes; if a sufficiently large difference
in SI is noticed then network growth should be suspended
until more data are available (and equally partitioned).
Since the increase in hierarchy height has the biggest
effect on the accuracy of the resulting clustering accuracy, a
strategy based on the SI measure can be adopted to select the
Fig. 6. HP2PC accuracy and speedup [YAHOO]. most appropriate hierarchy for a certain application. Given a
Fig. 8. Two-dimensional mixture of 10 Gaussians data set [10G]. Fig. 9. PMP comparison between HP2PC and P2P K-means [10G].
TABLE 4
PMP Comparison between HP2PC and P2P K-Means [10G]
sufficiently small accuracy parameter , we can use the

following strategy to recommend a certain hierarchy height:
1.Initialize i ¼ 1.
2.Set H ¼ i and compute the corresponding SIi
measure for the resulting clustering solution.
3. Set i ¼ i þ 1.
4. Compute SIi .
5. If SI ¼ SIi SIi1 < , go to step 3.
6. Output H ¼ i as the recommended height.
We investigate the effect of increasing hierarchy heights, Fig. 10. Clustering accuracy versus hierarchy level, H ¼ 5 [20NG].
as well as the accuracy at different levels within a single
hierarchy, in more detail in the next sections. available from the authors but rather the parameters of
In terms of speedup, the trends show that the HP2PC the Gaussians, which we used to regenerate the data.2
algorithm exhibits decent speedup over the centralized case. The 10G data set is illustrated in Fig. 8.
For H ¼ 1, however, speedup does not scale well with the The measure of accuracy in [8] was based on the
network size, largely due to the increased communication difference between cluster membership produced by P2P
cost for networks of that height. For H > 0, speedup becomes K-means and that of the same data point as produced by
more scalable, as we can notice a big difference between H ¼ 1 the centralized K-means. To ensure accurate comparison,
and H ¼ 2 than between H ¼ 2 and H ¼ 3. This result carries initial seeds for both the centralized and the P2P algorithms
an assertion that the hierarchical architecture of HP2PC is were the same. They report the total number of mislabeled
indeed scalable compared to flat P2P networks. data points as a percentage of the size of the data set. The
percentage of mislabeled points (PMP) is
6.5.1 Comparison with P2P K-Means

The accuracy of HP2PC is compared with P2P K-means 100j d 2 D : Lcent ðdÞ 6¼ Lp2p ðdÞ j
[8], which is the current state of the art in P2P-based :
jDj
distributed clustering. Since the implementation of P2P
K-means is nontrivial, we used their benchmark synthetic
data set and results to compare against. The data set is a 2. This means that there could be a difference in the actual data points
between our and their generated data due to the random number
2D mixture of 10 Gaussians, containing 78,200 points generation. However, we assume that the very large number of points will
(referred hereto as 10G). The actual data were not offset differences due to sampling.
on 20NG and RCV1. Thenumber of nodes at each level is

250, 83, 28, 9, 3, and 1, from level 0 to level 5, respectively.
We compared the results to centralized K-means performed
at each level of the hierarchy and took the average over all
neighborhoods on that level.
Fig. 10 shows the accuracy achieved at each level of the
hierarchy for 20NG and compares it to the average
centralized K-means accuracy at the same level. Fig. 10a
shows Entropy accuracy, while Fig. 10b shows SI change
with hierarchy level. We notice that the clustering quality
achieved by HP2PC is comparable to centralized K-means
and that it slightly degrades as we go up the hierarchy.
Fig. 11 shows the same results for RCV1 and verifies the
same trend. Since at higher level of the hierarchy we only
rely on cluster centroid information, this result is justifiable.
Nevertheless, it is clear that at level 0 we can achieve
clustering quality close to the centralized K-means algo-
rithm. In scenarios where tall hierarchies are necessary (e.g.,
deep hierarchical organization), we can still achieve results
that do not deviate much from the centralized case.
6.7 Hierarchy Height Scalability

Finally, we performed a set of experiments to test the effect
of increasing hierarchy heights. Experimenting on tall
Fig. 11. Clustering accuracy versus hierarchy level, H ¼ 5 [RCV1].
hierarchies requires large number of nodes so as to keep
the neighborhood sizes reasonable. For this reason, only the
Table 4 reports the results for P2P K-means and HP2PC
larger data sets 20NG and RCV1 were used in those
(with various hierarchy heights). Nodes vary between 50 and
experiments to avoid fine-grained partitioning of data
500, as reported in [8]. Fig. 9 illustrates the trend in the results.
across such large number of nodes.
HP2PC has zero error for networks of height 1, as expected. It
Table 5 reports the outcome of those experiments for a
is clear that for networks of low height, HP2PC is superior to
network of 250 nodes. Performance measures for both
P2P K-means. As the height increases, HP2PC starts to
HP2PC and centralized K-means clustering are reported
approach the error rate of P2P K-means ðH ¼ 4; H ¼ 5Þ, but
interestingly, HP2PC does not suffer from the sharp increase (those suffixed with ctr). Fig. 12 reports the same information,
in PMP at very large number of nodes ðnðpÞ > 300Þ. but only for RCV1 with respect to the clustering quality (20NG
P2P K-means has an advantage of being a model for exhibits similar trends) and compares it to the centralized
unstructured P2P networks. It assumes that each node has a clustering quality. Note that centralized clustering produces
finite number of reachable neighbors, which are randomly roughly the same quality regardless of the hierarchy height,
selected from the node population. HP2PC, on the other since all data in the network is centrally clustered in this case.
hand, has a fixed hierarchical structure that allows it to We can see that as we increase the hierarchy height, HP2PC
produce superior results by avoiding random peering and clustering quality (which is measured at the root of the
propagation delay and error, a common disadvantage in hierarchy) is affected (Figs. 12a and 12b). The sharpest
P2P networks. decrease happens as soon as the height increases from 1 to 2,
and then the degradation in Entropy and SI tend to stabilize.
6.6 Clustering Quality at Different Hierarchy Levels A similar trend can be seen in speedup (Fig. 12c). Both
To test the effect of the hierarchical structure on clustering observations can be related to the size of neighborhoods at
ð0Þ
quality at different levels, we performed experiments on a level 0 ðSQ Þ, which decreases significantly when the height is
network of size 250 nodes and a fixed height of 5 ð ¼ 0:33Þ, increased from 1 (250 nodes) to 2 (16.67 nodes).
TABLE 5
Performance of HP2PC versus Hierarchy Heights, nðpÞ ¼ 250 [20NG, RCV1]
Fig. 13. Entropy versus number of level 0 neighborhoods ½H ¼ 2

.
solution is dependent on the actual data (only available to

level 0 nodes), because we have to go up the hierarchy
several levels before we can converge to one solution for
the whole network. Conversely, the more coarse-grained
the neighborhoods, the better the final clustering solution
is, due to creating more accurate clustering at lower
levels before the less accurate merging of centroids takes
place at higher levels of the hierarchy.
Similar argument can be said about speedup. The less
nodes in a neighborhood, the less communication is needed
between peers. However, from Fig. 12c, we can see that we
do not gain much speedup after a certain height (around
H ¼ 4 or 5). In fact, speedup tends to decrease slightly after
that point. This can be explained by looking at the size of
ð0Þ
neighborhoods in Table 5. As soon as SQ decreases from
250 to 16.67, we notice a big jump in speedup (from 94.60 to
ð0Þ
135.07). SQ then tends to decrease slowly as we increase the
height, which after H ¼ 4 stays almost the same. So, in
effect, no gain is achieved; on the other hand, due to the
increased height, we have to go through several cluster
merging layers before the final solution is achieved. Thus,
our conclusion here is that hierarchy height should not be
increased unless there is a corresponding increase in the
number of nodes at level 0.
6.8 Distributed Cluster Summarization

Generation of cluster summaries using the distributed
Fig. 12. HP2PC performance versus hierarchy height. (a) Entropy version of CorePhrase was evaluated using different
versus hierarchy height—RCV1. (b) SI versus hierarchy height—RCV1. network sizes ðnðpÞÞ and heights ðHÞ. Experiments were
(c) Speedup versus hierarchy height—20NG, RCV1. performed on the 20NG data set where the summary of each
distributed cluster is compared to that of its centralized
In fact, to demonstrate that increasing the hierarchy counterpart, and an average is taken over all clusters.
height is not the primary cause of the drop in quality, we Fig. 14 illustrates the accuracy of the distributed cluster
show in Fig. 13 the change in clustering quality as we keep summarization compared with the baseline centralized
the hierarchy height constant at 2 and increase the number
cluster summarization. The first observation is that distrib-
of neighborhoods from 1 to 15. We notice that the quality is
uted cluster summaries can agree with their centralized
only slightly reduced as we partition the network to include
counterparts up to 88 percent ðH ¼ 1; nðpÞ ¼ 50Þ, which shows
one more neighborhood. Thus, although the drop in quality
seems to be related to a slight increase in hierarchy height, the feasibility of the distributed summarization algorithm.
the actual case is that the big jump in the number of The second observation is that the number of top
neighborhoods is the primary cause. A well-designed keyphrases, L, has a direct effect on accuracy. Lower values
network should take this observation into consideration, of L (usually lower than 100) tend to produce poor results,
so as not to overpartition a network. as well as higher values (usually above 500). For networks
From those observations, we can conclude that of low height (here H ¼ 1), 100 < L < 500 produces best
neighborhood size plays a key role in determining both results; while for those of larger heights (here H ¼ 3; 5),
clustering quality and speedup. The more fine-grained the 400 < L < 700 produces best results, albeit less accurate
neighborhoods in the network, the less the final clustering than those of lower height networks. An interpretation of
7 CONCLUSION AND FUTURE WORK

In this paper, we have introduced a novel architecture and
algorithm for distributed clustering, the HP2PC model,
which allows building hierarchical networks for clustering
data. We demonstrated the flexibility of the model, showing
that it achieves comparable quality to its centralized counter-
part while providing significant speedup and that it is
possible to make it equivalent to traditional distributed
clustering models (e.g., facilitator-worker models) by ma-
nipulating the neighborhood size and height parameters.
The model shows good scalability with respect to network
size and hierarchy height, degrading the distributed cluster-
ing quality significantly.
The importance of this contribution stems from its
flexibility to accommodate regular types of P2P networks
as well as modularized networks through neighborhood
and hierarchy formation. It also allows privacy within
neighborhood boundaries (no data shared between neigh-
borhoods). In addition, we provide interpretation capability
for document clustering through document cluster sum-
marization using distributed keyphrase extraction.
For future work, we plan to extend this model to be
dynamic, allowing nodes to join and leave the network,
which requires maintaining a balanced network in terms of
partitioning and height. This will also lead us to a way to
find the optimal network height for certain applications. We
also plan to extend it to allow merging and splitting of
complete hierarchies.
We are also investigating the possibility of making the
clustering algorithm more global by allowing centroids to
cross neighborhoods through higher levels; i.e., clusters at
lower level neighborhoods should be a function of higher
level centroids. We believe that this will create an opportu-
nity for better global clustering solutions but on the expense
of computational complexity.
REFERENCES
[1] N.F. Samatova, G. Ostrouchov, A. Geist, and A.V. Melechko,
“RACHET: An Efficient Cover-Based Merging of Clustering
Hierarchies from Distributed Datasets,” Distributed and Parallel
Databases, vol. 11, no. 2, pp. 157-180, 2002.
[2] S. Merugu and J. Ghosh, “Privacy-Preserving Distributed Cluster-
Fig. 14. Distributed cluster summarization accuracy [20NG].
ing Using Generative Models,” Proc. Third IEEE Int’l Conf. Data
Mining (ICDM ’03), pp. 211-218, 2003.
this observation is that at level 0, distributed summarization [3] J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch,
“Distributed Data Mining and Agents,” Eng. Applications of
is directly dependent on actual data, while at higher levels
Artificial Intelligence, vol. 18, no. 7, pp. 791-807, 2005.
only keyphrases from level 0 are merged together. [4] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse
The third observation is that networks of smaller number Framework for Combining Multiple Partitions,” J. Machine
of nodes, nðpÞ, produce more accurate results. Since the Learning Research, vol. 3, pp. 583-617, Dec. 2002.
[5] E. Januzaj, H.-P. Kriegel, and M. Pfeifle, “DBDC: Density Based
whole data set is partitioned among nðpÞ nodes, it is expected Distributed Clustering,” Proc. Ninth Int’l Conf. Extending Database
that a coarse-grained partitioning (smaller nðpÞ) means that Technology (EDBT ’04), pp. 88-105, 2004.
each node has access to larger portion of the distributed [6] M. Klusch, S. Lodi, and G. Moro, “Agent-Based Distributed Data
Mining: The KDEC Scheme,” Proc. AgentLink, pp. 104-122, 2003.
cluster, thus is able to get more accurate keyphrases. [7] M. Eisenhardt, W. Muller, and A. Henrich, “Classifying Docu-
To summarize those findings: 1) results of distributed ments by Distributed P2P Clustering,” Informatik 2003: Innovative
cluster summarization can agree with centralized summar- Information Technology Uses, 2003.
[8] S. Datta, C. Giannella, and H. Kargupta, “K-Means Clustering
ization with up to 88 percent accuracy; 2) for networks of over Peer-to-Peer Networks,” Proc. Eighth Int’l Workshop High
small height, 100 < L < 500 should be used, while for Performance and Distributed Mining (HPDM), SIAM Int’l Conf. Data
networks of large height, 400 < L < 700 should be used; Mining (SDM), 2005.
[9] S. Datta, C. Giannella, and H. Kargupta, “K-Means Clustering
and 3) accuracy of distributed summarization increases as over a Large, Dynamic Network,” Proc. Sixth SIAM Int’l Conf. Data
the network size and height are decreased. Mining (SDM ’06), pp. 153-164, 2006.
[10] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta, Khaled M. Hammouda received the BSc (Hons)
“Distributed Data Mining in Peer-to-Peer Networks,” IEEE degree in computer engineering from Cairo
Internet Computing, vol. 10, no. 4, pp. 18-26, 2006. University in 1997 and the MASc and PhD
[11] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, degrees in systems design engineering from the
and S. Datta, “Clustering Distributed Data Streams in Peer-to-Peer University of Waterloo in 2002 and 2007,
Environments,” Information Sciences, vol. 176, pp. 1952-1985, 2006. respectively. He is currently a professional soft-
[12] K. Hammouda and M. Kamel, “Collaborative Document Cluster- ware engineer at Desire2Learn Inc., where he
ing,” Proc. Sixth SIAM Int’l Conf. Data Mining (SDM ’06), pp. 453- works on emerging learning object repository
463, Apr. 2006. technology. He received numerous awards,
[13] H. Kargupta, I. Hamzaoglu, and B. Stafford, “Scalable, Distributed including the NSERC Postgraduate Scholarship,
Data Mining Using an Agent-Based Architecture,” Proc. Third Int’l Ontario Graduate Scholarship in Science and Technology, and the
Conf. Knowledge Discovery and Data Mining (KDD ’97), pp. 211-214, University of Waterloo President’s Graduate Scholarship and Faculty of
1997. Engineering Scholarship. He is a former member of the PAMI Research
[14] J. Li and R. Morris, “Document Clustering for Distributed Fulltext Group, University of Waterloo, where his research interests were in
Search,” Proc. Second MIT Student Oxygen Workshop, Aug. 2002. document clustering and distributed text mining, especially keyphrase
[15] A. Kumar, M. Kantardzic, and S. Madden, “Guest Editors’ extraction and summarization. He authored several papers in this field.
Introduction: Distributed Data Mining—Framework and Imple-
mentations,” IEEE Internet Computing, vol. 10, no. 4, pp. 15-17, Mohamed S. Kamel received the BSc (Hons)
2006. degree in electrical engineering from Alexandria
[16] R. Wolff, K. Bhaduri, and H. Kargupta, “Local L2-Thresholding University, the MASc degree from McMaster
Based Data Mining in Peer-to-Peer Systems,” Proc. Sixth SIAM University, and the PhD degree from the Uni-
Int’l Conf. Data Mining (SDM ’06), pp. 430-441, 2006. versity of Toronto. In 1985, he joined the
[17] I.S. Dhillon and D.S. Modha, “A Data-Clustering Algorithm on University of Waterloo, Waterloo, Ontario, where
Distributed Memory Multiprocessors,” Large-Scale Parallel Data he is currently a professor and the director of the
Mining, pp. 245-260, Springer, 2000. Pattern Analysis and Machine Intelligence La-
[18] K. Hammouda and M. Kamel, “Incremental Document Clustering boratory, Department of Electrical and Computer
Using Cluster Similarity Histograms,” Proc. IEEE/WIC Int’l Conf. Engineering and holds a university research
Web Intelligence (WI ’03), pp. 597-601, Oct. 2003. chair. He held a Canada research chair in cooperative intelligent systems
[19] K. Hammouda and M. Kamel, “Corephrase: Keyphrase Extraction from 2001 to 2008. His research interests are in computational
for Document Clustering,” Proc. IAPR Int’l Conf. Machine Learning intelligence, pattern recognition, machine learning, and cooperative
and Data Mining in Pattern Recognition (MLDM ’05), P. Perner and intelligent systems. He has authored and coauthored more than
A. Imiya, eds., pp. 265-274, July 2005. 350 papers in journals and conference proceedings, 10 edited volumes,
[20] K. Hammouda and M. Kamel, “Document Similarity Using a two patents, and numerous technical and industrial project reports.
Phrase Indexing Graph Model,” Knowledge and Information Under his supervision, 75 PhD and MASc students have completed their
Systems, vol. 6, no. 6, pp. 710-727, Nov. 2004. degrees. He is the editor in chief of the International Journal of Robotics
[21] D. Boley, “Principal Direction Divisive Partitioning,” Data Mining and Automation and an associate editor of the IEEE Transactions on
and Knowledge Discovery, vol. 2, no. 4, pp. 325-344, 1998. Systems, Man, and Cybernetics, Part A, Pattern Recognition Letters,
[22] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, Cognitive Neurodynamics Journal, and Pattern Recognition Journal. He
V. Kumar, B. Mobasher, and J. Moore, “Partitioning-Based is also a member of the editorial advisory board of the International
Clustering for Web Document Categorization,” Decision Support Journal of Image and Graphics and the Intelligent Automation and Soft
Systems, vol. 27, pp. 329-341, 1999. Computing Journal. He also served as an associate editor of Simulation,
[23] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, the journal of the Society for Computer Simulation. Based on his work
V. Kumar, B. Mobasher, and J. Moore, “Document Categoriza- at the NCR, he received the NCR Inventor Award. He is also a recipient of
tion and Query Generation on the World Wide Web Using the Systems Research Foundation Award for outstanding presentation in
WebACE,” AI Rev., vol. 13, nos. 5/6, pp. 365-391, 1999. 1985 and the ISRAM Best Paper Award in 1992. In 1994, he has been
[24] A. Strehl, “Relationship-Based Clustering and Cluster Ensembles awarded the IEEE Computer Society Press Outstanding Referee Award.
for High-Dimensional Data Mining,” PhD dissertation, Faculty of He was also a coauthor of the best paper in the 2000 IEEE Canadian
Graduate School, Univ. of Texas at Austin, 2002. Conference on Electrical and Computer Engineering. He is a recipient of
[25] D.D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A New the University of Waterloo Outstanding Performance Award twice, the
Benchmark Collection for Text Categorization Research,” faculty of engineering distinguished performance award. He is a member
J. Machine Learning Research, vol. 5, pp. 361-397, 2004. of the ACM and the PEO, a fellow of the IEEE, the Engineering Institute of
[26] M.F. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, Canada (EIC), and the Canadian Academy of Engineering (CAE), and
no. 3, pp. 130-137, July 1980. selected to be a fellow of the International Association of Pattern
[27] G. Salton, A. Wong, and C. Yang, “A Vector Space Model for Recognition (IAPR) in 2008. He served as a consultant for General
Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, Motors, NCR, IBM, Northern Telecom, and Spar Aerospace. He is a
Nov. 1975. cofounder of Virtek Vision of Waterloo and the chair of its Technology
[28] W. Wong and A. Fu, “Incremental Document Clustering for Web Advisory Group. He served as a member of the board from 1992 to 2008
Page Classification,” Proc. Int’l Conf. Information Soc. in the 21st and the vice president for research and development from 1987 to 1992.
Century: Emerging Technologies and New Challenges (IS), 2000.
[29] Y. Yang and J.P. Pedersen, “A Comparative Study on Feature
Selection in Text Categorization,” Proc. 14th Int’l Conf. Machine . For more information on this or any other computing topic,
Learning (ICML ’97), pp. 412-420, 1997. please visit our Digital Library at www.computer.org/publications/dlib.
[30] J. He, A.-H. Tan, C.-L. Tan, and S.-Y. Sung, “On Quantitative
Evaluation of Clustering Systems,” Clustering and Information
Retrieval, pp. 105-133, Kluwer Academic, 2003.
[31] J.C. Dunn, “Well Separated Clusters and Optimal Fuzzy Parti-
tions,” J. Cybernetica, vol. 4, pp. 95-104, 1974.
[32] A.-H. Tan, H.-L. Ong, H. Pan, J. Ng, and Q.-X. Li, “Towards
Personalized Web Intelligence,” Knowledge and Information Sys-
tems, vol. 6, no. 5, pp. 595-616, May 2004.

DMN09-Hierarchically Distributed Peer-To-peer Document Clustering and Cluster Summarization

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMN09-Hierarchically Distributed Peer-To-peer Document Clustering and Cluster Summarization

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO.

5, MAY 2009 681

Hierarchically Distributed Peer-to-Peer

A recent shift toward distributed data mining (DDM)

reduced. Bandyopadhyay et al. [11] derive for their P2P

The choice of which subset of peers forms the next level

Neighborhoodj ðqj ; spj Þ : qj  p; spj 2 qj :

Fig. 3. Example of an HP2PC network.

4.4.2 Communication Complexity

Since nðqÞ ¼ nðpÞ=SQ , then

For levels above 0, each neighborhood requires SQ 1

5.1 Single Cluster Summarization j documents containing p j

Note that, by definition, the core summary will be the

of height 1 all nodes at level 0 are in the same neighborhood,

Fig. 7. HP2PC accuracy and speedup [SN].

noticeable that unlike networks of height 1, networks with

sufficiently small accuracy parameter , we can use the

on 20NG and RCV1. Thenumber of nodes at each level is

6.7 Hierarchy Height Scalability

Fig. 13. Entropy versus number of level 0 neighborhoods ½H ¼ 2

solution is dependent on the actual data (only available to

6.8 Distributed Cluster Summarization

7 CONCLUSION AND FUTURE WORK

You might also like

Neighborhoodj ðqj ; spj Þ : qj p; spj 2 qj :

sufficiently small accuracy parameter , we can use the