Professional Documents
Culture Documents
Abstract—In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of
modularity, flexibility, and scalability, we propose a Hierarchically distributed Peer-to-Peer (HP2PC) architecture and clustering
algorithm. The architecture is based on a multilayer overlay network of peer neighborhoods. Supernodes, which act as representatives
of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate
within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular
way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up
the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize
the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters.
Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with
comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that
HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized
counterparts with up to 88 percent accuracy.
Index Terms—Distributed data mining, distributed document clustering, hierarchical peer-to-peer networks.
1 INTRODUCTION
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 683
hypothetical process that is modeled by an exact distributed created. Each cluster is then assigned to a node, and later
clustering algorithm. The exact algorithm works as if the documents are classified to their respective clusters by
data subsets, Di , from each node were brought together into comparing their signature with all cluster signatures.
one data set, D, first; then a centralized clustering algorithm, Queries are handled in the same way, where they are
A, had performed the clustering procedure on the whole directed from a root node to the node handling the cluster
data set. The clustering solutions are then distributed again most similar to the query.
by intersecting the data subsets with the global clustering State of the art. In the latest issue of IEEE Internet
solution. Computing [15] (at the time of writing this paper), a few
Approximate algorithms, on the other hand, produce a algorithms were presented representing the state of the art in
model that closely approximates a centrally generated DDM. Datta et al. [10] described an exact local algorithm for
model. Most DDM research studies focus on approximate monitoring a K-means clustering (originally proposed by
algorithms as they tend to produce comparable results to Wolff et al. [16]), as well as an approximate local K-means
exact algorithms with far less complexity [3]. clustering algorithm for P2P networks (originally proposed
Communication models. Communication between by Datta et al. [8], [9]).
nodes in distributed clustering algorithms can be categor- Although the K-means monitoring algorithm does
ized into three classes (in increasing order of communica- not produce a distributed clustering, it helps a centralized
tion cost): 1) communicating models, 2) communicating K-means process know when to recompute the clusters
representatives, and 3) communicating actual data. The first by monitoring the distribution of centroids across peers
case involves calculating local models that are then sent to and triggering a reclustering if the data distribution
peers or a central site. Models often are comprised of cluster significantly changes over time.
centroids, e.g., P2P K-means [9], cluster dendograms, e.g., On the other hand, the P2P K-means algorithm in [8] and
RACHET [1], or generative models, e.g., DMC [2]. In the [9] works by updating the centroids at each peer based on
second case, nodes select a number of representative information received from their immediate neighbors. The
samples of the local data to be sent to a central site for algorithm terminates when the information received does
global model generation, such as the case in the KDEC not result in significant update to the centroids of all peers.
distributed clustering algorithm [6] and the DBDC algo- The P2P K-means algorithm finds its roots in a parallel
rithm [5]. The last model of communication is for nodes to implementation of K-means proposed by Dhillon and
exchange actual data objects; i.e., data objects can change Modha [17].
sites to facilitate construction of clusters that exist in certain
sites only, such as the case in the collaborative clustering 3 THE HP2PC DISTRIBUTED ARCHITECTURE
scheme in [12].
Applications. Applications of DDM are numerous and HP2PC is a hierarchically distributed P2P architecture for
are usually manifested as distributed computing projects. scalable distributed clustering of horizontally partitioned
They often try to solve problems in mathematics and science. data. We argue that a scalable distributed clustering system
Specific areas and sample projects include: astronomy (or any data mining system for that matter) should involve
(SETI@home), biology (Folding@home, Predictor@home), hierarchical distribution. A hierarchical processing strategy
climate change (CPDN), physics (LHC@home), cryptogra- allows for delegation of responsibility and modularity.
phy (distributed.net), and biomedicine (grid.org). Those Central to this hierarchical architecture design is the
projects are usually built on top of a common platform formation of neighborhoods. A neighborhood is a group of
providing low level services for distributed or grid comput- peers forming a logical unit of isolation in an otherwise
ing. Examples of those platforms include: Berkeley Open unrestricted open P2P network. Peers in a neighborhood
Infrastructure for Network Computing (BOINC), Grid.org, can communicate directly but not with peers in other
World Community Grid, and Data Mining Grid. neighborhoods. Each neighborhood has a supernode. Com-
Text mining. Applications of DDM in the text mining munication between neighborhoods is achieved through
area are rare but usually employ a form of distributed their respective supernodes. This model reduces flooding
information retrieval. Distributed text classification and problems usually encountered in large P2P networks.
clustering have received little attention. PADMA is an early The notion of a neighborhood accompanied by a super-
example of parallel text classification [13]. node can be applied recursively to construct a multilevel
The work presented by Eisenhardt et al. [7] achieves overlay hierarchy of peers; i.e., a group of supernodes can
document clustering using a distributed P2P network. They form a higher level neighborhood, which can communicate
use the K-means clustering algorithm, modified to work with other neighborhoods on the same level of the hierarchy
in a distributed P2P fashion using a probe-and-echo through their respective (higher level) supernodes. This
mechanism. They report improvement in speedup com- type of hierarchy is illustrated in Fig. 2.
pared to centralized clustering. Their algorithm is an exact 3.1 Notations
algorithm, although it requires global synchronization at Symbol-Description
each iteration.
A similar system can be found in [14], but the problem is . p. A set of peers comprising a P2P network.
posed from the information retrieval point of view. In this . pi . Peer i in p.
work, a subset of the document collection is centrally . q. A set of neighborhoods.
partitioned into clusters, for which “cluster signatures” are . qj . Neighborhood j in q.
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
684 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
3.3 Neighborhoods
We divide a network overlay into a set of neighborhoods, q.
A neighborhood, qj 2 q, comprises a set of peers that is
subset of an overlay p and that has a designated peer
known as a supernode, spj ; thus
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 685
The following function then generates two neighborhood from which we can deduce H:
sizes based on r:
8l m
>
< loglognðpÞ
bnðpÞ=nðqÞc; with probability ð1 rÞ; ; 0 < < 1;
nðqj Þ ¼ ð4Þ H ¼ 1; ¼ 0; ð6Þ
dnðpÞ=nðqÞe; with probability r: >
:
1; ¼ 1:
This method produces a more even distribution of
neighborhood sizes, even if nðpÞ and nðqÞ are of the same If, however, the partitioning factor is chosen to be
different for different levels of the hierarchy, then we
order of magnitude. In the above example, using (4) we
cannot deduce the full height of the hierarchy up front.
have nðpÞ=nðqÞ ¼ 1:67 ðr ¼ 0:67Þ, so 67 percent of the
However, we can deduce the maximum height reachable
neighborhoods will be of size 2, and 33 percent will be of
from a certain level using (6), and hence, we can iteratively
size 1; far more balanced than with (3). calculate the full hierarchy height if all level ’s are known
All peers at the same level, h, of the hierarchy are denoted a priori.
by pðhÞ . Let the function levelðpÞ determine the level of a peer; If, instead of specifying ’s, a certain hierarchy height
i.e., levelðpðhÞ Þ ¼ h. A peer pi can communicate with peer pj if is desired, we can calculate the proper (the same for all
and only if levelðpi Þ ¼ levelðpj Þ and pi 2 ql () pj 2 ql . levels) using the following equation (which is derived
Peer hierarchy formation is bottom-up, so the lowest from (6)):
level of the hierarchy is h ¼ 0. The supernodes of level 0
neighborhoods form the overlay network at level h ¼ 1. eðlog nðpÞÞ=H ; H > 1;
¼ ð7Þ
Recursively, at level h ¼ 2 are the supernodes of level 1 0; H ¼ 1:
neighborhoods (groups of level 1 supernodes). The root
supernode is at level H, the height of the hierarchy; i.e.,
3.4 Example
there exists exactly one pðHÞ in the system.
The network partitioning factor, , can be different for Fig. 3 illustrates the HP2PC architecture with an example.
different levels of the hierarchy; i.e., neighborhood count The network shown consists of 16 nodes and four hierarchy
and size is not necessarily the same at each level. If we levels. The set of nodes at level 0, pð0Þ , is divided into four
apply the same network partitioning factor to every level, neighborhoods, subject to the network partitioning factor
we can deduce the height, H, of the hierarchy. We can ð0Þ ¼ 0:2. Each supernode of level 0 becomes a regular
approximate (2) to node at level 1, forming the set of four nodes pð1Þ . Those in
turn are grouped into two neighborhoods forming qð1Þ ,
nðqÞ nðpÞ: satisfying ð1Þ ¼ 0:33. At level 2, only one neighborhood is
formed out of level 1 supernodes, satisfying ð1Þ ¼ 0.
Since the number of nodes at a certain level is equal to the Finally, the root of the hierarchy is found at level 3.
number of supernodes (or neighborhoods) in the lower
3.5 HP2PC Network Construction
level, then we can say that
An HP2PC network is constructed recursively, starting
from level 0 up to the height of the hierarchy, H. The
nðpÞðhÞ ¼ nðpÞðh1Þ ; 8h > 0: number of neighborhoods and the size of each neighbor-
hood are controlled through the partitioning factor , which
At the top of the hierarchy (level H), we have one
is specified for each level of the hierarchy (except the root
node, then
level).
The construction process is given in Algorithm 1. Given
H nðpÞ ¼ 1 ð5Þ
the initial set of nodes, pð0Þ , and the set of partitioning
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
686 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
factors B ¼ f ðhÞ g, 0 < h < H 1, the algorithm recursively Let simðÞ be a similarity measure between two objects,
constructs the network. At each level, we partition the and sk be the set of pairwise similarity between objects of
current pðhÞ into the proper number of neighborhoods and cluster ck :
assign a supernode for each one. The set of supernodes at a
certain level forms the set of nodes for the next higher level, nðck Þðnðck Þ þ 1Þ
nðsk Þ ¼ ; ð8aÞ
which are passed to the next recursive call. Construction 2
stops when the root is reached.
sk ¼ fsl : 1 l nðsk Þg; ð8bÞ
Algorithm 1 HP2PC Construction
Input: pð0Þ , B ¼ f ðhÞ g, 0 < h < H 1 sl ¼ simðdi ; dj Þ; di ; dj 2 ck : ð8cÞ
Output: fpðhÞ g, fqðhÞ g, 0 h H
1: for h ¼ 0 to H 1 do The histogram of the similarities in cluster ck is represented as
2: nðpÞðhÞ ¼ jpðhÞ j
Hk ¼ fhi : 1 i Bg; ð9aÞ
3: Calculate nðqÞðhÞ by substituting nðpÞðhÞ and ðhÞ
into (2)
4: qðhÞ ¼ , pðhþ1Þ ¼ hi ¼ countðsl Þ; ð9bÞ
5: a ¼ 1, b ¼ 1 fPartition pðhÞ into nðqÞðhÞ neighborhoodsg
6: for j ¼ 1 to nðqÞðhÞ do sl 2 sk ; ði 1Þ sl < ðiÞ; ð9cÞ
7: Calculate nðqj Þ using (4) where
8: b ¼ b þ nðqj Þ
ðhÞ
9: qj ¼ fpi g, a i b . B is the number of histogram bins,
10: a¼bþ1 . hi is the count of similarities in bin i, and
11: Add qj to qðhÞ . is the bin width of the histogram.
12: spj ¼ first node in qj fsupernode for qj g To estimate the cohesiveness of cluster ck , we calculate
13: Add spj to pðhþ1Þ the histogram skew. Skew is the third central moment of a
14: end for distribution; it tells us if one tail of the distribution is longer
15: end for than the other. A positive skew indicates a longer tail in the
16: qðHÞ ¼ pðHÞ froot nodeg positive direction (higher interval of the histogram), while a
negative skew indicates a longer tail in the negative (lower
interval) direction. A similarity histogram that is negatively
4 THE HP2PC DISTRIBUTED CLUSTERING skewed indicates a tight cluster.
ALGORITHM Skew is calculated as
The HP2PC algorithm is a distributed iterative clustering P
ðxi Þ3
process. It is a centroid-based clustering algorithm, where a skew ¼ i ; ð10Þ
set of cluster centroids is generated to describe the N3
clustering solution. In HP2PC, each neighborhood con- so the tightness of a cluster, ’ðck Þ, calculated as the skew of
verges to a set of centroids that describe the data set in that its histogram Hk , is
neighborhood. The distributed clustering strategy within a P
single neighborhood is similar to the parallel K-means ðsl sk Þ3
’ðck Þ ¼ skewðHk Þ ¼ l ; s l 2 sk : ð11Þ
algorithm [17] in that the final set of centroids of a nðsk Þj3sk
neighborhood will be identical to those produced by
centralized K-means on the data within that neighborhood. A clustering quality measure based on skewness of
Other neighborhoods, either on the same level or at higher similarity histograms of individual clusters can be derived
levels of the hierarchy, may converge to another set of as a weighted average of the individual cluster skew:
centroids. P
Once a neighborhood converges to a set of centroids, nðck Þ’ðck Þ
’ðcÞ ¼ kP ; ck 2 c: ð12Þ
those centroids are acquired by the supernode of that k nðck Þ
neighborhood. The supernode, in turn as part of its higher
level neighborhood, collaborates with its peers to form a set 4.2 Distributed Clustering (Level h ¼ 0)
of centroids for its neighborhood. This process continues We define a general function for updating cluster models in
hierarchically until a set of centroids is generated at the root a fully connected neighborhood:
of the hierarchy.
ci;t ¼ f fcj;t1 g ; i; j 2 q; ð13aÞ
4.1 Estimating Clustering Quality
The distributed search for cluster centroids is guided by a ci;0 ¼ c0 ; ð13bÞ
cluster quality measure that estimates intracluster cohe-
i;t
siveness and intercluster separation. where c is the clustering model (a set of clusters)
Cluster cohesiveness. The distribution of pairwise calculated by peer i at iteration t, and fðÞ is an aggregating
similarities within a cluster is represented using a cluster function. The equation can be illustrated by Fig. 4, where
similarity histogram, which is a concise statistical representa- the output of each peer at iteration t depends on the
tion of the cluster tightness [18]. models calculated by all other peers at iteration t 1. In
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 687
5: end for
6: for all cik do
7: mik ¼ avgðdj Þ, j 2 cik
8: sik ¼ si # cik fpart of si indexed by objects in cik g
9: ’ik ¼ CalcSkewðsk Þ
10: wik ¼ ’ik nðcik Þ
11: end for
12: return fmi ; wi g
Function CalcSkewðsÞ
Fig. 4. Iterative level 0 neighborhood clustering. 1: ’ 0, avgðsÞ, stddevðsÞ
2: for all sl 2 s do
P2P K-means [8], fðÞ avgðÞ and the neighborhood are 3: ’ ¼ ’ þ ðsl Þ3
based on ad hoc network topology. 4: end for
Algorithm 2 shows the iterative HP2PC process for 5: ’ ¼ ’=ðnðsÞ 3 Þ
updating the cluster centroids at each node. Utility
6: return ’
routines for the main algorithm are given in Algorithm 3.
Function CalcSimilarityMatrixðDÞ
For each neighborhood, an initial set of centroids is
calculated by the supernode and transmitted to all peers 1: s
in the neighborhood. Like K-means, during each iteration, 2: for i ¼ 1 to nðDÞ do
each peer assigns local data to their nearest centroid and 3: for j ¼ 0 to i 1 do
calculates their new centroids. In addition, it also calcu- 4: Add simðdi ; dj Þ to s
lates cluster skews using (11). 5: end for
6: end for
Algorithm 2 Level 0 Clustering 7: return s
Input: Number of clusters K
Output: Set of clusters cr for each neighborhood qr in qð0Þ The final set of centroids for each iteration is calculated
1: for all qr 2 qð0Þ do from all peer centroids. Unlike K-means (or P2P K-means),
2: for all pi 2 qr do those final centroids are weighed by the skew and size of
3: Di ¼ data set at pi clusters at individual peers. The weight of a cluster ck at
4: si ¼ CalcPairwiseSimilarityðDi Þ peer pj is defined as
5: fmr ; wr g ¼ ReceiveFromðspr Þ fspr : supernode of qr g
wjk ¼ ’ðcjk Þ nðcjk Þ:
6: 8j 6¼ r : mj ¼ 0, wj ¼ 0
7: fmi ; wi g ¼ UpdateClustersðDi ; fmi gÞ At the end of each iteration, each node transmits the
8: while change in fmi g > do cluster centroids and their corresponding weights to all its
9: for all pj 2 qr , j 6¼ i do peers:
10: SendToðpj ; fmi ; wi gÞ
11: fmj ; wj g ¼ ReceiveFromðpj Þ cj;t fmjk ; wjk gt ; k 2 ½1; K
:
12: end for At peer pi , the centroid of cluster k is updated according
13: for k ¼ 1 to K do to the following equation, which favors tight and dense
14: mik ¼ 0, wik ¼ 0 clusters:
15: for j ¼ 1 to nðqr Þ do P j;t1
16: mik ¼ mik þ ðmjk wjk Þ j wk mj;t1
k
mi;t ¼ P j;t1 ; j 2 qr : ð14Þ
17: wik ¼ wik þ wjk k
w
j k
18: end for
19: mik ¼ mik =wik This is followed by assigning objects to their nearest
20: end for centroid and calculating the new set of cluster skews, f’ik g,
21: fmi ; wi g ¼ UpdateClustersðDi ; fmi gÞ and sizes, fnðcik Þg, which are used in the next iteration. The
22: end while algorithm terminates when object assignment does not
23: end for change, or when 8i; k kmi;t k mk
i;t1
k < , where is a
24: if pi ¼ spr then sufficiently small parameter.
25: cr ¼ ci fset of clusters stored at supernodeg
4.3 Distributed Clustering (Level h > 0)
26: end if
27: end for Once a neighborhood converges to a set of clusters, the
centroids and weights of those clusters are acquired by the
Algorithm 3 Utility Routines for Level 0 Clustering supernode as its initial set of clusters; i.e., for neighbor-
Function UpdateClustersðDi ; fmi gÞ hood qr with supernode spr
1: ci
cr;0;ðhÞ ¼ cr;T ;ðh1Þ ;
2: for all dj 2 Di do
3: l ¼ argmink fjdj mk jg where T is the final iteration of the algorithm at level h 1
4: Add dj to cil for neighborhood qr .
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
688 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
Since at level h of the hierarchy the actual data objects are 4.4 Complexity Analysis
not available, we rely on metaclustering: merging the clusters We divide the complexity of HP2PC into computational
using centroid and weight information alone. At level h > 0, complexity and communication complexity.
clusters are merged in a bottom-up fashion, up to the root of
the hierarchy; i.e., cðhÞ ¼ fðcðh1Þ Þ. This means once a 4.4.1 Computation Complexity
neighborhood at level h converges to a set of clusters, it is Assume the entire data set size across all nodes is D. The
frozen, and the higher level clustering is invoked. (A more data set is equally partitioned among nodes, so each node
elaborate technique would involve bidirectional traffic, holds DP ¼ D=nðpÞ data objects. For level 0, we have nðqÞ
making cðh1Þ ¼ fðcðhÞ Þ as well, but the complexity of this neighborhoods, each of size SQ ¼ nðpÞ=nðqÞ.
approach could be prohibitive, so we leave it for future work.) Each node has to compute a pairwise similarity matrix
A neighborhood at level h consists of a set of peers, each before it begins the P2P clustering process, requiring
having a set of K centroids. To merge those clusters, the DP ðDP 1Þ=2 similarity computations. For each iteration,
centroids are collected and clustered at the supernode of each node computes a new set of K centroids by averaging
this neighborhood, using K-means clustering. This process all neighborhood centroids ðK SQ Þ, assigns the data objects
repeats until one set of clusters is computed at the root of to those centroids ðK DP Þ, recomputes centroids based on
the hierarchy. The formal procedure representing this new data assignment ðDP Þ, and calculates the skew of the
clustering process is presented in Algorithm 4. clusters ðDP ðDP 1Þ=2Þ. Those requirements are summar-
Algorithm 4 HP2PC Clustering ized as
1: for all qi 2 qð0Þ do
Tsim ¼ DP ðDP 1Þ=2;
2: fmi gð0Þ ¼ NeighborhoodClusterðqi Þ
ð0Þ
3: end for Tupdate ¼ K SQ þ DP þ DP ;
4: for h ¼ 1 to H do
Tskew ¼ DP ðDP 1Þ=2:
5: for all qi 2 qðhÞ do
6: for all pj 2 qi do Let the number of iterations required to converge to a
7: fmj g ¼ fmj gðh1Þ solution be I. Then, the total number of computations
8: SendToðspi ; fmj gÞ required by each node to converge is
9: end for
TP ¼ Tsim þ I½Tupdate þ Tskew
: ð15Þ
10: fmi gh ¼ K-meansðfmj gÞ fonly at peer spi g
11: end for If we assume that DP 1 and I 1, then we can
12: end for rewrite (15) as
For comparison against the baseline K-means algorithm,
ð0Þ
we calculate the centroids of each neighborhood based on TP ¼ I K SQ þ DP þ D2P =2 : ð16Þ
centralized K-means, then compare against the HP2PC
centroids with respect to the merged data set of the For levels above 0, each neighborhood is responsible for
respective neighborhood (Algorithm 5.) For neighborhoods metaclustering a set of KSQ centroids into K centroids
at level 0, the merged data set is a union of the data from all using K-means. Then, for each neighborhood at level h, the
nodes in the neighborhood. For those at higher levels, the required computation is
merged data set is the union of all data reachable from ðhÞ
every node in the neighborhood through its respective Th ¼ ISQ K 2 : ð17Þ
lower level nodes. The EvaluateAndCompare function eval- Since each neighborhood computation is done in parallel
uates both the centralized and distributed solutions and with the others, we need only Th computations per level.
compares them against each other, as reported in the However, since computations at higher levels of the
various experiments in Section 6. hierarchy need to wait for lower levels to complete, we
Algorithm 5 Centralized K-means Clustering Comparison have to sum Th for all levels. The total computations
1: for h ¼ 0 to H do required for all levels above 0 are thus
2: for all qi 2 qðhÞ X
H 1
S
3: Di ¼ j2qi Dj fmerged data set of neighborhood qi g TH ¼ Th : ð18Þ
h
4: fmi gctr ¼ kmeansðDi Þ h¼1
5: EvaluateAndCompareðfmi ghctr ; fmi gh ; Di Þ Finally, we can combine (16) and (18) to find the total
6: end for computation complexity for HP2PC:
7: end for
T ¼ TP þ TH : ð19Þ
One of the major benefits of this algorithm is the ability to
zoom in to more refined clusters by descending down the It can be seen that computation complexity is largely
hierarchy and zoom out to more generalized clusters by affected by the data set size of each node ðDP Þ. By
ascending up the hierarchy. The other major benefit is the increasing the total number of nodes, we can decrease DP
ability to merge a forest of independent hierarchies into one (since the data are equally partitioned among nodes) but on
hierarchy by putting all roots of the forest into one the expense of increasing communication complexity, as
neighborhood and invoking the merge algorithm on that well as decreasing clustering quality due to fragmentation
neighborhood. of the data set.
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 689
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
690 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
[
pf ¼ arg avg
j occurrences of p j
: ik ¼ jk \ Di;k : ð26Þ
j words in document j j2q;j6¼i
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 691
TABLE 1
Data Sets
including [21], [22], [23], and [24]. The data set is available at 6.3.1 Entropy
ftp://ftp.cs.umn.edu/dept/users/boley/. Entropy reflects the homogeneity of a set of objects, and
SN is a collection of 3,271 metadata records collected
thus can be used to indicate the homogeneity of a cluster.
from the SchoolNet learning resources website (http://
This is referred to cluster entropy, introduced by Boley et al.
www.schoolnet.ca/). We extracted the fields containing text
from the metadata records (title, description, and key- [21]. Lower cluster entropy indicates more homogeneous
words) and combined them to form one document per clusters. On the other hand, we can also measure the
metadata record. We used the 17 top-level categories from entropy of a prelabeled class of objects, which indicates the
the SchoolNet data set. homogeneity of a class with respect to the generated
20NG is the standard 20-newsgroup data set, which clusters. The less fragmented a class across clusters, the
contains 18,828 documents from 20 Usenet news- higher its entropy, and vice versa. This is referred to as class
groups divided into 20 balanced categories. This data entropy, due to He et al. [30], Tan et al. [32]
set is available at http://people.csail.mit.edu/jrennie/ Cluster entropy [21]. For every cluster cj in the
20Newsgroups/. clustering result c, we compute nðli ; cj Þ=nðcj Þ, the prob-
RCV1 is a subset of 23,149 documents selected from the ability that a member of cluster cj belongs to class li . The
standard Reuters RCV1 text categorization data set, entropy of each cluster cj is calculated using the standard
converted from the original Reuters RCV1 data set by formula
Lewis et al. [25]. The documents in the RCV1 data set are
assigned multiple labels. In order to properly evaluate the X nðli ; cj Þ nðli ; cj Þ
clustering algorithms using single-label validity measures, Ec j ¼ log ;
i
nðcj Þ nðcj Þ
we restricted the labels of the documents to the first
document label that appears in the data set. where the sum is taken over all classes. The total entropy for
a set of clusters is calculated as the sum of entropies for each
6.2 Text Preprocessing cluster weighted by the size of each cluster:
All texts were preprocessed in the following way. First,
words were tokenized using a specially built finite-state- X
nðcÞ
nðcj Þ
machine tokenizer that can detect both alphanumeric and Ec ¼ Ej : ð29Þ
j¼1
nðDÞ
special entities (such as currencies, dates, and so forth).
Then, tokens were lower cased, stop-words were removed, Class entropy [30], [32]. A drawback of cluster entropy is
and finally the remaining words were stemmed using the that it rewards small clusters, which means that if a class is
popular Porter stemmer algorithm [26]. fragmented across many clusters it would still get a low
Further to text preprocessing, we applied simple feature entropy value. To counter this problem, we calculate also
selection to reduce the number of features for every data the class entropy.
set. The method is based on the Document Frequency (DF) The entropy of each class li is calculated using
feature selection measure. It is based on the argument that
terms with very low DF tend to be noninformative in X nðli ; cj Þ nðli ; cj Þ
Eli ¼ log ;
categorization-related tasks [27], [28], [29]. After ranking the nðli Þ nðli Þ
j
set of terms in descending order with respect to their DF,
we pruned the list by keeping only the top 20 percent terms. where the sum is taken over all clusters. The total entropy
A threshold as far as 10 percent has been used in the for a set of classes is calculated as the weighted average of
literature for selecting features to increase categorization the individual class entropies:
accuracy [28] (or at least not affecting it), but we opted for a
more conservative threshold.
nðlÞ
X nðli Þ
El ¼ : ð30Þ
nðDÞ Eli
6.3 Evaluation Measures i¼1
Three aspects of the algorithm were evaluated: clustering As with cluster entropy, a drawback of class entropy is that
accuracy, speedup, and distributed summarization accuracy. if multiple small classes are lumped into one cluster, their
For evaluating clustering accuracy, we used two evaluation class entropy would still be small.
measures: Entropy, which evaluates clusters with respect to Overall entropy [30], [32]. To avoid the drawbacks of
an external predefined categorization, and Separation Index, either cluster or class entropy, their values can be combined
which does not rely on predefined categories. into an overall entropy measure:
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
692 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
Ec ð Þ ¼ Ec þ ð1 Þ El : ð31Þ
TABLE 2
In our experiments, we set to 0.5. Accuracy and Performance of HP2PC [YAHOO]
We evaluated the quality of clustering at different levels
of the hierarchy. At level h ¼ 0, we evaluated the quality of
clustering for each neighborhood, with respect to the subset
of the data in the neighborhood, i.e.
Er ¼ Ecr jDr ;
where cr is the set of clusters obtained for neighborhood r,
and Dr is the union S of data sets of all nodes in that
neighborhood ðDr ¼ i2qr Di Þ.
At level h > 0, we evaluated the clustering acquired by a
supernode with respect to the data subset of the nodes at
the level 0 reachable from the supernode. Thus, evaluation run in parallel, the total time taken by that level is
of the clustering acquired at the root node reflects the calculated as the maximum time taken by any node on
quality with respect to the whole data set. the same level. The time taken by different levels is added
to arrive at the global Td .
6.3.2 Separation Index For cluster summarization accuracy, evaluation of the
SI is another cluster validity measure that utilizes cluster produced cluster summaries was based on how much the
centroids to measure the distance between clusters, as well extracted keyphrases agree with the centralized version of
as between points in a cluster to their respective cluster CorePhrase when run on the centralized cluster. Assume
centroid. It is defined as the ratio of average within-cluster HP2PC produced a cluster ck that spanned nðpÞ nodes, each
variance (cluster scatter) to the square of the minimum holding subset of the documents, Dki , from that cluster. If
pairwise distance between clusters: all documents were pooled into a centralized cluster, we
PNC P have Dk documents in that cluster. The percentage of
2
i¼1 xj 2ci distðxj ; mi Þ correct keyphrases is calculated as
SI ¼
ND min1r;sNC fdistðmr ; ms Þg2 percent correct keyphrases
PNC P
r6¼s
ð32Þ
distðx j ; m i Þ2 CorePhraseðDk Þ \ DistCorePhraseðfDki gÞ
¼
i¼1 xj 2ci
; ¼ ;
L
ND dist2min
where L is the maximum number of top keyphrases
where mi is the centroid of cluster ci , and distmin is the extracted.
minimum pairwise distance between cluster centroids.
Clustering solutions with more compact clusters and larger 6.4 Experimental Setup
separation have lower Separation Index, thus lower values A simulation environment was used for evaluating the
indicate better solutions. This index is more computation- HP2PC algorithm. During simulation, data were parti-
ally efficient than other validity indices, such as Dunn’s tioned randomly over all nodes of the network. The number
index [31], which is also used to validate clusters that are of clusters was specified to the algorithm such that it
compact and well separated. In addition, it is less sensitive corresponds to the actual number of classes in each data set.
to noisy data. A random set of centroids was chosen by each supernode,
Speedup is a measure of the relative increase in speed of and the centroids were distributed to all nodes in its
one algorithm over the other. For evaluating HP2PC, it is neighborhood at the beginning of the process. Clustering
calculated as the ratio of time taken in the centralized was invoked at level 0 neighborhoods and was propagated
case ðTc Þ to the time taken in the distributed case ðTd Þ, to the root of the hierarchy as described in Section 4.
including communication time, i.e., In the next sections, we evaluate the effect of network size
Tc on clustering accuracy, the effect of scaling the hierarchy
S¼ : ð33Þ height, the quality of clustering at different levels within a
Td
single hierarchy, and the accuracy of distributed cluster
To take communication time into consideration in the summarization using the distributed CorePhrase algorithm.
simulations, we factored the time taken to transmit a
message from one node to another on a 100 Mbps link.1 6.5 Network Size and Height
Thus, the time required to transmit a message of size jMj Experiments on different network sizes and heights were
bytes is calculated as performed, and their effect on clustering accuracy (Entropy
and SI) and speedup over centralized clustering were
TM ¼ jMj=ð100;000;000=8Þ seconds: measured. Table 2 summarizes those results for the YAHOO
During simulation, each time a message is sent from (or data set, and Table 3 summarizes the same results for the
received by) one node to another, its time is calculated and SN data set. The same results are illustrated in Figs. 6 and 7,
is added to the total time taken by that node. Since in a real respectively.
environment all nodes on the same level of the hierarchy The first observation here is that for networks of height
H ¼ 1 ð ¼ 0Þ, the distributed clustering accuracy stays
1. This is a simplified assumption. Real networks exhibit communication almost the same as the network size increases. This is
overhead due to network protocols and network congestion. evident through both the Entropy and SI. Since for networks
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 693
TABLE 3
Accuracy and Performance of HP2PC [SN]
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
694 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
Fig. 8. Two-dimensional mixture of 10 Gaussians data set [10G]. Fig. 9. PMP comparison between HP2PC and P2P K-means [10G].
TABLE 4
PMP Comparison between HP2PC and P2P K-Means [10G]
1.Initialize i ¼ 1.
2.Set H ¼ i and compute the corresponding SIi
measure for the resulting clustering solution.
3. Set i ¼ i þ 1.
4. Compute SIi .
5. If SI ¼ SIi SIi1 < , go to step 3.
6. Output H ¼ i as the recommended height.
We investigate the effect of increasing hierarchy heights, Fig. 10. Clustering accuracy versus hierarchy level, H ¼ 5 [20NG].
as well as the accuracy at different levels within a single
hierarchy, in more detail in the next sections. available from the authors but rather the parameters of
In terms of speedup, the trends show that the HP2PC the Gaussians, which we used to regenerate the data.2
algorithm exhibits decent speedup over the centralized case. The 10G data set is illustrated in Fig. 8.
For H ¼ 1, however, speedup does not scale well with the The measure of accuracy in [8] was based on the
network size, largely due to the increased communication difference between cluster membership produced by P2P
cost for networks of that height. For H > 0, speedup becomes K-means and that of the same data point as produced by
more scalable, as we can notice a big difference between H ¼ 1 the centralized K-means. To ensure accurate comparison,
and H ¼ 2 than between H ¼ 2 and H ¼ 3. This result carries initial seeds for both the centralized and the P2P algorithms
an assertion that the hierarchical architecture of HP2PC is were the same. They report the total number of mislabeled
indeed scalable compared to flat P2P networks. data points as a percentage of the size of the data set. The
percentage of mislabeled points (PMP) is
6.5.1 Comparison with P2P K-Means
The accuracy of HP2PC is compared with P2P K-means 100j d 2 D : Lcent ðdÞ 6¼ Lp2p ðdÞ j
[8], which is the current state of the art in P2P-based :
jDj
distributed clustering. Since the implementation of P2P
K-means is nontrivial, we used their benchmark synthetic
data set and results to compare against. The data set is a 2. This means that there could be a difference in the actual data points
between our and their generated data due to the random number
2D mixture of 10 Gaussians, containing 78,200 points generation. However, we assume that the very large number of points will
(referred hereto as 10G). The actual data were not offset differences due to sampling.
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 695
TABLE 5
Performance of HP2PC versus Hierarchy Heights, nðpÞ ¼ 250 [20NG, RCV1]
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
696 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 697
REFERENCES
[1] N.F. Samatova, G. Ostrouchov, A. Geist, and A.V. Melechko,
“RACHET: An Efficient Cover-Based Merging of Clustering
Hierarchies from Distributed Datasets,” Distributed and Parallel
Databases, vol. 11, no. 2, pp. 157-180, 2002.
[2] S. Merugu and J. Ghosh, “Privacy-Preserving Distributed Cluster-
Fig. 14. Distributed cluster summarization accuracy [20NG].
ing Using Generative Models,” Proc. Third IEEE Int’l Conf. Data
Mining (ICDM ’03), pp. 211-218, 2003.
this observation is that at level 0, distributed summarization [3] J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch,
“Distributed Data Mining and Agents,” Eng. Applications of
is directly dependent on actual data, while at higher levels
Artificial Intelligence, vol. 18, no. 7, pp. 791-807, 2005.
only keyphrases from level 0 are merged together. [4] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse
The third observation is that networks of smaller number Framework for Combining Multiple Partitions,” J. Machine
of nodes, nðpÞ, produce more accurate results. Since the Learning Research, vol. 3, pp. 583-617, Dec. 2002.
[5] E. Januzaj, H.-P. Kriegel, and M. Pfeifle, “DBDC: Density Based
whole data set is partitioned among nðpÞ nodes, it is expected Distributed Clustering,” Proc. Ninth Int’l Conf. Extending Database
that a coarse-grained partitioning (smaller nðpÞ) means that Technology (EDBT ’04), pp. 88-105, 2004.
each node has access to larger portion of the distributed [6] M. Klusch, S. Lodi, and G. Moro, “Agent-Based Distributed Data
Mining: The KDEC Scheme,” Proc. AgentLink, pp. 104-122, 2003.
cluster, thus is able to get more accurate keyphrases. [7] M. Eisenhardt, W. Muller, and A. Henrich, “Classifying Docu-
To summarize those findings: 1) results of distributed ments by Distributed P2P Clustering,” Informatik 2003: Innovative
cluster summarization can agree with centralized summar- Information Technology Uses, 2003.
[8] S. Datta, C. Giannella, and H. Kargupta, “K-Means Clustering
ization with up to 88 percent accuracy; 2) for networks of over Peer-to-Peer Networks,” Proc. Eighth Int’l Workshop High
small height, 100 < L < 500 should be used, while for Performance and Distributed Mining (HPDM), SIAM Int’l Conf. Data
networks of large height, 400 < L < 700 should be used; Mining (SDM), 2005.
[9] S. Datta, C. Giannella, and H. Kargupta, “K-Means Clustering
and 3) accuracy of distributed summarization increases as over a Large, Dynamic Network,” Proc. Sixth SIAM Int’l Conf. Data
the network size and height are decreased. Mining (SDM ’06), pp. 153-164, 2006.
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.
698 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009
[10] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta, Khaled M. Hammouda received the BSc (Hons)
“Distributed Data Mining in Peer-to-Peer Networks,” IEEE degree in computer engineering from Cairo
Internet Computing, vol. 10, no. 4, pp. 18-26, 2006. University in 1997 and the MASc and PhD
[11] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, degrees in systems design engineering from the
and S. Datta, “Clustering Distributed Data Streams in Peer-to-Peer University of Waterloo in 2002 and 2007,
Environments,” Information Sciences, vol. 176, pp. 1952-1985, 2006. respectively. He is currently a professional soft-
[12] K. Hammouda and M. Kamel, “Collaborative Document Cluster- ware engineer at Desire2Learn Inc., where he
ing,” Proc. Sixth SIAM Int’l Conf. Data Mining (SDM ’06), pp. 453- works on emerging learning object repository
463, Apr. 2006. technology. He received numerous awards,
[13] H. Kargupta, I. Hamzaoglu, and B. Stafford, “Scalable, Distributed including the NSERC Postgraduate Scholarship,
Data Mining Using an Agent-Based Architecture,” Proc. Third Int’l Ontario Graduate Scholarship in Science and Technology, and the
Conf. Knowledge Discovery and Data Mining (KDD ’97), pp. 211-214, University of Waterloo President’s Graduate Scholarship and Faculty of
1997. Engineering Scholarship. He is a former member of the PAMI Research
[14] J. Li and R. Morris, “Document Clustering for Distributed Fulltext Group, University of Waterloo, where his research interests were in
Search,” Proc. Second MIT Student Oxygen Workshop, Aug. 2002. document clustering and distributed text mining, especially keyphrase
[15] A. Kumar, M. Kantardzic, and S. Madden, “Guest Editors’ extraction and summarization. He authored several papers in this field.
Introduction: Distributed Data Mining—Framework and Imple-
mentations,” IEEE Internet Computing, vol. 10, no. 4, pp. 15-17, Mohamed S. Kamel received the BSc (Hons)
2006. degree in electrical engineering from Alexandria
[16] R. Wolff, K. Bhaduri, and H. Kargupta, “Local L2-Thresholding University, the MASc degree from McMaster
Based Data Mining in Peer-to-Peer Systems,” Proc. Sixth SIAM University, and the PhD degree from the Uni-
Int’l Conf. Data Mining (SDM ’06), pp. 430-441, 2006. versity of Toronto. In 1985, he joined the
[17] I.S. Dhillon and D.S. Modha, “A Data-Clustering Algorithm on University of Waterloo, Waterloo, Ontario, where
Distributed Memory Multiprocessors,” Large-Scale Parallel Data he is currently a professor and the director of the
Mining, pp. 245-260, Springer, 2000. Pattern Analysis and Machine Intelligence La-
[18] K. Hammouda and M. Kamel, “Incremental Document Clustering boratory, Department of Electrical and Computer
Using Cluster Similarity Histograms,” Proc. IEEE/WIC Int’l Conf. Engineering and holds a university research
Web Intelligence (WI ’03), pp. 597-601, Oct. 2003. chair. He held a Canada research chair in cooperative intelligent systems
[19] K. Hammouda and M. Kamel, “Corephrase: Keyphrase Extraction from 2001 to 2008. His research interests are in computational
for Document Clustering,” Proc. IAPR Int’l Conf. Machine Learning intelligence, pattern recognition, machine learning, and cooperative
and Data Mining in Pattern Recognition (MLDM ’05), P. Perner and intelligent systems. He has authored and coauthored more than
A. Imiya, eds., pp. 265-274, July 2005. 350 papers in journals and conference proceedings, 10 edited volumes,
[20] K. Hammouda and M. Kamel, “Document Similarity Using a two patents, and numerous technical and industrial project reports.
Phrase Indexing Graph Model,” Knowledge and Information Under his supervision, 75 PhD and MASc students have completed their
Systems, vol. 6, no. 6, pp. 710-727, Nov. 2004. degrees. He is the editor in chief of the International Journal of Robotics
[21] D. Boley, “Principal Direction Divisive Partitioning,” Data Mining and Automation and an associate editor of the IEEE Transactions on
and Knowledge Discovery, vol. 2, no. 4, pp. 325-344, 1998. Systems, Man, and Cybernetics, Part A, Pattern Recognition Letters,
[22] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, Cognitive Neurodynamics Journal, and Pattern Recognition Journal. He
V. Kumar, B. Mobasher, and J. Moore, “Partitioning-Based is also a member of the editorial advisory board of the International
Clustering for Web Document Categorization,” Decision Support Journal of Image and Graphics and the Intelligent Automation and Soft
Systems, vol. 27, pp. 329-341, 1999. Computing Journal. He also served as an associate editor of Simulation,
[23] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, the journal of the Society for Computer Simulation. Based on his work
V. Kumar, B. Mobasher, and J. Moore, “Document Categoriza- at the NCR, he received the NCR Inventor Award. He is also a recipient of
tion and Query Generation on the World Wide Web Using the Systems Research Foundation Award for outstanding presentation in
WebACE,” AI Rev., vol. 13, nos. 5/6, pp. 365-391, 1999. 1985 and the ISRAM Best Paper Award in 1992. In 1994, he has been
[24] A. Strehl, “Relationship-Based Clustering and Cluster Ensembles awarded the IEEE Computer Society Press Outstanding Referee Award.
for High-Dimensional Data Mining,” PhD dissertation, Faculty of He was also a coauthor of the best paper in the 2000 IEEE Canadian
Graduate School, Univ. of Texas at Austin, 2002. Conference on Electrical and Computer Engineering. He is a recipient of
[25] D.D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A New the University of Waterloo Outstanding Performance Award twice, the
Benchmark Collection for Text Categorization Research,” faculty of engineering distinguished performance award. He is a member
J. Machine Learning Research, vol. 5, pp. 361-397, 2004. of the ACM and the PEO, a fellow of the IEEE, the Engineering Institute of
[26] M.F. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, Canada (EIC), and the Canadian Academy of Engineering (CAE), and
no. 3, pp. 130-137, July 1980. selected to be a fellow of the International Association of Pattern
[27] G. Salton, A. Wong, and C. Yang, “A Vector Space Model for Recognition (IAPR) in 2008. He served as a consultant for General
Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, Motors, NCR, IBM, Northern Telecom, and Spar Aerospace. He is a
Nov. 1975. cofounder of Virtek Vision of Waterloo and the chair of its Technology
[28] W. Wong and A. Fu, “Incremental Document Clustering for Web Advisory Group. He served as a member of the board from 1992 to 2008
Page Classification,” Proc. Int’l Conf. Information Soc. in the 21st and the vice president for research and development from 1987 to 1992.
Century: Emerging Technologies and New Challenges (IS), 2000.
[29] Y. Yang and J.P. Pedersen, “A Comparative Study on Feature
Selection in Text Categorization,” Proc. 14th Int’l Conf. Machine . For more information on this or any other computing topic,
Learning (ICML ’97), pp. 412-420, 1997. please visit our Digital Library at www.computer.org/publications/dlib.
[30] J. He, A.-H. Tan, C.-L. Tan, and S.-Y. Sung, “On Quantitative
Evaluation of Clustering Systems,” Clustering and Information
Retrieval, pp. 105-133, Kluwer Academic, 2003.
[31] J.C. Dunn, “Well Separated Clusters and Optimal Fuzzy Parti-
tions,” J. Cybernetica, vol. 4, pp. 95-104, 1974.
[32] A.-H. Tan, H.-L. Ong, H. Pan, J. Ng, and Q.-X. Li, “Towards
Personalized Web Intelligence,” Knowledge and Information Sys-
tems, vol. 6, no. 5, pp. 595-616, May 2004.
Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.