Professional Documents
Culture Documents
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 2, March April 2013 ISSN 2278-6856
1. INTRODUCTION
Data clustering is a statistical data analysis technique that is used to partition a data set into a set of meaningful groups according to given criteria. Different clustering algorithms have implicit weaknesses and strengths against different definitions of cluster meaningfulness. However, the main goal of clustering algorithms remains the same: sum of the distances between data points in the cluster (intra-cluster distance) should be minimal and sum of the distances between clusters (inter-cluster distance) should be maximal. Document clustering is often a harder task than clustering numerical data because one only has a numerical representation of textual data, defined mainly by word frequency. For human languages, most statistical correlations are noise according to context and this makes document clustering even harder. Over the years, mapping documents to vectors defined by term frequency weights to compute distances in Vector Space Model (VSM) remained as an effective strategy. With reliable frequency information and proper noise removal, good results are achieved with many clustering algorithms. Search Result Clustering (SRC) is sub-topic of document clustering in which the clustering methods are expected to produce Volume 2, Issue 2 March April 2013
A distance can be easily derived from cosine similarity by: The most important property of cosine similarity is that it does not depend on the length of the vectors: = Page 299
where oa ob is the set of features in common between oa and ob and oa ob is the total set of features Many variants of the Jaccard coefficient were proposed in the literature. The most interesting is the Generalized Jaccard Coefficient (GJC) that takes into account also the weight of each term. It is defined as
GJC is proven to be a metric [Charikar, 2002]. The Generalized Jaccard Coefficient defines a very flexible distance that works well with both text and video data. Minkowski distance: it is defined as:
It is the standard family of distances for geometrical problems. Varying the value of the parameter p, we obtain different well known distance functions. When p = 1 the Minkowski distance reduces to the Manhattan distance. For p = 2 we have the well known Euclidean distance
where C = {c1, . . . , ck} are k clusters such that Cj is the center of the j-th cluster and j is its centroid. For all these functions it is known that finding the global minimum is NP-hard. Thus, heuristics are always employed to find a local minimum. 2.2 FPF algorithm for the k-center problem We first describe the original Furthest Point First (FPF) algorithm proposed by Gonzalez [5] that represents the basic algorithm we adopted in this paper.
2. CLUSTERING
Clustering algorithms can be classified according with many different characteristics. One of the most important is the strategy used by the algorithm to partition the space[3,4]: Partitional clustering: given a set O = {o1, ,on} of n data objects, the goal is to create a partition C = {c1, . . . , ck} such that:
Basic algorithm Given a set O of n points, FPF increasingly computes the set of centers c1 . . . ck O, where Ck is the solution to the k-center problem and C1 = {c1} is the starting set, built by randomly choosing c1 in O. At a generic iteration 1 < i k, the algorithm knows the set of centers Ci 1 (computed at the previous iteration) and a mapping that associates, to each point p O, its closest center (p) Ci 1. Iteration i consists of the following two steps: 1. Find the point p O for which the distance to its closest center, d(p, (p)), is maximum; make p a new center ci and let Ci = Ci 1 {ci }. 2. Compute the distance of ci to all points in O \ Ci 1 to update the mapping of points to their closest center. After k iterations, the set of centers ck = {c1, . . . , ck} and the mapping define the clustering. Cluster Ci is the set of points {p O \ Ck such that (p) =ci}, for i [1, k]. Each iteration can be done in time O(n), hence the overall cost of the algorithm is O(kn). Experiments have shown Page 300
Hierarchical clustering: given a set O = {o1,o2,o3,o4,on} of n data objects, the goal is to build a tree-like structure (called dendrogram) H = {h1, . . . , hq} with q n, such that: given two clusters ci hm and cj hl with hl ancestor of hm, one of the following conditions hold: ci cj or ci cj = for all i, j i,m, l [1, q]. Partitional clustering is said hard if a data object is assigned uniquely to one cluster, soft or fuzzy Volume 2, Issue 2 March April 2013
Separation of a clustering: the average similarity between non-mates. As M is the number of mate pairs, the number of non-mates pairs is given by .
3. CLUSTERING VALIDATION
Since the clustering task has an ambiguous definition, the assessment of the quality of results is also not well defined. There are two main philosophies for evaluating the clustering quality: Internal criterion: is based on the evaluation of how the output clustering approximates a certain objective function, External criterion: is based on the comparison between the output clustering and a predefined handmade classification of the data called ground truth. When a ground truth is available, it is usually preferable to use an external criterion to assess the clustering effectiveness, because it deals with real data while an internal criterion measures how well founded the clustering is according with such mathematical definition. 3.1 Internal measures There is a wide number of indexes used to measure the overall quality of a clustering. Some of them (i.e. the mean squared error) are also used as goal functions for the clustering algorithms. Homogeneity and separation According with the intuition, the more a cluster contains homogeneous objects the more it is a good cluster. Nevertheless the more two clusters are well separated the more they are considered good clusters. Following the intuition, homogeneity and separation [6] attempt to measure how compact and well distanciated clusters are among them. More formally given a set of objects O = Volume 2, Issue 2 March April 2013
3.2 External measures In the following, we denote with GT(S) = {GT1, . . . ,GTk} the ground truth partition formed by a collection of classes; and with C = {c1, . . . , ck} the outcome of the clustering algorithm that is a collection of clusters. F-measure: The F-measure was introduced in [Larsen and Aone, 1999] and is based on the precision and recall that are concepts well known in the information retrieval literature [Kowalski, 1997], [Van Rijsbergen, 1979]. Given a cluster cj and a class GTi we have:
Note that precision and recall are real numbers in the range [0, 1]. Intuitively precision measures the probability that an element of the class GTi falls in the cluster cj while recall is the probability that an element of the cluster cj is also an element of the class GTi. The Fmeasure F(GTi, cj) of a cluster cj and a class GTi is the harmonic mean of precision and recall:
where n is the sum of the cardinality of all the classes. The value of F is in the range [0, 1] and a higher value indicates better quality. Entropy: Entropy is a widely used measure in information theory. In a nutshell we can use the relative entropy to measure the amount of uncertainty that we have about the ground Page 301
where n is the number of elements of the whole clustering. The value of E is in the range [0, log n] and a lower value indicates better quality. Accuracy: While the entropy of a clustering is an average of the entropy of single clusters, a notion of accuracy is obtained using simply the maximum operator:
The accuracy A is in the range [0, 1] and a higher value indicates better quality. Normalized mutual information: The normalized mutual information comes information theory and is defined as follows:
from
where P(c) represents the probability that a randomly selected object oj belongs to c, and P(c, c) represents the probability that a randomly selected object oj belongs to both c and c. The normalization, achieved by the 2 log |C||GT| factor, is necessary in order to account for the fact that the cardinalities of C and GT are in general different [7]. Higher values of NMI mean better clustering quality. NMI is designed for hard clustering. Normalized complementary entropy: In order to evaluate soft clustering, the normalized complementary entropy [8] is often used. Here we describe a version of normalized complementary entropy in which we have changed the normalization factor so as to take overlapping clusters into account. The entropy of a cluster cj C is
Page 302
i = getCenter (Ci ); end forall p in O \ R do assign p to cluster Ci such that d(p, i ) < d(p, j), j i; end Algorithm 3: M-FPF.
Using medoids as centers The concept of medoid was introduced by Kaufman and Rousseeuw in [11]. Medoids have two main advantages with respect to centroids: first of all, they are elements of the input and not artificial objects. This make medoids available also in those environments in which the concept of centroid is not well defined or results artificious. Nevertheless, in many environments (i.e texts) centroids tends to become dense objects with a high number of features more of which of poor meaning. This makes centroids to lose representativeness and compute distances with them becomes more expensive with respect to distances between normal objects. The main drawback of the original definition is that the clustering algorithm (Partition Around Medoids) and the computation of medoids is expensive, to overcome this disadvantage many different re-definitions of medoids were introduced in the literature. In the context of the Furthest Point First heuristic where some input points are elected as cluster centers and are used to determinate which input points belong to the cluster, the restrictions of the use of centroids are not present. However, we observed that, although the objects selected from FPF as centers determine the points belonging to the cluster, they are not centers in the sense suggested by the human intuition. M-FPF-MD: Data: Let O be the input set, k the number of desired clusters Result: C: a k-partition of O Initialize R with a random sample of size |O|k elements of O; C = FPF(R, k); forall Ci C do ti = getRandomPoint (Ci ); ai = ci such that max d(ci , ti ) for each ci Ci ; bi = ci such that max d(ci , ai ) for each ci Ci ; mi = ci such that min |d(ci , ai ) d(ci , bi )| + |d(ci , ai ) + d(ci , bi ) d(ai , bi )|; end forall p in O \ R do assign p to cluster Ci such that d(p,mi ) < d(p,mj ), j i; if d(p, bi ) > d(ai , bi ) then ai = p ; if d(p, ai ) > d(ai , bi ) then bi = p ; Page 303
Figure 2.5. Exploiting the triangular inequality. Also in the operation of insertion of the (n n) remaining points, the bottleneck is the time spent computing distances to the point to the closest center. According with [Phillips, 2002] this operation can be made more efficiently exploiting the triangular inequality, even if the worst case running time does not change. Consider to have available the distances between all the pairs of centers of the clustering. Let p be the new point to be inserted in the clustering, by the triangular inequality if 1 2d(ci, cj) > d(ci, p) then d(ci, p) < d(cj , p). It means that the computation of the distance d(cj , p) can be safely avoided. We will refer to this algorithm as M-FPF. M-FPF: Data: Let O be the input set, k the number of desired clusters Result: C: a k-partition of O Initialize R with a random sample of size elements of O; C = FPF(R, k); forall Ci C do Volume 2, Issue 2 March April 2013
5. TEXT CLUSTERING
We described clustering algorithms based on the Furthest Point First heuristic by Gonzalez [5, 12]. In this section we compare all these algorithms (FPF, M-FPF and MFPF-MD) in the setting of text clustering. We compared them using two different metric spaces: the well known Cosine Similarity (CS) and the Generalized Jaccard Coefficient (GJC). Moreover we are interested also in comparing our algorithms against k-means which is the most used algorithm for text clustering. In order to compare clustering algorithms we used two different data sets: Reuters: is one of the most used data sets in information retrieval. After an initial cleaning to remove multilabelled documents, we obtain a corpus of 8538 documents of various lengths organized in 65 mutually exclusive classes. We further remove 10 classes each containing only a single document. Snippets: is a collection of 30 small corpora of 200 web snippets returned by the Open Directory Project (ODP) search engine in correspondence of the 30 most popular queries according with Google statistics. The two considered datasets present many differences: Reuters contains much more documents that are much longer than those in Snippets. Documents in Reuters can have a large size variability and are completely flat. Instead, Snippets contains semi-structured documents (a field with the snippet title and one with the snippet body). Since both the datasets we used are endowed with a manual classification, it was possible to establish a ground truth. We used four measures to validate the clustering correspondence with the ground truth: the Normalized Mutual Information (NMI), the Normalized Complementary Entropy (NCE), the F-measure and the Accuracy. All these measures are in the range [0, 1] where higher values mean better quality FPF evaluation In table 3.1 we show a comparison of the three clustering algorithms for the k center problem. Each clustering was run using two different distance functions with the goal of comparing also the metrics for text clustering. As shown in table 3.1, when clustering Reuters data, all the algorithms have a very different behavior depending on the used distance. In all the cases cosine similarity performs better than GJC. Instead, when clustering Volume 2, Issue 2 March April 2013
Reuters data put in evidence also some differences among the clustering algorithms. In fact FPF seems to achieve a worse quality with respect to the others. When clustering Snippets, these differences become much less evident. In fact, the F-measure and accuracy are always comparable; NMI, in the case FPF, is quite higher than that of M-FPF and M-FPF-MD. In contrast NCE of FPF is lower than that of the other algorithms. Table 1 Comparison of FPF, M-FPF and M-FPF-MD on the Reuters and Snippets datasets with Generalized Jaccard Coefficient and Cosine Similarity
FPF CS
0.29 0.76 0.45 0.704 0.807 0.538 0.47 0.678
Dat aset
Measu re
M-FPF GJC
0.080 2 0.633 0.377 0.579 0.774 0.498 0.486 0.656
M-FPF-MD GJC
0.08 8 0.63 7 0.33 0.57 7 0.5 0.77 6 0.43 9 0.64 9
CS
0.339 0.8 0.561 0.76 0.524 0.798 0.448 0.662
CS
0.281 0.751 0.439 0.696 0.472 0.738 0.386 0.602
GJC
0.08 9 0.59 6 0.21 7 0.51 4 0.45 6 0.73 3 0.40 4 0.61 2
Reuters
F-mea Acc.
6. CONCLUSIONS
To give a global overview we report all performances and running time results for respectively the Reuters and Snippets datasets. After analyzing the data, it is clear that the perfect clustering algorithm still does not exist and this motivates the effort of researchers. Hence some observations should be done. First of all, the on Reuters data, k-means outperforms algorithms for kcenter by a factor of roughly 10% but requires at least two orders of magnitude more time. Using GJC, time and performance differences become more evident. Clearly the best clustering/metric scheme is strictly dependent on the problem requirements and data size. In off-line applications probably computational efficiency should be sacrificed in favor of quality, but this is not always true. REFERENCES [1.] Geraci, F., Pellegrini, M., Sebastiani, F., Maggini, M.: Cluster generation and cluster labeling for web snippets: A ast and accurate hierarchical solution.
Page 304
Page 305