Professional Documents
Culture Documents
Keywords
topic directory, document clustering, hierarchical clustering
1.
INTRODUCTION
Text-based or semi-structured data abounds in cyberspace, e.g., emails and Web pages, and in our daily lives, e.g., newspapers, magazines. To provide easy access to such data, various processing methods have been extensively researched, including text classication, clustering, and summarization. A topic directory a practical application for ordinary users but compounded of cutting-edge text processing techniques provides a view of a document set at dierent levels of abstraction and thus is ideal for the interactive exploration and visualization of a document set. In this paper, we target the dynamic construction of a topic directory from a text-based data (or document) set. A topic directory is a hierarchical document tree or graph structure in which each node has a topic label (or a cluster
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Copyright 2002 IEEE ICDM X-XXXXX-XX-X/XX/XX ...$5.00.
as accurate as the other leading document clustering algorithms (i.e., bisecting k-means and UPGMA) in terms of clustering quality [6], and is more ecient since it substantially reduces the dimensions when constructing clusters. However, this approach introduces another critical issue determination of the support threshold. Since clusters are constructed by frequent termsets and cluster size is the support of termset, the number of clusters (i.e., previously a user parameter) is now determined by the support threshold (i.e., a new user parameter). The support threshold aects the entire cluster processing in terms of the quality and scalability: If it is set very low, then the nal topic directory becomes more descriptive but the entire clustering method becomes unscalable because the mining time increases dramatically and the number of termsets becomes very large. Using a faster mining algorithm (e.g., FP-growth [7]) does not solve it, because mining all the frequent itemsets is fundamentally a combinatorial problem; thus, as the support threshold decreases linearly, the mining time increases exponentially regardless of the mining algorithm [7]. As noted in [6], mining time becomes the bottleneck of the entire clustering process in this kind of method. If it is set too high, the number of mined termsets becomes too few, and the directory constructed from the termsets might not cover every document of the document set. The directory becomes also too abstract as each termset includes too many documents. The seemingly best way to adjust the support threshold is to run the mining algorithm multiple times with dierent support thresholds from small to large, and probe the information about the abstraction level of the directory or the cluster coverage, i.e., what portion of documents is covered by the clusters. However, then we would end up losing the benet of using mining algorithm for document clustering, i.e., the entire clustering process becomes unscalable and need tedius manual optimization.
consists of the documents containing the same frequent termset. The previous clustering methods [6, 3] use all the frequent termsets to construct hierarchical clusters, but only closed termsets are meaningful in hierarchical clustering, as we will discuss in Section 3.1.3. Using only closed termsets also signicantly reduces processing time, as it substantially reduces the number of cluster candidates in the process. Since FT-tree subsumes FP-tree, we can run the most recent frequent closed termset mining algorithm CLOSET+ [10] without any modications on the structure. After mining closed termsets, FT-tree also allows us to eciently construct the initial clusters by traversing the tree without scanning the documents (Section 3.2). We nally present an ecient way to build the softclusters, i.e., directory, from the initial clusters. To control the softness of clustering, we introduce a semantic parameter max dup, denoting the maximal number of document duplications in the directory. For example, if max dup = 1 and the cluster structure is a tree, then our method generates a hard cluster tree as do other clustering methods [6, 3]. Soft-clustering is necessary for many applications because a document can belong to multiple clusters. By allowing the directory to be a graph and max dup 1, our clustering method naturally supports soft-clustering. From our experiments on the document sets commonly used for evaluating hierarchical document clustering algorithms, our method generates results high in quality as most recent document clustering methods but is more ecient. It also naturally produces topic labels for the clusters using frequent closed termsets. We implement our method in the D2K environment [1] and show a screen shot of an experimental result using D2K. This paper is organized as follows: We rst present the framework of our method in Section 2 with some basic term denitions, and we perform step-by-step analysis of our method in Section 3. In Section 4, we experimentally evaluate the performance of our method. We conclude our study in Section 5.
Example 1. The rst two columns of Table 1 show the document set D in our running example. Suppose sup thr is 2; we can nd and sort the list of frequent items in support descending order. The sorted item list is called f list. In this example, f list = <e:6, c:5, b:4, d:4, a:3, f:2>. The frequent terms in each termset are sorted according to f list and shown in the third column of Table 1. A termset ac is a frequent termset with support 2 but is not closed because it has a superset acd whose support is also 2. acd is a frequent closed termset.
Doc. ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 set of terms a, b, e, f a, c, d e b, f a, c, d, e c, d c, d c, e b, e b, e ordered frequent term list e, b, a, f c, d, a e b, f e, c, d, a c, d c, d e, c e, b e, b
duplicated in multiple clusters within the initial clusters. When we construct the nal topic directory with maximally max dup number of document duplications (Step (4)), we use the original TFIDF vectors to trim the duplication from the initial clusters.
Figure 1: Framework for constructing topic directory Every text clustering method preprocesses documents in several steps, such as removing stopwords (i.e., I, am, and) and wordstemming (i.e., merging the same words of dierent forms like term and terms). After we preprocess the raw documents (Step (1) in Figure 1), each document can be represented as a vector of the weighted term frequencies, i.e., term frequency inverse document frequency (TFIDF), which the information retrieval community calls a vector space model. Our algorithm applies TFIDF. However, in our running examples, we will simply use TF for better understanding. Starting from this vector space model, we construct FTtree to mine closed termsets and then construct the initial clusters (Step (2)). Note that termsets are found based on word presence not on the TFIDF. After that, we construct the initial clusters from the FT-tree (Step (3)), which can be done without scanning the TFIDF vectors. The initial clusters are a list of a frequent closed termset with the documents that contain the termset. So, the documents are 3
Null e:6 {d3} Term e:6 c:5 b:4 d:4 a:3 f:2 f:1 {d1} a:1 {d5} a:1 d:1 a:1 {d2} Link b:3 {d9 d10} c:2 {d8}
Null
c:3
b:1
e:6 {d3} Term Link b:3 {d9 d10} c:2 {d8} c:3 b:1 {d4}
Term Link b:3 {d1 d9 d10} c:2 {d8} d:1 {d5} Null e:6 {d3} e:6 c:5 b:4
f:1 {d4}
c:3
b:1 {d4}
a:1 {d1}
d:4
a:1 {d5}
(a)
(c)
termsets) generated from the sup thr cover every document in the document set (or cluster coverage = 1.0)? Based on the above FT-tree, we can eciently identify the sup thr without running a mining algorithm. The basic idea is that we monitor if each document ID is covered by any node as we prune the nodes of lower supports from the bottom of the tree. If any document ID cannot be covered by any node during the pruning process, we stop the pruning, and the support then will be the maximal sup thr that covers every document. To illustrate how to identify the sup thr, consider the FTtree of Figure 2(a) that is constructed from Table 1. We start pruning the tree from the bottom. (Since a FT-tree is a prex tree with sorted items, as is a FP-tree, the lower nodes contain the items of lower supports.) The item f of support = 2, i.e., the two nodes of thick lines in Figure 2(a), will be pruned rst. If the pruned nodes contain any document IDs, we pass the IDs to their parents nodes. Thus, the FT-tree after pruning f becomes the tree of Figure 2(b). As you see, the parent nodes a and b now in Figure 2(b) contain the IDs d1 and d4 respectively. This means that after we prune a term f , documents d1 and d4 previously covered by termsets {e, b, a, f } and {b, f } respectively are now covered by termsets {e, b, a} and {b}. Next, we prune the term a of support = 3, i.e., the three nodes of thick lines in Figure 2(b). Then, the tree of Figure 2(b) becomes the tree of Figure 2(c). In other words, documents d1 , d5 and d2 previously covered by termsets {e, b, a}, {e, c, d, a} and {c, d, a} respectively are now covered by termsets {e, b}, {e, c, d} and {c, d}. When we start pruning the terms b and d of the next higher support = 4, i.e., the four nodes of thick lines in Figure 2(c), we nd that d4 will not be covered by any node since its parent is N ull. Thus, we stop the pruning procedure here, and the maximal sup thr that covers every document is 4. Note that we can compute this maximal sup thr without actually pruning the tree but not by searching over the tree from the bottom to nd the rst node whose parent is N ull. However, showing the owing of document IDs in the tree as sup thr increases helps users to understand the relations among the sup thr, the covered documents, and the length of the termset that covers the documents. Thus, it helps users to determine the proper sup thr. For instance, suppose that a document set contains very few outlier documents that do not share any terms with other documents in the set, then the maximal sup thr becomes 4
very low for mined termsets to cover such outlier documents. In such cases, the document coverage information of Table 2 becomes very useful in determining the proper sup thr. Column Coverage in the table denotes the portion of documents that is covered by the corresponding sup thr. Column Not Covered Doc. IDs denotes the actual document IDs that are not covered by the sup thr. This coverage table can be eciently generated from the FT-tree before mining frequent terms.
sup thr 4 5 6 7 Coverage 1.0 0.9 0.6 0.0 Not Covered Doc. IDs d4 d2 , d6 , d7 d1 , d3 , d5 , d8 , d9 , d10
Table 2: Document coverage table To get better intuition about the document coverage, we can easily draw the coverage graph from the FT-tree before running a mining algorithm. Our experiments in Section 4 show the coverage graphs (Figure 6). Tree pruning can be done after the sup thr is determined.
tors. Before building the topic directory, we prune the directory by (1) removing inner termsets (Section 3.3.1) and (2) constraining the maximal number of document duplication (Section 3.3.2). After that, a topic directory is constructed (Section 3.3.3) and the rst level nodes are nally merged (Section 3.3.4).
Table 4: Clusters for each document. termsets within parentheses are inner termsets If multiple nodes in the same path in a directory contain the same documents, to minimize the document redundancy, we only leave the one in the lowest node and remove the others. This is done by removing inner termsets among frequent closed termsets, the termsets whose superset exists in the same document, e.g., in Table 4, termset < c > in document d2 is an inner termset as its superset < cd > also exists in d2 . Lemma 1. Removing inner termsets will not cause an empty node in the directory and will not aect the clustering quality. Rationale. Only closed termsets constitute the nodes in the directory. Thus, for any termset, there must be at least one document that does not contain its superset. Otherwise, the termset would not be closed by Denition 1. For example, in Table 4, although itemset < c > is removed from d2 , d5 , d6 , and d7 , it still exists in d8 , and thus node < c > will not be an empty node in the directory. Otherwise, < c > would not be a closed termset. Removing inner termsets also does not aect the clustering quality, since, in a common hierarchical clustering evaluation method, documents in a cluster (or node) include those in its desendant nodes.
where d t denotes the vector of term t in document d. For instance, according to the TFIDF vectors of Table 5, score(d5 , < cd >) = 1.0 + 2.0 = 3.0.
Doc. ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 a 1.0 1.0 0 0 1.0 0 0 0 0 0 b 1.0 0 0 2.0 0 0 0 0 1.0 2.0 Feature c 0 2.0 0 0 1.0 1.0 2.0 1.0 0 0 vector d e 0 2.0 1.0 0 0 2.0 0 0 2.0 1.0 2.0 0 1.0 0 0 2.0 0 2.0 0 1.0 f 1.0 0 0 1.0 0 0 0 0 0 0
Input: nodes (termsets), document-cluster list Output: topic directory Main: for m = 1 to maximal length of nodes for node of length = m link(node, m) connect document IDs to corresponding nodes using the document-cluster list link(node, m): if m = 0, then link node to root, else: if there exist inner nodes of length m 1, then link the node to them as a child, else link(node, m-1)
Doc. ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
Cluster Labels <b> < cd > <e> <b> < cd > < cd > < cd > <e> <e> <b>
for
each
document
when
<c> {d8}
Table 6 shows the document-cluster list after applying max dup = 1 to each document using the above scoring function with the TFIDF vectors of Table 5.
4. EXPERIMENTAL EVALUATION
This section presents the experimental evaluation of our method (TDC) by comparing with most recent document clustering methods: agglomerative UPGMA [5], bisecting k-means [9, 4], and those using frequent itemset mining FIHC [6], HFTC [3]. We use the CLUTO-2.0 clustering toolkit [8] to generate the results of UPGMA and bisecting k-means. We use the authors implementation for FIHC [6]. We could not obtain the implementation of HFTC, but as shown in [6], FIHC always performs better than HFTC in accuracy, eciency, and scalability.
Dataset Hitech
Re0
Wap
F (Ki , Cj ) =
where nij is the number of members of natural class Ki in cluster Cj (i.e., true positive). Intuitively, F (Ki , Cj ) measures the quality of cluster Cj in describing the natural class Ki by the harmonic mean of precision and recall. When computing F (Ki , Cj ) in a hierarchical structure, all the documents in the subtree of Cj are considered as the documents in Cj . The success of capturing a natural class Ki is measured by using the best cluster Cj for Ki , i.e., Cj maximizing F (Ki , Cj ). We measure the quality of a clustering result the overall F-measure F(C) using the weighted sum of such maximum F measure for all natural classes as follows: X |Ki | F (C) = maxCj C {F (Ki , Cj )} |D| K K
i
Classic4
Reuters
TDC 0.57 0.52 0.48 0.44 0.50 0.57 0.51 0.47 0.41 0.49 0.47 0.45 0.43 0.41 0.44 0.61 0.53 0.48 0.41 0.50 0.46 0.45 0.42 0.40 0.43
FIHC 0.45 0.42 0.41 0.41 0.42 0.53 0.45 0.43 0.38 0.45 0.40 0.56 0.57 0.55 0.52 0.62 0.52 0.52 0.51 0.54 0.37 0.40 0.40 0.39 0.39
Bi k-means 0.54 0.44 0.29 0.21 0.37 0.34 0.38 0.38 0.28 0.34 0.40 0.57 0.44 0.37 0.45 0.59 0.46 0.43 0.27 0.44 0.40 0.34 0.31 0.26 0.33
UPGMA 0.33 0.33 0.47 0.40 0.38 0.36 0.47 0.42 0.34 0.40 0.39 0.49 0.58 0.59 0.51
Table 7: F-measure comparison. # of clus: # of clusters; : not scalable to run as was also done in [6].) Table 8 shows the sup thr of coverage = 1.0 determined for each data set. Figure 6 illustrates the coverage change on each sup thr for data sets Hitech and Reuters, which is also generated by the probing algorithm. As noted at the end of Section 3.1.2, for very large document sets having few outlier documents, this coverage information helps to determine a proper sup thr such that the sup thr covers enough documents and eciently mines the frequent closed termsets.
sup thr Hitech 363/2301 Re0 138/1504 Wap 333/1560 Classic4 70/7094 Reuters 174/10802
where K denotes all natural classes, C denotes all clusters at all levels, |Ki | denotes the number of documents in natural class Ki , and |D| denotes the total number of documents in the data set. The range of F (C) is [0,1]. A larger F (C) value indicates a higher accuracy of clustering.
4.3 Results
Due to space limitations, we report the main results and leave the details to a technical report.
Table 8: sup thr of coverage = 1.0 # of total document in each data set is within the parentheses.
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 300
0 400 500 600 700 800 900 1000 1100 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Figure 6: Document coverage in Hitech (left) and Reuters (right) X-axis: sup thr ; Y-axis: document coverage
Figure 7: Screen Shot of the TDC System. Topic directory constructed on the Reuters. ods for prediction, discovery, and deviation detection with data and information visualization tools.2 Figure 7 shows a screen shot of our running system. The topic directory is constructed from Reuters. The entire tree is shown in the navigator box located in the upper-left side. We can see the list of documents and the actual documents by selecting a node on the screen. The left popup window shows searched documents by a keyword year, and the right popup window shows the selected document; the circled node indicates the clusters containing the selected document. [3] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD02), pages 436442, 2002. [4] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR92), pages 318329, 1992. [5] R. C. Dubes and A. K. Jain, editors. Algorithms for clustering data. Prentice Hall, 1998. [6] B. C. M. Fung, K. Wang, and M. Ester. Herarchical document clustering using frequent itemsets. In SIAM Int. Conf. Data Mining, 2003. [7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generations. In Proc. ACM SIGMOD Int. Conf. Management of Data (SIGMOD00), 2000. [8] G. Karypis. Cluto 2.0 clustering toolkit, 2002. [9] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techiniques. In KDD Workshop on Text Mining, 2000. [10] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD03), pages 236245, 2003. [11] O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR98), pages 4654, 1998.
5.
CONCLUSIONS
Using frequent termsets for document clustering is promising because it substantially reduces the large dimensionality of the document vector space [6]. It also naturally provides a topic label for each cluster using frequent termsets. In this paper, we presented a method that eciently generates a topic directory from a set of documents using a frequent closed termset mining algorithm. We also presented a nonparametric closed termset mining method to automatically determine a proper support threshold for a topic directory. Our method experimentally shows as high performance as the most recent document clustering methods and has additional benets: automatic generation of topic labels and determination of a cluster parameter.
6.
REFERENCES
[1] http://alg.ncsa.uiuc.edu/do/tools/d2k. [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. Int. Conf. Very Large Databases (VLDB94), pages 487499, 1994.
2 It oers a visual programming environment that allows users to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development.