You are on page 1of 8

Scalable Contruction of Topic Directory with Nonparametric Closed Termset Mining

Hwanjo Yu, Duane Searsmith, Xiaolei Li, Jiawei Han


Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801

hwanjoyu@uiuc.edu, dsears@ncsa.uiuc.edu, xli10@uiuc.edu, hanj@cs.uiuc.edu ABSTRACT


A topic directory, e.g., Yahoo directory, provides a view of a document set at dierent levels of abstraction and thus is ideal for the interactive exploration and visualization of a document set. We present a method that dynamically generates a topic directory from a document set using a frequent closed termset mining algorithm. Our method naturally produces topic labels for the clusters using the frequent termsets and is more scalable than other recent document clustering algorithms by substantially reducing the high dimensionality of documents. We also present a nonparametric closed termset mining method that automatically determines a proper support threshold for a topic directory. Our method experimentally shows results of equal quality to recent document clustering methods and has additional benets: automatic generation of topic labels and determination of a cluster parameter. description) and corresponding documents. The topic of a higher node conceptually covers its children nodes. For a static topic directory, e.g., Yahoo Web directory, the taxonomy is static and manually constructed by the domain experts, and documents are classied into the taxomony by (non-)automatic classiers. Thus, such static directories are usually used for organizing and searching targetted documents. Whereas dynamic topic directories are constructed given a xed document set or a temporal interest across document sets, e.g., browsing the main news of year 2000 from the AP news data. Thus, the directory and topic labels are constructed dynamically based on the contents of the document set.

1.1 Challenges and Related Work


Construction of such a dynamic topic directory requires the techniques of hierarchical document soft-clustering and cluster summarization for constructing topic labels. Recent studies in document clustering show that UPGMA [5] and bisecting k-means [9, 4] are the most accurate algorithms in the categories of agglomerative and partitioning clustering algorithms respectively and outperform other recent hierarchical clustering methods in terms of the clustering quality [9, 5, 4]. However, such typical clustering methods (1) do not provide cluster descriptions, (2) are not scalable to large document sets (for UPGMA), (3) require the user to decide the number of clusters a priori which is usually unknown in real applications, and (4) focus on hard-clustering (whereas in the real world a document could belong to multiple categories). SuxTree clustering [11] is the rst method to provide cluster descriptions by forming clusters of documents sharing common terms or phrases. However, the size of the suxtree quadratically increases w.r.t., the length of documents, and thus it cannot be used with long documents. For this reason, the suxtree method for clustering Web search results uses only the returning sentences for clustering. Another recent approach is to use frequent itemset mining to construct clusters with corresponding topic labels [3, 6]. This approach rst run a frequent itemset mining algorithm, i.e., Apriori [2], to mine frequent termsets from a document set. (An itemset corresponds to a termset, i.e., a set of terms, in this case.) Then they cluster documents based on only the low-dimensional frequent termsets. Each frequent termset serves as the topic label of a cluster. The corresponding cluster consists of the set of documents containing the termset. This method indeed turns out to be 1

Keywords
topic directory, document clustering, hierarchical clustering

1.

INTRODUCTION

Text-based or semi-structured data abounds in cyberspace, e.g., emails and Web pages, and in our daily lives, e.g., newspapers, magazines. To provide easy access to such data, various processing methods have been extensively researched, including text classication, clustering, and summarization. A topic directory a practical application for ordinary users but compounded of cutting-edge text processing techniques provides a view of a document set at dierent levels of abstraction and thus is ideal for the interactive exploration and visualization of a document set. In this paper, we target the dynamic construction of a topic directory from a text-based data (or document) set. A topic directory is a hierarchical document tree or graph structure in which each node has a topic label (or a cluster

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Copyright 2002 IEEE ICDM X-XXXXX-XX-X/XX/XX ...$5.00.

as accurate as the other leading document clustering algorithms (i.e., bisecting k-means and UPGMA) in terms of clustering quality [6], and is more ecient since it substantially reduces the dimensions when constructing clusters. However, this approach introduces another critical issue determination of the support threshold. Since clusters are constructed by frequent termsets and cluster size is the support of termset, the number of clusters (i.e., previously a user parameter) is now determined by the support threshold (i.e., a new user parameter). The support threshold aects the entire cluster processing in terms of the quality and scalability: If it is set very low, then the nal topic directory becomes more descriptive but the entire clustering method becomes unscalable because the mining time increases dramatically and the number of termsets becomes very large. Using a faster mining algorithm (e.g., FP-growth [7]) does not solve it, because mining all the frequent itemsets is fundamentally a combinatorial problem; thus, as the support threshold decreases linearly, the mining time increases exponentially regardless of the mining algorithm [7]. As noted in [6], mining time becomes the bottleneck of the entire clustering process in this kind of method. If it is set too high, the number of mined termsets becomes too few, and the directory constructed from the termsets might not cover every document of the document set. The directory becomes also too abstract as each termset includes too many documents. The seemingly best way to adjust the support threshold is to run the mining algorithm multiple times with dierent support thresholds from small to large, and probe the information about the abstraction level of the directory or the cluster coverage, i.e., what portion of documents is covered by the clusters. However, then we would end up losing the benet of using mining algorithm for document clustering, i.e., the entire clustering process becomes unscalable and need tedius manual optimization.

consists of the documents containing the same frequent termset. The previous clustering methods [6, 3] use all the frequent termsets to construct hierarchical clusters, but only closed termsets are meaningful in hierarchical clustering, as we will discuss in Section 3.1.3. Using only closed termsets also signicantly reduces processing time, as it substantially reduces the number of cluster candidates in the process. Since FT-tree subsumes FP-tree, we can run the most recent frequent closed termset mining algorithm CLOSET+ [10] without any modications on the structure. After mining closed termsets, FT-tree also allows us to eciently construct the initial clusters by traversing the tree without scanning the documents (Section 3.2). We nally present an ecient way to build the softclusters, i.e., directory, from the initial clusters. To control the softness of clustering, we introduce a semantic parameter max dup, denoting the maximal number of document duplications in the directory. For example, if max dup = 1 and the cluster structure is a tree, then our method generates a hard cluster tree as do other clustering methods [6, 3]. Soft-clustering is necessary for many applications because a document can belong to multiple clusters. By allowing the directory to be a graph and max dup 1, our clustering method naturally supports soft-clustering. From our experiments on the document sets commonly used for evaluating hierarchical document clustering algorithms, our method generates results high in quality as most recent document clustering methods but is more ecient. It also naturally produces topic labels for the clusters using frequent closed termsets. We implement our method in the D2K environment [1] and show a screen shot of an experimental result using D2K. This paper is organized as follows: We rst present the framework of our method in Section 2 with some basic term denitions, and we perform step-by-step analysis of our method in Section 3. In Section 4, we experimentally evaluate the performance of our method. We conclude our study in Section 5.

1.2 Contribution and Organization


Is it possible to adjust the support threshold without running the mining algorithm so that the directory constructed from the mined termsets is in the right abstraction level and covers enough documents? To answer the question, we propose the nonparametric closed termset mining method for ecient topic directory construction. The new primary contributions of our method are summarized as follows: We present an algorithm that produces the information of cluster coverage with corresponding covered documents for each support without running the mining algorithm. To achieve this, we introduce a new structure frequent termset tree (FT-tree) which is similar to FP-tree but contains additional document ID information (Section 3.1). Our algorithm can be generalized to other frequent itemset mining applications, e.g., mine the top frequent (closed) itemsets which covers x% of the dataset and print out the data not covered by the itemsets. FT-tree also facilitates other intermediate processes: (1) mining closed termsets and (2) constructing initial clusters, i.e., a list of clusters in which each cluster 2

2. OUR FRAMEWORK 2.1 Term Denitions


We rst dene some basic terms. We have a document set D that contains a list of tuples document id and termset representing the document denoted as < did , T >. A term is any sequence of characters separated from other terms by some delimiter in a document. Let DT = {t1 , t2 , ..., tn } be the complete set of distinct terms appearing in D. Each document d is represented by a termset T , non-empty subset of DT . Note that termsets include only the information of term presence, not the term frequency. (We will take account of the term frequency information later.) The number of documents in D containing termset T is called the support of termset T , denoted as sup(T ). Given a support threshold, sup thr, a termset T is frequent if sup(T ) sup thr. Definition 1 (Frequent closed termset). A termset T is a frequent closed termset if it is frequent and there exists no proper superset T T such that sup(T) = sup(T).

Example 1. The rst two columns of Table 1 show the document set D in our running example. Suppose sup thr is 2; we can nd and sort the list of frequent items in support descending order. The sorted item list is called f list. In this example, f list = <e:6, c:5, b:4, d:4, a:3, f:2>. The frequent terms in each termset are sorted according to f list and shown in the third column of Table 1. A termset ac is a frequent termset with support 2 but is not closed because it has a superset acd whose support is also 2. acd is a frequent closed termset.
Doc. ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 set of terms a, b, e, f a, c, d e b, f a, c, d, e c, d c, d c, e b, e b, e ordered frequent term list e, b, a, f c, d, a e b, f e, c, d, a c, d c, d e, c e, b e, b

duplicated in multiple clusters within the initial clusters. When we construct the nal topic directory with maximally max dup number of document duplications (Step (4)), we use the original TFIDF vectors to trim the duplication from the initial clusters.

3. SCALABLE CONSTRUCTION OF TOPIC DIRECTORY WITH NONPARAMETRIC CLOSED TERMSET MINING


In this section, we perform step-by-step analysis for Step (2),(3) and (4) of Figure 1 respectively in Section 3.1, 3.2 and 3.3, to together develop a scalable method for topic directory construction from a document set.

3.1 Nonparametric Closed Termset Mining for Document Clustering


3.1.1 FT-tree Construction
The FP-tree published in [7] is a prex tree with sorted items in which each node contains an item and the support of the itemset from root to path. The FP-tree has proven to be an ecient structure for mining frequent (closed) itemsets [10]. The FT-tree is similar to the FP-tree except that the FT-tree includes document IDs in addition. For instance, Figure 2(a) shows the FT-tree constructed from document set D of Table 1. Each document ID from Table 1 is shown in the last node of the corresponding termset path in the FTtree. For instance, the document ID d2 = {c, d, a} is shown in the node a of the path of N ull-c-d-a in the tree. The FT-tree can be dened as the FP-tree including document ID at the last node of the corresponding path. Why we need these document IDs in the tree will be explained in Section 3.1.2, 3.1.3 and 3.2. Constructing a FT-tree is also similar to constructing a FP-tree except that when we insert a termset representing a document (i.e., a pattern in the FP-tree) into the tree, we insert the document ID at the last node. For instance, inserting d1 = {e, b, a, f } into the FP-tree creates four nodes in the tree. When we create the last node f , we insert the ID d1 into that node. For another instance, let us assume that we have inserted from d1 to d9 of Table 1 into the tree and are now inserting d10 = {e, b}. Since the node e and b is already created, the FT-tree only modies the support of each node through the path, as the FP-tree does. When modifying the support of the last node b, the FT-tree also inserts the ID d10 into that node as Figure 2(a) shows. Remark 1. Each document ID will show only in one node of the FT-tree without any duplications. Rationale. Each document or termset is represented by only one path in the FT-tree, so multiple paths cannot have the same document ID. The document ID is inserted only in the last node of each termset path.

Table 1: Document set D

2.2 Our Framework


TFIDF Vectors Raw Documents d1 (t1:0.7, t8:0.9, ) (1) d2 (t4:0.3, t6:0.3, ) D = {d1,,dn} dn (t1:0.7, t6:0.3, ) (3) Topic Directory {t6} : d2, (4) {t6 t4} : d9, {t6 t4 t8} : d5, Initial Clusters {t1 t8} : d1, {t4 t6 t8} : d2, d5, {t6 t8} : d3, d7, (2) FT-tree

Figure 1: Framework for constructing topic directory Every text clustering method preprocesses documents in several steps, such as removing stopwords (i.e., I, am, and) and wordstemming (i.e., merging the same words of dierent forms like term and terms). After we preprocess the raw documents (Step (1) in Figure 1), each document can be represented as a vector of the weighted term frequencies, i.e., term frequency inverse document frequency (TFIDF), which the information retrieval community calls a vector space model. Our algorithm applies TFIDF. However, in our running examples, we will simply use TF for better understanding. Starting from this vector space model, we construct FTtree to mine closed termsets and then construct the initial clusters (Step (2)). Note that termsets are found based on word presence not on the TFIDF. After that, we construct the initial clusters from the FT-tree (Step (3)), which can be done without scanning the TFIDF vectors. The initial clusters are a list of a frequent closed termset with the documents that contain the termset. So, the documents are 3

3.1.2 Probing Support Threshold


Note that constructing a FT-tree is very fast compared to mining frequent termsets from it. Running a mining algorithm, e.g., FP-growth, is the bottleneck in the entire running time of frequent pattern mining [6]. How can we eciently identify the maximal sup thr without running a mining algorithm, such that the clusters (i.e., the mined

Null e:6 {d3} Term e:6 c:5 b:4 d:4 a:3 f:2 f:1 {d1} a:1 {d5} a:1 d:1 a:1 {d2} Link b:3 {d9 d10} c:2 {d8}
Null

c:3

b:1
e:6 {d3} Term Link b:3 {d9 d10} c:2 {d8} c:3 b:1 {d4}
Term Link b:3 {d1 d9 d10} c:2 {d8} d:1 {d5} Null e:6 {d3} e:6 c:5 b:4

d:3 {d6 d7}

f:1 {d4}

e:6 c:5 b:4 d:4 a:3

c:3

b:1 {d4}

d:3 {d6 d7}

a:1 {d1}

d:1 a:1 {d2}

d:3 {d2 d6 d7}

d:4

a:1 {d5}

(a)

(b) Figure 2: Determining Support

(c)

termsets) generated from the sup thr cover every document in the document set (or cluster coverage = 1.0)? Based on the above FT-tree, we can eciently identify the sup thr without running a mining algorithm. The basic idea is that we monitor if each document ID is covered by any node as we prune the nodes of lower supports from the bottom of the tree. If any document ID cannot be covered by any node during the pruning process, we stop the pruning, and the support then will be the maximal sup thr that covers every document. To illustrate how to identify the sup thr, consider the FTtree of Figure 2(a) that is constructed from Table 1. We start pruning the tree from the bottom. (Since a FT-tree is a prex tree with sorted items, as is a FP-tree, the lower nodes contain the items of lower supports.) The item f of support = 2, i.e., the two nodes of thick lines in Figure 2(a), will be pruned rst. If the pruned nodes contain any document IDs, we pass the IDs to their parents nodes. Thus, the FT-tree after pruning f becomes the tree of Figure 2(b). As you see, the parent nodes a and b now in Figure 2(b) contain the IDs d1 and d4 respectively. This means that after we prune a term f , documents d1 and d4 previously covered by termsets {e, b, a, f } and {b, f } respectively are now covered by termsets {e, b, a} and {b}. Next, we prune the term a of support = 3, i.e., the three nodes of thick lines in Figure 2(b). Then, the tree of Figure 2(b) becomes the tree of Figure 2(c). In other words, documents d1 , d5 and d2 previously covered by termsets {e, b, a}, {e, c, d, a} and {c, d, a} respectively are now covered by termsets {e, b}, {e, c, d} and {c, d}. When we start pruning the terms b and d of the next higher support = 4, i.e., the four nodes of thick lines in Figure 2(c), we nd that d4 will not be covered by any node since its parent is N ull. Thus, we stop the pruning procedure here, and the maximal sup thr that covers every document is 4. Note that we can compute this maximal sup thr without actually pruning the tree but not by searching over the tree from the bottom to nd the rst node whose parent is N ull. However, showing the owing of document IDs in the tree as sup thr increases helps users to understand the relations among the sup thr, the covered documents, and the length of the termset that covers the documents. Thus, it helps users to determine the proper sup thr. For instance, suppose that a document set contains very few outlier documents that do not share any terms with other documents in the set, then the maximal sup thr becomes 4

very low for mined termsets to cover such outlier documents. In such cases, the document coverage information of Table 2 becomes very useful in determining the proper sup thr. Column Coverage in the table denotes the portion of documents that is covered by the corresponding sup thr. Column Not Covered Doc. IDs denotes the actual document IDs that are not covered by the sup thr. This coverage table can be eciently generated from the FT-tree before mining frequent terms.
sup thr 4 5 6 7 Coverage 1.0 0.9 0.6 0.0 Not Covered Doc. IDs d4 d2 , d6 , d7 d1 , d3 , d5 , d8 , d9 , d10

Table 2: Document coverage table To get better intuition about the document coverage, we can easily draw the coverage graph from the FT-tree before running a mining algorithm. Our experiments in Section 4 show the coverage graphs (Figure 6). Tree pruning can be done after the sup thr is determined.

3.1.3 Mining Closed Termsets from FT-tree


As noted in [10], mining frequent closed termsets can lead to orders of magnitude smaller result termsets than mining frequent termsets while retaining the completeness, i.e., from the concise result set, it is straightforward to generate all the frequent termsets with accurate support counts. Likewise, closed termsets are only meaningful for constructing a topic directory since non-closed termsets are always covered by closed sets. Mining frequent closed itemsets has been extensively researched, and FP-tree has been known as the most ecient data structure for closed itemset mining [10]1 Since FT-tree subsumes FP-tree, we can simply apply the most recent closed itemset mining algorithm CLOSET+ on an FT-tree. CLOSET+ has shown superior performance over other recent mining algorithms in terms of runtime, memory usage, and scalability [10]. Running CLOSET+ on the FT-tree of Figure 2(c) with sup thr = 4 generates the following closed termsets. < e >, < c >, < b >, < cd >
1 Reference [10] presents the pros and cons of dierent data structures and strategies for closed itemset mining.

3.2 Constructing Initial Clusters


For each frequent closed termset, we construct an initial cluster to contain all the documents that contain the itemset. Initial clusters are not disjointed because one document may contain several termsets. We will restrain the maximal number of duplications of each document in the clusters in Section 3.3. The termset of each cluster is the cluster label identity of each cluster. Cluster labels also specify the set-containment relationship of the hierarchical structure in topic directory. Using FT-tree, we do not need to scan the documents to construct the initial clusters while the previous methods [6] do. Document IDs are included in a FT-tree. To retrieve all the documents containing a closed termset, we need to nd all the paths containing the termset; the document IDs below the paths are all the documents containing the termset. For example, the initial cluster of termset < cd > is {d2 , d5 , d6 , d7 }, which are the documents in the last nodes of the termset. As another example, the initial cluster of termset < c > is {d2 , d5 , d6 , d7 , d8 }, which are the documents in the nodes and descendant nodes of the term c. To nd all the path of a termset, we follow the sidelink of the last term from the header table and check if the path of each node to the root contains the termset. For example, to locate the path of a termset < cd >, we follow the sidelink of term d from the header table and check if the path of each node to the root contains c. If the path contains the termset, we retrieve all the documents IDs in and below the node. Figure 3 describes the method and rationale for retrieving the document IDs for each closed termset to construct initial clusters. Table 3 shows the initial clusters constructed from the FT-tree of our running example (Figure 2(c)).
Input: frequent closed termsets Output: initial clusters (pairs of termset and document IDs) Method: for each closed termset T for each node t in the sidelink of the last term of T from the header table if the path from the root to t contains the termset T , assign to the termset with the document IDs in and below t Rationale:

tors. Before building the topic directory, we prune the directory by (1) removing inner termsets (Section 3.3.1) and (2) constraining the maximal number of document duplication (Section 3.3.2). After that, a topic directory is constructed (Section 3.3.3) and the rst level nodes are nally merged (Section 3.3.4).

3.3.1 Removing Inner Termsets


Doc. ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Cluster Labels < e >, < b > (< c >), < cd > <e> <b> < e >, (< c >), < cd > (< c >), < cd > (< c >), < cd > < e >, < c > < e >, < b > < e >, < b >

Table 4: Clusters for each document. termsets within parentheses are inner termsets If multiple nodes in the same path in a directory contain the same documents, to minimize the document redundancy, we only leave the one in the lowest node and remove the others. This is done by removing inner termsets among frequent closed termsets, the termsets whose superset exists in the same document, e.g., in Table 4, termset < c > in document d2 is an inner termset as its superset < cd > also exists in d2 . Lemma 1. Removing inner termsets will not cause an empty node in the directory and will not aect the clustering quality. Rationale. Only closed termsets constitute the nodes in the directory. Thus, for any termset, there must be at least one document that does not contain its superset. Otherwise, the termset would not be closed by Denition 1. For example, in Table 4, although itemset < c > is removed from d2 , d5 , d6 , and d7 , it still exists in d8 , and thus node < c > will not be an empty node in the directory. Otherwise, < c > would not be a closed termset. Removing inner termsets also does not aect the clustering quality, since, in a common hierarchical clustering evaluation method, documents in a cluster (or node) include those in its desendant nodes.

3.3.2 Constraining Document Duplication


Figure 3: Constructing initial clusters from FT-tree We allow the user to set the maximal number of duplication max dup of each document in the directory. For example, if the directory is a tree and max dup = 1, then our method generates a hard cluster tree as many other methods do [6, 3, 9]. By allowing the directory to be a graph and max dup 1, out method naturally supports soft clustering, which is necessary for many applications (e.g., constructing Yahoo directory) because a document can belong to multiple clusters. To illustrate, assume that max dup = 1 in our running example. After we remove the inner termsets in Table 4, documents d1 , d5 , d8 , d9 , d10 belong to two nodes thus needed to trim out one more node. We refer to the orignial TFIDF vectors to exclude inferior nodes for each document by applying P a heuristic score function such as score(d, T ) = tT d t 5

Cluster Label <e> <c> <b> < cd >

Doc. IDs {d1 , d3 , d5 , d8 , d9 , d10 } {d2 , d5 , d6 , d7 , d8 } {d1 , d4 , d9 , d10 } {d2 , d5 , d6 , d7 }

Table 3: Initial clusters

3.3 Topic Directory Construction


After initial clusters are constructed, Step (4) builds a topic directory from the initial clusters and the TFIDF vec-

where d t denotes the vector of term t in document d. For instance, according to the TFIDF vectors of Table 5, score(d5 , < cd >) = 1.0 + 2.0 = 3.0.
Doc. ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 a 1.0 1.0 0 0 1.0 0 0 0 0 0 b 1.0 0 0 2.0 0 0 0 0 1.0 2.0 Feature c 0 2.0 0 0 1.0 1.0 2.0 1.0 0 0 vector d e 0 2.0 1.0 0 0 2.0 0 0 2.0 1.0 2.0 0 1.0 0 0 2.0 0 2.0 0 1.0 f 1.0 0 0 1.0 0 0 0 0 0 0

Input: nodes (termsets), document-cluster list Output: topic directory Main: for m = 1 to maximal length of nodes for node of length = m link(node, m) connect document IDs to corresponding nodes using the document-cluster list link(node, m): if m = 0, then link node to root, else: if there exist inner nodes of length m 1, then link the node to them as a child, else link(node, m-1)

Table 5: Document set with TFIDF

Doc. ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10

Cluster Labels <b> < cd > <e> <b> < cd > < cd > < cd > <e> <e> <b>

Figure 4: Constructing topic directory


Root

Table 6: Clusters max dup = 1

for

each

document

when

<e> {d1 d3 d5 d8 d9 d10}

<b> {d1 d4 d9 d10}

<c> {d8}

Table 6 shows the document-cluster list after applying max dup = 1 to each document using the above scoring function with the TFIDF vectors of Table 5.

< cd > {d2 d5 d6 d7}

3.3.3 Constructing Topic Directory


Constructing a topic directory from the document-cluster list, e.g., Table 4 with max dup = 2 or Table 6 with max dup = 1, can be done in a top-down way. We start building a directory from the root: link the nodes of length one at the rst level, and link the nodes of larger length to their inner nodes as children nodes. Figure 4 describes the method of constructing a topic directory. The topic directory from Table 4, i.e., max dup = 2, is shown in Figure 5. Figure 5: Topic directory

4. EXPERIMENTAL EVALUATION
This section presents the experimental evaluation of our method (TDC) by comparing with most recent document clustering methods: agglomerative UPGMA [5], bisecting k-means [9, 4], and those using frequent itemset mining FIHC [6], HFTC [3]. We use the CLUTO-2.0 clustering toolkit [8] to generate the results of UPGMA and bisecting k-means. We use the authors implementation for FIHC [6]. We could not obtain the implementation of HFTC, but as shown in [6], FIHC always performs better than HFTC in accuracy, eciency, and scalability.

3.3.4 Merging the First Level Nodes


Common mining algorithms usually generate a large number of frequent termsets of length one. Thus, a clustering method based on frequent termset mining tends to generate a lot of rst level nodes, in which merging the rst level nodes helps to provide users with more comprehensible interface. We merge the nodes of high similarity by creating a higher level node between the root and the similar nodes until the total number of the rst level nodes becomes less than or equal to a user-specied number. We use a heuristic similarity function as follows: sim(n1 , n2 ) = # of common documents in n1 and n2 # of documents in n1 and n2 (1) 6

4.1 Data Sets


Five data sets widely used in document clustering research [6, 3, 9] were used for our evaluation Classic4, Hitech, Re0, Wap, and Reuters. They are heterogeneous in terms of document size, cluster size, number of classes, and document distribution. For Reuters, we remove the articles that are not assigned to any categories. We do not exclude the articles assigned to multiple categories. All of these datasets, except Reuters, can be obtained from [8].

4.2 Evaluation Method


F-measure [6, 3, 9] has been most commonly used for evaluating both at and hierarchical clustering structures. We suppose each cluster as the result of a query and suppose each natural class as the relevant set of documents for the query. Then, the precision, recall, and the F-measure for the natural class Ki and cluster Ci are formulated as follows: nij precision(Ki , Cj ) = |Cj | recall(Ki , Cj ) = nij |Ki |

Dataset Hitech

Re0

Wap

F (Ki , Cj ) =

2 precision(Ki , Cj ) recall(Ki , Cj ) precision(Ki , Cj ) + recall(Ki , Cj )

where nij is the number of members of natural class Ki in cluster Cj (i.e., true positive). Intuitively, F (Ki , Cj ) measures the quality of cluster Cj in describing the natural class Ki by the harmonic mean of precision and recall. When computing F (Ki , Cj ) in a hierarchical structure, all the documents in the subtree of Cj are considered as the documents in Cj . The success of capturing a natural class Ki is measured by using the best cluster Cj for Ki , i.e., Cj maximizing F (Ki , Cj ). We measure the quality of a clustering result the overall F-measure F(C) using the weighted sum of such maximum F measure for all natural classes as follows: X |Ki | F (C) = maxCj C {F (Ki , Cj )} |D| K K
i

Classic4

Reuters

# of clus 3 15 30 60 Ave. 3 15 30 60 Ave. 3 15 30 60 Ave. 3 15 30 60 Ave. 3 15 30 60 Ave.

TDC 0.57 0.52 0.48 0.44 0.50 0.57 0.51 0.47 0.41 0.49 0.47 0.45 0.43 0.41 0.44 0.61 0.53 0.48 0.41 0.50 0.46 0.45 0.42 0.40 0.43

FIHC 0.45 0.42 0.41 0.41 0.42 0.53 0.45 0.43 0.38 0.45 0.40 0.56 0.57 0.55 0.52 0.62 0.52 0.52 0.51 0.54 0.37 0.40 0.40 0.39 0.39

Bi k-means 0.54 0.44 0.29 0.21 0.37 0.34 0.38 0.38 0.28 0.34 0.40 0.57 0.44 0.37 0.45 0.59 0.46 0.43 0.27 0.44 0.40 0.34 0.31 0.26 0.33

UPGMA 0.33 0.33 0.47 0.40 0.38 0.36 0.47 0.42 0.34 0.40 0.39 0.49 0.58 0.59 0.51

Table 7: F-measure comparison. # of clus: # of clusters; : not scalable to run as was also done in [6].) Table 8 shows the sup thr of coverage = 1.0 determined for each data set. Figure 6 illustrates the coverage change on each sup thr for data sets Hitech and Reuters, which is also generated by the probing algorithm. As noted at the end of Section 3.1.2, for very large document sets having few outlier documents, this coverage information helps to determine a proper sup thr such that the sup thr covers enough documents and eciently mines the frequent closed termsets.
sup thr Hitech 363/2301 Re0 138/1504 Wap 333/1560 Classic4 70/7094 Reuters 174/10802

where K denotes all natural classes, C denotes all clusters at all levels, |Ki | denotes the number of documents in natural class Ki , and |D| denotes the total number of documents in the data set. The range of F (C) is [0,1]. A larger F (C) value indicates a higher accuracy of clustering.

4.3 Results
Due to space limitations, we report the main results and leave the details to a technical report.

4.3.1 Performance comparison


Table 7 shows the overall performance of the four methods on the ve data sets. TDC outperforms the other methods on data sets Hitech, Re0, and Reuters, and shows similar performance to FIHC for others. We x max dup = 10 for TDC in our experiments, which actually increases its performance over the case of max dup = 1. When we x max dup = 1 (i.e., hard-clustering), TDC performs close to FIHC overall, and TDC performs better with the softclustering parameter max dup > 1, which implies that softclustering is more desirable for document clustering. FIHC is more ecient than other traditional document clustering methods such as bisecting k-means and UPGMA, with a proper setting of sup thr [6]. TDC is always faster than FIHC with the same sup thr, as TDC uses only closed termsets with running CLOSET+, while FIHC uses all frequent termsets with running Apriori. Additionally, TDC provides an automatic probing method to eectively determine sup thr. In our experiments, we use the sup thr of coverage = 1.0, which is automatically determined by the probing algorithm, while we have to tune sup thr of FIHC for best performance. (We tuned it from 3% to 6%, 7

Table 8: sup thr of coverage = 1.0 # of total document in each data set is within the parentheses.

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 300

0 400 500 600 700 800 900 1000 1100 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Figure 6: Document coverage in Hitech (left) and Reuters (right) X-axis: sup thr ; Y-axis: document coverage

4.3.2 Illustration of running system


We have implemented our system using the D2K system [1]. D2K (Data to Knowledge) is a data mining and machine learning system that integrates analytical data mining meth-

Figure 7: Screen Shot of the TDC System. Topic directory constructed on the Reuters. ods for prediction, discovery, and deviation detection with data and information visualization tools.2 Figure 7 shows a screen shot of our running system. The topic directory is constructed from Reuters. The entire tree is shown in the navigator box located in the upper-left side. We can see the list of documents and the actual documents by selecting a node on the screen. The left popup window shows searched documents by a keyword year, and the right popup window shows the selected document; the circled node indicates the clusters containing the selected document. [3] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD02), pages 436442, 2002. [4] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR92), pages 318329, 1992. [5] R. C. Dubes and A. K. Jain, editors. Algorithms for clustering data. Prentice Hall, 1998. [6] B. C. M. Fung, K. Wang, and M. Ester. Herarchical document clustering using frequent itemsets. In SIAM Int. Conf. Data Mining, 2003. [7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generations. In Proc. ACM SIGMOD Int. Conf. Management of Data (SIGMOD00), 2000. [8] G. Karypis. Cluto 2.0 clustering toolkit, 2002. [9] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techiniques. In KDD Workshop on Text Mining, 2000. [10] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD03), pages 236245, 2003. [11] O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR98), pages 4654, 1998.

5.

CONCLUSIONS

Using frequent termsets for document clustering is promising because it substantially reduces the large dimensionality of the document vector space [6]. It also naturally provides a topic label for each cluster using frequent termsets. In this paper, we presented a method that eciently generates a topic directory from a set of documents using a frequent closed termset mining algorithm. We also presented a nonparametric closed termset mining method to automatically determine a proper support threshold for a topic directory. Our method experimentally shows as high performance as the most recent document clustering methods and has additional benets: automatic generation of topic labels and determination of a cluster parameter.

6.

REFERENCES

[1] http://alg.ncsa.uiuc.edu/do/tools/d2k. [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. Int. Conf. Very Large Databases (VLDB94), pages 487499, 1994.
2 It oers a visual programming environment that allows users to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development.

You might also like