Document Clustering Using Compound Words

Document Clustering using Compound Words
Yong Wang and Julia Hodges

Department of Computer Science & Engineering, Mississippi State University
Mississippi State, MS 39762-9637
ywang@cse.msstate.edu, hodges@cse.msstate.edu
Authors’ names and mailing addresses:
Yong Wang (corresponding author)

Box 9637
Mississippi State, MS 39762
e-mail: ywang@cse.msstate.edu
phone: (662) 325-3945
fax: (662) 325-8997
Julia Hodges
Box 9637
Mississippi State, MS 39762
Key Words: Document Clustering, Information Retrieval, Text Data Mining
Submitted to: ICAI'05 - The 2005 International Conference on Artificial Intelligence

Document Clustering using Compound Words
Yong Wang and Julia Hodges
Department of Computer Science & Engineering, Mississippi State University
Mississippi State, MS 39762-9637
ywang@cse.msstate.edu, hodges@cse.msstate.edu
Key Words: Document Clustering, Information Retrieval, Text Data Mining
Abstract (just one or two key words), it is difficult for the

retrieval system to identify the interesting items for the
Document clustering is a kind of text data mining and user. Sometimes most of the retrieved documents are
organization technique that automatically groups of no interest to the users. Applying documenting
related documents into clusters. Traditionally single clustering to the retrieved documents could make it
words occurring in the documents are identified to easier for the users to browse their results and locate
determine the similarities among documents. In this what they want quickly. A successful example of this
work, we investigate using compound words as application is VIVISIMO (http://vivisimo.com/),
features for document clustering. Our experimental which is a Web search engine that organizes search
results demonstrate that using compound words alone results with document clustering. Another application
cannot improve the performance of clustering system. of document clustering is the automated or semi-
Promising results are achieved when the compound automated creation of document taxonomies. A good
words are combined with the original single words to taxonomy for Web documents is Yahoo
be the features. An evaluation of several basic (www.yahoo.com).
clustering algorithms is also performed in our work for Although document clustering and document
algorithm selection. Although the bisecting K-means classification (document categorization) are both text
method has been proposed as a good document mining tasks that share a set of basic principles, they
clustering algorithm by other investigators, our are still different in several essential aspects. In
experimental results demonstrated that for small document classification, a set of categories is
datasets, a traditional hierarchical clustering algorithm predefined. The documents can be assigned only one
still achieves the best performance. of the labels in the fixed schema. In document
clustering, documents are grouped according to their
similarity. From this view, document clustering
1. Introduction reveals the inherent organizational structure of the
Data clustering partitions a set of unlabeled document corpus while document classification
objects into disjoint/joint groups of clusters. In a good imposes a predefined organization scheme to the
cluster, all the objects within a cluster are very similar corpus [31]. Document classification is a supervised
while the objects in other clusters are very different. learning problem. A set of labeled documents must be
When the data processed is a set of documents, the provided to train the classifier. The quality of the
process is called document clustering. Document labeled example affects the performance of classifier
clustering is very important and useful in the significantly. Document clustering is a unsupervised
information retrieval area. Document clustering can be learning problem. Document clustering groups the
applied to a document database so that similar document without any training. The goal of document
documents are related in the same cluster. During the clustering is to put similar documents in the same
retrieval process, documents belonging to the same cluster and dissimilar documents in different clusters.
cluster as the retrieved documents can also be returned From this view, document classification is learning
to the user. This could improve the recall of an from examples and document clustering is learning
information retrieval system. Document clustering can from observation [14].
also be applied to the retrieved documents to facilitate The vector space model is a widely used method
finding the useful documents for the user. Generally, for document representation in information retrieval.
the feedback of an information retrieval system is a In this model, each document is represented by a
ranked list of documents ordered by their estimated feature vector. The unique terms occurring in the
relevance to the query. When the volume of an whole document collection are identified as the
information database is small and the query formulated attributes (or features) of the feature vector. Different
by the user is well defined, this ranked list approach is term weighting methods may be used in the vector
efficient. But for a very large information source such space model, such as binary method, tf (term
as the World Wide Web and poor query conditions frequency) method [28], and tf-idf (inverse document
frequency) method [27]. Traditionally, the single
words occurring in the document set are used as the The buckshot method is a combination of the K-
features. Because of the synonym problem and means method and HAC method. In buckshot, n
polysemy problem, generally such a “bag of words”
cannot reflect the semantic content of a document. objects are selected randomly as a sample set of the
whole collection. The HAC method is applied to the
Some other researchers have tried to solve this
problem by investigating more precision syntactic sample set. The centers of the K clusters on the sample
units, phrases or compound words as the features. set are the initial seeds for the whole collection. The
K-means iterations are performed again to partition the
Both promising and discouraging results were reported
in the literature (see some related work in section 3). whole collection. The buckshot method is successfully
In this paper, we investigate the use of compound used in a well-known document clustering system, the
Scatter/Gather (SG) system [7].
words provided by WordNet as features for document
clustering. A brief introduction to some basic The K-means method can also be used to generate
clustering algorithms and some related work about hierarchical clusters. Steinbach, Karypis, and Kumar
using phrases for information retrieval are described in proposed bisecting K-means algorithm to generate
section 2 and section 3. Section 4 is a brief hierarchical clusters by applying the basic K-means
method recursively [29]. The bisecting K-means
introduction of the electronic dictionary WordNet and
the compound words set included in it. Before algorithm is a divisive hierarchical clustering
presenting our experiments, the natural language algorithm. Initially the whole document set is
considered one cluster. Then the algorithm recursively
processing tools used for data preprocessing in our
system are provided in section 5. We present our selects the largest cluster and uses the basic K-means
experimental data, evaluation methods, and algorithm to divide it into two sub-clusters until the
desired number of clusters is reached.
experimental results in section 6. Section 7 lists our
final conclusions and describes promising future work. Besides these basic clustering algorithms, some
algorithms particularly for document clustering have
been proposed. Zamir has described the use of a suffix
2. Document clustering algorithms
tree for document clustering [32]. Beil, Ester, and Xu
Hierarchical clustering generates a hierarchical proposed two clustering methods, FTC (Frequent
tree of clusters. This tree is also called a dendrogram Term-based Clustering) and HFTC (Hierarchical
[4]. Hierarchical methods can be further classified into Frequent Term-based Clustering), based on frequent
agglomerative methods and divisive methods. In an term sets [2]. Fung proposed another hierarchical
agglomerative method, originally, each object forms a document clustering method based on the frequent
cluster. Then the two most similar clusters are merged term set, HIFC (Frequent Itemset-based Hierarchical
iteratively until some termination criterion is satisfied. Clustering), to improve the HFTC method [12].
This is a bottom-up approach. In a divisive method, Hammouda proposed an incremental clustering
from a cluster which consists of all the objects, one algorithm by representing each cluster with a
cluster is selected and split into smaller clusters similarity histogram [13]. Weiss, White, and Apte
recursively until some termination criterion is satisfied. described a lightweight document clustering method
A divisive method is a top-down method. Steinbach, using nearest neighbors [30]. Lin and Ravikumar
Karypis, and Kumar compared the performance of described a soft document clustering system called
three agglomerative clustering algorithms, IST (Intra- WBSC (Word-based Soft Clustering) [18].
Cluster Similarity Technique), CST (Centroid
Similarity Technique), and UPGMA [29]. Their 3. Related work
experimental results show UPGMA is the best one
Zamir proposed to use a suffix tree to find the
among them. Maarek, Fagin, Ben-Shaul, and Pelleg
proposed the HAC algorithm for on-line ephemeral maximum word sequences (phases) between two
web document clustering [20]. documents [32]. Two documents sharing more
common phrases are more similar to each other.
Partitioning clustering methods allocate data into
a fixed number of non-empty clusters. All the clusters Hammouda proposed to use a graph structure,
are in the same level. The most well-known Document Index Graph (DIG), to represent documents
[13]. The shared word sequences (phases), which form
partitioning methods following this principle are the
K-means method and its variants. The basic K-means a path within this graph, are identified to measure the
method initially randomly allocates a set of objects distances among documents.
Bakus, Hussin, and Kamel used a hierarchical
into a number of clusters. In every iteration, the mean
of each cluster is calculated and each object is re- phrase grammar extraction procedure to identify
assigned to the nearest mean. This loop will stop until phrases from documents and used these phrases as
there is no change for any of the clusters. The use of features for document clustering [1]. The self-
the K-means method for document clustering can be organizing map (SOM) method was used as the
clustering algorithm. An improvement was
found in [3, 15]. Some variants of the K-means
methods include the K-medoids method [14] and demonstrated when using phrases rather than single
global k-means method [17]. words as features.
Mladenic and Grobelnik used a Naive Bayesian compound words instead of the original single words
method to classify documents base on word sequences as features to cluster the documents.
of different length [23]. Experimental results show that
using the word sequences whose length is no more 5. Document Preprocessing and Natural Language
than 3 words can improve the performance of a text Processing
classification system. But when the average length of The data preprocessing tasks for general text data
used word sequences is longer than 3 words, there will mining problems include tokenization, morphological
be no difference between using word sequences or analysis, part-of-speech tagging, phrase identification,
single words. syntactic analysis, and semantic analysis. NLP
Furnkranz, Mitchell, and Riloff investigated the techniques provide good support for this step.
use of phrases to classify the text on WWW [11]. An Tokenization is the very first step involved in
information extraction system, AUTOSLOG-TS, was most NLP processing tasks. A tokenizer separates a
used to extract linguistic phrases from web documents. text into a set of component elements called tokens.
A naive Bayes classifier, “RAINBOW,” and a rule The simplest tokenization method is splitting the text
learning algorithm, “RIPPER,” were used to evaluate according to blanks and punctuation marks. A simple
the use of phrasal features with the measures sed script implementation of tokenizer is provided by
‘precision’ and ‘recall’. The results show that phrasal the Penn Treebank project group 2 . Qtoken 3 is a
features can improve the precision at the low recall portable tokenizer implemented in Java.
end but not at the high recall end. Mitra et al. MXTERMINATOR 4 is a JAVA tokenizer
investigated the impact of both syntactic and statistical implemented by Adwait Ratnaparkhi [26]. It is used to
phrases in IR [22]. They show the opposite results, i.e., identify the sentence boundaries and separate the
that little benefit can be achieved at low recall levels sentences in the text.
when using phrase indexing. But at the high recall Morphology analysis converts the morphological
level, higher benefits can result from phrasal indexing. variations of a word, such as inflections and
Caropreso, Matwin, and Sebastiani reported their derivations, to its base form. One of the traditional
evaluation results from using statistical phrased for methods uses a stemmer. Stemmers try to identify the
automated text categorization tasks independent of the stem of a raw word in a text to reduce all such similar
classifier used [6]. words to a common form, making the statistical data
more useful. The process of stemming removes the
4. WordNet commoner morphological and inflexional endings
WordNet 1 is a widely used lexical database for from words in English. For example, the phrases
English that provides the sense information of words analysis, analyzer, and analyzing all have the stem
[9]. Different from traditional dictionaries, WordNet is form analy. The most widely used two stemmers are
organized as a semantic network. The whole corpus the Porter 5 stemmer [24] and Lovins 6 stemmer [19].
consists of four lexical databases for nouns, verbs, Another morphology analysis technique is
adjectives, and adverbs. The basic unit in each lemmatization. A lemmatizer is a linguistic suffix
database is a set of synonyms called a synset. A synset stripper that is more accurate than a stemmer. Instead
represents a meaning, and all words that have such a of identifying the stem of a word, a lemmatizer
meaning will be included in this synset. If a word converts a word to its normalized form, called a
occurs in several synsets, then it is polysemous. Each lemma. The implementation of a lemmatizer requires
synset is assigned a definition or gloss to explain the part-of-speech tagging, an extensive lexicon, and case
meaning of it. There are various relationships defined normalization. For example, given the words compute,
between two synsets. For example, the relationship computer, computing, computers, and computed, a
hypernymy indicates one synset is a kind of another stemmer will convert all of them into comput. For a
synset (IS-A relationship); the hyponym relationship is lemmatizer, compute, computing, and computed have
the inverse of hypernymy. In the semantic network, the same lemma compute, whereas computer and
each synset is represented as a node and the
relationships among the synsets are represented as arcs.
WordNet also provides a set of widely used 2 The Penn Treebank tokenizer can be downloaded at
compound words besides the general single words. http://www.cis.upenn.edu/~treebank/.
There are 63,218 compound words collected in 3 Qtoken software package can be downloaded at
WordNet. All these compound words are partitioned http://web.bham.ac.uk/O.Mason/software/tokeniser/.
into four different lexical databases and organized into 4 MXTERMINATOR can be downloaded at
a semantic network. There are 58,856 noun compound http://www.cis.upenn.edu/~adwait/statnlp.html.
words, 2,794 verb compound words, 682 adjective 5 The Porter stemmer can be downloaded at
compound words, and 886 adverb compound words http://www.tartarus.org/~martin/PorterStemmer/inde
separately. In the paper, we investigate using these x.html.
6
Lovins stemmer can be downloaded at
1
The homepage of WordNet is http://www.cs.waikato.ac.nz/~eibe/stemmers/index.ht
http://www.cogsci.princeton.edu/~wn/. ml.
computers have the same lemma computer. A good 6.2 Evaluation Methods
lemmatizer is provided within WordNet [21]. The evaluation measures of entropy and F-
Part-of-Speech (POS) tagging is the fundamental measure, which have been used by a number of
procedure in many NLP tasks; it is also called researchers, including Steinbach, Karypis, and Kumar
grammatical tagging. In this process, each word is [29], were used in this study. Both entropy and F-
assigned a POS tag to reflect its syntactic category. measure are external quality measures.
POS tagging is often seen as the first stage of some Given a set of labeled documents belonging to I
other more comprehensive syntactic annotation such classes, assume the clustering algorithm partitions
as phrase boundary identification. Probably the most them into J clusters. Let n be the size of the document
well known POS tagger is Brill’s TBL Tagger7, which set; ni be the size of class i; nj be the size of cluster j;
is a simple rule-based tagger using transform-based and nij be the number of documents belonging to both
learning [5]. Another one is MXPOST 8 (Maximum class i and cluster j. Then for a document in cluster j,
Entropy POS Tagger), which was developed by the probability that it belongs to class i is P(i,j).
Adwait Ratnaparkhi [25].
nij
P(i,j) =
6. Experiments nj
The entropy of cluster j is:
6.1 Experimental Data I
We collected 1,600 abstracts from journals E(j) = - ∑

i =1
P(i,j) log 2 P(i,j)
belonging to ten different areas. For each area, 160
abstracts were collected. Table 1 lists the areas and the The entropy of all the clusters is the sum of the
names of the journals. This data set was divided entropy of each of the clusters weighted by its size:
J
nj
∑
evenly into eight subsets and each subset contained
200 abstracts. These datasets were named DS 1 - 8. E = E( j)
n
j =1
Table 1: Journal Abstracts Data Set The F-measure value of each class is calculated
and the classes’ F-measures are combined to get the F-
Area size Journal Name measure of the entire set. Given a cluster j and a class
Artificial 160 Artificial Intelligence i, assume cluster j is the retrieval results of class i.
Intelligence Then the recall, precision, and F-measure of this
Ecology 160 Journal of Ecology retrieval are:
Economy 160 Economic Journal nij nij
Recall(i, j) = , Precision(i, j) =
History 160 Historical Abstracts ni nj
Linguistics 160 Journal of Linguistics
2 × Recall (i, j) × Precision(i,j)
Material 160 Journal of Electronic F(i, j) =
Materials Recall(i, j) + Precision(i, j)
Nuclear 160 IEEE Transactions on Since there is no one-to-one mapping relationship
Nuclear Science between each class and each cluster, any cluster can be
Proteomics 160 PubMed considered the candidate retrieval result of a class. The
Sociology 160 Journal of Sociology best F-measure among all the clusters is selected as the
Statistics 160 Journal of Applied Statistics F-measure for the query of a particular class:
Regression Analysis F(i) = MAX0<j<JF(i, j)
The F-measure of all of the clusters is the sum of
The abstracts were cut into sentences with the F-measures of each class weighted by its size:
MXTERMINATOR. Then the tokens were identified I
ni
from each sentence with the Penn Treebank tokenizer. F =
i = 1
∑
n
F(i)
The lemmatizer in WordNet was used to convert each
token into a lemma. All the stop words were filtered.
Finally, a document was converted into a list of 6.3 Experimental Results and Analysis
lemmas. The MXPOST part-of-speech tagger was Comparison of clustering algorithms
used to assign a part-of-speech tag to each lemma. The Four basic clustering algorithms, K-means,
lemmas with a POS tag were used to construct the buckshot, HAC, and bisecting K-means, were selected
feature vector for each document. for comparison. In this experiment, the K-means
method, the buckshot method, and the bisecting K-
means method are executed 20 times to alleviate the
7
effect of a random factor. All the results are listed in
The homepage of Brill’s tagger is located at table 2. The F-measure values and entropy values
http://www.cs.jhu.edu/~brill/. listed here are the average values of 20 different
8
MXPOST can be downloaded at results.
http://www.cis.upenn.edu/~adwait/statnlp.html.
Table 2: Comparison of clustering algorithms
Data Set DS 1 DS 2 DS 3 DS 4 DS 5 DS 6 DS 7 DS 8
F-measure
K-means 44.55% 45.16% 40.56% 41.86% 39.57% 44.36% 43.34% 37.21%
Buckshot 48.36% 54.27% 47.94% 47.77% 48.05% 48.82% 49.67% 45.31%
Bisecting K-mean 34.66% 38.40% 34.06% 35.16% 36.45% 34.98% 37.27% 36.66%
HAC 69.00% 74.20% 69.67% 64.45% 69.73% 77.55% 69.44% 64.36%
Entropy
K-means 145.55% 141.34% 155.46% 153.84% 155.46% 147.48% 146.95% 162.77%
Buckshot 134.34% 120.35% 139.00% 137.74% 136.67% 137.43% 131.97% 144.89%
Bisecting K-mean 165.31% 156.38% 171.55% 167.16% 164.71% 169.19% 162.00% 164.19%
HAC 74.40% 51.29% 80.33% 79.01% 75.16% 59.85% 76.86% 96.30%
In all eight data sets, the HAC method Table 3: Clustering results of large data set
outperformed all the other methods for both F-measure
and entropy. The results of the K-means method, F-measure
buckshot method, and bisecting K-means method are K-means 72.30%
quite similar to each other. This result is different from Buckshot 70.90%
that reported by Steinbach, Karypis, and Kumar [29]. Bisecting K-mean 87.51%
In Steinbach, Karypis, and Kumar’s experiments, HAC 67.13%
agglomerative hierarchival clustering performed Entropy
poorly and the bisecting K_means method achieved K-means 72.49%
the best performance. The analysis provided by Buckshot 72.09%
Steinbach, Karypis, and Kumar explains that in a Bisecting K-mean 45.90%
document clustering problem, the nearest neighbors of HAC 86.60%
a document belong to different clusters under most
conditions. This is the “nature of documents” [29].
Using compound words
The global adjusting function of the K-means method
can compensate for this weakness. But in the HAC
In this experiment, we identified compound words
method, once two clusters are merged together, they
provided by WordNet from the document and used
will never be re-assigned.
them as features. In order to evaluate the effect of
Then why does the HAC method work better here?
using compound words, we executed the clustering
We think the reason is the size of the data set.
algorithm on three different feature sets on each data
Generally, two documents sharing more of the same
set. The first feature set was just the single words. For
words are considered more similar to each other.
example, “artificial intelligence” was two features, the
When the size of the document set is larger, the
adjective “artificial” and the noun “intelligence”. In
problem caused by the “nature of documents” becomes
the second feature set, the compound words were used.
more serious. In Steinbach, Karypis, and Kumar’s
Then “artificial intelligence” was only one feature, the
experiments, eight data sets were used for evaluation
noun “artificial intelligence”. The third feature set is a
[29]. The largest data set contains about 3000
combination of the first one and second one. The word
documents and the smallest one contains about 1000
sequence “artificial intelligence” will provide three
documents. But in our dataset, there are only 200
features, the adjective “artificial”, the noun
documents in each set. For a small data set, there are
“intelligence”, and the noun (and compound word)
fewer shared words between documents. Then it is
“artificial intelligence”. The HAC clustering algorithm
more likely that two documents with shared words
was used in this experiment because of its good
belong to the same cluster. This makes the HAC
performance in the first experiment. All the results are
method the best one for small data sets. In order to
listed in table 4.
verify our hypothesis, we combined all ten datasets
into a large data set of 1600 documents. Four
clustering algorithms were executed again on this large
set. The result is listed in table 3. For this large data set,
the bisecting K-means method got the highest F-
measure and lowest entropy. However, HAC got the
poorest performance. This result is consistent with that
of Steinbach, Karypis, and Kumar’s experiments [29].
Table 4: Experimental results of using compound words
Data Set DS 1 DS 2 DS 3 DS 4 DS 5 DS 6 DS 7 DS 8
F-measure
Original word 69.00% 74.20% 69.67% 64.45% 69.73% 77.55% 69.44% 64.36%
Compound word 55.75% 63.88% 77.38% 57.06% 76.62% 75.47% 76.44% 64.52%
Combined 77.37% 75.36% 67.66% 64.58% 59.68% 83.81% 77.52% 58.00%
Entropy
Original word 74.40% 51.29% 80.33% 79.01% 75.16% 59.85% 76.86% 96.30%
Compound word 100.77% 82.37% 57.40% 92.20% 62.38% 61.41% 64.04% 95.39%
Combined 57.71% 52.66% 83.24% 77.28% 91.57% 46.55% 59.25% 109.89%
From this result, we found that using compound small data sets. The results are listed in table 5. These
words alone did not improve the system performance results are consistent with our small data set. Using
significantly. Comparing the result of using single compound words degraded the F-measure by about 20
words and compound words, single words got better F- percentage points. When a combined feature set was
measure values and entropy values in four data sets used, a slight improvement was achieved.
(dataset 1, 2, 4, 6). But when we combined the single
words and compound words together, the combined Table 5: Experimental results of using compound
feature set got better performance. Within eight data words on large data set
sets, the combined data set got the best F-measure for
five data sets and best entropy for four data sets. The F-measure
compound words feature set got the best F-measures Original word 66.89%
and entropy for three data sets. The single words Compound word 47.85%
feature set had the best entropy only once (i.e., for data Combined 67.12%
set 2). For both the F-measure and entropy measure, Entropy
the combined feature set got the best performance for Original word 87.91%
most data sets. Compound word 137.03%
The fair performance of using compound word in Combined 87.48%
our experiments resulted from the nature of our dataset.
We expected that compound words would improve the
accuracy of clustering because a compound word can
7. Conclusions and Future Works
provide more detailed information than a single word.
What we found is that when the distance between two In this paper, we investigate using compound
documents clusters is short and using single word words provided by WordNet as features for document
cannot distinguish them, then compound words will be clustering. A brief overview of different document
helpful. But when the distance is long, using clustering algorithms and some related work about
compound words will result in dissimilarity between using phrases for information retrieval are provided.
documents within the same cluster. This will degrade After describing the document preprocessing with
the performance rather than improve it. The some related natural language processing tools, the
documents in our dataset are collected from ten results of our experiments are given. From these
different disciplines. The inter-cluster distance is results, we can conclude:
relatively large. Using single words is more proper to For document clustering algorithm, the HAC
distinguish documents belonging to different clusters method outperforms all the other methods for our
and make the documents within the same cluster more small data sets. But for large data sets, the bisecting K-
similar. means method gets the best performance.
When the combined feature set is used, the Using compound words does not improve the
compound words are used as a complementary of clustering performance for our data sets. But when the
single words. If a compound word is found in two compound words are combined with single words, the
documents, it will make them more similar. If two combined feature set gets the best performance for
documents share a part of a compound word, when most data sets.
using a combined feature set, they can still have some Our future work will focus on the following four
similarity. If only the compound words are used, they aspects:
will have no similarity. This is the reason why using Identifying more meaningful compound words. In
compound words and single words together is better our current experiments, the compound words
than using one of them alone. provided by WordNet are used. The problem is that
An additional experiment was executed on the most compound words included in WordNet are very
large data set which was a combination of the eight general compound words. A lot of terms used in
different disciplines are not included. Since our data Discovery and Data Mining (KDD 2002),
are abstracts from technical journals, those technical Edmonton, Alberta, Canada, 2002.
terms are important. A set of technical terms from [3] P. Bellot and M. El-Beze, A Clustering Method
each area will be collected and incorporated into the for Information Retrieval, Technical Report IR-
current compound word set. 0199, Laboratoire d'Informatique d'Avignon,
Performing semantic analysis and using the sense France, 1999.
of words or compound words instead of the original [4] P. Berkhin, “Survey of Clustering Data Mining
word form. The synonym problem and polysemy Techniques,” Accrue Software,
problem are two major obstacles for text data mining. http://citeseer.nj.nec.com/berkhin02survey.html
Most words used in technical papers are polysemous, (current 12 Feb. 2004).
and generally the correct senses of them are the second [5] E. Brill, “A Simple Rule-Based Part of Speech
or third senses. Finding the correct sense of words is Tagger,” Proceedings of the Third Conference
important to understanding the content of a document on Applied Natural Language Processing,
and distinguishing one document from another. Two (ACL), Trento, Italy, 1992.
key problems of semantic analysis are using a good [6] M. F. Caropreso, S. Matwin, and F. Sebastiani,
dictionary and adapting a good sense disambiguation “A Learner-Independent Evaluation of the
algorithm. The Merriam-Webster Online Dictionary & Usefulness of Statistical Phrases for Automated
Thesaurus9 and WordNet are two widely used on-line Text Categorization,” In A.G. Chin, editor, Text
dictionaries [21]. Some related work about word sense Databases and Document Management: Theory
disambiguation can be found in [10, 16]. and Practice, pp. 78-102, Hershey, USA, 2001,
Performing syntactic analysis to find the Idea Group Publishing.
important word in a context. Currently all the words or [7] D. Cutting, D. Karger, J. Pedersen, and J. Tukey,
compound words in a sentence are considered to be “Scatter/Gather: a Clusterbased Approach to
independent and of the same importance. Actually, Browsing Large Document Collection,”
words with different part-of-speech (POS) and Proceedings of the 15th ACM SIGIR
syntactic attributes should be assigned different Conference, Copenhagen, Denmark, 1992, pp.
weights according to their relatedness to the content of 318-329.
the documents. One assumption may be that nouns, [8] M. Collins, Head-Driven Statistical Models for
subjects, and objects are more important than other Natural Language Parsing, Ph.D. Dissertation,
parts of speech for determining the topic of a University of Pennsylvania, 1999.
document. There are a lot of syntactic analysis tools [9] C. Fellbaum, WordNet: An Electronic Lexical
that can be used for mining more information from Database, MIT Press, 1998.
raw text. Michael Collins proposed a syntactic parser [10] K. Fragos, Y. Maistros, and C. Skourlas, “Word
which uses some statistical methods to identify the Sense Disambiguation using WordNet
syntactic structure of a sentence and generate a Relations,” In Proceeding of the 1st Balkan
syntactic tree10 [8]. Conference in Informatics, Thessaloniki,
Combining the previous three approaches to Greece, 2003.
realize further improvement. Syntactic information is [11] J. Furnkranz, T. Mitchell, and E. Rilogg, “A
helpful for identifying the important words or key Case Study in Using Linguistic Phrases for Text
words from the raw text. By combining this with Categorization on WWW,” In Proceedings of
semantic information in the form of the senses of these the 1st AAAI Workshop on Learning for Text
key words, we expect to see an improvement in the Categorization, 1998, pp. 5-12.
performance of the document clustering system. [12] B. C. M. Fung, Hierarchical Document
Clustering Using Frequent Itemsets, Master
8. References Thesis, Dept. Computer Science, Simon Fraser
University, Canada, 2002.
[1] J. Bakus, M. F. Hussin, and M. Kamel, “A [13] K. M. Hammouda, Web Mining: Identifying
SOM-Based Document Clustering Using Document Structure for Web Document
Phrases,” In Proceeding of the 9th International Clustering, Master's Thesis, Department of
Conference on Neural Information Processing Systems Design Engineering, University of
(ICONIP’02), vol. 5, 2002, pp. 2212-2216. Waterloo, Waterloo, Ontario, Canada, 2002.
[2] F. Beil, M. Ester, and X. Xu, “Frequent Term- [14] J. Han and M. Kamber, Data Mining: Concepts
Based Text Clustering,” Proceeding of the 8th and Techniques, Morgan Kaufmann Publishers,
International Conference on Knowledge 2001.
[15] I. Iliopoulos, A.J. Enright, and C.A. Ouzounis,
“Textquest: Document Clustering of Medline
9
Abstracts for Concept Discovery in Molecular
Direct access to Merriam-Webster dictionary is Biology,” Proceedings of the Sixth Annual
http://www.m-w.com. Pacific Symposium on Biocomputing (PSB 001),
10
Collins parser can be downloaded at 2001.
http://www.ai.mit.edu/people/mcollins/.
[16] M. Lesk, “Automatic Sense Disambiguation: [29] M. Steinbach, G. Karypis, and V. Kumar, “A
How to Tell a Pine Cone from an Ice Cream Comparison of Document Clustering
Cone,” In Proceeding of the 1986 SIGDOC Techniques," KDD Workshop on Text Mining,
Conference, New York, 1986, pp. 24-26. 2000.
Association of Computing Machinery. [30] S. Weiss, H. White, and C. Apt'e, “Lightweight
[17] A. Likas, N. Vlassis, and J.J. Verbeek, “The Document Clustering,” Proceedings of PKDD-
Global K-Means Clustering Algorithm,” 2000, Springer, 2000, pp. 665-672.
Pattern Recognition, vol. 36, no. 2, 2003, pp. [31] K. Yang, “Literature review of dissertation,”
451-461. http://www.ils.unc.edu/yangk/dissertation/litrev
[18] K. Lin and R. Kondadadi, “A Word-Based Soft content.htm (current 12 Feb. 2004).
Clustering Algorithm for Documents,” [32] O. Zamir, Clustering Web Documents: A
Proceedings of 16th International Conference Phrase-Based Method for Group Search Engine
on Computers and Their Applications, Mar. Results, Ph.D. dissertation, Dept. Computer
2001. Science & Engineering, Univ. of Washington,
[19] J.B. Lovins, “Development of a Stemming 1999.
Algorithm,” Mechanical Translation and
Computational Linguistics, vol. 11, 1968, pp.
22-31.
[20] Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, and D.
Pelleg, Ephemeral Document Clustering for
Web Applications, Technical Report RJ 10186,
IBM Research, 2000.
[21] G. A. Miller, R. Beckwith, C. Fellbaum, D.
Gross, and K. J. Miller, “Introduction to
WordNet: An On-Line Lexical Database,”
International Journal of Lexicography, vol. 3,
no. 4, 1990, pp. 235-312.
[22] M. Mitra, C. Buckley, A. Singhal, and C.
Cardie, “An Analysis of Statistical and
Syntactic Phrases,” In Proceedings of RIAO-97,
5th International Conference "Recherche
d'Information Assistee par Ordinateur",
Montreal, CA, 1997, pp. 200-214.
[23] D. Mladenic and M. Grobelnik, “Word
Sequence as Features in Text-learning,” In
Proceedings of the 17th Electrotechnical and
Computer Science Conference (ERK-98),
Ljubljana, Slovenia, 1998.
[24] M.F. Porter, “An Algorithm for Suffix
Stripping,” Program, vol. 14, no. 3, 1980, pp.
130-137.
[25] A. Ratnaparkhi, “A Maximum Entropy Part-Of-
Speech Tagger,” Proceedings of the Empirical
Methods in Natural Language Processing
Conference, University of Pennsylvania, May
1996, pp. 17-18.
[26] J. C. Reynar and A. Ratnaparkhi, “A Maximum
Entropy Approach to Identifying Sentence
Boundaries,” Proceedings of the Fifth
Conference on Applied Natural Language
Processing, Washington, D.C., March 31-April
3, 1997.
[27] G. Salton, The SMART Retrieval System –
Experiments in Automatic Document Retrieval,
New Jersy, Englewood Cliffs: Prentice Hall Inc.,
1971.
[28] G. Salton and C. Buckley, “Term-Weighting
Approach in Automatic Text Retrieval,”
Information Processing & management, vol. 24,
no. 5, 1988, pp. 513-523.

Document Clustering Using Compound Words

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Document Clustering Using Compound Words

Uploaded by

Copyright:

Available Formats

Document Clustering using Compound Words

Yong Wang and Julia Hodges

Authors’ names and mailing addresses:

Yong Wang (corresponding author)

Key Words: Document Clustering, Information Retrieval, Text Data Mining

Submitted to: ICAI'05 - The 2005 International Conference on Artificial Intelligence

Key Words: Document Clustering, Information Retrieval, Text Data Mining

Abstract (just one or two key words), it is difficult for the

We collected 1,600 abstracts from journals E(j) = - ∑

You might also like