Professional Documents
Culture Documents
Julia Hodges
Box 9637
Mississippi State, MS 39762
Table 1: Journal Abstracts Data Set The F-measure value of each class is calculated
and the classes’ F-measures are combined to get the F-
Area size Journal Name measure of the entire set. Given a cluster j and a class
Artificial 160 Artificial Intelligence i, assume cluster j is the retrieval results of class i.
Intelligence Then the recall, precision, and F-measure of this
Ecology 160 Journal of Ecology retrieval are:
Economy 160 Economic Journal nij nij
Recall(i, j) = , Precision(i, j) =
History 160 Historical Abstracts ni nj
Linguistics 160 Journal of Linguistics
2 × Recall (i, j) × Precision(i,j)
Material 160 Journal of Electronic F(i, j) =
Materials Recall(i, j) + Precision(i, j)
Nuclear 160 IEEE Transactions on Since there is no one-to-one mapping relationship
Nuclear Science between each class and each cluster, any cluster can be
Proteomics 160 PubMed considered the candidate retrieval result of a class. The
Sociology 160 Journal of Sociology best F-measure among all the clusters is selected as the
Statistics 160 Journal of Applied Statistics F-measure for the query of a particular class:
Regression Analysis F(i) = MAX0<j<JF(i, j)
The F-measure of all of the clusters is the sum of
The abstracts were cut into sentences with the F-measures of each class weighted by its size:
MXTERMINATOR. Then the tokens were identified I
ni
from each sentence with the Penn Treebank tokenizer. F =
i = 1
∑
n
F(i)
The lemmatizer in WordNet was used to convert each
token into a lemma. All the stop words were filtered.
Finally, a document was converted into a list of 6.3 Experimental Results and Analysis
lemmas. The MXPOST part-of-speech tagger was Comparison of clustering algorithms
used to assign a part-of-speech tag to each lemma. The Four basic clustering algorithms, K-means,
lemmas with a POS tag were used to construct the buckshot, HAC, and bisecting K-means, were selected
feature vector for each document. for comparison. In this experiment, the K-means
method, the buckshot method, and the bisecting K-
means method are executed 20 times to alleviate the
7
effect of a random factor. All the results are listed in
The homepage of Brill’s tagger is located at table 2. The F-measure values and entropy values
http://www.cs.jhu.edu/~brill/. listed here are the average values of 20 different
8
MXPOST can be downloaded at results.
http://www.cis.upenn.edu/~adwait/statnlp.html.
Table 2: Comparison of clustering algorithms
Data Set DS 1 DS 2 DS 3 DS 4 DS 5 DS 6 DS 7 DS 8
F-measure
K-means 44.55% 45.16% 40.56% 41.86% 39.57% 44.36% 43.34% 37.21%
Buckshot 48.36% 54.27% 47.94% 47.77% 48.05% 48.82% 49.67% 45.31%
Bisecting K-mean 34.66% 38.40% 34.06% 35.16% 36.45% 34.98% 37.27% 36.66%
HAC 69.00% 74.20% 69.67% 64.45% 69.73% 77.55% 69.44% 64.36%
Entropy
K-means 145.55% 141.34% 155.46% 153.84% 155.46% 147.48% 146.95% 162.77%
Buckshot 134.34% 120.35% 139.00% 137.74% 136.67% 137.43% 131.97% 144.89%
Bisecting K-mean 165.31% 156.38% 171.55% 167.16% 164.71% 169.19% 162.00% 164.19%
HAC 74.40% 51.29% 80.33% 79.01% 75.16% 59.85% 76.86% 96.30%
In all eight data sets, the HAC method Table 3: Clustering results of large data set
outperformed all the other methods for both F-measure
and entropy. The results of the K-means method, F-measure
buckshot method, and bisecting K-means method are K-means 72.30%
quite similar to each other. This result is different from Buckshot 70.90%
that reported by Steinbach, Karypis, and Kumar [29]. Bisecting K-mean 87.51%
In Steinbach, Karypis, and Kumar’s experiments, HAC 67.13%
agglomerative hierarchival clustering performed Entropy
poorly and the bisecting K_means method achieved K-means 72.49%
the best performance. The analysis provided by Buckshot 72.09%
Steinbach, Karypis, and Kumar explains that in a Bisecting K-mean 45.90%
document clustering problem, the nearest neighbors of HAC 86.60%
a document belong to different clusters under most
conditions. This is the “nature of documents” [29].
Using compound words
The global adjusting function of the K-means method
can compensate for this weakness. But in the HAC
In this experiment, we identified compound words
method, once two clusters are merged together, they
provided by WordNet from the document and used
will never be re-assigned.
them as features. In order to evaluate the effect of
Then why does the HAC method work better here?
using compound words, we executed the clustering
We think the reason is the size of the data set.
algorithm on three different feature sets on each data
Generally, two documents sharing more of the same
set. The first feature set was just the single words. For
words are considered more similar to each other.
example, “artificial intelligence” was two features, the
When the size of the document set is larger, the
adjective “artificial” and the noun “intelligence”. In
problem caused by the “nature of documents” becomes
the second feature set, the compound words were used.
more serious. In Steinbach, Karypis, and Kumar’s
Then “artificial intelligence” was only one feature, the
experiments, eight data sets were used for evaluation
noun “artificial intelligence”. The third feature set is a
[29]. The largest data set contains about 3000
combination of the first one and second one. The word
documents and the smallest one contains about 1000
sequence “artificial intelligence” will provide three
documents. But in our dataset, there are only 200
features, the adjective “artificial”, the noun
documents in each set. For a small data set, there are
“intelligence”, and the noun (and compound word)
fewer shared words between documents. Then it is
“artificial intelligence”. The HAC clustering algorithm
more likely that two documents with shared words
was used in this experiment because of its good
belong to the same cluster. This makes the HAC
performance in the first experiment. All the results are
method the best one for small data sets. In order to
listed in table 4.
verify our hypothesis, we combined all ten datasets
into a large data set of 1600 documents. Four
clustering algorithms were executed again on this large
set. The result is listed in table 3. For this large data set,
the bisecting K-means method got the highest F-
measure and lowest entropy. However, HAC got the
poorest performance. This result is consistent with that
of Steinbach, Karypis, and Kumar’s experiments [29].
Table 4: Experimental results of using compound words
Data Set DS 1 DS 2 DS 3 DS 4 DS 5 DS 6 DS 7 DS 8
F-measure
Original word 69.00% 74.20% 69.67% 64.45% 69.73% 77.55% 69.44% 64.36%
Compound word 55.75% 63.88% 77.38% 57.06% 76.62% 75.47% 76.44% 64.52%
Combined 77.37% 75.36% 67.66% 64.58% 59.68% 83.81% 77.52% 58.00%
Entropy
Original word 74.40% 51.29% 80.33% 79.01% 75.16% 59.85% 76.86% 96.30%
Compound word 100.77% 82.37% 57.40% 92.20% 62.38% 61.41% 64.04% 95.39%
Combined 57.71% 52.66% 83.24% 77.28% 91.57% 46.55% 59.25% 109.89%
From this result, we found that using compound small data sets. The results are listed in table 5. These
words alone did not improve the system performance results are consistent with our small data set. Using
significantly. Comparing the result of using single compound words degraded the F-measure by about 20
words and compound words, single words got better F- percentage points. When a combined feature set was
measure values and entropy values in four data sets used, a slight improvement was achieved.
(dataset 1, 2, 4, 6). But when we combined the single
words and compound words together, the combined Table 5: Experimental results of using compound
feature set got better performance. Within eight data words on large data set
sets, the combined data set got the best F-measure for
five data sets and best entropy for four data sets. The F-measure
compound words feature set got the best F-measures Original word 66.89%
and entropy for three data sets. The single words Compound word 47.85%
feature set had the best entropy only once (i.e., for data Combined 67.12%
set 2). For both the F-measure and entropy measure, Entropy
the combined feature set got the best performance for Original word 87.91%
most data sets. Compound word 137.03%
The fair performance of using compound word in Combined 87.48%
our experiments resulted from the nature of our dataset.
We expected that compound words would improve the
accuracy of clustering because a compound word can
7. Conclusions and Future Works
provide more detailed information than a single word.
What we found is that when the distance between two In this paper, we investigate using compound
documents clusters is short and using single word words provided by WordNet as features for document
cannot distinguish them, then compound words will be clustering. A brief overview of different document
helpful. But when the distance is long, using clustering algorithms and some related work about
compound words will result in dissimilarity between using phrases for information retrieval are provided.
documents within the same cluster. This will degrade After describing the document preprocessing with
the performance rather than improve it. The some related natural language processing tools, the
documents in our dataset are collected from ten results of our experiments are given. From these
different disciplines. The inter-cluster distance is results, we can conclude:
relatively large. Using single words is more proper to For document clustering algorithm, the HAC
distinguish documents belonging to different clusters method outperforms all the other methods for our
and make the documents within the same cluster more small data sets. But for large data sets, the bisecting K-
similar. means method gets the best performance.
When the combined feature set is used, the Using compound words does not improve the
compound words are used as a complementary of clustering performance for our data sets. But when the
single words. If a compound word is found in two compound words are combined with single words, the
documents, it will make them more similar. If two combined feature set gets the best performance for
documents share a part of a compound word, when most data sets.
using a combined feature set, they can still have some Our future work will focus on the following four
similarity. If only the compound words are used, they aspects:
will have no similarity. This is the reason why using Identifying more meaningful compound words. In
compound words and single words together is better our current experiments, the compound words
than using one of them alone. provided by WordNet are used. The problem is that
An additional experiment was executed on the most compound words included in WordNet are very
large data set which was a combination of the eight general compound words. A lot of terms used in
different disciplines are not included. Since our data Discovery and Data Mining (KDD 2002),
are abstracts from technical journals, those technical Edmonton, Alberta, Canada, 2002.
terms are important. A set of technical terms from [3] P. Bellot and M. El-Beze, A Clustering Method
each area will be collected and incorporated into the for Information Retrieval, Technical Report IR-
current compound word set. 0199, Laboratoire d'Informatique d'Avignon,
Performing semantic analysis and using the sense France, 1999.
of words or compound words instead of the original [4] P. Berkhin, “Survey of Clustering Data Mining
word form. The synonym problem and polysemy Techniques,” Accrue Software,
problem are two major obstacles for text data mining. http://citeseer.nj.nec.com/berkhin02survey.html
Most words used in technical papers are polysemous, (current 12 Feb. 2004).
and generally the correct senses of them are the second [5] E. Brill, “A Simple Rule-Based Part of Speech
or third senses. Finding the correct sense of words is Tagger,” Proceedings of the Third Conference
important to understanding the content of a document on Applied Natural Language Processing,
and distinguishing one document from another. Two (ACL), Trento, Italy, 1992.
key problems of semantic analysis are using a good [6] M. F. Caropreso, S. Matwin, and F. Sebastiani,
dictionary and adapting a good sense disambiguation “A Learner-Independent Evaluation of the
algorithm. The Merriam-Webster Online Dictionary & Usefulness of Statistical Phrases for Automated
Thesaurus9 and WordNet are two widely used on-line Text Categorization,” In A.G. Chin, editor, Text
dictionaries [21]. Some related work about word sense Databases and Document Management: Theory
disambiguation can be found in [10, 16]. and Practice, pp. 78-102, Hershey, USA, 2001,
Performing syntactic analysis to find the Idea Group Publishing.
important word in a context. Currently all the words or [7] D. Cutting, D. Karger, J. Pedersen, and J. Tukey,
compound words in a sentence are considered to be “Scatter/Gather: a Clusterbased Approach to
independent and of the same importance. Actually, Browsing Large Document Collection,”
words with different part-of-speech (POS) and Proceedings of the 15th ACM SIGIR
syntactic attributes should be assigned different Conference, Copenhagen, Denmark, 1992, pp.
weights according to their relatedness to the content of 318-329.
the documents. One assumption may be that nouns, [8] M. Collins, Head-Driven Statistical Models for
subjects, and objects are more important than other Natural Language Parsing, Ph.D. Dissertation,
parts of speech for determining the topic of a University of Pennsylvania, 1999.
document. There are a lot of syntactic analysis tools [9] C. Fellbaum, WordNet: An Electronic Lexical
that can be used for mining more information from Database, MIT Press, 1998.
raw text. Michael Collins proposed a syntactic parser [10] K. Fragos, Y. Maistros, and C. Skourlas, “Word
which uses some statistical methods to identify the Sense Disambiguation using WordNet
syntactic structure of a sentence and generate a Relations,” In Proceeding of the 1st Balkan
syntactic tree10 [8]. Conference in Informatics, Thessaloniki,
Combining the previous three approaches to Greece, 2003.
realize further improvement. Syntactic information is [11] J. Furnkranz, T. Mitchell, and E. Rilogg, “A
helpful for identifying the important words or key Case Study in Using Linguistic Phrases for Text
words from the raw text. By combining this with Categorization on WWW,” In Proceedings of
semantic information in the form of the senses of these the 1st AAAI Workshop on Learning for Text
key words, we expect to see an improvement in the Categorization, 1998, pp. 5-12.
performance of the document clustering system. [12] B. C. M. Fung, Hierarchical Document
Clustering Using Frequent Itemsets, Master
8. References Thesis, Dept. Computer Science, Simon Fraser
University, Canada, 2002.
[1] J. Bakus, M. F. Hussin, and M. Kamel, “A [13] K. M. Hammouda, Web Mining: Identifying
SOM-Based Document Clustering Using Document Structure for Web Document
Phrases,” In Proceeding of the 9th International Clustering, Master's Thesis, Department of
Conference on Neural Information Processing Systems Design Engineering, University of
(ICONIP’02), vol. 5, 2002, pp. 2212-2216. Waterloo, Waterloo, Ontario, Canada, 2002.
[2] F. Beil, M. Ester, and X. Xu, “Frequent Term- [14] J. Han and M. Kamber, Data Mining: Concepts
Based Text Clustering,” Proceeding of the 8th and Techniques, Morgan Kaufmann Publishers,
International Conference on Knowledge 2001.
[15] I. Iliopoulos, A.J. Enright, and C.A. Ouzounis,
“Textquest: Document Clustering of Medline
9
Abstracts for Concept Discovery in Molecular
Direct access to Merriam-Webster dictionary is Biology,” Proceedings of the Sixth Annual
http://www.m-w.com. Pacific Symposium on Biocomputing (PSB 001),
10
Collins parser can be downloaded at 2001.
http://www.ai.mit.edu/people/mcollins/.
[16] M. Lesk, “Automatic Sense Disambiguation: [29] M. Steinbach, G. Karypis, and V. Kumar, “A
How to Tell a Pine Cone from an Ice Cream Comparison of Document Clustering
Cone,” In Proceeding of the 1986 SIGDOC Techniques," KDD Workshop on Text Mining,
Conference, New York, 1986, pp. 24-26. 2000.
Association of Computing Machinery. [30] S. Weiss, H. White, and C. Apt'e, “Lightweight
[17] A. Likas, N. Vlassis, and J.J. Verbeek, “The Document Clustering,” Proceedings of PKDD-
Global K-Means Clustering Algorithm,” 2000, Springer, 2000, pp. 665-672.
Pattern Recognition, vol. 36, no. 2, 2003, pp. [31] K. Yang, “Literature review of dissertation,”
451-461. http://www.ils.unc.edu/yangk/dissertation/litrev
[18] K. Lin and R. Kondadadi, “A Word-Based Soft content.htm (current 12 Feb. 2004).
Clustering Algorithm for Documents,” [32] O. Zamir, Clustering Web Documents: A
Proceedings of 16th International Conference Phrase-Based Method for Group Search Engine
on Computers and Their Applications, Mar. Results, Ph.D. dissertation, Dept. Computer
2001. Science & Engineering, Univ. of Washington,
[19] J.B. Lovins, “Development of a Stemming 1999.
Algorithm,” Mechanical Translation and
Computational Linguistics, vol. 11, 1968, pp.
22-31.
[20] Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, and D.
Pelleg, Ephemeral Document Clustering for
Web Applications, Technical Report RJ 10186,
IBM Research, 2000.
[21] G. A. Miller, R. Beckwith, C. Fellbaum, D.
Gross, and K. J. Miller, “Introduction to
WordNet: An On-Line Lexical Database,”
International Journal of Lexicography, vol. 3,
no. 4, 1990, pp. 235-312.
[22] M. Mitra, C. Buckley, A. Singhal, and C.
Cardie, “An Analysis of Statistical and
Syntactic Phrases,” In Proceedings of RIAO-97,
5th International Conference "Recherche
d'Information Assistee par Ordinateur",
Montreal, CA, 1997, pp. 200-214.
[23] D. Mladenic and M. Grobelnik, “Word
Sequence as Features in Text-learning,” In
Proceedings of the 17th Electrotechnical and
Computer Science Conference (ERK-98),
Ljubljana, Slovenia, 1998.
[24] M.F. Porter, “An Algorithm for Suffix
Stripping,” Program, vol. 14, no. 3, 1980, pp.
130-137.
[25] A. Ratnaparkhi, “A Maximum Entropy Part-Of-
Speech Tagger,” Proceedings of the Empirical
Methods in Natural Language Processing
Conference, University of Pennsylvania, May
1996, pp. 17-18.
[26] J. C. Reynar and A. Ratnaparkhi, “A Maximum
Entropy Approach to Identifying Sentence
Boundaries,” Proceedings of the Fifth
Conference on Applied Natural Language
Processing, Washington, D.C., March 31-April
3, 1997.
[27] G. Salton, The SMART Retrieval System –
Experiments in Automatic Document Retrieval,
New Jersy, Englewood Cliffs: Prentice Hall Inc.,
1971.
[28] G. Salton and C. Buckley, “Term-Weighting
Approach in Automatic Text Retrieval,”
Information Processing & management, vol. 24,
no. 5, 1988, pp. 513-523.