Professional Documents
Culture Documents
com
Expert Systems
with Applications
Expert Systems with Applications 36 (2009) 227232
www.elsevier.com/locate/eswa
Abstract
Topics often transit among documents in a document collection. To improve the accuracy of the topic detection and tracking (TDT)
algorithms in discovering topics or classifying documents, it is necessary to make full use of this kind of topic transition information.
However, TDT algorithms usually nd topics based on topic models, such as LDA, pLSI, etc., which are a kind of mixture model
and make the topic transition dicult to be denoted and implemented. A topic transition model representation based on hidden Markov
model is present, and learning the topic transition from documents is discussed. Based on the model, two TDT algorithms incorporating
topic transition, i.e. topic discovering and document classifying, are provided to show the application of the proposed model. Experiments on two real-world document collections are done with the two algorithms, and performance comparison with other similar algorithm shows that the accuracy can achieve 93% for topic discovering in Reuters-21578, and 97.3% in document classifying. Furthermore,
topic transition discovered by the algorithm on a dataset which was collected from a BBS website is consistent with the manual analysis
results.
2007 Elsevier Ltd. All rights reserved.
Keywords: Topic transition; Topic detection and tracking; Hidden Markov model
1. Introduction
In recent years, the volume of all kinds of text within the
Internet is dramatically increasing. For example, in an
interactive BBS, thousands of new articles are posted every
day. A meaningful article discusses something about one or
more topics, which exist in the BBS. Finding and understanding the topics may be helpful to dierent users. For
examples, a reporter needs to keep up with hot topics in
on-line Internet news, while science researcher may interest
with topics which are correlated with his research. Hence, it
is important to nd and keep track the topics from amount
of documents. However, in contrast to articles, topics are
invisible and should be inferred by manual. Manually nding topics is impossible when there are large amount of
texts. So topic detection and tracking becomes a new
research recently.
*
228
where
5
at iaij bt1 j;
16t 6n1
j1
6
where at(i) and bt(j) are the forward and backward variable
(Rabiner, 1989), respectively.
To maximize the (6), a series of iteration should be performed in Baum-Welch (Rabiner, 1989). That is, let
Pn1
nt i; j
7
aij Pt1
n1
t1 ct i
Pn
c i TF SEQt
j t1 tPn
8
l
t1 ct i
Pn
j TF SEQt l
j T
c i TF SEQt l
Pn
9
Rj t1 t
t1 ct i
i c1 i
p
12
n
X
i 1; 2; . . . ; N
j1 vj
ci
13
mi
vj l0 vj l0
14
j1
where mi is the document number of cluster i, ci is the center of cluster i, vj is the document term frequency vector.
3.2. Document clustering based on TPHMM
1
BICN log P Djk; N k log n
2
P TF SEQjk
11
j1
229
10
230
t M 1; M 2; . . . ; 1
diswa ;c
i j
For each clusters, if the documents membership memb(wi, cj) > mt, then we cluster the wi into cj.
3.3. Document classier based on TPHMM
In many domains, documents can be considered as generating from a new stream. For examples, in the BBS or
Newswires, documents appear as time goes on. We would
like to track the new documents by classifying them into
the topics which have been found from the history documents. The problem can be considered as a classier problem which using a classier algorithm.
Suppose that we have learnt a TPHMM model which
includes N topics for a document stream. Now, we expect
to assign a new document from the stream to one of the
N topics. The algorithm is as follows.
Algorithm. Classier
(CATP)
algorithm
based
on
TPHMM
Another dataset, which we reference as BBS-1544, contains 1544 documents which we collected from a BBS website in a well-known university from July 25, 2006 to March
25, 2007. During this period, there happened several hot
issues. There are two holidays National day, spring festive, and many people go to tourism these time. In September, 2006, a well-known government ocial corruption
event breaks out in China.
231
Table 2
Performance comparison between document clustering algorithms
Membership
threshold
FKTDC
mt = 0.5
mt = 0.6
mt = 0.7
mt = 0.8
mt = 0.9
0.72
0.80
0.89
0.93
0.88
0.65
0.73
0.81
0.85
0.79
increases too, however, our method is superior than mixture model-based method.
Topic description
1
2
3
4
5
6
7
Table 1
BIC value for dierent topic number
N
10
11
BIC
2017.0
1995.7
1966.3
1925.2
1920.1
1803.6
1910.8
1988.1
2133
232
Table 4
Topic transition probability
1
2
3
4
5
6
7
0.35
0.1
0.05
0.26
0
0
0
0.32
0.3
0.12
0.10
0
0
0
0.12
0.35
0.25
0.16
0
0
0
0.16
0.12
0.38
0.25
0
0
0
0
0
0
0
0.8
0.22
0.01
0
0
0
0
0.1
0.36
0.2
0
0
0
0
0.03
0.20
0.79
Our further work is to focus on developing a hierarchical clustering algorithm based on TPHMM since topics are
grain-sensitive and it is useful to get a hierarchical topic
description for better understanding topic evolution in
BBS or other kinds of dynamical Web space.
Acknowledgements
The Reuters-21578 dataset (http://www.research.att.
com/~lewis) provides a good and sound text clustering test,
and we would like to acknowledge anyone who has made
contribution to it. We also would like to thanks J.F. Xie
for providing the BBS-1544 dataset.
References
Bicego, M., Murino, V., & Figueiredo, M. A. T. (2003). A sequential
pruning strategy for the selection of the number of states in hidden
Markov models. Pattern Recognition Letter, 24, 13951407.
Blei, D. M., Ng, A. N., & Jordan, M. I. (2003). Latent Dirichlet
Allocation. Journal of Machine Learning, 3, 9931022.
Bun, K. H., & Ishizuka, M. (2006). Emerging topic tracking system in
WWW. Knowledge-Based Systems, 19, 164171.
Griths, T. L., & Steyvers, M. (2004). Finding scientic topics. Proceedings of the National Academy of Sciences of the United States of
America, 101, 52285235.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings
of the twenty-second annual international SIGIR conference, Berkeley,
California, USA (pp. 3544).
Makkonen, J., Ahonen-myka, H., & Salmenkivi, M. (2004). Simple
semantics in topic detection and tracking. Information Retrieval, 7,
347368.
McCallum, A., Corrada-Emanuel, A., & Wang, X. (2005). Topic and role
discovery in social networks. In Proceedings of the 19th international
joint conference on articial intelligence, Edinburgh, UK (pp. 786791).
Mei, Q., & Zhai, C. X. (2006). A mixture model for contextual text
mining. In Proceedings of the twelfth ACM SIGKDD international
conference on knowledge discovery and data mining, Philadelphia, USA
(pp. 649655).
Morinaga, S. S, & Yamanishi, K. (2004). Tracking dynamics of topic
trends using a nite mixture model. In Proceedings of the 10th ACM
SIGKDD international conference on knowledge discovery and data
mining, Seattle, USA (pp. 811816).
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected
applications in speech recognition. Proceedings of the IEEE, 77,
257286.
Rosen-Zvi, M., Griths, T., Steyvers, M., & Smyth, P. (2004). The
author-topic model for authors and documents. In Proceedings of the
20th conference on uncertainty in articial intelligence, Ban, Canada
(pp. 487494).
Salton, G., & McGill, M. (1983). Introduction to modern information
retrieval. New York: McGraw-Hill.
Schwarz, G. (1978). Estimating the dimension of a model. Annuals of
Statistics, 6(2), 461464.
Vicenc, T. (2005). Fuzzy c-means for fuzzy hierarchical clustering. In The
2005 IEEE international conference on fuzzy systems (pp. 646651).
Wang, X. R., & McCallum, A. (2006). Topics over time: A non-markov
continuous-time model of topical trends. In Proceedings of the twelfth
ACM SIGKDD international conference on knowledge discovery and
data mining, Philadelphia, USA (pp. 424433).
Xie, J. H. (1995). Hidden Markov model and its application in speech
process (pp. 515). Wuhan, China: Technology University of Center
China Press (in Chinese).