Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms

Available online at www.sciencedirect.
com
Expert Systems
with Applications
Expert Systems with Applications 36 (2009) 227232
www.elsevier.com/locate/eswa
Incorporating topic transition in topic detection and tracking algorithms

Jianping Zeng *, Shiyong Zhang
Department of Computing & Information Technology, Fudan University, Shanghai 200433, PR China
Abstract
Topics often transit among documents in a document collection. To improve the accuracy of the topic detection and tracking (TDT)
algorithms in discovering topics or classifying documents, it is necessary to make full use of this kind of topic transition information.
However, TDT algorithms usually nd topics based on topic models, such as LDA, pLSI, etc., which are a kind of mixture model
and make the topic transition dicult to be denoted and implemented. A topic transition model representation based on hidden Markov
model is present, and learning the topic transition from documents is discussed. Based on the model, two TDT algorithms incorporating
topic transition, i.e. topic discovering and document classifying, are provided to show the application of the proposed model. Experiments on two real-world document collections are done with the two algorithms, and performance comparison with other similar algorithm shows that the accuracy can achieve 93% for topic discovering in Reuters-21578, and 97.3% in document classifying. Furthermore,
topic transition discovered by the algorithm on a dataset which was collected from a BBS website is consistent with the manual analysis
results.
2007 Elsevier Ltd. All rights reserved.
Keywords: Topic transition; Topic detection and tracking; Hidden Markov model
1. Introduction
In recent years, the volume of all kinds of text within the
Internet is dramatically increasing. For example, in an
interactive BBS, thousands of new articles are posted every
day. A meaningful article discusses something about one or
more topics, which exist in the BBS. Finding and understanding the topics may be helpful to dierent users. For
examples, a reporter needs to keep up with hot topics in
on-line Internet news, while science researcher may interest
with topics which are correlated with his research. Hence, it
is important to nd and keep track the topics from amount
of documents. However, in contrast to articles, topics are
invisible and should be inferred by manual. Manually nding topics is impossible when there are large amount of
texts. So topic detection and tracking becomes a new
research recently.
*
Corresponding author. Tel.: +86 13564317273.

E-mail addresses: zeng_jian_ping@hotmail.com (J. Zeng), szhang@
fudan.edu.cn (S. Zhang).
0957-4174/$ - see front matter 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2007.09.013
The task of topic detection and tracking is to discover

topics from large amount of text in areas, such as Newswires, BBS and Blogs, etc. To achieve this goal, rstly, a
topic model is required to represent topic and articles.
Important signicant progress has been made on this issue
by researcher in the eld of information retrieval (IR).
There are kinds of topic models, such as pLSI (Hofmann,
1999), LDA (Blei, Ng, & Jordan, 2003), AT (Rosen-Zvi,
Griths, Steyvers, & Smyth, 2004), nite mixture model
(Morinaga & Yamanishi, 2004), etc. They consider documents as a mixture of topics, and topics are composed of
many keywords. Based on these models, several topic
detection and tracking algorithms are proposed for dierent application, such as scientic area (Griths & Steyvers,
2004), hot issue in Web (Bun & Ishizuka, 2006), etc.
Although only using keywords in a model is simple and
easy to be implemented, to improve the topic detection performance, simple semantics are incorporated into the topic
model, which lead to all kinds of more comprehensive
models, such as topic model with document sender and
receiver in Email (McCallum et al., 2005), topic model with
228
J. Zeng, S. Zhang / Expert Systems with Applications 36 (2009) 227232
time evolution (Wang & McCallum, 2006), topic model

with contextual text representation (Mei & Zhai, 2006),
topic model based on event vector (Makkonen, Ahonenmyka, & Salmenkivi, 2004). However, topics in a document
collection usually transit between each other with a probability distribution. For examples, if the discussion about
education is now the hot spot in a BBS, topics about science and technology are more likely to appear than topics
about tourism. It is useful to incorporate this kind of transition information into the topic model and consequently
into the TDT algorithms to improve the topic detection
performance. However, current topic models mentioned
above are dicult in supporting to do so, since these models can not reveal the topic transition.
A new topic representation model based on hidden Markov model (HMM) is presented in this paper. HMM can be
considered as a kind of mixture models, moreover, it provides a mechanism to express topic transition. So, based
on this model, we can use TDT algorithms incorporating
topic transition to perform topic detection with higher
detection accuracy. The main contribution of the paper is
as follows. First, we proposed a topic model which can represent both of topic and topic transition based on HMM.
Second, a document clustering algorithm based on the
model is presented, and we present a simple classier algorithm. By nding the most likely topic transition sequence
corresponding to a document collection, we perform a
fuzzy-KMeans clustering and get a more reasonable result.
Third, experiments on the two algorithms are done and
shown that higher accuracy and reasonable topic transition
can be achieved.
The organization of the paper is as follow, in the next
section, we describe the topic representation model based
on the hidden Markov model. In Section 3, topic detection
based on the model is provided, and the document clustering is also present. In Section 4, we perform some experiments on two real-world dataset and analyze the result.
In Section 5, we give the conclusion and point out the
future work.
2. Topic representation based on hidden markov model
For conveniences in describing the model and algorithms with help to guide intuition when introducing hidden variables which aim to capture topics, here we dene
some basic terms which are related to document and topics.
Denition 1. Word: A word is the basic unit of discrete
data, dened to be an item from a vocabulary indexed by
{1 . . . V}. We represent words using its index in the
vocabulary. Thus, if a word is the vth word in the
vocabulary, then it is represented by v.
Denition 2. Document: A document is a term frequency (tf)
(Salton & McGill, 1983) vector of V words denoted by
w = (tf1, tf2, . . . , tfV), where tfn is the nth words tf in the
document, which represents ratios of the words count
and the total count for all words.
Denition 3. Document collection: A document collection is

a document set which contains M documents denoted by
D = {w1, w2, . . . , wM}
Denition 4. Topic: A topic is set of a words and their
probability pair, and these words reect the main content
of the document or document collection. It can be denoted
as T = {(v1, p1), (v2, p2), . . . , (vV, pV)}, where
V
X
pi 1
1
i1
We can see that the words are the visible characters in a

document, while the topics in the document are invisible
and should be induced from the document. However, the
words frequency in a document is a continuous variable,
we model the topic as a continuous HMM (Rabiner,
1989), which can be denoted as follows:
k N ; A; B; p
2
where N is the number of hidden states and A = [aij] is the
transition probability between these states and B = [bjk] is
observation distribution probability for each hidden states,
p is the initial probability distribution over the hidden
states. For topic representation, we consider hidden state
as topic, while the observation state is term frequency vector of words in a document. It is reasonable to suppose that
the term frequency of words is subjected to normal probability distribution, then B can be expressed by l; R, which
are the parameters of a V-dimension Gaussian distribution
probability for each hidden states, and l is the mean vector
and R is the covariance matrix. That is, for a topic hi, the
term frequency probability distribution is

1
1
T 1
pxjhi
exp x li Ri x li
V =2 P 1=2
2
2p j i j
3
For conveniences, we refer such a model k as TPHMM.
It can be inferred from (Bicego, Murino, & Figueiredo,
2003) that a Gaussians mixture model with M component
is equivalent to a HMM with hidden state number as M,
only one Gaussian for each state, and transition probability matrix setting with mixture weight and zero. As a result,
all kinds of topic model based on mixture model can be
viewed as a special HMM with very limited state transition
and cannot fully express the transition probability between
topics. However, in TPHMM, we dont make any constraint on the transition matrix A, hence TPHMM is more
general than mixture model.
As a HMM, for a given TPHMM, there are three basic
problems: training, encoding, evaluating which will be
involved in this paper. By the training, we mean that, given
the word sequences of each document, how to learn a topic
model from the document? Encoding means nding a best
topic sequence for a given document collection. Evaluating
means computing the likelihood of a document with
respect to a TPHMM model. We can nd solution for these
problems according to the HMM, that is, Baum Welch,

Viterbi, and Forward backward algorithm (Rabiner, 1989).
nt i; j P qt si ; qt1 sj jTF SEQ; k

N
X
ct i
nti; j
For a given document collection, we would like to nd

the topics exists in it and how the topics transit in the collection. On the other hand, we would like to make a decision that a new document belongs to which topics that
have been found in a document collection. In the section,
we discuss the related algorithms.
3.1. Find topic and topic transition
To discover topics from a document collection, we consider tting the collection as a TPHMM with topic as hidden state in the model. The tting process is a typical
HMM model training. Deciding the topic number in the
collection is an important task, that is, how many topics
exist in the collection?
Here, we use traditional BIC (Bayes Information Criteria) to select the best number of hidden states by the following formula (Schwarz, 1978),
4
where
5
where k is the number of free parameters in TPHMM, n is

the size of the document collection D. k is the TPHMM for
the collection, it can be learnt by the traditional algorithm
Baum Welch with the term frequency vector sequence
TF_SEQ, that is to adjust the parameters of k to maximize
the log likelihood P(TF_SEQjk), which can be decomposed
as,
N X
N
X
i1
at iaij bt1 j;
16t 6n1
j1
6
where at(i) and bt(j) are the forward and backward variable
(Rabiner, 1989), respectively.
To maximize the (6), a series of iteration should be performed in Baum-Welch (Rabiner, 1989). That is, let
Pn1
nt i; j

7
aij Pt1
n1
t1 ct i
Pn
c i TF SEQt
j t1 tPn
8
l
t1 ct i
Pn
j TF SEQt l
j T
c i TF SEQt l
Pn
9
Rj t1 t
t1 ct i
i c1 i
p
12
here, s is the hidden state in TPHMM, and qt is the state

(topic) of t-th document.

p
can be got, and
Then, the new model k N ; A;
l; R;

it can be proved that P TF SEQjk > P TF SEQjk (Xie,
1995). So, by recursively computing (7)(10), we can nally
get a best TPHMM model for the sequence TF_SEQ.
However, to improve the model accuracy in tting the
collection, it is necessary to set the initial parameter more
precisely, especially B parameters (Xie, 1995). We achieve
this goal by making a simple KMeans clustering on the
document collection, and the initial model parameter of
B is set as follows:
l0 l0 i;
Pm i
l0 i
R0
n
X
i 1; 2; . . . ; N
j1 vj
ci
13
mi
vj l0 vj l0
14
j1
where mi is the document number of cluster i, ci is the center of cluster i, vj is the document term frequency vector.
3.2. Document clustering based on TPHMM
1
BICN log P Djk; N k log n
2
P TF SEQjk
11
j1
3. Topic detection and tracking based on TPHMM
M BIC arg maxBICN
229
10
where TF_SEQ(t) is the term frequency vector of the t-th

document in the collection D
After we have created a suitable topic model for a

document collection, then we can perform clustering
for documents in the collection into the topics. However,
a document may belong to one or more topics, so we
employ a fuzzy-based algorithm to perform document
clustering.
Here, we present a fuzzy-KMeans algorithm based on
TPHMM, in which each document is treated as a point
in the V-dimension space. We use Viterbi algorithm to
get the most likely topic sequence which is corresponding
to the document in collection, in this way, we nd the most
suitable topic for each of the document. Then initial clustering centers are constructed based on the topic sequence,
and they are feed to a Fuzzy-KMeans clustering process to
get a fuzzy partition result. Thus, we can nd the clusters
more reasonably, and improve clustering accuracy. The
algorithm is described as follows.
Algorithm. Fuzzy KMeans based on TPHMM document
clustering (FKTDC)
Input: document collection D; a TPHMM k that best
describes the D; membership threshold mt
Output: the document clusters dc;
Process:
(1) Find the most likely topic for each document in
D = {w1, w2, . . . , wM} via Viterbi algorithm on k, as
following, and result in a topic sequence
TS = {T1, T2, . . . , TM}.
230
(i) Initialization: let

d1 i pi bi w1 ; u1 i 0; 1 6 i 6 N ;
N is the hidden state number of TPHMM:
(ii) Recursively compute the dt(j), ut(j):
dt j max dt1 iaij bj wt ; 1 6 j 6 N ; 2 6 t 6 M
16i6N
ut j arg maxdt1 iaij ; 1 6 j 6 N ; 2 6 t 6 M

16i6N
(iii) Get the most likely topic TM for wM as follows:

T M arg maxdM i
16i6N
(iv) Get a topic sequence by path backtracking, as

follows:
T t ut1 T t1 ;
t M 1; M 2; . . . ; 1
(2) Perform a Fuzzy-KMeans clustering on the D.

(i) Set the clusters number N as the hidden state number in k; Set the initial cluster center {ci, i = 1 . . . N}
according to the topic sequence TS. That is,
Pmi
j1 vj;i
ci
; i 1; 2; . . . ; N
mi
where mi is the number of documents that belong
to the ith topic, vj,i is the term frequency vector
of jth document in the cluster i which contains all
of the documents with same topic.
(ii) Compute the distance between each of the documents wi(i = 1, 2, . . . , M) and the clusters center cj
(i = 1, 2, . . . , N) using Euclidean distance, i.e.
r
XV
2
diswi ; cj
wi k cj k
k0
Here we use wi(k) to denote the kth term frequency
of ith document and cj(k) to denote the kth term
frequency of jth cluster.
Then we consider that the document wi belongs to
the cluster cj with a membership dened as follows:
membwi ; cj e
diswa ;c
i j
Here, we use an exponent function with parameter

a to control the distances affection on a documents membership to a cluster.
(iii) Compute the new center ci for each clusters, the
same as in a fuzzy-based KMeans clustering
(Vicenc, 2005), as follows:
P
j membwj ; ci wj
ci P
j membwj ; ci
(iv) Repeat the steps (ii) and (iii) until the average distance is below the threshold.
(3) De-fuzzy and get the document clusters dc.
For each clusters, if the documents membership memb(wi, cj) > mt, then we cluster the wi into cj.
3.3. Document classier based on TPHMM
In many domains, documents can be considered as generating from a new stream. For examples, in the BBS or
Newswires, documents appear as time goes on. We would
like to track the new documents by classifying them into
the topics which have been found from the history documents. The problem can be considered as a classier problem which using a classier algorithm.
Suppose that we have learnt a TPHMM model which
includes N topics for a document stream. Now, we expect
to assign a new document from the stream to one of the
N topics. The algorithm is as follows.
Algorithm. Classier
(CATP)
algorithm
based
on
TPHMM
Input: document stream topic model k; a new document

w.
Output: the topic k that w belongs to.
Process:
(1) Get each topic description from the model, the topic
can be viewed as the term Gaussians probability distribution p(xjhi), i = 1, 2, . . . , N, with parameters
l; R for each state.
(2) Compute the likelihood of the document w with
respect to all of the topic probability distribution,
that is, Pw(i) = p(wjhi), i = 1, 2, . . . , N.
(3) Select the topic whose likelihood is the largest,
k = arg max(Pw(i)), i = 1, 2, . . . , N.
4. Experiments and result

We do experiments with the algorithms on two realworld data sets: One is Reuters-21578 text categorization
test collection (from: http://www.research.att.com/~lewis)
which is commonly used in many TDT algorithm performance test. The other is BBS-1544 which contains eight
months duration of documents appeared in a well-known
university in China.
4.1. Data sets
Reuters-21578 contains newswire articles and Reuters
annotations appeared on the Reuters newswire in 1987.
The documents were assembled and indexed with categories by personnel from Reuters Ltd. and Carnegie Group,
Inc. in 1987. After further formatting, cleanup of the collection and data le production, the collection was made
public at 1996. The dataset contains 21,578 documents
and 120 topics. There are 10,211 documents which have
not set with topic and 9494 and documents are assigned
with one topic, and 1334 documents are with two topics,
and 539 documents are with more than three topics.
Another dataset, which we reference as BBS-1544, contains 1544 documents which we collected from a BBS website in a well-known university from July 25, 2006 to March
25, 2007. During this period, there happened several hot
issues. There are two holidays National day, spring festive, and many people go to tourism these time. In September, 2006, a well-known government ocial corruption
event breaks out in China.
231
Table 2
Performance comparison between document clustering algorithms
Membership
threshold
FKTDC
Fuzzy KMeans based on mixture

model
mt = 0.5
mt = 0.6
mt = 0.7
mt = 0.8
mt = 0.9
0.72
0.80
0.89
0.93
0.88
0.65
0.73
0.81
0.85
0.79
4.2. Topic discovery and documents clustering on Reuters21578

We select 8000 documents which belong to at least one
and at most eight topics from Reuters-21578, and use the
dataset as the topic discovery test and document clustering
test. We nd the topic number by computing the BIC value
in (4) with N varying from 3 to 11, and we got the BIC
value as listed in Table 1. From the table, we can see that
the largest BIC is got when N = 8, and this is consistent
with our dataset, so in this way, we can get the best topic
number.
Then we do document clustering experiment with
FKTDC and compare clustering performance with fuzzy
KMeans based on mixture model in Morinaga and Yamanishi (2004). The algorithm is similar with FKTDC, however, it is just make use of mixture model and can not
analyze topic transition. The clustering accuracy is measured by the ratio of number of documents that correctly
clustered and total documents that belong to the topic.
The clustering accuracy with dierent membership threshold is shown in Table 2. From the table, we can see that
the clustering algorithm based on TPHMM can achieve a
better performance. On the other hand, the accuracy may
decrease when the threshold is set too high since the algorithm tend to be similar to tradition KMeans algorithm.
4.3. Document classication on Reuters-21578
We conduct two binary classication experiments using
Reuters-21578 dataset. To compare with the classication
method based on mixture model, we do the experiment in
the following way. First, we select a subset from Reuters21578, which contains 8000 documents which belong to
topic GRAIN and NOT GRAIN. Then we, respectively,
use 5%, 10%, 15%, 20% proportion of the data to train
TPHMM with topic number as 2, and use the left data
as the test cases.
The classifying accuracy is computed, and Fig. 1 shows
the results of classication accuracy comparison on
TPHMM and mixture model-based. We see that as the
proportion of training data increases, the accuracy may
Fig. 1. Proportion of data used for training.
increases too, however, our method is superior than mixture model-based method.
4.4. Topic transition analysis on BBS-1544

To verify the reasonability of topic transition which is
found from the given document collection by the proposed
method, we use data set BBS-1544, which is collected by
ourselves from a BBS website, in this way we can analyze
the topic transition more clearly.
We rst transform the document collection into the term
frequency vector sequence and train a TPHMM to represent the documents, and nd the topic number via BIC
method. We have found 12 topics, some of which are listed
in the Table 3. Each of the topics is described by the keyTable 3
Topics found in the BBS-1544
Topic
Topic description
1
2
3
4
5
6
7
(Insurance, 0.35) (Shanghai, 0.32) (high rank ocer, 0.12)

(Ethos, 0.28) (politics, 0.20) (corruption, 0.48)
(Education, 0.35) (harmonious society, 0.27) (distribution, 0.12)
(Real estate, 0.52) (regulate, 0.22) (conict, 0.1)
(Twain strait, 0.13) (Spring festive, 0.36) (plane, 0.35)
(Spring Festive, 0.33) (passenger, 0.32) (National day, 0.28)
(Airline company, 0.4) (fuel fee, 0.20) (transportations, 0.19)
Table 1
BIC value for dierent topic number
N
10
11
BIC
2017.0
1995.7
1966.3
1925.2
1920.1
1803.6
1910.8
1988.1
2133
232
Table 4
Topic transition probability
1
2
3
4
5
6
7
0.35
0.1
0.05
0.26
0
0
0
0.32
0.3
0.12
0.10
0
0
0
0.12
0.35
0.25
0.16
0
0
0
0.16
0.12
0.38
0.25
0
0
0
0
0
0
0
0.8
0.22
0.01
0
0
0
0
0.1
0.36
0.2
0
0
0
0
0.03
0.20
0.79
words and their probability, and we just select some of

them with higher probability.
As we can see that topic 1 is about the corruption of
Shanghai government ocial in China in September
2006, and the corruption is caused by insurance funds misusing and real estate transaction. It has a great impact on
the university students, so it leads to several topics which
related to this event or induced from this event, such as
topic 2, topic 3 and topic 4. The topic 4 is also another
hot topic in China since the housing price increase too
much and too quickly, which lead to many discussion.
Another kind of topic is about the tourism in that period. We can see that topic 6 which is about the tour or
going home during the Chinese traditional spring festive.
Topic 5 is about the direct air y between Twain and Chinese mainland in spring festive. And topic 7 is generated
when fuel fee is announced to increase, it is somewhat
related to the topic tourism.
The topic transition can be reected by the topic transition probability between the seven topics, as the following
Table 4, note that the sum of probability of a topic transition
is not equal to 1 because we just select some of the topics.
From the table, we can note that topic 1 can mainly
cause the topic about corruption, and topic about corruption will cause the discussion about harmonious society.
When talking about real estate during this period, topic
will probably turn to Shanghai ocial corruption. Topics
5, 6 and 7 are about tourism in two holidays, however,
topic about tour by plane may lead to the topic about
increasing fuel fee since the air ticket increases as well,
and many people pay attention to this.
The analysis result is coordinate with the actual analysis
well, so the topic transition analysis based on TPHMM is
eective.
5. Conclusions and future work
Many algorithms use kinds of topic mixture models,
which is dicult to implement topic transition. A new topic
model TPHMM is proposed in the paper and algorithms
incorporate the topic transition based on the model are
present to cluster and classify documents. Experiments on
two real-world datasets show that the proposed model
and algorithms can nd topics in the case where a document belongs to two or more topics very well, and can also
get the topic transition, which is consistent with real-world.
Our further work is to focus on developing a hierarchical clustering algorithm based on TPHMM since topics are
grain-sensitive and it is useful to get a hierarchical topic
description for better understanding topic evolution in
BBS or other kinds of dynamical Web space.
Acknowledgements
The Reuters-21578 dataset (http://www.research.att.
com/~lewis) provides a good and sound text clustering test,
and we would like to acknowledge anyone who has made
contribution to it. We also would like to thanks J.F. Xie
for providing the BBS-1544 dataset.
References
Bicego, M., Murino, V., & Figueiredo, M. A. T. (2003). A sequential
pruning strategy for the selection of the number of states in hidden
Markov models. Pattern Recognition Letter, 24, 13951407.
Blei, D. M., Ng, A. N., & Jordan, M. I. (2003). Latent Dirichlet
Allocation. Journal of Machine Learning, 3, 9931022.
Bun, K. H., & Ishizuka, M. (2006). Emerging topic tracking system in
WWW. Knowledge-Based Systems, 19, 164171.
Griths, T. L., & Steyvers, M. (2004). Finding scientic topics. Proceedings of the National Academy of Sciences of the United States of
America, 101, 52285235.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings
of the twenty-second annual international SIGIR conference, Berkeley,
California, USA (pp. 3544).
Makkonen, J., Ahonen-myka, H., & Salmenkivi, M. (2004). Simple
semantics in topic detection and tracking. Information Retrieval, 7,
347368.
McCallum, A., Corrada-Emanuel, A., & Wang, X. (2005). Topic and role
discovery in social networks. In Proceedings of the 19th international
joint conference on articial intelligence, Edinburgh, UK (pp. 786791).
Mei, Q., & Zhai, C. X. (2006). A mixture model for contextual text
mining. In Proceedings of the twelfth ACM SIGKDD international
conference on knowledge discovery and data mining, Philadelphia, USA
(pp. 649655).
Morinaga, S. S, & Yamanishi, K. (2004). Tracking dynamics of topic
trends using a nite mixture model. In Proceedings of the 10th ACM
SIGKDD international conference on knowledge discovery and data
mining, Seattle, USA (pp. 811816).
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected
applications in speech recognition. Proceedings of the IEEE, 77,
257286.
Rosen-Zvi, M., Griths, T., Steyvers, M., & Smyth, P. (2004). The
author-topic model for authors and documents. In Proceedings of the
20th conference on uncertainty in articial intelligence, Ban, Canada
(pp. 487494).
Salton, G., & McGill, M. (1983). Introduction to modern information
retrieval. New York: McGraw-Hill.
Schwarz, G. (1978). Estimating the dimension of a model. Annuals of
Statistics, 6(2), 461464.
Vicenc, T. (2005). Fuzzy c-means for fuzzy hierarchical clustering. In The
2005 IEEE international conference on fuzzy systems (pp. 646651).
Wang, X. R., & McCallum, A. (2006). Topics over time: A non-markov
continuous-time model of topical trends. In Proceedings of the twelfth
ACM SIGKDD international conference on knowledge discovery and
data mining, Philadelphia, USA (pp. 424433).
Xie, J. H. (1995). Hidden Markov model and its application in speech
process (pp. 515). Wuhan, China: Technology University of Center
China Press (in Chinese).

Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Incorporating topic transition in topic detection and tracking algorithms

Corresponding author. Tel.: +86 13564317273.

The task of topic detection and tracking is to discover

J. Zeng, S. Zhang / Expert Systems with Applications 36 (2009) 227232

time evolution (Wang & McCallum, 2006), topic model

Denition 3. Document collection: A document collection is

We can see that the words are the visible characters in a

J. Zeng, S. Zhang / Expert Systems with Applications 36 (2009) 227232

problems according to the HMM, that is, Baum Welch,

nt i; j P qt si ; qt1 sj jTF SEQ; k

For a given document collection, we would like to nd

where k is the number of free parameters in TPHMM, n is

here, s is the hidden state in TPHMM, and qt is the state

3. Topic detection and tracking based on TPHMM

M BIC arg maxBICN

where TF_SEQ(t) is the term frequency vector of the t-th

After we have created a suitable topic model for a

J. Zeng, S. Zhang / Expert Systems with Applications 36 (2009) 227232

(i) Initialization: let

ut j arg maxdt1 iaij ; 1 6 j 6 N ; 2 6 t 6 M

(iii) Get the most likely topic TM for wM as follows:

(iv) Get a topic sequence by path backtracking, as

(2) Perform a Fuzzy-KMeans clustering on the D.

Here, we use an exponent function with parameter

Input: document stream topic model k; a new document

4. Experiments and result

J. Zeng, S. Zhang / Expert Systems with Applications 36 (2009) 227232

Fuzzy KMeans based on mixture

4.2. Topic discovery and documents clustering on Reuters21578

Fig. 1. Proportion of data used for training.

4.4. Topic transition analysis on BBS-1544

(Insurance, 0.35) (Shanghai, 0.32) (high rank ocer, 0.12)

J. Zeng, S. Zhang / Expert Systems with Applications 36 (2009) 227232

words and their probability, and we just select some of

You might also like