You are on page 1of 4

201O Interational Conference on Future Information Technology and Management Engineering

The Key Technology of Topic Detection Based on K


means
Shengdong Li
Chinese Information Processing Research Center
Beiing Information Science and Technolog Universit
Being, China
lishengdong_ 666@sina.com
Abstract - Text clustering is the key technology for topic
detection, and topic detection is essentially similar to the
unsupervised clustering. However, general clustering is based on
global information, and clustering in the topic detection is based
on incremental ways. So we should study topic detection
according to clustering algorithm, and it is necessary for
clustering algorithm to be in-depth and extensive research. Vector
space model (VSM) is one of the most simple and efective topics
representation model. And K-means is a well-known and widely
used partitional clustering method. Therefore, we develop a topic
detection prototype system to study how K in K-means afects
topic detection. Then we get the variation law that it afects topic
detection, and add up their optimal values in topic detection.
Finally, TDT evaluation methods prove that the validity of the
value of K in the algorithm is 83.33% in the topic detection
prototype system based on K-means. This shows that K-means
clustering algorithm is suited to deal with topic detection.
Inde Terms - k-means; vsm; tdt evaluation; topic detection.
I. INTRODUCTION
The emergence of the Internet makes information rapidly
expansive. In the case of the current information explosion, a
major concern for people becomes how to fast and accurately
access information of interest. In this context, researchers start
to pay attention to a new technology, which is topic detection
and tracking [1]. Topic detection and tracking is an intelligent
information access technology. It studies how to detect new
events and track subsequent developments of events. This
technology can help people put scattered information together
efectively, and then understand all the details of an event and
the relationships between an event and other events in the
overall situation.
The TDT topic detection [6] is defned to be the task of
detecting and tracking topics not previously known to the
system. Topic detection is a kind of unsupervised learning in
essence, and its key technology is the text clustering algorithm.
However, general clustering is based on global information,
and clustering in the topic detection is based on incremental
ways. As a well-known and widely used partitional clustering
method, K-means has attracted great interest in the literature.
So we use vector space model (VSM) for representing topics,
and then make use of K-means algorithm for text clustering to
develop a topic detection prototype system. On the basis of
topic detection prototype system, we study how K in K-means
978-1-4244-9088-211 0/$26.00 20 1 0 IEEE
387
Xueqiang Lv, Tao Wang, and Shuicai Shi
Chinese Information Processing Research Center
Beiing Information Science and Technolog Universit
Being, China
{lv.xueqiang&wang.tao&shi.shuicai}@trs.com.cn
afects topic detection. And fnally we use TOT evaluation
method to evaluate system performance.
II. TOPIC DETECTION PROTOTYPE SYSTEM
Topic detection prototype system consists of topics
representation model, K-means algorithm for text clustering
and TDT evaluation method. Fig. 1 is its architecture diagram.
Initial cluster centers
Calculate similarity distance
between cluster centers and
all the remaining vectors.
Fig. 1 Architecture Diagram for Topic Detection Prototype System.
A. Topics Representation Model
Topics representation is the most important issue that
topic Detection is confonted with. VSM is currently one of
FITME2010
the most widely used and better-performing topics
representation methods. In VSM, the feature representation
method of topics is TF-IDF algorithm, and the feature
selection method of topics is weight of evidence for text.
1) Topics Representation
In VSM, texts are seen as vector space composed of a set
of orthogonal term vectors. Afer a text is divided into words
in the Chinese word segmentation procedure, frst, stop words
are removed; second, word fequency is calculated; fnally, the
text is expressed as a vector.
If the total number of features in the data set is n, it
constitutes an n-dimensional vector space. Each text d is
expressed as an n-dimensional feature vector

VI' WI

(1)
Where
,
:

n ) is term i, and

: , n) is the weight of the text d, and it is calculated


by TF-IDF algorithm.
Eq. 2 is the TF-IDF algorithm.

.
/
,

l
_(

|Ia,,],
(2)
Where

is the fequency of term i in text d,


parameter q is the total number of texts, and
,
is the total
number of the text where
,
is included.
2) Weight of Evidence for Text
Weight of evidence for text is one of the excellent feature
selection algorithms. It refects the difference between the
class probability of all the terms and the class probability of a
term. For term t and m topics, its weight of evidence for text is
as follows [8].
W
E
T{t) =
p{t)f
P

s

.
P

,
I
t
Xl
-
p
J)
(3)
Si 1 P(J
-
p(
s
I
t
Where p(C J is the probability that the topic s appears in
the corpus, and P(C. I
t
) is the conditional probability that the
term t belongs to the topic s when the topic includes term t.
B. K-means Algorithm for Text Clustering
K-means algorithm is a clustering algorithm based on the
prototype [2], which is defned as cluster center. The algorithm
requires that the specifed clusters K and the initial cluster
center. It uses iterative method to make objective function
reduced, and the fnal clustering result achieves the minimum
objective fnction. The objective fnction is defned as
follows [3].
E
= L

,
(4)

XECi
388
Where X is data of topic
C

and

IS mean of
topic _
i
'
Defnition 1: Cosine similar function between the text i
and text j is defned as (5) according to VSM.


( ) 1

..
, , t w: t .
(5)
Where is text feature vector i,

is text feature
vector j, and parameter n is the dimension of feature vectors.
Defnition 2: the cosine distance between the text vectors is
defned as (6) [7].
o

-,

)
(6)
The bigger cosine similar fnction is, the more similar the
text i and j are, thus, the smaller the cosine distance between
the two texts is.
The specifc steps of K-means algorithm are as follows.

Step 1: Randomly selected K objects fom the data set,


and each object represents initial cluster centers or
mean of a topic.

Step 2: According to (5) and (6), calculate cosine


distance between the remaining objects and each
cluster center, and assign each object to the nearest
cluster center according to the cosine distance.

Step 3: Recalculate the mean of each cluster as the


new cluster center.

Step 4: If all the cluster centers do not change, the


objective function has converged and the algorithm
end, else modif cluster center and then repeat step 2
and step 3.
C TDT Evaluation
Topic detection performance is characterized in terms of
the probability of miss and false alarm errors
c b
'/.

-
and
/

-
) [6].
lSS
a + c
f
b +

TABLE I.
PARAMETERS OF MISS AND FALSE ALARM ERRORS
In topic(test set) Not in topic(test set)
In topic (system) d b
Not in topic (system) c d
. . .
These error probabilItIes are then combmed mto a smgle
detection cost
C 1cl
[6], by assigning costs to miss and false
alarm errors:
CDet C Miss' PMiS' arget + C 1 P Pnon-target
(7)
Where
C .,,
and
C 1H
are the costs of a Miss and a
False Alarm, respectively,
/ .,,
and
/
1H
are the conditional
probabilities of a Miss and a False Alarm, respectively, and
/
and
/
are the a priori target probabilities
l?|_cl nOnld_cl
'/
nOHI ?|_cl 1-
/
I?|_cl
).
The evaluation cost parameters to be used for the
TOT2004 evaluation are given in table 2 [6].
TABLE 1I.
TOPIC DETECTION EVALUATION COST PARAMETERS
Parameter Value
Ptarget 0.02
CMiss 1.0
CFA 0.1
Because
C
varies with the application, it will be
1cl
normalized by
(C
Del )/OI
to be a direct measure of the
value of the TOT system. This is done as follows [6]:
(CDeI o,m CDe
l }min|CMiSS arget' CFA P-tareJ (8)
In order to test topic detection overall performance, we defne
(9).
s Detection
: (( C Det ) Norm
p _______
m
(9)
Where S _ Detection _ p is topic detection overall
performance, parameter m is the number of topics, and
((C
Det
)Norm)
is TOT evaluation scores in topic i.
III. EXPERIMENTAL RSULTS AND ANALYSIS
We use VSM and K-means to establish topic detection
prototype system, make use of TDT evaluation method to
assess its performance. In the experiment, Chinese word
segmentation procedure is ICTCLAS provided by Chinese
Academy of Sciences. Corpus [4] is 2359 Chinese-language
news texts [5] provided by Dr. Tan Songbo in the CAS
Institute of Computing. There are 12 topics in the corpus. In
order to simulate a real network that distribution of the various
topics is irregular. Table 3 shows the detailed experimental
data.
TABLE lIL
DISTRIBUTION TABLE OF EXPERIMENTAL DATA
Topic Corpus
Finance 87
Computer 401
Premises 201
Education 96
Technique 215
Automobile 133
Talents 116
Sports 84
Health 420
Entertainment 149
Locale 362
Arts 95
Total: 2359
389
To test how K in K-means afects topic detection, we have
prepared an experiment. We fnd trends between K in K
means and topic detection performance, and its optimal value
in the experiment.
A. Experimental Results
In the experiment, text clustering algorithm is K-means,
feature dimension in VSM is 500, and we adjust he value of K.
And fnally we get TOT evaluation results of topic detection
overall performance. Table 4 is their TDT evaluation results.
TABLE IV.
TOT EVALUATION RESULTS
K
S Detection P
In order to facilitate analysis, we map the data in table 4 to
Fig. 2.
0.74
0.73
" 0.72

I
0.71
.
0.7345
0.7

I
0.69
U
0.68
0.67
0.66
0.6869
0.6 46
.6877
.6769
13
K=6 K=8 K=10 K=12 K=14 K=16
K in K-eas
Fig. 2 Trend Chart Between Kin K-means and Topic Detection Overall
Performance
The smaller TOT evaluation results are, the better topic
detection performance is. According to table 4 and Fig. 2,
when K is 10, topic detection overall performance
(S _ Detection _ p) is 0.67l3, which is a minimum value.
At this time, topic detection prototype system has the best
overall performance.
B. Analysis of Experimental Results
In the experiment, according to table 4 and Fig. 2, the
different values of K lead to different TOT evaluation results
for the same topics. When the value of K increases fom 6 to
10, TDT evaluation results gradually reduce fom 0.7345 to
0.6713. At this time, with feature dimension increasing, topic
detection performance gradually increases. When feature
dimension increases to 10, topic detection has the best
performance. This shows that when the value of K is 6, the
algorithm ignores some features of relatively large amount of
information. With the value of K increasing, these ignored
features are included, and topic detection performance
gradually increases. When feature dimension is 10, the system
has the best topic detection performance. When the value of K
increases fom 10 to 16, TDT evaluation results gadually
increases fom 0.6713 to 0.6946. At this time, with the value
of K increasing, topic detection performance gradually
reduces. This shows that when the value of K is 10, feature
space contains the entire helpfl features for text clustering.
With the value of K increasing, noise features are also
gradually increasing. These gradually offset the effect of some
helpfl features, and fnally topic detection performance
gradually reduces. Therefore, when the value of K is 10, topic
detection overall performance is 0.6713, which is the
minimum value. At this time, there is the optimal topic
detection performance.
According to table 4, when the value of K increases fom
10 to 12, TDT evaluation results increase fom 0.6713 to
0.6769. TDT evaluation results in 10 reduces by 0.827% less
than 12, that is to say, when the value of K equal to 10 or 12,
they have almost the same efect. According to Fig. 2, lO is the
optimal result obtained by K-means clustering algorithm, and
12 is the standard result. Therefore, the validity of the value of
K in the algorithm is 83.33% in the topic detection prototype
system based on K-means. This shows that K-means clustering
algorithm is suited to deal with topic detection.
IV. CONCLUSIONS
Topic detection is a task of TDT evaluation, its key
technology is the text clustering algorithm. However, general
clustering is based on global information, and clustering in the
topic detection is based on incremental ways. In this paper, we
use VSM to represent topics, and then make use of K-means
algorithm to develop a topic detection prototype system. On
the basis of the topic detection prototype system, we study how
K in K-means afects topic detection performance, and then we
use TDT evaluation method to assess results. At last, we prove
in the experiment that the validity of the value of K in the
algorithm is 83.33% in the topic detection prototpe system
based on K-means. This shows that K-means clustering
algorithm is suited to deal with topic detection.
390
ACKNOWLEDGMENT
The research work is supported by 863 Key Program of
China (2006AAOlO105), National Natural Science Foundation
of China (60872133), Beijing Municipal National Natural
Science Foundation (4092015), Funding Project for Academic
Human Resources Development in Institutions of Higher
Learing under the Jurisdiction of Beijing Municipality
(PXM2007 _014224_044677), and Scientifc Research
Common Program of Beijing Municipal Commission of
Education (KM20101 0772023).
REFERENCES
[I) Z Jin, H. F. Lin, J Zhao, "Study on Topic Tracking and Tendency
Classifcation Based on HowNew," Joural of The China Society For
Scientifc and Technical Information, Vo1.24, No.5, pp.555-561, 2005.
[2) H. Xiong, J. J Wu, and J Chen, "K-Means Clustering Versus Validation
Measures: A Data-Distribution Perspective," IEEE TRANSACTIONS
ON SYSTEMS, MAN, AND CYBERNETICS-PART B
CYBERNETICS, vo1.39, no.02, pp.318-331, 2009.
[3) L. Liu, "Adaptive Clustering Algorithm Alaysis beased on K-Means,"
Beijing University of Posts and Telecommunications, pp.27-42, 2009.
[4) S. B Tan, and Y. F. Wang, "TanCropV1.0,"
http://www.searchforum.org.cn/tansongbo/corpus.htm.
[5) S. B Tan, X Q. Cheng, M. M. Ghanem, B Wang, and H. B Xu, "A
Novel Refnement Approach for Text Categorization," ACM CIKM
2005, pp.469-476, 2005.
[6) Nist, "The 2004 Topic Detection and Tracking (TDT2004) Task
Defnition and Evaluation Plan,"
http://www.itl.nist. govliad/mig/tests/tdt/2004ITDT04.Eval.Plan.v 1.2. pdf
[7) X W. Li, "Research on Text Clustering Algorithm Based on K_means
and SOM," Interational Symposium on Intelligent Information
Technology Application Workshops, pp.341-344, 2008.
[8) J N. Hu, W. R. Xu, J Guo, and W. H. Dong, "Study on feature selection
methods in Chinese text categorization," Study on Communications,
no.03, pp.44-46, 2005.

You might also like