Professional Documents
Culture Documents
EMERGING TOPICS
IN COMPUTING
Received 25 February 2014; revised 3 February 2015; accepted 23 March 2015. Date of publication 30 March, 2015;
date of current version 10 June, 2015.
Digital Object Identifier 10.1109/TETC.2015.2418716
ABSTRACT
This paper proposes a Wikipedia-based semantic similarity measurement method that is
intended for real-world noisy short texts. Our method is a kind of explicit semantic analysis (ESA), which
adds a bag of Wikipedia entities (Wikipedia pages) to a text as its semantic representation and uses the vector
of entities for computing the semantic similarity. Adding related entities to a text, not a single word or phrase,
is a challenging practical problem because it usually consists of several subproblems, e.g., key term extraction
from texts, related entity finding for each key term, and weight aggregation of related entities. Our proposed
method solves this aggregation problem using extended naive Bayes, a probabilistic weighting mechanism
based on the Bayes theorem. Our method is effective especially when the short text is semantically noisy,
i.e., they contain some meaningless or misleading terms for estimating their main topic. Experimental results
on Twitter message and Web snippet clustering revealed that our method outperformed ESA for noisy short
texts. We also found that reducing the dimension of the vector to representative Wikipedia entities scarcely
affected the performance while decreasing the vector size and hence the storage space and the processing
time of computing the cosine similarity.
INDEX TERMS
I. INTRODUCTION
2168-6750
2015 IEEE. Translations and content mining are permitted for academic research only.
Personal use is also permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
205
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
related entity finding for each key term, and weight
aggregation of related entities. In order to solve the problem,
ESA sums the weighted vectors of related entities for each
word just based on the majority rule.
The approach of summing up the vectors is not suited for
real-world noisy short texts where both key and irrelevant
terms occur very few times. The majority rule does not work
well due to insufficient information. In such a situation,
focusing on key terms while filtering out noisy terms is
important.
The main purpose of this work is to accomplish better
aggregation of weighted vectors than ESA. Our proposed
method generates the output vector of related Wikipedia
entities from a given text in the same way as ESA. Here,
we define probabilistic scores and introduce extended Naive
Bayes (ENB) to aggregate related entities.
The main contributions of our work are as follows:
We proposed a probabilistic method of semantic
similarity measurements by adding related Wikipedia
entities as semantics of short texts. Our method is
more robust for noisy short texts than ESA because the
weighting mechanism of our method is based on the
Bayes theorem. Our method can amplify the score
of the related entity that is related to multiple terms
in a text even if each of the terms alone is not
characteristic.
We carried out experiments on both clean and noisy
short texts and demonstrated that our method was
effective for noisy short texts. Experimental results on
short text similarity benchmark datasets indicated that
ESA with well-adjusted parameters was more suited
to clean short texts than our method. Contrast to these
results, our method was able to outperform ESA when
computing the semantic distance between Twitter
messages or Web snippets. These texts are semantically
noisy because they often contain many meaningless
or misleading terms for understanding their topics.
Moreover, we found that dimension reduction of the
output vector can increase space and time efficiency
while achieving competitive performance.
This paper is organized into the following sections.
Section II describes previous research on short text
analysis and semantic similarity measurements using
Wikipedia. Section III explains why ESA has a weak side
against real-world noisy short texts. Section IV details our
method to find related Wikipedia entities for short texts
to measure the semantic similarity. Section V presents the
experiments we carried out to evaluate our method. Finally,
we conclude the paper in Section VI. A part of the work
was presented in 22nd ACM International Conference on
Information and Knowledge Management (CIKM 2013) [7].
II. RELATED WORK
A. SHORT TEXT ANALYSIS
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
208
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
CountAnchortexts(t, e)
.
ei E CountAnchortexts(t, ei )
(2)
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
specifically computed by
Q
P(c) K
k=1 P(tk |c)
P(c|T = {t1 , . . . , tK }) =
0
P(T = {t1 , . . . , tK })
QK
k=1 P(c|tk )P(tk )
=
P(T 0 = {t1 , . . . , tK })P(c)K 1
QK
P(c|tk )
(7)
= k=1 K 1 .
P(c)
Next, we tackle the case where members of T cannot be
observed, i.e., it is not clear whether a term in a text is a
key term or not. Because candidates of the key term in a text
can be determined using anchor texts and titles of Wikipedia,
this assumption is the same as what we have considered in
this work. One of the possible approaches to this challenge
may be the two-phase method that first determines key terms
and then applies conventional Naive Bayes to them. However,
this approach gives rise to the problem of how key terms
are determined. Threshold-based methods can be employed
to select or discard terms, although this requires parameter
adjustments. Adjusting thresholds is difficult because optimal
thresholds may change along with texts.
Instead of using threshold-based methods, we propose
extended Naive Bayes (ENB). The basic idea of ENB was
originally proposed for entity disambiguation in our tech
report [30]. In this paper, we developed ENB for finding
related entities. ENB can be applied to a set whose
members are probabilistically determined. Given a set of
key terms T , P(c|T 0 ) is computed for all possible states T 0 .
Figure 5 outlines an example of ENB for a set of candidates of
the key terms t1 , . . . , tK . ENB is used to compute P(T = T 0 ),
which is the probability that a set of key terms, T , will become
state T 0 . It then computes P(c|T 0 ) for each state T 0 and sums
up P(c|T 0 ) weighted by P(T = T 0 ).
0
CountLinks(c)
.
cj C CountLinks(cj )
(6)
can be observed.
210
tkY
T 0
tk
T 0
P(tk T )
tkY
T
/ 0
tk
(1 P(tk T )) . (8)
T
/ 0
VOLUME 3, NO. 2, JUNE 2015
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
X
Q
0
P(T = T )
T0
tk T 0 P(c|tk )
0
P(c)|T |1
(9)
P(c|T )
Q
Q
X
tk T
/ 0 P(c)
tk T 0 P(c|tk )
0
=
P(T = T )
0
0
P(c)|T |1 P(c)K |T |
T0
X Y
Y
P(tk T )P(c|tk )
1 P(tk T ) P(c)
=
T0
tk T
/ 0
K 1
tk T 0
P(c)
(10)
The numerator of Equation (10) is then decomposed for each
tk to efficiently compute it.
X Y
Y
P(tk T )P(c|tk )
1 P(tk T ) P(c)
T0
t T
/ 0
tk T 0
k
= P(t1 T )P(c|t1 ) + 1 P(t1 T ) P(c)
X
Y
P(tk T )P(c|tk )
T0
tk T 0 tk {t
/ 1}
1 P(tk T ) P(c)
tk T
/ 0 tk {t
/ 1}
= P(t1 T )P(c|t1 ) + 1 P(t1 T ) P(c)
P(t2 T )P(c|t2 ) + 1 P(t2 T ) P(c)
X
Y
P(tk T )P(c|tk )
T0
tk T 0 tk {t
/ 1 ,t2 }
1 P(tk T ) P(c)
tk T
/ 0 tk {t
/ 1 ,t2 }
=
K
Y
=
P(tk T )P(c|tk ) + 1 P(tk T ) P(c)
k=1
211
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
FIGURE 6. Example of Extended Naive Bayes (ENB) for input text Apple and tree. For each term t, the probability of
related entity c is computed by a linear combination of P(c|t) (if t is a key term) and its prior probability P(c)
(if t is not a key term).
212
IEEE TRANSACTIONS ON
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
similarity datasets.
EMERGING TOPICS
IN COMPUTING
We utilized the hashtags, which are defined by Twitter, to create Twitter datasets for clustering tasks. Hashtags are tags, such as #Obama and #MacBook, that Twitter
users intentionally add to their tweets in order to clarify the
topic of the tweet [41]. Hashtags are often used to create
datasets for short text clustering [14], [42]. In our experiment,
we first sampled tweets during January 2012 and then listed
frequently occurring hashtags (topics). Among them, we
selected independent, unambiguous hashtags so that each
cluster contained a maximum of appropriate tweets. Note that
collected tweets still contain ambiguous terms and therefore
this setup does not ease the clustering task. Also, we did not
check the output vector for collected tweets before selecting
the hashtags to ensure the fairness.
Table 6 lists four Twitter datasets and their statistics
that were used in the evaluation. The first dataset includes
six discrete topics (D) from politics, entertainment, sports,
information technology, health, and religion. The second
dataset assumed fine-grained topics of IT products (IT)
and the third dataset assumed sports (S). We built these
two datasets to investigate the ability for distinguishing similar but different topics. In addition, we built a dataset that
contain both IT and S datasets (Mix) to better imitate the real
situation. The procedure for constructing the dataset was as
follows: 1) search tweets by using predefined hashtags and
store those written in English, 2) delete tweets that contain
more than one predefined hashtag, 3) delete retweets (tweets
starting with RT), 3) remove URLs in tweets, 4) remove
hashtags at the end of tweets (to hide explicit clues for the
topic) and the # of hashtags not at the end of tweets, and
5) delete tweets that contain less than four words.
To verify that our method works on different types of
short texts, we additionally used Web snippet dataset by
Phan et al. [5]. The Web snippet dataset was created for short
text classification, containing Web search snippets for queries
belonging to one of eight topics. We used it for short text
clustering task. Table 7 lists the Web snippet (Web) dataset
and its statistics.
We employed a bag-of-words model (BOW) as the baseline
and ESA as the comparative method. In BOW, we used
all words except stop words in short texts to compute the
semantic similarity. In ESA, we employed two parameter
213
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
TABLE 7. Web snippet dataset for evaluation and its statistics.
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
FIGURE 7. Results of short text clustering. The horizontal axis means the number of non-zero elements of the output vector.
Maximum NMI scores achieved by each method are described in the figures.
214
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
FIGURE 8. Results of short text clustering when the dimension reduction of the output vector is applied. The horizontal axis
means the number of non-zero elements of the output vector. Maximum NMI scores achieved by each method are described in
the figures.
TABLE 8. Average processing time for generating the vector of
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
TABLE 9. Top related wikipedia entities obtained with ESA and our method.
not food but computing. Our method thus selects the most
dominant topic of a text and then filters out entities related to
other topics.
VI. CONCLUSION
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
[5] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, Learning to classify short
and sparse text & Web with hidden topics from large-scale data collections, in Proc. 17th Int. World Wide Web Conf. (WWW), Apr. 2008,
pp. 91100.
[6] E. Gabrilovich and S. Markovitch, Computing semantic relatedness using
Wikipedia-based explicit semantic analysis, in Proc. Int. Joint Conf. Artif.
Intell. (IJCAI), Jan. 2007, pp. 16061611.
[7] M. Shirakawa, K. Nakayama, T. Hara, and S. Nishio, Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities,
in Proc. ACM Int. Conf. Inf. Knowl. Manage. (CIKM), Oct./Nov. 2013,
pp. 903908.
[8] B. Leuf and W. Cunningham, The Wiki Way: Collaboration and Sharing
on the Internet. Reading, MA, USA: Addison-Wesley, Apr. 2001.
[9] K. Nakayama, T. Hara, and S. Nishio, Wikipedia mining for an association Web thesaurus construction, in Proc. Int. Conf. Web Inf. Syst.
Eng. (WISE), Dec. 2007, pp. 322334.
[10] E. Meij, W. Weerkamp, and M. de Rijke, Adding semantics to microblog
posts, in Proc. ACM Int. Conf. Web Search Data Mining (WSDM),
Feb. 2012, pp. 563572.
[11] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet allocation,
J. Mach. Learn. Res., vol. 3, pp. 9931022, Mar. 2003.
[12] X. Hu, N. Sun, C. Zhang, and T.-S. Chua, Exploiting internal and external
semantics for the clustering of short texts using world knowledge, in Proc.
ACM Conf. Inf. Knowl. Manage. (CIKM), Nov. 2010, pp. 919928.
[13] D. Metzler, S. Dumais, and C. Meek, Similarity measures for short
segments of text, in Proc. Eur. Conf. Inf. Retr. (ECIR), Apr. 2007,
pp. 1627.
[14] Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen, Short text conceptualization using a probabilistic knowledgebase, in Proc. Int. Joint Conf.
Artif. Intell. (IJCAI), Jul. 2011, pp. 23302336.
[15] M. Strube and S. P. Ponzetto, WikiRelate! Computing semantic relatedness using Wikipedia, in Proc. Nat. Conf. Artif. Intell. (AAAI), Jul. 2006,
pp. 14191424.
[16] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, MA,
USA: MIT Press, May 1998.
[17] S. P. Ponzetto and M. Strube, Exploiting semantic role labeling, WordNet
and Wikipedia for coreference resolution, in Proc. Human Lang. Technol.
Conf. North Amer. Chapter Assoc. Comput. Linguistics (HLT-NAACL),
Jun. 2006, pp. 192199.
[18] D. Milne and I. H. Witten, An effective, low-cost measure of semantic
relatedness obtained from Wikipedia links, in Proc. AAAI Workshop
Wikipedia Artif. Intell. (WIKIAI), Jul. 2008, pp. 2530.
[19] Y. Ollivier and P. Senellart, Finding related pages using green measures:
An illustration with Wikipedia, in Proc. Nat. Conf. Artif. Intell. (AAAI),
Jul. 2007, pp. 14271433.
[20] E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A. Soroa,
WikiWalk: Random walks on Wikipedia for semantic relatedness,
in Proc. Workshop Graph-Based Methods Natural Lang.
Process. (TextGraphs), Aug. 2009, pp. 4149.
[21] M. Ito, K. Nakayama, T. Hara, and S. Nishio, Association thesaurus
construction methods based on link co-occurrence analysis for Wikipedia,
in Proc. ACM Conf. Inf. Knowl. Manage. (CIKM), Oct. 2008, pp. 817826.
[22] S. Hassan and R. Mihalcea, Cross-lingual semantic relatedness using
encyclopedic knowledge, in Proc. Conf. Empirical Methods Natural
Lang. Process. (EMNLP), Aug. 2009, pp. 11921201.
[23] M. Yazdania and A. Popescu-Belis, Computing text semantic relatedness
using the contents and links of a hypertext encyclopedia, Artif. Intell.,
vol. 194, pp. 176202, Jan. 2013.
[24] M. A. H. Taieb, M. B. Aouicha, and A. B. Hamadou, Computing semantic relatedness using Wikipedia features, Knowl.-Based Syst., vol. 50,
pp. 260278, Sep. 2013.
[25] R. Mihalcea and A. Csomai, Wikify! Linking documents to encyclopedic
knowledge, in Proc. ACM Conf. Inf. Knowl. Manage. (CIKM), Nov. 2007,
pp. 233242.
[26] G. Salton and C. Buckley, Term-weighting approaches in automatic text
retrieval, Inf. Process. Manage., vol. 24, no. 5, pp. 513523, 1988.
[27] C. D. Manning and H. Schtze, Foundations of Statistical Natural
Language Processing. Cambridge, MA, USA: MIT Press, May 1999.
[28] G. J. Lidstone, Note on the general case of the BayesLaplace formula
for inductive or a posteriori probabilities, Trans. Faculty Actuaries, vol. 8,
pp. 182192, 1920.
[29] D. Milne and I. H. Witten, Learning to link with Wikipedia, in Proc.
ACM Conf. Inf. Knowl. Manage. (CIKM), Oct. 2008, pp. 509518.
218
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
MASUMI
SHIRAKAWA
received
the
Ph.D. degree in information science and technology from Osaka University, Japan, in 2013. He was
an Intern with Microsoft Research Asia in 2011.
He is currently a Research Assistant Professor
with the Department of Multimedia Engineering,
Osaka University. His research interests include
Web mining, text mining, knowledge base, and
natural language processing. He is a member of the
Association for Computing Machinery and three
other learned societies.
KOTARO
NAKAYAMA
received
the
Ph.D. degree in information science and technology from Osaka University, Osaka, Japan, in 2008.
While he was a student, he started a software
development company named Kansai Informatics
Laboratory and served as a CEO and the Executive
Director from 2000 to 2003. Meanwhile, he held
a lecturing position with the Doshisha-Womens
College of Liberal Arts from 2002 to 2004.
He was a Post-Doctoral Researcher with Osaka
University from 2008 to 2009, and an Assistant Professor/Lecturer Professor
with the Center of Knowledge Structuring, University of Tokyo, Japan, from
2009 to 2014, after receiving his Ph.D. degree. He is currently an Assistant
Professor/Lecturer Professor with the School of Engineering, University
of Tokyo. His research interests cover Web mining, artificial intelligence,
scalability, and practical software development. He is a member of the
Association for Computing Machinery and other two CS/AI related societies.
VOLUME 3, NO. 2, JUNE 2015
IEEE TRANSACTIONS ON
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
EMERGING TOPICS
IN COMPUTING
SHOJIRO NISHIO (F12) received the
B.E., M.E., and Ph.D. degrees from Kyoto University, Japan, in 1975, 1977, and 1980, respectively.
He has been a Full Professor with Osaka University since 1992, and was bestowed the prestigious
title as a Distinguished Professor of Osaka University in 2013. He served as the Vice President and a
Trustee with Osaka University from 2007 to 2011.
He has authored or co-authored over 600 refereed
journal and conference papers. He is a fellow of
the Institute of Electronics, Information and Communication Engineers and
the Information Processing Society of Japan, and a member of four learned
societies, including the Association for Computing Machinery (ACM).
He has received numerous awards for his research contributions, including
the Medal with Purple Ribbon from the Government of Japan in 2011.
He served as the Program Committee Co-Chair of several international
conferences, including DOOD 1989, VLDB 1995, and the IEEE ICDE 2005.
He has served as an Editor of several renowned journals, including the
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, the VLDB Journal,
the ACM Transactions on Internet Technology, and Data & Knowledge
Engineering.
219