10.1109 TETC.2015.2418716 Wikipedia Based Semantic Similarity Measurements For Noisy Short Texts Using Extended Naive Bayes

IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
Received 25 February 2014; revised 3 February 2015; accepted 23 March 2015. Date of publication 30 March, 2015;
date of current version 10 June, 2015.
Digital Object Identifier 10.1109/TETC.2015.2418716
Wikipedia-Based Semantic Similarity

Measurements for Noisy Short Texts
Using Extended Naive Bayes
MASUMI SHIRAKAWA1 , KOTARO NAKAYAMA2 , TAKAHIRO HARA1 , (Senior Member, IEEE),
AND SHOJIRO NISHIO1 , (Fellow, IEEE)
1 Department of Multimedia Engineering, Graduate School of Information Science and Technology,
Osaka University, Osaka 565-0871, Japan

2 School of Engineering, The University of Tokyo, Tokyo 113-8654, Japan
CORRESPONDING AUTHOR: M. SHIRAKAWA (shirakawa.masumi@ist.osaka-u.ac.jp)

This work was supported in part by CPS-IIP Project (Integrated Platforms for Cyber-Physical Systems to Accelerate Implementation of
Efficient Social Systems) in the research promotion program for national level challenges research and development for the realization of
next-generation IT platforms and by the Grant-in-Aid for Scientific Research (A)(26240013) by MEXT, Japan.
ABSTRACT
This paper proposes a Wikipedia-based semantic similarity measurement method that is
intended for real-world noisy short texts. Our method is a kind of explicit semantic analysis (ESA), which
adds a bag of Wikipedia entities (Wikipedia pages) to a text as its semantic representation and uses the vector
of entities for computing the semantic similarity. Adding related entities to a text, not a single word or phrase,
is a challenging practical problem because it usually consists of several subproblems, e.g., key term extraction
from texts, related entity finding for each key term, and weight aggregation of related entities. Our proposed
method solves this aggregation problem using extended naive Bayes, a probabilistic weighting mechanism
based on the Bayes theorem. Our method is effective especially when the short text is semantically noisy,
i.e., they contain some meaningless or misleading terms for estimating their main topic. Experimental results
on Twitter message and Web snippet clustering revealed that our method outperformed ESA for noisy short
texts. We also found that reducing the dimension of the vector to representative Wikipedia entities scarcely
affected the performance while decreasing the vector size and hence the storage space and the processing
time of computing the cosine similarity.
INDEX TERMS
Semantic similarity, semantic representation, naive Bayes, short text clustering.
I. INTRODUCTION
Recently the focus of text analysis has been shifting

toward short texts such as microblogs, search queries,
search results, ads, and news feeds. Semantic similarity
measurement between short texts is a fundamental task
and can be used for various applications including text
clustering [1] and text classification [2]. The challenge in
measuring the similarity between short texts lies in the
sparsity, i.e., there are likely to be no term co-occurrence
between two texts. For example, two short texts
Apples new product and iPhone 6 was launched refer
to similar topics, even though there are no term
co-occurrence. To overcome the sparsity, enriching the
semantic representation of short texts using external data or
knowledge is needed.
VOLUME 3, NO. 2, JUNE 2015
Wikipedia [3], a collaborative online encyclopedia,

can be used as an external knowledge for short text
analysis [1], [4], [5]. Wikipedia has dense link structure and
wide coverage of entities including named entities, domain
specific entities, and emerging entities. Moreover, the dump
data of Wikipedia can be freely obtained from the Web. These
advantages have encouraged researchers and developers to
use it for applications.
Wikipedia-based Explicit Semantic Analysis (ESA) [6] is
a widely used method to measure the semantic similarity
between texts of any length. ESA creates a vector of related
Wikipedia entities for a given text as its semantic representation and uses the vector for measuring the similarity.
Finding related Wikipedia entities from a text generally
consists of several subproblems such as key term extraction,
2168-6750
2015 IEEE. Translations and content mining are permitted for academic research only.
Personal use is also permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
205
EMERGING TOPICS
IN COMPUTING
related entity finding for each key term, and weight
aggregation of related entities. In order to solve the problem,
ESA sums the weighted vectors of related entities for each
word just based on the majority rule.
The approach of summing up the vectors is not suited for
real-world noisy short texts where both key and irrelevant
terms occur very few times. The majority rule does not work
well due to insufficient information. In such a situation,
focusing on key terms while filtering out noisy terms is
important.
The main purpose of this work is to accomplish better
aggregation of weighted vectors than ESA. Our proposed
method generates the output vector of related Wikipedia
entities from a given text in the same way as ESA. Here,
we define probabilistic scores and introduce extended Naive
Bayes (ENB) to aggregate related entities.
The main contributions of our work are as follows:
We proposed a probabilistic method of semantic
similarity measurements by adding related Wikipedia
entities as semantics of short texts. Our method is
more robust for noisy short texts than ESA because the
weighting mechanism of our method is based on the
Bayes theorem. Our method can amplify the score
of the related entity that is related to multiple terms
in a text even if each of the terms alone is not
characteristic.
We carried out experiments on both clean and noisy
short texts and demonstrated that our method was
effective for noisy short texts. Experimental results on
short text similarity benchmark datasets indicated that
ESA with well-adjusted parameters was more suited
to clean short texts than our method. Contrast to these
results, our method was able to outperform ESA when
computing the semantic distance between Twitter
messages or Web snippets. These texts are semantically
noisy because they often contain many meaningless
or misleading terms for understanding their topics.
Moreover, we found that dimension reduction of the
output vector can increase space and time efficiency
while achieving competitive performance.
This paper is organized into the following sections.
Section II describes previous research on short text
analysis and semantic similarity measurements using
Wikipedia. Section III explains why ESA has a weak side
against real-world noisy short texts. Section IV details our
method to find related Wikipedia entities for short texts
to measure the semantic similarity. Section V presents the
experiments we carried out to evaluate our method. Finally,
we conclude the paper in Section VI. A part of the work
was presented in 22nd ACM International Conference on
Information and Knowledge Management (CIKM 2013) [7].
II. RELATED WORK
A. SHORT TEXT ANALYSIS
Short texts vary from traditional documents in their brevity

and sparsity, which makes statistical approaches to short texts
206
Shirakawa et al.: Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts
less effective. Thus, enriching the semantics of short texts

using external knowledge, such as Wikipedia, is essential.
Because Wikipedia is a large-scale collaborative
encyclopedia based on Wiki [8] that enables editing through
Web browsers, articles can be easily created and edited by
anyone. This collaborative nature is why Wikipedia covers
a wide range of articles, such as named entities, domain
specific entities, and new entities, in addition to general
entities. As of February 2014, there were over 4 million
articles (entities). Wikipedia also has many features that are
useful for knowledge extractions, such as dense link structure
between articles, refined anchor texts (link texts), and entity
disambiguation with URLs [9]. It is notable that the dump
data of Wikipedia can be freely accessed online. For these
reasons, research on Wikipedia mining has been accelerated.
Most of the work on Wikipedia-based short text analysis
focused on specific tasks. Ferragina and Scaiella [4]
proposed a simple and fast method for entity disambiguation (entity linking) for short texts using Wikipedia.
Meij et al. [10] also tackled entity disambiguation by using
various features (e.g., anchor texts, links between articles)
derived from Wikipedia for machine learning. Phan et al. [5]
utilized hidden topics obtained from Wikipedia for learning
the LDA [11] classifier of short texts. Hu et al. [12] exploited
features from Wikipedia for clustering of short texts.
Their work demonstrated that Wikipedia was effective as an
external knowledge source.
Research on representing semantics of short texts was
intended for multiple purposes [6], [13]. Especially, Explicit
Semantic Analysis (ESA) [6] has been widely used because
of its availability and versatility. ESA was developed for
computing word similarity as well as text similarity written
in natural languages. ESA builds a weighted inverted index
that maps each word into a list of Wikipedia articles in
which it appears, and computes the similarity between vectors
generated from two words or texts.
Song et al. [14] illustrated the availability of ESA
for short text clustering (as a comparative method),
i.e., measuring semantic distance (semantic dissimilarity)
between short texts using ESA. Banerjee et al. [1] also
employed a similar approach to ESA for the purpose of
clustering short texts. Sun et al. [2] utilized ESA to classify
short texts with a support vector machine (SVM), which is a
supervised machine learning technique.
Thus, ESA has been demonstrated to be effective for
measuring semantic similarity for short texts. However,
ESA has a problem in its weighting system when it comes
to analyzing real-world noisy short texts. We will describe
this in Section III.
B. SEMANTIC SIMILARITY MEASUREMENTS
Though this paper focuses on improving ESA, it is

worthwhile to describe some representative work on
semantic similarity measurements using Wikipedia.
WikiRelate by Strube and Ponzetto [15] applied several
simple techniques that have been developed for WordNet [16]
EMERGING TOPICS
IN COMPUTING
to Wikipedia. Given two Wikipedia articles, they specifically

compute the distance in the category structure or the overlap
degree between texts. They demonstrated the effectiveness of
Wikipedia-based methods on standard datasets for similarity
measurements (MC, RG, and WordSim353) and coreference
resolution tasks [17]. Milne et al. [18] proposed WLM that
efficiently computes the similarity between two articles using
the overlap degree of their incoming and outgoing links.
Graph-based methods [9], [19], [20] construct a graph
in which nodes are Wikipedia articles and edges are links
between articles. Using the graph, they create a vector
of entities [20] or directly find related entities [9], [19].
Ito et al. [21] proposed link co-occurrence analysis to
speedily build an association thesaurus (defining the
similarity between entities). Hassan and Mihalcea [22]
utilized cross-language links of Wikipedia to compute the
similarity across languages. More recently, hybrid methods
have shown to be more accurate [23], [24]. Yazdani and
Popescu-Belis [23] utilized both text contents and links
in articles, and Taieb et al. [24] leveraged text contents,
categories, Wikipedia category graph, and redirection to
achieve competitive or sometimes better results.
Among Wikipedia-based methods described above,
ESA still has advantages in practical use: it is easily
understandable and implementable, robust to work,
applicable to texts written in natural languages, and fast
to generate the vector from input texts. In terms of
the precision, not a few literatures [18], [20], [23], [24]
reported that ESA achieved the best performance on
WordSim353 dataset among Wikipedia-based methods.
We therefore focus on ESA and propose better aggregation
method based on the Bayes theorem.
III. WEAKNESS OF ESA
Explicit Semantic Analysis (ESA) [6] is a method to

represent semantics of short texts for semantic similarity
measurements. ESA builds a weighted inverted index for all
Wikipedia articles and then creates a vector of articles for
a given text. This vector is used to compute the semantic
similarity between texts. To make a short list of related
Wikipedia entities for a text that contains multiple words,
ESA sums the weighted vectors of related entities for each
word. This simple weighting works well for long texts such
as news articles and web pages because the scores of related
entities belonging to the most dominant topic of the text
naturally increase based on the majority rule.
However, we posit that ESA is not well-designed for
finding related Wikipedia entities for real-world noisy short
texts. Noisy short texts may contain few key terms and some
noisy terms, and the majority rule ESA has employed may
not work well. Noisy terms in this paper indicate meaningless
or misleading terms for grasping the main topic of the
text containing them. It is important to focus on key
terms as well as filter out noisy terms to correctly derive
related entities from short texts and understand their
topics.
Noisy terms cannot be filtered out statically because a

noisy term in a text can be a key term in another text depending on the contexts. For example, general term tree may be a
noisy term in many texts, but it can be a key term that indicates
a data structure in the domain of computer sciences. A plant
tree can also be a key term in the topic of botany. Even named
entities can be noisy terms in some situations. City name
Liverpool may only explain John Lennons birthplace and
the main topic of the text may be popular music. Uniformly
giving low scores to such noisy terms does not lead to a
resolution of the problem.
Another issue is that a vector generated by ESA tends to be
just a mixture of strongly related entities to each respective
term. This situation is undesirable especially when the text
contains ambiguous terms. For example, iPhone is strongly
related to term Apple but not related to short text Apple is a
tree where the context is not about Apple Inc. In this case,
ESA sums vectors of Apple and tree. Though the vector of
tree does not contain iPhone, ESA cannot subtract the score of
iPhone generated from Apple. Thus, the output vector should
contain iPhone, which is strongly related to only Apple.
Under the majority rule, the score of such incorrect entities
becomes relatively small. However, the majority rule does not
exist in short texts.
IV. OUR METHOD
To achieve robust finding of related Wikipedia entities for

short texts to measure semantic similarity, we propose a
method that adapts Wikipedia-based techniques to define
probabilistic scores and integrates the scores based on the
Bayes theorem. As described in Section III, ESA is not
suited to real-world noisy short texts because of its simple
weighting mechanism of summing weighted vectors. Our
method addresses the problem by extending the Naive Bayes
method, which enables us to emphasize key terms while filter
out noisy terms. The definitions of terms and symbols that
are used in this and later sections are summarized in Table 1.
Here we define linked as the mapping of a term (surface
form) to corresponding entities (unambiguous meanings),
and related as the mapping of a term, entity, or a set of
them to related entities.
Figure 1 outlines our method. Our method obtains probabilistic scores for key terms and related entities by analyzing
Wikipedia. After that, our method synthesizes these
probabilities and computes the output vector of related
entities using extended Naive Bayes (ENB). To measure the
semantic similarity between two texts, the similarity of their
related entity lists ranked by P(c|T ), probability that related
entity c is related to a set of key terms T , is computed using
cosine or other metrics.
Our method solves the compound problem of key term
extraction, related entity finding, and the aggregation of
related entities in a probabilistic schema. In Section IV-A,
we explain the probabilistic scores of key terms and
related entities, as well as the prior probabilities of entities.
In Section IV-B, we describe how to aggregate related
207
EMERGING TOPICS
IN COMPUTING
TABLE 1. Definition of terms and symbols.
FIGURE 1. Outline of our method to find related Wikipedia
entities for texts.
entities for each key term using the probabilities introduced

in Section IV-A.
A. PROBABILISTIC SCORES FROM WIKIPEDIA
1) KEY TERM EXTRACTION
P(t T ), which is the probability that term t in a text, T ,

is a key term, is computed using anchor texts in Wikipedia
articles [25]. According to the editorial policy of Wikipedia
called wikify,1 a specific term in Wikipedia articles that
indicates another article (entity) should be linked to the
article. Here, the more often a term is selected as an anchor
text for a corresponding article, the more likely that the term
is important. Based on this heuristics, we use the rate of
articles that contain a term as an anchor text. According
to the literature [25], this method of extracting key terms
outperformed other common techniques, such as TF-IDF [26]
and the 2 independence test [27].
Given that CountArticlesHavingAnchortexts(t) is the
number of articles that contain term t as an anchor text and
CountArticlesHavingTerms(t) is the number of articles that
contain term t, the probability is computed as
CountArticlesHavingAnchortexts(t)
P(t T )
. (1)
CountArticlesHavingTerms(t)
1 http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikify
208
In order to avoid black or white probabilities (i.e., 0 or 1),

Laplace Smoothing [28] is introduced.
Figure 2 illustrates an example of how to compute P(t T )
using anchor texts. The probability P(t T ) for the term
Apple is computed from the ratio that Apple appears as an
anchor text at least once in an article. Table 2 summarizes
an example of probability P(t T ). Specific terms, such
as Apple Inc. and Steve Jobs, become key terms with high
FIGURE 2. How to compute P(t T), which is the probability that

term t is a key term. The example uses the term Apple.
EMERGING TOPICS
IN COMPUTING
TABLE 3. Example of P(e|t) when term t is apple.
TABLE 2. Example of P(t T ).
probability, whereas general terms, such as black and house,

are unlikely to be chosen as key terms (i.e., tend to be noisy
terms).
2) RELATED ENTITY FINDING
P(c|t), which is the probability that related entity c is related

to term t, is computed from P(e|t) and P(c|e).
P(e|t), which is the probability that term t is linked to
entity e, is computed using anchor texts and their destination article [29]. Using the policy of wikify, a specific term
that indicates an entity is linked to a corresponding article.
This term then becomes an anchor text for the entity. The
relationship between terms and entities can be extracted by
analyzing anchor texts. Given that CountAnchortexts(t, e) is
the number of times that the anchor text t is linked to entity e,
the probability is computed as follows:
P(e|t) P
CountAnchortexts(t, e)
.
ei E CountAnchortexts(t, ei )
(2)
E denotes a set of all Wikipedia entities.

Figure 3 illustrates an example of how P(e|t) is
computed using anchor texts and their destination. The term,
Apple, likely means the entity Apple Inc. because the term is
frequently linked with the article on Apple Inc. Table 3 lists
FIGURE 3. How to compute P(e|t), which is the probability that

term t is linked to an entity e. The example uses the term Apple.
the probability of each entity to which the term Apple is

linked (the top eight entities). In most cases, the term
Apple means Apple Inc., the IT company, Apple, the fruit, or
Apple Records, the record label.
P(c|e), which is the probability that related entity c is
related to entity e, is computed based on incoming and
outgoing links of e. We introduce a couple of methods here:
a link-based method and an ESA-based recalculation method.
The link-based method simply uses the number of links
between e and c. Given that CountLinks(e, c) is the number
of links (regardless of incoming or outgoing links) between
e and c, the probability is computed as
CountLinks(e, c)
.
(3)
P(c|e) P
cj C CountLinks(e, cj )
C denotes a set of all related entities. We can use all Wikipedia
articles for C, but just using representative Wikipedia
articles is reasonable to work well in practice (Section V-B.3).
The ESA-based method recomputes the score for e and c
using ESA. There are other choices including
WikiRelate [15] and WLM [18] for the recomputation.
However, many literatures [18], [20], [23], [24] reported that
ESA stably outperformed other Wikipedia-based methods as
a word similarity measurement. We therefore employed ESA
for recomputing P(c|e). Specifically, given that EsaSim(e, c)
is the cosine similarity of two ESA vectors generated from
e and c, the probability is computed as follows.
EsaSim(e, c)
P(c|e) P
.
(4)
cj C EsaSim(e, cj )
In order to convert them into the probabilities, the similarity
scores are normalized.
Figure 4 illustrates an example of how P(c|e) is computed
using incoming and outgoing links of entity e. Note that our
method does not distinguish between incoming and outgoing
links because it just needs related entities, regardless of their
type of relations. Table 4 summarizes the related entities of
an entity, Apple Inc., and their probabilities (the top eight
entities). It is clear from these tables that the top related
entities are strongly related to Apple Inc.
By using Equations (2), (3) and (4), P(c|t), the probability
that related entity c is related to term t (concretely the probability that term t is linked to entity ei and related entity c is
related to entity ei ), is computed as
X
P(c|ei )P(ei |t).
(5)
P(c|t) =
ei E
209
EMERGING TOPICS
IN COMPUTING
specifically computed by
Q
P(c) K
k=1 P(tk |c)
P(c|T = {t1 , . . . , tK }) =
0
P(T = {t1 , . . . , tK })
QK
k=1 P(c|tk )P(tk )
=
P(T 0 = {t1 , . . . , tK })P(c)K 1
QK
P(c|tk )
(7)
= k=1 K 1 .
P(c)
Next, we tackle the case where members of T cannot be
observed, i.e., it is not clear whether a term in a text is a
key term or not. Because candidates of the key term in a text
can be determined using anchor texts and titles of Wikipedia,
this assumption is the same as what we have considered in
this work. One of the possible approaches to this challenge
may be the two-phase method that first determines key terms
and then applies conventional Naive Bayes to them. However,
this approach gives rise to the problem of how key terms
are determined. Threshold-based methods can be employed
to select or discard terms, although this requires parameter
adjustments. Adjusting thresholds is difficult because optimal
thresholds may change along with texts.
Instead of using threshold-based methods, we propose
extended Naive Bayes (ENB). The basic idea of ENB was
originally proposed for entity disambiguation in our tech
report [30]. In this paper, we developed ENB for finding
related entities. ENB can be applied to a set whose
members are probabilistically determined. Given a set of
key terms T , P(c|T 0 ) is computed for all possible states T 0 .
Figure 5 outlines an example of ENB for a set of candidates of
the key terms t1 , . . . , tK . ENB is used to compute P(T = T 0 ),
which is the probability that a set of key terms, T , will become
state T 0 . It then computes P(c|T 0 ) for each state T 0 and sums
up P(c|T 0 ) weighted by P(T = T 0 ).
0
FIGURE 4. How to compute P(c|e), which is probability that
related entity c is related to entity e. The example is for entity

Apple Inc.
TABLE 4. Example of related entities of entity Apple Inc. and
their probability P(c|e).
3) PRIOR PROBABILITY OF ENTITIES
P(c), which is the prior probability of related entity c, means

the generality of c. Because our method computes P(c|t)
using links between articles, we also use them for determining
prior probability. Namely, prior probability is in proportion to the number of incoming and outgoing links. Given
that CountLinks(c) is the number of incoming and outgoing
links that a related entity c has, the prior probability can be
computed as
P(c) P
CountLinks(c)
.
cj C CountLinks(cj )
(6)
B. EXTENDED NAIVE BAYES
We attempt to integrate the probabilistic scores extracted

from Wikipedia to find related entities for texts. First, we
start by assuming multiple key terms are input. In other
words, we calculate P(c|T 0 ) for a set of key terms
T 0 = {t1 , . . . , tK }.2 Using P(c|t) and P(c), P(c|T 0 ) can be
derived using conventional Naive Bayes [14]. Given that
each term, t, is conditionally independent, the probability is
2 In this paper, we use T for a set of key terms whose members cannot be
observed and T 0 (with an apostrophe) for a set of key terms whose members
can be observed.
210
FIGURE 5. Extended Naive Bayes (ENB) for set of key terms
whose members cannot be observed.
Given that each term, t, is conditionally independent,

P(T = T 0 ) is computed as
Y
Y
P(T = T 0 ) =
P(tk T )
P(tk
/ T)
=
tkY
T 0
tk
T 0
P(tk T )
tkY
T
/ 0
tk
(1 P(tk T )) . (8)
T
/ 0
EMERGING TOPICS
IN COMPUTING
Therefore, related entities are estimated by using the

ENB in Figure 5 utilizing Equations (7) and (8).

X
P(c|T ) =
P(T = T 0 )P(c|T 0 )
T0
X
Q
0
P(T = T )
T0
tk T 0 P(c|tk )
0
P(c)|T |1

(9)
Here, |T 0 | denotes the number of key terms contained in T 0 .

The computation of Equation (9) requires exponential time
for the number of terms K because it separately applies
conventional Naive Bayes to each state T 0 . Equation (9) can
be decomposed by dual cases tk T 0 and tk
/ T 0 as
Equation (11) is then represented like the conventional

Naive Bayes.
QK
P(c|tk )JM
(13)
P(c|T ) = k=1 K 1
P(c)
P(c|T )
Q
Q

X
tk T
/ 0 P(c)
tk T 0 P(c|tk )
0
=
P(T = T )
0
0
P(c)|T |1 P(c)K |T |
T0

X Y
Y

P(tk T )P(c|tk )
1 P(tk T ) P(c)
=
T0
tk T
/ 0
K 1
tk T 0
P(c)
(10)
The numerator of Equation (10) is then decomposed for each
tk to efficiently compute it.

X Y
Y

P(tk T )P(c|tk )
1 P(tk T ) P(c)
T0
t T
/ 0
tk T 0
k

= P(t1 T )P(c|t1 ) + 1 P(t1 T ) P(c)
X
Y
P(tk T )P(c|tk )
T0
tk T 0 tk {t
/ 1}
1 P(tk T ) P(c)
tk T
/ 0 tk {t
/ 1}

= P(t1 T )P(c|t1 ) + 1 P(t1 T ) P(c)

P(t2 T )P(c|t2 ) + 1 P(t2 T ) P(c)
X
Y
P(tk T )P(c|tk )
T0
tk T 0 tk {t
/ 1 ,t2 }
1 P(tk T ) P(c)
tk T
/ 0 tk {t
/ 1 ,t2 }
=
K

Y

=
P(tk T )P(c|tk ) + 1 P(tk T ) P(c)
k=1
As a result, the following expression is derived.

P(c|T )

QK
k=1 P(tk T )P(c|tk ) + 1 P(tk T ) P(c)
=
P(c)K 1
(11)
Consequently, Equation (11) replaces each probability

P(c|tk ) in the conventional Naive Bayes (Equation (7)) with
a linear combination of P(c|tk ) and its prior probability P(c).
We found that the form of this linear combination was the
same as the bigram interpolated model of Jelinek-Mercer
smoothing [31]. P(tk T ) plays a role as the weight for
Jelinek-Mercer smoothing. That is, ENB naturally includes
the smoothing mechanism obtained by P(tk T ) to focus on
the key terms while filtering out noisy terms. We can define
P(c|tk )JM as the probability that related entity c is related to
a set of key terms T using Jelinek-Mercer smoothing.

P(c|tk )JM = P(tk T )P(c|tk ) + 1 P(tk T ) P(c) (12)
Using Equation (13) or (11), the scores of related entities

that are related to multiple terms are amplified. That is,
the scores of noisy related entities are relatively decreased.
Figure 6 shows an example of ENB for input text
Apple and tree. Strongly related entities for each term contain
some noisy ones such as iPhone (for Apple) and Tree structure
(for tree). ENB can filter out such related entities and obtain
those related to both terms such as Fire blight and Malus
sieversii. A ranked list of related entities obtained with ENB
can therefore be more refined than that by ESA, which is just
based on the majority rule for the related entity aggregation.
On the assumption that there is at least one key term,
P(tk T ) can be normalized by dividing it by the
maximum probability. Also, P(c|T ) may require the normalization because P(tk T ), P(c|tk ), and P(c) are approximate
probabilities. The similarity of related entity lists ranked by
P(c|T ) (output vectors) obtained from two texts is computed
using some similarity function such as cosine.
C. IMPLEMENTATION
We implemented our method using the Berkeley DB3 [32]

and marisa-trie4 (a trie index [33] implementation). The
Berkeley DB was used to store the probabilities given
in Section IV-A, and marisa-trie was used to extract terms
(candidates of the key term) from texts. The trie index generally extracts all possible terms registered in knowledge bases;
hence, we adopted the longest match approach to detect
appropriate terms. For example, from a snippet . . . and
New York Times said . . ., we only extract New York Times and
discard New York and Times. Two terms may rarely cross each
other. When this happens, we use both terms as the candidates
of the key term because it is also rare for both probabilities,
P(t T ), to be high. We employed anchor texts and titles as
candidates of the key term. We discarded terms that appeared
less than three times in Wikipedia articles to filter out
3 http://www.oracle.com/technetwork/products/berkeleydb/
4 http://code.google.com/p/marisa-trie/
211
EMERGING TOPICS
IN COMPUTING
FIGURE 6. Example of Extended Naive Bayes (ENB) for input text Apple and tree. For each term t, the probability of
related entity c is computed by a linear combination of P(c|t) (if t is a key term) and its prior probability P(c)
(if t is not a key term).
invalid terms. In order to reduce the computation time, we

also discarded candidates of the entity for each key term that
were ranked below 20 and related entities for each entity that
were ranked below 1,000. From a preliminary experiment,
we confirmed that the limitation of these numbers scarcely
affects the output. We used an English version of Wikipedia
dump data as of March 6, 2009. Our implementation can be
found on the Web.5
V. EVALUATION
A. SEMANTIC SIMILARITY MEASUREMENTS
ON BENCHMARK DATASETS
We evaluated our method and ESA with a variety of

parameter combinations on benchmarks of short text
semantic similarity. We particularly leveraged Pilot short
text semantic similarity benchmark dataset [34], which contains 30 sentence pairs and their similarity scores rated
by 32 people. Additionally, we created three datasets using
ConceptSim [35] and WordNet [16]. We followed the manner
of the literature [34] to build short text similarity datasets, i.e.,
replaced a synset (a single meaning of a word) of WordNet
with its definition. As the result, we obtained three datasets
based on the gold standards of word similarity datasets:
MC [36], RG [37], and WordSim353 (WS) [38], [39].
Spearmans rank correlation coefficient is used to measure
the similarity scores with those by human judgments.
5 http://sigwp.org/wikibbq/
212
We examined 16 combinations of parameter settings of

ESA: keyphraseness [25] (KEY) or IDF [26] for key term
extraction, count of anchor texts (A) or logarithmic count of
anchor texts (logA) for linking a key term to entities, count
of links (L) or logarithmic count of links (logL) for finding
related entities from an entity, and cosine-normalized scores
of related entities (COS) or unnormalized scores. Moreover,
we also implemented original ESA according to the
literature [6] by Gabrilovich et al. Since our method and
ESA generated a ranked list of entities as output, we used
the top 100, 200, 500, 1,000, and 2,000 entities to compute
similarity scores.6 We then marked the best score among them
per method.
Table 5 shows the results of semantic similarity measurements on the benchmark datasets. Our method outperformed
ESA with KEY-A-L (the parameter settings are the same
as our methods) for all the datasets. Compared to original
ESA, the performance of our method was marginally fine.
However, by adjusting parameters, ESA was able to achieve
higher scores than our method (e.g. IDF-A-logL-COS,
IDF-logA-logL-COS) for MC and RG datasets. These
datasets consist of formal texts and ESA is accurate enough to
measure the semantic similarity between the short texts. Our
method has no significant advantage in these datasets because
6 We did not evaluate the methods when the number of related entities was
less than 100, because the similarity scores became 0 for many unrelated
sentence pairs. Spearmans rank correlation coefficient cannot be measured
if not a few scores are the same.
TABLE 5. Spearmans rank correlation coefficient for short text
similarity datasets.
it is not needed to filter out noisy terms to correctly grasp the

topic of the texts.
In spite of these results, our method is more effective than
ESA for real-world noisy short texts. In Section V-B, we will
demonstrate that our method can surpass ESA with the best
parameter settings that are adjusted based on these
datasets (i.e., IDF-A-logL-COS).
B. CLUSTERING OF NOISY SHORT TEXTS
1) SETUP
We carried out an experiment on clustering of Twitter

messages (tweets)7 and Web snippets. In the same
clustering algorithm, the performance of clustering depends
on how the semantic distance (semantic dissimilarity)
is measured. Namely, the performance of semantic
similarity measurements can be evaluated using
clustering. We employed k-means [40] as the clustering
algorithm.
7 In Twitter, users can post and share tweets (messages) up to
140 characters in real time.
EMERGING TOPICS
IN COMPUTING
We utilized the hashtags, which are defined by Twitter, to create Twitter datasets for clustering tasks. Hashtags are tags, such as #Obama and #MacBook, that Twitter
users intentionally add to their tweets in order to clarify the
topic of the tweet [41]. Hashtags are often used to create
datasets for short text clustering [14], [42]. In our experiment,
we first sampled tweets during January 2012 and then listed
frequently occurring hashtags (topics). Among them, we
selected independent, unambiguous hashtags so that each
cluster contained a maximum of appropriate tweets. Note that
collected tweets still contain ambiguous terms and therefore
this setup does not ease the clustering task. Also, we did not
check the output vector for collected tweets before selecting
the hashtags to ensure the fairness.
Table 6 lists four Twitter datasets and their statistics
that were used in the evaluation. The first dataset includes
six discrete topics (D) from politics, entertainment, sports,
information technology, health, and religion. The second
dataset assumed fine-grained topics of IT products (IT)
and the third dataset assumed sports (S). We built these
two datasets to investigate the ability for distinguishing similar but different topics. In addition, we built a dataset that
contain both IT and S datasets (Mix) to better imitate the real
situation. The procedure for constructing the dataset was as
follows: 1) search tweets by using predefined hashtags and
store those written in English, 2) delete tweets that contain
more than one predefined hashtag, 3) delete retweets (tweets
starting with RT), 3) remove URLs in tweets, 4) remove
hashtags at the end of tweets (to hide explicit clues for the
topic) and the # of hashtags not at the end of tweets, and
5) delete tweets that contain less than four words.
To verify that our method works on different types of
short texts, we additionally used Web snippet dataset by
Phan et al. [5]. The Web snippet dataset was created for short
text classification, containing Web search snippets for queries
belonging to one of eight topics. We used it for short text
clustering task. Table 7 lists the Web snippet (Web) dataset
and its statistics.
We employed a bag-of-words model (BOW) as the baseline
and ESA as the comparative method. In BOW, we used
all words except stop words in short texts to compute the
semantic similarity. In ESA, we employed two parameter
TABLE 6. Four twitter datasets for evaluation and their statistics.
213
EMERGING TOPICS
IN COMPUTING
TABLE 7. Web snippet dataset for evaluation and its statistics.
We conducted k-means clustering 20 times with random

initial clusters, and recorded the average score for each
method.
2) RESULTS
settings: the same parameter settings as our methods

(ESA-same) and the best parameter settings for the
benchmark datasets in Section V-A (ESA-adjusted, i.e.,
IDF-A-logL-COS). We used the top 10, 20, 50, 100, 200,
500, 1,000, and 2,000 related entities for measuring semantic
similarity in both ESA and our method. We did not use
combined methods of BOW and Wikipedia-based methods
(ESA and our method) because the purpose of this experiment
was to assess the performance of each method for semantic
similarity measurements.
We employed normalized mutual information (NMI) [43]
as the metric to evaluate the performance. NMI expresses
scores based on information theory and is regarded as one
of the most reliable metrics for clustering. NMI scores are
between 0 and 1, and larger scores indicate better results.
Figure 7 shows the results of tweet clustering (maximum

NMI scores achieved by each method are described in the
figures). The horizontal axis means the number of nonzero elements of the output vector. In comparison with the
bag-of-words (BOW) method, our methods achieved good
performance because they were able to finely enhance the
semantics of short texts to increase the co-occurrences
of Wikipedia entities among tweets. ESA-adjusted also
achieved better performance than BOW in IT, S, and
Mix datasets. From Tables 6 and 7, the average number of
words per text is between 13 and 18. This indicates that
there are few co-occurrences of terms in short texts and
the BOW method often fails to measure semantic similarity
between tweets. The same tendency can be observed in the
literature [14], which has reported that BOW or statistical
approaches, such as LDA [11], are ineffective for computing
semantic distance in short text clustering. Generated features
by ESA-same were not superior to BOW because of
inappropriate parameter settings.
The best performance of our method (ESA-based) was
better than that of ESA-adjusted in D, S, Mix, and Web
datasets with the p-value < 0.01. Our method (link-based)
also achieved better performance than ESA-adjusted in
FIGURE 7. Results of short text clustering. The horizontal axis means the number of non-zero elements of the output vector.
Maximum NMI scores achieved by each method are described in the figures.
214
EMERGING TOPICS
IN COMPUTING
D, S, and Mix datasets (p-value < 0.01). Our methods

were not at least inferior to ESA-adjusted. It is noteworthy
that ESA-adjusted achieved the significant performance for
short text semantic similarity datasets and outperformed our
methods. From the results, our methods are more suited to
real-world noisy short texts than ESA even if the parameters
of ESA are well-adjusted.
The best performances by our methods (link-based and
ESA-based) were similar in most datasets. However, the
ESA-based method was able to achieve better performance
than the link-based method when the performances with
the same number of related entities (non-zero elements)
were compared. Because leveraging less number of non-zero
elements in the vector saves the computational time of
similarity measurements and storage space, the ESA-based
method seems to be more favorable than the link-based
method. It is possible that the link-based method obtained
too fine-grained related entities to discriminate the topics
of the dataset. This can explain why the link-based method
was able to be competitive with the ESA-based method in
some datasets when 500 or more related entities were used to
measure the semantic similarity. This hypothesis is supported
in the evaluation on the effect of the dimension reduction
(Section V-B.3).
With respect to datasets, more related entities resulted
in better performance in D and Web datasets whereas too
many entities deteriorated performance in other datasets. This
is because IT, S, and Mix datasets include similar topics
that share the same major topics. In these datasets, adding
fine-grained related entities, while discarding coarsely related
entities, is important to distinguish similar topics. Mix dataset
contains both totally different topics and similar topics all
together, and Wikipedia-based method still achieved better
performance than the BOW method. This demonstrates that
the semantic representation by Wikipedia entities is effective
in real environments where topics with various granularity
are contained in texts. In IT dataset, the difference between
the performances by our methods and ESA-adjusted was
not remarkable. This is because tweets in IT dataset mostly
contain unambiguous and characteristic terms. In such
situations, ESA-adjusted was able to focus on the most characteristic terms in texts to correctly derive related entities.
In Web dataset, our method (link-based) could not reach to
around the best performance of our method (ESA-based)
because each topic of the dataset is wide (e.g. Business). This
can be improved to effectively reduce the dimension of the
vector of Wikipedia entities (see Section V-B.3).
A good example that illustrates the advantages of our
method versus ESA-adjusted is Kobes 48 will be the
highlight of the Lakers season lol (the topic is NBA).
Of all the terms in this sentence, Kobe (indicating NBA
player Kobe Bryant) and Lakers (indicating NBA team
Los Angeles Lakers) are key terms, and highlight and lol are
likely to be noisy terms. Additionally, Kobe is highly ambiguous as it usually denotes a Japanese city Kobe. The output of
ESA-adjusted or ESA-same contained many unrelated
entities that were derived from the noisy or ambiguous terms.

Only the proposed methods were able to derive many finegrained related entities such as basketball players who belong
or belonged to Los Angeles Lakers by filtering out noisy
terms and amplifying related entities that were related to
multiple key terms.
3) EFFECT OF DIMENSION REDUCTION
We examined how the dimension reduction of the vector of

Wikipedia entities affects the performance. Normally a set of
all related entities C, which is the dimension of the output
vector, is equal to a set of all entities E. We reduced the
space of C to relatively smaller than E using a simple rulebased technique. Particularly, we used entities that have a
Wikipedia category of the same name (allowing plural forms
and parentheses). This technique can extract representative
Wikipedia entities across domains. The number of dimensions of C became 98,311 (the number of all entities
E is 2,810,087). We then recomputed probabilistic scores
P(c|e), P(c|t), and P(c).
Figure 8 shows the clustering results when the dimension
reduction of the output vector is applied to each method.
Compared to Figure 7, peaks moved to left in all datasets,
i.e., the best performances were achieved with fewer non-zero
elements of the vector. Unlike Figure 7, our method (linkbased) tended to outperform the other methods including
our method (ESA-based). The best performance of our
method (link-based) was superior to that of ESA-adjusted in
D, S, and Mix datasets, and that of our method (ESA-based)
in D and S datasets with the p-value < 0.01. There were
no significant differences on the best performance in the
other datasets. Also, our method (link-based) achieved better
performance than the other methods in most cases if the same
number of related entities were used.
From the results above, we can say that the link-based
method is capable of generating more accurate related entities
than the ESA-based method. Using all Wikipedia entities as
the dimension of the output vector, related entities generated
by the link-based method were too fine-grained to discriminate the given topics of datasets (Figure 7). By reducing
the dimension to representative entities, the output vector of
the link-based method stably achieved good performances
(Figure 8). Just using around 50 or 100 related entities was
reasonable enough to discriminate various granularity of
topics.
C. PROCESSING TIME
We compared the processing time between our method and

ESA. We sampled 353 tweets from Twitter datasets that were
used in Section V-B and measured the processing time for
generating the vectors from tweets. Table 8 shows the average
processing time per tweet. Note that our method on HDD
needs to store the prior probability P(c) on memory (requiring
150MB with hash-based naive implementation), otherwise
the processing time substantially increases. From Table 8,
we found that the differences of the processing time were
215
EMERGING TOPICS
IN COMPUTING
FIGURE 8. Results of short text clustering when the dimension reduction of the output vector is applied. The horizontal axis
means the number of non-zero elements of the output vector. Maximum NMI scores achieved by each method are described in
the figures.
TABLE 8. Average processing time for generating the vector of
wikipedia articles for tweets.
not significant on the same environments. This is because the

time complexities of both our method and ESA are linearly
proportional to the number of terms in input texts. While ESA
can incrementally add the vector for each term to obtain the
output vector, our method requires two phase process that
first collects the probabilistic scores for each term and then
aggregates the scores using ENB. This forced our method to
take extra time. It is obvious that the processing time was
drastically reduced using in-memory implementation. This
instead required approximately 6GB (or 3GB for dimension
reduction) of memory space with hash-based naive implementation.
D. OUTPUT COMPARISON
In order to further investigate the behavior of our method,

we obtained related Wikipedia entities for three input texts
216
Apple CPU, Cooking Apple, and Cooking Apple CPU.

Table 9 lists the top eight related Wikipedia entities obtained
with ESA-same, ESA-adjusted, and our methods (links-based
and ESA-based).
For input text Apple CPU (the topic is computing), all
the methods were able to find related entities to computing and Apple Inc. For input text Cooking Apple (the topic
is food), ESA with the same parameters as our methods
(i.e., only the difference is their scoring mechanism) mistakenly obtained related entities to computing and Apple Inc.
Because the probabilistic score of key term Cooking was
far lower than that of Apple, related entities were almost
derived from only Apple. ESA with adjusted parameters also
wrongly derived some computing-related entities such as
iTunes Store and iPhone. In this case, IDF scores of Apple
and Cooking were similar, though the topics of related entities are mixed. Both of our methods were able to obtain
entities related to cooking and food by amplifying scores
for them while decreasing scores for computing-related
entities.
Input text Cooking Apple CPU (the topic is computing)
does not mean cooking foods (here Cooking may be a noisy
term). The outputs of ESA with adjusted parameters were also
mixed, i.e., both computing and food topics appeared in the
top related entities. On the other hand, our method was able
to ignore Cooking and obtain entities related to computing.
Here, CPU was more characteristic than Cooking and our
probabilistic scoring mechanism speculated that the topic was
EMERGING TOPICS
IN COMPUTING
TABLE 9. Top related wikipedia entities obtained with ESA and our method.
not food but computing. Our method thus selects the most
dominant topic of a text and then filters out entities related to
other topics.
VI. CONCLUSION
We proposed a method for semantic similarity measurements

based on the Bayes theorem. Our method is a kind of
Wikipedia-based Explicit Semantic Analysis (ESA), which
generates a vector of related Wikipedia entities as the semantic representation of a given text and uses the vector for
measuring the semantic similarity. Whereas ESA just sums
the vectors for each term that occurs in the text based on
the majority rule, our method aggregates the vectors using
extended Naive Bayes (ENB). This enables us to generate
more refined vectors from texts especially when the texts are
short and semantically noisy, i.e., the majority rule does not
exist.
The performance results showed that out method outperformed ESA for real-world noisy short texts such as tweets
and Web snippets even when the parameters of ESA were
well-adjusted using short text semantic similarity datasets.
With a few extra processing time, our probabilistic method
can generate more refined vector of Wikipedia entities from
texts without parameter adjustments. Using the dimension
reduction of the output vector, our method can achieve
reasonable performance with fewer non-zero elements
of the vector, saving the processing time and memory

space.
In the future work, we will seek a robust way to measure
the semantic similarity across languages. Using
inter-language links of Wikipedia, vectors of Wikipedia
entities of different languages can be handled all together.
However, due to the differences of contents in articles and
the missing of inter-language links, directly comparing the
vectors of different languages deteriorates the performance.
Another challenge is to face state-of-the-art distributed
(not distributional, like bag-of-words model) representation,
which comes from neural network language models and
is an alternative of the semantic representation of texts.
Combining our knowledge-based representation with such
novel representation may result in further improvements on
the performance of semantic similarity measurements.
REFERENCES
[1] S. Banerjee, K. Ramanathan, and A. Gupta, Clustering short texts
using Wikipedia, in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf.
Retr. (SIGIR), Jul. 2007, pp. 787788.
[2] X. Sun, H. Wang, and Y. Yu, Towards effective short text deep classification, in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR),
Jul. 2011, pp. 11431144.
[3] Wikipedia. [Online]. Available: http://www.wikipedia.org/, accessed
Jan. 23, 2015.
[4] P. Ferragina and U. Scaiella, TAGME: On-the-fly annotation of short
text fragments (by Wikipedia entities), in Proc. ACM Conf. Inf. Knowl.
Manage. (CIKM), Oct. 2010, pp. 16251628.
217
EMERGING TOPICS
IN COMPUTING
[5] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, Learning to classify short
and sparse text & Web with hidden topics from large-scale data collections, in Proc. 17th Int. World Wide Web Conf. (WWW), Apr. 2008,
pp. 91100.
[6] E. Gabrilovich and S. Markovitch, Computing semantic relatedness using
Wikipedia-based explicit semantic analysis, in Proc. Int. Joint Conf. Artif.
Intell. (IJCAI), Jan. 2007, pp. 16061611.
[7] M. Shirakawa, K. Nakayama, T. Hara, and S. Nishio, Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities,
in Proc. ACM Int. Conf. Inf. Knowl. Manage. (CIKM), Oct./Nov. 2013,
pp. 903908.
[8] B. Leuf and W. Cunningham, The Wiki Way: Collaboration and Sharing
on the Internet. Reading, MA, USA: Addison-Wesley, Apr. 2001.
[9] K. Nakayama, T. Hara, and S. Nishio, Wikipedia mining for an association Web thesaurus construction, in Proc. Int. Conf. Web Inf. Syst.
Eng. (WISE), Dec. 2007, pp. 322334.
[10] E. Meij, W. Weerkamp, and M. de Rijke, Adding semantics to microblog
posts, in Proc. ACM Int. Conf. Web Search Data Mining (WSDM),
Feb. 2012, pp. 563572.
[11] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet allocation,
J. Mach. Learn. Res., vol. 3, pp. 9931022, Mar. 2003.
[12] X. Hu, N. Sun, C. Zhang, and T.-S. Chua, Exploiting internal and external
semantics for the clustering of short texts using world knowledge, in Proc.
ACM Conf. Inf. Knowl. Manage. (CIKM), Nov. 2010, pp. 919928.
[13] D. Metzler, S. Dumais, and C. Meek, Similarity measures for short
segments of text, in Proc. Eur. Conf. Inf. Retr. (ECIR), Apr. 2007,
pp. 1627.
[14] Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen, Short text conceptualization using a probabilistic knowledgebase, in Proc. Int. Joint Conf.
Artif. Intell. (IJCAI), Jul. 2011, pp. 23302336.
[15] M. Strube and S. P. Ponzetto, WikiRelate! Computing semantic relatedness using Wikipedia, in Proc. Nat. Conf. Artif. Intell. (AAAI), Jul. 2006,
pp. 14191424.
[16] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, MA,
USA: MIT Press, May 1998.
[17] S. P. Ponzetto and M. Strube, Exploiting semantic role labeling, WordNet
and Wikipedia for coreference resolution, in Proc. Human Lang. Technol.
Conf. North Amer. Chapter Assoc. Comput. Linguistics (HLT-NAACL),
Jun. 2006, pp. 192199.
[18] D. Milne and I. H. Witten, An effective, low-cost measure of semantic
relatedness obtained from Wikipedia links, in Proc. AAAI Workshop
Wikipedia Artif. Intell. (WIKIAI), Jul. 2008, pp. 2530.
[19] Y. Ollivier and P. Senellart, Finding related pages using green measures:
An illustration with Wikipedia, in Proc. Nat. Conf. Artif. Intell. (AAAI),
Jul. 2007, pp. 14271433.
[20] E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A. Soroa,
WikiWalk: Random walks on Wikipedia for semantic relatedness,
in Proc. Workshop Graph-Based Methods Natural Lang.
Process. (TextGraphs), Aug. 2009, pp. 4149.
[21] M. Ito, K. Nakayama, T. Hara, and S. Nishio, Association thesaurus
construction methods based on link co-occurrence analysis for Wikipedia,
in Proc. ACM Conf. Inf. Knowl. Manage. (CIKM), Oct. 2008, pp. 817826.
[22] S. Hassan and R. Mihalcea, Cross-lingual semantic relatedness using
encyclopedic knowledge, in Proc. Conf. Empirical Methods Natural
Lang. Process. (EMNLP), Aug. 2009, pp. 11921201.
[23] M. Yazdania and A. Popescu-Belis, Computing text semantic relatedness
using the contents and links of a hypertext encyclopedia, Artif. Intell.,
vol. 194, pp. 176202, Jan. 2013.
[24] M. A. H. Taieb, M. B. Aouicha, and A. B. Hamadou, Computing semantic relatedness using Wikipedia features, Knowl.-Based Syst., vol. 50,
pp. 260278, Sep. 2013.
[25] R. Mihalcea and A. Csomai, Wikify! Linking documents to encyclopedic
knowledge, in Proc. ACM Conf. Inf. Knowl. Manage. (CIKM), Nov. 2007,
pp. 233242.
[26] G. Salton and C. Buckley, Term-weighting approaches in automatic text
retrieval, Inf. Process. Manage., vol. 24, no. 5, pp. 513523, 1988.
[27] C. D. Manning and H. Schtze, Foundations of Statistical Natural
Language Processing. Cambridge, MA, USA: MIT Press, May 1999.
[28] G. J. Lidstone, Note on the general case of the BayesLaplace formula
for inductive or a posteriori probabilities, Trans. Faculty Actuaries, vol. 8,
pp. 182192, 1920.
[29] D. Milne and I. H. Witten, Learning to link with Wikipedia, in Proc.
ACM Conf. Inf. Knowl. Manage. (CIKM), Oct. 2008, pp. 509518.
218
[30] M. Shirakawa et al., Entity disambiguation based on a

probabilistic taxonomy, Microsoft Research Asia, Beijing, China,
Tech. Rep. MSR-TR-2011-125, Nov. 2011.
[31] F. Jelinek and R. L. Mercer, Interpolated estimation of Markov source
parameters from sparse data, in Proc. Workshop Pattern Recognit. Pract.,
1980, pp. 381397.
[32] M. A. Olson, K. Bostic, and M. Seltzer, Berkeley DB, in Proc. USENIX
Annu. Tech. Conf., Jun. 1999, pp. 183192.
[33] E. Fredkin, Trie memory, Commun. ACM, vol. 3, no. 9, pp. 490499,
Sep. 1960.
[34] Y. Li, D. McLean, Z. A. Bandar, J. D. OShea, and K. Crockett, Sentence
similarity based on semantic nets and corpus statistics, IEEE Trans.
Knowl. Data Eng., vol. 18, no. 8, pp. 11381150, Aug. 2006.
[35] H. A. Schwartz and F. Gomez, Evaluating semantic metrics on tasks
of concept similarity, in Proc. Int. Florida Artif. Intell. Res. Soc.
Conf. (FLAIRS), May 2011, p. 324.
[36] G. A. Miller and W. G. Charles, Contextual correlates of semantic
similarity, Lang. Cognit. Process., vol. 6, no. 1, pp. 128, 1991.
[37] H. Rubenstein and J. B. Goodenough, Contextual correlates of
synonymy, Commun. ACM, vol. 8, no. 10, pp. 627633, Oct. 1965.
[38] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paca, and
A. Soroa, A study on similarity and relatedness using distributional and
WordNet-based approaches, in Proc. Human Lang. Technol. Conf. North
Amer. Chapter Assoc. Comput. Linguistics (HLT-NAACL), May/Jun. 2009,
pp. 1927.
[39] L. Finkelstein et al., Placing search in context: The concept revisited,
ACM Trans. Inf. Syst., vol. 20, no. 1, pp. 116131, Jan. 2002.
[40] J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proc. Berkeley Symp. Math. Statist. Probab., 1967,
pp. 281297.
[41] D. Laniado and P. Mika, Making sense of Twitter, in Proc. Int. Semantic
Web Conf. (ISWC), Nov. 2010, pp. 470485.
[42] K. D. Rosa, B. L. R. Shah, A. Gershman, and R. Frederking, Topical clustering of tweets, in Proc. SIGIR Workshop Social Web Search
Mining (SWSM), Jul. 2011, pp. 18.
[43] A. Strehl and J. Ghosh, Cluster ensemblesA knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res., vol. 3,
pp. 583617, Dec. 2002.
MASUMI
SHIRAKAWA
received
the
Ph.D. degree in information science and technology from Osaka University, Japan, in 2013. He was
an Intern with Microsoft Research Asia in 2011.
He is currently a Research Assistant Professor
with the Department of Multimedia Engineering,
Osaka University. His research interests include
Web mining, text mining, knowledge base, and
natural language processing. He is a member of the
Association for Computing Machinery and three
other learned societies.
KOTARO
NAKAYAMA
received
the
Ph.D. degree in information science and technology from Osaka University, Osaka, Japan, in 2008.
While he was a student, he started a software
development company named Kansai Informatics
Laboratory and served as a CEO and the Executive
Director from 2000 to 2003. Meanwhile, he held
a lecturing position with the Doshisha-Womens
College of Liberal Arts from 2002 to 2004.
He was a Post-Doctoral Researcher with Osaka
University from 2008 to 2009, and an Assistant Professor/Lecturer Professor
with the Center of Knowledge Structuring, University of Tokyo, Japan, from
2009 to 2014, after receiving his Ph.D. degree. He is currently an Assistant
Professor/Lecturer Professor with the School of Engineering, University
of Tokyo. His research interests cover Web mining, artificial intelligence,
scalability, and practical software development. He is a member of the
Association for Computing Machinery and other two CS/AI related societies.
TAKAHIRO HARA (SM06) received the

B.E., M.E., and Dr.Ing. degrees in information systems engineering from Osaka University, Osaka,
Japan, in 1995, 1997, and 2000, respectively.
He is currently an Associate Professor with the
Department of Multimedia Engineering, Osaka
University. He has authored over 300 international journal and conference papers in databases,
mobile computing, peer-to-peer systems, WWW,
and wireless networking. His research interests
include distributed databases, peer-to-peer systems, mobile networks, and
mobile computing systems. He is a Distinguished Scientist of the Association
for Computing Machinery and a member of three other learned societies. He
served as the Program Chair of the IEEE International Conference on Mobile
Data Management in 2006 and 2010, the IEEE International Conference
on Advanced Information Networking and Applications in 2009 and 2014,
and the IEEE International Symposium on Reliable Distributed Systems
in 2012. He was a Guest Editor of the IEEE JOURNAL ON SELECTED AREAS
IN COMMUNICATIONS of the Special Issues on Peer-to-Peer Communications
and Applications.
EMERGING TOPICS
IN COMPUTING
SHOJIRO NISHIO (F12) received the
B.E., M.E., and Ph.D. degrees from Kyoto University, Japan, in 1975, 1977, and 1980, respectively.
He has been a Full Professor with Osaka University since 1992, and was bestowed the prestigious
title as a Distinguished Professor of Osaka University in 2013. He served as the Vice President and a
Trustee with Osaka University from 2007 to 2011.
He has authored or co-authored over 600 refereed
journal and conference papers. He is a fellow of
the Institute of Electronics, Information and Communication Engineers and
the Information Processing Society of Japan, and a member of four learned
societies, including the Association for Computing Machinery (ACM).
He has received numerous awards for his research contributions, including
the Medal with Purple Ribbon from the Government of Japan in 2011.
He served as the Program Committee Co-Chair of several international
conferences, including DOOD 1989, VLDB 1995, and the IEEE ICDE 2005.
He has served as an Editor of several renowned journals, including the
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, the VLDB Journal,
the ACM Transactions on Internet Technology, and Data & Knowledge
Engineering.
219

10.1109 TETC.2015.2418716 Wikipedia Based Semantic Similarity Measurements For Noisy Short Texts Using Extended Naive Bayes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10.1109 TETC.2015.2418716 Wikipedia Based Semantic Similarity Measurements For Noisy Short Texts Using Extended Naive Bayes

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON

Wikipedia-Based Semantic Similarity

Osaka University, Osaka 565-0871, Japan

CORRESPONDING AUTHOR: M. SHIRAKAWA (shirakawa.masumi@ist.osaka-u.ac.jp)

Semantic similarity, semantic representation, naive Bayes, short text clustering.

Recently the focus of text analysis has been shifting

VOLUME 3, NO. 2, JUNE 2015

Wikipedia [3], a collaborative online encyclopedia,

Short texts vary from traditional documents in their brevity

less effective. Thus, enriching the semantics of short texts

Though this paper focuses on improving ESA, it is

to Wikipedia. Given two Wikipedia articles, they specifically

Explicit Semantic Analysis (ESA) [6] is a method to

Noisy terms cannot be filtered out statically because a

To achieve robust finding of related Wikipedia entities for

TABLE 1. Definition of terms and symbols.

FIGURE 1. Outline of our method to find related Wikipedia

entities for texts.

entities for each key term using the probabilities introduced

P(t T ), which is the probability that term t in a text, T ,

In order to avoid black or white probabilities (i.e., 0 or 1),

FIGURE 2. How to compute P(t T), which is the probability that

TABLE 3. Example of P(e|t) when term t is apple.

TABLE 2. Example of P(t T ).

probability, whereas general terms, such as black and house,

P(c|t), which is the probability that related entity c is related

E denotes a set of all Wikipedia entities.

FIGURE 3. How to compute P(e|t), which is the probability that

the probability of each entity to which the term Apple is

FIGURE 4. How to compute P(c|e), which is probability that

related entity c is related to entity e. The example is for entity

their probability P(c|e).

3) PRIOR PROBABILITY OF ENTITIES

P(c), which is the prior probability of related entity c, means

B. EXTENDED NAIVE BAYES

We attempt to integrate the probabilistic scores extracted

FIGURE 5. Extended Naive Bayes (ENB) for set of key terms

whose members cannot be observed.

Given that each term, t, is conditionally independent,

Therefore, related entities are estimated by using the

Here, |T 0 | denotes the number of key terms contained in T 0 .

Equation (11) is then represented like the conventional

As a result, the following expression is derived.

Consequently, Equation (11) replaces each probability

Using Equation (13) or (11), the scores of related entities

We implemented our method using the Berkeley DB3 [32]

invalid terms. In order to reduce the computation time, we

We evaluated our method and ESA with a variety of

We examined 16 combinations of parameter settings of

TABLE 5. Spearmans rank correlation coefficient for short text

it is not needed to filter out noisy terms to correctly grasp the

We carried out an experiment on clustering of Twitter

TABLE 6. Four twitter datasets for evaluation and their statistics.

VOLUME 3, NO. 2, JUNE 2015

We conducted k-means clustering 20 times with random

settings: the same parameter settings as our methods

Figure 7 shows the results of tweet clustering (maximum

VOLUME 3, NO. 2, JUNE 2015

D, S, and Mix datasets (p-value < 0.01). Our methods

entities that were derived from the noisy or ambiguous terms.

We examined how the dimension reduction of the vector of

We compared the processing time between our method and

wikipedia articles for tweets.