You are on page 1of 6

A Semantic Similarity Approach Based on

Web Resources
Mrs.M.Karthiga Mrs.PCD.Kalaivaani Mr.S.Sankarananth
PG Student, Dept of CSE Assistant Professor, Dept of CSE Assistant Professor, Dept of EEE
Kongu Engineering College Kongu Engineering College Excel college of Engineering
Perundurai, Tamilnadu, India Perundurai, Tamilnadu, India Nammakkal, Tamilnadu, India
mkarthiga22@gmail.com kalaivaani.ramesh@gmail.com sankarananth99@gmail.com

Abstract— The ability to accurately judge the semantic similarity presence of scale and noise on the web, words might co-
is important in various tasks on the web such as extracting the occur on pages without being actually related. For those
relation, document clustering, and automatic metadata reasons, page counts measure alone is unreliable when
extraction. An empirical method is proposed to provide a measuring semantic similarity.
semantic wise search that uses in one hand, a technical English
dictionary and on the other hand, a page count based metric and Snippet is the text document that shows the sample content
a text snippet based metric retrieved from a web search engine to users about the web search engine result page. It provides a
for two words. To identify the numerous semantic relations convenient summary about the search results. So it avoids the
between the words, a novel pattern extraction algorithm and a need to download the entire source document from the web.
pattern clustering algorithm is proposed. The page counts based Snippets provide useful information regarding the local context
co-occurrence measures and lexical pattern clusters extracted of the query term. Consider a snippet from Google for the query
from snippets is learned using support vector machines. Ostrich AND Africa.
Integrate the page count, text snippet and dictionary based
metric to accurately measure the semantic similarity search “Ostrich, a flightless bird that lives in the dry grasslands of
compared to normal search.
IndexTerms- Web search, information extraction, natural language
Africa”
processing, user generated content, snippet.
A drawback of using snippets is that, due to the presence of
huge scale and large number of documents in the web result set,
I. INTRODUCTION only the snippets for the top-ranking result set for a query is
Information on the web is vast with lot of hidden processed efficiently. There is no guarantee that all the
information, which are interconnected by various semantic underlying information needed to measure the semantic
relations. Information Semantics identify the concepts which similarity between a given pair of words will be present in the
help to extract the useful information from data. Providing a top-ranking snippets returned by a web search engine. Moreover
semantic wise search is a challenging task in many Natural most information in the web have technical difference so
Language Processing tasks and information retrieval such as including a technical dictionary is more useful for determining
Query Expansion, Query suggestion and Word Sense the semantic similarity.
Disambiguation etc. Since the semantic similarity between the In this paper, a method that considers technical dictionary
words changes over time and domain, WordNet (a general based metric, page counts and lexical syntactic patterns
purpose Ontology) is not efficient. extracted from snippets is proposed experimentally to overcome
Using an automatic method to estimate the semantic the above mentioned problems.
similarity by search engines is more efficient [1]. Page counts, The rest of the document is organized as follows: Section 2
dictionary based metric and snippets are some types of useful introduces the related work on semantic similarity methods. In
information provided by a search engine. Page count for a query section 3, an approach to measure the semantic similarity
is the number of web pages returned by the search engine as a between words is provided. Section 4 demonstrates the
result for the query concerned. Page counts for two words experimental setup. In section 5, conclusion and some future
provide the global co-occurrences of the two words on the web. perspectives is presented.
If two words have more page count then they are considered to
be more similar. II. RELATED WORK
But page counts alone as a measure of co-occurrence of two The semantic similarity between the words, when the
words presents lot of drawbacks. First, analysis of page count knowledge base is considered as a graph, uses a metric called
ignores the word position in a page. Second, a page count distance [16]. The shortest path in a is-a hierarchy is used to
for polysemous word (a word with multiple senses) measure the distance between concepts. The conceptual distance
contains a combination of all its senses. Moreover, in between two words which are represented by two nodes in an is-
a semantic net is the minimum number of edges separating the depends heavily on the search engine’s ranking algorithm.
nodes. The drawback with this approach is that it considers that Though two words P and Q might be very similar, one cannot
all links in taxonomy has a uniform distance. assume that the word Q could be found in the snippets for P
Besides evaluating the semantic similarity by considering or P in snippet for Q, because a search engine considers many
distance, Resnik et al [17] provided an alternative way to other factors besides semantic similarity, such as publication
measure similarity based on the notion of information content. date (novelty) and link structure (authority) when ranking the
The similarity between two concepts is based on the extent to result set for a query.
which they share common information. If the two concepts have Imen Akermi [8] introduced a new similarity measure
more common information then they are considered as a highly between words using an online English dictionary provided
specific content. In case of multiple inheritances, Word by the Semantic Atlas project of the French National Centre
similarity is also taken into account. The widely acknowledged for Scientific Research and page counts returned by the social
problem with this approach is that word sense disambiguation is website Digg.com whose content is generated by the users. In
not considered. So it produces similarity measure for words on the proposed work, polysemy and semantic disambiguation
the basis of irrelevant word senses. problem has been dealt.
Li et al [13] combined structural semantic information Bollegala et al [1] proposed a web search engine based
from a lexical taxonomy and information content from a approach to measure the semantic similarity which is used in
corpus in a nonlinear model. A similarity measure that uses query expansion, word sense disambiguation. The idea of
shortest path length, depth, and local density in taxonomy is calculating the semantic similarity between words using web
proposed. The experiment reported a high Pearson correlation search engines is via page counts and snippets. Using page
coefficient of 0.8914 on the Miller and Charles benchmark counts and lexico syntactic patterns extracted from snippets
data set. But the proposed work did not evaluate the method various similarity scores are calculated. The calculated
in terms of similarities among named entities. similarity scores are integrated using support vector machines
Lin et al [11] defined the similarity between two concepts to measure similarity.
as the information that is in common to both concepts and the
III. PROPOSED SYSTEM
information contained in each individual concept. A universal
Our method for measuring semantic similarity between
definition of similarity in terms of information theory was
words uses, in one hand, a technical dictionary provided by
presented. A definition for similarity is provided by Lin that
the Semantic Atlas project (SA) and in the other hand, page
achieves two goals: Universality and Theoretical Justification.
counts and text snippets returned by a web search engine (see
Universality means definition of similarity is applied to many
Fig. 1).
different domains. Theoretical Justification means similarity
The proposed method is used to measure the semantic
measure is not defined directly by a formula. Rather, derived
similarity between words with great accuracy. The proposed
from a set of assumptions about similarity.
method has three phases:
Cilibrasi and Vitanyi [3] proposed a distance metric
1. The calculation of the similarity (TD) between two words
between words using only page counts retrieved from a web
based on the technical dictionary.
search engine which is named as Normalized Google Distance
(NGD). NGD is based on normalized information distance.
Because NGD does not take into account the context in which
the words co-occur, it suffers from the drawbacks of
measuring similarity measures using only page counts.
Sahami [19] measured semantic similarity between two
queries using snippets returned for those queries by a search
engine. For each query, the author collect snippets from a
search engine and represent each snippet as a TF-IDF
weighted term vector. Each vector is normalized and the
centroid of the set of vectors is computed. The Semantic
similarity is then defined as the inner product between the
corresponding centroid vectors. But the similarity measure in
the proposed work was not compared with the taxonomy-
based similarity measure.
Chen et al [2] proposed a double-checking model using
text snippets returned by a web search engine to compute
semantic similarity between words. For two words P and Q,
snippets are collected for each word from a web search Fig 1. Outline of the proposed method
engine. Then, the occurrences of word P in the snippets for
word Q and the occurrences of word Q in the snippets for 2. The calculation of the similarity (SPS) between two words
word P are counted. These values are combined nonlinearly to based on the page counts and text snippets retrieved from
compute the similarity between P and Q. But this method a web search engine.
3. Integration of the two similarity measures SD and SPS. WebJaccard, WebOverlap, WebDice, and WebPMI are
the four similarity scores considered for page counts. Then
Phase 1: Technical Dictionary based similarity measure snippets for the two words are collected from the web search
In this phase, technical synonyms for each word are engine and find the frequency of numerous lexical syntactic
extracted from the technical English dictionary. The technical patterns in snippets returned for the conjunctive query of the
dictionary is used to differentiate the technical meaning of the two words. The lexical patterns are extracted automatically
words from normal meaning. The TD is used to extract sets of using the lexical pattern extraction algorithm. A semantic
technical synonyms for each word. Once the two sets of relation could be expressed using more than one lexical
synonyms for each word are collected, the degree of pattern. So the different lexical patterns that convey the same
similarity, S (w1, w2) (1) is calculated using the Jaccard semantic relation are grouped using the sequential pattern
coefficient: clustering algorithm. It represents accurately the semantic
relation between the two words. Both page counts-based
similarity scores and lexical pattern clusters are used to
(1) define various features that represent the relation between
IV. PROPOSED SYSTEM two words. A two-class support vector machine is trained
using that future representation of word pairs.
Where
mc: The number of common words between the two A. Page Count-Based Co-occurrence Measures
synonyms set Page counts for the query P AND Q are considered as the
mw1: The number of words contained in the w1 synonym set approximation of co-occurrence of two words P and Q on the
mw2: The number of words contained in the w2 synonym set web. But page counts for the query P AND Q alone do not
accurately express the semantic similarity. So the page counts
If the group of synonyms for the words w1 explicitly for the individual words P and Q are also considered to access
contains the word w2 or vice versa, assign directly the value 1 the semantic similarity between P and Q accurately. Four page
to S (w1, w2). count-based co-occurrences measures WebJaccard,
WebOverlap, WebDice and Web Pointwise mutual
Phase 2: Page counts and Snippets similarity measure information (WebPMI) are calculated to compute the semantic
In this phase, four page count-based co-occurrences similarity. The WebJaccard coefficient between words P and
measures WebJaccard, WebDice, WebOverlap and WebPMI Q is defined as
are defined using page counts and integrate those with lexical
patterns extracted from text snippets. For extracting the
lexical patterns and for clustering the similar patterns in the
text snippets, lexical pattern extraction algorithm and a
sequential pattern clustering algorithm has been proposed.
The optimal combination of page-count based co-occurrences
measures and lexical pattern clusters is learned using support Where P ^ Q notes the conjunctive query P AND Q. It is
vector machines. The page count and text snippets based possible that two words may appear on same pages even
metric is depicted in the figure 2. though they are not related due to the presence of scale and
noise in web data. To reduce these adverse effects, the
WebJaccard coefficient is set to zero if the page count for the
query P ∧ Q is less than a threshold c. Set c=5
experimentally [1].
The WebDice coefficient is a variant of the Dice
coefficient. The WebDice (P, Q) is defined as

The WebOverlap is a natural modification to the Overlap


(Simpson) coefficient. WebOverlap (P, Q) is defined as

Fig 2. A framework for page count and text snippet based metric
Pointwise mutual information [4] is a measure that is Output: Clusters of patterns
motivated by information theory; it is intended to reflect the
Method:
dependence between two probabilistic events.
SORT the patterns into descending order of their total
The WebPMI as a variant of pointwise mutual information
occurrences in all word pairs. The total occurrence μ ( a ) of a
using page counts is defined as
pattern a is the sum of frequency of occurrence of the pattern
in all word pairs. μ ( a ) is given by (6),

(6)
1. Initialize the set of clusters, C to the empty set
Where N is the number of documents indexed by the 2. The outer for loop, repeatedly takes a pattern ai from
search engine. Set N=1010 according to the number of indexed the ordered set of lexical patterns
pages returned by Google [1]. 3. Set max value to - ∞
4. Set the most similar cluster c* to null
B. Lexical Pattern Extraction 5. The inner for loop finds the cluster c* ( ∈ C ) that is most
A snippet contains a window of text selected from a
similar to the pattern ai
document that includes the queried words. A user could read
6. First, represent a cluster by the centroid of all word-pair
the snippet and decide whether a particular search result is
frequency vectors corresponding to the patterns in that
relevant, without even opening the URL. A search engine
cluster to compute the similarity between the pattern and a
might produce a snippet by selecting multiple text fragments
cluster. Next, compute the cosine similarity between the
from different portions in a document; a predefined delimiter
cluster centroid (cj) and the word-pair frequency vector of
is used to separate the different fragments. For example, in
the pattern (ai)
Google, the delimiter “...” is used to separate different
7. If the similarity between a pattern ai, and its most similar
fragments in a snippet. Such delimiters are used to split a
cluster, c*, is greater than the threshold θ , append ai to c*
snippet before running the proposed lexical pattern extraction
8. Then form a new cluster {ai} and append it to the set of
algorithm on each fragment.
clusters, C, if ai is not similar to any existing clusters
For a snippet δ , retrieved for a word pair (P, Q), replace
beyond the threshold θ
the two words P and Q with two variables X and Y. Then
replace all numeric values by D, a marker for digits. Next,
D. Measuring semantic similarity
generate all subsequences of words from δ that satisfy all of
A machine learning approach is used to combine both page
the following conditions:
counts-based co-occurrence measures, and snippets-based
measures to construct a robust semantic similarity measure.
1. A subsequence should contain exactly one occurrence of
Given N clusters of lexical patterns, first represent a pair
each X and Y.
of words (P, Q) by an (N+4) dimensional feature vector fPQ.
2. The maximum length of a subsequence is L words.
The four page counts-based co-occurrence measures are used
3. A subsequence could skip one or more words. But it
as four distinct features in fPQ. Then a feature from each of the
should not skip more than g number of words
N clusters is computed as follows: first, a weight wij is
consecutively. The total number of words skipped in a
assigned to a pattern ai that is in a cluster cj as follows:
subsequence should not exceed G.
4. All the negation contractions in a context are expanded.
(7)
For example, didn’t is expanded to did not. Do not skip
the word not when generating subsequences.
Finally the frequency of all generated subsequences is
counted and only the subsequences that occur more than T
times are used as lexical patterns. Set L=5, g=2, G=4 and T=5 Here, μ ( a ) is the total frequency of a pattern a in all
experimentally [1]. word pairs and it is given by (7). Finally, compute the value
of the jth feature in the feature vector for a word pair (P, Q) as
C. Lexical Clustering Algorithm per the equation below:
Different patterns that express the same semantic relation
are extracted automatically using sequential pattern clustering
algorithm. Identifying the different patterns that express the
same semantic relation helps to represent the semantic The value of the jth feature of the feature vector fPQ
relation between two words accurately. representing a word pair (P, Q) is the weighted sum of all
patterns in cluster cj that co-occur with words P and Q. All
Algorithm 1. Grouping of different lexical patterns patterns in a cluster represent the same semantic relation.
Input: A set of lexical patterns, threshold θ
Table I: The semantic similarity scores on MC data set

Consequently, the value of the jth feature given by [16] it is used to design a semantic search engine which in turn
represents the significance of the semantic relation represented returns the semantically related results for the user query.
by the cluster j for the word pair (P, Q).
Using these features a two class support vector machine IV. EXPERIMENTAL SETUP
(SVM) is trained to detect the synonyms and non-synonyms The proposed method has been evaluated against Miller-
word pair. Training data set S is generated automatically Charles dataset, a dataset of 30 word pairs. The comparison of
from WordNet synsets. After training a SVM using the results of the proposed method with Chen Co-occurrence
synonyms and non-synonyms word pairs use it to compute Double Checking (CODC) measure [2], with Sahami and
the semantic similarity between two given words. LibSVM is Heilman metric [19], Normalised Google Distance (NGD) [3],
used as the SVM implementation. NoClust (i.e. does not use any clustering information in
feature vector creation) and with four popular co-occurrence
Phase 3: The overall similarity measure measures WebJaccard, WebOverlap, WebDice, and WebPMI
In this phase, the two similarity measures (technical (point-wise mutual information) is shown in Table I. These
dictionary based metric, page count and text snippets based measures use page counts returned by the search engine
metric) are integrated for more accurately measuring the Google.
semantic similarity between words. The proposed method uses Pearson coefficient and
In the proposed work the input given to the machine is a Spearman coefficient for evaluation. Pearson and Spearman
collection of word pairs. The synonymous words are extracted coefficient are two popular measures to measure the semantic
directly from the synsets and the non-synonymous words are similarity.
generated by a random shuffling technique. Once the machine A search engine could be created which consists of two
gets trained it can be used to compute the similarity between searches. One search is the Normal search which returns all
the words. After computing the similarity between the words pages for a query word. Another form of search is the
Semantic search which returns the pages that are semantically 5. Hearst M (1992), “Automatic Acquisition of Hyponyms from Large Text
Corpora,” Proceedings of the 14th Conference on Computational
related to the query keyword.
Linguistics (COLING), pp. 539-545.
The comparison of the result pages returned by the 6. Hirst G and St-Onge D (1998), “Lexical Chains as Representations of
Semantic search and Normal search is given in Figure 3. The Context for the Detection and Correction of Malapropisms,” WordNet:
search using semantic will provide more accuracy than An Electronic Lexical Database, pp. 305-332, MIT Press.
7. Hughes T and Ramage D (2007), “Lexical Semantic Relatedness with
Normal search.
Random Graph Walks,” Proceedings of Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL ’07), pp. 581-589.
8. Imen Akermi and Rim Faiz (2012), “Semantic similarity measure based
on multiple resources”, Proceedings of the International Conference on
Information Technology and e-Services, pp.546-550.
9. Kilgarriff A, “Googleology Is Bad Science (2007),” Computational
Linguistics, vol. 33, pp. 147-151.
10. Lapata M and Keller F (2005), “Web-Based Models for Natural
Language Processing,” ACM Transaction Speech and Language
Processing, vol. 2, no. 1, pp. 1-3.
11. Lin D (1998), “An Information-Theoretic Definition of Similarity,”
Proceedings of the 15th International Conference on Machine Learning
(ICML), pp. 296-304.
12. Matsuo Y, Sakaki T, Uchiyama K and Ishizuka M (2006),” Graph-based
word clustering using web search engine”, Proceedings of EMNLP, pp.
523-530.
13. Mclean D, Li Y, and Bandar Z. A (2003), “An Approach for Measuring
Semantic Similarity between Words Using Multiple Information Sources,”
IEEE Transactions on Knowledge and Data Engineering, vol. 15, Issue 4,
pp. 871-882.
14. Pei T, Han J, Mortazavi-Asi B, Wang J, Pinto H, Chen Q, Dayal U, and
Hsu M (2004), “Mining Sequential Patterns by Pattern- Growth: The
Fig 3. Comparison of Semantic search and Normal search Prefixspan Approach,” IEEE Trans. Knowledge and Data Eng., vol. 16,
no. 11, pp. 1424-1440.
V. CONCLUSION 15. Pasca M, Lin D, Bigham J, Lifchits A, and Jain A (2006), “Organizing
and Searching the World Wide Web of Facts - Step One: The One-Million
Semantic similarity between words is fundamental to Fact Extraction Challenge,” Proceedings of National Conference on
various fields such as Cognitive Science, Natural Language Artificial Intelligence (AAAI ’06).
Processing and Information Retrieval. Therefore, relying on a 16. Rada R, Mili H, Bichnell E, and Blettner M (1989), “Development and
robust semantic similarity measure is crucial. Application of a Metric on Semantic Nets”, IEEE Transaction Systems,
Man and Cybernetics, vol. 19, Issue 1, pp. 17-30.
In this paper, a new similarity wise search is introduced 17. Resnik P (1995), “Using Information Content to Evaluate Semantic
which uses in one hand, a technical dictionary and in the other Similarity in a Taxonomy”, Proceedings of the 14th International Joint
hand, page counts and text snippets returned by the web search Conference on Artificial Intelligence, pp.448-453.
engine. The proposed method gives a high accuracy for 18. Rosenfield R (1996), “A Maximum Entropy Approach to Adaptive
Statistical Modelling,” Proceedings on Computer Speech and Language,
classification of synonymous and non-synonymous words. vol. 10, pp. 187-228.
There are several lines of future work that our proposed 19. Sahami M and Heilman T (2006), “A Web-Based Kernel Function for
measure lays the foundation for. This measure could be Measuring the Similarity of Short Text Snippets”, Proceedings of the 15th
incorporated into other similarity-based applications to International World Wide Web Conference, pp.326-331.
20. Schickel-Zuber V and Faltings B (2007), “OSS: A Semantic Similarity
determine its ability to provide improvement in tasks such as Function Based on Hierarchical Ontologies,” Proceedings of
classification and clustering of text. International Joint Conference on Artificial Intelligence (IJCAI ’07), pp.
Besides, we intend to go further in this domain by 551-556.
developing new semantic similarity measures related to 21. Siddharth P, Banerjee S and Pedersen T (2003),”Using measures of
semantic relatedness for word sense disambiguation”, Proceedings of the
sentences and documents. Fourth International Conference on Intelligent on Text Processing and
Computational Linguistics, Mexico City, Mexico, pages 241-257.
REFERENCES 22. Snow R, Jurafsky D, and Ng A (2005), “Learning Syntactic Patterns for
1. Bollegala D, Matsuo Y, and Ishizuka M (2011),”Measuring semantic Automatic Hypernym Discovery,” Proceedings of Advances in Neural
similarity between words using web search engines”, IEEE Transactions Information Processing Systems (NIPS), pp. 1297-1304.
on Knowledge and Data Engineering, vol.23, Issue 7, pp.977-990. 23. Strube M and Ponzetto S.P (2006), “Wikirelate! Computing Semantic
2. Chen H, Lin M, and Wei Y (2006), “Novel Association Measures Using Relatedness Using Wikipedia,” Proceedings of National Conference on
Web Search with Double Checking”, Proceedings of the 21st International Artificial Intelligence(AAAI ’06), pp. 1419-1424.
Conference on Computational Linguistics, pp. 1009-1016. 24. Turney P.D (2001), “Mining the web for synonyms: Pmi-ir versus lsa on
3. Cilibrasi R and Vitanyi P (2007), “The Google Similarity Distance,” toefl”, proceedings of ECML, pp. 491–502.
IEEE Transactions on Knowledge and Data Engineering, vol. 19, Issue 3, 25. Wu Z and Palmer M (1994), “Verb Semantics and Lexical Selection,”
pp. 370-383. Proceedings of Ann. Meeting on Assoc. for Computational Linguistics
4. Church K and Hanks P (1991),” Word Association Norms, Mutual (ACL ’94), pp. 133-138.
Information and Lexicography,” Computational Linguistics, vol. 16, pp.
22-29.

You might also like