You are on page 1of 22

Keyphrase Extraction

by
Arun Balaji.A
07MSS02
Under the guidance of
Ms.T.Santha, M.Sc.,M.Phil.,M.Phil (CS),
(Ph.D).,
KEYPHRASE EXTRACTION
Keyphrases provide a simple way of
describing a document, giving the reader
some clues about its contents.
Keyphrases can be useful in a various
applications such as retrieval engines,
browsing interfaces, thesaurus
construction, text mining etc..
The keyphrases help readers rapidly
understand, organize, access, and share
information of a document.
Keyphrases are the phrases consisting of
one or more significant words.
Keyphrases can be incorporated in the
search results as subject metadata to
facilitate information search on the web.
A list of keyphrases associated with a
document may serve as indicative summary
or document metadata, which helps readers
in searching relevant information.
For eg., If we want to search a paper on
compilers, we can simply just use the
keyphrases which normally relate the
concept like compiler ,compile ,system
software etc., to retrieve them.
Keyphrases are meant to serve various goals.
For example,
when they are printed on the first page of a
journal document, the goal is summarization.
They enable the reader to quickly determine
whether the given article worth in-depth
reading.
When they are added to the cumulative index
for a journal, the goal is indexing. They enable
the reader to quickly find a article relevant to a
specific need.
When a search engine form contains a field
labeled keywords, the goal is to enable the
reader to make the search more precise.
EXTRACTION USING NAVE
BAYES
Keyphrase extractions are done using several
methods like genetic algorithms, decision trees,
neural networks.One simple way to perform this is
Nave Bayes classifier.
Bayes classifieris a simple probablistic
classifierbased on applyingBayes
theorem(fromBayesian statistics) with strong
(naive)independenceassumptions.
In simple terms, a naive Bayes classifier assumes
that the presence (or absence) of a particular feature
of a class is unrelated to the presence (or absence)
of any other feature.
For example, a fruit may be considered to be an
apple if it is red, round, and about 4" in diameter.
Even if these features depend on each other or upon
the existence of the other features, a naive Bayes
classifier considers all of these properties to
independently contribute to the probability that this
fruit is an apple.
We are using Nave Bayes method
because it is simple to implement and it
needs very less amount of training
compared neural networks and other
methods. Also, its results are effective.
Depending on the precise nature of the
probability model, Naive Bayes
classifiers can be trained very efficiently
in aSupervised learning setting.
Supervised learningis themachine
learning task of inferring a function from
supervisedtraining data.
Simple example for training and
implementing nave bayes is as follows,
For explaining Nave Bayes
Classification, consider the TRAINING
example with two different
objects that can be classified as DATA
either GREEN or RED.
Our task is to classify new
cases as they arrive, i.e., decide
to which class they belong,
based on the currently existing
objects.
Since there are twice as many
GREEN objects as RED, it is
reasonable to believe that a
new case is twice as likely to
have membership GREEN
rather than RED.
Prior probabilities are based on
previous experience, in this
case the percentage of GREEN
and RED objects, and often
used to predict outcomes
before they actually happen.
Having formulated our prior
probability, we are now ready
CALCULATING
to classify a new object FOR NEW INPUT
(WHITE circle).
Since the objects are well
clustered, it is reasonable to
assume that the more GREEN
(or RED) objects in the
vicinity of X, the more likely
that the new cases belong to
that particular color.
To measure this likelihood,
we draw a circle around X
which encompasses a
number of points irrespective
of their class labels.
Then we calculate the
number of points in the circle
belonging to each class label.
From this we calculate the
likelihood:
Although the prior
probabilities indicate that X
may belong to GREEN
(given that there are twice
as many GREEN compared
to RED) the likelihood
indicates otherwise; that
the class membership of X
is RED (given that there are
more RED objects in the
vicinity of X than GREEN).
In the Bayesian analysis,
the final classification is
produced by combining
both sources of
information, i.e., the prior
and the likelihood, to form
a posterior probability using
the so-called Bayes' rule.
Finally, we classify X as
RED since its class
membership achieves the
STEPS INVOLVED IN
EXTRACTION
Document Tagging
Candidate Phrase
Identification
Feature Calculations
Training with experimental
dataset
Document Tagging
Document tagging involves tagging
every single word whether they are
noun or verb or adjective etc..,
To perform tagging we have to use a
POS(Part-Of-Speech) tagger. Here, we
use a tagger called Tree Tagger.
We have to call the tagger function
to create a new tagged output file.
Thus, the tagged output is
generated.Then, we have to identify
the candidate phrases.
Generally we use simple nouns or
noun phrases as keyphrases.So, we
have to eliminate all other
possibilities such as verbs, adjectives,
prepositons etc.,
This is done by using conditions that
can be explained using the following
DFA.
After performing the tagging functions, we
have to reduce those keywords into collection
of noun phrases.This operation can be
explained using the below DFA.

Adjective

noun

START

article
Features
After identifying the noun phrases, we have to
select a small number(5 to 10 based on the
size of the document) of phrases as
keyphrases.
Whether a candidate noun phrase is a
keyphrase or not can be decided by a classifier
based on a set of features characterizing a
phrase.
The features that we can use are frequency of
the phrase ,position ,length ,link to other
phrases etc.,
Frequency & Link Count
Frequency is the number of times the
phrase occurs in a particular document.
If a phrase is very frequent in a
document, then it must be important
to it. So, this feature is taken.
Link count is to calculate the phrases
link to other phrases i.e., if phrase A is
present as a part of phrase B, then it is
considered that A has link to B.
Phrase Position, Length & Word
Length
Phrase position is a feature that is used
to assess the position of the phrase i.e., if
a phrase occurs in a title or abstract, it
should be given more priority.
Phrase length becomes an important
feature because the length of keyphrases
usually varies from 1 word to 3 words.
Similarly, word length also plays a major
role because shorter words occur more
frequent than longer ones.So, the length
of the longest word is considered as a
feature
Thus, we will calculate features for
each and every candidate phrase in
our document.
Then, we use Nave Bayes formula to
calculate the probability of a phrase by
combining all the features.
Then, the same process is done for
each candidate phrase .
Based on the probability values
calculated, we then rank the candidate
phrases as keyphrases or not.
Thank You

You might also like