Question Classification: Using Support Vector Machine and Lexical, Semantic and Sytactic Features

International Journal of Research in Advent Technology, Vol.2, No.
8, August 2014
E-ISSN: 2321-9637

77

Question Classification: Using Support Vector Machine
and Lexical, Semantic and Sytactic Features

Kiran Yadav, Megha Mishra
M.E scholar sscet Bhilai, Prof. sscet Bhilai
Yadavkiran64@gmail.com

Abstract Question classification is play important role in the question answering system. The results of the
question classification find out the quality of the question answering system. In this paper, a question
classification algorithm based on SVM and feature, Support Vector Machine model is take to train a classifier
on coarse categories, there features also use for classify the category. SVM has been used for question
classification and have a good results. We use SVM as the classifier. The experiment results show that the
feature extraction can perform well with SVM and our approach can reach classification accuracy.

Index Terms-
Question answering, text classification, machine learning, support vector machine.

1. INTRODUCTION
In this work, we use a machine learning approach to
question classification. Task of question classification
as a supervised learning classification. In order to
prepare the learning model, we designed a deep
position of features that are prognostic of question
categories .
In this paper work this classification has two
purposes. It provides constraints on the answer types
that provide foster processing to just site and verify
the answer. Which city has the largest population? we
do not want to test each phrase in a document to look
that it gives an answer .
However, there characteristics of question
classification that mark it from the common work. On
one hand, questions are relatively short and contain
less word-based information equate with classifying
the entire text. On the other hand, small questions are
amenable for more correct and deeper-level In this
way, this work on question classification can be also
see as a case study is take semantic information to text
classification. Similar to syntactic information such as
part-of-speech tags, clear notion of how to use lexical
semantic information is to replace or augment each
word by its semantic class in the given context, then
generate a feature-based representation and learn a
mapping from this representation to the desired
property. This general scheme leaves several issues
open that make the analogy to syntactic categories
nontrivial.
First, it is not open which semantic category
is allow and how to develop them. Second, it is not
open how to hold the more dissimilar problem of
semantic when decisied the delegacy of a sentence.
Merge these three features and increase the accuracy
of the question classification by using these features.
Question classification plays an important role in
question answering. Features are the key to obtain an
accurate question classifier.

Question answering systems deal defferent it this
problem, by giving natural language de in which users
can explain their information required form of a
natural language question. Retrieve the exact answer
to that very same question in place of a set of
documents. natural language, from a (typically large)
collection of documents, such as the WWW.
The developing period of the q/a system in different
field is too long and recycle rate is so low. Developed
a state of the art machine leaning based question
classifier that use a rich a set of lexical, syntactic and
semantic features.

2. QUESTION CLASSIFICATION
Question Classification means it helps for give the
result of given question .It is mainly use for the
question answer system. It work category wise
example if any type of question it there and find the
answer in category it give fast result. When we search
any thing it search engine like google then it gives all
things which are related to that word which is in
search. But it gives the answer in category wise.
because of only the question`s answer is presented.

Table 1. The coarse question categories
Coarse

ABBR
DESC
ENTY
HUM
LOC
NUM
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014
E-ISSN: 2321-9637

78

To simplify the following experiments, we assume
that one Question resides in only one category. That is
to say, unambiguous question is labeled with its most
probable category.

2.1 Question types

What is the fastest fish in the world?
Whats the colored part of the eye called?
What color is Mr. Spocks blood?
Name a novel written by John Steinbeck.
What currency is used in Australia?
What is the fear of cockroaches called?
What are the historical trials following WorldWar II
called?
What is the world s best selling cookie?
What instrument is Ray Charles best known for
playing?
What language is mostly spoken in Brazil?
What letter adorns the flag of Rwanda?
Whats the highest hand in straight poker?
What is the state tree of Nebraska?
What is the best brand for a laptop computer?
What religion has the most members?
What game is Garry Kasparov really good at?

3. RELATED WORK
Hand-made Rule-based show on extracting names
using many of human-made rules set. basically the
systems consist of a set of patterns using grammatical
(e.g. part of speech), syntactic (e.g. word precedence)
and orthographic features (e.g. capitalization) in
combination with dictionaries An example for this
type of system is: "President rao said bankers talks
will make discussions on private, U.S. forces to leave
Iraq". In this example a proper noun follows a
person's title(president), then noun is a person's name
and proper noun that is started with capital character
(Iraq) after the verb (to leave) is a Location's name.
In this family of approaches, Appelt , propose a name
identification system based on carefully handcrafted
regular expression called FASTUS. They divided the
task into three steps: Recognizing Phrases,
Recognizing Patterns and Merging incidents These
approaches are relying on manually coded rules and
manually compiled corpora. These kinds of models
have better results for restricted domains, are capable
of detecting complex entities that learning models
have difficulty with. However, the rule-based NE
systems lack the ability of portability robustness, and
furthermore the high cost of the rule maintains
increases even though the data is slightly changed.
These type of approaches are often domain and
language specific and do
not necessarily adapt well to new domains and
languages.
In Machine Learning-based NER system,
the purpose of Named Entity Recognition approach is
converting identification problem into a classification
problem and employs a classification statistical model
to solve it. In this type of approach, the systems look
for patterns and relationships into text to make a
model using statistical models and machine learning
algorithms. The systems identify and classify nouns
into particular classes such as persons, locations,
times, etc base on this model, using machine learning
algorithms. There are two types of machine learning
model that are use for NER. Supervised and
Unsupervised machine learning model. Supervised
learning involves using a program that can learn to
classify a given set of labeled examples that are made
up of the same number of features.

Each example is thus represented with
respect to the different feature spaces. The learning
process is called supervised, because the people who
marked up the training examples are teaching the
program the right distinctions. The supervised
learning approach requires preparing labeled training
data to construct a statistical model, but it cannot
achieve a good performance without a large amount
of training data, because of data sparseness problem.
In recent years several statistical methods based on
supervised learning method were proposed. Bikel et.
al. propose a learning name-finder base on hidden
Markov model [8] called Nymbel, while Borthwick
et. al. investigates exploiting diverse knowledge
sources via maximum entropy in named entity
recognition [9,10]. A tagging of unknown proper
names system with Decision Tree model was
proposed by Bechet et. al. [5], while Wuet. al.
presented a named entity recognition system based on
support vector machines [2]. Unsupervised learning
method is another type of machine learning model,
where an unsupervised model learns without any
feedback. In unsupervised learning, the goal of the
program is to build representations from data. These
representations can then be used for data compression,
classifying, decision making, and other purposes.
Unsupervised learning is not a very popular approach
for NER and the systems that do use unsupervised
learning are usually not completely unsupervised. In
these types of approach, Collins et. al. discusses an
unsupervised model for named entity classification by
use of unlabeled examples of data [7],

Koimetal. Proposes an unsupervised named entity
classification models and their ensembles that uses a
small-scale named entity dictionary and an unlabeled
corpus for classifying named entities [4]. Unlike the
rulebased method, these types of approaches can be
easily port to different domain or languages. In
Hybrid NER system, the approach is to combine
rulebased and machine learning-based methods, and
make new methods using strongest points from each
method.

E-ISSN: 2321-9637

79

4. QUESTION FEATURES
One of the main challenges in developing a
supervised classifier for a particular domain is to
identifyand design a rich set of features a process
which is generally referred to as feature engineering.
In the subsections that follow, we present the different
types of features that were used in the question
classifier, and how they are extracted from a given
question.

4.1Lexical features
Lexical features refer to word related features that are
extracted directly from the question. In this work,we
use word level n-grams as lexical features. We also
include in this section the techniques of stemming and
stop word removal, which can be used to reduce the
dimensionality of the feature set.

4.1.1 Stemming and Stop word removal
Stemming is a technique that reduces words to their
grammatical roots or stems, by removing their affixes.
For instance, after applying stemming, the words
inventing and invented both become invent. We
exploit this technique in our question classifier in the
following manner. First, we represent the question
using the bag-of-words model as previously
described. Second, we apply Porters stemming
algorithm (Porter, 1980) to transform each word into
its stem. The following two examples depict a
question before and after stemming are applied,
respectively.

(1) Which countries are bordered by France?
(2) Which country are border by Franc?

Another related technique is to remove stop words,
which are frequently occurring words with no
semantic value, such as the articles the and an. Both
of these techniques are mainly used to reduce the
feature space of the classifier i.e., the number of
total features that need to be considered. This is
achieved by collapsing several different forms of the
same word into one distinct term by applying
stemming; or by eliminating words which are likely to
be present in most questions stop words , and
which do not provide useful information for the
classifier.

4.2 Syntactic Features
In addition to the information that is readily available
in the input instance, it is common in natural language
processing tasks to augment sentence representation
with syntactic categories, under the assumption that
the sought-after property, for which we seek the
classifier, depends on the syntactic role of a word in
the sentence rather than the specific word .

4.2.1Question headword
The question headword 1 is a word in a given
question that represents the information that is being
Sought after. In the following examples, the headword
is in bold face:

(1) What is Australias national flower?
(2) Name an American made motorcycle.
(3) Which country are Godiva chocolates from?
(4) What is the name of the highest mountain in
Africa?

In Example 1,2,3, the
headword flower provides the classifier with an
important clue to correctly classify the question to
ENTITY:PLANT. By the same token, motorcycle in
Example 4 renders hints that help classify the question
to ENTITY:VEHICLE. Indeed, the aforementioned
examples entire headword serves as an important
feature to unveil the questions category, which is
why we dedicate a great effort to its accurate
extraction. Our baseline classifier makes use of the
standard POS information and phrase information
extracted by a shallow parser. Specifically, we use
chunks (non overlapping phrases) and head chunks,
.The following example illustrates the information
available when generating the syntax-augmented
feature-based representation. Question: Who was the
first woman killed in the Vietnam War? Chunking:
[NP Who] [VP was] [NP the first woman] [VP killed]
[PP in] [NP the Vietnam War] ?
The head chunks
denote the first noun or verb chunk after the question
word in a question. For example, in the above
question, the first noun chunk after the question word
who is the first woman. The features are represented
as abstract tags in each example.

4.3 Semantic Features
Similar logic can be applied to semantic categories. In
many cases, the property seems not depend on the
specific word used in the sentence that could be
replaced without affecting this property but rather
on its meaning. For example, given the question:
What Cuban dictator did Fidel Castro force out of
power in 1958?, we would like to determine that its
answer
Should be a name of a person. Knowing that dictator
refers to a person is essential to correct classification.

This work systematically
studies four semantic information sources and their
contribution to classification: (1) automatically
acquired named entity categories -NE, (2) word
senses in WordNet -SemWN, (3) manually
constructed word lists related to specific categories of
interest -SemCSR, and (4) automatically generated
E-ISSN: 2321-9637

80

semantically similar word lists (Zhang, D., & Lee, W.
S, 2003) -SemSWL.
For the four external semantic
information sources, we define semantic categories of
words and incorporate the information into question
classification in the
same way: if a word w occurs in a question, the
question representation is augmented with the
semantic category(ies), of the word. For example, in
the question: What is the state flower of California?
given that plant (for example) is the only semantic
class of flower, the feature extractor adds plant, an
abstract label to the question representation.

4.3.1 Named Entities
A named entity (NE) recognizer assigns a semantic
category to some of the noun phrases in the question.
The scope of the categories used here is broader than
the common named entity recognizer. With additional
categories that could help question answering, such as
profession, event, holiday, plant, sport, medical etc.,
we redefine our task in the direction of semantic
categorization. The named entity recognizer was built
on the shallow parser described in (Voorhees, E. M.
(2004).), and was trained to categorize noun phrases
into one of 34 different semantic categories of varying
specificity. Its overall accuracy (F =1) is above 90%.
For the question Who was the woman killed in the
Vietnam War ?, the named entity tagger will return:
NE: Who was the [Num first] woman killed in the
[Event Vietnam War] ? As described above, the
identified named entities are added to the question
representation.

4.3.2WordNet Senses
In WordNet (C. Peters,2005)words are organized
according to their senses (meanings). Words of the
same sense can, in principle, be exchanged in some
contexts. The senses are organized in a hierarchy of
hypernyms and hyponyms. Word senses provide
another effective way to describe the semantic
category of a word. For example, in WordNet 1.7, the
word water belongs to 5 senses. The first two senses
are:

Sense 1: binary compound that occurs at room
temperature
as a colorless odorless liquid;
Sense 2: body of water.
Sense 1 contains words fH2O, water} while Sense 2
contains
water, body of water. Sense 1 has a hypernym
Sense 3: binary compound); and one hyponym of
Sense 2 is (Sense 4: tap water). For each word in a
question, all of its sense IDs and direct hypernym and
hyponym IDs are extracted as features.
This approach possibly introduces
significant noise to classification since only a small
proportion of senses are really related.

5 SUPPORT VECTOR MACHINE
Machine learning tasks can be of several forms.
In supervised learning, the computer is presented with
example inputs and their desired outputs, given by a
"teacher", and the goal is to learn a general rule
that maps inputs to outputs. Spam filtering is an
example of supervised learning,
particular classification, where the learning algorithm
is presented with email (or other) messages labeled
beforehand as "spam" or "not spam", to produce a
computer program that labels unseen messages as
either spam or not.
In unsupervised learning, no labels
are given to the learning algorithm, leaving it on its
own to groups of similar inputs (clustering),density
estimates or projections of high-dimensional data that
can be visualised effectively.
[2]:3
Unsupervised
learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end. Topic
modeling is an example of unsupervised learning,
where a program is given a list of human
language documents and is tasked to find out which
documents cover similar topics Supervised learning is
the machine learning task of inferring a function from
labeled training data.
[1]
The training data consist of a
set of training examples. In supervised learning, each
example is a pair consisting of an input object
(typically a vector) and a desired output value (also

called the supervisory signal). A supervised learning
algorithm analyzes the training data and produces an
inferred function, which can be used for mapping new
examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for
unseen instances. This requires the learning algorithm
to generalize from the training data to unseen
situations in a "reasonable" way (see inductive bias).
In machine learning, the problem of unsupervised
learning is that of trying to find hidden structure in
unlabeled data. Since the examples given to the
learner are unlabeled, there is no error or reward
signal to evaluate a potential solution. This
distinguishes unsupervised learning from supervised
learning and reinforcement learning.
Unsupervised
learning is closely related to the problem of density
estimation in statistics.
[1]
However unsupervised
learning also encompasses many other techniques that
seek to summarize and explain key features of the
data. Many methods employed in unsupervised
learning

E-ISSN: 2321-9637

81

6 CONCLUSION
In this paper we presented a detailed overview on
learning-based question classification approaches.
Question classification is a hard problem. In fact the
machine need to understand the question and classify
it to the right category. This is done by a series of
complicated steps. In this paper we reviewed different
learning methods and feature extraction techniques for
question classification. Deciding for the best model
and optimal set of features is not a simple problem.
Enhancing the feature space with syntactic and
semantic features can usually improve the
classification accuracy.
.
7 FUTURE WORK
In the question classification task, we have shown that
a machine learning-based classifier using solely
superficial features . Increase the accuracy in question
answer system with the combination of the three
feature by using svm (support vector system) method.

8 RESULT
It increases the accuracy of the answer detection. It
give 95.2% of accuracy.

Acknowledgements
I thank PROF .Megha Mishra for several valuable
suggestions and the entire SSCET team for help with
various components, feature suggestions and
guidance.

REFERENCES
[1] Question classification using support vector
machines. By Zhang, D., & Lee, W. S. (2003). In
Proceedings of the 26th annual international acm
sigir conference on researc and
developmentininformaionretrieval(pp.2632).
[2] Voorhees, E. M. (2004). Overview of the trec
2004 question answering track. In E. M.
Voorhees & L. P.Buckland (Eds.), Trec (Vol.
Special Publication 500-261). National Institute
of Standards and Technology(NIST).
[3] Wang, Y.-C., Wu, J.-C., Liang, T., & Chang, J. S.
(2005). Web-based unsupervised learning for
queryformulation in question answering. In Ijcnlp
(p. 519-529).
[4] Accessingmultilingualinformation2005multilingu
alquestion answering track. In C. Peters (Ed.),
repositories.Berlin, Heidelberg: Springer-Verlag.
[5] Adaptive information extraction. ACM Comput.
Surv.,Turmo, J., Ageno, A., & Catal`a, N. (2006).
38(2), 4.Vallin, A., Magnini, B., Giampiccolo,
D., Aunimo, L., & Ayache, C. (2006).
[6] Improved inference for unlexicalized parsing by
Petrov, S., & Klein, D. (2007, April).. In Human
language technologies2007: The conference of
the north american chapter of the association for
computational .

[7] Question classification with semantic tree kernel.
Pan, Y., Tang, Y., Lin, L., & Luo, Y. (2008).
InProceedings of the 31st annual international
acm sigir conference on research and
development in information retrieval (pp. 837
838). New York, NY, USA: AC
[8] Designing an interactive open-domain question
answering
[9] System by Quarteroni, S&Manandhar, S.
(2009)..forthcoming,Journal of Natural Language
Engineering,Volume 15 Issue 1.
[10] Biomedical Semantics by Chanlekha and Collier
(2010)Journalof,1:3
http://www.jbiomedsem.com/content/1/1/3
[11] Document Classification with Support Vector
Machines By Konstantin Mertsalov Principal
Scientist, Machine and Computational Learning
Rational Retention, LLC
kmertsalov@rationalretention.com January 2009

[12] Information Processing and Management journal
Trento, Italy(2011) homepage:
www.elsevier.com/ locate/ infoproman Linguistic
kernels for answer re-ranking in question
answering systems Alessandro Moschitti, Silvia
Quarteroni University of Trento, Via Sommarive
14, 38050 Povo.

Question Classification: Using Support Vector Machine and Lexical, Semantic and Sytactic Features

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Question Classification: Using Support Vector Machine and Lexical, Semantic and Sytactic Features

Uploaded by

Copyright:

Available Formats

International Journal of Research in Advent Technology, Vol.2, No.

You might also like