You are on page 1of 5

An Analysis of Sentence level Text Classification for the Kannada Language

Jayashree R

Srikanta Murthy K

Department of Computer Science


PES Institute of Technology
Bangalore, India
Jayashree@pes.edu

Department of Computer Science


PES School of Engineering
Bangalore, India
srikantamurthy@pes.edu

With the rapid growth of internet, huge amount of data is


available online. The ability to draw useful information from
this digital data is quite challenging. The task of exploring and
extracting information from native languages available on line
is very much a useful task. The work presented here focuses on
sentence level classification in the Kannada language. The most
popular approaches in text categorization like Nave Bayesian
and Bag of Words (BOW) approaches are used in this work. It
is evident that Bag of Words approach performs significantly
better than Nave Bayesian approach. The objective of the
work is to find how sentence level classification works for
Kannada Language, as it can be extended further to sentiment
classification, Question Answering, Text Summarization and
also for customer reviews in Kannada Blogs, because most
users comments, queries, opinions etc are expressed using
sentences, hence this sentence level Text Classification becomes
a special task of Text Classification problem. The work though
focuses on very basic approaches presently, can later be
extended to other methods like SVM, KNN etc.
Keywords- sentence level classification, kannada text
classification,Nave Bayesian,Bag of Words,single label

I.

INTRODUCTION

With large sets of digital data online, there is a growing


need for methods to sort, retrieve, filter and manage digital
resources. An Information Retrieval (IR) technique such as
Text Classification, offers tools for converting unstructured
data into structured one. Text Classification is a process in
which a set of "documents" are labeled with "classes",
indicating that the documents in a class share a relationship;
which the documents across classes dont.
When a document can belong to more than one class, it is
called Multi-label Classification. When there are two classes
involved, it is called Binary Classification.
Text categorization is the task of assigning a Boolean
D C, where D is a
value to each pair (dj, ci)
domain of documents and C = {c1 , . . . , c|C|} is
a set of predefined categories. A value of T assigned to
(dj, ci) indicates a decision to file dj under ci , while a
value of F indicates a decision not to file dj under ci . More
formally, the task is to approximate the unknown target
function : D C {T, F} (that describes how
documents ought to be classified) by means of a function

c
978-1-4577-1196-1/11/$26.00 2011
IEEE

: D C {T, F} called the classifier (aka rule, or


hypothesis, or model) such that and coincide as much
as possible.[3]
This work looks at Single label classification and uses
bag of words and Naive Bayesian approach to solve the
same. The work looks at classifying individual sentences to
classes and thus differs from document classification. The
work we have presented here is novel as we are developing
sentence level classification in the Kannada language where
in classification problem is looked at more fine-grained
level.
II.

LITERATURE SURVEY

From the survey it is found that the sentence level


classification can be extended to sentiment classification.
The ability to correctly classify sentences that describe
events is an important task for many Natural Language
Processing applications such as Question Answering (QA)
and Text Summarization [4] and [5].
Recent years have seen a large growth in online customer
reviews. Classifying reviews into positive and negative ones
would be helpful in business intelligence applications and
recommender systems. The challenging aspect of this
problem that distinguishes it from the traditional
classification problem is that sentiment expression is more
free style. Classification features are more difficult to
determine.
The sentence level classification method differs from lots of
former research methods in that the processing unit is a
sentence, which means a more finer grained analysis.[5].
Martina Naughton et al have investigated statistical
techniques for sentence level classification. Event detection
is a core NLP task that focuses on the automatic
identification and classification of various event types in
text. This task has applications in automatic Text
Summarization and Question Answering (QA) [6]. There are
a lot of subjective Texts in the web, such as product reviews,
Movie Reviews, News, editorials and Blogs etc. Extracting
these subjective texts and analyzing their orientations play
significant roles in many applications such as electronic
commercial etc. One of the most important task in this field
is sentiment classification which can be performed in several

147

levels: word level, sentence level, passage Level etc [7]


Sentence level text categorization has also been applied to
named entity and relationship extraction [8][9], automatic
text summarization [10], and rhetorical analysis [11].Event
Detection is a sentence level Text Classification problem.
Majority of Questions/Concerns/Opinions posed on web
relate to events and daily life situations. Events can be
expressed as Phrases; hence Event detection can be looked
from sentence level perspective, which in turn can help in
generating Extractive summaries. Martina Naughton et.al[6]
has used statistical methods for identifying the sentences in a
document that describe one or more instances of a specified
event type. They further treat this task as a classification
problem, where each sentence in a given document is either
classified as containing an instance of the target event or not.
Sentence Level event classification is an important first step
for many Natural Language Processing Applications such as
Question
Answering
(Q&A)
and
summarization
systems[12].The objective of Sentiment Classification is to
classify a text according to the sentimental polarities of
opinions it contains, ex: favorable or unfavorable, positive or
negative. Here in sentiment classification, sentiments are
expressed in sentences; hence it can be treated as a sentence
level classification problem. Jun Zhao et.al present a novel
method for sentiment classification based on Conditional
Random Field(CRF)s in response to the two special
characteristics of contextual dependency and label
redundancy in sentence sentiment classification[13].The
work of Ainur Yessenalia et.al proposes two-level approach
for document sentiment classification, where useful
sentences are extracted and document level sentiment is
predicted based on extracted sentences[14].
III.

To find the class that the text belongs to, the following
relation must be maximized:
PC 
) PW |C ; $%& ,'( 
"
P | ;
 |+|
, PC | ) PW |C* ; $%& ,'( 
"

P | ;
is the probability of text d belonging to class
 . PC | is resemble meanings, |C| is the sum of the
class, NW , d is the word frequency of W in
d , n is the account of characteristic words.
1) Dimensionality Reduction
The morphological richness of the Kannada language leads
to the feature dimensions to be in the order of tens of
thousands. For practical classification considerations, large
amounts of training samples are required to train the
classifier. The size of the feature set has a significant impact
on the time required for classification. This makes
dimensionality reduction a requirement for text
classification on Kannada (and Indian languages in general)

Using stopwords

ABOUT THE CORPUS

The labeled corpus used for training the classifier is a


custom developed corpus whose source is Kannada
Wikipedia. A set of documents were extracted and manually
categorized into classes. The documents were merged and
the sentence boundaries in the documents were marked
manually.
A total of 3225 sentences were used for the classification
process.
Table 1.1 Class wise Distribution of Sentences
Class
Number of Sentences
Biotechnology
840
Literature
1040
Politics
308
Technology
1037
Total
3225
IV.

METHODOLOGY

A. Naive Bayes
A Naive Bayesian classifier was used with an estimator
as an alternative approach to the Bag of words approach
supported by Cardinality of the Intersection. A word vector
is created based on the training data. The dimensions in the

148

vector indicated the presence of the word and no special


weight age parameter was used in classification.
The word occurrence probability is given by the relation
||
1  NW , d 
W  PW C

||
||
|V|   NW , d 
Where W is the probability vector of the characteristic
word that belongs to every class.

Stop words are words that do not hold information about


the class of the text. The function words of a language are
usually identified as stop words. These words are considered
noise and are removed from the classification process. Due
to the non-availability of a standard stop words list for
Kannada, a list was developed and used in this work.
The corpus developed for use in this work was also used
to develop the stop words list. The words were manually
examined and following set of stop words was created for
use in the classification process.

Table 1.2 List of Stop words









"

"

$

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)

"

'

(

)*

"

"

-.

'

0

*

(

'



$



'

 4

/*

-.

"

/)

"

""

-.'

-

5,



1 



2

6

b)Using a restriction based on word occurence


Another method used to reduce the number of features
used for classification, is removing the words that occur in
the entire data only once. While words that occur once do
add information that a classifier can use, the amount of
words that occur once is very large; which add unrealistic
requirement for training data.
Table 1.3 Word Occurrence and the Number of Unique
words
Word Occurrence (M)
Number of Unique words
1
10644
2
2109
3
755
4
404
5
238
Rest
1932
Total
16082

class. Every sentence is represented by the set of words that


form the sentence. This model uses training data and hence
belongs to the supervised form of classification.
For assigning a class to a test sentence, its bag of words
(the set of words that form the sentence) is derived and set
intersection is performed with each candidate class. The
class with the highest number of common words is the class
this sentence belongs to.
A Sentence s is denoted by a bag of words S where
S = {w| w
s}. A Class c is denoted by bag of words
C where C = {w| w
s, s
c}
A test sentence belongs to a class C that maximizes the
cardinality of the intersection of the bag of words
as shown:
"

A S C C
D

V.

K-fold Cross Validation is used in this work for


evaluation of the classifier performance. This technique
involves splitting the document into K disjoint partitions and
carrying out K rounds of testing with one of the partitions as
the test set and the remaining as training. It is ensured that
each partition is used as a test set only once
The parameters used in measuring the performance of the
classifier are Precision (P), Recall (R)(also called as TP rate)
and F-Score (F). The definitions of the parameters are as
shown:
EFGHIJIKL 

Proportion of the examples which truly have class x


Total classiQied as class x

RE FSTG/RFVG EKJITIWGRE 
]SYJG EKJITIWG ]E 

B. Bag of words based model for high precision

VI.

EFKXKFTIKL HYSJJIZIG SJ HYSJJ [


\HTVSY TKTSY KZ HYSJJ [

EFKXKFTIKL ILHKFFGHTY^ HYSJJIZIG SJ HYSJJ [


\HTVSY TKTSY KZ SYY HYSJJGJ, G[HGXT [

] _ `GSJVFG 

In this work, the behavior of the classifier for varying


values of the minimum word occurrence requirement (M) is
analyzed; the values being 2 to 5 and the evaluation measure
for M = 1 is derived by fitting it on the curve. The points
are plotted on the graph and a best fit curve is derived which
makes it possible to interpolate and infer co-ordinates for
other points.

EVALUATION

2 b EFGHIJKL b cGHSYY
EFGHIJIKL cGHSYY

RESULTS AND DISCUSSION

The two models developed are evaluated against the test set
using 10-fold Cross Validation and the results are as shown
A. Naive Bayes
Weighted Averages for precision, recall and F-scores:

Every class in this approach is represented by the set of


words that occur at least in one sentence that belongs to the

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)

149

Table 1.4 Effects of stop words on Performance

M
5
4
3
2
1*

Without Stopwords filter


P
R
F
0.548
0.517 0.501
0.588
0.562 0.556
0.607
0.596 0.593
0.699
0.685 0.682
0.729
0.725 0.728

Table 1.7 Performance Measures for BOW approach

With Stopwords filter


P
R
F
0.678 0.509 0.468
0.674 0.534 0.508
0.705 0.567 0.546
0.728 0.641 0.638
0.742 0.670 0.677
* interpolated

With decreasing M, the evaluation parameters show a


significant rise, which shows the effect of the words that
have low occurrence, have on classification. To estimate the
impact if single-occurrence words were included, the values
from the table are used and the corresponding values for
M = 1 are derived using best-fit regression.
Taking M=2, the class-wise breakup for the classification
results is as shown:
Table 1.5 Performance measures for Naive Bayesian
approach
Class
TP Rate
FP Rate
Precision F-Score
biotech
0.586
0.034
0.859
0.696
Literature
0.478
0.034
0.869
0.617
politics
0.403
0.024
0.639
0.494
technology 0.921
0.426
0.506
0.653
Apart from politics, the F-Scores of the classes are very
similar. The inverse relation of Precision and Recall (TP
Rate) can be seen here. The technology class is classified
with low precision but high recall. The biotech and literature
classes, however, have high precision and low recall.
The confusion matrix representing the distribution of errors
in classification is as shown.
Table 1.6 Confusion matrix for Naive Bayesian approach
A
533
58
22
73

b
68
685
69
89

c
25
39
133
18

D
214
258
84
857

classified as
a = biotech
b = literature
c = politics
d = technology

B. Bag of words based model for high precision

Class
Biotech
Literature
technology
Politics

Precision
0.848
0.867
0.965
0.935

TP Rate
0.173
0.160
0.136
0.101

FP Rate
0.009
0.010
0.002
0.001

Certain applications require high precision and recall. The


classified sentences should be classified correctly; but
achieving high recall is not a concern. This method helps
such cases.
Table 1.8 Confusion Matrix for BOW Approach
A
b
c
d
classified as
123 0
2
0
a = biotech
7
144 3
2
b = literature
11
8
136 0
c = politics
4
14
0
29
d = technology
Error analysis in this case shows the confusion between the
politics and biotech, literature and technology classes.
Otherwise, the classification is shown to be performing very
well in case of high precision requirements.
VII. CONCLUSION
Manual error analysis has shown that there is a significant
possibility of sentences belonging to multiple classes. In
some cases, sentences might not have sufficient class
information as a standalone entity and might use
neighboring sentences to convey the class information. This
can be captured to increase the performance of the classifier.
Multi-label classification is appropriate for sentences that
can be a part of two or more classes. Also, possible
hierarchy among the classes can be explored to support
multi label classification. The work can be extended to
online customer reviews in Kannada blogs. It can also be
used in extracting opinion in posted Kannada articles online.
Fine grained classification, i.e. at sentence level or subsentence level can be achieved using this approach.
Depending on the requirements of the application whether
or not it requires high precision/recall, the appropriate
methods can be chosen.

The new model developed in this work yields high precision


and low recall. The overall precision of the classification is
89.05%. The class-wise breakup for the evaluation measures
used are as shown:

150

F-Score
0.345
0.319
0.272
0.201

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)

REFERENCES
[1]

Fengxia Pan, Multi-Dimensional Fragment Classification in


Biomedical Text, Thesis for MS, Queen's University, Kingston,
Ontario, Canada,pages 1-6, 2006

[2]

K Raghuveer and Kavi Narayana Murthy, Text Categorization in


Indian Languages using Machine Learning Approaches, IICAI
,pages 1-20,2007

[3]

Fabrizion Sebastini, Machine Learning in Automated Text


Categorization, ACM Computing Surveys, Vol. 34, No. 1, pp. 147.
pages 1-6,2002

[4]

Guohong Fu and Xin Wang, "Chinese Sentence Level sentiment


classification based on Fuzzy sets", Coling 2010 poster volume ,
pages 312-319, 2010

[5]

Linlin Li, Tianfang Yao," Kernel-based Sentiment Classification for


Chinese Sentence",CSIE 09, Proceedings of the 2009 WRI World
Congress on Computer Science and Information EngineeringVolume5, pages1-6,2009

[6]

Marina Naughton,Nicola Stokes,Joe Carthy, "Investigating Statistical


Techniques for Sentence Level Event Classification", Proceedings of
the 22nd International Conference on Computational linguistics,
pages 1-8,Coling 2008.

[12] Sentiment classification and Polarity shifting,


Shoushan Li, Sophia Yat Mei Lee,Ying Chen, Chu-Ren Huang,
Guodong Zhou, Proceedings of the 23rd International Conference on
Computational Linguistics (Coling 2010), pages 635643,
Beijing, August 2010

[7]

Ju Zhao,Kang Liu,Gen Wang, "Adding redundant features for CRFs


based sentence sentiment classification" , Proceedings of the 2008
Conference on Empirical Methods in Natural Language processing,
Pages 117-126. Honolulu,October 2008.

[8]

A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. "Exploiting


diverse knowledge sources via Maximum Entropy in named entity
recognition" Proceedings of the 6th Workshop on Very Large
Corpora,pages 1-9, 1998.

[9]

M. Skounakis, M. Craven, and S. Ray. "Hierarchical hidden Markov


models for information extraction" Proceedings of the 18th
International Joint Conference on Artificial Intelligence (IJCAI03),
pages 427-433. 2003

[10] J. Kupiec, J. Pedersen and F. Chen. "A trainable document


summarizer" Proceedings of the 18th Annual International ACM

[11] D. Marcu and A. Echihabi. "An unsupervised approach to recognizing


discourse relations" Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL), pages 368-375,
2002.

[13]Jun Zhao, Kang Liu, Gen Wang, Adding redundant features for CRFS
based sentence sentiment classification, Proceedings of 2008
conference on Empirical methods in NLP, Pages
Honolulu, October 2008, ACL.

[14] Ainur Yessenalina, Yisong Yue, Claire Cardie, Multi-Level


Document level sentiment Classification, Proceedings of the 2010
Conference on Empirical Methods in NLP, Pages 1046-1056,MIT
USA, October 2010,ACL

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)

151

You might also like