An Analysis of Sentence Level Text Classification For The Kannada Language

An Analysis of Sentence level Text Classification for the Kannada Language
Jayashree R
Srikanta Murthy K
Department of Computer Science

PES Institute of Technology
Bangalore, India
Jayashree@pes.edu
Department of Computer Science

PES School of Engineering
Bangalore, India
srikantamurthy@pes.edu
With the rapid growth of internet, huge amount of data is

available online. The ability to draw useful information from
this digital data is quite challenging. The task of exploring and
extracting information from native languages available on line
is very much a useful task. The work presented here focuses on
sentence level classification in the Kannada language. The most
popular approaches in text categorization like Nave Bayesian
and Bag of Words (BOW) approaches are used in this work. It
is evident that Bag of Words approach performs significantly
better than Nave Bayesian approach. The objective of the
work is to find how sentence level classification works for
Kannada Language, as it can be extended further to sentiment
classification, Question Answering, Text Summarization and
also for customer reviews in Kannada Blogs, because most
users comments, queries, opinions etc are expressed using
sentences, hence this sentence level Text Classification becomes
a special task of Text Classification problem. The work though
focuses on very basic approaches presently, can later be
extended to other methods like SVM, KNN etc.
Keywords- sentence level classification, kannada text
classification,Nave Bayesian,Bag of Words,single label
I.
INTRODUCTION
With large sets of digital data online, there is a growing

need for methods to sort, retrieve, filter and manage digital
resources. An Information Retrieval (IR) technique such as
Text Classification, offers tools for converting unstructured
data into structured one. Text Classification is a process in
which a set of "documents" are labeled with "classes",
indicating that the documents in a class share a relationship;
which the documents across classes dont.
When a document can belong to more than one class, it is
called Multi-label Classification. When there are two classes
involved, it is called Binary Classification.
Text categorization is the task of assigning a Boolean
D C, where D is a
value to each pair (dj, ci)
domain of documents and C = {c1 , . . . , c|C|} is
a set of predefined categories. A value of T assigned to
(dj, ci) indicates a decision to file dj under ci , while a
value of F indicates a decision not to file dj under ci . More
formally, the task is to approximate the unknown target
function : D C {T, F} (that describes how
documents ought to be classified) by means of a function
c
978-1-4577-1196-1/11/$26.00 2011
IEEE
: D C {T, F} called the classifier (aka rule, or

hypothesis, or model) such that and coincide as much
as possible.[3]
This work looks at Single label classification and uses
bag of words and Naive Bayesian approach to solve the
same. The work looks at classifying individual sentences to
classes and thus differs from document classification. The
work we have presented here is novel as we are developing
sentence level classification in the Kannada language where
in classification problem is looked at more fine-grained
level.
II.
LITERATURE SURVEY
From the survey it is found that the sentence level

classification can be extended to sentiment classification.
The ability to correctly classify sentences that describe
events is an important task for many Natural Language
Processing applications such as Question Answering (QA)
and Text Summarization [4] and [5].
Recent years have seen a large growth in online customer
reviews. Classifying reviews into positive and negative ones
would be helpful in business intelligence applications and
recommender systems. The challenging aspect of this
problem that distinguishes it from the traditional
classification problem is that sentiment expression is more
free style. Classification features are more difficult to
determine.
The sentence level classification method differs from lots of
former research methods in that the processing unit is a
sentence, which means a more finer grained analysis.[5].
Martina Naughton et al have investigated statistical
techniques for sentence level classification. Event detection
is a core NLP task that focuses on the automatic
identification and classification of various event types in
text. This task has applications in automatic Text
Summarization and Question Answering (QA) [6]. There are
a lot of subjective Texts in the web, such as product reviews,
Movie Reviews, News, editorials and Blogs etc. Extracting
these subjective texts and analyzing their orientations play
significant roles in many applications such as electronic
commercial etc. One of the most important task in this field
is sentiment classification which can be performed in several
147
levels: word level, sentence level, passage Level etc [7]

Sentence level text categorization has also been applied to
named entity and relationship extraction [8][9], automatic
text summarization [10], and rhetorical analysis [11].Event
Detection is a sentence level Text Classification problem.
Majority of Questions/Concerns/Opinions posed on web
relate to events and daily life situations. Events can be
expressed as Phrases; hence Event detection can be looked
from sentence level perspective, which in turn can help in
generating Extractive summaries. Martina Naughton et.al[6]
has used statistical methods for identifying the sentences in a
document that describe one or more instances of a specified
event type. They further treat this task as a classification
problem, where each sentence in a given document is either
classified as containing an instance of the target event or not.
Sentence Level event classification is an important first step
for many Natural Language Processing Applications such as
Question
Answering
(Q&A)
and
summarization
systems[12].The objective of Sentiment Classification is to
classify a text according to the sentimental polarities of
opinions it contains, ex: favorable or unfavorable, positive or
negative. Here in sentiment classification, sentiments are
expressed in sentences; hence it can be treated as a sentence
level classification problem. Jun Zhao et.al present a novel
method for sentiment classification based on Conditional
Random Field(CRF)s in response to the two special
characteristics of contextual dependency and label
redundancy in sentence sentiment classification[13].The
work of Ainur Yessenalia et.al proposes two-level approach
for document sentiment classification, where useful
sentences are extracted and document level sentiment is
predicted based on extracted sentences[14].
III.
To find the class that the text belongs to, the following
relation must be maximized:
PC
) PW |C ; $%& ,'(
"
P | ;
|+|
, PC | ) PW |C* ; $%& ,'(
"
P | ;
is the probability of text d belonging to class
. PC | is resemble meanings, |C| is the sum of the
class, NW , d is the word frequency of W in
d , n is the account of characteristic words.
1) Dimensionality Reduction
The morphological richness of the Kannada language leads
to the feature dimensions to be in the order of tens of
thousands. For practical classification considerations, large
amounts of training samples are required to train the
classifier. The size of the feature set has a significant impact
on the time required for classification. This makes
dimensionality reduction a requirement for text
classification on Kannada (and Indian languages in general)
Using stopwords
ABOUT THE CORPUS
The labeled corpus used for training the classifier is a

custom developed corpus whose source is Kannada
Wikipedia. A set of documents were extracted and manually
categorized into classes. The documents were merged and
the sentence boundaries in the documents were marked
manually.
A total of 3225 sentences were used for the classification
process.
Table 1.1 Class wise Distribution of Sentences
Class
Number of Sentences
Biotechnology
840
Literature
1040
Politics
308
Technology
1037
Total
3225
IV.
METHODOLOGY
A. Naive Bayes
A Naive Bayesian classifier was used with an estimator
as an alternative approach to the Bag of words approach
supported by Cardinality of the Intersection. A word vector
is created based on the training data. The dimensions in the
148
vector indicated the presence of the word and no special

weight age parameter was used in classification.
The word occurrence probability is given by the relation
||
1 NW , d
W PW C

||
||
|V| NW , d
Where W is the probability vector of the characteristic
word that belongs to every class.
Stop words are words that do not hold information about

the class of the text. The function words of a language are
usually identified as stop words. These words are considered
noise and are removed from the classification process. Due
to the non-availability of a standard stop words list for
Kannada, a list was developed and used in this work.
The corpus developed for use in this work was also used
to develop the stop words list. The words were manually
examined and following set of stop words was created for
use in the classification process.
Table 1.2 List of Stop words

"
"
$
2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)
"
'
(
)*
"
"
-.
'
0
*
(
'

$

'
4
/*
-.
"
/)
"
""
-.'
-
5,

1

2
6
b)Using a restriction based on word occurence

Another method used to reduce the number of features
used for classification, is removing the words that occur in
the entire data only once. While words that occur once do
add information that a classifier can use, the amount of
words that occur once is very large; which add unrealistic
requirement for training data.
Table 1.3 Word Occurrence and the Number of Unique
words
Word Occurrence (M)
Number of Unique words
1
10644
2
2109
3
755
4
404
5
238
Rest
1932
Total
16082
class. Every sentence is represented by the set of words that

form the sentence. This model uses training data and hence
belongs to the supervised form of classification.
For assigning a class to a test sentence, its bag of words
(the set of words that form the sentence) is derived and set
intersection is performed with each candidate class. The
class with the highest number of common words is the class
this sentence belongs to.
A Sentence s is denoted by a bag of words S where
S = {w| w
s}. A Class c is denoted by bag of words
C where C = {w| w
s, s
c}
A test sentence belongs to a class C that maximizes the
cardinality of the intersection of the bag of words
as shown:
"
A S C C
D
V.
K-fold Cross Validation is used in this work for

evaluation of the classifier performance. This technique
involves splitting the document into K disjoint partitions and
carrying out K rounds of testing with one of the partitions as
the test set and the remaining as training. It is ensured that
each partition is used as a test set only once
The parameters used in measuring the performance of the
classifier are Precision (P), Recall (R)(also called as TP rate)
and F-Score (F). The definitions of the parameters are as
shown:
EFGHIJIKL
Proportion of the examples which truly have class x

Total classiQied as class x
RE FSTG/RFVG EKJITIWGRE
]SYJG EKJITIWG ]E
B. Bag of words based model for high precision
VI.
EFKXKFTIKL HYSJJIZIG SJ HYSJJ [

\HTVSY TKTSY KZ HYSJJ [
EFKXKFTIKL ILHKFFGHTY^ HYSJJIZIG SJ HYSJJ [

\HTVSY TKTSY KZ SYY HYSJJGJ, G[HGXT [
] _ `GSJVFG
In this work, the behavior of the classifier for varying

values of the minimum word occurrence requirement (M) is
analyzed; the values being 2 to 5 and the evaluation measure
for M = 1 is derived by fitting it on the curve. The points
are plotted on the graph and a best fit curve is derived which
makes it possible to interpolate and infer co-ordinates for
other points.
EVALUATION
2 b EFGHIJKL b cGHSYY
EFGHIJIKL cGHSYY
RESULTS AND DISCUSSION
The two models developed are evaluated against the test set
using 10-fold Cross Validation and the results are as shown
A. Naive Bayes
Weighted Averages for precision, recall and F-scores:
Every class in this approach is represented by the set of

words that occur at least in one sentence that belongs to the
149
Table 1.4 Effects of stop words on Performance
M
5
4
3
2
1*
Without Stopwords filter

P
R
F
0.548
0.517 0.501
0.588
0.562 0.556
0.607
0.596 0.593
0.699
0.685 0.682
0.729
0.725 0.728
Table 1.7 Performance Measures for BOW approach
With Stopwords filter

P
R
F
0.678 0.509 0.468
0.674 0.534 0.508
0.705 0.567 0.546
0.728 0.641 0.638
0.742 0.670 0.677
* interpolated
With decreasing M, the evaluation parameters show a

significant rise, which shows the effect of the words that
have low occurrence, have on classification. To estimate the
impact if single-occurrence words were included, the values
from the table are used and the corresponding values for
M = 1 are derived using best-fit regression.
Taking M=2, the class-wise breakup for the classification
results is as shown:
Table 1.5 Performance measures for Naive Bayesian
approach
Class
TP Rate
FP Rate
Precision F-Score
biotech
0.586
0.034
0.859
0.696
Literature
0.478
0.034
0.869
0.617
politics
0.403
0.024
0.639
0.494
technology 0.921
0.426
0.506
0.653
Apart from politics, the F-Scores of the classes are very
similar. The inverse relation of Precision and Recall (TP
Rate) can be seen here. The technology class is classified
with low precision but high recall. The biotech and literature
classes, however, have high precision and low recall.
The confusion matrix representing the distribution of errors
in classification is as shown.
Table 1.6 Confusion matrix for Naive Bayesian approach
A
533
58
22
73
b
68
685
69
89
c
25
39
133
18
D
214
258
84
857
classified as
a = biotech
b = literature
c = politics
d = technology
B. Bag of words based model for high precision
Class
Biotech
Literature
technology
Politics
Precision
0.848
0.867
0.965
0.935
TP Rate
0.173
0.160
0.136
0.101
FP Rate
0.009
0.010
0.002
0.001
Certain applications require high precision and recall. The

classified sentences should be classified correctly; but
achieving high recall is not a concern. This method helps
such cases.
Table 1.8 Confusion Matrix for BOW Approach
A
b
c
d
classified as
123 0
2
0
a = biotech
7
144 3
2
b = literature
11
8
136 0
c = politics
4
14
0
29
d = technology
Error analysis in this case shows the confusion between the
politics and biotech, literature and technology classes.
Otherwise, the classification is shown to be performing very
well in case of high precision requirements.
VII. CONCLUSION
Manual error analysis has shown that there is a significant
possibility of sentences belonging to multiple classes. In
some cases, sentences might not have sufficient class
information as a standalone entity and might use
neighboring sentences to convey the class information. This
can be captured to increase the performance of the classifier.
Multi-label classification is appropriate for sentences that
can be a part of two or more classes. Also, possible
hierarchy among the classes can be explored to support
multi label classification. The work can be extended to
online customer reviews in Kannada blogs. It can also be
used in extracting opinion in posted Kannada articles online.
Fine grained classification, i.e. at sentence level or subsentence level can be achieved using this approach.
Depending on the requirements of the application whether
or not it requires high precision/recall, the appropriate
methods can be chosen.
The new model developed in this work yields high precision

and low recall. The overall precision of the classification is
89.05%. The class-wise breakup for the evaluation measures
used are as shown:
150
F-Score
0.345
0.319
0.272
0.201
REFERENCES
[1]
Fengxia Pan, Multi-Dimensional Fragment Classification in

Biomedical Text, Thesis for MS, Queen's University, Kingston,
Ontario, Canada,pages 1-6, 2006
[2]
K Raghuveer and Kavi Narayana Murthy, Text Categorization in

Indian Languages using Machine Learning Approaches, IICAI
,pages 1-20,2007
[3]
Fabrizion Sebastini, Machine Learning in Automated Text

Categorization, ACM Computing Surveys, Vol. 34, No. 1, pp. 147.
pages 1-6,2002
[4]
Guohong Fu and Xin Wang, "Chinese Sentence Level sentiment

classification based on Fuzzy sets", Coling 2010 poster volume ,
pages 312-319, 2010
[5]
Linlin Li, Tianfang Yao," Kernel-based Sentiment Classification for

Chinese Sentence",CSIE 09, Proceedings of the 2009 WRI World
Congress on Computer Science and Information EngineeringVolume5, pages1-6,2009
[6]
Marina Naughton,Nicola Stokes,Joe Carthy, "Investigating Statistical

Techniques for Sentence Level Event Classification", Proceedings of
the 22nd International Conference on Computational linguistics,
pages 1-8,Coling 2008.
[12] Sentiment classification and Polarity shifting,

Shoushan Li, Sophia Yat Mei Lee,Ying Chen, Chu-Ren Huang,
Guodong Zhou, Proceedings of the 23rd International Conference on
Computational Linguistics (Coling 2010), pages 635643,
Beijing, August 2010
[7]
Ju Zhao,Kang Liu,Gen Wang, "Adding redundant features for CRFs

based sentence sentiment classification" , Proceedings of the 2008
Conference on Empirical Methods in Natural Language processing,
Pages 117-126. Honolulu,October 2008.
[8]
A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. "Exploiting

diverse knowledge sources via Maximum Entropy in named entity
recognition" Proceedings of the 6th Workshop on Very Large
Corpora,pages 1-9, 1998.
[9]
M. Skounakis, M. Craven, and S. Ray. "Hierarchical hidden Markov

models for information extraction" Proceedings of the 18th
International Joint Conference on Artificial Intelligence (IJCAI03),
pages 427-433. 2003
[10] J. Kupiec, J. Pedersen and F. Chen. "A trainable document

summarizer" Proceedings of the 18th Annual International ACM
[11] D. Marcu and A. Echihabi. "An unsupervised approach to recognizing

discourse relations" Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL), pages 368-375,
2002.
[13]Jun Zhao, Kang Liu, Gen Wang, Adding redundant features for CRFS
based sentence sentiment classification, Proceedings of 2008
conference on Empirical methods in NLP, Pages
Honolulu, October 2008, ACL.
[14] Ainur Yessenalina, Yisong Yue, Claire Cardie, Multi-Level

Document level sentiment Classification, Proceedings of the 2010
Conference on Empirical Methods in NLP, Pages 1046-1056,MIT
USA, October 2010,ACL
151

An Analysis of Sentence Level Text Classification For The Kannada Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Analysis of Sentence Level Text Classification For The Kannada Language

Uploaded by

Copyright:

Available Formats

An Analysis of Sentence level Text Classification for the Kannada Language

Department of Computer Science

Department of Computer Science

With the rapid growth of internet, huge amount of data is

With large sets of digital data online, there is a growing

: D C {T, F} called the classifier (aka rule, or

From the survey it is found that the sentence level

levels: word level, sentence level, passage Level etc [7]

ABOUT THE CORPUS

The labeled corpus used for training the classifier is a

vector indicated the presence of the word and no special

Stop words are words that do not hold information about

Table 1.2 List of Stop words

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)

b)Using a restriction based on word occurence

class. Every sentence is represented by the set of words that

K-fold Cross Validation is used in this work for

Proportion of the examples which truly have class x

B. Bag of words based model for high precision

EFKXKFTIKL HYSJJIZIG SJ HYSJJ [

EFKXKFTIKL ILHKFFGHTY^ HYSJJIZIG SJ HYSJJ [

In this work, the behavior of the classifier for varying

RESULTS AND DISCUSSION

Every class in this approach is represented by the set of

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)

Table 1.4 Effects of stop words on Performance

Without Stopwords filter

Table 1.7 Performance Measures for BOW approach

With Stopwords filter

With decreasing M, the evaluation parameters show a

B. Bag of words based model for high precision

Certain applications require high precision and recall. The

The new model developed in this work yields high precision

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)

Fengxia Pan, Multi-Dimensional Fragment Classification in

K Raghuveer and Kavi Narayana Murthy, Text Categorization in

Fabrizion Sebastini, Machine Learning in Automated Text

Guohong Fu and Xin Wang, "Chinese Sentence Level sentiment

Linlin Li, Tianfang Yao," Kernel-based Sentiment Classification for

Marina Naughton,Nicola Stokes,Joe Carthy, "Investigating Statistical

[12] Sentiment classification and Polarity shifting,

Ju Zhao,Kang Liu,Gen Wang, "Adding redundant features for CRFs

A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. "Exploiting

M. Skounakis, M. Craven, and S. Ray. "Hierarchical hidden Markov

[10] J. Kupiec, J. Pedersen and F. Chen. "A trainable document

[11] D. Marcu and A. Echihabi. "An unsupervised approach to recognizing

[14] Ainur Yessenalina, Yisong Yue, Claire Cardie, Multi-Level

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR)

You might also like

EFKXKFTIKL HYSJJIZIG SJ HYSJJ [

EFKXKFTIKL ILHKFFGHTY^ HYSJJIZIG SJ HYSJJ [