You are on page 1of 9

WSEAS TRANSACTIONS on COMPUTERS, Issue 8, Volume 4, August 2005, pp.

966-974

Text Classification Using Machine Learning Techniques

M. IKONOMAKIS S. KOTSIANTIS V. TAMPAKAS


Department of Mathematics Department of Mathematics Technological Educational
University of Patras, GREECE University of Patras, GREECE Institute of Patras, GREECE
ikonomakis@mailbox.gr sotos@math.upatras.gr tampakas@teipat.gr

Abstract: Automated text classification has been considered as a vital method to manage and process a vast
amount of documents in digital forms that are widespread and continuously increasing. In general, text
classification plays an important role in information extraction and summarization, text retrieval, and question-
answering. This paper illustrates the text classification process using machine learning techniques. The
references cited cover the major theoretical issues and guide the researcher to interesting research directions.

Key-Words: text mining, learning algorithms, feature selection, text representation

1 Introduction Classification), but in this paper only researches on


Automatic text classification has always been an Hard Categorization (assigning a single category to
important application and research topic since the each document) are taken into consideration.
inception of digital documents. Today, text Moreover, approaches, that take into consideration
classification is a necessity due to the very large other information besides the pure text, such as
amount of text documents that we have to deal with hierarchical structure of the texts or date of
daily. publication, are not presented. This is because the
In general, text classification includes topic based main issue of this paper is to present techniques
text classification and text genre-based that exploit the most of the text of each document
classification. Topic-based text categorization and perform best under this condition.
classifies documents according to their topics [33]. Sebastiani gave an excellent review of text
Texts can also be written in many genres, for classification domain [25]. Thus, in this work apart
instance: scientific articles, news reports, movie from the brief description of the text classification
reviews, and advertisements. Genre is defined on we refer to some more recent works than those in
the way a text was created, the way it was edited, Sebastianis article as well as few articles that were
the register of language it uses, and the kind of not referred by Sebastiani. In Figure 1 is given the
audience to whom it is addressed. Previous work on graphical representation of the Text Classification
genre classification recognized that this task differs process.
from topic-based categorization [13].
Read Tokenize Stemming
Typically, most data for genre classification are
collected from the web, through newsgroups, Document Text
bulletin boards, and broadcast or printed news.
They are multi-source, and consequently have Vector Representation of Delete
different formats, different preferred vocabularies Text Stopwords
and often significantly different writing styles even
for documents within one genre. Namely, the data
are heterogenous.
Intuitively Text Classification is the task of Feature Selection and/or Learning
Feature Transformation algorithm
classifying a document under a predefined
category. More formally, if di is a document of the .
entire set of documents D and {c1 , c2 ,..., cn } is the
Fig. 1. Text Classification Process

set of all the categories, then text classification The task of constructing a classifier for
assigns one category c j to a document di . documents does not differ a lot from other tasks of
As in every supervised machine learning task, an Machine Learning. The main issue is the
initial dataset is needed. A document may be representation of a document [16]. In Section 2 the
assigned to more than one category (Ranking document representation is presented. One
particularity of the text categorization problem is

*The Project is co-funded by the European Social Fund and National Resources.
that the number of features (unique words or Although stemming is considered by the Text
phrases) can easily reach orders of tens of Classification community to amplify the classifiers
thousands. This raises big hurdles in applying many performance, there are some doubts on the actual
sophisticated learning algorithms to the text importance of aggressive stemming, such as
categorization performed by the Porter Stemmer [25].
Thus dimension reduction methods are called for. An ancillary feature engineering choice is the
Two possibilities exist, either selecting a subset of representation of the feature value [16]. Often a
the original features [3], or transforming the Boolean indicator of whether the word occurred in
features into new ones, that is, computing new the document is sufficient. Other possibilities
features as some functions of the old ones [10]. We include the count of the number of times the word
examine both in turn in Section 3 and Section 4. occurred in the document, the frequency of its
After the previous steps a Machine Learning occurrence normalized by the length of the
algorithm can be applied. Some algorithms have document, the count normalized by the inverse
been proven to perform better in Text Classification document frequency of the word. In situations
tasks and are more often used; such as Support where the document length varies widely, it may be
Vector Machines. A brief description of recent important to normalize the counts. Further, in short
modification of learning algorithms in order to be documents words are unlikely to repeat, making
applied in Text Classification is given in Section 5. Boolean word indicators nearly as informative as
There are a number of methods to evaluate the counts. This yields a great savings in training
performance of a machine learning algorithms in resources and in the search space of the induction
Text Classification. Most of these methods are algorithm. It may otherwise try to discretize each
described in Section 6. Some open problems are feature optimally, searching over the number of
mentioned in the last section. bins and each bins threshold.
Most of the text categorization algorithms in the
literature represent documents as collections of
2 Vector space document words. An alternative which has not been
sufficiently explored is the use of word meanings,
representations also known as senses. Kehagias et al. using several
A document is a sequence of words [16]. So each
algorithms, they compared the categorization
document is usually represented by an array of
accuracy of classifiers based on words to that of
words. The set of all the words of a training set is
classifiers based on senses [12]. The document
called vocabulary, or feature set. So a document
collection on which this comparison took place is a
can be presented by a binary vector, assigning the
subset of the annotated Brown Corpus semantic
value 1 if the document contains the feature-word
concordance. A series of experiments indicated that
or 0 if the word does not appear in the document.
the use of senses does not result in any significant
This can be translated as positioning a document in
V
categorization improvement.
a R space, were V denotes the size of the
vocabulary V .
Not all of the words presented in a document can 3 Feature Selection
be used in order to train the classifier [19]. There The aim of feature-selection methods is the
are useless words such as auxiliary verbs, reduction of the dimensionality of the dataset by
conjunctions and articles. These words are called removing features that are considered irrelevant for
stopwords. There exist many lists of such words the classification [6]. This transformation
which are removed as a preprocess task. This is procedure has been shown to present a number of
done because these words appear in most of the advantages, including smaller dataset size, smaller
documents. computational requirements for the text
Stemming is another common preprocessing step. categorization algorithms (especially those that do
In order to reduce the size of the initial feature set not scale well with the feature set size) and
is to remove misspelled or words with the same considerable shrinking of the search space. The
stem. A stemmer (an algorithm which performs goal is the reduction of the curse of dimensionality
stemming), removes words with the same stem and to yield improved classification accuracy. Another
keeps the stem or the most common of them as benefit of feature selection is its tendency to reduce
feature. For example, the words train, training, overfitting, i.e. the phenomenon by which a
trainer and trains can be replaced with train. classifier is tuned also to the contingent
characteristics of the training data rather than the methods. Therefore SFS often give better results
constitutive characteristics of the categories, and than BIF. However, SFS are not usually used in
therefore, to increase generalization. text classification because of their computation cost
Methods for feature subset selection for text due to large vocabulary size.
document classification task use an evaluation Forman has present benchmark comparison of 12
function that is applied to a single word [27]. metrics on well known training sets [6]. According
Scoring of individual words (Best Individual to Forman, BNS performed best by wide margin
Features) can be performed using some of the using 500 to 1000 features, while Information Gain
measures, for instance, document frequency, term outperforms the other metrics when the features
frequency, mutual information, information gain, vary between 20 and 50. Accuracy 2 performed
odds ratio, 2 statistic and term strength [3], [30], equally well as Information Gain. Concerning the
[6], [28], [27]. What is common to all of these performance of chi-square, it was consistently
feature-scoring methods is that they conclude by worse the Information Gain. Since there is no
ranking the features by their independently metric that performs constantly better than all
determined scores, and then select the top scoring others, researchers often combine two metrics in
features. The most common metrics are presented order to benefit from both metrics [6].
in Table 1. The symbolisms that are presented in Novovicova et al. used SFS that took into
Table 1 are described in Table 2. account, not only the mutual information between a
On the contrary with Best Individual Features class and a word but also between a class and two
(BIF) methods, sequential forward selection (SFS) words [22]. The results were slightly better.
methods firstly select the best single word Although machine learning based text
evaluated by given criterion [20]; then, add one classification is a good method as far as
word at a time until the number of selected words performance is concerned, it is inefficient for it to
reaches desired k words. SFS methods do not result handle the very large training corpus. Thus, apart
in the optimal words subset but they take note of from feature selection, many times instance
dependencies between words as opposed to the BIF selection is needed.

Metrics mathematical forms


m m m
Information IG ( t ) = P ( ci ) log P ( ci ) + P ( t ) P ( ci | t ) log P ( ci | t ) + P ( t ) P ( ci | t ) log P ( ci | t )
Gain i =1 i =1 i =1

( ) P t, c
P ( t , c ) log P ( t ) P ( c )
c{ci , ci } t{tk , tk }
Gain Ratio GR ( tk , ci ) =
P ( c ) log P ( c )
c{ci ,ci }

Conditional
Mutual CMI ( C | S ) = H ( C ) H ( C | S1 , S 2 ,..., S n )
Information
Document DF ( tk ) = P ( tk )
Frequency

tf ( fi , d j ) =
Term
freqij
Frequency max freqkj
k
Inversed D
Document idf i = log
Frequency # ( fi )

( () ( )) ) (
D # ( ci , f j ) # ci , f j # ci , f j # ci , f j
2

2 ( f ,c ) =
Chi-square i j
( # (c , f ) + # (c , f )) ( # (c , f ) + # (c , f )) ( # ( c , f ) + # ( c , f )) ( # (c , f
i j i j i j i j i j i j i

Term s (t ) = P (t y | t x )
Metrics mathematical forms
Strength
Weighted WOddsRatio ( w) = P ( w) OddsRatio ( w)
Ratio

OddsRatio OddsRatio ( fi , c j ) = log


(
P ( f i | c j ) 1 P ( f i | c j ) )
(1 P ( f | c )) P ( f | c )
i j i j

Logarithmic P ( w | c)
Probability LogProbRatio ( w ) = log
Ratio P ( w | c )
Pointwise P ( x, y )
Mutual I ( x, y ) = log
Information P ( x) P ( y)
Category # ( f i ) /# ( c j )
Relevance CRF ( f i , c j ) = log
Factor (CRF) ( ) ( )
# f i , c j /# c j
Odds
OddsNum ( w, c ) = P ( w | c ) (1 P ( w | c ) )
Numerator
P ( w | c)
Probability Pr R ( w | c ) =
Ratio P ( w | c )
Bi-Normal F 1 ( P ( w | c ) ) F 1 ( P ( w | c ) )
Separation
(1 P ( w | c ) ) (1 P ( w | c ) )
k k
Pow
Topic
Relevance
DFn ( w, c ) db
using M DFn ( w, c ) = log + log
Relative DFn ( w, db ) c
Word
Position
Topic
Relevance DF ( w, c ) db
using M DFn ( w, c ) = log + log
Document DF ( w, db ) c
Frequency
Topic
Relevance ~

using DF ( w, c ) db
M DFn ( w, c ) = log + log
Modified ~

Document DF ( w, db ) c
Frequency
Topic
TF ( w, c ) db
Relevance M TFn ( w, c ) = log + log
using Term TF ( w, db ) c
Frequency
Weight of P ( ci | w) (1 P ( ci ) )
evidence for Weight ( w) = i P ( ci ) P ( w) log
Text P ( ci ) (1 P ( ci | w) )
Table 1. Feature Selection metrics
c a class of the training set
C the set of classes of the training set
d a document of the training set
D or db the set of documents of the training set
t or w a term or word
the probability of the class c or ci respectively How often the class appears in the
P ( c ) or P ( ci )
training set
P ( c ) or P ( c ) the probability of the class not occurring
the probability of the class c given that the term t appears Respectively, P ( c | t )
P (c | t )
denotes the probability of class c not occurring, given that the term t appears
P ( c, t ) the probability of the class c and term t occurring simultaneously
H (C ) the entropy of the set C
DF ( ti ) the document frequency of term tk
DFn ( t ) the frequency of term t in documents containing t in every of their n splits
~ the document frequency, taking into consideration only documents in which t appears
DF ( t ) more than once
# ( c ) or # ( t ) the number of documents which belong to class or respectively contain the term t
# ( c, t ) the number of documents containing term t and belong to class c
Table 2. Symbolisms

Guan and Zhou proposed a training-corpus the lower weighted but compacts the vocabulary
pruning based approach to speedup the process [8]. based on feature concurrencies.
By using this approach, the size of training corpus Principal Component Analysis is a well known
can be reduced significantly while classification method for feature transformation [38]. Its aim is to
performance can be kept at a level close to that of learn a discriminative transformation matrix in
without training documents pruning according to order to reduce the initial feature space into a lower
their experiments. dimensional feature space in order to reduce the
Fragoudis et al. [7] integrated Feature and complexity of the classification task without any
Instance Selection for Text Classification with even trade-off in accuracy. The transform is derived
better results. Their method works in two steps. In from the eigenvectors corresponding. The
the first step, their method sequentially selects covariance matrix of data in PCA corresponds to
features that have high precision in predicting the the document term matrix multiplied by its
target class. All documents that do not contain at transpose. Entries in the covariance matrix
least one such feature are dropped from the training represent co-occurring terms in the documents.
set. In the second step, their method searches Eigenvectors of this matrix corresponding to the
within this subset of the initial dataset for a set of dominant eigenvalues are now directions related to
features that tend to predict the complement of the dominant combinations can be called topics or
target class and these features are also selected. The semantic concepts. A transform matrix
sum of the features selected during these two steps constructed from these eigenvectors projects a
is the new feature set and the documents selected document onto these latent semantic concepts,
from the first step comprise the training set and the new low dimensional representation
consists of the magnitudes of these projections. The
eigenanalysis can be computed efficiently by a
4 Feature Transformation sparse variant of singular value decomposition of
Feature Transformation varies significantly from the document-term matrix [11].
Feature Selection approaches, but like them its In the information retrieval community this
purpose is to reduce the feature set size [10]. This method has been named Latent Semantic Indexing
approach does not weight terms in order to discard (LSI) [23]. This approach is not intuitive
discernible for a human but has a good improve recall, is to adjust the threshold associated
performance. with an SVM. Shanahan and Roma described an
Qiang et al [37] performed experiments using k- automatic process for adjusting the thresholds of
NN LSI, a new combination of the standard k-NN generic SVM [26] with better results.
method on top of LSI, and applying a new matrix Johnson et al. described a fast decision tree
decomposition algorithm, Semi-Discrete Matrix construction algorithm that takes advantage of the
Decomposition, to decompose the vector matrix. sparsity of text data, and a rule simplification
The Experimental results showed that text method that converts the decision tree into a
categorization effectiveness in this space was better logically equivalent rule set [9].
and it was also computationally less costly, because Lim proposed a method which improves
it needed a lower dimensional space. performance of kNN based text classification by
The authors of [4] present a comparison of the using well estimated parameters [18]. Some
performance of a number of text categorization variants of the kNN method with different decision
methods in two different data sets. In particular, functions, k values, and feature sets were proposed
they evaluate the Vector and LSI methods, a and evaluated to find out adequate parameters.
classifier based on Support Vector Machines Corner classification (CC) network is a kind of
(SVM) and the k-Nearest Neighbor variations of feed forward neural network for instantly document
the Vector and LSI models. Their results show that classification. A training algorithm, named as
overall, SVMs and k-NN LSI perform better than TextCC is presented in [34].
the other methods, in a statistically significant way. The level of difficulty of text classification tasks
naturally varies. As the number of distinct classes
increases, so does the difficulty, and therefore the
5 Machine learning algorithms size of the training set needed. In any multi-class
After feature selection and transformation the text classification task, inevitably some classes will
documents can be easily represented in a form that be more difficult than others to classify. Reasons
can be used by a ML algorithm. Many text for this may be: (1) very few positive training
classifiers have been proposed in the literature examples for the class, and/or (2) lack of good
using machine learning techniques, probabilistic predictive features for that class.
models, etc. They often differ in the approach When training a binary classifier per category in
adopted: decision trees, nave-Bayes, rule text categorization, we use all the documents in the
induction, neural networks, nearest neighbors, and training corpus that belong to that category as
lately, support vector machines. Although many relevant training data and all the documents in the
approaches have been proposed, automated text training corpus that belong to all the other
classification is still a major area of research categories as non-relevant training data. It is often
primarily because the effectiveness of current the case that there is an overwhelming number of
automated text classifiers is not faultless and still non relevant training documents especially when
needs improvement. there is a large collection of categories with each
Naive Bayes is often used in text classification assigned to a small number of documents, which is
applications and experiments because of its typically an imbalanced data problem". This
simplicity and effectiveness [14]. However, its problem presents a particular challenge to
performance is often degraded because it does not classification algorithms, which can achieve high
model text well. Schneider addressed the problems accuracy by simply classifying every example as
and show that they can be solved by some simple negative. To overcome this problem, cost sensitive
corrections [24]. Klopotek and Woch presented learning is needed [5].
results of empirical evaluation of a Bayesian A scalability analysis of a number of classifiers
multinet classifier based on a new method of in text categorization is given in [32]. Vinciarelli
learning very large tree-like Bayesian networks presents categorization experiments performed over
[15]. The study suggests that tree-like Bayesian noisy texts [31]. By noisy it is meant any text
networks are able to handle a text classification obtained through an extraction process (affected by
task in one hundred thousand variables with errors) from media other than digital texts (e.g.
sufficient speed and accuracy. transcriptions of speech recordings extracted with a
Support vector machines (SVM), when applied to recognition system). The performance of the
text classification provide excellent precision, but categorization system over the clean and noisy
poor recall. One means of customizing SVMs to (Word Error Rate between ~10 and ~50 percent)
versions of the same documents is compared. The
noisy texts are obtained through Handwriting under ci, or what would be deemed the correct
Recognition and simulation of Optical Character category. It represents the classifiers ability to place
Recognition. The results show that the performance a document as being under the correct category as
loss is acceptable. opposed to all documents place in that category,
Other authors [36] also proposed to parallelize both correct and incorrect:
and distribute the process of text classification.
With such a procedure, the performance of =i
TPi
TPi + FPi
classifiers can be improved in both accuracy and
time complexity. Recall (i) is defined as the probability that, if a
Recently in the area of Machine Learning the random document dx should be classified under
concept of combining classifiers is proposed as a category (ci), this decision is taken.
new direction for the improvement of the
performance of individual classifiers. Numerous =i
TPi
TPi + FN i
methods have been suggested for the creation of
Accuracy is commonly used as a measure for
ensemble of classifiers. Mechanisms that are used
categorization techniques. Accuracy values,
to build ensemble of classifiers include: i) Using
however, are much less reluctant to variations in
different subset of training data with a single
the number of correct decisions than precision and
learning method, ii) Using different training
recall:
parameters with a single training method (e.g. using
TPi +TN i
different initial weights for each neural network in
an ensemble), iii) Using different learning methods.
A=
i TPi +TN i + FPi + FN i
In the context of combining multiple classifiers Many times there are very few instances of the
for text categorization, a number of researchers interesting category in text categorization. This
have shown that combining different classifiers can overrepresentation of the negative class in
improve classification accuracy [1], [29]. information retrieval problems can cause problems
Comparison between the best individual classifier in evaluating classifiers' performances using
and the combined method, it is observed that the accuracy. Since accuracy is not a good metric for
performance of the combined method is superior skewed datasets, the classification performance of
[2]. Nardiello et al. [21] also proposed algorithms algorithms in this case is measured by precision
in the family of "boosting"-based learners for and recall [5].
automated text classification with good results. Furthermore, precision and recall are often
combined in order to get a better picture of the
performance of the classifier. This is done by
6 Evaluation combining them in the following formula:
There are various methods to determine
effectiveness; however, precision, recall, and F =
( 2
+ 1)
,
accuracy are most often used. To determine these, +
2

one must first begin by understanding if the where and denote presicion and recall
classification of a document was a true positive respectively. is a positive parameter, which
(TP), false positive (FP), true negative (TN), or represents the goal of the evaluation task. If
false negative (FN) (see Table 3). presicion is considered to be more important that
recall, then the value of converges to zero. On the
TP Determined as a document being classified other hand, if recall is more important than
correctly as relating to a category. presicion then converges to infinity. Usually is
FP Determined as a document that is said to be set to 1, because in this way equal importance is
related to the category incorrectly. given to each presicion and recall.
Reuters Corpus Volume I (RCV1) is an archive
FN Determined as a document that is not marked
of over 800,000 manually categorized newswire
as related to a category but should be.
stories recently made available by Reuters, Ltd. for
TN Documents that should not be marked as being research purposes [17]. Using this collection, we
in a particular category and are not. can compare the learning algorithms.
Table 3. Classification of a document Although research in the pass years had shown
that training corpus could impact classification
Precision (i) is determined as the conditional performance, little work was done to explore the
probability that a random document d is classified underlying causes. The authors of [35] try to
propose an approach to build semi-automatically References:
high-quality training corpuses for better [1] Bao Y. and Ishii N., Combining Multiple kNN
classification performance by first exploring the Classifiers for Text Categorization by
properties of training corpuses, and then giving an Reducts, LNCS 2534, 2002, pp. 340-347
algorithm for constructing training corpuses semi- [2] Bi Y., Bell D., Wang H., Guo G., Greer K.,
automatically. Combining Multiple Classifiers Using
Dempster's Rule of Combination for Text
Categorization, MDAI, 2004, 127-138.
7 Conclusion [3] Brank J., Grobelnik M., Milic-Frayling N.,
The text classification problem is an Artificial Mladenic D., Interaction of Feature Selection
Intelligence research topic, especially given the Methods and Linear Classification Models,
vast number of documents available in the form of Proc. of the 19th International Conference on
web pages and other electronic texts like emails, Machine Learning, Australia, 2002.
discussion forum postings and other electronic [4] Ana Cardoso-Cachopo, Arlindo L. Oliveira, An
documents. Empirical Comparison of Text Categorization
It has observed that even for a specified Methods, Lecture Notes in Computer Science,
classification method, classification performances Volume 2857, Jan 2003, Pages 183 - 196
of the classifiers based on different training text [5] Chawla, N. V., Bowyer, K. W., Hall, L. O.,
corpuses are different; and in some cases such Kegelmeyer, W. P., SMOTE: Synthetic
differences are quite substantial. This observation Minority Over-sampling Technique, Journal
implies that a) classifier performance is relevant to of AI Research, 16 2002, pp. 321-357.
its training corpus in some degree, and b) good or [6] Forman, G., An Experimental Study of Feature
high quality training corpuses may derive Selection Metrics for Text Categorization.
classifiers of good performance. Unfortunately, up Journal of Machine Learning Research, 3 2003,
to now little research work in the literature has been pp. 1289-1305
seen on how to exploit training text corpuses to [7] Fragoudis D., Meretakis D., Likothanassis S.,
improve classifiers performance. Integrating Feature and Instance Selection for
Some important conclusions have not been Text Classification, SIGKDD 02, July 23-26,
reached yet, including: 2002, Edmonton, Alberta, Canada.
Which feature selection methods are both [8] Guan J., Zhou S., Pruning Training Corpus to
computationally scalable and high-performing Speedup Text Classification, DEXA 2002, pp.
across classifiers and collections? Given the 831-840
high variability of text collections, do such [9] D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz,
methods even exist? A decision-tree-based symbolic rule induction
Would combining uncorrelated, but well- system for text categorization, IBM Systems
performing methods yield a performance Journal, September 2002.
increase? [10] Han X., Zu G., Ohyama W., Wakabayashi
Change the thinking from word frequency T., Kimura F., Accuracy Improvement of
based vector space to concepts based vector Automatic Text Classification Based on
space. Study the methodology of feature Feature Transformation and Multi-classifier
selection under concepts, to see if these will Combination, LNCS, Volume 3309, Jan 2004,
help in text categorization. pp. 463-468
Make the dimensionality reduction more [11] Ke H., Shaoping M., Text categorization
efficient over large corpus. based on Concept indexing and principal
Moreover, there are other two open problems in component analysis, Proc. TENCON 2002
text mining: polysemy, synonymy. Polysemy refers Conference on Computers, Communications,
to the fact that a word can have multiple meanings. Control and Power Engineering, 2002, pp. 51-
Distinguishing between different meanings of a 56.
word (called word sense disambiguation) is not [12] Kehagias A., Petridis V., Kaburlasos V.,
easy, often requiring the context in which the word Fragkou P., A Comparison of Word- and
appears. Synonymy means that different words can Sense-Based Text Categorization Using
have the same or similar meaning. Several Classification Algorithms, JIIS,
Volume 21, Issue 3, 2003, pp. 227-247.
[13] B. Kessler, G. Nunberg, and H. Schutze.
Automatic detection of text genre. In
Proceedings of the Thirty-Fifth ACL and [28] Sousa P., Pimentao J. P., Santos B. R. and
EACL, pages 3238, 1997. Moura-Pires F., Feature Selection Algorithms
[14] Kim S. B., Rim H. C., Yook D. S. and Lim to Improve Documents Classification
H. S., Effective Methods for Improving Naive Performance, LNAI 2663, 2003, pp. 288-296
Bayes Text Classifiers, LNAI 2417, 2002, pp. [29] Sung-Bae Cho, Jee-Haeng Lee, Learning
414-423 Neural Network Ensemble for Practical Text
[15] Klopotek M. and Woch M., Very Large Classification, Lecture Notes in Computer
Bayesian Networks in Text Classification, Science, Volume 2690, Aug 2003, Pages 1032
ICCS 2003, LNCS 2657, 2003, pp. 397-406 1036.
[16] Leopold, Edda & Kindermann, Jrg, Text [30] Torkkola K., Discriminative Features for
Categorization with Support Vector Machines. Text Document Classification, Proc.
How to Represent Texts in Input Space?, International Conference on Pattern
Machine Learning 46, 2002, pp. 423 - 444. Recognition, Canada, 2002.
[17] Lewis D., Yang Y., Rose T., Li F., RCV1: [31] Vinciarelli A., Noisy Text Categorization,
A New Benchmark Collection for Text Pattern Recognition, 17th International
Categorization Research, Journal of Machine Conference on (ICPR'04) , 2004, pp. 554-557
Learning Research 5, 2004, pp. 361-397. [32] Y. Yang, J. Zhang and B. Kisiel., A
[18] Heui Lim, Improving kNN Based Text scalability analysis of classifiers in text
Classification with Well Estimated Parameters, categorization, ACM SIGIR'03, 2003, pp 96-
LNCS, Vol. 3316, Oct 2004, Pages 516 - 523. 103
[19] Madsen R. E., Sigurdsson S., Hansen L. K. [33] Y. Yang. An evaluation of statistical
and Lansen J., Pruning the Vocabulary for approaches to text categorization. Journal of
Better Context Recognition, 7th International Information Retrieval, 1(1/2):6788, 1999.
Conference on Pattern Recognition, 2004 [34] Zhenya Zhang, Shuguang Zhang, Enhong
[20] Montanes E., Quevedo J. R. and Diaz I., Chen, Xufa Wang, Hongmei Cheng, TextCC:
A Wrapper Approach with Support Vector New Feed Forward Neural Network for
Machines for Text Categorization, LNCS Classifying Documents Instantly, Lecture
2686, 2003, pp. 230-237 Notes in Computer Science, Volume 3497, Jan
[21] Nardiello P., Sebastiani F., Sperduti A., 2005, Pages 232 237.
Discretizing Continuous Attributes in [35] Shuigeng Zhou, Jihong Guan, Evaluation
AdaBoost for Text Categorization, LNCS, and Construction of Training Corpuses for Text
Volume 2633, Jan 2003, pp. 320-334 Classification: A Preliminary Study, Lecture
[22] Novovicova J., Malik A., and Pudil P., Notes in Computer Science, Volume 2553, Jan
Feature Selection Using Improved Mutual 2002, Page 97-108.
Information for Text Classification, [36] Verayuth Lertnattee, Thanaruk
SSPR&SPR 2004, LNCS 3138, pp. 1010 Theeramunkong, Parallel Text Categorization
1017, 2004 for Multi-dimensional Data, Lecture Notes in
[23] Qiang W., XiaoLong W., Yi G., A Study Computer Science, Volume 3320, Jan 2004,
of Semi-discrete Matrix Decomposition for LSI Pages 38 - 41
in Automated Text Categorization, LNCS, [37] Wang Qiang, Wang XiaoLong, Guan Yi, A
Volume 3248, Jan 2005, pp. 606-615. Study of Semi-discrete Matrix Decomposition
[24] Schneider, K., Techniques for Improving for LSI in Automated Text Categorization,
the Performance of Naive Bayes for Text Lecture Notes in Computer Science, Volume
Classification, LNCS, Vol. 3406, 2005, 682- 3248, Jan 2005, Pages 606 615.
693. [38] Zu G., Ohyama W., Wakabayashi T.,
[25] Sebastiani F., Machine Learning in Kimura F., "Accuracy improvement of
Automated Text Categorization, ACM automatic text classification based on feature
Computing Surveys, vol. 34 (1),2002, pp. 1-47. transformation": Proc: the 2003 ACM
[26] Shanahan J. and Roma N., Improving SVM Symposium on Document Engineering,
Text Classification Performance through November 20-22, 2003, pp.118-120
Threshold Adjustment, LNAI 2837, 2003, 361-
372
[27] Soucy P. and Mineau G., Feature
Selection Strategies for Text Categorization,
AI 2003, LNAI 2671, 2003, pp. 505-509