A Software System For Topic Extraction and Document Classification

A Software System for Topic Extraction and Document Classification
Davide Magatti and Fabio Stella Marco Faini

Department of Informatics, Systems and Communications DocFlow Italia S.p.A.
Universit degli Studi di Milano-Bicocca Centro Direzionale Milanofiori, Strada 4 Palazzo Q8
Milan, Italy 20089 Rozzano, Italy
Email: {magatti, stella} @disco.unimib.it Email: marco.faini@docflow.it
Abstract share the elegance of physics. Therefore, when dealing with

sciences that involve human beings
A software system for topic extraction and automatic
document classification is presented. Given a set of doc- ... we should stop acting as if our goal is
uments, the system automatically extracts the mentioned to author extremely elegant theories, and instead
topics and assists the user to select their optimal number. embrace complexity and make use of the best ally
The user-validated topics are exploited to build a model for we have: the unreasonable effectiveness of data.
multi-label document classification. While topic extraction
is performed by using an optimized implementation of the as discussed by Alon Halevy, Peter Norvig and Fernando
Latent Dirichlet Allocation model, multi-label document Pereira in their recent paper titled “The Unreasonable Ef-
classification is performed by using a specialized version of fectiveness of Data” [3].
the Multi-Net Naive Bayes model. The performance of the The authors conclude their paper by stating
proposed conceptual model is investigated by using 10,056
documents retrieved from the WEB through a set of queries Choose a representation that can use unsuper-
formed by exploiting the Italian Google Directory. This vised learning on unlabeled data, which is so much
dataset is used for topic extraction while an independent more plentiful than labeled data.
dataset, consisting of 1,012 elements labeled by humans, is
used to evaluate the performance of the Multi-Net Naive and by giving the following suggestion
Bayes model. The results are satisfactory, with precision
being consistently better than recall for the labels associated
with the four most frequent topics. Represent all the data with a nonparametric
model rather than trying to summarize it with a
1. Introduction parametric model, because with very large data
sources, the data holds a lot of detail.
The continuously increasing amount of text available on
the WEB, news wires, forums and chat lines, business com- In this paper the authors share this point of view and
pany intranets, personal computers, e-mails and elsewhere follow the suggestion to design and implement a software
is overwhelming [1]. prototype for topic extraction and multi-label document clas-
Information, is switching from useful to troublesome. sification. The software prototype implements a conceptual
Indeed, it is becoming more and more evident that while model which processes a corpus of unstructured textual data
the amount of data is rapidly increasing, our capability to to discover which topics are mentioned, and then exploits the
process information remains constant. This trend strongly extracted topics to learn a supervised classification model for
limits the extraction of valuable knowledge from text and multi-label document classification.
thus drastically reduces the competitive advantage we can The rest of the paper is organized as follows; Section 2
gain. Search engines have exacerbated such a problem by introduces Text Mining and gives basic elements concerning
dramatically increasing the amount of text available in a the Latent Dirichlet Allocation model and the Multi-Net
matter of a few key strokes. Naive Bayes model. Section 3 describes the main compo-
While Wigner [2] examines the reasons why a great nents of the software prototype along with their functional-
body of physics can be neatly explained with elementary ities. Section 4 is devoted to numerical experiments and to
mathematical formulas, the same does not apply to sciences multi-label document classification performance evaluation.
that involve human beings. Indeed, in this case it is becom- Finally, Section 5 presents conclusions and discusses further
ing increasingly evident how complex theories will never developments.
2. Text Mining observation count on the number of times words are sampled
from a topic before any word from the corpus is observed.
Text Mining (TM) [1], [4] is an emerging research area This choice can smooth the word distribution in every topic
which aims to solve the problem of information overload, with the amount of smoothing determined by β. The authors
i.e. to automatically extract knowledge from semi and un- showed that good choices for the hyperparameters α and
structured text. Therefore, TM can be interpreted as the β will depend on number of topics T and vocabulary size
natural answer to the unreasonable effectiveness of data W , and that accordingly to the results of their empirical
which aims to efficiently memorize the huge amount of data investigation α = 50T and β = 0.01 to work well with many
available through the WEB. The main techniques used by different document collections.
TM are; data mining, machine learning, natural language Topic extraction, i.e. estimation of the topic-word dis-
processing, information retrieval and knowledge manage- tributions and topic distributions for each document, can
ment. Typical tasks of TM are; text categorization, document be implemented through different algorithms. Hofmann [6]
clustering and organization, and information extraction. used a direct estimation approach based on the Expectation-
Maximization (EM) algorithm. However, such an approach
2.1. Probabilistic Topic Extraction suffers from problems involving local maxima of the likeli-
hood function. A better alternative has been proposed by Blei
Probabilistic Topic Extraction (PTE) is a particular form et al. [7] which directly estimate the posterior distribution
of document clustering and organization used to analyze the over z given the observed words w. However, many text
content of documents and the meaning of words with the aim collections contain millions of word tokens and thus the
to discover the topics mentioned in a document collection. estimation of the posterior over z requires the adoption of
A variety of models have been proposed, described and efficient procedures. Gibbs sampling, a form of Markov
analyzed in the specialized literature [5], [6], [7], [8]. These Chain Monte Carlo (MCMC), is easy to implement and
models differ among them in terms of the assumptions they provides a relatively efficient method of extracting a set of
make concerning the data generating process. However, they topics from a large corpus.
all share the same rationale, i.e. a document is a mixture of It is worthwhile to mention that topic extraction can be
topics. performed when the number of topics T is given. However,
To describe how the PTE model works, let P (z) be in many cases this quantity is unknown and we have to
the probability distribution over topics z, P (w|z) be the resort to empirical procedures and/or to statistical measures
probability distribution over words w given topic z. The such as perplexity to choose the optimal number of topics
topic-word distribution P (w|z) specifies the weight to the- to be retained after the unsupervised learning task has been
matically related words. A document is assumed to be performed.
formed as follows: the ith word wi is generated by first
extracting a sample from the topic distribution P (z), then
2.2. Text Categorization
by choosing a word from P (w|z). We let P (zi = j) to be
the probability that the j th topic was sampled for the ith
word token while P (wi |zi = j) is let to be the probability Document classification, the task of classifying natural
of word wi under topic j. Therefore, the PTE induces the language documents into a predefined set of semantic cate-
following distribution over words within a document: gories, has become one of the key methods for organizing
online information. It is commonly referred to as text
X
T categorization and represents a building block of several
P (wi ) = P (wi |zi = j) P (zi = j) (1) applications such as: web pages categorization, newswire
j=1 filtering and automatic routing of incoming messages at call
where T is the number of topics. centers, ...
Hofmann [6], [9] proposed the probabilistic Latent Se- Text categorization is distinguished into binary, multi-
mantic Indexing (pLSI) method which makes no assump- class and multi-label settings. In the binary setting there
tions about how the mixture weights in (1), i.e. P (zi = j), are exactly two classes, i.e. relevant and non relevant, spam
are generated. Blei et al. [7] improved the generalizabil- and non spam or sport and non sport. Some classification
ity of this model to new documents. They introduced a tasks require more than two classes, i.e. an e-mail routing
Dirichlet prior, with hyperparameter α, on P (zi = j), thus agent at a call center need to forward an incoming message
originating the Latent Dirichlet Allocation (LDA) model. In to the right operator depending on the specific nature of
2004, Griffiths and Steyvers [10] introduced an extension of the message contents. Such cases belong to the multi-class
the original LDA model which associate a Dirichlet prior, setting where documents can be labeled with exactly one out
with hyperparameter β, also to P (wi |zi = j). The authors of K classes. Finally, in the multi-label setting there is no
suggested the hyperparameter to be interpreted as the prior one-to-one correspondence between class and document. In
such a setting, each document can belong to many, exactly are judged to be too frequent/rare within a document
one or no category at all. and/or across the document corpus. This software com-
Several supervised learning models have been described ponent exploits a general purpose Italian vocabulary to
in the specialized literature to cope with text categorization. obtain the word-document matrix following to the bag-
However, the most studied models are Support Vector Ma- of-words model. Furthermore, the following document
chines and Naive Bayes. representations are allowed; binary, i.e. a word token
The Support Vector Machines (SVMs) approach has been belongs or not to a given document, term frequency,
proposed by Joachims [11] and has been extensively investi- i.e. how may times a word token is mentioned in a
gated by many researchers [12], [13], [14], [15] to mention given document, and term frequency inverse document
a few. The SVM model has been further extended to cope frequency first introduced in [28]. The following docu-
with cases when little training data is available. For example, ment formats are valid inputs for the software system;
a news-filtering service, which requires thousands of labeled pdf, word and txt. The user is allowed to interact with
data, it is very unlikely to please even the most patient user. the Text Pre-Processing software component through
To cope with such a case the Transductive Support Vector the GUI depicted in Figure 1.
Machines (TSVMs) is presented and discussed in [16] and
• Topic Extractor. This software component offers topic
further developed in [17], [18], [19].
extraction and topic selection functionalities. The topic
Several works have extensively studied the Naive Bayes
extraction functionality is implemented through a cus-
model for text categorization [20], [21]. However, these pure
tomized version of the Latent Dirichlet Allocation
Naive Bayes classifier models considered a document as a
(LDA) model [10]. LDA learning, i.e. topic extraction,
binary feature vector, and so they cannot utilize the term
is obtained by using the Gibbs sampling algorithm
frequencies in a document, resulting in poor performances.
which has been implemented in the C++ programming
The multinomial Naive Bayes text classifier has been shown
language on a single processor machine. The topic
to be an effective alternative to the basic Naive Bayes model
selection functionality assists the user in choosing the
by a number of researchers [22], [23], [24].
optimal number of topics to be retained. Topic selec-
However, the same researchers have also given disap-
tion is implemented through a hierarchical clustering
pointing results compared to many other statistical learning
procedure based on the symmetrized Kullback Liebler
methods, such as nearest neighbor classifiers [25], support
distance between topic distributions [8]. Each retained
vector machines [11] and boosting [26]. Recently Sang-
topic z = j is summarized through the estimate of
Bum Kim et al. [27] revisited the naive Bayes frame-
its prior probability P (z = j), a sorted list of its
work and proposed a Poisson Naive Bayes model for text
most frequent words w, together with the estimate
classification with a statistical feature weighting method.
of their conditional probabilities of occurrence given
Feature weighting has many advantages when compared to
the topic, i.e. the value of the conditional probability
previous feature selection approaches, especially when the
P (w|z = j).
new training examples are continuously provided. In this
paper a Multi-Net Poisson Naive Bayes model has been
used to implement the multi-label document classification
functionality.
3. The Software System
The software system described in this paper consists of

three main components; namely Text Pre-processor, Topic
Extractor and Multi-label Classifier. These components,
which are also available as stand-alone applications have
been integrated to deploy a software system working on
Windows XP and Vista operating systems. A brief descrip-
tion of the aims and functionalities offered by the three
software components follows.
• Text Pre-processor. This software component im-

plements functionalities devoted to document pre-
processing and document corpus representation. It of-
fers stopwords removal, different word stemming op-
tions and several filters to exclude those words which Figure 1. Text Pre-Processor GUI.
• Multi-label Classifier. Implements a supervised multi-
label classification model. This model exploits the
output from the Topic Extractor software component,
i.e. conditional probability distribution of words given
the topic. It uses a customized version of the Multi-
Net Poisson Naive Bayes (MNPNB) model [27]. The
MNPNB model allows to select the following bag-
of-words representations; binary, term frequency and
term frequency inverse document frequency. Each new
document, represented according to the bag-of-words
model, is automatically labeled depending on the user
specified value for the posterior threshold. If the pos-
terior probability of a given topic is greater than the
user specified posterior threshold, then the document
receives the label associated with the considered topic.
This software component has been implemented in C#
and is available through dedicated GUIs (Figure 2 and Figure 3. MNPNB classifier: results GUI.
Figure 3) as well as a WEB service.
S ALUTE , S CIENZA , S OCIETA , S PORT, T EMPO L IBERO.
4. The Italian Google Directory Each first level topic is associated with a second and third
The performance of the software system is investigated level sub-topic structure summarized in a words list.
using a document corpus collected by exploiting the topics
structure offered by the Italian Google Directory (gDir)1 . 4.1. Document Corpus
This topic structure (Figure 4) relies on the Open Directory
Project (DMOZ) which manages the largest human-edited The document corpus has been collected by submitting a
directory available on the web; each branch of the directory set of 273 queries to the Google search engine. Each query
can be extended only by editors having a deep knowledge contains a pair of words, randomly selected from the union
about the specific topic to be edited. Furthermore, the of the words lists associated to second and third level sub-
editors’ community guarantees fairness and correctness of topic structures.
the gDir topics structure.
The topics associated with the first level of gDir are
the following; ACQUISTI , A FFARI , A RTE , C ASA , C OM - Some examples of the submitted random queries, are as
PUTER , C ONSULTAZIONE , G IOCHI , N OTIZIE , R EGIONALE , follows; "veicoli realtá virtuale", "body art
1. http://www.google.com/Top/World/Italiano/.
Figure 2. MNPNB classifier: labeling GUI. Figure 4. The Italian Google Directory.
altre culture" and "multimedia politica". • P (S ALUTE) = 0.0787,
The PDF filter, offered by the Google search engine, has • P (C OMPUTER) = 0.0696,
been used to ensure that only pdf format files are retrieved. • P (T EMPO L IBERO ) = 0.0884
The random query process retrieved 14,037 documents, each • P (A MMINISTRAZIONE) = 0.1021.
associated with one or more gDir first level topics. Furthermore, each topic is summarized in a words list,
whose 20 most frequent words are reported in Table 1 and
4.2. Text Pre-processing Table 2. It is worthwhile to mention that for each pair
of words wi and topics j, an estimate of the conditional
The document corpus, consisting of the 14,037 retrieved probability of the word wi given the topic j, i.e. P (wi |j)
documents, has been submitted to the Text Pre-processor (Eq. 1), is provided.
software component. Therefore, PDF files have been trans-
formed to plain text, submitted to stopwords removal and 4.4. Multi-Label Classification
to word stemming. Furthermore, size-based file selection
has been applied to include only those PDF files with size The performance of the software system, as a whole, has
between 2 and 400 KB. been estimated by submitting a new document corpus to
The obtained document corpus consists of 10,056 doc- the Multi-label Classifier. This document corpus has been
uments (D = 10, 056). The global vocabulary, which has
been formed by including only those words occurring at
Table 1. S ALUTE and C OMPUTER.
least in 10 and in no more than 450 documents, consists of
48,750 word tokens (W = 48, 750). The document corpus S ALUTE 0.0787 C OMPUTER 0.0696
is represented in a word-document matrix W D consisting of cellule 0.0032 blog 0.0063
emissioni 0.0029 google 0.0035
W × D (term frequency) elements (48, 750 × 10, 056). The nutrizione 0.0028 linux 0.0034
word-document matrix is inputed to the Topic Extractor molecolare 0.0026 copyright 0.0033
software component as described in the next subsection. proteine 0.0022 wireless 0.0030
dieta 0.0022 source 0.0029
climatici 0.0021 access 0.0028
4.3. Topic Extraction foreste 0.0021 client 0.0027
cancro 0.0021 multimedia 0.0027
aids 0.0020 hacker 0.0026
The Topic Extractor software component has been in- disturbi 0.0020 password 0.0026
voked with the following learning parameters; 12 topics infermiere 0.0019 giornalismo 0.0025
cibi 0.0019 browser 0.0023
(T = 12), alpha prior equal to 1.67 (α = 1.67), which tumori 0.0019 provider 0.0022
implements the α = 20 T rule cited in [7], beta prior equal veterinaria 0.0018 telecom 0.0022
to 0.01 (β = 0.04), which implements the β = 200 W rule
obesitá 0.0018 brand 0.0022
clinico 0.0018 book 0.0021
cited in [7]. The Gibbs sampling procedure has been run serra 0.0017 chat 0.0021
100 times with different initial conditions and different virus 0.0017 wiki 0.0021
initialization seeds. Each run consisted of 500 sampling infezioni 0.0017 piattaforme 0.0021
iterations. The topics extracted through the last 99 runs of
the Gibbs sampling learning procedure have been re-ordered Table 2. T EMPO L IBERO and A MMINISTRAZIONE.
to correspond as best as possible with the topics obtained T EMPO L IBERO 0.0884 A MMINISTRAZIONE 0.1021
through the first run. Correspondence was measured by sconto 0.0077 locazione 0.0021
means of the simmetrized Kullback Liebler distance. aeroporto 0.0038 federale 0.0021
salone 0.0035 direttivo 0.0021
The 12 topics extracted from the Topic Extractor soft- spiaggia 0.0028 finanze 0.0020
ware component have been summarized through their cor- lago 0.0028 versamento 0.0020
responding 500 most frequent words. Among the extracted colazione 0.0026 lire 0.0019
albergo 0.0025 commi 0.0019
topics, the four most interesting ones have been manually vacanza 0.0025 prescrizioni 0.0018
labeled as follows; piscina 0.0024 vietato 0.0018
vini 0.0023 contrattuale 0.0018
• S ALUTE (medicine and health), bagni 0.0023 richiedente 0.0018
• C OMPUTER (information and communication technolo- voli 0.0021 utilizzatore 0.0017
gies), pensione 0.0021 agevolazioni 0.0017
biglietto 0.0020 contabile 0.0017
• T EMPO L IBERO (travels and holidays), notti 0.0020 appalto 0.0017
• A MMINISTRAZIONE (bureaucracy, public services). escursioni 0.0020 affidamento 0.0017
agevolazioni 0.0020 redditi 0.0017
The structure of the four topics, is described in Table 1 archeologico 0.0019 sanzione 0.0017
and Table 2. Each topic is associated with an estimate of its piatti 0.0019 somme 0.0016
prior probability; bicicletta 0.0019 indennitá 0.0016
collected by using the same random querying procedure Table 3. Precision/Recall for S ALUTE (S AL .),
described in subsection 4.1. Its documents have been man- C OMPUTER (C OM .), T EMPO L IBERO (T E L.) and
ually labeled, according to the 12 first level gDir topics, A MMINISTRAZIONE (A MM .).
independently by three humans. S AL . C OM . T E L. A MM .
The labeled document corpus consists of 1,012 docu- Precision 76 85 78 92
ments. It is worthwhile to mention that each document can Recall 41 59 44 79
be associated with one or more labels, i.e. a document can
mention one or more topics. In detail, 478 documents are
singly labeled, 457 are doubly labeled, while only 77 are 5. Conclusions and Future Work
associated with three labels.
The Multi-label Classifier, queried by using the binary The overwhelming amount of textual data calls for ef-
document representation and by setting a posterior threshold ficient and effective methods to automatically summarize
equal to 0.5, achieves an accuracy equal to 73%, which information and to extract valuable knowledge. Text Mining
can be considered satisfactory. The estimates of precision offers a rich set of computational models and algorithms
and recall for the four selected topics; namely C OMPUTER, to automatically extract valuable knowledge from huge
S ALUTE, A FFARI and T EMPO L IBERO, are reported in Table amounts of semi and un-structured data.
3. In this paper a software system for topic extraction and
The best result is achieved for the topic A MMINIS - document classification has been described. The software
TRAZIONE , where the precision equals 92%, i.e. if the system assists the user in correctly discovering which main
Multi-label Classifier labels a document with the label A M - topics are mentioned in a document collection. The dis-
MINISTRAZIONE , then the labeling is wrong with respect to covered topic structure, after being user-validated, is used
the manual labeling with probability 0.08. Furthermore, the to implement an automatic document classifier. This model
recall equals 79% which means that the documents manually suggests labels to be used for each new document submitted
labeled with A MMINISTRAZIONE are correctly labeled by by the user. It is important to mention that each document is
the Multi-label Classifier with probability 0.79. The topic not restricted to receive a single label but can be labeled with
C OMPUTER achieves a precision value equal to 85% which more topics. Furthermore, each topic, for a given document,
is slightly lower than that achieved for A MMINISTRAZIONE. is associated with a probability value that informs the user
However, the achieved recall value drops from 79% of about the fitting of the topic to the considered document.
A MMINISTRAZIONE to 59%. The topic T EMPO L IBERO This feature offers an important opportunity to the user who
achieves a precision value equal to 78%, i.e. slightly lower can sort his/her document collection in descending order of
than those achieved for C OMPUTER, while the achieved probability for each topic.
recall value is equal to 44%. Thus, the achieved recall value However, it must be clearly stated that many improve-
is significantly lower than that achieved for C OMPUTER. ments can be achieved by taking into account specific
Finally, the topic S ALUTE achieves performances compa- user requirements. This aspect is under investigation, and
rable to those of T EMPO L IBERO. Indeed, the precision particular attention is being dedicated to non-parametric
equals 76%, while the recall equals 41%. Around 57% of models for the discovery of hierarchical topic structures.
documents manually labelled with T EMPO L IBERO and/or Finally, the interplay between topics taxonomies and topic
with S ALUTE, are not identified as such by the Multi-label extraction algorithms offers an interesting research direction
Classifier. to explore.
It is worthwhile to notice that for each label the achieved
recall value is consistently lower than the precision value. References
A possible explanation for this behavior is as follows; the
manual labeling procedure is both complex and ambiguous; [1] R. Feldman and J. Sanger, The Text Mining Handbook. New
it could label documents by using a broader meaning for York: Cambridge University Press, 2007.
each topic. Therefore, it is expected somewhat that auto-
matic document classification could not achieve excellent [2] E. Wigner, “The unreasonable effectiveness of mathematics
in the natural sciences,” Communications in Pure and Applied
performance with respect to both precision and recall. It is Mathematics, vol. 13, no. 1, pp. 1–14, February 1960.
expected that the ‘purer’ a topic is, the better the perfor-
mance achieved for the automatic documents classification [3] A. Halevy, P. Norvig, and F. Pereira, “The unreasonable
task will be. effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2,
However, it is important to keep in mind the difficulty pp. 8–12, 2009.
of the considered labeling task, together with the fact that [4] M. W. Berry and M. Castellanos, Survey of Text Mining II:
human labeling of documents can result in ambiguous and clustering, classification and retrieval. London: Springer,
contradictory label assignment. 2008.
[5] T. L. Griffiths and M. Steyvers, “A probabilistic approach to [20] A. Mccallum and K. Nigam, “A comparison of event models
semantic representation,” in Proceedings of the Twenty-Fourth for naive bayes text classification,” in In AAAI-98 Workshop
Annual Conference of Cognitive Science Society, G. W. and on Learning for Text Categorization. AAAI Press, 1998, pp.
C. Schunn, Eds., 2002, pp. 381–386. 41–48.
[6] T. Hofmann, “Probabilistic latent semantic analysis,” in In [21] D. D. Lewis, “Naive (bayes) at forty: The independence
Proc. of Uncertainty in Artificial Intelligence, UAI99, 1999, assumption in information retrieval,” in ECML ’98: Proceed-
pp. 289–296. ings of the 10th European Conference on Machine Learning.
London, UK: Springer-Verlag, 1998, pp. 4–15.
[7] D. M. Blei, N. Andrew, and M. I. Jordan, “Latent dirichlet
allocation,” Journal of Machine Learning Research, vol. 3, [22] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive
pp. 993–1022, January 2003. learning algorithms and representations for text categoriza-
tion,” in CIKM ’98: Proceedings of the seventh international
[8] T. L. Griffiths and M. Steyvers, Probabilistic Topic Models, conference on Information and knowledge management. New
S. D. . W. K. T. Landauer, D. McNamara, Ed. Erlbaum, York, NY, USA: ACM Press, 1998, pp. 148–155.
2007.
[23] Y. Yang and X. Liu, “A re-examination of text categorization
[9] T. Hofmann, “Unsupervised learning by probabilistic methods,” in SIGIR ’99: Proceedings of the 22nd annual
latent semantic analysis,” Machine Learning, vol. 42, international ACM SIGIR conference on Research and de-
no. 1-2, pp. 177–196, 2001. [Online]. Available: velopment in information retrieval. New York, NY, USA:
http://portal.acm.org/citation.cfm?id=599631 ACM Press, 1999, pp. 42–49.
[10] T. L. Griffiths and M. Steyvers, “Finding scientific topics.” [24] A. K. McCallum and T. Mitchell, “Text classification from
Proc Natl Acad Sci U S A, vol. 101 Suppl 1, pp. 5228–5235, labeled and unlabeled documents using em,” in Machine
April 2004. Learning, 2000, pp. 103–134.
[11] T. Joachims, Learning to Classify Text Using Support Vector [25] Y. Yang and C. G. Chute, “An example-based mapping
Machines: Methods, Theory and Algorithms. New York: method for text categorization and retrieval,” ACM Trans. Inf.
Springer, 2002. Syst., vol. 12, no. 3, pp. 252–277, 1994.
[12] R. Cooley, “Classification of news stories using support vector [26] R. E. Schapire and Y. Singer, “Boostexter: a boosting-based
machines,” in Proc. 16th International Joint Conference on system for text categorization,” in Machine Learning, 2000,
Artificial Intelligence Text Mining Workshop, 1999. pp. 135–168.
[13] H. Lodhi, C. Saunders, N. Cristianini, C. Watkins, and [27] H.-C. R. H. M. Sang-Bum Kim, Kyoung-Soo Han, “Some
B. Scholkopf, “Text classification using string kernels,” Jour- effective techniques for naive bayes text classification,” IEEE
nal of Machine Learning Research, vol. 2, pp. 563–569, 2002. Transactions on Knowledge and Data Engineering, vol. 18,
no. 11, pp. 1457–1466, January 2006.
[14] L. Chen, J. Huang, and Z.-H. Gong, “An anti-noise text
categorization method based on support vector machines,” in [28] G. Salton and M. J. Mcgill, Introduction to Modern Infor-
AWIC, 2005, pp. 272–278. mation Retrieval. New York, NY, USA: McGraw-Hill, Inc.,
1986.
[15] A. Lehmann and J. Shawe-Taylor, “A probabilistic model
for text kernels,” in ICML ’06: Proceedings of the 23rd
international conference on Machine learning. New York,
NY, USA: ACM, 2006, pp. 537–544.
[16] V. N. Vapnik, The Nature of Statistical Learning Theory

(Information Science and Statistics). Springer, 1999.
[17] T. Joachims, “Transductive inference for text classification

using support vector machines,” in Proceedings of ICML-
99, 16th International Conference on Machine Learning,
I. Bratko and S. Dzeroski, Eds. Morgan Kaufmann Pub-
lishers, San Francisco, US, 1999, pp. 200–209.
[18] S. Tong and D. Koller, “Support vector machine active

learning with applications to text classification,” Journal of
Machine Learning Research, vol. 2, pp. 45–66, November
2001.
[19] C. Xu and Y. Zhou, “Transductive support vector machine for

personal inboxes spam categorization,” Computational Intelli-
gence and Security Workshops, International Conference on,
vol. 0, pp. 459–463, 2007.

A Software System For Topic Extraction and Document Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Software System For Topic Extraction and Document Classification

Uploaded by

Copyright:

Available Formats

A Software System for Topic Extraction and Document Classification

Davide Magatti and Fabio Stella Marco Faini

Abstract share the elegance of physics. Therefore, when dealing with

3. The Software System

The software system described in this paper consists of

• Text Pre-processor. This software component im-

[16] V. N. Vapnik, The Nature of Statistical Learning Theory

[17] T. Joachims, “Transductive inference for text classification

[18] S. Tong and D. Koller, “Support vector machine active

[19] C. Xu and Y. Zhou, “Transductive support vector machine for

You might also like