Professional Documents
Culture Documents
by Jana Novovicov�
During the last twenty years the number of text documents in digital form has grown
enormously in size. As a consequence, it is of great practical importance to be
able to automatically organize and classify documents. Research into text
classification aims to partition unstructured sets of documents into groups that
describe the contents of the documents. There are two main variants of text
classification: text clustering and text categorization. The former is concerned
with finding a latent group structure in the set of documents, while the latter
(also known as text classification) can be seen as the task of structuring the
repository of documents according to a group structure that is known in advance.
DR in TC often takes the form of feature selection. Methods for feature subset
selection for TC tasks use some evaluation function that is applied to a single
feature. The best individual features (BIF) method evaluates all words individually
according to a given criterion, sorts them and selects the best subset of words.
Since the vocabulary usually contains several thousand or tens of thousands of
words, BIF methods are popular in TC. However, such methods evaluate each word
separately, and completely ignore the existence of other words and the manner in
which the words work together.
UTIA proposed the use of the sequential forward selection (SFS) method based on
novel improved mutual information measures as criteria for reducing the
dimensionality of text data. These criteria take into consideration how features
work together. The performance of the proposed criteria using SFS compared to
mutual information, information gain, chi-square statistics and odds ratio using
the BIF method has been investigated. Experiments using a naive Bayes classifier
based on multinomial model, linear support vector machine (SVM) and k-nearest
neighbour classifiers on the Reuters data sets were performed and the results were
analysed from various perspectives, including precision, recall and F1-measure.
Preliminary experimental results on the Reuters data indicate that SFS methods
significantly outperform BIF based on the above-mentioned evaluation functions.
Furthermore, SVM on average outperforms both Naive Bayes and k-nearest neighbour
classifiers on the test data.
Currently, text classification research at UTIA is heading in two directions.
First, investigation of sequential floating search methods and oscillating
algorithms (developed in UTIA) for reducing dimensionality of text data; and
second, design of a new probabilistic model for document modelling based on
mixtures for simultaneously solving the problems of feature selection and
classification. These phases of the project rely on the involvement of PhD students
from the Faculty of Mathematics and Physics at Charles University in Prague.
Please contact:
Jana Novovicov�, CRCIM (UTIA), Czech Republic
Tel: +420 2 6605 2224
E-mail: novovic@utia.cas.cz
spacer