On The Application of Linguistic Quantifiers For Text Categorization

ON THE APPLICATION OF
LINGUISTIC QUANTIFIERS
FOR TEXT CATEGORIZATION
Slawomir Zadrozny and Janusz Kacprzyk
Systems Research Institute

Polish Academy of Sciences
Warsaw School of Information Technology
Warsaw, POLAND
• Natural interpretation of various concepts of

information retrieval in terms of fuzzy logic
• Multilabel text categorization solved via similarity
based classifier
• Linguistic quantifier as a flexible aggregation
operator for similarity assessment and thresholding
strategy
BASICS OF TEXT CATEGORIZATION (1)
Notation
D = {di}i=1,N - a set of text documents
T = {tj}j=1,M - a set of index terms

F: D × T → [0, 1] - assignment of weights to terms
appearing in a document
C = {ci}i=1,S - a set of categories

Ξ: D × C → {0, 1} - assignment of categories to
documents
D1 ⊂ D - a set of training text documents
Document representation
“a bag of words” (index terms)

vector space model:
di → [wi1,...,wiM] wij = F(di,tj) di ∈ [0,1]M
tf × idf is a popular version of function F:
F(di,tj) = (fij ∗ log(N/nj)) / arg max (fij ∗ log(N / n j ) )

j
fij - frequency of a term tj in a document di,
N - number of all documents in the set D
nj - number of documents (in D) containing term tj
first factor - frequency of term tj in di

(tf, term frequency)
second factor - inverted frequency (in the collection D)
of documents in which term tj appears at least once.
(idf, inverted document frequency)
Classification problem
Ξ: D1 × C → {0, 1} ⇒ Ξ: D × C → {0, 1}
Two phases:
• learning of classification rules (explicit or implicit;

building a classifier) from examples of documents
with known class assignment (supervised learning),
• classification of documents unseen earlier using

derived rules.
Special cases
1. C = {c1, c2} (single class)

2. ∀d ∃ci Ξ(d, ci) = 1 ∧ ∀cj≠ci Ξ(d, ci) = 0 (single label)
Here: multiclass multilabel problem considered

ROCCHIO’S ALGORTIHM
Learning phase
Computation of a centroid vector pi for each category ci

of documents:
P = {pi}i=1,...,S pi = [ui1,…,uiM]
uij = 1/K ∑l=1,…,K wlj
K is the number of documents of category ci

M is the number of terms used to index the documents
Classification phase
Selection of a category whose centroid is most similar

to the document under consideration
Classically, similarity assessment via cosine measure:
M M M
sim(d,p) = ∑ w i ui ∑ w i ∑ ui2
2
i =1 i =1 i =1
where d = [w1,…,wM] and p = [u1,...,uM].

PROPOSED ALGORITHM (1)
Centroids
uij = (fij ∗ log(S/nj)) / arg max (log(S / n j ) ∗ fij ) )

j
fij - frequency of term tj in all documents of category ci
nj - number of categories in documents of which term tj
appears (“category frequency”).
By analogy to tf× ×idf it may be called a tf×

×icf
representation where icf stands for an inverted
category frequency.
A document to be classified d is represented, as

previously, by a vector:
d = [w1,...,wM]
wj = (fj ∗ log(S/nj)) / arg max(log(S / n j ) ∗ f j ) )
j
fj - frequency of term tj in the document d
nj - category frequency of this term
As in original document the classification decision is

based on the similarity of a document and a centroid
Intuitive interpretation of this similarity:
“terms representing both a document and a centroid

have comparable weights, relatively or absolutely”
The centroids are aggregates of many documents, thus

one should not expect a match between a centroid and
a document along all dimensions of their
representation. More reasonable is to formulate the
requirements that
“along most of the dimensions there is a match”
This may be formalized as follows using the calculus of

linguistically quantified propositions:
“A document matches a category if most of the

important terms of the document are present in the
centroid of the category”
CALCULUS OF LINGUISTICALLY
QUANTIFIED PROPOSITIONS
Most (Q) elements of a set X posses property S
Q S (x )
∈X
x∈
Most (Q) elements of a set X possessing property F

posses also property S
Q (F ( x ), S ( x ))or QF S ( x )
x ∈X x ∈X
X = { x1, xN }
µ S : X → [0,1],µ F : X → [0,1],µQ : [ 0,1] → [0,1]
1 N
truth( Q S ( x )) = µQ ( ∑ µS ( x i ))
∈X
x∈ N i =1
truth( Q (F ( x ), S( x ))) =
x ∈X
 N 
 ∑ ( µF ( x i ) ∧ µS ( x i )) 
 
µQ  i =1 
N
 ∑ µF ( x i ) 
 
 i =1 
“A document matches a category if most of the

important terms of the document are present in the
centroid of the category”
Q B’s are G’s
X, the universe considered, is a set T of all index terms

B is a fuzzy set of terms important for the document d,
i.e., µB(tj) = wj
G is a fuzzy set of terms present in centroid pi of
category ci, i.e., µG(tj) = uij.
Then,
similarity(d,p) = truth(Q B’s are G’s)
We also tested a modified version where only terms

weighted in the document higher than a certain
threshold or only, say, 10 top ranked terms, are
considered.
THRESHOLDING STRATEGY (1)
Many classifiers, including the one considered here,

produce for a document to be classified and each
category a matching degree expressing the extent to
which a document possibly belongs to given category.
For single label problem a category with highest

matching degree is chosen
In multilabel case a subset of categories has to be

chosen somehow - thresholding strategy is required
Classically:
• rank-based thresholding (RCut),

choosing r top categories for each document
• proportion based assignment (PCut)
for “batch categorization” (a set/batch of documents has
to be classified at once) - assigns to each category such
a number of documents so as to preserve a proportion of
the cardinalities of categories in the training set
• score-based local optimization (SCut).
assigns a document to a category only if a matching
score of this category and document is higher than a
threshold.
THRESHOLDING STRATEGY (2)
Our proposed thresholding strategy (RCut type) may be

intuitively expressed as follows:
"Select such a threshold r that most of the important

categories had a number of sibling categories similar to
r in the training data set"
For each r ∈ [1,R] compute the truth degree of the
clause underlined above (R is a parameter).
Using Zadeh’s calculus of linguistically quantified
propositions):
Q B’s are G’s
X, the universe considered, is a subset of C, of 10

categories with the highest matching scores,
B is a fuzzy set of important categories for a given
document d, i.e., µB(ci) = sim (d,ci)
G is a fuzzy set of categories, that, on average, had in
the training set the number of sibling categories similar
to r. This similarity is modeled by a similarity relation
which is another parameter of the method. By the
sibling category for a category ci we mean a category
that is assigned to the same document as the category
c i.
COMPUTATIONAL EXPERIMENT
Text corpus used: Reuters-21578

7728 training, 3005 test documents and 114 categories
Stop words removing and stemming done
The terms space dimensionality reduction using
document and category frequency 5565 index terms.
Matching micro-averaging macro-averaging 11-point
scheme precisio recall F1 precisio recall F1 average
n n precisio
n
Method1 0.3914 0.8215 0.5302 0.4038 0.5322 0.4592 0.8311
Method2* 0.4226 0.6765 0.5203 0.3416 0.6174 0.4398 0.7673
Cosine 0.2226 0.6462 0.3311 0.1235 0.4943 0.1976 0.6511
*
only terms weighted above 0.2 are considered in matching degree computation
Table 1: Comparison of matching schemes for T.II. thresholding
strategy
Thresh. micro-averaging macro-averaging 11-point
strateg precisio Recall F1 precisio Recall F1 average
y n n precisio
n
T.I. 0.5531 0.7765 0.6460 0.4891 0.4785 0.4837 0.8311
T.II 0.3914 0.8215 0.5302 0.4038 0.5322 0.4592 0.8311
T.III.* 0.4642 0.7478 0.5728 0.4776 0.4309 0.4530 0.8311
Table 2: Comparison of different thresholding strategies for the
Method1 matching scheme

On The Application of Linguistic Quantifiers For Text Categorization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On The Application of Linguistic Quantifiers For Text Categorization

Uploaded by

Copyright:

Available Formats

ON THE APPLICATION OF

Slawomir Zadrozny and Janusz Kacprzyk

Systems Research Institute

Warsaw School of Information Technology

• Natural interpretation of various concepts of

D = {di}i=1,N - a set of text documents

T = {tj}j=1,M - a set of index terms

C = {ci}i=1,S - a set of categories

“a bag of words” (index terms)

tf × idf is a popular version of function F:

F(di,tj) = (fij ∗ log(N/nj)) / arg max (fij ∗ log(N / n j ) )

first factor - frequency of term tj in di

• learning of classification rules (explicit or implicit;

• classification of documents unseen earlier using

1. C = {c1, c2} (single class)

Here: multiclass multilabel problem considered

Computation of a centroid vector pi for each category ci

uij = 1/K ∑l=1,…,K wlj

K is the number of documents of category ci

Selection of a category whose centroid is most similar

Classically, similarity assessment via cosine measure:

where d = [w1,…,wM] and p = [u1,...,uM].

uij = (fij ∗ log(S/nj)) / arg max (log(S / n j ) ∗ fij ) )

By analogy to tf× ×idf it may be called a tf×

A document to be classified d is represented, as

As in original document the classification decision is

Intuitive interpretation of this similarity:

“terms representing both a document and a centroid

The centroids are aggregates of many documents, thus

“along most of the dimensions there is a match”

This may be formalized as follows using the calculus of

“A document matches a category if most of the

Most (Q) elements of a set X posses property S

Most (Q) elements of a set X possessing property F

“A document matches a category if most of the

Q B’s are G’s

X, the universe considered, is a set T of all index terms

similarity(d,p) = truth(Q B’s are G’s)

We also tested a modified version where only terms

Many classifiers, including the one considered here,

For single label problem a category with highest

In multilabel case a subset of categories has to be

• rank-based thresholding (RCut),

Our proposed thresholding strategy (RCut type) may be

"Select such a threshold r that most of the important

Q B’s are G’s

X, the universe considered, is a subset of C, of 10

Text corpus used: Reuters-21578

You might also like