You are on page 1of 12

Text Categorization:

Support Vector Machines

Remo Frey

Algorithms for Data Base Systems (Fachseminar)

ETH Zürich
Summer Semester 2007
Text Categorization: Support Vector Machines

Content

1. Introduction ........................................................................................3

2. Text Categorization ............................................................................3

2.1 Modulation of Text Categorization .................................................................... 3


2.2 Properties of Text Categorization...................................................................... 4

3. Support Vector Machines (SVM) .......................................................5

3.1 Big Picture of SVM ............................................................................................ 5


3.2 Mathematical Formulation of SVM .................................................................... 7

4. A Quality Measure ..............................................................................8

4.1 Odds Ratio and TCat-Concept.......................................................................... 8


4.2 Lower Bound for the Margin............................................................................ 10

5. References ........................................................................................12

2
Text Categorization: Support Vector Machines

1. Introduction
Text categorization is the process of sorting text documents into one or more predefined
categories or classes of similar documents. This procedure will be more and more important,
because anywhere the volume of data is growing and the searcher wants to find his desired
information in a shorter time.
Such enormous information is no longer reasonable sortable by hand, the duration would be
too long. Particularly the WWW with the nearly infinity data volume is reliant on good
automatic search algorithms, which are often using text categorization. Furthermore the data
in the WWW are extremely fluctuating and thus permanently refreshing by hand is an
impossibility. In section 2, I specify first the issue of text categorization in particular in
conjunction with Support Vector Machines (SVM).
Support Vector Machines introduced by Vapnik et al. [3] are offering for the challenge of text
categorization a suitable solution [1, 2]. They have got many advantages, for example they are
extremly robust against bad input data. Furthermore, they are easy to use. In section 3 I show
generally the idea and the mathematical basics of Support Vector Machines.
Thanks to empirical knowledge (for example Zipf’s law), it is possible to make stronger
statements than the pure mathematics can do. In the final section I introduce the TCat-
Concept, which helps us to say something about the quality of text categorization.

2. Text Categorization

2.1 Modulation of Text Categorization


„Text Categorization“ is a Classification Problem. Given a text, we want to know in which
category it belongs to. In our case, the categories are predefined and the number of them is
not infinity. Predefined categories make only sense with supervised learning. Therefore
training data are required. A good introduction into classification problems is offered by the
presentation [5] of Joachim Buhmann, which is unfortunately not publicly available.

A mathematical Formulation of this Problem, specific to text categorization with Support


Vector Machines, reads as follows: (Vectors are bold.)
Ö Each text is converted into a vector xi. A component of xi describes the frequency of a
certain word in this text.
Ö We take d words in our dictionary. This are all words, which we want to consider in our
problem. We call them features. They build together the feature space χ. Thus: xi ∈
feature space χ ∈ ℜd
Ö We have a predefined set of categories: {Category1,...,Categoryk,}
Ö Label yi is the category of xi. yi ∈ {Category1,…,Categoryk}
Ö Training data are (x1,y1),(x2,y2),…,(xn,yn).

3
Text Categorization: Support Vector Machines

Ö A classifier is a decision function, which maps a text to a category:


y = c(x): χ -> {Category1,…,Categoryk}
Ö The text, which we want to classifier is xn+1.
Ö The Categorization Problem is the following: What is yn+1 ∈ {Category1,…,Categoryk}
for xn+1 ∈ χ?

Simple Example ℜd
χ
The Figure shows a strongly reduced and thus

# Bush
an unrealistic example for demonstrating the
concept. Here there are only d = 3 words for y20 = ?
x # Beatles
classification. So the feature space is χ ∈ ℜ3.
The classifier c with k = 3 categories maps
χ to { (Sport), (Politics), (Music)}.
Now we search for the label y20 of the
# Euro08
x
document x20 ( ).

2.2 Properties of Text Categorization [1, 2]

High-Dimensional Feature Space χ


Each word, occurring in the training documents x1,…,xn is used as a feature. Thus the feature
space χ is high-dimensional: χ ∈ ℜd, d = number of considered words. It is obvious that d is
in many experiments of text categorization greater than 10,000.

Sparse Text Vector xi


While there is a large choice of potential words, each text contains only a small number of
distinct words. Therefore the text Vector xi is very sparse.

Heterogeneous Use of Terms


There are documents, that do not share any content words and nevertheless they belong to the
same category.

Few irrelevant Words


In text categorization there are only very few irrelevant words. Even words, which have from
an empirical view few information in comparison with other words, still contain considerable
information and are somewhat relevant. Removing features in an aggressive way may result
in a loss of information. Therefor if we try to decreasing complexity or the runtime, it is not
recommendable to diminish the high dimensional features space χ aggressively

Stopwords
Some few words in dictionary are typically considered stopwords. Stopwords don’t help
discriminate between documents. Examples: „if“, „the“, „and“, „of“, „for“, „an“, „a“, „not“,
„that“, „in“. Unfortunately, this words are high-frequency in each typical text document.
Of course stopwords are irrelevant for text categorization and thus the dimension of the
feature space χ can reduced without loss.

4
Text Categorization: Support Vector Machines

Zipf’s Law [4]


Originally, Zipf's law stated that, in a corpus of natural language utterances, the frequency of
any word is roughly inversely proportional to its rank in the frequency table. So, the most
frequent word will occur approximately twice as often as the second most frequent word,
which occurs twice as often as the fourth most frequent word, etc.

3. Support Vector Machines (SVM)

3.1 Big Picture of SVM [7]


A Support Vector Machine (SVM) performs classification by constructing a k-dimensional
hyperplane that optimally separates the data into exactly two categories.
First, let’s look at a 2-dimensional example. Assume our training data, consisting of two
features, has a categorical target variable with two categories {Category1, Category2},
represented by rectangles and circles:
Support Vectors
m1 m2
Feature 2

Feature 2

Separating
Hyperplane
Feature 1 Feature 1
In this simple example, the cases with one category are in the lower left corner and the cases
with the other category are in the upper right corner; the cases are completely separated. The
SVM analysis attempts to find a line (a 1-dimensional hyperplane) that separates the cases
based on their labelled categories. There are an infinite number of possible lines; two
candidate lines are shown in the figures above. Our question is which line is better, and
how do we define the optimal line.
The every two dashed lines drawn parallel to the separating line mark the distance between
the dividing line and the closest vectors to the line. The distance between a dashed line and
the hyperplane is called the margin. The rectangles and circles that constrain the width of the
margin are the support vectors ( , )
An SVM analysis finds the line (or, in general, hyperplane) that is oriented so that the margin
between the support vectors is maximized. In the figure above, the line in the right panel is
superior to the line in the left panel.
In the example, we had only two features, and we were able to plot the points on a 2-
dimensional plane. If we add a third feature, then we can use its value for a third dimension
and plot the points in a 3-dimensional cube. Points on a 2-dimensional plane can be separated

5
Text Categorization: Support Vector Machines

by a 1-dimensional line. Similarly, points in a 3-dimensional cube can be separated by a 2-


dimensional plane. The data points are represented in the d-dimensional feature space χ, and a
(d-1)-dimensional hyperplane can separate them.

The simplest way to divide two groups is with a straight line, flat plane or an (d-1)-
dimensional hyperplane. But what if the points are separated by a nonlinear region such as
shown below?
We need a nonlinear dividing line. Rather than fitting nonlinear curves to the data, SVM
handles this by using a kernel function to map the data into a different space where a
hyperplane can be used to do the separation. The kernel function may transform the data into
a higher dimensional space.
An example is given in the following picture. The data has two features F1 and F2. Our
example kernel Φ maps them to Z1, Z2 and Z3:

F2 Φ : ℜ2 → ℜ3 Z2 Separating
2 2 Hyperplane
(F1,F2) → (Z1,Z2,Z3) := (F1 ,√2F1F2,F2 )

Z3

F1 Z1

Ideally an SVM analysis should produce a hyperplane that completely separates the feature
vectors into two non-overlapping groups. However, perfect separation may not be possible, or
it may result in a model with so many feature vector dimensions that the model does not
generalize well to other data; this is known as over fitting.
To allow some flexibility in separating the categories, SVM models have a cost parameter, C,
that controls the trade off between allowing training errors and forcing rigid margins. It
creates a soft margin that permits some
misclassifications, see left figure. Increasing the
value of C increases the cost of misclassifying
points and forces the creation of
a more accurate model that may not generalize
well. In text categorization is this idea of soft
margin important, because in this application a
perfect separation is rare.

The idea of using a hyperplane to separate the feature vectors into


two groups works well when there are only two categories, but how
does SVM handle the case where the Label yi has more than two
categories? Several approaches have been suggested, but two are the
most popular: “one against many” (see right figure) where one
category is split out ( ) and all of the other categories are merged;
and, “one against one” where k(k-1)/2 models are constructed where
k is the number of categories.

6
Text Categorization: Support Vector Machines

3.2 Mathematical Formulation of SVM


Ö Training data: m
(x1,y1),(x2,y2),…,(xn,yn)
Ö Separating hyperplane is described by a

Feature 2
normal vector w and a translation parameter w
b. So it holds: wTx + b = 0
Ö For Support Vectors (on dashed lines)
ξi
holds: wTxi + b = ±m
Ö Label yi:
yi = Category1 ( ) if wTxi + b ≥ m
yi = Category2 ( ) if wTxi + b ≤ –m
Ö Classifier c: Feature 1
yi+1 = c(xi+1) = sgn(wTxi+1 + b)
Ö Learning problem:
Find w (w normalized: ||w|| = 1),
such that the margin m is maximized:
Maximize m (= geometric Margin, see Figure)
Subject to ∀xi ∈χ: yi(wTxi + b) ≥ m
Ö Alternative formulation without m:
Rescaling w := w/m, b : = b/m ⇒ m2 = 1/||w||2 = 1/(2*½wTw) (without derivation!)
Minimize ||w|| for a given margin m = 1 (= functional margin) ⇒ Minimize ½wTw
Subject to ∀xi ∈χ: yi(wTxi + b) ≥ 1 L(w,b,α)
⇒ Generalized Lagrange Function:
n
L(w,b,α) = ½wTw – ∑αi[yi(wTxi + b) – 1]
i=1
Saddle point: Minimize ||w|| and b
Maximize the αi
Find the solution with the dual problem: w, b
α
We set
n
and w– ∑αiyixi = 0
i=1
and receive
n n

∑αiyi = 0 and w =
i=1
∑αiyixi
i=1
If we insert the result in L(w,b,α) and transform, then we become the dual problem:
n
L(w,b,α) = ½wTw – ∑αi[yi(wTxi + b) – 1]
i=1
n n n
= ½wTw – ∑αiyi(wTxi) +
i=1
∑αiyi + ∑αi
i=1 i=1
n
=–½ αiαjyiyjxiTxj + ∑αi
i=1

7
Text Categorization: Support Vector Machines

n
Maximize W(α) = = – ½ αiαjyiyjxiTxj + ∑αi
i=1
n n
Subject to αi ≥ 0 ∑αi and ∑αiyi = 0
i=1 i=1
Solution:
First we solve the dual problem and receive α, which maximize W(α).
n
Then we calculate the normal vector w: Put α into the equation w = ∑αiyixi .
i=1

Ö decision function (= classifier c) of our text classification Problem:


n
yi+1 = c(xi+1) = sgn(wTxi+1 + b) = sgn( ∑αiyixiTxi+1 + b)
i=1

Ö Include soft margin:


n
Minimize ½wTw + C ∑ ξi , C is a cost parameter.
i =1

Subject to ∀xi ∈χ: yi(wTxi + b) ≥ 1 – ξi


∀xi ∈χ: ξi > 0

A Java-Applet for SVM is available here:


http://www.inf.ethz.ch/personal/porbanz/ml2/applets/Classifier/JSupportVectorApplet.html

4. A Quality Measure

4.1 Odds Ratio and TCat-Concept


The larger the margin m the better is the separation and the easier to make a decision about
the category. So a lower bound of the margin would be a measure of the quality of our
categorization problem. The so-called Odds Ratio and TCat-Concept are useful to find such a
bound.
We build up this section on the example of section 2.1. The Notation is anywhere the same.
First we make a setup of our example:
Ö We have 3 predefined categories: „Music“ , „Politics“ , „Sport“
Ö Training data: 100 Documents per category. Each document consists of exactly 150
words.
Ö Feature space: We choose 20,000 words into dictionary, so the feature space χ has a
dimension of 20,000. We assume that each word in training documents is in dictionary.
Ö We use „one against many” (see section 3.1):
„Sport“ against ¬„Sport“ (¬„Sport“ = „Music“ ∪ „Politics“ )

We order now the features in test data by their „Odds Ratio“ Ψ, described in [4]. For example,
we consider the word „ball“ and „Iraq“ on the next page:

8
Text Categorization: Support Vector Machines

# docs contain „ball“ # docs don’t contain „ball“


# docs ∈ „Sport“ a := 59 b := 41
# docs ∉ „Sport“ c := 3 d := 97
ad 59 ⋅ 97
⇒ Odds Ratio of „ball“ is Ψ = = = 46.5
bc 41 ⋅ 3

# docs contain „Iraq“ # docs don’t contain „Iraq“


# docs ∈ „Sport“ a := 1 b := 99
# docs ∉ „Sport“ c := 11 d := 89
ad 1 ⋅ 89
⇒ Odds Ratio of „Iraq“ is Ψ = = = 0.1
bc 99 ⋅ 11

An Odds Ratio of
Ö 1 means, that the feature fit for „Sport“ as well as for ¬„Sport“. Such a feature do not
carry information. E.g. stopwords
Ö > 1 means, that the feature helps to identify the category „Sport“ (the larger the better).
Ö < 1 means, that the feature helps to identify the category ¬„Sport“ (the smaller the better).

Because of the Zipf’s Law we use the simple strategy [6] of sorting features by Ψ:
„Sport“ ¬„Sport“
irrelevant, stopwords 0.5 < Ψ < 2
high frequency 2<Ψ<5 0.2 < Ψ < 0.5
medium frequency 5 < Ψ < 10 0.1 < Ψ < 0.2
low frequency 10 < Ψ < ∞ 0 < Ψ < 0.1
In our example we calculate Ψ from the training data for each of the 20,000 words. Then the
resulting (disjunct) subsets could look like this: (example values)
„Sport“ ¬„Sport“
stopwords (high freq.) if the and of for an a not that in … (105 words) c
high frequency Euro08 ball match final score sound Bush Iraq parlament flag
fight golf player … (96) d EU cover song … (158)
medium frequency Saturday saison ice Federer concert war rock trance USA
concentration … (864) world food love… (1602)
low frequency parachute 1:0 0:1 FC FCL fight Beatles Berlin Putin drugs
… (2108) George … (6231)
irrelevant (low freq.) car mouse off look hold noise travel … (8836)

This leads us to the TCat-concept [2]: The TCat-concept


TCat([p1:n1:f1], … ,[ps:ns:fs])
describes a binary classification task with s disjoint sets of features (i.e.words). The i-th set
includes fi features. Each positive example (∈ „Sport“ ) contains pi occurrences of features
from the respective set, and each negative example (∉ „Sport“ ) contains ni occurences. The
same feature can occur multiple times in one document.

9
Text Categorization: Support Vector Machines

Our training data set could produce the following values:


TCatSport([58:42:105]c, # stopwords
[26:8:96]d,[11:27:158], # high freq.
[14:3:864], [6:27:1602], # medium freq.
[4:1:2108],[2:10:6231] # low freq.
[29:32:8836]) # irrelevant
Explanation of c: In a “Sport”-document (150 words long) are on average 58 stopwords. In a
¬„Sport“-document are on average 42 stopwords. In our dictionary we have 105 stopwords
(see table above).
Explanation of d: We consider the subset of 96 high frequency words. In a „Sport“-document
are on average much more words of this subset (26 words) than in a ¬„Sport“-document (8
words). So the subset is a good indicator for „Sport“-documents.
A figure visualizes all our training results: We see quickly, which subset indicate well a
document ∈ „Sport“ (or ∈ ¬„Sport“) or which subset is ill-suited for text categorization.
58

Document ∈ „Sport“ 20,000 words in dictionary


ordered by their frequency
26

Zipf’s Law Frequency


69 high freq. d

14
11
864 medium
105 stopwords c

6 4 29
freq.

2108 low freq. 2 8836 irrelevant


6231 low freq
1602 medium freq.
158 high freq.

3 1
10 32
8

27 27

Document ∈ ¬„Sport“
: Sets of words as described in table above
42

The idea of grouping words and using average values brings a great advantage: We need no
more a constraint (in the optimization problem) for each training document. This helps us to
simplify the problem and to build a lower bound for the margin m in the next section.

4.2 Lower Bound for the Margin

Thanks to the TCat-concept we are able to make the following


statement about the margin m (see right figure) [2]: m
m
For TCat([p1:n1: f1],…,[ps:ns:fs])-concepts, there is always a
hyperplane passing trough the origin that has a margin m bounded by
ac − b 2 s
p2 s
pn s
n2
m2 ≥ with a = ∑ i , b = ∑ i i , c = ∑ i
a + 2b + c i =1 f i i =1 fi i =1 f i

10
Text Categorization: Support Vector Machines

Proof.
Ö Define p = (p1,…,ps)T, n = (n1,…,ns)T, F = diag(f1,…,fs)
Ö For SVMs with a hyperplane passing trough the origin and without soft margin it holds
the following optimization problem (see section 3.2):
W(w) = min(½wTw), s.t. ∀xi ∈χ: yi(wTxi) ≥ 1
It holds: m2 = 1/||w*||2 = 1/(2*½w*Tw*) = 1/(2W(w*)) for the solution vector w*
Ö Simplification of optimization problem:
Let us add the constraint that within each group of fi features the weights are required
identical. Then wTw = vTFv, v ∈ ℜs.
By definition, each example contains a certain number of features from each group. This
means that all constraints for positive examples are equivalent to pTv ≥ 1 and nTv ≤ – 1.
⇒ V(v) = min(½vTFv), s.t. pTv ≥ 1, nTv ≤ – 1
Ö v* is the solution vector. So we get a lower bound: V(v*) ≥ W(w*) ⇒ m2 ≥ 1/(2V(v*))
Ö Introducing and solving Lagrange multiplayers:
L(v, α+, α–) = ½vTFv – α+(vTp – 1) + α–(vTn + 1), α+ ≥ 0, α– ≥ 0
dL( v,α + , α − )
= 0 ⇔ v = F–1(α+p – α–n)
dv
For ease of notation we write: v = F–1XYα, with X = (p, n), Y = diag(1, –1), αT = (α+, α–)
⇒ L(α) = 1Tα – ½αTYXTF–1XYα
Ö Maximize L(α), s.t. α+ ≥ 0, α– ≥ 0
Since only a lower bound on the margin is needed, it is possible to drop the constraints α+
≥ 0 and α– ≥ 0, because removing this constraints can only increase the objective function
at the solution. So the unconstrainted maximum L’(α)* is greater or equal to L(α)*.
dL' (α )
= 0 ⇔ α = (YXTF–1XY)–11

⇒ L’(α) = ½1T(YXTF–1XY)–11 e
Ö The special form of (YXTF–1XY) makes it possible to compute its inverse in closed form:
 p T F −1 p − pT F −1n   a − b 1 a b
(YX T F −1 XY ) −1 =   =   =  
− n F p
T −1 T −1 
n F n  − b c  ac − b 2  b c 
Substituting into e completes the proof. □

Let us have a look, what we get in our example:


a = 582/105 + 262/96 + 112/158 + 142/864 + 62/1602 + 42/2108 + 22/6231 + 292/8836 = 40.20
b = 58⋅42/105 + 26⋅8/96 + 11⋅27/158 + 14⋅3/864 + 6⋅27/1602 + 4⋅1/2108 + 2⋅10/6231 + 29⋅32/8836 = 27.51
c = 422/105 + 82/96 + 272/158 + 32/864 + 272/1602 + 12/2108 + 102/6231 + 322/8836 =22.68
m2 ≥ (40.2⋅22.7 – 27.52) / (40.2 + 2⋅27.5 + 22.7) = 1.32

⇒ The lower bound is m ≥ 1.15!

Remark
Each document is assumed to exactly follow the same generalized Zipf’s law, neglecting
variance and discretization inaccuracies that occur especially for short documents. In
particular, this implies that all documents are of equal length.

11
Text Categorization: Support Vector Machines

5. References
1. T. Joachims. Text Categorization with Support Vector Machines: Learning with Many
Relevant Features. Proceedings of the European Conference on Machine Learning
(ECML), Springer, 1998.
2. T. Joachims. A Statistical Learning Model of Text Classification for Support Vector
Machines. Proceedings of the Conference on Research and Development in Information
Retrieval (SIGIR), ACM, 2001.
3. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning 20:273-297,
November 1995
4. http://www.wikipedia.org, 2007
5. Joachim Buhmann. Course slides of „Introduction to Machine Learning“. Winter
semester 2006/07
6. T. Joachims. The Maximum-Margin Approach to Learning Text Classifiers: Methods,
Theory, and Algorithms. PhD thesis, Universität Dortmund, 2001. Kluwer, to appear.
7. http://www.dtreg.com/svm.htm, 2007

12

You might also like