Professional Documents
Culture Documents
Remo Frey
ETH Zürich
Summer Semester 2007
Text Categorization: Support Vector Machines
Content
1. Introduction ........................................................................................3
5. References ........................................................................................12
2
Text Categorization: Support Vector Machines
1. Introduction
Text categorization is the process of sorting text documents into one or more predefined
categories or classes of similar documents. This procedure will be more and more important,
because anywhere the volume of data is growing and the searcher wants to find his desired
information in a shorter time.
Such enormous information is no longer reasonable sortable by hand, the duration would be
too long. Particularly the WWW with the nearly infinity data volume is reliant on good
automatic search algorithms, which are often using text categorization. Furthermore the data
in the WWW are extremely fluctuating and thus permanently refreshing by hand is an
impossibility. In section 2, I specify first the issue of text categorization in particular in
conjunction with Support Vector Machines (SVM).
Support Vector Machines introduced by Vapnik et al. [3] are offering for the challenge of text
categorization a suitable solution [1, 2]. They have got many advantages, for example they are
extremly robust against bad input data. Furthermore, they are easy to use. In section 3 I show
generally the idea and the mathematical basics of Support Vector Machines.
Thanks to empirical knowledge (for example Zipf’s law), it is possible to make stronger
statements than the pure mathematics can do. In the final section I introduce the TCat-
Concept, which helps us to say something about the quality of text categorization.
2. Text Categorization
3
Text Categorization: Support Vector Machines
Simple Example ℜd
χ
The Figure shows a strongly reduced and thus
# Bush
an unrealistic example for demonstrating the
concept. Here there are only d = 3 words for y20 = ?
x # Beatles
classification. So the feature space is χ ∈ ℜ3.
The classifier c with k = 3 categories maps
χ to { (Sport), (Politics), (Music)}.
Now we search for the label y20 of the
# Euro08
x
document x20 ( ).
Stopwords
Some few words in dictionary are typically considered stopwords. Stopwords don’t help
discriminate between documents. Examples: „if“, „the“, „and“, „of“, „for“, „an“, „a“, „not“,
„that“, „in“. Unfortunately, this words are high-frequency in each typical text document.
Of course stopwords are irrelevant for text categorization and thus the dimension of the
feature space χ can reduced without loss.
4
Text Categorization: Support Vector Machines
Feature 2
Separating
Hyperplane
Feature 1 Feature 1
In this simple example, the cases with one category are in the lower left corner and the cases
with the other category are in the upper right corner; the cases are completely separated. The
SVM analysis attempts to find a line (a 1-dimensional hyperplane) that separates the cases
based on their labelled categories. There are an infinite number of possible lines; two
candidate lines are shown in the figures above. Our question is which line is better, and
how do we define the optimal line.
The every two dashed lines drawn parallel to the separating line mark the distance between
the dividing line and the closest vectors to the line. The distance between a dashed line and
the hyperplane is called the margin. The rectangles and circles that constrain the width of the
margin are the support vectors ( , )
An SVM analysis finds the line (or, in general, hyperplane) that is oriented so that the margin
between the support vectors is maximized. In the figure above, the line in the right panel is
superior to the line in the left panel.
In the example, we had only two features, and we were able to plot the points on a 2-
dimensional plane. If we add a third feature, then we can use its value for a third dimension
and plot the points in a 3-dimensional cube. Points on a 2-dimensional plane can be separated
5
Text Categorization: Support Vector Machines
The simplest way to divide two groups is with a straight line, flat plane or an (d-1)-
dimensional hyperplane. But what if the points are separated by a nonlinear region such as
shown below?
We need a nonlinear dividing line. Rather than fitting nonlinear curves to the data, SVM
handles this by using a kernel function to map the data into a different space where a
hyperplane can be used to do the separation. The kernel function may transform the data into
a higher dimensional space.
An example is given in the following picture. The data has two features F1 and F2. Our
example kernel Φ maps them to Z1, Z2 and Z3:
F2 Φ : ℜ2 → ℜ3 Z2 Separating
2 2 Hyperplane
(F1,F2) → (Z1,Z2,Z3) := (F1 ,√2F1F2,F2 )
Z3
F1 Z1
Ideally an SVM analysis should produce a hyperplane that completely separates the feature
vectors into two non-overlapping groups. However, perfect separation may not be possible, or
it may result in a model with so many feature vector dimensions that the model does not
generalize well to other data; this is known as over fitting.
To allow some flexibility in separating the categories, SVM models have a cost parameter, C,
that controls the trade off between allowing training errors and forcing rigid margins. It
creates a soft margin that permits some
misclassifications, see left figure. Increasing the
value of C increases the cost of misclassifying
points and forces the creation of
a more accurate model that may not generalize
well. In text categorization is this idea of soft
margin important, because in this application a
perfect separation is rare.
6
Text Categorization: Support Vector Machines
Feature 2
normal vector w and a translation parameter w
b. So it holds: wTx + b = 0
Ö For Support Vectors (on dashed lines)
ξi
holds: wTxi + b = ±m
Ö Label yi:
yi = Category1 ( ) if wTxi + b ≥ m
yi = Category2 ( ) if wTxi + b ≤ –m
Ö Classifier c: Feature 1
yi+1 = c(xi+1) = sgn(wTxi+1 + b)
Ö Learning problem:
Find w (w normalized: ||w|| = 1),
such that the margin m is maximized:
Maximize m (= geometric Margin, see Figure)
Subject to ∀xi ∈χ: yi(wTxi + b) ≥ m
Ö Alternative formulation without m:
Rescaling w := w/m, b : = b/m ⇒ m2 = 1/||w||2 = 1/(2*½wTw) (without derivation!)
Minimize ||w|| for a given margin m = 1 (= functional margin) ⇒ Minimize ½wTw
Subject to ∀xi ∈χ: yi(wTxi + b) ≥ 1 L(w,b,α)
⇒ Generalized Lagrange Function:
n
L(w,b,α) = ½wTw – ∑αi[yi(wTxi + b) – 1]
i=1
Saddle point: Minimize ||w|| and b
Maximize the αi
Find the solution with the dual problem: w, b
α
We set
n
and w– ∑αiyixi = 0
i=1
and receive
n n
∑αiyi = 0 and w =
i=1
∑αiyixi
i=1
If we insert the result in L(w,b,α) and transform, then we become the dual problem:
n
L(w,b,α) = ½wTw – ∑αi[yi(wTxi + b) – 1]
i=1
n n n
= ½wTw – ∑αiyi(wTxi) +
i=1
∑αiyi + ∑αi
i=1 i=1
n
=–½ αiαjyiyjxiTxj + ∑αi
i=1
7
Text Categorization: Support Vector Machines
n
Maximize W(α) = = – ½ αiαjyiyjxiTxj + ∑αi
i=1
n n
Subject to αi ≥ 0 ∑αi and ∑αiyi = 0
i=1 i=1
Solution:
First we solve the dual problem and receive α, which maximize W(α).
n
Then we calculate the normal vector w: Put α into the equation w = ∑αiyixi .
i=1
4. A Quality Measure
We order now the features in test data by their „Odds Ratio“ Ψ, described in [4]. For example,
we consider the word „ball“ and „Iraq“ on the next page:
8
Text Categorization: Support Vector Machines
An Odds Ratio of
Ö 1 means, that the feature fit for „Sport“ as well as for ¬„Sport“. Such a feature do not
carry information. E.g. stopwords
Ö > 1 means, that the feature helps to identify the category „Sport“ (the larger the better).
Ö < 1 means, that the feature helps to identify the category ¬„Sport“ (the smaller the better).
Because of the Zipf’s Law we use the simple strategy [6] of sorting features by Ψ:
„Sport“ ¬„Sport“
irrelevant, stopwords 0.5 < Ψ < 2
high frequency 2<Ψ<5 0.2 < Ψ < 0.5
medium frequency 5 < Ψ < 10 0.1 < Ψ < 0.2
low frequency 10 < Ψ < ∞ 0 < Ψ < 0.1
In our example we calculate Ψ from the training data for each of the 20,000 words. Then the
resulting (disjunct) subsets could look like this: (example values)
„Sport“ ¬„Sport“
stopwords (high freq.) if the and of for an a not that in … (105 words) c
high frequency Euro08 ball match final score sound Bush Iraq parlament flag
fight golf player … (96) d EU cover song … (158)
medium frequency Saturday saison ice Federer concert war rock trance USA
concentration … (864) world food love… (1602)
low frequency parachute 1:0 0:1 FC FCL fight Beatles Berlin Putin drugs
… (2108) George … (6231)
irrelevant (low freq.) car mouse off look hold noise travel … (8836)
9
Text Categorization: Support Vector Machines
14
11
864 medium
105 stopwords c
6 4 29
freq.
3 1
10 32
8
27 27
Document ∈ ¬„Sport“
: Sets of words as described in table above
42
The idea of grouping words and using average values brings a great advantage: We need no
more a constraint (in the optimization problem) for each training document. This helps us to
simplify the problem and to build a lower bound for the margin m in the next section.
10
Text Categorization: Support Vector Machines
Proof.
Ö Define p = (p1,…,ps)T, n = (n1,…,ns)T, F = diag(f1,…,fs)
Ö For SVMs with a hyperplane passing trough the origin and without soft margin it holds
the following optimization problem (see section 3.2):
W(w) = min(½wTw), s.t. ∀xi ∈χ: yi(wTxi) ≥ 1
It holds: m2 = 1/||w*||2 = 1/(2*½w*Tw*) = 1/(2W(w*)) for the solution vector w*
Ö Simplification of optimization problem:
Let us add the constraint that within each group of fi features the weights are required
identical. Then wTw = vTFv, v ∈ ℜs.
By definition, each example contains a certain number of features from each group. This
means that all constraints for positive examples are equivalent to pTv ≥ 1 and nTv ≤ – 1.
⇒ V(v) = min(½vTFv), s.t. pTv ≥ 1, nTv ≤ – 1
Ö v* is the solution vector. So we get a lower bound: V(v*) ≥ W(w*) ⇒ m2 ≥ 1/(2V(v*))
Ö Introducing and solving Lagrange multiplayers:
L(v, α+, α–) = ½vTFv – α+(vTp – 1) + α–(vTn + 1), α+ ≥ 0, α– ≥ 0
dL( v,α + , α − )
= 0 ⇔ v = F–1(α+p – α–n)
dv
For ease of notation we write: v = F–1XYα, with X = (p, n), Y = diag(1, –1), αT = (α+, α–)
⇒ L(α) = 1Tα – ½αTYXTF–1XYα
Ö Maximize L(α), s.t. α+ ≥ 0, α– ≥ 0
Since only a lower bound on the margin is needed, it is possible to drop the constraints α+
≥ 0 and α– ≥ 0, because removing this constraints can only increase the objective function
at the solution. So the unconstrainted maximum L’(α)* is greater or equal to L(α)*.
dL' (α )
= 0 ⇔ α = (YXTF–1XY)–11
dα
⇒ L’(α) = ½1T(YXTF–1XY)–11 e
Ö The special form of (YXTF–1XY) makes it possible to compute its inverse in closed form:
p T F −1 p − pT F −1n a − b 1 a b
(YX T F −1 XY ) −1 = = =
− n F p
T −1 T −1
n F n − b c ac − b 2 b c
Substituting into e completes the proof. □
Remark
Each document is assumed to exactly follow the same generalized Zipf’s law, neglecting
variance and discretization inaccuracies that occur especially for short documents. In
particular, this implies that all documents are of equal length.
11
Text Categorization: Support Vector Machines
5. References
1. T. Joachims. Text Categorization with Support Vector Machines: Learning with Many
Relevant Features. Proceedings of the European Conference on Machine Learning
(ECML), Springer, 1998.
2. T. Joachims. A Statistical Learning Model of Text Classification for Support Vector
Machines. Proceedings of the Conference on Research and Development in Information
Retrieval (SIGIR), ACM, 2001.
3. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning 20:273-297,
November 1995
4. http://www.wikipedia.org, 2007
5. Joachim Buhmann. Course slides of „Introduction to Machine Learning“. Winter
semester 2006/07
6. T. Joachims. The Maximum-Margin Approach to Learning Text Classifiers: Methods,
Theory, and Algorithms. PhD thesis, Universität Dortmund, 2001. Kluwer, to appear.
7. http://www.dtreg.com/svm.htm, 2007
12