Professional Documents
Culture Documents
Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Christopher Manning, Pandu Nayak, and Prabhakar Raghavan
Text classification
Last week: 3 algorithms for text classification
Naive Bayes classifier
Simple, cheap, high bias, linear
K Nearest Neighbor classification
Simple, expensive at test time, high variance, non-linear
Vector space classification: Rocchio
Simple linear discriminant classifier; perhaps too simple*
Today
Support Vector Machines (SVMs)
Soft margin SVMs and kernels for non-linear classifiers
Some empirical evaluation and comparison
Text-specific issues in classification
2
Introduction to Information Retrieval Ch. 15
3
Introduction to Information Retrieval Sec. 15.1
Another intuition
If you have to place a fat separator between classes,
you have less choices, and so the capacity of the
model has been decreased
4
Introduction to Information Retrieval Sec. 15.1
6
Introduction to Information Retrieval Sec. 15.1
Geometric Margin
wT x + b
Distance from example to the separator is r=y
w
Examples closest to the hyperplane are support vectors.
Margin of the separator is the width of separation between support vectors
of classes.
x Derivation of finding r:
Dotted line x x is perpendicular to
r decision boundary so parallel to w.
x Unit vector is w/|w|, so line is rw/|w|.
x = x yrw/|w|.
x satisfies wTx + b = 0.
So wT(x yrw/|w|) + b = 0
Recall that |w| = sqrt(wTw).
So wTx yr|w| + b = 0
So, solving for r gives:
w r = y(wTx + b)/|w|
7
Introduction to Information Retrieval Sec. 15.1
wTxi + b 1 if yi = 1
wTxi + b 1 if yi = 1
For support vectors, the inequality becomes an equality
Then, since each examples distance from the hyperplane is
wT x + b
r=y
w
The margin is:
2
r=
w
8
Introduction to Information Retrieval Sec. 15.1
wTxa + b = 1
wTxb + b = -1
Hyperplane
wT x + b = 0
This implies:
wT(xaxb) = 2
= xaxb2 = 2/w2
wT x + b = 0
9
Introduction to Information Retrieval
f(x) = iyixiTx + b
Notice that it relies on an inner product between the test point x and the
support vectors xi
We will return to this later.
Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all pairs of training points.
14
Introduction to Information Retrieval Sec. 15.2.1
16
Introduction to Information Retrieval Sec. 15.2.1
The most important training points are the support vectors; they define
the hyperplane.
Both in the dual formulation of the problem and in the solution, training
points appear only inside inner products:
19
Introduction to Information Retrieval Sec. 15.2.3
Non-linear SVMs
Datasets that are linearly separable (with some noise) work out great:
0 x
0 x
How about mapping data to a higher-dimensional space:
x2
0 x
20
Introduction to Information Retrieval Sec. 15.2.3
: x (x)
21
Introduction to Information Retrieval Sec. 15.2.3
22
Introduction to Information Retrieval Sec. 15.2.3
Kernels
Why use kernels?
Make non-separable problem separable.
Map data into better representational space
Common kernels
Linear
Polynomial K(x,z) = (1+xTz)d
Gives feature conjunctions
Radial basis function (infinite dimensional space)
c ii
Precision: Fraction of docs assigned
class i that are actually about class i:
c ji
j
c ii
i
Accuracy: (1 - error rate) Fraction of c ij
docs classified correctly: j i
26
Introduction to Information Retrieval Sec. 15.2.4
27
Introduction to Information Retrieval Sec. 15.2.4
29
Introduction to Information Retrieval
Recall 0.5
0.4 LSVM
Decision Tree
0.3 Nave Bayes
0.2 Rocchio
0.1
0
0 0.2 0.4 0.6 0.8 1 Dumais
(1998)
Precision
Introduction to Information Retrieval Sec. 15.2.4
Recall 0.5
0.4 LSVM
Decision Tree
0.3 Nave Bayes
0.2 Rocchio
0.1
0
0 0.2 0.4 0.6 0.8 1 Dumais
(1998)
Precision 31
Introduction to Information Retrieval Sec. 15.2.4
32
Introduction to Information Retrieval Sec. 15.2.4
53
35
Introduction to Information Retrieval Sec. 15.3.1
37
Introduction to Information Retrieval Sec. 15.3.1
39
Introduction to Information Retrieval Sec. 15.3.1
40
Introduction to Information Retrieval Sec. 15.3.2
Upweighting
You can get a lot of value by differentially weighting
contributions from different document zones:
That is, you count as two instances of a word when
you see it in, say, the abstract
Upweighting title words helps (Cohen & Singer 1996)
Doubling the weighting on the title words is a good rule of thumb
Upweighting the first sentence of each paragraph helps
(Murata, 1999)
Upweighting sentences that contain title words helps (Ko
et al, 2002)
43
Introduction to Information Retrieval Sec. 15.3.2
44
Introduction to Information Retrieval Sec. 15.3.2
45
Introduction to Information Retrieval Sec. 15.3.2
46
Introduction to Information Retrieval
Measuring Classification
Figures of Merit
Not just accuracy; in the real world, there are
economic measures:
Your choices are:
Do no classification
That has a cost (hard to compute)
Do it all manually
Has an easy-to-compute cost if doing it like that now
Do it all with an automatic classifier
Mistakes have a cost
Do it with a combination of automatic classification and manual
review of uncertain/difficult/new cases
Commonly the last method is most cost efficient and is
adopted
47
Introduction to Information Retrieval
48
Introduction to Information Retrieval
Summary
Support vector machines (SVM)
Choose hyperplane based on support vectors
Support vector = critical point close to decision boundary
(Degree-1) SVMs are linear classifiers.
Kernels: powerful and elegant way to define similarity metric
Perhaps best performing text classifier
But there are other methods that perform about as well as SVM, such
as regularized logistic regression (Zhang & Oles 2001)
Partly popular due to availability of good software
SVMlight is accurate and fast and free (for research)
Now lots of good software: libsvm, TinySVM, .
Comparative evaluation of methods
Real world: exploit domain specific structure!
49
Introduction to Information Retrieval Ch. 15