You are on page 1of 6

Last Time

IAML: Support Vector Machines, Part II


I Max margin trick
Charles Sutton and Victor Lavrenko
I Geometry of the margin and how to compute it
School of Informatics I Finding the max margin hyperplane using a constrained
optimization problem
I Max margin = Min norm
Semester 1

1 / 23 2 / 23

This Time The SVM optimization problem

I Last time: the max margin weights can be computed by


solving a constrained optimization problem

I Non separable data min ||w||2


w
I The kernel trick s.t. yi (w> xi + w0 ) ≥ +1 for all i

I Many algorithms have been proposed to solve this. One of


the earliest efficient algorithms is called SMO [Platt, 1998].
This is outside the scope of the course, but it does explain
the name of the SVM method in Weka.

3 / 23 4 / 23
Finding the optimum Non-separable training sets
I If you go through some advanced maths (Langrange, etc.),
it turns out that you can show something remarkable.
Optimal parameters look like I If data set is not linearly separable, the optimization
X problem that we have given has no solution.
w= αi yi xi
i
min ||w||2
I Furthermore, solution is sparse. Optimal hyperplane is w

determined by just a few examples: call these support s.t. yi (w> xi + w0 ) ≥ +1 for all i
vectors
I αi = 0 for non-support patterns I Why?
I Optimization problem has no local minima (like logistic
regression)
I Prediction on new data point x
f (x) = sign((w> x) + w0 )
n
X
= sign( αi yi (x>
i x) + w0 )
i=1
5 / 23 6 / 23

Non-separable training sets

I If data set is not linearly separable, the optimization x


problem that we have given has no solution.

min ||w||2
o x o
w
o x
>
s.t. yi (w xi + w0 ) ≥ +1 for all i ! x
I Why? o
I Solution: Don’t require that we classify all points correctly.
o
Allow the algorithm to choose to ignore some of the points.
~
w margin
I This is obviously dangerous (why not ignore all of them?) o
so we need to give it a penalty for doing so.

9 / 18

7 / 23 8 / 23
Slack Think about ridge regression again
I Solution: Add a “slack” variable ξi ≥ 0 for each training I Our max margin + slack optimization problem is:
example. Xn
2
I If the slack variable is high, we get to relax the constraint, ||w|| + C( ξi )k
but we pay a price i=1

I New optimization problem is to minimize subject to the constraints

Xn w> xi + w0 ≥ 1 − ξi for yi = +1
||w||2 + C( ξi )k >
w xi + w0 ≤ −1 + ξi for yi = −1
i=1

subject to the constraints


I This looks a even more like ridge regression than the
>
w xi + w0 ≥ 1 − ξi for yi = +1 non-slack problem: We have one term that measures how
>
w xi + w0 ≤ −1 + ξi for yi = −1 well we fit the data and the other than penalizes weight
vectors with a large norm
I So C can be viewed as a regularization parameters, like λ
I Usually set k = 1. C is a trade-off parameter. Large C in ridge regression or regularized logistic regression
gives a large penalty to errors I You’re allowed to make this tradeoff even when the data
9 / 23
set is separable! 10 / 23

Why you might want slack in a separable data set Non-linear SVMs

x2
x2 I SVMs can be made nonlinear just like any other linear
o
o
o
o
o o algorithm we’ve seen (i.e., using a basis expansion)
o o ξ o o
o o
o o
o
o
o I But in an SVM, the basis expansion is implemented in a
x w o w o
x x
o
x
x
x
o
o very special way, using something called a kernel
x x
I The reason for this is that kernels can be faster to compute
x x x1 x x x1
x
x
x
x with if the expanded feature space is very high dimensional
(even infinite)!
I This is a fairly advanced topic mathematically, so we will
just go through a high-level version

11 / 23 12 / 23
Kernel Non-linear SVMs

I Transform x to φ(x)
I A kernel is in some sense an alternate “API” for specifying
I Linear algorithm depends only on x> xi . Hence
to the classifier what your expanded feature space is.
transformed algorithm depends only on φ(x)> φ(xi )
I Up to now, we have always given the classifier a new set of
I Use a kernel function k (xi , xj ) such that
training vectors φ(xi ) for all i, e.g., just as a list of numbers.
φ : Rd → RD
k(xi , xj ) = φ(xi )> φ(xj )
I If D is large, this will be expensive; if D is infinite, this will
be impossible I (This is called the “kernel trick”, and can be used with a
wide variety of learning algorithms, not just max margin.)

13 / 23 14 / 23

Example of kernel Kernels, dot products, and distance

I The Euclidean distance between two vectors can be


computed using dot products
I Example 1: for 2-d input space
  d(x1 , x2 ) = (x1 − x2 )T (x1 − x2 )
2
√ x1 = xT1 x1 − 2xT1 x2 + xT2 x2
φ(x) =  2x1 x2 
x22 I Using a linear kernel k(x1 , x2 ) = xT1 x2 we can rewrite this
with as
k(xi , xj ) = (x> 2 d(x1 , x2 ) = k (x1 , x1 ) − 2k (x1 , x2 ) + k(x2 , x2 )
i xj )
I Any kernel gives you an associated distance measure this
way. Think of a kernel as an indirect way of specifying
distances.

15 / 23 16 / 23
Applications
Support Vector Machine Prediction on new example

I A support vector machine is a kernelized maximum


margin classifier.
I For max margin remember that we had the magic property f(x)= sgn ( ! + b) classification f(x)= sgn ( ! $i.k(x,x i) + b)

X
w= αi yi xi $1 $2 $3 $4 weights

i
k k k k comparison: k(x,x i), e.g. k(x,x i)=(x.x i)d
I This means we would predict the label of a test example x k(x,x i)=exp(!||x!x i||2 / c)
support vectors
as X x 1 ... x 4 k(x,x i)=tanh("(x.x i)+#)
ŷ = sign[wT x + w0 ] = sign[ αi yi xTi x]
i
I Kernelizing this we get input vector x
X
ŷ = sign[ αi yi k(xi , bx)]
i
Figure Credit:
Figure Credit:Bernhard
BernhardSchoelkopf
Schoelkopf

17 / 23 18 13
/ 23/ 18

Support Vector Regression


input space feature space
Choosing φ, C
" !
!
!
! ! � The support vector algorithm can also be used for
" "
"
regression problems
"
"
� Instead of using squared-error, the algorithm uses the
I There are theoretical results, but we will not cover them.
(If
�-insensitive error
Figure Credit: Bernhard Schoelkopf you want to look them up, there are actually upper bounds
Figure Credit: Bernhard Schoelkopf on the generalization�error: look for VC-dimension and
� Example 2 |z| − � if |z| ≥ �,
I Example 2 structural risk minimization.)
E� (z) =
k(xi , xj ) = exp −||xi − xj ||2 /α2
0 otherwise.
I However, in practice cross-validation methods are
k(xi , xj ) = exp −||xi − xj ||2 /α2 commonly used
In this case the dimension of φ is infinite � Again a sparse solution is obtained from a QP problem

�InTo
this
testcase theinput
a new dimension
x of φ is infinite. i.e., It can be n

shown that no φ that maps n
into a finite-dimensional space f (x) = βi k(x, xi ) + w0

will give you this kernel.
f (x) = sgn( α y k(x , x) + w ) i=1
i i i 0
I We can never calculate φ(x),
i=1 but the algorithm only needs
us to calculate k for different pairs of points.
11 / 18

19 / 23 20 15
/ 23/ 18
Example application Comparison with linear and logistic regression

I US Postal Service digit data (7291 examples, 16 × 16


images). Three SVMs using polynomial, RBF and
I Underlying basic idea of linear prediction is the same, but
MLP-type kernels were used (see Schölkopf and Smola,
error functions differ
Learning with Kernels, 2002 for details)
I Logistic regression (non-sparse) vs SVM (“hinge loss”,
I Use almost the same (' 90%) small sets (4% of data
sparse solution)
base) of SVs
I Linear regression (squared error) vs -insensitive error
I All systems perform well (' 4% error)
I Many other applications, e.g.
I Linear regression and logistic regression can be
I Text categorization
“kernelized” too
I Face detection
I DNA analysis

21 / 23 22 / 23

SVM summary

I SVMs are the combination of max-margin and the kernel


trick
I Learn linear decision boundaries (like logistic regression,
perceptrons)
I Pick hyperplane that maximizes margin
I Use slack variables to deal with non-separable data
I Optimal hyperplane can be written in terms of support
patterns
I Transform to higher-dimensional space using kernel
functions
I Good empirical results on many problems
I Appears to avoid overfitting in high dimensional spaces (cf
regularization)
I Sorry for all the maths!

23 / 23

You might also like