Svm2up PDF

Last Time
IAML: Support Vector Machines, Part II

I Max margin trick
Charles Sutton and Victor Lavrenko
I Geometry of the margin and how to compute it
School of Informatics I Finding the max margin hyperplane using a constrained
optimization problem
I Max margin = Min norm
Semester 1
1 / 23 2 / 23
This Time The SVM optimization problem
I Last time: the max margin weights can be computed by

solving a constrained optimization problem
I Non separable data min ||w||2

w
I The kernel trick s.t. yi (w> xi + w0 ) ≥ +1 for all i
I Many algorithms have been proposed to solve this. One of

the earliest efficient algorithms is called SMO [Platt, 1998].
This is outside the scope of the course, but it does explain
the name of the SVM method in Weka.
3 / 23 4 / 23
Finding the optimum Non-separable training sets
I If you go through some advanced maths (Langrange, etc.),
it turns out that you can show something remarkable.
Optimal parameters look like I If data set is not linearly separable, the optimization
X problem that we have given has no solution.
w= αi yi xi
i
min ||w||2
I Furthermore, solution is sparse. Optimal hyperplane is w
determined by just a few examples: call these support s.t. yi (w> xi + w0 ) ≥ +1 for all i
vectors
I αi = 0 for non-support patterns I Why?
I Optimization problem has no local minima (like logistic
regression)
I Prediction on new data point x
f (x) = sign((w> x) + w0 )
n
X
= sign( αi yi (x>
i x) + w0 )
i=1
5 / 23 6 / 23
Non-separable training sets
I If data set is not linearly separable, the optimization x

problem that we have given has no solution.
min ||w||2
o x o
w
o x
>
s.t. yi (w xi + w0 ) ≥ +1 for all i ! x
I Why? o
I Solution: Don’t require that we classify all points correctly.
o
Allow the algorithm to choose to ignore some of the points.
~
w margin
I This is obviously dangerous (why not ignore all of them?) o
so we need to give it a penalty for doing so.
9 / 18
7 / 23 8 / 23
Slack Think about ridge regression again
I Solution: Add a “slack” variable ξi ≥ 0 for each training I Our max margin + slack optimization problem is:
example. Xn
2
I If the slack variable is high, we get to relax the constraint, ||w|| + C( ξi )k
but we pay a price i=1
I New optimization problem is to minimize subject to the constraints
Xn w> xi + w0 ≥ 1 − ξi for yi = +1
||w||2 + C( ξi )k >
w xi + w0 ≤ −1 + ξi for yi = −1
i=1
subject to the constraints

I This looks a even more like ridge regression than the
>
w xi + w0 ≥ 1 − ξi for yi = +1 non-slack problem: We have one term that measures how
>
w xi + w0 ≤ −1 + ξi for yi = −1 well we fit the data and the other than penalizes weight
vectors with a large norm
I So C can be viewed as a regularization parameters, like λ
I Usually set k = 1. C is a trade-off parameter. Large C in ridge regression or regularized logistic regression
gives a large penalty to errors I You’re allowed to make this tradeoff even when the data
9 / 23
set is separable! 10 / 23
Why you might want slack in a separable data set Non-linear SVMs
x2
x2 I SVMs can be made nonlinear just like any other linear
o
o
o
o
o o algorithm we’ve seen (i.e., using a basis expansion)
o o ξ o o
o o
o o
o
o
o I But in an SVM, the basis expansion is implemented in a
x w o w o
x x
o
x
x
x
o
o very special way, using something called a kernel
x x
I The reason for this is that kernels can be faster to compute
x x x1 x x x1
x
x
x
x with if the expanded feature space is very high dimensional
(even infinite)!
I This is a fairly advanced topic mathematically, so we will
just go through a high-level version
11 / 23 12 / 23
Kernel Non-linear SVMs
I Transform x to φ(x)
I A kernel is in some sense an alternate “API” for specifying
I Linear algorithm depends only on x> xi . Hence
to the classifier what your expanded feature space is.
transformed algorithm depends only on φ(x)> φ(xi )
I Up to now, we have always given the classifier a new set of
I Use a kernel function k (xi , xj ) such that
training vectors φ(xi ) for all i, e.g., just as a list of numbers.
φ : Rd → RD
k(xi , xj ) = φ(xi )> φ(xj )
I If D is large, this will be expensive; if D is infinite, this will
be impossible I (This is called the “kernel trick”, and can be used with a
wide variety of learning algorithms, not just max margin.)
13 / 23 14 / 23
Example of kernel Kernels, dot products, and distance
I The Euclidean distance between two vectors can be

computed using dot products
I Example 1: for 2-d input space
  d(x1 , x2 ) = (x1 − x2 )T (x1 − x2 )
2
√ x1 = xT1 x1 − 2xT1 x2 + xT2 x2
φ(x) =  2x1 x2 
x22 I Using a linear kernel k(x1 , x2 ) = xT1 x2 we can rewrite this
with as
k(xi , xj ) = (x> 2 d(x1 , x2 ) = k (x1 , x1 ) − 2k (x1 , x2 ) + k(x2 , x2 )
i xj )
I Any kernel gives you an associated distance measure this
way. Think of a kernel as an indirect way of specifying
distances.
15 / 23 16 / 23
Applications
Support Vector Machine Prediction on new example
I A support vector machine is a kernelized maximum

margin classifier.
I For max margin remember that we had the magic property f(x)= sgn ( ! + b) classification f(x)= sgn ( ! $i.k(x,x i) + b)
X
w= αi yi xi $1 $2 $3 $4 weights
i
k k k k comparison: k(x,x i), e.g. k(x,x i)=(x.x i)d
I This means we would predict the label of a test example x k(x,x i)=exp(!||x!x i||2 / c)
support vectors
as X x 1 ... x 4 k(x,x i)=tanh("(x.x i)+#)
ŷ = sign[wT x + w0 ] = sign[ αi yi xTi x]
i
I Kernelizing this we get input vector x
X
ŷ = sign[ αi yi k(xi , bx)]
i
Figure Credit:
Figure Credit:Bernhard
BernhardSchoelkopf
Schoelkopf
17 / 23 18 13
/ 23/ 18
Support Vector Regression

input space feature space
Choosing φ, C
" !
!
!
! ! � The support vector algorithm can also be used for
" "
"
regression problems
"
"
� Instead of using squared-error, the algorithm uses the
I There are theoretical results, but we will not cover them.
(If
�-insensitive error
Figure Credit: Bernhard Schoelkopf you want to look them up, there are actually upper bounds
Figure Credit: Bernhard Schoelkopf on the generalization�error: look for VC-dimension and
� Example 2 |z| − � if |z| ≥ �,
I Example 2 structural risk minimization.)
E� (z) =
k(xi , xj ) = exp −||xi − xj ||2 /α2
0 otherwise.
I However, in practice cross-validation methods are
k(xi , xj ) = exp −||xi − xj ||2 /α2 commonly used
In this case the dimension of φ is infinite � Again a sparse solution is obtained from a QP problem
�InTo
this
testcase theinput
a new dimension
x of φ is infinite. i.e., It can be n
�
shown that no φ that maps n
into a finite-dimensional space f (x) = βi k(x, xi ) + w0
�
will give you this kernel.
f (x) = sgn( α y k(x , x) + w ) i=1
i i i 0
I We can never calculate φ(x),
i=1 but the algorithm only needs
us to calculate k for different pairs of points.
11 / 18
19 / 23 20 15
/ 23/ 18
Example application Comparison with linear and logistic regression
I US Postal Service digit data (7291 examples, 16 × 16

images). Three SVMs using polynomial, RBF and
I Underlying basic idea of linear prediction is the same, but
MLP-type kernels were used (see Schölkopf and Smola,
error functions differ
Learning with Kernels, 2002 for details)
I Logistic regression (non-sparse) vs SVM (“hinge loss”,
I Use almost the same (' 90%) small sets (4% of data
sparse solution)
base) of SVs
I Linear regression (squared error) vs -insensitive error
I All systems perform well (' 4% error)
I Many other applications, e.g.
I Linear regression and logistic regression can be
I Text categorization
“kernelized” too
I Face detection
I DNA analysis
21 / 23 22 / 23
SVM summary
I SVMs are the combination of max-margin and the kernel

trick
I Learn linear decision boundaries (like logistic regression,
perceptrons)
I Pick hyperplane that maximizes margin
I Use slack variables to deal with non-separable data
I Optimal hyperplane can be written in terms of support
patterns
I Transform to higher-dimensional space using kernel
functions
I Good empirical results on many problems
I Appears to avoid overfitting in high dimensional spaces (cf
regularization)
I Sorry for all the maths!
23 / 23

Svm2up PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Svm2up PDF

Uploaded by

Copyright:

Available Formats

Last Time

IAML: Support Vector Machines, Part II

This Time The SVM optimization problem

I Last time: the max margin weights can be computed by

I Non separable data min ||w||2

I Many algorithms have been proposed to solve this. One of

Non-separable training sets

I If data set is not linearly separable, the optimization x

I New optimization problem is to minimize subject to the constraints

subject to the constraints

Example of kernel Kernels, dot products, and distance

I The Euclidean distance between two vectors can be

I A support vector machine is a kernelized maximum

Support Vector Regression

I US Postal Service digit data (7291 examples, 16 × 16

I SVMs are the combination of max-margin and the kernel

You might also like