Professional Documents
Culture Documents
1 / 23 2 / 23
3 / 23 4 / 23
Finding the optimum Non-separable training sets
I If you go through some advanced maths (Langrange, etc.),
it turns out that you can show something remarkable.
Optimal parameters look like I If data set is not linearly separable, the optimization
X problem that we have given has no solution.
w= αi yi xi
i
min ||w||2
I Furthermore, solution is sparse. Optimal hyperplane is w
determined by just a few examples: call these support s.t. yi (w> xi + w0 ) ≥ +1 for all i
vectors
I αi = 0 for non-support patterns I Why?
I Optimization problem has no local minima (like logistic
regression)
I Prediction on new data point x
f (x) = sign((w> x) + w0 )
n
X
= sign( αi yi (x>
i x) + w0 )
i=1
5 / 23 6 / 23
min ||w||2
o x o
w
o x
>
s.t. yi (w xi + w0 ) ≥ +1 for all i ! x
I Why? o
I Solution: Don’t require that we classify all points correctly.
o
Allow the algorithm to choose to ignore some of the points.
~
w margin
I This is obviously dangerous (why not ignore all of them?) o
so we need to give it a penalty for doing so.
9 / 18
7 / 23 8 / 23
Slack Think about ridge regression again
I Solution: Add a “slack” variable ξi ≥ 0 for each training I Our max margin + slack optimization problem is:
example. Xn
2
I If the slack variable is high, we get to relax the constraint, ||w|| + C( ξi )k
but we pay a price i=1
Xn w> xi + w0 ≥ 1 − ξi for yi = +1
||w||2 + C( ξi )k >
w xi + w0 ≤ −1 + ξi for yi = −1
i=1
Why you might want slack in a separable data set Non-linear SVMs
x2
x2 I SVMs can be made nonlinear just like any other linear
o
o
o
o
o o algorithm we’ve seen (i.e., using a basis expansion)
o o ξ o o
o o
o o
o
o
o I But in an SVM, the basis expansion is implemented in a
x w o w o
x x
o
x
x
x
o
o very special way, using something called a kernel
x x
I The reason for this is that kernels can be faster to compute
x x x1 x x x1
x
x
x
x with if the expanded feature space is very high dimensional
(even infinite)!
I This is a fairly advanced topic mathematically, so we will
just go through a high-level version
11 / 23 12 / 23
Kernel Non-linear SVMs
I Transform x to φ(x)
I A kernel is in some sense an alternate “API” for specifying
I Linear algorithm depends only on x> xi . Hence
to the classifier what your expanded feature space is.
transformed algorithm depends only on φ(x)> φ(xi )
I Up to now, we have always given the classifier a new set of
I Use a kernel function k (xi , xj ) such that
training vectors φ(xi ) for all i, e.g., just as a list of numbers.
φ : Rd → RD
k(xi , xj ) = φ(xi )> φ(xj )
I If D is large, this will be expensive; if D is infinite, this will
be impossible I (This is called the “kernel trick”, and can be used with a
wide variety of learning algorithms, not just max margin.)
13 / 23 14 / 23
15 / 23 16 / 23
Applications
Support Vector Machine Prediction on new example
X
w= αi yi xi $1 $2 $3 $4 weights
i
k k k k comparison: k(x,x i), e.g. k(x,x i)=(x.x i)d
I This means we would predict the label of a test example x k(x,x i)=exp(!||x!x i||2 / c)
support vectors
as X x 1 ... x 4 k(x,x i)=tanh("(x.x i)+#)
ŷ = sign[wT x + w0 ] = sign[ αi yi xTi x]
i
I Kernelizing this we get input vector x
X
ŷ = sign[ αi yi k(xi , bx)]
i
Figure Credit:
Figure Credit:Bernhard
BernhardSchoelkopf
Schoelkopf
17 / 23 18 13
/ 23/ 18
�InTo
this
testcase theinput
a new dimension
x of φ is infinite. i.e., It can be n
�
shown that no φ that maps n
into a finite-dimensional space f (x) = βi k(x, xi ) + w0
�
will give you this kernel.
f (x) = sgn( α y k(x , x) + w ) i=1
i i i 0
I We can never calculate φ(x),
i=1 but the algorithm only needs
us to calculate k for different pairs of points.
11 / 18
19 / 23 20 15
/ 23/ 18
Example application Comparison with linear and logistic regression
21 / 23 22 / 23
SVM summary
23 / 23