L7&8 - Support Vector Machines - v2

Dept.
of Electrical and Computer Engineering

0909.555.01
PR
Dr. Robi Polikar
Lecture 7-8
Week 8
Kernel Methods &
Support Vector
Machines
Chapter 13 in Alpaydin
Chapter 14 in Murphy
PR This Week in PR
 A new approach to linear classifiers
 Remembering the geometry
 Maximum margin hyperplane
 Constrained optimization
 The Lagrangian Dual Problem
 Karush – Kuhn – Tucker Conditions
 Support Vector Machine
 SVMs with noisy data
 Slack variables
 SVMs as nonlinear classifiers D Duda, Hart & Stork, Pattern Classification
 Kernel methods and the kernel trick G R. Gutieerez-Osuna, Lecture Notes
 Mercer’s conditions TK Theodoridis & Koutroumbas, Pattern Recognition
RP Original graphic created / generated by Robi Polikar – All Rights Reserved © 2001 – 2013.
May be used with permission and citation.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Intuitive Idea
 Given many boundaries that can all solve a given linearly separable
problem, which one is the best – i.e., most likely to result in the smallest
generalization error?
 Intuitively, our answer is: the one that provides the largest separation between the
classes: this leaves more room for noisy samples to move around
 The SVM is a classifier that can find such an optimal hyperplane, which provides
the maximum margin between the nearest instances of opposing classes.
Viable decision Optimal decision Support
boundary vectors
Feature 2
boundaries
Feature 2
Maximum
RP RP margin
Feature 1 Feature 1
PR Maximizing Margins
TK
PR Recall: Geometry of a
Decision Boundary
w  w 
x  xp  r g  x   wT x  w0  w T  x p  r   w0
w  w 

2
w
wT w
 w x p  w0  r
T
w
g  x p 0
g  x
r w g  x  r w  r 
w
Let’s take a look at the value of the
function g(x) at the origin, x=0
r0 w  g  x  x 0  wT x  w0  w0
w0
 r0 
w
D
w0 determines the location of the hyperplane
! w determines the direction of the hyperplane
PR Formalizing the Problem
 Recall that the separating hyperplane (decision boundary) is given by
(using 𝑏 instead of 𝑤0, as commonly done in SVM literature)
𝑔 𝐱 = 𝐰 𝑇 𝐱 + 𝑤0 = 𝐰 𝑇 𝐱 + 𝑏
 Given a two-class training data 𝐱𝑖 with labels 𝑦𝑖 = +1 and 𝑦𝑖 = −1. Then,
Class 𝜔1 𝐰 𝑇 𝐱 + 𝑏 ≥ 1 ⇒ 𝐱 ∈ 𝜔1
𝐰𝑇 𝐱 + 𝑏 = 0 w 𝐰 𝑇 𝐱 + 𝑏 = 0 ⇒ 𝐱 𝑜𝑛 𝑡ℎ𝑒 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦
Feature 2
𝐰 𝑇 𝐱 + 𝑏 ≤ −1 ⇒ 𝐱 ∈ 𝜔2
r* 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1, ∀𝑖 ⇒ Correct ∀𝐱
RP Distance of a point x on the margin (where 𝑔(𝑥) = 1)
r to the hyperplane given as 𝑔(𝑥) = 0
𝑔 𝐱 𝐰𝑇 𝐱 + 𝑏 ∗
1
𝑟= = ⇒𝑟 =
m ‖𝐰‖ ‖𝐰‖ ‖𝐰‖
𝐰𝑇 𝐱 + 𝑏 = 1
Class 𝜔2 Then the length of 2
𝐰𝑇 𝐱 + 𝑏 = −1 𝑚=
the margin m is ‖𝐰‖
Feature 1
PR Constrained
Optimization
 The best hyperplane – the one that provides the maximum separation – is therefore
the one that maximizes 𝑚 = 2 ‖𝐰‖
 However, by arbitrarily choosing 𝐰 we can make its length as small as we want. In fact,
𝐰 = 𝟎 would provide an infinite margin – this is clearly not an interesting – or even viable –
solution. There has to be constraint on this problem.
 The constraint comes from the correct classification of all data points, which requires that
𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1, ∀𝑖
 Therefore, the problem of finding the optimal decision boundary is converted into
the following constrained optimization problem:
1
min ‖𝐰‖2
2
subject to 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1, ∀𝑖
 Note that maximizing 𝑚 = 2 ‖𝐰‖ is equivalent to minimizing ‖𝐰‖2 2
 We take the square of the length of the vector, which (along with the ½ factor) does not
change the solution, but makes the process for solution easier.
 Among other things, since the function to be minimized is quadratic, it has only a single
(global) minimum.
PR A Primer on
Constrained Optimization
 Recall from our previous discussion that constrained optimization can be
solved through Lagrange multipliers:
 If we wish to find the extremum of a function 𝑓(𝐱) subject to some constraint
𝑔(𝐱) = 0, the extremum point x can be found by
1. Form the Lagrange function to convert the problem to an unconstrained problem,
where 𝛼 – whose value need to be determined – is the Lagrange multiplier
𝐿 𝐱, 𝛼 = 𝑓(𝐱) + 𝛼𝑔(𝐱)
2. Solve the resulting unconstrained problem by taking the derivative

𝜕𝐿 𝐱, 𝛼 𝜕 𝜕𝑓 𝐱 𝜕𝑔 𝐱
= 𝑓 𝐱 + 𝛼𝑔 𝐱 = +𝛼
𝜕𝐱 𝜕𝐱 𝜕𝐱 𝜕𝐱
3. For a point x* to be a solution it must then satisfy:
𝜕𝐿 𝐱, 𝛼
= 0, 𝑔 𝐱 ∗ = 0
𝜕𝐱 𝐱=𝐱∗
PR Constrained Optimization
with Multiple Constrains
 If we have many constraints, such as 𝑔𝑖 𝐱 = 0, 𝑖 = 1, ⋯ , 𝑛, then we need a
Lagrange multiplier 𝛼𝑖 for each constraint, which then appear as a
summation in the Lagrangian
𝑛
𝐿 𝐱, 𝛼𝑖 = 𝑓 𝐱 + 𝛼𝑖 𝑔𝑖 𝐱
𝑖=1
and we require that the gradient of the Lagrangian be zero
𝑛
𝜕𝐿 𝐱, 𝛼𝑖 𝜕
= 𝑓 𝐱 + 𝛼𝑖 𝑔𝑖 𝐱 =0
𝜕𝐱 𝐱=𝐱 ∗
𝜕𝐱
𝑖=1 𝐱=𝐱 ∗
𝑔𝑖 𝐱 = 0 𝑖 = 1, ⋯ , 𝑛
𝐱=𝐱 ∗
PR Inequality Constraints
Karush-Kuhn-Tucker Conditions
 If we have several equality and inequality constraints, in the following form
min 𝑓 𝐱
subject to 𝑔𝑖 𝐱 ≤ 0, 𝑖 = 1, ⋯ , 𝑛
and ℎ𝑖 𝐱 = 0, 𝑖 = 1, ⋯ , 𝑚
the necessary and sufficient conditions (also known as the Karush – Kuhn – Tucker
(KKT) conditions) for 𝐱 ∗ , 𝛼 ∗ , 𝛽 ∗ to be the solution are:
∂𝐿 𝐱∗ , 𝛼∗ , 𝛽∗ ∂𝐿 𝐱∗ , 𝛼∗ , 𝛽∗
KKT conditions at a glance
 Gradient of Lagrangian (with respect to
= 0
 =0
∂𝐱 ∂𝛽 parameters to be selected )= 0
∗ ∗ ∗  Lagrange multipliers 𝑖 must be > 0
 𝛼𝑖 𝑔𝑖 𝐱 = 0 𝛼𝑖 ≥ 0   i multiplied by inequality constraints
∗ ∗ must be zero (complementary slackness)
 ℎ𝑖 𝐱 = 0, 𝑔𝑖 𝐱 ≤ 0 𝑖 = 1, ⋯ , 𝑛  Original equality and inequality
constraints must be satisfied (primal
where the generalized Lagrangian is now given as feasibility)
𝑛 𝑚
𝐿 𝐱, 𝛼𝑖 , 𝛽𝑖 = 𝑓 𝐱 + 𝛼𝑖 𝑔𝑖 𝐱 + 𝛽 𝑖 ℎ𝑖 𝐱
𝑖=1 𝑖=1
where 𝛼𝑖 𝛽𝑖 are Lagrange multipliers, one for each constraint.
PR Back to Our Own Problem
 We simply have one set of inequality constraints 1
‖𝐰‖2 = 𝐰 𝑇 𝐰
min ‖𝐰‖2
so the KKT conditions become as follows 2
subject to 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1, ∀𝑖
for our problem:
𝑛
1
𝐿 𝐰, 𝛼𝑖 = 𝐰𝑇 𝐰 + 𝛼𝑖 1 − 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏
2
𝑖=1 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1 ⇔ 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≤ 0
𝑛 𝑛
1 𝑇 𝑔𝑖 𝐱 ≤ 0 ⇒ 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≤ 0
= 𝐰 𝐰+ 𝛼𝑖 − 𝛼𝑖 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏
2
𝑖=1 𝑖=1
 Taking the derivatives, setting to zero, and solving for all parameters:
 𝑛 Note the last three KKT conditions
𝜕𝐿 𝐰, 𝑏, 𝛼
= 0 ⇒ 𝐰 = 𝛼𝑖 𝐱 𝑖 𝑦𝑖 = 𝛼𝑖 𝐱 𝑖 𝑦𝑖 have an interesting interpretation: A point
𝜕𝐰 xi satisfying Eq at equality is on the
𝑖=1 𝐱 𝑖 ∈𝑆

 𝑛 margin, and the corresponding 𝛼𝑖 > 0. For
𝜕𝐿 𝐰, 𝑏, 𝛼
= 0 ⇒ 𝛼𝑖 𝑦𝑖 = 0 all points xi satisfying Eq  at inequality
𝜕𝑏 must have corresponding 𝛼𝑖 = 0. Hence, all
 𝑖=1 
𝑇 
𝛼𝑖 1 − 𝑦𝑖 𝐰 𝐱 𝑖 + 𝑏 = 0 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≤ 0, 𝛼𝑖 ≥ 0 points that are on the boundary (support
vectors) have their 𝛼𝑖 > 0. For all other
𝑆 = 𝐱 𝑖 |𝛼𝑖 ≠ 0 S: Set of support vectors points, 𝛼𝑖 = 0.
PR The Dual Problem
 Now, let’s substitute the expression for 𝐰 into the Lagrangian 𝐿
 
n n
1 T
L  x,  i   w w    i    i yi w T xi  b
2 i 1 i 1
 
1 n n n n  n  
   i yi xi   j y j x j    i    i yi     j y j x j  xi  b 
T T
2 i 1 j 1 i 1 i 1   j 1  
 
wT w  wT 
1 n n n n n n
   i j yi y j xi x j    i    i j yi y j xi x j  b   i yi
T T
2 i 1 j 1 i 1 i 1 j 1 i 1
0
1 n n n n
    i j yi y j xi x j    i ,
T
 i  0,  y i i 0
2 i 1 j 1 i 1 i 1
 This is an easier problem to solve, since the Lagrangian is now only a

function of 𝛼𝑖 as unknowns. The new Lagrangian is called the dual.
PR The Dual Problem
𝑛 𝑛 𝑛
1
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖 ,
2
𝑖=1 𝑗=1 𝑖=1
𝑛
subject to 𝛼𝑖 ≥ 0, 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
 Few points worth mentioning:

 This is called the Wolf dual representation. Wolf has proved that the 𝐰 (and 𝑏) that minimize 𝐿
are the same that maximize the dual Lagrangian 𝐿𝐷 with respect to 𝛼𝑖.
 The dual problem does not depend on 𝐰 or 𝑏. Hence, it is easier to solve. Even the constraints
are (fewer and) easier. Once 𝛼𝑖 are obtained, 𝐰 can be found from 𝐰 = 𝑛𝑖=1 𝛼𝑖 𝐱𝑖 𝑦𝑖 and 𝑏 can
be found from any support vector 𝐱𝒊 on the margin satisfying 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 = 1
 The original (primal) problem increases its complexity (scales) as dimensionality increases, as
there is one additional parameter in 𝐰 for each dimension. However, the dual problem is
independent of 𝐰, and hence of dimension. It only scales with number of data points – there is
one Lagrange multiplier for each data point.
 Most importantly, the training data only appears as dot products 𝐱𝑇𝐱 in the dual formulation.
As we will seen later, this is going to have a profound effect when we deal with nonlinear
problems (kernel trick)
PR Support Vectors
Class 𝜔1
𝐰𝑇 𝐱 + 𝑏 = 0  Once again, let’s remember the KKT conditions
𝐰
Feature 2
𝛼𝑖 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 = 0, 1 − 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 ≤ 0, 𝛼𝑖 ≥ 0
r*  The first equation states that either 𝛼𝑖 = 0, or
RP 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 is zero (or both).
r
Therefore, if 1 − 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 is not zero, that is, if the
xi is not on the margin, then the corresponding
Lagrange multiplier must necessarily be zero!
m
 For those xi that do lie on the hyperplane, 𝛼𝑖 > 0 , in
𝐰 𝑇 𝐱 + 𝑏 = 1 which case those points define the hyperplane, and
Class 𝜔2
𝐰 𝑇 𝐱 + 𝑏 = −1 hence are called support vectors.
Feature 1  It is possible for both conditions to be satisfied at zero,
that is 𝛼𝑖 = 0 for those points that do lie on the margin.
These points are not considered as support vectors,
since they are not required to define the hyperplane.
Hence, we could replace the entire dataset with the few support vectors we find by
solving the optimization problem. Only the support vectors matter for determining
the optimal hyperplane, the rest of the data points might as well be thrown away.
PR How about
Nonseparable Case?
 So far, we assumed that we have a two-class, linearly separable problem. What if
the classes are not linearly separable due to noisy data?
There are three types of instances:
𝐰 𝐱+𝑏 =0𝑇 w 1. Those that fall on the correct side of the
𝐰𝑇 𝐱 + 𝑏 = 1 margin. These satisfy 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 ≥ 1
Feature 2
2. Those that fall inside the margin, but on the

correct side of the boundary (𝒙𝒊). These
𝒙𝒊
points satisfy 0 ≤ 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 < 1
ξi 3. Those that fall on the incorrect side of the
boundary (𝒙𝒋). These are misclassified and
obey the inequality 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 < 0
𝒙𝒋 All three cases can be addressed under a
ξj single type of constraint by introducing a
new set of variables, called slack variables:
𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1 − 𝜉𝑖
RP 𝐰 𝑇 𝐱 + 𝑏 = −1 For the above three cases, the slack
Feature 1 variables are 𝜉𝑖 = 0, 0 < 𝜉𝑖 ≤ 1, and 𝜉𝑖 > 1,
respectively.
with Slack Variables
 Our goal now is to make the margin as wide as possible, while keeping the number
of points with 𝜉 > 0 as few as possible. This can be achieved by the modified
constrained optimization problem:
𝑛
1 2
min 𝐰 +𝐶 𝜉𝑖 , 𝐶 > 0
2
𝑖=1
𝑇
subject to 𝑦𝑖 𝐰 𝐱 𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0
 The tradeoff parameter 𝐶 controls the relative importance of the two competing
terms: minimize error vs. maximize margin. Smaller values of 𝐶 emphasize
maximizing the margin, whereas larger values of 𝐶 emphasize reducing the error
(smaller margin). The linearly separable case corresponds to 𝐶 = ∞
 Once again we have a quadratic Lagrangian
𝑛 𝑛 𝑛
1 𝑇
𝐿 𝐱, 𝛼𝑖 = 𝐰 𝐰+𝐶 𝜉𝑖 − 𝜇𝑖 𝜉𝑖 − 𝛼𝑖 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 − 1 + 𝜉𝑖
2
𝑖=1 𝑖=1 𝑖=1
with Slack Variables
 Similar to earlier case, we have L  w, b,  ,  ,   n
 0  w    i yi xi 
w i 1
L  w, b,  ,  ,   n
 The dual of this problem can b

0   y
i 1
i i 0 
be obtained by substituting w L  w, b,  ,  ,  
 0  C  i   i  0 
in the original Lagrangian, 
which results in  
 i yi  wT xi  b   1    0  ii  0,  i  0 i  0
  
𝑛 𝑛 𝑛
1
2
𝑖=1 𝑗=1 𝑖=1
𝑛
subject to 0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0 (Box constraints)

𝑖=1
Final classification: 𝑦 𝐱 = 𝐰 𝑇 𝐱 + 𝑏 = 𝛼𝑖 𝑦𝑖 𝐱 𝑖𝑇 𝐱 + 𝑏
𝑖=1
PR Some Strange
Observations
 First of all, let’s compare the two problems with and without slack variables
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
1 1
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱𝑖𝑇 𝐱𝑗 + 𝛼𝑖 , max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖 ,
2 2
𝑖=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1 𝑖=1
𝑛 𝑛
subject to 𝛼𝑖 ≥ 0, 𝛼𝑖 𝑦𝑖 = 0 subject to 0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0

𝑖=1 𝑖=1
 They are remarkably similar. Specifically, the slack variables do not appear in the
new formulation, the dual problem still depends only on the Lagrangians 𝛼𝑖, the
desired hyper plane is still found as 𝐰 = 𝑛𝑖=1 𝛼𝑖 𝐱 𝑖 𝑦𝑖 and the training data points
still appear as dot products only (𝜉𝑖 enter into formulation indirectly through 𝐶 ).
 The only difference is that there is now an upper bound 𝐶 on the values of 𝛼𝑖
 On the other hand, the 𝜇𝑖 𝜉𝑖 = 0 condition of the original formulation requires that
for all points residing within the margin (for which 𝜉𝑖 ≠ 0), 𝜇𝑖 = 0. Therefore, for
those points 𝛼𝑖 = 𝐶 must be satisfied (from 𝐶 − 𝜇𝑖 − 𝛼𝑖 = 0). For all others, 𝛼𝑖 = 0.
 Note that it is these points – those that fall within the margin – that define the
optimal decision boundary 𝐰! (from 𝐰 = 𝑛𝑖=1 𝛼𝑖 𝐱 𝑖 𝑦𝑖 )
PR The Effect of C
© Theodoridis & Koutrombas, Pattern Recognition, 3/e, 2006

C=0.2 C=1000
Use large values of 𝐶 if the primary goal is to have as few misclassified instances as
possible. This will result in smaller margins (a high capacity classifier), and increased
likelihood of over fitting. Use smaller values of 𝐶, if the primary goal is to create as large of
a margin as possible. This will results in a low-capacity classifier with higher empirical error,
but may provide better generalization. Use cross-validation to obtain a suitable value for 𝐶.
PR How to Solve…?
Quadratic Programming
 So far, we have mentioned that we simply solve the quadratic Lagrangian
along with the constraints. We have obtained a simpler (dual) form with
simpler constraints, but exactly how is minimization obtained?
 This is a quadratic programming (QP) problem which is solved through
iterative (and numerical) optimization techniques
 There are many alternatives, see http://www.numerical.rl.ac.uk/qp/qp.html
 Typically, such techniques are interior point algorithms, which starts with an initial
solution that violates the constraints.
• Iteratively improve the solution by reducing the amount of constraint violation
 A popular technique is sequential minimal optimization (SMO), whose details
are beyond the scope of this class. However, it basically based on the following:
• A QP problem with only two variables can be analytically solved (even by hand).
• SMO picks a pair of 𝛼𝑖 𝛼𝑗 at each iteration and solves QP with these two variables. It
then repeats until convergence, i.e., all KKT conditions / constraints are met.
 Matlab’s optimization toolbox has a quadprog() function
PR Quadprog
Matlab Optimization Toolbox, User’s Guide,
©Mathworks, 2009 in Matlab
Watch this  https://www.mathworks.com/company/events/webinars/wbnr62151.html?s_iid=disc_rw_opt_cta1
PR Quadprog
Matlab Optimization Toolbox, User’s Guide, ©Mathworks, 2009
in Matlab
PR  - SVM
 Note that the width of the margin is actually not involved in the direct
calculations of the optimization. We set a margin of 1 around the decision
boundary, and try to maximize this margin.
 In fact, the only parameter we have in our control, the 𝐶 parameter, only indirectly
controls the width of the margin.
 Can we involve the width of the margin more directly in the cost function?
 This approach gives us “soft – margin” SVM.
 Let’s define 𝜌 as a free parameter that will allow us to control the width of the
margin
𝐰 𝑇 𝐱 + 𝑏 = 0 ⇒ 𝐰 𝑇 𝐱 + 𝑏 = ±𝜌
PR  - SVM
 Now the new optimization problem, along with its constraints, is
𝑛
1 2
1
min 𝐰 − 𝜐𝜌 + 𝜉𝑖 ,
2 𝑛
𝑖=1
subject to 𝑦𝑖 𝐰𝑇 𝐱 𝑖 + 𝑏 ≥ 𝜌 − 𝜉𝑖 , 𝜉𝑖 ≥ 0, 𝜌 ≥ 0
Where  is the Lagrange multiplier for the new parameter 𝜌.
 Note the following:
 For 𝑖 = 0, the margin separating the two classes is 2𝜌 𝐰
 Previously, also known as 𝐶-SVM, we simply tried to minimize the number of
instances for which 𝑖 > 0 (those that fell into the margin and/or missclassified)
 In -SVM, in addition to minimizing the “average number of instances” with
𝑖 > 0, we also directly target the margin width through the 𝜌 parameter. The larger
the 𝜌, the wider the margin, and the higher the number of points in the margin.
 The parameter (Lagrange multiplier) 𝜈, on the other hand, controls the influence of
𝜌 , whose value will be in the [0 1] range.
PR  - SVM
 Using the same procedure of substituting the value of w found through
setting the Lagrangian to zero into the original formulation (along with the
appropriate constraints), the dual problem can be written as follows:
𝑛 𝑛
1
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗
2
𝑖=1 𝑗=1
𝑛 𝑛
1
subject to 0 ≤ 𝛼𝑖 < , 𝛼𝑖 𝑦𝑖 = 0, 𝛼𝑖 ≥ 𝜐
𝑛
𝑖=1 𝑖=1
 Compare this to the 𝐶-SVM formulation

𝑛 𝑛 𝑛
1 This term does not
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖 , exist in -SVM, hence the
2
𝑖=1 𝑗=1 𝑖=1
𝑛 objective function is
subject to 0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0 “quadratically homogenous”
𝑖=1
PR Remarks about -SVM
 The original C-SVM and 𝜈-SVM will generate the same results for
appropriate selections of 𝐶 and 𝜈 parameters. For the optimization problem
to be feasible, however, 𝜈 must satisfy 0𝜈 1, unlike 𝐶 which has an
infinite range of values.
 𝜈-SVM has some additional advantages, such as leading to an easy to
interpret geometric interpretation for nonseparable classes. More
importantly, the 𝜈 parameter provides two crucial bounds:
 The error rate: The total number of errors that can be committed by the 𝜈-SVM is
𝑛𝜈 . Hence the error rate on the training data 𝑃𝑒 < 𝜈𝑛 (WHY? Exercise)
 Number of support vectors: Also, we can show that 𝑛 𝜈  𝑛𝑠, the number of
support vectors.
 Hence, by choosing the 𝜈 parameters, we have a more clear sense of the error rate
(thing of it like the error goal in MLP), and the number of support vectors (thing
of it like the number of hidden layer nodes in MLP), which directly controls the
computational complexity of the problem.
PR How About
Nonlinear Case…?
 Note that SVM is essentially a linear classifier. Even with the slack
variables, it still finds a linear boundary between the classes.
 What if the problem is fundamentally a non-linearly separable problem?
 Perhaps one of the most dramatic twists in pattern recognition allows the
modest linear SVM classifiers to turn into one of the strongest nonlinear
classifiers.
 Cover’s theorem: A complex problem that is not linearly separable in the given
input space is more likely to be linearly separable in a higher dimensional space
• Input space: the space in which the given training data points 𝐱𝑖 reside
• Feature space: A higher dimensional space, obtained by transforming 𝐱𝑖 through a
(kernel) transformation function 𝜙(𝐱𝑖 )
 Hence SVMs solve a nonlinear problem by:
• Perform a nonlinear mapping from the input space to the higher dimensional space that
is hidden from both the input and output
• Construct an optimal (linear) hyperplane in the high dimensional space.
PR An Example
φ( )
Feature 2
Feature 2
φ( )
φ( )
φ( )
φ( ) φ( )
φ( )
φ( ) φ( )
φ(.) φ( ) φ( )
φ( ) φ( ) φ( )
φ( ) φ( )
φ( )
φ( ) φ( ) φ( ) φ( )
RP φ( )
φ( )
(a) Feature 1 (b) Feature 1
 x  t1 2  xt2 2
x=[x1 x2] 1(x)  e t1  [1 1]T  2 (x)  e t 2  [0 0]T
x2 φ2(x)
_ (1,1)
1.0
1 _
0.8
φ(.) 0.6
_
_
0 0.4 (0,1)
_ (1,0)
0.2
x1 (0,0)
0 _ φ1(x)
1 0 | | | | | |
0 0.2 0.4 0.6 0.8 1.0 1.2
(c) (d)
PR Another Example
Original example by Scholkoph & Smola, figure from Gutierrez

PR Curse of Dimensionality…?
 An immediate concern that comes to mind is the difficulty of solving an
optimization problem in a higher dimensional space.
 In fact, the higher dimensional space can even be of infinite dimension.
 How does one compute anything in infinite dimensional space?
 SVM pulls out its final and most effective weapon ever: the kernel trick
 Recall that the data points in the Lagrangian appeared in dot (inner) products only
 So long as we can compute the 𝑛 𝑛 𝑛
inner product in the higher 1
dimensional space efficiently, 2
𝑖=1 𝑗=1 𝑖=1
𝑛
we do not need to compute
subject to 0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0
mapping, or any high dimensional 𝑖=1
computation for that matter
 Many geometric operators such as angles and distances can be expressed in
terms of inner products. Then the trick is simply to find a kernel function 𝐾
such that 𝐾 𝐱 , 𝐱 = 𝜙 𝐱 𝑇𝜙 𝐱 𝑖 𝑗 𝑖 𝑗
PR Kernel Trick… Huh?
 An example (one that is very commonly referenced) is in order:
 Suppose we would like to increase the dimensionality from 2 to 6. The original
input vector is 𝐱 = [𝑥1 𝑥2] , whereas the new variable is 𝒛 = [𝑧1 𝑧2 𝑧3 𝑧4 𝑧5 𝑧6],
such that 𝒛 = 𝜙(𝐱), where  
 x  
   1     1 , 2 x1 , 2 x2 , x12 , x22 , 2 x1 x2 
  x2    z 
z 1 z 2 z z3 4

5 z6 
 The SVM formulation requires us to compute 𝐱𝑖𝑇𝐱𝑗 . If we are to use higher
dimensional space through the 𝜙(𝐱) function, then we need to compute

  xi    x j   1, 2 xi1 , 2 xi 2 , x , x , 2 xi1 xi 2   1, 
T T
2 2
i1 i2 2 x j1 , 2 x j 2 , x 2j1 , x 2j 2 , 2 x j1 x j 2
 12  2 xi1 x j1  2 xi 2 x j 2  xi21 x 2j1  2 xi1 xi 2 x j1 x j 2  xi22 x 2j 2
 1  xi1 x j1  xi 2 x j 2 
2
Instead of computing the inner product of
  6 dimensional 𝐳𝑖𝑇𝐳𝑗, we define this kernel

2
 1  xi x jT
and obtain computation on the 2
 K  xi , x j   1  x dimensional 𝐱𝑖𝑇𝐱𝑗

2
T
i xj
PR Kernel Trick
𝑑
 This particular kernel can be generalized to any order 𝐾 𝐱𝑖 , 𝐱𝑗 = 1 + 𝐱𝑖 𝑇 𝐱𝑗
 For example, for 𝑑 = 3, the two dimensional feature vector 𝐱 = [𝑥1 𝑥2]𝑇 is
transformed into a 10 – dimensional feature vector 𝒛 = [𝑧1 … 𝑧10], where
  x1    
       1 , 3x1 , 3x2 , 3 x12 , 6 x1 x2 , 3 x22 , x13 , 3x12 x2 , 3 x1 x22 , x23 
  x2    z1 z2 z3 z4 z5 z6 z7 z8 z9 z10


 Since everything is specified in terms of the inner products, the actual
calculation – or even explicit statement of the transformation 𝜙 is not
necessary.
 All we need to specify is some kernel function 𝐾(𝐱𝑖, 𝐱𝑗) that defines an inner
product.
 Since the inner product itself usually defines a measure of length or distance, any
function that describes such a similarity between its arguments can be used as a
kernel
PR How to use &
choose Kernels
 The mapping to the high dimensional space is implicit – That is, we
typically do not (need to) know what the mapping function is once we define
an appropriate kernel.
3
 For example, by choosing 𝐾 𝐱 𝑖 , 𝐱𝑗 = 1 + 𝐱 𝑖 𝑇 𝐱𝑗 we are implicitly moving a 2-D
x into a 10-D z , whose specific mapping is given in the previous slide.
 In SVMs, we normally pick a kernel, and determine – if we wish to – the
corresponding implicit mapping function 𝜙. We do not need to know, or compute,
what that mapping is, however, as it is not necessary for the computation of SVM.
3
 How do we select a kernel? Is 𝐾 𝐱𝑖 , 𝐱𝑗 = 1 + 𝐱𝑖 𝑇 𝐱𝑗 “the” kernel?
 Since the SVM formulation has the input data in dot product form, we must select
a kernel which can be expressed as a dot – product that provides the implicit
mapping.
 …and, which kernels 𝐾(𝐱𝑖, 𝐱𝑗) have an implicit mapping in the form of a dot
– product?
 Those that satisfy the Mercer’s conditions
PR Mercer’s Theorem
 Let 𝐾(𝑥1, 𝑥2) be a continuous and symmetric kernel defined in some closed
interval 𝑎 < 𝑥 < 𝑏. This kernel has an equivalent representation using the
dot-product of a mapping function 𝜙 as follows: 𝐱 → 𝜙 𝐱 ∈ ℋ
∞ 𝜆𝑖 > 0
𝐾 𝐱1 , 𝐱 2 = 𝜆𝑖 𝜙𝑖 𝐱1 𝜙𝑖 𝐱 2
𝑖=1
 Where ℋ is a Hilbert space – a generalization of the Euclidean space, where the
inner product can be defined more generally – not just as the dot product in
Euclidean space.
𝑏 𝑏
 This expansion is valid, if and only if, 𝑎 𝑎
𝐾 𝐱1 , 𝐱 2 𝜓 𝐱1 𝜓 𝐱 2 𝑑𝐱1 𝑑𝐱 2 > 0
𝑏
is satisfied for any and all arbitrary functions 𝜓( . ) for which 𝑎
𝜓 𝐱 𝑑𝐱 < ∞
 The mapping functions 𝜙(𝐱) are then called the eigenfunctions, and 𝜆𝑖 are the
eigenvalues of the kernel representation. Note that the condition 𝜆𝑖 > 0 makes the
kernel positive definite  The matrix 𝐾, whose 𝑖, 𝑗 𝑡ℎ entry is 𝐾(𝐱𝑖, 𝐱𝑗), is
(+)def.
Uhm…whatever…
PR Just Tell me what kernels satisfy
these mercer’s conditions
𝑑
 Polynomial kernel with degree 𝑑: 𝐾 𝐱 𝑖 , 𝐱𝑗 = 1 + 𝛾𝐱 𝑖 𝑇 𝐱𝑗
 The user defines the value of 𝑑 (and 𝛾), which then controls how large the feature space
dimensionality will be. As seen earlier, a choice of 𝑑 = 2 moves 2-D 𝐱 into 6-D 𝐳. Similarly
using 𝑑 = 3 moves a 2-D 𝐱 into a 10-D 𝐳.
 Radial basis (Gaussian) kernel with width 𝜎: 𝐾 𝐱 𝑖 , 𝐱𝑗 = exp − ‖𝐱 𝑖 − 𝐱𝑗 ‖2 2𝜎 2
 The user defines the kernel width 𝜎. This SVM is closely related to the RBF network. It
increases the dimensionality to ∞, as every data point is replaced by a continuous Gaussian.
The number of RBFs and their centers are determined based on the (number of) support
vectors and their values, respectively.
 Hyperbolic tangent (Sigmoid – Two layer MLP): 𝐾 𝐱 𝑖 , 𝐱𝑗 = tanh 𝜅𝐱 𝑖 𝑇 𝐱𝑗 + 𝜃
 The parameters 𝜅 and 𝜃 are user defined. The number of hidden layer nodes and their
values are determined based on the (number) of support vectors and their values,
respectively. Then, the hidden – output weights are the Lagrange multipliers 𝛼𝑖. This kernel
satisfies Mercer’s conditions only for certain values of 𝜅 and 𝜃.
PR SVMs in
High Dimensional Space
 So, then exactly how do we use the kernel trick with the SVM to obtain nonlinearly
separable boundaries? 𝑛 𝑛 𝑛
1
 First, recall original formulation: max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖
2
𝑖=1 𝑗=1 𝑖=1
the decision boundary is given as 𝑛 𝑛
𝑔 𝐱 = 𝐰𝑇 𝐱 + 𝑏 = 0 subject to 0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0, 𝐰 = 𝛼𝑖 𝑦𝑖 𝐱 𝑖

𝑖=1 𝑖=1
 Now, going to higher dimensional space 𝐷 > 𝑑,

the decision boundary is simply 𝑔 𝐱 = 𝐰𝑇 𝜙 𝐱 + 𝑏 = 0
since input space x is mapped to 𝜙(𝐱).
𝑔 𝐱 = 𝐰𝑇 𝜙 𝐱 + 𝑏 = 0
Similarly, the weight vector that 𝑛 𝑇
defines this optimal boundary is then ⇒𝑔 𝐱 = 𝛼𝑖 𝑦𝑖 𝜙 𝐱 𝑖 𝜙 𝐱 +𝑏 =0
𝑛 𝑖=1
𝑛
𝐰= 𝛼𝑖 𝑦𝑖 𝜙 𝐱 𝑖 ⇒𝑔 𝐱 = 𝛼𝑖 𝑦𝑖 𝜙 𝐱 𝑖 𝑇 𝜙 𝐱 +𝑏 =0
𝑖=1
𝑖=1
𝑛
Using this weight vector in the boundary
⇒𝑔 𝐱 = 𝛼𝑖 𝑦𝑖 𝐾 𝐱 𝑖 , 𝐱 + 𝑏 = 0
𝑖=1
PR SVMs in
High Dimensional Space
 So the new decision boundary is the weighted sum of the Kernel function with
respect to support vectors xi. Recall that only the support vectors have nonzero
𝑛
dual variables αi !
𝑔 𝐱 = 𝛼𝑖 𝑦𝑖 𝐾 𝐱 𝑖 , 𝐱 + 𝑏 = 𝛼𝑖 𝑦𝑖 𝐾 𝐱 𝑖 , 𝐱 + 𝑏 = 0
𝑖=1 𝐱 𝑖 ∈𝑆
 But how do we compute these variables in the high dimensional space?

 The exact same optimization except, all dot-products are now replaced by Kernels:
𝑛 𝑛 𝑛
• Original Lagrangian: 1
𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖
2
𝑖=1 𝑗=1 𝑖=1
• Lagrangian in high dimensional space 𝑛 𝑛 𝑛
1 𝑇
𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝜙 𝐱 𝑖 𝜙 𝐱𝑗 + 𝛼𝑖
2
𝑖=1 𝑗=1 𝑖=1
• Lagrangian using the Kernels: 𝑛 𝑛 𝑛

1
𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾 𝐱 𝑖 , 𝐱𝑗 + 𝛼𝑖
2
subject to exact same constraints: 𝑖=1 𝑗=1 𝑖=1
0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
PR Some Examples
(Gaussian Kernel, σ=0.3, C=10)
The decision boundaries generated by the SVM classifier, C=10
1 1
0
0.9
1
0 Rotating checkerboard data with N = 300 points, a = 0.5 and  = 30
0.8
1
-1
0
1
0.7
1
0.9
-1
0.6
-1
0.8
0
1
0
0.5 0.7 SVM classification on rotating checkerboard data, C=10
-1
-1
1
0.4 0.6
0.9
0
1
0
1 0
0.3 -1
0.5
0.8
0.2
0.4
1
0.7
0.1 0
0.3
-1
-1 1 0.6
1
0
0
0 0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5
0.1
0.4
0
0 0.1 0.2 0.3 0.4 0.3
0.5 0.6 0.7 0.8 0.9 1
0.2
RP 0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PR Effect of C
C=0.1
The decision boundaries generated by the SVM classifier, C=0.1
1 1
0.9
1
0.8
1
0.7
0
0.6 1
SVM classification on rotating checkerboard data, C=0.1
0
0.5 0 1
0.4 0.9
1
0.3 0.8
0.2 0.7
1 1 0.6
0.1
0 1 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.3
0.2
0.1
RP 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PR Effect of C
C=1
1
1
1
0.9
0.8
0 1
0.7
0
1 -1
0.6
-1
SVM classification on rotating checkerboard data, C=1
-1
0.5 1
1
0
-1
0.4 0.9
0
-1
0.3 0.8
1
0
0.2 0.7
0
0.1 0.6
1
1
0 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.3
0.2
0.1
RP
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PR Effect of C
C=100
1
-1
1
1 0
0.9
0.8 -1 0
0.7 1
-1 0
0.6 SVM classification on rotating checkerboard data, C=100
1
1
0
1
0 0 1
0.5 -1
-1
0.9
0.4 0 -1 0.8
1
0.3 -1 0.7
1
0
0
1
0.2
-1
0.6
0.1
0
0.5
1
1
-10
-1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4
0.3
0.2
0.1
RP
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PR Effect of Kernel Type
SVM decision boundaries using polynomial kernel C=100 d = 10
1
0.9
0.8
0.7
0.6
SVM decision boundaries using Gaussian kernel C=100  = 0.3
1
0.5
1 0 -1
0
0.9
1
0.4
0.8
0.3
0.7 -1 -10 1
0.2
0
1
0.6
0.1 1
1
0
-1
0.5 -1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
-1
0.4 0
1
0.3 -1
0
1
1
0.2
1 -1
0
0.1 0
RP -1
-1 0 1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SVM classification using polynomial kernel C=100 d = 10
1
0.9
0.8
0.7
0.6
SVM classification using Gaussian kernel C=100  = 0.3
0.5 1
0.4 0.9
0.3 0.8
0.2 0.7
0.1 0.6
0 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.3
0.2
RP 0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SVMSVM
decision boundaries
classification using
using polynomial
polynomial kernel
kernel C=100
C=100 d =d2= 2
11
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
SVM decision
SVM boundaries
classification using
using Gaussian
Gaussian kernel
kernel  =
C=100
C=100 0.1= 0.1
0.4 11
0.4 1 0
0
-1
0.3
0.3 0.9
0.9 1
0 1
0.2
0.2 0.8
0.8 1
-1
0.1 0.7 -1
0.1 0.7 0 1
00 0.6
0.6
00 0.1
0.1 0.2
0.2 0.3
0.3 0.4
0.4 0.5
0.5 0.6
0.6 0.7
0.7 0.8
0.8 0.9
0.9 11
0 1
0
0.5
0.5 1
0
-1
-1
0.4
0.4
01 -1
0.3 0
0.3
1 -1
1
0.2
0.2
0
RP 0.1
0.1
0
-1
1
-1
0
-1
1
1
1
00
00 0.1
0.1 0.2
0.2 0.3
0.3 0.4
0.4 0.5
0.5 0.6
0.6 0.7
0.7 0.8
0.8 0.9
0.9 11
PR When we come back…
 Why SVMs work?
 Structural Risk Minimization
 Statistical Learning Theory
 VC Dimension
 Strengths and weaknesses of SVMs
 Implementation of SVMs
 HOMEWORK:
 I have placed two tutorial papers on SVMs on the class BB page. Make sure that
you read these papers.
• An Introduction to Kernel Based Learning Algorithms; Muller, Mika, Ratsch, Tsuda
and Scholkoph, IEEE TNN, 2001
• A Tutorial on Support Vectors for Pattern Recognition; Burges, Data Mining and
Knowledge Discovery, 1998.

L7&8 - Support Vector Machines - v2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L7&8 - Support Vector Machines - v2

Uploaded by

Copyright:

Available Formats

Dept.

of Electrical and Computer Engineering

 Kernel methods and the kernel trick G R. Gutieerez-Osuna, Lecture Notes

 Mercer’s conditions TK Theodoridis & Koutroumbas, Pattern Recognition

2. Solve the resulting unconstrained problem by taking the derivative

and we require that the gradient of the Lagrangian be zero

 This is an easier problem to solve, since the Lagrangian is now only a

 Few points worth mentioning:

2. Those that fall inside the margin, but on the

 The dual of this problem can b

subject to 0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0 (Box constraints)

subject to 𝛼𝑖 ≥ 0, 𝛼𝑖 𝑦𝑖 = 0 subject to 0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0

© Theodoridis & Koutrombas, Pattern Recognition, 3/e, 2006

 Compare this to the 𝐶-SVM formulation

(a) Feature 1 (b) Feature 1

Original example by Scholkoph & Smola, figure from Gutierrez

Instead of computing the inner product of

  6 dimensional 𝐳𝑖𝑇𝐳𝑗, we define this kernel

𝑔 𝐱 = 𝐰𝑇 𝐱 + 𝑏 = 0 subject to 0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0, 𝐰 = 𝛼𝑖 𝑦𝑖 𝐱 𝑖

 Now, going to higher dimensional space 𝐷 > 𝑑,

 But how do we compute these variables in the high dimensional space?

• Lagrangian using the Kernels: 𝑛 𝑛 𝑛

You might also like