Professional Documents
Culture Documents
PR
Dr. Robi Polikar
Lecture 7-8
Week 8
Kernel Methods &
Support Vector
Machines
Chapter 13 in Alpaydin
Chapter 14 in Murphy
PR This Week in PR
A new approach to linear classifiers
Remembering the geometry
Maximum margin hyperplane
Constrained optimization
The Lagrangian Dual Problem
Karush – Kuhn – Tucker Conditions
Support Vector Machine
SVMs with noisy data
Slack variables
SVMs as nonlinear classifiers D Duda, Hart & Stork, Pattern Classification
RP Original graphic created / generated by Robi Polikar – All Rights Reserved © 2001 – 2013.
May be used with permission and citation.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Intuitive Idea
Given many boundaries that can all solve a given linearly separable
problem, which one is the best – i.e., most likely to result in the smallest
generalization error?
Intuitively, our answer is: the one that provides the largest separation between the
classes: this leaves more room for noisy samples to move around
The SVM is a classifier that can find such an optimal hyperplane, which provides
the maximum margin between the nearest instances of opposing classes.
Viable decision Optimal decision Support
boundary vectors
Feature 2
boundaries
Feature 2
Maximum
RP RP margin
Feature 1 Feature 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Maximizing Margins
TK
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Recall: Geometry of a
Decision Boundary
w w
x xp r g x wT x w0 w T x p r w0
w w
2
w
wT w
w x p w0 r
T
w
g x p 0
g x
r w g x r w r
w
Let’s take a look at the value of the
function g(x) at the origin, x=0
r0 w g x x 0 wT x w0 w0
w0
r0
w
D
w0 determines the location of the hyperplane
! w determines the direction of the hyperplane
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Formalizing the Problem
Recall that the separating hyperplane (decision boundary) is given by
(using 𝑏 instead of 𝑤0, as commonly done in SVM literature)
𝑔 𝐱 = 𝐰 𝑇 𝐱 + 𝑤0 = 𝐰 𝑇 𝐱 + 𝑏
Given a two-class training data 𝐱𝑖 with labels 𝑦𝑖 = +1 and 𝑦𝑖 = −1. Then,
Class 𝜔1 𝐰 𝑇 𝐱 + 𝑏 ≥ 1 ⇒ 𝐱 ∈ 𝜔1
𝐰𝑇 𝐱 + 𝑏 = 0 w 𝐰 𝑇 𝐱 + 𝑏 = 0 ⇒ 𝐱 𝑜𝑛 𝑡ℎ𝑒 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦
Feature 2
𝐰 𝑇 𝐱 + 𝑏 ≤ −1 ⇒ 𝐱 ∈ 𝜔2
r* 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1, ∀𝑖 ⇒ Correct ∀𝐱
RP Distance of a point x on the margin (where 𝑔(𝑥) = 1)
r to the hyperplane given as 𝑔(𝑥) = 0
𝑔 𝐱 𝐰𝑇 𝐱 + 𝑏 ∗
1
𝑟= = ⇒𝑟 =
m ‖𝐰‖ ‖𝐰‖ ‖𝐰‖
𝐰𝑇 𝐱 + 𝑏 = 1
Class 𝜔2 Then the length of 2
𝐰𝑇 𝐱 + 𝑏 = −1 𝑚=
the margin m is ‖𝐰‖
Feature 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Constrained
Optimization
The best hyperplane – the one that provides the maximum separation – is therefore
the one that maximizes 𝑚 = 2 ‖𝐰‖
However, by arbitrarily choosing 𝐰 we can make its length as small as we want. In fact,
𝐰 = 𝟎 would provide an infinite margin – this is clearly not an interesting – or even viable –
solution. There has to be constraint on this problem.
The constraint comes from the correct classification of all data points, which requires that
𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1, ∀𝑖
Therefore, the problem of finding the optimal decision boundary is converted into
the following constrained optimization problem:
1
min ‖𝐰‖2
2
subject to 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1, ∀𝑖
Note that maximizing 𝑚 = 2 ‖𝐰‖ is equivalent to minimizing ‖𝐰‖2 2
We take the square of the length of the vector, which (along with the ½ factor) does not
change the solution, but makes the process for solution easier.
Among other things, since the function to be minimized is quadratic, it has only a single
(global) minimum.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR A Primer on
Constrained Optimization
Recall from our previous discussion that constrained optimization can be
solved through Lagrange multipliers:
If we wish to find the extremum of a function 𝑓(𝐱) subject to some constraint
𝑔(𝐱) = 0, the extremum point x can be found by
1. Form the Lagrange function to convert the problem to an unconstrained problem,
where 𝛼 – whose value need to be determined – is the Lagrange multiplier
𝐿 𝐱, 𝛼 = 𝑓(𝐱) + 𝛼𝑔(𝐱)
𝜕𝐿 𝐱, 𝛼
= 0, 𝑔 𝐱 ∗ = 0
𝜕𝐱 𝐱=𝐱∗
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Constrained Optimization
with Multiple Constrains
If we have many constraints, such as 𝑔𝑖 𝐱 = 0, 𝑖 = 1, ⋯ , 𝑛, then we need a
Lagrange multiplier 𝛼𝑖 for each constraint, which then appear as a
summation in the Lagrangian
𝑛
𝐿 𝐱, 𝛼𝑖 = 𝑓 𝐱 + 𝛼𝑖 𝑔𝑖 𝐱
𝑖=1
𝑛
𝜕𝐿 𝐱, 𝛼𝑖 𝜕
= 𝑓 𝐱 + 𝛼𝑖 𝑔𝑖 𝐱 =0
𝜕𝐱 𝐱=𝐱 ∗
𝜕𝐱
𝑖=1 𝐱=𝐱 ∗
𝑔𝑖 𝐱 = 0 𝑖 = 1, ⋯ , 𝑛
𝐱=𝐱 ∗
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Inequality Constraints
Karush-Kuhn-Tucker Conditions
If we have several equality and inequality constraints, in the following form
min 𝑓 𝐱
subject to 𝑔𝑖 𝐱 ≤ 0, 𝑖 = 1, ⋯ , 𝑛
and ℎ𝑖 𝐱 = 0, 𝑖 = 1, ⋯ , 𝑚
the necessary and sufficient conditions (also known as the Karush – Kuhn – Tucker
(KKT) conditions) for 𝐱 ∗ , 𝛼 ∗ , 𝛽 ∗ to be the solution are:
∂𝐿 𝐱∗ , 𝛼∗ , 𝛽∗ ∂𝐿 𝐱∗ , 𝛼∗ , 𝛽∗
KKT conditions at a glance
Gradient of Lagrangian (with respect to
= 0
=0
∂𝐱 ∂𝛽 parameters to be selected )= 0
∗ ∗ ∗ Lagrange multipliers 𝑖 must be > 0
𝛼𝑖 𝑔𝑖 𝐱 = 0 𝛼𝑖 ≥ 0 i multiplied by inequality constraints
∗ ∗ must be zero (complementary slackness)
ℎ𝑖 𝐱 = 0, 𝑔𝑖 𝐱 ≤ 0 𝑖 = 1, ⋯ , 𝑛 Original equality and inequality
constraints must be satisfied (primal
where the generalized Lagrangian is now given as feasibility)
𝑛 𝑚
𝐿 𝐱, 𝛼𝑖 , 𝛽𝑖 = 𝑓 𝐱 + 𝛼𝑖 𝑔𝑖 𝐱 + 𝛽 𝑖 ℎ𝑖 𝐱
𝑖=1 𝑖=1
where 𝛼𝑖 𝛽𝑖 are Lagrange multipliers, one for each constraint.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Back to Our Own Problem
We simply have one set of inequality constraints 1
‖𝐰‖2 = 𝐰 𝑇 𝐰
min ‖𝐰‖2
so the KKT conditions become as follows 2
subject to 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1, ∀𝑖
for our problem:
𝑛
1
𝐿 𝐰, 𝛼𝑖 = 𝐰𝑇 𝐰 + 𝛼𝑖 1 − 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏
2
𝑖=1 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≥ 1 ⇔ 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≤ 0
𝑛 𝑛
1 𝑇 𝑔𝑖 𝐱 ≤ 0 ⇒ 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≤ 0
= 𝐰 𝐰+ 𝛼𝑖 − 𝛼𝑖 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏
2
𝑖=1 𝑖=1
Taking the derivatives, setting to zero, and solving for all parameters:
𝑛 Note the last three KKT conditions
𝜕𝐿 𝐰, 𝑏, 𝛼
= 0 ⇒ 𝐰 = 𝛼𝑖 𝐱 𝑖 𝑦𝑖 = 𝛼𝑖 𝐱 𝑖 𝑦𝑖 have an interesting interpretation: A point
𝜕𝐰 xi satisfying Eq at equality is on the
𝑖=1 𝐱 𝑖 ∈𝑆
𝑛 margin, and the corresponding 𝛼𝑖 > 0. For
𝜕𝐿 𝐰, 𝑏, 𝛼
= 0 ⇒ 𝛼𝑖 𝑦𝑖 = 0 all points xi satisfying Eq at inequality
𝜕𝑏 must have corresponding 𝛼𝑖 = 0. Hence, all
𝑖=1
𝑇
𝛼𝑖 1 − 𝑦𝑖 𝐰 𝐱 𝑖 + 𝑏 = 0 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 ≤ 0, 𝛼𝑖 ≥ 0 points that are on the boundary (support
vectors) have their 𝛼𝑖 > 0. For all other
𝑆 = 𝐱 𝑖 |𝛼𝑖 ≠ 0 S: Set of support vectors points, 𝛼𝑖 = 0.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Dual Problem
Now, let’s substitute the expression for 𝐰 into the Lagrangian 𝐿
n n
1 T
L x, i w w i i yi w T xi b
2 i 1 i 1
1 n n n n n
i yi xi j y j x j i i yi j y j x j xi b
T T
2 i 1 j 1 i 1 i 1 j 1
wT w wT
1 n n n n n n
i j yi y j xi x j i i j yi y j xi x j b i yi
T T
2 i 1 j 1 i 1 i 1 j 1 i 1
0
1 n n n n
i j yi y j xi x j i ,
T
i 0, y i i 0
2 i 1 j 1 i 1 i 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Dual Problem
𝑛 𝑛 𝑛
1
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖 ,
2
𝑖=1 𝑗=1 𝑖=1
𝑛
subject to 𝛼𝑖 ≥ 0, 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
𝛼𝑖 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 = 0, 1 − 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 ≤ 0, 𝛼𝑖 ≥ 0
r* The first equation states that either 𝛼𝑖 = 0, or
RP 1 − 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 is zero (or both).
r
Therefore, if 1 − 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 is not zero, that is, if the
xi is not on the margin, then the corresponding
Lagrange multiplier must necessarily be zero!
m
For those xi that do lie on the hyperplane, 𝛼𝑖 > 0 , in
𝐰 𝑇 𝐱 + 𝑏 = 1 which case those points define the hyperplane, and
Class 𝜔2
𝐰 𝑇 𝐱 + 𝑏 = −1 hence are called support vectors.
Feature 1 It is possible for both conditions to be satisfied at zero,
that is 𝛼𝑖 = 0 for those points that do lie on the margin.
These points are not considered as support vectors,
since they are not required to define the hyperplane.
Hence, we could replace the entire dataset with the few support vectors we find by
solving the optimization problem. Only the support vectors matter for determining
the optimal hyperplane, the rest of the data points might as well be thrown away.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR How about
Nonseparable Case?
So far, we assumed that we have a two-class, linearly separable problem. What if
the classes are not linearly separable due to noisy data?
There are three types of instances:
𝐰 𝐱+𝑏 =0𝑇 w 1. Those that fall on the correct side of the
𝐰𝑇 𝐱 + 𝑏 = 1 margin. These satisfy 𝑦𝑖 𝐰 𝑇 𝐱𝑖 + 𝑏 ≥ 1
Feature 2
The tradeoff parameter 𝐶 controls the relative importance of the two competing
terms: minimize error vs. maximize margin. Smaller values of 𝐶 emphasize
maximizing the margin, whereas larger values of 𝐶 emphasize reducing the error
(smaller margin). The linearly separable case corresponds to 𝐶 = ∞
Once again we have a quadratic Lagrangian
𝑛 𝑛 𝑛
1 𝑇
𝐿 𝐱, 𝛼𝑖 = 𝐰 𝐰+𝐶 𝜉𝑖 − 𝜇𝑖 𝜉𝑖 − 𝛼𝑖 𝑦𝑖 𝐰 𝑇 𝐱 𝑖 + 𝑏 − 1 + 𝜉𝑖
2
𝑖=1 𝑖=1 𝑖=1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Constrained Optimization
with Slack Variables
Similar to earlier case, we have L w, b, , , n
0 w i yi xi
w i 1
L w, b, , , n
be obtained by substituting w L w, b, , ,
0 C i i 0
in the original Lagrangian,
which results in
i yi wT xi b 1 0 ii 0, i 0 i 0
𝑛 𝑛 𝑛
1
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖 ,
2
𝑖=1 𝑗=1 𝑖=1
𝑛
Final classification: 𝑦 𝐱 = 𝐰 𝑇 𝐱 + 𝑏 = 𝛼𝑖 𝑦𝑖 𝐱 𝑖𝑇 𝐱 + 𝑏
𝑖=1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Some Strange
Observations
First of all, let’s compare the two problems with and without slack variables
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
1 1
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱𝑖𝑇 𝐱𝑗 + 𝛼𝑖 , max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖 ,
2 2
𝑖=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1 𝑖=1
𝑛 𝑛
They are remarkably similar. Specifically, the slack variables do not appear in the
new formulation, the dual problem still depends only on the Lagrangians 𝛼𝑖, the
desired hyper plane is still found as 𝐰 = 𝑛𝑖=1 𝛼𝑖 𝐱 𝑖 𝑦𝑖 and the training data points
still appear as dot products only (𝜉𝑖 enter into formulation indirectly through 𝐶 ).
The only difference is that there is now an upper bound 𝐶 on the values of 𝛼𝑖
On the other hand, the 𝜇𝑖 𝜉𝑖 = 0 condition of the original formulation requires that
for all points residing within the margin (for which 𝜉𝑖 ≠ 0), 𝜇𝑖 = 0. Therefore, for
those points 𝛼𝑖 = 𝐶 must be satisfied (from 𝐶 − 𝜇𝑖 − 𝛼𝑖 = 0). For all others, 𝛼𝑖 = 0.
Note that it is these points – those that fall within the margin – that define the
optimal decision boundary 𝐰! (from 𝐰 = 𝑛𝑖=1 𝛼𝑖 𝐱 𝑖 𝑦𝑖 )
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Effect of C
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Quadprog
Matlab Optimization Toolbox, User’s Guide, ©Mathworks, 2009
in Matlab
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR - SVM
Note that the width of the margin is actually not involved in the direct
calculations of the optimization. We set a margin of 1 around the decision
boundary, and try to maximize this margin.
In fact, the only parameter we have in our control, the 𝐶 parameter, only indirectly
controls the width of the margin.
Can we involve the width of the margin more directly in the cost function?
This approach gives us “soft – margin” SVM.
Let’s define 𝜌 as a free parameter that will allow us to control the width of the
margin
𝐰 𝑇 𝐱 + 𝑏 = 0 ⇒ 𝐰 𝑇 𝐱 + 𝑏 = ±𝜌
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR - SVM
Now the new optimization problem, along with its constraints, is
𝑛
1 2
1
min 𝐰 − 𝜐𝜌 + 𝜉𝑖 ,
2 𝑛
𝑖=1
subject to 𝑦𝑖 𝐰𝑇 𝐱 𝑖 + 𝑏 ≥ 𝜌 − 𝜉𝑖 , 𝜉𝑖 ≥ 0, 𝜌 ≥ 0
Where is the Lagrange multiplier for the new parameter 𝜌.
Note the following:
For 𝑖 = 0, the margin separating the two classes is 2𝜌 𝐰
Previously, also known as 𝐶-SVM, we simply tried to minimize the number of
instances for which 𝑖 > 0 (those that fell into the margin and/or missclassified)
In -SVM, in addition to minimizing the “average number of instances” with
𝑖 > 0, we also directly target the margin width through the 𝜌 parameter. The larger
the 𝜌, the wider the margin, and the higher the number of points in the margin.
The parameter (Lagrange multiplier) 𝜈, on the other hand, controls the influence of
𝜌 , whose value will be in the [0 1] range.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR - SVM
Using the same procedure of substituting the value of w found through
setting the Lagrangian to zero into the original formulation (along with the
appropriate constraints), the dual problem can be written as follows:
𝑛 𝑛
1
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗
2
𝑖=1 𝑗=1
𝑛 𝑛
1
subject to 0 ≤ 𝛼𝑖 < , 𝛼𝑖 𝑦𝑖 = 0, 𝛼𝑖 ≥ 𝜐
𝑛
𝑖=1 𝑖=1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Remarks about -SVM
The original C-SVM and 𝜈-SVM will generate the same results for
appropriate selections of 𝐶 and 𝜈 parameters. For the optimization problem
to be feasible, however, 𝜈 must satisfy 0𝜈 1, unlike 𝐶 which has an
infinite range of values.
𝜈-SVM has some additional advantages, such as leading to an easy to
interpret geometric interpretation for nonseparable classes. More
importantly, the 𝜈 parameter provides two crucial bounds:
The error rate: The total number of errors that can be committed by the 𝜈-SVM is
𝑛𝜈 . Hence the error rate on the training data 𝑃𝑒 < 𝜈𝑛 (WHY? Exercise)
Number of support vectors: Also, we can show that 𝑛 𝜈 𝑛𝑠, the number of
support vectors.
Hence, by choosing the 𝜈 parameters, we have a more clear sense of the error rate
(thing of it like the error goal in MLP), and the number of support vectors (thing
of it like the number of hidden layer nodes in MLP), which directly controls the
computational complexity of the problem.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR How About
Nonlinear Case…?
Note that SVM is essentially a linear classifier. Even with the slack
variables, it still finds a linear boundary between the classes.
What if the problem is fundamentally a non-linearly separable problem?
Perhaps one of the most dramatic twists in pattern recognition allows the
modest linear SVM classifiers to turn into one of the strongest nonlinear
classifiers.
Cover’s theorem: A complex problem that is not linearly separable in the given
input space is more likely to be linearly separable in a higher dimensional space
• Input space: the space in which the given training data points 𝐱𝑖 reside
• Feature space: A higher dimensional space, obtained by transforming 𝐱𝑖 through a
(kernel) transformation function 𝜙(𝐱𝑖 )
Hence SVMs solve a nonlinear problem by:
• Perform a nonlinear mapping from the input space to the higher dimensional space that
is hidden from both the input and output
• Construct an optimal (linear) hyperplane in the high dimensional space.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR An Example
φ( )
Feature 2
Feature 2
φ( )
φ( )
φ( )
φ( ) φ( )
φ( )
φ( ) φ( )
φ(.) φ( ) φ( )
φ( ) φ( ) φ( )
φ( ) φ( )
φ( )
φ( ) φ( ) φ( ) φ( )
RP φ( )
φ( )
x t1 2 xt2 2
x=[x1 x2] 1(x) e t1 [1 1]T 2 (x) e t 2 [0 0]T
x2 φ2(x)
_ (1,1)
1.0
1 _
0.8
φ(.) 0.6
_
_
0 0.4 (0,1)
_ (1,0)
0.2
x1 (0,0)
0 _ φ1(x)
1 0 | | | | | |
0 0.2 0.4 0.6 0.8 1.0 1.2
(c) (d)
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Another Example
xi x j 1, 2 xi1 , 2 xi 2 , x , x , 2 xi1 xi 2 1,
T T
2 2
i1 i2 2 x j1 , 2 x j 2 , x 2j1 , x 2j 2 , 2 x j1 x j 2
12 2 xi1 x j1 2 xi 2 x j 2 xi21 x 2j1 2 xi1 xi 2 x j1 x j 2 xi22 x 2j 2
1 xi1 x j1 xi 2 x j 2
2
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR How to use &
choose Kernels
The mapping to the high dimensional space is implicit – That is, we
typically do not (need to) know what the mapping function is once we define
an appropriate kernel.
3
For example, by choosing 𝐾 𝐱 𝑖 , 𝐱𝑗 = 1 + 𝐱 𝑖 𝑇 𝐱𝑗 we are implicitly moving a 2-D
x into a 10-D z , whose specific mapping is given in the previous slide.
In SVMs, we normally pick a kernel, and determine – if we wish to – the
corresponding implicit mapping function 𝜙. We do not need to know, or compute,
what that mapping is, however, as it is not necessary for the computation of SVM.
3
How do we select a kernel? Is 𝐾 𝐱𝑖 , 𝐱𝑗 = 1 + 𝐱𝑖 𝑇 𝐱𝑗 “the” kernel?
Since the SVM formulation has the input data in dot product form, we must select
a kernel which can be expressed as a dot – product that provides the implicit
mapping.
…and, which kernels 𝐾(𝐱𝑖, 𝐱𝑗) have an implicit mapping in the form of a dot
– product?
Those that satisfy the Mercer’s conditions
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Mercer’s Theorem
Let 𝐾(𝑥1, 𝑥2) be a continuous and symmetric kernel defined in some closed
interval 𝑎 < 𝑥 < 𝑏. This kernel has an equivalent representation using the
dot-product of a mapping function 𝜙 as follows: 𝐱 → 𝜙 𝐱 ∈ ℋ
∞ 𝜆𝑖 > 0
𝐾 𝐱1 , 𝐱 2 = 𝜆𝑖 𝜙𝑖 𝐱1 𝜙𝑖 𝐱 2
𝑖=1
Where ℋ is a Hilbert space – a generalization of the Euclidean space, where the
inner product can be defined more generally – not just as the dot product in
Euclidean space.
𝑏 𝑏
This expansion is valid, if and only if, 𝑎 𝑎
𝐾 𝐱1 , 𝐱 2 𝜓 𝐱1 𝜓 𝐱 2 𝑑𝐱1 𝑑𝐱 2 > 0
𝑏
is satisfied for any and all arbitrary functions 𝜓( . ) for which 𝑎
𝜓 𝐱 𝑑𝐱 < ∞
The mapping functions 𝜙(𝐱) are then called the eigenfunctions, and 𝜆𝑖 are the
eigenvalues of the kernel representation. Note that the condition 𝜆𝑖 > 0 makes the
kernel positive definite The matrix 𝐾, whose 𝑖, 𝑗 𝑡ℎ entry is 𝐾(𝐱𝑖, 𝐱𝑗), is
(+)def.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Uhm…whatever…
PR Just Tell me what kernels satisfy
these mercer’s conditions
𝑑
Polynomial kernel with degree 𝑑: 𝐾 𝐱 𝑖 , 𝐱𝑗 = 1 + 𝛾𝐱 𝑖 𝑇 𝐱𝑗
The user defines the value of 𝑑 (and 𝛾), which then controls how large the feature space
dimensionality will be. As seen earlier, a choice of 𝑑 = 2 moves 2-D 𝐱 into 6-D 𝐳. Similarly
using 𝑑 = 3 moves a 2-D 𝐱 into a 10-D 𝐳.
Radial basis (Gaussian) kernel with width 𝜎: 𝐾 𝐱 𝑖 , 𝐱𝑗 = exp − ‖𝐱 𝑖 − 𝐱𝑗 ‖2 2𝜎 2
The user defines the kernel width 𝜎. This SVM is closely related to the RBF network. It
increases the dimensionality to ∞, as every data point is replaced by a continuous Gaussian.
The number of RBFs and their centers are determined based on the (number of) support
vectors and their values, respectively.
Hyperbolic tangent (Sigmoid – Two layer MLP): 𝐾 𝐱 𝑖 , 𝐱𝑗 = tanh 𝜅𝐱 𝑖 𝑇 𝐱𝑗 + 𝜃
The parameters 𝜅 and 𝜃 are user defined. The number of hidden layer nodes and their
values are determined based on the (number) of support vectors and their values,
respectively. Then, the hidden – output weights are the Lagrange multipliers 𝛼𝑖. This kernel
satisfies Mercer’s conditions only for certain values of 𝜅 and 𝜃.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR SVMs in
High Dimensional Space
So, then exactly how do we use the kernel trick with the SVM to obtain nonlinearly
separable boundaries? 𝑛 𝑛 𝑛
1
First, recall original formulation: max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖
2
𝑖=1 𝑗=1 𝑖=1
the decision boundary is given as 𝑛 𝑛
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR SVMs in
High Dimensional Space
So the new decision boundary is the weighted sum of the Kernel function with
respect to support vectors xi. Recall that only the support vectors have nonzero
𝑛
dual variables αi !
𝑔 𝐱 = 𝛼𝑖 𝑦𝑖 𝐾 𝐱 𝑖 , 𝐱 + 𝑏 = 𝛼𝑖 𝑦𝑖 𝐾 𝐱 𝑖 , 𝐱 + 𝑏 = 0
𝑖=1 𝐱 𝑖 ∈𝑆
0 ≤ 𝛼𝑖 < 𝐶, 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Some Examples
(Gaussian Kernel, σ=0.3, C=10)
The decision boundaries generated by the SVM classifier, C=10
1 1
0
0.9
1
0 Rotating checkerboard data with N = 300 points, a = 0.5 and = 30
0.8
1
-1
0
1
0.7
1
0.9
-1
0.6
-1
0.8
0
1
0
0.5 0.7 SVM classification on rotating checkerboard data, C=10
-1
-1
1
0.4 0.6
0.9
0
1
0
1 0
0.3 -1
0.5
0.8
0.2
0.4
1
0.7
0.1 0
0.3
-1
-1 1 0.6
1
0
0
0 0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5
0.1
0.4
0
0 0.1 0.2 0.3 0.4 0.3
0.5 0.6 0.7 0.8 0.9 1
0.2
RP 0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Effect of C
C=0.1
The decision boundaries generated by the SVM classifier, C=0.1
1 1
0.9
1
0.8
1
0.7
0
0.6 1
SVM classification on rotating checkerboard data, C=0.1
0
0.5 0 1
0.4 0.9
1
0.3 0.8
0.2 0.7
1 1 0.6
0.1
0 1 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.3
0.2
0.1
RP 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Effect of C
C=1
The decision boundaries generated by the SVM classifier, C=1
1
1
1
0.9
0.8
0 1
0.7
0
1 -1
0.6
-1
SVM classification on rotating checkerboard data, C=1
-1
0.5 1
1
0
-1
0.4 0.9
0
-1
0.3 0.8
1
0
0.2 0.7
0
0.1 0.6
1
1
0 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.3
0.2
0.1
RP
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Effect of C
C=100
The decision boundaries generated by the SVM classifier, C=100
1
-1
1
1 0
0.9
0.8 -1 0
0.7 1
-1 0
0.6 SVM classification on rotating checkerboard data, C=100
1
1
0
1
0 0 1
0.5 -1
-1
0.9
0.4 0 -1 0.8
1
0.3 -1 0.7
1
0
0
1
0.2
-1
0.6
0.1
0
0.5
1
1
-10
-1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4
0.3
0.2
0.1
RP
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Effect of Kernel Type
SVM decision boundaries using polynomial kernel C=100 d = 10
1
0.9
0.8
0.7
0.6
SVM decision boundaries using Gaussian kernel C=100 = 0.3
1
0.5
1 0 -1
0
0.9
1
0.4
0.8
0.3
0.7 -1 -10 1
0.2
0
1
0.6
0.1 1
1
0
-1
0.5 -1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
-1
0.4 0
1
0.3 -1
0
1
1
0.2
1 -1
0
0.1 0
RP -1
-1 0 1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Effect of Kernel Type
SVM classification using polynomial kernel C=100 d = 10
1
0.9
0.8
0.7
0.6
SVM classification using Gaussian kernel C=100 = 0.3
0.5 1
0.4 0.9
0.3 0.8
0.2 0.7
0.1 0.6
0 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.3
0.2
RP 0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Effect of Kernel Type
SVMSVM
decision boundaries
classification using
using polynomial
polynomial kernel
kernel C=100
C=100 d =d2= 2
11
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
SVM decision
SVM boundaries
classification using
using Gaussian
Gaussian kernel
kernel =
C=100
C=100 0.1= 0.1
0.4 11
0.4 1 0
0
-1
0.3
0.3 0.9
0.9 1
0 1
0.2
0.2 0.8
0.8 1
-1
0.1 0.7 -1
0.1 0.7 0 1
00 0.6
0.6
00 0.1
0.1 0.2
0.2 0.3
0.3 0.4
0.4 0.5
0.5 0.6
0.6 0.7
0.7 0.8
0.8 0.9
0.9 11
0 1
0
0.5
0.5 1
0
-1
-1
0.4
0.4
01 -1
0.3 0
0.3
1 -1
1
0.2
0.2
0
RP 0.1
0.1
0
-1
1
-1
0
-1
1
1
1
00
00 0.1
0.1 0.2
0.2 0.3
0.3 0.4
0.4 0.5
0.5 0.6
0.6 0.7
0.7 0.8
0.8 0.9
0.9 11
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR When we come back…
Why SVMs work?
Structural Risk Minimization
Statistical Learning Theory
VC Dimension
Strengths and weaknesses of SVMs
Implementation of SVMs
HOMEWORK:
I have placed two tutorial papers on SVMs on the class BB page. Make sure that
you read these papers.
• An Introduction to Kernel Based Learning Algorithms; Muller, Mika, Ratsch, Tsuda
and Scholkoph, IEEE TNN, 2001
• A Tutorial on Support Vectors for Pattern Recognition; Burges, Data Mining and
Knowledge Discovery, 1998.
Computational Intelligence and Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ