Support Vector Machines & Kernels: David Sontag New York University

Support Vector Machines & Kernels
Lecture 5
David Sontag
New York University
Slides adapted from Luke Zettlemoyer and Carlos Guestrin

Multi-class SVM
As for the SVM, we introduce slack variables and maximize margin:
To predict, we use:
Now can we learn it? 

How to deal with imbalanced data?
•  In many practical applications we may have

imbalanced data sets
•  We may want errors to be equally distributed
between the positive and negative classes
•  A slight modification to the SVM objective
does the trick!
Class-specific weighting of the slack variables

What’s Next!
•  Learn one of the most interesting and
exciting recent advancements in machine
learning
–  The “kernel trick”
–  High dimensional feature spaces at no extra
cost!
•  But first, a detour
–  Constrained optimization!
Constrained optimization
No Constraint x ≥ -1 x≥1
x*=0 x*=0 x*=1
How do we solve with constraints?

 Lagrange Multipliers!!!
Lagrange multipliers – Dual variables
Add Lagrange multiplier
Rewrite
Constraint
Introduce Lagrangian (objective):
We will solve:
Why is this equivalent?
•  min is fighting max!
x<b  (x-b)<0  maxα-α(x-b) = ∞
•  min won’t let this happen!
Add new
x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0 constraint
•  min is cool with 0, and L(x, α)=x2 (original objective)
x=b  α can be anything, and L(x, α)=x2 (original objective)
The min on the outside forces max to behave, so constraints will be satisfied.
Dual SVM derivation (1) – the linearly
separable case
Original optimization problem:
Rewrite One Lagrange multiplier

constraints per example
Lagrangian:
Our goal now is to solve:

separable case
(Primal)
Swap min and max
(Dual)
Slater’s condition from convex optimization guarantees that

these two optimization problems are equivalent!
 
x(1)
 ... 
 
Dual SVM derivation
 x (n) 
 (3) – the linearly
 x(1) x(2) 
φ(x) =  separable case
 
(1) (3) 
 x x 
 
 ... 
(Dual)  
 ex(1) 
Can solve for optimal w,. .b. as function of α:
∂L �
∂w
=w− αj yj xj 
j
Substituting these values back in (and simplifying), we obtain:
(Dual)
Sums over all training examples scalars dot product

 
x(1)
 ... 
 
Dual SVM derivation
 x (n) 
 (3) – the linearly
 x(1) x(2) 
φ(x) =  separable case
 
(1) (3) 
 x x 
 
 ... 
(Dual)  
 ex(1) 
Can solve for optimal w,. .b. as function of α:
∂L �
∂w
=w− αj yj xj 
j
Substituting these values back in (and simplifying), we obtain:
(Dual)
So, in dual formulation we will solve for α directly!

•  w and b are computed from α (if needed)
separable case
Lagrangian:
αj > 0 for some j implies constraint

is tight. We use this to obtain b:
(1)
(2)
(3)
Dual for the non-separable case – same basic
story (we will skip details)
Primal: Solve for w,b,α:
Dual:
What changed?
•  Added upper bound of C on αi!
•  Intuitive explanation:
•  Without slack, αi  ∞ when constraints are violated (points
misclassified)
•  Upper bound of C limits the αi, so misclassifications are allowed
Wait a minute: why did we learn about the dual
SVM?
•  There are some quadratic programming

algorithms that can solve the dual faster than the
primal
–  At least for small datasets
•  But, more importantly, the “kernel trick”!!!

Reminder: What if the data is not
linearly separable?
Use features of features
of features of features….
 
x(1)
 ... 
 
 x(n) 
 
 x(1) x(2) 
 
φ(x) =  (1) (3) 
 x x 
 
 ... 
 
 ex(1) 
...
Feature space can get really large really quickly!

Higher order polynomials
number of monomial terms
d=4
m – input features
d – degree of polynomial
d=3
grows fast!
d = 6, m = 100
d=2
about 1.6 billion terms
number of input dimensions
Dual formulation only depends on
dot-products, not on w!
Remember the
First, we introduce features: examples x only
 appear in one dot
product
Next, replace the dot product with a Kernel:
Why is this useful???


∂w x . . .x
(1)   j j j
φ(x) =   ex x j�
(1)
 (1)e∂Lx(3) 
� x �. . .�= w − � αj yj xj
u1 .∂w .. v1  j
=  �. .. .  = u1 v1 + u2 v2 = u.v
Efficient dot-product
φ(u).φ(v)
∂L
∂L = w
=
− u2� (1)of
�
α �y
w − euα1j yj xj 
x j
v2�polynomials
j xj
 �
∂w
Polynomials of degree exactly d
v1
∂w 2 =
φ(u).φ(v)  jju 2
.  = u1 v1 + u2 v2 = u.v
u1 2 v1 v2
d=1 �� . . .

uu u 1u 2
v1  v 1 v 2
 2 2
φ(u).φ(v)
φ(u).φ(v) =
=  11
φ(u).φ(v) = u u u. v1�. 
u 2 1
.  == uu v v2+ =
+u2uu
v221vv= + u.v
u.v
1= 2u1 v1 u2 v2 + u

∂Lu22 21 u1 uv222  v
1
2 v1
1 11
1 2
v v 
φ(u).φ(v)
 =
= u


w
2 −  α
.
 2

vj2y j1x 2
j  = u21 v12 + 2u1 v1 u2 v2 +

d=2 u∂w
1
2 2
uv21u12 v2 v1
 u1 u2   v1uv22  j 2
v2 v )2
φ(u).φ(v) =  .
�u2 u1 �� = (u v 2
+ 2
u 2 2
v2 v1 �1 11 1 2 21 1 2 2
= u v + 2u v u v + u 2 v2
u
u2 1 v 2v1 = (u v + u v ) 2
φ(u).φ(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
φ(u).φ(v) = (u.v)d
For any d (we will skip proof): φ(u).φ(v) = (u.v) d
= (u.v)=d (u.v)d
φ(u).φ(v)φ(u).φ(v)
•  Cool! Taking a dot product and exponentiating gives same
results as mapping into high dimensional space and then taking
dot produce
Finally: the “kernel trick”!
•  Never compute features explicitly!!!

–  Compute dot products in closed form
•  Constant-time high-dimensional dot-
products for many classes of features
•  But, O(n2) time in size of dataset to
compute objective
–  Naïve implements slow
–  much work on speeding up
Common kernels
•  Polynomials of degree exactly d
•  Polynomials of degree up to d
•  Gaussian kernels
•  Sigmoid
•  And many others: very active area of research!

Support Vector Machines & Kernels: David Sontag New York University

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machines & Kernels: David Sontag New York University

Uploaded by

Copyright:

Available Formats

Support Vector Machines & Kernels

Slides adapted from Luke Zettlemoyer and Carlos Guestrin

Now can we learn it? 

• In many practical applications we may have

Class-specific weighting of the slack variables

x*=0 x*=0 x*=1

How do we solve with constraints?

x=b  α can be anything, and L(x, α)=x2 (original objective)

Original optimization problem:

Rewrite One Lagrange multiplier

Our goal now is to solve:

Swap min and max

Slater’s condition from convex optimization guarantees that

Substituting these values back in (and simplifying), we obtain:

Sums over all training examples scalars dot product

Substituting these values back in (and simplifying), we obtain:

So, in dual formulation we will solve for α directly!

αj > 0 for some j implies constraint

Primal: Solve for w,b,α:

• There are some quadratic programming

• But, more importantly, the “kernel trick”!!!

Feature space can get really large really quickly!

Why is this useful???

• Never compute features explicitly!!!

• And many others: very active area of research!

You might also like

•  In many practical applications we may have

x=0 x=0 x*=1

•  There are some quadratic programming

•  But, more importantly, the “kernel trick”!!!

•  Never compute features explicitly!!!

•  And many others: very active area of research!