Support Vector Machine SVM

Support Vector Machine
SVM
Slides for guest lecture presented by Linda Sellie in Spring 2012 for
CS6923, Machine Learning, NYU-Poly
with a few corrections...
http://www.svms.org/tutorials/Hearst-etal1998.pdf
http://www.cs.cornell.edu/courses/cs578/2003fa/slides_sigir03_tutorial-modified.v3.pdf
These slides were prepared by Linda Sellie and Lisa Hellerstein
Which Hyperplane?
g(x)?
+
++
+
+
- +
- - -
g(x) = w x + w0
g(x) > 0 then f (x) = 1
g(x) 0 then f (x) = 1
T
If w = (3, 4) & w0 = 10
T
g(x) = (3, 4) x 10
T
+
++
g(x) (0, 2.5)
+
(2, 2) +
+
- -(1,1) (2,1) + (3,1)

- - --
g(x) = w x + w0
g(x) > 0 then f (x) = 1
g(x) 0 then f (x) = 1
T
If w = (3, 4) & w0 = 10
T
g(x) = (3, 4) x 10
T
+
++
g(x) (0, 2.5)
+
(2, 2) +
+
- -(1,1) (2,1) + (3,1)

- - --
g(2,1) = (3, 4)i(2,1) 10 = 0

g(0, 5 / 2) = (3, 4)i(0, 5 / 2) 10 = 0
so f (2,1) = f (0, 5 / 2) = 1
f (2, 2) = 1
g(2, 2) = (3, 4)i(2, 2) 10 = 4 > 0
f (3,1) = 1
g(3,1) = (3, 4)i(3,1) 10 = 3 > 0
f (1,1) = 1
g(1,1) = (3, 4)i(1,1) 10 = 3 0
with shared variance for each feature (for each xi, requiring estimated variance
of distributions for p[xi|+] and p[xi|-] to be the same)
(for the usual Gaussian Naive Bayes,

where you don't required shared variance
for each feature, discriminant function is
quadratic.)
that is, if you have Boolean features, and you treat

them as discrete/categorical features, running the
standard NB algorithm for discrete/categorical
features will produce a linear discriminant.
Which line (hyperplane) to choose?

Maximal Margin Hyperplane
+
++
g(x)
- - - -
+
margin
+
g(x)
+
++
+
+
- margin
+
- - -
How to compute the distance from a point on the plane to

T
the hyperplane x, g(x) = w x + w0 = 0
x a point
x p the normal
x3
projection of
the plane x =
xp
r x
x onto w
xp + r
w
w
g(x) = w x + w0
T
Thus
g(x)
r=
w
w
= w xp + r
+
w
0
x1
w
T
2
T
=w r
Observe that w w = w
w
= w r
T
x2
g(x) = w x + w0
T
w = (3, 4) w0 = 10
T
g(x)
Distance Formula: r =
w
g(x)
+
+ +
+
(2, 2) +
+
- -(1,1)
- -(1,.5)
- --
++
+ (3,1)
g(2, 2) (3, 4)i(2, 2) 10

=
= 4/5
(3, 4)
5
g(1,.5) (3, 4)i(1,.5) 10
=
= 1
(3, 4)
5
g(3,1) (3, 4)i(3,1) 10
=
= 3/5
(3, 4)
5
g(1,1) (3, 4)i(1,1) 10
=
= 3 / 5
(3, 4)
5
1
g '(x) = w ' x + w '0 = g(x)
3
1
T
w '0 = 10 / 3
w ' = w = (1, 4 / 3)
3
g '(x)
Distance Formula: r =
w'
g '(2, 2)
(1, 4 / 3)i(2, 2) 10 / 3
=
= 4/5
(1, 4 / 3)
5/3
g '(1,.5) (1, 4 / 3)i(1,.5) 10 / 3
=
= 1
(1, 4 / 3)
5/3
+
+ g '(3,1) (1, 4 / 3)i(3,1) 10 / 3
g(x)
+
+ + +
=
= 3/5
(2, 2) +
+
(1, 4 / 3)
5/3
- -(1,1)
g '(1,1)
(1, 4 / 3)i(1,1) 10 / 3
(3,1)
+
=
= 3 / 5
- -(1,
4
/
3)
5
/
3
(1,.5)
- T
We want to classify points in space.

Which hyperplane does SVM choose?
+
+
+
++ +
+ ++
-- +
- - - -- - --
Maximal Margin Hyperplane
g(x)
+
support vector +
++ +
+ +
x support vector
- + margin
x
support -vector1
The margin is the geometric

distance between the closest
training example to the
g(x )
g(x )
hyperplane, y || w || = y || w ||
1
We use the hyperplane to classify a point x
f (x) = 1 if wix + w0 > 0

f (x) = 1 if wix + w0 0
g(x) = (3, 4)ix 10

The hyperplane is defined by all the points
which satisfy g(x) = (3, 4)ix 10 = 0
e.g.
+
++
g(x) (0, 2.5)
+
(2, 2) +
+
- -(1,1) (2,1) + (3,1)

- - --
(0, 3.3)
g(2,1) = (3, 4)i(2,1) 10 = 0

g(0, 2.5) = 0
All the points above the line are positive
g(x) = (3, 4)ix 10 > 0

e.g. g(2, 2) = (3, 4)i(2, 2) 10 = 4
g(3,1) = (3, 4)i(3,1) 10 = 3
All the points below the line are negative
g(x) = (3, 4)ix 10 < 0

e.g. g(1,1) = (3, 4)i(1,1) 10 = 3
Notice that for any hyperplane we have an infinite number

of formulas that describe it!
If (3, 4)ix 10 = 0, so does (1/3) ( (3, 4)ix 10 ) = 0
so does 23( (3, 4)ix 10 ) = 0
so does .9876 ( (3, 4)ix 10 ) = 0
if it is a maximum margin
hyperplane -- since in such a
hyperplane, the distance to the
closest + example must be equal
to the distance to the closest example
The canonical hyperplane for a set of training examples

S = {< (1,1), 1 >, < (2, 2),1 >, < (1,1 / 2), 1 >, < (3, 2),1 > ...}
is g '(x) = (1, 4 / 3)ix 10 / 3 (the functional margin is 1.)
g(x)
+
+ +
+
(2, 2) (3, 2)
+ +
- -(1,1)
- (1,.5)- (1,1)
1
+ (3,1)
g '(3,1) = (1, 4 / 3)i(3,1) 10 / 3 = 1

g '(1,1) = (1, 4 / 3)i(1,1) 10 / 3 = 1
y (w ' x + w '0 ) 1
1(1, 4 / 3)i(2, 2) 10 / 3 1
1(1, 4 / 3)i(1,1 / 2) 10 / 3 1
(i )
(i )
Remember the distance from a point x

g(x)
T
to the hyperplane g(x) = w x + w0 is =
w
For a canonical hyperplane g(x) = w x + w0

w.r.t. a fixed set of training examples, S , the
+
margin is computed by = g(x ) = g(x ) = 1

w
w
w
+
+
+ + +
1
1
3
(2, 2) (3, 2)
+ +
=
=
=
(1, 4 / 3)
5
1+
16
/
9
- -(1,1) 3 / 5 + (3,1)
- (1,.5)
-- - - (1,1)
T
g(x) = (1, 4 / 3)T x + 10 / 3
Distance of x to the hyperplane is
g(x)
1
:r =
=
w
w
For the canonical hyperplane the margin is
To find the maximal canonical hyperplane,

the goal is to minimize
assuming
canonical
hyperplane
1
w
For a set of training examples, we can find the

maximum margin hyperplane in polynomial time!
To do this - we reformulate the problem as an
optimization problem
There is an algorithm to solve an optimization

problem if it has this form:
min: f (x)
Subject to: i, gi (w) 0
i, hi (w) = 0
Where f is convex, i, gi is convex, and i, hi is affine.

We can use the standard techniques to find the
optimal.
Finding the largest geometric margin

by finding g(x) which
max:
y
(w
x
+
w
)
0
Subject to: i
w
(i )
(i )
Finding the largest geometric margin

min: w
Subject to: i y (w x + w0 ) 1
(i )
(i )
Finding the largest margin

1
2
min: 2 w
(i )
T (i )
Subject to: i y (w x + w0 ) 1
Solving this constrained

quadratic optimization requires
The Karush, Kuhn, Tucker (KKT)
conditions are met.
The Karush, Kuhn, Tucker

(KKT) conditions imply that vi
is non-zero only if it is a
support vector!
The hypothesis for the set of training examples,
S = {< (1,1), 1 >, < (2, 2),1 >, < (1,1 / 2), 1 >, < (3, 2),1 >,...}
is g(x) = v1 (2, 7 / 4)ix + v2 (3,1)ix + v3 i(1,1)ix + w0

Note that only the support vectors are in the
hypothesis.
+
(2, 7 / 4)
+
+ + +
(2, 2) (3, 2)
+ +
- -(1,1)
- (1,.5)
-- - - (1,1)
+ (3,1)
g(x) = wT x + b = 1
g(x) = wT x + b = 0
g(x) = wT x + b = 1
NonLinearly Separable Data

III
negative
positive
Linearly separable?
g(x) = w x + w0
T
+
+
(0,1.25)
(1,1) + +
+ +(1,1)
+ g(x) - +
+
+
+
(0.25, 0.25) +
(1.25, 0)
(0.5,
0)
(0, 0)
+
+ (0.75, 0.25)- - - (0.5, 0.5) +
+ -- - - - - +
(1, 0.75)
+
(1, 1)
+
+ ++ +
+
+
+
Linearly separable?
Transform feature space
: x (x) (x) = x , x
T
g(x) = w (x) + w0
w = (1,1) w0 = 1
2
1
2
2
+
+
(0,1.25)
(1,1) + +
+ +(1,1)
+ g(x) - +
+
+
+
(0.25, 0.25) +
(1.25, 0)
(0.5,
0)
(0, 0)
+
+ (0.75, 0.25)- - - (0.5, 0.5) +
+ -- - - - - +
(1, 0.75)
+
(1, 1)
+
+ ++ +
+
+
+
(0,1.5625)
(1,1)
+
(1, 0.56)
+
(0.25, 0.25)
(1.5625, 0)
(0, 0)-(0, 0.25) - - (0.56, 0.0625) +
Linearly separable?
Transform feature space
: x (x) (x) = x , x
T
g(x) = w (x) + w0
w = (1,1) w0 = 1
2
1
2
2
+
+
(0,1.25)
(1,1) + +
+ +(1,1)
+ g(x) - +
+
+
+
(0.25, 0.25) +
(1.25, 0)
(0.5,
0)
(0, 0)
+
+ (0.75, 0.25)- - - (0.5, 0.5) +
+ -- - - - - +
(1, 0.75)
+
(1, 1)
+
+ ++ +
+
+
+
g(.5, .5) = (1,1)i(0.25, 0.25) 1 0

g(1,1) = (1,1)i(1,1) 1 > 0
g(1, 1) = (1,1)i(1,1) 1 > 0
(0,1.5625)
(1,1)
+
(1, 0.56)
+
(0.25, 0.25)
(1.5625, 0)
(0, 0)-(0, 0.25) - - (0.56, 0.0625) +
Linearly separable?
-. .- .--. +. +. +. .--. -. -. -.
0
1
2
3
Linearly separable?
Yes, by transforming the
feature
space!
There is an error in this slide.
(x) = x, x
2
-. .- .--. +. +. +. .--. -. -. -.
0
1
2
3
The given g is equal to x - 3x^2 + 2,

which is positive iff x > -2/3 and < 1 (check this
by factoring).
So this slide and the next one can be fixed
by relabeling the points on the line accordingly.
(Alternatively, change phi(x) to be [x^2,x] instead
of [x,x^2]. Then the labeling on the line is correct,
but the
rest of the example needs to be changed.)
g( (x)) = (1, 3)i (x) + 2

f (x) = 1 if g( (x)) = (1, 3)i (x) + 2 > 0
f (x) = 1 if g( (x)) = (1, 3)i (x) + 2 < 0
Linearly separable?
Yes, by transforming the feature space!
(x) = x, x
2
-. .- .--. +. +. +. .--. -. -. -.
0
1
2
3
g( (x)) = (1, 3)i (x) + 2

f (x) = 1 if g( (x)) = (1, 3)i (x) + 2 > 0
f (x) = 1 if g( (x)) = (1, 3)i (x) + 2 < 0
g( (1 / 2)) = g(1 / 2,1 / 4) = (1, 3)i(1 / 2,1 / 4) + 2
g( (3 / 2)) = g(3 / 2, 9 / 4) = (1, 3)i(3 / 2, 9 / 4) + 2
g( (2)) = g(2, 4) = (1, 3)i(2, 4) + 2
These points become linearly separable by

-. .- .--. +. +. +. .--. -. -. -.
0
1
2
3
2
transforming the feature space using (x) = x, x
(7 / 4, 49 /16)
g( (x)) = (1, 3)i (x) + 2
+(7 / 4, 49 /16)
+(3 / 2, 9 / 4)
+ (5 / 4, 25 /16)
(3 / 4, 9 /16)-
The points are now linearly separable
Transform the feature space

map x to (x)
Kernel Function
K (x, z) = (x)i (y)
KERNEL TRICK
Never compute phi(x). Just compute K(x,z)
Why is this enough?
If work with dual representation of hyperplane (and dual
quadratic program), only use of new features is in inner
products!
Non-Separable Data
IV
x
What if data is not linearly separable for
only a few points?
+ + +
+
+ ++
+ + + +
- + +
+
--+
(3,1)+
+
-- -+
(1,1)
x
What if a small number of points prevents
the margin from being large?
+ + +
+
+ ++
+ + + +
+
+
+
--- +
(3,1)+
+
-- -+
(1,1)
x
What if a small number of points prevents
the margin from being large?
large
+ + +
+
+ ++
+ + + +
+
+
+
+
--- - - +
(3,1)+
+
+
-- -+
(1,1)
What if =?
small
+ + +
+
+ ++
+
+ + +
+
+
+
+
--- -- +
(3,1)+
+
+
-- -+
--- (1,1)

Support Vector Machine SVM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machine SVM

Uploaded by

Copyright:

Available Formats

Support Vector Machine

These slides were prepared by Linda Sellie and Lisa Hellerstein

g(x) (0, 2.5)

- -(1,1) (2,1) + (3,1)

g(x) (0, 2.5)

- -(1,1) (2,1) + (3,1)

g(2,1) = (3, 4)i(2,1) 10 = 0

(for the usual Gaussian Naive Bayes,

that is, if you have Boolean features, and you treat

Which line (hyperplane) to choose?

How to compute the distance from a point on the plane to

g(2, 2) (3, 4)i(2, 2) 10

We want to classify points in space.

Maximal Margin Hyperplane

The margin is the geometric

We use the hyperplane to classify a point x

f (x) = 1 if wix + w0 > 0

g(x) = (3, 4)ix 10

g(x) (0, 2.5)

- -(1,1) (2,1) + (3,1)

g(2,1) = (3, 4)i(2,1) 10 = 0

All the points above the line are positive

g(x) = (3, 4)ix 10 > 0

g(x) = (3, 4)ix 10 < 0

Notice that for any hyperplane we have an infinite number

The canonical hyperplane for a set of training examples

g '(3,1) = (1, 4 / 3)i(3,1) 10 / 3 = 1

Remember the distance from a point x

For a canonical hyperplane g(x) = w x + w0

margin is computed by = g(x ) = g(x ) = 1

g(x) = (1, 4 / 3)T x + 10 / 3

Distance of x to the hyperplane is

For the canonical hyperplane the margin is

To find the maximal canonical hyperplane,

For a set of training examples, we can find the

There is an algorithm to solve an optimization

Where f is convex, i, gi is convex, and i, hi is affine.

Finding the largest geometric margin

Finding the largest geometric margin

Finding the largest margin

Solving this constrained

The Karush, Kuhn, Tucker

The hypothesis for the set of training examples,

is g(x) = v1 (2, 7 / 4)ix + v2 (3,1)ix + v3 i(1,1)ix + w0

NonLinearly Separable Data

g(.5, .5) = (1,1)i(0.25, 0.25) 1 0

The given g is equal to x - 3x^2 + 2,

g( (x)) = (1, 3)i (x) + 2

g( (x)) = (1, 3)i (x) + 2

These points become linearly separable by

g( (x)) = (1, 3)i (x) + 2

The points are now linearly separable

Transform the feature space

You might also like