You are on page 1of 58

Support Vector Machine

SVM
Slides for guest lecture presented by Linda Sellie in Spring 2012 for
CS6923, Machine Learning, NYU-Poly
with a few corrections...

http://www.svms.org/tutorials/Hearst-etal1998.pdf

http://www.cs.cornell.edu/courses/cs578/2003fa/slides_sigir03_tutorial-modified.v3.pdf

These slides were prepared by Linda Sellie and Lisa Hellerstein

Which Hyperplane?

g(x)?

+
++

+
+

- +
- - -

g(x) = w x + w0
g(x) > 0 then f (x) = 1
g(x) 0 then f (x) = 1
T
If w = (3, 4) & w0 = 10
T
g(x) = (3, 4) x 10
T

+
++

g(x) (0, 2.5)

+
(2, 2) +
+

- -(1,1) (2,1) + (3,1)


- - --

g(x) = w x + w0
g(x) > 0 then f (x) = 1
g(x) 0 then f (x) = 1
T
If w = (3, 4) & w0 = 10
T
g(x) = (3, 4) x 10
T

+
++

g(x) (0, 2.5)

+
(2, 2) +
+

- -(1,1) (2,1) + (3,1)


- - --

g(2,1) = (3, 4)i(2,1) 10 = 0


g(0, 5 / 2) = (3, 4)i(0, 5 / 2) 10 = 0
so f (2,1) = f (0, 5 / 2) = 1
f (2, 2) = 1
g(2, 2) = (3, 4)i(2, 2) 10 = 4 > 0
f (3,1) = 1
g(3,1) = (3, 4)i(3,1) 10 = 3 > 0
f (1,1) = 1
g(1,1) = (3, 4)i(1,1) 10 = 3 0

with shared variance for each feature (for each xi, requiring estimated variance
of distributions for p[xi|+] and p[xi|-] to be the same)

(for the usual Gaussian Naive Bayes,


where you don't required shared variance
for each feature, discriminant function is
quadratic.)

that is, if you have Boolean features, and you treat


them as discrete/categorical features, running the
standard NB algorithm for discrete/categorical
features will produce a linear discriminant.

Which line (hyperplane) to choose?


Maximal Margin Hyperplane

+
++
g(x)

- - - -

+
margin
+

g(x)

+
++

+
+

- margin
+
- - -

How to compute the distance from a point on the plane to


T
the hyperplane x, g(x) = w x + w0 = 0
x a point
x p the normal

x3
projection of
the plane x =

xp

r x

x onto w
xp + r
w
w

g(x) = w x + w0
T

Thus

g(x)
r=
w

w
= w xp + r
+
w
0

x1
w
T
2
T
=w r
Observe that w w = w
w
= w r
T

x2

g(x) = w x + w0
T
w = (3, 4) w0 = 10
T

g(x)
Distance Formula: r =
w

g(x)

+
+ +

+
(2, 2) +
+

- -(1,1)
- -(1,.5)
- --

++

+ (3,1)

g(2, 2) (3, 4)i(2, 2) 10


=
= 4/5
(3, 4)
5
g(1,.5) (3, 4)i(1,.5) 10
=
= 1
(3, 4)
5
g(3,1) (3, 4)i(3,1) 10
=
= 3/5
(3, 4)
5
g(1,1) (3, 4)i(1,1) 10
=
= 3 / 5
(3, 4)
5

1
g '(x) = w ' x + w '0 = g(x)
3
1
T
w '0 = 10 / 3
w ' = w = (1, 4 / 3)
3
g '(x)
Distance Formula: r =
w'
g '(2, 2)
(1, 4 / 3)i(2, 2) 10 / 3
=
= 4/5
(1, 4 / 3)
5/3
g '(1,.5) (1, 4 / 3)i(1,.5) 10 / 3
=
= 1
(1, 4 / 3)
5/3
+
+ g '(3,1) (1, 4 / 3)i(3,1) 10 / 3
g(x)
+
+ + +
=
= 3/5
(2, 2) +
+
(1, 4 / 3)
5/3
- -(1,1)
g '(1,1)
(1, 4 / 3)i(1,1) 10 / 3
(3,1)
+
=
= 3 / 5
- -(1,
4
/
3)
5
/
3
(1,.5)
- T

We want to classify points in space.


Which hyperplane does SVM choose?

+
+
+
++ +
+ ++
-- +
- - - -- - --

Maximal Margin Hyperplane

g(x)

+
support vector +
++ +
+ +
x support vector
- + margin
x
support -vector1

The margin is the geometric


distance between the closest
training example to the
g(x )
g(x )
hyperplane, y || w || = y || w ||
1

We use the hyperplane to classify a point x

f (x) = 1 if wix + w0 > 0


f (x) = 1 if wix + w0 0

g(x) = (3, 4)ix 10


The hyperplane is defined by all the points
which satisfy g(x) = (3, 4)ix 10 = 0
e.g.

+
++

g(x) (0, 2.5)

+
(2, 2) +
+

- -(1,1) (2,1) + (3,1)


- - --

(0, 3.3)

g(2,1) = (3, 4)i(2,1) 10 = 0


g(0, 2.5) = 0

All the points above the line are positive

g(x) = (3, 4)ix 10 > 0


e.g. g(2, 2) = (3, 4)i(2, 2) 10 = 4
g(3,1) = (3, 4)i(3,1) 10 = 3
All the points below the line are negative

g(x) = (3, 4)ix 10 < 0


e.g. g(1,1) = (3, 4)i(1,1) 10 = 3

Notice that for any hyperplane we have an infinite number


of formulas that describe it!
If (3, 4)ix 10 = 0, so does (1/3) ( (3, 4)ix 10 ) = 0
so does 23( (3, 4)ix 10 ) = 0
so does .9876 ( (3, 4)ix 10 ) = 0

if it is a maximum margin
hyperplane -- since in such a
hyperplane, the distance to the
closest + example must be equal
to the distance to the closest example

The canonical hyperplane for a set of training examples


S = {< (1,1), 1 >, < (2, 2),1 >, < (1,1 / 2), 1 >, < (3, 2),1 > ...}
is g '(x) = (1, 4 / 3)ix 10 / 3 (the functional margin is 1.)

g(x)

+
+ +

+
(2, 2) (3, 2)
+ +

- -(1,1)
- (1,.5)- (1,1)
1

+ (3,1)

g '(3,1) = (1, 4 / 3)i(3,1) 10 / 3 = 1


g '(1,1) = (1, 4 / 3)i(1,1) 10 / 3 = 1
y (w ' x + w '0 ) 1
1(1, 4 / 3)i(2, 2) 10 / 3 1
1(1, 4 / 3)i(1,1 / 2) 10 / 3 1
(i )

(i )

Remember the distance from a point x


g(x)
T
to the hyperplane g(x) = w x + w0 is =
w

For a canonical hyperplane g(x) = w x + w0


w.r.t. a fixed set of training examples, S , the
+

margin is computed by = g(x ) = g(x ) = 1


w
w
w
+
+
+ + +
1
1
3
(2, 2) (3, 2)
+ +
=
=
=
(1, 4 / 3)
5
1+
16
/
9
- -(1,1) 3 / 5 + (3,1)
- (1,.5)
-- - - (1,1)
T

g(x) = (1, 4 / 3)T x + 10 / 3

Distance of x to the hyperplane is

g(x)
1
:r =
=
w
w

For the canonical hyperplane the margin is

To find the maximal canonical hyperplane,


the goal is to minimize

assuming
canonical
hyperplane

1
w

For a set of training examples, we can find the


maximum margin hyperplane in polynomial time!
To do this - we reformulate the problem as an
optimization problem

There is an algorithm to solve an optimization


problem if it has this form:
min: f (x)
Subject to: i, gi (w) 0
i, hi (w) = 0

Where f is convex, i, gi is convex, and i, hi is affine.


We can use the standard techniques to find the
optimal.

Finding the largest geometric margin


by finding g(x) which
max:
y
(w
x
+
w
)
0
Subject to: i

w
(i )

(i )

Finding the largest geometric margin


by finding g(x) which
min: w
Subject to: i y (w x + w0 ) 1
(i )

(i )

Finding the largest margin


by finding g(x) which
1
2
min: 2 w
(i )
T (i )
Subject to: i y (w x + w0 ) 1

Solving this constrained


quadratic optimization requires
The Karush, Kuhn, Tucker (KKT)
conditions are met.

The Karush, Kuhn, Tucker


(KKT) conditions imply that vi
is non-zero only if it is a
support vector!

The hypothesis for the set of training examples,

S = {< (1,1), 1 >, < (2, 2),1 >, < (1,1 / 2), 1 >, < (3, 2),1 >,...}

is g(x) = v1 (2, 7 / 4)ix + v2 (3,1)ix + v3 i(1,1)ix + w0


Note that only the support vectors are in the
hypothesis.
+
(2, 7 / 4)
+
+ + +
(2, 2) (3, 2)
+ +
- -(1,1)
- (1,.5)
-- - - (1,1)

+ (3,1)
g(x) = wT x + b = 1
g(x) = wT x + b = 0
g(x) = wT x + b = 1

NonLinearly Separable Data


III

negative

positive

Linearly separable?

g(x) = w x + w0
T

+
+
(0,1.25)
(1,1) + +
+ +(1,1)
+ g(x) - +
+
+
+
(0.25, 0.25) +
(1.25, 0)
(0.5,
0)
(0, 0)
+
+ (0.75, 0.25)- - - (0.5, 0.5) +
+ -- - - - - +
(1, 0.75)
+
(1, 1)
+
+ ++ +
+
+
+

Linearly separable?
Transform feature space

: x (x) (x) = x , x
T
g(x) = w (x) + w0
w = (1,1) w0 = 1
2
1

2
2

+
+
(0,1.25)
(1,1) + +
+ +(1,1)
+ g(x) - +
+
+
+
(0.25, 0.25) +
(1.25, 0)
(0.5,
0)
(0, 0)
+
+ (0.75, 0.25)- - - (0.5, 0.5) +
+ -- - - - - +
(1, 0.75)
+
(1, 1)
+
+ ++ +
+
+
+

(0,1.5625)

(1,1)

+
(1, 0.56)
+

(0.25, 0.25)
(1.5625, 0)
(0, 0)-(0, 0.25) - - (0.56, 0.0625) +

Linearly separable?
Transform feature space

: x (x) (x) = x , x
T
g(x) = w (x) + w0
w = (1,1) w0 = 1
2
1

2
2

+
+
(0,1.25)
(1,1) + +
+ +(1,1)
+ g(x) - +
+
+
+
(0.25, 0.25) +
(1.25, 0)
(0.5,
0)
(0, 0)
+
+ (0.75, 0.25)- - - (0.5, 0.5) +
+ -- - - - - +
(1, 0.75)
+
(1, 1)
+
+ ++ +
+
+
+

g(.5, .5) = (1,1)i(0.25, 0.25) 1 0


g(1,1) = (1,1)i(1,1) 1 > 0
g(1, 1) = (1,1)i(1,1) 1 > 0

(0,1.5625)

(1,1)

+
(1, 0.56)
+

(0.25, 0.25)
(1.5625, 0)
(0, 0)-(0, 0.25) - - (0.56, 0.0625) +

Linearly separable?

-. .- .--. +. +. +. .--. -. -. -.
0
1
2
3

Linearly separable?
Yes, by transforming the
feature
space!
There is an error in this slide.
(x) = x, x
2

-. .- .--. +. +. +. .--. -. -. -.
0
1
2
3

The given g is equal to x - 3x^2 + 2,


which is positive iff x > -2/3 and < 1 (check this
by factoring).
So this slide and the next one can be fixed
by relabeling the points on the line accordingly.
(Alternatively, change phi(x) to be [x^2,x] instead
of [x,x^2]. Then the labeling on the line is correct,
but the
rest of the example needs to be changed.)

g( (x)) = (1, 3)i (x) + 2


f (x) = 1 if g( (x)) = (1, 3)i (x) + 2 > 0
f (x) = 1 if g( (x)) = (1, 3)i (x) + 2 < 0

Linearly separable?
Yes, by transforming the feature space!
(x) = x, x
2

-. .- .--. +. +. +. .--. -. -. -.
0
1
2
3

g( (x)) = (1, 3)i (x) + 2


f (x) = 1 if g( (x)) = (1, 3)i (x) + 2 > 0
f (x) = 1 if g( (x)) = (1, 3)i (x) + 2 < 0
g( (1 / 2)) = g(1 / 2,1 / 4) = (1, 3)i(1 / 2,1 / 4) + 2
g( (3 / 2)) = g(3 / 2, 9 / 4) = (1, 3)i(3 / 2, 9 / 4) + 2
g( (2)) = g(2, 4) = (1, 3)i(2, 4) + 2

These points become linearly separable by


-. .- .--. +. +. +. .--. -. -. -.
0
1
2
3
2
transforming the feature space using (x) = x, x
(7 / 4, 49 /16)

g( (x)) = (1, 3)i (x) + 2

+(7 / 4, 49 /16)
+(3 / 2, 9 / 4)
+ (5 / 4, 25 /16)
(3 / 4, 9 /16)-

The points are now linearly separable

Transform the feature space


map x to (x)

Kernel Function
K (x, z) = (x)i (y)
KERNEL TRICK
Never compute phi(x). Just compute K(x,z)
Why is this enough?
If work with dual representation of hyperplane (and dual
quadratic program), only use of new features is in inner
products!

Non-Separable Data
IV

x
What if data is not linearly separable for
only a few points?

+ + +
+
+ ++
+ + + +
- + +
+
--+
(3,1)+
+
-- -+
(1,1)

x
What if a small number of points prevents
the margin from being large?

+ + +
+
+ ++
+ + + +
+
+
+
--- +
(3,1)+
+
-- -+
(1,1)

x
What if a small number of points prevents
the margin from being large?

large

+ + +
+
+ ++
+ + + +
+
+
+
+
--- - - +
(3,1)+
+
+
-- -+
(1,1)
What if =?

small

+ + +
+
+ ++
+
+ + +
+
+
+
+
--- -- +
(3,1)+
+
+
-- -+
--- (1,1)

You might also like