You are on page 1of 34

Kernel Method

and Support Vector Machines


Nguyen Duc Dung, Ph.D.
IOIT, VAST

Outline

Reference

Support vector machines (SVMs)

The maximum-margin hyper-plane


Kernel method

Implementation

Books, papers, slides, software

Approaches
Sequential minimal optimization (SMO)

Open problems

Reference

Book

Paper

N. Cristianini. ICML'01 tutorial, 2001.

Software

C. J. C. Burges. A Tutorial on Support Vector Machines for


Pattern Recognition. Knowledge Discovery and Data Mining,
2(2), 1998.

Slide

Cristianini, N., Shawe-Taylor, J., An Introduction to Support


Vector Machines, Cambridge University Press, (2000).
http://www.support-vector.net/index.html
Bernhard Schlkopf and Alex Smola. Learning with Kernels.
MIT Press, Cambridge, MA, 2002.

LibSVM (NTU), SVMlight (joachims.org)

Online resource

http://www.kernel-machines.org/
3

Classification Problem

How would we classify this data set?


4

Linear Classifiers

There are many lines that can be linear classifiers.


Which one is the better classifier?
5

SVM Solution

SVM solution is the linear classifier with the maximum


margin (maximum margin linear classifier)
6

Margin
of a Linear Function f(x) = w.x + b

Functional margin

( xi , yi )

Geometric margin

Margin

SVM solution

A Bound on Expected Risk


of a Linear Classifier f = sign(w.x)
With a probability at least (1 - ), (0,1)
c R 2 2 2
1
R[ f ] Remp [ f ]
ln l ln
2
l f

where Remp is training error, l is training size, f is the
margin, ||w|| , ||x|| R, c is a constant
Larger margin, smaller bound
8

Finding the Maximum-Margin Classifier


( xi , yi )

f
f

Minimize normal vector

Constrain functional margin 1


9

Soft and Hard Margin

1
min || w ||2
w,b 2
s.t. yi ( w.xi b) 1, i 1,..., l

n
1
p
2
min || w || C i
w,b 2
i 1
s.t. yi ( wxi b) 1 i ,

i 0, i 1,..., l
Hard (maximum) margin

Soft (maximum) margin


10

Lagrangian Optimization

11

Kuhn-Tucker Theorem

12

Optimization
n
1
p
2
min || w || C i
w,b 2
i 1
s.t. yi ( wxi b) 1 i ,

Primal problem

i 0, i 1,..., l

y x

i 0

i i

l
1 l
min yi y j i j xi , x j i
i 2
i , j 1
i 1

s.t.: 0 i C , i 1,..., l ,
l

Dual problem

y
i 1

0.

13

(Linear) Support Vector Machines

Training

Testing

l
1 l
min yi y j i j xi , x j i
i 2
i , j 1
i 1

s.t.: 0 i C , i 1,..., l ,
l

y
i 1

f ( x) w x b

0.

Quadratic optimization
l variables
l2 coefficients

Norm of the hyperplane

y x

i 0

i i

(xi,i), i 0 support
vector

14

Kernel Method

Problem

Most datasets are linearly non-separable

Solution

Map input data into a higher dimensional feature space


Find the optimal hyperplane in feature space

15

Hyperplane in Feature Space

VC-dimension of a class of
functions: the maximum
number of points that can be
shattered
VC-dimension of linear
functions in Rd is d+1
Dimension of feature space is
high
Linear functions in feature
space has high VCdimension, or high capacity
16

VC Dimension: Example

Gaussian RBF SVMs of sufficiently small width can


classify an arbitrary large number of training points
correctly, and thus have infinite VC dimension
17

Linear SVMs

Training

l
1 l
min yi y j i j xi , x j i
i 2
i , j 1
i 1

s.t.: 0 i C , i 1,..., l ,
l

y
i 1

0.

Quadratic optimization
l variables
l2 coefficients

Testing

f ( x) sign yi i x, xi b
i 0

Norm of the hyperplane

y x

i 0

(xi,i), i 0 support
vector

SVMs work with pairs of data (dot product), not sample


18

Non-linear SVMs

Kernel: to calculate dot product between two


vectors in feature space K(x,y) = <(x), (y)>

Training

l
1 l
min yi y j i j K ( xi , x j ) i
i 2
i , j 1
i 1

s.t.: 0 i C , i 1,..., l ,
l

y
i 1

0.

Testing

f ( x ) sign yi i K ( x, xi ) b
i 0

Norm of the hyperplane

y ( x )

i 0

The maximal margin algorithm works indirectly in


feature space via kernel, or is not known explicitly
19

Kernel

Linear: K(x,y) = <x.y>


Gaussian: K(x,y) = exp(-||x-y||2)

Dimension of feature space: infinite

Polynomial: K(x,y) = <x.y>p

d p 1
, where d input space dimension
Dimension of feature space:
p

20

Support Vector Learning

Task

Given a set of labeled data

Training

Time: O(l3),
Memory: O(l2)

T {( xi , yi )}i 1,...,l R d {1,1}

l
1 l
min yi y j i j K ( xi , x j ) i
i 2
i , j 1
i 1

Find the decision function

s.t.: 0 i C , i 1,..., l ,
l

y
i 1

Testing

0.

Time: O(Ns)

f ( x) sign yi i K ( x, xi ) b
i 0

21

MNIST Data: SVM vs. Other

Data

60,000/10,000 training/testing

Performance
Method
linear classifier (1-layer NN)

Testing
error (%)
12.0

K-nearest-neighbors

5.0

40 PCA + quadratic
classifier

3.3

SVM, Gaussian Kernel

1.4

2-layer NN, 300 hidden


units, mean square error

4.7

Convolutional net LeNet-4

1.1 (Source: http://yann.lecun.com/)

Hand written data

22

SVM: Probability Output

SVM solution

f ( x)

y K ( x , x) b

i 0

Probability estimation
p( y 1 | x)

1
1 e Af ( x ) B

Maximum likelihood approach


l

( A, B) arg min F (a, b) ti log( pi ) (1 ti ) log(1 pi )


i 1

a ,b

where pi p( y 1 | xi )
N 1
N 2

ti
1

N 1

1
1 e af ( x ) b

if yi 1,
, i 1,..., l.( N :# positive, N :# negative )
if yi 1

23

Outline

Reference

Support vector machines (SVMs)

The maximum-margin hyperplane


Kernel method

Implementation

Books, papers, slides, software

Approaches
Sequential minimal optimization

Open problems

24

SVM Training
Problem

Approach
l

1
yi y ji j Kij i

2 i , j 1
i 1

min F ()
i

s.t. : 0 i C , i 1,..., l ,
l

y
i 1

Obj. function: quadratic w.r.t.


Number of variable: l
Number of parameter: l2
Complexity

Cascade SVM (Peter et al., 05)


Parallel mixture of SVM (Collobert et al.,
02)

Approximation

Decomposition alg. (e.g. Osuna et al.,


97, Joachims, 99)
Sequential minimal optimization (SMO)
(Plat, 99)

Parallelization

Modified gradient projection (Bottou et


al., 94)

Divide-and-conquer

Time: O(l3) or O(NS3 + NS2l + NSdl)


Memory: O(l2)

Constraint: box, linear

Gradient method

Quadratic programming (QP)

Online and active learning (e. g. Bordes


et al., 05)
Core SVM (Tsang et al., 05, 07)

Combination of methods

25

Optimality
The Karush-Kuhn-Tucker (KKT) conditions
-1

yi f ( xi ) 1 for i 0,

yi f ( xi ) 1 for i C ,
y f ( x ) 1 for 0 C ,
i
i
i
l

where

f ( x) yi i K ( x, xi ) b

+1

+
+

+
-

i 1

26

SMO Algorithm

Initialize solution (zero)


While (!StoppingCondition)
Select two vector {i,j}
Optimize on {i,j}

EndWhile

SMO: Optimization

Problem
l
1 l
min F ( ) yi y j i j K ij i
i
2 i , j 1
i 1

s.t.: 0 i C , i 1,..., l ,
l

k 1

yk 0

(i, j ) : yi i y j j const
j y j (const yi i )

Fixing all k , k i, j
F () F ( i ) A i2 B i C

Updating scheme
(without the box constraint)

new
i

new
j

old
i

old
j

yi E old
Eiold
j

2 ij

y j Eiold E old
j
2 ij

,
.

Ei yk k K ( xk , xi ) yi , i 1,..., l ,
k 1

ij K ii K jj 2 K ij
28

Selection Heuristic and Stopping Condition

Maximum violating pair

i arg max Ek | k I up

j arg min Ek | k I low

Maximum gain

i arg max Ek | k I up

j arg max Fik | k I low , Ek Ei

where

I up {t | t C , yt 1 or t 0, yt 1}
I low {t | t C , yt 1 or t 0, yt 1}

Stopping condition:

Ei E j (103 )

29

Sequential Minimal Optimization

Training problem
l

1
min yi y j i j K ( xi , x j ) i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l

y
i 1

0.

Functional margin

Updating scheme

new
i

new
j

old
i

old
j

yi E old
Eiold
j

2 ij

y j Eiold E old
j
2 ij

,
.

Stopping condition
Ei E j

Ei yk k K ( xk , xi ) yi
k 1

Selection heuristic
i arg max k { Ek | k I up ( )}
j arg max k { Lik | k I low ( ), Ek Ei }

30

Support Vector Regression (1)

Training data S = {(xi,yi)}i = 1,,l RNR


Linear regressor y = f(x) = w.x + b
-loss function

31

Support Vector Regression (2)

Optimization: minimizing

Dual problem

32

Open Problems

Model selection

Kernel type
Parameter setting

Multi-class
application

Speed and size

One-versus-rest
One-versus-one

Categorical data

Training: time O(NS2l),


space O(NSl)
Testing: O(NS)

33

Thank you!

dungduc@gmail.com

You might also like