Kernel Method and Support Vector Machines: Nguyen Duc Dung, Ph.D. Ioit, Vast

Kernel Method
and Support Vector Machines

Nguyen Duc Dung, Ph.D.
IOIT, VAST
Outline
Reference
Support vector machines (SVMs)
The maximum-margin hyper-plane

Kernel method
Implementation
Books, papers, slides, software
Approaches
Sequential minimal optimization (SMO)
Open problems
Reference
Book
Paper
N. Cristianini. ICML'01 tutorial, 2001.
Software
C. J. C. Burges. A Tutorial on Support Vector Machines for

Pattern Recognition. Knowledge Discovery and Data Mining,
2(2), 1998.
Slide
Cristianini, N., Shawe-Taylor, J., An Introduction to Support

Vector Machines, Cambridge University Press, (2000).
http://www.support-vector.net/index.html
Bernhard Schlkopf and Alex Smola. Learning with Kernels.
MIT Press, Cambridge, MA, 2002.
LibSVM (NTU), SVMlight (joachims.org)
Online resource
http://www.kernel-machines.org/
3
Classification Problem
How would we classify this data set?

4
Linear Classifiers
There are many lines that can be linear classifiers.

Which one is the better classifier?
5
SVM Solution
SVM solution is the linear classifier with the maximum

margin (maximum margin linear classifier)
6
Margin
of a Linear Function f(x) = w.x + b
Functional margin
( xi , yi )
Geometric margin
Margin
SVM solution
A Bound on Expected Risk

of a Linear Classifier f = sign(w.x)
With a probability at least (1 - ), (0,1)
c R 2 2 2
1
R[ f ] Remp [ f ]
ln l ln
2
l f

where Remp is training error, l is training size, f is the
margin, ||w|| , ||x|| R, c is a constant
Larger margin, smaller bound
8
Finding the Maximum-Margin Classifier

( xi , yi )
f
f
Minimize normal vector
Constrain functional margin 1

9
Soft and Hard Margin
1
min || w ||2
w,b 2
s.t. yi ( w.xi b) 1, i 1,..., l
n
1
p
2
min || w || C i
w,b 2
i 1
s.t. yi ( wxi b) 1 i ,
i 0, i 1,..., l
Hard (maximum) margin
Soft (maximum) margin

10
Lagrangian Optimization
11
Kuhn-Tucker Theorem
12
Optimization
n
1
p
2
min || w || C i
w,b 2
i 1
s.t. yi ( wxi b) 1 i ,
Primal problem
i 0, i 1,..., l
y x
i 0
i i
l
1 l
min yi y j i j xi , x j i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
Dual problem
y
i 1
0.
13
(Linear) Support Vector Machines
Training
Testing
l
1 l
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
f ( x) w x b
0.
Quadratic optimization
l variables
l2 coefficients
Norm of the hyperplane
y x
i 0
i i
(xi,i), i 0 support
vector
14
Kernel Method
Problem
Most datasets are linearly non-separable
Solution
Map input data into a higher dimensional feature space

Find the optimal hyperplane in feature space
15
Hyperplane in Feature Space
VC-dimension of a class of
functions: the maximum
number of points that can be
shattered
VC-dimension of linear
functions in Rd is d+1
Dimension of feature space is
high
Linear functions in feature
space has high VCdimension, or high capacity
16
VC Dimension: Example
Gaussian RBF SVMs of sufficiently small width can

classify an arbitrary large number of training points
correctly, and thus have infinite VC dimension
17
Linear SVMs
Training
l
1 l
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
0.
Quadratic optimization
l variables
l2 coefficients
Testing
f ( x) sign yi i x, xi b
i 0
y x
i 0
(xi,i), i 0 support
vector
SVMs work with pairs of data (dot product), not sample

18
Non-linear SVMs
Kernel: to calculate dot product between two

vectors in feature space K(x,y) = <(x), (y)>
Training
l
1 l
min yi y j i j K ( xi , x j ) i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
0.
Testing
f ( x ) sign yi i K ( x, xi ) b
i 0
y ( x )
i 0
The maximal margin algorithm works indirectly in

feature space via kernel, or is not known explicitly
19
Kernel
Linear: K(x,y) = <x.y>

Gaussian: K(x,y) = exp(-||x-y||2)
Dimension of feature space: infinite
Polynomial: K(x,y) = <x.y>p
d p 1
, where d input space dimension
Dimension of feature space:
p
20
Support Vector Learning
Task
Given a set of labeled data
Training
Time: O(l3),
Memory: O(l2)
T {( xi , yi )}i 1,...,l R d {1,1}
l
1 l
i 2
i , j 1
i 1
Find the decision function
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
Testing
0.
Time: O(Ns)
f ( x) sign yi i K ( x, xi ) b
i 0
21
MNIST Data: SVM vs. Other
Data
60,000/10,000 training/testing
Performance
Method
linear classifier (1-layer NN)
Testing
error (%)
12.0
K-nearest-neighbors
5.0
40 PCA + quadratic
classifier
3.3
SVM, Gaussian Kernel
1.4
2-layer NN, 300 hidden

units, mean square error
4.7
Convolutional net LeNet-4
1.1 (Source: http://yann.lecun.com/)
Hand written data
22
SVM: Probability Output
SVM solution
f ( x)
y K ( x , x) b
i 0
Probability estimation
p( y 1 | x)
1
1 e Af ( x ) B
Maximum likelihood approach

l
( A, B) arg min F (a, b) ti log( pi ) (1 ti ) log(1 pi )

i 1
a ,b
where pi p( y 1 | xi )
N 1
N 2

ti
1
N 1
1
1 e af ( x ) b
if yi 1,
, i 1,..., l.( N :# positive, N :# negative )
if yi 1
23
Outline
Reference
Support vector machines (SVMs)
The maximum-margin hyperplane

Kernel method
Implementation
Books, papers, slides, software
Approaches
Sequential minimal optimization
Open problems
24
SVM Training
Problem
Approach
l
1
yi y ji j Kij i
2 i , j 1
i 1
min F ()
i
s.t. : 0 i C , i 1,..., l ,
l
y
i 1
Obj. function: quadratic w.r.t.

Number of variable: l
Number of parameter: l2
Complexity
Cascade SVM (Peter et al., 05)

Parallel mixture of SVM (Collobert et al.,
02)
Approximation
Decomposition alg. (e.g. Osuna et al.,

97, Joachims, 99)
Sequential minimal optimization (SMO)
(Plat, 99)
Parallelization
Modified gradient projection (Bottou et

al., 94)
Divide-and-conquer
Time: O(l3) or O(NS3 + NS2l + NSdl)

Memory: O(l2)
Constraint: box, linear
Gradient method
Quadratic programming (QP)
Online and active learning (e. g. Bordes

et al., 05)
Core SVM (Tsang et al., 05, 07)
Combination of methods
25
Optimality
The Karush-Kuhn-Tucker (KKT) conditions
-1
yi f ( xi ) 1 for i 0,
yi f ( xi ) 1 for i C ,
y f ( x ) 1 for 0 C ,
i
i
i
l
where
f ( x) yi i K ( x, xi ) b
+1
+
+
+
-
i 1
26
SMO Algorithm
Initialize solution (zero)

While (!StoppingCondition)
Select two vector {i,j}
Optimize on {i,j}
EndWhile
SMO: Optimization
Problem
l
1 l
min F ( ) yi y j i j K ij i
i
2 i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
k 1
yk 0
(i, j ) : yi i y j j const
j y j (const yi i )
Fixing all k , k i, j
F () F ( i ) A i2 B i C
Updating scheme
(without the box constraint)
new
i
new
j
old
i
old
j
yi E old
Eiold
j
2 ij
y j Eiold E old
j
2 ij
,
.
Ei yk k K ( xk , xi ) yi , i 1,..., l ,
k 1
ij K ii K jj 2 K ij
28
Selection Heuristic and Stopping Condition
Maximum violating pair
i arg max Ek | k I up
j arg min Ek | k I low
Maximum gain
i arg max Ek | k I up
j arg max Fik | k I low , Ek Ei
where
I up {t | t C , yt 1 or t 0, yt 1}
I low {t | t C , yt 1 or t 0, yt 1}
Stopping condition:
Ei E j (103 )
29
Sequential Minimal Optimization
Training problem
l
1
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
0.
Functional margin
Updating scheme
new
i
new
j
old
i
old
j
yi E old
Eiold
j
2 ij
y j Eiold E old
j
2 ij
,
.
Stopping condition
Ei E j
Ei yk k K ( xk , xi ) yi
k 1
Selection heuristic
i arg max k { Ek | k I up ( )}
j arg max k { Lik | k I low ( ), Ek Ei }
30
Support Vector Regression (1)
Training data S = {(xi,yi)}i = 1,,l RNR

Linear regressor y = f(x) = w.x + b
-loss function
31
Support Vector Regression (2)
Optimization: minimizing
Dual problem
32
Open Problems
Model selection
Kernel type
Parameter setting
Multi-class
application
Speed and size
One-versus-rest
One-versus-one
Categorical data
Training: time O(NS2l),

space O(NSl)
Testing: O(NS)
33
Thank you!
dungduc@gmail.com

Kernel Method and Support Vector Machines: Nguyen Duc Dung, Ph.D. Ioit, Vast

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kernel Method and Support Vector Machines: Nguyen Duc Dung, Ph.D. Ioit, Vast

Uploaded by

Copyright:

Available Formats

Kernel Method

and Support Vector Machines

Support vector machines (SVMs)

The maximum-margin hyper-plane

Books, papers, slides, software

N. Cristianini. ICML'01 tutorial, 2001.

C. J. C. Burges. A Tutorial on Support Vector Machines for

Cristianini, N., Shawe-Taylor, J., An Introduction to Support

LibSVM (NTU), SVMlight (joachims.org)

How would we classify this data set?

There are many lines that can be linear classifiers.

SVM solution is the linear classifier with the maximum

A Bound on Expected Risk

Finding the Maximum-Margin Classifier

Minimize normal vector

Constrain functional margin 1

Soft and Hard Margin

Soft (maximum) margin

(Linear) Support Vector Machines

Norm of the hyperplane

Most datasets are linearly non-separable

Map input data into a higher dimensional feature space

Hyperplane in Feature Space

Gaussian RBF SVMs of sufficiently small width can

Norm of the hyperplane

SVMs work with pairs of data (dot product), not sample

Kernel: to calculate dot product between two

Norm of the hyperplane

The maximal margin algorithm works indirectly in

Linear: K(x,y) = <x.y>

Dimension of feature space: infinite

Polynomial: K(x,y) = <x.y>p

Support Vector Learning

Given a set of labeled data

T {( xi , yi )}i 1,...,l R d {1,1}

Find the decision function

MNIST Data: SVM vs. Other

SVM, Gaussian Kernel

2-layer NN, 300 hidden

Convolutional net LeNet-4

1.1 (Source: http://yann.lecun.com/)

Hand written data

SVM: Probability Output

Maximum likelihood approach

( A, B) arg min F (a, b) ti log( pi ) (1 ti ) log(1 pi )

Support vector machines (SVMs)

The maximum-margin hyperplane

Books, papers, slides, software

Obj. function: quadratic w.r.t.

Cascade SVM (Peter et al., 05)

Decomposition alg. (e.g. Osuna et al.,

Modified gradient projection (Bottou et

Time: O(l3) or O(NS3 + NS2l + NSdl)

Constraint: box, linear

Quadratic programming (QP)

Online and active learning (e. g. Bordes

Initialize solution (zero)

Selection Heuristic and Stopping Condition

Maximum violating pair

j arg min Ek | k I low

j arg max Fik | k I low , Ek Ei

Sequential Minimal Optimization

Support Vector Regression (1)