Professional Documents
Culture Documents
Outline
Reference
Implementation
Approaches
Sequential minimal optimization (SMO)
Open problems
Reference
Book
Paper
Software
Slide
Online resource
http://www.kernel-machines.org/
3
Classification Problem
Linear Classifiers
SVM Solution
Margin
of a Linear Function f(x) = w.x + b
Functional margin
( xi , yi )
Geometric margin
Margin
SVM solution
f
f
1
min || w ||2
w,b 2
s.t. yi ( w.xi b) 1, i 1,..., l
n
1
p
2
min || w || C i
w,b 2
i 1
s.t. yi ( wxi b) 1 i ,
i 0, i 1,..., l
Hard (maximum) margin
Lagrangian Optimization
11
Kuhn-Tucker Theorem
12
Optimization
n
1
p
2
min || w || C i
w,b 2
i 1
s.t. yi ( wxi b) 1 i ,
Primal problem
i 0, i 1,..., l
y x
i 0
i i
l
1 l
min yi y j i j xi , x j i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
Dual problem
y
i 1
0.
13
Training
Testing
l
1 l
min yi y j i j xi , x j i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
f ( x) w x b
0.
Quadratic optimization
l variables
l2 coefficients
y x
i 0
i i
(xi,i), i 0 support
vector
14
Kernel Method
Problem
Solution
15
VC-dimension of a class of
functions: the maximum
number of points that can be
shattered
VC-dimension of linear
functions in Rd is d+1
Dimension of feature space is
high
Linear functions in feature
space has high VCdimension, or high capacity
16
VC Dimension: Example
Linear SVMs
Training
l
1 l
min yi y j i j xi , x j i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
0.
Quadratic optimization
l variables
l2 coefficients
Testing
f ( x) sign yi i x, xi b
i 0
y x
i 0
(xi,i), i 0 support
vector
Non-linear SVMs
Training
l
1 l
min yi y j i j K ( xi , x j ) i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
0.
Testing
f ( x ) sign yi i K ( x, xi ) b
i 0
y ( x )
i 0
Kernel
d p 1
, where d input space dimension
Dimension of feature space:
p
20
Task
Training
Time: O(l3),
Memory: O(l2)
l
1 l
min yi y j i j K ( xi , x j ) i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
Testing
0.
Time: O(Ns)
f ( x) sign yi i K ( x, xi ) b
i 0
21
Data
60,000/10,000 training/testing
Performance
Method
linear classifier (1-layer NN)
Testing
error (%)
12.0
K-nearest-neighbors
5.0
40 PCA + quadratic
classifier
3.3
1.4
4.7
22
SVM solution
f ( x)
y K ( x , x) b
i 0
Probability estimation
p( y 1 | x)
1
1 e Af ( x ) B
a ,b
where pi p( y 1 | xi )
N 1
N 2
ti
1
N 1
1
1 e af ( x ) b
if yi 1,
, i 1,..., l.( N :# positive, N :# negative )
if yi 1
23
Outline
Reference
Implementation
Approaches
Sequential minimal optimization
Open problems
24
SVM Training
Problem
Approach
l
1
yi y ji j Kij i
2 i , j 1
i 1
min F ()
i
s.t. : 0 i C , i 1,..., l ,
l
y
i 1
Approximation
Parallelization
Divide-and-conquer
Gradient method
Combination of methods
25
Optimality
The Karush-Kuhn-Tucker (KKT) conditions
-1
yi f ( xi ) 1 for i 0,
yi f ( xi ) 1 for i C ,
y f ( x ) 1 for 0 C ,
i
i
i
l
where
f ( x) yi i K ( x, xi ) b
+1
+
+
+
-
i 1
26
SMO Algorithm
EndWhile
SMO: Optimization
Problem
l
1 l
min F ( ) yi y j i j K ij i
i
2 i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
k 1
yk 0
(i, j ) : yi i y j j const
j y j (const yi i )
Fixing all k , k i, j
F () F ( i ) A i2 B i C
Updating scheme
(without the box constraint)
new
i
new
j
old
i
old
j
yi E old
Eiold
j
2 ij
y j Eiold E old
j
2 ij
,
.
Ei yk k K ( xk , xi ) yi , i 1,..., l ,
k 1
ij K ii K jj 2 K ij
28
i arg max Ek | k I up
Maximum gain
i arg max Ek | k I up
where
I up {t | t C , yt 1 or t 0, yt 1}
I low {t | t C , yt 1 or t 0, yt 1}
Stopping condition:
Ei E j (103 )
29
Training problem
l
1
min yi y j i j K ( xi , x j ) i
i 2
i , j 1
i 1
s.t.: 0 i C , i 1,..., l ,
l
y
i 1
0.
Functional margin
Updating scheme
new
i
new
j
old
i
old
j
yi E old
Eiold
j
2 ij
y j Eiold E old
j
2 ij
,
.
Stopping condition
Ei E j
Ei yk k K ( xk , xi ) yi
k 1
Selection heuristic
i arg max k { Ek | k I up ( )}
j arg max k { Lik | k I low ( ), Ek Ei }
30
31
Optimization: minimizing
Dual problem
32
Open Problems
Model selection
Kernel type
Parameter setting
Multi-class
application
One-versus-rest
One-versus-one
Categorical data
33
Thank you!
dungduc@gmail.com