You are on page 1of 25

Support Vector Machines

(SVM)

Y.H. Hu SVM − 1
Outline

Linear pattern classfiers and optimal hyperplane

Optimization problem formulation

Statistical properties of optimal hyperplane

The case of non-separable patterns

Applications to general pattern classification

Mercer's Theorem

Y.H. Hu SVM − 2
Linear Hyperplane Classifier
Given: {(xi, di); i = 1 to N, di ∈
x2
{+1, −1}}.
A linear hyperplane classifier is
a hyperplane consisting of
x points x such that
r
w H = {x| g(x) = wTx + b = 0}

−b/|w| x1 g(x): discriminant function!.

For x in the side of o : wTx + b ≥ 0; d = +1;


For x in the side of ¾: wTx + b ≤ 0; d = −1.
Distance from x to H: r = wTx/|w| − (−b/|w|) = g(x) /|w|

Y.H. Hu SVM − 3
Distance from a Point to a Hyper-plane
The hyperplane H is characterized by
(*) wTx + b = 0 H

w: normal vector perpendicular to H A


C

X*
(*) says any vector x on H that r
w
B
project to w will have a length of
OA = −b/|w|. O

Consider a special point C corresponding to vector x*. Its


projection onto vector w is
wTx*/|w| = OA + BC. Or equivalently, wTx*/|w| = r−b/|w|.
Hence r = (wTx*+b)/|w| = g(x*)/|w|

Y.H. Hu SVM − 4
Optimal Hyperplane: Linearly Separable Case

x2 • Optimal hyperplane should


be in the center of the gap.
• Supporting Vectors −
Samples on the boundaries.
Supporting vectors alone
ρ
ρ can determine optimal
hyperplane.
x1
• Question: How to find
optimal hyperplane?
For di = +1, g(xi) = wTxi + b ≥ ρ|w| ⇒ woTxi + bo ≥ 1
For di = −1, g(xi) = wTxi + b ≤ −ρ|w| ⇒ woTxi + bo ≤ −1

Y.H. Hu SVM − 5
Separation Gap
For xi being a supporting vector,
For di = +1, g(xi) = wTxi + b = ρ|w| ⇒ woTxi + bo = 1
For di = −1, g(xi) = wTxi + b = −ρ|w| ⇒ woTxi + bo = −1
Hence wo = w/(ρ|w|), bo = b/(ρ|w|). But distance from xi to
hyperplane is ρ = g(xi)/|w|. Thus wo = w/g(xi), and ρ = 1/|wo|.
The maximum distance between the two classes is
2ρ = 2/|wo|.
Hence the objective is to find wo, bo to minimize |wo| (so that ρ
is maximized) subject to the constraints that
woTxi + bo ≥ 1 for di = +1; and woTxi + bo ≤ 1 for di = −1.
Combine these constraints, one has: di•(woTxi + bo) ≥ 1

Y.H. Hu SVM − 6
Quadratic Optimization Problem Formulation
Given {(xi, di); i = 1 to N}, find w and b such that
φ(w) = wTw/2
is minimized subject to N constraints
di • (wTxi + b) ≥ 0; 1 ≤ i ≤ N.

Method of Lagrange Multiplier


N

J(w, b, α ) = φ(w) − ∑ α i [d ( w
i =1
i
T
xi + b) − 1]

N
Set ∂J(w,b,α)/∂w = 0 ⇒ w = ∑ α i d i xi
i =1

N
∂J(w,b,α)/∂α = 0 ⇒ ∑ α i di = 0
i =1

Y.H. Hu SVM − 7
Optimization (continued)
The solution of Lagrange multiplier problem is at a saddle
point where the minimum is sought w.r.t. w and b, while the
maximum is sought w.r.t. αi.
Kuhn-Tucker Condition: at the saddle point,
αi[di (wTxi + b) − 1] = 0 for 1 ≤ i ≤ N.
⇒ If xi is NOT a suppor vector, the corresponding αi = 0!
⇒ Hence, only support vector will affect the result of
optimization!

Y.H. Hu SVM − 8
A Numerical Example

(1,−1) (2,1) (3,1)

3 inequalities: 1•w + b ≤ −1; 2•w + b ≥ +1; 3•w + b ≥ + 1


J = w2/2 − α1(−w−b−1) − α2(2w+b−1) − α3(3w+b−1)
∂J/∂w = 0 ⇒ w = − α1 + 2α2 + 3α3
∂J/∂b = 0 ⇒ 0 = α1 − α2 − α3
Solve: (a) −w−b−1 = 0 (b) 2w+b−1 = 0 (c); 3w + b −1 = 0
(b) and (c) conflict each other. Solve (a), (b) yield w = 2,
b = −3. From the Kuhn-Tucker condition, α3 = 0. Thus, α1 = α2
= 2. Hence the solution of decision boundary is: 2x − 3 = 0. or
x = 1.5! This is shown as the dash line in above figure.

Y.H. Hu SVM − 9
Primal/Dual Problem Formulation
Given a constrained optimization problem with a convex cost
function and linear constraints; a dual problem with the
Lagrange multipliers providing the solution can be formulated.
Duality Theorem (Bertsekas 1995)
(a) If the primal problem has an optimal solution, then the dual
problem has an optimal solution with the same optimal
values.
(b) In order for wo to be an optimal solution and αo to be an
optimal dual solution, it is necessary and sufficient that wo
is feasible for the primal problem and
Φ(wo) = J(wo,bo,α o) = Minw J(w,bo,α o)

Y.H. Hu SVM − 10
Formulating the Dual Problem
1 T N N N
J ( w, b,α ) = w w − ∑ α i di w xi − b ∑α i di + ∑α i
T

2 i =1 i =1 i =1

N N
With w = ∑ α i d i xi and ∑ α i d i = 0 . These lead to a
i =1 i =1

Dual Problem
N 1 N N
Maximize Q (α ) = ∑ α i − ∑ ∑ α iα j d i d j xiT x j
i =1 2 i =1 j =1
N
Subject to: ∑ α i d i = 0 and αi ≥ 0 for i = 1, 2, …, N.
i =1

 x12 L x1 xN   α1d1 
 
M   M 
N

Note Q(α ) = ∑α i − [α1d1 L α N d N ] M


1
O
2
i =1
 xN x1 L xN2  α N d N 
 

Y.H. Hu SVM − 11
Numerical Example (cont’d)
1 2 3 − α1 
Q(α ) = ∑ α i − [− α1 α 2 α 3 ] 2 4 6  α 2 
3 1
i =1 2   
 3 6 9  α 3 
or Q(α) = α1 + α2 + α3 − [0.5α12 + 2α22 + 4.5α32 − 2α1α2 −
3α1α3 + 6α2α3]
subject to constraints: −α1 + α2 + α3 = 0, and
α1 ≥ 0, α2 ≥ 0, and α3 ≥ 0.
Use Matlab Optimization tool box command:
x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq)
The solution is [α1 α2 α3] = [2 2 0] as expected.

Y.H. Hu SVM − 12
Implication of Minimizing ||w||
Let D denote the diameter of the smallest hyper-ball that
encloses all the input training vectors {x1, x2, …, xN}. The set
of optimal hyper-planes described by the equation
WoTx + bo = 0
has a VC-dimension h bounded from above as
h ≤ min { D2/ρ2, m0} + 1
where m0 is the dimension of the input vectors, and ρ = 2/||wo||
is the margin of the separation of the hyper-planes.
VC-dimension determines the complexity of the classifier
structure, and usually the smaller the better.

Y.H. Hu SVM − 13
Non-separable Cases
Recall that in linearly separable case, each training sample
pair (xi, di) represents a linear inequality constraint
di(wTxi + b) ≥ 1, i = 1, 2, …, N
If the training samples are not linearly separable, the
constraint can be modified to yield a soft constraint:
di(wTxi + b) ≥ 1−ζi , i = 1, 2, …, N
{ζi; 1 ≤ i ≤ N} are known as slack variables. If ζi > 1, then the
corresponding (xi, di) will be mis-classified.
N
The minimum error classifier would minimize ∑ I (ζ i − 1) , but it
i =1
is non-convex w.r.t. w. Hence an approximation is to minimize
1 T N
Φ ( w, ζ ) = w w + C ∑ ζ i
2 i =1

Y.H. Hu SVM − 14
Primal and Dual Problem Formulation
Primal Optimization Problem Given {(xi, ζi);1≤ i≤ N}. Find w, b
such that
1 T N
Φ ( w, ζ ) = w w + C ∑ ζ i
2 i =1

is minimized subject to the constraints (i) ζi ≥ 0, and (ii)


di(wTxi + b) ≥ 1−ζi for i = 1, 2, …, N.
Dual Optimization Problem Given {(xi, ζi);1≤ i≤ N}. Find
Lagrange multipliers {αi; 1 ≤ i ≤ N} such that
N 1N N
Q(α ) = ∑ α i − ∑ ∑ α iα j d i d j xiT xi
i =1 2 i=1 j =1
is maximized subject to the constraints (i) 0 ≤ αi ≤ C (a user-
N
specified positive number) and (ii) ∑ α i d i = 0
i =1

Y.H. Hu SVM − 15
Solution to the Dual Problem
Optimal Solution to the Dual problem is:
Ns
wo = ∑ α o ,i d i xi Ns: # of support vectors.
i =1

Kuhn-Tucker condition implies for i = 1, 2, …, N,


(i) αi [di (wTxi + b) − 1 + ζi] = 0 (*)
(ii) µi ζi = 0
{µi; 1 ≤ i ≤ N} are Lagrange multipliers to enforce the condition
ζi ≥ 0. At optimal point of the primal problem, ∂ Φ/∂ζi = 0. One
may deduce that ζi = 0 if αi ≤ C. Solving (*), we have
1 N
bo = ∑ α i ( wT xi − d i )
N i =1

Y.H. Hu SVM − 16
Matlab Implementation
% svm1.m: basic support vector machine
% X: N by m matrix. i-th row is x_i
% d: N by 1 vector. i-th element is d_i
% X, d should be loaded from file or read from input.
% call MATLAB optimization tool box function fmincon
a0=eps*ones(N,1);
C = 1;
a=fmincon('qfun',a0,[],[],d',0,zeros(N,1),C*ones(N,1),…
[],[],X,d)
wo=X'*(a.*d)
bo=sum(diag(a)*(X*wo-d))/sum([a > 10*eps])

function y=qfun(a,X,d);
% the Q(a) function. Note that it is actually -Q(a)
% because we call fmincon to minimize -Q(a) is
% the same as to maximize Q(a)
[N,m]=size(X);
y=-ones(1,N)*a+0.5*a'*diag(d)*X*X'*diag(d)*a;

Y.H. Hu SVM − 17
Inner Product Kernels
In general, if the input is first transformed via a set of nonlinear
functions {φi(x)} and then subject to the hyperplane classifier
p p
g ( x) = ∑ w jϕ j ( x) + b =∑ w jϕ j ( x) = wT ϕ b = w0 ; ϕ 0 ( x) = 1
j =1 j =0

Define the inner product kernel as


p
K ( x, y ) = ∑ ϕ j ( x)ϕ j ( y ) = ϕϕ T
j =0

one may obtain a dual optimization problem formulation as:


N
1 N N
Q (α ) = ∑ α i − ∑∑ α iα j d i d j K ( xi , x j )
i =1 2 i=1 j =1

Often, dim of ϕ (=p+1) >> dim of x!

Y.H. Hu SVM − 18
General Pattern Recognition with SVM

ϕ1(x)
w1 b
x1
ϕ2(x) w2
x2
xi •
+ di

• • •
• •
• •
xm
wp
ϕp(x)

By careful selection of the nonlinear transformation {φj(x); 1 ≤ j


≤ p}, any pattern recognition problem can be solved.

Y.H. Hu SVM − 19
Polynomial Kernel
Consider a polynomial kernel
m m m m
K ( x, y ) = (1 + x y ) = 1 + 2∑ xi yi + 2∑ ∑ xi yi x j y j + ∑ xi2 yi2
T 2

i =1 i =1 j = i +1 i =1

Let K(x,y) = ϕT(x) ϕ(x), then


ϕ(x) = [1 x12, …, xm2, √2 x1, …, √2xm, √2 x1 x2, …, √2 x1xm,
√2 x2 x3, …, √2 x2xm, …,√2 xm−1xm]
= [1 ϕ1(x), …, ϕp(x)]
where p = 1 +m + m + (m−1) + (m−2) + … + 1 = (m+2)(m+1)/2
Hence, using a kernel, a low dimensional pattern classification
problem (with dimension m) is solved in a higher dimensional
space (dimension p+1). But only φj(x) corresponding to
support vectors are used for pattern classification!

Y.H. Hu SVM − 20
Numerical Example: XOR Problem
Training samples:
(−1 −1; −1), (−1 1 +1),
(1 −1 +1), (1 1 −1)

x = [x1, x2]T. Use K(x,xi) = (1 + xTxi)2 one has


ϕ(x) = [1 x12 x22 √2 x1, √2 x2, √2 x1x2]T
1 1 1 − 2 − 2 2  9 1 1 1
  1 1
1 1 1 − 2 2 − 2 9 1
Φ= ; K ( x i , x j ) = ΦΦ T =  ;
1 1 0 2 − 2 − 2 1 1 9 1
  1
1 1 1 2 2 2   1 1 9

ϕ(x)] = 6 > dim[x] = 2!


Note dim[ϕ

Y.H. Hu SVM − 21
XOR Problem (Continued)
Note that K(xi, xj) can be calculated directly without using Φ!
−1 −1
E.g. K1,1 = (1 + [− 1 − 1]  ) 2 = 9; K1, 2 = (1 + [− 1 − 1]  ) 2 = 1
− 1 1
The corresponding Lagrange multiplier α = (1/8)[1 1 1 1]T.
N
w= ∑ α i d iϕ (xi ) = ΦT [α1d1 α 2 d 2 L α N d N ]
T

i =1

= (1/8)(−1) ϕ(x1) + (1/8)(1)ϕ(x2) + (1/8)(1)ϕ(x3) + (1/8)(−1)ϕ(x4)


= [0 0 0 0 0 −1/√2]T

Hence the hyperplane is: y = wTϕ(x) = − x1x2


(x1, x2) (−1, −1) (−1, +1) (+1,−1) (+1,+1)

y = −1 x1x2 −1 +1 +1 −1
Y.H. Hu SVM − 22
Other Types of Kernels

type of SVM K(x,y) Comments


Polynomial learning (xTy + 1)p p: selected a priori
machine
Radial basis  1 2  σ : selected a priori
2
exp − 2 || x − y || 
function  2σ 
Two-layer tanh(βoxTy + β1) only some βo and β1
perceptron values are feasible.

What kernel is feasible? It must satisfy the "Mercer's


theorem"!

Y.H. Hu SVM − 23
Mercer's Theorem
Let K(x,y) be a continuous, symmetric kernel, defined on a≤
x,y ≤ b. K(x,y) admits an eigen-function expansion

K ( x, y ) = ∑ λiϕ i (x)ϕ i ( y )
i =1

with λi > 0 for each i. This expansion converges absolutely and


uniformly if and only if
aa
∫ ∫ K ( x, y )ψ (x)ψ ( y )dxdy ≥ 0
bb

a
for all ψ(x) such that ∫ ψ 2 (x)dx < ∞ .
b

Y.H. Hu SVM − 24
Testing with Kernels
For many types of kernels, ϕ(x) can not be explicitly
represented or even found. However,
N
w= ∑ i i i
α d ϕ (x ) = Φ T
[α d
1 1 α d
2 2 L α d
N N ] = Φ T
f
T

i =1

y(x) = wTϕ(x) = fT Φ ϕ(x) = fT K(xi,x) = K(x, xj) f

Hence there is no need to know ϕ(x) explicitly! For example,


in the XOR problem, f = (1/8)[−1 +1 +1 −1]T. Suppose that x
= (−1, +1), then
− 1 / 8
  
− 1 − 1 − 1 − 1 2   1 / 8 
y ( x) = K ( x, x j ) f = 1 + [ − 1 − 1] ) 2 1 + [ − 1 1]  ) 2 1 + [1 − 1]  ) 2 1 + [1 1]  ) 
 1 1 1  1   1 / 8 
 
− 1 / 8
= [1 9 1 1][− 1 / 8 1 / 8 1 / 8 − 1 / 8] = 1
T

Y.H. Hu SVM − 25