Classification Mechanism of Support Vector

~
Proceedings of ICSP2000
Classification Mechanism of Support Vector

Machines
Chen Junli Jiao Licheng
Key Lab. for Radar Signal Processing, Xidian Univ., Xian 710071, China
Email: chenj1@rsp .xidian.edu.cn
Abstract
The purpose of this paper is to
provide an introductory tutorial on the
basic ideas behind Support Vector
Machines (SVMs). The paper starts with approaches had suffered difficulties with
an overview of Structural Risk generalization, producing models that can overfit the
Minimization (SRM) principle, and data, especially for small data. Vapnik had proposed
describes the mechanism of how to SVMs algorithm in the late seventies, which
construct SVMs. For a two-class pattern combined with several techniques from statistics,
recognition problem, we discuss in detail machine learning and neural networks. Its formulation
the classification mechanism of SVMs in embodies the Structural Risk Minimization (SRM)
three cases of linearly separable, linearly principle. By use of kernel function mapping
non-separable and non-linear. Finally, for technique, SVMs can achieve good ability of
non-linear case, we give a new function classification generalization through small data
mapping technique: By choosing an learning. Moreover, SVMs have been receiving
appropriate kernel function, the SVMs can increasing attention in the areas of theoretic and
map the low dimensional input space into engineering application due to their many attractive
the high dimensional feature space, and features. For the pattern recognition case, SVMs have
construct an optimal separating hyperplane been used for isolated handwritten digit recognition,
with maximum margin in the feature object recognition, speaker identification, face
space. detection in images, and text categorization etc[5],[6].
Key words For the regression estimation, SVMs have been
Support Vector Machines (SVM) compared on benchmark time series prediction tests
Structural Risk Minimization (SRM) and linear operator inversion etc[7].
Kernel function
Feature space 2.Classifiction Mechanism of SVMs
2.1 Structural Risk Minimization (SRM)
SVMs algorithm is based on statistics learning
theory. It embodies the S R M principle. Firstly we
1. Introduction describe the S R M principle.
For a given learning task with a finite amount of The idea of SRh4 principle is to create a structure
training data, the best generalization performance will such that S, is hypothesis space of VC dimension h ,
be achieved if the right balance is struck between the where the VC dimension is a scalar value that
accuracy attained on that particular training set, and measures the capacity of a set of functions,
the capacity of the machine, that is, the ability of the s, c s, c .* c s, c * * *
machine to learn any training set without error. In SRM consists in solving the following problem
practically application, traditional neural network
0-7803-5747-7/00/$10.00@2000IEEE.
could solve the following optimization problem:
1
Minimize Q, = --llwll (4)
where RemIf]is empirical risk, I is training data,

Subjectto > = I, ...,m
y i ( ( w * x j ) + b ) 1,i (5)
and 1- 6 is confidence.
Equation (1) shows the conshuction idea of (w.x,)+b = +1
SVMs is that optimizing structural to make the second \ t J$w.f)+b=+l} (w.$ ) + b = -1
term of (1). minimize, constrained by the first term
unchanged. This method is substantially different
from Empirical Risk Minimization (EM) which is
adopted in tradition neural network.
2.2 Classification Mechanism of SVMs

We in detail discuss classification mechanism of
SVMs in three cases of linearly separable, linearly
non-separable and non-linear through a two-class Figure 1. A binary classification problem
pattem recognition problem. This constrained optimization problem is dealt with by
2.2.1 Linearly Separable Example introducing Lagrange multiplier ai2 0 , and a
Assume that we are given a set of training data, Lagrangian
which belong to two separate classes,
1
(4yl),.**(xm ym x E R~ Y E {-1>+1]
The goal is to find some decision function
g (x) = sgn (f( x ) ) that accurately predicts the labels The Lagrangian L has to be minimized with respect to
of unseen data ( x , y ) , and minimizes the w and b and maximized with respect t o a , . Classical
classification error. f ( x ) is a linear function, Lagrangian duality enables the primal problem, (6),to
f ( x ) = ( x - w ) + b , for W E R N , b E R (2) be transformed to its dual problem, which is easier to
This gives a classification rule whose decision solve. The dual problem is given by,
boundary i x l f ( x ) = 0)is an N - 1dimensional maxW(a) = m a x b n L(w,b,a)] (7)
a w.6
hyperplane separating the classes +1 and -1 from
each other. Figure 1 depicts the situation. The The minimum with respect towand b o f the
problem of learning from data can be formulated as Lagrangian, L, is given by,
finding a set of parameters (w,b) such that aL
sgn((w.xi)+b)= y f o r d 1I i I m .
-
ab
=o * Z a j Y i
i=l
m
Fig. lshows margin p(w,b) is : aL
-=o
aw
* w =C a , x , y ,
i=l
(3) Hence from (6), (7) and (8), dual problem is
max ~ ( a ) -
= max- 1 ixia y y (xi x >+2ai
The optimal separating hyperplane is given by
a 2 i=] j=] i=l
maximizing margin, because a large margin can make
estimate reliable on the training set, and also make (9)
estimate perform well on unseen examples. Thus we with constraints,
[a;
2 0,i= l,...ym L(wb,{,a,P)=1/2(w*w)+ czti
i=l
(15)
m
- &((Xi
i=l
w)+ b3yi - 1+ ti >- 2pisi
i=l
The second term of (8) shows that the solution vector
has an expansion in terms of a subset of the training where ai,piare the Lagrange multipliers. As before,
patterns, namely those patters whose Lagrange Lagrangian duality enables the primal problem, (16),
multiplier aiis non-zero. By the Karush-Kuhn-Tucker to be transformed to its dual problem. The dual
complementarily conditions, these training patterns problem is given by,
are the ones for which
ai(yi((xi *w)+b)-l)=O, i = l , ...,m (11)
and therefore they correspond precisely to the Support By minimum with respect to w,b,ti of the
Vectors(SV). If the data is linearly separable, all the Lagrangian L, the dual problem is,
determined by a small subset of the training set; the (17)

other points could be removed from the training set with constraints,
and recalculating the hyperplane would produce the CO I aiI c,i = l,...,m
same answer.
Thus the decision function can be written as
The solution to this minimization problem is identical

0
to the separable case except for a modification of the
where b is computed using (11). bounds of the Lagrange multipliers. The parameter C
2.2.2 Linearly Non-Separable Example introduces additional capacity control within the
For linearly non-separable example, one can classifier. In some circumstances C can be directly
introduce slack variables (5 2 0 ). The constraint of related to a regularization parameter.
( 5 ) are modified to, 2.2.3 Non-Linear Example
In the case where a linear boundary is
y i ( ( w - x i ) + b ) >l-ti,i= l,...,m (13) inappropriate, SVMs can map the input vector into a
The generalized optimal separation hyperplane is high dimensional feature space. By choosing a non-
determined by the vector w,that minimizes the linear mapping, the SVMs construct an optimal
following functional, separation hyperplane in this higher dimensional
1 space. K(x,y ) is the kernel function performing the
@(w,t)= fllwl12 + C C t i
i=l
(14) non-linear mapping into feature space.
Y$T,t +zIKj
subject to the constraint of (13).
The solution to the optimization problem of (14)
under the constraint of (13) is given by the saddle
point of the Lagrangian
Figure 2. The non-linear mapping mechanism
The optimization problem of (17) becomes,
The obvious question that arises is that with so many
different mappings to choose from, which is the best
(19) for a particular problem? However, there has not been
and the constraints (18)are unchanged. strong theoretical method for selecting a kernel at
Solving ( 19) with constraint (18) determines the present. Unless this can be validated using
Lagrange multipliers, and a classifier implementing independent test sets on a large number of problems,
the optimal separation hyperplane in the feature space method such as cross-validation will remain the
is given by, preferred method for kernel selection.
[ )
f ( x ) = sgn C a , y , ~ ( x ~ , x ) + b
SVS
(20) 4.Conclusion
SVMs are attractive approach to data modeling.
3. Feature Space and Mapping They combine statistical learning theory with
Mechanism of Kernel Function generalization control. The formulation results in a
By the use of reproducing kernels, we can global quadratic optimization problem with convex
construct a mapping into a high dimensional feature constraints, which is readily solved by interior point
space. The idea of the kernel function is to enable methods. The kernel mapping provides a unifying
operations to be performed in the input space rather framework for most of the commonly employed
than the potentially high dimensional feature space. model architectures. A technique for choosing the
Hence the inner product does not need to be evaluated kernel function and additional capacity control
in the feature space. remains to be further research.
The following theory is based upon Reproducing
Kernel Hilbert Spaces (RKHS). If K is a symmetric REFERENCE
positive definite function, which satisfies Mercers 1.Bueges C J C. A Tutorial on Support Vector
Conditions, Machines for Pattern Recognition. Data Mining and
Knowledge Discovery, Boston, 1998.
2..Gunn S . Support Vector Machines for Classification
and Regression. Image Speed and Intelligent
Systems Technical Report. 1998.
then the kernel represents a legitimate inner product in 3.Lu Zengxiang, etc. Supervised Support Vector
feature space. Valid functions that satisfy Mercers Machines Learning Algorithm and Application.
conditions are partially given, Journal of TSINGHUA University. No.7,1999.
4.Weston J, Herbrich R. Adaptive Margin Support
Vector Machines. London: the MIT Press, 1999.
@Gaussian Radial basis function 5.Blaze V, etc. Comparison of View-based Object
Recognition Algorithms Using Realistic 3d Models.
ICAN96,1996.
6.Cortes C, Vapnik V. Support Vector networks.
@ExponentialRadial basis function Machine Learning, 1995.
7.Osuma E, etc. An improved training algorithm for
support vector machines. In proceedings of the 1997
IEEE Workshop on Networks for Signal Processing,
Principe E J, etc. 1997.
@B splines K(x,y ) = B2n+l(X - y )
. 1559 -

Classification Mechanism of Support Vector

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification Mechanism of Support Vector

Uploaded by

Copyright:

Available Formats

~

Classification Mechanism of Support Vector

where RemIf]is empirical risk, I is training data,

2.2 Classification Mechanism of SVMs

(3) Hence from (6), (7) and (8), dual problem is

determined by a small subset of the training set; the (17)

The solution to this minimization problem is identical

You might also like