You are on page 1of 5

Support Vector Machine 1

Support Vector Machine (SVM)


M. Rupp, G. Schneider

The Support Vector Machine (SVM) belongs to a class of machine learning algorithms that
are based on linear classifiers and the “kernel trick”. The latter is a general method, based on
the concepts of duality and kernels, which allows algorithms that can be formulated in terms
of inner products to be systematically extended to non-linear cases. For brevity and simplicity,
we restrict the discussion of SVMs to binary classification tasks.

Linear classification and margins:


Consider l samples x1 , K , xl ∈ ℜ d with labels y i ∈ {− 1,+1} (see Fig. 1 for an example). If the
samples are linearly separable, there exist (infinitely many) hyperplanes f ( x ) = w, x + b ,
which correctly classify all samples, that is sgn ( f ( xi )) = y i . Here, a hyperplane is determined
by a weight vector w perpendicular to it and a translation b from the origin. Assuming that
both given and future samples (and their labels) are independent and from the same
distribution, the hyperplane that will perform best on new samples will be the one “furthest
away” from both classes. To formalize this notion, we define the margin of f on ( xi , y i ) as
yi f (xi ) . The margin is positive for correctly classified examples and negative for
misclassified ones. Geometrically, the margin is the distance between the sample and the
class-separating hyperplane. The functional margin of f with respect to the whole dataset is
the smallest margin over all samples: min 1≤i ≤l y i f ( xi ) . The optimal hyperplane is the one
maximizing the functional margin. Fig. 1 shows the optimal hyperplane and its margin for an
example dataset. Formally, the optimal hyperplane bisects the two closest points on the
convex hulls of both classes [1] (Fig. 2).

Figure 1. Lnear separation of two classes (indoles ■ and piperidines □) in two dimensions (clogP and
molar refractivity). The solid line shows the optimal hyperplane, together with its margin (dashed
lines) and support vectors (circled). Note that the computation was done on standardized descriptors;
otherwise, the difference in scale between the values of the two descriptors would have given more
weight to molar refractivity, leading to a different hyperplane and support vectors. This hyperplane
would have also separated the two classes, but with a worse generalization performance on new
samples.
Support Vector Machine 2

Figure 2. Relationship of the optimal hyperplane to the convex hull. The convex hull of a set of points
{x1 , K , xl } is the smallest convex set containing those points. It consists of all linear combinations
with weights in [0,1] , that is, all points
l
∑ i =1
α i xi with ∑α i = 1 and α i ≥ 0 (see Eq. (2)). For
linearly separable datasets as in the given example, the hyperplane with the largest margin bisects the
line between the two closest points on the convex hulls of both classes. If the classes are not linearly
separable, the optimal hyperplane bisects the reduced convex hulls of the classes (not shown). In the
reduced convex hull, the weights α i are bounded from above by a constant D < 1 , which results in a
“shrunk” convex hull.

Quadratic optimization and duality:


The optimal hyperplane can be found by solving the following optimization problem:
max γ subject to ∀i : γ ≤ y i f ( xi ) and w
2
=1 (1)
w ,b ,γ

Note that while the target function sgn f is invariant to re-scaling of w, the functional
margin is not. Therefore, in the above optimization problem the length of w is fixed. For
w = 1 , the term w, xi measures the length of the perpendicular projection of xi on w, that
is, the distance of xi to the hyperplane. Eq. (1) is called the hard margin SVM as it requires
the training points to lie outside the hyperplane’s margin. It can be rewritten as a
minimization problem by fixing the functional margin to unity and minimizing the norm of
the weight vector instead. Using Lagrange multipliers yields the dual form of the problem
(Eq. (2)). Lagrange multipliers are a tool for transforming an optimization problem over a
differentiable function in n variables and with k constraints into an equivalent problem with
n+k variables and no constraints. [2,3]
1 l l l
min
α i ≥0

2 i , j =1
α iα j y i y j xi , x j − ∑ α i subject to
i =1
∑yα
i =1
i i =0 (2)

Eq. (2) has several important properties: It belongs to a well-studied class of problems,
namely convex quadratic programs. For such problems, efficient solvers exist, which find the
(unique) global optimum, that is, the optimal hyperplane. According to the Karush-Kuhn-
Tucker conditions [4,5], α i ≠ 0 if and only if xi lies on the margin of the optimal hyperplane.
Such samples are called the support vectors of the solution (Fig. 1). The solution to Eq. (2) is
sparse in the sense that only the support vectors are necessary for its evaluation, which allows
Support Vector Machine 3

for fast classification of new samples. This is an important distinction between SVM and
ANN (Box 4.4).

Non-linearity and kernels:


The most important property of Eq. (2), however, is that the minimized function uses only
inner products between the training samples. An inner product (also dot product, scalar
product) is any non-negative, real-valued symmetric bilinear form with x, x = 0 ⇔ x = 0 . It
encodes information about lengths and angle of its arguments (Eq. (3)).
x, y x, y
cos ∠xy = = . (3)
x y x, x y , y

A kernel is a function which corresponds to an inner product in a Hilbert space H, e.g.


κ : ℜ × ℜ d → ℜ , κ (xi , x j ) = φ ( xi ), φ (x j ) for some transformation φ : ℜ d → H . By
d

replacing the standard dot product xi , x j in Eq. (2) with a kernel κ(xi , x j ) , the linear
classification takes place in H (which can be high-dimensional and non-linearly derived from
ℜ d ), while the computation itself is still done in the original space ℜ d . As the evaluation of
κ requires only xi , x j and not φ(xi ), φ(x j ) , this makes the computation feasible even for very
high-dimensional H. This procedure is known as the kernel trick. Fig. 3 shows an example of
φ : ℜ → ℜ 2 . Another example is the mapping of two circles in ℜ 2 (one enclosing the other,
( )
each circle representing a class) by ( x, y ) a x, y, x 2 + y 2 . Although it can be helpful, it is in
general not necessary to know φ or H explicitly, as long as they exist.

Figure 3. Example of how linear separability can be achieved through non-linear transformation. In
the original one-dimensional descriptor space (x-axis), classes A ■ and B □ are not linearly separable
as there is no zero-dimensional hyperplane (point) with all samples from A on one side and all
samples from B on the other side of it. However, if a second dimension is added (the y-axis) which is
quadratic in the first one, e.g. x a ( x, ax + b ) , there is a one-dimensional hyperplane (line) which
separates the two classes.

To characterize kernels, consider a kernel matrix G = (κ(xi , x j ))i , j =1,K ,l of kernel


evaluations on a finite subset of input samples {xi ∈ ℜ i = 1, K , l}. A square matrix M ∈ ℜl×l is
called positive semidefinite if ∀v ∈ ℜ l : v ′Mv ≥ 0 . A function κ : ℜ d × ℜ d → ℜ is a kernel if
Support Vector Machine 4

and only if it is symmetric and all its kernel matrices are positive semidefinite [3]. Popular
kernels include the polynomial kernel κ(x, y ) = ( x, y + a ) and the radial basis function
b

( )
kernel κ(x, y ) = exp − x − y / 2σ 2 . The sigmoid κ ( x, y ) = tanh (a x, y + r ) ,
2
which
corresponds to a two-layer neural network (Box 4.4), is not a kernel in general.

Extensions:
Real molecule datasets are often not linearly separable, either because the chosen descriptors
do not provide enough information or because the data are corrupted by noise like
measurement errors. Enforcing linear separability in H by choosing a sufficiently complex
kernel risks severe overfitting as the training data are learned by rote, degrading
generalization capability. As a solution, one allows misclassified samples and tries to
minimize the resulting violations of the margin: For each training sample xi , define a slack
variable ξi = max (γ − yi f (xi ),0) . It assumes a value of zero for correctly classified samples
and measures the violation of the margin of its class otherwise. Reformulating Eq. (2) by
taking the slack variables into account yields Eq. (4).
l
max γ − C ∑ ξ i subject to ∀i : γ − ξ i ≤ y i f ( xi ) and w
2
=1 (4)
w,b ,γ
i =1

The parameter C controls the trade-off between model complexity (for C → ∞ one
gets the hard margin SVM) and the frequency of error. The corresponding dual problem is
given by Eq. (5).
1 l l l
min
0 ≤α i ≤C 2
∑ i j i j i j ∑
i , j =1
α α y y x , x −
i =1
α i subject to ∑yα
i =1
i i =0 (5)

This is the most commonly used SVM formulation, the soft margin SVM. Finding the
optimal soft margin hyperplane corresponds to finding the two closest points on the reduced
convex hulls of the two classes (Fig. 2). [6] The ideas underlying SVMs can be extended to
other applications like regression (estimation of real instead of binary labels) and novelty
detection (also one-class SVM; estimation if a new sample belongs to the same distribution as
the training samples).

Summary:
SVMs represent a popular class of machine learning algorithms based on the ideas of linear
classification, duality and kernels. By solving a convex quadratic optimization problem, a
hyperplane is found which optimally separates the training samples in a (non-linearly)
transformed space. The solution depends only on a subset of the training samples, the support
vectors. SVMs have received considerable attention, mainly due to solid theoretical
foundations as well as good and robust performance in practice. The kernel trick underlying
SVMs has also been applied to many other algorithms, e.g. nonlinear principle component
analysis [7].

[1] K. Bennett, C. Campbell. Support vector machines: Hype or hallelujah? SIGKDD


Explorations 2000, 2, 1–13.
[2] J.-L. Lagrange. Theorie des fonctions analytiques. Imprimerie de la Republique, Paris,
1797.
[3] J. Shawe-Taylor, N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, Cambridge, 2004.
Support Vector Machine 5

[4] H. Kuhn, A. Tucker. Nonlinear programming. In: Proceedings of the 2nd Berkeley
Symposium on Mathematical Statistics and Probabilistics,. University of California Press,
1951, pp. 481–492.
[5] B. Schölkopf, A. Smola. Learning with Kernels. MIT Press, Cambridge, 2002.
[6] K. Bennett, E. Bredensteiner. Duality and geometry in svm classifiers. In: Proceedings of
the 17th International Conference on Machine Learning (ICML 2000; P. Langley, Ed.),
Stanford, California, U.S.A., 2000, pp. 57–64.
[7] B. Schölkopf, A. Smola, K.-R. Müller. Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 1998, 10, 1299–1319.

You might also like