Professional Documents
Culture Documents
The Support Vector Machine (SVM) belongs to a class of machine learning algorithms that
are based on linear classifiers and the “kernel trick”. The latter is a general method, based on
the concepts of duality and kernels, which allows algorithms that can be formulated in terms
of inner products to be systematically extended to non-linear cases. For brevity and simplicity,
we restrict the discussion of SVMs to binary classification tasks.
Figure 1. Lnear separation of two classes (indoles ■ and piperidines □) in two dimensions (clogP and
molar refractivity). The solid line shows the optimal hyperplane, together with its margin (dashed
lines) and support vectors (circled). Note that the computation was done on standardized descriptors;
otherwise, the difference in scale between the values of the two descriptors would have given more
weight to molar refractivity, leading to a different hyperplane and support vectors. This hyperplane
would have also separated the two classes, but with a worse generalization performance on new
samples.
Support Vector Machine 2
Figure 2. Relationship of the optimal hyperplane to the convex hull. The convex hull of a set of points
{x1 , K , xl } is the smallest convex set containing those points. It consists of all linear combinations
with weights in [0,1] , that is, all points
l
∑ i =1
α i xi with ∑α i = 1 and α i ≥ 0 (see Eq. (2)). For
linearly separable datasets as in the given example, the hyperplane with the largest margin bisects the
line between the two closest points on the convex hulls of both classes. If the classes are not linearly
separable, the optimal hyperplane bisects the reduced convex hulls of the classes (not shown). In the
reduced convex hull, the weights α i are bounded from above by a constant D < 1 , which results in a
“shrunk” convex hull.
Note that while the target function sgn f is invariant to re-scaling of w, the functional
margin is not. Therefore, in the above optimization problem the length of w is fixed. For
w = 1 , the term w, xi measures the length of the perpendicular projection of xi on w, that
is, the distance of xi to the hyperplane. Eq. (1) is called the hard margin SVM as it requires
the training points to lie outside the hyperplane’s margin. It can be rewritten as a
minimization problem by fixing the functional margin to unity and minimizing the norm of
the weight vector instead. Using Lagrange multipliers yields the dual form of the problem
(Eq. (2)). Lagrange multipliers are a tool for transforming an optimization problem over a
differentiable function in n variables and with k constraints into an equivalent problem with
n+k variables and no constraints. [2,3]
1 l l l
min
α i ≥0
∑
2 i , j =1
α iα j y i y j xi , x j − ∑ α i subject to
i =1
∑yα
i =1
i i =0 (2)
Eq. (2) has several important properties: It belongs to a well-studied class of problems,
namely convex quadratic programs. For such problems, efficient solvers exist, which find the
(unique) global optimum, that is, the optimal hyperplane. According to the Karush-Kuhn-
Tucker conditions [4,5], α i ≠ 0 if and only if xi lies on the margin of the optimal hyperplane.
Such samples are called the support vectors of the solution (Fig. 1). The solution to Eq. (2) is
sparse in the sense that only the support vectors are necessary for its evaluation, which allows
Support Vector Machine 3
for fast classification of new samples. This is an important distinction between SVM and
ANN (Box 4.4).
replacing the standard dot product xi , x j in Eq. (2) with a kernel κ(xi , x j ) , the linear
classification takes place in H (which can be high-dimensional and non-linearly derived from
ℜ d ), while the computation itself is still done in the original space ℜ d . As the evaluation of
κ requires only xi , x j and not φ(xi ), φ(x j ) , this makes the computation feasible even for very
high-dimensional H. This procedure is known as the kernel trick. Fig. 3 shows an example of
φ : ℜ → ℜ 2 . Another example is the mapping of two circles in ℜ 2 (one enclosing the other,
( )
each circle representing a class) by ( x, y ) a x, y, x 2 + y 2 . Although it can be helpful, it is in
general not necessary to know φ or H explicitly, as long as they exist.
Figure 3. Example of how linear separability can be achieved through non-linear transformation. In
the original one-dimensional descriptor space (x-axis), classes A ■ and B □ are not linearly separable
as there is no zero-dimensional hyperplane (point) with all samples from A on one side and all
samples from B on the other side of it. However, if a second dimension is added (the y-axis) which is
quadratic in the first one, e.g. x a ( x, ax + b ) , there is a one-dimensional hyperplane (line) which
separates the two classes.
and only if it is symmetric and all its kernel matrices are positive semidefinite [3]. Popular
kernels include the polynomial kernel κ(x, y ) = ( x, y + a ) and the radial basis function
b
( )
kernel κ(x, y ) = exp − x − y / 2σ 2 . The sigmoid κ ( x, y ) = tanh (a x, y + r ) ,
2
which
corresponds to a two-layer neural network (Box 4.4), is not a kernel in general.
Extensions:
Real molecule datasets are often not linearly separable, either because the chosen descriptors
do not provide enough information or because the data are corrupted by noise like
measurement errors. Enforcing linear separability in H by choosing a sufficiently complex
kernel risks severe overfitting as the training data are learned by rote, degrading
generalization capability. As a solution, one allows misclassified samples and tries to
minimize the resulting violations of the margin: For each training sample xi , define a slack
variable ξi = max (γ − yi f (xi ),0) . It assumes a value of zero for correctly classified samples
and measures the violation of the margin of its class otherwise. Reformulating Eq. (2) by
taking the slack variables into account yields Eq. (4).
l
max γ − C ∑ ξ i subject to ∀i : γ − ξ i ≤ y i f ( xi ) and w
2
=1 (4)
w,b ,γ
i =1
The parameter C controls the trade-off between model complexity (for C → ∞ one
gets the hard margin SVM) and the frequency of error. The corresponding dual problem is
given by Eq. (5).
1 l l l
min
0 ≤α i ≤C 2
∑ i j i j i j ∑
i , j =1
α α y y x , x −
i =1
α i subject to ∑yα
i =1
i i =0 (5)
This is the most commonly used SVM formulation, the soft margin SVM. Finding the
optimal soft margin hyperplane corresponds to finding the two closest points on the reduced
convex hulls of the two classes (Fig. 2). [6] The ideas underlying SVMs can be extended to
other applications like regression (estimation of real instead of binary labels) and novelty
detection (also one-class SVM; estimation if a new sample belongs to the same distribution as
the training samples).
Summary:
SVMs represent a popular class of machine learning algorithms based on the ideas of linear
classification, duality and kernels. By solving a convex quadratic optimization problem, a
hyperplane is found which optimally separates the training samples in a (non-linearly)
transformed space. The solution depends only on a subset of the training samples, the support
vectors. SVMs have received considerable attention, mainly due to solid theoretical
foundations as well as good and robust performance in practice. The kernel trick underlying
SVMs has also been applied to many other algorithms, e.g. nonlinear principle component
analysis [7].
[4] H. Kuhn, A. Tucker. Nonlinear programming. In: Proceedings of the 2nd Berkeley
Symposium on Mathematical Statistics and Probabilistics,. University of California Press,
1951, pp. 481–492.
[5] B. Schölkopf, A. Smola. Learning with Kernels. MIT Press, Cambridge, 2002.
[6] K. Bennett, E. Bredensteiner. Duality and geometry in svm classifiers. In: Proceedings of
the 17th International Conference on Machine Learning (ICML 2000; P. Langley, Ed.),
Stanford, California, U.S.A., 2000, pp. 57–64.
[7] B. Schölkopf, A. Smola, K.-R. Müller. Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 1998, 10, 1299–1319.