Slides Lecture6

K ERNELS
M OTIVATION : K ERNELS
Idea
I
The SVM uses the scalar product hx,

xi i as a measure of similarity between x
and
xi , and of distance to the hyperplane.
Since the scalar product is linear, the SVM is a linear method.
By using a nonlinear function instead, we can make the classifier nonlinear.
More precisely
I
Scalar product can be regarded as a two-argument function

h . , . i : Rd Rd R.
We will replace this function with a function k : Rd Rd R and substitute
0
k(x, x0 )
for every occurrence of
x, x
in the SVM formulae.
Under certain conditions on k, all optimization/classification results for the

SVM still hold. Functions that satisfy these conditions are called kernel
functions.
Peter Orbanz Data Mining
T HE M OST P OPULAR K ERNEL

RBF Kernel

kx x0 k22
kRBF (x, x0 ) := exp
2 2
for some R+
is called an RBF kernel (RBF = radial basis function). The parameter is called
bandwidth.
8
CHAPTER 4. NONPARAMETRIC TECHNIQU
Other names for kRBF : Gaussian kernel, squared-exponential kernel.
If we fix x0 , the function kRBF ( . , x0 ) is (up to scaling) a spherical Gaussian density on
Rd , with mean x0 and standard deviation .
h = .2
h = .5
h=1
x2
0.6
0.15
3
(x)
0.4
0.1
(x)
(x)
0.05
2
1
0
-2
0.2
2
1
0
-2
0
-1
-1
-1
0
1
2 -2
1
1
0
-2
0
-1
0
-1
x1
-1
1
2 -2
(c)(d=2)
contours
2 -2
function surface (d=2)

Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windo
SVM WITH RBF K ERNEL
f (x) = sign
n
X
!
yi i kRBF (xi , x)
i=1
Circled points are support vectors. The the two contour lines running through
support vectors are the nonlinear counterparts of the convex hulls.
The thick black line is the classifier.
Think of a Gaussian-shaped function kRBF ( . , x0 ) centered at each support vector

x0 . These functions add up to a function surface over R2 .
The lines in the image are contour lines of this surface. The classifier runs
along the bottom of the "valley" between the two classes.
Smoothness of the contours is controlled by
C HOOSING A KERNEL
Theory
To define a kernel:
I
We have to define a function of two arguments and proof that it is a kernel.
This is done by checking a set of necessary and sufficient conditions known as

Mercers theorem.
Practice
The data analyst does not define a kernel, but tries some well-known standard kernels
until one seems to work. Most common choices:
I
The RBF kernel.
The "linear kernel" kSP (x, x0 ) = hx, x0 i, i.e. the standard, linear SVM.
Once kernel is chosen

I
Classifier can be trained by solving the optimization problem using standard

software.
SVM software packages include implementations of all common kernels.
W HICH F UNCTIONS WORK AS K ERNELS ?

Formal definition
A function k : Rd Rd R is called a kernel on Rd if there is some function
: Rd F into some space F with scalar product h . , . iF such that

k(x, x0 ) = (x), (x0 ) F
for all x, x0 Rd .
In other words
I
k is a kernel if it can be interpreted as a scalar product on some other space.
If we substitute k(x, x0 ) for hx, x0 i in all SVM equations, we implicitly train a

linear SVM on the space F.
That is why the SVM still works: It still uses scalar products, just on another
space.
The mapping
I
To make this work, has to transform the data into data on which a linear SVM
works.
This is usually achieved by choosing F as a higher-dimensional space than Rd .
M APPING INTO H IGHER D IMENSIONS

Example
Nonlinear
inboundary
Kernel(more)
Space
How can a map
into higherTransformation
dimensions make class
linear?
Consider
2

x1
x1
: R2 R3
where
:= 2x1 x2
x2
x22

P
P
!"
P
P
P
PP

P P
P

$"

Machine Learning I : Joachim M. Buhmann
P
P

$%
!#

$#
137/196
M APPING INTO H IGHER D IMENSIONS

Problem
In previous example: We have to know what the data looks like to choose !
Solution
I
Choose high dimension h for F.
Choose components i of (x) = (1 (x), . . . , h (x)) as different nonlinear

mappings.
If two points differ in Rd , some of the nonlinear mappings will amplify

differences.
The RBF kernel is an extreme case

I
The function kRBF can be shown to be a kernel, however:
F is infinite-dimensional for this kernel.
D ETERMINING WHETHER k IS A KERNEL

Mercers theorem
A mathematical result called Mercers theorem states that, if the function k is
positive, i.e.
Z
k(x, x0 )f (x)f (x0 )dxdx0 0
Rd Rd
for all functions f , then it can be written as

k(x, x0 ) =
j j (x)j (x0 ) .
j=1
The j arefunctions
Rd R and i 0. This means the (possibly infinite) vector
(x) = ( 1 1 (x), 2 2 (x), . . .) is a feature map.
Kernel arithmetic
Various functions of kernels are again kernels: If k1 and k2 are kernels, then e.g.
k1 + k2
are again kernels.
k1 k2
const. k1
T HE K ERNEL T RICK
Kernels in general
I
Many linear machine learning and statistics algorithms can be "kernelized".
The only conditions are:

1. The algorithm uses a scalar product.
2. In all relevant equations, the data (and all other elements of Rd ) appear
only inside a scalar product.
This approach to making algorithms non-linear is known as the "kernel trick".
S UMMARY: K ERNEL SVM

Optimization problem
kvH k2F +
min
vH ,c
n
X
i=1
yi (k(vH ,
xi ) c) 1 i
s.t.
and i 0
Note: vH now lives in F.
Dual optimization problem

maxn
s.t.
) :=
W(
n
X
i=1
n
X
yi i = 0
n
1
1X
i jyiyj (k(
xi ,
xj ) + I{i = j})
2 i,j=1
and
i 0
i=1
Classifier
f (x) = sgn
n
X
i=1
!
yi i k(
xi , x)
S UMMARY: SVM S
Basic SVM
I
Linear classifier for linearly separable data.
Positions of affine hyperplane is determined by maximizing margin.
Maximizing the margin is a convex optimization problem.
Full-fledged SVM
Ingredient
Purpose
Maximum margin
Slack variables
Good generalization properties

Overlapping classes
Robustness against outliers
Nonlinear decision boundary
Kernel
Use in practice
I
Software packages (e.g. libsvm, SVMLite)
Choose a kernel function (e.g. RBF)
Cross-validate margin parameter and kernel parameters (e.g. bandwidth)
H ISTORY
Ca. 1957: Perceptron (Rosenblatt)
1970s: Vapnik and Chervonenkis develop learning theory
1986: Neural network renaissance (backpropagation algorithm by Rumelhart,

Hinton, Williams)
1993: SVM (Boser, Guyon, Vapnik)
1997: Boosting (Freund and Schapire)

Slides Lecture6

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides Lecture6

Uploaded by

Copyright:

Available Formats

K ERNELS

The SVM uses the scalar product hx,

Since the scalar product is linear, the SVM is a linear method.

By using a nonlinear function instead, we can make the classifier nonlinear.

Scalar product can be regarded as a two-argument function

We will replace this function with a function k : Rd Rd R and substitute

Under certain conditions on k, all optimization/classification results for the

Peter Orbanz Data Mining

T HE M OST P OPULAR K ERNEL

function surface (d=2)

Peter Orbanz Data Mining

SVM WITH RBF K ERNEL

The thick black line is the classifier.

Think of a Gaussian-shaped function kRBF ( . , x0 ) centered at each support vector

Smoothness of the contours is controlled by

Peter Orbanz Data Mining

We have to define a function of two arguments and proof that it is a kernel.

This is done by checking a set of necessary and sufficient conditions known as

The RBF kernel.

Once kernel is chosen

Classifier can be trained by solving the optimization problem using standard

SVM software packages include implementations of all common kernels.

Peter Orbanz Data Mining

W HICH F UNCTIONS WORK AS K ERNELS ?

k is a kernel if it can be interpreted as a scalar product on some other space.

If we substitute k(x, x0 ) for hx, x0 i in all SVM equations, we implicitly train a

This is usually achieved by choosing F as a higher-dimensional space than Rd .

Peter Orbanz Data Mining

M APPING INTO H IGHER D IMENSIONS

Machine Learning I : Joachim M. Buhmann

Peter Orbanz Data Mining

M APPING INTO H IGHER D IMENSIONS

Choose high dimension h for F.

Choose components i of (x) = (1 (x), . . . , h (x)) as different nonlinear

If two points differ in Rd , some of the nonlinear mappings will amplify

The RBF kernel is an extreme case

The function kRBF can be shown to be a kernel, however:

F is infinite-dimensional for this kernel.

Peter Orbanz Data Mining

D ETERMINING WHETHER k IS A KERNEL

for all functions f , then it can be written as

Many linear machine learning and statistics algorithms can be "kernelized".

The only conditions are:

This approach to making algorithms non-linear is known as the "kernel trick".

Peter Orbanz Data Mining

S UMMARY: K ERNEL SVM

Note: vH now lives in F.

Dual optimization problem

Peter Orbanz Data Mining

Linear classifier for linearly separable data.

Positions of affine hyperplane is determined by maximizing margin.

Maximizing the margin is a convex optimization problem.

Good generalization properties

Software packages (e.g. libsvm, SVMLite)

Choose a kernel function (e.g. RBF)

Cross-validate margin parameter and kernel parameters (e.g. bandwidth)

Peter Orbanz Data Mining

Ca. 1957: Perceptron (Rosenblatt)

1970s: Vapnik and Chervonenkis develop learning theory

1986: Neural network renaissance (backpropagation algorithm by Rumelhart,

1993: SVM (Boser, Guyon, Vapnik)

1997: Boosting (Freund and Schapire)

Peter Orbanz Data Mining