You are on page 1of 13

K ERNELS

M OTIVATION : K ERNELS
Idea
I

The SVM uses the scalar product hx,


xi i as a measure of similarity between x
and
xi , and of distance to the hyperplane.

Since the scalar product is linear, the SVM is a linear method.

By using a nonlinear function instead, we can make the classifier nonlinear.

More precisely
I

Scalar product can be regarded as a two-argument function


h . , . i : Rd Rd R.

We will replace this function with a function k : Rd Rd R and substitute

0
k(x, x0 )
for every occurrence of
x, x
in the SVM formulae.

Under certain conditions on k, all optimization/classification results for the


SVM still hold. Functions that satisfy these conditions are called kernel
functions.

Peter Orbanz Data Mining

T HE M OST P OPULAR K ERNEL


RBF Kernel


kx x0 k22
kRBF (x, x0 ) := exp
2 2

for some R+

is called an RBF kernel (RBF = radial basis function). The parameter is called
bandwidth.
8
CHAPTER 4. NONPARAMETRIC TECHNIQU
Other names for kRBF : Gaussian kernel, squared-exponential kernel.
If we fix x0 , the function kRBF ( . , x0 ) is (up to scaling) a spherical Gaussian density on
Rd , with mean x0 and standard deviation .
h = .2
h = .5

h=1

x2

0.6

0.15

3
(x)

0.4

0.1
(x)

(x)
0.05

2
1

0
-2

0.2

2
1

0
-2

0
-1

-1
-1

0
1

2 -2

1
1

0
-2

0
-1
0

-1

x1

-1

1
2 -2

(c)(d=2)
contours

2 -2

function surface (d=2)


Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windo

Peter Orbanz Data Mining

SVM WITH RBF K ERNEL

f (x) = sign

n
X

!
yi i kRBF (xi , x)

i=1

Circled points are support vectors. The the two contour lines running through
support vectors are the nonlinear counterparts of the convex hulls.

The thick black line is the classifier.

Think of a Gaussian-shaped function kRBF ( . , x0 ) centered at each support vector


x0 . These functions add up to a function surface over R2 .

The lines in the image are contour lines of this surface. The classifier runs
along the bottom of the "valley" between the two classes.

Smoothness of the contours is controlled by

Peter Orbanz Data Mining

C HOOSING A KERNEL
Theory
To define a kernel:
I

We have to define a function of two arguments and proof that it is a kernel.

This is done by checking a set of necessary and sufficient conditions known as


Mercers theorem.

Practice
The data analyst does not define a kernel, but tries some well-known standard kernels
until one seems to work. Most common choices:
I

The RBF kernel.

The "linear kernel" kSP (x, x0 ) = hx, x0 i, i.e. the standard, linear SVM.

Once kernel is chosen


I

Classifier can be trained by solving the optimization problem using standard


software.

SVM software packages include implementations of all common kernels.

Peter Orbanz Data Mining

W HICH F UNCTIONS WORK AS K ERNELS ?


Formal definition
A function k : Rd Rd R is called a kernel on Rd if there is some function
: Rd F into some space F with scalar product h . , . iF such that


k(x, x0 ) = (x), (x0 ) F
for all x, x0 Rd .

In other words
I

k is a kernel if it can be interpreted as a scalar product on some other space.

If we substitute k(x, x0 ) for hx, x0 i in all SVM equations, we implicitly train a


linear SVM on the space F.

That is why the SVM still works: It still uses scalar products, just on another
space.

The mapping
I

To make this work, has to transform the data into data on which a linear SVM
works.

This is usually achieved by choosing F as a higher-dimensional space than Rd .

Peter Orbanz Data Mining

M APPING INTO H IGHER D IMENSIONS


Example
Nonlinear
inboundary
Kernel(more)
Space
How can a map
into higherTransformation
dimensions make class
linear?
Consider
2
 
x1
x1
: R2 R3
where

:= 2x1 x2
x2
x22


P
P

!"

P
P

P 
PP


P P
P




$"





Machine Learning I : Joachim M. Buhmann

Peter Orbanz Data Mining

P
P




$%

!#




$#

137/196

M APPING INTO H IGHER D IMENSIONS


Problem
In previous example: We have to know what the data looks like to choose !

Solution
I

Choose high dimension h for F.

Choose components i of (x) = (1 (x), . . . , h (x)) as different nonlinear


mappings.

If two points differ in Rd , some of the nonlinear mappings will amplify


differences.

The RBF kernel is an extreme case


I

The function kRBF can be shown to be a kernel, however:

F is infinite-dimensional for this kernel.

Peter Orbanz Data Mining

D ETERMINING WHETHER k IS A KERNEL


Mercers theorem
A mathematical result called Mercers theorem states that, if the function k is
positive, i.e.
Z
k(x, x0 )f (x)f (x0 )dxdx0 0

Rd Rd

for all functions f , then it can be written as


k(x, x0 ) =

j j (x)j (x0 ) .

j=1

The j arefunctions
Rd R and i 0. This means the (possibly infinite) vector
(x) = ( 1 1 (x), 2 2 (x), . . .) is a feature map.

Kernel arithmetic
Various functions of kernels are again kernels: If k1 and k2 are kernels, then e.g.
k1 + k2
are again kernels.
Peter Orbanz Data Mining

k1 k2

const. k1

T HE K ERNEL T RICK

Kernels in general
I

Many linear machine learning and statistics algorithms can be "kernelized".

The only conditions are:


1. The algorithm uses a scalar product.
2. In all relevant equations, the data (and all other elements of Rd ) appear
only inside a scalar product.

This approach to making algorithms non-linear is known as the "kernel trick".

Peter Orbanz Data Mining

S UMMARY: K ERNEL SVM


Optimization problem
kvH k2F +

min
vH ,c

n
X

i=1

yi (k(vH ,
xi ) c) 1 i

s.t.

and i 0

Note: vH now lives in F.

Dual optimization problem


maxn

s.t.

) :=
W(

n
X

i=1
n
X

yi i = 0

n
1
1X
i jyiyj (k(
xi ,
xj ) + I{i = j})
2 i,j=1

and

i 0

i=1

Classifier
f (x) = sgn

n
X
i=1

Peter Orbanz Data Mining

!
yi i k(
xi , x)

S UMMARY: SVM S
Basic SVM
I

Linear classifier for linearly separable data.

Positions of affine hyperplane is determined by maximizing margin.

Maximizing the margin is a convex optimization problem.

Full-fledged SVM
Ingredient

Purpose

Maximum margin
Slack variables

Good generalization properties


Overlapping classes
Robustness against outliers
Nonlinear decision boundary

Kernel

Use in practice
I

Software packages (e.g. libsvm, SVMLite)

Choose a kernel function (e.g. RBF)

Cross-validate margin parameter and kernel parameters (e.g. bandwidth)

Peter Orbanz Data Mining

H ISTORY

Ca. 1957: Perceptron (Rosenblatt)

1970s: Vapnik and Chervonenkis develop learning theory

1986: Neural network renaissance (backpropagation algorithm by Rumelhart,


Hinton, Williams)

1993: SVM (Boser, Guyon, Vapnik)

1997: Boosting (Freund and Schapire)

Peter Orbanz Data Mining

You might also like