You are on page 1of 27

Lecture Slides for

INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014

alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 13:

KERNEL MACHINES
Kernel Machines
3

 Discriminant-based: No need to estimate densities


first
 Define the discriminant in terms of support vectors
 The use of kernel functions, application-specific
measures of similarity
 No need to represent instances as vectors
 Convex optimization problems with a unique solution
Optimal Separating Hyperplane
4

  if  C1
X  x , r t where r  
t
t t t 1 x
  1 if x t
C2
find w and w0 such that
w T xt  w0  1 for r t  1
w T xt  w0  1 for r t  1
which can be rewritten as
r t w T xt  w0   1

(Cortes and Vapnik, 1995; Vapnik, 1995)


Margin
5

 Distance from the discriminant to the closest instances


on either side
 Distance of x to the hyperplane is w x  w0
T t

w
r t w T xt  w0 
 We require   , t
w

 For a unique sol’n, fix ρ||w||=1, and to max margin

min w subject to r t wT xt  w0   1, t


1 2

2
Margin
6
min w subject to r t w T xt  w 0   1, t
1 2

2
 
Lp  w   t r t w T xt  w 0   1
N
1 2

2 t 1

 w   r w x  w 0    t
N N
1 2 t t T t

2 t 1 t 1

Lp N
 0  w   t r t xt
w t 1

Lp N
 0   t r t  0
w 0 t 1

7
Ld  w w   w T  t r t xt w0  t r t   t
1 T
2 t t t

  w w    t
1 T
2 t

    r r x  x   t
1 t s t s t T s

2 t s t

subject to  t r t  0 and  t  0, t
t

Most αt are 0 and only a small number have αt >0; they are
the support vectors

8
Soft Margin Hyperplane
9

 Not linearly separable

r t wT x t  w0   1   t

 Soft error

 
t
t

 New primal is
1
2
2
 
Lp  w  C t  t  t  t r t wT x t  w0   1   t t  t t
10
Hinge Loss
11

 0 if y t r t  1
Lhinge(y , r )  
t t

1  y t t
r otherwise
n-SVM
12

1 1
min w - n    t
2

2 N t
subject to
r t w T xt  w 0      t , t  0,   0

Ld     r r x  x
1 N t s t s t T s

2 t 1 s
subject to
1
t  r 
t t
0 ,0    t
,
N t
  t
n

n controls the fraction of support vectors


Kernel Trick
13

 Preprocess input x by basis functions


z = φ(x) g(z)=wTz
g(x)=wT φ(x)
 The SVM solution
w   t r t z t   t r t φxt 
t t

gx   w φx    r φx


T t t
 φx
t T

gx    t r t K xt , x 
t
Vectorial Kernels
14

 Polynomials of degree q:

K x , x  x x  1
t T t q

K x, y   xT y  1
2

 x1y1  x 2 y 2  12
 1  2 x1y1  2 x 2 y 2  2 x1 x 2 y1y 2  x12 y12  x 22 y 22

 x   1, 2 x1 , 2 x 2 , 2 x1 x 2 , x , x 2
1 
2 T
2
Vectorial Kernels
15

 Radial-basis functions:

 xt  x 2

K xt , x   exp 
 2s 2 
 
Defining kernels
16

 Kernel “engineering”
 Defining good measures of similarity
 String kernels, graph kernels, image kernels, ...
 Empirical kernel map: Define a set of templates mi
and score function s(x,mi)
(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]
and
K(x,xt)= (x)T  (xt)
Multiple Kernel Learning
17

 Fixed kernel combination  cK x, y 



K x, y   K1 x, y   K 2 x, y 
 K x, y K x, y 
 1 2

 Adaptive kernel combination


m
K x, y   i K i x, y 
i 1

 t s r t r s i K i xt , x s 
1
Ld   t  
t 2 t s i

g(x)   t r t i K i xt , x 


t i

 Localized kernel combination g(x)   t r t i x| K i xt , x


t i
Multiclass Kernel Machines
18

 1-vs-all
 Pairwise separation
 Error-Correcting Output Codes (section 17.5)
 Single multiclass optimization
1 K
min  w i  C  it
2

2 i 1 i t

subject to
w zt T xt  w zt 0  w i T xt  wi 0  2   it , i  z t ,  it  0
SVM for Regression
19

 Use a linear model (possibly kernelized)


f(x)=wTx+w0
 Use the є-sensitive error function
 if r t  f xt   
e r , f x    t
t t 0

 r  f x t
  otherwis e

min w  C   t   t 
1 2

2
 
t

r t  w T x  w0     t
w x  w   r
T
0
t
    t
 t , t  0
20
Kernel Regression
21

 Polynomial kernel  Gaussian kernel


Kernel Machines for Ranking
22

 We require not only that scores be correct order


but at least +1 unit margin.
 Linear case:
1
min w i  C  it
2

2 t

subject to
w T xu  w T xv  1   t , t : r u  r v ,  it  0
One-Class Kernel Machines
23

 Consider a sphere with center a and radius R

min R 2  C  t
t

subject to
x t  a  R 2   t , t  0

Ld   x  x    r r x  x
N
t t T s t s t s t T s

t t 1 s

subject to
0   t  C ,  t  1
t
24
Large Margin Nearest Neighbor
25

 Learns the matrix M of Mahalanobis metric


D(xi, xj)=(xi-xj)TM(xi-xj)
 For three instances i, j, and l, where i and j are of
the same class and l different, we require
D(xi, xl) > D(xi, xj)+1
and if this is not satisfied, we have a slack for the
difference and we learn M to minimize the sum of
such slacks over all i,j,l triples (j and l being one of k
neighbors of i, over all i)
Learning a Distance Measure
26

 LMNN algorithm (Weinberger and Saul 2009)

 LMCA algorithm (Torresani and Lee 2007) uses a


similar approach where M=LTL and learns L
Kernel Dimensionality Reduction
27

 Kernel PCA does


PCA on the
kernel matrix
(equal to
canonical PCA
with a linear
kernel)
 Kernel LDA, CCA