I2ml3e-Chap13 - KERNEL MACHINES PDF

Lecture Slides for
INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 13:
KERNEL MACHINES
Kernel Machines
3
 Discriminant-based: No need to estimate densities

first
 Define the discriminant in terms of support vectors
 The use of kernel functions, application-specific
measures of similarity
 No need to represent instances as vectors
 Convex optimization problems with a unique solution
Optimal Separating Hyperplane
4
  if  C1
X  x , r t where r  
t
t t t 1 x
  1 if x t
C2
find w and w0 such that
w T xt  w0  1 for r t  1
w T xt  w0  1 for r t  1
which can be rewritten as
r t w T xt  w0   1
(Cortes and Vapnik, 1995; Vapnik, 1995)

Margin
5
 Distance from the discriminant to the closest instances

on either side
 Distance of x to the hyperplane is w x  w0
T t
w
r t w T xt  w0 
 We require   , t
w
 For a unique sol’n, fix ρ||w||=1, and to max margin
min w subject to r t wT xt  w0   1, t

1 2
2
Margin
6
min w subject to r t w T xt  w 0   1, t
1 2
2
 
Lp  w   t r t w T xt  w 0   1
N
1 2
2 t 1
 w   r w x  w 0    t
N N
1 2 t t T t
2 t 1 t 1
Lp N
 0  w   t r t xt
w t 1
Lp N
 0   t r t  0
w 0 t 1
7
Ld  w w   w T  t r t xt w0  t r t   t
1 T
2 t t t
  w w    t
1 T
2 t
    r r x  x   t
1 t s t s t T s
2 t s t
subject to  t r t  0 and  t  0, t
t
Most αt are 0 and only a small number have αt >0; they are
the support vectors
8
Soft Margin Hyperplane
9
 Not linearly separable
r t wT x t  w0   1   t
 Soft error
 
t
t
 New primal is
1
2
2
 
Lp  w  C t  t  t  t r t wT x t  w0   1   t t  t t
10
Hinge Loss
11
 0 if y t r t  1
Lhinge(y , r )  
t t
1  y t t
r otherwise
n-SVM
12
1 1
min w - n    t
2
2 N t
subject to
r t w T xt  w 0      t , t  0,   0
Ld     r r x  x
1 N t s t s t T s
2 t 1 s
subject to
1
t  r 
t t
0 ,0    t
,
N t
  t
n
n controls the fraction of support vectors

Kernel Trick
13
 Preprocess input x by basis functions

z = φ(x) g(z)=wTz
g(x)=wT φ(x)
 The SVM solution
w   t r t z t   t r t φxt 
t t
gx   w φx    r φx

T t t
 φx
t T
gx    t r t K xt , x 
t
Vectorial Kernels
14
 Polynomials of degree q:
K x , x  x x  1
t T t q
K x, y   xT y  1
2
 x1y1  x 2 y 2  12
 1  2 x1y1  2 x 2 y 2  2 x1 x 2 y1y 2  x12 y12  x 22 y 22

 x   1, 2 x1 , 2 x 2 , 2 x1 x 2 , x , x 2
1 
2 T
2
Vectorial Kernels
15
 Radial-basis functions:
 xt  x 2

K xt , x   exp 
 2s 2 
 
Defining kernels
16
 Kernel “engineering”
 Defining good measures of similarity
 String kernels, graph kernels, image kernels, ...
 Empirical kernel map: Define a set of templates mi
and score function s(x,mi)
(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]
and
K(x,xt)= (x)T  (xt)
Multiple Kernel Learning
17
 Fixed kernel combination  cK x, y 


K x, y   K1 x, y   K 2 x, y 
 K x, y K x, y 
 1 2
 Adaptive kernel combination

m
K x, y   i K i x, y 
i 1
 t s r t r s i K i xt , x s 
1
Ld   t  
t 2 t s i
g(x)   t r t i K i xt , x 

t i
 Localized kernel combination g(x)   t r t i x| K i xt , x

t i
Multiclass Kernel Machines
18
 1-vs-all
 Pairwise separation
 Error-Correcting Output Codes (section 17.5)
 Single multiclass optimization
1 K
min  w i  C  it
2
2 i 1 i t
subject to
w zt T xt  w zt 0  w i T xt  wi 0  2   it , i  z t ,  it  0
SVM for Regression
19
 Use a linear model (possibly kernelized)

f(x)=wTx+w0
 Use the є-sensitive error function
 if r t  f xt   
e r , f x    t
t t 0

 r  f x t
  otherwis e
min w  C   t   t 
1 2
2
 
t
r t  w T x  w0     t
w x  w   r
T
0
t
    t
 t , t  0
20
Kernel Regression
21
 Polynomial kernel  Gaussian kernel

Kernel Machines for Ranking
22
 We require not only that scores be correct order

but at least +1 unit margin.
 Linear case:
1
min w i  C  it
2
2 t
subject to
w T xu  w T xv  1   t , t : r u  r v ,  it  0
One-Class Kernel Machines
23
 Consider a sphere with center a and radius R
min R 2  C  t
t
subject to
x t  a  R 2   t , t  0
Ld   x  x    r r x  x
N
t t T s t s t s t T s
t t 1 s
subject to
0   t  C ,  t  1
t
24
Large Margin Nearest Neighbor
25
 Learns the matrix M of Mahalanobis metric

D(xi, xj)=(xi-xj)TM(xi-xj)
 For three instances i, j, and l, where i and j are of
the same class and l different, we require
D(xi, xl) > D(xi, xj)+1
and if this is not satisfied, we have a slack for the
difference and we learn M to minimize the sum of
such slacks over all i,j,l triples (j and l being one of k
neighbors of i, over all i)
Learning a Distance Measure
26
 LMNN algorithm (Weinberger and Saul 2009)
 LMCA algorithm (Torresani and Lee 2007) uses a

similar approach where M=LTL and learns L
Kernel Dimensionality Reduction
27
 Kernel PCA does

PCA on the
kernel matrix
(equal to
canonical PCA
with a linear
kernel)
 Kernel LDA, CCA

I2ml3e-Chap13 - KERNEL MACHINES PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

I2ml3e-Chap13 - KERNEL MACHINES PDF

Uploaded by

Copyright:

Available Formats

Lecture Slides for

 Discriminant-based: No need to estimate densities

(Cortes and Vapnik, 1995; Vapnik, 1995)

 Distance from the discriminant to the closest instances

 For a unique sol’n, fix ρ||w||=1, and to max margin

min w subject to r t wT xt  w0   1, t

 Not linearly separable

n controls the fraction of support vectors

 Preprocess input x by basis functions

gx   w φx    r φx

 Fixed kernel combination  cK x, y 

 Adaptive kernel combination

g(x)   t r t i K i xt , x 

 Localized kernel combination g(x)   t r t i x| K i xt , x

 Use a linear model (possibly kernelized)

 Polynomial kernel  Gaussian kernel

 We require not only that scores be correct order

 Consider a sphere with center a and radius R

 Learns the matrix M of Mahalanobis metric

 LMNN algorithm (Weinberger and Saul 2009)

 LMCA algorithm (Torresani and Lee 2007) uses a

 Kernel PCA does

You might also like