You are on page 1of 760

Machine Learning

with Kernel Methods

Julien Mairal & Jean-Philippe Vert

firstname.lastname@m4x.org

Last update: Jan 2017


1 / 635
Starting point: what we know is how to solve

2 / 635
Or

3 / 635
But real data are often more complicated...

4 / 635
Main goal of this course

Extend
well-understood, linear statistical learning techniques
to
real-world, complicated, structured, high-dimensional data
based on
a rigorous mathematical framework
leading to
practical modelling tools and algorithms
5 / 635
Organization of the course
Contents
1 Present the basic mathematical theory of kernel methods.
2 Introduce algorithms for supervised and unsupervised machine
learning with kernels.
3 Develop a working knowledge of kernel engineering for specific data
and applications (graphs, biological sequences, images).
4 Discuss open research topics related to kernels such as large-scale
learning with kernels and deep kernel learning.

Practical
Course homepage with slides, schedules, homework etc...:
http://cbio.mines-paristech.fr/~jvert/svn/kernelcourse/course/2017mva
Evaluation: 60% homework (3)+ 40% data challenge.

6 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

7 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks
The kernel trick
The representer theorem

7 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks
The kernel trick
The representer theorem

3 Kernel Methods: Supervised Learning


Kernel ridge regression
Kernel logistic regression
Large-margin classifiers
Interlude: convex optimization and duality
Support vector machines

7 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

8 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

8 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

6 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Deep learning with kernels

8 / 635
Part 1

Kernels and RKHS

9 / 635
Overview
Motivations
Develop versatile algorithms to process and analyze data...
...without making any assumptions regarding the type of data
(vectors, strings, graphs, images, ...)

The approach
Develop methods based on pairwise comparisons.
By imposing constraints on the pairwise comparison function
(positive definite kernels), we obtain a general framework for
learning from data (optimization in RKHS).

10 / 635
Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics


11 / 635
Representation by pairwise comparisons

X
(S)=(aatcgagtcac,atggacgtct,tgcactact)

S
1 0.5 0.3
K= 0.5 1 0.6
0.3 0.6 1

Idea
Define a comparison function: K : X X 7 R.
Represent a set of n data points S = {x1 , x2 , . . . , xn } by the n n
matrix:
[K]ij := K (xi , xj ) .

12 / 635
Representation by pairwise comparisons
Remarks
K is always an n n matrix, whatever the nature of data: the same
algorithm will work for any type of data (vectors, strings, ...).
Total modularity between the choice of function K and the choice of
the algorithm.
Poor scalability with respect to the dataset size (n2 to compute and
store K)... but wait until the end of the course to see how to deal
with large-scale problems
We will restrict ourselves to a particular class of pairwise comparison
functions.

13 / 635
Positive Definite (p.d.) Kernels
Definition
A positive definite (p.d.) kernel on a set X is a function K : X X R
that is symmetric:

x, x0 X 2 , K x, x0 = K x0 , x ,
  

and which satisfies, for all N N, (x1 , x2 , . . . , xN ) X N and


(a1 , a2 , . . . , aN ) RN :
N X
X N
ai aj K (xi , xj ) 0.
i=1 j=1

14 / 635
Similarity matrices of p.d. kernels
Remarks
Equivalently, a kernel K is p.d. if and only if, for any N N and any
set of points (x1 , x2 , . . . , xN ) X N , the similarity matrix
[K]ij := K (xi , xj ) is positive semidefinite.
Kernel methods are algorithms that take such matrices as input.

15 / 635
The simplest p.d. kernel, for real numbers
Lemma
Let X = R. The function K : R2 7 R defined by:

x, x 0 R2 , K x, x 0 = xx 0
 

is p.d.

16 / 635
The simplest p.d. kernel, for real numbers
Lemma
Let X = R. The function K : R2 7 R defined by:

x, x 0 R2 , K x, x 0 = xx 0
 

is p.d.
Proof:
xx 0 = x 0 x
PN PN P 2
N
i=1 j=1 ai aj xi xj = i=1 ai xi 0

16 / 635
The simplest p.d. kernel, for vectors
Lemma
Let X = Rd . The function K : X 2 7 R defined by:

x, x0 X 2 , K x, x0 = x, x0 Rd
 

is p.d. (it is often called the linear kernel).

17 / 635
The simplest p.d. kernel, for vectors
Lemma
Let X = Rd . The function K : X 2 7 R defined by:

x, x0 X 2 , K x, x0 = x, x0 Rd
 

is p.d. (it is often called the linear kernel).


Proof:
hx, x0 iRd = hx0 , xiRd
PN PN PN 2
i=1 j=1 ai aj hxi , xj iRd = k i=1 ai xi kRd 0

17 / 635
A more ambitious p.d. kernel

X F

Lemma
Let X be any set, and : X 7 Rd . Then, the function K : X 2 7 R
defined as follows is p.d.:

x, x0 X 2 , K x, x0 = (x) , x0 Rd .
 


18 / 635
A more ambitious p.d. kernel

X F

Lemma
Let X be any set, and : X 7 Rd . Then, the function K : X 2 7 R
defined as follows is p.d.:

x, x0 X 2 , K x, x0 = (x) , x0 Rd .
 


Proof:
h (x) , (x0 )iRd = h (x0 ) , (x)iRd
PN PN PN 2
i=1 j=1 ai aj h (xi ) , (xj )iRd = k i=1 ai (xi ) kRd 0
18 / 635
Example: polynomial kernel
x1 x1 2

x2
R
x2 2


For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :

19 / 635
Example: polynomial kernel
x1 x1 2

x2
R
x2 2


For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :

K (x, x0 ) = x12 x102 + 2x1 x2 x10 x20 + x22 x202


2
= x1 x10 + x2 x20
2
= x, x0 R2 .

19 / 635
Example: polynomial kernel
x1 x1 2

x2
R
x2 2


For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :

K (x, x0 ) = x12 x102 + 2x1 x2 x10 x20 + x22 x202


2
= x1 x10 + x2 x20
2
= x, x0 R2 .

Exercise: show that hx.x0 idRp is p.d. on X = Rp for any d N.


19 / 635
Conversely: Kernels as inner products
Theorem [Aronszajn, 1950]
K is a p.d. kernel on the set X if and only if there exists a Hilbert
space H and a mapping
: X 7 H
such that, for any x, x0 in X :

K x, x0 = (x) , x0 H .




X F

20 / 635
In case of ...
Definitions
An inner product on an R-vector space H is a mapping
(f , g ) 7 hf , g iH from H2 to R that is bilinear, symmetric and such
that hf , f iH > 0 for all f H\{0}.
A vector space endowed with an inner product is called pre-Hilbert.
1
It is endowed with a norm defined as k f kH = hf , f iH
2
.
A Cauchy sequence (fn )n0 is a sequence whose elements become
progressively arbitrarily close to each other:

lim sup kfn fm kH = 0.


N+ n,mN

A Hilbert space is a pre-Hilbert space complete for the norm k.kH .


That is, any Cauchy sequence in H converges in H.
Completeness is necessary to keep good convergence properties of
Euclidean spaces in an infinite-dimensional context.
21 / 635
Proof: finite case
Assume X = {x1 , x2 , . . . , xN } is finite of size N.
Any p.d. kernel K : X X R is entirely defined by the N N
symmetric positive semidefinite matrix [K]ij := K (xi , xj ).
It can therefore be diagonalized on an orthonormal basis of
eigenvectors (u1 , u2 , . . . , uN ), with non-negative eigenvalues
0 1 . . . N , i.e.,
" N # N
X X
>
K (xi , xj ) = l ul ul = l [ul ]i [ul ]j = h (xi ) , (xj )iRN ,
l=1 ij l=1

with
1 [u1 ]i
..
(xi ) = . 

.
N [uN ]i

22 / 635
Proof: general case

Mercer (1909) for X = [a, b] R (more generally X compact) and


K continuous.
Kolmogorov (1941) for X countable.
Aronszajn (1944, 1950) for the general case.
We will go through the proof of the general case by introducing the
concept of Reproducing Kernel Hilbert Spaces (RKHS).

23 / 635
Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics


24 / 635
RKHS Definition
Definition
Let X be a set and H RX be a class of functions forming a (real)
Hilbert space with inner product h., .iH . The function K : X 2 7 R is
called a reproducing kernel (r.k.) of H if
1 H contains all functions of the form

x X , Kx : t 7 K (x, t) .

2 For every x X and f H the reproducing property holds:

f (x) = hf , Kx iH .

If a r.k. exists, then H is called a reproducing kernel Hilbert space


(RKHS).
25 / 635
An equivalent definition of RKHS
Theorem
The Hilbert space H RX is a RKHS if and only if for any x X , the
mapping:

F : H R
f 7 f (x)

is continuous.

26 / 635
An equivalent definition of RKHS
Theorem
The Hilbert space H RX is a RKHS if and only if for any x X , the
mapping:

F : H R
f 7 f (x)

is continuous.

Corollary
Convergence in a RKHS implies pointwise convergence, i.e., if (fn )nN
converges to f in H, then (fn (x))nN converges to f (x) for any x X .

26 / 635
Proof
If H is a RKHS then f 7 f (x) is continuous
If a r.k. K exists, then for any (x, f ) X H:

| f (x) | = | hf , Kx iH |
k f kH .k Kx kH (Cauchy-Schwarz)
1
k f kH .K (x, x) ,2

because k Kx k2H = hKx , Kx iH = K (x, x). Therefore f H 7 f (x) R


is a continuous linear mapping. 
Since F is linear, it is indeed sufficient to show that f 0 f (x) 0.

27 / 635
Proof (Converse)
If f 7 f (x) is continuous then H is a RKHS
Conversely, let us assume that for any x X the linear form
f H 7 f (x) is continuous.
Then by Riesz representation theorem (general property of Hilbert
spaces) there exists a unique gx H such that:

f (x) = hf , gx iH .

The function K (x, y) = gx (y) is then a r.k. for H. 

28 / 635
Unicity of r.k. and RKHS
Theorem
If H is a RKHS, then it has a unique r.k.
Conversely, a function K can be the r.k. of at most one RKHS.

29 / 635
Unicity of r.k. and RKHS
Theorem
If H is a RKHS, then it has a unique r.k.
Conversely, a function K can be the r.k. of at most one RKHS.

Consequence
This shows that we can talk of the kernel of a RKHS, or the RKHS
of a kernel.

29 / 635
Proof
If a r.k. exists then it is unique
Let K and K 0 be two r.k. of a RKHS H. Then for any x X :

k Kx Kx0 k2H = Kx Kx0 , Kx Kx0 H



= Kx Kx0 , Kx H Kx Kx0 , Kx0 H




= Kx (x) Kx0 (x) Kx (x) + Kx0 (x)


= 0.

This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y X . In other words, K=K. 

30 / 635
Proof
If a r.k. exists then it is unique
Let K and K 0 be two r.k. of a RKHS H. Then for any x X :

k Kx Kx0 k2H = Kx Kx0 , Kx Kx0 H



= Kx Kx0 , Kx H Kx Kx0 , Kx0 H




= Kx (x) Kx0 (x) Kx (x) + Kx0 (x)


= 0.

This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y X . In other words, K=K. 

The RKHS of a r.k. K is unique


Left as exercise.

30 / 635
An important result
Theorem
A function K : X X R is p.d. if and only if it is a r.k.

31 / 635
Proof
A r.k. is p.d.
1 A r.k. is symmetric because, for any (x, y) X 2 :

K (x, y) = hKx , Ky iH = hKy , Kx iH = K (y, x) .

2 It is p.d. because for any N N,(x1 , x2 , . . . , xN ) X N , and


(a1 , a2 , . . . , aN ) RN :
N
X N
X

ai aj K (xi , xj ) = ai aj Kxi , Kxj H
i,j=1 i,j=1
N
X
=k ai Kxi k2H
i=1
0. 

32 / 635
Proof
A p.d. kernel is a r.k. (1/4)
Let H0 be the vector subspace of RX spanned by the functions
{Kx }xX .
For any f , g H0 , given by:
m
X n
X
f = ai Kxi , g= bj Kyj ,
i=1 j=1

let: X
hf , g iH0 := ai bj K (xi , yj ) .
i,j

33 / 635
Proof
A p.d. kernel is a r.k. (2/4)
hf , g iH0 does not depend on the expansion of f and g because:
m
X n
X
hf , g iH0 = ai g (xi ) = bj f (yj ) .
i=1 j=1

This also shows that h., .iH0 is a symmetric bilinear form.


This also shows that for any x X and f H0 :

hf , Kx iH0 = f (x) .

34 / 635
Proof
A p.d. kernel is a r.k. (3/4)
K is assumed to be p.d., therefore:
m
X
k f k2H0 = ai aj K (xi , xj ) 0 .
i,j=1

In particular Cauchy-Schwarz is valid with h., .iH0 .


By Cauchy-Schwarz, we deduce that x X :
1
| f (x) | = hf , Kx iH0 k f kH0 .K (x, x) 2 ,

therefore k f kH0 = 0 = f = 0.
H0 is therefore a pre-Hilbert space endowed with the inner product
h., .iH0 .

35 / 635
Proof
A p.d. kernel is a r.k. (4/4)

For any Cauchy sequence (fn )n0 in H0 , h., .iH0 , we note that:
1
(x, m, n) X N2 , | fm (x) fn (x) | k fm fn kH0 .K (x, x) 2 .

Therefore for any x the sequence (fn (x))n0 is Cauchy in R and has
therefore a limit.
If we add to H0 the functions defined as the pointwise limits of
Cauchy sequences, then the space becomes complete and is
therefore a Hilbert space, with K as r.k. (up to a few technicalities,
left as exercise). 

36 / 635
Application: back to Aronzsajns theorem
Theorem (Aronszajn, 1950)
K is a p.d. kernel on the set X if and only if there exists a Hilbert space
H and a mapping
: X 7 H ,
such that, for any x, x0 in X :

K x, x0 = (x) , x0 H .




X F

37 / 635
Proof of Aronzsajns theorem
If K is p.d. over a set X then it is the r.k. of a Hilbert space
H RX .
Let the mapping : X H defined by:

x X , (x) = Kx .

By the reproducing property we have:

(x, y) X 2 , h(x), (y)iH = hKx , Ky iH = K (x, y) . 


X F

38 / 635
Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics


39 / 635
The linear kernel
Take X = Rd and the linear kernel:

K (x, y) = hx, yiRd .

Theorem
The RKHS of the linear kernel is the set of linear functions of the form

fw (x) = hw, xiRd for w Rd ,

endowed with the inner product

w, v Rd , hfw , fv iH = hw, viRd

and corresponding norm

w Rd , k fw k H = k w k 2 .

40 / 635
Proof
The RKHS of the linear kernel consists of functions:
X
x Rd 7 f (x) = ai hxi , xiRd = hw, xiRd ,
i
P
with w = i ai xi .
The RKHS is therefore the set of linear forms endowed with the
following inner product:

hf , g iH = hw, viRd ,

when f (x) = w> x and g (x) = v> x.

41 / 635
RKHS of the linear kernel (cont.)

0 = x> x0 .
Klin (x, x )

f (x) = w> x ,

k f kH = k w k2 .

||f||=2 ||f||=1 ||f||=0.5

42 / 635
The polynomial kernel
Lets find the RKHS of the polynomial kernel:
 2
x, y Rd , K (x, y) = hx, yi2Rd = x> y

43 / 635
The polynomial kernel
Lets find the RKHS of the polynomial kernel:
 2
x, y Rd , K (x, y) = hx, yi2Rd = x> y

First step: Look for an inner-product.


 
K (x, y) = trace x> y x> y
 
> >
= trace y x x y
 
= trace xx> yy>
D E
= xx> , yy> ,
F

where F is the Froebenius norm for matrices in Rdd .

43 / 635
The polynomial kernel
Second step: propose a candidate RKHS.
We know that H contains all the functions
* +
X X D E X
f (x) = ai K (xi , x) = ai xi x>
i , xx
>
= ai x i x >
i , xx
>
.
F
i i i

Any symmetric matrix in Rdd may be decomposed as i ai xi x>


P
i . Our
candidate RKHS H will be the set of quadratic functions
D E
fS (x) = S, xx> = x> Sx for S S dd ,
F

where S dd is the set of symmetric1 matrices in Rdd , endowed with


the inner-product hfS1 , fS1 iH = hS1 , S2 iF .

1
Why is it important?
44 / 635
The polynomial kernel
Third step: check that the candidate is a Hilbert space.
This step is trivial in the present case since it is easy to see that H a
Euclidean space, isomorphic to S dd by : S 7 fS . Sometimes, things
are not so simple and we need to prove the completeness explicitly.

Fourth step: check that H is the RKHS.


1 H contains all the functions K : t 7 K (x, t) = xx> , tt>


x F
.
2 For all f
S in H and x in X ,
D E
fS (x) = S, xx> = hfS , fxx> iH = hfS , Kx iH .
F

Remark
All points x in X are mapped to a rank-one matrix xx> , hence to a
function Kx = fxx> in H. However, most of points in H do not admit a
pre-image (why?).
Exercise: what is the RKHS of the general polynomial kernel?
45 / 635
Combining kernels
Theorem
If K1 and K2 are p.d. kernels, then:

K1 + K2 ,
K1 K2 , and
cK1 , for c 0,

are also p.d. kernels


If (Ki )i1 is a sequence of p.d. kernels that converges pointwisely to
a function K :

x, x0 X 2 , K x, x0 = lim Ki x, x0 ,
  
n

then K is also a p.d. kernel.


Proof: left as exercise

46 / 635
Examples
Theorem
If K is a kernel, then e K is a kernel too.

47 / 635
Examples
Theorem
If K is a kernel, then e K is a kernel too.
Proof:
n
K (x,x0 )
X K (x, x0 )i
e = lim
n+ i!
i=0

47 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


X = R, K (x, x 0 ) = cos (x + x 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 )
X = N, K (x, x 0 ) = LCM (x, x 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2


X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 )
X = N, K (x, x 0 ) = LCM (x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 ) /LCM (x, x 0 )
48 / 635
Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics


49 / 635
Remember the RKHS of the linear kernel

0 = x> x0 .
Klin (x, x )

f (x) = w> x ,

k f kH = k w k2 .

||f||=2 ||f||=1 ||f||=0.5

50 / 635
Smoothness functional
A simple inequality
By Cauchy-Schwarz we have, for any function f H and any two
points x, x0 X :
f (x) f x0 = | hf , Kx Kx0 i |

H
k f kH k Kx Kx0 kH
= k f kH dK x, x0 .


The norm of a function in the RKHS controls how fast the function
varies over X with respect to the geometry defined by the kernel
(Lipschitz with constant k f kH ).

Important message

Small norm = slow variations.

51 / 635
Kernels and RKHS : Summary
P.d. kernels can be thought of as inner product after embedding the
data space X in some Hilbert space. As such a p.d. kernel defines a
metric on X .
A realization of this embedding is the RKHS, valid without
restriction on the space X nor on the kernel.
The RKHS is a space of functions over X . The norm of a function
in the RKHS is related to its degree of smoothness w.r.t. the metric
defined by the kernel on X .
We will now see some applications of kernels and RKHS in statistics,
before coming back to the problem of choosing (and eventually
designing) the kernel.

52 / 635
Part 2

Kernel tricks

53 / 635
Motivations
Two theoretical results underpin a family of powerful algorithms for data
analysis using p.d. kernels, collectively known as kernel methods:
The kernel trick, based on the representation of p.d. kernels as inner
products;
The representer theorem, based on some properties of the
regularization functional defined by the RKHS norm.

54 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks
The kernel trick
The representer theorem

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics

55 / 635
Motivations
Choosing a p.d. kernel K on a set X amounts to embedding the
data in a Hilbert space: there exists a Hilbert space H and a
mapping : X 7 H such that, for all x, x0 X ,

x, x0 X 2 , K x, x0 = (x) , x0 H .
 


However this mapping might not be explicitly given, nor convenient


to work with in practice (e.g., large or even infinite dimensions).
A solution is to work implicitly in the feature space!

X F

56 / 635
The kernel trick
Proposition
Any algorithm to process finite-dimensional vectors that can be expressed
only in terms of pairwise inner products can be applied to potentially
infinite-dimensional vectors in the feature space of a p.d. kernel by
replacing each inner product evaluation by a kernel evaluation.
Remarks:
The proof of this proposition is trivial, because the kernel is exactly
the inner product in the feature space.
This trick has huge practical applications.
Vectors in the feature space are only manipulated implicitly, through
pairwise inner products.

57 / 635
Example 1: computing distances in the feature space


X F

x1 d(x1,x2) ( x1)

x2 ( x2 )

dK (x1 , x2 )2 = k (x1 ) (x2 ) k2H


= h (x1 ) (x2 ) , (x1 ) (x2 )iH
= h (x1 ) , (x1 )iH + h (x2 ) , (x2 )iH 2 h (x1 ) , (x2 )iH
2
dK (x1 , x2 ) = K (x1 , x1 ) + K (x2 , x2 ) 2K (x1 , x2 )

58 / 635
Distance for the Gaussian kernel

The Gaussian kernel with


bandwidth on Rd is:
k xy k2
K (x, y) = e 2 2 ,

1.2
K (x, x) = 1 = k (x) k2H , so all

0.8
d(x,y)
points are on the unit sphere in the
feature space.

0.4
The distance between the images

0.0
of two points x and y in the feature 4 2 0 2 4
space is given by: ||xy||

s  
k xy k2
dK (x, y) = 2 1 e 2 2

59 / 635
Example 2: distance between a point and a set
Problem
Let S = (x1 , , xn ) be a finite set of points in X .
How to define and compute the similarity between any point x in X
and the set S?

60 / 635
Example 2: distance between a point and a set
Problem
Let S = (x1 , , xn ) be a finite set of points in X .
How to define and compute the similarity between any point x in X
and the set S?
A solution:
Map all points to the feature space.
Summarize S by the barycenter of the points:
n
1X
:= (xi ) .
n
i=1

Define the distance between x and S by:

dK (x, S) := k (x) kH .

60 / 635
Computation

X F
m

n

1 X
dK (x, S) = (x) (xi )

n
i=1 H
v
u n n n
u 2X 1 XX
= tK (x, x) K (x, xi ) + 2 K (xi , xj ).
n n
i=1 i=1 j=1

Remark
The barycentre only exists in the feature spacein general: it does not
necessarily have a pre-image x such that x = .
61 / 635
1D illustration
S = {2, 3}
Plot f (x) = d(x, S)
2.5 2.5

2 2

1.5 1.5

d(x,S)

d(x,S)
1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x x

(xy )2 (xy )2
K (x, y ) = xy . K (x, y ) = e 2 2 . K (x, y ) = e 2 2 .
(linear) with = 1. with = 0.2.

62 / 635
1D illustration
S = {2, 3}
Plot f (x) = d(x, S)
2.5 2.5

2 2

1.5 1.5

d(x,S)

d(x,S)
1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x x

(xy )2 (xy )2
K (x, y ) = xy . K (x, y ) = e 2 2 . K (x, y ) = e 2 2 .
(linear) with = 1. with = 0.2.
Remarks
for the linear kernel, H = R, = 2.5 and d(x, S) = |x |.
q
for the Gaussian kernel d(x, S) = C n2 ni=1 K (xi , x).
P

62 / 635
2D illustration

S = {(1, 1)0 , (1, 2)0 , (2, 2)0 }


Plot f (x) = d(x, S)
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3

(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.

63 / 635
2D illustration

S = {(1, 1)0 , (1, 2)0 , (2, 2)0 }


Plot f (x) = d(x, S)
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3

(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.

Remark
as before, the barycenter in H (which is a single point in H) may
carry a lot of information about the training data.
63 / 635
Application in discrimination

S1 = {(1, 1)0 , (1, 2)0 } and S2 = {(1, 3)0 , (2, 2)0 }


Plot f (x) = d (x, S1 )2 d (x, S2 )2
4 4 4

3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3

(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.

64 / 635
Example 3: Centering data in the feature space
Problem
Let S = (x1 , , xn ) be a finite set of points in X endowed with a
p.d. kernel K . Let K be their n n Gram matrix: [K]ij = K (xi , xj ) .
Let = 1/n ni=1 (xi ) their barycenter, and ui = (xi ) for
P
i = 1, . . . , n be centered data in H.
How to compute the centered Gram matrix [Kc ]i,j = hui , uj iH ?


X F
m

65 / 635
Computation
A direct computation gives, for 0 i, j n:

Kci,j = h (xi ) , (xj ) iH


= h (xi ) , (xj )iH h, (xi ) + (xj )iH + h, iH
n n
1X 1 X
= Ki,j (Ki,k + Kj,k ) + 2 Kk,l .
n n
k=1 k,l=1

This can be rewritten in matricial form:

Kc = K UK KU + UKU = (I U) K (I U) ,

where Ui,j = 1/n for 1 i, j n.

66 / 635
Kernel trick Summary
The kernel trick is a trivial statement with important applications.
It can be used to obtain nonlinear versions of well-known linear
algorithms, e.g., by replacing the classical inner product by a
Gaussian kernel.
It can be used to apply classical algorithms to non vectorial data
(e.g., strings, graphs) by again replacing the classical inner product
by a valid kernel for the data.
It allows in some cases to embed the initial space to a larger feature
space and involve points in the feature space with no pre-image
(e.g., barycenter).

67 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks
The kernel trick
The representer theorem

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics

68 / 635
Motivation
An RKHS is a space of (potentially nonlinear) functions, and k f kH
measures the smoothness of f
Given a set of data (xi X , yi R)i=1,...,n , a natural way to
estimate a regression function f : X R is to solve something like:
n
1X
min `(yi , f (xi )) + kf k2H . (1)
f H n | {z }
| i=1 {z } regularization
empirical risk, data fit

for a loss function ` such as `(y , t) = (y t)2


How to solve in practice this problem, potentially in infinite
dimension?

69 / 635
The Theorem
Representer Theorem
Let X be a set endowed with a p.d. kernel K , H the corresponding
RKHS, and S = {x1 , , xn } X a finite set of points in X .
Let : Rn+1 R be a function of n + 1 variables, strictly
increasing with respect to the last variable.
Then, any solution to the optimization problem:

min (f (x1 ) , , f (xn ) , k f kH ) ,


f H
admits a representation of the form:
n
X
x X , f (x) = i K (xi , x) .
i=1
In other words, the solution lives in a finite-dimensional subspace:

f Span(Kx1 , . . . , Kxn ).
70 / 635
Proof (1/2)
Let (f ) be the functional that is minimized in the statement of the
representer theorem, and HS the linear span in H of the vectors Kxi :
n
( )
X
HS = f H : f (x) = i K (xi , x) , (1 , , n ) R n
.
i=1

HS is a finite-dimensional subspace, therefore any function f H


can be uniquely decomposed as:

f = fS + f ,

with fS HS and f HS (by orthogonal projection).

71 / 635
Proof (2/2)
H being a RKHS it holds that:

i = 1, , n, f (xi ) = hf , K (xi , .)iH = 0 ,

because K (xi , .) H, therefore:

i = 1, , n, f (xi ) = fS (xi ) .

Pythagoras theorem in H then shows that:

k f k2H = k fS k2H + k f k2H .

As a consequence, (f ) (fS ) , with equality if and only if


k f kH = 0. The minimum of is therefore necessarily in HS .


72 / 635
Remarks
Often the function has the form:

(f (x1 ) , , f (xn ) , k f kH ) = c (f (x1 ) , , f (xn )) + (k f kH )

where c(.) measures the fit of f to a given problem (regression,


classification, dimension reduction, ...) and is strictly increasing. This
formulation has two important consequences:
Theoretically, the minimization will enforce the norm k f kH to be
small, which can be beneficial by ensuring a sufficient level of
smoothness for the solution (regularization effect).
Practically, we know by the representer theorem that the solution
lives in a subspace of dimension n, which can lead to efficient
algorithms although the RKHS itself can be of infinite dimension.

73 / 635
Practical use of the representer theorem (1/2)
When the representer theorem holds, we know that we can look for
a solution of the form
n
X
f (x) = i K (xi , x) , for some Rn .
i=1

For any j = 1, . . . , n, we have


n
X
f (xj ) = i K (xi , xj ) = [K]j .
i=1

Furthermore,
n X
X n
k f k2H = i j K (xi , xj ) = > K.
i=1 j=1

74 / 635
Practical use of the representer theorem (2/2)
Therefore, a problem of the form

min f (x1 ) , , f (xn ) , k f k2H



f H

is equivalent to the following n-dimensional optimization problem:


 
minn [K]1 , , [K]n , > K
R
This problem can usually be solved analytically or by numerical
methods; we will see many examples in the next sections.

75 / 635
Remarks
Dual interpretations of kernel methods
Most kernel methods have two complementary interpretations:
A geometric interpretation in the feature space, thanks to the kernel
trick. Even when the feature space is large, most kernel methods
work in the linear span of the embeddings of the points available.
A functional interpretation, often as an optimization problem over
(subsets of) the RKHS associated to the kernel.
The representer theorem has important consequences, but it is in fact
rather trivial. We are looking for a function f in H such that for all x
in X , f (x) = hKx , f iH . The part f that is orthogonal to the Kxi s is
thus useless to explain the training data.

76 / 635
Part 3

Kernel Methods
Supervised Learning

77 / 635
Supervised learning
Definition
Given:
X , a space of inputs,
Y, a space of outputs,
Sn = (xi , yi )i=1,...,n , a training set of (input,output) pairs,
the supervised learning problem is to estimate a function h : X Y to
predict the output for any future input.

78 / 635
Supervised learning
Definition
Given:
X , a space of inputs,
Y, a space of outputs,
Sn = (xi , yi )i=1,...,n , a training set of (input,output) pairs,
the supervised learning problem is to estimate a function h : X Y to
predict the output for any future input.
Depending on the nature of the output, this covers:
Regression when Y = R;
Classification when Y = {1, 1} or any set of two labels;
Structured output regression or classification when Y is more
general.

78 / 635
Example: regression
Task: predict the capacity of a small molecule to inhibit a drug target
X = set of molecular structures (graphs?)
Y=R

79 / 635
Example: classification
Task: recognize if an image is a dog or a cat
X = set of images (Rd )
Y = {cat,dog}

80 / 635
Example: classification
Task: recognize if an image is a dog or a cat
X = set of images (Rd )
Y = {cat,dog}

80 / 635
Example: structured output
Task: translate from Japanese to French
X = finite-length strings of japanese characters
Y = finite-length strings of french characters

81 / 635
Supervised learning with kernels: general principles
1 Express h : X Y using a real-valued function f : Z R:
regression Y = R:
h(x) = f (x) with f : X R (Z = X )
classification Y = {1, 1}:
h(x) = sign(f (x)) with f :X R (Z = X )
structured output:
h(x) = arg max f (x, y) with f : X Y R (Z = X Y)
yY

2 Define an empirical risk function Rn (f ) to assess how good a


candidate function f is on the training set Sn , typically the average
of a loss:
n
1X
Rn (f ) := ` (f (xi ), yi )
n
i=1
3 Define a p.d. kernel on Z and solve
min Rn (f ) or min Rn (f ) + k f k2H
f H,k f kH B f H
82 / 635
Remarks
n
1X
min ` (f (xi ), yi ) + k f k2H .
f H n | {z }
i=1
| {z } regularization
empirical risk, data fit

Regularization is important, particularly in high dimension, to


prevent overfitting
When Z = Rd and K is the linear kernel, f = fw is a linear model
and the regularization is kwk2
Using more general spaces Z and kernels K allows to
learn non-linear functions over a functional space endowed with a
natural regularization (remember, small norm in RKHS = smooth)
learn functions over non-vectorial data, such as strings and graphs

We will now see a few methods in more details

83 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning


Kernel ridge regression
Kernel logistic regression
Large-margin classifiers
Interlude: convex optimization and duality
Support vector machines

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics 84 / 635


Regression
Setup
X set of inputs
Y = R real-valued outputs
Sn = (xi , yi )i=1,...,n (X R)n a training set of n pairs
Goal = find a function f : X R to predict y by f (x)

2

0 2 4 6 8 10
85 / 635
Regression
Setup
X set of inputs
Y = R real-valued outputs
Sn = (xi , yi )i=1,...,n (X R)n a training set of n pairs
Goal = find a function f : X R to predict y by f (x)

2

0 2 4 6 8 10
85 / 635
Least-square regression over a general functional space
Let us quantify the error if f predicts f (x) instead of y by the
squared error:
` (f (x) , y ) = (y f (x))2
Fix a set of functions H.
Least-square regression amounts to finding the function in H with
the smallest empirical risk, called in this case the mean squared error
(MSE):
n
1X

f arg min (yi f (xi ))2
f H n i=1

Issues: unstable (especially in large dimensions), overfitting if H is


too large.

86 / 635
Kernel ridge regression (KRR)
Let us now consider a RKHS H, associated to a p.d. kernel K on X .
KRR is obtained by regularizing the MSE criterion by the RKHS
norm:
n
1X

f = arg min (yi f (xi ))2 + k f k2H (2)
f H n i=1

1st effect = prevent overfitting by penalizing non-smooth functions.

87 / 635
Kernel ridge regression (KRR)
Let us now consider a RKHS H, associated to a p.d. kernel K on X .
KRR is obtained by regularizing the MSE criterion by the RKHS
norm:
n
1X

f = arg min (yi f (xi ))2 + k f k2H (2)
f H n i=1

1st effect = prevent overfitting by penalizing non-smooth functions.


By the representer theorem, any solution of (2) can be expanded as
n
X
f(x) = i K (xi , x) .
i=1

2nd effect = simplifying the solution.

87 / 635
Solving KRR
Let y = (y1 , . . . , yn )> Rn
Let = (1 , . . . , n )> Rn
Let K be the n n Gram matrix: Kij = K (xi , xj )
We can then write:
 >
f (x1 ) , . . . , f (xn ) = K

The following holds as usual:

k f k2H = > K

The KRR problem (2) is therefore equivalent to:

1
arg min (K y)> (K y) + > K
Rn n

88 / 635
Solving KRR

1
arg min (K y)> (K y) + > K
Rn n

This is a convex and differentiable function of . Its minimum can


therefore be found by setting the gradient in to zero:
2
0= K (K y) + 2K
n
= K [(K + nI) y]

For > 0, K + nI is invertible (because K is positive semidefinite)


so one solution is to take:

= (K + nI)1 y.

89 / 635
Example (KRR with Gaussian RBF kernel)

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 1000

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 100

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 10

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 1

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 0.1

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 0.01

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 0.001

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 0.0001

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 0.00001

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 0.000001

0 2 4 6 8 10

90 / 635
Example (KRR with Gaussian RBF kernel)

lambda = 0.0000001

0 2 4 6 8 10

90 / 635
Remark: uniqueness of the solution
Let us find all s that solve
K [(K + nI) y] = 0
K being a symmetric matrix, it can be diagonalized in an
orthonormal basis and Ker (K) Im(K).
In this basis we see that (K + nI)1 leaves Im(K) and Ker (K)
invariant.
The problem is therefore equivalent to:
(K + nI) y Ker (K)
(K + nI)1 y Ker (K)
= (K + nI)1 y + , with K = 0.
However, if 0 = +  with K = 0, then:
>
k f f 0 k2H = 0 K 0 = 0,


therefore f = f 0 . KRR has a unique solution f H, which can


possibly be expressed by several s if K is singular.
91 / 635
Remark: link with standard ridge regression
Take X = Rd and the linear kernel K (x, x0 ) = x> x0
Let X = (x1 , . . . , xn )> the n d data matrix
The kernel matrix is then K = XX>
The function learned by KRR in that case is linear:
>
fKRR (x) = wKRR x

with
n
X  1
wKRR = i xi = X> = X> XX> + nI y
i=1

92 / 635
Remark: link with standard ridge regression
On the other hand, the RKHS is the set of linear functions
f (x) = w> x and the RKHS norm is k f kH = k w k
We can therefore directly rewrite the original KRR problem (2) as

n
1 X 2
arg min yi w> xi + k w k2
wRd n
i=1
1
= arg min (y Xw)> (y Xw) + w> w
wRd n

Setting the gradient to 0 gives the solution:


 1
wRR = X> X + nI X> y
1
Oups, looks different from wKRR = X> XX> + nI y ..?

93 / 635
Remark: link with standard ridge regression
Matrix inversion lemma
For any matrices B and C , and > 0 the following holds (when it makes
sense):
B (CB + I)1 = (BC + I)1 B
We deduce that (of course...):
 1  1
wRR = X> X + nI X> y = X> XX> + nI y = wKRR
| {z } | {z }
dd nn

Computationally, inverting the matrix is the expensive part, which


suggest to implement:
KRR when d > n (high dimension)
RR when d < n (many points)

94 / 635
Robust regression
The squared error `(t, y ) = (t y 2 ) is arbitrary and sensitive to
outliers
Many other loss functions exist for regression, e.g.:

Any loss function leads to a valid kernel method, which is usually


solved by numerical optimization as there is usually no analytical
solution beyond the squared error.
95 / 635
Weighted regression
Given weights W1 , . . . , Wn R, a variant of ridge regression is to
weight differently the error at different points:
n
1X
arg min Wi (yi f (xi ))2 + k f k2H
f H n
i=1
Pn
By the representer theorem the solution is f (x) = i=1 i K (xi , x)
where solves, with W = diag (W1 , . . . , Wn ):

1
arg min (K y)> W (K y) + > K
R n n

96 / 635
Weighted regression
Setting the gradient to zero gives
2
0= (KWK KWy) + 2K
n
2 1
h 1 1
 1 1
i
= KW 2 W 2 KW 2 + nI W 2 W 2 y
n
A solution is therefore given by
 1 1
 1 1
W 2 KW 2 + nI W 2 W 2 y = 0

therefore  1 1 1
1 1
= W 2 W 2 KW 2 + nI W2Y

97 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning


Kernel ridge regression
Kernel logistic regression
Large-margin classifiers
Interlude: convex optimization and duality
Support vector machines

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics 98 / 635


Binary classification
Setup
X set of inputs
Y = {1, 1} binary outputs
Sn = (xi , yi )i=1,...,n (X Y)n a training set of n pairs
Goal = find a function f : X R to predict y by sign(f (x))

99 / 635
Binary classification
Setup
X set of inputs
Y = {1, 1} binary outputs
Sn = (xi , yi )i=1,...,n (X Y)n a training set of n pairs
Goal = find a function f : X R to predict y by sign(f (x))

99 / 635
The 0/1 loss
The 0/1 loss measures if a prediction is correct or not:
(
0 if y = sign (f (x))
`0/1 (f (x), y )) = 1 (yf (x) < 0) =
1 otherwise.

It is then tempting to learn f by solving:


n
1X
min `0/1 (f (xi ) , yi ) + k f k2H
f H n | {z }
| i=1 {z } regularization
misclassification rate

However:
The problem is non-smooth, and typically NP-hard to solve
The regularization has no effect since the 0/1 loss is invariant by
scaling of f
In fact, no function achieves the minimum when > 0 (why?)

100 / 635
The logistic loss
An alternative is to define a probabilistic model of y parametrized by
f (x), e.g.:
1
y {1, 1} , p (y | f (x)) = = (yf (x))
1 + e yf (x)
1.0
0.8
0.6

sigma(u)
sigma(u)
0.4
0.2
0.0

5 0 5

The logistic loss is the negative conditional likelihood:


 
`logistic (f (x), y ) = ln p (y | f (x)) = ln 1 + e yf (x)

101 / 635
Kernel logistic regression (KLR)

n
1 X
f = arg min `logistic (f (xi ), yi ) + k f k2H
f H n 2
i=1
n 
1X 
= arg min ln 1 + e yi f (xi ) + k f k2H
f H n 2
i=1

Can be interpreted as a regularized conditional maximum likelihood


estimator
No explicit solution, but smooth convex optimization problem that
can be solved numerically

102 / 635
Solving KLR
By the representer theorem, any solution of KLR can be expanded as
n
X
f(x) = i K (xi , x)
i=1

and as always we have:


 >
f (x1 ) , . . . , f (xn ) = K and k f k2H = > K

To find we therefore need to solve:


n 
1X  yi [K]i
min ln 1 + e + > K
Rn n 2
i=1

103 / 635
Technical facts
1.0

8
logistic loss(u)
0.8

6
0.6

sigma(u)

4
sigma(u)
0.4

2
0.2
0.0

0
5 0 5 5 0 5

Sigmoid: Logistic loss:


(u) = 1
1+e u
`logistic (u) = ln (1 + e u )
(u) = 1 (u) `0logistic (u) = (u)
0 (u) = (u)(u) 0 `00logistic (u) = (u)(u) 0

104 / 635
Back to KLR
n
1X
minn J () = `logistic (yi [K]i ) + > K
R n 2
i=1

This is a smooth convex optimization problem, that can be solved by


many numerical methods. Let us explicit one of them, Newtons method,
which iteratively approximates J by a quadratic function and solves the
quadratic problem.
The quadratic approximation near a point 0 is the function:
1
Jq () = J(0 ) + ( 0 )> J (0 ) + ( 0 )> 2 J (0 ) ( 0 )
2
Let us compute the different terms...

105 / 635
Computing the quadratic approximation
n
J 1X 0
= `logistic (yi [K]i ) yi Kij + [K]j
j n
i=1
| {z }
Pi ()

therefore
1
J () = KP () y + K
n
where P () = diag (P1 (), . . . , Pn ()).
n
2J 1 X 00
= `logistic (yi [K]i ) yi Kij yi Kil + Kjl
j l n
i=1
| {z }
Wi ()

therefore
1
2 J () = KW () K + K
n
where W () = diag (W1 (), . . . , Wn ()).
106 / 635
Computing the quadratic approximation

1
Jq () = J(0 ) + ( 0 )> J (0 ) + ( 0 )> 2 J (0 ) ( 0 )
2
Terms that depend on , with P = P (0 ) and W = W (0 ):
> J (0 ) = n1 > KPy + > K0
1 > 2 1 > >
2 J (0 ) = 2n KWK + 2 K
> 2 J (0 ) 0 = n1 > KWK0 > K0
Putting it all together:
2  1
2Jq () = > KW K0 W1 Py + > KWK + > K + C
n | {z } n
:=z
1
= (K z)> W (K z) + > K + C
n
This is a standard weighted kernel ridge regression (WKRR) problem!
107 / 635
Solving KLR by IRLS
In summary, one way to solve KLR is to iteratively solve a WKRR
problem until convergence:

t+1 solveWKRR(K, Wt , zt )

where we update Wt and zt from t as follows ( for i = 1, . . . , n):


mi [Kt ]i
Pit `0logistic (yi mi ) = (yi mi )
Wit `00logistic (yi mi ) = (mi )(mi )
zit mi Pit yi /Wit = mi + yi / (yi mi )
This is the kernelized version of the famous iteratively reweighted
least-square (IRLS) method to solve the standard linear logistic
regression.

108 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning


Kernel ridge regression
Kernel logistic regression
Large-margin classifiers
Interlude: convex optimization and duality
Support vector machines

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics 109 / 635


Loss functions for classifications
We already saw two loss functions for binary classification problems
The 0/1 loss `0/1 (f (x), y ) = 1 (yf (x) < 0)
The logistic loss `logistic (f (x), y ) = ln 1 + e yf (x)


In both cases, the loss is a function of the margin defined as follows


Definition
In binary classification (Y = {1, 1}), the margin of the function f for a
pair (x, y ) is:
yf (x) .

In both cases the loss is a decreasing function of the margin, i.e.,

` (f (x), y ) = (yf (x)) , with non-increasing

What about other similar loss functions?

110 / 635
Loss function examples

3.0
2.5
1SVM
2SVM

2.0
Logistic
Boosting
phi(u)

1.5
1.0
0.5
0.0

2 1 0 1 2

Method (u)
Kernel logistic regression log (1 + e u )
Support vector machine (1-SVM) max (1 u, 0)
Support vector machine (2-SVM) max (1 u, 0)2
Boosting e u
111 / 635
Large-margin classifiers
Definition
Given a non-increasing function : R R+ , a (kernel) large-margin
classifier is an algorithm that estimates a function f : X R by solving
n
1X
min (yi f (xi )) + k f k2H
f H n
i=1

Hence, KLR is a large-margin classifier, corresponding to


(u) = ln (1 + e u ). Many more are possible.

Questions:
1 Can we solve the optimization problem for other s?
2 Is it a good idea to optimize this objective function, if at the end of
the day we are interested in the `0/1 loss, i.e., learning models that
make few errors?

112 / 635
Solving large-margin classifiers
n
1X
min (yi f (xi )) + k f k2H
f H n
i=1

By the representer theorem, the solution of the unconstrained


problem can be expanded as:
n
X
f (x) = i K (xi , x) .
i=1

Plugging into the original problem we obtain the following


unconstrained and convex optimization problem in Rn :
( n )
1X >
min (yi [K]i ) + K , .
Rn n
i=1

When is convex, this can be solved using general tools for convex
optimization, or specific algorithms (e.g., for SVM, see later).
113 / 635
A tiny bit of learning theory
Assumptions and notations
Let P be an (unknown) distribution on X Y, and
(x) = P(Y = 1 | X = x) a measurable version of the conditional
distribution of Y given X
Assume the training set Sn = (Xi , Yi )i=1,...,n are i.i.d. random
variables according to P.
The risk of a classifier f : X R is R(f ) = P (sign(f (X )) 6= Y )
The Bayes risk is
R = inf R(f )
f measurable

which is attained for f (x) = (x) 1/2


The empirical risk of a classifier f : X R is
n
n 1X
R (f ) = 1 (sign(f (Xi )) 6= Yi )
n
i=1
114 / 635
-risk
Let the empirical -risk be the empirical risk optimized by a
large-margin classifier:
n
1X
Rn (f ) = (Yi f (Xi ))
n
i=1

It is the empirical version of the -risk

R (f ) = E[ (Yf (X ))]

Can we hope to have a small risk R(f ) if we focus instead on the


-risk R (f )?

115 / 635
A small -risk ensures a small 0/1 risk
Theorem [Bartlett et al., 2003]
Let : R R+ be convex, non-increasing, differentiable at 0 with
0 (0) < 0. Let f : X R measurable such that

R (f ) = min R (g ) = R .
g measurable

Then
R(f ) = min R(g ) = R .
g measurable

Remarks:
This tells us that, if we know P, then minimizing the -risk is a
good idea even if our focus is on the classification error.
The assumptions on can be relaxed; it works for the broader class
of classification-calibrated loss functions [Bartlett et al., 2003].
More generally, we can show that if R (f ) R is small, then
R(f ) R is small too [Bartlett et al., 2003].
116 / 635
A small -risk ensures a small 0/1 risk
Proof sketch:
Condition on X = x:
R (f | X = x) = E [ (Yf (X )) | X = x] = (x) (f (x)) + (1 (x)) (f (x))
R (f | X = x) = E [ (Yf (X )) | X = x] = (x) (f (x)) + (1 (x)) (f (x))

Therefore:

R (f | X = x) R (f | X = x) = [2(x) 1] [ (f (x)) (f (x))]

This must be a.s. 0 because R (f ) R (f ), which implies:


if (x) > 12 , (f (x)) (f (x)) = f (x) 0
if (x) < 12 , (f (x)) (f (x)) = f (x) 0
These inequalities are in fact strict thanks to the assumptions we made on
(left as exercice). 

117 / 635
Empirical risk minimization (ERM)
To find a function with a small -risk, the following is a good candidate:
Definition
The ERM estimator on a functional class F is the solution (when it
exists) of:
fn = argmin Rn (f ) .
f F

118 / 635
Empirical risk minimization (ERM)
To find a function with a small -risk, the following is a good candidate:
Definition
The ERM estimator on a functional class F is the solution (when it
exists) of:
fn = argmin Rn (f ) .
f F

Questions
Is Rn (f ) a good estimate of the true risk R (f )?
Is R (fn ) small?

118 / 635
Class capacity
Motivations
 
The ERM principle gives a good solution if R fn is similar to the
minimum achievable risk inf f F R (f ).
This can be ensured if F is not too large.
We need a measure of the capacity of F.

Definition: Rademacher complexity


The Rademacher complexity of a class of functions F is:
#
n
"
2X
Radn (F) = EX , sup i f (Xi ) ,

f F n i=1

where the expectation is over (Xi )i=1,...,n and the independent uniform
{1}-valued (Rademacher) random variables (i )i=1,...,n .

119 / 635
Basic learning bounds
Theorem
Suppose is Lipschitz with constant L :

u, u 0 R, (u) (u 0 ) L u u 0 .

Then the -risk of the ERM estimator satisfies (on average over the
sampling of training set)
 
ESn R fn R 4L Radn (F) + inf R (f ) R
| {z } | {z } f F
Estimation error
| {z }
Excess -risk Approximation error

This quantifies a trade-off between:


F large = overfitting (approximation error small, estimation error
large)
F small = underfitting (estimation error small, approximation
error large)
120 / 635
ERM in RKHS balls
Principle
Assume X is endowed with a p.d. kernel.
We consider the ball of radius B in the RKHS as function class for
the ERM:
FB = {f H : k f kH B} .

Theorem (capacity control of RKHS balls)


EK (X , X )
p
2B
Radn (FB ) .
n

121 / 635
Proof (1/2)

" #
2X n
Radn (FB ) = EX , sup i f (Xi )

f FB n
" * i=1 + #
n
2X
= EX , sup f , i KXi (RKHS)

f FB n
i=1
" n
#
2X
= EX , Bk i KXi kH (Cauchy-Schwarz)
n
i=1
v
u n
2B X
EX , tk
u
= i KXi k2H
n
i=1
v
u
n
2B u
u X
tEX , i j K (Xi , Xj ) (Jensen)
n
i,j=1

122 / 635
Proof (2/2)
But E [i j ] is 1 if i = j, 0 otherwise. Therefore:
v
u
n
2B u
u X
Radn (FB ) tEX E [i j ] K (Xi , Xj )
n
i,j=1
v
u n
2B u
tEX
X
K (Xi , Xi )
n
i=1

2B EX K (X , X )
p
= . 
n

123 / 635
Basic learning bounds in RKHS balls
Corollary
Suppose K (X , X ) 2 a.s. (e.g., Gaussian kernel and = 1). Then the
ERM estimator in FB satisfies
 
 
8L B
ER fn R + inf R (f ) R .
n f FB

Remarks
B controls the trade-off between approximation and estimation error
The bound on expression error is independent of P and decreases
with n
The approximation error is harder to analyze in general
In practice, B (or , next slide) is tuned by cross-validation

124 / 635
ERM as penalized risk minimization
ERM over FB solves the constrained minimization problem:
(
minf H n1 ni=1 (yi f (xi ))
P

subject to k f kH B .

To make this practical we assume that is convex.


The problem is then a convex problem in f for which strong duality
holds. In particular f solves the problem if and only if it solves for
some dual parameter the unconstrained problem:
( n )
1X 2
min (yi f (xi )) + k f kH .
f H n
i=1

125 / 635
Summary: large margin classifiers

3.0
2.5
1SVM
2SVM

2.0
Logistic
Boosting
phi(u)

1.5
1.0
0.5
0.0

2 1 0 1 2

u
n
( )
1X
min (yi f (xi )) + k f k2H
f H n
i=1

calibrated (e.g., decreasing, 0 (0) < 0) = good proxy for


classification error
convex + representer theorem = efficient algorithms
126 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning


Kernel ridge regression
Kernel logistic regression
Large-margin classifiers
Interlude: convex optimization and duality
Support vector machines

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics 127 / 635


A few slides on convex duality
Strong Duality

f (x), primal

x
b b

b b

q(), dual

Strong duality means that max q() = minx f (x)


Strong duality holds in most reasonable cases for convex
optimization (to be detailed soon).

128 / 635
A few slides on convex duality
Strong Duality

f (x), primal

x
b b

b b

q(), dual

The relation between x? and ? is not always known a priori.

128 / 635
A few slides on convex duality
Parenthesis on duality gaps
f (x), primal

x
b b

(x, ) b

q(), dual
The duality gap guarantees us that 0 f (x) f (x? ) (x, ).
Dual problems are often obtained by Lagrangian or Fenchel duality.

129 / 635
A few slides on Lagrangian duality
Setting
We consider an equality and inequality constrained optimization
problem over a variable x X :

minimize f (x)
subject to hi (x) = 0 , i = 1, . . . , m ,
gj (x) 0 , j = 1, . . . , r ,

making no assumption of f , g and h.


Let us denote by f ? the optimal value of the decision function under
the constraints, i.e., f ? = f (x? ) if the minimum is reached at a
global minimum x? .

130 / 635
A few slides on Lagrangian duality
Lagrangian
The Lagrangian of this problem is the function L : X Rm Rr R
defined by:
m
X r
X
L (x, , ) = f (x) + i hi (x) + j gj (x) .
i=1 j=1

Lagrangian dual function


The Lagrange dual function g : Rm Rr R is:

q(, ) = inf L (x, , )


xX

Xm r
X
= inf f (x) + i hi (x) + j gj (x) .
xX
i=1 j=1

131 / 635
A few slides on convex Lagrangian duality
For the (primal) problem:
minimize f (x)
subject to h(x) = 0 , g (x) 0 ,
the Lagrange dual problem is:
maximize q(, )
subject to 0,

Proposition
q is concave in (, ), even if the original problem is not convex.
The dual function yields lower bounds on the optimal value f ? of
the original problem when is nonnegative:

q(, ) f ? , Rm , Rr , 0 .

132 / 635
Proofs
For each x, the function (, ) 7 L(x, , ) is linear, and therefore
both convex and concave in (, ). The pointwise minimum of
concave functions is concave, therefore q is concave.
Let x be any feasible point, i.e., h(x) = 0 and g (x) 0. Then we
have, for any and 0:
m
X r
X
i hi (x) + i gi (x) 0 ,
i=1 i=1

m
X r
X
= L(x, , ) = f (x) + i hi (x) + i gi (x) f (x) ,
i=1 i=1

= q(, ) = inf L(x, , ) L(x, , ) f (x) , x . 


x

133 / 635
Weak duality
Let q the optimal value of the Lagrange dual problem. Each
q(, ) is a lower bound for f ? and by definition q ? is the best lower
bound that is obtained. The following weak duality inequality
therefore always hold:
q? f ? .

This inequality holds when q ? or f ? are infinite. The difference


q ? f ? is called the optimal duality gap of the original problem.

134 / 635
Strong duality
We say that strong duality holds if the optimal duality gap is zero,
i.e.:
q? = f ? .

If strong duality holds, then the best lower bound that can be
obtained from the Lagrange dual function is tight
Strong duality does not hold for general nonlinear problems.
It usually holds for convex problems.
Conditions that ensure strong duality for convex problems are called
constraint qualification.
in that case, we have for all feasible primal and dual points x, , ,

q(, ) q(? , ? ) = L (x? , ? , ? ) = f (x? ) f (x).

135 / 635
Slaters constraint qualification
Strong duality holds for a convex problem:

minimize f (x)
subject to gj (x) 0 , j = 1, . . . , r ,
Ax = b ,

if it is strictly feasible, i.e., there exists at least one feasible point that
satisfies:
gj (x) < 0 , j = 1, . . . , r , Ax = b .

136 / 635
Remarks
Slaters conditions also ensure that the maximum q ? (if > ) is
attained, i.e., there exists a point (? , ? ) with

q (? , ? ) = q ? = f ?

They can be sharpened. For example, strict feasibility is not required


for affine constraints.
There exist many other types of constraint qualifications

137 / 635
Dual optimal pairs
Suppose that strong duality holds, x? is primal optimal, (? , ? ) is dual
optimal. Then we have:

f (x? ) = q (? , ? )

m
X r
X
= inf n f (x) + ?i hi (x) + ?j gj (x)
xR
i=1 j=1
m
X r
X
f (x? ) + ?i hi (x? ) + ?j gj (x? )
i=1 j=1
?
f (x )

Hence both inequalities are in fact equalities.

138 / 635
Complimentary slackness
The first equality shows that:

L (x? , ? , ? ) = inf n L (x, ? , ? ) ,


xR

showing that x? minimizes the Lagrangian at (? , ? ). The second


equality shows the following important property:
Complimentary slackness
Each optimal Lagrange multiplier is zero unless the corresponding
constraint is active at the optimum:

j gj (x? ) = 0 , j = 1, . . . , r .

139 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning


Kernel ridge regression
Kernel logistic regression
Large-margin classifiers
Interlude: convex optimization and duality
Support vector machines

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics 140 / 635


Support vector machines (SVM)
Historically the first kernel method for pattern recognition, still
the most popular.
Often state-of-the-art in performance.
One particular choice of loss function (hinge loss).
Leads to a sparse solution, i.e., not all points are involved in the
decomposition (compression).
Particular algorithm for fast optimization (decomposition by
chunking methods).

141 / 635
Support vector machines (SVM)
Definition
The hinge loss is the function R R+ :
(
0 if u 1,
hinge (u) = max (1 u, 0) =
1u otherwise.

SVM is the corresponding large-margin classifier, which solves:


( n )
1X
min hinge (yi f (xi )) + k f k2H .
f H n
i=1

l(f(x),y)

yf(x)

1 142 / 635
Problem reformulation (1/3)
By the representer theorem, the solution satisfies
n
X
f (x) = i K (xi , x) ,
i=1

where solves
n
( )
1X
min hinge (yi [K]i ) + > K
Rn n
i=1

This is a convex optimization problem


But the objective function is not smooth (because of the hinge loss)

143 / 635
Problem reformulation (2/3)
Let us introduce additional slack variables 1 , . . . , n R. The
problem is equivalent to:
( n )
1X >
min i + K ,
Rn , Rn n i=1

subject to:
i hinge (yi [K]i ) .
The objective function is now smooth, but not the constraints
However it is easy to replace the non-smooth constraint by a
cunjunction of two smooth constraints, because:
(
u 1v
u hinge (v )
u 0

144 / 635
Problem reformulation (3/3)
In summary, the SVM solution is
n
X
f (x) = i K (xi , x) ,
i=1

where solves:
SVM (primal formulation)
n
1X
min i + > K ,
Rn , Rn n i=1

subject to:
(
yi [K]i + i 1 0 , for i = 1, . . . , n ,
i 0 , for i = 1, . . . , n .

145 / 635
Solving the SVM problem
This is a classical quadratic program (minimization of a convex
quadratic function with linear constraints) for which any
out-of-the-box optimization package can be used.
The dimension of the problem and the number of constraints,
however, are 2n where n is the number of points. General-purpose
QP solvers will have difficulties when n exceeds a few thousands.
Solving the dual of this problem (also a QP) will be more convenient
and lead to faster algorithms (due to the sparsity of the final
solution).

146 / 635
Lagrangian
Let us introduce the Lagrange multipliers Rn and Rn .
The Lagrangian of the problem is:
n
1X
L (, , , ) = i + > K
n
i=1
n
X n
X
i [yi [K]i + i 1] i i
i=1 i=1

or, in matrix notations:


1
L (, , , ) = > + > K
n
(diag (y))> K ( + )> + > 1

147 / 635
Minimizing L (, , , ) w.r.t.
L (, , , ) is a convex quadratic function in . It is minimized
whenever its gradient is null:

L = 2K K diag (y) = K (2 diag (y))

The following solves L = 0:

diag (y)
=
2

148 / 635
Minimizing L (, , , ) w.r.t.
L (, , , ) is a linear function in .
Its minimum is except when it is constant, i.e., when:
1
L = =0
n
or equivalently
1
+ =
n

149 / 635
Dual function
We therefore obtain the Lagrange dual function:

q (, ) = inf L (, , , )
Rn , Rn
(
> 1 4 1 >
diag (y)K diag (y) if + = 1
n ,
=
otherwise.

The dual problem is:

maximize q (, )
subject to 0, 0 .

150 / 635
Dual problem
If i > 1/n for some i, then there is no i 0 such that
i + i = 1/n, hence q (, ) = .
If 0 i 1/n for all i, then the dual function takes finite values
that depend only on by taking i = 1/n i .
The dual problem is therefore equivalent to:
1 >
max > 1 diag (y)K diag (y)
01/n 4

or with indices:
n n
X 1 X
max i yi yj i j K (xi , xj ) .
01/n 4
i=1 i,j=1

151 / 635
Back to the primal
Once the dual problem is solved in we get a solution of the primal
problem by = diag (y)/2.
Because the link is so simple, we can therefore directly plug this into
the dual problem to obtain the QP that must solve:

SVM (dual formulation)


n
X n
X
max 2 i y i i j K (xi , xj ) = 2> y > K ,
Rn
i=1 i,j=1

subject to:
1
0 yi i , for i = 1, . . . , n .
2n

152 / 635
Complimentary slackness conditions
The complimentary slackness conditions are, for i = 1, . . . , n:
(
i [yi f (xi ) + i 1] = 0,
i i = 0,

In terms of this can be rewritten as:


(
i [yi f (xi ) + i 1] = 0 ,
yi 
i 2n i = 0 .

153 / 635
Analysis of KKT conditions
(
i [yi f (xi ) + i 1] = 0 ,
yi 
i 2n i = 0 .

If i = 0, then the second constraint is active: i = 0. This implies


yi f (xi ) 1.
1
If 0 < yi i < 2n , then both constraints are active: i = 0 et
yi f (xi ) + i 1 = 0. This implies yi f (xi ) = 1.
yi
If i = 2n , then the second constraint is not active (i 0) while
the first one is active: yi f (xi ) + i = 1. This implies yi f (xi ) 1

154 / 635
Geometric interpretation

155 / 635
Geometric interpretation

) = +1
f(x
)=0 )=1
f(x f(x

155 / 635
Geometric interpretation

y=1/2n

0< y<1/2n

=0

155 / 635
Support vectors
Consequence of KKT conditions
The training points with i 6= 0 are called support vectors.
Only support vectors are important for the classification of new
points:
n
X X
x X , f (x) = i K (xi , x) = i K (xi , x) ,
i=1 iSV

where SV is the set of support vectors.

Consequences
The solution is sparse in , leading to fast algorithms for training
(use of decomposition methods).
The classification of a new point only involves kernel evaluations
with support vectors (fast).
156 / 635
Remark: C-SVM
Often the SVM optimization problem is written in terms of a
regularization parameter C instead of as follows:
n
1 X
arg min k f k2H + C Lhinge (f (xi ) , yi ) .
f H 2 i=1

1
This is equivalent to our formulation with C = 2n .
The SVM optimization problem is then:
n
X X n
max 2 i y i i j K (xi , xj ) ,
Rd i=1 i,j=1

subject to:
0 y i i C , for i = 1, . . . , n .
This formulation is often called C-SVM.
157 / 635
Remark: 2-SVM
A variant of the SVM, sometimes called 2-SVM, is obtained by
replacing the hinge loss by the square hinge loss:
( n )
1X 2 2
min hinge (yi f (xi )) + k f kH .
f H n
i=1

After some computation (left as exercice) we find that the dual


problem of the 2-SVM is:

max 2> y > (K + nI ) ,


Rd
subject to:
0 y i i , for i = 1, . . . , n .
This is therefore equivalent to the previous SVM with the kernel
K + nI and C = +

158 / 635
Part 4

Kernel Methods
Unsupervised Learning

159 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning


Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

5 The Kernel Jungle

6 Open Problems and Research Topics

160 / 635
The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi si k22 .
j Rp for j=1,...,k
i=1
si {1,...,k}, for i=1,...,n

K-means alternates between two steps:


1 cluster assignment:
Given fixed 1 , . . . , k , assign each xi to its closest centroid

i, si argmin kxi s k22 .


s{1,...,k}

161 / 635
The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi si k22 .
j Rp for j=1,...,k
i=1
si {1,...,k}, for i=1,...,n

K-means alternates between two steps:


2 centroids update:
Given the previous assignments s1 , . . . , sn , update the centroids
X
j, j = argmin kxi k22 .
Rp i:si =j

161 / 635
The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi si k22 .
j Rp for j=1,...,k
i=1
si {1,...,k}, for i=1,...,n

K-means alternates between two steps:


2 centroids update:
Given the previous assignments s1 , . . . , sn , update the centroids
1 X
j, j = xi .
|Cj |
iCj

161 / 635
The kernel K-means algorithm
We may now modify the objective to operate in a RKHS. Given data
points x1 , . . . , xn in X and a p.d. kernel K : X X R with H its
RKHS, the new objective becomes

162 / 635
The kernel K-means algorithm
We may now modify the objective to operate in a RKHS. Given data
points x1 , . . . , xn in X and a p.d. kernel K : X X R with H its
RKHS, the new objective becomes
n
X
min k(xi ) si k2H .
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n

To optimize the cost function, we will first use the following Proposition
Proposition
1 Pn
The center of mass n = n i=1 (xi ) solves the following optimization
problem
n
X
min k(xi ) k2H .
H
i=1

162 / 635
The kernel K-means algorithm
Proof

n n n
* +
1X 1X 2X
k(xi ) k2H = k(xi )k2H (xi ), + kk2H
n n n
i=1 i=1 i=1 H
n
1X
= k(xi )k2H 2 hn , iH + kk2H
n
i=1
n
1X
= k(xi )k2H kn k2H + kn k2H ,
n
i=1

which is minimum for = n .

163 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n

we know that given assignments si , the optimal j are the centers of


mass of the respective clusters and we obtain
Greedy approach: kernel K-means
We alternate between two steps:
1 centroids update:
Given the previous assignments s1 , . . . , sn , update the centroids
X
j, j = argmin k(xi ) k2H .
H i:si =j

164 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n

we know that given assignments si , the optimal j are the centers of


mass of the respective clusters and we obtain
Greedy approach: kernel K-means
We alternate between two steps:
1 centroids update:
Given the previous assignments s1 , . . . , sn , update the centroids
1 X
j, j = (xi ).
|Cj |
iCj

164 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n

we know that given assignments si , the optimal j are the centers of


mass of the respective clusters and we obtain
Greedy approach: kernel K-means
We alternate between two steps:
2 cluster assignment:
Given fixed 1 , . . . , k , assign each xi to its closest centroid: i,

si argmin k(xi ) s k2H .


s{1,...,k}

164 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n

we know that given assignments si , the optimal j are the centers of


mass of the respective clusters and we obtain
Greedy approach: kernel K-means
We alternate between two steps:
2 cluster assignment:
Given fixed 1 , . . . , k , assign each xi to its closest centroid: i,
2

1 X
si argmin (xi ) |Cs | (xi )
(Cs is from step 1).
s{1,...,k} iCs
H
164 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n

we know that given assignments si , the optimal j are the centers of


mass of the respective clusters and we obtain
Greedy approach: kernel K-means
We alternate between two steps:
2 cluster assignment:
Given fixed 1 , . . . , k , assign each xi to its closest centroid: i,

2 X 1 X
si argmin K (xi , xi ) K (xi , xj ) + K (xj , xl ) .
s{1,...,k} |Cs | |Cs |2
jCs j,lCs

164 / 635
The kernel K-means algorithm, equivalent objective
Note that all operations are performed by manipulating kernel values
K (xi , xj ) only. Implicitly, we are optimizing in fact
2
n
X 1 X
min (xi ) (x j ) ,
si {1,...,k} |Csi |
for i=1,...,n i=1 jCsi

H

or, equivalently,

n
K (xi , xi ) 2 1
X X X
min K (xi , xj ) + 2
K (xj , xl ) .
si {1,...,k} |C si | |Csi |
for i=1,...,n i=1 jCsi j,lCsi

Then, notice that


n k
X 1 X X 1 X
K (xj , x l ) = K (xi , xj ).
|Csi |2 |Cl |
i=1 j,lCsi l=1 i,jCl

165 / 635
The kernel K-means algorithm, equivalent objective
Note that all operations are performed by manipulating kernel values
K (xi , xj ) only. Implicitly, we are optimizing in fact
2
n
X
(xi ) 1 X
min ,
(xj )
si {1,...,k} |Csi |
for i=1,...,n i=1 jCs
i H

or, equivalently,

n
X
K (xi , xi ) 2 X 1 X
min K (xi , xj ) + K (xj , xl ) .
si {1,...,k} |Csi | |Csi |2
for i=1,...,n i=1 jCsi j,lCsi

and
n k
X 1 X X 1 X
K (xi , xj ) = K (xi , xj ).
|Csi | |Cl |
i=1 jCsi l=1 i,jCl

165 / 635
The kernel K-means algorithm, equivalent objective
Then, after removing the constant terms K (xi , xi ), we obtain:

Proposition
The kernel K-means objective is equivalent to the following one:
k
X 1 X
max K (xi , xj ).
si {1,...,k} |Cl |
for i=1,...,n l=1 i,jCl

This is a hard combinatorial optimization problem.


There are two types of algorithms to address it:
1 greedy algorithm: kernel K-means
2 spectral relaxation: spectral clustering

166 / 635
The spectral clustering algorithms
Instead of a greedy approach, we can relax the problem into a feasible
one, which yields a class of algorithms called spectral clustering.
First, consider the objective
k
X 1 X
max K (xi , xj ).
si {1,...,k} |Cl |
for i=1,...,n l=1 i,jCl

and we introduce
(?) the binary assignment matrix A in {0, 1}nk whose rows sum to one.

Pnrescaling1matrix D in R with diagonal entries [D]jj


(??) the diagonal ll

equal to ( i=1 [A]ij ) : the inverse of the cardinality of cluster j.


and the objective can be rewritten (proof is easy and left as an exercise)
h i
max trace (D1/2 A> KAD1/2 ) s.t. (?) and (??).
A,D

167 / 635
The spectral clustering algorithms

max trace (D1/2 A> KAD1/2 ) s.t. (?) and (??).


A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints (?, ??) on A
and D and instead optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


ZRnk

168 / 635
The spectral clustering algorithms

max trace (D1/2 A> KAD1/2 ) s.t. (?) and (??).


A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints (?, ??) on A
and D and instead optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


ZRnk

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.
Question
How do we obtain an approximate solution (A, D) of the original problem
from the exact solution of the relaxed one Z? ?

168 / 635
The spectral clustering algorithms

max trace (D1/2 A> KAD1/2 ) s.t. (?) and (??).


A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints (?, ??) on A
and D and instead optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


ZRnk

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.
Answer 1
With the original constraints on A, every row of A has a single non-zero
entry compute the maximum entry of every row of Z? .

168 / 635
The spectral clustering algorithms

max trace (D1/2 A> KAD1/2 ) s.t. (?) and (??).


A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints (?, ??) on A
and D and instead optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


ZRnk

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.
Answer 2
Normalize the rows of Z? to have unit `2 -norm, and apply the traditional
K-means algorithm on the rows. This is called spectral clustering.

168 / 635
The spectral clustering algorithms

max trace (D1/2 A> KAD1/2 ) s.t. (?) and (??).


A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints (?, ??) on A
and D and instead optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


ZRnk

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.
Answer 3
Choose another variant of the previous procedures.

168 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning


Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

5 The Kernel Jungle

6 Open Problems and Research Topics

169 / 635
Principal Component Analysis (PCA)
Classical setting
Let S = {x1 , . . . , xn } be a set of vectors (xi Rd )
PCA is a classical algorithm in multivariate statistics to define a set
of orthogonal directions that capture the maximum variance
Applications: low-dimensional representation of high-dimensional
points, visualization

PC2 PC1

170 / 635
Principal Component Analysis (PCA)
Formalization
Assume that the data are centered (otherwise center them as
preprocessing), i.e.:
n
1X
xi = 0.
n
i=1

The orthogonal projection onto a direction w Rd is the function


hw : Rd R defined by:
w
hw (x) = x> .
kwk

171 / 635
Principal Component Analysis (PCA)
Formalization
The empirical variance captured by hw is:
n n 2
1X 2 1 X x>i w
var
(hw ) := hw (xi ) = .
n n k w k2
i=1 i=1

The i-th principal direction wi (i = 1, . . . , d) is defined by:

wi = arg max (hw ) s.t. kwk = 1.


var
w{w1 ,...,wi1 }

172 / 635
Principal Component Analysis (PCA)
Solution
Let X be the n d data matrix whose rows are the vectors
x1 , . . . , xn . We can then write:
n 2
1 X x>i w 1 w> X> Xw
var
(hw ) = = .
n k w k2 n w> w
i=1

The solutions of:

wi = arg max w> X> Xw s.t. kwk = 1


w{w1 ,...,wi1 }

are the successive eigenvectors of X> X, ranked by decreasing


eigenvalues.

173 / 635
Kernel Principal Component Analysis (PCA)
Let x1 , . . . , xn be a set of data points in X ; let K : X X R be a
positive definite kernel and H be its RKHS.
Formalization
Assume that the data are centered (otherwise center by
manipulating the kernel matrix), i.e.:
n n
1X 1X
xi = (xi ) = 0.
n n
i=1 i=1

The orthogonal projection onto a direction f H is the function


hf : X R defined by:
 
> w f
hw (x) = x = hf (x) = (x), .
kwk kf kH H

174 / 635
Kernel Principal Component Analysis (PCA)
Let x1 , . . . , xn be a set of data points in X ; let K : X X R be a
positive definite kernel and H be its RKHS.
Formalization
The empirical variance captured by hf is:
n 2 n
1 X x>i w 1 X h(xi ), f i2H
var
(hw ) = = var
(hf ) := .
n k w k2 n k f k2H
i=1 i=1

The i-th principal direction fi (i = 1, . . . , d) is defined by:

fi = (hf ) s.t. kf kH = 1.
arg max var
f {f1 ,...,fi1 }

175 / 635
Kernel Principal Component Analysis (PCA)
Let x1 , . . . , xn be a set of data points in X ; let K : X X R be a
positive definite kernel and H be its RKHS.
Formalization
The empirical variance captured by hf is:
n 2 n
1 X x>i w 1 X f (xi )2
var
(hw ) = = var
(hf ) := .
n k w k2 n k f k2H
i=1 i=1

The i-th principal direction fi (i = 1, . . . , d) is defined by:


n
X
fi = arg max f (xi )2 s.t. kf kH = 1.
f {f1 ,...,fi1 } i=1

175 / 635
Sanity check: kernel PCA with linear kernel = PCA

Let K (x, y) = x> y be the linear kernel.


The associated RKHS H is the set of linear functions:

fw (x) = w> x ,

endowed with the norm k fw kH = k w kRd .


Therefore we can write:
n 2 n
1 X x>i w 1 X
var
(hw ) = = fw (xi )2 .
n k w k2 nk fw k2
i=1 i=1

Moreover, w w0 fw fw0 .

176 / 635
Kernel Principal Component Analysis (PCA)
Solution
Kernel PCA solves, for i = 1, . . . , d:
n
X
fi = arg max f (xi )2 s.t. kf kH = 1.
f {f1 ,...,fi1 } i=1

We can apply the representer theorem (exercise: check that is is also


valid in this case): for i = 1, . . . , d, we have:
n
X
x X , fi (x) = i,j K (xj , x) ,
j=1

with i = (i,1 , . . . , i,n )> Rn .

177 / 635
Kernel Principal Component Analysis (PCA)
Therefore we have:
n
X
k fi k2H = i,k i,l K (xk , xl ) = >
i Ki ,
k,l=1

Similarly:
n
X
fi (xk )2 = > 2
i K i .
k=1

and
hfi , fj iH = >
i Kj .

178 / 635
Kernel Principal Component Analysis (PCA)
Solution
Kernel PCA maximizes in the function:

i = arg max > K2 ,


Rn
under the constraints:
 >
i Kj = 0 for j = 1, . . . , i 1 .
>i Ki = 1

179 / 635
Kernel Principal Component Analysis (PCA)
Solution
Compute the eigenvalue decomposition of the kernel matrix
K = UU> , with eigenvalues 1 . . . n 0.
After a change of variable = K1/2 (with K1/2 = U1/2 U> ),

i = arg max > K,


Rn

under the constraints:


 >
i j = 0 for j = 1, . . . , i 1 .
>i i = 1

Thus, i = ui (i-th eigenvector) is a solution!


Finally, i = 1 ui .
i

180 / 635
Kernel Principal Component Analysis (PCA)
Summary
1 Center the Gram matrix
2 Compute the first eigenvectors (ui , i )

3 Normalize the eigenvectors i = ui / i
4 The projections of the points onto the i-th eigenvector is given by
Ki

181 / 635
Kernel Principal Component Analysis (PCA)
Remarks
In this formulation, we must diagonalize the centered kernel Gram
matrix, instead of the covariance matrix in the classical setting
Exercise: check that X> X and XX> have the same spectrum (up to
0 eigenvalues) and that the eigenvectors are related by a simple
relationship.
This formulation remains valid for any p.d. kernel: this is kernel PCA
Applications: nonlinear PCA with nonlinear kernels for vectors, PCA
of non-vector objects (strings, graphs..) with specific kernels...

182 / 635
Example
PC2 A set of 74 human tRNA
sequences is analyzed using a
kernel for sequences (the
second-order marginalized
kernel based on SCFG). This
set of tRNAs contains three
PC1
classes, called Ala-AGC
(white circles), Asn-GTT
(black circles) and Cys-GCA
(plus symbols) (from Tsuda
et al., 2003).

183 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning


Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

5 The Kernel Jungle

6 Open Problems and Research Topics

184 / 635
Canonical Correlation Analysis (CCA)
Given two views X = [x1 , . . . , xn ] in Rpn and Y = [y1 , . . . , yn ] in Rdn
of the same dataset, the goal of canonical correlation analysis (CCA) is
to find pairs of directions in the two views that are maximally correlated.
Formulation
Assuming that the datasets are centered, we want to maximize
1 Pn > >
n i=1 wa xi yi wb
max  .
> x x> w 1/2 1 > y y> w 1/2
wa Rp ,wb Rd 1
Pn  Pn
n i=1 w a i i a n w
i=1 b i i b

Assuming that the pairs (xi , yi ) are i.i.d. samples from an unknown
distribution, CCA seeks to maximize

cov (wa> X , wb> Y )


max q .
wa Rp ,wb Rd
p
> >
var (wa X ) var (wb Y )

185 / 635
Canonical Correlation Analysis (CCA)
Given two views X = [x1 , . . . , xn ] in Rpn and Y = [y1 , . . . , yn ] in Rdn
of the same dataset, the goal of canonical correlation analysis (CCA) is
to find pairs of directions in the two views that are maximally correlated.
Formulation
Assuming that the datasets are centered, we want to maximize
1 Pn > >
n i=1 wa xi yi wb
max  .
> x x> w 1/2 1 > y y> w 1/2
wa Rp ,wb Rd 1
Pn  Pn
n i=1 w a i i a n w
i=1 b i i b

It is possible to show that this is an generalized eigenvalue problem (see


next slide or see Section 6.5 of Shawe-Taylor and Cristianini 2004b).
The above problem provides the first pair of canonical directions. Next
directions can be obtained by solving the same problem under the
constraint that they are orthogonal to the previous canonical directions.

185 / 635
Canonical Correlation Analysis (CCA)
Formulation
Assuming that the datasets are centered,

wa> X> Ywb


max 1/2 1/2 .
wa Rp ,wb Rd (wa> X> Xwa ) wb> Y> Ywb

can be formulated, after removing the scaling ambiguity, as

max wa> X> Ywb s.t. wa> X> Xwa = 1 and wb> Y> Ywb = 1.
wa Rp ,wb Rd

Then, there exists a and b such that the problem is equivalent to


a > > b
min wa> X> Ywb + (w X Xwa 1) + (wb> Y> Ywb 1).
wa Rp ,w b Rd 2 a 2

186 / 635
Canonical Correlation Analysis (CCA)
Taking the derivatives and setting the gradient to zero, we obtain

X> Ywb + a X> Xwa = 0


Y> Xwa + b Y> Ywb = 0

Multiply first equality by wa> and second equality by wb> ; subtract the
two resulting equalities and we get

a wa> X> Xwa = b wb> Y> Ywb = a = b = ,

and then, we obtain the generalized eigenvalue problem:

X> Y
    >  
0 wa X X 0 wa
=
Y> X 0 wb 0 Y> Y wb

187 / 635
Canonical Correlation Analysis (CCA)
Let us define
X> Y X> X
     
0 0 wa
A = , B = and w =
Y> X 0 0 >
Y Y wb

Assuming the covariances are invertible, the generalized eigenvalue


problem is equivalent to
1/2 1/2
B A w = B w

which is also equivalent to the eigenvalue problem


1/2 1/2 1/2 1/2
B A B (B w) = (B w).

188 / 635
Kernel Canonical Correlation Analysis
Similar to kernel PCA, it is possible to operate in a RKHS. Given two
p.d. kernels Ka , Kb : X X R, we can obtain two views of a
dataset x1 , . . . , xn in X n :

(a (x1 ), . . . , a (xn )) and (b (x1 ), . . . , b (xn )),

where a : X Ha and b : X Hb are the embeddings in the


RKHSs Ha of Ka and Hb of Kb , respectively.
Formulation
Then, we may formulate kernel CCA as
1 Pn
n i=1 hfa , a (xi )iHa hb (xi ), fb iHb
max
fa Ha ,fb Hb 1
 1/2  P 1/2 .
P n 2 1 n 2
n hf
i=1 a , (x )i
a i Ha n hf
i=1 b , (x )i
b i Hb

189 / 635
Kernel Canonical Correlation Analysis
Similar to kernel PCA, it is possible to operate in a RKHS. Given two
p.d. kernels Ka , Kb : X X R, we can obtain two views of a
dataset x1 , . . . , xn in X n :

(a (x1 ), . . . , a (xn )) and (b (x1 ), . . . , b (xn )),

where a : X Ha and b : X Hb are the embeddings in the


RKHSs Ha of Ka and Hb of Kb , respectively.
Formulation
Then, we may formulate kernel CCA as
1 Pn
n i=1 fa (xi )fb (xi )
max  .
2 1/2 1 2 1/2
fa Ha ,fb Hb 1
Pn  Pn
n i=1 fa (xi ) n i=1 fb (xi )

189 / 635
Kernel Canonical Correlation Analysis
Up to a few technical details (exercise),Pwe can apply the representer
theoremP and look for solutions fa (.) = ni=1 i Ka (xi , .) and
fb (.) = ni=1 i Kb (xi , .). We finally obtain the formulation
1 Pn
n i=1 [Ka ]i [Kb ]i
max
Rn ,Rn 1
P n 2 1/2 1
 Pn  ,
2 1/2
n i=1 [K a ]i n i=1 [K ]
b i

which is equivalent to
> Ka Kb
max
R ,Rn
n 1/2 1/2 ,
(> K2a ) > K2b
or, after removing the scaling ambiguity for and ,
Equivalent formulation
max > Ka Kb s.t. > K2a = 1 and > K2b = 1.
Rn ,Rn

190 / 635
Kernel Canonical Correlation Analysis

max > Ka Kb s.t. > K2a = 1 and > K2b = 1.


Rn ,Rn

This also leads to a generalized eigenvalue problem.


The subsequent canonical directions are obtained by solving the
same problem with additional orthogonality constraints.

191 / 635
Kernel Canonical Correlation Analysis

max > Ka Kb s.t. > K2a = 1 and > K2b = 1.


Rn ,Rn

This also leads to a generalized eigenvalue problem.


The subsequent canonical directions are obtained by solving the
same problem with additional orthogonality constraints.

What is wrong here?

191 / 635
Kernel Canonical Correlation Analysis

max > Ka Kb s.t. > K2a = 1 and > K2b = 1.


Rn ,Rn

This also leads to a generalized eigenvalue problem.


The subsequent canonical directions are obtained by solving the
same problem with additional orthogonality constraints.

What is wrong here?


If Ka and Kb are invertible, make the change of variable 0 = Ka and
0 = Kb , and we obtain the equivalent formulation

max 0> 0 s.t. 0> 0 = 1 and 0> 0 = 1.


0 Rn , 0 Rn

The function is maximized for any 0 = 0 in Rn .

191 / 635
Kernel Canonical Correlation Analysis

max > Ka Kb s.t. > K2a = 1 and > K2b = 1.


Rn ,Rn

This also leads to a generalized eigenvalue problem.


The subsequent canonical directions are obtained by solving the
same problem with additional orthogonality constraints.

What is wrong here?


If Ka and Kb are invertible, make the change of variable 0 = Ka and
0 = Kb , and we obtain the equivalent formulation

max 0> 0 s.t. 0> 0 = 1 and 0> 0 = 1.


0 Rn , 0 Rn

The function is maximized for any 0 = 0 in Rn . In high (or infinite)


dimension, it is easy to find spurious correlations.
191 / 635
Spurious correlations
Spurious correlations are bad:

Figure: http://www.tylervigen.com/.

192 / 635
Spurious correlations
Spurious correlations are bad:

Figure: http://www.tylervigen.com/.

193 / 635
Kernel Canonical Correlation Analysis

max > Ka Kb s.t. > K2a = 1 and > K2b = 1.


Rn ,Rn

spurious correlation is a problem of overfitting;


it also a problem of numerical instability, due to the need to invert
the kernel matrices;

194 / 635
Kernel Canonical Correlation Analysis

max > Ka Kb s.t. > K2a = 1 and > K2b = 1.


Rn ,Rn

spurious correlation is a problem of overfitting;


it also a problem of numerical instability, due to the need to invert
the kernel matrices;

A solution to both problems: Regularize!


Find smooth directions (fa , fb ) by penalizing kfa kH and kfb kH .
it consists of replacing the constraints > K2a = 1 by

(1 )> K2a + > Ka = 1,


| {z }
kfa k2H

and do the same for > K2b = 1.


194 / 635
Part 5

The Kernel Jungle

195 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

6 Open Problems and Research Topics 196 / 635


Introduction
The kernel function plays a critical role in the performance of kernel
methods.
It is the place where prior knowledge about the problem can be
inserted, in particular by controlling the norm of functions in the
RKHS.
In this part we provide some intuition about the link between kernels
and smoothness functional through several examples.
Subsequent parts will focus on the design of kernels for particular
types of data.

197 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Green kernels
Mercer kernels
Shift-invariant kernels
Generalization to semigroups
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

198 / 635
Motivations
The RKHS norm is related to the smoothness of functions.
Smoothness of a function is naturally quantified by Sobolev norms
(in particular L2 norms of derivatives).
Example: spline regression
n Z
X 2
min (yi f (xi ))2 + f 00 (t) dt
f
i=1
2

In this section we make a general link between RKHS and Green


functions defined by differential operators.
199 / 635
A simple example
Let

H = f : [0, 1] 7 R, absolutely continuous, f 0 L2 ([0, 1]) , f (0) = 0 .




endowed with the bilinear form:


Z 1
(f , g ) F 2 hf , g iH = f 0 (u) g 0 (u) du .
0

Note that hf , f iH measures the smoothness of f :


Z 1
hf , f iH = f 0 (u)2 du = k f 0 k2L2 ([0,1]) .
0

200 / 635
The RKHs point of view
Theorem
H is a RKHS with r.k. given by:

(x, y ) [0, 1]2 , K (x, y ) = min (x, y ) .


Therefore, the RKHS norm is precisely the smoothness functional defined
in the simple example:
k f kH = k f 0 kL2
In particular, the following problem
n Z
X 2 2
min (yi f (xi )) + f 0 (t) dt
f H
i=1
can be reformulated as a simple kernel ridge regression problem with
kernel K (x, y ) = min (x, y ):
n
X
min (yi f (xi ))2 + k f k2HK
f HK
i=1
201 / 635
Proof (1/5)
We need to show that
1 H is a Hilbert space of functions
2 x [0, 1], Kx H,
3 (x, f ) [0, 1] H, hf , Kx iH = f (x).

202 / 635
Proof (2/5)
H is a pre-Hilbert space of functions
H is a vector space of functions, and hf , g iH a bilinear form that
satisfies hf , f iH 0.
f absolutely continuous implies differentiable almost everywhere, and
Z x
x [0, 1], f (x) = f (0) + f 0 (u)du .
0

For any f H, f (0) = 0 implies by Cauchy-Schwarz:

x 1  12

Z Z
0 0 2 1/2
| f (x) | = f (u)du x f (u) du = x hf , f iH .
0 0

Therefore, hf , f iH = 0 = f = 0, showing that h., .iH is an inner


product. H is thus a pre-Hilbert space.

203 / 635
Proof (3/5)
H is a Hilbert space
To show that H is complete, let (fn )nN a Cauchy sequence in H
(fn0 )nN is a Cauchy sequence in L2 [0, 1], thus converges to
g L2 [0, 1]
By the previous inequality, (fn (x))nN is a Cauchy sequence and
thus converges to a real number f (x), for any x [0, 1]. Moreover:
Z x Z x
0
f (x) = lim fn (x) = lim fn (u)du = g (u)du ,
n n 0 0

showing that f is absolutely continuous and f 0 = g almost


everywhere; in particular, f 0 L2 [0, 1].
Finally, f (0) = limn fn (0) = 0, therefore f H and

lim k fn f kH = k f 0 gn kL2 [0,1] = 0 .


n

204 / 635
Proof (4/5)
x [0, 1], Kx H
Let Kx (y ) = K (x, y ) = min(x, y ) sur [0, 1]2 :

K(s,t)

t
s 1
Kx is differentiable except at s, has a square integrable derivative,
and Kx (0) = 0, therefore Kx H for all x [0, 1].

205 / 635
Proof (5/5)
For all x, f , hf , Kx iH = f (x)
For any x [0, 1] and f H we have:
Z 1 Z x
0
hf , Kx iH = f (u)Kx0 (u)du = f 0 (u)du = f (x),
0 0

This shows that H is a RKHS with K as r.k. 

206 / 635
Generalization
Theorem
Let X = Rd and D a differential operator on a class of functions H such
that, endowed with the inner product:

(f , g ) H2 , hf , g iH = hDf , Dg iL2 (X ) ,

it is a Hilbert space.
Then H is a RKHS that admits as r.k. the Green function of the
operator D D, where D denotes the adjoint operator of D.

207 / 635
Green function?
Definition
Let the differential equation on H:

f = Dg ,

where g is unknown. In order to solve it we can look for g of the form:


Z
g (x) = k (x, y ) f (y ) dy
X

for some function k : X 2 7 R. k must then satisfy, for all x X ,

f (x) = Dg (x) = hDkx , f iL2 (X ) .

If such a k exists, it is called the Green function of the operator D.

208 / 635
Proof
Let H be a Hilbert space endowed with the inner product:

hf , g iX = hDf , Dg iL2 (X ) ,

and K be the Green function of the operator D D.


For all x X , Kx H because:

hDKx , DKx iL2 (X ) = hD DKx , Kx iL2 (X ) = Kx (x) < .

(caveat: sometimes other conditions must be fulfilled to be in H, to


be checked on a case by case basis).
Moreover, for all f H and x X , we have:

f (x) = hD DKx , f iL2 (X ) = hDKx , Df iL2 (X ) = hKx , f iH .

This shows that H is a RKHS with K as r.k. 

209 / 635
Example
Back to our example, take X = [0, 1] and Df (u) = f 0 (u)
To find the r.k. of H we need to solve in k:
f (x) = hD Dkx , f iL2 ([0,1])
= hDkx , Df iL2 ([0,1])
Z 1
= kx0 (u)f 0 (u)du
0

The solution is
kx0 (u) = 1[0,x] (u)
which gives (
u if u x ,
kx (u) =
x otherwise.
and therefore
k(x, x 0 ) = min(x, x 0 )
210 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Green kernels
Mercer kernels
Shift-invariant kernels
Generalization to semigroups
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

211 / 635
Mercer kernels
Definition
A kernel K on a set X is called a Mercer kernel if:
1 X is a compact metric space (typically, a closed bounded subset of
Rd ).
2 K : X X R is a continuous p.d. kernel (w.r.t. the Borel
topology)

Motivations
We can exhibit an explicit and intuitive feature space for a large
class of p.d. kernels
Historically, provided the first proof that a p.d. kernel is an inner
product for non-finite sets X (Mercer, 1905).
Can be thought of as the natural generalization of the factorization
of positive semidefinite matrices over infinite spaces.
212 / 635
Sketch of the proof that a Mercer kernel is an inner
product
1 The kernel matrix when X is finite becomes a linear operator when
X is a metric space.
2 The matrix was positive semidefinite in the finite case, the linear
operator is self-adjoint and positive in the metric case.
3 The spectral theorem states that any compact linear operator
admits a complete orthonormal basis of eigenfunctions, with
non-negative eigenvalues (just like positive semidefinite matrices can
be diagonalized with nonnegative eigenvalues).
4 The kernel function can then be expanded over basis of
eigenfunctions as:

X
K (x, t) = k k (x) k (t) ,
k=1

where i 0 are the non-negative eigenvalues.


213 / 635
In case of...
Definition
Let H be a Hilbert space
A linear operator is a continuous linear mapping from H to itself.
A linear operator L is called compact if, for any bounded sequence
{fn }
n=1 , the sequence {Lfn }n=1 has a subsequence that converges.
L is called self-adjoint if, for any f , g H:

hf , Lg i = hLf , g i .

L is called positive if it is self-adjoint and, for any f H:

hf , Lf i 0 .

214 / 635
An important lemma
The linear operator
Let be any Borel measure on X , and L2 (X ) the Hilbert space of
square integrable functions on X .
For any function K : X 2 7 R, let the transform:
Z
f L2 (X ) , (LK f ) (x) = K (x, t) f (t) d (t) .

Lemma
If K is a Mercer kernel, then LK is a compact and bounded linear
operator over L2 (X ), self-adjoint and positive.

215 / 635
Proof (1/6)
LK is a mapping from L2 (X ) to L2 (X )
For any f L2 (X ) and (x1 , x1 ) X 2 :
Z

| LK f (x1 ) LK f (x2 ) | = (K (x1 , t) K (x2 , t)) f (t) d (t)

k K (x1 , ) K (x2 , ) kk f k
(Cauchy-Schwarz)
p
(X ) max | K (x1 , t) K (x2 , t) | k f k.
tX

K being continuous and X compact, K is uniformly continuous,


therefore LK f is continuous. In particular, LK f L2 (X ) (with the slight
abuse of notation C (X ) L2 (X )). 

216 / 635
Proof (2/6)
LK is linear and continuous
Linearity is obvious (by definition of LK and linearity of the integral).
For continuity, we observe that for all f L2 (X ) and x X :
Z

| (LK f ) (x) | = K (x, t) f (t) d (t)
p
(X ) max | K (x, t) | k f k
tX
p
(X )CK k f k.

with CK = maxx,tX | K (x, t) |. Therefore:


Z 1
2
2
k LK f k = LK f (t) d (t) (X ) CK k f k. 

217 / 635
Proof (3/6)
Criterion for compactness
In order to prove the compactness of LK we need the following criterion.
Let C (X ) denote the set of continuous functions on X endowed with
infinite norm k f k = maxxX | f (x) |.
A set of functions G C (X ) is called equicontinuous if:

 > 0, > 0, (x, y) X 2 ,


k x y k < = g G , | g (x) g (y) | < .

Ascoli Theorem
A part H C (X ) is relatively compact (i.e., its closure is compact) if
and only if it is uniformly bounded and equicontinuous.

218 / 635
Proof (4/6)
LK is compact
Let (fn )n0 be a bounded sequence of L2 (X ) (k fn k < M for all n).
The sequence (LK fn )n0 is a sequence of continuous functions, uniformly
bounded because:
p p
k LK fn k (X )CK k fn k (X )CK M .

It is equicontinuous because:
p
| LK fn (x1 ) LK fn (x2 ) | (X ) max | K (x1 , t) K (x2 , t) | M .
tX

By Ascoli theorem, we can extract a sequence uniformly convergent in


C (X ), and therefore in L2 (X ). 

219 / 635
Proof (5/6)
LK is self-adjoint
K being symmetric, we have for all f , g H:
Z
hf , Lg i = f (x) (Lg ) (x) (dx)
Z Z
= f (x) g (t) K (x, t) (dx) (dt) (Fubini)

= hLf , g i .

220 / 635
Proof (6/6)
LK is positive
We can approximate the integral by finite sums:
Z Z
hf , Lf i = f (x) f (t) K (x, t) (dx) (dt)
k
(X ) X
= lim K (xi , xj ) f (xi ) f (xj )
k k 2
i,j=1

0,

because K is positive definite. 

221 / 635
Diagonalization of the operator
We need the following general result:
Spectral theorem
Let L be a compact linear operator on a Hilbert space H. Then there
exists in H a complete orthonormal system (1 , 2 , . . .) of eigenvectors
of L. The eigenvalues (1 , 2 , . . .) are real if L is self-adjoint, and
non-negative if L is positive.

Remark
This theorem can be applied to LK . In that case the eigenfunctions k
associated to the eigenfunctions k 6= 0 can be considered as continuous
functions, because:
1
k = LK .
k

222 / 635
Main result
Mercer Theorem
Let X be a compact metric space, a Borel measure on X , and K a
continuous p.d. kernel. Let (1 , 2 , . . .) denote the nonnegative
eigenvalues of LK and (1 , 2 , . . .) the corresponding eigenfunctions.
Then all functions k are continuous, and for any x, t X :

X
K (x, t) = k k (x) k (t) ,
k=1

where the convergence is absolute for each x, t X , and uniform on


X X.

223 / 635
Mercer kernels as inner products
Corollary
The mapping

: X 7 l 2
p 
x 7 k k (x)
kN

is well defined, continuous, and satisfies

K (x, t) = h (x) , (t)il 2 .

224 / 635
Proof of the corollary
k k2 (x) converges
P
By Mercer theorem we see that for all x X ,
to K (x, x) < , therefore (x) l 2 .
The continuity of results from:

X
k (x) (t) k2l 2 = k (k (x) k (t))2
k=1
= K (x, x) + K (t, t) 2K (x, t)

225 / 635
Summary
This proof extends the proof valid when X is finite.
This is a constructive proof, developed by Mercer (1905).
The eigensystem (k and k ) depend on the choice of the measure
(dx): different s lead to different feature spaces for a given
kernel and a given space X
Compactness and continuity are required. For instance, for X = Rd ,
the eigenvalues of:
Z
K (x, t) (t) dt = (x)
X

are not necessarily countable, Mercer theorem does not hold. Other
tools are thus required such as the Fourier transform for
shift-invariant kernels.

226 / 635
Example (1/6)

Consider the unit sphere in Rd :


n o
X = S d1 = x Rd : k x k = 1

Let be the Lebesgue measure on S d1 . Note that:


d
d1 2 2
(S )=
d2


227 / 635
Example (2/6)
Let a p.d. kernel on S d1 of the form:
 
K (x, t) = x> t ,

where : [1, 1] R is continuous.


To write Mercers expansion we need to find the eigenfunctions by
solving Z  
x> t (t) d(t) = (x)
S d1
For that purpose study polynomials that solve the Laplace equation:

2f 2f
f = + . . . + =0
x12 xd2

where is the Laplacian operator on Rd .

228 / 635
Example (3/6)
Definition (Spherical harmonics)
A homogeneous polynomial of degree k 0 in Rd whose Laplacian
vanishes is called a homogeneous harmonic of order k.
A spherical harmonic of order k is a homogeneous harmonic of order
k on the unit sphere S d1
The set Yk (d) of spherical harmonics is a vector space of dimension
(2k + d 2)(k + d 3)!
N(n, k) = dim (Yk (d)) = .
k!(d 2)!

229 / 635
Example (4/6)
Spherical harmonics form the Mercers eigenfunctions, because:
Theorem (Funk-Hecke) [e.g., Muller, 1998, p.30]
For any x S d1 , Yk Yk (d) and C ([1, 1]),
Z  
>
x t Yk (t) d(t) = k Yk (x)
S d1

where
 Z 1 d3
k = S d2 (t)Pk (d; t)(1 t 2 ) 2 dt
1

and Pk (d; t) is the Legendre polynomial of degree k in dimension d.


When C k ([1, 1]) we have Rodrigues rule [Muller, 1998, p.23]:

d1
 Z 1

d2
 k+ d3
k = S 2
d1
 (k) (t) 1 t 2 2
dt
k
2 k+ 2 1

230 / 635
Example (5/6)
N(d;k)
For any k 0, let {Yk,j (d; x)}j=1 an orthonormal basis of Yk (d)
N(d;k)
n o
Spherical harmonics {Yk,j (d; x)}j=1 form an orthonormal
k=0
basis for L2 S d1


Therefore, for any kernel K (x, t) = x> t on S d1 the Mercer




eigenvalues are exactly the k s, with corresponding orthonormal


N(d;k)
eigenfunctions {Yk,j (d; x)}j=1 .
Note that eigenfunctions are the same for different s, only the
eigenvalues change

231 / 635
Example (6/6)
2
Take d = 2 and K (x, t) = 1 + x> t for x, t S 1
Using Rodrigeus rule we get 3 nonzero eigenvalues:

0 = 3 , 1 = 2 , 2 =
2
with multiplicities 1, 2 and 2
Corresponding eigenfunctions:

x1 x2 x1 x2 x12 x22
 
1
, , , ,
2
The resulting Mercer feature map is
!
3
r
x12 x22
(x) = , 2x1 , 2x2 , 2x1 x2 ,
2 2

Obviously, (x)> (t) = K (x, t) for x, t S 1 (exercice)


232 / 635
RKHS of Mercer kernels
Let X be a compact metric space, and K a Mercer kernel on X
(symmetric, continuous and positive definite).
We have expressed a decomposition of the kernel in terms of the
eigenfunctions of the linear convolution operator.
In some cases this provides an intuitive feature space.
The kernel also has a RKHS, like any p.d. kernel.
Can we get an intuition of the RKHS norm in terms of the
eigenfunctions and eigenvalues of the convolution operator?

233 / 635
Reminder: expansion of Mercer kernel
Theorem
Denote by LK the linear operator of L2 (X ) defined by:
Z

f L2 (X ) , (LK f ) (x) = K (x, t) f (t) d (t) .

Let (1 , 2 , . . .) denote the eigenvalues of LK in decreasing order, and


(1 , 2 , . . .) the corresponding eigenfunctions. Then it holds that for any
x, y X :

X
K (x, y) = k k (x) k (y) = h (x) , (y)il 2 ,
k=1

with : X 7 l 2 defined par (x) =

k k (x) kN .

234 / 635
RKHS construction
Theorem
Assuming that all eigenvalues are positive, the RKHS is the Hilbert
space:

( )
X X a 2
HK = f L2 (X ) : f = ai i , with k
<
k
i=1 k=1

endowed with the inner product:



X ak bk X X
hf , g iK = , for f = ak k , g = bk k .
k
k=1 k k

Remark
If some eigenvalues are equal to zero, then the result and the proof remain valid
on the subspace spanned by the eigenfunctions with positive eigenvalues.

235 / 635
Proof (1/6)
Sketch
In order to show that HK is the RKHS of the kernel K we need to show
that:
1 it is a Hilbert space of functions from X to R,
2 for any x X , Kx HK ,
3 for any x X and f HK , f (x) = hf , Kx iHK .

236 / 635
Proof (2/6)
HK is a Hilbert space
Indeed the function:
1
LK2 :L2 (X ) HK
X
X p
ai i 7 ai i i
i=1 i=1

is an isomorphism, therefore HK is a Hilbert space, like L2 (X ). 

237 / 635
Proof (3/6)
HK is a space of continuous functions
P
For any f = i=1 ai i HK , and x X , we have (if f (x) makes sense):

X X a p
i
| f (x) | = ai i (x) = i i (x)


i=1

i=1
i

!1
!1
X ai2 2 X 2
2

. i i (x)
i
i=1 i=1
1
= k f kHK K (x, x) 2
p
= k f kHK CK .

Therefore convergence in k . kHK implies uniform convergence for


functions.

238 / 635
Proof (4/6)
HK is a space of continuous functions (cont.)
Let now fn = ni=1 ai i HK . The functions i are continuous
P
functions, therefore fn is also continuous, for all n. The fn s are
convergent in HK , therefore also in the (complete) space of continuous
functions endowed with the uniform norm.
Let fc the continuous limit function. Then fc L2 (X ) and

k fn fc kL2 (X ) 0.
n

On the other hand,

k f fn kL2 (X ) 1 k f fn kHK 0,
n

therefore f = fc . 

239 / 635
Proof (5/6)
Kx HK
For any x X let, for all i, ai = i i (x). We have:

X a2 X
i
= i i (x)2 = K (x, x) < ,
i
i=1 i=1
P
therefore x := i=1 ai i HK . As seen earlier the convergence in HK
implies pointwise convergence, therefore for any t X :

X
X
x (t) = ai i (t) = i i (x) i (t) = K (x, t) ,
i=1 i=1

therefore x = Kx HK . 

240 / 635
Proof (6/6)
f (x) = hf , Kx iHK
P
Let f = i=1 ai i HK , et x X . We have seen that:

X
Kx = i i (x) i ,
i=1

therefore:

X i i (x) ai X
hf , Kx iHK = = ai i (x) = f (x) ,
i
i=1 i=1

which concludes the proof. 

241 / 635
Remarks
Although HK was built from the eigenfunctions of LK , which depend
on the choice of the measure (x), we know by uniqueness of the
RKHS that HK is independant of and LK .
Mercer theorem provides a concrete way to build the RKHS, by
taking linear combinations of the eigenfunctions of LK (with
adequately chosen weights).
The eigenfunctions (i )iN form an orthogonal basis of the RKHS:

1
hi , j iHK = 0 si i 6= j, k i kHK = .
i
The RKHS is a well-defined ellipsoid with axes given by the
eigenfunctions.

242 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Green kernels
Mercer kernels
Shift-invariant kernels
Generalization to semigroups
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

243 / 635
Motivation
Let us suppose that X is not compact, for example X = Rd .
In that case, the eigenvalues of:
Z
K (x, t) (t) d(t) = (t)
X

are not necessarily countable, Mercer theorem does not hold.


Fourier transforms provide a convenient extension for translation
invariant kernels, i.e., kernels of the form K (x, y) = (x y).
Harmonic analysis also bring kernels well beyond vector spaces, e.g.,
groups and semigroups

244 / 635
Fourier-Stieltjes transform on the torus
Let T the torus [0, 2] with 0 and 2 identified
C (T) the set of continuous functions on T
M(T) the finite complex Borel measures2 on T
M(T) can be identified as the dual space (C (T)) : for any
continuous/bounded linear functional : C (T) C there exists
1
R
M(T) such that (f ) = 2 T f (t)d(t) (Riesz theorem).

Definition (Fourier-Stieltjes coefficients)


For any M(T), the Fourier-Stieltjes coefficients of is the sequence:
Z
1
n Z , (n) = e int d(t)
2 T
This extends the standard Fourier transform for integrable functions by
taking d(t) = f (t)dt.
2
a measure defined on all open sets
245 / 635
Translation invariant kernels on Z
Definition
A kernel K : Z Z 7 R is called translation invariant (t.i.), or
shift-invariant, if it only depends on the difference between its argument,
i.e.:
x, y Z , K (x, y) = axy
for some sequence {an }nZ . Such a sequence is called positive definite if
the corresponding kernel K is p.d.

246 / 635
Translation invariant kernels on Z
Definition
A kernel K : Z Z 7 R is called translation invariant (t.i.), or
shift-invariant, if it only depends on the difference between its argument,
i.e.:
x, y Z , K (x, y) = axy
for some sequence {an }nZ . Such a sequence is called positive definite if
the corresponding kernel K is p.d.

Theorem (Herglotz)
A sequence {an }nZ is p.d. if and only if it is the Fourier-Stieltjes
transform of a positive measure M(T)

246 / 635
Examples
Diagonal kernel:
(
1 if n = 0 ,
Z
1
= dt , an = (n) = e int dt =
2 T 0 otherwise.

The resulting kernel is K (x, t) = 1(x = t).


Constant kernel: for C 0,
Z
= 2C 0 , an = (n) = C e int 0 (t) = C ,
T

resulting in K (x, t) = C

247 / 635
Proof of Herglotzs theorem:
If an = (n) for M(T) positive, then for any n N, x1 , . . . , xn Z
and z1 , . . . , zn R (or C) :
n X
n n n Z
X 1 XX
zi zj axi xj = zi zj e i(xi xj )t d(t)
2 T
i=1 j=1 i=1 j=1
n n Z
1 XX
= zi zj e ixi t e ixj t d(t)
2 T
i=1 j=1
Z X n
1
= | zj e ixj t |2 d(t)
2 T
j=1

0.

248 / 635
Proof of Herglotzs theorem: (1/4)
Let {an }nZ a p.d. sequence
For a given t R and N N let {zn }nZ be
(
e int if | n | N ,
zn =
0 otherwise.
Since {an }nZ is p.d. we get:
N
X N
X N
X N
X
0 akl zk zl = akl e i(kl)t
k=N l=N k=N l=N
2N
X
= (2N + 1 |k|)ak e ikt
k=2N
|k|
 
1 X
= max 0, 1 ak e ikt
2N + 1 2N + 1
kZ
| {z }
2N (t)

249 / 635
Proof of Herglotzs theorem: (2/4)
dN = N (t)dt is a positive measure (for N even) and satisfies
N
|j| |n|
X  Z  
i(nj)t
N (n) = aj 1 e = an max 0, 1
N +1 T N +1
j=N

Moreover
Z
k N kM(T) = sup f (t)N (t)dt
k f k 1 T
Z
= N (t)dt (take f = 1 because N (t) 0)
T
N Z
|n|
X  
= an 1 e int dt
T N +1
n=N
= a0

250 / 635
Proof of Herglotzs theorem: (3/4)
For any P
trigonometric polynomial of the form
P(t) = K ikt
k=K bk e , with Fourier coefficient P(n) = bn , we have
Z
lim P(t)dN (t)
N+ T
K N Z
|n|
X X  
= lim an bk 1 e i(nk)t dt
N+ T N +1
k=K n=N
K
|n|
X  
= ak bk lim 1
N+ N +1
k=K
K
X
= ak b k
k=K
X
= ak P(k)
kZ

251 / 635
Proof of Herglotzs theorem: (4/4)
P
This shows that (P) = kZ ak P(k) is a linear functional over
trigonometric polynomials, with norm a0
It can be extended to all continuous functions because trigonometric
polynomials are dense in C (T)
By Riesz representation theorem, there exists a measure M(T)
such that k kM(T) a0
Z
f C (T) , (f ) = f (t)d(t)
T

Taking f (t) = e int gives


Z
(n) = e int d(t) = (e int ) = an
T
Furthermore is a positive measure because if f 0
Z
f (t)d(t) = (f ) = lim (Pn ) = lim k (Pn ) 0 
T n+ n,k+

252 / 635
Fourier transform on Rd
Definition
For any f L1 Rd , the Fourier transform of f is the function:


Z
>
Rd , f () = e ix f (x) dx .
Rd

253 / 635
Fourier transform on Rd
Properties
f is complex-valued, continuous, tends to 0 at infinity and
k f kL k f kL1 .
If f L1 Rd , then the inverse Fourier formula holds:


Z
1 >
x Rd , f (x) = d
e ix f () d.
(2) Rd

If f L1 Rd is square integrable, then Parsevals formula holds:




Z Z
1 2
| f (x) |2 dx =

f () d .

d
Rd (2) R d

254 / 635
Fourier-Stieltjes transform on Rd
C0 (Rd ) the set of continuous functions on Rd that vanish at infinity
M(Rd ) the finite complex Borel measures on Rd

M(Rd ) can be identified as the dual space C0 (Rd ) : for any
continuous/bounded linear functional : C0 (Rd ) C there exists
M(Rd ) such that (f ) = Rd f (t)d(t) (Riesz theorem).
R

Definition (Fourier-Stieltjes transform)


For any M(Rd ), the Fourier-Stieltjes transform of is the function:
Z
>
R , () =
d
e i x d(x)
Rd

255 / 635
Fourier-Stieltjes transform on Rd
This extends the standard Fourier transform for integrable functions
by taking d(x) = f (x)dx.
For M(Rd ), is still uniformly continuous, but () does not
necessarily go to 0 at infinity (e.g., take the Dirac = 0 , then
() = 1 for all )
Parsevals formula becomes: if M(Rd ), and both g , g are in
L1 (Rd ), then
Z Z
1
g (x)d(x) = g ()()d
Rd (2)d Rd

256 / 635
Translation invariant kernels on Rd
Definition
A kernel K : Rd Rd 7 R is called translation invariant (t.i.), or
shift-invariant, if it only depends on the difference between its argument,
i.e.:
x, y Rd , K (x, y) = (x y)
for some function : Rd R. Such a function is called positive
definite if the corresponding kernel K is p.d.

257 / 635
Translation invariant kernels on Rd
Definition
A kernel K : Rd Rd 7 R is called translation invariant (t.i.), or
shift-invariant, if it only depends on the difference between its argument,
i.e.:
x, y Rd , K (x, y) = (x y)
for some function : Rd R. Such a function is called positive
definite if the corresponding kernel K is p.d.

Theorem (Bochner)
A continuous function : Rd R is p.d. if and only if it is the
Fourier-Stieltjes transform of a symmetric and positive finite Borel
measure M(T)

257 / 635
Proof of Bochners theorem:
If = for some M(T) positive, then for any n N,
x1 , . . . , xn Rd and z1 , . . . , zn R (or C) :
n X
n n X
n Z
>
X X
zi zj (xi xj ) = zi zj e i(xi xj ) t d(t)
i=1 j=1 i=1 j=1 Rd
n X n Z
X > >
= zi zj e ixi t e ixj t d(t)
i=1 j=1 Rd
Z n
>
zj e ixj t |2 d(t)
X
= |
Rd j=1

0.

If is symmetric then, in addition, is real-valued.

258 / 635
Proof of Bochners theorem: (1/5)
Lemma
Let : R R continuous. If there exists C 0 such that
Z
1
g ()()d C sup | g (x) |
2
R

xR

for every continuous function g L1 (R) such that g is continuous and


has compact support, then is the Fourier-Stieljes transform of a
measure M(R).
Proof: Let G C0 (R) be the set of functions g RL1 (R) such that g is
1
continuous and has compact support. : g 7 2 R g ()()d is
linear and continuous on G, and can be extended to C0 (R) by density of
G. By Riesz
R theorem, there exists M(R) such that
1
R
(g ) = R g (x)d(x) = 2 R g ()()d, using Parcevals formula
for the second equality. This must hold for all g , so = . 
Note: the converse is also true.
259 / 635
Proof of Bochners theorem: (2/5)
We consider d = 1. Generalization to d > 1 is trivial.
Let : R R continuous and p.d.
For any > 0, the sequence {(n)}nZ is p.d., so by Herglotzs
theorem there exists a positive measure M(T) such that
(n) = (n) ,
and k kM(T) = (0) = (0).
Let g L1 (R) continuous such that g is continuous and has
compact support.
For any  > 0 there exists > 0 such that

X
Z
1
g ()()d < g (n)(n) + ,

2 2
R nZ

by approximating the integral by its Riemann sums (where the width


of each rectangle is ).
260 / 635
Proof of Bochners theorem: (3/5)
For t T let:  
X t + 2m
G (t) = g

mZ
.
Given the regularity and decay of g , we can find a sufficiently small
to ensure
sup | G (t) | sup | g (x) | + 
tT xR

261 / 635
Proof of Bochners theorem: (3/5)
In addition, for any n Z:
Z
1
G (n) = e int G (t)dt
2 T
1 X 2 int
Z  
t + 2m
= e g dt
2 0
mZ
Z 2(m+1)
X
= e in(u+2m) g (u)du
2 2m
mZ
2(m+1)
XZ
= e inu g (u)du
2 2m
mZ


Z
= e inu g (u)du
2 R

= g (n)
2
262 / 635
Proof of Bochners theorem: (4/5)
This gives:

X X
g (n)(n) = G (n) (n)

2


nZ nZ
Z
1
= G (t)d (t) (Parceval)
2 T
k kM(T) sup | G (t) |
tT
C sup | G (t) |
tT
C sup | g (x) | + C 
xR

with C = (0).

263 / 635
Proof of Bochners theorem: (5/5)
Putting it all together gives:
Z
1

2 g ()()d < C sup | g (x) | + (C + 1)
R xR

This is true for all  > 0 which implies


Z
1

2 g ()()d C sup | g (x) |
R xR

We conclude from the lemma that = for some M(R), which


satisfies Z Z
1
g ()()d = g (x)d(x)
2 R R
1
R
When g 0, this is approximated by 2 T G (t)d (t) for small ,
which is 0 because is a positive measure and G 0 like g .
Consequently, is a positive measure. 
264 / 635
RKHS of translation invariant kernels
Theorem
Let K (x, t) = (x t) be a translation invariant p.d. kernel, such that
is integrable on Rd as well as its Fourier transform . The subset H of
L2 R that consists of integrable and continuous functions f such that:
d


2

f ()
Z
1
k f k2K := d < + ,
(2)d Rd ()

endowed with the inner product:

f()g ()
Z
1
hf , g i := d
(2)d Rd ()

is a RKHS with K as r.k.

265 / 635
Proof
H is a Hilbert space: exercise.
For x Rd , Kx (y) = K (x, y) = (x y) therefore:
Z
> >
Kx () = e i u (u x)du = e i x () .

This leads to Kx H, because:


2
Z Kx () Z
| () | < ,
Rd () Rd

Moreover, if f H and x Rd , we have:

Kx ()f ()
Z Z
1 1
hf , Kx iH = d = f ()e i.x d
(2)d Rd () (2)d Rd
= f (x)

266 / 635
Example
Gaussian kernel
(xy )2
K (x, y ) = e 2 2

corresponds to:
2 2
() = e 2

and  Z 2 2 2 

H= f : f () e 2 d < .

In particular, all functions in H are infinitely differentiable with all


derivatives in L2 .

267 / 635
Example
Laplace kernel
1
K (x, y ) = e | xy |
2
corresponds to:

() =
2 + 2
and ( )
Z 2 2 + 2 
H= f : f() d < ,

the set of functions L2 differentiable with derivatives in L2 (Sobolev


norm).

268 / 635
Example
Low-frequency filter
sin ((x y ))
K (x, y ) =
(x y )
corresponds to:

() = U ( + ) U ( )

and ( Z )
2

H= f : f () d = 0 ,

| |>

the set of functions whose spectrum is included in [, ].

269 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Green kernels
Mercer kernels
Shift-invariant kernels
Generalization to semigroups
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

270 / 635
Generalization to semigroups (cf Berg et al., 1983)
Definition
A semigroup (S, ) is a nonempty set S equipped with an
associative composition and a neutral element e.
A semigroup with involution (S, , ) is a semigroup (S, ) together
with a mapping : S S called involution satisfying:

1 (s t) = t s , for s, t S.

2 (s ) = s for s S.

Examples
Any group (G , ) is a semigroup with involution when we define
s = s 1 .
Any abelian semigroup (S, +) is a semigroup with involution when
we define s = s, the identical involution.

271 / 635
Positive definite functions on semigroups
Definition
Let (S, , ) be a semigroup with involution. A function : S R is
called positive definite if the function:

s, t S, K (s, t) = (s t)

is a p.d. kernel on S.

Example: translation invariant kernels


Rd , +, is an abelian group with involution. A function : Rd R


is p.d. if the function


K (x, y) = (x y)
is p.d. on Rd (translation invariant kernels).

272 / 635
Semicharacters
Definition
A function : S C on an abelian semigroup with involution (S, +, )
is called a semicharacter if
1 (0) = 1,
2 (s + t) = (s)(t) for s, t S,
3 (s ) = (s) for s S.
The set of semicharacters on S is denoted by S .

Remarks
If is the identity, a semicharacter is automatically real-valued.
If (S, +) is an abelian group and s = s, a semicharacter has its
values in the circle group {z C | | z | = 1} and is a group character.

273 / 635
Semicharacters are p.d.
Lemma
Every semicharacter is p.d., in the sense that:
K (s, t) = K (t, s),
Pn
i,j=1 ai aj K (xi , xj ) 0.

Proof
Direct from definition, e.g.,
n
X n
 X
ai aj xi + xj = ai aj (xi ) (xj ) 0 .
i,j=1 i,j=1

Examples
(t) = e t on (R, +, Id).
(t) = e it on (R, +, ).
274 / 635
Integral representation of p.d. functions
Definition
An function : S R on a semigroup with involution is called an absolute
value if (i) (e) = 1, (ii)(s t) (s)(t), and (iii) (s ) = (s).
A function f : S R is called exponentially bounded if there exists an
absolute value and a constant C > 0 s.t. | f (s) | C (s) for s S.

Theorem
Let (S, +, ) an abelian semigroup with involution. A function : S R is p.d.
and exponentially bounded (resp. bounded) if and only if it has a representation
of the form: Z
(s) = (s)d() .
S

where is a Radon measure with compact support on S (resp. on S, the set


of bounded semicharacters).

275 / 635
Proof
Sketch (details in Berg et al., 1983, Theorem 4.2.5)
For an absolute value , the set P1 of -bounded p.d. functions
that satisfy (0) = 1 is a compact convex set whose extreme points
are precisely the -bounded semicharacters.
If is p.d. and exponentially bounded then there exists an absolute
value such that (0)1 P1 .
By the Krein-Milman theorem there exits a Radon probability
measure on P1 having (0)1 as barycentre.

Remarks
The result is not true without the assumption of exponentially
bounded semicharacters.
In the case of abelian groups with s = s this reduces to
Bochners theorem for discrete abelian groups, cf. Rudin (1962).

276 / 635
Example 1: (R+ , +, Id)
Semicharacters
S = (R+ , +, Id) is an abelian semigroup.
2
P.d. functions are nonnegative, because (x) = x .
The set of bounded semicharacters is exactly the set of functions:

s R+ 7 a (s) = e as ,

for a [0, +] (left as exercice).


Non-bounded semicharacters are more difficult to characterize; in
fact there exist nonmeasurable solutions of the equation
h(x + y ) = h(x)h(y ).

277 / 635
Example 1: (R+ , +, Id) (cont.)
P.d. functions
By the integral representation theorem for bounded semi-characters
we obtain that a function : R+ R is p.d. and bounded if and
only if it has the form:
Z
(s) = e as d(a) + b (s)
0

where Mb+ (R+ ) and b 0.


The first term is the Laplace transform of . is p.d., bounded and
continuous iff it is the Laplace transform of a measure in Mb+ (R).

278 / 635
Example 2: Semigroup kernels for finite measures (1/6)
Setting
We assume that data to be processed are bags-of-points, i.e., sets
of points (with repeats) of a space U.
Example : a finite-length string as a set of k-mers.
How to define a p.d. kernel between any two bags that only depends
on the union of the bags?
See details and proofs in Cuturi et al. (2005).

279 / 635
Example 2: Semigroup kernels for finite measures (2/6)
Semigroup of bounded measures
We can represent any bag-of-point x as a finite measure on U:
X
x= ai xi ,
i

where ai is the number of occurrences on xi in the bag.


The measure that represents the union of two bags is the sum of the
measures that represent each individual bag.
This suggests to look at the semigroup Mb+ (U) , +, Id of bounded


Radon measures on U and to search for p.d. functions on this


semigroup.

280 / 635
Example 2: Semigroup kernels for finite measures (3/6)
Semicharacters
For any Borel measurable function f : U R the function
f : Mb+ (U) R defined by:

f () = e [f ]

is a semicharacter on Mb+ (U) , + .




Conversely, is continuous semicharacter (for the topology of weak


convergence) if and only if there exists a continuous function
f : U R such that = f .
No such characterization for non-continuous characters, even
bounded.

281 / 635
Example 2: Semigroup kernels for finite measures (4/6)
Corollary
Let U be a Hausdorff space. For any Radon measure Mc+ (C (U))
with compact support on the Hausdorff space of continuous real-valued
functions on U endowed with the topology of pointwise convergence, the
following function K is a continuous p.d. kernel on Mb+ (U) (endowed
with the topology of weak convergence):
Z
K (, ) = e [f ]+[f ] d(f ) .
C (X )

Remarks
The converse is not true: there exist continuous p.d. kernels that do not have
this integral representation (it might include non-continuous semicharacters)

282 / 635
Example 2: Semigroup kernels for finite measures (5/6)
Example : entropy kernel
Let X be the set of probability densities (w.r.t. some reference
measure) on U with finite entropy:
Z
h(x) = x ln x .
U

Then the following entropy kernel is a p.d. kernel on X for all


> 0:
x+x
K x, x0 = e h( 2 ) .


Remark: only valid for densities (e.g., for a kernel density estimator
from a bag-of-parts)

283 / 635
Example 2: Semigroup kernels for finite measures (6/6)
Examples : inverse generalized variance kernel
Let U = Rd and MV + (U) be the set of finite measure with second
order moment and non-singular variance
h i
() = xx > [x] [x]> .

Then the following function is a p.d. kernel on MV


+ (U), called the
inverse generalized variance kernel:
1
K , 0 =

 .
+0
det 2

Generalization possible with regularization and kernel trick.

284 / 635
Application of semigroup kernel

Weighted linear PCA of two different measures, with the first PC shown.
Variances captured by the first and second PC are shown. The
generalized variance kernel is the inverse of the product of the two values.

285 / 635
Kernelization of the IGV kernel
Motivations
Gaussian distributions may be poor models.
The method fails in large dimension

Solution
1 Regularization:
1
K , 0 =

  0
 .
det +
2 + I d

2 Kernel trick: the non-zero eigenvalues of UU > and U > U are the
same = replace the covariance matrix by the centered Gram
matrix (technical details in Cuturi et al., 2005).

286 / 635
Illustration of kernel IGV kernel

287 / 635
Semigroup kernel remarks
Motivations
A very general formalism to exploit an algebraic structure of the
data.
Kernel IVG kernel has given good results for character recognition
from a subsampled image.
The main motivation is more generally to develop kernels for
complex objects from which simple patches can be extracted.
The extension to nonabelian groups (e.g., permutation in the
symmetric group) might find natural applications.

288 / 635
Kernel examples: Summary
Many notions of smoothness can be translated as RKHS norms for
particular kernels (eigenvalues convolution operator, Sobolev norms
and Green operators, Fourier transforms...).
There is no uniformly best kernel, but rather a large toolbox of
methods and tricks to encode prior knowledge and exploit the
nature or structure of the data.
In the following sections we focus on particular data and
applications to illustrate the process of kernel design.

289 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

6 Open Problems and Research Topics 290 / 635


Motivation
Kernel methods are sometimes criticized for their lack of flexibility: a
large effort is spent in designing by hand the kernel.
Question
How do we design a kernel adapted to the data?

291 / 635
Motivation
Kernel methods are sometimes criticized for their lack of flexibility: a
large effort is spent in designing by hand the kernel.
Question
How do we design a kernel adapted to the data?

Answer
A successful strategy is given by kernels for generative models, which
are/have been the state of the art in many fields, including
representation of image and sequence data representation.

Parametric model
A model is a family of distributions

{P , Rm } M+
1 (X ) .

291 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Fisher kernel
Mutual information kernels
Marginalized kernels
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

292 / 635
Fisher kernel
Definition
Fix a parameter 0 (obtained for instance by maximum
likelihood over a training set).
For each sequence x, compute the Fisher score vector:

0 (x) = log P (x)|=0 ,

which can be interpreted as the local contribution of each parameter.


Form the kernel (Jaakkola et al., 2000):

K x, x0 = 0 (x)> I(0 )1 0 (x0 ) ,




where I(0 ) = E 0 (x)0 (x)> is the Fisher information matrix.


 

Note: when 0 is the ML estimator, E[0 (x)] = 0 and I(0 ) is a


covariance matrix.

293 / 635
Fisher kernel properties (1/2)
The Fisher score describes how each parameter contributes to the
process of generating a particular example
A kernel classifier employing the Fisher kernel derived from a model
that contains the label as a latent variable is, asymptotically, at least
as good as the MAP labelling based on the model (Jaakkola and
Haussler, 1999).
A variant of the Fisher kernel (called the Tangent of Posterior
kernel) can also improve over the direct posterior classification by
helping to correct the effect of estimation errors in the parameter
(Tsuda et al., 2002).

294 / 635
Fisher kernel properties (2/2)
Lemma
The Fisher kernel is invariant under change of parametrization.

Consider indeed a different parametrization given by some


diffeomorphism = f (). The Jacobian matrix relating the

parametrization is denoted by [J]ij = ji .
The gradient of log-likelihood w.r.t. to the new parameters is

0 (x) = log P0 (x) = J log P0 (x) = J0 (x).

The Fisher information matrix is


h i
I(0 ) = E 0 (x)0 (x)> = JI(0 )J> .

We conclude by noticing that I(0 )1 = J1 I(0 )1 J>1 :

K x, x0 = 0 (x)> I(0 )1 0 (x0 ) = 0 (x)> I(0 )1 0 (x0 ).




295 / 635
Fisher kernel in practice

0 (x) can be computed explicitly for many models (e.g., HMMs),


where the model is first estimated from data.
I(0 ) is often replaced by the identity matrix for simplicity.
Several different models (i.e., different 0 ) can be trained and
combined.
The Fisher vectors are defined as 0 (x) = I(0 )1/2 0 (x). They
are explicitly computed and correspond to an explicit embedding:
K (x, x0 ) = 0 (x)> 0 (x0 ).

296 / 635
Fisher kernels: example with Gaussian data model (1/2)
Consider a normal distribution N (, 2 ) and denote by = 1/ 2 the
inverse variance, i.e., precision parameter. With = (, ), we have
1 1 1
log P (x) = log log(2) (x )2 ,
2 2 2
and thus
 
log P (x) log P (x) 1 1 2
= (x ), = (x ) ,
2

and (exercise)  
0
I() = .
0 (1/2)2
The Fisher vector is then
 
(x) = (x )/ .
(1/ 2)(1 (x )2 / 2 )

297 / 635
Fisher kernels: example with Gaussian data model (2/2)
Now consider an i.i.d. data model over a set of data points x1 , . . . , xn all
distributed according to N (, 2 ):
n
Y
P (x1 , . . . , xn ) = P (xi ).
i=1

Then, the Fisher vector is given by the sum of Fisher vectors of the
points.
Encodes the discrepancy in the first and second order moment of
the data w.r.t. those of the model.
n  
X ( )/
(x1 , . . . , xn ) = (xi ) = n ,
( 2 2 )/( 2 2 )
i=1

where
n n
1X 1X
= xi and = (xi )2 .
n n
i=1 i=1
298 / 635
Application: Aggregation of visual words (1/4)
Patch extraction and description stage:
In various contexts, images may be described as a set of
patches x1 , . . . , xn computed at interest points. For example, SIFT,
HOG, LBP, color histograms, convolutional features...
Coding stage: The set of patches is then encoded into a single
representation (xi ), typically in a high-dimensional space.
Pooling stage: For example, sum pooling
n
X
(x1 , . . . , xn ) = (xi ).
i=1

Fisher vectors with a Gaussian Mixture Model (GMM) is a simple


and effective aggregation technique [Perronnin and Dance, 2007].

299 / 635
Application: Aggregation of visual words (2/4)
Let = (j , j , j )j=1 ...,k be the parameters of a GMM with k Gaussian
components. Then, the probabilistic model is given by
k
X
P (x) = j N (x; j , j ).
j=1

Remarks
Each mixture component corresponds to a visual word, with a mean,
variance, and mixing weight.
Diagonal covariances j = diag (j1 , . . . , jp ) = diag ( j ) are often
used for simplicity.
This is a richer model than the traditional bag of words approach.
The probabilistic model is learned offline beforehand.

300 / 635
Application: Aggregation of visual words (3/4)
After cumbersome calculations (exercise), we obtain (x1 , . . . , xn ) =

[1 (X), . . . , p (X), 1 (X)> , . . . , p (X)> , 1 (X)> , . . . , p (X)> ]> ,

with
n
1 X
j (X) = ij (xi j )/ j
n j
i=1
n
1 X
ij (xi j )2 / 2j 1 ,
 
j (X) = p
n 2j i=1

where with an abuse of notation, the division between two vectors is


meant elementwise and the scalars ij can be interpreted as the
soft-assignment of word i to component j:
j N (xi ; j , j )
ij = Pk .
l=1 l N (xi ; l , l )

301 / 635
Application: Aggregation of visual words (4/4)
Finally, we also have the following interpretation of encoding first and
second-order statistics:
j
j (X) = (j j )/ j
j
j
j (X) = p ( 2j 2j )/ 2j ,
2j

with
n n n
X 1 X 1 X
j = ij and j = ij xi and j = ij (xi j )2 .
j j
i=1 i=1 i=1

The component (X) is often dropped due to its negligible contribution


in practice, and the resulting representation is of dimension 2kp where p
is the dimension of the xi s.

302 / 635
Relation to classification with generative models (1/3)
Assume that we have a generative probabilistic model P to model
random variables (X , Y ) where Y is a label in {1, . . . , p}.
Assume that the marginals P (Y = k) = k are among the model
parameters , which we can also parametrize as
e k
P (Y = k) = k = Pp k 0
.
k 0 =1 e

The classification of a new point x can be obtained via Bayes rule:


y (x) = argmax P (Y = k|x),
k=1,...,p

where P (Y = k|x) is short for P (Y = k|X = x) and


P (Y = k|x) = P (x|Y = k)P (Y = k)/P (x)
p
X
= P (x|Y = k)k / P (x|Y = k 0 )k 0
k 0 =1

303 / 635
Relation to classification with generative models (2/3)
Then, consider the Fisher score
1
log P (x) = P (x)
P (x)
p
1 X
= P (x, Y = k)
P (x)
k=1
p
1 X
= P (x, Y = k) log P (x, Y = k)
P (x)
k=1
p
X
= P (Y = k|x)[ log k + log P (x|Y = k)].
k=1

In particular (exercise)

log P (x)
= P (Y = k|x) k .
k
304 / 635
Relation to classification with generative models (3/3)
The first p elements in the Fisher score are given by class posteriors
minus a constant

(x) = [P (Y = 1|x) 1 , . . . , P (Y = p|x) p , ...].

Consider a multi-class linear classifier on (x) such that for class k


The weights are zero except one for the k-th position;
The intercept bk be k ;
Then,

y (x) = argmax (x)> wk + bk


k=1,...,p

y (x) = argmax P (Y = k|x).


k=1,...,p

Bayes rule is implemented via this simple classifier using Fisher kernel.

305 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Fisher kernel
Mutual information kernels
Marginalized kernels
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

306 / 635
Mutual information kernels
Definition
Chose a prior w (d) on the measurable set .
Form the kernel (Seeger, 2002):
Z
0
P (x)P (x0 )w (d) .

K x, x =

No explicit computation of a finite-dimensional feature vector.


K (x, x0 ) =< (x) , (x0 ) >L2 (w ) with

(x) = (P (x)) .

307 / 635
Example: coin toss
Let P (X = 1) = and P (X = 0) = 1 a model for random coin
toss, with [0, 1].
Let d be the Lebesgue measure on [0, 1]
The mutual information kernel between x = 001 and x0 = 1010 is:
(
P (x) = (1 )2 ,
P (x0 ) = 2 (1 )2 ,
Z 1
3!4! 1
0
3 (1 )4 d =

K x, x = = .
0 8! 280

308 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Fisher kernel
Mutual information kernels
Marginalized kernels
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

309 / 635
Marginalized kernels
Definition
For any observed data x X , let a latent variable y Y be
associated probabilistically through a conditional probability Px (dy).
Let KZ be a kernel for the complete data z = (x, y)
Then, the following kernel is a valid kernel on X , called a
marginalized kernel (Tsuda et al., 2002):

KX x, x0 := EPx (dy)Px0 (dy0 ) KZ z, z0


 
Z Z
KZ (x, y) , x0 , y0 Px (dy) Px0 dy0 .
 
=

310 / 635
Marginalized kernels: proof of positive definiteness
KZ is p.d. on Z. Therefore, there exists a Hilbert space H and
Z : Z H such that:

KZ z, z0 = Z (z) , Z z0 H .



Marginalizing therefore gives:

KX x, x0 = EPx (dy)Px0 (dy0 ) KZ z, z0


 

= EPx (dy)Px0 (dy0 ) Z (z) , Z z0 H





= EPx (dy) Z (z) , EPx0 (dy0 ) Z z0 H ,





therefore KX is p.d. on X . 

311 / 635
Marginalized kernels: proof of positive definiteness
KZ is p.d. on Z. Therefore, there exists a Hilbert space H and
Z : Z H such that:

KZ z, z0 = Z (z) , Z z0 H .



Marginalizing therefore gives:

KX x, x0 = EPx (dy)Px0 (dy0 ) KZ z, z0


 

= EPx (dy)Px0 (dy0 ) Z (z) , Z z0 H





= EPx (dy) Z (z) , EPx0 (dy0 ) Z z0 H ,





therefore KX is p.d. on X . 
Of course, we make the right assumptions such that each operation
above is valid, and all quantities are well defined.

311 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

6 Open Problems and Research Topics 312 / 635


Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Kernels for graphs
Kernels on graphs

313 / 635
Short history of genomics

1866 : Laws of heredity (Mendel)


1909 : Morgan and the drosophilists
1944 : DNA supports heredity (Avery)
1953 : Structure of DNA (Crick and Watson)
1966 : Genetic code (Nirenberg)
1960-70 : Genetic engineering
1977 : Method for sequencing (Sanger)
1982 : Creation of Genbank
1990 : Human genome project launched
2003 : Human genome project completed

314 / 635
A cell

315 / 635
Chromosomes

316 / 635
Chromosomes and DNA

317 / 635
Structure of DNA

We wish to suggest a
structure for the salt of
desoxyribose nucleic acid
(D.N.A.). This structure have
novel features which are of
considerable biological
interest (Watson and Crick,
1953)

318 / 635
The double helix

319 / 635
Central dogma

320 / 635
Proteins

321 / 635
Genetic code

322 / 635
Human genome project
Goal : sequence the 3,000,000,000 bases of the human genome
Consortium with 20 labs, 6 countries
Cost : between 0.5 and 1 billion USD

323 / 635
2003: End of genomics era

Findings
About 25,000 genes only (representing 1.2% of the genome).
Automatic gene finding with graphical models.
97% of the genome is considered junk DNA.
Superposition of a variety of signals (many to be discovered).

324 / 635
Cost of human genome sequencing

325 / 635
Protein sequence

A : Alanine V : Valine L : Leucine


F : Phenylalanine P : Proline M : Methionine
E : Glutamic acid K : Lysine R : Arginine
T : Threonine C : Cysteine N : Asparagine
H : Histidine Y : Tyrosine W : Tryptophane
I : Isoleucine S : Serine Q : Glutamine
D : Aspartic acid G : Glycine

326 / 635
Challenges with protein sequences
A protein sequences can be seen as a variable-length sequence over
the 20-letter alphabet of amino-acids, e.g., insuline:
FVNQHLCGSHLVEALYLVCGERGFFYTPKA
These sequences are produced at a fast rate (result of the
sequencing programs)
Need for algorithms to compare, classify, analyze these sequences
Applications: classification into functional or structural classes,
prediction of cellular localization and interactions, ...

327 / 635
Example: supervised sequence classification
Data (training)
Secreted proteins:
MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...
MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...
MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...
...
Non-secreted proteins:
MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...
MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...
MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..
...

Goal
Build a classifier to predict whether new proteins are secreted or not.

328 / 635
Supervised classification with vector embedding
The idea
Map each string x X to a vector (x) F.
Train a classifier for vectors on the images (x1 ), . . . , (xn ) of the
training set (nearest neighbor, linear perceptron, logistic regression,
support vector machine...)


X F
maskat...
msises
marssl...
malhtv...
mappsv...
mahtlg...

329 / 635
Kernels for protein sequences
Kernel methods have been widely investigated since Jaakkola et al.s
seminal paper (1998).
What is a good kernel?
it should be mathematically valid (symmetric, p.d. or c.p.d.)
fast to compute
adapted to the problem (gives good performances)

330 / 635
Kernel engineering for protein sequences

Define a (possibly high-dimensional) feature space of interest


Physico-chemical kernels
Spectrum, mismatch, substring kernels
Pairwise, motif kernels

331 / 635
Kernel engineering for protein sequences

Define a (possibly high-dimensional) feature space of interest


Physico-chemical kernels
Spectrum, mismatch, substring kernels
Pairwise, motif kernels
Derive a kernel from a generative model
Fisher kernel
Mutual information kernel
Marginalized kernel

331 / 635
Kernel engineering for protein sequences

Define a (possibly high-dimensional) feature space of interest


Physico-chemical kernels
Spectrum, mismatch, substring kernels
Pairwise, motif kernels
Derive a kernel from a generative model
Fisher kernel
Mutual information kernel
Marginalized kernel
Derive a kernel from a similarity measure
Local alignment kernel

331 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Kernels for graphs
Kernels on graphs

332 / 635
Vector embedding for strings
The idea
Represent each sequence x by a fixed-length numerical vector
(x) Rn . How to perform this embedding?

333 / 635
Vector embedding for strings
The idea
Represent each sequence x by a fixed-length numerical vector
(x) Rn . How to perform this embedding?

Physico-chemical kernel
Extract relevant features, such as:
length of the sequence
time series analysis of numerical physico-chemical properties of
amino-acids along the sequence (e.g., polarity, hydrophobicity),
using for example:
Fourier transforms (Wang et al., 2004)
Autocorrelation functions (Zhang et al., 2003)
nj
1 X
rj = hi hi+j
nj
i=1

333 / 635
Substring indexation
The approach
Alternatively, index the feature space by fixed-length strings, i.e.,

(x) = (u (x))uAk

where u (x) can be:


the number of occurrences of u in x (without gaps) : spectrum
kernel (Leslie et al., 2002)
the number of occurrences of u in x up to m mismatches (without
gaps) : mismatch kernel (Leslie et al., 2004)
the number of occurrences of u in x allowing gaps, with a weight
decaying exponentially with the number of gaps : substring kernel
(Lohdi et al., 2002)

334 / 635
Example: Spectrum kernel (1/4)
Kernel definition
The 3-spectrum of
x = CGGSLIAMMWFGV
is:
(CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV) .
Let u (x) denote the number of occurrences of u in x. The
k-spectrum kernel is:
X
K x, x0 := u (x) u x0 .
 

uAk

335 / 635
Example: Spectrum kernel (2/4)
Implementation
The computation of the kernel is formally a sum over |A|k terms,
but at most | x | k + 1 terms are non-zero in (x) =
Computation in O (| x | + | x0 |) with pre-indexation of the strings.
Fast classification of a sequence x in O (| x |):
| x |k+1
X X
f (x) = w (x) = wu u (x) = wxi ...xi+k1 .
u i=1

Remarks
Work with any string (natural language, time series...)
Fast and scalable, a good default method for string classification.
Variants allow matching of k-mers up to m mismatches.

336 / 635
Example: Spectrum kernel (3/4)
If pre-indexation is not possible: retrieval tree (trie)

The complexity for computing K (x, x0 ) becomes O(k(|x| + |x0 |)).

337 / 635
Example: Spectrum kernel (4/4)
If pre-indexation is not possible: use a prefix tree

The complexity for computing K (x, x0 ) becomes O(|x| + |x0 |), but with a
larger constant than with pre-indexation.
338 / 635
Example 2: Substring kernel (1/12)
Definition
For 1 k n N, we denote by I(k, n) the set of sequences of
indices i = (i1 , . . . , ik ), with 1 i1 < i2 < . . . < ik n.
For a string x = x1 . . . xn X of length n, for a sequence of indices
i I(k, n), we define a substring as:

x (i) := xi1 xi2 . . . xik .

The length of the substring is:

l (i) = ik i1 + 1.

339 / 635
Example 2: Substring kernel (2/12)
Example

ABRACADABRA
i = (3, 4, 7, 8, 10)
x (i) =RADAR
l (i) = 10 3 + 1 = 8

340 / 635
Example 2: Substring kernel (3/12)
The kernel
Let k N and R+ fixed. For all u Ak , let u : X R be
defined by:
X
x X , u (x) = l(i) .
iI(k,| x |): x(i)=u

The substring kernel is the p.d. kernel defined by:


X
x, x0 X 2 , Kk, x, x0 = u (x) u x0 .
  

uAk

341 / 635
Example 2: Substring kernel (4/12)
Example

u ca ct at ba bt cr ar br
u (cat) 2 3 2 0 0 0 0 0
u (car) 2 0 0 0 0 3 2 0
u (bat) 0 0 2 2 3 0 0 0
u (bar) 0 0 0 2 0 0 2 3

4 6
K (cat,cat) = K (car,car) = 2 +

K (cat,car) = 4

K (cat,bar) = 0

342 / 635
Example 2: Substring kernel (5/12)
Kernel computation
We need to compute, for any pair x, x0 X , the kernel:
X
Kk, x, x0 = u (x) u x0
 

uAk
0
X X X
= l(i)+l(i ) .
uAk i:x(i)=u i0 :x0 (i0 )=u

Enumerating the substrings is too slow (of order | x |k ).

343 / 635
Example 2: Substring kernel (6/12)
Kernel computation (cont.)
For u Ak remember that:
X
u (x) = ik i1 +1 .
i:x(i)=u

Let now: X
u (x) = | x |i1 +1 .
i:x(i)=u

344 / 635
Example 2: Substring kernel (7/12)
Kernel computation (cont.)
Let us note x[1,j] = x1 . . . xj . A simple rewriting shows that, if we note
a A the last letter of u (u = va):
X 
va (x) = v x[1,j1] ,
j[1,| x |]:xj =a

and X
v x[1,j1] | x |j+1 .

va (x) =
j[1,| x |]:xj =a

345 / 635
Example 2: Substring kernel (8/12)
Kernel computation (cont.)
Moreover we observe that if the string is of the form xa (i.e., the last
letter is a A), then:
If the last letter of u is not a:
(
u (xa) = u (x) ,
u (xa) = u (x) .

If the last letter of u is a (i.e., u = va with v Ak1 ):


(
va (xa) = va (x) + v (x) ,
va (xa) = va (x) + v (x) .

346 / 635
Example 2: Substring kernel (9/12)
Kernel computation (cont.)
Let us now show how the function:
X
Bk x, x0 := u (x) u x0
 

uAk

and the kernel: X


Kk x, x0 := u (x) u x0
 

uAk

can be computed recursively. We note that:


(
B0 (x, x0 ) = K0 (x, x0 ) = 0 for all x, x0
Bk (x, x0 ) = Kk (x, x0 ) = 0 if min (| x | , | x0 |) < k

347 / 635
Example 2: Substring kernel (10/12)
Recursive computation of Bk

Bk xa, x0

X
u (xa) u x0

=
uAk
X X
u (x) u x0 + v (x) va x0
 
=
uAk vAk1
0

= Bk x, x +

X X   0
v (x) v x0[1,j1] | x |j+1

vAk1 j[1,| x0 |]:xj0 =a


X   0
= Bk x, x0 + Bk1 x, x0[1,j1] | x |j+2


j[1,| x0 |]:xj0 =a

348 / 635
Example 2: Substring kernel (11/12)
Recursive computation of Bk

Bk xa, x0 b

X   0
= Bk x, x0 b + Bk1 x, x0[1,j1] | x |j+2


j[1,| x0 |]:xj0 =a

+ a=b Bk1 (x, x0 )2


x, x0 b + (Bk (xa, x0 ) Bk (x, x0 )) + a=b Bk1 (x, x0 )2

= Bk
x, x0 b + Bk (xa, x0 ) 2 Bk (x, x0 ) + a=b Bk1 (x, x0 )2 .

= Bk

The dynamic programming table can be filled in O(k|x||x0 |) operations.

349 / 635
Example 2: Substring kernel (12/12)
Recursive computation of Kk

Kk xa, x0

X
u (xa) u x0

=
uAk
X X
u (x) u x0 + v (x) va x0
 
=
uAk vAk1
0

= Kk x, x +

X X  
v (x) v x0[1,j1]
vAk1 j[1,| x0 |]:xj0 =a
X  
= Kk x, x0 + 2 Bk1 x, x0[1,j1]


j[1,| x0 |]:xj0 =a

350 / 635
Summary: Substring indexation

Implementation in O(|x| + |x0 |) in memory and time for the


spectrum and mismatch kernels (with suffix trees)
Implementation in O(k(|x| + |x0 |)) in memory and time for the
spectrum and mismatch kernels (with tries)
Implementation in O(k|x| |x0 |) in memory and time for the
substring kernels
The feature space has high dimension (|A|k ), so learning requires
regularized methods (such as SVM)

351 / 635
Dictionary-based indexation
The approach
Chose a dictionary of sequences D = (x1 , x2 , . . . , xn )
Chose a measure of similarity s (x, x0 )
Define the mapping D (x) = (s (x, xi ))xi D

352 / 635
Dictionary-based indexation
The approach
Chose a dictionary of sequences D = (x1 , x2 , . . . , xn )
Chose a measure of similarity s (x, x0 )
Define the mapping D (x) = (s (x, xi ))xi D

Examples
This includes:
Motif kernels (Logan et al., 2001): the dictionary is a library of
motifs, the similarity function is a matching function
Pairwise kernel (Liao & Noble, 2003): the dictionary is the training
set, the similarity is a classical measure of similarity between
sequences.

352 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Kernels for graphs
Kernels on graphs

353 / 635
Probabilistic models for sequences
Probabilistic modeling of biological sequences is older than kernel
designs. Important models include HMM for protein sequences, SCFG for
RNA sequences.

Recall: parametric model


A model is a family of distributions

{P , Rm } M+
1 (X )
354 / 635
Context-tree model
Definition
A context-tree model is a variable-memory Markov chain:
n
Y
PD, (x) = PD, (x1 . . . xD ) PD, (xi | xiD . . . xi1 )
i=D+1

D is a suffix tree
D is a set of conditional probabilities (multinomials)

355 / 635
Context-tree model: example

P(AABACBACC ) = P(AAB)AB (A)A (C )C (B)ACB (A)A (C )C (A) .

356 / 635
The context-tree kernel
Theorem (Cuturi et al., 2005)
For particular choices of priors, the context-tree kernel:
Z
0
 X
K x, x = PD, (x)PD, (x0 )w (d|D)(D)
D D

can be computed in O(|x| + |x0 |) with a variant of the Context-Tree


Weighting algorithm.
This is a valid mutual information kernel.
The similarity is related to information-theoretical measure of
mutual information between strings.

357 / 635
Marginalized kernels
Recall: Definition
For any observed data x X , let a latent variable y Y be
associated probabilistically through a conditional probability Px (dy).
Let KZ be a kernel for the complete data z = (x, y)
Then the following kernel is a valid kernel on X , called a
marginalized kernel (Tsuda et al., 2002):

KX x, x0 := EPx (dy)Px0 (dy0 ) KZ z, z0


 
Z Z
KZ (x, y) , x0 , y0 Px (dy) Px0 dy0 .
 
=

358 / 635
Example: HMM for normal/biased coin toss
0.85

N 0.05
0.5
0.1 E Normal (N) and biased (B)
S 0.1 coins (not observed)
B
0.5 0.05
0.85
Observed output are 0/1 with probabilities:
(
(0|N) = 1 (1|N) = 0.5,
(0|B) = 1 (1|B) = 0.2.

Example of realization (complete data):


NNNNNBBBBBBBBBNNNNNNNNNNNBBBBBB
1001011101111010010111001111011
359 / 635
1-spectrum kernel on complete data
If both x A and y S were observed, we might rather use the
1-spectrum kernel on the complete data z = (x, y):
X
KZ z, z0 = na,s (z) na,s z0 ,
 

(a,s)AS

where na,s (x, y) for a = 0, 1 and s = N, B is the number of


occurrences of s in y which emit a in x.
Example:
z =1001011101111010010111001111011,
z0=0011010110011111011010111101100101,

z, z0 = n1 (z) n1 z0 + n1 (z) n1 z0 + n0 (z) n0 z0 + n0 (z) n0 z0


    
KZ
= 7 15 + 9 12 + 13 6 + 2 1 = 293.

360 / 635
1-spectrum marginalized kernel on observed data
The marginalized kernel for observed data is:
X
KX x, x0 = KZ (x, y) , x0 , y0 P (y|x) P y0 |x0
  

y,y0 S
X
a,s (x) a,s x0 ,

=
(a,s)AS

with X
a,s (x) = P (y|x) na,s (x, y)
yS

361 / 635
Computation of the 1-spectrum marginalized kernel

X
a,s (x) = P (y|x) na,s (x, y)
yS
( n )
X X
= P (y|x) (xi , a) (yi , s)
yS i=1

n
X X
= (xi , a) P (y|x) (yi , s)


i=1 yS
n
X
= (xi , a) P (yi = s|x) .
i=1

and P (yi = s|x) can be computed efficiently by forward-backward


algorithm!

362 / 635
HMM example (DNA)

363 / 635
HMM example (protein)

364 / 635
SCFG for RNA sequences

SFCG rules
S SS
S aSa
S aS
S a

Marginalized kernel (Kin et al., 2002)


Feature: number of occurrences of each (base,state) combination
Marginalization using classical inside/outside algorithm

365 / 635
Marginalized kernels in practice
Examples
Spectrum kernel on the hidden states of a HMM for protein
sequences (Tsuda et al., 2002)
Kernels for RNA sequences based on SCFG (Kin et al., 2002)
Kernels for graphs based on random walks on graphs (Kashima et
al., 2004)
Kernels for multiple alignments based on phylogenetic models (Vert
et al., 2006)

366 / 635
Marginalized kernels: example

PC2 A set of 74 human tRNA


sequences is analyzed using a
kernel for sequences (the
second-order marginalized
kernel based on SCFG). This
PC1
set of tRNAs contains three
classes, called Ala-AGC (white
circles), Asn-GTT (black
circles) and Cys-GCA (plus
symbols) (from Tsuda et al.,
2002).

367 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Kernels for graphs
Kernels on graphs

368 / 635
Sequence alignment
Motivation
How to compare 2 sequences?

x1 = CGGSLIAMMWFGV
x2 = CLIVMMNRLMWFGV

Find a good alignment:

CGGSLIAMM------WFGV
|...|||||....||||
C-----LIVMMNRLMWFGV

369 / 635
Alignment score
In order to quantify the relevance of an alignment , define:
a substitution matrix S RAA
a gap penalty function g : N R
Any alignment is then scored as follows

CGGSLIAMM------WFGV
|...|||||....||||
C----LIVMMNRLMWFGV

sS,g () = S(C , C ) + S(L, L) + S(I , I ) + S(A, V ) + 2S(M, M)


+ S(W , W ) + S(F , F ) + S(G , G ) + S(V , V ) g (3) g (4)

370 / 635
Local alignment kernel
Smith-Waterman score (Smith and Waterman, 1981)
The widely-used Smith-Waterman local alignment score is defined
by:
SWS,g (x, y) := max sS,g ().
(x,y)

It is symmetric, but not positive definite...

371 / 635
Local alignment kernel
Smith-Waterman score (Smith and Waterman, 1981)
The widely-used Smith-Waterman local alignment score is defined
by:
SWS,g (x, y) := max sS,g ().
(x,y)

It is symmetric, but not positive definite...

LA kernel (Saigo et al., 2004)


The local alignment kernel:
()
X
KLA (x, y) = exp (sS,g (x, y, )) ,
(x,y)

is symmetric positive definite.

371 / 635
LA kernel is p.d.: proof (1/11)
Lemma
If K1 and K2 are p.d. kernels, then:

K1 + K2 ,
K1 K2 , and
cK1 , for c 0,

are also p.d. kernels


If (Ki )i1 is a sequence of p.d. kernels that converges pointwisely to
a function K :

x, x0 X 2 , K x, x0 = lim Ki x, x0 ,
  
n

then K is also a p.d. kernel.

372 / 635
LA kernel is p.d.: proof (2/11)
Proof of lemma
Let A and B be n n positive semidefinite matrices. By diagonalization
of A:
Xn
Ai,j = fp (i)fp (j)
p=1

for some vectors f1 , . . . , fn . Then, for any Rn :


n
X n X
X n
i j Ai,j Bi,j = i fp (i)j fp (j)Bi,j 0.
i,j=1 p=1 i,j=1

The matrix Ci,j = Ai,j Bi,j is therefore p.d. Other properties are obvious
from definition. 

373 / 635
LA kernel is p.d.: proof (3/11)
Lemma (direct sum and product of kernels)
Let X = X1 X2 . Let K1 be a p.d. kernel on X1 , and K2 be a p.d.
kernel on X2 . Then the following functions are p.d. kernels on X :
the direct sum,

K ((x1 , x2 ) , (y1 , y2 )) = K1 (x1 , y1 ) + K2 (x2 , y2 ) ,

The direct product:

K ((x1 , x2 ) , (y1 , y2 )) = K1 (x1 , y1 ) K2 (x2 , y2 ) .

374 / 635
LA kernel is p.d.: proof (4/11)
Proof of lemma
If K1 is a p.d. kernel, let 1 : X1 7 H be such that:

K1 (x1 , y1 ) = h1 (x1 ) , 1 (y1 )iH .

Let : X1 X2 H be defined by:

((x1 , x2 )) = 1 (x1 ) .

Then for x = (x1 , x2 ) and y = (y1 , y2 ) X , we get

h ((x1 , x2 )) , ((y1 , y2 ))iH = K1 (x1 , x2 ) ,

which shows that K (x, y) := K1 (x1 , y1 ) is p.d. on X1 X2 . The lemma


follows from the properties of sums and products of p.d. kernels. 

375 / 635
LA kernel is p.d.: proof (5/11)
Lemma: kernel for sets
Let K be a p.d. kernel on X , and let P (X ) be the set of finite subsets of
X . Then the function KP on P (X ) P (X ) defined by:
XX
A, B P (X ) , KP (A, B) := K (x, y)
xA yB

is a p.d. kernel on P (X ).

376 / 635
LA kernel is p.d.: proof (6/11)
Proof of lemma
Let : X 7 H be such that

K (x, y) = h (x) , (y)iH .

Then, for A, B P (X ), we get:


XX
KP (A, B) = h (x) , (y)iH
xA yB
* +
X X
= (x) , (y)
xA yB H
= hP (A), P (B)iH ,
P
with P (A) := xA (x). 

377 / 635
LA kernel is p.d.: proof (7/11)
Definition: Convolution kernel (Haussler, 1999)
Let K1 and K2 be two p.d. kernels for strings. The convolution of K1
and K2 , denoted K1 ? K2 , is defined for any x, x0 X by:
X
K1 ? K2 (x, y) := K1 (x1 , y1 )K2 (x2 , y2 ).
x1 x2 =x,y1 y2 =y

Lemma
If K1 and K2 are p.d. then K1 ? K2 is p.d..

378 / 635
LA kernel is p.d.: proof (8/11)
Proof of lemma
Let X be the set of finite-length strings. For x X , let

R (x) = {(x1 , x2 ) X X : x = x1 x2 } X X .

We can then write


X X
K1 ? K2 (x, y) = K1 (x1 , y1 )K2 (x2 , y2 )
(x1 ,x2 )R(x) (y1 ,y2 )R(y)

which is a p.d. kernel by the previous lemmas. 

379 / 635
LA kernel is p.d.: proof (9/11)
3 basic string kernels
The constant kernel:
K0 (x, y) := 1 .

A kernel for letters:



() 0 if | x | =
6 1 where | y | =
6 1,
Ka (x, y) :=
exp (S(x, y)) otherwise .

A kernel for gaps:


()
Kg (x, y) = exp [ (g (| x |) + g (| y |))] .

380 / 635
LA kernel is p.d.: proof (10/11)
Remark
S : A2 R is the similarity function between letters used in the
()
alignment score. Ka is only p.d. when the matrix:

(exp (s(a, b)))(a,b)A2

is positive semidefinite (this is true for all when s is conditionally


p.d..
g is the gap penalty function used in alignment score. The gap
kernel is always p.d. (with no restriction on g ) because it can be
written as:
()
Kg (x, y) = exp (g (| x |)) exp (g (| y |)) .

381 / 635
LA kernel is p.d.: proof (11/11)
Lemma
The local alignment kernel is a (limit) of convolution kernel:

() (n1)
 
() () ()
X
KLA = K0 ? Ka ? Kg ? Ka ? K0 .
n=0

As such it is p.d..

Proof (sketch)
By induction on n (simple but long to write).
See details in Vert et al. (2004).

382 / 635
LA kernel computation
We assume an affine gap penalty:
(
g (0) = 0,
g (n) = d + e(n 1) si n 1,

The LA kernel can then be computed by dynamic programming by:


()
KLA (x, y) = 1 + X2 (|x|, |y|) + Y2 (|x|, |y|) + M(|x|, |y|),

where M(i, j), X (i, j), Y (i, j), X2 (i, j), and Y2 (i, j) for 0 i |x|,
and 0 j |y| are defined recursively.

383 / 635
LA kernel is p.d.: proof (/)
Initialization


M(i, 0) = M(0, j) = 0,

X (i, 0) = X (0, j) = 0,



Y (i, 0) = Y (0, j) = 0,

X2 (i, 0) = X2 (0, j) = 0,





Y2 (i, 0) = Y2 (0, j) = 0,

384 / 635
LA kernel is p.d.: proof (/)
Recursion
For i = 1, . . . , |x| and j = 1, . . . , |y|:
h


M(i, j) = exp(S(x ,
i j y )) 1 + X (i 1, j 1)
i





+Y (i 1, j 1) + M(i 1, j 1) ,

X (i, j) = exp(d)M(i 1, j) + exp(e)X (i 1, j),

Y (i, j) = exp(d) [M(i, j 1) + X (i, j 1)]



+ exp(e)Y (i, j 1),





X2 (i, j) = M(i 1, j) + X2 (i 1, j),





Y2 (i, j) = M(i, j 1) + X2 (i, j 1) + Y2 (i, j 1).

385 / 635
LA kernel in practice

Implementation by a finite-state transducer in O(|x| |x0 |)


0:0/1

a:0/1 a:0/E a:0/1

X1 a:0/D X X2 0:0/1
a:0/1 a:b/m(a,b)
0:b/D a:0/1
a:b/m(a,b) 0:a/1

B a:b/m(a,b) M 0:0/1 E
0:a/1
a:b/m(a,b)
0:0/1
a:b/m(a,b) 0:a/1
0:a/1 0:b/D
Y1 Y Y2

0:a/1 0:b/E 0:a/1


0:0/1

0:0/1

In practice, values are too large (exponential scale) so taking its


logarithm is a safer choice (but not p.d. anymore!)

386 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Kernels for graphs
Kernels on graphs

387 / 635
Remote homology

gs
gs

o
on
o

ol
ol

m
tz
m

ho
gh
ho

se
ili
on

lo
Tw
N

C
Sequence similarity

Homologs have common ancestors


Structures and functions are more conserved than sequences
Remote homologs can not be detected by direct sequence
comparison

388 / 635
SCOP database

SCOP
Fold
Superfamily
Family
Remote homologs Close homologs

389 / 635
A benchmark experiment
Goal: recognize directly the superfamily
Training: for a sequence of interest, positive examples come from
the same superfamily, but different families. Negative from other
superfamilies.
Test: predict the superfamily.

390 / 635
Difference in performance
60
SVM-LA
SVM-pairwise
SVM-Mismatch
No. of families with given performance
50 SVM-Fisher

40

30

20

10

0
0 0.2 0.4 0.6 0.8 1
ROC50

Performance on the SCOP superfamily recognition benchmark (from


Saigo et al., 2004).

391 / 635
String kernels: Summary
A variety of principles for string kernel design have been proposed.
Good kernel design is important for each data and each task.
Performance is not the only criterion.
Still an art, although principled ways have started to emerge.
Fast implementation with string algorithms is often possible.
Their application goes well beyond computational biology.

392 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

6 Open Problems and Research Topics 393 / 635


Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

394 / 635
Virtual screening for drug discovery

active

inactive

active
inactive
inactive

active

NCI AIDS screen results (from http://cactus.nci.nih.gov).


395 / 635
Image retrieval and classification

From Harchaoui and Bach (2007).

396 / 635
Our approach

397 / 635
Our approach
1 Represent each graph x in X by a vector (x) H, either explicitly
or implicitly through the kernel

K (x, x0 ) = (x)> (x0 ) .


X H

397 / 635
Our approach
1 Represent each graph x in X by a vector (x) H, either explicitly
or implicitly through the kernel

K (x, x0 ) = (x)> (x0 ) .

2 Use a linear method for classification in H.



X H

397 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

398 / 635
The approach
1 Represent explicitly each graph x by a vector of fixed dimension
(x) Rp .


X H

399 / 635
The approach
1 Represent explicitly each graph x by a vector of fixed dimension
(x) Rp .
2 Use an algorithm for regression or pattern recognition in Rp .


X H

399 / 635
Example
2D structural keys in chemoinformatics
Index a molecule by a binary fingerprint defined by a limited set of
predefined structures
N N N
O O O O O

O
N

Use a machine learning algorithm such as SVM, kNN, PLS, decision


tree, etc.

400 / 635
Challenge: which descriptors (patterns)?

N N N
O O O O O

O
N

Expressiveness: they should retain as much information as possible


from the graph
Computation: they should be fast to compute
Large dimension of the vector representation: memory storage,
speed, statistical issues

401 / 635
Indexing by substructures

N N N
O O O O O

O
N

Often we believe that the presence or absence of particular


substructures may be important predictive patterns
Hence it makes sense to represent a graph by features that indicate
the presence (or the number of occurrences) of these substructures
However, detecting the presence of particular substructures may be
computationally challenging...

402 / 635
Subgraphs
Definition
A subgraph of a graph (V , E ) is a graph (V 0 , E 0 ) with V 0 V and
E0 E.

A graph and all its connected subgraphs.

403 / 635
Indexing by all subgraphs?

404 / 635
Indexing by all subgraphs?

Theorem
Computing all subgraph occurrences is NP-hard.

404 / 635
Indexing by all subgraphs?

Theorem
Computing all subgraph occurrences is NP-hard.

Proof
The linear graph of size n is a subgraph of a graph X with n vertices
iff X has a Hamiltonian path;
The decision problem whether a graph has a Hamiltonian path is
NP-complete.

404 / 635
Paths
Definition
A path of a graph (V , E ) is a sequence of distinct vertices
v1 , . . . , vn V (i 6= j = vi 6= vj ) such that (vi , vi+1 ) E for
i = 1, . . . , n 1.
Equivalently the paths are the linear subgraphs.

405 / 635
Indexing by all paths?

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

406 / 635
Indexing by all paths?

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Theorem
Computing all path occurrences is NP-hard.

406 / 635
Indexing by all paths?

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Theorem
Computing all path occurrences is NP-hard.

Proof
Same as for subgraphs.

406 / 635
Indexing by what?
Substructure selection
We can imagine more limited sets of substructures that lead to more
computationnally efficient indexing (non-exhaustive list)
substructures selected by domain knowledge (MDL fingerprint)
all paths up to length k (Openeye fingerprint, Nicholls 2005)
all shortest path lengths (Borgwardt and Kriegel, 2005)
all subgraphs up to k vertices (graphlet kernel, Shervashidze et al.,
2009)
all frequent subgraphs in the database (Helma et al., 2004)

407 / 635
Example: Indexing by all shortest path lengths and their
endpoint labels

A 3 B
A A
B (0,...,0,2,0,...,0,1,0,...)
B A
A 1 A A 3 A

408 / 635
Example: Indexing by all shortest path lengths and their
endpoint labels

A 3 B
A A
B (0,...,0,2,0,...,0,1,0,...)
B A
A 1 A A 3 A

Properties (Borgwardt and Kriegel, 2005)


There are O(n2 ) shortest paths.
The vector of counts can be computed in O(n3 ) with the
Floyd-Warshall algorithm.

408 / 635
Example: Indexing by all subgraphs up to k vertices

409 / 635
Example: Indexing by all subgraphs up to k vertices

Properties (Shervashidze et al., 2009)


Naive enumeration scales as O(nk ).
Enumeration of connected graphlets in O(nd k1 ) for graphs with
degree d and k 5.
Randomly sample subgraphs if enumeration is infeasible.

409 / 635
Summary
Explicit computation of substructure occurrences can be
computationnally prohibitive (subgraphs, paths);
Several ideas to reduce the set of substructures considered;
In practice, NP-hardness may not be so prohibitive (e.g., graphs
with small degrees), the strategy followed should depend on the
data considered.

410 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

411 / 635
The idea

412 / 635
The idea
1 Represent implicitly each graph x in X by a vector (x) H
through the kernel

K (x, x0 ) = (x)> (x0 ) .


X H

412 / 635
The idea
1 Represent implicitly each graph x in X by a vector (x) H
through the kernel

K (x, x0 ) = (x)> (x0 ) .

2 Use a kernel method for classification in H.


X H

412 / 635
Expressiveness vs Complexity
Definition: Complete graph kernels
A graph kernel is complete if it distinguishes non-isomorphic graphs, i.e.:

G1 , G2 X , dK (G1 , G2 ) = 0 = G1 ' G2 .

Equivalently, (G1 ) 6= (G2 ) if G1 and G2 are not isomorphic.

413 / 635
Expressiveness vs Complexity
Definition: Complete graph kernels
A graph kernel is complete if it distinguishes non-isomorphic graphs, i.e.:

G1 , G2 X , dK (G1 , G2 ) = 0 = G1 ' G2 .

Equivalently, (G1 ) 6= (G2 ) if G1 and G2 are not isomorphic.

Expressiveness vs Complexity trade-off


If a graph kernel is not complete, then there is no hope to learn all
possible functions over X : the kernel is not expressive enough.
On the other hand, kernel computation must be tractable, i.e., no
more than polynomial (with small degree) for practical applications.
Can we define tractable and expressive graph kernels?

413 / 635
Complexity of complete kernels
Proposition (Gartner et al., 2003)
Computing any complete graph kernel is at least as hard as the graph
isomorphism problem.

414 / 635
Complexity of complete kernels
Proposition (Gartner et al., 2003)
Computing any complete graph kernel is at least as hard as the graph
isomorphism problem.

Proof
For any kernel K the complexity of computing dK is the same as the
complexity of computing K , because:

dK (G1 , G2 )2 = K (G1 , G1 ) + K (G2 , G2 ) 2K (G1 , G2 ) .

If K is a complete graph kernel, then computing dK solves the graph


isomorphism problem (dK (G1 , G2 ) = 0 iff G1 ' G2 ). 

414 / 635
Subgraph kernel
Definition
Let (G )G X be a set or nonnegative real-valued weights
For any graph G X and any connected graph H X , let

H (G ) = G 0 is a subgraph of G : G 0 ' H .


The subgraph kernel between any two graphs G1 and G2 X is


defined by:
X
Ksubgraph (G1 , G2 ) = H H (G1 )H (G2 ) .
HX
H connected

415 / 635
Subgraph kernel complexity
Proposition (Gartner et al., 2003)
Computing the subgraph kernel is NP-hard.

416 / 635
Subgraph kernel complexity
Proposition (Gartner et al., 2003)
Computing the subgraph kernel is NP-hard.

Proof (1/2)
Let Pn be the path graph with n vertices.
Subgraphs of Pn are path graphs:

(Pn ) = neP1 + (n 1)eP2 + . . . + ePn .

The vectors (P1 ), . . . , (Pn ) are linearly independent, therefore:


n
X
ePn = i (Pi ) ,
i=1

where the coefficients i can be found in polynomial time (solving


an n n triangular system).
416 / 635
Subgraph kernel complexity
Proposition (Gartner et al., 2003)
Computing the subgraph kernel is NP-hard.

Proof (2/2)
If G is a graph with n vertices, then it has a path that visits each
node exactly once (Hamiltonian path) if and only if (G )> ePn > 0,
i.e.,
n n
!
X X
>
(G ) i (Pi ) = i Ksubgraph (G , Pi ) > 0 .
i=1 i=1

The decision problem whether a graph has a Hamiltonian path is


NP-complete. 

417 / 635
Path kernel

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Definition
The path kernel is the subgraph kernel restricted to paths, i.e.,
X
Kpath (G1 , G2 ) = H H (G1 )H (G2 ) ,
HP

where P X is the set of path graphs.

418 / 635
Path kernel

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Definition
The path kernel is the subgraph kernel restricted to paths, i.e.,
X
Kpath (G1 , G2 ) = H H (G1 )H (G2 ) ,
HP

where P X is the set of path graphs.

Proposition (Gartner et al., 2003)


Computing the path kernel is NP-hard.

418 / 635
Summary
Expressiveness vs Complexity trade-off
It is intractable to compute complete graph kernels.
It is intractable to compute the subgraph kernels.
Restricting subgraphs to be linear does not help: it is also
intractable to compute the path kernel.
One approach to define polynomial time computable graph kernels is
to have the feature space be made up of graphs homomorphic to
subgraphs, e.g., to consider walks instead of paths.

419 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

420 / 635
Walks
Definition
A walk of a graph (V , E ) is sequence of v1 , . . . , vn V such that
(vi , vi+1 ) E for i = 1, . . . , n 1.
We note Wn (G ) the set of walks with n vertices of the graph G ,
and W(G ) the set of all walks.

etc...

421 / 635
Walks 6= paths

422 / 635
Walk kernel
Definition
Let Sn denote the set of all possible label sequences of walks of
length n (including vertex and edge labels), and S = n1 Sn .
For any graph X let a weight G (w ) be associated to each walk
w W(G ).
Let the feature vector (G ) = (s (G ))sS be defined by:
X
s (G ) = G (w )1 (s is the label sequence of w ) .
w W(G )

423 / 635
Walk kernel
Definition
Let Sn denote the set of all possible label sequences of walks of
length n (including vertex and edge labels), and S = n1 Sn .
For any graph X let a weight G (w ) be associated to each walk
w W(G ).
Let the feature vector (G ) = (s (G ))sS be defined by:
X
s (G ) = G (w )1 (s is the label sequence of w ) .
w W(G )

A walk kernel is a graph kernel defined by:


X
Kwalk (G1 , G2 ) = s (G1 )s (G2 ) .
sS

423 / 635
Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with G (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.

424 / 635
Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with G (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.
The random walk kernel is obtained with G (w ) = PG (w ), where
PG is a Markov random walk on G . In that case we have:

K (G1 , G2 ) = P(label(W1 ) = label(W2 )) ,

where W1 and W2 are two independent random walks on G1 and


G2 , respectively (Kashima et al., 2003).

424 / 635
Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with G (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.
The random walk kernel is obtained with G (w ) = PG (w ), where
PG is a Markov random walk on G . In that case we have:

K (G1 , G2 ) = P(label(W1 ) = label(W2 )) ,

where W1 and W2 are two independent random walks on G1 and


G2 , respectively (Kashima et al., 2003).
The geometric walk kernel is obtained (when it converges) with
G (w ) = length(w ) , for > 0. In that case the feature space is of
infinite dimension (Gartner et al., 2003).

424 / 635
Computation of walk kernels
Proposition
These three kernels (nth-order, random and geometric walk kernels) can
be computed efficiently in polynomial time.

425 / 635
Product graph
Definition
Let G1 = (V1 , E1 ) and G2 = (V2 , E2 ) be two graphs with labeled vertices.
The product graph G = G1 G2 is the graph G = (V , E ) with:
1 V = {(v1 , v2 ) V1 V2 : v1 and v2 have the same label} ,
2 E = {((v1 , v2 ), (v10 , v20 )) V V : (v1 , v10 ) E1 and (v2 , v20 ) E2 }.

1 a b 1b 2a 1d

c 3c 3e
2
1a 2b 2d

3 4 d e
4c 4e

G1 G2 G1 x G2

426 / 635
Walk kernel and product graph
Lemma
There is a bijection between:
1 The pairs of walks w1 Wn (G1 ) and w2 Wn (G2 ) with the same
label sequences,
2 The walks on the product graph w Wn (G1 G2 ).

427 / 635
Walk kernel and product graph
Lemma
There is a bijection between:
1 The pairs of walks w1 Wn (G1 ) and w2 Wn (G2 ) with the same
label sequences,
2 The walks on the product graph w Wn (G1 G2 ).

Corollary
X
Kwalk (G1 , G2 ) = s (G1 )s (G2 )
sS
X
= G1 (w1 )G2 (w2 )1(l(w1 ) = l(w2 ))
(w1 ,w2 )W(G1 )W(G1 )
X
= G1 G2 (w ) .
w W(G1 G2 )

427 / 635
Computation of the nth-order walk kernel

For the nth-order walk kernel we have G1 G2 (w ) = 1 if the length


of w is n, 0 otherwise.
Therefore: X
Knth-order (G1 , G2 ) = 1.
w Wn (G1 G2 )

Let A be the adjacency matrix of G1 G2 . Then we get:


X
Knth-order (G1 , G2 ) = [An ]i,j = 1> An 1 .
i,j

Computation in O(n|V1 ||V2 |d1 d2 ), where di is the maximum degree


of Gi .

428 / 635
Computation of random and geometric walk kernels

In both cases G (w ) for a walk w = v1 . . . vn can be decomposed as:


n
Y
G (v1 . . . vn ) = i (v1 ) t (vi1 , vi ) .
i=2

Let i be the vector of i (v ) and t be the matrix of t (v , v 0 ):



X X n
Y
Kwalk (G1 , G2 ) = i (v1 ) t (vi1 , vi )
n=1 w Wn (G1 G2 ) i=2

X
= i nt 1
n=0
= i (I t )1 1

Computation in O(|V1 |3 |V2 |3 ).

429 / 635
Extensions 1: Label enrichment
Atom relabeling with the Morgan index (Mahe et al., 2004)
1 2 4

1 1 2 2 4 5

1 O1 2 O1 4 O3
1 3 7
N1 N3 N5
1 2 5

No Morgan Indices O1 Order 1 indices O1 Order 2 indices O3

Compromise between fingerprints and structural keys.


Other relabeling schemes are possible.
Faster computation with more labels (less matches implies a smaller
product graph).

430 / 635
Extension 2: Non-tottering walk kernel
Tottering walks
A tottering walk is a walk w = v1 . . . vn with vi = vi+2 for some i.

Nontottering

Tottering

Tottering walks seem irrelevant for many applications.


Focusing on non-tottering walks is a way to get closer to the path
kernel (e.g., equivalent on trees).

431 / 635
Computation of the non-tottering walk kernel (Mahe et al.,
2005)
Second-order Markov random walk to prevent tottering walks
Written as a first-order Markov random walk on an augmented graph
Normal walk kernel on the augmented graph (which is always a
directed graph).

432 / 635
Extension 3: Subtree kernels

Remark: Here and in subsequent slides by subtree we mean a tree-like


pattern with potentially repeated nodes and edges.
433 / 635
Example: Tree-like fragments of molecules

.
.
. C C

C .
N

N O
.
.
C

O N C C N
. N O
.
.
N C
N N C C C
.
.
.

434 / 635
Computation of the subtree kernel (Ramon and Gartner,
2003; Mahe and Vert, 2009)

Like the walk kernel, amounts to computing the (weighted) number


of subtrees in the product graph.
Recursion: if T (v , n) denotes the weighted number of subtrees of
depth n rooted at the vertex v , then:
X Y
T (v , n + 1) = t (v , v 0 )T (v 0 , n) ,
RN (v ) v 0 R

where N (v ) is the set of neighbors of v .


Can be combined with the non-tottering graph transformation as
preprocessing to obtain the non-tottering subtree kernel.

435 / 635
Back to label enrichment
Link between the Morgan index and subtrees
Recall the Morgan index:
1 2 4

1 1 2 2 4 5

1 O1 2 O1 4 O3
1 3 7
N1 N3 N5
1 2 5

No Morgan Indices O1 Order 1 indices O1 Order 2 indices O3

The Morgan index of order k at a node v in fact corresponds to the


number of leaves in the k-th order full subtree pattern rooted at v .

2
1
1 3

2 3 6

6 4
1 3 1 2 4 5 1 5
5

A full subtree pattern of order 2 rooted at node 1.

436 / 635
Label enrichment via the Weisfeiler-Lehman algorithm
A slightly more involved label enrichment strategy (Weisfeiler and
Lehman, 1968) is exploited in the definition and computation of the
Weisfeiler-Lehman subtree kernel (Shervashidze and Borgwardt, 2009).
e b
1 Multiset-label determination
d c
and sorting
a a

e,bcd b,ce a,d f


2 Label compression b,ce g
d,aace c,bde c,bde h
d,aace i
a,d a,d e,bcd j

j g
3 Relabeling
i h

c d
f f
e
b

437 / 635
Label enrichment via the Weisfeiler-Lehman algorithm
A slightly more involved label enrichment strategy (Weisfeiler and
Lehman, 1968) is exploited in the definition and computation of the
Weisfeiler-Lehman subtree kernel (Shervashidze and Borgwardt, 2009).
e b
1 Multiset-label determination
d c
and sorting
a a

e,bcd b,ce a,d f


2 Label compression b,ce g
d,aace c,bde c,bde h
d,aace i
a,d a,d e,bcd j

j g
3 Relabeling
i h

c d
f f
e
b

Compressed labels represent full subtree patterns.


437 / 635
Weisfeiler-Lehman (WL) subtree kernel

e b b e

d c d c (1)
WLsubtree(G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)
a b c d e f g h i j k l m
a a a b (1)
WLsubtree(G) = ( 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1)
m h i m a b c d e f g h i j k l m

k j l j Counts of Counts of
original compressed
G G
f f f g node labels node labels
(1) (1) (1)
KWLsubtree(G,G)=<WLsubtree(G), WLsubtree(G)>=11.

Properties
The WL features up to the k-th order are computed in O(|E |k).
Similarly to the Morgan index, the WL relabeling can be exploited in
combination with any graph kernel (that takes into account
categorical node labels) to make it more expressive (Shervashidze et
al., 2011).
438 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

439 / 635
Application in chemoinformatics (Mahe et al., 2005)
MUTAG dataset
aromatic/hetero-aromatic compounds
high mutagenic activity /no mutagenic activity, assayed in
Salmonella typhimurium.
188 compounds: 125 + / 63 -

Results
10-fold cross-validation accuracy
Method Accuracy
Progol1 81.4%
2D kernel 91.2%

440 / 635
AUC

70 72 74 76 78 80

CCRFCEM
HL60(TB)
K562
MOLT4
Walks

RPMI8226
Subtrees

SR
A549/ATCC
EKVX
HOP62
HOP92
NCIH226
NCIH23
NCIH322M
NCIH460
NCIH522
COLO_205
HCC2998
HCT116
HCT15
HT29
KM12
SW620
SF268
2D subtree vs walk kernels

SF295
SF539
SNB19
SNB75
U251
LOX_IMVI
MALME3M
M14
SKMEL2
SKMEL28
SKMEL5
UACC257
UACC62
IGROV1
OVCAR3
OVCAR4
OVCAR5
OVCAR8

Screening of inhibitors for 60 cancer cell lines.


SKOV3
7860
A498
ACHN
CAKI1
RXF_393
SN12C
TK10
UO31
PC3
DU145
MCF7
NCI/ADRRES
MDAMB231/ATCC
HS_578T
MDAMB435
MDAN
BT549
T47D
441 / 635
Comparison of several graph feature extraction
methods/kernels (Shervashidze et al., 2011)
10-fold cross-validation accuracy on garph classification problems in
chemo- and bioinformatics:
NCI1 and NCI109 - active/inactive compounds in an anti-cancer screen
ENZYMES - 6 types of enzymes from the BRENDA database

Method/Data Set NCI1 NCI109 ENZYMES


WL subtree 82.19 ( 0.18) 82.46 (0.24) 52.22 (1.26)
WL shortest path 84.55 (0.36) 83.53 (0.30) 59.05 (1.05)
Ramon & Gartner 61.86 (0.27) 61.67 (0.21) 13.35 (0.87)
Geometric p-walk 58.66 (0.28) 58.36 (0.94) 27.67 (0.95)
Geometric walk 64.34 (0.27) 63.51 ( 0.18) 21.68 (0.94)
Graphlet count 66.00 (0.07) 66.59 (0.08) 32.70 (1.20)
Shortest path 73.47 (0.11) 73.07 (0.11) 41.68 (1.79)

442 / 635
Image classification (Harchaoui and Bach, 2007)
COREL14 dataset
1400 natural images in 14 classes
Compare kernel between histograms (H), walk kernel (W), subtree
kernel (TW), weighted subtree kernel (wTW), and a combination
(M).

Performance comparison on Corel14

0.12

0.11

0.1

Test error
0.09

0.08

0.07

0.06

0.05
H W TW wTW M
Kernels

443 / 635
Summary: graph kernels
What we saw
Kernels do not allow to overcome the NP-hardness of subgraph
patterns.
They allow to work with approximate subgraphs (walks, subtrees) in
infinite dimension, thanks to the kernel trick.
However: using kernels makes it difficult to come back to patterns
after the learning stage.

444 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

6 Open Problems and Research Topics 445 / 635


Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

446 / 635
Graphs
Motivation
Data often come in the form of nodes in a graph for different reasons:
by definition (interaction network, internet...)
by discretization/sampling of a continuous domain
by convenience (e.g., if only a similarity function is available)

447 / 635
Example: web

448 / 635
Example: social network

449 / 635
Example: protein-protein interaction

450 / 635
Kernel on a graph

We need a kernel K (x, x0 ) between nodes of the graph.


Example: predict protein functions from high-throughput
protein-protein interaction data.

451 / 635
General remarks
Strategies to design a kernel on a graph
X being finite, any symmetric semi-definite matrix K defines a valid
p.d. kernel on X .

452 / 635
General remarks
Strategies to design a kernel on a graph
X being finite, any symmetric semi-definite matrix K defines a valid
p.d. kernel on X .
How to translate the graph topology into the kernel?
Direct geometric approach: Ki,j should be large when xi and xj are
close to each other on the graph?
Functional approach: k f kK should be small when f is smooth
on the graph?
Link discrete/continuous: is there an equivalent to the continuous
Gaussian kernel on the graph (e.g., limit by fine discretization)?

452 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

453 / 635
Conditionally p.d. kernels
Hilbert distance
Any p.d. kernel is an inner product in a Hilbert space

K x, x0 = (x) , x0 H .



It defines a Hilbert distance:


2
dK x, x0 = K (x, x) + K x0 , x0 2K x, x0 .
 

dK2 is conditionally positive definite (c.p.d.), i.e.:


 2 
t > 0 , exp tdK x, x0 is p.d.

454 / 635
Example
A direct approach
For X = Rn , the inner product is p.d.:

K (x, x0 ) = x> x0 .

The corresponding Hilbert distance is the Euclidean distance:


2
dK x, x0 = x> x + x0> x0 2x>x0 = ||x x0 ||2 .

dK2 is conditionally positive definite (c.p.d.), i.e.:

t > 0 , exp t||x x0 ||2 is p.d.




455 / 635
Graph distance
Graph embedding in a Hilbert space
Given a graph G = (V , E ), the graph distance dG (x, x 0 ) between
any two vertices is the length of the shortest path between x and x 0 .
We say that the graph G = (V , E ) can be embedded (exactly) in a
Hilbert space if dG is c.p.d., which implies in particular that
exp(tdG (x, x 0 )) is p.d. for all t > 0.

456 / 635
Graph distance
Graph embedding in a Hilbert space
Given a graph G = (V , E ), the graph distance dG (x, x 0 ) between
any two vertices is the length of the shortest path between x and x 0 .
We say that the graph G = (V , E ) can be embedded (exactly) in a
Hilbert space if dG is c.p.d., which implies in particular that
exp(tdG (x, x 0 )) is p.d. for all t > 0.

Lemma
In general graphs cannot be embedded exactly in Hilbert spaces.
In some cases exact embeddings exist, e.g.:
trees can be embedded exactly,
closed chains can be embedded exactly.

456 / 635
Example: non-c.p.d. graph distance

1 3 5
4

0 1 1 1 2

1 0 2 2 1

dG =
1 2 0 2 1

1 2 2 0 1
2 1 1 1 0
h i
min e (0.2dG (i,j)) = 0.028 < 0 .

457 / 635
Graph distances on trees are c.p.d.
Proof
Let G = (V , E ) be a tree;
Fix a root x0 V ;
Represent any vertex x V by a vector (x) R|E | , where
(x)i = 1 if the i-th edge is part of the (unique) path between x
and x0 , 0 otherwise.
Then
dG (x, x 0 ) = k (x) (x 0 ) k2 ,
and therefore dG is c.p.d., in particular exp(tdG (x, x 0 )) is p.d.
for all t > 0.

458 / 635
Example

1
3 5
4
2

1 0.14 0.37 0.14 0.05
h i 0.14 1 0.37 0.14 0.05
dG (i,j)

e =
0.37 0.37 1 0.37 0.14

0.14 0.14 0.37 1 0.37
0.05 0.05 0.14 0.37 1

459 / 635
Graph distances on closed chains are c.p.d.
Proof: case |V | = 2p
Let G = (V , E ) be a directed cycle with an even number of vertices
|V | = 2p.
Fix a root x0 V , number the 2p edges from x0 to x0 ;
Label the 2p edges with e1 , . . . , ep , e1 , . . . , ep (vectors in Rp );
For a vertex v , take (v ) to be the sum of the labels of the edges in
the shortest directed path between x0 and v .

460 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

461 / 635
Functional approach
Motivation
How to design a p.d. kernel on general graphs?
Designing a kernel is equivalent to defining an RKHS.
There are intuitive notions of smoothness on a graph.

Idea
Define a priori a smoothness functional on the functions f : X R;
Show that it defines an RKHS and identify the corresponding kernel.

462 / 635
Notations

X = (x1 , . . . , xm ) is finite.
For x, x0 X , we note x x0 to indicate the existence of an edge
between x and x0
We assume that there is no self-loop x x, and that there is a
single connected component.
The adjacency matrix is A Rmm :
(
1 if i j,
Ai,j =
0 otherwise.

D is theP
diagonal matrix where Di,i is the number of neighbors of xi
(Di,i = mi=1 Ai,j ).

463 / 635
Example

1
3 5
4
2

0 0 1 0 0 1 0 0 0 0

0 0 1 0 0


0 1 0 0 0

A= 1 1 0 1 0 , D= 0 0 3 0 0

0 0 1 0 1 0 0 0 2 0
0 0 0 1 0 0 0 0 0 1

464 / 635
Graph Laplacian
Definition
The Laplacian of the graph is the matrix L = D A.

1
3 5
4
2
0 1 0

1 0

0 1 1 0 0

L=D A=
1 1 3 1 0

0 0 1 2 1
0 0 0 1 1

465 / 635
Properties of the Laplacian
Lemma
Let L = D A be the Laplacian of a connected graph:
For any f : X R,
X
(f ) := (f (xi ) f (xj ))2 = f > Lf
ij

L is a symmetric positive semi-definite matrix


0 is an eigenvalue with multiplicity 1 associated to the constant
eigenvector 1 = (1, . . . , 1)
The image of L is
m
( )
X
Im(L) = f Rm : fi = 0
i=1

466 / 635
Proof: link between (f ) and L

X
(f ) = (f (xi ) f (xj ))2
ij
X 
= f (xi )2 + f (xj )2 2f (xi ) f (xj )
ij
Xm X
= Di,i f (xi )2 2 f (xi ) f (xj )
i=1 ij
> >
= f Df f Af
= f > Lf

467 / 635
Proof: eigenstructure of L
L is symmetric because A and D are symmetric.
For any f Rm , f > Lf = (f ) 0, therefore the (real-valued)
eigenvalues of L are 0 : L is therefore positive semi-definite.
f is an eigenvector associated to eigenvalue 0
iff fP> Lf = 0

iff ij (f (xi ) f (xj ))2 = 0 ,


iff f (xi ) = f (xj ) when i j,
iff f is constant (because the graph is connected).
L being symmetric, Im(L) is the orthogonal supplement of Ker (L),
that is, the set of functions orthogonal to 1. 

468 / 635
Our first graph kernel
Theorem
Pm
The set H = {f Rm : i=1 fi= 0} endowed with the norm
X
(f ) = (f (xi ) f (xj ))2
ij

is a RKHS whose reproducing kernel is L , the pseudo-inverse of the


graph Laplacian.

469 / 635
In case of...
Pseudo-inverse of L
Remember the pseudo-inverse L of L is the linear application that is
equal to:
0 on Ker (L)
L1 on Im(L), that is, if we write:
m
X
L= i ui ui>
i=1

the eigendecomposition of L:

(i )1 ui ui> .
X
L =
i 6=0

In particular it holds that L L = LL = H , the projection onto


Im(L) = H.
470 / 635
Proof (1/2)
Resticted to H, the symmetric bilinear form:

hf , g i = f > Lg

is positive definite (because L is positive semi-definite, and


H = Im(L)). It is therefore a scalar product, making of H a Hilbert
space (in fact Euclidean).
The norm in this Hilbert space H is:

k f k2 = hf , f i = f > Lf = (f ) .

471 / 635
Proof (2/2)
To check that H is a RKHS with reproducing kernel K = L , it suffices
to show that:
(
x X , Kx H ,
(x, f ) X H, hf , Kx i = f (x) .

Ker (K ) = Ker (L ) = Ker (L), implying K 1 = 0. Therefore, each


row/column of K is in H.
For any f H, if we note gi = hK (i, ), f i we get:

g = KLf = L Lf = H (f ) = f .

As a conclusion K = L is the reproducing kernel of H. 

472 / 635
Example

1
3 5
4
2
0.88 0.12 0.08 0.32 0.52

0.12 0.88 0.08 0.32 0.52


L =
0.08 0.08 0.28 0.12 0.32

0.32 0.32 0.12 0.48 0.28
0.52 0.52 0.32 0.28 1.08

473 / 635
Interpretation of the Laplacian

f
dx

i1 i i+1

f (x) = f 00 (x)
f 0 (x + dx/2) f 0 (x dx/2)

dx
f (x + dx) f (x) f (x) + f (x dx)

dx 2
fi1 + fi+1 2f (x)
=
dx 2
Lf (i)
= .
dx 2
474 / 635
Interpretation of regularization
For f = [0, 1] R and xi = i/m, we have:
m     2
X i +1 i
(f ) = f f
m m
i=1
m   2
X 1 i
f0
m m
i=1
m
1 X 0 i 2
 
1
= f
m m m
i=1
1 1 0 2
Z
f (t) dt.
m 0

475 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

476 / 635
Motivation

Consider the normalized Gaussian kernel on Rd :


k x x0 k2
 
1
Kt x, x0 =

d exp .
(4t) 2 4t

In order to transpose it to the graph, replacing the Euclidean distant


by the shortest-path distance does not work.
In this section we provide a characterization of the Gaussian kernel
as the solution of a partial differential equation involving the
Laplacian, which we can transpose to the graph: the diffusion
equation.
The solution of the discrete diffusion equation will be called the
diffusion kernel or heat kernel.

477 / 635
The diffusion equation
Lemma
For any x0 Rd , the function:

k x x0 k2
 
1
Kx0 (x, t) = Kt (x0 , x) = d exp
(4t) 2 4t

is solution of the diffusion equation:



Kx (x, t) = Kx0 (x, t)
t 0
with initial condition Kx0 (x, 0) = x0 (x)

(proof by direct computation).

478 / 635
Discrete diffusion equation
For finite-dimensional ft Rm , the diffusion equation becomes:

ft = Lft
t
which admits the following solution:

ft = f0 e tL

with
t2 2 t3 3
e tL = I tL + L L + ...
2! 3!

479 / 635
Diffusion kernel (Kondor and Lafferty, 2002)
This suggest to consider:
K = e tL
which is indeed symmetric positive semi-definite because if we write:
m
X
L= i ui ui> (i 0)
i=1

we obtain:
m
X
tL
K =e = e ti ui ui>
i=1

480 / 635
Example: complete graph

1+(m1)e tm
(
m for i = j,
Ki,j = 1e tm
m 6 j.
for i =

481 / 635
Example: closed chain

m1
2(i j)
  
1 X 2
Ki,j = exp 2t 1 cos cos .
m m m
=0

482 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

483 / 635
Motivation
In this section we show that the diffusion and Laplace kernels can be
interpreted in the frequency domain of functions
This shows that our strategy to design kernels on graphs was based
on (discrete) harmonic analysis on the graph
This follows the approach we developed for semigroup kernels!

484 / 635
Spectrum of the diffusion kernel
Let 0 = 1 < 2 . . . m be the eigenvalues of the Laplacian:
m
X
L= i ui ui> (i 0)
i=1

The diffusion kernel Kt is an invertible matrix because its


eigenvalues are strictly positive:
m
X
Kt = e ti ui ui>
i=1

485 / 635
Norm in the diffusion RKHS
Any function f Rm can be written as f = K K 1 f , therefore its


norm in the diffusion RKHS is:


 
k f k2Kt = f > K 1 K K 1 f = f > K 1 f .


For i = 1, . . . , m, let:
fi = ui> f
be the projection of f onto the eigenbasis of K .
We then have:
m
X
> 1
kf k2Kt =f K f = e ti fi 2 .
i=1

R 2
2 2
This looks similar to f() e d ...

486 / 635
Discrete Fourier transform
Definition
 >
The vector f = f1 , . . . , fm is called the discrete Fourier transform of
f R n

The eigenvectors of the Laplacian are the discrete equivalent to the


sine/cosine Fourier basis on Rn .
The eigenvalues i are the equivalent to the frequencies 2
Successive eigenvectors oscillate increasingly as eigenvalues get
more and more negative.

487 / 635
Example: eigenvectors of the Laplacian

488 / 635
Generalization
This observation suggests to define a whole family of kernels:
m
X
Kr = r (i )ui ui>
i=1

associated with the following RKHS norms:


m
X fi 2
kf k2Kr =
r (i )
i=1

where r : R+ R+
is a non-increasing function.

489 / 635
Example : regularized Laplacian

1
r () = , >0
+
m
X 1
K= ui u > = (L + I )1
i +  i
i=1

X m
X
k f k2K = f > K 1 f = (f (xi ) f (xj ))2 +  f (xi )2 .
ij i=1

490 / 635
Example

1
3 5
4
2

0.60 0.10 0.19 0.08 0.04
0.10 0.60 0.19 0.08 0.04
1

(L + I ) =
0.19 0.19 0.38 0.15 0.08

0.08 0.08 0.15 0.46 0.23
0.04 0.04 0.08 0.23 0.62

491 / 635
Outline

5 The Kernel Jungle


Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

492 / 635
Applications 1: graph partitioning
A classical relaxation of graph partitioning is:
X X
min (fi fj )2 s.t. fi 2 = 1
f RX
ij i

This can be rewritten


X
max fi 2 s.t. k f kH 1
f
i

This is principal component analysis in the RKHS (kernel PCA)

PC2 PC1

493 / 635
Applications 2: search on a graph

Let x1 , . . . , xq be a set of q nodes (the query). How to find


similar nodes (and rank them)?
One solution:

min k f kH s.t. f (xi ) 1 for i = 1, . . . , q.


f

494 / 635
Application 3: Semi-supervised learning

495 / 635
Application 3: Semi-supervised learning

496 / 635
Application 4: Tumor classification from microarray data
(Rapaport et al., 2006)
Data available
Gene expression measures for more than 10k genes
Measured on less than 100 samples of two (or more) different
classes (e.g., different tumors)

497 / 635
Application 4: Tumor classification from microarray data
(Rapaport et al., 2006)
Data available
Gene expression measures for more than 10k genes
Measured on less than 100 samples of two (or more) different
classes (e.g., different tumors)

Goal
Design a classifier to automatically assign a class to future samples
from their expression profile
Interpret biologically the differences between the classes

497 / 635
Linear classifiers
The approach
Each sample is represented by a vector x = (x1 , . . . , xp ) where
p > 105 is the number of probes
Classification: given the set of labeled sample, learn a linear decision
function:
Xp
f (x) = i xi + 0 ,
i=1

that is positive for one class, negative for the other


Interpretation: the weight i quantifies the influence of gene i for
the classification

498 / 635
Linear classifiers
Pitfalls
No robust estimation procedure exist for 100 samples in 105
dimensions!
It is necessary to reduce the complexity of the problem with prior
knowledge.

499 / 635
Example : Norm Constraints
The approach
A common method in statistics to learn with few samples in high
dimension is to constrain the norm of , e.g.:
EuclideanPnorm (support vector machines, ridge regression):
k k2 = pi=1 i2
L1 -norm (lasso regression) : k k1 = pi=1 | i |
P

Cons
Pros Limited interpretation
Good performance in (small weights)
classification No prior biological
knowledge

500 / 635
Example 2: Feature Selection
The approach
Constrain most weights to be 0, i.e., select a few genes (< 20) whose
expression are enough for classification. Interpretation is then about the
selected genes.

Pros Cons
The gene selection
Good performance in
process is usually not
classification
robust
Useful for biomarker
Wrong interpretation is
selection
the rule (too much
Apparently easy correlation between
interpretation genes)

501 / 635
Pathway interpretation
Motivation
Basic biological functions are usually expressed in terms of pathways
and not of single genes (metabolic, signaling, regulatory)
Many pathways are already known
How to use this prior knowledge to constrain the weights to have an
interpretation at the level of pathways?

Solution (Rapaport et al., 2006)


Constrain the diffusion RKHS norm of
Relevant if the true decision function is indeed smooth w.r.t. the
biological network

502 / 635
Pathway interpretation
N Glycan
biosynthesis

Glycolysis / Bad example


Gluconeogenesis

The graph is the


complete known
Porphyrin Protein
and Sulfur
metabolism kinases metabolic network of the
chlorophyll
metabolism
- Nitrogen, budding yeast (from
asparagine
Riboflavin metabolism metabolism KEGG database)
Folate
biosynthesis
DNA
and
We project the classifier
RNA
polymerase
subunits
weight learned by a SVM
Biosynthesis of steroids,
ergosterol metabolism Good classification
Lysine
biosynthesis
Oxidative
phosphorylation,
accuracy, but no possible
TCA cycle
Phenylalanine, tyrosine and
tryptophan biosynthesis Purine
interpretation!
metabolism

503 / 635
Pathway interpretation

Good example
The graph is the complete
known metabolic network
of the budding yeast
(from KEGG database)
We project the classifier
weight learned by a
spectral SVM
Good classification
accuracy, and good
interpretation!

504 / 635
Part 6

Open Problems
and Research Topics

505 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Deep learning with kernels

506 / 635
Motivation

We have seen how to make learning algorithms given a kernel K on


some data space X
Often we may have several possible kernels:
by varying the kernel type or parameters on a given description of the
data (eg, linear, polynomial, Gaussian kernels with different
bandwidths...)
because we have different views of the same data, eg, a protein can
be characterized by its sequence, its structure, its mass spectrometry
profile...
How to choose or integrate different kernels in a learning task?

507 / 635
Setting: learning with one kernel
For any f : X R, let f n = (f (x1 ), . . . , f (xn )) Rn
Given a p.d. kernel K : X X R, we learn with K by solving:

min R(f n ) + k f k2H , (3)


f H

where > 0 and R : Rn R is an closed3 and convex empirical


risk:
1
Pn 2
R(u) = n
1
Pni=1 (ui yi ) for kernel ridge regression
R(u) = n Pi=1 max(1 yi ui , 0) for SVM
1 n
R(u) = n i=1 log (1 + exp (yi ui )) for kernel logistic regression

3
R is closed if, for each A R, the sublevel set {u Rn : R(u) A} is closed. For
example, if R is continuous then it is closed.
508 / 635
Sum kernel

Definition
Let K1 , . . . , KM be M kernels on X . The sum kernel KS is the kernel on
X defined as
M
X
x, x0 X , KS (x, x0 ) = Ki (x, x0 ) .
i=1

509 / 635
Sum kernel and vector concatenation
Theorem
For i = 1, . . . , M, let i : X Hi be a feature map such that

Ki (x, x0 ) = i (x) , i x0 H .


i

PM
Then KS = i=1 Ki can be written as:

KS (x, x0 ) = S (x) , S x0 H ,


S

where S : X HS = H1 . . . HM is the concatenation of the


feature maps i :

S (x) = (1 (x) , . . . , M (x))> .


Therefore, summing kernels amounts to concatenating their feature space
representations, which is a quite natural way to integrate different
features.
510 / 635
Proof
For S (x) = (1 (x) , . . . , M (x))> , we easily compute:
M
X
0
i (x) , i x0 H



S (x) , S x Hs =
i
i=1
XM
= Ki (x, x0 )
i=1
= KS (x, x0 ) .

511 / 635
Example: data integration with the sum kernel
Table 1. List of experiments of direct approach, spectral approach based on
kernel PCA, and supervised approach based on kernel CCA
Vol. 20 Suppl. 1 2004, pages i363i370
BIOINFORMATICS DOI: 10.1093/bioinformatics/bth910

Approach Kernel (Predictor) Kernel (Target)


Protein network inference from multiple
genomic data: a supervised approach
Direct Kexp (Expression)
Y. Yamanishi1, , J.-P. Vert2 and M. Kanehisa1
Kppi (Protein interaction)
1
Kloc (Localization) Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho,
Uji, Kyoto 611-0011, Japan and 2 Computational Biology group, Ecole des Mines de
Kphy (PhylogeneticParis, profile)
35 rue Saint-Honor, 77305 Fontainebleau cedex, France
Kexp + Kppi + KlocReceived + Konphy January 15, 2004; accepted on March 1, 2004

(Integration)
Spectral Kexp (Expression)
Kppi (Protein interaction)
ABSTRACT computational biology. By protein network we mean, in this
Motivation: An increasing number of observations support the paper, a graph with proteins as vertices and edges that corres-
K loc (Localization)
hypothesis that most biological functions involve the interac- pond to various binary relationships between proteins. More
Kbetween
tions phy (Phylogenetic
many proteins, and profile)
that the complexity of living precisely, we consider below the protein network with edges
systems arises as a result of such interactions. In this context, between two proteins if (i) the proteins interact physically,
theK exp +of K
problem ppi +a K
inferring global + Kphy
loc protein network for a given or (ii) the proteins are enzymes that catalyze two successive
organism,(Integration)
using all available genomic data about the organ- chemical reactions in a pathway or (iii) one of the proteins
ism, is quickly becoming one of the main challenges in current regulates the expression of the other. This definition of pro-
Supervised Kexp (Expression)
computational biology. Kgold (Protein
tein networknetwork)
involves various forms of interactions between
Results: This paper presents a new method to infer protein proteins, which should be taken into account for the study of
Kppi (Protein interaction)
networks from multiple types of genomic data. Based on a
Kgold (Protein network)
the behavior of biological systems.
Kloc
variant (Localization)
of kernel Kgold (Protein
canonical correlation analysis, its originality network)
Unfortunately, the experimental determination of this pro-
is in the formalization of the protein network inference problem tein network remains very challenging nowadays, even for
Kphy (Phylogenetic
as a supervised
profile) Kgold (Protein
learning problem, and in the integration of het-
network)
the most basic organisms. The lack of reliable informa-
Kexp +genomic
erogeneous Kppi data+K within +K
loc this framework.
phy Kgold (Protein
We present network)
tion contrasts with the wealth of genomic data generated by
promising results on the prediction of the protein network for high-throughput technologies such as gene expression data
(Integration)
the yeast Saccharomyces cerevisiae from four types of widely (Eisen et al., 1998), physical protein interactions (Ito et al.,
available data: gene expressions, protein interactions meas- 2001), protein localization (Huh et al., 2003), phylogen-
ured by yeast two-hybrid systems, protein localizations in the etic profiles (Pellegrini et al., 1999) or pathway knowledge Fig. 512
6. Effect
Fig. 5. ROC curves: supervised approach.
cell and protein phylogenetic profiles. The method is shown (Kanehisa et al., 2004). There is therefore an incentive / 635of n
The sum kernel: functional point of view
Theorem
PM
The solution f HKS when we learn with KS = i=1 Ki is equal to:
M
X
f = fi ,
i=1

where (f1 , . . . , fM ) HK1 . . . HKM is the solution of:

M M
!
X X
n
min R fi + k fi k2HK .
f1 ,...,fM i
i=1 i=1

513 / 635
Generalization: The weighted sum kernel
Theorem
PM
The solution f when we learn with K = i=1 i Ki , with
1 , . . . , M 0, is equal to:
M
X
f = fi ,
i=1

where (f1 , . . . , fM ) HK1 . . . HKM is the solution of:

M M k f k2
!
X X i HK
n i
min R fi + .
f1 ,...,fM i
i=1 i=1

514 / 635
Proof (1/4)
M M k f k2
!
X X i HK
n i
min R fi + .
f1 ,...,fM i
i=1 i=1

R being convex, the problem is strictly convex and has a unique


solution (f1 , . . . , fM ) HK1 . . . HKM .
By the representer theorem, there exists 1 , . . . , M Rn such that
n
X

fi (x) = ij Ki (xj , x) .
j=1

(1 , . . . , M ) is the solution of
M M
!
X X > Ki i i
min R Ki i + .
1 ,...,M R n i
i=1 i=1

515 / 635
Proof (2/4)
This is equivalent to
M M
X > Ki i
i
X
min R (u) + s.t. u= Ki i .
u,1 ,...,M Rn i
i=1 i=1

This is equivalent to the saddle point problem:


M M
X > Ki i X
min maxn R (u) + i
+ 2 > (u Ki i ) .
u,1 ,...,M Rn R i
i=1 i=1

By Slaters condition, strong duality holds, meaning we can invert


min and max:
M M
X > Ki i X
maxn min R (u) + i
+ 2 > (u Ki i ) .
R u,1 ,...,M R n i
i=1 i=1

516 / 635
Proof (3/4)
Minimization in u:
n o
min R(u) + 2 > u = max 2 > u R(u) = R (2) ,
u u

where R is the Fenchel dual of R:

v Rn R (v) = sup u> v R(u) .


uRn

Minimization in i for i = 1, . . . , M:
 > 
i Ki i
min 2 Ki i = i > Ki ,
>
i i

where the minimum in i is reached for i = i .

517 / 635
Proof (4/4)
The dual problem is therefore
M
( ! )
X
max R (2) > i Ki .
Rn
i=1
Note that if learn from a single kernel K , we get the same dual
problem n o
maxn R (2) > K .
R
If is a solution of the dual problem, then i = i leading to:
n
X n
X
x X , fi (x) = ij Ki (xj , x) = i j Ki (xj , x)
j=1 j=1
PM
Therefore, f = i=1 fi
satisfies
M X
X n n
X

f (x) = i j Ki (xj , x) = j K (xj , x) . 
i=1 j=1 j=1
518 / 635
Learning the kernel

Motivation
If we know how to weight each kernel, then we can learn with the
weighted kernel
XM
K = i Ki
i=1

However, usually we dont know...


Perhaps we can optimize the weights i during learning?

519 / 635
An objective function for K
Theorem
For any p.d. kernel K on X , let

J(K ) = min R(f n ) + k f k2H .



f H

The function K 7 J(K ) is convex.


This suggests a principled way to learn a kernel: define a convex set of
candidate kernels, and minimize J(K ) by convex optimization.

520 / 635
Proof
We have shown by strong duality that
n o
J(K ) = maxn R (2) > K .
R

For each fixed, this is an affine function of K , hence convex


A supremum of convex functions is convex. 

521 / 635
MKL (Lanckriet et al., 2004)
We consider the set of convex combinations
M M
( )
X X
K = i Ki with M = i 0 , i = 1
i=1 i=1

We optimize both and f by solving:


n o
min J (K ) = min min R(f n ) + k f k2HK
M M f HK

The problem is jointly convex in (, ) and can be solved efficiently.


The output is both a set of weights , and a predictor corresponding
to the kernel method trained with kernel K .
This method is usually called Multiple Kernel Learning (MKL).

522 / 635
Example: protein annotation
Vol. 20 no. 16 2004, pages 26262635
BIOINFORMATICS doi:10.1093/bioinformatics/bth294

A statistical framework for genomic data fusion


Gert R. G. Lanckriet1 , Tijl De Bie3 , Nello Cristianini4 ,
Michael I. Jordan2 and William Stafford Noble5,
1 Department of Electrical Engineering and Computer Science, 2 Division of Computer
Science, Department of Statistics, University of California, Berkeley 94720, USA,
3 Department of Electrical Engineering, ESAT-SCD,A statistical framework for genomic data fusion
Katholieke Universiteit Leuven 3001,
Belgium, 4 Department of Statistics, University of California, Davis 95618, USA and
5 Department of Genome Sciences, University of Washington, Seattle 98195, USA

1.00 1.0
Received on January 29, 2004; revised on April 7, 2004; accepted on April 23, 2004

ROC
0.95 Advance Access publication May 6, 2004 0.9
ROC

G.R.G.Lanckriet et al.
0.90
0.8
0.85
0.80 ABSTRACT 0.7views. In yeast, for example for a given gene we typ-
these
BfunctionsSW B its SW Pfam FFT h(p LI ) DR|pi | : Ea vectorall
Table 1. Kernel
Motivation: During Pfam
the past LIdecade,D the new E focus allon depend
ically knowupon hydropathy
the protein it encodes,profile
that proteins
i similarity to
100 genomics has highlighted a particular challenge: to integrate containing theitshydrophobicities
other40proteins, of the
hydrophobicity profile, theamino
mRNAacids along the
expres-

TP1FP
TP1FP

the different views of the genome that are provided by various sion30
proteinlevels associatedetwith
(Engleman al.,the given
1986; gene and
Black underMould,
hundreds of Hopp
1991;
Kernel Data Similarity measure
50 types of experimental data. experimental
and Woods, conditions,
20 1981). The theFFT
occurrences
kernel of known
uses or inferred profiles
hydropathy
Results: This paper describes a computational framework transcription
10 factor binding sites in the upstream region of
KSW proteinand
sequences Smith-Waterman generated from the KyteDoolittle index (Kyte and Doolittle,
0 for integrating drawing inferences from a collection of that gene
0 and the identities of many of the proteins that interact
KB B
genome-wide SW
protein Pfam Each
sequences
measurements. LI datasetD BLAST E
is represented all
via 1982).
with theThis B kernel
given genes compares
SWprotein
Pfam the Each
FFT
product. frequency Dcontent
LIof these E of the
distinct all
K 1 protein sequences Pfam HMMrelation- 1
hydropathy profiles of the two proteins. First, the hydropathy
Pfam a kernel function, which defines generalized similarity data types provides one view of the molecular machinery of
Weights
Weights

KFFT hydropathy
ships between profile
pairs of entities, such as genes FFTor proteins. The profiles
the cell. are pre-filtered
In the near future, with a low-pass
research filter to reduce
in bioinformatics will noise:
KLI protein interactions
0.5 kernel representation linear kernel
is both flexible and efficient, and can be 0.5more and more heavily on methods of data fusion.
focus
KD protein interactions diffusion kernel
applied to many different types of data. Furthermore, kernel Different data sources hf (p = f to
arei )likely contain
h(pi ), different and
KE gene expression radial basis kernel
0 functions derived from different types of data can be combined thus partly
0 independent information about the task at hand.
KRND random numbers linear kernel 1 complementary pieces of information can be
in a straightforward (A) fashion.
Ribosomal Recent advances in the theory
proteins where f =
Combining
4 (1 2 1) (B)
those is Membrane
the impulse response of the filter
proteins
of kernel methods have provided efficient algorithms to per- expected
and to enhance
denotes the total information
convolution with thatabout
filter.theAfter
problem at
pre-filtering
The table lists the seven kernels used to compare proteins, the data on which they are
form such combinations in a way that minimizes a statistical hand. One
theheight
hydropathy problem with
profiles this
(andapproach, however, is that gen-zeros to
defined,
Fig. 1.andCombining
the method for
loss
computing
datasets
function. These
similarities.
yields better
methods
The final kernel, KRND
classification
exploit semidefinite
, is included The
performance.
program- omic data ofcome
the bars
in a wide upperifof
in thevariety necessary
two plots
data areappending
formats: proportional
expression to the ROC
as a control. All kernel matrices, along with the data from which they were generated, make them equal invectors
lengtha commonly used technique notError
score
are (top)atming
available and the percentage
techniques of true the
to reduce positives
noble.gs.washington.edu/proj/sdp-svm. problem at one percentoptimiz-
of finding false positives (middle),
data are expressedfor as
the SDP/SVM or timemethod
series; using
proteinthe given kernel.
sequence 523 / 635
Example: Image classification (Harchaoui and Bach, 2007)
COREL14 dataset
1400 natural images in 14 classes
Compare kernel between histograms (H), walk kernel (W), subtree
kernel (TW), weighted subtree kernel (wTW), and a combination by
MKL (M).

Performance comparison on Corel14

0.12

0.11

0.1

Test error
0.09

0.08

0.07

0.06

0.05
H W TW wTW M
Kernels

524 / 635
MKL revisited (Bach et al., 2004)
M M
( )
X X
K = i Ki with M = i 0 , i = 1
i=1 i=1

Theorem
The solution f of
n o
min min R(f n ) + k f k2HK
M f HK

PM
is f = i=1 fi
, where (f1 , . . . , fM ) HK1 . . . HKM is the solution
of: !2
M M
!
X X
min R fi n + k fi kHKi .
f1 ,...,fM
i=1 i=1

525 / 635
Proof (1/2)

n o
min min R(f n ) + k f k2HK
M f HK
M M k f k2
( ! )
X X i HK
n i
= min min R fi +
M f1 ,...,fM i
i=1 i=1
M X k fi k2HK
( ! (M ))
X
= min R fi n + min i
f1 ,...,fM M i
i=1 i=1
!2
M M
!
X X
= min R fi n + k fi kHKi ,
f1 ,...,fM
i=1 i=1

526 / 635
Proof (2/2)
where the last equality results from:

M
!2 M
X X a2
a RM
+ , ai = inf i
,
M i
i=1 i=1

which is a direct consequence of the Cauchy-Schwarz inequality:

M M M
! 21 M
! 12
X X ai X a2 i
X
ai = i i .
i i
i=1 i=1 i=1 i=1

527 / 635
Algorithm: simpleMKL (Rakotomamonjy et al., 2008)
We want to minimize in M :
n o
min J (K ) = min maxn R (2) > K .
M M R

For a fixed M , we can compute f () = J (K ) by using a


standard solver for a single kernel to find :

J (K ) = R (2 ) > K .

From we can also compute the gradient of J (K ) with respect to


:
J (K )
= > Ki .
i
J (K ) can then be minimized on M by a projected gradient or
reduced gradient algorithm.

528 / 635
Sum kernel vs MKL
Learning with the sum kernel (uniform combination) solves
M M
( ! )
X X
n 2
min R fi + k fi kHK .
f1 ,...,fM i
i=1 i=1

Learning with MKL (best convex combination) solves


!2
M M
!
X X
min R fi n + k fi kHKi .
f1 ,...,fM
i=1 i=1

Although MKL can be thought of as optimizing a convex


combination of kernels, it is more correct to think of it as a
penalized risk minimization estimator with the group lasso penalty:
M
X
(f ) = min k fi kHKi .
f1 +...+fM =f
i=1

529 / 635
Example: ridge vs LASSO regression
Take X = Rd , and for x = (x1 , . . . , xd )> consider the rank-1 kernels:

i = 1, . . . , d , Ki x, x0 = xi xi0 .


A function fi HKi has the form fi (x) = i xi , with k fi kHKi = | i |


The sum kernel is KS (x, x0 ) = di=1 xi xi0 = x> x, a function HKS is
P
of the form f (x) = > x, with norm k f kHKS = k kRd .
Learning with the sum kernel solves a ridge regression problem:
d
( )
X
min R(X) + i2 .
Rd
i=1

Learning with MKL solves a LASSO regression problem:


!2
Xd
min R(X) + | i | .
Rd
i=1

530 / 635
Extensions (Micchelli et al., 2005)

M M
( )
X X
For r > 0 , K = i Ki with rM = i 0 , ir = 1
i=1 i=1

Theorem
The solution f of
n o
minr min R(f n ) + k f k2HK
M f HK

PM
is f = i=1 fi
, where (f1 , . . . , fM ) HK1 . . . HKM is the solution
of:
M
! M
! r +1
r

X X 2r
min R fi n + r +1
k fi kHK .
f1 ,...,fM i
i=1 i=1

531 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Deep learning with kernels

532 / 635
Outline

6 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Interlude: Large-scale learning with linear models
Nystrom approximations
Random Fourier features
Deep learning with kernels

533 / 635
Motivation
Main problem
All methods we have seen require computing the n n Gram matrix,
which is infeasible when n is significantly greater than 100 000 both in
terms of memory and computation.

Solutions
low-rank approximation of the kernel;
random Fourier features.
The goal is to find an approximate embedding : X Rd such that

K (x, x0 ) h(x), (x0 )iRd .

and use large-scale optimization techniques dedicated to linear models!

534 / 635
Motivation
Then, functions f in H may be approximated by linear ones in Rd , e.g.,.
n
* n +
X X
f (x) = i K (xi , x) i (xi ), (x) = hw, (x)iRd .
i=1 i=1 Rd

Then, the ERM problem


n
1X
min L(yi , f (xi )) + kf k2H ,
f H n
i=1

becomes, approximately,
n
1X
min L(yi , w> (xi )) + kwk22 ,
wRd n
i=1

which we know how to solve when n is large.

535 / 635
Outline

6 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Interlude: Large-scale learning with linear models
Nystrom approximations
Random Fourier features
Deep learning with kernels

536 / 635
Interlude: Large-scale learning with linear models
Let us study for a while optimization techniques for minimizing large
sums of functions
n
1X
min fi (w).
wRd n
i=1

Good candidates are


stochastic optimization techniques;
randomized incremental optimization techniques;
We will see a couple of such algorithms with their convergence rates and
start with the (batch) gradient descent method.

537 / 635
Introduction of a few optimization principles
Why do we care about convexity?

538 / 635
Introduction of a few optimization principles
Why do we care about convexity?
Local observations give information about the global optimum
f (w)

w
b

w
b

f (w) = 0 is a necessary and sufficient optimality condition for


differentiable convex functions;
it is often easy to upper-bound f (w) f ? .
538 / 635
Introduction of a few optimization principles
An important inequality for smooth convex functions

If f is convex
f (w)

w w0
b b

w
b

f (w) f (w0 ) + f (w0 )> (w w0 );


| {z }
linear approximation
this is an equivalent definition of convexity for smooth functions.
539 / 635
Introduction of a few optimization principles
An important inequality for smooth functions

If f is L-Lipschitz continuous (f does not need to be convex)


g(w) f (w)

w w1 w0
b b b

w
b

f (w) g (w) = f (w0 ) + f (w0 )> (w w0 ) + L2 kw w0 k22 ;


g (w) = Cw0 + L2 kw0 (1/L)f (w0 ) wk22 .
540 / 635
Introduction of a few optimization principles
An important inequality for smooth functions

If f is L-Lipschitz continuous (f does not need to be convex)


g(w) f (w)

w w1 w0
b b b

w
b

f (w) g (w) = f (w0 ) + f (w0 )> (w w0 ) + L2 kw w0 k22 ;


w1 = w0 L1 f (w0 ) (gradient descent step).
540 / 635
Introduction of a few optimization principles
Gradient Descent Algorithm

Assume that f is convex and differentiable, and that f is L-Lipschitz.


Theorem
Consider the algorithm

wt wt1 L1 f (wt1 ).

Then,
Lkw0 w? k22
f (wt ) f ? .
2t

Remarks
the convergence rate improves under additional assumptions on f
(strong convexity);
some variants have a O(1/t 2 ) convergence rate [Nesterov, 2004].

541 / 635
Proof (1/2)
Proof of the main inequality for smooth functions
We want to show that for all w and z,
L
f (w) f (z) + f (z)> (w z) + kw zk22 .
2

542 / 635
Proof (1/2)
Proof of the main inequality for smooth functions
We want to show that for all w and z,
L
f (w) f (z) + f (z)> (w z) + kw zk22 .
2
By using Taylors theorem with integral form,
Z 1
f (w) f (z) = f (tw + (1 t)z)> (w z)dt.
0

Then,
Z 1
f (w)f (z)f (z)> (wz) (f (tw+(1t)z)f (z))> (wz)dt
0
Z 1
|(f (tw+(1t)z)f (z))> (wz)|dt
0
Z 1
kf (tw+(1t)z)f (z)k2 kwzk2 dt (C.-S.)
0
Z 1
L
Ltkwzk22 dt = kwzk22 .
0 2

542 / 635
Proof (2/2)
Proof of the theorem
We have shown that for all w,
L
f (w) gt (w) = f (wt1 ) + f (wt1 )> (w wt1 ) + kw wt1 k22 .
2
gt is minimized by wt ; it can be rewritten gt (w) = gt (wt ) + L2 kw wt k22 . Then,

L ?
f (wt ) gt (wt ) = gt (w? ) kw wt k22
2
L ? L
= f (wt1 ) + f (wt1 )> (w? wt1 ) + kw wt1 k22 kw? wt k22
2 2
L ? L
f?+ kw wt1 k22 kw? wt k22 .
2 2
By summing from t = 1 to T , we have a telescopic sum
T
X L ? L
T (f (wT ) f ? ) f (wt ) f ? kw w0 k22 kw? wT k22 .
t=1
2 2

543 / 635
Introduction of a few optimization principles
An important inequality for smooth and -strongly convex functions

If f is L-Lipschitz continuous and f -strongly convex


f (w)

w w0
b b

w
b

f (w) f (w0 ) + f (w0 )> (w w0 ) + L2 kw w0 k22 ;


f (w) f (w0 ) + f (w0 )> (w w0 ) + 2 kw w0 k22 ;
544 / 635
Introduction of a few optimization principles
Proposition
When f is -strongly convex, differentiable and f is L-Lipschitz, the
gradient descent algorithm with step-size 1/L produces iterates such that
 t Lkw0 w? k22
f (wt ) f ? 1 .
L 2
We call that a linear convergence rate.

545 / 635
Proof
We start from an inequality from the previous proof
L ? L
f (wt ) f (wt1 ) + f (wt1 )> (w? wt1 ) + kw wt1 k22 kw? wt k22
2 2
L ? L
f?+ kw wt1 k22 kw? wt k22 .
2 2
In addition, we have that f (wt ) f ? + 2 kwt w? k22 , and thus

L ?
kw? wt k22 kw wt1 k22
L+
  ?
1 kw wt1 k22 .
L
Finally,
L t
f (wt ) f ? kw w? k22
2
 t Lkw? w0 k22
1
L 2

546 / 635
The stochastic (sub)gradient descent algorithm
Consider now the minimization of an expectation

min f (w) = Ex [`(x, w)],


wRp

To simplify, we assume that for all x, w 7 `(x, w) is differentiable, but


everything here is true for nonsmooth functions.
Algorithm
At iteration t,
Randomly draw one example xt from the training set;
Update the current iterate

wt wt1 t w `(xt , wt1 ).

Perform online averaging of the iterates (optional)

wt (1 t )wt1 + t wt .
547 / 635
The stochastic (sub)gradient descent algorithm
There are various learning rates strategies (constant, varying step-sizes),
and averaging strategies. Depending on the problem assumptions and
choice of t , t , classical convergence rates may be obtained:

f (wt ) f ? = O(1/ t) for convex problems;
f (wt ) f ? = O(1/t) for strongly-convex ones;

Remarks
The convergence rates are not that great, but the complexity
per-iteration is small (1 gradient evaluation for minimizing an
empirical risk versus n for the batch algorithm).
When the amount of data is infinite, the method minimizes the
expected risk.
Choosing a good learning rate automatically is an open problem.

548 / 635
Randomized incremental algorithms (1/2)
Consider now the minimization of a large finite sum of smooth convex
functions:
n
1X
minp fi (w),
wR n
i=1

A class of algorithms with low per-iteration complexity have been


recently introduced that enjoy exponential (aka, linear) convergence rates
for strongly-convex problems, e.g., SAG [Schmidt et al., 2016].

SAG algorithm

n
fi (wt1 ) if i = it

t t1 X t
w w yi with yit = .
Ln yit1 otherwise
i=1

See also SAGA [Defazio et al., 2014], SVRG [Xiao and Zhang, 2014],
SDCA [Shalev-Shwartz and Zhang, 2015], MISO [Mairal, 2015];

549 / 635
Randomized incremental algorithms (2/2)
Many of these techniques are in fact performing SGD-types of steps

wt wt1 t gt ,

where E[gt |wt1 ] = f (wt1 ), but where the estimator of the gradient
has lower variance than in SGD, see SVRG [Xiao and Zhang, 2014].
Typically, these methods have the convergence rate
  t 
? 1
f (wt ) f = O 1 C max ,
n L

Remarks
their complexity per-iteration is independent of n!
unlike SGD, they are often almost parameter-free.
besides, they can be accelerated [Lin et al., 2015].
550 / 635
Large-scale learning with linear models
Conclusion
we know how to deal with huge-scale linear problems;
this is also useful to learn with kernels!

551 / 635
Outline

6 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Interlude: Large-scale learning with linear models
Nystrom approximations
Random Fourier features
Deep learning with kernels

552 / 635
Nystrom approximations: principle
Consider a p.d. kernel K : X X R and RKHS H, with the
mapping : X H such that

K (x, x0 ) = h(x), (x0 )iH .

The Nystrom method consists of replacing any point (x) in H, for x


in X by its orthogonal projection onto a finite-dimensional subspace

F := Span (f1 , . . . , fp ) with p  n,

where the fi s are anchor points in H (to be defined later).

Motivation
This principle allows us to work explicitly in a finite-dimensional
space; it was introduced several times in the kernel literature [Williams
and Seeger, 2002], [Smola and Scholkopf, 2000], [Fine and Scheinberg, 2001].

553 / 635
Nystrom approximations: principle

The orthogonal projection is defined as


F [x] := argmin k(x) f k2H ,
f F

(x) Hilbert space H

(x0 )

554 / 635
Nystrom approximations: principle

The projection is equivalent to


2
p
X Xp
j? fj ?

F [x] := with argmin (x)

,
j fj
j=1 Rp j=1
H

and ? is the solution of the problem


p
X p
X
minp 2 j hfj , (x)iH + j l hfj , fl iH ,
R
j=1 j,l=1

or also
p
X p
X
minp 2 j fj (x) + j l hfj , fl iH .
R
j=1 j,l=1

555 / 635
Nystrom approximations: principle
Then, call [Kf ]jl = hfj , fl iH and f(x) = [f1 (x), . . . , fp (x)] in Rp . The
problem may be rewritten as
min 2 > f(x) + > Kf ,
Rp

and, assuming Kf to be non-singular to simplify, the solution is


? (x) = K1
f f(x). Then,
p
X
(x) j? (x)fj ,
j=1

and
* p p
+
X X
0
h(x), (x )iH j? (x)fj , j? (x0 )fj
j=1 j=1 H
p
X
= j? (x)l? (x0 )hfj , fl iH = ? (x)> Kf ? (x0 ).
j,l=1

556 / 635
Nystrom approximations: principle
This allows us to define the mapping
1/2 1/2
(x) = Kf ? (x) = Kf f(x),

and we have the approximation K (x, x0 ) h(x), (x0 )iRp .

Remarks
the mapping provides low-rank approximations of the kernel matrix.
Given an n n Gram matrix K computed on a training set
S = {x1 , . . . , xn }, we have

K (S)> (S),

where (S) := [(x1 ), . . . , (xn )].


the approximation has a geometric interpretation.
We need to define a good strategy for choosing the fj s.

557 / 635
Nystrom approximation via kernel PCA
Let us now try to learn the fj s given training data x1 , . . . , xn in X :
2
Xn Xp

min (xi ) ij fj
.
f1 ,...,fp H
ij R i=1 j=1
H

Using similar calculation as before, the objective is equivalent to


n
X
min 2 > >
i f(xi ) + i Kf i ,
f1 ,...,fp H
i Rp i=1

and, by minimizing with respect to all i with f fixed, we have that


i = K1
f f(xi ) (assuming Kf to be invertible), which leads to
n
X
max f(xi )> K1
f f(xi ).
f1 ,...,fp H
i=1

558 / 635
Nystrom approximation via kernel PCA
Remember the objective:
n
X
max f(xi )> K1
f f(xi ).
f1 ,...,fp H
i=1
Consider an optimal solution ?
f and
compute the eigenvalue
decomposition of Kf ? = UU> . Then, define the functions
g? (x) := [g1? (x), . . . , gp? (x)] = 1/2 U> f ? (x).
The functions gj? are points in the RKHS H since they are linear
combinations of the functions fj? in H.

559 / 635
Nystrom approximation via kernel PCA
Remember the objective:
n
X
max f(xi )> K1
f f(xi ).
f1 ,...,fp H
i=1
Consider an optimal solution ?
f and
compute the eigenvalue
decomposition of Kf ? = UU> . Then, define the functions
g? (x) := [g1? (x), . . . , gp? (x)] = 1/2 U> f ? (x).
The functions gj? are points in the RKHS H since they are linear
combinations of the functions fj? in H.

Exercise: check that all we do here and in the next slides can be
extended to deal with singular Gram matrices Kf ? and Kf .
559 / 635
Nystrom approximation via kernel PCA
Besides, by construction

[Kg? ]jl := hgj? , gl? iH


p p
* +
1 X 1 X
= p [U]kj fk? , [U]kl fk?
jj k=1 ll k=1
H
p
1 1 X
=p [U]kj [U]k 0 l hfk? , fk?0 iH
jj ll k,k 0 =1
p
1 1 X
=p [U]kj [U]k 0 l [Kf ? ]kk 0
jj ll k,k 0 =1
1 1
=p u>
j Kf ? ul
jj ll
= j=l .

560 / 635
Nystrom approximation via kernel PCA
Then, Kg? = I and g? is also a solution of the problem
n
X
max f(xi )> K1
f f(xi ),
f1 ,...,fp H
i=1

since

f ? (xi )> K1 ? ? > 1 > ?


f ? f (xi ) = f (xi ) U U f (xi )
= g? (xi )> g? (xi ) = g? (xi )> K1 ?
g? g (xi ),

and also a solution of the problem


p X
X n
max gj (xi )2 s.t. gj gk for k 6= j and kgj kH = 1.
g1 ,...,gp H
j=1 i=1

561 / 635
Nystrom approximation via kernel PCA
Then, Kg? = I and g? is also a solution of the problem
n
X
max f(xi )> K1
f f(xi ),
f1 ,...,fp H
i=1

since

f ? (xi )> K1 ? ? > 1 > ?


f ? f (xi ) = f (xi ) U U f (xi )
= g? (xi )> g? (xi ) = g? (xi )> K1 ?
g? g (xi ),

and also a solution of the problem


p X
X n
max gj (xi )2 s.t. gj gk for k 6= j and kgj kH = 1.
g1 ,...,gp H
j=1 i=1

This is the kernel PCA formulation!

561 / 635
Nystrom approximation via kernel PCA
Our first recipe with kernel PCA
Given a dataset of n training points x1 , . . . , xn in X ,
randomly choose a subset Z = [xz1 , . . . , xzm ] of m n training
points;
compute the m m kernel matrix KZ .
perform kernel PCA to find the p m largest principal directions
(parametrized by p vectors j in Rm );
Then, every point x in X may be approximated by
1/2
(x) = Kg? g? (x) = g? (x) = [g1? (x), . . . , gp? (x)]>
" m m
#>
X X
= 1i K (xzi , x), . . . , pi K (xzi , x) .
i=1 i=1

562 / 635
Nystrom approximation via kernel PCA
Remarks
The vector (x) can be interpreted as coordinates of the projection
of (x) onto the (orthogonal) PCA basis.
The complexity of training is O(m3 ) (eig decomposition of KZ ) +
O(m2 ) kernel evaluations.
The complexity of encoding a new point x is O(mp) (matrix vector
multiplication) + O(m) kernel evaluations.

563 / 635
Nystrom approximation via kernel PCA
Remarks
The vector (x) can be interpreted as coordinates of the projection
of (x) onto the (orthogonal) PCA basis.
The complexity of training is O(m3 ) (eig decomposition of KZ ) +
O(m2 ) kernel evaluations.
The complexity of encoding a new point x is O(mp) (matrix vector
multiplication) + O(m) kernel evaluations.
The main issue is the encoding time, which depends linearly on m > p.

563 / 635
Nystrom approximation via random sampling
A popular alternative is instead to select the anchor points among the
training data points x1 , . . . , xn that is,

F := span((xz1 ), . . . , (zzp )).

In other words, choose f1 = (xz1 ), . . . , fp = (xzp ).

Second recipe with random point sampling


Given a dataset of n training points x1 , . . . , xn in X ,
randomly choose a subset Z = [xz1 , . . . , xzp ] of p training points;
compute the p p kernel matrix KZ .
Then, a new point x is encoded as
1/2
(x) = KZ fZ (x)
1/2
= KZ [K (xz1 , x), . . . , K (xzp , x)]>

564 / 635
Nystrom approximation via random sampling
The complexity of training is O(p 3 ) (eig decomposition) + O(p 2 )
kernel evaluations.
The complexity of encoding a point x is O(p 2 ) (matrix vector
multiplication) + O(p) kernel evaluations.

565 / 635
Nystrom approximation via random sampling
The complexity of training is O(p 3 ) (eig decomposition) + O(p 2 )
kernel evaluations.
The complexity of encoding a point x is O(p 2 ) (matrix vector
multiplication) + O(p) kernel evaluations.
The main issue complexity is better, but we lose the optimality of the
PCA basis and the random choice of anchor points is not clever.

565 / 635
Nystrom approximation via greedy approach
Better approximation can be obtained with a greedy algorithm that
iteratively selects one column at a time with largest residual (Bach and
Jordan, 2002; Smola and Sholkopf, 2000, Fine and Scheinbert, 2000).

At iteration k, assume that Z = {xz1 , . . . , xzk }; then, the residual for a


data point x encoded with k anchor points f1 , . . . , fk is
2
X k

min (x)
j (xj ) ,
k
R
j=1
H

which is equal to
k(x)k2H fZ (x)> K1
Z fZ (x),
and since fj = (xzj ) for all j, the data point xi with largest residual is
the one that maximizes

K (xi , xi ) fZ (xi )K1 >


Z fZ (xi ) with fZ (xi ) = [K (xz1 , x), . . . , K (xzk , x)] .

566 / 635
Nystrom approximation via greedy approach
This brings us to the following algorithm
Third recipe with greedy anchor point selection
Initialize Z = . For k = 1, . . . , p do
data point selection

zk argmax K (xi , xi ) fZ (xi )K1


Z fZ (xi );
i{1,...,n}

update the set Z


Z Z {xzk }.

Remarks
A naive implementation costs (O(k 2 n + k 3 ) at every iteration.
To get a reasonable complexity, one has to use simple linear algebra
tricks (see next slide).

567 / 635
Nystrom approximation via greedy approach
If Z 0 = Z {z},
1
K1 1 > 1b
  
KZ fZ (z) Z + s bb
K1
Z0 = = s ,
fZ (z)> K (z, z) 1 >
s b 1
s

where s is the Schur complement s = K (z, z) fZ (z)K1


Z fZ (z), and
1
b = KZ fZ (z).

Complexity analysis
K1 1 2
Z 0 can be obtained from KZ and fZ (z) in O(k ) float operations;
for that we need to always keep into memory the n vectors fZ (xi ).
updating the fZ 0 (xi )s from fZ (xi ) requires n kernel evaluations;
The total training complexity is O(p 2 n) float operations and O(pn)
kernel evaluations

568 / 635
Nystrom approximation via K-means
When X = Rd , it is also possible to synthesize points z1 , . . . , zp such
that they represented well some training data x1 , . . . , xn , leading to the
Clustred Nystrom approximation (Zhang and Kwok, 2008).

Fourth recipe with K-means


1 Perform the regular K-means algorithm on the training data, to
obtain p centroids z1 , . . . , zp in Rp .
2 Define the anchor points fj = (zj ) for j = 1, . . . , p, and perform
the classical Nystrom approximation.

Remarks
The complexity is the same as Nystrom with random selection
(except for the K-means step);
The method is data-dependent and can significantly outperform the
other variants in practice.

569 / 635
Nystrom approximation: conclusion
Concluding remarks
The greedy selection rule is equivalent to computing an incomplete
Cholesky factorization of the kernel matrix (Bach and Jordan, 2002;
Scholkopf and Smola, 2000, Fine and Scheinberg, 2001);
The techniques we have seen produce low-rank approximations of
the kernel matrix K LL> ;
The method admits a geometric interpretation in terms of
orthogonal projection onto a finite-dimensional subspace.
The approximation provides points in the RKHS. As such, many
operations on the mapping are valid (translations, linear
combinations, projections), unlike the method that will come next.

570 / 635
Outline

6 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Interlude: Large-scale learning with linear models
Nystrom approximations
Random Fourier features
Deep learning with kernels

571 / 635
Random Fourier features [Rahimi and Recht, 2007] (1/5)
A large class of approximations for shift-invariant kernels are based on
sampling techniques. Consider a real-valued positive-definite continuous
translation-invariant kernel K (x, y) = (x y) with : Rd R. Then,
if (0) = 1, Bochner theorem tells us that is a valid characteristic
function for some probability measure
>
(z) = Ew [e iw z ].

Remember indeed that, with the right assumptions on ,


Z
1 > >
(x y) = d
(w)e iw x e iw y dw,
(2) Rd
1
and the probability measure admits a density q(w) = (2)d
(w)
(non-negative, real-valued, sum to 1 since (0) = 1).

572 / 635
Random Fourier features (2/5)
Then,
Z
1 > >
(x y) = d
(w)e iw x e iw y dw
(2) Rd
Z
= q(w) cos(w> x w> y)dw
ZR
d
 
= q(w) cos(w> x) cos(w> y) + sin(w> x) sin(w> y) dw
Rd
Z Z 2
q(w)
= 2 cos(w> x + b) cos(w> y + b)dwdb (exercise)
R d b=0 2
h i
= Ewq(w),bU [0,2] 2 cos(w> x + b) 2 cos(w> y + b)

573 / 635
Random Fourier features (3/5)
Random Fourier features recipe
Compute the Fourier transform of the kernel and define the
probability density q(w) = (w)/(2)d ;
Draw p i.i.d. samples w1 , . . . , wp from q and p i.i.d. samples
b1 , . . . , bp from the uniform distribution on [0, 2];
define the mapping
r h i>
2
x 7 (x) = cos(w1> x + b1 ), . . . , cos(wp> x + bp ) .
d
Then, we have that

(x y) h(x), (y)iRp .

The two quantities are equal in expectation.

574 / 635
Random Fourier features (4/5)
Theorem, [Rahimi and Recht, 2007]
On any compact subset X of Rm , for all > 0,
" # 2 2

q diam(X ) p
4(m+2)
P sup |(x y) h(x), (y)iRp | 2 8
e ,
x,yX

where q2 = Ewq(w) [w> w] is the second moment of the Fourier


transform of .

Remarks
The convergence is uniform, not data dependent;
q
Take the sequence p = log(p)
p q diam(X ); Then the term on the
right converges to zero when p grows to infinity;
Prediction functions with Random Fourier features are not in H.

575 / 635
Random Fourier features (5/5)
Ingredients of the proof
For a fixed pair of points x, y, Hoeffdings inequality says that
p2
P |(x y) h(x), (y)iRd | 2e 4 .
 
| {z }
f (x,y)

Consider a net (set of balls of radius r ) that covers


X = {x y : (x, y) X } with at most T = (4diam(X )/r )m balls.
Apply the Hoeffdings inequality to the centers xi yi of the balls;
Use a basic union bound
  X h
i p2
P sup f (xi , yi ) P f (xi , yi ) 2Te 8 .
i 2 2
i

Glue things together: control the probability for points (x, y) inside
each ball, and adjust the radius r (a bit technical).
576 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Deep learning with kernels

577 / 635
Adaline: a physical neural net for least square regression

Figure: Adaline, [Widrow and Hoff, 1960]: A physical device that performs least
square regression using stochastic gradient descent.

578 / 635
A quick zoom on multilayer neural networks
The goal is to learn a prediction function f : Rp R given labeled
training data (xi , yi )i=1,...,n with xi in Rp , and yi in R:
n
1X
min L(yi , f (xi )) + (f ) .
f F n | {z }
i=1
| {z } regularization
empirical risk, data fit

579 / 635
A quick zoom on multilayer neural networks
The goal is to learn a prediction function f : Rp R given labeled
training data (xi , yi )i=1,...,n with xi in Rp , and yi in R:
n
1X
min L(yi , f (xi )) + (f ) .
f F n | {z }
i=1
| {z } regularization
empirical risk, data fit

What is specific to multilayer neural networks?


The neural network space F is explicitly parametrized by:

f (x) = k (Ak k1 (Ak1 . . . 2 (A2 1 (A1 x)) . . .)).

Finding the optimal A1 , A2 , . . . , Ak yields a non-convex


optimization problem in huge dimension.

580 / 635
A quick zoom on convolutional neural networks

Figure: Picture from LeCun et al. [1998]

CNNs perform simple operations such as convolutions, pointwise


non-linearities and subsampling.
for most successful applications of CNNs, training is supervised.

581 / 635
A quick zoom on convolutional neural networks

Figure: Picture from Yann LeCuns tutorial, based on Zeiler and Fergus [2014].

582 / 635
A quick zoom on convolutional neural networks
What are the main features of CNNs?
they capture compositional and multiscale structures in images;
they provide some invariance;
they model local stationarity of images at several scales.

583 / 635
A quick zoom on convolutional neural networks
What are the main features of CNNs?
they capture compositional and multiscale structures in images;
they provide some invariance;
they model local stationarity of images at several scales.

What are the main open problems?


very little theoretical understanding;
they require large amounts of labeled data;
they require manual design and parameter tuning;

583 / 635
A quick zoom on convolutional neural networks
What are the main features of CNNs?
they capture compositional and multiscale structures in images;
they provide some invariance;
they model local stationarity of images at several scales.

What are the main open problems?


very little theoretical understanding;
they require large amounts of labeled data;
they require manual design and parameter tuning;

Nonetheless...
they are the focus of a huge academic and industrial effort;
there is efficient and well-documented open-source software.

583 / 635
Context of kernel methods
What are the main features of kernel methods?
decoupling of data representation and learning algorithm;
a huge number of unsupervised and supervised algorithms;
typically, convex optimization problems in a supervised context;
versatility: applies to vectors, sequences, graphs, sets,. . . ;
natural regularization function to control the learning capacity;
well studied theoretical framework.

584 / 635
Context of kernel methods
What are the main features of kernel methods?
decoupling of data representation and learning algorithm;
a huge number of unsupervised and supervised algorithms;
typically, convex optimization problems in a supervised context;
versatility: applies to vectors, sequences, graphs, sets,. . . ;
natural regularization function to control the learning capacity;
well studied theoretical framework.

But...
poor scalability in n, at least O(n2 );
decoupling of data representation and learning may not be a good
thing, according to recent supervised deep learning success.

584 / 635
Context of kernel methods
Challenges
Scaling-up kernel methods with approximate feature maps;

K (x, x0 ) h(x), (x0 )i.

[Williams and Seeger, 2001, Rahimi and Recht, 2007, Vedaldi and
Zisserman, 2012, Le et al., 2013]...
Design data-adaptive and task-adaptive kernels;
Build kernel hierarchies to capture compositional structures.
Introduce supervision in the kernel design.

585 / 635
Context of kernel methods
Challenges
Scaling-up kernel methods with approximate feature maps;

K (x, x0 ) h(x), (x0 )i.

[Williams and Seeger, 2001, Rahimi and Recht, 2007, Vedaldi and
Zisserman, 2012, Le et al., 2013]...
Design data-adaptive and task-adaptive kernels;
Build kernel hierarchies to capture compositional structures.
Introduce supervision in the kernel design.

We need deep kernel machines!

585 / 635
Context of kernel methods
Challenges
Scaling-up kernel methods with approximate feature maps;

K (x, x0 ) h(x), (x0 )i.

[Williams and Seeger, 2001, Rahimi and Recht, 2007, Vedaldi and
Zisserman, 2012, Le et al., 2013]...
Design data-adaptive and task-adaptive kernels;
Build kernel hierarchies to capture compositional structures.
Introduce supervision in the kernel design.

Remark
there exists already successful data-adaptive kernels that rely on
probabilistic models, e.g., Fisher kernel.
[Jaakkola and Haussler, 1999, Perronnin and Dance, 2007].

585 / 635
Some more motivation
Longer term objectives
build a kernel for images (abstract object), for which we can
precisely quantify the invariance, stability to perturbations,
recovery, and complexity properties.
build deep networks which can be easily regularized.
build deep networks for structured objects (graph, sequences)...
add more geometric interpretation to deep networks.
...

586 / 635
Basic principles of deep kernel machines: composition
Composition of feature spaces
Consider a p.d. kernel K1 : X 2 R and its RKHS H1 with mapping
1 : X H1 . Consider also a p.d. kernel K2 : H12 R and its RKHS
H2 with mapping 2 : H1 H2 . Then, K3 : X 2 R below is also p.d.

K3 (x, x0 ) = K2 (1 (x), 1 (x0 )),

and its RKHS mapping is 3 = 2 1 .

587 / 635
Basic principles of deep kernel machines: composition
Composition of feature spaces
Consider a p.d. kernel K1 : X 2 R and its RKHS H1 with mapping
1 : X H1 . Consider also a p.d. kernel K2 : H12 R and its RKHS
H2 with mapping 2 : H1 H2 . Then, K3 : X 2 R below is also p.d.

K3 (x, x0 ) = K2 (1 (x), 1 (x0 )),

and its RKHS mapping is 3 = 2 1 .

Examples
1
k1 (x)1 (x0 )k2H
K3 (x, x0 ) = e 2 2 1 .

K3 (x, x0 ) = h1 (x), 1 (x0 )i2H1 = K1 (x, x0 )2 .

587 / 635
Basic principles of deep kernel machines: composition
Remarks on the composition of feature spaces
we can iterate the process many times.
the idea appears early in the literature of kernel methods [see
Scholkopf et al., 1998, for a multilayer variant of kernel PCA].

Is this idea sufficient to make kernel methods more powerful?

588 / 635
Basic principles of deep kernel machines: composition
Remarks on the composition of feature spaces
we can iterate the process many times.
the idea appears early in the literature of kernel methods [see
Scholkopf et al., 1998, for a multilayer variant of kernel PCA].

Is this idea sufficient to make kernel methods more powerful?


Probably not:
K2 is doomed to be a simple kernel (dot-product or RBF kernel).
it does not address any of the challenges we have listed before.
K3 and K1 operate on the same type of object; it is not clear why
desining K3 is easier than designing K1 directly.

588 / 635
Basic principles of deep kernel machines: composition
Remarks on the composition of feature spaces
we can iterate the process many times.
the idea appears early in the literature of kernel methods [see
Scholkopf et al., 1998, for a multilayer variant of kernel PCA].

Is this idea sufficient to make kernel methods more powerful?


Probably not:
K2 is doomed to be a simple kernel (dot-product or RBF kernel).
it does not address any of the challenges we have listed before.
K3 and K1 operate on the same type of object; it is not clear why
desining K3 is easier than designing K1 directly.
Nonetheless, we will see later that this idea can be used to build a
hierarchies of kernels that operate on more and more complex objects.

588 / 635
Basic principles of deep kernel machines: infinite NN
A large class of kernels on Rp may be defined as an expectation

K (x, y) = Ew [s(w> x)s(w> y)],

where s : R R is a nonlinear function. The encoding can be seen as a


one-layer neural network with infinite number of random weights.

589 / 635
Basic principles of deep kernel machines: infinite NN
A large class of kernels on Rp may be defined as an expectation

K (x, y) = Ew [s(w> x)s(w> y)],

where s : R R is a nonlinear function. The encoding can be seen as a


one-layer neural network with infinite number of random weights.
Examples
random Fourier features
h i
(x y) = Ewq(w),bU [0,2] 2 cos(w> x + b) 2 cos(w> y + b)

Gaussian kernel
1 2
h 2 > 2 > i
e 22 kxyk2 Ew e 2 w x e 2 w y with w N (0, ( 2 /4)I).

589 / 635
Basic principles of deep kernel machines: infinite NN
Example, arc-cosine kernels
h     i
K (x, y) Ew max w> x, 0 max w> y, 0 with w N (0, I),

for x, y on the hyper-sphere Sm1 . Interestingly, the non-linearity s are


typical ones from the neural network literature.
s(u) = max(0, u) (rectified linear units) leads to
K1 (x, y) = sin() + ( ) cos() with = cos1 (x> y);
s(u) = max(0, u)2 (squared rectified linear units) leads to
K2 (x, y) = 3 sin() cos() + ( )(1 + 2 cos2 ());

Remarks
infinite neural nets were discovered by Neal, 1994; then revisited
many times [Le Roux, 2007, Cho and Saul, 2009].
the concept does not lead to more powerful kernel methods...
590 / 635
Basic principles of DKM: dot-product kernels
Another basic link between kernels and neural networks can be obtained
by considering dot-product kernels.
Proposition
Let X = Sd1 be the unit sphere of Rd . The kernel K : X 2 R

K (x, y) = (hx, yiRd )

is positive definite if and only if is smooth and its Taylor expansion


coefficients are non-negative.

Remarks
the proposition holds if X is the unit sphere of some Hilbert space
and hx, yiRd is replaced by the corresponding inner-product.

591 / 635
Basic principles of DKM: dot-product kernels
The Nystrom method consists of replacing any point (x) in H, for x
in X by its orthogonal projection onto a finite-dimensional subspace

F = span((z1 ), . . . , (zp )),

for some anchor points Z = [z1 , . . . , zp ] in Rdp

(x) Hilbert space H

(x0 )

[Williams and Seeger, 2001, Smola and Scholkopf, 2000, Fine and Scheinberg, 2001].
592 / 635
Basic principles of DKM: dot-product kernels
The projection is equivalent to
2
p
X Xp
j? (zj ) ?

F [x] := with argmin (x)

,
j (zj )
j=1 Rp j=1
H

Then, it is possible to show that with K (x, y) = (hx, yiRd ),

K (x, y) hF [x], F [y]iH = h(x), (y)iRp ,

with
(x) = (Z> Z)1/2 (Z> x),
where the function is applied pointwise to its arguments. The resulting
can be interpreted as a neural network performing (i) linear operation,
(ii) pointwise non-linearity, (iii) linear operation.

593 / 635
Convolutional kernel networks [Mairal et al., 2014, Mairal, 2016]

The (happy?) marriage of kernel methods and CNNs


1 a multilayer convolutional kernel for images: A hierarchy of
kernels for local image neighborhoods (aka, receptive fields).
2 unsupervised scheme for large-scale learning: the kernel beeing
too computationally expensive, the Nystrom approximation at each
layer yields a new type of unsupervised deep neural network.
3 end-to-end learning: learning subspaces in the RKHSs can be
achieved with a supervised loss function.

The model can be seen as an interpolation between a neural net and


a classical kernel method.

594 / 635
Related work
proof of concept for combining kernels and deep learning [Cho and
Saul, 2009];
hierarchical kernel descriptors [Bo et al., 2011];
other multilayer models [Bouvrie et al., 2009, Montavon et al., 2011,
Anselmi et al., 2015];
deep Gaussian processes [Damianou and Lawrence, 2013].
multilayer PCA [Scholkopf et al., 1998].
old kernels for images [Scholkopf, 1997].
RBF networks [Broomhead and Lowe, 1988].

595 / 635
The multilayer convolutional kernel
Definition: image feature maps
An image feature map is a function I : H, where is a 2D grid
representing coordinates in the image and H is a Hilbert space.

I1 ( 2 ) H1
I1 : 1 H1
Linear pooling

I0.5 : 0 H1 I0.5 ( 1 ) = 1 (P1 ) H1


Kernel trick

P1 P0 (patch)
I0 ( 0 ) H0 I0 : 0 H0

596 / 635
The multilayer convolutional kernel
Definition: image feature maps
An image feature map is a function I : H, where is a 2D grid
representing coordinates in the image and H is a Hilbert space.

Motivation and examples


Each point I () carries information about an image neighborhood,
which is motivated by the local stationarity of natural images.
We will construct a sequence of maps I0 , . . . , Ik . Going up in the
hierarchy yields larger receptive fields with more invariance.
I0 may simply be the input image, where H0 = R3 for RGB.

597 / 635
The multilayer convolutional kernel
Definition: image feature maps
An image feature map is a function I : H, where is a 2D grid
representing coordinates in the image and H is a Hilbert space.

Motivation and examples


Each point I () carries information about an image neighborhood,
which is motivated by the local stationarity of natural images.
We will construct a sequence of maps I0 , . . . , Ik . Going up in the
hierarchy yields larger receptive fields with more invariance.
I0 may simply be the input image, where H0 = R3 for RGB.

How do we go from I0 : 0 H0 to I1 : 1 H1 ?

597 / 635
The multilayer convolutional kernel
Definition: image feature maps
An image feature map is a function I : H, where is a 2D grid
representing coordinates in the image and H is a Hilbert space.

Motivation and examples


Each point I () carries information about an image neighborhood,
which is motivated by the local stationarity of natural images.
We will construct a sequence of maps I0 , . . . , Ik . Going up in the
hierarchy yields larger receptive fields with more invariance.
I0 may simply be the input image, where H0 = R3 for RGB.

How do we go from I0 : 0 H0 to I1 : 1 H1 ?

First, define a p.d. kernel on patches of I0 !

597 / 635
The multilayer convolutional kernel
Going from I0 to I0.5 : kernel trick
Patches of size e0 e0 can be defined as elements of the Cartesian
product P0 := H0e0 e0 endowed with its natural inner-product.
Define a p.d. kernel on such patches: For all x, x0 in P0 ,

hx, x0 iP0
 
0 0
K1 (x, x ) = kxkP0 kx kP0 1 if x, x0 6= 0 and 0 otherwise.
kxkP0 kx0 kP0

Note that for y, y0 normalized, we may choose



0 1 kyy0 k2P
1 hy, y0 iP0 = e 1 (hy,y iP0 1) = e 2

0.

We call H1 the RKHS and define a mapping 1 : P0 H1 .


Then, we may define the map I0.5 : 0 H1 that carries the
representations in H1 of the patches from I0 at all locations in 0 .

598 / 635
The multilayer convolutional kernel

I1 ( 2 ) H1
I1 : 1 H1
Linear pooling

I0.5 : 0 H1 I0.5 ( 1 ) = 1 (P1 ) H1


Kernel trick

P1 P0 (patch)
I0 ( 0 ) H0 I0 : 0 H0

How do we go from I0.5 : 0 H1 to I1 : 1 H1 ?

599 / 635
The multilayer convolutional kernel

I1 ( 2 ) H1
I1 : 1 H1
Linear pooling

I0.5 : 0 H1 I0.5 ( 1 ) = 1 (P1 ) H1


Kernel trick

P1 P0 (patch)
I0 ( 0 ) H0 I0 : 0 H0

How do we go from I0.5 : 0 H1 to I1 : 1 H1 ?


Linear pooling!
599 / 635
The multilayer convolutional kernel
Going from I0.5 to I1 : linear pooling
For all in 1 :
0 2
X
I1 () = I0.5 ( 0 )e 1 k k2 .
0 0

The Gaussian weight can be interpreted as an anti-aliasing filter for


downsampling the map I0.5 to a different resolution.
Linear pooling is compatible with the kernel interpretation: linear
combinations of points in the RKHS are still points in the RKHS.

Finally,
We may now repeat the process and build I0 , I1 , . . . , Ik .
and obtain the multilayer convolutional kernel
X
K (Ik , Ik0 ) = hIk (), Ik0 ()iHk .
k
600 / 635
The multilayer convolutional kernel
In summary
The multilayer convolutional kernel builds upon similar principles as
a convolutional neural net (multiscale, local stationarity).
It remains a conceptual object due to its high complexity.
Learning and modelling are still decoupled.
Let us first address the second point (scalability).

601 / 635
Unsupervised learning for convolutional kernel networks
Learn linear subspaces of finite-dimensions where we project the data

M1
linear pooling
1 (x ) 1 (x) Hilbert space H1
1 (x)
1 (x )
M0.5
projection on F1 F1
x
kernel trick

M0
x

Figure: The convolutional kernel network model between layers 0 and 1.

602 / 635
Unsupervised learning for convolutional kernel networks
Formally, this means using the Nystrom approximation
We now manipulate finite-dimensional maps Mj : j Rpj .
Every linear subspace is parametrized by anchor points

Fj := Span (zj,1 ), . . . , (zj,pj ) ,
2
where the z1,j s are in Rpj1 ej1 for patches of size ej1 ej1 .
The encoding function at layer j is
 
> x
j (x) := kxkj (Z>
j Z j ) 1/2
1 Zj if x 6= 0 and 0 otherwise,
kxk

where Zj = [zj,1 , . . . , zj,pj ] and k.k is the Euclidean norm.


The interpretation is convolution with filters Zj , pointwise
non-linearity, 1 1 convolution, contrast normalization.

603 / 635
Unsupervised learning for convolutional kernel networks
The pooling operation keeps points in the linear subspace Fj , and
pooling M0.5 : 0 Rp1 is equivalent to pooling I0.5 : 0 H1 .

M1
linear pooling
1 (x ) 1 (x) Hilbert space H1
1 (x)
1 (x )
M0.5
projection on F1 F1
x
kernel trick

M0
x

Figure: The convolutional kernel network model between layers 0 and 1.

604 / 635
Unsupervised learning for convolutional kernel networks
How do we learn the filters with no supervision?
we learn one layer at a time, starting from the bottom one.
we extract a large numbersay 100 000 patches from layers j 1
computed on an image database and normalize them;
perform a spherical K-means algorithm to learn the filters Zj ;
compute the projection matrix j (Z>
j Zj )
1/2 .

Remarks
with kernels, we map patches in infinite dimension; with the
projection, we manipulate finite-dimensional objects.
we obtain an unsupervised convolutional net with a geometric
interpretation, where we perform projections in the RKHSs.

605 / 635
Unsupervised learning for convolutional kernel networks
Remark on input image pre-processing
Unsupervised CKNs are sensitive to pre-processing; we have tested
RAW RGB input;
local centering of every color channel;
local whitening of each color channel;
2D image gradients.

(a) RAW RGB (b) centering


606 / 635
Unsupervised learning for convolutional kernel networks
Remark on input image pre-processing
Unsupervised CKNs are sensitive to pre-processing; we have tested
RAW RGB input;
local centering of every color channel;
local whitening of each color channel;
2D image gradients.

(c) RAW RGB (d) whitening


606 / 635
Unsupervised learning for convolutional kernel networks
Remark on pre-processing with image gradients and 1 1 patches
Every pixel/patch can be represented as a two dimensional vector

x = [cos(), sin()],

where = kxk is the gradient intensity and is the orientation.


A natural choice of filters Z would be

with j = 2j/p0 .
zj = [cos(j ), sin(j )]
 
Then, the vector (x) = kxk1 (Z> Z)1/2 1 Z> kxk x
, can be
interpreted as a soft-binning of the gradient orientation.
After pooling, the representation of this first layer is very close
to SIFT/HOG descriptors [see Bo et al., 2011].

607 / 635
Convolutional kernel networks with supervised learning
How do we learn the filters with supervision?
Given a kernel K and RKHS H, the ERM objective is
n
1X
min L(yi , f (xi )) + kf k2H .
f H n 2
i=1 | {z }
| {z } regularization
empirical risk, data fit

here, we use the parametrized kernel


X
KZ (I0 , I00 ) = hMk (), Mk0 ()i = hMk , Mk0 iF ,
k

and we obtain the simple formulation


n
1X
min L(yi , hW, Mki iF ) + kWk2F . (4)
WRpk |k | n 2
i=1
608 / 635
Convolutional kernel networks with supervised learning
How do we learn the filters with supervision?
we jointly optimize w.r.t. Z (set of filters) and W.
we alternate between the optimization of Z and of W;
for W, the problem is strongly-convex and can be tackled with
recent algorithms that are much faster than SGD;
for Z, we derive backpropagation rules and use classical tricks for
learning CNNs (SGD+momentum);
The only tricky part is to differentiate j (Z>
j Zj )
1/2 w.r.t Z , which is a
j
non-standard operation in classical CNNs.

609 / 635
Convolutional kernel networks
In summary
a multilayer kernel for images, which builds upon similar principles
as a convolutional neural net (multiscale, local stationarity).
A new type of convolutional neural network with a geometric
interpretation: orthogonal projections in RKHS.
Learning may be unsupervised: align subspaces with data.
Learning may be supervised: subspace learning in RKHSs.

610 / 635
Image classification
Experiments were conducted on classical deep learning datasets, on
CPUs with no model averaging and no data augmentation.
Dataset ] classes im. size ntrain ntest
CIFAR-10 10 32 32 50 000 10 000
SVHN 10 32 32 604 388 26 032

Figure: Figure from the NIPS16 paper. Error rates in percents.

Remarks on CIFAR-10
10% is the standard good result for single model with no data
augmentation.
the best unsupervised architecture has two layers, is wide
(1024-16384 filters), and achieves 14.2%;
611 / 635
Image super-resolution
The task is to predict a high-resolution y image from low-resolution
one x. This may be formulated as a multivariate regression problem.

(a) Low-resolution y (b) High-resolution x

612 / 635
Image super-resolution
The task is to predict a high-resolution y image from low-resolution
one x. This may be formulated as a multivariate regression problem.

(c) Low-resolution y (d) Bicubic interpolation

612 / 635
Image super-resolution
Fact. Dataset Bicubic SC CNN CSCN SCKN
Set5 33.66 35.78 36.66 36.93 37.07
x2 Set14 30.23 31.80 32.45 32.56 32.76
Kodim 30.84 32.19 32.80 32.94 33.21
Set5 30.39 31.90 32.75 33.10 33.08
x3 Set14 27.54 28.67 29.29 29.41 29.50
Kodim 28.43 29.21 29.64 29.76 29.88

Table: Reconstruction accuracy for super-resolution in PSNR (the higher, the


better). All CNN approaches are without data augmentation at test time.
Remarks
CNN is a vanilla CNN;
Very recent work do better with very deep CNNs and residual
learning [Kim et al., 2016];
CSCN combines ideas from sparse coding and CNNs;

[Zeyde et al., 2010, Dong et al., 2016, Wang et al., 2015, Kim et al., 2016].

613 / 635
Image super-resolution

Bicubic Sparse coding CNN SCKN (Ours)


Figure: Results for x3 upscaling.

614 / 635
Image super-resolution

Figure: Bicubic
615 / 635
Image super-resolution

Figure: SCKN
615 / 635
Image super-resolution

Bicubic Sparse coding CNN SCKN (Ours)


Figure: Results for x3 upscaling.

616 / 635
Image super-resolution

Figure: Bicubic
617 / 635
Image super-resolution

Figure: SCKN
617 / 635
Image super-resolution

Bicubic Sparse coding CNN SCKN (Ours)


Figure: Results for x3 upscaling.

618 / 635
Image super-resolution

Figure: Bicubic
619 / 635
Image super-resolution

Figure: SCKN
619 / 635
Image super-resolution

Bicubic CNN SCKN (Ours)


Figure: Results for x3 upscaling.

620 / 635
Image super-resolution

Figure: Bicubic

621 / 635
Image super-resolution

Figure: SCKN

621 / 635
Conclusion of the course

622 / 635
What we saw
Basic definitions of p.d. kernels and RKHS
How to use RKHS in machine learning
The importance of the choice of kernels, and how to include prior
knowledge there.
Several approaches for kernel design (there are many!)
Review of kernels for strings and on graphs
Recent research topics about kernel methods

623 / 635
What we did not see

How to automatize the process of kernel design (kernel selection?


kernel optimization?)
How to deal with non p.d. kernels
Bayesian view of kernel methods, called Gaussian processes.
How do do statistical testing with kernels with the kernel mean
embedding.

624 / 635
References I
F. Anselmi, L. Rosasco, C. Tan, and T. Poggio. Deep convolutional networks are hierarchical
kernel machines. arXiv preprint arXiv:1508.01084, 2015.
N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 404, 1950.
URL http://www.jstor.org/stable/1990404.
F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and
the SMO algorithm. In Proceedings of the Twenty-First International Conference on
Machine Learning, page 6, New York, NY, USA, 2004. ACM. doi:
http://doi.acm.org/10.1145/1015330.1015424.
P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification and risk bounds. Technical
Report 638, UC Berkeley Statistics, 2003.
C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups.
Springer-Verlag, New-York, 1984.
L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In
Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In ICDM 05:
Proceedings of the Fifth IEEE International Conference on Data Mining, pages 7481,
Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2278-5. doi:
http://dx.doi.org/10.1109/ICDM.2005.132.
J. V. Bouvrie, L. Rosasco, and T. Poggio. On invariance in hierarchical models. In Adv. NIPS,
2009.

625 / 635
References II
D. S. Broomhead and D. Lowe. Radial basis functions, multi-variable functional interpolation
and adaptive networks. Technical report, DTIC Document, 1988.
Y. Cho and L. K. Saul. Kernel methods for deep learning. In Adv. NIPS, 2009.
M. Cuturi and J.-P. Vert. The context-tree kernel for strings. Neural Network., 18(4):
11111123, 2005. doi: 10.1016/j.neunet.2005.07.010. URL
http://dx.doi.org/10.1016/j.neunet.2005.07.010.
M. Cuturi, K. Fukumizu, and J.-P. Vert. Semigroup kernels on measures. J. Mach. Learn. Res.,
6:11691198, 2005. URL http://jmlr.csail.mit.edu/papers/v6/cuturi05a.html.
A. Damianou and N. Lawrence. Deep Gaussian processes. In Proc. AISTATS, 2013.
A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with
support for non-strongly convex composite objectives. In Advances in Neural Information
Processing Systems (NIPS), pages 16461654, 2014.
C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional
networks. IEEE T. Pattern Anal., 38(2):295307, 2016.
S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. J.
Mach. Learn. Res., 2:243264, 2001.

626 / 635
References III
T. Gartner, P. Flach, and S. Wrobel. On graph kernels: hardness results and efficient
alternatives. In B. Scholkopf and M. Warmuth, editors, Proceedings of the Sixteenth
Annual Conference on Computational Learning Theory and the Seventh Annual Workshop
on Kernel Machines, volume 2777 of Lecture Notes in Computer Science, pages 129143,
Heidelberg, 2003. Springer. doi: 10.1007/b12006. URL
http://dx.doi.org/10.1007/b12006.
Z. Harchaoui and F. Bach. Image classification with segmentation graph kernels. In 2007
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2007), pages 18. IEEE Computer Society, 2007. doi: 10.1109/CVPR.2007.383049. URL
http://dx.doi.org/10.1109/CVPR.2007.383049.
D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10,
UC Santa Cruz, 1999.
C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning
techniques for the identification of mutagenicity inducing substructures and structure
activity relationships of noncongeneric compounds. J. Chem. Inf. Comput. Sci., 44(4):
140211, 2004. doi: 10.1021/ci034254q. URL http://dx.doi.org/10.1021/ci034254q.
T. Jaakkola, M. Diekhans, and D. Haussler. A Discriminative Framework for Detecting
Remote Protein Homologies. J. Comput. Biol., 7(1,2):95114, 2000. URL
http://www.cse.ucsc.edu/research/compbio/discriminative/Jaakola2-1998.ps.

627 / 635
References IV
T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In
Proc. of Tenth Conference on Advances in Neural Information Processing Systems, 1999.
URL http://www.cse.ucsc.edu/research/ml/papers/Jaakola.ps.
H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In
T. Faucett and N. Mishra, editors, Proceedings of the Twentieth International Conference
on Machine Learning, pages 321328, New York, NY, USA, 2003. AAAI Press.
J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep
convolutional networks. In Proc. CVPR, 2016.
T. Kin, K. Tsuda, and K. Asai. Marginalized kernels for RNA sequence data analysis. In
R. Lathtop, K. Nakai, S. Miyano, T. Takagi, and M. Kanehisa, editors, Genome Informatics
2002, pages 112122. Universal Academic Press, 2002. URL
http://www.jsbi.org/journal/GIW02/GIW02F012.html.
R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input. In
Proceedings of the Nineteenth International Conference on Machine Learning, pages
315322, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel
matrix with semidefinite programming. J. Mach. Learn. Res., 5:2772, 2004a. URL
http://www.jmlr.org/papers/v5/lanckriet04a.html.

628 / 635
References V
G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical
framework for genomic data fusion. Bioinformatics, 20(16):26262635, 2004b. doi:
10.1093/bioinformatics/bth294. URL
http://bioinformatics.oupjournals.org/cgi/content/abstract/20/16/2626.
Q. V. Le, T. Sarlos, and A. J. Smola. Fastfood - computing hilbert space expansions in
loglinear time. In Proceedings of the 30th International Conference on Machine Learning,
ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Proceedings, pages
244252, 2013. URL http://jmlr.org/proceedings/papers/v28/le13.html.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):22782324, 1998.
C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. J.
Mach. Learn. Res., 5:14351455, 2004.
C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: a string kernel for SVM protein
classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein,
editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564575,
Singapore, 2002. World Scientific.
L. Liao and W. Noble. Combining Pairwise Sequence Similarity and Support Vector Machines
for Detecting Remote Protein Evolutionary and Structural Relationships. J. Comput. Biol.,
10(6):857868, 2003. URL
http://www.liebertonline.com/doi/abs/10.1089/106652703322756113.

629 / 635
References VI
H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In
Advances in Neural Information Processing Systems (NIPS), 2015.
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification
using string kernels. J. Mach. Learn. Res., 2:419444, 2002. URL
http://www.ai.mit.edu/projects/jmlr/papers/volume2/lodhi02a/abstract.html.
B. Logan, P. Moreno, B. Suzek, Z. Weng, and S. Kasif. A Study of Remote Homology
Detection. Technical Report CRL 2001/05, Compaq Cambridge Research laboratory, June
2001.
P. Mahe and J. P. Vert. Graph kernels based on tree patterns for molecules. Mach. Learn., 75
(1):335, 2009. doi: 10.1007/s10994-008-5086-2. URL
http://dx.doi.org/10.1007/s10994-008-5086-2.
P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph
kernels. In R. Greiner and D. Schuurmans, editors, Proceedings of the Twenty-First
International Conference on Machine Learning (ICML 2004), pages 552559. ACM Press,
2004.
P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Graph kernels for molecular
structure-activity relationship analysis with support vector machines. J. Chem. Inf. Model.,
45(4):93951, 2005. doi: 10.1021/ci050039t. URL
http://dx.doi.org/10.1021/ci050039t.
J. Mairal. Incremental majorization-minimization optimization with application to large-scale
machine learning. SIAM Journal on Optimization, 25(2):829855, 2015.

630 / 635
References VII
J. Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In
Advances in Neural Information Processing Systems, pages 13991407, 2016.
J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In
Advances in Neural Information Processing Systems, 2014.
C. Micchelli and M. Pontil. Learning the kernel function via regularization. J. Mach. Learn.
Res., 6:10991125, 2005. URL http://jmlr.org/papers/v6/micchelli05a.html.
G. Montavon, M. L. Braun, and K.-R. Muller. Kernel analysis of deep networks. Journal of
Machine Learning Research, 12(Sep):25632581, 2011.
C. Muller. Analysis of spherical symmetries in Euclidean spaces, volume 129 of Applied
Mathematical Sciences. Springer, 1998.
Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic
Publishers, 2004.
A. Nicholls. Oechem, version 1.3.4, openeye scientific software. website, 2005.
F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Adv. NIPS, 2007.
A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res.,
9:24912521, 2008. URL http://jmlr.org/papers/v9/rakotomamonjy08a.html.

631 / 635
References VIII
J. Ramon and T. Gartner. Expressivity versus efficiency of graph kernels. In T. Washio and
L. De Raedt, editors, Proceedings of the First International Workshop on Mining Graphs,
Trees and Sequences, pages 6574, 2003.
F. Rapaport, A. Zynoviev, M. Dutreix, E. Barillot, and J.-P. Vert. Classification of microarray
data using gene networks. BMC Bioinformatics, 8:35, 2007. doi: 10.1186/1471-2105-8-35.
URL http://dx.doi.org/10.1186/1471-2105-8-35.
H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string
alignment kernels. Bioinformatics, 20(11):16821689, 2004. URL
http://bioinformatics.oupjournals.org/cgi/content/abstract/20/11/1682.
M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average
gradient. Mathematical Programming, 2016.
B. Scholkopf. Support Vector Learning. PhD thesis, Technischen Universitat Berlin, 1997.
B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002. URL
http://www.learning-with-kernels.org.
B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation, 10(5):12991319, 1998.
B. Scholkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT
Press, The MIT Press, Cambridge, Massachussetts, 2004.

632 / 635
References IX
M. Seeger. Covariance Kernels from Bayesian Generative Models. In Adv. Neural Inform.
Process. Syst., volume 14, pages 905912, 2002.
S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for
regularized loss minimization. Mathematical Programming, 2015.
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, New York, NY, USA, 2004a.
J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University
Press, 2004b.
N. Shervashidze and K. M. Borgwardt. Fast subtree kernels on graphs. In Advances in Neural
Information Processing Systems, pages 16601668, 2009.
N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient
graphlet kernels for large graph comparison. In 12th International Conference on Artificial
Intelligence and Statistics (AISTATS), pages 488495, Clearwater Beach, Florida USA,
2009. Society for Artificial Intelligence and Statistics.
N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt.
Weisfeiler-lehman graph kernels. The Journal of Machine Learning Research, 12:
25392561, 2011.
T. Smith and M. Waterman. Identification of common molecular subsequences. J. Mol. Biol.,
147:195197, 1981.
A. J. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine learning. 2000.

633 / 635
References X
K. Tsuda, M. Kawanabe, G. Ratsch, S. Sonnenburg, and K.-R. Muller. A new discriminative
kernel from probabilistic models. Neural Computation, 14(10):23972414, 2002a. doi:
10.1162/08997660260293274. URL http://dx.doi.org/10.1162/08997660260293274.
K. Tsuda, T. Kin, and K. Asai. Marginalized Kernels for Biological Sequences. Bioinformatics,
18:S268S275, 2002b.
V. N. Vapnik. Statistical Learning Theory. Wiley, New-York, 1998.
A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 34(3):480492, 2012.
J.-P. Vert, H. Saigo, and T. Akutsu. Local alignment kernels for biological sequences. In
B. Scholkopf, K. Tsuda, and J. Vert, editors, Kernel Methods in Computational Biology,
pages 131154. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004.
J.-P. Vert, R. Thurman, and W. S. Noble. Kernels for gene regulatory regions. In Y. Weiss,
B. Scholkopf, and J. Platt, editors, Adv. Neural. Inform. Process Syst., volume 18, pages
14011408, Cambridge, MA, 2006. MIT Press.
G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional
Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deep networks for image super-resolution
with sparse prior. In Proc. ICCV, 2015.
B. Weisfeiler and A. A. Lehman. A reduction of a graph to a canonical form and an algebra
arising during this reduction. Nauchno-Technicheskaya Informatsia, Ser. 2, 9, 1968.

634 / 635
References XI
B. Widrow and M. E. Hoff. Adaptive switching circuits. In IRE WESCON convention record,
volume 4, pages 96104. New York, 1960.
C. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. In Adv.
NIPS, 2001.
L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance
reduction. SIAM Journal on Optimization, 24(4):20572075, 2014.
Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic
data: a supervised approach. Bioinformatics, 20:i363i370, 2004. URL
http://bioinformatics.oupjournals.org/cgi/reprint/19/suppl_1/i323.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In
European Conference on Computer Vision (ECCV), 2014.
R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In
Curves and Surfaces, pages 711730. 2010.

635 / 635

You might also like