Lecture10 MMDtheory

“Large” Reproducing Kernel Hilbert Spaces
Advanced Topics in Machine Learning: COMPGI13
Bharath K. Sriperumbudur and Arthur Gretton
Gatsby Unit
20 March 2012
So far...
◮ Introduction to RKHS
◮ RKHS based learning algorithms
◮ Kernel PCA
◮ Kernel regression
◮ SVMs for classification and regression
◮ Hypothesis testing (two-sample and independence tests)
◮ Feature selection, Clustering, ICA
◮ Representer theorem
This Lecture
Why RKHS?
How to choose an RKHS?
◮ Polynomial kernels
◮ Radial basis kernels
◮ Spline kernel
◮ Laplacian kernel
“Large” reproducing kernel Hilbert spaces

Binary Classification
◮ Given: D := {(xj , yj )}N

j=1 , xj ∈ X , yj ∈ {−1, +1}
◮ Goal: Learn a function f : X → R such that
yj = sign(f (xj )), ∀ j = 1, . . . , N

Linear Classifiers
◮ Linear classifier: f (x) = hw , xi + b, w , x ∈ Rd , b ∈ R

◮ Find w ∈ Rd and b ∈ R such that
yj (hw , xj i + b) ≥ 0, ∀ j = 1, . . . , N.
Maximum Margin Classifiers
◮ Popular Idea: Maximize the margin (distance from f to D):
|hw , xj i + b|
max min
w ,b j∈{1,...,N} kw k
◮ Result: Linear support vector machine (SVM)
min {kw k : yj (hw , xj i + b) ≥ 1, ∀ j = 1, . . . , N}

w ,b
Non-linear Classifiers
◮ Issue: Linear SVM is not suitable to classify samples that cannot be

linearly separated, i.e.,
∄ w ∈ Rd , b ∈ R s.t. yj = sign(hw , xj i + b), ∀ j = 1, . . . , N

Kernel Classifiers
◮ Idea: X 7→ Φ(X ) ⊂ H and build a linear SVM in the Hilbert space,
H. Φ is called the feature map.
N N
1 X X
min αl αj yl yj hΦ(xl ), Φ(xj )iH − αj
{αj }N
j=1
2
l,j=1 j=1
N
X
s.t. yj αj = 0, αj ≥ 0, ∀ j
j=1
PN
where f (x) = j=1 yj αj hΦ(xj ), Φ(x)iH + b.
Problem of Learning
◮ Given a set D := {(x1 , y1 ), . . . , (xn , yn )} of input/output pairs in

X ×Y.
◮ Goal: “Learn” a function f : X → Y such that f (x) is a good

approximation of the possible response y for an arbitrary x.
Without any assumptions on the seen and unseen data, no learning is

possible.
◮ Assumption: The past and future pairs (x, y ) are independently

generated by the same, but of course unknown probability
distribution P on X × Y .
Problem of Learning

X ×Y.


possible.

Problem of Learning

X ×Y.


possible.

Loss Function
◮ We need a means to assess the quality of an estimated response

f (x) when the true input and output pair is (x, y ).
◮ Loss function: L : Y × Y → [0, ∞)
◮ Squared-loss: L(y , f (x)) = (y − f (x))2
◮ Hinge-loss: L(y , f (x)) = max(0, 1 − yf (x))
◮ Smaller the value of L, better is the approximation of f (x) to y for a

given pair (x, y ).
Loss Function


given pair (x, y ).
Loss Function


given pair (x, y ).
Risk Functional
◮ Knowing L(y , f (x)) to be small for a particular (x, y ) is not

sufficient. Need to quantify how small the function
(x, y ) 7→ L(y , f (x))
is.
◮ One common quality measure is the average loss or expected loss of

f , called the risk functional i.e.,
Z
RL,P (f ) := L(y , f (x)) dP(x, y ).
X ×Y
Risk Functional
◮ Knowing L(y , f (x)) to be small for a particular (x, y ) is not

sufficient. Need to quantify how small the function
(x, y ) 7→ L(y , f (x))
is.
◮ One common quality measure is the average loss or expected loss of

f , called the risk functional i.e.,
Z
RL,P (f ) := L(y , f (x)) dP(x, y ).
X ×Y
Bayes Risk and Bayes Function
◮ Note that for each f , we have an associated risk, RL,P (f ).
◮ Idea: Choose f that has the smallest risk.
f ∗ := arg inf RL,P (f ),

f :X →R
where the infimum is taken over the set of all measurable functions.
◮ f ∗ is called the Bayes function and RL,P (f ∗ ) is called the Bayes risk.
◮ If P is known, finding f ∗ is often a relatively easy task and there is

nothing to learn.
◮ Exercise: Find f ∗ for L(y , f (x)) = (y − f (x))2 and

L(y , f (x)) = |y − f (x)|?
Bayes Risk and Bayes Function
◮ Note that for each f , we have an associated risk, RL,P (f ).
◮ Idea: Choose f that has the smallest risk.
f ∗ := arg inf RL,P (f ),

f :X →R
where the infimum is taken over the set of all measurable functions.
◮ f ∗ is called the Bayes function and RL,P (f ∗ ) is called the Bayes risk.
◮ If P is known, finding f ∗ is often a relatively easy task and there is

nothing to learn.
◮ Exercise: Find f ∗ for L(y , f (x)) = (y − f (x))2 and

L(y , f (x)) = |y − f (x)|?
Universal Consistency
◮ But P is unknown
◮ Without additional information, it is impossible to find an

(approximate) minimizer.
◮ This additional information comes from the training set,
i.i.d.
D := {(x1 , y1 ), . . . , (xn , yn )} ∼ P.
◮ Given D, the goal is to construct fD : X → R such that
RL,P (fD ) ≈ RL,P (f ∗ )
◮ Universally consistent learning algorithm: for all P on X × Y , we

have
RL,P (fD ) → RL,P (f ∗ ), n → ∞
in probability.

i.i.d.
D := {(x1 , y1 ), . . . , (xn , yn )} ∼ P.
RL,P (fD ) ≈ RL,P (f ∗ )

have
RL,P (fD ) → RL,P (f ∗ ), n → ∞
in probability.

i.i.d.
D := {(x1 , y1 ), . . . , (xn , yn )} ∼ P.
RL,P (fD ) ≈ RL,P (f ∗ )

have
RL,P (fD ) → RL,P (f ∗ ), n → ∞
in probability.

i.i.d.
D := {(x1 , y1 ), . . . , (xn , yn )} ∼ P.
RL,P (fD ) ≈ RL,P (f ∗ )

have
RL,P (fD ) → RL,P (f ∗ ), n → ∞
in probability.
Empirical Risk Minimization
◮ Since P is unknown but is known through D, it is tempting to
replace RL,P (f ) by
n
1X
RL,D (f ) := L(yi , f (xi )),
n
i=1
called the empirical risk and find fD by
fD := arg min RL,D (f )

f :X →R
◮ Is it a good idea?
◮ No! Choose fD such that fD (x) = yi , x = xi , ∀ i and

fD (x) = 0, otherwise.
◮ RL,D (fD ) = 0 but can be very far from RL,P (f ∗ )
Overfitting!!
n
1X
RL,D (f ) := L(yi , f (xi )),
n
i=1

f :X →R

Overfitting!!
n
1X
RL,D (f ) := L(yi , f (xi )),
n
i=1

f :X →R

Overfitting!!
◮ How to avoid overfitting: Choose a small set F of functions

f : X → R that is assumed to contain a reasonably good
approximation to f ∗ .
◮ Do minimization over F:

f ∈F
◮ Total error: Define R∗L,P,F := inf f ∈F RL,P (f )
Estimation error
z }| {
RL,P (fD ) − RL,P (f ∗ ) = RL,P (fD ) − R∗L,P,F
Approximation error
z }| {
∗ ∗
+ RL,P,F − RL,P (f )
Approximation and Estimation Errors
Approximation
error
Estimation
error
Regularized Learning
◮ Let Ω be some functional on F such that for c1 ≤ c2 ,
{f ∈ F : Ω(f ) ≤ c1 } ⊂ {f ∈ F : Ω(f ) ≤ c2 }.
◮ Define
fD = arg min RL,D (f )

f ∈F : Ω(f )≤c
n
1X
= arg min L(yi , f (xi ))
f ∈F : Ω(f )≤c n
i=1
◮ In the Lagrangian formulation, we have
fD = arg min RL,D (f ) + λ Ω(f )

f ∈F
n
1X
= arg min L(yi , f (xi )) + λ Ω(f )
f ∈F n
i=1
Why RKHS?
◮ Various choices for F (with evaluation functional bounded):
◮ Lipschitz functions with Ω(f ) = kf kL
◮ Bounded Lipschitz functions with Ω(f ) = kf kL + kf k∞
◮ Bounded measurable functions with Ω(f ) = kf k∞
◮ RKHS, (H, k) with Ω(f ) = kf kH
◮ Advantage with RKHS: For convex L, the regularized objective is a

nice convex program.
◮ Hinge loss: Support vector machine

◮ Squared loss: Kernel regression
◮ How: By the representer theorem
Can I choose any RKHS?

Why RKHS?
◮ Various choices for F (with evaluation functional bounded):
◮ Lipschitz functions with Ω(f ) = kf kL
◮ Bounded Lipschitz functions with Ω(f ) = kf kL + kf k∞
◮ Bounded measurable functions with Ω(f ) = kf k∞
◮ RKHS, (H, k) with Ω(f ) = kf kH
◮ Advantage with RKHS: For convex L, the regularized objective is a

nice convex program.
◮ Hinge loss: Support vector machine

◮ Squared loss: Kernel regression
◮ How: By the representer theorem
Can I choose any RKHS?

Loss Interpretation of Maximum Mean Discrepancy
Suppose Y = {−1, 1} and L(y , t) = −2yt.
Z
R∗L,P,F = inf L(y , f (x)) dP(x, y )
f ∈F X ×Y
Z
= inf L(y , f (x)) dP(x|y ) dP(y )
f ∈F X ×Y
Z
= inf L(1, f (x)) P(y = 1) dP(x|y = 1)
f ∈F X
Z
+ L(−1, f (x)) P(y = −1) dP(x| − 1)
X
Let P(y = 1) = 12 , P(x|y = 1) = P(x) and P(x|y = −1) = Q(x).

Therefore,
Z Z
R∗L,P,F = inf f (x) dQ(x) − f (x) dP(x)
f ∈F X X
Z Z
= − sup f (x) dP(x) − f (x) dQ(x)
f ∈F X X
Z
f ∈F X ×Y
Z
f ∈F X ×Y
Z
= inf L(1, f (x)) P(y = 1) dP(x|y = 1)
f ∈F X
Z
+ L(−1, f (x)) P(y = −1) dP(x| − 1)
X

Therefore,
Z Z
f ∈F X X
Z Z
= − sup f (x) dP(x) − f (x) dQ(x)
f ∈F X X
Z
f ∈F X ×Y
Z
f ∈F X ×Y
Z
= inf L(1, f (x)) P(y = 1) dP(x|y = 1)
f ∈F X
Z
+ L(−1, f (x)) P(y = −1) dP(x| − 1)
X

Therefore,
Z Z
f ∈F X X
Z Z
= − sup f (x) dP(x) − f (x) dQ(x)
f ∈F X X
Z
f ∈F X ×Y
Z
f ∈F X ×Y
Z
= inf L(1, f (x)) P(y = 1) dP(x|y = 1)
f ∈F X
Z
+ L(−1, f (x)) P(y = −1) dP(x| − 1)
X

Therefore,
Z Z
f ∈F X X
Z Z
= − sup f (x) dP(x) − f (x) dQ(x)
f ∈F X X
Z
f ∈F X ×Y
Z
f ∈F X ×Y
Z
= inf L(1, f (x)) P(y = 1) dP(x|y = 1)
f ∈F X
Z
+ L(−1, f (x)) P(y = −1) dP(x| − 1)
X

Therefore,
Z Z
f ∈F X X
Z Z
= − sup f (x) dP(x) − f (x) dQ(x)
f ∈F X X
Z
f ∈F X ×Y
Z
f ∈F X ×Y
Z
= inf L(1, f (x)) P(y = 1) dP(x|y = 1)
f ∈F X
Z
+ L(−1, f (x)) P(y = −1) dP(x| − 1)
X

Therefore,
Z Z
f ∈F X X
Z Z
= − sup f (x) dP(x) − f (x) dQ(x)
f ∈F X X
Z
f ∈F X ×Y
Z
f ∈F X ×Y
Z
= inf L(1, f (x)) P(y = 1) dP(x|y = 1)
f ∈F X
Z
+ L(−1, f (x)) P(y = −1) dP(x| − 1)
X

Therefore,
Z Z
f ∈F X X
Z Z
= − sup f (x) dP(x) − f (x) dQ(x)
f ∈F X X
R∗L,P,F = −MMD(P, Q, F)
◮ MMD(P, Q, F) is a pseudometric on the space of probability

measures
◮ MMD(P, P, F) = 0
◮ Symmetry: MMD(P, Q, F) = MMD(Q, P, F)
◮ Triangle inequality:
MMD(P, R, F) ≤ MMD(P, Q, F) + MMD(Q, R, F)
◮ However, MMD(P, Q, F) = 0 ; P = Q
◮ Only for certain F, MMD(P, Q, F) = 0 ⇒ P = Q


measures
◮ MMD(P, P, F) = 0


measures
◮ MMD(P, P, F) = 0


measures
◮ MMD(P, P, F) = 0


measures
◮ MMD(P, P, F) = 0


measures
◮ MMD(P, P, F) = 0

Choice of F
◮ Unit Lipschitz ball, F = {kf kL ≤ 1}: Wasserstein distance
◮ Unit bounded Lipschitz ball, F = {kf kL + kf k∞ ≤ 1}: Dudley

metric
◮ Unit sup ball, F = {kf k∞ ≤ 1} : Total-variation distance
F is a unit ball in an RKHS?

F is an RKHS
◮ When F = {f ∈ H : kf kH ≤ 1}, then
µP µQ 2
z }| { zZ }| {
Z
2

MMD (P, Q, F) = k(·, x) dP(x) −
k(·, x) dQ(x)

X X

H
hµP ,µP iH
zZ Z }| {
= k(x, y ) dP(x) dP(y )
X X
hµQ ,µQ iH
z Z
Z }| {
+ k(x, y ) dQ(x) dQ(y )
X X
hµP ,µQ iH
z Z
Z }| {
−2 k(x, y ) dP(x) dQ(y )
X X
Z Z
= k(x, y )dµ(x) dµ(y )
X X
for µ = P − Q.
F is an RKHS
µP µQ 2
z }| { zZ }| {
Z
2

MMD (P, Q, F) = k(·, x) dP(x) −
k(·, x) dQ(x)

X X

H
hµP ,µP iH
zZ Z }| {
= k(x, y ) dP(x) dP(y )
X X
hµQ ,µQ iH
z Z
Z }| {
+ k(x, y ) dQ(x) dQ(y )
X X
hµP ,µQ iH
z Z
Z }| {
−2 k(x, y ) dP(x) dQ(y )
X X
Z Z
X X
for µ = P − Q.
F is an RKHS
µP µQ 2
z }| { zZ }| {
Z
2

MMD (P, Q, F) = k(·, x) dP(x) −
k(·, x) dQ(x)

X X

H
hµP ,µP iH
zZ Z }| {
= k(x, y ) dP(x) dP(y )
X X
hµQ ,µQ iH
z Z
Z }| {
+ k(x, y ) dQ(x) dQ(y )
X X
hµP ,µQ iH
z Z
Z }| {
−2 k(x, y ) dP(x) dQ(y )
X X
Z Z
X X
for µ = P − Q.
Not all Kernels are Useful
◮ k(x, y ) = c for all x, y ∈ X
MMD(P, Q, F) = 0, ∀ P, Q.
◮ Another example: k(x, y ) = hx, y iRd , x, y ∈ Rd
MMD(P, Q, F) = kMP − MQ kRd ,
where MP is the mean of P.
◮ Separable distributions can be made inseparable if the RKHS is not

chosen properly.
How to choose H?
MMD(P, Q, F) = 0, ∀ P, Q.

chosen properly.
How to choose H?
MMD(P, Q, F) = 0, ∀ P, Q.

chosen properly.
How to choose H?
Computation: RKHS vs. Other F
i.i.d. i.i.d.
◮ Suppose {X1 , . . . , Xm } ∼ P and {Y1 , . . . , Yn } ∼ Q.
1
Pm 1
Pn
◮ Define Pm := m i=1 δXi and Qn := n i=1 δYi , where δx represents
the Dirac measure at x.
◮ MMD(Pm , Qn , {kf kH ≤ 1}) is obtained in a closed form as:

m n
1 X 1 X
MMD 2 (Pm , Qn , {kf kH ≤ 1}) = k(Xi , Xj ) + 2 k(Yi , Yj )
m2 n
i,j=1 i,j=1
2 X
− k(Xi , Yj ).
mn
i,j
Very easy to compute!!

◮ MMD(Pm , Qn , F) is obtained by solving a linear program for F =

Lipschitz and bounded Lipschitz balls. [Sriperumbudur et al., 2010a]
◮ Define Zi = Xi for i = 1, . . . , m and Zm+i = Yi for i = 1, . . . , n. Let

ρ be a metric on X .
1
Pm 1
Pm+n
◮ MMD(Pm , Qn , {kf kL ≤ 1}) = m i=1 a ⋆
i− n i=m+1 a ⋆
i , and
m+n
{ai⋆ }i=1 solve the following linear program,
( m m+n
)
1 X 1 X
max ai − ai : −ρ(Zi , Zj ) ≤ ai − aj ≤ ρ(Zi , Zj ), ∀ i, j .
a1 ,...,am+n m n
i=1 i=m+1
More complex than with RKHS!!



1
Pm 1
Pm+n
◮ MMD(Pm , Qn , {kf kL ≤ 1}) = m i=1 a ⋆
i− n i=m+1 a ⋆
i , and
m+n
{ai⋆ }i=1 solve the following linear program,
( m m+n
)
1 X 1 X
max ai − ai : −ρ(Zi , Zj ) ≤ ai − aj ≤ ρ(Zi , Zj ), ∀ i, j .
a1 ,...,am+n m n
i=1 i=m+1


1
Pm 1
Pm+n
◮ MMD(Pm , Qn , {kf kL + kf k∞ ≤ 1}) = m − ⋆
i=1 ai n i=m+1 ai ,
⋆
m+n
and {ai⋆ }i=1 solve the following linear program,
m m+n
1 X 1 X
max ai − ai
a1 ,...,am+n ,b,c m n
i=1 i=m+1
s.t. −b ρ(Zi , Zj ) ≤ ai − aj ≤ b ρ(Zi , Zj ), ∀ i, j
−c ≤ ai ≤ c, ∀ i, b + c ≤ 1.


1
Pm 1
Pm+n
◮ MMD(Pm , Qn , {kf kL + kf k∞ ≤ 1}) = m − ⋆
i=1 ai n i=m+1 ai ,
⋆
m+n
and {ai⋆ }i=1 solve the following linear program,
m m+n
1 X 1 X
max ai − ai
a1 ,...,am+n ,b,c m n
i=1 i=m+1
s.t. −b ρ(Zi , Zj ) ≤ ai − aj ≤ b ρ(Zi , Zj ), ∀ i, j
−c ≤ ai ≤ c, ∀ i, b + c ≤ 1.

Error: RKHS vs. Other F
|MMD(Pm , Qn , F) − MMD(P, Q, F)| =?
◮ RKHS: [Gretton et al., 2007]

|MMD(Pm , Qn , F) − MMD(P, Q, F)| → 0, m, n → ∞
There exists C > 0 (independent of m and n) suchrthat
m+n
|MMD(Pm , Qn , F) − MMD(P, Q, F)| ≤ C
mn
◮ Lipschitz and Bounded Lipschitz on Rd :

[Sriperumbudur et al., 2010a]
There exists C > 0 (independent of m and n) such that
1
d+1
m+n
mn
Curse of dimensionality!!
|MMD(Pm , Qn , F) − MMD(P, Q, F)| =?

m+n
mn

1
d+1
m+n
mn
|MMD(Pm , Qn , F) − MMD(P, Q, F)| =?

m+n
mn

1
d+1
m+n
mn
How to choose H?
Large RKHS: Universal Kernel/RKHS
◮ Universal kernel: A kernel k on a compact metric space, X is said to

be universal if the RKHS, H is dense (w.r.t. uniform norm) in C (X ).
◮ [Steinwart and Christmann, 2008]: For certain conditions on L, if k

is universal, then
inf RL,P (f ) = RL,P (f ∗ )
f ∈H
◮ Squared loss, Hinge loss,...

Large RKHS
Strictly Positive Definite Kernels
A symmetric function k : X × X → R is positive definite if ∀ n ≥ 1,

∀ (a1 , . . . , an ) ∈ Rn , ∀ (x1 , . . . , xn ) ∈ X n ,
n X
X n
ai aj k(xi , xj ) ≥ 0.
i=1 j=1
k is strictly positive definite if for mutually distinct xi , the equality holds

only when all the ai are zero.
Stronger than Strictly Positive Definite Kernels
◮ Mb (X ) = set of finite signed measure on X .
[Sriperumbudur et al., 2010b]: k is universal if and only if

Z
µ 7→ k(·, x) dµ(x), µ ∈ Mb (X )
X
is injective, i.e., Z
k(·, x) dµ(x) = 0 ⇒ µ = 0
X
which is equivalent to
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}
X X
Generalization of strictly positive definite kernels


Z
µ 7→ k(·, x) dµ(x), µ ∈ Mb (X )
X
k(·, x) dµ(x) = 0 ⇒ µ = 0
X
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}
X X


Z
µ 7→ k(·, x) dµ(x), µ ∈ Mb (X )
X
k(·, x) dµ(x) = 0 ⇒ µ = 0
X
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}
X X

Why Useful?
◮ Denseness characterization is not easy to check
◮ In general, though
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}
X X
is also not easy to check, for certain X and for certain families of k,
the above condition is easy to check
◮ Later: Gaussian and Spline kernels are universal; Sinc kernel is not
but is strictly positive definite.
MMD: What Kernels are Useful?
◮ Note that
Z Z
MMD 2 (P, Q, F) = k(x, y ) d(P − Q)(x) d(P − Q)(y )
X X
◮ If k is universal, which means

Z Z
k(x, y ) dµ(x) dµ(y ) = 0 ⇒ µ = 0,
X X
then
MMD(P, Q, F) = 0 ⇒ P = Q (characteristic)
◮ In other words, universal kernel ⇒ characteristic kernel

When is a Kernel Universal?
◮ [Sriperumbudur et al., 2010b]: The notion of universality can be
generalized to non-compact X and we define bounded k to be
universal if
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}.
X X
◮ Nice characterization can be obtained if k is a bounded continuous

translation invariant kernel on Rd , i.e.,
k(x, y ) = ψ(x − y )
2
◮ Examples: Gaussian, e −kx−y k2 , Laplacian, e −kx−y k1
◮ Bochner’s Theorem: ψ is positive definite if and only it is the

Fourier transform of a non-negative finite Borel measure, Λ,
Z √
− −1x T ω
ψ(x) = e dΛ(ω).
Rd
universal if
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}.
X X

k(x, y ) = ψ(x − y )
2

Z √
− −1x T ω
ψ(x) = e dΛ(ω).
Rd
universal if
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}.
X X

k(x, y ) = ψ(x − y )
2

Z √
− −1x T ω
ψ(x) = e dΛ(ω).
Rd
universal if
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}.
X X

k(x, y ) = ψ(x − y )
2

Z √
− −1x T ω
ψ(x) = e dΛ(ω).
Rd
Translation Invariant Kernels on Rd
[Sriperumbudur et al., 2010c, Sriperumbudur et al., 2010b]:

Result: universal ⇔ characteristic ⇔ support of Λ is Rd
◮ Support of a function, f is {x ∈ X : f (x) 6= 0}

Proof: support of Λ is Rd ⇒ universal ⇒ characteristic
ZZ ZZZ √
− −1(x−y )T ω
k(x, y ) dµ(x) dµ(y ) = e dΛ(ω) dµ(x) dµ(y )
Rd d
ZZ R √ Z √
− −1x T ω T
= e dµ(x) e −1y ω dµ(y ) dΛ(ω)
Rd Rd
Z
= µ̂(ω)µ̂(ω) dΛ(ω)
d
ZR
2
= |µ̂(ω)| dΛ(ω).
Rd
d
RR
If the support of Λ is R , then Rd
ψ(x − y ) dµ(x) dµ(y ) = 0 implies
µ̂ = 0 and therefore µ = 0.


ZZ ZZZ √
− −1(x−y )T ω
Rd d
ZZ R √ Z √
− −1x T ω T
Rd Rd
Z
= µ̂(ω)µ̂(ω) dΛ(ω)
d
ZR
2
= |µ̂(ω)| dΛ(ω).
Rd
d
RR


ZZ ZZZ √
− −1(x−y )T ω
Rd d
ZZ R √ Z √
− −1x T ω T
Rd Rd
Z
= µ̂(ω)µ̂(ω) dΛ(ω)
d
ZR
2
= |µ̂(ω)| dΛ(ω).
Rd
d
RR


ZZ ZZZ √
− −1(x−y )T ω
Rd d
ZZ R √ Z √
− −1x T ω T
Rd Rd
Z
= µ̂(ω)µ̂(ω) dΛ(ω)
d
ZR
2
= |µ̂(ω)| dΛ(ω).
Rd
d
RR


ZZ ZZZ √
− −1(x−y )T ω
Rd d
ZZ R √ Z √
− −1x T ω T
Rd Rd
Z
= µ̂(ω)µ̂(ω) dΛ(ω)
d
ZR
2
= |µ̂(ω)| dΛ(ω).
Rd
d
RR


ZZ ZZZ √
− −1(x−y )T ω
Rd d
ZZ R √ Z √
− −1x T ω T
Rd Rd
Z
= µ̂(ω)µ̂(ω) dΛ(ω)
d
ZR
2
= |µ̂(ω)| dΛ(ω).
Rd
d
RR


ZZ ZZZ √
− −1(x−y )T ω
Rd d
ZZ R √ Z √
− −1x T ω T
Rd Rd
Z
= µ̂(ω)µ̂(ω) dΛ(ω)
d
ZR
2
= |µ̂(ω)| dΛ(ω).
Rd
d
RR
MMD(P, Q, F) = kφP − φQ kL2 (Rd ,Λ)
◮ Example: P differs from Q at (roughly) one frequency

0.35
0.3
0.25
0.2
P(X)
0.15
0.1
0.05
0
−10 −5 0 5 10
X
0.5
0.4
0.3
Q(X)
0.2
0.1
0
−10 −5 0 5 10
X

0.35 0.4
0.3
0.3
0.25
0.2
F
P(X)
→ |φP |
0.2
0.15
0.1
0.1
0.05
0 0
−10 −5 0 5 10 −20 −10 0 10 20
X ω
0.5 0.4
0.4
0.3
0.3
F
Q(X)
|φQ |
0.2
0.2
0.1
→ 0.1
0 0
−10 −5 0 5 10 −20 −10 0 10 20
X ω

0.35 0.4
0.3
0.3
0.25
0.2
F
P(X)
→ |φP |
0.2
0.15 Characteristic function difference
0.2
0.1
0.1
0.05
0
−10 −5 0 5 10
0
−20 −10 0 10 20
ց 0.15
|φ − φ |
X
Q
0.1
P
0.5 0.4
0.4
0.3 ր 0.05
0.3 0
F
Q(X)
|φQ |
−30 −20 −10 0 10 20 30

0.2
ω
0.2
0.1
→ 0.1
0 0
−10 −5 0 5 10 −20 −10 0 10 20
X ω
Gaussian kernel
|φP − φQ |
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Universal
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
B-Spline kernel
|φP − φQ |
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
???
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Universal
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Proof Idea of the Converse
◮ supp(Λ) = Rd ⇒ universal ⇒ characteristic
◮ If we show that characteristic ⇒ supp(Λ) = Rd , then we are DONE.
◮ Equivalently, we need to show that if the support of Λ is NOT Rd ,

then ∃ P 6= Q such that MMD(P, Q, {kf kH ≤ 1}) = 0


Proof
◮ Suppose support of Λ is NOT Rd .
◮ Then there exists an open set, U ⊂ Rd \supp(Λ).
◮ Construct a non-zero real-valued symmetric function, θ supported on

U with θ(0) = 0.
◮ Define dµ(x) = θ̂(x) dx where θ̂ is the Fourier transform of θ.
◮ Also µ(Rd ) = 0.
◮ There exists positive measures µ+ 6= µ− such that µ = µ+ − µ−

(Jordan decomposition)
◮ Define α := µ+ (Rd ), P := α−1 µ+ and Q := α−1 µ−
◮ Clearly φP − φQ = α−1 θ which is NOT supported on supp(Λ)
◮ Therefore, there exits P 6= Q such that MMD(P, Q, F) = 0

Proof

U with θ(0) = 0.
◮ Also µ(Rd ) = 0.


Proof

U with θ(0) = 0.
◮ Also µ(Rd ) = 0.


Proof

U with θ(0) = 0.
◮ Also µ(Rd ) = 0.


Proof

U with θ(0) = 0.
◮ Also µ(Rd ) = 0.


Proof

U with θ(0) = 0.
◮ Also µ(Rd ) = 0.


Proof

U with θ(0) = 0.
◮ Also µ(Rd ) = 0.


Proof

U with θ(0) = 0.
◮ Also µ(Rd ) = 0.


Proof

U with θ(0) = 0.
◮ Also µ(Rd ) = 0.


Sinc kernel
|φP − φQ |
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
NOT universal
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Summary
◮ Why RKHS?
◮ Problem of learning
◮ Loss function, Risk functional
◮ Bayes risk and Bayes function
◮ Empirical risk minimization
◮ Approximation and estimation errors
◮ RKHS allows great computational advantage
◮ How to choose an RKHS?

◮ Universal RKHS that makes the approximation error to be zero.
◮ Universal kernels generalize strictly positive definite kernels
◮ Nice characterization for translation invariant kernels on Rd .
References
◮ Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2007).
A kernel method for the two sample problem.
In Schölkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MIT
Press.
◮ Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. G. (2010a).
Non-parametric estimation of integral probability metrics.
In Proc. IEEE International Symposium on Information Theory, pages 1428–1432.
◮ Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. G. (2010b).
On the relation between universality, characteristic kernels and RKHS embedding of measures.
In Teh, Y. W. and Titterington, M., editors, Proc. 13th International Conference on Artificial Intelligence and Statistics, volume 9 of
Workshop and Conference Proceedings. JMLR.
◮ Sriperumbudur, B. K., Gretton, A., Fukumizu, K., chölkopf, B. S., and Lanckriet, G. R. G. (2010c).
Hilbert space embeddings and metrics on probability measures.
Journal of Machine Learning Research, 11:1517–1561.
◮ Steinwart, I. and Christmann, A. (2008).
Support Vector Machines.
Springer.

Lecture10 MMDtheory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture10 MMDtheory

Uploaded by

Copyright:

Available Formats

“Large” Reproducing Kernel Hilbert Spaces

Advanced Topics in Machine Learning: COMPGI13

Bharath K. Sriperumbudur and Arthur Gretton

“Large” reproducing kernel Hilbert spaces

◮ Given: D := {(xj , yj )}N

◮ Goal: Learn a function f : X → R such that

yj = sign(f (xj )), ∀ j = 1, . . . , N

◮ Linear classifier: f (x) = hw , xi + b, w , x ∈ Rd , b ∈ R

◮ Result: Linear support vector machine (SVM)

min {kw k : yj (hw , xj i + b) ≥ 1, ∀ j = 1, . . . , N}

◮ Issue: Linear SVM is not suitable to classify samples that cannot be

∄ w ∈ Rd , b ∈ R s.t. yj = sign(hw , xj i + b), ∀ j = 1, . . . , N

◮ Given a set D := {(x1 , y1 ), . . . , (xn , yn )} of input/output pairs in

◮ Goal: “Learn” a function f : X → Y such that f (x) is a good

Without any assumptions on the seen and unseen data, no learning is

◮ Assumption: The past and future pairs (x, y ) are independently

◮ Given a set D := {(x1 , y1 ), . . . , (xn , yn )} of input/output pairs in

◮ Goal: “Learn” a function f : X → Y such that f (x) is a good

Without any assumptions on the seen and unseen data, no learning is

◮ Assumption: The past and future pairs (x, y ) are independently

◮ Given a set D := {(x1 , y1 ), . . . , (xn , yn )} of input/output pairs in

◮ Goal: “Learn” a function f : X → Y such that f (x) is a good

Without any assumptions on the seen and unseen data, no learning is

◮ Assumption: The past and future pairs (x, y ) are independently

◮ We need a means to assess the quality of an estimated response

◮ Loss function: L : Y × Y → [0, ∞)

◮ Squared-loss: L(y , f (x)) = (y − f (x))2

◮ Hinge-loss: L(y , f (x)) = max(0, 1 − yf (x))

◮ Smaller the value of L, better is the approximation of f (x) to y for a

◮ We need a means to assess the quality of an estimated response

◮ Loss function: L : Y × Y → [0, ∞)

◮ Squared-loss: L(y , f (x)) = (y − f (x))2

◮ Hinge-loss: L(y , f (x)) = max(0, 1 − yf (x))

◮ Smaller the value of L, better is the approximation of f (x) to y for a

◮ We need a means to assess the quality of an estimated response

◮ Loss function: L : Y × Y → [0, ∞)

◮ Squared-loss: L(y , f (x)) = (y − f (x))2

◮ Hinge-loss: L(y , f (x)) = max(0, 1 − yf (x))

◮ Smaller the value of L, better is the approximation of f (x) to y for a

◮ Knowing L(y , f (x)) to be small for a particular (x, y ) is not

(x, y ) 7→ L(y , f (x))

◮ One common quality measure is the average loss or expected loss of

◮ Knowing L(y , f (x)) to be small for a particular (x, y ) is not

(x, y ) 7→ L(y , f (x))

◮ One common quality measure is the average loss or expected loss of

◮ Note that for each f , we have an associated risk, RL,P (f ).

◮ Idea: Choose f that has the smallest risk.

f ∗ := arg inf RL,P (f ),

◮ If P is known, finding f ∗ is often a relatively easy task and there is

◮ Exercise: Find f ∗ for L(y , f (x)) = (y − f (x))2 and

◮ Note that for each f , we have an associated risk, RL,P (f ).

◮ Idea: Choose f that has the smallest risk.

f ∗ := arg inf RL,P (f ),

◮ If P is known, finding f ∗ is often a relatively easy task and there is

◮ Exercise: Find f ∗ for L(y , f (x)) = (y − f (x))2 and

◮ Without additional information, it is impossible to find an

◮ Given D, the goal is to construct fD : X → R such that

RL,P (fD ) ≈ RL,P (f ∗ )

◮ Universally consistent learning algorithm: for all P on X × Y , we

◮ Without additional information, it is impossible to find an

◮ Given D, the goal is to construct fD : X → R such that

RL,P (fD ) ≈ RL,P (f ∗ )

◮ Universally consistent learning algorithm: for all P on X × Y , we