Professional Documents
Culture Documents
Gatsby Unit
20 March 2012
So far...
◮ Introduction to RKHS
◮ RKHS based learning algorithms
◮ Kernel PCA
◮ Kernel regression
◮ SVMs for classification and regression
◮ Hypothesis testing (two-sample and independence tests)
◮ Feature selection, Clustering, ICA
◮ Representer theorem
This Lecture
Why RKHS?
How to choose an RKHS?
◮ Polynomial kernels
◮ Radial basis kernels
◮ Spline kernel
◮ Laplacian kernel
yj (hw , xj i + b) ≥ 0, ∀ j = 1, . . . , N.
Maximum Margin Classifiers
◮ Popular Idea: Maximize the margin (distance from f to D):
|hw , xj i + b|
max min
w ,b j∈{1,...,N} kw k
PN
where f (x) = j=1 yj αj hΦ(xj ), Φ(x)iH + b.
Problem of Learning
is.
is.
where the infimum is taken over the set of all measurable functions.
◮ f ∗ is called the Bayes function and RL,P (f ∗ ) is called the Bayes risk.
where the infimum is taken over the set of all measurable functions.
◮ f ∗ is called the Bayes function and RL,P (f ∗ ) is called the Bayes risk.
◮ But P is unknown
◮ But P is unknown
◮ But P is unknown
◮ But P is unknown
◮ Is it a good idea?
Overfitting!!
Empirical Risk Minimization
◮ Since P is unknown but is known through D, it is tempting to
replace RL,P (f ) by
n
1X
RL,D (f ) := L(yi , f (xi )),
n
i=1
◮ Is it a good idea?
Overfitting!!
Empirical Risk Minimization
◮ Since P is unknown but is known through D, it is tempting to
replace RL,P (f ) by
n
1X
RL,D (f ) := L(yi , f (xi )),
n
i=1
◮ Is it a good idea?
Overfitting!!
Empirical Risk Minimization
◮ Do minimization over F:
Estimation error
z }| {
RL,P (fD ) − RL,P (f ∗ ) = RL,P (fD ) − R∗L,P,F
Approximation error
z }| {
∗ ∗
+ RL,P,F − RL,P (f )
Approximation and Estimation Errors
Approximation
error
Estimation
error
Regularized Learning
◮ Let Ω be some functional on F such that for c1 ≤ c2 ,
{f ∈ F : Ω(f ) ≤ c1 } ⊂ {f ∈ F : Ω(f ) ≤ c2 }.
◮ Define
R∗L,P,F = −MMD(P, Q, F)
◮ MMD(P, P, F) = 0
◮ Triangle inequality:
MMD(P, R, F) ≤ MMD(P, Q, F) + MMD(Q, R, F)
◮ However, MMD(P, Q, F) = 0 ; P = Q
R∗L,P,F = −MMD(P, Q, F)
◮ MMD(P, P, F) = 0
◮ Triangle inequality:
MMD(P, R, F) ≤ MMD(P, Q, F) + MMD(Q, R, F)
◮ However, MMD(P, Q, F) = 0 ; P = Q
R∗L,P,F = −MMD(P, Q, F)
◮ MMD(P, P, F) = 0
◮ Triangle inequality:
MMD(P, R, F) ≤ MMD(P, Q, F) + MMD(Q, R, F)
◮ However, MMD(P, Q, F) = 0 ; P = Q
R∗L,P,F = −MMD(P, Q, F)
◮ MMD(P, P, F) = 0
◮ Triangle inequality:
MMD(P, R, F) ≤ MMD(P, Q, F) + MMD(Q, R, F)
◮ However, MMD(P, Q, F) = 0 ; P = Q
R∗L,P,F = −MMD(P, Q, F)
◮ MMD(P, P, F) = 0
◮ Triangle inequality:
MMD(P, R, F) ≤ MMD(P, Q, F) + MMD(Q, R, F)
◮ However, MMD(P, Q, F) = 0 ; P = Q
R∗L,P,F = −MMD(P, Q, F)
◮ MMD(P, P, F) = 0
◮ Triangle inequality:
MMD(P, R, F) ≤ MMD(P, Q, F) + MMD(Q, R, F)
◮ However, MMD(P, Q, F) = 0 ; P = Q
MMD(P, Q, F) = 0, ∀ P, Q.
How to choose H?
Not all Kernels are Useful
MMD(P, Q, F) = 0, ∀ P, Q.
How to choose H?
Not all Kernels are Useful
MMD(P, Q, F) = 0, ∀ P, Q.
How to choose H?
Computation: RKHS vs. Other F
i.i.d. i.i.d.
◮ Suppose {X1 , . . . , Xm } ∼ P and {Y1 , . . . , Yn } ∼ Q.
1
Pm 1
Pn
◮ Define Pm := m i=1 δXi and Qn := n i=1 δYi , where δx represents
the Dirac measure at x.
Curse of dimensionality!!
Error: RKHS vs. Other F
|MMD(Pm , Qn , F) − MMD(P, Q, F)| =?
Curse of dimensionality!!
Error: RKHS vs. Other F
|MMD(Pm , Qn , F) − MMD(P, Q, F)| =?
Curse of dimensionality!!
How to choose H?
Large RKHS: Universal Kernel/RKHS
is injective, i.e., Z
k(·, x) dµ(x) = 0 ⇒ µ = 0
X
which is equivalent to
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}
X X
is injective, i.e., Z
k(·, x) dµ(x) = 0 ⇒ µ = 0
X
which is equivalent to
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}
X X
is injective, i.e., Z
k(·, x) dµ(x) = 0 ⇒ µ = 0
X
which is equivalent to
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}
X X
◮ In general, though
Z Z
k(x, y ) dµ(x) dµ(y ) > 0, ∀ µ ∈ Mb (X )\{0}
X X
is also not easy to check, for certain X and for certain families of k,
the above condition is easy to check
◮ Later: Gaussian and Spline kernels are universal; Sinc kernel is not
but is strictly positive definite.
MMD: What Kernels are Useful?
◮ Note that
Z Z
MMD 2 (P, Q, F) = k(x, y ) d(P − Q)(x) d(P − Q)(y )
X X
then
MMD(P, Q, F) = 0 ⇒ P = Q (characteristic)
k(x, y ) = ψ(x − y )
2
◮ Examples: Gaussian, e −kx−y k2 , Laplacian, e −kx−y k1
k(x, y ) = ψ(x − y )
2
◮ Examples: Gaussian, e −kx−y k2 , Laplacian, e −kx−y k1
k(x, y ) = ψ(x − y )
2
◮ Examples: Gaussian, e −kx−y k2 , Laplacian, e −kx−y k1
k(x, y ) = ψ(x − y )
2
◮ Examples: Gaussian, e −kx−y k2 , Laplacian, e −kx−y k1
d
RR
If the support of Λ is R , then Rd
ψ(x − y ) dµ(x) dµ(y ) = 0 implies
µ̂ = 0 and therefore µ = 0.
Translation Invariant Kernels on Rd
d
RR
If the support of Λ is R , then Rd
ψ(x − y ) dµ(x) dµ(y ) = 0 implies
µ̂ = 0 and therefore µ = 0.
Translation Invariant Kernels on Rd
d
RR
If the support of Λ is R , then Rd
ψ(x − y ) dµ(x) dµ(y ) = 0 implies
µ̂ = 0 and therefore µ = 0.
Translation Invariant Kernels on Rd
d
RR
If the support of Λ is R , then Rd
ψ(x − y ) dµ(x) dµ(y ) = 0 implies
µ̂ = 0 and therefore µ = 0.
Translation Invariant Kernels on Rd
d
RR
If the support of Λ is R , then Rd
ψ(x − y ) dµ(x) dµ(y ) = 0 implies
µ̂ = 0 and therefore µ = 0.
Translation Invariant Kernels on Rd
d
RR
If the support of Λ is R , then Rd
ψ(x − y ) dµ(x) dµ(y ) = 0 implies
µ̂ = 0 and therefore µ = 0.
Translation Invariant Kernels on Rd
d
RR
If the support of Λ is R , then Rd
ψ(x − y ) dµ(x) dµ(y ) = 0 implies
µ̂ = 0 and therefore µ = 0.
Translation Invariant Kernels on Rd
0.3
0.25
0.2
P(X)
0.15
0.1
0.05
0
−10 −5 0 5 10
X
0.5
0.4
0.3
Q(X)
0.2
0.1
0
−10 −5 0 5 10
X
Translation Invariant Kernels on Rd
0.3
0.3
0.25
0.2
F
P(X)
→ |φP |
0.2
0.15
0.1
0.1
0.05
0 0
−10 −5 0 5 10 −20 −10 0 10 20
X ω
0.5 0.4
0.4
0.3
0.3
F
Q(X)
|φQ |
0.2
0.2
0.1
→ 0.1
0 0
−10 −5 0 5 10 −20 −10 0 10 20
X ω
Translation Invariant Kernels on Rd
0.3
0.3
0.25
0.2
F
P(X)
→ |φP |
0.2
0.15 Characteristic function difference
0.2
0.1
0.1
0.05
0
−10 −5 0 5 10
0
−20 −10 0 10 20
ց 0.15
|φ − φ |
X
Q
0.1
P
0.5 0.4
0.4
0.3 ր 0.05
0.3 0
F
Q(X)
|φQ |
0.1
→ 0.1
0 0
−10 −5 0 5 10 −20 −10 0 10 20
X ω
Translation Invariant Kernels on Rd
Gaussian kernel
|φP − φQ |
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Translation Invariant Kernels on Rd
MMD(P, Q, F) = kφP − φQ kL2 (Rd ,Λ)
Universal
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Translation Invariant Kernels on Rd
B-Spline kernel
|φP − φQ |
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Translation Invariant Kernels on Rd
MMD(P, Q, F) = kφP − φQ kL2 (Rd ,Λ)
???
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Translation Invariant Kernels on Rd
MMD(P, Q, F) = kφP − φQ kL2 (Rd ,Λ)
Universal
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Proof Idea of the Converse
◮ Also µ(Rd ) = 0.
◮ Also µ(Rd ) = 0.
◮ Also µ(Rd ) = 0.
◮ Also µ(Rd ) = 0.
◮ Also µ(Rd ) = 0.
◮ Also µ(Rd ) = 0.
◮ Also µ(Rd ) = 0.
◮ Also µ(Rd ) = 0.
◮ Also µ(Rd ) = 0.
Sinc kernel
|φP − φQ |
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Translation Invariant Kernels on Rd
MMD(P, Q, F) = kφP − φQ kL2 (Rd ,Λ)
NOT universal
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−30 −20 −10 0 10 20 30
Frequency ω
Summary
◮ Why RKHS?
◮ Problem of learning
◮ Loss function, Risk functional
◮ Bayes risk and Bayes function
◮ Empirical risk minimization
◮ Approximation and estimation errors
◮ RKHS allows great computational advantage
◮ Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2007).
A kernel method for the two sample problem.
In Schölkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MIT
Press.
◮ Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. G. (2010a).
Non-parametric estimation of integral probability metrics.
In Proc. IEEE International Symposium on Information Theory, pages 1428–1432.
◮ Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. G. (2010b).
On the relation between universality, characteristic kernels and RKHS embedding of measures.
In Teh, Y. W. and Titterington, M., editors, Proc. 13th International Conference on Artificial Intelligence and Statistics, volume 9 of
Workshop and Conference Proceedings. JMLR.
◮ Sriperumbudur, B. K., Gretton, A., Fukumizu, K., chölkopf, B. S., and Lanckriet, G. R. G. (2010c).
Hilbert space embeddings and metrics on probability measures.
Journal of Machine Learning Research, 11:1517–1561.
◮ Steinwart, I. and Christmann, A. (2008).
Support Vector Machines.
Springer.