Professional Documents
Culture Documents
Tomaso Poggio
9.520 Class 02
February 2011
Tomaso Poggio
Computational Learning
Tomaso Poggio
Supervised
Semisupervised
Unsupervised
Online
Transductive
Active
Variable Selection
Reinforcement
.....
One can consider the data to be created in a deterministic,
stochastic or even adversarial way.
Tomaso Poggio
Where to Start?
Statistical and Supervised Learning
Statistical Models are essentially to deal with noise
sampling and other sources of uncertainty.
Supervised Learning is by far the most understood class of
problems
Regularization
Regularization provides a a fundamental framework to
solve learning problems and design learning algorithms.
We present a set of ideas and tools which are at the
core of several developments in supervised learning
and beyond it.
This is a theory and associated algorithms which work in practice, eg in products, such as in vision systems
for cars. Later in the semester we will learn about ongoing research combining neuroscience and learning.
The latter research is at the frontier on approaches that may work in practice or may not (similar to Bayes
techniques: still unclear how well they work beyond toy or special problems).
Tomaso Poggio
Where to Start?
Statistical and Supervised Learning
Statistical Models are essentially to deal with noise
sampling and other sources of uncertainty.
Supervised Learning is by far the most understood class of
problems
Regularization
Regularization provides a a fundamental framework to
solve learning problems and design learning algorithms.
We present a set of ideas and tools which are at the
core of several developments in supervised learning
and beyond it.
This is a theory and associated algorithms which work in practice, eg in products, such as in vision systems
for cars. Later in the semester we will learn about ongoing research combining neuroscience and learning.
The latter research is at the frontier on approaches that may work in practice or may not (similar to Bayes
techniques: still unclear how well they work beyond toy or special problems).
Tomaso Poggio
Tomaso Poggio
Plan
Tomaso Poggio
Problem at a Glance
Tomaso Poggio
Learning is Inference
Tomaso Poggio
Learning is Inference
Tomaso Poggio
Tomaso Poggio
Noise ...
p (y|x)
x
...and Sampling
y
even in a noise free case we
have to deal with sampling
p(x)
Tomaso Poggio
presence or absence of
certain input instances
...and Sampling
y
even in a noise free case we
have to deal with sampling
p(x)
Tomaso Poggio
...and Sampling
y
p(x)
Tomaso Poggio
...and Sampling
y
"
#
$
%
.
/
,
-
&
'
*
+
(
)
0
1
Tomaso Poggio
Problem at a Glance
Tomaso Poggio
Predictivity or Generalization
Given the data, the goal is to learn how to make
decisions/predictions about future data / data not belonging to
the training set. Generalization is the key requirement
emphasized in Learning Theory. This emphasis makes it
different from traditional statistics (especially explanatory
statistics) or Bayesian freakonomics.
Tomaso Poggio
Loss functions
Tomaso Poggio
Loss functions
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Loss functions
Tomaso Poggio
Expected Risk
Tomaso Poggio
Expected Risk
Tomaso Poggio
Target Function
f F
Tomaso Poggio
Tomaso Poggio
Basic definitions
Tomaso Poggio
Reminder
Convergence in probability
Let {Xn } be a sequence of bounded random variables. Then
lim Xn = X in probability
if
> 0
lim P{|Xn X | } = 0
Convergence in Expectation
Let {Xn } be a sequence of bounded random variables. Then
lim Xn = X in expectation
if
lim E(|Xn X |) = 0
Reminder
Convergence in probability
Let {Xn } be a sequence of bounded random variables. Then
lim Xn = X in probability
if
> 0
lim P{|Xn X | } = 0
Convergence in Expectation
Let {Xn } be a sequence of bounded random variables. Then
lim Xn = X in expectation
if
lim E(|Xn X |) = 0
Universal Consistency
We say that an algorithm is universally consistent if for all
probability p,
> 0
Tomaso Poggio
Universal Consistency
We say that an algorithm is universally consistent if for all
probability p,
> 0
Tomaso Poggio
Universal Consistency
We say that an algorithm is universally consistent if for all
probability p,
> 0
Tomaso Poggio
Generalization Error
The effectiveness of such an approximation error is captured by
the generalization error,
P{|I[fn ] In [fn ]| } 1 ,
for n = n(, ).
Tomaso Poggio
Generalization Error
The effectiveness of such an approximation error is captured by
the generalization error,
P{|I[fn ] In [fn ]| } 1 ,
for n = n(, ).
Tomaso Poggio
Generalization Error
The effectiveness of such an approximation error is captured by
the generalization error,
P{|I[fn ] In [fn ]| } 1 ,
for n = n(, ).
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Plan
Tomaso Poggio
Hypotheses Space
Tomaso Poggio
Hypotheses Space
Tomaso Poggio
Hypotheses Space
Tomaso Poggio
Hypotheses Space
Tomaso Poggio
Tomaso Poggio
1X
V (f , zi )
n
Tomaso Poggio
Reminder: Generalization
A natural requirement for fS is distribution independent
generalization
lim |IS [fS ] I[fS ]| = 0 in probability
(1)
Tomaso Poggio
f H
Tomaso Poggio
Tomaso Poggio
ERM
.
For example linear regression is ERM when V (z) = (f (x) y )2
and H is space of linear functions f = ax.
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
Tomaso Poggio
0.5
0.6
0.7
0.8
0.9
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
Tomaso Poggio
0.5
0.6
0.7
0.8
0.9
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
Tomaso Poggio
0.5
0.6
0.7
0.8
0.9
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
Tomaso Poggio
0.5
0.6
0.7
0.8
0.9
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
Tomaso Poggio
0.5
0.6
0.7
0.8
0.9
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
Tomaso Poggio
0.5
0.6
0.7
0.8
0.9
Tomaso Poggio
Tomaso Poggio
f H
Tomaso Poggio
Tomaso Poggio
Key Theorem(s)
Uniform Glivenko-Cantelli Classes
We say that H is a uniform Glivenko-Cantelli (uGC) class, if for
all p,
> 0 lim P sup |I[f ] In [f ]| >
n
= 0.
f H
Key Theorem(s)
Uniform Glivenko-Cantelli Classes
We say that H is a uniform Glivenko-Cantelli (uGC) class, if for
all p,
> 0 lim P sup |I[f ] In [f ]| >
n
= 0.
f H
Stability
z = (x, y )
S = z1 , ..., zn
S i = z1 , ..., zi1 , zi+1 , ...zn
CV Stability
A learning algorithm A is CV loo stability if for each n there
(n)
(n)
exists a CV and a CV such that for all p
o
n
(n)
(n)
P |V (fS i , zi ) V (fS , zi )| CV 1 CV ,
(n)
(n)
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Regularization
Tomaso Poggio
n i=1
V (f (xi ), yi )
n i=1
V (f (xi ), yi )
(2)
R(f ) is the regulirizer, a penalization on f . In this course we will mainly discuss the case R(f ) = kf k2K where kf k2K
is the norm in the Reproducing Kernel Hilbert Space (RKHS) H, defined by the kernel K .
Tomaso Poggio
Tikhonov Regularization
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Plan
Tomaso Poggio
Hypotheses Space
We are going to look at hypotheses spaces which are
reproducing kernel Hilbert spaces.
RKHS are Hilbert spaces of point-wise defined
functions.
They can be defined via a reproducing kernel, which is a
symmetric positive definite function.
n
X
ci cj K (ti , tj ) 0
i,j=1
p
X
The Learning Problem and Regularization
Hypotheses Space
We are going to look at hypotheses spaces which are
reproducing kernel Hilbert spaces.
RKHS are Hilbert spaces of point-wise defined
functions.
They can be defined via a reproducing kernel, which is a
symmetric positive definite function.
n
X
ci cj K (ti , tj ) 0
i,j=1
p
X
The Learning Problem and Regularization
Hypotheses Space
We are going to look at hypotheses spaces which are
reproducing kernel Hilbert spaces.
RKHS are Hilbert spaces of point-wise defined
functions.
They can be defined via a reproducing kernel, which is a
symmetric positive definite function.
n
X
ci cj K (ti , tj ) 0
i,j=1
p
X
The Learning Problem and Regularization
Hypotheses Space
We are going to look at hypotheses spaces which are
reproducing kernel Hilbert spaces.
RKHS are Hilbert spaces of point-wise defined
functions.
They can be defined via a reproducing kernel, which is a
symmetric positive definite function.
n
X
ci cj K (ti , tj ) 0
i,j=1
p
X
The Learning Problem and Regularization
Examples of pd kernels
Very common examples of symmetric pd kernels are
Linear kernel
K (x, x 0 ) = x x 0
Gaussian kernel
K (x, x 0 ) = e
kxx 0 k2
2
>0
Polynomial kernel
K (x, x 0 ) = (x x 0 + 1)d ,
d N
Tomaso Poggio
K (x, x ) =
p
X
j (x)j (x 0 ).
i=1
Tomaso Poggio
Ivanov regularization
min
f H
1X
V (f (xi ), yi )
n
i=1
subject to
kf k2H R.
The above algorithm corresponds to a constrained optimization
problem.
Tomaso Poggio
Tikhonov regularization
arg min
f H
1X
V (f (xi ), yi ) + kf k2H .
n
i=1
Tomaso Poggio
n
X
ci K (xi , x),
i=1
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Bayes Interpretation
Tomaso Poggio
Regularization approach
Tomaso Poggio
Summary
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio
Tomaso Poggio