Professional Documents
Culture Documents
Lecture 2
Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
PAC Model,
Guarantees for Learning with
Finite Hypothesis Sets
Motivation
Some computational learning questions
What can be learned efficiently?
What is inherently hard to learn?
A general model of learning?
Complexity
Computational complexity: time and space.
Sample complexity: amount of training data
needed to learn successfully.
Mistake bounds: number of mistakes before
learning successfully.
page 3
This lecture
PAC Model
Sample complexity, finite H, consistent case
Sample complexity, finite H, inconsistent case
page 4
page 5
linear classifiers.
page 6
Errors
True error or generalization error of h with
respect to the target concept c and distribution D :
!
x D
Note: R(h) = S
E m RS (h) .
D
page 7
1h(xi )=c(xi ) .
i=1
PAC Model
(Valiant, 1984)
Pr m [R(hS )
S D
page 8
Remarks
Concept class C is known to the algorithm.
Distribution-free model: no assumption on D .
Both training and test examples drawn D.
Probably: confidence 1
page 9
page 10
R
R
Hypothesis: rectangle R. In general, there may be
false positive and false negative points.
Mehryar Mohri - Foundations of Machine Learning
page 11
R
R
page 12
r1
R = [l, r] [b, t]
r4 = [l, s4 ] [b, t]
s4 = inf{s : Pr [l, s] [b, t]
R
r2
r4
r3
Mehryar Mohri - Foundations of Machine Learning
page 13
4}
R(R ) >
Pr[
4
i=1
4(1
4
i=1 {R
misses ri }]
Pr[{R misses ri }]
4)
4e
m
4
r1
R
r2
r4
r3
Mehryar Mohri - Foundations of Machine Learning
R
page 14
Then, for m
m
4
log 4 .
r1
R
r2
r4
r3
Mehryar Mohri - Foundations of Machine Learning
R
page 15
Notes
Infinite hypothesis set, but simple proof.
Does this proof readily apply to other similar
concepts classes?
Geometric properties:
key in this proof.
in general non-trivial to extend to other classes,
e.g., non-concentric circles (see HW2, 2006).
page 16
This lecture
PAC Model
Sample complexity, finite H, consistent case
Sample complexity, finite H, inconsistent case
page 17
1
m (log |H|
+ log 1 ).
page 18
R(h) > ]
(h|H|
(1
)m .
H consistent
(union bound)
h H
h H
h H
)m = |H|(1
)m
|H|e
R(h|H| ) > )]
page 19
Remarks
The algorithm can be ERM if problem realizable.
Error bound linear in m1 and only logarithmic in 1 .
log2 |H| is the number of bits used for the
representation of H.
page 20
x1
x2
page 21
x5
x6 .
1
((log 3) n +
log 1 ).
page 22
This lecture
PAC Model
Sample complexity, finite H, consistent case
Sample complexity, finite H, inconsistent case
page 23
Inconsistent Case
No h H is a consistent hypothesis.
The typical case in practice: difficult problems,
complex concept class.
But, inconsistent hypotheses with a small number
of errors on the training set can be useful.
Need a more powerful tool: Hoeffdings inequality.
page 24
Hoeffdings Inequality
Corollary: for any > 0 and any hypothesis h : X
the following inequalities holds:
Pr[R(h)
Pr[R(h)
R(h)
R(h)
2m
2m
2
2
R(h)|
2e
2m
page 25
{0, 1}
page 26
H, R(h)
RS (h) +
RS (h) >
h H
= Pr R(h1 )
RS (h1 ) >
Pr R(h)
...
R(h|H| )
RS (h|H| ) >
RS (h) >
h H
2|H| exp( 2m 2 ).
Mehryar Mohri - Foundations of Machine Learning
page 27
Remarks
Thus, for a finite hypothesis set, whp,
h
H, R(h)
Error bound in O(
RS (h) + O
1
)
m
log |H|
.
m
(quadratically worse).
page 28
Occams Razor
Principle formulated by controversial theologian
William of Occam: plurality should not be posited
without necessity, rephrased as the simplest
explanation is best;
invoked in a variety of contexts, e.g., syntax.
Kolmogorov complexity can be viewed as the
corresponding framework in information theory.
here, to minimize true error, choose the most
parsimonious explanation (smallest |H|).
we will see later other applications of this
principle.
page 29
Lecture Summary
C is PAC-learnable if L, c C, , > 0, m = P
Pr m [R(hS )
] 1
.
S D
1
m (log |H|
+ log 1 ).
RS (h) +
log |H|+log
2m
page 30
1 1
,
References
Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and
the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM),Volume 36, Issue 4, 1989.
Michael Kearns and Umesh Vazirani. An Introduction to Computational Learning Theory, MIT
Press, 1994.
page 31
Appendix
The bound
(2n )
1
1
n
gives m = ((log 2) 2 + log ).
page 33
page 34
k-CNF Expressions
Definition: expressions T1 Tj of arbitrary
length j with each term Ti a disjunction of at most k
boolean attributes.
Algorithm: reduce problem to that of learning
conjunctions of boolean literals. (2n)k new variables:
(u1 , . . . , uk )
Yu1 ,...,uk .
page 35
ui,1
i=1
ui,ni =
u1,j1
j1 [1,n1 ],...,jk [1,nk ]
Example:
(u1 u2 u3) (v1 v2 v3 ) = 3i,j=1 (ui vj ).
But, in general converting a k-CNF (equiv. to a
k-term DNF) to a k-term DNF is intractable.
uk,jk .
page 36