Chapter 9

Algorithm-Independent
Machine Learning
Anna Egorova-Frster
University of Lugano
Pattern Classification Reading
Group,
January 2007
All materials in these slides were taken from

Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the publisher
Algorithm-Independent Machine 1
Learning
So far: different classifiers and methods presented
BUT:
Is some classifier better than all others?
How to compare classifiers?
Is comparison possible at all?
Is at least some classifier always better than random?
AND
Do techniques exist which boost all classifiers?
Pattern Classification, Chapter 9

2
No Free Lunch Theorem

For any two learning algorithms P1(h|D) and P2(h|D), the following are true,
independent of the sampling distribution P(x) and the number n of
training points:
1. Uniformly averaged over all target functions F,

1(E|F,n) - 2(E|F,n) = 0.
2. For any fixed training set D, uniformly averaged over F,
1(E|F,D) - 2(E|F,D) = 0
3. Uniformly averaged over all priors P(F),
1(E|n) - 2(E|n) = 0
4. For any fixed training set D, uniformly averaged over P(F),
1(E|D) - 2(E|D) = 0

3
1. Uniformly averaged over all target functions F,

1(E|F,n) - 2(E|F,n) = 0. x F h1 h2
000 1 1 1
Average over all possible target
functions, the error will be Training set D 001 -1 -1 -1
the same for all classifiers. 010 1 1 1
011 -1 1 -1
Possible target functions: 25 100 1 1 -1
Off-Training set 101
2. For any fixed training set D, uniformly -1 1 -1
averaged over F, 110 1 1 -1
1(E|F,D) - 2(E|F,D) = 0 111 1 1 -1
Even if we know the training set D, the off-training

errors will be the same.

Consequences of the 4
If no information about the target function F(x) is

provided:
No classifier is better than some other in the general

case
No classifier is better than random in the general case

Ugly Duckling Theorem 5
Features Comparison
Binary feature fi
Patterns xi in the form:
f1 and f2, f1 or f2 etc.
Rank of a predicate r: the
number of simplest patterns it
contains.
Rank 1:
x1: f1 AND NOT f2
x2: f1 AND f2
x3: f2 AND NOT f1
Rank 2: Venn diagram
x1 OR x2 : f1
Rank 3:
x1 OR x2 OR x3 : f1 OR f2 Pattern Classification, Chapter 9
6
Features with prior information

7
Features Comparison
To compare two patterns: take the number of features they
share?
Blind_left = {0,1}
Blind_right = {0,1}
Is (0,1) more similar to (1,0) or to (1,1)???
Different representations also possible:
Blind_right = {0,1}
Both_eyes_same = {0,1}
With no prior information about the features available

impossible to prefer some representation over another
8
Ugly Duckling Theorem

Given that we use a finite set of predicates that
enables us to distinguish any two patterns under
consideration, the number of predicates shared by
two such patterns is constant and independent of
the choice of those patterns. Furthermore, if
pattern similarity is based on the total number of
predicates shared by two patterns, then any two
patterns are equally similar.
An ugly duckling is as similar to the beautiful swan

1 as does beautiful swan 2 to beautiful swan 1.

9
Ugly Duckling Theorem

Use for comparison of patterns the number of
predicates they share.
For two different patterns xi and xj:
No same predicates of rank 1
One of rank 2: xi OR xj
In the general case:
d d 2
r 2 (1 1) d 2
2 d 2
r2
Result is independent of choice of xi and xj!

10
Bias and Variance

g(x,D) F(x)
2
D

2
D g(x,D) F(x) D g(x,D) D g(x,D)
2

bias2 var iance
Bias: given the training set D, we can accurately estimate F

from D.
Variance: given different training sets D, there will be no

(little) differences between the estimations of F.
Low bias means usually high variance
High bias means usually low variance
Best: low bias, low variance
Only possible with as much as possible information about F(x).

11
Bias and variance

Resampling for estimating statistics 12
Jackknife
Remove some point from the training set:
D(i)
Calculate the statistics with the new training set
n

1
(i) xj
n 1 ji
Repeat for all points

Calculate the jackknife statistics
1 n
() i
n i1

13
Bagging
Draw n < n training points and train a different
classifier
Combine classifiers votes into end result
Classifiers are of same type: all neural networks,
decision trees etc.
Instability: small changes in the training sets leads
to significantly different classifiers and/or results

14
Boosting
Improve the performance of different types of
classifiers
Weak learners: the classifier has accuracy only
slightly better than random
Example: three component-classifiers for a two-
class problem
Draw three different training sets D1, D2 and D3
and train three different classifiers C1, C2 and C3
(weak learners).

15
Boosting
D1: randomly draw n1 < n training points from D.
Train C1 with D1
D2: most informative dataset with respect to D1.
Half of the points are classified properly by C1, half of them not.
Flip a coin: if head, find the first pattern in D/D1 misclassified by C1.
If tails, find a pattern properly classified by C1.
Continue until possible
Train C2 with D2
D3: most informative with respect to C1 and C2.
Randomly select a pattern from D/(D1,D2)
If C1 and C2 disagree, add it to D3
Train C3 with D3

16
Boosting

Chapter 9

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 9

Uploaded by

Copyright:

Available Formats

Algorithm-Independent

All materials in these slides were taken from

Pattern Classification, Chapter 9

No Free Lunch Theorem

1. Uniformly averaged over all target functions F,

Pattern Classification, Chapter 9

No Free Lunch Theorem

1. Uniformly averaged over all target functions F,

Even if we know the training set D, the off-training

Pattern Classification, Chapter 9

No Free Lunch Theorem

If no information about the target function F(x) is

No classifier is better than some other in the general

Pattern Classification, Chapter 9

Features with prior information

Pattern Classification, Chapter 9

With no prior information about the features available

Ugly Duckling Theorem

An ugly duckling is as similar to the beautiful swan

Pattern Classification, Chapter 9

Ugly Duckling Theorem

Result is independent of choice of xi and xj!

Bias and Variance

Bias: given the training set D, we can accurately estimate F

Variance: given different training sets D, there will be no

Pattern Classification, Chapter 9

Bias and variance

Pattern Classification, Chapter 9

Repeat for all points

Pattern Classification, Chapter 9

Pattern Classification, Chapter 9

Pattern Classification, Chapter 9

Pattern Classification, Chapter 9

Pattern Classification, Chapter 9

You might also like