Black Box Approach

A Black-Box approach to
machine learning
Yoav Freund
Why do we need learning?

Computers need functions that map highly variable data:
Speech recognition:
Audio signal -> words
Image analysis:
Video signal -> objects
Bio-Informatics: Micro-array Images -> gene function
Data Mining: Transaction logs -> customer classification
For accuracy, functions must be tuned to fit the data

source.
For real-time processing, function computation has to be
very fast.
The complexity/accuracy
tradeoff
Error
Trivial performance
Complexity
3
Flexibility
The speed/flexibility tradeoff

Matlab Code
Java Code
Machine code
Digital Hardware
Analog Hardware
Speed
Theory Vs. Practice

Theoretician: I want a polynomial-time algorithm which is
guaranteed to perform arbitrarily well in all situations.
- I prove theorems.
Practitioner: I want a real-time algorithm that performs
well on my problem.
- I experiment.
My approach: I want combining algorithms whose
performance and speed is guaranteed relative to the
performance and speed of their components.
- I do both.
Plan of talk
The black-box approach

Boosting
Alternating decision trees
A commercial application
Boosting the margin
Confidence rated predictions
Online learning
6
The black-box approach

Statistical models are not generators, they are
predictors.
A predictor is a function from
observation X to action Z.
After action is taken, outcome Y is observed
which implies loss L (a real valued number).
Goal: find a predictor with small loss

(in expectation, with high probability, cumulative)
Main software components

A predictor
A learner
Training examples
x1,y1,x2 ,y2 , ,xm ,ym
We assume the predictor will be applied to

examples similar to those on which it was trained
Learning in a system
Learning System
Training
Examples
predictor
Target System
Sensor Data
Action
feedback
Special case: Classification

Observation X - arbitrary (measurable) space
Outcome Y - finite set {1,..,K}
y Y
Prediction Z - {1,,K}
y Z
Usually K=2 (binary classification)
1 if
Loss( y,y)
0 if
y y
y y
10
batch learning for

binary classification
Data distribution:
x,y ~ D;
Generalization error:
y 1,1
h P
x,y ~D h(x) y
Training set: T x1,y1,x2 ,y2 ,..., xm ,ym ; T ~ D m

Training error:
1
1h(x) y P
x,y ~T h(x) y
(h)
m x,y T
11
Boosting
Combining weak learners
12
A weighted training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm

13
A weak learner
Weighted
training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm
A weak rule
Weak Leaner
instances
predictions
x1,x2 , ,xm
y1, y2 , , ym ; yi {0,1}
y y w
w
m
The weak requirement:
i1 i i
m
i1
0
14
The boosting process
(x1,y1,1/n), (xn,yn,1/n)
(x1,y1,w1), (xn,yn,wn)
weak learner
h1
weak learner
h2
h3
h4
h5
h6
h7
h8
h9
hT
FT x 1h1x 2h2 x
Final rule:
T hT x
fT (x) signFT x
15
Adaboost
F0 x 0
for t 1..T
w exp yi Ft1(xi )
Get ht fromweak learner
t
t
t lni:h x 1,y 1 wi i:h x 1,y 1 wi
t i i
t i
i
Ft1 Ft t ht
t
i
16
Main property of Adaboost

If advantages of weak rules over random
guessing are: 1,2,..,T then
training error of final rule is at most
T 2
fT exp t
t1
17
Boosting block diagram

Strong Learner
Accurate
Rule
Weak
Learner
Weak
rule
Example
weights
Booster
18
What is a good weak learner?

The set of weak rules (features) should be:
flexible enough to be (weakly) correlated with most
conceivable relations between feature vector and label.
Simple enough to allow efficient search for a rule with
non-trivial weighted training error.
Small enough to avoid over-fitting.
Calculation of prediction from observations should

be very fast.
19
Alternating decision trees

Freund, Mason 1997
20
Decision Trees
Y
X>3
+1
5
-1
Y>5
-1
-1
-1
+1
3
X
21
A decision tree as a sum of

weak rules.
Y
-0.2
X>3
sign
+1
+0.2
+0.1
-0.1
-0.1
-1
Y>5
-0.3
-0.2 +0.1
-1
-0.3
+0.2
X
22
An alternating decision tree

Y
-0.2
Y<1
sign
0.0
+0.2
+1
X>3
+0.7
-1
-0.1
+0.1
-0.1
+0.1
-1
-0.3
Y>5
-0.3
+0.2
+0.7
+1
X
23
Example: Medical Diagnostics
Cleve dataset from UC Irvine database.
Heart disease diagnostics (+1=healthy,-1=sick)
13 features from tests (real valued and discrete).
303 instances.
24
AD-tree for heart-disease diagnostics
>0 : Healthy
<0 : Sick
25
Commercial Deployment.
26
AT&T buisosity problem

Freund, Mason, Rogers, Pregibon, Cortes 2000
Distinguish business/residence customers from call detail

information. (time of day, length of call )
230M telephone numbers, label unknown for ~30%
260M calls / day
Required computer resources:

Huge: counting log entries to produce statistics -- use specialized
I/O efficient sorting algorithms (Hancock).
Significant: Calculating the classification for ~70M customers.
Negligible: Learning (2 Hours on 10K training examples on an offline computer).
27
AD-tree for buisosity
28
AD-tree (Detail)
29
Precision/recall:
Accuracy
Quantifiable results
Score
For accuracy 94%

increased coverage from 44% to 56%.
Saved AT&T 15M$ in the year 2000 in operations costs
and missed opportunities.
30
Adaboosts resistance to
over fitting
Why statisticians find Adaboost
interesting.
31
A very curious phenomenon

Boosting decision trees
Using <10,000 training examples we fit >2,000,000 parameters

32
Large margins
h x
F x
(x,y) y
y
marginFT
t1 t t
T
t1
marginFT (x,y) 0
fT (x) y
Thesis:
large margins => reliable predictions
Very similar to SVM.
33
Experimental Evidence
34
Theorem
Schapire, Freund, Bartlett & Lee / Annals of statistics 1998
H: set of binary functions with VC-dimension d

C i hi | hi H, i 0,i 1
T x1,y1,x2 ,y2 ,..., xm ,ym ; T ~ D

c C, 0,
with probability 1 w.r.t. T ~ Dm
Px,y~D signc(x) y Px,y~T marginc x,y

d / m 1
O
Olog

No dependence on no. of combined functions!!!
35
Idea of Proof
36
Confidence rated
predictions
Agreement gives confidence
37
A motivating example
?
-
--
- -
+ Unsure
++
+
+
- +
+ ++ +
+
?+ +
+
++ ++ +
+ +
+ +
+
+
+ +
+
+
+
- + + + +
+ ++ +
++
+
+
?
Unsure
+
+
-
38
The algorithm
Freund, Mansour, Schapire 2001
Parameters
0, 0
w(h) e
Hypothesis weight:
Empirical
Log Ratio
Prediction rule:
w(h)
h:h ( x)1
1
l (x)
ln
w(h)
h:h ( x)1
1
if
p, x -1,+1 if
if
1
lx
lx
lx
39
Suggested tuning
Suppose H is a finite set.
0 14
ln 8 H m1 2
Yields:
2 ln8 H
m
8m1 2
ln m
2h O 1/2
m
mistake Px,y ~D y p (x)

1) P
ln
1/
2) for m = ln 1 ln H
ln 1 ln H
P(abstain) Px,y ~D p(x) 1,1 5h O

1/2
40
Confidence Rating block

diagram
Training examples
x1,y1,x2 ,y2 , ,xm ,ym
Confidence-rated
Rule
Candidate
Rules
RaterCombiner
41
Face Detection
Viola & Jones 1999
Paul Viola and Mike Jones developed a face detector that can work
in real time (15 frames per second).
Quic kTim e and a YUV420 c odec decompressor are needed to s ee this pic ture.
42
Using confidence to save time

The detector combines 6000 simple features using Adaboost.
In most boxes, only 8-9 features are calculated.
All
boxes
Feature 1
Might be a face
Feature 2
Definitely
not a face
43
Using confidence to train

car detectors
44
Original Image Vs.

difference image
QuickTime and a Photo - JPEG d ecompressor are needed to see th is picture.
45
Co-training
Blum and Mitchell 98
Partially trained
B/W based
Classifier
Confident
Predictions
Hwy
Images
Partially trained
Diff based
Classifier
Confident
Predictions
46
Co-Training Results
Levin, Freund, Viola 2002
Raw Image detector
Before co-training
Difference Image detector
After co-training
47
Selective sampling
Unlabeled data
Partially trained
classifier
Sample of unconfident
examples
Labeled
examples
Query-by-committee, Seung, Opper & Sompolinsky

Freund, Seung, Shamir & Tishby
48
Online learning
Adapting to changes
49
Online learning
So far, the only statistical assumption was
that data is generated IID.
Can we get rid of that assumption?
Yes, if we consider prediction as a repeating game
An expert is an algorithm that maps the
past (x1,y1 ),(x2 ,y2 ), ,(xt1,yt1 ),x
to ta prediction
zt
Suppose we have a set of experts,

we believe one is good, but we dont know which one.
50
Online prediction game

For t 1,
,T
1
t
2
t
z ,z , ,z
Experts generate predictions:
Algorithm makes its own prediction:
Total loss of algorithm:
yt
Nature generates outcome:

Total loss of expert i:
N
t
L Lzti ,yt
i
T
t1
L ,y
L
A
T
t1
of events
Goal: for any sequence
LTA min LiT oT

i
51
A very simple example
Binary classification
N experts
one expert is known to be perfect
Algorithm: predict like the majority of
experts that have made no mistake so far.
Bound:
LA log2 N
52
History of online learning

Littlestone & Warmuth
Vovk
Vovk and Shafers recent book:
Probability and Finance, its only a game!
Innumerable contributions from many fields:
Hannan, Blackwell, Davison, Gallager, Cover, Barron,
Foster & Vohra, Fudenberg & Levin, Feder & Merhav,
Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum,
Freund, Schapire, Valiant, Auer
53
Lossless compression
X - arbitrary input space.
Y - {0,1}
Z - [0,1]
Log Loss: LZ,Y y log 2
1
1
(1 y)log 2
z
1 z
Entropy, Lossless compression, MDL.

Statistical likelihood, standard probability theory.
54
Bayesian averaging
Folk theorem in Information Theory
N
w z
i
t
i
t
i1
N
i
w
t
L
; w 2
i
t1
i
t
i1
T 0; LTA log 2 w1i log 2 wTi min LiT ln N
i1
i1
55
Game theoretical loss
X - arbitrary space
Y - a loss for each of N actions
y 0,1
Z - a distribution over N actions
p 0,1 , p 1 1
Loss: L(p,y) p y Ep y
N
56
Learning in games
Freund and Schapire 94
An algorithm which knows T in advance

guarantees:
L min L 2T ln N ln N
A
T
i
T
57
Multi-arm bandits
Auer, Cesa-Bianchi, Freund, Schapire 95
Algorithm cannot observe full outcome
yt
Instead, a single it 1, ,N is chosen at

it
p t and yt is observed
random according to
We describe
an algorithm that guarantees:
NT
A
i
LT
min LT O NT ln
i

With probability
1
58
Why isnt online learning

practical?
Prescriptions too similar to Bayesian
approach.
Implementing low-level learning requires a
large number of experts.
Computation increases linearly with the
number of experts.
Potentially very powerful for combining a
few high-level experts.
59
Online learning for detector

deployment
Face
code
Detector
Library
Merl frontal 1.0

B/W Frontal face detector
Indoor, neutral background
direct front-right lighting
Detector can be
Feedbackadaptive!!
Download
Images
OL
Adaptive
real-time
face detector
Face
Detections
60
Summary
By Combining predictors we can:
Improve accuracy.
Estimate prediction confidence.
Adapt on-line.
To make machine learning practical:
Speed-up the predictors.

Concentrate human feedback on hard cases.
Fuse data from several sources.
Share predictor libraries.
61

Black Box Approach

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Black Box Approach

Uploaded by

Copyright:

Available Formats

A Black-Box approach to

Why do we need learning?

For accuracy, functions must be tuned to fit the data

The speed/flexibility tradeoff

Theory Vs. Practice

The black-box approach

The black-box approach

Goal: find a predictor with small loss

Main software components

x1,y1,x2 ,y2 , ,xm ,ym

We assume the predictor will be applied to

Special case: Classification

Usually K=2 (binary classification)

batch learning for

Training set: T x1,y1,x2 ,y2 ,..., xm ,ym ; T ~ D m

Combining weak learners

A weighted training set

x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm

The weak requirement:

The boosting process

Main property of Adaboost

Boosting block diagram

What is a good weak learner?

Calculation of prediction from observations should

Alternating decision trees

A decision tree as a sum of

An alternating decision tree

Example: Medical Diagnostics

Cleve dataset from UC Irvine database.

Heart disease diagnostics (+1=healthy,-1=sick)

13 features from tests (real valued and discrete).

AD-tree for heart-disease diagnostics

AT&T buisosity problem

Distinguish business/residence customers from call detail

Required computer resources:

AD-tree for buisosity

For accuracy 94%

A very curious phenomenon

Using <10,000 training examples we fit >2,000,000 parameters

H: set of binary functions with VC-dimension d

T x1,y1,x2 ,y2 ,..., xm ,ym ; T ~ D

with probability 1 w.r.t. T ~ Dm

Px,y~D signc(x) y Px,y~T marginc x,y

No dependence on no. of combined functions!!!

mistake Px,y ~D y p (x)

P(abstain) Px,y ~D p(x) 1,1 5h O

Confidence Rating block

x1,y1,x2 ,y2 , ,xm ,ym

Using confidence to save time

Using confidence to train

Original Image Vs.

QuickTime and a Photo - JPEG d ecompressor are needed to see th is picture.

Raw Image detector

Difference Image detector

Query-by-committee, Seung, Opper & Sompolinsky

Suppose we have a set of experts,

Online prediction game

Experts generate predictions:

Algorithm makes its own prediction:

Total loss of algorithm:

Nature generates outcome:

LTA min LiT oT

A very simple example

History of online learning