You are on page 1of 61

A Black-Box approach to

machine learning
Yoav Freund

Why do we need learning?


Computers need functions that map highly variable data:

Speech recognition:
Audio signal -> words
Image analysis:
Video signal -> objects
Bio-Informatics: Micro-array Images -> gene function
Data Mining: Transaction logs -> customer classification

For accuracy, functions must be tuned to fit the data


source.
For real-time processing, function computation has to be
very fast.

The complexity/accuracy
tradeoff

Error

Trivial performance

Complexity
3

Flexibility

The speed/flexibility tradeoff


Matlab Code

Java Code

Machine code

Digital Hardware

Analog Hardware

Speed

Theory Vs. Practice


Theoretician: I want a polynomial-time algorithm which is
guaranteed to perform arbitrarily well in all situations.
- I prove theorems.
Practitioner: I want a real-time algorithm that performs
well on my problem.
- I experiment.
My approach: I want combining algorithms whose
performance and speed is guaranteed relative to the
performance and speed of their components.
- I do both.

Plan of talk

The black-box approach


Boosting
Alternating decision trees
A commercial application
Boosting the margin
Confidence rated predictions
Online learning
6

The black-box approach


Statistical models are not generators, they are
predictors.
A predictor is a function from
observation X to action Z.
After action is taken, outcome Y is observed
which implies loss L (a real valued number).

Goal: find a predictor with small loss


(in expectation, with high probability, cumulative)

Main software components


A predictor
A learner

Training examples

x1,y1,x2 ,y2 , ,xm ,ym

We assume the predictor will be applied to


examples similar to those on which it was trained

Learning in a system

Learning System
Training
Examples

predictor
Target System
Sensor Data

Action

feedback

Special case: Classification


Observation X - arbitrary (measurable) space
Outcome Y - finite set {1,..,K}

y Y

Prediction Z - {1,,K}

y Z

Usually K=2 (binary classification)

1 if
Loss( y,y)
0 if

y y
y y

10

batch learning for


binary classification
Data distribution:

x,y ~ D;

Generalization error:

y 1,1

h P
x,y ~D h(x) y

Training set: T x1,y1,x2 ,y2 ,..., xm ,ym ; T ~ D m


Training error:
1
1h(x) y P

x,y ~T h(x) y

(h)
m x,y T
11

Boosting

Combining weak learners

12

A weighted training set

x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm


13

A weak learner
Weighted
training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm

A weak rule

Weak Leaner

instances

predictions

x1,x2 , ,xm

y1, y2 , , ym ; yi {0,1}

y y w
w
m

The weak requirement:

i1 i i
m
i1

0
14

The boosting process

(x1,y1,1/n), (xn,yn,1/n)

(x1,y1,w1), (xn,yn,wn)

weak learner

h1

weak learner

h2
h3
h4
h5
h6
h7
h8
h9
hT

(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)

FT x 1h1x 2h2 x
Final rule:

T hT x

fT (x) signFT x

15

Adaboost
F0 x 0

for t 1..T

w exp yi Ft1(xi )
Get ht fromweak learner

t
t
t lni:h x 1,y 1 wi i:h x 1,y 1 wi
t i i

t i
i
Ft1 Ft t ht
t
i

16

Main property of Adaboost


If advantages of weak rules over random
guessing are: 1,2,..,T then
training error of final rule is at most
T 2
fT exp t

t1

17

Boosting block diagram


Strong Learner

Accurate
Rule

Weak
Learner
Weak
rule

Example
weights

Booster
18

What is a good weak learner?


The set of weak rules (features) should be:
flexible enough to be (weakly) correlated with most
conceivable relations between feature vector and label.
Simple enough to allow efficient search for a rule with
non-trivial weighted training error.
Small enough to avoid over-fitting.

Calculation of prediction from observations should


be very fast.
19

Alternating decision trees


Freund, Mason 1997

20

Decision Trees
Y
X>3
+1

5
-1

Y>5

-1
-1

-1

+1
3

X
21

A decision tree as a sum of


weak rules.
Y
-0.2

X>3

sign

+1
+0.2
+0.1

-0.1

-0.1
-1
Y>5
-0.3

-0.2 +0.1
-1
-0.3

+0.2

X
22

An alternating decision tree


Y
-0.2

Y<1

sign

0.0

+0.2
+1

X>3

+0.7

-1
-0.1

+0.1

-0.1

+0.1
-1
-0.3

Y>5

-0.3

+0.2

+0.7
+1

X
23

Example: Medical Diagnostics

Cleve dataset from UC Irvine database.

Heart disease diagnostics (+1=healthy,-1=sick)

13 features from tests (real valued and discrete).

303 instances.

24

AD-tree for heart-disease diagnostics

>0 : Healthy
<0 : Sick

25

Commercial Deployment.

26

AT&T buisosity problem


Freund, Mason, Rogers, Pregibon, Cortes 2000

Distinguish business/residence customers from call detail


information. (time of day, length of call )
230M telephone numbers, label unknown for ~30%
260M calls / day

Required computer resources:


Huge: counting log entries to produce statistics -- use specialized
I/O efficient sorting algorithms (Hancock).
Significant: Calculating the classification for ~70M customers.
Negligible: Learning (2 Hours on 10K training examples on an offline computer).

27

AD-tree for buisosity

28

AD-tree (Detail)

29

Precision/recall:

Accuracy

Quantifiable results

Score

For accuracy 94%


increased coverage from 44% to 56%.
Saved AT&T 15M$ in the year 2000 in operations costs
and missed opportunities.
30

Adaboosts resistance to
over fitting
Why statisticians find Adaboost
interesting.

31

A very curious phenomenon


Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters


32

Large margins
h x

F x
(x,y) y
y

marginFT

t1 t t
T
t1

marginFT (x,y) 0

fT (x) y

Thesis:
large margins => reliable predictions
Very similar to SVM.
33

Experimental Evidence

34

Theorem
Schapire, Freund, Bartlett & Lee / Annals of statistics 1998

H: set of binary functions with VC-dimension d


C i hi | hi H, i 0,i 1

T x1,y1,x2 ,y2 ,..., xm ,ym ; T ~ D


c C, 0,

with probability 1 w.r.t. T ~ Dm

Px,y~D signc(x) y Px,y~T marginc x,y


d / m 1
O
Olog

No dependence on no. of combined functions!!!

35

Idea of Proof

36

Confidence rated
predictions
Agreement gives confidence

37

A motivating example
?
-

--

- -

+ Unsure

++
+
+
- +
+ ++ +
+
?+ +
+
++ ++ +
+ +
+ +
+
+
+ +
+
+
+
- + + + +
+ ++ +
++
+
+
?
Unsure
+
+
-

38

The algorithm
Freund, Mansour, Schapire 2001

Parameters

0, 0
w(h) e

Hypothesis weight:

Empirical
Log Ratio

Prediction rule:

w(h)
h:h ( x)1
1
l (x)
ln

w(h)
h:h ( x)1

1
if

p, x -1,+1 if

if
1

lx
lx
lx
39

Suggested tuning
Suppose H is a finite set.

0 14

ln 8 H m1 2

Yields:

2 ln8 H
m

8m1 2

ln m
2h O 1/2
m

mistake Px,y ~D y p (x)


1) P

ln

1/

2) for m = ln 1 ln H

ln 1 ln H

P(abstain) Px,y ~D p(x) 1,1 5h O


1/2

40

Confidence Rating block


diagram
Training examples

x1,y1,x2 ,y2 , ,xm ,ym

Confidence-rated
Rule

Candidate
Rules

RaterCombiner

41

Face Detection
Viola & Jones 1999
Paul Viola and Mike Jones developed a face detector that can work
in real time (15 frames per second).

Quic kTim e and a YUV420 c odec decompressor are needed to s ee this pic ture.

42

Using confidence to save time


The detector combines 6000 simple features using Adaboost.
In most boxes, only 8-9 features are calculated.

All
boxes

Feature 1

Might be a face

Feature 2

Definitely
not a face
43

Using confidence to train


car detectors

44

Original Image Vs.


difference image

QuickTime and a Photo - JPEG d ecompressor are needed to see th is picture.

45

Co-training
Blum and Mitchell 98
Partially trained
B/W based
Classifier

Confident
Predictions

Hwy
Images

Partially trained
Diff based
Classifier

Confident
Predictions

46

Co-Training Results
Levin, Freund, Viola 2002

Raw Image detector

Before co-training

Difference Image detector

After co-training
47

Selective sampling

Unlabeled data

Partially trained
classifier

Sample of unconfident
examples

Labeled
examples

Query-by-committee, Seung, Opper & Sompolinsky


Freund, Seung, Shamir & Tishby
48

Online learning

Adapting to changes

49

Online learning
So far, the only statistical assumption was
that data is generated IID.
Can we get rid of that assumption?
Yes, if we consider prediction as a repeating game
An expert is an algorithm that maps the
past (x1,y1 ),(x2 ,y2 ), ,(xt1,yt1 ),x
to ta prediction

zt

Suppose we have a set of experts,


we believe one is good, but we dont know which one.

50

Online prediction game


For t 1,

,T
1
t

2
t

z ,z , ,z

Experts generate predictions:

Algorithm makes its own prediction:

Total loss of algorithm:

yt

Nature generates outcome:


Total loss of expert i:

N
t

L Lzti ,yt

i
T

t1

L ,y

L
A
T

t1

of events
Goal: for any sequence

LTA min LiT oT


i

51

A very simple example

Binary classification
N experts
one expert is known to be perfect
Algorithm: predict like the majority of
experts that have made no mistake so far.
Bound:
LA log2 N
52

History of online learning


Littlestone & Warmuth
Vovk
Vovk and Shafers recent book:
Probability and Finance, its only a game!
Innumerable contributions from many fields:
Hannan, Blackwell, Davison, Gallager, Cover, Barron,
Foster & Vohra, Fudenberg & Levin, Feder & Merhav,
Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum,
Freund, Schapire, Valiant, Auer
53

Lossless compression
X - arbitrary input space.
Y - {0,1}
Z - [0,1]
Log Loss: LZ,Y y log 2

1
1
(1 y)log 2
z
1 z

Entropy, Lossless compression, MDL.


Statistical likelihood, standard probability theory.

54

Bayesian averaging
Folk theorem in Information Theory
N

w z
i
t

i
t

i1
N

i
w
t

L
; w 2

i
t1

i
t

i1

T 0; LTA log 2 w1i log 2 wTi min LiT ln N

i1

i1

55

Game theoretical loss

X - arbitrary space
Y - a loss for each of N actions

y 0,1

Z - a distribution over N actions

p 0,1 , p 1 1

Loss: L(p,y) p y Ep y
N

56

Learning in games
Freund and Schapire 94

An algorithm which knows T in advance


guarantees:

L min L 2T ln N ln N
A
T

i
T

57

Multi-arm bandits
Auer, Cesa-Bianchi, Freund, Schapire 95

Algorithm cannot observe full outcome

yt

Instead, a single it 1, ,N is chosen at


it
p t and yt is observed
random according to

We describe
an algorithm that guarantees:

NT
A
i
LT
min LT O NT ln

i


With probability
1
58

Why isnt online learning


practical?
Prescriptions too similar to Bayesian
approach.
Implementing low-level learning requires a
large number of experts.
Computation increases linearly with the
number of experts.
Potentially very powerful for combining a
few high-level experts.
59

Online learning for detector


deployment
Face
code
Detector
Library

Merl frontal 1.0


B/W Frontal face detector
Indoor, neutral background
direct front-right lighting

Detector can be
Feedbackadaptive!!

Download

Images

OL
Adaptive
real-time
face detector

Face
Detections
60

Summary
By Combining predictors we can:
Improve accuracy.
Estimate prediction confidence.
Adapt on-line.

To make machine learning practical:

Speed-up the predictors.


Concentrate human feedback on hard cases.
Fuse data from several sources.
Share predictor libraries.
61

You might also like