Professional Documents
Culture Documents
machine learning
Yoav Freund
Speech recognition:
Audio signal -> words
Image analysis:
Video signal -> objects
Bio-Informatics: Micro-array Images -> gene function
Data Mining: Transaction logs -> customer classification
The complexity/accuracy
tradeoff
Error
Trivial performance
Complexity
3
Flexibility
Java Code
Machine code
Digital Hardware
Analog Hardware
Speed
Plan of talk
Training examples
Learning in a system
Learning System
Training
Examples
predictor
Target System
Sensor Data
Action
feedback
y Y
Prediction Z - {1,,K}
y Z
1 if
Loss( y,y)
0 if
y y
y y
10
x,y ~ D;
Generalization error:
y 1,1
h P
x,y ~D h(x) y
x,y ~T h(x) y
(h)
m x,y T
11
Boosting
12
A weak learner
Weighted
training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm
A weak rule
Weak Leaner
instances
predictions
x1,x2 , ,xm
y1, y2 , , ym ; yi {0,1}
y y w
w
m
i1 i i
m
i1
0
14
(x1,y1,1/n), (xn,yn,1/n)
(x1,y1,w1), (xn,yn,wn)
weak learner
h1
weak learner
h2
h3
h4
h5
h6
h7
h8
h9
hT
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
(x1,y1,w1), (xn,yn,wn)
FT x 1h1x 2h2 x
Final rule:
T hT x
fT (x) signFT x
15
Adaboost
F0 x 0
for t 1..T
w exp yi Ft1(xi )
Get ht fromweak learner
t
t
t lni:h x 1,y 1 wi i:h x 1,y 1 wi
t i i
t i
i
Ft1 Ft t ht
t
i
16
t1
17
Accurate
Rule
Weak
Learner
Weak
rule
Example
weights
Booster
18
20
Decision Trees
Y
X>3
+1
5
-1
Y>5
-1
-1
-1
+1
3
X
21
X>3
sign
+1
+0.2
+0.1
-0.1
-0.1
-1
Y>5
-0.3
-0.2 +0.1
-1
-0.3
+0.2
X
22
Y<1
sign
0.0
+0.2
+1
X>3
+0.7
-1
-0.1
+0.1
-0.1
+0.1
-1
-0.3
Y>5
-0.3
+0.2
+0.7
+1
X
23
303 instances.
24
>0 : Healthy
<0 : Sick
25
Commercial Deployment.
26
27
28
AD-tree (Detail)
29
Precision/recall:
Accuracy
Quantifiable results
Score
Adaboosts resistance to
over fitting
Why statisticians find Adaboost
interesting.
31
Large margins
h x
F x
(x,y) y
y
marginFT
t1 t t
T
t1
marginFT (x,y) 0
fT (x) y
Thesis:
large margins => reliable predictions
Very similar to SVM.
33
Experimental Evidence
34
Theorem
Schapire, Freund, Bartlett & Lee / Annals of statistics 1998
35
Idea of Proof
36
Confidence rated
predictions
Agreement gives confidence
37
A motivating example
?
-
--
- -
+ Unsure
++
+
+
- +
+ ++ +
+
?+ +
+
++ ++ +
+ +
+ +
+
+
+ +
+
+
+
- + + + +
+ ++ +
++
+
+
?
Unsure
+
+
-
38
The algorithm
Freund, Mansour, Schapire 2001
Parameters
0, 0
w(h) e
Hypothesis weight:
Empirical
Log Ratio
Prediction rule:
w(h)
h:h ( x)1
1
l (x)
ln
w(h)
h:h ( x)1
1
if
p, x -1,+1 if
if
1
lx
lx
lx
39
Suggested tuning
Suppose H is a finite set.
0 14
ln 8 H m1 2
Yields:
2 ln8 H
m
8m1 2
ln m
2h O 1/2
m
ln
1/
2) for m = ln 1 ln H
ln 1 ln H
40
Confidence-rated
Rule
Candidate
Rules
RaterCombiner
41
Face Detection
Viola & Jones 1999
Paul Viola and Mike Jones developed a face detector that can work
in real time (15 frames per second).
Quic kTim e and a YUV420 c odec decompressor are needed to s ee this pic ture.
42
All
boxes
Feature 1
Might be a face
Feature 2
Definitely
not a face
43
44
45
Co-training
Blum and Mitchell 98
Partially trained
B/W based
Classifier
Confident
Predictions
Hwy
Images
Partially trained
Diff based
Classifier
Confident
Predictions
46
Co-Training Results
Levin, Freund, Viola 2002
Before co-training
After co-training
47
Selective sampling
Unlabeled data
Partially trained
classifier
Sample of unconfident
examples
Labeled
examples
Online learning
Adapting to changes
49
Online learning
So far, the only statistical assumption was
that data is generated IID.
Can we get rid of that assumption?
Yes, if we consider prediction as a repeating game
An expert is an algorithm that maps the
past (x1,y1 ),(x2 ,y2 ), ,(xt1,yt1 ),x
to ta prediction
zt
50
,T
1
t
2
t
z ,z , ,z
yt
N
t
L Lzti ,yt
i
T
t1
L ,y
L
A
T
t1
of events
Goal: for any sequence
51
Binary classification
N experts
one expert is known to be perfect
Algorithm: predict like the majority of
experts that have made no mistake so far.
Bound:
LA log2 N
52
Lossless compression
X - arbitrary input space.
Y - {0,1}
Z - [0,1]
Log Loss: LZ,Y y log 2
1
1
(1 y)log 2
z
1 z
54
Bayesian averaging
Folk theorem in Information Theory
N
w z
i
t
i
t
i1
N
i
w
t
L
; w 2
i
t1
i
t
i1
i1
i1
55
X - arbitrary space
Y - a loss for each of N actions
y 0,1
p 0,1 , p 1 1
Loss: L(p,y) p y Ep y
N
56
Learning in games
Freund and Schapire 94
L min L 2T ln N ln N
A
T
i
T
57
Multi-arm bandits
Auer, Cesa-Bianchi, Freund, Schapire 95
yt
We describe
an algorithm that guarantees:
NT
A
i
LT
min LT O NT ln
i
With probability
1
58
Detector can be
Feedbackadaptive!!
Download
Images
OL
Adaptive
real-time
face detector
Face
Detections
60
Summary
By Combining predictors we can:
Improve accuracy.
Estimate prediction confidence.
Adapt on-line.