You are on page 1of 327

www.GetPedia.

com

* The Ebook starts from the next page : Enjoy !


6.891 Machine Learning and Neural
Networks

/HFWXUH
,QWURGXFWLRQDQG([DPSOHV

6.891 Machine Learning

News
● )LUVWSUREOHPVHWLVDYDLODEOH VKRUW
ª 'XH6HSW
ª $OOSVHWVDUHGXHRQ7KXUVGD\
ª 1RUPDOO\\RXZLOOKDYHWZRZHHNV
● 5HDGLQJ'+6&K IRU)ULGD\

6.891 Machine Learning

1
Review & Overview
● $GPLQLVWUDWLYHLQIRUPDWLRQ
● &RXUVH*RDOV
● 'HILQH/HDUQLQJ,QGXFWLRQ5HJUHVVLRQ
&ODVVLILFDWLRQ
● *LYHH[DPSOHVRIOHDUQLQJDSSOLFDWLRQV
● %D\HV5XOHDQGFODVVLILFDWLRQ
● 5HJUHVVLRQDQG2YHUILWWLQJ
● 2FNKDP·V5D]RU&XUVHRI'LPHQVLRQDOLW\
ª %ULHI0HQWLRQRI3UREDELOLW\

6.891 Machine Learning

Course Information
● KWWSZZZDLPLWHGXFRXUVHV
● /HFWXUHU3DXO9LROD
ª 3URILQWKH$,/DE1([
ª YLROD#DLPLWHGX
ª 5HVHDUFK/HDUQLQJDQG&RPSXWHU9LVLRQ
² KWWSZZZDLPLWHGXSURMHFWVOY
● 7$.LQK7LHX
ª 3K'VWXGHQWLQWKH$,/DE1([
ª WLHX#DLPLWHGX
ª 5HVHDUFK,PDJH'DWDEDVH5HWULHYDO9LVLRQ/HDUQLQJ
² KWWSZZZDLPLWHGXSHRSOHWLHX

6.891 Machine Learning

2
Grading Experiment!!!
● 3UREOHPVHWVZLOOEHVHOIJUDGHG PRVWO\
● <RXZLOOKDQGLQWKHSVHWRQ7KXUVGD\.LQKZLOOUHFRUGLWV
SUHVHQFH RUDEVHQFH DQGJODQFHWRVHHLI\RXDWWHPSWHG
HDFKSUREOHP
● :HZLOOGLVWULEXWHWKHSVHWV RQ)ULGD\ DWUDQGRPWRWKH
FODVV<RXZLOOHDFKJUDGHRQHSVHWZLWKKHOSIURPDVROXWLRQ
NH\<RXKDYHGD\V
● .LQKZLOOOHDGDKRXUSVHWUHYLHZVHVVLRQWRJRRYHUFRUUHFW
VROXWLRQV3UREDEO\0RQGD\DIWHUQRRQ
● <RXZLOOKDQGEDFNWKHJUDGHGSVHWVRQ:HGQHVGD\
● .LQKZLOOWKHQJUDGHTXHVWLRQ XVXDOO\WKHWRXJKHVW 
● 7KHJUDGHGSVHWVZLOOEHUHWXUQHGWR\RXRQ)ULGD\GD\V
DIWHU\RXWXUQHGWKHPLQ

6.891 Machine Learning

Course Goals
● ,QWURGXFH0RWLYDWHDQG6WXG\FRQFHSWVIURPPDFKLQH
OHDUQLQJ)RFXVERWKRQIXQGDPHQWDOVDQGDSSOLFDWLRQV
ª 6HFRQG7LPH:DWFK2XW
● )XQGDPHQWDOV
ª )ROORZWH[W'XGD+DUW 6WRUN IURPWKH:HESDJH
ª 3OXVVRPHVXSSOHPHQWDOKDQGRXWV
● $SSOLFDWLRQV
ª 5HDGSDSHUVIURPOLWHUDWXUH
● 5HLQIRUFH
ª 6L[36(7VZLOOUHTXLUHERWKWKLQNLQJDQGKDFNLQJ
ª 2QHILQDOSURMHFW
ª 0LGWHUP
ª )LQDOH[DP ""

6.891 Machine Learning

3
Course Goals: 127

NIPS 1989

6.891 Machine Learning

Course Goals: %87

Pitts and McCulloch, 1947


6.891 Machine Learning

4
Goals: Analysis and Computation

6.891 Machine Learning

What is Machine Learning?


● ,QGXFWLRQRISDWWHUQVUHJXODULWLHVDQGUXOHVIURP
GDWD
ª /HDSWRFRQFOXVLRQV
● 1RW'HGXFWLRQ
ª ^D[LRPVDVVXPSWLRQVUXOHV`!WKHRUHPV
● ,QGXFWLRQ
ª ^WRQVRIGDWD`!UXOHVD[LRPVODZV
● &ODVVLF([DPSOHV
ª 1HZWRQ·V/DZV.HSOHU·V/DZV
ª 3HULRGLF7DEOH
ª 0HQGHO·VODZVRILQKHULWDQFH

6.891 Machine Learning

5
Physical Laws

Newton’s Measurements
● 2EVHUYHPDQ\ 2

H[SHULPHQWV
1.8

1.6

Acceleration
1.4

1.2

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Force

● &RQMHFWXUHVLPSOH5XOH

6.891 Machine Learning

Physical Laws: Theorize

Newton’s Measurements
● 2EVHUYHPDQ\ 2

H[SHULPHQWV
1.8

1.6
Acceleration

1.4

1.2

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Force

● &RQMHFWXUHVLPSOH5XOH
Ignore Errors &
» F=ma Inconsistencies

6.891 Machine Learning

6
Different Types of Learned Relations
● 5HJUHVVLRQ
ª &RQWLQXRXVLQSXW&RQWLQXRXVRXWSXW
– F = ma, pv = nrt
² ,QWHUHVW5DWHV!6WRFN3ULFHV
² ,QFKHV5DLQ!&RUQ3URGXFWLRQ
● &ODVVLILFDWLRQ
ª 'LVFUHWHLQSXW'LVFUHWHRXWSXW
² ^5HG" 5RXQG" 6PDOO6HHG"`!$SSOH
² $ODUP"!%UHDN,Q$ODUP" (DUWKTXDNH"!1R%UHDN,Q
ª &RQWLQXRXVLQSXW'LVFUHWHRXWSXW
² 0LGWHUP!)LQDO*UDGH
² ^)HYHU%ORRG3UHVVXUH`!6LFN"
² ^,QFRPH&XUUHQW'HEW`!,VVXH/RDQ"
² 6RXQG!:RUGV,PDJHV!3HRSOH
6.891 Machine Learning

Some Notation
● ,Q*HQHUDODOHDUQLQJSUREOHPZLOOKDYH

x = (x1 , x2 , K , xd )
T
ª ,QSXWV x j or x j
j-th
j-thexample
example
ª 2XWSXWV C = {C1 , C2 , K }
Cj & yj
y = (y1 , y2 , K )
T

ª 7DUJHW &RUUHFWODEHORUYDOXH tj

6.891 Machine Learning

7
Additional Notation (Abusive!)
● 3UHGLFWLRQ)XQFWLRQ

y j = y (x j ) C j = C (x j )

● (UURU)XQFWLRQ
0 if z = 0
E = ∑ l (t j − y ( x j ) l( z) = 
j 1 otherwise

l( z) = z2 Loss
Loss

6.891 Machine Learning

Example: Digit Recognition


● 863RVWDO6HUYLFH0LOOLRQ/HWWHUVDGD\

● 5HDGLQJ=LS&RGHV
ª )LUVWILQGWKH$GGUHVV%ORFN
ª 7KHQWKH]LSFRGH
ª 1RUPDOL]HVL]HDQGURWDWLRQ
ª 6HSDUDWHGLJLWV
ª 'LJLWL]H[!SRVVLEOHLPDJHV

6.891 Machine Learning

8
Character Recognition

6.891 Machine Learning

Zip Code Recognition

6.891 Machine Learning

9
Tremendous Variety

6.891 Machine Learning

Hand Labeled Data

6.891 Machine Learning

10
Final Performance
● 863RVWDO6HUYLFH0LOOLRQ/HWWHUVDGD\

● 5HDGLQJ=LS&RGHV
ª )LUVWILQGWKH$GGUHVV%ORFN
ª 7KHQWKH]LSFRGH
ª 1RUPDOL]HVL]HDQGURWDWLRQ
ª 6HSDUDWHGLJLWV
ª 'LJLWL]H[!SRVVLEOHLPDJHV

● 7UDLQLQJH[DPSOHLPDJHV
● )LQDO3HUIRUPDQFH!

6.891 Machine Learning

Differentiating Speech & Music

The Key Issue:


Features!

6.891 Machine Learning

11
Speech Recognition
● 6SHHFKUHFRJQLWLRQ
ª 6RXQGVLJQDOV!FHSVWUDOFRHIILFLHQWV!:RUG6HTXHQFH
ª VHF!IUDPHVVHF!ZRUGVVHF

Sound Now is the time...

● .H\GLIILFXOWLHV
ª 9DULDWLRQVLQSLWFKSURQXQFLDWLRQVSHHG

6.891 Machine Learning

Evaluation of Credit Risk


● )HDWXUHYHFWRU
ª ,QFRPHOHYHOWLPHDWFXUUHQWMREWLPHDWSUHYLRXVMRE
PDULWDOVWDWXVFKLOGUHQ"SDLGSUHYLRXVELOOV"RZQKRPH"
ORFDWLRQRIKRPHHWF
● 0DQ\WKRXVDQGVRUPLOOLRQVRIH[DPSOHV
ª )HDWXUHYHFWRURXWFRPHRIORDQ
● 'LIILFXOWLHV
ª 1RLVH
ª ,QVXIILFLHQWGDWD
ª 0LVVLQJGDWD
ª *HQHUDOL]DWLRQ

6.891 Machine Learning

12
Digit Recognition in Detail

● &ODVVLI\LQJ1YV2
● 'HILQHDVHWRIIHDWXUHV

Num Black Pixels

Perim

Width Height

● /RRNIRUVHSDUDWLRQ
6.891 Machine Learning

Rules for Classification


● 0DQ\VFKHPHVIRUFODVVLI\LQJGDWD
One of the feature
graphs here
ª 3LFNDWKUHVKROG

1 if y ≥ 0
C ( x ) = θ ( ax + b) θ ( y)
0 otherwise
ª 'LYLGHLQWRUHJLRQV
² )^`
^!1!2 `

6.891 Machine Learning

13
Using Bayes’ Law

● (YDOXDWH3 I_1 DQG3 I_2


ª 2EVHUYLQJORWVRIGDWD

P ( A| B) P( B)
● 8VH%D\HV·/DZ P( B| A) =
P( A)

P ( F = f |"2" ) P ("2" )
P("2" | F = f ) =
P( F = f )
P ( F = f |"1") P("1")
P("1"| F = f ) =
P( F = f )

6.891 Machine Learning

Combining Features
● $GGIHDWXUHVWRVHSDUDWH

x2

x1

6.891 Machine Learning

14
6.891 Machine Learning and Neural
Networks

/HFWXUH
7KH3UREDELOLVWLF$SSURDFK

6.891 Machine Learning

1
News
● )RUWKRVHRI\RXWKDWPLVVHGWKHILUVWFODVV«
ª )LUVWSUREOHPVHWLVRQWKHZHE 'XH

● 7KHZHESDJHLVJHWWLQJXSGDWHGUHJXODUO\

● :HZLOOKDQGRXWDJUDGLQJJXLGHOLQHVZKHQ\RXDUH
JLYHQWKHILUVWSUREOHPVHWWRJUDGH
ª 7KHVHJUDGHVZLOOQRWDVVXPHSHUIHFWDFFXUDF\«

6.891 Machine Learning

2
Review & Overview
● /HFWXUH
ª 'HILQHG/HDUQLQJ,QGXFWLRQ5HJUHVVLRQ&ODVVLILFDWLRQ
ª 6KRZH[DPSOHDSSOLFDWLRQV'LJLWV6RXQGV6SHHFK
ª %ULHI0HQWLRQRI3UREDELOLW\
● )LQLVKWKHLQWURGXFWLRQWR/HDUQLQJ
ª )LWWLQJIXQFWLRQVWRGDWD«
ª 2YHUILWWLQJ
● 7KH3UREDELOLVWLF$SSURDFK
ª 5HYLHZVRPHVLPSOHSUREDELOLW\
ª $SSO\LWWRFODVVLILFDWLRQWDVNV

6.891 Machine Learning

3
Fitting a Curve to Data

?? ??

0.8

0.6

0.4

0.2

0
0 0.5 1

Data : {x , t }
j j y (x)
6.891 Machine Learning

You are given 10 example data points. These are samples of physical
relationship, perhaps including noise.

You challenge is to make prediction for this relationship


- Interpolation: between the example points
- Extrapolation: beyond the data.

In principle there are an infinite number of functions that could be associated


with this data… our challenge is to pick one.

In the final analysis we may want to hedge our bets and return a probability
distribution of functions.

4
Polynomial Fitting

Data : {x , t }
j j

( ) + K + w (x ) ( )l
M
y ( x ) = w0 + w1 x
j j 1
M
j M
= ∑ wl x j
l

y( x j ; w)

Weight vector : w = {w0 , w1 ,K}

6.891 Machine Learning

Imagine that we are constraining ourselves to class of polynomials.

Each M-th order polynomial is parameterized by M+1 parameters.

The learning process, becomes a process by which we select values of W_i,

The dependency of the y(.) function on W will can be highlighted by the


notation y(x; w).

5
Graphical Representation
● *UDSKUHSUHVHQWVIXQFWLRQ

y ( x j ) = ∑ wl (x )
M
j l ● ,QIRUPDWLRQ)ORZVIURP
ERWWRPWRWRS
l

● $UURZV/LQNV
y ª WUDQVPLWLQIR
ª PXOWLSOLFDWLYHZHLJKW
w0 w3 w9 ● 1RGHV
w1 w2
ª VXPLQFRPLQJLQIR
X0 X1 X2 X3 ... X9 ª SRVVLEOHQRQOLQHDU
WUDQVIRUP
High Dimensional
Non-linear Representation

X Scalar Representation

6.891 Machine Learning

- While the algebraic notation for y() is clear and specific, we will see that
sometimes is also useful to develop a graphical notation of both classifiers
and regression functions.
- This idea was originally popularized in the neural network literature,
wherein neural networks were almost always drawn out in their graphical
form.
- The graphical notation points out that an intermediate representation for X
is formed (M+1 exponentiations). The resulting problem is then one of
learning the linear relationship between this high dimensional space and t.

6
Choose the Best Polynomial
● :KLFKSRO\QRPLDOIXQFWLRQLVEHVW"
ª %HVWSUHGLFWLRQVRQWUDLQLQJGDWD«
ª %HVWSUHGLFWLRQVRQIXWXUHGDWD«LQWHUSRODWLRQH[WUDSRODWLRQ
² %HVWH[SHFWHGORVVRQIXWXUHGDWD
² :KHUHGRZHJHWWKLVGDWD"

E=
1

2 j
(
loss y ( x j ; w ) − t j ) Empirical
Loss

ˆ = min E
w Find “Optimal”
w weights

6.891 Machine Learning

- What defines the best polynomial function? Perhaps it is the one which is
most consistent with the training data?
- Actually we would rather return the function which makes the best
predictions on future data - unfortunately there may be no source for this
data.
- For the time being let’s assume that we want to find the function which best
agrees with training data… the function with the lowest loss.

7
Simple Loss Functions Simplify Learning

loss (δ ) = δ 2 E=
1

2 j
(
y( x j ; w) − t j )
2

∂E 1
(
= ∑ 2 y( x j ; w) − t j x j = 0
∂wi 2 j
i
)( )
(
= ∑ y( x j ; w) − t j x j = 0 )( ) i

6.891 Machine Learning

- Certain simple loss functions lead to learning algorithms which are easy to
derive and inexpensive to compute.
- For example, squared loss can be solved by differentiating and setting this
to zero.
- The result is a set of linear equations that can be solved by inverting a
matrix.

8
First order fit…

0.8

0.6

0.4
E=
1

2 j
(y( x j ; w) − t j )
2

0.2

0 The optimal
0 0.5 1
function minimizes
the residual error

6.891 Machine Learning

- There is a pleasant physical analogy for the squared loss. The functions
are connected by springs to the data. The system is then allowed to relax
until the forces are balanced. The minimum energy solution is the one that
is “closest” to the training data.

9
Fitting Different Polynomials
1 1 1
0.8 0.8 0.8

0.6 0.6 0.6


0.4 0.4 0.4

0.2 0.2 0.2


0 1 0 0
6
3
0 0.5 1 0 0.5 1 0 0.5 1

1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 9
2 4 0
0 0.5 1 0 0.5 1 0 0.5 1

6.891 Machine Learning

Each order of polynomial leads to a different fit.


Higher order polynomials come closer to the training data.
The 9th order polynomial can fit the 10 datapoints perfectly.

Which of these is the most likely to generalize

10
Target Function

h ( x ) = 0.5 + 0.4 sin( 2πx )

t j = h( x j )
0 .9

0 .8

0 .7

0 .6

0 .5

0 .4

0 .3

0 .2

0 .1

0
0 0 .2 0 .4 0.6 0 .8 1

6.891 Machine Learning

The function that generated the data was not a polynomial at all.

11
Fitting Different Polynomials
1 1 1

0 .8 0 .8 0 .8

0 .6 0 .6 0 .6

0 .4 0 .4 0 .4

0 .2 0 .2 0 .2

0 1 0 0 6
3
0 0 .5 1 0 0.5 1 0 0.5 1

1 1 1

0 .8 0 .8 0 .8

0 .6 0 .6 0 .6

0 .4 0 .4 0 .4

0 .2 0 .2 0 .2

0 0 0 9
2 4
0 0 .5 1 0 0.5 1 0 0.5 1

6.891 Machine Learning

Probably the best approximation was 6th order (though 3rd is very good as
well).
Ninth provides a terrible fit to the function, though it fits the training data
perfectly. This is what is called overfitting…

12
Matlab Code
% Construct training data
train_in = [1:10]/10;
train_out = 0.5 + 0.4 * sin(2 * pi * train_in) + 0.1 * randn(size(train_in));

% fit a polynomial
order = 3
p = polyfit(train_in, train_out, order)

% construct a test set


test_in = [1:300]/300;
true_out = 0.5 + 0.4 * sin(2 * pi * test_in);

% compute the polynomial prediction


fit_out = polyval(p, test_in);

% plot the results


% first: training data
% second: test data
% third: predictions
plot(train_in, train_out, ’o’, test_in, true_out, test_in, fit_out)
axis([0 1 -0.1 1.1])

6.891 Machine Learning

Above is all the code used to generate the previous graphs.

As we can see Matlab allows us to explore issues in machine learning


without much hacking.

13
First General Problem in Learning
● &RQWURORIFRPSOH[LW\
ª ´(QWLWLHVVKRXOGQRWEHPXOWLSOLHGZLWKRXWQHFHVVLW\µ
² :2FFDPWK&HQWXU\
² 2FFDP·VUD]RU
ª ´$SK\VLFDOWKHRU\VKRXOGEHDVVLPSOHDVSRVVLEOHEXWQR
VLPSOHUµ$(LQVWHLQ
ª ´*RRGWKHRULHVDUHIDOVLILDEOHµ99DSQLN

ª &RPSOH[WKHRULHVDUHOLNHO\WREHZURQJ39LROD

6.891 Machine Learning

There are a few general (grand challenge) problems in machine learning.


Perhaps the most important is the problem of controlling complexity. We
have seen that a simpler approximator can fit an unknown function better
than a more complex approximator. Building a theory for this is one grand
challenge of learning.

Clearly this problem has been appreciated for a long time.

Vapnik’s statement is perhaps the most confusing. When he says theory


what he means is something like a learning algorithm. The learning
algorithm which fits 9th order polynomials to 10 datapoints is not falsifiable.
- No set of datapoints would fail to be fit perfectly by this data.

14
Overfitting in Classification

6.891 Machine Learning

This is not to say that such problems are uniuqe to regression. Determining
decision boundaries for classification is very similar.

We need to balance the complexity of the boundary against the accuracy on


training data.

15
Probabilistic Notation

X is a Random Variable
P ( X = x ) where
P ( ) is a Probabil ity Distribution

Shorthand: P(X) the distributi on fuction


or PX (.)

P(x) = P ( X = x ) P(y) = P (Y = y )

PX (x) = P ( X = x ) PY (y) = P (Y = y )
6.891 Machine Learning

Introduction of probabilistic notation.

Note that there are several potentially confusing short hand notations.

16
Recall the probabilistic approach
● *LYHQDFODVVLILFDWLRQSUREOHP
ª 6SHHFK0XVLF%DVV6DOPRQ5RWWHQ5LSH
● &KRRVHDIHDWXUHRI\RXUH[DPSOHV
ª )LVKZLGWKKHLJKWFRORU
ª )UXLWFRORUZHLJKW
ª 6RXQGV6SHFWUXP9DULDQFH
● 5HFRUGWKHGLVWULEXWLRQRI)HDWXUHYV&ODVV
● *LYHQDQXQFODVVLILHGH[DPSOH
ª &RPSXWHWKHP(F|C1)DQGP(F|C2)
ª &ODVVLI\XVLQJ%D\HV5XOH

6.891 Machine Learning

17
Probabilistic Approach

P ( C = Ck , X = x )
P ( X = x | C = C k ) P (C = C k )
P ( C = Ck | X = x ) =
P( X = x )

P ( x | Ck ) P (C k )
P ( Ck | x ) =
P( x )

Thomas Bayes
P( X | C1) & P ( X | C 2) 1702-1761
6.891 Machine Learning

18
Probability Densities
b
d
P ( X ∈ [a, b]) = ∫ p( X = x )dx p ( X = x) = P ( X ∈ [a, b])
a db

p(x) = p X ( x ) = p ( X = x )

p ( X = x | C = C k ) P (C = C k )
P ( C = Ck | X = x ) =
p( X = x )

p ( x | Ck ) P ( C k )
P ( Ck | x ) =
p( x )
6.891 Machine Learning

Probability Densities are necessary because for a continuous random


variable the probability of every event is zero.
The density measures the slope of the cumulative distribution function.
Alternatively it is the probability per unit area (or length, or volume)
measured over an infinitesimal area.

Somewhat surprisingly the density used in the same way that the distribution
function is used. In other words the probability distribution of the class give
the feature value can be found using the densities of the features.

19
Bayes Law for Densities

ω P (ω1 | x ) > P (ω 2 | x )
C ( x) =  1
ω 2 otherwise

Duda & Hart, 1973

Class 1 Class 2

6.891 Machine Learning

Just as before we can graph the conditional probability of class given


feature. The functions are now continuous…

Given the Bayes classification rule, a set of decision regions are defined.

20
Decisions

6.891 Machine Learning

These analyses are easily generalized to:


- Multiple classes
- Multiple dimensions

21
Analysis of Decision Rule

P( error ) = P( x ∈ R2 , C1 ) + P( x ∈ R1 , C2 )
= P( x ∈ R2 | C1 ) P (C1 ) + P ( x ∈ R1 | C2 ) P(C2 )
= ∫ p( x | C ) P(C )dx + ∫ p( x | C ) P(C )dx
R2
1 1
R1
2 2

6.891 Machine Learning

22
Minimize Expected Loss or Risk

Lkl = {Loss if C(x j ) = Cl and t j = Ck }

Risk for
Riskk = ∑ Lkl ∫ p( x | Ck )dx
l
elements of Ck
Rl

Risk = ∑ Rk P (Ck ) Overall Risk


k

6.891 Machine Learning

23
Probabilistic Classification Review
● ,IZHDUHJLYHQP(F|C) & P(C) -> P(F,C)
ª +RZWKHIHDWXUHLVGLVWULEXWHGIRUHDFKFODVV
● :HFDQXVHWKLVLQIRUPDWLRQWRFODVVLI\QHZ
H[DPSOHXVLQJ%D\HV5XOH
ª 0LQLPL]HVWKHSUREDELOLW\RIHUURU«
ª :HPD\LQVWHDGZLVKWRPLQLPL]HULVN

● :KHUHLVWKHPDFKLQHOHDUQLQJ"

6.891 Machine Learning

24
Information Retrieval
● 7KH$OWDYLVWD3UREOHP
● GRFXPHQWVRQWKHZHE
ª 7DNHDORQJWLPHWREURZVH
● 6LPSOH.H\ZRUG6HDUFK
ª )LQGGRFXPHQWVZLWK´*HUPDQµDQG´FDUµ
ª 0LJKWPLVV´*HUPDQ\µDQG´FDUVµ
² 6WHPPLQJ
ª 0LVVHV´0HUFHGHVµDQG´DXWRPRELOHµ
● 0DFKLQH/HDUQLQJ"
ª *LYHQGRFXPHQWVRQ*HUPDQFDUVEXLOGD
FODVVLILHU

6.891 Machine Learning

25
Keyword Search Works Well

6.891 Machine Learning

26
Naïve Bayes Classifier
● $VVXPHHDFKZRUGLVDQLQGHSHQGHQWIHDWXUH
f i ( Doc j ) 1 if Doc j has word i.

Probability of word i appearing


P ( Fi | C j ) in a Doc from Class j

P ({ f i }| C j ) =∏ P( F = f |C )
i
i i j

P (C ) ∏ P ( F = f | C )
j i i j
P( C j |{ f }) = i

∏ P( F = f )
i
i i
i
6.891 Machine Learning

27
Estimating Probabilities
● 0D[LPXP/LNHOLKRRG

# {Docs containing word i}


P ( Fi ) =
# {Docs}

# {Training Docs with word i}


P ( Fi |C j ) =
# {Training Docs}

Potential Bug:
None of our Training Docs contain “Mercedes”

6.891 Machine Learning

28
6.891 Machine Learning

29
Curse of Dimensionality

● ,WLVQRWDOZD\VEHWWHUWRPHDVXUHPRUHIHDWXUHV
● 1HZUHVXOWVVHHPWRDGGUHVVWKLVSUREOHP
ª 6XSSRUW9HFWRUV%RRVWLQJHWF

6.891 Machine Learning

30
Density Estimation is Ambiguous

6.891 Machine Learning

31
Impacts Classification

6.891 Machine Learning

32
6.891 Machine Learning and Neural
Networks

/HFWXUH
'HQVLW\(VWLPDWLRQ

Machine Learning

News
● 6RUU\DERXWWKHUHFLWDWLRQPL[XS
ª :HZLOODQQRXQFHE\HPDLOVRRQ

● 3UREOHP6HWLVGXHWRPRUURZ
ª 6HHZHEIRUSROLF\«

● 3UREOHP6HWZLOOEHDYDLODEOHE\WRQLJKW

● .LQKDQG,ZLOOEHWDNLQJSKRWRV

Machine Learning

1
Review & Overview
● /HFWXUH
ª 2YHUILWWLQJ3RO\QRPLDOV
ª 5HYLHZHGWKH3UREDELOLVWLF$SSURDFK
ª ,QIRUPDWLRQ5HWULHYDO([DPSOH

● 'HQVLW\'LVWULEXWLRQ(VWLPDWLRQ
ª ,QIRUPDWLRQ5HWULHYDO
² HVWLPDWLQJELQDU\59·V
ª *DXVVLDQV
ª 0XOWLGLPHQVLRQDO*DXVVLDQV
ª 1RQSDUDPHWULF'HQVLWLHV

Machine Learning

Keyword Search Works Well

Machine Learning

2
Bayesian Text Classification
{d k } : A collection of documents

1 if d k contains word i
Wi(d k ) : 
0 otherwise

P( Fi = 1| C = c j ) = pij Probability of word i appearing


in a Doc from Class j

P ( F1 = f1 , F2 = f 2 , K| C = c j )
= P ({f1 − f N }| C = c j ) Assume
Assume

≡ ∏ P( Fi = f i | C = c j ) Independence
Independence
i Machine Learning

Bayes Nets Show Dependencies

P ({ f i }| C j ) = ∏ P( F = f |C )
i
i i j

C
P( F1 | C )

F1 F2 F3 F4 ... FN

Bayes Nets show the dependencies between RV’s

Machine Learning

3
Classification Using Bayes Law

P (c j ) ∏ P ( f i | c j )
P( c j |{ f i }) = i

∏ P( f )
i
i

c1 = German Cars
c0 = Other Documents

Machine Learning

Estimating Probability Distributions

{d k } : A collection of documents

1 if d k contains word i
Wi(d k ) = f ki : 
0 otherwise

P( Fi = 1| C = c j ) = pij

● +RZFDQZHOHDUQp_ij?
ª 0D[LPXP/LNHOLKRRG3ULQFLSOH
ª &KRRVHp_ijVRWKDWWKHWUDLQLQJGDWDLVPRVWSUREDEOH

Machine Learning

4
Maximum Likelihood

P({d k }| c0 ) = ∏ P(d k | c0 ) = ∏ P({f k1 − f kN }| c0 )


k k

= ∏∏ P( f ki | c0 )
k i

= ∏∏ (pij ) f ki (1 − pij )(1 − f ki )


k i

= (pij )ni (1 − pij )(N − ni )

Machine Learning

Log Likelihood

L = log  (p ij )n i (1 − p ij )(N − n i )


 
= n i log( p ij ) + ( N − n i ) log( 1 − p ij )

∂L ∂ log( pij ) ∂ log(1 − pij )


= ni + ( N − ni )
∂pij ∂pij ∂pij
1 −1
= ni + ( N − ni )
pij 1 − pij
=0
Machine Learning

5
Maximum Likelihood
1 −1
ni + ( N − ni ) =0
pij 1 − pij
ni ( N − ni )
=
pij 1 − pij
ni
ni pij pij =
= N
( N − ni ) 1 − pij
ni
N = pij
1 − ni  1 − pij
 N 
 Machine Learning

Estimating Probabilities
● 0D[LPXP/LNHOLKRRG

# {Docs containing word i}


P( Fi = 1) =
#{Docs}

# {Training Docs with word i}


P( Fi = 1| C j ) =
#{Training Docs}

Potential Bug:
None of our Training Docs contain “Mercedes”

Machine Learning

6
Prior Expectations
● *LYHQDVPDOODPRXQWDGDWDZHFDQ·WEHDEVROXWHO\
VXUHWKDW´0HUFHGHVµZLOOQHYHUDSSHDULQ
GRFXPHQWVIURPRXUFODVV«
ª :HPD\KDYHJRWWHQXQOXFN\
● 8VHSULRUH[SHFWDWLRQVWRLPSURYHRXUHVWLPDWHV

● 3UREOHP
ª 0HUFHGHVRFFXUVLQRXWRIWRWDOGRFXPHQWV
ª %XWQHYHULQWKH´*HUPDQFDUVµWUDLQLQJVHW
ª :KDWLVDJRRGHVWLPDWHIRUp(mercedes | GermanCars)?

Machine Learning

Bayesian Parameter Estimation


%D\HVWRWKHUHVFXHDJDLQ Maximum
MaximumLikelihood
max P ({d k }| c0 , pij )
● Likelihood

pij
Maximum
MaximumAAPosteori
P ({d k }| c0 , pij )p ( pij )
Posteori

P(pij | {d k }, c0 )=
P({d k }| c0 )

This
Thisturns
turnsout
outto
tobe
bemore
moreuseful
useful
for continuous parameters
for continuous parameters

Machine Learning

7
What is the right prior?
● 7KHPRVWDJQRVWLFSULRULVWKHXQLIRUPGHQVLW\

max P (pij | {d k }, c0 ) = max P ({d k }| c0 , pij )p ( pij ) Maximum

= max P ({d k }| c0 , pij )ε Likelihood

P ({d k }| c0 , pij ) = (pij )ni (1 − pij )(N − ni )

Machine Learning

Probability of the parameters

P ({d k }| c0 , pij ) = (pij )Ci (1 − pij )(N − Ci )

Machine Learning

8
Bayesian Estimation
P({d k }| c0 , pij )p ( pij )
P (pij | {d k }, c0 ) = P ( Fi = 1| C = c j , pij ) = pij
P({d k }| c0 )

P( Fi | c j ) = ∫ P( Fi | c j , pij ) p (pij | {d k }, c j )dpij


P ({d k }| c0 , pij )p ( pij )
= ∫ pij
P ({d k }| c0 )

=
∫ p P({d }| c , p )p( p
ij k 0 ij ij )
P({d k }| c0 )

Machine Learning

… Continued

∫ p P({d }| c , p )p ( p ) dp
ij k 0 ij ij ij
P( F | c ) =
∫ P({d }| c , p )p( p )dp
i j
k 0 ij ij ij

∫ p [(p ) (1 − p ) ]ε dp
ni N − ni
ij ij ij ij
=
∫ [(p ) (1 − p ) ]ε dp
ni N − ni
ij ij ij

Machine Learning

9
What if no Mercedes?

Machine Learning

10
6.891 Machine Learning and Neural
Networks

/HFWXUH
1HZ'HQVLW\(VWLPDWRUV

Machine Learning

News
● 3UREOHPVHWZLOOEHKDQGHGRXWWRGD\
● 3UREOHPVHWLVRQWKHZHE
ª ,WLVPXFKKDUGHUWKDQWKHILUVWSVHW

● 3UREOHP6HWV
ª 3OHDVHVKRZVRPHZRUN
ª 0DNHVXUHWRJHWWKHSVHWVWR.LQK
² (VSHFLDOO\LIWKH\DUHODVWPLQXWH

Machine Learning

1
Review & Overview
● /HFWXUH
ª 7DONHGDERXW,QIRUPDWLRQ5HWULHYDO
² 1HHGSULRUVRYHUSDUDPHWHUV
ª 'HULYHG0D[LPXP/LNHOLKRRGIRU%HUQRXOOL59·V
ª 'LVFXVVHGXVHRISULRUVRYHUSDUDPHWHUV

● 1HZ'HQVLW\(VWLPDWRUV &RQWLQXRXV
ª *DXVVLDQ
ª 1RQSDUDPHWULF
ª 0L[WXUHRI*DXVVLDQV
● 4XLFNWRXURI([SHFWDWLRQ0D[LPL]DWLRQ

Machine Learning

Why Gaussians ?

● $QDO\WLFDOO\7UDFWDEOH
● &HQWUDO/LPLW7KHRUHP
ª 6XPRIPDQ\YDULDEOHVLV*DXVVLDQ
● /LQHDU7UDQVIRUPVRI*DXVVLDQDUH*DXVVLDQ
● *DXVVLDQVKDYHWKHKLJKHVW(QWURS\
Machine Learning

2
Multi-Dimensional Gaussian

Machine Learning

Eigen Structure

Machine Learning

3
Recall: Bayes Decision Boundaries

Machine Learning

Descriminant Function

Machine Learning

4
Set Discriminants Equal

Machine Learning

Machine Learning

5
Bayesian Parameter Estimation
● :KDWLI\RXOLWWOHGDWD«
● 2ULI\RXKDYHVWURQJH[SHFWDWLRQV

Machine Learning

Convergence of Probability

Machine Learning

6
Reminder: Why we are here
-1.2660
-1.2660 0.1781
0.1781
-0.8724
-0.8724 0.2013
0.2013
-0.8081
-0.8081 4 0.8293
0.8293
-0.6223
-0.6223 0.8299
0.8299
3 .5
-0.1624
-0.1624 0.9217
0.9217
-0.1342
-0.1342 3 0.9434
0.9434
-0.1098
-0.1098 0.9851
0.9851
2 .5
-0.0882
-0.0882 1.0079
1.0079
0.1258
0.1258 2 1.0539
1.0539
0.1395
0.1395 1.5355
1.5355
1 .5
0.1914
0.1914 1.5621
1.5621
0.2873
0.2873 1 1.5875
1.5875
0.3409
0.3409 1.6015
1.6015
0 .5
0.3694
0.3694 2.1811
2.1811
0.6093
0.6093 0
2.7845
2.7845
0.6463
0.6463 -2 -1 0 1 2 3 4 5 2.7879
2.7879
1.1217
1.1217 3.0956
3.0956
1.1463
1.1463 3.8428
3.8428
1.3021
1.3021 3.9562
3.9562
1.3971
1.3971 Machine Learning 4.0800
4.0800

Max Likelihood Gaussian

Mean: 0.16 Mean: 2.2


StDev: 0.8 StDev: 1.0

Machine Learning

7
Different Samples, Different Decisions

Concept:
Concept:Variance
Variance
The
The Variationyou
Variation youobserve
observe
when training on different
when training on different
independent
independenttraining
trainingsets
sets

Machine Learning

Variance depends on data set size...

20 points 2000 points


Machine Learning

8
But when data gets more complex...

Machine Learning

… Gaussian don’t work well

Concept:
Concept:Training
TrainingError
Error
Error in your classifier
Error in your classifier
on
onthe
thetraining
trainingset
set

Machine Learning

9
Even if you had “infinite” data …

Related
RelatedConcept:
Concept:Bias
Bias
Error
Errorininyour
yourclassifier
classifier
in
inthe
thelimit
limitas
assize
sizeof
of
training
trainingdata
datagrows.
grows.

Machine Learning

-1.2660
Histogram
-1.2660
-0.8724
-0.8724
-0.8081
-0.8081 4
-0.6223
-0.6223
-0.1624
-0.1624
3 .5

-0.1342
-0.1342 3
-0.1098
-0.1098
-0.0882
-0.0882
2 .5

0.1258
0.1258 2
0.1395
0.1395
0.1914
0.1914 1 .5

0.2873
0.2873 1
0.3409
0.3409
0.3694
0.3694 0 .5

0.6093
0.6093 0
0.6463
0.6463 -1 .5 -1 -0.5 0 0 .5 1 1.5
1.1217
1.1217
1.1463
1.1463
1.3021
Divide by N to
1.3021
1.3971
1.3971
yield a probability
Machine Learning

10
6.891 Machine Learning and Neural
Networks

/HFWXUH
'HQVLW\(VWLPDWLRQDQG&ODVVLILFDWLRQ

© Paul Viola 1999 Machine Learning 1

News
● 1R/HFWXUHRQ:HGQHVGD\
ª %HVXUHWRJHW.LQK\RXUJUDGHGSVHWVE\:HGQHVGD\
² 5HFLWDWLRQ
² 'URSLWRII
● *XHVW/HFWXUHE\/HVOLH.DHOEOLQJRQ)ULGD\
ª 5HLQIRUFHPHQW/HDUQLQJ

© Paul Viola 1999 Machine Learning 2

1
Review & Overview
● /HFWXUH
ª *DXVVLDQ'HQVLW\(VWLPDWLRQ
ª &RYDULDQFH
ª /LQHDUDQG4XDGUDWLF'LVFULPLQDQWV

● 1HZ'HQVLW\(VWLPDWRUV
ª 1RQSDUDPHWULF
ª 0L[WXUHRI*DXVVLDQV
ª 4XLFNWRXURI([SHFWDWLRQ0D[LPL]DWLRQ
● $SSOLFDWLRQ)DFH'HWHFWLRQ
ª 0L[WXUHRI*DXVVLDQV

© Paul Viola 1999 Machine Learning 3

-1.2660
Histogram
-1.2660
-0.8724
-0.8724
-0.8081
-0.8081 4
-0.6223
-0.6223
-0.1624
-0.1624
3.5

-0.1342
-0.1342 3
-0.1098
-0.1098
-0.0882
-0.0882 2.5

0.1258
0.1258 2
0.1395
0.1395
0.1914
0.1914 1.5

0.2873
0.2873 1
0.3409
0.3409
0.3694
0.3694 0.5

0.6093
0.6093 0
0.6463
0.6463 -1.5 -1 -0.5 0 0.5 1 1.5
1.1217
1.1217
1.1463
1.1463
1.3021
Divide by N to
1.3021
1.3971
1.3971
yield a probability
© Paul Viola 1999 Machine Learning 4

2
Simple Algorithm

function counts = myhist(data, centers)

% Initialize counts
counts = zeros(size(centers));

numdata = size(data,1);

% For each datapoint compute distance to every center


for i = 1:numdata
diffs = data(i,1) - centers;
dists = diffs.^2;
[minval mindist] = min(dists);
counts(mindist) = counts(mindist) + 1;
end

© Paul Viola 1999 Machine Learning 5

Histogram

© Paul Viola 1999 Machine Learning 6

3
Max Likelihood Gaussian

Mean: 0.16 Mean: 2.2


StDev: 0.8 StDev: 1.0

© Paul Viola 1999 Machine Learning 7

Histogram Flexibility is Adjustable

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5 -2 -1 0 1 2
© Paul Viola 1999 Machine Learning 83 4 5

4
Histograms have lower bias …

© Paul Viola 1999 Machine Learning 9

… but higher variance.

© Paul Viola 1999 Machine Learning 10

5
Parzen: One Bump per Data Point

0.7

0.6

0.5

0.4

0.3

0.2

0.1

-1.5 -1 -0.5 0 0.5 1 1.5

© Paul Viola 1999 Machine Learning 11

Parzen Algorithm

function [func, range] = parzen(data, sigma)

range = splitrange(min(data), max(data), 500);

numdata = size(data, 1)

% For each point on range compute distance to every


% datapoint.
for i = 1:size(range, 2)
gaussvals = gauss(range(i) - data, 0, sigma);
func(i) = sum(gaussvals)/numdata;
end

plot(range, func)
© Paul Viola 1999 Machine Learning 12

6
Parzen and Histogram are Similar
● %RWKFDQPRGHODQ\W\SHRIGLVWULEXWLRQ
ª *LYHQSOHQW\RIGDWD
● %RWKDUHVLPSOH
● 3DU]HQLVGLIIHUHQWLDEOH+LVWRJUDPLVQRW
● 3DU]HQLVVPRRWK+LVWRJUDPLVQRW
● +LVWRJUDPGHQVLW\(YDOXDWLRQS [ LVFKHDS
● 3DU]HQGHQVLW\(YDOXDWLRQLVOLQHDULQGDWDVL]H

© Paul Viola 1999 Machine Learning 13

All Three at Once

0.7

0.6

0.5

0.4

0.3

0.2

0.1

-1.5 -1 -0.5 0 0.5 1 1.5

© Paul Viola 1999 Machine Learning 14

7
Properties of Non-parametric Techniques
● 'HQVLW\LVDQDO\WLFDOIXQFWLRQRIGDWD
● %LDVDQGYDULDQFHRIGHQVLW\HVWLPDWRUFDQEH
DGMXVWHGWRWKHSUREOHP
● 0DQ\PRUHSDUDPHWHUVPXVWEHHVWLPDWHG
ª +LVWRJUDP1'ELQV
● /RVHPDQ\RIWKHVLPSOHSURSHUWLHVRI*DXVVLDQV

© Paul Viola 1999 Machine Learning 15

Semi-Parametric Models
● +DYHPRUHIOH[LELOLW\WKDQSDUDPHWULFPRGHOV
ª OLNH*DXVVLDQV
● +DYHOHVVYDULDQFHWKDQQRQSDUDPHWULFPRGHOV
● (YDOXDWLRQRIS [ LVFKHDS
● 'HWHUPLQDWLRQRISDUDPHWHUVLVH[SHQVLYH

© Paul Viola 1999 Machine Learning 16

8
Gaussian )
p ( x | µ1 , σ 1 ) µ1 , σ 1 xj
Flip
( p (x ) ???
xj Coin
p ( x | µ2 , σ 2 ) Gaussian
µ2 , σ 2 P (k = 1) + P ( k = 2) = 1

© Paul Viola 1999 Machine Learning 17

Events are Disjoint -> They add


p ( X = x ) = p ( X = x, J = 1) + p ( X = x, J = 2)
= p ( X = x | J = 1) P ( J = 1) + p ( X = x | J = 2) P ( J = 2)
= p ( x | µ1 , σ 1 ) P( J = 1) + p( x | µ2 , σ 2 ) P( J = 2)

© Paul Viola 1999 Machine Learning 18

9
Face Detection

© Paul Viola 1999 Machine Learning 19

Generating Training Data

Sung &
Poggio
© Paul Viola 1999 Machine Learning 20

10
Results

But, it takes minutes per image…


© Paul Viola 1999 Machine Learning 21

Face Detection
● *UHDWDSSOLFDWLRQRISUREDELOLVWLFFODVVLILFDWLRQ
ª :RUNVYHU\ZHOO
ª 5HTXLUHVPDQ\WKRXVDQGVRISDUDPHWHUV
ª &RPSXWDWLRQWLPHLVYHU\ORQJ

● ,VWKHUHDQ$OWHUQDWLYH"!'LVFULPLQDQWV
ª $OVRZRUNVZHOO
ª 5HTXLUHIHZHUSDUDPHWHUV
ª &RPSXWDWLRQWLPHLVYHU\VKRUW

© Paul Viola 1999 Machine Learning 22

11
Events are Disjoint -> They add
p ( X = x ) = p ( X = x, J = 1) + p ( X = x, J = 2)
= p ( X = x | J = 1) P ( J = 1) + p ( X = x | J = 2) P ( J = 2)
= p ( x | µ1 , σ 1 ) P( J = 1) + p( x | µ2 , σ 2 ) P( J = 2)

© Paul Viola 1999 Machine Learning 23

Expectation Maximization

∑ P(k | x ) x ∑ P(k | x ) (x − µ )
j j j j 2

P(k ) = ∑ P(k | x j )
k
µk = j
σ k2 =
j

∑ P(k | x )
j
j
∑ P(k | x ) j j

E = − log l ({µ k , σ k , qk })

Bounded Below?

Decreases?

© Paul Viola 1999 Machine Learning 24

12
News
● 6RUU\DERXWPLVVLQJODVWZHHN«
ª 6FKHGXOLQJKLFFXSZKLFKSXVK3HUFHSWURQVRXWRI3VHW
● 3VHWZLOOEHRXWE\WRQLJKW
ª 3OHDVHJHWVWDUWHGHDUO\
● 3VHWLVGXHWRPRUURZ
● &URVVJUDGLQJZRUNHGRXWZHOO
ª %XWZHQRWLFHGWKDWDIHZSHRSOHZHUHQRWJUDGLQJ
FDUHIXOO\
ª ,ZRXOGOLNH\RXWRWDNHWKLVWDVNYHU\VHULRXVO\

© Paul Viola 1999 Machine Learning

Distribution of Grades: Pset 1

© Paul Viola 1999 Machine Learning

1
Review & Overview
● /HFWXUH
ª 1RQSDUDPHWULF'HQVLW\(VWLPDWLRQ
² +LVWRJUDPVDQG3DU]HQ'HQVLWLHV
ª 6HPLSDUDPHWULF0L[WXUHRI*XDVVLDQV
ª $SSOLFDWLRQ)DFH'HWHFWLRQ «YHU\FRPSOH[

● 3HUFHSWURQV
● 7UDLQLQJ3HUFHSWURQV
● *HQHUDOL]HG3HUFHSWURQV
● 0XOWL/D\HU3HUFHSWURQV

© Paul Viola 1999 Machine Learning

Where are we?


● ,QWURGXFHGGHQVLW\HVWLPDWLRQ
ª 'LVFUHWHGDWD
ª &RQWLQXRXVGDWD
ª 3DUDPHWULF1RQSDUDPHWULFDQG6HPLSDUDPHWULF
● 8VHG%D\HV·ODZWRFODVVLI\QHZH[DPSOHV
ª 0LQLPL]LQJHLWKHUHUURURU5LVN

● %XWWKLVLVQRWWKHRQO\ZD\«
● ,QIDFWWKLVDSSURDFKKDVFRPHXQGHUVXVWDLQHG
DWWDFNUHFHQWO\

© Paul Viola 1999 Machine Learning

2
Between density and classification

● 2IWHQWKHGHWDLOVRIWKHGHQVLW\GRQRWPDWWHU

© Paul Viola 1999 Machine Learning

Gaussians vs. Discriminants


● ([DPSOH7ZRFODVVHV*DXVVLDQFODVVHVHTXDO
FRYDULDQFH
ª 7KHGHQVLW\HVLPDWRURIN2SDUDPHWHUV
ª 7KHUHVXOWLQJOLQHDUGLVFULPLQDQWKDVN SDUDPHWHUV
ª :K\HVWLPDWHWKHH[WUDSDUDPHWHUV"""

Two
TwoClass
ClassGaussian
y ( x ) = w T x + wo Gaussian
same
sameCovariance
Covariance

● $OWHUQDWLYHO\\RXPD\QRWNQRZPXFKDERXWWKH
GHQVLW\RI\RXUFODVVHV
● &RQVWUXFWDIXQFWLRQWKDWFODVVLILHVGLUHFWO\«

© Paul Viola 1999 Machine Learning

3
Linear Discriminant

y ( x ) = w T x + wo

Bias
Warning! w0 w1 w2 wd

X0 X1 X2 … Xd
N
y ( x) = w T x = ∑ wi xi
i =0
© Paul Viola 1999 Machine Learning

Multiple Discriminants
y1 ( x) = w1 x + w1o y2 ( x) = w 2 x + w2 o
T T

y1 ( x ) = y2 ( x )

w1 x + w10 = w 2 x + w20
T T

( w1 − w 2 )T x + ( w10 − w20 ) = 0

ˆ T x + wˆ 0 = 0
w

© Paul Viola 1999 Machine Learning

4
… in a single network

yk ( x ) = ∑ wki xi + wko
i

wki weight matrix

C ( x ) = Ck if k = arg max yi ( x )
i

© Paul Viola 1999 Machine Learning

Multiple Discriminants

Intersection
of Half Planes

© Paul Viola 1999 Machine Learning

5
How do we learn linear discriminants?
● :KDWDUHWKHSULQFLSOHV"
ª ,QGHQVLW\HVWLPDWLRQZHPD[LPL]HOLNHOLKRRG
ª ,QFODVVLILFDWLRQZHPLQLPL]HHUURUV
● +RZGRVHDUFKIRUWKHEHVWFODVVLILHU"
● :LOOWKHVHDUFKKDYHORFDOPLQLPD"

© Paul Viola 1999 Machine Learning

Perhaps this is really Regression?

E ( w ) = ∑ (y ( x j ) − t j )
2

= ∑ (wT x j − t j )
2

j
1 x xx xx x x

Minimize the
squared error.
-1 o ooo o o

© Paul Viola 1999 Machine Learning

6
Quadratic cost is very simple…

E ( w ) = ∑ (y ( x j ) − t j )
2

E (W ) = (XW − T ) (XW − T )
T
j

= ∑ (w T x j − t j ) = W T X T XW − 2 X TW T T + T T T
2

dE (W )
= 2 X T XW − 2 X T T = 0
dW
X T XW = X T T
W = XTX ( )
−1
X TT

● 'LUHFWOLQHDUH[SUHVVLRQIRUWKHZHLJKWVJLYHQWKH
WUDLQLQJGDWD
© Paul Viola 1999 Machine Learning

Is this a model for the brain?

Pitts and McCulloch, 1947


© Paul Viola 1999 Machine Learning

7
What about Gradient Descent?

E ( w ) = ∑ (y ( x j ) − t j )
2

= ∑ (wT x j − t j )
2

∂E ( w )
= 2∑ (w T x j − t j )x j
∂w j

= 2∑ δ j x j
j

wt = wt −1 − η ∑ δ j x j
j

© Paul Viola 1999 Machine Learning

Batch vs. On-line

E ( w ) = ∑ E j = ∑ (δ )
j 2 Error has many
j j components

∂E j Pick an example
wt = wt −1 − η = wt −1 − ηδ j x j
∂w at Random

y ● 3LFN5DQGRP([DPSOH
● 2EVHUYH2XWSXW(UURU
w0 w1 w2 wd
● $GMXVW:HLJKWVWR
X0 X1 X2 … Xd 5HGXFH(UURU

© Paul Viola 1999 Machine Learning

8
Can’t Always Solve for the Weights…

y ( x ) = g ( wT x ) 1 if a ≥ 0
g (a) = 
0 otherwise

 1 if a ≥ 0
g (a) = 
− 1 otherwise

● 3HUFHSWURQV0F&XOORFKDQG3LWWV 
ª 2ULJLQDOO\DVDPRGHOIRUUHDOQHXURQV

© Paul Viola 1999 Machine Learning

Perceptron

© Paul Viola 1999 Machine Learning

9
Perceptron Cost Function
E ( w ) = ∑ (g ( wT x j ) − t j )
2

j
Simple Gradient
∂E ( w) ∂g ( wT x j )
= −2 ∑ (g ( wT x j ) − t j ) Descent does not work
∂w j ∂w

( )
E ( w ) = −∑ g ( wT x j ) − t j ( wT x) t j
2 Perceptron
Criterion
j

= − ∑ ( wT x) t j
errors

∂E ( w)
= −2 ∑ t j x j
∂w errors
wt = wt −1 − η t j x j
© Paul Viola 1999 Machine Learning

Different Error Measures

© Paul Viola 1999 Machine Learning

10
Perceptron Learning
y

w0 w1 w2 wd

X0 X1 X2 … Xd

© Paul Viola 1999 Machine Learning

Real Perceptrons

© Paul Viola 1999 Machine Learning

11
© Paul Viola 1999 Machine Learning

A classic problem...

x x oo
x
oo
o o
x x
o o o

oo
oo x x
o o
x
o o o x x

© Paul Viola 1999 Machine Learning

12
6.891 Machine Learning and Neural
Networks

Lecture 7:
Multi-Layer Perceptrons
Back Propagation

© Paul Viola 1999 Machine Learning

News
l Pset 3 is on the web
» Includes a classifier “shootout”
» The mystery dataset has 20 dimensions and two classes
» Winner gets $10 of Toscanini’s
l Pset 2 looks great …
» Many of you did a lot of work.

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 6:
» Linear Discriminants
» Perceptrons
» Training Perceptrons

l Generalized Perceptrons
l Multi-layer Perceptrons
» Multi-Layer Derivatives
» Back Propagation
l Examples:
» NET Talk

© Paul Viola 1999 Machine Learning

On-line learning of Perceptrons

1: Error Function E (w) = ∑ E j


(Criteria) j

2: Update Rule ∂E j
wt = wt−1 − η = wt−1 − ηδ j x j
∂w

y l Pick Random Example


l Observe Output/Error
Adjust Weights to
w0 w1 w2 wd
l
X0 X1 X2 … Xd Reduce Error

© Paul Viola 1999 Machine Learning

2
Different Criteria…

Linear Discriminant Perceptron

(
E ( w) = ∑ wT x j − t j )
2
( )
E ( w ) = − ∑ g ( wT x j ) − t j ( wT x) t j
2

j j

∂E ( w) = − ∑ ( wT x ) t j
= 2∑ (wT x j − t j )x j errors
∂w ∂E ( w)
= −2 ∑ t j x j
j

= 2∑ δ j x j ∂w errors
j

wt = wt−1 − η t j x j
wt = wt−1 − ηδ x j j

© Paul Viola 1999 Machine Learning

Normalizing examples…

For errors
wt = wt −1 − η x j
only!
© Paul Viola 1999 Machine Learning

3
The update rule in action...

wt = wt−1 − x j

© Paul Viola 1999 Machine Learning

Real Perceptrons

© Paul Viola 1999 Machine Learning

4
A classic problem...

x x oo
x oo
o o
x x
o o o

oo
oo x x
o o
x
o o o x x

© Paul Viola 1999 Machine Learning

Generalized Perceptron

XOR ( x ) = x1 + x2 − 2 x1x2 xi ∈{0,1}

y ( x ) = g ( wT x ) Can’t do that!

 x1   1 
xˆ =  x2  ŵ =  1  Works
    Great
 x1 x2   − 2.1

© Paul Viola 1999 Machine Learning

5
Another Generalized Perceptron

© Paul Viola 1999 Machine Learning

Adding a single feature can yield


complex classifications…

© Paul Viola 1999 Machine Learning

6
Two Dilemmas
l How does one find/define the correct set of
features?
l How many will you need?

l 1950’s answers:
» Don’t know… we’ll just think them up.
» Don’t know… we’ll just keep adding wires.

© Paul Viola 1999 Machine Learning

1968: The Death of


Neural Networks

© Paul Viola 1999 Machine Learning

7
Multiple Layers

y  0 0 0 0 
 0  
  0 
0 0
W =W =0 0 0 0 
w54

 − 1.5 
1 1 0 
u4
w52 w53   
 0 1 1 − 2.5
w41 w43
w42
How can we learn this??
1 X1 X2

u1 u2 u3

© Paul Viola 1999 Machine Learning

1986: The Rebirth of Neural Networks


l PDP Group had
Huge Impact

© Paul Viola 1999 Machine Learning

8
1980’s: Perhaps Gradient Descent?

y ( x ) = s ( wT x )

1
s (a ) =
1 + e−a

E ( w) = ∑ (s ( wT x j ) − t j )
2

∂s ( u ) ∂u
= s (u ) (1 − s( u ) )
j

∂E ( w) ∂s ( w x ) ∂w ∂w
= −2∑ (s ( wT x j ) − t j )
T j

∂w j ∂w

∂E ( w)
= −2∑ (s ( wT x j ) − t j ) s ( wT x j ) (1 − s ( wT x j ) ) x j
∂w j
© Paul Viola 1999 Machine Learning

Sigmoid Multi-Layer Network

u6 y y ( x) = s ( w54u 4 + w65u5 )
= s ( w54 s ( w41u1 + w42u2 + w43u3 )
w64 w65 + w65 s ( w51u1 + w52u 2 + w53u3 ))

u4 u5

E ( w) = ∑ (s ( wT x j ) − t j )
2

j
w41 w51 w43 w53
∂E ( w) ∂s ( wT x j )
= −2∑ (s ( wT x j ) − t j )
w42 w52
1 X1 X2
∂w j ∂w
u1 u2 u3
w10

© Paul Viola 1999 Machine Learning

9
Multi-Layer Conventions

 
u6 y uk = g  ∑ wkj u j 
 j 
w64 w65
a k = ∑ wkj u j
u4 u5 j

w41 w51 w53 l Networks must not have


w43
w42 w52 loops…
1 X1 X2 l Units are ordered:
u1 u2 u3 » i > k --> ui is not an input to uk
l Compute Units in order

© Paul Viola 1999 Machine Learning

More Conventions

l If Units are organized in Layers


» Layers can be computed in parallel.

         
   
u 2 = g   W21  * u 
  1  u 3 = g   W32  * u 
  2 
   
         

© Paul Viola 1999 Machine Learning

10
© Paul Viola 1999 Machine Learning

© Paul Viola 1999 Machine Learning

11
Solving XOR (big deal?)

© Paul Viola 1999 Machine Learning

Vision Applications (sort of)

T’s vs. C’s

© Paul Viola 1999 Machine Learning

12
Very Simple Solution

© Paul Viola 1999 Machine Learning

NETtalk (1986) First Real Application


l Task: Pronounce English text
» Text -> Phonemes
l Example: “This is it.” /s/ vs. /z/
l 29 possible characters
l 26 phonemes
l 7 Character Window
l Structure
» 203 Inputs
» 26 Output
» 80 Hidden

95% Accurate
© Paul Viola 1999 Machine Learning

13
6.891 Machine Learning and Neural
Networks

Lecture 8:
Back Prop and Beyond

© Paul Viola 1999 Machine Learning

News
l Mid-term will be on 10/20
» Here in this room.
» It should take about 1 hour… but we will give you 1.5
– Show up on time, please.
» Coverage: Psets 1, 2 and 3.
– Density estimation (Parametric, Semi and Non-parametric)
– Bayesian Classification
– Discriminants (Linear, Perceptron, Multi-layer)

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 7:
» Multi-Layer Derivatives
» Back Propagation
» Examples:
– NET Talk

l Why 6.891 is not Over!


» Bugs with Gradient Descent
» Local Min
» Bias and Variance
– How many units?
» Variants

© Paul Viola 1999 Machine Learning

Face Detection Network:


General Layout

Baluja, Rowley, and Kanade


© Paul Viola 1999 Machine Learning

2
Intensity Preprocessing

© Paul Viola 1999 Machine Learning

Training Data

Positives
Negatives

© Paul Viola 1999 Machine Learning

3
Performance

© Paul Viola 1999 Machine Learning

© Paul Viola 1999 Machine Learning

4
MLP: How Powerful?

© Paul Viola 1999 Machine Learning

Derivatives are Cheap


l Modeling Factories/Plants

Input Outputs
Control Plant
Products
Materials

MLP • Train with Back Prop


• Use Derivs to Modify
input to improve output
Derivatives

© Paul Viola 1999 Machine Learning

5
1990: The height of MLP’s and Back Prop
l Multi-layer perceptrons can solve any
approximation problem (in principle)
» Given 3 layers
» Given and infinite number of units and weights
l There is no direct technique for finding the
weights (unlike linear discriminants)
l Gradient descent (using Back Prop) comes to
dominate discussion in the Neural Net community
» Can you find a good set of weights quickly?
– How can you speed things up?
» Will you get stuck in local minima?
l A small group in the community also worries about
generalization.
© Paul Viola 1999 Machine Learning

How long ‘til we find the Min?


y
l Simplest Case
» 1 Weight, Quadratic Error Function w
x 1

E ( w) = ( wx − y )2
= ( w − 0) 2 ∂E
wt = wt−1 − η
∂wt−1
= w2

∂E 1
= 2w η=
∂wt−1 2

© Paul Viola 1999 Machine Learning

6
Scale the Input
y
l Simplest Case
» 1 Weight, Quadratic Error Function w
x 2

E ( w) = ( w2 − 0 ) 2 ∂E
= 4w 2 wt = wt−1 − η
∂wt−1

∂E
= 8w
∂wt −1

Hack 1: Start eta very small More Hacks Coming!


Increase if Error decreases
© Paul Viola 1999 Machine Learning

Multiple Weights

Hack 2: Momentum

0.020
0.047
0.049
0.050
© Paul Viola 1999 Machine Learning

7
Momentum

∂E
∆wt = −η + α∆wt −1
∂wt−1

© Paul Viola 1999 Machine Learning

Second Order Techniques


l Gradient descent assumes a locally linear cost
function:
∂E E ( w + ∆ ) = E ( w) + E′( w) ∆ + ε
∆wt = −η = −ηE′
∂wt −1 = E ( w) − (E ′( w) )
2

l Second order techniques assume locally quadratic:

E ( w) = aw2 + bw + c w1 = w0 + ∆w

∆wt = −η
E′ E ′ = 2aw + b
E ′′ = 2 a
(
= w0 − w0 + b 2a )
E′′
E′ b =−b
=w+ 2a
E ′′ 2a
© Paul Viola 1999 Machine Learning

8
More Principled Hacks...
l Second Order Techniques
» N weights --> N^2 Hessian entries
» Also Destabilizes learning
l Line Search
» Expensive but hard to beat

© Paul Viola 1999 Machine Learning

Local Minima
l Number of Papers
» 1000’s of local minima in simple problems (XOR)

l One More Trick


» Linear is good…
l Small Input Range
» Sigmoid is almost linear

l ** Start weights near zero...

© Paul Viola 1999 Machine Learning

9
Bias and Variance
l How many layers are right?
l How many units per layer?
l What about structural constraints?

l *** We don’t know the answers ***

© Paul Viola 1999 Machine Learning

ALVINN

Pomerleau
© Paul Viola 1999 Machine Learning

10
No Hands Across America

© Paul Viola 1999 Machine Learning

Zip Codes

Le Cun

© Paul Viola 1999 Machine Learning

11
6.891 Machine Learning and Neural
Networks

Lecture 9:
On to Support Vector Techniques

© Paul Viola 1999 Machine Learning

News
l Final will be 12/13 at 1:30PM
» If you have a conflicting final let us know.
l Remember that almost all the material appears in
the book…
» Right now we are jumping back and forth between
– Chapter 5
– Chapter 6

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 8:
» Multi-layer Perceptrons
» Back propagation
» Hacks (… many)

l Why did we discard Perceptrons?


l Kernel function network
l Define the Support Vector framework

© Paul Viola 1999 Machine Learning

History Lesson
l 1950’s Perceptrons are cool
» Very simple learning rule, can learn “complex” concepts
» Generalized perceptrons are better -- too many weights
l 1960’s Perceptron’s stink (M+P)
» Some simple concepts require exponential # of features
– Can’t possibly learn that, right?
l 1980’s MLP’s are cool (R+M / PDP)
» Sort of simple learning rule, can learn anything (?)
» Create just the features you need
l 1990 MLP’s stink
» Hard to train : Slow / Local Minima
l 1996 Perceptron’s are cool
© Paul Viola 1999 Machine Learning

2
Why did we need multi-layer
perceptrons?

l Problems like this seem to require very complex


non-linearities.
l Minsky and Papert showed that an exponential
number of features is necessary to solve generic
problems.

© Paul Viola 1999 Machine Learning

Why an exponential number of features?

x15 , x14 x2 , x13 x22 , x12 x23 , x1 x24 , x52 ...


 
Φ ( x ) =  x14 , x13 x2 , x12 x22 , x12 x22 , x1 x23 , x24 
 M 
 

n : variables
k : order poly
 n + k  (n + k )!
14th Order???   = (
∈ O min( n k , k n ) )
120 Features  k  k !n!

N=21,
N=21,k=5
k=5-->
-->65,000
65,000features
features
© Paul Viola 1999 Machine Learning

3
MLP’s vs. Perceptron
l MLP’s are incredibly hard to train…
» Takes a long time (unpredictably long)
» Can converge to poor minima
l MLP are hard to understand
» What are they really doing?

l Perceptrons are easy to train…


» Type of linear programming. Polynomial time.
» One minimum which is global.
l Generalized perceptrons are easier to understand.
» Polynomial functions.

© Paul Viola 1999 Machine Learning

Perceptron Training
is Linear Programming
• After Normalization
∑wx i
i
l
i > 0 ∀l • After adding bias
• Assumes no errors
Polynomial time in the number of variables
and in the number of constraints.

What about linearly inseparable?

∑w x i
l
i + sl > 0 ∀l min ∑ sl
i l

sl > 0 ∀l

© Paul Viola 1999 Machine Learning

4
Rebirth of Perceptrons
l How to train efficiently.
» Linear Programming (… later quadratic programming)
l How to get so many features inexpensively?!?
l How to generalize with so many features?
» Occam’s revenge.

Support Vector Machines

© Paul Viola 1999 Machine Learning

Lemma 1: Weight vectors are simple

w0 = 0 ∆wt = η xt

wt = ∑ ηx = ∑ b x
errors
t

l
l
l
wt = ∑ bl Φ (xl )
l

l The weight vector lives in a sub-space spanned by


the examples…
» Dimensionality is determined by the number of examples
not the complexity of the space.

© Paul Viola 1999 Machine Learning

5
Lemma 2: Only need to compare examples

© Paul Viola 1999 Machine Learning

Perceptron Rebirth: Generalization


l Too many features … Occam is unhappy
» Perhaps we should encourage smoothness?

∑ b K (x , x
j
l j
) + sl > 0 ∀l min ∑ sl
j l

sl > 0 ∀l
min ∑ b 2j
j

Smoother

But this is unstable!!


© Paul Viola 1999 Machine Learning

6
Linear Program is not unique

The linear could return any multiple of the correct


weight vector...

∑ wˆ x i
l
i > 0 ∀l ∑ (λwˆ )x
i
i
l
i > 0 ∀l
i

Slack variables & Weight prior


- Force the solution toward zero

∑wx i
l
i + s l > 0 ∀l min ∑ sl
i l

sl > 0 ∀l min ∑ wi2


i

© Paul Viola 1999 Machine Learning

Definition of the Margin

l Margin: Gap between negatives and positives


measured perpendicular to a hyperplane

© Paul Viola 1999 Machine Learning

7
Require non-zero margin

∑wx
i
i
l
i + s l > 0 ∀l Allows solutions
with zero margin

Enforces a non-zero
∑ wi x li + sl > 1 ∀l margin between examples
i and the decision boundary.

© Paul Viola 1999 Machine Learning

Constrained Optimization

∑ b K (x , x
j
l j
) + sl > 1 ∀l min ∑ sl
j l

sl > 0 ∀l
min ∑ b 2j
j

l Find the smoothest function that separates data


» Quadratic Programming (similar to Linear Programming)
– Single Minima
– Polynomial Time algorithm

© Paul Viola 1999 Machine Learning

8
Constrained Optimization 2

x 3 is inactive

© Paul Viola 1999 Machine Learning

Support Vectors

l Many of the B’s are zero -- inactive constraints


l Guaranteed to generalize well
» VC Dimension -- end of semester

© Paul Viola 1999 Machine Learning

9
SVM: examples

© Paul Viola 1999 Machine Learning

SVM: Key Ideas


l Augment inputs with a very large feature set Φ ( x)
» Polynomials, etc.
l Use Kernel Trick(TM) to do this efficiently
l Enforce/Encourage Smoothness with weight penalty

» Minimize ν wT w = ν ∑ wi2
i
l Introduce Margin so that: wi ≠ 0 ∀i
» Set of linear inequalities
l Find best solution using Quadratic Programming

© Paul Viola 1999 Machine Learning

10
SVM: Difficulties
l How do you pick the kernels?
» Intuition / Luck / Magic / …

l What if the data is not linearly separable?


» i.e. the constraints cannot be satisfied

 
∀ j ( 2t j − 1) ∑ wi K ( x j , ci )  ≥ 1 − ε j
 i 

(
min w T w + c ∑ j | ε j | ) Slack Variables

© Paul Viola 1999 Machine Learning

SVM: Simple Example

6 weights

l Data dimension: 2
l Feature Space: 2nd order polynomial
» 4 dimensional

© Paul Viola 1999 Machine Learning

11
SVM versus Perceptron
l Why not just use a perceptron?
» Use all training points as a centers

 
y( x ) = Θ  ∑ wi K ( x, c i ) 
T

 i 

» Update using perceptron rule:

wτi = wτi + η K ( x, c i )

l Perceptron is not necessarily


smooth...

© Paul Viola 1999 Machine Learning

Perceptrons are not smooth…

© Paul Viola 1999 Machine Learning

12
Zip Codes

Much Effort spent on


organizing the network

© Paul Viola 1999 Machine Learning

SVM: Zip Code recogntion

l Data dimension: 256


l Feature Space: 4 th order
» roughly 100,000,000 dims

© Paul Viola 1999 Machine Learning

13
SVM: Faces

© Paul Viola 1999 Machine Learning

Support Vectors

© Paul Viola 1999 Machine Learning

14
6.891 Machine Learning and Neural
Networks

Lecture 10:
Support Vector Machines
More Details and Derivations

© Paul Viola 1999 Machine Learning

News
l Quiz is 1 week from today.

l Problem set 4 will go out right after the quiz


» In one week… it’s a pain to do two things at once.

l Problems set are very good (once again).


» On an absolute scale many of you are getting A’s.

© Paul Viola 1999 Machine Learning

1
Pset 2

© Paul Viola 1999 Machine Learning

Review & Overview


l Lecture 9:
» Resurrecting Perceptrons
» Setting up Support Vector Machines

l SVM review
l Why is it called “Support Vectors”??
l Derivation of some simpler properties.

© Paul Viola 1999 Machine Learning

2
SVM: Key Ideas
l Augment inputs with a very large feature set Φ ( x)
» Polynomials, etc.
l Use Kernel Trick(TM) to do this efficiently
l Enforce/Encourage Smoothness with weight penalty

» Minimize b T b = ∑ bi2 bi ≠ 0 ∀i
i
Avoid!

l Find best solution using Quadratic Programming

© Paul Viola 1999 Machine Learning

Support Vectors
min( w T w ) subject to constraint
 
∀ j ∑ bi K ( x j , c i ) ≥ 1
 i 
 
y ( x ) = Θ  ∑ bi K ( x, ci ) 
 i 
l Many of the b’s are zero -- inactive constraints
» Only keep examples where bi ≠ 0
l Likely to generalize well
» VC Dimension -- later in the semester

© Paul Viola 1999 Machine Learning

3
An alternative motivation
l Like all good ideas, Support Vector Machines can
be motivated in several different ways.

© Paul Viola 1999 Machine Learning

The optimal dividing line…

© Paul Viola 1999 Machine Learning

4
The optimal dividing line…
l The optimal separator
maximizes the margin between
positive and negative examples

d − = max wT x i
negatives

d + = min wT x i
positives

d+ − d−
margin =
|w|

d + − d−
max (margin ) = max
© Paul Viola 1999 w
Machine Learning w | w|

Definition of the Margin

l Margin: Gap between negatives and positives


measured perpendicular to a hyperplane

© Paul Viola 1999 Machine Learning

5
Optimal dividing line=Support Vectors

d − = max wT x i ∀ w T x i ≤ −1
negatives negatives

d + = min wT x i
positives ∀ wT x i ≥ 1
positives

d+ − d−
max min wT w
w | w|

© Paul Viola 1999 Machine Learning

Lemma 1: Weight vectors are simple

w = ∑ bl x l w = ∑ bl Φ (x l )
l l

l The weight vector lives in a sub-space spanned by


the examples…

l Proved this to you by analyzing the perceptron


weight update rule…
» But we no longer use that rule!!!
» Instead we use Quadratic Programming

© Paul Viola 1999 Machine Learning

6
Lemma 1: Kuhn-Tucker Conditions

w T x1 ≥ 1
wT x 2 ≥ 1 (
min w T w )
w T x 3 ≤ −1

wT x1 = 1
w = b1x1 + b2x 2
w x =1
T 2

Some of the examples do not contribute…


© Paul Viola 1999 Machine Learning
the inactive constraints.

SVM versus Perceptron


l Why not just use a perceptron?
» Use all training points as a centers

 
y( x ) = Θ  ∑ wi K ( x, c i ) 
T

 i 
» Update using perceptron rule:

wτi = wiτ + η (tτ − y ( xτ )) K ( x , ci )


l Perceptrons do not maximize the margin
» The estimated function is not terribly smooth…
l Perceptrons do not rely on very few support vectors
» Yields a much more efficient classifier.

© Paul Viola 1999 Machine Learning

7
Perceptrons are not smooth…

© Paul Viola 1999 Machine Learning

SVM: Faces

© Paul Viola 1999 Machine Learning

8
Support Vectors

© Paul Viola 1999 Machine Learning

SVM: Difficulties
l How do you pick the kernels?
» Intuition / Luck / Magic / …

l What if the data is not linearly separable?


» i.e. the constraints cannot be satisfied

 
∀ j  ∑ bi K ( x j , c i )  + s j ≥ 1
 i 

(
min b T b + c ∑ j | s j | ) Slack Variables

© Paul Viola 1999 Machine Learning

9
SVM: Generalization??
l Is there a formal proof that SVM’s will work better
than Perceptrons or MLPs??
» Perhaps…
l There is a tenuous relationship between maximizing
the margin and reducing the complexity of the
classifier.
» The complexity of the classifier is reduced to the number
of support vectors.
» Hard problems require more support vectors.
l The VC-Dimension of a support vector machine is
controlled by maximizing the margin.

© Paul Viola 1999 Machine Learning

Margin is the Key Concept


l As the margin is increased, so too does
generalization.

l We will see other types of algorithms which will


attempt to maximize the margin between positive
and negative examples…

© Paul Viola 1999 Machine Learning

10
Can we regain the simplicity of Perceptrons

© Paul Viola 1999 Machine Learning

How are the Margins effected??

© Paul Viola 1999 Machine Learning

11
6.891 Machine Learning and Neural
Networks

Lecture 11:
More Kernel Networks

© Paul Viola 1999 Machine Learning

News
l Matlab was down at the AI lab for a few hours.
» I am not terribly sympathetic… since it was after the
official deadline for the pset.
» Just hand it in as soon as you can.

l Cross-grading for next week.


» Please have it done by Thursday (earlier is better).

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 10:
» The Support in Support Vectors
» The Margin is a key concept

l The SVM criteria (one last time… )


l Smooth Regression
» Another way of motivating Kernel networks

© Paul Viola 1999 Machine Learning

Optimal dividing line=Support Vectors

d − = max wT x i ∀ w T x i ≤ −1
negatives negatives

d + = min wT x i
positives ∀ wT x i ≥ 1
positives

d+ − d−
max min wT w
w | w|

© Paul Viola 1999 Machine Learning

2
Optimal dividing line=Support Vectors

d − = max wT x i −1 1
negatives
d− = d+ =
| w| | w|
d + = min wT x i
positives

d+ − d − 1 −1
max −
| w| d + − d− | w | | w | 1 1
= = = T
w
2
∀ w T x i ≤ −1 |w| | w| | w| w w
negatives

∀ wT x i ≥ 1 1
positives max min wT w
wT w
© Paul Viola 1999 Machine Learning

Kernel Networks are Good for Regression


 
y ( x ) = Θ  ∑ bi K ( x, ci )  y ( x) = ∑ bi K ( x, ci )
 i  i

l The form of the Kernel determines the form of


the final function
» Polynomial Kernels -> Polynomial Function
» Gaussian Kernels -> Sum of Gaussians
l The Common Error Criteria is squared error…
(
Error = ∑ y ( x j ) − t j )2

This
Thisendsendsup
upbeing
beingexactly
exactlylikelikepolynomial
polynomialfitting…
fitting…
except that there is one
except that there isMachine weight
one Learning per data point
weight per data point
© Paul Viola 1999

3
Radial Basis Function Networks

K ( x , c ) = K (| x − c |) y ( x ) = ∑ bi K (| x − c i |)
i

l When we restrict ourselves to Kernels which are


radially symmetric, the resulting network is called
a Radial Basis Function Network
» K only depends on the radial distance from some
datapoint c.
» Poggio & Girosi pioneered the use of these.

© Paul Viola 1999 Machine Learning

From Smoothness to Kernels:


Assumptions are Necessary

3 3

2.5
2.5
2

2 1.5

1
1.5
0.5
? ?
1 0
0 2 4 6 8 10 2 4 6 8 10

Intuition

© Paul Viola 1999 Machine Learning

4
Setting up the problem

Cost = (WY − T )
2

Y = (W TW ) −1 W TT Not Invertible!

T= W=
1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1
© Paul Viola 1999 Machine Learning

Conditioning the Problem

Cost = (WY − T ) + λY T Y
2
Y T Y = ∑ yi
2

i
−1
Y = (W W + λ I ) W T
T T
Small solution
vectors are best.
T= Y=
1 1
0 0 3

0 0 2.5
0 0
2
3 3
0 0 1.5

0 0 1
0 0
0.5
0 0
1 1 0
2 4 6 8 10

© Paul Viola 1999 Machine Learning

5
… and the winner is?
3
6
2.5
5

2
4

1.5
3

1 2

0.5 1

0 0
2 4 6 8 10 2 4 6 8 10

2.5

This is not always true… 2

remember to think like a 1.5

Bayesian 1

0.5

0
2 4 6 8 10
© Paul Viola 1999 Machine Learning

Smooth is Good: Regularization

l Alternative way to motivate Kernel Networks.

Cost ( y) = Error ( y ) + Smoothness ( y )


∂
2

= ∑ ( y( x ) − t j
)
j 2
+ ∫  y ( x$ ) dx$
j
 ∂x$ 

© Paul Viola 1999 Machine Learning

6
Derivative Measures Smoothness

Squared 1st Derivative


9 3
16
8
14 2.5
7
12
6 2
10
5
8 1.5
4
6
3 1
4
2
0.5
2
1
0 0
0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Sum = 49.7 Sum = 20 Sum = 1.8

© Paul Viola 1999 Machine Learning

Setting up the Problem


W=

Cost = (WY − T ) + λ (DY )


2 2 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
Y = (W TW + λ DT D) −1 W T T 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
T= Y=
1 1.0000 D=
0 1.5000 1 -1 0 0 0 0 0 0 0 0
0 2.0000 0 1 -1 0 0 0 0 0 0 0
0 2.5000 0 0 1 -1 0 0 0 0 0 0
3 3.0000 0 0 0 1 -1 0 0 0 0 0
0 2.6000 0 0 0 0 1 -1 0 0 0 0
0 2.2000 0 0 0 0 0 1 -1 0 0 0
0 1.8000 0 0 0 0 0 0 1 -1 0 0
0 1.4000 0 0 0 0 0 0 0 1 -1 0
1 1.0000 0 0 0 0 0 0 0 0 1 -1
© Paul Viola 1999 Machine Learning

7
Need to find lambda …
Y= 3

1.0000 2.5
1.5000
2.0000 2

λ = 0.001 2.5000
3.0000 1.5

2.6000 1
1.8
2.2000
1.8000 0.5

1.4000 0
1.0000 2 4 6 8 10

Y = 3

1.6000 2.5
1.6600
1.7200 2
1.7800
0.3
λ = 10 1.8400 1.5

1.7840 1
1.7280
1.6720 0.5

1.6160 0
© Paul Viola 1999 1.5600 Machine Learning 2 4 6 8 10

Derivative Order Controls Shape


3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0
10 20 30 40 50 10 20 30 40 50

© Paul Viola 1999 Machine Learning

8
A Closer Look
3
3
2.5
2.5
2
2
1.5

1 1.5

0.5 1

0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50

Linear + Kinks Cubics + Kinks

© Paul Viola 1999 Machine Learning

Fitting More Data

0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 20 30 40 50 10 20 30 40 50

1
1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
10 20 30 40 50 10 20 30 40 50
© Paul Viola 1999 Machine Learning

9
Still Piecewise Cubic

0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1

10 20 30 40 50

© Paul Viola 1999 Machine Learning

Smoothness is easily controlled

0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6

0.5 0.5 0.5


0.4 0.4 0.4

0.3 0.3 0.3


0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

0.9 0.9 0.9


0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

© Paul Viola 1999 Machine Learning

10
Regularization to RBF’s

l Alternative way to motivate RBF’s

E ( y ) = Error ( y ) + Smoothness ( y )

Every
Training Point

Gaussian

© Paul Viola 1999 Machine Learning

Problem: To many centers


(Old solutions… )
l One per data point can be way to many…
l Choose a random subset of the points
– Hope you don’t get unlucky
l Distribute them based on the density of points
– Perhaps EM clustering…

© Paul Viola 1999 Machine Learning

11
Too Many Centers 2
l Put them where you need them…
» To best approximate your function

l Compute the derivative of E(y) w.r.t. the centers


» This gets very hairy and does not work well
– Too many local minima -- no small weight trick

© Paul Viola 1999 Machine Learning

Too Many Centers 3


l Support Vector Regression… Next time.

© Paul Viola 1999 Machine Learning

12
6.891 Machine Learning and Neural
Networks

Lecture 12:
Smooth Functions and Kernel Networks

© Paul Viola 1999 Machine Learning

News
l Quiz was too hard…
» I am trying to come up with a creative grading scheme.
– Best 5 out of 6 problems???
– First let us do the grading.
l Problem set will be out by tonight.

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 11:
» Trying to find smooth functions.

l Smooth Regression
» Another way of motivating Kernel networks

© Paul Viola 1999 Machine Learning

… where we were last time.


3

Squared 1st 2.5

Derivative 1.5

1
0 2 4 6 8 10

9 3
16
8
14 2.5
7
12
6 2
10
5
8 1.5
4
6
3 1
4
2
0.5
2
1
0 0
0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

© Paul Viola 1999 Machine Learning


Sum = 49.7 Sum = 20 Sum = 1.8

2
Regression Review
l Up until now we have been mostly analyzing
classification:
» X, inputs. Y, classes. Find the best c(x) .

l Today: Regression.
» X, inputs. Y, outputs. Find the best f(x) .
» Predict the stock’s value next week.
» “Picture of Road” -> “Car steering wheel”
» etc.

(
min ∑ f ( x j , w) − y j
w
)2

© Paul Viola 1999 Machine Learning

Schemes for motivating regression…


l Prior assumptions
» Find the best polynomial which fits the data:

f ( x, w) = w0 + w1 x + w2 x + K 2
w
(
min ∑ f ( x j , w) − y j )
2

» Or find the best neural network, or ???

l Bayesian Approach
» Find the most likely function:

p({x j , y j } | f ) p( f )
max p ( f | {x , y }) = max
( )
j j
f f p {x j , y j }

© Paul Viola 1999 Machine Learning

3
Bayesian framework captures
many approaches
p ({ x j , y j } | f ) p ( f )
max p ( f | { x j , y j }) = max
f f (
p {x j , y j } )
[
max log p({x j , y j } | f ) + log p( f ) − log p { x j , y j }
f
( )]
log p({ x j , y j } f ) = log ∏ p ( x j , y j | f ) ε if f is a poly
j
p( f ) = 
0 otherwise
= ∑ log p( x j , y j | f )
j

= ∑ log G ( f ( x j ) − y j ) The polynomial that


j fits the data best is the
(
= −∑ c f ( x j ) − y j )2 most likely function
j
© Paul Viola 1999 Machine Learning

Bayesian framework captures


many approaches

log p({ x j , y j } f ) 2
 ∂f 
log p( f ) = − ∫  
(
= −∑ c f ( x j ) − y j )
2
 ∂x 
j

The function which both fits the data and has


a small derivative is the most likely

2
 ∂2 f 
Also popular… log p( f ) = − ∫  2 
 ∂x 

© Paul Viola 1999 Machine Learning

4
A closer look...

1 1 
 ∂f 
2

(
C( f ) = ∑ c f ( x ) − yj
)
j 2
+ λ∫   X =  5  Y = 3
 ∂x 
10 1
j

Data
l How do we minimize this function?
» The set of possible functions is infinite
» The space of functions is infinite dimensional

l Constrain f to be: polynomial, sum of exponentials, etc??


l What about unconstrained solutions...

© Paul Viola 1999 Machine Learning

A slightly simpler problem

 ∂f 
2

( )
C( f ) = ∑ c f ( x j ) − y j + λ ∫  
2 dC( f )
=0
j  ∂x  df

l But f is not a scalar!!


l In fact f is more like an infinite dimensional vector.
l If f were a finite vector:
∂C ( f ) Can not reduce C()
∀j =0
∂f j by adjusting any of the
parameters of f.

© Paul Viola 1999 Machine Learning

5
We could approximate f .

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

10
0

-0.2
0

-0.2 20
-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-1 -1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

100 1000
0 0

-0.2 -0.2

-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-1 -1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

© Paul Viola 1999 Machine Learning

Looking for smooth solutions...

3 3

2.5
2.5
2

2 1.5

1
1.5
0.5
? ?
1 0
0 2 4 6 8 10 2 4 6 8 10

Intuition

© Paul Viola 1999 Machine Learning

6
Setting up the problem

Cost = (WF − Y )
2

F = (W TW ) −1 W TY Not Invertible!

Y= W= F=
1 1 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 22
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
3 = 0 0 0 0 1 0 0 0 0 0 3
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 6
0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1 1

© Paul Viola 1999 Machine Learning

Conditioning the Problem

Cost = (WF − Y ) + λF T F
2
FT F = ∑ fi
2

i
−1
F = (W W + λ I ) W Y
T T
Small solution
vectors are best.
Y= F=
1 1
0 0 3

0 0 2.5
0 0
2
3 3
0 0 1.5

0 0 1
0 0
0.5
0 0
1 1 0
2 4 6 8 10

© Paul Viola 1999 Machine Learning

7
… and the winner is?
3
6
2.5
5

2
4

1.5
3

1 2

0.5 1

0 0
2 4 6 8 10 2 4 6 8 10

2.5

This is not always true… 2

remember to think like a 1.5

Bayesian 1

0.5

0
2 4 6 8 10
© Paul Viola 1999 Machine Learning

Smooth is Good: Regularization

l Alternative way to motivate Kernel Networks.

Cost ( f ) = Error ( f ) + Smoothness( f )


∂
2

(
= ∑ f (x ) − y j
)
j 2 
+ ∫  f ( xˆ )  dxˆ
j  ∂xˆ 

© Paul Viola 1999 Machine Learning

8
Setting up the Problem
W=

Cost = (WF − T ) + λ (DF )


2 2 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
F = (W T W + λ DT D) −1 W T Y 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
Y= F=
1 1.0000 D=
0 1.5000 1 -1 0 0 0 0 0 0 0 0
0 2.0000 0 1 -1 0 0 0 0 0 0 0
0 2.5000 0 0 1 -1 0 0 0 0 0 0
3 3.0000 0 0 0 1 -1 0 0 0 0 0
0 2.6000 0 0 0 0 1 -1 0 0 0 0
0 2.2000 0 0 0 0 0 1 -1 0 0 0
0 1.8000 0 0 0 0 0 0 1 -1 0 0
0 1.4000 0 0 0 0 0 0 0 1 -1 0
1 1.0000 0 0 0 0 0 0 0 0 1 -1
© Paul Viola 1999 Machine Learning

Need to find lambda …


F= 3

1.0000 2.5
1.5000
2.0000 2

λ = 0.001 2.5000
3.0000 1.5

2.6000 1
1.8
2.2000
1.8000 0.5

1.4000 0
1.0000 2 4 6 8 10

F = 3

1.6000 2.5
1.6600
1.7200 2
1.7800
0.3
λ = 10 1.8400 1.5

1.7840 1
1.7280
1.6720 0.5

1.6160 0
© Paul Viola 1999 1.5600 Machine Learning 2 4 6 8 10

9
Derivative Order Controls Shape
3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0
10 20 30 40 50 10 20 30 40 50

2
 ∂f 
2
 ∂2 f 
∫  ∂x  ∫  ∂x 2 

© Paul Viola 1999 Machine Learning

A Closer Look
3
3
2.5
2.5
2
2
1.5

1 1.5

0.5 1

0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50

Linear + Kinks Cubics + Kinks

© Paul Viola 1999 Machine Learning

10
Look at the regularizer...
D=
Cost = (WF − T ) + λ ( DF )
2 2 1 -1 0 0 0 0 0 0 0 0
0 1 -1 0 0 0 0 0 0 0
0 0 1 -1 0 0 0 0 0 0
0 0 0 1 -1 0 0 0 0 0
0 0 0 0 1 -1 0 0 0 0
F = (W T W + λ DT D) −1 W T Y 0
0
0 0 0 0 1 -1 0 0 0
0 0 0 0 0 1 -1 0 0
0 0 0 0 0 0 0 1 -1 0
0 0 0 0 0 0 0 0 1 -1
D’ * D =
1 -1 0 0 0 0 0 0 0 0
-1 2 -1 0 0 0 0 0 0 0
0 -1 2 -1 0 0 0 0 0 0
0 0 -1 2 -1 0 0 0 0 0
0 0 0 -1 2 -1 0 0 0 0
0 0 0 0 -1 2 -1 0 0 0
0 0 0 0 0 -1 2 -1 0 0
0 0 0 0 0 0 -1 2 -1 0
0 0 0 0 0 0 0 -1 2 -1
© Paul Viola 1999 0 0 0 Machine
0 0 Learning
0 0 0 -1 1

Second Deriv -> Fourth Deriv

D= D' * D =
1 -1 0 0 0 0 0 0 0 0 1 -2 1 0 0 0 0 0 0 0
-1 2 -1 0 0 0 0 0 0 0 -2 5 -4 1 0 0 0 0 0 0
0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0 0 0
0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0 0
0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0
0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0
0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0
0 0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1
0 0 0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 5 -2
0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 1 -2 1

© Paul Viola 1999 Machine Learning

11
What about continuous functions??

2
 ∂f  ∂C ( f ) Infinite number
C( f ) = λ ∫   ∀x =0
 ∂x  ∂ f ( x) of derivatives

∂C ( f ) C ( f + δ x ) − C ( f )
= = δC ( x )
∂ f ( x) |δx |

© Paul Viola 1999 Machine Learning

Fitting More Data

0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 20 30 40 50 10 20 30 40 50

1
1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
10 20 30 40 50 10 20 30 40 50
© Paul Viola 1999 Machine Learning

12
Still Piecewise Cubic

0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1

10 20 30 40 50

© Paul Viola 1999 Machine Learning

Smoothness is easily controlled

0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6

0.5 0.5 0.5


0.4 0.4 0.4

0.3 0.3 0.3


0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

0.9 0.9 0.9


0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

© Paul Viola 1999 Machine Learning

13
Regularization to RBF’s

l Alternative way to motivate RBF’s

E ( y ) = Error ( y ) + Smoothness ( y )

Every
Training Point

Gaussian

© Paul Viola 1999 Machine Learning

Problem: To many centers


(Old solutions… )
l One per data point can be way to many…
l Choose a random subset of the points
– Hope you don’t get unlucky
l Distribute them based on the density of points
– Perhaps EM clustering…

© Paul Viola 1999 Machine Learning

14
Too Many Centers 2
l Put them where you need them…
» To best approximate your function

l Compute the derivative of E(y) w.r.t. the centers


» This gets very hairy and does not work well
– Too many local minima -- no small weight trick

© Paul Viola 1999 Machine Learning

Too Many Centers 3


l Support Vector Regression…

Cost ( f ) = Error ( f ) + Smoothness( f )


∂
2

= ∑ f (x ) − y
j j
+ ∫ f ( xˆ )  dxˆ
j
ε
 ∂xˆ 

f ( x) = ∑ w j K ( x, x j ) Many j’s
j
are zero!!!

© Paul Viola 1999 Machine Learning

15
6.891 Machine Learning and Neural
Networks

Lecture 13:
Kernel Networks
… on to Unsupervised Learning

© Paul Viola 1999 Machine Learning

News
l Quizes are graded…
» Each problem has been graded.
» ** The overall score for the quiz is being determined.
– We ran out of time last night.
l Course grading: (approximate)
» Psets: 35%
» Quiz: 20%
» Final: 30%
» Project: 10%
» Participation: 5%

© Paul Viola 1999 Machine Learning

1
Pset 3

You are doing spectacularly well…

© Paul Viola 1999 Machine Learning

Exams

© Paul Viola 1999 Machine Learning

2
Grading alternatives…

© Paul Viola 1999 Machine Learning

Review & Overview


l Lecture 12:
» Trying to find smooth functions.
» Requiring smoothness simplifies functions:
– 1st deriv -> piecewise linear; 2nd deriv -> cubic

l Finish off Regression


l Begin unsupervised learning.
» PCA, etc…

© Paul Viola 1999 Machine Learning

3
Calculus of Variations
∂C ( f ) C ( f + δ x ) − C ( f )
δ C ( x) = =
 ∂f 
2
∂f ( x ) | δx |
C( f ) = ∫   f ( x ) = ax + b
 ∂x  = f ′′( x)
=0

 ∂f 
( )
2

C( f ) = ∑ f ( x ) − y j j 2
+ λ∫ 
j  ∂x 

( )
δC ( x) = λf ′′( x) + ∑ 2 f ( x j ) − y j δ ( x − x j ) = 0
j

f ′′( x) = −
1
( )
∑ 2 f (x j ) − y j δ (x − x j )
λ j

© Paul Viola 1999 Piecewise Linear


Machine Learning

A Closer Look
3
3
2.5
2.5
2
2
1.5

1 1.5

0.5 1

0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50

Linear + Kinks Cubics + Kinks

© Paul Viola 1999 Machine Learning

4
Still Piecewise Cubic

0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1

10 20 30 40 50

© Paul Viola 1999 Machine Learning

But where are the kernel functions??


l Recall that this was supposed to be another way to
motivate kernel functions!!!

 ∂f  ? f ( x) = ∑ b j K (x , x j )
( )
2

C( f ) = ∑ f ( x j ) − y j + λ ∫  
2

j  ∂x  j

f ′′( x) = ∑ a jδ ( x − x j ) ∂ 2 K ( x, x j )
f ′′( x) = ∑ b j
j
j ∂ x2

∂ 2 K ( x, x j )
=δ (x − x j) K ( x, x j ) = x − x j
∂x 2

© Paul Viola 1999 Machine Learning

5
Cubics are similar...

∂ 2 f 
2 ? f ( x) = ∑ b j K (x , x j )
(
C( f ) = ∑ f ( x j ) − y j )
2
+ λ ∫  2 
j  ∂x  j

f ′′′′( x) = ∑ a jδ ( x − x j ) ∂ 4 K ( x, x j )
f ′′′′( x) = ∑ b j
j
j ∂x 4

∂ 4 K ( x, x j )
=δ (x − x j) K ( x, x j ) = x − x j ( x − x j ) 2
∂x 4

© Paul Viola 1999 Machine Learning

Can also get gaussian kernels…

l … if you want them!!

E ( y ) = Error ( y ) + Smoothness ( y )

Every
Training Point

Gaussian

© Paul Viola 1999 Machine Learning

6
Smoothness is easily controlled

0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6

0.5 0.5 0.5


0.4 0.4 0.4

0.3 0.3 0.3


0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

0.9 0.9 0.9


0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

© Paul Viola 1999 Machine Learning

Problem: To many centers


(Old solutions… )
l One per data point can be way too many…
l Choose a random subset of the points
– Hope you don’t get unlucky
l Distribute them based on the density of points
– Perhaps EM clustering…

© Paul Viola 1999 Machine Learning

7
Too Many Centers 2
l Put them where you need them…
» To best approximate your function

l Compute the derivative of C(f) w.r.t. the centers


» This gets very hairy and does not work well
– Too many local minima -- no small weight trick

© Paul Viola 1999 Machine Learning

Support Vector Regression


l Support Vector Regression…

(w T
)
x j +b − y j ≤ ε
f ( x) = wT x + b min wT w
(
y j − wT x j + b ) ≤ε

© Paul Viola 1999 Machine Learning

8
SVM Regression

Cost( f ) = c ∑ wT x j + b − y j +wT w
ε
j

© Paul Viola 1999 Machine Learning

Works with smoothness as well...


l Support Vector Regression…

Cost ( f ) = Error ( f ) + Smoothness( f )


∂
2

= ∑ f (x ) − y
j j
+ ∫ f ( xˆ )  dxˆ
j
ε
 ∂xˆ 

f ( x) = ∑ w j K ( x, x j ) Many w_j’s
j
are zero!!!

© Paul Viola 1999 Machine Learning

9
New Topic: Unsupervised Learning
l What can you do to “understand” data when you
have no labels?
» Find unusual structure in the data.
» Find simplifications of the data.
l Find the clusters in the data:
» Fit a mixture of gaussians…
– Been there, done that.
l Reduce the dimensionality of the data:
» Find a linear projection from high to low dimensions
l These all amount to density estimation
l There are many other approaches
» Build a tree which captures the data, etc.

© Paul Viola 1999 Machine Learning

8 bits per pixel .36 bits per pixel

© Paul Viola 1999 Machine Learning

10
© Paul Viola 1999 Machine Learning

© Paul Viola 1999 Machine Learning

11
© Paul Viola 1999 Machine Learning

© Paul Viola 1999 Machine Learning

12
6.891 Machine Learning and Neural
Networks

Lecture 14:
… on to Unsupervised Learning

© Paul Viola 1999 Machine Learning

News
l I will try to give you a feeling for where we are
headed:
» Next 4 lectures
– Bayes Nets / Graphical Models / Boltzman Machines /
HMM’s
» After that a series of topics (… from papers).

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 13:
» The end of regression…

l Begin unsupervised learning.


» PCA, etc…

© Paul Viola 1999 Machine Learning

New Topic: Unsupervised Learning


l What can you do to “understand” data when you
have no labels?
» Find unusual structure in the data.
» Find simplifications of the data.
l Find the clusters in the data:
» Fit a mixture of gaussians…
– Been there, done that.
l Reduce the dimensionality of the data:
» Find a linear projection from high to low dimensions
l These all amount to density estimation
l There are many other approaches
» Build a tree which captures the data, etc.

© Paul Viola 1999 Machine Learning

2
Exploratory Data Analysis
l Machine learning is simply not that smart…
l It is still very important to look at the data.
l But when there are millions of examples and
thousands of dimensions you cannot look at the
data.

l It is very important to summarize the data…


» Summarize the statistics
» Reduce dimensionality

© Paul Viola 1999 Machine Learning

Example: Mixture of Gaussians


l Start out with a many training points (1000’s)
» Distributed in a “clumpy” fashion.
l Replace with a few summary clusters (10’s)
» One for each clump.
l Why?
» Better understand/summarize the data…
– Demographics Data: growing number of stay at home dad’s
l Salary < 1000; Children > 0; Family Salary > 20,000, etc.
– Astronomical Data: Unusual distribution of stars near galaxy
– Extract “symbols”
l one cluster per word (in speech), one cluster per letter (in writing)
» Speed learning…
– Learn with the clusters rather than all the data… approximate

© Paul Viola 1999 Machine Learning

3
Example of Clustering
l Can I get some examples of clustering???

l PDP??

l Andrew Moore

© Paul Viola 1999 Machine Learning

Speed Learning???
l Regression & Kernel Networks f ( x) = ∑ b j K (x , x j )
j

» Learning time is cubic in the number of examples.


– Need to solve the linear system to find weights.
» Performance time is linear in the number of examples
» 10,000 examples -> 100 clusters??

l Problem: these clusters are not optimal…


» Other locations may be much more effective for
approximating the target function.
» Clusters congregate near data… not at the margin!

© Paul Viola 1999 Machine Learning

4
Support Vector Machines
f ( x) = ∑ b j K (x , x j )
j

l Claims to choose the optimal set of examples


» 10,000 -> 100
» Performance is optimal

l But, training time is still cubic in the number of


examples.
l Though some new algorithms are beginning to work
more quickly.

© Paul Viola 1999 Machine Learning

Example: Dimensionality Reduction


l It is difficult to visualize 10, 20, 1000 dimensions
l Worse, there is reason to believe that it can be
very difficult to learn in high dimensions.
l Curse of dimensionality:
» As the dimensionality of the data grows, the average
distance between points also grows.
» As a result it is difficult to get an accurate local
estimate for what is going on.

© Paul Viola 1999 Machine Learning

5
Curse of Dimensionality of Nearest
Neighbor
l How far is it to your nearest neighbor??
» Easier Question: How far do you have to look before
expecting 1 neighbor.

» Assume points are distributed in a sphere of unit radius

Vol ( Sk (r )) = ck r k

» The point in the center of the sphere should have the


largest number of neighbors. How far do we have to go
to find 1 neighbor on average??

© Paul Viola 1999 Machine Learning

Reducing Dimensionality Linearly

y = Wx
l Where W has fewer rows than columns…

l What is the right W??


» One that preserves as much information as possible.
» Ah, but what is information??
 e1 
 
( (
min E x − W −∗ Wx ) ) 2
W =  e2 
W
M
© Paul Viola 1999 Machine Learning
 

6
First Eigenvector preserves more info...

© Paul Viola 1999 Machine Learning

256,000 4,000
Numbers Numbers

© Paul Viola 1999 Machine Learning

7
© Paul Viola 1999 Machine Learning

© Paul Viola 1999 Machine Learning

8
Dimensionality Reduction

© Paul Viola 1999 Machine Learning

Cottrell and Metcalfe

© Paul Viola 1999 Machine Learning

9
Information Theory for Signal
Separation

Sound
Sources
Microphones

l The Cocktail Party Problem


l Many Speakers -- the signals a hopelessly mixed

© Paul Viola 1999 Machine Learning

Unmixed

© Paul Viola 1999 Machine Learning

10
Let’s look at data

Figures from Christian

Unmixed Mixed

PCA
Unmixed

© Paul Viola 1999 Machine Learning

Mathematical Assumptions

 s1( t )   m1( t )   a11 a12 a13 


S =  s 2 ( t ) M =  m2 ( t ) A =  a 21 a 22 a 23 
     
 s3 ( t )  m3 ( t )  a31 a 32 a 33 

M = AS

l Assumptions:
» Sounds Travels instantaneously
» Sound Mixes Linearly
» Signals are independent

© Paul Viola 1999 Machine Learning

11
The Unmixing Problem

S$ = A−1 AS
l We would like to undo the mixing...

© Paul Viola 1999 Machine Learning

Reducing Dimensionality Non-Linearly

y = g (Wx )
l Where W has fewer rows than columns…

l What is the right W??


» One that preserves as much information as possible.
» Ah, but what is information??

 independent 
max MI (g (Wx ), x ) W =  
W
 components 
© Paul Viola 1999 Machine Learning

12
ICA
Unmixed

© Paul Viola 1999 Machine Learning

Learning Rule

∆W = (W − T + (1 − 2 y ) x T )W T W
= W + (1 − 2 y ) x T W TW
= W + (1 − 2 y )uT W

© Paul Viola 1999 Machine Learning

13
6.891 Machine Learning and Neural
Networks

Lecture 15:
Reasoning and Learning on Discrete Data
Bayes Nets

© Paul Viola 1999 Machine Learning

News
l Final Problem Set will be ready tomorrow
» Mostly Bayes Nets
l Please begin to think about your final project

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 14:
» Principal Components Analysis
– A low dimensional projection is can summarize data
» Independent Components Analysis
– An alternative to PCA which can pick out the independent
sources of data.

l Bayes Nets
» Meeting of the minds
– Artificial Intelligence and Machine Learning
» Represents symbolic knowledge and reasoning
» Principled mechanism for inference and learning
– Bayes Rule

© Paul Viola 1999 Machine Learning

Artificial Intelligence
l Build systems that reason about the world:
» Diagnosis
– “Why won’t my car start?”
» Goal directed behavior
– “How can I get from here to the White House?”
– Space Probe: “How do I change orbit, take photos of Mars,
and communicate with Earth in the next 5 minutes?”
– “How can I symbolically integrate this function?”
» Game Playing
– “How can I beast Kasparov?”
l Biases:
» Symbolic data and symbolic problems (not continuous)
» No representation of uncertainty or probability.

© Paul Viola 1999 Machine Learning

2
Techniques in Artificial Intelligence
l Write down a set of rules that govern the world
» If I get on a plane to Wash. DC then I will end up in DC.
» If I take a taxi to Logan then I will end up at Logan.

» If my car is out of fuel then it won’t start.


» If my starter motor is broken then it won’t start.

» The integral of sin(x) is - cos(x).


» The integral of 2x is 2x2
l Use these to either:
» Reason forward from the initial conditions
» Reason backwards from your goal (or problem).

© Paul Viola 1999 Machine Learning

Difficulty with Artificial Intelligence


l No explicit representation for uncertainty
» Some unknown or unmodeled aspect of the world may
interfere with your rules
– Taking a plane to DC gets you to DC unless there is engine
trouble, or the airport is fog bound, or the pilot gets ill, or
someone on board has a heart attack, etc.
» Causality is probabilistic
– The probability of vapor lock is higher in the summer.
l Early attempts to model uncertainty:
» Confidence factors
» Certainty factors
– Not consistent with Probability…

© Paul Viola 1999 Machine Learning

3
Probabilistic Reasoning is Optimal
l What we really want is to reason with the laws of
probability:
» The probability that I will get to the White House is:
– The probability of the conjunction of events
l Get packed
l Get to Logan
l Catch plane
l Arrive in DC
l Get Taxi to White house

l Provided you can estimate and represent these


probabilities!

© Paul Viola 1999 Machine Learning

Simple Example: Faculty Meetings


l Faculty meetings can be argumentative sometimes.
» Hard to say why, but it happens about 23%
» When Marvin Minsky comes to the meeting
– (Minsky only comes to 10% of the meetings)
– Arguments happen 30%
» When Chomsky comes
– (Chomsky only comes to 5% of the meetings)
– Arguments happen 30%
» When Minsky and Chomsky come
– Arguments happen 90%
The World
l Very difficult for AI systems is Fuzzy
» If Minsky than argument (false!!!)
Logic is NOT!
» If Chomsky than argument (false!!!)

© Paul Viola 1999 Machine Learning

4
A Probabilistic Approach
Probability distribution over our 3 Events:
P ( A, M , C ) Arguments, Minsky, and Chomsky

Probability of a joint event:


P ( A = a, M = m , C = c ) Argument happens,
Minsky does not come
Chomsky does come

P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C

P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C
© Paul Viola 1999 Machine Learning

The Probabilistic Approach


l Everything is driven off P(A,M,C)
l Given a set of example meetings
M C A Probability
A M C 0 0 0 0.684
1 0 0 0 0 1 0.171
0 0 0
1 1 1
0 1 0 0.0315
0 1 1 0.036
0 1 0 Histogram
0 0 1 1 0 0 0.0665
...
...
...

1 0 1 0.0285
1 1 0 0.00005
1 1 1 0.00045
Observe Data
Probability of Events

Everything should work out perfectly right?

© Paul Viola 1999 Machine Learning

5
Problems with Naïve Probability
M C A Probability l Way too many variables:
0 0 0 0.684
0 0 1 0.171 » 2^N variables (minus 1)
0 1 0 0.0315
» Occam wouldn’t like this
0 1 1 0.036
1 0 0 0.0665 l Lots of computation:
1 0 1 0.0285
1 1 0 0.00005 » P(M) requires O(2^(N-1))
1 1 1 0.00045
» P(A|M) requires O(2^(N-1))
P(??) = 0.23595

P ( m ) = 0.1 l Optimality comes with serious


penalties.
P ( c ) = 0.068 l Furthermore the table is very
P ( a ) = 0.23 hard to interpret…

© Paul Viola 1999 Machine Learning

Bayes Nets to the Rescue


l Assumptions can help a lot:
» We could assume that each property is independent
– Big mistake for most reasoning problems
» We could assume that certain variables are independent
– While others are dependent…

P ( A, M , C ) ≡ P( M ) P(C | M ) P ( A | M , C )
7 1 2 4
Tables

P ( A, M , C ) ≈ P ( M ) P(C ) P ( A | M , C )
7 1 1 4
© Paul Viola 1999 Machine Learning

6
Removing Links

P ( A, M , C ) ≡ P( M ) P(C | M ) P ( A | M , C )
M C

P ( A, M , C ) ≈ P ( M ) P(C ) P ( A | M , C )
M C

© Paul Viola 1999 Machine Learning

An Efficient Representation

0.1
• Draw a directed acyclic graph M C 0.05

• Links imply causation


• Represent probabilities
• Prior for every node with no parent A
• Conditional probabilities for others
P(A|M,C) M NOT M
C 0.9 0.3
NOT C 0.3 0.2

© Paul Viola 1999 Machine Learning

7
Much more efficient representations

M C

A B

E F

26 = 64 −1 = 63 parameters
vs. (2 *1) + (2 * 2) + ( 2 * 4) = 13

© Paul Viola 1999 Machine Learning

Additional Example 1
l You are waiting for an appointment
AI with Holmes (H) and Watson (W).
l Both are very poor drivers and are
likely to be avoid driving if the
W H roads are icy (I).
l It is winter so that probability of I
is high 0.5.
l H and W are dependent…
» Unless you know the road conditions

© Paul Viola 1999 Machine Learning

8
Additional Example 2
Wet Grass
l The lawn of home A is either wet or
not (Wb).
A
R A
S » This could have been caused by rain
(R) or sprinkler (S)
l The lawn of home B is either wet or
Wa Wb not (Wa).
» Home A has no sprinklers so rain is
the only cause.

© Paul Viola 1999 Machine Learning

Additional Example 3
Earthquake or Burglar
l Home Alarm (A)
l Neighbor reports the Alarm (N)
A
B A
E l Burglary (B) 0.001
l Earthquake (E) 0.0000001
l Radio Report of Earthquake (R)
A R
l You receive a call from your
neighbor saying that your Alarm is
N going off.
l You drive home to confront the
burglar… on the drive you hear a
radio report of an Earthquake.

© Paul Viola 1999 Machine Learning

9
Reasoning
l In some cases computation time is not changed:
» We have re-written the joint distribution
» Some reasoning still requires large summations...

P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C

P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C

© Paul Viola 1999 Machine Learning

Sometimes reasoning is efficient

M C

∑ P(a, B, C , D , e, F )
P(e | a ) =
A B
∑ P( A, B, C , D, e , F )

E F

P( A, B , C , D, E , F ) = P ( M ) P ( C ) P ( A | M , C ) P( B | C ) P ( E | A) P ( F | A, B )

© Paul Viola 1999 Machine Learning

10
Junction Tree Algorithm 1
l Table arithmetic:
X Y Z

P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1

TXYZ = TX × TYX × TZY


© Paul Viola 1999 Machine Learning

Junction Tree Algorithm: Graph Hacking


M C M C M C

A B A B A B

E F E F E F

CM C AC CM C AC
A A A A
AE A ABF AE ABF

© Paul Viola 1999 Machine Learning

11
More Junction Trees

© Paul Viola 1999 Machine Learning

From Junction Trees to Probability


M C
CM C AC

A B A1 A2

AE ABF
E F

TCM × TAC × TAE × TABF


P( A, B, C , D, E, F ) = TABCDEF =
SC × S A1 × S A2
= P( M ) P( C | M ) K TCM
× P( A | C ) K TAC
× P( E | A) K TAE
© Paul Viola 1999 Machine Learning× P( B) P ( F | B ) K TABF

12
6.891 Machine Learning and Neural
Networks

Lecture 16:
More Bayes Nets

© Paul Viola 1999 Machine Learning

News
l Half of pset 5 is done
» and on the web.
» Other half will be done over the weekend.

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 15:
» Bayes Nets
– Meeting of the minds
l Artificial Intelligence and Machine Learning
– Represents symbolic knowledge and reasoning
– Principled mechanism for inference and learning
l Bayes Rule

l Reasoning with Bayes Nets


l Efficient algorithms

© Paul Viola 1999 Machine Learning

Bayes Net Review


l Knowledge of the complete probability table
provides the opportunity for powerful deduction.
» Relate symptoms to diseases
– Even when you have no observed every symptom
» Reason with partial and conflicting knowledge
l But the joint probability is difficult to model
» 2^N -- N binary variables
» Reasoning is equally hard
l Given 2^N numbers you can model any dependency
» But sometimes variables are not really dependent.

© Paul Viola 1999 Machine Learning

2
Bayesian Text Classification
{d k } : A collection of documents

1 if d k contains word i
Wi(d k ) : 
0 otherwise
P( F1 = f1 , F2 = f 2 , K| C = c j ) 2^N
2^Nprobs
probs

= P ({ f1 − f N }| C = c j )
≡ ∏ P( Fi = f i | C = c j ) Assume
Assume
Independence
i Independence

P( Fi = 1| C = c j ) = pij Probability of word i appearing


© Paul Viola 1999
in a Doc from Class j
Machine Learning

Bayes Nets Show Dependencies

P({ f i} | C j ) = ∏ P(F = i fi |C j )
i
One of
C N classes
P( F1 | C )

F1 F2 F3 F4 ... FN

P (c j ) ∏ P ( fi | c j )
P ( c j |{ fi }) = i

© Paul Viola 1999 ∏ P( f )


Machine Learning
i
i

3
More Complex Models are “Easy”
What if documents could be BB
“about” two different topics at once:
- like Politics and Sports

P S
P( F1 | P, S )

F1 F2 F3 F4 F5... ... FN

N causes -> 2^N variables

© Paul Viola 1999 Machine Learning

An Efficient Representation

0.1
• Draw a directed acyclic graph M C 0.05

• Links imply causation


• Represent probabilities
• Prior for every node with no parent A
• Conditional probabilities for others
P(A|M,C) M NOT M
C 0.9 0.3
NOT C 0.3 0.2

© Paul Viola 1999 Machine Learning

4
Much more efficient representations

M C

A B

E F

26 = 64 −1 = 63 parameters
vs. (2 *1) + (2 * 2) + ( 2 * 4) = 13

© Paul Viola 1999 Machine Learning

Additional Example 1
l You are waiting for an appointment
AI with Holmes (H) and Watson (W).
l Both are very poor drivers and are
likely to be avoid driving if the
W H roads are icy (I).
l It is winter so that probability of I
is high 0.5.
l H and W are dependent…
» Unless you know the road conditions

© Paul Viola 1999 Machine Learning

5
Additional Example 2
Wet Grass
l The lawn of home A is either wet or
not (Wb).
A
R A
S » This could have been caused by rain
(R) or sprinkler (S)
l The lawn of home B is either wet or
Wa Wb not (Wa).
» Home A has no sprinklers so rain is
the only cause.

© Paul Viola 1999 Machine Learning

Additional Example 3
Earthquake or Burglar
l Home Alarm (A)
l Neighbor reports the Alarm (N)
A
B A
E l Burglary (B) 0.001
l Earthquake (E) 0.0000001
l Radio Report of Earthquake (R)
A R
l You receive a call from your
neighbor saying that your Alarm is
N going off.
l You drive home to confront the
burglar… on the drive you hear a
radio report of an Earthquake.

© Paul Viola 1999 Machine Learning

6
Reasoning
l In some cases computation time is not changed:
» We have re-written the joint distribution
» Some reasoning still requires large summations...

P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C

P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C

© Paul Viola 1999 Machine Learning

Sometimes reasoning is efficient

D C

∑ P(a, B, C , D , e, F )
P(e | a ) =
A B
∑ P( A, B, C , D, e , F )

E F

P( A, B , C , D , E , F ) = P ( D ) P ( C ) P ( A | D, C ) P ( B | C ) P ( E | A) P ( F | A, B )

By the way: this is called adding the evidence (or observation)


that A=a.
© Paul Viola 1999 Machine Learning

7
Sometimes reasoning is more efficient

D C
∑ P( A, B, c , D , e, F )
P(e | c) =
A B ∑ P( A, B, c, D, E , F )
E Add the evidence that C=c.
F
Observe the marginal of E.

∑ P( A, B, c, D, E, F )
a ,b ,d ,e , f

= ∑ P( D) P(c ) P( A | D, c) P(B | c ) P(E | A) P(F | A, B)


a ,b, d ,e , f

= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑ P(B | c ) ∑ P( F | A, B)
a ,e d b f

© Paul Viola 1999 Machine Learning

Saving Work
∑ P( A, B, c, D, E, F )
a ,b ,d ,e , f

= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑ P(B | c ) ∑ P( F | A, B)
a ,e d b f

= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑T
a ,e d b
B
2
T 1
AB

= P (c) ∑ P( E | A)∑ T
a ,e d
5
D
4
TAD TA3

= P (c) ∑ P( E | A)T ∑T
a ,e
3
A
d
5
D TAD
4

= P (c) ∑T
a ,e
7 3
AE A T T 6
A

= P (c) ∑T e
e
8

∑ P ( A, B , c, D, e, F ) = P (c) T
a ,b ,d , f
8
E ( e)

© Paul Viola 1999 Machine Learning

8
Hidden Markov Model

A A
B C D
A

F G H I

Time

P( A, B , C , D , F , G, H , I )
= P( A ) P ( F | A ) P ( B | A) P (G | B ) P( C | B ) P ( H | C ) P ( D | C ) P ( I | D )

© Paul Viola 1999 Machine Learning

Junction Tree Algorithm 1


l Table arithmetic:
X Y Z

P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1

TXYZ = TX × TYX × TZY


© Paul Viola 1999 Machine Learning

9
Junction Tree Algorithm: Graph Hacking
M C M C M C

A B A B A B

E F E F E F

CM C AC CM C AC
A A A A
AE A ABF AE ABF

© Paul Viola 1999 Machine Learning

More Junction Trees

© Paul Viola 1999 Machine Learning

10
From Junction Trees to Probability
M C
CM C AC

A B A1 A2

AE ABF
E F

TCM × TAC × TAE × TABF


P( A, B, C , D, E, F ) = TABCDEF =
SC × S A1 × S A2
= P( M ) P( C | M ) K TCM
× P( A | C ) K TAC
× P( E | A) K TAE
© Paul Viola 1999 Machine Learning× P( B) P ( F | B ) K TABF

11
6.891 Machine Learning and Neural
Networks

Lecture 17:
Hidden Markov Models
& Other Bayes Nets

© Paul Viola 1999 Machine Learning

News
l Problem Set 5 complete on Monday

l Remember to keep thinking about your final


projects!
» Please send us some email which describes your project
» 1-2 paragraphs
» We will render our “expert” opinion
– Not too hard!!

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 16:
» Bayes Nets
– An Efficient way to represent joint probability distributions
– Allow reasoning about subtle and conflicting evidence
– Allow reasoning with partial information
» Structure implies Reasoning Efficiency
– Dependence structure allows for more efficient reasoning
– Dynamic programming

l Markov Processes
l Hidden Markov Models
» Speech

© Paul Viola 1999 Machine Learning

A brief overview of speech recognition

© Paul Viola 1999 Machine Learning

2
A brief overview of speech recognition

The signal is very high dimensional…


10000 samples / second

© Paul Viola 1999 Machine Learning

The production Process

© Paul Viola 1999 Machine Learning

3
Differing representations

© Paul Viola 1999 Machine Learning

Spectrogram provides better information


l The vocal tract produces sound by combining the
output of multiple oscillators…
» Each vowel has several formants -- pure tones
l The spectrum of speech helps to distinguish
vowels…
l The consonants are very different…
» Broad band and very brief
» Consonant are much harder for speech recognition
systems

© Paul Viola 1999 Machine Learning

4
© Paul Viola 1999 Machine Learning

The phonemes

© Paul Viola 1999 Machine Learning

5
The digits

© Paul Viola 1999 Machine Learning

Speech using Pattern Recogniton

Speech Spectrogram
{x , y }
j j Training Data

10,000 Samples 3000 coeff x j ∈ ℜ1000


y j ∈ {cat, car, ball, hat, L}
per second per second

Note: This does not work… for many reasons!!!

© Paul Viola 1999 Machine Learning

6
Speech Difficulties
l Rate of speech
» Words are spoken at different rates -- factor of 2 or 3.
l Continuous speech
» Where are the boundaries between words??

IsIs this
this your
your cat?
cat? When
When isis your
your train?
train?
0.2
0.2 - 0.3 - 0.6 -- 0.2
- 0.3 - 0.6 0.2 0.2
0.2 - 0.2 - 0.6
- 0.2 - 0.6 -- 0.3
0.3

• Can’t build IsIs this


this your
your cat?
cat?
sentence recognizers 0.1
0.1 - 0.2 - 0.4 -- 0.2
- 0.2 - 0.4 0.2

• Other difficulties: Pitch variation, Accent, Prosody

© Paul Viola 1999 Machine Learning

Decompose the construction of words


l Words are constructed from letters
» In written english
» There are 26 letters
l Words are constructed from phonemes
» In spoken english
» There are XX phonemes in English

Cat
Cat ->
-> ‘c’
‘c’ -- ‘ah’
‘ah’ -- ‘t’
‘t’
0.03
0.03 - 0.15--0.02
- 0.15 0.02
fat
fat ->
-> ‘f’
‘f’ -- ‘ah’
‘ah’ -- ‘t’
‘t’
0.1 - 0.1 -
0.1 - 0.1 - 0.020.02

© Paul Viola 1999 Machine Learning

7
Implications of Decomposition
l The parts of words can be reused
» Words are built from XX phoneme models
» Perhaps we can train the phoneme recognizers
separately??
» (Sometimes… co-articulation can make this harder)

l But, even the parts vary in length

l It can be very hard to find the beginnings and


endings of phonemes

l Time is our enemy!!!


© Paul Viola 1999 Machine Learning

Probabilistic Models of Time


l We’ve always built probabilistic models in class...
» Salmon vs. Bass
» 2’s vs. 3’s
» etc.
0.5 0.2 0.9 0.15 1.0

N c aAh t N
0.5 0.8 0.1 0.85

P(F) P(F) P(F) P(F) P(F)

Non-deterministic Finite State Automata


© Paul Viola 1999 Machine Learning

8
Phoneme Sequences
NFA
NFAModel
Model

Sequence:
Sequence:
NNCCAAAAAAAAATTTNN…
St
NNCCAAAAAAAAATTTNN…

Spectrogram
Spectrogram Ft
P( F , S | Model ) NFA
NFAmodel
modelfor
for‘cat’
‘cat’
assigns a probability
assigns a probability
= P( F | S ) P ( S | Model ) to
toeach
eachspectrogram
spectrogram
© Paul Viola 1999 Machine Learning

Non-deterministic FSM -> Bayes Net


0.5 0.2 0.9 0.15 1.0

N 0.5
c 0.8
ah
A 0.1
t 0.85
N

P(F) P(F) P(F) P(F) P(F)

S1 S2 S3
A S4 S5 … Sn

F1 F2 F3 F4 F5 Fn

© Paul Viola 1999 Machine Learning

9
The Details

S1 S2 S3
A S4 S5 … Sn

F1 F2 F3 F4 F5 Fn

Si ∈ {N B , C , A, T , N A }
P( Fi | S i = k ) = G (Fi , µ k , Σ k )
P( S1 ) = {0 .5, 0 .5, 0 .0 , 0.0, 0.0}
 NB C A T NA 
N 0.5 0 .5 
 B 
C 0 .2 0.8 
P( S i +1 | S i ) =  
 A 0.9 0.1 
T 0 .15 0 .85 
 
© Paul Viola 1999
 N A Machine Learning
1 .0 

Hidden Markov Model

Hidden
State
A A
B C D
A

F G H I

Observations
Time
P( A, B , C , D , F , G, H , I )
= P( A ) P ( F | A ) P ( B | A) P (G | B ) P( C | B ) P ( H | C ) P ( D | C ) P ( I | D )

© Paul Viola 1999 Machine Learning

10
Using Dynamic Programming…
S1 S2 S3
A S4 S5 … Sn

F1 F2 F3 F4 F5 Fn

P ( F Model)
= ∑ P (F | S ) P (S | Model)
S

= ∑ P( F | S = Sˆ ) P(S = Sˆ | Model)
Sˆ ={s1 , s 2 , s 3 , s4 ,... }

 
= ∑ ∏ P ( F j | S j = s j )  P (S = Sˆ | Model)
Sˆ ={s1 , s 2 , s 3 , s4 ,... }  j 
= ∑ P (F1 = f 1 | S = s1 ) P (S1 = s1 ) ∑ P ( F2 = f 2 | S = s2 ) P ( S2 = s2 S1 = s1 )∑ ...
s1 s2 s3

© Paul Viola 1999 Machine Learning

Some standard notation...


P ( F Model)
= ∑ P (F1 = f 1 | S = s1 ) P (S1 = s1 ) ∑ P ( F2 = f 2 | S = s2 ) P ( S2 = s2 S1 = s1 )∑ ...
s1 s2 s3

= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 ∑ TS 3 S 4 ∑ TS 4 S 5
s1 s2 s3 s4 s5

= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 ∑ TS 3 S 4 β S44
s1 s2 s3 s4

= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 β S33
s1 s2 s3

= ∑ TS 1 ∑ TS1S 2 β S22
s1 s2

= ∑ TS 1 β S11
s1

© Paul Viola 1999 Machine Learning

11
Stringing Words Together

Cat Eats
A Food

F F F

© Paul Viola 1999 Machine Learning

Limitations of HMM for Speech


l Observed spectrograms are independent given
phonemes
» Does not model pronunciation or accent

l Spend most of their effort on vowels…


» Fat vs. Far.

© Paul Viola 1999 Machine Learning

12
Markov Processes
l Markov Processes are in fact very general…
» Loosely, they are processes in which there is a great deal
of conditional independence.
– Like most Bayes Nets. P( A | B, C , D , F , G , H , I )
= P ( A | π ( A))
A A
Note:
Note:up
up‘til
‘tilnow
nowwewe
have seen only directed
have seen only directed
A A models…
models… thethenotion
notion
of
ofMarkov
Markovforfor
undirected
undirectedmodels
modelsisisaa
A A bit
bitmore
morecomplex...
complex...
© Paul Viola 1999 Machine Learning

Junction Tree Algorithm 1


l Table arithmetic:
X Y Z

P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1

TXYZ = TX × TYX × TZY


© Paul Viola 1999 Machine Learning

13
Junction Tree Algorithm: Graph Hacking
M C M C M C

A B A B A B

E F E F E F

CM C AC CM C AC
A A A A
AE A ABF AE ABF

© Paul Viola 1999 Machine Learning

More Junction Trees

© Paul Viola 1999 Machine Learning

14
Junction Trees and Tables

CM C AC

A1 A2

AE ABF

TCM × TAC × TAE × TABF ?


TABCDEF = = P( A, B , C , D, E, F )
SC × S A1 × S A2

© Paul Viola 1999 Machine Learning

Rules for Junction Tree Initialization

© Paul Viola 1999 Machine Learning

15
From Junction Trees to Probability
M C
CM C AC
A B
A1 A2
E F
AE ABF
P( A, B, C , D , E , F )
TCM × TAC × TAE × TABF
= P( M ) P (C | M ) P( A | C ) TABCDEF =
S C × S A1 × S A2
× P( E | A) P ( B) P( F | B)
= P( M ) P(C | M ) K TCM
× P ( A | C) K TAC
× P ( E | A) K TAE
× P ( B) P( F | B) K TABF
© Paul Viola 1999 Machine Learning

16
6.891 Machine Learning and Neural
Networks

Lecture 18:
Finish Hidden Markov Models
& Finish Bayes Nets

© Paul Viola 1999 Machine Learning

News
l Remember to keep thinking about your final
projects!

l The reading for this class is now mostly the


supplemental material on the related-info page.
» You need to at scan the materials there.

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lecture 17:
» Hidden Markov Models for Speech
– Speech is complex…
l Many words / Length of words varies
– Speech is best represented as a spectrogram
– Variable timing of speech can be modeled as a NFA.
– An HMM is a Bayes Net which is equivalent to an NFA
l We can build an HMM for each word out of phoneme models
– Can sum over the unknown states to recognize words

l More HMM examples


l Finding the most likely state sequence

© Paul Viola 1999 Machine Learning

Speech in a Nutshell

six

s i x
© Paul Viola 1999 Machine Learning

2
Closer Examination

100 Numbers 4 of silence


5 of silence 3 of ‘i’

5 of ‘s’ 6 of ‘x’

© Paul Viola 1999 Machine Learning

Build a Probabilistic Model


which generates the data…
0.8 0.7 0.8 0.8 1.0

N 0.2
s 0.3
Ai 0.2
x 0.2
N

P(F) P(F) P(F) P(F) P(F)

P( F , S | Model )
= P( F | S ) P ( S | Model )

© Paul Viola 1999 Machine Learning

3
Use Bayes Law

P( F = f |' six' ) P( F = f |' five')


= ∑ P (F = f , S = s |' six ' ) vs. = ∑ P( F = f , S = s |' five')
s s

© Paul Viola 1999 Machine Learning

Non-deterministic FSM -> Bayes Net


0.5 0.2 0.9 0.15 1.0

N 0.5
c 0.8
ah
A 0.1
t 0.85
N

P(F) P(F) P(F) P(F) P(F)

S1 S2 S3
A S4 S5 … Sn

F1 F2 F3 F4 F5 Fn

© Paul Viola 1999 Machine Learning

4
A concrete example

S1 S2 S3
A S4 S5 … Sn

F1 F2 F3 F4 F5 Fn

Si ∈ {1, 2}
P( S1 ) = {0.5, 0 .5} P( Fi = f | S i = 1) = G ( f ,1.0, 0 .1)
 1 2  P( Fi = f | S i = 2) = G ( f , 2.0, 0.1)
P( S i +1 | S i ) = 1 0 .9 0 .1 

2 0.1 0.9  P( F , S | Model )

© Paul Viola 1999 Machine Learning

Some Samples
2.6

2.4

2.2
2.6
2

1.8
2.4
1.6
2.2
1.4

1.2
2
1

1.8
0.8

0.6
1.6 0 10 20 30 40 50 60 70 80 90 100

1.4
2.6

1.2
2.4

1 2.2

2
0.8
0 10 20 30 40 50 60 70 80 90 100 1.8

1.6

1.4

1.2

0.8
0 10 20 30 40 50 60 70 80 90 100

© Paul Viola 1999 Machine Learning

5
Code is very simple...

function [states, obs] = hmm_draw(n, initial, transition, obs_models)

% function [states, obs] = hmm_draw(n, initial, transition, obs_models)


%
% Draw a sample of the HMM running for N steps

% Setup the initial space


states = zeros(n, 1);
obs = zeros(n, length(hmm_observe(2, obs_models)));

states(1) = hmm_draw_state(initial);

for i = 2:n
% transition(states(i-1), :)
states(i) = hmm_draw_state(transition(states(i-1), :));
end

obs = hmm_observe(states, obs_models);

© Paul Viola 1999 Machine Learning

Add more noise...


3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

P( Fi = f | S i = 1) = G( f , 1.0, 0.4)
P( Fi = f | S i = 2 ) = G( f , 2 .0, 0 .4)

© Paul Viola 1999 Machine Learning

6
But we have a detailed model

S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn

arg max P( F = f , S = s | Model )


s
3.5

2.5

1.5

0.5

0
© Paul Viola 1999 0 10 20 30 Machine
40 50 Learning
60 70 80 90 100

Using Dynamic Programming…


S1 S2 S3
A S4 S5 … Sn

F1 F2 F3 F4 F5 Fn

arg max P ( F = f , S = s | Model)


s

= arg max P ( s1 ) P ( f 1 | s1 ) P ( s2 | s1 ) P ( f 2 | s2 ) ... P ( sn | sn −1 ) P ( f n | sn )


s

© Paul Viola 1999 Machine Learning

7
% Propagate the maximum state forward in time from the beginning
maxes = state_like;
This code is also simple…
maxes (1,:)=maxes(1,:)/sum(maxes(1,:));

for i = 2: ntimes
for j = 1:nstates
% For each new time, check each of the past states and to determine
% the best state given the transition costs.
for k = 1: nstates
vals(j,k) = maxes (i-1,k) * trans (k,j) * state_like(i,j);
end
maxes (i,j) = max( vals(j,:));
end
maxes(i,:)=maxes(i,:)/sum(maxes(i,:));
end

vals = zeros(nstates ,1);


shat = zeros(size(maxes));

[v ind] = max(maxes(ntimes, :));


shat( ntimes, ind) = 1;

for i = ntimes-1:-1:1
for j = 1:nstates
vals(j) = trans (ind ,j) * maxes(i,j);
end
[v ind ] = max(vals );
shat(i, ind) = 1;
end
© Paul Viola 1999 Machine Learning

What about distinguishing two models??

Si ∈ {1, 2} Si ∈ {1, 2}
P( S1 ) = {0.5, 0 .5} P( S1 ) = {0.5, 0 .5}
 1 2   1 2 
P( S i +1 | S i ) = 1 0 .9 0 .1 
 P( S i +1 | S i ) = 1 0 .8 0.2 

2 0.1 0.9  2 0 .2 0.8 

3.5 3.5

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

© Paul Viola 1999 Machine Learning

8
Code for model likelihood...
function
function like
like == hmm_model_likelihood(f,
hmm_model_likelihood(f, initial,
initial, trans,
trans, obs_models)
obs_models)

%% function
function shat
shat == hmm_model_likelihood(f,
hmm_model_likelihood(f, initial,
initial, trans,
trans, obs_models)
obs_models)

%% First
First compute
compute the
the likelihood
likelihood of
of every
every state
state given
given every
every observation
observation
state_like
state_like == hmm_obs_likelihood(f,
hmm_obs_likelihood(f, obs_models);
obs_models);

%% initialize
initialize some
some variables
variables
ntimes
ntimes = size(state_like, 1);
= size(state_like, 1);
nstates
nstates == size(state_like,
size(state_like, 2);
2);
mfactor
mfactor == 100;
100;

beta
beta == mfactor
mfactor .*
.* state_like(ntimes,:);
state_like(ntimes,:);

for
for ii == ntimes-1:-1:1
ntimes-1:-1:1
beta
beta == mfactor
mfactor .*
.* (state_like(i,:)
(state_like(i,:) .*
.* (trans
(trans ** beta')');
beta')');
end
end

like
like == log10(sum(beta))-(log10(mfactor)
log10(sum(beta))-(log10(mfactor) ** ntimes);
ntimes);
© Paul Viola 1999 Machine Learning

Limitations of HMM for Speech


l Observed spectrograms are independent given
phonemes
» Does not model pronunciation or accent
» Does not model inter word dependencies

l Spend most of their effort on vowels…


» mount vs. won’t

© Paul Viola 1999 Machine Learning

9
Markov Processes
l Markov Processes are in fact very general…
» Loosely, they are processes in which there is a great deal
of conditional independence.
– Like most Bayes Nets. P( A | B, C , D , F , G , H , I )
= P ( A | π ( A))
A A
Note:
Note:up
up‘til
‘tilnow
nowwewe
have seen only directed
have seen only directed
A A models…
models… thethenotion
notion
of
ofMarkov
Markovforfor
undirected
undirectedmodels
modelsisisaa
A A bit
bitmore
morecomplex...
complex...
© Paul Viola 1999 Machine Learning

Segue
l We have seen several applications of Bayesian
Networks…
» Expert Systems
» Diagnosis
» Speech Recognition

l Are there other algorithms for reasoning on Bayes


Nets…
» Junction Trees
» Propagation algorithms
» Makes it easy to measure marginals…

© Paul Viola 1999 Machine Learning

10
Junction Tree Algorithm 1
l Table arithmetic:
X Y Z

P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1

TXYZ = TX × TYX × TZY


© Paul Viola 1999 Machine Learning

Junction Tree Algorithm: Graph Hacking


M C M C M C

A B A B A B

E F E F E F

CM C AC CM C AC
A A A A
AE A ABF AE ABF

© Paul Viola 1999 Machine Learning

11
More Junction Trees

© Paul Viola 1999 Machine Learning

Junction Trees and Tables

CM C AC

A1 A2

AE ABF

TCM × TAC × TAE × TABF ?


TABCDEF = = P( A, B , C , D, E, F )
SC × S A1 × S A2

© Paul Viola 1999 Machine Learning

12
Rules for Junction Tree Initialization
l For each conditional distribution in the Bayes Net
» Find a node in the Jtree which contains all those vars
» Multiply that nodes table by the conditional dist

© Paul Viola 1999 Machine Learning

From Junction Trees to Probability


M C
CM C AC
A B
A1 A2
E F
AE ABF
P( A, B, C , D , E , F )
TCM × TAC × TAE × TABF
= P( M ) P(C | M ) P ( A | C ) TABCDEF =
S C × S A1 × S A 2
× P ( E | A) P( B ) P( F | AB )
= P( M ) P(C | M ) K TCM
× P( A | C ) K TAC
× P( E | A) K TAE
× P( B) P( F | AB) K TABF
© Paul Viola 1999 Machine Learning

13
Image Markov Models

© Paul Viola 1999 Machine Learning

14
Multi-scale Statistical Models:
Images, People, Movement

Paul Viola
Collaborators: Jeremy De Bonet,
John Fisher, Andrew Kim
Tom Rikert, Mike Jones,

http://www.ai.mit.edu/projects/lv

Paul Viola MIT AI Lab

Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection & registration
Recognition

distribution Likelihood Similarity

Example
images New Hypothesis for
Human Object segmentation
Recognition

Denoising, Super-resolution
Paul Viola MIT AI Lab

1
Visual Texture: a testing ground
• Texture
– Random Repeating Process
– No two patches are identical

Good statistical
model for images

Good model
for visual texture
Paul Viola MIT AI Lab

Generation a critical test

Input
Texture

Non-parametric
Gaussian Independent
Multi-scale
Paul Viola MIT AI Lab

2
Simple Statistical Model 1:
Independent pixels

• Statistical Model 1
– Each pixel is independent
and identically distributed

P( I ) = ∏ P ( I xy )
x, y

Paul Viola MIT AI Lab

Technical Point:
Texture is Ergodic/Stationary
• A texture image is assumed to be many samples of
a single process
– Each sample is almost certainly dependent on the other
samples
– But actual location of the samples does not matter
– (Space invariant process).

Paul Viola MIT AI Lab

3
Simple Statistical Models
Independent pixels

Histogram

P( I ) = ∏ P ( I xy )
x, y

Paul Viola MIT AI Lab

Statistical Model 2:
Gaussian Distribution
P (I ) = N ( I , m, Σ )
− 1
( I − m )| 2
e − |Σ
2

Original

Generated

Paul Viola MIT AI Lab

4
What else are probabilistic image
models good for??
• Denoising:
– If we have a model for: P(I)
– And we observe an image plus noise: Iˆ = I + h
– Then:
ˆ ˆ ˆ
P ( I ) = ∫ P ( I = I − h , h ) dh = ∫ P ( I = I −h ) P (h ) dh

P ( Iˆ | I ) P ( I ) P (h = Iˆ − I ) P ( I )
P ( I | Iˆ) = =
P ( Iˆ) P ( Iˆ )

I P (h = Iˆ − I ) P ( I )
E [I | Iˆ ] = ∫ dI
P ( Iˆ )
Paul Viola MIT AI Lab

What if I were a scalar?


And the both signal and noise were Gaussian

I P (h = Iˆ − I ) P ( I )
E [I | Iˆ ] = ∫ dI
P ( Iˆ )
ˆ 2 2
I e(I− I ) e(I − m )
= ∫ dI
c
Same thing as estimating the mean of a gaussian from
one example and there is a prior…
the expected value is between the observation and prior

Paul Viola MIT AI Lab

5
Gaussian are not quite right...

Paul Viola MIT AI Lab

Gaussian model fails other tests also...

Derivative
Gaussian Fit
P(Value)

Every linear projection


of a Gaussian must be
Gaussian…
yet the derivatives in
images are far from
Gaussian
Value
Paul Viola MIT AI Lab

6
Statistical Model 3:
Independent Wavelet Models
• Donoho, Adelson, Simoncelli, etc.
• Very efficient (linear time)
– Estimation, Sampling, Inference

P (I ) ∝ ∏ P j ([WI ] j )
j

• P(I) is defined implicitly


– As a distribution over the features present in an image
• W is a Wavelet or tight frame operator
– Invertibility is key... MIT AI Lab
Paul Viola

1D Wavelet
Transform

Wavelet
Transform

Simple Input
Texture

Paul Viola
Filters
MIT AI Lab

7
Sub-band
Pyramid

Fourier
Decomposition

FLq ( x, y )
WI
Paul Viola MIT AI Lab

Signals plus noise...

Paul Viola MIT AI Lab

8
Noise removal through shrinkage

Paul Viola MIT AI Lab

Removing noise from images

Paul Viola MIT AI Lab

9
Inside the guts...

Paul Viola MIT AI Lab

Noise + Signal: Two Gaussian Case


800 800

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

800
800

700
700

600
600

500
500

400
400

300
300

200
200

100
100

0
Paul Viola
-6 -4 -2 0 2 4 6 0
-6 -4 -2 0 2
MIT AI Lab
4 6

10
(h = Iˆ − I ) P ( I )
[I | Iˆ ] = ∫ I PDenoising
EGaussian dI
P ( Iˆ )
− ( Iˆ − I )2 − ( I )2
2
2n 2s 2
I e e
= ∫ c
dI

− ( Iˆ − I ) 2 ( I ) 2  s 2 ( I 2 − 2 IIˆ + Iˆ 2 ) + n 2 I 2 
− = − 
2n 2 2s 2  2n2 s 2 
 (s 2 + n 2 )I 2 − 2 s 2 IIˆ + s 2 Iˆ 2 
= − 
 2 n 2s 2 
 2 2 s 2 IIˆ s 2 Iˆ 2 
 I − 2 +
= −
(s + n 2 ) (s 2 + n 2 ) 
 2n2 s 2 
 
Paul Viola  MIT AI Lab

Noise vs. Signal: The details


800 800

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

800
800

700
700

600
600

500 500

400 400

300 300

200 200

100 100

Paul Viola 0
MIT AI Lab
0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

11
Independent Wavelet Synthesis Model

P (I ) ≈ ∏ Pl , q ,x ,y (Fl J
( x, y))
l ,q , x , y

∏ Pl , (Fl ( x , y ) )
J
≈ q
l ,q , x , y

Given : I ,W
Observe : O l ,q = { F lq ( x , y )}
Model : Pl ,q (.)

Paul Viola MIT AI Lab

Observe Coefficients

Paul Viola MIT AI Lab

12
Compute Histograms

Paul Viola MIT AI Lab

Multi-scale Multi-scale
Histograms Sampling
Sampling Procedure

texture patch
texture patch

synthesized
original

Paul Viola MIT AI Lab

13
Not quite right...

Paul Viola MIT AI Lab

Edges lead to aligned


coefficients

Wavelet
Transform

Simple Input
Texture

Paul Viola
Filters
MIT AI Lab

14
Heeger and Bergen:
Constrain the pixel histogram

Models of structured
images are weak.

Paul Viola MIT AI Lab

FRAME: a generalization of B&H


(Zu, Wu and Mumford)

• Specify a set of filters


– Not necessarily orthogonal or even linear.
• Measure the histogram of these filters
– Type of statistic
• Construct a Boltzmann/Gibbs distribution which
generates these statistics
– Maximum Entropy
• Resulting algorithm is currently intractible
– Days to generate a single image

Paul Viola MIT AI Lab

15
Paul Viola MIT AI Lab

Edges lead to aligned


coefficients

Wavelet
Transform

Simple Input
Texture

Paul Viola
Filters
MIT AI Lab

16
Preserving Cross Scale Alignment

Wavelet
Transform

Paul Viola
Filters
MIT AI Lab

Statistical Distribution
of Multi-scale Features
The distribution of
multi-scale features
determines appearance

Paul Viola Wavelet Pyramid MIT AI Lab

17
Multi-scale
Wavelet Features

A multi-scale feature
associates many
values with each
pixel in the image

Paul Viola MIT AI Lab

Conjunctions of filters:
Multi-resolution Parent Vector
fine

r   x y   x y   x y 
coarse V ( x, y ) =  FN0  N , N  , FN1  N , N  ,K , FNM  N , N  ,
 2 2  2 2  2 2 
M
 x y  x y x y
F10  ,  , F11  ,  ,K , F1M  ,  ,
2 2 2 2  2 2

F00 ( x, y ), F01 ( x, y ), K , F0M ( x, y ) ]


V(x,y)={ }

Parent Vector
Paul Viola MIT AI Lab

18
Build a Model for Observed Distribution

r
P ( I ) = P (V ( x, y ))

Non-parametric
Distribution

Paul Viola Related to the MAR models of Willsky et. al. MIT AI Lab

Original
Texture

Synthesis Results

Paul Viola MIT AI Lab

19
Multi-resolution Parent Vector
r
V N (x , y )

r   x y  1 x y  M x y 
V ( x, y ) =  FN0  N , N  , FN  ,  ,K , FN  , ,
 2 2   2N 2N   2N 2N 
M
r
V1 ( x, y )
 x y  x y  x y
F10  ,  , F11  ,  ,K , F1M  , ,
2 2 2 2  2 2

F00 ( x, y ), F01 ( x, y ), K , F0M ( x, y ) ]


V(x,y)={ }
Parent Vector
r
V0 ( x, y)
Paul Viola MIT AI Lab

r
Probabilistic Model P (V (x , y ))

P (V l ( x , y ) | {WI } − V l ( x , y ) )
Markov
= P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y )... )

Conditionally P (V l ) = ∏ P (V l ( x , y ) | V l + 1 ( x, y ), V l + 2 ( x , y ) ... )
Independent x,y

P( I ) = P (WI ) = P (V M ) × P (V M − 1 | V M )
Successive
× P (V M − 2 | V M , V M − 1 )
Conditioning
× P (V M − 3 | V M ,V M − 1 ,V M − 2 ) ...

Paul Viola MIT AI Lab

20
Estimating Conditional Distributions

• Non-parametrically P* ( x) = ∑ R( x − xi )
i

P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y ) ... )

P (V l ( x , y ), V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
=
P (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )

P * (V l ( x , y ), V l + 1 ( x , y ), V l + 2 ( x , y ) ... )

P * (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )

Paul Viola MIT AI Lab

Shannon Resampling on a Tree


Step 1: Build analysis pyramid

64x64 2x2

Input
Image

Note: We are using only the Gaussian pyramid here!


Paul Viola Normally we use an oriented pyramid... MIT AI Lab

21
Shannon Resampling
Step 2: Build synthesis pyramid

Paul Viola MIT AI Lab

Shannon Resampling
Step 2a: Fill in the top...

Pixels are generated by sampling


from the analysis pyramid.
Paul Viola MIT AI Lab

22
Shannon Resampling
Step 2b: Fill in subsequent levels

Pixels are generated by


conditional sampling
Paul Viola (dependent on the parent). MIT AI Lab

Shannon Resampling
Finish the pyramid

Decisions made at low resolutions


generate discrete features in the final image.
Paul Viola MIT AI Lab

23
Paul Viola MIT AI Lab

Heeger and Bergen,


SIGGraph95

B&H D&V
Paul Viola MIT AI Lab

24
Paul Viola MIT AI Lab

Paul Viola MIT AI Lab

25
Paul Viola MIT AI Lab

Paul Viola MIT AI Lab

26
FRAME: Challenge

Paul Viola MIT AI Lab

Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection &
Recognition

distribution Likelihood

Example
images

Paul Viola MIT AI Lab

27
Discrimination via Cross Entropy

IMODEL

P( I Model )

Cross Entropy

ITEST P( I Test )
Paul Viola MIT AI Lab

Meastex: Texture Classification

Best:
GMRF’s 97%

Ours: 99%

Paul Viola MIT AI Lab

28
Paul Viola MIT AI Lab

Paul Viola MIT AI Lab

29
Where is the boundary
between texture and objects?
• Our model can synthesize and recognize complex
and structured textures.
– Far beyond previous older definitions of texture.
• Where is the boundary between these complex
textures and other patterns in images
– Like faces, human forms, automobiles, etc?

• Only experiments can tell…

Paul Viola MIT AI Lab

The Jacket Hypothesis

Paul Viola MIT AI Lab

30
What about face detection?

• Synthesis is convincing
• Train a texture model to detect faces

Paul Viola
Tom Rikert & Mike Jones
MIT AI Lab

Detecting Objects

• Key Difficulties:
– Variation in Pose, Deformation, & variation across class
• Most Object Recognition approaches are either:
– Very dependent on precise shape and size
– Entirely dependent on simple features (… color, edge histograms)
• Hypothesis:
– Object recognition is closely related to texture perception

Paul Viola MIT AI Lab

31
Detection Results

Non-face test images

Paul Viola MIT AI Lab


Web face test images

Detection Results:

Paul Viola MIT AI Lab

32
Non-frontal
faces

But
Butnaïve
naïvedetection
detection
isisexpensive
expensive

Paul Viola MIT AI Lab

Car Images

Paul Viola MIT AI Lab

33
Texture recognition via Cross Entropy

IMODEL

P( I Model )

Cross Entropy

ITEST P( I Test )
Paul Viola MIT AI Lab

Pruning the density estimator

200
Parent Vectors
Reduce the number of bins
by clustering & value for discrimination

Result: Detection/Classification is faster than template correlation


Paul Viola MIT AI Lab

34
ROC using 200 vectors…

2000 Vector Model 200 Vector Model

Paul Viola MIT AI Lab

Scanning results:
Time: 9 secs

Paul Viola MIT AI Lab

35
Key facial features
- determined automatically
- located automatically

Multi-scale features which are come


from the face model can be automatically
detected for many individuals MIT AI Lab
Paul Viola

Paul Viola MIT AI Lab

36
Paul Viola MIT AI Lab

Paul Viola MIT AI Lab

37
Future Work:
New Face Recognition Algorithm

• Facial identity depends both on the types of


features and their location.

Paul Viola MIT AI Lab

38
Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection & registration
Recognition

distribution Likelihood Similarity

Example
images New Hypothesis for
Human Object segmentation
Recognition

Denoising, Super-resolution
Paul Viola MIT AI Lab

Visual Texture: a testing ground


• Texture
– Random Repeating Process
– No two patches are identical

Good statistical
model for images

Good model
for visual texture
Paul Viola MIT AI Lab

1
Generation a critical test

Input
Texture

Non-parametric
Gaussian Independent
Multi-scale
Paul Viola MIT AI Lab

Simple Statistical Models


Independent pixels

Histogram

P ( I ) = ∏ P ( I xy )
x, y

Paul Viola MIT AI Lab

2
Statistical Model 2:
Gaussian Distribution
P (I ) = N ( I , µ , Σ)
− 12
− |Σ ( I − µ )| 2
∝ e

Original

Generated

Paul Viola MIT AI Lab

Signals plus noise...

Paul Viola MIT AI Lab

3
Noise + Signal: Two Gaussian Case
800 800

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

800
800

700
700

600
600

500
500

400
400

300
300

200
200

100
100

0
Paul Viola
-6 -4 -2 0 2 4 6 0
-6 -4 -2 0 2
MIT AI Lab
4 6

[ ]
| Iˆ = ∫
E IGaussian
I P (η = Iˆ − I ) P ( I )
Denoising
P ( Iˆ )
dI

− ( Iˆ − I ) 2 −( I ) 2 [ ]
E I | Iˆ =
s2
n 2 + s2

2n2 2s 2
I e e
= ∫ c
dI

− ( Iˆ − I ) 2 ( I ) 2  s 2 ( I 2 − 2 IIˆ + Iˆ 2 ) + n 2 I 2 
− = −  
2n 2 2s 2  2n2 s 2 
 (s + n )I − 2 s IIˆ + s Iˆ 2 
2 2 2 2 2
= − 
 2 n 2s 2 
 2 2 s 2 IIˆ s 2 Iˆ 2 
− +
=−


I
( s 2 + n 2 ) (s 2 + n 2 ) 

 2n2 s 2 
 
Paul Viola  MIT AI Lab

4
In pictures…
800 200

180
700
160
600
140

500 120

400 100

80
300
60
200
40

100 20

0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

Observation = 3.0 Mean = 2.8

Paul Viola MIT AI Lab

Statistical Model 3:
Independent Wavelet Models
• Donoho, Adelson, Simoncelli, etc.
• Very efficient (linear time)
– Estimation, Sampling, Inference

P (I ) ∝ ∏ P ([WI ] )
j
j j

• P(I) is defined implicitly


– As a distribution over the features present in an image
• W is a Wavelet or tight frame operator
– Invertibility is key... MIT AI Lab
Paul Viola

5
Noise vs. Signal: The details
800 800

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

800
800

700
700

600
600

500 500

400 400

300 300

200 200

100 100

Paul Viola 0
MIT AI Lab
0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

Non-gaussian:
Integral is evaluated numerically

[ ] ∫I
E I | Iˆ =
P (η = Iˆ − I ) P ( I )
P ( Iˆ )
dI

Paul Viola MIT AI Lab

6
1D Wavelet
Transform

Wavelet
Transform

Simple Input
Texture

Paul Viola
Filters
MIT AI Lab

Sub-band
Pyramid

Fourier
Decomposition

FLθ ( x, y )
WI
Paul Viola MIT AI Lab

7
Noise removal through shrinkage

Paul Viola MIT AI Lab

Removing noise from images

Paul Viola MIT AI Lab

8
Independent Wavelet Synthesis Model

P (I ) ≈ ∏ P (F ( x , y ) )
l ,θ , x , y l
ϑ

l ,θ , x , y

≈ ∏ P (F ( x , y ) )
l ,θ l
ϑ

l ,θ , x , y

Given : I ,W
Observe : O l ,θ = { F lθ ( x , y )}
Model : Pl ,θ (.)

Paul Viola MIT AI Lab

Observe Coefficients

Paul Viola MIT AI Lab

9
Compute Histograms

Paul Viola MIT AI Lab

Multi-scale Multi-scale
Histograms Sampling
Sampling Procedure

texture patch
texture patch

synthesized
original

Paul Viola MIT AI Lab

10
Not quite right...

Paul Viola MIT AI Lab

Edges lead to aligned


coefficients

Wavelet
Transform

Simple Input
Texture

Paul Viola
Filters
MIT AI Lab

11
Heeger and Bergen:
Constrain the pixel histogram

Models of structured
images are weak.

Paul Viola MIT AI Lab

FRAME: a generalization of B&H


(Zu, Wu and Mumford)

• Specify a set of filters


– Not necessarily orthogonal or even linear.
• Measure the histogram of these filters
– Type of statistic
• Construct a Boltzmann/Gibbs distribution which
generates these statistics
– Maximum Entropy
• Resulting algorithm is currently intractible
– Days to generate a single image

Paul Viola MIT AI Lab

12
Paul Viola MIT AI Lab

Edges lead to aligned


coefficients

Wavelet
Transform

Simple Input
Texture

Paul Viola
Filters
MIT AI Lab

13
Preserving Cross Scale Alignment

Wavelet
Transform

Paul Viola
Filters
MIT AI Lab

Statistical Distribution
of Multi-scale Features
The distribution of
multi-scale features
determines appearance

Paul Viola Wavelet Pyramid MIT AI Lab

14
Multi-scale
Wavelet Features

A multi-scale feature
associates many
values with each
pixel in the image

Paul Viola MIT AI Lab

Conjunctions of filters:
Multi-resolution Parent Vector
fine

r   x y   x y   x y 
coarse V ( x, y ) =  FN0  N , N , FN1  N , N  ,K , FNM  N , N  ,
 2 2   2 2  2 2 
M
 x y  x y x y
F10  , , F11  , ,K , F1M  ,  ,
2 2  2 2  2 2

F00 ( x, y ), F01 ( x, y ), K , F0M ( x, y ) ]


V(x,y)={ }

Parent Vector
Paul Viola MIT AI Lab

15
Build a Model for Observed Distribution

r
(
P ( I ) = P V ( x, y ))

Non-parametric
Distribution

Paul Viola Related to the MAR models of Willsky et. al. MIT AI Lab

Original
Texture

Synthesis Results

Paul Viola MIT AI Lab

16
Multi-resolution Parent Vector
r
V N (x , y )

r   x y  1 x y  M x y 
V ( x, y ) =  FN0  N , N , FN  N , N  ,K , FN  N , N ,
 2 2  2 2  2 2 
M
r
V1 ( x, y )
 x y  x y x y
F10  , , F11  , ,K , F1M  ,  ,
2 2  2 2  2 2

F00 ( x, y ), F01 ( x, y ), K , F0M ( x, y ) ]


V(x,y)={ }
Parent Vector
r
V0 ( x, y)
Paul Viola MIT AI Lab

r
Probabilistic Model (
P V (x , y ) )
P (V l ( x , y ) | {WI } − V l ( x , y ) )
Markov
= P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y )... )

Conditionally P (V l ) = ∏ P (V l ( x , y ) | V l +1 ( x, y ), V l + 2 ( x , y ) ... )
Independent x,y

P ( I ) = P (WI ) = P (V M ) × P (V M −1 | V M )
Successive
× P (V M − 2 | V M , V M −1 )
Conditioning
× P (V M − 3 | V M ,V M −1 ,V M − 2 ) ...

Paul Viola MIT AI Lab

17
Estimating Conditional Distributions

• Non-parametrically P* ( x) = ∑ R( x − xi )
i

P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P (V l ( x , y ), V l +1 ( x , y ), V l + 2 ( x , y ) ... )
=
P (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )

P * (V l ( x , y ), V l +1 ( x , y ), V l + 2 ( x , y ) ... )

P * (V l +1 ( x , y ), V l + 2 ( x , y ) ... )

Paul Viola MIT AI Lab

Shannon Resampling on a Tree


Step 1: Build analysis pyramid

64x64 2x2

Input
Image

Note: We are using only the Gaussian pyramid here!


Paul Viola Normally we use an oriented pyramid... MIT AI Lab

18
Shannon Resampling
Step 2: Build synthesis pyramid

Paul Viola MIT AI Lab

Shannon Resampling
Step 2a: Fill in the top...

Pixels are generated by sampling


from the analysis pyramid.
Paul Viola MIT AI Lab

19
Shannon Resampling
Step 2b: Fill in subsequent levels

Pixels are generated by


conditional sampling
Paul Viola (dependent on the parent). MIT AI Lab

Shannon Resampling
Finish the pyramid

Decisions made at low resolutions


generate discrete features in the final image.
Paul Viola MIT AI Lab

20
Paul Viola MIT AI Lab

Heeger and Bergen,


SIGGraph95

B&H D&V
Paul Viola MIT AI Lab

21
Paul Viola MIT AI Lab

Paul Viola MIT AI Lab

22
Paul Viola MIT AI Lab

Paul Viola MIT AI Lab

23
FRAME: Challenge

Paul Viola MIT AI Lab

Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection &
Recognition

distribution Likelihood

Example
images

Paul Viola MIT AI Lab

24
Discrimination via Cross Entropy

IMODEL

P( I Model )

Cross Entropy

ITEST P( I Test )
Paul Viola MIT AI Lab

Meastex: Texture Classification

Best:
GMRF’s 97%

Ours: 99%

Paul Viola MIT AI Lab

25
Paul Viola MIT AI Lab

Paul Viola MIT AI Lab

26
Where is the boundary
between texture and objects?
• Our model can synthesize and recognize complex
and structured textures.
– Far beyond previous older definitions of texture.
• Where is the boundary between these complex
textures and other patterns in images
– Like faces, human forms, automobiles, etc?

• Only experiments can tell…

Paul Viola MIT AI Lab

The Jacket Hypothesis

Paul Viola MIT AI Lab

27
What about face detection?

• Synthesis is convincing
• Train a texture model to detect faces

Paul Viola
Tom Rikert & Mike Jones
MIT AI Lab

Detecting Objects

• Key Difficulties:
– Variation in Pose, Deformation, & variation across class
• Most Object Recognition approaches are either:
– Very dependent on precise shape and size
– Entirely dependent on simple features (… color, edge histograms)
• Hypothesis:
– Object recognition is closely related to texture perception

Paul Viola MIT AI Lab

28
Detection Results

Non-face test images

Paul Viola MIT AI Lab


Web face test images

Detection Results:

Paul Viola MIT AI Lab

29
Non-frontal
faces

But
Butnaïve
naïvedetection
detection
isisexpensive
expensive

Paul Viola MIT AI Lab

Car Images

Paul Viola MIT AI Lab

30
Texture recognition via Cross Entropy

IMODEL

P( I Model )

Cross Entropy

ITEST P( I Test )
Paul Viola MIT AI Lab

Pruning the density estimator

200
Parent Vectors
Reduce the number of bins
by clustering & value for discrimination

Result: Detection/Classification is faster than template correlation


Paul Viola MIT AI Lab

31
ROC using 200 vectors…

2000 Vector Model 200 Vector Model

Paul Viola MIT AI Lab

Scanning results:
Time: 9 secs

Paul Viola MIT AI Lab

32
Key facial features
- determined automatically
- located automatically

Multi-scale features which are come


from the face model can be automatically
detected for many individuals MIT AI Lab
Paul Viola

Paul Viola MIT AI Lab

33
Paul Viola MIT AI Lab

Paul Viola MIT AI Lab

34
Future Work:
New Face Recognition Algorithm

• Facial identity depends both on the types of


features and their location.

Paul Viola MIT AI Lab

35
6.891 Machine Learning and Neural
Networks

Lecture 24:
The End

© Paul Viola 1999 Machine Learning

News
l The Final is on Monday of Final’s week at 1:30
» In this room…
l Conflict exam will be in NE43 on Tuesday Morning
at 9:30.
» Come to Kinh’s office at 9:15 so we can set people up.
l Last year’s final will be on the web by 1PM.

© Paul Viola 1999 Machine Learning

1
Review & Overview
l Lectures 22 and 23:
» Statistical image processing
– Estimate statistical models from examples
– Applications
l Denoising

l Synthesis

l Recognition

l An overview of Machine Learning


l What I would have liked to cover…

© Paul Viola 1999 Machine Learning

6891 at a Glance
l Probability
» Bayes Law
l Linear Algebra
» Eigenvectors and inverses
l Bayesian Classification
l Discriminant Functions
» Perceptron’s, MLP’s
l Support Vector Machines
l Regularization
» Radial Basis Functions
l Unsupervised Learning and PCA
l Bayes Nets and HMMs

© Paul Viola 1999 Machine Learning

2
In the beginning… Probability
l The key concepts of probability
» The basic algebra of probability
– Independent events add
– Relationships between conditional and joint distributions
» Densities work like probabilities (mostly)
» Bayes Law allows us to make decisions
– Loss functions are critical
» Maximum likelihood allows us to learn distributions
– Bayesian estimation averages over parameters
» Exponential densities are easiest to work with
» Mixtures of Gaussians are powerful (but EM is slow)
» Non-parametric estimators are more powerful
– But are difficult to represent

© Paul Viola 1999 Machine Learning

Linear Algebra
l The inverse and pseudo-inverse are everywhere
» Solving least squares problems
l Covariance and co-occurrence are everywhere
» Estimating a Gaussian
» Fitting a line to data
» Principal components analysis
l Eigenvectors simplify most linear algebra
» Especially for symmetric positive semi-definite mats
» Allow you to compute inverses & square roots
» Allow you to understand distributions and linear
dependence

© Paul Viola 1999 Machine Learning

3
Bayesian Classification
l Start out with strong assumptions about your data
» Number of classes, structure of the classes
l Use data to estimate the distribution of each class
l Use Bayes’ law to classify new examples
l Advantages:
» Can estimate the probability of classes (confidence)
» Can validate the model
» Harder to over-train or over-fit
l Disadvantages
» May not use data efficiently
» Sensitive to poor assumptions

© Paul Viola 1999 Machine Learning

Discriminant Functions
l Attempt to estimate the discriminant function
directly
» Linear
» Polynomial
» Multi-layer perceptron
l Specifically minimizes the number of errors
l Advantages:
» Don’t waste time on distributions (just the boundary)
l Disadvantages
» No natural measure of confidence
» Can over-train

© Paul Viola 1999 Machine Learning

4
Support Vector Machines
l A principled and direct way to simultaneously
minimize errors while yielding the simplest
possible classifier
» Occam’s razor
l Using the Kernel Trick ™
» Can find a very complex polynomial with little work
l Using the Margin Trick™
» Maximizes generalization in the face of complexity
l Simple learning criteria
l Well studied learning algorithm
» Quadratic programming

© Paul Viola 1999 Machine Learning

Regularization
l Sometimes you would like to find the smoothest
function which is close to the data
» Minimize the squared error
» Minimize the squared first derivative (or 2nd deriv.)
l The least squares solution:
» Is a sum of kernel functions centered on the data
» Kernel functions depend on the smoothness penalty
l Derivative penalties yield polynomial kernels
» First -> linear, Second -> cubic, Hairy -> Gaussian

© Paul Viola 1999 Machine Learning

5
Unsupervised Learning
l Transforming the input so that it is more manageable
» PCA: The data can be represented using fewer numbers
– Can compress data, make learning simpler
» ICA: The resulting data is now more independent
– Can separate signals that were mixed
» Informative Features (by John Fisher)
– Can represent just the critical information

© Paul Viola 1999 Machine Learning

Bayes Nets
l Models of the conditional dependencies between
variables
» Usually many variables
l A complete model would be intractable
» Exponential number of parameters
– Impossible to learn or reason
l By assuming that certain vars are independent
» Number of params goes down rapidly
» Efficient reasoning is possible
l Bayes Nets are very general and can be used in
many ways

© Paul Viola 1999 Machine Learning

6
Hidden Markov Models
l A type of Bayes Net that allows reasoning over time
l The true state of the world is unknown
» You have noisy observations
l HMM use temporal dependencies to differentiate
ambiguous states

© Paul Viola 1999 Machine Learning

The VC dimension
l Each class of learned functions has a VC dimension
» Perceptron: VCdim = number of weights
l VC dimension measures the capacity of the classifier
» VCdim is the max number of points which can be shattered
» Shattered = assigned any set of lables
l Intuition: larger capacity requires more data
» Like polynomials: Nth order requires N+1 points
l The bounds are actually probabilistic
» The probability of that the error rates exceeds a
particular rate is bounded by a function of VC and N.

© Paul Viola 1999 Machine Learning

7
Symbolic Learning
l Often the correct classification rule is symbolic
» If BP < 50 and HR < 50 then administer DRUG
l While Bayes Nets can reason in this way, they do
not offer much help in learning the relationships
from data
» If structure of net is given, then params can be estimated
l This is sometimes called rule learning
l Decision Trees – ID3, CART, etc.
» Pick a feature, split into ranges
» For each case, pick another feature and repeat
» Each leaf should have only one label

© Paul Viola 1999 Machine Learning

Combining Classifiers
l We have encountered many learning techniques
» Each has multiple variants
l Bagging
» Train the same classifier on different subsets of the data
» Related to cross-validation (or the Bootstrap)
l Stacking
» Perhaps the best approach is to train each type of
classifier and then have them vote.
– Combine 100 different types of neural networks
– Many types of generalized perceptrons
l Boosting
» Train a sequence of classifiers on re-weighted data sets

© Paul Viola 1999 Machine Learning

8
Policy Learning
l You must act over time to maximize some reward
» Portfolios: Buy and sell stock to max return and min risk
» Two armed bandit: tradeoff exploration for exploitation
» Learn a sequence of action which takes you from the
start to the goal – like in a video game
l Sometimes you feedback is delayed
» Rarely do you get detailed feedback on your actions
l Policy
» Mapping from state of the world to actions
l Reinforcement Learning (Leslie Kaelbling)
l Game Learning (Backgammon)

© Paul Viola 1999 Machine Learning

Language Learning
l How can you learn to pluralize? (phonetically)
» Wug
l How do you discover parts of speech?
l How do you learn the grammar of English?
» Stochastic Context Free Grammar
– Generalization of HMM
– S -> NP VP, VP -> V NP, etc.

© Paul Viola 1999 Machine Learning

You might also like