Machine Learning Course

www.GetPedia.
com
* The Ebook starts from the next page : Enjoy !

6.891 Machine Learning and Neural
Networks
/HFWXUH
,QWURGXFWLRQDQG([DPSOHV
6.891 Machine Learning
News
● )LUVWSUREOHPVHWLVDYDLODEOHVKRUW
ª 'XH6HSW
ª $OOSVHWVDUHGXHRQ7KXUVGD\
ª 1RUPDOO\\RXZLOOKDYHWZRZHHNV
● 5HDGLQJ'+6&KIRU)ULGD\
1
Review & Overview
● $GPLQLVWUDWLYHLQIRUPDWLRQ
● &RXUVH*RDOV
● 'HILQH/HDUQLQJ,QGXFWLRQ5HJUHVVLRQ
&ODVVLILFDWLRQ
● *LYHH[DPSOHVRIOHDUQLQJDSSOLFDWLRQV
● %D\HV5XOHDQGFODVVLILFDWLRQ
● 5HJUHVVLRQDQG2YHUILWWLQJ
● 2FNKDP·V5D]RU&XUVHRI'LPHQVLRQDOLW\
ª %ULHI0HQWLRQRI3UREDELOLW\
Course Information
● KWWSZZZDLPLWHGXFRXUVHV
● /HFWXUHU3DXO9LROD
ª 3URILQWKH$,/DE1([
ª YLROD#DLPLWHGX
ª 5HVHDUFK/HDUQLQJDQG&RPSXWHU9LVLRQ
² KWWSZZZDLPLWHGXSURMHFWVOY
● 7$.LQK7LHX
ª 3K'VWXGHQWLQWKH$,/DE1([
ª WLHX#DLPLWHGX
ª 5HVHDUFK,PDJH'DWDEDVH5HWULHYDO9LVLRQ/HDUQLQJ
² KWWSZZZDLPLWHGXSHRSOHWLHX
2
Grading Experiment!!!
● 3UREOHPVHWVZLOOEHVHOIJUDGHGPRVWO\
● <RXZLOOKDQGLQWKHSVHWRQ7KXUVGD\.LQKZLOOUHFRUGLWV
SUHVHQFHRUDEVHQFHDQGJODQFHWRVHHLI\RXDWWHPSWHG
HDFKSUREOHP
● :HZLOOGLVWULEXWHWKHSVHWVRQ)ULGD\DWUDQGRPWRWKH
FODVV<RXZLOOHDFKJUDGHRQHSVHWZLWKKHOSIURPDVROXWLRQ
NH\<RXKDYHGD\V
● .LQKZLOOOHDGDKRXUSVHWUHYLHZVHVVLRQWRJRRYHUFRUUHFW
VROXWLRQV3UREDEO\0RQGD\DIWHUQRRQ
● <RXZLOOKDQGEDFNWKHJUDGHGSVHWVRQ:HGQHVGD\
● .LQKZLOOWKHQJUDGHTXHVWLRQXVXDOO\WKHWRXJKHVW
● 7KHJUDGHGSVHWVZLOOEHUHWXUQHGWR\RXRQ)ULGD\GD\V
DIWHU\RXWXUQHGWKHPLQ
Course Goals
● ,QWURGXFH0RWLYDWHDQG6WXG\FRQFHSWVIURPPDFKLQH
OHDUQLQJ)RFXVERWKRQIXQGDPHQWDOVDQGDSSOLFDWLRQV
ª 6HFRQG7LPH:DWFK2XW
● )XQGDPHQWDOV
ª )ROORZWH[W'XGD+DUW 6WRUNIURPWKH:HESDJH
ª 3OXVVRPHVXSSOHPHQWDOKDQGRXWV
● $SSOLFDWLRQV
ª 5HDGSDSHUVIURPOLWHUDWXUH
● 5HLQIRUFH
ª 6L[36(7VZLOOUHTXLUHERWKWKLQNLQJDQGKDFNLQJ
ª 2QHILQDOSURMHFW
ª 0LGWHUP
ª )LQDOH[DP""
3
Course Goals: 127
NIPS 1989
Course Goals: %87
Pitts and McCulloch, 1947

4
Goals: Analysis and Computation
What is Machine Learning?

● ,QGXFWLRQRISDWWHUQVUHJXODULWLHVDQGUXOHVIURP
GDWD
ª /HDSWRFRQFOXVLRQV
● 1RW'HGXFWLRQ
ª ^D[LRPVDVVXPSWLRQVUXOHV`!WKHRUHPV
● ,QGXFWLRQ
ª ^WRQVRIGDWD`!UXOHVD[LRPVODZV
● &ODVVLF([DPSOHV
ª 1HZWRQ·V/DZV.HSOHU·V/DZV
ª 3HULRGLF7DEOH
ª 0HQGHO·VODZVRILQKHULWDQFH
5
Physical Laws
Newton’s Measurements
● 2EVHUYHPDQ\ 2
H[SHULPHQWV
1.8
1.6
Acceleration
1.4
1.2
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Force
● &RQMHFWXUHVLPSOH5XOH
Physical Laws: Theorize
Newton’s Measurements
● 2EVHUYHPDQ\ 2
H[SHULPHQWV
1.8
1.6
Acceleration
1.4
1.2
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Force
● &RQMHFWXUHVLPSOH5XOH
Ignore Errors &
» F=ma Inconsistencies
6
Different Types of Learned Relations
● 5HJUHVVLRQ
ª &RQWLQXRXVLQSXW&RQWLQXRXVRXWSXW
– F = ma, pv = nrt
² ,QWHUHVW5DWHV!6WRFN3ULFHV
² ,QFKHV5DLQ!&RUQ3URGXFWLRQ
● &ODVVLILFDWLRQ
ª 'LVFUHWHLQSXW'LVFUHWHRXWSXW
² ^5HG" 5RXQG" 6PDOO6HHG"`!$SSOH
² $ODUP"!%UHDN,Q$ODUP" (DUWKTXDNH"!1R%UHDN,Q
ª &RQWLQXRXVLQSXW'LVFUHWHRXWSXW
² 0LGWHUP!)LQDO*UDGH
² ^)HYHU%ORRG3UHVVXUH`!6LFN"
² ^,QFRPH&XUUHQW'HEW`!,VVXH/RDQ"
² 6RXQG!:RUGV,PDJHV!3HRSOH
Some Notation
● ,Q*HQHUDODOHDUQLQJSUREOHPZLOOKDYH
x = (x1 , x2 , K , xd )
T
ª ,QSXWV x j or x j
j-th
j-thexample
example
ª 2XWSXWV C = {C1 , C2 , K }
Cj & yj
y = (y1 , y2 , K )
T
ª 7DUJHW&RUUHFWODEHORUYDOXH tj
7
Additional Notation (Abusive!)
● 3UHGLFWLRQ)XQFWLRQ
y j = y (x j ) C j = C (x j )
● (UURU)XQFWLRQ
0 if z = 0
E = ∑ l (t j − y ( x j ) l( z) = 
j 1 otherwise
l( z) = z2 Loss
Loss
Example: Digit Recognition

● 863RVWDO6HUYLFH0LOOLRQ/HWWHUVDGD\
● 5HDGLQJ=LS&RGHV
ª )LUVWILQGWKH$GGUHVV%ORFN
ª 7KHQWKH]LSFRGH
ª 1RUPDOL]HVL]HDQGURWDWLRQ
ª 6HSDUDWHGLJLWV
ª 'LJLWL]H[!SRVVLEOHLPDJHV
8
Character Recognition
Zip Code Recognition
9
Tremendous Variety
Hand Labeled Data
10
Final Performance
● 863RVWDO6HUYLFH0LOOLRQ/HWWHUVDGD\
● 5HDGLQJ=LS&RGHV
ª )LUVWILQGWKH$GGUHVV%ORFN
ª 7KHQWKH]LSFRGH
ª 1RUPDOL]HVL]HDQGURWDWLRQ
ª 6HSDUDWHGLJLWV
ª 'LJLWL]H[!SRVVLEOHLPDJHV
● 7UDLQLQJH[DPSOHLPDJHV
● )LQDO3HUIRUPDQFH!
Differentiating Speech & Music
The Key Issue:

Features!
11
Speech Recognition
● 6SHHFKUHFRJQLWLRQ
ª 6RXQGVLJQDOV!FHSVWUDOFRHIILFLHQWV!:RUG6HTXHQFH
ª VHF!IUDPHVVHF!ZRUGVVHF
Sound Now is the time...
● .H\GLIILFXOWLHV
ª 9DULDWLRQVLQSLWFKSURQXQFLDWLRQVSHHG
Evaluation of Credit Risk

● )HDWXUHYHFWRU
ª ,QFRPHOHYHOWLPHDWFXUUHQWMREWLPHDWSUHYLRXVMRE
PDULWDOVWDWXVFKLOGUHQ"SDLGSUHYLRXVELOOV"RZQKRPH"
ORFDWLRQRIKRPHHWF
● 0DQ\WKRXVDQGVRUPLOOLRQVRIH[DPSOHV
ª )HDWXUHYHFWRURXWFRPHRIORDQ
● 'LIILFXOWLHV
ª 1RLVH
ª ,QVXIILFLHQWGDWD
ª 0LVVLQJGDWD
ª *HQHUDOL]DWLRQ
12
Digit Recognition in Detail
● &ODVVLI\LQJ1YV2
● 'HILQHDVHWRIIHDWXUHV
Num Black Pixels
Perim
Width Height
● /RRNIRUVHSDUDWLRQ
Rules for Classification

● 0DQ\VFKHPHVIRUFODVVLI\LQJGDWD
One of the feature
graphs here
ª 3LFNDWKUHVKROG
1 if y ≥ 0
C ( x ) = θ ( ax + b) θ ( y)
0 otherwise
ª 'LYLGHLQWRUHJLRQV
² )^`
^!1!2 `
13
Using Bayes’ Law
● (YDOXDWH3I_1 DQG3I_2

ª 2EVHUYLQJORWVRIGDWD
P ( A| B) P( B)
● 8VH%D\HV·/DZ P( B| A) =
P( A)
P ( F = f |"2" ) P ("2" )
P("2" | F = f ) =
P( F = f )
P ( F = f |"1") P("1")
P("1"| F = f ) =
P( F = f )
Combining Features
● $GGIHDWXUHVWRVHSDUDWH
x2
x1
14
Networks
/HFWXUH
7KH3UREDELOLVWLF$SSURDFK
1
News
● )RUWKRVHRI\RXWKDWPLVVHGWKHILUVWFODVV«
ª )LUVWSUREOHPVHWLVRQWKHZHE'XH
● 7KHZHESDJHLVJHWWLQJXSGDWHGUHJXODUO\
● :HZLOOKDQGRXWDJUDGLQJJXLGHOLQHVZKHQ\RXDUH
JLYHQWKHILUVWSUREOHPVHWWRJUDGH
ª 7KHVHJUDGHVZLOOQRWDVVXPHSHUIHFWDFFXUDF\«
2
Review & Overview
● /HFWXUH
ª 'HILQHG/HDUQLQJ,QGXFWLRQ5HJUHVVLRQ&ODVVLILFDWLRQ
ª 6KRZH[DPSOHDSSOLFDWLRQV'LJLWV6RXQGV6SHHFK
ª %ULHI0HQWLRQRI3UREDELOLW\
● )LQLVKWKHLQWURGXFWLRQWR/HDUQLQJ
ª )LWWLQJIXQFWLRQVWRGDWD«
ª 2YHUILWWLQJ
● 7KH3UREDELOLVWLF$SSURDFK
ª 5HYLHZVRPHVLPSOHSUREDELOLW\
ª $SSO\LWWRFODVVLILFDWLRQWDVNV
3
Fitting a Curve to Data
?? ??
0.8
0.6
0.4
0.2
0
0 0.5 1
Data : {x , t }
j j y (x)
You are given 10 example data points. These are samples of physical
relationship, perhaps including noise.
You challenge is to make prediction for this relationship

- Interpolation: between the example points
- Extrapolation: beyond the data.
In principle there are an infinite number of functions that could be associated

with this data… our challenge is to pick one.
In the final analysis we may want to hedge our bets and return a probability
distribution of functions.
4
Polynomial Fitting
Data : {x , t }
j j
( ) + K + w (x ) ( )l
M
y ( x ) = w0 + w1 x
j j 1
M
j M
= ∑ wl x j
l
y( x j ; w)
Weight vector : w = {w0 , w1 ,K}
Imagine that we are constraining ourselves to class of polynomials.
Each M-th order polynomial is parameterized by M+1 parameters.
The learning process, becomes a process by which we select values of W_i,
The dependency of the y(.) function on W will can be highlighted by the

notation y(x; w).
5
Graphical Representation
● *UDSKUHSUHVHQWVIXQFWLRQ
y ( x j ) = ∑ wl (x )
M
j l ● ,QIRUPDWLRQ)ORZVIURP
ERWWRPWRWRS
l
● $UURZV/LQNV
y ª WUDQVPLWLQIR
ª PXOWLSOLFDWLYHZHLJKW
w0 w3 w9 ● 1RGHV
w1 w2
ª VXPLQFRPLQJLQIR
X0 X1 X2 X3 ... X9 ª SRVVLEOHQRQOLQHDU
WUDQVIRUP
High Dimensional
Non-linear Representation
X Scalar Representation
- While the algebraic notation for y() is clear and specific, we will see that
sometimes is also useful to develop a graphical notation of both classifiers
and regression functions.
- This idea was originally popularized in the neural network literature,
wherein neural networks were almost always drawn out in their graphical
form.
- The graphical notation points out that an intermediate representation for X
is formed (M+1 exponentiations). The resulting problem is then one of
learning the linear relationship between this high dimensional space and t.
6
Choose the Best Polynomial
● :KLFKSRO\QRPLDOIXQFWLRQLVEHVW"
ª %HVWSUHGLFWLRQVRQWUDLQLQJGDWD«
ª %HVWSUHGLFWLRQVRQIXWXUHGDWD«LQWHUSRODWLRQH[WUDSRODWLRQ
² %HVWH[SHFWHGORVVRQIXWXUHGDWD
² :KHUHGRZHJHWWKLVGDWD"
E=
1
∑
2 j
(
loss y ( x j ; w ) − t j ) Empirical
Loss
ˆ = min E
w Find “Optimal”
w weights
- What defines the best polynomial function? Perhaps it is the one which is
most consistent with the training data?
- Actually we would rather return the function which makes the best
predictions on future data - unfortunately there may be no source for this
data.
- For the time being let’s assume that we want to find the function which best
agrees with training data… the function with the lowest loss.
7
Simple Loss Functions Simplify Learning
loss (δ ) = δ 2 E=
1
∑
2 j
(
y( x j ; w) − t j )
2
∂E 1
(
= ∑ 2 y( x j ; w) − t j x j = 0
∂wi 2 j
i
)( )
(
= ∑ y( x j ; w) − t j x j = 0 )( ) i
- Certain simple loss functions lead to learning algorithms which are easy to
derive and inexpensive to compute.
- For example, squared loss can be solved by differentiating and setting this
to zero.
- The result is a set of linear equations that can be solved by inverting a
matrix.
8
First order fit…
0.8
0.6
0.4
E=
1
∑
2 j
(y( x j ; w) − t j )
2
0.2
0 The optimal
0 0.5 1
function minimizes
the residual error
- There is a pleasant physical analogy for the squared loss. The functions
are connected by springs to the data. The system is then allowed to relax
until the forces are balanced. The minimum energy solution is the one that
is “closest” to the training data.
9
Fitting Different Polynomials
1 1 1
0.8 0.8 0.8
0.6 0.6 0.6

0.4 0.4 0.4
0.2 0.2 0.2

0 1 0 0
6
3
0 0.5 1 0 0.5 1 0 0.5 1
1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 9
2 4 0
0 0.5 1 0 0.5 1 0 0.5 1
Each order of polynomial leads to a different fit.

Higher order polynomials come closer to the training data.
The 9th order polynomial can fit the 10 datapoints perfectly.
Which of these is the most likely to generalize
10
Target Function
h ( x ) = 0.5 + 0.4 sin( 2πx )
t j = h( x j )
0 .9
0 .8
0 .7
0 .6
0 .5
0 .4
0 .3
0 .2
0 .1
0
0 0 .2 0 .4 0.6 0 .8 1
The function that generated the data was not a polynomial at all.
11
Fitting Different Polynomials
1 1 1
0 .8 0 .8 0 .8
0 .6 0 .6 0 .6
0 .4 0 .4 0 .4
0 .2 0 .2 0 .2
0 1 0 0 6
3
0 0 .5 1 0 0.5 1 0 0.5 1
1 1 1
0 .8 0 .8 0 .8
0 .6 0 .6 0 .6
0 .4 0 .4 0 .4
0 .2 0 .2 0 .2
0 0 0 9
2 4
0 0 .5 1 0 0.5 1 0 0.5 1
Probably the best approximation was 6th order (though 3rd is very good as
well).
Ninth provides a terrible fit to the function, though it fits the training data
perfectly. This is what is called overfitting…
12
Matlab Code
% Construct training data
train_in = [1:10]/10;
train_out = 0.5 + 0.4 * sin(2 * pi * train_in) + 0.1 * randn(size(train_in));
% fit a polynomial
order = 3
p = polyfit(train_in, train_out, order)
% construct a test set

test_in = [1:300]/300;
true_out = 0.5 + 0.4 * sin(2 * pi * test_in);
% compute the polynomial prediction

fit_out = polyval(p, test_in);
% plot the results

% first: training data
% second: test data
% third: predictions
plot(train_in, train_out, ’o’, test_in, true_out, test_in, fit_out)
axis([0 1 -0.1 1.1])
Above is all the code used to generate the previous graphs.
As we can see Matlab allows us to explore issues in machine learning

without much hacking.
13
First General Problem in Learning
● &RQWURORIFRPSOH[LW\
ª ´(QWLWLHVVKRXOGQRWEHPXOWLSOLHGZLWKRXWQHFHVVLW\µ
² :2FFDPWK&HQWXU\
² 2FFDP·VUD]RU
ª ´$SK\VLFDOWKHRU\VKRXOGEHDVVLPSOHDVSRVVLEOHEXWQR
VLPSOHUµ$(LQVWHLQ
ª ´*RRGWKHRULHVDUHIDOVLILDEOHµ99DSQLN
ª &RPSOH[WKHRULHVDUHOLNHO\WREHZURQJ39LROD
There are a few general (grand challenge) problems in machine learning.

Perhaps the most important is the problem of controlling complexity. We
have seen that a simpler approximator can fit an unknown function better
than a more complex approximator. Building a theory for this is one grand
challenge of learning.
Clearly this problem has been appreciated for a long time.
Vapnik’s statement is perhaps the most confusing. When he says theory

what he means is something like a learning algorithm. The learning
algorithm which fits 9th order polynomials to 10 datapoints is not falsifiable.
- No set of datapoints would fail to be fit perfectly by this data.
14
Overfitting in Classification
This is not to say that such problems are uniuqe to regression. Determining
decision boundaries for classification is very similar.
We need to balance the complexity of the boundary against the accuracy on

training data.
15
Probabilistic Notation
X is a Random Variable
P ( X = x ) where
P ( ) is a Probabil ity Distribution
Shorthand: P(X) the distributi on fuction

or PX (.)
P(x) = P ( X = x ) P(y) = P (Y = y )
PX (x) = P ( X = x ) PY (y) = P (Y = y )
Introduction of probabilistic notation.
Note that there are several potentially confusing short hand notations.
16
Recall the probabilistic approach
● *LYHQDFODVVLILFDWLRQSUREOHP
ª 6SHHFK0XVLF%DVV6DOPRQ5RWWHQ5LSH
● &KRRVHDIHDWXUHRI\RXUH[DPSOHV
ª )LVKZLGWKKHLJKWFRORU
ª )UXLWFRORUZHLJKW
ª 6RXQGV6SHFWUXP9DULDQFH
● 5HFRUGWKHGLVWULEXWLRQRI)HDWXUHYV&ODVV
● *LYHQDQXQFODVVLILHGH[DPSOH
ª &RPSXWHWKHP(F|C1)DQGP(F|C2)
ª &ODVVLI\XVLQJ%D\HV5XOH
17
Probabilistic Approach
P ( C = Ck , X = x )
P ( X = x | C = C k ) P (C = C k )
P ( C = Ck | X = x ) =
P( X = x )
P ( x | Ck ) P (C k )
P ( Ck | x ) =
P( x )
Thomas Bayes
P( X | C1) & P ( X | C 2) 1702-1761
18
Probability Densities
b
d
P ( X ∈ [a, b]) = ∫ p( X = x )dx p ( X = x) = P ( X ∈ [a, b])
a db
p(x) = p X ( x ) = p ( X = x )
p ( X = x | C = C k ) P (C = C k )
P ( C = Ck | X = x ) =
p( X = x )
p ( x | Ck ) P ( C k )
P ( Ck | x ) =
p( x )
Probability Densities are necessary because for a continuous random

variable the probability of every event is zero.
The density measures the slope of the cumulative distribution function.
Alternatively it is the probability per unit area (or length, or volume)
measured over an infinitesimal area.
Somewhat surprisingly the density used in the same way that the distribution
function is used. In other words the probability distribution of the class give
the feature value can be found using the densities of the features.
19
Bayes Law for Densities
ω P (ω1 | x ) > P (ω 2 | x )
C ( x) =  1
ω 2 otherwise
Duda & Hart, 1973
Class 1 Class 2
Just as before we can graph the conditional probability of class given

feature. The functions are now continuous…
Given the Bayes classification rule, a set of decision regions are defined.
20
Decisions
These analyses are easily generalized to:

- Multiple classes
- Multiple dimensions
21
Analysis of Decision Rule
P( error ) = P( x ∈ R2 , C1 ) + P( x ∈ R1 , C2 )
= P( x ∈ R2 | C1 ) P (C1 ) + P ( x ∈ R1 | C2 ) P(C2 )
= ∫ p( x | C ) P(C )dx + ∫ p( x | C ) P(C )dx
R2
1 1
R1
2 2
22
Minimize Expected Loss or Risk
Lkl = {Loss if C(x j ) = Cl and t j = Ck }
Risk for
Riskk = ∑ Lkl ∫ p( x | Ck )dx
l
elements of Ck
Rl
Risk = ∑ Rk P (Ck ) Overall Risk

k
23
Probabilistic Classification Review
● ,IZHDUHJLYHQP(F|C) & P(C) -> P(F,C)
ª +RZWKHIHDWXUHLVGLVWULEXWHGIRUHDFKFODVV
● :HFDQXVHWKLVLQIRUPDWLRQWRFODVVLI\QHZ
H[DPSOHXVLQJ%D\HV5XOH
ª 0LQLPL]HVWKHSUREDELOLW\RIHUURU«
ª :HPD\LQVWHDGZLVKWRPLQLPL]HULVN
● :KHUHLVWKHPDFKLQHOHDUQLQJ"
24
Information Retrieval
● 7KH$OWDYLVWD3UREOHP
● GRFXPHQWVRQWKHZHE
ª 7DNHDORQJWLPHWREURZVH
● 6LPSOH.H\ZRUG6HDUFK
ª )LQGGRFXPHQWVZLWK´*HUPDQµDQG´FDUµ
ª 0LJKWPLVV´*HUPDQ\µDQG´FDUVµ
² 6WHPPLQJ
ª 0LVVHV´0HUFHGHVµDQG´DXWRPRELOHµ
● 0DFKLQH/HDUQLQJ"
ª *LYHQGRFXPHQWVRQ*HUPDQFDUVEXLOGD
FODVVLILHU
25
Keyword Search Works Well
26
Naïve Bayes Classifier
● $VVXPHHDFKZRUGLVDQLQGHSHQGHQWIHDWXUH
f i ( Doc j ) 1 if Doc j has word i.
Probability of word i appearing

P ( Fi | C j ) in a Doc from Class j
P ({ f i }| C j ) =∏ P( F = f |C )
i
i i j
P (C ) ∏ P ( F = f | C )
j i i j
P( C j |{ f }) = i
∏ P( F = f )
i
i i
i
27
Estimating Probabilities
● 0D[LPXP/LNHOLKRRG
# {Docs containing word i}

P ( Fi ) =
# {Docs}
# {Training Docs with word i}

P ( Fi |C j ) =
# {Training Docs}
Potential Bug:
None of our Training Docs contain “Mercedes”
28
29
Curse of Dimensionality
● ,WLVQRWDOZD\VEHWWHUWRPHDVXUHPRUHIHDWXUHV
● 1HZUHVXOWVVHHPWRDGGUHVVWKLVSUREOHP
ª 6XSSRUW9HFWRUV%RRVWLQJHWF
30
Density Estimation is Ambiguous
31
Impacts Classification
32
Networks
/HFWXUH
'HQVLW\(VWLPDWLRQ
Machine Learning
News
● 6RUU\DERXWWKHUHFLWDWLRQPL[XS
ª :HZLOODQQRXQFHE\HPDLOVRRQ
● 3UREOHP6HWLVGXHWRPRUURZ
ª 6HHZHEIRUSROLF\«
● 3UREOHP6HWZLOOEHDYDLODEOHE\WRQLJKW
● .LQKDQG,ZLOOEHWDNLQJSKRWRV
Machine Learning
1
Review & Overview
● /HFWXUH
ª 2YHUILWWLQJ3RO\QRPLDOV
ª 5HYLHZHGWKH3UREDELOLVWLF$SSURDFK
ª ,QIRUPDWLRQ5HWULHYDO([DPSOH
● 'HQVLW\'LVWULEXWLRQ(VWLPDWLRQ
ª ,QIRUPDWLRQ5HWULHYDO
² HVWLPDWLQJELQDU\59·V
ª *DXVVLDQV
ª 0XOWLGLPHQVLRQDO*DXVVLDQV
ª 1RQSDUDPHWULF'HQVLWLHV
Machine Learning
Keyword Search Works Well
Machine Learning
2
Bayesian Text Classification
{d k } : A collection of documents
1 if d k contains word i
Wi(d k ) : 
0 otherwise
P( Fi = 1| C = c j ) = pij Probability of word i appearing

in a Doc from Class j
P ( F1 = f1 , F2 = f 2 , K| C = c j )
= P ({f1 − f N }| C = c j ) Assume
Assume
≡ ∏ P( Fi = f i | C = c j ) Independence
Independence
i Machine Learning
Bayes Nets Show Dependencies
P ({ f i }| C j ) = ∏ P( F = f |C )
i
i i j
C
P( F1 | C )
F1 F2 F3 F4 ... FN
Bayes Nets show the dependencies between RV’s
Machine Learning
3
Classification Using Bayes Law
P (c j ) ∏ P ( f i | c j )
P( c j |{ f i }) = i
∏ P( f )
i
i
c1 = German Cars
c0 = Other Documents
Machine Learning
Estimating Probability Distributions
Wi(d k ) = f ki : 
0 otherwise
P( Fi = 1| C = c j ) = pij
● +RZFDQZHOHDUQp_ij?
ª 0D[LPXP/LNHOLKRRG3ULQFLSOH
ª &KRRVHp_ijVRWKDWWKHWUDLQLQJGDWDLVPRVWSUREDEOH
Machine Learning
4
Maximum Likelihood
P({d k }| c0 ) = ∏ P(d k | c0 ) = ∏ P({f k1 − f kN }| c0 )

k k
= ∏∏ P( f ki | c0 )
k i
= ∏∏ (pij ) f ki (1 − pij )(1 − f ki )

k i
= (pij )ni (1 − pij )(N − ni )
Machine Learning
Log Likelihood
L = log  (p ij )n i (1 − p ij )(N − n i )

 
= n i log( p ij ) + ( N − n i ) log( 1 − p ij )
∂L ∂ log( pij ) ∂ log(1 − pij )

= ni + ( N − ni )
∂pij ∂pij ∂pij
1 −1
= ni + ( N − ni )
pij 1 − pij
=0
Machine Learning
5
Maximum Likelihood
1 −1
ni + ( N − ni ) =0
pij 1 − pij
ni ( N − ni )
=
pij 1 − pij
ni
ni pij pij =
= N
( N − ni ) 1 − pij
ni
N = pij
1 − ni  1 − pij
 N 
 Machine Learning
Estimating Probabilities
● 0D[LPXP/LNHOLKRRG
# {Docs containing word i}

P( Fi = 1) =
#{Docs}
# {Training Docs with word i}

P( Fi = 1| C j ) =
#{Training Docs}
Potential Bug:
None of our Training Docs contain “Mercedes”
Machine Learning
6
Prior Expectations
● *LYHQDVPDOODPRXQWDGDWDZHFDQ·WEHDEVROXWHO\
VXUHWKDW´0HUFHGHVµZLOOQHYHUDSSHDULQ
GRFXPHQWVIURPRXUFODVV«
ª :HPD\KDYHJRWWHQXQOXFN\
● 8VHSULRUH[SHFWDWLRQVWRLPSURYHRXUHVWLPDWHV
● 3UREOHP
ª 0HUFHGHVRFFXUVLQRXWRIWRWDOGRFXPHQWV
ª %XWQHYHULQWKH´*HUPDQFDUVµWUDLQLQJVHW
ª :KDWLVDJRRGHVWLPDWHIRUp(mercedes | GermanCars)?
Machine Learning
Bayesian Parameter Estimation

%D\HVWRWKHUHVFXHDJDLQ Maximum
MaximumLikelihood
max P ({d k }| c0 , pij )
● Likelihood
pij
Maximum
MaximumAAPosteori
P ({d k }| c0 , pij )p ( pij )
Posteori
P(pij | {d k }, c0 )=
P({d k }| c0 )
This
Thisturns
turnsout
outto
tobe
bemore
moreuseful
useful
for continuous parameters
for continuous parameters
Machine Learning
7
What is the right prior?
● 7KHPRVWDJQRVWLFSULRULVWKHXQLIRUPGHQVLW\
max P (pij | {d k }, c0 ) = max P ({d k }| c0 , pij )p ( pij ) Maximum
= max P ({d k }| c0 , pij )ε Likelihood
P ({d k }| c0 , pij ) = (pij )ni (1 − pij )(N − ni )
Machine Learning
Probability of the parameters
P ({d k }| c0 , pij ) = (pij )Ci (1 − pij )(N − Ci )
Machine Learning
8
Bayesian Estimation
P({d k }| c0 , pij )p ( pij )
P (pij | {d k }, c0 ) = P ( Fi = 1| C = c j , pij ) = pij
P({d k }| c0 )
P( Fi | c j ) = ∫ P( Fi | c j , pij ) p (pij | {d k }, c j )dpij

P ({d k }| c0 , pij )p ( pij )
= ∫ pij
P ({d k }| c0 )
=
∫ p P({d }| c , p )p( p
ij k 0 ij ij )
P({d k }| c0 )
Machine Learning
… Continued
∫ p P({d }| c , p )p ( p ) dp
ij k 0 ij ij ij
P( F | c ) =
∫ P({d }| c , p )p( p )dp
i j
k 0 ij ij ij
∫ p [(p ) (1 − p ) ]ε dp
ni N − ni
ij ij ij ij
=
∫ [(p ) (1 − p ) ]ε dp
ni N − ni
ij ij ij
Machine Learning
9
What if no Mercedes?
Machine Learning
10
Networks
/HFWXUH
1HZ'HQVLW\(VWLPDWRUV
Machine Learning
News
● 3UREOHPVHWZLOOEHKDQGHGRXWWRGD\
● 3UREOHPVHWLVRQWKHZHE
ª ,WLVPXFKKDUGHUWKDQWKHILUVWSVHW
● 3UREOHP6HWV
ª 3OHDVHVKRZVRPHZRUN
ª 0DNHVXUHWRJHWWKHSVHWVWR.LQK
² (VSHFLDOO\LIWKH\DUHODVWPLQXWH
Machine Learning
1
Review & Overview
● /HFWXUH
ª 7DONHGDERXW,QIRUPDWLRQ5HWULHYDO
² 1HHGSULRUVRYHUSDUDPHWHUV
ª 'HULYHG0D[LPXP/LNHOLKRRGIRU%HUQRXOOL59·V
ª 'LVFXVVHGXVHRISULRUVRYHUSDUDPHWHUV
● 1HZ'HQVLW\(VWLPDWRUV&RQWLQXRXV
ª *DXVVLDQ
ª 1RQSDUDPHWULF
ª 0L[WXUHRI*DXVVLDQV
● 4XLFNWRXURI([SHFWDWLRQ0D[LPL]DWLRQ
Machine Learning
Why Gaussians ?
● $QDO\WLFDOO\7UDFWDEOH
● &HQWUDO/LPLW7KHRUHP
ª 6XPRIPDQ\YDULDEOHVLV*DXVVLDQ
● /LQHDU7UDQVIRUPVRI*DXVVLDQDUH*DXVVLDQ
● *DXVVLDQVKDYHWKHKLJKHVW(QWURS\
Machine Learning
2
Multi-Dimensional Gaussian
Machine Learning
Eigen Structure
Machine Learning
3
Recall: Bayes Decision Boundaries
Machine Learning
Descriminant Function
Machine Learning
4
Set Discriminants Equal
Machine Learning
Machine Learning
5
Bayesian Parameter Estimation
● :KDWLI\RXOLWWOHGDWD«
● 2ULI\RXKDYHVWURQJH[SHFWDWLRQV
Machine Learning
Convergence of Probability
Machine Learning
6
Reminder: Why we are here
-1.2660
-1.2660 0.1781
0.1781
-0.8724
-0.8724 0.2013
0.2013
-0.8081
-0.8081 4 0.8293
0.8293
-0.6223
-0.6223 0.8299
0.8299
3 .5
-0.1624
-0.1624 0.9217
0.9217
-0.1342
-0.1342 3 0.9434
0.9434
-0.1098
-0.1098 0.9851
0.9851
2 .5
-0.0882
-0.0882 1.0079
1.0079
0.1258
0.1258 2 1.0539
1.0539
0.1395
0.1395 1.5355
1.5355
1 .5
0.1914
0.1914 1.5621
1.5621
0.2873
0.2873 1 1.5875
1.5875
0.3409
0.3409 1.6015
1.6015
0 .5
0.3694
0.3694 2.1811
2.1811
0.6093
0.6093 0
2.7845
2.7845
0.6463
0.6463 -2 -1 0 1 2 3 4 5 2.7879
2.7879
1.1217
1.1217 3.0956
3.0956
1.1463
1.1463 3.8428
3.8428
1.3021
1.3021 3.9562
3.9562
1.3971
1.3971 Machine Learning 4.0800
4.0800
Max Likelihood Gaussian
Mean: 0.16 Mean: 2.2

StDev: 0.8 StDev: 1.0
Machine Learning
7
Different Samples, Different Decisions
Concept:
Concept:Variance
Variance
The
The Variationyou
Variation youobserve
observe
when training on different
when training on different
independent
independenttraining
trainingsets
sets
Machine Learning
Variance depends on data set size...
20 points 2000 points

Machine Learning
8
But when data gets more complex...
Machine Learning
… Gaussian don’t work well
Concept:
Concept:Training
TrainingError
Error
Error in your classifier
Error in your classifier
on
onthe
thetraining
trainingset
set
Machine Learning
9
Even if you had “infinite” data …
Related
RelatedConcept:
Concept:Bias
Bias
Error
Errorininyour
yourclassifier
classifier
in
inthe
thelimit
limitas
assize
sizeof
of
training
trainingdata
datagrows.
grows.
Machine Learning
-1.2660
Histogram
-1.2660
-0.8724
-0.8724
-0.8081
-0.8081 4
-0.6223
-0.6223
-0.1624
-0.1624
3 .5
-0.1342
-0.1342 3
-0.1098
-0.1098
-0.0882
-0.0882
2 .5
0.1258
0.1258 2
0.1395
0.1395
0.1914
0.1914 1 .5
0.2873
0.2873 1
0.3409
0.3409
0.3694
0.3694 0 .5
0.6093
0.6093 0
0.6463
0.6463 -1 .5 -1 -0.5 0 0 .5 1 1.5
1.1217
1.1217
1.1463
1.1463
1.3021
Divide by N to
1.3021
1.3971
1.3971
yield a probability
Machine Learning
10
Networks
/HFWXUH
'HQVLW\(VWLPDWLRQDQG&ODVVLILFDWLRQ
© Paul Viola 1999 Machine Learning 1
News
● 1R/HFWXUHRQ:HGQHVGD\
ª %HVXUHWRJHW.LQK\RXUJUDGHGSVHWVE\:HGQHVGD\
² 5HFLWDWLRQ
² 'URSLWRII
● *XHVW/HFWXUHE\/HVOLH.DHOEOLQJRQ)ULGD\
ª 5HLQIRUFHPHQW/HDUQLQJ
1
Review & Overview
● /HFWXUH
ª *DXVVLDQ'HQVLW\(VWLPDWLRQ
ª &RYDULDQFH
ª /LQHDUDQG4XDGUDWLF'LVFULPLQDQWV
● 1HZ'HQVLW\(VWLPDWRUV
ª 1RQSDUDPHWULF
ª 4XLFNWRXURI([SHFWDWLRQ0D[LPL]DWLRQ
● $SSOLFDWLRQ)DFH'HWHFWLRQ
-1.2660
Histogram
-1.2660
-0.8724
-0.8724
-0.8081
-0.8081 4
-0.6223
-0.6223
-0.1624
-0.1624
3.5
-0.1342
-0.1342 3
-0.1098
-0.1098
-0.0882
-0.0882 2.5
0.1258
0.1258 2
0.1395
0.1395
0.1914
0.1914 1.5
0.2873
0.2873 1
0.3409
0.3409
0.3694
0.3694 0.5
0.6093
0.6093 0
0.6463
0.6463 -1.5 -1 -0.5 0 0.5 1 1.5
1.1217
1.1217
1.1463
1.1463
1.3021
Divide by N to
1.3021
1.3971
1.3971
yield a probability
2
Simple Algorithm
function counts = myhist(data, centers)
% Initialize counts
counts = zeros(size(centers));
numdata = size(data,1);
% For each datapoint compute distance to every center

for i = 1:numdata
diffs = data(i,1) - centers;
dists = diffs.^2;
[minval mindist] = min(dists);
counts(mindist) = counts(mindist) + 1;
end
Histogram
3
Max Likelihood Gaussian
Mean: 0.16 Mean: 2.2

StDev: 0.8 StDev: 1.0
Histogram Flexibility is Adjustable
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5 -2 -1 0 1 2
© Paul Viola 1999 Machine Learning 83 4 5
4
Histograms have lower bias …
… but higher variance.
5
Parzen: One Bump per Data Point
0.7
0.6
0.5
0.4
0.3
0.2
0.1
-1.5 -1 -0.5 0 0.5 1 1.5
Parzen Algorithm
function [func, range] = parzen(data, sigma)
range = splitrange(min(data), max(data), 500);
numdata = size(data, 1)
% For each point on range compute distance to every

% datapoint.
for i = 1:size(range, 2)
gaussvals = gauss(range(i) - data, 0, sigma);
func(i) = sum(gaussvals)/numdata;
end
plot(range, func)
6
Parzen and Histogram are Similar
● %RWKFDQPRGHODQ\W\SHRIGLVWULEXWLRQ
ª *LYHQSOHQW\RIGDWD
● %RWKDUHVLPSOH
● 3DU]HQLVGLIIHUHQWLDEOH+LVWRJUDPLVQRW
● 3DU]HQLVVPRRWK+LVWRJUDPLVQRW
● +LVWRJUDPGHQVLW\(YDOXDWLRQS[LVFKHDS
● 3DU]HQGHQVLW\(YDOXDWLRQLVOLQHDULQGDWDVL]H
All Three at Once
0.7
0.6
0.5
0.4
0.3
0.2
0.1
-1.5 -1 -0.5 0 0.5 1 1.5
7
Properties of Non-parametric Techniques
● 'HQVLW\LVDQDO\WLFDOIXQFWLRQRIGDWD
● %LDVDQGYDULDQFHRIGHQVLW\HVWLPDWRUFDQEH
DGMXVWHGWRWKHSUREOHP
● 0DQ\PRUHSDUDPHWHUVPXVWEHHVWLPDWHG
ª +LVWRJUDP1'ELQV
● /RVHPDQ\RIWKHVLPSOHSURSHUWLHVRI*DXVVLDQV
Semi-Parametric Models
● +DYHPRUHIOH[LELOLW\WKDQSDUDPHWULFPRGHOV
ª OLNH*DXVVLDQV
● +DYHOHVVYDULDQFHWKDQQRQSDUDPHWULFPRGHOV
● (YDOXDWLRQRIS[LVFKHDS
● 'HWHUPLQDWLRQRISDUDPHWHUVLVH[SHQVLYH
8
Gaussian )
p ( x | µ1 , σ 1 ) µ1 , σ 1 xj
Flip
( p (x ) ???
xj Coin
p ( x | µ2 , σ 2 ) Gaussian
µ2 , σ 2 P (k = 1) + P ( k = 2) = 1
Events are Disjoint -> They add

p ( X = x ) = p ( X = x, J = 1) + p ( X = x, J = 2)
= p ( X = x | J = 1) P ( J = 1) + p ( X = x | J = 2) P ( J = 2)
= p ( x | µ1 , σ 1 ) P( J = 1) + p( x | µ2 , σ 2 ) P( J = 2)
9
Face Detection
Generating Training Data
Sung &
Poggio
10
Results
But, it takes minutes per image…

Face Detection
● *UHDWDSSOLFDWLRQRISUREDELOLVWLFFODVVLILFDWLRQ
ª :RUNVYHU\ZHOO
ª 5HTXLUHVPDQ\WKRXVDQGVRISDUDPHWHUV
ª &RPSXWDWLRQWLPHLVYHU\ORQJ
● ,VWKHUHDQ$OWHUQDWLYH"!'LVFULPLQDQWV
ª $OVRZRUNVZHOO
ª 5HTXLUHIHZHUSDUDPHWHUV
ª &RPSXWDWLRQWLPHLVYHU\VKRUW
11
Events are Disjoint -> They add
p ( X = x ) = p ( X = x, J = 1) + p ( X = x, J = 2)
= p ( X = x | J = 1) P ( J = 1) + p ( X = x | J = 2) P ( J = 2)
= p ( x | µ1 , σ 1 ) P( J = 1) + p( x | µ2 , σ 2 ) P( J = 2)
Expectation Maximization
∑ P(k | x ) x ∑ P(k | x ) (x − µ )
j j j j 2
P(k ) = ∑ P(k | x j )
k
µk = j
σ k2 =
j
∑ P(k | x )
j
j
∑ P(k | x ) j j
E = − log l ({µ k , σ k , qk })
Bounded Below?
Decreases?
12
News
● 6RUU\DERXWPLVVLQJODVWZHHN«
ª 6FKHGXOLQJKLFFXSZKLFKSXVK3HUFHSWURQVRXWRI3VHW
● 3VHWZLOOEHRXWE\WRQLJKW
ª 3OHDVHJHWVWDUWHGHDUO\
● 3VHWLVGXHWRPRUURZ
● &URVVJUDGLQJZRUNHGRXWZHOO
ª %XWZHQRWLFHGWKDWDIHZSHRSOHZHUHQRWJUDGLQJ
FDUHIXOO\
ª ,ZRXOGOLNH\RXWRWDNHWKLVWDVNYHU\VHULRXVO\
© Paul Viola 1999 Machine Learning
Distribution of Grades: Pset 1
1
Review & Overview
● /HFWXUH
ª 1RQSDUDPHWULF'HQVLW\(VWLPDWLRQ
² +LVWRJUDPVDQG3DU]HQ'HQVLWLHV
ª 6HPLSDUDPHWULF0L[WXUHRI*XDVVLDQV
ª $SSOLFDWLRQ)DFH'HWHFWLRQ«YHU\FRPSOH[
● 3HUFHSWURQV
● 7UDLQLQJ3HUFHSWURQV
● *HQHUDOL]HG3HUFHSWURQV
● 0XOWL/D\HU3HUFHSWURQV
Where are we?

● ,QWURGXFHGGHQVLW\HVWLPDWLRQ
ª 'LVFUHWHGDWD
ª &RQWLQXRXVGDWD
ª 3DUDPHWULF1RQSDUDPHWULFDQG6HPLSDUDPHWULF
● 8VHG%D\HV·ODZWRFODVVLI\QHZH[DPSOHV
ª 0LQLPL]LQJHLWKHUHUURURU5LVN
● %XWWKLVLVQRWWKHRQO\ZD\«
● ,QIDFWWKLVDSSURDFKKDVFRPHXQGHUVXVWDLQHG
DWWDFNUHFHQWO\
2
Between density and classification
● 2IWHQWKHGHWDLOVRIWKHGHQVLW\GRQRWPDWWHU
Gaussians vs. Discriminants

● ([DPSOH7ZRFODVVHV*DXVVLDQFODVVHVHTXDO
FRYDULDQFH
ª 7KHGHQVLW\HVLPDWRURIN2SDUDPHWHUV
ª 7KHUHVXOWLQJOLQHDUGLVFULPLQDQWKDVN SDUDPHWHUV
ª :K\HVWLPDWHWKHH[WUDSDUDPHWHUV"""
Two
TwoClass
ClassGaussian
y ( x ) = w T x + wo Gaussian
same
sameCovariance
Covariance
● $OWHUQDWLYHO\\RXPD\QRWNQRZPXFKDERXWWKH
GHQVLW\RI\RXUFODVVHV
● &RQVWUXFWDIXQFWLRQWKDWFODVVLILHVGLUHFWO\«
3
Linear Discriminant
y ( x ) = w T x + wo
Bias
Warning! w0 w1 w2 wd
X0 X1 X2 … Xd
N
y ( x) = w T x = ∑ wi xi
i =0
Multiple Discriminants
y1 ( x) = w1 x + w1o y2 ( x) = w 2 x + w2 o
T T
y1 ( x ) = y2 ( x )
w1 x + w10 = w 2 x + w20
T T
( w1 − w 2 )T x + ( w10 − w20 ) = 0
ˆ T x + wˆ 0 = 0
w
4
… in a single network
yk ( x ) = ∑ wki xi + wko
i
wki weight matrix
C ( x ) = Ck if k = arg max yi ( x )
i
Multiple Discriminants
Intersection
of Half Planes
5
How do we learn linear discriminants?
● :KDWDUHWKHSULQFLSOHV"
ª ,QGHQVLW\HVWLPDWLRQZHPD[LPL]HOLNHOLKRRG
ª ,QFODVVLILFDWLRQZHPLQLPL]HHUURUV
● +RZGRVHDUFKIRUWKHEHVWFODVVLILHU"
● :LOOWKHVHDUFKKDYHORFDOPLQLPD"
Perhaps this is really Regression?
E ( w ) = ∑ (y ( x j ) − t j )
2
= ∑ (wT x j − t j )
2
j
1 x xx xx x x
Minimize the
squared error.
-1 o ooo o o
6
Quadratic cost is very simple…
E ( w ) = ∑ (y ( x j ) − t j )
2
E (W ) = (XW − T ) (XW − T )
T
j
= ∑ (w T x j − t j ) = W T X T XW − 2 X TW T T + T T T
2
dE (W )
= 2 X T XW − 2 X T T = 0
dW
X T XW = X T T
W = XTX ( )
−1
X TT
● 'LUHFWOLQHDUH[SUHVVLRQIRUWKHZHLJKWVJLYHQWKH
WUDLQLQJGDWD
Is this a model for the brain?
Pitts and McCulloch, 1947

7
What about Gradient Descent?
E ( w ) = ∑ (y ( x j ) − t j )
2
= ∑ (wT x j − t j )
2
∂E ( w )
= 2∑ (w T x j − t j )x j
∂w j
= 2∑ δ j x j
j
wt = wt −1 − η ∑ δ j x j
j
Batch vs. On-line
E ( w ) = ∑ E j = ∑ (δ )
j 2 Error has many
j j components
∂E j Pick an example
wt = wt −1 − η = wt −1 − ηδ j x j
∂w at Random
y ● 3LFN5DQGRP([DPSOH
● 2EVHUYH2XWSXW(UURU
w0 w1 w2 wd
● $GMXVW:HLJKWVWR
X0 X1 X2 … Xd 5HGXFH(UURU
8
Can’t Always Solve for the Weights…
y ( x ) = g ( wT x ) 1 if a ≥ 0
g (a) = 
0 otherwise
 1 if a ≥ 0
g (a) = 
− 1 otherwise
● 3HUFHSWURQV0F&XOORFKDQG3LWWV
ª 2ULJLQDOO\DVDPRGHOIRUUHDOQHXURQV
Perceptron
9
Perceptron Cost Function
E ( w ) = ∑ (g ( wT x j ) − t j )
2
j
Simple Gradient
∂E ( w) ∂g ( wT x j )
= −2 ∑ (g ( wT x j ) − t j ) Descent does not work
∂w j ∂w
( )
E ( w ) = −∑ g ( wT x j ) − t j ( wT x) t j
2 Perceptron
Criterion
j
= − ∑ ( wT x) t j
errors
∂E ( w)
= −2 ∑ t j x j
∂w errors
wt = wt −1 − η t j x j
Different Error Measures
10
Perceptron Learning
y
w0 w1 w2 wd
X0 X1 X2 … Xd
Real Perceptrons
11
A classic problem...
x x oo
x
oo
o o
x x
o o o
oo
oo x x
o o
x
o o o x x
12
Networks
Lecture 7:
Multi-Layer Perceptrons
Back Propagation
News
l Pset 3 is on the web
» Includes a classifier “shootout”
» The mystery dataset has 20 dimensions and two classes
» Winner gets $10 of Toscanini’s
l Pset 2 looks great …
» Many of you did a lot of work.
1
Review & Overview
l Lecture 6:
» Linear Discriminants
» Perceptrons
» Training Perceptrons
l Generalized Perceptrons
l Multi-layer Perceptrons
» Multi-Layer Derivatives
» Back Propagation
l Examples:
» NET Talk
On-line learning of Perceptrons
1: Error Function E (w) = ∑ E j

(Criteria) j
2: Update Rule ∂E j
wt = wt−1 − η = wt−1 − ηδ j x j
∂w
y l Pick Random Example

l Observe Output/Error
Adjust Weights to
w0 w1 w2 wd
l
X0 X1 X2 … Xd Reduce Error
2
Different Criteria…
Linear Discriminant Perceptron
(
E ( w) = ∑ wT x j − t j )
2
( )
E ( w ) = − ∑ g ( wT x j ) − t j ( wT x) t j
2
j j
∂E ( w) = − ∑ ( wT x ) t j
= 2∑ (wT x j − t j )x j errors
∂w ∂E ( w)
= −2 ∑ t j x j
j
= 2∑ δ j x j ∂w errors
j
wt = wt−1 − η t j x j
wt = wt−1 − ηδ x j j
Normalizing examples…
For errors
wt = wt −1 − η x j
only!
3
The update rule in action...
wt = wt−1 − x j
Real Perceptrons
4
A classic problem...
x x oo
x oo
o o
x x
o o o
oo
oo x x
o o
x
o o o x x
Generalized Perceptron
XOR ( x ) = x1 + x2 − 2 x1x2 xi ∈{0,1}
y ( x ) = g ( wT x ) Can’t do that!
 x1   1 
xˆ =  x2  ŵ =  1  Works
    Great
 x1 x2   − 2.1
5
Another Generalized Perceptron
Adding a single feature can yield

complex classifications…
6
Two Dilemmas
l How does one find/define the correct set of
features?
l How many will you need?
l 1950’s answers:
» Don’t know… we’ll just think them up.
» Don’t know… we’ll just keep adding wires.
1968: The Death of

Neural Networks
7
Multiple Layers
y  0 0 0 0 
 0  
  0 
0 0
W =W =0 0 0 0 
w54
 − 1.5 
1 1 0 
u4
w52 w53   
 0 1 1 − 2.5
w41 w43
w42
How can we learn this??
1 X1 X2
u1 u2 u3
1986: The Rebirth of Neural Networks

l PDP Group had
Huge Impact
8
1980’s: Perhaps Gradient Descent?
y ( x ) = s ( wT x )
1
s (a ) =
1 + e−a
E ( w) = ∑ (s ( wT x j ) − t j )
2
∂s ( u ) ∂u
= s (u ) (1 − s( u ) )
j
∂E ( w) ∂s ( w x ) ∂w ∂w
= −2∑ (s ( wT x j ) − t j )
T j
∂w j ∂w
∂E ( w)
= −2∑ (s ( wT x j ) − t j ) s ( wT x j ) (1 − s ( wT x j ) ) x j
∂w j
Sigmoid Multi-Layer Network
u6 y y ( x) = s ( w54u 4 + w65u5 )
= s ( w54 s ( w41u1 + w42u2 + w43u3 )
w64 w65 + w65 s ( w51u1 + w52u 2 + w53u3 ))
u4 u5
E ( w) = ∑ (s ( wT x j ) − t j )
2
j
w41 w51 w43 w53
∂E ( w) ∂s ( wT x j )
= −2∑ (s ( wT x j ) − t j )
w42 w52
1 X1 X2
∂w j ∂w
u1 u2 u3
w10
9
Multi-Layer Conventions
 
u6 y uk = g  ∑ wkj u j 
 j 
w64 w65
a k = ∑ wkj u j
u4 u5 j
w41 w51 w53 l Networks must not have

w43
w42 w52 loops…
1 X1 X2 l Units are ordered:
u1 u2 u3 » i > k --> ui is not an input to uk
l Compute Units in order
More Conventions
l If Units are organized in Layers

» Layers can be computed in parallel.
         
   
u 2 = g   W21  * u 
  1  u 3 = g   W32  * u 
  2 
   
         
10
11
Solving XOR (big deal?)
Vision Applications (sort of)
T’s vs. C’s
12
Very Simple Solution
NETtalk (1986) First Real Application

l Task: Pronounce English text
» Text -> Phonemes
l Example: “This is it.” /s/ vs. /z/
l 29 possible characters
l 26 phonemes
l 7 Character Window
l Structure
» 203 Inputs
» 26 Output
» 80 Hidden
95% Accurate
13
Networks
Lecture 8:
Back Prop and Beyond
News
l Mid-term will be on 10/20
» Here in this room.
» It should take about 1 hour… but we will give you 1.5
– Show up on time, please.
» Coverage: Psets 1, 2 and 3.
– Density estimation (Parametric, Semi and Non-parametric)
– Bayesian Classification
– Discriminants (Linear, Perceptron, Multi-layer)
1
Review & Overview
l Lecture 7:
» Multi-Layer Derivatives
» Back Propagation
» Examples:
– NET Talk
l Why 6.891 is not Over!

» Bugs with Gradient Descent
» Local Min
» Bias and Variance
– How many units?
» Variants
Face Detection Network:

General Layout
Baluja, Rowley, and Kanade

2
Intensity Preprocessing
Training Data
Positives
Negatives
3
Performance
4
MLP: How Powerful?
Derivatives are Cheap

l Modeling Factories/Plants
Input Outputs
Control Plant
Products
Materials
MLP • Train with Back Prop

• Use Derivs to Modify
input to improve output
Derivatives
5
1990: The height of MLP’s and Back Prop
l Multi-layer perceptrons can solve any
approximation problem (in principle)
» Given 3 layers
» Given and infinite number of units and weights
l There is no direct technique for finding the
weights (unlike linear discriminants)
l Gradient descent (using Back Prop) comes to
dominate discussion in the Neural Net community
» Can you find a good set of weights quickly?
– How can you speed things up?
» Will you get stuck in local minima?
l A small group in the community also worries about
generalization.
How long ‘til we find the Min?

y
l Simplest Case
» 1 Weight, Quadratic Error Function w
x 1
E ( w) = ( wx − y )2
= ( w − 0) 2 ∂E
wt = wt−1 − η
∂wt−1
= w2
∂E 1
= 2w η=
∂wt−1 2
6
Scale the Input
y
l Simplest Case
» 1 Weight, Quadratic Error Function w
x 2
E ( w) = ( w2 − 0 ) 2 ∂E
= 4w 2 wt = wt−1 − η
∂wt−1
∂E
= 8w
∂wt −1
Hack 1: Start eta very small More Hacks Coming!

Increase if Error decreases
Multiple Weights
Hack 2: Momentum
0.020
0.047
0.049
0.050
7
Momentum
∂E
∆wt = −η + α∆wt −1
∂wt−1
Second Order Techniques

l Gradient descent assumes a locally linear cost
function:
∂E E ( w + ∆ ) = E ( w) + E′( w) ∆ + ε
∆wt = −η = −ηE′
∂wt −1 = E ( w) − (E ′( w) )
2
l Second order techniques assume locally quadratic:
E ( w) = aw2 + bw + c w1 = w0 + ∆w
∆wt = −η
E′ E ′ = 2aw + b
E ′′ = 2 a
(
= w0 − w0 + b 2a )
E′′
E′ b =−b
=w+ 2a
E ′′ 2a
8
More Principled Hacks...
l Second Order Techniques
» N weights --> N^2 Hessian entries
» Also Destabilizes learning
l Line Search
» Expensive but hard to beat
Local Minima
l Number of Papers
» 1000’s of local minima in simple problems (XOR)
l One More Trick

» Linear is good…
l Small Input Range
» Sigmoid is almost linear
l ** Start weights near zero...
9
Bias and Variance
l How many layers are right?
l How many units per layer?
l What about structural constraints?
l *** We don’t know the answers ***
ALVINN
Pomerleau
10
No Hands Across America
Zip Codes
Le Cun
11
Networks
Lecture 9:
On to Support Vector Techniques
News
l Final will be 12/13 at 1:30PM
» If you have a conflicting final let us know.
l Remember that almost all the material appears in
the book…
» Right now we are jumping back and forth between
– Chapter 5
– Chapter 6
1
Review & Overview
l Lecture 8:
» Multi-layer Perceptrons
» Back propagation
» Hacks (… many)
l Why did we discard Perceptrons?

l Kernel function network
l Define the Support Vector framework
History Lesson
l 1950’s Perceptrons are cool
» Very simple learning rule, can learn “complex” concepts
» Generalized perceptrons are better -- too many weights
l 1960’s Perceptron’s stink (M+P)
» Some simple concepts require exponential # of features
– Can’t possibly learn that, right?
l 1980’s MLP’s are cool (R+M / PDP)
» Sort of simple learning rule, can learn anything (?)
» Create just the features you need
l 1990 MLP’s stink
» Hard to train : Slow / Local Minima
l 1996 Perceptron’s are cool
2
Why did we need multi-layer
perceptrons?
l Problems like this seem to require very complex

non-linearities.
l Minsky and Papert showed that an exponential
number of features is necessary to solve generic
problems.
Why an exponential number of features?
x15 , x14 x2 , x13 x22 , x12 x23 , x1 x24 , x52 ...

 
Φ ( x ) =  x14 , x13 x2 , x12 x22 , x12 x22 , x1 x23 , x24 
 M 
 
n : variables
k : order poly
 n + k  (n + k )!
14th Order???   = (
∈ O min( n k , k n ) )
120 Features  k  k !n!
N=21,
N=21,k=5
k=5-->
-->65,000
65,000features
features
3
MLP’s vs. Perceptron
l MLP’s are incredibly hard to train…
» Takes a long time (unpredictably long)
» Can converge to poor minima
l MLP are hard to understand
» What are they really doing?
l Perceptrons are easy to train…

» Type of linear programming. Polynomial time.
» One minimum which is global.
l Generalized perceptrons are easier to understand.
» Polynomial functions.
Perceptron Training
is Linear Programming
• After Normalization
∑wx i
i
l
i > 0 ∀l • After adding bias
• Assumes no errors
Polynomial time in the number of variables
and in the number of constraints.
What about linearly inseparable?
∑w x i
l
i + sl > 0 ∀l min ∑ sl
i l
sl > 0 ∀l
4
Rebirth of Perceptrons
l How to train efficiently.
» Linear Programming (… later quadratic programming)
l How to get so many features inexpensively?!?
l How to generalize with so many features?
» Occam’s revenge.
Support Vector Machines
Lemma 1: Weight vectors are simple
w0 = 0 ∆wt = η xt
wt = ∑ ηx = ∑ b x
errors
t
l
l
l
wt = ∑ bl Φ (xl )
l
l The weight vector lives in a sub-space spanned by

the examples…
» Dimensionality is determined by the number of examples
not the complexity of the space.
5
Lemma 2: Only need to compare examples
Perceptron Rebirth: Generalization

l Too many features … Occam is unhappy
» Perhaps we should encourage smoothness?
∑ b K (x , x
j
l j
) + sl > 0 ∀l min ∑ sl
j l
sl > 0 ∀l
min ∑ b 2j
j
Smoother
But this is unstable!!

6
Linear Program is not unique
The linear could return any multiple of the correct

weight vector...
∑ wˆ x i
l
i > 0 ∀l ∑ (λwˆ )x
i
i
l
i > 0 ∀l
i
Slack variables & Weight prior

- Force the solution toward zero
∑wx i
l
i + s l > 0 ∀l min ∑ sl
i l
sl > 0 ∀l min ∑ wi2

i
Definition of the Margin
l Margin: Gap between negatives and positives

measured perpendicular to a hyperplane
7
Require non-zero margin
∑wx
i
i
l
i + s l > 0 ∀l Allows solutions
with zero margin
Enforces a non-zero
∑ wi x li + sl > 1 ∀l margin between examples
i and the decision boundary.
Constrained Optimization
∑ b K (x , x
j
l j
) + sl > 1 ∀l min ∑ sl
j l
sl > 0 ∀l
min ∑ b 2j
j
l Find the smoothest function that separates data

» Quadratic Programming (similar to Linear Programming)
– Single Minima
– Polynomial Time algorithm
8
Constrained Optimization 2
x 3 is inactive
Support Vectors
l Many of the B’s are zero -- inactive constraints

l Guaranteed to generalize well
» VC Dimension -- end of semester
9
SVM: examples
SVM: Key Ideas

l Augment inputs with a very large feature set Φ ( x)
» Polynomials, etc.
l Use Kernel Trick(TM) to do this efficiently
l Enforce/Encourage Smoothness with weight penalty
» Minimize ν wT w = ν ∑ wi2
i
l Introduce Margin so that: wi ≠ 0 ∀i
» Set of linear inequalities
l Find best solution using Quadratic Programming
10
SVM: Difficulties
l How do you pick the kernels?
» Intuition / Luck / Magic / …
l What if the data is not linearly separable?

» i.e. the constraints cannot be satisfied
 
∀ j ( 2t j − 1) ∑ wi K ( x j , ci )  ≥ 1 − ε j
 i 
(
min w T w + c ∑ j | ε j | ) Slack Variables
SVM: Simple Example
6 weights
l Data dimension: 2
l Feature Space: 2nd order polynomial
» 4 dimensional
11
SVM versus Perceptron
l Why not just use a perceptron?
» Use all training points as a centers
 
y( x ) = Θ  ∑ wi K ( x, c i ) 
T
 i 
» Update using perceptron rule:
wτi = wτi + η K ( x, c i )
l Perceptron is not necessarily

smooth...
Perceptrons are not smooth…
12
Zip Codes
Much Effort spent on

organizing the network
SVM: Zip Code recogntion
l Data dimension: 256

l Feature Space: 4 th order
» roughly 100,000,000 dims
13
SVM: Faces
Support Vectors
14
Networks
Lecture 10:
More Details and Derivations
News
l Quiz is 1 week from today.
l Problem set 4 will go out right after the quiz

» In one week… it’s a pain to do two things at once.
l Problems set are very good (once again).

» On an absolute scale many of you are getting A’s.
1
Pset 2
Review & Overview

l Lecture 9:
» Resurrecting Perceptrons
» Setting up Support Vector Machines
l SVM review
l Why is it called “Support Vectors”??
l Derivation of some simpler properties.
2
SVM: Key Ideas
l Augment inputs with a very large feature set Φ ( x)
» Polynomials, etc.
l Use Kernel Trick(TM) to do this efficiently
l Enforce/Encourage Smoothness with weight penalty
» Minimize b T b = ∑ bi2 bi ≠ 0 ∀i
i
Avoid!
l Find best solution using Quadratic Programming
Support Vectors
min( w T w ) subject to constraint
 
∀ j ∑ bi K ( x j , c i ) ≥ 1
 i 
 
y ( x ) = Θ  ∑ bi K ( x, ci ) 
 i 
l Many of the b’s are zero -- inactive constraints
» Only keep examples where bi ≠ 0
l Likely to generalize well
» VC Dimension -- later in the semester
3
An alternative motivation
l Like all good ideas, Support Vector Machines can
be motivated in several different ways.
The optimal dividing line…
4
The optimal dividing line…
l The optimal separator
maximizes the margin between
positive and negative examples
d − = max wT x i
negatives
d + = min wT x i
positives
d+ − d−
margin =
|w|
d + − d−
max (margin ) = max
© Paul Viola 1999 w
Machine Learning w | w|
Definition of the Margin
l Margin: Gap between negatives and positives

measured perpendicular to a hyperplane
5
Optimal dividing line=Support Vectors
d − = max wT x i ∀ w T x i ≤ −1
negatives negatives
d + = min wT x i
positives ∀ wT x i ≥ 1
positives
d+ − d−
max min wT w
w | w|
Lemma 1: Weight vectors are simple
w = ∑ bl x l w = ∑ bl Φ (x l )
l l
l The weight vector lives in a sub-space spanned by

the examples…
l Proved this to you by analyzing the perceptron

weight update rule…
» But we no longer use that rule!!!
» Instead we use Quadratic Programming
6
Lemma 1: Kuhn-Tucker Conditions
w T x1 ≥ 1
wT x 2 ≥ 1 (
min w T w )
w T x 3 ≤ −1
wT x1 = 1
w = b1x1 + b2x 2
w x =1
T 2
Some of the examples do not contribute…

the inactive constraints.
SVM versus Perceptron

l Why not just use a perceptron?
» Use all training points as a centers
 
y( x ) = Θ  ∑ wi K ( x, c i ) 
T
 i 
» Update using perceptron rule:
wτi = wiτ + η (tτ − y ( xτ )) K ( x , ci )

l Perceptrons do not maximize the margin
» The estimated function is not terribly smooth…
l Perceptrons do not rely on very few support vectors
» Yields a much more efficient classifier.
7
Perceptrons are not smooth…
SVM: Faces
8
Support Vectors
SVM: Difficulties
l How do you pick the kernels?
» Intuition / Luck / Magic / …
l What if the data is not linearly separable?

» i.e. the constraints cannot be satisfied
 
∀ j  ∑ bi K ( x j , c i )  + s j ≥ 1
 i 
(
min b T b + c ∑ j | s j | ) Slack Variables
9
SVM: Generalization??
l Is there a formal proof that SVM’s will work better
than Perceptrons or MLPs??
» Perhaps…
l There is a tenuous relationship between maximizing
the margin and reducing the complexity of the
classifier.
» The complexity of the classifier is reduced to the number
of support vectors.
» Hard problems require more support vectors.
l The VC-Dimension of a support vector machine is
controlled by maximizing the margin.
Margin is the Key Concept

l As the margin is increased, so too does
generalization.
l We will see other types of algorithms which will

attempt to maximize the margin between positive
and negative examples…
10
Can we regain the simplicity of Perceptrons
How are the Margins effected??
11
Networks
Lecture 11:
More Kernel Networks
News
l Matlab was down at the AI lab for a few hours.
» I am not terribly sympathetic… since it was after the
official deadline for the pset.
» Just hand it in as soon as you can.
l Cross-grading for next week.

» Please have it done by Thursday (earlier is better).
1
Review & Overview
l Lecture 10:
» The Support in Support Vectors
» The Margin is a key concept
l The SVM criteria (one last time… )

l Smooth Regression
» Another way of motivating Kernel networks
d − = max wT x i ∀ w T x i ≤ −1
negatives negatives
d + = min wT x i
positives ∀ wT x i ≥ 1
positives
d+ − d−
max min wT w
w | w|
2
d − = max wT x i −1 1
negatives
d− = d+ =
| w| | w|
d + = min wT x i
positives
d+ − d − 1 −1
max −
| w| d + − d− | w | | w | 1 1
= = = T
w
2
∀ w T x i ≤ −1 |w| | w| | w| w w
negatives
∀ wT x i ≥ 1 1
positives max min wT w
wT w
Kernel Networks are Good for Regression

 
y ( x ) = Θ  ∑ bi K ( x, ci )  y ( x) = ∑ bi K ( x, ci )
 i  i
l The form of the Kernel determines the form of

the final function
» Polynomial Kernels -> Polynomial Function
» Gaussian Kernels -> Sum of Gaussians
l The Common Error Criteria is squared error…
(
Error = ∑ y ( x j ) − t j )2
This
Thisendsendsup
upbeing
beingexactly
exactlylikelikepolynomial
polynomialfitting…
fitting…
except that there is one
except that there isMachine weight
one Learning per data point
weight per data point
© Paul Viola 1999
3
Radial Basis Function Networks
K ( x , c ) = K (| x − c |) y ( x ) = ∑ bi K (| x − c i |)
i
l When we restrict ourselves to Kernels which are

radially symmetric, the resulting network is called
a Radial Basis Function Network
» K only depends on the radial distance from some
datapoint c.
» Poggio & Girosi pioneered the use of these.
From Smoothness to Kernels:

Assumptions are Necessary
3 3
2.5
2.5
2
2 1.5
1
1.5
0.5
? ?
1 0
0 2 4 6 8 10 2 4 6 8 10
Intuition
4
Setting up the problem
Cost = (WY − T )
2
Y = (W TW ) −1 W TT Not Invertible!
T= W=
1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1
Conditioning the Problem
Cost = (WY − T ) + λY T Y
2
Y T Y = ∑ yi
2
i
−1
Y = (W W + λ I ) W T
T T
Small solution
vectors are best.
T= Y=
1 1
0 0 3
0 0 2.5
0 0
2
3 3
0 0 1.5
0 0 1
0 0
0.5
0 0
1 1 0
2 4 6 8 10
5
… and the winner is?
3
6
2.5
5
2
4
1.5
3
1 2
0.5 1
0 0
2 4 6 8 10 2 4 6 8 10
2.5
This is not always true… 2
remember to think like a 1.5
Bayesian 1
0.5
0
2 4 6 8 10
Smooth is Good: Regularization
l Alternative way to motivate Kernel Networks.
Cost ( y) = Error ( y ) + Smoothness ( y )

∂
2

= ∑ ( y( x ) − t j
)
j 2
+ ∫  y ( x$ ) dx$
j
 ∂x$ 
6
Derivative Measures Smoothness
Squared 1st Derivative

9 3
16
8
14 2.5
7
12
6 2
10
5
8 1.5
4
6
3 1
4
2
0.5
2
1
0 0
0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Sum = 49.7 Sum = 20 Sum = 1.8
Setting up the Problem

W=
Cost = (WY − T ) + λ (DY )

2 2 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
Y = (W TW + λ DT D) −1 W T T 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
T= Y=
1 1.0000 D=
0 1.5000 1 -1 0 0 0 0 0 0 0 0
0 2.0000 0 1 -1 0 0 0 0 0 0 0
0 2.5000 0 0 1 -1 0 0 0 0 0 0
3 3.0000 0 0 0 1 -1 0 0 0 0 0
0 2.6000 0 0 0 0 1 -1 0 0 0 0
0 2.2000 0 0 0 0 0 1 -1 0 0 0
0 1.8000 0 0 0 0 0 0 1 -1 0 0
0 1.4000 0 0 0 0 0 0 0 1 -1 0
1 1.0000 0 0 0 0 0 0 0 0 1 -1
7
Need to find lambda …
Y= 3
1.0000 2.5
1.5000
2.0000 2
λ = 0.001 2.5000
3.0000 1.5
2.6000 1
1.8
2.2000
1.8000 0.5
1.4000 0
1.0000 2 4 6 8 10
Y = 3
1.6000 2.5
1.6600
1.7200 2
1.7800
0.3
λ = 10 1.8400 1.5
1.7840 1
1.7280
1.6720 0.5
1.6160 0
© Paul Viola 1999 1.5600 Machine Learning 2 4 6 8 10
Derivative Order Controls Shape

3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
10 20 30 40 50 10 20 30 40 50
8
A Closer Look
3
3
2.5
2.5
2
2
1.5
1 1.5
0.5 1
0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50
Linear + Kinks Cubics + Kinks
Fitting More Data
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 20 30 40 50 10 20 30 40 50
1
1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
10 20 30 40 50 10 20 30 40 50
9
Still Piecewise Cubic
0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1
10 20 30 40 50
Smoothness is easily controlled
0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6
0.5 0.5 0.5

0.4 0.4 0.4
0.3 0.3 0.3

0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
0.9 0.9 0.9

0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
10
Regularization to RBF’s
l Alternative way to motivate RBF’s
E ( y ) = Error ( y ) + Smoothness ( y )
Every
Training Point
Gaussian
Problem: To many centers

(Old solutions… )
l One per data point can be way to many…
l Choose a random subset of the points
– Hope you don’t get unlucky
l Distribute them based on the density of points
– Perhaps EM clustering…
11
Too Many Centers 2
l Put them where you need them…
» To best approximate your function
l Compute the derivative of E(y) w.r.t. the centers

» This gets very hairy and does not work well
– Too many local minima -- no small weight trick
Too Many Centers 3

l Support Vector Regression… Next time.
12
Networks
Lecture 12:
Smooth Functions and Kernel Networks
News
l Quiz was too hard…
» I am trying to come up with a creative grading scheme.
– Best 5 out of 6 problems???
– First let us do the grading.
l Problem set will be out by tonight.
1
Review & Overview
l Lecture 11:
» Trying to find smooth functions.
l Smooth Regression
» Another way of motivating Kernel networks
… where we were last time.

3
Squared 1st 2.5
Derivative 1.5
1
0 2 4 6 8 10
9 3
16
8
14 2.5
7
12
6 2
10
5
8 1.5
4
6
3 1
4
2
0.5
2
1
0 0
0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Sum = 49.7 Sum = 20 Sum = 1.8
2
Regression Review
l Up until now we have been mostly analyzing
classification:
» X, inputs. Y, classes. Find the best c(x) .
l Today: Regression.
» X, inputs. Y, outputs. Find the best f(x) .
» Predict the stock’s value next week.
» “Picture of Road” -> “Car steering wheel”
» etc.
(
min ∑ f ( x j , w) − y j
w
)2
Schemes for motivating regression…

l Prior assumptions
» Find the best polynomial which fits the data:
f ( x, w) = w0 + w1 x + w2 x + K 2
w
(
min ∑ f ( x j , w) − y j )
2
» Or find the best neural network, or ???
l Bayesian Approach
» Find the most likely function:
p({x j , y j } | f ) p( f )
max p ( f | {x , y }) = max
( )
j j
f f p {x j , y j }
3
Bayesian framework captures
many approaches
p ({ x j , y j } | f ) p ( f )
max p ( f | { x j , y j }) = max
f f (
p {x j , y j } )
[
max log p({x j , y j } | f ) + log p( f ) − log p { x j , y j }
f
( )]
log p({ x j , y j } f ) = log ∏ p ( x j , y j | f ) ε if f is a poly
j
p( f ) = 
0 otherwise
= ∑ log p( x j , y j | f )
j
= ∑ log G ( f ( x j ) − y j ) The polynomial that

j fits the data best is the
(
= −∑ c f ( x j ) − y j )2 most likely function
j
Bayesian framework captures

many approaches
log p({ x j , y j } f ) 2
 ∂f 
log p( f ) = − ∫  
(
= −∑ c f ( x j ) − y j )
2
 ∂x 
j
The function which both fits the data and has

a small derivative is the most likely
2
 ∂2 f 
Also popular… log p( f ) = − ∫  2 
 ∂x 
4
A closer look...
1 1 
 ∂f 
2
(
C( f ) = ∑ c f ( x ) − yj
)
j 2
+ λ∫   X =  5  Y = 3
 ∂x 
10 1
j
Data
l How do we minimize this function?
» The set of possible functions is infinite
» The space of functions is infinite dimensional
l Constrain f to be: polynomial, sum of exponentials, etc??

l What about unconstrained solutions...
A slightly simpler problem
 ∂f 
2
( )
C( f ) = ∑ c f ( x j ) − y j + λ ∫  
2 dC( f )
=0
j  ∂x  df
l But f is not a scalar!!

l In fact f is more like an infinite dimensional vector.
l If f were a finite vector:
∂C ( f ) Can not reduce C()
∀j =0
∂f j by adjusting any of the
parameters of f.
5
We could approximate f .
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
10
0
-0.2
0
-0.2 20
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
100 1000
0 0
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Looking for smooth solutions...
3 3
2.5
2.5
2
2 1.5
1
1.5
0.5
? ?
1 0
0 2 4 6 8 10 2 4 6 8 10
Intuition
6
Setting up the problem
Cost = (WF − Y )
2
F = (W TW ) −1 W TY Not Invertible!
Y= W= F=
1 1 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 22
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
3 = 0 0 0 0 1 0 0 0 0 0 3
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 6
0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1 1
Conditioning the Problem
Cost = (WF − Y ) + λF T F
2
FT F = ∑ fi
2
i
−1
F = (W W + λ I ) W Y
T T
Small solution
vectors are best.
Y= F=
1 1
0 0 3
0 0 2.5
0 0
2
3 3
0 0 1.5
0 0 1
0 0
0.5
0 0
1 1 0
2 4 6 8 10
7
… and the winner is?
3
6
2.5
5
2
4
1.5
3
1 2
0.5 1
0 0
2 4 6 8 10 2 4 6 8 10
2.5
This is not always true… 2
remember to think like a 1.5
Bayesian 1
0.5
0
2 4 6 8 10
Smooth is Good: Regularization
l Alternative way to motivate Kernel Networks.
Cost ( f ) = Error ( f ) + Smoothness( f )

∂
2
(
= ∑ f (x ) − y j
)
j 2 
+ ∫  f ( xˆ )  dxˆ
j  ∂xˆ 
8
Setting up the Problem
W=
Cost = (WF − T ) + λ (DF )

2 2 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
F = (W T W + λ DT D) −1 W T Y 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
Y= F=
1 1.0000 D=
0 1.5000 1 -1 0 0 0 0 0 0 0 0
0 2.0000 0 1 -1 0 0 0 0 0 0 0
0 2.5000 0 0 1 -1 0 0 0 0 0 0
3 3.0000 0 0 0 1 -1 0 0 0 0 0
0 2.6000 0 0 0 0 1 -1 0 0 0 0
0 2.2000 0 0 0 0 0 1 -1 0 0 0
0 1.8000 0 0 0 0 0 0 1 -1 0 0
0 1.4000 0 0 0 0 0 0 0 1 -1 0
1 1.0000 0 0 0 0 0 0 0 0 1 -1
Need to find lambda …

F= 3
1.0000 2.5
1.5000
2.0000 2
λ = 0.001 2.5000
3.0000 1.5
2.6000 1
1.8
2.2000
1.8000 0.5
1.4000 0
1.0000 2 4 6 8 10
F = 3
1.6000 2.5
1.6600
1.7200 2
1.7800
0.3
λ = 10 1.8400 1.5
1.7840 1
1.7280
1.6720 0.5
1.6160 0
© Paul Viola 1999 1.5600 Machine Learning 2 4 6 8 10
9
Derivative Order Controls Shape
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
10 20 30 40 50 10 20 30 40 50
2
 ∂f 
2
 ∂2 f 
∫  ∂x  ∫  ∂x 2 

A Closer Look
3
3
2.5
2.5
2
2
1.5
1 1.5
0.5 1
0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50
10
Look at the regularizer...
D=
Cost = (WF − T ) + λ ( DF )
2 2 1 -1 0 0 0 0 0 0 0 0
0 1 -1 0 0 0 0 0 0 0
0 0 1 -1 0 0 0 0 0 0
0 0 0 1 -1 0 0 0 0 0
0 0 0 0 1 -1 0 0 0 0
F = (W T W + λ DT D) −1 W T Y 0
0
0 0 0 0 1 -1 0 0 0
0 0 0 0 0 1 -1 0 0
0 0 0 0 0 0 0 1 -1 0
0 0 0 0 0 0 0 0 1 -1
D’ * D =
1 -1 0 0 0 0 0 0 0 0
-1 2 -1 0 0 0 0 0 0 0
0 -1 2 -1 0 0 0 0 0 0
0 0 -1 2 -1 0 0 0 0 0
0 0 0 -1 2 -1 0 0 0 0
0 0 0 0 -1 2 -1 0 0 0
0 0 0 0 0 -1 2 -1 0 0
0 0 0 0 0 0 -1 2 -1 0
0 0 0 0 0 0 0 -1 2 -1
© Paul Viola 1999 0 0 0 Machine
0 0 Learning
0 0 0 -1 1
Second Deriv -> Fourth Deriv
D= D' * D =
1 -1 0 0 0 0 0 0 0 0 1 -2 1 0 0 0 0 0 0 0
-1 2 -1 0 0 0 0 0 0 0 -2 5 -4 1 0 0 0 0 0 0
0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0 0 0
0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0 0
0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0
0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0
0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0
0 0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1
0 0 0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 5 -2
0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 1 -2 1
11
What about continuous functions??
2
 ∂f  ∂C ( f ) Infinite number
C( f ) = λ ∫   ∀x =0
 ∂x  ∂ f ( x) of derivatives
∂C ( f ) C ( f + δ x ) − C ( f )
= = δC ( x )
∂ f ( x) |δx |
Fitting More Data
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 20 30 40 50 10 20 30 40 50
1
1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
10 20 30 40 50 10 20 30 40 50
12
0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1
10 20 30 40 50
0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6
0.5 0.5 0.5

0.4 0.4 0.4
0.3 0.3 0.3

0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
0.9 0.9 0.9

0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
13
Regularization to RBF’s
l Alternative way to motivate RBF’s
Every
Training Point
Gaussian

(Old solutions… )
l One per data point can be way to many…
14
Too Many Centers 2
l Compute the derivative of E(y) w.r.t. the centers

Too Many Centers 3

l Support Vector Regression…

∂
2

= ∑ f (x ) − y
j j
+ ∫ f ( xˆ )  dxˆ
j
ε
 ∂xˆ 
f ( x) = ∑ w j K ( x, x j ) Many j’s
j
are zero!!!
15
Networks
Lecture 13:
Kernel Networks
… on to Unsupervised Learning
News
l Quizes are graded…
» Each problem has been graded.
» ** The overall score for the quiz is being determined.
– We ran out of time last night.
l Course grading: (approximate)
» Psets: 35%
» Quiz: 20%
» Final: 30%
» Project: 10%
» Participation: 5%
1
Pset 3
You are doing spectacularly well…
Exams
2
Grading alternatives…
Review & Overview

l Lecture 12:
» Trying to find smooth functions.
» Requiring smoothness simplifies functions:
– 1st deriv -> piecewise linear; 2nd deriv -> cubic
l Finish off Regression

l Begin unsupervised learning.
» PCA, etc…
3
Calculus of Variations
∂C ( f ) C ( f + δ x ) − C ( f )
δ C ( x) = =
 ∂f 
2
∂f ( x ) | δx |
C( f ) = ∫   f ( x ) = ax + b
 ∂x  = f ′′( x)
=0
 ∂f 
( )
2
C( f ) = ∑ f ( x ) − y j j 2
+ λ∫ 
j  ∂x 
( )
δC ( x) = λf ′′( x) + ∑ 2 f ( x j ) − y j δ ( x − x j ) = 0
j
f ′′( x) = −
1
( )
∑ 2 f (x j ) − y j δ (x − x j )
λ j
© Paul Viola 1999 Piecewise Linear

Machine Learning
A Closer Look
3
3
2.5
2.5
2
2
1.5
1 1.5
0.5 1
0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50
4
0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1
10 20 30 40 50
But where are the kernel functions??

l Recall that this was supposed to be another way to
motivate kernel functions!!!
 ∂f  ? f ( x) = ∑ b j K (x , x j )
( )
2
C( f ) = ∑ f ( x j ) − y j + λ ∫  
2
j  ∂x  j
f ′′( x) = ∑ a jδ ( x − x j ) ∂ 2 K ( x, x j )
f ′′( x) = ∑ b j
j
j ∂ x2
∂ 2 K ( x, x j )
=δ (x − x j) K ( x, x j ) = x − x j
∂x 2
5
Cubics are similar...
∂ 2 f 
2 ? f ( x) = ∑ b j K (x , x j )
(
C( f ) = ∑ f ( x j ) − y j )
2
+ λ ∫  2 
j  ∂x  j
f ′′′′( x) = ∑ a jδ ( x − x j ) ∂ 4 K ( x, x j )
f ′′′′( x) = ∑ b j
j
j ∂x 4
∂ 4 K ( x, x j )
=δ (x − x j) K ( x, x j ) = x − x j ( x − x j ) 2
∂x 4
Can also get gaussian kernels…
l … if you want them!!
Every
Training Point
Gaussian
6
0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6
0.5 0.5 0.5

0.4 0.4 0.4
0.3 0.3 0.3

0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
0.9 0.9 0.9

0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

(Old solutions… )
l One per data point can be way too many…
7
Too Many Centers 2
l Compute the derivative of C(f) w.r.t. the centers

Support Vector Regression

(w T
)
x j +b − y j ≤ ε
f ( x) = wT x + b min wT w
(
y j − wT x j + b ) ≤ε
8
SVM Regression
Cost( f ) = c ∑ wT x j + b − y j +wT w
ε
j
Works with smoothness as well...


∂
2

= ∑ f (x ) − y
j j
+ ∫ f ( xˆ )  dxˆ
j
ε
 ∂xˆ 
f ( x) = ∑ w j K ( x, x j ) Many w_j’s
j
are zero!!!
9
New Topic: Unsupervised Learning
l What can you do to “understand” data when you
have no labels?
» Find unusual structure in the data.
» Find simplifications of the data.
l Find the clusters in the data:
» Fit a mixture of gaussians…
– Been there, done that.
l Reduce the dimensionality of the data:
» Find a linear projection from high to low dimensions
l These all amount to density estimation
l There are many other approaches
» Build a tree which captures the data, etc.
8 bits per pixel .36 bits per pixel
10
11
12
Networks
Lecture 14:
… on to Unsupervised Learning
News
l I will try to give you a feeling for where we are
headed:
» Next 4 lectures
– Bayes Nets / Graphical Models / Boltzman Machines /
HMM’s
» After that a series of topics (… from papers).
1
Review & Overview
l Lecture 13:
» The end of regression…
l Begin unsupervised learning.

» PCA, etc…
New Topic: Unsupervised Learning

l What can you do to “understand” data when you
have no labels?
» Find unusual structure in the data.
» Find simplifications of the data.
l Find the clusters in the data:
» Fit a mixture of gaussians…
– Been there, done that.
l Reduce the dimensionality of the data:
» Find a linear projection from high to low dimensions
l These all amount to density estimation
l There are many other approaches
» Build a tree which captures the data, etc.
2
Exploratory Data Analysis
l Machine learning is simply not that smart…
l It is still very important to look at the data.
l But when there are millions of examples and
thousands of dimensions you cannot look at the
data.
l It is very important to summarize the data…

» Summarize the statistics
» Reduce dimensionality
Example: Mixture of Gaussians

l Start out with a many training points (1000’s)
» Distributed in a “clumpy” fashion.
l Replace with a few summary clusters (10’s)
» One for each clump.
l Why?
» Better understand/summarize the data…
– Demographics Data: growing number of stay at home dad’s
l Salary < 1000; Children > 0; Family Salary > 20,000, etc.
– Astronomical Data: Unusual distribution of stars near galaxy
– Extract “symbols”
l one cluster per word (in speech), one cluster per letter (in writing)
» Speed learning…
– Learn with the clusters rather than all the data… approximate
3
Example of Clustering
l Can I get some examples of clustering???
l PDP??
l Andrew Moore
Speed Learning???
l Regression & Kernel Networks f ( x) = ∑ b j K (x , x j )
j
» Learning time is cubic in the number of examples.

– Need to solve the linear system to find weights.
» Performance time is linear in the number of examples
» 10,000 examples -> 100 clusters??
l Problem: these clusters are not optimal…

» Other locations may be much more effective for
approximating the target function.
» Clusters congregate near data… not at the margin!
4
f ( x) = ∑ b j K (x , x j )
j
l Claims to choose the optimal set of examples

» 10,000 -> 100
» Performance is optimal
l But, training time is still cubic in the number of

examples.
l Though some new algorithms are beginning to work
more quickly.
Example: Dimensionality Reduction

l It is difficult to visualize 10, 20, 1000 dimensions
l Worse, there is reason to believe that it can be
very difficult to learn in high dimensions.
l Curse of dimensionality:
» As the dimensionality of the data grows, the average
distance between points also grows.
» As a result it is difficult to get an accurate local
estimate for what is going on.
5
Curse of Dimensionality of Nearest
Neighbor
l How far is it to your nearest neighbor??
» Easier Question: How far do you have to look before
expecting 1 neighbor.
» Assume points are distributed in a sphere of unit radius
Vol ( Sk (r )) = ck r k
» The point in the center of the sphere should have the

largest number of neighbors. How far do we have to go
to find 1 neighbor on average??
Reducing Dimensionality Linearly
y = Wx
l Where W has fewer rows than columns…
l What is the right W??

» One that preserves as much information as possible.
» Ah, but what is information??
 e1 
 
( (
min E x − W −∗ Wx ) ) 2
W =  e2 
W
M
 
6
First Eigenvector preserves more info...
256,000 4,000
Numbers Numbers
7
8
Dimensionality Reduction
Cottrell and Metcalfe
9
Information Theory for Signal
Separation
Sound
Sources
Microphones
l The Cocktail Party Problem

l Many Speakers -- the signals a hopelessly mixed
Unmixed
10
Let’s look at data
Figures from Christian
Unmixed Mixed
PCA
Unmixed
Mathematical Assumptions
 s1( t )   m1( t )   a11 a12 a13 

S =  s 2 ( t ) M =  m2 ( t ) A =  a 21 a 22 a 23 
     
 s3 ( t )  m3 ( t )  a31 a 32 a 33 
M = AS
l Assumptions:
» Sounds Travels instantaneously
» Sound Mixes Linearly
» Signals are independent
11
The Unmixing Problem
S$ = A−1 AS
l We would like to undo the mixing...
Reducing Dimensionality Non-Linearly
y = g (Wx )
l Where W has fewer rows than columns…
l What is the right W??

» One that preserves as much information as possible.
» Ah, but what is information??
 independent 
max MI (g (Wx ), x ) W =  
W
 components 
12
ICA
Unmixed
Learning Rule
∆W = (W − T + (1 − 2 y ) x T )W T W
= W + (1 − 2 y ) x T W TW
= W + (1 − 2 y )uT W
13
Networks
Lecture 15:
Reasoning and Learning on Discrete Data
Bayes Nets
News
l Final Problem Set will be ready tomorrow
» Mostly Bayes Nets
l Please begin to think about your final project
1
Review & Overview
l Lecture 14:
» Principal Components Analysis
– A low dimensional projection is can summarize data
» Independent Components Analysis
– An alternative to PCA which can pick out the independent
sources of data.
l Bayes Nets
» Meeting of the minds
– Artificial Intelligence and Machine Learning
» Represents symbolic knowledge and reasoning
» Principled mechanism for inference and learning
– Bayes Rule
Artificial Intelligence
l Build systems that reason about the world:
» Diagnosis
– “Why won’t my car start?”
» Goal directed behavior
– “How can I get from here to the White House?”
– Space Probe: “How do I change orbit, take photos of Mars,
and communicate with Earth in the next 5 minutes?”
– “How can I symbolically integrate this function?”
» Game Playing
– “How can I beast Kasparov?”
l Biases:
» Symbolic data and symbolic problems (not continuous)
» No representation of uncertainty or probability.
2
Techniques in Artificial Intelligence
l Write down a set of rules that govern the world
» If I get on a plane to Wash. DC then I will end up in DC.
» If I take a taxi to Logan then I will end up at Logan.
» If my car is out of fuel then it won’t start.

» If my starter motor is broken then it won’t start.
» The integral of sin(x) is - cos(x).

» The integral of 2x is 2x2
l Use these to either:
» Reason forward from the initial conditions
» Reason backwards from your goal (or problem).
Difficulty with Artificial Intelligence

l No explicit representation for uncertainty
» Some unknown or unmodeled aspect of the world may
interfere with your rules
– Taking a plane to DC gets you to DC unless there is engine
trouble, or the airport is fog bound, or the pilot gets ill, or
someone on board has a heart attack, etc.
» Causality is probabilistic
– The probability of vapor lock is higher in the summer.
l Early attempts to model uncertainty:
» Confidence factors
» Certainty factors
– Not consistent with Probability…
3
Probabilistic Reasoning is Optimal
l What we really want is to reason with the laws of
probability:
» The probability that I will get to the White House is:
– The probability of the conjunction of events
l Get packed
l Get to Logan
l Catch plane
l Arrive in DC
l Get Taxi to White house
l Provided you can estimate and represent these

probabilities!
Simple Example: Faculty Meetings

l Faculty meetings can be argumentative sometimes.
» Hard to say why, but it happens about 23%
» When Marvin Minsky comes to the meeting
– (Minsky only comes to 10% of the meetings)
– Arguments happen 30%
» When Chomsky comes
– (Chomsky only comes to 5% of the meetings)
» When Minsky and Chomsky come
The World
l Very difficult for AI systems is Fuzzy
» If Minsky than argument (false!!!)
Logic is NOT!
» If Chomsky than argument (false!!!)
4
A Probabilistic Approach
Probability distribution over our 3 Events:
P ( A, M , C ) Arguments, Minsky, and Chomsky
Probability of a joint event:

P ( A = a, M = m , C = c ) Argument happens,
Minsky does not come
Chomsky does come
P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C
P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C
The Probabilistic Approach

l Everything is driven off P(A,M,C)
l Given a set of example meetings
M C A Probability
A M C 0 0 0 0.684
1 0 0 0 0 1 0.171
0 0 0
1 1 1
0 1 0 0.0315
0 1 1 0.036
0 1 0 Histogram
0 0 1 1 0 0 0.0665
...
...
...
1 0 1 0.0285
1 1 0 0.00005
1 1 1 0.00045
Observe Data
Probability of Events
Everything should work out perfectly right?
5
Problems with Naïve Probability
M C A Probability l Way too many variables:
0 0 0 0.684
0 0 1 0.171 » 2^N variables (minus 1)
0 1 0 0.0315
» Occam wouldn’t like this
0 1 1 0.036
1 0 0 0.0665 l Lots of computation:
1 0 1 0.0285
1 1 0 0.00005 » P(M) requires O(2^(N-1))
1 1 1 0.00045
» P(A|M) requires O(2^(N-1))
P(??) = 0.23595
P ( m ) = 0.1 l Optimality comes with serious

penalties.
P ( c ) = 0.068 l Furthermore the table is very
P ( a ) = 0.23 hard to interpret…
Bayes Nets to the Rescue

l Assumptions can help a lot:
» We could assume that each property is independent
– Big mistake for most reasoning problems
» We could assume that certain variables are independent
– While others are dependent…
P ( A, M , C ) ≡ P( M ) P(C | M ) P ( A | M , C )
7 1 2 4
Tables
P ( A, M , C ) ≈ P ( M ) P(C ) P ( A | M , C )
7 1 1 4
6
Removing Links
P ( A, M , C ) ≡ P( M ) P(C | M ) P ( A | M , C )
M C
P ( A, M , C ) ≈ P ( M ) P(C ) P ( A | M , C )
M C
An Efficient Representation
0.1
• Draw a directed acyclic graph M C 0.05
• Links imply causation

• Represent probabilities
• Prior for every node with no parent A
• Conditional probabilities for others
P(A|M,C) M NOT M
C 0.9 0.3
NOT C 0.3 0.2
7
Much more efficient representations
M C
A B
E F
26 = 64 −1 = 63 parameters
vs. (2 *1) + (2 * 2) + ( 2 * 4) = 13
Additional Example 1
l You are waiting for an appointment
AI with Holmes (H) and Watson (W).
l Both are very poor drivers and are
likely to be avoid driving if the
W H roads are icy (I).
l It is winter so that probability of I
is high 0.5.
l H and W are dependent…
» Unless you know the road conditions
8
Wet Grass
l The lawn of home A is either wet or
not (Wb).
A
R A
S » This could have been caused by rain
(R) or sprinkler (S)
l The lawn of home B is either wet or
Wa Wb not (Wa).
» Home A has no sprinklers so rain is
the only cause.
Earthquake or Burglar
l Home Alarm (A)
l Neighbor reports the Alarm (N)
A
B A
E l Burglary (B) 0.001
l Earthquake (E) 0.0000001
l Radio Report of Earthquake (R)
A R
l You receive a call from your
neighbor saying that your Alarm is
N going off.
l You drive home to confront the
burglar… on the drive you hear a
radio report of an Earthquake.
9
Reasoning
l In some cases computation time is not changed:
» We have re-written the joint distribution
» Some reasoning still requires large summations...
P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C
P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C
Sometimes reasoning is efficient
M C
∑ P(a, B, C , D , e, F )
P(e | a ) =
A B
∑ P( A, B, C , D, e , F )
E F
P( A, B , C , D, E , F ) = P ( M ) P ( C ) P ( A | M , C ) P( B | C ) P ( E | A) P ( F | A, B )
10
Junction Tree Algorithm 1
l Table arithmetic:
X Y Z
P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1
TXYZ = TX × TYX × TZY

Junction Tree Algorithm: Graph Hacking

M C M C M C
A B A B A B
E F E F E F
CM C AC CM C AC
A A A A
AE A ABF AE ABF
11
More Junction Trees
From Junction Trees to Probability

M C
CM C AC
A B A1 A2
AE ABF
E F
TCM × TAC × TAE × TABF

P( A, B, C , D, E, F ) = TABCDEF =
SC × S A1 × S A2
= P( M ) P( C | M ) K TCM
× P( A | C ) K TAC
× P( E | A) K TAE
© Paul Viola 1999 Machine Learning× P( B) P ( F | B ) K TABF
12
Networks
Lecture 16:
More Bayes Nets
News
l Half of pset 5 is done
» and on the web.
» Other half will be done over the weekend.
1
Review & Overview
l Lecture 15:
» Bayes Nets
– Meeting of the minds
l Artificial Intelligence and Machine Learning
– Represents symbolic knowledge and reasoning
– Principled mechanism for inference and learning
l Bayes Rule
l Reasoning with Bayes Nets

l Efficient algorithms
Bayes Net Review

l Knowledge of the complete probability table
provides the opportunity for powerful deduction.
» Relate symptoms to diseases
– Even when you have no observed every symptom
» Reason with partial and conflicting knowledge
l But the joint probability is difficult to model
» 2^N -- N binary variables
» Reasoning is equally hard
l Given 2^N numbers you can model any dependency
» But sometimes variables are not really dependent.
2
Bayesian Text Classification
Wi(d k ) : 
0 otherwise
P( F1 = f1 , F2 = f 2 , K| C = c j ) 2^N
2^Nprobs
probs
= P ({ f1 − f N }| C = c j )
≡ ∏ P( Fi = f i | C = c j ) Assume
Assume
Independence
i Independence
P( Fi = 1| C = c j ) = pij Probability of word i appearing

© Paul Viola 1999
in a Doc from Class j
Machine Learning
Bayes Nets Show Dependencies
P({ f i} | C j ) = ∏ P(F = i fi |C j )
i
One of
C N classes
P( F1 | C )
F1 F2 F3 F4 ... FN
P (c j ) ∏ P ( fi | c j )
P ( c j |{ fi }) = i
© Paul Viola 1999 ∏ P( f )

Machine Learning
i
i
3
More Complex Models are “Easy”
What if documents could be BB
“about” two different topics at once:
- like Politics and Sports
P S
P( F1 | P, S )
F1 F2 F3 F4 F5... ... FN
N causes -> 2^N variables
An Efficient Representation
0.1
• Draw a directed acyclic graph M C 0.05
• Links imply causation

• Represent probabilities
• Prior for every node with no parent A
• Conditional probabilities for others
P(A|M,C) M NOT M
C 0.9 0.3
NOT C 0.3 0.2
4
Much more efficient representations
M C
A B
E F
26 = 64 −1 = 63 parameters
vs. (2 *1) + (2 * 2) + ( 2 * 4) = 13
l You are waiting for an appointment
AI with Holmes (H) and Watson (W).
l Both are very poor drivers and are
likely to be avoid driving if the
W H roads are icy (I).
l It is winter so that probability of I
is high 0.5.
l H and W are dependent…
» Unless you know the road conditions
5
Wet Grass
l The lawn of home A is either wet or
not (Wb).
A
R A
S » This could have been caused by rain
(R) or sprinkler (S)
l The lawn of home B is either wet or
Wa Wb not (Wa).
» Home A has no sprinklers so rain is
the only cause.
Earthquake or Burglar
l Home Alarm (A)
l Neighbor reports the Alarm (N)
A
B A
E l Burglary (B) 0.001
l Earthquake (E) 0.0000001
l Radio Report of Earthquake (R)
A R
l You receive a call from your
neighbor saying that your Alarm is
N going off.
l You drive home to confront the
burglar… on the drive you hear a
radio report of an Earthquake.
6
Reasoning
l In some cases computation time is not changed:
» We have re-written the joint distribution
» Some reasoning still requires large summations...
P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C
P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C
Sometimes reasoning is efficient
D C
∑ P(a, B, C , D , e, F )
P(e | a ) =
A B
∑ P( A, B, C , D, e , F )
E F
P( A, B , C , D , E , F ) = P ( D ) P ( C ) P ( A | D, C ) P ( B | C ) P ( E | A) P ( F | A, B )
By the way: this is called adding the evidence (or observation)

that A=a.
7
Sometimes reasoning is more efficient
D C
∑ P( A, B, c , D , e, F )
P(e | c) =
A B ∑ P( A, B, c, D, E , F )
E Add the evidence that C=c.
F
Observe the marginal of E.
∑ P( A, B, c, D, E, F )
a ,b ,d ,e , f
= ∑ P( D) P(c ) P( A | D, c) P(B | c ) P(E | A) P(F | A, B)

a ,b, d ,e , f
= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑ P(B | c ) ∑ P( F | A, B)
a ,e d b f
Saving Work
∑ P( A, B, c, D, E, F )
a ,b ,d ,e , f
= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑ P(B | c ) ∑ P( F | A, B)
a ,e d b f
= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑T
a ,e d b
B
2
T 1
AB
= P (c) ∑ P( E | A)∑ T
a ,e d
5
D
4
TAD TA3
= P (c) ∑ P( E | A)T ∑T
a ,e
3
A
d
5
D TAD
4
= P (c) ∑T
a ,e
7 3
AE A T T 6
A
= P (c) ∑T e
e
8
∑ P ( A, B , c, D, e, F ) = P (c) T
a ,b ,d , f
8
E ( e)
8
Hidden Markov Model
A A
B C D
A
F G H I
Time
P( A, B , C , D , F , G, H , I )
= P( A ) P ( F | A ) P ( B | A) P (G | B ) P( C | B ) P ( H | C ) P ( D | C ) P ( I | D )

l Table arithmetic:
X Y Z
P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1

9
M C M C M C
A B A B A B
E F E F E F
CM C AC CM C AC
A A A A
AE A ABF AE ABF
More Junction Trees
10
M C
CM C AC
A B A1 A2
AE ABF
E F

P( A, B, C , D, E, F ) = TABCDEF =
SC × S A1 × S A2
= P( M ) P( C | M ) K TCM
× P( A | C ) K TAC
× P( E | A) K TAE
© Paul Viola 1999 Machine Learning× P( B) P ( F | B ) K TABF
11
Networks
Lecture 17:
Hidden Markov Models
& Other Bayes Nets
News
l Problem Set 5 complete on Monday
l Remember to keep thinking about your final

projects!
» Please send us some email which describes your project
» 1-2 paragraphs
» We will render our “expert” opinion
– Not too hard!!
1
Review & Overview
l Lecture 16:
» Bayes Nets
– An Efficient way to represent joint probability distributions
– Allow reasoning about subtle and conflicting evidence
– Allow reasoning with partial information
» Structure implies Reasoning Efficiency
– Dependence structure allows for more efficient reasoning
– Dynamic programming
l Markov Processes
l Hidden Markov Models
» Speech
A brief overview of speech recognition
2
A brief overview of speech recognition
The signal is very high dimensional…

10000 samples / second
The production Process
3
Differing representations
Spectrogram provides better information

l The vocal tract produces sound by combining the
output of multiple oscillators…
» Each vowel has several formants -- pure tones
l The spectrum of speech helps to distinguish
vowels…
l The consonants are very different…
» Broad band and very brief
» Consonant are much harder for speech recognition
systems
4
The phonemes
5
The digits
Speech using Pattern Recogniton
Speech Spectrogram
{x , y }
j j Training Data
10,000 Samples 3000 coeff x j ∈ ℜ1000

y j ∈ {cat, car, ball, hat, L}
per second per second
Note: This does not work… for many reasons!!!
6
Speech Difficulties
l Rate of speech
» Words are spoken at different rates -- factor of 2 or 3.
l Continuous speech
» Where are the boundaries between words??
IsIs this
this your
your cat?
cat? When
When isis your
your train?
train?
0.2
0.2 - 0.3 - 0.6 -- 0.2
- 0.3 - 0.6 0.2 0.2
0.2 - 0.2 - 0.6
- 0.2 - 0.6 -- 0.3
0.3
• Can’t build IsIs this

this your
your cat?
cat?
sentence recognizers 0.1
0.1 - 0.2 - 0.4 -- 0.2
- 0.2 - 0.4 0.2
• Other difficulties: Pitch variation, Accent, Prosody
Decompose the construction of words

l Words are constructed from letters
» In written english
» There are 26 letters
l Words are constructed from phonemes
» In spoken english
» There are XX phonemes in English
Cat
Cat ->
-> ‘c’
‘c’ -- ‘ah’
‘ah’ -- ‘t’
‘t’
0.03
0.03 - 0.15--0.02
- 0.15 0.02
fat
fat ->
-> ‘f’
‘f’ -- ‘ah’
‘ah’ -- ‘t’
‘t’
0.1 - 0.1 -
0.1 - 0.1 - 0.020.02
7
Implications of Decomposition
l The parts of words can be reused
» Words are built from XX phoneme models
» Perhaps we can train the phoneme recognizers
separately??
» (Sometimes… co-articulation can make this harder)
l But, even the parts vary in length
l It can be very hard to find the beginnings and

endings of phonemes
l Time is our enemy!!!

Probabilistic Models of Time

l We’ve always built probabilistic models in class...
» Salmon vs. Bass
» 2’s vs. 3’s
» etc.
0.5 0.2 0.9 0.15 1.0
N c aAh t N
0.5 0.8 0.1 0.85
P(F) P(F) P(F) P(F) P(F)
Non-deterministic Finite State Automata

8
Phoneme Sequences
NFA
NFAModel
Model
Sequence:
Sequence:
NNCCAAAAAAAAATTTNN…
St
NNCCAAAAAAAAATTTNN…
Spectrogram
Spectrogram Ft
P( F , S | Model ) NFA
NFAmodel
modelfor
for‘cat’
‘cat’
assigns a probability
assigns a probability
= P( F | S ) P ( S | Model ) to
toeach
eachspectrogram
spectrogram
Non-deterministic FSM -> Bayes Net

0.5 0.2 0.9 0.15 1.0
N 0.5
c 0.8
ah
A 0.1
t 0.85
N
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
9
The Details
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
Si ∈ {N B , C , A, T , N A }
P( Fi | S i = k ) = G (Fi , µ k , Σ k )
P( S1 ) = {0 .5, 0 .5, 0 .0 , 0.0, 0.0}
 NB C A T NA 
N 0.5 0 .5 
 B 
C 0 .2 0.8 
P( S i +1 | S i ) =  
 A 0.9 0.1 
T 0 .15 0 .85 
 
© Paul Viola 1999
 N A Machine Learning
1 .0 
Hidden Markov Model
Hidden
State
A A
B C D
A
F G H I
Observations
Time
P( A, B , C , D , F , G, H , I )
= P( A ) P ( F | A ) P ( B | A) P (G | B ) P( C | B ) P ( H | C ) P ( D | C ) P ( I | D )
10
Using Dynamic Programming…
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
P ( F Model)
= ∑ P (F | S ) P (S | Model)
S
= ∑ P( F | S = Sˆ ) P(S = Sˆ | Model)
Sˆ ={s1 , s 2 , s 3 , s4 ,... }
 
= ∑ ∏ P ( F j | S j = s j )  P (S = Sˆ | Model)
Sˆ ={s1 , s 2 , s 3 , s4 ,... }  j 
= ∑ P (F1 = f 1 | S = s1 ) P (S1 = s1 ) ∑ P ( F2 = f 2 | S = s2 ) P ( S2 = s2 S1 = s1 )∑ ...
s1 s2 s3
Some standard notation...

P ( F Model)
= ∑ P (F1 = f 1 | S = s1 ) P (S1 = s1 ) ∑ P ( F2 = f 2 | S = s2 ) P ( S2 = s2 S1 = s1 )∑ ...
s1 s2 s3
= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 ∑ TS 3 S 4 ∑ TS 4 S 5
s1 s2 s3 s4 s5
= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 ∑ TS 3 S 4 β S44
s1 s2 s3 s4
= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 β S33
s1 s2 s3
= ∑ TS 1 ∑ TS1S 2 β S22
s1 s2
= ∑ TS 1 β S11
s1
11
Stringing Words Together
Cat Eats
A Food
F F F
Limitations of HMM for Speech

l Observed spectrograms are independent given
phonemes
» Does not model pronunciation or accent
l Spend most of their effort on vowels…

» Fat vs. Far.
12
Markov Processes
l Markov Processes are in fact very general…
» Loosely, they are processes in which there is a great deal
of conditional independence.
– Like most Bayes Nets. P( A | B, C , D , F , G , H , I )
= P ( A | π ( A))
A A
Note:
Note:up
up‘til
‘tilnow
nowwewe
have seen only directed
A A models…
models… thethenotion
notion
of
ofMarkov
Markovforfor
undirected
undirectedmodels
modelsisisaa
A A bit
bitmore
morecomplex...
complex...

l Table arithmetic:
X Y Z
P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1

13
M C M C M C
A B A B A B
E F E F E F
CM C AC CM C AC
A A A A
AE A ABF AE ABF
More Junction Trees
14
Junction Trees and Tables
CM C AC
A1 A2
AE ABF
TCM × TAC × TAE × TABF ?

TABCDEF = = P( A, B , C , D, E, F )
SC × S A1 × S A2
Rules for Junction Tree Initialization
15
M C
CM C AC
A B
A1 A2
E F
AE ABF
P( A, B, C , D , E , F )
= P( M ) P (C | M ) P( A | C ) TABCDEF =
S C × S A1 × S A2
× P( E | A) P ( B) P( F | B)
= P( M ) P(C | M ) K TCM
× P ( A | C) K TAC
× P ( E | A) K TAE
× P ( B) P( F | B) K TABF
16
Networks
Lecture 18:
Finish Hidden Markov Models
& Finish Bayes Nets
News
l Remember to keep thinking about your final
projects!
l The reading for this class is now mostly the

supplemental material on the related-info page.
» You need to at scan the materials there.
1
Review & Overview
l Lecture 17:
» Hidden Markov Models for Speech
– Speech is complex…
l Many words / Length of words varies
– Speech is best represented as a spectrogram
– Variable timing of speech can be modeled as a NFA.
– An HMM is a Bayes Net which is equivalent to an NFA
l We can build an HMM for each word out of phoneme models
– Can sum over the unknown states to recognize words
l More HMM examples

l Finding the most likely state sequence
Speech in a Nutshell
six
s i x
2
Closer Examination
100 Numbers 4 of silence

5 of silence 3 of ‘i’
5 of ‘s’ 6 of ‘x’
Build a Probabilistic Model

which generates the data…
0.8 0.7 0.8 0.8 1.0
N 0.2
s 0.3
Ai 0.2
x 0.2
N
P( F , S | Model )
= P( F | S ) P ( S | Model )
3
Use Bayes Law
P( F = f |' six' ) P( F = f |' five')

= ∑ P (F = f , S = s |' six ' ) vs. = ∑ P( F = f , S = s |' five')
s s
Non-deterministic FSM -> Bayes Net

0.5 0.2 0.9 0.15 1.0
N 0.5
c 0.8
ah
A 0.1
t 0.85
N
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
4
A concrete example
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
Si ∈ {1, 2}
P( S1 ) = {0.5, 0 .5} P( Fi = f | S i = 1) = G ( f ,1.0, 0 .1)
 1 2  P( Fi = f | S i = 2) = G ( f , 2.0, 0.1)
P( S i +1 | S i ) = 1 0 .9 0 .1 

2 0.1 0.9  P( F , S | Model )
Some Samples
2.6
2.4
2.2
2.6
2
1.8
2.4
1.6
2.2
1.4
1.2
2
1
1.8
0.8
0.6
1.6 0 10 20 30 40 50 60 70 80 90 100
1.4
2.6
1.2
2.4
1 2.2
2
0.8
0 10 20 30 40 50 60 70 80 90 100 1.8
1.6
1.4
1.2
0.8
0 10 20 30 40 50 60 70 80 90 100
5
Code is very simple...
function [states, obs] = hmm_draw(n, initial, transition, obs_models)
% function [states, obs] = hmm_draw(n, initial, transition, obs_models)

%
% Draw a sample of the HMM running for N steps
% Setup the initial space

states = zeros(n, 1);
obs = zeros(n, length(hmm_observe(2, obs_models)));
states(1) = hmm_draw_state(initial);
for i = 2:n
% transition(states(i-1), :)
states(i) = hmm_draw_state(transition(states(i-1), :));
end
obs = hmm_observe(states, obs_models);
Add more noise...

3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
P( Fi = f | S i = 1) = G( f , 1.0, 0.4)
P( Fi = f | S i = 2 ) = G( f , 2 .0, 0 .4)
6
But we have a detailed model
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
arg max P( F = f , S = s | Model )

s
3.5
2.5
1.5
0.5
0
© Paul Viola 1999 0 10 20 30 Machine
40 50 Learning
60 70 80 90 100
Using Dynamic Programming…

S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
arg max P ( F = f , S = s | Model)

s
= arg max P ( s1 ) P ( f 1 | s1 ) P ( s2 | s1 ) P ( f 2 | s2 ) ... P ( sn | sn −1 ) P ( f n | sn )

s
7
% Propagate the maximum state forward in time from the beginning
maxes = state_like;
This code is also simple…
maxes (1,:)=maxes(1,:)/sum(maxes(1,:));
for i = 2: ntimes
for j = 1:nstates
% For each new time, check each of the past states and to determine
% the best state given the transition costs.
for k = 1: nstates
vals(j,k) = maxes (i-1,k) * trans (k,j) * state_like(i,j);
end
maxes (i,j) = max( vals(j,:));
end
maxes(i,:)=maxes(i,:)/sum(maxes(i,:));
end
vals = zeros(nstates ,1);

shat = zeros(size(maxes));
[v ind] = max(maxes(ntimes, :));

shat( ntimes, ind) = 1;
for i = ntimes-1:-1:1
for j = 1:nstates
vals(j) = trans (ind ,j) * maxes(i,j);
end
[v ind ] = max(vals );
shat(i, ind) = 1;
end
What about distinguishing two models??
Si ∈ {1, 2} Si ∈ {1, 2}
P( S1 ) = {0.5, 0 .5} P( S1 ) = {0.5, 0 .5}
 1 2   1 2 
P( S i +1 | S i ) = 1 0 .9 0 .1 
 P( S i +1 | S i ) = 1 0 .8 0.2 

2 0.1 0.9  2 0 .2 0.8 
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
8
Code for model likelihood...
function
function like
like == hmm_model_likelihood(f,
hmm_model_likelihood(f, initial,
initial, trans,
trans, obs_models)
obs_models)
%% function
function shat
shat == hmm_model_likelihood(f,
hmm_model_likelihood(f, initial,
initial, trans,
trans, obs_models)
obs_models)
%% First
First compute
compute the
the likelihood
likelihood of
of every
every state
state given
given every
every observation
observation
state_like
state_like == hmm_obs_likelihood(f,
hmm_obs_likelihood(f, obs_models);
obs_models);
%% initialize
initialize some
some variables
variables
ntimes
ntimes = size(state_like, 1);
= size(state_like, 1);
nstates
nstates == size(state_like,
size(state_like, 2);
2);
mfactor
mfactor == 100;
100;
beta
beta == mfactor
mfactor .*
.* state_like(ntimes,:);
state_like(ntimes,:);
for
for ii == ntimes-1:-1:1
ntimes-1:-1:1
beta
beta == mfactor
mfactor .*
.* (state_like(i,:)
(state_like(i,:) .*
.* (trans
(trans ** beta')');
beta')');
end
end
like
like == log10(sum(beta))-(log10(mfactor)
log10(sum(beta))-(log10(mfactor) ** ntimes);
ntimes);
Limitations of HMM for Speech

l Observed spectrograms are independent given
phonemes
» Does not model pronunciation or accent
» Does not model inter word dependencies
l Spend most of their effort on vowels…

» mount vs. won’t
9
Markov Processes
l Markov Processes are in fact very general…
» Loosely, they are processes in which there is a great deal
of conditional independence.
– Like most Bayes Nets. P( A | B, C , D , F , G , H , I )
= P ( A | π ( A))
A A
Note:
Note:up
up‘til
‘tilnow
nowwewe
A A models…
models… thethenotion
notion
of
ofMarkov
Markovforfor
undirected
undirectedmodels
modelsisisaa
A A bit
bitmore
morecomplex...
complex...
Segue
l We have seen several applications of Bayesian
Networks…
» Expert Systems
» Diagnosis
» Speech Recognition
l Are there other algorithms for reasoning on Bayes

Nets…
» Junction Trees
» Propagation algorithms
» Makes it easy to measure marginals…
10
l Table arithmetic:
X Y Z
P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1


M C M C M C
A B A B A B
E F E F E F
CM C AC CM C AC
A A A A
AE A ABF AE ABF
11
More Junction Trees
Junction Trees and Tables
CM C AC
A1 A2
AE ABF
TCM × TAC × TAE × TABF ?

TABCDEF = = P( A, B , C , D, E, F )
SC × S A1 × S A2
12
Rules for Junction Tree Initialization
l For each conditional distribution in the Bayes Net
» Find a node in the Jtree which contains all those vars
» Multiply that nodes table by the conditional dist

M C
CM C AC
A B
A1 A2
E F
AE ABF
P( A, B, C , D , E , F )
= P( M ) P(C | M ) P ( A | C ) TABCDEF =
S C × S A1 × S A 2
× P ( E | A) P( B ) P( F | AB )
= P( M ) P(C | M ) K TCM
× P( A | C ) K TAC
× P( E | A) K TAE
× P( B) P( F | AB) K TABF
13
Image Markov Models
14
Multi-scale Statistical Models:
Images, People, Movement
Paul Viola
Collaborators: Jeremy De Bonet,
John Fisher, Andrew Kim
Tom Rikert, Mike Jones,
http://www.ai.mit.edu/projects/lv
Paul Viola MIT AI Lab
Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection & registration
Recognition
distribution Likelihood Similarity
Example
images New Hypothesis for
Human Object segmentation
Recognition
Denoising, Super-resolution
1
Visual Texture: a testing ground
• Texture
– Random Repeating Process
– No two patches are identical
Good statistical
model for images
Good model
for visual texture
Generation a critical test
Input
Texture
Non-parametric
Gaussian Independent
Multi-scale
2
Simple Statistical Model 1:
Independent pixels
• Statistical Model 1
– Each pixel is independent
and identically distributed
P( I ) = ∏ P ( I xy )
x, y
Technical Point:
Texture is Ergodic/Stationary
• A texture image is assumed to be many samples of
a single process
– Each sample is almost certainly dependent on the other
samples
– But actual location of the samples does not matter
– (Space invariant process).
3
Simple Statistical Models
Independent pixels
Histogram
P( I ) = ∏ P ( I xy )
x, y
Statistical Model 2:
Gaussian Distribution
P (I ) = N ( I , m, Σ )
− 1
( I − m )| 2
e − |Σ
2
∝
Original
Generated
4
What else are probabilistic image
models good for??
• Denoising:
– If we have a model for: P(I)
– And we observe an image plus noise: Iˆ = I + h
– Then:
ˆ ˆ ˆ
P ( I ) = ∫ P ( I = I − h , h ) dh = ∫ P ( I = I −h ) P (h ) dh
P ( Iˆ | I ) P ( I ) P (h = Iˆ − I ) P ( I )
P ( I | Iˆ) = =
P ( Iˆ) P ( Iˆ )
I P (h = Iˆ − I ) P ( I )
E [I | Iˆ ] = ∫ dI
P ( Iˆ )
What if I were a scalar?

And the both signal and noise were Gaussian
I P (h = Iˆ − I ) P ( I )
E [I | Iˆ ] = ∫ dI
P ( Iˆ )
ˆ 2 2
I e(I− I ) e(I − m )
= ∫ dI
c
Same thing as estimating the mean of a gaussian from
one example and there is a prior…
the expected value is between the observation and prior
5
Gaussian are not quite right...
Gaussian model fails other tests also...
Derivative
Gaussian Fit
P(Value)
Every linear projection

of a Gaussian must be
Gaussian…
yet the derivatives in
images are far from
Gaussian
Value
6
Independent Wavelet Models
• Donoho, Adelson, Simoncelli, etc.
• Very efficient (linear time)
– Estimation, Sampling, Inference
P (I ) ∝ ∏ P j ([WI ] j )
j
• P(I) is defined implicitly

– As a distribution over the features present in an image
• W is a Wavelet or tight frame operator
– Invertibility is key... MIT AI Lab
Paul Viola
1D Wavelet
Transform
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
7
Sub-band
Pyramid
Fourier
Decomposition
FLq ( x, y )
WI
Signals plus noise...
8
Noise removal through shrinkage
Removing noise from images
9
Inside the guts...
Noise + Signal: Two Gaussian Case

800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
800
800
700
700
600
600
500
500
400
400
300
300
200
200
100
100
0
Paul Viola
-6 -4 -2 0 2 4 6 0
-6 -4 -2 0 2
MIT AI Lab
4 6
10
(h = Iˆ − I ) P ( I )
[I | Iˆ ] = ∫ I PDenoising
EGaussian dI
P ( Iˆ )
− ( Iˆ − I )2 − ( I )2
2
2n 2s 2
I e e
= ∫ c
dI
− ( Iˆ − I ) 2 ( I ) 2  s 2 ( I 2 − 2 IIˆ + Iˆ 2 ) + n 2 I 2 
− = − 
2n 2 2s 2  2n2 s 2 
 (s 2 + n 2 )I 2 − 2 s 2 IIˆ + s 2 Iˆ 2 
= − 
 2 n 2s 2 
 2 2 s 2 IIˆ s 2 Iˆ 2 
 I − 2 +
= −
(s + n 2 ) (s 2 + n 2 ) 
 2n2 s 2 
 
Paul Viola  MIT AI Lab
Noise vs. Signal: The details

800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
800
800
700
700
600
600
500 500
400 400
300 300
200 200
100 100
Paul Viola 0
MIT AI Lab
0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
11
Independent Wavelet Synthesis Model
P (I ) ≈ ∏ Pl , q ,x ,y (Fl J
( x, y))
l ,q , x , y
∏ Pl , (Fl ( x , y ) )
J
≈ q
l ,q , x , y
Given : I ,W
Observe : O l ,q = { F lq ( x , y )}
Model : Pl ,q (.)
Observe Coefficients
12
Compute Histograms
Multi-scale Multi-scale
Histograms Sampling
Sampling Procedure
texture patch
texture patch
synthesized
original
13
Not quite right...
Edges lead to aligned

coefficients
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
14
Heeger and Bergen:
Constrain the pixel histogram
Models of structured
images are weak.
FRAME: a generalization of B&H

(Zu, Wu and Mumford)
• Specify a set of filters

– Not necessarily orthogonal or even linear.
• Measure the histogram of these filters
– Type of statistic
• Construct a Boltzmann/Gibbs distribution which
generates these statistics
– Maximum Entropy
• Resulting algorithm is currently intractible
– Days to generate a single image
15

coefficients
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
16
Preserving Cross Scale Alignment
Wavelet
Transform
Paul Viola
Filters
MIT AI Lab
Statistical Distribution
of Multi-scale Features
The distribution of
multi-scale features
determines appearance
Paul Viola Wavelet Pyramid MIT AI Lab
17
Multi-scale
Wavelet Features
A multi-scale feature
associates many
values with each
pixel in the image
Conjunctions of filters:
Multi-resolution Parent Vector
fine
r   x y   x y   x y 
coarse V ( x, y ) =  FN0  N , N  , FN1  N , N  ,K , FNM  N , N  ,
 2 2  2 2  2 2 
M
 x y  x y x y
F10  ,  , F11  ,  ,K , F1M  ,  ,
2 2 2 2  2 2
F00 ( x, y ), F01 ( x, y ), K , F0M ( x, y ) ]

V(x,y)={ }
Parent Vector
18
Build a Model for Observed Distribution
r
P ( I ) = P (V ( x, y ))
Non-parametric
Distribution
Paul Viola Related to the MAR models of Willsky et. al. MIT AI Lab
Original
Texture
Synthesis Results
19
r
V N (x , y )
r   x y  1 x y  M x y 
V ( x, y ) =  FN0  N , N  , FN  ,  ,K , FN  , ,
 2 2   2N 2N   2N 2N 
M
r
V1 ( x, y )
 x y  x y  x y
F10  ,  , F11  ,  ,K , F1M  , ,
2 2 2 2  2 2
F00 ( x, y ), F01 ( x, y ), K , F0M ( x, y ) ]

V(x,y)={ }
Parent Vector
r
V0 ( x, y)
r
Probabilistic Model P (V (x , y ))
P (V l ( x , y ) | {WI } − V l ( x , y ) )
Markov
= P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y )... )
Conditionally P (V l ) = ∏ P (V l ( x , y ) | V l + 1 ( x, y ), V l + 2 ( x , y ) ... )
Independent x,y
P( I ) = P (WI ) = P (V M ) × P (V M − 1 | V M )
Successive
× P (V M − 2 | V M , V M − 1 )
Conditioning
× P (V M − 3 | V M ,V M − 1 ,V M − 2 ) ...
20
Estimating Conditional Distributions
• Non-parametrically P* ( x) = ∑ R( x − xi )
i
P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P (V l ( x , y ), V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
=
P (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P * (V l ( x , y ), V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
≅
P * (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
Shannon Resampling on a Tree

Step 1: Build analysis pyramid
64x64 2x2
Input
Image
Note: We are using only the Gaussian pyramid here!

Paul Viola Normally we use an oriented pyramid... MIT AI Lab
21
Shannon Resampling
Step 2: Build synthesis pyramid
Shannon Resampling
Step 2a: Fill in the top...
Pixels are generated by sampling

from the analysis pyramid.
22
Shannon Resampling
Step 2b: Fill in subsequent levels
Pixels are generated by

conditional sampling
Paul Viola (dependent on the parent). MIT AI Lab
Shannon Resampling
Finish the pyramid
Decisions made at low resolutions

generate discrete features in the final image.
23
Heeger and Bergen,

SIGGraph95
B&H D&V
24
25
26
FRAME: Challenge
Non-Parametric
& Computer Graphics
Models
sample Detection &
Recognition
distribution Likelihood
Example
images
27
Discrimination via Cross Entropy
IMODEL
P( I Model )
Cross Entropy
ITEST P( I Test )
Meastex: Texture Classification
Best:
GMRF’s 97%
Ours: 99%
28
29
Where is the boundary
between texture and objects?
• Our model can synthesize and recognize complex
and structured textures.
– Far beyond previous older definitions of texture.
• Where is the boundary between these complex
textures and other patterns in images
– Like faces, human forms, automobiles, etc?
• Only experiments can tell…
The Jacket Hypothesis
30
What about face detection?
• Synthesis is convincing
• Train a texture model to detect faces
Paul Viola
Tom Rikert & Mike Jones
MIT AI Lab
Detecting Objects
• Key Difficulties:
– Variation in Pose, Deformation, & variation across class
• Most Object Recognition approaches are either:
– Very dependent on precise shape and size
– Entirely dependent on simple features (… color, edge histograms)
• Hypothesis:
– Object recognition is closely related to texture perception
31
Detection Results
Non-face test images

Web face test images
Detection Results:
32
Non-frontal
faces
But
Butnaïve
naïvedetection
detection
isisexpensive
expensive
Car Images
33
Texture recognition via Cross Entropy
IMODEL
P( I Model )
Cross Entropy
ITEST P( I Test )
Pruning the density estimator
200
Parent Vectors
Reduce the number of bins
by clustering & value for discrimination
Result: Detection/Classification is faster than template correlation

34
ROC using 200 vectors…
2000 Vector Model 200 Vector Model
Scanning results:
Time: 9 secs
35
Key facial features
- determined automatically
- located automatically
Multi-scale features which are come

from the face model can be automatically
detected for many individuals MIT AI Lab
Paul Viola
36
37
Future Work:
New Face Recognition Algorithm
• Facial identity depends both on the types of

features and their location.
38
Non-Parametric
& Computer Graphics
Models
sample Detection & registration
Recognition
distribution Likelihood Similarity
Example
images New Hypothesis for
Human Object segmentation
Recognition
Denoising, Super-resolution
Visual Texture: a testing ground

• Texture
– Random Repeating Process
– No two patches are identical
Good statistical
model for images
Good model
for visual texture
1
Generation a critical test
Input
Texture
Non-parametric
Gaussian Independent
Multi-scale
Simple Statistical Models

Independent pixels
Histogram
P ( I ) = ∏ P ( I xy )
x, y
2
Gaussian Distribution
P (I ) = N ( I , µ , Σ)
− 12
− |Σ ( I − µ )| 2
∝ e
Original
Generated
Signals plus noise...
3
Noise + Signal: Two Gaussian Case
800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
800
800
700
700
600
600
500
500
400
400
300
300
200
200
100
100
0
Paul Viola
-6 -4 -2 0 2 4 6 0
-6 -4 -2 0 2
MIT AI Lab
4 6
[ ]
| Iˆ = ∫
E IGaussian
I P (η = Iˆ − I ) P ( I )
Denoising
P ( Iˆ )
dI
− ( Iˆ − I ) 2 −( I ) 2 [ ]
E I | Iˆ =
s2
n 2 + s2
Iˆ
2n2 2s 2
I e e
= ∫ c
dI
− ( Iˆ − I ) 2 ( I ) 2  s 2 ( I 2 − 2 IIˆ + Iˆ 2 ) + n 2 I 2 
− = −  
2n 2 2s 2  2n2 s 2 
 (s + n )I − 2 s IIˆ + s Iˆ 2 
2 2 2 2 2
= − 
 2 n 2s 2 
 2 2 s 2 IIˆ s 2 Iˆ 2 
− +
=−


I
( s 2 + n 2 ) (s 2 + n 2 ) 

 2n2 s 2 
 
Paul Viola  MIT AI Lab
4
In pictures…
800 200
180
700
160
600
140
500 120
400 100
80
300
60
200
40
100 20
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
Observation = 3.0 Mean = 2.8
Independent Wavelet Models
• Donoho, Adelson, Simoncelli, etc.
• Very efficient (linear time)
– Estimation, Sampling, Inference
P (I ) ∝ ∏ P ([WI ] )
j
j j
• P(I) is defined implicitly

– As a distribution over the features present in an image
• W is a Wavelet or tight frame operator
– Invertibility is key... MIT AI Lab
Paul Viola
5
Noise vs. Signal: The details
800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
800
800
700
700
600
600
500 500
400 400
300 300
200 200
100 100
Paul Viola 0
MIT AI Lab
0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
Non-gaussian:
Integral is evaluated numerically
[ ] ∫I
E I | Iˆ =
P (η = Iˆ − I ) P ( I )
P ( Iˆ )
dI
6
1D Wavelet
Transform
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
Sub-band
Pyramid
Fourier
Decomposition
FLθ ( x, y )
WI
7
Noise removal through shrinkage
Removing noise from images
8
Independent Wavelet Synthesis Model
P (I ) ≈ ∏ P (F ( x , y ) )
l ,θ , x , y l
ϑ
l ,θ , x , y
≈ ∏ P (F ( x , y ) )
l ,θ l
ϑ
l ,θ , x , y
Given : I ,W
Observe : O l ,θ = { F lθ ( x , y )}
Model : Pl ,θ (.)
Observe Coefficients
9
Compute Histograms
Multi-scale Multi-scale
Histograms Sampling
Sampling Procedure
texture patch
texture patch
synthesized
original
10
Not quite right...

coefficients
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
11
Heeger and Bergen:
Constrain the pixel histogram
Models of structured
images are weak.
FRAME: a generalization of B&H

(Zu, Wu and Mumford)
• Specify a set of filters

– Not necessarily orthogonal or even linear.
• Measure the histogram of these filters
– Type of statistic
• Construct a Boltzmann/Gibbs distribution which
generates these statistics
– Maximum Entropy
• Resulting algorithm is currently intractible
– Days to generate a single image
12

coefficients
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
13
Preserving Cross Scale Alignment
Wavelet
Transform
Paul Viola
Filters
MIT AI Lab
Statistical Distribution
of Multi-scale Features
The distribution of
multi-scale features
determines appearance
Paul Viola Wavelet Pyramid MIT AI Lab
14
Multi-scale
Wavelet Features
A multi-scale feature
associates many
values with each
pixel in the image
Conjunctions of filters:
fine
r   x y   x y   x y 
coarse V ( x, y ) =  FN0  N , N , FN1  N , N  ,K , FNM  N , N  ,
 2 2   2 2  2 2 
M
 x y  x y x y
F10  , , F11  , ,K , F1M  ,  ,
2 2  2 2  2 2
F00 ( x, y ), F01 ( x, y ), K , F0M ( x, y ) ]

V(x,y)={ }
Parent Vector
15
Build a Model for Observed Distribution
r
(
P ( I ) = P V ( x, y ))
Non-parametric
Distribution
Paul Viola Related to the MAR models of Willsky et. al. MIT AI Lab
Original
Texture
Synthesis Results
16
r
V N (x , y )
r   x y  1 x y  M x y 
V ( x, y ) =  FN0  N , N , FN  N , N  ,K , FN  N , N ,
 2 2  2 2  2 2 
M
r
V1 ( x, y )
 x y  x y x y
F10  , , F11  , ,K , F1M  ,  ,
2 2  2 2  2 2
F00 ( x, y ), F01 ( x, y ), K , F0M ( x, y ) ]

V(x,y)={ }
Parent Vector
r
V0 ( x, y)
r
Probabilistic Model (
P V (x , y ) )
P (V l ( x , y ) | {WI } − V l ( x , y ) )
Markov
= P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y )... )
Conditionally P (V l ) = ∏ P (V l ( x , y ) | V l +1 ( x, y ), V l + 2 ( x , y ) ... )
Independent x,y
P ( I ) = P (WI ) = P (V M ) × P (V M −1 | V M )
Successive
× P (V M − 2 | V M , V M −1 )
Conditioning
× P (V M − 3 | V M ,V M −1 ,V M − 2 ) ...
17
Estimating Conditional Distributions
• Non-parametrically P* ( x) = ∑ R( x − xi )
i
P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P (V l ( x , y ), V l +1 ( x , y ), V l + 2 ( x , y ) ... )
=
P (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P * (V l ( x , y ), V l +1 ( x , y ), V l + 2 ( x , y ) ... )
≅
P * (V l +1 ( x , y ), V l + 2 ( x , y ) ... )
Shannon Resampling on a Tree

Step 1: Build analysis pyramid
64x64 2x2
Input
Image
Note: We are using only the Gaussian pyramid here!

Paul Viola Normally we use an oriented pyramid... MIT AI Lab
18
Shannon Resampling
Step 2: Build synthesis pyramid
Shannon Resampling
Step 2a: Fill in the top...
Pixels are generated by sampling

from the analysis pyramid.
19
Shannon Resampling
Step 2b: Fill in subsequent levels
Pixels are generated by

conditional sampling
Paul Viola (dependent on the parent). MIT AI Lab
Shannon Resampling
Finish the pyramid
Decisions made at low resolutions

generate discrete features in the final image.
20
Heeger and Bergen,

SIGGraph95
B&H D&V
21
22
23
FRAME: Challenge
Non-Parametric
& Computer Graphics
Models
sample Detection &
Recognition
distribution Likelihood
Example
images
24
Discrimination via Cross Entropy
IMODEL
P( I Model )
Cross Entropy
ITEST P( I Test )
Meastex: Texture Classification
Best:
GMRF’s 97%
Ours: 99%
25
26
Where is the boundary
between texture and objects?
• Our model can synthesize and recognize complex
and structured textures.
– Far beyond previous older definitions of texture.
• Where is the boundary between these complex
textures and other patterns in images
– Like faces, human forms, automobiles, etc?
• Only experiments can tell…
The Jacket Hypothesis
27
What about face detection?
• Synthesis is convincing
• Train a texture model to detect faces
Paul Viola
Tom Rikert & Mike Jones
MIT AI Lab
Detecting Objects
• Key Difficulties:
– Variation in Pose, Deformation, & variation across class
• Most Object Recognition approaches are either:
– Very dependent on precise shape and size
– Entirely dependent on simple features (… color, edge histograms)
• Hypothesis:
– Object recognition is closely related to texture perception
28
Detection Results
Non-face test images

Web face test images
Detection Results:
29
Non-frontal
faces
But
Butnaïve
naïvedetection
detection
isisexpensive
expensive
Car Images
30
Texture recognition via Cross Entropy
IMODEL
P( I Model )
Cross Entropy
ITEST P( I Test )
Pruning the density estimator
200
Parent Vectors
Reduce the number of bins
by clustering & value for discrimination
Result: Detection/Classification is faster than template correlation

31
ROC using 200 vectors…
2000 Vector Model 200 Vector Model
Scanning results:
Time: 9 secs
32
Key facial features
- determined automatically
- located automatically
Multi-scale features which are come

from the face model can be automatically
detected for many individuals MIT AI Lab
Paul Viola
33
34
Future Work:
New Face Recognition Algorithm
• Facial identity depends both on the types of

features and their location.
35
Networks
Lecture 24:
The End
News
l The Final is on Monday of Final’s week at 1:30
» In this room…
l Conflict exam will be in NE43 on Tuesday Morning
at 9:30.
» Come to Kinh’s office at 9:15 so we can set people up.
l Last year’s final will be on the web by 1PM.
1
Review & Overview
l Lectures 22 and 23:
» Statistical image processing
– Estimate statistical models from examples
– Applications
l Denoising
l Synthesis
l Recognition
l An overview of Machine Learning

l What I would have liked to cover…
6891 at a Glance
l Probability
» Bayes Law
l Linear Algebra
» Eigenvectors and inverses
l Bayesian Classification
l Discriminant Functions
» Perceptron’s, MLP’s
l Support Vector Machines
l Regularization
» Radial Basis Functions
l Unsupervised Learning and PCA
l Bayes Nets and HMMs
2
In the beginning… Probability
l The key concepts of probability
» The basic algebra of probability
– Independent events add
– Relationships between conditional and joint distributions
» Densities work like probabilities (mostly)
» Bayes Law allows us to make decisions
– Loss functions are critical
» Maximum likelihood allows us to learn distributions
– Bayesian estimation averages over parameters
» Exponential densities are easiest to work with
» Mixtures of Gaussians are powerful (but EM is slow)
» Non-parametric estimators are more powerful
– But are difficult to represent
Linear Algebra
l The inverse and pseudo-inverse are everywhere
» Solving least squares problems
l Covariance and co-occurrence are everywhere
» Estimating a Gaussian
» Fitting a line to data
» Principal components analysis
l Eigenvectors simplify most linear algebra
» Especially for symmetric positive semi-definite mats
» Allow you to compute inverses & square roots
» Allow you to understand distributions and linear
dependence
3
Bayesian Classification
l Start out with strong assumptions about your data
» Number of classes, structure of the classes
l Use data to estimate the distribution of each class
l Use Bayes’ law to classify new examples
l Advantages:
» Can estimate the probability of classes (confidence)
» Can validate the model
» Harder to over-train or over-fit
l Disadvantages
» May not use data efficiently
» Sensitive to poor assumptions
Discriminant Functions
l Attempt to estimate the discriminant function
directly
» Linear
» Polynomial
» Multi-layer perceptron
l Specifically minimizes the number of errors
l Advantages:
» Don’t waste time on distributions (just the boundary)
l Disadvantages
» No natural measure of confidence
» Can over-train
4
l A principled and direct way to simultaneously
minimize errors while yielding the simplest
possible classifier
» Occam’s razor
l Using the Kernel Trick ™
» Can find a very complex polynomial with little work
l Using the Margin Trick™
» Maximizes generalization in the face of complexity
l Simple learning criteria
l Well studied learning algorithm
» Quadratic programming
Regularization
l Sometimes you would like to find the smoothest
function which is close to the data
» Minimize the squared error
» Minimize the squared first derivative (or 2nd deriv.)
l The least squares solution:
» Is a sum of kernel functions centered on the data
» Kernel functions depend on the smoothness penalty
l Derivative penalties yield polynomial kernels
» First -> linear, Second -> cubic, Hairy -> Gaussian
5
Unsupervised Learning
l Transforming the input so that it is more manageable
» PCA: The data can be represented using fewer numbers
– Can compress data, make learning simpler
» ICA: The resulting data is now more independent
– Can separate signals that were mixed
» Informative Features (by John Fisher)
– Can represent just the critical information
Bayes Nets
l Models of the conditional dependencies between
variables
» Usually many variables
l A complete model would be intractable
» Exponential number of parameters
– Impossible to learn or reason
l By assuming that certain vars are independent
» Number of params goes down rapidly
» Efficient reasoning is possible
l Bayes Nets are very general and can be used in
many ways
6
Hidden Markov Models
l A type of Bayes Net that allows reasoning over time
l The true state of the world is unknown
» You have noisy observations
l HMM use temporal dependencies to differentiate
ambiguous states
The VC dimension
l Each class of learned functions has a VC dimension
» Perceptron: VCdim = number of weights
l VC dimension measures the capacity of the classifier
» VCdim is the max number of points which can be shattered
» Shattered = assigned any set of lables
l Intuition: larger capacity requires more data
» Like polynomials: Nth order requires N+1 points
l The bounds are actually probabilistic
» The probability of that the error rates exceeds a
particular rate is bounded by a function of VC and N.
7
Symbolic Learning
l Often the correct classification rule is symbolic
» If BP < 50 and HR < 50 then administer DRUG
l While Bayes Nets can reason in this way, they do
not offer much help in learning the relationships
from data
» If structure of net is given, then params can be estimated
l This is sometimes called rule learning
l Decision Trees – ID3, CART, etc.
» Pick a feature, split into ranges
» For each case, pick another feature and repeat
» Each leaf should have only one label
Combining Classifiers
l We have encountered many learning techniques
» Each has multiple variants
l Bagging
» Train the same classifier on different subsets of the data
» Related to cross-validation (or the Bootstrap)
l Stacking
» Perhaps the best approach is to train each type of
classifier and then have them vote.
– Combine 100 different types of neural networks
– Many types of generalized perceptrons
l Boosting
» Train a sequence of classifiers on re-weighted data sets
8
Policy Learning
l You must act over time to maximize some reward
» Portfolios: Buy and sell stock to max return and min risk
» Two armed bandit: tradeoff exploration for exploitation
» Learn a sequence of action which takes you from the
start to the goal – like in a video game
l Sometimes you feedback is delayed
» Rarely do you get detailed feedback on your actions
l Policy
» Mapping from state of the world to actions
l Reinforcement Learning (Leslie Kaelbling)
l Game Learning (Backgammon)
Language Learning
l How can you learn to pluralize? (phonetically)
» Wug
l How do you discover parts of speech?
l How do you learn the grammar of English?
» Stochastic Context Free Grammar
– Generalization of HMM
– S -> NP VP, VP -> V NP, etc.

Machine Learning Course

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Course

Uploaded by

Copyright:

Available Formats

www.GetPedia.

* The Ebook starts from the next page : Enjoy !

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

Course Goals: %87

Pitts and McCulloch, 1947

6.891 Machine Learning

What is Machine Learning?

6.891 Machine Learning

6.891 Machine Learning

Physical Laws: Theorize

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

Example: Digit Recognition

6.891 Machine Learning

6.891 Machine Learning

Zip Code Recognition

6.891 Machine Learning

6.891 Machine Learning

Hand Labeled Data

6.891 Machine Learning

6.891 Machine Learning

Differentiating Speech & Music

The Key Issue:

6.891 Machine Learning

Sound Now is the time...

6.891 Machine Learning

Evaluation of Credit Risk

6.891 Machine Learning

Num Black Pixels

Rules for Classification

6.891 Machine Learning

● (YDOXDWH3 I_1 DQG3 I_2

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

You challenge is to make prediction for this relationship

In principle there are an infinite number of functions that could be associated

Weight vector : w = {w0 , w1 ,K}

6.891 Machine Learning

Imagine that we are constraining ourselves to class of polynomials.

Each M-th order polynomial is parameterized by M+1 parameters.

The learning process, becomes a process by which we select values of W_i,

The dependency of the y(.) function on W will can be highlighted by the

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

6.891 Machine Learning

0.6 0.6 0.6

0.2 0.2 0.2

6.891 Machine Learning

Each order of polynomial leads to a different fit.

Which of these is the most likely to generalize

h ( x ) = 0.5 + 0.4 sin( 2πx )

6.891 Machine Learning

6.891 Machine Learning

● (YDOXDWH3I_1 DQG3I_2