Professional Documents
Culture Documents
com
/HFWXUH
,QWURGXFWLRQDQG([DPSOHV
News
● )LUVWSUREOHPVHWLVDYDLODEOHVKRUW
ª 'XH6HSW
ª $OOSVHWVDUHGXHRQ7KXUVGD\
ª 1RUPDOO\\RXZLOOKDYHWZRZHHNV
● 5HDGLQJ'+6&KIRU)ULGD\
1
Review & Overview
● $GPLQLVWUDWLYHLQIRUPDWLRQ
● &RXUVH*RDOV
● 'HILQH/HDUQLQJ,QGXFWLRQ5HJUHVVLRQ
&ODVVLILFDWLRQ
● *LYHH[DPSOHVRIOHDUQLQJDSSOLFDWLRQV
● %D\HV5XOHDQGFODVVLILFDWLRQ
● 5HJUHVVLRQDQG2YHUILWWLQJ
● 2FNKDP·V5D]RU&XUVHRI'LPHQVLRQDOLW\
ª %ULHI0HQWLRQRI3UREDELOLW\
Course Information
● KWWSZZZDLPLWHGXFRXUVHV
● /HFWXUHU3DXO9LROD
ª 3URILQWKH$,/DE1([
ª YLROD#DLPLWHGX
ª 5HVHDUFK/HDUQLQJDQG&RPSXWHU9LVLRQ
² KWWSZZZDLPLWHGXSURMHFWVOY
● 7$.LQK7LHX
ª 3K'VWXGHQWLQWKH$,/DE1([
ª WLHX#DLPLWHGX
ª 5HVHDUFK,PDJH'DWDEDVH5HWULHYDO9LVLRQ/HDUQLQJ
² KWWSZZZDLPLWHGXSHRSOHWLHX
2
Grading Experiment!!!
● 3UREOHPVHWVZLOOEHVHOIJUDGHGPRVWO\
● <RXZLOOKDQGLQWKHSVHWRQ7KXUVGD\.LQKZLOOUHFRUGLWV
SUHVHQFHRUDEVHQFHDQGJODQFHWRVHHLI\RXDWWHPSWHG
HDFKSUREOHP
● :HZLOOGLVWULEXWHWKHSVHWVRQ)ULGD\DWUDQGRPWRWKH
FODVV<RXZLOOHDFKJUDGHRQHSVHWZLWKKHOSIURPDVROXWLRQ
NH\<RXKDYHGD\V
● .LQKZLOOOHDGDKRXUSVHWUHYLHZVHVVLRQWRJRRYHUFRUUHFW
VROXWLRQV3UREDEO\0RQGD\DIWHUQRRQ
● <RXZLOOKDQGEDFNWKHJUDGHGSVHWVRQ:HGQHVGD\
● .LQKZLOOWKHQJUDGHTXHVWLRQXVXDOO\WKHWRXJKHVW
● 7KHJUDGHGSVHWVZLOOEHUHWXUQHGWR\RXRQ)ULGD\GD\V
DIWHU\RXWXUQHGWKHPLQ
Course Goals
● ,QWURGXFH0RWLYDWHDQG6WXG\FRQFHSWVIURPPDFKLQH
OHDUQLQJ)RFXVERWKRQIXQGDPHQWDOVDQGDSSOLFDWLRQV
ª 6HFRQG7LPH:DWFK2XW
● )XQGDPHQWDOV
ª )ROORZWH[W'XGD+DUW 6WRUNIURPWKH:HESDJH
ª 3OXVVRPHVXSSOHPHQWDOKDQGRXWV
● $SSOLFDWLRQV
ª 5HDGSDSHUVIURPOLWHUDWXUH
● 5HLQIRUFH
ª 6L[36(7VZLOOUHTXLUHERWKWKLQNLQJDQGKDFNLQJ
ª 2QHILQDOSURMHFW
ª 0LGWHUP
ª )LQDOH[DP""
3
Course Goals: 127
NIPS 1989
4
Goals: Analysis and Computation
5
Physical Laws
Newton’s Measurements
● 2EVHUYHPDQ\ 2
H[SHULPHQWV
1.8
1.6
Acceleration
1.4
1.2
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Force
● &RQMHFWXUHVLPSOH5XOH
Newton’s Measurements
● 2EVHUYHPDQ\ 2
H[SHULPHQWV
1.8
1.6
Acceleration
1.4
1.2
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Force
● &RQMHFWXUHVLPSOH5XOH
Ignore Errors &
» F=ma Inconsistencies
6
Different Types of Learned Relations
● 5HJUHVVLRQ
ª &RQWLQXRXVLQSXW&RQWLQXRXVRXWSXW
– F = ma, pv = nrt
² ,QWHUHVW5DWHV!6WRFN3ULFHV
² ,QFKHV5DLQ!&RUQ3URGXFWLRQ
● &ODVVLILFDWLRQ
ª 'LVFUHWHLQSXW'LVFUHWHRXWSXW
² ^5HG" 5RXQG" 6PDOO6HHG"`!$SSOH
² $ODUP"!%UHDN,Q$ODUP" (DUWKTXDNH"!1R%UHDN,Q
ª &RQWLQXRXVLQSXW'LVFUHWHRXWSXW
² 0LGWHUP!)LQDO*UDGH
² ^)HYHU%ORRG3UHVVXUH`!6LFN"
² ^,QFRPH&XUUHQW'HEW`!,VVXH/RDQ"
² 6RXQG!:RUGV,PDJHV!3HRSOH
6.891 Machine Learning
Some Notation
● ,Q*HQHUDODOHDUQLQJSUREOHPZLOOKDYH
x = (x1 , x2 , K , xd )
T
ª ,QSXWV x j or x j
j-th
j-thexample
example
ª 2XWSXWV C = {C1 , C2 , K }
Cj & yj
y = (y1 , y2 , K )
T
ª 7DUJHW&RUUHFWODEHORUYDOXH tj
7
Additional Notation (Abusive!)
● 3UHGLFWLRQ)XQFWLRQ
y j = y (x j ) C j = C (x j )
● (UURU)XQFWLRQ
0 if z = 0
E = ∑ l (t j − y ( x j ) l( z) =
j 1 otherwise
l( z) = z2 Loss
Loss
● 5HDGLQJ=LS&RGHV
ª )LUVWILQGWKH$GGUHVV%ORFN
ª 7KHQWKH]LSFRGH
ª 1RUPDOL]HVL]HDQGURWDWLRQ
ª 6HSDUDWHGLJLWV
ª 'LJLWL]H[!SRVVLEOHLPDJHV
8
Character Recognition
9
Tremendous Variety
10
Final Performance
● 863RVWDO6HUYLFH0LOOLRQ/HWWHUVDGD\
● 5HDGLQJ=LS&RGHV
ª )LUVWILQGWKH$GGUHVV%ORFN
ª 7KHQWKH]LSFRGH
ª 1RUPDOL]HVL]HDQGURWDWLRQ
ª 6HSDUDWHGLJLWV
ª 'LJLWL]H[!SRVVLEOHLPDJHV
● 7UDLQLQJH[DPSOHLPDJHV
● )LQDO3HUIRUPDQFH!
11
Speech Recognition
● 6SHHFKUHFRJQLWLRQ
ª 6RXQGVLJQDOV!FHSVWUDOFRHIILFLHQWV!:RUG6HTXHQFH
ª VHF!IUDPHVVHF!ZRUGVVHF
● .H\GLIILFXOWLHV
ª 9DULDWLRQVLQSLWFKSURQXQFLDWLRQVSHHG
12
Digit Recognition in Detail
● &ODVVLI\LQJ1YV2
● 'HILQHDVHWRIIHDWXUHV
Perim
Width Height
● /RRNIRUVHSDUDWLRQ
6.891 Machine Learning
1 if y ≥ 0
C ( x ) = θ ( ax + b) θ ( y)
0 otherwise
ª 'LYLGHLQWRUHJLRQV
² )^`
^!1!2 `
13
Using Bayes’ Law
P ( A| B) P( B)
● 8VH%D\HV·/DZ P( B| A) =
P( A)
P ( F = f |"2" ) P ("2" )
P("2" | F = f ) =
P( F = f )
P ( F = f |"1") P("1")
P("1"| F = f ) =
P( F = f )
Combining Features
● $GGIHDWXUHVWRVHSDUDWH
x2
x1
14
6.891 Machine Learning and Neural
Networks
/HFWXUH
7KH3UREDELOLVWLF$SSURDFK
1
News
● )RUWKRVHRI\RXWKDWPLVVHGWKHILUVWFODVV«
ª )LUVWSUREOHPVHWLVRQWKHZHE'XH
● 7KHZHESDJHLVJHWWLQJXSGDWHGUHJXODUO\
● :HZLOOKDQGRXWDJUDGLQJJXLGHOLQHVZKHQ\RXDUH
JLYHQWKHILUVWSUREOHPVHWWRJUDGH
ª 7KHVHJUDGHVZLOOQRWDVVXPHSHUIHFWDFFXUDF\«
2
Review & Overview
● /HFWXUH
ª 'HILQHG/HDUQLQJ,QGXFWLRQ5HJUHVVLRQ&ODVVLILFDWLRQ
ª 6KRZH[DPSOHDSSOLFDWLRQV'LJLWV6RXQGV6SHHFK
ª %ULHI0HQWLRQRI3UREDELOLW\
● )LQLVKWKHLQWURGXFWLRQWR/HDUQLQJ
ª )LWWLQJIXQFWLRQVWRGDWD«
ª 2YHUILWWLQJ
● 7KH3UREDELOLVWLF$SSURDFK
ª 5HYLHZVRPHVLPSOHSUREDELOLW\
ª $SSO\LWWRFODVVLILFDWLRQWDVNV
3
Fitting a Curve to Data
?? ??
0.8
0.6
0.4
0.2
0
0 0.5 1
Data : {x , t }
j j y (x)
6.891 Machine Learning
You are given 10 example data points. These are samples of physical
relationship, perhaps including noise.
In the final analysis we may want to hedge our bets and return a probability
distribution of functions.
4
Polynomial Fitting
Data : {x , t }
j j
( ) + K + w (x ) ( )l
M
y ( x ) = w0 + w1 x
j j 1
M
j M
= ∑ wl x j
l
y( x j ; w)
5
Graphical Representation
● *UDSKUHSUHVHQWVIXQFWLRQ
y ( x j ) = ∑ wl (x )
M
j l ● ,QIRUPDWLRQ)ORZVIURP
ERWWRPWRWRS
l
● $UURZV/LQNV
y ª WUDQVPLWLQIR
ª PXOWLSOLFDWLYHZHLJKW
w0 w3 w9 ● 1RGHV
w1 w2
ª VXPLQFRPLQJLQIR
X0 X1 X2 X3 ... X9 ª SRVVLEOHQRQOLQHDU
WUDQVIRUP
High Dimensional
Non-linear Representation
X Scalar Representation
- While the algebraic notation for y() is clear and specific, we will see that
sometimes is also useful to develop a graphical notation of both classifiers
and regression functions.
- This idea was originally popularized in the neural network literature,
wherein neural networks were almost always drawn out in their graphical
form.
- The graphical notation points out that an intermediate representation for X
is formed (M+1 exponentiations). The resulting problem is then one of
learning the linear relationship between this high dimensional space and t.
6
Choose the Best Polynomial
● :KLFKSRO\QRPLDOIXQFWLRQLVEHVW"
ª %HVWSUHGLFWLRQVRQWUDLQLQJGDWD«
ª %HVWSUHGLFWLRQVRQIXWXUHGDWD«LQWHUSRODWLRQH[WUDSRODWLRQ
² %HVWH[SHFWHGORVVRQIXWXUHGDWD
² :KHUHGRZHJHWWKLVGDWD"
E=
1
∑
2 j
(
loss y ( x j ; w ) − t j ) Empirical
Loss
ˆ = min E
w Find “Optimal”
w weights
- What defines the best polynomial function? Perhaps it is the one which is
most consistent with the training data?
- Actually we would rather return the function which makes the best
predictions on future data - unfortunately there may be no source for this
data.
- For the time being let’s assume that we want to find the function which best
agrees with training data… the function with the lowest loss.
7
Simple Loss Functions Simplify Learning
loss (δ ) = δ 2 E=
1
∑
2 j
(
y( x j ; w) − t j )
2
∂E 1
(
= ∑ 2 y( x j ; w) − t j x j = 0
∂wi 2 j
i
)( )
(
= ∑ y( x j ; w) − t j x j = 0 )( ) i
- Certain simple loss functions lead to learning algorithms which are easy to
derive and inexpensive to compute.
- For example, squared loss can be solved by differentiating and setting this
to zero.
- The result is a set of linear equations that can be solved by inverting a
matrix.
8
First order fit…
0.8
0.6
0.4
E=
1
∑
2 j
(y( x j ; w) − t j )
2
0.2
0 The optimal
0 0.5 1
function minimizes
the residual error
- There is a pleasant physical analogy for the squared loss. The functions
are connected by springs to the data. The system is then allowed to relax
until the forces are balanced. The minimum energy solution is the one that
is “closest” to the training data.
9
Fitting Different Polynomials
1 1 1
0.8 0.8 0.8
1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 9
2 4 0
0 0.5 1 0 0.5 1 0 0.5 1
10
Target Function
t j = h( x j )
0 .9
0 .8
0 .7
0 .6
0 .5
0 .4
0 .3
0 .2
0 .1
0
0 0 .2 0 .4 0.6 0 .8 1
The function that generated the data was not a polynomial at all.
11
Fitting Different Polynomials
1 1 1
0 .8 0 .8 0 .8
0 .6 0 .6 0 .6
0 .4 0 .4 0 .4
0 .2 0 .2 0 .2
0 1 0 0 6
3
0 0 .5 1 0 0.5 1 0 0.5 1
1 1 1
0 .8 0 .8 0 .8
0 .6 0 .6 0 .6
0 .4 0 .4 0 .4
0 .2 0 .2 0 .2
0 0 0 9
2 4
0 0 .5 1 0 0.5 1 0 0.5 1
Probably the best approximation was 6th order (though 3rd is very good as
well).
Ninth provides a terrible fit to the function, though it fits the training data
perfectly. This is what is called overfitting…
12
Matlab Code
% Construct training data
train_in = [1:10]/10;
train_out = 0.5 + 0.4 * sin(2 * pi * train_in) + 0.1 * randn(size(train_in));
% fit a polynomial
order = 3
p = polyfit(train_in, train_out, order)
13
First General Problem in Learning
● &RQWURORIFRPSOH[LW\
ª ´(QWLWLHVVKRXOGQRWEHPXOWLSOLHGZLWKRXWQHFHVVLW\µ
² :2FFDPWK&HQWXU\
² 2FFDP·VUD]RU
ª ´$SK\VLFDOWKHRU\VKRXOGEHDVVLPSOHDVSRVVLEOHEXWQR
VLPSOHUµ$(LQVWHLQ
ª ´*RRGWKHRULHVDUHIDOVLILDEOHµ99DSQLN
ª &RPSOH[WKHRULHVDUHOLNHO\WREHZURQJ39LROD
14
Overfitting in Classification
This is not to say that such problems are uniuqe to regression. Determining
decision boundaries for classification is very similar.
15
Probabilistic Notation
X is a Random Variable
P ( X = x ) where
P ( ) is a Probabil ity Distribution
P(x) = P ( X = x ) P(y) = P (Y = y )
PX (x) = P ( X = x ) PY (y) = P (Y = y )
6.891 Machine Learning
Note that there are several potentially confusing short hand notations.
16
Recall the probabilistic approach
● *LYHQDFODVVLILFDWLRQSUREOHP
ª 6SHHFK0XVLF%DVV6DOPRQ5RWWHQ5LSH
● &KRRVHDIHDWXUHRI\RXUH[DPSOHV
ª )LVKZLGWKKHLJKWFRORU
ª )UXLWFRORUZHLJKW
ª 6RXQGV6SHFWUXP9DULDQFH
● 5HFRUGWKHGLVWULEXWLRQRI)HDWXUHYV&ODVV
● *LYHQDQXQFODVVLILHGH[DPSOH
ª &RPSXWHWKHP(F|C1)DQGP(F|C2)
ª &ODVVLI\XVLQJ%D\HV5XOH
17
Probabilistic Approach
P ( C = Ck , X = x )
P ( X = x | C = C k ) P (C = C k )
P ( C = Ck | X = x ) =
P( X = x )
P ( x | Ck ) P (C k )
P ( Ck | x ) =
P( x )
Thomas Bayes
P( X | C1) & P ( X | C 2) 1702-1761
6.891 Machine Learning
18
Probability Densities
b
d
P ( X ∈ [a, b]) = ∫ p( X = x )dx p ( X = x) = P ( X ∈ [a, b])
a db
p(x) = p X ( x ) = p ( X = x )
p ( X = x | C = C k ) P (C = C k )
P ( C = Ck | X = x ) =
p( X = x )
p ( x | Ck ) P ( C k )
P ( Ck | x ) =
p( x )
6.891 Machine Learning
Somewhat surprisingly the density used in the same way that the distribution
function is used. In other words the probability distribution of the class give
the feature value can be found using the densities of the features.
19
Bayes Law for Densities
ω P (ω1 | x ) > P (ω 2 | x )
C ( x) = 1
ω 2 otherwise
Class 1 Class 2
Given the Bayes classification rule, a set of decision regions are defined.
20
Decisions
21
Analysis of Decision Rule
P( error ) = P( x ∈ R2 , C1 ) + P( x ∈ R1 , C2 )
= P( x ∈ R2 | C1 ) P (C1 ) + P ( x ∈ R1 | C2 ) P(C2 )
= ∫ p( x | C ) P(C )dx + ∫ p( x | C ) P(C )dx
R2
1 1
R1
2 2
22
Minimize Expected Loss or Risk
Risk for
Riskk = ∑ Lkl ∫ p( x | Ck )dx
l
elements of Ck
Rl
23
Probabilistic Classification Review
● ,IZHDUHJLYHQP(F|C) & P(C) -> P(F,C)
ª +RZWKHIHDWXUHLVGLVWULEXWHGIRUHDFKFODVV
● :HFDQXVHWKLVLQIRUPDWLRQWRFODVVLI\QHZ
H[DPSOHXVLQJ%D\HV5XOH
ª 0LQLPL]HVWKHSUREDELOLW\RIHUURU«
ª :HPD\LQVWHDGZLVKWRPLQLPL]HULVN
● :KHUHLVWKHPDFKLQHOHDUQLQJ"
24
Information Retrieval
● 7KH$OWDYLVWD3UREOHP
● GRFXPHQWVRQWKHZHE
ª 7DNHDORQJWLPHWREURZVH
● 6LPSOH.H\ZRUG6HDUFK
ª )LQGGRFXPHQWVZLWK´*HUPDQµDQG´FDUµ
ª 0LJKWPLVV´*HUPDQ\µDQG´FDUVµ
² 6WHPPLQJ
ª 0LVVHV´0HUFHGHVµDQG´DXWRPRELOHµ
● 0DFKLQH/HDUQLQJ"
ª *LYHQGRFXPHQWVRQ*HUPDQFDUVEXLOGD
FODVVLILHU
25
Keyword Search Works Well
26
Naïve Bayes Classifier
● $VVXPHHDFKZRUGLVDQLQGHSHQGHQWIHDWXUH
f i ( Doc j ) 1 if Doc j has word i.
P ({ f i }| C j ) =∏ P( F = f |C )
i
i i j
P (C ) ∏ P ( F = f | C )
j i i j
P( C j |{ f }) = i
∏ P( F = f )
i
i i
i
6.891 Machine Learning
27
Estimating Probabilities
● 0D[LPXP/LNHOLKRRG
Potential Bug:
None of our Training Docs contain “Mercedes”
28
6.891 Machine Learning
29
Curse of Dimensionality
● ,WLVQRWDOZD\VEHWWHUWRPHDVXUHPRUHIHDWXUHV
● 1HZUHVXOWVVHHPWRDGGUHVVWKLVSUREOHP
ª 6XSSRUW9HFWRUV%RRVWLQJHWF
30
Density Estimation is Ambiguous
31
Impacts Classification
32
6.891 Machine Learning and Neural
Networks
/HFWXUH
'HQVLW\(VWLPDWLRQ
Machine Learning
News
● 6RUU\DERXWWKHUHFLWDWLRQPL[XS
ª :HZLOODQQRXQFHE\HPDLOVRRQ
● 3UREOHP6HWLVGXHWRPRUURZ
ª 6HHZHEIRUSROLF\«
● 3UREOHP6HWZLOOEHDYDLODEOHE\WRQLJKW
● .LQKDQG,ZLOOEHWDNLQJSKRWRV
Machine Learning
1
Review & Overview
● /HFWXUH
ª 2YHUILWWLQJ3RO\QRPLDOV
ª 5HYLHZHGWKH3UREDELOLVWLF$SSURDFK
ª ,QIRUPDWLRQ5HWULHYDO([DPSOH
● 'HQVLW\'LVWULEXWLRQ(VWLPDWLRQ
ª ,QIRUPDWLRQ5HWULHYDO
² HVWLPDWLQJELQDU\59·V
ª *DXVVLDQV
ª 0XOWLGLPHQVLRQDO*DXVVLDQV
ª 1RQSDUDPHWULF'HQVLWLHV
Machine Learning
Machine Learning
2
Bayesian Text Classification
{d k } : A collection of documents
1 if d k contains word i
Wi(d k ) :
0 otherwise
P ( F1 = f1 , F2 = f 2 , K| C = c j )
= P ({f1 − f N }| C = c j ) Assume
Assume
≡ ∏ P( Fi = f i | C = c j ) Independence
Independence
i Machine Learning
P ({ f i }| C j ) = ∏ P( F = f |C )
i
i i j
C
P( F1 | C )
F1 F2 F3 F4 ... FN
Machine Learning
3
Classification Using Bayes Law
P (c j ) ∏ P ( f i | c j )
P( c j |{ f i }) = i
∏ P( f )
i
i
c1 = German Cars
c0 = Other Documents
Machine Learning
{d k } : A collection of documents
1 if d k contains word i
Wi(d k ) = f ki :
0 otherwise
P( Fi = 1| C = c j ) = pij
● +RZFDQZHOHDUQp_ij?
ª 0D[LPXP/LNHOLKRRG3ULQFLSOH
ª &KRRVHp_ijVRWKDWWKHWUDLQLQJGDWDLVPRVWSUREDEOH
Machine Learning
4
Maximum Likelihood
= ∏∏ P( f ki | c0 )
k i
Machine Learning
Log Likelihood
5
Maximum Likelihood
1 −1
ni + ( N − ni ) =0
pij 1 − pij
ni ( N − ni )
=
pij 1 − pij
ni
ni pij pij =
= N
( N − ni ) 1 − pij
ni
N = pij
1 − ni 1 − pij
N
Machine Learning
Estimating Probabilities
● 0D[LPXP/LNHOLKRRG
Potential Bug:
None of our Training Docs contain “Mercedes”
Machine Learning
6
Prior Expectations
● *LYHQDVPDOODPRXQWDGDWDZHFDQ·WEHDEVROXWHO\
VXUHWKDW´0HUFHGHVµZLOOQHYHUDSSHDULQ
GRFXPHQWVIURPRXUFODVV«
ª :HPD\KDYHJRWWHQXQOXFN\
● 8VHSULRUH[SHFWDWLRQVWRLPSURYHRXUHVWLPDWHV
● 3UREOHP
ª 0HUFHGHVRFFXUVLQRXWRIWRWDOGRFXPHQWV
ª %XWQHYHULQWKH´*HUPDQFDUVµWUDLQLQJVHW
ª :KDWLVDJRRGHVWLPDWHIRUp(mercedes | GermanCars)?
Machine Learning
pij
Maximum
MaximumAAPosteori
P ({d k }| c0 , pij )p ( pij )
Posteori
P(pij | {d k }, c0 )=
P({d k }| c0 )
This
Thisturns
turnsout
outto
tobe
bemore
moreuseful
useful
for continuous parameters
for continuous parameters
Machine Learning
7
What is the right prior?
● 7KHPRVWDJQRVWLFSULRULVWKHXQLIRUPGHQVLW\
Machine Learning
Machine Learning
8
Bayesian Estimation
P({d k }| c0 , pij )p ( pij )
P (pij | {d k }, c0 ) = P ( Fi = 1| C = c j , pij ) = pij
P({d k }| c0 )
=
∫ p P({d }| c , p )p( p
ij k 0 ij ij )
P({d k }| c0 )
Machine Learning
… Continued
∫ p P({d }| c , p )p ( p ) dp
ij k 0 ij ij ij
P( F | c ) =
∫ P({d }| c , p )p( p )dp
i j
k 0 ij ij ij
∫ p [(p ) (1 − p ) ]ε dp
ni N − ni
ij ij ij ij
=
∫ [(p ) (1 − p ) ]ε dp
ni N − ni
ij ij ij
Machine Learning
9
What if no Mercedes?
Machine Learning
10
6.891 Machine Learning and Neural
Networks
/HFWXUH
1HZ'HQVLW\(VWLPDWRUV
Machine Learning
News
● 3UREOHPVHWZLOOEHKDQGHGRXWWRGD\
● 3UREOHPVHWLVRQWKHZHE
ª ,WLVPXFKKDUGHUWKDQWKHILUVWSVHW
● 3UREOHP6HWV
ª 3OHDVHVKRZVRPHZRUN
ª 0DNHVXUHWRJHWWKHSVHWVWR.LQK
² (VSHFLDOO\LIWKH\DUHODVWPLQXWH
Machine Learning
1
Review & Overview
● /HFWXUH
ª 7DONHGDERXW,QIRUPDWLRQ5HWULHYDO
² 1HHGSULRUVRYHUSDUDPHWHUV
ª 'HULYHG0D[LPXP/LNHOLKRRGIRU%HUQRXOOL59·V
ª 'LVFXVVHGXVHRISULRUVRYHUSDUDPHWHUV
● 1HZ'HQVLW\(VWLPDWRUV&RQWLQXRXV
ª *DXVVLDQ
ª 1RQSDUDPHWULF
ª 0L[WXUHRI*DXVVLDQV
● 4XLFNWRXURI([SHFWDWLRQ0D[LPL]DWLRQ
Machine Learning
Why Gaussians ?
● $QDO\WLFDOO\7UDFWDEOH
● &HQWUDO/LPLW7KHRUHP
ª 6XPRIPDQ\YDULDEOHVLV*DXVVLDQ
● /LQHDU7UDQVIRUPVRI*DXVVLDQDUH*DXVVLDQ
● *DXVVLDQVKDYHWKHKLJKHVW(QWURS\
Machine Learning
2
Multi-Dimensional Gaussian
Machine Learning
Eigen Structure
Machine Learning
3
Recall: Bayes Decision Boundaries
Machine Learning
Descriminant Function
Machine Learning
4
Set Discriminants Equal
Machine Learning
Machine Learning
5
Bayesian Parameter Estimation
● :KDWLI\RXOLWWOHGDWD«
● 2ULI\RXKDYHVWURQJH[SHFWDWLRQV
Machine Learning
Convergence of Probability
Machine Learning
6
Reminder: Why we are here
-1.2660
-1.2660 0.1781
0.1781
-0.8724
-0.8724 0.2013
0.2013
-0.8081
-0.8081 4 0.8293
0.8293
-0.6223
-0.6223 0.8299
0.8299
3 .5
-0.1624
-0.1624 0.9217
0.9217
-0.1342
-0.1342 3 0.9434
0.9434
-0.1098
-0.1098 0.9851
0.9851
2 .5
-0.0882
-0.0882 1.0079
1.0079
0.1258
0.1258 2 1.0539
1.0539
0.1395
0.1395 1.5355
1.5355
1 .5
0.1914
0.1914 1.5621
1.5621
0.2873
0.2873 1 1.5875
1.5875
0.3409
0.3409 1.6015
1.6015
0 .5
0.3694
0.3694 2.1811
2.1811
0.6093
0.6093 0
2.7845
2.7845
0.6463
0.6463 -2 -1 0 1 2 3 4 5 2.7879
2.7879
1.1217
1.1217 3.0956
3.0956
1.1463
1.1463 3.8428
3.8428
1.3021
1.3021 3.9562
3.9562
1.3971
1.3971 Machine Learning 4.0800
4.0800
Machine Learning
7
Different Samples, Different Decisions
Concept:
Concept:Variance
Variance
The
The Variationyou
Variation youobserve
observe
when training on different
when training on different
independent
independenttraining
trainingsets
sets
Machine Learning
8
But when data gets more complex...
Machine Learning
Concept:
Concept:Training
TrainingError
Error
Error in your classifier
Error in your classifier
on
onthe
thetraining
trainingset
set
Machine Learning
9
Even if you had “infinite” data …
Related
RelatedConcept:
Concept:Bias
Bias
Error
Errorininyour
yourclassifier
classifier
in
inthe
thelimit
limitas
assize
sizeof
of
training
trainingdata
datagrows.
grows.
Machine Learning
-1.2660
Histogram
-1.2660
-0.8724
-0.8724
-0.8081
-0.8081 4
-0.6223
-0.6223
-0.1624
-0.1624
3 .5
-0.1342
-0.1342 3
-0.1098
-0.1098
-0.0882
-0.0882
2 .5
0.1258
0.1258 2
0.1395
0.1395
0.1914
0.1914 1 .5
0.2873
0.2873 1
0.3409
0.3409
0.3694
0.3694 0 .5
0.6093
0.6093 0
0.6463
0.6463 -1 .5 -1 -0.5 0 0 .5 1 1.5
1.1217
1.1217
1.1463
1.1463
1.3021
Divide by N to
1.3021
1.3971
1.3971
yield a probability
Machine Learning
10
6.891 Machine Learning and Neural
Networks
/HFWXUH
'HQVLW\(VWLPDWLRQDQG&ODVVLILFDWLRQ
News
● 1R/HFWXUHRQ:HGQHVGD\
ª %HVXUHWRJHW.LQK\RXUJUDGHGSVHWVE\:HGQHVGD\
² 5HFLWDWLRQ
² 'URSLWRII
● *XHVW/HFWXUHE\/HVOLH.DHOEOLQJRQ)ULGD\
ª 5HLQIRUFHPHQW/HDUQLQJ
1
Review & Overview
● /HFWXUH
ª *DXVVLDQ'HQVLW\(VWLPDWLRQ
ª &RYDULDQFH
ª /LQHDUDQG4XDGUDWLF'LVFULPLQDQWV
● 1HZ'HQVLW\(VWLPDWRUV
ª 1RQSDUDPHWULF
ª 0L[WXUHRI*DXVVLDQV
ª 4XLFNWRXURI([SHFWDWLRQ0D[LPL]DWLRQ
● $SSOLFDWLRQ)DFH'HWHFWLRQ
ª 0L[WXUHRI*DXVVLDQV
-1.2660
Histogram
-1.2660
-0.8724
-0.8724
-0.8081
-0.8081 4
-0.6223
-0.6223
-0.1624
-0.1624
3.5
-0.1342
-0.1342 3
-0.1098
-0.1098
-0.0882
-0.0882 2.5
0.1258
0.1258 2
0.1395
0.1395
0.1914
0.1914 1.5
0.2873
0.2873 1
0.3409
0.3409
0.3694
0.3694 0.5
0.6093
0.6093 0
0.6463
0.6463 -1.5 -1 -0.5 0 0.5 1 1.5
1.1217
1.1217
1.1463
1.1463
1.3021
Divide by N to
1.3021
1.3971
1.3971
yield a probability
© Paul Viola 1999 Machine Learning 4
2
Simple Algorithm
% Initialize counts
counts = zeros(size(centers));
numdata = size(data,1);
Histogram
3
Max Likelihood Gaussian
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5 -2 -1 0 1 2
© Paul Viola 1999 Machine Learning 83 4 5
4
Histograms have lower bias …
5
Parzen: One Bump per Data Point
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Parzen Algorithm
numdata = size(data, 1)
plot(range, func)
© Paul Viola 1999 Machine Learning 12
6
Parzen and Histogram are Similar
● %RWKFDQPRGHODQ\W\SHRIGLVWULEXWLRQ
ª *LYHQSOHQW\RIGDWD
● %RWKDUHVLPSOH
● 3DU]HQLVGLIIHUHQWLDEOH+LVWRJUDPLVQRW
● 3DU]HQLVVPRRWK+LVWRJUDPLVQRW
● +LVWRJUDPGHQVLW\(YDOXDWLRQS[LVFKHDS
● 3DU]HQGHQVLW\(YDOXDWLRQLVOLQHDULQGDWDVL]H
0.7
0.6
0.5
0.4
0.3
0.2
0.1
7
Properties of Non-parametric Techniques
● 'HQVLW\LVDQDO\WLFDOIXQFWLRQRIGDWD
● %LDVDQGYDULDQFHRIGHQVLW\HVWLPDWRUFDQEH
DGMXVWHGWRWKHSUREOHP
● 0DQ\PRUHSDUDPHWHUVPXVWEHHVWLPDWHG
ª +LVWRJUDP1'ELQV
● /RVHPDQ\RIWKHVLPSOHSURSHUWLHVRI*DXVVLDQV
Semi-Parametric Models
● +DYHPRUHIOH[LELOLW\WKDQSDUDPHWULFPRGHOV
ª OLNH*DXVVLDQV
● +DYHOHVVYDULDQFHWKDQQRQSDUDPHWULFPRGHOV
● (YDOXDWLRQRIS[LVFKHDS
● 'HWHUPLQDWLRQRISDUDPHWHUVLVH[SHQVLYH
8
Gaussian )
p ( x | µ1 , σ 1 ) µ1 , σ 1 xj
Flip
( p (x ) ???
xj Coin
p ( x | µ2 , σ 2 ) Gaussian
µ2 , σ 2 P (k = 1) + P ( k = 2) = 1
9
Face Detection
Sung &
Poggio
© Paul Viola 1999 Machine Learning 20
10
Results
Face Detection
● *UHDWDSSOLFDWLRQRISUREDELOLVWLFFODVVLILFDWLRQ
ª :RUNVYHU\ZHOO
ª 5HTXLUHVPDQ\WKRXVDQGVRISDUDPHWHUV
ª &RPSXWDWLRQWLPHLVYHU\ORQJ
● ,VWKHUHDQ$OWHUQDWLYH"!'LVFULPLQDQWV
ª $OVRZRUNVZHOO
ª 5HTXLUHIHZHUSDUDPHWHUV
ª &RPSXWDWLRQWLPHLVYHU\VKRUW
11
Events are Disjoint -> They add
p ( X = x ) = p ( X = x, J = 1) + p ( X = x, J = 2)
= p ( X = x | J = 1) P ( J = 1) + p ( X = x | J = 2) P ( J = 2)
= p ( x | µ1 , σ 1 ) P( J = 1) + p( x | µ2 , σ 2 ) P( J = 2)
Expectation Maximization
∑ P(k | x ) x ∑ P(k | x ) (x − µ )
j j j j 2
P(k ) = ∑ P(k | x j )
k
µk = j
σ k2 =
j
∑ P(k | x )
j
j
∑ P(k | x ) j j
E = − log l ({µ k , σ k , qk })
Bounded Below?
Decreases?
12
News
● 6RUU\DERXWPLVVLQJODVWZHHN«
ª 6FKHGXOLQJKLFFXSZKLFKSXVK3HUFHSWURQVRXWRI3VHW
● 3VHWZLOOEHRXWE\WRQLJKW
ª 3OHDVHJHWVWDUWHGHDUO\
● 3VHWLVGXHWRPRUURZ
● &URVVJUDGLQJZRUNHGRXWZHOO
ª %XWZHQRWLFHGWKDWDIHZSHRSOHZHUHQRWJUDGLQJ
FDUHIXOO\
ª ,ZRXOGOLNH\RXWRWDNHWKLVWDVNYHU\VHULRXVO\
1
Review & Overview
● /HFWXUH
ª 1RQSDUDPHWULF'HQVLW\(VWLPDWLRQ
² +LVWRJUDPVDQG3DU]HQ'HQVLWLHV
ª 6HPLSDUDPHWULF0L[WXUHRI*XDVVLDQV
ª $SSOLFDWLRQ)DFH'HWHFWLRQ«YHU\FRPSOH[
● 3HUFHSWURQV
● 7UDLQLQJ3HUFHSWURQV
● *HQHUDOL]HG3HUFHSWURQV
● 0XOWL/D\HU3HUFHSWURQV
● %XWWKLVLVQRWWKHRQO\ZD\«
● ,QIDFWWKLVDSSURDFKKDVFRPHXQGHUVXVWDLQHG
DWWDFNUHFHQWO\
2
Between density and classification
● 2IWHQWKHGHWDLOVRIWKHGHQVLW\GRQRWPDWWHU
Two
TwoClass
ClassGaussian
y ( x ) = w T x + wo Gaussian
same
sameCovariance
Covariance
● $OWHUQDWLYHO\\RXPD\QRWNQRZPXFKDERXWWKH
GHQVLW\RI\RXUFODVVHV
● &RQVWUXFWDIXQFWLRQWKDWFODVVLILHVGLUHFWO\«
3
Linear Discriminant
y ( x ) = w T x + wo
Bias
Warning! w0 w1 w2 wd
X0 X1 X2 … Xd
N
y ( x) = w T x = ∑ wi xi
i =0
© Paul Viola 1999 Machine Learning
Multiple Discriminants
y1 ( x) = w1 x + w1o y2 ( x) = w 2 x + w2 o
T T
y1 ( x ) = y2 ( x )
w1 x + w10 = w 2 x + w20
T T
( w1 − w 2 )T x + ( w10 − w20 ) = 0
ˆ T x + wˆ 0 = 0
w
4
… in a single network
yk ( x ) = ∑ wki xi + wko
i
C ( x ) = Ck if k = arg max yi ( x )
i
Multiple Discriminants
Intersection
of Half Planes
5
How do we learn linear discriminants?
● :KDWDUHWKHSULQFLSOHV"
ª ,QGHQVLW\HVWLPDWLRQZHPD[LPL]HOLNHOLKRRG
ª ,QFODVVLILFDWLRQZHPLQLPL]HHUURUV
● +RZGRVHDUFKIRUWKHEHVWFODVVLILHU"
● :LOOWKHVHDUFKKDYHORFDOPLQLPD"
E ( w ) = ∑ (y ( x j ) − t j )
2
= ∑ (wT x j − t j )
2
j
1 x xx xx x x
Minimize the
squared error.
-1 o ooo o o
6
Quadratic cost is very simple…
E ( w ) = ∑ (y ( x j ) − t j )
2
E (W ) = (XW − T ) (XW − T )
T
j
= ∑ (w T x j − t j ) = W T X T XW − 2 X TW T T + T T T
2
dE (W )
= 2 X T XW − 2 X T T = 0
dW
X T XW = X T T
W = XTX ( )
−1
X TT
● 'LUHFWOLQHDUH[SUHVVLRQIRUWKHZHLJKWVJLYHQWKH
WUDLQLQJGDWD
© Paul Viola 1999 Machine Learning
7
What about Gradient Descent?
E ( w ) = ∑ (y ( x j ) − t j )
2
= ∑ (wT x j − t j )
2
∂E ( w )
= 2∑ (w T x j − t j )x j
∂w j
= 2∑ δ j x j
j
wt = wt −1 − η ∑ δ j x j
j
E ( w ) = ∑ E j = ∑ (δ )
j 2 Error has many
j j components
∂E j Pick an example
wt = wt −1 − η = wt −1 − ηδ j x j
∂w at Random
y ● 3LFN5DQGRP([DPSOH
● 2EVHUYH2XWSXW(UURU
w0 w1 w2 wd
● $GMXVW:HLJKWVWR
X0 X1 X2 … Xd 5HGXFH(UURU
8
Can’t Always Solve for the Weights…
y ( x ) = g ( wT x ) 1 if a ≥ 0
g (a) =
0 otherwise
1 if a ≥ 0
g (a) =
− 1 otherwise
● 3HUFHSWURQV0F&XOORFKDQG3LWWV
ª 2ULJLQDOO\DVDPRGHOIRUUHDOQHXURQV
Perceptron
9
Perceptron Cost Function
E ( w ) = ∑ (g ( wT x j ) − t j )
2
j
Simple Gradient
∂E ( w) ∂g ( wT x j )
= −2 ∑ (g ( wT x j ) − t j ) Descent does not work
∂w j ∂w
( )
E ( w ) = −∑ g ( wT x j ) − t j ( wT x) t j
2 Perceptron
Criterion
j
= − ∑ ( wT x) t j
errors
∂E ( w)
= −2 ∑ t j x j
∂w errors
wt = wt −1 − η t j x j
© Paul Viola 1999 Machine Learning
10
Perceptron Learning
y
w0 w1 w2 wd
X0 X1 X2 … Xd
Real Perceptrons
11
© Paul Viola 1999 Machine Learning
A classic problem...
x x oo
x
oo
o o
x x
o o o
oo
oo x x
o o
x
o o o x x
12
6.891 Machine Learning and Neural
Networks
Lecture 7:
Multi-Layer Perceptrons
Back Propagation
News
l Pset 3 is on the web
» Includes a classifier “shootout”
» The mystery dataset has 20 dimensions and two classes
» Winner gets $10 of Toscanini’s
l Pset 2 looks great …
» Many of you did a lot of work.
1
Review & Overview
l Lecture 6:
» Linear Discriminants
» Perceptrons
» Training Perceptrons
l Generalized Perceptrons
l Multi-layer Perceptrons
» Multi-Layer Derivatives
» Back Propagation
l Examples:
» NET Talk
2: Update Rule ∂E j
wt = wt−1 − η = wt−1 − ηδ j x j
∂w
2
Different Criteria…
(
E ( w) = ∑ wT x j − t j )
2
( )
E ( w ) = − ∑ g ( wT x j ) − t j ( wT x) t j
2
j j
∂E ( w) = − ∑ ( wT x ) t j
= 2∑ (wT x j − t j )x j errors
∂w ∂E ( w)
= −2 ∑ t j x j
j
= 2∑ δ j x j ∂w errors
j
wt = wt−1 − η t j x j
wt = wt−1 − ηδ x j j
Normalizing examples…
For errors
wt = wt −1 − η x j
only!
© Paul Viola 1999 Machine Learning
3
The update rule in action...
wt = wt−1 − x j
Real Perceptrons
4
A classic problem...
x x oo
x oo
o o
x x
o o o
oo
oo x x
o o
x
o o o x x
Generalized Perceptron
y ( x ) = g ( wT x ) Can’t do that!
x1 1
xˆ = x2 ŵ = 1 Works
Great
x1 x2 − 2.1
5
Another Generalized Perceptron
6
Two Dilemmas
l How does one find/define the correct set of
features?
l How many will you need?
l 1950’s answers:
» Don’t know… we’ll just think them up.
» Don’t know… we’ll just keep adding wires.
7
Multiple Layers
y 0 0 0 0
0
0
0 0
W =W =0 0 0 0
w54
− 1.5
1 1 0
u4
w52 w53
0 1 1 − 2.5
w41 w43
w42
How can we learn this??
1 X1 X2
u1 u2 u3
8
1980’s: Perhaps Gradient Descent?
y ( x ) = s ( wT x )
1
s (a ) =
1 + e−a
E ( w) = ∑ (s ( wT x j ) − t j )
2
∂s ( u ) ∂u
= s (u ) (1 − s( u ) )
j
∂E ( w) ∂s ( w x ) ∂w ∂w
= −2∑ (s ( wT x j ) − t j )
T j
∂w j ∂w
∂E ( w)
= −2∑ (s ( wT x j ) − t j ) s ( wT x j ) (1 − s ( wT x j ) ) x j
∂w j
© Paul Viola 1999 Machine Learning
u6 y y ( x) = s ( w54u 4 + w65u5 )
= s ( w54 s ( w41u1 + w42u2 + w43u3 )
w64 w65 + w65 s ( w51u1 + w52u 2 + w53u3 ))
u4 u5
E ( w) = ∑ (s ( wT x j ) − t j )
2
j
w41 w51 w43 w53
∂E ( w) ∂s ( wT x j )
= −2∑ (s ( wT x j ) − t j )
w42 w52
1 X1 X2
∂w j ∂w
u1 u2 u3
w10
9
Multi-Layer Conventions
u6 y uk = g ∑ wkj u j
j
w64 w65
a k = ∑ wkj u j
u4 u5 j
More Conventions
u 2 = g W21 * u
1 u 3 = g W32 * u
2
10
© Paul Viola 1999 Machine Learning
11
Solving XOR (big deal?)
12
Very Simple Solution
95% Accurate
© Paul Viola 1999 Machine Learning
13
6.891 Machine Learning and Neural
Networks
Lecture 8:
Back Prop and Beyond
News
l Mid-term will be on 10/20
» Here in this room.
» It should take about 1 hour… but we will give you 1.5
– Show up on time, please.
» Coverage: Psets 1, 2 and 3.
– Density estimation (Parametric, Semi and Non-parametric)
– Bayesian Classification
– Discriminants (Linear, Perceptron, Multi-layer)
1
Review & Overview
l Lecture 7:
» Multi-Layer Derivatives
» Back Propagation
» Examples:
– NET Talk
2
Intensity Preprocessing
Training Data
Positives
Negatives
3
Performance
4
MLP: How Powerful?
Input Outputs
Control Plant
Products
Materials
5
1990: The height of MLP’s and Back Prop
l Multi-layer perceptrons can solve any
approximation problem (in principle)
» Given 3 layers
» Given and infinite number of units and weights
l There is no direct technique for finding the
weights (unlike linear discriminants)
l Gradient descent (using Back Prop) comes to
dominate discussion in the Neural Net community
» Can you find a good set of weights quickly?
– How can you speed things up?
» Will you get stuck in local minima?
l A small group in the community also worries about
generalization.
© Paul Viola 1999 Machine Learning
E ( w) = ( wx − y )2
= ( w − 0) 2 ∂E
wt = wt−1 − η
∂wt−1
= w2
∂E 1
= 2w η=
∂wt−1 2
6
Scale the Input
y
l Simplest Case
» 1 Weight, Quadratic Error Function w
x 2
E ( w) = ( w2 − 0 ) 2 ∂E
= 4w 2 wt = wt−1 − η
∂wt−1
∂E
= 8w
∂wt −1
Multiple Weights
Hack 2: Momentum
0.020
0.047
0.049
0.050
© Paul Viola 1999 Machine Learning
7
Momentum
∂E
∆wt = −η + α∆wt −1
∂wt−1
E ( w) = aw2 + bw + c w1 = w0 + ∆w
∆wt = −η
E′ E ′ = 2aw + b
E ′′ = 2 a
(
= w0 − w0 + b 2a )
E′′
E′ b =−b
=w+ 2a
E ′′ 2a
© Paul Viola 1999 Machine Learning
8
More Principled Hacks...
l Second Order Techniques
» N weights --> N^2 Hessian entries
» Also Destabilizes learning
l Line Search
» Expensive but hard to beat
Local Minima
l Number of Papers
» 1000’s of local minima in simple problems (XOR)
9
Bias and Variance
l How many layers are right?
l How many units per layer?
l What about structural constraints?
ALVINN
Pomerleau
© Paul Viola 1999 Machine Learning
10
No Hands Across America
Zip Codes
Le Cun
11
6.891 Machine Learning and Neural
Networks
Lecture 9:
On to Support Vector Techniques
News
l Final will be 12/13 at 1:30PM
» If you have a conflicting final let us know.
l Remember that almost all the material appears in
the book…
» Right now we are jumping back and forth between
– Chapter 5
– Chapter 6
1
Review & Overview
l Lecture 8:
» Multi-layer Perceptrons
» Back propagation
» Hacks (… many)
History Lesson
l 1950’s Perceptrons are cool
» Very simple learning rule, can learn “complex” concepts
» Generalized perceptrons are better -- too many weights
l 1960’s Perceptron’s stink (M+P)
» Some simple concepts require exponential # of features
– Can’t possibly learn that, right?
l 1980’s MLP’s are cool (R+M / PDP)
» Sort of simple learning rule, can learn anything (?)
» Create just the features you need
l 1990 MLP’s stink
» Hard to train : Slow / Local Minima
l 1996 Perceptron’s are cool
© Paul Viola 1999 Machine Learning
2
Why did we need multi-layer
perceptrons?
n : variables
k : order poly
n + k (n + k )!
14th Order??? = (
∈ O min( n k , k n ) )
120 Features k k !n!
N=21,
N=21,k=5
k=5-->
-->65,000
65,000features
features
© Paul Viola 1999 Machine Learning
3
MLP’s vs. Perceptron
l MLP’s are incredibly hard to train…
» Takes a long time (unpredictably long)
» Can converge to poor minima
l MLP are hard to understand
» What are they really doing?
Perceptron Training
is Linear Programming
• After Normalization
∑wx i
i
l
i > 0 ∀l • After adding bias
• Assumes no errors
Polynomial time in the number of variables
and in the number of constraints.
∑w x i
l
i + sl > 0 ∀l min ∑ sl
i l
sl > 0 ∀l
4
Rebirth of Perceptrons
l How to train efficiently.
» Linear Programming (… later quadratic programming)
l How to get so many features inexpensively?!?
l How to generalize with so many features?
» Occam’s revenge.
w0 = 0 ∆wt = η xt
wt = ∑ ηx = ∑ b x
errors
t
l
l
l
wt = ∑ bl Φ (xl )
l
5
Lemma 2: Only need to compare examples
∑ b K (x , x
j
l j
) + sl > 0 ∀l min ∑ sl
j l
sl > 0 ∀l
min ∑ b 2j
j
Smoother
6
Linear Program is not unique
∑ wˆ x i
l
i > 0 ∀l ∑ (λwˆ )x
i
i
l
i > 0 ∀l
i
∑wx i
l
i + s l > 0 ∀l min ∑ sl
i l
7
Require non-zero margin
∑wx
i
i
l
i + s l > 0 ∀l Allows solutions
with zero margin
Enforces a non-zero
∑ wi x li + sl > 1 ∀l margin between examples
i and the decision boundary.
Constrained Optimization
∑ b K (x , x
j
l j
) + sl > 1 ∀l min ∑ sl
j l
sl > 0 ∀l
min ∑ b 2j
j
8
Constrained Optimization 2
x 3 is inactive
Support Vectors
9
SVM: examples
» Minimize ν wT w = ν ∑ wi2
i
l Introduce Margin so that: wi ≠ 0 ∀i
» Set of linear inequalities
l Find best solution using Quadratic Programming
10
SVM: Difficulties
l How do you pick the kernels?
» Intuition / Luck / Magic / …
∀ j ( 2t j − 1) ∑ wi K ( x j , ci ) ≥ 1 − ε j
i
(
min w T w + c ∑ j | ε j | ) Slack Variables
6 weights
l Data dimension: 2
l Feature Space: 2nd order polynomial
» 4 dimensional
11
SVM versus Perceptron
l Why not just use a perceptron?
» Use all training points as a centers
y( x ) = Θ ∑ wi K ( x, c i )
T
i
wτi = wτi + η K ( x, c i )
12
Zip Codes
13
SVM: Faces
Support Vectors
14
6.891 Machine Learning and Neural
Networks
Lecture 10:
Support Vector Machines
More Details and Derivations
News
l Quiz is 1 week from today.
1
Pset 2
l SVM review
l Why is it called “Support Vectors”??
l Derivation of some simpler properties.
2
SVM: Key Ideas
l Augment inputs with a very large feature set Φ ( x)
» Polynomials, etc.
l Use Kernel Trick(TM) to do this efficiently
l Enforce/Encourage Smoothness with weight penalty
» Minimize b T b = ∑ bi2 bi ≠ 0 ∀i
i
Avoid!
Support Vectors
min( w T w ) subject to constraint
∀ j ∑ bi K ( x j , c i ) ≥ 1
i
y ( x ) = Θ ∑ bi K ( x, ci )
i
l Many of the b’s are zero -- inactive constraints
» Only keep examples where bi ≠ 0
l Likely to generalize well
» VC Dimension -- later in the semester
3
An alternative motivation
l Like all good ideas, Support Vector Machines can
be motivated in several different ways.
4
The optimal dividing line…
l The optimal separator
maximizes the margin between
positive and negative examples
d − = max wT x i
negatives
d + = min wT x i
positives
d+ − d−
margin =
|w|
d + − d−
max (margin ) = max
© Paul Viola 1999 w
Machine Learning w | w|
5
Optimal dividing line=Support Vectors
d − = max wT x i ∀ w T x i ≤ −1
negatives negatives
d + = min wT x i
positives ∀ wT x i ≥ 1
positives
d+ − d−
max min wT w
w | w|
w = ∑ bl x l w = ∑ bl Φ (x l )
l l
6
Lemma 1: Kuhn-Tucker Conditions
w T x1 ≥ 1
wT x 2 ≥ 1 (
min w T w )
w T x 3 ≤ −1
wT x1 = 1
w = b1x1 + b2x 2
w x =1
T 2
y( x ) = Θ ∑ wi K ( x, c i )
T
i
» Update using perceptron rule:
7
Perceptrons are not smooth…
SVM: Faces
8
Support Vectors
SVM: Difficulties
l How do you pick the kernels?
» Intuition / Luck / Magic / …
∀ j ∑ bi K ( x j , c i ) + s j ≥ 1
i
(
min b T b + c ∑ j | s j | ) Slack Variables
9
SVM: Generalization??
l Is there a formal proof that SVM’s will work better
than Perceptrons or MLPs??
» Perhaps…
l There is a tenuous relationship between maximizing
the margin and reducing the complexity of the
classifier.
» The complexity of the classifier is reduced to the number
of support vectors.
» Hard problems require more support vectors.
l The VC-Dimension of a support vector machine is
controlled by maximizing the margin.
10
Can we regain the simplicity of Perceptrons
11
6.891 Machine Learning and Neural
Networks
Lecture 11:
More Kernel Networks
News
l Matlab was down at the AI lab for a few hours.
» I am not terribly sympathetic… since it was after the
official deadline for the pset.
» Just hand it in as soon as you can.
1
Review & Overview
l Lecture 10:
» The Support in Support Vectors
» The Margin is a key concept
d − = max wT x i ∀ w T x i ≤ −1
negatives negatives
d + = min wT x i
positives ∀ wT x i ≥ 1
positives
d+ − d−
max min wT w
w | w|
2
Optimal dividing line=Support Vectors
d − = max wT x i −1 1
negatives
d− = d+ =
| w| | w|
d + = min wT x i
positives
d+ − d − 1 −1
max −
| w| d + − d− | w | | w | 1 1
= = = T
w
2
∀ w T x i ≤ −1 |w| | w| | w| w w
negatives
∀ wT x i ≥ 1 1
positives max min wT w
wT w
© Paul Viola 1999 Machine Learning
This
Thisendsendsup
upbeing
beingexactly
exactlylikelikepolynomial
polynomialfitting…
fitting…
except that there is one
except that there isMachine weight
one Learning per data point
weight per data point
© Paul Viola 1999
3
Radial Basis Function Networks
K ( x , c ) = K (| x − c |) y ( x ) = ∑ bi K (| x − c i |)
i
3 3
2.5
2.5
2
2 1.5
1
1.5
0.5
? ?
1 0
0 2 4 6 8 10 2 4 6 8 10
Intuition
4
Setting up the problem
Cost = (WY − T )
2
Y = (W TW ) −1 W TT Not Invertible!
T= W=
1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1
© Paul Viola 1999 Machine Learning
Cost = (WY − T ) + λY T Y
2
Y T Y = ∑ yi
2
i
−1
Y = (W W + λ I ) W T
T T
Small solution
vectors are best.
T= Y=
1 1
0 0 3
0 0 2.5
0 0
2
3 3
0 0 1.5
0 0 1
0 0
0.5
0 0
1 1 0
2 4 6 8 10
5
… and the winner is?
3
6
2.5
5
2
4
1.5
3
1 2
0.5 1
0 0
2 4 6 8 10 2 4 6 8 10
2.5
Bayesian 1
0.5
0
2 4 6 8 10
© Paul Viola 1999 Machine Learning
6
Derivative Measures Smoothness
7
Need to find lambda …
Y= 3
1.0000 2.5
1.5000
2.0000 2
λ = 0.001 2.5000
3.0000 1.5
2.6000 1
1.8
2.2000
1.8000 0.5
1.4000 0
1.0000 2 4 6 8 10
Y = 3
1.6000 2.5
1.6600
1.7200 2
1.7800
0.3
λ = 10 1.8400 1.5
1.7840 1
1.7280
1.6720 0.5
1.6160 0
© Paul Viola 1999 1.5600 Machine Learning 2 4 6 8 10
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
10 20 30 40 50 10 20 30 40 50
8
A Closer Look
3
3
2.5
2.5
2
2
1.5
1 1.5
0.5 1
0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 20 30 40 50 10 20 30 40 50
1
1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
10 20 30 40 50 10 20 30 40 50
© Paul Viola 1999 Machine Learning
9
Still Piecewise Cubic
0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1
10 20 30 40 50
0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6
10
Regularization to RBF’s
E ( y ) = Error ( y ) + Smoothness ( y )
Every
Training Point
Gaussian
11
Too Many Centers 2
l Put them where you need them…
» To best approximate your function
12
6.891 Machine Learning and Neural
Networks
Lecture 12:
Smooth Functions and Kernel Networks
News
l Quiz was too hard…
» I am trying to come up with a creative grading scheme.
– Best 5 out of 6 problems???
– First let us do the grading.
l Problem set will be out by tonight.
1
Review & Overview
l Lecture 11:
» Trying to find smooth functions.
l Smooth Regression
» Another way of motivating Kernel networks
Derivative 1.5
1
0 2 4 6 8 10
9 3
16
8
14 2.5
7
12
6 2
10
5
8 1.5
4
6
3 1
4
2
0.5
2
1
0 0
0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
2
Regression Review
l Up until now we have been mostly analyzing
classification:
» X, inputs. Y, classes. Find the best c(x) .
l Today: Regression.
» X, inputs. Y, outputs. Find the best f(x) .
» Predict the stock’s value next week.
» “Picture of Road” -> “Car steering wheel”
» etc.
(
min ∑ f ( x j , w) − y j
w
)2
f ( x, w) = w0 + w1 x + w2 x + K 2
w
(
min ∑ f ( x j , w) − y j )
2
l Bayesian Approach
» Find the most likely function:
p({x j , y j } | f ) p( f )
max p ( f | {x , y }) = max
( )
j j
f f p {x j , y j }
3
Bayesian framework captures
many approaches
p ({ x j , y j } | f ) p ( f )
max p ( f | { x j , y j }) = max
f f (
p {x j , y j } )
[
max log p({x j , y j } | f ) + log p( f ) − log p { x j , y j }
f
( )]
log p({ x j , y j } f ) = log ∏ p ( x j , y j | f ) ε if f is a poly
j
p( f ) =
0 otherwise
= ∑ log p( x j , y j | f )
j
log p({ x j , y j } f ) 2
∂f
log p( f ) = − ∫
(
= −∑ c f ( x j ) − y j )
2
∂x
j
2
∂2 f
Also popular… log p( f ) = − ∫ 2
∂x
4
A closer look...
1 1
∂f
2
(
C( f ) = ∑ c f ( x ) − yj
)
j 2
+ λ∫ X = 5 Y = 3
∂x
10 1
j
Data
l How do we minimize this function?
» The set of possible functions is infinite
» The space of functions is infinite dimensional
∂f
2
( )
C( f ) = ∑ c f ( x j ) − y j + λ ∫
2 dC( f )
=0
j ∂x df
5
We could approximate f .
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
10
0
-0.2
0
-0.2 20
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
100 1000
0 0
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
3 3
2.5
2.5
2
2 1.5
1
1.5
0.5
? ?
1 0
0 2 4 6 8 10 2 4 6 8 10
Intuition
6
Setting up the problem
Cost = (WF − Y )
2
F = (W TW ) −1 W TY Not Invertible!
Y= W= F=
1 1 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 22
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
3 = 0 0 0 0 1 0 0 0 0 0 3
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 6
0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1 1
Cost = (WF − Y ) + λF T F
2
FT F = ∑ fi
2
i
−1
F = (W W + λ I ) W Y
T T
Small solution
vectors are best.
Y= F=
1 1
0 0 3
0 0 2.5
0 0
2
3 3
0 0 1.5
0 0 1
0 0
0.5
0 0
1 1 0
2 4 6 8 10
7
… and the winner is?
3
6
2.5
5
2
4
1.5
3
1 2
0.5 1
0 0
2 4 6 8 10 2 4 6 8 10
2.5
Bayesian 1
0.5
0
2 4 6 8 10
© Paul Viola 1999 Machine Learning
(
= ∑ f (x ) − y j
)
j 2
+ ∫ f ( xˆ ) dxˆ
j ∂xˆ
8
Setting up the Problem
W=
1.0000 2.5
1.5000
2.0000 2
λ = 0.001 2.5000
3.0000 1.5
2.6000 1
1.8
2.2000
1.8000 0.5
1.4000 0
1.0000 2 4 6 8 10
F = 3
1.6000 2.5
1.6600
1.7200 2
1.7800
0.3
λ = 10 1.8400 1.5
1.7840 1
1.7280
1.6720 0.5
1.6160 0
© Paul Viola 1999 1.5600 Machine Learning 2 4 6 8 10
9
Derivative Order Controls Shape
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
10 20 30 40 50 10 20 30 40 50
2
∂f
2
∂2 f
∫ ∂x ∫ ∂x 2
A Closer Look
3
3
2.5
2.5
2
2
1.5
1 1.5
0.5 1
0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50
10
Look at the regularizer...
D=
Cost = (WF − T ) + λ ( DF )
2 2 1 -1 0 0 0 0 0 0 0 0
0 1 -1 0 0 0 0 0 0 0
0 0 1 -1 0 0 0 0 0 0
0 0 0 1 -1 0 0 0 0 0
0 0 0 0 1 -1 0 0 0 0
F = (W T W + λ DT D) −1 W T Y 0
0
0 0 0 0 1 -1 0 0 0
0 0 0 0 0 1 -1 0 0
0 0 0 0 0 0 0 1 -1 0
0 0 0 0 0 0 0 0 1 -1
D’ * D =
1 -1 0 0 0 0 0 0 0 0
-1 2 -1 0 0 0 0 0 0 0
0 -1 2 -1 0 0 0 0 0 0
0 0 -1 2 -1 0 0 0 0 0
0 0 0 -1 2 -1 0 0 0 0
0 0 0 0 -1 2 -1 0 0 0
0 0 0 0 0 -1 2 -1 0 0
0 0 0 0 0 0 -1 2 -1 0
0 0 0 0 0 0 0 -1 2 -1
© Paul Viola 1999 0 0 0 Machine
0 0 Learning
0 0 0 -1 1
D= D' * D =
1 -1 0 0 0 0 0 0 0 0 1 -2 1 0 0 0 0 0 0 0
-1 2 -1 0 0 0 0 0 0 0 -2 5 -4 1 0 0 0 0 0 0
0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0 0 0
0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0 0
0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0 0
0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0 0
0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1 0
0 0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 6 -4 1
0 0 0 0 0 0 0 -1 2 -1 0 0 0 0 0 0 1 -4 5 -2
0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 1 -2 1
11
What about continuous functions??
2
∂f ∂C ( f ) Infinite number
C( f ) = λ ∫ ∀x =0
∂x ∂ f ( x) of derivatives
∂C ( f ) C ( f + δ x ) − C ( f )
= = δC ( x )
∂ f ( x) |δx |
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 20 30 40 50 10 20 30 40 50
1
1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
10 20 30 40 50 10 20 30 40 50
© Paul Viola 1999 Machine Learning
12
Still Piecewise Cubic
0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1
10 20 30 40 50
0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6
13
Regularization to RBF’s
E ( y ) = Error ( y ) + Smoothness ( y )
Every
Training Point
Gaussian
14
Too Many Centers 2
l Put them where you need them…
» To best approximate your function
f ( x) = ∑ w j K ( x, x j ) Many j’s
j
are zero!!!
15
6.891 Machine Learning and Neural
Networks
Lecture 13:
Kernel Networks
… on to Unsupervised Learning
News
l Quizes are graded…
» Each problem has been graded.
» ** The overall score for the quiz is being determined.
– We ran out of time last night.
l Course grading: (approximate)
» Psets: 35%
» Quiz: 20%
» Final: 30%
» Project: 10%
» Participation: 5%
1
Pset 3
Exams
2
Grading alternatives…
3
Calculus of Variations
∂C ( f ) C ( f + δ x ) − C ( f )
δ C ( x) = =
∂f
2
∂f ( x ) | δx |
C( f ) = ∫ f ( x ) = ax + b
∂x = f ′′( x)
=0
∂f
( )
2
C( f ) = ∑ f ( x ) − y j j 2
+ λ∫
j ∂x
( )
δC ( x) = λf ′′( x) + ∑ 2 f ( x j ) − y j δ ( x − x j ) = 0
j
f ′′( x) = −
1
( )
∑ 2 f (x j ) − y j δ (x − x j )
λ j
A Closer Look
3
3
2.5
2.5
2
2
1.5
1 1.5
0.5 1
0 0.5
-0.5
0
-1
-0.5
10 20 30 40 50
10 20 30 40 50
4
Still Piecewise Cubic
0.9
0.8
0.8
0.7 0.6
0.6
0.5 0.4
0.4
0.2
0.3
0.2 0
0.1
0
-0.2
10 20 30 40 50
-0.4
-0.6
-0.8
-1
10 20 30 40 50
∂f ? f ( x) = ∑ b j K (x , x j )
( )
2
C( f ) = ∑ f ( x j ) − y j + λ ∫
2
j ∂x j
f ′′( x) = ∑ a jδ ( x − x j ) ∂ 2 K ( x, x j )
f ′′( x) = ∑ b j
j
j ∂ x2
∂ 2 K ( x, x j )
=δ (x − x j) K ( x, x j ) = x − x j
∂x 2
5
Cubics are similar...
∂ 2 f
2 ? f ( x) = ∑ b j K (x , x j )
(
C( f ) = ∑ f ( x j ) − y j )
2
+ λ ∫ 2
j ∂x j
f ′′′′( x) = ∑ a jδ ( x − x j ) ∂ 4 K ( x, x j )
f ′′′′( x) = ∑ b j
j
j ∂x 4
∂ 4 K ( x, x j )
=δ (x − x j) K ( x, x j ) = x − x j ( x − x j ) 2
∂x 4
E ( y ) = Error ( y ) + Smoothness ( y )
Every
Training Point
Gaussian
6
Smoothness is easily controlled
0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6 0.6
7
Too Many Centers 2
l Put them where you need them…
» To best approximate your function
(w T
)
x j +b − y j ≤ ε
f ( x) = wT x + b min wT w
(
y j − wT x j + b ) ≤ε
8
SVM Regression
Cost( f ) = c ∑ wT x j + b − y j +wT w
ε
j
f ( x) = ∑ w j K ( x, x j ) Many w_j’s
j
are zero!!!
9
New Topic: Unsupervised Learning
l What can you do to “understand” data when you
have no labels?
» Find unusual structure in the data.
» Find simplifications of the data.
l Find the clusters in the data:
» Fit a mixture of gaussians…
– Been there, done that.
l Reduce the dimensionality of the data:
» Find a linear projection from high to low dimensions
l These all amount to density estimation
l There are many other approaches
» Build a tree which captures the data, etc.
10
© Paul Viola 1999 Machine Learning
11
© Paul Viola 1999 Machine Learning
12
6.891 Machine Learning and Neural
Networks
Lecture 14:
… on to Unsupervised Learning
News
l I will try to give you a feeling for where we are
headed:
» Next 4 lectures
– Bayes Nets / Graphical Models / Boltzman Machines /
HMM’s
» After that a series of topics (… from papers).
1
Review & Overview
l Lecture 13:
» The end of regression…
2
Exploratory Data Analysis
l Machine learning is simply not that smart…
l It is still very important to look at the data.
l But when there are millions of examples and
thousands of dimensions you cannot look at the
data.
3
Example of Clustering
l Can I get some examples of clustering???
l PDP??
l Andrew Moore
Speed Learning???
l Regression & Kernel Networks f ( x) = ∑ b j K (x , x j )
j
4
Support Vector Machines
f ( x) = ∑ b j K (x , x j )
j
5
Curse of Dimensionality of Nearest
Neighbor
l How far is it to your nearest neighbor??
» Easier Question: How far do you have to look before
expecting 1 neighbor.
Vol ( Sk (r )) = ck r k
y = Wx
l Where W has fewer rows than columns…
6
First Eigenvector preserves more info...
256,000 4,000
Numbers Numbers
7
© Paul Viola 1999 Machine Learning
8
Dimensionality Reduction
9
Information Theory for Signal
Separation
Sound
Sources
Microphones
Unmixed
10
Let’s look at data
Unmixed Mixed
PCA
Unmixed
Mathematical Assumptions
M = AS
l Assumptions:
» Sounds Travels instantaneously
» Sound Mixes Linearly
» Signals are independent
11
The Unmixing Problem
S$ = A−1 AS
l We would like to undo the mixing...
y = g (Wx )
l Where W has fewer rows than columns…
independent
max MI (g (Wx ), x ) W =
W
components
© Paul Viola 1999 Machine Learning
12
ICA
Unmixed
Learning Rule
∆W = (W − T + (1 − 2 y ) x T )W T W
= W + (1 − 2 y ) x T W TW
= W + (1 − 2 y )uT W
13
6.891 Machine Learning and Neural
Networks
Lecture 15:
Reasoning and Learning on Discrete Data
Bayes Nets
News
l Final Problem Set will be ready tomorrow
» Mostly Bayes Nets
l Please begin to think about your final project
1
Review & Overview
l Lecture 14:
» Principal Components Analysis
– A low dimensional projection is can summarize data
» Independent Components Analysis
– An alternative to PCA which can pick out the independent
sources of data.
l Bayes Nets
» Meeting of the minds
– Artificial Intelligence and Machine Learning
» Represents symbolic knowledge and reasoning
» Principled mechanism for inference and learning
– Bayes Rule
Artificial Intelligence
l Build systems that reason about the world:
» Diagnosis
– “Why won’t my car start?”
» Goal directed behavior
– “How can I get from here to the White House?”
– Space Probe: “How do I change orbit, take photos of Mars,
and communicate with Earth in the next 5 minutes?”
– “How can I symbolically integrate this function?”
» Game Playing
– “How can I beast Kasparov?”
l Biases:
» Symbolic data and symbolic problems (not continuous)
» No representation of uncertainty or probability.
2
Techniques in Artificial Intelligence
l Write down a set of rules that govern the world
» If I get on a plane to Wash. DC then I will end up in DC.
» If I take a taxi to Logan then I will end up at Logan.
3
Probabilistic Reasoning is Optimal
l What we really want is to reason with the laws of
probability:
» The probability that I will get to the White House is:
– The probability of the conjunction of events
l Get packed
l Get to Logan
l Catch plane
l Arrive in DC
l Get Taxi to White house
4
A Probabilistic Approach
Probability distribution over our 3 Events:
P ( A, M , C ) Arguments, Minsky, and Chomsky
P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C
P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C
© Paul Viola 1999 Machine Learning
1 0 1 0.0285
1 1 0 0.00005
1 1 1 0.00045
Observe Data
Probability of Events
5
Problems with Naïve Probability
M C A Probability l Way too many variables:
0 0 0 0.684
0 0 1 0.171 » 2^N variables (minus 1)
0 1 0 0.0315
» Occam wouldn’t like this
0 1 1 0.036
1 0 0 0.0665 l Lots of computation:
1 0 1 0.0285
1 1 0 0.00005 » P(M) requires O(2^(N-1))
1 1 1 0.00045
» P(A|M) requires O(2^(N-1))
P(??) = 0.23595
P ( A, M , C ) ≡ P( M ) P(C | M ) P ( A | M , C )
7 1 2 4
Tables
P ( A, M , C ) ≈ P ( M ) P(C ) P ( A | M , C )
7 1 1 4
© Paul Viola 1999 Machine Learning
6
Removing Links
P ( A, M , C ) ≡ P( M ) P(C | M ) P ( A | M , C )
M C
P ( A, M , C ) ≈ P ( M ) P(C ) P ( A | M , C )
M C
An Efficient Representation
0.1
• Draw a directed acyclic graph M C 0.05
7
Much more efficient representations
M C
A B
E F
26 = 64 −1 = 63 parameters
vs. (2 *1) + (2 * 2) + ( 2 * 4) = 13
Additional Example 1
l You are waiting for an appointment
AI with Holmes (H) and Watson (W).
l Both are very poor drivers and are
likely to be avoid driving if the
W H roads are icy (I).
l It is winter so that probability of I
is high 0.5.
l H and W are dependent…
» Unless you know the road conditions
8
Additional Example 2
Wet Grass
l The lawn of home A is either wet or
not (Wb).
A
R A
S » This could have been caused by rain
(R) or sprinkler (S)
l The lawn of home B is either wet or
Wa Wb not (Wa).
» Home A has no sprinklers so rain is
the only cause.
Additional Example 3
Earthquake or Burglar
l Home Alarm (A)
l Neighbor reports the Alarm (N)
A
B A
E l Burglary (B) 0.001
l Earthquake (E) 0.0000001
l Radio Report of Earthquake (R)
A R
l You receive a call from your
neighbor saying that your Alarm is
N going off.
l You drive home to confront the
burglar… on the drive you hear a
radio report of an Earthquake.
9
Reasoning
l In some cases computation time is not changed:
» We have re-written the joint distribution
» Some reasoning still requires large summations...
P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C
P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C
M C
∑ P(a, B, C , D , e, F )
P(e | a ) =
A B
∑ P( A, B, C , D, e , F )
E F
P( A, B , C , D, E , F ) = P ( M ) P ( C ) P ( A | M , C ) P( B | C ) P ( E | A) P ( F | A, B )
10
Junction Tree Algorithm 1
l Table arithmetic:
X Y Z
P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1
A B A B A B
E F E F E F
CM C AC CM C AC
A A A A
AE A ABF AE ABF
11
More Junction Trees
A B A1 A2
AE ABF
E F
12
6.891 Machine Learning and Neural
Networks
Lecture 16:
More Bayes Nets
News
l Half of pset 5 is done
» and on the web.
» Other half will be done over the weekend.
1
Review & Overview
l Lecture 15:
» Bayes Nets
– Meeting of the minds
l Artificial Intelligence and Machine Learning
– Represents symbolic knowledge and reasoning
– Principled mechanism for inference and learning
l Bayes Rule
2
Bayesian Text Classification
{d k } : A collection of documents
1 if d k contains word i
Wi(d k ) :
0 otherwise
P( F1 = f1 , F2 = f 2 , K| C = c j ) 2^N
2^Nprobs
probs
= P ({ f1 − f N }| C = c j )
≡ ∏ P( Fi = f i | C = c j ) Assume
Assume
Independence
i Independence
P({ f i} | C j ) = ∏ P(F = i fi |C j )
i
One of
C N classes
P( F1 | C )
F1 F2 F3 F4 ... FN
P (c j ) ∏ P ( fi | c j )
P ( c j |{ fi }) = i
3
More Complex Models are “Easy”
What if documents could be BB
“about” two different topics at once:
- like Politics and Sports
P S
P( F1 | P, S )
F1 F2 F3 F4 F5... ... FN
An Efficient Representation
0.1
• Draw a directed acyclic graph M C 0.05
4
Much more efficient representations
M C
A B
E F
26 = 64 −1 = 63 parameters
vs. (2 *1) + (2 * 2) + ( 2 * 4) = 13
Additional Example 1
l You are waiting for an appointment
AI with Holmes (H) and Watson (W).
l Both are very poor drivers and are
likely to be avoid driving if the
W H roads are icy (I).
l It is winter so that probability of I
is high 0.5.
l H and W are dependent…
» Unless you know the road conditions
5
Additional Example 2
Wet Grass
l The lawn of home A is either wet or
not (Wb).
A
R A
S » This could have been caused by rain
(R) or sprinkler (S)
l The lawn of home B is either wet or
Wa Wb not (Wa).
» Home A has no sprinklers so rain is
the only cause.
Additional Example 3
Earthquake or Burglar
l Home Alarm (A)
l Neighbor reports the Alarm (N)
A
B A
E l Burglary (B) 0.001
l Earthquake (E) 0.0000001
l Radio Report of Earthquake (R)
A R
l You receive a call from your
neighbor saying that your Alarm is
N going off.
l You drive home to confront the
burglar… on the drive you hear a
radio report of an Earthquake.
6
Reasoning
l In some cases computation time is not changed:
» We have re-written the joint distribution
» Some reasoning still requires large summations...
P( A) = ∑ P ( A, M = x , C = y ) = ∑ P ( A, M , C )
x, y M ,C
P ( A, M = m ) P ( A, m)
∑ P( A, m ,C )
P( A | M = m) = = = C
P (M = m ) P ( m) ∑ P( A, m ,C )
A,C
D C
∑ P(a, B, C , D , e, F )
P(e | a ) =
A B
∑ P( A, B, C , D, e , F )
E F
P( A, B , C , D , E , F ) = P ( D ) P ( C ) P ( A | D, C ) P ( B | C ) P ( E | A) P ( F | A, B )
7
Sometimes reasoning is more efficient
D C
∑ P( A, B, c , D , e, F )
P(e | c) =
A B ∑ P( A, B, c, D, E , F )
E Add the evidence that C=c.
F
Observe the marginal of E.
∑ P( A, B, c, D, E, F )
a ,b ,d ,e , f
= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑ P(B | c ) ∑ P( F | A, B)
a ,e d b f
Saving Work
∑ P( A, B, c, D, E, F )
a ,b ,d ,e , f
= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑ P(B | c ) ∑ P( F | A, B)
a ,e d b f
= P (c) ∑ P( E | A) ∑ P( D) P( A | D, c ) ∑T
a ,e d b
B
2
T 1
AB
= P (c) ∑ P( E | A)∑ T
a ,e d
5
D
4
TAD TA3
= P (c) ∑ P( E | A)T ∑T
a ,e
3
A
d
5
D TAD
4
= P (c) ∑T
a ,e
7 3
AE A T T 6
A
= P (c) ∑T e
e
8
∑ P ( A, B , c, D, e, F ) = P (c) T
a ,b ,d , f
8
E ( e)
8
Hidden Markov Model
A A
B C D
A
F G H I
Time
P( A, B , C , D , F , G, H , I )
= P( A ) P ( F | A ) P ( B | A) P (G | B ) P( C | B ) P ( H | C ) P ( D | C ) P ( I | D )
P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1
9
Junction Tree Algorithm: Graph Hacking
M C M C M C
A B A B A B
E F E F E F
CM C AC CM C AC
A A A A
AE A ABF AE ABF
10
From Junction Trees to Probability
M C
CM C AC
A B A1 A2
AE ABF
E F
11
6.891 Machine Learning and Neural
Networks
Lecture 17:
Hidden Markov Models
& Other Bayes Nets
News
l Problem Set 5 complete on Monday
1
Review & Overview
l Lecture 16:
» Bayes Nets
– An Efficient way to represent joint probability distributions
– Allow reasoning about subtle and conflicting evidence
– Allow reasoning with partial information
» Structure implies Reasoning Efficiency
– Dependence structure allows for more efficient reasoning
– Dynamic programming
l Markov Processes
l Hidden Markov Models
» Speech
2
A brief overview of speech recognition
3
Differing representations
4
© Paul Viola 1999 Machine Learning
The phonemes
5
The digits
Speech Spectrogram
{x , y }
j j Training Data
6
Speech Difficulties
l Rate of speech
» Words are spoken at different rates -- factor of 2 or 3.
l Continuous speech
» Where are the boundaries between words??
IsIs this
this your
your cat?
cat? When
When isis your
your train?
train?
0.2
0.2 - 0.3 - 0.6 -- 0.2
- 0.3 - 0.6 0.2 0.2
0.2 - 0.2 - 0.6
- 0.2 - 0.6 -- 0.3
0.3
Cat
Cat ->
-> ‘c’
‘c’ -- ‘ah’
‘ah’ -- ‘t’
‘t’
0.03
0.03 - 0.15--0.02
- 0.15 0.02
fat
fat ->
-> ‘f’
‘f’ -- ‘ah’
‘ah’ -- ‘t’
‘t’
0.1 - 0.1 -
0.1 - 0.1 - 0.020.02
7
Implications of Decomposition
l The parts of words can be reused
» Words are built from XX phoneme models
» Perhaps we can train the phoneme recognizers
separately??
» (Sometimes… co-articulation can make this harder)
N c aAh t N
0.5 0.8 0.1 0.85
8
Phoneme Sequences
NFA
NFAModel
Model
Sequence:
Sequence:
NNCCAAAAAAAAATTTNN…
St
NNCCAAAAAAAAATTTNN…
Spectrogram
Spectrogram Ft
P( F , S | Model ) NFA
NFAmodel
modelfor
for‘cat’
‘cat’
assigns a probability
assigns a probability
= P( F | S ) P ( S | Model ) to
toeach
eachspectrogram
spectrogram
© Paul Viola 1999 Machine Learning
N 0.5
c 0.8
ah
A 0.1
t 0.85
N
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
9
The Details
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
Si ∈ {N B , C , A, T , N A }
P( Fi | S i = k ) = G (Fi , µ k , Σ k )
P( S1 ) = {0 .5, 0 .5, 0 .0 , 0.0, 0.0}
NB C A T NA
N 0.5 0 .5
B
C 0 .2 0.8
P( S i +1 | S i ) =
A 0.9 0.1
T 0 .15 0 .85
© Paul Viola 1999
N A Machine Learning
1 .0
Hidden
State
A A
B C D
A
F G H I
Observations
Time
P( A, B , C , D , F , G, H , I )
= P( A ) P ( F | A ) P ( B | A) P (G | B ) P( C | B ) P ( H | C ) P ( D | C ) P ( I | D )
10
Using Dynamic Programming…
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
P ( F Model)
= ∑ P (F | S ) P (S | Model)
S
= ∑ P( F | S = Sˆ ) P(S = Sˆ | Model)
Sˆ ={s1 , s 2 , s 3 , s4 ,... }
= ∑ ∏ P ( F j | S j = s j ) P (S = Sˆ | Model)
Sˆ ={s1 , s 2 , s 3 , s4 ,... } j
= ∑ P (F1 = f 1 | S = s1 ) P (S1 = s1 ) ∑ P ( F2 = f 2 | S = s2 ) P ( S2 = s2 S1 = s1 )∑ ...
s1 s2 s3
= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 ∑ TS 3 S 4 ∑ TS 4 S 5
s1 s2 s3 s4 s5
= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 ∑ TS 3 S 4 β S44
s1 s2 s3 s4
= ∑ TS 1 ∑ TS1S 2 ∑ TS 2 S 3 β S33
s1 s2 s3
= ∑ TS 1 ∑ TS1S 2 β S22
s1 s2
= ∑ TS 1 β S11
s1
11
Stringing Words Together
Cat Eats
A Food
F F F
12
Markov Processes
l Markov Processes are in fact very general…
» Loosely, they are processes in which there is a great deal
of conditional independence.
– Like most Bayes Nets. P( A | B, C , D , F , G , H , I )
= P ( A | π ( A))
A A
Note:
Note:up
up‘til
‘tilnow
nowwewe
have seen only directed
have seen only directed
A A models…
models… thethenotion
notion
of
ofMarkov
Markovforfor
undirected
undirectedmodels
modelsisisaa
A A bit
bitmore
morecomplex...
complex...
© Paul Viola 1999 Machine Learning
P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1
13
Junction Tree Algorithm: Graph Hacking
M C M C M C
A B A B A B
E F E F E F
CM C AC CM C AC
A A A A
AE A ABF AE ABF
14
Junction Trees and Tables
CM C AC
A1 A2
AE ABF
15
From Junction Trees to Probability
M C
CM C AC
A B
A1 A2
E F
AE ABF
P( A, B, C , D , E , F )
TCM × TAC × TAE × TABF
= P( M ) P (C | M ) P( A | C ) TABCDEF =
S C × S A1 × S A2
× P( E | A) P ( B) P( F | B)
= P( M ) P(C | M ) K TCM
× P ( A | C) K TAC
× P ( E | A) K TAE
× P ( B) P( F | B) K TABF
© Paul Viola 1999 Machine Learning
16
6.891 Machine Learning and Neural
Networks
Lecture 18:
Finish Hidden Markov Models
& Finish Bayes Nets
News
l Remember to keep thinking about your final
projects!
1
Review & Overview
l Lecture 17:
» Hidden Markov Models for Speech
– Speech is complex…
l Many words / Length of words varies
– Speech is best represented as a spectrogram
– Variable timing of speech can be modeled as a NFA.
– An HMM is a Bayes Net which is equivalent to an NFA
l We can build an HMM for each word out of phoneme models
– Can sum over the unknown states to recognize words
Speech in a Nutshell
six
s i x
© Paul Viola 1999 Machine Learning
2
Closer Examination
5 of ‘s’ 6 of ‘x’
N 0.2
s 0.3
Ai 0.2
x 0.2
N
P( F , S | Model )
= P( F | S ) P ( S | Model )
3
Use Bayes Law
N 0.5
c 0.8
ah
A 0.1
t 0.85
N
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
4
A concrete example
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
Si ∈ {1, 2}
P( S1 ) = {0.5, 0 .5} P( Fi = f | S i = 1) = G ( f ,1.0, 0 .1)
1 2 P( Fi = f | S i = 2) = G ( f , 2.0, 0.1)
P( S i +1 | S i ) = 1 0 .9 0 .1
2 0.1 0.9 P( F , S | Model )
Some Samples
2.6
2.4
2.2
2.6
2
1.8
2.4
1.6
2.2
1.4
1.2
2
1
1.8
0.8
0.6
1.6 0 10 20 30 40 50 60 70 80 90 100
1.4
2.6
1.2
2.4
1 2.2
2
0.8
0 10 20 30 40 50 60 70 80 90 100 1.8
1.6
1.4
1.2
0.8
0 10 20 30 40 50 60 70 80 90 100
5
Code is very simple...
states(1) = hmm_draw_state(initial);
for i = 2:n
% transition(states(i-1), :)
states(i) = hmm_draw_state(transition(states(i-1), :));
end
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
P( Fi = f | S i = 1) = G( f , 1.0, 0.4)
P( Fi = f | S i = 2 ) = G( f , 2 .0, 0 .4)
6
But we have a detailed model
S1 S2 S3
A S4 S5 … Sn
F1 F2 F3 F4 F5 Fn
2.5
1.5
0.5
0
© Paul Viola 1999 0 10 20 30 Machine
40 50 Learning
60 70 80 90 100
F1 F2 F3 F4 F5 Fn
7
% Propagate the maximum state forward in time from the beginning
maxes = state_like;
This code is also simple…
maxes (1,:)=maxes(1,:)/sum(maxes(1,:));
for i = 2: ntimes
for j = 1:nstates
% For each new time, check each of the past states and to determine
% the best state given the transition costs.
for k = 1: nstates
vals(j,k) = maxes (i-1,k) * trans (k,j) * state_like(i,j);
end
maxes (i,j) = max( vals(j,:));
end
maxes(i,:)=maxes(i,:)/sum(maxes(i,:));
end
for i = ntimes-1:-1:1
for j = 1:nstates
vals(j) = trans (ind ,j) * maxes(i,j);
end
[v ind ] = max(vals );
shat(i, ind) = 1;
end
© Paul Viola 1999 Machine Learning
Si ∈ {1, 2} Si ∈ {1, 2}
P( S1 ) = {0.5, 0 .5} P( S1 ) = {0.5, 0 .5}
1 2 1 2
P( S i +1 | S i ) = 1 0 .9 0 .1
P( S i +1 | S i ) = 1 0 .8 0.2
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
8
Code for model likelihood...
function
function like
like == hmm_model_likelihood(f,
hmm_model_likelihood(f, initial,
initial, trans,
trans, obs_models)
obs_models)
%% function
function shat
shat == hmm_model_likelihood(f,
hmm_model_likelihood(f, initial,
initial, trans,
trans, obs_models)
obs_models)
%% First
First compute
compute the
the likelihood
likelihood of
of every
every state
state given
given every
every observation
observation
state_like
state_like == hmm_obs_likelihood(f,
hmm_obs_likelihood(f, obs_models);
obs_models);
%% initialize
initialize some
some variables
variables
ntimes
ntimes = size(state_like, 1);
= size(state_like, 1);
nstates
nstates == size(state_like,
size(state_like, 2);
2);
mfactor
mfactor == 100;
100;
beta
beta == mfactor
mfactor .*
.* state_like(ntimes,:);
state_like(ntimes,:);
for
for ii == ntimes-1:-1:1
ntimes-1:-1:1
beta
beta == mfactor
mfactor .*
.* (state_like(i,:)
(state_like(i,:) .*
.* (trans
(trans ** beta')');
beta')');
end
end
like
like == log10(sum(beta))-(log10(mfactor)
log10(sum(beta))-(log10(mfactor) ** ntimes);
ntimes);
© Paul Viola 1999 Machine Learning
9
Markov Processes
l Markov Processes are in fact very general…
» Loosely, they are processes in which there is a great deal
of conditional independence.
– Like most Bayes Nets. P( A | B, C , D , F , G , H , I )
= P ( A | π ( A))
A A
Note:
Note:up
up‘til
‘tilnow
nowwewe
have seen only directed
have seen only directed
A A models…
models… thethenotion
notion
of
ofMarkov
Markovforfor
undirected
undirectedmodels
modelsisisaa
A A bit
bitmore
morecomplex...
complex...
© Paul Viola 1999 Machine Learning
Segue
l We have seen several applications of Bayesian
Networks…
» Expert Systems
» Diagnosis
» Speech Recognition
10
Junction Tree Algorithm 1
l Table arithmetic:
X Y Z
P ( X , Y , Z ) = P ( X ) P(Y | X ) P( Z | Y )
∀a, b, c
P( X = a , Y = b, Z = c ) = P ( X = a ) P(Y = b | X = a ) P( Z = c | Y = b )
Z
P(X) P(Y|X) Y NOT Y P(Z|Y) Z NOT Z
X = X 0.4 * X 0.9 0.1 * Y 0.2 0.8
NOT X 0.6 NOT X 0.3 0.7 NOT Y 0 1
A B A B A B
E F E F E F
CM C AC CM C AC
A A A A
AE A ABF AE ABF
11
More Junction Trees
CM C AC
A1 A2
AE ABF
12
Rules for Junction Tree Initialization
l For each conditional distribution in the Bayes Net
» Find a node in the Jtree which contains all those vars
» Multiply that nodes table by the conditional dist
13
Image Markov Models
14
Multi-scale Statistical Models:
Images, People, Movement
Paul Viola
Collaborators: Jeremy De Bonet,
John Fisher, Andrew Kim
Tom Rikert, Mike Jones,
http://www.ai.mit.edu/projects/lv
Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection & registration
Recognition
Example
images New Hypothesis for
Human Object segmentation
Recognition
Denoising, Super-resolution
Paul Viola MIT AI Lab
1
Visual Texture: a testing ground
• Texture
– Random Repeating Process
– No two patches are identical
Good statistical
model for images
Good model
for visual texture
Paul Viola MIT AI Lab
Input
Texture
Non-parametric
Gaussian Independent
Multi-scale
Paul Viola MIT AI Lab
2
Simple Statistical Model 1:
Independent pixels
• Statistical Model 1
– Each pixel is independent
and identically distributed
P( I ) = ∏ P ( I xy )
x, y
Technical Point:
Texture is Ergodic/Stationary
• A texture image is assumed to be many samples of
a single process
– Each sample is almost certainly dependent on the other
samples
– But actual location of the samples does not matter
– (Space invariant process).
3
Simple Statistical Models
Independent pixels
Histogram
P( I ) = ∏ P ( I xy )
x, y
Statistical Model 2:
Gaussian Distribution
P (I ) = N ( I , m, Σ )
− 1
( I − m )| 2
e − |Σ
2
∝
Original
Generated
4
What else are probabilistic image
models good for??
• Denoising:
– If we have a model for: P(I)
– And we observe an image plus noise: Iˆ = I + h
– Then:
ˆ ˆ ˆ
P ( I ) = ∫ P ( I = I − h , h ) dh = ∫ P ( I = I −h ) P (h ) dh
P ( Iˆ | I ) P ( I ) P (h = Iˆ − I ) P ( I )
P ( I | Iˆ) = =
P ( Iˆ) P ( Iˆ )
I P (h = Iˆ − I ) P ( I )
E [I | Iˆ ] = ∫ dI
P ( Iˆ )
Paul Viola MIT AI Lab
I P (h = Iˆ − I ) P ( I )
E [I | Iˆ ] = ∫ dI
P ( Iˆ )
ˆ 2 2
I e(I− I ) e(I − m )
= ∫ dI
c
Same thing as estimating the mean of a gaussian from
one example and there is a prior…
the expected value is between the observation and prior
5
Gaussian are not quite right...
Derivative
Gaussian Fit
P(Value)
6
Statistical Model 3:
Independent Wavelet Models
• Donoho, Adelson, Simoncelli, etc.
• Very efficient (linear time)
– Estimation, Sampling, Inference
P (I ) ∝ ∏ P j ([WI ] j )
j
1D Wavelet
Transform
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
7
Sub-band
Pyramid
Fourier
Decomposition
FLq ( x, y )
WI
Paul Viola MIT AI Lab
8
Noise removal through shrinkage
9
Inside the guts...
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
800
800
700
700
600
600
500
500
400
400
300
300
200
200
100
100
0
Paul Viola
-6 -4 -2 0 2 4 6 0
-6 -4 -2 0 2
MIT AI Lab
4 6
10
(h = Iˆ − I ) P ( I )
[I | Iˆ ] = ∫ I PDenoising
EGaussian dI
P ( Iˆ )
− ( Iˆ − I )2 − ( I )2
2
2n 2s 2
I e e
= ∫ c
dI
− ( Iˆ − I ) 2 ( I ) 2 s 2 ( I 2 − 2 IIˆ + Iˆ 2 ) + n 2 I 2
− = −
2n 2 2s 2 2n2 s 2
(s 2 + n 2 )I 2 − 2 s 2 IIˆ + s 2 Iˆ 2
= −
2 n 2s 2
2 2 s 2 IIˆ s 2 Iˆ 2
I − 2 +
= −
(s + n 2 ) (s 2 + n 2 )
2n2 s 2
Paul Viola MIT AI Lab
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
800
800
700
700
600
600
500 500
400 400
300 300
200 200
100 100
Paul Viola 0
MIT AI Lab
0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
11
Independent Wavelet Synthesis Model
P (I ) ≈ ∏ Pl , q ,x ,y (Fl J
( x, y))
l ,q , x , y
∏ Pl , (Fl ( x , y ) )
J
≈ q
l ,q , x , y
Given : I ,W
Observe : O l ,q = { F lq ( x , y )}
Model : Pl ,q (.)
Observe Coefficients
12
Compute Histograms
Multi-scale Multi-scale
Histograms Sampling
Sampling Procedure
texture patch
texture patch
synthesized
original
13
Not quite right...
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
14
Heeger and Bergen:
Constrain the pixel histogram
Models of structured
images are weak.
15
Paul Viola MIT AI Lab
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
16
Preserving Cross Scale Alignment
Wavelet
Transform
Paul Viola
Filters
MIT AI Lab
Statistical Distribution
of Multi-scale Features
The distribution of
multi-scale features
determines appearance
17
Multi-scale
Wavelet Features
A multi-scale feature
associates many
values with each
pixel in the image
Conjunctions of filters:
Multi-resolution Parent Vector
fine
r x y x y x y
coarse V ( x, y ) = FN0 N , N , FN1 N , N ,K , FNM N , N ,
2 2 2 2 2 2
M
x y x y x y
F10 , , F11 , ,K , F1M , ,
2 2 2 2 2 2
Parent Vector
Paul Viola MIT AI Lab
18
Build a Model for Observed Distribution
r
P ( I ) = P (V ( x, y ))
Non-parametric
Distribution
Paul Viola Related to the MAR models of Willsky et. al. MIT AI Lab
Original
Texture
Synthesis Results
19
Multi-resolution Parent Vector
r
V N (x , y )
r x y 1 x y M x y
V ( x, y ) = FN0 N , N , FN , ,K , FN , ,
2 2 2N 2N 2N 2N
M
r
V1 ( x, y )
x y x y x y
F10 , , F11 , ,K , F1M , ,
2 2 2 2 2 2
r
Probabilistic Model P (V (x , y ))
P (V l ( x , y ) | {WI } − V l ( x , y ) )
Markov
= P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y )... )
Conditionally P (V l ) = ∏ P (V l ( x , y ) | V l + 1 ( x, y ), V l + 2 ( x , y ) ... )
Independent x,y
P( I ) = P (WI ) = P (V M ) × P (V M − 1 | V M )
Successive
× P (V M − 2 | V M , V M − 1 )
Conditioning
× P (V M − 3 | V M ,V M − 1 ,V M − 2 ) ...
20
Estimating Conditional Distributions
• Non-parametrically P* ( x) = ∑ R( x − xi )
i
P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P (V l ( x , y ), V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
=
P (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P * (V l ( x , y ), V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
≅
P * (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
64x64 2x2
Input
Image
21
Shannon Resampling
Step 2: Build synthesis pyramid
Shannon Resampling
Step 2a: Fill in the top...
22
Shannon Resampling
Step 2b: Fill in subsequent levels
Shannon Resampling
Finish the pyramid
23
Paul Viola MIT AI Lab
B&H D&V
Paul Viola MIT AI Lab
24
Paul Viola MIT AI Lab
25
Paul Viola MIT AI Lab
26
FRAME: Challenge
Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection &
Recognition
distribution Likelihood
Example
images
27
Discrimination via Cross Entropy
IMODEL
P( I Model )
Cross Entropy
ITEST P( I Test )
Paul Viola MIT AI Lab
Best:
GMRF’s 97%
Ours: 99%
28
Paul Viola MIT AI Lab
29
Where is the boundary
between texture and objects?
• Our model can synthesize and recognize complex
and structured textures.
– Far beyond previous older definitions of texture.
• Where is the boundary between these complex
textures and other patterns in images
– Like faces, human forms, automobiles, etc?
30
What about face detection?
• Synthesis is convincing
• Train a texture model to detect faces
Paul Viola
Tom Rikert & Mike Jones
MIT AI Lab
Detecting Objects
• Key Difficulties:
– Variation in Pose, Deformation, & variation across class
• Most Object Recognition approaches are either:
– Very dependent on precise shape and size
– Entirely dependent on simple features (… color, edge histograms)
• Hypothesis:
– Object recognition is closely related to texture perception
31
Detection Results
Detection Results:
32
Non-frontal
faces
But
Butnaïve
naïvedetection
detection
isisexpensive
expensive
Car Images
33
Texture recognition via Cross Entropy
IMODEL
P( I Model )
Cross Entropy
ITEST P( I Test )
Paul Viola MIT AI Lab
200
Parent Vectors
Reduce the number of bins
by clustering & value for discrimination
34
ROC using 200 vectors…
Scanning results:
Time: 9 secs
35
Key facial features
- determined automatically
- located automatically
36
Paul Viola MIT AI Lab
37
Future Work:
New Face Recognition Algorithm
38
Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection & registration
Recognition
Example
images New Hypothesis for
Human Object segmentation
Recognition
Denoising, Super-resolution
Paul Viola MIT AI Lab
Good statistical
model for images
Good model
for visual texture
Paul Viola MIT AI Lab
1
Generation a critical test
Input
Texture
Non-parametric
Gaussian Independent
Multi-scale
Paul Viola MIT AI Lab
Histogram
P ( I ) = ∏ P ( I xy )
x, y
2
Statistical Model 2:
Gaussian Distribution
P (I ) = N ( I , µ , Σ)
− 12
− |Σ ( I − µ )| 2
∝ e
Original
Generated
3
Noise + Signal: Two Gaussian Case
800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
800
800
700
700
600
600
500
500
400
400
300
300
200
200
100
100
0
Paul Viola
-6 -4 -2 0 2 4 6 0
-6 -4 -2 0 2
MIT AI Lab
4 6
[ ]
| Iˆ = ∫
E IGaussian
I P (η = Iˆ − I ) P ( I )
Denoising
P ( Iˆ )
dI
− ( Iˆ − I ) 2 −( I ) 2 [ ]
E I | Iˆ =
s2
n 2 + s2
Iˆ
2n2 2s 2
I e e
= ∫ c
dI
− ( Iˆ − I ) 2 ( I ) 2 s 2 ( I 2 − 2 IIˆ + Iˆ 2 ) + n 2 I 2
− = −
2n 2 2s 2 2n2 s 2
(s + n )I − 2 s IIˆ + s Iˆ 2
2 2 2 2 2
= −
2 n 2s 2
2 2 s 2 IIˆ s 2 Iˆ 2
− +
=−
I
( s 2 + n 2 ) (s 2 + n 2 )
2n2 s 2
Paul Viola MIT AI Lab
4
In pictures…
800 200
180
700
160
600
140
500 120
400 100
80
300
60
200
40
100 20
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
Statistical Model 3:
Independent Wavelet Models
• Donoho, Adelson, Simoncelli, etc.
• Very efficient (linear time)
– Estimation, Sampling, Inference
P (I ) ∝ ∏ P ([WI ] )
j
j j
5
Noise vs. Signal: The details
800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
800
800
700
700
600
600
500 500
400 400
300 300
200 200
100 100
Paul Viola 0
MIT AI Lab
0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
Non-gaussian:
Integral is evaluated numerically
[ ] ∫I
E I | Iˆ =
P (η = Iˆ − I ) P ( I )
P ( Iˆ )
dI
6
1D Wavelet
Transform
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
Sub-band
Pyramid
Fourier
Decomposition
FLθ ( x, y )
WI
Paul Viola MIT AI Lab
7
Noise removal through shrinkage
8
Independent Wavelet Synthesis Model
P (I ) ≈ ∏ P (F ( x , y ) )
l ,θ , x , y l
ϑ
l ,θ , x , y
≈ ∏ P (F ( x , y ) )
l ,θ l
ϑ
l ,θ , x , y
Given : I ,W
Observe : O l ,θ = { F lθ ( x , y )}
Model : Pl ,θ (.)
Observe Coefficients
9
Compute Histograms
Multi-scale Multi-scale
Histograms Sampling
Sampling Procedure
texture patch
texture patch
synthesized
original
10
Not quite right...
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
11
Heeger and Bergen:
Constrain the pixel histogram
Models of structured
images are weak.
12
Paul Viola MIT AI Lab
Wavelet
Transform
Simple Input
Texture
Paul Viola
Filters
MIT AI Lab
13
Preserving Cross Scale Alignment
Wavelet
Transform
Paul Viola
Filters
MIT AI Lab
Statistical Distribution
of Multi-scale Features
The distribution of
multi-scale features
determines appearance
14
Multi-scale
Wavelet Features
A multi-scale feature
associates many
values with each
pixel in the image
Conjunctions of filters:
Multi-resolution Parent Vector
fine
r x y x y x y
coarse V ( x, y ) = FN0 N , N , FN1 N , N ,K , FNM N , N ,
2 2 2 2 2 2
M
x y x y x y
F10 , , F11 , ,K , F1M , ,
2 2 2 2 2 2
Parent Vector
Paul Viola MIT AI Lab
15
Build a Model for Observed Distribution
r
(
P ( I ) = P V ( x, y ))
Non-parametric
Distribution
Paul Viola Related to the MAR models of Willsky et. al. MIT AI Lab
Original
Texture
Synthesis Results
16
Multi-resolution Parent Vector
r
V N (x , y )
r x y 1 x y M x y
V ( x, y ) = FN0 N , N , FN N , N ,K , FN N , N ,
2 2 2 2 2 2
M
r
V1 ( x, y )
x y x y x y
F10 , , F11 , ,K , F1M , ,
2 2 2 2 2 2
r
Probabilistic Model (
P V (x , y ) )
P (V l ( x , y ) | {WI } − V l ( x , y ) )
Markov
= P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y )... )
Conditionally P (V l ) = ∏ P (V l ( x , y ) | V l +1 ( x, y ), V l + 2 ( x , y ) ... )
Independent x,y
P ( I ) = P (WI ) = P (V M ) × P (V M −1 | V M )
Successive
× P (V M − 2 | V M , V M −1 )
Conditioning
× P (V M − 3 | V M ,V M −1 ,V M − 2 ) ...
17
Estimating Conditional Distributions
• Non-parametrically P* ( x) = ∑ R( x − xi )
i
P (V l ( x , y ) | V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P (V l ( x , y ), V l +1 ( x , y ), V l + 2 ( x , y ) ... )
=
P (V l + 1 ( x , y ), V l + 2 ( x , y ) ... )
P * (V l ( x , y ), V l +1 ( x , y ), V l + 2 ( x , y ) ... )
≅
P * (V l +1 ( x , y ), V l + 2 ( x , y ) ... )
64x64 2x2
Input
Image
18
Shannon Resampling
Step 2: Build synthesis pyramid
Shannon Resampling
Step 2a: Fill in the top...
19
Shannon Resampling
Step 2b: Fill in subsequent levels
Shannon Resampling
Finish the pyramid
20
Paul Viola MIT AI Lab
B&H D&V
Paul Viola MIT AI Lab
21
Paul Viola MIT AI Lab
22
Paul Viola MIT AI Lab
23
FRAME: Challenge
Non-Parametric
Multi-scale Synthesis
& Computer Graphics
Models
sample Detection &
Recognition
distribution Likelihood
Example
images
24
Discrimination via Cross Entropy
IMODEL
P( I Model )
Cross Entropy
ITEST P( I Test )
Paul Viola MIT AI Lab
Best:
GMRF’s 97%
Ours: 99%
25
Paul Viola MIT AI Lab
26
Where is the boundary
between texture and objects?
• Our model can synthesize and recognize complex
and structured textures.
– Far beyond previous older definitions of texture.
• Where is the boundary between these complex
textures and other patterns in images
– Like faces, human forms, automobiles, etc?
27
What about face detection?
• Synthesis is convincing
• Train a texture model to detect faces
Paul Viola
Tom Rikert & Mike Jones
MIT AI Lab
Detecting Objects
• Key Difficulties:
– Variation in Pose, Deformation, & variation across class
• Most Object Recognition approaches are either:
– Very dependent on precise shape and size
– Entirely dependent on simple features (… color, edge histograms)
• Hypothesis:
– Object recognition is closely related to texture perception
28
Detection Results
Detection Results:
29
Non-frontal
faces
But
Butnaïve
naïvedetection
detection
isisexpensive
expensive
Car Images
30
Texture recognition via Cross Entropy
IMODEL
P( I Model )
Cross Entropy
ITEST P( I Test )
Paul Viola MIT AI Lab
200
Parent Vectors
Reduce the number of bins
by clustering & value for discrimination
31
ROC using 200 vectors…
Scanning results:
Time: 9 secs
32
Key facial features
- determined automatically
- located automatically
33
Paul Viola MIT AI Lab
34
Future Work:
New Face Recognition Algorithm
35
6.891 Machine Learning and Neural
Networks
Lecture 24:
The End
News
l The Final is on Monday of Final’s week at 1:30
» In this room…
l Conflict exam will be in NE43 on Tuesday Morning
at 9:30.
» Come to Kinh’s office at 9:15 so we can set people up.
l Last year’s final will be on the web by 1PM.
1
Review & Overview
l Lectures 22 and 23:
» Statistical image processing
– Estimate statistical models from examples
– Applications
l Denoising
l Synthesis
l Recognition
6891 at a Glance
l Probability
» Bayes Law
l Linear Algebra
» Eigenvectors and inverses
l Bayesian Classification
l Discriminant Functions
» Perceptron’s, MLP’s
l Support Vector Machines
l Regularization
» Radial Basis Functions
l Unsupervised Learning and PCA
l Bayes Nets and HMMs
2
In the beginning… Probability
l The key concepts of probability
» The basic algebra of probability
– Independent events add
– Relationships between conditional and joint distributions
» Densities work like probabilities (mostly)
» Bayes Law allows us to make decisions
– Loss functions are critical
» Maximum likelihood allows us to learn distributions
– Bayesian estimation averages over parameters
» Exponential densities are easiest to work with
» Mixtures of Gaussians are powerful (but EM is slow)
» Non-parametric estimators are more powerful
– But are difficult to represent
Linear Algebra
l The inverse and pseudo-inverse are everywhere
» Solving least squares problems
l Covariance and co-occurrence are everywhere
» Estimating a Gaussian
» Fitting a line to data
» Principal components analysis
l Eigenvectors simplify most linear algebra
» Especially for symmetric positive semi-definite mats
» Allow you to compute inverses & square roots
» Allow you to understand distributions and linear
dependence
3
Bayesian Classification
l Start out with strong assumptions about your data
» Number of classes, structure of the classes
l Use data to estimate the distribution of each class
l Use Bayes’ law to classify new examples
l Advantages:
» Can estimate the probability of classes (confidence)
» Can validate the model
» Harder to over-train or over-fit
l Disadvantages
» May not use data efficiently
» Sensitive to poor assumptions
Discriminant Functions
l Attempt to estimate the discriminant function
directly
» Linear
» Polynomial
» Multi-layer perceptron
l Specifically minimizes the number of errors
l Advantages:
» Don’t waste time on distributions (just the boundary)
l Disadvantages
» No natural measure of confidence
» Can over-train
4
Support Vector Machines
l A principled and direct way to simultaneously
minimize errors while yielding the simplest
possible classifier
» Occam’s razor
l Using the Kernel Trick ™
» Can find a very complex polynomial with little work
l Using the Margin Trick™
» Maximizes generalization in the face of complexity
l Simple learning criteria
l Well studied learning algorithm
» Quadratic programming
Regularization
l Sometimes you would like to find the smoothest
function which is close to the data
» Minimize the squared error
» Minimize the squared first derivative (or 2nd deriv.)
l The least squares solution:
» Is a sum of kernel functions centered on the data
» Kernel functions depend on the smoothness penalty
l Derivative penalties yield polynomial kernels
» First -> linear, Second -> cubic, Hairy -> Gaussian
5
Unsupervised Learning
l Transforming the input so that it is more manageable
» PCA: The data can be represented using fewer numbers
– Can compress data, make learning simpler
» ICA: The resulting data is now more independent
– Can separate signals that were mixed
» Informative Features (by John Fisher)
– Can represent just the critical information
Bayes Nets
l Models of the conditional dependencies between
variables
» Usually many variables
l A complete model would be intractable
» Exponential number of parameters
– Impossible to learn or reason
l By assuming that certain vars are independent
» Number of params goes down rapidly
» Efficient reasoning is possible
l Bayes Nets are very general and can be used in
many ways
6
Hidden Markov Models
l A type of Bayes Net that allows reasoning over time
l The true state of the world is unknown
» You have noisy observations
l HMM use temporal dependencies to differentiate
ambiguous states
The VC dimension
l Each class of learned functions has a VC dimension
» Perceptron: VCdim = number of weights
l VC dimension measures the capacity of the classifier
» VCdim is the max number of points which can be shattered
» Shattered = assigned any set of lables
l Intuition: larger capacity requires more data
» Like polynomials: Nth order requires N+1 points
l The bounds are actually probabilistic
» The probability of that the error rates exceeds a
particular rate is bounded by a function of VC and N.
7
Symbolic Learning
l Often the correct classification rule is symbolic
» If BP < 50 and HR < 50 then administer DRUG
l While Bayes Nets can reason in this way, they do
not offer much help in learning the relationships
from data
» If structure of net is given, then params can be estimated
l This is sometimes called rule learning
l Decision Trees – ID3, CART, etc.
» Pick a feature, split into ranges
» For each case, pick another feature and repeat
» Each leaf should have only one label
Combining Classifiers
l We have encountered many learning techniques
» Each has multiple variants
l Bagging
» Train the same classifier on different subsets of the data
» Related to cross-validation (or the Bootstrap)
l Stacking
» Perhaps the best approach is to train each type of
classifier and then have them vote.
– Combine 100 different types of neural networks
– Many types of generalized perceptrons
l Boosting
» Train a sequence of classifiers on re-weighted data sets
8
Policy Learning
l You must act over time to maximize some reward
» Portfolios: Buy and sell stock to max return and min risk
» Two armed bandit: tradeoff exploration for exploitation
» Learn a sequence of action which takes you from the
start to the goal – like in a video game
l Sometimes you feedback is delayed
» Rarely do you get detailed feedback on your actions
l Policy
» Mapping from state of the world to actions
l Reinforcement Learning (Leslie Kaelbling)
l Game Learning (Backgammon)
Language Learning
l How can you learn to pluralize? (phonetically)
» Wug
l How do you discover parts of speech?
l How do you learn the grammar of English?
» Stochastic Context Free Grammar
– Generalization of HMM
– S -> NP VP, VP -> V NP, etc.