Professional Documents
Culture Documents
2
n
#
=
y
k
F
w
jk
h
j
$
j
#
[ ] =
w
jk
&
&E
y &
&E
' &
&y
w
jk
&
&
w
jk
h
j
j
#
( ) $ $ =
2 y t ( ) F' ' [ ] h
j
$ $ =
w
jk
w
jk
%
w
jk
&
&E
$ =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 13
Neural net example
2 input units (normalized F1, F2)
5 hidden units, 3 output units (U, O, A)
Sigmoid nonlinearity:
F1
A
O
U
F2
F x [ ]
1
1 e
x
+
----------------- =
x d
dF
( F 1 F ( ) =
5 4 3 2 1 0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
sigm(x)
sigm(x)
d
dx
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 14
Neural net training
example...
0 10 20 30 40 50 60 70 80 90 100
0
0.2
0.4
0.6
0.8
1
2:5:3 net: MS error by training epoch
M
e
a
n
s
q
u
a
r
e
d
e
r
r
o
r
Training epoch
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
Contours @ 10 iterations
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
Contours @ 100 iterations
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 15
Support Vector Machines (SVMs)
Principled nonlinear decision boundaries...
Key idea: Boundary can be linear
in a higher-dimensional space
e.g.
- depends only on training points at edges
(support vectors)
- unique optimal solution for a given projection
- solution involves only inner products in projected space
e.g. via kernel
.
(w x) +b =1
.
(w x) +b =+1
.
x
1
y
y
i
=+1
w
(w x) +b =0
x
2
i
= 1
)
x
1
x
2 * +
, -
, -
. /
x
1
2
2x
1
x
2
x
2
2
=
)
k x y , ( ) ) x ( ) ) y ( ) $ x y $ ( )
2
= =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 16
Statistical Interpretation
Observations are random variables
whose distribution depends on the class:
Source distributions p(x|"
i
)
- reect variability in feature
- reect noise in observation
- generally have to be estimated from data
(rather than known in advance)
3
Class
"
i
(hidden)
discrete continuous
Observation
x
p(x|"
i
)
Pr("
i
|x)
p(x|"
i
)
x
"
1
"
2
"
3
"
4
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 17
Random Variables review
Random vars have joint distributions (pdfs):
- marginal
- covariance
x
x
0
x
0
xy
0
y
y
y
p(x,y)
p
(
y
)
p(x)
p x ( ) p x y , ( ) y d
1
=
' E o ( ) o ( )
T
[ ]
0
x
2
0
xy
0
xy
0
y
2
= =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 18
Conditional probability
Knowing one value in a joint distribution
constrains the remainder:
Bayes rule:
! can reverse conditioning given priors/marginal
- either term can be discrete or continuous
x
y
Y
p(x,y)
p(x
|
y = Y )
p x y , ( ) p x y ( ) p y ( ) $ =
p y x ( )
p x y ( ) p y ( ) $
p x ( )
-------------------------------- = (
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 19
Priors and posteriors
Bayesian inference can be interpreted as
updating prior beliefs with new information, x:
Posterior is prior scaled by likelihood
& normalized by evidence (so '(posteriors) = 1)
Objection: priors are often unknown
- ...but omitting them amounts to
assuming they are all equal
Pr "
i
( )
p x "
i
( )
p x "
j
( ) Pr "
j
( ) $
j
#
--------------------------------------------------- $ Pr "
i
x ( ) =
Posterior
probability
Evidence = p(x)
Likelihood
Prior
probability
Bayes Rule:
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 20
Bayesian (MAP) classier
Minimize the probability of error by
choosing maximum a posteriori (MAP) class:
Intuitively right - choose most probable class
in light of the observation
Formally, expected probability of error,
where is the region of X where "
i
is chosen
.. but Pr("
i
|x) is largest in that region,
so Pr(err) is minimized.
"
Pr "
i
x ( )
"
i
argmax =
E Pr err ( ) [ ] p x ( ) Pr err x ( ) $ x d
1
=
p x ( ) 1 Pr "
i
x ( ) ( ) x d
X
"
i
1
i
#
=
X
"
i
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 21
Practical implementation
Optimal classier is
but we dont know
Can model directly e.g. train a neural net / SVM
to map from inputs x to a set of outputs Pr("
i
)
- a discriminative model
Often easier to model conditional distributions
then use Bayes rule to nd MAP class
"
Pr "
i
x ( )
"
i
argmax =
Pr "
i
x ( )
p x "
i
( )
Labeled
training
examples
{x
n
,"
xn
}
Sort
according
to class
Estimate
conditional pdf
for class "
1
p(x|"
1
)
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 22
Likelihood models
Given models for each distribution ,
the search for
becomes
but denominator = p(x) is the same over all "
i
hence
or even
Choose parameters to
maximize data likelihood
p x "
i
( )
"
Pr "
i
x ( )
"
i
argmax =
p x "
i
( ) Pr "
i
( ) $
p x "
j
( ) Pr "
j
( ) $
j
#
---------------------------------------------------
"
i
argmax
"
p x "
i
( ) Pr "
i
( ) $
"
i
argmax =
p x "
i
( ) log Pr "
i
( ) log + [ ]
"
i
argmax
p x
train
"
i
( )
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 23
Sources of error
Suboptimal threshold / regions (bias error)
- use a Bayes classier!
Incorrect distributions (model error)
- better distribution models/more training data
Misleading features (Bayes error)
- irreducible for a given feature set
regardless of classication scheme
p(x|"
2
)Pr("
2
)
p(x|"
1
)Pr("
1
)
x
X
"
2
^
Pr("
1
|X
"
2
)
= Pr(err|X
"
2
)
^
^
Pr("
2
|X
"
1
)
= Pr(err|X
"
1
)
^
^
X
"
1
^
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 24
Gaussian models
Easiest way to model distributions is via
parametric model
- assume known form, estimate a few parameters
Gaussian model is simple & useful:
in 1-D:
Parameters mean
i
and variance 0
i
2
! t
4
p x "
i
( )
1
220
i
----------------
1
2
---
x
i
0
i
-------------
* +
, -
. /
2
exp $ =
normalization to make it sum to 1
p(x|"
i
)
x
i
0
i
P
M
P
M
e
1/2
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 25
Gaussians in d dimensions:
Described by d dimensional mean vector
i
and d x d covariance matrix '
i
Classify by maximizing log likelihood i.e.
p x "
i
( )
1
22 ( )
d
'
i
1 2
-----------------------------------
1
2
--- x
i
( )
T
'
i
1
x
i
( ) exp $ =
0
2
4
0
2
4
0
0.5
1
0 1 2 3 4 5
0
1
2
3
4
5
1
2
--- x
i
( )
T
'
i
1
x
i
( )
1
2
--- '
i
log Pr "
i
( ) log +
"
i
argmax
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 26
Multivariate Gaussian covariance
Eigen analysis of inverse of covariance gives
with diagonal
! Gausn ~
i.e. (SV) transforms x into
uncorrelated, normalized-variance space
= Mahalanobis distance
2
- Euclidean distance in normalized space
example...
'
i
1
SV ( )
T
SV ( ) = S 3
1
2
=
1
2
--- x
i
( )
T
SV ( )
T
SV ( ) x
i
( )
* +
. /
exp
x space
SVx space
v
v'=SVv
SV
x
i
( )
T
'
i
1
x
i
( )
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 27
Gaussian Mixture models (GMMs)
Single Gaussians cannot model
- distributions with multiple modes
- distributions with nonlinear correlation
What about a weighted sum?
i.e.
where {c
k
} is a set of weights and
{ } is a set of Gaussian components
- can t anything given enough components
Interpretation:
Each observation is generated by one of the
Gaussians, chosen at random, with priors:
p x ( ) c
k
p x m
k
( )
k
#
4
p x m
k
( )
c
k
Pr m
k
( ) =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 28
Gaussian mixtures (2)
e.g. nonlinear correlation:
Problem: nding c
k
and m
k
parameters
- easy if we knew which m
k
generated each x
0
5
10
15
20
25
30
35
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-2
-1
0
1
2
3
original data
Gaussian
components
resulting
surface
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 29
Expectation-maximization (EM)
General procedure for estimating a model when
some parameters are unknown
- e.g. which component generated a value in a
Gaussian mixture model
Procedure:
Iteratively update xed model parameters 5 to
maximize Q=expected value of log-probability
of known training data x
trn
and unknown parameters u:
- iterative because changes
each time we change 5
- can prove this increases data likelihood,
hence maximum-likelihood (ML) model
- local optimum - depends on initialization
Q 5 5
old
, ( ) Pr u x
trn
5
old
, ( ) p u x
trn
5 , ( ) log
u
#
=
Pr u x
trn
5
old
, ( )
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 30
Fitting GMMs with EM
Want to nd
component Gaussians
plus weights
to maximize likelihood of training data x
trn
If we could assign each x to a particular m
k
,
estimation would be direct
Hence, treat ks (mixture indices) as unknown,
form ,
differentiate with respect to model parameters
! equations for
k
, '
k
and c
k
to maximize Q...
m
k
k
'
k
, { } =
c
k
Pr m
k
( ) =
Q E p x k , 5 ( ) log [ ] =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 31
GMM EM update equations:
Solve for maximizing Q:
Each involves ,
fuzzy membership of x
n
to Gaussian k
Parameter is just sample average, weighted by
fuzzy membership
k
p k x
n
5 , ( ) x
n
$
n
#
p k x
n
5 , ( )
n
#
---------------------------------------------- =
'
k
p k x
n
5 , ( ) x
n
k
( ) x
n
k
( )
T
n
#
p k x
n
5 , ( )
n
#
------------------------------------------------------------------------------------- =
c
k
1
N
---- p k x
n
5 , ( )
n
#
=
p k x
n
5 , ( )
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 32
GMM examples
Vowel data t with different mixture counts:
example...
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
1 Gauss logp(x)=-1911
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
2 Gauss logp(x)=-1864
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
3 Gauss logp(x)=-1849
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
4 Gauss logp(x)=-1840
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 33
Methodological Remarks:
1: Training and test data
A rich model can learn every training example
(overtraining)
But, goal is to classify new, unseen data
i.e. generalization
- sometimes use cross validation set to decide
when to stop training
For evaluation results to be meaningful:
- dont test with training data!
- dont train on test data (even indirectly...)
5
training or parameters
error
rate
Test
data
Training
data
Overfitting
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 34
Model complexity
More model parameters ! better t
- more Gaussian mixture components
- more MLP hidden units
More training data ! can use larger models
For xed training set size, there will be some
optimal model size (to avoid overtting):
500
1000
2000
4000
9.25
18.5
37
74
32
34
36
38
40
42
44
Hidden layer / units Training set / hours
W
E
R
%
WER for PLP12N-8k nets vs. net size & training data
Word Error Rate
M
o
r
e
t
r
a
i
n
i
n
g
d
a
t
a
M
o
r
e
p
a
r
a
m
e
t
e
r
s
Optimal
parameter/data
ratio
Constant
training
time
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 35
Model combination
No single model is always best
But different models have different strengths
! Benets from combining several models e.g.
where each comes from a different
model / feature set
and estimates probability that
model k is correct
Should be able to incorporate into one model,
but may be easier to combine in practice
Trick is to have good estimates for :
- static, from training data
- dynamic, from test data...
Pr " ( ) w
k
Pr " m
k
( ) $
k
#
=
Pr " m
k
( )
w
k
Pr m
k
( ) =
w
k
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 36
Summary
Classication is making discrete (hard)
decisions
Basis is comparison with known examples
- explicitly, or via a model
Probabilistic interpretation for principled
decisions
Ways to build models:
- neural nets can learn posteriors
- Gaussians can model pdfs
- other approaches...
EM allows estimation of complex models
Parting thought:
How do you know whats going on?
Pr " x ( )
p x " ( )