You are on page 1of 36

E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 1

EE E6820: Speech & Audio Processing & Recognition



Lecture 3:
Pattern Classication

The problem of classication
Linear and nonlinear classiers
Probabilistic classication
Gaussians, mixtures and EM
Methodological remarks

Dan Ellis <dpwe@ee.columbia.edu>
http://www.ee.columbia.edu/~dpwe/e6820/
Columbia University Dept. of Electrical Engineering
Spring 2006
1
2
3
4
5

E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 2

Pattern classication

Classication means:
nding categorical (

discrete

) labels
for real-world (

continuous

) observations
Why?

- information extraction
- conditional processing
- detect exceptions
1
0
0
0
1000
1000
2000
2000
1000
2000
3000
4000
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
f/Hz
time/s
F1/Hz
F2/Hz
x
x
ay
ao

E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 3

Building a classier

Dene classes/attributes

- could state explicit rules
- better to dene through training examples

Dene feature space
Dene decision algorithm

- set parameters from examples

Measure performance

- calculate (weighted) error rate
250 300 350 400 450 500 550 600
600
700
800
900
1000
1100
F1 / Hz
F
2

/

H
z
Pols vowel formants: "u" (x) vs. "o" (o)

E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 4

Classication system parts
signal
segment
feature vector
class
Pre-processing/
segmentation
Feature extraction
Classification
Post-processing
Sensor
STFT
Locate vowels
Formant extraction
Context constraints
Costs/risk

E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 5

Feature extraction

Right features are critical

- waveform vs. formants vs. cepstra
- invariance under irrelevant modications

Theoretically equivalent features
may act very differently in practice

- representations make important aspects explicit
- remove irrelevant information

Feature design incorporates
domain knowledge

- more data

!

less need for cleverness?

Smaller feature space (fewer dimensions)

!

simpler models (fewer parameters)

!

less training data needed

!

faster training

E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 6

Minimum Distance Classication

Find closest match (Nearest Neighbor)

- choice of distance metric

D

(

x

,

y

) is important

Find representative match (class prototypes)

- class data {

y

"

,i

}

!

class

model



z

"
0 200 400 600 800 1000 1200 1400 1600 1800
600
800
1000
1200
1400
1600
1800
Pols vowel formants: "u" (x), "o" (o), "a" (+)
F1 / Hz
F
2

/

H
z
x
*
+
o
new observation x
min D x y
" i ,
, ( ) [ ]
min D x z
"
, ( ) [ ]

E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 7

Linear and nonlinear classiers

Minimum Euclidean distance is equivalent to a
linear discriminant function:

- examples {

y

"

,i

}

!

template

z

"

=

mean

{

y

"

,i

}
- observed feature vector

x

= [

x

1



x

2

...]

T

- Euclidean distance metric
- minimum distance rule:
class
- i.e. choose class O over U if
...
2
D x y , [ ] x y ( )
T
x y ( ) =
i D x z
i
, [ ]
i
argmin =
D
2
x z
i
, [ ]
i
argmin =
D
2
x z
O
, [ ] D
2
x z
U
, [ ] <

E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 8

Linear classier (contd)

Decision rule: Choose class O over U if:
i.e. distance from

z

U

to

x


is less than half the distance to

z

O

...
D
2
x z
O
, [ ] D
2
x z
U
, [ ] <
x z
O
( )
T
x z
O
( ) x z
U
( )
T
x z
U
( ) <
x
T
x z
O
T
z
O
2x
T
z
O
+ x
T
x z
U
T
z
U
2x
T
z
U
+ <
2x
T
z
O
z
U
( ) z
O
T
z
O
z
U
T
z
U
>
x z
U
( )
T
z
O
z
U
( )
1
2
--- z
O
z
U
( )
T
z
O
z
U
( ) >
<
z
O
z
U
Threshold
Projection
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 9
Linear classication boundaries
Min-distance divides normal to class centers:
Scaling axes changes boundary:
*
o
x
(x-z
U
)
T
(z
O
-z
U
)
(z
O
-z
U
)
z
O
z
U
x
|
z
O
-z
U
|
(z
O
-z
U
)
T
(z
O
-z
U
)
1
2
|
z
O
-z
U
|
decision boundary
o
x
o
x
Squeezed
horizontally
new
boundary
Squeezed vertically
new
boundary
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 10
Decision boundaries
Linear functions give linear boundaries
- planes and hyperplanes as dimensions increase
Linear boundaries can become curves under
feature transformations
- e.g. a linear discriminant in x' = [x
1
2
x
2
2
...]
T
- support vector machines...
What about more complex boundaries?
- i.e. if a linear discriminant is
how about a nonlinear function?
- changes only threshold space, but
offers more exibility, and
has even more ...
w
i
x
i
#
F
w
i
x
i
#
[ ]
F
w
ij
x
i
i
#
[ ]
j
#
F
w
jk
F
w
ij
x
i
i
#
[ ] $
j
#
[ ]
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 11
Neural networks
Sums over nonlinear functions of sums
! large range of decision surfaces
e.g. Multi-layer perceptron (MLP)
with 1 hidden layer:
Problem is nding the weights w
ij
...
(training)
y
k
F
w
jk
F
w
ij
x
i
j
#
[ ] $
j
#
[ ] =
+
+
w
ij
x
1
y
1
h
1
h
2
y
2
x
2
x
3
F[]
+
+
+
w
jk
Input
layer
Hidden
layer
Output
layer
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 12
Back-propagation training
Find w
ij
to minimize e.g.
for training set of patterns {x
n
} and desired
(target) outputs {t
n
}
Differentiate error with respect to weights
- contribution from each pattern x
n
, t
n
:
i.e. gradient descent with learning rate %
E y x
n
( ) t
n

2
n
#
=
y
k
F
w
jk
h
j
$
j
#
[ ] =
w
jk
&
&E
y &
&E
' &
&y
w
jk
&
&
w
jk
h
j
j
#
( ) $ $ =
2 y t ( ) F' ' [ ] h
j
$ $ =
w
jk
w
jk
%
w
jk
&
&E
$ =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 13
Neural net example
2 input units (normalized F1, F2)
5 hidden units, 3 output units (U, O, A)
Sigmoid nonlinearity:
F1
A
O
U
F2
F x [ ]
1
1 e
x
+
----------------- =
x d
dF
( F 1 F ( ) =
5 4 3 2 1 0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
sigm(x)
sigm(x)
d
dx
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 14
Neural net training
example...
0 10 20 30 40 50 60 70 80 90 100
0
0.2
0.4
0.6
0.8
1
2:5:3 net: MS error by training epoch
M
e
a
n

s
q
u
a
r
e
d

e
r
r
o
r
Training epoch
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
Contours @ 10 iterations
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
Contours @ 100 iterations
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 15
Support Vector Machines (SVMs)
Principled nonlinear decision boundaries...
Key idea: Boundary can be linear
in a higher-dimensional space
e.g.
- depends only on training points at edges
(support vectors)
- unique optimal solution for a given projection
- solution involves only inner products in projected space
e.g. via kernel
.
(w x) +b =1
.
(w x) +b =+1
.
x
1
y
y
i
=+1
w
(w x) +b =0
x
2
i
= 1
)
x
1
x
2 * +
, -
, -
. /
x
1
2
2x
1
x
2
x
2
2
=
)
k x y , ( ) ) x ( ) ) y ( ) $ x y $ ( )
2
= =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 16
Statistical Interpretation
Observations are random variables
whose distribution depends on the class:
Source distributions p(x|"
i
)
- reect variability in feature
- reect noise in observation
- generally have to be estimated from data
(rather than known in advance)
3
Class
"
i
(hidden)
discrete continuous
Observation
x
p(x|"
i
)
Pr("
i
|x)
p(x|"
i
)
x
"
1
"
2
"
3
"
4
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 17
Random Variables review
Random vars have joint distributions (pdfs):
- marginal
- covariance
x
x
0
x
0
xy
0
y

y
y
p(x,y)
p
(
y
)
p(x)
p x ( ) p x y , ( ) y d
1
=
' E o ( ) o ( )
T
[ ]
0
x
2
0
xy
0
xy
0
y
2
= =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 18
Conditional probability
Knowing one value in a joint distribution
constrains the remainder:
Bayes rule:
! can reverse conditioning given priors/marginal
- either term can be discrete or continuous
x
y
Y
p(x,y)
p(x
|
y = Y )
p x y , ( ) p x y ( ) p y ( ) $ =
p y x ( )
p x y ( ) p y ( ) $
p x ( )
-------------------------------- = (
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 19
Priors and posteriors
Bayesian inference can be interpreted as
updating prior beliefs with new information, x:
Posterior is prior scaled by likelihood
& normalized by evidence (so '(posteriors) = 1)
Objection: priors are often unknown
- ...but omitting them amounts to
assuming they are all equal
Pr "
i
( )
p x "
i
( )
p x "
j
( ) Pr "
j
( ) $
j
#
--------------------------------------------------- $ Pr "
i
x ( ) =
Posterior
probability
Evidence = p(x)
Likelihood
Prior
probability
Bayes Rule:
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 20
Bayesian (MAP) classier
Minimize the probability of error by
choosing maximum a posteriori (MAP) class:

Intuitively right - choose most probable class
in light of the observation
Formally, expected probability of error,
where is the region of X where "
i
is chosen
.. but Pr("
i
|x) is largest in that region,
so Pr(err) is minimized.
"

Pr "
i
x ( )
"
i
argmax =
E Pr err ( ) [ ] p x ( ) Pr err x ( ) $ x d
1
=
p x ( ) 1 Pr "
i
x ( ) ( ) x d
X
"
i
1
i
#
=
X
"
i
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 21
Practical implementation
Optimal classier is
but we dont know
Can model directly e.g. train a neural net / SVM
to map from inputs x to a set of outputs Pr("
i
)
- a discriminative model
Often easier to model conditional distributions
then use Bayes rule to nd MAP class
"

Pr "
i
x ( )
"
i
argmax =
Pr "
i
x ( )
p x "
i
( )
Labeled
training
examples
{x
n
,"
xn
}
Sort
according
to class
Estimate
conditional pdf
for class "
1
p(x|"
1
)
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 22
Likelihood models
Given models for each distribution ,
the search for
becomes
but denominator = p(x) is the same over all "
i
hence
or even
Choose parameters to
maximize data likelihood
p x "
i
( )
"

Pr "
i
x ( )
"
i
argmax =
p x "
i
( ) Pr "
i
( ) $
p x "
j
( ) Pr "
j
( ) $
j
#
---------------------------------------------------
"
i
argmax
"

p x "
i
( ) Pr "
i
( ) $
"
i
argmax =
p x "
i
( ) log Pr "
i
( ) log + [ ]
"
i
argmax
p x
train
"
i
( )
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 23
Sources of error
Suboptimal threshold / regions (bias error)
- use a Bayes classier!
Incorrect distributions (model error)
- better distribution models/more training data
Misleading features (Bayes error)
- irreducible for a given feature set
regardless of classication scheme
p(x|"
2
)Pr("
2
)
p(x|"
1
)Pr("
1
)
x
X
"
2
^
Pr("
1
|X
"
2
)
= Pr(err|X
"
2
)
^
^
Pr("
2
|X
"
1
)
= Pr(err|X
"
1
)
^
^
X
"
1
^
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 24
Gaussian models
Easiest way to model distributions is via
parametric model
- assume known form, estimate a few parameters
Gaussian model is simple & useful:
in 1-D:
Parameters mean
i
and variance 0
i
2
! t
4
p x "
i
( )
1
220
i
----------------
1
2
---
x
i

0
i
-------------
* +
, -
. /
2
exp $ =
normalization to make it sum to 1
p(x|"
i
)
x

i
0
i
P
M
P
M

e
1/2
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 25
Gaussians in d dimensions:
Described by d dimensional mean vector
i
and d x d covariance matrix '
i
Classify by maximizing log likelihood i.e.
p x "
i
( )
1
22 ( )
d
'
i
1 2
-----------------------------------
1
2
--- x
i
( )
T
'
i
1
x
i
( ) exp $ =
0
2
4
0
2
4
0
0.5
1
0 1 2 3 4 5
0
1
2
3
4
5
1
2
--- x
i
( )
T
'
i
1
x
i
( )
1
2
--- '
i
log Pr "
i
( ) log +
"
i
argmax
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 26
Multivariate Gaussian covariance
Eigen analysis of inverse of covariance gives
with diagonal
! Gausn ~
i.e. (SV) transforms x into
uncorrelated, normalized-variance space
= Mahalanobis distance
2
- Euclidean distance in normalized space
example...
'
i
1
SV ( )
T
SV ( ) = S 3
1
2

=
1
2
--- x
i
( )
T
SV ( )
T
SV ( ) x
i
( )
* +
. /
exp
x space
SVx space
v
v'=SVv
SV
x
i
( )
T
'
i
1
x
i
( )
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 27
Gaussian Mixture models (GMMs)
Single Gaussians cannot model
- distributions with multiple modes
- distributions with nonlinear correlation
What about a weighted sum?
i.e.
where {c
k
} is a set of weights and
{ } is a set of Gaussian components
- can t anything given enough components
Interpretation:
Each observation is generated by one of the
Gaussians, chosen at random, with priors:
p x ( ) c
k
p x m
k
( )
k
#
4
p x m
k
( )
c
k
Pr m
k
( ) =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 28
Gaussian mixtures (2)
e.g. nonlinear correlation:
Problem: nding c
k
and m
k
parameters
- easy if we knew which m
k
generated each x
0
5
10
15
20
25
30
35
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-2
-1
0
1
2
3
original data
Gaussian
components
resulting
surface
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 29
Expectation-maximization (EM)
General procedure for estimating a model when
some parameters are unknown
- e.g. which component generated a value in a
Gaussian mixture model
Procedure:
Iteratively update xed model parameters 5 to
maximize Q=expected value of log-probability
of known training data x
trn
and unknown parameters u:
- iterative because changes
each time we change 5
- can prove this increases data likelihood,
hence maximum-likelihood (ML) model
- local optimum - depends on initialization
Q 5 5
old
, ( ) Pr u x
trn
5
old
, ( ) p u x
trn
5 , ( ) log
u
#
=
Pr u x
trn
5
old
, ( )
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 30
Fitting GMMs with EM
Want to nd
component Gaussians
plus weights
to maximize likelihood of training data x
trn
If we could assign each x to a particular m
k
,
estimation would be direct
Hence, treat ks (mixture indices) as unknown,
form ,
differentiate with respect to model parameters
! equations for
k
, '
k
and c
k
to maximize Q...
m
k

k
'
k
, { } =
c
k
Pr m
k
( ) =
Q E p x k , 5 ( ) log [ ] =
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 31
GMM EM update equations:
Solve for maximizing Q:
Each involves ,
fuzzy membership of x
n
to Gaussian k
Parameter is just sample average, weighted by
fuzzy membership

k
p k x
n
5 , ( ) x
n
$
n
#
p k x
n
5 , ( )
n
#
---------------------------------------------- =
'
k
p k x
n
5 , ( ) x
n

k
( ) x
n

k
( )
T
n
#
p k x
n
5 , ( )
n
#
------------------------------------------------------------------------------------- =
c
k
1
N
---- p k x
n
5 , ( )
n
#
=
p k x
n
5 , ( )
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 32
GMM examples
Vowel data t with different mixture counts:
example...
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
1 Gauss logp(x)=-1911
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
2 Gauss logp(x)=-1864
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
3 Gauss logp(x)=-1849
200 400 600 800 1000 1200
600
800
1000
1200
1400
1600
4 Gauss logp(x)=-1840
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 33
Methodological Remarks:
1: Training and test data
A rich model can learn every training example
(overtraining)
But, goal is to classify new, unseen data
i.e. generalization
- sometimes use cross validation set to decide
when to stop training
For evaluation results to be meaningful:
- dont test with training data!
- dont train on test data (even indirectly...)
5
training or parameters
error
rate
Test
data
Training
data
Overfitting
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 34
Model complexity
More model parameters ! better t
- more Gaussian mixture components
- more MLP hidden units
More training data ! can use larger models
For xed training set size, there will be some
optimal model size (to avoid overtting):
500
1000
2000
4000
9.25
18.5
37
74
32
34
36
38
40
42
44
Hidden layer / units Training set / hours
W
E
R
%
WER for PLP12N-8k nets vs. net size & training data
Word Error Rate
M
o
r
e

t
r
a
i
n
i
n
g

d
a
t
a
M
o
r
e

p
a
r
a
m
e
t
e
r
s
Optimal
parameter/data
ratio
Constant
training
time
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 35
Model combination
No single model is always best
But different models have different strengths
! Benets from combining several models e.g.
where each comes from a different
model / feature set
and estimates probability that
model k is correct
Should be able to incorporate into one model,
but may be easier to combine in practice
Trick is to have good estimates for :
- static, from training data
- dynamic, from test data...
Pr " ( ) w
k
Pr " m
k
( ) $
k
#
=
Pr " m
k
( )
w
k
Pr m
k
( ) =
w
k
E6820 SAPR - Dan Ellis L03 - Pattern Classification 2006-02-02 - 36
Summary
Classication is making discrete (hard)
decisions
Basis is comparison with known examples
- explicitly, or via a model
Probabilistic interpretation for principled
decisions
Ways to build models:
- neural nets can learn posteriors
- Gaussians can model pdfs
- other approaches...
EM allows estimation of complex models
Parting thought:
How do you know whats going on?
Pr " x ( )
p x " ( )

You might also like