You are on page 1of 42

Acoustic Vector Re-sampling

for GMMSVM-Based
Speaker Verification
Man-Wai MAK and Wei RAO
The Hong Kong Polytechnic University
enmwmak@polyu.edu.hk
http://www.eie.polyu.edu.hk/~mwmak/

Outline

GMM-UBM for Speaker Verification


GMM-SVM for Speaker Verification
Data-Imbalance Problem in GMM-SVM
Utterance Partitioning for GMM-SVM
Experiments on NIST SRE

Speaker Verification

To verify the identify of a claimant based on his/her own voices

I am Mary

Is this
Marys
voice?

Verification Process
Im
John

Decision
Threshold

Johns Voiceprint

Feature
Extraction

Johns
Model
Impostor
Model

Scores

Impostors Voiceprints

Score
Normalization
and Decision
Making

Accept/Reject

Acoustic Features

Speech is a continuous evolution of the vocal tract


Need to extract a sequence of spectra or sequence of spectral coefficients
Use a sliding window - 25 ms window, 10 ms shift

DCT

Log|X()|

MFCC

GMM-UBM for Speaker Verification


The acoustic vectors (MFCC) of speaker s is modeled by a
(s)
(s)
(s)
(s) M
prob. density function parameterized by { j , j , j } j 1
M

p( x | ) (js ) p( x | (j s ) , (js ) )
(s)

j 1

Gaussian mixture model (GMM) for speaker s:

( s ) {(js ) , (j s ) , (js ) }Mj1


6

GMM-UBM for Speaker Verification


The acoustic vectors of a general population is modeled by
another GMM called the universal background model
(UBM):
p(x |

( ubm )

) (jubm ) p( x | (j ubm ) , (jubm ) )


j 1

Parameters of the UBM

( ubm ) {(jubm ) , (j ubm ) , (jubm ) }Mj1


7

GMM-UBM for Speaker Verification


Enrollment Utterance (X(s)) of Client Speaker

MAP

Universal
Background
Model
( ubm)

Client Speaker
Model
( s )

(j s ) j E j ( X ( s ) ) (1 j ) (j ubm )
8

GMM-UBM Scoring

2-class Hypothesis problem:


H0: MFCC sequence X(c) comes from to the true speaker
H1: MFCC sequence X(c) comes from an impostor
Verification score is a likelihood ratio:
p ( X ( c ) | H 0)
(c)
(s)
(c)
( ubm)
Score log

log
p
(
X
|

log
p
(
X
|

)
(c)
p ( X | H 1)
X

(c )

Speaker
Model ( s )

Feature
extraction

X (c )

Background
Model ( ubm)

Score

Score accept
Decision

Score reject

Outline

GMM-UBM for Speaker Verification


GMM-SVM for Speaker Verification
Data-Imbalance Problem in GMM-SVM
Acoustic Vector Resampling for GMM-SVM
Results on NIST SRE

10

GMM-SVM for Speaker Verification

(s)
UBM

utt

(s)

Mapping

(s )

X (s )

Feature
Extraction

(s)

1

2

MD 1

1

2

MD1

GMM
supervector

(s )

MAP
Adaptation

Mean
Stacking

11

GMM-SVM Scoring
SVM Scoring

utt

Feature
Extraction

(s)

X (s )

Compute GMMSupervector of Target


Speaker s

d (s )
K ( X (c) , X (s) )

0( s )

UBM

X ( b1 ) , , X ( bB )

utt
utt

(c )

Feature
Extraction
UBM

(c )

K(X

(c)

Compute GMMSupervector of
Claimant c

K(X

(c)

,X

( bB )

i(s )
,X

( bB )

(s )

B
)

SGMM-SVM ( X ( c ) )
M

) j j 2 (j c )
j 1

SGMM-SVM ( X ( c ) ) 0( s ) K ( X ( c ) , X ( s ) )

j j 2 (j bB )

( bi )
(s)
(c)
(s)

K
(
X
,
X
)

d
i

iSV
from bkg

K ( X ( c ) , X ( b1 ) )

Compute GMMSupervectors of
Background Speakers

Feature
Extraction

( bB )

1( s )

utt ( b1 )
utt ( b2 )

12

GMM-UBM Scoring Vs. GMM-SVM Scoring


GMM-UBM:

SGMM-UBM ( X ( c ) ) log p ( X ( c ) | ( s ) ) log p( X ( c ) | ( ubm) )


GMM-SVM:

SGMM-SVM ( X ( c ) ) 0( s ) K ( X ( c ) , X ( s ) )

K(X

(c)

,X

(s)

(c)

) j j 2 (j c )

j 1

12

Normalized GMMsupervector of
claimants utterance

12

iSV
from bkg

(s)

(s)
i

K ( X ( c ) , X ( bi ) ) d ( s )

j j 2 (j s )

Normalized GMMsupervector of targetspeakers utterance

13

Outline

GMM-UBM for Speaker Verification


GMM-SVM for Speaker Verification
Data-Imbalance Problem in GMM-SVM
Utterance Partitioning for GMM-SVM
Results on NIST SRE

14

Data Imbalance in GMM-SVM

For each target speaker, we only have one utterance (GMMsupervector) from the target speaker and many utterances
from the background speakers.
So, we have a highly imbalance learning problem.
Linear SVM, C=10.0, #SV=3, slope=-1.00

Speaker Class
Impostor Class

8
7
6
5

x2

Only one
training
vector from
the target
speaker

3
2
1
0

x1

15

Data Imbalance in GMM-SVM

Linear SVM, C=10.0, #SV=3, slope=-1.44

Speaker Class
Impostor Class

8
7
6

x2

5
4
3
2
1
0

x1

Orientation of the
decision
boundary
depends mainly
on impostor-class
8
data
16

Data Imbalance in GMM-SVM


Impostor Class
Speaker Class

Region for which the


target-speaker vector
can be located without
changing the orientation
of the decision plane

A 3-dim two-class problem illustrating the problem that the SVM


decision plane is largely governed by the impostor-class supervectors.
17

Outline

GMM-UBM for Speaker Verification


GMM-SVM for Speaker Verification
Data-Imbalance Problem in GMM-SVM
Utterance Partitioning for GMM-SVM
Results on NIST SRE

18

Utterance Partitioning

Partition an enrollment utterance of a target speaker into


number of sub-utterances, with each sub-utterance
producing one GMM-supervector.

19

Utterance Partitioning
Target-speakers Enrollment Utterance

Background-speakers Utterances
Feature Extraction

Feature Extraction

X 0(b1 )

X 0(s)

utt ( b1 )

utt (s)
X 1(s)

X 2(s)

X 3(s)

X 4(s)

X 1(b1 )

X 0(s) , , X 4(s)
UBM

MAP Adaptation
and
Mean Stacking

X 3(b1 )

X 2(b1 )

X 4(b1 )

X 0(b2 )

utt ( b2 )
X 1(b2 )

X 3(b2 )

X 2(b2 )

X 4(b2 )

m0( s ) ,, m4( s )

m0( b1 ) ,, m4( bB )

X 0(bB )

SVM Training

utt ( bB )
X 1(bB )

SVM of Target Speaker s

X 2(bB )

X 3(bB )

20

X 4(bB )

Length-Representation Trade-off
When the number of partitions increases, the length of subutterance decreases.
If the utterance-length is too short, the supervectors of the
sub-utterances will be almost the same as that of the UBM
utt (s)
Linear SVM, C=10.0, #SV=3, slope=-1.44

Speaker Class
Impostor Class

8
7
6

x2

5
4

Supervector
corresponding to
the UBM

3
2
1
0

x1

21

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)

Goal: Increase the number of sub-utterances without


compromising their representation power
Procedure of UP-AVR:
1.
2.
3.

Randomly rearrange the sequence of acoustic vectors in an


utterance;
Partition the acoustic vectors of an utterance into N segments;
If Step 1 and Step 2 are repeated R times, we obtain RN+1
target-speakers supervectors .

MFCC seq. before randomization

MFCC seq. after randomization

22

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)

Target-speakers Enrollment Utterance

Background-speakers Utterances

Feature Extraction and


Index Randomization

Feature Extraction and


Index Randomization

X0(b )

X0(s)

utt(b )

utt(s)

X1(s)

X2(s)

X3(s)

X4(s)

(b )

X1 1

X0(s) , , X4(s)
MAP Adaptation
and
Mean Stacking

UBM

(b )

(b )

X3 1

X2 1

(b )

X4 1

(b )

X0 2

utt(b )
2

(b )

X1 2

(b )

(b )

X3 2

X2 2

(b )

X4 2

m0( s) , , m4( s)

m0(b ) , , m4(b )
1

X0(b

SVM Training

B)

( bB )

utt
(bB )

X1

SVM of Target Speaker s

(b )

X2 B

(b )

X3 B

23

(b )

X4 B

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)

Characteristics of supervectors created by UP-AVR


Average pairwise distance between sub-utt SVs is larger than the
average pairwise distance between sub-utt SVs and full-utt SV.
Average pairwise distance between speaker-classs sub-utt SVs and
impostor-classs SVs is smaller than the average pairwise distance
between speaker-classs full-utt SV and impostor-classs SVs.

Imposter-class

Speaker-class

Sub-utt supervector
Full-utt supervector

24

Nuisance Attribute Projection


Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005]
Goal: To reduce the effect of session variability
Recall the GMM-supervector kernel:
12 ( c ,h ) T
12 ( s ,h )
( c ,h )
( s ,h )
K(X
,X
)

Define the session- and speaker-dependent supervector as

1
m( s ,h ) 2 ( s ,h ) , where s stands for speaker and h stands for session
Remove the session-dependent part (h) by removing the sub-space that
causes the session variability:

m( s ) Pm( s ,h ) ( I VV T )m( s ,h )
Sub-space
representing
The New kernel becomes
session
(c) T ( s)
(c)
(s)
K ( X , X ) m m
variability.
( s ,h )
T ( s ,h )
m
VV m
( c ,h ) T ( s ,h )
Defined by V
Pm Pm

m( s ) Pm( s ,h )
25

Nuisance Attribute Projection


Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005]

P* arg min wij Pm( i ,h ) Pm( j ,h)


P

i, j

1 i and j correspond to the same speaker


wij
otherwise
0

m ( s ,h )

VV T m( s ,h )

Sub-space
representing session
variability.
Defined by V

m( s ) Pm( s ,h )
26

Enrollment Process of
GMM-SVM with UP-AVR

MFCCs of an
utterance from
target-speaker s

X ( s ,h )

Resampling/
Partitioning

X i( s ,h )
MAP and
Mean Stacking
Sessiondependent
supervectors

UBM

mi( s ,h )
NAP

Sessionindependent
supervectors

(b )
mi j

mi( s )

SVM Training

SVM of targetspeaker s

27

Verification Process of
GMM-SVM with UP-AVR

MFCCs of a
test utterance
from claimant c

X (c )

MAP and
Mean Stacking
Sessiondependent
supervector

UBM

m ( c ,h )

Tnorm
Models

NAP
Sessionindependent
supervector
SVM of targetspeaker s

m(c )

SVM Scoring

score

S ( X (c ) )

T-Norm

~
S(X

Normalized
score
(c )

28

T-Norm (Auckenthaler, 2000)


Goal: To shift and scale the verification scores so that a global decision
threshold can be used for all speakers

T-Norm SVM 1
SVM Scoring

m(c )
from test utterance

SVM Scoring

T-Norm SVM R

Compute
Mean
and
Standard
Deviation

S ( X (c) ) ( X (c) )
~ (c)
S(X )
( X (c) )

( X (c) )
( X (c) )

Z-norm

S ( X (c ) )
29

Outline

GMM-UBM for Speaker Verification


GMM-SVM for Speaker Verification
Data-Imbalance Problem in GMM-SVM
Utterance Partitioning for GMM-SVM
Experiments on NIST SRE

30

Experiments
Speech Data

Evaluations on NIST SRE 2002 and 2004


NIST SRE 2002:

Use NIST01 for computing the UBMs, impostor-class supervectors of


SVMs, Tnorm models, and NAP parameters
2983 true-speaker trials and 36287 impostor attempts
2-min utterances for training and about 1-min utt for test

NIST SRE 2004:

Use the Fisher corpus for computing UBMs, impostor-class supervectors of


SVMs, and Tnorm models
NIST99 and NIST00 for computing NAP parameters
2386 true-speaker trials and 23838 impostor attempts
5-min utterances for training and testing
31

Experiments
Features and Models

12 MFCC + 12 MFCC with feature warping


1024-mixture GMMs for GMM-UBM
256-mixture GMMs for GMM-SVM
MAP relevance factor = 16
300 impostor-class supervectors for GMM-SVM
200 T-norm models
64-dim session variability subspace (NAP corank, rank of V)

32

Results
No. of mixtures in GMM-SVM (NIST02)

Normalized

Threshold below
which the variances
of feature are
deemed too small

Large number of
features with small
variance

33

Results
Effects of NAP on Different NIST SRE

Large eigenvalues mean


large session variation

34

Results
Effect of NAP Corank on Performance

No NAP

35

Results
Comparing discriminative power of GMM-SVM and GMM-SVM with
UP-AVR

36

Results
EER and MinDCF vs. No. of Target-Speaker Supervectors

NIST02
37

Results
Varying the number of resampling (R) and number of partitions (N)

NIST02
38

Results

NIST02

39

Experiments and Results

Performance on NIST02

EER=9.39%

EER=9.05%
EER=8.16%

40

Experiments and Results

Performance on NIST04
GM
M
-U
BM
GM
M
-S
VM

EER=16.05%

G
w/ MM
UP -SV
-A M
VR

EER=9.46%

EER=10.42%

41

References
1. S.X. Zhang and M.W. Mak "Optimized Discriminative Kernel for SVM
Scoring and its Application to Speaker Verification", IEEE Trans. on Neural
Networks, to appear.
2. M.W. Mak and W. Rao, "Utterance Partitioning with Acoustic Vector
Resampling for GMM-SVM Speaker Verification", Speech Communication,
vol. 53 (1), Jan. 2011, Pages 119-130.
2. M.W. Mak and W. Rao, "Acoustic Vector Resampling for GMMSVM-Based
Speaker Verification, Interspeech 2010. Sept. 2010, Makuhari, Japan, pp.
1449-1452.
3. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine
Learning Approach, Prentice Hall, 2005
4. W. M. Campbell, D. E. Sturim, and D. A. Reynolds, Support vector machines
using GMM supervectors for speaker verification, IEEE Signal Processing
Letters, vol. 13, pp. 308311, 2006.
5. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker verification using
adapted Gaussian mixture models, Digital Signal Processing, vol. 10, pp. 19
41, 2000.

42

You might also like