Acoustic Vector Re-Sampling For GMMSVM-Based Speaker Verification

Acoustic Vector Re-sampling
for GMMSVM-Based
Speaker Verification
Man-Wai MAK and Wei RAO
The Hong Kong Polytechnic University
enmwmak@polyu.edu.hk
http://www.eie.polyu.edu.hk/~mwmak/
Outline
GMM-UBM for Speaker Verification

GMM-SVM for Speaker Verification
Data-Imbalance Problem in GMM-SVM
Utterance Partitioning for GMM-SVM
Experiments on NIST SRE
Speaker Verification
To verify the identify of a claimant based on his/her own voices
I am Mary
Is this
Marys
voice?
Verification Process
Im
John
Decision
Threshold
Johns Voiceprint
Feature
Extraction
Johns
Model
Impostor
Model
Scores
Impostors Voiceprints
Score
Normalization
and Decision
Making
Accept/Reject
Acoustic Features
Speech is a continuous evolution of the vocal tract

Need to extract a sequence of spectra or sequence of spectral coefficients
Use a sliding window - 25 ms window, 10 ms shift
DCT
Log|X()|
MFCC

The acoustic vectors (MFCC) of speaker s is modeled by a
(s)
(s)
(s)
(s) M
prob. density function parameterized by { j , j , j } j 1
M
p( x | ) (js ) p( x | (j s ) , (js ) )
(s)
j 1
Gaussian mixture model (GMM) for speaker s:
( s ) {(js ) , (j s ) , (js ) }Mj1

6

The acoustic vectors of a general population is modeled by
another GMM called the universal background model
(UBM):
p(x |
( ubm )
) (jubm ) p( x | (j ubm ) , (jubm ) )

j 1
Parameters of the UBM
( ubm ) {(jubm ) , (j ubm ) , (jubm ) }Mj1

7

Enrollment Utterance (X(s)) of Client Speaker
MAP
Universal
Background
Model
( ubm)
Client Speaker
Model
( s )
(j s ) j E j ( X ( s ) ) (1 j ) (j ubm )
8
GMM-UBM Scoring
2-class Hypothesis problem:

H0: MFCC sequence X(c) comes from to the true speaker
H1: MFCC sequence X(c) comes from an impostor
Verification score is a likelihood ratio:
p ( X ( c ) | H 0)
(c)
(s)
(c)
( ubm)
Score log
log
p
(
X
|
log
p
(
X
|
)
(c)
p ( X | H 1)
X
(c )
Speaker
Model ( s )
Feature
extraction
X (c )
Background
Model ( ubm)
Score
Score accept
Decision
Score reject
Outline

Acoustic Vector Resampling for GMM-SVM
Results on NIST SRE
10
(s)
UBM
utt
(s)
Mapping
(s )
X (s )
Feature
Extraction
(s)
1

2

MD 1
1

2

MD1
GMM
supervector
(s )
MAP
Adaptation
Mean
Stacking
11
GMM-SVM Scoring
SVM Scoring
utt
Feature
Extraction
(s)
X (s )
Compute GMMSupervector of Target

Speaker s
d (s )
K ( X (c) , X (s) )
0( s )
UBM
X ( b1 ) , , X ( bB )
utt
utt
(c )
Feature
Extraction
UBM
(c )
K(X
(c)
Compute GMMSupervector of
Claimant c
K(X
(c)
,X
( bB )
i(s )
,X
( bB )
(s )
B
)
SGMM-SVM ( X ( c ) )
M
) j j 2 (j c )
j 1
SGMM-SVM ( X ( c ) ) 0( s ) K ( X ( c ) , X ( s ) )
j j 2 (j bB )
( bi )
(s)
(c)
(s)
K
(
X
,
X
)
d
i
iSV
from bkg
K ( X ( c ) , X ( b1 ) )
Compute GMMSupervectors of
Background Speakers
Feature
Extraction
( bB )
1( s )
utt ( b1 )
utt ( b2 )
12
GMM-UBM Scoring Vs. GMM-SVM Scoring

GMM-UBM:
SGMM-UBM ( X ( c ) ) log p ( X ( c ) | ( s ) ) log p( X ( c ) | ( ubm) )

GMM-SVM:
SGMM-SVM ( X ( c ) ) 0( s ) K ( X ( c ) , X ( s ) )
K(X
(c)
,X
(s)
(c)
) j j 2 (j c )
j 1
12
Normalized GMMsupervector of
claimants utterance
12
iSV
from bkg
(s)
(s)
i
K ( X ( c ) , X ( bi ) ) d ( s )
j j 2 (j s )
Normalized GMMsupervector of targetspeakers utterance
13
Outline

Results on NIST SRE
14
Data Imbalance in GMM-SVM
For each target speaker, we only have one utterance (GMMsupervector) from the target speaker and many utterances
from the background speakers.
So, we have a highly imbalance learning problem.
Linear SVM, C=10.0, #SV=3, slope=-1.00
Speaker Class
Impostor Class
8
7
6
5
x2
Only one
training
vector from
the target
speaker
3
2
1
0
x1
15
Speaker Class
Impostor Class
8
7
6
x2
5
4
3
2
1
0
x1
Orientation of the
decision
boundary
depends mainly
on impostor-class
8
data
16

Impostor Class
Speaker Class
Region for which the

target-speaker vector
can be located without
changing the orientation
of the decision plane
A 3-dim two-class problem illustrating the problem that the SVM

decision plane is largely governed by the impostor-class supervectors.
17
Outline

Results on NIST SRE
18
Utterance Partitioning
Partition an enrollment utterance of a target speaker into

number of sub-utterances, with each sub-utterance
producing one GMM-supervector.
19
Utterance Partitioning
Target-speakers Enrollment Utterance
Background-speakers Utterances
Feature Extraction
Feature Extraction
X 0(b1 )
X 0(s)
utt ( b1 )
utt (s)
X 1(s)
X 2(s)
X 3(s)
X 4(s)
X 1(b1 )
X 0(s) , , X 4(s)
UBM
MAP Adaptation
and
Mean Stacking
X 3(b1 )
X 2(b1 )
X 4(b1 )
X 0(b2 )
utt ( b2 )
X 1(b2 )
X 3(b2 )
X 2(b2 )
X 4(b2 )
m0( s ) ,, m4( s )
m0( b1 ) ,, m4( bB )
X 0(bB )
SVM Training
utt ( bB )
X 1(bB )
SVM of Target Speaker s
X 2(bB )
X 3(bB )
20
X 4(bB )
Length-Representation Trade-off
When the number of partitions increases, the length of subutterance decreases.
If the utterance-length is too short, the supervectors of the
sub-utterances will be almost the same as that of the UBM
utt (s)
Speaker Class
Impostor Class
8
7
6
x2
5
4
Supervector
corresponding to
the UBM
3
2
1
0
x1
21
Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)
Goal: Increase the number of sub-utterances without

compromising their representation power
Procedure of UP-AVR:
1.
2.
3.
Randomly rearrange the sequence of acoustic vectors in an

utterance;
Partition the acoustic vectors of an utterance into N segments;
If Step 1 and Step 2 are repeated R times, we obtain RN+1
target-speakers supervectors .
MFCC seq. before randomization
MFCC seq. after randomization
22
Target-speakers Enrollment Utterance
Background-speakers Utterances
Feature Extraction and

Index Randomization
Feature Extraction and

Index Randomization
X0(b )
X0(s)
utt(b )
utt(s)
X1(s)
X2(s)
X3(s)
X4(s)
(b )
X1 1
X0(s) , , X4(s)
MAP Adaptation
and
Mean Stacking
UBM
(b )
(b )
X3 1
X2 1
(b )
X4 1
(b )
X0 2
utt(b )
2
(b )
X1 2
(b )
(b )
X3 2
X2 2
(b )
X4 2
m0( s) , , m4( s)
m0(b ) , , m4(b )
1
X0(b
SVM Training
B)
( bB )
utt
(bB )
X1
SVM of Target Speaker s
(b )
X2 B
(b )
X3 B
23
(b )
X4 B
Characteristics of supervectors created by UP-AVR

Average pairwise distance between sub-utt SVs is larger than the
average pairwise distance between sub-utt SVs and full-utt SV.
Average pairwise distance between speaker-classs sub-utt SVs and
impostor-classs SVs is smaller than the average pairwise distance
between speaker-classs full-utt SV and impostor-classs SVs.
Imposter-class
Speaker-class
Sub-utt supervector
Full-utt supervector
24
Nuisance Attribute Projection

Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005]
Goal: To reduce the effect of session variability
Recall the GMM-supervector kernel:
12 ( c ,h ) T
12 ( s ,h )
( c ,h )
( s ,h )
K(X
,X
)

Define the session- and speaker-dependent supervector as
1
m( s ,h ) 2 ( s ,h ) , where s stands for speaker and h stands for session
Remove the session-dependent part (h) by removing the sub-space that
causes the session variability:
m( s ) Pm( s ,h ) ( I VV T )m( s ,h )
Sub-space
representing
The New kernel becomes
session
(c) T ( s)
(c)
(s)
K ( X , X ) m m
variability.
( s ,h )
T ( s ,h )
m
VV m
( c ,h ) T ( s ,h )
Defined by V
Pm Pm
m( s ) Pm( s ,h )
25
Nuisance Attribute Projection

Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005]
P* arg min wij Pm( i ,h ) Pm( j ,h)

P
i, j
1 i and j correspond to the same speaker

wij
otherwise
0
m ( s ,h )
VV T m( s ,h )
Sub-space
representing session
variability.
Defined by V
m( s ) Pm( s ,h )
26
Enrollment Process of
GMM-SVM with UP-AVR
MFCCs of an
utterance from
target-speaker s
X ( s ,h )
Resampling/
Partitioning
X i( s ,h )
MAP and
Mean Stacking
Sessiondependent
supervectors
UBM
mi( s ,h )
NAP
Sessionindependent
supervectors
(b )
mi j
mi( s )
SVM Training
SVM of targetspeaker s
27
Verification Process of
GMM-SVM with UP-AVR
MFCCs of a
test utterance
from claimant c
X (c )
MAP and
Mean Stacking
Sessiondependent
supervector
UBM
m ( c ,h )
Tnorm
Models
NAP
Sessionindependent
supervector
SVM of targetspeaker s
m(c )
SVM Scoring
score
S ( X (c ) )
T-Norm
~
S(X
Normalized
score
(c )
28
T-Norm (Auckenthaler, 2000)

Goal: To shift and scale the verification scores so that a global decision
threshold can be used for all speakers
T-Norm SVM 1
SVM Scoring
m(c )
from test utterance
SVM Scoring
T-Norm SVM R
Compute
Mean
and
Standard
Deviation
S ( X (c) ) ( X (c) )
~ (c)
S(X )
( X (c) )
( X (c) )
( X (c) )
Z-norm
S ( X (c ) )
29
Outline

Experiments on NIST SRE
30
Experiments
Speech Data
Evaluations on NIST SRE 2002 and 2004

NIST SRE 2002:
Use NIST01 for computing the UBMs, impostor-class supervectors of

SVMs, Tnorm models, and NAP parameters
2983 true-speaker trials and 36287 impostor attempts
2-min utterances for training and about 1-min utt for test
NIST SRE 2004:
Use the Fisher corpus for computing UBMs, impostor-class supervectors of

SVMs, and Tnorm models
NIST99 and NIST00 for computing NAP parameters
2386 true-speaker trials and 23838 impostor attempts
5-min utterances for training and testing
31
Experiments
Features and Models
12 MFCC + 12 MFCC with feature warping

1024-mixture GMMs for GMM-UBM
256-mixture GMMs for GMM-SVM
MAP relevance factor = 16
300 impostor-class supervectors for GMM-SVM
200 T-norm models
64-dim session variability subspace (NAP corank, rank of V)
32
Results
No. of mixtures in GMM-SVM (NIST02)
Normalized
Threshold below
which the variances
of feature are
deemed too small
Large number of
features with small
variance
33
Results
Effects of NAP on Different NIST SRE
Large eigenvalues mean

large session variation
34
Results
Effect of NAP Corank on Performance
No NAP
35
Results
Comparing discriminative power of GMM-SVM and GMM-SVM with
UP-AVR
36
Results
EER and MinDCF vs. No. of Target-Speaker Supervectors
NIST02
37
Results
Varying the number of resampling (R) and number of partitions (N)
NIST02
38
Results
NIST02
39
Experiments and Results
Performance on NIST02
EER=9.39%
EER=9.05%
EER=8.16%
40
Experiments and Results
Performance on NIST04
GM
M
-U
BM
GM
M
-S
VM
EER=16.05%
G
w/ MM
UP -SV
-A M
VR
EER=9.46%
EER=10.42%
41
References
1. S.X. Zhang and M.W. Mak "Optimized Discriminative Kernel for SVM
Scoring and its Application to Speaker Verification", IEEE Trans. on Neural
Networks, to appear.
2. M.W. Mak and W. Rao, "Utterance Partitioning with Acoustic Vector
Resampling for GMM-SVM Speaker Verification", Speech Communication,
vol. 53 (1), Jan. 2011, Pages 119-130.
2. M.W. Mak and W. Rao, "Acoustic Vector Resampling for GMMSVM-Based
Speaker Verification, Interspeech 2010. Sept. 2010, Makuhari, Japan, pp.
1449-1452.
3. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine
Learning Approach, Prentice Hall, 2005
4. W. M. Campbell, D. E. Sturim, and D. A. Reynolds, Support vector machines
using GMM supervectors for speaker verification, IEEE Signal Processing
Letters, vol. 13, pp. 308311, 2006.
5. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker verification using
adapted Gaussian mixture models, Digital Signal Processing, vol. 10, pp. 19
41, 2000.
42

Acoustic Vector Re-Sampling For GMMSVM-Based Speaker Verification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Acoustic Vector Re-Sampling For GMMSVM-Based Speaker Verification

Uploaded by

Copyright:

Available Formats

Acoustic Vector Re-sampling

GMM-UBM for Speaker Verification

To verify the identify of a claimant based on his/her own voices

Speech is a continuous evolution of the vocal tract

GMM-UBM for Speaker Verification

Gaussian mixture model (GMM) for speaker s:

( s ) {(js ) , (j s ) , (js ) }Mj1

GMM-UBM for Speaker Verification

) (jubm ) p( x | (j ubm ) , (jubm ) )

Parameters of the UBM

( ubm ) {(jubm ) , (j ubm ) , (jubm ) }Mj1

GMM-UBM for Speaker Verification

2-class Hypothesis problem:

GMM-UBM for Speaker Verification

GMM-SVM for Speaker Verification

Compute GMMSupervector of Target

GMM-UBM Scoring Vs. GMM-SVM Scoring

SGMM-UBM ( X ( c ) ) log p ( X ( c ) | ( s ) ) log p( X ( c ) | ( ubm) )

Normalized GMMsupervector of targetspeakers utterance

GMM-UBM for Speaker Verification

Data Imbalance in GMM-SVM

Data Imbalance in GMM-SVM

Linear SVM, C=10.0, #SV=3, slope=-1.44

Data Imbalance in GMM-SVM

Region for which the

A 3-dim two-class problem illustrating the problem that the SVM

GMM-UBM for Speaker Verification

Partition an enrollment utterance of a target speaker into

SVM of Target Speaker s

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)

Goal: Increase the number of sub-utterances without

Randomly rearrange the sequence of acoustic vectors in an

MFCC seq. before randomization

MFCC seq. after randomization

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)

Target-speakers Enrollment Utterance

Feature Extraction and

Feature Extraction and

SVM of Target Speaker s

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)

Characteristics of supervectors created by UP-AVR

Nuisance Attribute Projection

Define the session- and speaker-dependent supervector as

Nuisance Attribute Projection

P* arg min wij Pm( i ,h ) Pm( j ,h)

1 i and j correspond to the same speaker

T-Norm (Auckenthaler, 2000)

GMM-UBM for Speaker Verification

Evaluations on NIST SRE 2002 and 2004

Use NIST01 for computing the UBMs, impostor-class supervectors of

NIST SRE 2004:

Use the Fisher corpus for computing UBMs, impostor-class supervectors of

12 MFCC + 12 MFCC with feature warping

Large eigenvalues mean

Experiments and Results

Experiments and Results

You might also like