Professional Documents
Culture Documents
for GMMSVM-Based
Speaker Verification
Man-Wai MAK and Wei RAO
The Hong Kong Polytechnic University
enmwmak@polyu.edu.hk
http://www.eie.polyu.edu.hk/~mwmak/
Outline
Speaker Verification
I am Mary
Is this
Marys
voice?
Verification Process
Im
John
Decision
Threshold
Johns Voiceprint
Feature
Extraction
Johns
Model
Impostor
Model
Scores
Impostors Voiceprints
Score
Normalization
and Decision
Making
Accept/Reject
Acoustic Features
DCT
Log|X()|
MFCC
p( x | ) (js ) p( x | (j s ) , (js ) )
(s)
j 1
( ubm )
MAP
Universal
Background
Model
( ubm)
Client Speaker
Model
( s )
(j s ) j E j ( X ( s ) ) (1 j ) (j ubm )
8
GMM-UBM Scoring
log
p
(
X
|
log
p
(
X
|
)
(c)
p ( X | H 1)
X
(c )
Speaker
Model ( s )
Feature
extraction
X (c )
Background
Model ( ubm)
Score
Score accept
Decision
Score reject
Outline
10
(s)
UBM
utt
(s)
Mapping
(s )
X (s )
Feature
Extraction
(s)
1
2
MD 1
1
2
MD1
GMM
supervector
(s )
MAP
Adaptation
Mean
Stacking
11
GMM-SVM Scoring
SVM Scoring
utt
Feature
Extraction
(s)
X (s )
d (s )
K ( X (c) , X (s) )
0( s )
UBM
X ( b1 ) , , X ( bB )
utt
utt
(c )
Feature
Extraction
UBM
(c )
K(X
(c)
Compute GMMSupervector of
Claimant c
K(X
(c)
,X
( bB )
i(s )
,X
( bB )
(s )
B
)
SGMM-SVM ( X ( c ) )
M
) j j 2 (j c )
j 1
SGMM-SVM ( X ( c ) ) 0( s ) K ( X ( c ) , X ( s ) )
j j 2 (j bB )
( bi )
(s)
(c)
(s)
K
(
X
,
X
)
d
i
iSV
from bkg
K ( X ( c ) , X ( b1 ) )
Compute GMMSupervectors of
Background Speakers
Feature
Extraction
( bB )
1( s )
utt ( b1 )
utt ( b2 )
12
SGMM-SVM ( X ( c ) ) 0( s ) K ( X ( c ) , X ( s ) )
K(X
(c)
,X
(s)
(c)
) j j 2 (j c )
j 1
12
Normalized GMMsupervector of
claimants utterance
12
iSV
from bkg
(s)
(s)
i
K ( X ( c ) , X ( bi ) ) d ( s )
j j 2 (j s )
13
Outline
14
For each target speaker, we only have one utterance (GMMsupervector) from the target speaker and many utterances
from the background speakers.
So, we have a highly imbalance learning problem.
Linear SVM, C=10.0, #SV=3, slope=-1.00
Speaker Class
Impostor Class
8
7
6
5
x2
Only one
training
vector from
the target
speaker
3
2
1
0
x1
15
Speaker Class
Impostor Class
8
7
6
x2
5
4
3
2
1
0
x1
Orientation of the
decision
boundary
depends mainly
on impostor-class
8
data
16
Outline
18
Utterance Partitioning
19
Utterance Partitioning
Target-speakers Enrollment Utterance
Background-speakers Utterances
Feature Extraction
Feature Extraction
X 0(b1 )
X 0(s)
utt ( b1 )
utt (s)
X 1(s)
X 2(s)
X 3(s)
X 4(s)
X 1(b1 )
X 0(s) , , X 4(s)
UBM
MAP Adaptation
and
Mean Stacking
X 3(b1 )
X 2(b1 )
X 4(b1 )
X 0(b2 )
utt ( b2 )
X 1(b2 )
X 3(b2 )
X 2(b2 )
X 4(b2 )
m0( s ) ,, m4( s )
m0( b1 ) ,, m4( bB )
X 0(bB )
SVM Training
utt ( bB )
X 1(bB )
X 2(bB )
X 3(bB )
20
X 4(bB )
Length-Representation Trade-off
When the number of partitions increases, the length of subutterance decreases.
If the utterance-length is too short, the supervectors of the
sub-utterances will be almost the same as that of the UBM
utt (s)
Linear SVM, C=10.0, #SV=3, slope=-1.44
Speaker Class
Impostor Class
8
7
6
x2
5
4
Supervector
corresponding to
the UBM
3
2
1
0
x1
21
22
Background-speakers Utterances
X0(b )
X0(s)
utt(b )
utt(s)
X1(s)
X2(s)
X3(s)
X4(s)
(b )
X1 1
X0(s) , , X4(s)
MAP Adaptation
and
Mean Stacking
UBM
(b )
(b )
X3 1
X2 1
(b )
X4 1
(b )
X0 2
utt(b )
2
(b )
X1 2
(b )
(b )
X3 2
X2 2
(b )
X4 2
m0( s) , , m4( s)
m0(b ) , , m4(b )
1
X0(b
SVM Training
B)
( bB )
utt
(bB )
X1
(b )
X2 B
(b )
X3 B
23
(b )
X4 B
Imposter-class
Speaker-class
Sub-utt supervector
Full-utt supervector
24
1
m( s ,h ) 2 ( s ,h ) , where s stands for speaker and h stands for session
Remove the session-dependent part (h) by removing the sub-space that
causes the session variability:
m( s ) Pm( s ,h ) ( I VV T )m( s ,h )
Sub-space
representing
The New kernel becomes
session
(c) T ( s)
(c)
(s)
K ( X , X ) m m
variability.
( s ,h )
T ( s ,h )
m
VV m
( c ,h ) T ( s ,h )
Defined by V
Pm Pm
m( s ) Pm( s ,h )
25
i, j
m ( s ,h )
VV T m( s ,h )
Sub-space
representing session
variability.
Defined by V
m( s ) Pm( s ,h )
26
Enrollment Process of
GMM-SVM with UP-AVR
MFCCs of an
utterance from
target-speaker s
X ( s ,h )
Resampling/
Partitioning
X i( s ,h )
MAP and
Mean Stacking
Sessiondependent
supervectors
UBM
mi( s ,h )
NAP
Sessionindependent
supervectors
(b )
mi j
mi( s )
SVM Training
SVM of targetspeaker s
27
Verification Process of
GMM-SVM with UP-AVR
MFCCs of a
test utterance
from claimant c
X (c )
MAP and
Mean Stacking
Sessiondependent
supervector
UBM
m ( c ,h )
Tnorm
Models
NAP
Sessionindependent
supervector
SVM of targetspeaker s
m(c )
SVM Scoring
score
S ( X (c ) )
T-Norm
~
S(X
Normalized
score
(c )
28
T-Norm SVM 1
SVM Scoring
m(c )
from test utterance
SVM Scoring
T-Norm SVM R
Compute
Mean
and
Standard
Deviation
S ( X (c) ) ( X (c) )
~ (c)
S(X )
( X (c) )
( X (c) )
( X (c) )
Z-norm
S ( X (c ) )
29
Outline
30
Experiments
Speech Data
Experiments
Features and Models
32
Results
No. of mixtures in GMM-SVM (NIST02)
Normalized
Threshold below
which the variances
of feature are
deemed too small
Large number of
features with small
variance
33
Results
Effects of NAP on Different NIST SRE
34
Results
Effect of NAP Corank on Performance
No NAP
35
Results
Comparing discriminative power of GMM-SVM and GMM-SVM with
UP-AVR
36
Results
EER and MinDCF vs. No. of Target-Speaker Supervectors
NIST02
37
Results
Varying the number of resampling (R) and number of partitions (N)
NIST02
38
Results
NIST02
39
Performance on NIST02
EER=9.39%
EER=9.05%
EER=8.16%
40
Performance on NIST04
GM
M
-U
BM
GM
M
-S
VM
EER=16.05%
G
w/ MM
UP -SV
-A M
VR
EER=9.46%
EER=10.42%
41
References
1. S.X. Zhang and M.W. Mak "Optimized Discriminative Kernel for SVM
Scoring and its Application to Speaker Verification", IEEE Trans. on Neural
Networks, to appear.
2. M.W. Mak and W. Rao, "Utterance Partitioning with Acoustic Vector
Resampling for GMM-SVM Speaker Verification", Speech Communication,
vol. 53 (1), Jan. 2011, Pages 119-130.
2. M.W. Mak and W. Rao, "Acoustic Vector Resampling for GMMSVM-Based
Speaker Verification, Interspeech 2010. Sept. 2010, Makuhari, Japan, pp.
1449-1452.
3. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine
Learning Approach, Prentice Hall, 2005
4. W. M. Campbell, D. E. Sturim, and D. A. Reynolds, Support vector machines
using GMM supervectors for speaker verification, IEEE Signal Processing
Letters, vol. 13, pp. 308311, 2006.
5. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker verification using
adapted Gaussian mixture models, Digital Signal Processing, vol. 10, pp. 19
41, 2000.
42