Professional Documents
Culture Documents
Citation / Cmo citar este artculo: Gonzalez-Rodriguez, J. (2014). Evaluating Automatic Speaker Recognition systems:
An overview of the NIST Speaker Recognition Evaluations (1996-2014). Loquens, 1(1), e007. doi: http://dx.doi.org/10.3989/
loquens.2014.007
ABSTRACT: Automatic Speaker Recognition systems show interesting properties, such as speed of processing or
repeatability of results, in contrast to speaker recognition by humans. But they will be usable just if they are reliable.
Testability, or the ability to extensively evaluate the goodness of the speaker detector decisions, becomes then critical.
In the last 20 years, the US National Institute of Standards and Technology (NIST) has organized, providing the
proper speech data and evaluation protocols, a series of text-independent Speaker Recognition Evaluations (SRE).
Those evaluations have become not just a periodical benchmark test, but also a meeting point of a collaborative community of scientists that have been deeply involved in the cycle of evaluations, allowing tremendous progress in a
specially complex task where the speaker information is spread across different information levels (acoustic, prosodic,
linguistic) and is strongly affected by speaker intrinsic and extrinsic variability factors. In this paper, we outline
how the evaluations progressively challenged the technology including new speaking conditions and sources of variability, and how the scientific community gave answers to those demands. Finally, NIST SREs will be shown to be
not free of inconveniences, and future challenges to speaker recognition assessment will also be discussed.
KEYWORDS: automatic speaker recognition; discrimination and calibration; assessment; benchmark
RESUMEN: Evaluando los sistemas automticos de reconocimiento de locutor: Panorama de las evaluaciones NIST
de reconocimiento de locutor (1996-2014).- Los sistemas automticos de reconocimiento de locutor son crticos para
la organizacin, etiquetado, gestin y toma de decisiones sobre grandes bases de datos de voces de diferentes locutores.
Con el fin de procesar eficientemente tales cantidades de informacin de voz, necesitamos sistemas muy rpidos y, al
no estar libre de errores, lo suficientemente fiables. Los sistemas actuales son rdenes de magnitud ms rpidos que
tiempo real, permitiendo tomar decisiones automticas instantneas sobre enormes cantidades de conversaciones. Pero
tal vez la caracterstica ms interesante de un sistema automtico es la posibilidad de ser analizado en detalle, ya que
su rendimiento y fiabilidad puede ser evaluada de manera ciega sobre cantidades enormes de datos en una gran diversidad de condiciones. En los ltimos 20 aos, el Instituto Nacional de Estndares y Tecnologa (NIST) de EE. UU. ha
organizado, proporcionando los datos de voz y protocolos de evaluacin adecuada, una serie de evaluaciones de reconocimiento de locutor independiente del texto. Esas evaluaciones se han convertido no slo en una prueba comparativa
peridica, sino tambin en punto de encuentro de una comunidad colaborativa de cientficos que han estado profundamente involucrados en el ciclo de evaluaciones, lo que ha permitido un enorme progreso en una tarea especialmente
compleja en la que la informacin individualizadora del locutor se encuentra dispersa en diferentes niveles de informacin
(acstica, prosdica, lingstica...) y est fuertemente afectada por factores de variabilidad intrnsecos y extrnsecos
al locutor. En este artculo se describe cmo las evaluaciones desafiaron progresivamente la tecnologa existente, incluyendo nuevas condiciones de habla y fuentes de variabilidad, y cmo la comunidad cientfica fue dando respuesta
a dichos retos. Sin embargo, estas evaluaciones NIST no estn libres de inconvenientes, por lo que tambin se discutirn
los retos futuros para la evaluacin de tecnologas de locutor.
PALABRAS CLAVE: reconocimiento automtico de locutor; calibracin; validacin; prueba de referencia
Copyright: 2014 CSIC This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial
(by-nc) Spain 3.0 License.
2 J. Gonzalez-Rodriguez
1. INTRODUCTION
The massive presence and exponential growth of multimedia data, with audio sources varying from call centers,
mobile phones and broadcast data (radio, TV, podcasts)
to individuals producing voice or video messages with
speakers and conversations of all types in an unconceivable
range of situations and speaking conditions, makes speaker
recognition applications critical for the organization, labeling, management and decision making over this big audio
data. In order to efficiently process such amounts of speech
information, we need extremely fast and reliable enough
(being not error free) automatic speaker recognition systems. Current systems are orders of magnitude faster than
real time, allowing instantaneous automatic decisions (or
informed opinions) on huge amounts of conversations.
Moreover, being built with well-known signal processing
and pattern recognition algorithms, they are transparent,
avoiding subjective components in the decision process
(in contrast to human speaker recognition) and allowing
public scrutiny of every module of the system, and testable,
as their performance and claimed reliability can be blindly
evaluatedonmassiveamountsofknowndatawhereground
truth (the speaker identity) is available in a great diversity
of conditions (Gonzalez-Rodriguez, Rose, Ramos,
Toledano, & Ortega-Garcia, 2007).
However, evaluating a speaker recognition system
is not an easy task. The speaker individualizing information is spread across different information levels, and
each of them is affected in different ways by speaker
intrinsic and extrinsic sources of variability, such as the
elapsed time between recordings under comparison,
differences in acquisition and transmission devices
(microphones, telephones, GSM/IP coding...), noise and
reverberation, emotional state of the speaker, type of
conversation, permanent and transient health conditions,
etc. When two given recordings are to be compared, all
those factors are empirically combined in a specific and
difficult to emulate way, so controlling and disentangling
them will be critical for the proper evaluation of systems.
Fortunately, the US NIST (National Institute of Standards and Technology) series of Speaker Recognition
Evaluations (SRE) have provided for almost two decades
a challenging and collaborative environment that has
allowed text-independent speaker recognition to make
remarkable progress. For every new evaluation, and
according to the results and expectations from the previous one, NIST balanced the suggestions from participants and the priorities of their sponsors to compound
a new evaluation, where new speech corpora were designed and collected under new conditions, development
and test data was prepared and distributed, and final
submissions from participants were analyzed, to be finally compared and discussed in a public workshop open
to participants in the evaluation. This cycle of innovation
adds a tremendous value to participants, whose technology is challenged in every new evaluation, demanding
enormous progress in order to cope with new demands,
and final solutions being publicly discussed and scruti-
Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) 3
4 J. Gonzalez-Rodriguez
NIST have provided participants with errorful (1530% of word error rate) word transcriptions of conversations, but suddenly interrupted this policy for
the 2012 evaluation.
3. ASSESSMENT OF SPEAKER RECOGNITION
SYSTEMS
One of the main advantages of automatic systems is
that they can be extensively and repeatedly tested to
assess their performance in a variety of evaluation conditions, allowing objective comparison of systems in
exactly the same task, or observing the performance
degradation of a given system in progressively challenging conditions. In order to assess a system, we need a
database of voice recordings from known speakers,
where the ground truth about the identity of the speaker
in every utterance is known in order to compare it with
the blind decisions of the system.
At this point, we need some definitions. A test trial
will consist in determining if a given speaker (typically
the speaker in a control recording) is actually speaking
in the test recording. We will talk about target trials
when the target (known) speaker is actually speaking in
the test recording, and non-target (or impostor) trials in
the opposite condition (the speakers in the train and test
recordings are different). Automatic systems, given a
test trial of unknown-solution, provide a score; the
higher the score the greater the confidence in being
same-speaker recordings. In an ideal system, the distribution of target-trial scores should be clearly separated
and valued higher than that of non-target trials, allowing
perfect discrimination setting a threshold between the
two distributions of scores. However, as shown in Figure
1, target and non-target score distributions usually
overlap each other partially.
A detection threshold is needed if we want the system to provide hard acceptance (score higher than the
threshold) of rejection (score lower than the threshold)
decisions for every trial. Two types of errors can be
committed by the system: false alarm (or false acceptance) errors, if the score is higher than the threshold in
a non-target trial, and miss detections (also called false
rejections) if the score is lower than the threshold in a
target trial. Then, as the score distributions overlap,
whatever the detection threshold we set a percentage of
false acceptances and missed detections will occur: the
higher the threshold, the lower the false alarms and the
higher the miss detections. This means that, for any
given system, lots of operation points are possible, with
different values of compromise between both types of
error.
3.1. ROC and DET curves
In Figure 2, both false acceptance and miss detection
errors are shown as a function of a sweeping threshold.
The error rate obtained where both false acceptance and
miss detection curves cross each other is known as the
Equal Error Rate (EER), and it is commonly used as a
single number characterizing system performance: the
lower the EER the better the separation between both
target and non-target score distributions.
Figure 2: Percentage (%) of alse alarms and miss detections as
function of a sweeping threshold (x-axis), and Equal Error Rate
(EER).
Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) 5
However, when different systems (or the same system in different evaluation conditions) with low EER
are represented, the graphical resolution is very low as
all the useful information is concentrated in the lower
corner close to the origin of coordinates. The Detection
Error Trade-off (DET) plot is suggested to overcome
this problem, modifying the axis in order for Gaussian
distribution of scores to result in straight lines in the
DET plot, allowing for easy comparison and detailed
analysis of different possible operation points of the
system, as shown in Figure 4.
Figure 4: Sample DET curves from the development phase of
ATVS-UAM systems for SRE 2008. The telephone-based SRE
2006 GMM-UBM system is progressively adapted to the crosschannel mic and tel condition in 2008.
6 J. Gonzalez-Rodriguez
if a given speaker is actually speaking in a given conversation side, has always been the main subject of evaluation. For every trial, participants are required to submit
both a score indicating the confidence in being the same
speakers, which allows NIST computing DET plots, and
an actual decision value (true/false), allowing CDET and
minCDET values to be computed per participant. Male
and female data is included, but no cross-trials are performed to avoid over-optimistic error rates.
Figure 5: Faunagram, or speaker-dependent performance plot,
showing non-target (red) and target (blue) scores per speaker.
Models are vertically sorted as a function of the speaker-dependent EER threshold.
Conversations were extracted from the English-spoken Switchboard corpus, a spontaneous conversational
speech corpus consistent of thousands of telephone
conversations from hundreds of speakers, each conversation typically running through 5 minutes. In 2000 and
2001, the Spanish-spoken spontaneous but not conversational Ahumada database (Ortega-Garcia, GonzalezRodriguez, & Marrero-Aguiar, 2000) was explored with
similar same-language (in train and test) protocols.
Different releases of Switchboard were produced
through the years (Cieri, Miller, & Walker, 2003), including extensive landline (local- and long-distance)
and cellular (GSM and CDMA) phone data. Different
telephone numbers and types of telephone handsets were
used by the speakers and explored in the evaluations,
showing the degradation on performance in mismatched
conditions if not properly addressed. Systems were
tested with one- or two-session training, significantly
reducing the sensitivity to session mismatch by including
explicit session variability in the model with the use of
multiple session training. The effect of the length of the
test was also explored, showing very limited benefit
from same-session audio segments longer than 30-60
seconds, but strong degradation for shorter durations.
However, with just 3 seconds of speech, systems were
still providing very useful information (e.g., degradations
Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) 7
8 J. Gonzalez-Rodriguez
Ear-bud/lapel mike
Mini-boom mike
Courtroom mike
Conference room mike
Distant mike
Near-field mike
PC stand mike
Micro-cassette mike
Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) 9
10 J. Gonzalez-Rodriguez
Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) 11
interview scenario, resulting in the Mixer 5 Interview speech database. It is also conversational
speech but of a totally different nature, and it is
recorded via multiple simultaneous microphones
(interview-mic). In the required condition (known
as short2-short3, similar to a multichannel1side1side), 1,788 phonecall and 1,475 interview
speaker models were involved for a total of almost
100,000 trials (60,000 male and 40,000 female).
SRE 2010: four new significant changes were introduced in 2010. Firstly, all speech in the evaluation was English (both native and non-native).
Second, some of the conversational telephone
speech data has been collected in a manner to produce particularly high or particularly low vocal effort. Third, the interview segments in the required
condition were of varying duration, ranging from
three to fifteen minutes. And finally, the most radical change, a new performance measure
(new_CDET), intended for systems to obtain good
calibration at extremely low false alarm rates, was
defined from a new set of parameter values. While
CDET always used Cmiss=10, CFA=1 and Ptarget=0.01,
the new_CDET parameters are Cmiss=1, CFA=1 and
Ptarget=0.001, increasing in a factor of 100 the relative weight of false alarms to missed detections. In
order to have statistically significant error rates for
very low false alarm rates, the required (core)
condition included close to 6,000 speaker models,
25,000 test speech segments and 750,000 trials.
SRE 2012: in NISTs own wording, SRE12 task
conditions represent a significant departure from
previous NIST SREs (NIST 2012, p. 1). In all
previous evaluations, participants were told what
speech segments should be used to build a given
speaker model. However, in 2012, all the speech
from previous evaluations with known identities
was made available to build the speaker models,
resulting in availability of large (and variable)
number of segments from the phone-tel, phone-mic
and int-mic conditions, systems being free to build
their models using whatever combination of previous files from every speaker. Moreover, additive
and environmental noises were included, variable
test durations (300, 100 and 30 seconds) considered, knowledge of all targets was allowed in
computing each trial detection score, tests were
performed with both known and unknown impostors, and for the first time in a long time ASR word
transcripts were not provided. Finally, the cost
measure was changed again, averaging the 2010
operating point (optimized for very low false alarm
rates) with a new one with a greater target prior
(pushing the optimal threshold back, closer to its
classical position), intending for systems to show
greater stability of the cost measure and good score
calibration over a wider range of log-likelihoods.
For this evaluation, close to 2,250 target speakers
and 100,000 test segments were involved, for a total
12 J. Gonzalez-Rodriguez
two different variability subspaces (speaker and channel), in (Dehak, Kenny, Dehak, Dumouchel, & Ouellet,
2011) a single subspace is considered, called Total (T)
Variability subspace, which contains both speaker information and channel variability. In this case, the T lowrank matrix is obtained in a similar way to the eigenspeakers matrix but all the utterances for each speaker
are included (as if each of them came from different
speakers). Once the T matrix is available, an UBM is
used to collect Baum-Welch first-order statistics from
the utterance. The high-dimensional supervector (dimensions ranging from 20,000 to 150,000), which is built
stacking together those first order statistics for each
mixture component, is then projected into a low dimensional fixed-length representation known as the i-vector
in the subspace defined by T, being estimated as a MAP
point-estimate of a posterior distribution.
As both target (train) and test utterances are now
represented by fixed-lentgh i-vectors (typically from
200 to 600 dimensions), it can be seen as a new global
feature extractor, which captures all relevant speaker
and channel variability in a given utterance into a lowdimensional vector. Target and test i-vectors can be directly compared through cosine scoring, a measure of
the angle between i-vectors, which produce recognition
results similar to those with JFA. But as they still contain
both speaker and channel information, factor analysis
can also be applied in the total variability subspace to
better separate low dimensional contributions of channel
and speaker information. Probabilistic Linear Discriminant Analysis (PLDA) (Prince & Elder, 2007) models
the underlying distribution of the speaker and channel
components of the i-vectors in a generative framework
where Factor Analysis is applied to describe the i-vector
generation process. In this PLDA framework, a likelihood ratio score of the same speaker hypothesis versus
the different speaker hypothesis can be efficiently
computed, as a closed form solution exists. Further, in
order to reduce the non-Gaussian behavior of speaker
and channel effects in the i-vector representation, ivector length normalization (Garcia-Romero, & EspyWilson, 2011) is proposed allowing the use of probabilistic models with Gaussian assumptions as PLDA instead
of complex Heavy Tailed representations.
I-vector extraction, PLDA modeling and scoring,
and i-vector length normalization have become by the
time of writing this paper the current state-of-the-art in
text-independent speaker recognition, and the basis of
successful submissions to NIST SRE 2012. Multiple
generative and discriminative variants, optimizations
and combinations of the above ideas exist based in the
same underlying principles. As a result, joint submissions to NIST SREs of multiple systems from multiple
sites into a single fused system have become usual, and
a must if an individual system wants to be in the horserace photo-finish of best submissions (Saeidi et al.,
2013).
However, recent success of Deep Neural Networks
in different areas of speech processing (Hinton et al.,
Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) 13
14 J. Gonzalez-Rodriguez
Greenberg, C., Martin, A., Brandschain, L., Campbell, J., Cieri, C.,
Doddington, G., & Godfrey, J. (2011). Human assisted speaker
recognition in NIST SRE10. Paper presented at the 2011 IEEE
International Conference on Acoustics, Speech, and Signal
Processing (ICASSP 11), Prague, Czech Republic.
Hbert, M. (2008). Text-dependent speaker recognition. In J. Benesty,
M. Sondhi, & Y. Huang (Eds.), Springer handbook of speech
processing (pp. 743762). BerlinHeidelberg, Germany:
Springer. http://dx.doi.org/10.1007/978-3-540-49127-9_37
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech.
IEEE Transactions on Speech and Audio Processing, 2(4),
578589. http://dx.doi.org/10.1109/89.326616
Hernando, J., & Nadeu, C. (1998). Speaker verification on the
polycost database using frequency filtered spectral energies.
Proceedings of the 5th International Conference on Spoken
Language, 98, 129132.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly,
N., ... Kingsbury, B. (2012). Deep neural networks for acoustic
modeling in speech recognition: The shared views of four
research groups. Signal Processing Magazine, IEEE, 29(6),
8297. http://dx.doi.org/10.1109/MSP.2012.2205597
Kajarekar, S. S., Ferrer, L., Shriberg, E., Sonmez, K., Stolcke, A.,
Venkataraman, A., & Zheng, J. (2005). SRIs 2004 NIST speaker
recognition evaluation system. Proceedings of the 2005 IEEE
International Conference on Acoustics, Speech, and Signal
Processing (ICASSP 05), 1, 173176.
Kajarekar, S. S., Scheffer, N., Graciarena, M., Shriberg, E., Stolcke,
A., Ferrer, L., & Bocklet, T. (2009). THE SRI NIST 2008 speaker
recognition evaluation system. Proceedings of the 2009 IEEE
International Conference on Acoustics, Speech, and Signal
Processing,
42054208.
http://dx.doi.org/10.1109/
ICASSP.2009.4960556
Kenny, P. (2005). Joint factor analysis of speaker and session
variability: Theory and algorithms (Technical Report No.
CRIM-06/08-13). Montreal, Canada: CRIM.
Khoury, E., Vesnicer, B., Franco-Pedroso, J., Violato, R.,
Boulkcnafet, Z., Mazaira Fernandez, L. M., ... Marcel, S. (2013,
June). The 2013 speaker recognition evaluation in mobile
environment. 2013 International Conference on Biometrics (ICB),
18. http://dx.doi.org/10.1109/ICB.2013.6613025
Kinnunen, T., & Li, H. (2010). An overview of text-independent
speaker recognition: from features to supervectors. Speech
Communication,
52(1),
1240.
http://dx.doi.org/
10.1016/j.specom.2009.08.009
Kockmann, M., Ferrer, L., Burget, L., Shriberg, E., & ernock, J.
(2011). Recent progress in prosodic speaker verification.
Proceedings of the 2011 IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP 11),
45564559.
Larcher, A., Aronowitz, H., Lee, K. A., & Kenny, P. (Organizers)
(2014, September). Text-dependent speaker verification with
short utterances. Special session to be conducted at the 15th
INTERSPEECH Conference 2014, INTERSPEECH 2014
(Singapore). Retrieved from http://www.interspeech2014.org/
public.php?page=special_sessions.html
Larcher, A., Lee, K. A., Ma, B., & Li, H. (2014). Text-dependent
speaker verification: Classifiers, databases and RSR2015. Speech
Communication, 60, 5677. http://dx.doi.org/10.1016/
j.specom.2014.03.001
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O.,
Martnez-Gonzlez, D., Gonzalez-Rodriguez, J., Moreno, P.J.
(2014, May). Automatic language identification using deep neural
networks. To be presented at the 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP
14), Florence, Italy.
Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki,
M. (1997). The DET curve in assessment of detection task
performance. Proceedings of the 5th European Conference on
Speech Communication and Technology, EUROSPEECH 1997,
18951898.
National Institute of Standards and Technology (NIST) (2012). The
NIST year 2012 speaker recognition evaluation plan, 17.
Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) 15
Acoustics, Speech and Signal Processing (ICASSP 14), Florence,
Italy.
Vasilakakis, V., Cumani, S., & Laface, P. (2013, October). Speaker
recognition by means of Deep Belief Networks. Technologies in
Forensic Science, Nijmegen, The Netherlands.