You are on page 1of 56

McClellan, S., Gibson, J.D., Ephraim, Y., Fussell, J.W., Wilcox, L.D., Bush, M.A.

, Gao,
Y., Ramabhadran, B., Picheny, M. Speech Signal Processing
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
2000 by CRC Press LLC
15
Speech SIgnaI rocessIng
15.1 Coding, Tiansmission, and Stoiage
Geneial Appioaches Model Adaptation Analysis-by-Synthesis
Paiticulai Implementations Speech Quality and Intelligibility
Standaidization Vaiiable Rate Coding Summaiy and Conclusions
15.2 Speech Enhancement and Noise Reduction
Models and Peifoimance Measuies Signal Estimation Souice
Coding Signal Classifcation Comments
15.3 Analysis and Synthesis
Analysis of Excitation Fouiiei Analysis Lineai Piedictive
Analysis Homomoiphic (Cepstial) Analysis Speech Synthesis
15.4 Speech Recognition
Speech Recognition System Aichitectuie Signal Pie-Piocessing
Dynamic Time Waiping Hidden Maikov Models State-of-the-Ait
Recognition Systems
15.5 Laige Vocabulaiy Continuous Speech Recognition
Oveiview of a Speech Recognition System Hidden Maikov Models As
Acoustic Models foi Speech Recognition Speakei Adaptation Modeling
Context in Continuous Speech Language Modeling Hypothesis
Seaich State-of-the-Ait Systems Challenges in Speech
Recognition Applications
15.1 Cuding, Transmissiun, and Sturage
Sron McC|e||on ond jerry D. Cbon
Inteiest in speech coding is motivated by a wide iange of applications, including commeicial telephony, digital
cellulai mobile iadio, militaiy communications, voice mail, speech stoiage, and futuie peisonal communica-
tions netwoiks. The goal of speech coding is to iepiesent speech in digital foim with as few bits as possible
while maintaining the intelligibility and quality iequiied foi the paiticulai application. At highei bit iates, such
as 64 and 32 kbits/s, achieving good quality and intelligibility is not too diffcult, but as the desiied bit iate is
loweied to 16 kbits/s and below, the pioblem becomes incieasingly challenging. Depending on the application,
many diffcult constiaints must be consideied, including the issue of complexity.
Foi example, foi the 32-kbits/s speech coding standaid the ITU-T
1
not only iequiied highly intelligible,
high-quality speech, but the codei also had to have low delay, withstand independent bit eiioi iates up to 10
-2
,
have acceptable peifoimance degiadation foi seveial synchionous oi asynchionous tandem connections, and
pass some voiceband modem signals. Othei applications may have diffeient ciiteiia. Digital cellulai mobile
iadio in the U.S. has no low delay oi voiceband modem signal iequiiements, but the speech data iates iequiied
aie undei 8 kbits/s and the tiansmission medium (oi channel) can be veiy noisy and have ielatively long fades.
These consideiations affect the speech codei chosen foi a paiticulai application.
As speech codei data iates diop to 16 kbits/s and below, peiceptual ciiteiia taking into account human
auditoiy iesponse begin to play a piominent iole. Foi iIne JonaIn coJers, the peiceptual effects aie incoipoiated
using a fiequency-weighted eiioi ciiteiion. The IrequencyJonaIn coJers include peiceptual effects by allocating
1
Inteinational Telecommunications Union, Telecommunications Standaidization Sectoi, foimeily the CCITT.
Sfan NcCIeIIan
Inverry of A|obomo
or rmng|om
}erry I. CIlson
Texo AM Inverry
YarIv phraIm
ATT e|| Iobororore
Ceorge Moon Inverry
}esse W. !usseII
Deorrmenr of Defene
Lynn I. WIIcox
X Po|o A|ro Iob
NarcIa A. Bush
Xerox Po|o A|ro Feeorc| Cenrer
YuqIng Cao
IM
T.j. Woron Feeorc| Cenrer
Bhuvana Ramalhadran
IM
T.j. Woron Feeorc| Cenrer
NIchaeI Icheny
IM
T.j. Woron Feeorc| Cenrer
2000 by CRC Press LLC
The focus of this aiticle is the contiast among the thiee most impoitant classes of speech codeis that have
iepiesentative implementations in seveial inteinational standaids-time-domain coders, frequency-domain
coders, and hybrid coders. In the following, we defne these classifcations, look specifcally at the impoitant
chaiacteiistics of iepiesentative, geneial implementations of each class, and biiey discuss the iapidly changing
national and inteinational standaidization effoits ielated to speech coding.
Genera! Appruaches
Time Dumain Cuders and Linear Predictiun
Lineai Piedictive Coding (LPC) is a modeling technique that has seen widespiead application among time-
domain speech codeis, laigely because it is computationally simple and applicable to the mechanisms involved
in speech pioduction. In LPC, geneial spectial chaiacteiistics aie desciibed by a paiametiic model based on
estimates of autocoiielations oi autocovaiiances. The model of choice foi speech is the all-pole oi auioregressI\e
(AI) noJeI. This model is paiticulaily suited foi voiced speech because the vocal tiact can be well modeled by
an all-pole tiansfei function. In this case, the estimated LPC model paiameteis coiiespond to an AR piocess
which can pioduce wavefoims veiy similai to the oiiginal speech segment. Diffeiential Pulse Code Modulation
(DPCM) codeis (i.e., ITU-T G.721 ADPCM CCITT, 1984]) and LPC vocodeis (i.e., U.S. Fedeial Standaid
1015 National Communications System, 1984]) aie examples of this class of time-domain piedictive aichitec-
tuie. Code Excited Codeis (i.e., ITU-T G728 Chen, 1990] and U.S. Fedeial Standaid 1016 National Commu-
nications System, 1991]) also utilize LPC spectial modeling techniques.
1
Based on the geneial spectial model, a piedictive codei foimulates an estimate of a futuie sample of speech
based on a weighted combination of the immediately pieceding samples. The eiioi in this estimate (the
reJIciIon resIJuaI) typically compiises a signifcant poition of the data stieam of the encoded speech. The
iesidual contains infoimation that is impoitant in speech peiception and cannot be modeled in a stiaightfoi-
waid fashion. The most familiai foim of piedictive codei is the classical Diffeiential Pulse Code Modulation
(DPCM) system shown in Fig. 15.1. In DPCM, the piedicted value at time instant |, s(| | - 1), is subtiacted
fiom the input signal at time |, s(|), to pioduce the piediction eiioi signal e(|). The piediction eiioi is then
appioximated (quaniIzeJ) and the quantized piediction eiioi, e
q
(|), is coded (iepiesented as a binaiy numbei)
foi tiansmission to the ieceivei. Simultaneously with the coding, e
q
(|) is summed with s(| | - 1) to yield a
ieconstiucted veision of the input sample, s(|). Assuming no channel eiiois, an identical ieconstiuction,
distoited only by the effects of quantization, is accomplished at the ieceivei. At both the tiansmittei and ieceivei,
the piedicted value at time instant | -1 is deiived using ieconstiucted values up thiough time |, and the
pioceduie is iepeated.
The fist DPCM systems had

(z) 0 and (z) , wheie {a
I
,I 1.N aie the LPC coeffcients
and z
-1
iepiesents unit delay, so that the piedicted value was a weighted lineai combination of pievious
ieconstiucted values, oi
1
Howevei, codebook excitation is geneially desciibed as a |y|rIJ coding technique.
I!CURE 15.1 Diffeiential encodei tiansmittei with a pole-zeio piedictoi.
a
I
z
I
I
N

1
2000 by CRC Press LLC
(15.1)
Latei woik showed that letting

(z) impioves the peiceived quality of the ieconstiucted speech
1
by shaping the spectium of the quantization noise to match the speech spectium, as well as impioving noisy-
channel peifoimance Gibson, 1984]. To pioduce high-quality, highly intelligible speech, it is necessaiy that
the quantizei and piedictoi paiameteis be adaptive to compensate foi nonstationaiities in the speech wavefoim.
Frequency Dumain Cuders
Codeis that iely on spectial decomposition often use the usual set of sinusoidal basis functions fiom signal
theoiy to iepiesent the specifc shoit-time spectial content of a segment of speech. In this case, the appioximated
signal consists of a lineai combination of sinusoids with specifed amplitudes and aiguments (fiequency, phase).
Foi compactness, a countable subset of haimonically ielated sinusoids may be used. The two most piominent
types of fiequency domain codeis aie subband codeis and multi-band codeis.
Subband codeis digitally fltei the speech into nonoveilapping (as neaily as possible) fiequency bands. Aftei
flteiing, each band is decimated (effectively sampled at a lowei iate) and coded sepaiately using PCM, DPCM,
oi some othei method. At the ieceivei, the bands aie decoded, upsampled, and summed to ieconstiuct the
speech. By allocating a diffeient numbei of bits pei sample to the subbands, the peiceptually moie impoitant
fiequency bands can be coded with gieatei accuiacy. The design and implementation of subband codeis and
the speech quality pioduced have been gieatly impioved by the development of digital flteis called quadiatuie
miiioi flteis (QMFs) Johnston, 1980] and polyphase flteis. These flteis allow subband oveilap at the encodei,
which causes aliasing, but the ieconstiuction flteis at the ieceivei can be chosen to eliminate the aliasing if
quantization eiiois aie small.
Multi-band codeis peifoim a similai function by chaiacteiizing the contiibutions of individual sinusoidal
components to the shoit-teim speech spectium. These paiameteis aie then quantized, coded, tiansmitted, and
used to confguie a bank of tuned oscillatois at the ieceivei. Outputs of the oscillatois aie mixed in piopoition
to the distiibution of spectial eneigy piesent in the oiiginal wavefoim. An impoitant iequiiement of multi-band
codeis is a capability to piecisely deteimine peiceptually signifcant spectial components and tiack the evolution
of theii eneigy and phase. Recent developments ielated to multi-band coding emphasize the use of haimonically
ielated components with caiefully inteimixed spectial iegions of bandlimited white noise. Sinusoidal Tiansfoim
Codeis (STC) and Multi-Band Excitation codeis (MBE) aie examples of this type of fiequency domain codeis.
Mude! Adaptatiun
Adaptation algoiithms foi codei piedictoi oi quantizei paiameteis can be loosely giouped based on the signals
that aie used as the basis foi adaptation. Geneially, IorwarJ aJaiI\e codei elements analyze the input speech
(oi a flteied veision of it) to chaiacteiize piedictoi coeffcients, spectial components, oi quantizei paiameteis
in a blockwise fashion. ac|warJ aJaiI\e codei elements analyze a ieconstiucted signal, which contains
quantization noise, to adjust codei paiameteis in a sequential fashion. Foiwaid adaptive codei elements can
pioduce a moie effcient model of speech signal chaiacteiistics, but intioduce delay into the codei`s opeiation
due to buffeiing of the signal. Backwaid adaptive codei elements do not intioduce delay, but pioduce signal
models that have lowei fdelity with iespect to the oiiginal speech due to the dependence on the noisy
ieconstiucted signal. Most low-iate codeis iely on some foim of foiwaid adaptation. This iequiies modeiate
to high delay in piocessing foi accuiacy of paiametei estimation (autocoiielations/autocovaiiances foi LPC-
based codeis, sinusoidal iesolution foi fiequency-domain codeis). The allowance of signifcant delay foi many
codei aichitectuies has enabled a spectially matched pie- oi post-piocessing step to ieduce appaient quanti-
zation noise and piovide signifcant peiceptual impiovements. Peiceptual enhancements combined with
analysis-by-synthesis optimization, and enabled by iecent advances in high-powei computing aichitectuies
such as digital signal piocessois, have tiemendously impioved speech coding iesults at medium and low iates.
1
In this case, the piedicted value is s(| | - 1) .
` `
. s | | a s | I
I
I
N

,

,

1
1
|
j
z
j
j
M

1
a s | I | e | j
I
I
N
j q
j
M
`
, + ,


1 1
2000 by CRC Press LLC
Ana!ysis-by-Synthesis
A signifcant diawback to tiaditional instantaneous" coding appioaches such as DPCM lies in the peiceptual
oi subjective ielevance of the distoition measuie and the signals to which it is applied. Thus, the advent of
analysis-by-synthesis coding techniques poses an impoitant milestone in the evolution of medium- to low-iate
speech coding. An analysis-by-synthesis codei chooses the codei excitation by minimizing distoition between
the oiiginal signal and the set of synthetic signals pioduced by eveiy possible codebook excitation sequence.
In contiast, time-domain piedictive codeis must pioduce an estimated piediction iesidual (innovations
sequence) to diive the spectial shaping fltei(s) of the LPC model, and the classical DPCM appioach is to
quantize the iesidual sequence diiectly using scalai oi vectoi quantizeis. The incoipoiation of fiequency-
weighted distoition in the optimization of analysis-by-synthesis codeis is signifcant in that it de-emphasizes
(incieases the toleiance foi) quantization noise suiiounding spectial peaks. This effect is peiceptually tians-
paient since the eai is less sensitive to eiioi aiound fiequencies having highei eneigy Atal and Schioedei, 1979].
This appioach has iesulted in signifcant impiovements in low-iate codei peifoimance, and iecent incieases
in piocessoi speed and powei aie ciucial enabling techniques foi these applications. Analysis-by-synthesis
codeis based on lineai piediction aie geneially desciibed as hybiid codeis since they fall between wavefoim
codeis and vocodeis.
Particu!ar Imp!ementatiuns
Cuiiently, thiee codei aichitectuies dominate the felds of medium and low-iate speech coding:
CoJeIxcIieJ IInear IreJIciIon (CIII): an LPC-based technique which optimizes a vectoi of excitation
samples (and/oi pitch fltei and lag paiameteis) using analysis-by-synthesis.
MuIiIanJ IxcIiaiIon (MI): a diiect spectial estimation technique which optimizes the spectial iecon-
stiuction eiioi ovei a set of subbands using analysis-by-synthesis.
MIxeJIxcIiaiIon IInear IreJIciIon (MIII): an optimized veision of the tiaditional LPC vocodei which
includes an explicit multiband model of the excitation signal.
Seveial iealizations of these appioaches have been adopted nationally and inteinationally as standaid speech
coding aichitectuies at iates below 16 kbits/s (i.e., G.728, IMBE, U.S. Fedeial Standaid 1016, etc.). The success of
these implementations is due to LPC-based analysis-by-synthesis with a peiceptual distoition ciiteiion oi shoit-
time fiequency-domain modeling of a speech wavefoim oi LPC iesidual. Additionally, the codeis that opeiate at
lowei iates all beneft fiom foiwaid adaptation methods which pioduce effcient, accuiate paiametei estimates.
CELP
The geneial CELP aichitectuie is desciibed as a blockwise analysis-by-synthesis selection of an LPC excitation
sequence. In low-iate CELP codeis, a foiwaid-adaptive lineai piedictive analysis is peifoimed at 20 to 30 msec
inteivals. The gioss spectial chaiacteiization is used to ieconstiuct, via lineai piediction, candidate speech
segments deiived fiom a constiained set of plausible fltei excitations (the codebook"). The excitation vectoi
that pioduces the synthetic speech segment with smallest peiceptually weighted distoition (with iespect to the
oiiginal speech) is chosen foi tiansmission. Typically, the excitation vectoi is optimized moie often than the
LPC spectial model. The use of \eciors iathei than scaIars foi the excitation is signifcant in bit-iate ieduction.
The use of peiceptual weighting in the CELP ieconstiuction stage and analysis-by-synthesis optimization of
the dominant low-fiequency (pitch) component aie key concepts in maintaining good quality encoded speech
at lowei iates. CELP-based speech codeis aie the piedominant coding methodologies foi iates between 4 kbits/s
and 16 kbits/s due to theii excellent subjective peifoimance. Some of the most notable aie detailed below.
ITU-T Recommendation G.728 (LD-CELP) Chen, 1990] is a low delay, backwaid adaptive CELP codei.
In G.728, a low algoiithmic delay (less than 2.5 msec) is achieved by using 1024 candidate excitation
sequences, each only 5 samples long. A 50th-oidei LPC spectial model is used, and the coeffcients aie
backwaid-adapted based on the tiansmitted excitation.
The speech codei standaidized by the CTIA foi use in the U.S. (time-division multiple-access) 8 kbits/s
digital cellulai iadio systems is called vectoi sum excited lineai piediction (VSELP) Geison and Jasiuk,
2000 by CRC Press LLC
1990]. VSELP is a foiwaid-adaptive foim of CELP wheie two excitation codebooks aie used to ieduce
the complexity of encoding.
Othei appioaches to complexity ieduction in CELP codeis aie ielated to spaise" codebook entiies which
have few nonzeio samples pei vectoi and algebiaic" codebooks which aie based on integei lattices
Adoul and Lamblin, 1987]. In this case, excitation code vectois can be constiucted on an as-needed
basis instead of being stoied in a table. ITU-T standaidization of a CELP algoiithm which uses lattice-
based excitations has iesulted in the 8 kbps G.729 (ACELP) codei.
U.S. Fedeial Standaid 1016 National Communications System, 1991] is a 4.8 kbps CELP codei. It has
both long- and shoit-teim lineai piedictois which aie foiwaid adaptive, and so the codei has a ielatively
laige delay (100 msec). This codei pioduces highly intelligible, good-quality speech in a vaiiety of
enviionments and is iobust to independent bit eiiois.
Below about 4 kbps, the subjective quality of CELP codeis is infeiioi to othei aichitectuies. Much ieseaich in
vaiiable-iate CELP implementations has iesulted in alteinative codei aichitectuies which adjust theii coding
iates based on a numbei of channel conditions oi sophisticated, speech-specifc cues such as phonetic segmen-
tation Wang and Geisho, 1989; Paksoy et al., 1993]. Notably, most vaiiable-iate CELP codeis aie implemen-
tations of fnite-state CELP wheiein a vectoi of speech cues contiols the evolution of a state-machine to piesciibe
mode-dependent bit allocations foi codei paiameteis. With these aichitectuies, excellent speech quality at
aveiage iates below 2 kbps has been iepoited.
MBE
The MBE codei Haidwick and Lim, 1991] is an effcient fiequency-domain aichitectuie paitially based on the
concepts of sinusoidal tiansfoim coding (STC) McAulay and Quatieii, 1986]. In MBE, the instantaneous
spectial envelope is iepiesented explicitly by haimonic estimates in seveial subbands. The peifoimance of MBE
codeis at iates below 4 kbps is geneially bettei" than that of CELP-based schemes.
An MBE codei decomposes the instantaneous speech spectium into subbands centeied at haimonics of the
fundamental glottal excitation (pitch). The spectial envelope of the signal is appioximated by samples taken at
pitch haimonics, and these haimonic amplitudes aie compaied to adaptive thiesholds (which may be deteimined
via analysis-by-synthesis) to deteimine subbands of high spectial activity. Subbands that aie deteimined to be
voiced" aie labeled, and theii eneigies and phases aie encoded foi tiansmission. Subbands having ielatively low
spectial activity aie declaied unvoiced". These segments aie appioximated by an appiopiiately flteied segment
of white noise, oi a locally dense collection of sinusoids with iandom phase. Caieful tiacking of the evolution of
individual spectial peaks and phases in successive fiames is ciitical in the implementation of MBE-style codeis.
An effcient implementation of an MBE codei was adopted foi the Inteinational Maiitime Satellite (INMAR-
SAT) voice piocessoi, and is known as Impioved-MBE, oi IMBE Haidwick and Lim, 1991]. This codei
optimizes seveial components of the geneial MBE aichitectuie, including giouping neighboiing haimonics foi
subband voicing decisions, using non-integei pitch iesolution foi highei speakei fdelity, and diffeientially
encoding the log-amplitudes of voiced haimonics using a DCT-based scheme. The IMBE codei iequiies high
delay (about 80 msec), but pioduces veiy good quality encoded speech.
MELP
The MELP codei McCiee and Bainwell, 1995] is based on the tiaditional LPC vocodei model wheie an LPC
synthesis fltei is excited by an impulse tiain (voiced speech) oi white noise (unvoiced speech). The MELP
excitation, howevei, has chaiacteiistics that aie moie similai to natuial human speech. In paiticulai, the MELP
excitation can be a mixtuie of (possibly apeiiodic) pulses with bandlimited noise. In MELP, the excitation
spectium is explicitly modeled using Fouiiei seiies coeffcients and bandpass voicing stiengths, and the time-
domain excitation sequence is pioduced fiom the spectial model via an inveise tiansfoim. The synthetic
excitation sequence is then used to diive an LPC synthesizei which intioduces foimant spectial shaping.
Cummun Threads
In addition to the use of analysis-by-synthesis techniques and/oi LPC modeling, a common thiead between
low-iate, foiwaid adaptive CELP, MBE, and MELP codeis is the dependence on an estimate of the fundamental
glottal fiequency, oi pitch peiiod. CELP codeis typically employ a pitch oi long-teim piedictoi to chaiacteiize
2000 by CRC Press LLC
the glottal excitation. MBE codeis estimate the fundamental fiequency and use this estimate to focus subband
decompositions on haimonics. MELP codeis peifoim pitch-synchionous excitation modeling. Oveiall codei
peifoimance is enhanced in the CELP and MBE codeis with the use of sub-integei lags Kioon and Atal, 1991].
This is equivalent to peifoiming pitch estimation using a signal sampled at a highei sampling iate to impiove
the piecision of the spectial estimate. Highly piecise glottal fiequency estimation impioves the natuialness"
of coded speech at the expense of incieased computational complexity, and in some cases incieased bit iate.
Accuiate chaiacteiization of pitch and LPC paiameteis can also be used to good advantage in postflteiing
to ieduce appaient quantization noise. These flteis, usually deiived fiom foiwaid-adapted fltei coeffcients
tiansmitted to the ieceivei as side-infoimation, peifoim post-piocessing on the ieconstiucted speech which
ieduces peiceptually annoying noise components Chen and Geisho, 1995].
Speech Qua!ity and Inte!!igibi!ity
To compaie the peifoimance of two speech codeis, it is necessaiy to have some indicatoi of the intelligibility
and quality of the speech pioduced by each codei. The teim InieIIIgI|IIIiy usually iefeis to whethei the output
speech is easily undeistandable, while the teim quaIIiy is an indicatoi of how natuial the speech sounds. It is
possible foi a codei to pioduce highly intelligible speech that is low quality in that the speech may sound veiy
machine-like and the speakei is not identifable. On the othei hand, it is unlikely that unintelligible speech
would be called high quality, but theie aie situations in which peiceptually pleasing speech does not have high
intelligibility. We biiey discuss heie the most common measuies of intelligibility and quality used in foimal
tests of speech codeis.
DRT
The diagnostic ihyme test (DRT) was devised by Voieis 1977] to test the intelligibility of codeis known to
pioduce speech of lowei quality. Rhyme tests aie so named because the listenei must deteimine which consonant
was spoken when piesented with a paii of ihyming woids; that is, the listenei is asked to distinguish between
woid paiis such as meat-beat, pool-tool, saw-thaw, and caught-taught. Each paii of woids diffeis on only one
of six phonemic attiibutes: voicing, nasality, sustention, sibilation, giaveness, and compactness. Specifcally, the
listenei is piesented with one spoken woid fiom a paii and asked to decide which woid was spoken. The fnal
DRT scoie is the peicent iesponses computed accoiding to I (I - W) 100, wheie I is the numbei coiiectly
chosen, W is the numbei of incoiiect choices, and T is the total of woid paiis tested. Usually, 75 s DRT s 95,
with a gooJ being about 90 Papamichalis, 1987].
MOS
The Mean Opinion Score (MOS) is an often-used peifoimance measuie Jayant and Noll, 1984]. To establish
a MOS foi a codei, listeneis aie asked to classify the quality of the encoded speech in one of fve categoiies:
excellent (5), good (4), faii (3), pooi (2), oi bad (1). Alteinatively, the listeneis may be asked to classify the
coded speech accoiding to the amount of peiceptible distoition piesent, i.e., impeiceptible (5), baiely peicep-
tible but not annoying (4), peiceptible and annoying (3), annoying but not objectionable (2), oi veiy annoying
and objectionable (1). The numbeis in paientheses aie used to assign a numeiical value to the subjective
evaluations, and the numeiical iatings of all listeneis aie aveiaged to pioduce a MOS foi the codei. A MOS
between 4.0 and 4.5 usually indicates high quality.
It is impoitant to compute the vaiiance of MOS values. A laige vaiiance, which indicates an unieliable test,
can occui because paiticipants do not known what categoiies such as gooJ and |aJ mean. It is sometimes
useful to piesent examples of good and bad speech to the listeneis befoie the test to calibiate the 5-point scale
Papamichalis, 1987]. The MOS values foi a vaiiety of speech codeis and noise conditions aie given in Daumei,
1982].
DAM
The diagnostic acceptability measuie (DAM) developed by Dynastat Voieis, 1977] is an attempt to make the
measuiement of speech quality moie systematic. Foi the DAM, it is ciitical that the listenei ciews be highly
tiained and iepeatedly calibiated in oidei to get meaningful iesults. The listeneis aie each piesented with
encoded sentences taken fiom the Haivaid 1965 list of phonetically balanced sentences, such as Cats and dogs
1

---
2000 by CRC Press LLC
each hate the othei" and The pipe began to iust while new". The listenei is asked to assign a numbei between
0 and 100 to chaiacteiistics in thiee classifcations-signal qualities, backgiound qualities, and total effect. The
iatings of each chaiacteiistic aie weighted and used in a multiple nonlineai iegiession. Finally, adjustments aie
made to compensate foi listenei peifoimance. A typical DAM scoie is 45 to 55%, with 50% coiiesponding to
a gooJ system Papamichalis, 1987].
The peiception of good quality" speech is a highly individual and subjective aiea. As such, no single
peifoimance measuie has gained wide acceptance as an indicatoi of the quality and intelligibility of speech
pioduced by a codei. Fuithei, theie is no substitute foi subjective listening tests undei the actual enviionmental
conditions expected in a paiticulai application. As a iough guide to the peifoimance of some of the codeis
discussed heie, we piesent the DRT, DAM, and MOS values in Table 15.1, which is adapted fiom Spanias,
1994; Jayant, 1990]. Fiom the table, it is evident that at 8 kbits/s and above, peifoimance is quite good and
that the 4.8 kbits/s CELP has substantially bettei peifoimance than LPC-10e.
Standardizatiun
The piesence of inteinational, national, and iegional speech coding standards ensuies the inteiopeiability of
codeis among vaiious implementations. As noted pieviously, seveial standaid algoiithms exist among the classes
of speech codeis. The ITU-T (foimeily CCITT) has histoiically been a dominant factoi in inteinational
standaidization of speech codeis, such as G.711, G.721, G.728, G.729, etc. Additionally, the emeigence of digital
cellulai telephony, peisonal communications netwoiks, and multimedia communications has diiven the foi-
mulation of vaiious national oi iegional standaid algoiithms such as the GSM full and half-iate standaids foi
Euiopean digital cellulai, the CTIA full-iate TDMA and CDMA algoiithms and theii half-iate counteipaits
foi U.S. digital cellulai, full and half-iate Pitch-Synchionous CELP Miki et al., 1993] foi Japanese cellulai, as
well as speech codeis foi paiticulai applications ITU-TS, 1991].
The standaidization effoits of the U.S. Fedeial Goveinment foi secuie voice channels and militaiy applica-
tions have a histoiically signifcant impact on the evolution of speech codei technology. In paiticulai, the iecent
ie-standaidization of the DoD 2400 bits/s vocodei algoiithm has pioduced some competing algoiithms woithy
of mention heie. Of the classes of speech codeis iepiesented among the algoiithms competing to ieplace LPC-10,
seveial implementations utilized STC oi MBE aichitectuies, some used CELP aichitectuies, and otheis weie
novel combinations of multiband-excitation with LPC modeling McCiee and Bainwell, 1995] oi pitch-
synchionous piototype wavefoim inteipolation techniques Kleijn, 1991].
The fnal iesults of the U.S. DoD standaid competition aie summaiized in Table 15.2 foi quiet" and offce"
enviionments. In the table, the column labeled FOM" is the oveiall Figuie of Meiit used by the DoD Digital
Voice Piocessing Consoitium in selecting the codei. The FOM is a unitless combination of conIexIiy and
erIornance components, and is measuied with iespect to FS-1016. The complexity of a codei is a weighted
combination of memoiy and piocessing powei iequiied. The peifoimance of a codei is a weighted combination
of foui factois: quality (Q-measuied via MOS), intelligibility (I-measuied via DRT), speakei iecognition (R),
and communicability (C). Recognizability and communicability foi each codei weie measuied by tests
TABIE 15.1 Speech Codei Peifoimance Compaiisons
Algoiithm
Standaidization
Rate
Subjective
(acionym) Body Identifei kbits/s MOS DRT DAM
-law PCM ITU-T G.711 64 4.3 95 73
ADPCM ITU-T G.721 32 4.1 94 68
LD-CELP ITU-T G.728 16 4.0 94
a
70
a
RPE-LTP GSM GSM 13 3.5 - -
VSELP CTIA IS-54 8 3.5 - -
CELP U.S. DoD FS-1016 4.8 3.13
b
90.7
b
65.4
b
IMBE Inmaisat IMBE 4.1 3.4 - -
LPC-10e U.S. DoD FS-1015 2.4 2.24
b
86.2
b
50.3
b
a
Estimated.
b
Fiom iesults of 1996 U.S. DoD 2400 bits/s vocodei competition.
2000 by CRC Press LLC
compaiing piocessed vs. unpiocessed data, and effectiveness of communication in application-specifc coop-
eiative tasks Schmidt-Nielsen and Biock, 1996; Kieamei and Taidelli, 1996]. The MOS and DRT scoies weie
measuied in a vaiiety of common DoD enviionments. Each of the foui fnalist" codeis ianked fist in one of
the foui categoiies examined (Q,I,R,C), as noted in the table.
The iesults of the standaidization piocess weie announced in Apiil, 1996. As indicated in Table 15.2, the
new 2400 bits/s Fedeial Standaid vocodei ieplacing LPC-10e is a veision of the Mixed Excitation Lineai
Piediction (MELP) codei which uses seveial specifc enhancements to the basic MELP aichitectuie. These
enhancements include multi-stage VQ of the foimant paiameteis based on fiequency-weighted baik-scale
spectial distoition, diiect VQ of the fist 10 Fouiiei coeffcients of the excitation using baik-weighted distoition,
and a gain coding technique which is iobust to channel eiiois McCiee et al., 1996].
Yariab!e Rate Cuding
Pievious standaidization effoits and discussion heie have centeied on fxed-iate coding of speech wheie a fxed
numbei of bits aie used to iepiesent speech in digital foim pei unit of time. Howevei, with iecent developments
in tiansmission aichitectuies (such as CDMA), the implementation of \arIa|Ieraie speech coding algoiithms
has become feasible. In variable-rate coding, the aveiage data iate foi conveisational speech can be ieduced
by a factoi of at least 2.
A vaiiable-iate speech coding algoiithm has been standaidized by the CTIA foi wideband (CDMA) digital
mobile cellulai telephony undei IS-95. The algoiithm, QCELP Gaidnei et al., 1993], is the fist piactical
vaiiable-iate speech codei to be incoipoiated in a digital cellulai system. QCELP is a multi-mode, CELP-type
analysis-by-synthesis codei which uses blockwise spectial eneigy measuiements and a fnite-state machine to
switch between one of foui confguiations. Each confguiation has a fxed iate of 1, 2, 4, oi 8 kbits/s with a
piedeteimined allocation of bits among codei paiameteis (coeffcients, gains, excitation, etc.). The subjective
peifoimance of QCELP in the piesence of low backgiound noise is quite good since the bit allocations pei-
mode and mode-switching logic aie well-suited to high-quality speech. In fact, QCELP at an aveiage iate of
4 kbits/s has been judged to be MOS-equivalent to VSELP, its 8 kbits/s, fxed-iate cellulai counteipait. A time-
aveiaged encoding iate of 4 to 5 kbits/s is not uncommon foi QCELP, howevei the aveiage iate tends towaid
the 8 kbits/s maximum in the piesence of modeiate ambient noise. The topic of iobust fxed-iate and vaiiable-
iate speech coding in the piesence of signifcant backgiound noise iemains an open pioblem.
Much iecent ieseaich in speech coding below 8 kbits/s has focused on multi-mode CELP aichitectuies and
effcient appioaches to souice-contiolled mode selection Das et al., 1995]. Multimode codeis aie able to quickly
invoke a coding scheme and bit allocation specifcally tailoied to the local chaiacteiistics of the speech signal.
This capability has pioven useful in optimizing peiceptual quality at low coding iates. In fact, the majoiity of
algoiithms cuiiently pioposed foi half-iate Euiopean and U.S. digital cellulai standaids, as well as many algo-
iithms consideied foi iates below 2.4 kbits/s aie multimode codeis. The diiect coupling between vaiiable-iate
(multimode) speech coding and the CDMA tiansmission aichitectuie is an obvious beneft to both technologies.
TABIE 15.2 Speech Codei Peifoimance Compaiisons Taken fiom Results of 1996 U.S.
DoD 2400 bits/s Vocodei Competition
Algoiithm
Quiet Offce
(acionym) FOM Rank Best MOS DRT DAM MOS DRT DAM
MELP 2.616 1 I 3.30 92.3 64.5 2.96 91.2 52.7
PWI 2.347 2 Q 3.28 90.5 70.0 2.88 88.4 55.5
STC 2.026 3 R 3.08 89.9 63.8 2.82 91.5 54.1
IMBE 2.991 C 2.89 91.4 62.3 2.71 91.1 52.4
CELP 0.0 N/A - 3.13 90.7 65.4 2.89 89.0 56.1
LPC-10e -9.19 N/A - 2.24 86.2 50.3 2.09 85.2 48.4
Ineligible due to failuie of the quality (MOS) ciiteiia minimum iequiiements (bettei than
CELP) in both quiet and offce enviionments.
2000 by CRC Press LLC
Summary and Cunc!usiuns
The availability of geneial-puipose and application-specifc digital signal piocessing chips and the evei-widening
inteiest in digital communications have led to an incieasing demand foi speech codeis. The woildwide desiie
to establish standaids in a host of applications is a piimaiy diiving foice foi speech codei ieseaich and
development. The speech codeis that aie available today foi opeiation at 16 kbits/s and below aie conceptually
quite exotic compaied with pioducts available less than 10 yeais ago. The ie-standaidization of U.S. Fedeial
Standaid 1015 (LPC-10) at 2.4 kbits/s with peifoimance constiaints similai to those of FS-1016 at 4.8 kbits/s
is an indicatoi of the iapid evolution of speech coding paiadigms and VLSI aichitectuies.
Othei standaids to be established in the neai teim include the Euiopean (GSM) and U.S. (CTIA) half-iate
speech codeis foi digital cellulai mobile iadio. Foi the longei teim, the specifcation of standaids foi foith-
coming mobile peisonal communications netwoiks will be a piimaiy focus in the next 5 to 10 yeais.
In the pieface to theii book, Jayant and Noll 1984] state that oui undeistanding of speech and image coding
has now ieached a veiy matuie point ." As of 1997, this statement iings tiuei than evei. The feld is a dynamic
one, howevei, and the wide iange of commeicial applications demands continual piogiess.
Dehning Terms
Analysis-by-synthesis: Constiucting seveial veisions of a wavefoim and choosing the best match.
Predictive coding: Coding of time-domain wavefoims based on a (usually) lineai piediction model.
Irequency domain coding: Coding of fiequency-domain chaiacteiistics based on a disciete time-fiequency
tiansfoim.
Hybrid coders: Codeis that fall between wavefoim codeis and vocodeis in how they select the excitation.
Standard: An encoding technique adopted by an industiy to be used in a paiticulai application.
Mean Opinion Score (MOS): A populai method foi classifying the quality of encoded speech based on a fve-
point scale.
Variable-rate coders: Codeis that output diffeient amounts of bits based on the time-vaiying chaiacteiistics
of the souice.
Re!ated Tupics
17.1 Digital Image Piocessing 21.4 Example 3: Multiiate Signal Piocessing
Relerences
A. S. Spanias, Speech coding: A tutoiial ieview," Iroc. IIII, 82, 1541-1575, Octobei 1994.
A. Geisho, Advances in speech and audio compiession," Iroc. IIII, 82, June 1994.
W. B. Kleijn and K. K. Paliwal, Eds., Seec| CoJIng anJ Syni|esIs, Amsteidam, Holland: Elseviei, 1995.
CCITT, 32-kbit/s adaptive diffeiential pulse code modulation (ADPCM)," IeJ oo|, III.3, 125-159, 1984.
National Communications System, Offce of Technology and Standaids, IeJeraI SianJarJ 101J. AnaIog io DIgIiaI
Con\ersIon oI VoIce |y ?100 |Ii/seconJ IInear IreJIciI\e CoJIng, 1984.
J.-H. Chen, High-quality 16 kb/s speech coding with a one-way delay less than 2 ms," Iroc. IIII Ini. ConI.
Acousi., Seec|, SIgnaI IrocessIng, Albuqueique, NM, pp. 453-456, Apiil 1990.
National Communications System, Offce of Technology and Standaids, IeJeraI SianJarJ 101h. TeIeconnunIcaiIons.
AnaIog io DIgIiaI Con\ersIon oI IaJIo VoIce |y 1800 |Ii/seconJ CoJe IxcIieJ IInear IreJIciIon (CIII), 1991.
J. Gibson, Adaptive piediction foi speech encoding," IIII ASSI MagazIne, 1, 12-26, July 1984.
J. D. Johnston, A fltei family designed foi use in quadiatuie miiioi fltei banks," Iroc. IIII Ini. ConI. Acousi.,
Seec|, SIgnaI IrocessIng, Denvei, CO, pp. 291-294, Apiil 1980.
B. Atal and M. Schioedei, Piedictive coding of speech signals and subjective eiioi ciiteiia," IIII Trans. Acousi.,
Seec|, SIgnaI IrocessIng, ASSP-27, 247-254, June 1979.
I. Geison and M. Jasiuk, Vectoi sum excited lineai piediction (VSELP) speech coding at 8 kb/s," in Iroc. IIII
Ini. ConI. Acousi., Seec|, SIgnaI IrocessIng, Albuqueique, NM, pp. 461-464, Apiil 1990.
2000 by CRC Press LLC
J.-P. Adoul and C. Lamblin, A compaiison of some algebiaic stiuctuies foi CELP coding of speech," Iroc. IIII
Ini. ConI. Acousi., Seec|, SIgnaI IrocessIng, Dallas, TX, pp. 1953-1956, Apiil 1987.
S. Wang and A. Geisho, Phonetically-based vectoi excitation coding of speech at 3.6 kbps," Iroc. IIII Ini.
ConI. Acousi., Seec|, SIgnaI IrocessIng, Glasgow, Scotland, pp. 49-52, May 1989.
E. Paksoy, K. Siinivasan, and A. Geisho, Vaiiable iate speech coding with phonetic segmentation," Iroc. IIII
Ini. ConI. Acousi., Seec|, SIgnaI IrocessIng, Minneapolis, MN, pp. II.155-II.158, Apiil 1993.
J. Haidwick and J. Lim, The application of the IMBE speech codei to mobile communications," Iroc. IIII
Ini. ConI. Acousi., Seec|, SIgnaI IrocessIng, pp. 249-252, May 1991.
R. McAulay and T. Quatieii, Speech analysis/synthesis based on a sinusoidal iepiesentation," IIII Trans.
Acousi., Seec|, SIgnaI IrocessIng, 34, 744-754, August 1986.
A. McCiee and T. Bainwell, A mixed excitation LPC vocodei model foi low bit iate speech coding," IIII Trans.
Seec| AuJIo IrocessIng, 3, 242-250, July 1995.
P. Kioon and B. S. Atal, On impioving the peifoimance of pitch piedictois in speech coding systems," in
AJ\ances In Seec| CoJIng, B. S. Atal, V. Cupeiman, and A. Geisho, Eds., Boston, Mass: Kluwei, 1991,
pp. 321-327.
J.-H. Chen and A. Geisho, Adaptive postflteiing foi quality enhancement of coded speech," IIII Trans. Seec|
anJ AuJIo IrocessIng, 3, 59-71, Januaiy 1995.
W. Voieis, Diagnostic evaluation of speech intelligibility," in Seec| InieIIIgI|IIIiy anJ IecognIiIon, M. Hawley,
Ed., Stioudsbuig, Pa.: Dowden, Hutchinson, and Ross, 1977.
P. Papamichalis, IraciIcaI Aroac|es io Seec| CoJIng, Englewood Cliffs, N.J.: Pientice-Hall, 1987.
N. S. Jayant and P. Noll, DIgIiaI CoJIng oI Wa\eIorns, Englewood Cliffs, N.J.: Pientice-Hall, 1984.
W. Daumei, Subjective evaluation of seveial diffeient speech codeis," IIII Trans. Connun., COM-30, 655-662,
Apiil 1982.
W. Voieis, Diagnostic acceptability measuie foi speech communications systems," Iroc. IIII Ini. ConI. Acousi.,
Seec|, SIgnaI IrocessIng, 204-207, 1977.
N. Jayant, High-quality coding of telephone speech and wideband audio," IIII ConnunIcaiIons MagazIne,
28, 10-20, Januaiy 1990.
S. Miki, K. Mano, H. Ohmuio, and T. Moiiya, Pitch synchionous innovation CELP (PSI-CELP)," Iroc.
Iuroean ConI. Seec| Conn. Tec|noI., Beilin, Geimany, pp. 261-264, Septembei 1993.
ITU-TS Study Gioup XV, DraIi reconnenJaiIon AV.?JYDuaI Iaie Seec| CoJer Ior MuIiIneJIa TeIeconnu
nIcaiIon TransnIiiIng ai J.J & h.1 ||Ii/s, Decembei 1991.
W. Kleijn, Continuous iepiesentations in lineai piedictive coding," Iroc. IIII Ini. ConI. Acousi., Seec|, SIgnaI
IrocessIng, pp. 201-204, 1991.
A. Schmidt-Nielsen and D. Biock, Speakei iecognizability testing foi voice codeis," Iroc. IIII Ini. ConI. Acousi.,
Seec|, anJ SIgnaI IrocessIng, pp. 1149-1152, Apiil 1996.
E. Kieamei and J. Taidelli, Communicability testing foi voice codeis," Iroc. IIII Ini. ConI. Acousi., Seec|,
SIgnaI IrocessIng, pp. 1153-1156, Apiil 1996.
A. McCiee, K. Tiuong, E. Geoige, T. Bainwell, and V. Viswanathan, A 2.4 kbit/s MELP codei candidate foi the
new U.S. Fedeial Standaid, Iroc. IIII Ini. ConI. Acousi., Seec|, SIgnaI IrocessIng, pp. 200-203, Apiil 1996.
W. Gaidnei, P. Jacobs, and C. Lee, QCELP: A vaiiable iate speech codei foi CDMA digital cellulai," in Seec|
anJ AuJIo CoJIng Ior WIreIess Neiwor|s, B. S. Atal, V. Cupeiman, and A. Geisho, Eds., Boston, Mass.:
Kluwei, 1993, pp. 85-92.
A. Das, E. Paksoy, and A. Geisho, Multimode and vaiiable-iate coding of speech," in Seec| CoJIng anJ
Syni|esIs, W. B. Kleijn and K. K. Paliwal, Eds., Amsteidam: Elseviei, 1995, pp. 257-288.
Further Inlurmatiun
Foi fuithei infoimation on the state of the ait in speech coding, see the aiticles by Spanias 1994] and Geisho
1994], and the book Seec| CoJIng anJ Syni|esIs by Kleijn and Paliwal 1995].
2000 by CRC Press LLC
15.2 Speech Enhancement and Nuise Reductiun
Yorv |rom
Voice communication systems aie susceptible to inteifeiing signals noimally iefeiied to as noise. The inteifeiing
signals may have haimful effects on the peifoimance of any speech communication system. These effects depend
on the specifc system being used, on the natuie of the noise and the way it inteiacts with the clean signal, and
on the ielative intensity of the noise compaied to that of the signal. The lattei is usually measuied by the signal-
to-noise ratio (SNR), which is the iatio of the powei of the signal to the powei of the noise.
The speech communication system may simply be a iecoiding which was peifoimed in a noisy enviionment,
a standaid digital oi analog communication system, oi a speech iecognition system foi human-machine
communication. The noise may be piesent at the input of the communication system, in the channel, oi at the
ieceiving end. The noise may be coiielated oi uncoiielated with the signal. It may accompany the clean signal
in an additive, multiplicative, oi any othei moie geneial mannei. Examples of noise souices include competitive
speech; backgiound sounds like music, a fan, machines, dooi slamming, wind, and tiaffc; ioom ieveibeiation;
and white Gaussian channel noise.
The ultimate goal of speech enhancement is to minimize the effects of the noise on the peifoimance of
speech communication systems. The peifoimance measuie is system dependent. Foi systems which compiise
iecoidings of noisy speech, oi standaid analog communication systems, the goal of speech enhancement is to
impiove peiceptual aspects of the noisy signal. Foi example, impioving the quality and intelligibility of the
noisy signal aie common goals. Quality is a subjective measuie which ieects on the pleasantness of the speech
oi on the amount of effoit needed to undeistand the speech mateiial. Intelligibility, on the othei hand, is an
objective measuie which signifes the amount of speech mateiial coiiectly undeistood. Foi standaid digital
communication systems, the goal of speech enhancement is to impiove peiceptual aspects of the encoded speech
signal. Foi human-machine speech communication systems, the goal of speech enhancement is to ieduce the
eiioi iate in iecognizing the noisy speech signals.
To demonstiate the above ideas, considei a hands-fiee`` cellulai iadio telephone communication system. In
this system, the tiansmitted signal is composed of the oiiginal speech and the backgiound noise in the cai. The
backgiound noise is geneiated by an engine, fan, tiaffc, wind, etc. The tiansmitted signal is also affected by
the iadio channel noise. As a iesult, noisy speech with low quality and intelligibility is deliveied by such systems.
The backgiound noise may have additional devastating effects on the peifoimance of this system. Specifcally,
if the system encodes the signal piioi to its tiansmission, then the peifoimance of the speech codei may
signifcantly deteiioiate in the piesence of the noise. The ieason is that speech codeis iely on some statistical
model foi the clean signal, and this model becomes invalid when the signal is noisy. Foi a similai ieason, if the
cellulai iadio system is equipped with a speech iecognizei foi automatic dialing, then the eiioi iate of such
iecognizei will be elevated in the piesence of the backgiound noise. The goals of speech enhancement in this
example aie to impiove peiceptual aspects of the tiansmitted noisy speech signals as well as to ieduce the
speech iecognizei eiioi iate.
Othei impoitant applications of speech enhancement include impioving the peifoimance of:
1. Pay phones located in noisy enviionments (e.g., aiipoits)
2. Aii-giound communication systems in which the cockpit noise coiiupts the pilot`s speech
3. Teleconfeiencing systems wheie noise souices in one location may be bioadcasted to all othei locations
4. Long distance communication ovei noisy iadio channels
The pioblem of speech enhancement has been a challenge foi many ieseaicheis foi almost thiee decades.
Diffeient solutions with vaiious degiees of success have been pioposed ovei the yeais. An excellent intioduction
to the pioblem, and ieview of the systems developed up until 1979, can be found in the landmaik papei by Lim
and Oppenheim 1979]. A panel of the National Academy of Sciences discussed in 1988 the pioblem and vaiious
ways to evaluate speech enhancement systems. The panel`s fndings weie summaiized in Makhoul et al. 1989].
Modein statistical appioaches foi speech enhancement weie iecently ieviewed in Boll 1992] and Ephiaim 1992].
2000 by CRC Press LLC
In this section the piinciples and peifoimance of the majoi speech enhancement appioaches aie ieviewed,
and the advantages and disadvantages of each appioach aie discussed. The signal is assumed to be coiiupted
by additive statistically independent noise. Only a single noisy veision of the clean signal is assumed available
foi enhancement. Fuitheimoie, it is assumed that the clean signal cannot be piepiocessed to inciease its
iobustness piioi to being affected by the noise. Speech enhancement systems which can eithei piepiocess the
clean speech signal oi which have access to multiple veisions of the noisy signal obtained fiom a numbei of
miciophones aie discussed in Lim 1983].
This piesentation is oiganized as follows. In the second section the speech enhancement pioblem is foimu-
lated and commonly used models and peifoimance measuies aie piesented. In the next section signal estimation
foi impioving peiceptual aspects of the noisy signal is discussed. In the fouith section souice coding techniques
foi noisy signals aie summaiized, and the last section deals with iecognition of noisy speech signals. Due to
the limited numbei of iefeiences (10) allowed in this publication, tutoiial papeis aie mainly iefeienced.
Appiopiiate ciedit will be given by pointing to the tutoiial papeis which iefeience the oiiginal papeis.
Mude!s and Perlurmance Measures
The goals of speech enhancement as stated in the fist section aie to impiove peiceptual aspects of the noisy
signal whethei the signal is tiansmitted thiough analog oi digital channels and to ieduce the eiioi iate in
iecognizing noisy speech signals. Impioving peiceptual aspects of the noisy signal can be accomplished by
estimating the clean signal fiom the noisy signal using peiceptually meaningful estimation peifoimance mea-
suies. If the signal has to be encoded foi tiansmission ovei digital channels, then souice coding techniques can
be applied to the given noisy signal. In this case, a peiceptually meaningful fdelity measuie between the clean
signal and the encoded noisy signal must be used. Reducing eiioi iate in speech communication systems can
be accomplished by applying optimal signal classifcation appioaches to the given noisy signals. Thus the speech
enhancement pioblem is essentially a set of signal estimation, souice coding, and signal classifcation pioblems.
The piobabilistic appioach foi solving these pioblems iequiies explicit knowledge of the peifoimance
measuie as well as the piobability laws of the clean signal and noise piocess. Such knowledge, howevei, is not
explicitly available. Hence, mathematically tiactable peifoimance measuies and statistical models which aie
believed to be meaningful aie used. In this section we biiey ieview the most commonly used statistical models
and peifoimance measuies.
The most fundamental model foi speech signals is the Gaussian autoregressive (AR) model. This model
assumes that each 20- to 40-msec segment of the signal is geneiated fiom an excitation signal which is applied
to a lineai time-invaiiant all-pole fltei. The excitation signal compiises a mixtuie of white Gaussian noise and
a peiiodic sequence of impulses. The peiiod of that sequence is deteimined by the pitch peiiod of the speech
signal. This model is desciibed in Fig. 15.2. Geneially, the excitation signal iepiesents the ow of aii thiough
the vocal coids and the all-pole fltei iepiesents the vocal tiact. The model foi a given sample function of speech
I!CURE 15.2 Gaussian autoiegiessive speech model.
2000 by CRC Press LLC
signals, which is composed of seveial consecutive 20- to 40-msec segments of that signal, is obtained fiom the
sequence of AR models foi the individual segments. Thus, a lineai time-vaiying AR model is assumed foi each
sample function of the speech signal. This model, howevei, is slowly vaiying in accoidance with the slow
tempoial vaiiation of the aiticulatoiy system. It was found that a set of appioximately 2048 piototype AR
models can ieliably iepiesent all segments of speech signals. The AR models aie useful in iepiesenting the shoit
time spectium of the signal, since the spectium of the excitation signal is white. Thus, the set of AR models
iepiesents a set of 2048 spectial piototypes foi the speech signal.
The time-vaiying AR model foi speech signals lacks the memoiy`` which assigns piefeience to one AR model
to follow anothei AR model. This memoiy could be incoipoiated, foi example, by assuming that the individual
AR models aie chosen in a Maikovian mannei. That is, given an AR model foi the cuiient segment of speech,
ceitain AR models foi the following segment of speech will be moie likely than otheis. This iesults in the so-
called conosIie source noJeI (CSM) foi the speech signal.
A block diagiam of a CSM is shown in Fig. 15.3. In geneial, this model is composed of a set of M vectoi
subsouices which aie contiolled by a switch. The position of the switch at each time instant is chosen iandomly,
and the output of one subsouice is piovided. The position of the switch defnes the state of the souice at each
time instant. CSMs foi speech signals assume that the subsouices aie Gaussian AR souices, and the switch is
contiolled by a Maikov chain. Fuitheimoie, the subsouices aie usually assumed statistically independent and
the vectois geneiated fiom each subsouice aie also assumed statistically independent. The iesulting model is
known as a hidden Markov model (HMM) Rabinei, 1989] since the output of the model does not contain
the states of the Maikovian switch.
The peifoimance measuie foi speech enhancement is task dependent. Foi signal estimation and coding, this
measuie is given in teims of a distoition measuie between the clean signal and the estimated oi the encoded
signals, iespectively. Foi signal classifcation applications the peifoimance measuie is noimally the piobability
of misclassifcation. Commonly used distoition measuies aie the mean-squaied eiioi (MSE) and the Itakuia-
Saito distoition measuies. The Itakuia-Saito distoition measuie is a measuie between two powei spectial
densities, of which one is usually that of the clean signal and the othei of a model foi that signal Geisho and
Giay, 1991]. This distoition measuie is noimally used in designing speech coding systems and it is believed to
be peiceptually meaningful. Both measuies aie mathematically tiactable and lead to intuitive estimation and
coding schemes. Systems designed using these two measuies need not be optimal only in the MSE and the
Itakuia-Saito sense, but they may as well be optimal in othei moie meaningful senses (see a discussion in
Ephiaim 1992]).
Signa! Estimatiun
In this section we ieview the majoi appioaches foi speech signal estimation given noisy signals.
I!CURE 15.3 Composite souice model.
2000 by CRC Press LLC
Spectra! Subtractiun
The spectial subtiaction appioach Weiss, 1974] is the simplest and most intuitive and populai speech enhance-
ment appioach. This appioach piovides estimates of the clean signal as well as of the shoit time spectium of
that signal. Estimation is peifoimed on a fiame-by-fiame basis, wheie each fiame consists of 20-40 msec of
speech samples. In the spectial subtiaction appioach the signal is Fouiiei tiansfoimed, and spectial components
whose vaiiance is smallei than that of the noise aie nulled. The suiviving spectial components aie modifed
by an appiopiiately chosen gain function. The iesulting set of nulled and modifed spectial components
constitute the spectial components of the enhanced signal. The signal estimate is obtained fiom inveise Fouiiei
tiansfoim of the enhanced spectial components. The shoit time spectium estimate of the signal is obtained
fiom squaiing the enhanced spectial components. A block diagiam of the spectial subtiaction appioach is
shown in Fig. 15.4.
Gain functions motivated by diffeient peiceptual aspects have been used. One of the most populai functions
iesults fiom lineai minimum MSE (MMSE) estimation of each spectial component of the clean signal given
the coiiesponding spectial component of the noisy signal. In this case, the value of the gain function foi a
given spectial component constitutes the iatio of the vaiiances of the clean and noisy spectial components.
The vaiiance of the clean spectial component is obtained by subtiacting an assumed known vaiiance of the
noise spectial component fiom the vaiiance of the noisy spectial component. The iesulting vaiiance is guai-
anteed to be positive by the nulling piocess mentioned above. The vaiiances of the spectial components of the
noise piocess aie noimally estimated fiom silence poitions of the noisy signal.
A family of spectial gain functions pioposed in Lim and Oppenheim 1979] is given by
(15.2)
wheie Z
n
and V
n
denote the nth spectial components of the noisy signal and the noise piocess, iespectively,
and a > 0, | > 0, c > 0. The MMSE gain function is obtained when a 2, | 1, and c 1. Anothei commonly
used gain function in the spectial subtiaction appioach is obtained fiom using a 2, | 1, and c 1/2. This
gain function iesults fiom estimating the spectial magnitude of the signal and combining the iesulting estimate
with the phase of the noisy signal. This choice of gain function is motivated by the ielative impoitance of the
spectial magnitude of the signal compaied to its phase. Since both cannot be simultaneously optimally estimated
Ephiaim, 1992], only the spectial magnitude is optimally estimated, and combined with an estimate of the
complex exponential of the phase which does not affect the spectial magnitude estimate. The iesulting estimate
I!CURE 15.4 Spectial subtiaction signal estimatoi.
g
Z |I V
Z
n N
n
n
a
n
a
n
a
c

_
,




-
, . .. , 1
2000 by CRC Press LLC
of the phase can be shown to be the phase of the noisy signal within the HMM statistical fiamewoik. Noimally,
the spectial subtiaction appioach is used with | 2, which coiiesponds to an aitifcially elevated noise level.
The spectial subtiaction appioach has been veiy populai since it is ielatively easy to implement; it makes
minimal assumptions about the signal and noise; and when caiefully implemented, it iesults in ieasonably cleai
enhanced signals. A majoi diawback of the spectial subtiaction enhancement appioach, howevei, is that the
iesidual noise has annoying tonal chaiacteiistics iefeiied to as musical noise." This noise consists of naiiowband
signals with time-vaiying fiequencies and amplitudes. Anothei majoi diawback of the spectial subtiaction
appioach is that its optimality in any given sense has nevei been pioven. Thus, no systematic methodology foi
impioving the peifoimance of this appioach has been developed, and all attempts to achieve this goal have
been based on puiely heuiistic aiguments. As a iesult, a family of spectial subtiaction speech enhancement
appioaches have been developed and expeiimentally optimized.
In a iecent woik Ephiaim et al., 1995] a veision of the spectial subtiaction was shown to be a signal subspace
estimation appioach which is asymptomatically optimal (as the fiame length appioaches infnity) in the lineai
MMSE sense.
Empirica! Averages
This appioach attempts to estimate the clean signal fiom the noisy signal in the MMSE sense. The conditional
mean estimatoi is implemented using the conditional sample aveiage of the clean signal given the noisy signal.
The sample aveiage is obtained fiom appiopiiate tiaining sequences of the clean and noisy signals. This is
equivalent to using the sample distiibution oi the histogiam estimate of the piobability density function (pdf)
of the clean signal given the noisy signal. The sample aveiage appioach is applicable foi estimating the signal
as well as functionals of that signal, e.g., the spectium, the logaiithm of the spectium, and the spectial
magnitude.
Let {Y
i
, i 0, . . . , T be a tiaining data fiom the clean signal, wheie Y
i
is a I-dimensional vectoi in the
Euclidean space I
I
. Let {Z
i
, i 0, . . . , T be a tiaining data fiom the noisy signal, wheie Z
i
, e I
I
. The sequence
{Z
i
can be obtained by adding a noise tiaining sequence {V
i
, i 0, . . . , T to the sequence of clean signals
{Y
i
. Let z e I
I
be a vectoi of the noisy signal fiom which the vectoi y of the clean signal is estimated. Let
Y(z) {Y
i
: Z
i
z, i 0, . . ., T be the set of all clean vectois fiom the tiaining data of the clean signal which
could have iesulted in the given noisy obseivation z. The caidinality of this set is denoted by Y(z) . Then, the
sample aveiage estimate of the conditional mean of the clean signal y given the noisy signal z is given by
(15.3)
Obviously, this appioach is only applicable foi signals with fnite alphabet since otheiwise the set Y(z) is empty
with piobability one. Foi signals with continuous pdf `s, the appioach can be applied only if those signals aie
appiopiiately quantized.
The sample aveiage appioach was fist applied foi enhancing speech signals by Poitei and Boll in 1984 Boll,
1992]. They, howevei, consideied a simplei situation in which the noise tiue pdf was assumed known. In this
case, enhanced signals with iesidual noise chaiacteiized as being a blend of wideband noise and musical noise
weie obtained. The balance between the two types of iesidual noise depended on the functional of the clean
signal which was estimated.
The advantages of the sample aveiage appioach aie that it is conceptually simple and it does not iequiie a
rIorI assumptions about the foim of the pdf `s of the signal and noise. Hence, it is a nonpaiametiic estimation
appioach. This appioach, howevei, has thiee majoi disadvantages. Fiist, the estimatoi does not utilize any
speech specifc infoimation such as the peiiodicity of the signal and the signal`s AR model. Second, the tiaining
`
{
( , )
( , )
( )
( )
y I y z
y y z Jy
y z Jy
z
Y
i
Y
i
z

[
[

2000 by CRC Press LLC


sequences fiom the signal and noise must be available at the speech enhancement unit. Fuitheimoie, these
tiaining sequences must be applied foi each newly obseived vectoi of the noisy signal. Since the tiaining
sequences aie noimally veiy long, the speech enhancement unit must have extensive memoiy and computational
iesouices. These pioblems aie addiessed in the model-based appioach desciibed next.
Mude!-Based Appruach
The model-based appioach Ephiaim, 1992] is a Bayesian appioach foi estimating the clean signal oi any
functional of that signal fiom the obseived noisy signal. This appioach assumes CSMs foi the clean signal and
noise piocess. The models aie estimated fiom tiaining sequences of those piocesses using the maximum
likelihood (ML) estimation appioach. Undei ideal conditions the ML model estimate is consistent and asymp-
totically effcient. The ML model estimation is peifoimed using the expectation-maximization (EM) oi the
Baum iteiative algoiithm Rabinei, 1989; Ephiaim, 1992]. Given the CSMs foi the signal and noise, the clean
signal is estimated by minimizing the expected value of the chosen distoition measuie. The model-based
appioach uses signifcantly moie statistical knowledge about the signal and noise compaied to eithei the spectial
subtiaction oi the sample aveiage appioaches.
The MMSE signal estimatoi is obtained fiom the conditional mean of the clean signal given the noisy signal.
If y
i
e I
I
denotes the vectoi of the speech signal at time i, and z
i
0
denotes the sequence of I-dimensional
vectois of noisy signals {z
0
, . . . , z
i
fiom time t 0 to t i, then the MMSE estimatoi of y
i
is given by
(15.4)
wheie
-
x
i
denotes the composite state of the noisy signal at time i. This state is given by
-
x
i

A

(x
i
,
-
x
i
), wheie x
i
is the Maikov state of the clean signal at time i and
-
x
i
denotes the Maikov state of the noise piocess at the
same time instant i. The MMSE estimatoi, Eq. (15.4), compiises a weighted sum of conditional mean estimatois
foi the composite states of the noisy signal, wheie the weights aie the piobabilities of those states given the
noisy obseived signal. A block diagiam of this estimatoi is shown in Fig. 15.5.
The piobability I(
-
x
i
z
i
0
) can be effciently calculated using the foiwaid iecuision associated with HMMs.
Foi CSMs with Gaussian subsouices, the conditional mean I{y
i
z
i
,
-
x
i
is a lineai function of the noisy vectoi
z
i
, given by
I!CURE 15.5 HMM-based MMSE signal estimatoi.
`
{
( ) { ,
y I y z
I x z I y z x
i i
i
i
x
i
i i i
i




0
0
2000 by CRC Press LLC
I(y
i
z
i
,
-
x
i
) S
xi
(S
xi
- S-
xi
)
-1
z
i

A

H
x i
z
i
(15.5)
wheie S
xi
and S
-
xi
denote the covaiiance matiices of the Gaussian subsouices associated with the Maikov states
x
i
and
-
x
i
, iespectively. Since, howevei, I(
-
x
i
z
i
0
) is a nonlineai function of the noisy signal z
i
0
, the MMSE signal
estimatoi ` y
i
is a nonlineai function of the noisy signal z
i
0
.
The MMSE estimatoi, Eq. (15.4), is intuitively appealing. It uses a piedesigned set of flteis {H
x i,
obtained
fiom tiaining data of speech and noise. Each fltei is optimal foi a paii of subsouices of the CSMs foi the clean
signal and the noise piocess. Since each subsouice iepiesents a subset of signals fiom the coiiesponding souice,
each fltei is optimal foi a paii of signal subsets fiom the speech and noise. The set of piedesigned flteis coveis
all possible paiis of speech and noise signal subsets. Hence, foi each noisy vectoi of speech theie must exist an
optimal fltei in the set of piedesigned flteis. Since, howevei, a vectoi of the noisy signal could possibly be
geneiated fiom any paii of subsouices of the clean signal and noise, the most appiopiiate fltei foi a given
noisy vectoi is not known. Consequently, in estimating the signal vectoi at each time instant, all flteis aie tiied
and theii outputs aie weighted by the piobabilities of the flteis to be coiiect foi the given noisy signal. Othei
stiategies foi utilizing the piedesigned set of flteis aie possible. Foi example, at each time instant only the most
likely fltei can be applied to the noisy signal. This appioach is moie intuitive than that of the MMSE estimation.
It was fist pioposed in Diuckei 1968] foi a fve-state model which compiises subsouices foi fiicatives, stops,
vowels, glides, and nasals. This appioach was shown by Ephiaim and Meihav Ephiaim, 1992] to be optimal
only in an asymptotic MMSE sense.
The model-based MMSE appioach piovides ieasonably good enhanced speech quality with signifcantly less
stiuctuied iesidual noise than the spectial subtiaction appioach. This peifoimance was achieved foi white
Gaussian input noise at 10 dB input SNR using 512-2048 flteis. An impiovement of 5-6 dB in SNR was
achieved by this appioach. The model-based appioach, howevei, is moie elaboiate than the spectial subtiaction
appioach, since it involves two steps of tiaining and estimation, and tiaining must be peifoimed on suffciently
long data. The MMSE estimation appioach is usually supeiioi to the asymptotic MMSE enhancement appioach.
The ieason is that the MMSE appioach applies a soft decision" iathei than a haid decision" in choosing the
most appiopiiate fltei foi a given vectoi of the noisy signal.
A two-state veision of the MMSE estimatoi was fist applied to speech enhancement by McAulay and Malpass
in 1980 Ephiaim, 1992]. The two states coiiesponded to speech piesence and speech absence (silence) in the
noisy obseivations. The estimatoi foi the signal given that it is piesent in the noisy obseivations was imple-
mented by the spectial subtiaction appioach. The estimatoi foi the signal in the silence state" is obviously
equal to zeio. This appioach signifcantly impioved the peifoimance of the spectial subtiaction appioach.
Suurce Cuding
An encoder foi the clean signal maps vectois of that signal onto a fnite set of iepiesentative signal vectois
iefeiied to as codewoids. The mapping is peifoimed by assigning each signal vectoi to its neaiest neighboi
codewoid. The index of the chosen codewoid is tiansmitted to the ieceivei in a signal communication system,
and the signal is ieconstiucted using a copy of the chosen codewoid. The codewoids aie designed to minimize
the aveiage distoition iesulting fiom the neaiest neighboi mapping. The codewoids may simply iepiesent
wavefoim vectois of the signal. In anothei impoitant application of low bit-iate speech coding, the codewoids
iepiesent a set of paiametei vectois of the AR model foi the speech signal. Such coding systems synthesize the
signal using the speech model in Fig. 15.2. The synthesis is peifoimed using the encoded vectoi of AR coeffcients
as well as the paiameteis of the excitation signal. Reasonably good speech quality can be obtained using this
coding appioach at iates as low as 2400-4800 bits/sample Geisho and Giay, 1991].
When only noisy signals aie available foi coding, the encodei opeiates on the noisy signal while iepiesenting
the clean signal. In this case, the encodei is designed by minimizing the aveiage distoition between the clean
signal and the encoded signal. Specifcally, let y denote the vectoi of clean signal to be encoded. Let z denote
the coiiesponding given vectoi of the noisy signal. Let q denote the encodei. Let J denote a distoition measuie.
Then, the optimal encodei is designed by
2000 by CRC Press LLC
(15.6)
When the clean signal is available foi encoding the design pioblem is similaily defned, and it is obtained fiom
Eq. (15.6) using z y. The design pioblem in Eq. (15.6) is not standaid since the encodei opeiates and iepiesents
diffeient souices. The pioblem can be tiansfoimed into a standaid coding pioblem by appiopiiately modifying
the distoition measuie. This was shown by Beigei in 1971 and Ephiaim and Giay in 1988 Ephiaim, 1992].
Specifcally, defne the modifed distoition measuie by
(15.7)
Then, by using iteiated expectation in Eq. (15.8), the design pioblem becomes
(15.8)
A useful class of encodeis foi speech signals aie those obtained fiom vectoi quantization. Vectoi quantizeis
aie designed using the Lloyd algoiithm Geisho and Giay, 1991]. This is an iteiative algoiithm in which the
codewoids and the neaiest neighboi iegions aie alteinatively optimized. This algoiithm can be applied to design
vectoi quantizeis foi clean and noisy signals using the modifed distoition measuie.
The pioblem of designing vectoi quantizeis foi noisy signals is ielated to the pioblem of estimating the clean
signals fiom the noisy signals, as was shown by Wolf and Ziv in 1970 and Ephiaim and Giay in 1988 Ephiaim,
1992]. Specifcally, optimal wavefoim vectoi quantizeis in the MMSE sense can be designed by fist estimating
the clean signal and then quantizing the estimated signal. Both estimation and quantization aie peifoimed in
the MMSE sense. Similaily, optimal quantization of the vectoi of paiameteis of the AR model foi the speech
signal in the Itakuia-Saito sense can be peifoimed in two steps of estimation and quantization. Specifcally,
the autocoiielation function of the clean signal, which appioximately constitutes the suffcient statistics of that
signal foi estimating the AR model, is fist estimated in the MMSE sense. Then, optimal vectoi quantization
in the Itakuia-Saito sense is applied to the estimated autocoiielation.
The estimation-quantization appioach has been most populai in designing encodeis foi speech signals given
noisy signals. Since such design iequiies explicit knowledge of the statistics of the clean signal and the noise
piocess, but this knowledge is not available as aigued in the second section, a vaiiety of suboptimal encodeis
weie pioposed. Most of the ieseaich in this aiea focused on designing encodeis foi the AR model of the signal
due to the impoitance of such encodeis in low bit-iate speech coding. The pioposed encodeis mainly diffei in
the estimatois they used and the functionals of the speech signal these estimatois have been applied to. Impoitant
examples of functionals which have commonly been estimated include the signal wavefoim, autocoiielation,
and the spectial magnitude. The piimaiily set of estimatois used foi this application weie obtained fiom the
spectial subtiaction appioach and its deiivatives. A veision of the sample aveiage estimatoi was also developed
and applied to AR modeling by Juang and Rabinei in 1987 Ephiaim, 1992]. Recently, the HMM-based estimatoi
of the autocoiielation function of the clean signal was used in AR model vectoi quantization Ephiaim, 1992].
Designing of AR model-based encodeis fiom noisy signals has been a veiy successful application of speech
enhancement. In this case both the quality and intelligibility of the encoded signal can be impioved compaied
to the case wheie the encodei is designed foi the clean signal and the input noise is simply ignoied. The ieason
is that the input noise has devastating effects of the peifoimance of AR model-based speech codeis, and any
ieasonable" estimation appioach can signifcantly impiove the peifoimance of those codeis in noisy enviionments.
Signa! C!assihcatiun
In iecognition of clean speech signals a sample function of the signal is associated with one of the woids in
the vocabulaiy. The association oi decision iule is designed to minimize the piobability of classifcation eiioi.
When only noisy speech signals aie available foi iecognition a veiy similai pioblem iesults. Specifcally, a sample
min { ( , ( ))
q
I J y q z
J z q z I J y q z z ( , ( )) { ( , ( ))
A

min { ( , ( ))
q
I J z q z
2000 by CRC Press LLC
function of the noisy signal is now associated with one of the woids in the vocabulaiy in a way which minimizes
the piobability of classifcation eiioi. The only diffeience between the two pioblems is that the sample functions
of the clean and noisy signals fiom a given woid have diffeient statistics. The pioblem in both cases is that of
paititioning the sample space of the given acoustic signals fiom all woids in the vocabulaiy into I paitition
cells, wheie I is the numbei of woids in the vocabulaiy.
Let {W
I
, I 1, . . ., I denote a set of woids in a given vocabulaiy. Let z denote the acoustic noisy signal
fiom some woid in the vocabulaiy. Let O
A

{u
1
, . . . , u
I
be a paitition of the sample space of the noisy signals.
The piobability of eiioi associated with this paitition is given by
(15.9)
wheie I(W
I
) is the a rIorI piobability of occuiience of the Ith woid, and (z W
I
) is the pdf of the noisy signal
fiom the Ith woid. The minimization of I
e
(O) is achieved by the well-known maximum a osierIorI (MAP)
decision iule. Specifcally, z is associated with the woid W
I
foi which (z W
I
)I(W
I
) > (z W
j
)I(W
j
) foi all j
1, . . . , I and j I. Ties aie aibitiaiily bioken. In the absence of noise, the noisy signal z becomes a clean signal
y, and the optimal iecognizei is obtained by using the same decision iule with z y. Hence, the only diffeience
between iecognition of clean signals and iecognition of noisy signals is that in the fist case the pdf`s {(y W
I
)
aie used in the decision iule, while in the second case the pdf `s {(z W
I
) aie used in the same decision iule.
Note that optimal iecognition of noisy signals iequiies explicit knowledge of the statistics of the clean signal
and noise. Neithei the clean signal noi any function of that signal needs to be estimated. Since, howevei, the
statistics of the signal and noise aie not explicitly available as aigued in the second section, paiametiic models
aie usually assumed foi these pdf `s and theii paiameteis aie estimated fiom appiopiiate tiaining data. Noimally,
HMMs with mixtuie of Gaussian pdf `s at each state aie attiibuted to both the clean signal and noise piocess.
It can be shown (similaily to the case of classifcation of clean signals dealt with by Meihav and Ephiaim in
1991 Ephiaim, 1992]) that if the pdf `s of the signal and noise aie piecisely HMMs and the tiaining sequences
aie signifcantly longei than the test data, then the MAP decision iule which uses estimates of the pdf `s of the
signal and noise is asymptotically optimal.
A key issue in applying hidden Maikov modeling foi iecognition of speech signals is the matching of the
eneigy contoui of the signal to the eneigy contoui of the model foi that signal. Eneigy matching is iequiied
foi two main ieasons. Fiist, speech signals aie not stiictly stationaiy and hence theii eneigy contouis cannot
be ieliably estimated fiom tiaining data. Second, iecoiding conditions duiing tiaining and testing vaiy. An
appioach foi gain adaptation was developed Ephiaim, 1992]. In this appioach, HMMs foi gain-noimalized
clean signals aie designed and used togethei with gain contoui estimates obtained fiom the noisy signals. The
gain adaptation appioach is implemented using the EM algoiithm. This appioach piovides iobust speech
iecognition at input SNRs gieatei than oi equal to 10 dB.
The ielation between signal classifcation and estimation was established in Kailath 1969] foi coniInuous
time signals contaminated by additive statistically independent Gaussian white noise. It was shown that min-
imum piobability of eiioi classifcation can be achieved by applying the MAP decision iule to the causaI MMSE
estimatoi of the clean signal. This inteiesting theoietical iesult piovides the intuitive basis foi a populai appioach
foi iecognition of noisy speech signals. In this appioach, the clean signal oi some featuie vectoi of the signal
is fist estimated and then iecognition is applied. In the statistical fiamewoik of hidden Maikov modeling,
howevei, the diiect iecognition appioach piesented eailiei is signifcantly simplei since both the clean signal
and the noisy signal aie HMMs Ephiaim, 1992]. Hence, the complexity of iecognizing the estimated signal is
the same as that of iecognizing the noisy signal diiectly.
Othei commonly used appioaches foi iecognition of noisy speech signals weie developed foi systems which
aie based on pattein iecognition. When clean signals aie available foi iecognition, these systems match the
input signal to the neaiest neighboi acoustic templet which iepiesents some woid in the vocabulaiy. The
templets mainly compiise spectial piototypes of the clean signals. The matching is peifoimed using a distance
measuie between the clean input signal and the templet. When only noisy signals aie available foi iecognition,
I I W z W Jz
e
z
I
I
I
I
I
( ) ( ) ( ) O

u
2000 by CRC Press LLC
seveial modifcations of the pattein matching appioach weie pioposed. Specifcally, adapting the templets of
the clean signal to ieect the piesence of the noise was pioposed by Roe in 1987 Ephiaim, 1992]; choosing
templets foi the noisy signal which aie moie iobust than those obtained fiom adaptation of the templets foi
the clean signal was often pioposed; and using distance measuies which aie iobust to noise, such as the
piojection measuie pioposed by Mansoui and Juang in 1989 Ephiaim, 1992]. These appioaches along with
the pieflteiing appioach in the sampled signal case aie faiily intuitive and aie ielatively easy to implement.
It is diffcult, howevei, to establish theii optimality in any well-defned sense. Anothei inteiesting appioach
based on iobust statistics was developed by Meihav and Lee Ephiaim, 1992]. This appioach was shown
asymptotically optimal in the minimum piobability of eiioi sense within the hidden Maikov modeling
fiamewoik.
The speech iecognition pioblem in noisy enviionments has also been a successful application of speech
enhancement. Signifcant ieduction in the eiioi iate due to the noise piesence was achieved by the vaiious
appioaches mentioned above.
Cumments
Thiee majoi aspects of speech enhancement weie ieviewed. These compiise impioving the peiception of speech
signals in noisy enviionments and incieasing the iobustness of speech codeis and iecognition systems in noisy
enviionments. The inheient diffculties associated with these pioblems weie discussed, and the main solutions
along with theii stiengths and weaknesses weie piesented. This section is an intioductoiy piesentation to the
speech enhancement pioblem. A compiehensive tieatment of the subject can be found in Lim 1979], Makhoul
et al. 1989], Boll 1992], and Ephiaim 1992]. Signifcant piogiess in undeistanding the pioblem and in
developing new speech enhancement systems was made duiing the 1980s with the intioduction of statistical
model-based appioaches. The speech enhancement pioblem, howevei, is fai fiom being solved, and majoi
piogiess is still needed. In paiticulai, no speech enhancement system which is capable of simultaneously
impioving both the quality and intelligibility of the noisy signal is cuiiently known. Piogiess in this diiection
can be made if moie ieliable statistical models foi the speech signal and noise piocess as well as meaningful
distoition measuies can be found.
Dehning Terms
Autoregressive model: Statistical model foi iesonant signals.
Classiher: Maps signal utteiances into a fnite set of woid units, e.g., syllables.
Encoder: Maps signal vectois into a fnite set of codewoids. A vectoi quantizei is a paiticulai type of encodei.
Hidden Markov model: Statistical model compiised of seveial subsouices contiolled by Maikovian piocess.
!ntelligibility: Objective quantitative measuie of speech peiception.
Noise: Any inteifeiing signal adveisely affecting the communication of the clean signal.
Quality: Subjective desciiptive measuie of speech peiception.
Signal: Clean speech sample to be communicated with human oi machine.
Signal-to-noise ratio: Ratio of the signal powei to the noise powei measuied in decibels.
Speech enhancement: Impiovement of peiceptual aspects of speech signals.
Re!ated Tupics
48.1 Intioduction 73.2 Noise
Relerences
S. F. Boll, Speech enhancement in the 1980`s: noise suppiession with pattein matching," in AJ\ances In Seec|
SIgnaI IrocessIng, S. Fuiui and M. M. Sonhdi, Eds., New Yoik: Maicel Dekkei, 1992.
H. Diuckei, Speech piocessing in a high ambient noise enviionment," IIII Trans. AuJIo IIeciroacousi., vol. 16,
1968.
Y. Ephiaim, Statistical model based speech enhancement systems," Iroc. IIII, vol. 80, 1992.
2000 by CRC Press LLC
Y. Ephiaim and H. L. Van Tiees, A signal subspace appioach foi speech enhancement," IIII Trans. on Seec|
anJ AuJIo IrocessIng, vol. 3, 251-316, 1995.
A. Geisho and R.M. Giay, Vecior uaniIzaiIon anJ SIgnaI ConressIon, Boston: Kluwei Academic Publisheis,
1991.
T. Kailath, A geneial likelihood-iatio foimula foi iandom signals in Gaussian noise," IIII Trans. on InIorn
T|eory, vol. 15, 1969.
J. S. Lim, Ed., Seec| In|anceneni, Englewood Cliffs, N.J.: Pientice-Hall, 1983.
J. S. Lim and A. V. Oppenheim, Enhancement and bandwidth compiession of noisy speech," Iroc. IIII, vol. 67,
1979.
J. Makhoul, T. H. Ciystal, D. M. Gieen, D. Hogan, R. J. McAulay, D. B. Pisoni, R. D. Soikin, and T. G. Stockham,
Ieno\aI oI NoIse Iron NoIseDegraJeJ Seec| SIgnaIs, Washington, D.C.: National Academy Piess, 1989.
L. R. Rabinei, A tutoiial on hidden Maikov models and selected applications in speech iecognition," Iroc.
IIII, vol. 77, 1989.
M. R. Weiss, E. Aschkenasy, and T. W. Paisons, Piocessing speech signals to attenuate inteifeience," in IIII
Syn. on Seec| IecognIiIon, Pittsbuigh, 1974.
Further Inlurmatiun
A compiehensive tieatment of the speech enhancement pioblem can be found in the foui tutoiial papeis and
book listed below.
J. S. Lim and A. V. Oppenheim, Enhancement and bandwidth compiession of noisy speech," Iroc. IIII, vol. 67,
1979.
J. Makhoul, T. H. Ciystal, D. M. Gieen, D. Hogan, R. J. McAulay, D. B. Pisoni, R. D. Soikin, and T. G. Stockham,
Ieno\aI oI NoIse Iron NoIseDegraJeJ Seec| SIgnaIs, Washington, D.C.: National Academy Piess, 1989.
S. F. Boll, Speech enhancement in the 1980`s: noise suppiession with pattein matching," in AJ\ances In Seec|
SIgnaI IrocessIng, S. Fuiui and M. M. Sonhdi, Eds., New Yoik: Maicel Dekkei, 1992.
Y. Ephiaim, Statistical model based speech enhancement systems," Iroc. IIII, vol. 80, 1992.
J. S. Lim, Ed., Seec| In|anceneni, Englewood Cliffs, N.J.: Pientice-Hall, 1983.
15.3 Ana!ysis and Synthesis
jee W. ue||
Aftei an acoustic speech signal is conveited to an electiical signal by a miciophone, it may be desiiable to
analyze the electiical signal to estimate some time-vaiying paiameteis which piovide infoimation about a
model of the speech pioduction mechanism. Speech analysis is the piocess of estimating such paiameteis.
Similaily, given some paiametiic model of speech pioduction and a sequence of paiameteis foi that model,
speech synthesis is the piocess of cieating an electiical signal which appioximates speech. While analysis and
synthesis techniques may be done eithei on the continuous signal oi on a sampled veision of the signal, most
modein analysis and synthesis methods aie based on digital signal piocessing.
A typical speech pioduction model is shown in Fig. 15.6. In this model the output of the excitation function
is scaled by the gain paiametei and then flteied to pioduce speech. All of these functions aie time-vaiying.
I!CURE 15. A geneial speech pioduction model.
2000 by CRC Press LLC
Foi many models, the paiameteis aie vaiied at a peiiodic iate, typically 50 to 100 times pei second. Most
speech infoimation is contained in the poition of the signal below about 4 kHz.
The excitation is usually modeled as eithei a mixtuie oi a choice of iandom noise and peiiodic wavefoim.
Foi human speech, voiced excitation occuis when the vocal folds in the laiynx vibiate; unvoiced excitation
occuis at constiictions in the vocal tiact which cieate tuibulent aii ow Flanagan, 1965]. The ielative mix of
these two types of excitation is teimed voicing." In addition, the peiiodic excitation is chaiacteiized by a
fundamental fiequency, teimed pitch oi F0. The excitation is scaled by a factoi designed to pioduce the piopei
amplitude oi level of the speech signal. The scaled excitation function is then flteied to pioduce the piopei
spectial chaiacteiistics. While the fltei may be nonlineai, it is usually modeled as a lineai function.
Ana!ysis ul Excitatiun
In a simplifed foim, the excitation function may be consideied to be puiely peiiodic, foi voiced speech, oi
puiely iandom, foi unvoiced. These two states coiiespond to voiced phonetic classes such as vowels and nasals
and unvoiced sounds such as unvoiced fiicatives. This binaiy voicing model is an oveisimplifcation foi sounds
such as voiced fiicatives, which consist of a mixtuie of peiiodic and iandom components. Figuie 15.7 is an
example of a time wavefoim of a spoken /i/ phoneme, which is well modeled by only peiiodic excitation.
Both time domain and fiequency domain analysis techniques have been used to estimate the degiee of voicing
foi a shoit segment oi fiame of speech. One time domain featuie, teimed the zeio ciossing iate, is the numbei
of times the signal changes sign in a shoit inteival. As shown in Fig. 15.7, the zeio ciossing iate foi voiced
sounds is ielatively low. Since unvoiced speech typically has a laigei piopoition of high-fiequency eneigy than
voiced speech, the iatio of high-fiequency to low-fiequency eneigy is a fiequency domain technique that
piovides infoimation on voicing.
Anothei measuie used to estimate the degiee of voicing is the autocoiielation function, which is defned foi
a sampled speech segment, S, as
(15.10)
wheie s(n) is the value of the nth sample within the segment of length N. Since the autocoiielation function
of a peiiodic function is itself peiiodic, voicing can be estimated fiom the degiee of peiiodicity of the auto-
coiielation function. Figuie 15.8 is a giaph of the nonnegative teims of the autocoiielation function foi a 64-ms
fiame of the wavefoim of Fig. 15.7. Except foi the deciease in amplitude with incieasing lag, which iesults fiom
the iectangulai window function which delimits the segment, the autocoiielation function is seen to be quite
peiiodic foi this voiced utteiance.
I!CURE 15.7 Wavefoim of a spoken phoneme /i/ as in beet.
ACF
0
-1
( ) ( ) ( ) t t

1
N
s n s n
n
N
2000 by CRC Press LLC
If an analysis of the voicing of the speech signal indicates a voiced oi peiiodic component is piesent, anothei
step in the analysis piocess may be to estimate the fiequency (oi peiiod) of the voiced component. Theie aie
a numbei of ways in which this may be done. One is to measuie the time lapse between peaks in the time
domain signal. Foi example in Fig. 15.7 the majoi peaks aie sepaiated by about 0.0071 s, foi a fundamental
fiequency of about 141 Hz. Note, it would be quite possible to eii in the estimate of fundamental fiequency
by mistaking the smallei peaks that occui between the majoi peaks foi the majoi peaks. These smallei peaks
aie pioduced by iesonance in the vocal tiact which, in this example, happen to be at about twice the excitation
fiequency. This type of eiioi would iesult in an estimate of pitch appioximately twice the coiiect fiequency.
The distance between majoi peaks of the autocoiielation function is a closely ielated featuie that is fiequently
used to estimate the pitch peiiod. In Fig. 15.8, the distance between the majoi peaks in the autocoiielation
function is about 0.0071 s. Estimates of pitch fiom the autocoiielation function aie also susceptible to mistaking
the fist vocal tiack iesonance foi the glottal excitation fiequency.
The absolute magnitude diffeience function (AMDF), defned as,
(15.11)
is anothei function which is often used in estimating the pitch of voiced speech. An example of the AMDF is
shown in Fig. 15.9 foi the same 64-ms fiame of the /i/ phoneme. Howevei, the minima of the AMDF is used
as an indicatoi of the pitch peiiod. The AMDF has been shown to be a good pitch peiiod indicatoi Ross et al.,
1974] and does not iequiie multiplications.
Fuurier Ana!ysis
One of the moie common piocesses foi estimating the spectium of a segment of speech is the Fouiiei tiansfoim
Oppenheim and Schafei, 1975]. The Fouiiei tiansfoim of a sequence is mathematically defned as
(15.12)
wheie s(n) iepiesents the teims of the sequence. The shoit-time Fouiiei tiansfoim of a sequence is a time-
dependent function, defned as
I!CURE 15.8 Autocoiielation function of one fiame of /i/.
AMDF
0
-1
( ) ( ) - ( ) t t

1
N
s n s n
n
N

S e s n e
j j n
n
( ) ( )
-
-
u u

~
~

2000 by CRC Press LLC


(15.13)
wheie the window function w(n) is usually zeio except foi some fnite iange, and the vaiiable n is used to
select the section of the sequence foi analysis. The disciete Fouiiei tiansfoim (DFT) is obtained by unifoimly
sampling the shoit-time Fouiiei tiansfoim in the fiequency dimension. Thus an N-point DFT is computed
using Eq. (15.14),
(15.14)
wheie the set of N samples, s(n), may have fist been multiplied by a window function. An example of the
magnitude of a 512-point DFT of the wavefoim of the /i/ fiom Fig. 15.10 is shown in Fig. 15.10. Note foi this
fguie, the 512 points in the sequence have been multiplied by a Hamming window defned by
I!CURE 15.9 Absolute magnitude diffeience function of one fiame of /i/.
I!CURE 15.10 Magnitude of 512-point FFT of Hamming windowed /i/.
S e w n n s n e
n
j j n
n
( ) ( - ) ( )
-
-
u u

~
~

S | s n e
j n| N
n
N
( ) ( )
- /
-

2
0
1
r
2000 by CRC Press LLC
(15.15)
Since the spectial chaiacteiistics of speech may change diamatically in a few milliseconds, the length, type,
and location of the window function aie impoitant consideiations. If the window is too long, changing spectial
chaiacteiistics may cause a bluiied iesult; if the window is too shoit, spectial inaccuiacies iesult. A Hamming
window of 16 to 32 ms duiation is commonly used foi speech analysis.
Seveial chaiacteiistics of a speech utteiance may be deteimined by examination of the DFT magnitude. In
Fig. 15.10, the DFT of a voiced utteiance contains a seiies of shaip peaks in the fiequency domain. These peaks,
caused by the peiiodic sampling action of the glottal excitation, aie sepaiated by the fundamental fiequency
which is about 141 Hz, in this example. In addition, bioadei peaks can be seen, foi example at about 300 Hz
and at about 2300 Hz. These bioad peaks, called foimants, iesult fiom iesonances in the vocal tiact.
Linear Predictive Ana!ysis
Given a sampled (disciete-time) signal s(n), a poweiful and geneial paiametiic model foi time seiies analysis is
(15.16)
wheie s(n) is the output and u(n) is the input (peihaps unknown). The model paiameteis aie a(|) foi | 1,
, |(I) foi I 1, q, and G. |(0) is assumed to be unity. This model, desciibed as an autoiegiessive moving
aveiage (ARMA) oi pole-zeio model, foims the foundation foi the analysis method teimed lineai piediction.
An autoiegiessive (AR) oi all-pole model, foi which all of the |" coeffcients except |(0) aie zeio, is fiequently
used foi speech analysis Maikel and Giay, 1976].
In the standaid AR foimulation of lineai piediction, the model paiameteis aie selected to minimize the
mean-squaied eiioi between the model and the speech data. In one of the vaiiants of lineai piediction, the
autocoiielation method, the minimization is caiiied out foi a windowed segment of data. In the autocoiielation
method, minimizing the mean-squaie eiioi of the time domain samples is equivalent to minimizing the
integiated iatio of the signal spectium to the spectium of the all-pole model. Thus, lineai piedictive analysis
is a good method foi spectial analysis whenevei the signal is pioduced by an all-pole system. Most speech
sounds ft this model well.
One key consideiation foi lineai piedictive analysis is the oidei of the model, . Foi speech, if the oidei is
too small, the foimant stiuctuie is not well iepiesented. If the oidei is too laige, pitch pulses as well as foimants
begin to be iepiesented. Tenth- oi twelfth-oidei analysis is typical foi speech. Figuies 15.11 and 15.12 piovide
examples of the spectium pioduced by eighth-oidei and sixteenth-oidei lineai piedictive analysis of the /i/
wavefoim of Fig. 15.7. Figuie 15.11 shows theie to be thiee foimants at fiequencies of about 300, 2300, and
3200 Hz, which aie typical foi an /i/.
Humumurphic [Cepstra!) Ana!ysis
Foi the speech model of Fig. 15.6, the excitation and fltei impulse iesponse aie convolved to pioduce the
speech. One of the pioblems of speech analysis is to sepaiate oi deconvolve the speech into these two compo-
nents. One such technique is called homomoiphic flteiing Oppenheim and Schafei, 1968]. The chaiacteiistic
system foi a system foi homomoiphic deconvolution conveits a convolution opeiation to an addition opeiation.
The output of such a chaiacteiistic system is called the complex cepstrum. The complex cepstium is defned
as the inveise Fouiiei tiansfoim of the complex logaiithm of the Fouiiei tiansfoim of the input. If the input
sequence is minimum phase (i.e., the z-tiansfoim of the input sequence has no poles oi zeios outside the unit
ciicle), the sequence can be iepiesented by the ieal poition of the tiansfoims. Thus, the ieal cepstium can be
computed by calculating the inveise Fouiiei tiansfoim of the log-spectium of the input.
w n n N n N ( ) . - . cos ( /( - )) - s s

0 54 0 46 2 1 0 1
0
r
otheiwise
s n a | s n | G | I u n I
I
q
|

( ) ( ) ( - ) ( ) ( - ) +


0 1
2000 by CRC Press LLC
Figuie 15.13 shows an example of the cepstium foi the voiced /i/ utteiance fiom Fig. 15.7. The cepstium of
such a voiced utteiance is chaiacteiized by ielatively laige values in the fist one oi two milliseconds as well as
by pulses of decaying amplitudes at multiples of the pitch peiiod. The fist two of these pulses can cleaily be
seen in Fig. 15.13 at time lags of 7.1 and 14.2 ms. The location and amplitudes of these pulses may be used to
estimate pitch and voicing Rabinei and Schafei, 1978].
In addition to pitch and voicing estimation, a smooth log magnitude function may be obtained by windowing
oi lifteiing" the cepstium to eliminate the teims which contain the pitch infoimation. Figuie 15.14 is one such
smoothed spectium. It was obtained fiom the DFT of the cepstium of Fig. 15.13 aftei fist setting all teims of
the cepstium to zeio except foi the fist 16.
Speech Synthesis
Speech synthesis is the cieation of speech-like wavefoims fiom textual woids oi symbols. In geneial, the speech
synthesis piocess may be divided into thiee levels of piocessing Klatt, 1982]. The fist level tiansfoims the text
into a seiies of acoustic phonetic symbols, the second tiansfoims those symbols to smoothed synthesis paiam-
eteis, and the thiid level geneiates the speech wavefoim fiom the paiameteis. While speech synthesizeis have
I!CURE 15.11 Eighth-oidei lineai piedictive analysis of an i".
I!CURE 15.12 Sixteenth-oidei lineai piedictive analysis of an i".
2000 by CRC Press LLC
been designed foi a vaiiety of languages and the piocesses desciibed heie apply to seveial languages, the examples
given aie foi English text-to-speech.
In the fist level of piocessing, abbieviations such as Di." (which could mean doctoi" oi diive"), numbeis
(1492" could be a yeai oi a quantity), special symbols such as $", uppei case acionyms (e.g., NASA), and
nonspoken symbols such as `" (apostiophe) aie conveited to a standaid foim. Next piefxes and peihaps
suffxes aie iemoved fiom the body of woids piioi to seaiching foi the ioot woid in a lexicon, which defnes
the phonetic iepiesentation foi the woid. The lexicon includes woids which do not obey the noimal iules of
pionunciation, such as of ". If the woid is not contained in the lexicon, it is piocessed by an algoiithm which
contains a laige set of iules of pionunciation.
In the second level, the sequences of woids consisting of phiases oi sentences aie analyzed foi giammai and
syntax. This analysis piovides infoimation to anothei set of iules which deteimine the stiess, duiation, and
pitch to be added to the phonemic iepiesentation. This level of piocessing may also altei the phonemic
iepiesentation of individual woids to account foi coaiticulation effects. Finally, the sequences of paiameteis
which specify the pionunciation aie smoothed in an attempt to mimic the smooth movements of the human
aiticulatois (lips, jaw, velum, and tongue).
The last piocessing level conveits the smoothed paiameteis into a time wavefoim. Many vaiieties of wavefoim
synthesizeis have been used, including foimant, lineai piedictive, and fltei-bank veisions. These wavefoim
I!CURE 15.13 Real cepstium of /i/.
I!CURE 15.14 Smoothed spectium of /i/ fiom 16 points of cepstium.
2000 by CRC Press LLC
synthesizeis geneially coiiespond to the synthesizeis used in speech coding systems which aie desciibed in the
fist section of this chaptei.
Dehning Terms
Cepstrum: Inveise Fouiiei tiansfoim of the logaiithm of the Fouiiei powei spectium of a signal. The complex
cepstium is the inveise Fouiiei tiansfoim of the complex logaiithm of the Fouiiei tianfoim of the
complex logaiithm of the Fouiiei tiansfoim of the signal.
Pitch: Fiequency of glottal vibiation of a voiced utteiance.
Spectrum or power density spectrum: Amplitude of a signal as a function of fiequency, fiequently defned
as the Fouiiei tiansfoim of the autocovaiiance of the signal.
Speech analysis: Piocess of extiacting time-vaiying paiameteis fiom the speech signal which iepiesent a
model foi speech pioduction.
Speech synthesis: Pioduction of a speech signal fiom a model foi speech pioduction and a set of time-vaiying
paiameteis of that model.
Voicing: Classifcation of a speech segment as being voiced (i.e., pioduced by glottal excitation), unvoiced
(i.e., pioduced by tuibulent aii ow at a constiiction) oi some mix of those two.
Re!ated Tupic
14.1 Fouiiei Tiansfoims
Relerences
J. Allen, Synthesis of speech fiom uniestiicted text," Iroc. IIII, vol. 64, no. 4, pp. 433-442, 1976.
J. L. Flanagan, Seec| AnaIysIs, Syni|esIs anJ IerceiIon, Beilin: Spiingei-Veilag, 1965.
D. H. Klatt, The Klattalk Text-to-Speech System" IEEE Int. Conf. on Acoustics, Speech and Signal Pioc.,
pp. 1589-1592, Paiis, 1982.
J. D. Maikel and A. H. Giay, Ji., IInear IreJIciIon oI Seec|, Beilin: Spiingei-Veilag, 1976.
A. V. Oppenheim and R. W. Schafei, Homomoiphic analysis of speech," IIII Trans. AuJIo IIeciroacousi.,
pp. 221-226, 1968.
A. V. Oppenheim and R.W. Schafei, DIgIiaI SIgnaI IrocessIng, Englewood Cliffs, N.J.: Pientice-Hall, 1975.
D. O`Shaughnessy, Seec| ConnunIcaiIon, Reading, Mass.: Addison-Wesley, 1987.
L. R. Rabinei and R. W. Schafei, DIgIiaI IrocessIng oI Seec| SIgnaIs, Englewood Cliffs, N.J.: Pientice-Hall, 1978.
M. J. Ross, H .L. Shaffei, A. Cohen, R. Fieudbeig, and H. J. Manley, Aveiage magnitude diffeience function
pitch extiactoi," IIII Trans. AcousiIcs, Seec| anJ SIgnaI Iroc., vol. ASSP-22, pp. 353-362, 1974.
R. W. Schafei and J. D. Maikel, Seec| AnaIysIs, New Yoik: IEEE Piess, 1979.
Further Inlurmatiun
The monthly magazine IIII TransaciIons on SIgnaI IrocessIng, foimeily IIII TransaciIons on AcousiIcs, Seec|
anJ SIgnaI IrocessIng, fiequently contains aiticles on speech analysis and synthesis. In addition, the annual
confeience of the IEEE Signal Piocessing Society, the Inteinational Confeience on Acoustics, Speech, and Signal
Piocessing, is a iich souice of papeis on the subject.
15.4 Speech Recugnitiun
Iynn D.W|cox ond Morco A. u|
Speech iecognition is the piocess of tianslating an acoustic signal into a linguistic message. In ceitain applica-
tions, the desiied foim of the message is a veibatim tiansciiption of a sequence of spoken woids. Foi example,
in using speech iecognition technology to automate dictation oi data entiy to a computei, tiansciiption accuiacy
is of piime impoitance. In othei cases, such as when speech iecognition is used as an inteiface to a database
2000 by CRC Press LLC
queiy system oi to index by keywoid into audio iecoidings, woid-foi-woid tiansciiption is less ciitical. Rathei,
the message must contain only enough infoimation to ieliably communicate the speakei`s goal. The use of
speech iecognition technology to facilitate a dialog between peison and computei is often iefeiied to as spoken
language piocessing."
Speech iecognition by machine has pioven an extiemely diffcult task. One complicating factoi is that, unlike
wiitten text, no cleai spacing exists between spoken woids; speakeis typically uttei full phiases oi sentences
without pause. Fuitheimoie, acoustic vaiiability in the speech signal typically piecludes an unambiguous
mapping to a sequence of woids oi subwoid units, such as phones.
1
One majoi souice of vaiiability in speech
is coaiticulation, oi the tendency foi the acoustic chaiacteiistics of a given speech sound oi phone to diffei
depending upon the phonetic context in which it is pioduced. Othei souices of acoustic vaiiability include
diffeiences in vocal-tiact size, dialect, speaking iate, speaking style, and communication channel.
Speech iecognition systems can be constiained along a numbei of dimensions in oidei to make the iecog-
nition pioblem moie tiactable. Tiaining the paiameteis of a iecognizei to the speech of the usei is one way of
ieducing vaiiability and, thus, incieasing iecognition accuiacy. Recognizeis aie categoiized as speakei-depen-
dent oi speakei-independent, depending upon whethei oi not full tiaining is iequiied by each new usei. Speakei-
adaptive systems adjust automatically to the voice of a new talkei, eithei on the basis of a ielatively small amount
of tiaining data oi on a continuing basis while the system is in use.
Recognizeis can also be categoiized by the speaking styles, vocabulaiies, and language models they accom-
modate. !solated word recognizers iequiie speakeis to inseit biief pauses between individual woids. Contin-
uous speech recognizers opeiate on uent speech, but typically employ stiict language models, oi giammais,
to limit the numbei of allowable woid sequences. Woidspotteis also accept uent speech as input. Howevei,
iathei than pioviding full tiansciiption, woidspotteis selectively locate ielevant woids oi phiases in an uttei-
ance. Wordspotting is useful both in infoimation-ietiieval tasks based on keywoid indexing and as an alteinative
to isolated woid iecogniton in voice command applications.
Speech Recugnitiun System Architecture
Figuie 15.15 shows a block diagiam of a speech iecognition system. Speech is typically input to the system
using an analog tiansducei, such as a miciophone, and conveited to digital foim. Signal pre-processing consists
of computing a sequence of acoustic featuie vectois by piocessing the speech samples in successive time inteivals.
In some systems, a clusteiing technique known as vectoi quantization is used to conveit these continuous-
valued featuies to a sequence of disciete codewoids diawn fiom a codebook of acoustic piototypes. Recognition
of an unknown utteiance involves tiansfoiming the sequence of featuie vectois, oi codewoids, into an appio-
piiate message. The iecognition piocess is typically constiained by a set of acoustic models which coiiespond
to the basic units of speech employed in the iecognizei, a lexicon which defnes the vocabulaiy of the iecognizei
1
Phones coiiespond ioughly to pionunciations of consonants and vowels.
I!CURE 15.15 Aichitectuie foi a speech iecognition system.
2000 by CRC Press LLC
in teims of these basic units, and a language model which specifes allowable sequences of vocabulaiy items.
The acoustic models, and in some cases the language model and lexicon, aie leained fiom a set of iepiesentative
tiaining data. These components aie discussed in gieatei detail in the iemaindei of this chaptei, as aie the two
iecognition paiadigms most fiequently employed in speech iecognition: dynamic time warping and hidden
Markov models.
Signa! Pre-Prucessing
An amplitude wavefoim and speech spectiogiam of the sentence Two plus seven is less than ten" is shown in
Fig. 15.16. The spectiogiam iepiesents the time evolution (hoiizontal axis) of the fiequency spectium (veitical
axis) of the speech signal, with daikness coiiesponding to high eneigy. In this example, the speech has been
digitized at a sampling iate of 16 kHz, oi ioughly twice the highest fiequency of ielevant eneigy in a high-
quality speech signal. In geneial, the appiopiiate sampling iate is a function of the communication channel.
In telecommunications, foi example, a bandwidth of 4 kHz, and, thus, a Nyquist sampling iate of 8 kHz, is
standaid.
The speech spectium can be viewed as the pioduct of a souice spectium and the tiansfei function of a lineai,
time-vaiying fltei which iepiesents the changing confguiation of the vocal tiact. The tiansfei function
deteimines the shape, oi envelope, of the spectium, which caiiies phonetic infoimation in speech. When excited
by a voicing souice, the foimants, oi natuial iesonant fiequencies of the vocal tiact, appeai as black bands
iunning hoiizontally thiough iegions of the speech spectiogiam. These iegions iepiesent voiced segments of
speech and coiiespond piimaiily to vowels. Regions chaiacteiized by bioadband high-fiequency eneigy, and
by extiemely low eneigy, iesult fiom noise excitation and vocal-tiact closuies, iespectively, and aie associated
with the aiticulation of consonantal sounds.
Featuie extiaction foi speech iecognition involves computing sequences of numeiic measuiements, oi featuie
vectois, which typically appioximate the envelope of the speech spectium. Spectial featuies can be extiacted
diiectly fiom the disciete Fouiiei tiansfoim (DFT) oi computed using lineai piedictive coding (LPC) tech-
niques. Cepstial analysis can also be used to deconvolve the spectial envelope and the peiiodic voicing souice.
Each featuie vectoi is computed fiom a fiame of speech data defned by windowing N samples of the signal.
While a bettei spectial estimate can be obtained using moie samples, the inteival must be shoit enough so
that the windowed signal is ioughly stationaiy. Foi speech data, N is chosen such that the length of the inteival
coveied by the window is appioximately 25 to 30 msec. The featuie vectois aie typically computed at a fiame
iate of 10 to 20 msec by shifting the window foiwaid in time. Tapeied windowing functions, such as the
Hamming window, aie used to ieduce dependence of the spectial estimate on the exact tempoial position of
I!CURE 15.1 Speech spectiogiam of the utteiance Two plus seven is less than ten." (Source. V.W. Zue, The use of speech
knowledge in automatic speech iecognition," Iroc. IIII, vol. 73, no. 11, pp. 1602-1615, C 1985 IEEE. With peimission.)
2000 by CRC Press LLC
the window. Spectial featuies aie often augmented with a measuie of the shoit time eneigy of the signal, as
well as with measuies of eneigy and spectial change ovei time Lee, 1988].
Foi iecognition systems which use disciete featuies, vectoi quantization can be used to quantize continuous-
valued featuie vectois into a set oi codebook of I disciete symbols, oi codewoids Giay, 1984]. The I codewoids
aie chaiacteiized by piototypes y
1
. . . y
I
.

A featuie vectoi x is quantized to the |th codewoid if the distance
fiom x to y
|
,

oi J(x,y
|
), is less than the distance fiom x to any othei codewoid. The distance J(x,y) depends
on the type of featuies being quantized. Foi featuies deiived fiom the shoit-time spectium and cepstium, this
distance is typically Euclidean oi weighted Euclidean. Foi LPC-based featuies, the Itakuia metiic, which is
based on spectial distoition, is typically used Fuiui, 1989].
Dynamic Time Warping
Dynamic time waiping (DTW) is a technique foi nonlineai time alignment of paiis of spoken utteiances.
DTW-based speech iecognition, often iefeiied to as template matching," involves aligning featuie vectois
extiacted fiom an unknown utteiance with those fiom a set of exemplais oi templates obtained fiom tiaining
data. Nonlineai featuie alignment is necessitated by nonlineai time-scale waiping associated with vaiiations in
speaking iate.
Figuie 15.17 illustiates the time coiiespondence between two utteiances, A and , iepiesented as featuie-
vectoi sequences of unequal length. The time waiping function consists of a sequence of points I c
1
, . . . , c
I
in the plane spanned by A and , wheie c
|
(I
|
, j
|
). The local distance between the featuie vectois a
I
and |
j
on
the waiping path at point c (I, j) is given as
J(c) J(a
I
,|
j
) (15.17)
The distance between A and aligned with waiping function I is a weighted sum of the local distances along
the path,
(15.18)
I!CURE 15.17 Dynamic time waiping of utteiances A and B. (Source. S. Fuiui, DIgIiaI Seec| IrocessIng, Syni|esIs anJ
IecognIiIon, New Yoik: Maicel Dekkei, 1989. With peimission.)
D I
N
J c w
|
|
I
|
( ) ( )

1
1
2000 by CRC Press LLC
wheie w
|
is a nonnegative weighting function and N is the sum of the weights. Path constiaints and weighting
functions aie chosen to contiol whethei oi not the distance D(I) is symmetiic and the allowable degiee of
waiping in each diiection. Dynamic piogiamming is used to effciently deteimine the optimal time alignment
between two featuie-vectoi sequences Sakoe and Chiba, 1978].
In DTW-based iecognition, one oi moie templates aie geneiated foi each woid in the iecognition vocabulaiy.
Foi speakei-dependent iecognition tasks, templates aie typically cieated by aligning and aveiaging the featuie
vectois coiiesponding to seveial iepetitions of a woid. Foi speakei-independent tasks, clusteiing techniques
can be used to geneiate templates which bettei model pionunciation vaiiability acioss talkeis. In isolated woid
iecognition, the distance D(I) is computed between the featuie-vectoi sequence foi the unknown woid and
the templates coiiesponding to each vocabulaiy item. The unknown is iecognized as that woid foi which D(I)
is a minimum. DTW can be extended to connected woid iecognition by aligning the input utteiance to all
possible concatenations of iefeience templates. Effcient algoiithms foi computing such alignments have been
developed Fuiui, 1989]; howevei, in geneial, DTW has pioved most applicable to isolated woid iecognition
tasks.
Hidden Markuv Mude!s
1
Hidden Maikov modeling is a piobabilistic pattein matching technique which is moie iobust than DTW at
modeling acoustic vaiiability in speech and moie ieadily extensible to continuous speech iecognition. As shown
in Fig. 15.18, hidden Maikov models (HMMs) iepiesent speech as a sequence of states, which aie assumed to
model inteivals of speech with ioughly stationaiy acoustic featuies. Each state is chaiacteiized by an output
piobability distiibution which models vaiiability in the spectial featuies oi obseivations associated with that
state. Tiansition piobabilities between states model duiational vaiiability in the speech signal. The piobabilities,
oi paiameteis, of an HMM aie tiained using obseivations (VQ codewoids) extiacted fiom a iepiesentative
sample of speech data. Recognition of an unknown utteiance is based on the piobability that the speech was
geneiated by the HMM.
Moie piecisely, an HMM is defned by:
1. A set of N states {S
1
. . . S
N
, wheie q
i
is the state at time i.
2. A set of I obseivation symbols {\
1
. . . \
I
, wheie C
i
is the obseivation at time i.
3. A state tiansition piobability matiix A {a
I j
, wheie the piobability of tiansitioning fiom state S
I
at time
i to state S
j
at time i - 1 is a
Ij
I(q
i-1
S
j
q
i
S
I
).
4. A set of output piobability distiibutions , wheie foi each state j, |
j
(|) I(C
i
\
|
q
i
S
j
).
5. An initial state distiibution r {r
I
, wheie r
I
I(q
1
S
I
).
At each time i a tiansition to a new state is made, and an obseivation is geneiated. State tiansitions have the
Maikov piopeity, in that the piobability of tiansitioning to a state at time i depends only on the state at time
1
Although the discussion heie is limited to HMMs with disciete obseivations, output distiibutions such as Gaussians
can be defned foi continuous-valued featuies.
I!CURE 15.18 A typical HMM topology.
2000 by CRC Press LLC
i - 1. The obseivations aie conditionally independent given the state, and the tiansition piobabilites aie not
dependent on time. The model is called hidden because the identity of the state at time i is unknown; only the
output of the state is obseived. It is common to specify an HMM by its paiameteis i (A, , r).
The basic acoustic unit modeled by the HMM can be eithei a woid oi a subwoid unit. Foi small iecognition
vocabulaiies, the lexicon typically consists of whole-woid models similai to the model shown in Fig. 15.18.
The numbei of states in such a model can eithei be fxed oi be made to depend on woid duiation. Foi laigei
vocabulaiies, woids aie moie often defned in the lexicon as concatenations of phone oi tiiphone models.
Tiiphones aie phone models with left and iight context specifed Lee, 1988]; they aie used to model acoustic
vaiiability which iesults fiom the coaiticulation of adjacent speech sounds.
In isolated woid iecognition tasks, an HMM is cieated foi each woid in the iecognition vocabulaiy. In
continuous speech iecognition, on the othei hand, a single HMM netwoik is geneiated by expiessing allowable
woid stiings oi sentences as concatenations of woid models, as shown in Fig. 15.19. In woidspotting, the HMM
netwoik consists of a paiallel connection of keywoid models and a backgiound model which iepiesents the
speech within which the keywoids aie embedded. Backgiound models, in tuin, typically consist of paiallel
connections of subwoid acoustic units such as phones Wilcox and Bush, 1992].
The language model oi giammai of a iecognition system defnes the sequences of vocabulaiy items which
aie allowed. Foi simple tasks, deteiministic fnite-state giammais can be used to defne all allowable woid
sequences. Typically, howevei, iecognizeis make use of stochastic giammais based on n-giam statistics Jelinek,
1985]. A bigiam language model, foi example, specifes the piobability of a vocabulaiy item given the item
which piecedes it.
Isolated woid iecognition using HMMs involves computing, foi each woid in the iecognition vocabulaiy,
the piobability I(C i) of the obseivation sequence C C
1
. . . C
T
. The unknown utteiance is iecognized as
the woid which maximizes this piobability. The piobability I(C i) is the sum ovei all possible state sequences
q
1
. . . q
T
of the piobability of C and given i, oi
(15.19)
I!CURE 15.19 Language model, lexicon, and HMM phone models foi a continuous speech iecognition system. (Source.
K.F. Lee, Laige-Vocabulaiy Speakei-Independent Continuous Speech Recognition: The SPHINX System," Ph.D. Disseita-
tion, Computei Science Dept., Cainegie Mellon, Apiil 1988. With peimission.)
C I C I C I | C a | C

q q
q q
q
q
q
T
( ) ( , ) ( , ) ( ) ( ) ( ). . .
...
i i i i r

1 1
1
1
2
2
1 2
2000 by CRC Press LLC
Diiect computation of this sum is computationally infeasible foi even a modeiate numbei of states and
obseivations. Howevei, an iteiative algoiithm known as the forward-backward pioceduie Rabinei, 1989]
makes this computation possible. Defning the foiwaid vaiiable o as
o
i
(I) I(C
1
. . . C
i
, q
i
S
I
i) (15.20)
and initializing
1
(I) r
I
|
I
(C
1
), subsequent o
i
(I) aie computed inductively as
(15.21)
By defnition, the desiied piobability of the obseivation sequence given the model i is
(15.22)
Similaily, the backwaid vaiiable can be defned

i
(I) I(C
i-1
. . . C
T
q
i
S
I
, i) (15.23)
The s aie computed inductively backwaid in time by fist initializing
T
(j) 1 and computing
(15.24)
HMM-based continuous speech iecognition involves deteimining an optimal woid sequence using the
Viterbi algoiithm. This algoiithm uses dynamic piogiamming to fnd the optimal state sequence thiough an
HMM netwoik iepiesenting the iecognizei vocabulaiy and giammai. The optimal state sequence (q
1
.
q
T
) is defned as the sequence which maximizes I(C,i), oi equivalently I(, C i). Let o
i
(I) be the joint
piobability of the optimal state sequence and the obseivations up to time i, ending in state S
I
at time i. Then
o
i
(I) max I(q
1
. . . q
i -1
, q
i
S
I
,C
1
. . . C
i
i) (15.25)
wheie the maximum is ovei all state sequences q
1
. . . q
i -1
. This piobability can be updated iecuisively by
extending each paitial optimal path using
(15.26)
At each time i, it is necessaiy to keep tiack of the optimal piecuisoi to state j, that is, the state which maximized
the above piobability. Then, at the end of the utteiance, the optimal state sequence can be ietiieved by
backtiacking thiough the piecuisoi list.
Tiaining HMM-based iecognizeis involves estimating the paiameteis foi the woid oi phone models used in
the system. As with DTW, seveial iepetitions of each woid in the iecognition vocabulaiy aie used to tiain
HMM-based isolated woid iecognizeis. Foi continuous speech iecognition, woid oi phone exemplais aie
typically extiacted fiom woid stiings oi sentences Lee, 1988]. Paiameteis foi the models aie chosen based on
a maximum likelihood ciiteiion; that is, the paiameteis i maximize the likelihood of the tiaining data C,
I(C i). This maximization is peifoimed using the Baum-Welch algoiithm Rabinei, 1989], a ie-estimation
o o
i i Ij
I
N
j i
j I a | C
+

1
1
1
( ) ( ) ( )
I C I
T
I
N
( ) ( ) i o

1

i Ij
j
N
j i i
I a | C j ( ) ( ) ( )

+ +
1
1 1
o o
i
I
i Ij j i
j I a | C
+ +

1 1
( ) max ( ) ( )
2000 by CRC Press LLC
technique based on fist aligning the tiaining data C with the cuiient models, and then updating the paiameteis
of the models based on this alignment.
Let
i
(I,j) be the piobability of being in state S
I
at time i and state S
j
at time i - 1 and obseiving the obseivation
sequence C. Using the foiwaid and backwaid vaiiables o
i
(I) and
i
(j),
i
(I,j ) can be wiitten as
(15.27)
An estimate of a
Ij
is given by the expected numbei of tiansitions fiom state S
I
to state S
j
divided by the expected
numbei of tiansitions fiom state S
I
. Defne y
i
(I) as the piobability of being in state S
I
at time i, given the
obseivation sequence C
(15.28)
Summing y
i
(I) ovei i yields a quantity which can be inteipieted as the expected numbei of tiansitions fiom
state S
I
. Summing
i
(I,j ) ovei i gives the expected numbei of tiansitions fiom state I to state j. An estimate of
a
Ij
can then be computed as the iatio of these two sums. Similaily, an estimate of |
j
(|) is obtained as the expected
numbei of times being in state j and obseiving symbol \
|
divided by the expected numbei of times in state j.
(15.29)
State-ul-the-Art Recugnitiun Systems
Dictation-oiiented iecognizeis which accommodate isolated woid vocabulaiies of many thousands of woids
in speakei-adaptive mannei aie cuiiently available commeicially. So too aie speakei-independent, continuous
speech iecognizeis foi small vocabulaiies, such as the digits; similai pioducts foi laigei (1000-woid) vocabu-
laiies with constiained giammais aie imminent. Speech iecognition ieseaich is aimed, in pait, at the develop-
ment of moie iobust pattein classifcation techniques, including some based on neuial netwoiks Lippmann,
1989] and on the development of systems which accommodate moie natuial spoken language dialogs between
human and machine.
Dehning Terms
Baum-Welch: A ie-estimation technique foi computing optimal values foi HMM state tiansition and output
piobabilities.
Continuous speech recognition: Recognition of uently spoken utteiances.
Dynamic time warping (DTW): A iecognition technique based on nonlineai time alignment of unknown
utteiances with iefeience templates.
Iorward-backward: An effcient algoiithm foi computing the piobability of an obseivation sequence fiom
an HMM.
Hidden Markov model (HMM): A stochastic model which uses state tiansition and output piobabilities to
geneiate obseivation sequences.
i
o
o
i i I i j
i Ij i j i
i Ij i j i
Ij
N
I j I q S q S C
I a j | C
I a j | C
( , ) ( , , )
( ) ( ) ( )
( ) ( ) ( )

+
+ +
+ +

1
1 1
1 1
1

y i
i i I i
j
N
I I q S C I j ( ) ( , ) ( , )

1
`
( , )
( )
`
( )
( )
( )
-
:
a
I j
I
| |
j
j
Ij
i
i
T
i
i
T
j
i
i C y
i
i
T
i |

y
y
y
1
1
1 1
2000 by CRC Press LLC
!solated word recognition: Recognition of woids oi shoit phiases pieceded and followed by silence.
Signal pre-processing: Conveision of an analog speech signal into a sequence of numeiic featuie vectois oi
obseivations.
Viterbi: An algoiithm foi fnding the optimal state sequence thiough an HMM given a paiticulai obseivation
sequence.
Wordspotting: Detection oi location of keywoids in the context of uent speech.
Relerences
S. Fuiui, DIgIiaI Seec| IrocessIng, Syni|esIs, anJ IecognIiIon, New Yoik: Maicel Dekkei, 1989.
R. M. Giay, Vectoi quantization," IIII ASSI MagazIne, vol. 1, no. 2, pp. 4-29, Apiil 1984.
F. Jelinek, The development of an expeiimental disciete dictation iecognizei," Iroc. IIII, vol. 73, no. 11,
pp. 1616-1624, Nov. 1985.
K. F. Lee, Laige-Vocabulaiy Speakei-Independent Continuous Speech Recognition: The SPHINX System,"
Ph.D. Disseitation, Computei Science Depaitment, Cainegie Mellon Univeisity, Apiil 1988.
R. P. Lippmann, Review of neuial netwoiks foi speech iecognition," NeuraI ConuiaiIon, vol. 1, pp. 1-38, 1989.
L. R. Rabinei, A tutoiial on hidden Maikov models and selected applications in speech iecognition," Iroc.
IIII, vol. 77, no. 2, pp. 257-285, Feb. 1989.
H. Sakoe and S. Chiba, Dynamic piogiamming algoiithm optimization foi spoken woid iecognition," IIII
TransaciIons on AcousiIcs, Seec| anJ SIgnaI IrocessIng, vol. 26, no. 1, pp. 43-49, Feb. 1978.
L. D. Wilcox and M. A. Bush, Tiaining and seaich algoiithms foi an inteiactive woidspotting system," in
Pioceedings, Inteinational Confeience on Acoustics, Speech and Signal Piocessing, San Fiancisco, Maich
1992, pp. II-97-II-100.
V. W. Zue, The use of speech knowledge in automatic speech iecognition," Iroc. IIII, vol. 73, no. 11,
pp. 1602-1615, Nov. 1985.
Further Inlurmatiun
Papeis on speech iecognition aie iegulaily published in the IIII Seec| anJ AuJIo TransaciIons (foimeily pait
of the IIII TransaciIons on AcousiIcs, Seec| anJ SIgnaI IrocessIng) and in the jouinal Conuier Seec| anJ
Ianguage. Speech iecognition ieseaich and technical exhibits aie piesented at the annual IEEE Inteinational
Confeience on Acoustics, Speech and Signal Piocessing (ICASSP), the biannual Euiopean Confeience on Speech
Communication and Technology (Euiospeech), and the biannual Inteinational Confeience on Spoken Language
Piocessing (ICSLP), all of which publish pioceedings. Commeicial applications of speech iecognition technol-
ogy aie featuied at annual Ameiican Voice Input-Output Society (AVIOS) and Speech Systems Woildwide
meetings. A vaiiety of standaidized databases foi speech iecognition system development aie available fiom
the National Institute of Standaids and Technology in Gaitheisbuig, MD.
15.5 Large Yucabu!ary Cuntinuuus Speech Recugnitiun
Yung Coo, |uvono Fomob|odron, ond Mc|oe| Pc|eny
Speech iecognition is the piocess of conveiting an acoustic signal to a textual message. High iecognition accuiacy
is of piime impoitance in oidei foi a speech inteiface to be of any piactical use in a dictation task, oi any kind
of intelligent human-machine inteiaction. Speech iecognition is made extiemely diffcult by co-articulation,
vaiiations in speaking styles, iates, vocal-tiact size acioss speakeis, and communication channels. Speech
ieseaich has been undeiway foi ovei 4 decades, and many pioblems have been addiessed and solved fully oi
paitially. High peifoimance can be achieved on tasks such as isolated woid iecognition, small and middle
vocabulaiy iecognition, and iecognition of speech in nonadveise conditions. Laige vocabulaiy (ovei 30K
woids), speakei-independent, continuous speech iecognition has been one of the majoi ieseaich taigets foi
yeais. Although foi some laige vocabulaiy tasks, high iecognition accuiacies have been achieved 7], signifcant
challenges emeige as moie and moie applications make themselves viable foi speech input.
2000 by CRC Press LLC
Cuntinuuus Speech Recugnitiun
Continuous speech iecognition is signifcantly moie diffcult than isolated woid iecognition. Its complexity
stems fiom the following thiee piopeities of continuous speech.
1. Woid boundaiies aie uncleai in continuous speech, wheieas in isolated woid iecognition they aie well-
known and can be used to impiove the accuiacy and limit the seaich. Foi example, in the phiase this
ship," the /s/ of this" is often omitted. Similaily, in we weie away a yeai," the whole sentence is one
long vocalic segment, and the woid boundaiies aie diffcult to locate.
2. Co-aiticulatoiy effects aie much stiongei than in isolated speech. Although we tiy to pionounce woids
as concatenated sequences of individual speech sounds (phones), oui aiticulatois possess ineitia which
ietaids theii motion fiom one position to anothei. As a iesult, a phone is stiongly inuenced by the
pievious and the following phones. This effect occuis both within single woids and between woids and
is aggiavated as the speaking iate incieases.
3. Function woids (aiticles, piepositions, pionouns, shoit veibs, etc.) tend to be pooily aiticulated. In
paiticulai, the phones aie often shoitened, skipped, oi deleted.
As a iesult, speech iecognition eiioi iates inciease diastically fiom isolated woid to continuous speech. Moie-
ovei, the piocessing powei needed to iecognize continuous speech incieases as well.
The piimaiy advantages of continuous speech aie two-fold. Fiist, typical speaking iates foi continuous speech
aie 140 to 170 woids pei minute, while isolated woid mode speakeis seldom exceed 70 woids pei minute.
Secondly, continuous speech is a natuial mode of human communication. Foicing pauses between woids
intioduces aitifciality and ieduces usei fiiendliness. The unnatuialness of isolated woid speech bieaks the
speakei`s tiain of thought.
Large Yucabu!ary
In the 1990s, the teim laige vocabulaiy" has come to mean 30K woids oi moie. Although the vocabulaiy size
is ceitainly not the best measuie of a task`s diffculty, it does affect the seveiity of many pioblems such as the
acoustic confusability of woids, the degiadation in peifoimance due to using sub-woid unit models, and the
computational complexity of the hypothesis search.
Cleaily, the numbei of confusable woids giows substantially with the vocabulaiy size. As the vocabulaiy size
incieases, it becomes impiactical to model each woid individually, because neithei the necessaiy tiaining data
noi the iequisite stoiage is available. Instead, models must be based on sub-woid units. These sub-woid models
usually lead to degiaded peifoimance because they fail to captuie co-articulation effects as well as whole-woid
models. Additionally, the computational complexity of the seaich iequiies the intioduction of effcient seaich
methods such as fast match" 26] which ieject all but the most plausible woid hypotheses to limit the
computation effoit. These woid hypotheses which suivive the fast match" aie then subjected to the full detailed
analysis. Natuially, this piocess may intioduce seaich eiiois, ieducing the accuiacy.
Some of the key engineeiing challenges in building speech iecognition systems aie selecting a piopei set of
sub-woid units (e.g., phones), assembling units into woids (baseforms), modeling co-aiticulating effects,
accomodating the diffeient stiess patteins of diffeient languages, and modeling pitch contouis foi tone-based
languages such as Mandaiin.
Overviev ul a Speech Recugnitiun System
The geneial aichitectuie of a typical speech iecognition system is given in Fig. 15.20. The speech signal is
typically input to the system via a miciophone oi a telephone. Signal piepiocessing consists of computing a
seiies of acoustic vectois by piocessing the speech signal at iegulai time inteivals (fiames), which aie typically
10 ms long. These acoustic vectois aie usually a set of paiameteis, such as LPC cepstra 23] oi fltei bank
outputs (PLP 30], RASTA 28], etc.). In oidei to captuie the change in these vectois ovei time, they have been
augmented with theii time deiivatives oi disciiminant piojection techniques (e.g., see LDA 10, 29]).
The iecognizei consists of thiee paits: the acoustic model, the language model, and the hypothesis seaich.
The iecognition piocess involves the use of acoustic models ovei these featuie vectois to label them with theii
2000 by CRC Press LLC
phonetic class. The acoustic models usually used aie Hidden Markov Models. Aitifcial Neuial Netwoiks 16]
oi Dynamic Time Waiping 17] based models have also been used, but will not be coveied in this chaptei section.
Context-dependent acoustic models 9, 10] aie obtained by queiying the phonetic context using the concept of
tri-phones oi decision trees (netwoiks) 2] that aie constiucted fiom a laige amount of tiaining data. A
multidimensional Caussian mixture model is used to model the featuie vectois of the tiaining data that have
similai phonetic contexts. These models aie then used as a set of obseivation densities in continuous Hidden
Maikov Models (HMMs). Each featuie vectoi is labeled as a context-dependent phonetic class which is the closest
acoustic class to the featuie vectoi. A sequence of labels thus obtained is used to obtain a set of candidate woids
that aie then piuned with the help of a language model. A language model bases its piediction of the next woid
on the histoiy of the woids pieceding it. Finally, a hypothesis search is conducted thiough all possible sequences
of hypothesized woids to deteimine the optimal woid sequence given the acoustic obseivations.
Seveial adaptation techniques have been pioposed to deiive speakei-dependent systems fiom the speakei-
independent system desciibed above. These techniques modify/tune the paiameteis of the acoustic models to
the specifc speakei.
Hidden Markuv Mude!s As Acuustic Mude!s lur Speech Recugnitiun
Theie aie many ways to chaiacteiize the tempoial sequence of speech sounds as iepiesented by a sequence of
spectial obseivations. The most common way is to model the tempoial sequence of spectia in teims of a Maikov
chain to desciibe the way one sound changes to anothei by imposing an explicitly piobabilistic stiuctuie on
the iepiesentation of the evolutional sequence.
If we denote the spectial vectoi at time i by C
i
the obseived spectial sequence, lasting fiom i 1 to i T,
is then iepiesented by
Considei a fist-oidei N-state Maikov chain as illustiated foi N 3 in Fig. 15.21. Such a iandom piocess has
the simplest memoiy: the value at time i depends only on the value at the pieceding time and on nothing that
went on befoie. Howevei, it has a veiy useful piopeity that leads to its application to speech iecognition pioblem:
the states of the chain geneiate obseivation sequences while the state sequence itself is hidden fiom the obseivei.
The system can be desciibed as being one of the N distinct states, S
1
, S
2
, ., S
N
, at any disciete time instant i.
We use the state vaiiable q
i
as the state of the system at time i. Assume that the Maikov chain is time invaiiant
(homogeneous), so the tiansition piobabilities do not depend on time. The Maikov chain is then desciibed by
a state tiansition piobability matiix A a
Ij
], wheie
a
Ij
I(q
i
S
j
q
i1
S
I
), 1 s I, j s N (15.30)
The tiansition piobabilities satisfy the following constiaints:
a
Ij
> 0 (15.31)
I!CURE 15.20 Geneial aichitectuie of a speech iecognition system.
C C C C
i
i
T
T
}

, ,
1
1 2
, , ,
2000 by CRC Press LLC
(15.32)
Assume that at the initiated time, i 0, the state of the system q
0
is specifed by an initial state piobability
vectoi r
T
r
1
, r
2
, ., r
N
]. Then foi any state sequence q (q
0
, q
1
,q
2
, ., q
T
), wheie q
i
e {S
1
, S
2
, ., S
N
, the
piobability of q being geneiated by the Maikov chain is
I(qA, r) r
q0
, a
q0q1
a
q1q2
.a
qT-1qT
(15.33)
Suppose now that the state sequence q is a sequence of speech sounds and cannot be obseived diiectly.
Instead, obseivation C
i
is pioduced with the system in some unobseived state q
i
(wheie q
i
e {S
1
, S
2
, ., S
N
).
Assume that the pioduction of C
i
in each possible S
I
, I 1, 2, ., N is stochastic and is chaiacteiized by a set
of obseivation piobability measuies {|
I
(C
i
)
N
I1
, wheie
|
I
(C
i
) I(C
i

q
i
S
I
) (15.34)
If the state sequence q that led to the obseivation sequence C (C
1
, C
2
, ., C
T
) is known, the piobability
of being geneiated by the system is assumed to be
I(Cq, ) |
q
1
(C
1
)|
q
2
(C
2
) . |
q
T
(C
T
) (15.35)
Theiefoie, the joint piobability of C and q being pioduced by the system is
(15.36)
The piobability of pioducing the obseivation sequence C by the iandom piocess without assuming knowl-
edge of the state sequence is
(15.37)
I!CURE 15.21 A fist-oidei thiee-state hidden Maikov model.
a I
Ij
j
N
V

_
1
1
I C A | a | C
q i
T
q q q i
i i i
, , r r
, ,

, ,

0 1
1
H
I C A I C q A a | C
q
q
q
i
T
q q q i
i i i
, , , , , r r r
, ,

, ,

, ,
_ _
0
1 1
H
2000 by CRC Press LLC
Cuntinuuus Parameter Hidden Markuv Mude!s
The tiiple (r, A, ) defnes a Hidden Markov Model (HMM). Moie specifcally, a hidden Maikov model is
chaiacteiized by the following:
1. A state space {S
1
, S
2
, ., S
N
. Although the states aie not explicitly obseived, foi many applications theie
is often some physical signifcance attached to the states. In the case of speech iecognition, this is often
a phone oi a poition-inital, middle, fnal-of a phone. We donote the state at time i as q
i
.
2. A set of obseivations C (C
1
, C
2
, ., C
T
). The obseivations can be a set of disciete symbols chosen
fiom a fnite set, oi continuous signals (oi vectois). In speech iecognition application, although it is
possible to conveit continuous speech iepiesentations into a sequence of disciete symbols via vectoi
quantization codebooks and othei methods, seiious degiadation tends to iesult fiom such discietization
of the signal. In this aiticle, we focus on HMMs with continuous obseivation output densities to model
continuous signals.
3. The initial state distiibution r {r
I
in which
4. The state tiansition piobability distiibution A {a
Ij
defned in Eq. (15.30).
5. The obseivations piobability distiibution, {|
j
(C
i
), defned in Eq. (15.34).
r
I
I(q
0
S
I
), 1 s I s N
Given the foim of HMM, the following thiee basic pioblems of inteiest must be solved foi the model to be
useful in applications.
Task 1 (Evaluation): Given the obseivation sequence C (C
1
, C
2
, ., C
T
) and a model i (r, A, ), how
does one effciently compute I(Ci):
Task 2 (Estimation): Given the obseivation sequence C (C
1
, C
2
, ., C
T
), how does one solve the inveise
pioblem of estimating the paiameteis in i:
Task 3 (Decoding): Given the obseivation sequence C and a model i, how does we deduce the most likely
state sequence q that is optimal in some sense oi best explains the obseivations:
The Eva!uatiun Prub!em
With unbounded computational powei, Eq. (15.37) can be used to compute I(Ci). Howevei, it involves on
the oidei of 2TN
T
calculations, because the summation in Eq. (15.37) has N
T-1
possible sequences. This is
computationally infeasible even foi small values of N and T.
An iteiative algoiithm known as the forward-backward pioceduie makes this computation possible. Defning
the foiwaid vaiiable o as
o
i
(I) I(C
1
, ., C
i
, q
i
S
I
i) (15.38)
and initializing o
1
(I) r
I
|
I
(C
1
), subsequent o
i
(I) aie computed inductively as
(15.39)
By defnition, the desiied piobability of the obseivation sequence given the model i is
(15.40)
o o
i i
I
N
Ij j i
j I a | C
+

+
, ,

, , , ,
_ 1
1
1
I C I
T
I
N
i o
, ,

, ,

_
1
2000 by CRC Press LLC
Anothei alteinative is to use the backwaid pioceduie by defning the backwaid vaiiable :
i
(I) I(C
i-1
,C
i-2
,.,C
T
/q
i
S
i
, i) (15.41)
t
(I) is the piobability of the paitial obseivation sequence fiom i - a to the end T, biven state S
I
and
model i. The initial values aie
T
(I) 1 foi all I. The values at time, T - 1, T - 2, ., 1, can be
computed inductively:
(15.42
The piobability of the obseivation sequence given the model i can be expiessed in teims of the foiwaid
and backwaid piobabilities:
(15.43)
The foiwaid and backwaid vaiiables aie veiy useful and will be used in the next section.
The Estimatiun Prub!em
Given an obseivation sequence oi a set of sequences, (multiple utteiances) the estimation pioblem is to fnd
the iight" model paiametei values that specify a model most likely to pioduce the given sequence. In speech
iecognition, this is called tiaining. Theie is no known closed foim analytic solution foi the maximum likelihood
model paiameteis. Neveitheless we can choose i (r, A, ) such that its likelihood, I(Ci), is locally maximized
using an iteiative pioceduie such as the Baum-Welch ie-estimation method (a foim of the EM expectation-
maximization] method 4]. The method intoduces an auxiliaiy function (
`
i, i) and maximizes it.
(15.44)
The ie-estimation technique consists of fist aligning the tiaining data C with the cuiient models, and then
updating the paiameteis of the models based on the alignment to obtain a new estimate i.
Let
i
(I, j) be the piobability of being in state S
I
at time i and state S
j
at time i - 1, given the model i and
the obseivation sequence C:
(15.45)
Using the foiwaid and backwaid vaiiables defned in section 3.2,
i
(I) and
i
(j),
i
(I, j) can be wiitten as

i Ij
j
N
j i i
I a | C j
, ,

, , , ,

+ + _
1
1 1
I C I I I
i
I
N
i T
I
N
i
, ,

, , , ,

, ,

_ _
1 1
I C q I C q
q
`
, ,
`
log , i i i i
, ,

, , , ,
_
i
i
i
i i I i j
i I i j
I j I q S q S C
I q S q S C
I C
, , ,
, ,

, ,

, ,

+
, ,
, ,
+
+
1
1
2000 by CRC Press LLC
(15.46)
An estimate of a
Ij
is given by the expected numbei of tiansitions fiom state S
I
to state S
j
, divided by the
expected numbei of tiansitions fiom state S
I
. Defne y
i
(I) as the piobability of being in state S
I
at time i, given
the obseivation sequence C
(15.47)
Summing y
i
(I) ovei i yields a quantity that can be inteipieted as the expected numbei of tiansitions fiom state
S
I
. Summing
i
(I, j) ovei i gives the expected numbei of tiansitions fiom state S
I
to S
j
. An estimate of a
Ij
can
be computed as the iatio of these two sums.
(15.48)
Foi the disciete obseivation case, an estimate of |
j
(|) is obtained as the expected numbei of times being in
state S
I
and obseiving symbol v
|
divided by the expected numbei of times in state S
j
.
(15.49)
The most geneial iepiesentation of a continuous density of HMMs is a fnite mixtuie of the foim
wheie O is the obseivation vectoi being modeled, c
j|
is the mixtuie coeffcient foi the |th mixtuie in state j
and N is any log-concave oi elliptically symmetiic density. Typically, we assume that N is Gaussian with mean
vectoi
j|
and covaiiance matiix U
j|
foi the |th mixtuie component in state j. The mixtuie weights c
j|
satisfy
the constiaint:
(15.50)
c
j|
> 0, 1 s j s N, 1 s | s M (15.51)
Let y
i
(j,|) be the piobability of being in state S
j
at time i with k-th mixtuie component accounting foi C
i
:
(15.52)

o o
o o
i
i Ij i j i
i
I j
N
Ij i j i
I j
I j | C
I j | C
,
,
, ,

, , , , , ,
, , , , , ,
+ +

+ + _
1 1
1
1 1
y i
i i I i
j
N
I I q S C I j
, ,

, ,

, ,

_
, ,
1
`
,
a
I j
I
Ij
i
T
i
i
T
i

, ,
, ,

Z
Z
1
1
1

y
`
:
| |
j
j
j
i C \ i
i
T
i
i |
, ,

, ,
, ,

Z
Z
y
y
1
| C c N C U j N
j j|
|
M
j| j|
, ,

, ,
s s

_
1
1 , , ,
c j N
j|
|
M

_
s s
1
1 1 ,
y i
i
i
i i j i
i j i
i j
j | I q S | | C
I q S | | C
I q S C
, , ,
, ,
,
, ,

, ,


, ,

, ,
2000 by CRC Press LLC
The ie-estimation foimula foi the coeffcients of the mixtuie density aie:
(15.53)
(15.54)
(15.55)
Details on how the ie-estimation foimulas aie deiived fiom the auxiliaiy function Q(i
`
,i) can be found in
25] and 23].
Yiterbi A!gurithm: One Su!utiun lur the Decuding Prub!em
We defne the optimal state sequence q (q
1
, ., q
T
) is defned as the sequence which maximizes I(qC, i),
oi equivalently I(q, C i). Let o
i
(I) be the joint piobability of the optimal state sequence and the obseivation
up to time i, ending in state S
I
at time i. Then,
o
i
(I) nax I(q
1
, . q
i-1
, q
i
S
I
, C
1
, ., C
i
i) (15.56)
wheie the maximum is ovei all state sequences q
1
, ., q
i-1
. This piobability can be updated iecuisively by
extending each paitial optimal path using
o
i
(I) nax o
i
(I)a
ij
b
j
(O
t-1
) (15.57)
At each time i, it is necessaiy to keep tiack of the optimal piecuisoi i of state j, that is, the state that maximized
the above piobability. Then, at the end of the utteiance, the optimal state sequence can be ietiieved by
backtiacking thiough the piecuisoi list 11].
Speaker Adaptatiun
In spite of iecent piogiess in the design of speakei-independent (SI) systems, eiioi iates aie still typically two
oi thiee times highei than equivalent speakei-dependent (SD) systems. Vaiiability in both anatomical and
peisonal chaiacteiistics contiibute to this effect. Anatomical diffeiences include the length of the vocal tiact,
the size of the nasal cavity, etc. Similaily, theie aie vaiiable speaking habits, such as accent, speed, and loudness.
The stiaight-foiwaid appioach which blindly mixes the statistics foi all speakeis discaids useful infoimation.
The laige amount of speakei-specifc data iequiied to tiain SD systems iendeis them impiactical foi many
applications. Howevei, it is possible to use a small amount of the new speakei`s speech (adaptation data) to
tune" the SI models to the new speakei. Ideally, we would like to ietain the iobustness of well-tiained SI
models, yet impiove the appiopiiateness of the models foi the new speakei. Such methods aie called speakei
adaptation techniques. The adaptation is said to be suer\IseJ if the tiue text tiansciipt of the adaptation data
is known; otheiwise, the adaption is said to be unsuer\IseJ.
Maximum a postcr1or1 Estimatiun
A widely used speakei adaptation method maximizes the osierIor estimation of HMMs 3]. The conventional
maximum likelihood (ML) based algoiithms assume the HMM paiameteis to be unknown but fxed, and the
paiametei estimatois aie deiived entiiely fiom the tiaining obseivation sequence using the Baum-Welch algo-
iithms. Sometimes, piioi infoimation about the HMM paiameteis is available, whethei fiom subject mattei
`
,
,
c
j |
j |
j|
i
T
i
i
T
|
M
i

, ,
, ,


Z
Z Z
1
1 1
y
y
`
,
,

y
y
j|
i
T
i i
i
T
i
j | o
j |

, ,
, ,

Z
Z
1
1
`
, - -
,
U
j | o o
j |
j|
i
T
i i j| i j|
T
i
T
i

, , , , , ,
, ,

Z
Z
1
1
y
y
I
2000 by CRC Press LLC
consideiations oi fiom pievious expeiience. Designeis may wish to use this piioi infoimation-in addition to the
sample obseivations-to infei the HMM paiameteis.
The Maximum a Posteiioii (MAP) fiamewoik natuially incoipoiates piioi infoimation into the estimation
piocess, which is paiticulaily useful foi dealing with pioblems posed by spaise tiaining data, wheie ML estimates
become inaccuiate. MAP paiametei estimates appioah the ML estimates when data is plentiful, but aie goveined
by the piioi infoimation in the absence of data. If i is the paiametei vectoi to be estimated fiom the obseivation
C with piobability density function (pdf) I(Ci) and g is the piioi pdf of i, then the MAP estimate is defned
as the maximum of the posteiioi pdf of i, g(io).
Rathei than maximizing the auxiliaiy function (
`
i, i) as in Eq. 15.44, we instead maximize an auxiliaiy
function that includes a contiibution fiom the piioi distiibution of the model paiameteis.
(15.58)
The appiopiiate piioi distiibutions aie Gaussian distiibutions foi the means, gamma distiibutions foi the
inveise vaiiances, and Diiichlet distiibutions foi the mixtuie weights 3].
The pioblem with MAP is that it adapts only paiameteis foi which explicit tiaining data is available, and it
conveiges slowly foi tasks wheie theie aie limited adaptation data and many paiameteis to be estimated. Many
adaptation algoiithms 15] 14] have been developed which attempt to geneialize fiom neaiby" tiaining data
points to oveicome this diffculty.
Translurm-Based Adaptatiun
Anothei categoiy of adaptation technique uses a set of iegiession-based tiansfoims to tune the mean and
vaiiance of a hidden Markov model to the new speakei. Each of the tiansfoimations is applied to a numbei of
HMMs and estimated fiom the coiiesponding data. Using this shaiing of tiansfoimations and data, the method
can pioduce impiovements even if a small amount of adaptation data is available foi the new speakei by using
a global tiansfoim foi all HMMs in the system. If moie data is available, the numbei of tiansfoims is incieased.
Maximum Likelihood Lineai Regiession (MIIR)
The MLLR fiamewoik was fist intioduced in 27]. Considei the case of a continuous density HMM system
with Gaussian output distiibutions. A paiticulai Gaussian distiibution, g, is chaiacteiized by a mean vectoi,

g
, and a covaiiance matiix U
g
. Given a speech vectoi o
i
, the piobability of that vectoi being geneiated by a
Gaussian distiibution g is |
g
(o
i
):
The adaptation of the mean vectoi is obtained by applying a tiansfoimation matiix W
g
to the extended mean
vectoi
g
to obtain an adapted mean vectoi `
g
wheie W
g
is a d(d - 1) matiix which maximizes the likelihood of the adaptation data, and the
g
is defned as

g
O,
1
, .,
J
]
T
wheie O is the offset teim foi the iegiession.
The piobability foi the adapted system becomes
I g
`
,
`
, log
`
i i i i i
, ,

, ,
+
, ,
| o
U
e
g i
J
g
o U o
i g
T
g i g
, ,

, ,
, , , , 1
2
2
1 2
1 2
1
r


- - -
-
`

g g g
W
2000 by CRC Press LLC
The auxiliaiy fuction in Eq. 15 can be used heie to estimate W
g
. It can be shown the W
g
can be estimated
using the equation below:
(15.59)
wheie y
g
(i) is the posteiioi piobability of occupying state q at time i given that the obseivation sequence C is
geneiated.
The MIIR algoiithm can also be extended to tiansfoim the covaiiance matiices 38].
C!uster-Based Speaker Adaptatiun
Yet anothei categoiy of speakei adaptation methodology is based on the fact that a speech tiaining coipus
contains a numbei of tiaining speakeis, some of whom aie closei, acoustically, to the test speakei than otheis.
Theiefoie, given a test speakei, if the acoustic models aie ie-estimated fiom a subset of the tiaining speakeis
who aie acoustically close to the test speakei, the system should be a bettei match to the test data of the speakei.
A fuithei impiovement is obtained if the acoustic space of each of these selected tiaining speakeis is tiansfoimed,
by using tiansfoim-based adaptation method, to come closei to the test speakei.
This scheme was shown to pioduce bettei speakei adaptation peifoimance than othei algoiithms, foi example
MLLR 27] oi MAP adaptation 3], when only a small amount of adaptation data was available.
Howevei, the implementation of this method iequiied the entiie tiaining coipus to be available online foi
the adaptation piocess, and this is not piactical in many situations. This pioblem can be ciicumvented if a
model is stoied foi each of the tiaining speakeis, and the tiansfoimation is applied to the model. The tians-
foimed models aie then combined to pioduce the speakei-adapted model. Howevei, due to the laige numbei
of tiaining speakeis, stoiing the models of each tiaining speakei would iequiie a piohibitively laige amount
of stoiage. Also, we may not have suffcient data fiom each tiaining speakei to iobustly estimate the paiameteis
of the speakei-dependent model.
To solve this pioblem and ietain the advantage of the method, a new algoiithm is piesented 21]. It is to
pieclustei the tiaining speakeis acoustically into clusteis. Foi each clustei, an HMM system (called a clustei-
dependent system) is tiained using speech data fiom the speakeis who belong to the clustei. When a test speakei`s
data is available, we iank these clustei-dependent systems accoiding to the distances between the test speakei
and each clustei, and a subset of these clusteis, acoustically closest to the test speakei, is chosen. Then the model
foi each of the selected clusteis is tiansfoimed fuithei to biing the model closei to the test speakei`s acoustic
space. Finally, these adapted clustei models aie combined to foim a speakei-adapted system. Hence, compaied
to 22], we now choose clusteis that aie acoustically close to the test speakei, iathei than individual tiaining
speakeis.
This method solves the pioblem of excessive stoiage foi the tiaining speakei models because the numbei of
clusteis is fai fewei than the numbei of tiaining speakeis, and it is ielatively inexpensive to stoie a model foi
each clustei. Also, as each clustei contains a numbei of speakeis, we have enough data to iobustly estimate the
paiameteis of the model foi the clustei.
| o
U
e
g i
J
g
o W U o W
i g g
T
g i g g
, ,

, ,
, , , , 1
2
2
1 2
1 2
1
r


- - -
-
y y
g g i g
T
g g
i
T
i
T
g g g
T
i U o i U W
, ,

, ,

_ _
- - ` 1 1
1 1
y
i
i
o
g i
i
I C
I C s q
, ,

, ,

, ,
e
_
1

,
d
2000 by CRC Press LLC
Yuca! Tract Length Nurma!izatiun [YTL)
Seveial attempts have been made to model vaiiations in vocal tiact length acioss seveial speakeis. The idea was
oiiginally intioduced by Bambeig 42] and ievived thiough a paiametiic appioach in 39]. Assume a unifoim
tube with length I foi the model of the vocal tiact. Then each foimant fiequency will be piopoitional to 1/I.
The fist-oidei effect of a diffeience in vocal tiact length is the scaling of the fiequency axis. The idea behind
VTL is to iescale oi waip the fiequency axis duiing the signal piocessing step in a speech iecognition system,
to make speech fiom all speakeis appeai as if it was pioduced by a vocal tiact of a single standaid length. Such
noimalizations have led to signifcant gains in accuiacy by ieducing vaiiability amongst speakeis and allowing
the pooling of tiaining data foi the constiuction of shaipei models. Thiee VTL methods have been iecently
pioposed. In 39], a paiametiic method of noimalization which counteiacts the effect of vaiied vocal tiact
length is piesented. This method is paiticulaily useful when only a small amount of tiaining data is available
and iequiies the deteimination of the foimant fiequencies. In 40], an automated method is piesented that
uses a simple geneiic voiced speech model to iapidly select appiopiiate fiequency scales. This geneiic model
is a mixtuie of 256 multiveisity Gaussians with diagonal covaiiances tiained on the unwaiped data. Diffeient
waip scales aie selected to lineaily tiansfoim the fiequency axis of the speakei`s data. The iesulting waiped
featuies aie scoied against the geneiic model. The waip scale that scoies the best is selected as the waip scale
foi that speakei. An iteiative piocess updates the geneiic model with the new featuies obtained aftei waiping
each speakei with the best waip scale. Once the best waip scales foi each speakei have been deteimined, SI
models aie built with the appiopiiately waiped featuie vectois. This waip selection method allows data fiom
all speakeis to be meiged into one set of cannonical models. In 41], a class of tiansfoims aie pioposed which
achieve a iemapping of the fiequency axis much like the conventional VTL methods. These mappings known
as all-pass tiansfoims, aie lineai in the cepstial domain which makes speakei noimalization simple to imple-
ment. The paiameteis of these tiansfoims aie computed using conjugate giadient methods.
Mude!ing Cuntext in Cuntinuuus Speech
Speech iecognition cannot be accuiately modeled by a concatenation of elementaiy HMMs coiiesponding to
individual phones of a woid basefoim. A phone is a sub-woid acoustic unit of speech. The iealizations of the
phones depend on theii context. This is especially tiue foi continuous speech wheie the phenomenon called
coariIcuIaiIon is obseived. Co-aiticulation is when the pionunciation of a phoneme is affected by the phones
pieceding and following it, such as, the i in io and oi. This section discussed seveial methods that will yield
HMM building blocks that take into account phonetic context. A woid is specifed by its phonetic basefoim,
the phones aie tiansfoimed into theii appiopiiate allophones accoiding to the context in which they appeai,
and a concatenation of the HMMs of these allophones iesults in the woid HMM. Two standaid appioaches
aie used to make use of contextual infoimation:
1. Tri-phones as building blocks
2. Decision trees that lead to geneial contextual building blocks
Tri-Phunes
In oidei to take into account the inuence of context on pionunciation, many speech iecognizeis base theii
modeling on the tii-phone concept. The tii-phone concept was fist intioduced by Chow et al. 5, 24] and Lee
et al. 7, 8] in the 1980s. In this concept, the pionunciation of a phone is inuenced by the pieceding and
following phone (i.e., the tiiplet is used to model the iealization of the phone). The phone embedded in the
context
1
and
2
is specifed by the tii-phone
1

2
, wheie
1
and
2
aie the pievious and the following phones.
Diffeient such iealizations of aie called allophones. This amounts to saying that the contextual inuence of
the pieceding and following phone is most impoitant. If this solution weie caiiied out liteially, the iesulting
allophone alphabet would be too laige to be useful. Foi example, a phonetic alphabet of size M would pioduce
M
3
allophones. Even though in piactice, not all M
3
allophones occui, the numbei of possible allophones is still
laige.
So the tii-phone method ielies on an equivalence classifcation of the context of phones. One simple scheme
involves the clusteiing of tii-phones into distinct categoiies to chaiacteiize them. The ciiteiion used foi
2000 by CRC Press LLC
clusteiing can be an infoimation theoietic measuie such as likelihood oi entiopy. This is the concept behind
generaIIzeJ irI|ones 8]. Decision tiees (desciibed in the next section) can also be used to deteimine the
distinct categoiies. Anothei scheme is to tie the HMM distiibutions of these tii-phones. Moie iecently, methods
that clustei individual states of the HMM 1] have been intioduced. A diawback in using tii-phones is that
widei contexts (i.e., thiee, foui, oi fve phones to the left) may be impoitant.
Decisiun Trees
The puipose of the decision tiee is to map a laige numbei of conditions (i.e., phonetic contexts) to a small
manageable numbei of equivalence classes. Each teiminal node of the tiee iepiesents a set of phonetic contexts.
The aim in decision tiee constiuction is to fnd the best possible defnition of equivalence classes 2].
Duiing decoding, the acoustic models to be used foi a given obseivation stiing aie chosen based on the
cuiient acoustic context - that is, by pouiing the data down the decision tiee until a teiminal node is ieached
and using the models at that teiminal node to compute the likelihood of the data.
Decision Tiee Constiuction
Maximum Likelihood (ML) estimation is one common technique used foi constiucting a decision tiee; i.e.,
the aim is to fnd the diffeient classes that maximize the likelihood of the given tiaining data. A binaiy decision
tiee is giown in the following fashion.
1. Fiom among a set of questions (at the phonetic oi lexeme level), the best question foi paititioning the
data into two classes (i.e., the question that maximizes the likelihood) is found.
2. The above step is iepeated iecuisively on each of the two classes until eithei theie is insuffcient data to
continue oi the best question is not suffciently helpful.
The easiest way to constiuct a decision tiee is to cieate - in advance - a list of possible questions foi each
vaiiable that may be tested. Finding the best question at any given node consists of subjecting all the ielevant
vaiiables to each of the questions on the coiiesponding list and picking the best combination of the vaiiable
and the question.
In building an acoustic decision tiee using phonetic context, at least 10 vaiiables may be inteiiogated, 5
pieceding phones and 5 following phones. Since all of these vaiiables belong to the same phonetic alphabet,
only one set of questions needs to be piepaied, wheie each question is a subset of the phonetic alphabet.
Typically, tiees aie less than 20 layeis deep, as beyond that the algoiithm iuns out of data.
Let X
1
. X
n
denote n disciete iandom vaiiables whose values may be tested. Let
Ij
denote the jth piede-
teimined question foi X
I
1. Staiting at the ioot node, tiy splitting each node into two subnodes.
2. Foi each vaiiable X
I
, evaluate questions
I1
,
I2
, etc. Let
|
denote the best question estimated using
any one of the ciiteiia desciibed eailiei. The best question at a node is the question that maximizes the
likelihood of the tiaining data at that node aftei applying the question.
3. Find the best paii X
I
,
|
denoted as X,
|
.
4. If the selected question is not suffciently helpful (gain in likelihood due to the split is not signifcant)
oi does not have suffcient data points, make the cuiient node a leaf.
5. Otheiwise, split the cuiient node into two new subnodes accoiding to the answei of question
|
on
vaiiable X
|
.
The algoiithm stops when all nodes aie eithei too small to split fuithei oi have been maiked as leaves. Ovei-
tiaining is pievented by limiting the numbei of asked questions.
Questions
A decision tiee has a question associated with eveiy non-teiminal node. These can be giouped into continuous
and disciete questions.
DIscreie quesiIons. If X is a disciete iandom vaiiable that takes values in some fnite alphabet R, then a
question about X has the foim: Is X an element of S wheie S is a subset of R: Typically, questions aie of the
foim Is the pieceding phone a vowel:" oi Is the following phone an unvoiced stop:"
2000 by CRC Press LLC
ConiInuous quesiIons. If X is a continuous iandom vaiiable that takes ieal values, a question about X has the
foim: X e t wheie t is some ieal value. Instead of limiting the questions to a piedefned set, we could seaich
foi the best subset of values taken by the iandom vaiiable at any node and use the best question found. This
implies that we geneiate questions on the y duiing tiee constiuction. The disadvantages of this appioach
include too much CPU time to seaich foi the best subset and because theie aie so many subsets, theie is too
much fieedom in the tiee-giowing algoiithm, iesulting in ovei-tiaining oi spuiious questions that do not
geneialize veiy well.
All of these questions can be constiucted in advance by expeits. Foi example, phonetic questions can be
geneiated by linguists oi by algoiithms 9].
Language Mude!ing
Humans iecognize woids based not only on what they heai, but also on what they have heaid in the past, as
well as what they anticipate to heai in the futuie. It is this capability that make humans the best speech
iecognition systems. Modein speech iecognition systems attempt to achieve this human capability thiough
Ianguage noJeIIng. Language modeling is the ait and science of anticipating oi piedicting woids oi woid
sequences fiom nonacoustic souices of infoimation such as context, stiuctuie, and giammai of the paiticulai
language, pieviously heaid woid sequences.
In laige vocabulaiy speech iecognition, in which woid sequences W aie utteied to convey some message,
the language model I(W) is of ciitical impoitance to the iecognition accuiacy. In most cases, the language
model must be estimated fiom a laige text coipus.
Foi piactical ieasons, the woid sequence piobability I(W) is appioximated by
(15.60)
This is called an N-giam language model, wheie N is the numbei of woids fiom the histoiy that aie used in
the computation. Typically, N 3 and these aie iefeiied to as tiigiam language models, is the numbei of
woids in the sequence being decoded.
The conditional piobabilities in Eq. (15.60) aie estimated by the simple ielative fiequency appioach desciibed
in 23].
The maximum entiopy appioach is a method of estimating the conditional piobability distiibutions (dis-
ciibed below). In cases when the text coipus is not laige enough to ieliably estimate the piobabilities, smoothing
techniques such as lineai smoothing (deleted inteipolation) aie applied (desciibed below).
IerIexIiy is a measuie of peifoimance of language models. Peiplexity, defned in Eq. (15.61) is the aveiage
woid bianching factoi of the language model.
2
H
I(w
1
, w
2
, ., w

)
-1/
(15.61)
Peiplexity is an impoitant paiametei is specifying the degiee of sophistication in a iecognition task, fiom the
souice unceitainty to the quality of the language model.
Othei techniques that have been used in language modeling include desision tiee models 43] and automat-
ically infeiied linked giammais to model long iange coiielations 44].
Smuuthing
In computing the language model piobabilities, we desiie the following: fewei paiameteis to estimate; the
available data is suffcient foi the estimation of paiameteis; and that the piobability can be constiucted at
iecognition time fiom the paiametei values while occupying limited stoiage. Seveial smoothing techniques
have been pioposed to handle the scaicity of data 25]. These aie essential in the constiuction of n-giam
language models. They include lineai smoothing, also known as deleted inteipolation, backing-off, bucketing
I W I w w w w
N I I I I N
I

, ,

, ,
+

j
, ,
1 2 1
1

2000 by CRC Press LLC


and equivalence classifcation techniques. An extensive empiiical study of these techniques foi language mod-
eling is given in 35]. A biief desciiption of two smoothing techniques is coveied in this section. Lineai
smoothing, is due to Jelinek and Meicei 37], wheie a class of smoothing models that involve lineai inteipolation
aie piesented. The maximum-likelihood estimate is inteipolated with the smoothed lowei-oidei distiibution
defned analogously, i.e.
(15.62)
To yield meaningful iesults, the tiaining data used to estimate need to be distinct fiom the data used
to estimate I
MI
. In |eIJoui InieroIaiIon, a section of tiaining data is ieseived foi this puipose. In37], a
technique called deleted inteipolation is desciibed wheie diffeient paits of the tiaining data iotate in tiain
eithei I
MI
oi the and the iesults aie then aveiaged. The othei widely used smoothing technique in
speech iecognition is the backing-off technique desciibed by Katz 36]. Heie, the Good-Tuiing estimate 36]
is extended by adding the inteipolation of highei-oidei models with lowei-oidei models. This technique
peifoims best foi bigiam models estimated fiom small tiaining sets. The tiigiam language model piobability
is defned by,
(u
3
u
1
,u
2
) i
3
I(u
3
u
1
,u
2
) - i
2
I(u
3
u
2
) - i
1
I(u
3
) (15.63)
The backing-off technique suggests that if the counts C(u
1
, u
2
) is suffciently laige, the I(u
3
u
1
,u
2
) by itself
is a bettei estimate of (u
3
u
1
,u
2
). Hence, foi diffeient values of counts, a diffeient estimate of
(u
3
u
1
,u
2
) is used. Seveial vaiiations on the choice of this thieshold and the Good-Tuiing type function
aie desciibed in 25].
Maximum Entrupy based Language Mude!s
A maximum-likelihood appioach foi automatically constiucting maximum entiopy models is piesented in
34]. The maximum entiopy method fnds a model that simultaneously satifes a set of constiaints Such a
model is a method of estimating the conditional piobability distiibutions. The piinciple is simple: model all
that is known and assume nothing about that which is not known. Let x and y be a set of iandom vaiiables
such that P(y ) is the piobability that the model assigns an output y given . Let I(, y) be the indicatoi
function (the expected value of this function is the featuie function) that takes a binaiy value of 1 oi 0 to ft
the tiaining data. If I() satisfes
,y
I()I(y)I(,y) J(I) wheie J(I) aie the constiaints then theie must
be a piobability I() that satisfes all the constiaints unifoimly. A mathematical measuie of unifoimity of
conditional distiibutions is the conditional entiopy, H. The solution to this pioblem is obtained by selecting
the model with maximum entiopy fiom the set of possible models, C, i.e.,
p argmaxH (15.64)
It can be shown that p is well-defned and that theie is always a model with maximum entiopy in any
constiained set C. Foi simple cases, the above equation can be solved mathematically, but foi a moie geneial
case, the use of Lagiange multiplieis fiom constiained optimization theoiy is employed. This appioach leads
to the following statement. The maximum entiopy model subject to a set of constiaints, has the paiametiic
foim given by Eq. 15.65,
(15.65)
wheie the Lagiangian multiplieis, i
i
s can be deteimined by maximizing the Lagiangian, i +
I
i
I
((I
I
)
-

(I
I
)).
i
() is the noimalization constant and (I
I
) and (I
I
) aie the empeiical and expected distiibutions.
Since the Lagiangian is the log-likelihood foi the exponential model, P, the solution states that the model with
the maximum entiopy is one that maximizes the likelihood of the tiaining data.
I I I
er I I n
I
MI I I n
I
er I I n
I
I n
I
I n
I int -
-
-
-
int -
-
-
-
-
-
-
u u i u u i u u
u u
+ + + , ,

, ,
+
j
(
\

, ,
+ +
1
1
1
1
2
1
1
1
1
1
1
i
u
I n
I
-
-
+1
1
i
u
I n
I
-
-
+1
1
Z
I y I y
I
I
exp ,
, ,

, , , ,
_ i
i
Z
C
2000 by CRC Press LLC
Details on the constiuction of maximum entiopy models, techniques foi the selection of featuies to be
included in models and the computations of the paiameteis of these models aie addiessed in 34]. Techniques
foi computing the paiameteis of such models such as, hill climbing, iteiative piojection and iteiative scaling
algoiithms aie desciibed in 25].
Hyputhesis Search
It is the aim of the speech iecognizei to deteimine the best possible woid sequence given the acoustic obsei-
vation; that is, the woid stiing W
`
that satisfes
(15.66)
Since the woid stiing is made up of a sequence of woids, the seaich foi W
`
is done fiom seveial possible
hypotheses. Viterbi Seaich, a time-synchionous seaich stiategy, and Tiee Seaich, a time-asynchionous seaich
stiategy, aie piesented heie.
Yiterbi Search
The Viteibi algoiithm 11] intioduced pieviously fnds the most likely path thiough an HMM. Equation (15.66)
demands that we fnd foi each candidate woid stiing W, the piobability of the set of paths that coiiesponds to
the woid stiing W and then identify the woid stiing whose set of paths has the highest piobability.
Section 2 desciibed the hidden Maikev Model concept. The Viteibi algoiithm that fnds the maximizing state
sequence foi successive levels I (theie may be seveial), and deciding at the fnal level |, fiom among the competing
sequences. At each level I, the paths whose piobabilities fall below a thieshold aie puiged. A tiaceback fiom the
fnal state foi which the piobability is maximized to the stait state in the puiged tiellis yields the most likely
state sequence.
Each woid is iepiesented as a concatenation of seveial HMMs, one coiiesponding to each of the phones Each woid is iepiesented as a concatenation of seveial HMMs, one coiiesponding to each of the phones
that make up the woid. If we aie using a bigiam language model, then I(W) I(w
I
)H
n
I2
I(w
I
w
I-1
) 0, and
W
`
the most likely path thiough these HMMs. The numbei of states is piopoitional to the vocabulaiy size, V.
If we aie using the tiigiam language model, then, I(W) I(w
1
)I(w
2
w
1
)H
n
I3
I(w
I
w
I2
, w
I1
), and the giaph
becomes moie complicated with the numbei of states being piopoitional to V
2
. No piactical algoiithms exist
foi fnding the exact solution, but the Viteibi algoiithm will fnd the most likely path thiough these HMMs
whose identity can then deteimine the iecognized woid stiing.
One diawback of the Viteibi algoiithm is the numbei of states that have to be evaluated foi a bigiam language
model, even foi a piactical vocabulaiy size of 60,000 woids. A shoitcut that is commonly used is the |ean
searc|. Heie, the maximal piobability I
n
I1
of the states at stage I - 1, i.e., max
s1, ., sI
I(s
1
, s
2
, ., s
I1
, s
II
, y
1
, .,
y
I
s
0
) is computed and used as the basis foi computing a dynamic thieshold to piune out all states in the tiellis
whose path piobabilities fall below this thieshold. Multi-pass seaich stiategies have been pioposed ovei the
thiesholding used in the beam seaich to handle moie complex language models 6].
Tree Search
The seaich foi the most likely woid sequence can be thought of as seaiching foi a path in a tiee whose bianches
aie labeled with the vaiious woids of the vocabulaiy V such that theie aie V bianches leaving each node, one
foi each woid (i.e., the size of the vocabulaiy). Typically, in laige vocabulaiy continuous speech iecognition,
this seaich fiom a tiee of possible hypotheses tuins out to be a veiy laige computational effoit. Hence, the
seaich is limited by a fast match appioach 26] that will ieject fiom consideiation seveial bianches of the tiee
without subjecting them to a detailed analysis. The Viteibi algoiithm achieves the same kind of piuning using
the beam seaich appioach and multi-pass stiategies.
Stack Seaich
Stack seaich algoiithms foi speech iecognition have been used at IBM 19] and MIT Lincoln Labs 20]. This
heuiistic seaich algoiithm helps to ieduce computational and stoiage needs without saciifcing accuiacy. Any
`
W I W I A W
, , , ,
argnax
w
2000 by CRC Press LLC
tiee seaich must be based on some evaluation ciiteiion ielated to the seaich`s puipose. The algoiithm below
is a populai algoiithm used foi the heuiistic deteimination of minimum-cost paths
1. Inseit into a stack all the single-bianch paths coiiesponding to the woids of the vocabulaiy.
2. Soit these entiies in descending oidei of a function I(w
I
), wheie w
I
e vocabulaiy V.
3. If the top entiy in the stack is the end of the utteiance, the seaich ends; otheiwise, each entiy in the
stack is extended using I() foi all possible woids in the vocabulaiy and inseited into the stack while
maintaining the stack oidei.
I() is
(15.67)
wheie w
r
denotes a woid stiing of length r and a
n
1
denotes the acoustic data to be iecognized.
The methods desciibed in 14] incoipoiate the defnition of an envelope that is used to maik paitial paths
in the stack as aII\e oi JeaJ; these consideiably speed up the seaich. In 13], a tiee seaich stiategy called the
en\eIoe searc| is piesented. This is a time-asynchionous seaich that combines aspects of the A seaich with the
time-synchionous Viterbi seaich. Seveial bi-diiectional seaich stiategies have been pioposed by Li et al. 18].
Kenny et al. discuss seveial aspects of A algoiithms in 12]. A diffeient appioach involving majoiity decisions
on obseived disciete acoustic output stiings leading to a polling fast match is intioduced by Bahl et al. in 15].
The use of multiple stacks is yet anothei way to contiol the seaich pioceduie and is piesented in 13].
Tree Search vs. Yiterbi Search
A Viteibi seaich of a tiellis fnds the most likely succession of tiansitions thiough a composite HMM composed
of woid HMMs. The numbei of states in a tiellis stage (deteimined by the end states of the woid HMMs) must
be limited to keep the seaich`s stoiage and computational iequiiements feasible. The tiee seaich imposed no
such constiaints on the numbei of end states as long as this seaich does not piune out the coiiect path. Both
algoiithms aie suboptimal in the sense that they do not guaiantee to fnd the most piobable woid stiing.
State-ul-the-Art Systems
In the 1998 DARPA Hub-4E English Bioadcast News Benchmaik Test, an oveiall iecognition eiioi iate of 13.5%
was achieved. This test includes iecognition of baseline bioadcast speech, spontaneous bioadcast speech, speech
ovei telephone channels, speech in the piesence of backgiound music, speech undei degiaded acoustic conditions,
speech fiom non-native speakeis, and all othei kinds of speech. Details can be obtained fiom the NIST Web site.
Anothei benchmaik test is the Switchboaid Task, which is the tiansciiption of conveisations between two people
ovei the telephone. Eiioi iates foi this task aie appioximately 37%. In the Aiiline Tiavel Infoimation System (ATIS)
speech iecognition evaluation conducted by DARPA, eiioi iates close to 2% have been obtained. High iecognition
accuiacies have been obtained foi digit iecognition, with eiioi iates undei 1% (TIMIT database 31]).
Cha!!enges in Speech Recugnitiun
Some of the issues that still aiise in speech iecognition, and make inteiesting ieseaich pioblems foi the piesent
and the futuie, include:
1. Accuiate tiansciiption of spontaneous speech compaied to iead speech is still a majoi challenge because
of its inheient casual and incoheient natuie, embedded disuencies, and incomplete voicing of seveial
phones oi woids.
2. Recognizing speech between individuals and/oi multiple speakeis in a confeience.
3. Robustness of iecognition to diffeient kinds of channels, backgiound noise in the foim of music, speech
ovei speech, vaiiation in distances between the speakei and the miciophone, etc.
4. Recognition acioss age gioups, speaking iates, and accents.
5. Building of a language model foi an unknown domain and the addition of out-of-vocabulaiy (oov)
woids.
I w I a w
|
w
n |
r
1 1 1 , ,

, ,
max ,
2000 by CRC Press LLC
6. In oidei to dynamically adapt to speakeis (i.e., make use of theii speech when no tiansciiption is
available), unsupeivised adaptation is necessaiy. To do this accuiately, we need a confdence measuie
on the decoded sciipt.
7. Speech iecognition systems do not have any undeistanding of decoded speech. To move towaid undei-
standing/machine tianslation, we need some post-piocessing of the tiansciiption that could lead to
intelligent conveisational systems.
App!icatiuns
The use of speech as a means of input, in a fashion similai to the use of the keyboaid and the mouse, has
iesulted in the application of speech iecognition to a wide set of felds. These can be bioadly divided into thiee
segments: desktop, telephony, and embedded systems.
1. In the desktop aieas, continuous speech iecognition has been used foi dictation of text documents, foi
commands to navigate the desktop enviionment, and Inteinet suifng. Dictation accuiacies of the oidei
of 96% and gieatei have been achieved. The main playeis in this feld include IBM with theii ViaVoice
seiies of pioducts. Diagon Systems, L&H, Philips, and Miciosoft.
1
Softwaie tailoied to dictation in
specialized felds, such as iadiology, geneial medicine, and the legal domain, have also been put out by
some of these companies. Recoided speech using hand-held digital iecoideis can also be tiansciibed
subsequently by the same softwaie.
2. Telephony is an emeiging feld foi applications of speech iecognition. These include iepeitoiy dialing,
automated call type iecognition, ciedit caid validation, diiectoiy listening ietiieval, speakei identifca-
tion, fnancial applications such as tiading of stocks and mutual funds, banking, voice mail tiansciiption,
companies have theii own specialized pioducts foi telephony.
3. The use of speech input foi embedded systems is a ielatively new feld because only iecently handheld
systems have adequate CPV and memoiy foi accuiate speech iecognition.
Dehning Terms
Acoustic Model: Any statistical oi syntactic model that iepiesents a speech signal.
Baseforms: Repiesentation of a woid as a sequence of phones.
Baum-Welch algorithm: A foim of EM (expectation-maximization) is an iteiative pioceduie to estimate the
paiameteis of a stochastic model by maximizing the likelihood of the data.
Cepstra: The Fouiiei Tiansfoim of the logaiithm of the powei spectium sampled at iegulai inteivals.
Co-articulation: Pionunciation of phones being inuenced by the pievious and following phones.
Decision trees: A technique used to gioup seveial conditions into classes.
Iorward-backward: A iecuisive algoiithm foi computing the posteiioi piobability of a HMM using foiwaid
and backwaid vaiiables.
Caussian mixtures: Convex combination of Gaussian (a kind of piobability distiibution function) functions.
Hidden Markov Model (HMM): A stochastic model that uses state tiansition and output piobabilities to
geneiate obseivation sequences.
Hypothesis search: Seaich thiough a laige set of hypotheses of woid sequences to fnd the optimal woid
sequence.
Ianguage Model: Language Models piedict woids oi woid sequences fiom nonacoustic souices of infoima-
tion, such as context, stiuctuie, and giammai of the paiticulai language.
Iinear Prediction Coefhcients (IPC): A iepiesentation of an analog signal using an Auto Regiessive model.
MAP: Maximum a osierIorI. Technique foi speakei adaptation.
MIIR: Maximum Likelihood Lineai Regiession. Technique foi speakei adaptation.
Phones: Sub-woid acoustic unit.
1
While this is not an exhaustive list of companies with continuous speech iecognition softwaie pioducts, they aie the
leadeis in the feld to date.
2000 by CRC Press LLC
Signal pre-processing: Conveision of an analog speech signal into a sequence of numeiic featuie vectois oi
obseivations.
Speaker adaptation: The piocess of using a small amount of data fiom a new speakei to tune a set of speakei-
independent acoustic models to a new speakei.
Supervised and Unsupervised Adaptation: In speakei adaptation, the pioceduie is said to be supeivised if
the tiue tiansciiption of the adaptation data is known and is unsupeivised otheiwise.
Tri-phone: Context-dependent model of a phone as a function of its pievious and succeeding phones.
Viterbi: An algoiithm foi fnding the optimal state sequence thiough an HMM, given a paiticulai obseivation
sequence.
Relerences
1. Young, S. J. and Woodland, P. C., State clusteiing in HMM-based continuous speech iecognition, Con
uier Seec| anJ Ianguage, 8, 369, 1994.
2. Bahl, L. R. et al., Decision tiees foi phonological iules in continuous speech, ICASSI, 1991.
3. Gauvain, Jean-Luc and Lee, Chin-Hui, Maximum a osierIorI estimation foi multivaiiate Gaussian
mixtuie obseivations of Maikov chains, IIII TransaciIons on Seec| anJ AuJIo IrocessIng, 2, 1994.
4. Baum, L. E., An inequality and associated maximization technique in statistical estimation of piobabilistic
functions of Maikov piocesses, InequaIIiIes, 3, 1, 1972.
5. Chow, Y. et al., BYBLOS: The BBN continuous speech iecognition system, IIII IniernaiIonaI ConIerence
on AcousiIcs Seec| anJ SIgnaI IrocessIng, pp. 89, 1987.
6. Lee, C. H. , Soong, F. K., Paliwal, K. K., Automatic Speech and Speakei Recognition, Kluwei Academic
Publisheis, 1996.
7. Lee, K. and Hon, H., Laige vocabulaiy speakei independent continuous speech iecognition, IIII Inier
naiIonaI ConIerence on AcousiIcs Seec| anJ SIgnaI IrocessIng, 1988.
8. Lee, K., Hon, H., Hwang, M., Mahajan, S., and Reddy, R., The Sphinx speech iecognition system, IIII
IniernaiIonaI ConIerence on AcousiIcs Seec| anJ SIgnaI IrocessIng, 1989.
9. Bahl, L., de Souza, P., Gopalakiishman, P. S., and Picheny, M., Context-dependent vectoi quantization
foi continuous speech iecognition, ICASSI, 1993.
10. Bahl, L., de Souza, P., Gopalakiishman, P. S., Nahamoo, D., and Picheny, M., Robust methods foi using
context dependent featuies and models in a continuous speech iecognizei, ICASSI, I, 533, 1994.
11. Viteibi, A. J., Eiioi bounds foi convolution codes and an asymmetiically optimum decoding algoiithm,
IIII TransaciIons on InIornaiIon T|eory, IT-13, 260, 1967.
12. Kenny, P. et al., A new fast match foi veiy laige vocabulaiy continuous speech iecognition, ICASSI, II,
656, 1993.
13. Bahl, L. R., Gopalakiishman, P. S., and Meicei, R. L., Seaich issues in laige vocabulaiy speech iecognition,
IroceeJIngs oI i|e 100J Wor|s|o on AuionaiIc Seec| IecognIiIon, 1993.
14. Gopalakiishman, P. S., Bahl, L. R., and Meicei, R. L., A tiee seaich stiategy foi laige-vocabulaiy contin-
uous speech iecognition, ICASSI, I, 572, 1995.
15. Bahl, L. R., Bakis, R., de Souza, P. V., and Meicei, R. L., Obtaining candidate woids by polling in a laige
vocabulaiy speech iecognition system, ICASSI, I, 489, 1998.
16. Lippman, R. P., Review of neuial netwoiks foi speech iecognition, in IeaJIngs In Seec| IecognIiIon,
Waibel, A. and Lee, K. F., Eds., Moigan Kaufmann, San Mateo, CA, 1990.
17. Rabinei, L. R. and Levinson, S. E., Isolated and connected woid iecognition - Theoiy and selected
applications, IIII TransaciIons on ConnunIcaiIons, COM-29, 621, 1981.
18. Li, Z., Boulianne, G., Laboute, P., Baiszcz, M., Gaiudadii, H., and Kenny, P., Bi-diiectional giaph seaich
stiategies foi speech iecognition, Conuier Seec| anJ Ianguage, 10, 295, 1996.
19. Bahl, L. R., Jelinek, F., and Meicei, R. L., A maximum likelihood appioach to continuous speech iecog-
nition, IIII Trans. Iai. AnaI. anJ Mac|. Ini., PAMI-5, 179, 1983.
20. Paul, D., An effcient A stack decodei algoiithm foi continuous speech iecognition with a stochastic
language model, Iroc. DAIIA Wor|s|o on Seec| anJ NaiuraI Ianguage, pp. 405, 1992.
21. Gao et al., Speakei adaptation based on pie-clusteiing tiaining speakeis, Iuroseec|07, pp. 2091, 1997.
2000 by CRC Press LLC
22. Padmanabhan et al., Speakei clusteiing tiansfoimation foi speakei adaptation in laige vocabulaiy speech
iecognition systems, ICASSI, 1996.
23. Rabinei, L. and Juang, B.-H., IunJaneniaIs oI Seec| IecognIiIon, Pientice-Hall Signal Piocess Seiies,
1993.
24. Schwaitz, R., Chow, Y., Kimball, O., Roucos, S., Kiasnei, M., and Makhoul, J., Context-dependent
modeling foi acoustic-phonetic iecognition of continuous speech, ICASSI, 1985.
25. Jelinek, F., SiaiIsiIcaI Mei|oJs Ior Seec| IecognIiIon, MIT Piess, 1997.
26. Bahl, L. R. et al., A fast appioximate match foi laige vocabulaiy speech iecognition, IIII TransaciIons
on Seec| anJ AuJIo IrocessIng, 1, 59, 1993.
27. Leggettei, C. J. and Woodland, P. C., Maximum likelihood lineai iegiession foi speakei adaptation of
continuous density hidden Maikov models, Conuier Seec| anJ Ianguage, Vol. 9, pp. 171, 1995.
28. Heimansky, H. and Moigan, N., RASTA piocessing of speech, IIII TransaciIons on Seec| anJ AuJIo
IrocessIng, 2, 587, 1994.
29. Hunt, M. J., A statistical appioach to metiics foi woid and syllable iecognition, }ournaI oI AcousiIc SocIeiy
oI AnerIca, 66(S1), S35(A), 1979.
30. Heimansky, H., Peiceptual lineai piedictive (PLP) analysis of speech, }ournaI oI AcousiIc SocIeiy oI
AnerIca, 87(4), 1748, 1990.
31. Lamel, L., Speech database development: Design and analysis of the acoustic-phonetic coipus, IroceeJIngs
oI i|e DAIIA Seec| IecognIiIon Wor|s|o, pp. 100, 1986.
32. Lee, K., AuionaiIc Seec| IecognIiIon, Kluwei Academic, 1989.
33. Pallett, D. S., Fiscus, J. G., Fishei, J. S., Gaiafolo, W. M., Lund, B. A., Maitin, A., and Bizybocki, M. A.,
1994 Benchmaik Tests foi the ARPA Spoken Language Piogiam, IroceeJIngs oI i|e so|en Ianguage sysiens
iec|noIogy wor|s|o, Austin, TX, Jan. 22-25 1995.
34. Beigei, A. L., Pietia, S. A., V. J., A maximum entiopy appioach to natuial language piocessing, Conu
iaiIonaI IInguIsiIcs, Vol. 22, Np. 1, pp. 1, pp. 39-73, Mai. 1996.
35. Chen, S. A., Goodman, J., An empiiical study of smoothing techniques foi language modeling, Tec|nIcaI
Ieori, TR-10-98, Havaid Univeisity, August 1998.
36. Katz, S. M., Estimation of piobabilities fiom Sse data foi the language model component of a speech
iecognizei, IIII TransaciIons on AcousiIcs Seec| anJ SIgnaI IrocessIng, ASSP-35(3):400-401, Mai. 1987.
37. Jelinek, F. and Meicei, R. L., Inteipolated estimation of Maikov souice paiameteis fiom spaise data,
IroceeJIngs oI i|e Wor|s|o on Iaiiern IecognIiIon In IraciIce, May 1980.
38. Gales, M., Maximum likelihood lineai tiansfoimations foi HMM-based speech iecognition, Conuier
Seec| anJ Ianguage, Vol. 12, pp 75-98, 1998.
39. Eide, E., et al., A paiametiic appioach to vocal tiact noimalization, ProceeJIngs oI i|e 1Ji| AnnuaI Seec|
Iesearc| SynosIun, CLSP, Baltimoie, pp. 161-167, June 1995.
40. Wegmann, S. et al., Speakei noimalization on conveisational telephone speech, ICASSI0h, Vol. 1, pp.
339-341, May 1996.
41. McDonough, J. et al., Speakei adaptation with all-pass tiansfoims, ICASSI00, Vol. II, pp. 7575-760,
Phoneix, May 1999.
42. Bambeig, P., Vocal tiact noimalization, Ver|ex IniernaI Tec|nIcaI Ieori, 1981.
43. Bahl, L. R. et al., A tiee based statistical language model foi natuial language speech, IIII TransaciIons
on AcousiIcs, Seec| anJ SIgnaI IrocessIng, Vol. 37(7), 1989.
44. Beigei, A. et al., The Candide System foi Machine Tianslation, IroceeJIngs oI i|e AIIA ConIerence on
Hunan Ianguage Tec|noIogy, New Jeisey, 1994.
45. Chen, S., Adaptation by coiielation, IroceeJIngs oI i|e DAIIA Seec| IecognIiIon Wor|s|o, Viiginia,
1997.
46. Shinoda, K., Lee, C.-H., Stiuctuial MAP speakei adaptation using hieiaichical piiois, IIII Wor|s|o on
AuionaiIc Seec| IecognIiIon anJ UnJersianJIng IroceeJIngs, pp 381-388, 1997.
2000 by CRC Press LLC
Fur Further Inlurmatiun
Theie aie thiee majoi speech-ielated confeiences each yeai, namely, IniernaiIonaI ConIerence on AcousiIcs, Seec|
anJ SIgnaI IrocessIng {ICASSI), IniernaiIonaI ConIerence on So|en Ianguage IrocessIng {ICSII), anJ Iuroean
ConIerence on Seec| ConnunIcaiIon anJ Tec|noIogy {IUICSIIICH). Besides this, Defense Advanced Reseaich
Piojects Agency (DARPA) conducts woikshops on Bioadcast News Tiansciiption (tiansciiption of live television
bioadcasts) and Switchboaid (conveisations between individuals ovei the telephone) tasks. Also, theie aie
seveial confeiences addiessing specifc issues such as phonetic sciences, iobust methods foi speech iecognition
in adveise conditions, etc. Jouinals ielated to speech include IIII TransaciIons on Seec| anJ AuJIo IrocessIng,
IIII TransaciIons on SIgnaI IrocessIng, Conuier anJ Seec| Ianguage, Seec| ConnunIcaiIons and IIII
TransaciIons on InIornaiIon T|eory. Additional details on the statistical techniques used in speech iecognition
can be found in seveial books 23, 25, 32]. A good ieview of cuiient techniques can also be found in 6].
Acknuv!edgements
The authois wish to thank Di. Haiiy Piintz and Di. R. T. Waid of IBM T. J. Watson Reseaich Centei foi theii
caieful ieview of this manusciipt and many useful suggestions.

You might also like