You are on page 1of 32

Practical Applications of

Speech Signal Processing

Vishu R Viswanathan
TI Fellow, Director, Speech Technologies Lab
DSP Solutions R&D Center
Texas Instruments, Dallas, Texas
v-viswanathan@ti.com

March 2004 Vishu Viswanathan 1


Lecture Outline

Goals of the Lecture


Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

March 2004 Vishu Viswanathan 2


Lecture Outline

Goals of the Lecture


Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

March 2004 Vishu Viswanathan 3


Goals of the Lecture

Introduce and discuss each of a number of speech


signal processing areas
List examples of practical applications
Discuss some selected topics in each area
High level presentation only

March 2004 Vishu Viswanathan 4


Lecture Outline

Goals of the Lecture


Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

March 2004 Vishu Viswanathan 5


Speech Coding
Goal
Reduce speech signal data rate
Maintain high speech quality
General Principle: Take advantage of
Redundancies in the speech signal
Properties of speech production and perception
Applications
Digital cellular telephony, voice over IP, IP phone,
audio/video conferencing, PSTN trunking, secure voice
communication, digital answering machines, voice mail, voice
response systems, talking products
March 2004 Vishu Viswanathan 6
Components of a Speech Coding System

Sampled Channel or
Analyzer Encoder
Speech s(n) x(n) y(n) Medium y(n)

Decoder Synthesizer
x(n) s(n)

Goal: Minimize data rate of y(n) while maximizing speech


quality of s(n)

March 2004 Vishu Viswanathan 7


Types of Speech Coders
Waveform Coders
Goal: Reproduce speech on a sample-by-sample basis
High data rates, high speech quality
Examples: 64 kb/s PCM (G.711), 32 kb/s ADPCM (G.726)
Parametric Coders
Speech production characterized by parametric models
Low data rates, good speech intelligibility, communications/synthetic speech
quality
Examples: 2.4 kb/s LPC (FS 1015), 2.4 kb/s MELP (recent NATO standard)
Analysis-by-Synthesis Coders
Hybrid between waveform and parametric coders, with medium data rates
Parametric models used, with excitation signal computed by minimizing
error between synthesized speech and input speech
Examples: 16 kb/s G.728, 8 kb/s G.729
March 2004 Vishu Viswanathan 8
Speech Quality
Terms Used
Toll quality: High-grade wireline telephone
High quality
Good quality
Communications quality
Transparent quality
Formal Subjective Testing Methods
Expensive, time consuming
Mean opinion score (MOS): Used in all industry standards bodies
Diagnostic acceptability measure (DAM): Used by US Dept of Defense
Informal and Semi-Formal Subjective Tests
Pairwise or A/B comparisons
Rating tests
Objective Methods
Signal-to-Noise Ratio, ITU P.802 (PESQ)
Automatic, repeatable, useful in coder development and optimization

March 2004 Vishu Viswanathan 9


Speech Coder Attributes
Low bit rate 1200 2400 4800 8000 16000 32000 64000
High bit rate
Bits/Second
2.5 3.0 3.5 4.0
Low quality High quality
Mean Opinion Score
Clean Noisy
Speech Handheld Hands-free Speech
10 50 100 200
Low delay High delay
Milliseconds

Low High
Complexity MIPS, Memory Complexity

Human
Music
Speech Sound Effects

March 2004 Vishu Viswanathan 10


Speech Coding Standards

ITU Standards
coder rate (kb/s) approach
G.711 64 Mu/A-law
G.726 16-40 ADPCM
G.728 16 LD-CELP
G.729 8 CS-ACELP
G.723.1 5.3/6.3 MP/ACELP

ITU standards are targeted for telephone network applications


Also used in Voice over IP applications
All produce toll quality speech
March 2004 Vishu Viswanathan 11
Speech Coding Standards
Digital Cellular Standards
coder rate (kb/s) chan rate approach date
GSM FR 13 22.8 RPE-LTP 1987
Europe GSM HR 5.6 11.4 VSELP 1994
GSM EFR 12.2 22.8 ACELP 1995
GSM AMR 4.75-12.2 11.4 - 22.8 ACELP 1998
TIA IS54 7.95 13 VSELP 1989
TIA IS95 0.8-8.55 QCELP 1993
North TIA Q13 0.8-13.3 QCELP 1995
America TIA IS641 7.4 13 ACELP 1996
TIA EVRC 0.8-8.55 R-ACELP 1996
TIA SMV 0.8-8.5 R-ACELP 2001
PDC FR 6.7 11.2 VSELP 1990
Japan PDC HR 3.45 5.6 PSI-CELP 1993
PDC EFR 8 11.2 ACELP 1999
PDC EFR 6.7 11.2 ACELP 2000

March 2004 Vishu Viswanathan 12


Speech Coding Standards

Wideband Standards
coder rate (kb/s) approach
G.722 48,56,64 SB-ADPCM
G.722.1 24,32 Transform
ITU WB 16,24 ACELP
AMR WB 6.60-23.85 ACELP
VMR WB 1.0-13.3 ACELP

Wideband: 50 Hz 7 kHz (versus narrowband telephone, 300-3200 Hz)

March 2004 Vishu Viswanathan 13


Lecture Outline

Goals of the Lecture


Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

March 2004 Vishu Viswanathan 14


Speech Synthesis
Human Speech Based Systems
Suitable for known material
Speech coding based
Talking toys, talking books, voice prompts, voice response systems
Concatenation of pre-recorded voice data
Information retrieval (stock quotes, airline schedules, banking)
Text-to-Speech Systems
Suitable for unknown or arbitrary text
Applications: e-mail/fax reading, phone access to web based
services, spoken telephone directory, car navigation, location-
based services, customer service, help desk, reading machines
for the blind

March 2004 Vishu Viswanathan 15


Components of a TTS System

Dictionary
and Rules

Text Text Letter-to- Speech


Synthesizer
Analysis Sound

- Numerical expansion - Phonemes choice of units


(dates, times, money) words, phones, diphones, dyad,
- Pitch
syllables
- abbreviations, acronyms
- Duration
choice of parameters
-proper name id
- Pauses LPC, formants, waveform templates,
Dr. Smith lives at 23 articulatory parameters, sinusoidal
- loudness/amplitude
Lakeshore Dr. parameters
method of computation
Courtesy of Larry Rabiner rules, concatenation

March 2004 Vishu Viswanathan 16


Lecture Outline

Goals of the Lecture


Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

March 2004 Vishu Viswanathan 17


Speech Recognition & Understanding
Problem
Recognition: Automatic recognition of human speech by machine
Understanding: Interpret the meaning of recognized speech and map them to
actions to be taken
Applications
Voice dialing (name or number dialing) in telephone, cellphone, PDA,
smartphone (Safety laws against handheld cellphone use while driving)
Voice command & control in telematics, cellphone, PDA, smartphone, PC, toys
Voice-enabled web browsing, information retrieval (stock quotes, weather
forecast, airline flight information, banking), navigation, e-mail, SMS, dictation
Automated customer service and help desks
Benefits: hands-free, eyes-free use; not using keypad; faster task completion;
ease of use; part of multi-modal interface; cost savings

March 2004 Vishu Viswanathan 18


March 2004 Vishu Viswanathan 19
Components of a Speech Recognizer

speech signal word string

Feature Acoustic
Decoding
Extraction Scoring

Acoustic Language
Models Models

Front end Back end

March 2004 Vishu Viswanathan 20


Speech Recognizer Attributes
Speaker Speaker Adaptive Speaker
Dependent Independent
Small 10 100 1000 10000 Large
Vocabulary Words Vocabulary
Isolated Continuous Speech Conversa-
Words tional Speech
Syntax Semantics
Recognition Understanding

Clean Noisy
Speech Handheld Hands-free Speech
Low High
Complexity MIPS, Memory Complexity
Server Distributed Client
Based Based
March 2004 Vishu Viswanathan 21
Performance & Robustness
Performance
Recognition Accuracy: Word error rate (WER) or task completion rate
High enough performance required for user acceptance
Robustness Issues
Training versus operational condition differences
Background noise: extent of noise, its variability (Usually additive)
Channel variability: different microphones, different telephone circuits,
handheld, handsfree, handheld-handsfree (Usually convolutive)
Recognizer must have means to compensate for noise and channel variabilities
Out-of-vocabulary rejection capability
Speaker dialect and accent variability (handled by speaker adaptation)
User Interface: Very important for the success of an application

March 2004 Vishu Viswanathan 22


Recognition in Multiple Languages
Speaker-Dependent Recognition
Language independent (User can enroll names for voice dialing in multiple
languages!)

Some Observations for Speaker-Independent Recognition


Same recognition engine but different data (models, dictionary) needed
Recognition grammar to handle language-specific usage differences (e.g.,
French speak telephone numbers in pairs; natural number dialing needed)
Training requires speech databases and dictionary in the new language
Automatic training tools to minimize time to develop recognition in a new
language

March 2004 Vishu Viswanathan 23


Lecture Outline

Goals of the Lecture


Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

March 2004 Vishu Viswanathan 24


Speaker Recognition
Speaker Verification / Authentication
Problem: Use voice input to verify the users claimed identity
Applications: Secure access to premises, information (banking), services (voice
dialing), etc.
Issues
True user acceptance traded off with impostor acceptance
Total voice verification
Fixed text versus free text
Speaker Identification
Problem: Use voice to identify speaker from a closed or open set of speakers
Applications: Legal and forensic use, intelligence, security
Issues: Uncooperative user, often relatively short-duration speech, noisy
and/or distorted speech.

March 2004 Vishu Viswanathan 25


Lecture Outline

Goals of the Lecture


Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

March 2004 Vishu Viswanathan 26


Speech Enhancement

Noise Suppression
Playback Enhancement
Acoustic Echo Cancellation

March 2004 Vishu Viswanathan 27


Noise Suppression
Problem
Remove acoustic noise from noisy speech signal for better listenability or for
improved performance of speech processing devices
Requirements: No speech signal distortion, no loss of speech intelligibility,
no artifacts like musical noises, natural sounding residual noise
Methods
Single microphone approach: spectral subtraction family of methods
Multi-microphone approach: adaptive noise cancellation, microphone array
based fixed or adaptive beamforming, blind signal separation

March 2004 Vishu Viswanathan 28


Playback Enhancement
Problem
Enhanced playback of speech to the listener
Methods
Spectrally shape the speech signal prior to playback, for improved
intelligibility when the listener is in a noisy environment (PA system in
aircraft, airports, sports arenas)
Active noise cancellation to cancel noise acoustically in listeners ears (ANC
headsets)
Narrowband to wideband speech extension to provide wideband speech
perception

March 2004 Vishu Viswanathan 29


Acoustic Echo Cancellation
loudspeaker
r ( n) s ( n)

Downlink Signal Far End Signal

A
channel
E H ( z ) H(z)
Error Signal C

e( n ) y (n) microphone
x ( n) - v(n) = u (n) + y (n) + n0 (n)

Uplink Signal + Near End Signal

Goal: Cancel feedback from loudspeaker into microphone using


adaptive linear filter

March 2004 Vishu Viswanathan 30


Lecture Outline

Goals of the Lecture


Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

March 2004 Vishu Viswanathan 31


Speech Modification
Voice Conversion
Convert one voice to sound like another
A female voice converted to sound like a low-pitched male voice (security)

Time-Scale or Rate Modification


Speed up or slow down speech, while preserving naturalness
Applications: talking books, pre-recorded lectures, language learning

March 2004 Vishu Viswanathan 32

You might also like