Applications PDF

Practical Applications of
Speech Signal Processing
Vishu R Viswanathan
TI Fellow, Director, Speech Technologies Lab
DSP Solutions R&D Center
Texas Instruments, Dallas, Texas
v-viswanathan@ti.com
March 2004 Vishu Viswanathan 1

Lecture Outline
Goals of the Lecture

Speech Coding
Speech Synthesis
Speech Recognition & Understanding
Speaker Recognition
Speech Enhancement
Speech Modification

Lecture Outline

Speech Coding
Speech Synthesis
Speaker Recognition
Speech Enhancement
Speech Modification

Introduce and discuss each of a number of speech

signal processing areas
List examples of practical applications
Discuss some selected topics in each area
High level presentation only

Lecture Outline

Speech Coding
Speech Synthesis
Speaker Recognition
Speech Enhancement
Speech Modification

Speech Coding
Goal
Reduce speech signal data rate
Maintain high speech quality
General Principle: Take advantage of
Redundancies in the speech signal
Properties of speech production and perception
Applications
Digital cellular telephony, voice over IP, IP phone,
audio/video conferencing, PSTN trunking, secure voice
communication, digital answering machines, voice mail, voice
response systems, talking products
Components of a Speech Coding System
Sampled Channel or
Analyzer Encoder
Speech s(n) x(n) y(n) Medium y(n)
Decoder Synthesizer
x(n) s(n)
Goal: Minimize data rate of y(n) while maximizing speech

quality of s(n)

Types of Speech Coders
Waveform Coders
Goal: Reproduce speech on a sample-by-sample basis
High data rates, high speech quality
Examples: 64 kb/s PCM (G.711), 32 kb/s ADPCM (G.726)
Parametric Coders
Speech production characterized by parametric models
Low data rates, good speech intelligibility, communications/synthetic speech
quality
Examples: 2.4 kb/s LPC (FS 1015), 2.4 kb/s MELP (recent NATO standard)
Analysis-by-Synthesis Coders
Hybrid between waveform and parametric coders, with medium data rates
Parametric models used, with excitation signal computed by minimizing
error between synthesized speech and input speech
Examples: 16 kb/s G.728, 8 kb/s G.729
Speech Quality
Terms Used
Toll quality: High-grade wireline telephone
High quality
Good quality
Communications quality
Transparent quality
Formal Subjective Testing Methods
Expensive, time consuming
Mean opinion score (MOS): Used in all industry standards bodies
Diagnostic acceptability measure (DAM): Used by US Dept of Defense
Informal and Semi-Formal Subjective Tests
Pairwise or A/B comparisons
Rating tests
Objective Methods
Signal-to-Noise Ratio, ITU P.802 (PESQ)
Automatic, repeatable, useful in coder development and optimization

Speech Coder Attributes
Low bit rate 1200 2400 4800 8000 16000 32000 64000
High bit rate
Bits/Second
2.5 3.0 3.5 4.0
Low quality High quality
Mean Opinion Score
Clean Noisy
Speech Handheld Hands-free Speech
10 50 100 200
Low delay High delay
Milliseconds
Low High
Complexity MIPS, Memory Complexity
Human
Music
Speech Sound Effects

Speech Coding Standards
ITU Standards
coder rate (kb/s) approach
G.711 64 Mu/A-law
G.726 16-40 ADPCM
G.728 16 LD-CELP
G.729 8 CS-ACELP
G.723.1 5.3/6.3 MP/ACELP
ITU standards are targeted for telephone network applications

Also used in Voice over IP applications
All produce toll quality speech
Digital Cellular Standards
coder rate (kb/s) chan rate approach date
GSM FR 13 22.8 RPE-LTP 1987
Europe GSM HR 5.6 11.4 VSELP 1994
GSM EFR 12.2 22.8 ACELP 1995
GSM AMR 4.75-12.2 11.4 - 22.8 ACELP 1998
TIA IS54 7.95 13 VSELP 1989
TIA IS95 0.8-8.55 QCELP 1993
North TIA Q13 0.8-13.3 QCELP 1995
America TIA IS641 7.4 13 ACELP 1996
TIA EVRC 0.8-8.55 R-ACELP 1996
TIA SMV 0.8-8.5 R-ACELP 2001
PDC FR 6.7 11.2 VSELP 1990
Japan PDC HR 3.45 5.6 PSI-CELP 1993
PDC EFR 8 11.2 ACELP 1999
PDC EFR 6.7 11.2 ACELP 2000

Wideband Standards
coder rate (kb/s) approach
G.722 48,56,64 SB-ADPCM
G.722.1 24,32 Transform
ITU WB 16,24 ACELP
AMR WB 6.60-23.85 ACELP
VMR WB 1.0-13.3 ACELP
Wideband: 50 Hz 7 kHz (versus narrowband telephone, 300-3200 Hz)

Lecture Outline

Speech Coding
Speech Synthesis
Speaker Recognition
Speech Enhancement
Speech Modification

Speech Synthesis
Human Speech Based Systems
Suitable for known material
Speech coding based
Talking toys, talking books, voice prompts, voice response systems
Concatenation of pre-recorded voice data
Information retrieval (stock quotes, airline schedules, banking)
Text-to-Speech Systems
Suitable for unknown or arbitrary text
Applications: e-mail/fax reading, phone access to web based
services, spoken telephone directory, car navigation, location-
based services, customer service, help desk, reading machines
for the blind

Components of a TTS System
Dictionary
and Rules
Text Text Letter-to- Speech

Synthesizer
Analysis Sound
- Numerical expansion - Phonemes choice of units

(dates, times, money) words, phones, diphones, dyad,
- Pitch
syllables
- abbreviations, acronyms
- Duration
choice of parameters
-proper name id
- Pauses LPC, formants, waveform templates,
Dr. Smith lives at 23 articulatory parameters, sinusoidal
- loudness/amplitude
Lakeshore Dr. parameters
method of computation
Courtesy of Larry Rabiner rules, concatenation

Lecture Outline

Speech Coding
Speech Synthesis
Speaker Recognition
Speech Enhancement
Speech Modification

Problem
Recognition: Automatic recognition of human speech by machine
Understanding: Interpret the meaning of recognized speech and map them to
actions to be taken
Applications
Voice dialing (name or number dialing) in telephone, cellphone, PDA,
smartphone (Safety laws against handheld cellphone use while driving)
Voice command & control in telematics, cellphone, PDA, smartphone, PC, toys
Voice-enabled web browsing, information retrieval (stock quotes, weather
forecast, airline flight information, banking), navigation, e-mail, SMS, dictation
Automated customer service and help desks
Benefits: hands-free, eyes-free use; not using keypad; faster task completion;
ease of use; part of multi-modal interface; cost savings

Components of a Speech Recognizer
speech signal word string
Feature Acoustic
Decoding
Extraction Scoring
Acoustic Language
Models Models
Front end Back end

Speech Recognizer Attributes
Speaker Speaker Adaptive Speaker
Dependent Independent
Small 10 100 1000 10000 Large
Vocabulary Words Vocabulary
Isolated Continuous Speech Conversa-
Words tional Speech
Syntax Semantics
Recognition Understanding
Clean Noisy
Speech Handheld Hands-free Speech
Low High
Complexity MIPS, Memory Complexity
Server Distributed Client
Based Based
Performance & Robustness
Performance
Recognition Accuracy: Word error rate (WER) or task completion rate
High enough performance required for user acceptance
Robustness Issues
Training versus operational condition differences
Background noise: extent of noise, its variability (Usually additive)
Channel variability: different microphones, different telephone circuits,
handheld, handsfree, handheld-handsfree (Usually convolutive)
Recognizer must have means to compensate for noise and channel variabilities
Out-of-vocabulary rejection capability
Speaker dialect and accent variability (handled by speaker adaptation)
User Interface: Very important for the success of an application

Recognition in Multiple Languages
Speaker-Dependent Recognition
Language independent (User can enroll names for voice dialing in multiple
languages!)
Some Observations for Speaker-Independent Recognition

Same recognition engine but different data (models, dictionary) needed
Recognition grammar to handle language-specific usage differences (e.g.,
French speak telephone numbers in pairs; natural number dialing needed)
Training requires speech databases and dictionary in the new language
Automatic training tools to minimize time to develop recognition in a new
language

Lecture Outline

Speech Coding
Speech Synthesis
Speaker Recognition
Speech Enhancement
Speech Modification

Speaker Recognition
Speaker Verification / Authentication
Problem: Use voice input to verify the users claimed identity
Applications: Secure access to premises, information (banking), services (voice
dialing), etc.
Issues
True user acceptance traded off with impostor acceptance
Total voice verification
Fixed text versus free text
Speaker Identification
Problem: Use voice to identify speaker from a closed or open set of speakers
Applications: Legal and forensic use, intelligence, security
Issues: Uncooperative user, often relatively short-duration speech, noisy
and/or distorted speech.

Lecture Outline

Speech Coding
Speech Synthesis
Speaker Recognition
Speech Enhancement
Speech Modification

Speech Enhancement
Noise Suppression
Playback Enhancement
Acoustic Echo Cancellation

Noise Suppression
Problem
Remove acoustic noise from noisy speech signal for better listenability or for
improved performance of speech processing devices
Requirements: No speech signal distortion, no loss of speech intelligibility,
no artifacts like musical noises, natural sounding residual noise
Methods
Single microphone approach: spectral subtraction family of methods
Multi-microphone approach: adaptive noise cancellation, microphone array
based fixed or adaptive beamforming, blind signal separation

Playback Enhancement
Problem
Enhanced playback of speech to the listener
Methods
Spectrally shape the speech signal prior to playback, for improved
intelligibility when the listener is in a noisy environment (PA system in
aircraft, airports, sports arenas)
Active noise cancellation to cancel noise acoustically in listeners ears (ANC
headsets)
Narrowband to wideband speech extension to provide wideband speech
perception

Acoustic Echo Cancellation
loudspeaker
r ( n) s ( n)
Downlink Signal Far End Signal
A
channel
E H ( z ) H(z)
Error Signal C
e( n ) y (n) microphone
x ( n) - v(n) = u (n) + y (n) + n0 (n)
Uplink Signal + Near End Signal
Goal: Cancel feedback from loudspeaker into microphone using

adaptive linear filter

Lecture Outline

Speech Coding
Speech Synthesis
Speaker Recognition
Speech Enhancement
Speech Modification

Speech Modification
Voice Conversion
Convert one voice to sound like another
A female voice converted to sound like a low-pitched male voice (security)
Time-Scale or Rate Modification

Speed up or slow down speech, while preserving naturalness
Applications: talking books, pre-recorded lectures, language learning

Applications PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applications PDF

Uploaded by

Copyright:

Available Formats

Practical Applications of

Speech Signal Processing

March 2004 Vishu Viswanathan 1

Goals of the Lecture

March 2004 Vishu Viswanathan 2

Goals of the Lecture

March 2004 Vishu Viswanathan 3

Introduce and discuss each of a number of speech

March 2004 Vishu Viswanathan 4

Goals of the Lecture

March 2004 Vishu Viswanathan 5

Goal: Minimize data rate of y(n) while maximizing speech

March 2004 Vishu Viswanathan 7

March 2004 Vishu Viswanathan 9

March 2004 Vishu Viswanathan 10

ITU standards are targeted for telephone network applications

March 2004 Vishu Viswanathan 12

Wideband: 50 Hz 7 kHz (versus narrowband telephone, 300-3200 Hz)

March 2004 Vishu Viswanathan 13

Goals of the Lecture

March 2004 Vishu Viswanathan 14

March 2004 Vishu Viswanathan 15

Text Text Letter-to- Speech

- Numerical expansion - Phonemes choice of units

March 2004 Vishu Viswanathan 16

Goals of the Lecture

March 2004 Vishu Viswanathan 17

March 2004 Vishu Viswanathan 18

speech signal word string

Front end Back end

March 2004 Vishu Viswanathan 20

March 2004 Vishu Viswanathan 22

Some Observations for Speaker-Independent Recognition

March 2004 Vishu Viswanathan 23

Goals of the Lecture

March 2004 Vishu Viswanathan 24

March 2004 Vishu Viswanathan 25

Goals of the Lecture

March 2004 Vishu Viswanathan 26

March 2004 Vishu Viswanathan 27

March 2004 Vishu Viswanathan 28

March 2004 Vishu Viswanathan 29

Downlink Signal Far End Signal

Uplink Signal + Near End Signal

Goal: Cancel feedback from loudspeaker into microphone using

March 2004 Vishu Viswanathan 30

Goals of the Lecture

March 2004 Vishu Viswanathan 31

Time-Scale or Rate Modification

March 2004 Vishu Viswanathan 32

You might also like