You are on page 1of 20

DIGITAL VOICE ANALYSIS

Prepared by
GAURAV MISHRA
BHUMIKA DWIVEDI
AKASH RAJAN RAI
KARTIC KUMAR
Table of Contents

INTRODUCTION

SPEAKER VERIFICATION

FEATURE EXTRACTION

FUTURE WORK

CONCLUSION

2
DIGITAL VOICE ANALYSIS -INTRODUCTION

Voice analysis- is the study of speech sounds for purposes other than linguistic
content, such as in speech recognition.

include mostly medical analysis of the voice i.e. phoniatrics, but also speaker
identification.

Speaker recognition
 process of identifying a person from a spoken phrase
 allows for a secure method of authenticating speakers
 Applications include:
voice dialing, banking over a telephone network, security control for
confidential information, etc

Challenges
 Can be imitated to a certain degree
 Need to capture discriminating features
 Emotional physical states affect quality
3
SPEECH PROCESSING TAXONOMY

Recognition

Speech Speaker Language


Recognition Recognition Recognition

Speaker Speaker
Identification Verification

Text -dependent Text -independent Text -dependent Text -independent

Closed -set Closed -set Closed -set Open -set

4
SPEAKER IDENTIFICATION

• Determine whether person is who they claim to be

• User makes identity claim: one to one mapping

• Unknown voice could come from large set of unknown speakers -


referred to as open-set verification

Is this Kartic’s voice?

5
GENERAL THEORY OF SPEAKER VERIFICATION SYSTEM

Mishra’s “Voiceprint”

Mishra
Input Speech

Speaker
Speaker
Model
Model
ACCEPT
Feature
Feature
extraction Σ Decision
Decision
extraction REJECT
“My Name is
Mishra”
Impostor
Impostor
Model
Model
Impostor “Voiceprints”
Identity Claim

6
Two distinct phases to any speaker verification system

Enrollment
Phase Enrollment speech for Voiceprints (models) for
each speaker each speaker

Akash Feature Akash


Feature Model
extraction Modeltraining
training
extraction

Bhumika Bhumika

Verification
Phase Feature Verification
Feature Verification Accepted!
extraction
extraction decision
decision

Claimed identity:
Bhumika
7
TRAINING PHASE

 1st phase of SIS is Enrollment


Speaker 1
Sessions also known as Training
Phase.
Speaker 2
Feature
Front Vectors
End
Processing
 During training phase, the SIS Speaker 3
generates a speaker model which
Speaker
is based on the speaker’s Modeling
characteristics.

Speaker Speaker
Database Models

8
COMPONENTS OF SPEAKER IDENTIFICATION SYSTEM
There are three main components of SI System:

Front-end Processing

Speaker Modeling

Pattern Matching and Classification

9
FRONT-END PROCESSING

Front-end Processing generally consists of three sub-processes


Preprocessing
Removal of Noise / Silence from Speech
Frame Blocking
Windowing
Feature Extraction
'the curse of the dimensionality'
the number of training/test-vectors needed for a classification problem
grows exponential with the dimension of the given input-vector- feature
extraction is needed.
Transform the speech signal into compact effective representation
More stable and discriminative than the original signal

10
PRE-PROCESSING

 The speech signal is a slowly timed varying signal called


quasi-stationary that is when the signal is examined over a short
period of time (5-100msec), the signal is fairly stationary.
 Speech signals are often analyzed in short time segments
referred to as short-time spectral analysis typically 20-30 msec
frames that overlap each other with 30-50%. This is done in
order not to lose any information due to the windowing.
 Duration of each frame is 23 ms for sampling
frequencies 11025 Hz, and a new frame contains
the last 11.5 ms of the previous frame’s data.
For the sampling frequency 8000 Hz,
duration of each frame is 16 ms and a new frame contains the
last 8 ms of the previous frame’s data

11
WINDOWING

 After the signal has been framed, window each individual frame so as
to minimize the signal discontinuities at the beginning and end of each
frame.

 Each frame is multiplied with a window function w(n) with length N,


where N is the length of the frame.

 Typically the Hamming window is used. It preserves higher order


harmonics and avoid problems due to truncation of the signal.

12
MFCC

Continous Frame Frame


Blocking Windowing
Speech

FFT

Spectrum

Log Mel
Mel Filter
Compression Bank
Weighted
Spectrum

DCT

Mel Cepstral
Coefficients

Feature Extracted Coefficients 13


Fast Fourier Transform (FFT)

 converts each frame of N samples from the time domain into the frequency
domain.
 defined on the set of N samples {xn}, as follow:

N −1
X k = ∑ xne − j 2πkn / N , k = 0,1,2,..., N − 1
n =0

14
Mel-frequency Wrapping

Mel-spaced filterbank
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0
0 1000 2000 3000 4000 5000 6000 7000
Frequency (Hz)

15
Cepstrum

 convert the log mel spectrum back to time


 called the mel frequency cepstrum coefficients (MFCC)
 we denote those mel power spectrum coefficients that are the result
of the last step are , we can calculate the MFCC's, as

16
PATTERN MATCHING AND CLASSIFICATION

The classifiers used for speaker identification can be grouped into two major
types:
Template-based and Stochastic model based classifiers
Template based classifiers are considered to be the simplest classifiers.
Dynamic Time Warping (useful for text-dependent speaker recognition)
Vector Quantization (useful for text-independent speaker recognition)

Stochastic models provide more flexibility and better results.


Gaussian Mixture Model (useful for text-independent speaker recognition), the
Hidden Markov model (useful for text-dependent speaker recognition), and Neural
Networks to model a speaker's acoustic space.

17
SPEAKER MODELING-VQ

Vector Quantization
It is not possible to use all the feature vectors of a given speaker
occurring in the training data to form the speaker's model.
Because there are too many feature vectors for each speaker.

A method of reducing/compressing the number of training


vectors is required to form a codebook consisting of a small
number of highly representative vectors that efficiently represent
the speaker-specific characteristics.

VQ is the process of mapping feature vectors in a vector space


into a finite number of regions in that space. Each region is called
a cluster and each cluster is represented by its centroid. The
collection of all centroids is called codebook

18
FUTURE WORK

We would develop selected algorithms related to speaker identification in


MATLAB. The implementation would be modular and will be done keeping in
view the real time implementation. The complete real time implementation
though is not in the scope of the project

We would test and verify all the performance level of the algorithms. For
this purpose the data collected will be divided into training and testing data
(70% training, 30% testing).

For hardware implementation we would either use interfacing or


some digital signal processor.

19
CONCLUSION

 Speaker verification is one of the few recognition areas


where machines can outperform humans

 Speaker verification technology is a viable technique


currently available for applications

 Speaker verification can be augmented with other


authentication techniques to add further security

20