You are on page 1of 72

INTRODUCTION

1.1 Overview
Recently there has been growing interest to improve Human-computer interaction (HCI)
means computers should interact to the humans in day to day life .In this context recognizing
people emotional state and giving suitable feedback may play a crucial role. As a
consequence, emotion recognition represents a hot research area in both industry and
academic field. Usually emotion recognition based on facial or voice features. This proposes
a solution, designed to be employed in a smart phone Environment able to capture emotional
state of a person starting from registration of speech signals in the surrounding obtained by
mobile devices such as smartphones.

This system presents the implementation of a voice-based emotion detection system which is
suitable to recognize four emotions (anger, sadness, joy and neutral) as widely used for
emotion recognition .The classification task for speech signals is done by using Support
Vector Machine (SVM) approach. The main contributions of this is concern: i) a system able
to recognize people emotions composed of two sub-systems, Gender Recognition and
Emotion Recognition . Gender recognition algorithm, based on pitch extraction, and aimed at
providing a priori information about the gender of the speaker; SVM-based emotion
classifier, which employs the gender information as input.

In order to train and test the mentioned SVM-based emotion classifier, a widely used
emotional database called (polish emotional database ED) has been employed. The overall
system reliability on the database adopted for training and testing phases: the use of a
simulated database (i.e., a collection of emotion vocal expressions played by actors) allows
obtaining a higher level of correctly identified emotions.
.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 1

1.2 Proposed method


The proposed method based on the employment of audio signals consists of four principal
parts which are elaborated bellow:

Feature Extraction: it involves the elaboration of the speech signal in order to obtain a
certain number of variables, called features, useful for speech emotion recognition.

Feature Selection: it selects the more appropriate features in order to reduce the
computational load and the time required to recognize an emotion.

Database: it is the memory of the classifier; it contains sentences divided according to


the emotions to be recognized.

Classification: it assigns a label representing the recognized emotion by using the


features selected by the Feature Selection block and the sentences in the Database.

1.3 Objectives
The objectives of the project are illustrated below.

1. For the speech signal we have to do framing and windowing first.


2. By using auto correlation method we have to extract the pitch values from
speech signal.
3. By taking average of pitch values for different samples of male and female is
taken.
4. A threshold level of pitch values is considered for separation of male and
female in gender recognition processes.
5. The principal emotional features for speech signal are formats; MFCC and
centre of gravity are extracted.
6. A SVM classifier is used for the train with different speeches in different
emotions.
7. Finally the SVM classifies the required emotion of the speech signal in the
testing phase.
.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 2

1.4 Block diagram

Figure 1.1: Block diagram

1.5 Procedure:
Initially, the speech signal is taken and is passed through front end block which
converts the continuous time speech signal in to discrete time signal with a rate of 16
kHz is done. After that it is given to feature extraction block in which the features are
extracted in which pitch can be find by using Autocorrelation method is used. After
finding pitch values a threshold level is considered for the pitch values versus frames
in speech sample is considered in order to find Gender recognition. After that from
speech sample the formats can be estimated from LPC coefficients along with MFCC
coefficients and centre of gravity for the speech spectrum is considered. All this
features along with gender recognition output is given to SVM. Thus the SVM act as a
classifier in recognition of emotion of the speech sample. Thus SVM need a database
of Polish emotional database is required in order to training the sentences in different
emotions at the testing phase the SVM classifies the emotion by using optimization
function.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 3

Pitch estimation
2.1 Introduction
Pitch is an important feature of audio signals, especially for quasi-periodic signals such as
voiced sounds from human speech/singing and monophonic music from most music
instruments. Intuitively speaking, pitch represent the vibration frequency of the sound source
of audio signals. In other words, pitch is the fundamental frequency of audio signals, which is
equal to the reciprocal of the fundamental period . Thus the speech signal exhibits relative
periodicity of its fundamental frequency called pitch.
Conceptually, the most obvious sample point within a fundamental period is often referred to
as the pitch mark. Usually pitch marks are selected as the local maxima or minima of the
audio waveform.
Pitch detection algorithms can be divided into methods which operate in the time domain,
frequency domain, or both.
One group of pitch detection methods uses the detection and timing of sometime domain
feature. Other time domain methods use autocorrelation functions or difference norms to
detect similarity between the waveform and a time lagged version of itself.
Another family of methods operates in the frequency domain, locating sinusoidal peaks in the
frequency transform of the input signal. Other methods use combinations of time and
frequency domain techniques to detect pitch.
Frequency domain methods call for the signal to be frequency transformed, then the
frequency domain rep presentation is inspected for the first harmonic, the greatest common
divisor of all harmonics, or other such indications of the period.
Windowing of the signal is recommended to avoid spectral smearing, and depending on the
type of window, a minimum number of periods of the signal must be analyzed to enable
accurate location of harmonic peaks .
Various linear pre-processing steps can be used to make the process of locating frequency
domain features easier, such as performing linear prediction on the signal and using the
residual signal for pitch detection. Performing nonlinear operations such as peak limiting also
simplifies the location of harmonics.
The best method used for pitch estimation used in this project is Autocorrelation method
which is time domain is used .

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 4

2.2 Autocorrelation method of Pitch estimation:


The correlation between two waveforms is a measure of their similarity. The waveforms are
compared at different time intervals, and their sameness is calculated at each interval. The
result of a correlation is a measure of similarity as a function of time lag between the
beginnings of the two waveforms. The autocorrelation function is the correlation of a
waveform with itself. One would expect exact similarity at a time lag of zero, with increasing
dissimilarity as the time lag increases.

The mathematical definition of the autocorrelation function is shown in figure

Figure 2.1: auto correlation of pitch estimation

where is the time lag in terms of sample points. The value of that maximizes acf() over a
specified range is selected as the pitch period in sample points.

Periodic waveforms exhibit an interesting autocorrelation characteristic: the autocorrelation


function itself is periodic. As the time lag increases to half of the period of the waveform, the
correlation decreases to a minimum. This is because the waveform is out of phase with its
time-delayed copy. As the time lag increases again to the length of one period, the
autocorrelation again increases back to a maximum, because the waveform and its time-

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 5

delayed copy are in phase. The first peak in the autocorrelation indicates the period of the
waveform.

Formats estimation
3.1 Introduction
Estimation of formant frequencies is generally more difficult than estimation of
fundamental frequency. The problem is that formant frequencies are properties of the vocal
tract system and need to be inferred from the speech signal rather than just measured. The
spectral shape of the vocal tract excitation strongly influences the observed spectral envelope,
such that we cannot guarantee that all vocal tract resonances will cause peaks in the observed
spectral envelope, nor that all peaks in the spectral envelope are caused by vocal tract
resonances.
The dominant method of formant frequency estimation is based on modelling the speech
signal as if it were generated by a particular kind of source and filter:

This type of analysis is called source-filter separation, and in the case of formant frequency
estimation, we are interested only in the modelled system and the frequencies of its
resonances. To find the best matching system we use a method of analysis called Linear
Prediction. Linear prediction models the signal as if it were generated by a signal of
minimum energy being passed through a purely-recursive IIR filter.
We will demonstrate the idea by using LPC to find the best IIR filter from a section of speech
signal and then plotting the filter's frequency response.

3.2 LPC method for format estimation:


Speech signal is produced by the convolution of excitation source and time varying vocal
tract system components. These excitation and vocal tract components are to be separated
from the available speech signal to study these components independent.

For deconvolving the given speech into excitation and vocal tract system components,
methods based on homomorphic analysis like cepstral analysis are developed. As the cepstral
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 6

analysis does the deconvolution of speech into source and system components by traversing
through frequency domain, the deconvolution task becomes computational intensive process.
To reduce such type of computational complexity and finding the source and system
components from time domain itself, the Linear Prediction analysis is developed.

3.2.1 LPC analysis:

The redundancy in the speech signal is exploited in the LP analysis. The prediction of
current sample as a linear combination of past p samples form the basis of linear
prediction analysis where p is the order of prediction. The predicted sample s ^(n) can
be represented as follows,

where aks are the linear prediction coefficients and s(n) is the windowed speech
sequence obtained by multiplying short time speech frame with a hamming or similar
type of window which is given by,

where (n) is the windowing sequence. The prediction error e(n) can be computed by
the difference between actual sample s(n)and the predicted sample s ^(n) which is
given by,

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 7

The primary objective of LP analysis is to compute the LP coefficients which


minimized the prediction error e(n). The popular method for computing the LP
coefficients by least squares auto correlation method. This achieved by minimizing the
total prediction error. The total prediction error can be represented as follows,

This can be expanded using the equation (5) as follows,

The values of aks which minimize the total prediction error E can be computed by
finding

and equating to zero for k=0,1,2,...p.

for each ak give p linear equations with p unknowns. The solution of which gives the
LP coefficients. This can be represented as follows,

The differentiated expression can be written as,

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 8

where i=1, 2, 3...p. The equation (9) can be written in terms of autocorrelation
sequence R(i) as follows,

for i=1,2,3...p.

Where the autocorrelation sequence used in equation (10) can be written as follows,

for i= 1,2,3...p and N is the length of the sequence.


This can be represented in the matrix form as follows,

where R is the pXp symmetric matrix of elements R(i, k) = R(|i-k|), (1<=i, k<=p), r is
a column vector with elements (R(1),R(2), ...R(P)) and finally A is the column vector
of LPC coefficients (a(1), a(2), ....a(p)). It can be shown that R is toeplitz matrix which
can be represented as,

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 9

The LP coefficients can be computed as shown,

where R-1 is the inverse of the matrix R

3.2.2 computation of LP residual:

LP residual is the prediction error e(n) obtained as the difference between the predicted
speech sample s^(n) and the current sample s(n). This is shown in equation (4).

In the frequency domain, the equation (16) can be represented as,

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 10

i.e.,

So LP residual can be obtained filtering the speech signal with A(z) as indicated in figure 1.
Similarly it can be shown that the LP spectrum H(z) as,

As A(z) is the reciprocal of H(z), LP residual is obtained by the inverse filtering of speech.

Figure 3.2.2: Computing the LP residual by inverse filtering


3.2.3 Determination of formats frequencies:

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 11

LP analysis separates the given short term sequence of speech into its slowly varying vocal
tract component represented by LP filter (H(z)) and fast varying excitation component given
by the LP residual (e(n)). The LP filter (H(z)) induces the desired spectral shape for the shape
on the flat spectrum (E(z)) of the noise like excitation sequence as given in equation (20). As
the LP spectrum provides the vocal tract characteristics, the vocal tract resonances (formants)
can be estimated from the LP spectrum. Various formant locations can be obtained by picking
the peaks from the magnitude LP spectrum (|H(z)|). The figure 3.2.3 shows the first (F1),
second (F2) and third formant (F3) frequencies estimated from the peaks in the LP magnitude
spectrum.

where S(z) is the spectrum of the given short time speech signal.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 12

Figure 3.2.3: Formant locations corresponding to peaks in LP magnitude


spectrum.

MFCC (mel frequency cepstral coefficients) estimation


4.1 Introduction:
MFCCs are based on the known variation of the human ears critical bandwidths with
frequency, filters spaced linearly at low frequencies and logarithmically at high frequencies
have been used to capture the phonetically important characteristics of speech. This is
expressed in the mel-frequency scale, which is a linear frequency spacing below 1000 Hz and
a logarithmic spacing above 1000 Hz. Normally the MFCC represents the short term power
spectrum of the speech signal .

4.2 MFCC implementation:


A block diagram of the structure of an MFCC processor is given in Figure 4.2. The
speech input is typically recorded at a sampling rate above 10000 Hz. This sampling
frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion.
These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of
sounds that are generated by humans. As been discussed previously, the main purpose of the
MFCC processor is to mimic the behavior of the human ears. In addition, rather than the
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 13

speech waveforms themselves, MFFCs are shown to be less susceptible to mentioned


variations.

continuous
speech

Frame
Blocking

mel
cepstrum

frame

Cepstrum

Windowing

mel
spectrum

FFT

spectrum

Mel-frequency
Wrapping

Figure 4.2: Block diagram of the MFCC processor

4.2.1Frame Blocking :
In this step the continuous speech signal is blocked into frames of N samples, with
adjacent frames being separated by M (M < N). The first frame consists of the first N
samples. The second frame begins M samples after the first frame, and overlaps it by N - M
samples and so on. This process continues until all the speech is accounted for within one or
more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec
windowing and facilitate the fast radix-2 FFT) and M = 100.

4.2.2 Windowing :
The next step in the processing is to window each individual frame so as to minimize the
signal discontinuities at the beginning and end of each frame. The concept here is to
minimize the spectral distortion by using the window to taper the signal to zero at the
w(n), 0 n N 1
beginning and end of each frame. If we define the window as
, where N is
the number of samples in each frame, then the result of windowing is the signal

y l (n) xl (n) w(n), 0 n N 1


Typically the Hamming window is used, which has the form:

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 14

2n
, 0 n N 1
N 1

w(n) 0.54 0.46 cos

4.2.3 Fast fourier transform(FFT) :


The next processing step is the Fast Fourier Transform, which converts each frame of N
samples from the time domain into the frequency domain. The FFT is a fast algorithm to
implement the Discrete Fourier Transform (DFT), which is defined on the set of N samples
{xn}, as follow:
N 1

X k x n e j 2kn / N ,

k 0,1,2,..., N 1

n 0

In general Xks are complex numbers and we only consider their absolute values
(frequency magnitudes). The resulting sequence {Xk} is interpreted as follow: positive
0 f Fs / 2
0 n N / 2 1
frequencies
correspond to values
, while negative frequencies
Fs / 2 f 0
N / 2 1 n N 1
correspond to
. Here, Fs denotes the sampling
frequency.
The result after this step is often referred to as spectrum or periodogram.

4.2.4Mel-frequency wrapping :
As mentioned above, psychophysical studies have shown that human perception of the
frequency contents of sounds for speech signals does not follow a linear scale. Thus for each
tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale
called the mel scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz
and a logarithmic spacing above 1000 Hz.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 15

Mel-spaced filterbank

2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

1000

2000

3000
4000
Frequency (Hz)

5000

6000

7000

Figure 4.2.4: An example of mel-spaced filterbank


One approach to simulating the subjective spectrum is to use a filter bank, spaced
uniformly on the mel-scale (see Figure 4.2.4). That filter bank has a triangular bandpass
frequency response, and the spacing as well as the bandwidth is determined by a constant mel
frequency interval. The number of mel spectrum coefficients, K, is typically chosen as 20.
Note that this filter bank is applied in the frequency domain, thus it simply amounts to
applying the triangle-shape windows as in the Figure 4 to the spectrum. A useful way of
thinking about this mel-wrapping filter bank is to view each filter as a histogram bin (where
bins have overlap) in the frequency domain.

4.2.5 Cepstrum :
In this final step, we convert the log mel spectrum back to time. The result is called the
mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech
spectrum provides a good representation of the local spectral properties of the signal for the
given frame analysis.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 16

Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can
convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if
we denote those mel power spectrum coefficients that are the result of the last step are
~
c~n ,
S 0 , k 0,2,..., K 1
, we can calculate the MFCC's,
as
K
1
~

c~
(log
S
)
cos
nk ,
n
k

k1
2K

n0,1,...,K-1

c~0 ,
Note that we exclude the first component,

from the DCT since it represents the mean

value of the input signal, which carried little speaker specific information.

By applying the procedure described above, for each speech frame of around 30msec with
overlap, a set of mel-frequency cepstrum coefficients is computed. These are result of a
cosine transform of the logarithm of the short-term power spectrum expressed on a melfrequency scale. This set of coefficients is called an acoustic vector. Therefore each input
utterance is transformed into a sequence of acoustic vectors.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 17

Centre of gravity(COG)
5.1Introduction:
The spectral Centre Of Gravity (COG) is a measure of how high the frequencies in a
spectrum are. For this reason the COG gives an average indication of the spectral distribution
of the speech signal under observation. Given the considered discrete signal s(n) and its DFT
S(k), the COG has been computed by:

, fk = k /N , k = 0, ...,N 1 represents the k-th frequency composing the DFT.


In order to determine exact power levels in the speech signal these spectral shaping features
are required .

5.2 spectral central moments:


The m-th central spectral moment of the considered sequence s(n) has been computed by:

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 18

5.2.1 Standard Deviation(SD):


The standard deviation of a spectrum is defined as the measure of how much the frequencies
in a spectrum can deviate from the centre of gravity. SD corresponds to the square root of the
second central moment 2:

5.2.2 Skewness:
The skewness of a spectrum is a measure of symmetry and it is defined as the third central
moment of the considered sequence s(n), divided by the 1.5 power of the second central
moment:

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 19

\
These are the spectral moments in order to determine how are the frequencies distributed in
the spectrum for the speech signal obtained by doing this we can determine exact power
values in the spectrum of required speech signal. Those these are best central moments can be
find in this project and we go for higher moments in order to determine more accuracy
normally up to 3rd central moments are sufficient to describe the power levels in the spectrum
we have done up to skewness for the spectrum which is a 3rd central moment .

Emotions Classifier
6.1 Introduction:
Usually, in the literature of the field, a Support Vector Machine (SVM) is used to classify
sentences. SVM is a relatively new machine learning algorithm introduced by Vapnik and
derived from statistical learning theory in the 90s. The main idea is to transform the original
input set into a high dimensional feature space by using a kernel function and, then, to
achieve optimum classification in this new feature space, where a clear separation among
features obtained by the optimal placement of a separation hyperplane under the precondition
of linear separability.

6.2 SVM classifier:


Differently from the previously proposed approaches, two different classifiers, both kernelbased Support Vector Machines (SVMs), have been employed in this project. The first one
(called Male-SVM) is used if a male speaker is recognized by the Gender Recognition block.
The other SVM (Female-SVM) is employed in case of female speaker. Male-SVM and
Female-SVM classifiers have been trained by using speech signals of the employed reference
Database (DB) generated, respectively, by male and female speakers. Being g = {1, 1} the
label of the gender as defined in above the two SVMs have been trained by the traditional
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 20

Quadratic Programming (QP) as done . In more detail, the following problem has been solved
for each gender g:

CLASS A
Support
vector
Hyperplane
SUPPORT VECTOR
CLASSB

Figure 6.2: A linear Support Vector Machine

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 21

where
represents
Multipliers vector of the QP problem written in dual form.

Vectors

are

the

features

well-known

vectors

Lagrangian

while

scalars

are related labels (i.e., the emotions in this paper).


They

represent

the

vectors

of

the

training

set

for

the

g-th

gender.

is the related association, also called observation, between the


u-th input features vector
and its label
observations composing the training set.

. The quantity `gis the total amount of

The quantity C (C > 0) is the Complexity constant which determines the trade-off between
the flatness (i.e., the sensitivity of the prediction to perturbations in the features) and the
tolerant level for misclassified samples.

Higher value of C means that is more important minimising the degree of misclassification. C
= 1 is used in the project represents a non-linear SVM and the function

is the

Kernel function that, in this paper, is


. Coherently with ,
the QP problems (one for each gender) in equation are solved by the Sequential Minimal

Optimization (SMO) approach that provides an optimal point, not necessarily unique and
isolated, of if and only if Karush-Kuhn-Tucker (KKT) conditions are verified and matrices
are positive semi-definite. Details about the KKT conditions and the
SMO approach employed

6.2.1 Polish Emotional database:


M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 22

In this database consists of 4 actors of 2 male and 2 female with 4different emotions was
taken. Recordings for every speaker were made during a single session. Each speaker utters
four different sentences.
The uttrence code are

1 - Oni kupili dzisiaj nowy samochd.


2 - Jego dziewczyna przylatuje dzisiaj samolotem.
3 - Janek by dzisiaj u fryzjera.
4 - Ta lampa dzisiaj jest na biurku.
In this first two utterances are given to training phase for the SVM and the other two are
given for the testing phase.
Finally with all the features extracted are given to feature vector was given input to SVM in
order to perform the emotion recognition. Thus SVM performs better classification in the
process of gender recognition

RESULTS
7.1Introduction:
In this project we have used MATLAB coding. MATLAB (Matrix Laboratory) is a tool for
numerical computation and visualization. The basic data element is a matrix, so if you need a
program that manipulates array-based data it is generally fast to write and run in MATLAB

Matlab is widely used in all areas of applied mathematics, in education and research at
universities, and in the industry. MATLAB has powerful graphic tools and can produce nice
pictures in both 2D and 3D. it is also a programming language, and is one of the easiest
programing languages processing, image processing, optimization. Etc.
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 23

Typical uses include:


i.
ii.
iii.
iv.
v.
vi.

Math and computation.


Algorithm development.
Modeling, simulation, and prototyping.
Data analysis, exploration, and visulization.
Scientific and engineering graphics.
Application development, including Graphical User Interface building

This is high level matrix/array language with control flow statements. Function, data
structures,input/output, and object-oriented programming features. This is a vast collection of
computational algorithms ranging from elementry functions like sum, sine, cosine, and
complex arithametic. To more sophisticate function like matrix inverse, matrix eignevalues,
Bessel functions, and fast Fourier transforms.In this project we have used MATLAB 8.3
version for programming

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 24

Figure 7.1: Gender recognition as female with mean pitch values higher than
threshold.
Analasis: Here we have taken speech sample and by using auto correlation function we have
find pitch vaules for every frame finaly a threshold level of 250 is taken in order to separete
male and female samples.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 25

Figure 7.2: Gender recognition as male with mean pitch values less than threshold.
Analasis: Here we have taken speech sample and by using auto correlation function we have
find pitch vaules for every frame finaly a threshold level of 250 is taken in order to separete
male and female samples.

Figure 7.3: Formats for speech signal in neutral sate.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 26

Figure 7.4: Formats for speech signal in anger sate.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 27

Figure 7.5: Formats for speech signal in joy.

Figure 7.6: Formats for speech signal in sad state.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 28

Analysis: The formats frequencies in the above figures gives resonant frequencies of
vocal tract at different emotions are used to construct the vocal tract system for
particular speaker.
Figure 7.7: MFCC for speech signal in anger state male and female.

Figure 7.8: Formats for speech signal in sad state.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 29

Figure 7.9: MFCC for speech signal in joy state male and female.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 30

Analysis: MFCC at different emotions gives the short term power levels of the
speech signal that are useful in recognition of speech.

Figure 7.10: Power spectrum for the for speech signal in anger state male

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 31

Analysis: The power spectrum represents the power of the particular male speaker in
particular emotion such that it is used recognition of particular word.

Figure 7.11: Power spectrum for the for speech signal in anger state female

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 32

Analysis: The power spectrum represents the power of the particular female speaker in
particular emotion such that it is used recognition of particular word.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 33

Figure 7.12: confusion matrix for four speakers on gender recognition.

Analysis: By performing the sytem with 4 speakers in four different emotions this
matrix represents recognition of one particular emotion versus different emotions

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 34

conclusion:

Thus the system can be able to dectect the 4emotions such that by using the particular
data base of polish emotinal database in which actors speak 4 utterences in different
emotions improve the accuracy in desinging the emotion recognition system.

The gender reconition used in this system are useful in reduceing the time delay in
form of classfying stage at classifier.By using this also the accuracy can be improved
in the system of recognition of emotion.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 35

References

Reference A.
[1]. Igor Bisio, Alessandro Delfino, Fabio Lavagetto, Mario Marchese, and
Andrea Sciarrone, Gender-driven Emotion Recognition Through Speech
Signals For Ambient Intelligence Applications, IEEE 2013.
Reference B :

[1] F. Burkhardt, M. van Ballegooy, R. Englert, and R. Huber, An


emotion-aware voice portal, Proc. Electronic Speech Signal Processing
ESSP, pp. 123131, 2005.

[2] J. Luo, Affective computing and intelligent interaction. Springer, 2012,


vol. 137

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 36

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 37

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 38

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 39

Figure 6.1: input and synthesized out from LPC and LSP for JUSTICE (LPC=8)

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 40

Figure 6.2: power spectral density for input

Figure 6.3: power spectral density for output


****** input ******

Sampling rate (Hz): 48000

Input length (samples): 42788

Input length (seconds): 0.891417

****** Compression using Only LPC ******

Compression ratio: 26.64

psnr with LPC: 28.721563

Mahalanobis Distance with LPC: 0.179952

****** compression using LSP as well ******

Compression ratio: 35.40

psnr with LSP: 28.653103

Mahalanobis Distance with LSP: 0.227074

2. Speech spelt=JUSTICE
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 41

Frame size=20msec
Order of LPC=12

Figure 6.4: input and synthesized out from LPC and LSP for input JUSTICE (LPC=12)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 42788

Input length (seconds): 0.891417

****** Compression using Only LPC ******


M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 42

Compression ratio: 19.54

psnr with LPC: 28.595457

Mahalanobis Distance with LPC: 0.185220


****** compression using LSP as well ******

Compression ratio: 23.17

psnr with LSP :28.616134

Mahalanobis Distance with LSP :0.185969

3. Speech spelt=JUSTICE
Frame size=20msec
Order of LPC=45

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 43

Figure 6.5: input and synthesized out from LPC and LSP for input JUSTICE (LPC=45)

****** input ******

Sampling rate (Hz): 48000

Input length (samples): 42788

Input length (seconds): 0.891417

****** Compression using Only LPC ******

Compression ratio: 6.11

psnr without LSP 28.494923

Mahalanobis Distance without LSP 0.173408

****** compression using LSP as well ******

Compression ratio: 6.29

psnr with LSP 28.647524

Mahalanobis Distance with LSP 0.203702

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 44

30
29.9
29.8
29.7
PSNR

LPC
LSP

29.6
29.5
29.4

Figure 6.6: Calculation of PSNR for input justice of various order of LPC

0.4
0.35
0.3
0.25
0.2
distance 0.15
0.1

INPUT&LPC
INPUT&LSP

0.05
0

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 45

Figure 6.7: Calculation of mehalanobis distance for input justice of various order of
LPC
4. Speech spelt=JUSTICE
Frame size=30msec
Order of LPC=12

Figure 6.8: input and synthesized out from LPC and LSP for input JUSTICE (30msec)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 42788

Input length (seconds): 0.891417

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 46

****** Compression using Only LPC ******

Compression ratio: 19.81

psnr with LPC: 29.779574

Mahalanobis Distance with LPC: 0.362347

****** compression using LSP as well ******

Compression ratio: 23.56

psnr with LSP :29.769989

Mahalanobis Distance with LSP: 0.285492

5. Speech spelt=JUSTICE
Frame size=40msec
Order of LPC=12

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 47

Figure 6.9: input and synthesized out from LPC and LSP for input JUSTICE (40msec)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 42788

Input length (seconds): 0.891417

****** Compression using Only LPC ******

Compression ratio: 20.09


psnr without LSP :31.15005
Mahalanobis Distance without LSP 0.472319

****** compression using LSP as well ******

Compression ratio: 23.91

psnr with LSP: 31.144920

Mahalanobis Distance with LSP: 0.473978

6. Speech spelt=JUSTICE
Frame size=50msec
Order of LPC=12

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 48

Figure 6.10: input and synthesized out from LPC and LSP for input JUSTICE (50msec)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 42788

Input length (seconds): 0.891417


****** Compression using Only LPC ******

Compression ratio: 20.23

psnr without LSP 32.257851

Mahalanobis Distance without LSP 0.560999

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 49

****** compression using LSP as well ******

Compression ratio: 24.13

psnr with LSP 32.156920

Mahalanobis Distance with LSP 0.504794


34
33
32
31
30
29
28
27
26

PSNR

lpc
lsp

Figure 6.11: Calculation of PSNR for input justice of various frame size
0.6
0.5
0.4
0.3
0.2
0.1
0

Input&LPC
Input&LSP

Figure 6.12: Calculation of mehalanobis distance for input justice of various frame size

7. Speech spelt=It is simple to be happy (male)


Frame size=30msec
Order of LPC=8

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 50

Figure 6.13: input and synthesized out from LPC and LSP (LPC=8)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 116550

Input length (seconds): 2.428125

****** Compression using Only LPC ******

Compression ratio: 26.49

psnr without LSP 30.744181

Mahalanobis Distance without LSP 0.353594

****** compression using LSP as well ******

Compression ratio: 36.90

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 51

psnr with LSP 31.065172

Mahalanobis Distance with LSP 0.399777

8. Speech spelt=It is simple to be happy (male)


Frame size=20msec
Order of LPC=12

Figure 6.14: input and synthesized out from LPC and LSP (LPC=12)
Power spectral density:

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 52

Figure 6.15: Power spectral density of input and output signals

****** input ******

Sampling rate (Hz): 48000

Input length (samples): 116550

Input length (seconds): 2.428125

****** Compression using Only LPC ******

Compression ratio: 19.33

psnr with LPC: 29.559239

Mahalanobis Distance with LSP:0.166563

****** compression using LSP as well ******

Compression ratio: 23.17

psnr with LSP :29.763026

Mahalanobis Distance with LSP :0.216062

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 53

31.2
31.1
31
30.9

Input&LPC
Input&LSP

30.8
30.7
30.6
30.5
order=8

order=10

order=12

order=14

order=16

Figure 6.16: Calculation of PSNR for various order of LPC


0.4
0.35
0.3
0.25
Input&LPC
Input&LSP

0.2
0.15
0.1
0.05
0
order =8

order =10 order =12 order =14 order=16

Figure 6.17: Calculation of mehalanobis distance for various order of LPC

9. Speech spelt=It is simple to be happy (male)


Frame size=30msec
Order of LPC=12

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 54

Figure 6.18: input and synthesized out from LPC and LSP (30msec)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 116550

Input length (seconds): 2.428125

****** Compression using Only LPC ******

Compression ratio: 19.43

psnr without LSP: 30.838957

Mahalanobis Distance without LSP 0.296230

****** compression using LSP as well ******

Compression ratio: 23.65

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 55

psnr with LSP: 31.006035

Mahalanobis Distance with LSP 0.336836

10. Speech spelt=It is simple to be happy (male)


Frame size=40msec
Order of LPC=12

Figure 6.19: input and synthesized out from LPC and LSP (40msec)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 116550

Input length (seconds): 2.428125

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 56

****** Compression using Only LPC ******

Compression ratio: 19.47

psnr with LPC: 32.050553

Mahalanobis Distance with LPC: 0.415163

****** compression using LSP as well ******

Compression ratio: 23.83

psnr with LSP: 32.442751

Mahalanobis Distance with LSP :0.459108


35
34
33
32
31
30
29

LPC
LSP

Figure 6.20: Calculation of PSNR for various frame size


0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

Figure 6.21: Calculation of mehalanobis distance for various frame size


M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 57

LPC
LSP

11. Speech spelt=Time is precious dont waste it


Frame size=20msec
Order of LPC=12

Figure 6.22: input and synthesized out from LPC and LSP (20msec)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 105952

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 58

Input length (seconds): 2.207333

****** Compression using Only LPC ******

Compression ratio: 19.35

psnr with LPC: 31.887008

Mahalanobis Distance with LSP : 0.280322

****** compression using LSP as well ******

Compression ratio: 23.13

psnr with LSP :31.823073

Mahalanobis Distance with LSP: 0.295197

12. Speech spelt=Time is precious dont waste it


Frame size=30msec
Order of LPC=12

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 59

Figure 6.23: input and synthesized out from LPC and LSP (30msec)

****** input ******

Sampling rate (Hz): 48000

Input length (samples): 105952

Input length (seconds): 2.207333

****** Compression using Only LPC ******

Compression ratio: 19.46

psnr with LPC: 33.872126

Mahalanobis Distance with LPC: 0.400250

****** compression using LSP as well ******

Compression ratio: 23.38

psnr with LSP: 33.939187

Mahalanobis Distance with LSP :0.426949

13. Speech spelt=Time is precious dont waste it


Frame size=40msec
Order of LPC=12
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 60

Figure 6.24: input and synthesized out from LPC and LSP (40msec)
****** input ******

Sampling rate (Hz): 48000

Input length (samples): 105952

Input length (seconds): 2.207333

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 61

****** Compression using Only LPC ******

Compression ratio: 19.51

psnr with LPC: 35.538876

Mahalanobis Distance with LPC: 0.499091

****** compression using LSP as well ******

Compression ratio: 23.46

psnr with LSP :35.582306

Mahalanobis Distance with LSP: 0.514292


34

33.95
33.9
33.85
33.8

LPC
LSP

33.75
33.7
33.65
33.6
33.55
order=8

order=10

order=12

order=14

order=16

Figure 6.25: Calculation of PSNR for various order of LPC

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 62

0.44
0.43
0.42
0.41
Distance

LPC
LSP

0.4
0.39
0.38
0.37
0.36

order=8 order=10 order=12 order=14 order=16

Figure 6.26: Calculation of mehalanobis distance for various order of LPC


37
36.5
36
35.5
35
34.5
34
33.5
33
32.5
32

Figure 6.27: Calculation of PSNR for various frame size

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 63

LPC
LSP

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Figure 6.28: Calculation of mehalanobis distance for various frame size


13. Speech spelt=make in India
Frame size=20msec
Order of LPC=8

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 64

LPC
LSP

Figure 6.29: input and synthesized out from LPC and LSP (20msec)
****** input ******

Sampling rate (Hz): 8000

Input length (samples): 15557

Input length (seconds): 1.944625

****** Compression using Only LPC ******

Compression ratio: 3.23

psnr with LPC: 25.913105

Mahalanobis Distance with LPC: 0.183672

****** compression using LSP as well ******

Compression ratio: 3.48

psnr with LSP: 25.861716

Mahalanobis Distance with LSP: 0.178418

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 65

28.5
28
27.5
LPC
LSP

27

PSNR

26.5
26
25.5

order=8

order=10 order=12 order=14 order=16

Figure 6.30: Calculation of PSNR for various order of LPC


0.45
0.4
0.35
0.3
0.25
Distance

LPC
LSP

0.2
0.15
0.1
0.05
0

order=8 order=10 order=12 order=14 order=16

Figure 6.31: Calculation of mehalanobis distance for various order of LPC

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 66

28.5
28
27.5
27
26.5
26
25.5

LPC
LSP

25
24.5

Figure 6.32: Calculation of PSNR for various frame size


0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

Figure 6.33: Calculation of mehalanobis distance for various frame size


M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 67

LPC
LSP

MBSD (modified bark spectral distortion)


The Average_MBSD=0.0151 for input&LPC
Signal= JUSTICE

Avg_MBSD

0.02
0.01
0.01
0.01
0.01
0.01
0
0
0

Figure 6.34: Calculation of MBSD for various input signals

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 68

Input&LPC
Input&LSP

Compression Ratio
Speech signal

Order of LPC

LPC

LSP

Justice

26.64

35.40

10

22.54

27.95

12

19.54

23.17

14

17.24

19.83

26.36

35.69

10

22.30

28.03

12

19.33

23.17

14

17.05

19.78

26.39

35.57

10

22.33

27.97

12

19.35

23.13

14

17.08

19.77

It is simple to be happy

Time is precious dont waste


it

Table 6.1: compression ratio for various inputs of different LPC orders

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 69

Compression ratio
Speech signal

Frame size

LPC

LSP

Justice

20msec

19.54

23.17

30msec

19.81

23.56

40msec

20.09

23.91

50msec

20.23

24.13

20msec

19.33

23.17

30msec

19.43

23.65

40msec

19.47

23.83

50msec

19.57

24.02

20msec

19.35

23.13

30msec

19.46

23.38

40msec

19.51

23.46

50msec

19.62

23.64

It is simple to be happy

Time is precious dont waste


it

Table 6.2: compression ratio for various inputs of different frame size

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 70

Conclusion
The synthesis of LPC and LSP depends on two parameters
1. Order of LPC
2. Frame size

For the given order of LPC and Frame size, the compression ratio is good for LSP
compared for LPC.

After calculating PSNR and mehalanobis distance , mostly the LPC is dominating the
LSP but the difference between them is very small.

So, for better compression it is better to opt LSP rather than LPC, but implementing
LSP is expensive .

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 71

References

Reference A:
[1] Sara Grassi,Optimized Implementation of Speech Processing Algorithms,
Imprimatur Pour La These

Reference B:
[1] F.Itakura, Line Spectrum Representation of Linear Predictive Coefficients of Speech
Signals,J.of the Acoustical Society of America,Vol.57,pp.S35,1975
[2] K.Paliwal and B.Atal, Efficient Vector Quantization of LPC parameters at
24bits/frame, IEEE Trans on Speech and Audio Processing, Vol.1, No.1, pp.3-14,
1993
[3] P.Kabal and P.Ramachandran,The Computation of Line Spectral Frequencies Using
Chebyshev Polynomials,IEEE Trans.on acoustics,Speech and Signal
Processing,Vol.34,No.6,pp.1419-1426,1986

M.Tech thesis submitted to Jawaharlal Nehru Technological University,


Hyderabad-2015
Page 72

You might also like