Thesis 1

INTRODUCTION
1.1 Overview
Recently there has been growing interest to improve Human-computer interaction (HCI)
means computers should interact to the humans in day to day life .In this context recognizing
people emotional state and giving suitable feedback may play a crucial role. As a
consequence, emotion recognition represents a hot research area in both industry and
academic field. Usually emotion recognition based on facial or voice features. This proposes
a solution, designed to be employed in a smart phone Environment able to capture emotional
state of a person starting from registration of speech signals in the surrounding obtained by
mobile devices such as smartphones.
This system presents the implementation of a voice-based emotion detection system which is
suitable to recognize four emotions (anger, sadness, joy and neutral) as widely used for
emotion recognition .The classification task for speech signals is done by using Support
Vector Machine (SVM) approach. The main contributions of this is concern: i) a system able
to recognize people emotions composed of two sub-systems, Gender Recognition and
Emotion Recognition . Gender recognition algorithm, based on pitch extraction, and aimed at
providing a priori information about the gender of the speaker; SVM-based emotion
classifier, which employs the gender information as input.
In order to train and test the mentioned SVM-based emotion classifier, a widely used
emotional database called (polish emotional database ED) has been employed. The overall
system reliability on the database adopted for training and testing phases: the use of a
simulated database (i.e., a collection of emotion vocal expressions played by actors) allows
obtaining a higher level of correctly identified emotions.
.
M.Tech thesis submitted to Jawaharlal Nehru Technological University,

Hyderabad-2015
Page 1
1.2 Proposed method

The proposed method based on the employment of audio signals consists of four principal
parts which are elaborated bellow:
Feature Extraction: it involves the elaboration of the speech signal in order to obtain a
certain number of variables, called features, useful for speech emotion recognition.
Feature Selection: it selects the more appropriate features in order to reduce the
computational load and the time required to recognize an emotion.
Database: it is the memory of the classifier; it contains sentences divided according to

the emotions to be recognized.
Classification: it assigns a label representing the recognized emotion by using the

features selected by the Feature Selection block and the sentences in the Database.
1.3 Objectives
The objectives of the project are illustrated below.
1. For the speech signal we have to do framing and windowing first.

2. By using auto correlation method we have to extract the pitch values from
speech signal.
3. By taking average of pitch values for different samples of male and female is
taken.
4. A threshold level of pitch values is considered for separation of male and
female in gender recognition processes.
5. The principal emotional features for speech signal are formats; MFCC and
centre of gravity are extracted.
6. A SVM classifier is used for the train with different speeches in different
emotions.
7. Finally the SVM classifies the required emotion of the speech signal in the
testing phase.
.

Hyderabad-2015
Page 2
1.4 Block diagram
Figure 1.1: Block diagram
1.5 Procedure:
Initially, the speech signal is taken and is passed through front end block which
converts the continuous time speech signal in to discrete time signal with a rate of 16
kHz is done. After that it is given to feature extraction block in which the features are
extracted in which pitch can be find by using Autocorrelation method is used. After
finding pitch values a threshold level is considered for the pitch values versus frames
in speech sample is considered in order to find Gender recognition. After that from
speech sample the formats can be estimated from LPC coefficients along with MFCC
coefficients and centre of gravity for the speech spectrum is considered. All this
features along with gender recognition output is given to SVM. Thus the SVM act as a
classifier in recognition of emotion of the speech sample. Thus SVM need a database
of Polish emotional database is required in order to training the sentences in different
emotions at the testing phase the SVM classifies the emotion by using optimization
function.

Hyderabad-2015
Page 3
Pitch estimation
2.1 Introduction
Pitch is an important feature of audio signals, especially for quasi-periodic signals such as
voiced sounds from human speech/singing and monophonic music from most music
instruments. Intuitively speaking, pitch represent the vibration frequency of the sound source
of audio signals. In other words, pitch is the fundamental frequency of audio signals, which is
equal to the reciprocal of the fundamental period . Thus the speech signal exhibits relative
periodicity of its fundamental frequency called pitch.
Conceptually, the most obvious sample point within a fundamental period is often referred to
as the pitch mark. Usually pitch marks are selected as the local maxima or minima of the
audio waveform.
Pitch detection algorithms can be divided into methods which operate in the time domain,
frequency domain, or both.
One group of pitch detection methods uses the detection and timing of sometime domain
feature. Other time domain methods use autocorrelation functions or difference norms to
detect similarity between the waveform and a time lagged version of itself.
Another family of methods operates in the frequency domain, locating sinusoidal peaks in the
frequency transform of the input signal. Other methods use combinations of time and
frequency domain techniques to detect pitch.
Frequency domain methods call for the signal to be frequency transformed, then the
frequency domain rep presentation is inspected for the first harmonic, the greatest common
divisor of all harmonics, or other such indications of the period.
Windowing of the signal is recommended to avoid spectral smearing, and depending on the
type of window, a minimum number of periods of the signal must be analyzed to enable
accurate location of harmonic peaks .
Various linear pre-processing steps can be used to make the process of locating frequency
domain features easier, such as performing linear prediction on the signal and using the
residual signal for pitch detection. Performing nonlinear operations such as peak limiting also
simplifies the location of harmonics.
The best method used for pitch estimation used in this project is Autocorrelation method
which is time domain is used .

Hyderabad-2015
Page 4
2.2 Autocorrelation method of Pitch estimation:

The correlation between two waveforms is a measure of their similarity. The waveforms are
compared at different time intervals, and their sameness is calculated at each interval. The
result of a correlation is a measure of similarity as a function of time lag between the
beginnings of the two waveforms. The autocorrelation function is the correlation of a
waveform with itself. One would expect exact similarity at a time lag of zero, with increasing
dissimilarity as the time lag increases.
The mathematical definition of the autocorrelation function is shown in figure
Figure 2.1: auto correlation of pitch estimation
where is the time lag in terms of sample points. The value of that maximizes acf() over a
specified range is selected as the pitch period in sample points.
Periodic waveforms exhibit an interesting autocorrelation characteristic: the autocorrelation

function itself is periodic. As the time lag increases to half of the period of the waveform, the
correlation decreases to a minimum. This is because the waveform is out of phase with its
time-delayed copy. As the time lag increases again to the length of one period, the
autocorrelation again increases back to a maximum, because the waveform and its time-

Hyderabad-2015
Page 5
delayed copy are in phase. The first peak in the autocorrelation indicates the period of the
waveform.
Formats estimation
3.1 Introduction
Estimation of formant frequencies is generally more difficult than estimation of
fundamental frequency. The problem is that formant frequencies are properties of the vocal
tract system and need to be inferred from the speech signal rather than just measured. The
spectral shape of the vocal tract excitation strongly influences the observed spectral envelope,
such that we cannot guarantee that all vocal tract resonances will cause peaks in the observed
spectral envelope, nor that all peaks in the spectral envelope are caused by vocal tract
resonances.
The dominant method of formant frequency estimation is based on modelling the speech
signal as if it were generated by a particular kind of source and filter:
This type of analysis is called source-filter separation, and in the case of formant frequency
estimation, we are interested only in the modelled system and the frequencies of its
resonances. To find the best matching system we use a method of analysis called Linear
Prediction. Linear prediction models the signal as if it were generated by a signal of
minimum energy being passed through a purely-recursive IIR filter.
We will demonstrate the idea by using LPC to find the best IIR filter from a section of speech
signal and then plotting the filter's frequency response.
3.2 LPC method for format estimation:

Speech signal is produced by the convolution of excitation source and time varying vocal
tract system components. These excitation and vocal tract components are to be separated
from the available speech signal to study these components independent.
For deconvolving the given speech into excitation and vocal tract system components,
methods based on homomorphic analysis like cepstral analysis are developed. As the cepstral
Hyderabad-2015
Page 6
analysis does the deconvolution of speech into source and system components by traversing
through frequency domain, the deconvolution task becomes computational intensive process.
To reduce such type of computational complexity and finding the source and system
components from time domain itself, the Linear Prediction analysis is developed.
3.2.1 LPC analysis:
The redundancy in the speech signal is exploited in the LP analysis. The prediction of
current sample as a linear combination of past p samples form the basis of linear
prediction analysis where p is the order of prediction. The predicted sample s ^(n) can
be represented as follows,
where aks are the linear prediction coefficients and s(n) is the windowed speech
sequence obtained by multiplying short time speech frame with a hamming or similar
type of window which is given by,
where (n) is the windowing sequence. The prediction error e(n) can be computed by
the difference between actual sample s(n)and the predicted sample s ^(n) which is
given by,

Hyderabad-2015
Page 7
The primary objective of LP analysis is to compute the LP coefficients which

minimized the prediction error e(n). The popular method for computing the LP
coefficients by least squares auto correlation method. This achieved by minimizing the
total prediction error. The total prediction error can be represented as follows,
This can be expanded using the equation (5) as follows,
The values of aks which minimize the total prediction error E can be computed by
finding
and equating to zero for k=0,1,2,...p.
for each ak give p linear equations with p unknowns. The solution of which gives the
LP coefficients. This can be represented as follows,
The differentiated expression can be written as,

Hyderabad-2015
Page 8
where i=1, 2, 3...p. The equation (9) can be written in terms of autocorrelation
sequence R(i) as follows,
for i=1,2,3...p.
Where the autocorrelation sequence used in equation (10) can be written as follows,
for i= 1,2,3...p and N is the length of the sequence.

This can be represented in the matrix form as follows,
where R is the pXp symmetric matrix of elements R(i, k) = R(|i-k|), (1<=i, k<=p), r is
a column vector with elements (R(1),R(2), ...R(P)) and finally A is the column vector
of LPC coefficients (a(1), a(2), ....a(p)). It can be shown that R is toeplitz matrix which
can be represented as,

Hyderabad-2015
Page 9
The LP coefficients can be computed as shown,
where R-1 is the inverse of the matrix R
3.2.2 computation of LP residual:
LP residual is the prediction error e(n) obtained as the difference between the predicted
speech sample s^(n) and the current sample s(n). This is shown in equation (4).
In the frequency domain, the equation (16) can be represented as,

Hyderabad-2015
Page 10
i.e.,
So LP residual can be obtained filtering the speech signal with A(z) as indicated in figure 1.
Similarly it can be shown that the LP spectrum H(z) as,
As A(z) is the reciprocal of H(z), LP residual is obtained by the inverse filtering of speech.
Figure 3.2.2: Computing the LP residual by inverse filtering

3.2.3 Determination of formats frequencies:

Hyderabad-2015
Page 11
LP analysis separates the given short term sequence of speech into its slowly varying vocal
tract component represented by LP filter (H(z)) and fast varying excitation component given
by the LP residual (e(n)). The LP filter (H(z)) induces the desired spectral shape for the shape
on the flat spectrum (E(z)) of the noise like excitation sequence as given in equation (20). As
the LP spectrum provides the vocal tract characteristics, the vocal tract resonances (formants)
can be estimated from the LP spectrum. Various formant locations can be obtained by picking
the peaks from the magnitude LP spectrum (|H(z)|). The figure 3.2.3 shows the first (F1),
second (F2) and third formant (F3) frequencies estimated from the peaks in the LP magnitude
spectrum.
where S(z) is the spectrum of the given short time speech signal.

Hyderabad-2015
Page 12
Figure 3.2.3: Formant locations corresponding to peaks in LP magnitude

spectrum.
MFCC (mel frequency cepstral coefficients) estimation

4.1 Introduction:
MFCCs are based on the known variation of the human ears critical bandwidths with
frequency, filters spaced linearly at low frequencies and logarithmically at high frequencies
have been used to capture the phonetically important characteristics of speech. This is
expressed in the mel-frequency scale, which is a linear frequency spacing below 1000 Hz and
a logarithmic spacing above 1000 Hz. Normally the MFCC represents the short term power
spectrum of the speech signal .
4.2 MFCC implementation:

A block diagram of the structure of an MFCC processor is given in Figure 4.2. The
speech input is typically recorded at a sampling rate above 10000 Hz. This sampling
frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion.
These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of
sounds that are generated by humans. As been discussed previously, the main purpose of the
MFCC processor is to mimic the behavior of the human ears. In addition, rather than the
Hyderabad-2015
Page 13
speech waveforms themselves, MFFCs are shown to be less susceptible to mentioned

variations.
continuous
speech
Frame
Blocking
mel
cepstrum
frame
Cepstrum
Windowing
mel
spectrum
FFT
spectrum
Mel-frequency
Wrapping
Figure 4.2: Block diagram of the MFCC processor
4.2.1Frame Blocking :
In this step the continuous speech signal is blocked into frames of N samples, with
adjacent frames being separated by M (M < N). The first frame consists of the first N
samples. The second frame begins M samples after the first frame, and overlaps it by N - M
samples and so on. This process continues until all the speech is accounted for within one or
more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec
windowing and facilitate the fast radix-2 FFT) and M = 100.
4.2.2 Windowing :
The next step in the processing is to window each individual frame so as to minimize the
signal discontinuities at the beginning and end of each frame. The concept here is to
minimize the spectral distortion by using the window to taper the signal to zero at the
w(n), 0 n N 1
beginning and end of each frame. If we define the window as
, where N is
the number of samples in each frame, then the result of windowing is the signal
y l (n) xl (n) w(n), 0 n N 1

Typically the Hamming window is used, which has the form:

Hyderabad-2015
Page 14
2n
, 0 n N 1
N 1
w(n) 0.54 0.46 cos
4.2.3 Fast fourier transform(FFT) :

The next processing step is the Fast Fourier Transform, which converts each frame of N
samples from the time domain into the frequency domain. The FFT is a fast algorithm to
implement the Discrete Fourier Transform (DFT), which is defined on the set of N samples
{xn}, as follow:
N 1
X k x n e j 2kn / N ,
k 0,1,2,..., N 1
n 0
In general Xks are complex numbers and we only consider their absolute values
(frequency magnitudes). The resulting sequence {Xk} is interpreted as follow: positive
0 f Fs / 2
0 n N / 2 1
frequencies
correspond to values
, while negative frequencies
Fs / 2 f 0
N / 2 1 n N 1
correspond to
. Here, Fs denotes the sampling
frequency.
The result after this step is often referred to as spectrum or periodogram.
4.2.4Mel-frequency wrapping :
As mentioned above, psychophysical studies have shown that human perception of the
frequency contents of sounds for speech signals does not follow a linear scale. Thus for each
tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale
called the mel scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz
and a logarithmic spacing above 1000 Hz.

Hyderabad-2015
Page 15
Mel-spaced filterbank
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1000
2000
3000
4000
Frequency (Hz)
5000
6000
7000
Figure 4.2.4: An example of mel-spaced filterbank

One approach to simulating the subjective spectrum is to use a filter bank, spaced
uniformly on the mel-scale (see Figure 4.2.4). That filter bank has a triangular bandpass
frequency response, and the spacing as well as the bandwidth is determined by a constant mel
frequency interval. The number of mel spectrum coefficients, K, is typically chosen as 20.
Note that this filter bank is applied in the frequency domain, thus it simply amounts to
applying the triangle-shape windows as in the Figure 4 to the spectrum. A useful way of
thinking about this mel-wrapping filter bank is to view each filter as a histogram bin (where
bins have overlap) in the frequency domain.
4.2.5 Cepstrum :
In this final step, we convert the log mel spectrum back to time. The result is called the
mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech
spectrum provides a good representation of the local spectral properties of the signal for the
given frame analysis.

Hyderabad-2015
Page 16
Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can
convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if
we denote those mel power spectrum coefficients that are the result of the last step are
~
c~n ,
S 0 , k 0,2,..., K 1
, we can calculate the MFCC's,
as
K
1
~
c~
(log
S
)
cos
nk ,
n
k
k1
2K
n0,1,...,K-1
c~0 ,
Note that we exclude the first component,
from the DCT since it represents the mean
value of the input signal, which carried little speaker specific information.
By applying the procedure described above, for each speech frame of around 30msec with
overlap, a set of mel-frequency cepstrum coefficients is computed. These are result of a
cosine transform of the logarithm of the short-term power spectrum expressed on a melfrequency scale. This set of coefficients is called an acoustic vector. Therefore each input
utterance is transformed into a sequence of acoustic vectors.

Hyderabad-2015
Page 17
Centre of gravity(COG)
5.1Introduction:
The spectral Centre Of Gravity (COG) is a measure of how high the frequencies in a
spectrum are. For this reason the COG gives an average indication of the spectral distribution
of the speech signal under observation. Given the considered discrete signal s(n) and its DFT
S(k), the COG has been computed by:
, fk = k /N , k = 0, ...,N 1 represents the k-th frequency composing the DFT.

In order to determine exact power levels in the speech signal these spectral shaping features
are required .
5.2 spectral central moments:

The m-th central spectral moment of the considered sequence s(n) has been computed by:

Hyderabad-2015
Page 18
5.2.1 Standard Deviation(SD):

The standard deviation of a spectrum is defined as the measure of how much the frequencies
in a spectrum can deviate from the centre of gravity. SD corresponds to the square root of the
second central moment 2:
5.2.2 Skewness:
The skewness of a spectrum is a measure of symmetry and it is defined as the third central
moment of the considered sequence s(n), divided by the 1.5 power of the second central
moment:

Hyderabad-2015
Page 19
\
These are the spectral moments in order to determine how are the frequencies distributed in
the spectrum for the speech signal obtained by doing this we can determine exact power
values in the spectrum of required speech signal. Those these are best central moments can be
find in this project and we go for higher moments in order to determine more accuracy
normally up to 3rd central moments are sufficient to describe the power levels in the spectrum
we have done up to skewness for the spectrum which is a 3rd central moment .
Emotions Classifier
6.1 Introduction:
Usually, in the literature of the field, a Support Vector Machine (SVM) is used to classify
sentences. SVM is a relatively new machine learning algorithm introduced by Vapnik and
derived from statistical learning theory in the 90s. The main idea is to transform the original
input set into a high dimensional feature space by using a kernel function and, then, to
achieve optimum classification in this new feature space, where a clear separation among
features obtained by the optimal placement of a separation hyperplane under the precondition
of linear separability.
6.2 SVM classifier:

Differently from the previously proposed approaches, two different classifiers, both kernelbased Support Vector Machines (SVMs), have been employed in this project. The first one
(called Male-SVM) is used if a male speaker is recognized by the Gender Recognition block.
The other SVM (Female-SVM) is employed in case of female speaker. Male-SVM and
Female-SVM classifiers have been trained by using speech signals of the employed reference
Database (DB) generated, respectively, by male and female speakers. Being g = {1, 1} the
label of the gender as defined in above the two SVMs have been trained by the traditional
Hyderabad-2015
Page 20
Quadratic Programming (QP) as done . In more detail, the following problem has been solved
for each gender g:
CLASS A
Support
vector
Hyperplane
SUPPORT VECTOR
CLASSB
Figure 6.2: A linear Support Vector Machine

Hyderabad-2015
Page 21
where
represents
Multipliers vector of the QP problem written in dual form.
Vectors
are
the
features
well-known
vectors
Lagrangian
while
scalars
are related labels (i.e., the emotions in this paper).

They
represent
the
vectors
of
the
training
set
for
the
g-th
gender.
is the related association, also called observation, between the

u-th input features vector
and its label
observations composing the training set.
. The quantity `gis the total amount of
The quantity C (C > 0) is the Complexity constant which determines the trade-off between
the flatness (i.e., the sensitivity of the prediction to perturbations in the features) and the
tolerant level for misclassified samples.
Higher value of C means that is more important minimising the degree of misclassification. C
= 1 is used in the project represents a non-linear SVM and the function
is the
Kernel function that, in this paper, is

. Coherently with ,
the QP problems (one for each gender) in equation are solved by the Sequential Minimal
Optimization (SMO) approach that provides an optimal point, not necessarily unique and
isolated, of if and only if Karush-Kuhn-Tucker (KKT) conditions are verified and matrices
are positive semi-definite. Details about the KKT conditions and the
SMO approach employed
6.2.1 Polish Emotional database:

Hyderabad-2015
Page 22
In this database consists of 4 actors of 2 male and 2 female with 4different emotions was
taken. Recordings for every speaker were made during a single session. Each speaker utters
four different sentences.
The uttrence code are
1 - Oni kupili dzisiaj nowy samochd.

2 - Jego dziewczyna przylatuje dzisiaj samolotem.
3 - Janek by dzisiaj u fryzjera.
4 - Ta lampa dzisiaj jest na biurku.
In this first two utterances are given to training phase for the SVM and the other two are
given for the testing phase.
Finally with all the features extracted are given to feature vector was given input to SVM in
order to perform the emotion recognition. Thus SVM performs better classification in the
process of gender recognition
RESULTS
7.1Introduction:
In this project we have used MATLAB coding. MATLAB (Matrix Laboratory) is a tool for
numerical computation and visualization. The basic data element is a matrix, so if you need a
program that manipulates array-based data it is generally fast to write and run in MATLAB
Matlab is widely used in all areas of applied mathematics, in education and research at
universities, and in the industry. MATLAB has powerful graphic tools and can produce nice
pictures in both 2D and 3D. it is also a programming language, and is one of the easiest
programing languages processing, image processing, optimization. Etc.
Hyderabad-2015
Page 23
Typical uses include:

i.
ii.
iii.
iv.
v.
vi.
Math and computation.

Algorithm development.
Modeling, simulation, and prototyping.
Data analysis, exploration, and visulization.
Scientific and engineering graphics.
Application development, including Graphical User Interface building
This is high level matrix/array language with control flow statements. Function, data
structures,input/output, and object-oriented programming features. This is a vast collection of
computational algorithms ranging from elementry functions like sum, sine, cosine, and
complex arithametic. To more sophisticate function like matrix inverse, matrix eignevalues,
Bessel functions, and fast Fourier transforms.In this project we have used MATLAB 8.3
version for programming

Hyderabad-2015
Page 24
Figure 7.1: Gender recognition as female with mean pitch values higher than
threshold.
Analasis: Here we have taken speech sample and by using auto correlation function we have
find pitch vaules for every frame finaly a threshold level of 250 is taken in order to separete
male and female samples.

Hyderabad-2015
Page 25
Figure 7.2: Gender recognition as male with mean pitch values less than threshold.
Analasis: Here we have taken speech sample and by using auto correlation function we have
find pitch vaules for every frame finaly a threshold level of 250 is taken in order to separete
male and female samples.
Figure 7.3: Formats for speech signal in neutral sate.

Hyderabad-2015
Page 26
Figure 7.4: Formats for speech signal in anger sate.

Hyderabad-2015
Page 27
Figure 7.5: Formats for speech signal in joy.
Figure 7.6: Formats for speech signal in sad state.

Hyderabad-2015
Page 28
Analysis: The formats frequencies in the above figures gives resonant frequencies of
vocal tract at different emotions are used to construct the vocal tract system for
particular speaker.
Figure 7.7: MFCC for speech signal in anger state male and female.
Figure 7.8: Formats for speech signal in sad state.

Hyderabad-2015
Page 29
Figure 7.9: MFCC for speech signal in joy state male and female.

Hyderabad-2015
Page 30
Analysis: MFCC at different emotions gives the short term power levels of the
speech signal that are useful in recognition of speech.
Figure 7.10: Power spectrum for the for speech signal in anger state male

Hyderabad-2015
Page 31
Analysis: The power spectrum represents the power of the particular male speaker in
particular emotion such that it is used recognition of particular word.
Figure 7.11: Power spectrum for the for speech signal in anger state female

Hyderabad-2015
Page 32
Analysis: The power spectrum represents the power of the particular female speaker in
particular emotion such that it is used recognition of particular word.

Hyderabad-2015
Page 33
Figure 7.12: confusion matrix for four speakers on gender recognition.
Analysis: By performing the sytem with 4 speakers in four different emotions this
matrix represents recognition of one particular emotion versus different emotions

Hyderabad-2015
Page 34
conclusion:
Thus the system can be able to dectect the 4emotions such that by using the particular
data base of polish emotinal database in which actors speak 4 utterences in different
emotions improve the accuracy in desinging the emotion recognition system.
The gender reconition used in this system are useful in reduceing the time delay in
form of classfying stage at classifier.By using this also the accuracy can be improved
in the system of recognition of emotion.

Hyderabad-2015
Page 35
References
Reference A.
[1]. Igor Bisio, Alessandro Delfino, Fabio Lavagetto, Mario Marchese, and
Andrea Sciarrone, Gender-driven Emotion Recognition Through Speech
Signals For Ambient Intelligence Applications, IEEE 2013.
Reference B :
[1] F. Burkhardt, M. van Ballegooy, R. Englert, and R. Huber, An

emotion-aware voice portal, Proc. Electronic Speech Signal Processing
ESSP, pp. 123131, 2005.
[2] J. Luo, Affective computing and intelligent interaction. Springer, 2012,

vol. 137

Hyderabad-2015
Page 36

Hyderabad-2015
Page 37

Hyderabad-2015
Page 38

Hyderabad-2015
Page 39
Figure 6.1: input and synthesized out from LPC and LSP for JUSTICE (LPC=8)

Hyderabad-2015
Page 40
Figure 6.2: power spectral density for input
Figure 6.3: power spectral density for output

****** input ******
Sampling rate (Hz): 48000
Input length (samples): 42788
Input length (seconds): 0.891417
****** Compression using Only LPC ******
Compression ratio: 26.64
psnr with LPC: 28.721563
Mahalanobis Distance with LPC: 0.179952
****** compression using LSP as well ******
psnr with LSP: 28.653103
Mahalanobis Distance with LSP: 0.227074
2. Speech spelt=JUSTICE
Hyderabad-2015
Page 41
Frame size=20msec
Order of LPC=12
Figure 6.4: input and synthesized out from LPC and LSP for input JUSTICE (LPC=12)
****** input ******

Hyderabad-2015
Page 42

psnr with LSP :28.616134
Mahalanobis Distance with LSP :0.185969
Frame size=20msec
Order of LPC=45

Hyderabad-2015
Page 43
Figure 6.5: input and synthesized out from LPC and LSP for input JUSTICE (LPC=45)
****** input ******
psnr without LSP 28.494923
Mahalanobis Distance without LSP 0.173408
psnr with LSP 28.647524
Mahalanobis Distance with LSP 0.203702

Hyderabad-2015
Page 44
30
29.9
29.8
29.7
PSNR
LPC
LSP
29.6
29.5
29.4
Figure 6.6: Calculation of PSNR for input justice of various order of LPC
0.4
0.35
0.3
0.25
0.2
distance 0.15
0.1
INPUT&LPC
INPUT&LSP
0.05
0

Hyderabad-2015
Page 45
Figure 6.7: Calculation of mehalanobis distance for input justice of various order of
LPC
Frame size=30msec
Order of LPC=12
Figure 6.8: input and synthesized out from LPC and LSP for input JUSTICE (30msec)
****** input ******

Hyderabad-2015
Page 46
Frame size=40msec
Order of LPC=12

Hyderabad-2015
Page 47
****** input ******

psnr without LSP :31.15005
Frame size=50msec
Order of LPC=12

Hyderabad-2015
Page 48
****** input ******


Hyderabad-2015
Page 49

34
33
32
31
30
29
28
27
26
PSNR
lpc
lsp
Figure 6.11: Calculation of PSNR for input justice of various frame size
0.6
0.5
0.4
0.3
0.2
0.1
0
Input&LPC
Input&LSP
Figure 6.12: Calculation of mehalanobis distance for input justice of various frame size
7. Speech spelt=It is simple to be happy (male)

Frame size=30msec
Order of LPC=8

Hyderabad-2015
Page 50
Figure 6.13: input and synthesized out from LPC and LSP (LPC=8)
****** input ******

Hyderabad-2015
Page 51

Frame size=20msec
Order of LPC=12
Figure 6.14: input and synthesized out from LPC and LSP (LPC=12)
Power spectral density:

Hyderabad-2015
Page 52
Figure 6.15: Power spectral density of input and output signals
****** input ******
Mahalanobis Distance with LSP:0.166563

Hyderabad-2015
Page 53
31.2
31.1
31
30.9
Input&LPC
Input&LSP
30.8
30.7
30.6
30.5
order=8
order=10
order=12
order=14
order=16
Figure 6.16: Calculation of PSNR for various order of LPC

0.4
0.35
0.3
0.25
Input&LPC
Input&LSP
0.2
0.15
0.1
0.05
0
order =8
order =10 order =12 order =14 order=16
Figure 6.17: Calculation of mehalanobis distance for various order of LPC

Frame size=30msec
Order of LPC=12

Hyderabad-2015
Page 54
Figure 6.18: input and synthesized out from LPC and LSP (30msec)
****** input ******
psnr without LSP: 30.838957

Hyderabad-2015
Page 55

Frame size=40msec
Order of LPC=12
****** input ******

Hyderabad-2015
Page 56

35
34
33
32
31
30
29
LPC
LSP
Figure 6.20: Calculation of PSNR for various frame size

0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Figure 6.21: Calculation of mehalanobis distance for various frame size

Hyderabad-2015
Page 57
LPC
LSP
11. Speech spelt=Time is precious dont waste it

Frame size=20msec
Order of LPC=12
****** input ******

Hyderabad-2015
Page 58
Mahalanobis Distance with LSP : 0.280322

Frame size=30msec
Order of LPC=12

Hyderabad-2015
Page 59
****** input ******

Frame size=40msec
Order of LPC=12
Hyderabad-2015
Page 60
****** input ******

Hyderabad-2015
Page 61

34
33.95
33.9
33.85
33.8
LPC
LSP
33.75
33.7
33.65
33.6
33.55
order=8
order=10
order=12
order=14
order=16

Hyderabad-2015
Page 62
0.44
0.43
0.42
0.41
Distance
LPC
LSP
0.4
0.39
0.38
0.37
0.36
order=8 order=10 order=12 order=14 order=16

37
36.5
36
35.5
35
34.5
34
33.5
33
32.5
32

Hyderabad-2015
Page 63
LPC
LSP
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

13. Speech spelt=make in India
Frame size=20msec
Order of LPC=8

Hyderabad-2015
Page 64
LPC
LSP
****** input ******

Hyderabad-2015
Page 65
28.5
28
27.5
LPC
LSP
27
PSNR
26.5
26
25.5
order=8
order=10 order=12 order=14 order=16

0.45
0.4
0.35
0.3
0.25
Distance
LPC
LSP
0.2
0.15
0.1
0.05
0
order=8 order=10 order=12 order=14 order=16

Hyderabad-2015
Page 66
28.5
28
27.5
27
26.5
26
25.5
LPC
LSP
25
24.5

0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

Hyderabad-2015
Page 67
LPC
LSP
MBSD (modified bark spectral distortion)

The Average_MBSD=0.0151 for input&LPC
Signal= JUSTICE
Avg_MBSD
0.02
0.01
0.01
0.01
0.01
0.01
0
0
0
Figure 6.34: Calculation of MBSD for various input signals

Hyderabad-2015
Page 68
Input&LPC
Input&LSP
Compression Ratio
Speech signal
Order of LPC
LPC
LSP
Justice
26.64
35.40
10
22.54
27.95
12
19.54
23.17
14
17.24
19.83
26.36
35.69
10
22.30
28.03
12
19.33
23.17
14
17.05
19.78
26.39
35.57
10
22.33
27.97
12
19.35
23.13
14
17.08
19.77
It is simple to be happy
Time is precious dont waste

it
Table 6.1: compression ratio for various inputs of different LPC orders

Hyderabad-2015
Page 69
Compression ratio
Speech signal
Frame size
LPC
LSP
Justice
20msec
19.54
23.17
30msec
19.81
23.56
40msec
20.09
23.91
50msec
20.23
24.13
20msec
19.33
23.17
30msec
19.43
23.65
40msec
19.47
23.83
50msec
19.57
24.02
20msec
19.35
23.13
30msec
19.46
23.38
40msec
19.51
23.46
50msec
19.62
23.64
It is simple to be happy
Time is precious dont waste

it
Table 6.2: compression ratio for various inputs of different frame size

Hyderabad-2015
Page 70
Conclusion
The synthesis of LPC and LSP depends on two parameters
1. Order of LPC
2. Frame size
For the given order of LPC and Frame size, the compression ratio is good for LSP
compared for LPC.
After calculating PSNR and mehalanobis distance , mostly the LPC is dominating the
LSP but the difference between them is very small.
So, for better compression it is better to opt LSP rather than LPC, but implementing
LSP is expensive .

Hyderabad-2015
Page 71
References
Reference A:
[1] Sara Grassi,Optimized Implementation of Speech Processing Algorithms,
Imprimatur Pour La These
Reference B:
[1] F.Itakura, Line Spectrum Representation of Linear Predictive Coefficients of Speech
Signals,J.of the Acoustical Society of America,Vol.57,pp.S35,1975
[2] K.Paliwal and B.Atal, Efficient Vector Quantization of LPC parameters at
24bits/frame, IEEE Trans on Speech and Audio Processing, Vol.1, No.1, pp.3-14,
1993
[3] P.Kabal and P.Ramachandran,The Computation of Line Spectral Frequencies Using
Chebyshev Polynomials,IEEE Trans.on acoustics,Speech and Signal
Processing,Vol.34,No.6,pp.1419-1426,1986

Hyderabad-2015
Page 72

Thesis 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis 1

Uploaded by

Copyright:

Available Formats

INTRODUCTION

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

1.2 Proposed method

Database: it is the memory of the classifier; it contains sentences divided according to

Classification: it assigns a label representing the recognized emotion by using the

1. For the speech signal we have to do framing and windowing first.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

1.4 Block diagram

Figure 1.1: Block diagram

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

2.2 Autocorrelation method of Pitch estimation:

The mathematical definition of the autocorrelation function is shown in figure

Figure 2.1: auto correlation of pitch estimation

Periodic waveforms exhibit an interesting autocorrelation characteristic: the autocorrelation

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

3.2 LPC method for format estimation:

3.2.1 LPC analysis:

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

The primary objective of LP analysis is to compute the LP coefficients which

This can be expanded using the equation (5) as follows,

and equating to zero for k=0,1,2,...p.

The differentiated expression can be written as,

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

for i= 1,2,3...p and N is the length of the sequence.

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

The LP coefficients can be computed as shown,

where R-1 is the inverse of the matrix R

3.2.2 computation of LP residual:

In the frequency domain, the equation (16) can be represented as,

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

Figure 3.2.2: Computing the LP residual by inverse filtering

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

Figure 3.2.3: Formant locations corresponding to peaks in LP magnitude

MFCC (mel frequency cepstral coefficients) estimation

4.2 MFCC implementation:

speech waveforms themselves, MFFCs are shown to be less susceptible to mentioned

Figure 4.2: Block diagram of the MFCC processor

y l (n) xl (n) w(n), 0 n N 1

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

w(n) 0.54 0.46 cos

4.2.3 Fast fourier transform(FFT) :

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

Figure 4.2.4: An example of mel-spaced filterbank

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

from the DCT since it represents the mean

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

, fk = k /N , k = 0, ...,N 1 represents the k-th frequency composing the DFT.

5.2 spectral central moments:

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

5.2.1 Standard Deviation(SD):

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

6.2 SVM classifier:

Figure 6.2: A linear Support Vector Machine

M.Tech thesis submitted to Jawaharlal Nehru Technological University,

are related labels (i.e., the emotions in this paper).

is the related association, also called observation, between the

. The quantity `gis the total amount of

Kernel function that, in this paper, is

6.2.1 Polish Emotional database:

1 - Oni kupili dzisiaj nowy samochd.

Typical uses include:

Math and computation.

Compression using Only LPC

compression using LSP as well

Compression using Only LPC

input

Compression using Only LPC

compression using LSP as well

Compression using Only LPC