Professional Documents
Culture Documents
1.1 Overview
Recently there has been growing interest to improve Human-computer interaction (HCI)
means computers should interact to the humans in day to day life .In this context recognizing
people emotional state and giving suitable feedback may play a crucial role. As a
consequence, emotion recognition represents a hot research area in both industry and
academic field. Usually emotion recognition based on facial or voice features. This proposes
a solution, designed to be employed in a smart phone Environment able to capture emotional
state of a person starting from registration of speech signals in the surrounding obtained by
mobile devices such as smartphones.
This system presents the implementation of a voice-based emotion detection system which is
suitable to recognize four emotions (anger, sadness, joy and neutral) as widely used for
emotion recognition .The classification task for speech signals is done by using Support
Vector Machine (SVM) approach. The main contributions of this is concern: i) a system able
to recognize people emotions composed of two sub-systems, Gender Recognition and
Emotion Recognition . Gender recognition algorithm, based on pitch extraction, and aimed at
providing a priori information about the gender of the speaker; SVM-based emotion
classifier, which employs the gender information as input.
In order to train and test the mentioned SVM-based emotion classifier, a widely used
emotional database called (polish emotional database ED) has been employed. The overall
system reliability on the database adopted for training and testing phases: the use of a
simulated database (i.e., a collection of emotion vocal expressions played by actors) allows
obtaining a higher level of correctly identified emotions.
.
Feature Extraction: it involves the elaboration of the speech signal in order to obtain a
certain number of variables, called features, useful for speech emotion recognition.
Feature Selection: it selects the more appropriate features in order to reduce the
computational load and the time required to recognize an emotion.
1.3 Objectives
The objectives of the project are illustrated below.
1.5 Procedure:
Initially, the speech signal is taken and is passed through front end block which
converts the continuous time speech signal in to discrete time signal with a rate of 16
kHz is done. After that it is given to feature extraction block in which the features are
extracted in which pitch can be find by using Autocorrelation method is used. After
finding pitch values a threshold level is considered for the pitch values versus frames
in speech sample is considered in order to find Gender recognition. After that from
speech sample the formats can be estimated from LPC coefficients along with MFCC
coefficients and centre of gravity for the speech spectrum is considered. All this
features along with gender recognition output is given to SVM. Thus the SVM act as a
classifier in recognition of emotion of the speech sample. Thus SVM need a database
of Polish emotional database is required in order to training the sentences in different
emotions at the testing phase the SVM classifies the emotion by using optimization
function.
Pitch estimation
2.1 Introduction
Pitch is an important feature of audio signals, especially for quasi-periodic signals such as
voiced sounds from human speech/singing and monophonic music from most music
instruments. Intuitively speaking, pitch represent the vibration frequency of the sound source
of audio signals. In other words, pitch is the fundamental frequency of audio signals, which is
equal to the reciprocal of the fundamental period . Thus the speech signal exhibits relative
periodicity of its fundamental frequency called pitch.
Conceptually, the most obvious sample point within a fundamental period is often referred to
as the pitch mark. Usually pitch marks are selected as the local maxima or minima of the
audio waveform.
Pitch detection algorithms can be divided into methods which operate in the time domain,
frequency domain, or both.
One group of pitch detection methods uses the detection and timing of sometime domain
feature. Other time domain methods use autocorrelation functions or difference norms to
detect similarity between the waveform and a time lagged version of itself.
Another family of methods operates in the frequency domain, locating sinusoidal peaks in the
frequency transform of the input signal. Other methods use combinations of time and
frequency domain techniques to detect pitch.
Frequency domain methods call for the signal to be frequency transformed, then the
frequency domain rep presentation is inspected for the first harmonic, the greatest common
divisor of all harmonics, or other such indications of the period.
Windowing of the signal is recommended to avoid spectral smearing, and depending on the
type of window, a minimum number of periods of the signal must be analyzed to enable
accurate location of harmonic peaks .
Various linear pre-processing steps can be used to make the process of locating frequency
domain features easier, such as performing linear prediction on the signal and using the
residual signal for pitch detection. Performing nonlinear operations such as peak limiting also
simplifies the location of harmonics.
The best method used for pitch estimation used in this project is Autocorrelation method
which is time domain is used .
where is the time lag in terms of sample points. The value of that maximizes acf() over a
specified range is selected as the pitch period in sample points.
delayed copy are in phase. The first peak in the autocorrelation indicates the period of the
waveform.
Formats estimation
3.1 Introduction
Estimation of formant frequencies is generally more difficult than estimation of
fundamental frequency. The problem is that formant frequencies are properties of the vocal
tract system and need to be inferred from the speech signal rather than just measured. The
spectral shape of the vocal tract excitation strongly influences the observed spectral envelope,
such that we cannot guarantee that all vocal tract resonances will cause peaks in the observed
spectral envelope, nor that all peaks in the spectral envelope are caused by vocal tract
resonances.
The dominant method of formant frequency estimation is based on modelling the speech
signal as if it were generated by a particular kind of source and filter:
This type of analysis is called source-filter separation, and in the case of formant frequency
estimation, we are interested only in the modelled system and the frequencies of its
resonances. To find the best matching system we use a method of analysis called Linear
Prediction. Linear prediction models the signal as if it were generated by a signal of
minimum energy being passed through a purely-recursive IIR filter.
We will demonstrate the idea by using LPC to find the best IIR filter from a section of speech
signal and then plotting the filter's frequency response.
For deconvolving the given speech into excitation and vocal tract system components,
methods based on homomorphic analysis like cepstral analysis are developed. As the cepstral
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 6
analysis does the deconvolution of speech into source and system components by traversing
through frequency domain, the deconvolution task becomes computational intensive process.
To reduce such type of computational complexity and finding the source and system
components from time domain itself, the Linear Prediction analysis is developed.
The redundancy in the speech signal is exploited in the LP analysis. The prediction of
current sample as a linear combination of past p samples form the basis of linear
prediction analysis where p is the order of prediction. The predicted sample s ^(n) can
be represented as follows,
where aks are the linear prediction coefficients and s(n) is the windowed speech
sequence obtained by multiplying short time speech frame with a hamming or similar
type of window which is given by,
where (n) is the windowing sequence. The prediction error e(n) can be computed by
the difference between actual sample s(n)and the predicted sample s ^(n) which is
given by,
The values of aks which minimize the total prediction error E can be computed by
finding
for each ak give p linear equations with p unknowns. The solution of which gives the
LP coefficients. This can be represented as follows,
where i=1, 2, 3...p. The equation (9) can be written in terms of autocorrelation
sequence R(i) as follows,
for i=1,2,3...p.
Where the autocorrelation sequence used in equation (10) can be written as follows,
where R is the pXp symmetric matrix of elements R(i, k) = R(|i-k|), (1<=i, k<=p), r is
a column vector with elements (R(1),R(2), ...R(P)) and finally A is the column vector
of LPC coefficients (a(1), a(2), ....a(p)). It can be shown that R is toeplitz matrix which
can be represented as,
LP residual is the prediction error e(n) obtained as the difference between the predicted
speech sample s^(n) and the current sample s(n). This is shown in equation (4).
i.e.,
So LP residual can be obtained filtering the speech signal with A(z) as indicated in figure 1.
Similarly it can be shown that the LP spectrum H(z) as,
As A(z) is the reciprocal of H(z), LP residual is obtained by the inverse filtering of speech.
LP analysis separates the given short term sequence of speech into its slowly varying vocal
tract component represented by LP filter (H(z)) and fast varying excitation component given
by the LP residual (e(n)). The LP filter (H(z)) induces the desired spectral shape for the shape
on the flat spectrum (E(z)) of the noise like excitation sequence as given in equation (20). As
the LP spectrum provides the vocal tract characteristics, the vocal tract resonances (formants)
can be estimated from the LP spectrum. Various formant locations can be obtained by picking
the peaks from the magnitude LP spectrum (|H(z)|). The figure 3.2.3 shows the first (F1),
second (F2) and third formant (F3) frequencies estimated from the peaks in the LP magnitude
spectrum.
where S(z) is the spectrum of the given short time speech signal.
continuous
speech
Frame
Blocking
mel
cepstrum
frame
Cepstrum
Windowing
mel
spectrum
FFT
spectrum
Mel-frequency
Wrapping
4.2.1Frame Blocking :
In this step the continuous speech signal is blocked into frames of N samples, with
adjacent frames being separated by M (M < N). The first frame consists of the first N
samples. The second frame begins M samples after the first frame, and overlaps it by N - M
samples and so on. This process continues until all the speech is accounted for within one or
more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec
windowing and facilitate the fast radix-2 FFT) and M = 100.
4.2.2 Windowing :
The next step in the processing is to window each individual frame so as to minimize the
signal discontinuities at the beginning and end of each frame. The concept here is to
minimize the spectral distortion by using the window to taper the signal to zero at the
w(n), 0 n N 1
beginning and end of each frame. If we define the window as
, where N is
the number of samples in each frame, then the result of windowing is the signal
2n
, 0 n N 1
N 1
X k x n e j 2kn / N ,
k 0,1,2,..., N 1
n 0
In general Xks are complex numbers and we only consider their absolute values
(frequency magnitudes). The resulting sequence {Xk} is interpreted as follow: positive
0 f Fs / 2
0 n N / 2 1
frequencies
correspond to values
, while negative frequencies
Fs / 2 f 0
N / 2 1 n N 1
correspond to
. Here, Fs denotes the sampling
frequency.
The result after this step is often referred to as spectrum or periodogram.
4.2.4Mel-frequency wrapping :
As mentioned above, psychophysical studies have shown that human perception of the
frequency contents of sounds for speech signals does not follow a linear scale. Thus for each
tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale
called the mel scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz
and a logarithmic spacing above 1000 Hz.
Mel-spaced filterbank
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1000
2000
3000
4000
Frequency (Hz)
5000
6000
7000
4.2.5 Cepstrum :
In this final step, we convert the log mel spectrum back to time. The result is called the
mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech
spectrum provides a good representation of the local spectral properties of the signal for the
given frame analysis.
Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can
convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if
we denote those mel power spectrum coefficients that are the result of the last step are
~
c~n ,
S 0 , k 0,2,..., K 1
, we can calculate the MFCC's,
as
K
1
~
c~
(log
S
)
cos
nk ,
n
k
k1
2K
n0,1,...,K-1
c~0 ,
Note that we exclude the first component,
value of the input signal, which carried little speaker specific information.
By applying the procedure described above, for each speech frame of around 30msec with
overlap, a set of mel-frequency cepstrum coefficients is computed. These are result of a
cosine transform of the logarithm of the short-term power spectrum expressed on a melfrequency scale. This set of coefficients is called an acoustic vector. Therefore each input
utterance is transformed into a sequence of acoustic vectors.
Centre of gravity(COG)
5.1Introduction:
The spectral Centre Of Gravity (COG) is a measure of how high the frequencies in a
spectrum are. For this reason the COG gives an average indication of the spectral distribution
of the speech signal under observation. Given the considered discrete signal s(n) and its DFT
S(k), the COG has been computed by:
5.2.2 Skewness:
The skewness of a spectrum is a measure of symmetry and it is defined as the third central
moment of the considered sequence s(n), divided by the 1.5 power of the second central
moment:
\
These are the spectral moments in order to determine how are the frequencies distributed in
the spectrum for the speech signal obtained by doing this we can determine exact power
values in the spectrum of required speech signal. Those these are best central moments can be
find in this project and we go for higher moments in order to determine more accuracy
normally up to 3rd central moments are sufficient to describe the power levels in the spectrum
we have done up to skewness for the spectrum which is a 3rd central moment .
Emotions Classifier
6.1 Introduction:
Usually, in the literature of the field, a Support Vector Machine (SVM) is used to classify
sentences. SVM is a relatively new machine learning algorithm introduced by Vapnik and
derived from statistical learning theory in the 90s. The main idea is to transform the original
input set into a high dimensional feature space by using a kernel function and, then, to
achieve optimum classification in this new feature space, where a clear separation among
features obtained by the optimal placement of a separation hyperplane under the precondition
of linear separability.
Quadratic Programming (QP) as done . In more detail, the following problem has been solved
for each gender g:
CLASS A
Support
vector
Hyperplane
SUPPORT VECTOR
CLASSB
where
represents
Multipliers vector of the QP problem written in dual form.
Vectors
are
the
features
well-known
vectors
Lagrangian
while
scalars
represent
the
vectors
of
the
training
set
for
the
g-th
gender.
The quantity C (C > 0) is the Complexity constant which determines the trade-off between
the flatness (i.e., the sensitivity of the prediction to perturbations in the features) and the
tolerant level for misclassified samples.
Higher value of C means that is more important minimising the degree of misclassification. C
= 1 is used in the project represents a non-linear SVM and the function
is the
Optimization (SMO) approach that provides an optimal point, not necessarily unique and
isolated, of if and only if Karush-Kuhn-Tucker (KKT) conditions are verified and matrices
are positive semi-definite. Details about the KKT conditions and the
SMO approach employed
In this database consists of 4 actors of 2 male and 2 female with 4different emotions was
taken. Recordings for every speaker were made during a single session. Each speaker utters
four different sentences.
The uttrence code are
RESULTS
7.1Introduction:
In this project we have used MATLAB coding. MATLAB (Matrix Laboratory) is a tool for
numerical computation and visualization. The basic data element is a matrix, so if you need a
program that manipulates array-based data it is generally fast to write and run in MATLAB
Matlab is widely used in all areas of applied mathematics, in education and research at
universities, and in the industry. MATLAB has powerful graphic tools and can produce nice
pictures in both 2D and 3D. it is also a programming language, and is one of the easiest
programing languages processing, image processing, optimization. Etc.
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 23
This is high level matrix/array language with control flow statements. Function, data
structures,input/output, and object-oriented programming features. This is a vast collection of
computational algorithms ranging from elementry functions like sum, sine, cosine, and
complex arithametic. To more sophisticate function like matrix inverse, matrix eignevalues,
Bessel functions, and fast Fourier transforms.In this project we have used MATLAB 8.3
version for programming
Figure 7.1: Gender recognition as female with mean pitch values higher than
threshold.
Analasis: Here we have taken speech sample and by using auto correlation function we have
find pitch vaules for every frame finaly a threshold level of 250 is taken in order to separete
male and female samples.
Figure 7.2: Gender recognition as male with mean pitch values less than threshold.
Analasis: Here we have taken speech sample and by using auto correlation function we have
find pitch vaules for every frame finaly a threshold level of 250 is taken in order to separete
male and female samples.
Analysis: The formats frequencies in the above figures gives resonant frequencies of
vocal tract at different emotions are used to construct the vocal tract system for
particular speaker.
Figure 7.7: MFCC for speech signal in anger state male and female.
Figure 7.9: MFCC for speech signal in joy state male and female.
Analysis: MFCC at different emotions gives the short term power levels of the
speech signal that are useful in recognition of speech.
Figure 7.10: Power spectrum for the for speech signal in anger state male
Analysis: The power spectrum represents the power of the particular male speaker in
particular emotion such that it is used recognition of particular word.
Figure 7.11: Power spectrum for the for speech signal in anger state female
Analysis: The power spectrum represents the power of the particular female speaker in
particular emotion such that it is used recognition of particular word.
Analysis: By performing the sytem with 4 speakers in four different emotions this
matrix represents recognition of one particular emotion versus different emotions
conclusion:
Thus the system can be able to dectect the 4emotions such that by using the particular
data base of polish emotinal database in which actors speak 4 utterences in different
emotions improve the accuracy in desinging the emotion recognition system.
The gender reconition used in this system are useful in reduceing the time delay in
form of classfying stage at classifier.By using this also the accuracy can be improved
in the system of recognition of emotion.
References
Reference A.
[1]. Igor Bisio, Alessandro Delfino, Fabio Lavagetto, Mario Marchese, and
Andrea Sciarrone, Gender-driven Emotion Recognition Through Speech
Signals For Ambient Intelligence Applications, IEEE 2013.
Reference B :
Figure 6.1: input and synthesized out from LPC and LSP for JUSTICE (LPC=8)
2. Speech spelt=JUSTICE
M.Tech thesis submitted to Jawaharlal Nehru Technological University,
Hyderabad-2015
Page 41
Frame size=20msec
Order of LPC=12
Figure 6.4: input and synthesized out from LPC and LSP for input JUSTICE (LPC=12)
****** input ******
3. Speech spelt=JUSTICE
Frame size=20msec
Order of LPC=45
Figure 6.5: input and synthesized out from LPC and LSP for input JUSTICE (LPC=45)
30
29.9
29.8
29.7
PSNR
LPC
LSP
29.6
29.5
29.4
Figure 6.6: Calculation of PSNR for input justice of various order of LPC
0.4
0.35
0.3
0.25
0.2
distance 0.15
0.1
INPUT&LPC
INPUT&LSP
0.05
0
Figure 6.7: Calculation of mehalanobis distance for input justice of various order of
LPC
4. Speech spelt=JUSTICE
Frame size=30msec
Order of LPC=12
Figure 6.8: input and synthesized out from LPC and LSP for input JUSTICE (30msec)
****** input ******
5. Speech spelt=JUSTICE
Frame size=40msec
Order of LPC=12
Figure 6.9: input and synthesized out from LPC and LSP for input JUSTICE (40msec)
****** input ******
6. Speech spelt=JUSTICE
Frame size=50msec
Order of LPC=12
Figure 6.10: input and synthesized out from LPC and LSP for input JUSTICE (50msec)
****** input ******
PSNR
lpc
lsp
Figure 6.11: Calculation of PSNR for input justice of various frame size
0.6
0.5
0.4
0.3
0.2
0.1
0
Input&LPC
Input&LSP
Figure 6.12: Calculation of mehalanobis distance for input justice of various frame size
Figure 6.13: input and synthesized out from LPC and LSP (LPC=8)
****** input ******
Figure 6.14: input and synthesized out from LPC and LSP (LPC=12)
Power spectral density:
31.2
31.1
31
30.9
Input&LPC
Input&LSP
30.8
30.7
30.6
30.5
order=8
order=10
order=12
order=14
order=16
0.2
0.15
0.1
0.05
0
order =8
Figure 6.18: input and synthesized out from LPC and LSP (30msec)
****** input ******
Figure 6.19: input and synthesized out from LPC and LSP (40msec)
****** input ******
LPC
LSP
LPC
LSP
Figure 6.22: input and synthesized out from LPC and LSP (20msec)
****** input ******
Figure 6.23: input and synthesized out from LPC and LSP (30msec)
Figure 6.24: input and synthesized out from LPC and LSP (40msec)
****** input ******
33.95
33.9
33.85
33.8
LPC
LSP
33.75
33.7
33.65
33.6
33.55
order=8
order=10
order=12
order=14
order=16
0.44
0.43
0.42
0.41
Distance
LPC
LSP
0.4
0.39
0.38
0.37
0.36
LPC
LSP
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
LPC
LSP
Figure 6.29: input and synthesized out from LPC and LSP (20msec)
****** input ******
28.5
28
27.5
LPC
LSP
27
PSNR
26.5
26
25.5
order=8
LPC
LSP
0.2
0.15
0.1
0.05
0
28.5
28
27.5
27
26.5
26
25.5
LPC
LSP
25
24.5
LPC
LSP
Avg_MBSD
0.02
0.01
0.01
0.01
0.01
0.01
0
0
0
Input&LPC
Input&LSP
Compression Ratio
Speech signal
Order of LPC
LPC
LSP
Justice
26.64
35.40
10
22.54
27.95
12
19.54
23.17
14
17.24
19.83
26.36
35.69
10
22.30
28.03
12
19.33
23.17
14
17.05
19.78
26.39
35.57
10
22.33
27.97
12
19.35
23.13
14
17.08
19.77
It is simple to be happy
Table 6.1: compression ratio for various inputs of different LPC orders
Compression ratio
Speech signal
Frame size
LPC
LSP
Justice
20msec
19.54
23.17
30msec
19.81
23.56
40msec
20.09
23.91
50msec
20.23
24.13
20msec
19.33
23.17
30msec
19.43
23.65
40msec
19.47
23.83
50msec
19.57
24.02
20msec
19.35
23.13
30msec
19.46
23.38
40msec
19.51
23.46
50msec
19.62
23.64
It is simple to be happy
Table 6.2: compression ratio for various inputs of different frame size
Conclusion
The synthesis of LPC and LSP depends on two parameters
1. Order of LPC
2. Frame size
For the given order of LPC and Frame size, the compression ratio is good for LSP
compared for LPC.
After calculating PSNR and mehalanobis distance , mostly the LPC is dominating the
LSP but the difference between them is very small.
So, for better compression it is better to opt LSP rather than LPC, but implementing
LSP is expensive .
References
Reference A:
[1] Sara Grassi,Optimized Implementation of Speech Processing Algorithms,
Imprimatur Pour La These
Reference B:
[1] F.Itakura, Line Spectrum Representation of Linear Predictive Coefficients of Speech
Signals,J.of the Acoustical Society of America,Vol.57,pp.S35,1975
[2] K.Paliwal and B.Atal, Efficient Vector Quantization of LPC parameters at
24bits/frame, IEEE Trans on Speech and Audio Processing, Vol.1, No.1, pp.3-14,
1993
[3] P.Kabal and P.Ramachandran,The Computation of Line Spectral Frequencies Using
Chebyshev Polynomials,IEEE Trans.on acoustics,Speech and Signal
Processing,Vol.34,No.6,pp.1419-1426,1986