Professional Documents
Culture Documents
by
Sangeet Sagar
(Dept. of ECE, The LNMIIT Jaipur)
Department of ECE
Birla Institute of Technology
Mesra, Ranchi
July-2017
BONAFIDE CERTIFICATE
record of the work done by “Sangeet Sagar” under my supervision from “11th of May to 8th
of July”
Place-
Date -
Declaration by Author
This is to declare that this report has been written by me. No part of the report is plagiarized
from other sources. All information included from other sources have been duly
acknowledged. I aver that if any part of the report is found to be plagiarized, I will take full
Sangeet Sagar
Roll- 15uec053
(Dept. of ECE, The LNMIIT Jaipur)
Table of Contents
Introduction ................................................................................................................................ 1
Pre-Processing............................................................................................................................ 2
Conclusion ............................................................................................................................... 11
Introduction
Speech processing is very important research area where speaker recognition, speech
synthesis, speech codec, speech noise reduction are some of the research areas. Speech
recognition technology is one from the fast growing engineering technologies. It has a
number of applications in different areas and provides potential benefits. Speech recognition
usually involves extraction of features from speech signal and representing them using an
appropriate data model.
During this internship I did a mini project on “Isolated word recognition” using Mel
frequency Cepstral coefficient (MFCC) and Linear Predictive Coding (LPC) as a feature
extraction techniques and Artificial Neural Network as a classification technique and a final
project on “Accent Recognition for hindi and bengali speech signals” using MFCC, delta
MFCC and double-delta MFCC as a feature extraction techniques and Artificial Neural
Network as a classification technique. The performance of automatic speech recognition
systems can be increased, if the speaker’s accent or dialect is detected before the recognition
of speech by adapting the suitable ASR acoustic and/or language models.
1
1. Pre-Processing
Preprocessing of speech signals is considered a crucial step in the development of a
robust and efficient speech or speaker recognition system. The general preprocessing
pipeline is depicted in the following figure.
1.1 Sampling
In order that a computer is able to process the speech signal, it first has to be
digitized. Therefore the time-continuous speech signal is sampled and
quantized. The result is a time- and value-discrete signal.
This operation was performed in MATLAB and the noise removal was performed on
a recorded speech sample for the word ‘shunya’ and following was observed.
2
Fig 2: Noise Removal Technique
2. Feature Extraction
In feature extraction the Mel frequency Cepstral coefficient (MFCC) and combine
features of both MFCC and LPC are used. The both techniques are described below:
Step 1: Pre-emphasis -The signal is passed through a filter which emphasis a high
frequencies. This process increases the energy of signal at high frequency. High
frequency also contains information. The equation used to denotes the pre-emphasis
is shown below:
Step 2: Framing and Overlapping- The speech signal is split into several frames such that
each frame can be examined in the short time instead of the entire signal. The frame
size is of the range 20-40 ms. Then overlapping is applied to frames, hamming
window is applied. The equation of hamming window is as follows:
2𝜋n
𝑊(𝑛) = 0.54 − 0.46 ∗ Cos [ ] ;0 ≤ 𝑛 ≤ 𝑁 − 1
𝑁−1
3
Step 3: Framing -The input speech signal is partitioned into frames with aduration
which lesser than window duration.
Step 4: Fast Fourier Transform - The Fast Fourier Transform (FFT) converts the frames
from time domain to frequency domain. The conversion is done from time to
frequency domain because the information is more in frequency domain. Therefore,
FFT is executed to obtain the magnitude frequency response of each frame and to
prepare the signal for the next stage.
𝑆(𝜔) = 𝑓𝑓𝑡(𝑋(𝑛))
Step 5: Mel Warping - Human ear perception of frequency contents of sounds for speech
signal does not follow a linear scale. Therefore, for each tone with an actual frequency
f, measured in Hz, subjective pitch is measured on a scale called the “Mel scale”. The
Mel frequency scale is linear frequency spacing below 1000 Hz and a logarithmic
spacing above 1000Hz. To compute the Mel for a given frequency f in Hz, the
following formula is used.
Linear Predictive Coding (LPC) analysis states that a speech sample can be
approximated as linear combination of past speech samples. LPC is based on the
source-filter model of speech production.
𝑝
𝑆̃[𝑛] = ∑ 𝑎𝑘 𝑠[𝑛 − 𝑘]
𝑘=1
The unknown 𝑎𝑘 , 𝑘 = 1,2 … 𝑝are called the LPC coefficients and can be solved by
theleast square method.
4
We generally take only 11 or 12 coefficients which calculating LPC coefficients and the first
coefficient is always 1.Taking more than 11-12 coefficients generally takes up undesirable
information about the speech signals like background noise. To increase the accuracy of the
feature extraction process these LPC coefficients is vertically concatenated (using vertcat
function in MATLAB) to the MFCC matrix to obtain 25 coefficients for each sample.
3. Speech Classification
I performed isolated word recognition for 10 speech signals: “zero, one, two, three,
four, five, six, seven, eight, nine, add, minus, into, divide” (These speech samples were
pronounced and recorded as written in a controlled environment of laboratory) with
20 samples for each of them using both MFCC and LPC as feature extraction process
and ANN as classifier and the following confusion matrix was obtained.
5
Fig 5: Confusion Matrix for above speech
signals (20 samples for each speech
signals) with both MFCC and LPC as
extraction features and ANN as classifier.
Conclusion: The experimental results shows that by using the combination of both MFCC
and LPC feature extraction techniques the results are higher as compared to proposed MFCC
feature extraction technique. The recognition accuracy is in the former case. The recognition
accuracy may differ by using only MFCC, LPC and combination of both MFCC and LPC
techniques as well as other classification techniques.
5. Database Preparation
In this project we collected database from four speakers. We choose two native hindi
speakers and two native bengali. The database included four speech signals (pronounced and
recorded as written): “shunya”, “ek”, “do”, “teen”, “chaar” with 50 samples for each speech
signals. So in total we had 1000 speech signals (500 speech signals from hindi speakers and
500 speech signals from native bengali speakers). Each sample was recorded in ‘.wav’ format
6
in a noise controlled environment. We then preformed accent recognition task with MFCC
[2.1], delta-MFCC [5.1] and double-delta-MFCC [5.2] as our feature extraction process and
artificial neural networks (ANN) [3.1] as our classifier.
7
Following are the confusion matrix obtained in each case with different recognition
accuracies:
8
Accuracy Plot:-
9
Now we manually collected 50 samples from native hindi speakers and 50 samples
from native bengali speakers as testing data and then extracted its features. Resulting
confusion matrix are shown below:
Language Language
Hindi Bengali
Hindi 50 2
Bengali 0 48
Language Language
Hindi Bengali
Hindi 50 2
Bengali 0 48
Language Language
Hindi Bengali
Hindi 50 2
Bengali 0 48
Accuracy -We observe that when SVM is used as classifier the accuracy is same in all three
cases. This might be due to limited database for limited number of speakers.
10
Conclusion
Different feature extraction techniques and recognition techniques are discussed and used in
this report and it can be concluded that performance of combination MFCC and LPC features
is superior to MFCC features. This paper attempts to provide a comprehensive survey on
speech recognition. We also observed that advanced feature extraction techniques like delta-
MFCC and double-delta-MFCC increases the speech recognition accuracy to appreciable
extent. Though we did not observe much difference in accuracy in case of SVM as classifier.
Speech recognition has attracted scientist as an important regulation and has created a
technological influence on society. It is hoping that this report bring out understand and
inspiration amongst the research group of automatic speech recognition (ASR) system.
11