Report

INTRODUCTION Speech processing is one of the exciting areas of signal processing.
The speech is a primary mode of communication among human being and also the most natural and efficient form of exchanging information among human. So, it is only logical that the next technological development to be natural language speech processing. Speech Recognition can be defined as the process of converting speech signal to a sequence of words. The goal of speech recognition system, is to identify the exact word spell by the speaker. Since 1960s scientists have been researching the ways to make computers able to record, interpret and understand human speech. Throughout the decades this has been a daunting task. Even the most rudimentary problem such as digitalizing (sampling) voice was a huge challenge in the early years, it took until the 1980s. The early systems were very limited in scope and power. Communication among the human being is dominated by spoken language, therefore it is natural for people to expect speech interfaces with computer .computer which can speak and recognize speech in native language. Machine recognition of speech involves generating a sequence of words best matches the given speech signal. Some of known applications include virtual reality, Multimedia searches, travel Information and reservation, translators, natural language understanding and many more Applications. Speech recognition systems commonly carry out some kind of classification, recognition based on speech features which are usually obtained via Fourier Transforms (FTs), Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques. However, these methods have some disadvantages. These methods accept stationary signals within a given time frame and may therefore lack of ability to analyze localized events correctly. The wavelet transform copes with some of these problems. Other factors influencing the selection of Wavelet Transforms (WT) over conventional methods include their ability to determine localized features. We use Discrete Wavelet Transform method is used for speech processing. Statement of the problem The primary goal is to develop a speech recognition system to rocognize the telugu words. This requires a detailed, in-depth and comprehensive knowledge of speech processing. The second goal is the implementation of the proposed system. i.e. to develop a fast and efficient system with an innovative approach to use wavelet transform and LPC,MFCC techniques for feature extraction and HMM and DTW techniques to compare the data stored in data base. Motivation Now a days Information Technology based applications are growing rapidly and with increase in these applications, there is need for secure transaction. Such as Net banking, Mobile banking, biometric identification and forensic etc. Speech recognition is very cheap and robust technology as compare to other biometric techniques to secure access and many applications. Traditional system used LPC and MFCC techniques widely in order to extract features.
M.Tech. (E.I.), NITW
Page 1
Block diagram
Linear Prediction of Speech

Feature Extraction Using Linear Predictive Coding One of the most powerful signal analysis techniques is the method of linear prediction. LPC of speech has become the predominant technique for estimating the basic parameters of speech. It provides both an accurate estimate of the speech parameters and it is also an efficient computational model of speech. The basic idea behind LPC is that a speech sample can be approximated as a linear combination of past speech samples. Through minimizing the sum of squared differences (over a finite interval) between the actual speech samples and predicted values, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for LPC of speech. The analysis provides the capability for computing the linear prediction model of speech over time. The predictor coefficients are therefore transformed to a more robust set of parameters known as cepstral coefficients. Voice signal sampled directly from microphone, is processed for extracting the features. Method used for feature extraction process is Linear Predictive Coding using LPC Processor. The basic steps of LPC processor include the following : 1. Preemphasis: The digitized speech signal, s(n), is put through a low order digital system, to spectrally flatten the signal and to make it less susceptible to finite precision effects later in the signal processing. The output of the preemphasizer network is related to the input to the network, s(n) , by difference equation:
Page 2
2. Frame Blocking: The output of preemphasis step, ~s (n) , is blocked into frames of N samples, with adjacent frames being separated by M samples. If x (n) l is the l th frame of speech, and there are L frames within entire speech signal, then
where n = 0,1,,N 1 and l = 0,1,,L 1 3. Windowing: After frame blocking, the next step is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. If we define the window as w(n), 0 n N 1, then the result of windowing is the signal:
where 0 n N 1
Typical window is the Hamming window, which has the form
4. Autocorrelation Analysis: The next step is to auto correlate each frame of windowed signal in order to give
where the highest autocorrelation value, p, is the order of the LPC analysis. 5. LPC Analysis: The next processing step is the LPC analysis, which converts each frame of p + 1 autocorrelations into LPC parameter set by using Durbins method. This can formally be given as the following algorithm:
Solving the above recursively from i=1.2..p the LPC coefficient, m a , is given as
Page 3
LPC advantages LPC provides good model of speech signal It is a Production based method LPC represents the spectral envelope by low dimension feature vectors Provides linear characteristics LPC leads to a reasonable source-vocal tract separation LPC is analytically tractable model The method of LPC is mathematically precise and straight forward to implement in either software in hardware LPC disadvantages The LP models the input signal with constant weighting for the whole frequency range. However human perception does not have constant frequency perception in the whole frequency range. A serious problem with the LPC is that they are highly correlated but it is desirable to obtain less correlated features for acoustic modeling. An inherent drawback of conventional LP is its inability to include speech specic a priori information in the modeling process. MFCC(mel frequency cepstral coefficients) The extraction of the best parametric representation of acoustic signals is an important task to produce a better recognition performance. The efficiency of this phase is important for the next phase since it affects its behavior. MFCC is based on human hearing perceptions which cannot perceive frequencies over 1KHz. In other words, MFCC is based on known variation of the human ears critical bandwidth with frequency. MFCC has two types of filter which are spaced linearly at low frequency below 1000 Hz and logarithmic spacing above 1000Hz. A subjective pitch is present on Mel Frequency Scale to capture important characteristic of phonetic in speech. The overall process of the MFCC is shown in Fig Pre-emphasis Pre-emphasis refers to a system process designed to increase, within a band of frequencies, the magnitude of some frequencies with respect to the magnitude of the others(usually lower) frequencies in order to improve the overall SNR. Hence, this step processes the passing of signal through a filter which emphasizes higher frequencies. This process will increase the energy of signal at higher frequency. Framing The process of segmenting the speech samples obtained from an ADC into a small frame with the length within the range of 20 to 40 msec. The voice signal is divided into frames of N samples. Adjacent frames are being separated by M (M<N).Typical values used are M = 100 and N= 256.
Page 4
Hamming windowing Hamming window is used as window shape by considering the next block in feature extraction processing chain and integrates all the closest frequency lines. The Hamming window is represented as shown in Eq. (1). If the window is defined as W (n), 0 n N-1 where N = number of samples in each frame Y[n] = Output signal X (n) = input signal W (n) = Hamming window, then the result of windowing signal is shown below: Y[n] =X (n)*W (n).. (1) Fast Fourier transform To convert each frame of N samples from time domain into frequency domain FFT is being used. The Fourier Transform is used to convert the convolution of the glottal pulse U[n] and the vocal tract impulse response H[n] in the time domain. This statement supports as shown in Eq. (2) below: Y(w)=FFT[h(t)*X(t)]=H(w)*X(w) ..(2) If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t) respectively. Mel filter bank processing The frequencies range in FFT spectrum is very wide and voice signal does not follow the linear scale. Each filters magnitude frequency response is triangular in shape and equal to unity at the Centre frequency and decrease linearly to zero at centre frequency of two adjacent filters.Then, each filter output is the sum of its filtered spectral components. After that the following equation as shown in Eq. (3) is used to compute the Mel for given frequency f in HZ: F (Mel ) =[2595 * log 10[1+ f /700] ..(3) Discrete cosine transform This is the process to convert the log Mel spectrum into time domain using DCT. The result of the conversion is called Mel Frequency Cepstrum Coefficient. The set of coefficient is called acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector.
M.Tech. (E.I.), NITW Page 5
Delta energy and delta spectrum The voice signal and the frames changes, such as the slope of a formant at its transitions. Therefore, there is a need to add features related to the change in cepstral features over time. 13 delta or velocity features (12 cepstral features plus energy), and 39 features a double delta or acceleration feature are added. The energy in a frame for a signal x in a window from time sample t1 to time sample t2, is represented as shown below in Eq. (4). Energy= X 2[t].. (4) Where X[t] = signal Each of the 13 delta features represents the change between frames corresponding to cepstral or energy feature, while each of the 39 double delta features represents the change between frames in the corresponding delta features. MFFC advantages Perception based method. MFCC feature extraction approach gives a good discrimination and a small correlation between components. Characteristics of the slow varying part concentrated in the low cepstral coefcients. Individual features of MFCC seem to be just weakly correlated which turns out to be an advantage for the creation of statistical acoustic model. Does not have linear characteristics (because of human perception of frequency cepstrum of sound does not have linear characteristics). MFCC features give good discrimination and lend themselves to a number of manipulation. MFCCs are one of the more popular parameterization methods used by researchers in the speech technology field. It has the benefit that it is capable of capturing the phonetically important characteristics of speech. Also band-limiting can easily be employed to make it suitable for telephone applications. It has the basic desirable property that the coefficients are largely independent, allowing probability densities to be modeled with the diagonal covariances metrics. Mel scaling as been shown to offer better discrimination between phones, which is an obvious help in recognition. It has good discriminating properties MFCCs are derived from the power spectrum of the speech signal, while the phase spectrum is ignored. MFCC features are advantageous because it mimic some of the human processing of the signal. It has good discriminating properties. Disadvantages of MFCC A small drawback is that MFCCs are more computationally expensive than LPCC due to the Fast Fourier Transform (FFT) at the early stages to convert speech from the time to the frequency domain. First, they do not lie in the frequency domain.
Page 6
However, it is well-known that MFCC is not robust enough in noisy environments, which suggests that the MFCC still has insufficient sound representation capability, especially at low SNR. Though Mel frequency cepstral coefficients (MFCCs) have been very successful in speech recognition, they have the following two problems: (1) They do not have any physical interpretation, and (2) Liftering of cepstral coefficients, found to be highly useful in the earlier dynamic warping-based speech recognition systems, has no effect in the recognition process when used with continuous. The features derived from either the power spectrum arr the phase spectrum have the limitation in representation of the signal.
WAVELETS Wavelet analysis is the breaking up of a signal into a set of scaled and translated versions of an original (or mother) wavelet. Taking the wavelet transform of a signal decomposes the original signal into wavelets coefficients at different scales and positions. These coefficients represent the signal in the wavelet domain and all data operations can be performed using just the corresponding wavelet coefficients
DAUBECHIES WAVELET BASIS
It turns out that there exist basis functions that fit the bill - namely wavelets. Wavelets are a cross between the impulse and the sinusoid. The wavelet dies off at negative and positive infinity giving location in time. The wavelet's wiggle gives the frequency content. For our project we chose the 32 point Daubechies wavelet generated by the Matlab command daubcqf.m from the Rice Wavelet Toolbox for MATLAB for two reasons. 1. Apparently it is the default wavelet for time frequency analysis. 2. It looks a lot like the transient parts of speech. The 32 point Daubechies wavelet is shown in Figure . A few other wavelets are shown below in Figure as well.
Figure: 32 point Daubechies wavelet Figure : wavelet families
With Fourier analysis we compared our signals to a basis consisting of sinusoids that differed in frequency. With wavelet analysis we compare our signals to a basis consisting of wiggles that differ in frequency and temporal location. Surprisingly such a set is generated by
one wavelet prototype or mother wavelet. The wavelet W may be represented through the wave equation as a function of two parameters - frequency and time and thus may be expressed as:
W = g(f*t + t')
where t is time, f is frequency and t' is the time delay. Varying the two parameters of the wavelet has physical consequences. We use the mother wavelet X shown in Figure 4.4 to demonstrate these changes.
Figure 4.4
By varying f we can compress and dilate the prototype wavelet to obtain wavelets of higher frequency and wavelets of lower frequency respectively - much like varying the frequency w in a sine function - sin(wt). Figure 4.5 shows the result of multiplying the f of X by a factor of 0.5
Figure 4.5
By varying t' we can translate the wavelet in time. Figure 4.6 shows the result of subtracting some delay in the argument of X.
Figure 4.6
By varying both parameters we can generate a wide domain of wavelets each representing different frequency content within different time intervals. Once a set of wavelets is generated from the prototype wavelet, the signal is projected onto the set via the dot product or in more formal terminology the wavelet transform. If the two parameters f and t' are stepped through continuously we have the continuous wavelet transform. On the other hand if the two parameters are stepped through discretely we have the discrete wavelet transform. For our project, we chose the discrete wavelet transform (DWT). The DWT steps through frequency and time by factors of two. Hence the DWT projects the signal onto a set of octaves wavelets that differ in frequency by factors of two. The majority of our working recognition algorithm relied on differentiating digits by their octaves.
Page 8
The first step is to make templates of the digits to compare input signals. For each digit, we recorded 21 samples from seven different sources all males, and wavelet transform each one of them. Then we take the average of the coefficients as our templates. Daubechies wavelet of length 32 is used in this project the level of the transform is seven (a level is just the number of octaves the signal is projected onto). These numbers are obtained by trial-and-error.we tried to compare the entire input signal to the templates. The first approach we took was the mean square difference comparison, where we subtract the template from the input signal, square the remainders, and sum up all the coefficients hoping that the digit which the input signal correspond to,will give the minimum value. This approach works very well with signals we made the templates out of however, it is a complete failure with signals outside of the templates. We then tried to make comparisons with other methods comparing the absolute values of the coefficients, normalizing the signal before comparing, dot product the input signals to the templates. Among these methods, dot product gives the best result. We dot the input with each of the templates, and due to the nature of the dot product, the digit that the signal correspond to will result in the largest value. As a different approach we analyzed the octaves, and found out that we can differentiate 2 and 3 from 1, 4, and 5 looking at the amplitude of the third octave. If the amplitude is small, the number is either 2 or 3; otherwise, it's 1, 4, or 5.
We then analyze other octaves to differentiate between 2 and 3, and 1, 4, and 5. For 2 and 3, we look at the second octave. We threshold the region and find the number of samples above the threshold. If the number of samples above the threshold is large, we probably have a 2, and if the number is small, the signal is likely to be a 3. Of course, there's always the chance that the number falls within the region between a 2 and a 3. In this case, we use the dot product comparison to identify the signal.
We used the same approach to identify 1, 4, and 5, only that these three numbers differ mostly in the forth octave instead of the second. These three numbers have a similar mean value in the forth octave; however they differ in the amount they fractuates in the region. 4s coefficients fractuates with large amplitude in the forth octave, those of 1 also fractuates but with less amplitude, while 5's coefficients remain roughly constant in this region. Therefore, to distinguish between them, we threshold the first part of the octave, and count the number of samples above the threshold. The value of the threshold is picked so that 5 will only have a few coefficients above it. Advantages of wavelets Due to the efficient time frequency localization and the multi-resolution characteristics of the wavelet representations, the wavelet transforms are quite suitable for processing non stationary signals such as speech. In wavelet analysis one can look signals at different scales or resolutions a rough approximation of the signal might look stationary, while at detailed level discontinuities become apparent. One major advantage afforded by wavelets is the ability to perform efficient localization in both time and frequency. The multi-resolution property of wavelets that can decompose the signal in terms of the resolution of detail. This analysis is capable of revealing aspects of data that other signal analysis techniques miss, aspects like transients, breakdown points, discontinuities in higher derivatives and self similarity. Wavelet analysis can often compress or de-noise a signal without appreciable degradation. It is proved an indispensable property of analysis of signal. Wavelets can zoom in to time discontinuities and those orthogonal bases, localized in time. Frequency can be constructed. Wavelet transforms have advantages over traditional Fourier transforms for representing functions that have discontinuities and sharp peaks, and for accurately deconstructing and reconstructing nite, non-periodic and/or non-stationary signals. Hence wavelet transform is well suited to transient signals whose frequency characteristics are time varying, especially like speech. Disadvantage However, this method is non-adaptive because the same basic wavelets have to be used for all data. The major differences between Fourier and Wavelet WORK DONE SO FAR LPC : Number of features extracted from a speech signal decides the accuracy of the speech recognition system. To examine, the results of the LPC analysis, from the extracted LPC coefficients vocal tract response is plotted as show in below figure
Page 10
WAVELET: Wavelets proves to be an effective method in analyzing speech signals that contain both steady state characteristics (vowels) and transient characteristics (consonants), since different combinations of vowels and consonants have distinct characteristics in different octaves. Using Daubechies wavelet of length 32 speech signal is decomposed into its coefficients and converted into template.
By making trial and error method,octave level values for english digits (one to nine) are decided.
Conclusions
A very difficult problem in speech recognition systems is feature extraction, which decides the accuracy of the recognition system. LPC and MFCC gives the information of the signal in frequency domain only where as the wavelet transform gives the information of the signal in both the time and frequency domain. Hence the more detailed data about the signal can be extracted for the analysis by using wavelet transforms. Future Work DTW and HMM algorithms will be implemented for comparing the speech templates formed during training the recognition system.
REFERENCES
1. M A Anasuya S K katti comparision of different feature extraction techniques with and with out wavelet transform to kannada speech reconition international journal of computer application 2011 2. Vimal Krishnan V.R, Athulya Jayakumar, Babu Anto.P, Speech Recognition of Isolated Malayalam Words Using Wavelet Features and Artificial Neural Network, 4th IEEE International Symposium on Electronic Design, Test & Applications 3. Lawrance Rabiner, Bing-Hwang Juang, Fundamentals Speech Recognition, Eaglewood Cliffs, NJ, Prentice hall, 1993. 4. Mallat Stephen, A Wavelet Tour of Signal Processing, San Dieago: Academic Press, 1999, ISBN 012466606. 5. Mallat SA, Theory for MuItiresolution Signal Decomposition: The Wavelet Representation, IEEE Transactions on Pattern Analysis Machine Intelligence. Vol. 31, pp 674-693, 1989. M.Tech. (E.I.), NITW Page 11

Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report

Uploaded by

Copyright:

Available Formats

INTRODUCTION Speech processing is one of the exciting areas of signal processing.

M.Tech. (E.I.), NITW

Linear Prediction of Speech

M.Tech. (E.I.), NITW

Typical window is the Hamming window, which has the form

M.Tech. (E.I.), NITW

M.Tech. (E.I.), NITW

M.Tech. (E.I.), NITW

Figure: 32 point Daubechies wavelet Figure : wavelet families

M.Tech. (E.I.), NITW

M.Tech. (E.I.), NITW

You might also like