You are on page 1of 4

Speech Recognition Using Digital Signal Processor

Mohd Abdul Muqeet M.Tech. (Instrumentation) Shri Guru Govind Singhji Institute of Engineering and Technology Nanded abmuqeet@rediffmail.com
Abstract-This paper describes a PC Independent Speech Recognition system. It combines the aspects of both hardware and software design to implement a speaker dependent, isolated word, small vocabulary speech recognition system. The feature extraction is based on estimation of Linear Prediction Cepstral Coefficients (LPCC) and template matching employs Dynamic Time Warping (DTW). In the hardware designTMSS320C6XDSP is proposed. Code Composer Studio which is an integrated development environment (IDE) is used to build and debug programs related with the proposed system. Both the hardware and the software have to be design with a view to achieve highspeed recognition Keyword- LPCC, Dynamic Time Warping (DTW), Preempasis, Interrupt Service Routine (ISR)

word speech recognition system. The system should recognize a spoken word from a template of 10-15 words. It should have high recognition accuracy and a modest rejection ratio. 2. SOFTWARE This section presents the software aspects used in the speech recognition system. The whole aspect can be divided into three main subprojects preprocessing, recognition parameter calculation, and the distance measure. Preprocessing deals with various ways in which the signal must be processed before the recognition parameters may be calculated. The recognition parameters are either the autocorrelation values or the LPCs, and they are the values which are compared to decide which word was most likely to be spoken. The distance measure is actually the way in which two frames recognition parameters are compared. 2.1. Feature Extraction Using Linear Prediction Cepstral Coefficients (LPCC) The feature extraction involves identifying the formants in the speech, which represent the changes in the speakers vocal tract. There are many approaches used viz. Linear Predictive Coding (LPC), Mel-scaled Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Reflection Coefficients (RCs). Among these, LPCC have been found to be more popular for speech recognition of isolated words. The basic steps in LPC are discussed with giving the theory how it could be implemented. 2.2. Implementation of LPCC A software implementation of feature extraction using LPC coefficient computation is explained in the following text. 2.2.1. Sampling. The sampling frequency of 8 KHz is sufficient for human speech. This frequency gives the window of 125 s between the two consecutive samples. Thus a sizable part

1. INTRODUCTION Speech Recognition has been an active area of research for many years. With advances in VLSI technology, high performance compilers, it has become possible to incorporate speech algorithms in hardware. In the last few years; various systems have been developed to serve to a variety of applications. High-end Digital Signal Processors (DSPs) from companies like TI, Analog Devices, provide an ideal platform for developing and testing algorithms in hardware. The advanced software tools like C-compiler, Code Composer StudioTM simulator and debugger provide an easy approach to optimize the algorithms. Speech recognition is either speaker Independent or dependent. Speaker independent mode involves extraction of those features of speech which are inherent in the spoken word. This class of algorithms is rather complex and makes use of statistical models and language modeling. On the other hand, speaker dependent mode involves extracting the userspecific features of the speech. A template of extracted coefficients of words has to be created for every user and the matching is done to determine the spoken word. Furthermore, using isolated words rather than continuous words helps in increasing the accuracy of recognition. The concerned paper gives the idea of development of a speaker dependent, isolated

of the processing can be done in real time. For a word of duration of 0.5 s, the number of samples at 8 KHz is 4000. 2.2.2. Preprocessing. Preprocessing consists of: deciding when to start and when to stop processing data, through the use of a starting and ending threshold; pre emphasizing the signal data; blocking that data into frames; and finally, windowing those frames with a Hamming window. Start Detection. Its vital to know when the incoming samples are likely to be a speech signal, and when they are likely to be noise. The simplest way to do this is to set energy thresholds. In this the speech samples are continuously fetched and maintained at a sliding window of the past N samples at every point of time. A running average of the energy content of the window is maintained. If this average is above the threshold value the start of an utterance can be recognized. It is also found that noise has very few zero crossings as compared to normal speech. Similar to energy this can also be used in the detection process i.e. number of zero crossings of the input signal in the sliding window should exceed a threshold value. 2.2.3. Preemphasis. Much of the important data contained in the speech signal is at higher frequencies. It is helpful to preemphasis the signal to enhance spectral characteristics at these higher frequencies. The preemphasis filter is just a simple IIR filter, and is given by

An efficient algorithm for finding the autocorrelation sequence is used here known as Levinson-Durbin recursion. This procedure takes the autocorrelation sequence as its input, and produces the coefficients a[k].The set of prediction coefficients is usually converted to the so-called Linear Predictive Cepstral Coefficients (LPCC).The relationship between Cepstrum coefficients Cn and prediction coefficients a[k] is represented in the following equations.

2.3. Template Matching After feature extraction of a spoken word, we get a frame-wise sequence of feature vectors. The next step is to compare it with set of stored templates for the current user. A popular technique called Dynamic Time Warping (DTW) is used here. It is a technique which warps the time axis to detect the best match between given two sequences. For the spoken word S, let Sik denotes the kth coefficient of the ith frame. The DTW comparison of S and a template word T starts with calculation of a Local Distance Matrix of size 2020 where each entry LDij is given by,

2.2.4. Framing and windowing. Input speech is divided into N number of overlapping frames. Then each frame is processed through a window to minimize the discontinuities at the beginning and end of each frame. Hamming window function is commonly used for this purpose given by

Thus the local distance LDij is the vector distance between the corresponding coefficients of ith frame of the spoken word and the jth frame of the template word. Thus, DTW distance is calculated between the spoken word and each of the template words. The template word having the least distance is taken as a correct match. 3. HARDWARE

2.2.5. Recognition and Parameter Calculation. There are two different sets of recognition parameters that are autocorrelation values, Linear Predictive Coefficients. Autocorrelation is, just as it sounds, a measure of how strongly correlated a signal is to itself shifted k times and is given by

The software is proposed to be implemented on a DSK/EVM board built around the TI TMS320C6X DSP which has inbuilt 16bit stereo Codec with A/D conversion capability. The software tool needed to generate TMS320C6X processor program executable files is called as Code Composer Studio TM (CCS).

DSK board can be connected to a PC through its parallel port. 4. SOFTWARE OPTIMIZATIONS The code is developed using the CCSTM which includes an integrated editor for editing both C and assembly files. The techniques which can be used for generating the codes are explained in the following section. 4.1. Hardware API The DSP support software contains C functions for accessing and setting hardware aspects like data acquisition interrupt management, and other peripherals .The codec library contains API (application programmer interface) that can be used to handle the above enlisted operations. 4.2. Initialization of EVM/DSK and Codec In writing the program for sampling the speech signal several initializations have to be performed like initialization of EVM/DSK ,adjustment of sampling rate for codec .The API functions can be used to achieve all of these mentioned adjustments. 4.3. Interrupt Service Routine On the DSP processor, the processing of samples can be done within an ISR (interrupt service routine). Once the initializations have been done an interrupt needs to be assigned to halt the Processor and jump to the defined interrupt service routine. The interrupt occurs when a new data (speech) sample arrives at the serial port. The generated interrupt causes a branch to an ISR, which is then used as to process the speech sample, and send it back out. Considering that the CPU is doing nothing it waits for the new data and an infinite loop is set in the main program to keep it running. 4.4. Frame Processing In our case processing of frames of speech data is done by a frame data handling mechanism called as triple buffering. In this samples of a current frame are being collected by CPU in an input array via an ISR, samples of the previous frame in an intermediate array can get processed during the time left between samples. At the same time DMA (direct memory access) can be used to move out processed data samples from

the output array. At the end of each frame or a start of new frame the roles of these arrays is interchanged. Here programs for different algorithms discussed previously are executed. The ability of CCSTM to analyze the results on a graphical display of data provides a better feedback about the behavior of a program. 5. RESULT To test the performance of the system a group of users with different set of words can be applied. Both male and female subjects were used and words were chosen which are well separated in frequency domain. 6. CONCLUSION AND APPLICATIONS This paper discussed the idea of implementation of a PC independent speech recognition system. The algorithm may provide sufficient accuracy of recognition. The DSK/EVM board can be custom developed which may give a flexible design and can also be used as a general purpose prototype board. The start detection technique and optimized speech recognition algorithm can be easily ported to a variety of embedded platforms.Some of the typical applica- tions where the system may find its place are as follows A voice enabled interaction device for physically handicapped persons. Low cost voice enabled switches for households. Unit fitted into automobiles, aeroplanes could serve as a voice activated interface. 7. REFERENCES
[1] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, 1st ed., ser. Prentice Hall Signal Processing Series. Prentice Hall Professional Technical Reference,Apr. 1993.

[2] Heungsuk Chin, J.Kim, I.Kim, .Kwon,K.Lee, Sung-il Yang, Realization of Speech Recogni tion using DSP (Digital signal Processor) in IEEE Transaction on Speech Recognition, pp 508-511, ISIE 2001 Pusan Korea. [3] Sujay Phadke, Rishikesh Limaye,Siddharth Verma, Kavitha Subramanian, On design Implementation of an Embedded Automatic Speech Recognition System in !7th International Conferenceon VLSI Design (VLSID04).

[4] E. J. Keogh and M. J. Pazzani, Derivative


dynamic time warping, department of Information and Computer Science, University of California, Irvine.

[5] John G.Proakis, Dimitris G. Manolikis, Digital signal Processing, Third Edition, Prentice-Hall India 1996.

[6] Texas Instruments, TMS320C6X User's


Guide July 2005 .

You might also like