You are on page 1of 6

Artificial Neural Network for Digit Recognition

Manoj Kumar, Hitesh Kumar, Shweta Sinha Kamrah Institute of Information Technology, Gurgaon
Manoj.delhi24@gmail,com, hiteshkumar111@gmail.com, Meshweta_7@rediffmail.com,

ABSTRACT This paper discusses the recognition of Hindi digits based on emotion rich small vocabulary. A feed forward multilayer neural network is trained by Back propagation method for speaker independent isolated word recognition. Mel Frequency Cepstral Coefficients (MFCC) is extracted as speech features. These features are used to train the Multi Layer Feed Forward network (MLFFN) Network .The same routine is applied to signals during recognition stage and unknown test patterns are classified to the nearest pattern. Analysis based on varying number of hidden neurons in the network is presented here. The network is trained with input waves captured in the office environment and is tested against database created in the similar environment. It has been observed that the MLFFN works as good classifier for test data and number of speech features extracted plays a very important role in recognition of isolated Hindi digits through machine.

Introduction

Automatic Speech Recognition plays a very important role in the area of Human-Machine interaction. The transfer of information through speech communication from one person to another consists of variations in pressure wave coming from the mouth of a speaker that propagates through the air medium and reaches the ears of listeners, who decipher the waves into a received message. In computer technology, Speech Recognition refers to the recognition of human speech by computers for the performance of speaker-initiated computer-generated functions. Speech recognition systems are usually built upon three common approaches, namely, the acoustic-phonetic approach, the pattern recognition approach and the artificial intelligence approach [1]. The acoustic-phonetic approach attempts to decide the speech signal in a sequential manner based on the knowledge of the acoustic features and the relations between the acoustic features with phonetic symbols. The pattern recognition approach, on the other hand, classifies the speech patterns without explicit feature determination and segmentation such as in the formal approach. The artificial intelligence (AI) approach forms a hybrid system between the acoustic phonetic approach and the pattern-recognition approach. After the great success of AI approach [2, 3, 4] it became the field of interest for many more researches. There are many recognition systems based on different languages which are often used in applications meant for military systems, aircrafts, deaf-telephony etc. Efforts are being done for developing such system for Hindi language also [5].Hindi digit recognition is one effort towards achieving this. In this paper application of neural network in the pattern recognition approach is discussed. We propose the use of multilayer feed forward neural network which is trained using back propagation technique for Hindi digit recognition. The input to the training module of the

system is the speech features of the digits recorded in neutral emotion. The trained network is tested against digits recorded in same environment and emotion. The speech features extracted from the recorded digits during training and testing phases are the Mel Frequency Cepstral Coefficients. Several network with different structure (different hidden neurons) were trained and their performance in recognizing the unknown input pattern were compared.

Word Recognition Methodology

2.1 Speech Signal


All signals available to us are analog, Speech being one of them. Multiple applications are being developed based on speech as input. They all require fast and reliable processing of input. Digital systems are reliable and logic speeds are fast enough so that tremendous number of operations can be performed in very less time. To process a speech signal in a digital system we need to process the signal and convert it into discrete form. Speech is a communicational signal so during its processing the actual information contained in the system should be preserved. The other major concern for the processing is the ability for representation of signal into a convenient form so that modification may be made to it without destroying its contents. Representation of speech signal in digital form is the fundamental concern which is guided by the fundamental concept of sampling. For recognition of Hindi digits the approach used in this paper is represented as

Speech Speech Signal Signal

Capturing of Speech Signal in Capturing of Speech Signal in Different Emotions Different Emotions Database Creation Recognition of Recognition of Input Test Word in Input Test Word in Different Emotion Different Emotion

Feature Extraction Feature Extraction

Training of Training of Network for Word Network for Word and Emotions and Emotions

Figure1: Block Diagram for Digit Recognition Using Neural Networks

3 Speech Database
The speech recognition process requires corpora which provide training to the system. Research [2] shows that size of corpora plays very important role in the success of any such system. For proper training of the system we have collected speech database from speakers in the age group of 22 to 35 years. All speakers are female and are from Hindi speaking region. Total no of speakers speaking at different rates : 30 Vocabulary Size: 10 digits (Shunya -Nau) Each digit spoken by every speaker 5 times in neutral emotion. Emotion under consideration for training the network : Neutral Emotions under consideration for testing the

network: Neutral. Total number of utterances in training Database: 30*10*5= 1500 utterances Test Database under consideration: Recording of 10 speakers for every word in each of sad, surprise and neutral emotion. Number of speech features taken into consideration: 12 MFCC coefficients along with energy

Spectral Analysis

Speech is a non-stationary signal so to extract spectral features of sub-phones we analyse the spectrum in successive narrow time windows of about 20-25 ms width. For reliable frequency analysis, the human speech is considered to be fairly stationary over 20-25msec time windows [8] .The analysis is carried out using the Fast Fourier transform algorithm (FFT) of each window. This gives us the intensity of several bands on the frequency scale. After digitization and quantization of the wave form our goal is to transform the input waveform into a sequence of acoustic feature vectors, such that each feature vector represents the information in a small time window of the signal. Mel Scale Cepstrum Coefficients (MFCC) is the most widely used features extracted by Cepstrum analysis of the signal. These MFCC are Human Listening perception based features [8, 9]. As the human ear is not equally sensitive to all frequency bands in MFCC also the features are extracted by attenuating the high frequency components using the Mel scale. The overall extraction process can be represented in sequence of steps as defined below.
Continuo us Speech

Frame Blocking

Windowin g

FFT

Mel Cestrum

Sk

Cepstrum

MelFrequency Wrapping

Figure 2: Block diagram of an MFCC processor

For obtaining MFCC features for the samples of the database words were divided into window of 25ms with frame rate of 10ms.The window taken during the extraction process is the Hamming window .Fast Fourier Transform was allowed on the windowed data and this was followed by bank of filters spaced logarithmically above 1000 Hz to obtain the Cepstrum coefficients. For training of the system one set of input was prepared for 12 coefficients from every frame of each word and the other set of input was taken as 12 MFCC coefficients from every frame along with one energy value.

Mel-cepstrum coefficient

12 10 8 6 4 2 0.1 0.2 0.3 T e (s) im 0.4 0.5 0.6 -5 0 5

Figure 3: 12 Mel Frequency Cepstral Coefficients of word EK

Neural Networks in Speech Recognition

Multi-layer Feed Forward Networks [MLFFW] are one of many different types of existing neural networks. They comprise of number of neurons connected together to form a network. The strengths or weights of the links between the neurons is where the functionality of the network resides. Neural networks are useful to model the behaviours of real-world phenomena[6,7].Being able to model the behaviours of certain phenomena, a neural network is able subsequently to classify the different aspects of those behaviours, recognize what is going on at the moment, diagnose whether this is correct or faulty, predict what it will do next, and if necessary respond to what it will do next. This paper uses a multilayer feed forward neural networks with one hidden layer .The activation function at hidden and output layer is sigmoid and the network is trained with scaled conjugate gradient back propagation with momentum . The model can be extended to include more MFCC features for analysis purpose. These extracted features are fed as input to the network. These inputs are processed by hidden layers and fed to the output layer. Each neuron at the output layer corresponds to one input digit. Only one neuron is activated at one time. The overall training of the network is done in multiple epochs. The input for each frame is kept in the input file in required format. The training target is shown in the table where each of the digits will activate a different output neuron. The network can be represented as
Input layer S P E E C H Hidden Layer
B w

Output Layer
0 1 0 0 0 0 0 0 0 0 0

Training Target
1 0 1 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0 0 5 0 0 0 0 0 1 0 0 0 0 6 0 0 0 0 0 0 1 0 0 0 7 0 0 0 0 0 0 0 1 0 0 8 0 0 0 0 0 0 0 0 1 0 9 0 0 0 0 0 0 0 0 0 1

Figure 4: Neural Network Structure for Digit Recognition

Artificial Neural Networks Detail


Network Properties
Training Method Input Layer Neuron Transfer Function for Neuron Transfer Function for hidden layer Epochs output layer Learning Rate Momentum Constant Maximum performance parameter

Information
Scaled Conjugate Gradient descent with 12 MFCC and momentum energy Features from short time Log sigmoid Transfer Function duration frames Log sigmoid Transfer Function 1500 0.01 0.9 0.01 Table 1: Neural Network Configuration

6 Result and Analysis


The analysis has been carried on the basis of number of neurons in the hidden layer as well as the number of speech features as input to the network is changed during its training. In the first phase the Input to the network is the 12 MFCC features of each frame of the word .In the second phase along with these energy of each frame is included as the 13th input to the network. The network is trained with 75% of the total size of training database containing samples of digits recorded in neutral emotion.10% of the data is used for validation to check the generalization of the network where as rest 15% is used to test the network. Analysis is also done on the basis of change of neurons in the hidden layer. The classification result is represented in terms of confusion matrix.

Figure 5:Training Performance with 12 MFCC vector

Figure 6:Training Performance with 12 MFCC and energy

H dn id e Nu n e ro s

D its ig 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

0 6 .0 6% 1% .7 1 .4 0% 2% .0 0% .0 2% .7 0% .0 4% .4 0% .0 0% .0 7 .0 1% 1% .0 3% .6 2% .7 1% .6 3% .1 0% .0 3% .8 1% .6 4% .7

1 2% .7 8 .0 1% 4% .2 0% .0 0% .0 6% .3 0% .0 6% .2 0% .0 1% .6 1% .3 8 .0 2% 2% .7 1% .2 2% .7 4% .7 1% .2 6% .1 2% .8 2% .7

2 8% .0 0% .0 7 .6 4% 4% .0 0% .0 1% .8 1% .7 0% .0 1% .1 2% .7 6% .4 2% .1 7 .7 1% 5% .1 0% .0 1% .6 0% .0 1% .3 1% .9 1% .8

3 0% .0 4% .6 0% .0 7 .0 7% 1% .5 0% .0 2% .1 2% .1 0% .0 4% .7 2% .1 3% .2 0% .0 6 .6 5% 1% .3 0% .0 2% .8 0% .0 3% .8 1% .2

4 2% .3 3% .1 1% .9 3% .7 9 .0 1% 0% .0 0% .0 3% .6 1% .8 7% .0 3% .7 3% .1 2% .8 6% .1 8 .2 1% 2% .9 1% .9 2% .6 4% .8 4% .3

5 4% .6 0% .0 0% .0 1% .9 3% .4 8 .0 6% 3% .6 6% .4 0% .0 3% .1 2% .9 1% .2 1% .3 9% .3 4% .2 7 .0 8% 0% .0 8% .2 2% .7 2% .8

6 4% .0 0% .0 2% .6 5% .5 0% .0 2% .2 9 .0 1 % 0% .0 1 .3 0 % 6% .9 4% .7 0% .0 4% .2 4% .7 3% .1 3% .9 9 .0 2 % 1% .3 1% .9 4% .0

7 0% .0 2% .7 3% .1 1% .4 4% .1 1% .0 1% .6 6 .0 2% 1 .6 2% 0% .0 6% .0 2% .1 4% .1 2% .1 1% .9 1% .7 0% .0 7 .0 5% 4% .5 7% .2

8 1 .2 0% 3% .0 3% .2 3% .2 0% .0 0% .0 0% .0 1 .6 0% 6 .0 7% 1 .0 0% 0% .0 4% .2 3% .6 3% .2 4% .0 2% .3 1% .0 0% .0 7 .0 1% 7% .0

9 2% .2 3% .9 0% .0 1% .3 0% .0 0% .0 0% .0 4% .7 7% .2 6 .0 4% 1% .9 1% .1 6% .0 0% .0 0% .0 1% .8 1% .1 1% .7 5% .0 6 .3 4%

0 5 .0 4% 3% .0 7% .4 7% .2 0% .0 2% .1 1% .1 6% .7 2% .5 1% .9 6 .2 1% 9% .6 4% .8 5% .1 0% .0 0% .6 2% .6 1% .9 3% .1 7% .2

1 3% .1 6 .0 2% 3% .6 1% .9 7% .2 3% .6 1% .6 3% .9 1% .7 2% .8 0% .0 7 .0 1% 6% .2 3% .8 3% .2 1% .2 1% .7 3% .8 2% .3 8% .6

T s da a s S dE o n l d ta a e e te g in t a m tio a a b s

2 0

1 .6 2 0% .0 7 .0 2 6% .4 2% .5 4% .2 0% .0 2% .9 3% .7 3% .2

3 0

8% .2 3% .9 7 .0 2 1% .8 2% .7 2% .4 2% .9 4% .9 7% .2 3% .8

Table 2: Confusion Matrix for Network with 12 MFCC features as Input for training

Testing of these same data on network trained with 12 MFCC along with energy gave better performance with average up to 93%.

References

[1] Lawrence Rabiner, Biing-Hwang Juang, Fundamental of speech Recognition, Pearson Publications, Second Edition. [2] H. S. Li, J. Liu, R. S. Liu, High Performance Mandarin Digit Speech Recognition. Journal of Tsinghua University (Science and Technology), 2000. [3] H. S. Li, M. J. Yang, R. S. Liu, Mandarin Digital Speech Recognition Adaptive Algorithm. Journal of Circuits and Systems, Vol. 4, No. 2, 1999. [4]Bin Lu, Jing-Jing Su, Research_on_Isolated_Word_Speech_Recognition Based On Biomimetic Pattern Recognition, International Conference on Artificial Intelligence and Computational Intelligence, IEEE Computer Society, pp. 436-439, 2009 [5] J. Chen, K.K. Paliwal, S. Makamura, Cepstrum derived from differentiated power spectrum for robust speech recognition, Speech communication, vol. 41, pp. 469484, 2003 [6] Mike Schuster and Kuldip K. Paliwal, Bidirectional Recurrent Neural Networks, IEEE Transaction on Signal Processing, Vol. 45, No. 11, pp. 26732681, November 1997 [7] K B Khanchandani and Moiz A Hussain, Emotion Recognition using Multilayer Perceptron and Generalized Feed Forward Neural Network, Journal of Scientific & Industrial Research, Vol. 68, pp. 367-371, May 2009 [8] Eric H .C. Choi On Compensating The Mel-Frequency Cepstral Coefficients for Noisy Speech Recognition, 29th Australian Computer Science Conference, 2006 [9] Lecture Notes of Summer School on ASR-10 (2010), 5th-9th Sep 2010, Osmania University, Organized by IIIT Hyderabad.