You are on page 1of 6

www.tjprc.org editor@tjprc.

org
International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)
ISSN(P): 2249-6831; ISSN(E): 2249-7943
Vol. 4, Issue 2, Apr 2014, 81-86
TJPRC Pvt. Ltd.

ISOLATED WORD SPEECH RECOGNITION SYSTEM USING HTK
SHANTHI THERESE S
1
& CHELPA LINGAM
2

1
Department of Information Technology, Thadomal Shahani Engineering College, Bandra, Mumbai, Maharashtra, India
2
Pillai HOC College of Engineering & Technology, Rasayani, Maharashtra, India

ABSTRACT
This paper proposes a system of Isolated Word Speech Recognition for Tamil language using Hidden Markov
Model (HMM) approach. The most powerful Mel Frequency Cepstral Coefficients (MFCC) feature extraction technique is
used to train the acoustic features of the speech database. The triphone based acoustic model is chosen to recognize the
given Tamil digits. This system is a small vocabulary system where it is trained to identify the Tamil digits from one to ten.
The result analysis of HTK (HMM Tool Kit) shows 90% of recognition accuracy at the word level.
KEYWORDS: Hidden Markov Models, Isolated Word Speech Recognition, MFCC, Tamil language
1. INTRODUCTION
Tamil script is a phonemic script. A phoneme is a fundamental unit of a language. Tamil grammar includes 18
consonants, 12 Vowels and 1 Aytham. An Allophone is a variant sound of a standard phoneme. Including all Allophones,
Tamil language has almost 100 sounds. Tamil has the greatest geographical spread and the richest and the most ancient
literature of all the Dravidian languages. Tamil is one of the longest surviving classical languages in the world
[1]
.
It has been described as the only language of the contemporary India which is recognizably continuous with a classical
past and having one of the richest literature in the world
[2]
. In this proposed system the digits from 0ne to Ten in Tamil
language are used to train the system for isolated word recognition. This proposed system is having the usage of dictating
the machine using Tamil language.
The words from one to ten of Tamil digits are recorded. Speech recognition engine is built using HTK
(Hidden Markov Model Tool Kit). HTK is a speech processing tool which is mainly used for building speech engine.
The area of speech recognition is very complex because of the diversity in language and variability in speech
characteristics uttered by users
[3].
This paper includes the following sections. Section 2 which describes the basic
Automatic Speech Recognition system and its classification Section 3 elaborates the steps involved by HTK in building
any speech engine. Section 4 explains how this system of Isolated Word Recognition is constructed. Section 5 discusses
the analysis of the results followed by section 6 which indicates the possibility of further advancement of the proposed
work.
2. AUTOMATIC SPEECH RECOGNITION (ASR) AND ITS CLASSIFICATION
ASR is the process of building a system for finding equivalent acoustic features to a given string of words. ASR
can be classified as Isolated, Connected, Continuous or Spontaneous based on the type of speech utterance. Based on the
speaker models, ASR can be classified as Speaker dependent and Speaker Independent. Based on the size of the
vocabulary used, ASR can be classified as Small, Medium, Large, Very large or Out of vocabulary systems.
[4]

82 Shanthi Therese S & Chelpa Lingam

Impact Factor (JCC): 6.8785 Index Copernicus Value (ICV): 3.0
The front end of a speech recognizer is a feature extraction step. It is used to analyze the given speech signal and
extract the acoustic features. There are various feature extraction techniques existing namely MFCC, Linear Predictive
Coding (LPC), Perceptual Linear predictive (PLP) analysis etc.

Figure 1: Overview of ASR
3. HTK AND AUTOMATIC SPEECH RECOGNITION
HTK is a software toolkit for building and manipulating systems that use continuous density Hidden Markov
Models that has been developed by the Speech Group of Cambridge University Engineering Department. HTK includes a
software library as well as a number of tools (programs) that perform tasks such as coding data, various styles of HMM
training including embedded Baum-Welch re-estimation, Viterbi decoding, results analysis and editing of HMM
definitions.
[5]
HTK is primarily used for speech recognition research. HTK has also been used for numerous other
applications including research into speech synthesis, character recognition and DNA sequencing.
HTK includes four important processing stages. 1. Data Preparation 2. Training, 3. Testing/ Recognition and
4. Analysis.
3.1 Data Preparation
A set of speech data files and their associated transcriptions are recorded in the .wav file format. The speech data
files are converted into an appropriate acoustic feature format (MFCC, LPC).Hcopy tool in HTK is used to parameterize
the speech waveforms to a variety of acoustic feature formats by setting the appropriate configuration variables.
[6]

The various feature formats are as below.
LPC - Linear Prediction Filter coefficients
LPCCEPSTRA - LPC cepstral coefficients
MFCC- Mel Frequency cepstral coefficients
DISCRETE-Vector quantized data

Figure 2: MFCC-Based Front End Processor
Recognition Isolated Word Speech System Using HTK 83

www.tjprc.org editor@tjprc.org
The most common features used for ASR are MFCC because it approximates the response of the human ear
[7]

The computational steps of MFCC includes
Framing: The signal is segmented into short frames where each frame is of 20~40 ms
Windowing: Each window is multiplied by a windowing function.
Extracting: A vector of 12 coefficients is extracted from each windowed frame.
The feature vector comprised of 39 coefficients which includes 12 MFCC, 12 MFCC, 12 MFCC, 1.Energy,
1 Energy and 1 Energy.
3.2 Training Phase
In this phase, the topology required for each HMM is defined. HTK allows HMMs to be built using any desired
topology.
In Isolated Word Training HI nit read in all of the training data and cuts out all of the examples of a specific
phone. Iteratively compute an initial set of parameter value using the segmental k-means training procedure. In this training
phase, Baum-Welch reestimation procedure is used.
Flat start Training can be used if the subword boundary information is not available. In this all of the phone
models are initialized to be identical and have state means and variances equal to the global speech mean and variance.

Figure 3: Training Phase Using HTK
Embedded Training uses HE Rest which perform a single Baum Welch re-estimation of the whole sets of HMMs
simultaneously.
3.3 Recognition Phase
In the recognition phase HVite performs viterbi based speech recognition.

Figure 4: Recognition Phase Using HTK
HVite takes a network describing the allowable word sequences, a dictionary defining how each word is
pronounced and a set of HMMs as inputs. Pronunciation dictionary also called lexicon contains the details about how the
84 Shanthi Therese S & Chelpa Lingam

Impact Factor (JCC): 6.8785 Index Copernicus Value (ICV): 3.0
words are pronounced.
[8]
The word network needed to drive. HVite are usually either simple word loops in which any
word can follow any other word.
3.4 Analysis Phase
The final stage of the HTK Toolkit is the analysis phase. When the HMM based recognizer has been built, it is
necessary to evaluate its performance comparing the recognition results with the correct reference transcriptions. Results
tool of HTK is used for this purpose.
4. ISOLATED WORD RECOGNITION OF TAMIL LANGUAGE WORDS
In this proposed system, the speech recognition system is trained to identify the given Tamil digits from one to
ten. Each phone is represented by a continuous density HMM with transition probability parameters and output observation
distributions. A flat start monophone set is created in which each phone is a monophone single Gaussian HMM with means
and covariance equal to the mean and covariance of the training data. The speech engine recognizes the sound given by the
user by comparing the acoustic model for the corresponding sounds. The Baum Welch algorithm is used to refine the
models and perform embedded re-estimation of all parameters. When the reference and test phones are same the decoder
determines the phonetic equivalent to the sound. This process is repeated for all the phones in the given word. Then the
searching will be made in the language model for the equivalent series of phonemes. Finally when the match occurs
corresponding word will be returned as recognized word.
In this proposed system, we have used small vocabulary which comprises the Tamil numbers from one to ten as
shown in table 1. The following configuration parameters are used to convert the speech waveform into MFCC acoustic
feature vector form.
SOURCEFORMAT = WAV
TARGETKIND=MFCC_0_D_A
TARGETRATE=100000.0
SAVECOMPRESSED=T
SAVEWITHCRC=T
WINDOWSIZE=250000.0
USEHAMMING=T
PREEMCOEF=0.97
NUMCHANS=26
CEPFILTER=22
NUMCEPS=12



Recognition Isolated Word Speech System Using HTK 85

www.tjprc.org editor@tjprc.org
Table 1
1 2 3 4 5
Onru Erandu Moonru Nangu Ainthu
6 7 8 9 10
Aaru Aezhu Ettu Onpathu Pathu

The data for this experiment is recorded using a speech analysis tool PRAAT.
[9]
These words are recorded at
noise free conditions at 16 KHz sampling rate using Mono channel. The word level phones are constructed manually.
Initial monophone models are created for all phones from the dictionary. Then the HMM definitions were made using
triphones. The grammar used in the construction of word network is shown in Figure 5.

Figure 5: Grammar Used in Isolated Word Recognition
5. EXPERIMENTAL RESULTS
It is observed that HTK provides 90% of recognition accuracy at word level. The statistics of result analysis
provided by HTK is as given below.
======HTK Results Analysis ================
Date: Wed Dec 26 10:39:52 2013
Ref: testref.mlf
Rec: recout.mlf
-------------- Overall Results -------------------------------
WORD: %Corr=90.00, Acc=90.00 [H=9, D=0, S=1, I=0, N=10]
The performance of the proposed system is measured based on Word Error Rate (WER).
WER is given by
WER = (S+D+I)/ N ---------------------- (1)
Where S is the number of substitutions, D is the number of deletions, I is the number of insertions and N is the
number of words in the reference.
Substitution (S) error is obtained by counting number of words which were recognized as other words.
Deletion (D) error is obtained by counting number of words which got omitted in recognition.
Insertion (I) error is obtained by counting number of words which were wrongly added to the recognized words.
86 Shanthi Therese S & Chelpa Lingam

Impact Factor (JCC): 6.8785 Index Copernicus Value (ICV): 3.0
Word Accuracy (WA) The ratio of correctly recognized words against the total number of words
5. FUTURE WORK AND CONCLUSIONS
The proposed isolated word recognition system can be further extended for speaker independent recognition.
The vocabulary size can be extended from small to medium or large size. Connected or continuous speech systems can be
developed for Tamil language applications. Robustness of the system can be always enhanced by applying noise removal
techniques. The speaker identification using Tamil language can also be an motivating area to encourage users using only
Tamil language for their communication.
REFERENCES
1. Radha.V, Vimala C Krishnaveni. M, Continuous Speech Recognition System for Tamil Language Using
Monophone-based Hidden Markov Model, CCSEIT-12, October 26-28, 2012, Coimbatore [Tamilnadu, India]
2. R. Thangarajan, A.M. Natarajan and M. Selvam. Phoneme Based Approach in Medium Vocabulary Continuous
Speech Recognition in Tamil. International Journal of Computer Processing of Oriental Languages Chinese
Language Computer Society & World Scientific Publishing Company.
3. M. Gales and S. Young, 2008, The Application of Hidden Markov Models in Speech Recognition, Foundation
and Trends in Signal Processing. Vol. 1, No. 3(2007) 195 304. 2008
4. Vimala. C and Radha. V. 2012. A Review on Speech Recognition Challenges and Approaches,
World of Computer Science and Information technology Journal (WCSIT), ISSN: 2221-0741 Vol. 2.No.1.1-7.
5. P.C. Woodland, J. J. Odell, V. Vetches , S.J. Young, Large Vocabulary Continuous Speech Recognition Using
HTK , Acoustics, Speech and Signal Processing , 1994 ICASSP-1994
6. S. J. Young, P.C. Woodland and W. J. Byrne, The HTK BOOK (for HTK Version 3.4), Entropic Ltd., Jan. 1999
7. Amarin Deemagarn, Asanee Kawtrakul. 2004. Thai Connected Digit Speech Recognition Using Hidden Markov
Models. SPECOM2004: 9th Conference Speech and Computer St. peters burg. Russia September 20-22
8. Vimala.C and Radha. V. 2011. Speaker Independent Isolated Speech Recognition System for Tamil Language
using HMM. International Conference on Communication Technology and System Design 2011. Published by
Elsevier Ltd.
9. http://www.fon.hum.uva.nl/praat/ used for speech recording

You might also like