Project22 Final Paper

SPEECH RECOGNITION
By
Mital A. Gandhi Brian T. Romanowski
ECE 345, SENIOR DESIGN PROJECT FALL 2002
TA: Paul Leisher
December 10, 2002
Project No. 22
ABSTRACT
As portable devices grow in popularity, designers may increasingly turn towards speech recognition as a solution to user interface requirements. This project implements isolated word recognition in a flexible and extensible system. It is oriented towards portable applications and mainly optimized for execution speed. Hidden Markov models (HMM) are used with a Viterbi recognition algorithm on a floating-point TI 67x series digital signal processor (DSP) platform. The hidden Markov model tool kit (HTK) is used to train models and optionally perform recognition. A Matlab framework allows inspection and manipulation of the various algorithms and data streams. In addition, a hardware component of the project, the volume box, displays bar-graph readout of the systems input volume.
ii
TABLE OF CONTENTS
ABSTRACT...................................................................................................... ii TABLE OF CONTENTS................................................................................. iii 1. INTRODUCTION ........................................................................................... 1 1.1. Purpose....................................................................................................... 1 1.2. Functionality and Specifications.................................................................1 1.3. Subprojects................................................................................................. 1 2. DESIGN PROCEDURE .................................................................................. 3 2.1. Design Decisions........................................................................................ 3 2.1.1. Design Alternatives.............................................................................3 2.1.2. System Layout.................................................................................... 4 2.2. Tools Used.................................................................................................. 5 2.3. Theory ........................................................................................................5 2.3.1. Hidden Markov Models......................................................................6 2.3.2. Mel-Frequency Cepstral Coefficients................................................ 7 2.3.3. Viterbi ............................................................................................... 7 3. DESIGN DETAILS .......................................................................................... 8 3.1. Components................................................................................................ 8 3.1.1. Volume Box ....................................................................................... 9 3.1.2. DSP...................................................................................................10 3.1.3. Hidden Markov Modeling Toolkit (HTK)........................................ 11 4. DESIGN VERIFICATION ............................................................................ 14 4.1. Testing....................................................................................................... 14 4.1.1. Volume Box.......................................................................................14 4.1.2. DSP................................................................................................... 14 4.1.3. HTK.................................................................................................. 14 4.2. Conclusions...............................................................................................14 5. COST ..............................................................................................................16 5.1. Parts...........................................................................................................16 5.2. Labor......................................................................................................... 16 6. CONCLUSIONS.............................................................................................. 17 6.1. Accomplishments......................................................................................17 6.2. Uncertainties............................................................................................. 17 6.3. Future Possibilities.................................................................................... 17
APPENDIX A: Sample External Files Used in HTK ..................................... 18 APPENDIX B: Volume Box Schematics.........................................................20 REFERENCES.................................................................................................21
iii 1. INTRODUCTION
1.1 Purpose The field of speech recognition has been growing in popularity for various applications. Embedding recognition in a product allows a unique level of hands-free and intuitive user interaction. Popular applications include automated dictation and command interfaces. Our main goal was to implement a system that can perform a relatively accurate transcription of speech and in particular, isolated word recognition. The system was developed so that its components could be utilized, or at least easily adapted, for diverse applications. Our interests in the areas of speech recognition and human-computer intelligent interaction systems encouraged us to collaborate on this project. The various phases of the project led to an in-depth understanding of the theory and implementation issues of speech recognition, while becoming more involved with the speech recognition community at UIUC.
1.2 Functionality and Specifications The primary function of the system would be to recognize separately words spoken input through a microphone. In general, such recognition may be performed in many forms, varying upon factors such as the platform used to recognize the speech or the mode of the input data. Our system would realize several of these forms, such as the following: Recognize pre-recorded speech data on a computer Recognize live audio input on a computer Recognize (on a computer) speech data processed by a DSP Recognize live speech input on a DSP
Another feature the system implemented was a visual display of the voltage generated by the microphone (due to sound energy). This volume box is helpful in noting the input intensity, in regards to whether it crosses the necessary voltage threshold or if it is too high (into the distortion levels). The following were the expected specifications for this project, as stated in our proposal: Qualitative / Visual The VI Box is easily used, visually pleasing, and informative
Quantitative / Technical Target recognition rate for the DSP is at least 50% for a four-word vocabulary. Target recognition rate for HTK is at least 60% for a ten to twelve word vocabulary. Time from speech to DSP recognition is not too long (<10s).
1.3 Subprojects
The entire system (to be presented in detail soon) was divided among into three semi-independent components: the volume box circuit, Hidden Markov Modeling ToolKit (HTK) on the computer, and the DSP. Of these, the DSP and HTK portions handled the various recognitions performed by the system. For the HTK, the tasks involved were to learn and setup a procedure to train models for speech data to be used for recognition both on the computer and on the DSP. The procedures for recognition on the computer also had to be developed in HTK. On the DSP, the major task was to implement a recognition algorithm that would recognize speech inputted from its onboard microphones. Recognition on the DSP 1 of a recognizer, while moving the project towards allowed us to gain experience in the implementation portability (at the potential cost of speed and accuracy). The volume box involved a decent amount of hardware circuit design requiring attention in analog and hardware components such as amplifier, comparator, and inverting TTL chips driving an LED array.
2. DESIGN PROCEDURE 2.1 Design Decisions 2.1.1 Design Alternatives
The categories in which alternative design decisions exist are that of hardware, theory, and development. The hardware for this project, the Texas Instruments TMS320C6711 Digital Signal Processor (DSP) Development Board (DSK), was chosen because of its availability, low cost, development environment, and familiarity. Other options included an embedded microprocessor (a stand-alone 386 board running embedded Linux), an embedded microcontroller (the Motorola HC12), or a fixed-point DSP (the TI 5x series of DSP). The fixed-point DSP is attractive from a price standpoint, while the stand-alone microprocessor would provide the best development environment and future extendibility. The embedded microcontroller is possibly the worst choice; its main benefit could be that of cost. The volume box was designed to fulfill the hardware requirements of this course. Alternatives include using a packaged analog-to-digital converter to replace the voltage reference and analog compare circuitry, or simply to output the bar graph display directly from the DSPs digital outputs. The theory that dictated the use of hidden Markov models (HMMs) was a natural choice. The majority of existing, functional speech recognition is performed by this method, it is well researched, and it works for the kind of recognition required. Other possibilities include Gaussian multi-mixtures whose disadvantage lies in that they do not take into account the non-stationary aspects of speech. Neural networks are a promising technology, but they require large amounts of training data. A method that uses pitch, energy, and/or linear prediction coefficients in conjunction with dynamic time warping would not be robust enough for our project goals. Development of this system can be split into the areas of data and code management. Data management, dealing with sound and HMM information, was accomplished with the Hidden Markov Modeling Toolkit (HTK) and Matlab. Another option researched was Carnegie-Mellon Universitys SPHINX speech recognizer but HTK was familiar and seen as a more general tool. Code management was accomplished through Texas Instruments Code Composer software. This provided an integrated development environment with no recognized rival. The Matlab portion of this project was not strictly necessary. However, this project was implemented purely from theory, and Matlab eased the transition into the embedded environment by providing data visualizations and algorithmic-level debugging options not available on the DSP. An alternative to Matlab could be Mathematica or Octave, but the ease, familiarity, and code-level similarity to C of Matlab suggested its use.
2.1.2
System Layout
As seen in Figure 2.1, the speech recognition system can be thought of as two subsystems: a training system and a recognition system. The training system accomplishes the recording of data, training of a word-level HMM, and the encoding of this HMM into C-style #include file. The recognition system accomplishes the recording of data, feature extraction, recognition, display of results, as well as the display of the input speech volume.
2.2 Tools Used The important tools used in this project include Mathworks Matlab 6.1, Texas Instruments Code Composer 2.10, and Orcads Capture 9.2.1. The first two products have been discussed in section 2.1.1. Capture is a circuit schematic drafting program in which we designed the volume box circuitry.
2.3 Theory 2.3.1 Hidden Markov Models
Figure 2.2: A hidden Markov model [1]
Hidden Markov models (HMM) are the central idea in this project and a thorough understanding of the theory is imperative. The below description describes the general construction of the HMM theory. An HMM is a powerful tool used to model semi-stationary random processes. In speech production, the changing voicing status and vocal tract shape are used to communicate words. Because the voicing function can be approximated by an impulse and the vocal tract by a filter, information needed to characterize a fixed vocal tract position during one pitch period exists (using, for example, a linear prediction algorithm). The vocal tract changes shape slowly enough that it can usually be considered stationary during any particular frame. The HMM model mirrors this physical process. A family of similar fixed positions may correspond to a state in our model and a phoneme (basic unit of speech) in reality. At each time step, there is a probability that the speaker has started voicing the next phoneme in the expressed word. This probability is termed the transition probability, and is a quantity usually represented as a square matrix for every possible state-to-state transition. A general Markov process is one where current states depend only on a certain number of past states, so the transition matrix should be zero below the diagonal (for example, there may be a transition between states two and three, but not between three and two), The states can be imagined to correspond to the vocal tract configuration, yet the only available measurement is the speech audio waveform. This aspect contributes the hidden aspect of the modeling. Each state has a probability distribution function (PDF) that gives the probability of being in that state given a particular input observation (Equation 2-1). Diagonal covariance multi-mixture Gaussians of the form (Equation 2-2) are usually used for a particular states PDF.
Equation (2-1), [1]
Equation (2-2), [1]
The HMM is thus defined by the transition matrix, the means and variances of each Gaussian in each mixture of each state, and the weights on the multiple Gaussians in each state. Training of the HMM can be difficult. This project does not concern itself with training theory. Further 5 information may be found in [1], [3].
2.3.2 Mel-Frequency Cepstral Coefficients Mel-Frequency Cepstral Coefficients (MFCC) are used in speech recognition because they provide a decorrelated, perceptually-oriented observation vector in the cepstral domain. In order to perform filtering in the time domain, one may take the discrete Fourier transform (DFT) of a time sequence and of a filters impulse response, and multiply them in the spectral domain. Similarly, if one wishes to filter a spectrum, one can take the DFT of the spectrum and the filter, and perform a multiplication in the cepstral domain. This kind of filtering (termed liftering) can provide channel and speaker independence.
lifter, Equation (2-3), [1]
When a discrete cosine transform (DCT) (Equation 2-4) is used in place of the DFT, the resulting cepstral coefficients are decorrelated.
, the discrete cosine transform Equation (2-4), [1]
A perceptual filter bank (a Mel-scale filter bank see Equation 2-4 and Figure 2.3) has been used to approximate the human ears response to speech (in an attempt to receive only relevant information). Due to the overlapping filters, data in each band is highly correlated. The diagonal covariance matrices used in the HMM output cannot represent this correlation well, resulting in the need for the DCT.
, relation between the Mel scale and frequency (Hz) scale, Equation (2-5), [1]
Figure 2.3: Filter bank, with filters spaced equidistantly in the Mel-frequency domain, [1]
2.3.3 Viterbi Algorithm Recognition is accomplished with HMMs by choosing the model that most likely produced the sequence of input observations (Equation 2-6). , model = Maximum a posteriori = Maximum likelihood, Equation (26), [3]
To efficiently calculate this maximum likelihood, the Viterbi algorithm concerns itself with only a few possible ways the model could produce the sequence of observations. This algorithm can be carried out on sequentially input frames, a benefit for a real-time system. In particular, Equation (2-7) represents selects the state transition which maximizes the probability of being in state and transitioning to a particular state (which maximizes the probability that the state produced the input observation).
Viterbi algorithm, Equation (2-7), [3]
3. DESIGN DETAILS 3.1 Components 3.1.1 Volume Box
This component is the user interface to the transcription system. The bar graph display is designed to provide visual feedback on input signal strength and quality. Input through this component will allow visualization of an amplified version of the input on a LED array (please see Appendix B for full schematic). Design of the volume box involved numerous challenges and design decisions. First, the microphone input voltage was not clearly detectable using the standard measuring techniques (oscilloscope, multimeter, LEDs). Using a better microphone led to slight visible changes in the waveform on the oscilloscope. Then, a peak detector circuit was employed to approximately measure the maximum voltage reached (these measurements resulted in about 30mV). Hence, the input needed amplification by a factor of approximately 60 in order to be compared to reference voltages. The following amplification circuit, learned in ECE210, was implemented.
Vin U 5A 8 3 2 + 4 1 LM 348 R 9 20k Vout
AMPLIFICATION (61x)
R 9 330 O hm s
Figure 3.1: Non-inverting Amplifier
The following equation relates the input and output voltages for this non-inverting amplifier: Vout = Vin * (1 + R2 / R1) R2 = 20 kOhms, R1 = 330 Ohms After the amplified input, it was necessary to generate a set of reference voltages. This task was attained through the series of resistor network, as seen in the full circuit schematic. However, connecting these networks to the comparators caused a drop in voltages due to load changes in the lines. Therefore, voltage buffers were introduced between the resistors and the comparators to maintain the voltage levels. The following diagram depicts a simple voltage buffer:
Figure 3.2: Voltage Buffer
Once the voltage buffers were introduced, analog comparators (LM393 operational amplifiers) were used to evaluate the amplified input voltage with the reference voltages, and to output the appropriate signals on each of the four lines connected to the LEDs. The effect was to display the proper 8 number of LEDs resulting in a varying bar graph dependant upon the input voltage.
3.1.2
DSP
The DSP portion of the project deals with the DSP/HTK data interface, user interface, DSP support code, input sampling, feature extraction, recognition, and testing. Due to speed concerns, the system is oriented towards the block processing of time-varying data. Every effort is made at well commented, maintainable code, but there are a few places where this fails. There are a few portions of the code which have constants that are defined in multiple places and undocumented features or unimplemented functionality. DSP/HTK Interface The DSP/HTK data interface was required for the offline HTK model training. Matlab was used as an intermediary, providing useful auditory and visual options for data verification. DSP data was written out in HTK format [1] and HTK generated data was encapsulated in C-style #include file. User Interaction The user was informed of the recognition results via the DSKs three onboard light emitting diodes (LEDs). Currently, the words one, two, three are displayed in binary on the LEDs, and hello turns on each LED. It is also possible to run a Matlab script that downloads the first few recognition results from the DSP, and displays them in plain text in Matlab. Support Code The recognition system was wrapped around existing code written for the Control Systems Lab, primarily written by Texas Instruments (TI) and Dan Block. It used DSP BIOS to handle all the DSP initialization and create the Matlab enabled environment. Code from a TI Educational CD [2] was used for sound input from the ADC daughterboard. Public domain code from Takuya Ooura was used to compute the FFT in the feature extraction section. Sampling TI code [2] was used to poll the ADC daughterboard for incoming samples. Only one of the two available stereo channels was used; this stream was down sampled and tested against a threshold. If the threshold (an experimentally determined number) was passed, the current data, as well as a quantity of past data was stored in memory. When the input data fell below the threshold for a specified amount of time, recording was ended, and control passed to the feature extraction routine. Feature Extraction Features in this system were the commonly used Mel-Frequency Cepstral Coefficients (MFCC). The algorithm was outlined very clearly in [1], and the only challenge was to write a working and efficient implementation on the DSP. A difficult bug was worked out in the Matlab pre-implementation; Matlab was also used to generate pre-calculated values (a cosine table and other constant tables) that sped the computation on the DSP. The implementation took as input a block of speech samples, and output a block of MFCCs. Control was then passed to the recognition algorithm.
Recognition When computing the log-probability, a half-Viterbi algorithm is used (the recovery of the most-likely state sequence was omitted). This decreases the complexity significantly when compared to a foreword algorithm computation. The implementation was straightforward, with the only difficulty coming from indexing and derivation issues. Testing
9
The main goal was to have significantly accurate recognition. Since HTK was used as a control/reference implementation, testing between the Matlab and HTK input/output sets occurred in the feature and recognition phases. Code was examined till the data matched within acceptable precision, or error limits.
3.1.3
Hidden Markov Modeling Toolkit (HTK)
The HTK package was used extensively in accomplishing various components of our system. The entire process involving HTK is also visible in the flowchart on the next two pages. First of all, HTK allowed for inputs to be recorded using embedded tools directly from a microphone through HSLab. Second, the tool HCopy was useful in converting the recorded speech files to parameterized form (Mel Frequency Cepstral Coefficients, or MFCCs). Third, training was solely performed using HCompV and HERest tools of HTK. HCompV allowed for initialization of prototype models using average statistics computed from the entire set of training data. HERest then further enhances the parameter models in detail by performing re-estimations based on the speech files corresponding for each particular word in the supplied dictionary (please see the appendix for samples of each of these external files used in the process). Finally, HVite was the tool used in HTK to utilize the Viterbi recognition algorithm on a new set of speech data files. HResults would then be used to compare the output transcriptions from HVite with a pre-established transcription to determine recognition accuracy and correctness. The following equations were used to calculate the merit percentages:
N = number of reference labels correctly recognized D = number of deletions S = number of substitutions I = number of insertions
Figure 3.3: Calculating Recognition Rates
10
Create all necessary files manually (configuration, datafiles location list, prototype HMM definition, master label file, word models list, wordlevel dictionary, and others as necessary ...)
Record training & test data
Convert data into parametric format
Initialize HMM models (using proto definition + MFCC data)
Create generic definitions (for each word model) Also, create base variance & macros file (for states)
Re-estimate parameters for word models using data files
Multiple Iterations
Perform recognition on test data
Analyze recognition results for accuracy
Figure 3.3: Flowchart/System Process for HTK wordlevel recognition
11
Figure 3.4: Flowchart with external file details for HTK word level recognition phases
12
The commands with options that were specified while executing the various tools are summarized in the following table (commands listed are executable in a bash shell window): 1) Create word net from specified grammar format HParse grammar.txt wdnet 2) Convert files into MFCC format HCopy -T 1 -S hcopy_list.scp -C config 3) Initialize prototype model wordmodel_proto based on all speech data HCompV -C config -f 0.01 -m -S hcompv_list.scp -M hmm0 wordmodel_proto 4) Re-estimate parameters for num in {1,2,3,4,5,6}; do let num2=$num-1; HERest -C config -I word3.mlf -S hcompv_list.scp -H hmm$num2/word_macros -H hmm$num2/word_hmmdefs -M hmm$num model_list.txt; done; 5) Recognize new speech data (in MFCC format) listed in testdata.scp HVite -H hmm6/word_macros -H hmm6/word_hmmdefs -S testdata.scp -l '*' -i recout.mlf -w wdnet -p 0.0 -s 5.0 wordlevel_dict model_list.txt 6) Analyze the output transcription file recout.mlf from previous command HResults -I word3.mlf model_list.txt recout.mlf Figure 3.5: HTK Procedures & Commands
Additional documentation for HTK, such as the external files needed to support the various tools and commands above, has been included in Appendix A.
13
4. DESIGN VERIFICATION 4.1 Testing 4.1.1 Volume Box
Testing the features of the volume box was quite standard. First, the microphone input level was measured using an oscilloscope. Second, the amplification circuit was tested using sample input voltages produced from the power supply. The output of the amplifier was measured using the multimeter available in the lab kit. The voltage follower circuit was also easily tested by checking the voltage levels at the input and output with the multimeter. While the comparator caused some unexpected results in the start (such as LEDs lighting up regardless of inputs to the comparator), similar tracing techniques and a proper understanding of the theory allowed us to debug the circuit easily. Two different configurations helped us to understand the comparators operation properly. The first one of these was to input voltages into the comparators (+) and (-) pins such that the net voltage was positive. The second one was differed so that the net voltage would be negative. After understanding that the comparator amplifies the voltage by up to a factor of 200,000, but limited by the supply voltage, these combinations yielded the proper results (LEDs lighted up at the correct inputs). The LED array and inverter IC combination circuit was tested by simply connecting ground and high to the inputs and observing for the correct outputs (on the oscilloscope and on the LED array itself). All in all, proper connections of the components (and wires interconnecting them) led to expected circuit outputs in general.
4.1.2
DSP
The main test applied to the DSP was recognition accuracy. With a four-word vocabulary and a familiar speaker, the system was able to achieve 95 percent recognition accuracy (as measured over 90 trials). In addition, calculated recognition speed was approximately .8 seconds.
4.1.3
HTK
The testing procedure for the CPU/HTK components consisted of checking each tools successful operation before being able to use the next one. In general, each HTK tool would output error signals for various errors possible in the process. Based on these error codes, the appropriate external file or the command form would need to be modified to resolve the issue. The usability or merit of the trained model files (after HERest) would generally be determined from the statistical results outputted at the end of the entire procedure (the accuracy and correctness percentages).
4.2 Conclusions In general, we were quite successful in obtaining favorable results from the system. First, the volume box worked as expected (visually pleasing) as seen in the demonstration. The other components of our system also worked well, as can be observed from the following table summarizing the results:
14
Mode of Recognition
Training data source Mitals voice Mitals voice Mitals voice Mitals voice Mitals voice
Test data source
Results (%)
Pre-recorded files Pre-recorded files Pre-recorded files Pre-recorded files Live audio input on HTK Live audio input on DSP
Mitals voice (same set of data) Mitals voice (new set of data) Brians voice (new set of data) Brians voice (DSP MFCC's) Mitals voice (live audio input)
100% 97% 100% 65% 83%
Brians voice
Brians voice (live audio input)
95%
Figure 4.1: Summary of recognition results
15
5. COST 5.1 Parts TABLE 5.1. PART COST Quantit y 1 1 1 1 Description Audio Daughter Card (TMDX326040A) DSK (TMDS320006711) Volume /Interface (VI) Box Microphone Total Original Estimate Balance Price per unit $50 $295 $10 $20 $375 $715 +$340
5.2 Labor TABLE 5.2. LABOR COSTS

Employee Mital Gandhi Brian Romanowski Total Balance Pay-per-Hour $30 $30 Hours 130 130 260 Adjusted Cost $9750 $9750 $19500 +$3000 Original Hour Estimate 150 150 300 Adjusted Cost $11250 $11250 $22500 -
GRAND TOTAL: $19875
FINAL SURPLUS BALANCE: $3340
16
6. CONCLUSIONS
6.1 Accomplishments Accomplished speech recognition Created a very extensible and adaptable system for embedded speech recognition 6.2 Uncertainties Are there any minor bugs that degrade recognition accuracy? How many words can be realistically recognized? How does the recognition accuracy drop off as the vocabulary increases? What is/are the optimal feature size/components?
6.3 Future Possibilities (in order of increasing difficulty) Fix code so it is properly written and documented (cosmetic) Better user interface Training with multiple speakers for speaker independence Larger vocabulary Test multi-mixture Gaussians in DSP/HTK Real-time transmission of features to a PC Channel independence Continuous recognition Online/Adaptive training
17
APPENDIX A: SAMPLE EXTERNAL FILES USED IN HTK 1) Config

# Coding parameters TARGETKIND = MFCC_E_D_A TARGETRATE = 100000.0 SOURCERATE = 625.0 SAVECOMPRESSED = T SAVEWITHCRC = T WINDOWSIZE= 250000.0 USEHAMMING = T PREEMCOEF = 0.97 NUMCHANS = 32 CEPLIFTER = 27 NUMCEPS = 15 ENORMALISE = F
2)
Prototype HMM definition file

~o <VecSize> 19 <MFCC_E_D_A> ~h "wordmodel_proto" <BeginHMM> <NumStates> 5 <State> 2 <NumMixes> 1 <Mixture> 1 1.0 <Mean> 19 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 19 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 3 <NumMixes> 1 <Mixture> 1 1.0 <Mean> 19 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 19 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 4 <NumMixes> 1 <Mixture> 1 1.0 <Mean> 19 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 19 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <TransP> 15 0.00 0.95 0.00 0.85 0.00 0.00 0.00 0.00 0.00 0.00 <EndHMM> 0.05 0.10 0.85 0.00 0.00 0.00 0.05 0.15 0.90 0.00 0.00 0.00 0.00 0.10 0.00
3)
#!MLF!#
Master Label File
"*/bye0.lab" sil bye sil . "*/bye1.lab" sil bye sil . "*/bye2.lab" sil bye sil . "*/bye3.lab" sil bye sil . "*/cat0_0.lab" sil cat sil . "*/cat1_0.lab" sil cat sil . "*/cat2_0.lab" sil cat sil . "*/cat3_0.lab" sil cat sil . "*/cat4_0.lab" sil cat sil .
18
4)
Grammar file (recognizer will contain only words bye or cat and in the form of silence word silence)
$word = bye | cat ; ( sil $word sil )
NOTE: HTKBook contains other such samples and explanations of file usage and tools as well. These samples are not copied from the HTKBook, but files are similar.
19
APPENDIX B: VOLUME BOX SCHEMATIC

U 5A V2 5Vdc
MIC INPUT
3 2
+ 4
1 LM 348 R 9 20k 3 + 4 8 1 LM 393 V3 5Vdc U 1A R 5 10k
AMPLIFICATION (61x)
2 R 9 330 O hm s
5Vdc
V1 Vref (0.78 volts approximately) 8 3 R 1 17k 3 2 8 + 4 1 LM 348 U 3A 3 R 2 1k U 6A 3 2 + 4 8 1 LM 348 4 2 + 8 1 LM 393 5Vdc V5 R 7 10k U 6A 2 + V4 5Vdc 4 U 2A LM 393 1 R 6 10k
TTL LOGIC (DRIVING LEDs)
Vref (0.5 volts) 3 2 8
U 4A + 4 1 LM 393
R 8 10k
R 2 1k 3 2 R 2 1k 4 + 8 U 6A 1 LM 348
Vref (0.25 volts)
Vref (0.0 volts)
T it le S iz e A D a te :
< T it le > D ocum ent N um ber <D oc> Tuesday , D ecem ber 10, 2002 Sheet 1 of 1 R ev <R ev C ode>
NOTES: 1) A single LM393 IC contains two of the above op-amps.
20
REFERENCES [1] HTK (Hidden Markov Model Toolkit), The HTK Book (for HTK Version 3.1), Dec 2001 [2] Texas Instruments, TI Educational CD [3] M. Hasegawa-Johnson, Lecture Notes in Speech Production, Speech Coding, and Speech Recognition, class notes, University of Illinois at Urbana-Champaign, Fall 2000.
21

Project22 Final Paper

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project22 Final Paper

Uploaded by

Copyright:

Available Formats

SPEECH RECOGNITION

Mital A. Gandhi Brian T. Romanowski

ECE 345, SENIOR DESIGN PROJECT FALL 2002

TA: Paul Leisher

December 10, 2002

2. DESIGN PROCEDURE 2.1 Design Decisions 2.1.1 Design Alternatives

2.3 Theory 2.3.1 Hidden Markov Models

Figure 2.2: A hidden Markov model [1]

Equation (2-1), [1]

Equation (2-2), [1]

, the discrete cosine transform Equation (2-4), [1]

Viterbi algorithm, Equation (2-7), [3]

3. DESIGN DETAILS 3.1 Components 3.1.1 Volume Box

Figure 3.1: Non-inverting Amplifier

Figure 3.2: Voltage Buffer

Hidden Markov Modeling Toolkit (HTK)

Figure 3.3: Calculating Recognition Rates

Record training & test data

Convert data into parametric format

Initialize HMM models (using proto definition + MFCC data)

Re-estimate parameters for word models using data files

Perform recognition on test data

Analyze recognition results for accuracy

Figure 3.3: Flowchart/System Process for HTK wordlevel recognition

4. DESIGN VERIFICATION 4.1 Testing 4.1.1 Volume Box

Test data source

100% 97% 100% 65% 83%

Brians voice (live audio input)

Figure 4.1: Summary of recognition results

5.2 Labor TABLE 5.2. LABOR COSTS

GRAND TOTAL: $19875

FINAL SURPLUS BALANCE: $3340

APPENDIX A: SAMPLE EXTERNAL FILES USED IN HTK 1) Config

Prototype HMM definition file

Master Label File

APPENDIX B: VOLUME BOX SCHEMATIC

1 LM 348 R 9 20k 3 + 4 8 1 LM 393 V3 5Vdc U 1A R 5 10k

TTL LOGIC (DRIVING LEDs)

Vref (0.5 volts) 3 2 8

Vref (0.25 volts)

Vref (0.0 volts)

NOTES: 1) A single LM393 IC contains two of the above op-amps.

You might also like