You are on page 1of 8

70

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006

Scalable Architecture for Word HMM-Based


Speech Recognition and VLSI Implementation
in Complete System
Shingo Yoshizawa, Student Member, IEEE, Naoya Wada, Student Member, IEEE,
Noboru Hayasaka, Student Member, IEEE, and Yoshikazu Miyanaga, Senior Member, IEEE

AbstractThis paper describes a scalable architecture for realtime speech recognizers based on word hidden Markov models
(HMMs) that provide high recognition accuracy for word recognition tasks. However, the size of their recognition vocabulary is
small because its extremely high computational costs cause long
processing times. To achieve high-speed operations, we developed
a VLSI system that has a scalable architecture. The architecture effectively uses parallel computations on the word HMM structure.
It can reduce processing time and/or extend the word vocabulary.
To explore the practicality of our architecture, we designed and
evaluated a complete system recognizer, including speech analysis
and noise robustness parts, on a 0.18- m CMOS standard cell library and field-programmable gate array. In the CMOS standardcell implementation, the total processing time is 56.9 s word at
an operating frequency of 80 MHz in a single system. The recognizer gives a real-time response using an 800-word vocabulary.
Index TermsHidden Markov model (HMM), scalable architecture, speech recognition, VLSI implementation.

I. INTRODUCTION

IDDEN Markov model (HMM)-based speech recognition


technologies have developed considerably and can now
obtain a high recognition performance. Voice dictation systems,
spoken dialogue systems, and speech input interfaces are representative speech applications that use these sophisticated technologies. These developments lead us to expect speech input
interfaces to be embedded in practical applications.
The development of speech input interfaces embedded in
mobile terminals requires recognition accuracy, miniaturization,
and low-power consumption. Hardware-based speech recognition systems meet these requirements. Previous research on
custom hardware described the implementation of the HMM
algorithm using application-specific integrated circuits (ASICs)
[1], [2]and field-programmable gate arrays (FPGAs) [3], [4].
Word speech recognition employs a word HMM or a phoneme
HMM in acoustic models. In particular, the word HMM is
adopted in [2], [4], and the phoneme HMM is adopted in [1], [3].
We adopted a word HMM for our hardware recognition system
that performs isolated word recognition tasks. This word HMM
accurately expresses coarticulation effects and maintains high

Manuscript received June 2, 2004; revised February 7, 2005, June 7, 2005.


This work was supported in part by the Semiconductor Technology Academic
Research Center (STARC), Program 112 and by the Ministry of Education, Science, Sports and Culture under Grant B215300010), 2003. This paper was recommended by Associate Editor P. Nilsson.
The authors are with the Graduate School of Information Science Technology,
Hokkaido University, Sapporo 0600814, Japan.
Digital Object Identifier 10.1109/TCSI.2005.854408

recognition accuracy in variable environments. For isolated word


recognition, dynamic time warping (DTW) is an effective technique, particularly, for speaker-dependent tasks. It is difficult to
decide whether the word HMM has better recognition performance than DTW or not because recognition results are quite
variable according to experimental conditions. Rabiner et al. reported experimental results for isolated digits recognition [5].
They reported that DTW performs better in speaker-dependent
tasks, whereas the word HMM outperforms DTW in speaker-independent tasks. In general, the word HMM is disadvantageous
in terms of computation costs compared with the phoneme-level
HMM algorithm. The word HMM-based recognition system has
extremely high computation costs and requires a long processing
time because it has to calculate the likelihood scores for all reference models. For example, 335 million arithmetic operations
are required for the output probability calculation in the unpruned
800-vocabulary task described in Section V. Word HMM applications have been developed for numerical and alphabetical recognitiontasks [5], [6]. The recognition vocabulary, however, issmall
size (50 words or fewer), because of the high computation costs.
However, since the latest circuit technologies have reached an operating performance of about 10 GIPS, we believe that the word
HMM-based system can deal with a middle-sized vocabulary of
up to 1000 words or less using dedicated hardware architecture
that decreases the processing time.
In this paper, we focus on the word HMM structure and present
effective parallel computations to achieve high-speed operations.
We propose a new architecture based on these computations. It
achieves high throughputs and low-power operations. Furthermore, the proposed architecture provides scalability. To implement speech recognition systems on hardware, variable conditions, such as vocabularies, recognition rates, and types of recognition words, should be considered. Because the required computational costs vary for the above conditions, the optimum number
of parallel computations must be changed. Fixed circuit structures
require redundant circuit resources for excessive parallel operations, or degrade system performance, such as response time and
recognitionaccuracy,resultinginaninsufficient operatingperformance. A scalable technique always provides optimum hardware
resources that can cope with the variable conditions by making
small modifications to the hardware architecture. Namely, the
proposed architecture reduces the processing time and/or easily
extends the word vocabulary.
Our proposed scalable architecture extends over not only
the HMM computations in speech recognition, but also the
complete recognition system including robust processing and

1057-7122/$20.00 2006 IEEE

YOSHIZAWA et al.: SCALABLE ARCHITECTURE FOR WORD HMM-BASED SPEECH RECOGNITION

71

Fig. 2. Left-right HMM.

2. Recursion, and
Fig. 1. Flowchart of a speech recognition system.

speech analysis processing. In related works, such as [1][4], the


authors realized hardware architecture that only used a part of the
speech recognition algorithm, e.g., Viterbi algorithm or output
probability calculation. Our work provides unified parallel
computations and scalable techniques in a word HMM-based
speech recognition system, and evaluates its effectiveness by
implementing the complete recognition system using CMOS
technologies. We verified the complete system on an FPGA
board in actual environments, such as computer rooms, offices,
and exhibition halls.
II. OUTLINE OF SPEECH RECOGNITION SYSTEM
Fig. 1 shows a flowchart of a speech recognition system.
This flowchart is based on our developed, complete recognition
system. In the speech analysis part, speech feature vectors are
extracted from a time series of short-duration speech signals.
Traditional speech recognition systems directly handed these
feature vectors over, from speech analysis to speech recognition.
Currently, many systems employ robust processing that removes
noise interferences because the raw data is very sensitive to noise.
In the robust processing part, the feature vectors are re-generated
by shaping, e.g., subtracting the noise components. During the
speech recognition part, the recognizer computes the likelihood
scores and finds the best match using test utterances. Reference
models are generated from HMM training in advance. Because
the training is assumed to have been executed by the software,
the system does not include a training function. The complete
system unifies speech analysis, robust processing, and speech
recognition.

for

(2)

3. Termination
for

(3)

Here, is the number of states, is the number of frames for


,
is the state tranthe feature vectors
denotes their
sition probability between and ,
-by- matrix,
is a -by- matrix in log
output probability, and
is the likelihood value at the time
index and state . We restrict the HMM structure to the strict
left-to-right connection topology shown in Fig. 2 for use in hardware architecture. Hence, a sparse matrix gives the state transition probability
if
or
otherwise,

for

(4)
Discrete HMM (DHMM), semi-continuous HMM
(SCHMM) [9], [12], and continuous HMM (CHMM) are
utilized to compute the output probabilities. The DHMM and
SCHMM can reduce computation the output probability costs.
However, their recognition rates are lower than CHMM. Our
system employs CHMM to give priority to recognition accuracy. In CHMM, the output probability is typically based on
a Gaussian distribution. For an uncorrelated single Gaussian
distribution, the output probability is expressed as follows:

(5)
III. WORD-LEVEL HMM ALGORITHM
HMM is a statistical modeling approach that is robust to temporal variations in speech and speaker differences [7], [8], and
is defined by a state transition probability matrix , a symbol
output probability matrix , and an initial state probability .
is given
The probability of the observation sequence
by multidimensional observation sequences , known as fea, which
ture vectors, and an HMM expression
and . For the
is the compact notation of three sets of ,
word-level HMM, the recognizer computes and compares all the
s (
), where
is the number of word
is computed using the
models. For left-to-right HMMs,
LogViterbi algorithm as follows.

where
and
are the mean vectors and
diagonal covariance matrices, respectively, for the state index
and the dimension index . The frame number and dimension
index feature vectors are expressed as
of the th feature vector in and is the number of dimensions. The log output probability is simplified as follows:

1.Initialization
for

(6)
(1)

72

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006

We explained the output probability calculation using a single


Gaussian distribution. However, a mixture Gaussian distribution
is also available for our hardware recognition system. The log
output probability of the mixture distribution with states and
mixtures is given by

(9)
where

(10)
(11)

Vector
gives the mixture weights for mixture index
and state index . The vectors
and
are computed in the same way as the single Gaussian distribustates and
tion. The mixture Gaussian distribution with
mixtures has the same computational processing as the single
states, when the addlog opGaussian distribution with
eration (10) is ignored and replaces . Approximations of the
addlog operation using the maximum function [11] and the log
[12], are proposed. Our recogtable function, e.g.,
nition system can employ the maximum function.
IV. SCALABLE ARCHITECTURE
Fig. 3. Flowchart of HMM computation.

where

(7)

(8)
and
can be computed
In (7) and (8),
beforehand, i.e., during HMM training. The matrix/vectors ,
, and are called HMM model parameters in this paper.
These parameters are stored in the hardware recognition system
memory.
Fig. 3 shows a flowchart of the whole computation. The
output probability is the most computationally expensive part
of the procedure. For each output probability, the number of
,
arithmetic operations for (6) can be represented by about
which indicates one addition, one subtraction, two multiplications and repetitions. Because it repeats as Loop A, Loop
repetitions. The total
C, and Loop D, it requires
computation costs, excluding the other calculation parts, is rep. As a measure of the processing time, we
resented by
use the number of clock cycles. We assume that one arithmetic
operation requires one clock cycle. The number of clock cycles
for the above computation. For clarity, we
comes to
and the processing
state the computation cost as
time as
. The processing time is proportional to
the number of frames, the feature vector dimensions, the HMM
states, and the word models. Large numbers of HMM states
and feature vector dimensions are required when long words
are expected in word recognition tasks.

Our scalable architecture for high-speed computation is based


on parallel and concurrent processing. We applied two methods
to the scalable architecture. In the first, multiple process elements (PEs) are implemented inside the HMM computation
module. The HMM computation module executes all arithmetic
operations in the word HMM algorithm. In the other, we employ
a masterslave operation in the recognition systems. The system
consists of speech recognition, speech analysis, and noise robust
processing and controls data transfer. The masterslave operation is done using instruction sets designed for the recognition
system. This masterslave operation reduces processing time,
or can extend the word vocabulary by simply arraying two or
more systems.
A. Multiple Process Elements
In the word HMM structure, some computation parts are executed concurrently by partitioning the HMM states and word
models. We considered the following three points in the HMM
computation structure.
a) During Loop C, the same HMM model parameters ,
, and are used repeatedly, because the parameters
are independent of frame number .
b) Loops A and B are divided into parallel computations
in each HMM state .
c) Loops A and B are computed simultaneously if
Loop A precedes Loop B by one frame.
Case (a) indicates block processing of the HMM model parameters data. When all the data in a word model is transferred to
an internal memory, data fetches from an external memory are unnecessary during Loop C. Note that the HMM model parameters of all word models are stored in the external memory in advance, because the data size increases. Block processing reduces

YOSHIZAWA et al.: SCALABLE ARCHITECTURE FOR WORD HMM-BASED SPEECH RECOGNITION

73

Fig. 5. Structure of HMM circuit.

Fig. 4. Modified flowchart for parallel computing.

data transfer by
times. Case (b) enables the maximum parallel computations. However, the parallel computations require a
lotofdataportsinarithmeticunits.Thevalues and correspond
to the number of HMM states and the number of feature vector
is
frames, respectively. The number of parallel computations
. To obtain the maximum perforno more than , that is
mance usingparallel computation, we considerthe number of par. The flowchart shown in Fig. 3 can
allel computations as
be modified to the new flowchart shown in Fig. 4, which is suitable
for parallel computing. It is difficult to directly connect the arithmetic units and external memory, which exists outside the chip.
We effectively utilize the internal memory to solve this issue. The
internal memory structure can be modified inside a circuit module
or a chip. When the internal memory has multiple output ports, the
model parameter data can be supplied to all the arithmetic units.
Fig. 5 shows the HMM circuit structure for the single Gaussian
distribution. PE1 and PE2 are the process elements of the
output probability calculation (6) and the Viterbi algorithm
(1)(3), respectively. The model parameters are partitioned by
the HMM states. The data and are transferred to all the
PE1s. The data and are transferred to all the PE2s. The
addition of is executed in the Viterbi algorithm in this circuit
structure. The data port of the feature vectors is shared by
all the PE1s. The PE1 operates a 4-stage pipeline process,
consisting of add, square, multiply, and accumulate operations
using fixed-point arithmetic. The PE1s generate absolute
and treat the value of
as the
values of
maximum value in their fixed-point format1. Due to these use of
the absolute values, the maximum functions in (2) and (3) change
to the minimum function in actual hardware processing. Case (c)
realizes pipeline chaining between Loops A and B. Because
1We assumed that all the log-likelihoods  in (1)(3) were negative. The values
of ! are adjusted by subtracting a constant value so that the log-likelihoods do
not become positive. The constant values can be pre-computed. The hardware
architecture uses the absolute values of their likelihoods to cut a sign bit.

Fig. 6. Implementation of mixture Gaussian distribution.

the operation cycles of Loop A in one frame surpass those of


, it satisfies the requirement of pipeline
Loop B; that is
chaining. Consequently, the scalable architecture obtains a faster
in the total computation excluding data transfer of
the model parameters. The processing time barely increases by
arraying the PEs because the number of the HMM states does
.
not depend on
Conventional architecture that arrays multiple arithmetic units
is limited for expanding the memory bandwidth or increasing
the number of pins connected to the external memory LSIs. The
proposed architecture solves this problem by connecting internal
memory units with arithmetic units to the inside of a chip. The
block processing that transfers the HMM model parameters thus
becomes important when implementing the internal memory.
We explained the hardware architecture for the single
Gaussian distribution. The mixture Gaussian distribution also
can be applied by inserting the addlog approximation unit.
Fig. 6 shows a simple example of a Gaussian distribution with
two mixtures and two states. The input ports of the addlog unit
are connected to PE1. The output port is connected to PE2.
B. Complete Recognition System
This system executes the whole speech processing required
for a speech recognition system. It includes not only the
speech recognition part but also the parts for speech analysis,
noise robustness, and system control. Fig. 7 shows a block
diagram of the complete recognition system. The speech
analysis algorithm consists of Hanning windowing and Mel
frequency cepstral coefficients (MFCC) analysis [10]. For the
noise robustness algorithm, we adopted the running spectrum
filtering/dynamic range adjustment (RSF/DRA) [13]. This
method has an improved robust performance compared with

74

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006

Fig. 7. Complete recognition system.


TABLE I
INSTRUCTION SETS

conventional methods, such as spectral subtraction [14] and


RASTA [15]. The RSF/DRA improved recognition accuracy by
average 5% in car noise environments of signal-to-noise ratio
(SNR) 0 dB, and by an average 10% in SNR-10 dB white noise
environments. The right interfaces shown in the figure are connected to an external memory. The memory stores the HMM
model parameters and feature vectors. The top and bottom
interfaces are used for the masterslave operation. When the
system is set to the master, the top interface is connected to the
microprocessor. The microprocessor controls the recognition
system using instruction sets. These instructions execute speech
analysis, noise robustness, and speech recognition processing.
They are also used to control communication between the
microprocessor and the complete systems, to call the recognition results, and to execute the masterslave operation. Table I
shows the instruction sets for the complete system.
C. MasterSlave Operation
The masterslave operation can realize speeds up to
where
is the number of systems. Fig. 8
shows the masterslave operation. Fig. 8(a) illustrates the arrangement of the master and slave systems. The master system
is directly connected to the microprocessor. The master and
slave systems each have an external memory that stores the
HMM model parameters. The master/slave assignment is done
by giving constant values to the left ports of the recognition
system in Fig. 7. Zero is assigned to the master system, and
the other values are assigned to the slave systems. Data and
instructions from the microprocessor are transferred to individual master/slave systems by switching the Chip Select
value. If Chip Select is set to the maximum value, the data
and instructions are transferred to all the slave systems.

Fig. 8. Procedure of masterslave operation.

Before the masterslave operation, the HMM reference


models are transferred to each memory shown in Fig. 8(b).
The data size per memory is equally partitioned depending on
. In the masterslave operation, speech analysis and robust
processing are first executed in the master system while a
speaker is uttering. The feature vectors are then stored in the
external memory on the master system shown in Fig. 8(c). Note
that the feature vectors are constants for all the word models
during Loop D. Because the feature vectors are commonly
utilized in all the systems, when only master system executes
the speech analysis and robust processing it reduces power consumption. Second, the feature vectors are transferred from the
master system to the slave systems using the BROADCAST
instruction shown in Fig. 8(d). The broadcast transmission does
not require a handshake between the microprocessor and the
slave system, thus reducing the data transfer time, even if the
number of slave systems increases. Third, speech recognition
processing is simultaneously executed in all the systems by
calling the feature vectors and the reference HMM models from
the memory shown in Fig. 8(e). Finally, the microprocessor
gathers the recognition scores in all the systems and searches
for the best recognition result, as in Fig. 8(f).
V. EVALUATIONS
A. System Implementation
The HMM computation circuit and the complete recognition
system were designed on a CMOS 0.18- m standard cell library
using the Verilog-HDL RTL level description. The number of
gates in the HMM computation circuit and the complete system

YOSHIZAWA et al.: SCALABLE ARCHITECTURE FOR WORD HMM-BASED SPEECH RECOGNITION

75

TABLE II
ARITHMETIC OPERATION IN HMM COMPUTATION

TABLE III
INSTRUCTION TYPES AND NUMBER OF INSTRUCTION CYCLES

was 340 k and 400 k, respectively. The circuit executed 32 parallel operations. The number of parallel operations was equal to
the number of the HMM states. The maximum operating frequency in the circuit and the recognition system was 128 MHz.

TABLE IV
EVALUATIONS IN THE HARDWARE AND SOFTWARE IMPLEMENTATIONS

B. System Performance
The hardware-based recognition system based on the proposed
architecture was evaluated on processing time and power dissipation. The processing time of the proposed HMM circuit was
much smaller than that of a single algorithmic logic unit (ALU).
This should be further evaluated, but including power dissipation.
We estimated power dissipation in the arithmetic units on both
the hardware recognition system and a fixed-point DSP using a
software solution. Most software implementations use pruning to
reduce the computational load. Two popular forms of pruning are
Gaussian selection [16] and Gaussian pruning [17], [18]. These
techniques reduce the computational loads to 2040% in HMMbased recognition systems. For example, Gaussian pruning can
reduce the computation loads in the output probability calculadenotes additions, incretion in (6). The summation
menting dimensional index . During this summation, if the calculated value falls below a certain threshold, computation might
stop halfway by replacing an approximate value because its likelihood value is assumed to be far from the center of Gaussian distribution. This indicates that the computation loads shown in the
summation are reduced using threshold pruning.
In the evaluations, we assumed an 800-word vocabulary task
that could be handled with a single-system. The parameters in
,
,
,
;
the recognition task were
(i.e., 32 HMM states, 38-dimensional feature vectors, 86 speech
frames in which the speech length was 1.0 seconds, and an
800-word HMM). Table II shows the number of arithmetic
operations in output probability calculation (6) and Viterbi
search algorithm (1)(3). The computational cost of the Viterbi
algorithm was a small percentage of the total. To simplify the
comparison, we evaluated the system performance only for the
output probability calculation that is common to the hardware
and software implementations. The hardware implementation
requires 335 million arithmetic operations. In the software
implementation, we estimated the required computational costs
using Gaussian pruning for the vector threshold with heuristic
estimation [18]. In this case, the arithmetic cost was reduced
to 117 million, or 34.9% of the full computation.
In the hardware implementation, the proposed architecture
was measured using only a single-system operation and not the
masterslave operation. The clock frequency was set to 80 MHz.
The processing time in the output probability calculation was
32.7 ms without the HMM training data transfer from an external to an internal memory. The total processing time came
to 45.5 ms, including the Viterbi algorithm and data transfer.
Consequently, this recognizer took 56.9 s word for the single
word HMM at an 80-MHz clock frequency. The processing time

and recognition time were measured from the RTL-level circuit


simulation. The power dissipation value was measured using
a power estimation CAD tool that uses switching activities in
gate-level simulations, including data loading/storing in an internal memory. The hardware system consumed 421.5 mW at
an 80-MHz clock frequency and an 1.8-V power supply. This
measurement includes the Viterbi algorithm unit, but its power
dissipation percentage was low. The software implementation
utilizes a Texas Instruments TMS320VC5416 fixed-point digital signal processor (DSP). It had a 160-MIPS operating performance at a 160-MHz clock frequency and took one cycle per
one instruction. The number of instruction cycles was measured
from a DSP compiler tool. Table III shows an example of the required instructions for one addition, one subtraction, and two
multiplications inside the summation shown in (6). The total
output probability calculation for the above Gaussian pruning
required 241 million instruction cycles. We used the power dissipation value from the TMS320VC5416 data sheet [19]. This
DSP consumed 96 mW at a 160-MHz clock frequency using a
1.6-V power supply2.
Table IV gives the evaluation results of the software and
hardware implementations. The processing time represents the
time length in the output probability calculation, excluding
the HMM training data transfer. Note that most software implementations use a time-synchronous search [20] to reduce
the latency between the end of the utterance and obtaining
the recognition result. The time-synchronous search executed
recognition processing during an utterance. In contrast, the
hardware implementation cannot use a time-synchronous
search. The processing time directly results in the system response delay because HMM computation starts after the end of
the utterance. If very long utterances are recognized, it causes
an unacceptably long delay. The dissipated energy is given by
mW

(12)

The proposed hardware system requires a higher peak-power


compared with the DSP-based system, even though it considerably outperforms the DSP system in terms of total dissipated
2The value from the data sheet was obtained using 50% MAC and 50% NOP
instructions. Otherwise, the actual value would have been higher.

76

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006

Fig. 10.

Fig. 9. Measurement results.

energy. A software implementation would reduce power dissipation and processing time by applying additional techniques to
reduce computational costs. For example, using beam pruning
can decrease cost by as much as a factor of two, but the proposed
hardware system is still better than the DSP system in terms of
the total dissipation energy. Although the hardware system does
not always outperform in system response time, even if processing time is much shorter than the DSP system, its response
time is fully allowable for real-time applications. The proposed
architecture realized both low-power consumption and real-time
system response in the 800-word vocabulary task.
C. Measurement Results for Scalability
Fig. 9 shows the measurement results for scalability in HMM
states and masterslave operation. For the speech recognition
performance evaluation, we used 100 Japanese city names from
the Japanese Electronic Industry Development Association
(JEIDA) database [21] for the speaker-independent recognition
experiments. Speech was sampled at 11.025 kHz and 16-bit
quantization. For the speech analysis, the MFCC features were
extracted after pre-emphasis and Hanning windowing. They
were converted to 38-dimensional feature vectors. The frame
length and shift were 23.2 ms and 11.6 ms, respectively. The
feature vectors consisted of 12 MFCCs, 12 delta MFCCs,
12 deltadelta MFCCs, delta log energy, and delta-delta log
energy. Two hundred gender-dependent models were trained
on a speech corpus of 24 000 words, collected from 40 males
and 40 females. Speech from the training speakers was not

FPGA board system.

included in the test data. The word models were set at from
4- to 32-state HMMs, with a single Gaussian distribution. The
speech data from 10 males and 10 females was tested for recognition, and the noise of a running car was added to the original
speech data under the 10-dB SNR condition. The experimental
results indicate that a large number of HMM states improves
recognition performance in noisy environments. Because over
32-state HMMs barely increased the recognition accuracy, they
provided the best recognition performance in this test set.
With regards to circuit performance, processing time was
measured according to the conditions in Section V-B. The
clock frequency was set to 25 MHz. The evaluated processing
included both the output probability and other calculations.
The circuit area was proportional to the number of recognition
words. However, the recognition time slightly increased for
large numbers of states because it requires data transfer from
external to internal memory. The data size was proportional
to the HMM states. In masterslave operations, the total processing time is inversely proportional to the number of systems.
When the number of systems is more than two, the feature
vectors take the data transfer time between the microprocessor
and the recognition systems. The transfer time is no more than
20 ms in a five-system operation. The masterslave operation
can thus considerably reduce the total processing time.
VI. FPGA IMPLEMENTATION
We implemented the complete recognition system on a
FPGA to verify various system operations and evaluate the entire system in a realistic environment. Fig. 10 shows the FPGA
board recognition system using an Altera APEX20KE running
at 10 MHz. The sampling clock generator, A/D converter,
serial port interface, and external SRAM were connected to the
FPGA board. The sampling rate was 11.025 kHz with 12-bit
quantization. The sequential control circuit substitutes for a
microprocessor. Speech detection starts when a switch on the
board is pushed and ends automatically, after 1.5 seconds. A
more standard push-to-talk interface, (e.g., the user starts an
utterance by pushing down a button and halts by releasing it), or
automatic voice activity detection (VAD) [22] should be used
in practical applications for future developments. The HMM
model parameters were transferred from a PC to the FPGA
board via the serial port before speech recognition testing.
The FPGA board system enabled users to utter speech using

YOSHIZAWA et al.: SCALABLE ARCHITECTURE FOR WORD HMM-BASED SPEECH RECOGNITION

a microphone and to observe the recognition results as word


numbers displayed on an LED.
VII. CONCLUSION
In this paper, we described a new scalable architecture for the
word HMM computation in speech recognition. The high computation costs in word HMMs cause excessively long processing
times and restricts the applications, to only small vocabulary. To
solve this problem, we applied new methods in parallel computations to the hardware architecture. The proposed architecture
provides scalability that can reduce processing time and/or extend word vocabulary. This scalability is realized by employing
the multiple process elements inside the HMM computation circuit and the masterslave operation between the recognition
complete systems. To evaluate the proposed architecture, we
designed the complete system using CMOS standard library
cells and demonstrated that the system is adequate for operating
bigger vocabularies from the evaluations.
ACKNOWLEDGMENT
The authors would like to thank Research and Development
Headquarters, Yamatake Corporation and the VLSI Design Education and Research Center (VDEC), Tokyo University for their
cooperation of our work.

77

[15] H. Hermansky and N. Morgan, RASTA processing of speech, IEEE


Trans. Speech Audio Process, vol. 2, no. 5, pp. 578589, Oct. 1994.
[16] K. M. Knill, M. J. F. Gales, and S. J. Young, Use of Gaussian selection in large vocabulary continuous speech recognition using HMMs,
in Proc. ICSLP96, 1996, pp. 470473.
[17] A. Lee, T. Kawahara, and K. Shikano, Gaussian mixture selection using
context-Independent HMM, in Proc. IEEE ICASSP01, May 2001, pp.
6972.
[18] A. Lee, T. Kawahara, K. Takeda, and K. Shikano, A new phonetic tiedmixture model for efficient decoding, in Proc. IEEE ICASSP00, 2000,
pp. 12691272.
[19] TMS320VC5416 Fixed-Point Digital Signal Processor Data Manual,
Texas Instruments, Literature Number: SPRS095O, 1999.
[20] H. Ney, D. Mergel, A. Noll, and A. Paeseler, Data driven search organization for continuous speech recognition, IEEE Trans. Signal Process.,
vol. 40, no. 1, pp. 272281, Feb. 1992.
[21] S. Itahashi, A japanese language speech database, in Proc. IEEE
ICASSP86, 1986, pp. 321324.
[22] J. Sohn, N. S. Kim, and W. Sung, A statistical model based voice activity detection, IEEE Signal Process. Lett., vol. 6, no. 1, pp. 13, Jan.
1999.

Shingo Yoshizawa (S00) received the B.E.


and M.E. degrees in electrical engineering from
Hokkaido University, Sapporo, Japan in 2001 and
2003, respectively. He is currently working toward
the Ph.D. degree at the Graduate School of Information Science and Technology, Hokkaido University.
His research interests are speech processing, wireless communication systems, and VLSI architecture.

REFERENCES
[1] J. Pihl, T. Svendsen, and M. H. Johnsen, A VLSI implementation of
pdf computations in HMM based speech recognition, in Proc. IEEE
TENCON96, 1996, pp. 241246.
[2] W. Han, K. Hon, and C. Chan, An HMM-based speech recognition IC,
in Proc. IEEE ISCAS03, vol. 2, 2003, pp. 744747.
[3] S. J. Melnikoff, S. Quigley, and M. J. Russell, Implementing a simple
continuous speech recognition system on an FPGA, in Proc. IEEE
Symp. FPGAs for Custom Computing Machines (FCCM02), 2002, pp.
275276.
[4] F. Vargas, R. Fagundes, and D. Barros, A FPGA-based Viterbi algorithm implementation for speech recognition systems, in Proc. IEEE
ICASSP01, vol. 2, May 2001, pp. 12171220.
[5] L. R. Rabiner, Recognition of isolated digits using hidden Markov
models with continuous mixture densities, AT&T Tech. J., vol. 64, no.
6, pp. 12111234, 1985.
[6] M. Karnjanadecha and S. A. Zahorian, Signal modeling for isolated
word recognition, in Proc. IEEE ICASSP99, vol. 1, Mar. 1999, pp.
293296.
[7] X. Huang, Spoken Language Processing. Englewood Cliffs, NJ: Prentice-Hall, 2001.
[8] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, pp. 257285, Feb.
1989.
[9] T. Watanabe, K. Shinoda, K. Takagi, and E. Yamada, Speech recognition using tree-structured probability density function, in Proc.
ICSLP94, 1994, pp. 223226.
[10] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, N.J.: Prentice-Hall, 1993.
[11] P. Beyerlein, Fast log-likelihood computation for mixture densities in a
high-dimensional feature space, in Proc. ICSLP94, vol. S0722, 1994,
pp. 5354.
[12] S. Sagayama and S. Takahashi, On the use of scalar quantization for
fast HMM computation, in Proc. IEEE ICASSP95, vol. 1, 1995, pp.
213216.
[13] S. Yoshizawa, N. Wada, N. Hayasaka, and Y. Miyanaga, Noise robust
speech recognition focusing on time variation and dynamic range of
speech feature parameters, in Proc. ISPACS03, 2003, pp. 484487.
[14] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust., Speech Signal Process., vol. 27, no. 1, pp.
113120, Feb. 1979.

Naoya Wada (S00) received the B.E. and M.E.


degrees in electrical engineering from Hokkaido
University, Sapporo, Japan in 2001 and 2003,
respectively. He is currently working toward the
Ph.D. degree at the Graduate School of Information
Science and Technology, Hokkaido University.
His research interests are digital signal processing,
speech analysis, and speech recognition.

Noboru Hayasaka (S00) received the B.E.


and M.E. degrees in electrical engineering from
Hokkaido University, Sapporo, Japan in 2002 and
2004, respectively. He is currently working toward
the Ph.D. degree at the Graduate School of Information Science and Technology, Hokkaido University.
His research interests are digital signal processing,
speech analysis, and speech recognition.

Yoshikazu Miyanaga (S80M83SM03) received the B.S., M.S., and Dr.Eng. degrees from
Hokkaido University, Sapporo, Japan in 1979, 1981,
and 1986, respectively.
Since 1983, he has been with Hokkaido University, where he is a Professor in the Graduate
School of Information Science and Technology. His
research interests are adaptive signal processing,
nonlinear signal processing, and parallel-pipelined
VLSI systems.
Prof. Miyanaga is a member of IEICE, Information
Processing Society of Japan, and Acoustical Society of Japan.

You might also like