Professional Documents
Culture Documents
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006
AbstractThis paper describes a scalable architecture for realtime speech recognizers based on word hidden Markov models
(HMMs) that provide high recognition accuracy for word recognition tasks. However, the size of their recognition vocabulary is
small because its extremely high computational costs cause long
processing times. To achieve high-speed operations, we developed
a VLSI system that has a scalable architecture. The architecture effectively uses parallel computations on the word HMM structure.
It can reduce processing time and/or extend the word vocabulary.
To explore the practicality of our architecture, we designed and
evaluated a complete system recognizer, including speech analysis
and noise robustness parts, on a 0.18- m CMOS standard cell library and field-programmable gate array. In the CMOS standardcell implementation, the total processing time is 56.9 s word at
an operating frequency of 80 MHz in a single system. The recognizer gives a real-time response using an 800-word vocabulary.
Index TermsHidden Markov model (HMM), scalable architecture, speech recognition, VLSI implementation.
I. INTRODUCTION
71
2. Recursion, and
Fig. 1. Flowchart of a speech recognition system.
for
(2)
3. Termination
for
(3)
for
(4)
Discrete HMM (DHMM), semi-continuous HMM
(SCHMM) [9], [12], and continuous HMM (CHMM) are
utilized to compute the output probabilities. The DHMM and
SCHMM can reduce computation the output probability costs.
However, their recognition rates are lower than CHMM. Our
system employs CHMM to give priority to recognition accuracy. In CHMM, the output probability is typically based on
a Gaussian distribution. For an uncorrelated single Gaussian
distribution, the output probability is expressed as follows:
(5)
III. WORD-LEVEL HMM ALGORITHM
HMM is a statistical modeling approach that is robust to temporal variations in speech and speaker differences [7], [8], and
is defined by a state transition probability matrix , a symbol
output probability matrix , and an initial state probability .
is given
The probability of the observation sequence
by multidimensional observation sequences , known as fea, which
ture vectors, and an HMM expression
and . For the
is the compact notation of three sets of ,
word-level HMM, the recognizer computes and compares all the
s (
), where
is the number of word
is computed using the
models. For left-to-right HMMs,
LogViterbi algorithm as follows.
where
and
are the mean vectors and
diagonal covariance matrices, respectively, for the state index
and the dimension index . The frame number and dimension
index feature vectors are expressed as
of the th feature vector in and is the number of dimensions. The log output probability is simplified as follows:
1.Initialization
for
(6)
(1)
72
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006
(9)
where
(10)
(11)
Vector
gives the mixture weights for mixture index
and state index . The vectors
and
are computed in the same way as the single Gaussian distribustates and
tion. The mixture Gaussian distribution with
mixtures has the same computational processing as the single
states, when the addlog opGaussian distribution with
eration (10) is ignored and replaces . Approximations of the
addlog operation using the maximum function [11] and the log
[12], are proposed. Our recogtable function, e.g.,
nition system can employ the maximum function.
IV. SCALABLE ARCHITECTURE
Fig. 3. Flowchart of HMM computation.
where
(7)
(8)
and
can be computed
In (7) and (8),
beforehand, i.e., during HMM training. The matrix/vectors ,
, and are called HMM model parameters in this paper.
These parameters are stored in the hardware recognition system
memory.
Fig. 3 shows a flowchart of the whole computation. The
output probability is the most computationally expensive part
of the procedure. For each output probability, the number of
,
arithmetic operations for (6) can be represented by about
which indicates one addition, one subtraction, two multiplications and repetitions. Because it repeats as Loop A, Loop
repetitions. The total
C, and Loop D, it requires
computation costs, excluding the other calculation parts, is rep. As a measure of the processing time, we
resented by
use the number of clock cycles. We assume that one arithmetic
operation requires one clock cycle. The number of clock cycles
for the above computation. For clarity, we
comes to
and the processing
state the computation cost as
time as
. The processing time is proportional to
the number of frames, the feature vector dimensions, the HMM
states, and the word models. Large numbers of HMM states
and feature vector dimensions are required when long words
are expected in word recognition tasks.
73
data transfer by
times. Case (b) enables the maximum parallel computations. However, the parallel computations require a
lotofdataportsinarithmeticunits.Thevalues and correspond
to the number of HMM states and the number of feature vector
is
frames, respectively. The number of parallel computations
. To obtain the maximum perforno more than , that is
mance usingparallel computation, we considerthe number of par. The flowchart shown in Fig. 3 can
allel computations as
be modified to the new flowchart shown in Fig. 4, which is suitable
for parallel computing. It is difficult to directly connect the arithmetic units and external memory, which exists outside the chip.
We effectively utilize the internal memory to solve this issue. The
internal memory structure can be modified inside a circuit module
or a chip. When the internal memory has multiple output ports, the
model parameter data can be supplied to all the arithmetic units.
Fig. 5 shows the HMM circuit structure for the single Gaussian
distribution. PE1 and PE2 are the process elements of the
output probability calculation (6) and the Viterbi algorithm
(1)(3), respectively. The model parameters are partitioned by
the HMM states. The data and are transferred to all the
PE1s. The data and are transferred to all the PE2s. The
addition of is executed in the Viterbi algorithm in this circuit
structure. The data port of the feature vectors is shared by
all the PE1s. The PE1 operates a 4-stage pipeline process,
consisting of add, square, multiply, and accumulate operations
using fixed-point arithmetic. The PE1s generate absolute
and treat the value of
as the
values of
maximum value in their fixed-point format1. Due to these use of
the absolute values, the maximum functions in (2) and (3) change
to the minimum function in actual hardware processing. Case (c)
realizes pipeline chaining between Loops A and B. Because
1We assumed that all the log-likelihoods in (1)(3) were negative. The values
of ! are adjusted by subtracting a constant value so that the log-likelihoods do
not become positive. The constant values can be pre-computed. The hardware
architecture uses the absolute values of their likelihoods to cut a sign bit.
74
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006
75
TABLE II
ARITHMETIC OPERATION IN HMM COMPUTATION
TABLE III
INSTRUCTION TYPES AND NUMBER OF INSTRUCTION CYCLES
was 340 k and 400 k, respectively. The circuit executed 32 parallel operations. The number of parallel operations was equal to
the number of the HMM states. The maximum operating frequency in the circuit and the recognition system was 128 MHz.
TABLE IV
EVALUATIONS IN THE HARDWARE AND SOFTWARE IMPLEMENTATIONS
B. System Performance
The hardware-based recognition system based on the proposed
architecture was evaluated on processing time and power dissipation. The processing time of the proposed HMM circuit was
much smaller than that of a single algorithmic logic unit (ALU).
This should be further evaluated, but including power dissipation.
We estimated power dissipation in the arithmetic units on both
the hardware recognition system and a fixed-point DSP using a
software solution. Most software implementations use pruning to
reduce the computational load. Two popular forms of pruning are
Gaussian selection [16] and Gaussian pruning [17], [18]. These
techniques reduce the computational loads to 2040% in HMMbased recognition systems. For example, Gaussian pruning can
reduce the computation loads in the output probability calculadenotes additions, incretion in (6). The summation
menting dimensional index . During this summation, if the calculated value falls below a certain threshold, computation might
stop halfway by replacing an approximate value because its likelihood value is assumed to be far from the center of Gaussian distribution. This indicates that the computation loads shown in the
summation are reduced using threshold pruning.
In the evaluations, we assumed an 800-word vocabulary task
that could be handled with a single-system. The parameters in
,
,
,
;
the recognition task were
(i.e., 32 HMM states, 38-dimensional feature vectors, 86 speech
frames in which the speech length was 1.0 seconds, and an
800-word HMM). Table II shows the number of arithmetic
operations in output probability calculation (6) and Viterbi
search algorithm (1)(3). The computational cost of the Viterbi
algorithm was a small percentage of the total. To simplify the
comparison, we evaluated the system performance only for the
output probability calculation that is common to the hardware
and software implementations. The hardware implementation
requires 335 million arithmetic operations. In the software
implementation, we estimated the required computational costs
using Gaussian pruning for the vector threshold with heuristic
estimation [18]. In this case, the arithmetic cost was reduced
to 117 million, or 34.9% of the full computation.
In the hardware implementation, the proposed architecture
was measured using only a single-system operation and not the
masterslave operation. The clock frequency was set to 80 MHz.
The processing time in the output probability calculation was
32.7 ms without the HMM training data transfer from an external to an internal memory. The total processing time came
to 45.5 ms, including the Viterbi algorithm and data transfer.
Consequently, this recognizer took 56.9 s word for the single
word HMM at an 80-MHz clock frequency. The processing time
(12)
76
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006
Fig. 10.
energy. A software implementation would reduce power dissipation and processing time by applying additional techniques to
reduce computational costs. For example, using beam pruning
can decrease cost by as much as a factor of two, but the proposed
hardware system is still better than the DSP system in terms of
the total dissipation energy. Although the hardware system does
not always outperform in system response time, even if processing time is much shorter than the DSP system, its response
time is fully allowable for real-time applications. The proposed
architecture realized both low-power consumption and real-time
system response in the 800-word vocabulary task.
C. Measurement Results for Scalability
Fig. 9 shows the measurement results for scalability in HMM
states and masterslave operation. For the speech recognition
performance evaluation, we used 100 Japanese city names from
the Japanese Electronic Industry Development Association
(JEIDA) database [21] for the speaker-independent recognition
experiments. Speech was sampled at 11.025 kHz and 16-bit
quantization. For the speech analysis, the MFCC features were
extracted after pre-emphasis and Hanning windowing. They
were converted to 38-dimensional feature vectors. The frame
length and shift were 23.2 ms and 11.6 ms, respectively. The
feature vectors consisted of 12 MFCCs, 12 delta MFCCs,
12 deltadelta MFCCs, delta log energy, and delta-delta log
energy. Two hundred gender-dependent models were trained
on a speech corpus of 24 000 words, collected from 40 males
and 40 females. Speech from the training speakers was not
included in the test data. The word models were set at from
4- to 32-state HMMs, with a single Gaussian distribution. The
speech data from 10 males and 10 females was tested for recognition, and the noise of a running car was added to the original
speech data under the 10-dB SNR condition. The experimental
results indicate that a large number of HMM states improves
recognition performance in noisy environments. Because over
32-state HMMs barely increased the recognition accuracy, they
provided the best recognition performance in this test set.
With regards to circuit performance, processing time was
measured according to the conditions in Section V-B. The
clock frequency was set to 25 MHz. The evaluated processing
included both the output probability and other calculations.
The circuit area was proportional to the number of recognition
words. However, the recognition time slightly increased for
large numbers of states because it requires data transfer from
external to internal memory. The data size was proportional
to the HMM states. In masterslave operations, the total processing time is inversely proportional to the number of systems.
When the number of systems is more than two, the feature
vectors take the data transfer time between the microprocessor
and the recognition systems. The transfer time is no more than
20 ms in a five-system operation. The masterslave operation
can thus considerably reduce the total processing time.
VI. FPGA IMPLEMENTATION
We implemented the complete recognition system on a
FPGA to verify various system operations and evaluate the entire system in a realistic environment. Fig. 10 shows the FPGA
board recognition system using an Altera APEX20KE running
at 10 MHz. The sampling clock generator, A/D converter,
serial port interface, and external SRAM were connected to the
FPGA board. The sampling rate was 11.025 kHz with 12-bit
quantization. The sequential control circuit substitutes for a
microprocessor. Speech detection starts when a switch on the
board is pushed and ends automatically, after 1.5 seconds. A
more standard push-to-talk interface, (e.g., the user starts an
utterance by pushing down a button and halts by releasing it), or
automatic voice activity detection (VAD) [22] should be used
in practical applications for future developments. The HMM
model parameters were transferred from a PC to the FPGA
board via the serial port before speech recognition testing.
The FPGA board system enabled users to utter speech using
77
REFERENCES
[1] J. Pihl, T. Svendsen, and M. H. Johnsen, A VLSI implementation of
pdf computations in HMM based speech recognition, in Proc. IEEE
TENCON96, 1996, pp. 241246.
[2] W. Han, K. Hon, and C. Chan, An HMM-based speech recognition IC,
in Proc. IEEE ISCAS03, vol. 2, 2003, pp. 744747.
[3] S. J. Melnikoff, S. Quigley, and M. J. Russell, Implementing a simple
continuous speech recognition system on an FPGA, in Proc. IEEE
Symp. FPGAs for Custom Computing Machines (FCCM02), 2002, pp.
275276.
[4] F. Vargas, R. Fagundes, and D. Barros, A FPGA-based Viterbi algorithm implementation for speech recognition systems, in Proc. IEEE
ICASSP01, vol. 2, May 2001, pp. 12171220.
[5] L. R. Rabiner, Recognition of isolated digits using hidden Markov
models with continuous mixture densities, AT&T Tech. J., vol. 64, no.
6, pp. 12111234, 1985.
[6] M. Karnjanadecha and S. A. Zahorian, Signal modeling for isolated
word recognition, in Proc. IEEE ICASSP99, vol. 1, Mar. 1999, pp.
293296.
[7] X. Huang, Spoken Language Processing. Englewood Cliffs, NJ: Prentice-Hall, 2001.
[8] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, pp. 257285, Feb.
1989.
[9] T. Watanabe, K. Shinoda, K. Takagi, and E. Yamada, Speech recognition using tree-structured probability density function, in Proc.
ICSLP94, 1994, pp. 223226.
[10] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, N.J.: Prentice-Hall, 1993.
[11] P. Beyerlein, Fast log-likelihood computation for mixture densities in a
high-dimensional feature space, in Proc. ICSLP94, vol. S0722, 1994,
pp. 5354.
[12] S. Sagayama and S. Takahashi, On the use of scalar quantization for
fast HMM computation, in Proc. IEEE ICASSP95, vol. 1, 1995, pp.
213216.
[13] S. Yoshizawa, N. Wada, N. Hayasaka, and Y. Miyanaga, Noise robust
speech recognition focusing on time variation and dynamic range of
speech feature parameters, in Proc. ISPACS03, 2003, pp. 484487.
[14] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust., Speech Signal Process., vol. 27, no. 1, pp.
113120, Feb. 1979.
Yoshikazu Miyanaga (S80M83SM03) received the B.S., M.S., and Dr.Eng. degrees from
Hokkaido University, Sapporo, Japan in 1979, 1981,
and 1986, respectively.
Since 1983, he has been with Hokkaido University, where he is a Professor in the Graduate
School of Information Science and Technology. His
research interests are adaptive signal processing,
nonlinear signal processing, and parallel-pipelined
VLSI systems.
Prof. Miyanaga is a member of IEICE, Information
Processing Society of Japan, and Acoustical Society of Japan.