You are on page 1of 6

Isolated Word Recognition Project Using HMMs

Lawrence Rabiner
Center for Advanced Information Processing (CAIP)
Rutgers University
Piscataway, NJ 08854
lrr@caip.rutgers.edu
File Location: (c:\data\LaTex files\speech recognition course\digit recognition project
\word recognition project HMMs.tex)

Word HMMs

Figure 1: HMM whole word model with number of states, N S = 5.


Figure 1 shows an HMM model for a whole word with the number of states,
(N S), equal to 5. (Typical whole word models use a number of states between
5 and 7.) It is assumed that the self transition loop has a probability of aii in
the range {0.9 0.95} with skip state transition having a value of 1 aii , for
all states i = 1, 2, . . . , N S 1 in the model. (State N S has a self transition
probability of 1.0).
If we assume that the feature vector (observation vector) of the speech signal
(used for both training and testing models) corresponding to the whole word is
of the form:
{O} = [O1 , O2 , . . . , OT ]

(1)

where each Oi is a p-dimensional cepstral vector (containing a set of p = 12 16


cepstral (or mel-frequency cepstral) coefficients), optionally along with p first
cepstral differences, and optionally along with p second cepstral differences, i.e.,

Oi is of the form:
(i) (i)
(i)

(c1 , c2 , . . . , cp )

cepstral (or mel-frequency cepstral) coefficients only

(i) (i)
(i)
(i)
(i)
(i)

(c1 , c2 , . . . , cp , c1 , c2 , . . . , cp )

cepstral (or mel-frequency cepstral) coefficients along with


Oi =
(2)

first differences

(i)
(i)
(i)
(i)
(i)

2 (i)
2 (i)
2 (i)
(c(i)

1 , c2 , . . . , cp , c1 , c2 , . . . , cp , c1 , c2 , . . . , cp )

cepstral (or mel-frequency cepstral) coefficients along with first

and second differences


The cepstral coefficients can be computed in several ways including:
deriving them from LPC processing of each frame with conversion from
LPC coefficients to cepstral coefficients
deriving them from computing the FFT of the speech frame, taking the
log magnitude of the spectrum, and then computing the inverse FFT of
the log magnitude spectrum
the set of mfcc coefficients can be derived from simple processing on the
log spectrum using the Slaney MATLAB code in the Auditory Toolbox;
the first and second cepstral differences can be derived from either the
cepstral coefficients or the mel-frequency cepstral coefficients.
We assume throughout this discussion that the sampling rate of the signal is
FS = 10, 000 Hz. Trivial modifications are made if the sampling rate is different
from that specified above.

Signal Processing for LPC-Based Cepstral Coefficients

Figure 2: Signal processing for computing LPC-based cepstral coefficients from


the speech signal
Figure 2 shows the signal processing required to compute one set of cepstral
coefficients from simple operations on the speech signal. The speech signal, s[n],
is first pre-emphasized using a simple first order FIR filter, and then blocked into
40 msec frames (N = 400 speech samples) and weighted by a Hamming window.
Adjacent frames are spaced 10 msec apart (300 sample overlap or M = 100
2

sample shift between frames). The autocorrelation of the windowed frame is


computed and then LPC analysis is performed using the Durbin method. The
resulting set of LPC coefficients is converted directly to the LPC cepstrum using
the appropriate recursion formula.
The cepstral coefficients derived from LPC processing are often windows to
ensure that the middle coefficients provide the most weight to all computations.
A typical cepstral window is of the form:
 

n
Q
n = 1, 2, . . . , Q
(3)
wc [n] = 1 + sin
2
Q
where Q = 12 16 is the duration of the cepstral window weighting. The
weighted cepstral coefficients are denoted as cl [m] = cl [m]wc [m], m = 1, 2, . . . , Q
where we have changed notation slightly to represent the cepstral coefficients of
the lth frame of speech as {cl [1], cl [2], . . . , cl [Q]}. In a similar manner, the first
cepstral differences are computed as:
" K
#
X

cl [m] =
k clk [m] G, 1 m Q, G = 0.375
(4)
k=K

where K is typically on the order of 3-5. In a similar manner we can compute


the second difference of the cepstral feature vectors as:
2 cl [m] = G1 [
cl+1 [m]
cl1 [m]]

m = 1, 2, . . . , Q, G1 = 0.375

(5)

HMM Segmental K-Means Training

For the HMM model of each digit, we assume the state transition coefficients
are of the form:
(
aii = 0.95 states 1 to N S -1
a=
(6)
ai,i+1 = 0.05 state N S
If we assume that the feature vector (observation) probability density is a mixture of Gaussians, we can write the probability of observing feature vector O in
state j of the HMM model (for either training or testing) as:
bj (O) =

M
X

gmj N (O, mj , Umj )

(7)

m=1

where:
j is the state of the model, j = 1, 2, . . . , N S
m is the mixture number, m = 1, 2, . . . , M
O is the observation vector of cepstral coefficients and possible delta and
delta-delta cepstral coefficients
3

gmj is the mixture gain for the mth mixture in the j th state
N is a Gaussian density fuction
mj is the mean vector for the mth mixture in state j
Umj is the covariance of the mth mixture in state j (assumed to be a
diagonal covariance matrix)
Combining terms, the probability of observing feature vector O in state j of the
HMM model is:
"
#
Q
M
X
Y
(O[q] mjq )2
gmj
exp
2
2mjq
m=1
q=1
bj (O) =
(8)
!1/2
Q
Y
2
(2)Q/2
mjq
q=1

Training Procedure

Figure 3: Use of the Viterbi algorithm to resegment words into HMM states in
an optimal manner
For a vocabulary of V words, we independently form a whole word HMM
for each individual word, vi , i = 1, 2, . . . , V using the 3-step training procedure
outlined below:

Step 1 - Initialization
assume an initial uniform segmentation of each training token of each
word, vi , into states
determine the mean vector, mj , and the diagonal covariance matrix,
2
, for all mixtures and states (assume that the number of mixtures
mj
is 1 for the time being)
this process step gives initial word models for the V words in the
vocabulary, namely 1 , 2 , . . . , V
Step 2 Viterbi alignment and segmentation
- resegment each training utterance into states using the Viterbi algorithm as shown in Figure 3
Step 3 - Iteration
2
re-estimate model parameters (mj , mj
) from re-segmented utterances

iterate Steps 2 and 3 until convergence

Log Viterbi Decoding

Viterbi decoding finds the best alignment path between the feature vector of the
whole word input signal and the HMM model states using a 5-step procedure.
If we assume that the whole word model, v , for the v th word in the vocabulary
is of the form:
v = {iv , avij , bvi },

1 i, j N S,

1vV

(9)

and we assume that there are T frames in the feature vector of the whole word
model, then the implementation of log Viterbi decoding (we omit the superscript
v in the steps below for ease of notation) is the following:
1. Preprocessing

i = log(i )
bi (Ot ) = log[bi (Ot )]

1 i NS

(10)

1 i N S,

1tT

1 i, j N S

(11)

a
ij

log(aij )

(12)

1 (i)

log(1 (i)) =
i + bi (O1 )

1 i NS

(13)

1 (i)

1 i NS

(14)

2. Initialization

3. Recursion
t (j)

log(t (j)) = max [t1 (i) + a


ij ] + bj (Ot )

t (j)

arg max [t1 (i) + a


ij ]

1iN S

2 t T,

1 j NS

(16)
(17)

1iN S

2 t T,

(15)

1 j NS

(18)

4. Termination
P

qT

max [T (i)]

1iN S

arg max [T (i)]


1iN S

(19)
(20)

5. Backtracking

qt = t+1 (qt+1
),

t = T 1, T 2, . . . , 1

(21)

Figure 4: Use of log Viterbi decoding to find optimal decoding path for aligning
frames of a word token with a given word model.
Figure 4 shows the computation of the best path using the log Viterbi decoding
algorithm given above.

You might also like