You are on page 1of 6

Hybrid Recurrent Neural Networks:

An Application to Phoneme Classification


Rohitash Chandra Christian W. Omlin
School of Science and Technology Department of Computer Science
The University of Fiji The University of Western Cape
Saweni, Lautoka, Fiji. Cape Town, South Africa.

Abstract - We present a hybrid recurrent neural networks represent deterministic finite automaton in their internal
architecture inspired by hidden Markov models which have weight representations [6].
shown to learn and represent dynamical systems. We use
genetic algorithms for training the hybrid architecture and The structural similarity between hidden Markov
show their contribution to speech phoneme classification. models and recurrent neural networks is the basis for
We use Mel frequency cepstral coefficient feature extraction constructing the hybrid recurrent neural networks
methods on three pair of phonemes obtained from the architecture. The recurrence equation in the recurrent neural
TIMIT speech database. We use hybrid recurrent neural network resembles the equation in the forward algorithm in
networks for modelling each pair of phoneme. Our results the hidden Markov models. The combination of the two
demonstrate that the hybrid architecture can successfully paradigms into a hybrid system may provide better
model speech sequences. generalization and training performance which would be a
useful contribution to the field of machine learning and
Keywords: Genetic algorithms, Hidden Markov models,
pattern recognition. We have previously introduced a slight
Phoneme classification and Recurrent neural networks.
variation of this architecture and shown that they can
represent dynamical systems [7]. In this paper, we show that
the hybrid recurrent neural network architecture can be
applied to model speech sequences. We use Mel cepstral
1 Introduction frequency coefficients (MFCC) for feature extraction from
the TIMIT speech database. We extract three different pairs
Recurrent neural networks have been an important
of phonemes and use hybrid recurrent neural networks for
focus of research as they can be applied to difficult
modelling.
problems involving time-varying patterns. Their
applications range from speech recognition and financial
Evolutionary optimization techniques such as genetic
prediction to gesture recognition [1]-[3]. Hidden Markov
algorithms have been popular for training neural networks
models, on the other hand, have been very popular in the
other than gradient decent learning [8]. It has been
field of speech recognition [4]. They have also been applied
observed that genetic algorithms overcome the problem of
to other problems such as gesture recognition [5].
local minima whereas in gradient descent search for the
optimal solution, it may be difficult to drive the network out
Recurrent neural networks are capable of modelling
of the local minima which in turn proves costly in terms of
complicated recognition tasks. They have shown more
training time. In this paper, we will show how genetic
accuracy in speech recognition in cases of low quality noisy
algorithms can be used for training the hybrid recurrent
data compared to hidden Markov models. However, hidden
neural networks architecture inspired by hidden Markov
Markov models have shown to perform better when it
models. We will use genetic algorithms in training hybrid
comes to large vocabulary speech recognition. One
recurrent neural networks for the classification of phonemes
limitation for hidden Markov models in the application for
extracted from the TIMIT speech database. We end this
speech recognition is that they assume that the probability
paper with conclusions from our work and possible
of being in a state at time t only depends on the previous
directions for future research.
state i.e. the state at time t-1. This assumption is
inappropriate for speech signals where dependencies often
extend through several states, however, hidden Markov
models have shown extremely well for certain types of
speech recognition. Recurrent neural networks are
dynamical systems and it has been shown that they can
2 Definitions and Methods 2.2 Hidden Markov Models
2.1 Recurrent Neural Networks A hidden Markov model (HMM) describes a process
which goes through a finite number of non-observable states
Recurrent neural networks maintain information whilst generating a signal of either discrete or continuous in
about their past states for the computation of future states nature. In a first order Markov model, the state at time t+1
and outputs by using feedback connections. They are depends only on state at time t, regardless of the states in
composed of an input layer, a context layer which provides the previous times [12]. Fig. 2 shows an example of a
state information, a hidden layer and an output layer as Markov model containing three states in a stochastic
shown in Fig. 1. Each layer contains one or more processing automaton.
units called neurons which propagate information from one
layer to the next by computing a non-linear function of their
weighted sum of inputs.

Fig. 2. A first order Markov model. Пi is the probability that the


system will start in state Si and aij is the probability that the
system will move from state Si to state Sj.

The model probabilistically links the observed signal to the


Fig. 1. The architecture of first–order recurrent neural network. state transitions in the system. The theory provides a means
The recurrence from the hidden to the context layer is shown. by which:
Dashed lines indicate that more neurons can be used in each layer
depending on the application. 1. The probability P(O|λ) can be calculated for a
HMM with parameter set λ, generating a particular
Popular architectures of recurrent neural networks observation sequence O, through what is called the
include first-order recurrent networks [8], second-order Forward algorithm.
recurrent networks [9], NARX networks [10] and LSTM
recurrent networks [11]. A detailed study about the vast 2. The most likely state sequence the system went
variety of recurrent neural networks is beyond the scope of through in generating the observed signal through
this paper; however, we will discuss the dynamics of first– the Viterbi algorithm.
order recurrent neural networks, refer to (1).
3. A set of re-estimation for iteratively updating the
HMM parameters given an observation sequence
 K J
 as training data. These formulas strive to maximize
S i ( t ) = g  ∑ V ik S k ( t − 1) + ∑W ij I j ( t − 1)  (1)
 k =1 j =1  the probability of the sequence being generated by
the model. The algorithm is known as the Baum-
where S k (t ) and I j (t ) represent the output of the state Welch of Forward- backward procedure.
neurons and input neurons respectively. Vik and W ij
The term “hidden” hints at the process state transition
represent their corresponding weights. g(.) is a sigmoidal sequence which is hidden from the observer. The process
discriminant function. We will use this architecture to reveals itself to the observer only through the generated
construct hybrid architecture of recurrent neural networks observable signal. A HMM is parameterized through a
inspired by hidden Markov models and show that they can matrix of transition probabilities between states and output
learn and model speech sequences. probability distributions for observed signal frames given
the internal process state. The probabilities are used in the descent learning, genetic algorithms can help the network to
mentioned algorithm for achieving the desired results. escape from the local minima.

2.3 Evolutionary Training of Recurrent 2.4 Speech Phoneme Classification


Neural Networks A speech sequence contains huge amount of
Genetic algorithms provide a learning method irrelevant information. In order to model them, feature
motivated by biological evolution. They are search extraction is necessary. In feature extraction, useful
techniques that can be used for both solving problems information from speech sequences are extracted which then
and modelling evolutionary systems [13]. The problem is used for modelling. Recurrent neural networks and hidden
faced by genetic algorithms is to search a space of candidate Markov models have been successfully applied for
hypothesis and find the best hypothesis. The hypothesis modelling speech sequences [1, 4]. They have been applied
fitness is a numerical measure which computes the ‘best to recognize words and phonemes. The performance of
hypothesis’ that optimizes the problem. The algorithm speech recognition system can be measured in terms of
operates by iteratively updating a pool of hypothesis, called accuracy and speed. Recurrent neural networks are capable
the population. The population consists of many individuals of modelling complicated recognition tasks. They have
called chromosomes. All members of the population are shown more accuracy in recognition in cases of low quality
evaluated by the fitness function on iteration. A new noisy data compared to hidden Markov models. However,
population is then generated by probabilistically selecting hidden Markov models have shown to perform better when
the most fit chromosomes from the current population. it comes to large vocabularies. Extensive research on the
Some of the selected chromosomes are added to the new application of speech recognition has been done for more
generation while others are selected as parent than forty years; however, scientists are unable to
chromosomes. Parent chromosomes are used for creating implement systems which can show excellent performance
new offspring’s by applying genetic operators such as in environments with background noise.
crossover and mutation. Traditionally, the chromosomes
represent bit strings; however, real number representation is Mel frequency cepstral coefficients (MFCC) are
possible. useful feature extraction techniques as the Mel filter has
characteristics similar to the human auditory system [14].
In order to use genetic algorithms for training neural The human ear performs similar techniques before
networks, we need to represent the problem as presenting information to the brain for processing. We will
chromosomes. Real numbered values of weights must be apply MFCC feature extraction techniques to phonemes
encoded in the chromosome other than binary values. This obtained from the TIMIT speech database. In MFCC
is done by altering the crossover and mutation operators. A feature extraction, a frame of speech signal is presented to
crossover operator takes two parent chromosomes and the discrete Fourier transformation function to change the
creates a single child chromosome by randomly selecting signal from time domain to its frequency domain. Then the
corresponding genetic materials from both parents. The discrete Fourier transformed based spectrum is mapped onto
mutation operator adds a small random real number the Mel scale using triangular overlapping windows.
between -1 and 1 to a randomly selected gene in the Finally, we compute the log energy at the output of each
chromosome. filter and then do a discrete cosine transformation of the
Mel amplitudes which becomes the vector of MFCC
In evolutionary neural learning, the task of genetic features.
algorithms is to find the optimal set of weights in a network
which minimizes the error function. The fitness function
must define the performance of the neural network. Thus, 3 Hybrid Recurrent Neural Networks
the fitness function is the reciprocal of sum of squared error Inspired by Hidden Markov Models
of the neural network. To evaluate the fitness function, each
weight encoded in the chromosome is assigned to the 3.1 Motivation
respective weight links of the network. The training set of
examples is then presented to the network which propagates We have stated earlier that the structural similarities
the information forward and the sum of squared error is of hidden Markov models and recurrent neural networks
calculated. In this way, genetic algorithms attempts to find a form the basis for combining the two paradigms into the
set of weights which minimizes the error function of the hybrid architecture. Why is it a good idea? Most often, first
network. Recurrent neural networks have been trained by order hidden Markov models are used in practice in which
evolutionary computation methods such as genetic that state transition probabilities are dependent only on the
algorithms which optimize the weights in the network previous state. This assumption is unrealistic for many real
architecture for a particular problem. Compared to gradient world applications of hidden Markov models. Furthermore,
the number of states in the hidden Markov model needs to
be fixed beforehand for a particular application. Therefore, input, the univariate Gaussian distribution is used, refer to
the number of states for different applications varies. The “(5)”.
theory on recurrent neural networks and hidden Markov
models suggest that the combination of the two paradigms 1  1 ( O − µ )2 
bt ( O) = exp −  (5)
may provide better generalization and training performance. 2
Our proposed architecture of hybrid recurrent neural
2πσ  2 σ 
networks may also have the capability of learning higher
order dependencies and one does not need to fix the where Ot is the observation at time t, µ is the mean and σ 2
number of states as done in the case of hidden Markov is the variance. For multiple inputs to the hybrid recurrent
models. network, the multivariate Gaussian for d dimensions is used.
Refer to “(6)”.
3.2 Derivation
1  1  (6)
Consider the equation of the forward procedure for bt (O) = exp  − (O − µ )t ∑ −1 (O − µ ) 
the calculation of the probability of the observation O given 2π d /2
|∑|1/ 2
 2 
the model λ , thus P (O | λ ) in hidden Markov models. Refer
to “(2)”.
where O is a d-component column vector, µ is a d-
component mean vector, ∑ is a d-by-d covariance matrix,
 N

α j (t ) =  ∑ α i (t − 1) aij  b j ( Ot ) 1 ≤ j ≤ N (2) and | ∑ | and ∑ −1
are its determinant and inverse,
 i  respectively.

where N is the number of hidden states in the HMM, aij is Fig. 3 shows how the Gaussian distribution for
hidden Markov model is used to build hybrid recurrent
the probability of making a transition from state i to j and
neural networks. The output of the multivariate Gaussian
bj ( Ot ) is the Gaussian distribution for the observation at
function solely depends on the mean which is a vector equal
time t. The calculation in equation 2 is inherently recurrent to the size of the input vector. This parameter will also be
and bares resemblance to the recursion of recurrent neural represented in the chromosomes together with the weights
networks, refer to “(3)”. and biases and will be trained by genetic algorithms in order
to model speech sequences.
 N 
x j ( t ) = f  ∑ x i ( t − 1) wij  1≤ i ≤ N (3)
 i 

where f(.) s a non-linearity as sigmoid, N the number of


hidden neurons and wij the weights connecting the neurons
with each other and with the input nodes. Equation (1)
describes the dynamics of first–order recurrent neural
network which is combined with Gaussian distribution
feature in (2) to form the hybrid architecture. We replace
the subscript j in b j ( Ot ) which in hidden Markov models
denotes the state by time t to incorporate the feature into
recurrent neural networks. Hence the dynamics for the
hybrid recurrent networks architecture is given by:

 K  J   (4) bt ( O )
S i ( t ) = f  ∑ V ik S k ( t − 1) +  ∑ W ij I j ( t − 1)  .bt −1 ( O ) 
 k =1 
  j =1  

where bt −1 ( O ) is the Gaussian distribution. Note that the


Fig. 3. The architecture of hybrid recurrent neural networks. The
subscript in bt −1 ( O ) i.e. time t in (4) is different when
dashed lines indicate that the architecture can represent more
compared to the subscript for Gaussian distribution in (2). neurons in each layer if required. The multivariate Gaussian
The dynamics of hidden Markov models and recurrent function is shown by the shaded node.
networks varies in this context; however we can adjust the
parameter for time t as shown in (4) in order to map hidden
Markov models into recurrent neural networks. For a single
architecture with random weights in the range of -7 to 7 and
4 Empirical Results and Discussion used genetic algorithm for training these weights. In a way,
genetic algorithms shuffle these weights and finds the
4.1 MFCC Feature Extraction optimal sets of weight which best learns the training
samples.
We used Mel cepstral frequency coefficients for
feature extraction from three pair of phonemes read from The maximum number of training time was 100
the TIMIT speech database. For each phoneme read, we generations. We recorded the performance of the network
used a frame size of 512 every 256 sample point. We used being trained by genetic algorithms at every generation until
MFCC feature extraction technique as discussed previously the maximum training time was reached. We used the best
and obtained 12 MFCC features for each frame of the performance of the network as a stopping criterion for
phoneme. The feature extraction results are shown in Table training. Using the information obtained for the stopping
1 which has information of each phoneme pair extracted. criteria, we started training. Once the network could
The frame size of 512 is equal to 32 ms. perform well up to the stopping criterion, we halt the
training and present the network with the testing data set
Table 1 which was not included initially during training. The trained
MFCC Feature Extraction hybrid recurrent neural network thus uses the knowledge it
gained in the training process to make a generalization. The
Phoneme No. of No. of Frame results of all experiments are shown in Table 2. The training
Pair Training Testing Size and generalization performance is given by the percentage
Samples Samples
of samples correctly classified by the training and testing
‘b’ and ‘d’ 645 238 512
set, respectively.
‘k’ and ‘dx’ 4199 1412 512
‘q’ and ‘jh’ 4196 1420 512
Table 2
Hybrid Recurrent Neural Networks for Phoneme
The three pairs of phonemes extracted from the TIMIT speech Classification
dataset. The frame size of 512 is equal to 32ms.
Phoneme Generalization
Hidden Training Training
Pair Performance
4.2 Training Hybrid Recurrent Neural Neurons Time Performance

Networks for Phoneme Classification ‘b’, ‘d’ 12 2 87.75% 82.6%


‘b’, ‘d’ 15 5 87.75% 82.6%
‘k’, ‘dx’ 12 2 87.21% 80.92%
In the hybrid recurrent neural networks architecture, ‘k’, ‘dx’ 15 2 87.21% 80.92%
the neuron in the hidden layer compute the weighted sum of ‘q’, ‘jh’ 12 2 79% 75.81%
their inputs and further multiplies with the output of the ‘q', ‘jh’ 15 3 79% 75.81%
corresponding Gaussian function which gets inputs from the
input layer. The product of the neuron and the output of the The training time is given by the number of generation it takes
Gaussian function are then propagated from the hidden the hybrid architecture to learn the training examples. The
layer to the output layer as shown in Fig. 3. maximum training time is 100 generations. The training and
generalization performance is given by the percentage of
We obtained the training and testing data set from samples correctly classified by the network.
the TIMIT database as shown in Table 1. We trained all the
parameters of the hybrid architecture, i.e. the weights
connecting input to hidden layer, weights connecting hidden 5 Conclusions
to output layer and weights connecting the context to hidden
layer. We also trained the bias weights and the mean vector We have successfully combined the strength of
as parameters of the multivariate Gaussian distribution hidden Markov models to construct hybrid recurrent neural
function which gets inputs from the input layer. We used the networks architecture. The structural similarities between
hybrid recurrent neural network topology as follows: 12 hidden Markov models and recurrent neural networks have
neurons in the input layer which represents the speech been the basis for the successful mapping in the hybrid
feature input and 2 neurons in the output layer representing architecture. We have used genetic algorithms to train the
a pair of phonemes. We experimented with different number hybrid system of hidden Markov models and recurrent
of neurons in the hidden layer. We ran sample experiments neural networks. Our results show that speech sequences
and found that the population size of 40, crossover can be trained and modelled by hybrid recurrent neural
probability of 0.7 and mutation probability of 0.1 have networks. An open question remains for the suitability of
shown good genetic training performance; therefore, we use the hybrid architecture in modelling other real world
these values for training. We initialized the hybrid application problems involving temporal sequences.
6 References [12] E. Alpaydin, Introduction to Machine Learning, The
MIT Press, London, 306-311, 2004.
[1] A.J Robinson, “An application of recurrent nets to
phone probability estimation”, IEEE transactions on Neural [13] T. M. Mitchell, Machine Learning, McGraw Hill,
Networks, Vol. No. 5, Issue No. 2, 298-305, 1994. 1997.

[2] C.L. Giles, S. Lawrence and A.C. Tsoi, “Rule


inference for financial prediction using recurrent neural
networks”, Proc. of the IEEE/IAFE Computational
Intelligence for Financial Engineering, New York City,
USA, 253-259, 1997.

[3] K. Marakami and H Taguchi, “Gesture recognition


using recurrent neural networks”, Proc. of the SIGCHI
conference on Human factors in computing systems:
Reaching through technology, Louisiana, USA, 237-242,
1991.

[4] M. J. F. Gales, “Maximum likelihood linear


transformations for HMM-based speech recognition”,
Computer Speech and Language, Vol. No. 12, 75-98, 1998.

[5] T. Kobayashi, S. Haruyama, “Partly-Hidden Markov


Model and its Application to Gesture Recognition”, Proc.
of IEEE International Conference on Acoustics, Speech,
and Signal Processing, Vol. No. 4, p. 3081, 1997.

[6] C. Lee Giles, C.W Omlin and K. Thornber,


“Equivalence in Knowledge Representation: Automata,
Recurrent Neural Networks, and dynamical Systems”, Proc.
of the IEEE, Vol. No. 87, Issue No. 9, 1623-1640, 1999.

[7] R. Chandra, C.W. Omlin, “Evolutionary training of


hybrid systems of recurrent neural networks and hidden
Markov models”, Transactions on Engineering, Computing
and Technology: Proc. of the International Conference on
Neural Networks, Vol. No. 15, Barcelona, Spain, 58-63,
October 2006.

[8] P. Manolios and R. Fanelli, “First order recurrent


neural networks and deterministic finite state automata,”
Neural Computation, Vol. No. 6, Issue No. 6, 1154-1172,
1994.

[9] R. L. Watrous and G. M. Kuhn, “Induction of finite-


state languages using second-order recurrent networks,”
Proc. of Advances in Neural Information Systems,
California, USA, 309-316, 1992.

[10] T. Lin, B.G. Horne, P. Tino, & C.L. Giles, “Learning


long-term dependencies in NARX recurrent neural
networks,” IEEE Transactions on Neural Networks, Vol.
No. 7, Issue No. 6, 1329-1338, 1996.

[11] S. Hochreiter and J. Schmidhuber, “Long short-term


memory”, Neural Computation, Vol. No. 9, Issue No. 8,
1735-1780, 1997.

You might also like