You are on page 1of 17

End-to-End Automatic KUNAL DHAWAN

Speech Recognition KUMAR PRIYADARSHI


Towards End-to-End
Speech Recognition with
Recurrent Neural
Networks
ALEX GRAVES, DEEPMIND, UK
NAVDEEP JAITLEY, UNIVERSITY OF TORONTO, CANADA
PROCEEDINGS OF THE 31 ST INTERNATIONAL CONFERENCE ON MACHINE
LEARNING, BEIJING, CHINA, 2014
Highlights

 Directly transcribes audio data with text, without requiring an intermediate


phonetic representation
 Based on a combination of the deep bidirectional LSTM recurrent neural
network architecture and the Connectionist Temporal Classification
objective function
 A modification to the objective function is introduced that trains the network
to minimise the expectation of an arbitrary transcription loss function. This
allows a direct optimisation of the word error rate, even in the absence of a
lexicon or language model.
Model

 The spectrograms are


processed by a deep
bidirectional LSTM network
(Graves et al., 2013) with a
Connectionist Temporal
Classification (CTC) output
layer.
 The network is trained
directly on the text
transcripts: no phonetic
representation (and hence
no pronunciation dictionary)
is used
Expected Transcription Loss
 Given a target transcription y* , the network can then be trained to minimise
the CTC objective function:

 This paper proposes a method that allows an RNN to be trained to optimise


the expected value of an arbitrary loss function defined over output
transcriptions (such as Word Error Rate).
 Given input sequence x, the distribution Pr(y|x) over transcriptions
sequences y defined by CTC, and a real-valued transcription loss function
L(x, y), the expected transcription loss L(x) is defined as

 In general, It is not easy to calculate this expectation exactly, instead use


Monte-Carlo sampling to approximate both L and its gradient.
Decoding

 Decoding a CTC network (that is, finding the most probable output
transcription y for a given input sequence x) can be done to a first
approximation by picking the single most probable output at every timestep
and returning the corresponding transcription.

 More accurate decoding can be performed with a beam search algorithm,


which also makes it possible to integrate a language model.

 The experiments were carried out on the Wall Street Journal (WSJ) corpus
and compared to a DNN-HMM model (Povey et al., 2011)
Results and
Conclusions
 1. the character level RNN
outperforms the baseline
model when no language
model is present.

 2. The RNN retrained to


minimise word error
rate performed particularly
well.

 The baseline system overtook


the RNN as the LM was
strengthened!
Key TakeAways

 The difference between baseline model and this


model is very small, considering that so much more
prior information (audio pre-processing, pronunciation
dictionary, statetying, forced alignment) was encoded
into the baseline system.

 This
paper has demonstrated that character-level
speech transcription can be performed by a recurrent
neural network with minimal preprocessing and no
explicit phonetic representation.
LISTEN, ATTEND AND
SPELL: A NEURAL
NETWORK FOR LVCSR
WILLIAM CHAN, CARNEGIE MELLON UNIVERSITY
NAVDEEP JAITLY, QUOC LE, ORIOL VINYALS, GOOGLE BRAIN
ICASSP 2016
A truly End-to-End system

 Transcribes speech utterances directly to characters without pronunciation


models, HMMs or other components of traditional speech recognizers
 The neural network architecture subsumes the acoustic, pronunciation and
language models making it not only an end-to-end trained system but an
end-to-end model !
 In contrast to DNN-HMM, CTC and most other models, LAS makes no
independence assumptions about the probability distribution of the output
character sequences given the input acoustic sequence.
For example, Connectionist Temporal Classification (CTC) and DNN-HMM
systems assume that neural networks make independent predictions at
different times and use HMMs or language models (which make their own
independence assumptions) to introduce dependencies between these
predictions over time.
Model
 The system has two
components: a listener and
a speller.
 The listener is a
pyramidal recurrent
network encoder that
accepts filter bank
spectra as inputs.

 The speller is an
attention-based recurrent
network decoder that
emits each character
conditioned on all
previous characters, and
the entire acoustic
sequence.
Model - II

 LAS models each character output y_i as a conditional distribution over the
previous characters y< i and the input signal x using the chain rule for
probabilities:

 This objective makes the model a discriminative, end-to-end model,


because it directly predicts the conditional probability of character
sequences, given the acoustic signal.
 The Listen(Encoder) and AttendAndSpell(Decoder) modules work as per
given equations:
Listen

 Pyramidal Bidirectional LSTM network is used to learn high level features


 Why Pyramidal?
 Simple BiLSTM produced inferior results
 Attention model ineffective due to large number of time-steps
 Allows deep architecture for learning non-linear feature representation
 Reduces the computational complexity
AttendAndSpell
 The AttendAndSpell function is computed using an attention - based
LSTM transducer.

 The distribution for y_i is a function of the decoder state s_i and context
c_i. The decoder state s_i is a function of the previous state s_i−1, the
previously emitted character y_i−1 and context c_i−1. The context vector
c_i is produced by an attention mechanism. Specifically,

 where CharacterDistribution is an MLP with softmax outputs over


characters, and where RNN is a 2 layer LSTM.
Learning

• We train the parameters of our model to maximize the log probability of the
correct sequences. Specifically,

• Experiment: clean and noisy Google voice search task


Results
Thank you!

You might also like