BTP Meet 3

End-to-End Automatic KUNAL DHAWAN
Speech Recognition KUMAR PRIYADARSHI

Towards End-to-End
Speech Recognition with
Recurrent Neural
Networks
ALEX GRAVES, DEEPMIND, UK
NAVDEEP JAITLEY, UNIVERSITY OF TORONTO, CANADA
PROCEEDINGS OF THE 31 ST INTERNATIONAL CONFERENCE ON MACHINE
LEARNING, BEIJING, CHINA, 2014
Highlights
 Directly transcribes audio data with text, without requiring an intermediate

phonetic representation
 Based on a combination of the deep bidirectional LSTM recurrent neural
network architecture and the Connectionist Temporal Classification
objective function
 A modification to the objective function is introduced that trains the network
to minimise the expectation of an arbitrary transcription loss function. This
allows a direct optimisation of the word error rate, even in the absence of a
lexicon or language model.
Model
 The spectrograms are

processed by a deep
bidirectional LSTM network
(Graves et al., 2013) with a
Connectionist Temporal
Classification (CTC) output
layer.
 The network is trained
directly on the text
transcripts: no phonetic
representation (and hence
no pronunciation dictionary)
is used
Expected Transcription Loss
 Given a target transcription y* , the network can then be trained to minimise
the CTC objective function:
 This paper proposes a method that allows an RNN to be trained to optimise

the expected value of an arbitrary loss function defined over output
transcriptions (such as Word Error Rate).
 Given input sequence x, the distribution Pr(y|x) over transcriptions
sequences y defined by CTC, and a real-valued transcription loss function
L(x, y), the expected transcription loss L(x) is defined as
 In general, It is not easy to calculate this expectation exactly, instead use

Monte-Carlo sampling to approximate both L and its gradient.
Decoding
 Decoding a CTC network (that is, finding the most probable output
transcription y for a given input sequence x) can be done to a first
approximation by picking the single most probable output at every timestep
and returning the corresponding transcription.
 More accurate decoding can be performed with a beam search algorithm,

which also makes it possible to integrate a language model.
 The experiments were carried out on the Wall Street Journal (WSJ) corpus
and compared to a DNN-HMM model (Povey et al., 2011)
Results and
Conclusions
 1. the character level RNN
outperforms the baseline
model when no language
model is present.
 2. The RNN retrained to

minimise word error
rate performed particularly
well.
 The baseline system overtook

the RNN as the LM was
strengthened!
Key TakeAways
 The difference between baseline model and this

model is very small, considering that so much more
prior information (audio pre-processing, pronunciation
dictionary, statetying, forced alignment) was encoded
into the baseline system.
 This
paper has demonstrated that character-level
speech transcription can be performed by a recurrent
neural network with minimal preprocessing and no
explicit phonetic representation.
LISTEN, ATTEND AND
SPELL: A NEURAL
NETWORK FOR LVCSR
WILLIAM CHAN, CARNEGIE MELLON UNIVERSITY
NAVDEEP JAITLY, QUOC LE, ORIOL VINYALS, GOOGLE BRAIN
ICASSP 2016
A truly End-to-End system
 Transcribes speech utterances directly to characters without pronunciation

models, HMMs or other components of traditional speech recognizers
 The neural network architecture subsumes the acoustic, pronunciation and
language models making it not only an end-to-end trained system but an
end-to-end model !
 In contrast to DNN-HMM, CTC and most other models, LAS makes no
independence assumptions about the probability distribution of the output
character sequences given the input acoustic sequence.
For example, Connectionist Temporal Classification (CTC) and DNN-HMM
systems assume that neural networks make independent predictions at
different times and use HMMs or language models (which make their own
independence assumptions) to introduce dependencies between these
predictions over time.
Model
 The system has two
components: a listener and
a speller.
 The listener is a
pyramidal recurrent
network encoder that
accepts filter bank
spectra as inputs.
 The speller is an
attention-based recurrent
network decoder that
emits each character
conditioned on all
previous characters, and
the entire acoustic
sequence.
Model - II
 LAS models each character output y_i as a conditional distribution over the
previous characters y< i and the input signal x using the chain rule for
probabilities:
 This objective makes the model a discriminative, end-to-end model,

because it directly predicts the conditional probability of character
sequences, given the acoustic signal.
 The Listen(Encoder) and AttendAndSpell(Decoder) modules work as per
given equations:
Listen
 Pyramidal Bidirectional LSTM network is used to learn high level features

 Why Pyramidal?
 Simple BiLSTM produced inferior results
 Attention model ineffective due to large number of time-steps
 Allows deep architecture for learning non-linear feature representation
 Reduces the computational complexity
AttendAndSpell
 The AttendAndSpell function is computed using an attention - based
LSTM transducer.
 The distribution for y_i is a function of the decoder state s_i and context
c_i. The decoder state s_i is a function of the previous state s_i−1, the
previously emitted character y_i−1 and context c_i−1. The context vector
c_i is produced by an attention mechanism. Specifically,
 where CharacterDistribution is an MLP with softmax outputs over

characters, and where RNN is a 2 layer LSTM.
Learning
• We train the parameters of our model to maximize the log probability of the
correct sequences. Specifically,
• Experiment: clean and noisy Google voice search task

Results
Thank you!

BTP Meet 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BTP Meet 3

Uploaded by

Copyright:

Available Formats

End-to-End Automatic KUNAL DHAWAN

Speech Recognition KUMAR PRIYADARSHI

 Directly transcribes audio data with text, without requiring an intermediate

 The spectrograms are

 This paper proposes a method that allows an RNN to be trained to optimise

 In general, It is not easy to calculate this expectation exactly, instead use

 More accurate decoding can be performed with a beam search algorithm,

 2. The RNN retrained to

 The baseline system overtook

 The difference between baseline model and this

 Transcribes speech utterances directly to characters without pronunciation

 This objective makes the model a discriminative, end-to-end model,

 Pyramidal Bidirectional LSTM network is used to learn high level features

 where CharacterDistribution is an MLP with softmax outputs over

• Experiment: clean and noisy Google voice search task

You might also like