Professional Documents
Culture Documents
Decoding a CTC network (that is, finding the most probable output
transcription y for a given input sequence x) can be done to a first
approximation by picking the single most probable output at every timestep
and returning the corresponding transcription.
The experiments were carried out on the Wall Street Journal (WSJ) corpus
and compared to a DNN-HMM model (Povey et al., 2011)
Results and
Conclusions
1. the character level RNN
outperforms the baseline
model when no language
model is present.
This
paper has demonstrated that character-level
speech transcription can be performed by a recurrent
neural network with minimal preprocessing and no
explicit phonetic representation.
LISTEN, ATTEND AND
SPELL: A NEURAL
NETWORK FOR LVCSR
WILLIAM CHAN, CARNEGIE MELLON UNIVERSITY
NAVDEEP JAITLY, QUOC LE, ORIOL VINYALS, GOOGLE BRAIN
ICASSP 2016
A truly End-to-End system
The speller is an
attention-based recurrent
network decoder that
emits each character
conditioned on all
previous characters, and
the entire acoustic
sequence.
Model - II
LAS models each character output y_i as a conditional distribution over the
previous characters y< i and the input signal x using the chain rule for
probabilities:
The distribution for y_i is a function of the decoder state s_i and context
c_i. The decoder state s_i is a function of the previous state s_i−1, the
previously emitted character y_i−1 and context c_i−1. The context vector
c_i is produced by an attention mechanism. Specifically,
• We train the parameters of our model to maximize the log probability of the
correct sequences. Specifically,