Professional Documents
Culture Documents
Networks
Tomáš Mikolov
1 / 59
Overview
Motivation
Neural Network Based Language Models
Training Algorithm
Recurrent Neural Network
Classes
Maximum Entropy Language Model
Empirical Results:
Penn Treebank Corpus
Wall Street Journal Speech Recognition
NIST RT04 Broadcast News Speech Recognition
Generating Text with RNNs
Additional Experiments
Conclusion: Finally Beyond N-grams?
2 / 59
Motivation
3 / 59
Motivation - Turing Test
Example:
P (M onday|W hat day of week is today?) =?
P (red|W hat is the color of roses?) =?
or more as a language modeling problem:
P (red|T he color of roses is) =?
4 / 59
Motivation - N-grams
5 / 59
Motivation - Limitations of N-grams
6 / 59
Neural Network Based Language Models
7 / 59
Model Description - Feedforward NNLM
8 / 59
Model description - recurrent NNLM
w(t) y(t)
s(t)
U V
s(t-1)
Input layer w and output layer y have the same dimensionality as the
vocabulary (10K - 200K)
Hidden layer s is orders of magnitude smaller (50 - 1000 neurons)
U is the matrix of weights between input and hidden layer, V is the
matrix of weights between hidden and output layer
Without the recurrent weights W, this model would be a bigram neural
network language model
9 / 59
Model Description - Recurrent NNLM
The output values from neurons in the hidden and output layers
are computed as follows:
10 / 59
Training of RNNLM
11 / 59
Training of RNNLM
12 / 59
Training of RNNLM
Weights V between the hidden layer s(t) and the output layer
y(t) are updated as
13 / 59
Training of RNNLM
eh (t) = dh eo (t)T V, t ,
(6)
14 / 59
Training of RNNLM
Weights U between the input layer w(t) and the hidden layer
s(t) are then updated as
Note that only one neuron is active at a given time in the input
vector w(t). As can be seen from the equation 8, the weight
change for neurons with zero activation is none, thus the
computation can be speeded up by updating weights that
correspond just to the active input neuron.
15 / 59
Training of RNNLM - Backpropagation Through Time
16 / 59
Training of RNNLM - Backpropagation Through Time
w(t) y(t)
s(t)
U V
w(t-1) W
w(t-2) W s(t-1)
W s(t-2)
s(t-3)
18 / 59
Training of RNNLM - Backpropagation Through Time
19 / 59
Training of RNNLM - Backpropagation Through Time
w(t) y(t)
s(t)
U V
w(t-1)
W
U
V
w(t-2) s(t-1)
W
U y(t-1)
W s(t-2)
y(t-2)
V
s(t-3)
y(t-3)
Figure: Example of batch mode training. Red arrows indicate how the
gradients are propagated through the unfolded recurrent neural network.
20 / 59
Extensions: Classes
21 / 59
Extensions: Classes
w(t) y(t)
s(t)
U V
W X
s(t-1) c(t)
22 / 59
Extensions: Classes
23 / 59
Joint Training With Maximum Entropy Model
24 / 59
Hash Based Maximum Entropy Model
Assume a vocabulary V with three words, V=(ONE, TWO,
THREE); maximum entropy model with unigram features:
P(w(t)|*)
a1 ONE
a2
1 TWO
a3 THREE
25 / 59
Hash Based Maximum Entropy Model
ONE, ONE
ONE, TWO P(w(t)|w(t-2), w(t-1))
ONE, THREE
TWO, ONE ONE
C
TWO, TWO TWO
TWO, THREE THREE
THREE, ONE
THREE, TWO
THREE, THREE
26 / 59
Hash Based Maximum Entropy Model
27 / 59
Empirical Results
Penn Treebank
Comparison of advanced language modeling techniques
Combination
Wall Street Journal
JHU setup
Kaldi setup
NIST RT04 Broadcast News speech recognition
Additional experiments: machine translation, text
compression
28 / 59
Penn Treebank
29 / 59
Penn Treebank - Comparison
30 / 59
Penn Treebank - Comparison
31 / 59
Penn Treebank - Comparison
32 / 59
Penn Treebank - Comparison
33 / 59
Penn Treebank - Combination
34 / 59
Combination of Techniques (Joshua Goodman, 2001)
36 / 59
Improvements with Increasing Amount of Data
9
KN5
8.8 KN5+RNN
8.4
8.2
7.8
7.6
7.4
7.2
7 5 6 7 8
10 10 10 10
Training tokens
The improvement obtained from a single RNN model over the best
backoff model increases with more data!
However, it is also needed to increase size of the hidden layer with
more training data.
37 / 59
Improvements with Increasing Amount of Data
38 / 59
Comparison of Techniques - WSJ, JHU Setup
39 / 59
Empirical evaluation - Kaldi WSJ setup description
40 / 59
Empirical evaluation - Kaldi WSJ setup
41 / 59
Empirical evaluation - Kaldi WSJ setup
42 / 59
Evaluation - Broadcast News Speech Recognition
43 / 59
Evaluation - Broadcast News Speech Recognition
Model WER[%]
Single Interpolated
KN4 (baseline) 13.11 13.11
model M 13.1 12.49
RNN-40 13.36 12.90
RNN-80 12.98 12.70
RNN-160 12.69 12.58
RNN-320 12.38 12.31
RNN-480 12.21 12.04
RNN-640 12.05 12.00
RNNME-0 13.21 12.99
RNNME-40 12.42 12.37
RNNME-80 12.35 12.22
RNNME-160 12.17 12.16
RNNME-320 11.91 11.90
3xRNN - 11.70
14.5
RNN
RNN+KN4
14 KN4
RNNME
RNNME+KN4
13.5
WER on eval [%]
13
12.5
12
11.5
1 2 3
10 10 10
Hidden layer size
-0.3
-0.4
-0.5
-0.6
-0.7
1 2 3
10 10 10
Hidden layer size
47 / 59
Empirical Evaluation - Entropy
-0.06
-0.08
-0.1
-0.14
-0.16
-0.18
-0.2
-0.22 RNN-20
RNNME-20
-0.24 5 6 7 8 9
10 10 10 10 10
Training tokens
-0.05
Entropy reduction over KN5
-0.1
-0.15
-0.2
-0.25
RNN-20
-0.3 RNNME-20
RNN-80
RNNME-80
-0.35 5 6 7 8 9
10 10 10 10 10
Training tokens
49 / 59
Additional Experiments: Machine Translation
50 / 59
Additional Experiments: Machine Translation
Model BLEU
baseline (n-gram) 48.7
300-best rescoring with RNNs 51.2
51 / 59
Additional Experiments: Machine Translation
RNNs were trained on subset of the training data (about 17.5M training
tokens), with limited vocabulary
52 / 59
Additional Experiments: Text Compression
Task: compression of normalized text data that were used in the NIST
RT04 experiments
Achieved entropy of English text 1.21 bpc is already lower than the
upper bound 1.3 bpc estimated by Shannon
Several tricks were used to obtain the results: multiple models with
different learning rate, skip-gram features
53 / 59
Conclusion: Finally Beyond N-grams?
54 / 59
Data sampled from 4-gram backoff model
55 / 59
Data sampled from RNN model
56 / 59
WSJ-Kaldi rescoring
5-gram: IN TOKYO FOREIGN EXCHANGE TRADING YESTERDAY THE UNIT INCREASED AGAINST THE
DOLLAR
RNNLM: IN TOKYO FOREIGN EXCHANGE TRADING YESTERDAY THE YEN INCREASED AGAINST THE
DOLLAR
5-gram: SOME CURRENCY TRADERS SAID THE UPWARD REVALUATION OF THE GERMAN MARK
WASN’T BIG ENOUGH AND THAT THE MARKET MAY CONTINUE TO RISE
RNNLM: SOME CURRENCY TRADERS SAID THE UPWARD REVALUATION OF THE GERMAN MARKET
WASN’T BIG ENOUGH AND THAT THE MARKET MAY CONTINUE TO RISE
5-gram: MR. PARNES FOLEY ALSO FOR THE FIRST TIME THE WIND WITH SUEZ’S PLANS FOR
GENERALE DE BELGIQUE’S WAR
RNNLM: MR. PARNES SO LATE ALSO FOR THE FIRST TIME ALIGNED WITH SUEZ’S PLANS FOR
GENERALE DE BELGIQUE’S WAR
5-gram: HE SAID THE GROUP WAS MARKET IN ITS STRUCTURE AND NO ONE HAD LEADERSHIP
RNNLM: HE SAID THE GROUP WAS ARCANE IN ITS STRUCTURE AND NO ONE HAD LEADERSHIP
57 / 59
Conclusion: Finally Beyond N-grams?
58 / 59
Conclusion: Finally Beyond N-grams?
59 / 59