Professional Documents
Culture Documents
Hung-yi Lee
Deep learning
attracts lots of attention.
I believe you have seen lots of exciting results
before.
Playing Go
f 5-5 (next move)
Dialogue System
f Hi Hello
(what the user said) (system response)
Image Recognition:
Framework f cat
A set of Model
function f1 , f 2
f1 cat f2 money
f1 dog f2 snake
Image Recognition:
Framework f cat
A set of Model
function f1 , f 2 Better!
Goodness of
function f
Supervised Learning
Framework f cat
Training Testing
A set of Model
function f1 , f 2 cat
Step 1
Training
Data
monkey cat dog
Three Steps for Deep Learning
a1 w1 A simple function
wk z z
ak a
Activation
wK function
aK weights b bias
Neural Network
Neuron Sigmoid Function z
z
1
z
1 e z
2
1
z
4
-1 -2 0.98
Activation
-1
function
1 weights 1 bias
Neural Network
Different connections leads to
different network structure
z z
z
Each neurons can have different values
of weights and biases.
Weights and biases are network parameters
Fully Connect Feedforward
Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function z
z
1
z
1 e z
Fully Connect Feedforward
Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Fully Connect Feedforward
Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
= =
Input vector, output vector 1 0.83 0 0.85
Given parameters , define a function
Given network structure, define a function set
Fully Connect Feedforward
Network neuron
Input Layer 1 Layer 2 Layer L Output
x1 y1
x2 y2
xN yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers
Output Layer (Option)
Softmax layer as the output layer
Ordinary Layer
z1
y1 z1
In general, the output of
z2
y2 z 2
network can be any value.
3 0.88 3
e
20
z1 e e z1
y1 e z1 zj
j 1
1 0.12 3
z2 e e z 2 2.7
y2 e z2
e
zj
j 1
0.05 0
z3 -3
3
e e z3
y3 e z3
e
zj
3 j 1
e zj
j 1
Example Application
Input Output
y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is 2
x256 y10
0.2 is 0
16 x 16 = 256
Ink 1 Each dimension represents
No ink 0 the confidence of a digit.
Example Application
Handwriting Digit Recognition
x1 y1 is 1
x2
y2 is 2
Neural
Machine 2
Network
x256 y10 is 0
What is needed is a
function
Input: output:
256-dim vector 10-dim vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 y1 is 1
x2
A function set containing the y2 is 2
candidates for 2
Handwriting Digit Recognition
xN y10 is 0
Input Output
Layer Hidden Layers Layer
5 0 4 1
9 2 1 3
Softmax
x2
y2 is 2
x256 y10 is 0
16 x 16 = 256
Ink 1 The learning target is
No ink 0
Input: y1 has the maximum value
x1 y1 As close as 1
x2 of possible
Given a set y2 0
parameters
Loss
xN y10 0
target
Loss can be the distance between the
network output and target
Total Loss:
Total Loss
=
For all training data =1
x1 NN y1 1
1 As small as possible
x2 NN y2 2
2 Find a function in
function set that
x3 NN y3 3 minimizes total loss L
3
Network parameters =
106
1 , 2 , 3 , , 1 , 2 , 3 ,
weights
Millions of parameters
w
Network parameters =
Gradient Descent 1 , 2 , , 1 , 2 ,
Positive Decrease w
w
http://chico386.pixnet.net/album/photo/171572850
Network parameters =
Gradient Descent 1 , 2 , , 1 , 2 ,
is called
learning rate w
Network parameters =
Gradient Descent 1 , 2 , , 1 , 2 ,
w
Gradient Descent
Compute 1
1 0.2 0.15
1 1
Compute 2
2 -0.1
2
0.05 = 2
Compute 1 1
1 0.3 0.2
1
gradient
Gradient Descent
Compute 1 Compute 1
1 0.2 0.15 0.09
1 1
Compute 2 Compute 2
2 -0.1 0.05 0.15
2 2
Compute 1 Compute 1
1 0.3 0.2 0.10
1 1
Gradient Descent
Color: Value of
2 Total Loss L
1
Gradient Descent Hopfully, we would reach
a minima ..
Color: Value of
2 Total Loss L
( 1 , 2 )
Compute 1 , 2
1
Gradient Descent - Difficulty
Gradient descent never guarantee global minima
( 1 , 2 )
Compute 1 , 2
2 1
Gradient Descent
This is the learning of machines in deep
learning
Even alpha go using this approach.
People image Actually ..
Why Deep?
f : R N RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html
x1 x2 xN x1 x2 xN
Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4
Why?
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Analogy
Logic circuits Neural network
Logic circuits consists of Neural network consists of
gates neurons
A two layers of logic gates A hidden layer network can
can represent any Boolean represent any continuous
function. function.
Using multiple layers of Using multiple layers of
logic gates to build some neurons to represent some
functions are much simpler functions are much simpler
less gates needed less less
parameters data?
Deep Modularization
Boy or Girl? v.s.
Basic
Image
Classifier
Long or
short? v.s.
Classifiers for the
attributes
Modularization
can be trained by little data
Deep Modularization
Classifier Girls with
1 long hair
Boy or Girl? Classifier Boys with
2 fine long Little
hair data
Image Basic
Classifier Classifier Girls with
Long or 3 short hair
short?
Classifier Boys with
Sharing by the 4 short hair
following classifiers
as module
Modularization
Deep Modularization Less training data?
x1
x2 The modularization is
automatically learned from data.
xN
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module
Reference: Zeiler, M. D., & Fergus, R.
Deep Modularization
x1
x2
xN
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module
Outline of Lecture I
Why Deep?
Very flexible
or
Need some
effort to learn
Keras
Example Application
Handwriting Digit Recognition
Machine 1
28 x 28
500
500
Softmax
y1 y2
y10
Keras
Keras
0.1
Step 3.2: Find the optimal network parameters
28 x 28 10
=784
case 1:
case 2:
Keras
Using GPU to speed training
Way 1
THEANO_FLAGS=device=gpu0 python
YourCode.py
Way 2 (in your code)
import os
os.environ["THEANO_FLAGS"] =
"device=gpu0"
Live Demo
Lecture II:
Tips for Training DNN
Recipe of Deep Learning
YES
Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Do not always blame Overfitting
Not well trained
Overfitting?
Good Results on
Different approaches for Testing Data?
different problems.
Neural
Network
Recipe of Deep Learning
YES
Momentum
Choosing Proper Loss
1
x1 y1 1 1 1
x2
Softmax
y2 0 2 0
loss
x256 y10 0 10 0
Which one is better?
10 10 target
Square 2 Cross
Error Entropy
=1 =0 =1 =0
Lets try it
Square Error
Cross Entropy
Testing: Accuracy
Lets try it
Square Error 0.11
Cross Entropy 0.84
Training
Cross
Entropy
Square
Error
Choosing Proper Loss
When using softmax output layer,
choose cross entropy
Cross
Entropy
Total
Loss
Square
Error
http://jmlr.org/procee
dings/papers/v9/gloro
w1 w2
t10a/glorot10a.pdf
Recipe of Deep Learning
YES
Momentum
We do not really minimize total loss!
Mini-batch Randomly initialize
network parameters
1 = 1 + 31 +
x31 NN y31 31 Update parameters once
31 Pick the 2nd batch
= 2 + 16 +
Update parameters once
x2 NN y2 2
Mini-batch
2 Until all mini-batches
have been picked
x16 NN y16 16
16 one epoch
1
Update parameters once
x31 NN y31 31
31 Pick the 2nd batch
= 2 + 16 +
Update parameters once
100 examples in a mini-batch
Until all mini-batches
Repeat 20 times have been picked
one epoch
We do not really minimize total loss!
Mini-batch Randomly initialize
network parameters
1 = 1 + 31 +
x31 NN y31 31 Update parameters once
31 Pick the 2nd batch
= 2 + 16 +
Update parameters once
x2 NN y2 2
Mini-batch
2
L is different each time
x16 NN y16 16 when we update
16 parameters!
Mini-batch
Original Gradient Descent With Mini-batch
Unstable!!!
1 epoch
No batch
Epoch
Shuffle the training examples for each epoch
Epoch 1 Epoch 2
x1 NN y1 1 x1 NN y1 1
Mini-batch
Mini-batch
1 1
x31 NN y31 31 x31 NN y31 31
31 17
2 2
Recipe of Deep Learning
YES
Momentum
Hard to get the power of Deep
Training
3 layers
9 layers
Vanishing Gradient Problem
x1 y1
x2 y2
xN yM
x1 1 1
x2 Small
output 2 2
+
xN
Large
+
input
Intuitive way to compute the derivatives
=?
Hard to get the power of Deep
x1 y1
0 y2
x2
0
0
=
ReLU
A Thinner linear network =0
x1 y1
y2
x2
Do not have
smaller gradients
Lets try it
Testing: 9 layers Accuracy
Lets try it
Sigmoid 0.11
ReLU 0.96
9 layers
Training
ReLU
Sigmoid
ReLU - variant
= =
= 0.01 =
also learned by
gradient descent
Maxout ReLU is a special cases of Maxout
+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2
x2 + 1 + 4
Max 1 Max 4
+ 1 + 3
Momentum
Learning Rates Set the learning
rate carefully
1
Learning Rates Set the learning
rate carefully
1
Learning Rates
Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
At the beginning, we are far from the destination, so we
use larger learning rate
After several epochs, we are close to the destination, so
we reduce the learning rate
E.g. 1/t decay: = + 1
Learning rate cannot be one-size-fits-all
Giving different parameters different learning
rates
Adagrad
Original:
Adagrad: w
Parameter dependent
learning rate
constant
=
=0 2 is obtained
at the i-th update
Summation of the square of the previous derivatives
=
Adagrad
=0 2
g0 g1 g0 g1
1 2
0.1 0.2 20.0 10.0
Learning rate: Learning rate:
= =
0.12 0.1 20 2 20
= =
0.12 + 0.22 0.22 202 + 102 22
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
Why?
learning rate, and vice versa
Larger
derivatives
Smaller
Learning Rate
Smaller Derivatives
Momentum
Hard to find
optimal network parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point
0 =0 =0
= 0
Adam RMSProp (Advanced Adagrad) + Momentum
Lets try it Testing: Accuracy
Original 0.96
Adam 0.97
ReLU, 3 layer
Training
Original
Adam
Recipe of Deep Learning
YES
Regularization
YES
Network Structure
Why Overfitting?
Training data and testing data can be different.
Handwriting recognition:
Original Created
Training Data: Training Data:
Shift 15
Why Overfitting?
For experiments, we added some noises to the
testing data
Why Overfitting?
For experiments, we added some noises to the
testing data
Testing: Accuracy
Clean 0.97
Noisy 0.50
Weight Decay
YES
Network Structure
Early Stopping
Total
Loss
Stop at Validation set
here Testing set
Training set
Epochs
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
Keras: the-validation-loss-isnt-decreasing-anymore
Recipe of Deep Learning
YES
Weight Decay
YES
Network Structure
Weight Decay
Our brain prunes out the useless link between
neurons.
Keras: http://keras.io/regularizers/
Recipe of Deep Learning
YES
Weight Decay
YES
Network Structure
Dropout
Training:
Thinner!
No dropout
If the dropout rate at training is p%,
all the weights times (1-p)%
Assume that the dropout rate is 50%.
If a weight w = 1 by training, set = 0.5 for testing.
Dropout - Intuitive Reason
partner
y1 y2 y3 y4
average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
2M possible
networks
All the
weights
multiply
(1-p)%
y1 y2 y3
?????
average y
More about dropout
More reference for dropout [Nitish Srivastava, JMLR14] [Pierre Baldi,
NIPS13][Geoffrey E. Hinton, arXiv12]
Dropout works better with Maxout [Ian J. Goodfellow, ICML13]
Dropconnect [Li Wan, ICML13]
Dropout delete neurons
Dropconnect deletes the connection between neurons
Annealed dropout [S.J. Rennie, SLT14]
Dropout rate decreases by epochs
Standout [J. Ba, NISP13]
Each neural has different dropout rate
Lets try it
500
model.add( dropout(0.8) )
500
model.add( dropout(0.8) )
Softmax
y1 y2
y10
Lets try it
No Dropout
Accuracy
Dropout
Testing:
Training Accuracy
Noisy 0.50
Epoch + dropout 0.63
Recipe of Deep Learning
YES
Regularization
YES
Network Structure
CNN is a very good example!
(next lecture)
Concluding Remarks
of Lecture II
Recipe of Deep Learning
YES
Step 1: define a NO
Good Results on
set of function
Testing Data?
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Lets try another task
Document Classification
stock in document
Machine
president in document
http://top-breaking-news.com/
Data
MSE
ReLU
Accuracy
Convolutional Neural
Network (CNN) Widely used in
image processing
Softmax
100
3 x 107
100 100 x 100 x 3 1000
Can the fully connected network be simplified by
considering the properties of image recognition?
Why CNN for Image
Some patterns are much smaller than the whole
image
A neuron does not have to see the whole image
to discover the pattern.
Connecting to small region with less parameters
beak detector
Why CNN for Image
The same patterns appear in different regions.
upper-left
beak detector
middle beak
detector
Why CNN for Image
Subsampling the pixels will not change the object
bird
bird
subsampling
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
The whole CNN
Property 1
Some patterns are much Convolution
smaller than the whole image
Property 2
Max Pooling
The same patterns appear in
Can repeat
different regions.
many times
Property 3 Convolution
Subsampling the pixels will
not change the object
Max Pooling
Flatten
The whole CNN
cat dog
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
CNN Convolution Those are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1 Matrix
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
Matrix
0 0 1 0 1 0 -1 1 -1
6 x 6 image
Each filter detects a small
Property 1
pattern (3 x 3).
1 -1 -1
CNN Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
CNN Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
We set stride=1 below
0 0 1 0 1 0
6 x 6 image
1 -1 -1
CNN Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
Property 2
-1 1 -1
CNN Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1 Do the same process for
1 0 0 0 0 1 every filter
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
4 x 4 image
1 -1 -1
CNN Zero Padding -1 1 -1 Filter 1
-1 -1 1
0 0 0
0 1 0 0 0 0 1
0 0 1 0 0 1 0
0 0 1 1 0 0 You will get another 6 x 6
1 0 0 0 1 0 images in this way
0 1 0 0 1 0 0
0 0 1 0 1 0 0 Zero padding
0 0 0
6 x 6 image
CNN Colorful image
1 -1 -1 -1-1 11 -1-1
11 -1-1 -1-1 -1 1 -1
-1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Colorful image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
The whole CNN
cat dog
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
CNN Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
CNN Max Pooling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
Can repeat
A new image many times
Smaller than the original Convolution
image
The number of the channel Max Pooling
is the number of filters
The whole CNN
cat dog
Convolution
Max Pooling
A new image
Fully Connected
Feedforward network Convolution
Max Pooling
A new image
Flatten
3
Flatten
0
1
3 0
-1 1 3
3 1 -1
0 3 Flatten
1 Fully Connected
Feedforward network
0
3
The whole CNN
Convolution
Max Pooling
Can repeat
many times
Convolution
Max Pooling
Input
+ 5
Max 7
x1 + 7
x2 + 1
Max 1
+ 1
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution Max
image pooling
(Ignoring the non-linear activation function after the convolution.)
1 -1 -1 Filter 1 1: 1
-1 1 -1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0
0 1 0 0 1 0 10: 0
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1 Only connect to 9
16: 1 input, not fully
connected
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1
16: 1 Shared weights
Even less parameters!
Input
+ 5
Max 7
x1 + 7
x2 + 1
Max 1
+ 1
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution Max
image pooling
(Ignoring the non-linear activation function after the convolution.)
Input
+ 5
Max 7
x1 + 7
x1 + 1
Max 1
+ 1
3 -1 -3 -1
3 0
-3 1 0 -3
-3 -3 0 1
3 1
3 -2 -2 -1
Input
+ 5
Max 7
x1 + 7
x2 + 1
Dim = 6 x 6 = 36
Max 1
parameters = + 1
36 x 32 = 1152 Dim = 4 x 4 x 2
= 32
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
convolution
0 0 1 0 1 0 Max
Only 9 x 2 = 18 pooling
image
parameters
Convolutional Neural Network
monkey 0
cat 1
CNN
dog 0
Convolution, Max target
Pooling, fully connected
Learning: Nothing special, just gradient descent
Playing Go
Next move
Network (19 x 19
positions)
19 x 19 matrix 19 x 19 vector
19(image)
x 19 vector
Black: 1 Fully-connected feedword
white: -1 network can be used
none: 0 But CNN performs much better.
v.s.
Playing Go : 5
:
: 5
Training: record of previous plays
Target:
Network = 1
else = 0
Target:
Network 5 = 1
else = 0
Why CNN for playing Go?
Some patterns are much smaller than the whole
image
Convolutional Neural
Network (CNN)
Destination: Taipei
Slot
time of arrival: November 2nd
Example Application
y1 y2
Solving slot filling by
Feedforward network?
Input: a word
(Each word is represented
as a vector)
Taipei x1 x2
1-of-N encoding
apple 0 a-a-a 0
bag 0 a-a-b 0
cat 0 a-p-p 1
dog 0
26 X 26 X 26
elephant 0 p-l-e 1
p-p-l 1
other 1
w = apple
w = Gandalf w = Sauron
187
Example Application time of
dest departure
y1 y2
Solving slot filling by
Feedforward network?
Input: a word
(Each word is represented
as a vector)
Output:
Probability distribution that
the input word belonging to
the slots
Taipei x1 x2
Example Application time of
dest departure
y1 y2
arrive Taipei on November 2nd
place of departure
a1 a2
x1 x2 x3
x1 x2 x1 x2
leave Taipei arrive Taipei
xt xt+1 xt+2
Bidirectional RNN
xt xt+1 xt+2
yt yt+1 yt+2
xt xt+1 xt+2
Long Short-term Memory (LSTM)
Other part of the network
Special Neuron:
Signal control
Output Gate
4 inputs,
the output gate 1 output
(Other part of
the network)
Memory Forget Signal control
Cell Gate the forget gate
(Other part of
the network)
Signal control
Input Gate LSTM
the input gate
(Other part of
the network)
Other part of the network
=
multiply
Activation function f is
usually a sigmoid function
Between 0 and 1
Mimic open and close gate
c
= +
multiply
0
0
-10
10
10
7 10
1
1 3
10
3
3
-3
1
10
-3
10
-3
7 -10
0
1 -3
10
-3
-3
LSTM
ct-1
vector
zf zi z zo 4 vectors
xt
LSTM
yt
zo
ct-1
zf
zi
zf zi z zo
xt
z
Extension: peephole
LSTM
yt yt+1
ct-1 ct ct+1
zf zi z zo zf zi z zo
This is quite
standard now.
https://img.komicolle.org/2015-09-20/src/14426967627131.gif
Three Steps for Deep Learning
y1 y2 y3
copy copy
a1 a2 a3
a1 a2
Wi
x1 x2 x3
Training
Sentences: arrive Taipei on November 2nd
other dest other time time
Three Steps for Deep Learning
Backpropagation
through time (BPTT)
copy
a1 a2
x1 x2
Unfortunately
RNN-based network is not always easy to learn
Real experiments on Language modeling
sometimes
Total Loss
Lucky
Epoch
The error surface is rough.
The error surface is either
very flat or very steep.
Total
Clipping
CostLoss
w2
Sentiment Analysis
. . .
Positive () Negative () Positive ()
Many to Many (Output is shorter)
Both input and output are both sequences, but the output
is shorter.
E.g. Speech Recognition
Many to Many (No Limitation)
Both input and output are both sequences with different
lengths. Sequence to sequence learning
E.g. Machine Translation (machine learning)
machine
learning
Containing all
information about
input sequence
Many to Many (No Limitation)
Both input and output are both sequences with different
lengths. Sequence to sequence learning
E.g. Machine Translation (machine learning)
machine
learning
tlkagk: ===================
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 ()
Many to Many (No Limitation)
Both input and output are both sequences with different
lengths. Sequence to sequence learning
E.g. Machine Translation (machine learning)
===
machine
learning
CNN
Input
image Caption Generation
Application:
Video Caption Generation
A girl is running.
Video
Convolutional Neural
Network (CNN)
Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Skyscraper
https://zh.wikipedia.org/wiki/%E9%9B%99%E5%B3%B0%E5%A1%94#/me
dia/File:BurjDubaiHeight.svg
Ultra Deep Network 22 layers
http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf
8 layers
6.7%
7.3%
16.4%
3.57%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Ultra Deep Network
Worry about overfitting? 152 layers
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net
(2012) (2014) (2014) (2015)
Ultra Deep Network
Ultra deep network is the
ensemble of many networks
with different depth.
6 layers
Ensemble 4 layers
2 layers
Ultra Deep Network
FractalNet
Resnet in Resnet
Good Initialization?
Ultra Deep Network
copy Gate
controller copy
output layer output layer output layer
Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Attention-based Model
What you learned Lunch today
in these lectures
What is deep
learning?
summer
vacation 10
Answer Organize years ago
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Attention-based Model
Input DNN/RNN output
Reading Head
Controller
Reading Head
Machines Memory
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e
cm.mp4/index.html
Attention-based Model v2
Machines Memory
Reading Head
Controller
Semantic
Analysis
source: http://visualqa.org/
Visual Question Answering
Reading Head
Controller
random
(A)
(proposed by FB AI group)
(proposed by FB AI group)
Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Scenario of Reinforcement
Learning
Observation Action
Agent
Dont do Reward
that
Environment
Scenario of Reinforcement
Learning Agent learns to take actions to
maximize expected reward.
Observation Action
Agent
http://www.sznews.com/news/conte Environment
nt/2013-11/26/content_8800180.htm
Supervised v.s. Reinforcement
Supervised Hello Say Hi
Learning from
teacher Bye bye Say Good bye
Reinforcement
. .
Hello Bad
Learning from
critics Agent Agent
Scenario of Reinforcement
Learning Agent learns to take actions to
maximize expected reward.
Observation Action
If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Environment
Supervised v.s. Reinforcement
Supervised:
Reinforcement Learning
Function Function
Input Output
Environment
Application: Interactive Retrieval
Interactive retrieval is helpful. [Wu & Lee, INTERSPEECH 16]
Deep Learning
user
Better retrieval
The task cannot be addressed
performance,
Less user labor by linear model.
More Interaction
More applications
Alpha Go, Playing Video Games, Dialogue
Flying Helicopter
https://www.youtube.com/watch?v=0JL04JJjocc
Driving
https://www.youtube.com/watch?v=0xo1Ldx3L
5Q
Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
http://www.bloomberg.com/news/articles/2016-07-
19/google-cuts-its-giant-electricity-bill-with-deepmind-
powered-ai
To learn deep reinforcement
learning
Lectures of David Silver
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Te
aching.html
10 lectures (1:30 each)
Deep Reinforcement Learning
http://videolectures.net/rldm2015_silver_reinfo
rcement_learning/
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning
Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Does machine know what the
world look like?
Ref: https://openai.com/blog/generative-models/
Draw something!
Deep Dream
Given a photo, machine adds what it sees
http://deepdreamgenerator.com/
Deep Dream
Given a photo, machine adds what it sees
http://deepdreamgenerator.com/
Deep Style
Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
CNN CNN
content style
CNN
?
Generating Images by RNN
Real
World
Generating Images
Training a decoder to generate images is
unsupervised
Neural Network
Auto-encoder
code NN
Decoder
Not state-of-
the-art Learn together
approach NN code
Encoder
As close as possible
Output Layer
Input Layer
Layer
bottle
Layer
Layer
Layer
Encoder Decoder
Code
Generating Images
Training a decoder to generate images is
unsupervised
Variation Auto-encoder (VAE)
Ref: Auto-Encoding Variational Bayes,
https://arxiv.org/abs/1312.6114
Generative Adversarial Network (GAN)
Ref: Generative Adversarial Networks,
http://arxiv.org/abs/1406.2661
code NN
Decoder
Which one is machine-generated?
Ref: https://openai.com/blog/generative-models/
!!! https://github.com/mattya/chainer-DCGAN
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning
Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Machine Reading
Machine learn the meaning of words from reading
a lot of documents without supervision
http://top-breaking-news.com/
Machine Reading
Machine learn the meaning of words from reading
a lot of documents without supervision
Neural Network
?
https://garavato.files.wordpress.com/2011/11/stacksdocuments.jpg?w=490
Machine Reading
Machine learn the meaning of words from reading
a lot of documents without supervision
A word can be understood by its context
You shall know a word
are
by the company it keeps
something very similar
520
520
Word Vector
Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014
283
Word Vector +
Characteristics
Solving analogies
286
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning
Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Learning from Audio Book
Like an infant
dogs
never
ever ever
Sequence-to-sequence
Auto-encoder
vector
audio segment
x1 x2 x3 x4 acoustic features
audio segment
Sequence-to-sequence
Input acoustic features
Auto-encoder
x1 x2 x3 x4
The RNN encoder and
decoder are jointly trained.
y1 y2 y3 y4
RNN Encoder
RNN Decoder
x1 x2 x3 x4 acoustic features
audio segment
Audio Word to Vector
- Results
Visualizing embedding vectors of the words
fear
fame
name near
WaveNet (DeepMind)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Concluding Remarks
Concluding Remarks
http://www.express.co.uk/news/science/651202/First-step-towards-The-Terminator-
becoming-reality-AI-beats-champ-of-world-s-oldest-game
AI
AI
AI
AI
step 1AI
step 3
E.g. best function
E.g. Deep Learning
AI
AI AI
AI
http://www.gvm.com.tw/web
only_content_10787.html