Professional Documents
Culture Documents
Talk Outline
2
Part I: Introduction to
Machine Learning & Deep Learning
Part I: Introduction to
Machine Learning & Deep Learning
“I love this product!” “It claims too much.” “It’s a little expensive.”
program.py program.py program.py
+ - ?
if input contains “love”, “like”, etc. if input contains “too much”, “bad”, etc.
output = positive output = negative
“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”
program.py program.py program.py
推 噓 ?
Some tasks are complex, and we don’t know how to write a program to solve them.
Learning ≈ Looking for a Function
11
“I love this product!” “It claims too much.” “It’s a little expensive.”
f f f
+ - ?
Given a large amount of data, the machine learns what the function f should be.
Learning ≈ Looking for a Function
12
Speech Recognition
f “你好”
Image Recognition
f cat
Go Playing
f 5-5 (next move)
Dialogue System
f “台積電怎麼去” “地址為…
現在建議搭乘計程車”
Image Recognition:
Framework f “cat”
13
A set of Model
function f1 , f 2
f1 “cat” f2 “monkey”
f1 “dog” f2 “snake”
Image Recognition:
Framework f “cat”
14
A set of Model
function f1 , f 2 Better!
Goodness of
function f
Supervised Learning
Training Testing
A set of Model
function f1 , f 2 “cat”
Step 1
Training
Data
“monkey” “cat” “dog”
Training is to pick the best function given the observed data
Testing is to predict the label using the learned function
Why to Learn Machine Learning?
16
AI Age
AIcan work for most of labor work?
New job market AI 訓練師
(Machine Learning Expert 機器學習專家、
Data Scientist 資料科學家)
AI 訓練師
17
機器不是自己會學嗎?
為什麼需要 AI 訓練師
戰鬥是寶可夢在打,
為什麼需要寶可夢訓練師?
AI 訓練師
18
寶可夢訓練師 AI 訓練師
挑選適合的寶可夢來戰鬥 在 step 1,AI訓練師要挑選
寶可夢有不同的屬性 合適的模型
召喚出來的寶可夢不一定聽話 不同模型適合處理不同
厲害的 AI , AI 訓練師功不可沒
讓我們一起朝 AI 訓練師之路邁進
Machine Learning Map
20
Regression
Semi-Supervised Learning
Linear Model
Transfer Learning
Non-Linear Model
Classification Reinforcement Learning
Supervised Learning
Machine Learning Map
21
Linear Model
Transfer Learning
Non-Linear Model
Reinforcement Learning
Classification
Supervised Learning
Regression
22
Self-driving Car
𝑓 = 方向盤角度
Recommendation
𝑓 使用者 A 商品 B = 購買可能性
Example Application
23
CP after
𝑓 =
evolution 𝑦
𝑥𝑠 𝑥
𝑥ℎ𝑝
𝑥𝑤 𝑥ℎ
Step 1: Model
24
𝑥𝑖 : an attribute of
Linear model: 𝑦 = 𝑏 + 𝑤𝑖 𝑥𝑖 input x feature
𝑤𝑖 : weight, b: bias
Step 2: Goodness of Function
25
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
function function
A set of Model
input: output (scalar):
function f1 , f 2 𝑦ො 1
𝑥1
𝑦ො 2
𝑥2
Training
Data
Step 2: Goodness of Function
26
Training data
1st pokemon:
𝑥 1 , 𝑦ො 1
2nd pokemon:
𝑥 2 , 𝑦ො 2 𝑛 ,𝑦
𝑥𝑐𝑝 ො𝑛
……
10th pokemon:
𝑥 10 , 𝑦ො 10
𝑦ො
This is real data.
𝑥𝑐𝑝
Source: https://www.openintro.org/stat/data/?data=pokemon
Step 2: Goodness of Function
27
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
A set of Model
function f1 , f 2
Loss function 𝐿:
Goodness of Input: a function, output: how bad it is
function f
L 𝑓 = L 𝑤, 𝑏 Estimated y based
10
on input function
2
Training = 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛
Data 𝑛=1
Estimation error
Sum over examples
Step 2: Goodness of Function
28
10
2
Loss Function L 𝑤, 𝑏 = 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
𝑛=1
L 𝑤, 𝑏
10
A set of Model 2
= 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
function f1 , f 2 𝑛=1
𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤, 𝑏
𝑤,𝑏
Training 10
2
Data = 𝑎𝑟𝑔 min 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
𝑤,𝑏
𝑛=1
Step 3: Gradient Descent
30
𝑑𝐿
Compute | 0
Loss 𝑑𝑤 𝑤=𝑤
𝐿 𝑤
w0 w
Step 3: Gradient Descent
31
𝑑𝐿
Compute | 0 𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤1 ← 𝑤0 −𝜂 |𝑤=𝑤 0
𝐿 𝑤 𝑑𝑤
w0 𝑑𝐿 w
−𝜂 |𝑤=𝑤 0
𝑑𝑤
Step 3: Gradient Descent
32
𝑑𝐿
Compute | 0 1 0
𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤 ←𝑤 −𝜂 |𝑤=𝑤 0
𝐿 𝑤 𝑑𝑤
𝑑𝐿
Compute | 1 𝑑𝐿
𝑑𝑤 𝑤=𝑤 2 1
𝑤 ←𝑤 −𝜂 |𝑤=𝑤 1
𝑑𝑤
…… Many iteration
w0 w1 w2 wT w
Step 3: Gradient Descent
33
𝑏
Step 3: Gradient Descent
35
Local optimal
Loss function is convex in linear regression
Linear regression
No local optimal
𝐿
𝑤
𝑤 𝑏
𝑏
Step 3: Gradient Descent
36
10
2
𝑛 𝑛
𝐿 𝑤, 𝑏 = 𝑦ො − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛=1
10
𝜕𝐿
=? 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 𝑛
−𝑥𝑐𝑝
𝜕𝑤
𝑛=1
𝜕𝐿
=?
𝜕𝑏
Step 3: Gradient Descent
37
10
2
𝑛 𝑛
𝐿 𝑤, 𝑏 = 𝑦ො − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛=1
10
𝜕𝐿
=? 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 𝑛
−𝑥𝑐𝑝
𝜕𝑤
𝑛=1
10
𝜕𝐿
=? 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 −1
𝜕𝑏
𝑛=1
Learned Model
38
Training Data
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
𝑒1
b = -188.4
w = 2.7
Average Error on
Training Data 𝑒2
10
= 𝑒 𝑛 = 31.9
𝑛=1
What we really care about
Model Generalization is the error on new data
(testing data)
39
w = 2.7
Average Error on
Testing Data
10
= 𝑒 𝑛 = 35.0
𝑛=1
> Average Error on
Training Data (31.9)
Model Generalization
40
Best function
𝑏 = −10.3, 𝑤1 = 1.0, 𝑤2 = 2.7 × 10−3
Average Error = 15.4
Testing
Average Error = 18.4
Best function
𝑏 = 6.4, 𝑤1 = 0.66, 𝑤2 = 4.3 × 10−3 ,
𝑤3 = 1.8 × 10−6
Average Error = 15.3
Testing
Average Error = 18.1
Slightly better.
How about more complex model?
Model Generalization
42
Best function
Average Error = 14.9
Testing
Average Error = 28.8
Best function
Average Error = 12.8
Testing
Average Error = 232.1
1. 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
2
2. 𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
2
3. 𝑦 = 𝑏 + 𝑤1 ∙3𝑥cp + 𝑤2 ∙ 𝑥cp
+𝑤3 ∙ 𝑥cp
2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
4. 3 4
+𝑤3 ∙ 𝑥cp + 𝑤4 ∙ 𝑥cp
2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
A more complex model yields
5. 3
+𝑤3 ∙ 𝑥cp + 𝑤4 ∙ 𝑥cp
4
5
lower error on training data.
+𝑤5 ∙ 𝑥cp
If we can truly find the best function
Machine Learning Map
45
Regression
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
Function f Function f
Input Input
Binary Classification – Spam Filtering
47
“Talk” in e-mail
Model 1/0
(Yes/No)
“free” in e-mail
1 (Yes)
0 (No)
(http://spam-filter-review.toptenreviews.com/)
Multi-Class – Image Recognition
48
“monkey”
“monkey”
“cat”
Model
“cat”
“dog”
“dog”
Multi-Class – Topic Classification
49
政治
“stock” in document
經濟
Model
體育
“president” in document
體育 政治 財經
http://top-breaking-news.com/
Machine Learning Map
50
Regression
Semi-Supervised Learning
Linear Model
Unsupervised Learning
Non-Linear Model
Classification Transfer Learning
Supervised Learning
Part I: Introduction to ML & DL
51
Neuron
z a1w1 ak wk aK wK b
a1 w1 A simple function
…
wk z z
ak a
…
Activation
…
wK function
aK weights b bias
Neural Network
56
z
1
z
1 e z
2
1
z
4
-1 -2 0.98
Activation
-1
function
1 weights 1 bias
Neural Network
57
z
z z
z
The neurons have different values of
weights and biases.
Weights and biases are network parameters 𝜃
Fully Connected Feedforward Network
58
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function z
z
1
z
1 e z
Fully Connected Feedforward Network
59
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers
Why Deep? Universality Theorem
62
……
……
……
x1 x2 …… xN
x1 x2 …… xN
Why Deep
64
http://cs231n.stanford.e
du/slides/winter1516_le
cture8.pdf
6.7%
7.3%
16.4%
Special
structure
3.57%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Output Layer
67
z1
y1 z1
In general, the output of
z2
y2 z 2
network can be any value.
Probability:
Softmax layer as the output layer 1 > 𝑦𝑖 > 0
σ𝑖 𝑦𝑖 = 1
Softmax Layer
3 0.88 3
e
20
z1 e e z1
y1 e z1 zj
j 1
1 0.12 3
z2 e e z 2 2.7
y2 e z2
e
zj
j 1
0.05 ≈0
z3 -3
3
e
z3
e y3 e z3 zj
e
3 j 1
e zj
j 1
Example Application
69
Input Output
y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
70
x1 y1 is 1
x2
y2 is 2
Neural
Machine “2”
……
……
……
Network
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
71
……
……
……
……
……
……
Handwriting Digit Recognition
xN …… y10 is 0
Input Output
Layer Hidden Layers Layer
x1 …… y1 is 1
Softmax
x2 ……
…… y2 is 2
……
……
x256 …… y10 is 0
16 x 16 = 256
Ink → 1 The learning target is ……
No ink → 0
Input: y1 has the maximum value
“1”
x1 …… y1 As close as 1
x2 possible
Softmax
Given a set ……
of y2 0
parameters
……
……
……
……
……
Loss
x256 …… y10 𝑙 0
Loss can be square error or cross entropy between the network target
output and target
A good function should make the loss of all examples as small as possible.
Total Loss
77
Total Loss:
For all training data … 𝑅
𝐿 = 𝑙𝑟
x1 NN y1 𝑦ො 1 𝑟=1
𝑙1
x2 NN y2 𝑦ො 2 As small as possible
𝑙2
Find a function in
x3 NN y3 𝑦ො 3 function set that
𝑙3
minimizes total loss L
……
……
……
……
Network parameters 𝜃 =
106
𝑤1 , 𝑤2 , 𝑤3 , ⋯ , 𝑏1 , 𝑏2 , 𝑏3 , ⋯
weights
……
……
Millions of parameters
w
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
81
Positive Decrease w
w
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
82
η is called
−𝜂𝜕𝐿Τ𝜕𝑤 “learning rate” w
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
83
w
Gradient Descent
84
Color: Value of
𝑤2 Total Loss L
𝑤1
Hopfully, we would reach
Gradient Descent a minima …..
86
𝑤2
𝑤1
Local Minima
87
Total
Loss Very slow at the
plateau
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
Very flexible
or
Need some
effort to learn
感謝 沈昇勳 同學提供圖檔
Example Application
96
Machine “1”
28 x 28
YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Step 2: goodness
of function YES
Possible solutions
more training samples
some tips: dropout, etc.
Learning Recipe
100
YES
Good Results on
Testing Data?
Different approaches for different problems
e.g. dropout for good results on testing data YES
Good Results on
Training Data?
Learning Recipe
101
YES
Choosing proper loss
Good Results on
Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?
Momentum
Learning Recipe
102
Testing Data
x ŷ x y x y
“Best” Function f *
Learning Recipe
103
Testing Data
x ŷ x y x y
immediately Do not know the
know the performance
performance
Learning Recipe
104
modify training
process
Possible reasons
no good function exists: bad hypothesis function set
reconstruct the model architecture
cannot find a good function: local optima
change the training strategy
Learning Recipe
105
yes yes
get good results get good results on done
no no dev/validation set
on training set
modify training
process prevent overfitting
Part I: Introduction to
Machine Learning & Deep Learning
108
PART II: Variants of Neural Networks
109
x1 ……
x2
…… ……
……
……
……
Represented
as pixels xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
“beak” detector
Why CNN for Image
113
“upper-left
beak” detector
“middle beak”
detector
Why CNN for Image
114
bird
bird
subsampling
http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
The Whole CNN
117
cat dog ……
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
The Whole CNN
118
Property 1
Convolution
Some patterns are much
smaller than the whole image
Property 2 Max Pooling
Can repeat
The same patterns appear in
many times
different regions
Convolution
Property 3
Subsampling the pixels will
not change the object Max Pooling
Flatten
Image Recognition
119
Local Connectivity
120
weights weights
height
depth
depth = 1 depth = 2
a1 b1
c1
a2
b2
c2
a3
Convolutional Layers
125
depth = 2 depth = 2
b1
a1 c1
b2 d1
a2
c2
b3
d2
a3
Convolutional Layers
126
depth = 2 depth = 2
b1
a1 c1
b2 d1
a2
c2
b3
d2
a3
Convolutional Layers
127
A B C
A B C D
Hyper-parameters of CNN
128
Stride Padding
Stride = 1 Padding = 0
0 0
Stride = 2 Padding = 1
Example
129
Output Stride = 2
Volume (3x3x2)
Filter
(3x3x3)
Input
Volume (7x7x3) Padding = 1
http://cs231n.github.io/convolutional-networks/
Convolutional Layers
130
http://cs231n.github.io/convolutional-networks/
Convolutional Layers
131
http://cs231n.github.io/convolutional-networks/
Convolutional Layers
132
http://cs231n.github.io/convolutional-networks/
Convolutional Layers
133
http://cs231n.github.io/convolutional-networks/
Pooling Layer
134
1 3 2 4
5 7 6 8
0 0 3 3
5 5 0 0 no weights
Maximum Average
Pooling Pooling
7 8 4 5
5 3 5 3 no overlap depth = 1
Max(1,3,5,7) = 7 Avg(1,3,5,7) = 4
Max(0,0,5,5) = 5
Why “Deep” Learning?
135
Visual Perception of Human
136
http://www.nature.com/neuro/journal/v8/n8/images/nn0805-975-F1.jpg
Visual Perception of Computer
137
Input Convolutional
Layer Layer
Pooling Convolutional
Layer Layer Pooling
Layer
Receptive Fields
Receptive Fields
Visual Perception of Computer
138
Convolutional Max-pooling
Layer with Layer with
Receptive Fields: Width =3, Height = 3
Input Layer
Filter Responses
7
Class
Label
Convolutional Neural Network
140
“monkey” 0
“cat” 1
CNN
……
“dog” 0
Convolution, Max target
Pooling, fully connected
What CNN Learned
141
Alexnet
http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
http://vision03.csail.mit.edu/cnn_art/data/single_layer.png
DNN are easily fooled
142
Nguyen et al., “Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images,” arXiv:1412.1897.
Visualizing CNN
143
filter
flower
response
CNN
random filter
noise response
CNN
Gradient Ascent
144
score:
gradient:
Gradient Ascent
145
update
gradient:
learning rate
Gradient Ascent
146
Different Layers of Visualization
147
CNN
Multiscale Image Generation
148
visualize resize
visualize
resize
visualize
Multiscale Image Generation
149
CNN
Modify
Deep Dream image
150
http://deepdreamgenerator.com/
Deep Dream http://deepdreamgenerator.com/
151
CNN CNN
content style
CNN
?
Neural Art Mechanism
155
Artist Brain
Next move
Network (19 x 19
positions)
19 x 19 matrix 19 x 19 vector
19(image)
x 19 vector
Black: 1 Fully-connected feedforward
white: -1 network can be used
none: 0 But CNN performs much better.
More Application: Playing Go
157
record of
Training: 黑: 5之五 白: 天元 黑: 五之5 …
previous plays
Target:
CNN “天元” = 1
else = 0
Target:
CNN “五之 5” = 1
else = 0
Why CNN for playing Go?
158
Slot Filling
Destination: Taipei
Slot
time of arrival: November 2nd
Example Application
162
y1 y2
Solving slot filling by
feedforward network?
Input: a word
(Each word is represented as a vector)
Taipei x1 x2
1-of-N encoding
163
Output:
probability distribution that the
input word belonging to the slots
Taipei x1 x2
Example Application
165
place of departure
Taipei x1 x2
Three Steps for Deep Learning
166
y1 y2
a1 a2
x1 x2 x3
Different
Prob of “leave” Prob of “Taipei” Prob of “arrive” Prob of “Taipei”
in each slot in each slot in each slot in each slot
y1 y2 …… y1 y2 ……
store store
a1 a2 a1 a2
a1 …… a1 ……
x1 x2 …… x1 x2 ……
leave Taipei arrive Taipei
yt yt+1 yt+2
…… ……
……
……
……
…… ……
…… ……
xt xt+1 xt+2
Bidirectional RNN
171
xt xt+1 xt+2
…… ……
yt yt+1 yt+2
…… ……
xt xt+1 xt+2
RNN
172
y1 y2 y3
copy copy
a1 a2 a3
a1 a2
Wi
x1 x2 x3
Training
Sentences: arrive Taipei on November 2nd
other dest other time time
Rough Error Surface
174
Total
CostLoss
w2
w1
[Razvan Pascanu, ICML’13]
Rough Error Surface
175
……
我 覺 得 …… 太 糟 了
Many-to-Many (Output is shorter)
178
Both input and output are both sequences, but the output is
shorter.
E.g. Speech Recognition
Problem?
Why can’t it be 好好好棒棒棒棒棒
“好棒棒”
(vector
Input:
sequence)
Many-to-Many (Output is shorter)
179
Both input and output are both sequences, but the output is
shorter.
Connectionist Temporal Classification (CTC)
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
[Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]
Many-to-Many (Output has no limitation)
180
learning
Containing all
information about
input sequence
Many-to-Many (Output has no limitation)
181
機 器 學 習 慣 性
machine
learning
http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%E6%8E%A8%E6%96%87 (鄉民百科)
Many-to-Many (Output has no limitation)
183
===
機 器 學 習
machine
learning
CNN ……
Input
image
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
Image Caption Generation
185
Video Caption Generation
186
A girl is running.
Video
https://www.youtube.com/watch?v=LY7x2Ihqj
Attention and Memory
189
What is deep
learning?
summer
vacation 10
Answer Organize years ago
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Attention on Sensory Info
190
learning
<END>
deep
Information of the
whole sentences
RNN RNN
Encoder Decoder
深 度 學 習
Machine Translation with Attention
192
𝑐0 𝛼
deep
0.5 𝛼ො01 0.5 𝛼ො02 0.0 𝛼ො03 0.0 𝛼ො04 match
softmax ℎ 𝑧
𝑧0 𝑧1 How to learn the
𝛼01 𝛼02 𝛼03 𝛼04
parameters?
ℎ1 ℎ2 ℎ3 ℎ4 𝑐 0 As RNN input
deep
𝛼11
match
𝑧0 𝑧1
ℎ1 ℎ2 ℎ3 ℎ4 𝑐0
深 度 學 習
Machine Translation with Attention
195
𝑐1
learning
deep
0.0 𝛼ො11 0.0 𝛼ො12 0.5 𝛼ො13 0.5 𝛼ො14
softmax
𝑧0 𝑧1 𝑧2
𝛼11 𝛼12 𝛼13 𝛼14
ℎ1 ℎ2 ℎ3 ℎ4 𝑐0 𝑐1
learning
deep
𝛼21 ……
match
𝑧0 𝑧1 𝑧 2 ……
ℎ1 ℎ2 ℎ3 ℎ4 𝑐0 𝑐1 ……
Input: image
Output: word sequence
A vector for
whole image a woman is <END>
CNN ……
Input
image
Image Captioning with Attention
199
Word 1
A vector for each region
𝑧0 𝑧1
Word 1 Word 2
A vector for each region
𝑧0 𝑧1 𝑧1
weighted
filter filter filter weighted sum
CNN filter filter filter
sum 0.0 0.8 0.2
0.0 0.0 0.0
Good examples
Image Captioning
203
Bad examples
Video Captioning
204
Video Captioning
205
Reading Comprehension
206
Answer
𝑁
Sentence to Extracted
= 𝛼𝑛 𝑥 𝑛 DNN
vector can be Information
𝑛=1
jointly trained.
𝛼1 𝛼2 𝛼3 𝛼𝑁
Document 𝑥 1 𝑥 2 𝑥 3 …… 𝑥 𝑁
Match q Question
Reading Comprehension
207
Answer
𝑁
Extracted
= 𝛼𝑛 ℎ 𝑛 DNN
Jointly learned Information
𝑛=1
ℎ1 ℎ2 ℎ3 …… ℎ𝑁
Hopping
𝛼1 𝛼2 𝛼3 𝛼𝑁
Document
𝑥 1 𝑥 2 𝑥 3 …… 𝑥 𝑁
Match q Question
Memory Network
208
DNN a
Extract information
……
……
Compute attention
Extract information
……
……
Compute attention
q
Memory Network
209
https://www.facebook.com/Engineering/videos/10153098860532200/
Attention on Memory
210
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing Machine
212
y1 y2
h0 h1 h2
𝑟0 = 𝛼ො0𝑖 𝑚0𝑖 r0 x1 x2
Retrieval
process
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04
y1 y2
h0 h1 h2
𝑟0 = 𝛼ො0𝑖 𝑚0𝑖 r0 x1 x2
𝑘1 𝑒1 𝑎1
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14
softmax 𝛼1𝑖 = 1 − 𝜆 𝛼0𝑖 (simplified)
𝑚10 𝑚02 𝑚03 𝑚04 𝛼11 𝛼12 𝛼13 𝛼14 +𝜆𝑐𝑜𝑠 𝑚0𝑖 , 𝑘1
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing Machine
214
Encode
𝑚1𝑖 = 𝑚0𝑖 ∗ 1 −𝛼ො1𝑖 𝑒 1 +𝛼ො1𝑖 𝑎1
process
(element-wise)
𝑘1 𝑒1 𝑎1
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing Machine
215
y1 y2
h0 h1 h2
r0 x1 r1 x2
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14 𝛼ො21 𝛼ො22 𝛼ො23 𝛼ො24
𝑚10 𝑚02 𝑚03 𝑚04 𝑚11 𝑚12 𝑚13 𝑚14 𝑚12 𝑚22 𝑚23 𝑚24
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Concluding Remarks
216
x1 x2 x3
Talk Outline
217
Part I: Introduction to
Machine Learning & Deep Learning
Regression
Semi-Supervised Learning
Linear Model
Transfer Learning
Non-Linear Model
Classification Reinforcement Learning
Supervised Learning
Outline
221
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
化繁為簡 Representation Learning
無中生有 Generative Model
Reinforcement Learning
Outline
222
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
化繁為簡 Representation Learning
無中生有 Generative Model
Reinforcement Learning
Semi-Supervised Learning
223
Labelled
data
cat dog
Unlabeled
data
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
化繁為簡 Representation Learning
無中生有 Generative Model
Reinforcement Learning
Transfer Learning
226
Labelled
data
cat dog
Labeled
data
elephant elephant tiger tiger
……
……
……
……
xN …… ……
Transfer Learning Example
228
研究生 漫畫家
指導教授 責編
研究生
生存守則 跑實驗 畫分鏡
投稿期刊 投稿 jump
爆漫王
Outline
229
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
化繁為簡 Representation Learning
無中生有 Generative Model
Reinforcement Learning
Outline
230
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
化繁為簡 Representation Learning
無中生有 Generative Model
Reinforcement Learning
Unsupervised Learning
231
Labelled
data
cat dog
Unlabeled
data
化繁為簡 無中生有
only having
function input
only having
function
function output
function
code
Unsupervised Learning
233
3
-20。 -10。 0。 10。 20。
Latent Factors for Documents
235
http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg
Latent Factors for Recommendation
236
C
Latent Factor Exploitation
237
Handwritten digits
…….
No. 1 No. 2 No. 3 No. 4 No. 5
Latent Factor Exploitation
238
…….
No. 1 No. 2 No. 3 No. 4 No. 5
28 No. 1 No. 3 No. 5
28 = + +
Represented by [1 0 1 0 1 0 …….]
28 X 28 = 784 pixels (simpler representation)
Outline
239
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
化繁為簡 Representation Learning
無中生有 Generative Model
Reinforcement Learning
Autoencoder
240
NN Compact
code representation of
Encoder
Usually <784 the input object
28 X 28 = 784
Learn together
NN Can reconstruct
code
Decoder the original object
Autoencoder
241
2
Minimize 𝑥 − 𝑦
As close as possible
encode decode
𝑥 𝑎 𝑦
𝑊 𝑊′
hidden layer
Input layer Bottleneck layer output layer
𝑎 = 𝜎 𝑊𝑥 + 𝑏 𝑦 = 𝜎 𝑊′𝑎 + 𝑏′
Output of the hidden layer is the code
Autoencoder
242
De-noising auto-encoder
As close as possible
encode decode
𝑎
𝑥 𝑥′ 𝑦
Add 𝑊 𝑊′
noise
Rifai, et al. "Contractive auto-encoders: Explicit invariance during feature extraction,“ in ICML, 2011.
Deep Autoencoder
243
As close as possible
Output Layer
Input Layer
bottle
Layer
Layer
Layer
Layer
Layer
Layer
… …
𝑥 Initialize by RBM 𝑥
Code
layer-by-layer
Hinton and Salakhutdinov. “Reducing the dimensionality of data with neural networks,” Science, 2006.
Deep Autoencoder
244
Original
Image
784
784
30
PCA
Deep
Auto-encoder
500
500
250
250
30
1000
1000
784
784
784 784
1000
2
500
784
Feature Representation
250
2
250
500
1000
784
245
Auto-encoder – Text Retrieval
246
…
Semantics are not considered
Autoencoder – Text Retrieval
247
2 query
125
250
500
The documents talking about the
same thing will have close code
2000
Bag-of-word (document or query)
Autoencoder – Similar Image Retrieval
248
Krizhevsky et al. "Using very deep autoencoders for content-based image retrieval," in ESANN, 2011.
Autoencoder – Similar Image Retrieval
249
8192
4096
1024
2048
256
512
32x32 code
output 10
500
Target
1000 784 𝑥
W1’
1000 1000
W1
Input 784 Input 784 𝑥
Autoencoder for DNN Pre-Training
252
output 10
500 1000 𝑎 1
W2’
Target
1000 1000
W2
1000 1000 𝑎1
fix W1
Input 784 Input 784 𝑥
Autoencoder for DNN Pre-Training
253
output 10 1000 𝑎 2
W3’
500 500
W3
Target
1000 1000 𝑎2
fix W2
1000 1000 𝑎1
fix W1
Input 784 Input 784 𝑥
Autoencoder for DNN Pre-Training
254
1000 1000
W2
1000 1000
W1
Input 784 Input 784 𝑥
Word Vector/Embedding
255
tree
flower
dog rabbit
run
jump cat
Word Embedding
256
馬英九 520宣誓就職
蔡英文 520宣誓就職
wi
…… wi-2 wi-1 ___
Prediction-Based
257
0 z1
1-of-N
1 z2 The probability
encoding
0 for each word as
of the
…
……
……
the next word wi
word wi-1
……
Neural
潮水 退了
Network
Neural
不爽 不要
Network
……
0 z1
1 z2 The probability
0 for each word as
…
蔡英文
……
……
the next word wi
……
or
“宣誓就職”
馬英九
should have large
z2 probability
Training text:
…… 蔡英文 宣誓就職 …… 蔡英文
wi-1 wi
馬英九
…… 馬英九 宣誓就職 ……
wi-1 wi z1
Various Architectures
260
Skip-gram
…… ____ wi ____ …… w wi-1
Neural
i
Network
wi+1
https://ronxin.github.io/wevi/
Word2Vec CBOW
262
https://ronxin.github.io/wevi/
Word2Vec Skip-Gram
263
https://ronxin.github.io/wevi/
Word Embedding
264
http://www.slideshare.net/hustwj/cikm-keynotenov2014
Word Embedding
265
Characteristics
𝑉 ℎ𝑜𝑡𝑡𝑒𝑟 − 𝑉 ℎ𝑜𝑡 ≈ 𝑉 𝑏𝑖𝑔𝑔𝑒𝑟 − 𝑉 𝑏𝑖𝑔
𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒 − 𝑉 𝑎𝑢𝑛𝑡
Solving analogies
Rome : Italy = Berlin : ?
Compute 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Find the word w with the closest V(w)
𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Outline
266
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
化繁為簡 Representation Learning
無中生有 Generative Model
Reinforcement Learning
Creation
267
Draw something!
Creation
268
Generative Models
https://openai.com/blog/generative-models/
https://www.quora.com/What-did-Richard-Feynman-mean-when-he-said-What-I-cannot-create-I-do-not-understand
PixelRNN
269
NN
NN
……
Can be trained just with a large collection
of images without any annotation
Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, Pixel Recurrent Neural Networks, arXiv preprint, 2016
PixelRNN
270
Real
World
Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, Pixel Recurrent Neural Networks, arXiv preprint, 2016
PixelRNN – beyond Image
271
Audio: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior,
Koray Kavukcuoglu, WaveNet: A Generative Model for Raw Audio, arXiv preprint, 2016
Video: Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu, Video
Pixel Networks , arXiv preprint, 2016
Generative Adversarial Network (GAN)
272
Discriminative v.s. Generative Models
273
Discriminative Generative
learns a function that maps tries to learn the joint
the input data (x) to some probability of the input
desired output class label data and labels
(y) simultaneously, i.e. P(x,y)
• directly learn the conditional • can be converted
distribution P(y|x) to P(y|x) for classification via
Bayes rule
棕色 葉脈
encode decode
𝑥 𝑎 𝑥′
𝑊 𝑊′
hidden layer
Input layer output layer
code
The generator is to generate the data from the code
Generative Adversarial Networks (GAN)
276
Training two networks jointly the generator knows how to adapt its
parameters in order to produce output data that can fool the discriminator
Goodfellow, et al., “Generative adversarial networks,” in NIPS, 2014.
http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/
Generator Evolution
277
NN NN NN
Generator Generator Generator
v1 v2 v3
Real images:
Cifar-10
278
https://openai.com/blog/generative-models/
Generated Bedrooms
279
Radford et al., “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” arXiv:1511.06434.
Comics Drawing
280
https://github.com/mattya/chainer-DCGAN
Comics Drawing
281
http://qiita.com/mattya/items/e5bfe5e04b9d2f0bbd47
Pokémon Creation
282
Original image is 40 x 40
Making them into 20 x 20
Pokémon Creation
283
0 0 1 0 0 ……
Real
Pokémon
Never seen
by machine!
It is difficult to evaluate generation.
Cover 50%
Cover 75%
Drawing from scratch
Pokémon Creation Need some randomness
285
Pokémon Creation
286
m1
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 exp + 𝑐2 output
Decoder
𝜎2 𝑐3
𝜎3 10-dim
X
𝑒1
Pick 2 dim, and fix the rest 8 𝑒2
𝑒3
𝑐1 NN
𝑐2 ?
Decoder
𝑐3
10-dim
287
288
Pokémon Creation - Data
289
0
1
……
2
…
Outline
290
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
化繁為簡 Representation Learning
無中生有 Generative Model
Reinforcement Learning
Reinforcement Learning
291
Observation Action
Agent
Don’t do Reward
that
Environment
Reinforcement Learning
292
Observation Action
Agent
Environment
Agent learns to take actions to maximize expected reward.
Supervised v.s. Reinforcement
293
Supervised
“Hello” Say “Hi”
Learning from teacher
……. ……. ……
Hello …… Bad
Observation Action
If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Environment
Agent learns to take actions to maximize expected reward.
Supervised v.s. Reinforcement
295
DNN
Observation Action
……
…
Function Function
Input Output
Environment
Major Components in an RL Agent
299
Simulated User
Generate interactions based on a predefined fake goal
Automatically learn strategy by training on the simulated data
User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.
RULE BASED SYSTEM REINFORCEMENT LEARNING SYSTEM
User: Can I get 2 tickets for the witch? User: Can I get 2 tickets for the witch?
Agent: Which theater do you want? Agent: What date would you like to see it?
User: Regal meridian 16 User: Tomorrow
Agent: What time would you like to see it? Agent: Which theater do you prefer?
User: 9:30 pm User: Regal meridian 16
Agent: What date would you like to see it? Agent: What is your location?
User: Tomorrow! User: Seattle
Agent: How many people are you? Agent: What time would you like to see it?
User: Just two adults. User: 9:30 pm.
Agent: What is your location? Agent: Great - I was able to purchase 2 tickets for you
User: Seattle! to see the witch tomorrow at regal meridian 16
Agent: What kind of video format do you like? theater in Seattle at 9:30 pm.
User: Oh, any kind is okay. User: Thanks.
Agent: Do you have any price constraint? Agent: Thanks!
User: No, I don’t care.
Agent: Great - I was able to purchase 2 tickets for you to
see the witch tomorrow at regal meridian 16 theater in
Seattle at 9:30 pm.
User: Thank you.
Agent: Thank you!
More applications
304
Agent
Reward
Environment
如何成為武林高手
306
內外兼修
內功充沛,恃強克弱
招數精妙,以快打慢
招數:各種技巧
內力充沛,平常的招式也有可能發會巨大的威力