Tutorial

Slide credit from Hung-Yi Lee and Mark Chang
Talk Outline
2
Part I: Introduction to
Machine Learning & Deep Learning
Part II: Variants of Neural Nets
Part III: Beyond Supervised Learning

& Recent Trends
Talk Outline
3

& Recent Trends
4 PART I
Introduction to Machine Learning &
Deep Learning
Part I: Introduction to ML & DL
5
 Basic Machine Learning

 Basic Deep Learning
 Toolkits and Learning Recipe
6

Machine Learning
7
 Machine learning is rising rapidly in recent days

Recent Trend
8
What Computers Can Do?
9
Programs can do the things you ask them to do

Program for Solving Tasks
10
 Task: predicting positive or negative given a product review
“I love this product!” “It claims too much.” “It’s a little expensive.”
program.py program.py program.py
+ - ?
if input contains “love”, “like”, etc. if input contains “too much”, “bad”, etc.
output = positive output = negative
“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”
program.py program.py program.py
推噓 ?
Some tasks are complex, and we don’t know how to write a program to solve them.
Learning ≈ Looking for a Function
11
 Task: predicting positive or negative given a product review
“I love this product!” “It claims too much.” “It’s a little expensive.”
f f f
+ - ?
“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”

f f f
推噓 ?
Given a large amount of data, the machine learns what the function f should be.
Learning ≈ Looking for a Function
12
 Speech Recognition
f  “你好”
 Image Recognition
f   cat
 Go Playing
f   5-5 (next move)
 Dialogue System
f  “台積電怎麼去”   “地址為…
現在建議搭乘計程車”
Image Recognition:
Framework f  “cat”
13
A set of Model
function f1 , f 2 
f1   “cat” f2   “monkey”
f1   “dog” f2   “snake”
Image Recognition:
14
A set of Model
function f1 , f 2  Better!
Goodness of
function f
Supervised Learning
Training function input:

Data
function output: “monkey” “cat” “dog”
Image Recognition:
15
Training Testing
A set of Model
function f1 , f 2  “cat”
Step 1
Goodness of Pick the “Best” Function

Using f
function f f*
Step 2 Step 3
Training
Data
“monkey” “cat” “dog”
Training is to pick the best function given the observed data
Testing is to predict the label using the learned function
Why to Learn Machine Learning?
16
 AI Age
 AIcan work for most of labor work?
 New job market AI 訓練師
(Machine Learning Expert 機器學習專家、
Data Scientist 資料科學家)
AI 訓練師
17
機器不是自己會學嗎？
為什麼需要 AI 訓練師
戰鬥是寶可夢在打，
為什麼需要寶可夢訓練師？
AI 訓練師
18
 寶可夢訓練師  AI 訓練師
 挑選適合的寶可夢來戰鬥  在 step 1，AI訓練師要挑選
 寶可夢有不同的屬性合適的模型
 召喚出來的寶可夢不一定聽話  不同模型適合處理不同
 E.g. 小智的噴火龍的問題

 不一定能在 step 3 找出 best
 需要足夠的經驗 function
 E.g. Deep Learning
 需要足夠的經驗
AI 訓練師
19
 厲害的 AI ， AI 訓練師功不可沒
 讓我們一起朝 AI 訓練師之路邁進
Machine Learning Map
20
Scenario Task Method
Regression
Semi-Supervised Learning
Linear Model
Transfer Learning
Deep Learning SVM, Decision

Tree, KNN, etc Unsupervised Learning
Non-Linear Model
Classification Reinforcement Learning
Supervised Learning
21
Regression The output of the target function 𝑓 is a “scalar”.

一個數值
Linear Model
Transfer Learning

Non-Linear Model
Reinforcement Learning
Classification
Supervised Learning
Regression
22
 Stock Market Forecast
Dow Jones Industrial

𝑓 =
Average at tomorrow
 Self-driving Car
𝑓 = 方向盤角度
 Recommendation
𝑓 使用者 A 商品 B = 購買可能性
Example Application
23
 Estimating the Combat Power (CP) of a pokemon

after evolution
𝑥𝑐𝑝
CP after
𝑓 =
evolution 𝑦
𝑥𝑠 𝑥
𝑥ℎ𝑝
𝑥𝑤 𝑥ℎ
Step 1: Model
24
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp w and b are parameters

(can be any value)
A set of Model
f1: y = 10.0 + 9.0 ∙ xcp
f2: y = 9.8 + 9.2 ∙ xcp
f3: y = - 0.8 - 1.2 ∙ xcp
…… infinite
CP after
𝑓 𝑥 = 𝑦
evolution
𝑥𝑖 : an attribute of
Linear model: 𝑦 = 𝑏 + ෍ 𝑤𝑖 𝑥𝑖 input x feature
𝑤𝑖 : weight, b: bias
Step 2: Goodness of Function
25
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
function function
A set of Model
input: output (scalar):
function f1 , f 2  𝑦ො 1
𝑥1
𝑦ො 2
𝑥2
Training
Data
26
 Training data
 1st pokemon:
𝑥 1 , 𝑦ො 1
 2nd pokemon:
𝑥 2 , 𝑦ො 2 𝑛 ,𝑦
𝑥𝑐𝑝 ො𝑛
……
 10th pokemon:
𝑥 10 , 𝑦ො 10
𝑦ො
This is real data.
𝑥𝑐𝑝
Source: https://www.openintro.org/stat/data/?data=pokemon
27
A set of Model
Loss function 𝐿:
Goodness of Input: a function, output: how bad it is
function f
L 𝑓 = L 𝑤, 𝑏 Estimated y based
10
on input function
2
Training = ෍ 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛
Data 𝑛=1
Estimation error
Sum over examples
28
10
2
 Loss Function L 𝑤, 𝑏 = ෍ 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
𝑛=1
Each point in the

figure is a function
The color represents

𝐿 𝑤, 𝑏
y = - 180 - 2 ∙ xcp
Step 3: Best Function
29
L 𝑤, 𝑏
10
A set of Model 2
=෍ 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
function f1 , f 2  𝑛=1
Goodness of Pick the “Best” Function

function f
𝑓 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑓
𝑓
𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤, 𝑏
𝑤,𝑏
Training 10
2
Data = 𝑎𝑟𝑔 min ෍ 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
𝑤,𝑏
𝑛=1
Step 3: Gradient Descent
30
 Consider loss function 𝐿(𝑤) with one parameter w:

𝑤 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤
𝑤
 (Randomly) Pick an initial value w 0
𝑑𝐿
 Compute | 0
Loss 𝑑𝑤 𝑤=𝑤
𝐿 𝑤
w0 w
31

𝑤
𝑑𝐿
 Compute | 0 𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤1 ← 𝑤0 −𝜂 |𝑤=𝑤 0
𝐿 𝑤 𝑑𝑤
w0 𝑑𝐿 w
−𝜂 |𝑤=𝑤 0
𝑑𝑤
32

𝑤
𝑑𝐿
 Compute | 0 1 0
𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤 ←𝑤 −𝜂 |𝑤=𝑤 0
𝐿 𝑤 𝑑𝑤
𝑑𝐿
 Compute | 1 𝑑𝐿
𝑑𝑤 𝑤=𝑤 2 1
𝑤 ←𝑤 −𝜂 |𝑤=𝑤 1
𝑑𝑤
…… Many iteration
w0 w1 w2 wT w
33
 How about two parameters?

𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤, 𝑏
𝑤,𝑏
 (Randomly) Pick an initial value w0, b0
𝜕𝐿 𝜕𝐿
 Compute |𝑤=𝑤 0 ,𝑏=𝑏0 , |𝑤=𝑤 0 ,𝑏=𝑏0
𝜕𝑤 𝜕𝑏
1 0
𝜕𝐿 𝜕𝐿
𝑤 ←𝑤 −𝜂 |𝑤=𝑤 0 ,𝑏=𝑏0 1 0
𝑏 ← 𝑏 − 𝜂 |𝑤=𝑤 0 ,𝑏=𝑏0
𝜕𝑤 𝜕𝑏
𝜕𝐿 𝜕𝐿
 Compute |𝑤=𝑤 1 ,𝑏=𝑏1 , |𝑤=𝑤 1 ,𝑏=𝑏1
𝜕𝑤 𝜕𝑏
2 1
𝜕𝐿 2 1
𝜕𝐿
𝑤 ←𝑤 −𝜂 |𝑤=𝑤 1 ,𝑏=𝑏1 𝑏 ← 𝑏 − 𝜂 |𝑤=𝑤 1 ,𝑏=𝑏1
𝜕𝑤 𝜕𝑏
𝜕𝐿
𝜕𝑤
𝛻𝐿 = 𝜕𝐿
gradient
𝜕𝑏
34
Color: Value of loss 𝐿 𝑤, 𝑏
𝑏
35
 Local optimal
 Loss function is convex in linear regression
Linear regression 
No local optimal
𝐿
𝑤
𝑤 𝑏
𝑏
36
 Formulation of 𝜕𝐿Τ𝜕𝑤 and 𝜕𝐿Τ𝜕𝑏
10
2
𝑛 𝑛
𝐿 𝑤, 𝑏 = ෍ 𝑦ො − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛=1
10
𝜕𝐿
=? ෍ 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 𝑛
−𝑥𝑐𝑝
𝜕𝑤
𝑛=1
𝜕𝐿
=?
𝜕𝑏
37
 Formulation of 𝜕𝐿Τ𝜕𝑤 and 𝜕𝐿Τ𝜕𝑏
10
2
𝑛 𝑛
𝐿 𝑤, 𝑏 = ෍ 𝑦ො − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛=1
10
𝜕𝐿
=? ෍ 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 𝑛
−𝑥𝑐𝑝
𝜕𝑤
𝑛=1
10
𝜕𝐿
=? ෍ 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 −1
𝜕𝑏
𝑛=1
Learned Model
38
Training Data
𝑒1
b = -188.4
w = 2.7
Average Error on
Training Data 𝑒2
10
= ෍ 𝑒 𝑛 = 31.9
𝑛=1
What we really care about
Model Generalization is the error on new data
(testing data)
39
Another 10 pokemons as testing data

b = -188.4 How can we do better?
w = 2.7
Average Error on
Testing Data
10
= ෍ 𝑒 𝑛 = 35.0
𝑛=1
> Average Error on
Training Data (31.9)
Model Generalization
40
 Select another model

2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
 Best function
𝑏 = −10.3, 𝑤1 = 1.0, 𝑤2 = 2.7 × 10−3
Average Error = 15.4
 Testing
Better! Could it be even better?

41

2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
3
+𝑤3 ∙ 𝑥cp
 Best function
𝑏 = 6.4, 𝑤1 = 0.66, 𝑤2 = 4.3 × 10−3 ,
𝑤3 = 1.8 × 10−6
 Testing
Slightly better.
How about more complex model?
42

2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
3 4
+𝑤3 ∙ 𝑥cp + 𝑤4 ∙ 𝑥cp
 Best function
 Testing
The results become worse

43

2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
3 4
5
+𝑤5 ∙ 𝑥cp
 Best function
 Testing
The results are so bad

Training Data
Model Selection
44
1. 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
2
2. 𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
2
3. 𝑦 = 𝑏 + 𝑤1 ∙3𝑥cp + 𝑤2 ∙ 𝑥cp
+𝑤3 ∙ 𝑥cp
2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
4. 3 4
2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
A more complex model yields
5. 3
4
5
lower error on training data.
+𝑤5 ∙ 𝑥cp
If we can truly find the best function
45
Regression
Transfer Learning
Unsupervised Learning

Supervised Learning
Classification
46
 Binary Classification  Multi-Class Classification
Yes / No Class 1, Class 2, …, Class N
Function f Function f
Input Input
Binary Classification – Spam Filtering
47
“Talk” in e-mail
Model 1/0
(Yes/No)
“free” in e-mail
1 (Yes)
0 (No)
(http://spam-filter-review.toptenreviews.com/)
Multi-Class – Image Recognition
48
“monkey”
“monkey”
“cat”
Model
“cat”
“dog”
“dog”
Multi-Class – Topic Classification
49
政治
“stock” in document
經濟
Model
體育
“president” in document
體育政治財經
http://top-breaking-news.com/
50
Regression
Linear Model

Tree, KNN, etc Reinforcement Learning
Non-Linear Model
Classification Transfer Learning
Supervised Learning
51

Stacked Functions Learned by Machine
52
 Production line (生產線)
Deep Learning Model

Simple Simple Simple
“台灣第一波
Function Function Function 推
上市!” f1 f3
f2
f: a very complex function
End-to-end training: what each function should do is learned automatically

Deep learning usually refers to neural network based model
Three Steps for Deep Learning
53
Step 1: define a set of function
Step 2: goodness of function
Step 3: pick the best function

54

Neural Network

Neural Network
55
Neuron
z  a1w1    ak wk    aK wK  b
a1 w1 A simple function
…
wk z  z 
ak  a
…
Activation
…
wK function
aK weights b bias
Neural Network
56
Neuron Sigmoid Function  z 
 z  
1
z
1 e z
2
1
 z 
4
-1 -2  0.98
Activation
-1
function
1 weights 1 bias
Neural Network
57
Different connections lead to

different network structures
  z 
  z    z 
  z 
The neurons have different values of
weights and biases.
Weights and biases are network parameters 𝜃
Fully Connected Feedforward Network
58
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function  z 
 z  
1
z
1 e z
59
1 4 0.98 2 0.86 3 0.62

1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
60
1 0.73 2 0.72 3 0.51

0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
𝑓 = 𝑓 =
Input vector, output vector −1 0.83 0 0.85
Given parameters 𝜃, define a function
Given network structure, define a function set
Fully Connect Feedforward Network
61
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers
Why Deep? Universality Theorem
62
 Any continuous function f

f :R RN M
can be realized by a network with only hidden layer

 (given enough hidden neurons)
Why “deep” not “fat”?

Fat + Shallow v.s. Thin + Deep
63
 Two networks with the same number of parameters
……
……
……
x1 x2 …… xN
x1 x2 …… xN
Why Deep
64
 Logic circuits  Neural network

 Consists of gates  consists of neurons
 A two layers of logic gates  A hidden layer network can
can represent any Boolean represent any continuous
function. function.
 Using multiple layers of logic  Using multiple layers of
gates to build some functions neurons to represent some
are much simpler functions are much simpler
Deep = Many Hidden Layers
65
http://cs231n.stanford.e
du/slides/winter1516_le
cture8.pdf
6.7%
7.3%
16.4%
AlexNet (2012) VGG (2014) GoogleNet (2014)

Deep = Many Hidden Layers
66
Special
structure
3.57%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Output Layer
67
 Softmax layer as the output layer

Ordinary Layer
z1   
y1   z1
In general, the output of
z2   
y2   z 2
network can be any value.
May not be easy to interpret

z3   
y3   z 3
Output Layer
68
Probability:
 Softmax layer as the output layer  1 > 𝑦𝑖 > 0
 σ𝑖 𝑦𝑖 = 1
Softmax Layer
3 0.88 3
e
20
z1 e e z1
 y1  e z1 zj
j 1
1 0.12 3
z2 e e z 2 2.7
 y2  e z2
e
zj
j 1
0.05 ≈0
z3 -3 
3
e
z3
e y3  e z3 zj
e
3 j 1
 e zj
j 1
Example Application
69
 Input  Output
y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
70
 Handwriting Digit Recognition
x1 y1 is 1
x2
y2 is 2
Neural
Machine “2”
……
……
……
Network
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
71
Input Layer 1 Layer 2 Layer L Output

x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2”
……
……
……
……
……
……
Handwriting Digit Recognition
xN …… y10 is 0
Input Output
Layer Hidden Layers Layer
You need to decide the network structure to

let a good function in your function set.
FAQ
72
 Q: How many layers? How many neurons for each

layer?
Trial and Error + Intuition
 Q: Can we design the network structure?
Variants of Neural Networks
(next lecture)
 Q: Can the structure be automatically determined?
 Yes, but not widely studied yet.
73

Training Data
74
 Preparing training data: images and their labels
“5” “0” “4” “1”
“9” “2” “1” “3”
The learning target is defined on

the training data.
Learning Target
75
x1 …… y1 is 1
Softmax
x2 ……
…… y2 is 2
……
……
x256 …… y10 is 0
16 x 16 = 256
Ink → 1 The learning target is ……
No ink → 0
Input: y1 has the maximum value
Input: y2 has the maximum value

Loss
76
“1”
x1 …… y1 As close as 1
x2 possible
Softmax
Given a set ……
of y2 0
parameters
……
……
……
……
……
Loss
x256 …… y10 𝑙 0
Loss can be square error or cross entropy between the network target
output and target
A good function should make the loss of all examples as small as possible.
Total Loss
77
Total Loss:
 For all training data … 𝑅
𝐿 = ෍ 𝑙𝑟
x1 NN y1 𝑦ො 1 𝑟=1
𝑙1
x2 NN y2 𝑦ො 2 As small as possible
𝑙2
Find a function in
x3 NN y3 𝑦ො 3 function set that
𝑙3
minimizes total loss L
……
……
……
……
Find the network

xR NN yR 𝑦ො 𝑅 parameters 𝜽∗ that
𝑙𝑅
minimize total loss L
78

How to pick the best function
79
Find network parameters 𝜽∗ that minimize total loss L

Layer l Layer l+1
Enumerate all possible values
Network parameters 𝜃 =
106
𝑤1 , 𝑤2 , 𝑤3 , ⋯ , 𝑏1 , 𝑏2 , 𝑏3 , ⋯
weights
……
……
Millions of parameters
E.g. speech recognition: 8 layers and

1000 1000
1000 neurons each layer
neurons neurons
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
80

 Pick an initial value for w
Random, RBM pre-train

Usually good enough
w
81

Total  Compute 𝜕𝐿Τ𝜕𝑤
Loss 𝐿 Negative Increase w
Positive Decrease w
w
82

 Compute 𝜕𝐿Τ𝜕𝑤
𝑤 ← 𝑤 − 𝜂𝜕𝐿Τ𝜕𝑤
Repeat
η is called
−𝜂𝜕𝐿Τ𝜕𝑤 “learning rate” w
83

Total  Compute 𝜕𝐿Τ𝜕𝑤
Loss 𝐿 𝑤 ← 𝑤 − 𝜂𝜕𝐿Τ𝜕𝑤
Repeat
(when update is little)
w
Gradient Descent
84
 Assume that θ has two variables {θ1, θ2}

Gradient Descent
85
Color: Value of
𝑤2 Total Loss L
Randomly pick a starting point
𝑤1
Hopfully, we would reach
Gradient Descent a minima …..
86
𝑤2
(−𝜂 𝜕𝐿Τ𝜕𝑤1 , −𝜂 𝜕𝐿Τ𝜕𝑤2 )
Compute 𝜕𝐿Τ𝜕𝑤1 , 𝜕𝐿Τ𝜕𝑤2
𝑤1
Local Minima
87
Total
Loss Very slow at the
plateau
Stuck at saddle point
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
The value of a network parameter w

Local Minima
88
 Gradient descent never guarantee global

minima
Different initial point
Reach different minima,

so different results
𝑤1 𝑤2
Gradient Descent
89
This is the “learning” of machines in deep learning ……

Even AlphaGo using this approach.
People image …… Actually …..
I hope you are not too disappointed :p

90

Deep Learning Toolkit
91
 Backpropagation: an efficient way to compute

𝜕𝐿Τ𝜕𝑤 in neural network
92
Step 1: Step 2: Step 3: pick

define a set goodness of the best
of function function function
Deep Learning is so simple ……
Now If you want to find a function

If you have lots of function input/output (?) as
training data
You can use deep learning
Keras
93
Very flexible
or
Need some
effort to learn
Easy to learn and use

Interface of
TensorFlow or (still have some flexibility)
Theano You can modify it if you can write
keras TensorFlow or Theano
Keras
94
 François Chollet is the author of Keras.

 He currently works for Google as a deep learning engineer and
researcher.
 Keras means horn in Greek
 Documentation: http://keras.io/
 Example
 https://github.com/fchollet/keras/tree/master/examples
 Step-by-step lecture by Prof. Hung-Yi Lee
 Slide
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Keras.pdf
 Lecture recording:
https://www.youtube.com/watch?v=qetE6uUoLQA
使用 Keras 心得
95
感謝沈昇勳同學提供圖檔
Example Application
96
 Handwriting Digit Recognition
Machine “1”
28 x 28
MNIST Data: http://yann.lecun.com/exdb/mnist/

“Hello world” for deep learning
Keras provides data sets loading function: http://keras.io/datasets/
97

define a set goodness of the best
of function function function

Learning Recipe
98
YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Step 2: goodness
of function YES
Step 3: pick the NO Good Results on

best function Training Data?
Overfitting
99
 Possible solutions
 more training samples
 some tips: dropout, etc.
Learning Recipe
100
YES
Good Results on
Testing Data?
Different approaches for different problems
e.g. dropout for good results on testing data YES
Good Results on
Training Data?
Learning Recipe
101
YES
Choosing proper loss
Good Results on
Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?
Momentum
Learning Recipe
102
Testing Data
Training Data Validation Real Testing
x ŷ x y x y
“Best” Function f *
Learning Recipe
103
Testing Data
Training Data Validation Real Testing
x ŷ x y x y
immediately Do not know the
know the performance
performance
Learning Recipe
104
get good results

no
on training set
modify training
process
 Possible reasons
 no good function exists: bad hypothesis function set
 reconstruct the model architecture
 cannot find a good function: local optima
 change the training strategy
Learning Recipe
105
yes yes
get good results get good results on done
no no dev/validation set
on training set
modify training
process prevent overfitting
Better performance on training but worse performance on dev  overfitting

Concluding Remarks
106

1. Define a set of functions
2. Measure goodness of functions
3. Pick the best function
 Stacked functions
Talk Outline
107

& Recent Trends
PART II
Variants of Neural Networks
108
PART II: Variants of Neural Networks
109
 Convolutional Neural Network (CNN)

 Recurrent Neural Network (RNN)
110

Widely used in image processing

Why CNN for Image (Zeiler, M. D., ECCV 2014)
111
x1 ……
x2
…… ……
……
……
……
Represented
as pixels xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Can the network be simplified by considering the properties of images?

Why CNN for Image
112
 Some patterns are much smaller than the whole

image
A neuron does not have to see the whole image

to discover the pattern.
Connecting to small region with less parameters
“beak” detector
Why CNN for Image
113
 The same patterns appear in different regions.
“upper-left
beak” detector
Do almost the same thing

They can use the same
set of parameters.
“middle beak”
detector
Why CNN for Image
114
 Subsampling the pixels will not change the object
bird
bird
subsampling
We can subsample the pixels to make image smaller

Less parameters for the network to process the image
115

define a set
Convolutional goodness of the best
of function
Neural Network function function

Image Recognition
116
http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
The Whole CNN
117
cat dog ……
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
The Whole CNN
118
Property 1
Convolution
 Some patterns are much
smaller than the whole image
Property 2 Max Pooling
Can repeat
 The same patterns appear in
many times
different regions
Convolution
Property 3
 Subsampling the pixels will
not change the object Max Pooling
Flatten
Image Recognition
119
Local Connectivity
120
Neurons connect to a small region

Parameter Sharing
121
 The same feature in different positions
Neurons share the same weights

Parameter Sharing
122
 Different features in the same position
Neurons have different weights

Convolutional Layers
123
weights weights
height
depth
depth width shared weight width

124
depth = 1 depth = 2
a1 b1
c1
a2
b2
c2
a3
125
depth = 2 depth = 2
b1
a1 c1
b2 d1
a2
c2
b3
d2
a3
126
depth = 2 depth = 2
b1
a1 c1
b2 d1
a2
c2
b3
d2
a3
127
A B C
A B C D
Hyper-parameters of CNN
128
 Stride  Padding
Stride = 1 Padding = 0
0 0
Stride = 2 Padding = 1
Example
129
Output Stride = 2
Volume (3x3x2)
Filter
(3x3x3)
Input
Volume (7x7x3) Padding = 1
http://cs231n.github.io/convolutional-networks/
130
131
132
133
Pooling Layer
134
1 3 2 4
5 7 6 8
0 0 3 3
5 5 0 0 no weights
Maximum Average
Pooling Pooling
7 8 4 5
5 3 5 3 no overlap depth = 1
Max(1,3,5,7) = 7 Avg(1,3,5,7) = 4
Max(0,0,5,5) = 5
Why “Deep” Learning?
135
Visual Perception of Human
136
http://www.nature.com/neuro/journal/v8/n8/images/nn0805-975-F1.jpg
Visual Perception of Computer
137
Input Convolutional
Layer Layer
Pooling Convolutional
Layer Layer Pooling
Layer
Receptive Fields
Receptive Fields
Visual Perception of Computer
138
Convolutional Max-pooling
Layer with Layer with
Receptive Fields: Width =3, Height = 3
Input Layer
Filter Responses
Input Image Filter Responses

Fully-Connected Layer
139
 Fully-Connected Layers : Global feature extraction

 Softmax Layer: Classifier
Fully-Connected
Convolutional Layer Softmax
Input Input Layer Convolutional Layer
Image Layer Pooling Layer
Layer Pooling
Layer 5
7
Class
Label
Convolutional Neural Network
140

define a set
Convolutional goodness of the best
of function
“monkey” 0
“cat” 1
CNN
……
“dog” 0
Convolution, Max target
Pooling, fully connected
What CNN Learned
141
 Alexnet
 http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
http://vision03.csail.mit.edu/cnn_art/data/single_layer.png
DNN are easily fooled
142
Nguyen et al., “Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images,” arXiv:1412.1897.
Visualizing CNN
143
filter
flower
response
CNN
random filter
noise response
CNN
Gradient Ascent
144
 Magnify the filter response lower higher

random filter score score
noise: response:
score:
gradient:
Gradient Ascent
145
 Magnify the filter response lower higher

random filter score score
noise: response:
update
gradient:
learning rate
Gradient Ascent
146
Different Layers of Visualization
147
CNN
Multiscale Image Generation
148
visualize resize
visualize
resize
visualize
Multiscale Image Generation
149
CNN
Modify
Deep Dream image
150
 Given a photo, machine adds what it sees ……

3.9
−1.5
2.3
⋮
CNN exaggerates what it sees
http://deepdreamgenerator.com/
Deep Dream http://deepdreamgenerator.com/
151
 Given a photo, machine adds what it sees ……

Deep Style http://deepdreamgenerator.com/
152
 Given a photo, make its style like famous paintings

Deep Style http://deepdreamgenerator.com/
153
 Given a photo, make its style like famous paintings

Deep Style A Neural Algorithm of Artistic Style
https://arxiv.org/abs/1508.06576
154
CNN CNN
content style
CNN
?
Neural Art Mechanism
155
Artist Brain
Scene Style ArtWork
Computer Neural Networks

Go Playing
156
Next move
Network (19 x 19
positions)
19 x 19 matrix 19 x 19 vector
19(image)
x 19 vector
Black: 1 Fully-connected feedforward
white: -1 network can be used
none: 0 But CNN performs much better.
More Application: Playing Go
157
record of
Training: 黑: 5之五白: 天元黑: 五之5 …
previous plays
Target:
CNN “天元” = 1
else = 0
Target:
CNN “五之 5” = 1
else = 0
Why CNN for playing Go?
158
 Some patterns are much smaller than the whole

image
Alpha Go uses 5 x 5 for first layer
 The same patterns appear in different regions

Why CNN for playing Go?
159
 Subsampling the pixels will not change the object

How to explain this???
160

Neural Network with Memory

Example Application
161
 Slot Filling
I would like to arrive Taipei on November 2nd.
ticket booking system
Destination: Taipei
Slot
time of arrival: November 2nd
Example Application
162
y1 y2
Solving slot filling by
feedforward network?
Input: a word
(Each word is represented as a vector)
Taipei x1 x2
1-of-N encoding
163
How to represent each word as a vector?

1-of-N Encoding lexicon = {apple, bag, cat, dog, elephant}
The vector is lexicon size. apple = [ 1 0 0 0 0]
Each dimension corresponds bag = [ 0 1 0 0 0]
to a word in the lexicon cat = [ 0 0 1 0 0]
The dimension for the word dog = [ 0 0 0 1 0]
is 1, and others are 0 elephant = [ 0 0 0 0 1]
Example Application
164
Solving slot filling by dest time of departure

feedforward network? y1 y2
Input: a word
(Each word is represented as a vector)
Output:
probability distribution that the
input word belonging to the slots
Taipei x1 x2
Example Application
165
arrive Taipei on November 2nd dest time of departure

y1 y2
other dest other time time
leave Taipei on November 2nd
place of departure
Neural network needs memory!
Taipei x1 x2
166

define a set
Recurrent goodness of the best
of function

Recurrent Neural Network (RNN)
167
y1 y2
The output of hidden layer

are stored in the memory.
store
a1 a2
Memory can be considered x1 x2

as another input.
RNN The same network is used again and again.
168
Probability of Probability of Probability of

“arrive” in each slot “Taipei” in each slot “on” in each slot
y1 y2 y3
store store
a1 a2 a3
a1 a2
x1 x2 x3
arrive Taipei on November 2nd

RNN
169
Different
Prob of “leave” Prob of “Taipei” Prob of “arrive” Prob of “Taipei”
in each slot in each slot in each slot in each slot
y1 y2 …… y1 y2 ……
store store
a1 a2 a1 a2
a1 …… a1 ……
x1 x2 …… x1 x2 ……
leave Taipei arrive Taipei
The values stored in the memory is different.

Deep RNN
170
yt yt+1 yt+2
…… ……
……
……
……
…… ……
…… ……
xt xt+1 xt+2
Bidirectional RNN
171
xt xt+1 xt+2
…… ……
yt yt+1 yt+2
…… ……
xt xt+1 xt+2
RNN
172

define a set
Recurrent goodness of the best
of function

Learning Target
173
other dest other
0 … 1 … 0 0 … 1 … 0 0 … 1 … 0
y1 y2 y3
copy copy
a1 a2 a3
a1 a2
Wi
x1 x2 x3
Training
Sentences: arrive Taipei on November 2nd
other dest other time time
Rough Error Surface
174
Total
CostLoss
w2
w1
[Razvan Pascanu, ICML’13]
Rough Error Surface
175
𝑤=1 𝑦1000 = 1 Large Small

𝑤 = 1.01 𝑦1000 ≈ 20000 𝜕𝐿Τ𝜕𝑤 Learning rate?
𝑤 = 0.99 𝑦1000 ≈ 0 small Large
𝑤 = 0.01 𝑦1000 ≈ 0 𝜕𝐿Τ𝜕𝑤 Learning rate?
=w999
y1 y2 y3 y1000
Toy Example
1 1 1 1
……
w w w
1 1 1 1
1 0 0 0
RNN Applications
176
Probability of Probability of Probability of

“arrive” in each slot “Taipei” in each slot “on” in each slot
y1 y2 y3
Input store
and output are both sequences
store
a1 with the a2 length
same a3
a 1
a2
RNN can do more than that!
x1 x2 x3
arrive Taipei on November 2nd

Many-to-One
177
 Input is a vector sequence, but output is only one vector
Sentiment Analysis 超好雷

好雷
看了這部電影覺這部電影太糟了這部電影很
得很高興 ……. ……. 棒 …….
普雷
負雷
Positive (正雷) Negative (負雷) Positive (正雷) 超負雷
……
我覺得 …… 太糟了
Many-to-Many (Output is shorter)
178
 Both input and output are both sequences, but the output is
shorter.
 E.g. Speech Recognition
Output: “好棒” (character sequence)
Problem?
Why can’t it be 好好好棒棒棒棒棒
“好棒棒”
(vector
Input:
sequence)
Many-to-Many (Output is shorter)
179
 Both input and output are both sequences, but the output is
shorter.
 Connectionist Temporal Classification (CTC)
“好棒” Add an extra symbol “φ” “好棒棒”

representing “null”
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
[Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]
Many-to-Many (Output has no limitation)
180
 Both input and output are both sequences with different

lengths. → Sequence to sequence learning
 E.g. Machine Translation (machine learning→機器學習)
machine
learning
Containing all
information about
input sequence
181

機器學習慣性
machine
learning
Don’t know when to stop

182
http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%E6%8E%A8%E6%96%87 (鄉民百科)
183

===
機器學習
machine
learning
Add a symbol “===“ (斷)
[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]

Image Caption Generation
184
 Input an image, but output a sequence of words
A vector Caption Generation

for whole ===
image a woman is
CNN ……
Input
image
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
Image Caption Generation
185
Video Caption Generation
186
A girl is running.
Video
A group of people is A group of people is

knocked by a tree. walking in the forest.
Chit-Chat Bot
187
電視影集 (~40,000 sentences)、美國總統大選辯論

Sci-Fi Short Film - SUNSPRING
188
https://www.youtube.com/watch?v=LY7x2Ihqj
Attention and Memory
189
What you learned in Breakfast today

these lectures
What is deep
learning?
summer
vacation 10
Answer Organize years ago
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Attention on Sensory Info
190
Information from the When the input is a very

sensors (e.g. eyes, ears) long sequence or an image
Pay attention on partial of
Sensory Memory the input object each time
Attention
Working Memory
Encode Retrieval
Long-term Memory
Machine Translation
191
 Sequence-to-sequence learning: both input and output are

both sequences with different lengths.
 E.g. 深度學習 → deep learning
learning
<END>
deep
Information of the
whole sentences
RNN RNN
Encoder Decoder
深度學習
Machine Translation with Attention
192
𝛼01 What is match ?
match 𝑧0  Cosine similarity of

z and h
 Small NN whose input is
z and h, output a scalar
ℎ1 ℎ2 ℎ3 ℎ4
 𝛼 = ℎ𝑇 𝑊𝑧
深度學習 How to learn the parameters?
193
𝑐0 𝛼
deep
0.5 𝛼ො01 0.5 𝛼ො02 0.0 𝛼ො03 0.0 𝛼ො04 match
softmax ℎ 𝑧
𝑧0 𝑧1 How to learn the
𝛼01 𝛼02 𝛼03 𝛼04
parameters?
ℎ1 ℎ2 ℎ3 ℎ4 𝑐 0 As RNN input
深度學習 𝑐 0 = ෍ 𝛼ො0𝑖 ℎ𝑖 = 0.5ℎ1 + 0.5ℎ2

194
deep
𝛼11
match
𝑧0 𝑧1
ℎ1 ℎ2 ℎ3 ℎ4 𝑐0
深度學習
195
𝑐1
learning
deep
0.0 𝛼ො11 0.0 𝛼ො12 0.5 𝛼ො13 0.5 𝛼ො14
softmax
𝑧0 𝑧1 𝑧2
𝛼11 𝛼12 𝛼13 𝛼14
ℎ1 ℎ2 ℎ3 ℎ4 𝑐0 𝑐1
深度學習 𝑐1 = ෍ 𝛼ො1𝑖 ℎ𝑖 = 0.5ℎ3 + 0.5ℎ4

196
learning
deep
𝛼21 ……
match
𝑧0 𝑧1 𝑧 2 ……
ℎ1 ℎ2 ℎ3 ℎ4 𝑐0 𝑐1 ……
The same process repeat until

深度學習 generating <END>
Speech Recognition with Attention
197
Chan et al., “Listen, Attend and Spell”, arXiv, 2015 .

Image Captioning
198
 Input: image
 Output: word sequence
A vector for
whole image a woman is <END>
CNN ……
Input
image
Image Captioning with Attention
199
A vector for each region

𝑧0 match 0.7
filter filter filter

CNN filter filter filter

200
Word 1
𝑧0 𝑧1
filter filter filter weighted

sum 0.7 0.1 0.1
0.1 0.0 0.0

201
Word 1 Word 2
𝑧0 𝑧1 𝑧1
weighted
filter filter filter weighted sum
sum 0.0 0.8 0.2
0.0 0.0 0.0

Image Captioning
202
 Good examples
Image Captioning
203
 Bad examples
Video Captioning
204
Video Captioning
205
Reading Comprehension
206
Answer
𝑁
Sentence to Extracted
= ෍ 𝛼𝑛 𝑥 𝑛 DNN
vector can be Information
𝑛=1
jointly trained.
𝛼1 𝛼2 𝛼3 𝛼𝑁
Document 𝑥 1 𝑥 2 𝑥 3 …… 𝑥 𝑁
Match q Question
Reading Comprehension
207
Answer
𝑁
Extracted
= ෍ 𝛼𝑛 ℎ 𝑛 DNN
Jointly learned Information
𝑛=1
ℎ1 ℎ2 ℎ3 …… ℎ𝑁
Hopping
𝛼1 𝛼2 𝛼3 𝛼𝑁
Document
𝑥 1 𝑥 2 𝑥 3 …… 𝑥 𝑁
Match q Question
Memory Network
208
෍ DNN a
Extract information
……
……
Compute attention
෍
Extract information
……
……
Compute attention
q
Memory Network
209
 Muti-hop performance analysis
https://www.facebook.com/Engineering/videos/10153098860532200/
Attention on Memory
210
Information from the When the input is a very

sensors (e.g. eyes, ears) long sequence or an image
Pay attention on partial of
Sensory Memory the input object each time
Attention
Working Memory In RNN/LSTM, larger memory
implies more parameters
Encode Retrieval
Increasing memory size will
Long-term Memory
not increasing parameters
Neural Turing Machine
211
 Von Neumann architecture

 Neural Turing Machine is an advanced RNN/LSTM.
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
212
y1 y2
h0 h1 h2
𝑟0 = ෍ 𝛼ො0𝑖 𝑚0𝑖 r0 x1 x2
Retrieval
process
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04
𝑚10 𝑚02 𝑚03 𝑚04 Long term

memory
213
y1 y2
h0 h1 h2
𝑟0 = ෍ 𝛼ො0𝑖 𝑚0𝑖 r0 x1 x2
𝑘1 𝑒1 𝑎1
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14
softmax 𝛼1𝑖 = 1 − 𝜆 𝛼0𝑖 (simplified)
𝑚10 𝑚02 𝑚03 𝑚04 𝛼11 𝛼12 𝛼13 𝛼14 +𝜆𝑐𝑜𝑠 𝑚0𝑖 , 𝑘1
214
Encode
𝑚1𝑖 = 𝑚0𝑖 ∗ 1 −𝛼ො1𝑖 𝑒 1 +𝛼ො1𝑖 𝑎1
process
(element-wise)
𝑘1 𝑒1 𝑎1
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14
𝑚10 𝑚02 𝑚03 𝑚04 𝑚11 𝑚12 𝑚13 𝑚14
215
y1 y2
h0 h1 h2
r0 x1 r1 x2
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14 𝛼ො21 𝛼ො22 𝛼ො23 𝛼ො24
𝑚10 𝑚02 𝑚03 𝑚04 𝑚11 𝑚12 𝑚13 𝑚14 𝑚12 𝑚22 𝑚23 𝑚24
Concluding Remarks
216

Input Convolutional
Pooling Convolutional
Pooling

y1 y2 y3
store store
a1 a1 a2 a2 a3
x1 x2 x3
Talk Outline
217

& Recent Trends
218 PART III
Beyond Supervised Learning & Recent Trend
Introduction
219
 Big data ≠ Big annotated data

 Machine learning techniques include:
 Supervised learning (if we have labelled data)
 Reinforcement learning (if we have an environment for reward)
 Unsupervised learning (if we do not have labelled data)
What can we do if there is no sufficient labelled training data?

220
Regression
Linear Model
Transfer Learning

Non-Linear Model
Supervised Learning
Outline
221
 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model
 Reinforcement Learning
Outline
222
223
Labelled
data
cat dog
Unlabeled
data
(Image of cats and dogs without labeling)

224
 Why semi-supervised learning helps?
The distribution of the unlabeled data provides some cues

Outline
225
Transfer Learning
226
Labelled
data
cat dog
Labeled
data
elephant elephant tiger tiger
Not related to the task considered

Transfer Learning
227
 Widely used on image processing

 Using sufficient labeled data to learn a CNN
 Using this CNN as feature extractor
Pixels Layer 1 Layer 2 Layer L

x1 …… ……
x2 …… elephant
……
……
……
……
xN …… ……
Transfer Learning Example
228
研究生 online 漫畫家 online
研究生漫畫家
指導教授責編
研究生
生存守則跑實驗畫分鏡
投稿期刊投稿 jump
爆漫王
Outline
229
Outline
230
231
 The unlabeled data sometimes is not related to the task
Labelled
data
cat dog
Unlabeled
data
(Just crawl millions of images from the Internet)

232
 化繁為簡  無中生有
only having
function input
only having
function
function output
function
code
233
 How does self-taught learning work?

 Why does unlabeled and unrelated data help the tasks?
Finding latent factors that control the observations

Latent Factors for Handwritten Digits
234
3
-20。 -10。 0。 10。 20。
Latent Factors for Documents
235
http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg
Latent Factors for Recommendation
236
C
Latent Factor Exploitation
237
 Handwritten digits
The handwritten images are

composed of strokes
Strokes (Latent Factors)
…….
No. 1 No. 2 No. 3 No. 4 No. 5
Latent Factor Exploitation
238
Strokes (Latent Factors)
…….
No. 1 No. 2 No. 3 No. 4 No. 5
28 No. 1 No. 3 No. 5
28 = + +
Represented by [1 0 1 0 1 0 …….]
28 X 28 = 784 pixels (simpler representation)
Outline
239
Autoencoder
240
 Represent a digit using 28 X 28 dimensions

 Not all 28 X 28 images are digits
Idea: represent the images of digits in a more compact way
NN Compact
code representation of
Encoder
Usually <784 the input object
28 X 28 = 784
Learn together
NN Can reconstruct
code
Decoder the original object
Autoencoder
241
2
Minimize 𝑥 − 𝑦
As close as possible
encode decode
𝑥 𝑎 𝑦
𝑊 𝑊′
hidden layer
Input layer Bottleneck layer output layer
𝑎 = 𝜎 𝑊𝑥 + 𝑏 𝑦 = 𝜎 𝑊′𝑎 + 𝑏′
Output of the hidden layer is the code
Autoencoder
242
 De-noising auto-encoder
encode decode
𝑎
𝑥 𝑥′ 𝑦
Add 𝑊 𝑊′
noise
Rifai, et al. "Contractive auto-encoders: Explicit invariance during feature extraction,“ in ICML, 2011.
Deep Autoencoder
243
Output Layer
Input Layer
bottle
Layer
Layer
Layer
Layer
Layer
Layer
… …
𝑥 Initialize by RBM 𝑥෤
Code
layer-by-layer
Hinton and Salakhutdinov. “Reducing the dimensionality of data with neural networks,” Science, 2006.
Deep Autoencoder
244
Original
Image
784
784
30
PCA
Deep
Auto-encoder
500
500
250
250
30
1000
1000
784
784
784 784
1000
2
500
784
Feature Representation
250
2
250
500
1000
784
245
Auto-encoder – Text Retrieval
246
Vector Space Model Bag-of-word

this 1
is 1
word string:
query “This is an apple” a 0
an 1
apple 1
pen 0
document
…
Semantics are not considered
Autoencoder – Text Retrieval
247
2 query
125
250
500
The documents talking about the
same thing will have close code
2000
Bag-of-word (document or query)
Autoencoder – Similar Image Retrieval
248
 Retrieved using Euclidean distance in pixel intensity space
Krizhevsky et al. "Using very deep autoencoders for content-based image retrieval," in ESANN, 2011.
249
8192
4096
1024
2048
256
512
32x32 code
(crawl millions of images from the Internet)

250
 Images retrieved using Euclidean distance in pixel intensity

space
 Images retrieved using 256 codes
Learning the useful latent factors

Autoencoder for DNN Pre-Training
251
 Greedy layer-wise pre-training again
output 10
500
Target
1000 784 𝑥෤
W1’
1000 1000
W1
Input 784 Input 784 𝑥
252
output 10
500 1000 𝑎෤ 1
W2’
Target
1000 1000
W2
1000 1000 𝑎1
fix W1
253
output 10 1000 𝑎෤ 2
W3’
500 500
W3
Target
1000 1000 𝑎2
fix W2
1000 1000 𝑎1
fix W1
254

Find-tune via backprop
output output 10 Random
10
W4 init
500 500
W3
Target
1000 1000
W2
1000 1000
W1
Word Vector/Embedding
255
 Machine learn the meaning of words from reading

a lot of documents without supervision
tree
flower
dog rabbit
run
jump cat
Word Embedding
256
 Machine learn the meaning of words from reading

a lot of documents without supervision
 A word can be understood by its context
You shall know a word by the
蔡英文、馬英九 are company it keeps
something very similar
馬英九 520宣誓就職
蔡英文 520宣誓就職
wi
…… wi-2 wi-1 ___
Prediction-Based
257
0 z1
1-of-N
1 z2 The probability
encoding
0 for each word as
of the
…
……
……
the next word wi
word wi-1
……
 Take out the input of the z2 tree

neurons in the first layer flower
 Use it to represent a dog rabbit run
word w jump
cat
 Word vector, word
embedding feature: V(w) z1
Prediction-Based
258
Neural
潮水退了
Network
Collect data: Neural

退了就
潮水退了就知道 … Network
不爽不要買 …
公道價八萬一 … Neural
……… 就知道
Network
Neural
不爽不要
Network
……
Minimizing cross entropy

You shall know a word by
Prediction-Based the company it keeps
259
0 z1
1 z2 The probability
0 for each word as
…
蔡英文
……
……
the next word wi
……
or
“宣誓就職”
馬英九
should have large
z2 probability
Training text:
…… 蔡英文宣誓就職 …… 蔡英文
wi-1 wi
馬英九
…… 馬英九宣誓就職 ……
wi-1 wi z1
Various Architectures
260
 Continuous bag of word (CBOW) model

wi-1
…… wi-1 ____ wi+1 …… Neural
wi
wi+1 Network
predicting the word given its context
 Skip-gram
…… ____ wi ____ …… w wi-1
Neural
i
Network
wi+1
predicting the context given a word

Word2Vec LM
261
 Goal: predicting the next words given the proceeding

contexts
https://ronxin.github.io/wevi/
Word2Vec CBOW
262
 Goal: predicting the target word given the surrounding words
Word2Vec Skip-Gram
263
 Skip-gram training data:

apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk,mil
k|drink^rice,water|drink^milk,juice|orange^apple,juice|apple^drink,milk
|rice^drink,drink|milk^water,drink|water^juice,drink|juice^water
Word Embedding
264
http://www.slideshare.net/hustwj/cikm-keynotenov2014
Word Embedding
265
 Characteristics
𝑉 ℎ𝑜𝑡𝑡𝑒𝑟 − 𝑉 ℎ𝑜𝑡 ≈ 𝑉 𝑏𝑖𝑔𝑔𝑒𝑟 − 𝑉 𝑏𝑖𝑔
𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒 − 𝑉 𝑎𝑢𝑛𝑡
 Solving analogies
Rome : Italy = Berlin : ?
Compute 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Find the word w with the closest V(w)
𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Outline
266
Creation
267
Draw something!
Creation
268
 Generative Models
 https://openai.com/blog/generative-models/
What I cannot create,

I do not understand.
Richard Feynman
https://www.quora.com/What-did-Richard-Feynman-mean-when-he-said-What-I-cannot-create-I-do-not-understand
PixelRNN
269
 To create an image, generating a pixel each time

E.g. 3 x 3 images
NN
NN
NN
……
Can be trained just with a large collection
of images without any annotation
Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, Pixel Recurrent Neural Networks, arXiv preprint, 2016
PixelRNN
270
Real
World
Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, Pixel Recurrent Neural Networks, arXiv preprint, 2016
PixelRNN – beyond Image
271
Audio: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior,
Koray Kavukcuoglu, WaveNet: A Generative Model for Raw Audio, arXiv preprint, 2016
Video: Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu, Video
Pixel Networks , arXiv preprint, 2016
Generative Adversarial Network (GAN)
272
Discriminative v.s. Generative Models
273
 Discriminative  Generative
 learns a function that maps  tries to learn the joint
the input data (x) to some probability of the input
desired output class label data and labels
(y) simultaneously, i.e. P(x,y)
• directly learn the conditional • can be converted
distribution P(y|x) to P(y|x) for classification via
Bayes rule
Advantage: generative models have the potential to understand and explain

the underlying structure of the input data even when there are no labels
擬態的演化
274
棕色葉脈
蝴蝶不是棕色蝴蝶沒有葉脈 ……..

Generator
275
 Decoder from autoencoder as generator
encode decode
𝑥 𝑎 𝑥′
𝑊 𝑊′
hidden layer
Input layer output layer
 code
The generator is to generate the data from the code
Generative Adversarial Networks (GAN)
276
 Two competing neural networks: generator & discriminator
forger trying to produce

some counterfeit material
the police trying to detect

the forged items
Training two networks jointly  the generator knows how to adapt its
parameters in order to produce output data that can fool the discriminator
Goodfellow, et al., “Generative adversarial networks,” in NIPS, 2014.
http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/
Generator Evolution
277
NN NN NN
Generator Generator Generator
v1 v2 v3
Discri- Discri- Discri-

minator minator minator
v1 v2 v3
Real images:
Cifar-10
278
 Which one is machine-generated?
https://openai.com/blog/generative-models/
Generated Bedrooms
279
Radford et al., “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” arXiv:1511.06434.
Comics Drawing
280
https://github.com/mattya/chainer-DCGAN
Comics Drawing
281
http://qiita.com/mattya/items/e5bfe5e04b9d2f0bbd47
Pokémon Creation
282
 Small images of 792 Pokémon's

 Can machine learn to create new Pokémons?
Don't catch them! Create them!

 Source of image:
http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9
mon_by_base_stats_(Generation_VI)
Original image is 40 x 40
Making them into 20 x 20
Pokémon Creation
283
 Each pixel is represented by 3 numbers (corresponding

to RGB)
R=50, G=150, B=100
 Each pixel is represented by a 1-of-N encoding feature
0 0 1 0 0 ……
Clustering the similar color 167 colors in total

Pokémon Creation
284
Real
Pokémon
Never seen
by machine!
It is difficult to evaluate generation.
Cover 50%
Cover 75%
Drawing from scratch
Pokémon Creation Need some randomness
285
Pokémon Creation
286
m1
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 exp + 𝑐2 output
Decoder
𝜎2 𝑐3
𝜎3 10-dim
X
𝑒1
Pick 2 dim, and fix the rest 8 𝑒2
𝑒3
𝑐1 NN
𝑐2 ?
Decoder
𝑐3
10-dim
287
288
Pokémon Creation - Data
289
 Original image (40 x 40):

http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/image.rar
 Pixels (20 x 20):
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/pixel_color.txt
 Each line corresponds to an image, and each number corresponds to a pixel
 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/colormap.txt
0
1
……
2
…
Outline
290
291
Observation Action
Agent
Don’t do Reward
that
Environment
292
Observation Action
Agent
Thank you. Reward
Environment
Agent learns to take actions to maximize expected reward.
Supervised v.s. Reinforcement
293
 Supervised
“Hello” Say “Hi”
Learning from teacher
“Bye bye” Say “Good bye”

 Reinforcement
……. ……. ……
Hello  …… Bad
Learning from critics

Agent Agent
Scenario of Reinforcement Learning
294
Observation Action
Reward Next Move
If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Environment
Agent learns to take actions to maximize expected reward.
Supervised v.s. Reinforcement
295
 Supervised Learning  Reinforcement Learning

 Training based on  Training only based on
supervisor/label/annotatio reward signal
n  Feedback is delayed
 Feedback is instantaneous  Time matters
 Time does not matter  Agent actions affect
subsequent data
AlphaGo uses supervised learning + reinforcement learning

296
 RL is a general purpose framework for decision making

 RL is for an agent with the capacity to act
 Each action influences the agent’s future state
 Success is measured by a scalar reward signal
 Goal: select actions to maximize future reward
RL Difficulty
297
 Goal: select actions to maximize total future reward

 Actions may have long-term consequences
 Reward may be delayed
 It may be better to sacrifice immediate reward to gain more
long-term reward
Deep Reinforcement Learning
298
DNN
Observation Action
……
…
Function Function
Input Output
Used to pick the

best function Reward
Environment
Major Components in an RL Agent
299
 An RL agent may include one or more of these components

 Policy: agent’s behavior function
 Value function: how good is each state and/or action
 Model: agent’s representation of the environment
Maze Example
300
 Rewards: -1 per time-step

 Actions: N, E, S, W
 States: agent’s location
Maze Example: Policy
301

Arrows represent policy π(s) for each state s

Maze Example: Value Function
302

Numbers represent value Qπ(s) of each state s

E2E RL Task-Completion Bot
303
 Simulated User
 Generate interactions based on a predefined fake goal
 Automatically learn strategy by training on the simulated data
User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.
RULE BASED SYSTEM REINFORCEMENT LEARNING SYSTEM
User: Can I get 2 tickets for the witch? User: Can I get 2 tickets for the witch?
Agent: Which theater do you want? Agent: What date would you like to see it?
User: Regal meridian 16 User: Tomorrow
Agent: What time would you like to see it? Agent: Which theater do you prefer?
User: 9:30 pm User: Regal meridian 16
Agent: What date would you like to see it? Agent: What is your location?
User: Tomorrow! User: Seattle
Agent: How many people are you? Agent: What time would you like to see it?
User: Just two adults. User: 9:30 pm.
Agent: What is your location? Agent: Great - I was able to purchase 2 tickets for you
User: Seattle! to see the witch tomorrow at regal meridian 16
Agent: What kind of video format do you like? theater in Seattle at 9:30 pm.
User: Oh, any kind is okay. User: Thanks.
Agent: Do you have any price constraint? Agent: Thanks!
User: No, I don’t care.
Agent: Great - I was able to purchase 2 tickets for you to
see the witch tomorrow at regal meridian 16 theater in
Seattle at 9:30 pm.
User: Thank you.
Agent: Thank you!
More applications
304
 AlphaGo, Playing Video Games, Dialogue

 Flying Helicopter
 https://www.youtube.com/watch?v=0JL04JJjocc
 Driving
 https://www.youtube.com/watch?v=0xo1Ldx3L5Q
 Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
 http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-
its-giant-electricity-bill-with-deepmind-powered-ai
Concluding Remarks
305
 Semi-Supervised Learning Semi-Supervised Learning
 Transfer Learning Transfer Learning
 Unsupervised Learning Unsupervised Learning

Action
 Reinforcement Learning Observation
Agent
Reward
Environment
如何成為武林高手
306
 內外兼修
 內功充沛，恃強克弱
 招數精妙，以快打慢
 Machine Learning & Deep Learning 也需要內外兼修

 內力：運算資源
 招數：各種技巧
 內力充沛,平常的招式也有可能發會巨大的威力

Tutorial

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tutorial

Uploaded by

Copyright:

Available Formats

Slide credit from Hung-Yi Lee and Mark Chang

Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning

Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning

 Basic Machine Learning

 Basic Machine Learning

 Machine learning is rising rapidly in recent days

Programs can do the things you ask them to do

 Task: predicting positive or negative given a product review

 Task: predicting positive or negative given a product review

“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”

Training function input:

Goodness of Pick the “Best” Function

 E.g. 小智的噴火龍 的問題

Scenario Task Method

Deep Learning SVM, Decision

Scenario Task Method

Regression The output of the target function 𝑓 is a “scalar”.

Deep Learning SVM, Decision

 Stock Market Forecast

Dow Jones Industrial

 Estimating the Combat Power (CP) of a pokemon

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp w and b are parameters

Each point in the

The color represents

Goodness of Pick the “Best” Function

 Consider loss function 𝐿(𝑤) with one parameter w:

 Consider loss function 𝐿(𝑤) with one parameter w:

 Consider loss function 𝐿(𝑤) with one parameter w:

 How about two parameters?

Color: Value of loss 𝐿 𝑤, 𝑏

 Formulation of 𝜕𝐿Τ𝜕𝑤 and 𝜕𝐿Τ𝜕𝑏

 Formulation of 𝜕𝐿Τ𝜕𝑤 and 𝜕𝐿Τ𝜕𝑏

Another 10 pokemons as testing data

b = -188.4 How can we do better?

 Select another model

Better! Could it be even better?

 Select another model

 Select another model

The results become worse

 Select another model

The results are so bad

Scenario Task Method

Classification Reinforcement Learning

 Binary Classification  Multi-Class Classification

Yes / No Class 1, Class 2, …, Class N

Scenario Task Method

Deep Learning SVM, Decision

 Basic Machine Learning

 Production line (生產線)

Deep Learning Model

f: a very complex function

End-to-end training: what each function should do is learned automatically

Step 1: define a set of function

Step 2: goodness of function

Step 3: pick the best function

Step 1: define a set of function

Step 2: goodness of function

Step 3: pick the best function

Neuron Sigmoid Function  z 

Different connections lead to

1 4 0.98 2 0.86 3 0.62

 E.g. 小智的噴火龍的問題