You are on page 1of 306

Slide credit from Hung-Yi Lee and Mark Chang

Talk Outline
2

Part I: Introduction to
Machine Learning & Deep Learning

Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning


& Recent Trends
Talk Outline
3

Part I: Introduction to
Machine Learning & Deep Learning

Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning


& Recent Trends
4 PART I
Introduction to Machine Learning &
Deep Learning
Part I: Introduction to ML & DL
5

 Basic Machine Learning


 Basic Deep Learning
 Toolkits and Learning Recipe
Part I: Introduction to ML & DL
6

 Basic Machine Learning


 Basic Deep Learning
 Toolkits and Learning Recipe
Machine Learning
7

 Machine learning is rising rapidly in recent days


Recent Trend
8
What Computers Can Do?
9

Programs can do the things you ask them to do


Program for Solving Tasks
10

 Task: predicting positive or negative given a product review

“I love this product!” “It claims too much.” “It’s a little expensive.”
program.py program.py program.py

+ - ?
if input contains “love”, “like”, etc. if input contains “too much”, “bad”, etc.
output = positive output = negative
“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”
program.py program.py program.py

推 噓 ?

Some tasks are complex, and we don’t know how to write a program to solve them.
Learning ≈ Looking for a Function
11

 Task: predicting positive or negative given a product review

“I love this product!” “It claims too much.” “It’s a little expensive.”
f f f
+ - ?

“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”


f f f
推 噓 ?

Given a large amount of data, the machine learns what the function f should be.
Learning ≈ Looking for a Function
12

 Speech Recognition
f  “你好”
 Image Recognition
f   cat
 Go Playing
f   5-5 (next move)
 Dialogue System
f  “台積電怎麼去”   “地址為…
現在建議搭乘計程車”
Image Recognition:
Framework f  “cat”
13

A set of Model
function f1 , f 2 

f1   “cat” f2   “monkey”

f1   “dog” f2   “snake”
Image Recognition:
Framework f  “cat”
14

A set of Model
function f1 , f 2  Better!

Goodness of
function f
Supervised Learning

Training function input:


Data
function output: “monkey” “cat” “dog”
Image Recognition:
Framework f  “cat”
15

Training Testing
A set of Model
function f1 , f 2  “cat”
Step 1

Goodness of Pick the “Best” Function


Using f
function f f*
Step 2 Step 3

Training
Data
“monkey” “cat” “dog”
Training is to pick the best function given the observed data
Testing is to predict the label using the learned function
Why to Learn Machine Learning?
16

 AI Age
 AIcan work for most of labor work?
 New job market AI 訓練師
(Machine Learning Expert 機器學習專家、
Data Scientist 資料科學家)
AI 訓練師
17

機器不是自己會學嗎?
為什麼需要 AI 訓練師

戰鬥是寶可夢在打,
為什麼需要寶可夢訓練師?
AI 訓練師
18

 寶可夢訓練師  AI 訓練師
 挑選適合的寶可夢來戰鬥  在 step 1,AI訓練師要挑選
 寶可夢有不同的屬性 合適的模型
 召喚出來的寶可夢不一定聽話  不同模型適合處理不同

 E.g. 小智的噴火龍 的問題


 不一定能在 step 3 找出 best
 需要足夠的經驗 function
 E.g. Deep Learning
 需要足夠的經驗
AI 訓練師
19

 厲害的 AI , AI 訓練師功不可沒
 讓我們一起朝 AI 訓練師之路邁進
Machine Learning Map
20

Scenario Task Method

Regression
Semi-Supervised Learning

Linear Model
Transfer Learning

Deep Learning SVM, Decision


Tree, KNN, etc Unsupervised Learning

Non-Linear Model
Classification Reinforcement Learning
Supervised Learning
Machine Learning Map
21

Scenario Task Method

Regression The output of the target function 𝑓 is a “scalar”.


Semi-Supervised Learning
一個數值

Linear Model
Transfer Learning

Deep Learning SVM, Decision


Tree, KNN, etc Unsupervised Learning

Non-Linear Model
Reinforcement Learning
Classification

Supervised Learning
Regression
22

 Stock Market Forecast

Dow Jones Industrial


𝑓 =
Average at tomorrow

 Self-driving Car

𝑓 = 方向盤角度

 Recommendation
𝑓 使用者 A 商品 B = 購買可能性
Example Application
23

 Estimating the Combat Power (CP) of a pokemon


after evolution
𝑥𝑐𝑝

CP after
𝑓 =
evolution 𝑦
𝑥𝑠 𝑥
𝑥ℎ𝑝

𝑥𝑤 𝑥ℎ
Step 1: Model
24

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp w and b are parameters


(can be any value)
A set of Model
f1: y = 10.0 + 9.0 ∙ xcp
function f1 , f 2 
f2: y = 9.8 + 9.2 ∙ xcp
f3: y = - 0.8 - 1.2 ∙ xcp
…… infinite
CP after
𝑓 𝑥 = 𝑦
evolution

𝑥𝑖 : an attribute of
Linear model: 𝑦 = 𝑏 + ෍ 𝑤𝑖 𝑥𝑖 input x feature
𝑤𝑖 : weight, b: bias
Step 2: Goodness of Function
25

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
function function
A set of Model
input: output (scalar):
function f1 , f 2  𝑦ො 1

𝑥1

𝑦ො 2
𝑥2

Training
Data
Step 2: Goodness of Function
26

 Training data
 1st pokemon:
𝑥 1 , 𝑦ො 1
 2nd pokemon:
𝑥 2 , 𝑦ො 2 𝑛 ,𝑦
𝑥𝑐𝑝 ො𝑛
……

 10th pokemon:
𝑥 10 , 𝑦ො 10
𝑦ො
This is real data.
𝑥𝑐𝑝

Source: https://www.openintro.org/stat/data/?data=pokemon
Step 2: Goodness of Function
27

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
A set of Model
function f1 , f 2 

Loss function 𝐿:
Goodness of Input: a function, output: how bad it is
function f
L 𝑓 = L 𝑤, 𝑏 Estimated y based
10
on input function
2
Training = ෍ 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛
Data 𝑛=1
Estimation error
Sum over examples
Step 2: Goodness of Function
28
10
2
 Loss Function L 𝑤, 𝑏 = ෍ 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
𝑛=1

Each point in the


figure is a function

The color represents


𝐿 𝑤, 𝑏
y = - 180 - 2 ∙ xcp
Step 3: Best Function
29

L 𝑤, 𝑏
10
A set of Model 2
=෍ 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
function f1 , f 2  𝑛=1

Goodness of Pick the “Best” Function


function f
𝑓 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑓
𝑓

𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤, 𝑏
𝑤,𝑏
Training 10
2
Data = 𝑎𝑟𝑔 min ෍ 𝑦ො 𝑛 − 𝑏+𝑤∙ 𝑛
𝑥𝑐𝑝
𝑤,𝑏
𝑛=1
Step 3: Gradient Descent
30

 Consider loss function 𝐿(𝑤) with one parameter w:


𝑤 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤
𝑤
 (Randomly) Pick an initial value w 0

𝑑𝐿
 Compute | 0
Loss 𝑑𝑤 𝑤=𝑤
𝐿 𝑤

w0 w
Step 3: Gradient Descent
31

 Consider loss function 𝐿(𝑤) with one parameter w:


𝑤 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤
𝑤
 (Randomly) Pick an initial value w 0

𝑑𝐿
 Compute | 0 𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤1 ← 𝑤0 −𝜂 |𝑤=𝑤 0
𝐿 𝑤 𝑑𝑤

w0 𝑑𝐿 w
−𝜂 |𝑤=𝑤 0
𝑑𝑤
Step 3: Gradient Descent
32

 Consider loss function 𝐿(𝑤) with one parameter w:


𝑤 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤
𝑤
 (Randomly) Pick an initial value w 0

𝑑𝐿
 Compute | 0 1 0
𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤 ←𝑤 −𝜂 |𝑤=𝑤 0
𝐿 𝑤 𝑑𝑤
𝑑𝐿
 Compute | 1 𝑑𝐿
𝑑𝑤 𝑤=𝑤 2 1
𝑤 ←𝑤 −𝜂 |𝑤=𝑤 1
𝑑𝑤
…… Many iteration

w0 w1 w2 wT w
Step 3: Gradient Descent
33

 How about two parameters?


𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤, 𝑏
𝑤,𝑏
 (Randomly) Pick an initial value w0, b0
𝜕𝐿 𝜕𝐿
 Compute |𝑤=𝑤 0 ,𝑏=𝑏0 , |𝑤=𝑤 0 ,𝑏=𝑏0
𝜕𝑤 𝜕𝑏
1 0
𝜕𝐿 𝜕𝐿
𝑤 ←𝑤 −𝜂 |𝑤=𝑤 0 ,𝑏=𝑏0 1 0
𝑏 ← 𝑏 − 𝜂 |𝑤=𝑤 0 ,𝑏=𝑏0
𝜕𝑤 𝜕𝑏
𝜕𝐿 𝜕𝐿
 Compute |𝑤=𝑤 1 ,𝑏=𝑏1 , |𝑤=𝑤 1 ,𝑏=𝑏1
𝜕𝑤 𝜕𝑏
2 1
𝜕𝐿 2 1
𝜕𝐿
𝑤 ←𝑤 −𝜂 |𝑤=𝑤 1 ,𝑏=𝑏1 𝑏 ← 𝑏 − 𝜂 |𝑤=𝑤 1 ,𝑏=𝑏1
𝜕𝑤 𝜕𝑏
𝜕𝐿
𝜕𝑤
𝛻𝐿 = 𝜕𝐿
gradient
𝜕𝑏
Step 3: Gradient Descent
34

Color: Value of loss 𝐿 𝑤, 𝑏

𝑏
Step 3: Gradient Descent
35

 Local optimal
 Loss function is convex in linear regression

Linear regression 
No local optimal

𝐿
𝑤

𝑤 𝑏
𝑏
Step 3: Gradient Descent
36

 Formulation of 𝜕𝐿Τ𝜕𝑤 and 𝜕𝐿Τ𝜕𝑏

10
2
𝑛 𝑛
𝐿 𝑤, 𝑏 = ෍ 𝑦ො − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛=1

10
𝜕𝐿
=? ෍ 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 𝑛
−𝑥𝑐𝑝
𝜕𝑤
𝑛=1

𝜕𝐿
=?
𝜕𝑏
Step 3: Gradient Descent
37

 Formulation of 𝜕𝐿Τ𝜕𝑤 and 𝜕𝐿Τ𝜕𝑏

10
2
𝑛 𝑛
𝐿 𝑤, 𝑏 = ෍ 𝑦ො − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛=1

10
𝜕𝐿
=? ෍ 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 𝑛
−𝑥𝑐𝑝
𝜕𝑤
𝑛=1

10
𝜕𝐿
=? ෍ 2 𝑦ො 𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝
𝑛 −1
𝜕𝑏
𝑛=1
Learned Model
38

Training Data
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
𝑒1
b = -188.4
w = 2.7
Average Error on
Training Data 𝑒2
10

= ෍ 𝑒 𝑛 = 31.9
𝑛=1
What we really care about
Model Generalization is the error on new data
(testing data)
39

Another 10 pokemons as testing data


𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

b = -188.4 How can we do better?

w = 2.7
Average Error on
Testing Data
10

= ෍ 𝑒 𝑛 = 35.0
𝑛=1
> Average Error on
Training Data (31.9)
Model Generalization
40

 Select another model


2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp

 Best function
𝑏 = −10.3, 𝑤1 = 1.0, 𝑤2 = 2.7 × 10−3
Average Error = 15.4

 Testing
Average Error = 18.4

Better! Could it be even better?


Model Generalization
41

 Select another model


2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
3
+𝑤3 ∙ 𝑥cp

 Best function
𝑏 = 6.4, 𝑤1 = 0.66, 𝑤2 = 4.3 × 10−3 ,
𝑤3 = 1.8 × 10−6
Average Error = 15.3
 Testing
Average Error = 18.1
Slightly better.
How about more complex model?
Model Generalization
42

 Select another model


2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
3 4
+𝑤3 ∙ 𝑥cp + 𝑤4 ∙ 𝑥cp

 Best function
Average Error = 14.9

 Testing
Average Error = 28.8

The results become worse


Model Generalization
43

 Select another model


2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
3 4
+𝑤3 ∙ 𝑥cp + 𝑤4 ∙ 𝑥cp
5
+𝑤5 ∙ 𝑥cp

 Best function
Average Error = 12.8
 Testing
Average Error = 232.1

The results are so bad


Training Data
Model Selection
44

1. 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
2
2. 𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
2
3. 𝑦 = 𝑏 + 𝑤1 ∙3𝑥cp + 𝑤2 ∙ 𝑥cp
+𝑤3 ∙ 𝑥cp
2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
4. 3 4
+𝑤3 ∙ 𝑥cp + 𝑤4 ∙ 𝑥cp
2
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp
A more complex model yields
5. 3
+𝑤3 ∙ 𝑥cp + 𝑤4 ∙ 𝑥cp
4

5
lower error on training data.
+𝑤5 ∙ 𝑥cp
If we can truly find the best function
Machine Learning Map
45

Scenario Task Method

Regression
Semi-Supervised Learning

Transfer Learning

Unsupervised Learning

Classification Reinforcement Learning


Supervised Learning
Classification
46

 Binary Classification  Multi-Class Classification

Yes / No Class 1, Class 2, …, Class N

Function f Function f

Input Input
Binary Classification – Spam Filtering
47

“Talk” in e-mail

Model 1/0
(Yes/No)
“free” in e-mail
1 (Yes)

0 (No)

(http://spam-filter-review.toptenreviews.com/)
Multi-Class – Image Recognition
48

“monkey”
“monkey”
“cat”
Model
“cat”
“dog”

“dog”
Multi-Class – Topic Classification
49

政治
“stock” in document
經濟
Model

體育
“president” in document

體育 政治 財經
http://top-breaking-news.com/
Machine Learning Map
50

Scenario Task Method

Regression
Semi-Supervised Learning

Linear Model
Unsupervised Learning

Deep Learning SVM, Decision


Tree, KNN, etc Reinforcement Learning

Non-Linear Model
Classification Transfer Learning
Supervised Learning
Part I: Introduction to ML & DL
51

 Basic Machine Learning


 Basic Deep Learning
 Toolkits and Learning Recipe
Stacked Functions Learned by Machine
52

 Production line (生產線)

Deep Learning Model


Simple Simple Simple
“台灣第一波
Function Function Function 推
上市!” f1 f3
f2

f: a very complex function

End-to-end training: what each function should do is learned automatically


Deep learning usually refers to neural network based model
Three Steps for Deep Learning
53

Step 1: define a set of function

Step 2: goodness of function

Step 3: pick the best function


Three Steps for Deep Learning
54

Step 1: define a set of function


Neural Network

Step 2: goodness of function

Step 3: pick the best function


Neural Network
55

Neuron
z  a1w1    ak wk    aK wK  b

a1 w1 A simple function

wk z  z 
ak  a

Activation

wK function
aK weights b bias
Neural Network
56

Neuron Sigmoid Function  z 

 z  
1
z
1 e z
2
1

 z 
4
-1 -2  0.98

Activation
-1
function
1 weights 1 bias
Neural Network
57

Different connections lead to


different network structures

  z 

  z    z 

  z 
The neurons have different values of
weights and biases.
Weights and biases are network parameters 𝜃
Fully Connected Feedforward Network
58

1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function  z 

 z  
1
z
1 e z
Fully Connected Feedforward Network
59

1 4 0.98 2 0.86 3 0.62


1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Fully Connected Feedforward Network
60

1 0.73 2 0.72 3 0.51


0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
𝑓 = 𝑓 =
Input vector, output vector −1 0.83 0 0.85
Given parameters 𝜃, define a function
Given network structure, define a function set
Fully Connect Feedforward Network
61
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2

……
……

……

……

……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers
Why Deep? Universality Theorem
62

 Any continuous function f


f :R RN M

can be realized by a network with only hidden layer


 (given enough hidden neurons)

Why “deep” not “fat”?


Fat + Shallow v.s. Thin + Deep
63

 Two networks with the same number of parameters

……
……
……
x1 x2 …… xN
x1 x2 …… xN
Why Deep
64

 Logic circuits  Neural network


 Consists of gates  consists of neurons
 A two layers of logic gates  A hidden layer network can
can represent any Boolean represent any continuous
function. function.
 Using multiple layers of logic  Using multiple layers of
gates to build some functions neurons to represent some
are much simpler functions are much simpler
Deep = Many Hidden Layers
65

http://cs231n.stanford.e
du/slides/winter1516_le
cture8.pdf

6.7%
7.3%
16.4%

AlexNet (2012) VGG (2014) GoogleNet (2014)


Deep = Many Hidden Layers
66

Special
structure

3.57%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Output Layer
67

 Softmax layer as the output layer


Ordinary Layer

z1   
y1   z1
In general, the output of
z2   
y2   z 2
network can be any value.

May not be easy to interpret


z3   
y3   z 3
Output Layer
68

Probability:
 Softmax layer as the output layer  1 > 𝑦𝑖 > 0
 σ𝑖 𝑦𝑖 = 1
Softmax Layer

3 0.88 3

e
20
z1 e e z1
 y1  e z1 zj

j 1

1 0.12 3
z2 e e z 2 2.7
 y2  e z2
e
zj

j 1
0.05 ≈0
z3 -3 
3

e
z3
e y3  e z3 zj
e
3 j 1

 e zj

j 1
Example Application
69

 Input  Output

y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”

……
……
……

x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
70

 Handwriting Digit Recognition

x1 y1 is 1
x2
y2 is 2
Neural
Machine “2”
……

……
……
Network
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
71

Input Layer 1 Layer 2 Layer L Output


x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2”

……
……

……

……

……
……
Handwriting Digit Recognition
xN …… y10 is 0
Input Output
Layer Hidden Layers Layer

You need to decide the network structure to


let a good function in your function set.
FAQ
72

 Q: How many layers? How many neurons for each


layer?
Trial and Error + Intuition
 Q: Can we design the network structure?
Variants of Neural Networks
(next lecture)
 Q: Can the structure be automatically determined?
 Yes, but not widely studied yet.
Three Steps for Deep Learning
73

Step 1: define a set of function

Step 2: goodness of function

Step 3: pick the best function


Training Data
74

 Preparing training data: images and their labels

“5” “0” “4” “1”

“9” “2” “1” “3”

The learning target is defined on


the training data.
Learning Target
75

x1 …… y1 is 1

Softmax
x2 ……
…… y2 is 2

……

……
x256 …… y10 is 0
16 x 16 = 256
Ink → 1 The learning target is ……
No ink → 0
Input: y1 has the maximum value

Input: y2 has the maximum value


Loss
76

“1”

x1 …… y1 As close as 1
x2 possible

Softmax
Given a set ……
of y2 0
parameters
……

……
……

……

……
Loss
x256 …… y10 𝑙 0
Loss can be square error or cross entropy between the network target
output and target

A good function should make the loss of all examples as small as possible.
Total Loss
77

Total Loss:
 For all training data … 𝑅

𝐿 = ෍ 𝑙𝑟
x1 NN y1 𝑦ො 1 𝑟=1
𝑙1

x2 NN y2 𝑦ො 2 As small as possible
𝑙2
Find a function in
x3 NN y3 𝑦ො 3 function set that
𝑙3
minimizes total loss L
……
……

……
……

Find the network


xR NN yR 𝑦ො 𝑅 parameters 𝜽∗ that
𝑙𝑅
minimize total loss L
Three Steps for Deep Learning
78

Step 1: define a set of function

Step 2: goodness of function

Step 3: pick the best function


How to pick the best function
79

Find network parameters 𝜽∗ that minimize total loss L


Layer l Layer l+1
Enumerate all possible values

Network parameters 𝜃 =
106
𝑤1 , 𝑤2 , 𝑤3 , ⋯ , 𝑏1 , 𝑏2 , 𝑏3 , ⋯
weights

……
……
Millions of parameters

E.g. speech recognition: 8 layers and


1000 1000
1000 neurons each layer
neurons neurons
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
80

Find network parameters 𝜽∗ that minimize total loss L


 Pick an initial value for w

Random, RBM pre-train


Usually good enough

w
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
81

Find network parameters 𝜽∗ that minimize total loss L


 Pick an initial value for w
Total  Compute 𝜕𝐿Τ𝜕𝑤
Loss 𝐿 Negative Increase w

Positive Decrease w

w
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
82

Find network parameters 𝜽∗ that minimize total loss L


 Pick an initial value for w
 Compute 𝜕𝐿Τ𝜕𝑤
𝑤 ← 𝑤 − 𝜂𝜕𝐿Τ𝜕𝑤
Repeat

η is called
−𝜂𝜕𝐿Τ𝜕𝑤 “learning rate” w
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
83

Find network parameters 𝜽∗ that minimize total loss L


 Pick an initial value for w
Total  Compute 𝜕𝐿Τ𝜕𝑤
Loss 𝐿 𝑤 ← 𝑤 − 𝜂𝜕𝐿Τ𝜕𝑤
Repeat
(when update is little)

w
Gradient Descent
84

 Assume that θ has two variables {θ1, θ2}


Gradient Descent
85

Color: Value of
𝑤2 Total Loss L

Randomly pick a starting point

𝑤1
Hopfully, we would reach
Gradient Descent a minima …..
86

𝑤2

(−𝜂 𝜕𝐿Τ𝜕𝑤1 , −𝜂 𝜕𝐿Τ𝜕𝑤2 )

Compute 𝜕𝐿Τ𝜕𝑤1 , 𝜕𝐿Τ𝜕𝑤2

𝑤1
Local Minima
87

Total
Loss Very slow at the
plateau
Stuck at saddle point

Stuck at local minima

𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0

The value of a network parameter w


Local Minima
88

 Gradient descent never guarantee global


minima
Different initial point

Reach different minima,


so different results
𝑤1 𝑤2
Gradient Descent
89

This is the “learning” of machines in deep learning ……


Even AlphaGo using this approach.

People image …… Actually …..

I hope you are not too disappointed :p


Part I: Introduction to ML & DL
90

 Basic Machine Learning


 Basic Deep Learning
 Toolkits and Learning Recipe
Deep Learning Toolkit
91

 Backpropagation: an efficient way to compute


𝜕𝐿Τ𝜕𝑤 in neural network
Three Steps for Deep Learning
92

Step 1: Step 2: Step 3: pick


define a set goodness of the best
of function function function

Deep Learning is so simple ……

Now If you want to find a function


If you have lots of function input/output (?) as
training data
You can use deep learning
Keras
93

Very flexible
or
Need some
effort to learn

Easy to learn and use


Interface of
TensorFlow or (still have some flexibility)
Theano You can modify it if you can write
keras TensorFlow or Theano
Keras
94

 François Chollet is the author of Keras.


 He currently works for Google as a deep learning engineer and
researcher.
 Keras means horn in Greek
 Documentation: http://keras.io/
 Example
 https://github.com/fchollet/keras/tree/master/examples
 Step-by-step lecture by Prof. Hung-Yi Lee
 Slide
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Keras.pdf
 Lecture recording:
https://www.youtube.com/watch?v=qetE6uUoLQA
使用 Keras 心得
95

感謝 沈昇勳 同學提供圖檔
Example Application
96

 Handwriting Digit Recognition

Machine “1”

28 x 28

MNIST Data: http://yann.lecun.com/exdb/mnist/


“Hello world” for deep learning
Keras provides data sets loading function: http://keras.io/datasets/
Three Steps for Deep Learning
97

Step 1: Step 2: Step 3: pick


define a set goodness of the best
of function function function

Deep Learning is so simple ……


Learning Recipe
98

YES
Step 1: define a NO
set of function Good Results on
Testing Data?
Step 2: goodness
of function YES

Step 3: pick the NO Good Results on


best function Training Data?
Overfitting
99

 Possible solutions
 more training samples
 some tips: dropout, etc.
Learning Recipe
100

YES
Good Results on
Testing Data?
Different approaches for different problems
e.g. dropout for good results on testing data YES

Good Results on
Training Data?
Learning Recipe
101

YES
Choosing proper loss
Good Results on
Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Learning Recipe
102

Testing Data

Training Data Validation Real Testing

x ŷ x y x y

“Best” Function f *
Learning Recipe
103

Testing Data

Training Data Validation Real Testing

x ŷ x y x y
immediately Do not know the
know the performance
performance
Learning Recipe
104

get good results


no
on training set

modify training
process

 Possible reasons
 no good function exists: bad hypothesis function set
 reconstruct the model architecture
 cannot find a good function: local optima
 change the training strategy
Learning Recipe
105

yes yes
get good results get good results on done
no no dev/validation set
on training set

modify training
process prevent overfitting

Better performance on training but worse performance on dev  overfitting


Concluding Remarks
106

 Basic Machine Learning


1. Define a set of functions
2. Measure goodness of functions
3. Pick the best function
 Basic Deep Learning
 Stacked functions
Talk Outline
107

Part I: Introduction to
Machine Learning & Deep Learning

Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning


& Recent Trends
PART II
Variants of Neural Networks

108
PART II: Variants of Neural Networks
109

 Convolutional Neural Network (CNN)


 Recurrent Neural Network (RNN)
PART II: Variants of Neural Networks
110

 Convolutional Neural Network (CNN)


 Recurrent Neural Network (RNN)

Widely used in image processing


Why CNN for Image (Zeiler, M. D., ECCV 2014)
111

x1 ……
x2
…… ……

……
……

……
Represented
as pixels xN ……

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……

Can the network be simplified by considering the properties of images?


Why CNN for Image
112

 Some patterns are much smaller than the whole


image

A neuron does not have to see the whole image


to discover the pattern.
Connecting to small region with less parameters

“beak” detector
Why CNN for Image
113

 The same patterns appear in different regions.

“upper-left
beak” detector

Do almost the same thing


They can use the same
set of parameters.

“middle beak”
detector
Why CNN for Image
114

 Subsampling the pixels will not change the object

bird
bird

subsampling

We can subsample the pixels to make image smaller


Less parameters for the network to process the image
Three Steps for Deep Learning
115

Step 1: Step 2: Step 3: pick


define a set
Convolutional goodness of the best
of function
Neural Network function function

Deep Learning is so simple ……


Image Recognition
116

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
The Whole CNN
117

cat dog ……
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
The Whole CNN
118

Property 1
Convolution
 Some patterns are much
smaller than the whole image
Property 2 Max Pooling
Can repeat
 The same patterns appear in
many times
different regions
Convolution
Property 3
 Subsampling the pixels will
not change the object Max Pooling

Flatten
Image Recognition
119
Local Connectivity
120

Neurons connect to a small region


Parameter Sharing
121

 The same feature in different positions

Neurons share the same weights


Parameter Sharing
122

 Different features in the same position

Neurons have different weights


Convolutional Layers
123

weights weights
height
depth

depth width shared weight width


Convolutional Layers
124

depth = 1 depth = 2

a1 b1

c1

a2
b2

c2
a3
Convolutional Layers
125

depth = 2 depth = 2

b1
a1 c1

b2 d1

a2
c2

b3
d2
a3
Convolutional Layers
126

depth = 2 depth = 2

b1
a1 c1

b2 d1

a2
c2

b3
d2
a3
Convolutional Layers
127

A B C

A B C D
Hyper-parameters of CNN
128

 Stride  Padding

Stride = 1 Padding = 0

0 0

Stride = 2 Padding = 1
Example
129

Output Stride = 2
Volume (3x3x2)
Filter
(3x3x3)

Input
Volume (7x7x3) Padding = 1

http://cs231n.github.io/convolutional-networks/
Convolutional Layers
130

http://cs231n.github.io/convolutional-networks/
Convolutional Layers
131

http://cs231n.github.io/convolutional-networks/
Convolutional Layers
132

http://cs231n.github.io/convolutional-networks/
Convolutional Layers
133

http://cs231n.github.io/convolutional-networks/
Pooling Layer
134

1 3 2 4
5 7 6 8
0 0 3 3
5 5 0 0 no weights
Maximum Average
Pooling Pooling

7 8 4 5
5 3 5 3 no overlap depth = 1
Max(1,3,5,7) = 7 Avg(1,3,5,7) = 4
Max(0,0,5,5) = 5
Why “Deep” Learning?
135
Visual Perception of Human
136

http://www.nature.com/neuro/journal/v8/n8/images/nn0805-975-F1.jpg
Visual Perception of Computer
137

Input Convolutional
Layer Layer
Pooling Convolutional
Layer Layer Pooling
Layer

Receptive Fields
Receptive Fields
Visual Perception of Computer
138

Convolutional Max-pooling
Layer with Layer with
Receptive Fields: Width =3, Height = 3

Input Layer

Filter Responses

Input Image Filter Responses


Fully-Connected Layer
139

 Fully-Connected Layers : Global feature extraction


 Softmax Layer: Classifier
Fully-Connected
Convolutional Layer Softmax
Input Input Layer Convolutional Layer
Image Layer Pooling Layer
Layer Pooling
Layer 5

7
Class
Label
Convolutional Neural Network
140

Step 1: Step 2: Step 3: pick


define a set
Convolutional goodness of the best
of function
Neural Network function function

“monkey” 0
“cat” 1
CNN

……
“dog” 0
Convolution, Max target
Pooling, fully connected
What CNN Learned
141

 Alexnet
 http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

http://vision03.csail.mit.edu/cnn_art/data/single_layer.png
DNN are easily fooled
142

Nguyen et al., “Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images,” arXiv:1412.1897.
Visualizing CNN
143

filter
flower
response
CNN

random filter
noise response
CNN
Gradient Ascent
144

 Magnify the filter response lower higher


random filter score score
noise: response:

score:

gradient:
Gradient Ascent
145

 Magnify the filter response lower higher


random filter score score
noise: response:

update

gradient:
learning rate
Gradient Ascent
146
Different Layers of Visualization
147

CNN
Multiscale Image Generation
148

visualize resize
visualize
resize

visualize
Multiscale Image Generation
149
CNN
Modify
Deep Dream image
150

 Given a photo, machine adds what it sees ……


3.9
−1.5
2.3

CNN exaggerates what it sees

http://deepdreamgenerator.com/
Deep Dream http://deepdreamgenerator.com/
151

 Given a photo, machine adds what it sees ……


Deep Style http://deepdreamgenerator.com/
152

 Given a photo, make its style like famous paintings


Deep Style http://deepdreamgenerator.com/
153

 Given a photo, make its style like famous paintings


Deep Style A Neural Algorithm of Artistic Style
https://arxiv.org/abs/1508.06576
154

CNN CNN

content style

CNN
?
Neural Art Mechanism
155

Artist Brain

Scene Style ArtWork

Computer Neural Networks


Go Playing
156

Next move
Network (19 x 19
positions)

19 x 19 matrix 19 x 19 vector
19(image)
x 19 vector
Black: 1 Fully-connected feedforward
white: -1 network can be used
none: 0 But CNN performs much better.
More Application: Playing Go
157

record of
Training: 黑: 5之五 白: 天元 黑: 五之5 …
previous plays

Target:
CNN “天元” = 1
else = 0

Target:
CNN “五之 5” = 1
else = 0
Why CNN for playing Go?
158

 Some patterns are much smaller than the whole


image

Alpha Go uses 5 x 5 for first layer

 The same patterns appear in different regions


Why CNN for playing Go?
159

 Subsampling the pixels will not change the object


How to explain this???
PART II: Variants of Neural Networks
160

 Convolutional Neural Network (CNN)


 Recurrent Neural Network (RNN)

Neural Network with Memory


Example Application
161

 Slot Filling

I would like to arrive Taipei on November 2nd.

ticket booking system

Destination: Taipei
Slot
time of arrival: November 2nd
Example Application
162

y1 y2
Solving slot filling by
feedforward network?
Input: a word
(Each word is represented as a vector)

Taipei x1 x2
1-of-N encoding
163

How to represent each word as a vector?


1-of-N Encoding lexicon = {apple, bag, cat, dog, elephant}
The vector is lexicon size. apple = [ 1 0 0 0 0]
Each dimension corresponds bag = [ 0 1 0 0 0]
to a word in the lexicon cat = [ 0 0 1 0 0]
The dimension for the word dog = [ 0 0 0 1 0]
is 1, and others are 0 elephant = [ 0 0 0 0 1]
Example Application
164

Solving slot filling by dest time of departure


feedforward network? y1 y2
Input: a word
(Each word is represented as a vector)

Output:
probability distribution that the
input word belonging to the slots

Taipei x1 x2
Example Application
165

arrive Taipei on November 2nd dest time of departure


y1 y2
other dest other time time

leave Taipei on November 2nd

place of departure

Neural network needs memory!

Taipei x1 x2
Three Steps for Deep Learning
166

Step 1: Step 2: Step 3: pick


define a set
Recurrent goodness of the best
of function
Neural Network function function

Deep Learning is so simple ……


Recurrent Neural Network (RNN)
167

y1 y2

The output of hidden layer


are stored in the memory.
store

a1 a2

Memory can be considered x1 x2


as another input.
RNN The same network is used again and again.
168

Probability of Probability of Probability of


“arrive” in each slot “Taipei” in each slot “on” in each slot
y1 y2 y3
store store
a1 a2 a3
a1 a2

x1 x2 x3

arrive Taipei on November 2nd


RNN
169

Different
Prob of “leave” Prob of “Taipei” Prob of “arrive” Prob of “Taipei”
in each slot in each slot in each slot in each slot
y1 y2 …… y1 y2 ……
store store
a1 a2 a1 a2
a1 …… a1 ……

x1 x2 …… x1 x2 ……
leave Taipei arrive Taipei

The values stored in the memory is different.


Deep RNN
170

yt yt+1 yt+2

…… ……

……
……
……

…… ……

…… ……

xt xt+1 xt+2
Bidirectional RNN
171

xt xt+1 xt+2

…… ……

yt yt+1 yt+2

…… ……

xt xt+1 xt+2
RNN
172

Step 1: Step 2: Step 3: pick


define a set
Recurrent goodness of the best
of function
Neural Network function function

Deep Learning is so simple ……


Learning Target
173
other dest other
0 … 1 … 0 0 … 1 … 0 0 … 1 … 0

y1 y2 y3
copy copy
a1 a2 a3
a1 a2

Wi
x1 x2 x3
Training
Sentences: arrive Taipei on November 2nd
other dest other time time
Rough Error Surface
174

Total
CostLoss
w2

w1
[Razvan Pascanu, ICML’13]
Rough Error Surface
175

𝑤=1 𝑦1000 = 1 Large Small


𝑤 = 1.01 𝑦1000 ≈ 20000 𝜕𝐿Τ𝜕𝑤 Learning rate?
𝑤 = 0.99 𝑦1000 ≈ 0 small Large
𝑤 = 0.01 𝑦1000 ≈ 0 𝜕𝐿Τ𝜕𝑤 Learning rate?
=w999
y1 y2 y3 y1000
Toy Example
1 1 1 1
……
w w w
1 1 1 1
1 0 0 0
RNN Applications
176

Probability of Probability of Probability of


“arrive” in each slot “Taipei” in each slot “on” in each slot
y1 y2 y3
Input store
and output are both sequences
store
a1 with the a2 length
same a3
a 1
a2
RNN can do more than that!
x1 x2 x3

arrive Taipei on November 2nd


Many-to-One
177

 Input is a vector sequence, but output is only one vector

Sentiment Analysis 超好雷


好雷
看了這部電影覺 這部電影太糟了 這部電影很
得很高興 ……. ……. 棒 …….
普雷
負雷
Positive (正雷) Negative (負雷) Positive (正雷) 超負雷

……

我 覺 得 …… 太 糟 了
Many-to-Many (Output is shorter)
178

 Both input and output are both sequences, but the output is
shorter.
 E.g. Speech Recognition

Output: “好棒” (character sequence)

Problem?
Why can’t it be 好好好棒棒棒棒棒
“好棒棒”
(vector
Input:
sequence)
Many-to-Many (Output is shorter)
179

 Both input and output are both sequences, but the output is
shorter.
 Connectionist Temporal Classification (CTC)

“好棒” Add an extra symbol “φ” “好棒棒”


representing “null”

好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ

[Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]
Many-to-Many (Output has no limitation)
180

 Both input and output are both sequences with different


lengths. → Sequence to sequence learning
 E.g. Machine Translation (machine learning→機器學習)
machine

learning

Containing all
information about
input sequence
Many-to-Many (Output has no limitation)
181

 Both input and output are both sequences with different


lengths. → Sequence to sequence learning
 E.g. Machine Translation (machine learning→機器學習)

機 器 學 習 慣 性
machine

learning

Don’t know when to stop


Many-to-Many (Output has no limitation)
182

http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%E6%8E%A8%E6%96%87 (鄉民百科)
Many-to-Many (Output has no limitation)
183

 Both input and output are both sequences with different


lengths. → Sequence to sequence learning
 E.g. Machine Translation (machine learning→機器學習)

===
機 器 學 習
machine

learning

Add a symbol “===“ (斷)

[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]


Image Caption Generation
184

 Input an image, but output a sequence of words

A vector Caption Generation


for whole ===
image a woman is

CNN ……

Input
image
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
Image Caption Generation
185
Video Caption Generation
186

A girl is running.

Video

A group of people is A group of people is


knocked by a tree. walking in the forest.
Chit-Chat Bot
187

電視影集 (~40,000 sentences)、美國總統大選辯論


Sci-Fi Short Film - SUNSPRING
188

https://www.youtube.com/watch?v=LY7x2Ihqj
Attention and Memory
189

What you learned in Breakfast today


these lectures

What is deep
learning?

summer
vacation 10
Answer Organize years ago

http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Attention on Sensory Info
190

Information from the When the input is a very


sensors (e.g. eyes, ears) long sequence or an image
Pay attention on partial of
Sensory Memory the input object each time
Attention
Working Memory
Encode Retrieval
Long-term Memory
Machine Translation
191

 Sequence-to-sequence learning: both input and output are


both sequences with different lengths.
 E.g. 深度學習 → deep learning

learning

<END>
deep
Information of the
whole sentences

RNN RNN
Encoder Decoder

深 度 學 習
Machine Translation with Attention
192

𝛼01 What is match ?

match 𝑧0  Cosine similarity of


z and h
 Small NN whose input is
z and h, output a scalar
ℎ1 ℎ2 ℎ3 ℎ4
 𝛼 = ℎ𝑇 𝑊𝑧
深 度 學 習 How to learn the parameters?
Machine Translation with Attention
193

𝑐0 𝛼

deep
0.5 𝛼ො01 0.5 𝛼ො02 0.0 𝛼ො03 0.0 𝛼ො04 match

softmax ℎ 𝑧
𝑧0 𝑧1 How to learn the
𝛼01 𝛼02 𝛼03 𝛼04
parameters?
ℎ1 ℎ2 ℎ3 ℎ4 𝑐 0 As RNN input

深 度 學 習 𝑐 0 = ෍ 𝛼ො0𝑖 ℎ𝑖 = 0.5ℎ1 + 0.5ℎ2


Machine Translation with Attention
194

deep
𝛼11

match
𝑧0 𝑧1

ℎ1 ℎ2 ℎ3 ℎ4 𝑐0

深 度 學 習
Machine Translation with Attention
195

𝑐1

learning
deep
0.0 𝛼ො11 0.0 𝛼ො12 0.5 𝛼ො13 0.5 𝛼ො14

softmax
𝑧0 𝑧1 𝑧2
𝛼11 𝛼12 𝛼13 𝛼14
ℎ1 ℎ2 ℎ3 ℎ4 𝑐0 𝑐1

深 度 學 習 𝑐1 = ෍ 𝛼ො1𝑖 ℎ𝑖 = 0.5ℎ3 + 0.5ℎ4


Machine Translation with Attention
196

learning
deep
𝛼21 ……

match
𝑧0 𝑧1 𝑧 2 ……

ℎ1 ℎ2 ℎ3 ℎ4 𝑐0 𝑐1 ……

The same process repeat until


深 度 學 習 generating <END>
Speech Recognition with Attention
197

Chan et al., “Listen, Attend and Spell”, arXiv, 2015 .


Image Captioning
198

 Input: image
 Output: word sequence

A vector for
whole image a woman is <END>

CNN ……

Input
image
Image Captioning with Attention
199

A vector for each region


𝑧0 match 0.7

filter filter filter


CNN filter filter filter

filter filter filter


filter filter filter
Image Captioning with Attention
200

Word 1
A vector for each region

𝑧0 𝑧1

filter filter filter weighted


CNN filter filter filter
sum 0.7 0.1 0.1
0.1 0.0 0.0

filter filter filter


filter filter filter
Image Captioning with Attention
201

Word 1 Word 2
A vector for each region

𝑧0 𝑧1 𝑧1
weighted
filter filter filter weighted sum
CNN filter filter filter
sum 0.0 0.8 0.2
0.0 0.0 0.0

filter filter filter


filter filter filter
Image Captioning
202

 Good examples
Image Captioning
203

 Bad examples
Video Captioning
204
Video Captioning
205
Reading Comprehension
206
Answer

𝑁
Sentence to Extracted
= ෍ 𝛼𝑛 𝑥 𝑛 DNN
vector can be Information
𝑛=1
jointly trained.

𝛼1 𝛼2 𝛼3 𝛼𝑁

Document 𝑥 1 𝑥 2 𝑥 3 …… 𝑥 𝑁

Match q Question
Reading Comprehension
207
Answer
𝑁
Extracted
= ෍ 𝛼𝑛 ℎ 𝑛 DNN
Jointly learned Information
𝑛=1

ℎ1 ℎ2 ℎ3 …… ℎ𝑁
Hopping
𝛼1 𝛼2 𝛼3 𝛼𝑁
Document
𝑥 1 𝑥 2 𝑥 3 …… 𝑥 𝑁

Match q Question
Memory Network
208

෍ DNN a
Extract information
……
……
Compute attention

Extract information
……
……
Compute attention
q
Memory Network
209

 Muti-hop performance analysis

https://www.facebook.com/Engineering/videos/10153098860532200/
Attention on Memory
210

Information from the When the input is a very


sensors (e.g. eyes, ears) long sequence or an image
Pay attention on partial of
Sensory Memory the input object each time
Attention
Working Memory In RNN/LSTM, larger memory
implies more parameters
Encode Retrieval
Increasing memory size will
Long-term Memory
not increasing parameters
Neural Turing Machine
211

 Von Neumann architecture


 Neural Turing Machine is an advanced RNN/LSTM.

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing Machine
212

y1 y2

h0 h1 h2

𝑟0 = ෍ 𝛼ො0𝑖 𝑚0𝑖 r0 x1 x2

Retrieval
process
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04

𝑚10 𝑚02 𝑚03 𝑚04 Long term


memory
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing Machine
213

y1 y2

h0 h1 h2

𝑟0 = ෍ 𝛼ො0𝑖 𝑚0𝑖 r0 x1 x2

𝑘1 𝑒1 𝑎1
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14
softmax 𝛼1𝑖 = 1 − 𝜆 𝛼0𝑖 (simplified)
𝑚10 𝑚02 𝑚03 𝑚04 𝛼11 𝛼12 𝛼13 𝛼14 +𝜆𝑐𝑜𝑠 𝑚0𝑖 , 𝑘1

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing Machine
214

Encode
𝑚1𝑖 = 𝑚0𝑖 ∗ 1 −𝛼ො1𝑖 𝑒 1 +𝛼ො1𝑖 𝑎1
process

(element-wise)

𝑘1 𝑒1 𝑎1
𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14

𝑚10 𝑚02 𝑚03 𝑚04 𝑚11 𝑚12 𝑚13 𝑚14

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing Machine
215

y1 y2
h0 h1 h2

r0 x1 r1 x2

𝛼ො01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14 𝛼ො21 𝛼ො22 𝛼ො23 𝛼ො24

𝑚10 𝑚02 𝑚03 𝑚04 𝑚11 𝑚12 𝑚13 𝑚14 𝑚12 𝑚22 𝑚23 𝑚24

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Concluding Remarks
216

 Convolutional Neural Network (CNN)


Input Convolutional
Pooling Convolutional
Pooling

 Recurrent Neural Network (RNN)


y1 y2 y3
store store
a1 a1 a2 a2 a3

x1 x2 x3
Talk Outline
217

Part I: Introduction to
Machine Learning & Deep Learning

Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning


& Recent Trends
218 PART III
Beyond Supervised Learning & Recent Trend
Introduction
219

 Big data ≠ Big annotated data


 Machine learning techniques include:
 Supervised learning (if we have labelled data)
 Reinforcement learning (if we have an environment for reward)
 Unsupervised learning (if we do not have labelled data)

What can we do if there is no sufficient labelled training data?


Machine Learning Map
220

Scenario Task Method

Regression
Semi-Supervised Learning

Linear Model
Transfer Learning

Deep Learning SVM, Decision


Tree, KNN, etc Unsupervised Learning

Non-Linear Model
Classification Reinforcement Learning
Supervised Learning
Outline
221

 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model

 Reinforcement Learning
Outline
222

 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model

 Reinforcement Learning
Semi-Supervised Learning
223

Labelled
data
cat dog

Unlabeled
data

(Image of cats and dogs without labeling)


Semi-Supervised Learning
224

 Why semi-supervised learning helps?

The distribution of the unlabeled data provides some cues


Outline
225

 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model

 Reinforcement Learning
Transfer Learning
226

Labelled
data
cat dog

Labeled
data
elephant elephant tiger tiger

Not related to the task considered


Transfer Learning
227

 Widely used on image processing


 Using sufficient labeled data to learn a CNN
 Using this CNN as feature extractor

Pixels Layer 1 Layer 2 Layer L


x1 …… ……
x2 …… elephant

……
……
……
……

xN …… ……
Transfer Learning Example
228

研究生 online 漫畫家 online

研究生 漫畫家

指導教授 責編
研究生
生存守則 跑實驗 畫分鏡

投稿期刊 投稿 jump
爆漫王
Outline
229

 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model

 Reinforcement Learning
Outline
230

 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model

 Reinforcement Learning
Unsupervised Learning
231

 The unlabeled data sometimes is not related to the task

Labelled
data
cat dog

Unlabeled
data

(Just crawl millions of images from the Internet)


Unsupervised Learning
232

 化繁為簡  無中生有

only having
function input
only having
function
function output
function

code
Unsupervised Learning
233

 How does self-taught learning work?


 Why does unlabeled and unrelated data help the tasks?

Finding latent factors that control the observations


Latent Factors for Handwritten Digits
234

3
-20。 -10。 0。 10。 20。
Latent Factors for Documents
235

http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg
Latent Factors for Recommendation
236

C
Latent Factor Exploitation
237

 Handwritten digits

The handwritten images are


composed of strokes

Strokes (Latent Factors)

…….
No. 1 No. 2 No. 3 No. 4 No. 5
Latent Factor Exploitation
238

Strokes (Latent Factors)

…….
No. 1 No. 2 No. 3 No. 4 No. 5
28 No. 1 No. 3 No. 5

28 = + +

Represented by [1 0 1 0 1 0 …….]
28 X 28 = 784 pixels (simpler representation)
Outline
239

 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model

 Reinforcement Learning
Autoencoder
240

 Represent a digit using 28 X 28 dimensions


 Not all 28 X 28 images are digits

Idea: represent the images of digits in a more compact way

NN Compact
code representation of
Encoder
Usually <784 the input object
28 X 28 = 784
Learn together

NN Can reconstruct
code
Decoder the original object
Autoencoder
241

2
Minimize 𝑥 − 𝑦
As close as possible

encode decode
𝑥 𝑎 𝑦
𝑊 𝑊′
hidden layer
Input layer Bottleneck layer output layer
𝑎 = 𝜎 𝑊𝑥 + 𝑏 𝑦 = 𝜎 𝑊′𝑎 + 𝑏′
Output of the hidden layer is the code
Autoencoder
242

 De-noising auto-encoder

As close as possible

encode decode
𝑎
𝑥 𝑥′ 𝑦
Add 𝑊 𝑊′
noise

Rifai, et al. "Contractive auto-encoders: Explicit invariance during feature extraction,“ in ICML, 2011.
Deep Autoencoder
243

As close as possible

Output Layer
Input Layer

bottle

Layer

Layer

Layer
Layer
Layer

Layer

… …

𝑥 Initialize by RBM 𝑥෤
Code
layer-by-layer

Hinton and Salakhutdinov. “Reducing the dimensionality of data with neural networks,” Science, 2006.
Deep Autoencoder
244

Original
Image

784

784
30
PCA

Deep
Auto-encoder
500

500
250

250
30
1000

1000
784

784
784 784
1000
2
500
784
Feature Representation

250
2
250
500
1000
784
245
Auto-encoder – Text Retrieval
246

Vector Space Model Bag-of-word


this 1
is 1
word string:
query “This is an apple” a 0
an 1
apple 1
pen 0
document


Semantics are not considered
Autoencoder – Text Retrieval
247

2 query

125

250

500
The documents talking about the
same thing will have close code
2000
Bag-of-word (document or query)
Autoencoder – Similar Image Retrieval
248

 Retrieved using Euclidean distance in pixel intensity space

Krizhevsky et al. "Using very deep autoencoders for content-based image retrieval," in ESANN, 2011.
Autoencoder – Similar Image Retrieval
249

8192

4096

1024
2048

256
512
32x32 code

(crawl millions of images from the Internet)


Autoencoder – Similar Image Retrieval
250

 Images retrieved using Euclidean distance in pixel intensity


space

 Images retrieved using 256 codes

Learning the useful latent factors


Autoencoder for DNN Pre-Training
251

 Greedy layer-wise pre-training again

output 10

500
Target

1000 784 𝑥෤
W1’
1000 1000
W1
Input 784 Input 784 𝑥
Autoencoder for DNN Pre-Training
252

 Greedy layer-wise pre-training again

output 10

500 1000 𝑎෤ 1
W2’
Target

1000 1000
W2
1000 1000 𝑎1
fix W1
Input 784 Input 784 𝑥
Autoencoder for DNN Pre-Training
253

 Greedy layer-wise pre-training again

output 10 1000 𝑎෤ 2
W3’
500 500
W3
Target

1000 1000 𝑎2
fix W2
1000 1000 𝑎1
fix W1
Input 784 Input 784 𝑥
Autoencoder for DNN Pre-Training
254

 Greedy layer-wise pre-training again


Find-tune via backprop
output output 10 Random
10
W4 init
500 500
W3
Target

1000 1000
W2
1000 1000
W1
Input 784 Input 784 𝑥
Word Vector/Embedding
255

 Machine learn the meaning of words from reading


a lot of documents without supervision

tree
flower

dog rabbit
run
jump cat
Word Embedding
256

 Machine learn the meaning of words from reading


a lot of documents without supervision
 A word can be understood by its context
You shall know a word by the
蔡英文、馬英九 are company it keeps
something very similar

馬英九 520宣誓就職

蔡英文 520宣誓就職
wi
…… wi-2 wi-1 ___
Prediction-Based
257

0 z1
1-of-N
1 z2 The probability
encoding
0 for each word as
of the


……
……
the next word wi
word wi-1
……

 Take out the input of the z2 tree


neurons in the first layer flower
 Use it to represent a dog rabbit run
word w jump
cat
 Word vector, word
embedding feature: V(w) z1
Prediction-Based
258

Neural
潮水 退了
Network

Collect data: Neural


退了 就
潮水 退了 就 知道 … Network
不爽 不要 買 …
公道價 八萬 一 … Neural
……… 就 知道
Network

Neural
不爽 不要
Network
……

Minimizing cross entropy


You shall know a word by
Prediction-Based the company it keeps
259

0 z1
1 z2 The probability
0 for each word as


蔡英文
……
……
the next word wi
……

or
“宣誓就職”
馬英九
should have large
z2 probability
Training text:
…… 蔡英文 宣誓就職 …… 蔡英文
wi-1 wi
馬英九
…… 馬英九 宣誓就職 ……
wi-1 wi z1
Various Architectures
260

 Continuous bag of word (CBOW) model


wi-1
…… wi-1 ____ wi+1 …… Neural
wi
wi+1 Network

predicting the word given its context

 Skip-gram
…… ____ wi ____ …… w wi-1
Neural
i
Network
wi+1

predicting the context given a word


Word2Vec LM
261

 Goal: predicting the next words given the proceeding


contexts

https://ronxin.github.io/wevi/
Word2Vec CBOW
262

 Goal: predicting the target word given the surrounding words

https://ronxin.github.io/wevi/
Word2Vec Skip-Gram
263

 Skip-gram training data:


apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk,mil
k|drink^rice,water|drink^milk,juice|orange^apple,juice|apple^drink,milk
|rice^drink,drink|milk^water,drink|water^juice,drink|juice^water

https://ronxin.github.io/wevi/
Word Embedding
264

http://www.slideshare.net/hustwj/cikm-keynotenov2014
Word Embedding
265

 Characteristics
𝑉 ℎ𝑜𝑡𝑡𝑒𝑟 − 𝑉 ℎ𝑜𝑡 ≈ 𝑉 𝑏𝑖𝑔𝑔𝑒𝑟 − 𝑉 𝑏𝑖𝑔
𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒 − 𝑉 𝑎𝑢𝑛𝑡

 Solving analogies
Rome : Italy = Berlin : ?
Compute 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Find the word w with the closest V(w)
𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Outline
266

 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model

 Reinforcement Learning
Creation
267

Draw something!
Creation
268

 Generative Models
 https://openai.com/blog/generative-models/

What I cannot create,


I do not understand.
Richard Feynman

https://www.quora.com/What-did-Richard-Feynman-mean-when-he-said-What-I-cannot-create-I-do-not-understand
PixelRNN
269

 To create an image, generating a pixel each time


E.g. 3 x 3 images
NN

NN

NN

……
Can be trained just with a large collection
of images without any annotation
Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, Pixel Recurrent Neural Networks, arXiv preprint, 2016
PixelRNN
270

Real
World

Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, Pixel Recurrent Neural Networks, arXiv preprint, 2016
PixelRNN – beyond Image
271

Audio: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior,
Koray Kavukcuoglu, WaveNet: A Generative Model for Raw Audio, arXiv preprint, 2016
Video: Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu, Video
Pixel Networks , arXiv preprint, 2016
Generative Adversarial Network (GAN)
272
Discriminative v.s. Generative Models
273

 Discriminative  Generative
 learns a function that maps  tries to learn the joint
the input data (x) to some probability of the input
desired output class label data and labels
(y) simultaneously, i.e. P(x,y)
• directly learn the conditional • can be converted
distribution P(y|x) to P(y|x) for classification via
Bayes rule

Advantage: generative models have the potential to understand and explain


the underlying structure of the input data even when there are no labels
擬態的演化
274

棕色 葉脈

蝴蝶不是棕色 蝴蝶沒有葉脈 ……..


Generator
275

 Decoder from autoencoder as generator

encode decode
𝑥 𝑎 𝑥′
𝑊 𝑊′
hidden layer
Input layer output layer
 code
The generator is to generate the data from the code
Generative Adversarial Networks (GAN)
276

 Two competing neural networks: generator & discriminator

forger trying to produce


some counterfeit material

the police trying to detect


the forged items

Training two networks jointly  the generator knows how to adapt its
parameters in order to produce output data that can fool the discriminator
Goodfellow, et al., “Generative adversarial networks,” in NIPS, 2014.
http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/
Generator Evolution
277

NN NN NN
Generator Generator Generator
v1 v2 v3

Discri- Discri- Discri-


minator minator minator
v1 v2 v3

Real images:
Cifar-10
278

 Which one is machine-generated?

https://openai.com/blog/generative-models/
Generated Bedrooms
279

Radford et al., “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” arXiv:1511.06434.
Comics Drawing
280

https://github.com/mattya/chainer-DCGAN
Comics Drawing
281

http://qiita.com/mattya/items/e5bfe5e04b9d2f0bbd47
Pokémon Creation
282

 Small images of 792 Pokémon's


 Can machine learn to create new Pokémons?

Don't catch them! Create them!


 Source of image:
http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9
mon_by_base_stats_(Generation_VI)

Original image is 40 x 40
Making them into 20 x 20
Pokémon Creation
283

 Each pixel is represented by 3 numbers (corresponding


to RGB)
R=50, G=150, B=100

 Each pixel is represented by a 1-of-N encoding feature

0 0 1 0 0 ……

Clustering the similar color 167 colors in total


Pokémon Creation
284

Real
Pokémon
Never seen
by machine!
It is difficult to evaluate generation.

Cover 50%

Cover 75%
Drawing from scratch
Pokémon Creation Need some randomness
285
Pokémon Creation
286

m1
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 exp + 𝑐2 output
Decoder
𝜎2 𝑐3
𝜎3 10-dim
X
𝑒1
Pick 2 dim, and fix the rest 8 𝑒2
𝑒3

𝑐1 NN
𝑐2 ?
Decoder
𝑐3
10-dim
287
288
Pokémon Creation - Data
289

 Original image (40 x 40):


http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/image.rar
 Pixels (20 x 20):
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/pixel_color.txt
 Each line corresponds to an image, and each number corresponds to a pixel
 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/colormap.txt

0
1
……
2


Outline
290

 Semi-Supervised Learning
 Transfer Learning
 Unsupervised Learning
 化繁為簡 Representation Learning
 無中生有 Generative Model

 Reinforcement Learning
Reinforcement Learning
291

Observation Action

Agent

Don’t do Reward
that

Environment
Reinforcement Learning
292

Observation Action

Agent

Thank you. Reward

Environment
Agent learns to take actions to maximize expected reward.
Supervised v.s. Reinforcement
293

 Supervised
“Hello” Say “Hi”
Learning from teacher

“Bye bye” Say “Good bye”


 Reinforcement

……. ……. ……

Hello  …… Bad

Learning from critics


Agent Agent
Scenario of Reinforcement Learning
294

Observation Action

Reward Next Move

If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Environment
Agent learns to take actions to maximize expected reward.
Supervised v.s. Reinforcement
295

 Supervised Learning  Reinforcement Learning


 Training based on  Training only based on
supervisor/label/annotatio reward signal
n  Feedback is delayed
 Feedback is instantaneous  Time matters
 Time does not matter  Agent actions affect
subsequent data

AlphaGo uses supervised learning + reinforcement learning


Reinforcement Learning
296

 RL is a general purpose framework for decision making


 RL is for an agent with the capacity to act
 Each action influences the agent’s future state
 Success is measured by a scalar reward signal
 Goal: select actions to maximize future reward
RL Difficulty
297

 Goal: select actions to maximize total future reward


 Actions may have long-term consequences
 Reward may be delayed
 It may be better to sacrifice immediate reward to gain more
long-term reward
Deep Reinforcement Learning
298

DNN
Observation Action
……


Function Function
Input Output

Used to pick the


best function Reward

Environment
Major Components in an RL Agent
299

 An RL agent may include one or more of these components


 Policy: agent’s behavior function
 Value function: how good is each state and/or action
 Model: agent’s representation of the environment
Maze Example
300

 Rewards: -1 per time-step


 Actions: N, E, S, W
 States: agent’s location
Maze Example: Policy
301

 Rewards: -1 per time-step


 Actions: N, E, S, W
 States: agent’s location

Arrows represent policy π(s) for each state s


Maze Example: Value Function
302

 Rewards: -1 per time-step


 Actions: N, E, S, W
 States: agent’s location

Numbers represent value Qπ(s) of each state s


E2E RL Task-Completion Bot
303

 Simulated User
 Generate interactions based on a predefined fake goal
 Automatically learn strategy by training on the simulated data
User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.
RULE BASED SYSTEM REINFORCEMENT LEARNING SYSTEM
User: Can I get 2 tickets for the witch? User: Can I get 2 tickets for the witch?
Agent: Which theater do you want? Agent: What date would you like to see it?
User: Regal meridian 16 User: Tomorrow
Agent: What time would you like to see it? Agent: Which theater do you prefer?
User: 9:30 pm User: Regal meridian 16
Agent: What date would you like to see it? Agent: What is your location?
User: Tomorrow! User: Seattle
Agent: How many people are you? Agent: What time would you like to see it?
User: Just two adults. User: 9:30 pm.
Agent: What is your location? Agent: Great - I was able to purchase 2 tickets for you
User: Seattle! to see the witch tomorrow at regal meridian 16
Agent: What kind of video format do you like? theater in Seattle at 9:30 pm.
User: Oh, any kind is okay. User: Thanks.
Agent: Do you have any price constraint? Agent: Thanks!
User: No, I don’t care.
Agent: Great - I was able to purchase 2 tickets for you to
see the witch tomorrow at regal meridian 16 theater in
Seattle at 9:30 pm.
User: Thank you.
Agent: Thank you!
More applications
304

 AlphaGo, Playing Video Games, Dialogue


 Flying Helicopter
 https://www.youtube.com/watch?v=0JL04JJjocc
 Driving
 https://www.youtube.com/watch?v=0xo1Ldx3L5Q
 Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
 http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-
its-giant-electricity-bill-with-deepmind-powered-ai
Concluding Remarks
305

 Semi-Supervised Learning Semi-Supervised Learning

 Transfer Learning Transfer Learning

 Unsupervised Learning Unsupervised Learning


 化繁為簡 Representation Learning
Reinforcement Learning
 無中生有 Generative Model
Action
 Reinforcement Learning Observation

Agent

Reward

Environment
如何成為武林高手
306

 內外兼修
 內功充沛,恃強克弱

 招數精妙,以快打慢

 Machine Learning & Deep Learning 也需要內外兼修


 內力:運算資源

 招數:各種技巧

 內力充沛,平常的招式也有可能發會巨大的威力

You might also like