Professional Documents
Culture Documents
Reinforcement Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 1 May 23, 2017
Administrative
Grades:
- Midterm grades released last night, see Piazza for more
information and statistics
- A2 and milestone grades scheduled for later this week
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 2 May 23, 2017
Administrative
Projects:
- All teams must register their project, see Piazza for registration
form
- Tiny ImageNet evaluation server is online
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 3 May 23, 2017
Administrative
Survey:
- Please fill out the course survey!
- Link on Piazza or https://goo.gl/forms/eQpVW7IPjqapsDkB2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 4 May 23, 2017
So far Supervised Learning
Data: (x, y)
x is data, y is label
Examples: Classification,
regression, object detection,
semantic segmentation, image Classification
captioning, etc.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 5 May 23, 2017
So far Unsupervised Learning
Data: x
Just data, no labels!
1-d density estimation
Goal: Learn some underlying
hidden structure of the data
Examples: Clustering,
dimensionality reduction, feature
learning, density estimation, etc. 2-d density estimation
2-d density images left and right
are CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 6 May 23, 2017
Today: Reinforcement Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 7 May 23, 2017
Overview
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 8 May 23, 2017
Reinforcement Learning
Agent
Environment
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 9 May 23, 2017
Reinforcement Learning
Agent
State st
Environment
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 10 May 23, 2017
Reinforcement Learning
Agent
State st
Action at
Environment
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 11 May 23, 2017
Reinforcement Learning
Agent
Environment
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 12 May 23, 2017
Reinforcement Learning
Agent
Environment
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 13 May 23, 2017
Cart-Pole Problem
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 14 May 23, 2017
Robot Locomotion
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 15 May 23, 2017
Atari Games
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 16 May 23, 2017
Go
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 17 May 23, 2017
How can we mathematically formalize the RL problem?
Agent
Environment
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 18 May 23, 2017
Markov Decision Process
- Mathematical formulation of the RL problem
- Markov property: Current state completely characterises the state of the
world
Defined by:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 19 May 23, 2017
Markov Decision Process
- At time step t=0, environment samples initial state s0 ~ p(s0)
- Then, for t=0 until done:
- Agent selects action at
- Environment samples reward rt ~ R( . | st, at)
- Environment samples next state st+1 ~ P( . | st, at)
- Agent receives reward rt and next state st+1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 20 May 23, 2017
A simple MDP: Grid World
states
actions = {
1. right
2. left Set a negative reward
for each transition
3. up
(e.g. r = -1)
4. down
}
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 21 May 23, 2017
A simple MDP: Grid World
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 22 May 23, 2017
The optimal policy *
We want to find optimal policy * that maximizes the sum of rewards.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 23 May 23, 2017
The optimal policy *
We want to find optimal policy * that maximizes the sum of rewards.
Formally: with
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 24 May 23, 2017
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 25 May 23, 2017
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 26 May 23, 2017
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 27 May 23, 2017
Bellman equation
The optimal Q-value function Q* is the maximum expected cumulative reward achievable
from a given (state, action) pair:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 28 May 23, 2017
Bellman equation
The optimal Q-value function Q* is the maximum expected cumulative reward achievable
from a given (state, action) pair:
Intuition: if the optimal state-action values for the next time-step Q*(s,a) are known,
then the optimal strategy is to take the action that maximizes the expected value of
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 29 May 23, 2017
Bellman equation
The optimal Q-value function Q* is the maximum expected cumulative reward achievable
from a given (state, action) pair:
Intuition: if the optimal state-action values for the next time-step Q*(s,a) are known,
then the optimal strategy is to take the action that maximizes the expected value of
The optimal policy * corresponds to taking the best action in any state as specified by Q*
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 30 May 23, 2017
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 31 May 23, 2017
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 32 May 23, 2017
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 33 May 23, 2017
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 34 May 23, 2017
Solving for the optimal policy: Q-learning
Q-learning: Use a function approximator to estimate the action-value function
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 35 May 23, 2017
Solving for the optimal policy: Q-learning
Q-learning: Use a function approximator to estimate the action-value function
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 36 May 23, 2017
Solving for the optimal policy: Q-learning
Q-learning: Use a function approximator to estimate the action-value function
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 37 May 23, 2017
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 38 May 23, 2017
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:
Forward Pass
Loss function:
where
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 39 May 23, 2017
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:
Forward Pass
Loss function:
where
Backward Pass
Gradient update (with respect to Q-function parameters ):
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 40 May 23, 2017
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:
Forward Pass
Loss function:
Iteratively try to make the Q-value
where close to the target value (yi) it
should have, if Q-function
corresponds to optimal Q* (and
Backward Pass optimal policy *)
Gradient update (with respect to Q-function parameters ):
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 41 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 42 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Q-network Architecture
:
FC-4 (Q-values)
neural network
with weights FC-256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 43 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Q-network Architecture
:
FC-4 (Q-values)
neural network
with weights FC-256
Input: state st
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 44 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Q-network Architecture
:
FC-4 (Q-values)
neural network
with weights FC-256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 45 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Q-network Architecture
: Last FC layer has 4-d
FC-4 (Q-values)
neural network output (if 4 actions),
with weights FC-256 corresponding to Q(st,
a1), Q(st, a2), Q(st, a3),
32 4x4 conv, stride 2 Q(st,a4)
16 8x8 conv, stride 4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 46 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Q-network Architecture
: Last FC layer has 4-d
FC-4 (Q-values)
neural network output (if 4 actions),
with weights FC-256 corresponding to Q(st,
a1), Q(st, a2), Q(st, a3),
32 4x4 conv, stride 2 Q(st,a4)
16 8x8 conv, stride 4
Number of actions between 4-18
depending on Atari game
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 47 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Q-network Architecture
: Last FC layer has 4-d
FC-4 (Q-values)
neural network output (if 4 actions),
with weights FC-256 corresponding to Q(st,
a1), Q(st, a2), Q(st, a3),
32 4x4 conv, stride 2 Q(st,a4)
A single feedforward pass
to compute Q-values for all 16 8x8 conv, stride 4
actions from the current Number of actions between 4-18
state => efficient! depending on Atari game
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 48 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Forward Pass
Loss function:
Iteratively try to make the Q-value
where close to the target value (yi) it
should have, if Q-function
corresponds to optimal Q* (and
Backward Pass optimal policy *)
Gradient update (with respect to Q-function parameters ):
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 49 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 50 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 51 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 52 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 53 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 54 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 55 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Initialize state
(starting game
screen pixels) at the
beginning of each
episode
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 56 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 57 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 58 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 59 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Store transition in
replay memory
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 60 May 23, 2017
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Experience Replay:
Sample a random
minibatch of transitions
from replay memory
and perform a gradient
descent step
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 61 May 23, 2017
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Video by Kroly Zsolnai-Fehr. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 62 May 23, 2017
Policy Gradients
What is a problem with Q-learning?
The Q-function can be very complicated!
Example: a robot grasping an object has a very high-dimensional state => hard
to learn exact value of every (state, action) pair
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 63 May 23, 2017
Policy Gradients
What is a problem with Q-learning?
The Q-function can be very complicated!
Example: a robot grasping an object has a very high-dimensional state => hard
to learn exact value of every (state, action) pair
But the policy can be much simpler: just close your hand
Can we learn a policy directly, e.g. finding the best policy from a collection of
policies?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 64 May 23, 2017
Policy Gradients
Formally, lets define a class of parametrized policies:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 65 May 23, 2017
Policy Gradients
Formally, lets define a class of parametrized policies:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 66 May 23, 2017
Policy Gradients
Formally, lets define a class of parametrized policies:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 67 May 23, 2017
REINFORCE algorithm
Mathematically, we can write:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 68 May 23, 2017
REINFORCE algorithm
Expected reward:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 69 May 23, 2017
REINFORCE algorithm
Expected reward:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 70 May 23, 2017
REINFORCE algorithm
Expected reward:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 71 May 23, 2017
REINFORCE algorithm
Expected reward:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 72 May 23, 2017
REINFORCE algorithm
Expected reward:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 73 May 23, 2017
REINFORCE algorithm
Can we compute those quantities without knowing the transition probabilities?
We have:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 74 May 23, 2017
REINFORCE algorithm
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 75 May 23, 2017
REINFORCE algorithm
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
Doesnt depend on
And when differentiating: transition probabilities!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 76 May 23, 2017
REINFORCE algorithm
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
Doesnt depend on
And when differentiating: transition probabilities!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 77 May 23, 2017
Intuition
Gradient estimator:
Interpretation:
- If r( ) is high, push up the probabilities of the actions seen
- If r( ) is low, push down the probabilities of the actions seen
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 78 May 23, 2017
Intuition
Gradient estimator:
Interpretation:
- If r( ) is high, push up the probabilities of the actions seen
- If r( ) is low, push down the probabilities of the actions seen
Might seem simplistic to say that if a trajectory is good then all its actions were
good. But in expectation, it averages out!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 79 May 23, 2017
Intuition
Gradient estimator:
Interpretation:
- If r( ) is high, push up the probabilities of the actions seen
- If r( ) is low, push down the probabilities of the actions seen
Might seem simplistic to say that if a trajectory is good then all its actions were
good. But in expectation, it averages out!
However, this also suffers from high variance because credit assignment is
really hard. Can we help the estimator?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 80 May 23, 2017
Variance reduction
Gradient estimator:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 81 May 23, 2017
Variance reduction
Gradient estimator:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 82 May 23, 2017
Variance reduction
Gradient estimator:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 83 May 23, 2017
Variance reduction: Baseline
Problem: The raw value of a trajectory isnt necessarily meaningful. For
example, if rewards are all positive, you keep pushing up probabilities of
actions.
What is important then? Whether a reward is better or worse than what you
expect to get
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 84 May 23, 2017
How to choose the baseline?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 85 May 23, 2017
How to choose the baseline?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 86 May 23, 2017
How to choose the baseline?
A better baseline: Want to push up the probability of an action from a state, if
this action was better than the expected value of what we should get from
that state.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 87 May 23, 2017
How to choose the baseline?
A better baseline: Want to push up the probability of an action from a state, if
this action was better than the expected value of what we should get from
that state.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 88 May 23, 2017
How to choose the baseline?
A better baseline: Want to push up the probability of an action from a state, if
this action was better than the expected value of what we should get from
that state.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 89 May 23, 2017
How to choose the baseline?
A better baseline: Want to push up the probability of an action from a state, if
this action was better than the expected value of what we should get from
that state.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 90 May 23, 2017
Actor-Critic Algorithm
Problem: we dont know Q and V. Can we learn them?
- The actor decides which action to take, and the critic tells the actor
how good its action was and how it should adjust
- Also alleviates the task of the critic as it only has to learn the values
of (state, action) pairs generated by the policy
- Can also incorporate Q-learning tricks e.g. experience replay
- Remark: we can define by the advantage function how much an
action was better than expected
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 91 May 23, 2017
Actor-Critic Algorithm
Initialize policy parameters , critic parameters
For iteration=1, 2 do
Sample m trajectories under the current policy
For i=1, , m do
For t=1, ... , T do
End for
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 92 May 23, 2017
REINFORCE in action: Recurrent Attention Model (RAM)
Objective: Image Classification
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 93 May 23, 2017
REINFORCE in action: Recurrent Attention Model (RAM)
Objective: Image Classification
Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE
Given state of glimpses seen so far, use RNN to model the state and output next action
[Mnih et al. 2014]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 94 May 23, 2017
REINFORCE in action: Recurrent Attention Model (RAM)
(x1, y1)
Input NN
image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 95 May 23, 2017
REINFORCE in action: Recurrent Attention Model (RAM)
Input NN NN
image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 96 May 23, 2017
REINFORCE in action: Recurrent Attention Model (RAM)
Input NN NN NN
image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 97 May 23, 2017
REINFORCE in action: Recurrent Attention Model (RAM)
Input NN NN NN NN
image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 98 May 23, 2017
REINFORCE in action: Recurrent Attention Model (RAM)
(x1, y1) (x2, y2) (x3, y3) (x4, y4) (x5, y5)
Softmax
Input NN NN NN NN NN y=2
image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 99 May 23, 2017
REINFORCE in action: Recurrent Attention Model (RAM)
Has also been used in many other tasks including fine-grained image recognition,
image captioning, and visual question-answering!
10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
0
More policy gradients: AlphaGo
Overview:
- Mix of supervised learning and reinforcement learning
- Mix of old methods (Monte Carlo Tree Search) and
recent ones (deep RL)
10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
1
Summary
- Policy gradients: very general but suffer from high variance so
requires a lot of samples. Challenge: sample-efficiency
- Q-learning: does not always work but when it works, usually more
sample-efficient. Challenge: exploration
- Guarantees:
- Policy Gradients: Converges to a local minima of J( ), often good
enough!
- Q-learning: Zero guarantees since you are approximating Bellman
equation with a complicated function approximator
10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
2
Next Time
Guest Lecture: Song Han
- Energy-efficient deep learning
- Deep learning hardware
- Model compression
- Embedded systems
- And more...
10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
3