ADP Slides Tsinghua Complete

APPROXIMATE DYNAMIC PROGRAMMING
A SERIES OF LECTURES GIVEN AT
TSINGHUA UNIVERSITY
JUNE 2014
DIMITRI P. BERTSEKAS
Based on the books:

(1) “Neuro-Dynamic Programming,” by DPB
and J. N. Tsitsiklis, Athena Scientific,
1996
(2) “Dynamic Programming and Optimal
Control, Vol. II: Approximate Dynamic
Programming,” by DPB, Athena Sci-
entific, 2012
(3) “Abstract Dynamic Programming,” by
DPB, Athena Scientific, 2013
http://www.athenasc.com
For a fuller set of slides, see
http://web.mit.edu/dimitrib/www/publ.html
BRIEF OUTLINE I
• Our subject:
− Large-scale DP based on approximations and
in part on simulation.
− This has been a research area of great inter-
est for the last 25 years known under various
names (e.g., reinforcement learning, neuro-
dynamic programming)
− Emerged through an enormously fruitful cross-
fertilization of ideas from artificial intelligence
and optimization/control theory
− Deals with control of dynamic systems under
uncertainty, but applies more broadly (e.g.,
discrete deterministic optimization)
− A vast range of applications in control the-
ory, operations research, artificial intelligence,
and beyond ...
− The subject is broad with rich variety of
theory/math, algorithms, and applications.
Our focus will be mostly on algorithms ...
less on theory and modeling
BRIEF OUTLINE II
• Our aim:
− A state-of-the-art account of some of the ma-
jor topics at a graduate level
− Show how to use approximation and simula-
tion to address the dual curses of DP: di-
mensionality and modeling
• Our 6-lecture plan:
− Two lectures on exact DP with emphasis on
infinite horizon problems and issues of large-
scale computational methods
− One lecture on general issues of approxima-
tion and simulation for large-scale problems
− One lecture on approximate policy iteration
based on temporal differences (TD)/projected
equations/Galerkin approximation
− One lecture on aggregation methods
− One lecture on Q-learning, and other meth-
ods, such as approximation in policy space
LECTURE 1
LECTURE OUTLINE
• Introduction to DP and approximate DP

• Finite horizon problems
• The DP algorithm for finite horizon problems
• Infinite horizon problems
• Basic theory of discounted infinite horizon prob-
lems
DP AS AN OPTIMIZATION METHODOLOGY
• Generic optimization problem:
min g(u)
u∈U
where u is the optimization/decision variable, g(u)

is the cost function, and U is the constraint set
• Categories of problems:
− Discrete (U is finite) or continuous
− Linear (g is linear and U is polyhedral) or
nonlinear
− Stochastic or deterministic: In stochastic prob-
lems the cost involves a stochastic parameter
w, which is averaged, i.e., it has the form

g(u) = Ew G(u, w)
where w is a random parameter.

• DP deals with multistage stochastic problems
− Information about w is revealed in stages
− Decisions are also made in stages and make
use of the available information
− Its methodology is “different”
BASIC STRUCTURE OF STOCHASTIC DP
• Discrete-time system
xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1
− k: Discrete time
− xk : State; summarizes past information that
is relevant for future optimization
− uk : Control; decision to be selected at time
k from a given set
− wk : Random parameter (also called “distur-
bance” or “noise” depending on the context)
− N : Horizon or number of times control is
applied
• Cost function that is additive over time

( N −1
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
• Alternative system description: P (xk+1 | xk , uk )
xk+1 = wk with P (wk | xk , uk ) = P (xk+1 | xk , uk )

INVENTORY CONTROL EXAMPLE
• Discrete-time system
xk+1 = fk (xk , uk , wk ) = xk + uk − wk
• Cost function that is additive over time

( N −1
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
(N −1 )
X
=E cuk + r(xk + uk − wk )
k=0
ADDITIONAL ASSUMPTIONS
• Probability distribution of wk does not depend

on past values wk−1 , . . . , w0 , but may depend on
xk and uk
− Otherwise past values of w, x, or u would be
useful for future optimization
• The constraint set from which uk is chosen at
time k depends at most on xk , not on prior x or
u
• Optimization over policies (also called feedback
control laws): These are rules/functions
uk = µk (xk ), k = 0, . . . , N − 1
that map state/inventory to control/order (closed-

loop optimization, use of feedback)
• MAJOR DISTINCTION: We minimize over se-
quences of functions (mapping inventory to order)
{µ0 , µ1 , . . . , µN −1 }
NOT over sequences of controls/orders
{u0 , u1 , . . . , uN −1 }
GENERIC FINITE-HORIZON PROBLEM
• System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N −1

• Control contraints uk ∈ Uk (xk )
• Probability distribution Pk (· | xk , uk ) of wk
• Policies π = {µ0 , . . . , µN −1 }, where µk maps
states xk into controls uk = µk (xk ) and is such
that µk (xk ) ∈ Uk (xk ) for all xk
• Expected cost of π starting at x0 is
( N −1
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0
• Optimal cost function
J ∗ (x0 ) = min Jπ (x0 )

π
• Optimal policy π ∗ satisfies
Jπ∗ (x0 ) = J ∗ (x0 )
When produced by DP, π ∗ is independent of x0 .

PRINCIPLE OF OPTIMALITY
• Let π ∗ = {µ∗0 , µ∗1 , . . . , µ∗N −1 } be optimal policy

• Consider the “tail subproblem” whereby we are
at xk at time k and wish to minimize the “cost-
to-go” from time k to time N
( N −1
)
X
E gN (xN ) + gℓ xℓ , µℓ (xℓ ), wℓ
ℓ=k
and the “tail policy” {µ∗k , µ∗k+1 , . . . , µ∗N −1 }
xk Tail Subproblem
0 k N Time
• Principle of optimality: The tail policy is opti-

mal for the tail subproblem (optimization of the
future does not depend on what we did in the past)
• DP solves ALL the tail subroblems
• At the generic step, it solves ALL tail subprob-
lems of a given time length, using the solution of
the tail subproblems of shorter time length
DP ALGORITHM
• Computes for all k and states xk :

Jk (xk ): opt. cost of tail problem starting at xk
• Initial condition:
JN (xN ) = gN (xN )
Go backwards, k = N − 1, . . . , 0, using

Jk (xk ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk

+ Jk+1 fk (xk , uk , wk ) ,
• To solve tail subproblem at time k minimize
kth-stage cost + Opt. cost of next tail problem

starting from next state at time k + 1
• Then J0 (x0 ), generated at the last step, is equal
to the optimal cost J ∗ (x0 ). Also, the policy
π ∗ = {µ∗0 , . . . , µ∗N −1 }
where µ∗k (xk ) minimizes in the right side above for
each xk and k, is optimal
• Proof by induction
PRACTICAL DIFFICULTIES OF DP
• The curse of dimensionality

− Exponential growth of the computational and
storage requirements as the number of state
variables and control variables increases
− Quick explosion of the number of states in
combinatorial problems
• The curse of modeling
− Sometimes a simulator of the system is easier
to construct than a model
• There may be real-time solution constraints
− A family of problems may be addressed. The
data of the problem to be solved is given with
little advance notice
− The problem data may change as the system
is controlled – need for on-line replanning
• All of the above are motivations for approxi-
mation and simulation
A MAJOR IDEA: COST APPROXIMATION
• Use a policy computed from the DP equation

where the optimal cost-to-go function Jk+1 is re-
placed by an approximation J˜k+1 .
• Apply µk (xk ), which attains the minimum in
n o
min E gk (xk , uk , wk )+J˜k+1 fk (xk , uk , wk )
uk ∈Uk (xk )
• Some approaches:
(a) Problem Approximation: Use J˜k derived from
a related but simpler problem
(b) Parametric Cost-to-Go Approximation: Use
as J˜k a function of a suitable parametric
form, whose parameters are tuned by some
heuristic or systematic scheme (we will mostly
focus on this)
− This is a major portion of Reinforcement
Learning/Neuro-Dynamic Programming
(c) Rollout Approach: Use as J˜k the cost of
some suboptimal policy, which is calculated
either analytically or by simulation
ROLLOUT ALGORITHMS
• At each k and state xk , use the control µk (xk )

that minimizes in
˜

min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )
where J˜k+1 is the cost-to-go of some heuristic pol-

icy (called the base policy).
• Cost improvement property: The rollout algo-
rithm achieves no worse (and usually much better)
cost than the base policy starting from the same
state.
• Main difficulty: Calculating J˜k+1 (x) may be
computationally intensive if the cost-to-go of the
base policy cannot be analytically calculated.
− May involve Monte Carlo simulation if the
problem is stochastic.
− Things improve in the deterministic case (an
important application is discrete optimiza-
tion).
− Connection w/ Model Predictive Control (MPC).
INFINITE HORIZON PROBLEMS
• Same as the basic problem, but:

− The number of stages is infinite.
− The system is stationary.
• Total cost problems: Minimize
(N −1 )
X
Jπ (x0 ) = lim E
wk
αk g xk , µk (xk ), wk
N →∞
k=0,1,... k=0
− Discounted problems (α < 1, bounded g)

− Stochastic shortest path problems (α = 1,
finite-state system with a termination state)
- we will discuss sparringly
− Discounted and undiscounted problems with
unbounded cost per stage - we will not cover
• Average cost problems - we will not cover
• Infinite horizon characteristics:
− Challenging analysis, elegance of solutions
and algorithms
− Stationary policies π = {µ, µ, . . .} and sta-
tionary forms of DP play a special role
DISCOUNTED PROBLEMS/BOUNDED COST
• Stationary system
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
• Cost of a policy π = {µ0 , µ1 , . . .}

(N −1 )
X
Jπ (x0 ) = lim E
wk
N →∞
k=0,1,... k=0
with α < 1, and g is bounded [for some M , we

have |g(x, u, w)| ≤ M for all (x, u, w)]
• Optimal cost function: J ∗ (x) = minπ Jπ (x)
• Boundedness of g guarantees
that
all costs are
M
well-defined and bounded: Jπ (x) ≤ 1−α
• All spaces are arbitrary - only boundedness of
g is important (there are math fine points, e.g.
measurability, but they don’t matter in practice)
• Important special case: All underlying spaces
finite; a (finite spaces) Markovian Decision Prob-
lem or MDP
• All algorithms ultimately work with a finite
spaces MDP approximating the original problem
SHORTHAND NOTATION FOR DP MAPPINGS
• For any function J of x, denote

(T J)(x) = min E g(x, u, w) + αJ f (x, u, w) , ∀x
u∈U (x) w
• T J is the optimal cost function for the one-

stage problem with stage cost g and terminal cost
function αJ.
• T operates on bounded functions of x to pro-
duce other bounded functions of x
• For any stationary policy µ, denote

(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) , ∀x
w
• The critical structure of the problem is cap-

tured in T and Tµ
• The entire theory of discounted problems can
be developed in shorthand using T and Tµ
• True for many other DP problems.
• T and Tµ provide a powerful unifying framework
for DP. This is the essence of the book “Abstract
Dynamic Programming”
FINITE-HORIZON COST EXPRESSIONS
• Consider an N -stage policy π0N = {µ0 , µ1 , . . . , µN −1 }

with a terminal cost J:
( N −1
)
X
N
Jπ0N (x0 ) = E α J(xk ) + ℓ
α g xℓ , µℓ (xℓ ), wℓ
ℓ=0
n o
= E g x0 , µ0 (x0 ), w0 + αJπ1N (x1 )
= (Tµ0 Jπ1N )(x0 )
where π1N = {µ1 , µ2 , . . . , µN −1 }

• By induction we have
Jπ0N (x) = (Tµ0 Tµ1 · · · TµN −1 J)(x), ∀x

• For a stationary policy µ the N -stage cost func-
tion (with terminal cost J) is
Jπ0N = TµN J
where TµN is the N -fold composition of Tµ

• Similarly the optimal N -stage cost function
(with terminal cost J) is T N J
• T N J = T (T N −1 J) is just the DP algorithm
“SHORTHAND” THEORY – A SUMMARY
• Infinite horizon cost function expressions [with

J0 (x) ≡ 0]
Jπ (x) = lim (Tµ0 Tµ1 · · · TµN J0 )(x), Jµ (x) = lim (TµN J0 )(x)
N →∞ N →∞
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
• Value iteration: For any (bounded) J
J ∗ (x) = lim (T k J)(x), ∀x

k→∞
• Policy iteration: Given µk ,

− Policy evaluation: Find Jµk by solving
Jµk = Tµk Jµk
− Policy improvement: Find µk+1 such that
Tµk+1 Jµk = T Jµk

TWO KEY PROPERTIES
• Monotonicity property: For any J and J ′ such

that J(x) ≤ J ′ (x) for all x, and any µ
(T J)(x) ≤ (T J ′ )(x), ∀ x,
(Tµ J)(x) ≤ (Tµ J ′ )(x), ∀ x.
• Constant Shift property: For any J, any scalar

r, and any µ

T (J + re) (x) = (T J)(x) + αr, ∀ x,

Tµ (J + re) (x) = (Tµ J)(x) + αr, ∀ x,
where e is the unit function [e(x) ≡ 1].
• Monotonicity is present in all DP models (undis-
counted, etc)
• Constant shift is special to discounted models
• Discounted problems have another property
of major importance: T and Tµ are contraction
mappings (we will show this later)
CONVERGENCE OF VALUE ITERATION
• For all bounded J,
J ∗ (x) = lim (T k J)(x), for all x

k→∞
Proof: For simplicity we give the proof for J ≡ 0.

For any initial state x0 , and policy π = {µ0 , µ1 , . . .},
(∞ )
X
Jπ (x0 ) = E αℓ g xℓ , µℓ (xℓ ), wℓ
ℓ=0
(k−1 )
X
=E αℓ g xℓ , µℓ (xℓ ), wℓ
ℓ=0
(∞ )
X
+E αℓ g xℓ , µℓ (xℓ ), wℓ
ℓ=k
The tail portion satisfies

( )
∞ kM
X α
E αℓ g xℓ , µℓ (xℓ ), wℓ ≤ ,

1−α
ℓ=k
where M ≥ |g(x, u, w)|. Take min over π of both

sides, then lim as k → ∞. Q.E.D.
BELLMAN’S EQUATION
• The optimal cost function J ∗ is a solution of

Bellman’s equation, J ∗ = T J ∗ , i.e., for all x,

J ∗ (x) = min E g(x, u, w) + αJ ∗ f (x, u, w)
u∈U (x) w
Proof: For all x and k,
αk M αk M
J ∗ (x) − k ∗
≤ (T J0 )(x) ≤ J (x) + ,
1−α 1−α
where J0 (x) ≡ 0 and M ≥ |g(x, u, w)|. Applying

T to this relation, and using Monotonicity and
Constant Shift,
α k+1 M
(T J ∗ )(x) − ≤ (T k+1 J0 )(x)
1−α
α k+1 M
≤ (T J ∗ )(x) +
1−α
Taking the limit as k → ∞ and using the fact
lim (T k+1 J0 )(x) = J ∗ (x)

k→∞
we obtain J ∗ = T J ∗ . Q.E.D.
THE CONTRACTION PROPERTY
• Contraction property: For any bounded func-

tions J and J ′ , and any µ,

′ ′
max (T J)(x) − (T J )(x) ≤ α max J(x) − J (x),
x x

max(Tµ J)(x)−(Tµ J ′ )(x) ≤ α maxJ(x)−J ′ (x).
x x

Proof: Denote c = maxx∈S J(x) − J ′ (x). Then
J(x) − c ≤ J ′ (x) ≤ J(x) + c, ∀x
Apply T to both sides, and use the Monotonicity

and Constant Shift properties:
(T J)(x) − αc ≤ (T J ′ )(x) ≤ (T J)(x) + αc, ∀x
Hence

(T J)(x) − (T J ′ )(x) ≤ αc, ∀ x.
Q.E.D.
• Note: This implies that J ∗ is the unique solu-
tion of J ∗ = T J ∗ , and Jµ is the unique solution
of Jµ = Tµ Jµ
NEC. AND SUFFICIENT OPT. CONDITION
• A stationary policy µ is optimal if and only if

µ(x) attains the minimum in Bellman’s equation
for each x; i.e.,
T J ∗ = Tµ J ∗ ,
or, equivalently, for all x,

µ(x) ∈ arg min E g(x, u, w) + αJ ∗ f (x, u, w)
u∈U (x) w
Proof: If T J ∗ = Tµ J ∗ , then using Bellman’s equa-

tion (J ∗ = T J ∗ ), we have
J ∗ = Tµ J ∗ ,
so by uniqueness of the fixed point of Tµ , we obtain

J ∗ = Jµ ; i.e., µ is optimal.
• Conversely, if the stationary policy µ is optimal,
we have J ∗ = Jµ , so
J ∗ = Tµ J ∗ .
Combining this with Bellman’s Eq. (J ∗ = T J ∗ ),

we obtain T J ∗ = Tµ J ∗ . Q.E.D.
LECTURE 2
LECTURE OUTLINE
• Review of discounted problem theory

• Review of shorthand notation
• Algorithms for discounted DP
• Value iteration
• Various forms of policy iteration
• Optimistic policy iteration
• Q-factors and Q-learning
• Other DP models - Continuous space and time
• A more abstract view of DP
• Asynchronous algorithms
• Stationary system with arbitrary state space
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .

(N −1 )
X
Jπ (x0 ) = lim E
wk
N →∞
k=0,1,... k=0
with α < 1, and for some M , we have |g(x, u, w)| ≤

M for all (x, u, w)
• Shorthand notation for DP mappings (operate
on functions of state to produce other functions)

u∈U (x) w
T J is the optimal cost function for the one-stage

problem with stage cost g and terminal cost αJ.
• For any stationary policy µ

w
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ or

J ∗ (x) = min E g(x, u, w) + αJ ∗ f (x, u, w) , ∀x
u∈U (x) w

Jµ (x) = E g x, µ(x), w + αJµ f (x, µ(x), w) , ∀x
w
i.e.,

µ(x) ∈ arg min E g(x, u, w) + αJ ∗ f (x, u, w) , ∀x
u∈U (x) w
J ∗ (x) = lim (T k J)(x), ∀x

k→∞
• Policy iteration: Given µk ,

− Find Jµk from Jµk = Tµk Jµk (policy evalua-
tion); then
− Find µk+1 such that Tµk+1 Jµk = T Jµk (pol-
icy improvement)
MAJOR PROPERTIES
• Monotonicity property: For any functions J and

J ′ on the state space X such that J(x) ≤ J ′ (x)
for all x ∈ X, and any µ
(T J)(x) ≤ (T J ′ )(x), (Tµ J)(x) ≤ (Tµ J ′ )(x), ∀ x ∈ X
• Contraction property: For any bounded func-

tions J and J ′ , and any µ,

′ ′
max (T J)(x) − (T J )(x) ≤ α max J(x) − J (x),
x x

′ ′
max (Tµ J)(x) − (Tµ J )(x) ≤ α max J(x) − J (x)
x x
• Compact Contraction Notation:
kT J−T J ′ k ≤ αkJ−J ′ k, kTµ J−Tµ J ′ k ≤ αkJ−J ′ k,
where for any bounded function J, we denote by

kJk the sup-norm

kJk = maxJ(x)
x
THE TWO MAIN ALGORITHMS: VI AND PI
J ∗ (x) = lim (T k J)(x), ∀x

k→∞
• Policy iteration: Given µk

k k

Jµk (x) = E g x, µ (x), w + αJµk f (x, µ (x), w) , ∀x
w
or Jµk = Tµk Jµk

− Policy improvement: Let µk+1 be such that
k+1

µ (x) ∈ arg min E g(x, u, w) + αJµk f (x, u, w) , ∀x
u∈U (x) w
or Tµk+1 Jµk = T Jµk
• For the case of n states, policy evaluation is

equivalent to solving an n × n linear system of
equations: Jµ = gµ + αPµ Jµ
• For large n, exact PI is out of the question (even
though it terminates finitely as we will show)
JUSTIFICATION OF POLICY ITERATION
• We can show that Jµk ≥ Jµk+1 for all k

• Proof: For given k, we have
Jµk = Tµk Jµk ≥ T Jµk = Tµk+1 Jµk
Using the monotonicity property of DP,
Jµk ≥ Tµk+1 Jµk ≥ Tµ2k+1 Jµk ≥ · · · ≥ lim TµNk+1 Jµk

N →∞
• Since
lim TµNk+1 Jµk = Jµk+1
N →∞
we have Jµk ≥ Jµk+1 .

• If Jµk = Jµk+1 , all above inequalities hold
as equations, so Jµk solves Bellman’s equation.
Hence Jµk = J ∗
• Thus at iteration k either the algorithm gen-
erates a strictly improved policy or it finds an op-
timal policy
− For a finite spaces MDP, the algorithm ter-
minates with an optimal policy
− For infinite spaces MDP, convergence (in an
infinite number of iterations) can be shown
OPTIMISTIC POLICY ITERATION
• Optimistic PI: This is PI, where policy evalu-

ation is done approximately, with a finite number
of VI
• So we approximate the policy evaluation
Jµ ≈ Tµm J
for some number m ∈ [1, ∞) and initial J

• Shorthand definition: For some integers mk
Tµk Jk = T Jk , Jk+1 = Tµmkk Jk , k = 0, 1, . . .
• If mk ≡ 1 it becomes VI
• If mk = ∞ it becomes PI
• Converges for both finite and infinite spaces
discounted problems (in an infinite number of it-
erations)
• Typically works faster than VI and PI (for
large problems)
APPROXIMATE PI
• Suppose that the policy evaluation is approxi-

mate,
kJk − Jµk k ≤ δ, k = 0, 1, . . .
and policy improvement is approximate,
kTµk+1 Jk − T Jk k ≤ ǫ, k = 0, 1, . . .
where δ and ǫ are some positive scalars.

• Error Bound I: The sequence {µk } generated
by approximate policy iteration satisfies
ǫ + 2αδ
lim sup kJµk − J ∗k ≤
k→∞ (1 − α)2
• Typical practical behavior: The method makes

steady progress up to a point and then the iterates
Jµk oscillate within a neighborhood of J ∗ .
• Error Bound II: If in addition the sequence {µk }
“terminates” at µ (i.e., keeps generating µ)
ǫ + 2αδ
kJµ − J ∗k ≤
1−α
Q-FACTORS I
• Optimal Q-factor of (x, u):
Q∗ (x, u) = E {g(x, u, w) + αJ ∗ (x)}
with x = f (x, u, w). It is the cost of starting at x,

applying u is the 1st stage, and an optimal policy
after the 1st stage
• We can write Bellman’s equation as
J ∗ (x) = min Q∗ (x, u), ∀ x,

u∈U (x)
• We can equivalently write the VI method as
Jk+1 (x) = min Qk+1 (x, u), ∀ x,

u∈U (x)
where Qk+1 is generated by

Qk+1 (x, u) = E g(x, u, w) + α min Qk (x, v)
v∈U (x)
with x = f (x, u, w)
Q-FACTORS II
• Q-factors are costs in an “augmented” problem

where states are (x, u)
• They satisfy a Bellman equation Q∗ = F Q∗
where

(F Q)(x, u) = E g(x, u, w) + α min Q(x, v)
v∈U (x)
where x = f (x, u, w)
• VI and PI for Q-factors are mathematically
equivalent to VI and PI for costs
• They require equal amount of computation ...
they just need more storage
• Having optimal Q-factors is convenient when
implementing an optimal policy on-line by
µ∗ (x) = min Q∗ (x, u)

u∈U (x)
• Once Q∗ (x, u) are known, the model [g and

E{·}] is not needed. Model-free operation
• Q-Learning (to be discussed later) is a sampling
method that calculates Q∗ (x, u) using a simulator
of the system (no model needed)
OTHER DP MODELS
• We have looked so far at the (discrete or con-

tinuous spaces) discounted models for which the
analysis is simplest and results are most powerful
• Other DP models include:
− Undiscounted problems (α = 1): They may
include a special termination state (stochas-
tic shortest path problems)
− Continuous-time finite-state MDP: The time
between transitions is random and state-and-
control-dependent (typical in queueing sys-
tems, called Semi-Markov MDP). These can
be viewed as discounted problems with state-
and-control-dependent discount factors
• Continuous-time, continuous-space models: Clas-
sical automatic control, process control, robotics
− Substantial differences from discrete-time
− Mathematically more complex theory (par-
ticularly for stochastic problems)
− Deterministic versions can be analyzed using
classical optimal control theory
− Admit treatment by DP, based on time dis-
cretization
CONTINUOUS-TIME MODELS

• System equation: dx(t)/dt = f x(t), u(t)
R∞
• Cost function: 0 g x(t), u(t)
• Optimal cost starting from x: J ∗ (x)
• δ-Discretization of time: xk+1 = xk +δ·f (xk , uk )
• Bellman equation for the δ-discretized problem:

Jδ∗ (x) ∗
= min δ · g(x, u) + Jδ x + δ · f (x, u)
u
• Take δ → 0, to obtain the Hamilton-Jacobi-

Bellman equation [assuming limδ→0 Jδ∗ (x) = J ∗ (x)]

∗ ′
0 = min g(x, u) + ∇J (x) f (x, u) , ∀x
u
• Policy Iteration (informally):

− Policy evaluation: Given current µ, solve

0 = g x, µ(x) + ∇Jµ (x)′ f x, µ(x) , ∀x
− Policy improvement: Find

µ(x) ∈ arg min g(x, u)+∇Jµ (x)′ f (x, u) , ∀x
u
• Note: Need to learn ∇Jµ (x) NOT Jµ (x)

A MORE GENERAL/ABSTRACT VIEW OF DP
• Let Y be a real vector space with a norm k · k

• A function F : Y 7→ Y is said to be a contrac-
tion mapping if for some ρ ∈ (0, 1), we have
kF y − F zk ≤ ρky − zk, for all y, z ∈ Y.
ρ is called the modulus of contraction of F .

• Important example: Let X be a set (e.g., state
space in DP), v : X 7→ ℜ be a positive-valued
function. Let B(X) be the set of all functions
J : X 7→ ℜ such that J(x)/v(x) is bounded over
x.
• We define a norm on B(X), called the weighted
sup-norm, by
|J(x)|
kJk = max .
x∈X v(x)
• Important special case: The discounted prob-

lem mappings T and Tµ [for v(x) ≡ 1, ρ = α].
CONTRACTION MAPPINGS: AN EXAMPLE
• Consider extension from finite to countable state

space, X = {1, 2, . . .}, and a weighted sup norm
with respect to which the one stage costs are bounded
• Suppose that Tµ has the form
X
(Tµ J)(i) = bi + α aij J(j), ∀ i = 1, 2, . . .
j∈X
where bi and aij are some scalars. Then Tµ is a

contraction with modulus ρ if and only if
P
j∈X |aij | v(j)
≤ ρ, ∀ i = 1, 2, . . .
v(i)
• Consider T ,
(T J)(i) = min(Tµ J)(i), ∀ i = 1, 2, . . .

µ
where for each µ ∈ M , Tµ is a contraction map-

ping with modulus ρ. Then T is a contraction
mapping with modulus ρ
• Allows extensions of main DP results from
bounded one-stage cost to interesting unbounded
one-stage cost cases.
CONTRACTION MAPPING FIXED-POINT TH.
• Contraction Mapping Fixed-Point Theorem: If

F : B(X) 7→ B(X) is a contraction with modulus
ρ ∈ (0, 1), then there exists a unique J ∗ ∈ B(X)
such that
J ∗ = F J ∗.
Furthermore, if J is any function in B(X), then
{F k J} converges to J ∗ and we have
kF k J − J ∗ k ≤ ρk kJ − J ∗ k, k = 1, 2, . . . .
• This is a special case of a general result for

contraction mappings F : Y 7→ Y over normed
vector spaces Y that are complete: every sequence
{yk } that is Cauchy (satisfies kym − yn k → 0 as
m, n → ∞) converges.
• The space B(X) is complete (see the text for a
proof).
ABSTRACT FORMS OF DP
• We consider an abstract form of DP based on

monotonicity and contraction
• Abstract Mapping: Denote R(X): set of real-
valued functions J : X 7→ ℜ, and let H : X × U ×
R(X) 7→ ℜ be a given mapping. We consider the
mapping
(T J)(x) = min H(x, u, J), ∀ x ∈ X.

u∈U (x)
• We assume that (T J)(x) > −∞ for all x ∈ X,

so T maps R(X) into R(X).
• Abstract Policies: Let M be the set of “poli-
cies”, i.e., functions µ such that µ(x) ∈ U (x) for
all x ∈ X.
• For each µ ∈ M, we consider the mapping
Tµ : R(X) 7→ R(X) defined by

(Tµ J)(x) = H x, µ(x), J , ∀ x ∈ X.
• Find a function J ∗ ∈ R(X) such that
J ∗ (x) = min H(x, u, J ∗ ), ∀x∈X

u∈U (x)
EXAMPLES
• Discounted problems

H(x, u, J) = E g(x, u, w) + αJ f (x, u, w)
• Discounted “discrete-state continuous-time”

Semi-Markov Problems (e.g., queueing)
n
X
H(x, u, J) = G(x, u) + mxy (u)J(y)
y=1
where mxy are “discounted” transition probabili-
ties, defined by the distribution of transition times
• Minimax Problems/Games

H(x, u, J) = max g(x, u, w)+αJ f (x, u, w)
w∈W (x,u)
• Shortest Path Problems

axu + J(u) if u 6= d,
H(x, u, J) =
axd if u = d
where d is the destination. There are stochastic

and minimax versions of this problem
ASSUMPTIONS
• Monotonicity: If J, J ′ ∈ R(X) and J ≤ J ′ ,
H(x, u, J) ≤ H(x, u, J ′ ), ∀ x ∈ X, u ∈ U (x)
• We can show all the standard analytical and

computational results of discounted DP if mono-
tonicity and the following assumption holds:
• Contraction:
− For every J ∈ B(X), the functions Tµ J and
T J belong to B(X)
− For some α ∈ (0, 1), and all µ and J, J ′ ∈
B(X), we have
kTµ J − Tµ J ′ k ≤ αkJ − J ′ k
• With just monotonicity assumption (as in undis-

counted problems) we can still show various forms
of the basic results under appropriate conditions
• A weaker substitute for contraction assumption
is semicontractiveness: (roughly) for some µ, Tµ
is a contraction and for others it is not; also the
“noncontractive” µ are not optimal
RESULTS USING CONTRACTION
• Proposition 1: The mappings Tµ and T are

weighted sup-norm contraction mappings with mod-
ulus α over B(X), and have unique fixed points
in B(X), denoted Jµ and J ∗ , respectively (cf.
Bellman’s equation).
Proof: From the contraction property of H.

• Proposition 2: For any J ∈ B(X) and µ ∈ M,
lim Tµk J = Jµ , lim T k J = J ∗

k→∞ k→∞
(cf. convergence of value iteration).
Proof: From the contraction property of Tµ and

T.
• Proposition 3: We have Tµ J ∗ = T J ∗ if and
only if Jµ = J ∗ (cf. optimality condition).
Proof: Tµ J ∗ = T J ∗ , then Tµ J ∗ = J ∗ , implying

J ∗ = Jµ . Conversely, if Jµ = J ∗ , then Tµ J ∗ =
Tµ Jµ = Jµ = J ∗ = T J ∗ .
RESULTS USING MON. AND CONTRACTION
• Optimality of fixed point:
J ∗ (x) = min Jµ (x), ∀x∈X

µ∈M
• Existence of a nearly optimal policy: For every

ǫ > 0, there exists µǫ ∈ M such that
J ∗ (x) ≤ Jµǫ (x) ≤ J ∗ (x) + ǫ, ∀x∈X
• Nonstationary policies: Consider the set Π of

all sequences π = {µ0 , µ1 , . . .} with µk ∈ M for
all k, and define
Jπ (x) = lim inf (Tµ0 Tµ1 · · · Tµk J)(x), ∀ x ∈ X,

k→∞
with J being any function (the choice of J does

not matter)
• We have
J ∗ (x) = min Jπ (x), ∀x∈X

π∈Π
J ∗ (x) = lim (T k J)(x), ∀x

k→∞

Jµk = Tµk Jµk
− Policy improvement: Find µk+1 such that
Tµk+1 Jµk = T Jµk
• Optimistic PI: This is PI, where policy evalu-

ation is carried out by a finite number of VI
− Shorthand definition: For some integers mk
Tµk Jk = T Jk , Jk+1 = Tµmkk Jk , k = 0, 1, . . .
− If mk ≡ 1 it becomes VI
− If mk = ∞ it becomes PI
− For intermediate values of mk , it is generally
more efficient than either VI or PI
ASYNCHRONOUS ALGORITHMS
• Motivation for asynchronous algorithms

− Faster convergence
− Parallel and distributed computation
− Simulation-based implementations
• General framework: Partition X into disjoint
nonempty subsets X1 , . . . , Xm , and use separate
processor ℓ updating J(x) for x ∈ Xℓ
• Let J be partitioned as
J = (J1 , . . . , Jm ),
where Jℓ is the restriction of J on the set Xℓ .
• Synchronous VI algorithm:
Jℓt+1 (x) = T (J1t , . . . , Jm

t )(x), x ∈ X , ℓ = 1, . . . , m
ℓ
• Asynchronous VI algorithm: For some subsets

of times Rℓ ,
τ (t) τ (t)

Jℓt+1 (x) = T (J1 ℓ1 , . . . , Jmℓm )(x) if t ∈ Rℓ ,
Jℓt (x) if t ∈
/ Rℓ
where t − τℓj (t) are communication “delays”

ONE-STATE-AT-A-TIME ITERATIONS
• Important special case: Assume n “states”, a

separate processor for each state, and no delays
• Generate a sequence of states {x0 , x1 , . . .}, gen-
erated in some way, possibly by simulation (each
state is generated infinitely often)
• Asynchronous VI:
T (J1t , . . . , Jnt )(ℓ) if ℓ = xt ,

Jℓt+1 =
Jℓt if ℓ 6= xt ,
where T (J1t , . . . , Jnt )(ℓ) denotes the ℓ-th compo-

nent of the vector
T (J1t , . . . , Jnt ) = T J t ,
• The special case where
{x0 , x1 , . . .} = {1, . . . , n, 1, . . . , n, 1, . . .}
is the Gauss-Seidel method

ASYNCHRONOUS CONV. THEOREM I
• KEY FACT: VI and also PI (with some modifi-

cations) still work when implemented asynchronously
• Assume that for all ℓ, j = 1, . . . , m, Rℓ is infinite
and limt→∞ τℓj (t) = ∞
• Proposition: Let T have a unique fixed point J ∗ ,
and assume
that
there is a sequence of nonempty
subsets S(k) ⊂ R(X) with S(k + 1) ⊂ S(k) for
all k, and with the following properties:
(1) Synchronous Convergence Condition: Every
sequence {J k } with J k ∈ S(k) for each k,
converges pointwise to J ∗ . Moreover,
T J ∈ S(k+1), ∀ J ∈ S(k), k = 0, 1, . . . .
(2) Box Condition: For all k, S(k) is a Cartesian
product of the form
S(k) = S1 (k) × · · · × Sm (k),
where Sℓ (k) is a set of real-valued functions

on Xℓ , ℓ = 1, . . . , m.
Then for every J ∈ S(0), the sequence {J t } gen-
erated by the asynchronous algorithm converges
pointwise to J ∗ .
ASYNCHRONOUS CONV. THEOREM II
• Interpretation of assumptions:
∗ J = (J1 , J2 )
(0) S2 (0) ) S(k + 1) + 1) J ∗ TJ
(0) S(k)
S(0)
S1 (0)
A synchronous iteration from any J in S(k) moves

into S(k + 1) (component-by-component)
• Convergence mechanism:
J1 Iterations
∗ J = (J1 , J2 )
) S(k + 1) + 1) J ∗
(0) S(k)
S(0)
Iterations J2 Iteration
Key: “Independent” component-wise improve-

ment. An asynchronous component iteration from
any J in S(k) moves into the corresponding com-
ponent portion of S(k + 1)
LECTURE 3
LECTURE OUTLINE
• Review of discounted DP
• Introduction to approximate DP
• Approximation architectures
• Simulation-based approximate policy iteration
• Approximate policy evaluation
• Some general issues about approximation and
simulation
REVIEW
• Stationary system with arbitrary state space
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .

(N −1 )
X
Jπ (x0 ) = lim E
wk
N →∞
k=0,1,... k=0
with α < 1, and for some M , we have |g(x, u, w)| ≤

M for all (x, u, w)
• Shorthand notation for DP mappings (operate
on functions of state to produce other functions)

u∈U (x) w
T J is the optimal cost function for the one-stage

problem with stage cost g and terminal cost αJ
• For any stationary policy µ

w
MDP - TRANSITION PROBABILITY NOTATION
• We will mostly assume the system is an n-state

(controlled) Markov chain
• We will often switch to Markov chain notation
− States i = 1, . . . , n (instead of x)
− Transition probabilities pik ik+1 (uk ) [instead
of xk+1 = f (xk , uk , wk )]
− Stage cost g(ik , uk , ik+1 ) [instead of g(xk , uk , wk )]

− Cost functions J = J(1), . . . , J(n) (vec-
tors in ℜn )
(N −1 )
X
Jπ (i) = lim E αk g ik , µk (ik ), ik+1 | i0 = i
N →∞ ik
k=1,2,... k=0
• Shorthand notation for DP mappings

n
X
(T J)(i) = min pij (u) g(i, u, j)+αJ(j) , i = 1, . . . , n,
u∈U (i)
j=1
n
X
(Tµ J)(i) = pij µ(i) g i, µ(i), j +αJ(j) , i = 1, . . . , n
j=1
n
X
J ∗ (i) = min pij (u) g(i, u, j) + αJ ∗ (j) , ∀i
u∈U (i)
j=1
n
X
Jµ (i) = pij µ(i) g i, µ(i), j + αJµ (j) , ∀i
j=1
i.e.,
n
X
µ(i) ∈ arg min pij (u) g(i, u, j)+αJ ∗ (j) , ∀i
u∈U (i)
j=1
• Value iteration: For any J ∈ ℜn
J ∗ (i) = lim (T k J)(i), ∀ i = 1, . . . , n

k→∞

n
X k
k

Jµk (i) = pij µ (i) g i, µ (i), j +αJµk (j) , i = 1, . . . , n
j=1
or Jµk = Tµk Jµk

n
k+1
X
µ (i) ∈ arg min pij (u) g(i, u, j)+αJµk (j) , ∀ i
u∈U (i)
j=1

• Policy evaluation is equivalent to solving an
n × n linear system of equations
• For large n, exact PI is out of the question. We
use instead optimistic PI (policy evaluation with
a few VIs)
APPROXIMATE DP
GENERAL ORIENTATION TO ADP
• ADP (late 80s - present) is a breakthrough

methodology that allows the application of DP to
problems with many or infinite number of states.
• Other names for ADP are:
− “reinforcement learning” (RL).
− “neuro-dynamic programming” (NDP).
− “adaptive dynamic programming” (ADP).
• We will mainly adopt an n-state discounted
model (the easiest case - but think of HUGE n).
• Extensions to other DP models (continuous
space, continuous-time, not discounted) are possi-
ble (but more quirky). We will set aside for later.
• There are many approaches:
− Problem approximation
− Simulation-based approaches (we will focus
on these)
• Simulation-based methods are of three types:
− Rollout (we will not discuss further)
− Approximation in value space
− Approximation in policy space
WHY DO WE USE SIMULATION?
• One reason: Computational complexity advan-

tage in computing sums/expectations involving a
very large number of terms
− Any sum
Xn
ai
i=1
can be written as an expected value:

n n
X Xai ai
ai = ξi = Eξ ,
i=1 i=1
ξi ξi
where ξ is any prob. distribution over {1, . . . , n}

− It can be approximated by generating many
samples {i1 , . . . , ik } from {1, . . . , n}, accord-
ing to distribution ξ, and Monte Carlo aver-
aging:
n k
X ai 1 X a it
a i = Eξ ≈
i=1
ξi k t=1 ξit
• Simulation is also convenient when an analytical

model of the system is unavailable, but a simula-
tion/computer model is possible.
APPROXIMATION IN VALUE AND
POLICY SPACE
APPROXIMATION IN VALUE SPACE
• Approximate J ∗ or Jµ from a parametric class

˜ r) where i is the current state and r = (r1 , . . . , rm )
J(i;
is a vector of “tunable” scalars weights
• Use J˜ in place of J ∗ or Jµ in various algorithms
and computations
• Role of r: By adjusting r we can change the
“shape” of J˜ so that it is “close” to J ∗ or Jµ
• Two key issues:
˜ r) (the
− The choice of parametric class J(i;
approximation architecture)
− Method for tuning the weights (“training”
the architecture)
• Success depends strongly on how these issues
are handled ... also on insight about the problem
• A simulator may be used, particularly when
there is no mathematical model of the system (but
there is a computer model)
• We will focus on simulation, but this is not the
only possibility
• We may also use parametric approximation for
Q-factors or cost function differences
APPROXIMATION ARCHITECTURES
• Divided in linear and nonlinear [i.e., linear or

˜ r) on r]
nonlinear dependence of J(i;
• Linear architectures are easier to train, but non-
linear ones (e.g., neural networks) are richer
• Computer chess example:
− Think of board position as state and move
as control
− Uses a feature-based position evaluator that
assigns a score (or approximate Q-factor) to
each position/move
Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features
Position Evaluator
• Relatively few special features and weights, and

multistep lookahead
LINEAR APPROXIMATION ARCHITECTURES
• Often, the features encode much of the nonlin-

earity inherent in the cost function approximated
• Then the approximation may be quite accurate
without a complicated architecture (as an extreme
example, the ideal feature is the true cost func-
tion)
• With well-chosen features, we can use a linear
˜ r) = φ(i)′ r, i = 1, . . . , n, or
architecture: J(i;
s
˜ = Φr =
X
J(r) Φj rj
j=1
Φ: the matrix whose rows are φ(i)′ , i = 1, . . . , n,
Φj is the jth column of Φ
i) Linear Cost
Feature
State Extraction
i i Feature Mapping Feature Vector φ(i) Linear Cost Approximator
Extraction Mapping Feature Vectori) Linear Cost φ(i)′ r
Approximator
Approximator
Feature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector
• This is approximation on the subspace

S = {Φr | r ∈ ℜs }
spanned by the columns of Φ (basis functions)
• Many examples of feature types: Polynomial
approximation, radial basis functions, etc
ILLUSTRATIONS: POLYNOMIAL TYPE
• Polynomial Approximation, e.g., a quadratic

approximating function. Let the state be i =
(i1 , . . . , iq ) (i.e., have q “dimensions”) and define
φ0 (i) = 1, φk (i) = ik , φkm (i) = ik im , k, m = 1, . . . , q
Linear approximation architecture:

q q q
˜ r) = r0 +
X X X
J(i; rk ik + rkm ik im ,
k=1 k=1 m=k
where r has components r0 , rk , and rkm .

• Interpolation: A subset I of special/representative
states is selected, and the parameter vector r has
one component ri per state i ∈ I. The approxi-
mating function is
˜ r) = ri ,
J(i; i ∈ I,
˜ r) = interpolation using the values at i ∈ I, i ∈

J(i; /I
For example, piecewise constant, piecewise linear,
more general polynomial interpolations.
A DOMAIN SPECIFIC EXAMPLE
• Tetris game (used as testbed in competitions)
......
TERMINATION
• J ∗ (i): optimal score starting from position i

• Number of states > 2200 (for 10 × 20 board)
• Success with just 22 features, readily recognized
by tetris players as capturing important aspects of
the board position (heights of columns, etc)
APPROX. PI - OPTION TO APPROX. Jµ OR Qµ
• Use simulation to approximate the cost Jµ of

the current policy µ
• Generate “improved” policy µ by minimizing in
(approx.) Bellman equation
Initial Policy Controlled System Cost per Stage Vector

tion Matrix ( )
Evaluate Approximate Cost Steady-State Distribution

Approximate Policy
Cost ( i,) u, r) J˜µ (i, r) Evaluation
Generate “Improved” Policy µ Policy Improvement
• Altenatively approximate the Q-factors of µ

tion Matrix ( )
Evaluate Approximate Q-Factors Approximate Policy

Approximate
Q̃µPolicy
(i, u, r)Evaluation Evaluation

oximate Policy Evaluation µ(i) = arg minu∈U (i) Q̃µ (i, u, r)
Initial state ( ) Time
APPROXIMATING J ∗ OR Q∗
• Approximation of the optimal cost function J ∗

− Q-Learning: Use a simulation algorithm to
approximate the Q-factors
n
X
Q∗ (i, u) = g(i, u) + α pij (u)J ∗ (j);
j=1
and the optimal costs
J ∗ (i) = min Q∗ (i, u)

u∈U (i)
− Bellman Error approach: Find r to
n 2 o
min Ei ˜ r) − (T J)(i;
J(i; ˜ r)
r
where Ei {·} is taken with respect to some

distribution over the states
− Approximate Linear Programming (we will
not discuss here)
• Q-learning can also be used with approxima-
tions
• Q-learning and Bellman error approach can also
be used for policy evaluation
APPROXIMATION IN POLICY SPACE
• A brief discussion; we will return to it later.

• Use parametrization µ(i; r) of policies with a
vector r = (r1 , . . . , rs ). Examples:
− Polynomial, e.g., µ(i; r) = r1 + r2 · i + r3 · i2
− Linear feature-based
µ(i; r) = φ1 (i) · r1 + φ2 (i) · r2
• Optimize the cost over r. For example:
− Each value of r defines a stationary policy,
˜ r).
with cost starting at state i denoted by J(i;
− Let (p1 , . . . , pn ) be some probability distri-
bution over the states, and minimize over r
n
˜ r)
X
pi J(i;
i=1
− Use a random search, gradient, or other method

• A special case: The parameterization of the
policies is indirect, through a cost approximation
architecture J,ˆ i.e.,
X n
ˆ

µ(i; r) ∈ arg min pij (u) g(i, u, j) + αJ(j; r)
u∈U (i)
j=1
APPROXIMATE POLICY EVALUATION
METHODS
DIRECT POLICY EVALUATION
• Approximate the cost of the current policy by

using least squares and simulation-generated cost
samples
• Amounts to projection of Jµ onto the approxi-
mation subspace
Direct Method: Projection of cost vector Jµ Π
µ ΠJµ
=0
Subspace S = {Φr | r ∈ ℜs } Set
Direct Method: Projection of cost vector

( )cost (vector
ct Method: Projection of ) (Jµ)
• Solution by least squares methods

• Regular and optimistic policy iteration
• Nonlinear approximation architectures may also
be used
DIRECT EVALUATION BY SIMULATION
• Projection by Monte Carlo Simulation: Com-

pute the projection ΠJµ of Jµ on subspace S =
{Φr | r ∈ ℜs }, with respect to a weighted Eu-
clidean norm k · kξ
• Equivalently, find Φr∗ , where
n
X 2
r∗ = arg mins kΦr−Jµ k2ξ = arg mins ′
ξi φ(i) r−Jµ (i)
r∈ℜ r∈ℜ
i=1
• Setting to 0 the gradient at r∗ ,
n
!−1 n
X X
r∗ = ξi φ(i)φ(i)′ ξi φ(i)Jµ (i)
i=1 i=1

• Generate samples (i1 , Jµ (i1 )), . . . , (ik , Jµ (ik ))
using distribution ξ
• Approximate by Monte Carlo the two “expected
values” with low-dimensional calculations
k
!−1 k
X X
r̂k = φ(it )φ(it )′ φ(it )Jµ (it )
t=1 t=1
• Equivalent least squares alternative calculation:
k
X 2
r̂k = arg mins φ(it )′ r − Jµ (it )
r∈ℜ
t=1
INDIRECT POLICY EVALUATION
• An example: Galerkin approximation

• Solve the projected equation Φr = ΠTµ (Φr)
where Π is projection w/ respect to a suitable
weighted Euclidean norm
Direct Method: Projection of cost vector Jµ Π Tµ (Φr)
µ ΠJµ Φr = ΠTµ (Φr)

=0 =0
Subspace S = {Φr | r ∈ ℜs } Set Subspace S = {Φr | r ∈ ℜs } Set
Direct Method: Projection of cost vector Indirect Method: Solving a projected form of Be
( )cost (vector
jection of )Indirect
(Jµ) Method: Solving a projected form
Projection on
of Bellman’s equation
• Solution methods that use simulation (to man-

age the calculation of Π)
− TD(λ): Stochastic iterative algorithm for solv-
ing Φr = ΠTµ (Φr)
− LSTD(λ): Solves a simulation-based approx-
imation w/ a standard solver
− LSPE(λ): A simulation-based form of pro-
jected value iteration; essentially
Φrk+1 = ΠTµ (Φrk ) + simulation noise
BELLMAN EQUATION ERROR METHODS
• Another example of indirect approximate policy

evaluation:
min kΦr − Tµ (Φr)k2ξ (∗)
r
where k · kξ is Euclidean norm, weighted with re-
spect to some distribution ξ
• It is closely related to the projected equation/Galerkin
approach (with a special choice of projection norm)
• Several ways to implement projected equation
and Bellman error methods by simulation. They
involve:
− Generating many random samples of states
ik using the distribution ξ
− Generating many samples of transitions (ik , jk )
using the policy µ
− Form a simulation-based approximation of
the optimality condition for projection prob-
lem or problem (*) (use sample averages in
place of inner products)
− Solve the Monte-Carlo approximation of the
optimality condition
• Issues for indirect methods: How to generate
the samples? How to calculate r∗ efficiently?
ANOTHER INDIRECT METHOD: AGGREGATION
• A first idea: Group similar states together into

“aggregate states” x1 , . . . , xs ; assign a common
cost value ri to each group xi .
• Solve an “aggregate” DP problem, involving the
aggregate states, to obtain r = (r1 , . . . , rs ). This
is called hard aggregation
1 0 0 0
 
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9  1 0 0 0
0 1 0 0
 
1 2 3 4 5 6 7 8 9 x1 x2 1

0 0 0

1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
Φ = 1 0 0 0
 
0 1 0 0
 
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0

0 1 0

0 0 1 0
 
0 0 0 1
• More general/mathematical view: Solve
Φr = ΦDTµ (Φr)
where the rows of D and Φ are prob. distributions

(e.g., D and Φ “aggregate” rows and columns of
the linear system J = Tµ J)
• Compare with projected equation Φr = ΠTµ (Φr).
Note: ΦD is a projection in some interesting cases
AGGREGATION AS PROBLEM APPROXIMATION
!
Original System States Aggregate States
! i , j=1 !
according to pij (u),, g(i,
with cost
u, j)
Matrix Matrix Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
dxi ! Disaggregation
φjyProbabilities
Q
| S

), x ), y
!
n !
n
p̂xy (u) = dxi pij (u)φjy ,
i=1 j=1
!
n !
n
ĝ(x, u) = dxi pij (u)g(i, u, j)
i=1 j=1
• Aggregation can be viewed as a systematic

approach for problem approximation. Main ele-
ments:
− Solve (exactly or approximately) the “ag-
gregate” problem by any kind of VI or PI
method (including simulation-based methods)
− Use the optimal cost of the aggregate prob-
lem to approximate the optimal cost of the
original problem
• Because an exact PI algorithm is used to solve
the approximate/aggregate problem the method
behaves more regularly than the projected equa-
tion approach
APPROXIMATE POLICY ITERATION
ISSUES
THEORETICAL BASIS OF APPROXIMATE PI
• If policies are approximately evaluated using an

approximation architecture such that
˜ rk ) − Jµk (i)| ≤ δ,
max |J(i, k = 0, 1, . . .
i
• If policy improvement is also approximate,
˜ rk )−(T J)(i,
max |(Tµk+1 J)(i, ˜ rk )| ≤ ǫ, k = 0, 1, . . .
i
• Error bound: The sequence {µk } generated by

approximate policy iteration satisfies
ǫ + 2αδ
lim sup max Jµk (i) − J ∗ (i) ≤
k→∞ i (1 − α)2
• Typical practical behavior: The method makes

steady progress up to a point and then the iterates
Jµk oscillate within a neighborhood of J ∗ .
• Oscillations are quite unpredictable.
− Some bad examples of oscillations have been
constructed.
− In practice oscillations between policies is
probably not the major concern.
THE ISSUE OF EXPLORATION
• To evaluate a policy µ, we need to generate cost

samples using that policy - this biases the simula-
tion by underrepresenting states that are unlikely
to occur under µ
• Cost-to-go estimates of underrepresented states
may be highly inaccurate
• This seriously impacts the improved policy µ
• This is known as inadequate exploration - a
particularly acute difficulty when the randomness
embodied in the transition probabilities is “rela-
tively small” (e.g., a deterministic system)
• Some remedies:
− Frequently restart the simulation and ensure
that the initial states employed form a rich
and representative subset
− Occasionally generate transitions that use a
randomly selected control rather than the
one dictated by the policy µ
− Other methods: Use two Markov chains (one
is the chain of the policy and is used to gen-
erate the transition sequence, the other is
used to generate the state sequence).
APPROXIMATING Q-FACTORS
˜ r), policy improvement requires a

• Given J(i;
model [knowledge of pij (u) for all controls u ∈
U (i)]
• Model-free alternative: Approximate Q-factors
n
X
Q̃(i, u; r) ≈ pij (u) g(i, u, j) + αJµ (j)
j=1
and use for policy improvement the minimization

µ(i) ∈ arg min Q̃(i, u; r)
u∈U (i)
• r is an adjustable parameter vector and Q̃(i, u; r)

is a parametric architecture, such as
s
X
Q̃(i, u; r) = rm φm (i, u)
m=1
• We can adapt any of the cost approximation

approaches, e.g., projected equations, aggregation
• Use the Markov chain with states (i, u), so
pij (µ(i)) is the transition prob. to (j, µ(i)), 0 to
other (j, u′ )
• Major concern: Acutely diminished exploration
SOME GENERAL ISSUES
STOCHASTIC ALGORITHMS: GENERALITIES
• Consider solution of a linear equation x = b +

Ax by using m simulation samples b + wk and
A + Wk , k = 1, . . . , m, where wk , Wk are random,
e.g., “simulation noise”
• Think of x = b + Ax as approximate policy
evaluation (projected or aggregation equations)
• Stoch. approx. (SA) approach: For k = 1, . . . , m

xk+1 = (1 − γk )xk + γk (b + wk ) + (A + Wk )xk
• Monte Carlo estimation (MCE) approach: Form

Monte Carlo estimates of b and A
m m
1 X 1 X
bm = (b + wk ), Am = (A + Wk )
m m
k=1 k=1
Then solve x = bm + Am x by matrix inversion
xm = (1 − Am )−1 bm
or iteratively
• TD(λ) and Q-learning are SA methods
• LSTD(λ) and LSPE(λ) are MCE methods
COSTS OR COST DIFFERENCES?
• Consider the exact policy improvement process.

To compare two controls u and u′ at x, we need
′

′
E g(x, u, w) − g(x, u , w) + α Jµ (x) − Jµ (x )
where x = f (x, u, w) and x′ = f (x, u′ , w)

• Approximate Jµ (x) or
Dµ (x, x′ ) = Jµ (x) − Jµ (x′ )?
• Approximating Dµ (x, x′ ) avoids “noise differ-

encing”. This can make a big difference
• Important point: Dµ satisfies a Bellman equa-
tion for a system with “state” (x, x′ )
′

Dµ (x, x′ ) ′
= E Gµ (x, x , w) + αDµ (x, x )
′

where x = f x, µ(x), w , x = f x′ , µ(x′ ), w and

Gµ (x, x′ , w) = g x, µ(x), w − g x′ , µ(x′ ), w
• Dµ can be “learned” by the standard methods

(TD, LSTD, LSPE, Bellman error, aggregation,
etc). This is known as differential training.
AN EXAMPLE (FROM THE NDP TEXT)
• System and cost per stage:
xk+1 = xk + δuk , g(x, u) = δ(x2 + u2 )
δ > 0 is very small; think of discretization of

continuous-time problem involving dx(t)/dt = u(t)
• Consider policy µ(x) = −2x. Its cost function
is
5x2
Jµ (x) = (1 + δ) + O(δ 2 )
4
and its Q-factor is
5x2 9x2

5
Qµ (x, u) = +δ + u2 + xu + O(δ 2 )
4 4 2
• The important part for policy improvement is

2
5
δ u + xu
2
When Jµ (x) [or Qµ (x, u)] is approximated by

J˜µ (x; r) [or by Q̃µ (x, u; r)], it will be dominated
2
by 5x4 and will be “lost”
6.231 DYNAMIC PROGRAMMING
LECTURE 4
LECTURE OUTLINE
• Review of approximation in value space

• Approximate VI and PI
• Projected Bellman equations
• Matrix form of the projected equation
• Simulation-based implementation
• LSTD and LSPE methods
• Optimistic versions
• Multistep projected Bellman equations
• Bias-variance tradeoff
REVIEW
DISCOUNTED MDP
• System: Controlled Markov chain with states

i = 1, . . . , n, and finite control set U (i) at state i
• Transition probabilities: pij (u)
p ij (u)
p ii (u) i j p jj (u )
p ji (u)
• Cost of a policy π = {µ0 , µ1 , . . .} starting at

state i:
(N )
X
Jπ (i) = lim E k
α g ik , µk (ik ), ik+1 | i0 = i
N →∞
k=0
with α ∈ [0, 1)
n
X
u∈U (i)
j=1
n
X
j=1
n
X
J ∗ (i) = min pij (u) g(i, u, j) + αJ ∗ (j) , ∀i
u∈U (i)
j=1
n
X
Jµ (i) = pij µ(i) g i, µ(i), j + αJµ (j) , ∀i
j=1
i.e.,
n
X
µ(i) ∈ arg min pij (u) g(i, u, j)+αJ ∗ (j) , ∀i
u∈U (i)
j=1
• Value iteration: For any J ∈ ℜn
J ∗ (i) = lim (T k J)(i), ∀ i = 1, . . . , n

k→∞

n
X k
k

Jµk (i) = pij µ (i) g i, µ (i), j +αJµk (j) , i = 1, . . . , n
j=1
or Jµk = Tµk Jµk

n
k+1
X
µ (i) ∈ arg min pij (u) g(i, u, j)+αJµk (j) , ∀ i
u∈U (i)
j=1

• Policy evaluation is equivalent to solving an
n × n linear system of equations
• For large n, exact PI is out of the question
(even though it terminates finitely)
APPROXIMATION IN VALUE SPACE
• Approximate J ∗ or Jµ from a parametric class

˜ r), where i is the current state and r = (r1 , . . . , rs )
J(i;
is a vector of “tunable” scalars weights
• Think n: HUGE, s: (Relatively) SMALL
• Many types of approximation architectures [i.e.,
˜ r)] to select from
parametric classes J(i;
• Any r ∈ ℜs defines a (suboptimal) one-step
lookahead policy
n
˜
X
µ̃(i) = arg min pij (u) g(i, u, j)+αJ(j; r) , ∀ i
u∈U (i)
j=1
• We want to find a “good” r

• We will focus mostly on linear architectures
˜ = Φr
J(r)
where Φ is an n × s matrix whose columns are

viewed as basis functions
LINEAR APPROXIMATION ARCHITECTURES
• We have
˜ r) = φ(i)′ r,
J(i; i = 1, . . . , n
where φ(i)′ , i = 1, . . . , n is the ith row of Φ, or

Xs
˜ = Φr =
J(r) Φj rj
j=1
where Φj is the jth column of Φ
i) Linear Cost
Feature
State Extraction
i i Feature Mapping
Extraction FeatureFeature
Mapping Vector φ(i) Linear
Vector Cost Approximator
i) Linear Cost φ(i)′ r
Approximator
Approximator
Feature Extraction( Mapping
) Feature Vector
• This is approximation on the subspace

S = {Φr | r ∈ ℜs }
spanned by the columns of Φ (basis functions)
• Many examples of feature types: Polynomial
approximation, radial basis functions, etc
• Instead of computing Jµ or J ∗ , which is huge-
dimensional, we compute the low-dimensional r =
(r1 , . . . , rs ) using low-dimensional calculations
APPROXIMATE VALUE ITERATION
APPROXIMATE (FITTED) VI
• Approximates sequentially Jk (i) = (T k J0 )(i),

k = 1, 2, . . ., with J˜k (i; rk )
• The starting function J0 is given (e.g., J0 ≡ 0)
• Approximate (Fitted) Value Iteration: A se-
quential “fit” to produce J˜k+1 from J˜k , i.e., J˜k+1 ≈
T J˜k or (for a single policy µ) J˜k+1 ≈ Tµ J˜k
0 T J!0 ˜1 T J˜1
˜2 T J˜2
Fitted Value Iteration J0 1 J˜2 ˜2 J˜3

!0 J˜1
Fitted Value
! Iteration
• After a large enough number N of steps, J˜N (i; rN )

˜ r) to J ∗ (i)
is used as approximation J(i;
• Possibly use (approximate) projection Π with
respect to some projection norm,
J˜k+1 ≈ ΠT J˜k
WEIGHTED EUCLIDEAN PROJECTIONS
• Consider a weighted Euclidean norm

v
u n
uX 2
kJkξ = t ξi J(i) ,
i=1
where ξ = (ξ1 , . . . , ξn ) is a positive distribution

(ξi > 0 for all i).
• Let Π denote the projection operation onto
S = {Φr | r ∈ ℜs }
with respect to this norm, i.e., for any J ∈ ℜn ,
ΠJ = Φr∗
where
r∗ = arg mins kΦr − Jk2ξ
r∈ℜ
• Recall that weighted Euclidean projection can

be implemented by simulation and least squares,
i.e., sampling J(i) according to ξ and solving
k
X 2
mins ′
φ(it ) r − J(it )
r∈ℜ
t=1
FITTED VI - NAIVE IMPLEMENTATION
• Select/sample a “small” subset Ik of represen-

tative states
• For each i ∈ Ik , given J˜k , compute
n
(T J˜k )(i) = min pij (u) g(i, u, j) + αJ˜k (j; r)
X
u∈U (i)
j=1
• “Fit” the function J˜k+1 (i; rk+1 ) to the “small”

set of values (T J˜k )(i), i ∈ Ik (for example use
some form of approximate projection)
• Simulation can be used for “model-free” imple-
mentation
• Error Bound: If the fit is uniformly accurate
within δ > 0, i.e.,
max |J˜k+1 (i) − T J˜k (i)| ≤ δ,

i
then
˜ ∗
2αδ
lim sup max Jk (i, rk ) − J (i) ≤
k→∞ i=1,...,n (1 − α)2
• But there is a potential problem!

AN EXAMPLE OF FAILURE
• Consider two-state discounted MDP with states

1 and 2, and a single policy.
− Deterministic transitions: 1 → 2 and 2 → 2
− Transition costs ≡ 0, so J ∗ (1) = J ∗ (2) = 0.
• Consider (exact) fitted VI scheme that approx-
imates cost functions within S = (r, 2r) | r∈ ℜ
1
with a weighted least squares fit; here Φ =
2
• Given J˜k = (rk , 2rk ), we find J˜k+1 = (rk+1 , 2rk+1 ),
where J˜k+1 = Πξ (T J˜k ), with weights ξ = (ξ1 , ξ2 ):
h 2 2 i
rk+1 = arg min ξ1 r−(T J˜k )(1) +ξ2 2r−(T J˜k )(2)
r
• With straightforward calculation
rk+1 = αβrk , where β = 2(ξ1 +2ξ2 )/(ξ1 +4ξ2 ) > 1
• So if α > 1/β (e.g., ξ1 = ξ2 = 1), the sequence

{rk } diverges and so does {J˜k }.
• Difficulty is that T is a contraction, but Πξ T
(= least squares fit composed with T ) is not.
NORM MISMATCH PROBLEM
• For the method to converge, we need Πξ T to

be a contraction; the contraction property of T is
not enough
0 T J!0 ˜1 T J˜1
˜2 T J˜2
˜1 J˜2 = Πξ (T J˜1 )
Fitted Value Iteration J0
˜ ˜
2 J3 = Πξ (T J2 )
0
˜
J1 = Πξ (T J0 )
!
Fitted Value
! Iteration with Projection
" J
• We need a vector of weights ξ such that T is

a contraction with respect to the weighted Eu-
clidean norm k · kξ
• Then we can show that Πξ T is a contraction
with respect to k · kξ
• We will come back to this issue
APPROXIMATE POLICY ITERATION
APPROXIMATE PI

tion Matrix ( )

Approximate Policy
• Evaluation of typical policy µ: Linear cost func-

tion approximation J˜µ (r) = Φr, where Φ is full
rank n × s matrix with columns the basis func-
tions, and ith row denoted φ(i)′ .
• Policy “improvement” to generate µ:
Xn

µ(i) = arg min ′
pij (u) g(i, u, j) + αφ(j) r
u∈U (i)
j=1
• Error Bound (same as approximate VI): If
max |J˜µk (i, rk ) − Jµk (i)| ≤ δ, k = 0, 1, . . .

i
the sequence {µk } satisfies

2αδ
lim sup max Jµk (i) − J ∗ (i) ≤
k→∞ i (1 − α)2
POLICY EVALUATION
• Let’s consider approximate evaluation of the

cost of the current policy by using simulation.
− Direct policy evaluation - Cost samples gen-
erated by simulation, and optimization by
least squares
− Indirect policy evaluation - solving the pro-
jected equation Φr = ΠTµ (Φr) where Π is
projection w/ respect to a suitable weighted
Euclidean norm
Direct Method: Projection of cost vector Jµ Π Tµ (Φr)
µ ΠJµ Φr = ΠTµ (Φr)

=0 =0
Subspace S = {Φr | r ∈ ℜs } Set Subspace S = {Φr | r ∈ ℜs } Set
Direct Method: Projection of cost vector Indirect Method: Solving a projected form of Bell
( )cost (vector
ojection of )Indirect
(Jµ) Method: Solving a projected form
Projection on
of Bellman’s equation
• Recall that projection can be implemented by

simulation and least squares
PI WITH INDIRECT POLICY EVALUATION

tion Matrix ( )

Approximate Policy
• Given the current policy µ:

− We solve the projected Bellman’s equation
Φr = ΠTµ (Φr)
− We approximate the solution Jµ of Bellman’s

equation
J = Tµ J
with the projected equation solution J˜µ (r)
KEY QUESTIONS AND RESULTS
• Does the projected equation have a solution?

• Under what conditions is the mapping ΠTµ a
contraction, so ΠTµ has unique fixed point?
• Assumption: The Markov chain corresponding
to µ has a single recurrent class and no transient
states, i.e., it has steady-state probabilities that
are positive
N
1 X
ξj = lim P (ik = j | i0 = i) > 0
N →∞ N
k=1
Note that ξj is the long-term frequency of state j.

• Proposition: (Norm Matching Property) As-
sume that the projection Π is with respect to k·kξ ,
where ξ = (ξ1 , . . . , ξn ) is the steady-state proba-
bility vector. Then:
(a) ΠTµ is contraction of modulus α with re-
spect to k · kξ .
(b) The unique fixed point Φr∗ of ΠTµ satisfies
1
kJµ − Φr∗ kξ ≤ √ kJµ − ΠJµ kξ
1−α 2
PRELIMINARIES: PROJECTION PROPERTIES
• Important property of the projection Π on S

with weighted Euclidean norm k · kξ . For all J ∈
ℜn , Φr ∈ S, the Pythagorean Theorem holds:
kJ − Φrk2ξ = kJ − ΠJk2ξ + kΠJ − Φrk2ξ
!
J
r Φr J ΠJ
• The Pythagorean Theorem implies that the pro-

jection is nonexpansive, i.e.,
¯ ξ ≤ kJ − Jk
kΠJ − ΠJk ¯ ξ, for all J, J¯ ∈ ℜn .
To see this, note that

2 2 2
Π(J − J) ≤ Π(J − J) + (I − Π)(J − J)
ξ ξ ξ
= kJ − Jk2ξ
PROOF OF CONTRACTION PROPERTY
• Lemma: If P is the transition matrix of µ,

kP zkξ ≤ kzkξ , z ∈ ℜn
Proof: Let pij be the components of P . For all
z ∈ ℜn , we have
 2
n
X n
X n
X n
X
kP zk2ξ = ξi  pij zj  ≤ ξi pij zj2
i=1 j=1 i=1 j=1
n X
X n n
X
= ξi pij zj2 = ξj zj2 = kzk2ξ ,
j=1 i=1 j=1
where the inequality follows from the convexity of

the quadratic function, and the next to
Pnlast equal-
ity follows from the defining property i=1 ξi pij =
ξj of the steady-state probabilities.
• Using the lemma, the nonexpansiveness of Π,
and the definition Tµ J = g + αP J, we have
kΠTµ J−ΠTµ J̄kξ ≤ kTµ J−Tµ J̄kξ = αkP (J−J̄ )kξ ≤ αkJ−J̄ kξ
for all J, J¯ ∈ ℜn . Hence ΠTµ is a contraction of

modulus α.
PROOF OF ERROR BOUND
• Let Φr∗ be the fixed point of ΠT . We have
1
kJµ − Φr∗ kξ ≤√ kJµ − ΠJµ kξ .
1−α 2
Proof: We have
2
kJµ − Φr∗ k2ξ = kJµ − ΠJµ k2ξ
+ ΠJµ − Φr∗ ξ
2
2
= kJµ − ΠJµ kξ + ΠT Jµ − ΠT (Φr∗ ) ξ
≤ kJµ − ΠJµ k2ξ + α2 kJµ − Φr∗ k2ξ ,
where
− The first equality uses the Pythagorean The-
orem
− The second equality holds because Jµ is the
fixed point of T and Φr∗ is the fixed point
of ΠT
− The inequality uses the contraction property
of ΠT .
Q.E.D.
SIMULATION-BASED SOLUTION OF
PROJECTED EQUATION
MATRIX FORM OF PROJECTED EQUATION
Tµ (Φr) = g + αP Φr
r Φr = Πξ Tµ (Φr)
=0
• The solution Φr∗ satisfies the orthogonality con-

dition: The error
Φr∗ − (g + αP Φr∗ )
is “orthogonal” to the subspace spanned by the
columns of Φ.
• This is written as

Φ′ Ξ Φr∗ − (g + αP Φr∗ ) = 0,
where Ξ is the diagonal matrix with the steady-
state probabilities ξ1 , . . . , ξn along the diagonal.
• Equivalently, Cr∗ = d, where
C = Φ′ Ξ(I − αP )Φ, d = Φ′ Ξg
but computing C and d is HARD (high-dimensional
inner products).
SOLUTION OF PROJECTED EQUATION
• Solve Cr∗ = d by matrix inversion: r∗ = C −1 d

• Projected Value Iteration (PVI) method:
Φrk+1 = ΠT (Φrk ) = Π(g + αP Φrk )
Converges to r∗ because ΠT is a contraction.
Value Iterate
T(Φrk) = g + αPΦrk
Projection
on S
Φrk+1
Φrk
0
S: Subspace spanned by basis functions
• PVI can be written as:

2
rk+1 = arg mins Φr − (g + αP Φrk ) ξ
r∈ℜ
By setting to 0 the gradient with respect to r,

Φ′ Ξ Φrk+1 − (g + αP Φrk ) = 0,
which yields
rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d)
SIMULATION-BASED IMPLEMENTATIONS
• Key idea: Calculate simulation-based approxi-

mations based on k samples
Ck ≈ C, dk ≈ d
• Matrix inversion r∗ = C −1 d is approximated

by
r̂k = Ck−1 dk
This is the LSTD (Least Squares Temporal Dif-
ferences) Method.
• PVI method rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d) is
approximated by
rk+1 = rk − Gk (Ck rk − dk )
where
Gk ≈ (Φ′ ΞΦ)−1
This is the LSPE (Least Squares Policy Evalua-
tion) Method.
• Key fact: Ck , dk , and Gk can be computed
with low-dimensional linear algebra (of order s;
the number of basis functions).
SIMULATION MECHANICS
• We generate an infinitely long trajectory (i0 , i1 , . . .)

of the Markov chain, so states i and transitions
(i, j) appear with long-term frequencies ξi and pij .
• After generating each transition (it , it+1 ), we
compute the row φ(it )′ of Φ and the cost compo-
nent g(it , it+1 ).
• We form
k
1 X X
dk = φ(it )g(it , it+1 ) ≈ ξi pij φ(i)g(i, j) = Φ′ Ξg = d
k+1
t=0 i,j
k
1 X ′
Ck = φ(it ) φ(it )−αφ(it+1 ) ≈ Φ′ Ξ(I−αP )Φ = C
k+1
t=0
Also in the case of LSPE

k
1 X
Gk = φ(it )φ(it )′ ≈ Φ′ ΞΦ
k + 1 t=0
• Convergence based on law of large numbers.
• Ck , dk , and Gk can be formed incrementally.
Also can be written using the formalism of tem-
poral differences (this is just a matter of style)
OPTIMISTIC VERSIONS
• Instead of calculating nearly exact approxima-

tions Ck ≈ C and dk ≈ d, we do a less accurate
approximation, based on few simulation samples
• Evaluate (coarsely) current policy µ, then do a
policy improvement
• This often leads to faster computation (as op-
timistic methods often do)
• Very complex behavior (see the subsequent dis-
cussion on oscillations)
• The matrix inversion/LSTD method has serious
problems due to large simulation noise (because of
limited sampling) - particularly if the C matrix is
ill-conditioned
• LSPE tends to cope better because of its itera-
tive nature (this is true of other iterative methods
as well)
• A stepsize γ ∈ (0, 1] in LSPE may be useful to
damp the effect of simulation noise
rk+1 = rk − γGk (Ck rk − dk )

MULTISTEP PROJECTED EQUATIONS
MULTISTEP METHODS
• Introduce a multistep version of Bellman’s equa-

tion J = T (λ) J, where for λ ∈ [0, 1),
X∞
T (λ) = (1 − λ) λℓ T ℓ+1
ℓ=0
Geometrically weighted sum of powers of T .
• Note that T ℓ is a contraction with modulus
αℓ , with respect to the weighted Euclidean norm
k·kξ , where ξ is the steady-state probability vector
of the Markov chain.
• Hence T (λ) is a contraction with modulus
∞
X α(1 − λ)
αλ = (1 − λ) αℓ+1 λℓ =
1 − αλ
ℓ=0
Note that αλ → 0 as λ → 1
• T ℓ and T (λ) have the same fixed point Jµ and
1
kJµ − Φrλ∗ kξ ≤p kJµ − ΠJµ kξ
1− αλ2
where Φrλ∗ is the fixed point of ΠT (λ) .

• The fixed point Φrλ∗ depends on λ.
BIAS-VARIANCE TRADEOFF
Φrλ∗.: Solution of projected equation Φ

∗ Φr = ΠT (λ) (Φr)
Slope Jµ
)λ=0
Simulation error ΠJµ Simulation error

=0λ=10 Simulation error Bias
Simulation error Solution of
1
• Error bound kJµ −Φrλ∗ kξ ≤ √ kJµ −ΠJµ kξ
1−α2λ
• As λ ↑ 1, we have αλ ↓ 0, so error bound (and

the quality of approximation) improves as λ ↑ 1.
In fact
lim Φrλ∗ = ΠJµ
λ↑1
• But the simulation noise in approximating

∞
X
T (λ) = (1 − λ) λℓ T ℓ+1
ℓ=0
increases
• Choice of λ is usually based on trial and error
MULTISTEP PROJECTED EQ. METHODS
• The projected Bellman equation is

Φr = ΠT (λ) (Φr)
• In matrix form: C (λ) r = d(λ) , where

C (λ) = Φ′ Ξ I− αP (λ) Φ, d(λ) = Φ′ Ξg (λ) ,
with
∞
X ∞
X
P (λ) = (1 − λ) αℓ λℓ P ℓ+1 , g (λ) = α ℓ λℓ P ℓ g
ℓ=0 ℓ=0
• The LSTD(λ) method is

(λ) −1 (λ)
Ck dk ,
(λ) (λ)
where Ck and dk are simulation-based approx-
imations of C (λ) and d(λ) .
• The LSPE(λ) method is
(λ) (λ)
rk+1 = rk − γGk Ck rk − dk
where Gk is a simulation-based approx. to (Φ′ ΞΦ)−1

• TD(λ): An important simpler/slower iteration
[similar to LSPE(λ) with Gk = I - see the text].
MORE ON MULTISTEP METHODS
(λ) (λ)
• The simulation process to obtain Ck and dk
is similar to the case λ = 0 (single simulation tra-
jectory i0 , i1 , . . ., more complex formulas)
k k
(λ) 1 X X
m−t m−t
′
Ck = φ(it ) α λ φ(im )−αφ(im+1 )
k + 1 t=0 m=t
k k
(λ) 1 X X
dk = φ(it ) αm−t λm−t gim
k + 1 t=0 m=t
• In the context of approximate policy iteration,

we can use optimistic versions (few samples be-
tween policy updates).
• Many different versions (see the text).
• Note the λ-tradeoffs:
(λ) (λ)
− As λ ↑ 1, Ck and dk contain more “sim-
ulation noise”, so more samples are needed
for a close approximation of rλ (the solution
of the projected equation)
− The error bound kJµ −Φrλ kξ becomes smaller
− As λ ↑ 1, ΠT (λ) becomes a contraction for
arbitrary projection norm
LECTURE 5
LECTURE OUTLINE
• Review of approximate PI based on projected

Bellman equations
• Issues of policy improvement
− Exploration enhancement in policy evalua-
tion
− Oscillations in approximate PI
• Aggregation – An alternative to the projected
equation/Galerkin approach
• Examples of aggregation
• Simulation-based aggregation
• Relation between aggregation and projected
equations
REVIEW
DISCOUNTED MDP

i = 1, . . . , n and finite set of controls u ∈ U (i)
p ij (u)
p ji (u)

state i:
(N )
X
Jπ (i) = lim E k
α g ik , µk (ik ), ik+1 | i = i0
N →∞
k=0
with α ∈ [0, 1)
n
X
u∈U (i)
j=1
n
X
j=1
APPROXIMATE PI

tion Matrix ( )

Approximate Policy
• Evaluation of typical policy µ: Linear cost func-

tion approximation
J˜µ (r) = Φr
where Φ is full rank n × s matrix with columns

the basis functions, and ith row denoted φ(i)′ .
• Policy “improvement” to generate µ:
Xn

µ(i) = arg min ′
pij (u) g(i, u, j) + αφ(j) r
u∈U (i)
j=1
EVALUATION BY PROJECTED EQUATIONS
• Approximate policy evaluation by solving

Φr = ΠTµ (Φr)
Π: weighted Euclidean projection; special nature
of the steady-state distribution weighting.
• Implementation by simulation (single long tra-
jectory using current policy - important to make
ΠTµ a contraction). LSTD, LSPE methods.
(λ)
• Multistep option: Solve Φr = ΠTµ (Φr) with
∞
(λ)
X
Tµ = (1 − λ) λℓ Tµℓ+1 , 0≤λ<1
ℓ=0
(λ)
− As λ ↑ 1, becomes a contraction for
ΠTµ
any projection norm (allows changes in Π)
− Bias-variance tradeoff
. Solution of projected equation Φ
∗ Φr = ΠT (λ) (Φr)
Slope Jµ
)λ=0
Simulation error ΠJµ Simulation error

=0λ=10 Simulation error Bias
Simulation error Solution of
ISSUES OF POLICY IMPROVEMENT
EXPLORATION
• 1st major issue: exploration. To evaluate µ,

we need to generate cost samples using µ
• This biases the simulation by underrepresenting
states that are unlikely to occur under µ.
• As a result, the cost-to-go estimates of these
underrepresented states may be highly inaccurate,
and seriously impact the “improved policy” µ.
• This is known as inadequate exploration - a
particularly acute difficulty when the randomness
embodied in the transition probabilities is “rela-
tively small” (e.g., a deterministic system).
• To deal with this we must change the sampling
mechanism and modify the simulation formulas.
• Solve
Φr = ΠTµ (Φr)
where Π is projection with respect to an exploration-
enhanced norm [uses a weight distribution ζ =
(ζ1 , . . . , ζn )].
• ζ is more “balanced” than ξ the steady-state
distribution of the Markov chain of µ.
• This also addresses any lack of ergodicity of µ.
EXPLORATION MECHANISMS
• One possibility: Use multiple short simulation

trajectories instead of single long trajectory start-
ing from a rich mixture of states. This is known
as geometric sampling, or free-form sampling.
− By properly choosing the starting states, we
enhance exploration
− The simulation formulas for LSTD(λ) and
LSPE(λ) have to be modified to yield the so-
(λ)
lution of Φr = ΠTµ (Φr) (see the DP text)
• Another possibility: Use a modified policy to
generate a single long trajectory. This is called an
off-policy approach.
− Modify the transition probabilities of µ to
enhance exploration
− Again the simulation formulas for LSTD(λ)
and LSPE(λ) have to be modified to yield
(λ)
the solution of Φr = ΠTµ (Φr) (use of im-
portance sampling; see the DP text)
• With larger values of λ > 0 the contraction
(λ)
property of ΠTµ is maintained.
(λ)
• LSTD may be used without ΠTµ being a con-
traction ... LSPE and TD require a contraction.
POLICY ITERATION ISSUES: OSCILLATIONS
• 2nd major issue: oscillation of policies

• Analysis using the greedy partition of the space
of weights r: Rµ is the set of parameter vectors r
˜ r) = Φr
for which µ is greedy with respect to J(·;

Rµ = r | Tµ (Φr) = T (Φr) ∀µ
If we use r in Rµ the next “improved” policy is µ
Rµk+1
rµk
+2 rµk+3
Rµk
+1 rµk+2
+2 Rµk+3 k rµk+1
Rµk+2
• If policy evaluation is exact, there is a finite

number of possible vectors rµ , (one per µ)
• The algorithm ends up repeating some cycle of
policies µk , µk+1 , . . . , µk+m with
rµk ∈ Rµk+1 , rµk+1 ∈ Rµk+2 , . . . , rµk+m ∈ Rµk
• Many different cycles are possible

MORE ON OSCILLATIONS/CHATTERING
• In the case of optimistic policy iteration a dif-

ferent picture holds (policy evaluation does not
produce exactly rµ )
2 Rµ3
1 rµ2
rµ1
Rµ2 2 rµ3
Rµ1
• Oscillations of weight vector r are less violent,

but the “limit” point is meaningless!
• Fundamentally, oscillations are due to the lack
of monotonicity of the projection operator, i.e.,
J ≤ J ′ does not imply ΠJ ≤ ΠJ ′ .
• If approximate PI uses an evaluation of the form
Φr = (W Tµ )(Φr)
with W : monotone and W Tµ : contraction, the

policies converge (to a possibly nonoptimal limit).
• These conditions hold when aggregation is used
AGGREGATION
PROBLEM APPROXIMATION - AGGREGATION
• Another major idea in ADP is to approximate

J ∗ or Jµ with the cost-to-go functions of a simpler
problem.
• Aggregation is a systematic approach for prob-
lem approximation. Main elements:
− Introduce a few “aggregate” states, viewed
as the states of an “aggregate” system
− Define transition probabilities and costs of
the aggregate system, by relating original
system states with aggregate states
− Solve (exactly or approximately) the “ag-
gregate” problem by any kind of VI or PI
method (including simulation-based methods)
• If R̂(y) is the optimal cost of aggregate state y,
we use the approximation
X
J ∗ (j) ≈ φjy R̂(y), ∀j
y
where φjy are the aggregation probabilities, en-
coding the “degree of membership of j in the ag-
gregate state y”
• This is a linear architecture: φjy are the features
of state j
HARD AGGREGATION EXAMPLE
• Group the original system states into subsets,

and view each subset as an aggregate state
• Aggregation probs.: φjy = 1 if j belongs to
aggregate state y (piecewise constant approx).
1 0 0 0
 
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9  1 0 0 0
0 1 0 0
 
1 2 3 4 5 6 7 8 9 x1 x2 1

0 0 0

1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
Φ = 1 0 0 0
 
0 1 0 0
 
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0

0 1 0

0 0 1 0
 
0 0 0 1
• What should be the “aggregate” transition probs.

out of x?
• Select i ∈ x and use the transition probs. of i.
But which i should I use?
• The simplest possibility is to assume that all
states i in x are equally likely.
• A generalization is to randomize, i.e., use “dis-
aggregation probabilities” dxi : Roughly, the “de-
gree to which i is representative of x.”
AGGREGATION/DISAGGREGATION PROBS
!
! i , j=1 !
according to pij (u), with cost
Disaggregation Probabilities Aggregation Probabilities
dxi φjy Q
Matrix D Matrix S
D Matrix Φ
!
|
), x ), y
• Define the aggregate system transition proba-

bilities via two (somewhat arbitrary) choices.
• For each original system state j and aggregate
state y, the aggregation probability φjy
− Roughly, the “degree of membership of j in
the aggregate state y.”
− In hard aggregation, φjy = 1 if state j be-
longs to aggregate state/subset y.
• For each aggregate state x and original system
state i, the disaggregation probability dxi
− Roughly, the “degree to which i is represen-
tative of x.”
• Aggregation scheme is defined by the two ma-
trices D and Φ. The rows of D and Φ must be
probability distributions.
AGGREGATE SYSTEM DESCRIPTION
!
! i , j=1 !
,
g(i, u, j)
φjyProbabilities
Q
| S

), x ), y
!
n !
n
i=1 j=1
!
n !
n
i=1 j=1
• The transition probability from aggregate state

x to aggregate state y under control u
n
X Xn
p̂xy (u) = dxi pij (u)φjy , or P̂ (u) = DP (u)Φ
i=1 j=1
where the rows of D and Φ are the disaggregation

and aggregation probs.
• The expected transition cost is
n
X n
X
ĝ(x, u) = dxi pij (u)g(i, u, j), or ĝ = DP (u)g
i=1 j=1
AGGREGATE BELLMAN’S EQUATION
!
! i , j=1 !
according to pij (u),, g(i,
with cost
u, j)
φjyProbabilities
Q
| S

), x ), y
!
n !
n
i=1 j=1
!
n !
n
i=1 j=1
• The optimal cost function of the aggregate prob-

lem, denoted R̂, is
" #
X
R̂(x) = min ĝ(x, u) + α p̂xy (u)R̂(y) , ∀x
u∈U
y
Bellman’s equation for the aggregate problem.

• The optimal cost function J ∗ of the original
problem is approximated by J˜ given by
˜ =
X
J(j) φjy R̂(y), ∀j
y
EXAMPLE I: HARD AGGREGATION
• Group the original system states into subsets,

and view each subset as an aggregate state
• Aggregation probs.: φjy = 1 if j belongs to
aggregate state y.
1 0 0 0
 
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9  1 0 0 0
0 1 0 0
 
1 2 3 4 5 6 7 8 9 x1 x2 1

0 0 0

1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
Φ = 1 0 0 0
 
0 1 0 0
 
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0

0 1 0

0 0 1 0
 
0 0 0 1
• Disaggregation probs.: There are many possi-

bilities, e.g., all states i within aggregate state x
have equal prob. dxi .
• If optimal cost vector J ∗ is piecewise constant
over the aggregate states/subsets, hard aggrega-
tion is exact. Suggests grouping states with “roughly
equal” cost into aggregates.
• A variant: Soft aggregation (provides “soft
boundaries” between aggregate states).
EXAMPLE II: FEATURE-BASED AGGREGATION
• Important question: How do we group states

together?
• If we know good features, it makes sense to
group together states that have “similar features”

Special
SpecialStates
StatesAggregate
AggregateStates Special States Aggregate States Features
StatesFeatures
Features
)
• A general approach for passing from a feature-

based state representation to a hard aggregation-
based architecture
• Essentially discretize the features and generate
a corresponding piecewise constant approximation
to the optimal cost function
• Aggregation-based architecture is more power-
ful (it is nonlinear in the features)
• ... but may require many more aggregate states
to reach the same level of performance as the cor-
responding linear feature-based architecture
EXAMPLE III: REP. STATES/COARSE GRID
• Choose a collection of “representative” original

system states, and associate each one of them with
an aggregate state
y3 Original State Space
j3 y1 1 y2
x j1 2 y3
xj
x j1 j2
j2 j3
Representative/Aggregate States
• Disaggregation probabilities are dxi = 1 if i is

equal to representative state x.
• Aggregation probabilities associate original sys-
tem states with convex combinations of represen-
tative states X
j∼ φjy y
y∈A
• Well-suited for Euclidean space discretization

• Extends nicely to continuous state space, in-
cluding belief space of POMDP
EXAMPLE IV: REPRESENTATIVE FEATURES
• Here the aggregate states are nonempty subsets

of original system states. Common case: Each Sx
is a group of states with “similar features”
y3 Original State Space
Small cost
Sx 1 φjx1 Sx 2
ij j
φjxSmall
2
cost
pij Sx 3
Aggregate
φ States/Subsets
φjx3
0 1 2 49 i
pijij j
φ
Aggregate States/Subsets
0 1 2 49
• Restrictions:
− The aggregate states/subsets are disjoint.
− The disaggregation probabilities satisfy dxi >
0 if and only if i ∈ x.
− The aggregation probabilities satisfy φjy = 1
for all j ∈ y.
• Hard aggregation is a special case: ∪x Sx =
{1, . . . , n}
• Aggregation with representative states is a spe-
cial case: Sx consists of just one state
APPROXIMATE PI BY AGGREGATION
!
! i , j=1 !
,
g(i, u, j)
φjyProbabilities
Q
| S

), x ), y
!
n !
n
i=1 j=1
!
n !
n
i=1 j=1
• Consider approximate PI for the original prob-

lem, with policy evaluation done by aggregation.
• Evaluation of policy µ: J˜ = ΦR, where R =
DTµ (ΦR) (R is the vector of costs of aggregate
states for µ). Can be done by simulation.
• Looks like projected equation ΦR = ΠTµ (ΦR)
(but with ΦD in place of Π).
• Advantage: It has no problem with oscillations.
• Disadvantage: The rows of D and Φ must be
probability distributions.
ADDITIONAL ISSUES OF AGGREGATION
ALTERNATIVE POLICY ITERATION
• The preceding PI method uses policies that as-

sign a control to each aggregate state.
• An alternative is to use PI for the combined
system, involving the Bellman equations:
n
dxi J˜0 (i),
X
R∗ (x) = ∀ x,
i=1
n
J˜0 (i) = min ˜
X
pij (u) g(i, u, j)+αJ1 (j) , i = 1, . . . , n,
u∈U (i)
j=1
J˜1 (j) =
X
φjy R∗ (y), j = 1, . . . , n.
y∈A
!
! i , j=1 !
Disaggregation Probabilities Aggregation Probabilities
dxi φjy Q
Matrix D Matrix S
D Matrix Φ
!
|
), x ), y
• Simulation-based PI and VI are still possible.

RELATION OF AGGREGATION/PROJECTION
• Compare aggregation and projected equations
ΦR = ΦDT (ΦR), Φr = ΠT (Φr)

• If ΦD is a projection (with respect to some
weighted Euclidean norm), then the methodology
of projected equations applies to aggregation
• Hard aggregation case: ΦD can be verified to be
projection with respect to weights ξi proportional
to the disaggregation probabilities dxi
• Aggregation with representative features case:
ΦD can be verified to be a semi-norm projection
with respect to weights ξi proportional to dxi
• A (weighted)
q Euclidean semi-norm is defined by
Pn 2
kJkξ = i=1 ξi J(i) , where ξ = (ξ1 , . . . , ξn ),
with ξi ≥ 0.
• If Φ′ ΞΦ is invertible, the entire theory and
algorithms of projected equations generalizes to
semi-norm projected equations [including multi-
step methods such as LSTD/LSPE/TD(λ)].
• Reference: Yu and Bertsekas, “Weighted Bell-
man Equations and their Applications in Approxi-
mate Dynamic Programming,” MIT Report, 2012.
DISTRIBUTED AGGREGATION I
• We consider decomposition/distributed solu-

tion of large-scale discounted DP problems by hard
aggregation.
• Partition the original system states into subsets
S1 , . . . , Sm .
• Distributed VI Scheme: Each subset Sℓ
− Maintains detailed/exact local costs
J(i) for every original system state i ∈ Sℓ
using aggregate costs of other subsets

P
− Maintains an aggregate cost R(ℓ) = i∈Sℓ dℓi J(i)
− Sends R(ℓ) to other aggregate states
• J(i) and R(ℓ) are updated by VI according to
Jk+1 (i) = min Hℓ (i, u, Jk , Rk ), ∀ i ∈ Sℓ

u∈U (i)
with Rk being the vector of R(ℓ) at time k, and
n
X X
Hℓ (i, u, J, R) = pij (u)g(i, u, j) + α pij (u)J(j)
j=1 j∈Sℓ
X
+α pij (u)R(ℓ′ )
j∈Sℓ′ , ℓ′ 6=ℓ
DISTRIBUTED AGGREGATION II
• Can show that this iteration involves a sup-

norm contraction mapping of modulus α, so it
converges to the unique solution of the system of
equations in (J, R)
X
J(i) = min Hℓ (i, u, J, R), R(ℓ) = dℓi J(i),
u∈U (i)
i∈Sℓ
∀ i ∈ Sℓ , ℓ = 1, . . . , m.
• This follows from the fact that {dℓi | i =

1, . . . , n} is a probability distribution.
• View these equations as a set of Bellman equa-
tions for an “aggregate” DP problem. The differ-
ence is that the mapping H involves J(j) rather
than R x(j) for j ∈ Sℓ .
• In an asynchronous version of the method, the
aggregate costs R(ℓ) may be outdated to account
for communication “delays” between aggregate states.
• Convergence can be shown using the general
theory of asynchronous distributed computation,
briefly described in the 2nd lecture (see the text).
LECTURE 6
LECTURE OUTLINE
• Review of Q-factors and Bellman equations for

Q-factors
• VI and PI for Q-factors
• Q-learning - Combination of VI and sampling
• Q-learning and cost function approximation
• Adaptive dynamic programming
• Approximation in policy space
• Additional topics
REVIEW
DISCOUNTED MDP

i = 1, . . . , n and finite set of controls u ∈ U (i)
p ij (u)
p ji (u)

state i:
(N )
X
Jπ (i) = lim E k
α g ik , µk (ik ), ik+1 | i = i0
N →∞
k=0
with α ∈ [0, 1)
n
X
u∈U (i)
j=1
n
X
j=1
BELLMAN EQUATIONS FOR Q-FACTORS
• The optimal Q-factors are defined by

n
X
Q∗ (i, u) = pij (u) g(i, u, j) + αJ ∗ (j) , ∀ (i, u)
j=1
• Since J ∗ = T J ∗ , we have J ∗ (i) = minu∈U (i) Q∗ (i, u)

so the optimal Q-factors solve the equation
n
X
Q∗ (i, u) = pij (u) g(i, u, j) + α ′min Q∗ (j, u′ )
u ∈U (j)
j=1
• Equivalently Q∗ = F Q∗ , where
n
X
(F Q)(i, u) = pij (u) g(i, u, j) + α ′min Q(j, u′ )
u ∈U (j)
j=1
• This is Bellman’s Eq. for a system whose states

are the pairs (i, u)
• Similar mapping Fµ and Bellman equation for
a policy µ: Qµ = Fµ Qµ
BELLMAN EQ FOR Q-FACTORS OF A POLICY
State-Control Pairs: Fixed Policy µ Case (

! "
State-Control Pairs (i, u) States j) j, µ(j)
) g(i, u, j)
j pij (u) v µ(j)
) States j p
) States
• Q-factors of a policy µ: For all (i, u)

n
X
Qµ (i, u) = pij (u) g(i, u, j) + αQµ j, µ(j)
j=1
Equivalently Qµ = Fµ Qµ , where
n
X
(Fµ Q)(i, u) = pij (u) g(i, u, j) + αQ j, µ(j)
j=1
• This is a linear equation. It can be used for

policy evaluation.
• Generally VI and PI can be carried out in terms
of Q-factors.
• When done exactly they produce results that
are mathematically equivalent to cost-based VI
and PI.
WHAT IS GOOD AND BAD ABOUT Q-FACTORS
• All the exact theory and algorithms for costs

applies to Q-factors
− Bellman’s equations, contractions, optimal-
ity conditions, convergence of VI and PI
• All the approximate theory and algorithms for
costs applies to Q-factors
− Projected equations, sampling and exploration
issues, oscillations, aggregation
• A MODEL-FREE (on-line) controller imple-
mentation
− Once we calculate Q∗ (i, u) for all (i, u),
µ∗ (i) = arg min Q∗ (i, u), ∀i

u∈U (i)
− Similarly, once we calculate a parametric ap-
proximation Q̃(i, u; r) for all (i, u),
µ̃(i) = arg min Q̃(i, u; r), ∀i

u∈U (i)
• The main bad thing: Greater dimension and

more storage! (It can be used for large-scale prob-
lems only through aggregation, or other approxi-
mation.)
Q-LEARNING
Q-LEARNING
• In addition to the approximate PI methods

adapted for Q-factors, there is an important addi-
tional algorithm:
− Q-learning, a sampled form of VI (a stochas-
tic iterative algorithm).
• Q-learning algorithm (in its classical form):
− Sampling: Select sequence of pairs (ik , uk )
[use any probabilistic mechanism for this,
but all (i, u) are chosen infinitely often].
− Iteration: For each k, select jk according to
pik j (uk ). Update just Q(ik , uk ):
Qk+1 (ik ,uk ) = (1 − γk )Qk (ik , uk )

!
+ γk g(ik , uk , jk ) + α ′ min Qk (jk , u′ )
u ∈U (jk )
Leave unchanged all other Q-factors.

− Stepsize conditions: γk ↓ 0
• We move Q(i, u) in the direction of a sample of
Xn
(F Q)(i, u) = pij (u) g(i, u, j) + α ′min Q(j, u′ )
u ∈U (j)
j=1
NOTES AND QUESTIONS ABOUT Q-LEARNING
Qk+1 (ik ,uk ) = (1 − γk )Qk (ik , uk )

!
u ∈U (jk )
• Model free implementation. We just need a

simulator that given (i, u) produces next state j
and cost g(i, u, j)
• Operates on only one state-control pair at a
time. Convenient for simulation, no restrictions on
sampling method. (Connection with asynchronous
algorithms.)
• Aims to find the (exactly) optimal Q-factors.
• Why does it converge to Q∗ ?
• Why can’t I use a similar algorithm for optimal
costs (a sampled version of VI)?
• Important mathematical (fine) point: In the Q-
factor version of Bellman’s equation the order of
expectation and minimization is reversed relative
to the cost version of Bellman’s equation:
Xn

∗
J (i) = min ∗
pij (u) g(i, u, j) + αJ (j)
u∈U (i)
j=1
CONVERGENCE ASPECTS OF Q-LEARNING
• Q-learning can be shown to converge to true/exact

Q-factors (under mild assumptions).
• The proof is sophisticated, based on theories of
stochastic approximation and asynchronous algo-
rithms.
• Uses the fact that the Q-learning map F :
n o
(F Q)(i, u) = Ej g(i, u, j) + α min Q(j, u ′)
′u
is a sup-norm contraction.
• Generic stochastic approximation algorithm:
− Consider generic fixed point problem involv-
ing an expectation:

x = Ew f (x, w)

− Assume Ew f (x, w) is a contraction with
respect to some norm, so the iteration

xk+1 = Ew f (xk , w)
converges to the unique fixed point

− Approximate Ew f (x, w) by sampling
STOCH. APPROX. CONVERGENCE IDEAS
• Generate a sequence of samples {w1 , w2 , . . .},

and approximatethe convergent
fixed point iter-
ation xk+1 = Ew f (xk , w)
• At each iteration k use the approximation
k
1X
xk+1 = f (xk , wt ) ≈ Ew f (xk , w)
k t=1
• A major flaw: it requires, for each k, the compu-
tation of f (xk , wt ) for all values wt , t = 1, . . . , k.
• This motivates the more convenient iteration
k
1 X
xk+1 = f (xt , wt ), k = 1, 2, . . . ,
k t=1
that is similar, but requires much less computa-
tion; it needs only one value of f per sample wt .
• By denoting γk = 1/k, it can also be written as
xk+1 = (1 − γk )xk + γk f (xk , wk ), k = 1, 2, . . .

• Compare with Q-learning, where the fixed point
problem is Q = F Q

(F Q)(i, u) = Ej g(i, u, j) + α min ′
Q(j, u )
′ u
Q-LEARNING COMBINED WITH OPTIMISTIC PI
• Each Q-learning iteration requires minimization

over all controls u′ ∈ U (jk ):
Qk+1 (ik ,uk ) = (1 − γk )Qk (ik , uk )

!
u ∈U (jk )
• To reduce this overhead we may consider re-

placing the minimization by a simpler operation
using just the “current policy” µk
• This suggests an asynchronous sampled version
of the optimistic PI algorithm which policy eval-
uates by
Qk+1 = Fµmkk Qk ,
and policy improves by µk+1 (i) ∈ arg minu∈U (i) Qk+1 (i, u)
• This turns out not to work (counterexamples
by Williams and Baird, which date to 1993), but
a simple modification of the algorithm is valid
• See a series of papers starting with
D. Bertsekas and H. Yu, “Q-Learning and En-
hanced Policy Iteration in Discounted Dynamic
Programming,” Math. of OR, Vol. 37, 2012, pp.
66-94
Q-FACTOR APPROXIMATIONS
• We introduce basis function approximation:

Q̃(i, u; r) = φ(i, u)′ r
• We can use approximate policy iteration and
LSTD/LSPE for policy evaluation
• Optimistic policy iteration methods are fre-
quently used on a heuristic basis
• An extreme example: Generate trajectory {(ik , uk ) |
k = 0, 1, . . .} as follows.
• At iteration k, given rk and state/control (ik , uk ):
(1) Simulate next transition (ik , ik+1 ) using the
transition probabilities pik j (uk ).
(2) Generate control uk+1 from
uk+1 = arg min Q̃(ik+1 , u, rk )

u∈U (ik+1 )
(3) Update the parameter vector via
rk+1 = rk − (LSPE or TD-like correction)

• Complex behavior, unclear validity (oscilla-
tions, etc). There is solid basis for an important
special case: optimal stopping (see text)
BELLMAN EQUATION ERROR APPROACH
• Another model-free approach for approximate

evaluation of policy µ: Approximate Qµ (i, u) with
Q̃µ (i, u; rµ ) = φ(i, u)′ rµ , obtained from
2
rµ ∈ arg min Φr − Fµ (Φr) ξ

r
where k · kξ is Euclidean norm, weighted with re-
spect to some distribution ξ.
• Implementation for deterministic problems:
(1) Generate a large set of sample pairs (ik , uk ),
and corresponding deterministic
costs g(ik , uk )
and transitions jk , µ(jk ) (a simulator may
be used for this).
(2) Solve the linear least squares problem:
X ′ 2
min φ(ik , uk )′ r − g(ik , uk ) + αφ jk , µ(jk ) r
r
(ik ,uk )
• For stochastic problems a similar (more com-

plex) least squares approach works. It is closely
related to LSTD (but less attractive; see the text).
• Because this approach is model-free, it is often
used as the basis for adaptive control of systems
with unknown dynamics.
ADAPTIVE CONTROL BASED ON ADP
LINEAR-QUADRATIC PROBLEM
• System: xk+1 = Axk +Buk , xk ∈ ℜn , uk ∈ ℜm

P∞ ′ Qx + u′ Ru ), Q ≥ 0, R > 0
• Cost: (x
k=0 k k k k
• Optimal policy is linear: µ∗ (x) = Lx

• The Q-factor of each linear policy µ is quadratic:

x
Qµ (x, u) = ( x′ u′ ) Kµ (∗)
u
• We will consider A and B unknown

• We represent Q-factors using as basis func-
tions all the quadratic functions involving state
and control components
xi xj , ui uj , xi uj , ∀ i, j
These are the “rows” φ(x, u)′ of Φ

• The Q-factor Qµ of a linear policy µ can be ex-
actly represented within the approximation sub-
space:
Qµ (x, u) = φ(x, u)′ rµ
where rµ consists of the components of Kµ in (*)
PI FOR LINEAR-QUADRATIC PROBLEM
• Policy evaluation: rµ is found by the Bellman

error approach
X 2
′ ′ ′
′
min φ(x , u ) r − x Qx + u Ru + φ x , µ(x ) r

k k k k k k k+1 k+1
r
(xk ,uk )
where (xk , uk , xk+1 ) are many samples generated

by the system or a simulator of the system.
• Policy improvement:

µ(x) ∈ arg min φ(x, u)′ rµ
u
• Knowledge of A and B is not required

• If the policy evaluation is done exactly, this
becomes exact PI, and convergence to an optimal
policy can be shown
• The basic idea of this example has been gener-
alized and forms the starting point of the field of
adaptive dynamic programming
• This field deals with adaptive control of continuous-
space (possibly nonlinear) dynamic systems, in
both discrete and continuous time
• We parametrize policies by a vector r = (r1 , . . . , rs )

(an approximation architecture for policies).

• Each policy µ̃(r) = µ̃(i; r) | i = 1, . . . , n
defines a cost vector Jµ̃(r) (a function of r).
• We optimize some measure of Jµ̃(r) over r.
• For example, use a random search, gradient, or
other method to minimize over r
Xn
ξi Jµ̃(r) (i),
i=1
where ξ1 , . . . , ξn are some state-dependent weights.
• An important special case: Introduce cost ap-
proximation architecture V (i; r) that defines indi-
rectly the parametrization of the policies
n
X
µ̃(i; r) = arg min pij (u) g(i, u, j)+αV (j; r) , ∀ i
u∈U (i)
j=1
• This introduces state features into approxima-

tion in policy space.
• A policy approximator is called an actor, while a
cost approximator is also called a critic. An actor
and a critic may coexist.
APPROXIMATION IN POLICY SPACE METHODS
• Random search methods are straightforward

and have scored some impressive successes with
challenging problems (e.g., tetris).
− At a given point/r they generate a random
collection of neighboring r. They search within
the neighborhood for better points.
− Many variations (the cross entropy method
is one).
− They are very broadly applicable (to discrete
and continuous search spaces).
− They are idiosynchratic.
• Gradient-type methods (known as policy gra-
dient methods) also have been used extensively.
− They move along the gradient with respect
to r of
Xn
ξi Jµ̃(r) (i)
i=1
− There are explicit gradient formulas which
can be approximated by simulation.
− Policy gradient methods generally suffer by
slow convergence, local minima, and exces-
sive simulation noise.
COMBINATION WITH APPROXIMATE PI
• Another possibility is to try to implement PI

within the class of parametrized policies.
• Given a policy/actor µ(i; rk ), we evaluate it
(perhaps approximately) with a critic that pro-
duces J˜µ , using some policy evaluation method.
• We then consider the policy improvement phase
n
˜
X
µ(i) ∈ arg min pij (u) g(i, u, j) + αJµ (j) , ∀ i
u
j=1
and do it approximately via parametric optimiza-

tion
n
X n
X
g i, µ(i; r), j +αJ˜µ (j)

min ξi pij µ(i; r)
r
i=1 j=1
where ξi are some weights.

• This can be attempted by a gradient-type method
in the space of the parameter vector r.
• Schemes like this have been extensively applied
to continuous-space deterministic problems.
• Many unresolved theoretical issues, particularly
for stochastic problems.
FINAL WORDS
TOPICS THAT WE HAVE NOT COVERED
• Extensions to discounted semi-Markov, stochas-

tic shortest path problems, average cost problems,
sequential games ...
• Extensions to continuous-space problems
• Extensions to continuous-time problems
• Adaptive DP - Continuous-time deterministic
optimal control. Approximation of cost function
derivatives or cost function differences
• Random search methods for approximate policy
evaluation or approximation in policy space
• Basis function adaptation (automatic genera-
tion of basis functions, optimal selection of basis
functions within a parametric class)
• Simulation-based methods for general linear
problems, i.e., solution of linear equations, linear
least squares, etc - Monte-Carlo linear algebra
CONCLUDING REMARKS
• There is no clear winner among ADP methods

• There is interesting theory in all types of meth-
ods (which, however, does not provide ironclad
performance guarantees)
• There are major flaws in all methods:
− Oscillations and exploration issues in approx-
imate PI with projected equations
− Restrictions on the approximation architec-
ture in approximate PI with aggregation
− Flakiness of optimization in policy space ap-
proximation
• Yet these methods have impressive successes
to show with enormously complex problems, for
which there is often no alternative methodology
• There are also other competing ADP methods
(rollout is simple, often successful, and generally
reliable; approximate LP is worth considering)
• Theoretical understanding is important and
nontrivial
• Practice is an art and a challenge to our cre-
ativity!
THANK YOU

ADP Slides Tsinghua Complete

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ADP Slides Tsinghua Complete

Uploaded by

Copyright:

Available Formats

APPROXIMATE DYNAMIC PROGRAMMING

A SERIES OF LECTURES GIVEN AT

Based on the books:

• Introduction to DP and approximate DP

• Generic optimization problem:

where u is the optimization/decision variable, g(u)

where w is a random parameter.

• Cost function that is additive over time

• Alternative system description: P (xk+1 | xk , uk )

xk+1 = wk with P (wk | xk , uk ) = P (xk+1 | xk , uk )

• Cost function that is additive over time

• Probability distribution of wk does not depend

that map state/inventory to control/order (closed-

NOT over sequences of controls/orders

• System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N −1

• Optimal cost function

J ∗ (x0 ) = min Jπ (x0 )

• Optimal policy π ∗ satisfies

Jπ∗ (x0 ) = J ∗ (x0 )

When produced by DP, π ∗ is independent of x0 .

• Let π ∗ = {µ∗0 , µ∗1 , . . . , µ∗N −1 } be optimal policy

and the “tail policy” {µ∗k , µ∗k+1 , . . . , µ∗N −1 }

• Principle of optimality: The tail policy is opti-

• Computes for all k and states xk :

• To solve tail subproblem at time k minimize

kth-stage cost + Opt. cost of next tail problem

• The curse of dimensionality

• Use a policy computed from the DP equation

• At each k and state xk , use the control µk (xk )

where J˜k+1 is the cost-to-go of some heuristic pol-

• Same as the basic problem, but:

− Discounted problems (α < 1, bounded g)

• Cost of a policy π = {µ0 , µ1 , . . .}

with α < 1, and g is bounded [for some M , we

• For any function J of x, denote

• T J is the optimal cost function for the one-

• The critical structure of the problem is cap-

• Consider an N -stage policy π0N = {µ0 , µ1 , . . . , µN −1 }

where π1N = {µ1 , µ2 , . . . , µN −1 }

Jπ0N (x) = (Tµ0 Tµ1 · · · TµN −1 J)(x), ∀x

where TµN is the N -fold composition of Tµ

• Infinite horizon cost function expressions [with

• Value iteration: For any (bounded) J

J ∗ (x) = lim (T k J)(x), ∀x

• Policy iteration: Given µk ,

Jµk = Tµk Jµk

− Policy improvement: Find µk+1 such that

Tµk+1 Jµk = T Jµk

• Monotonicity property: For any J and J ′ such

(Tµ J)(x) ≤ (Tµ J ′ )(x), ∀ x.

• Constant Shift property: For any J, any scalar

• For all bounded J,

J ∗ (x) = lim (T k J)(x), for all x

Proof: For simplicity we give the proof for J ≡ 0.

The tail portion satisfies

where M ≥ |g(x, u, w)|. Take min over π of both

• The optimal cost function J ∗ is a solution of

Proof: For all x and k,

where J0 (x) ≡ 0 and M ≥ |g(x, u, w)|. Applying

Taking the limit as k → ∞ and using the fact

lim (T k+1 J0 )(x) = J ∗ (x)

• Contraction property: For any bounded func-

J(x) − c ≤ J ′ (x) ≤ J(x) + c, ∀x

Apply T to both sides, and use the Monotonicity