Professional Documents
Culture Documents
TSINGHUA UNIVERSITY
JUNE 2014
DIMITRI P. BERTSEKAS
BRIEF OUTLINE I
• Our subject:
− Large-scale DP based on approximations and
in part on simulation.
− This has been a research area of great inter-
est for the last 25 years known under various
names (e.g., reinforcement learning, neuro-
dynamic programming)
− Emerged through an enormously fruitful cross-
fertilization of ideas from artificial intelligence
and optimization/control theory
− Deals with control of dynamic systems under
uncertainty, but applies more broadly (e.g.,
discrete deterministic optimization)
− A vast range of applications in control the-
ory, operations research, artificial intelligence,
and beyond ...
− The subject is broad with rich variety of
theory/math, algorithms, and applications.
Our focus will be mostly on algorithms ...
less on theory and modeling
APPROXIMATE DYNAMIC PROGRAMMING
BRIEF OUTLINE II
• Our aim:
− A state-of-the-art account of some of the ma-
jor topics at a graduate level
− Show how to use approximation and simula-
tion to address the dual curses of DP: di-
mensionality and modeling
• Our 6-lecture plan:
− Two lectures on exact DP with emphasis on
infinite horizon problems and issues of large-
scale computational methods
− One lecture on general issues of approxima-
tion and simulation for large-scale problems
− One lecture on approximate policy iteration
based on temporal differences (TD)/projected
equations/Galerkin approximation
− One lecture on aggregation methods
− One lecture on Q-learning, and other meth-
ods, such as approximation in policy space
APPROXIMATE DYNAMIC PROGRAMMING
LECTURE 1
LECTURE OUTLINE
min g(u)
u∈U
• Discrete-time system
xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1
− k: Discrete time
− xk : State; summarizes past information that
is relevant for future optimization
− uk : Control; decision to be selected at time
k from a given set
− wk : Random parameter (also called “distur-
bance” or “noise” depending on the context)
− N : Horizon or number of times control is
applied
• Discrete-time system
xk+1 = fk (xk , uk , wk ) = xk + uk − wk
uk = µk (xk ), k = 0, . . . , N − 1
{µ0 , µ1 , . . . , µN −1 }
{u0 , u1 , . . . , uN −1 }
GENERIC FINITE-HORIZON PROBLEM
xk Tail Subproblem
0 k N Time
JN (xN ) = gN (xN )
Go backwards, k = N − 1, . . . , 0, using
Jk (xk ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
+ Jk+1 fk (xk , uk , wk ) ,
• Some approaches:
(a) Problem Approximation: Use J˜k derived from
a related but simpler problem
(b) Parametric Cost-to-Go Approximation: Use
as J˜k a function of a suitable parametric
form, whose parameters are tuned by some
heuristic or systematic scheme (we will mostly
focus on this)
− This is a major portion of Reinforcement
Learning/Neuro-Dynamic Programming
(c) Rollout Approach: Use as J˜k the cost of
some suboptimal policy, which is calculated
either analytically or by simulation
ROLLOUT ALGORITHMS
˜
min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )
• Stationary system
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
Jπ (x) = lim (Tµ0 Tµ1 · · · TµN J0 )(x), Jµ (x) = lim (TµN J0 )(x)
N →∞ N →∞
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
(T J)(x) ≤ (T J ′ )(x), ∀ x,
αk M αk M
J ∗ (x) − k ∗
≤ (T J0 )(x) ≤ J (x) + ,
1−α 1−α
α k+1 M
(T J ∗ )(x) − ≤ (T k+1 J0 )(x)
1−α
α k+1 M
≤ (T J ∗ )(x) +
1−α
we obtain J ∗ = T J ∗ . Q.E.D.
THE CONTRACTION PROPERTY
max(Tµ J)(x)−(Tµ J ′ )(x) ≤ α maxJ(x)−J ′ (x).
x x
Proof: Denote c = maxx∈S J(x) − J ′ (x). Then
Hence
(T J)(x) − (T J ′ )(x) ≤ αc, ∀ x.
Q.E.D.
• Note: This implies that J ∗ is the unique solu-
tion of J ∗ = T J ∗ , and Jµ is the unique solution
of Jµ = Tµ Jµ
NEC. AND SUFFICIENT OPT. CONDITION
T J ∗ = Tµ J ∗ ,
or, equivalently, for all x,
µ(x) ∈ arg min E g(x, u, w) + αJ ∗ f (x, u, w)
u∈U (x) w
J ∗ = Tµ J ∗ ,
J ∗ = Tµ J ∗ .
LECTURE 2
LECTURE OUTLINE
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ or
J ∗ (x) = min E g(x, u, w) + αJ ∗ f (x, u, w) , ∀x
u∈U (x) w
Jµ (x) = E g x, µ(x), w + αJµ f (x, µ(x), w) , ∀x
w
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
i.e.,
µ(x) ∈ arg min E g(x, u, w) + αJ ∗ f (x, u, w) , ∀x
u∈U (x) w
′ ′
max (Tµ J)(x) − (Tµ J )(x) ≤ α max J(x) − J (x)
x x
k k
Jµk (x) = E g x, µ (x), w + αJµk f (x, µ (x), w) , ∀x
w
Jµ ≈ Tµm J
• If mk ≡ 1 it becomes VI
• If mk = ∞ it becomes PI
• Converges for both finite and infinite spaces
discounted problems (in an infinite number of it-
erations)
• Typically works faster than VI and PI (for
large problems)
APPROXIMATE PI
kJk − Jµk k ≤ δ, k = 0, 1, . . .
kTµk+1 Jk − T Jk k ≤ ǫ, k = 0, 1, . . .
ǫ + 2αδ
lim sup kJµk − J ∗k ≤
k→∞ (1 − α)2
ǫ + 2αδ
kJµ − J ∗k ≤
1−α
Q-FACTORS I
with x = f (x, u, w)
Q-FACTORS II
where x = f (x, u, w)
• VI and PI for Q-factors are mathematically
equivalent to VI and PI for costs
• They require equal amount of computation ...
they just need more storage
• Having optimal Q-factors is convenient when
implementing an optimal policy on-line by
kF k J − J ∗ k ≤ ρk kJ − J ∗ k, k = 1, 2, . . . .
• Discounted problems
H(x, u, J) = E g(x, u, w) + αJ f (x, u, w)
kTµ J − Tµ J ′ k ≤ αkJ − J ′ k
− If mk ≡ 1 it becomes VI
− If mk = ∞ it becomes PI
− For intermediate values of mk , it is generally
more efficient than either VI or PI
ASYNCHRONOUS ALGORITHMS
J = (J1 , . . . , Jm ),
where Jℓ is the restriction of J on the set Xℓ .
• Synchronous VI algorithm:
τ (t) τ (t)
Jℓt+1 (x) = T (J1 ℓ1 , . . . , Jmℓm )(x) if t ∈ Rℓ ,
Jℓt (x) if t ∈
/ Rℓ
T (J1t , . . . , Jnt ) = T J t ,
{x0 , x1 , . . .} = {1, . . . , n, 1, . . . , n, 1, . . .}
T J ∈ S(k+1), ∀ J ∈ S(k), k = 0, 1, . . . .
(2) Box Condition: For all k, S(k) is a Cartesian
product of the form
• Interpretation of assumptions:
∗ J = (J1 , J2 )
(0) S2 (0) ) S(k + 1) + 1) J ∗ TJ
(0) S(k)
S(0)
S1 (0)
• Convergence mechanism:
J1 Iterations
∗ J = (J1 , J2 )
) S(k + 1) + 1) J ∗
(0) S(k)
S(0)
Iterations J2 Iteration
LECTURE 3
LECTURE OUTLINE
• Review of discounted DP
• Introduction to approximate DP
• Approximation architectures
• Simulation-based approximate policy iteration
• Approximate policy evaluation
• Some general issues about approximation and
simulation
REVIEW
DISCOUNTED PROBLEMS/BOUNDED COST
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
n
X
(Tµ J)(i) = pij µ(i) g i, µ(i), j +αJ(j) , i = 1, . . . , n
j=1
“SHORTHAND” THEORY – A SUMMARY
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ or
n
X
J ∗ (i) = min pij (u) g(i, u, j) + αJ ∗ (j) , ∀i
u∈U (i)
j=1
n
X
Jµ (i) = pij µ(i) g i, µ(i), j + αJµ (j) , ∀i
j=1
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
i.e.,
n
X
µ(i) ∈ arg min pij (u) g(i, u, j)+αJ ∗ (j) , ∀i
u∈U (i)
j=1
THE TWO MAIN ALGORITHMS: VI AND PI
POLICY SPACE
APPROXIMATION IN VALUE SPACE
Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features
Position Evaluator
i) Linear Cost
Feature
State Extraction
i i Feature Mapping Feature Vector φ(i) Linear Cost Approximator
Extraction Mapping Feature Vectori) Linear Cost φ(i)′ r
Approximator
Approximator
Feature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector
˜ r) = ri ,
J(i; i ∈ I,
......
TERMINATION
METHODS
DIRECT POLICY EVALUATION
µ ΠJµ
=0
Subspace S = {Φr | r ∈ ℜs } Set
Direct Method: Projection of cost vector Indirect Method: Solving a projected form of Be
( )cost (vector
jection of )Indirect
(Jµ) Method: Solving a projected form
Projection on
of Bellman’s equation
1 0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9 1 0 0 0
0 1 0 0
1 2 3 4 5 6 7 8 9 x1 x2 1
0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
Φ = 1 0 0 0
0 1 0 0
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0
0 1 0
0 0 1 0
0 0 0 1
Φr = ΦDTµ (Φr)
ISSUES
THEORETICAL BASIS OF APPROXIMATE PI
˜ rk ) − Jµk (i)| ≤ δ,
max |J(i, k = 0, 1, . . .
i
˜ rk )−(T J)(i,
max |(Tµk+1 J)(i, ˜ rk )| ≤ ǫ, k = 0, 1, . . .
i
ǫ + 2αδ
lim sup max Jµk (i) − J ∗ (i) ≤
k→∞ i (1 − α)2
xm = (1 − Am )−1 bm
or iteratively
• TD(λ) and Q-learning are SA methods
• LSTD(λ) and LSPE(λ) are MCE methods
COSTS OR COST DIFFERENCES?
′
′
E g(x, u, w) − g(x, u , w) + α Jµ (x) − Jµ (x )
′
Dµ (x, x′ ) ′
= E Gµ (x, x , w) + αDµ (x, x )
′
where x = f x, µ(x), w , x = f x′ , µ(x′ ), w and
Gµ (x, x′ , w) = g x, µ(x), w − g x′ , µ(x′ ), w
5x2 9x2
5
Qµ (x, u) = +δ + u2 + xu + O(δ 2 )
4 4 2
LECTURE 4
LECTURE OUTLINE
p ij (u)
p ii (u) i j p jj (u )
p ji (u)
with α ∈ [0, 1)
• Shorthand notation for DP mappings
n
X
(T J)(i) = min pij (u) g(i, u, j)+αJ(j) , i = 1, . . . , n,
u∈U (i)
j=1
n
X
(Tµ J)(i) = pij µ(i) g i, µ(i), j +αJ(j) , i = 1, . . . , n
j=1
“SHORTHAND” THEORY – A SUMMARY
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ or
n
X
J ∗ (i) = min pij (u) g(i, u, j) + αJ ∗ (j) , ∀i
u∈U (i)
j=1
n
X
Jµ (i) = pij µ(i) g i, µ(i), j + αJµ (j) , ∀i
j=1
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
i.e.,
n
X
µ(i) ∈ arg min pij (u) g(i, u, j)+αJ ∗ (j) , ∀i
u∈U (i)
j=1
THE TWO MAIN ALGORITHMS: VI AND PI
˜ = Φr
J(r)
• We have
˜ r) = φ(i)′ r,
J(i; i = 1, . . . , n
i) Linear Cost
Feature
State Extraction
i i Feature Mapping
Extraction FeatureFeature
Mapping Vector φ(i) Linear
Vector Cost Approximator
i) Linear Cost φ(i)′ r
Approximator
Approximator
Feature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector
0 T J!0 ˜1 T J˜1
˜2 T J˜2
Fitted Value
! Iteration
S = {Φr | r ∈ ℜs }
ΠJ = Φr∗
where
r∗ = arg mins kΦr − Jk2ξ
r∈ℜ
then
˜ ∗
2αδ
lim sup max Jk (i, rk ) − J (i) ≤
k→∞ i=1,...,n (1 − α)2
0 T J!0 ˜1 T J˜1
˜2 T J˜2
˜1 J˜2 = Πξ (T J˜1 )
Fitted Value Iteration J0
˜ ˜
2 J3 = Πξ (T J2 )
0
˜
J1 = Πξ (T J0 )
!
Fitted Value
! Iteration with Projection
" J
Direct Method: Projection of cost vector Indirect Method: Solving a projected form of Bell
( )cost (vector
ojection of )Indirect
(Jµ) Method: Solving a projected form
Projection on
of Bellman’s equation
Φr = ΠTµ (Φr)
1
kJµ − Φr∗ kξ ≤ √ kJµ − ΠJµ kξ
1−α 2
PRELIMINARIES: PROJECTION PROPERTIES
!
J
r Φr J ΠJ
¯ ξ ≤ kJ − Jk
kΠJ − ΠJk ¯ ξ, for all J, J¯ ∈ ℜn .
= kJ − Jk2ξ
PROOF OF CONTRACTION PROPERTY
kΠTµ J−ΠTµ J̄kξ ≤ kTµ J−Tµ J̄kξ = αkP (J−J̄ )kξ ≤ αkJ−J̄ kξ
1
kJµ − Φr∗ kξ ≤√ kJµ − ΠJµ kξ .
1−α 2
Proof: We have
2
kJµ − Φr∗ k2ξ = kJµ − ΠJµ k2ξ
+
ΠJµ − Φr∗
ξ
2
2
= kJµ − ΠJµ kξ +
ΠT Jµ − ΠT (Φr∗ )
ξ
≤ kJµ − ΠJµ k2ξ + α2 kJµ − Φr∗ k2ξ ,
where
− The first equality uses the Pythagorean The-
orem
− The second equality holds because Jµ is the
fixed point of T and Φr∗ is the fixed point
of ΠT
− The inequality uses the contraction property
of ΠT .
Q.E.D.
SIMULATION-BASED SOLUTION OF
PROJECTED EQUATION
MATRIX FORM OF PROJECTED EQUATION
Tµ (Φr) = g + αP Φr
r Φr = Πξ Tµ (Φr)
=0
Subspace S = {Φr | r ∈ ℜs } Set
C = Φ′ Ξ(I − αP )Φ, d = Φ′ Ξg
but computing C and d is HARD (high-dimensional
inner products).
SOLUTION OF PROJECTED EQUATION
Projection
on S
Φrk+1
Φrk
0
S: Subspace spanned by basis functions
which yields
rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d)
SIMULATION-BASED IMPLEMENTATIONS
Ck ≈ C, dk ≈ d
rk+1 = rk − Gk (Ck rk − dk )
where
Gk ≈ (Φ′ ΞΦ)−1
This is the LSPE (Least Squares Policy Evalua-
tion) Method.
• Key fact: Ck , dk , and Gk can be computed
with low-dimensional linear algebra (of order s;
the number of basis functions).
SIMULATION MECHANICS
k
1 X ′
Ck = φ(it ) φ(it )−αφ(it+1 ) ≈ Φ′ Ξ(I−αP )Φ = C
k+1
t=0
Note that αλ → 0 as λ → 1
• T ℓ and T (λ) have the same fixed point Jµ and
1
kJµ − Φrλ∗ kξ ≤p kJµ − ΠJµ kξ
1− αλ2
Slope Jµ
)λ=0
1
• Error bound kJµ −Φrλ∗ kξ ≤ √ kJµ −ΠJµ kξ
1−α2λ
with
∞
X ∞
X
P (λ) = (1 − λ) αℓ λℓ P ℓ+1 , g (λ) = α ℓ λℓ P ℓ g
ℓ=0 ℓ=0
(λ) (λ)
• The simulation process to obtain Ck and dk
is similar to the case λ = 0 (single simulation tra-
jectory i0 , i1 , . . ., more complex formulas)
k k
(λ) 1 X X
m−t m−t
′
Ck = φ(it ) α λ φ(im )−αφ(im+1 )
k + 1 t=0 m=t
k k
(λ) 1 X X
dk = φ(it ) αm−t λm−t gim
k + 1 t=0 m=t
LECTURE 5
LECTURE OUTLINE
p ij (u)
p ii (u) i j p jj (u )
p ji (u)
with α ∈ [0, 1)
• Shorthand notation for DP mappings
n
X
(T J)(i) = min pij (u) g(i, u, j)+αJ(j) , i = 1, . . . , n,
u∈U (i)
j=1
n
X
(Tµ J)(i) = pij µ(i) g i, µ(i), j +αJ(j) , i = 1, . . . , n
j=1
APPROXIMATE PI
J˜µ (r) = Φr
Slope Jµ
)λ=0
+1 rµk+2
+2 Rµk+3 k rµk+1
Rµk+2
2 Rµ3
1 rµ2
rµ1
Rµ2 2 rµ3
Rµ1
Φr = (W Tµ )(Φr)
1 0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9 1 0 0 0
0 1 0 0
1 2 3 4 5 6 7 8 9 x1 x2 1
0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
Φ = 1 0 0 0
0 1 0 0
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0
0 1 0
0 0 1 0
0 0 0 1
˜ =
X
J(j) φjy R̂(y), ∀j
y
EXAMPLE I: HARD AGGREGATION
1 0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9 1 0 0 0
0 1 0 0
1 2 3 4 5 6 7 8 9 x1 x2 1
0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
Φ = 1 0 0 0
0 1 0 0
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0
0 1 0
0 0 1 0
0 0 0 1
Special
SpecialStates
StatesAggregate
AggregateStates Special States Aggregate States Features
StatesFeatures
Features
)
j3 y1 1 y2
x j1 2 y3
xj
x j1 j2
j2 j3
Representative/Aggregate States
Aggregate States/Subsets
0 1 2 49
• Restrictions:
− The aggregate states/subsets are disjoint.
− The disaggregation probabilities satisfy dxi >
0 if and only if i ∈ x.
− The aggregation probabilities satisfy φjy = 1
for all j ∈ y.
• Hard aggregation is a special case: ∪x Sx =
{1, . . . , n}
• Aggregation with representative states is a spe-
cial case: Sx consists of just one state
APPROXIMATE PI BY AGGREGATION
!
Original System States Aggregate States
Original System States Aggregate States
! i , j=1 !
,
according to pij (u), with cost
g(i, u, j)
Matrix Matrix Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
Disaggregation Probabilities
dxi ! Disaggregation
φjyProbabilities
Q
| S
J˜1 (j) =
X
φjy R∗ (y), j = 1, . . . , n.
y∈A
!
Original System States Aggregate States
Original System States Aggregate States
! i , j=1 !
according to pij (u), with cost
Disaggregation Probabilities Aggregation Probabilities
Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
Disaggregation Probabilities
dxi φjy Q
Disaggregation Probabilities
Matrix D Matrix S
D Matrix Φ
!
|
), x ), y
Original System States Aggregate States
X
J(i) = min Hℓ (i, u, J, R), R(ℓ) = dℓi J(i),
u∈U (i)
i∈Sℓ
∀ i ∈ Sℓ , ℓ = 1, . . . , m.
LECTURE 6
LECTURE OUTLINE
p ij (u)
p ii (u) i j p jj (u )
p ji (u)
with α ∈ [0, 1)
• Shorthand notation for DP mappings
n
X
(T J)(i) = min pij (u) g(i, u, j)+αJ(j) , i = 1, . . . , n,
u∈U (i)
j=1
n
X
(Tµ J)(i) = pij µ(i) g i, µ(i), j +αJ(j) , i = 1, . . . , n
j=1
BELLMAN EQUATIONS FOR Q-FACTORS
• Equivalently Q∗ = F Q∗ , where
n
X
(F Q)(i, u) = pij (u) g(i, u, j) + α ′min Q(j, u′ )
u ∈U (j)
j=1
Equivalently Qµ = Fµ Qµ , where
n
X
(Fµ Q)(i, u) = pij (u) g(i, u, j) + αQ j, µ(j)
j=1
is a sup-norm contraction.
• Generic stochastic approximation algorithm:
− Consider generic fixed point problem involv-
ing an expectation:
x = Ew f (x, w)
− Assume Ew f (x, w) is a contraction with
respect to some norm, so the iteration
xk+1 = Ew f (xk , w)
and policy improves by µk+1 (i) ∈ arg minu∈U (i) Qk+1 (i, u)
• This turns out not to work (counterexamples
by Williams and Baird, which date to 1993), but
a simple modification of the algorithm is valid
• See a series of papers starting with
D. Bertsekas and H. Yu, “Q-Learning and En-
hanced Policy Iteration in Discounted Dynamic
Programming,” Math. of OR, Vol. 37, 2012, pp.
66-94
Q-FACTOR APPROXIMATIONS
xi xj , ui uj , xi uj , ∀ i, j