Professional Documents
Culture Documents
CAMBRIDGE, MASS
FALL 2015
DIMITRI P. BERTSEKAS
LECTURE 1
LECTURE OUTLINE
Problem Formulation
Examples
The Basic Problem
Significance of Feedback
2
DP AS AN OPTIMIZATION METHODOLOGY
min g(u)
uU
Discrete-time system
xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N 1
k: Discrete time
xk : State; summarizes past information that
is relevant for future optimization
uk : Control; decision to be selected at time
k from a given set
wk : Random parameter (also called distur-
bance or noise depending on the context)
N : Horizon or number of times control is
applied
wk Demand at Period k
Stock ordered at
Period k
Cost of Period k
uk
r(xk) + cuk
Discrete-time system
xk+1 = fk (xk , uk , wk ) = xk + uk wk
Cost function that is additive over time
1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
(N 1 )
X
=E cuk + r(xk + uk wk )
k=0
uk Uk (xk )
Pwk (xk , uk )
6
DETERMINISTIC FINITE-STATE PROBLEMS
ABC CCD
CBC
AB
ACB CBD
CAB
A CCB
CAC AC
CCD ACD CDB
SA
Initial
State
CAB CBD
CA CAB
SC
C CCA
CAD CAD CDB
CCD CD
CDA
CDA CAB
7
STOCHASTIC FINITE-STATE PROBLEMS
0.5-0.5 1- 0
pd pw
0-0 0-0
1 - pd 1 - pw
0-1 0-1
2-0
2-0
pw
pd 1-0 1 - pw
1-0 1.5-0.5
1.5-0.5
1 - pd
pw
pd 0.5-0.5 1-1
0.5-0.5 1-1 1 - pw
1 - pd
pw
pd 0.5-1.5
0.5-1.5 0-1
0-1 1 - pw
1 - pd
0-2
0-2
8
BASIC PROBLEM
1
( N
)
X
J (x0 ) = E gN (xN ) + gk (xk , k (xk ), wk )
k=0
J (x0 ) = J (x0 )
9
SIGNIFICANCE OF FEEDBACK
u k=km
uk = k(x
(xk )k) System xk
xk + 1 = fk( xk,u k,wk)
) m
kk
1- 0
1 - pd
pw
Timid Play
1-1
0-0
1 - pw
Bold Play
pw 1- 1
0-1
1 - pw
Bold Play
0-2
10
VARIANTS OF DP PROBLEMS
Continuous-time problems
Imperfect state information problems
Infinite horizon problems
Suboptimal control
11
LECTURE BREAKDOWN
********************************************
Infinite Horizon Problems - Advanced (Vol. 2)
Chs. 1, 2: Discounted problems - Computa-
tional methods (3 lectures)
Ch. 3: Stochastic shortest path problems (2
lectures)
Chs. 6, 7: Approximate DP (6 lectures)
12
COURSE ADMINISTRATION
13
A NOTE ON THESE SLIDES
14
6.231 DYNAMIC PROGRAMMING
LECTURE 2
LECTURE OUTLINE
15
BASIC PROBLEM
1
( N
)
X
J (x0 ) = E gN (xN ) + gk (xk , k (xk ), wk )
k=0
J (x0 ) = J (x0 )
16
PRINCIPLE OF OPTIMALITY
1
( N
)
X
E gN (xN ) + gk xk , k (xk ), wk
k=i
0 i N
ABC 6
3
AB
ACB 1
2 9
A 4
3 AC
8 ACD 3
5 6
5
Initial
1 0 State
CAB 1
CA 2
3
4
C 3 4
CAD 3
7 6
CD
5 3
CDA 2
18
STOCHASTIC INVENTORY EXAMPLE
wk Demand at Period k
Stock Ordered at
Period k
Cost of Period k uk
cuk + r (xk + uk - wk)
Start with
JN (xN ) = gN (xN ),
(
Jk (xk ) = min E gk xk , k (xk ), wk
(k ,k+1 ) wk ,...,wN 1
N 1
)
X
+ gN (xN ) + gi xi , i (xi ), wi
i=k+1
(
= min E gk xk , k (xk ), wk
k wk
" ( N 1
)# )
X
+ min E gN (xN ) + gi xi , i (xi ), wi
k+1 wk+1 ,...,wN 1
i=k+1
= min E gk xk , k (xk ), wk + Jk+1 fk xk , k (xk ), wk
k wk
= min E gk xk , k (xk ), wk + Jk+1 fk xk , k (xk ), wk
k wk
= min E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk )
uk Uk (xk ) wk
= Jk (xk )
21
LINEAR-QUADRATIC ANALYTICAL EXAMPLE
Initial Final
Temperature x0 Oven 1 x1 Oven 2 Temperature x2
Temperature Temperature
u0 u1
System
J2 (x2 ) = r(x2 T )2
h i
2
J1 (x1 ) = min u21 + r (1 a)x1 + au1 T
u1
2
J0 (x0 ) = min u0 + J1 (1 a)x0 + au0
u0
22
STATE AUGMENTATION
LECTURE 3
LECTURE OUTLINE
24
DETERMINISTIC FINITE-STATE PROBLEM
Terminal Arcs
with Cost Equal
to Terminal Cost
...
t
Artificial Terminal
Initial State
... Node
s
...
25
BACKWARD AND FORWARD DP ALGORITHMS
DP algorithm:
JN (i) = aN
it , i SN ,
k
Jk (i) = min aij +Jk+1 (j) , i Sk , k = 0, . . . , N 1
jSk+1
N
The optimal cost is J0 (t) = miniSN ait + J1 (i)
View Jk (j) as optimal cost-to-arrive to state j
from initial state s
26
A NOTE ON FORWARD DP ALGORITHMS
27
GENERIC SHORTEST PATH PROBLEMS
DP algorithm:
Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N 2,
j=1,...,N
28
EXAMPLE
State i
Destination
5
5 3 3 3 3
2 3
4
7 5 4 4 4 5
1 4 3
2 4.5 4.5 5.5 7
5 5
2
6 1 2 2 2 2
1
2 3
0.5
0 1 2 3 4 Stage k
(a) (b)
JN 1 (i) = ait , i = 1, 2, . . . , N,
Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N 2.
j=1,...,N
29
ESTIMATION / HIDDEN MARKOV MODELS
s x0 x1 x2 xN - 1 xN t
...
...
...
30
VITERBI ALGORITHM
We have
p(XN , ZN )
p(XN | ZN ) =
p(ZN )
where p(XN , ZN ) and p(ZN ) are the unconditional
probabilities of occurrence of (XN , ZN ) and ZN
Maximizing p(XN | ZN ) is equivalent with max-
imizing ln(p(XN , ZN ))
We have (using the multiplication rule for
cond. probs)
N
Y
p(XN , ZN ) = x0 pxk1 xk r(zk ; xk1 , xk )
k=1
N
X
minimize ln(x0 ) ln pxk1 xk r(zk ; xk1 , xk )
k=1
over all possible sequences {x0 , x1 , . . . , xN }.
5 1 15
AB AC AD
20 4 20 3 4 3
3 3 4 4 20 20
1 15 5 1
15 5
5 1 15
5 20 4
1 20 3
15 4 3
32
LABEL CORRECTING METHODS
Is di + aij < dj ?
(Is the path s --> i --> j
i j
better than the
OPEN current path s --> j ?)
REMOVE
34
EXAMPLE
1 A Origin Node s
5 1 15
2 AB 7 AC 10 AD
20 4 20 3 4 3
3 3 4 4 20 20
1 15 5 1
15 5
0 - 1
1 1 2, 7,10
2 2 3, 5, 7, 10
3 3 4, 5, 7, 10
4 4 5, 7, 10 43
5 5 6, 7, 10 43
6 6 7, 10 13
7 7 8, 10 13
8 8 9, 10 13
9 9 10 13
10 10 Empty 13
36
6.231 DYNAMIC PROGRAMMING
LECTURE 4
LECTURE OUTLINE
37
LINEAR-QUADRATIC PROBLEMS
System: xk+1 = Ak xk + Bk uk + wk
Quadratic cost
1
( N
)
X
E xN QN xN + (xk Qk xk + uk Rk uk )
w k
k=0,1,...,N 1 k=0
k (xk ) = Lk xk
Similar treatment of a number of variants
38
DERIVATION
KN = QN ,
L = (B KB + R)1 B KA,
xk+1 = (A + BL)xk + wk
40
GRAPHICAL PROOF FOR SCALAR SYSTEMS
2
A R
2 +Q
B F(P)
R
- 2
B P 0 450
Pk Pk + 1 P* P
B 2 Pk2
Pk+1 = A2 Pk + Q,
B 2 Pk + R
B2P 2
A2 RP
F (P ) = A2 P 2 +Q = 2 +Q
B P +R B P +R
JN (xN ) = xN QN xN ,
xk Qk xk
Jk (xk ) = min E
uk wk ,Ak ,Bk
uk Rk uk
+ + Jk+1 (Ak xk + Bk uk + wk )
1
Lk = Rk + E{Bk Kk+1 Bk } E{Bk Kk+1 Ak },
42
PROPERTIES
450
R
- 2 0 P
E{B }
43
INVENTORY CONTROL
xk+1 = xk + uk wk , k = 0, 1, . . . , N 1
Minimize
(N 1 )
X
E cuk + H(xk + uk )
k=0
where
DP algorithm:
JN (xN ) = 0,
Jk (xk ) = min cuk +H(xk +uk )+E Jk+1 (xk +uk wk )
uk 0
44
OPTIMAL POLICY
where
Gk (y) = cy + H(y) + E Jk+1 (y w)
lim Jk (x)
|x|
45
JUSTIFICATION
cy + H(y)
H(y)
cSN - 1
SN - 1 y
- cy
JN - 1(xN - 1)
SN - 1 xN - 1
- cy
46
6.231 DYNAMIC PROGRAMMING
LECTURE 5
LECTURE OUTLINE
Stopping problems
Scheduling problems
Minimax Control
47
PURE STOPPING PROBLEMS
CONTINUE STOP
REGION REGION
Stop State
48
EXAMPLE: ASSET SELLING
xN if xN =
6 T,
n
JN (xN ) =
0 if xN = T ,
n
Jk (xk ) = max (1 + r)N k xk , E Jk+1 (wk ) if xk =
6 T,
0 if xk = T .
Optimal policy;
49
FURTHER ANALYSIS
a1
a2
ACCEPT
REJECT
aN -1
0 1 2 N-1 N k
We have k = Ew Vk+1 (w) /(1 + r), so it is enough
to show that Vk (x) Vk+1 (x) for all x and k. Start
with VN 1 (x) VN (x) and use the monotonicity
property of DP. Q.E.D.
50
GENERAL STOPPING PROBLEMS
T0 Tk Tk+1 TN 1 .
Interesting case is when all the Tk are equal (to
TN 1 , the set where it is better to stop than to go
one step and stop). Can be shown to be true if
51
SCHEDULING PROBLEMS
52
EXAMPLE: THE QUIZ PROBLEM
pi Ri /(1 pi ) pj Rj /(1 pj ).
53
MINIMAX CONTROL
JN (xN ) = gN (xN ),
Jk (xk ) = min max gk (xk , uk , wk )
uk U (xk ) wk Wk (xk ,uk )
+ Jk+1 fk (xk , uk , wk )
54
DERIVATION OF MINIMAX DP ALGORITHM
55
UNKNOWN-BUT-BOUNDED CONTROL
if xk X k ,
0
n
g k (xk ) =
if xk
/ Xk .
1
We must reach at time k the set
X k = xk | Jk (xk ) = 0
56
6.231 DYNAMIC PROGRAMMING
LECTURE 6
LECTURE OUTLINE
57
BASIC PROBL. W/ IMPERFECT STATE INFO
58
INFORMATION VECTOR AND POLICIES
Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk1 ), k 1,
I0 = z 0
59
REFORMULATION AS PERFECT INFO PROBL.
System: We have
Ik+1 = (Ik , zk+1 , uk ), k = 0, 1, . . . , N 2, I0 = z 0
P (zk+1 | Ik , uk ) = P (zk+1 | Ik , uk , z0 , z1 , . . . , zk ),
60
DP ALGORITHM
J = E J0 (z0 )
z0
61
LINEAR-QUADRATIC PROBLEMS
System: xk+1 = Ak xk + Bk uk + wk
Quadratic cost
( N 1
)
X
E xN QN xN + (xk Qk xk + uk Rk uk )
w k
k=0,1,...,N 1 k=0
zk = Ck xk + vk , k = 0, 1, . . . , N 1
k (Ik ) = Lk E{xk | Ik }
62
DP ALGORITHM I
+ uN 1 RuN 1 + (AxN 1 + BuN 1 + wN 1 )
i
Q(AxN 1 + BuN 1 + wN 1 ) | IN 1 , uN 1
+ 2E{xN 1 | IN 1 } A QBuN 1
uN 1 = N 1 (IN 1 ) = LN 1 E{xN 1 | IN 1 }
where
LN 1 = (B QB + R)1 B QA
63
DP ALGORITHM II
JN 1 (IN 1 ) = E xN 1 KN 1 xN 1 | IN 1
xN 1
+ E xN 1 E{xN 1 | IN 1 }
xN 1
PN 1 xN 1 E{xN 1 | IN 1 } | IN 1
+ E {wN 1 QN wN 1 },
wN 1
PN 1 = AN 1 QN BN 1 (RN 1 + BN
1 QN BN 1 )
1
BN 1 QN AN 1 ,
KN 1 = AN 1 QN AN 1 PN 1 + QN 1
xN 1 E{xN 1 | IN 1 }
64
DP ALGORITHM III
65
QUALITY OF ESTIMATION LEMMA
E{N 1 PN 1 N 1 | IN 2 , uN 2 }
= E{N 1 PN 1 N 1 | IN 2 }
and is independent of uN 2 .
So minimization in the DP algorithm yields
uN 2 = N
2 (IN 2 ) = LN 2 E{xN 2 | IN 2 }
66
FINAL RESULT
k (Ik ) = Lk E{xk | Ik },
KN = QN , Kk = Ak Kk+1 Ak Pk + Qk ,
wk vk
xk zk
xk + 1 = A kxk + B ku k + wk zk = Ckxk + vk
uk
Delay
uk - 1
uk E{xk | Ik} zk
Lk Estimator
67
SEPARATION INTERPRETATION
68
STEADY STATE/IMPLEMENTATION ASPECTS
69
6.231 DYNAMIC PROGRAMMING
LECTURE 7
LECTURE OUTLINE
70
REVIEW: IMPERFECT STATE INFO PROBLEM
I0 = z0 , Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk1 ), k 1
71
DP ALGORITHM
DP algorithm:
h
Jk (Ik ) = min E gk (xk , uk , wk )
uk Uk xk , wk , zk+1
i
+ Jk+1 (Ik , zk+1 , uk ) | Ik , uk
J = E J0 (z0 ) .
z0
72
SUFFICIENT STATISTICS
k (Ik )
= k Sk (Ik ) ,
73
DP ALGORITHM IN TERMS OF PXK |IK
wk vk
uk xk zk
System Measurement
xk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk - 1,vk)
uk -1
Delay
uk -1
Px k | Ik zk
Actuator Estimator
k k - 1
74
EXAMPLE: A SEARCH PROBLEM
with p0 given.
pk evolves at time k according to the equation
pk if not search,
pk+1 = 0 if search and find treasure,
pk (1)
pk (1)+1pk
if search and no treasure.
75
SEARCH PROBLEM (CONTINUED)
DP algorithm
h
J k (pk ) = max 0, C + pk V
i
pk (1 )
+ (1 pk )J k+1 ,
pk (1 ) + 1 pk
with J N (pN ) = 0.
Can be shown by induction that the functions
J k satisfy
C
= 0 if pk V
,
J k (pk )
C
> 0 if pk > V
.
76
FINITE-STATE SYSTEMS - POMDP
77
INSTRUCTION EXAMPLE I
1 1
L L R
t r
L L R
1-t 1-r
78
INSTRUCTION EXAMPLE II
pk = P (xk = L | z0 , z1 , . . . , zk ).
DP algorithm:
J k (pk ) = min (1 pk )C, I + E J k+1 (pk , zk+1 )
zk+1
starting with
J N 1 (pN 1 ) = min (1pN 1 )C, I+(1t)(1pN 1 )C .
79
INSTRUCTION EXAMPLE III
I + A N - 1(p)
C
I + A N - 2(p)
I + A N - 3(p)
LECTURE 8
LECTURE OUTLINE
Suboptimal control
Cost approximation methods: Classification
Certainty equivalent control: An example
Limited lookahead policies
Performance bounds
Problem approximation approach
Parametric cost-to-go approximation
81
PRACTICAL DIFFICULTIES OF DP
82
COST-TO-GO FUNCTION APPROXIMATION
83
CERTAINTY EQUIVALENT CONTROL (CEC)
84
EQUIVALENT OFF-LINE IMPLEMENTATION
Let d0 (x0 ), . . . , dN 1 (xN 1 )
be an optimal con-
troller obtained from the DP algorithm for the de-
terministic problem
N 1
X
minimize gN (xN ) + gk xk , k (xk ), wk
k=0
subject to xk+1 = fk xk , k (xk ), wk , k (xk ) Uk
85
PARTIALLY STOCHASTIC CEC
86
GENERAL COST-TO-GO APPROXIMATION
where
JN = gN .
Jk+1 : approximation to true cost-to-go Jk+1
Two-step lookahead policy: At each k and
xk , use the control k (xk ) attaining the minimum
above, where the function Jk+1 is obtained using a
1SL approximation (solve a 2-step DP problem).
If Jk+1 is readily available and the minimiza-
tion above is not too hard, the 1SL policy is im-
plementable on-line.
Sometimes one also replaces Uk (xk ) above with
a subset of most promising controls U k (xk ).
As the length of lookahead increases, the re-
quired computation quickly explodes.
87
PERFORMANCE BOUNDS FOR 1SL
Jk (xk ) =
min E gk (xk , uk , wk )
uk Uk (xk )
+ Jk+1 fk (xk , uk , wk )
,
88
COMPUTATIONAL ASPECTS
89
PROBLEM APPROXIMATION
90
PARAMETRIC COST-TO-GO APPROXIMATION
91
APPROXIMATION ARCHITECTURES
m
X
r) = (x) r =
J(x, j (x)rj
j=1
i) Linear Cost
Feature Feature
State x i Feature Extraction xMapping Vector (x) ) Approximator (x) r
Vectori) Linear Cost
Approximator
eature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector
92
AN EXAMPLE - COMPUTER CHESS
Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features
Position Evaluator
93
ANOTHER EXAMPLE - AGGREGATION
94
AN EXAMPLE: REPRESENTATIVE SUBSETS
x S1 x1 1 S2
1 x2
xS
S4 2 x6 2 S3
! !
5 S6
6 S7 7 S8
4 S5
Aggregate States/Subsets
0 1 2 49
Common case: Each Sj is a group of states with
similar characteristics
Compute a cost rj for each aggregate state
Sj (using some method)
Approximate the Poptimal cost of each original
system state x with m r
j=1 xj j
95
6.231 DYNAMIC PROGRAMMING
LECTURE 9
LECTURE OUTLINE
Rollout algorithms
Policy improvement property
Discrete deterministic problems
Approximations of rollout algorithms
Model Predictive Control (MPC)
Discretization of continuous time
Discretization of continuous space
Other suboptimal approaches
96
ROLLOUT ALGORITHMS
where
JN = gN .
Jk+1 : approximation to true cost-to-go Jk+1
Rollout algorithm: When Jk is the cost-to-go of
some heuristic policy (called the base policy)
Policy improvement property (to be shown):
The rollout algorithm achieves no worse (and usu-
ally much better) cost than the base heuristic start-
ing from the same state.
Main difficulty: Calculating Jk (xk ) may be com-
putationally intensive if the cost-to-go of the base
policy cannot be analytically calculated.
May involve Monte Carlo simulation if the
problem is stochastic.
Things improve in the deterministic case.
97
EXAMPLE: THE QUIZ PROBLEM
98
COST IMPROVEMENT PROPERTY
Let
J k (xk ): Cost-to-go of the rollout policy
= Hk (xk )
99
DISCRETE DETERMINISTIC PROBLEMS
AB AC AD
100
EXAMPLE: THE BREAKTHROUGH PROBLEM
root
101
DET. EXAMPLE: ONE-DIMENSIONAL WALK
_
(N,-N) (N,0) i (N,N)
g(i)
_
-N 0 i N-2 N i
102
A ROLLOUT ISSUE FOR DISCRETE PROBLEMS
103
ROLLING HORIZON WITH ROLLOUT
104
MODEL PREDICTIVE CONTROL (MPC)
106
SPACE DISCRETIZATION
107
SPACE DISCRETIZATION/AGGREGATION
108
OTHER SUBOPTIMAL APPROACHES
LECTURE 10
LECTURE OUTLINE
110
TYPES OF INFINITE HORIZON PROBLEMS
J (x) = min E g(x, u, w) + J f (x, u, w)
uU (x) w
113
STOCHASTIC SHORTEST PATH PROBLEMS
114
FINITENESS OF POLICY COST FUNCTIONS
View
= max < 1
P {x2m 6= t | x0 = i, } = P {x2m 6= t | xm 6= t, x0 = i, }
P {xm 6= t | x0 = i, } 2
and similarly
6 t | x0 = i, } k ,
P {xkm = i = 1, . . . , n
k
m max g(i, u)
i=1,...,n
uU (i)
and
X
J (i) k
m
m max g(i, u) =
max g(i, u)
i=1,...,n 1 i=1,...,n
k=0 uU (i) uU (i)
115
MAIN RESULT
J (t) = 0
A stationary policy is optimal if and only
if for every state i, (i) attains the minimum in
Bellmans equation.
Key proof idea: The tail of the cost series,
X
E g xk , k (xk )
k=mK
vanishes as K increases to .
116
OUTLINE OF PROOF THAT JN J
mK1
X X
J (x0 ) = E g xk , k (xk ) + E g xk , k (xk )
k=0 k=mK
mK1
X X
E g xk , k (xk ) + k m max |g(i, u)|
i,u
k=0 k=K
K
J (x0 ) JmK (x0 ) + m max |g(i, u)|.
1 i,u
Similarly, we have
K
JmK (x0 ) m max |g(i, u)| J (x0 ).
1 i,u
117
EXAMPLE
g(i, u) = 1, i = 1, . . . , n, u U (i)
118
6.231 DYNAMIC PROGRAMMING
LECTURE 11
LECTURE OUTLINE
119
STOCHASTIC SHORTEST PATH PROBLEMS
N 1
( )
J (i) = lim E g xk , k (xk ) x0 = i
N
k=0
120
MAIN RESULT
121
BELLMANS EQ. FOR A SINGLE POLICY
122
POLICY ITERATION
123
JUSTIFICATION OF POLICY ITERATION
124
LINEAR PROGRAMMING
125
LINEAR PROGRAMMING (CONTINUED)
Pn
Obtain J by Max i=1
J(i) subject to
n
X
J(i) g(i, u)+ pij (u)J(j), i = 1, . . . , n, u U (i)
j=1
=0 J(1) =
126
DISCOUNTED PROBLEMS
127
DISCOUNTED PROBLEM EXAMPLE
LECTURE 12
LECTURE OUTLINE
129
AVERAGE COST PER STAGE PROBLEM
131
CONNECTION WITH SSP (CONTINUED)
Cnn () Nnn () 0,
h (n) = 0
If (i) attains the min for each i, is optimal.
There is also Bellman Eq. for a single policy .
133
MORE ON THE CONNECTION WITH SSP
134
EXAMPLE (CONTINUED)
Also h (0) = 0.
Optimal policy: Process i unfilled orders if
135
VALUE ITERATION
Jk (i) = k + h (i), i, k.
136
RELATIVE VALUE ITERATION
j=1
138
6.231 DYNAMIC PROGRAMMING
LECTURE 13
LECTURE OUTLINE
139
CONTINUOUS-TIME MARKOV CHAINS
140
PROBLEM FORMULATION
Qij (, u)
P {tk+1 tk | xk = i, xk+1 = j, uk = u} =
pij (u)
Thus, Qij (, u) can be viewed as a scaled CDF
141
EXPONENTIAL TRANSITION DISTRIBUTIONS
142
COST STRUCTURES
143
DISCOUNTED CASE - COST CALCULATION
147
MANUFACTURERS EXAMPLE REVISITED
where
n Z max
1 e 1 e
X Z
= dQij (, u) = d
j =1 0 0 max
where
Z max
e 1 emax
Z
= e dQij (, u) = d =
0 0
max max
149
AVERAGE COST
nR o
tN
Minimize limN E{t1 } E
g x(t), u(t) dt
0
N
assuming there is a special state that is recurrent
under all policies
Total expected cost of a transition
G(i, u) = g(i, u) i (u),
where i (u): Expected transition time.
We apply the SSP argument used for the discrete-
time case.
Divide trajectory into cycles marked by suc-
cessive visits to n.
The cost at (i, u) is G(i, u) i (u), where
is the optimal expected cost per unit time.
Each cycle is viewed as a state trajectory of
a corresponding SSP problem with the ter-
mination state being essentially n.
So Bellmans Eq. for the average cost problem:
n
X
h (i) = min G(i, u) i (u) + pij (u)h (j)
uU (i)
j=1
150
MANUFACTURER EXAMPLE/AVERAGE COST
c i max
G(i, Fill) = 0, G(i, Not Fill) =
2
and there is also the instantaneous cost
Bellmans equation:
h max
h (i) = min K + h (1),
2
max max
i
ci + h (i + 1)
2 2
151
6.231 DYNAMIC PROGRAMMING
LECTURE 14
LECTURE OUTLINE
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
Cost of a policy = {0 , 1 , . . .}
(N 1 )
X
J (x0 ) = lim E k g xk , k (xk ), wk
N wk
k=0,1,... k=0
We have
J (x0 ) M + M + 2 M + = M
, x0
1
153
WE ADOPT SHORTHAND NOTATION
T J = g + P J, T J = min T J
154
SHORTHAND COMPOSITION NOTATION
Bellmans equation: J = T J , J = T J
Optimality condition:
: optimal <==> T J = T J
Jk = Tk Jk
Tk+1 Jk = T Jk
156
SOME KEY PROPERTIES
(T J)(x) (T J )(x), x,
(T J)(x) (T J )(x), x.
Also
J TJ T k J T k+1 J, k
If J0 0,
( )
X
J (x0 ) = E k g xk , k (xk ), wk
k=0
(N 1 )
X
=E k g xk , k (xk ), wk
k=0
( )
X
+E k g xk , k (xk ), wk
k=N
from which
N M N M
J (x0 ) (T0 TN 1 J0 )(x0 ) J (x0 )+ ,
1 1
NM NM
J (x) (T N J0 )(x) J (x) + ,
1 1
N +1 M
(T J )(x) (T N +1 J0 )(x)
1
N +1 M
(T J )(x) +
1
to obtain J = T J . Q.E.D.
159
THE CONTRACTION PROPERTY
max(T J)(x) (T J )(x) maxJ(x) J (x).
x x
Proof: Denote c = maxxS J(x) J (x). Then
Hence
(T J)(x) (T J )(x) c, x.
160
IMPLICATIONS OF CONTRACTION PROPERTY
Proof: Use
k k k
max (T J)(x) J (x) = max (T J)(x) (T J )(x)
x x
k
max J (x) J (x)
x
161
NEC. AND SUFFICIENT OPT. CONDITION
T J = T J .
J = T J ,
J = T J .
162
COMPUTATIONAL METHODS - AN OVERVIEW
where
n
X
Q (i, u) = pij (u) g(i, u, j) + J (j)
j=1
or Jk+1 = T Jk , Qk+1 = F Qk .
Equal amount of computation ... just more
storage.
Having optimal Q-factors is convenient when
implementing an optimal policy on-line by
LECTURE 15
LECTURE OUTLINE
166
DISCOUNTED PROBLEMS/BOUNDED COST
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
Cost of a policy = {0 , 1 , . . .}
(N 1 )
X
J (x0 ) = lim E k g xk , k (xk ), wk
N wk
k=0,1,... k=0
167
SHORTHAND THEORY A SUMMARY
Bellmans equation: J = T J , J = T J
Optimality condition:
: optimal <==> T J = T J
Jk = Tk Jk
Tk+1 Jk = T Jk
168
MAJOR PROPERTIES
(T J)(x) (T J )(x), x X,
(T J)(x) (T J )(x), x X.
Contraction property: For any bounded func-
tions J and J , and any ,
max (T J)(x) (T J )(x) max J(x) J (x),
x x
max (T J)(x)(T J )(x) max J(x) J (x).
x x
kT JT J k kJJ k, kT JT J k kJJ k,
170
A DP-LIKE CONTRACTION MAPPING
171
CONTRACTION MAPPING FIXED-POINT TH.
kF k J J k k kJ J k, k = 1, 2, . . . .
172
GENERAL FORMS OF DISCOUNTED DP
Contraction assumption:
For every J B(X), the functions T J and
T J belong to B(X).
For some (0, 1) and all J, J B(X), H
satisfies
H(x, u, J)H(x, u, J ) max J(y)J (y )
y X
173
EXAMPLES
Discounted problems
H(x, u, J) = E g(x, u, w) + J f (x, u, w)
174
RESULTS USING CONTRACTION
lim Tk J = J , lim T k J = J
k k
175
RESULTS USING MON. AND CONTRACTION I
Also
T k J T0 Tk1 J
Take limit as k to obtain J J for all
.
177
6.231 DYNAMIC PROGRAMMING
LECTURE 16
LECTURE OUTLINE
178
DISCOUNTED PROBLEMS
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
179
SHORTHAND THEORY A SUMMARY
J = lim T0 T1 Tk J0 , J = lim Tk J0
k k
Bellmans equation: J = T J , J = T J
Optimality condition:
: optimal <==> T J = T J
Contraction: kT J1 T J2 k kJ1 J2 k
Value iteration: For any (bounded) J
J = lim T k J
k
Jk = Tk Jk
Tk+1 Jk = T Jk
180
INTERPRETATION OF VI AND PI
T J 45 Degree Line
Prob. = 1 Prob. =
TJ
n Value Iterations Prob. = 1 Prob. =
Do not J J =Set
Replace T J S
0 Prob.==T12 J0
= T J0
J0 J J = T J
0 Prob. = 1
J0 Do not Replace Set S 1 J J
= T J0 = T 2 J0
T J T1 J J
TJ
= T J0
Policy Improvement Exact Policy Evalua
Evaluation
J0 J J = T J J J1 = T1 J1
0 Prob. = 1
J0 1 J J
Policy Improvement Exact Policy Evaluation (Exact if
181
VI AND PI METHODS FOR Q-LEARNING
182
APPROXIMATE PI
+ 2
lim sup max Jk (x) J (x)
k xS (1 )2
183
OPTIMISTIC PI
If mk 1 it becomes VI
If mk = it becomes PI
For intermediate values of mk , it is generally
more efficient than either VI or PI
T0 J
T J = min T J
Policy Improvement Exact Policy Evaluation Approximate Policy
Evaluation
= T J0
Policy Improvement Exact Policy Evaluat
Evaluation
J0 J J = T J J J0 = T0 J0
0 Prob. = 1
J0 1 J J
= T J0 = T0 J0 J1 = T20 J0
Approx. Policy Evaluation 184
EXTENSIONS TO GENERALIZED DISC. DP
(T J)(x) = H x, (x), J , x X.
185
ASSUMPTIONS AND RESULTS
1 (t) m (t)
Jt+1 (x) = T J1 , . . . , Jm (x) if t R ,
Jt (x) if t
/ R
{x0 , x1 , . . .} = {1, . . . , n, 1, . . . , n, 1, . . .}
T J S(k+1), J S(k), k = 0, 1, . . . .
Interpretation of assumptions:
J = (J1 , J2 )
(0) S2 (0) ) S(k + 1) + 1) J
TJ
(0) (0) S(k)
S(0) ) + 1)
(0)
S1 (0)
Convergence mechanism:
J1 Iterations
J = (J1 , J2 )
S(k + 1) J
) + 1)
S(k)
S(0)(0) ) + 1)
(0) J2 Iteration
Iterations
Iterations
Key: Independent component-wise improvement.
An asynchronous component iteration from any J
in S(k) moves into the corresponding component
portion of S(k + 1) permanently!
190
PRINCIPAL DP APPLICATIONS
191
6.231 DYNAMIC PROGRAMMING
LECTURE 17
LECTURE OUTLINE
Undiscounted problems
Stochastic shortest path problems (SSP)
Proper and improper policies
Analysis and computational methods for SSP
Pathologies of SSP
SSP under weak conditions
192
UNDISCOUNTED PROBLEMS
195
SSP ANALYSIS I
J = T k J Tk J J = J
Similarly, J J, so J = J .
196
SSP ANALYSIS II
J0 = T0 J0 T J0 = T1 J0 Tk1 J0 J1
T0 Tk1 J0 T k J0 ,
For vi = J(i), we have vi 1, and for all ,
n
X
pij (i) vj vi 1 vi , i = 1, . . . , n,
j=1
where
vi 1
= max < 1.
i=1,...,n vi
This implies T and T are contractions of modu-
lus for norm kJk = maxi=1,...,n |J(i)|/vi (by the
results of earlier lectures). 198
SSP ALGORITHMS
199
PATHOLOGIES I: DETERM. SHORTEST PATHS
t b c u , Cost 0
t b Destination
u, Cost b
a12 12tb
200
PATHOLOGIES II: BLACKMAILERS DILEMMA
J (1) = u + (1 u2 )J (1)
=
J(i) min J (i), i = 1, . . . , n
: proper
u 1 Cost
Prob. p 0 2 3 Prob.
4 5 1p
p
3 2 5 6
0Cost
1 2 2
45 0 1Cost3 14 50 1 2Cost
3 4 1 0 1 2 3 4 Cost
5 72
Destination
Cost 0 Cost 2 Cost 1 u Cost 1 Cost 0 Cost
4 7
tb t
0 1 2Cost
3 51 0 1 2 3 4 5Cost
6 1
12 b
1 u Cost 1
Cost 0
)
For p = 1/2, we have
1
Bellman Eq. at state 1, J (1) = 2 J (2)+J (5) ,
is violated.
References: Bertsekas, D. P., and Yu, H., 2015.
Stochastic Shortest Path Problems Under Weak
Conditions, Report LIDS-2909; Math. of OR, to
appear. Also the on-line updated Ch. 4 of the
text.
203
6.231 DYNAMIC PROGRAMMING
LECTURE 18
LECTURE OUTLINE
Reference:
Updated Chapter 4 of Vol. II of the text:
Noncontractive Total Cost Problems
On-line at:
http://web.mit.edu/dimitrib/www/dpchapter.html
Check for most recent version
204
CONTRACTIVE/SEMICONTRACTIVE PROBLEMS
205
UNDISCOUNTED TOTAL COST PROBLEMS
207
SUMMARY OF ALGORITHMIC RESULTS
J(t) J(t)
(1) Case P Case N (1) Case P Case N
Bellman Eq. Bellman Eq.
Solutions Solutions Solutions Solutions
Bellman Eq. Bellman Eq.
J = J * = (0, 0) J = (0, 0)
0)
J = (b, 0) J(1) J = J * = (b, 0) J(1)
0) 0) 0) 0)
Case P Case N
t) startingCase
VI fails fromN ) Case P starting from
VI fails
J(1) = 0, J(t) = 0 J(1) < J (1), J(t) = 0
PI ! stops at PI oscilllates between and
Bellman Equation:
J(1) = min J(1), b + J(t)], J(t) = J(t)
209
DETERM. OPT. CONTROL - FORMULATION
210
DETERM. OPT. CONTROL - ANALYSIS
If J0 J and J0 T J0 , we have Jk J * .
Rollout with terminating heuristic (e.g., MPC).
212
LINEAR-QUADRATIC ADAPTIVE CONTROL
214
FINITE-STATE AFFINE MONOTONIC PROBLEMS
Cost function of = {0 , 1 , . . .}
J (i) = lim sup (T0 TN 1 J)(i), i = 1, . . . , n
N
Interpretation:
215
AFFINE MONOTONIC PROBLEMS: ANALYSIS
LECTURE 19
LECTURE OUTLINE
220
APPROXIMATION IN VALUE SPACE
Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features
Position Evaluator
i) Linear Cost
Feature Extraction
State i Feature
i Feature Mapping
Extraction Feature
Mapping
Extraction Mapping Vector
Feature (i)
Vector
Feature Linear Cost Approximator
Vectori) Linear Cost (i) r
Approximator
Approximator
Feature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector
r) = ri ,
J(i; i I,
......
TERMINATION
source unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see http://ocw.mit.edu/fairuse.
226
DIRECTLY APPROXIMATING J OR Q
J
=0
Subspace S = {r | r s } Set
229
INDIRECT POLICY EVALUATION
J T (r)
J r = T (r)
0 0
= S = {r | r }
Subspace = S = {r | r }
Subspace
s s
Set Set
Direct Method: Projection of Indirect Method: Solving a projected
cost vector J form of Bellmans equation
cost vector form of Be
( )
jection of ( )Indirect
( ) Method: Solving a projected Projection on
1 0 0 0
1 2 3 1 0 0 0
0 1 0 0
2 3 4 15 6 37 48 59 16 27 8 49 5 6 7 8 9
x1 x2 1
0 0 0
123456789 4 5 6
= 1 0 0 0
1 2 3 15 26 37 48 9 16 27 38 49 5 7 8 9 0
1 0 0
7 x3 8 x4 9 0
0 1 0
0 0 1 0
1 2 3 4 15 26 3 48 59 16 27 3 49 5 6 7 8
0 0 0 1
232
AGGREGATION AS PROBLEM APPROXIMATION
!
Original System States Aggregate States
Original System States Aggregate States
! i , j=1 !
according to pij (u),, g(i,
with cost
u, j)
Matrix Matrix Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
Disaggregation Probabilities
dxi ! Disaggregation
jyProbabilities
Q
| S
rk ) J k (i)| ,
max |J(i, k = 0, 1, . . .
i
rk )(T J)(i, rk )| ,
max |(Tk+1 J)(i, k = 0, 1, . . .
i
+ 2
lim sup max Jk (i) J (i)
k i (1 )2
234
THE ISSUE OF EXPLORATION
u; r)
r is an adjustable parameter vector and Q(i,
is a parametric architecture, such as
s
X
Q(i, u; r) = rm m (i, u)
m=1
236
STOCHASTIC ALGORITHMS: GENERALITIES
xm = (1 Am )1 bm
or iteratively
TD() and Q-learning are SA methods
LSTD() and LSPE() are MCE methods
237
6.231 DYNAMIC PROGRAMMING
LECTURE 20
LECTURE OUTLINE
238
REVIEW: APPROXIMATION IN VALUE SPACE
then
lim sup max Jk (i, rk ) J (i)
k i=1,...,n 1
243
NORM MISMATCH PROBLEM
244
APPROXIMATE PI
245
APPROXIMATE POLICY EVALUATION
J r = T (r)
=0 =0
Subspace S = {r | r s } Set Subspace S = {r | r s } Set
Direct Method: Projection of cost vector Indirect Method: Solving a projected form of Bell
( )cost (vector
ojection of )Indirect
(J) Method: Solving a projected form
Projection on
of Bellmans equation
246
PI WITH INDIRECT POLICY EVALUATION
r = T (r)
247
KEY QUESTIONS AND RESULTS
1
kJ r k kJ J k
1 2
248
PRELIMINARIES: PROJECTION PROPERTIES
kJ Jk
kJ Jk , for all J, J n .
(J J)
2
(J J)
2 +
(I )(J J)
2
= kJ Jk2
249
PROOF OF CONTRACTION PROPERTY
kT JT Jk kT JT J k = kP (JJ )k kJJ k
1
kJ r k kJ J k .
1 2
Proof: We have
2
kJ r k2 = kJ J k2
+
J r
2
2
= kJ J k +
T J T (r )
kJ J k2 + 2 kJ r k2 ,
where
The first equality uses the Pythagorean The-
orem
The second equality holds because J is the
fixed point of T and r is the fixed point
of T
The inequality uses the contraction property
of T .
Q.E.D.
251
MATRIX FORM OF PROJECTED EQUATION
C = (I P ), d = g
but computing C and d is HARD (high-dimensional
inner products). 252
SOLUTION OF PROJECTED EQUATION
Projection
on S
rk+1
rk
0
S: Subspace spanned by basis functions
which yields
rk+1 = rk ( )1 (Crk d)
253
SIMULATION-BASED IMPLEMENTATIONS
rk = Ck1 dk
254
SIMULATION MECHANICS
k
1 X
Ck = (it ) (it )(it+1 ) (IP ) = C
k+1
t=0
rk+1 = rk Gk (Ck rk dk )
256
6.231 DYNAMIC PROGRAMMING
LECTURE 21
LECTURE OUTLINE
257
REVIEW: PROJECTED BELLMAN EQUATION
or more compactly, T J = g + P J
Approximate Bellmans equation J = T J by
r = T (r) or the matrix form/orthogonality
condition Cr = d, where
C = (I P ), d = g.
T(r)
Projection
on S
r = T(r)
0
S: Subspace spanned by basis functions
258
PROJECTED EQUATION METHODS
Matrix inversion: r = C 1 d
Iterative Projected Value Iteration (PVI) method:
rk+1 = T (rk ) = (g + P rk )
Converges to r if T is a contraction. True if is
projection w.r.t. steady-state distribution norm.
Simulation-Based Implementations: Generate
k+1 simulated transitions sequence {i0 , i1 , . . . , ik }
and approximations Ck C and dk d:
k
1 X
Ck = (it ) (it )(it+1 ) (IP )
k+1
t=0
k
1 X
dk = (it )g (it , it+1 ) g
k + 1 t=0
LSTD: rk = Ck1 dk
LSPE: rk+1 = rk Gk (Ck rk dk ) where
Gk G = ( )1
Converges to r if T is contraction.
259
ISSUES FOR PROJECTED EQUATIONS
1
kJ r k p kJ J k
1 2
Slope J
)=0
1
From kJ r, k p kJ J k
12
error bound
As 1, we have 0, so error bound (and
quality of approximation) improves:
lim r, = J
1
with
X
X
P () = (1 ) P +1 , g () = P g
=0 =0
() 1 ()
The LSTD() method is Ck
where dk ,
() ()
Ck and dk are simulation-based approximations
of C () and d() .
The LSPE() method is
() ()
rk+1 = rk Gk Ck rk dk
() ()
The simulation process to obtain Ck and dk
is similar to the case = 0 (single simulation tra-
jectory i0 , i1 , . . ., more complex formulas)
k k
() 1 X X
mt mt
Ck = (it ) (im )(im+1 )
k + 1 t=0 m=t
k k
() 1 X X
dk = (it ) mt mt gim
k + 1 t=0 m=t
In the context of approximate policy iteration,
we can use optimistic versions (few samples be-
tween policy updates).
Many different versions (see the text).
Note the -tradeoffs:
() ()
As 1, Ck and dk contain more sim-
ulation noise, so more samples are needed
for a close approximation of r,
The error bound kJ r, k becomes smaller
As 1, T () becomes a contraction for
arbitrary projection norm
264
APPROXIMATE PI ISSUES - EXPLORATION
+1 rk+2
+2 Rk+3 k rk+1
Rk+2
266
MORE ON OSCILLATIONS/CHATTERING
2 R3
1 r2
r1
R2 2 r3
R1
r = (W T )(r)
with W a monotone operator, the generated poli-
cies converge (to an approximately optimal limit).
The operator W used in the aggregation ap-
proach has this monotonicity property.
267
6.231 DYNAMIC PROGRAMMING
LECTURE 22
LECTURE OUTLINE
268
PROBLEM APPROXIMATION - AGGREGATION
= minu [g + P R]
or R - Bellmans equation for
the aggregate problem.
The optimal cost J of the original problem is
approximated using interpolation, J J = R:
=
X
J(j) jy R(y), j
271
y
EXAMPLE I: HARD AGGREGATION
1 0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9 1 0 0 0
0 1 0 0
1 2 3 4 5 6 7 8 9 x1 x2 1
0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
= 1 0 0 0
0 1 0 0
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0
0 1 0
0 0 1 0
0 0 0 1
Special
SpecialStates
StatesAggregate
AggregateStates Special States Aggregate States Features
StatesFeatures
Features
)
273
EXAMPLE III: REP. STATES/COARSE GRID
j3 y1 1 y2
x j1 2 y3
xj
x j1 j2
j2 j3
Representative/Aggregate States
Aggregate States/Subsets
0 1 2 49
277
Q-LEARNING II
LECTURE 23
LECTURE OUTLINE
280
REVIEW: PROJECTED BELLMAN EQUATION
r = T (r)
Projection
on S
r = T(r)
0
S: Subspace spanned by basis functions
where
qt (i) = P (it = i), i = 1, . . . , n, t = 0, 1, . . .
We use the projection norm
v
u n
uX 2
kJkq = t q(i) J(i)
i=1
where
q0 (j )
= 1 min
j q(j)
284
AVERAGE COST PROBLEMS
F J = g e + P J
r = F (r), r = F () (r)
A problem here is that F is not a contraction
with respect to any norm (since e = P e).
F () is a contraction w. r. to k k assuming
that e does not belong to S and > 0 (the case
= 0 is exceptional, but can be handled); see the
text. LSTD(), LSPE(), and TD() are possible.
285
GENERALIZATION/UNIFICATION
T (x) = Ax + b, A is n n, b n
t t
X aik jk X
(ik ) (ik ) (jk ) rt = (ik )bik
pik jk
k=0 k=0
We have rt r , regardless of A being a con-
traction (by law of large numbers; see next slide).
Issues of singularity or near-singularity of IA
may be important; see the text.
An LSPE-like method is also possible, but re-
quires that A is a contraction.
Pn
Under the assumption j=1 |aij | 1 for all i,
there are conditions that guarantee contraction of
A; see the text. 290
JUSTIFICATION W/ LAW OF LARGE NUMBERS
t n n
1 X
it (i)bi
X X
(ik )bik = i (i)bi
t+1 i=1 i=1
k=0
291
BASIS FUNCTION ADAPTATION I
X
F J() = |J(i) J()(i)| 2,
iI
where I is a subset of states, and J(i), i I, are
the costs of the policy at these states calculated
directly by simulation.
Another example is
2
) =
J()
T J()
F J(
,
292
BASIS FUNCTION ADAPTATION II
293
APPROXIMATION IN POLICY SPACE I
294
APPROXIMATION IN POLICY SPACE II
By left-multiplying with ,
(r)e+ h(r) = g (r)+P (r)h(r) + P (r)h(r)
= (g + P h)
Since we dont know , we cannot implement a
gradient-like method for minimizing (r). An al-
ternative is to use sampled gradients, i.e., gener-
ate a simulation trajectory (i0 , i1 , . . .), and change
r once in a while, in the direction of a simulation-
based estimate of (g + P h).
Important Fact: can be viewed as an ex-
pected value!
Much research on this subject, see the text.
295
6.231 DYNAMIC PROGRAMMING
OVERVIEW-EPILOGUE
296
FINITE HORIZON PROBLEMS - ANALYSIS
297
FINITE HORIZON PROBS - EXACT COMP. SOL.
298
FINITE HORIZON PROBS - APPROX. SOL.
299
INFINITE HORIZON PROBLEMS - ANALYSIS
300
INF. HORIZON PROBS - EXACT COMP. SOL.
Value iteration
Variations (Gauss-Seidel, asynchronous, etc)
Policy iteration
Variations (asynchronous, based on value it-
eration, optimistic, etc)
Linear programming
Elegant algorithmic analysis
Curse of dimensionality is major bottleneck
301
INFINITE HORIZON PROBS - ADP
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.