You are on page 1of 8

ADP linprog

March 31, 2016


In [2]: using QuantEcon
using MathProgBase
using PlotlyJS

Linear programming approach to dynamic programming

Author: Spencer Lyon


Date: 3-30-16
In this notebook I will explore how to represent dynamic programming problems as linear programs.
I show how to convert an instance of DiscreteDP from QuantEcon.jl into a linear program.
I then solve a simple example model and do some benchmarks to compare the linear programming approach with policy function iteration, value function iteration, and modified value function iteration routines
already in QuantEcon.jl.

1.1

Theory

I develop the theoretical ideas within the context of models with discrete state and action spaces. Models
that are continous in either dimension can be approximated by a discretized version in order to apply the
techniques presented here.
For a more in depth overview see the quant-econ.net lecture on discrete dynamic programming.
1.1.1

Discrete dynamic programs

I first repeat the formal definition of a discrete dynamic program that can be found in the quant-econ.net
lecture.
A discrete dynamic program consists of:
A finite, discrete set S = {1, 2, . . . , n} of states
A finite set of feasible actions A(s) for each state s S.
Denote SA := {(s, a)|s S, a A(s)} as the set of feasible state action pairs
We call the set A := sS A(s) the action space.
A reward function r : SA R
A transition probability function Q : SA (S), where (S) is the set of probability distributions
over S.
A discount factor (0, 1).
A policy is a funciton : S A.
A policy if feasible if it satisfies (s) A(s) s S.
Let denote the set of feasible policies.
P
I will represent the function Q as Q(s, a, s0 ) where s S, a A(s) Q satisfies s0 S Q(s, a, s0 ) = 1 and
Q(a, s, s0 ) [0, 1] s0 S.
For each define

Q := Q(s, (s), s0 ). Note that Q is a stochastic matrix.


r := r(s, (s))
The value of following a policy is given by the infinite summation:
v :=

t Q,(t) rt .

t=0

Define a Bellman operator T : R

|S|

|S|

as

(
(T v)(s) = max

r(s, a) +

aA(s)

Q(s, a, s )v(s ) .

s0 S

I make assumptions about r and Q that cause T to be a contraction mapping.


With T being a contraction mapping, I can appeal to the Banach fixed point theorem and know that T

has a unique fixed point T v = v . Let be the corresponding policy (meaning v = v )


|S|
Theorem For a vector v R
1. if v T v then v v
2. if v T v then v v
3. if v = T v then v = v
Proof
This proof follows a proof in
Powell,
W.
(2010).
Approximate
http://doi.org/10.1002/9780470182963

Dynamic

Programming,

I,

1647.

I will sketch the proof of part (1). Part (2) will follow in an analagous way and part (3) can be obtained
by combining parts (1) and (2).
Let 0 , 1 be two feasible policies (not necessarily optimal). Then
v Tv
(
= max

r(s, a) +

aA(s)

Q(s, a, s0 )v(s0 )

s0 S

r0 + Q0 v
r0 + Q0 (r1 + Q1 v)
= r0 + Q0 r1 + 2 Q0 Q1 v
Define Q,(t) := Q0 Q1 Qt .
We can continue the string of inequalities from above and by induction write

t1
Y
v r0 + Q0 r1 + t1
Qj rt1 + t Qt ,(t) v
j=0

Consider following some policy . Breaking the infinite summation for v (see above) into two parts,
we can re-write the previous inequality as
v v + t Q,(t) v

j Q,(j) rj+1

j=t+1

Taking the limit of both sides of this inequality at t gives

v lim v + t Q,(t) v
t

j Q,(j) rj+1

j=t+1

The limit above exists when r is bounded. The previous inequality is true for any policy, therefore it
must also be true for the optimal policy. Thus we have that
v v
v

1.1.2

Mathematical Programs
|S|

Let b R++ be any vector with strictly positive elements.


Consider the mathematical program defined by
minb0 v
v

s.t.

v Tv

By the theorem above, the constraint set includes all vectors v R|S| such that v v .
The objective function is to minimize the inner product between the value function and a strictly positive
vector. This combined with the constraint set being all v v results in the unique solution to this
mathematical program being the unique fixed point of the Bellman operator: v .
The objective function of this program is linear in v, but even in our discrete setting the constraint is
non-linear because the operator T includes a max.
Notice that v T v is a set of |S| constraints. If we look at the constraint for a particular s S we notice
the following implication
X
v(s) r(s, a) +
Q(s, a, s0 )v(s0 ) a A(s) = v(s) (T v)(s).
s0 S

This means that for each s S the single non linear constraint v(s) (T v)(s) can be replaced with a
system of linear inequalities, one for each feasible action at that state.
This insight motivates the following linear program:
minb0 v
v

s.t.v(s) r(s, a) +

Q(s, a, s0 )v(s0 ) s S a A(s)

s0 S

1.2

Numerical Implementation

The Julia library MathProgBase provides a unified interface with many industry quality linear programming
solvers. The interface is documented here and is accessible via the linprog function.
The signature of the linprog function is: linprog(b, A, ">", lb).
The linear program solved by this method is
minb0 v
v

s.t.

Av lb

I can recast the constraint of my linear program into this framework by following these steps:

v(s) r(s, a) +

Q(s, a, s0 )v(s0 ) s S a A(s)

s0 S

v r(:, a) +

Q(:, a, s0 )v(s0 ) a A

s0 S

(I Q(:, a, :))v r(:, a) a A


where a : in any position is Julia-esque shorthand for all elements in that slice of an array.
Thus I can construct the A matrix expected by linprog in pieces. Specifically, for each a A I have
Aa = I Q(:, j, :).
I then stack each Aa on top of each other, forming a matrix A that is |S| A |S|
Then the vector lb is obtained in a similar fashion where ra = r(:, a) and r is constructed by stacking
all ra vectors.
Lets write a funciton that will construct A and lb from an instance of DiscreteDP. Notice that we return
a sparse version of A so that the linear programming solver can exploit the sparsity of our constraint matrix.
In [4]: function linprog_constraints(ddp::QuantEcon.DDP)
n, m = size(ddp.R)
# build A
A = Array(Float64, n*m, n)
for j in 1:m
A[1+(j-1)*n:j*n, :] = I - ddp.beta*slice(ddp.Q, :, j, :)
end
# build lb
lb = vec(ddp.R)
sparse(A), lb
end
Out[4]: linprog constraints (generic function with 1 method)
While were at it, lets add methods to MathProgBase.linprog for our DiscreteDP so that if we have
an instance of DiscreteDP named dpp we can simply call linprog(ddp).
In [5]: function MathProgBase.linprog(ddp::QuantEcon.DDP, A, lb)
b = ones(size(ddp.R, 1))
linprog(b, A, >, lb)
end
function MathProgBase.linprog(ddp::QuantEcon.DDP)
A, lb = linprog_constraints(ddp)
linprog(ddp, A, lb)
end
Out[5]: linprog (generic function with 6 methods)

1.3

Example

I will now consider a very simple example of a discrete dynamic program. This is example is discussed in
detail in this section of the quant-econ.net lecture.
I write a function to create an instance of DiscreteDP for the example growth model from the lecture.

In [6]: function growth_ddp(;B::Int=10, M::Int=5, ::Float64=0.5, ::Float64=0.9)


n = B + M + 1
m = M + 1
u(c) = c.^
R = Float64[(a <= s ? u(s-a) : -Inf) for s in 0:(n-1), a in 0:(m-1)]
# Q[s, a, s] = (a <= s <= a + B) ? 1/(1+B) : 0.0
Q = zeros(Float64, n, m, n)
p = 1.0/(B+1)
for a in 1:m
Q[:, a, a:(a+B)] = p
end
DiscreteDP(R, Q, )
end
Out[6]: growth ddp (generic function with 1 method)
Now the fun part. . . lets try it out!
In [7]: # construct a Discrete DP
ddp1 = growth_ddp();
# solve it via linear programming and policy funciton iteration
sol_linprog = linprog(ddp1)
sol_pfi = solve(ddp1, PFI);
# compare the computed value functions
sol_linprog.sol - sol_pfi.v
Out[7]: 16-element Array{Float64,1}:
0.0
-3.55271e-15
0.0
3.55271e-15
1.06581e-14
1.06581e-14
7.10543e-15
7.10543e-15
1.42109e-14
0.0
0.0
3.55271e-15
7.10543e-15
0.0
0.0
0.0
Nice! The linear programming formulation produced the same solution vector as the well-tested PFI
routine. This is an indication that we havent made any mistakes.
Now that we have routines to do this, lets do some benchmarks and see how the linprog routine scales
compared to policy function iteration, value function iteration, and modified policy function iteration.

In [8]: # define helper methods to wrap the PFI, VFI, and MPFI
# methods from QuantEcon.jl so we isolate computation time
# needed to comptue solution from time needed to do
# post-solution pacakging.
for (nm, typ) in [(:pfi, PFI), (:vfi, VFI), (:mpfi, MPFI)]
@eval function $(nm)(ddp::QuantEcon.DDP)
ddpr = QuantEcon.DPSolveResult{$(typ),Float64}(ddp)
QuantEcon._solve!(ddp, ddpr, 1000, 1e-7, 200)
ddpr
end
end
function horse_race(other_solvers,
Bs=[10, 20, 50, 100, 250, 500],
Ms=[5, 10, 20, 50])
nB = length(Bs)
nM = length(Ms)
times = zeros(Float64, length(other_solvers)+1, nB, nM)
build_constraints_time = zeros(nB, nM)
# run once to compile each routine
ddp = growth_ddp(B=minimum(Bs), M=minimum(Ms))
linprog(ddp)
[f(ddp) for f in other_solvers]
for (i_M, M) in enumerate(Ms), (i_B, B) in enumerate(Bs)
ddp = growth_ddp(B=B, M=M)
# handle linprog separately so we can separate time to build A from
# time to to linprog just as we separated time to solve via PFI and
# VFI from the time needed to construct the controled MarkovChain
A, lb = linprog_constraints(ddp)
tic()
linprog(ddp, A, lb)
t = toq()
times[1, i_B, i_M] = t
for (i_solver, solver) in enumerate(other_solvers)
tic()
solver(ddp)
t = toq()
times[i_solver+1, i_B, i_M] = t
end
end
Bs, Ms, times
end
Out[8]: horse race (generic function with 3 methods)
In [9]: other_solvers = [pfi, vfi, mpfi]
Bs, Ms, times = horse_race(other_solvers)
Out[9]: ([10,20,50,100,250,500],[5,10,20,50],
6

4x6x4 Array{Float64,3}:
[:, :, 1] =
0.000792707 0.00289365
0.000331459 0.000493138
0.00374358
0.00719934
0.0059769
0.00546229

0.00803942
0.000953892
0.0261211
0.0113737

[:, :, 2] =
0.0047423
0.000217353
0.00385994
0.00694409

0.00354907
0.000267647
0.00767909
0.00664556

0.0186667
0.00115345
0.0362517
0.010317

[:, :, 3] =
0.00853303
0.000566487
0.0237903
0.00534234

0.00874187
0.00496825
0.0306857
0.00878956

[:, :, 4] =
0.0306605
0.0125639
0.200401
0.029784

0.0528512
0.00893633
0.182073
0.0325731

0.0638502
0.00273073
0.0739338
0.0179435

0.294106
0.0183676
0.385639
0.0484988

0.035325
0.00272591
0.0581562
0.0278861

0.0940163
0.00309049
0.0904303
0.0265816

0.200947
0.00508882
0.206766
0.0386758

1.28193
0.0297667
0.923506
0.127971

0.651687
0.0153755
0.264998
0.157416

0.889793
0.0231833
0.626222
0.177521

2.87637
0.053197
1.25773
0.207349

10.6282
0.160551
3.8271
0.385324

4.80305
0.0929412
1.20181
0.501094

5.15933
0.0723744
2.19733
0.568824

12.3487
0.274444
4.41018
0.709394

41.2954
0.50697
11.238
0.834332)

Lets visualize these results


In [11]: colors = ["#E24A33", "#348ABD", "#988ED5", "#777777"]
names = ["linprog", "pfi", "vfi", "mpfi"]
function scatter_mi(i, showlegend=false, times=times, names=names, colors=colors)
plot(GenericTrace[scatter(x=Bs, y=slice(times, j, :, i), name=nm, showlegend=showlegend,
marker_color=colors[j])
for (j, nm) in enumerate(names)],
Layout(yaxis_type="log", yaxis_title="time", title="M = $(Ms[i])",
xaxis_title="B"))
end
[scatter_mi(1, true) scatter_mi(2); scatter_mi(3) scatter_mi(4)]
WARNING: imported binding for names overwritten in module Main

Notice that policy function consistently outperforms all other measures and the linear programming
approach is almost always the slowest.
This means that without altering the linprog algorithm (e.g. by using approximate linear programming),
the dynamic programming via linear progams result is nice theoretically, but impractical.
In [ ]:

You might also like