You are on page 1of 4

a

r
X
i
v
:
m
a
t
h
/
0
6
0
4
4
5
2
v
1


[
m
a
t
h
.
P
R
]


2
0

A
p
r

2
0
0
6
A SHORT NOTE ON STATIONARY DISTRIBUTIONS OF
UNICHAIN MARKOV DECISION PROCESSES
RONALD ORTNER
Abstract. Dealing with unichain MDPs, we consider stationary distributions of policies
that coincide in all but n states. In these states each policy chooses one of two possi-
ble actions. We show that the stationary distributions of n + 1 such policies uniquely
determine the stationary distributions of all other such policies. An explicit formula for
calculation is given.
1. Introduction
Denition 1.1. A Markov decision process (MDP) M on a (nite) set of states S with
a (nite) set of actions A available in each state S consists of
(i) an initial distribution
0
that species the probability of starting in some state in
S,
(ii) the transition probabilities p
a
(i, j) that specify the probability of reaching state j
when choosing action a in state i, and
A (stationary) policy on M is a mapping : S A.
Note that each policy induces a Markov chain on M. We are interested in MDPs,
where in each of the induced Markov chains any state is reachable from any other state.
Denition 1.2. An MDP M is called unichain, if for each policy the Markov chain
induced by is ergodic, i.e. if the matrix P = (p
(i)
(i, j))
i,jS
is irreducible.
It is a well-known fact (cf. e.g. [1], p.130) that for an ergodic Markov chain with
transition matrix P there exists a unique invariant and strictly positive distribution ,
such that independent of the initial distribution
0
one has
n
=
0

P
n
, where

P
n
=
1
n

n
j=1
P
j
.
1
2. Main Theorem and Proof
Given n policies
1
,
2
, . . . ,
n
we say that another policy is a combination of
1
,
2
, . . . ,
n
,
if for each state s one has (s) =
i
(s) for some i.
This work was supported in part by the the Austrian Science Fund FWF (S9104-N04 SP4) and the IST
Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.
This publication only reects the authors views.
1
Actually, for aperiodic Markov chains one has even
0
P
n
, while the convergence behavior of
periodic Markov chains can be described more precisely. However, for our purposes the stated fact is
sucient.
1
2 R. Ortner
Theorem 2.1. Let M be a unichain MDP and
1
,
2
,. . . ,
n+1
pairwise distinct policies
on M that coincide on all but n states s
1
, s
2
, . . . , s
n
. In these states each policy applies
one of two possible actions, i.e. we assume that for each i and each j either
i
(s
j
) = 0 or

i
(s
j
) = 1. Then the stationary distributions of all combinations of
1
,
2
,. . . ,
n+1
are
uniquely determined by the stationary distributions
i
of the policies
i
.
More precisely, if we represent each combined policy by the word (s
1
)(s
2
) . . . (s
n
), we
may assume without loss of generality (by swapping the names of the actions correspond-
ingly) that the policy we want to determine is 11 . . . 1. Let S
n
be the set of permutations
of the elements {1, . . . , n}. Then setting

k
:= { S
n+1
| (k) = n + 1 and
j
(s
(j)
) = 0 for all j = k}
one has for the stationary distribution of
(s) =

n+1
k=1

k
sgn()
k
(s)

n+1
j=1
j=k

j
(s
(j)
)

n+1
k=1

k
sgn()
k
(s)

n+1
j=1
j=k

j
(s
(j)
)
.
For clarication of Theorem 2.1, we proceed with an example.
Example 2.2. Let M be a unichain MDP and
000
,
010
,
101
,
110
policies on M whose
actions dier only in three states s
1
, s
2
and s
3
. The subindices of a policy correspond
to the word (s
1
)(s
2
)(s
3
), so that e.g.
010
(s
1
) =
010
(s
3
) = 0 and
010
(s
2
) = 1. Now
let
000
,
010
,
101
, and
110
be the stationary distributions of the respective policies.
Theorem 2.1 tells us that we may calculate the distributions of all other policies that play
in states s
1
, s
2
, s
3
action 0 or 1 and coincide with the above mentioned policies in all
other states. In order to calculate e.g. the stationary distribution
111
of policy
111
in
an arbitrary state s, we have to calculate the sets
000
,
010
,
101
, and
110
. This can be
done by interpreting the subindices of our policies as rows of a matrix. In order to obtain

k
one cancels row k and looks for all possibilities in the remaining matrix to choose three
0s that neither share a row nor a column:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 1 0 0 1 0 0 1 0
1 0 1 1 0 1 1 0 1 1 0 1 1 0 1
1 1 0 1 1 0 1 1 0 1 1 0 1 1 0
Each of the matrices now corresponds to a permutation in
k
, where k corresponds to
the cancelled row. Thus
000
,
010
and
101
contain only a single permutation, while
110
contains two. The respective permutation can be read o each matrix as follows: note
for each row one after another the position of the chosen 0, and choose n + 1 for the
cancelled row. Thus the permutation for the third matrix is (2, 1, 4, 3). Now for each of
the matrices one has a term that consists of four factors (one for each row). The factor
for a row j is
j
(s

), where s

= s if row j was cancelled (i.e. j = k), or equals the state


that corresponds to the column of row j in which the 0 was chosen. Thus for the third
matrix above one gets
000
(s
2
)
010
(s
1
)
101
(s)
110
(s
3
). Finally, one has to consider the
sign for each of the terms which is the sign of the corresponding permutation. Putting
A Short Note on Stationary Distributions of Unichain Markov Decision Processes 3
all together, normalizing the output vector and abbreviating a
i
:=
000
(s
i
), b
i
:=
010
(s
i
),
c
i
:=
101
(s
i
), and d
i
:=
110
(s
i
) one obtains

111
(s) =

000
(s) b
1
c
2
d
3
a
1

010
(s) c
2
d
3
a
2
b
1

101
(s) d
3
+ a
1
b
3
c
2

110
(s) a
3
b
1
c
2

110
(s)
b
1
c
2
d
3
a
1
c
2
d
3
a
2
b
1
d
3
+ a
1
b
3
c
2
a
3
b
1
c
2
.
Theorem 2.1 can be obtained from the following more general result where the station-
ary distribution of a randomized policy is considered.
Theorem 2.3. Under the assumptions of Theorem 2.1, the stationary distribution of
the policy that plays in state s
i
(i = 1, . . . , n) action 0 with probability
i
[0, 1] and
action 1 with probability (1
i
) is given by
(s) =

n+1
k=1

k
sgn()
k
(s)

n+1
j=1
j=k
f((j), j)

n+1
k=1

k
sgn()
k
(s)

n+1
j=1
j=k
f((j), j)
,
where

k
:= { S
n+1
| (k) = n + 1} and
f(i, j) :=
_

i

j
(i), if
j
(i) = 1
(
i
1)
j
(i), if
j
(i) = 0.
Theorem 2.1 follows from Theorem 2.3 by simply setting
i
= 0 for i = 1, . . . , n.
Proof of Theorem 2.3. Let S = {1, 2, . . . , N} and assume that s
i
= i for i = 1, 2, . . . , n.
We denote the probabilities associated with action 0 with p
ij
:= p
0
(i, j) and those of
action 1 with q
ij
:= p
1
(i, j). Furthermore, the probabilities in the states i = n+1, . . . , N,
where the policies
1
, . . . ,
n+1
coincide, are written as p
ij
:= p

k
(i)
(i, j) as well. Now
setting

s
:=
n+1

k=1

k
sgn()
k
(s)
n+1

j=1
j=k
f((j), j)
and := (
s
)
sS
we are going to show that P

= , where P

is the probability matrix


of the randomized policy . Since the stationary distribution is unique, normalization of
the vector proves the theorem. Now
(P

)
s
=
n

i=1

i
_

i
p
is
+ (1
i
)q
is
_
+
N

i=n+1

i
p
is
=
n

i=1
n+1

k=1

k
sgn()
k
(i)
n+1

j=1
j=k
f((j), j)
_

i
p
is
+ (1
i
)q
is
_
+
N

i=n+1
n+1

k=1

k
sgn()
k
(i)
n+1

j=1
j=k
f((j), j) p
is
.
Since
N

i=n+1

k
(i) p
is
=
k
(s)

i:
k
(i)=0

k
(i) p
is

i:
k
(i)=1

k
(i) q
is
,
4 R. Ortner
this gives
(P

)
s
=
n+1

k=1

k
sgn()
n+1

j=1
j=k
f((j), j)
_
n

i=1

k
(i)
_

i
p
is
+ (1
i
)q
is
_
+
k
(s)

i:
k
(i)=0

k
(i) p
is

i:
k
(i)=1

k
(i) q
is
_
=
s
+
n+1

k=1

k
sgn()
n+1

j=1
j=k
f((j), j)
_

i:
k
(i)=0

k
(i) (
i
1)(p
is
q
is
)
+

i:
k
(i)=1

k
(i)
i
(p
is
q
is
)
_
=
s
+
n+1

k=1

k
sgn()
n+1

j=1
j=k
f((j), j)
n

i=1
(p
is
q
is
)f(i, k)
=
s
+
n

i=1
(p
is
q
is
)
n+1

k=1

k
sgn() f(i, k)
n+1

j=1
j=k
f((j), j)
Now it is easy to see that

n+1
k=1

k
sgn() f(i, k)

n+1
j=1
j=k
f((j), j) = 0: x k and some
permutation

k
, and let l :=
1
(i). Then there is exactly one permutation

l
,
such that

(j) = (j) for j = k, l and

(k) = i. The pairs (k, ) and (l,

) correspond
to the same summands
f(i, k)
n+1

j=1
j=k
f((j), j) = f(i, l)
n+1

j=1
j=l
f(

(j), j)
yet, since sgn() = sgn(

), they have dierent sign and cancel out each other.


References
[1] J.G. Kemeny, J.L. Snell, and A.W. Knapp Denumerable Markov Chains. Springer, 1976.
[2] M.L. Puterman. Markov Decision Processes. Wiley Interscience, 1994.
E-mail address: rortner@unileoben.ac.at
Department Mathematik und Informationstechnolgie
Montanuniversit at Leoben
Franz-Josef-Strasse 18
8700 Leoben, Austria

You might also like