3

Chapter 4
FINITE-STATE MARKOV CHAINS

4.1 Introduction
The counting processes {N (t), t 0} of Chapterss 2 and 3 have the property that N (t) changes at discrete instants of time, but is dened for all real t 0. Such stochastic processes are generally called continuous time processes. The Markov chains to be discussed in this and the next chapter are stochastic processes dened only at integer values of time, n = 0, 1, . . . . At each integer time n 0, there is an integer valued random variable (rv) Xn , called the state at time n, and the process is the family of rvs {Xn , n 0}. These processes are often called discrete time processes, but we prefer the more specic term integer time processes. An integer time process {Xn ; n 0} can also be viewed as a continuous time process {X (t); t 0} by taking X (t) = Xn for n t < n + 1, but since changes only occur at integer times, it is usually simpler to view the process only at integer times. In general, for Markov chains, the set of possible values for each rv Xn is a countable set usually taken to be {0, 1, 2, . . . }. In this chapter (except for Theorems 4.2 and 4.3), we restrict attention to a nite set of possible values, say {1, . . . , M}. Thus we are looking at processes whose sample functions are sequences of integers, each between 1 and M. There is no special signicance to using integer labels for states, and no compelling reason to include 0 as a state for the countably innite case and not to include 0 for the nite case. For the countably innite case, the most common applications come from queueing theory, and the state often represents the number of waiting customers, which can be zero. For the nite case, we often use vectors and matrices, and it is more conventional to use positive integer labels. In some examples, it will be more convenient to use more illustrative labels for states. Denition 4.1. A Markov chain is an integer time process, {Xn , n 0} for which each rv Xn , n 1, is integer valued and depends on the past only through the most recent rv Xn1 , 139
140
CHAPTER 4. FINITE-STATE MARKOV CHAINS
i.e., for all integer n 1 and all integer i, j, k, . . . , m, Pr {Xn =j | Xn1 =i, Xn2 =k, . . . , X0 =m} = Pr {Xn =j | Xn1 =i} .. Pr {Xn =j | Xn1 =i} depends only on i and j (not n) and is denoted by Pr {Xn =j | Xn1 =i} = Pij . (4.2) The initial state X0 has an arbitrary probability distribution, which is required for a full probabilistic description of the process, but is not needed for most of the results. A Markov chain in which each Xn has a nite set of possible sample values is a nite-state Markov chain. The rv Xn is called the state of the chain at time n. The possible values for the state at time n, namely {1, . . . , M} or {0, 1, . . . } are also generally called states, usually without too much confusion. Thus Pij is the probability of going to state j given that the previous state is i; the new state, given the previous state, is independent of all earlier states. The use of the word state here conforms to the usual idea of the state of a system the state at a given time summarizes everything about the past that is relevant to the future. Note that the transition probabilities, Pij , do not depend on n. Occasionally, a more general model is required where the transition probabilities do depend on n. In such situations, (4.1) and (4.2) are replaced by Pr {Xn =j | Xn1 =i, Xn2 =k, . . . , X0 =m} = Pr {Xn =j | Xn1 =i} = Pij (n). (4.3) (4.1)
A process that obeys (4.3), with a dependence on n, is called a non-homogeneous Markov chain. Some people refer to a Markov chain (as dened in (4.1) and (4.2)) as a homogeneous Markov chain. We will discuss only the homogeneous case (since not much of general interest can be said about the non-homogeneous case) and thus omit the word homogeneous as a qualier. An initial probability distribution for X0 , combined with the transition probabilities {Pij } (or {Pij (n)} for the non-homogeneous case), dene the probabilities for all events. Markov chains can be used to model an enormous variety of physical phenomena and can be used to approximate most other kinds of stochastic processes. To see this, consider sampling a given process at a high rate in time, and then quantizing it, thus converting it into a discrete time process, {Zn ; 1 < n < 1}, where each Zn takes on a nite set of possible values. In this new process, each variable Zn will typically have a statistical dependence on past values that gradually dies out in time, so we can approximate the process by allowing Zn to depend on only a nite number of past variables, say Zn1 , . . . , Znk . Finally, we can dene a Markov process where the state at time n is Xn = (Zn , Zn1 , . . . , Znk+1 ). The state Xn = (Zn , Zn1 , . . . , Znk+1 ) then depends only on Xn1 = (Zn1 , . . . , Znk+1 , Znk ), since the new part of Xn , i.e., Zn , is independent of Znk1 , Znk2 , . . . , and the other variables comprising Xn are specied by Xn1 . Thus {Xn } forms a Markov chain approximating the original process. This is not always an insightful or desirable model, but at least provides one possibility for modeling relatively general stochastic processes. Markov chains are often described by a directed graph (see Figure 4.1). In the graphical representation, there is one node for each state and a directed arc for each non-zero transition
4.2. CLASSIFICATION OF STATES
141
probability. If Pij = 0, then the arc from node i to node j is omitted; thus the dierence between zero and non-zero transition probabilities stands out clearly in the graph. Several of the most important characteristics of a Markov chain depend only on which transition probabilities are zero, so the graphical representation is well suited for understanding these characteristics. A nite-state Markov chain is also often described by a matrix [P ] (see Figure 4.1). If the chain has M states, then [P ] is a M by M matrix with elements Pij . The matrix representation is ideally suited for studying algebraic and computational issues.
2 P32 P12 P35 1 P11 P41 P45
P23
P 3 63 6 P35
5
P55
P11 P12 P16 P21 P22 P26 [P ] = ................... P61 P62 P66 (b)
(a)
Figure 4.1: Graphical and Matrix Representation of a 6 state Markov Chain; a directed
arc from i to j is included in the graph if and only if (i ) Pij > 0.
4.2
Classication of states
This section, except where indicated otherwise, applies to Markov chains with both nite and countable state spaces. We start with several denitions. Denition 4.2. An (n-step) walk1 is an ordered string of nodes {i0 , i1 , . . . in }, n 1, in which there is a directed arc from im1 to im for each m, 1 m n. A path is a walk in which the nodes are distinct. A cycle is a walk in which the rst and last nodes are the same and the other nodes are distinct. Note that a walk can start and end on the same node, whereas a path cannot. Also the number of steps in a walk can be arbitrarily large, whereas a path can have at most M 1 steps and a cycle at most M steps. Denition 4.3. A state j is accessible from i (abbreviated as i j ) if there is a walk in the graph from i to j . For example, in gure 4.1(a), there is a walk from node 1 to node 3 (passing through node 2), so state 3 is accessible from 1. There is no walk from node 5 to 3, so state 3 is not accessible from 5. State 2, for example, is accessible from itself, but state 6 is not accessible from itself. To see the probabilistic meaning of accessibility, suppose that a walk i0 , i1 , . . . in exists from node i0 to in . Then, conditional on X0 = i0 , there is a positive probability, Pi0 i1 , that X1 = i1 , and consequently (since Pi1 i2 > 0), there is a positive probability that
1
We are interested here only in directed graphs, and thus undirected walks and paths do not arise.
142
X2 = i2 . Continuing this argument there is a positive probability that Xn = in , so that Pr {Xn =in | X0 =i0 } > 0. Similarly, if Pr {Xn =in | X0 =i0 } > 0, then there is an n-step walk from i0 to in . Summarizing, i j if and only if (i ) Pr {Xn =j | X0 =i} > 0 for some n . Thus, for n 1, P n > 0 i the graph has an n n 1. We denote Pr {Xn =j | X0 =i} by Pij ij step walk from i to j (perhaps visiting the same node more than once). For the example in 2 = P P n Figure 4.1(a), P13 12 23 > 0. On the other hand, P53 = 0 for all n 1. An important relation that we use often in what follows is that if there is an n-step walk from state i to j and an m-step walk from state j to k, then there is a walk of m + n steps from i to k. Thus
n Pij > 0 and Pm jk > 0 imply
Pn+m > 0. ik
(4.4)
This also shows that i j and j k imply i k. (4.5)
Denition 4.4. Two distinct states i and j communicate (abbreviated i j ) if i is accessible from j and j is accessible from i. An important fact about communicating states is that if i j and m j then i m. To see this, note that i j and m j imply that i j and j m, so that i m. Similarly, m i, so i m. Denition 4.5. A class T of states is a non-empty set of states such that for each state i T , i communicates with each j T (except perhaps itself) amd does not communicate with any j / T. For the example of Fig. 4.1(a), {1, 2, 3, 4} is one class of states, {5} is another, and {6} is another. Note that state 6 does not communicate with itself, but {6} is still considered to be a class. The entire set of states in a given Markov chain is partitioned into one or more disjoint classes in this way. Denition 4.6. For nite-state Markov chains, a recurrent state is a state i that is accessible from all states that are accessible from i (i is recurrent if i j implies that j i). A transient state is a state that is not recurrent. Recurrent and transient states for Markov chains with a countably innite set of states will be dened in the next chapter. According to the denition, a state i in a nite-state Markov chain is recurrent if there is no possibility of going to a state j from which there can be no return. As we shall see later, if a Markov chain ever enters a recurrent state, it returns to that state eventually with probability 1, and thus keeps returning innitely often (in fact, this property serves as the denition of recurrence for Markov chains without the nite-state restriction). A state i is transient if there is some j that is accessible from i but from which there is no possible return. Each time the system returns to i, there is a possibility of going to j ; eventually this possibility will occur, and then no more returns to i can occur (this can be thought of as a mathematical form of Murphys law).
143
Theorem 4.1. For nite-state Markov chains, either all states in a class are transient or all are recurrent.2 Proof: Assume that state i is transient (i.e., for some j , i j but j 6 i) and suppose that i and m are in the same class (i.e., i m). Then m i and i j , so m j . Now if j m, then the walk from j to m could be extended to i; this is a contradiction, and therefore there is no walk from j to m, and m is transient. Since we have just shown that all nodes in a class are transient if any are, it follows that the states in a class are either all recurrent or all transient. For the example of g. 4.1(a), {1, 2, 3, 4} is a transient class and {5} is a recurrent class. In terms of the graph of a Markov chain, a class is transient if there are any directed arcs going from a node in the class to a node outside the class. Every nite-state Markov chain must have at least one recurrent class of states (see Exercise 4.1), and can have arbitrarily many additional classes of recurrent states and transient states. States can also be classied according to their periods (see Figure 4.2). In g. 4.2(a), given that X0 = 2, we see that X1 must be either 1 or 3, X2 must then be either 2 or 4, and in general, Xn must be 2 or 4 for n even and 1 or 3 for n odd. On the other hand, if X0 is 1 or 3, then Xn is 2 or 4 for n odd and 1 or 3 for n even. Thus the eect of the starting state never dies out. Fig. 4.2(b) illustrates another example in which the state alternates from odd to even and the memory of the starting state never dies out. The states in both these Markov chains are said to be periodic with period 2.
1 4 2 3 9 8 1
(a)
4 2 5 3
(b)
Figure 4.2: Periodic Markov Chains Denition 4.7. The period of a state i, denoted d(i), is the greatest common divisor (gcd) n > 0. If the period is 1, the state is aperiodic, and if the of those values of n for which Pii period is 2 or more, the state is periodic.3
n > 0 for n = 2, 4, 6, . . . . Thus d(1), the period of state For example, in Figure 4.2(a), P11 1, is two. Similarly, d(i) = 2 for the other states in Figure 4.2(a). For g. 4.2(b), we have
2 This theorem is also true for Markov chains with a countable state space, but the proof here is inadequate. Also recurrent classes with a countable state space are further classied into either positive-recurrent or nullrecurrent, a distinction that does not appear in the nite-state case. 3 n For completeness, we say that the period is innite if Pii = 0 for all n 1. Such states do not have the intuitive characteristics of either periodic or aperiodic states. Such a state cannot communicate with any other state, and cannot return to itself, so it corresponds to a singleton class of transient states. The notion of periodicity is of primary interest for recurrent states.
144
n > 0 for n = 4, 8, 10, 12, . . . ; thus d(1) = 2, and it can be seen that d(i) = 2 for all the P11 states. These examples suggest the following theorem.
Theorem 4.2. For any Markov chain (with either a nite or countably innite number of states), all states in the same class have the same period. Proof: Let i and j be any distinct pair of states in a class. Then i j and there is some r r > 0 and some s such that P s > 0. Since there is a walk of length r + s going such that Pij ji from i to j and back to i, r + s must be divisible by d(i). Let t be any integer such that t > 0. Since there is a walk of length r + t + s that goes rst from i to j , then to j again, Pjj and then back to i, r + t + s is divisible by d(i), and thus t is divisible by d(i). Since this t > 0, d(j ) is divisible by d(i). Reversing the roles of i and j , is true for any t such that Pjj d(i) is divisible by d(j ), so d(i) = d(j ). Since the states in a class all have the same period and are either all recurrent or all transient, we refer to the class itself as having the period of its states and as being recurrent or transient. Similarly if a Markov chain has a single class of states, we refer to the chain as having the corresponding period and being recurrent or transient. Theorem 4.3. If a recurrent class in a nite-state Markov chain has period d, then the states in the class can be partitioned into d subsets, S1 , S2 , . . . , Sd , such that all transitions out of subset Sm go to subset Sm+1 for m < d and to subset S1 for m = d. That is, if j Sm and Pjk > 0, then k Sm+1 for m < d and k S1 for m = d. Proof: See Figure 4.3 for an illustration of the theorem. For a given state in the class, say state 1, dene the sets S1 , . . . , Sd by
nd+m Sm = {j : P1 > 0 for some n 0}; j
1 m d.
(4.6)
For each given j in the class, we rst show that there is one and only one value of m such r > 0 and some s for which P s > 0. that j Sm . Since 1 j , there is some r for which P1 j j1 Since there is a walk from 1 to 1 (through j ) of length r + s, r + s is divisible by d. Dene m, 1 m d, by r = m + nd, where n is an integer. From (4.6), j Sm . Now let r0 be any r0 > 0. Then r 0 + s is also divisible by d, so that r 0 r is divisible other integer such that P1 j by d. Thus r0 = m + n0 d for some integer n0 and that same m. Since r0 is any integer such r0 > 0, j is in S for only that one value of m. Since j is arbitrary, this shows that that P1 m j the sets Sm are disjoint and partition the class. Finally, suppose j Sm and Pjk > 0. Given a walk of length r = nd + m from state 1 to j , there is a walk of length nd + m + 1 from state 1 to k. It follows that if m < d, then k Sm+1 and if m = d, then k S1 , completing the proof. We have seen that each class of states (for a nite-state chain) can be classied both in terms of its period and in terms of whether or not it is recurrent. The most important case is that in which a class is both recurrent and aperiodic. Denition 4.8. For a nite-state Markov chain, an ergodic class of states is a class that
145
PP PP S2 P P P P PP P P q P PP S1 P P PP PP PP S3 P P q P
Figure 4.3: Structure of a Periodic Markov Chain with d = 3. Note that transitions
only go from one subset Sm to the next subset Sm+1 (or from Sd to S1 ).
is both recurrent and aperiodic4 . A Markov chain consisting entirely of one ergodic class is called an ergodic chain.
n becomes indepenWe shall see later that these chains have the desirable property that Pij dent of the starting state i as n 1. The next theorem establishes the rst part of this n > 0 for all i and j when n is suciently large. The Markov chain in by showing that Pij Figure 4.4 illustrates the theorem by illustrating how large n must be in the worst case.
m Figure 4.4: An ergodic chain with M = 6 states in which Pij > 0 for all m > (M 1)2 (M1)2
and all i, j but P11 = 0 The gure also illustrates that an M state Markov chain must have a cycle with M 1 or fewer nodes. To see this, note that an ergodic chain must have cycles, since each node must have a walk to itself, and any subcycle of repeated nodes can be omitted from that walk, converting it into a cycle. Such a cycle might have M nodes, but a chain with only a M node cycle would be periodic. Thus some nodes must be on smaller cycles, such as the cycle of length 5 in the gure.
m > 0 for all i, j , and all m Theorem 4.4. For an ergodic M state Markov chain, Pij 2 (M 1) + 1.
For Markov chains with a countably innite state space, ergodic means that the states are positiverecurrent and aperiodic (see Chapter 5, Section 5.1).
146
Proof*:5 As shown in Figure 4.4, the chain must contain a cycle with fewer than M nodes. Let M 1 be the number of nodes on a smallest cycle in the chain and let i be any given state on such a cycle. Dene T (m), m 1, as the set of states accessible from the xed state i in m steps. Thus T (1) = {j : Pij > 0}, and for arbitrary m 1,
m T (m) = {j : Pij > 0}.
(4.7)
> 0. For any m 1 and any j T (m), we can then Since i is on a cycle of length , Pii construct an m + step walk from i to j by going from i to i in steps and then to j in another m steps. This is true for all j T (m), so
T (m) T (m + ).
(4.8)
By dening T (0) to be the singleton set {i}, (4.8) also holds for m = 0, since i T ( ). By starting with m = 0 and iterating on (4.8), T (0) T ( ) T (2 ) T (n ) . (4.9)
We now show that if one of the inclusion relations in (4.9) is satised with equality, then all the subsequent relations are satised with equality. More generally, assume that T (m) = T (m + s) for some m 0 and s 1. Note that T (m + 1) is the set of states that can be reached in one step from states in T (m), and similarly T (m + s + 1) is the set reachable in one step from T (m + s) = T (m). Thus T (m + 1) = T (m + 1 + s). Iterating this result, T (m) = T (m + s) implies T (n) = T (n + s) for all n m. (4.10)
Thus, (4.9) starts with strict inclusions and then continues with strict equalities. Since the entire set has M members, there can be at most M 1 strict inclusions in (4.9). Thus T ((M 1) ) = T (n ) for all integers n M 1. Dene k as (M 1) . We can then rewrite (4.11) as T (k) = T (k + j ) for all j 1. (4.12) (4.11)
We next show that T (k) consists of all M nodes in the chain. The central part of this is to t > 0. show that T (k) = T (k + 1). Let t be any positive integer other than such that Pii Letting m = k in (4.8) and using t in place of , T (k) T (k + t) T (k + +2t) T (k + t). Since T (k + t) = T (k), this shows that T (k) = T (k + t). Now let s be the smallest positive integer such that T (k) = T (k + s).
5
(4.13)
(4.14)
(4.15)
Proofs marked with an asterisk can be omitted without loss of continuity.
4.3. THE MATRIX REPRESENTATION
147
From (4.11), we see that (4.15) holds when s takes the value . Thus, the minimizing s must lie in the range 1 s . We will show that s = 1 by assuming s > 1 and establishing a contradiction. Since the chain is aperiodic, there is some t not divisible by s for which t > 0. This t can be represented by t = js + ` where 1 ` < s and j 0. Iterating Pii (4.15), we get T (k) = T (k + js), and applying (4.10) to this, T (k + `) = T (k + js + `) = T (k + t) = T (k).
where we have used t = js + ` followed by (4.14). This is the desired contradiction, since ` < s. Thus s = 1 and T (k) = T (k + 1). Iterating this, T (k) = T (k + n) for all n 0. (4.16)
Since the chain is ergodic, each state j continues to be accessible after k steps. Therefore j must be in T (k + n) for some n 0, which, from (4.16), implies that j T (k). Since j is n > 0 for all n k and all j . arbitrary, T (k) must be the entire set of states. Thus Pij This same argument can be applied to any state i on the given cycle with nodes. Any state m not on this cycle has a path to the cycle using at most M steps. Using this path to reach a node i on the cycle, and following this with all the walks from i of length k = (M 1) , we see that Pmj
M +(M1)
>0
for all j, m.
The proof is complete, since M + (M 1) (M 1)2 + 1 for all , 1 M 1, with equality when = M 1. Figure 4.4 illustrates a situation where the bound (M 1)2 + 1 is met with equality. Note that there is one cycle of length M 1 and the single node not on this cycle, node 1, is the unique starting node at which the bound is met with equality.
4.3
The Matrix representation
The matrix [P ] of transition probabilities of a Markov chain is called a stochastic matrix; that is, a stochastic matrix is a square matrix of non-negative terms in which the elements n in terms of in each row sum to 1. We rst consider the n step transition probabilities Pij [P]. The probability of going from state i to state j in two steps is the sum over h of all possible two step walks, from i to h and from h to j . Using the Markov condition in (4.1),
2 Pij = M X h=1
Pih Phj .
It can be seen that this is just the ij term of the product of matrix [P ] with itself; denoting 2 is the (i, j ) element of the matrix [P ]2 . Similarly, P n is [P ][P ] as [P ]2 , this means that Pij ij
148
the ij element of the nth power of the matrix [P ]. Since [P ]m+n = [P ]m [P ]n , this means that
m+n Pij = M X h=1 m n Pih Phj .
(4.17)
This is known as the Chapman-Kolmogorov equation. An ecient approach to compute n ) for large n, is to multiply [P ]2 by [P ]2 , then [P ]4 by [P ]4 and so forth [P ]n (and thus Pij and then multiply these binary powers together as needed. The matrix [P ]n (i.e., the matrix of transition probabilities raised to the nth power) is very n , which is the important for a number of reasons. The i, j element of this matrix is Pij probability of being in state j at time n given state i at time 0. If memory of the past dies out with increasing n, then we would expect the dependence on both n and i to disappear n . This means, rst, that [P ]n should converge to a limit as n 1, and, second, that in Pij each row of [P ]n should tend to the same set of probabilities. If this convergence occurs (and we later determine the circumstances under which it occurs), [P ]n and [P ]n+1 will be the same in the limit n 1 which means lim[P ]n = (lim[P ]n )P . If all the rows of lim[P n ] are the same, equal to some row vector = (1 , 2 , . . . , M ), this simplies to = [P ]. Since is a probability vector (i.e., its components are the probabilities of being in the various states in the limit n 1), its components must be non-negative and sum to 1. Denition 4.9. A steady-state probability vector (or a steady-state distribution) for a Markov chain with transition matrix [P ] is a vector that satises X = [P ] ; where i = 1 ; i 0 , 1 i M. (4.18)
i
The steady-state probability vector is also often called a stationary distribution. If a probability vector satisfying (4.18) is taken as the initial probability assignment of the chain at time 0, then that assigment is maintained forever. That is, if Pr {X0 =i} = i for all i, P then Pr {X1 =j } = i i Pij = j for all j , and, by induction, Pr {Xn = j } = j for all j and all n > 0. If [P ]n converges as above, then, for each starting state, the steady-state distribution is reached asymptotically. There are a number of questions that must be answered for a steady-state distribution as dened above: 1. Does = [P ] always have a probability vector solution? 2. Does = [P ] have a unique probability vector solution? 3. Do the rows of [P ]n converge to a probability vector solution of = [P ]? We rst give the answers to these questions for nite-state Markov chains and then derive them. First, (4.18) always has a solution (although this is not necessarily true for innitestate chains). The answer to the second and third questions is simpler with the following denition:
4.3. THE MATRIX REPRESENTATION
149
Denition 4.10. A unichain is a nite-state Markov chain that contains a single recurrent class plus, perhaps, some transient states. An ergodic unichain is a unichain for which the recurrent class is ergodic. A Unichain, as we shall see, is the natural generalization of a recurrent chain to allow for some initial transient behavior without disturbing the long term aymptotic behavior of the underlying recurrent chain. The answer to the second question above is that the solution to (4.18) is unique i [P] is the transition matrix of a unichain. If there are r recurrent classes, then = [P ] has r linearly independent solutions. For the third question, each row of [P ]n converges to the unique solution of (4.18) if [P] is the transition matrix of an ergodic unichain. If there are multiple recurrent classes, but all of them are aperiodic, then [P ]n still converges, but to a matrix with non-identical rows. If the Markov chain has one or more periodic recurrent classes, then [P ]n does not converge. We rst look at these answers from the standpoint of matrix theory and then proceed in Chapter 5 to look at the more general problem of Markov chains with a countably innite number of states. There we use renewal theory to answer these same questions (and to discover the dierences that occur for innite-state Markov chains). The matrix theory approach is useful computationally and also has the advantage of telling us something about rates of convergence. The approach using renewal theory is very simple (given an understanding of renewal processes), but is more abstract.
4.3.1
The eigenvalues and eigenvectors of P
A convenient way of dealing with the nth power of a matrix is to nd the eigenvalues and eigenvectors of the matrix. Denition 4.11. The row vector is a left eigenvector of [P ] of eigenvalue if 6= 0 . The column vector is a right eigenvector of eigenvalue if 6= 0 and and [P ] = . [P ] = We rst treat the special case of a Markov chain with two states. Here the eigenvalues and eigenvectors can be found by elementary (but slightly tedious) algebra. The eigenvector equations can be written out as 1 P11 + 2 P21 = 1 1 P12 + 2 P22 = 2 P11 1 + P12 2 = 1 . P21 1 + P22 2 = 2 (4.19)
These equations have a non-zero solution i the matrix [P I ], where [I ] is the identity matrix, is singular (i.e., there must be a non-zero for which [P I ] = 0 ). Thus must be such that the determinant of [P I ], namely (P11 )(P22 ) P12 P21 , is equal to 0. Solving this quadratic equation in , we nd that has two solutions, 1 = 1 and 2 = 1 P12 P21 . Assume initially that P12 and P21 are not both 0. Then the solution for the left and right eigenvectors, (1) and (1) , of 1 and (2) and (2) of 2 , are given by 1 =
(2) 1 (1) P21 P12 +P21
2 =
(2) 2
(1)
P12 P12 +P21
1 =
(2) 1
(1)
1
P12 P12 +P21
2 =
(2) 2
(1)
1
P21 P12 +P21
150
1 0 These solutions contain an arbitrary normalization factor. Now let [] = and 0 2 let [U ] be a matrix with columns (1) and (2) . Then the two right eigenvector equations in (4.19) can be combined compactly as [P ][U ] = [U ][]. It turns out (given the way we have normalized the eigenvectors) that the inverse of [U ] is just the matrix whose rows are the left eigenvectors of [P ] (this can be veried by direct calculation, and we show later that any right eigenvector of one eigenvalue must be orthogonal to any left eigenvector of another eigenvalue). We then see that [P ] = [U ][][U ]1 and consequently [P ]n = [U ][]n [U ]1 . Multiplying this out, we get [P ]n = 1 + 2 n 2 1 1 n 2 2 2 n 2 2 + 1 n 2 where 1 = P21 , 2 = 1 1 . P12 + P21
Recalling that 2 = 1 P12 P21 , we see that |2 | 1. If P12 = P21 = 0, then 2 = 1 so that [P ] and [P ]n are simply identity matrices. If P12 = P21 = 1, then 2 = 1 so that [P ]n alternates between the identity matrix for n even and [P ] for n odd. In all other cases, |2 | < 1 and [P ]n approaches the matrix whose rows are both equal to . Parts of this special case generalize to an arbitrary nite number of states. In particular, = 1 is always an eigenvalue and the vector e whose components are all equal to 1 is always a right eigenvector of = 1 (this follows immediately from the fact that each row of a stochastic matrix sums to 1). Unfortunately, not all stochastic matrices can be represented in the form [P ] = [U ][][U 1 (since M independent right eigenvectors need not existsee Exercise 4.9) In general, the diagonal matrix of eigenvalues in [P ] = [U ][][U 1 ] must be replaced by something called a Jordan form, which does not easily lead us to the desired results. In what follows, we develop the powerful Perron and Frobenius theorems, which are useful in their own right and also provide the necessary results about [P ]n in general.
4.4
Perron-Frobenius theory
A real vector x (i.e., a vector with real components) is dened to be positive, denoted x > 0, if xi > 0 for each component i. A real matrix [A] is positive, denoted [A] > 0, if Aij > 0 for each i, j . Similarly, x is non-negative, denoted x 0, if xi 0 for all i. [A] is non-negative, denoted [A] 0, if Aij 0 for all i, j . Note that it is possible to have x 0 and x 6= 0 without having x > 0, since x > 0 means that all components of x are positive and x 0, x 6= 0 means that at least one component of x is positive and all are non-negative. Next, x > y and y < x both mean x y > 0. Similarly, x y and y x mean x y 0. The corresponding matrix inequalities have corresponding meanings. We start by looking at the eigenvalues and eigenvectors of positive square matrices. In what follows, when we assert that a matrix, vector, or number is positive or non-negative, we implicitly mean that it is real also. We will prove Perrons theorem, which is the critical result for dealing with positive matrices. We then generalize Perrons theorem to the Frobenius theorem, which treats a class of non-negative matrices called irreducible matrices. We nally specialize the results to stochastic matrices.
4.4. PERRON-FROBENIUS THEORY
151
Perrons theorem shows that a square positive matrix [A] always has a positive eigenvalue that exceeds the magnitude of all other eigenvalues. It also shows that this has a right eigenvector that is positive and unique within a scale factor. It establishes these results by relating to the following frequently useful optimization problem. For a given square matrix [A] > 0, and for any non-zero vector6 x 0, let g (x ) be the largest real number a for which ax [A]x . Let be dened by = sup
x 6=0 ,x 0
g (x ).
(4.20)
P We can express g (x ) explicitly by rewriting ax Ax as axi j Aij xj for all i. Thus, the largest a for which this is satised is P j Aij xj g (x ) = min gi (x ) where gi (x ) = . (4.21) i xi P Since [A] > 0, x 0 and x 6= 0 , it follows that the numerator i Aij xj is positive for all i. Thus gi (x ) is positive for xi > 0 and innite for xi = 0, so g (x ) > 0. It is shown in Exercise 4.10 that g (x ) is a continuous function of x over x 6= 0 , x 0 and that the supremum in (4.20) is actually achieved as a maximum. Theorem 4.5 (Perron). Let [A] > 0 be a M by M matrix, let > 0 be given by (4.20) and (4.21), and let be a vector x that maximizes (4.20). Then = [A] and > 0. 1. 2. For any other eigenvalue of [A], || < . for some (possibly complex) number . 3. If x satises x = [A]x, then x = Discussion: Property (1) asserts not only that the solution of the optimization problem is an eigenvalue of [A], but also that the optimizing vector is an eigenvector and is strictly positive. Property (2) says that is strictly greater than the magnitude of any other eigenvalue, and thus we refer to it in what follows as the largest eigenvalue of [A]. Property (3) asserts that the eigenvector is unique (within a scale factor), not only among positive vectors but among all (possibly complex) vectors. Proof* Property 1: We are given that for each x 0 , x 6= 0 . (4.22) P = [A] , i.e., that i = j Aij j for each i, or equivalently that We must show that P j Aij j = g ( ) = gi ( ) = for each i. (4.23) i Thus we want to show that the minimum in (4.21) is achieved by each i, 1iM. To show this, we assume the contrary and demonstrate a contradiction. Thus, suppose that
Note that the set of nonzero vectors x for which x 0 is dierent from the set {x > 0} in that the former allows some xi to be zero, whereas the latter requires all xi to be zero.
6
= g ( ) g (x )
152
g ( ) < gk ( ) for some k. Let e k be the kth unit vector and let be a small positive number. The contradiction will be to show that g ( + e k ) > g ( ) for small enough , thus violating (4.22). For i 6= k, P P j Aij j + Aik j Aij j gi ( + e k ) = > . (4.24) i i gk ( + e k ), on the other hand, is continuous in > 0 as increases from 0 and thus remains greater than g ( ) for small enough . This shows that g ( + e k ) > g ( ), completing the contradiction. This also shows that k must be greater than 0 for each k. Property 2: Let be any eigenvalue of [A]. Let x 6= 0 be a right eigenvector (perhaps complex) for . Taking the magnitude of each side of x = [A]x , we get the following for each component i X X |||xi | = | Aij xj | Aij |xj |. (4.25)
j j
Let u = (|x1 |, |x2 |, . . . , |xM |), so (4.25) becomes ||u [A]u . Since u 0, u 6= 0, it follows from the denition of g (x ) that || g (u ). From (4.20), g (u ) , so || . Next assume that || = . From (4.25), then, u [A]u , so u achieves the maximization in (4.20) and part 1 of the theorem asserts that u = [A]u . This means that (4.25) is satised with equality, and it follows from this (see Exercise 4.11) that x = u for some (perhaps complex) scalar . Thus x is an eigenvector of , and = . Thus || = is impossible for 6= , so > || for all eigenvalues 6= . Property 3: Let x be any eigenvector of . Property 2 showed that x = u where ui = |xi | for each i and u is a non-negative eigenvector of eigenvalue . Since > 0 , we can choose > 0 so that u 0 and i ui = 0 for some i. Now u is either identically 0 or else an eigenvector of eigenvalue , and thus strictly positive. Since i ui = 0 for some i, u = 0 . Thus u and x are scalar multiples of , completing the proof. Next we apply the results above to a more general type of non-negative matrix called an irreducible matrix. Recall that we analyzed the classes of a nite-state Markov chain in terms of a directed graph where the nodes represent the states of the chain and a directed arc goes from i to j if Pij > 0. We can draw the same type of directed graph for an arbitrary non-negative matrix [A]; i. e., a directed arc goes from i to j if Aij > 0. Denition 4.12. An irreducible matrix is a non-negative matrix such that for every pair of nodes i, j in its graph, there is a walk from i to j . For stochastic matrices, an irreducible matrix is thus the matrix of a recurrent Markov n chain. If we denote the i, j element of [A]n by An ij , then we see that Aij > 0 i there is a walk of length n from i to j in the graph. If [A] is irreducible, a walk exists from any i to any j (including j = i) with length at most M, since the walk need visit each other node at PM n most once. Thus Aij > 0 for some n, 1 n M, and n=1 An > 0 . The key to analyzing PMij irreducible matrices is then the fact that the matrix B = n=1 [A]n is strictly positive.
153
Theorem 4.6 (Frobenius). Let [A] 0 be a M by M irreducible matrix and let be the supremum in (4.20) and (4.21). Then the supremum is achieved as a maximum at some vector and the pair , have the following properties: = [A] and > 0. 1. 2. For any other eigenvalue of [A], || . for some (possibly complex) number . 3. If x satises x = [A]x, then x = Discussion: Note that this is almost the same as the Perron theorem, except that [A] is irreducible (but not necessarily positive), and the magnitudes of the other eigenvalues need not be strictly less than . When we look at recurrent matrices of period d, we shall nd that there are d 1 other eigenvalues of magnitude equal to . Because of the possibility of other eigenvalues with the same magnitude as , we refer to as the largest real eigenvalue of [A]. Proof* Property 1: We rst establish property 1 for a particular choice of and and then show this choice satises the optimization problem in (4.20) and (4.21). P that n > 0. Using theorem 4.5, we let be the largest eigenvalue of [B ] Let [B ] = M [ A ] B n=1 and let > 0 be the corresponding right eigenvector. Then [B ] = B . Also, since [B ][A] = [A][B ], we have [B ]{[A] } = [A][B ] = B [A] . Thus [A] is a right eigenvector for eigenvalue B of [B ] and thus equal to multiplied by some positive scale factor. and > 0. We can relate to B by Dene this that [A] = P scale nfactor to be , so M ) . Thus = + + M . [B ] = M [ A ] = ( + + B n=1
Next, for any non-zero x 0 , let g > 0 be the largest number such that [A]x g x . Multiplying both sides of this by [A], we see that [A]2 x g [A]x g 2 x . Similarly, [A]i x g i x for each i 1, so it follows that B x (g + g 2 + + g M )x . From the optimization property of B in theorem 4.5, this shows that B g + g 2 + + g M . Since B = + 2 + + M , we conclude that g , showing that , solve the optimization problem for A in (4.20) and (4.21). Properties 2 and 3: The rst half of the proof of property 2 in Theorem 4.5 applies here also to show that || for all eigenvalues of [A]. Finally, let x be an arbitrary vector satisfying [A]x = x . Then, from the argument above, x is also a right eigenvector of [B ] with eigenvalue B , so from Theorem 4.5, x must be a scalar multiple of , completing the proof. Corollary 4.1. The largest real eigenvalue of an irreducible matrix [A] 0 has a positive left eigenvector . is the unique left eigenvector of (within a scale factor) and is the [A]. only non-negative non-zero vector (within a scale factor) that satises Proof: A left eigenvector of [A] is a right eigenvector (transposed) of [A]T . The graph corresponding to [A]T is the same as that for [A] with all the arc directions reversed, so that all pairs of nodes still communicate and [A]T is irreducible. Since [A] and [A]T have the same eigenvalues, the corollary is just a restatement of the theorem.
154
Corollary 4.2. Let be the largest real eigenvalue of an irreducible matrix and let the right and left eigenvectors of be > 0 and > 0. Then, within a scale factor, is the only non-negative right eigenvector of [A] (i.e., no other eigenvalues have non-negative eigenvectors). Similarly, within a scale factor, is the only non-negative left eigenvector of [A]. Proof: Theorem 4.6 asserts that is the unique right eigenvector (within a scale factor) of the largest real eigenvalue , so suppose that u is a right eigenvector of some other eigenvalue u and also [A]u = u . . Letting be the left eigenvector of , we have [A]u = Thus u = 0. Since > 0 , u cannot be non-negative and non-zero. The same argument shows the uniqueness of . Corollary 4.3. Let [P ] be a stochastic irreducible matrix (i.e., the matrix of a recurrent Markov chain). Then = 1 is the largest real eigenvalue of [P ], e = (1, 1, . . . , 1)T is the right eigenvector of = 1, unique within a scale factor, and there is a unique probability vector > 0 that is a left eigenvector of = 1. Proof: Since each row of [P ] adds up to 1, [P ]e = e . Corollary 4.2 asserts the uniqueness of e and the fact that = 1 is the largest real eigenvalue, and Corollary 4.1 asserts the uniqueness of . The proof above shows that every stochastic matrix, whether irreducible or not, has an eigenvalue = 1 with e = (1, . . . , 1)T as a right eigenvector. In general, a stochastic matrix with r recurrent classes has r independent non-negative right eigenvectors and r independent non-negative left eigenvectors; the left eigenvectors can be taken as the steadystate probability vectors within the r recurrent classes (see Exercise 4.14). The following corollary, proved in Exercise 4.13, extends corollary 4.3 to unichains. Corollary 4.4. Let [P ] be the transition matrix of a unichain. Then = 1 is the largest real eigenvalue of [P ], e = (1, 1, . . . , 1)T is the right eigenvector of = 1, unique within a scale factor, and there is a unique probability vector 0 that is a left eigenvector of = 1; i > 0 for each recurrent state i and i = 0 for each transient state. Corollary 4.5. The largest real eigenvalue of an irreducible matrix [A] 0 is a strictly increasing function of each component of [A]. Proof: For a given irreducible [A], let [B ] satisfy [B ] [A], [B ] 6= [A]. Let be the largest real eigenvalue of [A] and > 0 be the corresponding right eigenvector. Then = [A] [B ] , but 6= [B ] . Let be the largest real eigenvalue of [B ], which is also [B ] , and 6= [B ] , which is a contradiction of irreducible. If , then property 1 in Theorem 4.6. Thus, > . We are now ready to study the asymptotic behavior of [A]n . The simplest and cleanest result holds for [A] > 0. We establish this in the following corollary and then look at the case of greatest importance, that of a stochastic matrix for an ergodic Markov chain. More general cases are treated in Exercises 4.13 and 4.14.
155
Corollary 4.6. Let be the largest eigenvalue of [A] > 0 and let and be the positive left and right eigenvectors of , normalized so that = 1. Then
n1
lim
[A]n = . n
(4.26)
Proof*: Since > 0 is a column vector and > 0 is a row vector, is a positive matrix which of the same dimension as [A]. Since [A] > 0, we can dene a matrix [B ] = [A] is positive for small enough > 0. Note that and are left and right eigenvectors of [B ] with eigenvalue = . We then have n = [B ]n , which when pre-multiplied by yields XX n ( )n = [B ]n = i Bij j .
i j n is the i, j element of [B ]n . Since each term in the above summation is positive, where Bij n , and therefore B n ( )n /( ). Thus, for each i, we have ( )n i Bij j i j ij n n j , limn1 Bij = 0, and therefore limn1 [B ]n n = 0. Next we use a convenient matrix identity: for any eigenvalue of a matrix [A], and any corresponding right and left }n = [A]n n (see eigenvectors and , normalized so that = 1, we have {[A] Exercise 4.12). Applying the same identity to [B ], we have {[B ] }n = [B ]n n . , we have [B ] = [A] , so that Finally, since [B ] = [A]
[A]n n = [B ]n n .
(4.27)
Dividing both sides of (4.27) by n and taking the limit of both sides of (4.27) as n 1, the right hand side goes to 0, completing the proof. Note that for a stochastic matrix [P ] > 0, this corollary simplies to limn1 [P ]n = e . n = , which means that the probability of being in state j This means that limn1 Pij j after a long time is j , independent of the starting state. Theorem 4.7. Let [P ] be the transition matrix of an ergodic nite-state Markov chain. Then = 1 is the largest real eigenvalue of [P ], and > || for every other eigenvalue . Furthermore, limn1 [P ]n = e , where > 0 is the unique probability vector satisfying [P ] = and e = (1, 1, . . . , 1)T is the unique vector (within a scale factor) satisfying [P ] = . Proof: From corollary 4.3, = 1 is the largest real eigenvalue of [P ], e is the unique (within a scale factor) right eigenvector of = 1, and there is a unique probability vector such that [P ] = . From Theorem 4.4, [P ]m is positive for suciently large m. Since [P ]m is also stochastic, = 1 is strictly larger than the magnitude of any other eigenvalue of [P ]m . Let be any other eigenvalue of [P ] and let x be a right eigenvector of . Note that x is also a right eigenvector of [P ]m with eigenvalue ()m . Since = 1 is the only eigenvalue of [P ]m of magnitude 1 or more, we either have || < or ()m = . If ()m = , then x must be a scalar times e . This is impossible, since x cannot be an eigenvector of [P ] with both eigenvalue and . Thus || < . Similarly, > 0 is the unique left eigenvector of [P ]m with eigenvalue = 1, and e = 1. Corollary 4.6 then asserts that
156
limn1 [P ]mn = e . Multiplying by [P ]i for any i, 1 i < m, we get limn1 [P ]mn+i = e , so limn1 [P ]n = e . Theorem 4.7 generalizes easily to an ergodic unichain (see Exercise 4.15). In this case, as one might suspect, i = 0 for each transient state i and i > 0 within the ergodic class. Theorem 4.7 becomes: Theorem 4.8. Let [P ] be the transition matrix of an ergodic unichain. Then = 1 is the largest real eigenvalue of [P ], and > || for every other eigenvalue . Furthermore,
m1
lim [P ]m = e ,
(4.28)
where 0 is the unique probability vector satisfying [P ] = and e = (1, 1, . . . , 1)T is the unique (within a scale factor) satisfying [P ] = . If a chain has a periodic recurrent class, [P ]m never converges. The existence of a unique probability vector solution to [P ] = for a periodic recurrent chain is somewhat mystifying at rst. If the period is d, then the steady-state vector assigns probability 1/d to each of the d subsets of Theorem 4.3. If the initial probabilities for the chain are chosen as Pr {X0 = i} = i for each i, then for each subsequent time n, Pr {Xn = i} = i . What is happening is that this initial probability assignment starts the chain in each of the d subsets with probability 1/d, and subsequent transitions maintain this randomness over n , for each i, is zero except subsets. On the other hand, [P ]n cannot converge because Pii when n is a multiple of d. Thus the memory of starting state never dies out. An ergodic Markov chain does not have this peculiar property, and the memory of the starting state dies out (from Theorem 4.7). The intuition to be associated with the word ergodic is that of a process in which timeaverages are equal to ensemble-averages. Using the general denition of ergodicity (which is beyond our scope here), a periodic recurrent Markov chain in steady-state (i.e., with Pr {Xn = i} = i for all n and i) is ergodic. Thus the notion of ergodicity for Markov chains is slightly dierent than that in the general theory. The dierence is that we think of a Markov chain as being specied without specifying the initial state distribution, and thus dierent initial state distributions really correspond to dierent stochastic processes. If a periodic Markov chain starts in steady state, then the corresponding stochastic process is stationary, and otherwise not.
4.5
Markov chains with rewards
Suppose that each state i in a Markov chain is associated with some reward, ri . As the Markov chain proceeds from state to state, there is an associated sequence of rewards that are not independent, but are related by the statistics of the Markov chain. The situation is similar to, but simpler than, that of renewal-reward processes. As with renewal-reward processes, the reward ri could equally well be a cost or an arbitrary real valued function of the state. In this section, the expected value of the aggregate reward over time is analyzed.
4.5. MARKOV CHAINS WITH REWARDS
157
The model of Markov chains with rewards is surprisingly broad. We have already seen that almost any stochastic process can be approximated by a Markov chain. Also, as we saw in studying renewal theory, the concept of rewards is quite graphic not only in modeling such things as corporate prots or portfolio performance, but also for studying residual life, queueing delay, and many other phenomena. In Section 4.6, we shall study Markov decision theory, or dynamic programming. This can be viewed as a generalization of Markov chains with rewards in the sense that there is a decision maker or policy maker who in each state can choose between several dierent policies; for each policy, there is a given set of transition probabilities to the next state and a given expected reward for the current state. Thus the decision maker must make a compromise between the expected reward of a given policy in the current state (i.e., the immediate reward) and the long term benet from the next state to be entered. This is a much more challenging problem than the current study of Markov chains with rewards, but a thorough understanding of the current problem provides the machinery to understand Markov decision theory also. Frequently it is more natural to associate rewards with transitions rather than states. If rij denotes the reward associated with a transition from i to j and Pij denotes the correspondP ing transition probability, then ri = P r is the expected reward associated with a ij ij j transition from state i. Since we analyze only expected rewards here, P and since the eect of transition rewards rij are summarized into the state rewards ri = j Pij rij , we henceforth ignore transition rewards and consider only state rewards. The steady-state expected P reward per unit time, assuming a single recurrent class of states, is easily seen to be g = i i ri where i is the steady-state probability of being in state i. The following examples demonstrate that it is also important to understand the transient behavior of rewards. This transient behavior will turn out to be even more important when we study Markov decision theory and dynamic programming. Example 4.5.1 (Expected rst-passage time). A common problem when dealing with Markov chains is that of nding the expected number of steps, starting in some initial state, before some given nal state is entered. Since the answer to this problem does not depend on what happens after the given nal state is entered, we can modify the chain to convert the given nal state, say state 1, into a trapping state (a trapping state i is a state from which there is no exit, i.e., for which Pii = 1). That is, we set P11 = 1, P1j = 0 for all j 6= 1, and leave Pij unchanged for all i 6= 1 and all j (see Figure 4.5). 2 2
1 4 3 3 1
Figure 4.5: The conversion of a four state Markov chain into a chain for which state 1
is a trapping state. Note that the outgoing arcs from node 1 have been removed.
Let vi be the expected number of steps to reach state 1 starting in state i 6= 1. This number
158
of steps includes the rst step plus the expected number of steps from whatever state is entered next (which is 0 if state 1 is entered next). Thus, for the chain in Figure 4.5, we have the equations v2 = 1 + P23 v3 + P24 v4 v3 = 1 + P32 v2 + P33 v3 + P34 v4 v4 = 1 + P42 v2 + P43 v3 . For an arbitrary chain of M states where 1 is a trapping state and all other states are transient, this set of equations becomes vi = 1 + X
j 6=1
Pij vj ;
i 6= 1.
(4.29)
If we dene ri = 1 for i 6= 1 and ri = 0 for i = 1, then ri is a unit reward for not yet entering the trapping state, and vi as the expected aggregate reward before entering the trapping state. Thus by taking r1 = 0, the reward ceases upon entering the trapping state, and vi is the expected transient reward, i.e., the expected rst passage time from state i to state 1. Note that in this example, rewards occur only in transient states. Since transient states P have zero steady-state probabilities, the steady-state gain per unit time, g = i i ri , is 0. If we dene v1 = 0, then (4.29), along with v1 = 0, has the vector form v = r + [P ]v ; v1 = 0. (4.30)
For a Markov chain with M states, (4.29) is a set of M 1 equations in the M 1 variables v2 to vM . The equation v = r + [P ]v is a set of M linear equations, of which the rst is the vacuous equation v1 = 0 + v1 , and, with v1 = 0, the last M 1 correspond to (4.29). It is not hard to show that (4.30) has a unique solution for v under the condition that states 2 to M are all transient states and 1 is a trapping state, but we prove this later, in Lemma 4.1, under more general circumstances. Example 4.5.2. Assume that a Markov chain has M states, {0, 1, . . . , M 1}, and that the state represents the number of customers in an integer time queueing system. Suppose we wish to nd the expected sum of the times all customers spend in the system, starting at an integer time where i customers are in the system, and ending at the rst instant when the system becomes idle. From our discussion of Littles theorem in Section 3.6, we know that this sum of times is equal to the sum of the number of customers in the system, summed over each integer time from the initial time with i customers to the nal time when the system becomes empty. As in the previous example, we modify the Markov chain to make state 0 a trapping state. We take ri = i as the reward in state i, and vi as the expected aggregate reward until the trapping state is entered. Using the same reasoning as in the previous example, vi is equal to the immediate reward P ri = i plus the expected reward from whatever state is entered next. Thus vi = ri + j 1 Pij vj . With v0 = 0, this is v = r + [P ]v . This has a unique solution for v as will be shown later in Lemma 4.1. This same analysis is valid for any choice of reward ri for each transient state i; the reward in the trapping state must be 0 so as to keep the expected aggregate reward nite.
159
In the above examples, the Markov chain has a trapping state with zero gain, so the expected gain is essentially a transient phenomena until entering the trapping state. We now look at the more general case of a unichain, i.e., a chain with a single recurrent class, possibly along with some transient states. In this more general case, there can be some average gain per unit time, along with some transient gain depending on the initial state. We rst look at the aggregate gain over a nite number of time units, thus providing a clean way of going to the limit. Example 4.5.3. The example in Figure 4.6 provides some intuitive appreciation for the general problem. Note that the chain tends to persist in whatever state it is in for a relatively long time. Thus if the chain starts in state 2, not only is an immediate reward of 1 achieved, but there is a high probability of an additional gain of 1 on many successive transitions. Thus the aggregate value of starting in state 2 is considerably more than the immediate reward of 1. On the other hand, we see from symmetry that the expected gain per unit time, over a long time period, must be one half. 0.99
1
0.01 0.01
r1 =0
0.99
r2 =1
Figure 4.6: Markov chain with rewards. Returning to the general case, it is convenient to work backward from a nal time rather than forward from the initial time. This will be quite helpful later when we consider dynamic programming and Markov decision theory. For any nal time m, dene stage n as n time units before the nal time, i.e., as time m n in Figure 4.7. Equivalently, we often view the nal time as time 0, and then stage n corresponds to time n. mn n n n m3 3 m2 2 2 2 m1 1 1 1 m 0 0 0 Time Stage Time Stage
n+1 n+2 n+3 n1 n2 n3
Figure 4.7: Alternate views of Stages. As a nal generalization of the problem (which will be helpful in the solution), we allow the reward at the nal time (i.e., in stage 0) to be dierent from that at other times. The nal reward in state i is denoted ui , and u = (u1 , . . . , uM )T . We denote the expected aggregate reward from stage n up to and including the nal stage (stage zero), given state i at stage n, as vi (n, u ). Note that the notation here is taking advantage of the Markov property. That is, given that the chain is in state i at time n (i.e., stage n), the expected aggregate reward up to and including time 0 is independent of the states before time n and is independent of when the Markov chain started prior to time n.
160
The expected aggregate reward can be found by starting at stage 1. Given that the chain is in state i at time 1, the immediate reward is ri . The chain then makes a transition (with probability Pij ) to some state j at time 0 with a nal reward of uj . Thus X vi (1, u ) = ri + Pij uj . (4.31)
j
For the example of Figure 4.6 (assuming the nal reward is the same as that at the other stages, i.e., ui = ri for i = 1, 2), we have v1 (1, u ) = 0.01 and v2 (1, u ) = 1.99. The expected aggregate reward for stage 2 can be calculated in the same way. Given state i at time 2 (i.e., stage 2), there is an immediate reward of ri and, with probability Pij , the chain goes to state j at time 1 (i.e., stage 1) with an expected additional gain of vj (1, u ). Thus X vi (2, u ) = ri + Pij vj (1, u ). (4.32)
j
Note that vj (1, u ), as calculated in (4.31), includes the gain in stages 1 and 0, and does not depend on how state j was entered. Iterating the above argument to stage 3, 4, . . . , n, X vi (n, u ) = ri + Pij vj (n1, u ). (4.33)
j
This can be written in vector form as v (n, u ) = r + [P ]v (n1, u ); n 1, (4.34)
where r is a column vector with components r1 , r2 , . . . , rM and v (n, u ) is a column vector with components v1 (n, u ), . . . , vM (n, u ). By substituting (4.34), with n replaced by n 1, into the last term of (4.34), v (n, u ) = r + [P ]r + [P ]2 v (n2, u ); n 2. (4.35)
Applying the same substitution recursively, we eventually get an explicit expression for v (n, u ), v (n, u ) = r + [P ]r + [P ]2 r + + [P ]n1 r + [P ]n u . (4.36)
Eq. (4.34), applied iteratively, is more convenient for calculating v (n, u ) than (4.36), but neither give us much insight into the behavior of the expected aggregate reward, especially for large n. We can get a little insight by averaging the components of (4.36) over the steady-state probability vector . Since [P ]m = for all m and r is, by denition, the steady state gain per stage g , this gives us v (n, u ) = ng + u . (4.37)
This result is not surprising, since when the chain starts in steady-state at stage n, it remains in steady-state, yielding a gain per stage of g until the nal reward at stage 0. For the example of Figure 4.6 (again assuming u = r ), Figure 4.8 tabulates this steady
161
state expected aggregate gain and compares it with the expected aggregate gain vi (n, u ) for initial states 1 and 2. Note that v1 (n, u ) is always less than the steady-state average by an amount approaching 25 with increasing n. Similarly, v2 (n, u ) is greater than the average by the corresponding amount. In other words, for this example, vi (n, u ) v (n, u ), for each state i, approaches a limit as n 1. This limit is called the asymptotic relative gain for starting in state i, relative to starting in steady state. In what follows, we shall see that this type of asymptotic behavior is quite general. n 1 2 4 10 40 100 400 1000 v (n, r ) 1 1.5 2.5 5.5 20.5 50.5 200.5 500.5 v1 (n, r ) 0.01 0.0298 0.098 0.518 6.420 28.749 175.507 475.500 v2 (n, r ) 1.99 2.9702 4.902 10.482 34.580 72.250 225.492 525.500
Figure 4.8: The expected aggregate reward, as a function of starting state and stage,
for the example of gure 4.6.
Initially we consider only ergodic Markov chains and rst try to understand the asymptotic behavior above at an intuitive level. For large n, the probability of being in state j at time n . Thus, the expected nal reward at 0, conditional on starting in i at time n, is Pij j time 0 is approximately u for each possible starting state at time n. For (4.36), this says that the nal term [P ]n u is approximately ( u )e for large n. Similarly, in (4.36), [P ]nm r g e if n m is large. This means that for very large n, each unit increase or decrease in n simply adds or subtracts g e to the vector gain. Thus, we might conjecture that, for large n, v (n, u ) is the sum of an initial transient term w , an intermediate term ng e , and a nal term, ( u )e , i.e, v (n, u ) w + ng e + ( u )e . (4.38)
where we also conjecture that the approximation becomes exact as n 1. Substituting (4.37) into (4.38), the conjecture (which we shall soon validate) is v (n, u ) w + ( v (n, u ))e . (4.39)
That is, the component wi of w tells us how protable it is, in the long term, to start in a particular state i rather than start in steady-state. Thus w is called the asymptotic relative gain vector or, for brevity, the relative gain vector. In the example of the table above, w = (25, +25). There are two reasonable approaches to validate the conjecture above and to evaluate the relative gain vector w . The rst is explored in Exercise 4.22 and expands on the intuitive argument leading to (4.38) to show that w is given by w=
1 X
n=0
([P ]i e )r .
(4.40)
162
This expression is not a very useful way to calculate w , and thus we follow the second approach here, which provides both a convenient expression for w and a proof that the approximation in (4.38) becomes exact in the limit. Rearranging (4.38) and going to the limit, w = lim {v (n, u ) ng e ( u )e }.
n1
(4.41)
The conjecture, which is still to be proven, is that the limit in (4.41) actually exists. We now show that if this limit exists, w must have a particular form. In particular, substituting (4.34) into (4.41), w =
n1
lim {r + [P ]v (n 1, u ) ng e ( u )e }
n1
= r g e + [P ] lim {v (n 1, u ) (n 1)g e ( u )e } = r g e + [P ]w . Thus, if the limit in (4.41) exists, that limiting vector w must satisfy w + g e = r + [P ]w . The following lemma shows that this equation has a solution. The lemma does not depend on the conjecture in (4.41); we are simply using this conjecture to motivate why the equation (4.42) is important. Lemma 4.1. Let [P ] be the transition matrix of a M state unichain. Let r = (r1 , . . . , rM )T be a reward P vector, let = (1 , . . . , M ) be the steady state probabilities of the chain, and let g = i i ri . Then the equation w + g e = r + [P ]w has a solution for w. With the additional condition w = 0, that solution is unique. Discussion: Note that v = r + [P ]v in Example 4.5.1 is a special case of (4.42) in T which = (1, 0, . . . , 0) and r = (0, 1, . . . , 1) and thus g = 0. With the added condition v1 = v = 0, the solution is unique. Example 4.5.2 is the same, except for that r is dierent, and thus also has a unique solution. Proof: Rewrite (4.42) as {[P ] [I ]}w = g e r . (4.43)
(4.42)
e be a particular solution to (4.43) (if one exists). Then any solution to (4.43) can be Let w e + x for some x that satises the homogeneous equation {[P ] [I ]}x = 0. For expressed as w x to satisfy {[P ] [I ]}x = 0, however, x must be a right eigenvector of [P ] with eigenvalue 1. From Theorem 4.8, x must have the form e for some number . This means that if a e to (4.43) exists, then all solutions have the form w = w e + e . For particular solution w a particular solution to (4.43) to exist, g e r must lie in the column space of the matrix [P ] [I ]. This column space is the space orthogonal to the left null space of [P ] [I ]. This left null space, however, is simply the set of left eigenvectors of [P ] of eigenvalue 1, i.e., the scalar multiples of . Thus, a particular solution exists i (g e r ) = 0. Since g e = g and r = g , this equality is satised and a particular solution exists. Since all solutions
163
It is not necessary to assume that g = r in the lemma. If g is treated as a variable in (4.42), then, by pre-multiplying any solution w , g of (4.42) by , we nd that g = r must be satised. This means that (4.42) can be viewed as M linear equations in the M + 1 variables w , g and the set of solutions can be found without rst calculating . Naturally, must be found to nd the particular solution with w = 0. If the nal reward vector is chosen to be any solution w of (4.42) (not necessarily the one with w = 0), then v (1, w ) = r + [P ]w = w + g e v (2, w ) = r + [P ]{w + g e } = w + 2g e v (n, w ) = r + [P ]{w + (n 1)g e } = w + ng e . (4.44)
e + e , setting w = 0 determines the value of to be w e , thus have the form w = w yielding a unique solution with w = 0 and completing the proof.
This is a simple explicit expression for expected aggregate gain for this special nal reward vector. We now show how to use this to get a simple expression for v (n, u ) for arbitrary u . From (4.36), v (n, u ) v (n, w ) = [P ]n {u w }. (4.45)
Note that this is valid for any Markov unichain and any reward vector. Substituting (4.44) into (4.45), v (n, u ) = ng e + w + [P ]n {u w }. (4.46)
It should now be clear why we wanted to allow the nal reward vector to dier from the reward vector at other stages. The result is summarized in the following theorem: Theorem 4.9. Let [P ] be the transition matrix of a unichain. Let r be a reward vector and w a solution to (4.42). Then the expected aggregate reward vector over n stages is given by (4.46). If the unichain is ergodic and w satises w = 0 then
n1
lim {v(n, u) ng e} = w + ( u)e.
(4.47)
Proof: The argument above established (4.46). If the recurrent class is ergodic, then [P ]n approaches a matrix whose rows each equal , and (4.47) follows. The set of solutions to (4.42) has the form w + e where w satises w = 0 and is any real number. The factor cancels out in (4.46), so any solution can be used. In (4.47), however, the restriction to w = 0 is necessary. We have dened the (asymptotic) relative gain vector w to satisfy w = 0 so that, in the ergodic case, the expected aggregate gain, v (n, u ) can be cleanly split into an initial transient w , the intermediate gain per stage, ne , and the nal gain u , as in (4.47). We shall call other solutions to (4.42) shifted relative gain vectors.
164
Recall that Examples 4.5.1 and 4.5.2 showed that the aggregate reward vi from state i to enter a trapping state, state 1, is given by the solution to v = r + [P ]v , v1 = 0. This aggregate reward, in the general setup of Theorem 4.9, is limn1 v (n, u ). Since g = 0 and u = 0 in these examples, (4.47) simplies to limn1 v (n, u ) = w where w = r + [P ]w and w = w1 = 0. Thus, we see that (4.47) gives the same answer as we got in these examples. For the example in Figure 4.6, we have seen that w = (25, 25) (see Exercise 4.21 also). The large relative gain for state 2 accounts for both the immediate reward and the high probability of multiple additional rewards through remaining in state 2. Note that w2 can not be interpreted as the expected reward up to the rst transition from state 2 to 1. The reason for this is that the gain starting from state 1 cannot be ignored; this can be seen from Figure 4.9, which modies Figure 4.6 by changing P12 to 1. In this case, (see Exercise 4.21), w2 w1 = 1/1.01 0.99, reecting the fact that state 1 is always left immediately, thus reducing the advantage of starting in state 2.
1 r1 =0
1 0.01
2
1 =1 r2
0.99
Figure 4.9: A variation of Figure 4.6. We can now interpret the general solution in (4.46) by viewing g e as the steady state gain per stage, viewing w as the dependence on the initial state, and viewing [P ]n {u w } as the dependence on the nal reward vector u ). If the recurrent class is ergodic, then, as seen in (4.47), this nal term is asymptotically independent of the starting state and w , but depends on u . Example 4.5.4. In order to understand better why (4.47) can be false without the assumption of an ergodic unichain, consider a two state periodic chain with P12 = P21 = 1, r1 = r2 = 0, and arbitrary nal reward with u1 6= u2 . Then it is easy to see that for n even, v1 (n) = u1 ; v2 (n) = u2 and for n odd, v1 (n) = u2 ; v2 (n) = u1 . Thus, the eect of the nal reward on the initial state never dies out. For a unichain with a periodic recurrent class of period d, as in the example above, it is a little hard to interpret w as an asymptotic relative gain vector, since the last term of (4.46) involves w also (i.e., the relative gain of starting in dierent states depends on both n and u ). The trouble is that the nal reward happens at a particular phase of the periodic variation, and the starting state determines the set of states at which the nal reward is assigned. If we view the nal reward as being randomized over a period, with equal probability of occuring at each phase, then, from (4.46),
d1 X
m=0
v (n + m, u ) (n + m)g e = w + [P ]n I + [P ] + + [P ]d1 {u w }.
Going to the limit n 1, and using the result of Exercise 4.18, this becomes almost the
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
165
same as the result for an ergodic unichain, i.e., lim

d1 X
n1
m=0
(v (n + m, u ) (n + m)g e ) = w + (e )u .
(4.48)
There is an interesting analogy between the steady-state vector and the relative gain vector w . If the recurrent class of states is ergodic, then any initial distribution on the states approaches the steady state with increasing time, and similarly the eect of any nal gain vector becomes negligible (except for the choice of ( u )) with an increasing number of stages. On the other hand, if the recurrent class is periodic, then starting the Markov chain in steady-state maintains the steady state, and similarly, choosing the nal gain to be the relative gain vector maintains the same relative gain at each stage. Theorem 4.9 treated only unichains, and it is sometimes useful to look at asymptotic expressions for chains with m > 1 recurrent classes. In this case, the analogous quantity to a relative gain vector can be expressed as a solution to w+
m X i=1
g (i) (i) = r + [P ]w ,
(4.49)
where g (i) is the gain of the ith recurrent class and (i) is the corresponding right eigenvector of [P ] (see Exercise 4.14). Using a solution to (4.49) as a nal gain vector, we can repeat the argument in (4.44) to get v (n, w ) = w + n
m X i=1
g (i) (i)
for all n 1.
(4.50)
As expected, the average reward per stage depends on the recurrent class of the initial state. If the initial state, j , is transient, the average reward per stage is averaged over the (i) recurrent classes, using the probability j that state j eventually reaches class i. For an arbitrary nal reward vector u , (4.50) can be combined with (4.45) to get v (n, u ) = w + n
m X i=1
g (i) (i) + [P ]n {u w } for all n 1.
(4.51)
Eqn. (4.49) always has a solution P(see Exercise 4.27), and in fact has an m dimensional set + i i (i) , where 1 , . . . , m can be chosen arbitrarily and of solutions given by w = w is any given solution. w
4.6
4.6.1
Markov decision theory and dynamic programming

Introduction
In the previous section, we analyzed the behavior of a Markov chain with rewards. In this section, we consider a much more elaborate structure in which a decision maker can select
166
between various possible decisions for rewards and transition probabilities. In place of the reward ri and the transition probabilities {Pij ; 1 j M} associated with a given state i, (1) (2) (K ) there is a choice between some number Ki of dierent rewards, say ri , ri , . . . , ri i and (1) a corresponding choice between Ki dierent sets of transition probabilities, say {Pij ; 1
(2) (K )
j M}, {Pij , 1 j M}, . . . {Pij i ; 1 j M}. A decision maker then decides between these Ki possible decisions each time the chain is in state i. Note that if the decision maker (k) chooses decision k for state i, then the reward is ri and the transition probabilities from (k) (k) (k) state i are {Pij ; 1 j M}; it is not possible to choose ri for one k and {Pij ; 1 j M} for another k. We assume that, given Xn = i, and given decision k at time n, the probability (k) of entering state j at time n + 1 is Pij , independent of earlier states and decisions. Figure 4.10 shows an example of this situation in which the decision maker can choose between two possible decisions in state 2 (K2 = 2) and has no freedom of choice in state 1 (K1 = 1). This gure illustrates the familiar tradeo between instant gratication (alternative 2) and long term gratication (alternative 1). 0.01 0.99 0.01 0.99 0.99 2 1 2 1 (1) (2) 0 . 01 1 r1 =0 r2 =1 r1 =0 r2 =50 Decision 1 Decision 2
Figure 4.10: A Markov decision problem with two alternatives in state 2. It is also possible to consider the situation in which the rewards for each decision are (k) associated with transitions; that is, for decision k in state i, the reward rij is associated with a transition from i to j . This means that the expected reward for a transition from P (k) (k) (k) i with decision k is given by ri = j Pij rij . Thus, as in the previous section, there is no essential loss in generality in restricting attention to the case in which rewards are associated with the states. The set of rules used by the decision maker in selecting dierent alternatives at each stage of the chain is called a policy. We want to consider the expected aggregate reward over n trials of the Markov chain, as a function of the policy used by the decision maker. If the policy uses the same decision, say ki , at each occurrence of state i, for each i, then that (k ) policy corresponds to a homogeneous Markov chain with transition probabilities Pij i . We denote the matrix of these transition probabilities as [P k ], where k = (k1 , . . . , kM ). Such a policy, i.e., making the decision for each state i independent of time, is called a stationary policy. The aggregate reward for any such stationary policy was found in the previous section. Since both rewards and transition probabilities depend only on the state and the corresponding decision, and not on time, one feels intuitively that stationary policies make a certain amount of sense over a long period of time. On the other hand, assuming some nal reward ui for being in state i at the end of the nth trial, one might expect the best policy to depend on time, at least close to the end of the n trials. In what follows, we rst derive the optimal policy for maximizing expected aggregate reward over an arbitrary number n of trials. We shall see that the decision at time m, 0 m < n, for
167
the optimal policy does in fact depend both on m and on the nal rewards {ui ; 1 i M}. We call this optimal policy the optimal dynamic policy. This policy is found from the dynamic programming algorithm, which, as we shall see, is conceptually very simple. We then go on to nd the relationship between the optimal dynamic policy and the optimal stationary policy and show that each has the same long term gain per trial.
4.6.2
Dynamic programming algorithm
As in our development of Markov chains with rewards, we consider expected aggregate reward over n time periods and we use stages, counting backwards from the nal trial. First consider the optimum decision with just one trial (i.e., with just one stage). We start (k) in a given state i at stage 1, make a decision k, obtain the reward ri , then go to some state (k) j with probability Pij and obtain the nal reward uj . This expected aggregate reward is maximized over the choice of k, i.e., X (k) (k) vi (1, u ) = max{ri + Pij uj }. (4.52)
k j (n, u ) to represent the maximum expected aggregate reward for We use the notation vi (1, u ) depends on the nal reward vector u = n stages starting in state i. Note that vi T (u1 , u2 , . . . , uM ) . Next consider the maximum expected aggregate reward starting in state i at stage 2. For each state j , 1 j M, let vj (1, u ) be the expected aggregate reward, over stages 1 and 0, for some arbitrary policy, conditional on the chain being in state j at stage 1. Then if decision k is made in state i at stage 2, the expected aggregate reward for P (k) (k) stage 2 is ri + j Pij vj (1, u ). Note that no matter what policy is chosen at stage 2, this expression is maximized at stage 1 by choosing the stage 1 policy that maximizes vj (1, u ). (1, u ) Thus, independent of what we choose at stage 2 (or at earlier times), we must use vj for the aggregate gain from stage 1 onward in order to maximize the overall aggregate gain (2, u ), by from stage 2. Thus, at stage 2, we achieve maximum expected aggregate gain, vi choosing the k that achieves the following maximum: X (k) (k) vi (2, u ) = max {ri + Pij vj (1, u )}. (4.53) k j
Repeating this argument for successively larger n, we obtain the general expression X (k) (k) vi (n, u ) = max{ri + Pij vj (n 1, u )}. (4.54)
k j
Note that this is almost the same as (4.33), diering only by the maximization over k. We can also write this in vector form, for n 1, as v (n, u ) = max{r k + [P k ]v (n 1, u )},
k
(4.55)
where for n = 1, we take v (0, u ) = u . Here k is a set (or vector) of decisions, k = (k1 , k2 , . . . , kM ) where ki is the decision to be used in state i. [P k ] denotes a matrix whose
168
(k )

(k )
(i, j ) element is Pij i , and r k denotes a vector whose ith element is ri i . The maximization over k in (4.55) is really M separate and independent maximizations, one for each state, i.e., (4.55) is simply a vector form of (4.54). Another frequently useful way to rewrite (4.54) or (4.55) is as follows: v (n, u ) = r k + [P k ]v (n1) for k 0 such that r k + [P k ]v (n1) = max r k + [P k ]v (n1).
k
0 0 0 0
(4.56)
If k 0 satises (4.56), it is called an optimal decision for stage n. Note that (4.54), (4.55), and (4.56) are valid with no restrictions (such as recurrent or aperiodic states) on the possible transition probabilities [P k ]. The dynamic programming algorithm is just the calculation of (4.54), (4.55), or (4.56), performed successively for n = 1, 2, 3, . . . . The development of this algorithm, as a systematic tool for solving this class of problems, is due to Bellman [Bel57]. This algorithm yields the optimal dynamic policy for any given nal reward vector, u . Along with the calculation of v (n, u ) for each n, the algorithm also yields the optimal decision at each stage. The (n, u ) is the surprising simplicity of the algorithm is due to the Markov property. That is, vi aggregate present and future reward conditional on the present state. Since it is conditioned on the present state, it is independent of the past (i.e., how the process arrived at state i from previous transitions and choices). Although dynamic programming is computationally straightforward and convenient7 , the asymptotic behavior of v (n, u ) as n 1 is not evident from the algorithm. After working out some simple examples, we look at the general question of asymptotic behavior. Example 4.6.1. Consider Fig. 4.10, repeated below, with the nal rewards u2 = u1 = 0. 0.99
1
0.01 0.01
r1 =0
2
(1)
0.99
0.99
r2 =1
0.01 1
r1 =0
2
(2)
r2 =50
Since there is no reward in stage 0, uj = 0. Also r1 = 0, so, from (4.52), the aggregate gain in state 1 at stage 1 is X v1 (1, u ) = r1 + Pij uj = 0.
j
Similarly, since policy 1 has an immediate reward r2 = 1 in state 2, and policy 2 has an (2) immediate reward r2 = 50, h X (1) i h (2) X (2) i (1) v2 (1, u ) = max r2 + Pij uj , r2 + Pij uj = max{1, 50} = 50.
j j
(1)
Unfortunately, many dynamic programming problems of interest have enormous numbers of states and possible choices of decision (the so called curse of dimensionality), and thus, even though the equations are simple, the computational requirements might be beyond the range of practical feasibility.

(1, u ). From (4.53), We can now go on to stage 2, using the results above for vj v1 (2) = r1 + P11 v1 (1, u ) + P12 v2 (1, u ) = P12 v2 (1, u ) = 0.5 h i h i X (1) (1) (2) (2) v2 (2) = max r2 + P2j vj (1, u ) , r2 + P21 v1 (1, u )
169
Thus, we have seen that, in state 2, decision 1 is preferable at stage 2, while decision 2 is preferable at stage 1. What is happening is that the choice of decision 2 at stage 1 has made it very protable to be in state 2 at stage 1. Thus if the chain is in state 2 at stage 2, it is preferable to choose decision 1 (i.e., the small unit gain) at stage 2 with the corresponding high probability of remaining in state 2 at stage 1. Continuing this computation for larger (n, u ) = n/2 and v (n, u ) = 50 + n/2. The optimum dynamic policy is n, one nds that v1 2 decision 2 for stage 1 and decision 1 for all stages n > 1. This example also illustrates that the maximization of expected gain is not necessarily what is most desirable in all applications. For example, people who want to avoid risk might well prefer decision 2 at stage 2. This guarantees a reward of 50, rather than taking a small chance of losing that reward. Example 4.6.2 (Shortest Path Problems). The problem of nding the shortest paths between nodes in a directed graph arises in many situations, from routing in communication networks to calculating the time to complete complex tasks. The problem is quite similar to the expected rst passage time of example 4.5.1. In that problem, arcs in a directed graph were selected according to a probability distribution, whereas here, we must make a decision about which arc to take. Although there are no probabilities here, the problem can be posed as dynamic programming. We suppose that we want to nd the shortest path from each node in a directed graph to some particular node, say node 1 (see Figure 4.11). The link lengths are arbitrary numbers that might reect physical distance, or might reect an arbitrary type of cost. The length of a path is the sum of the lengths of the arcs on that path. In terms of dynamic programming, a policy is a choice of arc out of each node. Here we want to minimize cost (i.e., path length) rather than maximizing reward, so we simply replace the maximum in the dynamic programming algorithm with a minimum (or, if one wishes, all costs can be replaced with negative rewards. 2 2 4 0 3 1
4 4
n o (1) = max [1 + P22 v2 (1, u )], 50 = max{50.5, 50} = 50.5.
Figure 4.11: A shortest path problem. The arcs are marked with their lengths. Any
unmarked link has length 1
We start the dynamic programming algorithm with a nal cost vector that is 0 for node 1 and innite for all other nodes. In stage 1, we choose the arc from node 2 to 1 and that
170
from 4 to 1; the choice at node 3 is immaterial. The stage 1 costs are then v1 (1, u ) = 0, v2 (1, u ) = 4, v3 (1, u ) = 1, v4 (1, u ) = 1.
In stage 2, the cost v3 (2, u ), for example, is h v3 (2, u ) = min 2 + v2 (1, u ), The set of costs at stage 2 are v1 (2, u ) = 0, v2 (2, u ) = 2,
i 4 + v4 (1, u ) = 5. v4 (2, u ) = 1.
v3 (2, u ) = 5,
and the policy is for node 2 to go to 4, node 3 to 4, and 4 to 1. At stage, node 3 switches to node 2, reducing its path length to 4, and nodes 2 and 4 are unchanged. Further iterations yield no change, and the resulting policy is also the optimal stationary policy. It can be seen without too much diculty, for the example of Figure 4.11, that these nal aggregate costs and shortest paths also result no matter what nal cost vector u (with u1 = 0) is used. We shall see later that this always happens so long as all the cycles in the directed graph (other than the self loop from node 1 to node 1) have positive cost.
4.6.3
Optimal stationary policies
In Example 4.6.1, we saw that there was a nal transient (for stage 1) in which decision 1 was taken, and in all other stages, decision 2 was taken. Thus, the optimal dynamic policy used a stationary policy (using decision 2) except for a nal transient. It seems reasonable to expect this same type of behavior for typical but more complex Markov decision problems. We can get a clue about how to demonstrate this by rst looking at a situation in which the aggregate expected gain of a stationary policy is equal to that of the optimal dynamic 0 , . . . , k 0 ) of decisions in policy. Denote some given stationary policy by the vector k 0 = (k1 M 0 each state. Assume that the Markov chain with transition matrix [P k ] is a unichain, i.e., recurrent with perhaps additional transient states. The expected aggregate reward for this stationary policy is then given by (4.46), using the Markov chain with transition matrix 0 0 [P k ] and reward vector r k . Let w 0 be the relative gain vector for the stationary policy k 0 . Recall from (4.44) that if w 0 is used as the nal reward vector, then the expected aggregate gain simplies to v k (n, w 0 ) ng 0 e = w 0 ,
0
(4.57)
P k0 k0 0 where g 0 = i i ri is the steady-state gain, k is the steady-state probability vector, and the relative gain vector w 0 satises w 0 + g 0 e = r k + [P k ]w 0 ;
0 0
k w 0 = 0.
(4.58)
The fact that the right hand side of (4.57) is independent of the stage, n, leads us to hypothesize that if the stationary policy k 0 is the same as the dynamic policy except for a nal transient, then that nal transient might disappear if we use w 0 as a nal reward
171
vector. To pursue this hypothesis, assume a nal reward equal to w 0 . Then, if k 0 maximizes r k + [P k ]w 0 over k , we have v (1, w 0 ) = r k + [P k ]w 0 = max{r k + [P k ]w 0 }.
k
0 0
(4.59)
Substituting (4.58) into (4.59), we see that the vector decision k 0 is optimal at stage 1 if w 0 + g 0 e = r k + [P k ]w 0 = max{r k + [P k ]w 0 }.
k
0 0
(4.60)
If (4.60) is also satised, then the optimal gain is given by v (1, w 0 ) = w 0 + g 0 e . (4.61)
The following theorem now shows that if (4.60) is satised, then, not only is the decision k 0 that maximizes r k + [P k ]w 0 an optimal dynamic policy for stage 1 but is also optimal at all stages (i.e., the stationary policy k 0 is also an optimal dynamic policy). Theorem 4.10. Assume that (4.60) is satised for some w0 , g 0 , and k0 . Then, if the nal reward vector is equal to w0 , the stationary policy k0 is an optimal dynamic policy and the optimal expected aggregate gain satises v (n, w0 ) = w0 + ng 0 e. (4.62)
Proof: Since k 0 maximizes r k + [P k ]w 0 , it is an optimal decision at stage 1 for the nal 0 0 vector w 0 . From (4.60), w 0 + g 0 e = r k + [P k ]w 0 , so v (1, w 0 ) = w 0 + g 0 e . Thus (4.62) is satised for n = 1, and we use induction on n, with n = 1 as a basis, to verify (4.62) in general. Thus, assume that (4.62) is satised for n. Then, from (4.55), v (n + 1, w 0 ) = max{r k + [P k ]v (n, w 0 )} k n o k = max r k + [P k ]{w 0 + ng 0 e }
k
(4.63) (4.64) (4.65) (4.66)
= ng 0 e + max{r k + [P k ]w 0 }
k
= (n + 1)g 0 e + w 0 ..
Eqn (4.64) follows from the inductive hypothesis of (4.62), (4.65) follows because [P k ]e = e for all k , and (4.66) follows from (4.60). This veries (4.62) for n + 1. Also, since k 0 maximizes (4.65), it also maximizes (4.63), showing that k 0 is the optimal decision at stage n + 1. This completes the inductive step and thus the proof. Since our major interest in stationary policies is to help understand the relationship between the optimal dynamic policy and stationary policies, we dene an optimal stationary policy as follows: Denition 4.13. A stationary policy k0 is optimal if there is some nal reward vector w0 for which k0 is the optimal dynamic policy.
172
From Theorem 4.10, we see that if there is a solution to (4.60), then the stationary policy k 0 that maximizes r k + [P k ]w 0 is an optimal stationary policy. Eqn. (4.60) is known as Bellmans equation, and we now explore the situations in which it has a solution (since these solutions give rise to optimal stationary policies). Theorem 4.10 made no assumptions beyond Bellmans equation about w 0 , g 0 , or the stationary policy k 0 that maximizes r k + [P k ]w 0 . However, if k 0 corresponds to a unichain, then, from Lemma 4.1 and its following discussion, w 0 and g 0 are uniquely determined (aside from an additive factor of e in w 0 ) as the relative gain vector and gain per stage for k 0 . If Bellmans equation has a solution, w 0 , g 0 , then, for every decision k , we have w 0 + g 0 e r k + [P k ]w 0 with equality for some k 0 . (4.67)
The Markov chains with transition matrices [P k ] might have multiple recurrent classes, so we let k ,R denote the steady-state probability vector for a given recurrent class R of k . Premultiplying both sides of (4.67) by k ,R , k ,R w 0 + g 0 k ,R e k ,R r k + k ,R [P k ]w 0 with equality for some k 0 . (4.68)
Recognizing that k ,R e = 1 and k ,R [P k ] = k ,R , this simplies to g 0 k ,R r k with equality for some k 0 . (4.69)
This says that if Bellmans equation has a solution w 0 , g 0 , then the gain per stage g 0 in that solution is greater than or equal to the gain per stage in each recurrent class of each stationary policy, and is equal to the gain per stage in each recurrent class of the maximizing stationary policy, k 0 . Thus, the maximizing stationary policy is either a unichain or consists of several recurrent classes all with the same gain per stage. We have been discussing the properties that any solution of Bellmans equation must have, but still have no guarantee that any such solution must exist. The following subsection describes a fairly general algorithm (policy iteration) to nd a solution of Bellmans algorithm, and also shows why, in some cases, no solution exists. Before doing this, however, we look briey at the overall relations between the states in a Markov decision problem. For any Markov decision problem, consider a directed graph for which the nodes of the graph are the states in the Markov decision problem, and, for each pair of states (i, j ), (k ) there is a directed arc from i to j if Pij i > 0 for some decision ki . Denition 4.14. A state i in a Markov decision problem is reachable from state j if there is a path from j to i in the above directed graph. Note that if i is reachable from j , then there is a stationary policy in which i is accessible from j (i.e., for each arc (m, l) on the path, a decision km in state m is used for which (k ) Pmlm > 0). Denition 4.15. A state i in a Markov decision problem is inherently transient if it is not reachable from some state j that is reachable from i. A state i is inherently recurrent
173
if it is not inherently transient. A class I of states is inherently recurrent if each i I is inherently recurrent, each is reachable from each other, and no state j / I is reachable from any i I . A Markov decision problem is inherently recurrent if all states form an inherently recurrent class. An inherently recurrent class of states is a class that, once entered, can never be left, but which has no subclass with that property. An inherently transient state is transient in at least one stationary policy, but might be recurrent in other policies (but all the states in any such recurrent class must be inherently transient). In the following subsection, we analyze inherently recurrent Markov decision problems. Multiple inherently recurrent classes can be analyzed one by one using the same approach, and we later give a short discussion of inherently transient states.
4.6.4
Policy iteration and the solution of Bellmans equation
The general idea of policy iteration is to start with an arbitrary unichain stationary policy k 0 and to nd its gain per stage g 0 and its relative gain vector w 0 . We then check whether Bellmans equation, (4.60), is satised, and if not, we nd another stationary policy k that is better than k 0 in a sense to be described later. Unfortunately, the better policy that we nd might not be a unichain, so the following lemma shows that any such policy can be converted into an equally good unichain policy. The algorithm then iteratively nds better and better unichain stationary policies, until eventually one of them satises Bellmans equation and is thus optimal. Lemma 4.2. Let k = (k1 , . . . , kM ) be an arbitrary stationary policy in an inherently recurrent Markov decision problem. Let R be a recurrent class of states in k. Then a unichain = (k 1 , . . . , k M ) exists with the recurrent class R and with k j = kj for stationary policy k j R. Proof: Let j be any state in R. By the inherently recurrent assumption, there is a decision vector, say k 0 under which j is accessible from all other states (see Exercise 4.38). Choosing i = ki for i R and k i = k0 for i k / R completes the proof. i Now that we are assured that unichain stationary policies exist and can be found, we can state the policy improvement algorithm for inherently recurrent Markov decision problems. This algorithm is a generalization of Howards policy iteration algorithm, [How60]. Policy Improvement Algorithm 1. Choose an arbitrary unichain policy k 0 2. For policy k 0 , calculate w 0 and g 0 from w 0 + g 0 e = r k + [P k ]w 0 . 3. If w 0 + g 0 e = maxk {r k + [P k ]w 0 }, then stop; k 0 is optimal. P (ki ) 0 0 + g 0 < r (ki ) + 0 4. Otherwise, choose i and ki so that wi j Pij wj . For j 6= i, let kj = kj . i
0 0
5. If the policy k = (k1 , . . . kM ) is not a unichain, then let R be the recurrent class in be the unichain policy of Lemma 4.2. Update policy k that contains state i, and let k k to the value of k .
174
6. Update k 0 to the value of k and return to step 2.

0 + g 0 < max {r i + If the stopping test in step 3 fails, then there is some i for which wi ki i P (ki ) 0 j Pij wj }, so step 4 can always be executed if the algorithm does not stop in step 3. The resulting policy k then satises (k )
w 0 + g0 e
6 =
r k + [P k ]w 0 ,
(4.70)
where 6= means that the inequality is strict for at least one component (namely i) of the vectors. Note that at the end of step 4, [P k ] diers from [P k ] only in the transitions out of state i. Thus the set of states from which i is accessible is the same in k 0 as k . If i is recurrent in the unichain k 0 , then it is accessible from all states in k 0 and thus also accessible from all states in k . It follows that i is also recurrent in k and that k is a unichain (see Exercise 4.2. On the other hand, if i is transient in k 0 , and if R0 is the recurrent class of k 0 , then R0 must also be a recurrent class of k , since the transitions from states in R0 are unchanged. There (k ) are then two possibilities when i is transient in k 0 . First, if the changes in Pij i eliminate all the paths from i to R0 , then a new recurrent class R will be formed with i a member. This is the case in which step 5 is used to change k back to a unichain. Alternatively, if a path still exists to R0 , then i is transient in k and k is a unichain with the same recurrent class R0 as k 0 . These results are summarized in the following lemma: Lemma 4.3. There are only three possibilities for k at the end of step 4 of the policy improvement algorithm for inherently recurrent Markov decision problems. First, k is a unichain and i is recurrent in both k0 and k. Second, k is not a unichain and i is transient in k0 and recurrent in k. Third, k is a unichain with the same recurrent class as k0 and i is transient in both k0 and k. The following lemma now asserts that the new policy on returning to step 2 of the algorithm is an improvement over the previous policy k 0 . Lemma 4.4. Let k0 be the unichain policy of step 2 in an iteration of the policy improvement algorithm for an inherently recurrent Markov decision problem. Let g 0 , w0 , R0 be the gain per stage, relative gain vector, and recurrent class respectively of k0 . Assume the algorithm doesnt stop at step 3 and let k be the unichain policy of step 6. Then either the gain per stage g of k satises g > g 0 or else the recurrent class of k is R0 , the gain per stage satises g = g 0 , and there is a shifted relative gain vector, w, of k satisfying w0
6 =
0
0 = w for each j R0 . and wj j
(4.71)
Proof*: The policy k of step 4 satises (4.70) with strict inequality for the component i in which k 0 and k dier. Let R be any recurrent class of k and let be the steady-state probability vector for R. Premultiplying both sides of (4.70) by , we get w 0 + g 0 r k + [P k ]w 0 . (4.72)
175
Recognizing that [P k ] = and cancelling terms, this shows that g 0 r k . Now (4.70) is satised with strict inequality for component i, and thus, if i > 0, (4.72) is satised with strict inequality. Thus, g 0 r k with equality i i = 0. (4.73)
For the rst possibility of Lemma 4.3, k is a unichain and i R. Thus g 0 < r k = g . Similarly, for the second possibility in Lemma 4.3, i R for the new recurrent class that is is a unichain with the recurrent class R, we have formed in k , so again g 0 < r k . Since k g 0 < g again. For the third possibility in Lemma 4.3, i is transient in R0 = R. Thus i = 0, so 0 = , and g 0 = g . Thus, to complete the proof, we must demonstrate the validity of (4.71) for this case. We rst show that, for each n 1, v k (n, w 0 ) ng 0 e v k (n+1, w 0 ) (n+1)g 0 e . For n = 1, v k (1, w 0 ) = r k + [P k ]w 0 . Using this, (4.70) can be rewritten as w0 Using (4.75) and then (4.76), v k (1, w 0 ) g 0 e = r k + [P k ]w 0 g 0 e
k k k 0 6 =
(4.74)
(4.75)
v k (1, w 0 ) g 0 e .
(4.76)
r k + [P k ]{v k (1, w 0 ) g 0 e } g 0 e = r + [P ]v (1, w ) 2g e = v k (2, w 0 ) 2g 0 e .

0
(4.77)
We now use induction on n, using n = 1 as the basis, to demonstrate (4.74) in general. For any n > 1, assume (4.74) for n 1 as the inductive hypothesis. v k (n, w 0 ) ng 0 e = r k + [P k ]v k (n 1, w 0 ) ng 0 e = r k + [P k ]{v k (n 1, w 0 ) (n 1)g 0 e } g 0 e r k + [P k ]{v k (n, w 0 ) ng 0 e } g 0 e = v k (n+1, w 0 ) (n+1)g 0 e .
This completes the induction, verifying (4.74) and showing that v k (n, w 0 ) ng 0 e is nondecreasing in n. Since k is a unichain, Lemma 4.1 asserts that k has a shifted relative gain vector w , i.e., a solution to (4.42). From (4.46), v k (n, w 0 ) = w + ng 0 e + [P k ]n {w 0 w }. (4.78)
Since [P k ]n is a stochastic matrix, its elements are each between 0 and 1, so the sequence of vectors v k ng 0 e must be bounded independent of n. Since this sequence is also non, decreasing, it must have a limit, say w
n1
. lim v k (n, w 0 ) ng 0 e = w
(4.79)
176
satises (4.42) for k . We next show that w w = =

n1 n1 k
lim {v k (n+1, w 0 ) (n + 1)g 0 e } lim {r k + [P k ]v k (n, w 0 ) (n + 1)g 0 e }

n1
(4.80) (4.81)
. = r g 0 e + [P k ] lim {v k (n, w 0 ) ng 0 e } = r k g 0 e + [P k ]w
is a shifted relative gain vector for k . Finally we must show that w satises the Thus w conditions on w in (4.71). Using (4.76) and iterating with (4.74), w0
6 =
v k (n, w 0 ) ng 0 e w
for all n 1.
(4.82)
Premultiplying each term in (4.82) by the steady-state probability vector for k , . w 0 v k (n, w 0 ) ng 0 w (4.83)
Now, k is the same as k 0 over the recurrent class, and = 0 since is non-zero only over the recurrent class. This means that the rst inequality above is actually an equality. Also, 0 w . Since i 0 and wi going to the limit, we see that w 0 = w i , this implies that 0 wi = w i for all recurrent i, completing the proof. We now see that each iteration of the algorithm either increases the gain per stage or holds the gain constant and increases the shifted relative gain vector w . Thus the sequence of policies found by the algorithm can never repeat. Since there are a nite number of stationary policies, the algorithm must eventually terminate at step 3. Thus we have proved the following important theorem. Theorem 4.11. For any inherently recurrent Markov decision problem, there is a solution to Bellmans equation and a maximizing stationary policy that is a unichain. There are also many interesting Markov decision problems, such as shortest path problems, that contain not only an inherently recurrent class but also some inherently transient states. The following theorem then applies. Theorem 4.12. Consider a Markov decision problem with a single inherently recurrent class of states and one or more inherently transient states. Let g be the maximum gain per stage over all recurrent classes of all stationary policies and assume that each recurrent class with gain per stage equal to g is contained in the inherently recurrent class. Then there is a solution to Bellmans equation and a maximizing stationary policy that is a unichain. Proof*: Let k be a stationary policy which has a recurrent class, R, with gain per stage g . Let j be any state in R. Since j is inherently recurrent, there is a decision vector k 0 0 under which j is accessible from all other states. Choose k such that ki = ki for all i R 0 =k i for all i and ki / R. Then k 0 is a unichain policy with gain per stage g . Suppose the policy improvement algorithm is started with this unichain policy. If the algorithm stops at step 3, then k 0 satises Bellmans equation and we are done. Otherwise, from Lemma 4.4, the unichain policy in step 6 of the algorithm either has a larger gain per stage (which is impossible) or has the same recurrent class R and has a relative gain vector w satisfying
177
(4.74). Iterating the algorithm, we nd successively larger relative gain vectors. Since the policies cannot repeat, the algorithm must eventually stop with a solution to Bellmans equation. The above theorems give us a good idea of the situations under which optimal stationary policies and solutions to Bellmans equation exist. However, we call a stationary policy optimal if it is the optimal dynamic policy for one special nal reward vector. In the next subsection, we will show that if an optimal stationary policy is unique and is an ergodic unichain, then that policy is optimal except for a nal transient no matter what the nal reward vector is.
4.6.5
Stationary policies with arbitrary nal rewards
We start out this subsection with the main theorem, then build up some notation and preliminary ideas for the proof, then prove a couple of lemmas, and nally prove the theorem. Theorem 4.13. Assume that k0 is a unique optimal stationary policy and is an ergodic unichain with the ergodic class R = {1, 2, . . . , m}. Let w0 and g 0 be the relative gain vector and gain per stage for k0 . Then, for any nal gain vector u, the following limit exists and is independent of i lim v (n, u) n1 i where ( 0 u)(u) satises ( 0 u)(u) = lim 0 [v (n, u) ng 0 e w0 ]
n1 0 ng 0 wi = ( 0 u)(u),
(4.84)
(4.85)
and 0 is the steady-state vector for k0 Discussion: The theorem says that, asymptotically, the relative advantage of starting in one state rather than another is independent of the nal gain vector, i.e., that for any states i, j , limn1 [u i (n, u ) uj (n, u )] is independent of u . For the shortest path problem, for example, this says that v (n, u ) converges to the shortest path vector for any choice of u for which ui = 0. This means that if the arc lengths change, we can start the algorithm at the shortest paths for the previous arc lengths, and the algorithm is guaranteed to converge to the correct new shortest paths. To see why the theorem can be false without the ergodic assumption, consider Example 4.5.4 where, even without any choice of decisions, (4.84) is false. Exercise 4.34 shows why the theorem can be false without the uniqueness assumption. It can also be shown (see Exercise 4.35) that for any Markov decision problem satisfying the hypotheses of Theorem 4.13, there is some n0 such that the optimal dynamic policy uses the optimal stationary policy for all stages n n0 . Thus, the dynamic part of the optimal dynamic policy is strictly a transient. The proof of the theorem is quite lengthy. Under the restricted conditions that k 0 is an ergodic Markov chain, the proof is simpler and involves only Lemma 4.5.
178
We now develop some notation required for the proof of the theorem. Given a nal reward (k) P (k) vector u , dene ki (n) for each i and n as the k that maximizes ri + j Pij vj (n, u ). Then
vi (n + 1, u ) = ri (ki (n))
X
j
Pij i P
(k (n)) vj (n, u )
ri
0) (ki
X
j
Pij i vj (n, u ).
(k0 )
(4.86)
0 maximizes r Similarly, since ki i vi (n + 1, w 0 ) = ri

0) (ki
(k)
(n, w 0 ), Pij vj (ki (n))
(k)
X
j
Pij i vj (n, w 0 ) ri
(k0 )
X
j
Pij i
(k (n)) vj (n, w 0 ).
(4.87)
Subtracting (4.87) from (4.86), we get the following two inequalities, X (k0 ) vi (n + 1, u ) vi (n + 1, w 0 ) Pij i [vj (n, u ) vj (n, w 0 )].
j vi (n + 1, u ) vi (n + 1, w 0 )
(4.88)
X
j
Pij i
(k (n))
[vj (n, u ) vj (n, w 0 )].
(4.89)
Dene
i (n) = vi (n, u ) vi (n, w 0 ).
Then (4.88) and (4.89) become i (n + 1) X

j
Pij i j (n).
(k0 )
(4.90)
i (n + 1)
(n, w 0 ) = ng 0 + w 0 for all i, n, Since vi i
X
j
Pij i
(k (n))
j (n).
(4.91)
0 i (n) = vi (n, u ) ng 0 wi .
Thus the theorem can be restated as asserting that limn1 i (n) = (u ) for each state i. Dene max (n) = max i (n);
i
min (n) = min i (n).

i
Then, from (4.90), i (n + 1)
Pij i min (n) = min (n). Since this is true for all i, (4.92)
(k0 )
min (n + 1) min (n). max (n + 1) max (n).
In the same way, from (4.91), (4.93)
The following lemma shows that (4.84) is valid for each of the recurrent states.
179
Lemma 4.5. Under the hypotheses of Theorem 4.12, the limiting expression for (u) in (4.85) exists and
n1
lim i (n) = (u) for 1 i m.
(4.94)
0 and summing over i, Proof* of lemma 4.5: Multiplying each side of (4.90) by i
0 (n + 1) 0 [P k ] (n) = 0 (n). Thus 0 (n) is non-decreasing in n. Also, from (4.93), 0 (n) max (n) max (1). Since 0 (n) is non-decreasing and bounded, it has a limit (u ) as dened by (4.85) and 0 (n) (u ) Next, iterating (4.90) m times, we get (n + m) [P k ]m (n). Since the recurrent class of k 0 is ergodic, (4.28) shows that limm1 [P k ]m = e 0 . Thus, [P k ]m = e 0 + [(m)]. where [(m)] is a sequence of matrices for which limm1 [(m)] = 0. (n + m) e 0 (n) + [(m)] (n). For any > 0, (4.95) shows that for all suciently large n, 0 (n) (u ) /2. Also, since min (1) i (n) max (1) for all i and n, and since [(m)] 0, we see that [(m)] (n) (/2)e for all large enough m. Thus, for all large enough n and m, i (n + m) (u ) . Thus, for any > 0, there is an n0 such that for all n n0 , i (n) (u ) . Also, from (4.95), we have 0 [ (n) (u )e ] 0, so 0 [ (n) (u )e + e ] . (4.97) (4.96)
0 0 0
n1
lim 0 (n) = (u ).
(4.95)
0 [ (n) (u ) + ], on the left side of (4.97) is non-negative, so From (4.96), each term, i i 0 > 0, it follows that each must also be smaller than . For i 0 i (n) (u ) + /i for all i and all n n0 .
(4.98)
0 > 0 show that, lim Since > 0 is arbitrary, (4.96) and (4.98) together with i n1 i (n) = (u ), completing the proof of Lemma 4.5.
Since k 0 is a unique optimal stationary policy, we have X (k0 ) X (k ) (k0 ) (k ) 0 0 ri i + Pij i wj > ri i + Pij i wj
j j
180
0 . Snce this is a nite set of strict inequalities, there is an > 0 such for all i and all ki 6= ki 0, that for all i > m, ki 6= ki X (k0 ) X (k ) (k0 ) (k ) 0 0 ri i + Pij i wj ri i + Pij i wj + . (4.99) j j (n, w 0 ) = ng 0 + w 0 , Since vi i vi (n + 1, w 0 ) = ri
0) (ki
ri
(ki (n))
X
j
Pij i vj (n, w 0 )
(k0 )
(4.100) + . (4.101)
X
j
Pij i
(k (n)) vj (n, w 0 )
0 . Subtracting (4.101) from (4.86), for each i and ki (n) 6= ki X (k0 ) 0 i (n + 1) Pij i j (n) for ki (n) 6= ki . j
(4.102)
Since i (n) max (n), (4.102) can be further bounded by i (n + 1) max (n) for 0) P (ki 0 . Combining this with (n + 1) = 0 ki (n) 6= ki i j Pij j (n) for ki (n) = ki , h i X (k0 ) i (n + 1) max max , Pij i j (n) . (4.103)
j
Next, since k 0 is a unichain, we can renumber the transient states, m < i M so that 0) P (ki > 0 for each i, m < i M. Since this is a nite set of strict inequalities, there j<i Pij is some > 0 such that X (k0 ) Pij i for m < i M. (4.104)
j<i
The quantity i (n) for each transient state i is somewhat dicult to work with directly, so i (n), which will be shown in the following lemma to upper we dene the new quantity, i (n) is given iteratively for n 1, m < i M as bound i (n). The denition for h i i (n + 1) = max M (n) , i1 (n) + (1 ) M (n) . (4.105) The boundary conditions for this are dened to be i (1) = max (1); m < i M m (n) = sup max i (n0 ).
n0 n im
(4.106) (4.107)
Lemma 4.6. Under the hypotheses of Theorem 4.13, with dened by (4.99) and dened by (4.104), the following three inequalities hold, i (n) i (n 1); i (n) i+1 (n); i (n); j (n) for n 2, m i M for n 1, m i < M for n 1, j i, m i M. (4.108) (4.109) (4.110)
181
Proof* of (4.108): Since the supremum in (4.107) is over a set decreasing in n, m (n) m (n 1); for n 1. (4.111)
i (1) = max (1) This establishes (4.108) for i = m. To establish (4.108) for n = 2, note that for i > m and m (1) = sup max i (n0 ) sup max (n0 ) max (1).
n0 1 im n0 1
(4.112)
Thus
h i (2) = max M (1) , i (1) max (1) =
i i1 (1) + (1 ) M (1) for i > m.
Finally, we use induction for n 2, i > m, using n = 2 as the basis. Assuming (4.108) for a given n 2, i (n+1) = max[ M (n), i1 (n) + (1 ) M (n)] M (n1), i1 (n1) + (1 ) M (n1)] = i (n). max[
i (1) = max (1) for i > m, (4.109) is Proof* of (4.109): Using (4.112) and the fact that valid for n = 1. Using induction on n with n = 1 as the basis, we assume (4.109) for a given n 1. Then for m i M, i (n + 1) i (n) i (n) + (1 ) M (n) M (n) , i (n) + (1 ) M (n)] = i+1 (n + 1). max[
m (n) for all j m and n 1 by the denition Proof* of (4.110): Note that j (n) in (4.107). From (4.109), j (n) i (n) for j m i. Also, for all i > m and j i, i (1). Thus (4.110) holds for n = 1. We complete the proof by using j (1) max (1) = induction on n for m < j i, using n = 1 as the basis. Assume (4.110) for a given M (n) for all j , and it then follows that max (n) M (n). Similarly, n 1. Then, j (n) i1 (n) for j i 1. For i > m, we then have j (n) h i X k0 i (n+1) max max (n), Piji j (n) h M (n), max h M (n), max X
j<i j
i1 (n) + Piji
k0
where the nal inequality follows from the denition of . Finally, using (4.109) again, we j (n + 1) i (n + 1) for m < j i, completing the proof of Lemma 4.6. have j (n + 1)
i1 (n) + (1 ) M (n) = i (n+1),
X
j i
i k0 Piji M (n)
i (n) is non-increasing in n for i m. Also, Proof* of Theorem 4.13: From (4.110), i (n) exists for each i m. from (4.109) and (4.97), i (n) m (n) (u ). Thus, limn1 We then have h i M (n) = max lim M (n), lim M1 (n) + (1 ) lim M (n) . lim
n1 n1 n1 n1
182
Since > 0, the second term in the maximum above must achieve the maximum in the limit. Thus,
n1
M (n) = lim M1 (n). lim

n1
(4.113)
In the same way,

n1
M1 (n) = max lim
Again, the second term must achieve the maximum, and using (4.113),
n1
n1
M (n), lim
i M2 (n) + (1 ) lim M1 (n) . lim

n1 n1
M1 (n) = lim M2 (n). lim

n1
Repeating this argument,

n1
i (n) = lim i1 (n) for each i, m < i M. lim

n1
(4.114)
m (n) = Now, from (4.94), limn1 i = (u ) for i m. From (4.107), then, we see that limn1 (u ). Combining this with (4.114),
n1
i (n) = (u ) for each i such that m i M. lim
(4.115)
Combining this with (4.110), we see that for any > 0, and any i, i (n) (u ) + for large enough n. Combining this with (4.96) completes the proof.
4.7
Summary
This chapter has developed the basic results about nite-state Markov chains from a primarily algebraic standpoint. It was shown that the states of any nite-state chain can be partitioned into classes, where each class is either transient or recurrent, and each class is periodic or aperiodic. If the entire chain is one recurrent class, then the Frobenius theorem, with all its corollaries, shows that = 1 is an eigenvalue of largest magnitude and has positive right and left eigenvectors, unique within a scale factor. The left eigenvector (scaled to be a probability vector) is the steady-state probability vector. If the chain is also aperiodic, then the eigenvalue = 1 is the only eigenvalue of magnitude 1, and all rows of [P ]n converge geometrically in n to the steady-state vector. This same analysis can be applied to each aperiodic recurrent class of a general Markov chain, given that the chain ever enters that class. For a periodic recurrent chain of period d, there are d 1 other eigenvalues of magnitude 1, with all d eigenvalues uniformly placed around the unit circle in the complex plane. Exercise 4.17 shows how to interpret these eigenvectors, and shows that [P ]nd converges geometrically as n 1. For an arbitrary nite-state Markov chain, if the initial state is transient, then the Markov chain will eventually enter a recurrent state, and the probability that this takes more than
4.8. EXERCISES
183
n steps approaches zero geometrically in n; Exercise 4.14 shows how to nd the probability that each recurrent class is entered. Given an entry into a particular recurrent class, then the results above can be used to analyze the behavior within that class. The results about Markov chains were extended to Markov chains with rewards. As with renewal processes, the use of reward functions provides a systematic way to approach a large class of problems ranging from rst passage times to dynamic programming. The key result here is Theorem 4.9, which provides both an exact expression and an asymptotic expression for the expected aggregate reward over n stages. Finally, the results on Markov chains with rewards were used to understand Markov decision theory. We developed the Bellman dynamic programming algorithm, and also investigated the optimal stationary policy. Theorem 4.13 demonstrated the relationship between the optimal dynamic policy and the optimal stationary policy. This section provided only an introduction to dynamic programming and omitted all discussion of discounting (in which future gain is considered worth less than present gain because of interest rates). We also omitted innite state spaces. For an introduction to vectors, matrices, and linear algebra, see any introductory text on linear algebra such as Strang [20]. Gantmacher [11] has a particularly complete treatment of non-negative matrices and Perron-Frobenius theory. For further reading on Markov decision theory and dynamic programming, see Bertsekas, [3]. Bellman [1] is of historic interest and quite readable.
4.8
Exercises
Exercise 4.1. a) Prove that, for a nite-state Markov chain, if Pii > 0 for some i in a recurrent class A, then class A is aperiodic. b) Show that every nite-state Markov chain contains at least one recurrent set of states. Hint: Construct a directed graph in which the states are nodes and an edge goes from i to j if i j but i is not accessible from j . Show that this graph contains no cycles, and thus contains one or more nodes with no outgoing edges. Show that each such node is in a recurrent class. Note: this result is not true for Markov chains with countably innite state spaces. Exercise 4.2. Consider a nite-state Markov chain in which some given state, say state 1, is accessible from every other state. Show that the chain has at most one recurrent class of states. (Note that, combined with Exercise 4.1, there is exactly one recurrent class and the chain is then a unichain.) Exercise 4.3. Show how to generalize the graph in Figure 4.4 to an arbitrary number of states M 3 with one cycle of M nodes and one of M 1 nodes. For M = 4, let node 1 be the node not in the cycle of M 1 nodes. List the set of states accessible from node 1 in n steps for each n 12 and show that the bound in Theorem 4.5 is met with equality. Explain why the same result holds for all larger M.
184
Exercise 4.4. Consider a Markov chain with one ergodic class of m states, say {1, 2, . . . , m} n > 0 for all j m and and M m other states that are all transient. Show that Pij n (m 1)2 + 1 + M m. Exercise 4.5. a) Let be the number of states in the smallest cycle of an arbitrary ergodic n > 0 for all n (M 2) + M. Hint: Look at Markov chain of M 3 states. Show that Pij the last part of the proof of Theorem 4.4. b) For = 1, draw the graph of an ergodic Markov chain (generalized for arbitrary M 3) n = 0 for n = 2M 3. Hint: Look at Figure 4.4. for which there is an i, j for which Pij d) For arbitrary < M 1, draw the graph of an ergodic Markov chain (generalized for n = 0 for n = (M 2) + M 1. arbitrary M) for which there is an i, j for which Pij Exercise 4.6. A transition probability matrix P is said to be doubly stochastic if X
j
Pij = 1 for all i;
X
i
Pij = 1 for all j.
That is, the row sum and the column sum each equal 1. If a doubly stochastic chain has M states and is ergodic (i.e., has a single class of states and is aperiodic), calculate its steady-state probabilities. Exercise 4.7. a) Find the steady-state probabilities 0 , . . . , k1 for the Markov chain below. Express your answer in terms of the ratio = p/q . Pay particular attention to the special case = 1. b) Sketch 0 , . . . , k1 . Give one sketch for = 1/2, one for = 1, and one for = 2. c) Find the limit of 0 as k approaches 1; give separate answers for < 1, = 1, and > 1. Find limiting values of k1 for the same cases. 1p
0
p 1p
p 1p
p 1p
...
k 2
p 1p
k 1
Exercise 4.8. a) Find the steady-state probabilities for each of the Markov chains in Figure 4.2 of section 4.1. Assume that all clockwise probabilities in the rst graph are the same, say p, and assume that P4,5 = P4,1 in the second graph. b) Find the matrices [P ]2 for the same chains. Draw the graphs for the Markov chains represented by [P ]2 , i.e., the graph of two step transitions for the original chains. Find the steady-state probabilities for these two step chains. Explain why your steady-state probabilities are not unique. c) Find limn1 [P ]2n for each of the chains.
4.8. EXERCISES
185
Exercise 4.9. Answer each of the following questions for each of the following non-negative matrices [A] 1 0 0 1/2 1/2 0 . 0 1/2 1/2
i)
1 0 1 1
ii)
a) Find [A]n in closed form for arbitrary n > 1. b) Find all eigenvalues and all right eigenvectors of [A]. c) Use (b) to show that there is no diagonal matrix [] and no invertible matrix [Q] for which [A][Q] = [Q][]. d) Rederive the result of part (c) using the result of (a) rather than (b). Exercise 4.10. a) Show that g (x ), as given in (4.21), is a continuous function of x for x 0 , x 6= 0 . b) Show that g (x ) = g ( x ) for all > 0. Show that this implies that P the supremum of g (x ) over x 0 , x 6= 0 is the same as the supremum over x 0 , i xi = 1. Note that this shows that the supremum must be achieved, since it is a supremum of a continuous function over a closed and bounded space. Exercise 4.11. a) Show that if x1 and x2 are real or complex numbers, then |x1 + x2 | = |x1 | + |x2 | implies that for some , x1 and x2 are both real and non-negative. b) Show from this that if the inequality in (4.25) is satised with equality, then there is some for which xi = |xi | for all i. Exercise 4.12. a) Let be an eigenvalue of a matrix [A], and let and be right and left eigenvectors respectively of , normalized so that = 1. Show that ]2 = [A]2 2 . [[A] ] = [A]n+1 n+1 . b) Show that [[A]n n ][[A] ]n = [A]n n . c) Use induction to show that [[A]
Exercise 4.13. Let [P ] be the transition matrix for a Markov unichain with M recurrent states, numbered 1 to M, and K transient states, J +1 to J + K . Thus [P ] can be partitioned Pr 0 as [P ] = . Ptr Ptt [Pr ]n [0] a) Show that [P ]n can be partitioned as [P ]n = n ] [P ]n . That is, the blocks on [Pij tt the diagonal are simply products of the corresponding blocks of [P ], and the lower left block is whatever it turns out to be.
186
b) Let Qi be the probability that will be in a recurrent state after K transitions, P the chain K . Show that Q > 0 for all transient i. starting from state i, i.e., Qi = j M Pij i
nK (1 Q)n for all c) Let Q be the minimum Qi over all transient i and show that Pij n transient i, j (i.e., show that [Ptt ] approaches the all zero matrix [0] with increasing n).
d) Let = ( r , t ) be a left eigenvector of [P ] of eigenvalue 1 (if one exists). Show that t = 0 and show that r must be positive and be a left eigenvector of [Pr ]. Thus show that exists and is unique (within a scale factor). e) Show that e is the unique right eigenvector of [P ] of eigenvalue 1 (within a scale factor). Exercise 4.14. Generalize Exercise 4.13 to the case of a Markov chain [P ] with r recurrent classes and one or more transient classes. In particular, a) Show that [P ] has exactly r linearly independent left eigenvectors, (1) , (2) , . . . , (r) of eigenvalue 1, and that the ith can be taken as a probability vector that is positive on the ith recurrent class and zero elsewhere. b) Show that [P ] has exactly r linearly independent right eigenvectors, (1) , (2) , . . . , (r) (i) of eigenvalue 1, and that the ith can be taken as a vector with j equal to the probability that recurrent class i will ever be entered starting from state j . Exercise 4.15. Prove Theorem 4.8. Hint: Use Theorem 4.7 and the results of Exercise 4.13. Exercise 4.16. Generalize Exercise 4.15 to the case of a Markov chain [P ] with r aperiodic recurrent classes and one or more transient classes. In particular, using the left and right eigenvectors (1) , (2) , . . . , (r) and (1) , . . . , (r) of Exercise 4.14, show that X lim [P ]n = (i) (i) .
n1 i
Exercise 4.17. Suppose a Markov chain with an irreducible matrix [P ] is periodic with period d and let Ti , 1 i d, be the ith subset in the sense of Theorem 4.3. Assume the states are numbered so that the rst M1 states are in T1 , the next J2 are in T2 , and so forth. Thus [P ] has the block form given by 0 . [P ] = .. 0 0 [P1 ] 0 .. . 0 0 [P2 ] .. .. . . .. .. . . [Pd1 ] .. .. . . 0 .. . .. . .. . 0 .. . .. .
[Pd ]
where [Pi ] has dimension Mi by Mi+1 for i < d and Md by M1 for i = d
4.8. EXERCISES a) Show that [P ]d has the form .. . 0 0 [Q1 ] d . . .. [P ] = 0 [Q2 ] . . .. . [Qd ] 0 0
187
where [Qi ] = [Pi ][Pi+1 ] . . . [Pd ][P1 ] . . . [Pi1 ].
b) Show that [Qi ] is the matrix of an ergodic Markov so that with the eigenvectors P chain, nd ( i ) ( i ) dened in Exercises 4.14 and 4.16, limn1 [P ] = i . (i) , the left eigenvector of [Qi ] of eigenvalue 1 satises (i) [Pi ] = (i+1) for c) Show that (d) (1) [Pd ] = . i < d and (2) ek , (3) e2k , . . . , (d) e(d1)k ). Show that (k) d) Let = 2 d 1 and let (k) = ( (1) , k is a left eigenvector of [P ] of eigenvalue e . Exercise 4.18. (continuation of Exercise 4.17). dened in Exercises 4.14 and 4.16, lim [P ] [P ] =
nd d X i=1
a) Show that, with the eigenvectors
n1
(i) (i+1)
where (d+1) is taken to be (1) . b) Show that, for 1 j < d,

n1
lim [P ]nd [P ]j =
d X i=1
(i) (i+j )
where (d+m) is taken to be (m) for 1 m < d. c) Show that n o lim [P ]nd I + [P ] + . . . , [P ]d1 = d X
i=1
n1
(i)
! d X
i=1
(i+j ) .
d) Show that 1 n [P ] + [P ]n+1 + + [P ]n+d1 = e n1 d lim where isP the steady-state probability vector for [P ]. Hint: Show that e = = (1/n) i (i) . e) Show that the above result is also valid for periodic unichains.
(i)
and
188
Exercise 4.19. Assume a friend has developed an excellent program for nding the steadystate probabilities for nite-state Markov chains. More precisely, given the transition matrix n for each i. Assume all chains are aperiodic. [P], the program returns limn1 Pii a) You want to nd the expected time to rst reach a given state k starting from a dierent state m for a Markov chain with transition matrix [P ]. You modify the matrix to [P 0 ] where 0 0 = 0 for j 6= m, and P 0 = P otherwise. How do you nd the desired rst Pkm = 1, Pkj ij ij passage time from the program output given [P 0 ] as an input? (Hint: The times at which a Markov chain enters any given state can be considered as renewals in a (perhaps delayed) renewal process). b) Using the same [P 0 ] as the program input, how can you nd the expected number of returns to state m before the rst passage to state k? c) Suppose, for the same Markov chain [P ] and the same starting state m, you want to nd the probability of reaching some given state n before the rst passage to k. Modify [P ] to some [P 00 ] so that the above program with P 00 as an input allows you to easily nd the desired probability. d) Let Pr {X (0) = i} = Qi , 1 i M be an arbitrary set of initial probabilities for the same Markov chain [P ] as above. Show how to modify [P ] to some [P 000 ] for which the steady-state probabilities allow you to easily nd the expected time of the rst passage to state k. Exercise 4.20. Suppose A and B are each ergodic Markov chains with transition probabilities {PAi ,Aj } and {PBi ,Bj } respectively. Denote the steady-state probabilities of A and B by {Ai } and {Bi } respectively. The chains are now connected and modied as shown below. In particular, states A1 and B1 are now connected and the new transition probabilities P 0 for the combined chain are given by
0 PA = , 1 ,B1 0 PB 1 ,A1 0 PA = (1 )PA1 ,Aj 1 ,Aj
for all Aj for all Bj .
= ,
0 PB 1 ,Bj
= (1 )PB1 ,Bj
All other transition probabilities remain the same. Think intuitively of and as being small, but do not make any approximations in what follows. Give your answers to the following questions as functions of , , {Ai } and {Bi }.
A1
Chain A Chain B a) Assume that > 0, = 0 (i.e., that A is a set of transient states in the combined chain). Starting in state A1 , nd the conditional expected time to return to A1 given that the rst transition is to some state in chain A.
B1
4.8. EXERCISES
189
b) Assume that > 0, = 0. Find TA,B , the expected time to rst reach state B1 starting from state A1 . Your answer should be a function of and the original steady state probabilities {Ai } in chain A. c) Assume > 0, > 0, nd TB,A , the expected time to rst reach state A1 , starting in state B1 . Your answer should depend only on and {Bi }.
d) Assume > 0 and > 0. Find P 0 (A), the steady-state probability that the combined chain is in one of the states {Aj } of the original chain A. e) Assume > 0, = 0. For each state Aj 6= A1 in A, nd vAj , the expected number of visits to state Aj , starting in state A1 , before reaching state B1 . Your answer should depend only on and {Ai }.
0 , the steady-state probability of f ) Assume > 0, > 0. For each state Aj in A, nd A j being in state Aj in the combined chain. Hint: Be careful in your treatment of state A1 .
Exercise 4.21. For the Markov chain with rewards in gure 4.6, a) Find the general solution to (4.42) and then nd the particular solution (the relative gain vector) with w = 0. b) Modify Figure 4.6 by letting P12 be an arbitrary probability. Find g and w again and give an intuitive explanation of why P12 eects w2 . Exercise 4.22. a) Show that, for any i, [P ]i r = ([P ]i e )r + g e . b) Show that 4.36) can be rewritten as v (n, u ) =
n 1 X i=0
([P ]i e )r + ng e + ( u )e .
P 1 i c) Show that if [P ] is a positive stochastic matrix, then n i=0 ([P ] e ) converges in the limit n 1. Hint: You can use the same argument as in the proof of Corollary 4.6. Note: this sum also converges for an arbitrary ergodic Markov chain. Exercise 4.23. Consider the Markov chain below: a) Suppose the chain is started in state i and goes through n transitions; let vi (n, u ) be the expected number of transitions (out of the total of n) until the chain enters the trapping state, state 1. Find an expression for v (n, u ) = (v1 (n, u ), v2 (n, u ), v3 (n, u )) in terms of v (n 1, u ) (take v1 (n, u ) = 0 for all n). (Hint: view the system as a Markov reward system; what is the value of r ?) b) Solve numerically for limn1 v (n, u ). Interpret the meaning of the elements vi in the solution of (4.30). c) Give a direct argument why (4.30) provides the solution directly to the expected time from each state to enter the trapping state.
190
Exercise 4.24. Consider a sequence of IID binary rvs X1 , X2 , . . . . Assume that Pr {Xi = 1} = p1 , Pr {Xi = 0} = p0 = 1 p1 . A binary string (a1 , a2 , . . . , ak ) occurs at time n if Xn = ak , Xn1 = ak1 , . . . Xnk+1 = a1 . For a given string (a1 , a2 , . . . , ak ), consider a Markov chain with k + 1 states {0, 1, . . . , k}. State 0 is the initial state, state k is a nal trapping state where (a1 , a2 , . . . , ak ) has already occurred, and each intervening state i, 0 < i < k, has the property that if the subsequent k i variables take on the values ai+1 , ai+2 , . . . , ak , the Markov chain will move successively from state i to i + 1 to i + 2 and so forth to k. For example, if k = 2 and (a1 , a2 ) = (0, 1), the corresponding chain is given by 0 1 0 1 2 0 a) For the chain above, nd the mean rst passage time from state 0 to state 2. b) For parts b to d, let (a1 , a2 , a3 , . . . , ak ) = (0, 1, 1, . . . , 1), i.e., zero followed by k 1 ones. Draw the corresponding Markov chain for k = 4. c) Let vi , 1 i k be the expected rst passage time from state i to state k. Note that vk = 0. Show that v0 = 1/p0 + v1 . d) For each i, 1 i < k, show that vi = i + vi+1 and v0 = i + vi+1 where i and i are each a product of powers of p0 and p1 . Hint: use induction, or iteration, starting with i = 1, and establish both equalities together. e) Let k = 3 and let (a1 , a2 , a3 ) = (1, 0, 1). Draw the corresponding Markov chain for this string. Evaluate v0 , the expected rst passage time for the string 1,0,1 to occur. f ) Use renewal theory to explain why the answer in part e is dierent from that in part d with k = 3.
2 PP PP 1/2 PP PP PP q 1 P 1/2 1 1/4 1/4 3
1/2
Exercise 4.25. a) Find limn1 [P ]n for the Markov chain below. Hint: Think in terms of the long term transition probabilities. Recall that the edges in the graph for a Markov chain correspond to the positive transition probabilities. b) Let (1) and (2) denote the rst two rows of limn1 [P ]n and let (1) and (2) denote the rst two columns of limn1 [P ]n . Show that (1) and (2) are independent left eigenvectors of [P ], and that (1) and (2) are independent right eigenvectors of [P ]. Find the eigenvalue for each eigenvector.
4.8. EXERCISES
191
P31 1
P33
1 P32 2
c) Let r be an arbitrary reward vector and consider the equation w + g (1) (1) + g (2) (2) = r + [P ]w . (4.116)
Determine what values g (1) and g (2) must have in order for (4.84) to have a solution. Argue that with the additional constraints w1 = w2 = 0, (4.84) has a unique solution for w and nd that w . (1) + (2) satises (4.84) for all choices of d) Show that, with the w above, w 0 = w + scalars and . e) Assume that the reward at stage 0 is u = w . Show that v (n, w ) = n(g (1) (1) + g (2) (2) )+ w. f ) For an arbitrary reward u at stage 0, show that v (n, u ) = n(g (1) (1) + g (2) (2) ) + w + [P ]n (u w ). Note that this veries (4.49-4.51) for this special case. Exercise 4.26. Generalize Exercise 4.25 to the general case of two recurrent classes and an arbitrary set of transient states. In part (f), you will have to assume that the recurrent classes are ergodic. Hint: generalize the proof of Lemma 4.1 and Theorem 4.9 Exercise 4.27. Generalize Exercise 4.26 to an arbitrary number of recurrent classes and an arbitrary number of transient states. This veries (4.49-4.51) in general. Exercise 4.28. Let u and u 0 be arbitrary nal reward vectors with u u 0 .
a) Let k be an arbitrary stationary policy and prove that v k (n, u ) v k (n, u 0 ) for each n 1. b) Prove that v (n, u ) v (n, u 0 ) for each n 1. This is known as the monotonicity theorem.
Exercise 4.29. George drives his car to the theater, which is at the end of a one-way street. There are parking places along the side of the street and a parking garage that costs $5 at the theater. Each parking place is independently occupied or unoccupied with probability 1/2. If George parks n parking places away from the theater, it costs him n cents (in time and shoe leather) to walk the rest of the way. George is myopic and can only see the parking place he is currently passing. If George has not already parked by the time he reaches the nth place, he rst decides whether or not he will park if the place is unoccupied, and then observes the place and acts according to his decision. George can never go back and must park in the parking garage if he has not parked before. a) Model the above problem as a 2 state Markov decision problem. In the driving state, state 2, there are two possible decisions: park if the current place is unoccupied or drive on whether or not the current place is unoccupied.
192
(n, u ), the minimum expected aggregate cost for n stages (i.e., immediately b) Find vi before observation of the nth parking place) starting in state i = 1 or 2; it is sucient (n, u ) in times of v (n 1). The nal costs, in cents, at stage 0 should be to express vi i v2 (0) = 500, v1 (0) = 0.
c) For what values of n is the optimal decision the decision to drive on? d) What is the probability that George will park in the garage, assuming that he follows the optimal policy? Exercise 4.30. Consider the dynamic programming problem below with two states and two possible policies, denoted k and k 0 . The policies dier only in state 2. 1/2 7/8 1/2 3/4 1/2 1/2 2 2 1 1 0 1 / 8 1 / 4 k =5 k r1 =0 r2 r1 =0 r2 =6 a) Find the steady-state gain per stage, g and g 0 , for stationary policies k and k 0 . Show that g = g 0 . b) Find the relative gain vectors, w and w 0 , for stationary policies k and k 0 . c) Suppose the nal reward, at stage 0, is u1 = 0, u2 = u. For what range of u does the dynamic programming algorithm use decision k in state 2 at stage 1? d) For what range of u does the dynamic programming algorithm use decision k in state 2 at stage 2? at stage n? You should nd that (for this example) the dynamic programming algorithm uses the same decision at each stage n as it uses in stage 1.
(n, u ) and v (n, u ) as a function of stage n assuming u = 10. e) Find the optimal gain v2 1
f ) Find limn1 v (n, u ) and show how it depends on u. Exercise 4.31. Consider a Markov decision problem in which the stationary policies k and k 0 each satisfy Bellmans equation, (4.60) and each correspond to ergodic Markov chains. a) Show that if r k + [P k ]w 0 r k + [P k ]w 0 is not satised with equality, then g 0 > g . b) Show that r k + [P k ]w 0 = r k + [P k ]w 0 (Hint: use part a).
0 0 0 0
c) Find the relationship between the relative gain vector w k for policy k and the relative gain vector w 0 for policy k 0 . (Hint: Show that r k + [P k ]w 0 = g e + w 0 ; what does this say about w and w 0 ?) e) Suppose that policy k uses decision 1 in state 1 and policy k 0 uses decision 2 in state 1 (i.e., k1 = 1 for policy k and k1 = 2 for policy k 0 ). What is the relationship between (k) (k) (k) (k) r1 , P11 , P12 , . . . P1J for k equal to 1 and 2? f ) Now suppose that policy k uses decision 1 in each state and policy k 0 uses decision 2 in (1) (2) each state. Is it possible that ri > ri for all i? Explain carefully. g) Now assume that ri Explain.
(1)
is the same for all i. Does this change your answer to part f)?
4.8. EXERCISES
193
Exercise 4.32. Consider a Markov decision problem with three states. Assume that each stationary policy corresponds to an ergodic Markov chain. It is known that a particular policy k 0 = (k1 , k2 , k3 ) = (2, 4, 1) is the unique optimal stationary policy (i.e., the gain per stage in steady-state is maximized by always using decision 2 in state 1, decision 4 in state (k) 2, and decision 1 in state 3). As usual, ri denotes the reward in state i under decision k, (k) and Pij denotes the probability of a transition to state j given state i and given the use of decision k in state i. Consider the eect of changing the Markov decision problem in each of the following ways (the changes in each part are to be considered in the absence of the changes in the other parts): a) r1 is replaced by r1 1. b) r1 is replaced by r1 + 1. c) r1
(k) (2) (2) (1) (1)
is replaced by r1 + 1 for all state 1 decisions k.

(ki )
(k)
d) for all i, ri
is replaced by r(ki ) + 1 for the decision ki of policy k 0 .
For each of the above changes, answer the following questions; give explanations : 1) Is the gain per stage, g 0 , increased, decreased, or unchanged by the given change? 2) Is it possible that another policy, k 6= k 0 , is optimal after the given change? Exercise 4.33. (The Odoni Bound) Let k 0 be the optimal stationary policy for a Markov decision problem and let g 0 and 0 be the corresponding gain and steady-state probability (n, u ) be the optimal dynamic expected reward for starting in state i at respectively. Let vi stage n.
(n, u ) v (n 1)] g 0 max [v (n, u ) v (n 1)] ; n 1. Hint: a) Show that mini [vi i i i i Consider premultiplying v (n, u ) v (n 1) by 0 or 0 where k is the optimal dynamic policy at stage n.
b) Show that the lower bound is non-decreasing in n and the upper bound is non-increasing in n and both converge to g 0 with increasing n. Exercise 4.34. Consider a Markov decision problem with three states, {1, 2, 3}. For state (1) (2) (1) (2) 3, there are two decisions, r3 = r3 = 0 and P3,1 = P3,2 = 1. For state 1, there are two decisions, r1 = 0, r2 = 0,
(1) (2) (1)
r2 = 100 and P2,1 = P2,3 = 1.
r1 = 100 and P1,1 = P1,3 = 1. For state 2, there are two decisions,
(1) (2
(2)
(1)
(2)
a) Show that there are two ergodic unichain optimal stationary policies, one using decision 1 in states 1 and 3 and decision 2 in state 2. The other uses the opposite decision in each state. b) Find the relative gain vector for each of the above stationary policies. c) Let u be the nal reward vector. Show that the rst stationary policy above is the optimal dynamic policy in all stages if u1 u2 + 100 and u3 u2 + 100. Show that a non-unichain stationary policy is the optimal dynamic policy if u1 = u2 = u3
194
(n, u ) c) Theorem 4.13 implies that, under the conditions of the theorem, limn1 [vi vj (n, u )] is independent of u . Show that this is not true for the conditions of this exercise.
Exercise 4.35. Assume that k 0 is a unique optimal stationary policy and corresponds to an ergodic unichain (as in Theorem 4.13). Let w 0 and g 0 be the relative gain and gain per stage for k 0 and let u be an arbitrary nal reward vector.
0 , k 0 , ..., k 0 ). Show that for each i and each k 6= k 0 , there is some > 0 such a) Let k 0 = (k1 2 i M 0, that for each i and k 6= ki
ri
0) (ki
X
j
0 k Pij i wj +ri +
(k0 )
X
j
k 0 Pij wj + .
Hint: Look at the proof of Lemma 4.5 b) Show that there is some n0 such that for all n n0 , 0 (u ) < /2 vj (n 1) (n 1)g 0 wj X
j
where (u ) is given in Theorem 4.13.
c) Use part b) to show that for all i and all n n0 , ri

0) (ki
Pij i vj (n 1) > +ri
(k0 )
0) (ki
X
j
0 Pij i wj + (n 1)g 0 + (u ) /2.
(k0 )
0, d) Use parts a) and b) to show that for all i, all n n0 , and all k 6= ki k ri +
X
j
k Pij vj (n 1) < +ri
0) (ki
X
j
0 Pij i wj + (n 1)g 0 + (u ) /2.
(k0 )
e Combine parts c) and d) to conclude that the optimal dynamic policy uses policy k 0 for all n n0 . Exercise 4.36. Consider an integer time queueing system with a nite buer of size 2. At the beginning of the nth time interval, the queue contains at most two customers. There is a cost of one unit for each customer in queue (i.e., the cost of delaying that customer). If there is one customer in queue, that customer is served. If there are two customers, an extra server is hired at a cost of 3 units and both customers are served. Thus the total immediate cost for two customers in queue is 5, the cost for one customer is 1, and the cost for 0 customers is 0. At the end of the nth time interval, either 0, 1, or 2 new customers arrive (each with probability 1/3). a) Assume that the system starts with 0 i 2 customers in queue at time 1 (i.e., in stage 1) and terminates at time 0 (stage 0) with a nal cost u of 5 units for each customer in queue (at the beginning of interval 0). Find the expected aggregate cost vi (1, u ) for 0 i 2.
4.8. EXERCISES
195
b) Assume now that the system starts with i customers in queue at time 2 with the same nal cost at time 0. Find the expected aggregate cost vi (2, u ) for 0 i 2. c) For an arbitrary starting time n, nd the expected aggregate cost vi (n, u ) for 0 i 2. d) Find the cost per stage and nd the relative cost (gain) vector. e) Now assume that there is a decision maker who can choose whether or not to hire the extra server when there are two customers in queue. If the extra server is not hired, the 3 unit fee is saved, but only one of the customers is served. If there are two arrivals in this case, assume that one is turned away at a cost of 5 units. Find the minimum dynamic (1), 0 i , for stage 1 with the same nal cost as before. aggregate expected cost vi
(n, u ) for stage n, 0 i 2. f ) Find the minimum dynamic aggregate expected cost vi
g) Now assume a nal cost u of one unit per customer rather than 5, and nd the new (n, u ), 0 i 2. minimum dynamic aggregate expected cost vi Exercise 4.37. Consider a nite-state ergodic Markov chain {Xn ; n 0} with an integer valued set of states {K, K +1, . . . , 1, 0, 1, . . . , +K }, a set of transition probabilities Pij ; K i, j K , and initial state X0 = 0. One example of such a chain is given by:
0.1 0 0.1 1 -1
0.9
0.9
0.9
0.1
P Let {Sn ; n 0} be a stochastic process with Sn = n i=0 Xi . Parts (a), (b), and (c) are independent of parts (d) and (e). Parts (a), (b), and (c) should be solved both for the special case in the above graph and for the general case. a) Find limn1 E [Xn ] for the example and express limn1 E [Xn ] in terms of the steadystate probabilities of {Xn , n 0} for the general case. b) Show that limn1 Sn /n exists with probability one and nd the value of the limit. Hint: apply renewal-reward theory to {Xn ; n 0}. c) Assume that limn1 E [Xn ] = 0. Find limn1 E [Sn ]. d) Show that Pr {Sn =sn | Sn1 =sn1 , Sn2 =sn2 , Sn3 =sn3 , . . . , S0 =0} = Pr {Sn =sn | Sn1 =sn1 , Sn2 =sn2 } . e) Let Y n = (Sn , Sn1 ) (i. e., Y n is a random two dimensional integer valued vector). Show that {Y n ; n 0} (where Y 0 = (0, 0)) is a Markov chain. Describe the transition probabilities of {Y n ; n 0} in terms of {Pij }.
196
Exercise 4.38. Consider a Markov decision problem with M states in which some state, say state 1, is inherently reachable from each other state. a) Show that there must be some other state, say state 2, and some decision, k2 , such that (k ) P21 2 > 0. b) Show that there must be some other state, say state 3, and some decision, k3 , such that (k ) (k ) either P31 3 > 0 or P32 3 > 0. c)Assume, for some i, and some set of decisions k2 , . . . , ki that, for each j , 2 j i, (k ) Pjl j > 0 for some l < j (i.e., that each state from 2 to j has a non-zero transition to a lower numbered state). Show that there is some state (other than 1 to i), say i + 1 and (ki+1 ) some decision ki+1 such that Pi+1 ,l > 0 for some l i. d) Use parts a), b), and c) to observe that there is a stationary policy k = k1 , . . . , kM for which state 1 is accessible from each other state.

3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3

Uploaded by

Copyright:

Available Formats

Chapter 4

FINITE-STATE MARKOV CHAINS

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.2. CLASSIFICATION OF STATES

CHAPTER 4. FINITE-STATE MARKOV CHAINS

This also shows that i j and j k imply i k. (4.5)

4.2. CLASSIFICATION OF STATES

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.2. CLASSIFICATION OF STATES

CHAPTER 4. FINITE-STATE MARKOV CHAINS

Proofs marked with an asterisk can be omitted without loss of continuity.

4.3. THE MATRIX REPRESENTATION

The Matrix representation

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.3. THE MATRIX REPRESENTATION

The eigenvalues and eigenvectors of P

P12 P12 +P21

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.4. PERRON-FROBENIUS THEORY

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.4. PERRON-FROBENIUS THEORY

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.4. PERRON-FROBENIUS THEORY

CHAPTER 4. FINITE-STATE MARKOV CHAINS

Markov chains with rewards

4.5. MARKOV CHAINS WITH REWARDS

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.5. MARKOV CHAINS WITH REWARDS

n+1 n+2 n+3 n1 n2 n3

CHAPTER 4. FINITE-STATE MARKOV CHAINS

This can be written in vector form as v (n, u ) = r + [P ]v (n1, u ); n 1, (4.34)

4.5. MARKOV CHAINS WITH REWARDS

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.5. MARKOV CHAINS WITH REWARDS

lim {v(n, u) ng e} = w + ( u)e.

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING

same as the result for an ergodic unichain, i.e., lim

g (i) (i) + [P ]n {u w } for all n 1.

Markov decision theory and dynamic programming

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING

Dynamic programming algorithm

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING

n o (1) = max [1 + P22 v2 (1, u )], 50 = max{50.5, 50} = 50.5.

CHAPTER 4. FINITE-STATE MARKOV CHAINS

Optimal stationary policies

4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING

(4.63) (4.64) (4.65) (4.66)

CHAPTER 4. FINITE-STATE MARKOV CHAINS

4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING

Policy iteration and the solution of Bellmans equation

CHAPTER 4. FINITE-STATE MARKOV CHAINS

6. Update k 0 to the value of k and return to step 2.

0 = w for each j R0 . and wj j

4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING

r k + [P k ]{v k (1, w 0 ) g 0 e } g 0 e = r + [P ]v (1, w ) 2g e = v k (2, w 0 ) 2g 0 e .

CHAPTER 4. FINITE-STATE MARKOV CHAINS

satises (4.42) for k . We next show that w w = =

lim {v k (n+1, w 0 ) (n + 1)g 0 e } lim {r k + [P k ]v k (n, w 0 ) (n + 1)g 0 e }

4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING

Stationary policies with arbitrary nal rewards