Dispense Processi Aleatori

Applied Stochastic Processes
dispense del corso

Processi stocastici per le applicazioni
economiche ed aziendali
Anno Accademico 2007-2008
Indice
1. Introduction 1
2. Stochastic Processes and Some Probability Review 3
3. Some Expectation Examples 15
4. An Expectation Problem 25
5. Conditional expectation 33
6. Quicksort Algorithm 43
7. The List Model 53
8. The Matching Problem Revisited 61
9. Markov Chains: Introduction 71
10. Classification of states 79
11. Generating Functions and the Random Walk 91
12. More on classification 101
13. Introduction to stationary distributions 111
14. Existence and uniqueness- part I 121
15. Existence and uniqueness- part II 131
16. Example of Prob. Gen. Funct. for Pi 139
17. Limiting Probabilities 151
18. Balance and reversibility 163
19. vuoto
20. vuoto
21. The exponential distribution 177
22. The Poisson Process: Introduction 187
23. Properties of the Poisson Process 197
24. Further Properties of the Poisson Process 207
25. Continuous-Time Markov Chains Part I 215
26. Continuous-Time Markov Chains Part I 223
27. Key properties of Continuous-Time Markov Chains 231
28. Limiting Properties 243
29. Local Balance Equations 247
30.Time reversibility 259
1
Introduction
Purpose of the Course
The purpose of this course is to study mathematically the behaviour
of stochastic systems.
Examples
1. A CPU with jobs arriving to it in a random fashion.
2. A communications network multiplexor with packets arriving to
it randomly.
3. A store with random demands on the inventory.
1
2 1. INTRODUCTION
Often part of the reason for studying the system is to be able to predict
how it will behave depending on how we design or control it.
Examples
1. A CPU could process jobs
one at a time as they arrive (FIFO).
one at a time but according to some predened priority scheme.
all at the same time (processor sharing).
2. A network multiplexor could transmit packets from dierent con-
nections
as they arrive (statistical multiplexing).
in a round-robin fashion (time-division multiplexing).
3. We could replenish a stores inventory
when it drops below a predened level (dynamic).
once a month (static).
Also, how much should we stock each time we replenish the in-
ventory?
Over the last several decades stochastic process models have become
important tools in many disciplines and for many applications (e.g.
the spread of cancer cells, growth of epidemics, naturally occuring
genetic mutation, social and economic mobility, population growth,
uctuation of nancial instruments).
Our goal is to study models of real systems that evolve in a random
fashion and for which we would like to quantify the likelihood of various
outcomes.
2
Stochastic Processes and Some Probability
Review
Stochastic Processes
Stochastic means the same thing as random and probability mod-
els that describe a quantity that evolves randomly in time are called
stochastic processes.
A stochastic process is a sequence of random variables {X
u
, u I},
which we will sometimes denote by X for shorthand (a bold X).
u is called the index, and most commonly (and always in this
course) it denotes time. Thus we say X
u
is the state of the
system at time u.
I is the index set. This is the set of all times we wish to dene
for the particular process under consideration.
The index set I will be either a discrete or a continuous set. If it is
discrete (e.g. I = {0, 1, 2, . . .}) then we say that X is a discrete-time
stochastic process. If it is continuous (e.g. I = [0, )) then we say
X is a continuous-time stochastic process.
Whether the index set I is discrete or continuous is important in de-
termining how we mathematically study the process. Chapter 4 of the
3
4 2. STOCHASTIC PROCESSES AND SOME PROBABILITY REVIEW
text deals exclusively with a class of discrete-time processes (Markov
Chains) while chapters 5 and 6 deal with the analogous class of pro-
cesses in continuous time (Poisson Processes in chapter 5 and Contin-
uous Time Markov Processes in chapter 6).
Notation: We wont always use X or {X
u
, u I} to denote a stochas-
tic process. The default letter to use for a stochastic process will be (a
captital) X, but well use other letters too (like Y , Z, W, S, N, M
and sometimes lower case letters like x, y, w, etc., and sometimes even
Greek letters like , or ), like for example when we want notation
for two or more dierent processes, but well try to use the X notation
whenever possible. Also, the index wont always be u (actually it will
usually be something dierent like n or t), but well (almost) always
use a lower-case Roman letter for the index (unless were referring to
a specic time in which case we use the value of the index, like 1, 2,
etc.)
By convention we use certain letters for discrete time indexes and
other letters for continuous time indexes. For discrete time indexes
well usually use the letter n for the index, as in X
n
, where n usually
will represent a nonnegative integer, and well also use the letters i,
j, k, l, m. For continuous time indexes well usually use the letter
t, as in X
t
, where t usually will represent a nonnegative real number,
and well also use the letters s, r, u, h, . The letters h and by
convention will be reserved for small numbers, as in X
t+h
, where we
mean the process at a time point just after time t.
Well never use the same letter for the process and the index, as in
s
s
. Thats bad and confusing notation, and really an incorrect use of
notation.
5
State Space
The other fundamental component (besides the index set I) in dening
the structure of a stochastic process is the state space. This is the set
of all values that the process can take on, much like the concept of
the sample space of a random variable (in fact, it is the sample space
of each of the random variables X
u
in the sequence making up the
process), and we usually denote the state space by S.
Examples
1. If the system under study is a CPU with randomly arriving jobs,
we might let I = [0, ) and S = {0, 1, 2, . . .}, where the state
represents the number of jobs at the CPU either waiting for pro-
cessing or being processed. That is, X
t
is the number of jobs at
the CPU waiting for processing or being processed at time t.
2. When studying the levels of inventory at a store we might let
I = {0, 1, 2, . . .} and S = {0, 1, . . . , B}, where the state repre-
sents the number of units of the inventory item currently in the
inventory, up to a maximum of B units. That is, X
n
is the number
of units in the inventory at time n.
The units of the time index are completely up to us to specify. So in
the inventory example the time index n could mean week n, month n,
or just time period n if we want to leave it more unspecied. But we
can only choose the unit to represent one thing; the time unit cant
represent, for example, both days and weeks.
Like the index set I, the state space S can be either discrete or
continuous. However, dealing mathematically with a continuous state
space involves some technical details that are beyond the scope of this
course, as they say, and are not particularly instructive. Moreover,
most real world systems can be adequately described using a discrete
state space. In this course well always assume the state space is
discrete.
By discrete we mean that the state space is either nite or countable.
Its pretty obvious what we mean by nite (e.g. S = {0, 1, 2, . . . , B}).
In case you dont know, countable means that there is a one-to-one
correspondence between the elements of S and the natural numbers
{1, 2, 3, . . .}. So we could count all the elements (if we had an innite
amount of time) and not miss any elements. Examples are the set of
all integers, the set of all multiples of 0.25, or the set of all rational
numbers. This is in contrast to uncountable, or continuous, sets, like
the interval [0, 1]. We could never devise a counting scheme to count
all the elements in this set even if we had an innite amount of time.
The index set I and the state space S are enough to dene the basic
structure of the stochastic process. But we also need to specify how
the process evolves randomly in time. Well come back to this when
we start studying chapter 4. Before that, well review some probability
theory and study the concept of conditioning as a useful technique to
evaluate complex probabilities and expectations.
7
Some Probability Review
Until chapter 5 in the text well be dealing almost exclusively with
discrete probabilities and random variables. You are expected to know
about continuous random variables and density functions and using
integration to calculate things like probabilities and expectations, but
it may help you to organize any time you spend reviewing basic prob-
ability concepts to know that we wont be using continuous random
variables regularly until chapter 5.
Fundamental concepts in probability are things like sample spaces,
events, the axioms of probability, random variables, distribution func-
tions, and expectation.
Another fundamental concept is conditioning. Well devote most of
the rst two weeks of the course on this valuable idea.
Lets start with a simple example in which we calculate a probability.
This example is meant to cover some basic concepts of probability
modeling.
Example: In an election candidate A receives n votes while can-
didate B receives just 1 vote. What is the probability that A was
always ahead in the vote count assuming that every ordering in the
vote count is equally likely?
Solution: First we dene an appropriate sample space. Since vote
count orderings are equally likely we can set this up as a counting
problem if we make our sample space the set of all possible vote count
orderings of n+1 votes that have n votes for candidate A and 1 vote
for candidate B. If we do this we need to be able to count the total
number of such orderings and also the number of such orderings in
which A is always ahead in the vote count. We think we can do the
counting so we proceed. In fact there are n + 1 possible vote count
orderings (in the ith ordering B received the ith vote). Moreover, A
had to receive the rst two votes to be always ahead in the vote count.
There are n 1 orderings in which A received the rst two votes. So
our desired probability is
n1
n+1
.
In this example we set up a sample space and dened an event of
interest in terms of outcomes in the sample space. We then computed
the probability of the event by counting the number of outcomes in
the event. This can be done if the outcomes in the sample space are
all equally likely.
9
Was that simple? Try this one.
Example: A fair coin is tossed repeatedly. Show that a heads is
eventually tossed with probability 1.
Solution: The event {heads eventually} is the same as the union of
all events of the form {heads ipped for the rst time on toss n}, for
n 1. That is,
{Heads eventually} = {H}
_
{TH}
_
{TTH}
_
{TTTH}
_
. . .
The events in the union above are all mutually exclusive and, because
we are implicitly assuming that the outcomes of dierent tosses of the
coin are independent, the probability that a heads is ipped for the
rst time on the nth toss is 1/2
n
, so
P(Heads eventually) = P(H) + P(TH) + P(TTH) + . . .
=
n=1
1
2
n
=
1
2
n=0
1
2
n
=
1
2
2 = 1,
as desired.
Concepts to review here are mutually exclusive events, independent
events, and the geometric series.
Another solution to the previous problem, one which utilizes the very
useful basic probability rule that P(A
c
) = 1 P(A) for any event
A, is to determine the probability of the complement of the event.
The logical opposite of the event that a heads eventually occurs is the
event that it never occurs. As is sometimes the case, the complement
event is easier to work with. Here the event that a heads never occurs
is just the event that a tails is ipped forever:
{head eventually}
c
= {TTTTTT . . .}.
How do we show this event has probability 0?
If you were to write the following
P(TTTTTT . . .) =
_
1
2
_
= 0
I wouldnt mark it wrong, but one must always be careful when working
with . This is because is technically not a number. The symbol
is really the mathematicians shorthand way of saying the limit as
n goes to . That is
_
1
2
_
really means lim

n
_
1
2
_
n
and also the validity of the statement that P(TTTTT . . .) =
_
1
2
_
relies on the fact that

lim
n
P(TTT . . . T) = P( lim
n
TTT . . . T),
where the number of Ts in each of the above events is n.
11
Another important point to note is that the event {TTTT . . .} of
ipping tails forever is not a logically impossible event (in terms of
sets this means its not the empty set). However, it has probabil-
ity 0. Theres a dierence between impossible events and events of
probability 0.
Heres a more extreme example.
Example: Monkey Typing Shakespeare. A monkey hits keys on
a typewriter randomly and forever. Show that he eventually types the
complete works of Shakespeare with probability 1.
Solution: Let N be the number of characters in the complete works
of Shakespeare and let T be the number of dierent keys on the keypad
of the typewriter. Let A be the event that the monkey never types
the complete works of Shakespeare, and well show that P(A) = 0.
To do this well use an important technique in mathematics, that of
bounding. Specically, divide up the sequence of typed characters into
blocks of N characters, starting with the rst typed character. Let
B be the event that the monkey never types the complete works of
Shakespeare in one of the blocks. Well show that B has probability
0 and that A is contained in B. This will show that A also has
probability 0. We work with the event B because its actually rather
trivial to show that it has probability 0. Let B
n
be the event that the
monkey doesnt type the complete works of Shakespeare in the the
nth block. Because the blocks are disjoint the outcomes in dierent
blocks are independent, and also
B = B
1
B
2
B
3
. . .
so that
P(B) = P(B
1
)P(B
2
)P(B
3
) . . .
But P(B
n
) is in fact is the same for all n because all blocks are
identically distributed, so that
P(B) = P(B
1
)
.
So to show P(B) = 0 all we need is that P(B
1
) < 1, but this is clearly
so since, even though its small, the probability that the monkey does
type the complete works of Shakespeare in the rst N keystrokes is
nonetheless positive (as an exercise calculate exactly P(B
1
)). There-
fore, P(B) = 0. Finally, it can be seen that event A logically implies
event B but event B does not imply event A, because B could occur
even though our monkey did type the complete works of Shakespeare
(just not in one of the blocks). Therefore, A B, which implies
P(A) P(B), or P(A) = 0.
Note that both events A and B have innitely many outcomes, for
there are innitely many innitely long sequences of characters that do
not contain any subsequence of length N that types out the complete
works of Shakespeare, so that both A and B are clearly not logically
impossible events. Yet both are events of zero probability.
13
One nal point I would like to make in this example is again concerning
the notion of innity. Our monkey may indeed eventually write the
complete works of Shakespeare given enough time, but probably not
before our galaxy has been sucked into a black hole. So the knowledge
that it will eventually happen has no practical use here because it would
take too long for it to be of any use.
Thats not the point Im trying to make, though. The point is that
statisticians regularly let things go to innity because it often is the
case that innity happens really fast, at least in a practical sense.
The best example is perhaps the Central Limit Theorem, which says
that
n
i=1
(X
i
)
n
N(0, 1),
where the X
i
are independent random variables each with mean and
variance
2
, N(0, 1) denotes the standard normal distribution, and
denotes convergence in distribution. You may have learned a rough
rule of thumb that if n 30 then the limiting N(0, 1) distribution
provides a good approximation (i.e. 30 is eectively equal to ).
This is what makes the Central Limit Theorem so useful.
Similarly, when we study a stochastic process in discrete time, say
{X
n
, n = 0, 1, 2 . . .}, one of the important things that well be trying
to do is get the limiting distribution of X
n
as n goes to innity. We do
this because this limiting distribution is often a good approximation to
the distribution of X
n
for small n. What this means is that if we let a
system operate for a little while, then we expect that the probability
of it being in a given state should follow the probability of that state
given by the limiting distribution.
3
Some Expectation Examples
Expectation
Let X be a discrete random variable dened on a sample space S with
probability mass function f
X
(). The expected value of X, also called
the mean of X and denoted E[X], is
E[X] :=
xS
xf
X
(x) =
xS
xP(X = x),
if the sum is absolutely convergent. Note that the sample space of a
random variable is always a subset of R, the real line.
Law of the Unconscious Statistician
Let g(x) be an arbitrary function from S to R. Then the expected
value of the random variable g(X) is
E[g(X)] =
xS
g(x)f
X
(x) =
xS
g(x)P(X = x)
If g(X) = X
2
then its mean is called the second moment of X. In
general, E[X
k
] is called the kth moment of X. The rst moment is
the same as the mean of X. The rst and second moments are the
most important moments. If g(X) = (X E[X])
2
then its mean is
called the second central moment. E[(XE[X])
2
] is also commonly
called the variance of X.
15
16 3. SOME EXPECTATION EXAMPLES
Example: Find the mean of the Geometric(p) distribution.
Solution: The Geometric(p) distribution has probability mass func-
tion
f(k) = p(1 p)
k1
for k = 1, 2, 3, . . .
so if X is a random variable with the Geometric(p) distribution,
E[X] =
k=1
kp(1 p)
k1
= p
k=1
k(1 p)
k1
.
There is a standard way to evaluate this innite sum. Let
g(p) =
k=0
(1 p)
k
.
This is just a Geometric series so we know that
g(p) =
1
1 (1 p)
=
1
p
.
The derivative is g
(p) = 1/p
2
, which has the following form based
on its innite sum representation:
g
(p) =
k=1
k(1 p)
k1
.
In fact weve evaluated the negative of the innite sum in E[X]:
E[X] = p
1
p
2
=
1
p
.
Next week well see how we can evaluate E[X] much more simply and
naturally by using a conditioning argument.
17
We can also nd the second moment, E[X
2
], of the Geometric(p)
distribution in a similar fashion by using the Law of the Unconscious
Statistician, which allows us to write
E[X
2
] = p
k=1
k
2
(1 p)
k1
.
One might consider trying to take the second derivative of g(p). When
this is done, one gets
g
(p) =
k=2
k(k 1)(1 p)
k2
.
This is not quite what we want, but it is close. Actually,
p(1 p)g
(p) = p
k=2
k(k 1)(1 p)
k1
= p
k=1
k(k 1)(1 p)
k1
= E[X(X 1)].
Since we know g
(p) = 1/p
2
we have that g
(p) = 2/p
3
and so
E[X(X 1)] =
2p(1 p)
p
3
=
2(1 p)
p
2
.
To nish it o we can write E[X(X 1)] = E[X
2
X] = E[X
2
]
E[X] so that
E[X
2
] = E[X(X 1)] + E[X]
=
2(1 p)
p
2
+
1
p
=
2
p
2

1
p
.
Expectation is a very important quantity when evaluating a stochastic
system.
In nancial markets, expected return is often used to determine a
fair price for nancial derivatives and other equities (based on
the notion of a fair game, for which a fair price to enter the game
is the expected return from the game, so that your expected net
return is zero).
When designing or controlling a system which provides a service
(such as a CPU or a communications multiplexor) which experi-
ences random demands on the resources, it is often the average
system behaviour that one is trying to optimize (such as expected
delay, average rate of denial of service, etc.)
When devising strategies for investment, expected prot is often
used as a guide for developing optimal strategies (e.g. a nancial
portfolio).
In the inventory example, we might use the expected number of
unlled orders to determine how best to schedule the replenish-
ment of the inventory. Of course, you can quickly see that this
doesnt work because according to this criterion we should just
stock the inventory with an innite number of units. Were ignor-
ing a crucial element: cost. This would lead us to develop a cost
function, which would reect components such as lost orders, cost
of storage, cost of the stock, and possibly other factors, and then
try to develop a schedule that minimizes the expected cost.
19
Example: Suppose you enter into a game in which you roll a die
repeatedly and when you stop you receive k dollars if your last roll
showed a k, except that you must stop if you roll a 1. A reasonable
type of strategy would be to stop as soon as the die shows m or
greater. Whats the best m?
Solution: We will use as our criterion for deciding what m is best
the expected prize. Firstly, if m = 1 or m = 2 then this corresponds
to the strategy in which you stop after the rst roll of the dice. The
expected prize for this strategy is just the expected value of one roll
of the dice. Assuming the dice is fair (each outcome is equally likely),
we have
For m = 1 or m = 2:
Expected prize =
1
6
(1 + 2 + 3 + 4 + 5 + 6) =
21
6
=
7
2
.
Lets try m = 6. In this strategy we stop as soon as we roll a 1 or a
6. Let X denote the number on the dice when we stop. Then X is a
random variable that takes on the values 1 or 6 and
Expected prize = (1)P(X = 1) + (6)P(X = 6).
Now we need to determine P(X = 1) and P(X = 6). Note that we
are writing the expectation above in terms of the distribution of X. We
could have invoked the Law of the Unconscious Statistician and written
the expectation as a sum over all possible outcomes in the underlying
experiment of the value of X corresponding to that outcome times the
probability of that outcome (outcomes in the underlying experiment
are sequences of dice rolls of nite length that end in a 1 or a 6).
However, there are an innite number of possible outcomes in the
underlying experiment and trying to evaluate the sum over all of the
outcomes would be more complex than necessary (it actually wouldnt
be that hard in this case and you might try it as an exercise). Its
unnecessary in this case because we will be able to determine the
distribution of X without too much trouble, and in this case its better
to compute our desired expectation directly from the distribution of
X. So what is P(X = 6)? An intuitive argument would be as
follows. When we stop we roll either a 1 or a 6 but on that last
roll we are equally likely to roll either a 1 or a 6, so P(X = 6) =
1/2 which also gives P(X = 1) = 1/2. This informal argument
turns out to be correct, but one should usually be careful when using
intuitive arguments and try to check the correctness of the answer more
rigorously. Well do a more rigorous argument here. Let T denote the
roll number of the last roll. Then T is a random variable that takes on
the values 1, 2, 3, . . .. Let A
n
be the event {T = n}, for n 1, and let
A = A
1
A
2
A
3
. . .. Then the set of events {A

1
, A
2
, A
3
, . . .} is
what we call a partition of the sample space because it is a collection
of mutually exclusive events and their union is the whole sample space
(every outcome in our underlying experiment corresponds to exactly
one of the A
i
). In particular, the event A is the whole sample space,
so that {X = 6}
A = {X = 6}, and
P({X = 6}) = P({X = 6}
A)
= P
_
{X = 6}
_
A
1
_
A
2
_
A
3
_
. . .
__
= P
_
({X = 6}
A
1
)
_
({X = 6}
A
2
)
_
. . .
_
= P
_
{X = 6}
A
1
_
+ P
_
{X = 6}
A
2
_
+ . . .
because the events {X = 6}
A
n
in the union in the third equality
above are mutually disjoint (because the A
n
are mutually disjoint).
Weve gone to some pain to go through all the formal steps to show
21
that
P(X = 6) = P(X = 6, T = 1) + P(X = 6, T = 2) + . . .
partly because intersecting an event with the union of the events of a
partition is a fairly important and useful technique for computing the
probability of the event, as long as we choose a useful partition, where
a partition is useful if it provides information such that calculating
the probability of the intersection of the event and any member of the
partition is easier to do than calculating the probability of the event
itself. Here the event {X = 6}
{T = k} can only happen if the

rst k 1 rolls of the dice were not a 1 or a 6 and the kth roll was
a 6. Since rolls of the dice are independent, its easy to see that the
probability of this is
P(X = 6, T = k) =
_
4
6
_
k1
1
6
.
Thus,
P(X = 6) =
k=1
_
4
6
_
k1
1
6
=
1
6
1
1 4/6
=
1
6
6
2
=
1
2
,
conrming our earlier intuitive argument. Next week well look at how
we would calculate P(X = 6) using a conditioning argument.
Going back to our original calculation, we have
For m = 6:
Expected prize =
1
2
(1 + 6) =
7
2
,
which is the same as the expected prize for the m = 1 and m = 2
strategies. Moving on (a little more quickly this time), lets calculate
the expected prize for the m = 5 strategy. Again let X denote the
number on the dice when we nish rolling. This time the possible
values of X are 1, 5 or 6. Well just appeal to the informal argument
for the distribution of X because its faster and happens to be correct.
When we stop we roll either a 1, 5 or 6 and we are equally likely to roll
any of these numbers, so P(X = 1) = P(X = 5) = P(X = 6) =
1/3, giving
For m = 5:
Expected prize =
1
3
(1 + 5 + 6) =
12
3
= 4.
Similar arguments yield
For m = 4:
Expected prize =
1
4
(1 + 4 + 5 + 6) =
16
4
= 4.
For m = 3:
Expected prize =
1
5
(1 + 3 + 4 + 5 + 6) =
19
5
.
So we see that the strategies corresponding to m = 4 or m = 5 yield
the highest expected prize, and these strategies are optimal in this
sense.
23
Digression on Examples:
You may have been noticing in the examples weve looked at so far
in the course, that they are not all straightforward (e.g. the monkey
example). If this is supposed to be probability review this week, you
might be asking yourself Am I expected to already know how to do all
these examples? The answer is no, at least not all of them. All of the
examples involve only basic probability concepts but youre probably
starting to realize that a problem can be dicult not because you dont
know the concepts but because theres a certain level of sophistication
in the solution method. The solutions are not always simple or direct
applications of the concepts.
Much of mathematics, including much of the theory of probability,
was developed in response to people posing, usually simply stated,
problems that they were genuinely interested in. For example, the
early development of probability theory was motivated by games of
chance. The methods of applied probability, in particular, continue to
be vigorously challenged by both old and new problems that people
pose in our uncertain world. Its fair to say that the bulk of the work
that goes on in probability is not in developing new general theory or
new concepts. The language of probability has more or less already
been suciently developed. Most of the work goes on in trying to
solve particular problems originating in a wide variety of applications.
So its in the problems, and the solutions to those problems, that much
of the learning and studying is to be done, and this fact is certainly
reected in our textbook.
So dont feel that there is something wrong if a solution to a problem
is not obvious to you or if it takes some time and thought to follow
even when its given. Thats supposed to happen. As you read the
text youll notice that the examples are not like typical examples in an
introductory probability book. Many of the examples (and problems)
required a lot of thought on the part of the author and other people
before a solution was obtained. You get the benet of all that work
that went into the examples, but keep in mind that the examples
and the problems are lessons in themselves over and above the main
exposition of the text.
4
An Expectation Example
This week well start studying conditional expectation arguments in
Section 3.4 of the text. Before doing so, well do one more example
calculating an expectation. The problem and solution in this example is
of the type alluded to in the previous lecture, meriting careful thought
and study on its own. The solution also utilizes a useful quantity
known as an indicator function.
Indicator Functions:
Indicator functions are very useful. For any set A the indicator function
of the set A is
I
A
(x) =
_
1 if x A
0 otherwise
.
One important property of the indicator function of A is that if A is
an event on a sample space and I
A
(x) is a function on that sample
space and there is a probability distribution P dened on that sample
space, then I
A
is a random variable with expectation
E[I
A
] = (1)P(A) + (0)P(A
c
) = P(A).
25
26 4. AN EXPECTATION EXAMPLE
Example: Matching Problem. If n people throw their hats in
a pile and then randomly pick up a hat from the pile, what is the
probability that exactly r people retrieve their own hat?
Solution: First consider r = n, so that everyone retrieves their own
hat. This case is relatively easy. The problem is equivalent to saying
that we take a random permutation of the integers 1, . . . , n and asking
what is the probability that we choose one particular permutation (the
one corresponding to all persons retrieving their own hat). There are
a total of n! possible permutations and only one corresponding to
all persons retrieving their own hat, so the probability that everyone
retrieves their own hat is 1/n!.
Secondly, consider r = n 1. This case is also easy, because its
logically impossible for exactly n1 persons to retrieve their own hat,
so the probability of this is 0.
For 0 r n2 the solution is not as trivial. There is more than
one way to approach the problem, and well consider a solution that
uses conditioning later in the week. Here well consider a more direct
approach.
Let A
i
be the event that person i retrieves his/her own hat. Note
that P(A
i
) = 1/n for all i. We can see this because asking that the
event A
i
occur is logically the same thing as asking for a permutation
of the integers {1, 2, . . . , n} that leaves the integer i in the ith position
(i.e. i doesnt move). But were allowed to permute the other n1 in-
tegers any way we want, so we see that there are (n1)! permutations
that leave integer i alone. So if we assume that all permutations are
equally likely we see that P(A
i
) = (n 1)!/n! = 1/n. In fact, using
exact the same type of argument we can get the probability that any
particular set of persons retrieves their own hat. Let {i
1
, . . . , i
s
} be a
particular set of s persons (e.g. s = 4 and {2, 4, 5, 7} are the persons).
27
Then leaving positions i
1
, . . . , i
s
alone, we can permute the remaining
ns positions any way we want, for a total of (ns)! permutations
that leave positions i
1
, . . . , i
s
alone. Therefore, the probability that
persons i
1
, . . . , i
s
all retrieve their own hats is (n s)!/n!. That is,
P(A
i
1
. . .
A
i
s
) =
(n s)!
n!
.
However, these are not quite the probabilities that were asking about
(though well want to use them eventually, so remember them), be-
cause if we take s = r and consider the above event that persons
i
1
, . . . , i
r
all retrieved their own hat, this event doesnt preclude the
possiblity that other persons (or even everyone) also retrieved their
own hat.
What we want are the probabilities of events like the following:
E
(i
1
,...,i
n
)
= A
i
1
. . .
A
i
r
A
c
i
r+1
. . .
A
c
i
n
,
where (i
1
, . . . , i
n
) is some permutation of (1, . . . , n). Event E
(i
1
,...,i
n
)
says that persons i
1
, . . . , i
r
retrieved their own hat but that persons
i
r+1
, . . . , i
n
did not. For a particular (i
1
, . . . , i
n
), that would be one
way for the event of r persons retrieving their own hat to occur. These
events E
(i
1
,...,i
n
)
are the right events to be considering, because as we
let (i
1
, . . . , i
n
) vary over all possible permutations, we get all the possi-
ble ways for exactly r persons to retrieve their own hat. However, here
we need to be careful in our counting, because as we vary (i
1
, . . . , i
n
)
over all possible permutations we are doing some multiple counting of
the same event. For example, suppose n = 5 and r = 3. Then, if you
go examine the way weve dened the event E
(i
1
,i
2
,i
3
,i
4
,i
5
)
in general,
youll see that the event E
(1,2,3,4,5)
is the same as the event E
(3,2,1,4,5)
or the event E
(3,2,1,5,4)
.
In general, if we have a particular permutation (i
1
, . . . , i
n
) and con-
sider the event E
(i
1
,...,i
n
)
, we can permute the rst r positions any way
we want and also permute the last n r positions any way we want
and well still end up with the same event. Since there are r! ways
to permute the rst r positions and for each of these ways there are
(n r)! ways to permute the last n r positions, in total there are
r!(n r)! permutations of (i
1
, . . . , i
n
) that lead to the same event
E
(i
1
,...,i
n
)
. So that means if we sum up P(E
(i
1
,...,i
n
)
) over all n! pos-
sible permutations of all n positions, then we should divide that sum
by r!(n r)! and we should end up with the right answer.
The next step is to realize that the events E
(i
1
,...,i
n
)
all have the
same probability no matter what the permutation (i
1
, . . . , i
n
) is. This
is so because all permutations are equally likely. This sort of symme-
try reasoning is a very valuable method for simplifying calculations in
problems of this sort and its important to get a feel for when you can
apply this kind of reasoning by getting practice applying it in problems.
This symmetry immediately simplies our calculation, because when
we sum P(E
(i
1
,...,i
n
)
) over all possible permutations, the answer can
be given in terms of any particular permutation, and in particular
(i
1
,...,i
n
)
P(E
(i
1
,...,i
n
)
) = n!P(E
(1,...,n)
).
So now if we divide this by r!(n r)! we should get the right answer:
P(exactly r persons retrieve their own hat) =
n!
r!(n r)!
P(E
(1,...,n)
),
and now the problem is to gure what is the probability of E
(1,...,n)
,
which is the event that persons 1, . . . , r retrieve their own hat and
persons r + 1, . . . , n do not.
Now (you were probably wondering when we would get to them)
well introduce the use of indicator functions. In the interest of a more
compact notation, let I
j
denote I
A
j
, the indicator of the event that
29
person j retrieves his/her own hat. Two useful properties of indicator
functions are the following. If I
A
and I
B
are two indicator functions
for the events A and B, respectively, then
I
A
I
B
= I
A
B
= the indicator of the intersection of A and B, and
1 I
A
= I
A
c = the indicator of the complement of A.
Using these two properties repeatedly, we get that the indicator of the
event E
(1,...,n)
(go back and look at the denition of this event) is
I
E
(1,...,n)
= I
1
. . . I
r
(1 I
r+1
) . . . (1 I
n
).
Now you may be wondering how this helps? Well, it helps because
weve converted a set expression A
1
. . .
A
r
A
c
r+1
. . .
A
c
n
into an arithmetic expression I
1
. . . I
r
(1 I
r+1
) . . . (1 I
n
) (contain-
ing random variables), related by the fact the probability of the set
expression is equal to the expected value of the arithmetic expression.
But this is useful because now we can apply ordinary arithmetic oper-
ations to the arithmetic expression. In particular, we will expand out
(1 I
r+1
) . . . (1 I
n
). Can you see why we might want to do this?
So how do we do this? I think well just have to do some multiplying
and see what we get. Lets simplify notation a little bit rst. Suppose
a
1
, . . . , a
k
are any k numbers and lets ask how we expand out the
product (1 a
1
) . . . (1 a
k
). Starting out easy, suppose k = 2. Then
we get
(1 a
1
)(1 a
2
) = 1 a
1
a
2
+ a
1
a
2
.
What if we now multiply that by (1 a
3
)? Well, we get
1 a
1
a
2
+ a
1
a
2
a
3
+ a
1
a
3
+ a
2
a
3
a
1
a
2
a
3
.
What if we now multiply this by (1 a
4
)? No, I dont want to write it
out either. Its time we looked for a pattern. This is the kind of thing
that turns up on IQ tests. I claim that the pattern is
k
i=1
(1 a
i
) =
k
s=0
(1)
s
1i
1
<...<i
s
k
a
i
1
. . . a
i
s
,
where the rst term (when s = 0) is meant to be a 1 and in the
remaining terms (for s > 0) the s indices i
1
, . . . , i
s
are to run from 1
to k, with the constraint that i
1
< . . . < i
s
. But is the above equality
true? In fact it is. One way to prove it would be to use induction. You
can check that its true for the cases when k = 2 and k = 3. Then
we assume the expression is correct for any xed k, then multiply the
expression by (1 a
k+1
) and see that its true for the k + 1 case.
Then logically it must be true for all k. Ill leave that for you to do as
an exercise. Right now its probably better to go back to our original
problem, which is to multiply out (1 I
r+1
) . . . (1 I
n
). If we get all
the indices straight, we get that
(1 I
r+1
) . . . (1 I
n
) =
nr
s=0
(1)
s
r+1i
1
<...i
s
n
I
i
1
. . . I
i
s
.
Now you can see why we wanted to expand out the above product. Its
because each of the terms in the sum corresponds to an intersection of
the some of the events A
i
directly, with no complements in the inter-
section. And this will still be true when we multiply it all by I
1
. . . I
r
.
We want this because when we take the expectation, what well want
to know is the probability that a given set of persons retrieved their
own hats. But we know how to do this. Remember?
So well take the above expression, multiply it by I
1
. . . I
r
, and take
the expectation, and what we end up with is P(E
(1,...,n)
), and from
there our nal answer is just one multiplication away.
31
P(E
(1,...,n)
) = E[I
1
. . . I
r
(1 I
r+1
) . . . (1 I
n
)]
= E
_
I
1
. . . I
r
nr
s=0
(1)
s
r+1i
1
<...<i
s
n
I
i
1
. . . I
i
s
_
=
nr
s=0
(1)
s
r+1i
1
<...<i
s
n
E[I
1
. . . I
r
I
i
1
. . . I
i
s
].
Lets pause here for a moment. The nal line above comes about by
taking the expectation inside both summations. We can always take
expectations inside summations because of the basic linearity prop-
erty of expectation (but dont make the mistake that you can always
take expectation inside of products). Now we also know the value
of each of the expectations above, from way back near the beginning
of this solution. For a given s, there are r + s indicator functions
inside the expectation. The expectation in a given term in the sum
is just the probability that a particular set of r + s persons (persons
1, . . . , r, i
1
, . . . , i
s
) retrieved their own hat, and weve already calcu-
lated that this probability is given by (n (r + s))!/n!. So we know
that
P(E
(1,...,n)
) =
nr
s=0
(1)
s
r+1i
1
<...<i
s
n
(n r s)!
n!
.
We can certainly simplify this, because we can take the term (nr
s)!/n! outside of the inner sum because this term depends only on s,
not on the particular indices i
1
, . . . , i
s
. So now the question is: for a
given s how many terms are there in the inner sum? This is a counting
question again. The question is how many ways are there to pick s
integers from the integers r + 1, . . . , n. There are n r integers in
the set {r +1, . . . , n} so there are
_
nr
s
_
ways to pick s integers from
these. So lets use this and simplify further:
P(E
(1,...,n)
) =
nr
s=0
(1)
s
(n r s)!
n!
_
n r
s
_
=
nr
s=0
(1)
s
(n r s)!
n!
(n r)!
(n r s)!s!
=
nr
s=0
(1)
s
(n r)!
n!s!
.
Now we are basically done except to write the nal answer. Recall
(from a few pages ago) that
P(exactly r persons retrieve their own hat) =
n!
r!(n r)!
P(E
(1,...,n)
)
so that now we can write
P (exactly r persons retrieve their own hat)
=
n!
r!(n r)!
nr
s=0
(1)
s
(n r)!
n!s!
=
1
r!
nr
s=0
(1)
s
1
s!
.
Finally, we can tweak the answer just a little bit more, because the
rst two terms in the sum above are 1 and -1, so they cancel, and so
P (exactly r persons retrieve their own hat)
=
1
r!
_
1
2!
1
3!
+ . . . +
(1)
nr
(n r)!
_
.
This answer is valid for r n2. This answer has an interesting form
when r = 0. When n is large, the probability that nobody retrieves
their own hat is approximately e
1
. Intuitive? Surprising?
5
Conditional Expectation
Conditional Expectation
Recall that given two events A and B with P(B) > 0, the conditional
probability of A given B, written P(A|B), is dened as
P(A|B) =
P(A
B)
P(B)
.
Given two (discrete) random variables X and Y the conditional ex-
pectation of X given Y = y, where P(Y = y) > 0, is dened to
be
E[X|Y = y] =
x
xP(X = x|Y = y).
Note that this is just the mean of the conditional distribution of X
given Y = y.
Conditioning on an event has the interpretation of information, in
that knowing the event {Y = y} occured gives us information about
the likelihood of the outcomes of X. There is a certain art or intu-
itiveness to knowing when conditioning on an event will give us useful
information or not, that one can develop with practice. Sometimes
though its fairly obvious when conditioning will be helpful.
Example: Suppose that in 1 week a chicken will lay N eggs, where
33
34 5. CONDITIONAL EXPECTATION
N is a random variable with a Poisson() distribution. Also each egg
has probability p of hatching a chick. Let X be the number of chicks
hatched in a week. What is E[X]?
The point being illustrated here is that the distribution of X is not
quite obvious but if we condition on the event {N = n} for some
xed n, then the distribution of X is clearly Binomial(n, p), so that
while E[X] may not be obvious, E[X|N = n] is easily seen to be np.
35
In E[X|Y = y], the quantity y is used to denote any particular value
that the random variable Y takes on, but its in lowercase because
it is meant to be a xed value, just some xed number. Similarly,
E[X|Y = y] is also just some number, and as we change the value
of y the conditional expectation E[X|Y = y] changes also. A very
useful thing happens when we take the weighted average of all these
numbers, weighted by the probability distribution of Y . That is, when
we consider
y
E[X|Y = y]P(Y = y).
When we plug in the denition of E[X|Y = y] and work out the sum
we get
y
E[X|Y = y]P(Y = y) =
x
xP(X = x|Y = y)P(Y = y)
=
x
x
P(X = x, Y = y)
P(Y = y)
P(Y = y)
=
x
x
y
P(X = x, Y = y)
=
x
xP(X = x) = E[X].
In words, when we average the conditional expectation of X given Y =
y over the possible values of y weighted by their marginal probabilities,
we get the marginal expectation of X. This result is so useful lets
write it again:
E[X] =
y
E[X|Y = y]P(Y = y).
Its also important enough to have a name:
The Law of Total Expectation.
Note that the quantity
y
E[X|Y = y]P(Y = y) is also an ex-
pectation (with respect to the distribution of Y ). Its the expected
value of a function of Y , which well just call g(Y ) for now, whose
value when Y = y is g(y) = E[X|Y = y]. What we usually do, even
though its sometimes confusing for students, is to use E[X|Y ] to de-
note this function of Y , rather than using something like g(Y ). The
notation E[X|Y ] is actually quite natural, but it can be confusing be-
cause it relies rather explicitly on the usual convention that upper case
letters denote random variables while lower case letters denote xed
values that the random variable might assume. In particular, E[X|Y ]
is not a number, its a function of Y as we said, and as such is itself
a random variable. Its also potentially confusing because we refer to
the quantity E[X|Y ] as the conditional expectation of X given Y ,
and this is almost the same way we refer to the quantity E[X|Y = y].
Also, it may take some getting used to thinking of a conditional ex-
pectation as a random variable, and you may be alarmed when we rst
write something like E[E[X|Y ]]. The more compact and ecient way
to write the Law of Total Expectation is
E[X] = E[E[X|Y ]],
where the inner expectation is taken with respect to the conditional
distribution of X given Y = y and the outer expectation is taken with
respect to the marginal distribution of Y .
Example: Chicken and Eggs. Since we know E[X|N = n] = np
we have that E[X|N] = Np and so
E[X] = E[E[X|N]] = E[Np] = pE[N] = p.
37
Now lets continue with looking at more examples of computing
expectations using conditioning arguments via the Law of Total Ex-
pectation. Of course to use a conditioning argument you have to have
two random variables, the one whose expectation you want and the
one you are conditioning on. Note though that the quantity you are
conditioning on can be more general than a random variable. It can for
example be a random vector or it could be the outcome of a general
sample space, not necessarily even numeric. In addition, conditioning
arguments are often used in the context of a whole sequence of ran-
dom variables (such as a stochastic process, hint hint), but the use
of the Law of Total Expectation is only part of the argument. The
conditioning doesnt give us the answer directly, but it does give us an
equation or set of equations involving expectation(s) of interest that
we can solve. Heres an example of this kind of argument which takes
us back to one of our earlier expectation examples.
Example: Use a conditional expectation argument to nd the
mean of the Geometric(p) distribution.
Solution: Let X be a random variable with the Geometric(p) dis-
tribution. To use conditioning recall that the distribution of X has
a description in terms of a sequence of independent Bernoulli trials.
Namely, each trial has probability p of success and X is the time
(the trial number) of the rst success. Now condition on the outcome
of the rst trial. The Law of Total Expectation then gives us that
E[X] = E[E[X|outcome of rst trial]]
= E[X|S on 1st trial]P(S on 1st trial)
+ E[X|F on 1st trial]P(F on 1st trial)
= E[X|S on 1st trial]p + E[X|F on 1st trial](1 p).
Now the argument proceeds as follows. Given that there is a success
on the rst trial, X is identically equal to 1. Therefore,
E[X|S on 1st trial] = 1.
The crucial part of the argument is recognizing what happens when
we condition on there being a failure in the rst trial. If this happens,
then X is equal to 1 (for the rst trial) plus the number of additional
trials needed for the rst success. But the number of additional trials
required for the rst success has the same distribution as X, namely a
Geometric(p) distribution. This is so because all trials are identically
distributed and independent. To write out this argument more formally
and mathematically we might write the following. Let Y be dened
as the rst trial index, starting from the second trial, that we have
a success, minus 1. Then the distribution of Y is the same as the
distribution of X and, in fact, Y is independent of the outcome of the
rst trial. However, given that the rst trial is a failure, the conditional
distribution of X is the same as the distribution of 1 +Y . Therefore,
E[X|F on 1st trial] = E[1 + Y ] = 1 + E[Y ] = 1 + E[X],
where E[Y ] = E[X] because X and Y have the same distribution.
Weve just connected a circle, relating the conditional expectation
E[X|F on 1st trial] back to the original unconditional expectation of
interest, E[X]. Putting this back into our original equation for E[X],
we have
E[X] = (1)p + (1 + E[X])(1 p).
Now its easy to solve for E[X], giving E[X] = 1/p.
39
I claim that the preceding method for evaluating E[X] is more elegant
than the way we calculated it last week using directly the denition
of expectation and evaluating the resulting innite sum, because the
conditioning argument is more natural and intuitive and is a purely
probabilistic argument. We can do a similar calculation to calculate
E[X
2
] (without any words this time to clutter the elegance):
E[X
2
] = E[X
2
|S on 1st trial]P(S on 1st trial)
+ E[X
2
|F on 1st trial]P(F on 1st trial)
= E[X
2
|S on 1st trial]p + E[X
2
|F on 1st trial](1 p)
= (1)
2
(p) + E[(1 + Y )
2
](1 p)
= p + E[(1 + X)
2
](1 p)
= p + E[1 + 2X + X
2
](1 p)
= p + (1 + 2E[X] + E[X
2
])(1 p)
= 1 +
2(1 p)
p
+ E[X
2
](1 p).
Solving for E[X
2
] gives
E[X
2
] =
1
p
+
2
p
2

2
p
=
2
p
2

1
p
.
(Check that this is the same answer we obtained by direct calculation
last week).
Finding a Conditional Expectation by Conditioning
Note that the Law of Total Expectation can be used to nd, at least in
principle, the mean of any distribution. This includes the conditional
distribution of X given Y = y. That is,
E[X|Y = y]
is the mean of a distribution, just like E[X] is, so it too can be com-
puted by conditioning on another random variable, say Z. However,
we must use the conditional distribution of Z given Y = y, and not
the marginal distribution of Z when we do the weighted averaging:
E[X|Y = y] =
z
E[X|Y = y, Z = z]P(Z = z|Y = y).
Example: Suppose we roll a fair die and then ip a fair coin the
number of times showing on our die roll. If any heads are ipped when
weve nished ipping the coin we stop. Otherwise we keep repeating
the above experiment until weve ipped at least one heads. Whats
the expected number of ips we make before we stop?
Solution: Let X be the number of ips before we stop and let Y
be the outcome of the rst roll of the die. Then
E[X] =
6
k=1
E[X|Y = k]
1
6
Now we compute E[X|Y = k] by conditioning on whether or not
there are any heads when we ip the coin k times. Let Z = 1 if there
is at least one heads in our rst set of coin ips and Z = 0 if there
are no heads in our rst set of coin ips. Then
E[X|Y = k] = E[X|Y = k, Z = 1]P(Z = 1|Y = k)
+ E[X|Y = k, Z = 0]P(Z = 0|Y = k).
41
Now E[X|Y = k, Z = 1] = k because we will stop the experiment
after the k ips since weve had at least one heads. But
E[X|Y = k, Z = 0] = k + E[X]
because we ip the coin k times and then probabilistically restart the
experiment over again. So we have
E[X|Y = k] = kP(Z = 1|Y = k) + (k + E[X])P(Z = 0|Y = k)
= k + E[X]P(Z = 0|Y = k)
= k + E[X]
_
1
2
_
k
,
since P(Z = 0|Y = k) is the probability of k consecutive tails. Plug-
ging this back in to our original expression for E[X], we have
E[X] =
1
6
6
k=1
_
k + E[X]
_
1
2
_
k
_
=
21
6
+
E[X]
6

63
64
.
Solving for E[X] then gives
E[X] =
21
6

(6)(64)
63
=
64
3
.
6
Quicksort Algorithm Example
In the previous examples we saw that conditioning allowed us to derive
an equation for the particular expectation of interest. In more elab-
orate situations, we may want know about more than one unknown
expectation, possibly an innite number of unknown expectations. In
such situations the unknown expectations are usually of the same
kind, and sometimes conditioning allows us to derive not just one,
but multiple equations that allow us to solve for all the unknown expec-
tations. The following is a nice example from the text which analyzes
the computational complexity of what is probably the most common
sorting algorithm, the Quicksort Algorithm.
43
44 6. QUICKSORT ALGORITHM EXAMPLE
Example: The Quicksort Algorithm. Given n values, a classical
programming problem is to eciently sort these values in increasing
order. One of the most ecient and widely used sorting algorithms
for doing this is called the Quicksort Algorithm, which is described as
follows.
Procedure QUICKSORT, Inputs n and {x
1
, . . . , x
n
}.
If n = 0 or n = 1 Stop. Return no sorting needed ag.
Else {
Choose one of the values at random.
Compare each of the remaining values to the value chosen and
divide up the remaining values into two sets, L and H, where L
is the set of values less than the chosen value and H is the set
of values higher than the chosen value (if one of the remaining
values is equal to the chosen value then assign it to either L or
H arbitrarily).
Apply procedure QUICKSORT to L.
Apply procedure QUICKSORT to H.
Return with sorted list.
}
45
Note that the procedure QUICKSORT is applied recursively to the
sets L and H. As an example, suppose we are given the 7 values
{6, 2, 3, 9, 1, 3, 5} and we wish to sort them in increasing order. First
we choose one of the values at random. Suppose we choose the rst 3.
When we compare every other value to 3 we get the sets L = {2, 1, 3}
and H = {6, 9, 5} (where we arbitrarily have assigned the second 3
to the L set). So far the numbers have been sorted to the following
extent:
{2, 1, 3}, 3, {6, 9, 5}.
Lets also keep track of the number of comparisons we make as we
go. So far weve made 6 comparisons. Now we apply QUICKSORT to
the set L = {2, 1, 3}. Suppose we (randomly) pick the value 1. Then
we divide up L into the two sets {} (the empty set) and {2, 3}. This
required 2 more comparisons (for a total of 8 so far), and the set L
has so far been sorted into
{}, 1, {2, 3}.
Now the empty set {} is passed to QUICKSORT and it immediately
returns with no comparisons made. Then the set {2, 3} is passed to
QUICKSORT. One can see that in this call to QUICKSORT 1 more
comparison will be made (for a total of 9 so far), then from this call
QUICKSORT will be called two more times and immediately return
both times, and the call to QUICKSORT with the set L will have
nished with L sorted. Now control gets passed to the top level and
QUICKSORT is called with the set H = {6, 9, 5}. Now suppose the
value if 5 is picked. Without going into the details again, QUICKSORT
will be called 4 more times and 3 more comparisons will be made (for
a total now of 12 comparisons). At this point H will be sorted and
the entire original list will be sorted.
A natural way to measure the eciency of the algorithm is to count
how many comparisons it must make to sort n values. However, for
the QUICKSORT algorithm, note that the number of comparisons is a
random variable, because of the randomness involved in selecting the
value which will separate the low and high sets. The worst case occurs
if we always select the lowest value in the set (or the highest value).
For example if our original set has 7 values then initially we make 6
comparisons. But if we picked the lowest value to compare to, then
the set H will have 6 values, and when we call QUICKSORT again,
well need to make 5 more comparisons. If we again choose the lowest
value in H then QUICKSORT will be called with a set containing 5
values and 4 more comparisons will need to be made. If we always (by
bad luck) choose the lowest value, then in total well end up making
6 + 5 + 4 + 3 + 2 + 1 = 21 comparisons. The best case occurs when
we (by good luck) always choose the middle value in whatever set
is passed to QUICKSORT.
So the number of comparisons QUICKSORT makes to sort a list of n
values is a random variable. Let X
n
denote the number of comparisons
required to sort a list of n values and let M
n
= E[X
n
]. As we noted
last week, we may use the expected value, M
n
, as a measure of the
performance or eciency of the QUICKSORT algorithm.
In this example we pose the problem of determining M
n
= E[X
n
]
for n = 0, 1, 2, . . .. To simplify things a little bit, well assume that
the list of numbers to sort have no tied values, so there is no ambi-
guity about how the algorithm proceeds. We have an innite number
of unknown expectations to determine, but they are all of the same
kind. Furthermore, the recursive nature of the QUICKSORT algo-
rithm suggests that a conditioning argument may be useful. But what
should we condition on?
47
Solution: First we note that no comparisons are required to sort a
set of 0 or 1 elements, because the QUICKSORT procedure will return
immediately in these cases. Thus, X
0
and X
1
are both equal to 0,
and so M
0
= E[X
0
] and M
1
= E[X
1
] are both equal to 0 as well.
We may also note that X
2
= 1 so that M
2
= E[X
2
] = 1 (though
it will turn out that we wont need to know this). For n 3, X
n
is indeed random. If we are going to use a conditioning argument to
compute M
n
= E[X
n
] in general, it is natural to consider how the
QUICKSORT algorithm proceeds. The rst thing it does is randomly
pick one of the n numbers that are to be sorted. Let Y denote the
rank of the number that is picked. Thus Y = 1 if we pick the smallest
value, Y = 2 if we pick the second smallest value, and so on. Since
we select the number at random, all ranks are equally likely. That is,
P(Y = j) = 1/n for j = 1, . . . , n.
Why might we want to condition on the rank of the chosen value?
What information do we get by doing so? If we know the rank of the
chosen value then well know the size of each of the sets L and H and
so well be able to express the expected total number of comparisons
in terms of the expected number of comparisons it takes to sort the
elements in L and in H. Lets proceed.
First we condition on Y and use the Law of Total Expectation to
write
M
n
= E[X
n
] = E[E[X
n
|Y ]] =
n
j=1
E[X
n
|Y = j]
1
n
.
Now consider E[X
n
|Y = j]. Firstly, no matter what the value of Y
is, we will make n 1 comparisons in order to form our sets L and
H. But if Y = j, then L will have j 1 elements and H will have
nj elements. Then we apply the QUICKSORT procedure to L then
to H. Since L has j 1 elements, the number of comparisons to
sort the elements in L is X
j1
and since the number of elements in
H is n j, the number of comparisons to sort the elements in H is
X
nj
. Thus given Y = j, the distribution of X
n
is the same as the
distribution of n 1 + X
j1
+ X
nj
, and so
E[X
n
|Y = j] = n 1 + E[X
j1
] + E[X
nj
]
= n 1 + M
j1
+ M
nj
.
Plugging this back into our expression for M
n
we have
M
n
=
1
n
n
j=1
(n 1 + M
j1
+ M
nj
).
At this point we can see we are getting somewhere and we probably did
the right thing in conditioning on Y , because we have derived a set of
(linear) equations for the quantities M
n
. They dont look too bad and
we have some hope of solving them for the unknown quantities M
n
.
In fact we can solve them and the rest of the solution is now solving
a linear algebra problem. All the probability arguments in the solution
are now over. While the probability argument is the main thing I
want you to learn from this example, we still have a linear system to
solve. Indeed, throughout this course, well see that some conditioning
argument will lead to a system of linear equations that needs to be
solved, so now is a good time to get some practice simplifying linear
systems.
The rst thing to do is take the sum through in the expression for
M
n
, which gives
M
n
=
n 1
n
n
j=1
(1) +
1
n
n
j=1
M
j1
+
1
n
n
j=1
M
nj
.
This simplies somewhat. In the rst sum, when we sum the value
one n times we get n, and this cancels with the n in the denominator,
49
so the rst term above is n 1. Next, in the second sum, the values
of j 1 range from 0 to n 1, and in the third sum the values of
n j also range from 0 to n 1. So in fact the second and third
sums are the same quantity, and we have
M
n
= n 1 +
2
n
n1
j=0
M
j
.
We see that M
n
can be obtained recursively in terms of M
j
s with
smaller indices j. Moreover, the M
j
s appear in the recursion by simply
summing them up. The usual way to simplify such a recursive equation
is to rst get rid of the sum by taking two successive equations (for n
and n+1) and then subtracting one from the other. But rst we need
to isolate the sum so that it is not multiplied by any term containing
n. So we multiply through by n to obtain
nM
n
= n(n 1) + 2
n1
j=0
M
j
.
Now we write the above equation with n replaced by n + 1:
(n + 1)M
n+1
= (n + 1)n + 2
n
j=0
M
j
and then take the dierence of the two equations above:
(n + 1)M
n+1
nM
n
= (n + 1)n n(n 1) + 2M
n
= 2n + 2M
n
,
or
(n + 1)M
n+1
= 2n + (n + 2)M
n
.
So we have simplied the equations quite a bit, and now we can think
about actually solving them. If you stare at the above for a little
while you may see that we can make it simpler still. If we divide by
(n + 1)(n + 2) we get
M
n+1
n + 2
=
2n
(n + 1)(n + 2)
+
M
n
n + 1
.
Why is this simpler? Its because the quantities involving M
n
and
M
n+1
are now of the same form. If we dene
R
n
=
M
n
n + 1
then the equation is equal to
R
n+1
=
2n
(n + 1)(n + 2)
+ R
n
.
Now we can successively replace R
n
with a similar expression involving
R
n1
, then replace R
n1
with a similar expression involving R
n2
, and
so on, as follows:
R
n+1
=
2n
(n + 1)(n + 2)
+ R
n
=
2n
(n + 1)(n + 2)
+
2(n 1)
n(n + 1)
+ R
n2
.
.
.
=
n1
j=0
2(n j)
(n + 1 j)(n + 2 j)
+ R
1
=
n1
j=0
2(n j)
(n + 1 j)(n + 2 j)
,
because R
1
= M
1
/2 = 0 as we discussed earlier. Were basically
done, but when you have a nal answer its good practice to present
the nal answer in as readable a form as possible. Here, we can make
51
the expression more readable if we make the substitution i = n j.
As j ranges between 0 and n 1 i will range between 1 and n, and
the expression now becomes
R
n+1
=
n
i=1
2i
(i + 1)(i + 2)
.
Finally, we should replace R
n+1
by its denition in terms of M
n+1
to
obtain
M
n+1
n + 2
=
n
i=1
2i
(i + 1)(i + 2)
,
or
M
n+1
= (n + 2)
n
i=1
2i
(i + 1)(i + 2)
.
So after a bit of tweaking we have our nal answer.
There is another important thing we havent done in this example.
The quantity M
n+1
is the expected number of comparisons that the
QUICKSORT algorithm makes to sort a list of n+1 (distinct) numbers,
and so is a measure of how long the algorithm will take to complete its
task. In computer science, we often talk about the complexity of an
algorithm. But rather than a specic equation like the one for M
n+1
above, what we are usually interested in is the order of complexity of
the algorithm. By this we mean that we want to know how fast does
the complexity grow with n? For example, if it grows exponentially
with n (like exp(an)) then thats usually bad news. Complexity that
grows like n (the complexity is like an + b) is usually very good, and
complexity that grows like n
2
is usually tolerable.
So what does the complexity of the QUICKSORT algorithm grow like?
Lets write it out again for easy reference:
M
n+1
= (n + 2)
n
i=1
2i
(i + 1)(i + 2)
.
It certainly seems to grow faster than n, because the expression is
like n times a quantity (the sum) which has n terms. So perhaps
it grows like n
2
? Not quite. In fact it grows like nlog n, which is
somewhere in between linear and quadratic complexity. This is quite
good, which makes the QUICKSORT algorithm a very commonly used
sorting algorithm. To see that it grows like nlog n please refer to the
text on p.116. Consider this a reading assignment. The important
point for now is to introduce you to the concept of the order of an
expression. Well dene what we mean by that more precisely in the
coming weeks.
7
The List Model
Well start today with one more example calculating an expectation
using a conditional expectation argument. Then well look more closely
at the Law of Total Probability, which is a special case of the Law of
Total Expectation.
Example: The List Model (Section 3.6.1).
In information retrieval systems, items are often stored in a list. Sup-
pose a list has n items e
1
, . . . , e
n
and that we know that item i will
be requested with probability P
i
, for i = 1, . . . , n. The time it takes
to retrieve an item from the list will be proportional to the position of
the item in the list. It will take longer to retrieve items that are further
down in the list. Ideally, we would like any requested item to be at the
top of the list, but since we dont know in advance what item will be
requested, its not possible to ensure this ideal case. However, we may
devise heuristic schemes to dynamically update the ordering of the list
items depending on what items have been requested in the past.
Suppose, for example, that we use the move to the front rule. In this
scheme when an item is requested it is moved to the front of the list,
while the remaining items maintain their same relative ordering.
53
54 7. THE LIST MODEL
Other schemes might include the move one forward rule, where the
position of a requested item is swapped with the position of the item
immediately in front of it, or a static ordering, in which the ordering
of the items in the list never changes once it is initially set. This
static ordering scheme might be good if we knew what the request
probabilities P
i
were. For example, we could put the item most likely
to be requested rst in the list, the item second most likely to be
requested second in the list, and so on. The advantage of the move
to the front or move one forward rules is that we dont need to know
the request probabilities P
i
ahead of time in order to implement these
schemes. In this example well analyze the move to the front scheme.
Later, when we study Markov Chains in Chapter 4, well come back
to this example and consider the move one forward scheme.
Initially, the items in the list are in some order. Well allow it to be
any arbitrary ordering. Let X
n
be the position of the nth requested
item. Its clear that if we know what the initial ordering is, then we
can easily determine the distribution of X
1
. What we would like to
imagine though is that requests have been coming for a long time, in
fact forever. Well see in Chapter 4 that the distribution of X
n
will
approach some limiting distribution. If we say that X is a random
variable with this limiting distribution, then the way we write this is
X
n
X in distribution.
What this means is that
lim
n
P(X
n
= j) = P(X = j) for j = 1, . . . , n.
What we are interested in is E[X].
55
Though well consider questions involving such limiting distributions
more rigorously in Chapter 4, for now well approach the question more
intuitively. Intuitively, the process has been running for a long time.
Now some request for an item comes. The current position in the list
of the item requested will be a random variable. We would like to
know the expected position of the requested item.
Solution: First well condition on which item was requested. Since the
item requested will be item i with probability P
i
, we have
E[position] =
n
i=1
E[position|e
i
is requested]P
i
=
n
i=1
E[position of e
i
]P
i
.
So we would like to know what is the expected position of item e
i
at some time point far in the future. Conditioning has allowed us to
focus on a specic item e
i
, but have we really made the problem any
simpler? We have, but we need to decompose the problem still further
into quantities we might know how to compute. Further conditioning
doesnt really help us much here. Conditioning is a good tool, but part
of the eective use of conditioning is to know when not to use it.
Here we nd the use of indicator functions to be helpful once again.
Let I
j
be the indicator of the event that item e
j
is ahead of item e
i
in
the list. Since the position of e
i
is just 1 plus the number of items that
are ahead of it in the list, we can decompose the quantity position
of e
i
into
position of e
i
= 1 +
j=i
I
j
.
Therefore, when we take expectation, we have that
E[position of e
i
] = 1 +
j=i
P(item e
j
is ahead of item e
i
)
So weve decomposed our calculation somewhat. Now, for any par-
ticular i and j, we need to compute the probability that item e
j
is
ahead of item e
i
. It may not be immediately apparent, but this we
can do. This is because the only times items e
j
and e
i
ever change
their relative ordering are the times when either e
j
or e
i
is requested.
Whenever any other item is requested, items e
j
and e
i
do not change
their relative ordering. So imagine that out of the sequence of all re-
quests up to our current time, we only look at the requests that were
for item e
j
or for item e
i
. Item e
j
will currently be ahead of item e
i
if and only if the last time item e
j
or e
i
was requested, it was item
e
j
that was requested. So the question is what is the probability that
item e
j
was requested the last time either item e
j
or e
i
was requested.
In other words, given that item e
j
or item e
i
is requested, what is the
probability that item e
j
is requested. That is,
P (item e
j
is ahead of item e
i
)
= P(item e
j
is requested | item e
j
or e
i
is requested)
=
P({e
j
requested}
{e
j
or e
i
requested})
P({e
j
or e
i
requested})
=
P({e
j
requested})
P({e
j
or e
i
requested})
=
P
j
P
j
+ P
i
.
57
So, plugging everything back in we get our nal answer
E[position] =
n
i=1
E[position of e
i
]P
i
=
n
i=1
_
1 +
j=i
P(item e
j
is ahead of item e
i
)
_
P
i
=
n
i=1
_
1 +
j=i
P
j
P
i
+ P
j
_
P
i
= 1 +
n
i=1
P
i
j=i
P
j
P
i
+ P
j
.
Note that in developing this solution, we are assuming that we are
really far out into the future. One (intuitive) consequence of this is that
when we consider the last time before our current time that either items
e
i
or e
j
was requested, we are assuming that there was a last time.
Far out into the future means far enough out (basically innitely far
out) that we are sure that items e
i
and e
j
had been requested many
times (indeed innitely many times) prior to the current time.
One of the homework problems asks you to consider some xed time t
which is not necessarily far into the future. You are asked to compute
the expected position of an item that is requested at time t. The
solution proceeds much along the same lines as our current solution.
However, for a specic time t you must allow for the possibility that
neither items e
i
or e
j
was ever requested before time t. If you al-
low for this possibility, then in order to proceed with a solution, you
must know what the initial ordering of the items is, or at least know
the probabilities of the dierent possible orderings. In the homework
problem it is assumed that all initial orderings are equally likely.
Calculating Probabilities by Conditioning (Section 3.5):
As noted last week, a special case of the Law of Total Expectation
E[X] = E[E[X|Y ]] =
y
E[X|Y = y]P(Y = y)
is when the random variable X is the indicator of some event A. This
special case is called the Law of Total Probability.
Since E[I
A
] = P(A) and E[I
A
|Y = y] = P(A|Y = y), we have
P(A) =
y
P(A|Y = y)P(Y = y).
Digression: Even though I promised not to look at continuous ran-
dom variables for a while (till Chapter 5), Id like to cheat a bit here
and ask what the Law of Total Expectation looks like when Y is a
continuous random variable. As you might expect, it looks like
E[X] =
_
y
E[X|Y = y]f
Y
(y)dy
where f
Y
(y) is the probability density function of Y . Similarly, the
Law of Total Probability looks like
P(A) =
_
y
P(A|Y = y)f
Y
(y)dy
But wait! If Y is a continuous random variable, doesnt the event
{Y = y} have probability 0? So isnt the conditional probability
P(A|Y = y) undened? Actually, it is dened, and textbooks (in-
cluding this one, see Section 1.4) that tell you otherwise are not being
quite truthful. Of course the denition is not
P(A|Y = y) =
P(A, Y = y)
P(Y = y)
.
59
Lets illustrate the Law of Total Probability (for discrete Y ) with a
fairly straightforward example.
Example: Let X
1
and X
2
be independent Geometric random vari-
ables with respective parameters p
1
and p
2
. Find P(|X
1
X
2
| 1).
Solution: We condition on either X
1
or X
2
(it doesnt matter which).
Say we condition on X
2
. Then note that
P(|X
1
X
2
| 1 | X
2
= j) = P(X
1
= j 1, j, or j + 1).
Thus,
P (|X
1
X
2
| 1) =
j=1
P(|X
1
X
2
| 1 | X
2
= j)P(X
2
= j)
=
j=1
P(X
1
= j 1, j, or j + 1)P(X
2
= j)
=
j=1
_
P(X
1
= j 1) + P(X
1
= j) + P(X
1
= j + 1)
P(X
2
= j)
=
j=2
p
1
(1 p
1
)
j2
p
2
(1 p
2
)
j1
+
j=1
p
1
(1 p
1
)
j1
p
2
(1 p
2
)
j1
+
j=1
p
1
(1 p
1
)
j
p
2
(1 p
2
)
j1
=
p
1
p
2
(1 p
2
) + p
1
p
2
+ p
1
p
2
(1 p
1
)
1 (1 p
1
)(1 p
2
)
.
How might you do this problem without using conditioning?
8
Matching Problem Revisited
Well do one more example of calculating a probability using condi-
tioning by redoing the matching problem.
Example: Matching Problem Revisited (Example 3.23). Recall that
we want to calculate the probability that exactly r persons retrieve
their own hats when n persons throw their hats into the middle of a
room and randomly retrieve them.
Solution: We start out by considering the case r = 0. That is, what
is the probability that no one retrieves their own hat when there are
n persons? As you search for a way to proceed with a conditioning
argument, you start out by wondering what information would decom-
pose the problem into conditional probabilities that might be simpler
to compute? One of the rst things that might dawn on you is that if
you knew person 1 retrieved his own hat then the event that no one
retrieved their own hat could not have happened. So there is some
useful information there.
61
62 8. MATCHING PROBLEM REVISITED
Next you would need to ask if you can determine the probability that
person 1 retrieved his own hat. Yes, you can. Clearly, person 1 is
equally likely to retrieve any of the n hats, so the probability that
he retrieves his own hat is 1/n. Next you need to wonder if you can
calculate the probability that no one retrieved their own hat given that
person 1 did not retrieve his own hat. This one is unclear perhaps.
But in fact we can determine it, and here is how we proceed.
First well set up some notation. Let
E
n
= {no one retrieves their own hat when there are n persons}
and
P
n
= P(E
n
).
Also, let
Y = The number of the hat picked up by person 1.
Then
P(Y = j) =
1
n
for j = 1, . . . , n.
Conditioning on Y , we have
P
n
= P(E
n
) =
n
j=1
P(E
n
|Y = j)
1
n
.
Now, P(E
n
|Y = 1) = 0 as noted earlier, so
P
n
=
n
j=2
P(E
n
|Y = j)
1
n
.
63
At this point its important to be able to see that P(E
n
|Y = 2) is
the same as P(E
n
|Y = 3), and in fact P(E
n
|Y = j) is the same
for all j = 2, . . . , n. This is an example of spotting symmetry in
an experiment and using it to simplify an expression. Symmetries in
experiments are extremely useful because the human mind is some-
how quite good at spotting them. The ability to recognize patterns is
in fact one thing that humans are good at that even the most pow-
erful supercomputers are perhaps only now getting the hang of. In
probability the recognition of symmetries in experiments allows us to
deduce that two or more expressions must be equal even if we have
no clue as to what the actual value of the expressions is. We can spot
the symmetry here because, as far as the event E
n
is concerned, the
indices 2, . . . , n are just interchangeable labels. The only thing that
matters as far as the event E
n
is concerned is that person 1 picked up
a dierent hat than his own.
With this symmetry, we can write, for example
P
n
=
n 1
n
P(E
n
|Y = 2),
Now lets consider P(E
n
|Y = 2). To determine this we might consider
that it would help to know what hat person 2 picked up. We know
person 2 didnt pick up his own hat because we know person 1 picked
it up. Suppose we knew that person 2 picked up person 1s hat. Then
we can see that persons 1 and 2 have formed a pair (they have picked
up one anothers hats) and the event E
n
will occur now if persons
3 to n do not pick up their own hats, where these hats actually do
belong to persons 3 to n. We have reduced the problem to one that
is exactly of the same form as our original problem, but now involving
only n 2 persons (persons 3 through n).
So lets dene Z to be the number of the hat picked up by person 2
and do one more level of conditioning to write
P(E
n
|Y = 2) =
n
k=1
k=2
P(E
n
|Y = 2, Z = k)P(Z = k|Y = 2).
Now given Y = 2 (person 1 picked up person 2s hat), person 2 is
equally likely to have picked up any of remaining n 1 hats, so
P(Z = k|Y = 2) =
1
n 1
and so
P(E
n
|Y = 2) =
n
k=1
k=2
P(E
n
|Y = 2, Z = k)
1
n 1
.
Furthermore, as we discussed on the previous page, if Z = 1 (and
Y = 2), then we have reduced the problem to one involving n 2
persons, and so
P(E
n
|Y = 2, Z = 1) = P(E
n2
) = P
n2
.
Plugging this back in we have
P(E
n
|Y = 2) =
1
n 1
P
n2
+
n
k=3
P(E
n
|Y = 2, Z = k)
1
n 1
.
Now we can argue once again by symmetry that P(E
n
|Y = 2, Z = k)
is the same for all k = 3, . . . , n because for these k the index is just
an arbitrary label as far as the event E
n
is concerned. So we have, for
example
P(E
n
|Y = 2) =
1
n 1
P
n2
+
n 2
n 1
P(E
n
|Y = 2, Z = 3).
65
Plugging this back into our expression for P
n
we get
P
n
=
n 1
n
P(E
n
|Y = 2)
=
1
n
P
n2
+
n 2
n
P(E
n
|Y = 2, Z = 3).
Now you might see that we can follow a similar line of argument to
decompose P(E
n
|Y = 2, Z = 3) by conditioning on what hat person
3 picked up given that person 1 picked up person 2s hat and person 2
picked up person 3s hat. You can see it only matters whether person
3 picked up person 1s hat, in which case the problem is reduced to
one involving n 3 people, or person 3 picked up person ls hat, for
l = 4, . . . , n. Indeed, this line of argument would lead to a correct
answer, and is in fact equivalent to the the argument involving the
notion of cycles in the Remark on p.120 of the text (also reproduced
in the statement of Problem 3 on Homework #2).
You would proceed to nd that
P(E
n
|Y = 2, Z = 3) =
1
n 2
P
n3
+
n 3
n 2
P(E
n
|Y = 2, Z = 3, Z
= 4),
where Z
is the hat picked by person 3, so that

P
n
=
1
n
P
n2
+
1
n
P
n3
+
n 3
n
P(E
n
|Y = 2, Z = 3, Z
= 4).
Continuing in this way you would end up with
P
n
=
1
n
P
n2
+
1
n
P
n3
+ . . . +
1
n
P
2
=
1
n
n2
k=2
P
k
.
However, I claim we could have stopped conditioning after our initial
conditioning on Y because the probability P(E
n
|Y = 2) can be ex-
pressed in terms of the event E
n1
and E
n2
more directly. Heres
why. Person 1 picked up person 2s hat. Suppose we relabel person
1s hat and pretend that it belongs to person 2 (but with the under-
standing that if person 2 picks up his own hat its really person 1s
hat). Then the event E
n
will occur if either of the events
{persons 2 to n do not pick up their own hat}
or
{person 2 picks up his own hat and persons 3 to n do not}
occurs, and these two events are mutually disjoint. The probability of
the rst event (given Y = 2) is
P (persons 2 to n do no pick up their own hat|Y = 2)
= P(E
n1
) = P
n1
while the probability of the second event given Y = 2 we can write as
P (person 2 picks up his own hat and persons 3 to n do not|Y = 2)
= P(persons 3 to n do not|person 2 does, Y = 2)
P(person 2 does|Y = 2)
= P(E
n2
)
1
n 1
.
So we see that
P
n
=
n 1
n
P(E
n
|Y = 2) =
n 1
n
P
n1
+
1
n
P
n2
Lets nish this o now by solving these equations.
67
We have
P
n
=
n 1
n
P
n1
+
1
n
P
n2
,
which is equivalent to
P
n
P
n1
=
1
n
(P
n1
P
n2
).
This gives us a direct recursion for P
n
P
n1
. If we keep following it
down we get
P
n
P
n1
= (1)
n2
1
n
1
n 1
. . .
1
3
(P
2
P
1
),
but since P
2
= 1/2 and P
1
= 0, P
2
P
1
= 1/2 and
P
n
P
n1
= (1)
n2
1
n
1
n 1
. . .
1
3
1
2
= (1)
n2
1
n!
.
So starting with P
2
= 1/2, this gives
P
3
= P
2
1
3!
=
1
2
1
3!
P
4
= P
3
+
1
4!
=
1
2
1
3!
+
1
4!
,
and so on. In general we would have
P
n
=
1
2
1
3!
+ . . . + (1)
n
1
n!
.
For r = 0, please check that this is the same answer we got last week.
The case r = 0 was really the hard part of the calculation. For
1 r n 2 (recall that the probability that all n persons pick up
their own hat is 1/n! and the probability that exactly n 1 persons
pick up their own hat is 0), we can use a straightforward counting
argument to express the answer in terms of P
nr
, the probability that
exactly n r persons do not pick up their own hat.
In fact, recall that
P (exactly r persons pick up their own hat)
=
_
n
r
_
P(persons 1, . . . , r do and persons r + 1 . . . , n dont),
since we can select the particular set of r people who pick up their own
hat in exactly
_
n
r
_
ways and, for each set of persons, the probability
that they pick up their own hats while the other persons do not is the
same no matter what subset of people we pick. However,
P (persons 1, . . . , r do and persons r + 1, . . . , n dont)
= P(persons r + 1, . . . , n dont|persons 1, . . . , r do)
P(persons 1, . . . , r do)
= P
nr
(n r)!
n!
,
and so
P (exactly r persons pick up their own hat)
=
_
n
r
_
(n r)!
n!
P
nr
=
1
r!
P
nr
=
1
r!
_
1
2
1
3!
+ . . . + (1)
nr
1
(n r)!
_
.
69
Summary of Chapter 3:
In Chapter 3 weve seen several useful techniques for solving problems.
Conditioning is a very important tool that well be using throughout the
course. Using conditioning arguments can decompose a probability or
an expectation into simpler conditional expectations or probabilities,
but the problem can still be dicult, and we still often need to be
able to simplify and evaluate fairly complex events in a direct way, for
example using counting or symmetry arguments.
Getting good at solving problems using conditioning takes practice.
Its not a matter of just knowing the Law of Total Expectation. Its
like saying that because I read a book on Visual Programming I now
expect that I can say Im a programmer. The best way to get the
process of solving a problem into your minds is to do problems. I
strongly recommend looking at problems in Chapter 3 in addition to
the homework problems. At least look at some of them and try to work
out how you would approach the problem in your head. Ill be glad to
answer any questions you may have that arises out of this process.
Having stressed the importance of reading examples and doing prob-
lems, we should note that this course is also about learning some
general theory for stochastic processes. Up to now weve been mostly
looking at examples of applying theory that we either already knew or
took a very short time to state (such as the Law of Total Expectation).
Well continue to look at plenty of examples, but its time to consider
some general theory now as we start looking at Markov Chains.
9
Markov Chains: Introduction
We now start looking at the material in Chapter 4 of the text. As we
go through Chapter 4 well be more rigorous with some of the theory
that is presented either in an intuitive fashion or simply without proof
in the text.
Our focus is on a class of discrete-time stochastic processes. Recall
that a discrete-time stochastic process is a sequence of random vari-
ables X = {X
n
: n I}, where I is a discrete index set. Unless
otherwise stated, well take the index set I to be the set of nonnegative
integers I = {0, 1, 2, . . .}, so
X = {X
n
: n = 0, 1, 2, . . .}.
Well denote the state space of X by S (recall that the state space
is the set of all possible values of any of the X
i
s). The state space
is also assumed to be discrete (but otherwise general), and we let |S|
denote the number of elements in S (called the cardinality of S). So
|S| could be or some nite positive integer.
Well start by looking at some of the basic structure of Markov chains.
71
72 9. MARKOV CHAINS: INTRODUCTION
Markov Chains:
A discrete-time stochastic process X is said to be a Markov Chain if
it has the Markov Property:
Markov Property (version 1):
For any s, i
0
, . . . , i
n1
S and any n 1,
P(X
n
= s|X
0
= i
0
, . . . , X
n1
= i
n1
) = P(X
n
= s|X
n1
= i
n1
).
In words, we say that the distribution of X
n
given the entire past of
the process only depends on the immediate past. Note that we are not
saying that, for example X
10
and X
1
are independent. They are not.
However, given X
9
, for example, X
10
is conditionally independent of
X
1
. Graphically, we may imagine being on a particle jumping around in
the state space as time goes on to form a (random) sample path. The
Markov property is that the distribution of where I go to next depends
only on where I am now, not on where Ive been. This property is a
reasonable assumption for many (though certainly not all) real-world
processes. Here are a couple of examples.
Example: Suppose we check our inventory once a week and replen-
ish if necessary according to some schedule which depends on the level
of the stock when we check it. Let X
n
be the level of the stock at the
beginning of week n. The Markov property for the sequence {X
n
}
should be reasonable if it is reasonable that the distribution of the
(random) stock level at the beginning of the following week depends
on the stock level now (and the replenishing rule, which we assume
depends only on the current stock level), but not on the stock level
from previous weeks. This assumption would be true if it were true
that demands for the stock for the coming week do not depend on the
past stock levels.
73
Example: In the list model example, suppose we let X
n
denote the
list ordering after the nth request for an item. If we assume (as we did
in that example) that all requests are independent of one another, then
the list ordering after the next request should not depend on previous
list orderings if I know the current list ordering. Thus it is natural here
to assume that {X
n
} has the Markov property.
Note that, as with the notion of independence, in applied modeling
the Markov property is not something we usually try to prove math-
ematically. It usually comes into the model as an assumption, and
its validity is veried either empirically by some statistical analysis or
by an underlying a priori knowledge about the system being modeled.
The bottom line for an applied model is how well it ends up predicting
an aspect of the real system, and in applied modeling the validity
of the Markov assumption is best judged by this criterion.
A useful alternative formulation of the Markov property is:
Markov Property (version 2):
For any s, i
0
, i
1
, . . . , i
n1
S and any n 1 and m 0
P(X
n+m
= s|X
0
= i
0
, . . . , X
n1
= i
n1
) = P(X
n+m
= s|X
n1
= i
n1
)
In words, this says that the distribution of the process at any time
point in the future given the most recent past is independent of the
earlier past. We should prove that the versions of the Markov property
are equivalent, because version 2 appears on the surface to be more
general. We do this by showing that each implies the other. Its clear
that version 2 implies version 1 just by setting m = 0. We can use
conditioning and an induction argument to prove that version 1 implies
version 2, as follows.
Proof that version 1 implies version 2: Version 2 is certainly true for
m = 0 (it is exactly version 1 in this case). The induction hypothesis
is to assume that version 2 true holds for some arbitrary xed m and
the induction argument is to show that this implies it must also hold
for m + 1. If we condition on X
n+m
then
P (X
n+m+1
= s|X
0
= i
0
, . . . , X
n1
= i
n1
)
=
S
P(X
n+m+1
= s|X
n+m
= s
, X
0
= i
0
, . . . , X
n1
= i
n1
)
P(X
n+m
= s
|X
0
= i
0
, . . . , X
n1
= i
n1
).
For each term in the sum, for the rst probability we can invoke version
1 of the Markov property and for the second probability we can invoke
the induction hypothesis, to get
P (X
n+m+1
= s|X
0
= i
0
, . . . , X
n1
= i
n1
)
=
S
P(X
n+m+1
= s|X
n+m
= s
, X
n1
= i
n1
)
P(X
n+m
= s|X
n1
= i
n1
)
Note that in the sum, in the rst probability we left the variable X
n1
in the conditioning. We can do that because it doesnt aect the
distribution of X
n+m+1
conditioned on X
n+m
. The reason we leave
X
n1
in the conditioning is so we can use the basic property that
P(A
B|C) = P(A|B
C)P(B|C)
for any events A, B and C (you should prove this for yourself if you
dont quite believe it). With A = {X
n+m+1
= s}, B = {X
n+m
= s
}
and C = {X
n1
= i
n1
}, we have
P (X
n+m+1
= s|X
0
= i
0
, . . . , X
n1
= i
n1
)
=
S
P(X
n+m+1
= s, X
n+m
= s
|X
n1
= i
n1
)
= P(X
n+m+1
= s|X
n1
= i
n1
).
So version 2 holds for m + 1 and by induction it holds for all m.
75
Time Homogeneity: There is one further assumption we will make
about the process X, in addition to the Markov property. Note that
we have said that the Markov property says that the distribution of X
n
given X
n1
is independent of X
0
, . . . , X
n2
. However, that doesnt
rule out the possibility that this (conditional) distribution could de-
pend on the time index n. In general, it could be, for example, that
the distribution of where I go next given that Im in state s at time 1
is dierent from the distribution of where I go next given that Im in
(the same) state s at time 10. Processes where this could happen are
called time-inhomogeneous. We will assume that the process is time-
homogeneous. Homogeneous means same and time-homogeneous
means the same over time. That is, every time Im in state s, the
distribution of where I go next is the same. You should check to see
if you think that time homogeneity is a reasonable assumption in the
last two examples.
Mathematically, what the time homogeneity property says is that
P(X
n+m
= j|X
m
= i) = P(X
n+m+k
= j|X
m+k
= i)
for any i, j S and any m, n, k such that the above indices are
nonnegative. The above conditional probabilities are called the n-step
transition probabilities of the chain because they are the conditional
probabilities of where you will be after n time units from where you
are now. Of basic importance are the 1-step transition probabilities:
P(X
n
= j|X
n1
= i) = P(X
1
= j|X
0
= i).
We denote these 1-step transition probabilities by p
ij
, and these do not
depend on the time n because of the time-homogeneity assumption.
The 1-Step Transition Matrix: We think of putting the 1-step transi-
tion probabilities p
ij
into a matrix called the 1-step transition matrix,
also called the transition probability matrix of the Markov chain. Well
usually denote this matrix by P. The (i, j)th entry of P (ith row and
jth column) is p
ij
. Note that P is a square (|S| |S|) matrix (so it
could be innitely big). Furthermore, since
jS
p
ij
=
jS
P(X
1
= j|X
0
= i) = 1,
each row of P has entries that sum to 1. In other words, each row of
P is a probability distribution over S (indeed, the ith row of P is the
conditional distribution of X
n
given that X
n1
= i). For this reason
we say that P is a stochastic matrix.
It turns out that the transition matrix P gives an almost complete
mathematical specication of the Markov chain. This is actually say-
ing quite a lot. In general, we would say that a stochastic process was
specied mathematically once we specify the state space and the joint
distribution of any subset of random variables in the sequence mak-
ing up the stochastic process. These are called the nite-dimensional
distributions of the stochastic process. So for a Markov chain thats
quite a lot of information we can determine from the transition matrix
P.
One thing that is relatively easy to see is that the 1-step transition
probabilities determine the n-step transition probabilities, for any n.
This fact is contained in what are known as the Chapman-Kolmogorov
Equations.
77
The Chapman-Kolmogorov Equations:
Let p
ij
(n) denote the n-step transition probabilities:
p
ij
(n) = P(X
n
= j|X
0
= i)
and let P(n) denote the n-step transition probability matrix whose
(i, j)th entry is p
ij
(n). Then
P(m + n) = P(m)P(n)
which is the same thing as
p
ij
(m + n) =
kS
p
ik
(m)p
kj
(n).
In words: the probability of going from i to j in m + n steps is the
sum over all k of the probability of going from i to k in m steps, then
from k to j in n steps.
Proof. By conditioning on X
m
we have
p
ij
(m + n) = P(X
m+n
= j|X
0
= i)
=
kS
P(X
m+n
= j|X
m
= k, X
0
= i)P(X
m
= k|X
0
= i)
=
kS
P(X
m+n
= j|X
m
= k)P(X
m
= k|X
0
= i)
=
kS
P(X
n
= j|X
0
= k)P(X
m
= k|X
0
= i)
=
kS
p
kj
(n)p
ik
(m),
as desired.
As we mentioned, contained in the Chapman-Kolmogorov Equations
is the fact that the 1-step transition probabilities determine the n-
step transition probabilities. This is easy to see, since by repeated
application of the Chapman-Kolmogorov Equations, we have
P(n) = P(n 1)P(1)
= P(n 2)P(1)
2
.
.
.
= P(1)P(1)
n1
= P(1)
n
.
But P(1) is just P, the 1-step transition probability matrix. So we
see that P(n) can be determined by raising the transition matrix P
to the nth power. One of the most important quantities that we are
interested in from a Markov chain model are the limiting probabilities
lim
n
p
ij
(n) = lim
n
P(X
n
= j|X
0
= i).
As an aside, we mention here that based on the Chapman-Kolmogorov
Equations, we see that one way to consider the limiting probabilities is
to examine the limit of P
n
. Thus, for nite dimensional P, we could
approach the problem of nding the limiting probabilities purely as an
linear algebra problem. In fact, for any (nite) stochastic matrix P, it
can be shown that it has exactly one eigenvalue equal to 1 and all the
other eigenvalues are strictly less than one in magnitude. What does
this mean? It means that P
n
converges to a matrix where each row
is the eigenvector of P corresponding to the eigenvalue 1.
However, we wont pursue this argument further here, because it
doesnt work when P is innite-dimensional. Well look at a more
probabilistic argument that works for any P later. But rst we need
to look more closely at the structure of Markov chains.
10
Classication of States
Let us consider a fairly general version of a Markov chain called the
random walk.
Example: Random Walk. Consider a Markov chain X with transi-
tion probabilities
p
i,i+1
= p
i
,
p
i,i
= r
i
,
p
i,i1
= q
i
and
p
ij
= 0 for all j = i 1, i, i + 1,
for any integer i, where p
i
+ r
i
+ q
i
= 1. A Markov chain like this,
which can only stay where it is or move either one step up or one
step down from its current state, is called a random walk. Random
walks have applications in several areas, including modeling of stock
prices and modeling of occupancy in service systems. The random
walk is also a great process to study because it is relatively simple to
think about yet it can display all sorts of interesting behaviour that is
illustrative of the kinds of things that can happen in Markov Chains
in general.
79
80 10. CLASSIFICATION OF STATES
To illustrate some of this behaviour, well look now at two special cases
of the random walk. The rst is the following
Example: Random Walk with Absorbing Barriers. Let N be some
xed positive integer. In the random walk, let
p
00
= p
NN
= 1,
p
i,i+1
= p for i = 1, . . . , N 1,
p
i,i1
= q = 1 p for i = 1, . . . , N 1, and
p
ij
= 0 otherwise.
We can take the state space to be S = {0, 1, . . . , N}. The states 0
and N are called absorbing barriers, because if the process ever enters
in either of these states, the process will stay there forever after.
This example illustrates the four possible relationships that can exist
between any pair of states i and j in any Markov chain.
1. State j is accessible from state i but state i is not accessible from
state j. This means that, starting in state i, there is a positive
probability that we will eventually go to state j, but starting in
state j the probability that we will ever go to state i is 0. In the
random walk with absorbing barriers, state N is accessible from
state 1, but state 1 is not accessible from state N.
2. State j is not accessible from state i but i is accessible from j.
3. Neither states i nor j are accessible from the other. In the random
walk with absorbing barriers, states 0 and N are like this.
4. States i and j are accessible from each other. In the random walk
with absorbing barriers, i and j are accessible from each other
for i, j = 1, . . . , N 1. In this case we say that states i and j
communicate.
81
Accessibility and Communication: The more mathematically precise
way to say that state j is accessible from state i is that the n-step
transition probabilities p
ij
(n) are strictly positive for at least one n.
Conversely, if p
ij
(n) = 0 for all n, then state j is not accessible from
state i.
The relationship of communication is our rst step in classifying the
states of an arbitrary Markov chain, because this relationship is what
mathematicians call an equivalence relationship. What this means is
that
State i communicates with itself for every state i. This is true by
convention, as
P(X
n
= i|X
n
= i) = 1 > 0 for any state i.
If state i communicates with state j then state j communicates
with state i. This is true directly from the denition of commu-
nication.
If state i communicates with state j and state j communicates
with state k, then state i communicates with state k. This is
easily shown. Please see the text for a short proof.
The fact that communication is an equivalence relationship allows
us to divide up the state space of any Markov chain in a natural way.
We say that the equivalence class of state i is the set of states that
state i communicates with. By the rst property of an equivalence
relationship, we see that the equivalence class of any state i is not
empty: it contains at least the state i. Moreover, by the third property
of an equivalence relationship, we see that all states in the equivalence
class of state i must communicate with one another.
More than that, if j is any state in the equivalence class of state i,
then states i and j have the same equivalence class. You cannot, for
example, have a state k that is in the equivalence class of j but not
in the equivalence class of i because that means k communicates with
j, and j communicates with i, but k does not communicate with i,
and this contradicts the third property of an equivalence relationship.
So that means we can talk about the equivalence class of states that
communicate with one another, as opposed to the equivalence class of
any particular state i. Now the state space of any Markov chain could
have just one equivalence class (if all states communicate with one
another) or it could have more than one equivalence class (you may
check that in the random walk with absorbing barriers example there
are three equivalence classes). But two dierent equivalence classes
must be disjoint because if, say, state i belonged to both equivalence
class 1 and to equivalence class 2, then again by the third property of
an equivalence relationship, every state in class 1 must communicate
with every state in class 2 (through communication with state i), and
so every state in class 2 belongs to class 1 (by denition), and vice
versa. In other words, class 1 and class 2 cannot be dierent unless
they are disjoint.
Thus we see that the equivalence relationship of communication di-
vides up the state space of a Markov chain into disjoint sets of equiva-
lence classes. You may wish to go over Examples 4.11 and 4.12 in the
text at this point. In the case when all the states of a Markov chain
communicate with one another, we say that the state space is irre-
ducible. We also say in this case that the Markov chain is irreducible.
More generally, we say that the equivalence relation of communication
partitions the state space of a Markov chain into disjoint, irreducible,
equivalence classes.
83
Classication of States
In addition to classifying the relationships between pairs of states,
we also can classify each individual states into one of three mutually
disjoint categories. Let us rst look at another special case of the
random walk before dening these categories.
Example: Simple Random Walk. In the random walk model, if we
let
p
i
= p,
r
i
= 0, and
q
i
= q = 1 p
for all i, then we say that the random walk is a simple random walk.
In this case the state space is the set of all integers,
S = {. . . , 1, 0, 1, . . .}.
That is, every time we make a jump we move up with probability p
and move down with probability 1 p, regardless of where we are in
the state space when we jump. Suppose the process starts o in state
0. If p > 1/2, your intuition might tell you that the process will, with
probability 1, eventually venture o to +, never to return to the
0 state. In this case we would say that state 0 is transient, because
there is a positive probability that if we start in state 0 then we will
never return to it.
Suppose now that p = 1/2. Now would you say that state 0 is
transient? If its not, that is, if we start out in state 0 and it will
eventually return to state 0 with probability 1, then we say that state
0 is recurrent.
Transient and Recurrent States: In any Markov chain, dene
f
i
= P(Eventually return to state i|X
0
= i)
= P(X
n
= i for some n 1|X
0
= i).
If f
i
= 1, then we say that state i is recurrent. Otherwise, if f
i
< 1,
then we say that state i is transient. Note that every state in a Markov
chain must be either transient or recurrent.
There is a pretty important applied reason why we care whether a state
is transient or recurrent. Whether a state is transient or recurrent de-
termines the kinds of questions we ask about the process. I mentioned
once that the limiting probabilities lim
n
p
ij
(n) are very important
because these will tell us about the long-run or steady-state be-
haviour of the process, and so give us a way to predict how the system
our process is modeling will tend to behave on average. However, if
state j is transient then this limiting probability is the wrong thing to
ask about because, as well see, this limit will be 0.
For example, in the random walk with absorbing barriers, it is clear
that every state is transient except for states 0 and N (why?). What
we should be asking about is not whether well be in state i, say,
where 1 i N 1, at some time point far into the future, but
other things, like, given that we start in state i, where 1 i N1,
a) will we eventually end up in state 0 or in state N? or b) whats the
expected time before we end up in state 0 or state N?
As another example, in the symmetric random walk with p < 1/2,
suppose we start in state B, where B is some large integer. If B is
a transient state, then the question to ask is not about the likelihood
of being in state B in the long run, but, for example, how long before
we can expect to be in state 0.
85
Our main result for today is about how to check whether a given
state is transient or recurrent. We will follow the argument in the
text which precedes Proposition 4.1 in Section 4.3. It is based on the
mean number of visits to a state. In particular, if state i is recurrent
then, starting in state i, state i should be visited innitely many times.
This is because, if state i is recurrent then, starting in state i, we will
return to state i with probability 1. But once we are back in state i it
is as if we were back at time 0 again because of the Markov property
and time-homogeneity, so we are certain to come back to state i for
a second time with probability 1. This process never changes. In fact,
a formal induction argument in which statement n is the proposition
that state i will be visited at least n times with probability 1 given that
the process starts in state i, will show that statement n is true for all
n. In other words, state i will be visited innitely many times with
probability 1, and so the mean number of visits to state i is innite.
On the other hand, if state i is transient, then, starting in state i,
it will return to state i with probability f
i
< 1 and will never return
with probability 1 f
i
, and this probability of never returning to state
i will be faced every time the process returns to state i. In other
words, the sequence of returns to state i is equivalent to a sequence
of independent Bernoulli trials, in which a success corresponds to
never returning to state i, and the probability of success is 1 f
i
. So
the number of returns to state i is a version of the Geometric random
variable that is interpreted as the number of failures until the rst
success in a sequence of independent Bernoulli trials. It is an easy
exercise to see that when the probability of a success is 1 f
i
, the
mean number of failures until the rst success is f
i
/(1 f
i
). This is
the mean number of returns to state i if state i is transient. All we
care about it at this point is that it is nite.
On the other hand, the mean number of visits to state i can be com-
puted in terms of the n-step transition probabilities. If we dene I
n
to
be the indicator of the event {X
n
= i} and let N
i
denote the number
of visits to state i, then
n=0
I
n
= the number of visits to state i = N
i
.
Then
E[N
i
|X
0
= i] = E
n=0
I
n
X
0
= i
n=0
E[I
n
|X
0
= i]
=
n=0
P(X
n
= i|X
0
= i)
=
n=0
p
ii
(n).
By the argument on the preceding page, state i is recurrent if and only
if the mean number of visits to state i, starting in state i, is innite.
Thus the following (Proposition 4.1 in the text) has been shown:
Proposition 4.1 in text. State i is recurrent if and only if
n=0
p
ii
(n) = .
This also implies that state i is transient if and only if
n=0
p
ii
(n) < .
87
A very useful consequence of Proposition 4.1 is that we can show that
recurrence is a class property; that is, if state i is recurrent, then so is
every state in its equivalence class.
Corollary to Proposition 4.1: If states i and j communicate and state
i is recurrent, then state j is also recurrent.
Proof: Since i and j communicate there exists some integers n and m
such that p
ij
(n) > 0 and p
ji
(m) > 0. By the Chapman-Kolmogorov
equations (see last lecture)
P(m + r + n) = P(m)P(r)P(n)
for any positive integer r. If we expand out the (j, j)th entry we get
p
jj
(m + r + n) =
kS
p
jk
(m)
lS
p
kl
(r)p
lj
(n)
=
kS
lS
p
jk
(m)p
kl
(r)p
lj
(n)
p
ji
(m)p
ii
(r)p
ij
(n)
by taking just the term where k = l = i, and where the inequality
follows because every term in the sum is nonnegative. If we sum over
r we get
r=1
p
jj
(m + r + n) p
ji
(m)p
ij
(n)
r=1
p
ii
(r).
But the right hand side equals innity because p
ji
(m) > 0 and
p
ij
(n) > 0 by choice of m and n, and state i is recurrent. There-
fore, the left hand side equals innity, and so state j is recurrent by
Proposition 4.1.
Note that since every state must be either recurrent or transient, tran-
sience is also a class property. That is, if states i and j communicate
and state i is transient, then state j cannot be recurrent because that
would imply state i is recurrent. If state j is not recurrent it must be
transient.
Example: In the simple random walk it is easy to see that every
state communicates with every other state. Thus, there is only one
class and this Markov chain is irreducible. We can classify every state
as either transient or recurrent by just checking whether state 0, for
example, is transient or recurrent. Especially for p = 1/2, its still
not clear whether state 0 is transient or recurrent. Well leave this
question till next lecture, where we will see that we need to rene our
categories of transient and recurrent a little bit.
A very similar argument to the one in Proposition 4.1 will show that
if state i is a transient state and if j is a state such that state i is
accessible from state j, then the mean number of visits to state i given
that the process starts in state j is nite and can be expressed as
E[N
i
|X
0
= j] =
n=0
p
ji
(n).
For the above sum to be convergent (that is, nite) it is necessary that
p
ji
(n) 0 as n . On the other hand, if state i is not accessible
from state j, then p
ji
(n) = 0 for all n. Thus we have our rst limit
theorem, which we state as another corollary:
Corollary: If i is a transient state in a Markov chain, then
p
ji
(n) 0 as n
for all j S.
89
Since weve just had a number of new denitions and results, its
prudent at this time to recap what we have done. To summarize,
we categorized the relationship between any pair of states by dening
the (asymmetric) notion of accessibility and the (symmetric) notion
of communication. We saw that communication is an equivalence
relationship and so divides up the state space of any Markov chain
into equivalence classes. Next, we categorized the individual states
by dening the concepts of transient and recurrent. These are funda-
mental denitions in the language of Markov chain theory, and should
be memorized. A recurrent state is one that the chain will return
to with probability 1 and a transient state is one that the chain may
never return to with positive probability. Next, we proved the use-
ful Proposition 4.1 which says that state i is recurrent if and only if
n=0
p
ii
(n) is equal to innity. Otherwise, if the sum is nite, then
state i is transient. Finally, we looked at two consequences of this
result. The rst is that recurrence and transience are class properties,
and the second is that lim
n
p
ji
(n) = 0 if i is any transient state.
We also considered two special cases of the random walk, the random
walk with absorbing barriers and the simple random walk. The random
walk with absorbing barriers has three equivalence classes {0}, {N}
and {1, . . . , N1}. States 0 and N are recurrent states, while states
1, . . . , N 1 are transient.
The simple random walk has just one equivalence class, so all states
are either all transient or all recurrent. However, we have not yet
determined which. Intuitively, if p < 1/2 or p > 1/2, one would guess
that all states are transient. If p = 1/2, its not as intuitive. What do
you think? In the next lecture we will consider this problem, and also
introduce the use of generating functions.
11
Generating Functions and the Random Walk
Today we will detour a bit from the text to look more closely at the
simple random walk discussed in the last lecture. In the process I
wish to introduce the notion and use of Generating Functions, which
are similar to Moment Generating Functions, which you may have
encountered in a previous course.
For a sequence of numbers a
0
, a
1
, a
2
, . . ., we dene the generating
function of this sequence to be
G(s) =
n=0
s
n
a
n
.
This brings the sequence into another domain (where s is the argu-
ment) that is analogous to the spectral frequency domain that may
be familiar to engineers. If the sequence {a
n
} is the probability mass
function of a random variable X on the nonnegative integers (i.e.
P(X = n) = a
n
), then we call the generating function the probability
generating function of X, and we can write it as
G(s) = E[s
X
].
91
92 11. GENERATING FUNCTIONS AND THE RANDOM WALK
Generating functions have many useful properties. Here we will briey
go over two properties that will be used when we look at the random
walk. The rst key property is how the generating function decom-
poses when applied to a convolution. If {a
0
, a
1
, . . .} and {b
0
, b
1
, . . .}
are two sequences then we dene their convolution by the sequence
{c
0
, c
1
, . . .}, where
c
n
= a
0
b
n
+ a
1
b
n1
+ . . . + a
n
b
0
=
n
i=0
a
i
b
ni
.
The generating function of the convolution is
G
c
(s) =
n=0
c
n
s
n
=
n=0
n
i=0
a
i
b
ni
s
n
=
i=0
n=i
a
i
b
ni
s
n
=
i=0
a
i
s
i
n=i
b
ni
s
ni
= G
a
(s)G
b
(s),
where G
a
(s) is the generating function of {a
n
} and G
b
(s) is the gen-
erating function of {b
n
}.
In the case when {a
n
} and {b
n
} are probability distributions and X and
Y are two independent random variables on the nonnegative integers
with P(X = n) = a
n
and P(Y = n) = b
n
, we see that the convolu-
tion {c
n
} is just the distribution of X+Y (i.e. P(X+Y = n = c
n
)).
In this case the decomposition above can be written as
G
X+Y
(s) = E[s
X+Y
] = E[s
X
s
Y
] = E[s
X
]E[s
Y
] = G
X
(s)G
Y
(s),
where the factoring of the expectation follows from the independence
of X and Y . In words, the generating function of the sum of inde-
pendent random variables is the product of the individual generating
functions of the random variables.
93
The second important property that we need for now is that if G
X
(s)
is the generating function of a random variable X, then
G
X
(1) = E[X]
This is straightforward to see because
G
(s) =
d
ds
(a
0
+ a
1
s + a
2
s
2
+ . . .)
= a
1
+ 2a
2
s + 3a
3
s
2
+ . . .
so that
G
(1) = a
1
+ 2a
2
+ 3a
3
+ . . . = E[X],
assuming a
n
= P(X = n).
In addition, one can also easily verify that if G(s) is the generating
function of an arbitrary sequence {a
n
}, then the nth derivative of
G(s) evaluated at s = 0 is equal to n!a
n
. That is,
G
(n)
(0) = n!a
n
and so
a
n
=
1
n!
G
(n)
(0).
In other words, the generating function of a sequence determines the
sequence. As a special case, the probability generating function of
a random variable X determines the distribution of X. This can
sometimes be used in conjunction with the property on the previous
page to determine the distribution of a sum of independent random
variables. For example, we can use this method to show that the
sum of two independent Poisson random variables again has a Poisson
distribution. Now lets consider the simple random walk.
Firstly, we discuss an important property of the simple random walk
process that is not shared by most Markov chains. In addition to be-
ing time homogeneous, the simple random walk has a property called
spatial homogeneity. In words, this means that if I take a portion of
a sample path (imagine a graph of the sample path with the state on
the vertical axis and time on the horizontal axis) and then displace it
vertically by any integer amount, that displaced sample path, condi-
tioned on my starting value, has the same probability as the original
sample path, conditioned on its starting value. Another way to say
this is analogous to one way that time homogeneity was explained.
That is, a Markov chain is time homogeneous because, for any state
i, every time we go into state i, the probabilities of where we go next
depend only on the state i and not on the time that we went into
state i. For a time homogeneous process that is also spatially homo-
geneous, not only do the probabilities of where we go next not depend
on the time we entered state i, but also do not depend on the state.
For the simple random walk, no matter where we are and no matter
what the time is, we always move up with probability p and down with
probability q = 1 p. Mathematically, we say that a (discrete-time)
process is spatially homogeneous if for any times n, m 0 and any
displacement k,
P(X
n+m
= b|X
n
= a) = P(X
n+m
= b + k|X
n
= a + k).
For example, in the simple random walk the probability that we are in
state 5 at time 10 given that we are in state 0 at time 0 is the same
as the probability that we are in state 10 at time 10 given that we are
in state 5 at time 0. Together with time homogeneity, we can assert,
as another example, that the probability that we ever reach state 1
given we start in state 0 at time 0 is the same as the probability that
we ever reach state 2 given that we start in state 1 at time 1.
95
Our goal is to determine whether state 0 in a simple random walk
is transient or recurrent. Since all states communicate in a simple
random walk, determining whether state 0 is transient or recurrent
tells us whether all states are transient or recurrent. Let us suppose
that the random walk starts in state 0 at time 0. Dene
T
r
= time that the walk rst reaches state r, for r 1
T
0
= time that the walk rst returns to state 0.
Also dene
f
r
(n) = P(T
r
= n|X
0
= 0)
for r 0 and n 0 (noting that f
r
(0) = 0 for all r), and let
G
r
(s) =
n=0
s
n
f
r
(n)
be the generating function of the sequence {f
r
(n)}, for r 0. Note
that G
0
(1) =
n=0
f
0
(n) is the probability that the walk ever returns
to state 0, so this will tell us directly whether state 0 is transient or
recurrent. State 0 is transient if G
0
(1) < 1 and state 0 is recurrent if
G
0
(1) = 1. We will proceed now by rst considering G
r
(s), for r > 1,
then considering G
1
(s). We will consider G
0
(s) later.
To approach the evaluation of G
r
(s), for r > 1, we consider the
probability f
r
(n) = P(T
r
= n|X
0
= 0) and condition on T
1
, the time
the walk rst reaches state 1, to obtain via the Markov property that
f
r
(n) =
k=0
P(T
r
= n|T
1
= k)f
1
(k)
=
n
k=0
P(T
r
= n|T
1
= k)f
1
(k),
where we can truncate the sum at n because P(T
r
= n|T
1
= k) = 0
for k > n (this should be clear from the denitions of T
r
and T
1
).
Now we may apply the time and spatial homogeneity of the random
walk to consider the conditional probability P(T
r
= n|T
1
= k). By
temporal and spatial homogeneity, this is the same as the probability
that the rst time we reach state r 1 is at time nk given that we
start in state 0 at time 0. That is,
P(T
r
= n|T
1
= k) = f
r1
(n k),
and so
f
r
(n) =
n
k=0
f
r1
(n k)f
1
(k).
So we see that the sequence {f
r
(n)} is the convolution of the two
sequences {f
r1
(n)} and {f
1
(n)}. Therefore, by the rst property of
generating functions that we considered, we have
G
r
(s) = G
r1
(s)G
1
(s).
But by applying this decomposition to G
r1
(s), and so on, we arrive
at the conclusion that
G
r
(s) = G
1
(s)
r
,
for r > 1. Now we will use this result to approach G
1
(s). This time
we condition on X
1
, the rst step of the random walk, to write, for
n > 1,
f
1
(n) = P(T
1
= n|X
1
= 1)p + P(T
1
= n|X
1
= 1)q.
Now if n > 1, then P(T
1
= n|X
1
= 1) = 0 because if X
1
= 1 then
that implies T
1
= 1 also, so that T
1
= n is impossible. Also, again by
97
the time and spatial homogeneity of the random walk, we may assert
that P(T
1
= n|X
1
= 1) is the same as the probability that the rst
time the walk reaches state 2 is at time n 1 given that the walk
starts in state 0 at time 0. That is
P(T
1
= n|X
1
= 1) = f
2
(n 1).
Therefore,
f
1
(n) = qf
2
(n 1),
for n > 1. For n = 1, f
1
(1) is just the probability that the rst
thing the random walk does is move up to state 1, so f
1
(1) = p. We
now have enough to write out an equation for the generating function
G
1
(s). Keeping in mind that f
1
(0) = 0, we have
G
1
(s) =
n=0
s
n
f
1
(n)
= sf
1
(1) +
n=2
s
n
f
1
(n)
= ps +
n=2
s
n
qf
2
(n 1)
= ps + qs
n=2
s
n1
f
2
(n 1)
= ps + qs
n=1
s
n
f
2
(n)
= ps + qsG
2
(s),
since f
2
(0) = 0 as well.
But by our previous result we know that G
2
(s) = G
1
(s)
2
, so that
G
1
(s) = ps + qsG
1
(s)
2
,
and so we have a quadratic equation for G
1
(s). We may write it in
the more usual form
qsG
1
(s)
2
G
1
(s) + ps = 0.
Using the quadratic formula, the two roots of this equation are
G
1
(s) =
1
1 4pqs
2
2qs
.
Only one of these two roots is the correct answer. A boundary con-
dition for G
1
(s) is that G
1
(0) = f
1
(0) = 0. You can check (using
LHospitals rule) that only the solution
G
1
(s) =
1
1 4pqs
2
2qs
satises this boundary condition.
Let us pause for a moment to consider what we have done. We
have seen a fairly typical way in which generating functions are used
with discrete time processes, especially Markov chains. A generating
function can be dened for any sequence indexed by the nonnegative
integers. This sequence is often a set of probabilities dened on the
nonnegative integers, and obtaining the generating function of this
sequence can tell us useful information about this set of probabilities.
In fact, theoretically at least, the generating function tells us everything
about this set of probabilities, since the generating function determines
these probabilities. In our current work with the random walk, we have
dened probabilities f
r
(n) over the times n = 0, 1, 2, . . . at which
99
an event of interest rst occurs. We may also work with generating
functions when the state space of our Markov chain is the set of
nonnegative integers and we dene probabilities over the set of states.
Many practical systems of interest are modeled with a state space
that is the set of nonnegative integers, including service systems in
which the state is the number of customers/jobs/tasks in the system.
We will consider the use of generating functions again when we look
at queueing systems in Chapter 8 of the text. Note also that when
the sequence is a sequence of probabilities, a typical way that we
try to determine the generating function of the sequence is to use
a conditioning argument to express the probabilities in the sequence
in terms of related probabilites. As we have seen before, when we
do this right we set up a system of equations for the probabilites,
and as we have just seen now, this can be turned into an equation
for the generating function of the sequence. One of the advantages
of using generating functions is when we are able to compress many
(usually innitely many) equations for the probabilities into just a single
equation for the generating function, as we did for G
1
(s).
So G
1
(s) can tell us something about the probabilities f
1
(n), for n
1. In particular
G
1
(1) =
n=1
P(T
1
= n|X
0
= 0)
= P(the walk ever reaches 1|X
0
= 0)
Setting s = 1, we see that
G
1
(1) =
1
1 4pq
2q
.
We can simplify this by replacing p with 1 q in 1 4pq to get
1 4pq = 1 4(1 q)q = 1 4q + 4q
2
= (1 2q)
2
. Since
(1 2q)
2
= |1 2q|, we have that
G
1
(1) =
1 |1 2q|
2q
.
If q 1/2 then |1 2q| = 1 2q and G
1
(1) = 2q/2q = 1. On
the other hand, if q > 1/2 then |1 2q| = 2q 1 and G
1
(1) =
(2 2q)/2q = 2p/2q = p/q, which is less than 1 if q > 1/2. So we
see that
G
1
(1) =
1 if q 1/2
p/q < 1 if q > 1/2.
.
In other words, the probability that the random walk ever reaches 1 is
1 if p 1/2 and is less than 1 if p < 1/2.
Next we will nish o our look at generating functions and the random
walk by evaluating G
0
(s), which will tell us whether state 0 is transient
or recurrent.
12
More on Classication
Todays lecture starts o with nishing up our analysis of the simple
random walk using generating functions that was started last lecture.
We wish to consider G
0
(s), the generating function of the sequence
f
0
(n), for n 0, where
f
0
(n) = P(T
0
= n|X
0
= 0)
= P(walk rst returns to 0 at time n|X
0
= 0).
We condition on X
1
, the rst step of the random walk, to obtain via
the Markov property that
f
0
(n) = P(T
0
= n|X
1
= 1)p + P(T
0
= n|X
0
= 1)q.
We follow similar arguments as those in the last lecture to evaluate
the conditional probabilities. For P(T
0
= n|X
0
= 1), we may say
that by the time and spatial homogeneity of the random walk, this is
the same as the probability that we rst reach state 1 at time n 1
given that we start in state 0 at time 0. That is,
P(T
0
= n|X
0
= 1) = P(T
1
= n 1|X
0
= 0) = f
1
(n 1).
101
102 12. MORE ON CLASSIFICATION
By a similar reasoning, if we dene T
1
to be the rst time the walk
reaches state -1 and dene f
1
(n) = P(T
1
= n|X
0
= 0), then
P(T
0
= n|X
1
= 1) = P(T
1
= n 1|X
0
= 0) = f
1
(n 1),
so we have so far that
f
0
(n) = f
1
(n 1)p + f
1
(n 1)q.
Now if we had p = q = 1/2, so that on each step the walk was
equally likely to move up or down, reasoning by symmetry tells us that
P(T
1
= n 1|X
0
= 0) = P(T
1
= n 1|X
0
= 0). For general p,
we can also see by symmetry that, given that we start in state 0 at
time 0, the distribution of T
1
is the same as the distribution of T
1
in a reected random walk in which we interchange p and q. That
is, if we let f
1
(n) denote the probability that the rst time we reach
state 1 is at time n given that we start in state 0 at time 0, but in a
random walk with p and q interchanged, then what we are saying is
that
f
1
(n) = f
1
(n)
So, keeping in mind what the

means, we have
f
0
(n) = f
1
(n 1)p + f
1
(n 1)q.
Therefore (since f
0
(0) = 0),
G
0
(s) =
n=1
s
n
f
0
(n) =
n=1
s
n
f
1
(n 1)p +
n=1
s
n
f
1
(n 1)q
= ps
n=1
s
n1
f
1
(n 1) + qs
n=1
s
n1
f
1
(n 1)
= psG
1
(s) + qsG
1
(s),
where G
1
(s) is the same function as G
1
(s) except with p and q inter-
changed.
103
Now, recalling from the last lecture that
G
1
(s) =
1
1 4pqs
2
2qs
,
we see that
G
1
(s) =
1
1 4pqs
2
2ps
,
and so
G
0
(s) = psG
1
(s) + qsG
1
(s)
= ps
1
1 4pqs
2
2ps
+ qs
1
1 4pqs
2
2qs
= 1
1 4pqs
2
Well, that took a bit of maneuvering, but weve ended up with quite
a simple form for G
0
(s). Now using this, it is easy to look at
G
0
(1) =
n=1
f
0
(n) = P(walk ever returns to 0|X
0
= 0).
Plugging in s = 1, we get
G
0
(1) = 1
1 4pq.
As we did before, we can simplify this by writing 1 4pq = 1 4(1
q)q = 1 4q + 4q
2
= (1 2q)
2
, so that
G
0
(1) = 1 |1 2q|.
Now we can see that if q 1/2 then G
0
(1) = 1 (1 2q) = 2q.
This equals one if q = 1/2 and is less than one if q < 1/2. On the
other hand, if q > 1/2 then G
0
(1) = 1 (2q 1) = 2 2q = 2p,
and this is less than one when q > 1/2.
So we see that G
0
(1) is less than one if q = 1/2 and is equal to one if
q = 1/2. In other words, state 0 is transient if q = 1/2 and is recurrent
if q = 1/2. Furthermore, since all states in the simple random walk
communicate with one another, there is only one equivalence class, so
that if state 0 is transient then all states are transient and if state 0 is
recurrent then all states are recurrent. This should also be intuitively
clear by the spatial homogeneity of the random walk. In any case, it
makes sense to say that the random walk is transient if q = 1/2 and
is recurrent if q = 1/2.
Now we remark here that we have just seen that the random vari-
ables, T
r
, in some cases had distributions that did not sum up to one.
For example, as we have just seen
n=1
P(T
0
= n|X
0
= 0) < 1
if q = 1/2. This is because, for q = 1/2, it is the case that P(T
0
=
|X
0
= 0) > 0. In words, T
0
equals innity with positive probability.
According to what you should have learned in a previous probability
probability course, this disqualies T
0
to be called a random variable
when q = 1/2. In fact, we say in this case that T
0
is a defective random
variable. Defective random variables can arise sometimes in the study
of stochastic processes, especially when the random quantity, lets call
it, denotes a time until the process rst does something, like reach a
certain state or set of states, because it may be that the process will
never reach that set of states with some positive probability. In fact,
knowing that a process may never reach a state or set of states with
some positive probability is of central interest for some systems, such
as population models where we want to know if the population will
ever die out.
105
In the case where q = 1/2, of course, T
0
is a proper random variable.
Here let us go back to the second property of generating functions
that we covered last lecture, which is that for a random variable X
with probability generating function G
X
(s), the expected value of X
is given by
E[X] = G
X
(1).
When q = 1/2, the random variable T
0
has probability generating
function
G
0
(s) = 1
1 s
2
,
so that taking the derivative we get
G
0
(s) =
s
1 s
2
.
Setting s = 1 we see that G
0
(1) = +. That is, E[T
0
] = +when
q = 1/2. So even though state 0 is recurrent when q = 1/2, we have
here an illustration of a very special kind of recurrent state. Starting
in state 0, we will return to state 0 with probability one, and so state
0 is recurrent. But the expected time to return to state 0 is +. In
such a case we call the recurrent state a null recurrent state. This is
another important classication of a state which we discuss next.
Remark: The text uses a more direct approach, using Proposition 4.1,
to determine the conditions under which a simple random walk is tran-
sient or recurrent. Please read Example 4.13 to see this solution. Our
approach using generating functions was chosen in part to introduce
generating functions, which have uses much beyond analysing the sim-
ple random walk. We were also able to obtain more information about
the random walk using generating functions.
Null Recurrent and Positive Recurrent States. Recurrent states in a
Markov chain can be further categorized as null recurrent and positive
recurrent, depending on the expected time to return to the state. If
i
denotes the expected time to return to state i given that the chain
starts in state i, then we say that state i is null recurrent if it is
recurrent (i.e. P(eventually return to state i|X
0
= i) = 1) but the
expected time to return is innite:
State i is null recurrent if
i
= .
Otherwise, we say that state i is positive recurrent if it is recurrent
and the mean time to return is nite:
State i is positive recurrent if
i
< .
A null recurrent state i is, like any recurrent state, returned to innitely
many times with probability 1. Even so, the probability of being in a
null recurrent state is 0 in the limit:
lim
n
p
ji
(n) = 0,
for any starting state j. This interesting fact is something well prove
later. A very useful fact, that well also prove later, is that null recur-
rence is a class property. Weve already seen that recurrence is a class
property. This is stronger. Not only do all states in an equivalence
class have to be recurrent if one of them is, they all have to be null
recurrent if one of them is. This also implies that they all have to be
positive recurrent if one of them is.
So we see that the equivalence relationship of communication divides
up any state space into disjoint equivalence classes, and each equiva-
lence class has similar members, in that they are either all transient,
all null recurrent, or all positive recurrent.
107
A nite Markov chain has at least one positive recurrent state: The
text (on p.192) argues that a Markov chain with nitely many states
must have at least one recurrent state. The argument is basically that
not all states can be transient because if this were so then eventually
we would run out of states to never return to if there are only nitely
many states. We can also show that a Markov chain with nitely many
states must have at least one positive recurrent state, which is slightly
stronger. In particular, this implies that any nite Markov chain
that has only one equivalence class has all states being
positive recurrent.
Since we must be somewhere at time n, we must have
jS
p
ij
(n) =
1 for any starting state i. That is, each n-step transition matrix P(n)
is a stochastic matrix (the ith row is the distribution of where the
process will be n steps later, starting in state i). This is true for every
n, so we can take the limit as n and get
lim
n
jS
p
ij
(n) = 1.
But if S is nite, we can take the limit inside to get
1 =
jS
lim
n
p
ij
(n).
But if every state were transient or null recurrent we would have
lim
n
p
ij
= 0 for every i and j. Thus we would get the contra-
diction that 1=0. Thus, there must be at least one positive recurrent
state. Keep in mind that limits cannot in general be taken inside sum-
mations if it is an innite sum. For example, lim
n
j=1
1/n =
j=1
lim
n
1/n. The LHS is + and the RHS is 0.
Recurrent Classes are Closed. Now, in general, the state space of a
Markov chain could have 1 or more transient classes, 1 or more null
recurrent classes, and 1 or more positive recurrent classes. While state
spaces with both transient and recurrent states are of great practical
use (and well look at these in Section 4.6), many practical systems
of interest are modeled by Markov chains for which all states are re-
current, and usually all positive recurrent. Even in this case, one can
imagine that there could be 2 or more disjoint classes of recurrent
states. The following Lemma says that when all states are recurrent
we might as well assume that there is only one class because if we are
in a recurrent class we will never leave it. In other words, we say that
a recurrent class is closed.
Lemma: Any recurrent class is closed. That is, if state i is recurrent,
and state j is not in the equivalence class containing state i, then
p
ij
= 0.
Proof. Suppose we start the chain in state i. If p
ij
were postive, then
with positive probability we could go to state j. But once in state j we
could never go back to state i because if we could then i and j would
communicate. This is impossible because j is not in the equivalence
class containing i, by assumption. But never going back to state i is
also impossible because state i is recurrent by assumption. Therefore,
it must be that p
ij
= 0.
Therefore, if all states are recurrent (null or positive recurrent) and
there is more than one equivalence class, we may as well assume that
the state space consists of just the equivalence class that the chain
starts in.
109
Period of a State:
There is one more property of a state that we need to dene, and that
is the period of a state. In words, the period of a state i is the largest
integer that evenly divides all the possible times that we could return
to state i given that we start in state i. If we let d
i
denote the period
of state i, then mathematically we write this as
d
i
= Period of state i = gcd{n : p
i
i(n) > 0},
where gcd stands for greatest common divisor. Another way to
say this is that if we start in state i, then we cannot return to state i
at any time that is not a multiple of d
i
.
Example: For the simple random walk, if we start in state 0, then
we can only return to state 0 with a positive probability at times
2, 4, 6, . . .; that is, only at even times. The greatest divisor of this set
of times is 2, so the period of state 0 is 2.
If a state i is positive recurrent, then we will be interested in the
limiting probability lim
n
p
ii
(n). But before we do so we need to
know the period of state i. This is because if d
i
2 then p
ii
(n) = 0
for any n that is not a multiple of d
i
. But then this limit will not exist
unless it is 0 because the sequence {p
ii
(n)} has innitely many 0s
in it. It turns out that the subsequence {p
ii
(d
i
n)} will converge to a
nonzero limiting value.
If the period of state i is 1, then we are ne. The sequence {p
ii
(n)}
will have a proper limiting value. We call a state that has period 1
aperiodic.
Period is a Class Property. As we might have come to expect and
certainly hope for, our last result for today is to show that all states
in an equivalence class have the same period.
Lemma: If states i and j communicate they have the same period.
Proof. Let d
i
be the period of state i and d
j
be the period of state j.
Since i and j communicate, there is some m and n such that p
ij
(m) >
0 and p
ji
(n) > 0. Using the Chapman-Kolmogorov equations again
as we did on Monday, we have that
p
ii
(m + r + n) p
ij
(m)p
jj
(r)p
ji
(n),
for any r 0, and the right hand side is strictly positive for any r such
that p
jj
(r) > 0 (because p
ij
(m) and p
ji
(n) are both strictly positive.
Setting r = 0, we have p
jj
(0) = 1 > 0, so p
ii
(m+n) > 0. Therefore,
d
i
divides m + n. But since p
ii
(m + r + n) > 0 for any r such that
p
jj
(r) > 0 we have that d
i
divides m+r+n for any such r. But since
d
i
divides m + n it must also divide r. That is, d
i
divides and r such
that p
jj
(r) > 0. In other words, d
i
is a divisor of {r : p
jj
(r) > 0}.
Since d
j
is, by denition, the greatest common divisor of the above set
of r values, we must have d
i
d
j
. Now repeat the same argument
but interchange i and j, to get d
j
d
i
. Therefore, d
i
= d
j
.
By the above lemma, we can speak of the period of a class, and if
a Markov chain has just one class, then we can speak of the period
of the Markov chain. In particular, we will be interested for the next
little while in Markov chains that are irreducible (one class), aperiodic
(period 1) and positive recurrent. It is these Markov chains for which
the limiting probabilities both exist and are not zero. We call such
Markov chains ergodic.
13
Introduction to Stationary Distributions
We rst briey review the classication of states in a Markov chain
with a quick example and then begin the discussion of the important
notion of stationary distributions.
First, lets review a little bit with the following
Example: Suppose we have the following transition matrix:
1 2 3 4 5 6 7 8 9 10
P =
1
2
3
4
5
6
7
8
9
10
_
_
1
.3 .3 .1 .3
.6 .4
1
.4 .3 .3
.9 .1
1
.8 .2
1
1
_
_
.
Determine the equivalence classes, the period of each equivalence
class, and whether each equivalence class it transient or recurrent.
111
112 13. INTRODUCTION TO STATIONARY DISTRIBUTIONS
Solution: The state space is small enough (10 elements) that one ef-
fective way to determine classes is to just start following possible paths.
When you see 1s in the matrix a good place to start is in a state with a
1 in the corresponding row. If we start in state 1, we see that the path
1 7 10 1 must be followed with probability 1. This immedi-
ately tells us that the set {1, 7, 10} is a recurrent class with period 3.
Next, we see that if we start in state 9, then we just stay there forever.
Therefore, {9} is a recurrent class with period 1. Similarly, we can see
that {4} is a recurrent class with period 1. Next suppose we start in
state 2. From state 2 we can go directly to states 2, 3, 4 or 5. We also
see that from state 3, we can get to state 2 (by the path 3 8 2)
and from state 5 we can get to state 2 (directly). Therefore, state 2
communicates with states 3 and 5. We dont need to check if state 2
communicates with states 1, 4, 7, 9, or 10 (why?). From state 2 we
can get to state 6 (by the path 2 5 6) but from state 6 we must
go to either state 4 or state 7, therefore from state 6 we cannot get
to state 2. Therefore, state 2 and 6 do not communicate. Finally, we
can see that states 2 and 8 do communicate. Therefore, {2, 3, 5, 8} is
an equivalence class. It is transient because from this class we can get
to state 4 (and never come back). Finally, its period is 1 because the
period of state 2 is clearly 1 (we can start in state 2 and come back
to state 2 in 1 step). The only state left that is still unclassied is
state 6, which is in a class by itself {6} and is clearly transient. Note
that p
66
(n) = 0 for all n > 0 so the set of times at which we could
possibly return to state 6 is the empty set. By convention, we will say
that the greatest common divisor of the empty set is innity, so the
period of state 6 is innity.
113
Sometimes a useful technique for determining the equivalence classes
in a Markov chain is to draw what is called a state transition diagram,
which is a graph with one node for each state and with a (directed)
edge between nodes i and j if p
ij
> 0. We also usually write the
transition probability p
ij
beside the directed edge between nodes i and
j if p
ij
> 0. For example, here is the state transition diagram for the
previous example.
4
3
2
6
1
7
10
9
5
8
1
1
1
1
1
0.9
0.1
0.3
0.3
0.4
0.3
0.1 0.3
0.3
0.2
0.8
0.4
0.6
Figure 13.1: State Transition Diagram for Preceding Example
Since the diagram displays all one-step transitions pictorially, it is
usually easier to see the equivalence classes with the diagram than
just by looking at the transition matrix. It helps if the diagram can be
drawn neatly, with, for example, no edges crossing each other.
Usually when we construct a Markov model for some system the equiv-
alence classes, if there are more than one, are apparent or obvious
because we designed the model so that certain states go together and
we designed them to be transient or recurrent.
Other times we may be trying to verify, modify, improve, or just under-
stand someone elses (complicated) model and one of the rst things
we may want to know is how to classify the states, and it may not
be obvious or even easy to determine the equivalence classes if the
state space is large and there are many transitions that dont follow a
regular pattern. For S nite, the following algorithm determines T(i),
the set of states accessible from i, F(i), the set of states from which
i is accessible, and C(i) = F(i)
T(i), the equivalence class of state

i, for each state i:
1. For each state i S, let T(i) = {i} and F(i) = {}, the empty
set.
2. For each state i S, do the following: For each state k T(i),
add to T(i) all states j such that p
kj
> 0 (if k is not already in
T(i). Repeat this step until no further addition is possible.
3. For each state i S, do the following: For each state j S, add
state j to F(i) if state i is in T(j).
4. For each state i S, let C(i) = F(i)
T(i).
Note that if C(i) = T(i) (the equivalence class containing i equals
the set of states that are accessible from i), then C(i) is closed (hence
recurrent since we are assuming S is nite for this algorithm). This
algorithm is taken from An Introduction to Stochastic Processes, by
Edward P. C. Kao, Duxbury Press, 1997. Also in this reference is the
listing of a MATLAB implementation of this algorithm.
115
Stationary Markov Chains
Now that we know the general architecture of a Markov chain, its time
to look at how we might analyse a Markov chain to make predictions
about system behaviour. For this well rst consider the concept of
a stationary distribution. This is distinct from the notion of limiting
probabilities, which well consider a bit later. First, lets dene what
we mean when we say that a process is stationary.
Denition: A (discrete-time) stochastic process {X
n
: n 0} is
stationary if for any time points i
1
, . . . , i
n
and any m 0, the joint
distribution of (X
i
1
, . . . , X
i
n
) is the same as the joint distribution of
(X
i
1
+m
, . . . , X
i
n
+m
).
So stationary refers to stationary in time. In particular, for a
stationary process, the distribution of X
n
is the same for all n.
So why do we care if our Markov chain is stationary? Well, if it were
stationary and we knew what the distribution of each X
n
was then we
would know a lot because we would know the long run proportion of
time that the Markov chain was in any state. For example, suppose
that the process was stationary and we knew that P(X
n
= 2) = 1/10
for every n. Then over 1000 time periods we should expect that
roughly 100 of those time periods was spent in state 2, and over N
time periods roughly N/10 of those time periods was spent in state
2. As N went to innity, the proportion of time spent in state 2
will converge to 1/10 (this can be proved rigorously by some form of
the Strong Law of Large Numbers). One of the attractive features of
Markov chains is that we can often make them stationary and there is
a nice and neat characterization of the distribution of X
n
when it is
stationary. We discuss this next.
Stationary Distributions
So how do we make a Markov chain stationary? If it can be made sta-
tionary (and not all of them can; for example, the simple random walk
cannot be made stationary and, more generally, a Markov chain where
all states were transient or null recurrent cannot be made stationary),
then making it stationary is simply a matter of choosing the right ini-
tial distribution for X
0
. If the Markov chain is stationary, then we call
the common distribution of all the X
n
the stationary distribution of
the Markov chain.
Heres how we nd a stationary distribution for a Markov chain.
Proposition: Suppose X is a Markov chain with state space S and
transition probability matrix P. If = (
j
, j S) is a distribution
over S (that is, is a (row) vector with |S| components such that
j

j
= 1 and
j
0 for all j S), then setting the initial distri-
bution of X
0
equal to will make the Markov chain stationary with
stationary distribution if
= P
That is,
j
=
iS
i
p
ij
for all j S.
In words,
j
is the dot product between and the jth column of P.
117
Proof: Suppose satises the above equations and we set the dis-
tribution of X
0
to be . Lets set (n) to be the distribution of X
n
(that is,
j
(n) = P(X
n
= j)). Then
j
(n) = P(X
n
= j) =
iS
P(X
n
= j|X
0
= i)P(X
0
= i)
=
iS
p
ij
(n)
i
,
or, in matrix notation,
(n) = P(n).
But, by the Chapman-Kolmogorov equations, we get
(n) = P
n
= (P)P
n1
= P
n1
.
.
.
= P
=
Well stop the proof here.
Note we havent fully shown that the Markov chain X is stationary
with this choice of initial distribution (though it is and not too
dicult to show). But we have shown that by setting the distribution
of X
0
to be , the distribution of X
n
is also for all n 0, and this is
enough to say that
j
can be interpreted as the long run proportion of
time the Markov chain spends in state j (if such a exists). We also
havent answered any questions about the existence or uniqueness of a
stationary distribution. But lets nish o today with some examples.
Example: Consider just the recurrent class {1, 7, 10} in our rst
example today. The transition matrix for this class is
1 7 10
P =
1
7
10
_
_
0 1 0
0 0 1
1 0 0
_
_
.
Intuitively, the chain spends one third of its time in state 1, one third of
its time in state 7, and one third of its time in state 10. One can easily
verify that the distribution = (1/3, 1/3, 1/3) satises = P, and
so (1/3, 1/3, 1/3) is a stationary distribution.
Remark: Note that in the above example, p
ii
(n) = 0 if n is not a
multiple of 3 and p
ii
= 1 if n is a multiple of 3, for all i. Thus, clearly
lim
n
p
ii
(n) does not exist because these numbers keep jumping
back and forth between 0 and 1. This illustrates that limiting proba-
bilities are not exactly the same thing as stationary probabilities. We
want them to be! Later well give just the right conditions for these
two quantities to be equal.
119
Example: (Ross, p.257 #30). Three out of every four trucks on
the road are followed by a car, while only one out of every ve cars is
followed by a truck. What fraction of vehicles on the road are trucks?
Solution: Imagine sitting on the side of the road watching vehicles go
by. If a truck goes by the next vehicle will be a car with probability
3/4 and will be a truck with probability 1/4. If a car goes by the
next vehicle will be a car with probability 4/5 and will be a truck with
probability 1/5. We may set this up as a Markov chain with two states
0=truck and 1=car, and transition probability matrix
0 1
P =
0
1
_
1/4 3/4
1/5 4/5
_
.
The equations = P are
0
=
1
4
0
+
1
5
1
and
1
=
3
4
0
+
4
5
1
.
Solving, we have from the rst equation that (3/4)
0
= (1/5)
1
, or
0
= (4/15)
1
. Plugging this into the constraint that
0
+
1
= 1
gives us that (4/15)
1
+
1
= 1, or (19/15)
1
= 1, or
1
= 15/19.
Therefore,
0
= 4/19. That is, as we sit by the side of the road, the
long run proportion of vehicles that will be trucks is 4/19.
Remark: Note that we need the constraint that
0
+
1
= 1 in or-
der to determine a solution. In general, we need the constraint that
jS

j
= 1 in order to determine a solution. This is because the
system of equations = P has just in itself innitely many solutions
(if is a solution then so is c for any constant c). We need the
normalization constraint basically to determine c to make a proper
distribution over S.
14
Existence and Uniqueness
We now begin to answer some of the main theoretical questions con-
cerning Markov chains. The rst, and perhaps most important, ques-
tion is under what conditions does a stationary distribution exist, and
if it exists is it unique? In general a Markov chain can have more
than one equivalence class. There are really only 3 combinations of
equivalence classes that we need to consider. These are 1) when there
is only one equivalence class, 2) when there are two or more classes,
all transient, and 3) when there are two or more classes with some
transient and some recurrent. As we have mentioned previously when
there are two or more classes and they are all recurrent, we can assume
that the whole state space is the class that we start the process in,
because such classes are closed. We will consider case (3) when we
get to Section 4.6 in the text and we will not really consider case (2),
as this does not arise very much in practice. Our main focus will be on
case (1). When there is only one equivalence class we say the Markov
chain is irreducible.
We will show that for an irreducible Markov chain, a stationary distri-
bution exists if and only if all states are positive recurrent, and in this
case the stationary distribution is unique.
121
122 14. EXISTENCE AND UNIQUENESS
We will start o by showing that if there is at least one recurrent state
in our Markov chain, then there exists a solution to the equations
= P, and we will demonstrate that solution by constructing it.
First well try to get an intuitive sense of the construction. The basic
property of Markov chains can be described as a starting over property.
If we x a state k and start out the chain in state k, then every time
the chain returns to state k it starts over in a probabilistic sense. We
say that the chain regenerates itself. Let us call the time that the
chain spends moving about the state space from the initial time 0,
where it starts in state k, to the time when it rst returns to state k,
a sojourn from state k back to state k. Successive sojourns all look
the same and so what the chain does during one sojourn should, on
average at least, be the same as what it does on every other sojourn.
In particular, for any state i = k, the number of times the chain visits
state i during a sojourn should, again on average, be the same as in
every other sojourn. If we accept this, then we should accept that
the proportion of time during a sojourn that the chain spends in state
i should be the same, again on average, for all sojourns. But this
reasoning then leads us to expect that the proportion of time that the
chain spends in state i over the long run should be the same as the
proportion of time that the chain spends in state i during any sojourn,
in particular the rst sojourn from state k back to state k. But this is
also how we interpret
i
, the stationary probability of state i, as the
long run proportion of time the chain spends in state i. So this is how
we will construct a vector to satisfy the equations = P. We will
let the ith component of our solution be the expected number of visits
to state i during the rst sojourn. This should be proportional to a
stationary distribution, if such a distribution exists.
123
Let us rst set our notation. Dene
T
k
= rst time the chain visits state k, starting at time 1,
N
i
= the number of visits to state i during the rst sojourn,
i
(k) = E[N
i
|X
0
= k].
Thus,
i
(k) is the expected number of visits to state i during the rst
sojourn from state k back to state k. We dene the (row) vector
(k) = (
i
(k))
kS
, whose ith component is
i
(k). Based on our
previous discussion, our goal now is to show that the vector (k)
satises (k) = (k)P. We should mention here that the sojourn
from state k back to state k may never even happen if state k is
transient because the chain may never return to state k. Therefore,
we assume that state k is recurrent, and it is exactly at this point that
we need to assume it. Assuming state k is recurrent, then the chain
will return to state k with probability 1. Also, the sojourn includes the
last step back to state k; that is, during this sojourn, state k is, by
denition, visited exactly once. In other words,
k
(k) = 1 (assuming
state k is recurrent).
One other important thing to observe about
i
(k) is that if we sum
i
(k) over all i S, then that is the expected length of the whole
sojourn. But the expected length of the sojourn is the mean time to
return to state k, given that we start in state k. That is, if
k
denotes
the mean recurrence time to state k, then
k
=
iS
i
(k).
If state k is positive recurrent then this sum will be nite and it will
be innite if state k is null recurrent.
As we have done in previous examples, we will use indicator functions
to represent the number of visits to state i during the rst sojourn. If
we dene I
{X
n
=i,T
k
n}
as the indicator of the event that the chain is
in state i at time n and we have not yet revisited state k by time n
(i.e. we are still in the rst sojourn), then we may represent the total
expected number of visits to state i during the rst sojourn as
i
(k) =
n=1
E[I
{X
n
=i,T
k
n}
|X
0
= k]
=
n=1
P(X
n
= i, T
k
n|X
0
= k).
(We are assuming here that i = k). Purely for the sake of shorter
notation we will let
ki
(n) denote the conditional probability above:
ki
(n) = P(X
n
= i, T
k
n|X
0
= k)
so that now we will write
i
(k) =
n=1
ki
(n).
We proceed by deriving an equation for
ki
(n), which will then give
an equation for
i
(k), and we will see that this equation is exactly the
ith equation in (k) = (k)P. To derive the equation, we intersect
the event {X
n
= i, T
k
n} with all possible values of X
n1
. Doing
this is a special case of the following calculation in basic probability.
If {B
j
} is a partition such that P(
j
B
j
) = 1 and B
j
B
j
= , the
empty set, for j = j
, then for any event A,

P(A) = P(A
(
_
j
B
j
)) = P(
_
j
(A
B
j
)) =
j
P(A
B
j
),
because the B
j
and so the A
B
j
are all disjoint.
125
For n = 1, we have
ki
(1) = P(X
1
= i, T
k
1|X
0
= k) = p
ki
, the
1-step transition probability from state k to state i. For n 2, we
let B
j
= {X
n1
= j} and A = {X
n
= i, T
k
n} in the previous
paragraph, to get
ki
(n) = P(X
n
= i, T
k
n|X
0
= k)
=
jS
P(X
n
= i, X
n1
= j, T
k
n|X
0
= k).
First we note that when j = k the above probability is 0 because the
event {X
n1
= k} implies that the sojourn is over by time n1 while
the event {T
k
n} says that the sojourn is not over at time n 1.
Therefore, their intersection is the empty set. Thus,
ki
(n) =
j=k
P(X
n
= i, X
n1
= j, T
k
n|X
0
= k).
Next, we note that the event above says that, given we start in state
k, we go to state j at time n 1 without revisiting state k in the
meantime, and then go to state i in the next step. But this is just
kj
(n 1)p
ji
, and so
ki
(n) =
j=k
kj
(n 1)p
ji
This is our basic equation for
ki
(n), for n 2. Now, if we sum this
over n 2 and use the fact that
ik
(1) = p
ki
we have
i
(k) =
n=1
ki
(n)
= p
ki
+
n=2
j=k
kj
(n 1)p
ji
= p
ki
+
j=k
_

n=2
kj
(n 1)
_
p
ji
.
But
n=2
kj
(n 1) =
n=1
kj
(n) is equal to
j
(k), so we get the
equation
i
(k) = p
ki
+
j=k
j
(k)p
ji
.
Now we use the fact that
k
(k) = 1 to write
i
(k) =
k
(k)p
ki
+
j=k
j
(k)p
ji
=
jS
j
(k)p
ji
.
But now we are done, because this is exactly the ith equation in
(k) = (k)P. So we have nished our construction. The vector
(k), as we have dened it, has been shown to satisfy the matrix
equation (k) = (k)P. Moreover, as was noted earlier, if state k is
a positive recurrent state, then the components of (k) have a nite
sum, so that
= (k)/
iS
i
(k)
is a stationary distribution. We have shown that if our Markov chain
has at least one positive recurrent state, then there exists a stationary
distribution .
Now that we have shown that a stationary distribution exists if there
is at least one positive recurrent state, the next thing we want to show
is that if a stationary distribution does exist, then all states must be
positive recurrent and the stationary distribution is unique.
127
First, we can show that if a stationary distribution exists, then the
Markov chain cannot be transient. If is a stationary distribution,
then = P. Multiplying both sides by P
n1
we get P
n1
=
P
n
. But we can reduce the left hand side down to by successively
applying the relationship = P. Therefore, we have the relationship
that = P
n
for any n 1, which in a more detailed form is
j
=
iS
i
p
ij
(n),
for any i, j S and all n 1, where p
ij
(n) is the n-step transition
probability from state i to state j.
Now consider what happens when we take the limit as n in the
above equality. When we look at
lim
n
iS
i
p
ij
(n),
if we can take the limit inside the summation, then we could use
the fact that lim
n
p
ij
(n) = 0 for all i, j S if all states are
transient (recall the Corollary we showed at the end of Lecture 10), to
conclude that
j
must equal zero for all j S. It turns out we can
take the limit inside the summation, but we should be careful because
the summation is in general an innite sum, and limits cannot be
taken inside innite sums in general (recall the example that + =
lim
n
i=1
1/n =
i=1
lim
n
1/n = 0). The fact that we can
take the limit inside the summation here is a consequence of the fact
that we can uniformly bound the vector (
i
p
ij
(n))
iS
by a summable
vector (uniformly means we can nd a bound that works for all n).
In particular, since p
ij
(n) 1 for all n, we have that
i
p
ij
(n)
i
for all i S. The fact that this allows us to take the limit inside
the summation is an instance of a more general result known as the
bounded convergence theorem. This is a well-known and useful result
in probability, but we wont invoke its use here, as we can show directly
that we can take the limit inside the summation, as follows. Let F be
any nite subset of the state space S. Then we can write
lim
n
iS
i
p
ij
(n) = lim
n
iF
i
p
ij
(n) + lim
n
iF
c
i
p
ij
(n)
lim
n
iF
i
p
ij
(n) +
iF
c
i
,
from the inequality p
ij
(n) 1. But for the rst nite summation, we
can take the limit inside, so we get that the limit of the rst sum (over
F) is 0. Therefore,
lim
n
iS
i
p
ij
(n)
iF
c
i
,
for any nite subset F of S. But since
iS

i
= 1 is a convergent
sum, for any > 0, we can take the set F so large (but still nite) to
make
iF
c
i
< . This implies that
lim
n
iS
i
p
ij
(n)
for every > 0. But the only way this can be true is if the above
limit is 0. Therefore, going back to our original argument, we see
that if all states are transient, this implies that
j
= 0 for all j S.
This is clearly impossible since the components of must sum to 1.
Therefore, if a stationary distribution exists for an irreducible Markov
chain, all states must be recurrent.
129
We end here with another attempt at some intuitive understanding,
this time of why the stationary distribution , if it did exist, might
be unique. In particular, let us try to see why we might expect that
i
= 1/
i
, where
i
is the mean recurrence time to state i. Suppose
we start the chain in state i and then observe the chain over N time
periods, where N is large. Over those N time periods, let n
i
be the
number of times that the chain revisits state i. If N is large, we expect
that n
i
/N is approximately equal to
i
, and indeed should converge
to
i
as N went to innity. On the other hand, if the times that the
chain returned to state i were uniformly spread over the times from
0 to N, then each time state i was visited the chain would return to
state i after N/n
i
steps. For example, if the chain visited state i 10
times in 100 steps and the times it returned to state i were uniformly
spread, then the chain would have returned to state i every 100/10=10
steps. In reality, the return times to state i vary, perhaps a lot, over
the dierent returns to state i. But if we average all these return
times (meaning the arithmetic average), then this average behaves
very much like the return time when all the return times are the same.
So we should expect that the average return time to state i should
be close to N/n
i
, when N is very large (note that as N grows, so
does n
i
), and as N went to innity, the ratio N/n
i
should actually
converge to
i
, the mean return time to state i. Given these two
things, that
i
should be close to n
i
/N and
i
should be close to
N/n
i
, we should expect their product to be 1; that is,
i
i
= 1,
or
i
= 1/
i
. Note that if this relationship holds, then this directly
relates the stationary distribution to the null or positive recurrence of
the chain, through the mean recurrence times
i
. If
i
is positive,
then
i
must be nite, and hence state i must be positive recurrent.
Also, the stationary distribution must be unique, because the mean
recurrence times are unique. Next we will prove more rigorously that
the relationship
i
i
= 1 does indeed hold and we will furthermore
show that if the stationary distribution exists then all states must be
positive recurrent.
15
Existence and Uniqueness (contd)
Previously we saw how to construct a vector (k) that satises the
equations (k) = (k)P, when P is the transition matrix of an irre-
ducible, recurrent Markov chain. Note that we didnt need the chain
to be positive recurrent, just recurrent. As an example, consider the
simple random walk with p = 1/2. We have seen that this Markov
chain is irreducible and null recurrent. The transition matrix is
P =
_
_
.
.
.
.
.
.
.
.
.
1
2
0
1
2
1
2
0
1
2
1
2
0
1
2
.
.
.
.
.
.
.
.
.
_
_
,
and one can easily verify that the vector = (. . . , 1, 1, 1, . . .) satises
= P (any constant multiple of will also work). However,
cannot be a stationary distribution because its components sum to
innity. Today we will show that if a stationary distribution exists
for an irreducible Markov chain, then it must be a positive recurrent
Markov chain. Moreover, the stationary distribution is unique.
131
132 15. EXISTENCE AND UNIQUENESS (CONTD)
Last time we gave a (hopefully) intuitive argument as to why, if a sta-
tionary distribution did exist, we might expect that
i
i
= 1, where
i
is the mean time to return to state i, given that we start in state
i. Well prove this rigorously now. So assume that a stationary distri-
bution exists, and let the initial distribution of X
0
be , so that we
make our process stationary. Let T
i
be the rst time we enter state
i, starting from time 1 (this is the same denition of T
i
as in the last
lecture). So we have that
i
= E[T
i
|X
0
= i]
and also
i
= E[T
i
|X
0
= i]P(X
0
= i).
We wish to show that this equals one, and the rst thing we do is
write out the expectation, but in a somewhat nonstandard form. The
random variable T
i
is dened on the nonnegative integers, and there
is a useful way to represent the mean of such a random variable, as
follows:
E[T
i
|X
0
= i] =
k=1
kP(T
i
= k|X
0
= i)
=
k=1
_
k
n=1
(1)
_
P(T
i
= k|X
0
= i)
=
n=1
k=n
P(T
i
= k|X
0
= i)
=
n=1
P(T
i
n|X
0
= i),
by interchanging the order of summation in the third equality.
133
So we have that
i
=
n=1
P(T
i
n|X
0
= i)P(X
0
= i)
=
n=1
P(T
i
n, X
0
= i).
Now for n = 1, we have P(T
i
1, X
0
= i) = P(X
0
= i), while for
n 2, we write
P(T
i
n, X
0
= i) = P(X
n1
= i, X
n2
= i, . . . , X
1
= i, X
0
= i)
Now for any events A and B, we have that
P(A
B) = P(A) P(A
B
c
),
which follows directly from P(A) = P(A
B) + P(A
B
c
). With
A = {X
n1
= i, . . . , X
1
= i} and B = {X
0
= i} we get
i
= P(X
0
= i) +
n=2
_
P(X
n1
= i, . . . , X
1
= i)
P(X
n1
= i, . . . , X
1
= i, X
0
= i)
_
= P(X
0
= i) +
n=2
_
P(X
n2
= i, . . . , X
0
= i)
P(X
n1
= i, . . . , X
1
= i, X
0
= i)
_
where we did a shift in index to get the last expression. This shift is
allowed because we are assuming the process is stationary.
We are almost done now. To make notation a bit less clunky, lets
dene
a
n
P(X
n
= i, . . . , X
0
= i).
Our expression for
i
i
can now be written as
i
= P(X
0
= i) +
n=2
(a
n2
a
n1
)
= P(X
0
= i) + a
0
a
1
+ a
1
a
2
+ a
2
a
3
+ . . .
The above sum is what is called a telescoping sum because of the way
the partial sums collapse. Indeed, the nth partial sum is
P(X
0
= i) + a
0
a
n
,
so that the innite sum (by denition the limit of the partial sums) is
i
= P(X
0
= i) + a
0
lim
n
a
n
.
Two facts give us our desired result that
i
i
= 1. The rst is the
simple fact that a
0
= P(X
0
= i), so that
P(X
0
= i) + a
0
= P(X
0
= i) + P(X
0
= i) = 1.
The second fact is that
lim
n
a
n
= 0.
This fact is not completely obvious. To see this, note that this limit is
the probability that the chain never visits state i. Suppose the chain
starts in some arbitrary state j. Because j is recurrent by the Markov
property it will be revisited innitely often with probability 1. Since
the chain is irreducible there is some n such that p
ji
(n) > 0. Thus on
each visit to j there is some positive probability that i will be visited
after a nite number of steps. So the situation is like ipping a coin
with a positive probability of heads. It is not hard to see that a heads
will eventually be ipped with probability one.
135
Thus, were done. Weve shown that
i
i
= 1 for any state i. Note
that the only thing weve assumed is that the chain is irreducible and
that a stationary distribution exists. The fact that
i
i
= 1 has several
important implications. One, obviously, is that
i
=
1
i
.
That is, the mean time to return to state i can be computed by deter-
mining the stationary probability
i
, if possible. Another implication
is that if a stationary distribution exists, then it must be unique,
because the mean recurrence times
i
are obviously unique. The third
important implication is that
i
=
1
i
.
This immediately implies that if state i is positive recurrent (which
means by denition that
i
< ), then
i
> 0. In fact, were now in
a position to prove that positive recurrence is a class property (recall
that when we stated this fact, we delayed the proof of it till later.
That later is now). We are still assuming that a stationary distribution
exists. As we have seen before, this implies that
j
=
iS
i
p
ij
(n),
for every n 1 and every j S. Suppose that
j
= 0 for some state
j. Then, that implies that
0 =
iS
i
p
ij
(n),
for that particular j, and for every n 1.
But since the state space is irreducible (all states communicate with
one another), for every i there is some n such that p
ij
(n) > 0. This
implies that
i
must be 0 for every i S. But this is impossible
because the
i
must sum to one. So we have shown that if a stationary
distribution exists, then
i
must be strictly positive for every i. This
implies that all states must be positive recurrent. So, putting this
together with our previous result that we can construct a stationary
distribution if at least one state is positive recurrent, we see that if
one state is positive recurrent, then we can construct a stationary
distribution, and then this implies that all states must be positive
recurrent. In other words, positive recurrence is a class property. Of
course, this then implies that null recurrence is also a class property.
Lets summarize the main results that weve proved over the last two
lectures in a theorem:
Theorem. For an irreducible Markov chain, a stationary dis-
tribution exists if and only if all states are positive recurrent.
In this case, the stationary distribution is unique and
i
= 1/
i
,
where
i
is the mean recurrence time to state i.
So we cant make a transient or a null recurrent Markov chain sta-
tionary. Also, if the Markov chain has two or more equivalence classes
(we say the Markov chain is reducible), then in general there will be
many stationary distributions. One of the Stat855 problems is to give
an example of this. In these cases, there are dierent questions to
ask about the process, as we shall see. Also note that there are no
conditions on the period of the Markov chain for the existence and
uniqueness of the stationary distribution. This is not true when we
consider limiting probabilities, as we shall also see.
137
Example: (Ross, p.229 #26, extended). Three out of every four
trucks on the road are followed by a car, while only one out of every
ve cars is followed by a truck. If I see a truck pass me by on the road,
on average how many vehicles pass before I see another truck?
Solution: Recall that we set this up as a Markov chain in which we
imagine sitting on the side of the road watching vehicles go by. If a
truck goes by the next vehicle will be a car with probability 3/4 and
will be a truck with probability 1/4. If a car goes by the next vehicle
will be a car with probability 4/5 and will be a truck with probability
1/5. If we let X
n
denote the type of the nth vehicle that passes by (0
for truck and 1 for car), then {X
n
: n 1} is a Markov chain with
two states (0 and 1) and transition probability matrix
0 1
P =
0
1
_
1/4 3/4
1/5 4/5
_
.
The equations = P are
0
=
1
4
0
+
1
5
1
and
1
=
3
4
0
+
4
5
1
,
which, together with the constraint
0
+
1
= 1, we had solved pre-
viously to yield
0
= 4/19 and
1
= 15/19. If I see a truck pass by
then the average number of vehicles that pass by before I see another
truck corresponds to the mean recurrence time to state 0, given that
I am currently in state 0. By our theorem, the mean recurrence time
to state 0 is
0
= 1/
0
= 19/4, which is roughly 5 vehicles.
16
Example of PGF for /Some Number Theory
Today well start with another example illustrating the calculation of
the mean time to return to a state in a Markov chain by calculating the
stationary probability of that state, but this time through the use of
the probability generating function (pgf) of the stationary distribution.
Example: Im taking a lot of courses this term. Every Monday I get
2 new assignments with probability 2/3 and 3 new assignments with
probability 1/3. Every week, between Monday morning and Friday
afternoon I nish 2 assignments (they might be new ones or ones
unnished from previous weeks). If I have any unnished assignments
on Friday afternoon, then I nd that over the weekend, independently
of anything else, I nish one assignment by Monday morning with
probability c and dont nish any of them with probability 1 c. If
the term goes on forever, how many weeks is it before I can expect a
weekend with no homework to do?
Solution: Let X
n
be the number of unnished homeworks at the end
of the nth Friday after term starts, where X
0
= 0 is the number of
unnished homeworks on the Friday before term starts. Then {X
n
:
n 0} is a Markov chain with state space S = {0, 1, 2, . . .}. Some
transition probabilities are, for example
139
140 16. EXAMPLE OF PGF FOR /SOME NUMBER THEORY
0 0 with probability 2/3 (2 new ones on Monday)
0 1 with probability 1/3 (3 new ones on Monday)
1 0 with probability 2c/3
1 1 with probability c/3 + 2(1 c)/3 = (2 c)/3
1 2 with probability (1 c)/3,
and, in general, if I have i unnished homeworks on a Friday afternoon,
then the transition probabilities are given by
i i 1 with probability 2c/3,
i i with probability c/3 + 2(1 c)/3 = (2 c)/3,
i i + 1 with probability (1 c)/3
The transition probability matrix for this Markov chain is given by
0 1 2 3 4
P =
0
1
2
3
4
.
.
.
_
_
2/3 1/3 0
q r p 0
0 q r p 0
0 0 q r p 0
0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
,
where
q = 2c/3
r = (2 c)/3
p = (1 c)/3
and q + r + p = 1. In the parlance of Markov chains, this process is
an example of a random walk with a reecting barrier at 0.
141
We should remark here that its not at all clear that this Markov chain
chain always has a stationary distribution for every c [0, 1]. On
the one hand, if c = 1, so that I always do a homework over the
weekend if there is one to do, then I will never have more than one
unnished homework on a Friday afternoon. This case corresponds to
p = 0, and we can see from the transition matrix that states {0, 1}
will be a closed, positive recurrent class, while the states {2, 3, . . .}
will be a transient class of states. On the other extreme, if c = 0,
so that I never do a homework on the weekend, then every time I
get 3 new homeworks on a Monday, my backlog of unnished home-
works will increase by one permanently. In this case q = 0 and one
can see from the transition matrix that I never reduce my number of
unnished homeworks, and eventually my backlog of unnished home-
works will go o to innity. We call such a system unstable. Stability
can often be a major design issue for complex systems that service
jobs/tasks/processes (generically customers). A stochastic model can
be invaluable for providing insight into the parameters aecting the
stability of a system. For our example here, there should be some
threshold value c
0
such that the system is stable for c > c
0
and un-
stable for c < c
0
. One valuable use of stationary distributions comes
from the mere fact of their existence. If we can nd those values of c
for which a stationary distribution exists, then it is for those values of
c that the system is stable.
So we look for a stationary distribution. Note that if we nd one,
then the answer to our question of how many weeks do we have to
wait on average for a homework-free weekend is
0
= 1/
0
, the mean
recurrence time to state 0, our starting state. A stationary distribution
= (
0
,
1
, . . .) must satisfy = P, which we write out as
0
=
2
3
0
+ q
1
1
=
1
3
0
+ r
1
+ q
2
2
= p
1
+ r
2
+ q
3
.
.
.
i
= p
i1
+ r
i
+ q
i+1
.
.
.
A direct attack on this system of linear equations is possible, by ex-
pressing
i
in terms of
0
, and then summing
i
over all i to get
0
using the constraint that
i=0
= 1. However, this approach is some-
what cumbersome. A more elegant approach is to use the method of
generating functions. This method can often be applied to solve a lin-
ear system of equations, especially when there are an innite number
of equations, in situations where each equation only involves variables
close to one another (for example, each of the equations above in-
volves only two or three consecutive variables) and all, or almost all,
of the equations have a regular form (as in
i
p
i1
+ r
i
+ q
i+1
).
By multiplying the ith equation above by s
i
and then summing over
i we collapse the above innite set of equations into just a single
equation for the generating function.
143
Let G(s) =
i=0
s
i
i
denote the generating function of the stationary
distribution . If we multiply the ith equation in = P by s
i
and
sum over i, we obtain
i=0
s
i
i
=
2
3
0
+
1
3
0
s + p
i=2
s
i
i1
+ r
i=1
s
i
i
+ q
i=0
s
i
i+1
The left hand side is just G(s) while the sums on the right hand
side are not dicult to express in terms of G(s) with a little bit of
manipulation. In particular,
p
i=2
s
i
i1
= ps
i=2
s
i1
i1
= ps
i=1
s
i
i
= ps
i=0
s
i
i
ps
0
= psG(s) ps
0
Similarly,
r
i=1
s
i
i
= r
i=0
s
i
i
r
0
= rG(s) r
0
and
q
i=0
s
i
i+1
=
q
s
i=0
s
i+1
i+1
=
q
s
i=1
s
i
i
=
q
s
i=0
s
i
q
s
0
=
q
s
G(s)
q
s
0
.
Therefore, the equation we obtain for G(s) is
G(s) =
2
3
0
+
s
3
0
+ psG(s) ps
0
+ rG(s) r
0
+
q
s
G(s)
q
s
0
.
Collecting like terms, we have
G(s)
_
1 ps r
q
s
_
=
0
_
2
3
+
s
3
ps r
q
s
_
.
To get rid of the fractions, well multiply both sides by 3s, giving
G(s)[3s 3ps
2
3rs 3q] =
0
[2s + s
2
3ps
2
3rs 3q]
G(s) =

0
(2s + s
2
3ps
2
3rs 3q)
3s 3ps
2
3rs 3q
.
In order to determine the unknown
0
we use the boundary condition
G(1) = 1, which must be satised if is to be a stationary distri-
bution. This boundary condition also gives us a way to check for the
values of c for which the stationary distribution exists. If a station-
ary distribution does not exist, then we will not be able to satisfy the
condition G(1) = 1. Plugging in s = 1, we obtain
G(1) =

0
(2 + 1 3p 3r 3q)
3 3p 3r 3q
.
However, we run into a problem here due to the fact that p+r+q = 1,
which means that G(1) is an indeterminate form
G(1) =
_
0
0
_
.
Therefore, we use LHospitals rule to determine the limiting value of
G(s) as s 1. This gives
lim
s1
G(s) =
0
lim
s1
(2 + 2s 6ps 3r)
lim
s1
(3 6ps 3r)
=
0
4 6p 3r
3 6p 3r
.
145
We had previously dened our quantities p, r and q in terms of c to
make it easier to write down the transition matrix P, but now we would
like to re-express these back in terms of c to make it simpler to see when
lim
s1
G(s) = 1 is possible. Recall that p = (1 c)/3, r = (2 c)/3
and q = 2c/3, so that 4 6p 3r = 4 2(1 c) (2 c) = 3c and
3 6p 3r = 3c 1. So in terms of c, we have
lim
s1
G(s) =
0
3c
3c 1
.
In order to have a proper stationary distribution, we must have the left
hand side equal to 1 and we must have 0 <
0
< 1. Together these
imply that we must have 3c/(3c 1) > 1, which will only be true if
3c 1 > 0, or c > 1/3. Thus, we have found our threshold value
of c
0
= 1/3 such that the system is stable (since it has a stationary
distribution) for c > c
0
and is unstable for c c
0
. Assuming c > 1/3
so that the system is stable, we may now solve for
0
through the
relationship
1 =
0
3c
3c 1

0
=
3c 1
3c
.
The answer to our original question of what is the mean number of
weeks until a return to state 0 is
0
=
1
0
=
3c
3c 1
.
Observe that we have found a mean return time of interest,
0
, in
terms of a system parameter, c. More generally, a typical thing we
try to do in stochastic modeling is nd out how some performance
measure of interest depends, explicitly or even just qualitatively, on
one or more system parameters. In particular, if we have some control
over one or more of those system parameters, then we have a useful
tool to help us design our system. For example, if I wanted to design
my homework habits so that I could expect to have a homework-free
weekend in six weeks, I can solve for c to make
0
6. This gives
0
= 3c/(3c 1) 6 3c 18c 6 or c 2/5.
Let us now return to some general theory. Weve already proved one of
the main general theorems concerning Markov chains, that we empha-
sized by writing it in a framed box near the end of the previous lecture.
This was the theorem concerning the conditions for the existence and
uniqueness of a stationary distribution for a Markov chain. We reit-
erate here that there were no conditions on the period of the Markov
chain for that result. The other main theoretical result concerning
Markov chains has to do with the limiting probabilities lim
n
p
ij
(n).
For this result the period does matter. Lets state what that result
is now: when the stationary distribution exists and the chain is ape-
riodic (so the chain is irreducible, positive recurrent, and aperiodic),
p
ij
(n) converges to the stationary probability
j
as n . Note
that the limit does not depend on the starting state i. This is quite
important. In words, for an irreducible, positive recurrent, aperiodic
Markov chain, no matter where we start from and no matter what our
initial distribution is, if we let the chain run for a long time then the
distribution of X
n
will be very much like the stationary distribution .
An important rst step in proving the above limiting result is to show
that for an irreducible, positive recurrent, aperiodic Markov chain the
n-step transition probability p
ij
(n) is strictly positive for all n big
enough. That is, there exists some integer M such that p
ij
(n) > 0
for all n M. To show this we will need some results from basic
number theory. Well state and prove these results now.
147
Some Number Theory:
If we have an irreducible, positive recurrent, aperiodic Markov chain
then we know that for any state j, the greatest common divisor (gcd)
of the set of times n for which p
jj
(n) > 0 is 1. If A
j
{n
1
, n
2
, . . .}
is this set of times, then this is an innite set because, for example,
there must be some nite n
0
such that p
jj
(n
0
) > 0. But that implies
p
jj
(2n
0
) > 0 and in general p
jj
(kn
0
) > 0 for any positive integer k.
For reasons which will become clearer in the next lecture, what we
would like to be able to do is take some nite subset of A
j
that also
has gcd 1 and then show that every n large enough can be written
as a linear combination of the elements of this nite subset, where
the coecients of the linear combination are all nonnegative integers.
This is what we will show now, through a series of three results.
Result 1: Let n
1
, n
2
, . . . be a sequence of positive integers with gcd 1.
Then there exists a nite subset b
1
, . . . , b
r
that has gcd 1.
Proof: Let b
1
= n
1
and b
2
= n
2
and let g =gcd(b
1
, b
2
). If g = 1
then we are done. If g > 1 let p
1
, . . . , p
d
be the distinct prime factors
of g that are larger than 1 (if g > 1 it must have at least one prime
factor larger than 1). For each p
k
, k = 1, . . . , d, there must be at
least one integer from {n
3
, n
4
, . . .} that p
k
does not divide because if
p
k
divided every integer in this set then, since it also divides both n
1
and n
2
, it is a common divisor of all the ns. But this contradicts our
assumption that the gcd is 1. Therefore,
choose b
3
from {n
3
, n
4
, . . .} such that p
1
does not divide b
3
choose b
4
from {n
3
, n
4
2
does not divide b
4
.
.
.
choose b
d+2
from {n
3
, n
4
d
does not divide b
d+2
.
Note that b
3
, . . . , b
d+2
do not need to be distinct. Let b
3
, . . . , b
r
be
the distinct integers among b
3
, . . . , b
d+2
. Then b
1
, b
2
, . . . , b
r
have gcd
1 because each p
k
does not divide at least one of {b
3
, . . . , b
r
}, so that
none of the p
k
is a common divisor. On the other hand, the p
k
s are
the only integers greater than 1 that divide both b
1
and b
2
. Therefore,
there are no integers greater than 1 that divide all of b
1
, . . . , b
r
. So
the gcd of b
1
, . . . , b
r
is 1.
Result 2: Let b
1
, . . . , b
r
be a nite set of positive integers with gcd
1. Then there exist integers a
1
, . . . , a
r
(not necessarily nonnegative)
such that a
1
b
1
+ . . . + a
r
b
r
= 1.
Proof: Consider the set of all integers of the form c
1
b
1
+ . . . + c
r
b
r
as the c
i
range over the integers. This set of integers has some least
positive element . Let a
1
, . . . , a
r
be such that = a
1
b
1
+. . . +a
r
b
r
.
We are done if we show that = 1. To do this we will show that
is a common divisor of b
1
, . . . , b
r
. Since b
1
, . . . , b
r
has gcd 1 by
assumption, this shows that = 1. We will show that divides b
i
by
contradiction. Suppose that did not divide b
i
. Then we can write
b
i
= q + R, where q 0 is an integer and the remainder R satises
0 < R < . But then
R = b
i
q = b
i
q
r
k=1
a
k
b
k
= (1 qa
i
)b
i
+
k=i
(qa
k
)b
k
is also of the form c
1
b
1
+ . . . + c
r
b
r
. But R < contradicts the
minimality of . Therefore, must divide b
i
.
Our nal result for today, the one we are really after, uses Result 2
to show that every integer large enough can be written as a linear
combination of b
1
, . . . , b
r
with nonnegative integer coecients.
149
Result 3: Let b
1
, . . . , b
r
be a nite set of positive integers with gcd 1.
Then there exists a positive integer M such that for every n > M there
exist nonnegative integers d
1
, . . . , d
r
such that n = d
1
b
1
+. . . +d
r
b
r
.
Proof: From Result 2, there exist integers a
1
, . . . , a
r
(which may be
positive or negative) such that a
1
b
1
+ . . . + a
r
b
r
= 1. Now choose
M = (|a
1
|b
1
+ . . . + |a
r
|b
r
)b
1
, where | | denotes absolute value. If
n > M then we can write n as n = M +qb
1
+R, where q 0 is an
integer and the remainder R satises 0 R < b
1
. If R = 0 then we
are done as we can choose d
k
= |a
k
| for k = 1 and d
1
= |a
1
| + q. If
0 < R < b
1
, then
n = M + qb
1
+ R(1)
= M + qb
1
+ R(a
1
b
1
+ . . . + a
r
b
r
)
= (|a
1
|b
1
+ q + Ra
1
)b
1
+
r
k=2
(|a
k
|b
1
+ Ra
k
)b
k
= d
1
b
1
+ . . . + d
r
b
r
,
where d
1
= q + b
1
|a
1
| + Ra
1
q + (b
1
R)|a
1
| 0 since R < b
1
,
and d
k
= b
1
|a
k
| + Ra
k
(b
1
R)|a
k
| 0 also.
Result 3 is what we need to show that p
jj
(n) > 0 for all n big
enough in an irreducible, positive recurrent, aperiodic Markov chain.
We will show this next and continue on to prove our main limit result
p
ij
(n)
j
as n .
17
Limiting Probabilities
Last time we ended with some results from basic number theory that
will allow us to show that for an irreducible, positive recurrent, ape-
riodic Markov chain, the n-step transition probability p
ij
(n) > 0 for
all n large enough. First, x any state j. Next, choose a nite set of
times b
1
, . . . , b
r
such that the gcd of b
1
, . . . , b
r
is 1 and p
jj
(b
k
) > 0
for all k = 1, . . . , r (we showed we can do this from our Result 1
from last time). Next, Result 2 tells us we can nd integers a
1
, . . . , a
r
such that a
1
b
1
+ . . . + a
r
b
r
= 1. Now let n be any integer larger
than M = (|a
1
|b
1
+ . . . + |a
r
|b
r
)b
1
. Then Result 3 tells us there are
nonnegative integers d
1
, . . . , d
r
such that n = d
1
b
1
+. . . +d
r
b
r
. But
now we have that
p
jj
(n) p
jj
(b
1
) . . . p
jj
(b
1
)
. .
d
1
times
p
jj
(b
2
) . . . p
jj
(b
2
)
. .
d
2
times
. . . p
jj
(b
r
) . . . p
jj
(b
r
)
. .
d
r
times
= p
jj
(b
1
)
d
1
p
jj
(b
2
)
d
2
. . . p
jj
(b
r
)
d
r
> 0,
where the rst inequality above follows because the right hand side
is the probability of just a subset of the possible paths that go from
state j to state j in n steps, and this probability is positive because
b
1
, . . . , b
r
were chosen to have p
jj
(b
k
) > 0 for k = 1, . . . , r.
151
152 17. LIMITING PROBABILITIES
More generally, x any two states i and j with i = j. Since the chain
is irreducible, there exists some m such that p
ij
(m) > 0. But then,
by the same bounding argument we may write
p
ij
(m+ n) p
ij
(m)p
jj
(n) > 0
for all n large enough.
Let me remind you again that if the period of the Markov chain is d,
where d is larger than 1, then we cannot have p
jj
(n) > 0 for all n big
enough because p
jj
(n) = 0 for all n that is not a multiple of d. This
is why the limiting probability will not exist. We can dene a dierent
limiting probability in this case, which well discuss later, but for now
we are assuming that the Markov chain has period 1 (as well as being
irreducible and positive recurrent).
153
Now we are ready to start thinking about the limit of p
ij
(n) as n .
We stated in the previous lecture that this limit should be
j
, the
stationary probability of state j (where we know that the stationary
distribution exists and is unique because we are working now under
the assumption that the Markov chain is irreducible and positive re-
current). Equivalently, we may show that the dierence
j
p
ij
(n)
converges to 0. We can start o our calculations using the fact that
j
satises
j
=
kS

k
p
kj
(n) for every n 1 and that
kS

k
= 1,
to write
j
p
ij
(n) =
kS
k
p
kj
(n) p
ij
(n)
=
kS
k
p
kj
(n)
kS
k
p
ij
(n)
=
kS
k
(p
kj
(n) p
ij
(n)).
So now
lim
n
(
j
p
ij
(n)) = lim
n
_
kS
k
(p
kj
(n) p
ij
(n))
_
=
kS
k
lim
n
(p
kj
(n) p
ij
(n)),
where taking the limit inside the (in general, innite) sum above is jus-
tied because the vector (
k
|p
kj
(n)p
ij
(n)|)
kS
is uniformly bounded
(meaning for every n) by the summable vector (
k
)
kS
.
Coupling: Our goal now is to show that for any i, j, k S, we have
lim
n
(p
kj
(n) p
ij
(n)) = 0.
This is probably the deepest theoretical result we will prove in this
course. The proof uses a technique in probability called coupling. This
technique has proven useful in a wide variety of probability problems
in recent years, and can legitimately be called a modern technique.
The exact denition of coupling is not important to us right now, but
lets see how a coupling argument works for us in our present problem.
Suppose that X = {X
n
: n 0} denotes our irreducible, positive
recurrent, aperiodic Markov chain. Let Y = {Y
n
: n 0} be another
Markov chain that is independent of X but with the same transition
matrix and the same state space as the X chain. We say that Y is an
independent copy of X. We will start o our X chain in state i and
start o our Y chain in state k. Then, as the argument goes, with
probability 1 the X chain and the Y chain will come to a time when
they are in the same state, say s. When this happens, we say that the
two chains have coupled because, due to the Markov property, for
any time n that is after this coupling time, the distribution of X
n
and
Y
n
will be the same. In particular, their limiting distributions will be
the same. This is a real and nontrivial result we are trying to prove
here. It is not obvious that the limiting distributions of X
n
and Y
n
should be the same when the two chains started out in dierent states,
and you should be skeptical of its validity without a proof.
We now give a more rigorous version of the above coupling argument
to show that
lim
n
(p
kj
(n) p
ij
(n)) = 0.
155
We start out by dening the bivariate process Z = {Z
n
= (X
n
, Y
n
) :
n 0} (bivariate in the sense that the dimension of Z
n
is twice that
of X
n
), where the processes X and Y are independent (irreducible,
positive recurrent, and aperiodic) Markov chains with the same transi-
tion matrix P and the same state space S as described on the previous
page. Fix any state s S. According to the coupling argument, if the
process Z starts in state (i, k), it should eventually reach the state
(s, s) with probability 1. The rst thing we need to do is prove that
this is true. We do so by showing that Z is an irreducible, recurrent
Markov chain. First we show that Z is a Markov chain. This should
actually be intuitively clear, since the chains X and Y are indepen-
dent. If (i
k
, j
k
), k = 0, . . . , n, are any n + 1 states in the state space
S S of Z, then we can work out in detail
P (Z
n
= (i
n
, j
n
) | Z
n1
= (i
n1
, j
n1
), . . . , Z
0
= (i
0
, j
0
))
= P(X
n
= i
n
, Y
n
= j
n
|X
n1
= i
n1
, Y
n1
= j
n1
, . . . , X
0
= i
0
, Y
0
= j
0
)
= P(X
n
= i
n
|X
n1
= i
n1
, Y
n1
= j
n1
, . . . , X
0
= i
0
, Y
0
= j
0
)
P(Y
n
= j
n
|X
n1
= i
n1
, Y
n1
= j
n1
, . . . , X
0
= i
0
, Y
0
= j
0
)
(by independence)
= P(X
n
= i
n
|X
n1
= i
n1
, . . . , X
0
= i
0
)
P(Y
n
= j
n
|Y
n1
= j
n1
, . . . , Y
0
= j
0
) (by independence)
= P(X
n
= i
n
| X
n1
= i
n1
)P(Y
n
= j
n
| Y
n1
= j
n1
)
(by the Markov property for X and Y )
= P(X
n
= i
n
| X
n1
= i
n1
, Y
n1
= j
n1
)
P(Y
n
= j
n
| X
n1
= i
n1
Y
n1
= j
n1
) (by independence)
= P(X
n
= i
n
, Y
n
= j
n
| X
n1
= i
n1
, Y
n1
= j
n1
)
(by independence)
= P(Z
n
= (i
n
, j
n
) | Z
n1
= (i
n1
, j
n1
)).
Thus, Z has the Markov property. Next, we show that the Z chain is
irreducible. Let (i, k) and (j, ) be any two states in the state space
of Z. Then the n-step transition probability from state (i, k) to state
(j, ) is given by
P (Z
n
= (j, ) | Z
0
= (i, k))
= P(X
n
= j, Y
n
= | X
0
= i, Y
0
= k)
= P(X
n
= j | X
0
= i, Y
0
= k)P(Y
n
= | X
0
= i, Y
0
= k)
(by independence)
= P(X
n
= j | X
0
= i)P(Y
n
= | Y
0
= k) (by independence)
= p
ij
(n)p
k
(n).
Now we may use our result that there exists some integer M
1
such
that p
ij
(n) > 0 for every n > M
1
and there exists some integer M
2
such that p
k
(n) > 0 for every n > M
2
. Letting M = max(M
1
, M
2
),
we see that p
ij
(n)p
k
(n) > 0 for every n > M. Thus the n-step
transition probability in the Z chain, p
(i,k),(j,)
(n) is positive for every
n > M. Thus, state (j, ) is accessible from state (i, k) in the Z
chain. But since states (i, k) and (j, ) were arbitrary, we see that all
states must actually communicate with one another, so that the Z
chain is irreducible, as desired.
157
It is worth remarking at this point that this is the only place in our
proof that we require the X chain to be aperiodic. It is also worth
mentioning that if the X chain were not aperiodic, then the Z chain
would in general not be irreducible. Consider, for example, the follow-
ing.
Example: As a simple example, suppose that the X chain has state
space S = {0, 1} and transition probability matrix
0 1
P
X
=
0
1
_
0 1
1 0
_
,
so that the chain just moves back and forth between states 0 and 1
with probability 1 and so has period 2. Then the chain Z will have
state space SS = {(0, 0), (1, 1), (0, 1), (1, 0)} and transition matrix
(0, 0) (1, 1) (0, 1) (1, 0)
P
Z
=
(0, 0)
(1, 1)
(0, 1)
(1, 0)
_
_
0 1 0 0
1 0 0 0
0 0 0 1
0 0 1 0
_
_
,
From the above matrix it should be clear that the states {(0, 0), (1, 1)}
form an equivalence class and the states {(0, 1), (1, 0)} form another
equivalence class, so the chain has two equivalence classes and is not
irreducible.
Finally, we show that the Z chain must be recurrent. We do so by
demonstrating a stationary distribution for the Z chain. In fact, since
Z is irreducible, demonstrating a stationary distribution leads to the
stronger conclusion that Z is positive recurrent even though we will
only need that Z is recurrent. Let be the stationary distribution of
the X (and Y ) chain. Then we will show that
(i,k)
=
i
k
is the
stationary probability of state (i, k) in the Z chain. First, summing
over all states (i, k) S S in the state space of Z, we obtain
(i,k)SS
(i,k)
=
iS
kS
k
=
iS
kS
k
= (1)(1) = 1.
Next, we verify that the equations
(j,)
=
(i,k)SS
(i,k)
p
(i,k),(j,)
are satised for every (j, ) S S. We have
(j,)
=
j
iS
i
p
ij
kS
k
p
k
=
iS
kS
k
p
ij
p
k
=
(i,k)SS
(i,k)
p
(i,k),(j,)
,
as required. Thus the irreducible chain Z has a stationary distribution,
which implies that it is positive recurrent. Recall that our goal was to
show that if the Z chain starts out in state (i, k), where (i, k) is any
arbitrary state, then it will eventually reach state (s, s) with probability
1. Now that we have shown that Z is irreducible and recurrent, this
statement is immediately true by the argument on the bottom of p.134
of these notes.
159
Thus, if we let T denote the time that the Z chain rst reaches
state (s, s), then P(T < |Z
0
= (i, k)) = 1. Now we are ready
to nish o our proof that p
ij
(n) p
kj
(n) 0 as n . The
following calculations use the following basic properties of events: 1)
for any events A, B and C with P(C) > 0, we have P(A
B|C)
P(A|C), and 2) for any events A and C with P(C) > 0 and any
partition B
1
, . . . , B
n
, we have P(A|C) =
n
m=1
P(A
B
m
|C). For
the partition B
1
, . . . , B
n
we will use B
m
= {T = m} for m =
1, . . . , n 1 and B
n
= {T n}. Heres our main calculation:
p
ij
(n) = P(X
n
= j|X
0
= i)
= P(X
n
= j|X
0
= i, Y
0
= k) (by independence)
=
n1
m=1
P(X
n
= j, T = m|X
0
= i, Y
0
= k)
+ P(X
n
= j, T n|X
0
= i, Y
0
= k)
=
n1
m=1
P(Y
n
= j, T = m|X
0
= i, Y
0
= k)
+ P(X
n
= j, T n|X
0
= i, Y
0
= k)
= P(Y
n
= j, T < n|X
0
= i, Y
0
= k)
+ P(X
n
= j, T n|X
0
= i, Y
0
= k)
P(Y
n
= j|X
0
= i, Y
0
= k) + P(T n|X
0
= i, Y
0
= k)
= P(Y
n
= j|Y
0
= k) + P(T n|X
0
= i, Y
0
= k)
= p
kj
(n) + P(T n|X
0
= i, Y
0
= k).
I hope the only potentially slippery move we made in the above cal-
culation is where we replaced X
n
with Y
n
in the 4th equality. If you
see how that is done, thats good. Ill come back to that later in any
case. For now, lets accept it and carry on because were almost done.
Reiterating the result of that last set of calculations we have
p
ij
(n) p
kj
(n) + P(T n|X
0
= i, Y
0
= k)
which we will write as
p
ij
(n) p
kj
(n) P(T n|X
0
= i, Y
0
= k).
Now if we interchange the roles of i and k and interchange the roles
of X and Y in the previous calculations, then we get
p
kj
(n) p
ij
(n) P(T n|X
0
= i, Y
0
= k).
Taken together, the last two inequalities imply that
|p
ij
(n) p
kj
(n)| P(T n|X
0
= i, Y
0
= k).
Now we are basically done because P(T < |X
0
= i, Y
0
= k) = 1
implies that
lim
n
P(T n|X
0
= i, Y
0
= k) = 0,
and we have our desired result that p
ij
(n) p
kj
(n) 0 as n ,
and then going way back to near the beginning of the argument we
see that this gives us that p
ij
(n)
j
as n .
Note that the limit result lim
n
p
ij
(n) =
j
is mostly a theoretical
result rather than a computational result. But its a very important
theoretical result. It gives a rigorous justication to using the sta-
tionary distribution to analyse the performance of a real system. In
practice systems do not start out stationary. What we can say, based
on the limit result, is that we can analyse the system based on the
stationary distribution when the system has been running for a while.
We say that such systems have reached steady state or equilibrium.
161
Ok, lets go back now and take a more detailed look at that 4th equality
in our calculations a couple of pages back. If you were comfortable
with that when you read it, then you may skip over this page of notes.
The equality in question was the following:
P(X
n
= j, T = m|X
0
= i, Y
0
= k)
= P(Y
n
= j, T = m|X
0
= i, Y
0
= k).
So why can we replace X
n
with Y
n
? The answer in words is that at
time m, where m < n, both the X and Y processes are in state s.
Once we know that, the probability that Y
n
= j is the same as the
probability that X
n
= j because of the Markov property and because
X and Y have the same state space and the same transition matrix.
Well do some calculations in more detail now, and well use the fact
that since the event {T = m} implies (i.e. is a subset of) all three
events {X
m
= s}, {Y
m
= s}, and {X
m
= s}
{Y
m
= s}, we have
that {T = m} = {T = m}
{X
m
= s} = {T = m}
{Y
m
= s} =
{T = m}
{X
m
= s}
{Y
m
= s}. We may write
P (X
n
= j, T = m|X
0
= i, Y
0
= k)
= P(X
n
= j, T = m, X
m
= s, Y
m
= s|X
0
= i, Y
0
= k)
= P(X
n
= j|T = m, X
m
= s, Y
m
= s, X
0
= i, Y
0
= k)
P(T = m, X
m
= s, Y
m
= s|X
0
= i, Y
0
= k)
= P(X
n
= j|X
m
= s, T = m)P(T = m, Y
m
= s|X
0
= i, Y
0
= k)
= P(Y
n
= j|Y
m
= s, T = m)P(T = m, Y
m
= s|X
0
= i, Y
0
= k)
= P(Y
n
= j|Y
m
= s, T = m, X
0
= i, Y
0
= k)
P(T = m, Y
m
= s|X
0
= i, Y
0
= k)
= P(Y
n
= j, Y
m
= s, T = m|X
0
= i, Y
0
= k)
= P(Y
n
= j, T = m|X
0
= i, Y
0
= k).
We did an interchange of X
n
and Y
n
, again in the 4th equality, where
we wrote
P(X
n
= j|X
m
= s, T = m) = P(Y
n
= j|Y
m
= s, T = m),
but hopefully in this form the validity of the interchange is more obvi-
ous. It should be crystal clear that
P(X
n
= j|X
m
= s) = P(Y
n
= j|Y
m
= s)
holds, since the X and Y chains have the same transition matrix. The
extra conditioning on the event {T = m} doesnt change either of the
above conditional probabilities. It is not dropped from the conditioning
only because we want to bring it in front of the conditioning bar later
on.
18
Balance and Reversibility
We have said that the stationary probability
i
, if it exists, gives the
long run proportion of time in state i. Since every time period spent
in state i corresponds to a transition into (or out of) state i, we can
also interpret
i
as the long run proportion of transitions that go into
(or out of) state i. Also, since p
ij
is the probability of going to state j
given that we are in state i, the product
i
p
ij
is the long run proportion
of transitions that go from state i to state j. If we think of a transition
from state i to state j as a unit of ow from state i to state j, then
i
p
ij
would be the rate of ow from state i to state j. Similarly, with
this ow interpretation, we have
j
= rate of ow out of state j
and
iS
i
p
ij
= rate of ow into state j.
So the equations = P have the interpretation
rate of ow into state j = rate of ow out of state j
for every j S. That is, the stationary distribution is that vector
which achieves balance of ow. For this reason the equations = P
are called the Balance Equations or the Global Balance Equations.
163
164 18. BALANCE AND REVERSIBILITY
Local Balance:
All stationary distributions must create global balance, in the sense
just described. If the stationary probabilities also satisfy
i
p
ij
=
j
p
ji
,
for every i, j S, then we say that also creates local balance. The
above equations are called the Local Balance Equations (sometimes
called the Detailed Balance Equations) because they specify balance
of ow between every pair of states:
rate of ow from i to j = rate of ow from j to i,
for every i, j S. If one can nd a vector that satises local
balance, then also satises the global balance equations, for
i
p
ij
=
j
p
ji
iS
i
p
ij
=
iS
j
p
ji
iS
i
p
ij
=
j
iS
p
ji
iS
i
p
ij
=
j
,
for every j S.
Processes that achieve local balance when they are made (or become)
stationary are typically easier to deal with computationally than those
that dont. This is because the local balance equations are typically
much simpler to solve than global balance equations, because each
local balance equation always involves just two unknowns.
165
Example: In the example from p.139 of the notes in which we used
the method of generating functions to obtain information about a
stationary distribution, the transition matrix was given by
0 1 2 3 4
P =
0
1
2
3
4
.
.
.
_
_
2/3 1/3 0
q r p 0
0 q r p 0
0 0 q r p 0
0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
,
where
q = 2c/3
r = (2 c)/3
p = (1 c)/3,
and c is the probability that I do a homework over the weekend if there
is at least one to be done. From the transition matrix P we can write
down the local balance equations as
0
1
3
=
1
q
1
p =
2
q
.
.
.
i
p =
i+1
q
Notice that each equation involves only adjacent pairs of states be-
cause the process only ever increases or decreases by one in any one
step, and the diagonal elements of P do not enter into the equations
because those give the transition probablities from i back to i.
Directly obtaining a recursion from these equations is now simple. We
have
1
=
1
3q
0
,
2
=
p
q
1
=
p
q
1
3q
0
,
and, in general
i+1
=
p
q
i
=
_
p
q
_
2
i1
.
.
.
=
_
p
q
_
i
1
=
_
p
q
_
i
1
3q
0
.
To obtain
0
, we can now use the constraint
i=0
i
= 1 to write
0
_
1 +
1
3q
+
p
q
1
3q
+
_
p
q
_
2
1
3q
+ . . .
_
= 1
0
_
1 +
1
3q
i=0
_
p
q
_
i
_
= 1.
At this point we can see that for a stationary distribution to exist,
the innite sum above must converge, and this is true if and only if
p/q < 1. In terms of c, this condition is
(1 c)/3
2c/3
< 1 1 c < 2c c >
1
3
,
verifying our condition for stability.
167
Assuming now that c > 1/3, we can evaluate the innite sum as
i=0
_
p
q
_
i
=
1
1 p/q
,
which gives
0
_
1 +
1
3q
1
(1 p/q)
_
= 1
0
_
1 +
1
3(q p)
_
= 1
0
_
1 + 3(q p)
3(q p)
_
= 1,
or
0
=
3(q p)
1 + 3(q p)
.
Since 3(q p) = 3(
2c
3

1c
3
) = 3c 1, we have
0
=
3c 1
1 + 3c 1
=
3c 1
3c
.
Moreover, we also have
i
as
i
=
_
p
q
_
i1
1
3q
0
=
_
1 c
2c
_
i1
1
2c
3c 1
3c
=
_
1 c
2c
_
i1
3c 1
6c
2
,
a result we didnt obtain explicitly using generating functions.
As this last example shows, it can be very useful to recognize when
local balance might hold. In the example we didnt actually try to
guess that it might hold, we just blindly tried to solve the local balance
equations and got lucky. But there are a couple of things we can do to
see if a Markov chain will satisfy the local balance equations without
actually writing down the equations and trying to solve them:
If there are two state i and j such that p
ij
> 0 but p
ji
= 0,
then we can right away conclude that the stationary distribution
will not satisfy the local balance equations. This is because the
equation
i
p
ij
=
j
p
ji
will have 0 on the right hand side and, since p
ij
> 0, will only be
satised if
i
= 0. But, as we have seen, no stationary distribution
can have this.
If the process X only ever increases or decreases by one (or stays
where it is) at each step, then the local balance equations will
be satised. We have seen this in todays example. To see this
more generally, we may refer to the ow interpretation of the local
balance equations. Consider any state i. During any xed interval
of time, the number of transitions from i to i + 1 must be within
one of the number of transitions from i + 1 to i because for each
transition from i to i +1, in order to get back to state i we must
make the transtion from i +1 to i. Therefore, in the long run, the
proportion of transitions from i to i +1 must equal the proportion
of transitions from i + 1 to i. In other words,
i
p
i,i+1
=
i+1
i
should be satised. But these are exactly the local balance equa-
tions in this case.
169
Reversibility: (Section 4.8)
There is deep connection between local balance and a property of
Markov chains (and stochastic processes in general) called reversibility,
or time reversibility. Just as not all Markov chains satisfy local balance,
not all Markov chains are reversible.
Keep in mind that we are only talking about stationary Markov chains.
Local balance and reversibility (and global balance as well) are prop-
erties of only stationary Markov chains. To imagine the notion of
reversibility, we start out with a stationary Markov chain and then
extend the time index back to , so that now our Markov chain is
X = {X
n
: n {. . . , 2, 1, 0, 1, 2, . . .}}
Imagine running the chain backwards in time to obtain a new process
Y = {Y
n
= X
n
: n {. . . , 1, 0, 1, . . .}}.
The process Y is called the reversed chain. Indeed, Y is also a Markov
chain. To see this, note that the Markov property for the X chain can
be stated in the following way: given the current state of the process,
all future states are independent of the entire past up to just before the
current time. That is, given X
n
, if k > n, then X
k
is independent of
X
m
for every m < n. But this goes both ways since independence is a
symmetric property: if W is independent of Z then Z is independent
of W for any random variables W and Z. So we can say: given X
n
,
if m < n, then X
m
is independent of X
k
for every k > n.
Therefore, we can see the Markov property of Y , as
P (Y
n+1
= j|Y
n
= i, Y
k
= i
k
for k < n)
= P(X
(n+1)
= j|X
n
= i, X
k
= i
k
for k < n)
= P(X
(n+1)
= j|X
n
= i)
= P(Y
n+1
= j|Y
n
= i).
So the reversed process Y is a Markov chain. Indeed, it is also sta-
tionary and has the same stationary distribution, say , as the X
chain (since, for example, the long run proportion of time the Y chain
spends in state i is obviously the same as the long run proportion of
time that the X chain spends in state i, for any state i). However,
the reversed chain Y does not in general have the same transition
matrix as X. In fact, we can explicity compute the transition matrix
of the Y chain, using the fact that both the X chain and the Y
chain are stationary with common stationary distribution . If we let
Q denote the transition matrix of the Y chain (with entries q
ij
), we
have
q
ij
= P(Y
n
= j|Y
n1
= i)
= P(X
n
= j|X
(n1)
= i)
=
P(X
n
= j, X
(n1)
= i)
P(X
(n1)
= i)
=
P(X
(n1)
= i|X
n
= j)P(X
n
= j)
P(X
(n1)
= i)
=
p
ji
i
,
where p
ij
is the one-step transition probability from state j to state i
in the X chain.
171
We say of a stationary Markov chain X that it is reversible, or time-
reversible, if the transition matrix of the reversed chain Y is the same
as the transition matrix of X; that is, Q = P. Note that the termi-
nology is a little confusing. The reversed chain Y always exists but
not every Markov chain X is reversible. Since we have computed q
ij
,
we can see exactly the conditions that will make X reversible:
X is reversible if and only if q
ij
= p
ij
if and only if p
ij
=
p
ji
i
if and only if
i
p
ij
=
j
p
ji
.
So here we see the connection between reversibility and local balance.
A Markov chain X is reversible if and only if local balance is satised
in equilibrium.
So, for example, to prove that a Markov chain X is reversible one
can check whether a stationary distribution can be found that satises
the local balance equations. You are asked to do this on one of the
homework problems. In our example today, in nding the stationary
distribution through the local balance equations, we have also shown
that the process there is reversible.
21
The Exponential Distribution
From Discrete-Time to Continuous-Time:
In Chapter 6 of the text we will be considering Markov processes in con-
tinuous time. In a sense, we already have a very good understanding of
continuous-time Markov chains based on our theory for discrete-time
Markov chains. For example, one way to describe a continuous-time
Markov chain is to say that it is a discrete-time Markov chain, except
that we explicitly model the times between transitions with contin-
uous, positive-valued random variables and we explicity consider the
process at any time t, not just at transition times.
The single most important continuous distribution for building and
understanding continuous-time Markov chains is the exponential dis-
tribution, for reasons which we shall explore in this lecture.
177
178 21. THE EXPONENTIAL DISTRIBUTION
The Exponential Distribution:
A continuous random variable X is said to have an Exponential()
distribution if it has probability density function
f
X
(x|) =
_
e
x
for x > 0
0 for x 0
,
where > 0 is called the rate of the distribution.
In the study of continuous-time stochastic processes, the exponential
distribution is usually used to model the time until something hap-
pens in the process. The mean of the Exponential() distribution is
calculated using integration by parts as
E[X] =
_

0
xe
x
dx
=
_
xe
x
0
+
1
_

0
e
x
dx
_
=
_
0 +
1
e
x
0
_
=
1
2
=
1
.
So one can see that as gets larger, the thing in the process were
waiting for to happen tends to happen more quickly, hence we think
of as a rate.
As an exercise, you may wish to verify that by applying integration by
parts twice, the second moment of the Exponential() distribution is
given by
E[X
2
] =
_

0
x
2
e
x
= . . . =
2
2
.
179
From the rst and second moments we can compute the variance as
Var(X) = E[X
2
] E[X]
2
=
2
2

1
2
=
1
2
.
The Memoryless Property:
The following plot illustrates a key property of the exponential distri-
bution. The graph after the point s is an exact copy of the original
function. The important consequence of this is that the distribution
of X conditioned on {X > s} is again exponential.
The Exponential Function
x
e
x
p
(

-

x
)
0.0 0.5 1.0 1.5 2.0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
s
Figure 21.1: The Exponential Function e
x
To see how this works, imagine that at time 0 we start an alarm clock
which will ring after a time X that is exponentially distributed with
rate . Let us call X the lifetime of the clock. For any t > 0, we
have that
P(X > t) =
_

t
e
x
dx =
e
x
t
= e
t
.
Now we go away and come back at time s to discover that the alarm
has not yet gone o. That is, we have observed the event {X > s}.
If we let Y denote the remaining lifetime of the clock given that
{X > s}, then
P(Y > t|X > s) = P(X > s + t|X > s)
=
P(X > s + t, X > s)
P(X > s)
=
P(X > s + t)
P(X > s)
=
e
(s+t)
e
s
= e
t
.
But this implies that the remaining lifetime after we observe the alarm
has not yet gone o at time s has the same distribution as the original
lifetime X. The really important thing to note, though, is that this
implies that the distribution of the remaining lifetime does not depend
on s. In fact, if you try setting X to have any other continuous
distribution, then ask what would be the distribution of the remaining
lifetime after you observe {X > s}, the distribution will depend on s.
181
This property is called the memoryless property of the exponential
distribution because I dont need to remember when I started the
clock. If the distribution of the lifetime X is Exponential(), then if
I come back to the clock at any time and observe that the clock has
not yet gone o, regardless of when the clock started I can assert that
the distribution of the time till it goes o, starting at the time I start
observing it again, is Exponential(). Put another way, given that the
clock has currently not yet gone o, I can forget the past and still
know the distribution of the time from my current time to the time
the alarm will go o. The resemblance of this property to the Markov
property should not be lost on you.
It is a rather amazing, and perhaps unfortunate, fact that the exponen-
tial distribution is the only one for which this works. The memoryless
property is like enabling technology for the construction of continuous-
time Markov chains. We will see this more clearly in Chapter 6. But
the exponential distribution is even more special than just the memo-
ryless property because it has a second enabling type of property.
Another Important Property of the Exponential:
Let X
1
, . . . , X
n
be independent random variables, with X
i
having an
Exponential(
i
) distribution. Then the distribution of min(X
1
, . . . , X
n
)
is Exponential(
1
+ . . . +
n
), and the probability that the minimum
is X
i
is
i
/(
1
+ . . . +
n
).
Proof:
P(min(X
1
, . . . , X
n
) > t) = P(X
1
> t, . . . , X
n
> t)
= P(X
1
> t) . . . P(X
n
> t)
= e
1
t
. . . e
n
t
= e
(
1
+...+
n
)t
.
The preceding shows that the CDF of min(X
1
, . . . , X
n
) is that of an
Exponential(
1
+. . . +
n
) distribution. The probability that X
i
is the
minimum can be obtained by conditioning:
P (X
i
is the minimum)
= P(X
i
< X
j
for j = i)
=
_

0
P(X
i
< X
j
for j = i|X
i
= t)
i
e
i
t
dt
=
_

0
P(t < X
j
for j = i)
i
e
i
t
dt
=
_

0
i
e
i
t
j=i
P(X
j
> t)dt
=
_

0
i
e
i
t
j=i
e
j
t
dt
=
i
_

0
e
(
1
+...+
n
)t
dt
=
i
e
(
1
+...+
n
)t
1
+ . . . +
n
0
=

i
1
+ . . . +
n
,
as required.
To see how this works together with the the memoryless property,
consider the following examples.
183
Example: (Ross, p.332 #20). Consider a two-server system in
which a customer is served rst by server 1, then by server 2, and
then departs. The service times at server i are exponential random
variables with rates
i
, i = 1, 2. When you arrive, you nd server
1 free and two customers at server 2 customer A in service and
customer B waiting in line.
(a) Find P
A
, the probability that A is still in service when you move
over to server 2.
(b) Find P
B
, the probability that B is still in the system when you
move over to 2.
(c) Find E[T], where T is the time that you spend in the system.
Solution:
(a) A will still be in service when you move to server 2 if your service at
server 1 ends before As remaining service at server 2 ends. Now
A is currently in service at server 2 when you arrive, but because
of memorylessness, As remaining service is Exponential(
2
), and
you start service at server 1 that is Exponential(
1
). Therefore,
P
A
is the probability that an Exponential(
1
) random variable is
less than an Exponential(
2
) random variable, which is
P
A
=

1
1
+
2
.
(b) B will still be in the system when you move over to server 2 if
your service time is less than the sum of As remaining service
time and Bs service time. Let us condition on the rst thing to
happen, either A nishes service or you nish service:
P(B in system) = P(B in system|A nishes before you)

2
1
+
2
+ P(B in system|you nish before A)

1
1
+
2
Now P(B in system|you nish before A) = 1 since B will still be
waiting in line when you move to server 2. On the other hand,
if the rst thing to happen is that A nishes service, then at
that point, by memorylessness, your remaining service at server
1 is Exponential(
1
), and B will still be in the system if your
remaining service at server 1 is less than Bs service at server 2,
and the probability of this is
1
/(
1
+
2
). That is,
P(B in system|A nishes before you) =

1
1
+
2
.
Therefore,
P(B in system) =

1
2
(
1
+
2
)
2
+

1
1
+
2
.
(c) To compute the expected time you are in the system, we rst
divide up your time in the system into
T = T
1
+ R,
where T
1
is the time until the rst thing that happens, and R
is the rest of the time. The time until the rst thing happens is
Exponential(
1
+
2
), so that
E[T
1
] =
1
1
+
2
.
To compute E[R], we condition on what was the rst thing to
happen, either A nished service at server 2 or you nished service
185
at server 1. If the rst thing to happen was that you nished
service at server 1, which occurs with probability
1
/(
1
+
2
),
then at that point you moved to server 2, and your remaining
time in the system is the remaining time of A at server 2, the
service time of B at server 2, and your service time at server
2. As remaining time at server 2 is again Exponential(
2
) by
memorylessness, and so your expected remaining time in service
will be 3/
2
. That is,
E[R|rst thing to happen is you nish service at server 1] =
3
2
,
and so
E[R] =
3
1
+
2
+ E[R|rst thing is A nishes]

2
1
+
2
.
Now if the rst thing to happen is that A nishes service at server
2, we can again compute your expected remaining time in the
system as the expected time until the next thing to happen (either
you or B nishes service) plus the expected remaining time after
that. To compute the latter we can again condition on what was
that next thing to happen. We will obtain
E[R|rst thing is A nishes] =
1
1
+
2
+
2
1
+
2
+
_
1
1
+
1
2
_

2
1
+
2
Plugging everything back gives E[T].
As an exercise you should consider how you might do the preceding
problem assuming a dierent service time distribution, such as a Uni-
form distribution on [0, 1] or a deterministic service time such as 1
time unit.
22
The Poisson Process: Introduction
We now begin studying our rst continuous-time process the Poisson
Process. Its relative simplicity and signicant practical usefulness make
it a good introduction to more general continuous time processes. To-
day we will look at several equivalent denitions of the Poisson Process
that, each in their own way, give some insight into the structure and
properties of the Poisson process.
187
188 22. THE POISSON PROCESS: INTRODUCTION
Stationary and Independent Increments:
We rst dene the notions of stationary increments and independent
increments. For a continuous-time stochastic process {X(t) : t 0},
an increment is the dierence in the process at two times, say s and
t. For s < t, the increment from time s to time t is the dierence
X(t) X(s).
A process is said to have stationary increments if the distribution of
the increment X(t) X(s) depends on s and t only through the
dierence t s, for all s < t. So the distribution of X(t
1
) X(s
1
)
is the same as the distribution of X(t
2
) X(s
2
) if t
1
s
1
= t
2
s
2
.
Note that the intervals [s
1
, t
1
] and [s
2
, t
2
] may overlap.
A process is said to have independent increments if any two increments
involving disjoint intervals are independent. That is, if s
1
< t
1
< s
2
<
t
2
, then the two increments X(t
1
) X(s
1
) and X(t
2
) X(s
2
) are
independent.
Not many processes we will encounter will have both stationary and
independent increments. In general they will have neither stationary
increments nor independent increments. An exception to this we have
already seen is the simple random walk. If
1
,
2
, . . . is a sequence of
independent and identically distributed random variables with P(
i
=
1) = p and P(
i
= 1) = q = 1 p, the the simple random walk
{X
n
: n 0} starting at 0 can be dened as X
0
= 0 and
X
n
=
n
i=1
i
.
From this representation it is not dicult to see that the simple random
walk has stationary and independent increments.
189
Denition 1 of a Poisson Process:
A continuous-time stochastic process {N(t) : t 0} is a Poisson
process with rate > 0 if
(i) N(0) = 0.
(ii) It has stationary and independent increments.
(iii) The distribution of N(t) is Poisson with mean t, i.e.,
P(N(t) = k) =
(t)
k
k!
e
t
for k = 0, 1, 2, . . ..
This denition tells us some of the structure of a Poisson process
immediately:
By stationary increments the distribution of N(t)N(s), for s < t
is the same as the distribution of N(t s) N(0) = N(t s),
which is a Poisson distribution with mean (t s).
The process is nondecreasing, for N(t) N(s) 0 with probabil-
ity 1 for any s < t since N(t) N(s) has a Poisson distribution.
The state space of the process is clearly S = {0, 1, 2, . . .}.
We can think of the Poisson process as counting events as it progresses:
N(t) is the number of events that have occurred up to time t and at
time t +s, N(t +s) N(t) more events will have been counted, with
N(t + s) N(t) being Poisson distributed with mean s.
For this reason the Poisson process is called a counting process. Count-
ing processes are a more general class of processes of which the Pois-
son process is a special case. One common modeling use of the
Poisson process is to interpret N(t) as the number of arrivals of
tasks/jobs/customers to a system by time t.
Note that N(t) as t , so that N(t) itself is by no means
stationary, even though it has stationary increments. Also note that, in
the customer arrival interpetation, as increases customers will tend
to arrive faster, giving one justication for calling the rate of the
process.
We can see where this denition comes from, and in the process try to
see some more low level structure in a Poisson process, by considering
a discrete-time analogue of the Poisson process, called a Bernoulli
process, described as follows.
The Bernoulli Process: A Discrete-Time Poisson Process:
Suppose we divide up the positive half-line [0, ) into disjoint inter-
vals, each of length h, where h is small. Thus we have the intervals
[0, h), [h, 2h), [2h, 3h), and so on. Suppose further that each interval
corresponds to an independent Bernoulli trial, such that in each inter-
val, independently of every other interval, there is a successful event
(such as an arrival) with probability h. Dene the Bernoulli process
to be {B(t) : t = 0, h, 2h, 3h, . . .}, where B(t) is the number of
successful trials up to time t.
The above denition of the Bernoulli process clearly corresponds to
the notion of a process in which events occur randomly in time, with
an intensity, or rate, that increases as increases, so we can think of
the Poisson process in this way too, assuming the Bernoulli process
is a close approximation to the Poisson process. The way we have
dened it, the Bernoulli process {B(t)} clearly has stationary and
independent increments. As well, B(0) = 0. Thus the Bernoulli
process is a discrete-time approximation to the Poisson process with
rate if the distribution of B(t) is approximately Poisson(t).
191
For a given t of the form nh, we know the exact distribution of B(t).
Up to time t there are n independent trials, each with probability h
of success, so B(t) has a Binomial distribution with parameters n and
h. Therefore, the mean number of successes up to time t is nh =
t. So E[B(t)] is correct. The fact that the distribution of B(t) is
approximately Poisson(t) follows from the Poisson approximation to
the Binomial distribution (p.32 of the text), which we can re-derive
here. We have, for k a nonnegative integer and t > 0, (and keeping
in mind that t = nh for some positive integer n),
P(B(t) = k) =
_
n
k
_
(h)
k
(1 h)
nk
=
n!
(n k)!k!
_
t
n
_
k
_
1
t
n
_
nk
=
n!
(n k)!n
k
_
1
t
n
_
k
(t)
k
k!
_
1
t
n
_
n
n!
(n k)!n
k
_
1
t
n
_
k
(t)
k
k!
e
t
,
for n very large (or h very small). But also, for n large
_
1
t
n
_
k
1
and
n!
(n k)!n
k
=
n(n 1) . . . (n k + 1)
n
k
1.
Therefore, P(B(t) = k) (t)
k
/k!e
t
(this approximation gets
exact as h 0).
Thinking intuitively about how the Poisson process can be expected
to behave can be done by thinking about the conceptually simpler
Bernoulli process. For example, given that there are n events in the
interval [0, t) (i.e. N(t) = n), the times of those n events should
be uniformly distributed in the interval [0, t) because that is what we
would expect in the Bernoulli process. This intuition is true, and well
prove it more carefully later.
Thinking in terms of the Bernoulli process also leads to a more low-
level (in some sense better) way to dene the Poisson process. This
way of thinking about the Poisson process will also be useful later
when we consider continuous-time Markov chains. In the Bernoulli
process the probability of a success in any given interval is h and the
probability of two or more successes is 0 (that is, P(B(h) = 1) = h
and P(B(h) 2) = 0). Therefore, in the Poisson process we have
the approximation that P(N(h) = 1) h and P(N(h) 2) 0.
We write this approximation in a more precise way by saying that
P(N(n) = 1) = h + o(h) and P(N(h) 2) = o(h).
The notation o(h) is called Landaus o(h) notation, read little o of
h, and it means any function of h that is of smaller order than h. This
means that if f(h) is o(h) then f(h)/h 0 as h 0 (f(h) goes
to 0 faster that h goes to 0). Notationally, o(h) is a very clever and
useful quantity because it lets us avoid writing out long, complicated,
or simply unknown expressions when the only crucial property of the
expression that we care about is how fast it goes to 0. We will make
extensive use of this notation in this and the next chapter, so it is
worthwhile to pause and make sure you understand the properties of
o(h).
193
Landaus Little o of h Notation:
Note that o(h) doesnt refer to any specic function. It denotes any
quantity that goes to 0 at a faster rate than h, as h 0:
o(h)
h
0 as h 0.
Since the sum of two such quantities retains this rate property, we get
the potentially disconcerting property that
o(h) + o(h) = o(h)
as well as
o(h)o(h) = o(h)
c o(h) = o(h),
where c is any constant (note that c can be a function of other variables
as long as it remains constant as h varies).
Example: The function h
k
is o(h) for any k > 1 since
h
k
h
= h
k1
0 as h 0.
h however is not o(h). The innite series
k=2
c
k
h
k
, where |c
k
| < 1,
is o(h) since
lim
h0
k=2
c
k
h
k
h
= lim
h0
k=2
c
k
h
k1
=
k=2
c
k
lim
h0
h
k1
= 0,
where taking the limit inside the summation is justied because the
sum is bounded by 1/(1 h) for h < 1.
(i) N(0) = 0.
(ii) It has stationary and independent increments.
(iii) P(N(h) = 1) = h + o(h),
P(N(h) 2) = o(h), and
P(N(h) = 0) = 1 h + o(h).
This denition can be more useful than Denition 1 because its con-
ditions are more primitive and correspond more directly with the
Bernoulli process, which is more intuitive to imagine as a process evolv-
ing in time.
We need to check that Denitions 1 and 2 are equivalent (that is,
they dene the same process). We will show that Denition 1 implies
Denition 2. The proof that Denition 2 implies Denition 1 is shown
in the text in Theorem 5.1 on p.292 (p.260 in the 7th Edition), which
you are required to read.
Proof that Denition 1 Denition 2: (Problem #35, p.335)
We just need to show part(iii) of Denition 2. By Denition 1, N(h)
has a Poisson distribution with mean h. Therefore,
P(N(h) = 0) = e
h
.
If we expand out the exponential in a Taylor series, we have that
P(N(h) = 0) = 1 h +
(h)
2
2!

(h)
3
3!
+ . . .
= 1 h + o(h).
195
Similarly,
P(N(h) = 1) = he
h
= h
_
1 h +
(h)
2
2!

(h)
3
3!
+ . . .
_
= h
2
h
2
+
(h)
3
2!

(h)
4
3!
+ . . .
= h + o(h).
Finally,
P(N(h) 2) = 1 P(N(h) = 1) P(N(h) = 0)
= 1 (h + o(h)) (1 h + o(h))
= o(h) o(h) = o(h).
Thus Denition 1 implies Denition 2.
A third way to dene the Poisson process is to dene the distribution
of the time between events. We will see in the next lecture that
the times between events are independent and identically distributed
Exponential() random variables. For now we can gain some insight
into this fact by once again considering the Bernoulli process.
Imagine that you start observing the Bernoulli process at some arbitrary
trial, such that you dont know how many trials have gone before and
you dont know when the last successful trial was. Still you would know
that the distribution of the time until the next successful trial was h
times a Geometric random variable with parameter h. In other words,
you dont need to know anything about the past of the process to know
the distribution of the time to the next success, and in fact this is the
same as the distribution until the rst success. That is, the distribution
of the time between successes in the Bernoulli process is memoryless.
When you pass to the limit as h 0 you get the Poisson process with
rate , and you should expect that you will retain this memoryless
property in the limit. Indeed you do, and since the only continuous
distribution on [0, ) with the memoryless property is the Exponential
distribution, you may deduce that this is the distribution of the time
between events in a Poisson process. Moreover, you should also inherit
from the Bernoulli process that the times between successive events
are independent and identically distributed.
As a nal aside, we remark that this discussion also suggests that the
Exponential distribution is a limiting form of the Geometric distribu-
tion, as the probability of success h in each trial goes to 0. This is
indeed the case. As we mentioned above, the time between successful
trials in the Bernoulli process is distributed as Y = hX, where X is
a Geometric random variable with parameter h. One can verify that
for any t > 0, we have P(Y > t) e
t
as h 0:
P(Y > t) = P(hX > t)
= P(X > t/h)
= (1 h)
t/h
= (1 h)
t/h
(1 h)
t/ht/h
=
_
1
t
t/h
_
t/h
(1 h)
t/ht/h
e
t
as h 0,
where t/h is the smallest integer greater than or equal to t/h. In
other words, the distribution of Y converges to the Exponential()
distribution as h 0.
Note that the above discussion also illustrates that the Geometric
distribution is a discrete distribution with the memoryless property.
23
Properties of the Poisson Process
Today we will consider the distribution of the times between events in a
Poisson process, called the interarrival times of the process. We will see
that the interarrival times are independent and identically distributed
Exponential() random variables, where is the rate of the Poisson
process. This leads to our third denition of the Poisson process.
Using this denition, as well as our previous denitions, we can de-
duce some further properties of the Poisson process. Today we will
see that the time until the nth event occurs has a Gamma(n,) dis-
tribution. Later we will consider the sum, called the superposition,
of two independent Poisson processes, as well as the thinned Poisson
process obtained by independently marking, with some xed probabil-
ity p, each event in a Poisson process, thereby identifying the events
in the thinned process.
197
198 23. PROPERTIES OF THE POISSON PROCESS
Interarrival Times of the Poisson Process:
We can think of the Poisson process as a counting process with a given
interarrival distribution That is, N(t) is the number of events that have
occurred up to time t, where the times between events, called the
interarrival times, are independent and identically distributed random
variables.
Comment: We will see that the interarrival distribution for a Poisson
process with rate is Exponential(), which is expected based on the
discussion at the end of the last lecture. In general, we can replace
the Exponential interarrival time distribution with any distribution on
[0, ), to obtain a large class of counting processes. Such processes
(when the interarrival time distribution is general) are called Renewal
Processes, and the area of their study is called Renewal Theory. We
will not study this topic in this course, but for those interested this
topic is covered in Chapter 7 of the text. However, we make the
comment here that if the interarrival time is not Exponential, then the
process will not have stationary and independent increments. That is,
the Poisson process is the only Renewal process with stationary and
independent increments.
199
Proof that the Interarrival Distribution is Exponential():
We can prove that the interarrival time distribution in the Poisson
process is Exponential directly from Denition 1. First, consider the
time until the rst event, say T
1
. Then for any t > 0, the event
{T
1
> t} is equivalent to the event {N(t) = 0}. Therefore,
P(T
1
> t) = P(N(t) = 0) = e
t
.
This shows immediately that T
1
has an Exponential distribution with
rate .
In general let T
i
denote the time between the (i 1)st and the ith
event. We can use an induction argument in which the nth propo-
sition is that T
1
, . . . , T
n
are independent and identically distributed
Exponential() random variables:
Proposition n : T
1
, . . . , T
n
are i.i.d. Exponential().
We have shown that Proposition 1 is true. Now assume that Propo-
sition n is true (the induction hypothesis). Then we show this implies
Proposition n +1 is true. To do this x t, t
1
, . . . , t
n
> 0. Proposition
n+1 will be true if we show that the distribution of T
n+1
conditioned
on T
1
= t
1
, . . . , T
n
= t
n
does not depend on t
1
, . . . , t
n
(which shows
that T
n+1
is independent of T
1
, . . . , T
n
), and P(T
n
> t) = e
t
. So
we wish to consider the conditional probability
P(T
n+1
> t|T
n
= t
n
, . . . , T
1
= t
1
).
First, we will re-express the event {T
n
= t
n
, . . . , T
1
= t
1
} which
involves the rst n interarrival times into an equivalent event which
involves the rst n arrival times. Let S
k
= T
1
+ . . . + T
k
be the kth
arrival time (the time of the kth event) and let s
k
= t
1
+. . . +t
k
, for
k = 1, . . . , n. Then
{T
n
= t
n
, . . . , T
1
= t
1
} = {S
n
= s
n
, . . . , S
1
= s
1
},
and we can rewrite our conditional probability as
P(T
n+1
> t|T
n
= t
n
, . . . , T
1
= t
1
) = P(T
n+1
> t|S
n
= s
n
, . . . , S
1
= s
1
)
The fact that the event {T
n+1
> t} is independent of the event
{S
n
= s
n
, . . . , S
1
= s
1
} is because of independent increments, though
it may not be immediately obvious. Well try to see this in some detail.
If the event {S
n
= s
n
, . . . , S
1
= s
1
} occurs then the event {T
n+1
> t}
occurs if and only if there are no arrivals in the interval (s
n
, s
n
+ t],
so we can write
P (T
n+1
> t|S
n
= s
n
, . . . , S
1
= s
1
)
= P(N(s
n
+ t) N(s
n
) = 0|S
n
= s
n
, . . . , S
1
= s
1
).
Therefore, we wish to express the event {S
n
= s
n
, . . . , S
1
= s
1
} in
terms of increments disjoint from the increment N(s
n
+ t) N(s
n
).
At the cost of some messy notation well do this, just to see how it
might be done at least once. Dene the increments
I
(k)
1
= N(s
1
1/k) N(0)
I
(k)
i
= N(s
i
1/k) N(s
i1
+ 1/k) for i = 2, . . . , n,
for k > M, where M is chosen so that 1/k is smaller than the smallest
interarrival time, and also dene the increments
B
(k)
i
= N(s
i
+ 1/k) N(s
i
1/k) for i = 1, . . . , n 1
B
(k)
n
= N(s
n
) N(s
n
1/k),
201
for k > M. The increments I
(k)
1
, B
(k)
1
, . . . , I
(k)
n
, B
(k)
n
are all disjoint
and account for the entire interval [0, s
n
]. Now dene the event
A
k
= {I
1
= 0}
. . .
{I
n
= 0}
{B
1
= 1}
. . .
{B
n
= 1}.
Then A
k
implies A
k1
(that is, A
k
is contained in A
k1
) so that the
sequence {A
k
}
k=M
is a decreasing sequence of sets, and in fact
{S
n
= s
n
, . . . , S
1
= s
1
} =
k=M
A
k
,
because one can check that each event implies the other.
However (and this is why we constructed the events A
k
), for any k the
event A
k
is independent of the event {N(s
n
+t)N(s
n
) = 0} because
the increment N(s
n
+t) N(s
n
) is independent of all the increments
I
(k)
1
, . . . , I
(k)
n
, B
(k)
1
, . . . , B
(k)
n
, as they are all disjoint increments. But
if the event {N(s
n
+t) N(s
n
) = 0} is independent of A
k
for every
k, it is independent of the intersection of the A
k
. Thus, we have
P (T
n+1
> t|S
n
= s
n
, . . . , S
1
= s
1
)
= P(N(s
n
+ t) N(s
n
) = 0|S
n
= s
n
, . . . , S
1
= s
1
)
= P
_
N(s
n
+ t) N(s
n
) = 0
k=M
A
k
_
= P((N(s
n
+ t) N(s
n
) = 0)
= P(N(t) = 0) = e
t
,
and we have shown that T
n+1
has an Exponential() distribution and is
independent of T
1
, . . . , T
n
. We conclude from the induction argument
that the sequence of interarrival times T
1
, T
2
, . . . are all independent
and identically distributed Exponential() random variables.
(i) N(0) = 0.
(ii) N(t) counts the number of events that have occurred up to time
t (i.e. it is a counting process).
(iii) The times between events are independent and identically dis-
tributed with an Exponential() distribution.
We have seen how Denition 1 implies (i), (ii) and (iii) in Denition 3.
One can show that Exponential() interarrival times implies part(iii)
of Denition 2 by expanding out the exponential function as a Taylor
series, much as we did in showing that Denition 1 implies Denition
2. One can also show that Exponential interarrival times implies sta-
tionary and independent increments by using the memoryless property.
As an exercise, you may wish to prove this. However, we will not do
so here. That Denition 3 actually is a denition of the Poisson pro-
cess is nice, but not necessary. It suces to take either Denition 1
or Denition 2 as the denition of the Poisson process, and to see
that either denition implies that the times between events are i.i.d.
Exponential() random variables.
Distribution of the Time to the nth Arrival:
If we let S
n
denote the time of the nth arrival in a Poisson process,
then S
n
= T
1
+ . . . + T
n
, the sum of the rst n interarrival times.
The distribution of S
n
is Gamma with parameters n and . Before
showing this, let us briey review the Gamma distribution.
203
The Gamma(, ) Distribution:
A random variable X on [0, ) is said to have a Gamma distribution
with parameters > 0 and > 0 if its probability density function is
given by
f
X
(x|, ) =
_

()
x
1
e
x
for x 0
0 for x < 0
,
where (), called the Gamma function, is dened by
() =
_

0
y
1
e
y
dy.
We can verify that the density of the Gamma(,) distribution inte-
grates to 1, by writing down the integral
_

0
()
x
1
e
x
dx
and making the substitution y = x. This gives dy = dx or dx =
(1/)dy, and x = y/, and so
_

0
()
x
1
e
x
dx =
_

0
()
_
y
_
1
e
y/
1
dy
=
1
()
_

0
y
1
e
y
dy = 1,
by looking again at the denition of ().
The () function has a useful recursive property. For > 1 we
can start to evaluate the integral dening the Gamma function using
integration by parts:
_
b
a
udv = uv|
b
a
_
b
a
vdu.
We let
u = y
1
and dv = e
y
dy,
giving
du = ( 1)y
2
dy and v = e
y
,
so that
() =
_

0
y
1
e
y
dy
= y
1
e
y
0
+ ( 1)
_

0
y
2
e
y
dy
= 0 + ( 1)( 1).
That is, () = ( 1)( 1). In particular, if = n, a positive
integer greater than or equal to 2, then we recursively get
(n) = (n 1)(n 1) = . . . = (n 1)(n 2) . . . (2)(1)(1)
= (n 1)!(1).
However,
(1) =
_

0
e
y
dy = e
y
0
= 1,
is just the area under the curve of the Exponential(1) density. There-
fore, (n) = (n 1)! for n 2. However, since (1) = 1 = 0! we
have in fact that (n) = (n 1)! for any positive integer n.
205
So for n a positive integer, the Gamma(n,) density can be written
as
f
X
(x|n, ) =
_

n
(n1)!
x
n1
e
x
for x 0
0 for x < 0
.
An important special case of the Gamma(, ) distribution is the
Exponential() distribution, which is obtained by setting = 1. Get-
ting back to the Poisson process, we are trying to show that the sum
of n independent Exponential() random variables has a Gamma(n,)
distribution, and for n = 1, the result is immediate. For n > 1, the
simplest way to get our result is to observe that the time of the nth
arrival is less than or equal to t if and only if the number of arrivals in
the interval [0, t] is greater than or equal to n. That is, the two events
{S
n
t} and {N(t) n}
are equivalent. However, the probability of the rst event {S
n
t}
gives the CDF of S
n
, and so we have a means to calculate the CDF:
F
S
n
(t) P(S
n
t) = P(N(t) n) =
j=n
(t)
n
n!
e
t
.
To get the density of S
n
, we dierentiate the above with respect to t,
giving
f
S
n
(t) =
j=n
(t)
j
j!
e
t
+
j=n
(t)
j1
(j 1)!
e
t
=
(t)
n1
(n 1)!
e
t
=

n
(n 1)!
t
n1
e
t
.
Comparing with the Gamma(n,) density above we have our result.
24
Further Properties of the Poisson Process
Today we will consider two further properties of the Poisson process
that both have to do with deriving new processes from a given Poisson
process. Specically, we will see that
(1) The sum of two independent Poisson processes (called the super-
position of the processes), is again a Poisson process but with rate
1
+
2
, where
1
and
2
are the rates of the constituent Poisson
processes.
(2) If each event in a Poisson process is marked with probability
p, independently from event to event, then the marked process
{N
1
(t) : t 0}, where N
1
(t) is the number of marked events up
to time t, is a Poisson process with rate p, where is the rate
of the original Poisson process. This is called thinning a Poisson
process.
The operations of taking the sum of two or more independent Poisson
processes and of thinning a Poisson process can be of great practical
use in modeling many systems where the Poisson process(es) represent
arrival streams to the system and we wish to classify dierent types of
arrivals because the system will treat each arrival dierently based on
its type.
207
208 24. FURTHER PROPERTIES OF THE POISSON PROCESS
Superposition of Poisson Processes:
Suppose that {N
1
(t) : t 0} and {N
2
(t) : t 0} are two indepen-
dent Poisson processes with rates
1
and
2
, respectively. The sum of
N
1
(t) and N
2
(t),
{N(t) = N
1
(t) + N
2
(t) : t 0},
is called the superposition of the two processes N
1
(t) and N
2
(t). Since
N
1
(t) and N
2
(t) are independent and N
1
(t) is Poisson(
1
t) and N
2
(t)
is Poisson(
2
t), their sum has a Poisson distribution with mean (
1
+
2
)t. Also, it is clear that N(0) = N
1
(0) + N
2
(0) = 0. That is,
properties (i) and (iii) of Denition 1 of a Poisson process are satised
by the process N(t) if we take the rate to be
1
+
2
. Thus, to show
that N(t) is indeed a Poisson process with rate
1
+
2
it just remains
to show that N(t) has stationary and independent increments.
First, consider any increment I(t
1
, t
2
) = N(t
2
) N(t
1
), with t
1
< t
2
.
Then
I(t
1
, t
2
) = N(t
2
) N(t
1
)
= N
1
(t
2
) + N
2
(t
2
) (N
1
(t
1
) + N
2
(t
1
))
= (N
1
(t
2
) N
1
(t
1
)) + (N
2
(t
2
) N
2
(t
1
))
I
1
(t
1
, t
2
) + I
2
(t
1
, t
2
),
where I
1
(t
1
, t
2
) and I
2
(t
1
, t
2
) are the corresponding increments in the
N
1
(t) and N
2
(t) processes, respectively. But the increment I
1
(t
1
, t
2
)
has a Poisson(
1
(t
2
t
1
)) distribution and the increment I
2
(t
1
, t
2
)
has a Poisson(
2
(t
2
t
1
)) distribution. Furthermore, I
1
(t
1
, t
2
) and
I
2
(t
1
, t
2
) are independent. Therefore, as before, their sum has a Pois-
son distribution with mean (
1
+
2
)(t
2
t
1
). That is, the distribution
of the increment I(t
1
, t
2
) depends on t
1
and t
2
only through the dif-
ference t
2
t
1
, which says that N(t) has stationary increments.
209
Second, for t
1
< t
2
and t
3
< t
4
, let I(t
1
, t
2
) = N(t
2
) N(t
1
) and
I(t
3
, t
4
) = N(t
4
) N(t
3
) be any two disjoint increments (i.e. the
intervals (t
1
, t
2
] and (t
3
, t
4
] are disjoint). Then
I(t
1
, t
2
) = I
1
(t
1
, t
2
) + I
2
(t
1
, t
2
)
and
I(t
3
, t
4
) = I
1
(t
3
, t
4
) + I
2
(t
3
, t
4
).
But I
1
(t
1
, t
2
) is independent of I
1
(t
3
, t
4
) because the N
1
(t) process
has independent increments, and I
1
(t
1
, t
2
) is independent of I
2
(t
3
, t
4
)
because the processes N
1
(t) and N
2
(t) are independent. Similarly, we
can see that I
2
(t
1
, t
2
) is independent of both I
1
(t
3
, t
4
) and I
2
(t
3
, t
4
).
From this it is clear that the increment I(t
1
, t
2
) is independent of the
increment I(t
3
, t
4
). Therefore, the process N(t) also has independent
increments.
Thus, we have shown that the process {N(t) : t 0} satises the
conditions in Denition 1 for it to be a Poisson process with rate
1
+
2
.
Remark 1: By repeated application of the above arguments we can
see that the superposition of k independent Poisson processes with
rates
1
, . . . ,
k
is again a Poisson process with rate
1
+ . . . +
k
.
Remark 2: There is a useful result in probability theory which says
that if we take N independent counting processes and sum them up,
then the resulting superposition process is approximately a Poisson
process. Here N must be large enough and the rates of the in-
dividual processes must be small relative to N (this can be made
mathematically precise, but here in this remark our interest is in just
the practical implications of the result), but the individual processes
that go into the superposition can otherwise be arbitrary.
This can sometimes be used as a justication for using a Poisson pro-
cess model. For example, in the classical voice telephone system, each
individual produces a stream of connection requests to a given tele-
phone exchange, perhaps in a way that does not look at all like a
Poisson process. But the stream of requests coming from any given
individual typically makes up a very small part of the total aggregate
stream of connection requests to the exchange. It is also reasonable
that individuals make telephone calls largely independently of one an-
other. Such arguments provide a theoretical justication for modeling
the aggregate stream of connection requests to a telephone exchange
as a Poisson process. Indeed, empirical observation also supports such
a model.
In contrast to this, researchers in recent years have found that arrivals
of packets to gateway computers in the internet can exhibit some
behaviour that is not very well modeled by a Poisson process. The
packet trac exhibits large spikes, called bursts, that do not suggest
that they are arriving uniformly in time. Even though many users
may make up the aggregate packet trac to a gateway or router, the
number of such users is likely still not as many as the number of users
that will make requests to a telephone exchange. More importantly,
the aggregate trac at an internet gateway tends to be dominated by
just a few individual users at any given time. The connection to our
remark here is that as the bandwidth in the internet increases and the
number of users grows, a Poisson process model should theoretically
become more and more reasonable.
211
Thinning a Poisson Process:
Let {N(t) : t 0} be a Poisson process with rate . Suppose we mark
each event with probability p, independently from event to event, and
let {N
1
(t) : t 0} be the process which counts the marked events.
We can use Denition 2 of a Poisson process to show that the thinned
process N
1
(t) is a Poisson process with rate p. To see this, rst
note that N
1
(0) = N(0) = 0. Next, the probability that there is one
marked event in the interval [0, h] is
P(N
1
(h) = 1) = P(N(h) = 1)p +
k=2
P(N(h) = k)
_
k
1
_
p(1 p)
k1
= (h + o(h))p +
k=2
o(h)kp(1 p)
k1
= ph + o(h).
Similarly,
P(N
1
(h) = 0) = P(N(h) = 0) + P(N(h) = 1)(1 p)
+
k=2
P(N(h) = k)(1 p)
k
= 1 h + o(h) + (h + o(h))(1 p)
+
k=2
o(h)(1 p)
k
= 1 ph + o(h).
Finally, P(N
1
(h) 2) can be obtained by subtraction:
P(N
1
(h) 2) = 1 P(N
1
(h) = 0) P(N
1
(h) = 1)
= 1 (1 ph + o(h)) (ph + o(h)) = o(h).
We can show that the increments in the thinned process are stationary
by computing P(I
1
(t
1
, t
2
) = k), where I
1
(t
1
, t
2
) N
1
(t
2
) N
1
(t
1
)
is the increment from t
1
to t
2
in the thinned process, by conditioning
on the increment I(t
1
, t
2
) N(t
2
) N(t
1
) in the original process:
P(I
1
(t
1
, t
2
) = k) =
n=0
P(I
1
(t
1
, t
2
) = k|I(t
1
, t
2
) = n)P(I(t
1
, t
2
) = n)
=
n=k
P(I
1
(t
1
, t
2
) = k|I(t
1
, t
2
) = n)P(I(t
1
, t
2
) = n)
=
n=k
_
n
k
_
p
k
(1 p)
nk
[(t
2
t
1
)]
n
n!
e
(t
2
t
1
)
=
[p(t
2
t
1
)]
k
k!
e
p(t
2
t
1
)
n=k
[(1 p)(t
2
t
1
)]
nk
(n k)!
e
(1p)(t
2
t
1
)
=
[p(t
2
t
1
)]
k
k!
e
p(t
2
t
1
)
.
This shows that the distribution of the increment I
1
(t
1
, t
2
) depends
on t
1
and t
2
only through the dierence t
2
t
1
, and so the increments
are stationary. Finally, the fact that the increments in the thinned
process are independent is directly inherited from the independence of
the increments in the original Poisson process N(t).
Remark: The process consisting of the unmarked events, call it N
2
(t),
is also a Poisson process, this time with rate (1p). The text shows
that the two processes N
1
(t) and N
2
(t) are independent. Please read
this section of the text (Sec.5.3.4).
213
The main practical advantage that the Poisson process model has over
other counting process models is the fact that many of its properties
are explicitly known. For example, it is in general dicult or impossible
to obtain explicitly the distribution of N(t) for any t if N(t) were a
counting process other than a Poisson process. The memoryless prop-
erty of the exponential interarrival times is also extremely convenient
when doing calculations that involve the Poisson process.
(Please read the class notes for material on the ltered Poisson
process and Proposition 5.3 of the text).
25
Continuous-Time Markov Chains - Introduction
Prior to introducing continuous-time Markov chains today, let us start
o with an example involving the Poisson process. Our particular
focus in this example is on the way the properties of the exponential
distribution allow us to proceed with the calculations. This will give us
a good starting point for considering how these properties can be used
to build up more general processes, namely continuous-time Markov
chains.
Example: (Ross, p.338 #48(a)). Consider an n-server parallel queue-
ing system where customers arrive according to a Poisson process with
rate , where the service times are exponential random variables with
rate , and where any arrival nding all servers busy immediately de-
parts without receiving any service. If an arrival nds all servers busy,
nd
(a) the expected number of busy servers found by the next arrival.
215
216 25. CONTINUOUS-TIME MARKOV CHAINS - INTRODUCTION
Solution: Let T
k
denote the expected number of busy servers found
by the next arrival for a k-server system when there are currently k
servers busy. Equivalently, let it denote the expected number of busy
servers found by the next arrival if there are currently k servers busy.
The two descriptions of T
k
are equivalent because of the memoryless
property of the exponential service and interarrival times and because
between the current time and the time of the next arrival we can ignore
the n k idle servers when considering the expected number of busy
servers found by the next arrival.
First, T
0
is clearly 0 because if there are currently 0 busy servers the
next arrival will nd 0 busy servers for sure. Next, consider T
1
. If
there is currently 1 busy server, the next arrival nds 1 busy server
if the time to the next arrival is less than the remaining service time
of the busy server. By memorylessness, the time to the next arrival
is Exponential() and the remaining service time is Exponential().
Therefore, the probability that the next arrival nds 1 server busy is
/( + ), and
T
1
= (1)

+
+ (0)

+
=

+
.
In general, consider the situation where we currently have k servers
busy. We can obtain an expression for T
k
by conditioning on what
happens rst. Let us see how the properties of the exponential distri-
bution allow us to proceed with this argument. When there are cur-
rently k servers busy, we have k + 1 independent exponential alarm
clocks going: k Exponential() clocks, one for each remaining ser-
vice time, and 1 Exponential() clock for the time till the next arrival.
For our purposes we wish to condition on whether a service com-
pletion happens rst or the next arrival happens rst. The time till
217
the next service completion is the minimum of the k Exponential()
clocks, and this has an Exponential(k) distribution. Thus, the prob-
ability that the next thing to happen is a service completion is the
probability that an Exponential(k) random variable is less than an
Exponential() random variable, and this probability is k/(k +).
Similarly, the probability that the next thing to happen is the next
arrival is /(k + ).
Now, if the rst thing to happen is the next customer arrival, then the
expected number of busy servers found by the next arrival is k. On
the other hand, if the rst thing to happen is a service completion,
then the expected number of busy servers found by the next arrival is
T
k1
.
The reason this latter conditional expectation is given by T
k1
, and
really the main thing I wish you to understand in this example, is that
the memorylessness of the exponential interarrival time and all the ex-
ponential service times allows us to say that once we have conditioned
on the rst thing to happen being a service completion, then we can
essentially restart the exponential clock on the interarrival time and
the exponential clocks on the k 1 service times still going. Thus,
probabilistically we are in exactly the conditions dening T
k1
.
We have
T
k
= T
k1
k
k +
+ k

k +
.
Solving for T
n
is now a matter of solving the recursion given by the
above expression.
Starting with T
2
, we have
T
2
= T
1
2
2 +
+
2
2 +
=
_

+
__
2
2 +
_
+
2
2 +
.
Continuing (since the pattern isnt so obvious yet),
T
3
= T
2
3
3 +
+
3
3 +
=
_

+
__
2
2 +
__
3
3 +
_
+
_
2
2 +
__
3
3 +
_
+
3
3 +
.
In general, we can observe the following patterns for T
n
:
T
n
will be a sum of n terms.
The ith term will be a product of n + 1 i factors.
the ith term will have a factor i/(i + ) for i = 1, . . . , n.
The ith term will have n i remaining factors that are given by
(i + 1)/((i + 1) +), . . . , n/(n +), for i = 1, . . . , n 1,
while the nth term has no remaining factors.
Based on these observations, we can write
T
n
=
n
n +
+
n1
i=1
i
i +
n
j=i+1
j
j +
as our nal expression.
219
We saw in the last example one way to think about how the process
which keeps track of the number of busy servers evolves, based on
the exponential service and interarrival times. We make the following
observations.
(i) When there are i busy servers (at any time), for i < n, there
are i + 1 independent exponential alarm clocks running, with i of
them having rate and 1 of them having rate . The time until
the process makes a jump is exponential whose rate is the sum of
all the competing rates: i + . If there are n busy servers then
only the n exponential clocks corresponding to the service times
can trigger a jump, and the time until the process makes a jump
is exponential with rate n.
(ii) When the process jumps from state i, for i < n, it jumps to state
i + 1 with probability /(i + ) and jumps to state i 1 with
probability i/(i + ). If there are n busy servers the process
jumps to state n 1 with probability n/n = 1.
(iii) When the process makes a jump from state i we can start up a
whole new set of clocks corresponding to the state we jumped to.
This is because even though some of the old clocks that had been
running before we made our jump but did not actually trigger the
current jump might still trigger the next jump, we can either reset
these clocks or, equivalently, replace them with new clocks.
Note that every time we jump to state i, regardless of when the time
is, the distribution of how long we stay in state i and the probabilities
of where we jump to next when we leave state i are the same. In other
words, the process is time-homogeneous.
We may generalize the preceding process which tracks the number of
busy servers in our opening example in a fairly straightforward manner.
First, we can generalize the state space {0, 1, . . . , n} in that example
to any arbitrary countable state space S. In addition, we can generalize
(i), (ii) and (iii) on the preceding page to the following:
(I) Every time the process is in state i there are n
i
independent ex-
ponential clocks running, such that the rst one to go o deter-
mines the state the process jumps to next. Let the rates of these
n
i
exponential clocks be q
i,j
1
, . . . , q
i,j
n
i
, such that j
1
, . . . , j
n
i
are
the n
i
states that the process can possibly jump to next. The
time until the process makes a jump is exponential with rate
v
i
q
i,j
1
+ . . . + q
i,j
n
i
.
(II) When the process jumps from state i, it jumps to state j
with
probability q
i,j
/(q
i,j
1
+ . . . + q
i,j
n
i
) = q
i,j
/v
i
, for = 1, . . . , n
i
.
(III) When the process makes a jump from state i we can start up a
whole new set of clocks corresponding to the state we jumped to.
The above description of a continuous-time stochastic process cor-
responds to a continuous-time Markov chain. This is not how a
continuous-time Markov chain is dened in the text (which we will
also look at), but the above description is equivalent to saying the
process is a time-homogeneous, continuous-time Markov chain, and it
is a more revealing and useful way to think about such a process than
the formal denition given in the text.
221
Example: The Poisson Process. The Poisson process is a continuous-
time Markov chain with state space S = {0, 1, 2, . . .}. If at any time
we are in state i we can only possibly jump to state i + 1 when we
leave state i and there is a single exponential clock running that has
rate q
i,i+1
= . The time until we leave state i is exponential with
rate v
i
= q
i,i+1
= . When the process leaves state i, it jumps to
state i + 1 with probability q
i,i+1
/v
i
= v
i
/v
i
= 1.
Example: Pure Birth Processes. We can generalize the Poisson pro-
cess by replacing q
i,i+1
= with q
i,i+1
=
i
. Such a process is called
a pure birth process, or just birth process. The state space is the same
as that of a Poisson process, S = {0, 1, 2, . . .}. If at any time the
birth process is in state i there is a single exponential clock running
with rate
i
, and so v
i
=
i
. We see that the only dierence between
a Poisson process and a pure birth process is that in the pure birth
process the rate of leaving a state can depend on the state.
Example: Birth/Death Processes. A birth/death process general-
izes the pure birth process by allowing jumps from state i to state
i 1 in addition to jumps from state i to state i + 1. The state space
is typically the set of all integers or a subset of the integers, but varies
depending on the particular modeling scenario. We can make the state
space a proper subset of the integers by making the rates of any jumps
that go out of the subset equal to 0. Whenever a birth/death process
is in state i there are two independent exponential clocks running, one
that will take us to state i + 1 if it goes o rst and the other which
will take us to state i 1 if it goes o rst. Following the text, we
denote the rates of these clocks by q
i,i+1
=
i
(the birth rates) and
q
i,i1
=
i
(the death rates), and v
i
=
i
+
i
. This important class
of processes is the subject of all of Section 6.3 of the text.
Example: The n-Server Parallel Queueing System. We can see that
the n-server parallel queueing system described in our opening example
is a birth/death process. The state space is S = {0, 1, . . . , n}. When
in state i, for i = 0, . . . , n 1, the birth rate is and when in state
n the birth rate is 0. When in state i, for i = 0, . . . , n, the death rate
is i. That is, this process is a birth/death process with
i
= for i = 0, 1, . . . , n 1,
n
= 0,
i
= i for i = 0, 1, . . . , n.
The main thing I would like you to focus on in this lecture is the de-
scription of a continuous-time stochastic process with countable state
space S given in (I), (II) and (III). Imagine a particle jumping around
the state space as time moves forward according to the mechanisms
described there.
Next we will formally dene a continuous-time Markov chain in terms
of the Markov property for continuous-time processes and see how this
corresponds to the description given in (I), (II) and (III).
26
Continuous-Time Markov Chains - Introduction II
Our starting point for today is the description of a continuous-time
stochastic process discussed in the previously. Specically, the process
can have any countable state space S. With each state i S there
is associated as set of n
i
independent, exponential alarm clocks with
rates q
i,j
1
, . . . , q
i,j
n
i
, where j
1
, . . . , j
n
i
is the set of possible states
the process may jump to when it leaves state i. We have seen that
when the process enters state i, the amount of time it spends in state
i is Exponentially distributed with rate v
i
= q
i,j
1
+ . . . + q
i,j
n
i
and
when it leaves state i it will go to state j
with probability q
i,j
/v
i
for
= 1, . . . , n
i
.
We also stated previously that any process described by the above
probabilistic mechanisms corresponds to a continuous-time Markov
chain. We will now elaborate on this statement in more detail. We
start by dening the Markov property for a continuous-time process,
which leads to the formal denition of what it means for a stochastic
process to be a continuous-time Markov chain.
223
224 26. CONTINUOUS-TIME MARKOV CHAINS - INTRODUCTION II
The Markov Property for Continuous-Time Processes:
You should be familiar and comfortable with what the Markov property
means for discrete-time stochastic processes. The natural extension
of this property to continuous-time processes can be stated as follows.
For a continuous-time stochastic process {X(t) : t 0} with state
space S, we say it has the Markov property if
P (X(t) = j|X(s) = i, X(t
n1
) = i
n1
, . . . , X(t
1
) = i
1
)
= P(X(t) = j|X(s) = i),
where 0 t
1
t
2
. . . t
n1
s t is any nondecreasing
sequence of n + 1 times and i
1
, i
2
, . . . , i
n1
, i, j S are any n + 1
states in the state space, for any integer n 1. That is, given the
state of the process at any set of times prior to time t, the distribution
of the process at time t depends only on the process at the most recent
time prior to time t. An equivalent way to say this is to say that given
the state of the process at time s, the distribution of the process at
any time after s is independent of the entire past of the process before
time s. This notion is exactly analogous to the Markov property for a
discrete-time process.
Denition: A continuous-time stochastic process {X(t) : t 0} is
called a continuous-time Markov chain if it has the Markov property.
The Markov property is a forgetting property, suggesting memory-
lessness in the distribution of the time a continuous-time Markov chain
spends in any state. This is indeed the case if the process is also time
homogeneous.
225
Time Homogeneity: We say that a continuous-time Markov chain is
time homogeneous if for any s t and any states i, j S,
P(X(t) = j|X(s) = i) = P(X(t s) = j|X(0) = i).
As with discrete-time Markov chains, a continuous-time Markov chain
need not be time homogeneous, but in this course we will consider
only time homogeneous Markov chains.
By time homogeneity, whenever the process enters state i, the way it
evolves probabilistically from that point is the same as if the process
started in state i at time 0. When the process enters state i, the
time it spends there before it leaves state i is called the holding time
in state i. By time homogeneity, we can speak of the holding time
distribution because it is the same every time the process enters state i.
Let T
i
denote the holding time in state i. Then we have the following
Proposition.
Proposition: T
i
is exponentially distributed.
Proof. By time homogeneity, we assume that the process starts out
in state i. For s 0 the event {T
i
> s} is equivalent to the
event {X(u) = i for 0 u s}. Similarly, for s, t 0 the event
{T
i
> s+t} is equivalent to the event {X(u) = i for 0 u s + t}.
Therefore,
P (T
i
> s + t|T
i
> s)
= P(X(u) = i for 0 u s + t|X(u) = i for 0 u s)
= P(X(u) = i for s < u s + t|X(u) = i for 0 u s)
= P(X(u) = i for s < u s + t|X(s) = i)
= P(X(u) = i for 0 < u t|X(0) = i)
= P(T
i
> t),
where
- the second equality follows from the simple fact that P(A
B|A) =
P(B|A), where we let A = {X(u) = i for 0 u s} and
B = {X(u) = i for s < u s + t}.
- the third equality follows from the Markov property.
- the fourth equality follows from time homogeneity.
Therefore, the distribution of T
i
has the memoryless property, which
implies that it is exponential.
By time homogeneity, every time our continuous-time Markov chain
leaves state i,
the number of states it could possibly jump to must stay the same,
and we can let n
i
denote this number.
the set of states it could possibly jump to must stay the same,
and we can let {j
1
, . . . , j
n
i
} denote this set of states.
the probability of going to state j
must stay the same, and we

can let p
i,j
denote this probability, for = 1, . . . , n

i
.
Essentially, starting with the Markov property and time homogeneity,
we have rebuilt our original description of a continuous-time Markov
chain that was in terms of exponential alarm clocks. It may not be
immediately obvious that we have done so because our current descrip-
tion uses the probabilities p
i,j
while our original description used the

rates q
i,j
. But the two descriptions are the same, with the following
correspondence between the p
i,j
and the q
i,j
:
p
i,j
= q
i,j
/v
i
or q
i,j
= v
i
p
i,j
.
227
Let us stop using the notation j
to denote a state that we can get to

from state i, and just use the simpler notation j (or something similar
like k), with the understanding that j is just a label. In this simpler
notation, we have
p
ij
= q
ij
/v
i
or q
ij
= v
i
p
ij
.
We make the following remarks regarding p
ij
and q
ij
.
Remark Concerning p
ij
(Embedded Markov Chains): The probability
p
ij
is the probability of going to state j at the next jump given that
we are currently in state i. The matrix P whose (i, j)th entry is p
ij
is
a stochastic matrix and so is the one-step transition probability matrix
of a (discrete-time) Markov chain. We call this discrete-time chain
the embedded Markov chain. Every continuous-time Markov chain
has an associated embedded discrete-time Markov chain. While the
transition matrix P completely determines the probabilistic behaviour
of the embedded discrete-time Markov chain, it does not fully cap-
ture the behaviour of the continuous-time process because it does not
specify the rates at which transitions occur.
Remark Concerning q
ij
: Recall that q
ij
is the rate of the exponential
alarm clock corresponding to state j that starts up whenever we enter
state i. We say that q
ij
is the rate of going from state i to state j.
Note that q
ii
= 0 for any i. The rates q
ij
taken all together contain
more information about the process than the probabilities p
ij
taken all
together. This is because if we know all the q
ij
we can calculate all
the v
i
and then all the p
ij
. But if we know all the p
ij
we cant recover
the q
ij
. In many ways the q
ij
are to continuous-time Markov chains
what the p
ij
are to discrete-time Markov chains.
However, there is an important dierence between the q
ij
in a continuous-
time Markov chain and the p
ij
in a discrete-time Markov chain to keep
in mind. Namely, the q
ij
are rates, not probabilities and, as such, while
they must be nonnegative, they are not bounded by 1.
The Transition Probability Function
Just as the rates q
ij
in a continuous-time Markov chain are the coun-
terpart of the transition probabilities p
ij
in a discrete-time Markov
chain, there is a counterpart to the n-step transition probabilities
p
ij
(n) of a discrete-time Markov chain. The transition probability
function, P
ij
(t), for a time homogeneous, continuous-time Markov
chain is dened as
P
ij
(t) = P(X(t) = j|X(0) = i).
Note that there is no time step in a continuous-time Markov chain.
For each pair of states i, j S, the transition probability function
P
ij
(t) is in fact a continuous function of t. In the next lecture we will
explore the relationship, which is fundamental, between the transition
probability functions P
ij
(t) and the exponential rates q
ij
. In general,
one cannot determine the transition probability function P
ij
(t) in a
nice closed form. In simple cases we can. For example, in the Poisson
process we have seen that for i j,
P
ij
(t) = P(there are j i events in an interval of length t)
=
(t)
ji
(j i)!
e
t
.
In Proposition 6.1, the text shows how one can explicitly compute
P
ij
(t) for a pure birth process, which was described last time, in which
the birth rates
i
are all dierent (that is,
i
=
j
for i = j). Please
read this example in the text.
229
We can say some important general things about P
ij
(t), however.
Since these functions are the counterpart of the n-step transition prob-
abilities, one might guess that there is a counterpart to the Chapman-
Kolmogorov equations for these functions. There is, and we will end
todays lecture with this result, whose proof is essentially identical to
the proof in the discrete case.
Lemma. (Lemma 6.3 in text, Chapman-Kolmogorov Equations). Let
{X(t) : t 0} be a continuous-time Markov chain with state space S,
rates (q
ij
)
i,jS
and transition probability functions (P
ij
(t))
i,jS
. Then
for any s, t 0,
P
ij
(t + s) =
kS
P
ik
(t)P
kj
(s).
Proof. By conditioning on X(t), we have
P
ij
(t + s) = P(X(t + s) = j|X(0) = i)
=
kS
P(X(t + s) = j|X(t) = k, X(0) = i)
P(X(t) = k|X(0) = i)
=
kS
P(X(t + s) = j|X(t) = k)P(X(t) = k|X(0) = i)
=
kS
P(X(s) = j|X(0) = k)P(X(t) = k|X(0) = i)
=
kS
P
kj
(s)P
ik
(t),
as desired.
For a given t, if we form the probabilities P
ij
(t) into an |S||S| matrix
P(t) whose (i, j)th entry is P
ij
(t), then the Chapman-Kolmogorov
equation
P
ij
(t + s) =
kS
P
ik
(t)P
kj
(s)
says that the (i, j)th entry of P(t + s) is the dot product of the ith
row of P(t) and the jth column of P(s). But that is the same thing
as the (i, j)th entry in the matrix product of P(t) and P(s). That is,
P(t + s) = P(t)P(s).
This is the direct analogue of the discrete-time result. Just a note
on terminology: in the discrete-time case we called the matrix P(n)
the n-step transition probability matrix. Because there is no notion
of a time step in continuous time, we just simply call P(t) the matrix
transition probability function. Note that it is a matrix-valued function
of the continuous variable t.
27
Key Properties of Continuous-Time Markov
Chains
The key quantities that specify a discrete-time Markov chains are the
transition probabilities p
ij
. In continuous time, the corresponding key
quantities are the transition rates q
ij
. Recall that we may think of q
ij
as the rate of an exponentially distributed alarm clock that starts as
soon as we enter state i, where there is one alarm clock that starts
for each state that we could possibly go to when we leave state i. We
leave state i as soon as an alarm clock goes o and we go to state j if
it was the clock corresponding to state j that went o rst. The time
until the rst alarm clock goes o is exponentially distributed with rate
v
i
=
jS
q
ij
, where we let q
ij
= 0 if we cannot go to state j from
state i. When we leave state i we go to state j with probability q
ij
/v
i
,
which we also denote by p
ij
. The p
ij
are the transition probabilities of
the embedded discrete-time Markov chain, also called the jump chain.
To summarize, the quantities q
ij
, v
i
and p
ij
are related by the equalities
v
i
=
jS
q
ij
q
ij
= v
i
p
ij
p
ij
= q
ij
/v
i
.
231
232 27. KEY PROPERTIES OF CONTINUOUS-TIME MARKOV CHAINS
To avoid technicalities, we will assume that v
i
< for all i for
this course. It is possible for v
i
to equal + since the rates q
ij
need not form a convergent sum when we sum over j. If v
i
=
then the process will leave state i immediately after it enters state
i. This behaviour is not typical of the models we will consider in
this course (though it can be typical for some kinds of systems, such
as congurations on an innite lattice for example). We also will
assume that v
i
> 0 for all i. If v
i
= 0 then when we enter state
i we will stay there forever so v
i
= 0 would correspond to state i
being an absorbing state. This does not present any real technical
diculties (we have already considered this possibility in the discrete
time setting). However, we will not consider any absorbing states in
the continuous time models we will look at.
Since the time spent in state i is exponentially distributed with rate
v
i
(where 0 < v
i
< ), we may expect from what we know about
the Poisson process that the probability of 2 or more transitions in a
time interval of length h should be o(h). This is indeed the case. If
T
i
denotes the holding time in state i, then T
i
is Exponential(v
i
), and
P(T
i
> h) = e
v
i
h
.
Expanding the exponential function in a Taylor series, we have
P(T
i
> h) = e
v
i
h
= 1 v
i
h +
(v
i
h)
2
2!

(v
i
h)
3
3!
+ . . .
= 1 v
i
h + o(h).
This also implies that
P(T
i
h) = v
i
h + o(h).
233
Furthermore, if T
j
denotes the holding time in state j, for j = i, then
T
j
is Exponentially distributed with rate v
j
and T
j
is independent of
T
i
. Since the event {T
i
+T
j
h} implies the event {T
i
h, T
j
h}
we have that
{T
i
+ T
j
h} {T
i
h, T
j
h},
so that
P(T
i
+ T
j
h) P(T
i
h, T
j
h)
= P(T
i
h)P(T
j
h)
= (v
i
h + o(h))(v
j
h + o(h))
= v
i
v
j
h
2
+ o(h) = o(h),
which implies that P(T
i
+ T
j
h) = o(h). Thus, starting in state
i, if we compute the probability of 2 or more transitions by time h by
conditioning on the rst transition, we obtain
P (2 or more transitions by time h|X(0) = i)
=
j=i
P(2 or more transitions by h|X(0) = i, 1st transition to j)p
ij
=
j=i
P(T
i
+ T
j
h)p
ij
=
j=i
o(h)p
ij
= o(h).
Since
P(0 transitions by time h|X(0) = i) = P(T
i
> h)
= 1 v
i
h + o(h),
we also have
P (exactly 1 transition by time h|X(0) = i)
= 1 (1 v
i
h + o(h)) o(h)
= v
i
h + o(h).
To summarize, we have
P(0 transitions by time h|X(0) = i) = 1 v
i
h + o(h)
P(exactly 1 transition by time h|X(0) = i) = v
i
h + o(h)
P(2 or more transitions by time h|X(0) = i) = o(h).
Now, for j = i, consider the conditional probability
P(X(h) = j|X(0) = i).
Given that X(0) = i, one way for the event {X(h) = j} to occur is
for there to be exactly one transition in the interval [0, h] and for that
transition to be to state j. The probability of this is (v
i
h+o(h))p
ij
=
v
i
p
ij
h + o(h). Moreover, the event consisting of the union of every
other way to be in state j at time h starting in state i implies the
event that there were 2 or more transitions in the interval [0, h]. So
the probability of this second event is o(h). Summarizing, we have
P(X(h) = j|X(0) = i) = v
i
p
ij
h + o(h) + o(h)
= v
i
p
ij
h + o(h).
Similarly, if we consider the conditional probability
P(X(h) = i|X(0) = i),
the only way for the event {X(h) = i} to occur given that X(0) = i
that does not involve at least 2 transitions in the interval [0, h] is for
there to be 0 transitions in the interval [0, h]. Thus,
P(X(h) = i|X(0) = i) = P(0 transitions in [0, h]|X(0) = i) + o(h)
= 1 v
i
h + o(h) + o(h)
= 1 v
i
h + o(h).
235
Now we are in a position to derive a set of dierential equations,
called Kolmogorovs Equations, for the probability functions p
ij
(t).
We proceed in a familiar way, by deriving a system of equations by
conditioning. There are actually 2 sets of equations we can derive
for the p
ij
(t) Kolmogorovs Backward Equations and Kolmogorovs
Forward Equations. We will now derive the Backward Equations. To
do so we will evaluate p
ij
(t + h) by conditioning on X(h) (here h is
some small positive amount). We obtain
p
ij
(t + h)
= P(X(t + h) = j|X(0) = i)
=
kS
P(X(t + h) = j|X(h) = k, X(0) = i)P(X(h) = k|X(0) = i)
=
kS
P(X(t + h) = j|X(h) = k)P(X(h) = k|X(0) = i)
=
kS
P(X(t) = j|X(0) = k)P(X(h) = k|X(0) = i)
=
kS
p
kj
(t)P(X(h) = k|X(0) = i),
where the third equality follows from the Markov property and the
fourth equality follows from time-homogeneity. Now we separate out
the term with k = i and use our results from the previous page to
obtain
p
ij
(t + h) = p
ij
(t)(1 v
i
h + o(h)) +
k=i
p
kj
(t)(v
i
p
ik
h + o(h)),
p
ij
(t + h) p
ij
(t) = v
i
p
ij
(t)h +
k=i
p
kj
(t)v
i
p
ik
h + o(h).
Upon dividing by h, and using the fact that v
i
p
ik
= q
ik
, we get
p
ij
(t + h) p
ij
(t)
h
=
k=i
q
ik
p
kj
(t) v
i
p
ij
(t) +
o(h)
h
,
As we let h 0, the left hand side above approaches p
ij
(t), which
shows that p
ij
(t) is dierentiable, and given by
p
ij
(t) =
k=i
q
ik
p
kj
(t) v
i
p
ij
(t).
The above dierential equations, for i, j S, are called Kolmogorovs
Backward Equations. We may write down the entire set of equations
more succinctly in matrix form. Let P(t) be the |S| |S| matrix
with (i, j)th entry p
ij
(t) and P
(t) the |S| |S| matrix with (i, j)th

entry p
ij
(t). We call P(t) the matrix transition probability function,
which is a (matrix-valued) dierentiable function of t. If we form a
matrix, which we will call G, whose ith row has v
i
in the ith column
and q
ik
in the kth column, then we see that the right hand side of
Kolmogorovs Backward Equation for p
ij
(t) is just the dot product of
the ith row of G with the jth column of P(t). That is, the dierential
equation above is the same as
[P
(t)]
ij
= [GP(t)]
ij
,
so that in matrix form, Kolmogorovs Backward Equations can be
written as
P
(t) = GP(t).
237
The Innitesimal Generator: The matrix G is a fundamental quantity
associated with the continuous-time Markov chain {X(t) : t 0}. It
is called the innitesimal generator, or simply generator, of the chain.
If we let g
ij
denote the (i, j)th entry of G, then
g
ij
= q
ij
for i = j, and
g
ii
= v
i
.
The generator matrix G contains all the rate information for the chain
and, even though its entries are not probabilities, it is the counter-
part of the one-step transition probability matrix P for discrete-time
Markov chains. In deriving Kolmogorovs Backward Equations, if we
had conditioned on X(t) instead of X(h) we would have derived an-
other set of dierential equations called Kolmogorovs Forward Equa-
tions, which in matrix form are given by
P
(t) = P(t)G.
For both the backward and the forward equations, we have the bound-
ary condition
P(0) = I,
where I is the |S| |S| identity matrix. The boundary condition
follows since
p
ii
(0) = P(X(0) = i|X(0) = i) = 1
and, for i = j,
p
ij
(0) = P(X(0) = j|X(0) = i) = 0.
Though the backward and forward equations are two dierent sets of
dierential equations, with the above boundary condition they have
the same solution, given by
P(t) = e
tG
n=0
(tG)
n
n!
= I + tG+
(tG)
2
2!
+
(tG)
3
3!
+ . . .
Keep in mind that the notation e
tG
is meaningless except as shorthand
notation for the innite sum above. To see that the above satises
the backward equations we may simply plug it into the dierential
equations and check that it solves them. Dierentiating with respect
to t, we get
P
(t) = G+ tG
2
+
t
2
2!
G
3
+
t
3
3!
G
4
+ . . .
= G
_
I + tG+
(tG)
2
2!
+
(tG)
3
3!
+ . . .
_
= GP(t).
Also, P(0) = I is clearly satised. Moreover, we could also have
written
P
(t) =
_
I + tG+
(tG)
2
2!
+
(tG)
3
3!
+ . . .
_
G = P(t)G,
showing that P(t) = e
tG
satises the forward equations as well.
Thus, even though we cannot normally obtain P(t) in a simple and
explicit closed form, the innite sum representation e
tG
is general, and
can be used to obtain numerical approximations to P(t) if |S| is nite,
by truncating the innite sum to a nite sum (see Section 6.8).
239
Remark: The text uses the notation R for the generator matrix, pre-
sumably to stand for the Rate matrix. The notation G is more com-
mon and will be adopted here, and the terminology generator matrix
or innitesimal generator matrix is standard.
The solution P(t) = e
tG
shows how basic the generator matrix G
is to the properties of a continuous-time Markov chain. We will now
show that the generator G is also the key quantity for determining the
stationary distribution of the chain. First, we dene what we mean by
a stationary distribution for a continuous-time Markov chain.
Stationary Distributions:
Denition: Let {X(t) : t 0} be a continuous-time Markov chain
with state space S, generator G, and matrix transition probability
function P(t). An |S|-dimensional (row) vector = (
i
)
iS
with
i
0 for all i and
iS

i
= 1, is said to be a stationary distribution
if = P(t), for all t 0.
A vector which satises = P(t) for all t 0 is called a sta-
tionary distribution for exactly the same reason that the stationary
distribution of a discrete-time Markov chain is called the stationary
distribution. It makes the process stationary. That is, if we set the
initial distribution of X(0) to be such a , then the distribution of
X(t) will also be for all t > 0 (i.e. P(X(t) = j) =
j
for all j S
and all t > 0). To see this, set the initial distribution of X(0) to be
and compute P(X(t) = j) by conditioning on X(0). This gives
P(X(t) = j) =
iS
P(X(t) = j|X(0) = i)P(X(0) = i)
=
iS
p
ij
(t)
i
= [P(t)]
j
=
j
,
as claimed.
To see how the generator G relates to the denition of a stationary
distribution, we can replace P(t) in the denition of with e
tG
. Doing
so, we obtain the following equivalences:
is a stationary distribution = P(t) for all t 0
=
n=0
(tG)
n
n!
for all t 0
0 =
n=1
t
n
n!
G
n
for all t 0
0 = G
n
for all n 1
0 = G.
You should convince yourself that the implications are true in both
directions in each of the lines above.
Thus, we see that the condition = P(t) for all t 0, which
would be quite dicult to check, reduces to the much simpler condition
0 = G, in terms of the generator matrix G. The equations 0 = G
are a set of |S| linear equations which, together with the normalization
constraint
iS

i
= 1, determines the stationary distribution if one
exists.
241
The jth equation in 0 = G is given by
0 = v
j
j
+
i=j
q
ij
i
,
j
v
j
=
i=j
i
q
ij
.
This equation has the following interpretation. On the left hand side,
j
is the long run proportion of time that the process is in state j,
while v
j
is that rate of leaving state j when the process is in state j.
Thus, the product
j
v
j
is interpreted as the long run rate of leaving
state j. On the right hand side, q
ij
is the rate of going to state j
when the process is in state i, so the product
i
q
ij
is interpreted as
the long run rate of going from state i to state j. Summing over all
i = j then gives the long run rate of going to state j. That is, the
equation
j
v
j
=
i=j
i
q
ij
is interpreted as
the long run rate out of state j = the long run rate into state j,
and for this reason the equations 0 = G are called the Global Bal-
ance Equations, or just Balance Equations, because they express the
fact that when the process is made stationary, there must be equality,
or balance, in the long run rates into and out of any state.
28
Limiting Probabilities
We now consider the limiting probabilities
lim
t
p
ij
(t),
for a continuous-time Markov chain {X(t) : t 0}, where
p
ij
(t) = P(X(t) = j|X(0) = i)
is the transition probability function for the states i and j.
Last time we considered the stationary distribution of a continuous-
time Markov chain, and saw that is the distribution of X(t) for
all t when the process is stationary. We also interpret
j
as the long
run proportion of time that the process is in state j. Based on what
we know about discrete-time Markov chains, we may expect that the
limiting probability lim
t
p
ij
(t) is equal to the stationary probability
j
for all i S. That is, no matter what state i we start in at
time 0, the probability that we are in state j at time t approaches
j
as t gets larger and larger. This is indeed the case, assuming the
stationary distribution exists, although it may not exist. However,
in this course the only continuous-time Markov chains we will consider
will be those for which the stationary distribution exists.
243
Actually, the fact that the limiting probabilities and the stationary
probabilities are the same is even more true in continuous-time than in
discrete time. In discrete time, we saw that even though the stationary
distribution may exist, the limiting probabilities still may not exist if
the discrete-time chain is not aperiodic. However, in continuous time
we dont run into such diculties because continuous-time Markov
chains dont have a period! There is no step in continuous time, so
there is no denition of period for a continuous-time Markov chain.
In fact, it can be shown (though we wont prove it) that for any two
states i and j in a continuous-time Markov chain, exactly one of the
following two statements must be true:
1. p
ij
(t) = 0 for all t > 0, or
2. p
ij
(t) > 0 for all t > 0.
This is called the Levy Dichotomy, and it shows that if a continuous-
time Markov chain is irreducible, in the sense that the embedded jump
chain is irreducible, then starting in state i we could possibly be in state
j at any positive time, for any state j, including the starting state i.
We may state the following theorem which summarizes the basic result
we would like to have concerning the limiting probabilities.
Theorem: In a continuous-time Markov chain, if a stationary distri-
bution exists, then it is unique and
lim
t
p
ij
(t) =
j
,
for all i.
245
We will not prove the preceding theorem completely. We will say
something about the uniqueness of (if it exists) in tomorrows lec-
ture. For now let us focus on the second statement in the theorem
concerning the limiting probabilities. Using the Kolmogorov Forward
Equations, we can easily prove something slightly weaker, which is
that assuming the limiting probabilities exist and are independent of
the starting state, then lim
t
p
ij
(t) =
j
. This is the extent of
what is shown in the text in Section 6.5, and well content ourselves
with that. However, you should be aware that we are not completely
proving the statement in the theorem (not because it is too dicut to
prove, but just in the interest of time).
Thus, assuming lim
t
p
ij
(t) exists and is independent of i, let
j
=
lim
t
p
ij
(t) and let = (
j
)
jS
be the |S|-dimensional row vec-
tor whose jth component is
j
. In matrix form the assumption that
lim
t
p
ij
(t) =
j
for all i and j is
lim
t
P(t) = V,
where P(t) is the matrix transition probability function with (i, j)th
entry p
ij
(t) introduced last time, and V is an |S||S| matrix in which
each row is equal to .
Now, if p
ij
(t)
j
as t , then we must have p
ij
(t) 0 as
t because p
ij
(t) is becoming more and more a constant as t
gets larger and larger. In matrix form we may write this as
lim
t
P
(t) = 0,
where P
(t) is the |S| |S| matrix with (i, j)th entry p
ij
(t) and 0 is
just the |S| |S| matrix of zeros.
Now, recall that Kolmogorovs Forward Equations state that
P
(t) = P(t)G,
where G is the innitesimal generator of the chain. Thus, letting
t , we obtain
0 = VG.
But since each row of V is equal to the limiting probability vector ,
this implies that
0 = G,
where now (slightly abusing notation), 0 denotes the |S|-dimensional
row vector of zeros.
Thus, we see that satises the global balance equations, which are
the equations which determine the stationary distribution . Assuming
the stationary distribution is unique this implies that = . So the
global balance equations yield both the stationary distribution and
the limiting probability vector . As in the discrete-time setting, this
is an important and useful result because if it were not true then we
would need to do dierent calculations depending on what questions
we were asking about our system being modeled, and it is not clear
that nding limiting probabilities would be very easy or even possible.
29
Local Balance Equations
We have seen that for a continuous-time Markov chain X = {X(t) :
t 0}, the stationary distribution , if it exists, must satisfy the global
balance equations 0 = G, where G is the innitesimal generator of
the chain. As for discrete-time Markov chains, there is also a set
of equations called the local balance equations that the stationary
distribution of X may or may not satisfy. Today we will discuss the
local balance equations for a continuous-time Markov chain, and give
some examples.
First, however, we will sidetrack from this discussion to note the im-
portant relationship between the stationary distribution of X and the
stationary distribution of the embedded discrete-time jump chain of
X. These two distributions are in fact not the same. We will also
discuss some consequences of this relationship.
247
248 29. LOCAL BALANCE EQUATIONS
Relationship Between the Stationary Distribution of a Continuous-
Time Markov Chain and the Stationary Distribution of its Corre-
sponding Embedded Discrete-Time Jump Chain:
Let X = {X(t) : t 0} be a continuous-time Markov chain with
state space S and transition rates q
ij
and, as usual, we let
v
i
=
j=i
q
ij
be the rate out of state i, and
p
ij
=
q
ij
v
i
be the one-step transition probabilities of the embedded discrete-time
jump chain. Note also that q
ii
= 0 for all i S so that we may also
write
v
i
=
jS
q
ij
.
We let G denote the innitesimal generator of the continuous-time
Markov chain (with entries g
ij
= q
ij
for i = j and g
ii
= v
i
) and let
P denote the one-step transition matrix for the embedded jump chain
(with entries p
ij
). If is the stationary distribution of the continuous-
time chain and is the stationary distribution of the embedded jump
chain, then and must satisfy, respectively, the two sets of global
balance equations 0 = G and = P. Writing out the jth
equation in each of these two sets of equations, we have
j
v
j
=
i=j
i
q
ij
,
and
j
=
iS
i
p
ij
.
249
These two sets of global balance equations give us a relationship be-
tween and , as follows. Since q
ij
= v
i
p
ij
and q
jj
= 0, we may
rewrite the jth equation in 0 = G as
j
v
j
=
iS
i
v
i
p
ij
.
Assuming satises 0 = G, we see that the vector (
j
v
j
)
jS
, with
jth entry
j
v
j
, satises the global balance equations for the stationary
distribution of the embedded jump chain. Furthermore, since we know
that the stationary distribution of the embedded jump chain is unique
from our theory for discrete-time Markov chains, we may conclude that
j
= C
j
v
j
,
where C is an appropriate normalizing constant. This also gives that
j
=
1
C

j
v
j
.
Indeed, we have that
j
=

j
v
j
iS

i
v
i
and
j
=

j
/v
j
iS

i
/v
i
.
The above relationship between
j
and
j
is intuitively correct. We
may interpret
j
as the long run proportion of transitions that the
continuous-time chain makes into state j. Also, over all the times that
we make a transition into state j, we stay in state j for an average of
1/v
j
time units. Therefore, the product
j
/v
j
should be proportional
to the long run proportion of time that the continuous-time chain
spends in state j, and this is how we interpret
j
.
We make several remarks associated with the relationship between
and .
Remark 1: If and both exist and the embedded jump chain is
irreducible, then the uniqueness of , which we proved in our theory
for discrete-time Markov chains, and the fact that is determined
from through the above relationship, implies that is also unique.
Remark 2: We have assumed that and both exist. However, it
is possible for to exist but for not to exist. From any discrete-
time Markov chain with transition probabilities p
ij
, we may construct
a continuous-time Markov chain by specifying the rates v
i
. However, if
in addition the discrete-time chain has a unique stationary distribution
, it is not always true that the corresponding continuous-time chain
will have a stationary distribution . This is because there is nothing
forcing the normalizing constant
iS

i
v
i
to be nite if the state
space S is innite and we are free to choose our rates v
i
as we please.
In particular, one can show that this sum is not nite if the state space
S is countably innite and we choose v
i
=
i
for all i S.
Remark 3: The fact that can be obtained from , assuming both
exist, has practical value, especially when the state space is large but
nite, and the transition matrix P of the embedded jump chain is
sparse in the sense of having many zero entries (this occurs if, even
though the state space may be large, the number of possible states
that can be reached from state i, for any i, in one step remains small).
Such models turn out to be quite common for many practical sys-
tems. The global balance equation = P is what is called a xed
point equation, which means that the stationary vector is a xed
point of the mapping which takes a row vector x to the row vec-
251
tor xP. A common, and simple, numerical procedure for solving a
xed point equation is something called successive substitution. In
this procedure, we simply start with a convenient initial probability
vector
(0)
(such as
(0)
j
= 1/|S| for all i S) and then obtain
(1)
=
(0)
P. We continue iterating, obtaining
(n+1)
=
(n)
P
from
(n)
, for n 1. Then, under certain conditions, the sequence
of vectors
(0)
,
(1)
,
(2)
, . . . will converge to a xed point . Note
that
(n)
=
(n1)
P =
(n2)
P
2
= =
(0)
P
n
,
for all n 1. If the embedded jump chain is irreducible (which it must
be for a unique stationary distribution to exist) and aperiodic, then
from our theory on the limiting probabilities of a discrete-time Markov
chain, we know that P
n
converges, as n , to an |S| |S| matrix
in which each row is equal to the stationary distribution (since
we know p
ij
(n)
j
as n , for all i, j S). This implies
that
(0)
P
n
, and so
(n)
, converges to . The numerical eciency
one can obtain by computing , and then through the relationship
between and , in this way can be an order of magnitude. A
direct, numerical solution of the system of linear equations = P
has numerical complexity O(|S|
2
). On the other hand, each iteration
in the successive substitution procedure requires a vector matrix
multiplication, which is in general also an O(|S|
2
) operation. However,
assuming each column of P has only a very small number, relative to
|S|, postive entries, one may cleverly compute
(n)
P with only K|S|
multiplications and additions, where K is much smaller than |S|. In
other words, the complexity of this operation can be reduced to O(|S|).
Moreover, in practice it takes only a few iterations for the sequence
{
(n)
} to converge to to within a reasonable tolerance (say 10
8
).
Local Balance Equations: As for discrete-time Markov chains, the sta-
tionary distribution of a continuous-time Markov chain must satisfy the
global balance equations, but may also satisfy the local balance equa-
tions. For a continuous-time Markov chain the local balance equations
are given by
i
q
ij
=
j
q
ji
,
for all i, j S such that i = j. The local balance equations express
a balance of ow between any pair states. We interpret
i
q
ij
as the
rate from state i to state j and
j
q
ji
as the rate from state j to
state i. There are actually
_
|S|
2
_
equations in the set of local balance
equations (the same as in the local balance equations for a discrete-
time Markov chain), but typically most of the equations are trivially
satised because q
ij
= q
ji
= 0. Note that one way to quickly check
if the local balance equations cannot be satisied by the stationary
distribution is to check if there are any rates q
ij
and q
ji
such that
q
ij
> 0 and q
ji
= 0 or q
ij
= 0 and q
ji
> 0.
Not every continuous-time Markov chain that has a stationary dis-
tribution has a stationary distribution that satises the local balance
equations. On the other hand, if we can nd a probability vector that
does satisfy the local balance equations, then this probability vector
will be the stationary distribution of the Markov chain. We have seen
this with discrete-time Markov chains, and we can easily show it again
here.
253
Suppose that is a probability vector that satises the local balance
equations. That is,
i
q
ij
=
j
q
ji
,
for all i, j S such that i = j. Then, since q
jj
= 0 for any j S,
we may sum both sides of the above equality over all i S to obtain
iS
i
q
ij
=
j
iS
q
ji
=
j
v
j
,
for all j S. But these are just the global balance equations. That
is, the probability vector also satises the global balance equations,
and this implies that is the stationary distribution.
If there is a probability vector that satises the local balance equa-
tions, then using the local balance equations to nd is typically much
easier than using the global balance equations because each equation
in the local balance equations involves only two unknowns, while at
least some of the equations in the global balance equations will usually
involve more than two unknowns.
We will now give two examples of continuous-time Markov chains
whose stationary distributions do satisfy the local balance equations,
in part to illustrate the utility of using the local balance equations to
nd the stationary distributions.
Example: Birth/Death Processes: We introduced birth/death pro-
cesses previously. The state space of a birth/death process is a subset
(possibly innite) of the integers, and from any state i the process
can only jump up to state i + 1 or down to state i 1. The tran-
sition rates q
i,i+1
, usually denoted by
i
, are called the birth rates of
the process and the transition rates q
i,i1
, usually denoted by
i
, are
called the death rates of the process. In this example we will consider
a birth/death process on S = {0, 1, 2, . . .}, the nonnegative integers,
but with general birth rates
i
, for i 0 and general death rates
i
,
for i 1. Since whenever the process goes from state i to state i + 1
it must make the transition from state i + 1 to i before it can make
the transition from state i to state i + 1 again, we may expect that
for any state i, the rate of ow from state i to state i + 1 is equal to
the rate of ow from state i + 1 to i, when the process is stationary.
The local balance equations are given by
i
=
i+1
i+1
,
for i 0 (all the other local balance equations are trivially satised
since q
ij
= q
ji
= 0 if j = i 1, i + 1). Thus, we have that
i+1
=
(
i
/
i+1
)
i
. Solving recursively, we obtain
i+1
=

i
i+1
i
=

i
i1
i+1
i1
.
.
.
=

i
. . .
0
i+1
. . .
1
0
.
The stationary distribution will exist if and only if we can normalize
this solution to the local balance equations, which will be possible if
and only if
1 +
i=1
i1
. . .
0
i
. . .
1
< .
255
Assuming the above sum is nite, then the normalization constraint
i=0
i
= 1 is equivalent to
0
_
1 +
i=1
i1
. . .
0
i
. . .
1
_
= 1,
which implies that
0
=
_
1 +
i=1
i1
. . .
0
i
. . .
1
_
1
.
Then we obtain
i
=

i1
. . .
0
i
. . .
1
_
1 +
i=1
i1
. . .
0
i
. . .
1
_
1
,
for i 1.
Example: M/M/1 Queue: As our nal example for today, we will
consider the M/M/1 queue, which is one of the most basic queueing
models in queueing theory (which we will cover in more detail next
week). This is a model for a single server system to which customers
arrive, are served in a rst-come rst-served fashion by the server, and
then depart the system upon nishing service. Customers that arrive
to a nonempty system will wait in a queue for service. The canonical
example is a single teller bank queue. The notation M/M/1 is an
example of something called Kendalls notation, which is a shorthand
for describing most queueing models. The rst entry (the rst M)
is a letter which denotes the arrival process to the queue. The M
stands for Markov and it denotes a Poisson arrival process to the
system. That is, customers arrive to the system according to a Poisson
process with some rate > 0. The second entry (the second M)
is a letter which denotes the service time distribution. The M here,
which also stands for Markov, denotes exponentially distributed ser-
vice times. As well, unless explicitly stated, the implicit assumption is
that service times are independent and identically distributed. Thus,
the second M signies that all service times are independent and iden-
tically distributed Exponential random variables with some rate > 0.
It is also implicitly assumed that the service times are independent of
the arrival process. Finally, the third entry (the 1) is a number which
denotes the number of servers in the system.
If X(t) denotes the number of customers in the system at time t,
then since the customer interarrival times and the service times are all
independent, exponentially distributed random variables, the process
{X(t) : t 0} is a continuous-time Markov chain. The state space
is S = {0, 1, 2, . . .}. Indeed, it is not hard to see that {X(t) : t 0}
is a birth/death process with birth rates
i
= , for i 0, and death
rates
i
= , for i 1. Thus, we may simply plug in these birth and
death rates into our previous example. The condition for the stationary
distribution to exist becomes
1 +
i=1
_
_
i
< ,
or
i=0
_
_
i
< .
The sum on the left hand side is just a Geometric series, and so it
converges if and only if / < 1. This condition for the stationary
distribution to exist is equivalent to < , and has the intuitive
interpretation that the arrival rate to the system, , must be less than
257
the service rate of the server, . If > then customers are arriving
to the system at a faster rate than the server can serve them, and the
number in the system eventually blows up to . In the language of
queueing theory, a queueing system in which the number of customers
in the system blows up to is called unstable, and the condition
< is called a stability condition. Note that when = the
system is also unstable, in the sense that no stationary distribution
exists.
If the condition < is satised, then we obtain from the general
solution in the previous example that
0
=
_

i=0
_
_
i
_
1
=
_
1
1 /
_
1
= 1
,
and
i
=
_
_
i
_
1
_
,
for i 1. So we see that the stationary distribution of the number
in the system for an M/M/1 queue is a Geometric distribution with
parameter 1 / (though it has the version of the Geometric distri-
bution which is usually interpreted as the number of failures before the
rst success rather than the number of trials until the rst success, so
that it is a distribution on {0, 1, 2, . . .} rather than on {1, 2, . . .}).
Next we will reconsider the notion of time reversibility, which we rst
encountered with discrete-time Markov chains, and see that there is
a strong connection between time reversibility and the local balance
equations, much as there was for discrete-time Markov chains.
30
Time Reversibility
The notion of time reversibility for continuous-time Markov chains is
essentially the same as for discrete-time Markov chains. The concept
of running a stochastic process backwards in time applies equally well
in continuous time as in discrete time. However, even though we can
imagine running any stochastic process backwards in time, the notion
of time reversibility applies only to stationary processes. Therefore,
we begin by assuming that {X(t) : < t < } is a stationary,
continuous-time Markov chain, where we extend the time index back
to to accommodate running the process backwards in time. The
reversed process is Y (t) = X(t). The rst thing we need to see is
that the reversed process Y is also a continuous-time Markov chain.
But this fact follows almost directly from the fact that a reversed
discrete-time Markov chain is also a Markov chain. In the continuous-
time chain, it is clear that whether we run the chain forwards in time
or backwards in time, the amount of time we spend in any state i
when we enter it has the same distribution, namely Exponential(v
i
).
The times we enter state i in the forward chain are the times we leave
state i in the reversed chain, and vice-versa, but the times spent in
state i are still distributed the same.
259
260 30. TIME REVERSIBILITY
Therefore, the reversed chain Y will be a continuous-time Markov
chain if the embedded jump process of the reversed process is a
discrete-time Markov chain. But this embedded discrete-time pro-
cess is just the reversed process of the forward embedded jump chain,
and since the forward embedded jump chain is a discrete-time Markov
chain, the embedded jump process in the reversed process is also a
discrete-time Markov chain. We know this from our discussions of time
reversibility in the discrete-time setting from Chapter 4 (see Lecture
18). So we may conclude that the reversed continuous-time process
Y is indeed a continuous-time Markov chain.
Denition: A continuous-time Markov chain X = {X(t) : t 0}
is time reversible if X has a stationary distribution (and so can be
made stationary by setting the initial distribution of X(0) to be ),
and when the stationary process is extended to the whole real line to
obtain the stationary process {X(t) : < t < }, the reversed
process Y = {Y (t) = X(t) : < t < } is probabilistically the
same continuous-time Markov chain as X (i.e. has the same transition
rates). Equivalently, from our discussion above, X is time reversible
if the embedded discrete-time jump chain of X is time reversible (i.e.
the embedded jump chain of the reversed process Y has the same
one-step transition probability matrix as that of the embedded jump
chain of the forward process X).
Let P = ((p
ij
))
i,jS
denote the transition probability matrix of the
embedded jump chain of the forward process X, and let = (
i
)
iS
denote the stationary distribution of this embedded jump chain (so
that satises the global balance equations = P).
261
We saw previously that the relationship between and the stationary
distribution of X, denoted by , is given by
i
= C
i
v
i
for all i S, where C is an appropriate normalizing constant and v
i
is the rate out of state i. We also have our basic relationship that
p
ij
=
q
ij
v
i
for all i, j S, where q
ij
is the transition rate from state i to state
j. From our discussions on time reversibility for discrete-time Markov
chains (see Lecture 18), we know that the embedded jump chain of
the forward chain X will be time reversible if and only if the stationary
distribution of this jump chain satises the local balance equations
i
p
ij
=
j
p
ji
,
for all i, j S. But from the relationships above between
i
and
i
and between p
ij
and q
ij
, these local balance equations for are
equivalent to
(C
i
v
i
)
_
q
ij
v
i
_
= (C
j
v
j
)
_
q
ji
v
j
_
,
for all i, j S. Now canceling out C, v
i
and v
j
gives the equivalent
equations
i
q
ij
=
j
q
ji
,
for all i, j S. But note that these are exactly the local balance
equations for the stationary distribution .
In other words, we conclude from the preceding discussion that a
continuous-time Markov chain X is time reversible if and only if it
has a stationary distribution which satises the local balance equa-
tions
i
q
ij
=
j
q
ji
,
discussed in the last lecture. Note that this gives us a way to show
that a given continuous-time Markov chain X is time reversible. If
we can solve the local balance equations to nd the stationary dis-
tribution , then this not only gives us a more convenient way to
determine . It also shows that X is time reversible, almost as a
side-eect. One part of Problem 8 on Assignment #6 (Ross, p.359
#36 in the 7th Edition and p.348 #36 in the 6th Edition) asks you
to show that a given continuous-time Markov chain is time reversible,
and this is how you may show it. The continuous-time Markov chain
in this problem is also multi-dimensional. As you might imagine, if this
process were not time reversible, so that one had to solve the global
balance equations to nd the stationary distribution , nding the
stationary distribution might prove quite daunting. Lucky for us, the
multi-dimensional process in this problem is time reversible and so may
be obtained by solving the local balance equations, which are typcially
simpler than the global balance equations as discussed in the previous
lecture. Despite this, even the local balance equations may not look all
that trivial to you to solve, especially when the states i and j represent
the vector-valued states of a multi-dimensional process. In practice (at
least for this course and, to a signicant extent, for a great variety of
modeling situations in the real world), the local balance equations
when the underlying process is multi-dimensional can often be solved
by inspection, by determining that the stationary probabilities must
have a certain form, by trial and error. The only way to develop your
263
sense of what the form of the stationary distribution should be in these
situations is to do problems and see examples (i.e. experience), so lets
do one such example now.
Example: (Ross, #31 in Chapter 6): Consider a system with r
servers, where the ith service times for the ith server are indepen-
dent and identically distributed Exponential(
i
) random variables, for
i = 1, . . . , r (and the service times at dierent servers are also inde-
pendent). A total of N customers move about among these servers as
follows. Whenever a customer nishes service at a server, it moves to
a dierent server at random. That is, if a customer has just nished
service at server i then it will next move to server j, where j = i,
with probability 1/(r 1). Each server also has a queue for wait-
ing customers and the service discipline is rst-come, rst-served. Let
X(t) = (n
1
(t), . . . , n
r
(t)) denote the number of customers at each
server at time t (that is, n
i
(t) is the number of customers at server
i at time t). Then {X(t) : t 0} is a continuous-time Markov
chain. Show that this chain is time reversible and nd the stationary
distribution.
Solution: Basically, we need to set up the local balance equations and
solve them, which does two things: i) it shows that the local balance
equations have a solution and thus that the process is time reversible
and ii) it gives us the stationary distribution. Firstly, let us note that
the state space of the process is
S = {(n
1
, . . . , n
r
) : the n
i
are nonnegative integers and
r
i=1
n
i
= N}.
Now let us consider the transition rates for {X(t) : t 0}. To
ease the writing we rst dene some convenient notation. Let n
denote an arbitrary vector (n
1
, . . . , n
r
) S and let e
i
denote the
r-dimensional vector which has a 1 for the ith component and a 0 for
every other component. Now suppose that we are currently in state
n. We will jump to a new state as soon as a customer nishes service
at some server and moves to a dierent server. Thus, from state n
we will next jump to a state which is of the form n e
i
+ e
j
, where
i = j and n
i
> 0. As n ranges over S this accounts for all the
possible transitions that can occur. The transition from state n to
state ne
i
+e
j
occurs when the customer at server i nishes service
and then moves to server j, and this occurs at rate
i
/(r 1). That
is, the transition rate q
n,ne
i
+e
j
from state n to state n e
i
+ e
j
is given by
q
n,ne
i
+e
j
=

i
r 1
,
for i = j and for all n such that n
i
> 0. Similarly, the transition from
state ne
i
+e
j
to state n occurs only when the customer at server
j nishes service and then moves to server i, and this occurs at rate
j
/(r 1). That is,
q
ne
i
+e
j
,n
=

j
r 1
,
Thus, our local balance equations
n
q
n,ne
i
+e
j
=
ne
i
+e
j
,n
q
ne
i
+e
j
,n
,
where
n
is the stationary probability of state n, are given by
i
r 1
=
ne
i
+e
j
,n
j
r 1
,
for i = j and all n S such that n
i
> 0. We may cancel out the
r 1 from both sides of the above equations to obtain
i
=
ne
i
+e
j
,n
j
,
for i = j and all n S such that n
i
> 0.
265
As mentioned, it may not seem obvious what form
n
should have
in order to satisfy these equations. On the other hand, the equations
certainly look simple enough that one should believe that
n
might
have some simple, regular form. In words,
n
should be some function
of n such that when we multiply it by
i
, that is the same thing as
taking this function evaluated at ne
i
+e
j
and multiplying that by
j
. A little inspection, and perhaps some trial and error, will lead to
n
of the form
n
=
C
n
1
1
. . .
n
r
r
,
for all n S, where C is the appropriate normalizing constant. We
can verify that this claimed form for
n
is correct by plugging it into
the local balance equations. On the left hand side we obtain
LHS =
C
i
n
1
1
. . .
n
r
r
,
while on the right hand side we obtain
RHS =
C
j
n
1
1
. . .
n
r
r
j
,
where the factor
i
/
j
is needed to account for the fact that there is
one less customer at server i and one more customer at server j in the
state n e
i
+ e
j
relative to state n. Clearly both the LHS and the
RHS are equal for all i = j and all n S such that n
i
> 0. Since the
state space S is nite the normalizing constant C is strictly positive
and given by
C =
_

mS
1
m
1
1
. . .
m
r
r
_
1
,
where m = (m
1
, . . . , m
r
) ranges over all states in S.
Thus, the stationary distribution = (
n
)
nS
is given by
n
=
1
n
1
1
. . .
n
r
r
_

mS
1
m
1
1
. . .
m
r
r
_
1
,
for all n S. So we have found the stationary distribution and,
since we have also shown that satises the local balance equa-
tions (since that is how we found ), we have also shown that the
continuous-time Markov chain {X(t) : t 0} is time reversible.
Introduction to
Probability Theory
1
1.1. Introduction
Any realistic model of a real-world phenomenon must take into account the possi-
bility of randomness. That is, more often than not, the quantities we are interested
in will not be predictable in advance but, rather, will exhibit an inherent varia-
tion that should be taken into account by the model. This is usually accomplished
by allowing the model to be probabilistic in nature. Such a model is, naturally
enough, referred to as a probability model.
The majority of the chapters of this book will be concerned with different prob-
ability models of natural phenomena. Clearly, in order to master both the model
building and the subsequent analysis of these models, we must have a certain
knowledge of basic probability theory. The remainder of this chapter, as well as
the next two chapters, will be concerned with a study of this subject.
1.2. Sample Space and Events
Suppose that we are about to perform an experiment whose outcome is not pre-
dictable in advance. However, while the outcome of the experiment will not be
known in advance, let us suppose that the set of all possible outcomes is known.
This set of all possible outcomes of an experiment is known as the sample space
of the experiment and is denoted by S.
Some examples are the following.
1. If the experiment consists of the ipping of a coin, then
S ={H, T }
where H means that the outcome of the toss is a head and T that it is a tail.
1
2 1 Introduction to Probability Theory
2. If the experiment consists of rolling a die, then the sample space is
S ={1, 2, 3, 4, 5, 6}
where the outcome i means that i appeared on the die, i =1, 2, 3, 4, 5, 6.
3. If the experiments consists of ipping two coins, then the sample space con-
sists of the following four points:
S ={(H, H), (H, T ), (T, H), (T, T )}
The outcome will be (H, H) if both coins come up heads; it will be (H, T )
if the rst coin comes up heads and the second comes up tails; it will be
(T, H) if the rst comes up tails and the second heads; and it will be (T, T )
if both coins come up tails.
4. If the experiment consists of rolling two dice, then the sample space consists
of the following 36 points:
S =
(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6)
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6)
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6)
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6)
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
where the outcome (i, j) is said to occur if i appears on the rst die and j
on the second die.
5. If the experiment consists of measuring the lifetime of a car, then the sample
space consists of all nonnegative real numbers. That is,
S =[0, )

Any subset E of the sample space S is known as an event. Some examples of
events are the following.
1
. In Example (1), if E ={H}, then E is the event that a head appears on the
ip of the coin. Similarly, if E ={T }, then E would be the event that a tail
appears.
2
. In Example (2), if E ={1}, then E is the event that one appears on the roll
of the die. If E ={2, 4, 6}, then E would be the event that an even number
appears on the roll.
The set (a, b) is dened to consist of all points x such that a < x < b. The set [a, b] is dened
to consist of all points x such that a x b. The sets (a, b] and [a, b) are dened, respectively, to
consist of all points x such that a <x b and all points x such that a x <b.
1.2. Sample Space and Events 3
3
. In Example (3), if E = {(H, H), (H, T )}, then E is the event that a head
appears on the rst coin.
4
. In Example (4), if E = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}, then
E is the event that the sum of the dice equals seven.
5
. In Example (5), if E =(2, 6), then E is the event that the car lasts between
two and six years.
For any two events E and F of a sample space S we dene the new event EF
to consist of all outcomes that are either in E or in F or in both E and F. That is,
the event EF will occur if either E or F occurs. For example, in (1) if E ={H}
and F ={T }, then
E F ={H, T }
That is, E F would be the whole sample space S. In (2) if E = {1, 3, 5} and
F ={1, 2, 3}, then
E F ={1, 2, 3, 5}
and thus EF would occur if the outcome of the die is 1 or 2 or 3 or 5. The event
E F is often referred to as the union of the event E and the event F.
For any two events E and F, we may also dene the new event EF, sometimes
written E F, and referred to as the intersection of E and F, as follows. EF
consists of all outcomes which are both in E and in F. That is, the event EF
will occur only if both E and F occur. For example, in (2) if E = {1, 3, 5} and
F ={1, 2, 3}, then
EF ={1, 3}
and thus EF would occur if the outcome of the die is either 1 or 3. In Example (1)
if E ={H} and F ={T }, then the event EF would not consist of any outcomes
and hence could not occur. To give such an event a name, we shall refer to it as
the null event and denote it by . (That is, refers to the event consisting of no
outcomes.) If EF =, then E and F are said to be mutually exclusive.
We also dene unions and intersections of more than two events in a simi-
lar manner. If E
1
, E
2
, . . . are events, then the union of these events, denoted by
n=1
E
n
, is dened to be that event which consists of all outcomes that are in E
n
for at least one value of n =1, 2, . . . . Similarly, the intersection of the events E
n
,
denoted by
n=1
E
n
, is dened to be the event consisting of those outcomes that
are in all of the events E
n
, n =1, 2, . . . .
Finally, for any event E we dene the new event E
c
, referred to as the
complement of E, to consist of all outcomes in the sample space S that are not
in E. That is, E
c
will occur if and only if E does not occur. In Example (4)
if E ={(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}, then E
c
will occur if the sum of
the dice does not equal seven. Also note that since the experiment must result in
some outcome, it follows that S
c
=.
1.3. Probabilities Dened on Events
Consider an experiment whose sample space is S. For each event E of the sample
space S, we assume that a number P(E) is dened and satises the following
three conditions:
(i) 0 P(E) 1.
(ii) P(S) =1.
(iii) For any sequence of events E
1
, E
2
, . . . that are mutually exclusive, that is,
events for which E
n
E
m
= when n =m, then
P
n=1
E
n
n=1
P(E
n
)
We refer to P(E) as the probability of the event E.
Example 1.1 In the coin tossing example, if we assume that a head is equally
likely to appear as a tail, then we would have
P({H}) =P({T }) =
1
2
On the other hand, if we had a biased coin and felt that a head was twice as likely
to appear as a tail, then we would have
P({H}) =
2
3
, P({T }) =
1
3

Example 1.2 In the die tossing example, if we supposed that all six numbers
were equally likely to appear, then we would have
P({1}) =P({2}) =P({3}) =P({4}) =P({5}) =P({6}) =
1
6
From (iii) it would follow that the probability of getting an even number would
equal
P({2, 4, 6}) =P({2}) +P({4}) +P({6})
=
1
2

1.3. Probabilities Dened on Events 5
Remark We have chosen to give a rather formal denition of probabilities as
being functions dened on the events of a sample space. However, it turns out
that these probabilities have a nice intuitive property. Namely, if our experiment
is repeated over and over again then (with probability 1) the proportion of time
that event E occurs will just be P(E).
Since the events E and E
c
are always mutually exclusive and since EE
c
=S
we have by (ii) and (iii) that
1 =P(S) =P(E E
c
) =P(E) +P(E
c
)
or
P(E
c
) =1 P(E) (1.1)
In words, Equation (1.1) states that the probability that an event does not occur is
one minus the probability that it does occur.
We shall now derive a formula for P(E F), the probability of all outcomes
either in E or in F. To do so, consider P(E) + P(F), which is the probability
of all outcomes in E plus the probability of all points in F. Since any outcome
that is in both E and F will be counted twice in P(E) +P(F) and only once in
P(E F), we must have
P(E) +P(F) =P(E F) +P(EF)
or equivalently
P(E F) =P(E) +P(F) P(EF) (1.2)
Note that when E and F are mutually exclusive (that is, when EF = ), then
Equation (1.2) states that
P(E F) =P(E) +P(F) P()
=P(E) +P(F)
a result which also follows from condition (iii). [Why is P() =0?]
Example 1.3 Suppose that we toss two coins, and suppose that we assume
that each of the four outcomes in the sample space
S ={(H, H), (H, T ), (T, H), (T, T )}
is equally likely and hence has probability
1
4
. Let
E ={(H, H), (H, T )} and F ={(H, H), (T, H)}
That is, E is the event that the rst coin falls heads, and F is the event that the
second coin falls heads.
By Equation (1.2) we have that P(E F), the probability that either the rst
or the second coin falls heads, is given by
P(E F) =P(E) +P(F) P(EF)
=
1
2
+
1
2
P({H, H})
=1
1
4
=
3
4
This probability could, of course, have been computed directly since
P(E F) =P({H, H), (H, T ), (T, H)}) =
3
4

We may also calculate the probability that any one of the three events E or F
or G occurs. This is done as follows:
P(E F G) =P((E F) G)
which by Equation (1.2) equals
P(E F) +P(G) P((E F)G)
Now we leave it for you to show that the events (E F)G and EG FG are
equivalent, and hence the preceding equals
P(E F G)
=P(E) +P(F) P(EF) +P(G) P(EGFG)
=P(E) +P(F) P(EF) +P(G) P(EG) P(FG) +P(EGFG)
=P(E) +P(F) +P(G) P(EF) P(EG) P(FG) +P(EFG) (1.3)
In fact, it can be shown by induction that, for any n events E
1
, E
2
, E
3
, . . . , E
n
,
P(E
1
E
2
E
n
) =
i
P(E
i
)
i<j
P(E
i
E
j
) +
i<j<k
P(E
i
E
j
E
k
)
i<j<k<l
P(E
i
E
j
E
k
E
l
)
+ +(1)
n+1
P(E
1
E
2
E
n
) (1.4)
In words, Equation (1.4) states that the probability of the union of n events equals
the sum of the probabilities of these events taken one at a time minus the sum of
the probabilities of these events taken two at a time plus the sum of the probabili-
ties of these events taken three at a time, and so on.
1.4. Conditional Probabilities 7
1.4. Conditional Probabilities
Suppose that we toss two dice and that each of the 36 possible outcomes is equally
likely to occur and hence has probability
1
36
. Suppose that we observe that the
rst die is a four. Then, given this information, what is the probability that the
sum of the two dice equals six? To calculate this probability we reason as fol-
lows: Given that the initial die is a four, it follows that there can be at most
six possible outcomes of our experiment, namely, (4, 1), (4, 2), (4, 3), (4, 4),
(4, 5), and (4, 6). Since each of these outcomes originally had the same proba-
bility of occurring, they should still have equal probabilities. That is, given that
the rst die is a four, then the (conditional) probability of each of the outcomes
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6) is
1
6
while the (conditional) probability
of the other 30 points in the sample space is 0. Hence, the desired probability
will be
1
6
.
If we let E and F denote, respectively, the event that the sum of the dice is
six and the event that the rst die is a four, then the probability just obtained
is called the conditional probability that E occurs given that F has occurred and is
denoted by
P(E|F)
A general formula for P(E|F) which is valid for all events E and F is derived in
the same manner as the preceding. Namely, if the event F occurs, then in order
for E to occur it is necessary for the actual occurrence to be a point in both E and
in F, that is, it must be in EF. Now, because we know that F has occurred, it
follows that F becomes our new sample space and hence the probability that the
event EF occurs will equal the probability of EF relative to the probability of F.
That is,
P(E|F) =
P(EF)
P(F)
(1.5)
Note that Equation (1.5) is only well dened when P(F) >0 and hence P(E|F)
is only dened when P(F) >0.
Example 1.4 Suppose cards numbered one through ten are placed in a hat,
mixed up, and then one of the cards is drawn. If we are told that the number
on the drawn card is at least ve, then what is the conditional probability that
it is ten?
Solution: Let E denote the event that the number of the drawn card is ten,
and let F be the event that it is at least ve. The desired probability is P(E|F).
Now, from Equation (1.5)
P(E|F) =
P(EF)
P(F)
However, EF = E since the number of the card will be both ten and at least
ve if and only if it is number ten. Hence,
P(E|F) =
1
10
6
10
=
1
6

Example 1.5 A family has two children. What is the conditional probability
that both are boys given that at least one of them is a boy? Assume that the sample
space S is given by S ={(b, b), (b, g), (g, b), (g, g)}, and all outcomes are equally
likely. [(b, g) means, for instance, that the older child is a boy and the younger
child a girl.]
Solution: Letting B denote the event that both children are boys, and A the
event that at least one of them is a boy, then the desired probability is given by
P(B|A) =
P(BA)
P(A)
=
P({(b, b)})
P({(b, b), (b, g), (g, b)})
=
1
4
3
4
=
1
3

Example 1.6 Bev can either take a course in computers or in chemistry. If Bev
takes the computer course, then she will receive an A grade with probability
1
2
; if
she takes the chemistry course then she will receive an A grade with probability
1
3
.
Bev decides to base her decision on the ip of a fair coin. What is the probability
that Bev will get an A in chemistry?
Solution: If we let C be the event that Bev takes chemistry and A denote
the event that she receives an A in whatever course she takes, then the desired
probability is P(AC). This is calculated by using Equation (1.5) as follows:
P(AC) =P(C)P(A|C)
=
1
2
1
3
=
1
6

Example 1.7 Suppose an urn contains seven black balls and ve white balls.
We draw two balls from the urn without replacement. Assuming that each ball in
the urn is equally likely to be drawn, what is the probability that both drawn balls
are black?
1.4. Conditional Probabilities 9
Solution: Let F and E denote, respectively, the events that the rst and sec-
ond balls drawn are black. Now, given that the rst ball selected is black, there
are six remaining black balls and ve white balls, and so P(E|F) =
6
11
. As
P(F) is clearly
7
12
, our desired probability is
P(EF) =P(F)P(E|F)
=
7
12
6
11
=
42
132

Example 1.8 Suppose that each of three men at a party throws his hat into
the center of the room. The hats are rst mixed up and then each man randomly
selects a hat. What is the probability that none of the three men selects his own
hat?
Solution: We shall solve this by rst calculating the complementary proba-
bility that at least one man selects his own hat. Let us denote by E
i
, i =1, 2, 3,
the event that the ith man selects his own hat. To calculate the probability
P(E
1
E
2
E
3
), we rst note that
P(E
i
) =
1
3
, i =1, 2, 3
P(E
i
E
j
) =
1
6
, i =j (1.6)
P(E
1
E
2
E
3
) =
1
6
To see why Equation (1.6) is correct, consider rst
P(E
i
E
j
) =P(E
i
)P(E
j
|E
i
)
Now P(E
i
), the probability that the ith man selects his own hat, is clearly
1
3
since he is equally likely to select any of the three hats. On the other hand, given
that the ith man has selected his own hat, then there remain two hats that the
jth man may select, and as one of these two is his own hat, it follows that with
probability
1
2
he will select it. That is, P(E
j
|E
i
) =
1
2
and so
P(E
i
E
j
) =P(E
i
)P(E
j
|E
i
) =
1
3
1
2
=
1
6
To calculate P(E
1
E
2
E
3
) we write
P(E
1
E
2
E
3
) =P(E
1
E
2
)P(E
3
|E
1
E
2
)
=
1
6
P(E
3
|E
1
E
2
)
However, given that the rst two men get their own hats it follows that the
third man must also get his own hat (since there are no other hats left). That is,
P(E
3
|E
1
E
2
) =1 and so
P(E
1
E
2
E
3
) =
1
6
Now, from Equation (1.4) we have that
P(E
1
E
2
E
3
) =P(E
1
) +P(E
2
) +P(E
3
) P(E
1
E
2
)
P(E
1
E
3
) P(E
2
E
3
) +P(E
1
E
2
E
3
)
=1
1
2
+
1
6
=
2
3
Hence, the probability that none of the men selects his own hat is
1
2
3
=
1
3
.
1.5. Independent Events
Two events E and F are said to be independent if
P(EF) =P(E)P(F)
By Equation (1.5) this implies that E and F are independent if
P(E|F) =P(E)
[which also implies that P(F|E) = P(F)]. That is, E and F are independent if
knowledge that F has occurred does not affect the probability that E occurs. That
is, the occurrence of E is independent of whether or not F occurs.
Two events E and F that are not independent are said to be dependent.
Example 1.9 Suppose we toss two fair dice. Let E
1
denote the event that the
sum of the dice is six and F denote the event that the rst die equals four. Then
P(E
1
F) =P({4, 2}) =
1
36
while
P(E
1
)P(F) =
5
36
1
6
=
5
216
and hence E
1
and F are not independent. Intuitively, the reason for this is clear
for if we are interested in the possibility of throwing a six (with two dice), then we
1.5. Independent Events 11
will be quite happy if the rst die lands four (or any of the numbers 1, 2, 3, 4, 5)
because then we still have a possibility of getting a total of six. On the other hand,
if the rst die landed six, then we would be unhappy as we would no longer have
a chance of getting a total of six. In other words, our chance of getting a total
of six depends on the outcome of the rst die and hence E
1
and F cannot be
independent.
Let E
2
be the event that the sum of the dice equals seven. Is E
2
independent of
F? The answer is yes since
P(E
2
F) =P({(4, 3)}) =
1
36
while
P(E
2
)P(F) =
1
6
1
6
=
1
36
We leave it for you to present the intuitive argument why the event that the sum
of the dice equals seven is independent of the outcome on the rst die.
The denition of independence can be extended to more than two events.
The events E
1
, E
2
, . . . , E
n
are said to be independent if for every subset
E
1
, E
2
, . . . , E
r
, r n, of these events
P(E
1
E
2
E
r
) =P(E
1
)P(E
2
) P(E
r
)
Intuitively, the events E
1
, E
2
, . . . , E
n
are independent if knowledge of the occur-
rence of any of these events has no effect on the probability of any other event.
Example 1.10 (Pairwise Independent Events That Are Not Independent) Let
a ball be drawn from an urn containing four balls, numbered 1, 2, 3, 4. Let E =
{1, 2}, F = {1, 3}, G = {1, 4}. If all four outcomes are assumed equally likely,
then
P(EF) =P(E)P(F) =
1
4
,
P(EG) =P(E)P(G) =
1
4
,
P(FG) =P(F)P(G) =
1
4
However,
1
4
=P(EFG) =P(E)P(F)P(G)
Hence, even though the events E, F, G are pairwise independent, they are not
jointly independent.
Suppose that a sequence of experiments, each of which results in either a
success or a failure, is to be performed. Let E
i
, i 1, denote the event that
the ith experiment results in a success. If, for all i
1
, i
2
, . . . , i
n
,
P(E
i
1
E
i
2
E
i
n
) =
n
j=1
P(E
i
j
)
we say that the sequence of experiments consists of independent trials.
Example 1.11 The successive ips of a coin consist of independent trials if
we assume (as is usually done) that the outcome on any ip is not inuenced by
the outcomes on earlier ips. A success might consist of the outcome heads and
a failure tails, or possibly the reverse.
1.6. Bayes Formula
Let E and F be events. We may express E as
E =EF EF
c
because in order for a point to be in E, it must either be in both E and F, or it
must be in E and not in F. Since EF and EF
c
are mutually exclusive, we have
that
P(E) =P(EF) +P(EF
c
)
=P(E|F)P(F) +P(E|F
c
)P(F
c
)
=P(E|F)P(F) +P(E|F
c
)(1 P(F)) (1.7)
Equation (1.7) states that the probability of the event E is a weighted average of
the conditional probability of E given that F has occurred and the conditional
probability of E given that F has not occurred, each conditional probability being
given as much weight as the event on which it is conditioned has of occurring.
Example 1.12 Consider two urns. The rst contains two white and seven
black balls, and the second contains ve white and six black balls. We ip a
fair coin and then draw a ball from the rst urn or the second urn depending
on whether the outcome was heads or tails. What is the conditional probability
that the outcome of the toss was heads given that a white ball was selected?
1.6. Bayes Formula 13
Solution: Let W be the event that a white ball is drawn, and let H be the
event that the coin comes up heads. The desired probability P(H|W) may be
calculated as follows:
P(H|W) =
P(HW)
P(W)
=
P(W|H)P(H)
P(W)
=
P(W|H)P(H)
P(W|H)P(H) +P(W|H
c
)P(H
c
)
=
2
9
1
2
2
9
1
2
+
5
11
1
2
=
22
67

Example 1.13 In answering a question on a multiple-choice test a student
either knows the answer or guesses. Let p be the probability that she knows the
answer and 1 p the probability that she guesses. Assume that a student who
guesses at the answer will be correct with probability 1/m, where m is the number
of multiple-choice alternatives. What is the conditional probability that a student
knew the answer to a question given that she answered it correctly?
Solution: Let C and K denote respectively the event that the student an-
swers the question correctly and the event that she actually knows the answer.
Now
P(K|C) =
P(KC)
P(C)
=
P(C|K)P(K)
P(C|K)P(K) +P(C|K
c
)P(K
c
)
=
p
p +(1/m)(1 p)
=
mp
1 +(m1)p
Thus, for example, if m = 5, p =
1
2
, then the probability that a student knew
the answer to a question she correctly answered is
5
6
.
Example 1.14 A laboratory blood test is 95 percent effective in detecting a
certain disease when it is, in fact, present. However, the test also yields a false
positive result for 1 percent of the healthy persons tested. (That is, if a healthy
person is tested, then, with probability 0.01, the test result will imply he has the
disease.) If 0.5 percent of the population actually has the disease, what is the
probability a person has the disease given that his test result is positive?
Solution: Let D be the event that the tested person has the disease, and
E the event that his test result is positive. The desired probability P(D|E) is
obtained by
P(D|E) =
P(DE)
P(E)
=
P(E|D)P(D)
P(E|D)P(D) +P(E|D
c
)P(D
c
)
=
(0.95)(0.005)
(0.95)(0.005) +(0.01)(0.995)
=
95
294
0.323
Thus, only 32 percent of those persons whose test results are positive actually
have the disease.
Equation (1.7) may be generalized in the following manner. Suppose that
F
1
, F
2
, . . . , F
n
are mutually exclusive events such that

n
i=1
F
i
= S. In other
words, exactly one of the events F
1
, F
2
, . . . , F
n
will occur. By writing
E =
n
i=1
EF
i
and using the fact that the events EF
i
, i = 1, . . . , n, are mutually exclusive, we
obtain that
P(E) =
n
i=1
P(EF
i
)
=
n
i=1
P(E|F
i
)P(F
i
) (1.8)
Thus, Equation (1.8) shows how, for given events F
1
, F
2
, . . . , F
n
of which one
and only one must occur, we can compute P(E) by rst conditioning upon
which one of the F
i
occurs. That is, it states that P(E) is equal to a weighted
average of P(E|F
i
), each term being weighted by the probability of the event on
which it is conditioned.
Suppose now that E has occurred and we are interested in determining which
one of the F
j
also occurred. By Equation (1.8) we have that
P(F
j
|E) =
P(EF
j
)
P(E)
=
P(E|F
j
)P(F
j
)
n
i=1
P(E|F
i
)P(F
i
)
(1.9)
Equation (1.9) is known as Bayes formula.
Exercises 15
Example 1.15 You know that a certain letter is equally likely to be in any
one of three different folders. Let
i
be the probability that you will nd your
letter upon making a quick examination of folder i if the letter is, in fact, in folder
i, i =1, 2, 3. (We may have
i
<1.) Suppose you look in folder 1 and do not nd
the letter. What is the probability that the letter is in folder 1?
Solution: Let F
i
, i = 1, 2, 3 be the event that the letter is in folder i; and
let E be the event that a search of folder 1 does not come up with the letter. We
desire P(F
1
|E). From Bayes formula we obtain
P(F
1
|E) =
P(E|F
1
)P(F
1
)
3
i=1
P(E|F
i
)P(F
i
)
=
(1
1
)
1
3
(1
1
)
1
3
+
1
3
+
1
3
=
1
1
3
1
Exercises
1. A box contains three marbles: one red, one green, and one blue. Consider an
experiment that consists of taking one marble from the box then replacing it in
the box and drawing a second marble from the box. What is the sample space?
If, at all times, each marble in the box is equally likely to be selected, what is the
probability of each point in the sample space?
*2. Repeat Exercise 1 when the second marble is drawn without replacing the
rst marble.
3. A coin is to be tossed until a head appears twice in a row. What is the sample
space for this experiment? If the coin is fair, what is the probability that it will be
tossed exactly four times?
4. Let E, F, G be three events. Find expressions for the events that of E, F, G
(a) only F occurs,
(b) both E and F but not G occur,
(c) at least one event occurs,
(d) at least two events occur,
(e) all three events occur,
(f) none occurs,
(g) at most one occurs,
(h) at most two occur.
*5. An individual uses the following gambling system at Las Vegas. He bets $1
that the roulette wheel will come up red. If he wins, he quits. If he loses then he
makes the same bet a second time only this time he bets $2; and then regardless
of the outcome, quits. Assuming that he has a probability of
1
2
of winning each
bet, what is the probability that he goes home a winner? Why is this system not
used by everyone?
6. Show that E(F G) =EF EG.
7. Show that (E F)
c
=E
c
F
c
.
8. If P(E) = 0.9 and P(F) = 0.8, show that P(EF) 0.7. In general, show
that
P(EF) P(E) +P(F) 1
This is known as Bonferronis inequality.
*9. We say that E F if every point in E is also in F. Show that if E F, then
P(F) =P(E) +P(FE
c
) P(E)
10. Show that
P
i=1
E
i
i=1
P(E
i
)
This is known as Booles inequality.
Hint: Either use Equation (1.2) and mathematical induction, or else showthat
n
i=1
E
i
=
n
i=1
F
i
, where F
1
=E
1
, F
i
=E
i
i1
j=1
E
c
j
, and use property (iii)
of a probability.
11. If two fair dice are tossed, what is the probability that the sum is i, i =
2, 3, . . . , 12?
12. Let E and F be mutually exclusive events in the sample space of an
experiment. Suppose that the experiment is repeated until either event E or
event F occurs. What does the sample space of this new super experiment look
like? Show that the probability that event E occurs before event F is P(E)/
[P(E) +P(F)].
Hint: Argue that the probability that the original experiment is performed n
times and E appears on the nth time is P(E)(1p)
n1
, n =1, 2, . . . , where
p =P(E) +P(F). Add these probabilities to get the desired answer.
13. The dice game craps is played as follows. The player throws two dice, and
if the sum is seven or eleven, then she wins. If the sum is two, three, or twelve,
then she loses. If the sum is anything else, then she continues throwing until she
either throws that number again (in which case she wins) or she throws a seven
(in which case she loses). Calculate the probability that the player wins.
Exercises 17
14. The probability of winning on a single toss of the dice is p. A starts, and
if he fails, he passes the dice to B, who then attempts to win on her toss. They
continue tossing the dice back and forth until one of them wins. What are their
respective probabilities of winning?
15. Argue that E =EF EF
c
, E F =E FE
c
.
16. Use Exercise 15 to show that P(E F) =P(E) +P(F) P(EF).
*17. Suppose each of three persons tosses a coin. If the outcome of one of the
tosses differs fromthe other outcomes, then the game ends. If not, then the persons
start over and retoss their coins. Assuming fair coins, what is the probability that
the game will end with the rst round of tosses? If all three coins are biased and
have probability
1
4
of landing heads, what is the probability that the game will end
at the rst round?
18. Assume that each child who is born is equally likely to be a boy or a girl.
If a family has two children, what is the probability that both are girls given that
(a) the eldest is a girl, (b) at least one is a girl?
*19. Two dice are rolled. What is the probability that at least one is a six? If the
two faces are different, what is the probability that at least one is a six?
20. Three dice are thrown. What is the probability the same number appears on
exactly two of the three dice?
21. Suppose that 5 percent of men and 0.25 percent of women are color-blind.
A color-blind person is chosen at random. What is the probability of this person
being male? Assume that there are an equal number of males and females.
22. A and B play until one has 2 more points than the other. Assuming that each
point is independently won by A with probability p, what is the probability they
will play a total of 2n points? What is the probability that A will win?
23. For events E
1
, E
2
, . . . , E
n
show that
P(E
1
E
2
E
n
) =P(E
1
)P(E
2
|E
1
)P(E
3
|E
1
E
2
) P(E
n
|E
1
E
n1
)
24. In an election, candidate A receives n votes and candidate B receives m
votes, where n > m. Assume that in the count of the votes all possible orderings
of the n +m votes are equally likely. Let P
n,m
denote the probability that from
the rst vote on A is always in the lead. Find
(a) P
2,1
(b) P
3,1
(c) P
n,1
(d) P
3,2
(e) P
4,2
(f ) P
n,2
(g) P
4,3
(h) P
5,3
(i) P
5,4
( j) Make a conjecture as to the value of P
n,m
.
*25. Two cards are randomly selected from a deck of 52 playing cards.
(a) What is the probability they constitute a pair (that is, that they are of the
same denomination)?
(b) What is the conditional probability they constitute a pair given that they are
of different suits?
26. A deck of 52 playing cards, containing all 4 aces, is randomly divided into
4 piles of 13 cards each. Dene events E
1
, E
2
, E
3
, and E
4
as follows:
E
1
={the rst pile has exactly 1 ace},
E
2
={the second pile has exactly 1 ace},
E
3
={the third pile has exactly 1 ace},
E
4
={the fourth pile has exactly 1 ace}
Use Exercise 23 to nd P(E
1
E
2
E
3
E
4
), the probability that each pile has an ace.
*27. Suppose in Exercise 26 we had dened the events E
i
, i =1, 2, 3, 4, by
E
1
={one of the piles contains the ace of spades},
E
2
={the ace of spades and the ace of hearts are in different piles},
E
3
={the ace of spades, the ace of hearts, and the
ace of diamonds are in different piles},
E
4
={all 4 aces are in different piles}
Now use Exercise 23 to nd P(E
1
E
2
E
3
E
4
), the probability that each pile has an
ace. Compare your answer with the one you obtained in Exercise 26.
28. If the occurrence of B makes A more likely, does the occurrence of A make
B more likely?
29. Suppose that P(E) =0.6. What can you say about P(E|F) when
(a) E and F are mutually exclusive?
(b) E F?
(c) F E?
*30. Bill and George go target shooting together. Both shoot at a target at the
same time. Suppose Bill hits the target with probability 0.7, whereas George, in-
dependently, hits the target with probability 0.4.
(a) Given that exactly one shot hit the target, what is the probability that it was
Georges shot?
(b) Given that the target is hit, what is the probability that George hit it?
31. What is the conditional probability that the rst die is six given that the sum
of the dice is seven?
Exercises 19
*32. Suppose all n men at a party throw their hats in the center of the room.
Each man then randomly selects a hat. Show that the probability that none of the
n men selects his own hat is
1
2!

1
3!
+
1
4!
+
(1)
n
n!
Note that as n this converges to e
1
. Is this surprising?
33. In a class there are four freshman boys, six freshman girls, and six sopho-
more boys. How many sophomore girls must be present if sex and class are to be
independent when a student is selected at random?
34. Mr. Jones has devised a gambling system for winning at roulette. When he
bets, he bets on red, and places a bet only when the ten previous spins of the
roulette have landed on a black number. He reasons that his chance of winning is
quite large since the probability of eleven consecutive spins resulting in black is
quite small. What do you think of this system?
35. A fair coin is continually ipped. What is the probability that the rst four
ips are
(a) H, H, H, H?
(b) T , H, H, H?
(c) What is the probability that the pattern T , H, H, H occurs before the
pattern H, H, H, H?
36. Consider two boxes, one containing one black and one white marble, the
other, two black and one white marble. A box is selected at random and a marble
is drawn at random from the selected box. What is the probability that the marble
is black?
37. In Exercise 36, what is the probability that the rst box was the one selected
given that the marble is white?
38. Urn 1 contains two white balls and one black ball, while urn 2 contains one
white ball and ve black balls. One ball is drawn at random from urn 1 and placed
in urn 2. A ball is then drawn from urn 2. It happens to be white. What is the
probability that the transferred ball was white?
39. Stores A, B, and C have 50, 75, and 100 employees, and, respectively, 50,
60, and 70 percent of these are women. Resignations are equally likely among all
employees, regardless of sex. One employee resigns and this is a woman. What is
the probability that she works in store C?
*40. (a) A gambler has in his pocket a fair coin and a two-headed coin. He
selects one of the coins at random, and when he ips it, it shows heads. What is
the probability that it is the fair coin? (b) Suppose that he ips the same coin a
second time and again it shows heads. Now what is the probability that it is the
fair coin? (c) Suppose that he ips the same coin a third time and it shows tails.
Now what is the probability that it is the fair coin?
41. In a certain species of rats, black dominates over brown. Suppose that a
black rat with two black parents has a brown sibling.
(a) What is the probability that this rat is a pure black rat (as opposed to being
a hybrid with one black and one brown gene)?
(b) Suppose that when the black rat is mated with a brown rat, all ve of their
offspring are black. Now, what is the probability that the rat is a pure black rat?
42. There are three coins in a box. One is a two-headed coin, another is a fair
coin, and the third is a biased coin that comes up heads 75 percent of the time.
When one of the three coins is selected at random and ipped, it shows heads.
What is the probability that it was the two-headed coin?
43. Suppose we have ten coins which are such that if the ith one is ipped then
heads will appear with probability i/10, i =1, 2, . . . , 10. When one of the coins is
randomly selected and ipped, it shows heads. What is the conditional probability
that it was the fth coin?
44. Urn 1 has ve white and seven black balls. Urn 2 has three white and twelve
black balls. We ip a fair coin. If the outcome is heads, then a ball from urn 1 is
selected, while if the outcome is tails, then a ball from urn 2 is selected. Suppose
that a white ball is selected. What is the probability that the coin landed tails?
*45. An urn contains b black balls and r red balls. One of the balls is drawn at
random, but when it is put back in the urn c additional balls of the same color are
put in with it. Now suppose that we draw another ball. Show that the probability
that the rst ball drawn was black given that the second ball drawn was red is
b/(b +r +c).
46. Three prisoners are informed by their jailer that one of them has been cho-
sen at random to be executed, and the other two are to be freed. Prisoner A asks
the jailer to tell him privately which of his fellow prisoners will be set free, claim-
ing that there would be no harm in divulging this information, since he already
knows that at least one will go free. The jailer refuses to answer this question,
pointing out that if A knew which of his fellows were to be set free, then his own
probability of being executed would rise from
1
3
to
1
2
, since he would then be one
of two prisoners. What do you think of the jailers reasoning?
47. For a xed event B, show that the collection P(A|B), dened for all events
A, satises the three conditions for a probability. Conclude from this that
P(A|B) =P(A|BC)P(C|B) +P(A|BC
c
)P(C
c
|B)
Then directly verify the preceding equation.
References 21
*48. Sixty percent of the families in a certain community own their own car,
thirty percent own their own home, and twenty percent own both their own car
and their own home. If a family is randomly chosen, what is the probability that
this family owns a car or a house but not both?
References
Reference [2] provides a colorful introduction to some of the earliest developments in probability
theory. References [3], [4], and [7] are all excellent introductory texts in modern probability theory.
Reference [5] is the denitive work which established the axiomatic foundation of modern mathemat-
ical probability theory. Reference [6] is a nonmathematical introduction to probability theory and its
applications, written by one of the greatest mathematicians of the eighteenth century.
1. L. Breiman, Probability, Addison-Wesley, Reading, Massachusetts, 1968.
2. F. N. David, Games, Gods, and Gambling, Hafner, New York, 1962.
3. W. Feller, An Introduction to Probability Theory and Its Applications, Vol. I, John
Wiley, New York, 1957.
4. B. V. Gnedenko, Theory of Probability, Chelsea, New York, 1962.
5. A. N. Kolmogorov, Foundations of the Theory of Probability, Chelsea, New York,
1956.
6. Marquis de Laplace, A Philosophical Essay on Probabilities, 1825 (English Transla-
tion), Dover, New York, 1951.
7. S. Ross, A First Course in Probability, Sixth Edition, Prentice Hall, New Jersey, 2002.

Dispense Processi Aleatori

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dispense Processi Aleatori

Uploaded by

Copyright:

Available Formats

Applied Stochastic Processes

dispense del corso

really means lim

relies on the fact that

. . .. Then the set of events {A

{T = k} can only happen if the

is the hat picked by person 3, so that

T(i), the equivalence class of state

, then for any event A,

must stay the same, and we

denote this probability, for = 1, . . . , n

while our original description used the

to denote a state that we can get to

(t) the |S| |S| matrix with (i, j)th

(t) is the |S| |S| matrix with (i, j)th entry p

You might also like