You are on page 1of 225

Probability Theory

S.R.S.Varadhan
Courant Institute of Mathematical Sciences
New York University
August 31, 2000
2
Contents
1 Measure Theory 7
1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Construction of Measures . . . . . . . . . . . . . . . . . . . . 11
1.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Distributions and Expectations . . . . . . . . . . . . . . . . . 28
2 Weak Convergence 31
2.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . 31
2.2 Moment Generating Functions . . . . . . . . . . . . . . . . . . 36
2.3 Weak Convergence . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Independent Sums 51
3.1 Independence and Convolution . . . . . . . . . . . . . . . . . . 51
3.2 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . 54
3.3 Strong Limit Theorems . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Series of Independent Random variables . . . . . . . . . . . . 61
3.5 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . 68
3.6 Central Limit Theorem. . . . . . . . . . . . . . . . . . . . . . 70
3.7 Accompanying Laws. . . . . . . . . . . . . . . . . . . . . . . . 76
3.8 Innitely Divisible Distributions. . . . . . . . . . . . . . . . . 83
3.9 Laws of the iterated logarithm. . . . . . . . . . . . . . . . . . 92
4 Dependent Random Variables 101
4.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . 108
4.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . 112
3
4 CONTENTS
4.4 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.5 Stopping Times and Renewal Times . . . . . . . . . . . . . . . 122
4.6 Countable State Space . . . . . . . . . . . . . . . . . . . . . . 123
5 Martingales. 149
5.1 Denitions and properties . . . . . . . . . . . . . . . . . . . . 149
5.2 Martingale Convergence Theorems. . . . . . . . . . . . . . . . 154
5.3 Doob Decomposition Theorem. . . . . . . . . . . . . . . . . . 157
5.4 Stopping Times. . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.5 Upcrossing Inequality. . . . . . . . . . . . . . . . . . . . . . . 164
5.6 Martingale Transforms, Option Pricing. . . . . . . . . . . . . . 165
5.7 Martingales and Markov Chains. . . . . . . . . . . . . . . . . 169
6 Stationary Stochastic Processes. 179
6.1 Ergodic Theorems. . . . . . . . . . . . . . . . . . . . . . . . . 179
6.2 Structure of Stationary Measures. . . . . . . . . . . . . . . . . 184
6.3 Stationary Markov Processes. . . . . . . . . . . . . . . . . . . 187
6.4 Mixing properties of Markov Processes. . . . . . . . . . . . . . 192
6.5 Central Limit Theorem for Martingales. . . . . . . . . . . . . 195
6.6 Stationary Gaussian Processes. . . . . . . . . . . . . . . . . . 199
7 Dynamic Programming and Filtering. 213
7.1 Optimal Control. . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.2 Optimal Stopping. . . . . . . . . . . . . . . . . . . . . . . . . 215
7.3 Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Preface
These notes are based on a rst year graduate course on Probability and Limit
theorems given at Courant Institute of Mathematical Sciences. Originally
written during 1997-98, they have been revised during academic year 1998-
99 as well as in the Fall of 1999. I want to express my appreciation to those
who pointed out to me several typos as well as suggestions for improvement.
I want to mention in particular the detailed comments from Professor Charles
Newman and Mr Enrique Loubet. Chuck used it while teaching the course
in 98-99 and Enrique helped me as TA when I taught out of these notes
again in the Fall of 99. These notes cover about three fourths of the course,
essentially discrete time processes. Hopefully there will appear a companion
volume some time in the near future that will cover continuos time processes.
A small amount measure theory that is included. While it is not meant to
be complete, it is my hope that it will be useful.
5
6 CONTENTS
Chapter 1
Measure Theory
1.1 Introduction.
The evolution of probability theory was based more on intuition rather than
mathematical axioms during its early development. In 1933, A. N. Kol-
mogorov [4] provided an axiomatic basis for probability theory and it is now
the universally accepted model. There are certain non commutative ver-
sions that have their origins in quantum mechanics, see for instance K. R.
Parthasarathy[5], that are generalizations of the Kolmogorov Model. We
shall however use exclusively Kolmogorovs framework.
The basic intuition in probability theory is the notion of randomness.
There are experiments whose results are not predicatable and can be deter-
mined only after performing it and then observing the outcome. The simplest
familiar examples are, the tossing of a fair coin, or the throwing of a balanced
die. In the rst experiment the result could be either a head or a tail and
the throwing of a die could result in a score of any integer from 1 through 6.
These are experiments with only a nite number of alternate outcomes. It is
not dicult to imagine experiments that have countably or even uncountably
many alternatives as possible outcomes.
Abstractly then, there is a space of all possible outcomes and each
individual outcome is represented as a point in that space . Subsets of
are called events and each of them corresponds to a collection of outcomes. If
the outcome is in the subset A, then the event A is said to have occurred.
For example in the case of a die the set A = {1, 3, 5} corresponds to
the event an odd number shows up. With this terminology it is clear that
7
8 CHAPTER 1. MEASURE THEORY
union of sets corresponds to or, intersection to and, and complementation
to negation.
One would expect that probabilities should be associated with each out-
come and there should be a Probability Function f() which is the proba-
bilty that occurs. In the case of coin tossing we may expect = {H, T}
and
f(T) = f(H) =
1
2
.
Or in the case of a die
f(1) = f(2) = = f(6) =
1
6
.
Since Probability is normalized so that certainty corresponds to a Proba-
bility of 1, one expects

f() = 1. (1.1)
If is uncountable this is a mess. There is no reasonable way of adding
up an uncountable set of numbers each of which is 0. This suggests that
it may not be possible to start with probabilities associated with individual
outcomes and build a meaningful theory. The next best thing is to start with
the notion that probabilities are already dened for events. In such a case,
P(A) is dened for a class B of subsets A . The question that arises
naturally is what should B be and what properties should P() dened on B
have? It is natural to demand that the class B of sets for which probabilities
are to be dened satisfy the following properties:
The whole space and the empty set are in B. For any two sets A
and B in B, the sets A B and A B are again in B. If A B, then the
complement A
c
is again in B. Any class of sets satisfying these properties is
called a eld.
Denition 1.1. A probability or more precisely a nitely additive proba-
bility measure is a nonnegative set function P() dened for sets A B that
satises the following properties:
P(A) 0 for all A B, (1.2)
P() = 1 and P() = 0. (1.3)
1.1. INTRODUCTION. 9
If A B and B B are disjoint then
P(A B) = P(A) +P(B). (1.4)
In particular
P(A
c
) = 1 P(A) (1.5)
for all A B.
A condition which is some what more technical, but important from a
mathematical viewpoint is that of countable additivity. The class B, in
addition to being a eld is assumed to be closed under countable union
(or equivalently, countable intersection); i.e. if A
n
B for every n, then
A =
n
A
n
B. Such a class is called a -eld. The probability itself is
presumed to be dened on a -eld B.
Denition 1.2. A set function P dened on a -eld is called a countably
additive probability measure if in addition to satsfying equations (1.2), (1.3)
and (1.4), it satises the following countable additivity property: for any
sequence of pairwise disjoint sets A
n
with A =
n
A
n
P(A) =

n
P(A
n
). (1.6)
Exercise 1.1. The limit of an increasing (or decreasing) sequence A
n
of sets
is dened as its union
n
A
n
(or the intersection
n
A
n
). A monotone class
is dened as a class that is closed under monotone limits of an increasing or
decreasing sequence of sets. Show that a eld B is a -eld if and only if it
is a monotone class.
Exercise 1.2. Show that a nitely additive probability measure P() dened
on a -eld B, is countably additive, i.e. satises equation (1.6), if and only
if it satises any the following two equivalent conditions.
If A
n
is any nonincreasing sequence of sets in B and A = lim
n
A
n
=
n
A
n
then
P(A) = lim
n
P(A
n
).
If A
n
is any nondecreasing sequence of sets in B and A = lim
n
A
n
=
n
A
n
then
P(A) = lim
n
P(A
n
).
10 CHAPTER 1. MEASURE THEORY
Exercise 1.3. If A, B B, and P is a nitely additive probability measure
show that P(A B) = P(A) +P(B) P(A B). How does this generalize
to P(
n
j=1
A
j
)?
Exercise 1.4. If P is a nitely additive measure on a eld F and A, B F,
then |P(A) P(B)| P(AB) where AB is the symmetric dierence
(A B
c
) (A
c
B). In particular if B A,
0 P(A) P(B) P(A B
c
) P(B
c
).
Exercise 1.5. If P is a countably additive probability measure, show that for
any sequence A
n
B, P(

n=1
A
n
)

n=1
P(A
n
).
Although we would like our probability to be a countably additive prob-
ability measure, on a - eld B of subsets of a space it is not clear that
there are plenty of such things. As a rst small step show the following.
Exercise 1.6. If {
n
: n 1} are distinct points in and p
n
0 are numbers
with

n
p
n
= 1 then
P(A) =

n:nA
p
n
denes a countably additive probability measure on the -eld of all subsets
of . ( This is still cheating because the measure P lives on a countable set.)
Denition 1.3. A probability measure P on a eld F is said to be countably
additive on F, if for any sequence A
n
F with A
n
, we have P(A
n
) 0.
Exercise 1.7. Given any class F of subsets of there is a unique -eld B
such that it is the smallest -eld that contains F.
Denition 1.4. The -eld in the above exercise is called the -eld gener-
ated by F.
1.2. CONSTRUCTION OF MEASURES 11
1.2 Construction of Measures
The following theorem is important for the construction of countably additive
probability measures. A detailed proof of this theorem, as well as other
results on measure and integration, can be found in [7], [3] or in any one of
the many texts on real variables. In an eort to be complete we will sketch
the standard proof.
Theorem 1.1. (Caratheodory Extension Theorem). Any countably
additive probabilty measure P on a eld F extends uniquely as a countably
additive probability measure to the -eld B generated by F.
Proof. The proof proceeds along the following steps:
Step 1. Dene an object P

called the outer measure for all sets A.


P

(A) = inf

j
A
j
A

j
P(A
j
) (1.7)
where the inmum is taken over all countable collections {A
j
} of sets from
F that cover A. Without loss of generality we can assume that {A
j
} are
disjoint. (Replace A
j
by (
j1
i=1
A
c
i
) A
j
).
Step 2. Show that P

has the following properties:


1. P

is countably sub-additive, i.e.


P

(
j
A
j
)

j
P

(A
j
).
2. For A F, P

(A) P(A). (Trivial)


3. For A F, P

(A) P(A). (Need to use the countable additivity of P


on F)
Step 3. Dene a set E to be measurable if
P

(A) P

(A E) +P

(A E
c
)
holds for all sets A, and establish the following properties for the class M
of measurable sets. The class of measurable sets M is a -eld and P

is a
countably additive measure on it.
12 CHAPTER 1. MEASURE THEORY
Step 4. Finally show that M F. This implies that M B and P

is an
extension of P from F to B.
Uniqueness is quite simple. Let P
1
and P
2
be two countably additive
probability measures on a -eld B that agree on a eld F B. Let us
dene A = {A : P
1
(A) = P
2
(A)}. Then A is a monotone class i.e., if A
n
A
is increasing (decreasing) then
n
A
n
(
n
A
n
) A. Uniqueness will follow
from the following fact left as an excercise.
Exercise 1.8. The smallest monotone class generated by a eld is the same
as the -eld generated by the eld.
It now follows that A must contain the -eld generated by F and that
proves uniqueness.
The extension Theorem does not quite solve the problem of constructing
countably additive probability measures. It reduces it to constructing them
on elds. The following theorem is important in the theory of Lebesgue inte-
grals and is very useful for the construction of countably additive probability
measures on the real line. The proof will again be only sketched. The natu-
ral -eld on which to dene a probability measure on the line is the Borel
-eld. This is dened as the smallest -eld containing all intervals and
includes in particular all open sets.
Let us consider the class of subsets of the real numbers, I = {I
a,b
:
a < b } where I
a,b
= {x : a < x b} if b < , and I
a,
= {x : a <
x < }. In other words I is the collection of intervals that are left-open and
right-closed. The class of sets that are nite disjoint unions of members of I
is a eld F, if the empty set is added to the class. If we are given a function
F(x) on the real line which is nondecreasing, continuous from the right and
satises
lim
x
F(x) = 0 and lim
x
F(x) = 1,
we can dene a nitely additive probability measure P by rst dening
P(I
a,b
) = F(b) F(a)
for intervals and then extending it to F by dening it as the sum for disjoint
unions from I. Let us note that the Borel -eld B on the real line is the
-eld generated by F.
1.2. CONSTRUCTION OF MEASURES 13
Theorem 1.2. (Lebesgue). P is countably additive on F if and only if
F(x) is a right continuous function of x. Therefore for each right continu-
ous nondecreasing function F(x) with F() = 0 and F() = 1 there is
a unique probability measure P on the Borel subsets of the line, such that
F(x) = P(I
,x
). Conversely every countably additive probability measure
P on the Borel subsets of the line comes from some F. The correspondence
between P and F is one-to-one.
Proof. The only dicult part is to establish the countable additivity of P on
F from the right continuity of F(). Let A
j
F and A
j
, the empty set.
Let us assume, P(A
j
) > 0, for all j and then establish a contradiction.
Step 1. We take a large interval [, ] and replace A
j
by B
j
= A
j
[, ].
Since |P(A
j
) P(B
j
)| 1 F() + F(), we can make the choice of
large enough that P(B
j
)

2
. In other words we can assumes without loss
of generality that P(A
j
)

2
and A
j
[, ] for some xed < .
Step 2. If
A
j
=
k
j
i=1
I
a
j,i
,b
j,i
use the right continuity of F to replace A
j
by B
j
which is again a union of
left open right closed intervals with the same right end points, but with left
end points moved ever so slightly to the right. Achieve this in such a way
that
P(A
j
B
j
)

10.2
j
for all j.
Step 3. Dene C
j
to be the closure of B
j
, obtained by adding to it the left
end points of the intervals making up B
j
. Let E
j
=
j
i=1
B
i
and D
j
=
j
i=1
C
i
.
Then, (i) the sequence D
j
of sets is decreasing, (ii) each D
j
is a closed
bounded set, (iii) since A
j
D
j
and A
j
, it follows that D
j
. Because
D
j
E
j
and P(E
j
)

2

j
P(A
j
B
j
)
4
10
, each D
j
is nonempty and
this violates the nite intersection property that every decreasing sequence of
bounded nonempty closed sets on the real line has a nonempty intersection,
i.e. has at least one common point.
The rest of the proof is left as an exercise.
The function F is called the distribution function corresponding to the
probability measure P.
14 CHAPTER 1. MEASURE THEORY
Example 1.1. Suppose x
1
, x
2
, , x
n
, is a sequence of points and we have
probabilities p
n
at these points then for the discrete measure
P(A) =

n:xnA
p
n
we have the distribution function
F(x) =

n:xnx
p
n
that only increases by jumps, the jump at x
n
being p
n
. The points {x
n
}
themselves can be discrete like integers or dense like the rationals.
Example 1.2. If f(x) is a nonnegative integrable function with integral 1, i.e.

f(y)dy = 1 then F(x) =

f(y)dy is a distribution function which


is continuous. In this case f is the density of the measure P and can be
calculated as f(x) = F

(x).
There are (messy) examples of F that are continuous, but do not come from
any density. More on this later.
Exercise 1.9. Let us try to construct the Lebesgue measure on the rationals
Q [0, 1]. We would like to have
P[I
a,b
] = b a
for all rational 0 a b 1. Show that it is impossible by showing that
P[{q}] = 0 for the set {q} containing the single rational q while P[Q] =
P[
qQ
{q}] = 1. Where does the earlier proof break down?
Once we have a countably additive probability measure P on a space
(, ), we will call the triple (, , P) a probabilty space.
1.3 Integration
An important notion is that of a random variable or a measurable function.
Denition 1.5. A random variable or measurable function is map f :
R, i.e. a real valued function f() on such that for every Borel set B R,
f
1
(B) = { : f() B} is a measurable subset of , i.e f
1
(B) .
1.3. INTEGRATION 15
Exercise 1.10. It is enough to check the requirement for sets B R that are
intervals or even just sets of the form (, x] for < x < .
A function that is measurable and satises |f()| M all for some
nite M is called a bounded measurable function.
The following statements are the essential steps in developing an integra-
tion theory.
Details can be found in any book on real variables.
1. If A , the indicator function A, dened as
1
A
() =

1 if A
0 if / A
is bounded and measurable.
2. Sums, products, limits, compositions and reasonable elementary oper-
ations like min and max performed on measurable functions lead to
measurable functions.
3. If {A
j
: 1 j n} is a nite disjoint partition of into measurable
sets, the function f() =

j
c
j
1
A
j
() is a measurable function and is
referred to as a simple function.
4. Any bounded measurable function f is a uniform limit of simple func-
tions. To see this, if f is bounded by M, divide [M, M] into n subin-
tervals I
j
of length
2M
n
with midpoints c
j
. Let
A
j
= f
1
(I
j
) = { : f() I
j
}
and
f
n
=
n

j=1
c
j
1
A
j
.
Clearly f
n
is simple, sup

|f
n
() f()|
M
n
, and we are done.
5. For simple functions f =

c
j
1
A
j
the integral

f()dP is dened to
be

j
c
j
P(A
j
). It enjoys the following properties:
(a) If f and g are simple, so is any linear combination af +bg for real
constants a and b and

(af +bg)dP = a

fdP +b

gdP.
16 CHAPTER 1. MEASURE THEORY
(b) If f is simple so is |f| and |

fdP|

|f|dP sup

|f()|.
6. If f
n
is a sequence of simple functions converging to f uniformly, then
a
n
=

f
n
dP is a Cauchy sequence of real numbers and therefore has a
limit a as n . The integral

fdP of f is dened to be this limit


a. One can verify that a depends only on f and not on the sequence
f
n
chosen to approximate f.
7. Now the integral is dened for all bounded measurable functions and
enjoys the following properties.
(a) If f and g are bounded measurable functions and a, b are real
constants then the linear combination af +bg is again a bounded
measurable function, and

(af +bg)dP = a

fdP +b

gdP.
(b) If f is a bounded measurable function so is |f| and |

fdP|

|f|dP sup

|f()|.
(c) In fact a slightly stronger inequality is true. For any bounded
measurable f,

|f|dP P({ : |f()| > 0}) sup

|f()|
(d) If f is a bounded measurable function and A is a measurable set
one denes

A
f()dP =

1
A
()f()dP
and we can write for any measurable set A,

fdP =

A
fdP +

A
c
fdP
In addition to uniform convergence there are other weaker notions of
convergence.
1.3. INTEGRATION 17
Denition 1.6. A sequence f
n
functions is said to converge to a function f
everywhere or pointwise if
lim
n
f
n
() = f()
for every .
In dealing with sequences of functions on a space that has a measure
dened on it, often it does not matter if the sequence fails to converge on
a set of points that is insignicant. For example if we are dealing with the
Lebesgue measure on the interval [0, 1] and f
n
(x) = x
n
then f
n
(x) 0 for
all x except x = 1. A single point, being an interval of length 0 should be
insignicant for the Lebesgue measure.
Denition 1.7. A sequence f
n
of measurable functions is said to converge
to a measurable function f almost everywhere or almost surely (usually ab-
breviated as a.e.) if there exists a measurable set N with P(N) = 0 such
that
lim
n
f
n
() = f()
for every N
c
.
Note that almost everywhere convergence is always relative to a proba-
bility measure.
Another notion of convergence is the following:
Denition 1.8. A sequence f
n
of measurable functions is said to converge
to a measurable function f in measure or in probability if
lim
n
P[ : |f
n
() f()| ] = 0
for every > 0.
Let us examine these notions in the context of indicator functions of sets
f
n
() = 1
An
(). As soon as A = B, sup

|1
A
() 1
B
()| = 1, so that
uniform convergence never really takes place. On the other hand one can
verify that 1
An
() 1
A
() for every if and only if the two sets
limsup
n
A
n
=
n

mn
A
m
18 CHAPTER 1. MEASURE THEORY
and
liminf
n
A
n
=
n

mn
A
m
both coincide with A. Finally 1
An
() 1
A
() in measure if and only if
lim
n
P(A
n
A) = 0
where for any two sets A and B the symmetric dierence AB is dened as
AB = (AB
c
) (A
c
B) = AB (AB)
c
. It is the set of points that
belong to either set but not to both. For instance 1
An
0 in measure if and
only if P(A
n
) 0.
Exercise 1.11. There is a dierence between almost everywhere convergence
and convergence in measure. The rst is really stronger. Consider the in-
terval [0, 1] and divide it successively into 2, 3, 4 parts and enumerate the
intervals in succession. That is, I
1
= [0,
1
2
], I
2
= [
1
2
, 1], I
3
= [0,
1
3
], I
4
= [
1
3
,
2
3
],
I
5
= [
2
3
, 1], and so on. If f
n
(x) = 1
In
(x) it easy to check that f
n
tends to 0
in measure but not almost everywhere.
Exercise 1.12. But the following statement is true. If f
n
f as n in
measure, then there is a subsequence f
n
j
such that f
n
j
f almost every-
where as j .
Exercise 1.13. If {A
n
} is a sequene of measurable sets, then in order that
limsup
n
A
n
= , it is necessary and sucient that
lim
n
P[

m=n
A
m
] = 0
In particular it is sucient that

n
P[A
n
] < . Is it necessary?
Lemma 1.3. If f
n
f almost everywhere then f
n
f in measure.
Proof. f
n
f outside N is equivalent to

mn
[ : |f
m
() f()| ] N
for every > 0. In particular by countable additivity
P[ : |f
n
() f()| ] P[
mn
[ : |f
m
() f()| ] 0
as n and we are done.
1.3. INTEGRATION 19
Exercise 1.14. Countable additivity is important for this result. On a nitely
additive probability space it could be that f
n
f everywhere and still
f
n
f in measure. In fact show that if every sequence f
n
0 that con-
verges everywhere also converges in probabilty, then the measure is countably
additive.
Theorem 1.4. (Bounded Convergence Theorem). If the sequence {f
n
}
of measurable functions is uniformly bounded and if f
n
f in measure as
n , then lim
n

f
n
dP =

fdP.
Proof. Since
|

f
n
dP

fdP | = |

(f
n
f)dP |

|f
n
f|dP
we need only prove that if f
n
0 in measure and |f
n
| M then

|f
n
|dP
0. To see this

|f
n
|dP =

|fn|
|f
n
|dP +

|fn|>
|f
n
|dP +MP[ : |f
n
()| > ]
and taking limits
limsup
n

|f
n
|dP
and since > 0 is arbitrary we are done.
The bounded convergence theorem is the essence of countable additivity.
Let us look at the example of f
n
(x) = x
n
on 0 x 1 with Lebesgue
measure. Clearly f
n
(x) 0 a.e. and therefore in measure. While the
convergence is not uniform, 0 x
n
1 for all n and x and so the bounded
convergence theorem applies. In fact

1
0
x
n
dx =
1
n + 1
0.
However if we replace x
n
by nx
n
, f
n
(x) still goes to 0 a.e., but the sequence
is no longer uniformly bounded and the integral does not go to 0.
We now proceed to dene integrals of nonnegative measurable functions.
20 CHAPTER 1. MEASURE THEORY
Denition 1.9. If f is a nonnegative measurable function we dene

fdP = {sup

gdP : g bounded , 0 g f}.


An important result is
Theorem 1.5. (Fatous Lemma). If for each n 1, f
n
0 is measurable
and f
n
f in measure as n then

fdP liminf
n

f
n
dP.
Proof. Suppose g is bounded and satises 0 g f. Then the sequence
h
n
= f
n
g is uniformly bounded and
h
n
h = f g = g.
Therefore, by the bounded convergence theorem,

gdP = lim
n

h
n
dP.
Since

h
n
dP

f
n
dP for every n it follows that

gdP liminf
n

f
n
dP.
As g satisfying 0 g f is arbitrary and we are done.
Corollary 1.6. (Monotone Convergence Theorem). If for a sequence
{f
n
} of nonnegative functions, we have f
n
f monotonically then

f
n
dP

fdP as n .
Proof. Obviously

f
n
dP

fdP and the other half follows from Fatous


lemma.
1.3. INTEGRATION 21
Now we try to dene integrals of arbitrary measurable functions. A non-
negative measurable function is said to be integrable if

fdP < . A
measurable function f is said to be integrable if |f| is integrable and we de-
ne

fdP =

f
+
dP

dP where f
+
= f 0 and f

= f 0 are the
positive and negative parts of f. The integral has the following properties.
1. It is linear. If f and g are integrable so is af + bg for any two real
constants and

(af +bg)dP = a

fdP +b

gdP.
2. |

fdP|

|f|dP for every integrable f.


3. If f = 0 except on a set N of measure 0, then f is integrable and

fdP = 0. In particular if f = g almost everywhere then

fdP =

gdP.
Theorem 1.7. (Jensens Inequality.) If (x) is a convex function of x
and f() and (f()) are integrable then

(f())dP

f()dP

. (1.8)
Proof. We have seen the inequlity already for (x) = |x|. The proof is
quite simple. We note that any convex function can be represented as the
supremum of a collection of ane linear functions.
(x) = sup
(a,b)E
{ax +b}. (1.9)
It is clear that if (a, b) E, then af()+b (f()) and on integration this
yields am + b E[(f())] where m = E[f()]. Since this is true for every
(a, b) E, in view of the represenattion (1.9), our theorem follows.
Another important theorem is
Theorem 1.8. (The Dominated Convergence Theorem) If for some
sequence {f
n
} of measurable functions we have f
n
f in measure and
|f
n
()| g() for all n and for some integrable function g, then

f
n
dP

fdP as n .
22 CHAPTER 1. MEASURE THEORY
Proof. g + f
n
and g f
n
are nonnegative and converge in measure to g + f
and g f respectively. By Fatous lemma
liminf
n

(g +f
n
)dP

(g +f)dP
Since

gdP is nite we can subtract it from both sides and get


liminf
n

f
n
dP

fdP.
Working the same way with g f
n
yields
limsup
n

f
n
dP

fdP
and we are done.
Exercise 1.15. Take the unit interval with the Lebesgue measure and dene
f
n
(x) = n

1
[0,
1
n
]
(x). Clearly f
n
(x) 0 for x = 0. On the other hand

f
n
(x)dx = n
1
0 if and only if < 1. What is g(x) = sup
n
f
n
(x) and
when is g integrable?
If h() = f() +ig() is a complex valued measurable function with real
and imaginary parts f() and g() that are integrable we dene

h()dP =

f()dP +i

g()dP
Exercise 1.16. Show that for any complex function h() = f() + ig()
with measurable f and g, |h()| is integrable, if and only if |f| and |g| are
integrable and we then have

h() dP

|h()| dP
1.4 Transformations
A measurable space (, B) is a set together with a -eld B of subsets of
.
1.4. TRANSFORMATIONS 23
Denition 1.10. Given two measurable spaces (
1
, B
1
) and (
2
, B
2
), a map-
ping or a transformation from T :
1

2
, i.e. a function
2
= T(
1
) that
assigns for each point
1

1
a point
2
= T(
1
)
2
, is said to be
measurable if for every measurable set A B
2
, the inverse image
T
1
(A) = {
1
: T(
1
) A} B
1
.
Exercise 1.17. Show that, in the above denition, it is enough to verify the
property for A A where A is any class of sets that generates the -eld B
2
.
If T is a measurable map from (
1
, B
1
) into (
2
, B
2
) and P is a probability
measure on (
1
, B
1
), the induced probability measure Q on (
2
, B
2
) is dened
by
Q(A) = P(T
1
(A)) for A B
2
. (1.10)
Exercise 1.18. Verify that Q indeed does dene a probability measure on
(
2
, B
2
).
Q is called the induced measure and is denoted by Q = PT
1
.
Theorem 1.9. If f :
2
R is a real valued measurable function on
2
,
then g(
1
) = f(T(
1
)) is a measurable real valued function on (
1
, B
1
).
Moreover g is integrable with respect to P if and only if f is integrable with
respect to Q, and

2
f(
2
) dQ =

1
g(
1
) dP (1.11)
Proof. If f(
2
) = 1
A
(
2
) is the indicator function of a set A B
2
, the
claim in equation (1.11) is in fact the denition of measurability and the
induced measure. We see, by linearity, that the claim extends easily from
indicator functions to simple functions. By uniform limits, the claim can now
be extended to bounded measurable functions. Monotone limits then extend
it to nonnegative functions. By considering the positive and negative parts
separately we are done.
24 CHAPTER 1. MEASURE THEORY
A measurable trnasformation is just a generalization of the concept of a
random variable introduced in section 1.2. We can either think of a random
variable as special case of a measurable transformation where the target space
is the real line or think of a measurable transformation as a random variable
with values in an arbitrary target space. The induced measure Q = PT
1
is
called the distribution of the random variable F under P. In particular, if T
takes real values, Q is a probability distribution on R.
Exercise 1.19. When T is real valued show that

T()dP =

x dQ.
When F = (f
1
, f
2
, f
n
) takes values in R
n
, the induced distribution Q
on R
n
is called the joint distribution of the n random variables f
1
, f
2
, f
n
.
Exercise 1.20. If T
1
is a measurable map from (
1
, B
1
) into (
2
, B
2
) and T
2
is
a measurable map from (
2
, B
2
) into (
3
, B
3
), then show that T = T
2
T
1
is
a measurable map from (
1
, B
1
) into (
3
, B
3
). If P is a probability measure
on (
1
, B
1
), then on (
3
, B
3
), the two measures PT
1
and (PT
1
1
)T
1
2
are
identical.
1.5 Product Spaces
Given two sets
1
and
2
the Cartesian product =
1

2
is the set of
pairs (
1
,
2
) with
1

1
and
2

2
. If
1
and
2
come with -elds
B
1
and B
2
respectively, we can dene a natural -eld B on as the -eld
generated by sets (measurable rectangles) of the form A
1
A
2
with A
1
B
1
and A
2
B
2
. This -eld will be called the product -eld.
Exercise 1.21. Show that sets that are nite disjoint unions of measurable
rectangles constitute a eld F.
Denition 1.11. The product -eld B is the -eld generaated by the eld
F.
1.5. PRODUCT SPACES 25
Given two probability measures P
1
and P
2
on (
1
, B
1
) and (
2
, B
2
) re-
spectively we try to dene on the product space (, B) a probability measure
P by dening for a measurable rectangle A = A
1
A
2
P(A
1
A
2
) = P
1
(A
1
) P
2
(A
2
)
and extending it to the eld F of nite disjoint unions of measurable rect-
angles as the obvious sum.
Exercise 1.22. If E F has two representations as disjoint nite unions of
measurable rectangles
E =
i
(A
i
1
A
i
2
) =
j
(B
j
1
B
j
2
)
then

i
P
1
(A
i
1
) P
2
(A
i
2
) =

j
P
1
(B
j
1
) P
2
(B
j
2
).
so that P(E) is well dened. P is a nitely additive probability measure on
F.
Lemma 1.10. The measure P is countably additive on the eld F.
Proof. For any set E F let us dene the section E

2
as
E

2
= {
1
: (
1
,
2
) E}. (1.12)
Then P
1
(E

2
) is a measurable function of
2
(is in fact a simple function)
and
P(E) =

2
P
1
(E

2
) dP
2
. (1.13)
Now let E
n
F , the empty set. Then it is easy to verify that E
n,
2
dened by
E
n,
2
= {
1
: (
1
,
2
) E
n
}
satises E
n,
2
for each
2

2
. From the countable additivity of P
1
we
conclude that P
1
(E
n,
2
) 0 for each
2

2
and since, 0 P
1
(E
n,
2
) 1
for n 1, it follows from equation (1.13) and the bounded convergence
theorem that
P(E
n
) =

2
P
1
(E
n,
2
) dP
2
0
establishing the countable additivity of P on F.
26 CHAPTER 1. MEASURE THEORY
By an application of the Caratheodory extension theorem we conclude that
P extends uniquely as a countably additive measure to the -eld B (product
-eld) generated by F. We will call this the Product Measure P.
Corollary 1.11. For any A B if we denote by A

1
and A

2
the respective
sections
A

1
= {
2
: (
1
,
2
) A}
and
A

2
= {
1
: (
1
,
2
) A}
then the functions P
1
(A

2
) and P
2
(A

1
) are measurable and
P(A) =

P
1
(A

2
)dP
2
=

P
2
(A

1
) dP
1
.
In particular for a measurable set A, P(A) = 0 if and only if for almost all

1
with respect to P
1
, the sections A

1
have measure 0 or equivalently for
almost all
2
with respect to P
2
, the sections A

2
have measure 0.
Proof. The assertion is clearly valid if A is rectangle of the form A
1
A
2
with A
1
B
1
and A
2
B
2
. If A F, then it is a nite disjoint union of such
rectangles and the assertion is extended to such a set by simple addition.
Clearly, by the monotone convergence theorem, the class of sets for which
the assertion is valid is a monotone class and since it contains the eld F it
also contains the -eld B generated by the eld F.
Warning. It is possible that a set A may not be measurable with respect
to the product -eld, but nevertheless the sections A

1
and A

2
are all
measurable, P
2
(A

1
) and P
1
(A

2
) are measurable functions, but

P
1
(A

2
)dP
2
=

P
2
(A

1
) dP
1
.
In fact there is a rather nasty example where P
1
(A

2
) is identically 1 whereas
P
2
(A

1
) is identically 0.
The next result concerns the equality of the double integral, (i.e. the
integral with respect to the product measure) and the repeated integrals in
any order.
1.5. PRODUCT SPACES 27
Theorem 1.12. (Fubinis Theorem). Let f() = f(
1
,
2
) be a measur-
able function of on (, B). Then f can be considered as a function of
2
for each xed
1
or the other way around. The functions g

1
() and h

2
()
dened respectively on
2
and
1
by
g

1
(
2
) = h

2
(
1
) = f(
1
,
2
)
are measurable for each
1
and
2
. If f is integrable then the functions
g

1
(
2
) and h

2
(
1
) are integrable for almost all
1
and
2
respectively. Their
integrals
G(
1
) =

2
g

1
(
2
) dP
2
and
H(
2
) =

1
h

2
(
1
) dP
1
are measurable, nite almost everywhere and integrable with repect to P
1
and
P
2
respectively. Finally

f(
1
,
2
) dP =

G(
1
)dP
1
=

H(
2
)dP
2
Conversely for a nonnegative measurable function f if either G are H, which
are always measurable, has a nite integral so does the other and f is inte-
grable with its integral being equal to either of the repeated integrals, namely
integrals of G and H.
Proof. The proof follows the standard pattern. It is a restatement of the
earlier corollary if f is the indicator function of a measurable set A. By
linearity it is true for simple functions and by passing to uniform limits, it is
true for bounded measurable functions f. By monotone limits it is true for
nonnegative functions and nally by taking the positive and negative parts
seperately it is true for any arbitrary integrable function f.
Warning. The following could happen. f is a measurable function that
takes both positive and negative values that is not integrable. Both the
repeated integrals exist and are unequal. The example is not hard.
28 CHAPTER 1. MEASURE THEORY
Exercise 1.23. Construct a measurable function f(x, y) which is not inte-
grable, on the product [0, 1] [0, 1] of two copies of the unit interval with
Lebesgue measure, such that the repeated integrals make sense and are un-
equal, i.e.

1
0
dx

1
0
f(x, y) dy =

1
0
dy

1
0
f(x, y) dx
1.6 Distributions and Expectations
Let us recall that a triplet (, B, P) is a Probability Space, if is a set, B is
a -eld of subsets of and P is a (countably additive) probability measure
on B. A random variable X is a real valued measurable function on (, B).
Given such a function X it induces a probability distribution on the Borel
subsets of the line = PX
1
. The distribution function F(x) corresponding
to is obviously
F(x) = ((, x]) = P[ : X() x].
The measure is called the distibution of X and F(x) is called the distribu-
tion function of X. If g is a measurable function of the real variable x, then
Y () = g(X()) is again a random variable and its distribution = PY
1
can be obtained as = g
1
from . The Expectation or mean of a random
variable is dened if it is integrable and
E[X] = E
P
[X] =

X() dP.
By the change of variables formula (Exercise 3.3) it can be obtained directly
from as
E[X] =

xd.
Here we are taking advantage of the fact that on the real line x is a very
special real valued function. The value of the integral in this context is
referred to as the expectation or mean of . Of course it exists if and only
if
|x| d <
and

xd

|x| d.
1.6. DISTRIBUTIONS AND EXPECTATIONS 29
Similarly
E(g(X)) =

g(X()) dP =

g(x) d
and anything concerning X can be calculated from . The statement X is a
random variable with distribution has to be interpreted in the sense that
somewhere in the background there is a Probability Space and a random
variable X on it, which has for its distribution. Usually only matters
and the underlying (, B, P) never emerges from the background and in a
pinch we can always say is the real line, B are the Borel sets , P is nothing
but and the random variable X(x) = x.
Some other related quantities are
V ar(X) =
2
(X) = E[X
2
] [E[X]]
2
. (1.14)
V ar(X) is called the variance of X.
Exercise 1.24. Show that if it is dened V ar(X) is always nonnegative and
V ar(X) = 0 if and only if for some value a, which is necessarily equal to
E[X], P[X = a] = 1.
Some what more generally we can consider a measurable mapping X =
(X
1
, , X
n
) of a probability space (, B, P) into R
n
as a vector of n random
variables X
1
(), X
2
(), , X
n
(). These are caled random vectors or vector
valued random variables and the induced distribution = PX
1
on R
n
is
called the distribution of X or the joint distribution of (X
1
, , X
n
). If we
denote by
i
the coordinate maps (x
1
, , x
n
) x
i
from R
n
R, then

i
=
1
i
= PX
1
i
are called the marginals of .
The covariance between two random variables X and Y is dened as
Cov(X, Y ) = E[(X E(X))(Y E(Y ))] = E[XY ] E[X]E[Y ]. (1.15)
Exercise 1.25. If X
1
, , X
n
are n random variables the matrix
C
i,j
= Cov(X
i
, X
j
)
is called the covariance matrix. Show that it is a symmetric positive semi-
denite matrix. Is every positive semi-denite matrix the covariance matrix
of some random vector?
30 CHAPTER 1. MEASURE THEORY
Exercise 1.26. The Riemann-Stieljes integral uses the distribution function
directly to dene

g(x)dF(x) where g is a bounded continuous function


and F is a distribution function. It is dened as limit as N of sums
N

j=0
g(x
j
)[F(a
N
j+1
) F(a
N
j
)]
where < a
N
0
< a
N
1
< < a
N
N
< a
N
N+1
< is a partition of the nite
interval [a
N
0
, a
N
N+1
] and the limit is taken in such a way that a
N
0
,
a
N
N+1
+ and the oscillation of g in any [a
N
j
, a
N
j+1
] goes to 0. Show that
if P is the measure corresponding to F then

g(x)dF(x) =

R
g(x)dP.
Chapter 2
Weak Convergence
2.1 Characteristic Functions
If is a probability distribution on the line, its characteristic function is
dened by
(t) =
_
exp[ i t x] d. (2.1)
The above denition makes sense. We write the integrand e
itx
as cos tx +
i sin tx and integrate each part to see that
|(t)| 1
for all real t.
Exercise 2.1. Calculate the characteristic functions for the following distri-
butions:
1. is the degenerate distribution
a
with probability one at the point a.
2. is the binomial distribution with probabilities
p
k
= Prob[X = k] =
_
n
k
_
p
k
(1 p)
nk
for 0 k n.
31
32 CHAPTER 2. WEAK CONVERGENCE
Theorem 2.1. The characteristic function (t) of any probability distribu-
tion is a uniformly continuous function of t that is positive denite, i.e. for
any n complex numbers
1
, ,
n
and real numbers t
1
, , t
n
n

i,j=1
(t
i
t
j
)
i

j
0.
Proof. Let us note that
n

i,j=1
(t
i
t
j
)
i

j
=

n
i,j=1

j
_
exp[ i(t
i
t
j
)x] d
=
_

n
j=1

j
exp[i t
j
x]

2
d 0.
To prove uniform continuity we see that
|(t) (s)|
_
| exp[ i(t s) x] 1| dP
which tends to 0 by the bounded convergence theorem if |t s| 0.
The characteristic function of course carries some information about the
distribution . In particular if
_
|x| d < , then () is continuously dif-
ferentiable and

(0) = i
_
xd.
Exercise 2.2. Prove it!
Warning: The converse need not be true. () can be continuously dier-
entiable but
_
|x| dP could be .
Exercise 2.3. Construct a counterexample along the following lines. Take a
discrete distribution, symmetric around 0 with
{n} = {n} = p(n)
1
n
2
log n
.
Then show that

n
(1cos nt)
n
2
log n
is a continuously dierentiable function of t.
Exercise 2.4. The story with higher moments m
r
=
_
x
r
d is similar. If any
of them, say m
r
exists, then () is r times continuously dierentiable and

(r)
(0) = i
r
m
r
. The converse is false for odd r, but true for even r by an
application of Fatous lemma.
2.1. CHARACTERISTIC FUNCTIONS 33
The next question is how to recover the distribution function F(x) from
(t). If we go back to the Fourier inversion formula, see for instance [2], we
can guess, using the fundamental theorem of calculus and Fubinis theorem,
that
F

(x) =
1
2
_

exp[i t x] (t) dt
and therefore
F(b) F(a) =
1
2
_
b
a
dx
_

exp[i t x] (t) dt
=
1
2
_

(t) dt
_
b
a
exp[i t x] dx
=
1
2
_

(t)
exp[i t b ]exp[i t a ]
i t
dt
= lim
T
1
2
_
T
T
(t)
exp[i t b ]exp[i t a ]
i t
dt.
We will in fact prove the nal relation, which is a principal value integral,
provided a and b are points of continuity of F. We compute the right hand
side as
lim
T
1
2
_
T
T
exp[i t b ] exp[i t a ]
i t
dt
_
exp[ i t x] d
= lim
T
1
2
_
d
_
T
T
exp[i t (xb) ]exp[i t (xa) ]
i t
dt
= lim
T
1
2
_
d
_
T
T
sint (xa)sin t (xb)
t
dt
=
1
2
_
[sign(x a) sign(x b)] d
= F(b) F(a)
provided a and b are continuity points. We have applied Fubinis theorem
and the bounded convergence theorem to take the limit as T . Note
that the Dirichlet integral
u(t, z) =
_
T
0
sin tz
t
dt
satises sup
T,z
|u(T, z)| C and
lim
T
u(T, z) =
_

_
1 if z > 0
1 if z < 0
0 if z = 0.
34 CHAPTER 2. WEAK CONVERGENCE
As a consequence we conclude that the distribution function and hence is
determined uniquely by the characteristic function.
Exercise 2.5. Prove that if two distribution functions agree on the set of
points at which they are both continuous, they agree everywhere.
Besides those in Exercise 2.1, some additional examples of probability
distributions and the corresponding characteristic functuions are given below.
1. The Poisson distribution of rare events, with rate , has probabilities
P[X = r] = e

r
r!
for r 0. Its characteristic function is
(t) = exp[(e
it
1)].
2. The geometric distribution, the distribution of the number of unsuc-
cessful attempts preceeding a success has P[X = r] = pq
r
for r 0.Its
characteristic function is
(t) = p(1 qe
it
)
1
.
3. The negative binomial distribution, the probabilty distribution of the
number of accumulated failures before k successes, with P[X = r] =
_
k+r1
r
_
p
k
q
r
has the characteristic function is
(t) = p
k
(1 qe
it
)
k
.
We now turn to some common continuous distributions, in fact given by
densities f(x) i.e the distribution functions are given by F(x) =
_
x

f(y) dy
1. The uniform distribution with density f(x) =
1
ba
, a x b has
characteristic function
(t) =
e
itb
e
ita
it(b a)
.
In particular for the case of a symmetric interval [a, a],
(t) =
sin at
at
.
2.1. CHARACTERISTIC FUNCTIONS 35
2. The gamma distribution with density f(x) =
c
p
(p)
e
cx
x
p1
, x 0 has
the characteristic function
(t) = (1
it
c
)
p
.
where c > 0 is any constant. A special case of the gamma distribution
is the exponential distribution, that corresponds to c = p = 1 with
density f(x) = e
x
for x 0. Its characteristic function is given by
(t) = [1 it]
1
.
3. The two sided exponential with density f(x) =
1
2
e
|x|
has characteristic
function
(t) =
1
1 + t
2
.
4. The Cauchy distribution with density f(x) =
1

1
1+x
2
has the character-
istic function
(t) = e
|t |
.
5. The normal or Gaussian distribution with mean and variance
2
,
which has a density of
1

2
e

(x)
2
2
2
has the characteristic function given
by
(t) = e
it

2
t
2
2
.
In general if X is a random variable which has distribution and a
characteristic function (t), the distribution of aX + b, can be written
as (A) = [x : ax + b A] and its characteristic function (t) can be
expressed as (t) = e
ita
(bt). In particular the characteristic function of X
is (t) = (t). Therefore the distribution of X is symmetric around x = 0
if and only if (t) is real for all t.
36 CHAPTER 2. WEAK CONVERGENCE
2.2 Moment Generating Functions
If is a probability distribution on R, for any integer k 1, the moment
m
k
of is dened as
m
k
=
_
x
k
d. (2.2)
Or equivalently the k-th moment of a random variable X is
m
k
= E[X
k
] (2.3)
By convention one takes m
0
= 1 even if P[X = 0] > 0. We should note
that if k is odd, in order for m
k
to be dened we must have E[|X|
k
] =
_
|x|
k
d < . Given a distribution , either all the moments exist, or they
exist only for 0 k k
0
for some k
0
. It could happen that k
0
= 0 as
is the case with the Cauchy distribution. If we know all the moments of a
distribution , we know the expectations
_
p(x)d for every polynomial p().
Since polynomials p() can be used to approximate (by Stone-Weierstrass
theorem) any continuous function, one might hope that, from the moments,
one can recover the distribution . This is not as staright forward as one
would hope. If we take a bounded continuous function, like sin x we can nd
a sequence of polynomials p
n
(x) that converges to sin x. But to conclude
that
_
sin xd = lim
n
_
p
n
(x)d
we need to control the contribution to the integral from large values of x,
which is the role of the dominated convergence theorem. If we dene p

(x) =
sup
n
|p
n
(x)| it would be a big help if
_
p

(x)d were nite. But the degrees


of the polynomials p
n
have to increase indenitely with n because sin x is a
transcendental function. Therefore p

() must grow faster than a polynomial


at and the condition
_
p

(x)d < may not hold.


In general, it is not true that moments determine the distribution. If
we look at it through characteristic functions, it is the problem of trying to
recover the function (t) from a knowledge of all of its derivatives at t = 0.
The Taylor series at t = 0 may not yield the function. Of course we have
more information in our hands, like positive deniteness etc. But still it is
likely that moments do not in general determine . In fact here is how to
construct an example.
2.2. MOMENT GENERATING FUNCTIONS 37
We need nonnegative numbers {a
n
}, {b
n
} : n 0, such that

n
a
n
e
k n
=

n
b
n
e
k n
= m
k
for every k 0. We can then replace them by {
an
m
0
}, {
bn
m
0
} : n 0 so that

k
a
k
=

k
b
k
= 1 and the two probability distributions
P[X = e
n
] = a
n
, P[X = e
n
] = b
n
will have all their moments equal. Once we can nd {c
n
} such that

n
c
n
e
nz
= 0 for z = 0, 1,
we can take a
n
= max(c
n
, 0) and b
n
= max(c
n
, 0) and we will have our
example. The goal then is to construct {c
n
} such that

n
c
k
z
n
= 0 for
z = 1, e, e
2
, . Borrowing from ideas in the theory of a complex variable,
(see Weierstrass factorization theorem, [1]) we dene
C(z) =

n=1
(1
z
e
n
)
and expand C(z) =

c
n
z
n
. Since C(z) is an entire function, the coecients
c
n
satisfy

n
|c
n
|e
k n
< for every k.
There is in fact a positive result as well. If is such that the moments
m
k
=
_
x
k
d do not grow too fast, then is determined by m
k
.
Theorem 2.2. Let m
k
be such that

k
m
2k
a
2k
(2k)!
< for some a > 0. Then
there is atmost one distribution such that
_
x
k
d = m
k
.
Proof. We want to determine the characteristic function (t) of . First we
note that if has moments m
k
satisfying our assumption, then
_
cosh(ax)d =

k
a
2k
(2k)!
m
2k
<
by the monotone convergence theorem. In particular
(u + it) =
_
e
(u+it)x
d
is well dene as an analytic function of z = u + it in the strip |u| < a. From
the theory of functions of a complex variable we know that the function ()
is uniquely determined in the strip by its derivatives at 0, i.e. {m
k
}. In
particular (t) = (0 + it) is determined as well
38 CHAPTER 2. WEAK CONVERGENCE
2.3 Weak Convergence
One of the basic ideas in establishing Limit Theorems is the notion of weak
convergence of a sequence of probability distributions on the line R. Since
the role of a probability measure is to assign probabilities to sets, we should
expect that if two probability measures are to be close, then they should
assign for a given set, probabilities that are nearly equal. This suggets the
denition
d(P
1
, P
2
) = sup
AB
|P
1
(A) P
2
(A)|
as the distance between two probability measures P
1
and P
2
on a measurable
space (, B). This is too strong. If we take P
1
and P
2
to be degenerate
distributions with probability 1 concentrated at two points x
1
and x
2
on the
line one can see that, as soon as x
1
= x
2
, d(P
1
, P
2
) = 1, and the above
metric is not sensitive to how close the two points x
1
and x
2
are. It only
cares that they are unequal. The problem is not because of the supremum.
We can take A to be an interval [a, b] that includes x
1
but omits x
2
and
|P
1
(A) P
2
(A)| = 1. On the other hand if the end points of the interval
are kept away from x
1
or x
2
the situation is not that bad. This leads to the
following denition.
Denition 2.1. A sequence
n
of probability distributions on R is said to
converge weakly to a probability distribution if,
lim
n

n
[I] = [I]
for any interval I = [a, b] such that the single point sets a and b have proba-
bility 0 under .
One can state this equivalently in terms of the distribution functions
F
n
(x) and F(x) corrresponding to the measures
n
and respectively.
Denition 2.2. A sequence
n
of probability measures on the real line R
with distribution functions F
n
(x) is said to converge weakly to a limiting
probability measure with distribution function F(x) (in symbols
n
or
F
n
F) if
lim
n
F
n
(x) = F(x)
for every x that is a continuity point of F.
2.3. WEAK CONVERGENCE 39
Exercise 2.6. prove the equivalence of the two denitions.
Remark 2.1. One says that a sequence X
n
of random variables converges
in law or in distribution to X if the distributions
n
of X
n
converges weakly
to the distribution of X.
There are equivalent formulations in terms of expectations and charac-
teristic functions.
Theorem 2.3. (Levy-Cramer continuity theorem) The following are
equivalent.
1.
n
or F
n
F
2. For every bounded continuous function f(x) on R
lim
n
_
R
f(x) d
n
=
_
R
f(x) d
3. If
n
(t) and (t) are respectively the characteristic functions of
n
and
, for every real t,
lim
n

n
(t) = (t)
Proof. We rst prove (a b). Let > 0 be arbitrary. Find continuity
points a and b of F such that a < b, F(a) and 1 F(b) . Since F
n
(a)
and F
n
(b) converge to F(a) and F(b), for n large enough, F
n
(a) 2 and
1F
n
(b) 2. Divide the interval [a, b] into a nite number N = N

of small
subintervals I
j
= (a
j
, a
j+1
], 1 j N with a = a
1
< a
2
< < a
N+1
=
b such that all the end points {a
j
} are points of continuity of F and the
oscillation of the continuous function f in each I
j
is less than a preassigned
number . Since any continuous function f is uniformly continuous in the
closed bounded (compact) interval [a, b], this is always possible for any given
> 0. Let h(x) =

N
j=1

I
j
f(a
j
) be the simple function equal to f(a
j
) on I
j
and 0 outside
j
I
j
= (a, b]. We have |f(x) h(x)| on (a, b]. If f(x) is
bounded by M, then

_
f(x) d
n

j=1
f(a
j
)[F
n
(a
j+1
) F
n
(a
j
)]

+ 4M (2.4)
40 CHAPTER 2. WEAK CONVERGENCE
and

_
f(x) d
N

j=1
f(a
j
)[F(a
j+1
) F(a
j
)]

+ 2M. (2.5)
Since lim
n
F
n
(a
j
) = F(a
j
) for every 1 j N, we conclude from equa-
tions (2.4), (2.5) and the triangle inequality that
limsup
n

_
f(x) d
n

_
f(x) d

2 + 6M.
Since and are arbitrary small numbers we are done.
Because we can make the choice of f(x) = exp[i t x] = cos tx+i sin tx, which
for every t is a bounded and continuous function (b c) is trivial.
(c a) is the hardest. It is carried out in several steps. Actually we will
prove a stronger version as a separate theorem.
Theorem 2.4. For each n 1, let
n
(t) be the characteristic function of a
probability distribution
n
. Assume that lim
n

n
(t) = (t) exists for each
t and (t) is continuous at t = 0. Then (t) is the characteristic function of
some probability distribution and
n
.
Proof.
Step 1. Let r
1
, r
2
, be an enumeration of the rational numbers. For each j
consider the sequence {F
n
(r
j
) : n 1} where F
n
is the distribution function
corresponding to
n
(). It is a sequence bounded by 1 and we can extract a
subsequence that converges. By the diagonalization process we can choose a
subseqence G
k
= F
n
k
such that
lim
k
G
k
(r) = b
r
exists for every rational number r. From the monotonicity of F
n
in x we
conclude that if r
1
< r
2
, then b
r
1
b
r
2
.
Step 2. From the skeleton b
r
we reconstruct a right continuous monotone
function G(x). We dene
G(x) = inf
r>x
b
r
.
2.3. WEAK CONVERGENCE 41
Clearly if x
1
< x
2
, then G(x
1
) G(x
2
) and therefore G is nondecreasing. If
x
n
x, any r > x satises r > x
n
for suciently large n. This allows us to
conclude that G(x) = inf
n
G(x
n
) for any sequence x
n
x, proving that G(x)
is right continuous.
Step 3. Next we show that at any continuity point x of G
lim
n
G
n
(x) = G(x).
Let r > x be a rational number. Then G
n
(x) G
n
(r) and G
n
(r) b
r
as
n . Hence
limsup
n
G
n
(x) b
r
.
This is true for every rational r > x, and therefore taking the inmum over
r > x
limsup
n
G
n
(x) G(x).
Suppose now that we have y < x. Find a rational r such that y < r < x.
liminf
n
G
n
(x) liminf
n
G
n
(r) = b
r
G(y).
As this is true for every y < x,
liminf
n
G
n
(x) sup
y<x
G(y) = G(x 0) = G(x)
the last step being a consequence of the assumption that x is a point of
continuity of G i.e. G(x 0) = G(x).
Warning. This does not mean that G is necessarily a distribution function.
Consider F
n
(x) = 0 for x < n and 1 for x n, which corresponds to
the distribution with the entire probability concentrated at n. In this case
lim
n
F
n
(x) = G(x) exists and G(x) 0, which is not a distribution
function.
Step 4. We will use the continuity at t = 0, of (t), to show that G is indeed
42 CHAPTER 2. WEAK CONVERGENCE
a distribution function. If (t) is the characteristic function of
1
2T
_
T
T
(t) dt =
_
_
1
2T
_
T
T
exp[i t x] dt
_
d
=
_
sinTx
Tx
d

sinTx
Tx

d
=
_
|x|<

sin Tx
Tx

d +
_
|x|

sinTx
Tx

d
[|x| < ] +
1
T
[|x| ].
We have used Fubinis theorem in the rst line and the bounds | sinx|
|x| and | sinx| 1 in the last line. We can rewrite this as
1
1
2T
_
T
T
(t) dt 1 [|x| < ]
1
T
[|x| ]
= [|x| ]
1
T
[|x| ]
=
_
1
1
T
_
[|x| ]

_
1
1
T
_
[1 F() + F()]
Finally, if we pick =
2
T
,
[1 F(
2
T
) + F(
2
T
)] 2
_
1
1
2T
_
T
T
(t) dt
_
.
Since this inequality is valid for any distribution function F and its charac-
teristic function , we conclude that, for every k 1,
[1 F
n
k
(
2
T
) + F
n
k
(
2
T
)] 2
_
1
1
2T
_
T
T

n
k
(t) dt
_
. (2.6)
We can pick T such that
2
T
are continuity points of G. If we now pass to
the limit and use the bounded convergence theorem on the right hand side
of equation (2.6), we obtain
[1 G(
2
T
) + G(
2
T
)] 2
_
1
1
2T
_
T
T
(t) dt
_
.
Since (0) = 1 and is continuous at t = 0, by letting T 0 in such a way
that
2
T
are continuity points of G, we conclude that
1 G() + G() = 0
or G is indeed a distribution function
2.3. WEAK CONVERGENCE 43
Step 5. We now complete the rest of the proof, i.e. show that
n
. We
have G
k
= F
n
k
G as well as
k
=
n
k
. Therefore G must equal F
which has for its characteristic function. Since the argument works for any
subsequence of F
n
, every subsequence of F
n
will have a further subsequence
that converges weakly to the same limit F uniquely determined as the distri-
bution function whose characteristic function is (). Consequently F
n
F
or
n
.
Exercise 2.7. How do you actually prove that if every subsequence of a se-
quence {F
n
} has a further subsequence that converges to a common F then
F
n
F?
Denition 2.3. A subset A of probability distributions on R is said to be
totally bounded if, given any sequence
n
from A, there is subsequence that
converges weakly to some limiting probability distribution .
Theorem 2.5. In order that a family A of probability distributions be totally
bounded it is necessary and sucient that either of the following equivalent
conditions hold.
lim

sup
A
[x : |x| ] = 0 (2.7)
lim
h0
sup
A
sup
|t|h
|1

(t)| = 0. (2.8)
Here

(t) is the characteristic function of .


The condition in equation (2.7) is often called the uniform tightness prop-
erty.
Proof. The proof is already contained in the details of the proof of the ear-
lier theorem. We can always choose a subsequence such that the distribution
functions converge at rationals and try to reconstruct the limiting distribu-
tion function from the limits at rationals. The crucial step is to prove that
the limit is a distribution function. Either of the two conditions (2.7) or (2.8)
will guarantee this. If condition (2.7) is violated it is straight forward to pick
a sequence from A for which the distribution functions have a limit which is
44 CHAPTER 2. WEAK CONVERGENCE
not a distribution function. Then A cannot be totally bounded. Condition
(2.7) is therefore necessary. That a) b), is a consequence of the estimate
|1 (t)|
_
| exp[ i t x] 1| d
=
_
|x|
| exp[ i t x] 1| d +
_
|x|>
| exp[ i t x] 1| d
|t| + 2[x : |x| > ]
It is a well known principle in Fourier analysis that the regularity of (t) at
t = 0 is related to the decay rate of the tail probabilities.
Exercise 2.8. Compute
_
|x|
p
d in terms of the characteristic function (t)
for p in the range 0 < p < 2.
Hint: Look at the formula
_

1 cos tx
|t|
p+1
dt = C
q
|x|
p
and use Fubinis theorem.
We have the following result on the behavior of
n
(A) for certain sets
whenever
n
.
Theorem 2.6. Let
n
on R. If C R is closed set then
limsup
n

n
(C) (C)
while for open sets G R
liminf
n

n
(G) (G)
If A R is a continuity set of i.e. (A) = (

AA
o
) = 0, then
lim
n

n
(A) = (A)
Proof. The function d(x, C) = inf
yC
|x y| is a continuous and equals 0
precisely on C.
f(x) =
1
1 + d(x, C)
2.3. WEAK CONVERGENCE 45
is a continuous function bounded by 1, that is equal to 1 precisely on C and
f
k
(x) = [f(x)]
k

C
(x)
as k . For every k 1, we have
lim
n
_
f
k
(x) d
n
=
_
f
k
(x) d
and therefore
limsup
n

n
(C) lim
n
_
f
k
(x) d
n
=
_
f
k
(x)d.
Letting k we get
limsup
n

n
(C) (C).
Taking complements we conclude that for any open set G R
liminf
n

n
(G) (G).
Combining the two parts, if A R is a continuity set of i.e. (A) =
(

AA
o
) = 0, then
lim
n

n
(A) = (A).
We are now ready to prove the converse of Theorem 2.1 which is the hard
part of a theorem of Bochner that characterizes the characteristic functions
of probability distributions as continuous positive denite functions on R
normalized to be 1 at 0.
Theorem 2.7. (Bochners Theorem). If (t) is a positive denite func-
tion which is continuous at t = 0 and is normalized so that (0) = 1, then
is the characteristic function of some probability ditribution on R.
Proof. The proof depends on constructing approximations
n
(t) which are in
fact characteristic functions and satisfy
n
(t) (t) as n . Then we can
apply the preceeding theorem and the probability measures corresponding to

n
will have a weak limit which will have for its characteristic function.
46 CHAPTER 2. WEAK CONVERGENCE
Step 1. Let us establish a few elementary properties of positive denite
functions.
1) If (t) is a positive denite function so is (t)exp[ i t a ] for any real a.
The proof is elementary and requires just direct verication.
2) If
j
(t) are positive denite for each j then so is any linear combination
(t) =

j
w
j

j
(t) with nonnegative weights w
j
. If each
j
(t) is normalized
with
j
(0) = 1 and

j
w
j
= 1, then of of course (0) = 1 as well.
3) If is positive denite then satises (0) 0, (t) = (t) and
|(t)| (0) for all t.
We use the fact that the matrix {(t
i
t
j
) : 1 i, j n} is Hermitian
positive denite for any n real numbers t
1
, , t
n
. The rst assertion follows
from the the positivity of (0)|z|
2
, the second is a consequence of the Hermi-
tian property and if we take n = 2 with t
1
= t and t
2
= 0 as a consequence
of the positive deniteness of the 2 2 matrix we get |(t)|
2
|(0)|
2
4) For any s, t we have |(t) (s)|
2
4(0)|(0) (t s)|
We use the positive deniteness of the 3 3 matrix
_

_
1 (t s) (t)
(t s) 1 (s)
(t) (s) 1
_

_
which is {(t
i
t
j
)} with t
1
= t, t
2
= s and t
3
= 0. In particular the
determinant has to be nonnegative.
0 1 + (s)(t s)(t) + (s)(t s)(t) |(s)|
2
|(t)|
2
|(t s)|
2
= 1 |(s) (t)|
2
|(t s)|
2
(t)(s)(1 (t s))
(t)(s)(1 (t s))
1 |(s) (t)|
2
|(t s)|
2
+ 2|1 (t s)|
Or
|(s) (t)|
2
1 |(s t)|
2
+ 2|1 (t s)|
4|1 (s t)|
2.3. WEAK CONVERGENCE 47
5) It now follows from 4) that if a positive denite function is continuous
at t = 0, it is continuous everywhere (in fact uniformly continuous).
Step 2. First we show that if (t) is a positive denite function which is
continuous on R and is absolutely integrable, then
f(x) =
1
2
_

exp[i t x](t) dt 0
is a continuous function and
_

f(x)dx = 1.
Moreover the function
F(x) =
_
x

f(y) dy
denes a distribution function with characteristic function
(t) =
_

exp[i t x]f(x) dx. (2.9)


If is integrable on (, ), then f(x) is clearly bounded and contin-
uous. To see that it is nonnegative we write
f(x) = lim
T
1
2
_
T
T
_
1
|t|
T
_
e
i t x
(t) dt (2.10)
= lim
T
1
2T
_
T
0
_
T
0
e
i (ts) x
(t s) dt ds (2.11)
= lim
T
1
2T
_
T
0
_
T
0
e
i t x
e
i s x
(t s) dt ds (2.12)
0.
We can use the dominated convergence theorem to prove equation (2.10),
a change of variables to show equation (2.11) and nally a Riemann sum
approximation to the integral and the positive deniteness of to show that
the quantity in (2.12) is nonnegative. It remains to show the relation (2.9).
Let us dene
f

(x) = f(x) exp[

2
x
2
2
]
48 CHAPTER 2. WEAK CONVERGENCE
and calculate for t R, using Fubinis theorem
_

e
i t x
f

(x) dx =
_

e
i t x
f(x) exp[

2
x
2
2
] dx
=
1
2
_

e
i t x
(s)e
i s x
exp[

2
x
2
2
] ds dx
=
_

(s)
1

2
exp[
(t s)
2
2
2
] ds. (2.13)
If we take t = 0 in equation (2.13), we get
_

(x) dx =
_

(s)
1

2
exp[
s
2
2
2
] ds 1. (2.14)
Now we let 0. Since f

0 and tends to f as 0, from Fa-


touss lemma and equation (2.14), it follows that f is integarable and in fact
_

f(x)dx 1. Now we let 0 in equation (2.13). Since f

(x)e
itx
is
dominated by the integrable function f, there is no problem with the left
hand side. On the other hand the limit as 0 is easily calculated on the
right hand side of equation (2.13)
_

e
itx
f(x)dx = lim
0
_

(s)
1

2
exp[
(s t)
2
2
2
] ds
= lim
0
_

(t + s)
1

2
exp[
s
2
2
] ds
= (t)
proving equation (2.9).
Step 3. If (t) is a positive denite function which is continuous, so is
(t) exp[ i t y ] for every y and for > 0, as well as the convex combination

(t) =
_

(t) exp[ i t y ]
1

2
exp[
y
2
2
2
] dy
= (t) exp[

2
t
2
2
].
The previous step is applicable to

(t) which is clearly integrable on R and


by letting 0 we conclude by Theorem 2.3. that is a characteristic
function as well.
2.3. WEAK CONVERGENCE 49
Remark 2.2. There is a Fourier Series analog involving distributions on a
nite interval, say S = [0, 2). The right end point is omitted on purpose,
because the distribution should be thought of as being on [0, 2] with 0 and
2 identied. If is a distribution on S the characteristic function is dened
as
(n) =
_
e
i nx
d
for integral values n Z. There is a uniqueness theorem, and a Bochner
type theorem involving an analogous denition of positive deniteness. The
proof is nearly the same.
Exercise 2.9. If
n
it is not always true that
_
xd
n

_
xd be-
cause while x is a continuous function it is not bounded. Construct a simple
counterexample. On the positive side, let f(x) be a continuous function that
is not necessarily bounded. Assume that there exists a positive continuous
function g(x) satisfying
lim
|x|
|f(x)|
g(x)
= 0
and
sup
n
_
g(x) d
n
C < .
Then show that
lim
n
_
f(x) d
n
=
_
f(x) d
In particular if
_
|x|
k
d
n
remains bounded, then
_
x
j
d
n

_
x
j
d for
1 j k 1.
Exercise 2.10. On the other hand if
n
and g : R R is a continuos
function then the distribution
n
of g under
n
dened as

n
[A] =
n
[x : g(x) A]
converges weakly to the corresponding distribution of g under .
Exercise 2.11. If g
n
(x) is a sequence of continuous functions such that
sup
n,x
|g
n
(x)| C < and lim
n
g
n
(x) = g(x)
uniformly on every bounded interval, then whenever
n
it follows that
lim
n
g
n
(x)d
n
=
_
g(x)d.
50 CHAPTER 2. WEAK CONVERGENCE
Can you onstruct an example to show that even if g
n
, g are continuous just
the pointwise convergence lim
n
g
n
(x) = g(x) is not enough.
Exercise 2.12. If a sequence {f
n
()} of random variables on a measure space
are such that f
n
f in measure, then show that the sequence of distributions

n
of f
n
on R converges weakly to the distribution of f. Give an example
to show that the converse is not true in general. However, if f is equal to a
constant c with probability 1, or equivalently is degenerate at some point c,
then
n
=
c
implies the convrgence in probability of f
n
to the constant
function c.
Chapter 3
Independent Sums
3.1 Independence and Convolution
One of the central ideas in probabilty is the notion of independence. In
intuitive terms two events are independent if they have no inuence on each
other. The formal denition is
Denition 3.1. Two events A and B are said to be independent if
P[A B] = P[A]P[B].
Exercise 3.1. If A and B are independent prove that so are A
c
and B.
Denition 3.2. Two random variables X and Y are independent if the
events X A and Y B are independent for any two Borel sets A and
B on the line i.e.
P[X A, Y B] = P[X A]P[Y B].
for all Borel sets A and B.
There is a natural extension to a nite or even an innite collection of
random variables.
51
52 CHAPTER 3. INDEPENDENT SUMS
Denition 3.3. A nite collection collection {X
j
: 1 j n} of random
variables are said to be independent if for any n Borel sets A
1
, . . . , A
n
on the
line
P
_

1jn
[X
j
A
j
]
_
=
1jn
P[X
j
A
j
].
Denition 3.4. An innite collection of random variables is said to be in-
dependent if every nite subcollection is independent.
Lemma 3.1. Two random variables X, Y dened on (, , P) are indepen-
dent if and only if the measure induced on R
2
by (X, Y ), is the product
measure where and are the distributions on R induced by X and
Y respectively.
Proof. Left as an exercise.
The important thing to note is that if X and Y are independent and one
knows their distributions and , then their joint distribution is automati-
cally determined as the product measure.
If X and Y are independent random variables having and for their
distributions, the distribution of the sum Z = X+Y is determined as follows.
First we construct the product measure on RR and then consider the
induced distribution of the function f(x, y) = x+y. This distribution, called
the convolution of and , is denoted by . An elementary calculation
using Fubinis theorem provides the following identities.
( )(A) =
_
(Ax) d =
_
(Ax) d (3.1)
In terms of characteristic function, we can express the characteristic func-
tion of the convolution as
_
exp[ i t x]d( ) =
_ _
exp[ i t (x +y) ] d d
=
_
exp[ i t x] d
_
exp[ i t x] d
3.1. INDEPENDENCE AND CONVOLUTION 53
or equivalently

(t) =

(t)

(t) (3.2)
which provides a direct way of calculating the distributions of sums of inde-
pendent random variables by the use of characteristic functions.
Exercise 3.2. If X and Y are independent show that for any two measurable
functions f and g, f(X) and g(Y ) are independent.
Exercise 3.3. Use Fubinis theorem to show that if X and Y are independent
and if f and g are measurable functions with both E[|f(X)|] and E[|g(Y )|]
nite then
E[f(X)g(Y )] = E[f(X)]E[g(Y )].
Exercise 3.4. Show that if X and Y are any two random variables then
E(X + Y ) = E(X) + E(Y ). If X and Y are two independent random
variables then show that
Var(X +Y ) = Var(X) + Var(Y )
where
Var(X) = E
_
[X E[X]]
2

= E[X
2
] [E[X]]
2
.
If X
1
, X
2
, , X
n
are n independent random variables, then the distri-
bution of their sum S
n
= X
1
+ X
2
+ + X
n
can be computed in terms of
the distributions of the summands. If
j
is the distribution of X
j
, then the
distribution of
n
of S
n
is given by the convolution
n
=
1

2

n
that can be calculated inductively by
j+1
=
j

j+1
. In terms of their
characteristic functions
n
(t) =
1
(t)
2
(t)
n
(t). The rst two moments
of S
n
are computed easily.
E(S
n
) = E(X
1
) +E(X
2
) + E(X
n
)
and
Var(S
n
) = E[S
n
E(S
n
)]
2
=

j
E[X
j
E(X
j
)]
2
+2

1i<jn
E[X
i
E(X
i
)][X
j
E(X
j
)].
54 CHAPTER 3. INDEPENDENT SUMS
For i = j, because of independence
E[X
i
E(X
i
)][X
j
E(X
j
)] = E[X
i
E(X
i
)]E[X
j
E(X
j
)] = 0
and we get the formula
Var(S
n
) = Var(X
1
) + Var(X
2
) + + Var(X
n
). (3.3)
3.2 Weak Law of Large Numbers
Let us look at the distribution of the number of succeses in n independent
trials, with the probability of success in a single trial being equal to p.
P{S
n
= r} =
_
n
r
_
p
r
(1 p)
nr
and
P{|S
n
np| n} =

|rnp|n
_
n
r
_
p
r
(1 p)
nr

1
n
2

|rnp|n
(r np)
2
_
n
r
_
p
r
(1 p)
nr
(3.4)

1
n
2

1rn
(r np)
2
_
n
r
_
p
r
(1 p)
nr
=
1
n
2

2
E[S
n
np]
2
=
1
n
2

2
Var(S
n
) (3.5)
=
1
n
2

2
np(1 p) (3.6)
=
p(1 p)
n
2
.
In the step (3.4) we have used a discrete version of the simple inequality
_
x:g(x)a
g(x)d g(a)[x : g(x) a]
with g(x) = (x np)
2
and in (3.6) have used the fact that S
n
= X
1
+ X
2
+
+ X
n
where the X
i
are independent and have the simple distribution
3.2. WEAK LAW OF LARGE NUMBERS 55
P{X
i
= 1} = p and P{X
i
= 0} = 1 p. Therefore E(S
n
) = np and
Var(S
n
) = nVar(X
1
) = np(1 p)
It follows now that
lim
n
P{|S
n
np| n} = lim
n
P{|
S
n
n
p| } = 0
or the average S
n
n converges to p in probability. This is seen easily to
be equivalent to the statement that the distribution of
Sn
n
converges to the
distribution degenerate at p. See (2.12).
The above argument works for any sequence of independent and iden-
tically distributed random variables. If we assume that E(X
i
) = m and
Var(X
i
) =
2
< , then E(
Sn
n
) = m and Var(
Sn
n
) =

2
n
. Chebychevs
inequality states that for any random variable X
P{|X E[X]| } =
_
|XE[X]|
dP

2
_
|XE[X]|
[X E[X]]
2
dP
=
1

2
_
[X E[X]]
2
dP
=
1

2
Var(X). (3.7)
This can be used to prove the weak law of large numbers for the gen-
eral case of independent identically distributed random variables with nite
second moments.
Theorem 3.2. If X
1
, X
2
. . . , X
n
, . . . is a sequence of independent identically
distributed random variables with E[X
j
] m and VarX
j

2
then for
S
n
= X
1
+X
2
+ +X
n
we have
lim
n
P
_
|
S
n
n
m|
_
= 0
for any > 0.
56 CHAPTER 3. INDEPENDENT SUMS
Proof. Use Chebychevs inequality to estimate
P
_
|
S
n
n
m|
_

2
Var(
S
n
n
) =

2
n
2
.
Actually it is enough to assume that E|X
i
| < and the existence of the
second moment is not needed. We will provide two proofs of the statement
Theorem 3.3. If X
1
, X
2
, X
n
are independent and identically distributed
with a nite rst moment and E(X
i
) = m, then
X
1
+X
2
++Xn
n
converges to m
in probability as n .
Proof. 1. Let C be a large constant and let us dene X
C
i
as the truncated
random variable X
C
i
= X
i
if |X
i
| C and X
C
i
= 0 otherwise. Let Y
C
i
=
X
i
X
C
i
so that X
i
= X
C
i
+Y
C
i
. Then
1
n

1in
X
i
=
1
n

1in
X
C
i
+
1
n

1in
Y
C
i
=
C
n
+
C
n
.
If we denote by a
C
= E(X
C
i
) and b
C
= E(Y
C
i
) we always have m =
a
C
+b
C
. Consider the quantity

n
= E[|
1
n

1in
X
i
m|]
= E[|
C
n
+
C
n
m|]
E[|
C
n
a
C
|] +E[|
C
n
b
C
|]

_
E[|
C
n
a
C
|
2
]
_1
2
+ 2E[|Y
C
i
|]. (3.8)
As n , the truncated random variables X
C
i
are bounded and indepen-
dent. Theorem 3.2 is applicable and the rst of the two terms in (3.8) tends
to 0. Therefore taking the limsup as n , for any 0 < C < ,
limsup
n

n
2E[|Y
C
i
|].
If we now let the cuto level C to go to innity, by the integrability of X
i
,
E
_
|Y
C
i
|

0 as C and we are done. The nal step of establishing that


3.2. WEAK LAW OF LARGE NUMBERS 57
for any sequence Y
n
of random variables, E[|Y
n
|] 0 implies that Y
n
0 in
probability, is left as an exercise and is not very dierent from Chebychevs
inequality.
Proof 2. We can use characteristic functions. If we denote the characteristic
function of X
i
by (t), then the characteristic function of
1
n

1in
X
i
is given
by
n
(t) = [(
t
n
)]
n
. The existence of the rst moment assures us that (t)
is dierentiable at t = 0 with a derivative equal to im where m = E(X
i
).
Therefore by Taylor expansion
(
t
n
) = 1 +
i mt
n
+o(
1
n
).
Whenever na
n
z it follows that (1 +a
n
)
n
e
z
. Therefore,
lim
n

n
(t) = exp[ i mt ]
which is the characteristic function of the distribution degenerate at m.
Hence the distribution of
Sn
n
tends to the degenerate distribution at the point
m. The weak law of large numbers is thereby established.
Exercise 3.5. If the underlying distribution is a Cauchy distribution with
density
1
(1+x
2
)
and characteristic function (t) = e
|t|
, prove that the weak
law does not hold.
Exercise 3.6. The weak law may hold sometimes even if the mean does not
exist. If we dampen the tails of the Cauchy ever so slightly with a density
f(x) =
c
(1+x
2
) log(1+x
2
)
, show that the weak law of large numbers holds.
Exercise 3.7. In the case of the Binomial distribution with p =
1
2
, use Stir-
lings formula
n!

2 e
n
n
n+12
to estimate the probability

rnx
_
n
r
_
1
2
n
and show that it decays geometrically in n. Can you calculate the geometric
ratio
(x) = lim
n
_

rnx
_
n
r
_
1
2
n
_1
n
explicitly as a function of x for x >
1
2
?
58 CHAPTER 3. INDEPENDENT SUMS
3.3 Strong Limit Theorems
The weak law of large numbers is really a result concerning the behavior of
S
n
n
=
X
1
+X
2
+ +X
n
n
where X
1
, X
2
, , X
n
, . . . is a sequence of independent and identically dis-
tributed random variables on some probability space (, B, P). Under the
assumption that X
i
are integrable with an integral equal to m, the weak law
asserts that as n ,
Sn
n
m in Probability. Since almost everywhere
convergence is generally stronger than convergence in Probability one may
ask if
P
_
: lim
n
S
n
()
n
= m
_
= 1
This is called the Strong Law of Large Numbers. Strong laws are statements
that hold for almost all .
Let us look at functions of the form f
n
=
An
. It is easy to verify that
f
n
0 in probability if and only if P(A
n
) 0. On the other hand
Lemma 3.4. (Borel-Cantelli lemma). If

n
P(A
n
) <
then
P
_
: lim
n

An
() = 0

= 1.
If the events A
n
are mutually independent the converse is also true.
Remark 3.1. Note that the complementary event
_
: limsup
n

An
() = 1

is the same as

n=1

j=n
A
j
, or the event that innitely many of the events
{A
j
} occcur.
The cnclusion of the next exercise will be used in the proof.
3.3. STRONG LIMIT THEOREMS 59
Exercise 3.8. Prove the following variant of the monotone convergence the-
orem. If f
n
() 0 are measurble functions the set E = { : S() =

n
f
n
() < } is measurable and S() is a measurable function on E. If
each f
n
is integrable and

n
E[f
n
] < then P[E] = 1, S() is integrable
and E[S()] =

n
E[f
n
()].
Proof. By the previous exercise if

n
P(A
n
) < , then

An
() = S()
is nite almost everywhere and
E(S()) =

n
P(A
n
) < .
If an innite series has a nite sum then the n-th term must go to 0, thereby
proving the direct part. To prove the converse we need to show that if

n
P(A
n
) = , then lim
m
P(

n=m
A
n
) > 0. We can use independence
and the continuity of probability under monotone limits, to calculate for
every m,
P(

n=m
A
n
) = 1 P(

n=m
A
c
n
)
= 1

n=m
(1 P(A
n
)) (by independence)
1 e


m
P(An)
= 1
and we are done. We have used the inequality 1 x e
x
familiar in the
study of innite products.
Another digression that we want to make into measure theory at this point
is to discuss Kolmogorovs consistency theorem. How do we know that there
are probability spaces that admit a sequence of independent identically dis-
tributed random variables with specied distributions? By the construction
of product measures that we outlined earlier we can construct a measure on
R
n
for every n which is the joint distribution of the rst n random variables.
Let us denote by P
n
this probability measure on R
n
. They are consistent
in the sense that if we project in the natural way from R
n+1
R
n
, P
n+1
projects to P
n
. Such a family is called a consistent family of nite dimen-
sional distributions. We look at the space = R

consisting of all real


sequences = {x
n
: n 1} with a natural -eld generated by the eld
F of nite dimensional cylinder sets of the form B = { : (x
1
, , x
n
) A}
where A varies over Borel sets in R
n
and varies over positive integers.
60 CHAPTER 3. INDEPENDENT SUMS
Theorem 3.5. (Kolmogorovs Consistency Theorem). Given a con-
sistent family of nite dimensional distributions P
n
, there exists a unique
P on (, ) such that for every n, under the natural projection
n
() =
(x
1
, , x
n
), the induced measure P
1
n
= P
n
on R
n
.
Proof. The consistency is just what is required to be able to dene P on F
by
P(B) = P
n
(A).
Once we have P dened on the eld F, we have to prove the countable
additivity of P on F. The rest is then routine. Let B
n
F and B
n
,
the empty set. If possible let P(B
n
) for all n and for some > 0. Then
B
n
=
1
kn
A
kn
for some k
n
and without loss of generality we assume that
k
n
= n, so that B
n
=
1
n
A
n
for some Borel set A
n
R
n
. According to
Exercise 3.8 below, we can nd a closed bounded subset K
n
A
n
such that
P
n
(A
n
K
n
)

2
n+1
and dene C
n
=
1
n
K
n
and D
n
=
n
j=1
C
j
=
1
n
F
n
for some closed bounded
set F
n
K
n
R
n
. Then
P(D
n
)
n

j=1

2
j+1


2
.
D
n
B
n
, D
n
and each D
n
is nonempty. If we take
(n)
= {x
n
j
: j 1}
to be an arbitrary point from D
n
, by our construction (x
n
1
, x
n
m
) F
m
for
n m. We can denitely choose a subsequence (diagonlization) such that
x
n
k
j
converges for each j producing a limit = (x
1
, , x
m
, ) and, for
every m, we will have (x
1
, , x
m
) F
m
. This implies that D
m
for
every m, contradicting D
n
. We are done.
Exercise 3.9. We have used the fact that given any borel set A R
n
, and a
probability measure on R
n
, for any > 0, there exists a closed bounded
subset K

A such that (AK

) . Prove it by showing that the class


of sets A with the above property is a monotone class that contains nite
disjoint unions of measurable rectangles and therefore contains the Borel -
eld. To prove the last fact, establish it rst for n = 1. To handle n = 1,
repeat the same argument starting from nite disjoint unions of right-closed
left-open intevals. Use the countable additivity to verify this directly.
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 61
Remark 3.2. Kolmogorovs consistency theorem remains valid if we replace
R by an arbitrary complete separable metric space X, with its Borel -eld.
However it is not valid in complete generality. See [8]. See Remark 4.7 in
this context.
The following is a strong version of the Law of Large Numbers.
Theorem 3.6. If X
1
, , X
n
is a sequence of independent identically
distributed random variables with E|X
i
|
4
= C < , then
lim
n
S
n
n
= lim
n
X
1
+ +X
n
n
= E(X
1
)
with probability 1.
Proof. We can assume without loss of generality that E[X
i
] = 0 . Just take
Y
i
= X
i
E[X
i
]. A simple calculation shows
E[(S
n
)
4
] = nE[(X
1
)
4
] + 3n(n 1)E[(X
1
)
2
]
2
nC + 3n
2

4
and by applying a Chebychev type inequality using fourth moments,
P[|
S
n
n
| ] = P[ |S
n
| n ]
nC + 3n
2

4
n
4

4
.
We see that

n=1
P[ |
S
n
n
| ] <
and we can now apply the Borel-Cantelli Lemma.
3.4 Series of Independent Random variables
We wish to investigate conditions under which an innite series with inde-
pendent summands
S =

j=1
X
j
converges with probability 1. The basic steps are the following inequalities
due to Kolomogorov and Levy that control the behaviour of sums of inde-
pendent random variables. They both deal with the problem of estimating
T
n
() = sup
1kn
|S
k
()| = sup
1kn
|
k

j=1
X
j
()|
62 CHAPTER 3. INDEPENDENT SUMS
where X
1
, , X
n
are n independent random variables.
Lemma 3.7. (Kolmogorovs Inequality). Assume that EX
i
= 0 and
Var(X
i
) =
2
i
< and let s
2
n
=

n
j=1

2
j
. Then
P{T
n
() }
s
2
n

2
. (3.9)
Proof. The important point here is that the estimate depends only on s
2
n
and
not on the number of summands. In fact the Chebychev bound on S
n
is
P{|S
n
| }
s
2
n

2
and the supremum does not cost anything.
Let us dene the events E
k
= {|S
1
| < , , |S
k1
| < , |S
k
| } and
then {T
n
} =
n
k=1
E
k
is a disjoint union of E
k
. If we use the independence
of S
n
S
k
and S
k

E
k
that only depends on X
1
, X
k
P{E
k
}
1

2
_
E
k
S
2
k
dP

2
_
E
k
_
S
2
k
+ (S
n
S
k
)
2

dP
=
1

2
_
E
k
_
S
2
k
+ 2S
k
(S
n
S
k
) + (S
n
S
k
)
2

dP
=
1

2
_
E
k
S
2
n
dP.
Summing over k from 1 to n
P{T
n
}
1

2
_
Tn
S
2
n
dP
s
2
n

2
.
eatblishing (3.9)
Lemma 3.8. (Levys Inequality). Assume that
P{|X
i
+ +X
n
|

2
}
for all 1 i n. Then
P{T
n
}

1
. (3.10)
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 63
Proof. Let E
k
be as in the previous lemma.
P
_
(T
n
) |S
n
|

2
_
=
n

k=1
P
_
E
k
|S
n
|

2
_

k=1
P
_
E
k
|S
n
S
k
|

2
_
=
n

k=1
P
_
|S
n
S
k
|

2
_
P(E
k
)

n

k=1
P(E
k
)
= P{T
n
}.
On the other hand,
P
_
(T
n
) |S
n
| >

2
_
P
_
|S
n
| >

2
_
.
Adding the two,
P
_
T
n

_
P
_
T
n

_
+
or
P
_
T
n

_


1
proving (3.10)
We are now ready to prove
Theorem 3.9. (Levys Theorem). If X
1
, X
2
, . . . , X
n
, . . . is a seqence of
independent random variables, then the following are equivalent.
(i) The distribution
n
of S
n
= X
1
+ +X
n
converges weakly to a prob-
ability distribution on R.
(ii) The random variable S
n
= X
1
+ +X
n
converges in probability to a
limit S().
(iii) The random variable S
n
= X
1
+ + X
n
converges with probability 1
to a limit S().
64 CHAPTER 3. INDEPENDENT SUMS
Proof. Clearly (iii) (ii) (i) are trivial. We will establish (i) (ii)
(iii).
(i) (ii). The characteristic functions
j
(t) of X
j
are such that
(t) =

i=1

j
(t)
is a convergent innite product. Since the limit (t) is continuous at t = 0
and (0) = 1 it is nonzero in some interval |t| T around 0. Therefore for
|t| T,
lim
n
m
n

m+1

j
(t) = 1.
By Exercise 3.10 below, this implies that for all t,
lim
n
m
n

m+1

j
(t) = 1
and consequently, the distribution of S
n
S
m
converges to the distribution
degenerate at 0. This implies the convergence in probability to 0 of S
n
S
m
as m, n . Therefore for each > 0,
lim
n
m
P{|S
n
S
m
| } = 0
establishing (ii).
(ii) (iii). To establish (iii), because of Exercise 3.11 below, we need only
show that for every > 0
lim
n
m
P
_
sup
m<kn
|S
k
S
m
|
_
= 0
and this follows from (ii) and Levys inequality.
Exercise 3.10. Prove the inequality 1 cos 2t 4(1 cos t) for all real t.
Deduce the inequality 1 Real (2t) 4[1 Real (t)], valid for any char-
acteristic function. Conclude that if a sequence of characteristic functions
converges to 1 in an interval around 0, then it converges to 1 for all real t.
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 65
Exercise 3.11. Prove that if a sequence S
n
of random variables is a Cauchy
sequence in Probability, i.e. for each > 0,
lim
n
m
P{|S
n
S
m
| } = 0
then there is a random variable S such that S
n
S in probability, i.e for
each > 0,
lim
n
P{|S
n
S| } = 0.
Exercise 3.12. Prove that if a sequence S
n
of random variables satises
lim
n
m
P
_
sup
m<kn
|S
k
S
m
|
_
= 0
for every > 0 then there is a limiting random variable S() such that
P
_
lim
n
S
n
() = S()
_
= 1.
Exercise 3.13. Prove that whenever X
n
X in probability the distribution

n
of X
n
converges weakly to the distribution of X.
Now it is straightforward to nd sucient conditions for the convergence
of an innite series of independent random variables.
Theorem 3.10. (Kolmogorovs one series Theorem). Let a sequence
{X
i
} of independent random variables, each of which has nite mean and
variance, satisfy E(X
i
) = 0 and

i=1
Var(X
i
) < , then
S() =

i=1
X
i
()
converges with probability 1.
Proof. By a direct application of Kolmogorovs inequality
lim
n
m
P
_
sup
m<kn
|S
k
S
m
|
_
lim
n
m
1

2
n

j=m+1
E(X
2
i
)
= lim
n
m
1

2
n

j=m+1
Var(X
i
) = 0.
66 CHAPTER 3. INDEPENDENT SUMS
Therefore
lim
n
m
P{ sup
m<kn
|S
k
S
m
| } 0.
We can also prove convergence in probability
lim
n
m
P{|S
n
S
m
| } = 0
by a simple application of Chebychevs inequality and then apply Levys
Theorem to get almost sure convergence.
Theorem 3.11. (Kolomogorovs two series theorem). Let a
i
= E[X
i
]
be the means and
2
i
= Var(X
i
) the variances of a sequence of independent
random variables {X
i
}. Assume that

i
a
i
and

2
i
converge. Then the
series

i
X
i
converges with probability 1.
Proof. Dene Y
i
= X
i
a
i
and apply the previous (one series) theorem to
Y
i
.
Of course in general random variables need not have nite expectations
or variances. If {X
i
} is any sequence of random variables we can take a cut
o value C and dene Y
i
= X
i
if |X
i
| C, and Y
i
= 0 otherwise. The Y
i
are
then independent and bounded in absolute value by C. The theorem can be
applied to Y
i
and if we impose the additional condition that

i
P{X
i
= Y
i
} =

i
P{|X
i
| > C} <
by an application of Borel-Cantelli Lemma, with Probabilty 1, X
i
= Y
i
for
all suciently large i. The convergence of

i
X
i
and

i
Y
i
are therefore
equivalent. We get then the suciency part of
Theorem 3.12. (Kolmogorovs three series theorem). For the con-
vergence of an innite series of independent random variables

i
X
i
it is
necessary and sucient that all the three following innite series converge.
(i) For some cut o value C > 0,

i
P{|X
i
| > C} converges.
(ii) If Y
i
is dened to equal X
i
if |X
i
| C, and 0 otherwise,

i
E(Y
i
)
converges.
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 67
(iii) With Y
i
as in (ii),

i
Var(Y
i
) converges.
Proof. Let us now prove the converse. If

i
X
i
converges for a sequence of
independent random variables, we must necessarily have |X
n
| C eventually
with probability 1. By Borel-Cantelli Lemma the rst series must converge.
This means that in order to prove the necessity we can assume without loss
of generality that |X
i
| are all bounded say by 1. we may also assume that
E(X
i
) = 0 for each i. Otherwise let us take independent random variables X

i
that have the same distribution as X
i
. Then

i
X
i
as well as

i
X

i
converge
with probability 1 and therefore so does

i
(X
i
X

i
). The random variables
Z
i
= X
i
X

i
are independent and bounded by 2. They have mean 0. If
we can show

Var(Z
i
) is convergent, since Var(Z
i
) = 2Var(X
i
) we would
have proved the convergence of the the third series. Now it is elementary
to conclude that since both

i
X
i
as well as

i
(X
i
E(X
i
)) converge, the
series

i
E(X
i
) must be convergent as well. So all we need is the following
lemma to complete the proof of necessity.
Lemma 3.13. If

i
X
i
is convergent for a series of independent random
variables with mean 0 that are individually bounded by C, then

i
Var(X
i
)
is convergent.
Proof. Let F
n
= { : |S
1
| , |S
2
| , , |S
n
| } where S
k
= X
1
+ +
X
k
. If the series converges with probablity 1, we must have, for some and
> 0, P(F
n
) for all n. We have
_
F
n1
S
2
n
dP =
_
F
n1
[S
n1
+X
n
]
2
dP
=
_
F
n1
[S
2
n1
+ 2S
n1
X
n
+X
2
n
] dP
=
_
F
n1
S
2
n1
dP +
2
n
P(F
n1
)

_
F
n1
S
2
n1
dP +
2
n
and on the other hand,
_
F
n1
S
2
n
dP =
_
Fn
S
2
n
dP +
_
F
n1
F
c
n
S
2
n
dP

_
Fn
S
2
n
dP +P(F
n1
F
c
n
) ( +C)
2
68 CHAPTER 3. INDEPENDENT SUMS
providing us with the estimate

2
n

_
Fn
S
2
n
dP
_
F
n1
S
2
n1
dP +P(F
n1
F
c
n
) ( +C)
2
.
Since F
n1
F
c
n
are disjoint and |S
n
| on F
n
,

j=1

2
j

1

2
+ ( +C)
2
].
This concludes the proof.
3.5 Strong Law of Large Numbers
We saw earlier that in Theorem 3.6 that if {X
i
} is sequence of i.i.d. (in-
dependent identically distributed) random variables with zero mean and a
nite fourth moment then
X
1
++Xn
n
0 with probability 1. We will now
prove the same result assuming only that E|X
i
| < and E(X
i
) = 0.
Theorem 3.14. If {X
i
} is a sequence of i.i.d random variables with mean
0,
lim
n
X
1
+ +X
n
n
= 0
with probability 1.
Proof. We dene
Y
n
=
_
X
n
if |X
n
| n
0 if |X
n
| > n
a
n
= P[X
n
= Y
n
], b
n
= E[Y
n
] and c
n
= Var(Y
n
).
First we note that (see exercise 3.14 below)

n
a
n
=

n
P[|X
1
| > n] E|X
1
| <
lim
n
b
n
= 0
3.5. STRONG LAW OF LARGE NUMBERS 69
and

n
c
n
n
2

n
E[Y
2
n
]
n
2
=

n
_
|x|n
x
2
n
2
d
=
_
x
2
_

nx
1
n
2
_
d C
_
|x| d <
where is the common distribution of X
i
. From the three series theorem and
the Borel-Cantelli Lemma, we conclude that

n
Ynbn
n
as well as

n
Xnbn
n
converge almost surely. It is elementary to verify that for any series

n
xn
n
that converges,
x
1
++xn
n
0 as n . We therefore conclude that
P
_
lim
n
_
X
1
+ +X
n
n

b
1
+ +b
n
n

= 0
_
= 1
Since b
n
0 as n , the theorem is proved.
Exercise 3.14. Let X be a nonnegative random variable. Then
E[X] 1

n=1
P[X
n
n] E[X]
In particular E[X] < if and only if

n
P[X n] < .
Exercise 3.15. If for a sequence of i.i.d. random variables X
1
, , X
n
, ,
the strong law of large numbers holds with some limit, i.e.
P[ lim
n
S
n
n
= ] = 1
for some random variable , which may or may not be a constant with prob-
ability 1, then show that necessarily E|X
i
| < . Consequently = E(X
i
)
with probabilty 1.
One may ask why the limit cannot be a proper random variable. There
is a general theorem that forbids it called Kolmogorovs Zero-One law. Let
us look at the space of real sequences {x
n
: n 1}. We have the -eld B,
the product -eld on . In addition we have the sub -elds B
n
generated
by {x
j
: j n}. B
n
are with n and B

=
n
B
n
which is also a -eld is
called the tail -eld. The typical set in B

is a set depending only on the


tail behavior of the sequence. For example the sets { : x
n
is bounded },
{ : limsup
n
x
n
= 1} are in B

whereas { : sup
n
|x
n
| = 1} is not.
70 CHAPTER 3. INDEPENDENT SUMS
Theorem 3.15. (Kolmogorovs Zero-One Law). If A B

and P is
any product measure (not necessarily with identical components) P(A) = 0
or 1.
Proof. The proof depends on showing that A is independent of itself so that
P(A) = P(A A) = P(A)P(A) = [P(A)]
2
and therefore equals 0 or 1. The
proof is elementary. Since A B

B
n+1
and P is a product measure,
A is independent of B
n
= -eld generated by {x
j
: 1 j n}. It is
therefore independent of sets in the eld F =
n
B
n
. The class of sets A
that are independent of A is a monotone class. Since it contains the eld
F it contains the -eld B generated by F. In particular since A B, A is
independent of itself.
Corollary 3.16. Any random variable measurable with respect to the tail -
eld B

is equal with probaility 1 to a constant relative to any given product


measure.
Proof. Left as an exercise.
Warning. For dierent product measures the constants can be dierent.
Exercise 3.16. How can that happen?
3.6 Central Limit Theorem.
We saw before that for any sequence of independent identically distributed
random variables X
1
, , X
n
, the sum S
n
= X
1
+ +X
n
has the prop-
erty that
lim
n
S
n
n
= 0
in probability provided the expectation exists and equals 0. If we assume
that the Variance of the random variables is nite and equals
2
> 0, then
we have
Theorem 3.17. The distribution of
Sn

n
converges as n to the normal
distribution with density
p(x) =
1

2
exp[
x
2

2
]. (3.11)
3.6. CENTRAL LIMIT THEOREM. 71
Proof. If we denote by (t) the characteristic function of any X
i
then the
characteristic function of
Sn

n
is given by

n
(t) = [(
t

n
)]
n
We can use the expansion
(t) = 1

2
t
2
2
+o (t
2
)
to conclude that
(
t

n
) = 1

2
t
2
2n
+o (
1
n
)
and it then follows that
lim
n

n
(t) = (t) = exp[

2
t
2
2
].
Since (t) is the characteristic function of the normal distribution with den-
sity p(x) given by equation (3.11), we are done.
Exercise 3.17. A more direct proof is possible in some special cases. For
instance if each X
i
= 1 with probability
1
2
, S
n
can take the values n 2k
with 0 k n,
P[S
n
= 2k n] =
1
2
n
_
n
k
_
and
P[a
S
n

n
b] =
1
2
n

k:a

n2knb

n
_
n
k
_
.
Use Stirlings formula to prove directly that
lim
n
P[a
S
n

n
b] =
_
b
a
1

2
exp[
x
2
2
] dx.
Actually for the proof of the central limit theorem we do not need the
random variables {X
j
} to have identical distributions. Let us suppose that
they all have zero means and that the variance of X
j
is
2
j
. Dene s
2
n
=
72 CHAPTER 3. INDEPENDENT SUMS

2
1
+ +
2
n
. Assume s
2
n
as n . Then Y
n
=
Sn
sn
has zero mean and
unit variance. It is not unreasonable to expect that
lim
n
P[Y
n
a] =
_
a

2
exp[
x
2
2
] dx
under certain mild conditions.
Theorem 3.18. (Lindebergs theorem). If we denote by
i
the distribu-
tion of X
i
, the condition (known as Lindebergs condition)
lim
n
1
s
2
n
n

i=1
_
|x|sn
x
2
d
i
= 0
for each > 0 is sucient for the central limit theorem to hold.
Proof. The rst step in proving this limit theorem as well as other limit
theorems that we will prove is to rewrite
Y
n
= X
n,1
+X
n,2
+ +X
n,kn
+A
n
where X
n,j
are k
n
mutually independent random variables and A
n
is a con-
stant. In our case k
n
= n, A
n
= 0, and X
n,j
=
X
j
sn
for 1 j n. We denote
by

n,j
(t) = E[e
i t X
n,j
] =
_
e
i t x
d
n,j
=
_
e
i t
x
sn
d
j
=
j
(
t
s
n
)
where
n,j
is the distribution of X
n,j
. The functions
j
and
n,j
are the
characteristic functions of
j
and
n,j
respectively. If we denote by
n
the
distribution of Y
n
, its characteristic function
n
(t) is given by

n
(t) =
n

j=1

n,j
(t)
and our goal is to show that
lim
n

n
(t) = exp[
t
2
2
].
This will be carried out in several steps. First, we dene

n,j
(t) = exp[
n,j
(t) 1]
3.6. CENTRAL LIMIT THEOREM. 73
and

n
(t) =
n

j=1

n,j
(t).
We show that for each nite T,
lim
n
sup
|t|T
sup
1jn
|
n,j
(t) 1| = 0
and
sup
n
sup
|t|T
n

j=1
|
n,j
(t) 1| < .
This would imply that
lim
n
sup
|t|T

log
n
(t) log
n
(t)

lim
n
sup
|t|T
n

j=1

log
n,j
(t) [
n,j
(t) 1]

lim
n
sup
|t|T
C
n

j=1
|
n,j
(t) 1|
2
C lim
n
_
sup
|t|T
sup
1jn
|
n,j
(t) 1|
__
sup
|t|T
n

j=1
|
n,j
(t) 1|
_
= 0
by the expansion
log r = log(1 + (r 1)) = r 1 +O(r 1)
2
.
The proof can then be completed by showing
lim
n
sup
|t|T

log
n
(t) +
t
2
2

= lim
n
sup
|t|T

_
n

j=1
(
n,j
(t) 1)
_
+
t
2
2

= 0.
74 CHAPTER 3. INDEPENDENT SUMS
We see that
sup
|t|T

n,j
(t) 1

= sup
|t|T

_
_
exp[i t x] 1

d
n,j

= sup
|t|T

_
_
exp[i t
x
s
n
] 1

d
j

= sup
|t|T

_
_
exp[i t
x
s
n
] 1 i t
x
s
n

d
j

(3.12)
C
T
_
x
2
s
2
n
d
j
(3.13)
= C
T
_
|x|<sn
x
2
s
2
n
d
j
+C
T
_
|x|sn
x
2
s
2
n
d
j
C
T

2
+C
T
1
s
2
n
_
|x|sn
x
2
d
j
. (3.14)
We have used the mean zero condition in deriving equation 3.12 and
the estimate |e
ix
1 ix| cx
2
to get to the equation 3.13. If we let
n , by Lindebergs condition, the second term of equation (3.14) goes
to 0. Therefore
limsup
n
sup
1jkn
sup
|t|T

n,j
(t) 1


2
C
T
.
Since, > 0 is arbitrary, we have
lim
n
sup
1jkn
sup
|t|T

n,j
(t) 1

= 0.
Next we observe that there is a bound,
sup
|t|T
n

j=1

n,j
(t) 1

C
T
n

j=1
_
x
2
s
2
n
d
j
C
T
1
s
2
n
n

j=1

2
j
= C
T
uniformly in n. Finally for each > 0,
3.6. CENTRAL LIMIT THEOREM. 75
lim
n
sup
|t|T

_
n

j=1
(
n,j
(t) 1)
_
+
t
2
2

lim
n
sup
|t|T
n

j=1

n,j
(t) 1 +

2
j
t
2
2s
2
n

= lim
n
sup
|t|T
n

j=1

_ _
exp[i t
x
s
n
] 1 i t
x
s
n
+
t
2
x
2
2s
2
n
_
d
j

lim
n
sup
|t|T
n

j=1

_
|x|<sn
_
exp[i t
x
s
n
] 1 i t
x
s
n
+
t
2
x
2
2s
2
n
_
d
j

+ lim
n
sup
|t|T
n

j=1

_
|x|sn
_
exp[i t
x
s
n
] 1 i t
x
s
n
+
t
2
x
2
2s
2
n
_
d
j

lim
n
C
T
n

j=1
_
|x|<sn
|x|
3
s
3
n
d
j
+ lim
n
C
T
n

j=1
_
|x|sn
x
2
s
2
n
d
j
C
T
limsup
n
n

j=1
_
x
2
s
2
n
d
j
+ lim
n
C
T
n

j=1
_
|x|sn
x
2
s
2
n
d
j
= C
T
by Lindebergs condition. Since > 0 is arbitrary the result is proven.
Remark 3.3. The key step in the proof of the central limit theorem under
Lindebergs condition, as well as in other limit theorems for sums of inde-
pendent random variables, is the analysis of products

n
(t) =
kn
j=1

n,j
(t).
The idea is to replace each
n,j
(t) by exp [
n,j
(t) 1], changing the product
to the exponential of a sum. Although each
n,j
(t) is close to 1, making
76 CHAPTER 3. INDEPENDENT SUMS
the idea reasonable, in order for the idea to work one has to show that
the sum

kn
j=1
|
n,j
(t) 1|
2
is negligible. This requires the boundedness of

kn
j=1
|
n,j
(t) 1|. One has to use the mean 0 condition or some suitable
centering condition to cancel the rst term in the expansion of
n,j
(t) 1
and control the rest from sums of the variances.
Exercise 3.18. Lyapunovs condition is the following: for some > 0
lim
n
1
s
2+
n
n

j=1
_
|x|
2+
d
j
= 0.
Prove that Lyapunovs condition implies Lindebergs condition.
Exercise 3.19. Consider the case of mutually independent random variables
{X
j
}, where X
j
= a
j
with probability
1
2
. What do Lyapunovs and Linde-
bergs conditions demand of {a
j
}? Can you nd a sequence {a
j
} that does
not satisfy Lyapunovs condition for any > 0 but satises Lindebergs con-
dition? Try to nd a sequence {a
j
} such that the central limit theorem is
not valid.
3.7 Accompanying Laws.
As we stated in the previous section, we want to study the behavior of the sum
of a large number of independent random variables. We have k
n
independent
random variables {X
n,j
: 1 j k
n
} with respective distributions {
n,j
}.
We are interested in the distribution
n
of Z
n
=

kn
j=1
X
n,j
. One important
assumption that we will make on the random variables {X
n,j
} is that no
single one is signicant. More precisely for every > 0,
lim
n
sup
1jkn
P[ |X
n,j
| ] = lim
n
sup
1jkn

n,j
[ |x| ] = 0. (3.15)
The condition is referred to as uniform innitesimality. The following
construction will play a major role. If is a probability distribution on the
line and (t) is its characteristic function, for any nonnegative real number
a > 0,
a
(t) = exp[a((t) 1)] is again a characteristic distribution. In fact,
3.7. ACCOMPANYING LAWS. 77
if we denote by
k
the k-fold convolution of with itself,
a
is seen to be
the characteristic function of the probability distribution
e
a

j=0
a
j
j!

j
which is a convex combination
j
with weights e
a a
j
j!
. We use the construc-
tion mostly with a = 1. If we denote the probability distribution with charac-
teristic function
a
(t) by e
a
() one checks easily that e
a+b
() = e
a
()e
b
().
In particular e
a
() = e
a
n
()
n
. Probability distributions that can be written
for each n 1 as the n-fold convolution
n
n
of some probability distribution

n
are called innitely divisible. In particular for every a 0 and , e
a
()
is an innitely divisible probability distribution. These are called compound
Poisson distributions. A special case when =
1
the degenerate distribu-
tion at 1, we get for e
a
(
1
) the usual Poisson distribution with parameter a.
We can interpret e
a
() as the distribution of the sum of a random number
of independent random variables with common distribution . The random
n has a distribution which is Poisson with parameter a and is independent
of the random variables involved in the sum.
In order to study the distribution
n
of Z
n
it will be more convenient
to replace
n,j
by an innitely divisible distribution
n,j
. This is done as
follows. We dene
a
n,j
=
_
|x|1
xd
n,j
,

n,j
as the translate of
n,j
by a
n,j
, i.e.

n,j
=
n,j

a
n,j
,

n,j
= e
1
(
n,j
),

n,j
=

n,j
a
n,j
and nally

n
=
kn

j=1

n,j
A main tool in this subject is the following theorem. We assume always
that the uniform innitesimality condition (3.15) holds. In terms of notation,
we will nd it more convenient to denote by the characteristic function of
the probability distribution .
78 CHAPTER 3. INDEPENDENT SUMS
Theorem 3.19. (Accompanying Laws.) In order that, for some con-
stants A
n
, the distribution
n

An
of Z
n
+A
n
may converge to the limit it
is necessary and sucient that, for the same constants A
n
, the distribution

n

An
converges to the same limit .
Proof. First we note that, for any > 0,
limsup
n
sup
1jkn
|a
n,j
| = limsup
n
sup
1jkn

_
|x|1
xd
n,j

limsup
n
sup
1jkn

_
|x|
xd
n,j

+ limsup
n
sup
1jkn

_
<|x|1
xd
n,j

+ limsup
n
sup
1jkn

n,j
[ |x| ]
= .
Therefore
lim
n
sup
1jkn
|a
n,j
| = 0.
This means that

n,j
are uniformly innitesimal just as
n,j
were. Let us
suppose that n is so large that sup
1jkn
|a
n,j
|
1
4
. The advantage in going
from
n,j
to

n,j
is that the latter are better centered and we can calculate
a

n,j
=
_
|x|1
xd

n,j
=
_
|xa
n,j
|1
(x a
n,j
) d
n,j
=
_
|xa
n,j
|1
xd
n,j
a
n,j

n,j
[ |x a
n,j
| 1 ]
=
_
|xa
n,j
|1
xd
n,j
a
n,j
+
n,j
[ |x a
n,j
| > 1 ]
and estimate |a

n,j
| by
|a

n,j
| C
n,j
[ |x|
3
4
] C

n,j
[ |x|
1
2
].
3.7. ACCOMPANYING LAWS. 79
In other words we may assume without loss of generality that
n,j
satisfy the
bound
|a
n,j
| C
n,j
[ |x|
1
2
] (3.16)
and forget all about the change from
n,j
to

n,j
. We will drop the primes
and stay with just
n,j
. Then, just as in the proof of the Lindeberg theorem,
we proceed to estimate
lim
n
sup
|t|T

log

n
(t) log
n
(t)

lim
n
sup
|t|T

kn

j=1
_
log
n,j
(t) (
n,j
(t) 1)]

lim
n
sup
|t|T
kn

j=1

log
n,j
(t) (
n,j
(t) 1)

lim
n
sup
|t|T
C
kn

j=1
|
n,j
(t) 1|
2
= 0.
provided we prove that if either
n
or
n
has a limit after translation by some
constants A
n
, then
sup
n
sup
|t|T
kn

j=1


n,j
(t) 1

C < . (3.17)
Let us rst suppose that
n
has a weak limit as n after translation
by A
n
. The characteristic functions
exp
_
kn

j=1
(
n,j
(t) 1)) +itA
n

= exp[f
n
(t)]
have a limit, which is again a characteristic function. Since the limiting char-
acteristic function is continuous and equals 1 at t = 0, and the convergence
is uniform near 0, on some small interval |t| T
0
we have the bound
sup
n
sup
|t|T
0
_
1 Re f
n
(t)

C
80 CHAPTER 3. INDEPENDENT SUMS
or equivalently
sup
n
sup
|t|T
0
kn

j=1
_
(1 cos t x) d
n,j
C
and from the subadditivity property (1cos 2 t x) 4(1cos t x) this bound
extends to arbitrary interval |t| T,
sup
n
sup
|t|T
kn

j=1
_
(1 cos t x) d
n,j
C
T
.
If we integrate the inequality with respect to t over the interval [T, T] and
divide by 2T, we get
sup
n
kn

j=1
_
(1
sin T x
Tx
) d
n,j
C
T
from which we can conclude that
sup
n
kn

j=1

n,j
[ |x| ] C

<
for every > 0 by choosing T =
2

. Moreover using the inequality (1cos x)


c x
2
valid near 0 for a suitable choice of c we get the estimate
sup
n
kn

j=1
_
|x|1
x
2
d
n,j
C < .
3.7. ACCOMPANYING LAWS. 81
Now it is straight forward to estimate, for t [T, T],
|
n,j
(t) 1| =

_
[exp(i t x) 1] d
n,j

_
|x|1
[exp(i t x) 1] d
n,j

_
|x|>1
[exp(i t x) 1] d
n,j

_
|x|1
[exp(i t x) 1 i t x] d
n,j

_
|x|>1
[exp(i t x) 1] d
n,j

+T |a
n,j
|
C
1
_
|x|1
x
2
d
n,j
+C
2

n,j
[x : |x|
1
2
]
which proves the bound of equation (3.17).
Now we need to establish the same bound under the assumption that

n
has a limit after suitable translations. For any probability measure
we dene by (A) = (A) for all Borel sets. The distribution is
denoted by ||
2
. The characteristic functions of and ||
2
are respectively

(t) and | (t)|


2
where (t) is the characteristic function of . An elementary
but important fact is | A|
2
= ||
2
for any translate A. If
n
has a limit so
does |
n
|
2
. We conclude that the limit
lim
n
|
n
(t)|
2
= lim
n
kn

j=1
|
n,j
(t)|
2
exists and denes a characteristic function which is continuous at 0 with a
value of 1. Moreover because of uniform innitesimality,
lim
n
inf
|t|T
|
n,j
(t)| = 1.
It is easy to conclude that there is a T
0
> 0 such that, for |t| T
0
,
sup
n
sup
|t|T
0
kn

j=1
[1 |
n,j
(t)|
2
] C
0
<
82 CHAPTER 3. INDEPENDENT SUMS
and by subadditivity for any nite T,
sup
n
sup
|t|T
kn

j=1
[1 |
n,j
(t)|
2
] C
T
<
providing us with the estimates
sup
n
kn

j=1
|
n,j
|
2
[ |x| ] C

< (3.18)
for any > 0, and
sup
n
kn

j=1
_ _
|xy|2
(x y)
2
d
n,j
(x) d
n,j
(y) C < . (3.19)
We now show that estimates (3.18) and (3.19) imply (3.17)
|
n,j
|
2
[ x : |x|

2
]
_
|y|

n,j
[x : |x y|

2
] d
n,j
(y)

n,j
[ x : |x| ]
n,j
[ x : |x|

2
]

1
2

n,j
[ x : |x| ]
by uniform innitesimality. Therfore 3.18 implies that for every > 0,
sup
n
kn

j=1

n,j
[ x : |x| ] C

< . (3.20)
We now turn to exploiting (3.19). We start with the inequality
_ _
|xy|2
(x y)
2
d
n,j
(x) d
n,j
(y)

n,j
[y : |y| 1 ]
__
inf
|y|1
_
|x|1
(x y)
2
d
n,j
(x)
_
.
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 83
The rst term on the right can be assumed to be at least
1
2
by uniform
innitesimality. The second term
_
|x|1
(x y)
2
d
n,j
(x)
_
|x|1
x
2
d
n,j
(x) 2y
_
|x|1
xd
n,j
(x)

_
|x|1
x
2
d
n,j
(x) 2

_
|x|1
xd
n,j
(x)

_
|x|1
x
2
d
n,j
(x) C
n,j
[x : |x|
1
2
].
The last step is a consequence of estimate (3.16) that we showed we could
always assume.
_
|x|1
xd
n,j
(x) C
n,j
[x : |x|
1
2
]
Because of estimate (3.20) we can now assert
sup
n
kn

j=1
_
|x|1
x
2
d
n,j
C < . (3.21)
One can now derive (3.17) from (3.20) and (3.21) as in the earlier part.
Exercise 3.20. Let k
n
= n
2
and
n,j
= 1
n
for 1 j n
2
.
n
=
n
and show
that without centering
n

n
converges to a dierent limit.
3.8 Innitely Divisible Distributions.
In the study of limit theorems for sums of independent random variables
innitely divisible distributions play a very important role.
Denition 3.5. A distribution is said to be innitely divisible if for every
positive integer n, can be written as the n-fold convolution (
n
)
n
of some
other probability distribution
n
.
Exercise 3.21. Show that the normal distribution with density
p(x) =
1

2
exp[
x
2
2
]
is innitely divisible.
84 CHAPTER 3. INDEPENDENT SUMS
Exercise 3.22. Show that for any 0, the Poisson distribution with pa-
rameter
p

(n) =
e
n

n
n!
for n 0
is innitely divisible.
Exercise 3.23. Show that any probabilty distribution supported on a nite
set {x
1
, . . . , x
k
} with
[{x
j
}] = p
j
and p
j
0,

k
j=1
p
j
= 1 is innitely divisible if and only if it is degenrate,
i.e. [{x
j
}] = 1 for some j.
Exercise 3.24. Show that for any nonnegative nite measure with total
mass a, the distribution
e(F) = e
a

j=0
()
j
j!
with characteristic function

e(F)(t) = exp[
_
(e
itx
1)d]
is an innitely divisible distribution.
Exercise 3.25. Show that the convolution of any two innitely divisible dis-
tributions is again innitely divisible. In particular if is innitely divisible
so is any translate
a
for any real a.
We saw in the last section that the asymptotic behavior of
n

An
can
be investigated by means of the asymptotic behavior of
n

An
and the
characteristic function

n
of
n
has a very special form

n
=
kn

j=1
exp[

n,j
(t) 1 +i t a
n,j
]
= exp
_
kn

j=1
_
[ e
i t x
1 ] d
n,j
+ i t
kn

j=1
a
n,j

= exp
_
_
[ e
i t x
1 ] dM
n
+ i t a
n

= exp
_
_
[ e
i t x
1 i t (x) ] dM
n
+ i t [
_
(x) dM
n
+a
n
]

= exp
_
_
[ e
i t x
1 i t (x) ] dM
n
+ i t b
n

. (3.22)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 85
We can make any reasonable choice for () and we will need it to be a
bounded continuous function with
|(x) x| C|x|
3
near 0. Possible choices are (x) =
x
1+x
2
, or (x) = x for |x| 1 and sign (x)
for |x| 1. We now investigate when such things will have a weak limit.
Convoluting with
An
only changes b
n
to b
n
+A
n
.
First we note that
(t) = exp
_
_
[ e
i t x
1 i t (x) ] dM + i t a

is a characteristic function for any measure M with nite total mass. In fact it
is the characteristic function of an innitely divisible probability distribution.
It is not necessary that M be a nite measure for to make sense. M could
be innite, but in such a way that it is nite on {x : |x| } for every > 0,
and near 0 it integrates x
2
i.e.,
M[x : |x| ] < for all > 0, (3.23)
_
|x|1
x
2
dM < . (3.24)
To see this we remark that

(t) = exp
_
_
|x|
[ e
i t x
1 i t (x) ] dM + i t a

is a characteristic function for each > 0 and because


|e
i t x
1 i t x | C
T
x
2
for |t| T,

(t) (t) uniformly on bounded intervals where (t) is given


by the integral
(t) = exp
_
_
[ e
i t x
1 i t (x) ] dM + i t a

which converges absolutely and denes a characteristic function. Let us call


measures that satisfy (3.23) and (3.24), that can be expressed in the form
86 CHAPTER 3. INDEPENDENT SUMS
_
x
2
1 +x
2
dM < (3.25)
admissible Levy measures. Since the same argument applies to
M
n
and
a
n
instead of M and a, for any admissible Levy measure M and real number a,
(t) is in fact an innitely divisible characteristic function. As the normal
distribution is also an innitely divisible probability distribution, we arrive
at the following
Theorem 3.20. For every admissible Levy measure M,
2
> 0 and real a
(t) = exp
_
_
[ e
i t x
1 i t (x) ] dM + i t a

2
t
2
2

is the characteristic function of an innitely divisible distribution .


We will denote this distribution by = e (M,
2
, a). The main theorem
of this section is
Theorem 3.21. In order that
n
= e (M
n
,
2
n
, a
n
) may converge to a limit
it is necessary and sucient that = e (M,
2
, a) and the following three
conditions (3.26) (3.27) and (3.28) are satised.
For every bounded continuous function f that vanishes in some neighborhood
of 0,
lim
n
_
f(x)dM
n
=
_
f(x)dM. (3.26)
For some ( and therefore for every) > 0 such that are continuity points
for M, i.e., M{} = 0
lim
n
_

2
n
+
_

x
2
dM
n
_
=
_

2
+
_

x
2
dM
_
. (3.27)
a
n
a as n . (3.28)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 87
Proof. Let us prove the suciency rst. Condition (3.26) implies that for
every such that are continuity points of M
lim
n
_
|x|
[ e
i t x
1 i t (x) ] dM
n
=
_
|x|
[ e
i t x
1 i t (x) ] dM
and because of condition (3.27), it is enough to show that
lim
0
limsup
n

[ e
i t x
1 i t (x) +
t
2
x
2
2
] dM
n

[ e
i t x
1 i t (x) +
t
2
x
2
2
] dM

= 0
in order to conclude that
lim
n
_


2
n
t
2
2
+
_
[ e
i t x
1 i t (x)] dM
n
_
=
_


2
t
2
2
+
_
[ e
i t x
1 i t (x)] dM
_
.
This follows from the estimates

e
i t x
1 i t (x) +
t
2
x
2
2

C
T
|x|
3
and
_

|x|
3
dM
n

_

|x|
2
dM
n
.
Condition (3.28) takes care of the terms involving a
n
.
We now turn to proving the necessity. If
n
has a weak limit then the
absolute values of the characteristic functions |
n
(t)| are all uniformly close
to 1 near 0. Since
|
n
(t)| = exp
_

_
(1 cos t x) dM
n


2
n
t
2
2
_
88 CHAPTER 3. INDEPENDENT SUMS
taking logarithms we conclude that
lim
t0
sup
n
_

n
t
2
2
+
_
(1 cos t x) dM
n
_
= 0.
This implies (3.29), (3.30) and (3.31 )below.
For each > 0,
sup
n
M
n
{x : |x| } < (3.29)
lim
A
sup
n
M
n
{x : |x| A} = 0. (3.30)
For every 0 < ,
sup
n
_

2
n
+
_

|x|
2
dM
n
_
< . (3.31)
We can choose a subsequence of M
n
(which we will denote by M
n
as well)
that converges in the sense that it satises conditions (3.26) and (3.27) of
the Theorem. Then e (M
n
,
2
n
, 0) converges weakly to e (M,
2
, 0). It is not
hard to see that for any sequence of probability distributions
n
if both
n
and
n

an
converge to limits and respectively, then necessarily =
a
for some a and a
n
a as n . In order complete the proof of necessity
we need only establish the uniqueness of the representation, which is done in
the next lemma.
Lemma 3.22. (Uniqueness). Suppose = e (M
1
,
2
1
, a
1
) = e (M
2
,
2
2
, a
2
),
then M
1
= M
2
,
2
1
=
2
2
and a
1
= a
2
.
Proof. Since (t) never vanishes by taking logarithms we have
(t) =
_


2
1
t
2
2
+
_
[ e
i t x
1 i t (x) ] dM
1
+ i t a
1
_
=
_


2
2
t
2
2
+
_
[ e
i t x
1 i t (x) ] dM
2
+ i t a
2
_
. (3.32)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 89
We can verify that for any admissible Levy measure M
lim
t
1
t
2
_
[ e
i t x
1 i t (x) ] dM = 0.
Consequently
lim
t
(t)
t
2
=
2
1
=
2
2
leaving us with
(t) =
_ _
[ e
i t x
1 i t (x) ] dM
1
+ i t a
1
_
=
_ _
[ e
i t x
1 i t (x) ] dM
2
+ i t a
2
_
for a dierent . If we calculate
H(s, t) =
(t +s) + (t s)
2
(t)
we get
_
e
i t x
(1 cos s x)dM
1
=
_
e
i t x
(1 cos s x)dM
2
for all t and s. Since we can and do assume that M{0} = 0 for any admissible
Levy measure M we have M
1
= M
2
. If we know that
2
1
=
2
2
and M
1
= M
2
it is easy to see that a
1
must equal a
2
.
Finally
Corollary 3.23. (Levy-Khintchine representation ) Any innitely di-
visible distribution admits a representation = e (M,
2
, a) for some admis-
sible Levy measure M,
2
> 0 and real number a.
Proof. We can write =
n

n
=
n

n

n
with n terms. If we show
that
n

0
then the sequence is uniformly innitesimal and by the earlier
theorem on accompanying laws will be the limit of some
n
= e (M
n
, 0, a
n
)
and therefore has to be of the form e (M,
2
, a) for some choice of admissible
90 CHAPTER 3. INDEPENDENT SUMS
Levy measure M,
2
> 0 and real a. In a neighborhood around 0, (t) is
close to 1 and it is easy to check that

n
(t) = [ (t)]
1
n
1
as n in that neighborhood. As we saw before this implies that
n

0
.
Applications.
1. Convergence to the Poisson Distribution. Let {X
n,j
: 1 j k
n
} be
k
n
independent random variables taking the values 0 or 1 with proba-
bilities 1 p
n,j
and p
n,j
respectively. We assume that
lim
n
sup
1jkn
p
n,j
= 0
which is the uniform innitesimality condition. We are interested in
the limiting distribution of S
n
=

kn
j=1
X
n,j
as n . Since we have
to center by the mean we can pick any level say
1
2
for truncation. Then
the truncated means are all 0. The accompanying laws are given by
e (M
n
, 0, a
n
) with M
n
= (

p
n,j
)
1
and a
n
= (

p
n,j
) (1). It is clear
that a limit exists if and only if
n
=

p
n,j
has a limit as n
and the limit in such a case is the Poisson distribution with parameter
.
2. Convergence to the normal distribution. If the limit of S
n
=

kn
j=1
X
n,j
of k
n
uniformly innitesimal mutually independent random variables
exists, then the limit is Normal if and only if M 0. If a
n,j
is the
centering needed, this is equivalent to
lim
n

j
P[|X
n,j
a
n,j
| ] = 0
for all > 0. Since lim
n
sup
j
|a
n,j
| = 0, this is equivalent to
lim
n

j
P[|X
n,j
| ] = 0
for each > 0.
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 91
3. The limiting variance and the mean are given by

2
= lim
n

j
E
_
[X
n,j
a
n,j
]
2
: |X
n,j
a
n,j
| 1
_
and
a = lim
n

j
a
n,j
where
a
n,j
=
_
|x|1
xd
n,j
Suppose that E[X
n,j
] = 0 for all 1 j k
n
and n. Assume that

2
n
=

j
E{[X
n,j
]
2
} and
2
= lim
n

2
n
exists. What do we need in
order to make sure that the limiting distribution is normal with mean
0 and variance
2
? Let
n,j
be the distribution of X
n,j
.
|a
n,j
|
2
=

_
|x|1
xd
n,j

2
=

_
|x|>1
xd
n,j

2

n,j
[ |x| > 1 ]
_
|x|
2
d
n,j
and
kn

j=1
|a
n,j
|
2

_

1jkn
_
|x|
2
d
n,j
__
sup
1jkn

n,j
[ |x| > 1 ]
_

2
n
_
sup
1jkn

n,j
[ |x| > 1 ]
_
0.
Because

kn
j=1
|a
n,j
|
2
0 as n we must have

2
= lim
n

_
|x|
|x|
2
d
n,j
for every > 0 or equivalently
lim
n

_
|x|>
|x|
2
d
n,j
= 0
92 CHAPTER 3. INDEPENDENT SUMS
for every establishing the necessity as well as suciency in Lindebergs
Theorem. A simple calculation shows that

j
|a
n,j
|

j
_
|x|>1
|x| d
n,j

j
_
|x|>1
|x|
2
d
n,j
= 0
establishing that the limiting Normal distribution has mean 0.
Exercise 3.26. What happens in the Poisson limit theorem (application 1) if

n
=

j
p
n,j
as n ? Can you show that the distribution of
Snn

n
converges to the standard Normal distribution?
3.9 Laws of the iterated logarithm.
When we are dealing with a sequence of independent identically distributed
random variables X
1
, , X
n
, with mean 0 and variance 1, we have a
strong law of large numbers asserting that
P
_
lim
n
X
1
+ +X
n
n
= 0
_
= 1
and a central limit theorem asserting that
P
_
X
1
+ +X
n

n
a
_

_
a

2
exp[
x
2
2
] dx
It is a reasonable question to ask if the random variables
X
1
++Xn

n
themselves
converge to some limiting random variable Y that is distributed according
to the the standard normal distribution. The answer is no and is not hard
to show.
Lemma 3.24. For any sequence n
j
of numbers ,
P
_
limsup
j
X
1
+ +X
n
j

n
j
= +
_
= 1
Proof. Let us dene
Z = limsup
j
X
1
+ +X
n
j

n
j
3.9. LAWS OF THE ITERATED LOGARITHM. 93
which can be +. Because the normal distribution has an innitely long
tail, i.e the probability of exceeding any given value is positive, we must have
P
_
Z a

> 0
for any a. But Z is a random variable that does not depend on the par-
ticular values of X
1
, , X
n
and is therefore a set in the tail -eld. By
Kolmogorovs zero-one law P
_
Z a

must be either 0 or 1. Since it cannot


be 0 it must be 1.
Since we know that
X
1
++Xn
n
0 with probability 1 as n , the
question arises as to the rate at which this happens. The law of the iterated
logarithm provides an answer.
Theorem 3.25. For any sequence X
1
, , X
n
, of independent identi-
cally distributed random variables with mean 0 and Variance 1,
P
_
limsup
n
X
1
+ +X
n

nlog log n
=

2
_
= 1.
We will not prove this theorem in the most general case which assumes
only the existence of two moments. We will assume instead that E[|X|
2+
] <
for some > 0. We shall rst reduce the proof to an estimate on the
tail behavior of the distributions of
Sn

n
by a careful application of the Borel-
Cantelli Lemma. This estimate is obvious if X
1
, , X
n
, are themselves
normally distributed and we will show how to extend it to a large class of
distributions that satisfy the additional moment condition. It is clear that
we are interested in showing that for >

2,
P
_
S
n

_
nlog log n innitely often
_
= 0.
It would be sucient because of Borel-Cantelli lemma to show that for any
>

2,

n
P
_
S
n

_
nlog log n
_
< .
This however is too strong. The condition of the Borel-Cantelli lemma is
not necessary in this context because of the strong dependence between the
partial sums S
n
. The function (n) =

nlog log n is clearly well dened and


94 CHAPTER 3. INDEPENDENT SUMS
non-decreasing for n 3 and it is sucient for our purposes to show that
for any >

2 we can nd some sequence k


n
of integers such that

n
P
_
sup
k
n1
jkn
S
j
(k
n1
)
_
< . (3.33)
This will establish that with probability 1,
limsup
n
sup
k
n1
jkn
S
j
(k
n1
)

or by the monotonicity of ,
limsup
n
S
n
(n)

with probability 1. Since >

2 is arbitrary the upper bound in the law


of the iterated logarithm will follow. Each term in the sum of 3.33 can be
estimated as in Levys inequality,
P
_
sup
k
n1
jkn
S
j
(k
n1
)
_
2 P
_
S
kn
( ) (k
n1
)
_
with 0 < < , provided
sup
1jknk
n1
P
_
|S
j
| (k
n1
)
_

1
2
.
Our choice of k
n
will be k
n
= [
n
] for some > 1 and therefore
lim
n
(k
n1
)

k
n
=
and by Chebychevs inequality, for any xed > 0,
sup
1jkn
P
_
|S
j
| (k
n1
)
_

E[S
2
n
]
[(k
n1
)]
2
=
k
n
[(k
n1
)]
2
=
k
n

2
k
n1
log log k
n1
= o(1) as n . (3.34)
3.9. LAWS OF THE ITERATED LOGARITHM. 95
By choosing small enough so that >

2 it is sucient to show
that for any

>

2,

n
P
_
S
kn

(k
n1
)
_
< .
By picking suciently close to 1, ( so that

>

2), because
(k
n1
)
(kn)
=
1

we can reduce this to the convergence of

n
P
_
S
kn
(k
n
)
_
< (3.35)
for all >

2.
If we use the estimate P[X a] exp[
a
2
2
] that is valid for the standard
normal distribution, we can verify 3.35.

n
exp
_


2
((k
n
))
2
2 k
n

<
for any >

2.
To prove the lower bound we select again a subsequece, k
n
= [
n
] with
some > 1, and look at Y
n
= S
k
n+1
S
kn
, which are now independent
random variables. The tail probability of the Normal distribution has the
lower bound
P[X a] =
1

2
_

a
exp[
x
2
2
]dx

2
_

a
exp[
x
2
2
x](x + 1)dx

2
exp[
(a + 1)
2
2
].
If we assume Normal like tail probabilities we can conclude that

n
P
_
Y
n
(k
n+1
)
_

n
exp
_

1
2
[1 +
(k
n+1
)
_
(
n+1

n
)
]
2

= +
provided

2

2(1)
< 1 and conclude by the Borel-Cantelli lemma, that Y
n
=
S
k
n+1
S
kn
exceeds (k
n+1
) innitely often for such . On the other hand
96 CHAPTER 3. INDEPENDENT SUMS
from the upper bound we already have (replacing X
i
by X
i
)
P
_
limsup
n
S
kn
(k
n+1
)

_
= 1.
Consequently
P
_
limsup
n
S
k
n+1
(k
n+1
)

2( 1)

_
= 1
and therefore,
P
_
limsup
n
S
n
(n)

2( 1)

_
= 1.
We now take arbitrarily large and we are done.
We saw that the law of the iterated logarithm depends on two things.
(i). For any a > 0 and p <
a
2
2
an upper bound for the probability
P[S
n
a
_
nlog log n] C
p
[log n]
p
with some constant C
p
(ii). For any a > 0 and p >
a
2
2
a lower bound for the probability
P[S
n
a
_
nlog log n] C
p
[log n]
p
with some, possibly dierent, constant C
p
.
Both inequalities can be obtained from a uniform rate of convergence in
the central limit theorem.
sup
a

P{
S
n

n
a}
_

a
1

2
exp[
x
2
2
] dx

Cn

(3.36)
for some > 0 in the central limit theorem. Such an error estimate is
provided in the following theorem
Theorem 3.26. (Berry-Esseen theorem). Assume that the i.i.d. se-
quence {X
j
} with mean zero and variance one satises an additional moment
condition E|X|
2+
< for some > 0. Then for some > 0 the estimate
(3.36) holds.
3.9. LAWS OF THE ITERATED LOGARITHM. 97
Proof. The proof will be carried out after two lemmas.
Lemma 3.27. Let < a < b < be given and 0 < h <
ba
2
be a small
positive number. Consider the function f
a,b,h
(x) dened as
f
a,b,h
(x) =
_

_
0 for < x a h
xa+h
2h
for a h x a +h
1 for a +h x b h
1
xb+h
2h
for b h x b +h
0 for b +h x < .
For any probability distribution with characteristic function (t)
_

f
a,b,h
(x) d(x) =
1
2
_

(y)
e
i a y
e
i b y
i y
sin h y
h y
dy.
Proof. This is essentially the Fourier inversion formula. Note that
f
a,b,h
(x) =
1
2
_

e
i xy
e
i a y
e
i b y
i y
sin h y
h y
dy.
We can start with the double integral
1
2
_

e
i x y
e
i a y
e
i b y
i y
sin h y
h y
dy d(x)
and apply Fubinis theorem to obtain the lemma.
Lemma 3.28. If , are two probability measures with zero mean having

(), () for respective characteristic functions. Then


_

f
a,h
(x) d( )(x) =
1
2
_

(y) (y)]
e
i a y
i y
sin h y
h y
dy
where f
a,h
(x) = f
a,,h
(x), is given by
f
a,h
(x) =
_

_
0 for < x a h
xa+h
2h
for a h x a +h
1 for a +h x < .
98 CHAPTER 3. INDEPENDENT SUMS
Proof. We just let b in the previous lemma. Since |

(y) (y)| = o(|y|),


there is no problem in applying the Riemann-Lebesgue Lemma. We now
proceed with the proof of the theorem.
[[a, )]
_
f
ah,h
(x) d(x) [[a 2h, )]
and
[[a, )]
_
f
ah,h
(x) d(x) [[a 2h, )].
Therefore if we assume that has a density bounded by C,
[[a, )] [[a, )] 2hC +
_
f
ah,h
(x) d( )(x).
Since we get a similar bound in the other direction as well,
sup
a
|[[a, )] [[a, )]| sup
a

_
f
ah,h
(x) d( )(x)

+ 2hC

1
2
_

(y) (y)|
| sinh y |
h y
2
dy
+ 2hC. (3.37)
Now we return to the proof of the theorem. We take to be the distribu-
tion of
Sn

n
having as its characteristic function

n
(y) = [(
y

n
)]
n
where (y)
is the characteristic function of the common distribution of the {X
i
} and has
the expansion
(y) = 1
y
2
2
+O(|y|
2+
)
for some > 0. We therefore get, for some choice of > 0,
|

n
(y) exp[
y
2
2
]| C
|y|
2+
n

if |y| n

2+
.
3.9. LAWS OF THE ITERATED LOGARITHM. 99
Therefore for =

2+
_

n
(y) exp[
y
2
2
]

| sinh y |
h y
2
dy
=
_
|y|n

n
(y) exp[
y
2
]

| sin h y |
h y
2
dy
+
_
|y|n

n
(y) exp[
y
2
]

| sin h y |
h y
2
dy

C
h
__
|y|n

|y|

dy +
_
|y|n

dy
|y|
2
_
C
n
(+1)
+n

h
=
C
hn

+2
Substituting this bound in 3.37 we get
sup
a
|
n
[[a, )] [[a, )]| C
1
h +
C
h n

2+
.
By picking h = n


2(2+)
we get
sup
a
|
n
[[a, )] [[a, )]| C n


2(2+)
and we are done.
100 CHAPTER 3. INDEPENDENT SUMS
Chapter 4
Dependent Random Variables
4.1 Conditioning
One of the key concepts in probability theory is the notion of conditional
probability and conditional expectation. Suppose that we have a probability
space (, F, P) consisting of a space , a -eld F of subsets of and a
probability measure on the -eld F. If we have a set A F of positive
measure then conditioning with respect to A means we restrict ourselves to
the set A. gets replaced by A. The -eld F by the -eld F
A
of subsets
of A that are in F. For B A we dene
P
A
(B) =
P(B)
P(A)
We could achieve the same thing by dening for arbitrary B F
P
A
(B) =
P(A B)
P(A)
in which case P
A
() is a measure dened on F as well but one that is concen-
trated on A and assigning 0 probability to A
c
. The denition of conditional
probability is
P(B|A) =
P(A B)
P(A)
.
Similarly the denition of conditional expectation of an integrable function
f() given a set A F of positive probability is dened to be
E{f|A} =
_
A
f()dP
P(A)
.
101
102 CHAPTER 4. DEPENDENT RANDOM VARIABLES
In particular if we take f =
B
for some B F we recover the denition
of conditional probability. In general if we know P(B|A) and P(A) we can
recover P(A B) = P(A)P(B|A) but we cannot recover P(B). But if we
know P(B|A) as well as P(B|A
c
) along with P(A) and P(A
c
) = 1 P(A)
then
P(B) = P(A B) +P(A
c
B) = P(A)P(B|A) +P(A
c
)P(B|A
c
).
More generally if P is a partition of into a nite or even a countable number
of disjoint measurable sets A
1
, , A
j
,
P(B) =

j
P(A
j
)P(B|A
j
).
If is a random variable taking distinct values {a
j
} on {A
j
} then
P(B| = a
j
) = P(B|A
j
)
or more generally
P(B| = a) =
P(B = a)
P( = a)
provided P( = a) > 0. One of our goals is to seek a denition that makes
sense when P( = a) = 0. This involves dividing 0 by 0 and should involve
dierentiation of some kind. In the countable case we may think of P(B| =
a
j
) as a function f
B
() which is equal to P(B|A
j
) on = a
j
. We can rewrite
our denition of
f
B
(a
j
) = P(B| = a
j
)
as
_
=a
j
f
B
()dP = P(B = a
j
) for each j
or summing over any arbitrary collection of js
_
E
f
B
()dP = P(B { E}).
Sets of the form E form a sub -eld F and we can rewrite the
denition as
_
A
f
B
()dP = P(B A)
4.1. CONDITIONING 103
for all A . Of course in this case A if and only if A is a union
of the atoms = a of the partition over a nite or countable subcollection
of the possible values of a. Similar considerations apply to the conditional
expectation of a random variable G given . The equation becomes
_
A
g()dP =
_
A
G()dP
or we can rewrite this as
_
A
g()dP =
_
A
G()dP
for all A and instead of demanding that g be a function of we demand
that g be measurable which is the same thing. Now the random variable
is out of the picture and rightly so. What is important is the information
we have if we know and that is the same if we replace by a one-to-one
function of itself. The -eld abstracts that information nicely. So it turns
out that the proper notion of conditioning involves a sub -eld F. If G
is an integrable function and F is given we will seek another integrable
function g that is measurable and satises
_
A
g()dP =
_
A
G()dP
for all A . We will prove existence and uniqueness of such a g and call it
the conditional expectation of G given and denote it by g = E[G|].
The way to prove the above result will take us on a detour. A signed
measure on a measurable space (, F) is a set function (.) dened for A
F which is countably additive but not necessarily nonnegative. Countable
addivity is again in any of the following two equivalent senses.
(A
n
) =

(A
n
)
for any countable collection of disjoint sets in F, or
lim
n
(A
n
) = (A)
whenver A
n
A or A
n
A.
Examples of such can be constructed by taking the dierence
1

2
of two nonnegative measures
1
and
2
.
104 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Denition 4.1. A set A F is totally positive (totally negative) for if
for every subset B F with B A (B) 0. ( 0)
Remark 4.1. A measurable subset of a totally positive set is totally positive.
Any countable union of totally positive subsets is again totally positive.
Lemma 4.1. If is a countably additive signed measure on (, F),
sup
AF
|(A)| <
Proof. The key idea in the proof is that, since () is a nite number, if
(A) is large so is (A
c
) with an opposite sign. In fact, it is not hard to
see that ||(A)| |(A
c
)|| |()| for all A F. Another fact is that if
sup
BA
|(B)| and sup
BA
c |(B)| are nite, so is sup
B
|(B)|. Now let us
complete the proof. Given a subset A F with sup
BA
|(B)| = , and
any positive number N, there is a subset A
1
F with A
1
A such that
|(A
1
)| N and sup
BA
1
|(B)| = . This is obvious because if we pick
a set E A with |(E)| very large so will (E
c
) be. At least one of the
two sets E, E
c
will have the second property and we can call it A
1
. If we
proceed by induction we have a sequence A
n
that is and |(A
n
)| that
contradicts countable additivity.
Lemma 4.2. Given a subset A F with (A) = > 0 there is a subset

A A that is totally positive with (



A) .
Proof. Let us dene m = inf
BA
(B). Since the empty set is included,
m 0. If m = 0 then A is totally positive and we are done. So let us assume
that m < 0. By the previous lemma m > .
Let us nd B
1
A such that (B
1
)
m
2
. Then for A
1
= AB
1
we have
A
1
A, (A
1
) and inf
BA
1
(B)
m
2
. By induction we can nd A
k
with A A
1
A
k
, (A
k
) for every k and inf
BA
k
(A
k
)
m
2
k
.
Clearly if we dene

A = A
k
which is the decreasing limit,

A works.
Theorem 4.3. (Hahn-Jordan Decomposition). Given a countably ad-
ditive signed measure on (, F) it can be written always as =
+

the dierence of two nonnegative measures. Moreover


+
and

may be
chosen to be orthogonal i.e, there are disjoint sets
+
,

F such that

+
(

) =

(
+
) = 0. In fact
+
and

can be taken to be subsets of


that are respectively totally positive and totally negative for .

then
become just the restrictions of to

.
4.1. CONDITIONING 105
Proof. Totally positive sets are closed under countable unions, disjoint or
not. Let us dene m
+
= sup
A
(A). If m
+
= 0 then (A) 0 for all A and
we can take
+
= and

= which works. Assume that m


+
> 0. There
exist sets A
n
with (A) m
+

1
n
and therefore totally positive subsets

A
n
of A
n
with (

A
n
) m
+

1
n
. Clearly
+
=
n

A
n
is totally positive and
(
+
) = m
+
. It is easy to see that

=
+
is totally negative.

can
be taken to be the restriction of to

.
Remark 4.2. If =
+

with
+
and

orthogonal to each other,


then they have to be the restrictions of to the totally positive and totally
negative sets for and such a representation for is unique. It is clear that
in general the representation is not unique because we can add a common
to both
+
and

and the will cancel when we compute =


+

.
Remark 4.3. If is a nonnegative measure and we dene by
(A) =
_
A
f() d =
_

A
()f() d
where f is an integrable function, then is a countably additive signed
measure and
+
= { : f() > 0} and

= { : f() < 0}. If we dene


f

() as the positive and negative parts of f, then

(A) =
_
A
f

() d.
The signed measure that was constructed in the preceeding remark
enjoys a very special relationship to . For any set A with (A) = 0, (A) = 0
because the integrand
A
()f() is 0 for -almost all and for all practical
purposes is a function that vanishes identically.
Denition 4.2. A signed measure is said to be absolutely continuous with
respect to a nonnegative measure , << in symbols, if whenever (A) is
zero for a set A F it is also true that (A) = 0.
Theorem 4.4. (Radon-Nikodym Theorem). If << then there is an
integrable function f() such that
(A) =
_
A
f() d (4.1)
106 CHAPTER 4. DEPENDENT RANDOM VARIABLES
for all A F. The function f is uniquely determined almost everywhere and
is called the Radon-Nikodym derivative of with respect to . It is denoted
by
f() =
d
d
.
Proof. The proof depends on the decomposition theorem. We saw that if the
relation 4.1 holds, then
+
= { : f() > 0}. If we dene
a
= a,
then
a
is a signed measure for every real number a. Let us dene (a) to
be the totally positive subset of
a
. These sets are only dened up to sets of
measure zero, and we can only handle a countable number of sets of measure
0 at one time. So it is prudent to restrict a to the set Q of rational numbers.
Roughly speaking (a) will be the sets f() > a and we will try to construct
f from the sets (a) by the denition
f() = [sup a Q : (a)].
The plan is to check that the function f() dened above works. Since
a
is getting more negative as a increases, (a) is as a . There is trouble
with sets of measure 0 for every comparison between two rationals a
1
and
a
2
. Collect all such troublesome sets (only a countable number and throw
them away). In other words we may assume without loss of generality that
(a
1
) (a
2
) whenever a
1
> a
2
. Clearly
{ : f() > x} = { : (y) for some rational y > x}
= y>x
yQ
(y)
and this makes f measurable. If A
a
(a) then (A) a(A) 0 for all
A. If (A) > 0, (A) has to be innite which is not possible. Therefore (A)
has to be zero and by absolute continuity (A) = 0 as well. On the other
hand if A (a) = for all a, then (A) a(A) 0 for all a and again
if (A) > 0, (A) = which is not possible either. Therefore (A), and
by absolute continuity, (A) are zero. This proves that f() is nite almost
everywhere with respect to both and . Let us take two real numbers a < b
and consider E
a,b
= { : a f() b}. It is clear that the set E
a,b
is in
(a

) and
c
(b

) for any a

< a and b

> b. Therefore for any set A E


a,b
by
letting a

and b

tend to a and b
a (A) (A) b (A).
4.1. CONDITIONING 107
Now we are essentially done. Let us take a grid {nh} and consider E
n
=
{ : nh f() < (n + 1)h} for < n < . Then for any A F and
each n,
(A E
n
) h(A E
n
) nh (A E
n
)
_
AEn
f() d
(n + 1) h(A E
n
) (A E
n
) +h (A E
n
).
Summing over n we have
(A) h (A)
_
A
f() d (A) +h (A)
proving the integrability of f and if we let h 0 establishing
(A) =
_
A
f() d
for all A F.
Remark 4.4. (Uniqueness). If we have two choices of f say f
1
and f
2
their
dierence g = f
1
f
2
satises
_
A
g() d = 0
for all A F. If we take A

= { : g() }, then 0 (A

) and this
implies (A

) = 0 for all > 0 or g() 0, almost everywhere with respect to


. A similar argument establishes g() 0 almost everywhere with respect
to . Therefore g = 0 a.e. proving uniqueness.
Exercise 4.1. If f and g are two integrable functions, maesurable with respect
to a -led B and if
_
A
f()dP =
_
A
g()dP for all sets A B
0
, a eld that
generates the -eld B, then f = g a.e. P.
Exercise 4.2. If (A) 0 for all A F, prove that f() 0 almost every-
where.
Exercise 4.3. If is a countable set and ({}) > 0 for each single point
set prove that any measure is absolutely continuous with respect to and
calculate the Radon-Nikodym derivative.
108 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Exercise 4.4. Let F(x) be a distribution function on the line with F(0) = 0
and F(1) = 1 so that the probability measure corresponding to it lives on
the interval [0, 1]. If F(x) satises a Lipschitz condition
|F(x) F(y)| A|x y|
then prove that << m where m is the Lebesgue measure on [0, 1]. Show
also that 0
d
dm
A almost surely.
If , , are three nonnegative measures such that << and <<
then show that << and
d
d
=
d
d
d
d
a.e.
Exercise 4.5. If , are nonnegative measures with << and
d
d
= f, then
show that g is integrable with respect to if and only if g f is integrable with
respect to and
_
g() d =
_
g() f() d.
Exercise 4.6. Given two nonnegative measures and , is said to be uni-
formly absolutely continuous with respect to on F if for any > 0 there
exists a > 0 such that for any A F with (A) < it is true that (A) < .
Use the Radon-Nikodym theorem to show that absolute continuity on a -
eld F implies uniform absolute continuity. If F
0
is a eld that generates the
-eld F show by an example that absolute continuity on F
0
does not imply
absolute continuity on F. Show however that uniform absolute continuity
on F
0
implies uniform absolute continuity and therefore absolute continuity
on F.
Exercise 4.7. If F is a distribution function on the line show that it is abso-
lutely continuous with respect to Lebesgue measure on the line, if and only
if for any > 0, there exists a > 0 such that for arbitrary nite collec-
tion of disjoint intervals I
j
= [a
j
, b
j
] with

j
|b
j
a
j
| < it follows that

j
[F(b
j
) F(a
j
)] .
4.2 Conditional Expectation
In the Radon-Nikodym theorem, if << are two probability distributions
on (, F), we dened the Radon-Nikodym derivative f() =
d
d
as an F
4.2. CONDITIONAL EXPECTATION 109
measurable function such that
(A) =
_
A
f() d for all A F
If F is a sub -eld, the absolute continuity of with respect to on
is clearly implied by the absolute continuity of with respect to on F. We
can therefore apply the Radon-Nikodym theorem on the measurable space
(, ), and we will obtain a new Radon-Nikodym derivative
g() =
d
d
=
d
d

such that
(A) =
_
A
g() d for all A
and g is measurable. Since the old function f() was only F measurable,
in general, it cannot be used as the Radon-Nikodym derivative for the sub
-eld . Now if f is an integrable function on (, F, ) and F is a sub
-eld we can dene on F by
(A) =
_
A
f() d for all A F
and recalculate the Radon-Nikodym derivative g for and g will be a
measurable, integrable function such that
(A) =
_
A
g() d for all A
In other words g is the perfect candidate for the conditional expectation
g() = E
_
f()|
_
We have therefore proved the existence of the conditional expectation.
Theorem 4.5. The conditional expectation as mapping of f g has the
following properties.
1. If g = E
_
f|
_
then E[g] = E[f]. E[1|] = 1 a.e.
2. If f is nonnegative then g = E
_
f|
_
is almost surely nonnegative.
110 CHAPTER 4. DEPENDENT RANDOM VARIABLES
3. The map is linear. If a
1
, a
2
are constants
E
_
a
1
f
1
+a
2
f
2
|
_
= a
1
E
_
f
1
|
_
+a
2
E
_
f
2
|
_
a.e.
4. If g = E
_
f|
_
, then
_
|g()| d
_
|f()| d
5. If h is a bounded measurable function then
E
_
f h|
_
= h E
_
f|
_
a.e.
6. If
2

1
F, then
7. Jensens Inequality. If (x) is a convex function of x, and g =
E
_
f|
_
then
E
_
(f())|
_
(g()) a.e. (4.2)
and if we take expectations
E[(f)] E[(g)].
Proof. (i), (ii) and (iii) are obvious. For (iv) we note that if d = f d
_
|f| d = sup
AF
(A) inf
AF
(A)
and if we replace F by a sub -eld the right hand side is decreased. (v)
is obvious if h is the indicator function of a set A in . To go from indicator
functions to simple functions to bounded measurable functions is routine.
(vi) is an easy consequence of the denition. Finally (vii) corresponds to
Theorem 1.7 proved for ordinary expectations and is proved analogously.
We note that if f
1
f
2
then E
_
f
1
|
_
E
_
f
2
|
_
a.e. and consequently
E
_
max(f
1
, f
2
)|
_
max(g
1
, g
2
) a.e. where g
i
= E
_
f
i
|
_
for i = 1, 2. Since
we can represent any convex function as (x) = sup
a
[ax (a)], limiting
4.2. CONDITIONAL EXPECTATION 111
ourselves to rational a, we have only a countable set of functions to deal with,
and
E
_
(f)|
_
= E
_
sup
a
[af (a)]|
_
sup
a
_
a E
_
f|
_
(a)

= sup
a
_
a g (a)

= (g)
a.e. and after taking expectations
E[(f)] E[(g)].
Remark 4.5. Conditional expecation is a form of averaging, i.e. it is linear,
takes constants into constants and preserves nonnegativity. Jensens inequal-
ity is now a cosequence of convexity.
In a somewhat more familiar context if =
1

2
is a product measure
on (, F) = (
1

2
, F
1
F
2
) and we take =
_
A
2
: A F
1
} then
for any function f() = f(
1
,
2
), E[f()|] = g() where g() = g(
1
) is
given by
g(
1
) =
_

2
f(
1
,
2
) d
2
so that the conditional expectation is just integrating the unwanted variable

2
. We can go one step more. If (x, y) is the joint density on R
2
of two
random variables X, Y (with respect to the Lebesgue measure on R
2
), and
(x) is the marginal density of X given by
(x) =
_

(x, y) dy
then for any integrable function f(x, y)
E[f(X, Y )|X] = E[f(, )|] =
_

f(x, y) (x, y) dy
(x)
where is the -eld of vertical strips A (, ) with a measurable
horizontal base A.
112 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Exercise 4.8. If f is already measurable then E[f|] = f. This suggests
that the map f g = E[f|] is some sort of a projection. In fact if
we consider the Hilbert space H = L
2
[, F, ] of all F measurable square
integrable functions with an inner product
< f, g >

=
_
f g d
then
H
0
= L
2
[, , ] H = L
2
[, F, ]
and f E[f|] is seen to be the same as the orthogonal projection from H
onto H
0
. Prove it.
Exercise 4.9. If F
1
F
2
F are two sub -elds of F and X is any in-
tegrable function, we can dene X
i
= E[X|F
i
] for i = 1, 2. Show that
X
1
= E[X
2
|F
1
] a.e.
Conditional expectation is then the best nonlinear predictor if the loss
function is the expected (mean) square error.
4.3 Conditional Probability
We now turn our attention to conditional probability. If we take f =
B
()
then E[f|] = P(, B) is called the conditional probability of B given . It
is characterized by the property that it is measurable as a function of
and for any A
(A B) =
_
A
P(, B) d.
Theorem 4.6. P(, ) has the following properties.
1. P(, ) = 1, P(, ) = 0 a.e.
2. For any B F, 0 P(, B) 1 a.e.
3. For any countable collection {B
j
} of disjoint sets in F,
P(,
j
B
j
) =

j
P(, B
j
) a.e.
4.3. CONDITIONAL PROBABILITY 113
4. If B , P(, B) =
B
() a.e.
Proof. All are easy consequences of properties of conditional expectations.
Property (iii) perhaps needs an explanation. If E[|f
n
f|] 0 by the
properties of conditional expectation E[|E{f
n
|} E{f|}| 0. Property
(iii) is an easy consequence of this.
The problem with the above theorem is that every property is valid only
almost everywhere. There are exceptional sets of measure zero for each case.
While each null set or a countable number of them can be ignored we have an
uncountable number of null sets and we would like a single null set outside
which all the properties hold. This means constructing a good version of the
conditional probability. It may not be always possible. If possible, such a
version is called a regular conditional probability. The existence of such a
regular version depends on the space (, F) and the sub -eld being nice.
If is a complete separable metric space and F are its Borel stes, and if
is any countably generated sub -eld of F, then it is nice enough. We will
prove it in the special case when = [0, 1] is the unit interval and F are the
Borel subsets B of [0, 1]. can be any countably generated sub -eld of F.
Remark 4.6. In fact the case is not so special. There is theorem [6] which
states that if (, F) is any complete separable metric space that has an un-
countable number of points, then there is one-to-one measurable map with
a measurable inverse between (, F) and ([0, 1], B). There is no loss of gen-
erality in assuming that (, F) is just ([0, 1], B).
Theorem 4.7. Let P be a probability distribution on ([0, 1], B). Let B
be a sub -eld. There exists a family of probability distributions Q
x
on
([0, 1], B) such that for every A B, Q
x
(A) is measurable and for every B
measurable f,
_
f(y)Q
x
(dy) = E
P
[f()|] a.e. P. (4.3)
If in addition is countably generated, i.e there is a eld
0
consisting of a
countable number of Borel subsets of [0, 1] such that the -eld generated by

0
is , then
Q
x
(A) = 1
A
(x) for all A . (4.4)
114 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Proof. The trick is not to be too ambitious in the rst place but try to
construct the conditional expectations
Q(, B) = E
_

B
()|}
only for sets B given by B = (, x) for rational x. We call our conditional
expectation, which is in fact a conditional probability, by F(, x). By the
properties of conditional expectations for any pair of rationals x < y, there
is a null set E
x,y
, such that for / E
x,y
F(, x) F(, y).
Moreover for any rational x < 0, there is a null set N
x
outside which
F(, x) = 0 and similar null sets N
x
for x > 1, ouside which F(, x) = 1.
If we collect all these null sets, of which there are only countably many, and
take their union, we get a null set N such that for / N, we have have
a family F(, x) dened for rational x that satises
F(, x) F(, y) if x < y are rational
F(, x) = 0 for rational x < 0
F(, x) = 1 for rational x > 1
P(A [0, x]) =
_
A
F(, x) dP for all A .
For / N and real y we can dene
G(, y) = lim
xy
x rational
F(, x).
For / N, G is a right continuous nondecreasing function (distribution
function) with G(, y) = 0 for y < 0 and G(, y) = 1 for y 1. There is
then a probability measure

Q(, B) on the Borel subsets of [0, 1] such that

Q(, [0, y]) = G(, y) for all y.



Q is our candidate for regular conditional
probability. Clearly

Q(, I) is measurable for all intervals I and by stan-
dard arguments will continue to be measurable for all Borel sets B F.
If we check that
P(A [0, x]) =
_
A
G(, x) dP for all A
4.3. CONDITIONAL PROBABILITY 115
for all 0 x 1 then
P(A I) =
_
A

Q(, I) dP for all A


for all intervals I and by standard arguments this will extend to nite disjoint
unions of half open intervals that constitute a eld and nally to the -eld
F generated by that eld. To verify that for all real y,
P(A [0, y]) =
_
A
G(, y) dP for all A
we start from
P(A [0, x]) =
_
A
F(, x) dP for all A
valid for rational x and let x y through rationals. From the countable ad-
ditivity of P the left hand side converges to P(A[0, y]) and by the bounded
convergence theorem, the right hand side converges to
_
A
G(, y) dP and
we are done.
Finally from the uniqueness of the conditional expectation if A

Q(, A) =
A
()
provided / N
A
, which is a null set that depends on A. We can take a
countable set
0
of generators A that forms a eld and get a single null set
N such that if / N

Q(, A) =
A
()
for all A
0
. Since both side are countably additive measures in A and as
they agree on
0
they have to agree on as well.
Exercise 4.10. (Disintegration Theorem.) Let be a probability measure on
the plane R
2
with a marginal distribution for the rst coordinate. In other
words if we denote is such that, for any f that is a bounded measurable
function of x,
_
R
2
f(x) d =
_
R
f(x) d
Show that there exist measures
x
depending measurably on x such that

x
[{x}R] = 1, i.e.
x
is supported on the vertical line through (x, y) : y R
and =
_
R

x
d. The converse is of course easier. Given and
x
we can
construct a unique such that disintegrates as expected.
116 CHAPTER 4. DEPENDENT RANDOM VARIABLES
4.4 Markov Chains
One of the ways of generating a sequence of dependent random variables is
to think of a system evolving in time. We have time points that are discrete
say T = 0, 1, , N, . The state of the system is described by a point
x in the state space X of the system. The state space X comes with a
natural -eld of subsets F. At time 0 the system is in a random state and
its distribution is specied by a probability distribution
0
on (X, F). At
successive times T = 1, 2, , the system changes its state and given the past
history (x
0
, , x
k
) of the states of the system at times T = 0, , k 1
the probability that system nds itself at time k in a subset A F is given
by
k
(x
0
, , x
k1
; A). For each (x
0
, , x
k1
),
k
denes a probability
measure on (X, F) and for each A F,
k
(x
0
, , x
k1
; A) is assumed to
be a measurable function of (x
0
, , x
k1
), on the space (X
k
, F
k
) which is
the product of k copies of the space (X, F) with itself. We can inductively
dene measures
k
on (X
k+1
, F
k+1
) that describe the probability distribution
of the entire history (x
0
, , x
k
) of the system through time k. To go from

k1
to
k
we think of (X
k+1
, F
k+1
) as the product of (X
k
, F
k
) with (X, F)
and construct on (X
k+1
, F
k+1
) a probability measure with marginal
k1
on
(X
k
, F
k
) and conditionals
k
(x
0
, , x
k1
; ) on the bers (x
1
, , x
k1
)X.
This will dene
k
and the induction can proceed. We may stop at some nite
terminal time N or go on indenitely. If we do go on indenitely, we will have
a consitent family of nite dimensional distributions {
k
} on (X
k+1
, F
k+1
)
and we may try to use Kolmogorovs theorem to construct a probability
measure P on the space (X

, F

) of sequences {x
j
: j 0} representing
the total evolution of the system for all times.
Remark 4.7. However Kolmogorovs theorem requires some assumptions on
(X, F) that are satised if X is a complete separable metric space and F
are the Borel sets. However, in the present context, there is a result known
as Tulceas theorem (see [8]) that proves the existence of a P on (X

, F

)
for any choice of (X, F), exploiting the fact that the consistent family of
nite dimensional distributions
k
arise from well dened successive regular
conditional probability distributions.
An important subclass is generated when the transition probability depends
on the past history only through the current state. In other words

k
(x
0
, , x
k1
; ) =
k1,k
(x
k1
; ).
4.4. MARKOV CHAINS 117
In such a case the process is called a Markov Process with transition prob-
abilities
k1,k
(, ). An even smaller subclass arises when we demand that

k1,k
(, ) be the same for dierent values of k. A single transition proba-
bility (x, A) and the initial distribution
0
determine the entire process i.e.
the measure P on (X

, F

). Such processes are called time-homogeneous


Markov Proceses or Markov Processes with stationary transition probabili-
ties.
Chapman-Kolmogorov Equations:. If we have the transition probabili-
ties
k,k+1
of transition from time k to k +1 of a Markov Chain it is possible
to obtain directly the transition probabilities from time k to k + for any
2. We do it by induction on . Dene

k,k++1
(x, A) =
_
X

k,k+
(x, dy)
k+,k++1
(y, A) (4.5)
or equivalently, in a more direct fashion

k,k++1
(x, A) =
_
X

_
X

k,k+1
(x, dy
k+1
)
k+,k++1
(y
k+
, A)
Theorem 4.8. The transition probabilities
k,m
(, ) satisfy the relations

k,n
(x, A) =
_
X

k,m
(x, dy)
m,n
(y, A) (4.6)
for any k < m < n and for the Markov Process dened by the one step
transition probabilities
k,k+1
(, ), for any n > m
P[x
n
A|
m
] =
m,n
(x
m
, A) a.e.
where
m
is the -eld of past history upto time m generated by the coordi-
nates x
0
, x
1
, , x
m
.
Proof. The identity is basically algebra. The multiple integral can be carried
out by iteration in any order and after enough variables are integrated we
get our identity. To prove that the conditional probabilities are given by the
right formula we need to establish
P[{x
n
A} B] =
_
B

m,n
(x
m
, A) dP
118 CHAPTER 4. DEPENDENT RANDOM VARIABLES
for all B
m
and A F. We write
P[{x
n
A} B] =
_
{xnA}B
dP
=
_

_
{xnA}B
d(x
0
)
0,1
(x
0
, dx
1
)
m1,m
(x
m1
, dx
m
)

m,m+1
(x
m
, dx
m1
)
n1,n
(x
n1
, dx
n
)
=
_

_
B
d(x
0
)
0,1
(x
0
, dx
1
)
m1,m
(x
m1
, dx
m
)

m,m+1
(x
m
, dx
m1
)
n1,n
(x
n1
, A)
=
_

_
B
d(x
0
)
0,1
(x
0
, dx
1
)
m1,m
(x
m1
, dx
m
)
m,n
(x
m
, A)
=
_
B

m,n
(x
m
, A) dP
and we are done.
Remark 4.8. If the chain has stationary transition probabilities then the
transition probabilities
m,n
(x, dy) from time m to time n depend only on
the dierence k = n m and are given by what are usually called the k step
transition probabilities. They are dened inductively by

(k+1)
(x, A) =
_
X

(k)
(x, dy) (y, A)
and satisfy the Chapman-Kolmogorov equations

(k+)
(x, A) =
_
X

(k)
(x, dy)
()
(x, A) =
_

()
(x, dy)
(k)
(y, A)
Suppose we have a probability measure P on the product space XY Z
with the product -eld. The Markov property in this context refers to
equality
E
P
[g(z)|
x,y
] = E
P
[g(z)|
y
] a.e. P (4.7)
for bounded measurable functions g on Z, where we have used
x,y
to denote
the -eld generated by projection on to X Y and
y
the corresponding
4.4. MARKOV CHAINS 119
-eld generated by projection on to Y . The Markov property in the reverse
direction is the similar condition for bounded measurable functions f on X.
E
P
[f(x)|
y,z
] = E
P
[f(x)|
y
] a.e. P (4.8)
They look dierent. But they are both equivalent to the symmetric condition
E
P
[f(x)g(z)|
y
] = E
P
[f(x)|
y
]E
P
[g(z)|
y
] a.e. P (4.9)
which says that given the present, the past and future are conditionally
independent. In view of the symmetry it sucient to prove the following:
Theorem 4.9. For any P on (X Y Z) the relations (4.7) and (4.9) are
equivalent.
Proof. Let us x f and g. Let us denote the common value in (4.7) by g(y)
Then
E
P
[f(x)g(z)|
y
] = E
P
[E
P
[f(x)g(z)|
x,y
]|
y
] a.e. P
= E
P
[f(x)E
P
[g(z)|
x,y
]|
y
] a.e. P
= E
P
[f(x) g(y)|
y
] a.e. P (by (4.5))
= E
P
[f(x)|
y
] g(y) a.e. P
= E
P
[f(x)|
y
]E
P
[g(z)|
y
] a.e. P
which is (4.9). Conversely, we assume (4.9) and denote by g(x, y) and g(y)
the expressions on the left and right side of (4.7). Let b(y) be a bounded
measurable function on Y .
E
P
[f(x)b(y) g(x, y)] = E
P
[f(x)b(y)g(z)]
= E
P
[b(y)E
P
[f(x)g(z)|
y
]]
= E
P
[b(y)
_
E
P
[f(x)|
y
]
__
E
P
[g(z)|
y
]
_
]
= E
P
[b(y)
_
E
P
[f(x)|
y
]
_
g(y)]
= E
P
[f(x)b(y) g(y)].
Since f and b are arbitrary this implies that g(x, y) = g(y) a.e. P.
Let us look at some examples.
120 CHAPTER 4. DEPENDENT RANDOM VARIABLES
1. Suppose we have an urn containg a certain number of balls (nonzero)
some red and others green. A ball is drawn at random and its color
is noted. Then it is returned to the urn along with an extra ball of
the same color. Then a new ball is drawn at random and the process
continues ad innitum. The current state of the system can be charac-
terized by two integers r, g such that r +g 1. The initial state if the
system is some r
0
, g
0
with r
0
+ g
0
1. The system can go from (r, g)
to either (r +1, g) with probability
r
r+g
or to (r, g +1) with probability
g
r+g
. This is clearly an example of a Markov Chain with stationary
transition probabilities.
2. Consider a queue for service in a store. Suppose at each of the times
1, 2, , a random number of new customers arrive and and join the
queue. If the queue is non empty at some time, then exactly one
customer will be served and will leave the queue at the next time point.
The distribution of the number of new arrivals is specied by {p
j
: j
0} where p
j
is the probability that exactly j new customers arrive
at a given time. The number of new arrivals at distinct times are
assumed to be independent. The queue length is a Markov Chain on
the state space X = {0, 1, , } of nonegative integers. The transition
probabilities (i, j) are given by (0, j) = p
j
because there is no service
and nobody in the queue to begin with and all the new arrivals join
the queue. On the other hand (i, j) = p
ji+1
if j + 1 i 1 because
one person leaves the queue after being served.
3. Consider a reservoir into which water ows. The amount of additional
water owing into the reservoir on any given day is random, and has
a distribution on [0, ). The demand is also random for any given
day, with a probability distribution on [0, ). We may also assume
that the inows and demands on successive days are random variables

n
and
n
, that have and for their common distributions and are all
mutually independent. We may wish to assume a percentage loss due
to evaporation. In any case the storage level at successive days have a
recurrence relation
S
n+1
= [(1 p)S
n
+
n

n
]
+
p is the loss and we have put the condition that the outow is the
demand unless the stored amount is less than the demand in which case
4.4. MARKOV CHAINS 121
the outow is the available quantity. The current amount in storage is
a Markov Process with Stationary transition probabilities.
4. Let X
1
, , X
n
, be a sequence of independent random variables
with a common distribution . Let S
n
= Y + X
1
+ + X
n
for
n 1 with S
0
= Y where Y is a random variable independent of
X
1
, . . . , X
n
, . . . with distribution . Then S
n
is a Markov chain on
R with one step transition probaility (x, A) = (A x) and initial
distribution . The n step transition probability is
n
(A x) where

n
is the n-fold convolution of . This is often referred to as a random
walk.
The last two examples can be described by models of the type
x
n
= f(x
n1
,
n
)
where x
n
is the current state and
n
is some random external disturbance.
n
are assumed to be independent and identically distributed. They could have
two components like inow and demand. The new state is a deterministic
function of the old state and the noise.
Exercise 4.11. Verify that the rst two examples can be cast in the above
form. In fact there is no loss of generality in assuming that
j
are mutually
independent random variables having as common distribution the uniform
distribution on the interval [0, 1].
Given a Markov Chain with stationary transition probabilities (x, dy) on
a state space (X, F), the behavior of
(n)
(x, dy) for large n is an important
and natural question. In the best situation of independent random variables

(n)
(x, A) = (A) are independent of x as well as n. Hopefully after a long
time the Chain will forget its origins and
(n)
(x, ) (), in some suitable
sense, for some that does not depend on x . If that happens, then from
the relation

(n+1)
(x, A) =
_

(n)
(x, dy)(y, A),
we conclude
(A) =
_
(y, A) d(y) for all A F
122 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Measures that satisfy the above property, abbreviated as = , are called
invariant measures for the Markov Chain. If we start with the initial distribu-
tion which is inavariant then the probability measure P has as marginal
at every time. In fact P is stationary i.e., invariant with respect to time
translation, and can be extended to a stationary process where time runs
from to +.
4.5 Stopping Times and Renewal Times
One of the important notions in the analysis of Markov Chains is the idea of
stopping times and renewal times. A function
() : {n : n 0}
is a random variable dened on the set = X

such that for every n 0 the


set { : () = n} (or equivalently for each n 0 the set { : () n})
is measurable with respect to the -eld F
n
generated by X
j
: 0 j n.
It is not necessary that () < for every . Such random variable are
called stopping times. Examples of stopping times are, constant times n 0,
the rst visit to a state x, or the second visit to a state x. The important
thing is that in order to decide if n i.e. to know if what ever is supposed
to happen did happen before time n the chain need be observed only up to
time n. Examples of that are not stopping times are easy to nd. The last
time a site is visited is not a stopping time nor is is the rst time such that
at the next time one is in a state x. An important fact is that the Markov
property extends to stopping times. Just as we have -elds F
n
associated
with constant times, we do have a eld F

associated to any stopping


time. This is the information we have when we observe the chain upto time
. Formally
F

=
_
A : A F

and A { n} F
n
for each n
_
One can check from the denition that is F

measurable and so is X

on
the set < . If is the time of rst visit to y then is a stopping time and
the event that the chain visits a state z before visiting y is F

measurable.
Lemma 4.10. ( Strong Markov Property.) At any stopping time
the Markov property holds in the sense that the conditional distribution of
4.6. COUNTABLE STATE SPACE 123
X
+1
, , X
+n
, conditioned on F

is the same as the original chain


starting from the state x = X

on the set < . In other words


P
x
{X
+1
A
1
, , X
+n
A
n
|F

}
=
_
A
1

_
An
(X

, dx
1
) (x
n1
, dx
n
)
a.e. on { < }.
Proof. Let A F

be given with A { < }. Then


P
x
{A {X
+1
A
1
, , X
+n
A
n
}}
=

k
P
x
{A { = k} {X
k+1
A
1
, , X
k+n
A
n
}}
=

k
_
A{=k}
_
A
1

_
An
(X
k
, dx
k+1
) (x
k+n1
, dx
k+n
) dP
x
=
_
A
_
A
1

_
An
(X

, dx
1
) (x
n1
, dx
n
) dP
x
We have used the fact that if A F

then A { = k} F
k
for every
k 0.
Remark 4.9. If X

= y a.e. with respect to P


x
on the set < , then at
time , when it is nite, the process starts afresh with no memory of the past
and will have conditionally the same probabilities in the future as P
y
. At
such times the process renews itself and these times are called renewal times.
4.6 Countable State Space
From the point of view of analysis a particularly simple situation is when the
state space X is a countable set. It can be taken as the integers {x : x 1}.
Many applications fall in this category and an understanding of what happens
in this situation will tell us what to expect in general.
The one step transition probability is a matrix (x, y) with nonnegative en-
tries such that

y
(x, y) = 1 for each x. Such matrices are called stochastic
124 CHAPTER 4. DEPENDENT RANDOM VARIABLES
matrices. The n step transition matrix is just the n-th power of the matrix
dened inductively by

(n+1)
(x, y) =

(n)
(x, z)(z, y)
To be consistent one denes
(0)
(x, y) =
x,y
which is 1 if x = y and 0
otherwise. The problem is to analyse the behaviour for large n of
(n)
(x, y). A
state x is said to communicate with a state y if
(n)
(x, y) > 0 for some n 1.
We will assume for simplicity that every state communicates with every other
state. Such Markov Chains are called irreducible. Let us rst limit ourselves
to the study of irreducible chains. Given an irreducible Markov chain with
transition probabilities (x, y) we dene f
n
(x) as the probability of returning
to x for the rst time at the n-th step assuming that the chain starts from
the state x.. Using the convention that P
x
refers to the measure on sequences
for the chain starting from x and {X
j
} are the successive positions of the
chain
f
n
(x) = P
x
_
X
j
= x for 1 j n 1 and X
n
= x
_
=

y
1
=x

y
n1
=x
(x, y
1
) (y
1
, y
2
) (y
n1
, x)
Since f
n
(x) are probailities of disjoint events

n
f
n
(x) 1. The state x
is called transient if

n
f
n
(x) < 1 and recurrent if

n
f
n
(x) = 1. The
recurrent case is divided into two situations. If we denote by
x
= inf{n
1 : X
n
= x}, the time of rst visit to x, then recurrence is P
x
{
x
< } = 1.
A recurrent state x is called positive recurrent if
E
Px
{
x
} =

n1
nf
n
(x) <
and null recurrent if
E
Px
{
x
} =

n1
nf
n
(x) =
Lemma 4.11. If for a (not necessarily irreducible) chain starting from x, the
probability of ever visiting y is positive then so is the probability of visiting y
before returning to x.
4.6. COUNTABLE STATE SPACE 125
Proof. Assume that for the chain starting from x the probability of visiting
y before returning to x is zero. But when it returns to x it starts afresh and
so will not visit y until it returns again. This reasoning can be repeated and
so the chain will have to visit x innitely often before visiting y. But this
will use up all the time and so it cannot visit y at all.
Lemma 4.12. For an irreducible chain all states x are of the same type.
Proof. Let x be recurrent and y be given. Since the chain is irreducible, for
some k,
(k)
(x, y) > 0. By the previous lemma, for the chain starting from x,
there is a positive probability of visiting y before returning to x. After each
successive return to x, the chain starts afresh and there is a xed positive
probability of visiting y before the next return to x. Since there are innitely
many returns to x, y will be visited innitely many times as well. Or y is
also a recurrent state.
We now prove that if x is positive recurrent then so is y. We saw already
that the probability p = P
x
{
y
<
x
} of visiting y before returning to x is
positive. Clearly
E
Px
{
x
} P
x
{
y
<
x
} E
Py
{
x
}
and therefore
E
Py
{
x
}
1
p
E
Px
{
x
} < .
On the other hand we can write
E
Px
{
y
}
_
y<x

x
dP
x
+
_
x<y

y
dP
x
=
_
y<x

x
dP
x
+
_
x<y
{
x
+E
Px
{
y
}} dP
x
=
_
y<x

x
dP
x
+
_
x<y

x
dP
x
+ (1 p) E
Px
{
y
}
=
_

x
dP
x
+ (1 p) E
Px
{
y
}
by the renewal property at the stopping time
x
. Therefore
E
Px
{
y
}
1
p
E
Px
{
x
}.
We also have
E
Py
{
y
} E
Py
{
x
} +E
Px
{
y
}
2
p
E
Px
{
x
}
126 CHAPTER 4. DEPENDENT RANDOM VARIABLES
proving that y is positive recurrent.
Transient Case: We have the following theorem regarding transience.
Theorem 4.13. An irreducible chain is transient if and only if
G(x, y) =

n=0

(n)
(x, y) < for all x, y.
Moreover for any two states x and y,
G(x, y) = f(x, y)G(y, y)
and
G(x, x) =
1
1 f(x, x)
where f(x, y) = P
x
{
y
< }.
Proof. Each time the chain returns to x there is a probability 1 f(x, x) of
never returning. The number of returns has then the geometric distribution
P
x
{ exactly n returns to x} = (1 f(x, x))f(x, x)
n
and the expected number of returns is given by

k=1

(k)
(x, x) =
f(x, x)
1 f(x, x)
.
The left hand side comes from the calculation
E
Px

k=1

{x}
(X
k
) =

k=1

(k)
(x, x)
and the right hand side from the calculation of the mean of a Geometric
distribution. Since we count the visit at time 0 as a visit to x we add 1 to
both sides to get our formula. If we want to calculate the expected number of
visits to y when we start from x, rst we have to get to y and the probability
of that is f(x, y). Then by the renewal property it is exactly the same as the
expected number of visits to y starting from y, including the visit at time 0
and that equals G(y, y).
4.6. COUNTABLE STATE SPACE 127
Before we study the recurrent behavior we need the notion of periodicity.
For each state x let us dene D
x
= {n :
(n)
(x, x) > 0} to be the set of times
at which a return to x is possible if one starts from x. We dene d
x
to be
the greatest common divisor of D
x
.
Lemma 4.14. For any irreducible chain d
x
= d for all x X and for each
x, D
x
contains all suciently large multiples of d.
Proof. Let us dene
D
x,y
= {n :
(n)
(x, y) > 0}
so that D
x
= D
x,x
. By the Chapman-Kolmogorov equations

(m+n)
(x, y)
(m)
(x, z)
(n)
(z, y)
for every z, so that if m D
x,z
and n D
z,y
, then m + n D
x,y
. In
particular if m, n D
x
it follows that m + n D
x
. Since any pair of states
communicate with each other, given x, y X, there are positive integers n
1
and n
2
such that n
1
D
x,y
and n
2
D
y,x
. This implies that with the choice
of = n
1
+n
2
, n + D
x
whenever n D
y
; similarly n + D
y
whenever
n D
x
. Since itself belongs to both D
x
and D
y
both d
x
and d
y
divide .
Suppose n D
x
. Then n + D
y
and therefore d
y
divides n + . Since d
y
divides , d
y
must divide n. Since this is true for every n D
x
and d
x
is the
greatest common divisor of D
x
, d
y
must divide d
x
. Similarly d
x
must divide
d
y
. Hence d
x
= d
y
. We now complete the proof of the lemma. Let d be the
greatest common divisor of D
x
. Then it is the greatest common divisor of a
nite subset n
1
, n
2
, , n
q
of D
x
and there will exist integers a
1
, a
2
, , a
q
such that
a
1
n
1
+a
2
n
2
+ +a
q
n
q
= d
Some of the as will be positive and others negative. Seperating them out,
and remembering that all the n
i
are divisible by d, we nd two integers
md, (m+1)d such that they both belong to D
x
. If now n = kd with k > m
2
we can write k = m+r with a large m and the remainder r is less than
m.
kd = (m+r)d = md +r(m+ 1)d rmd = ( r)md +r(m+ 1)d D
x
since m > r.
128 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Remark 4.10. For an irreducible chain the common value d is called the pe-
riod of the chain and an irreducible chain with period d = 1 is called aperi-
odic.
The simplest example of a periodic chain is one with 2 states and the chain
shuttles back and forth between the two. (x, y) = 1 if x = y and 0 if x = y.
A simple calculation yields
(n)
(x, x) = 1 if n is even and 0 otherwise. There
is oscillatory behavior in n that persists. The main theorem for irreducible,
aperiodic, recurrent chains is the following.
Theorem 4.15. Let (x, y) be the one step transition probability for a re-
current aperiodic Markov chain and let
(n)
(x, y) be the n-step transition
probabilities. If the chain is null recurrent then
lim
n

(n)
(x, y) = 0 for all x, y
If the chain is positive recuurrent then of course E
Px
{
x
} = m(x) < for
all x, and in that case
lim
n

(n)
(x, y) = q(y) =
1
m(y)
exist for all x and y is independent of the starting point x and

y
q(y) = 1.
The proof is based on
Theorem 4.16. (Renewal Theorem.) Let {f
n
: n 1} be a sequence of
nonnegative numbers such that

n
f
n
= 1,

n
nf
n
= m
and the greatest common divisor of {n : f
n
> 0} is 1. Suppose that {p
n
: n
0} are dened by p
0
= 1 and recursively
p
n
=
n

j=1
f
j
p
nj
(4.10)
Then
lim
n
p
n
=
1
m
where if m = the right hand side is taken as 0.
4.6. COUNTABLE STATE SPACE 129
Proof. The proof is based on several steps.
Step 1. We have inductively p
n
1. Let a = limsup
n
p
n
. We can
choose a subsequence n
k
such that p
n
k
a. We can assume without loss of
generality that p
n
k
+j
q
j
as k for all positive and negative integers j
as well. Of course the limit q
0
for j = 0 is a. In relation 4.10 we can pass to
the limit along the subsequence and use the dominated convergence theorem
to obtain
q
n
=

j=1
f
j
q
nj
(4.11)
valid for < n < . In particular
q
0
=

j=1
f
j
q
j
(4.12)
Step 2: Because a = limsup p
n
we can conclude that q
j
a for all j. If
we denote by S = {n : f
n
> 0} then q
k
= a for k S. We can then
deduce from equation 4.11 that q
k
= a for k = k
1
+ k
2
with k
1
, k
2
S. By
repeating the same reasoning q
k
= a for k = k
1
+k
2
+ + k

. By lemma
3.6 because the greatest common factor of the integers in S is 1, there is a k
0
such that for k k
0
,we have q
k
= a. We now apply the relation 4.11 again
to conclude that q
j
= a for all positve as well as negative j.
Step 3: If we add up equation 4.10 for n = 1, , N we get
p
1
+p
2
+ +p
N
= (f
1
+f
2
+ +f
N
) + (f
1
+f
2
+ +f
N1
)p
1
+ + (f
1
+f
2
+ +f
Nk
)p
k
+ +f
1
p
N1
If we denote by T
j
=

i=j
f
i
, we have T
1
= 1 and

j=0
T
j
= m. We can
now rewrite
N

j=1
T
j
p
Nj+1
=
N

j=1
f
j
Step 4: Because p
Nj
a for every j along the subsequence N = n
k
, if

j
T
j
= m < , we can deduce from the dominated convergence theorem
that ma = 1 and we conclude that
limsup
n
p
n
= v1m
130 CHAPTER 4. DEPENDENT RANDOM VARIABLES
If

j
T
j
= , by Fatous Lemma a = 0. Exactly the same argument applies
to liminf and we conclude that
liminf
n
p
n
=
1
m
This concludes the proof of the renewal theorem.
We now turn to
Proof. (of Theorem 4.15). If we take a xed x X and consider f
n
=
P
x
{
x
= n}, then f
n
and p
n
=
(n)
(x, x) are related by (1) and m = E
Px
{
x
}.
In order to apply the renewal theorem we need to establish that the greatest
common divisor of S = {n : f
n
> 0} is 1. In general if f
n
> 0 so is p
n
. So
the greatest common divisor of S is always larger than that of {n : p
n
> 0}.
That does not help us because the greatest common divisor of {n : p
n
> 0}
is 1. On the other hand if f
n
= 0 unless n = k d for some k, the relation 4.10
can be used inductively to conclude that the same is true of p
n
. Hence both
sets have the same greatest common divisor. We can now conclude that
lim
n

(n)
(x, x) = q(x) =
1
m(x)
On the other hand if f
n
(x, y) = P
x
{
y
= n}, then

(n)
(x, y) =
n

k=1
f
k
(x, y)
(nk)
(y, y)
and recurrence implies

k+1
f
k
(x, y) = 1 for all x and y. Therefore
lim
n

(n)
(x, y) = q(y) =
1
m(y)
and is independent of x, the starting point. In order to complete the proof
we have to establish that
Q =

y
q(y) = 1
It is clear by Fatous lemma that

y
q(y) = Q 1
4.6. COUNTABLE STATE SPACE 131
By letting n in the Chapman-Kolmogorov equation

(n+1)
(x, y) =

n
(x, z)(z, y)
and using Fatous lemma we get
q(y)

z
(z, y)q(z)
Summing with repect to y we obtain
Q

z,y
(z, y)q(z) = Q
and equality holds in this relation. Therefore
q(y) =

z
(z, y)q(z)
for every y or q() is an invariant measure. By iteration
q(y) =

n
(z, y)q(z)
and if we let n again an application of the bounded convergence theo-
rem yields
q(y) = Qq(y)
implying Q = 1 and we are done.
Let us now consider an irreducible Markov Chain with one step transition
probability (x, y) that is periodic with period d > 1. Let us choose and x
a reference point x
0
X. For each x X let D
x
0
,x
= {n :
(n)
(x
0
, x) > 0}.
Lemma 4.17. If n
1
, n
2
D
x
0
,x
then d divides n
1
n
2
.
Proof. Since the chain is irreducible there is an m such tha
(m)
(x, x
0
) > 0.
By the Chapman-Kolmogorov equations
(m+n
i
)
(x
0
, x
0
) > 0 for i = 1, 2.
Therefore m + n
i
D
x
0
= D
x
0
,x
0
for i = 1, 2. This implies that d divides
both m +n
1
as well as m +n
2
. Thus d divides n
1
n
2
.
132 CHAPTER 4. DEPENDENT RANDOM VARIABLES
The residue modulo d of all the integers in D
x
0
,x
are the same and equal
some number r(x), satisfying 0 r(x) d 1. By denition r(x
0
) = 0. Let
us dene X
j
= {x : r(x) = j}. Then {X
j
: 0 j d 1} is a partition of X
into disjoint sets with x
0
X
0
.
Lemma 4.18. If x X, then
(n)
(x, y) = 0 unless r(x) +n = r(y) mod d.
Proof. Suppose that x X and (x, y) > 0. Then if m D
x
0
,x
then
(m + 1) D
x
0
,y
. Therefore r(x) + 1 = r(y) modulo d. The proof can be
completed by induction. The chain marches through {X
j
} in a cyclical way
from a state in X
j
to one in X
j+1
Theorem 4.19. Let X be irreducible and positive recurrent with period d.
Then
lim
n
n+r(x)=r(y) modulo d

(n)
(x, y) =
d
m(y)
Of course

(n)
(x, y) = 0
unless n +r(x) = r(y) modulo d.
Proof. If we replace by where (x, y) =
(d)
(x, y), then (x, y) = 0 unless
both x and y are in the same X
j
. The restriction of to each X
j
denes an
irreducible aperiodic Markov chain. Since each time step under is actually
d units of time we can apply the earlier results and we will get for x, y X
j
for some j,
lim
k

(k d)
(x, y) =
d
m(y)
We note that

(n)
(x, y) =

1mn
f
m
(x, y)
(nm)
(y, y)
f
m
(x, y) = P
x
{
y
= m} = 0 unless r(x) +m = r(y) modulo d

(nm)
(y, y) = 0 unless n m = 0 modulo d

m
f
m
(x, y) = 1.
The theorem now follows.
4.6. COUNTABLE STATE SPACE 133
Suppose now we have a chain that is not irreducible. Let us collect all
the transient states and call the set X
tr
. The complement consists of all the
recurrent states and will be denoted by X
re
.
Lemma 4.20. If x X
re
and y X
tr
, then (x, y) = 0.
Proof. If x is a recuurrent state, and (x, y) > 0, the chain will return to x
innitely often and each time there is a positive probability of visiting y. By
the renewal property these are independent events and so y will be recurrent
too.
The set of recurrent states X
re
can be divided into one or more equivalence
classes accoeding to the following procedure. Two recurrent states x and y are
in the same equivalence class if f(x, y) = P
x
{
y
< }, the probability of ever
visiting y starting from x is positive. Because of recurrence if f(x, y) > 0 then
f(x, y) = f(y, x) = 1. The restriction of the chain to a single equivalence class
is irreducible and possibly periodic. Dierent equivalence classes could have
dierent periods, some could be positive recurrent and others null recurrent.
We can combine all our observations into the following theorem.
Theorem 4.21. If y is transient then

(n)
(x, y) < for all x. If y
is null recurrent (belongs to an equivalence class that is null recurrent) then

(n)
(x, y) 0 for all x, but

(n)
(x, y) = if x is in the same equivalence
class or x X
tr
with f(x, y) > 0. In all other cases
(n)
(x, y) = 0 for all
n 1. If y is positive recurrent and belongs to an equivalence class with
period d with m(y) = E
Py
{
y
}, then for a nontransient x,
(n)
(x, y) = 0
unless x is in the same equivalence class and r(x) + n = r(y) modulo d. In
such a case,
lim
n
r(x)+n=r(y) modulo d

(n)
(x, y) =
d
m(y)
.
If x is transient then
lim
n
n=r modulo d

(n)
(x, y) = f(r, x, y)
d
m(y)
where
f(r, x, y) = P
x
{X
kd+r
= y for some k 0}.
134 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Proof. The only statement that needs an explanation is the last one. The
chain starting from a transient state x may at some time get into a positive
recurrent equivalence class X
j
with period d. If it does, it never leaves that
class and so gets absorbed in that class. The probability of this is f(x, y)
where y can be any state in X
j
. However if the period d is greater than
1, there will be cyclical subclasses C
1
, , C
d
of X
j
. Depending on which
subclass the chain enters and when, the phase of its future is determined.
There are d such possible phases. For instance, if the subclasses are ordered
in the correct way, getting into C
1
at time n is the same as getting into
C
2
at time n + 1 and so on. f(r, x, y) is the probability of getting into the
equivalence class in a phase that visits the cyclical subclass containing y at
times n that are equal to r modulo d.
Example 4.1. (Simple Random Walk).
If X = Z
d
, the integral lattice in R
d
, a random walk is a Markov chain
with transition probability (x, y) = p(y x) where {p(z)} species the
probability distribution of a single step. We will assume for simplicity that
p(z) = 0 except when z F where F consists of the 2 d neighbors of 0 and
p(z) =
1
2d
for each z F. For R
d
the characteristic function of p() of
p() is given by
1
d
(cos
1
+ cos
2
+ + cos
d
). The chain is easily seen to
irreducible, but periodic of period 2. Return to the starting point is possible
only after an even number of steps.

(2n)
(0, 0) = (
1
2
)
d
_
T
d
[ p()]
2n
d

C
n
d
2
.
To see this asymptotic behavior let us rst note that the integration can be
restricted to the set where | p()| 1 or near the 2 points (0, 0, , 0)
and (, , , ) where | p()| = 1. Since the behaviour is similar at both
points let us concentrate near the origin.
1
d
d

j=1
cos
j
1 c

2
j
exp[c

2
j
]
for some c > 0 and
_
1
d
d

j=1
cos
j

2n
exp[2 nc

2
j
]
4.6. COUNTABLE STATE SPACE 135
and with a change of variables the upper bound is clear. We have a similar
lower bound as well. The random walk is recurrent if d = 1 or 2 but transient
if d 3.
Exercise 4.12. If the distribution p() is arbitrary, determine when the chain
is irreducible and when it is irreducible and aperiodic.
Exercise 4.13. If

z
zp(z) = m = 0 conclude that the chain is transient by
an application of the strong law of large numbers.
Exercise 4.14. If

z
zp(z) = m = 0, and if the covariance matrix given by ,

z
z
i
z
j
p(z) =
i,j
, is nondegenerate show that the transience or recurrence
is determined by the dimension as in the case of the nearest neighbor random
walk.
Exercise 4.15. Can you make sense of the formal calculation

(n)
(0, 0) =

n
(
1
2
)
d
_
T
d
[ p()]
n
d
= (
1
2
)
d
_
T
d
1
(1 p())
d
= (
1
2
)
d
_
T
d
Real Part
_
1
1 p()
_
d
to conclude that a necessary and sucient condition for transience or recur-
rece is the convergence or divergence of the integral
_
T
d
Real Part
_
1
1 p()
_
d
with an integrand
Real Part
_
1
1 p()
_
that is seen to be nonnegative ?
Hint: Consider instead the sum

n=0

(n)
(0, 0) =

n
(
1
2
)
d
_
T
d

n
[ p()]
n
d
= (
1
2
)
d
_
T
d
1
(1 p())
d
136 CHAPTER 4. DEPENDENT RANDOM VARIABLES
= (
1
2
)
d
_
T
d
Real Part
_
1
1 p()
_
d
for 0 < < 1 and let 1.
Example 4.2. (The Queue Problem).
In the example of customers arriving, except in the trivial cases of p
0
= 0
or p
0
+ p
1
= 1 the chain is irreducible and aperiodoc. Since the service
rate is at most 1 if the arrival rate m =

j
j p
j
> 1, then the queue will
get longer and by an application of law of large numbers it is seen that the
queue length will become innite as time progresses. This is the transient
behavior of the queue. If m < 1, one can expect the situation to be stable and
there should be an asymptotic distribution for the queue length. If m = 1,
it is the borderline case and one should probably expect this to be the null
recurent case. The actual proofs are not hard. In time n the actual number
of customers served is at most n because the queue may sometomes be empty.
If {
i
: i 1} are the number of new customers arriving at time i and X
0
is
the initial number in the queue, then the number X
n
in the queue at time n
satises X
n
X
0
+(

n
i=1

i
)n and if m > 1, it follows from the law of large
numbers that lim
n
X
n
= +, thereby establishing transience. To prove
positive recurrence when m < 1 it is sucient to prove that the equations

x
q(x)(x, y) = q(y)
has a nontrivial nonnegative solution such that

x
q(x) < . We shall
proceed to show that this is indeed the case. Since the equation is linear we
can alaways normalize tha solution so that

x
q(x) = 1. By iteration

x
q(x)
(n)
(x, y) = q(y)
for every n. If lim
n

(n)
(x, y) = 0 for every x and y, because

x
q(x) =
1 < , by the bounded convergence theorem the right hand side tends to
0 as n . therefore q 0 and is trivial. This rules out the transient
and the null recurrent case. In our case (0, y) = p
y
and (x, y) = p
yx+1
if y x 1 and x 1. In all other cases (x, y) = 0. The equations for
{q
x
= q(x)} are then
q
0
p
y
+
y+1

x=1
q
x
p
yx+1
= q
y
for y 0. (4.13)
4.6. COUNTABLE STATE SPACE 137
Multiplying equation 4.13 by z
n
and summimg from 1 to , we get
q
0
P(z) +
1
z
P(z) [Q(z) q
0
] = Q(z)
where P(z) and Q(z) are the generating functions
P(z) =

x=0
p
x
z
x
Q(z) =

x=0
q
x
z
x
.
We can solve for Q to get
Q(z)
q
0
= P(z)
_
1
P(z) 1
z 1
_
1
= P(z)

k=0
_
P(z) 1
z 1
_
k
= P(z)

k=0
_

j=1
p
j
(1 +z + +z
j1
)
_
k
is a power series in z with nonnegative coecients. If m < 1, we can let
z 1 to get
Q(1)
q
0
=

k=0
_

j=1
j p
j
_
k
=

k=0
m
k
=
1
1 m
<
proving positive recurrence.
The case m = 1 is a little bit harder. The calculations carried out earlier
are still valid and we know in this case that there exists q(x) 0 such that
each q(x) < for each x,

x
q(x) = , and

x
q(x) (x, y) = q(y).
In other words the chain admits an innite invariant measure. Such a chain
cannot be positive recurrent. To see this we note
q(y) =

(n)
(x, y)q(x)
138 CHAPTER 4. DEPENDENT RANDOM VARIABLES
and if the chain were positive recurrent
lim
n

(n)
(x, y) = q(y)
would exist and

y
q(y) = 1. By Fatous lemma
q(y)

x
q(y)q(x) =
giving us a contradiction. To decide between transience and null recurrence
a more detiled investigation is needed. We will outline a general procedure.
Suppose we have a state x
0
that is xed and would like to calculate
F
x
0
() = P
x
0
{
x
0
}. If we can do this, then we can answer questions
about transience, recurrence etc. If lim

F
x
0
() < 1 then the chain is
transient and otherwise recurrent. In the recurrent case the convergence or
divergence of
E
Px
0
{
x
0
} =

[1 F
x
0
()]
determines if it is positive or null recurrent. If we can determine
F
y
() = P
y
{
x
0
}
for y = x
0
, then for 1
F
x
0
() = (x
0
, x
0
) +

y=x
0
(x
0
, y)F
y
( 1).
We shall outline a procedure for determining for > 0,
U(, y) = E
y
_
exp[
x
0
]

.
Clearly U(x) = U(, x) satises
U(x) = e

y
(x, y)U(y) for x = x
0
(4.14)
and U(x
0
) = 1. One would hope that if we solve for these equations then
we have our U. This requires uniqueness. Since our U is bounded in fact by
1, it is sucient to prove uniqueness within the class of bounded solutions
4.6. COUNTABLE STATE SPACE 139
of equation 4.14. We will now establish that any bounded solution U of
equation 4.14 with U(x
0
) = 1, is given by
U(y) = U(, y) = E
y
_
exp[
x
0
]

.
Let us dene E
n
= {X
1
= x
0
, X
2
= x
0
, , X
n1
= x
0
, X
n
= x
0
}. Then
we will prove, by induction, that for any solution U of equation (3.7), with
U(, x
0
) = 1,
U(y) =
n

j=1
e
j
P
y
{E
j
} +e
n
_
x
0
>n
U(X
n
) dP
y
. (4.15)
By letting n we would obtain
U(y) =

j=1
e
j
P
y
{E
j
} = E
Py
{e
x
0
}
because U is bounded and > 0.
_
x
0
>n
U(X
n
) dP
y
= e

_
x
0
>n
[

y
(X
n
, y) U(y)] dP
y
= e

P
y
{E
n+1
} +e

_
x
0
>n
[

y=x
0
(X
n
, y) U(y)] dP
y
= e

P
y
{E
n+1
} +e

_
x
0
>n+1
U(X
n+1
) dP
y
completing the induction argument. In our case, if we take x
0
= 0 and try
U

(x) = e
x
with > 0, for x 1

y
(x, y)U

(y) =

yx1
e
y
p
yx+1
=

y0
e
(x+y1)
p
y
= e
x
e

y0
e
y
p
y
= ()U

(x)
140 CHAPTER 4. DEPENDENT RANDOM VARIABLES
where
() = e

y0
e
y
p
y
.
Let us solve e

= () for which is the same as solving log () = for


> 0 to get a solution = () > 0. Then
U(, x) = e
() x
= E
Px
{e

0
}.
We see now that recurrence is equivalent to () 0 as 0 and positive
recurrence to () being dierentiable at = 0. The function log ()
is convex and its slope at the origin is 1 m. If m > 1 it dips below 0
initially for > 0 and then comes back up to 0 for some positive
0
before
turning positive for good. In that situation lim
0
() =
0
> 0 and that
is transience. If m < 1 then log () has a positive slope at the origin and

(0) =
1

(0)
=
1
1m
< . If m = 1, then log has zero slope at the origin
and

(0) = . This concludes the discussion of this problem.


Example 4.3. ( The Urn Problem.)
We now turn to a discussion of the urn problem.
(p, q ; p + 1, q) =
p
p +q
and (p, q ; p, q + 1) =
q
p +q
and is zero otherwise. In this case the equation
F(p, q) =
p
p +q
F(p + 1, q) +
q
p +q
F(p, q + 1) for all p, q
which will play a role later, has lots of solutions. In particular, F(p, q) =
p
p+q
is one and for any 0 < x < 1
F
x
(p, q) =
1
(p, q)
x
p1
(1 x)
q1
where
(p, q) =
(p)(q)
(p +q)
is a solution as well. The former is dened on p+q > 0 where as the latter is
dened only on p > 0, q > 0. Actually if p or q is initially 0 it remains so for
4.6. COUNTABLE STATE SPACE 141
ever and there is nothing to study in that case. If f is a continuous function
on [0, 1] then
F
f
(p, q) =
_
1
0
F
x
(p, q)f(x) dx
is a solution and if we want we can extend F
f
by making it f(1) on q = 0
and f(0) on p = 0. It is a simple exercise to verify
lim
p,q
p
q
x
F
f
(p, q) = f(x)
for any continuous f on [0, 1]. We will show that the ratio
n
=
pn
pn+qn
which
is random, stabilizes asymptotically (i.e. has a limit) to a random variable
and if we start from p, q the distribution of is the Beta distribution on
[0, 1] with density
F
x
(p, q) =
1
(p, q)
x
p1
(1 x)
q1
Suppose we have a Markov Chain on some state space X with transition
probability (x, y) and U(x) is a bounded function on X that solves
U(x) =

y
(x, y) U(y).
Such functions are called (bounded) Harmonic functions for the Chain. Con-
sider the random variables
n
= U(X
n
) for such an harmonic function.
n
are uniformly bounded by the bound for U. If we denote by
n
=
n

n1
an elementary calculation reveals
E
Px
{
n+1
} = E
Px
{U(X
n+1
) U(X
n
)}
= E
Px
{E
Px
{U(X
n+1
) U(X
n
)}|F
n
}}
where F
n
is the -eld generated by X
0
, , X
n
. But
E
Px
{U(X
n+1
) U(X
n
)}|F
n
} =

y
(X
n
, y)[U(y) U(X
n
)] = 0.
A similar calculation shows that
E
Px
{
n

m
} = 0
142 CHAPTER 4. DEPENDENT RANDOM VARIABLES
for m = n. If we write
U(X
n
) = U(X
0
) +
1
+
2
+ +
n
this is an orthogonal sum in L
2
[P
x
] and because U is bounded
E
Px
{|U(X
n
)|
2
} = |U(x)|
2
+
n

i=1
E
Px
{|
i
|
2
} C
is bounded in n. Therefore lim
n
U(X
n
) = exists in L
2
[P
x
] and E
Px
{} =
U(x). Actually the limit exists almost surely and we will show it when
we discuss martingales later. In our example if we take U(p, q) =
p
p+q
, as
remarked earlier, this is a harmonic function bounded by 1 and therefore
lim
n
p
n
p
n
+q
n
=
exists in L
2
[P
x
]. Moreover if we take U(p, q) = F
f
(p, q) for some continuous
f on [0, 1], because F
f
(p, q) f(x) as p, q and
p
q
x, U(p
n
, q
n
) has a
limit as n and this limit has to be f(). On the other hand
E
Pp,q
{U(p
n
, q
n
)} = U(p
0
, q
0
) = F
f
(p
0
, q
0
)
=
1
(p
0
, q
0
)
_
1
0
f(x)x
p
0
1
(1 x)
q
0
1
dx
giving us
E
Pp,q
{f()} =
1
(p, q)
_
1
0
f(x) x
p1
(1 x)
q1
dx
thereby identifying the distribution of under P
p,q
as the Beta distribution
with the right parameters.
Example 4.4. (Branching Process). Consider a population, in which each
individual member replaces itself at the beginning of each day by a random
number of osprings. Every individual has the same probability distribution,
but the number of osprings for dierent individuals are distibuted indepen-
dently of each other. The distribution of the number N of osprings is given
by P[N = i] = p
i
for i 0. If there are X
n
individuals in the population on
a given day, then the number of individuals X
n+1
present on the next day,
has the represenation
X
n+1
= N
1
+N
2
+ +N
Xn
4.6. COUNTABLE STATE SPACE 143
as the sum of X
n
independent random variables each having the ospring
distribution {p
i
: i 0}. X
n
is seen to be a Markov chain on the set
of nonnegative integers. Note that if X
n
ever becomes zero, i.e. if every
member on a given day produces no osprings, then the population remains
extinct.
If one uses generating functions, then the transition probability
i,j
of the
chain are

i,j
z
j
=
_

j
p
j
z
j

i
.
What is the long time behavior of the chain?
Let us denote by m the expected number of osprings of any individual,
i.e.
m =

i0
ip
i
.
Then
E[X
n+1
|F
n
] = mX
n
.
1. If m < 1, then the population becomes extinct sooner or later. This is
easy to see. Consider
E[

n0
X
n
|F
0
] =

n0
m
n
X
0
=
1
1 m
X
0
< .
By an application of Fubinis theorem, if S =

n0
X
n
, then
E[S|X
0
= i] =
i
1 m
<
proving that P[S < ] = 1. In particular
P[ lim
n
X
n
= 0] = 1.
2. If m = 1 and p
1
= 1, then X
n
X
0
and the poulation size never
changes, each individual replacing itself everytime by exactly one o-
spring.
144 CHAPTER 4. DEPENDENT RANDOM VARIABLES
3. If m = 1 and p
1
< 1, then p
0
> 0, and there is a positive probabiity
q(i) = q
i
that the poulation becomes extinct, when it starts with i
individuals. Here q is the probabilty of the population becoming extinct
when we start with X
0
= 1. If we have initially i individulas each of
the i family lines have to become extinct for the entire population to
become extinct. The number q must therefore be a solution of the
equation
q = P(q)
where P(z) is the generating function
P(z) =

i0
p
i
z
i
.
If we show that the equation P(z) = z has only the solution z = 1
in 0 z 1, then the population becomes extinct with probability 1
although E[S] = in this case. If P(1) = 1 and P(a) = a for some
0 a < 1 then by the mean value theorem applied to P(z) z we
must have P

(z) = 1 for some 0 < z < 1. But if 0 < z < 1


P

(z) =

i1
iz
i1
p
i
<

i1
ip
i
= 1
a contradiction.
4. If m > 1 but p
0
= 0 the problem is trivial. There is no chance of the
population becoming extinct. Let us assume that p
0
> 0. The equation
P(z) = z has another solution z = q besides z = 1, in the range
0 < z < 1. This is seen by considering the function g(z) = P(z) z.
We have g(0) > 0, g(1) = 0, g

(1) > 0 which implies another root. But


g(z) is convex and therefore ther can be atmost one more root. If we
can rule out the possibility of extinction probability being equal to 1,
then this root q must be the extinction probability when we start with
a single individual at time 0. Let us denote by q
n
the probability of
extinction with in n days. Then
q
n+1
=

i
p
i
q
i
n
= P(q
n
)
and q
1
< 1. A simple consequence of the monotonicity of P(z) and
the inequalities P(z) > z for z < q and P(z) < z for z > q is that if
4.6. COUNTABLE STATE SPACE 145
we start with any a < 1 and iterate q
n+1
= P(q
n
) with q
1
= a, then
q
n
q.
If the population does not become extinct, one can show that it has
to grow indenitely. This is best done using martingales and we will
revisit this example later as Example 5.6.
Example 4.5. Let X be the set of integers. Assume that transitions from x
are possible only to x1, x, and x+1. The transition matrix (x, y) appears
as a tridiagonal matrix with (x, y) = 0 unless |xy| 1. For simplicity let
us assume that (x, x), (x, x 1) and (x, x + 1) are positive for all x.
The chain is then irreducible and aperiodic. Let us try to solve for
U(x) = P
x
{
0
= }
that satises the equation
U(x) = (x, x 1) U(x 1) + (x, x) U(x) + (x, x + 1) U(x + 1)
for x = 0 with U(0) = 0. The equations decouple into a set for x > 0 and a
set for x < 0. If we denote by V (x) = U(x + 1) U(x) for x 0, then we
always have
U(x) = (x, x 1) U(x) + (x, x) U(x) + (x, x + 1) U(x)
so that
(x, x 1) V (x 1) (x, x + 1) V (x) = 0
or
V (x)
V (x 1)
=
(x, x 1)
(x, x + 1)
and therefore
V (x) = V (0)
x

i=1
(i, i 1)
(i, i + 1)
and
U(x) = V (0)
_
1 +
x1

y=1
y

i=1
(i, i 1)
(i, i + 1)
_
.
146 CHAPTER 4. DEPENDENT RANDOM VARIABLES
If the chain is to be transient we must have for some choice of V (0), 0
U(x) 1 for all x > 0 and this will be possible only if

y=1
y

i=1
(i, i 1)
(i, i + 1)
<
which then becomes a necessary condition for
P
x
{
0
= } > 0
for x > 0. There is a similar condition on the negative side

y=1
y

i=1
(i, i + 1)
(i, i 1)
< .
Transience needs at least one of the two series to converge. Actually the
converse is also true. If, for instance the series on the positive side converges
then we get a function U(x) with 0 U(x) 1 and U(0) = 0 that satises
U(x) = (x, x 1) U(x 1) + (x, x) U(x) + (x, x + 1) U(x + 1)
and by iteration one can prove that for each n,
U(x) =
_

0
>n
U(X
n
) dP
x
P{
0
> n}
so the existence of a nontrivial U implies transience.
Exercise 4.16. Determine the conditions for positive recurrence in the previ-
ous example.
Exercise 4.17. We replace the set of integers by the set of nonnegative inte-
gers and assume that (0, y) = 0 for y 2. Such processes are called birth
and death processes. Work out the conditions in that case.
Exercise 4.18. In the special case of a birth and death process with (0, 1) =
(0, 0) =
1
2
, and for x 1, (x, x) =
1
3
, (x, x 1) =
1
3
+ a
x
, (x, x + 1) =
1
3
a
x
with a
x
=

x

for large x, nd conditions on positive and real for


the chain to be transient, null recurrent and positive recurrent.
4.6. COUNTABLE STATE SPACE 147
Exercise 4.19. The notion of a Markov Chain makes sense for a nite chain
X
0
, , X
n
. Formulate it precisely. Show that if the chain {X
j
: 0 j n}
is Markov so is the reversed chain {Y
j
: 0 j n} where Y
j
= X
nj
for 0
j n. Can the transition probabilities of the reversed chain be determined
by the transition probabilities of the forward chain? If the forward chain has
stationary transition proabilities does the same hold true for the reversed
chain? What if we assume that the chain has a nte invariant probability
distribution and we initialize the chain to start with an initial distribution
which is the invariant distribution?
Exercise 4.20. Consider the simple chain on nonnegative integers with the
following transition probailities. (0, x) = p
x
for x 0 with

x=0
p
x
= 1.
For x > 0, (x, x 1) = 1 and (x, y) = 0 for all other y. Determine
conditions on {p
x
} in order that the chain may be transient, null recurrent
or positive recurrent. Determine the invariant probability measure in the
positive recurrent case.
Exercise 4.21. Show that any null recurrent equivalence class must neces-
sarily contain an innite number of states. In patricular any Markov Chain
with a nite state space has only transient and positive recurrent states and
moreover the set of positive recurrent states must be non empty.
148 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Chapter 5
Martingales.
5.1 Denitions and properties
The theory of martingales plays a very important ans ueful role in the study
of stochastic processes. A formal denition is given below.
Denition 5.1. Let (, F, P) be a probability space. A martingale se-
quence of length n is a chain X
1
, X
2
, , X
n
of random variables and corre-
sponding sub -elds F
1
, F
2
, , F
n
that satisfy the following relations
1. Each X
i
is an integrable random variable which is measurable with re-
spect to the corresponding -eld F
i
.
2. The -eld F
i
are increasing i.e. F
i
F
i+1
for every i.
3. For every i [1, 2, , n 1], we have the relation
X
i
= E{X
i+1
|F
i
} a.e. P.
Remark 5.1. We can have an innite martingale sequence {(X
i
, F
i
) : i 1}
which requires only that for every n, {(X
i
, F
i
) : 1 i n} be a martingale
sequence of length n. This is the same as conditions (i), (ii) and (iii) above
except that they have to be true for every i 1.
149
150 CHAPTER 5. MARTINGALES.
Remark 5.2. From the properties of conditional expectations we see that
E{X
i
} = E{X
i+1
} for every i, and therfore E{X
i
} = c for some c. We can
dene F
0
to be the trivial -eld consisting of {, } and X
0
= c. Then
{(X
i
, F
i
) : i 0} is a martingale sequence as well.
Remark 5.3. We can dene Y
i+1
= X
i+1
X
i
so that X
j
= c+

1ij
Y
i
and
property (iii) reduces to
E{Y
i+1
|F
i
} = 0 a.e. P.
Such sequences are called martingale dierences. If Y
i
is a sequence of in-
dependent random variables with mean 0, for each i, we can take F
i
to
be the -eld generated by the random variables {Y
j
: 1 j i} and
X
j
= c +

1ij
Y
i
, will be a martingale relative to the -elds F
i
.
Remark 5.4. We can generate martingale sequences by the following pro-
cedure. Given any increasing family of -elds {F
j
}, and any integrable
random variable X on (, F, P), we take X
i
= E{X|F
i
} and it is easy to
check that {(X
i
, F
i
)} is a martingale sequence. Of course every nite mar-
tingale sequence is generated this way for we can always take X to be X
n
,
the last one. For innite sequences this raises an important question that we
will answer later.
If one participates in a fair gambling game, the asset X
n
of the player at
time n is supposed to be a martingale. One can take for F
n
the -eld of all
the results of the game through time n. The condition E[X
n+1
X
n
|F
n
] = 0
is the assertion that the game is neutral irrespective of past history.
A related notion is that of a super or sub-martingale. If, in the denition
of a martingale, we replace the equality in (iii) by an inequality we get super
or sub-martingales.
For a sub-martingale we demand the relation
(iiia) for every i,
X
i
E{X
i+1
|F
i
} a.e. P.
while for a super-martingale the relation is
(iiib) for every i,
X
i
E{X
i+1
|F
i
} a.e. P.
5.1. DEFINITIONS AND PROPERTIES 151
Lemma 5.1. If {(X
i
, F
i
)} is a martingale and is a convex (or concave)
function of one variable such that (X
n
) is integrable for every n, then
{((X
n
), F
n
)} is a sub (or super)-martingale.
Proof. An easy consequence of Jensens inequality (4.2) for conditional ex-
pectations.
Remark 5.5. A particular case is (x) = |x|
p
with 1 p < . For any
martingale (X
n
, F
n
) and 1 p < , (|X
n
|
p
, F
n
) is a sub-martingale provided
E[|X
n
|
p
] <
Theorem 5.2. (Doobs inequality.) Suppose {X
j
} is a martingale se-
quence of length n. Then
P
_
: sup
1jn
|X
j
|
_

_
{sup
1jn
|X
j
|}
|X
n
| dP
1

_
|X
n
| dP (5.1)
Proof. Let us dene S() = sup
1jn
|X
j
()|. Then
{ : S() } = E =
j
E
j
is written as a disjoint union, where
E
j
= { : |X
1
()| < , , |X
j1
| < , |X
j
| }.
We have
P(E
j
)
1

_
E
j
|X
j
| dP
1

_
E
j
|X
n
| dP. (5.2)
The second inequality in (5.2) follows from the fact that |x| is a convex func-
tion of x, and therefore |X
j
| is a sub-martingale. In particular E{|X
n
||F
j
}
|X
j
| a.e. P and E
j
F
j
. Summing up (5.2) over j = 1, , n we obtain the
theorem.
Remark 5.6. We could have started with
P(E
j
)
1

p
_
E
j
|X
j
|
p
dP
and obtained for p 1
P(E
j
)
1

p
_
E
j
|X
n
|
p
dP. (5.3)
Compare it with (3.9) for p = 2.
152 CHAPTER 5. MARTINGALES.
This simple inequality has various implications. For example
Corollary 5.3. (Doobs Inequality.) Let {X
j
: 1 j n} be a martin-
gale. Then if, as before, S() = sup
1jn
|X
j
()| we have
E[S
p
]
_
p
p 1
_
p
E [ |X
n
|
p
].
The proof is a consequence of the following fairly general lemma.
Lemma 5.4. Suppose X and Y are two nonnegative random variables on a
probability space such that for every 0,
P{Y }
1

_
Y
X dP.
Then for every p > 1,
_
Y
p
dP
_
p
p 1
_
p
_
X
p
dP.
Proof. Let us denote the tail probability by T() = P{Y }. Then with
1
p
+
1
q
= 1, i.e. (p 1)q = p
_
Y
p
dP =
_

0
y
p
dT(y) = p
_

0
y
p1
T(y)dy (integrating by parts)
p
_

0
y
p1
dy
y
_
Y y
X dP (by assumption)
= p
_
X
_ _
Y
0
y
p2
dy
_
dP (by Fubinis Theorem)
=
p
p 1
_
X Y
p1
dP

p
p 1
__
X
p
dP
_1
p
__
Y
q(p1)
dP
_1
q
(by H olders inequality)

p
p 1
__
X
p
dP
_1
p
__
Y
p
dP
_
1
1
p
5.1. DEFINITIONS AND PROPERTIES 153
This simplies to
_
Y
p
dP
_
p
p 1
_
p
_
X
p
dP
provided
_
Y
p
dP is nite. In general given Y , we can truncate it at level N
to get Y
N
= min(Y, N) and for 0 < N ,
P{Y
N
} = P{Y }
1

_
Y
X dP =
1

_
Y
N

X dP
with P{Y
N
} = 0 for > N. This gives us uniform bounds on
_
Y
p
N
dP
and we can pass to the limit. So we have the strong implication that the
niteness of
_
X
p
dP implies the niteness of
_
Y
p
dP.
Exercise 5.1. The result is false for p = 1. Construct a nonnegative martin-
gale X
n
with E[X
n
] 1 such that = sup
n
X
n
is not integrable. Consider
= [0, 1], F is the Borel -eld and P the Lebesgue measure. Suppose we
take F
n
to be the -eld generated by intervals with end points of the form
j
2
n
for some integer j. It corresponds to a partition with 2
n
sets. Consider
the random variables
X
n
(x) =
_
2
n
for 0 x 2
n
0 for 2
n
< x 1.
Check that it is a martingale and calculate
_
(x) dx. This is the winning
strategy of doubling ones bets until the losses are recouped.
Exercise 5.2. If X
n
is a martingale such that the dierences Y
n
= X
n
X
n1
are all square integrable, show that for n = m, E [Y
n
Y
m
] = 0. Therefore
E[X
2
n
] = E[X
0
]
2
+
n

j=1
E [Y
2
j
].
If in addition, sup
n
E[X
2
n
] < , then show that there is a random variable
X such that
lim
n
E [ |X
n
X|
2
] = 0.
154 CHAPTER 5. MARTINGALES.
5.2 Martingale Convergence Theorems.
If F
n
is an increasing family of -elds and X
n
is a martingale sequence with
respect to F
n
, one can always assume without loss of generality that the
full -eld F is the smallest -eld generated by
n
F
n
. If for some p 1,
X L
p
, and we dene X
n
= E [ X|F
n
] then X
n
is a martingale and by
Jensens inequality, sup
n
E [ |X
n
|
p
] E [|X|
p
]. We would like to prove
Theorem 5.5. For p 1, if X L
p
, then lim
n
X
n
X
p
= 0.
Proof. Assume that X is a bounded function. Then by the properties of con-
ditional expectation sup
n
sup

|X
n
| < . In particular E [ X
2
n
] is uniformly
bounded. By Exercise 5.2, at the end of last section, lim
n
X
n
= Y exists
in L
2
. By the properties of conditional expectations for A F
m
,
_
A
Y dP = lim
n
_
A
X
n
dP =
_
A
X dP.
This is true for all A F
m
for any m. Since F is generated by
m
F
m
the
above relation is true for A F. As X and Y are F measurable we conclude
that X = Y a.e. P. See Exercise 4.1. For a sequence of functions that
are bounded uniformnly in n and convergence in L
p
are all equivalent and
therefore convergence in L
2
implies the convergence in L
p
for any 1 p < .
If now X L
p
for some 1 p < , we can approximate it by X

so that X

X
p
< . Let us denote by X

n
the conditional expectations
E [ X

|F
n
]. By the properties of conditional expectations X

n
X
n

p

for all n, and as we saw earlier X

n
X

p
0 as n . It now follows
that
limsup
n
m
X
n
X
m

p
2
and as > 0 is arbitrary we are done.
In general, if we have a martingale {X
n
}, we wish to know when it comes
from a random variable X L
p
in the sense that X
n
= E [ X |F
n
].
Theorem 5.6. If for some p > 1, a martingale {X
n
} is bounded in L
p
, in
the sense that sup
n
X
n

p
< , then there is a random variable X L
p
such
that X
n
= E [ X |F
n
] for n 1. In particular X
n
X
p
0 as n .
5.2. MARTINGALE CONVERGENCE THEOREMS. 155
Proof. Suppose X
n

p
is uniformly bounded. For p > 1, since L
p
is the dual
of L
q
with
1
p
+
1
q
= 1, bounded sets are weakly compact. See [7] or [3]. We
can therefore choose a subsequence X
n
j
that converges weakly in L
p
to a
limit in the weak topology. We call this limit X. Then consider A F
n
for
some xed n. The function 1
A
() L
q
.
_
A
X dP =< 1
A
, X >= lim
j
< 1
A
, X
n
j
>= lim
j
_
A
X
n
j
dP =
_
A
X
n
dP.
The last equality follows from the fact that {X
n
} is a martingale, A F
n
and n
j
> n eventually. It now follows that X
n
= E [ X |F
n
]. We can now
apply the preceeding theorem.
Exercise 5.3. For p = 1 the result is false. Example 5.1 gives us at the same
time a counterexample of an L
1
bounded martingale that does not converge
in L
1
and so cannot be represented as X
n
= E [ X |F
n
].
We can show that the convergence in the preceeding theorems is also valid
almost everywhere.
Theorem 5.7. Let X L
p
for some p 1. Then the martingale X
n
=
E [X |F
n
] converges to X for almost all with respect to P.
Proof. From H olders inequality X
1
X
p
. Clearly it is sucient to
prove the theorem for p = 1. Let us denote by M L
1
the set of functions
X L
1
for which the theorem is true. Clearly M is a linear subset of L
1
.
We will prove that it is closed in L
1
and that it is dense in L
1
. If we denote
by M
n
the space of F
n
measurable functions in L
1
, then M
n
is a closed
subspace of L
1
. By standard approximation theorems
n
M
n
is dense in L
1
.
Since it is obvious that M M
n
for every n, it follows that M is dense in
L
1
. Let Y
j
M L
1
and Y
j
X in L
1
. Let us dene Y
n,j
= E [Y
j
|F
n
].
With X
n
= E [X |F
n
], by Doobs inequality (5.1) and jensens inequlaity
(4.2),
P
_
sup
1nN
|X
n
|
_

_
{:sup
1nN
|Xn|}
|X
N
| dP

E [ |X
N
| ]

E [ |X| ]
156 CHAPTER 5. MARTINGALES.
and therefore X
n
is almost surely a bounded sequence. Since we know that
X
n
X in L
1
, it suces to show that
limsup
n
X
n
liminf
n
X
n
= 0 a.e. P.
If we write X = Y
j
+ (X Y
j
), then X
n
= Y
n,j
+ (X
n
Y
n,j
), and
limsup
n
X
n
liminf
n
X
n
[limsup
n
Y
n,j
liminf
n
Y
n,j
]
+ [limsup
n
(X
n
Y
n,j
) liminf
n
(X
n
Y
n,j
)]
= limsup
n
(X
n
Y
n,j
) liminf
n
(X
n
Y
n,j
)
2 sup
n
|X
n
Y
n,j
|.
Here we have used the fact that Y
j
M for every j and hence
limsup
n
Y
n,j
liminf
n
Y
n,j
= 0 a.e. P.
Finally
P
_
limsup
n
X
n
liminf
n
X
n

_
P
_
sup
n
|X
n
Y
n,j
|

2
_

E [ |X Y
j
| ]
= 0
since the left hand side is independent of j and the term on the right on the
second line tends to 0 as j .
The only case where the situation is unclear is when p = 1. If X
n
is an
L
1
bounded martingale, it is not clear that it comes from an X. If it did
arise from an X, then X
n
would converge to it in L
1
and in particular would
have to be uniformly integrable. The converse is also true.
Theorem 5.8. If X
n
is a uniformly integrable martingale then there is ran-
dom variable X such that X
n
= E [ X |F
n
] , and then of course X
n
X in
L
1
.
5.3. DOOB DECOMPOSITION THEOREM. 157
Proof. The uniform integrability of X
n
implies the weak compactness in L
1
and if X is any weak limit of X
n
[see [7]], it is not dicult to show as in
Theorem 5.5, that X
n
= E [ X |F
n
] .
Remark 5.7. Note that for p > 1, a martingale X
n
that is bounded in L
p
is
uniformly integrable in L
p
, i.e |X
n
|
p
is uniformly integrable. This is false for
p = 1. The L
1
bounded martingale that we constructed earlier in Exercise
5.1 as a counterexample, is not convergent ln L
1
and therefore can not be
uniformly integrable. We will defer the analysis of L
1
bounded martingales
to the next section.
5.3 Doob Decomposition Theorem.
The simplest example of a submartingale is a sequence of functions that is
non decreasing in n for every (almost all) . In some sense the simplest
example is also the most general. More precisely the decomposition theorem
of Doob asserts the following.
Theorem 5.9. (Doob decomposition theorem.) If {X
n
: n 1} is a
sub-martingale on (, F
n
, P ), then X
n
can be written as X
n
= Y
n
+ A
n
,
with the following properties:
1. (Y
n
, F
n
) is a martingale.
2. A
n+1
A
n
for almost all and for every n 1.
3. A
1
0.
4. For every n 2, A
n
is F
n1
measurable.
X
n
determines Y
n
and A
n
uniquely .
Proof. Let X
n
be any sequence of integrable functions such that X
n
is F
n
measurable, and is represented as X
n
= Y
n
+A
n
, with Y
n
and A
n
satisfying
(1), (3) and (4). Then
A
n
A
n1
= E [X
n
X
n1
|F
n1
] (5.4)
158 CHAPTER 5. MARTINGALES.
are uniquely determined. Since A
1
= 0, all the A
n
are uniquely determined as
well. Property (2) is then plainly equivalent to the submartingale property.
To establish the representation, we dene A
n
inductively by (5.4). It is
routine to verify that Y
n
= X
n
A
n
is a martingale and the monotonicity of
A
n
is a consequence of the submartingale property.
Remark 5.8. Actually, given any sequence of integrable functions {X
n
: n
1} such that X
n
is F
n
measurable, equation (5.4) along with A
1
= 0 denes
F
n1
measurable functions that are integrable, such that X
n
= Y
n
+A
n
and
(Y
n
, F
n
) is a martingale. The decomposition is always unique. It is easy
to verify from (5.4) that {A
n
} is increasing (or decreasing) if and only if
{X
n
} is a super- (or sub-) maringale. Such a decomposition is called the
semi-martingale decomposition.
Remark 5.9. It is the demand that A
n
be F
n1
measurable that leads to
uniqueness. If we have to deal with continuous time this will become a
thorny issue.
We now return to the study of L
1
bounded martingales. A nonnegative
martingale is clearly L
1
bounded because E [ |X
n
| ] = E [ X
n
] = E [ X
1
].
One easy way to generate L
1
bounded martingales is to take the dierence
of two nonneagtive martingales. We have the converse as well.
Theorem 5.10. Let X
n
be an L
1
bounded martingale. There are two non-
negative martingales Y
n
and Z
n
relative to the same -elds F
n
, such that
X
n
= Y
n
Z
n
.
Proof. For each j and n j, we dene
Y
j,n
= E [ | X
n
| |F
j
].
By the submartingale property of | X
n
|
Y
j,n+1
Y
j,n
= E[(|X
n+1
| |X
n
|) |F
j
] = E[E[(|X
n+1
| |X
n
|) |F
n
]|F
j
] 0
almost surely. Y
j,n
is nonnegative and E[ Y
j,n
] = E[ |X
n
| ] is bounded in n.
By the monotone convergence theorem, for each j, there exists some Y
j
in
L
1
such that Y
j,n
Y
j
in L
1
as n . Since limits of martingales are
again martingales, and Y
n,j
is a martingale for n j, it follows that Y
j
is a
martingale. Moreover
Y
j
+X
j
= lim
n
E [ | X
n
| +X
n
|F
j
] 0
5.3. DOOB DECOMPOSITION THEOREM. 159
and
X
j
= (Y
j
+X
j
) Y
j
does it!
We can always assume that our nonnegative martingale has its expecta-
tion equal to 1 because we can always multiply by a suitable constant. Here
is a way in which such martingales arise. Suppose we have a probability
space (, F , P ) and and an increasing family of sub -elds F
n
of F that
generate F. Suppose Q is another probability measure on (, F ) which may
or may not be absolutely continuous with respect to P on F. Let us suppose
however that Q << P on each F
n
, i.e. whenever A F
n
and P(A) = 0, it
follows that Q(A) = 0. Then the sequence of Radon-Nikodym derivatives
X
n
=
dQ
dP

Fn
of Q with respect to P on F
n
is a nonnegative martingale with expectation
1. It comes from an X, if and only if Q << P on F and this is the uniformly
integrable case. By Lebesgue decomposition we reduce our consideration to
the case when Q P. Let us change the reference measure to P

=
P+Q
2
.
The Radon-Nikodym derivative
X

n
=
dQ
dP

Fn
=
2X
n
1 +X
n
is uniformly integrable with respect to P

and X

n
X

a.e. P

. From
the orthogonality P Q we know that there are disjoint sets E, E
c
with
P(E) = 1 and Q(E
c
) = 1. Then
Q(A) = Q(A E) +Q(A E
c
) = Q(A E
c
)
= 2P

(A E
c
) =
_
A
2 1
E
()dP

.
It is now seen that
X

=
dQ
dP

F
=
_
2 a.e. Q
0 a.e. P
from which one concludes that
P
_
lim
n
X
n
= 0
_
= 1.
160 CHAPTER 5. MARTINGALES.
Exercise 5.4. In order to establish that a nonnegative martingale has an
almost sure limit (which may not be an L
1
limit) show that we can assume,
without loss of generality, that we are in the following situation.
=

j=1
R ; F
n
= [x
1
, , x
n
] ; X
j
() = x
j
The existence of a Q such that
dQ
dP

Fn
= x
n
is essentially Kolmogorovs consistency theorem (Theorem 3.5.) Now com-
plete the proof.
Remark 5.10. We shall give a more direct proof of almost sure convergence
of an L
1
bounded martingale later on by means of the upcrossing inequality.
5.4 Stopping Times.
The notion of stopping times that we studied in the context of Markov Chains
is important again in the context of Martingales. In fact the concept of
stopping times is relevant whenever one has an ordered sequence of sub -
elds and is concerned about conditioning with respect to them.
Let (, F) be a measurable space and {F
t
: t T} be a family of sub -
elds. T is an ordred set usually a set of real numbers or integers of the form
T = {t : a t b} or {t : t a}. We will assume that T = {0, 1, 2, , }
the set of non negative integers. The family F
n
is assumed to be increasing
with n. In other words
F
m
F
n
if m < n
An F measurable random variable () mapping {0, 1, , } is
said to be a stopping time if for every n 0 the set { : () n} F
n
. A
stopping time may actually take the value on a nonempty subset of .
The idea behind the deniton of a stopping time, as we saw in the study
of Markov chains is that the decision to stop at time n can be based only on
the information available upto that time.
Exercise 5.5. Show that the function () k is a stopping time for any
admissible value of the constant k.
5.4. STOPPING TIMES. 161
Exercise 5.6. Show that if is a stopping time and f : T T is a nonde-
creasing function that satises f(t) t for all t T, then

() = f(())
is again a stopping time.
Exercise 5.7. Show that if
1
,
2
are stopping times so are max (
1
,
2
) and
min (
1
,
2
). In particular any stopping time is an increasing limit of
bounded stopping times
n
() = min((), n).
To every stopping time we associate a stopped -eld F

F dened
by
F

= {A : A F and A { : () n} F
n
for every n}. (5.5)
This should be thought of as the information available upto the stopping
time . In other words, events in F

correspond to questions that can be


answered with a yes or no, if we stop observing the process at time .
Exercise 5.8. Verify that for any stopping time , F

is indeed a sub -eld


i.e. is closed under countable unions and complementations. If () k
then F

F
k
. If
1

2
are stopping times F

1
F

2
. Finally if is a
stopping time then it is F

measurable.
Exercise 5.9. If X
n
() is a sequence of measurable functions on (, F) such
that for every n T, X
n
is F
n
measurable then on the set { : () <
}, which is an F

measurable set, the function X

() = X
()
() is F

measurable.
The following theorem called Doobs optional stopping theorem is one of
the central facts in the theory of martingale sequences.
Theorem 5.11. (Optional Stopping Theorem.) Let {X
n
: n 0} be
sequence of random variables dened on a probability space (, F, P), which
is a martingale sequence with respect to the ltration (, F
n
, P) and 0
1

2
C be two bounded stopping times. Then
E [X

2
| F

1
] = X

1
a.e.
162 CHAPTER 5. MARTINGALES.
Proof. Since F

1
F

2
F
C
, it is sucient to show that for any martingale
{X
n
}
E [ X
k
|F

] = X

a.e. (5.6)
provided is a stopping time bounded by the integer k. To see this we note
that in view of Exercise 4.9,
E[X
k
|F

1
] = E[E[X
k
|F

2
]|F

1
]
and if (5.6) holds, then
E[X

2
|F

1
] = X

1
a.e.
Let A F

. If we dene E
j
= { : () = j}, then =
k
1
E
j
is
a disjoint union. Moreover A E
j
F
j
for every j = 1, , k. By the
martingale property
_
AE
j
X
k
dP =
_
AE
j
X
j
dP =
_
AE
j
X

dP
and summing over j = 1, , k gives
_
A
X
k
dP =
_
A
X

dP
for every A F

and we are done.


Remark 5.11. In particular if X
n
is a martingale sequence and is a bounded
stopping time then E[X

] = E[X
0
]. This property, obvious for constant
times, has now been extended to bounded stopping times. In a fair game,
a policy to quit at an opportune time, gives no advantage to the gambler
so long as he or she cannot foresee the future.
Exercise 5.10. The property extends to sub or super-martingales. For ex-
ample if X
n
is a sub-martingale, then for any two bounded stopping times

1

2
, we have
E [X

2
|F

1
] X

1
a.e..
One cannot use the earlier proof directly, but one can reduce it to the mar-
tingale case by applying the Doob decomposition theorem.
5.4. STOPPING TIMES. 163
Exercise 5.11. Boundedness is important. Take X
0
= 0 and
X
n
=
1
+
2
+
n
for n 1
where
i
are independent identically distributed random variables taking the
values 1 with probability
1
2
. Let = inf{n : X
n
= 1}. Then is a stopping
time, P [ < ] = 1, but is not bounded. X

= 1 with probability 1 and


trivially E [X

] = 1 = 0.
Exercise 5.12. It does not mean that we can never consider stopping times
that are unbounded. Let be an unbounded stopping time. For every k,

k
= min(, k) is a bounded stopping time and E [X

k
] = 0 for every k. As
k ,
k
and X

k
X

. If we can establish uniform integrability of


X

k
we can pass to the limit. In particular if S() = sup
0n()
|X
n
()| is
integrable then sup
k
|X

k
()| S() and therefore E [X

] = 0.
Exercise 5.13. Use a similar argument to show that if
S() = sup
0k
2
()
|X
k
()|
is integrable, then for any
1

2
E [ X

2
|F

1
] = X

1
a.e..
Exercise 5.14. The previous exercise needs the fact that if
n
are stop-
pimg times, then
{
n
F
n
} = F

.
Prove it.
Exercise 5.15. Let us go back to the earlier exercise (Exercise 5.11) where
we had
X
n
=
1
+ +
n
as a sum of n idependent random variables taking the values 1 with prob-
ability
1
2
. Show that if is a stopping time with E[] < , then S() =
sup
1n()
|X
n
()| is square integrable and therefore E[X

] = 0. [Hint: Use
the fact that X
2
n
n is a martingale.]
164 CHAPTER 5. MARTINGALES.
5.5 Upcrossing Inequality.
The following inequality due to Doob, that controls the oscillations of a
martingale sequence, is very useful for proving the almost sure convergence
of L
1
bounded martingales directly. Let {X
j
: 0 j n} be a martingale
sequence with n+1 terms. Let us take two real numbers a < b. An upcrossing
is a pair of terms X
k
and X
l
, with k < l, for which X
k
a < b X
l
. Starting
from X
0
, we locate the rst term that is at most a and then the rst term
following it that is at least b. This is the rst upcrossing. In our martingale
sequence there will be a certain number of completed upcrossings (of course
over disjoint intervals ) and then at the end we may be in the middle of
an upcrossing or may not even have started on one because we are still on
the way down from a level above b to one below a. In any case there will
be a certain number U(a, b) of completed upcrossings. Doobs upcrossing
inequlity gives a uniform upper bound on the expected value of U(a, b) in
terms of E[|X
n
|], i.e. one that does not depend otherwise on n.
Theorem 5.12. Doobs upcrossing inequality For any n,
E[U(a, b)]
1
b a
E[a X
n
]
+

1
b a
[|a| +E
_
|X
n
|]

(5.7)
Proof. Let us dene recursively

1
= n inf{k : X
k
a}

2
= n inf{k : k
1
, X
k
b}

2k
= n inf{k : k
2k1
, X
k
b}

2k+1
= n inf{k : k
2k
, X
k
a}

Since
k

k1
+ 1 ,
n
= n. Consider the quantity
D() =
n

j=1
[X

2j
X

2j1
]
which could very well have lots of 0s at the end. In any case the rst few
terms correspond to upcrossings and each term is at least (b a) and there
5.6. MARTINGALE TRANSFORMS, OPTION PRICING. 165
are U(a, b) of them. Before the 0s begin there may be at most one nonzero
term which is an incomplete upcrossing, i.e. when
21
< n =
2
for some
. It is then equal to (X
n
X

2l1
) X
n
a for some l. If on the other hand
if we end in the middle of a downcrossing, i.e.
2
< n =
2+1
there is no
incomplete upcrossing. Therefore
D() (b a)U(a, b) +R
n
()
with the remainder R
n
() satisfying
R
n
() = 0 if
2
< n =
2+1
(X
n
a) if
21
< n =
2
By the optional stopping theorem E[D()] = 0. This gives the bound
E[U(a, b)]
1
b a
E
_
R
n
()

1
b a
E
_
(a X
n
)
+
]

1
b a
E
_
|a X
n
|

1
b a
E
_
|a| + |X
n
|

.
Remark 5.12. In particular if X
n
is an L
1
bounded martingale, then the
number of upcrossings of any interval [a, b] is nite with Probability 1. From
Doobs inequality, the sequence X
n
is almost surely bounded. It now follows
by taking a countable number of intervals [a, b] with rational endpoints that
X
n
has a limit almost surely. If X
n
is uniformly integrable then the conver-
gence is in L
1
and then X
n
= E [ X| F
n
]. If we have a uniform L
p
bound
on X
n
, then X L
p
and X
n
X in L
p
. All of our earlier results on the
convergence of martingales now follow.
Exercise 5.16. For the proof it is sucient that we have a supermartingale.
In fact we can change signs and so a submartingale works just as well.
5.6 Martingale Transforms, Option Pricing.
If X
n
is a martingale with respect to (, F
n
, P) and Y
n
are the dierences
X
n
X
n1
, a martingale transform X

n
of X
n
is given by the formula
X

n
= X

n1
+a
n1
Y
n
, for n 1
166 CHAPTER 5. MARTINGALES.
where a
n1
is F
n1
measurable and has enough integrability assumptions to
make a
n1
Y
n
integrable. An elementary calculation shows that
E [ X

n
|F
n1
] = X

n1
making X

n
a martingale as well. X

n
is called a martingale transform of X
n
.
The interpretation is if we have a fair game, we can choose the size and side of
our bet at each stage based on the prior history and the game will continue to
be fair. It is important to note that X
n
may be sums of independent random
vaiables with mean zero. But the independence of the increments may be
destroyed and X

n
will in general no longer have the independent increments
property.
Exercise 5.17. Suppose X
n
=
1
+ +
n
, where
j
are independent random
variables taking the values 1 with probability
1
2
. Let X

n
be the martingale
transform given by
X

n
=
n

j=1
a
j1
()
j
where a
j
is F
j
measurable, F
j
being the -eld generated by
1
, ,
j
.
Calculate E
_
[X

n
]
2
_
.
Suppose X
n
is a sequence of nonnegative random variables that represent
the value of a security that is traded in the market place at a price that
is X
n
for day n and changes overnight between day n and day n + 1 from
X
n
to X
n+1
. We could at the end of day n, based on any information F
n
that is available to us at the end of that day be either long or short on the
security. The quantity a
n
() is the number of shares that we choose to own
overnight between day n and day n+1 and that could be a function of all the
information available to us up to that point. Positive values of a
n
represent
long positions and negative values represent short positions. Our gain or loss
overnight is given by a
n
(X
n+1
X
n
) and the cumulative gain(loss) is the
transform
X

n
X

0
=
n

j=1
a
j1
(X
j
X
j1
).
A contingent claim (European Option) is really a gamble or a bet based
on the value of X
N
at some terminal date N. The nature of the claim is that
there is function f(x) such that if the security trades on that day at a price
5.6. MARTINGALE TRANSFORMS, OPTION PRICING. 167
x then the claim pays an amount of f(x). A call is an option to buy at a
certain price a and the payo is f(x) = (xa)
+
whereas a put is an option
to sell at a xed price a and therefore has a payo function f(x) = (a x)
+
.
Replicating a claim, if it is possible at all, is determining a
0
, a
1
, , a
N
and the initial value V
0
such that the transform
V
N
= V
0
+
N

j=1
a
j
(X
j+1
X
j
)
at time N equals the claim f(X
N
) under every conceivable behavior of the
price movements X
1
, X
2
, , X
N
. If the claim can be exactly replicated
starting from an initial capital of V
0
, then V
0
becomes the price of that
option. Anyone could sell the option at that price, use the proceeds as
capital and follow the strategy dictated by the coecients a
0
, , a
N1
and
have exactly enough to pay o the claim at time N. Here we are ignoring
transaction costs as well as interest rates. It is not always true that a claim
can be replicated.
Let us assume for simplicity that the stock prices are always some non-
negative integral multiples of some unit. The set of possible prices can then
be taken to be the set of nonnegative integers. Let us make a crucial assump-
tion that if the price on some day is x the price on the next day is x 1. It
has to move up or down a notch. It cannot jump two or more steps or even
stay the same. When the stock price hits 0 we assume that the company
goes bankrupt and the stock stays at 0 for ever. In all other cases, from day
to day, it always moves either up or down a notch.
Let us value the claim f for one period. If the price at day N 1 is x = 0
and we have assets c on hand and invest in a shares we will end up on day
N, with either assets of c +a and a claim of f(x + 1) or assets of c a with
a claim of f(x 1). In order to make sure that we break even in either case,
we need
f(x + 1) = c +a ; f(x 1) = c a
and solving for a and c, we get
c(x) =
1
2
[f(x 1) +f(x + 1)] ; a(x) =
1
2
[f(x + 1) f(x 1)]
168 CHAPTER 5. MARTINGALES.
The value of the claim with one day left is
V
N1
(x) =
_
1
2
[f(x 1) +f(x + 1)] if x 1
f(0) if x = 0
and we can proceed by iteration
V
j1
(x) =
_
1
2
[V
j
(x 1) +V
j
(x + 1)] if x 1
V
j
(0) if x = 0
for j 1 till we arrive at the value V
0
(x) of the claim at time 0 and price x.
The corresponding values of a = a
j1
(x) =
1
2
[V
j
(x + 1) V
j
(x 1)] gives us
the number of shares to hold between day j 1 and j if the current price at
time j 1 equals x.
Remark 5.13. The important fact is that the value is determined by arbi-
trage and is unaected by the actual movement of the price so long as it is
compatible with the model.
Remark 5.14. The value does not depend on any statistical assumptions on
the various probabilities of transitions of price levels between successive days.
Remark 5.15. However the value can be interpreted as the expected value
V
0
(x) = E
Px
_
f(X
N
)
_
where P
x
is the random walk starting at x with probability
1
2
for transitions
up or down a level, which is absorbed at 0.
Remark 5.16. P
x
can be characterized as the unique probability distribution
of (X
0
, , X
N
) such that P
x
[X
0
= x] = 1, P
x
[|X
j
X
j1
| = 1|X
j1
1] = 1
for 1 j N and X
j
is a martingale with respect to (, F
j
, P
x
) where F
j
is generated by X
0
, , X
j
.
Exercise 5.18. It is not necessary for the argument that the set of possible
price levels be equally spaced. If we make the assumption that for each price
level x > 0, the price on the following day can take only one of two possible
values h(x) > x and l(x) < x with a possible bankruptcy if the level 0 is
reached, a simlar analysis can be worked out. Carry it out.
5.7. MARTINGALES AND MARKOV CHAINS. 169
5.7 Martingales and Markov Chains.
One of the ways of specifying the joint distribution of a sequence X
0
, , X
n
of random variables is to specify the distribution of X
0
and for each j 1,
specify the conditional distribution of X
j
given the -eld F
j1
generated by
X
0
, , X
j1
. Equivalently instead of the conditional distributions one can
specify the conditional expectations E [f(X
j
)|F
j1
] for 1 j n. Let us
write
h
j1
(X
0
, , X
j1
) = E [f(X
j
)|F
j1
] f(X
j1
)
so that, for 1 j n
E [{f(X
j
) f(X
j1
) h
j1
(X
0
, , X
j1
)}|F
j1
] = 0
or
Z
f
j
= f(X
j
) f(X
0
)
j

i=1
h
i1
(X
0
, , X
i1
)
is a martingale for every f. It is not dicult to see that the specication
of {h
i
} for each f is enough to determine all the successive conditional ex-
pectations and therefore the conditional distributions. If in addition the
initial distribution of X
0
is specied then the distribution of X
0
, , X
n
is
completely determined.
If for each j and f, the corresponding h
j1
(X
0
, , X
j1
) is a function
h
j1
(X
j1
) of X
j1
only, then the distribution of (X
0
, , X
n
) is Markov
and the transition probabilities are seen to be given by the relation
h
j1
(X
j1
) = E
_
[f(X
j
) f(X
j1
)]|F
j1

=
_
[f(y) f(X
j1
)]
j1,j
(X
j1
, dy).
In the case of a stationary Markov chain the relationship is
h
j1
(X
j1
) = h(X
j1
) = E
_
[f(X
j
) f(X
j1
)]|F
j1

=
_
[f(y) f(X
j1
)](X
j1
, dy).
If we introduce the linear transformation (transition operator)
(f)(x) =
_
f(y)(x, dy) (5.8)
170 CHAPTER 5. MARTINGALES.
then
h(x) = ([I]f)(x).
Remark 5.17. In the case of a Markov chain on a countable state space
(f)(x) =

y
(x, y)f(y)
and
h(x) = [I](x) =

y
[f(y) f(x)](x, y).
Remark 5.18. The measure P
x
on the space (, F) of sequences {x
j
: j
0} from the state space X, that corresponds to the Markov Process with
transition probability (x, dy), and initial state x, can be characterized as
the unique measure on (, F) such that
P
x
_
: x
0
= x
_
= 1
and for every bounded measurable function f dened on the state space X
f(x
n
) f(x
0
)
n

j=1
h(x
j1
)
is a martingale with respect to (, F
n
, P
x
) where
h(x) =
_
X
[f(y) f(x)](x, dy).
Let A X be a measurable subset and let
A
= inf{n 0 : x
n
A} be
the rst entrance time into A. It is easy to see that
A
is a stopping time. It
need not always be true that P
x
{
A
< } = 1. But U
A
(x) = P
x
{
A
< } is
a well dened measurable function of x, that satises 0 U(x) 1 for all x
and is the exit probability from the set A
c
. By its very denition U
A
(x) 1
on A and if x / A, by the Markov property,
U
A
(x) = (x, A) +
_
A
c
U
A
(y)(x, dy) =
_
X
U
A
(y)(x, dy).
5.7. MARTINGALES AND MARKOV CHAINS. 171
In other words U
A
satises 0 U
A
1 and is a solution of
(I)V = 0 on A
c
V = 1 on A (5.9)
Theorem 5.13. Among all nonnegative solutions V of the equation (5.9)
U
A
(x) = P
x
{
A
< } is the smallest. If U
A
(x) = 1, then any bounded
solution of the equation
(I)V = 0 on A
c
V = f on A (5.10)
is equal to
V (x) = E
Px
_
f(x

A
)
_
. (5.11)
In particular if U
A
(x) = 1 for all x / A, then any bounded solution V of
equation (5.10) is unique and is given by the formula (5.11).
Proof. First we establish that any nonnegative solution V of (5.10) dominates
U
A
. Let us replace V by W = min(V, 1). Then 0 W 1 everywhere,
W(x) = 1 for x A and for x / A,
(W)(x) =
_
X
W(y)(x, dy)
_
X
V (y)(x, dy) = V (x).
Since W 1 as well we conclude that W W on A
c
. On the otherhand
it is obvious that W 1 = W on A. Since we have shown that W W
everywhere it follows that {W(x
n
)} is a supermartingale with with repect to
(, F
n
, P
x
). In particular for any bounded stopping time
E
Px
_
W(x

)
_
E
Px
_
W(x
0
)
_
= W(x).
While we cannot take =
A
(since
A
may not be bounded), we can always
take =
N
= min(
A
, N) to conclude
E
Px
_
W(x

N
)
_
E
Px
_
W(x
0
)
_
= W(x).
172 CHAPTER 5. MARTINGALES.
Let us let N . On the set { :
A
() < },
N

A
and W(x

N
)
W(x

A
) = 1. Since W is nonnegative and bounded,
W(x) limsup
N
E
Px
_
W(x

N
)
_
limsup
N
_

A
<
W(x

N
)dP
x
= P
x
{
A
< } = U
A
(x).
Since V (x) W(x) it follows that V (x) U
A
(x).
For a bounded solution V of (5.10), let us dene h = (I)V which will
be a function vanishing on A
c
. We know that
V (x
n
) V (x
0
)
n

j=1
h(x
j1
)
is a martingale with rsepect to (, F
n
, P
x
) and let us use the stopping theorem
with
N
= min(
A
, N). Since h(x
j1
) = 0 for j
A
, we obtain
V (x) = E
Px
_
V (x

N
)
_
.
If we now make the assumption that U
A
(x) = P
x
{
A
< } = 1, let N
and use the bounded convergence theorem it is easy to see that
V (x) = E
Px
_
f(x

A
)
_
which proves (5.11) and the rest of the theorem.
Such arguments are powerful tools for the study of qualitative proper-
ties of Markov chains. Solutions to equations of the type [ I]V = f are
often easily constructed. They can be used to produce martingales, sub-
martingales or supermartingales that have certain behavior and that in turn
implies certain qualitative behavior of the Markov chain. We will now see
several illustrations of this method.
Example 5.1. Consider the symmetric simple random walk in one dimension.
We know from recurrence that the random walk exits the interval (R, R)
in a nite time. But we want to get some estimates on the exit time
R
.
Consider the function u(x) = cos x. The function f(x) = [u](x) can be
calculated and
f(x) =
1
2
[cos (x 1) + cos (x + 1)]
= cos cos x
= cos u(x).
5.7. MARTINGALES AND MARKOV CHAINS. 173
If <

2R
, then cos x cos R > 0 in [R, R]. Consider Z
n
= e
n
cos x
n
with = log cos .
E
Px
_
Z
n
|F
n1
_
= e
n
f(x
n1
) = e
n
cos cos x
n1
= Z
n1
.
If
R
is the exit time from the interval (R, R), for any N, we have
E
Px
_
Z

R
N
_
= E
Px
_
Z
0
_
= cos x.
Since > 0 and cos x cos R > 0 for x [R, R], if R is an integer, we
can claim that
E
P
0
_
e
[
R
N]
_

cos x
cos R
.
Since the estimate is uniform we can let N to get the estimate
E
P
0
_
e

R
_

cos x
cos R
.
Exercise 5.19. Can you prove equality above? What is range of validity of
the equality? Is E
Px
_
e

R
_
< for all > 0?
Example 5.2. Let us make life slightly more complicated by taking a Markov
chain in Z
d
with transition probabilities
(x, y) =
_
1
2d
+ (x, y) if |x y| = 1
0 if |x y| = 0
so that we have slightly perturbed the random walk with perhaps even a
possible bias.
Exact calculations like in Eaxmple 5.1 are of course no longer possible.
Let us try to estimate again the exit time from a ball of radius R. For > 0
consider the function
F(x) = exp[
d

i=1
|x
i
|]
dened on Z
d
. We can get an estimate of the form
(F)(x
1
, , x
d
) F(x
1
, , x
d
)
for some choices of > 0 and > 1 that may depend on R. Now proceed as
in Example 5.1.
174 CHAPTER 5. MARTINGALES.
Example 5.3. We can use these methods to show that the random walk is
transient in dimension d 3.
For 0 < < d 2 consider the function V (x) =
1
|x|

for x = 0 with
V (0) = 1. An approximate calculation of (V )(x) yields, for suciently
large |x| (i.e |x| L for some L), the estimate
(V )(x) V (x) 0
If we start initially from an x with |x| > L and take
L
to be the rst
entrance time into the ball of radius L, one gets by the stopping theorem,
the inequality
E
Px
_
V (x

L
N
)
_
V (x).
If
L
N, then |x

L
| L. In any case V (x

L
N
) 0. Therefore,
P
x
_

L
N
_

V (x)
inf
|y|L
V (y)
valid uniformly in N. Letting N
P
x
_

L
<
_

V (x)
inf
|y|L
V (y)
.
If we let |x| , keeping L xed, we see the transience. Note that recur-
rence implies that P
x
_

L
<
_
= 1 for all x. The proof of transience really
only required a function V dened for large |x|, that was strictly positive for
each x, went to 0 as |x| and had the property (V )(x) V (x) for
large values of |x|.
Example 5.4. We will now show that the random walk is recurrent in d = 2.
This is harder because the recurrence of random walk in d = 2 is right
on the border. We want to construct a function V (x) as |x| that
satises (V )(x) V (x) for large |x|. If we succeed, then we can estimate
by a stopping argument the probability that the chain starting from a point
x in the annulus < |x| < L exits at the outer circle before getting inside
the inner circle.
P
x
_

L
<

V (x)
inf
|y|L
V (y)
.
We also have for every L,
P
x
_

L
<
_
= 1.
5.7. MARTINGALES AND MARKOV CHAINS. 175
This proves that P
x
_

<
_
= 1 thereby proving recurrence. The natural
candidate is F(x) = log |x| for x = 0. A computation yields
(F)(x) F(x)
C
|x|
4
which does not quite make it. On the other hand if U(x) = |x|
1
, for large
values of |x|,
(U)(x) U(x)
c
|x|
3
for some c > 0. The choice of V (x) = F(x) CU(x) = log x
C
|x|
works with
any C > 0.
Example 5.5. We can use these methods for proving positive recurrence as
well.
Suppose X is a countable set and we can nd V 0, a nite set F and
a constant C 0 such that
(V )(x) V (x)
_
1 for x / F
C for x F
Let us let U = V V , and we have
V (x) E
Px
_
V (x
n
) V (x)
_
= E
Px
_
n

j=1
U(x
j1
)
_
E
Px
_
n

j=1
C 1
F
(x
j1
)
n

j=1
1
F
c(x
j1
)
= E
Px
_
n

j=1
[1 (1 +C)1
F
(x
j1
)]
_
= n + (1 +C)
n

j=1

yF

n
(x, y)
= n +o(n) as n .
176 CHAPTER 5. MARTINGALES.
if the process is not positive recurrent. This is a contradiction.
For instance if X = Z, the integers, and we have a little bit of bias
towards the origin in the random walk
(x, x + 1) (x, x 1)
a
|x|
if x
(x, x 1) (x, x + 1)
a
|x|
if x
with V (x) = x
2
, for x
(V )(x) (x + 1)
2
1
2
(1
a
|x|
) + (x 1)
2
1
2
(1 +
a
|x|
)
= x
2
+ 1 2a
If a >
1
2
, we can multiply V by a constant and it works.
Exercise 5.20. What happens when
(x, x + 1) (x, x 1) =
1
2x
for |x| 10? (See Exercise 4.16)
Example 5.6. Let us return to our example of a branching process Example
4.4. We see from the relation
E[X
n+1
|F
n
] = mX
n
that
Xn
m
n
is a martingale. If m < 1 we saw before quite easily that the
population becomes extinct. If m = 1, X
n
is a martingale. Since it is
nonnegative it is L
1
bounded and must have an almost sure limit as n
. Since the population is an integer, this means that the size eventually
stabilizes. The limit can only be 0 because the population cannot stabilize
at any other size. If m > 1 there is a probability 0 < q < 1 such that
P[X
n
0|X
0
= 1] = q, We can show that with probability 1 q, X
n
.
To see this consider the function u(x) = q
x
. In the notation of Example 4.4
E[q
X
n+1
|F
n
] = [

q
j
p
j
]
Xn
= [P(q)]
Xn
= q
Xn
5.7. MARTINGALES AND MARKOV CHAINS. 177
so that q
Xn
is a non negative martingale. It then has an almost sure limit,
which can only be 0 or 1. If q is the probabbility that it is 1 i.e that X
n
0,
then 1 q is the probability that it is 0, i.e. that X
n
.
178 CHAPTER 5. MARTINGALES.
Chapter 6
Stationary Stochastic
Processes.
6.1 Ergodic Theorems.
A stationary stochastic process is a collection {
n
: n Z} of random vari-
ables with values in some space (X, B) such that the joint distribution of
(
n
1
, ,
n
k
) is the same as that of (
n
1
+n
, ,
n
k
+n
) for every choice of
k 1, and n, n
1
, , n
k
Z. Assuming that the space (X, B) is reasonable
and Kolmogorovs consistency theorem applies, we can build a measure P on
the countable product space of sequences {x
n
: n Z} with values in X,
dened for sets in the product -eld F. On the space there is the natural
shift dened by (T)(n) = x
n+1
for with (n) = x
n
. The random variables
x
n
() = (n) are essentially equivalent to {
n
}. The stationarity of the pro-
cess is reected in the invariance of P with respect to T i.e. PT
1
= P. We
can without being specic consider a space a -eld F, a one to one invert-
ible measurable map from with a measurable inverse T
1
and nally
a probability measure P on (, F) that is T-invariant i.e P(T
1
A) = P(A)
for every A F. One says that P is an invariant measure for T or T is a
measure preserving transformation for P. If we have a measurable map from
: (, F) (X, B), then it is easily seen that
n
() = (T
n
) denes a sta-
tionary stochastic process. The study of stationary stochastic process is then
more or less the same as the study of measure preserving (i.e. probability
preserving) transformations.
The basic transformation T : induces a linear transformation U
179
180 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
on the space of functions dened on by the rule (Uf)() = f(T). Because
T is measure preserving it is easy to see that
_

f() dP =
_

f(T) dP =
_

(Uf)() dP
as well as
_

|f()|
p
dP =
_

|f(T)|
p
dP =
_

|(Uf)()|
p
dP.
In other words U acts as an isometry (i.e. norm presrving linear transforma-
tion) on the various L
p
spaces for 1 p < and in fact it is an isometry on
L

as well. Moreover the transformation induced by T


1
is the inverse of U
so that U is also invertible. In particular U is unitary ( or orthogoanl)on L
2
.
This means it presrves the inner product < , >.
< f, g >=
_
f()g()dP =
_
f(T)g(T)dP =< Uf, Ug > .
Of course our linear transformation U is very special and satises U1 = 1
and U(fg) = (Uf)(Ug).
A basic theorem known as the Ergodic theorem asserts that
Theorem 6.1. For any f L
1
(P) the limit
lim
n
f() +f(T) + +f(T
n1
)
n
= g()
exists for almost all with respect to P as well as in L
1
(P). Moreover if
f L
p
for some p satisfying 1 < p < then the function g L
p
and the
convergence takes place in that L
p
. Moreover the limit g() is given by the
conditional expectation
g() = E
P
[f|I]
where the -eld I, called the invariant -eld, is dened as
I = {A : TA = A}.
Proof. Fisrst we prove the convergence in the various L
p
spaces. These are
called mean ergodic theorems. The easiest situation to prove is when p = 2.
Let us dene
H
0
= {f : f H, Uf = f} = {f : f H, f(T) = f()}.
6.1. ERGODIC THEOREMS. 181
Since H
0
contains constants, it is a closed nontrivial subspace of H = L
2
(P),
of dimension at least one. Since U is unitary Uf = f if and only if U
1
f =
U

f = f where U

is the adjoint of U. The orthogonal complement H

0
can
be dened as
H

0
= {g :< g, f >= 0 f : U

f = f} = Range(I U)H .
Clearly if we let
A
n
f =
f +Uf + +U
n1
f
n
then A
n
f
2
f
2
for every f H and A
n
f = f for every n and f H
0
.
Therefore for f H
0
, A
n
f f as n . On the other hand if f = (IU)g,
A
n
f =
gU
n
g
n
and A
n
f
2

2g
2
n
0 as n . Since A
n
1, it follows
that A
n
f 0 as n for every f H

0
= Range(I U)H. (See exercise
6.1). If we denote by the orthogonal projection from H H
0
, we see that
A
n
f f as n for every f H establishing the L
2
ergodic theorem.
There is an alternate characterization of H
0
. Functions f in H
0
are invari-
ant under T, i.e. have the property that f(T) = f(). For any invariant
function f the level sets { : a < f() < b} are invariant under T. We
can therefore talk about invariant sets {A : A F, T
1
A = A}. Technically
we should allow ourselves to dier by sets of measure zero and one denes
I = {A : P(AT
1
A) = 0} as the -eld of almost invariant sets. . Noth-
ing is therefore lost by taking I to be the -eld of invariant sets. We can
identify the orthogonal projection as (see Exercise 4.8)
f = E
P
_
f|I}
and as the conditional expectation operator, is well dened on L
p
as an
operator of norm 1, for all p in the range 1 p . If f L

, then
A
n
f

and by the bounded convergence theorem, for any p sat-


isfying 1 p < , we have A
n
f f
p
0 as n . Since L

is
dense in L
p
and A
n
1 in all the L
p
spaces it is easily seen, by a simple
approximation argument, that for each p in 1 p < and f L
p
,
lim
n
A
n
f f
p
= 0
proving the mean ergodic theorem in all the L
p
spaces.
We now concentrate on proving almost sure convergence of A
n
f to f
for f L
1
(P). This part is often called the individual ergodic theorem or
182 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Birkhos theorem . This will be based on an analog of Doobs inequality for
martingales. First we will establish an inequality called the maximal ergodic
theorem.
Theorem 6.2. (Maximal Ergodic Theorem.) Let f L
1
(P) and for
n 1, let
E
0
n
= { : sup
1jn
[f() +f(T) + +f(T
j1
)] 0]}.
Then
_
E
0
n
f() dP 0
Proof. Let
h
n
() = sup
1jn
[f() +f(T) + + f(T
j1
)]
= f() + max(0 , h
n1
(T))
= f() +h
+
n1
(T)
where
h
+
n
() = max(0 , h
n
()).
On E
0
n
, h
n
() = h
+
n
() and therefore
f() = h
n
() h
+
n1
(T) = h
+
n
() h
+
n1
(T).
Consequently,
_
E
0
n
f() dP =
_
E
0
n
[h
+
n
() h
+
n1
(T)] dP

_
E
0
n
[h
+
n
() h
+
n
(T)] dP (because h
+
n1
() h
+
n
())
=
_
E
0
n
h
+
n
() dP
_
TE
0
n
h
+
n
() dP (because of inavraince of T)
0.
The last step follows from the fact that for any integrable function h(),
_
E
h() dP is the largest when we take for E the set E = { : h() 0}.
6.1. ERGODIC THEOREMS. 183
Now we establish the analog of Doobs inequality or maximal inequality, or
sometimes referred to as the weaktype 1 1 inequality.
Lemma 6.3. For any f L
1
(P), and > 0, denoting by E
n
the set
E
n
= { : sup
1jn
|(A
j
f)()| }
we have
P
_
E
n

_
En
|f()| dP.
In particular
P
_
: sup
j1
|(A
j
f)()|

_
|f()| dP.
Proof. We can assume without loss of generality that f L
1
(P) is nonneg-
ative. Apply the lemma to f . If
E
n
= { : sup
1jn
[f() +f(T) + +f(T
j1
)]
j
> },
then
_
En
[f() ] dP 0
or
P[E
n
]
1

_
En
f() dP.
We are done.
Given the lemma the proof of the almost sure ergodic theorem follows
along the same lines as the proof of the almost sure convergence in the
martingale context. If f H
0
it is trivial. For f = (I U)g with g L

it
is equally trivial because A
n
f


2g
n
. So the almost sure convergence
is valid for f = f
1
+ f
2
with f
1
H
0
and f
2
= (I U)g with g L

. But
such functions are dense in L
1
(P). Once we have almost sure convergence
for a dense set in L
1
(P), the almost sure convergence for every f L
1
(P)
follows by routine approximation using Lemma 6.3. See the proof of Theorem
5.7.
184 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Exercise 6.1. For any bounded linear transformation A on a Hilbert Space
H, show that the closure of the range of A, i.e Range A is the orthogonal
complement of the null space {f : A

f = 0} where A

is the adjoint of A.
Exercise 6.2. Show that any almost invariant set diers by a set of measure
0 from an invariant set i.e. if P(AT
1
A) = 0 then there is a B F with
P(AB) = 0 and T
1
B = B.
Although the ergodic theorem implies a strong law of large numbers for
any stationary sequence of random variables, in particular a sequence of
independent identically distributed random variables, it is not quite the end
of the story. For the law of large numbers, we need to know that the limit
f is a constant, which will then equal
_
f() dP. To claim this, we need
to know that the invariant -eld is trivial or essentially consists of the
whole space and the empty set . An invariant measure P is said to be
ergodic for the transformation T, if every A I i.e every invariant set has
measure 0 or 1. Then every invariant function is almost surely a constant
and f = E
_
f|I

=
_
f() dP.
Theorem 6.4. Any product measure is ergodic for the shift.
Proof. Let A be an invariant set. Then A can be approximated by sets A
n
in
the -eld corresonding to the coordinates from [n, n]. Since A is invariant
T
2n
A
n
will approximate A just as well. This proves that A actually belongs
to the tail -eld, the remote past as well as the remote future. Now we
can use Kolmogorovs 0 1 law (Theorem 3.15), to assert that P(A) = 0 or
1.
6.2 Structure of Stationary Measures.
Given a space (, F) and a measurable transformation T with a measurable
inverse T
1
, we can consider the space M of all T-invariant probability
measures on (, F). The set M, which may be empty, is easily seen to be a
convex set.
Exercise 6.3. Let = Z, the integers, and for n Z, let Tn = n + 1. Show
that M is empty.
Theorem 6.5. A probability measure P M is ergodic if and only if it is
an extreme point of M.
6.2. STRUCTURE OF STATIONARY MEASURES. 185
Proof. A point of a convex set is extreme if it cannot be written as a nontrivial
convex combination of two other points from that set. Suppose P M is
not extremal. Then P can be written as nontrivial convex combination of
P
1
, P
2
M, i.e. for some 0 < a < 1 and P
1
= P
2
, P = aP
1
+ (1 a)P
2
. We
claim that such a P cannot be ergodic. If it were, by denition, P(A) = 0 or
1 for every A I. Since P(A) can be 0 or 1 only when P
1
(A) = P
2
(A) = 0
or P
1
(A) = P
2
(A) = 1, it follows that for every invariant set A I, P
1
(A) =
P
2
(A). We now show that if two invariant measures P
1
and P
2
agree on I,
they agree on F. Let f() be any bounded F-measurable function. Consider
the function
h() = lim
n
1
n
[f() +f(T) + +f(T
n1
)]
dened on the set E where the limit exists. By the ergodic theorem P
1
(E) =
P
2
(E) = 1 and h is I measurable. Moreover, by the stationarity of P
1
, P
2
and the bounded convergence theorem,
E
P
i
[f()] =
_
E
h()dP
i
for i = 1, 2
Since P
1
= P
2
on I and h is I measurable and P
i
(E) = 1 for i = 1, 2 we see
that
E
P
1
[f()] = E
P
2
[f()]
Since f is arbitrary this implies that P
1
= P
2
on F.
Conversely if P is not ergodic, then there is an A I with 0 < P(A) < 1
and we dene
P
1
(E) =
P(A E)
P(A)
; P
2
(E) =
P(A
c
E)
P(A
c
)
.
Since A I it follows that P
i
are stationary. Moreover P = P(A)P
1
+
P(A
c
)P
2
and hence P is not extremal.
One of the questions in the theory of convex sets is the existence of
suciently many extremal points, enough to recover the convex set by taking
convex combinations. In particular one can ask if any point in the convex
set can be obtained by taking a weighted average of the extremals. The next
theorem answers the question in our context. We will assume that our space
(, F) is nice, i.e. is a complete separable metric space with its Borel sets.
186 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Theorem 6.6. For any invariant measure P, there is a probability measure

P
on the set M
e
of ergodic measures such that
P =
_
Me
Q
P
(dQ)
Proof. If we denote by P

the regular conditional probability distribution of


P given I, which exists (see Theorem 4.4) because (, F) is nice, then
P =
_

P(d)
We will complete the proof by showing that P

is an ergodic stationary
probability measure for almost all with respect to P. We can then view
P

as a map M
e
and
P
will be the image of P under the map. Our
integral representation in terms of ergodic measures will just be an immediate
consequence of the change of variables formula.
Lemma 6.7. For any stationary probability measure P, for almost all with
respect to P, the regular conditional probability distribution P

, of P given
I, is stationary and ergodic.
Proof. Let us rst prove stationarity. We need to prove that P

(A) =
P

(TA) a.e. We have to negotiate carefully through null sets. Since a mea-
sure on the Borel -eld F of a complete separable metric space is determined
by its values on a countable generating eld F
0
F, it is sucient to prove
that for each xed A F
0
, P

(A) = P

(TA) a.e. P. Since P

is I measur-
able all we need to show is that for any E I,
_
E
P

(A) P(d) =
_
E
P

(TA) P(d)
or equivalently
P(E A) = P(E TA)
This is obvious because P is stationary and E is invariant.
We now turn to ergodicity. Again there is a mineeld of null sets to
negotiate. It is a simple exercise to check that if, for some stationary measure
Q, the ergodic theorem is valid with an almost surely constant limit for the
indicator functions 1
A
with A F
0
, then Q is ergodic. This needs to be
checked only for a countable collection of sets {A}. We need therfore only to
6.3. STATIONARY MARKOV PROCESSES. 187
check that any invariant function is constant almost surely with respect to
almost all P

. Equivalently for any invariant set E, P

(E) must be shown


almost surely to be equal to 0 or 1. But P

(E) =
E
() and is always 0 or
1. This completes the proof.
Exercise 6.4. Show that any two distinct ergodic invariant measures P
1
and
P
2
are orthogonal on I, i.e. there is an invariant set E such that P
1
(E) = 1
and P
2
(E) = 0.
Exercise 6.5. Let (, F) = ([0, 1), B) and Tx = x + a (mod) 1. If a is irra-
tional there is just one invariant measure P, namely the uniform distribution
on [0, 1). This is seen by Fourier Analysis. See Remark 2.2.
_
e
i 2 n x
dP =
_
e
i 2 n (Tx)
dP =
_
e
i 2 n (x+a)
dP = e
i 2 n a
_
e
i 2 n x
dP
If a is irrational e
i 2 n a
= 1 if and only if n = 0. Therefore
_
e
i 2 n x
dP = 0 for n = 0
which makes P uniform. Now let a =
p
q
be rational with (p, q) = 1, i.e. p
and q are relatively prime. Then, for any x, the discrete distribution with
probabilities
1
q
at the points {x, x+a, x+2a, . . . , x+(q1)a} is invariant and
ergodic. We can denote this distribution by P
x
. If we limit x to the interval
0 x <
1
q
then x is uniquely determined by P
x
. Complete the example by
determining all T invariant probability distributions on [0, 1) and nd the
integral representation in terms of the ergodic ones.
6.3 Stationary Markov Processes.
Let (x, dy) be a transition probability function on (X, B), where X is a state
space and B is a -eld of measurable subsets of X. A stochastic process with
values in X is a probability measure on the space (, F), where is the space
of sequences {x
n
: < n < } with values in X, and F is the product
-eld. The space (, F) has some natural sub -elds. For any two integers
m n, we have the sub -elds, F
m
n
= {x
j
: m j n} corresponding
to information about the process during the time interval [m, n]. In addition
we have F
n
= F

n
= {x
j
: j n} and F
m
= F
m

= {x
j
: j m} that
188 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
correspond to the past and future. P is a Markov Process on (, F) with
transition probability (, ), if for every n, A B and P-almost all ,
P
_
x
n+1
A|F
n
} = (x
n
, A)
Remark 6.1. Given a , it is not always true that P exists. A simple but
illuminating example is to take X = {0, 1, , n, } to be the nonnegative
integers and dene (x, x +1) = 1 and all the process does is move one step
to the right every time. Such a process if it had started long time back will
be found nowhere today! So it does not exist. On the other hand if we take
X to be the set of all integers then P is seen to exist. In fact there are lots
of them. What is true however is that given any initial distribution and
initial time m, there exist a unique process P on (, F
m
), i.e. dened on the
future -eld from time m on, that is Markov with transition probability
and satises P{x
m
A} = (A) for all A B.
The shift T acts naturally as a measurable invertible map on the product
space into itself and the notion of a stationary process makes sense. The
following theorem connects stationarity and the Markov property.
Theorem 6.8. Let the transition probability be given. Let P be a station-
ary Markov process with transition probability . Then the one dimensional
marginal distribution , which is independent of time because of stationarity
and given by
(A) = P
_
x
n
A
_
is invariant in the sense that
(A) =
_
(x, A)(dx)
for every set A B. Conversely given such a , there is a unique stationary
Markov process P with marginals and transition probability .
Exercise 6.6. Prove the above Theorem. Use Remark 4.7.
Exercise 6.7. If P is a stationary Markov process on a countable state space
with transition probaility and invariant marginal distribution , show that
the time reversal map that maps {x
n
} to {x
n
} takes P to another stationary
Markov process Q, and express the transition probability of Q, as explicitly
as you can in terms of and .
6.3. STATIONARY MARKOV PROCESSES. 189
Exercise 6.8. If is an invariant measure for , show that the conditional
expectation map : f()
_
f(y) ( , dy) induces a contraction in L
p
()
for any p [1, ]. We say that a Markov process is reversible if the time
reversed process Q of the previous example coincides with P. Show that P
corresponding to and is reversible if and only if the corresponding in
L
2
() is self-adjoint or symmetric.
Since a given transition probability may in general have several invariant
measures , there will be several stationary Markov processes with transition
probability . Let M be the set of invariant probability measures for the
transition probabilities (x, dy) i.e.

M =
_
: (A) =
_
X
(x , A) d(x) for all A B
_

M is a convex set of probability mesures and we denote by



M
e
its (possi-
bly empty) set of extremals. For each

M, we have the corresponding
stationary Markov process P

and the map P

is clearly linear. If we
want P

to be an ergodic stationary process, then it must be an extremal in


the space of all stationary processes. The extremality of

M is therfore a
necessary condition for P

to be ergodic. That it is also sucient is a little


bit of a surprise. The following theorem is the key step in the proof. The
remaining part is routine.
Theorem 6.9. Let be an invariant measure for and P = P

the corre-
sponding stationary Markov process. Let I be the -eld of shift invariant
subsets on . To within sets of P measure 0, I F
0
0
.
Proof. This theorem describes completely the structure of nontrivial sets in
the -eld I of invariant sets for a stationary Markov process with transition
probability and marginal distribution . Suppose that the state space can
be partitioned nontrivially i.e. with 0 < (A) < 1 into two sets A and A
c
that satisfy (x, A) = 1 a.e on A and (x, A
c
) = 1 a.e on A
c
. Then the
event
E = { : x
n
A for all n Z}
provides a non trivial set in I. The theorem asserts the converse. The
proof depends on the fact that an invariant set E is in the remote past
F

=
n
F

n
as well as in the remote future F

=
m
F
m

. See the
proof of Theorem 6.4. For a Markov process the past and the future are
190 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
conditionally independent given the present. See Theorem 4.9. This implies
that
P
_
E|F
0
0

= P
_
E E|F
0
0

= P
_
E|F
0
0

P
_
E|F
0
0

and must therfore equal either 0 or 1. This in turn means that corresponding
to any invariant set E I, there exists A X that belongs to B, such that
E = { : x
n
A for all n Z} up to a set of P measure 0. If the
Markov process starts from A or A
c
, it does not ever leave it. That means
0 < (A) < 1 and
(x, A
c
) = 0 for a.e x A and (x, A) = 0 for a.e x A
c
Remark 6.2. One way to generate markov processes with multiple invariant
measures is to start with two markov processes with transition probabilities

i
(x
i
, dy
i
) on X
i
and invariant measures
i
, and consider X = X
1
X
2
.
Dene
(x, A) =
_

1
(x, A X
1
) if x X
1

2
(x, A X
2
) if x X
2
Then any one of the two processes can be going on depending on which world
we are in. Both
1
and
2
are invariant measures. We have combined two
distinct possibilities into one. What we have shown is that when we have
multiple invariant measures they essentially arise in this manner.
Remark 6.3. We can therefore look at the convex set of measures that are
invariant, i.e. = . The extremals of this convex set are precisely the ones
that correspond to ergodic stationary processes and they are called ergodic
or extremal invariant measures. If the set of invariant probability measures
is nonempty for some , then there are enough extremals to recover arbitrary
invariant measure as an integral or weighted average of extremal ones.
Exercise 6.9. Show that any two distinct extremal invariant measures
1
and

2
for the same are orthogonal on B.
Exercise 6.10. Consider the operator on the L
p
() spaces corresponding to
a given invariant measure. The dimension of the eigenspace f : f = f that
corresponds to the eigenvalue 1, determines the extremality of . Clarify this
statement.
6.3. STATIONARY MARKOV PROCESSES. 191
Exercise 6.11. Let P
x
be the Markov process with stationary transition prob-
ability (x, dy) starting at time 0 from x X. Let f be a bounded mea-
surable function on X. Then for almost all x with respect to any extemal
invariant measure ,
lim
n
1
n
[f(x
1
) + +f(x
n
)] =
_
f(y)(dy)
for almost all with respect to P
x
.
Exercise 6.12. We saw in the earlier section that any stationary process is
an integral over stationary ergodic processes. If we represent a stationary
Markov Process P

as the integral
P

=
_
RQ(dR)
over stationary ergodic processes, show that the integral really involves only
stationary Markov processes with transition probability , so that the inte-
gral is really of the form
P

=
_
Me
P

Q(d)
or equivalently
=
_
Me
Q(d).
Exercise 6.13. If there is a reference measure such that (x, dy) has a
density p(x, y) with respect to for every , then show that any invari-
ant measure is absolutely continuous with respect to . In this case the
eigenspace f : f = f in L
2
() gives a complete picture of all the invariant
measures.
The question of when there is at most one invariant measure for the
Markov process with transition probability is a dicult one. If we have
a density p(x, y) with respect to a reference measure and if for each x,
p(x, y) > 0 for almost all y with respect to , then there can be atmost one
inavriant measure. We saw already that any invariant measure has a density
with respect to . If there are at least two invariant mesaures, then there
are at least two ergodic ones which are orthogonal. If we denote by f
1
and
192 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
f
2
their densities with respect to , by orthogonality we know that they are
supported on disjoint ivariant sets, A
1
and A
2
. In particular p(x, y) = 0 for
almost all x on A
1
in the support of f
1
and almost all y in A
2
with respect
to . By our positivity assumption we must have (A
2
) = 0 , which is a
contradiction.
6.4 Mixing properties of Markov Processes.
One of the questions that is important in the theory of Markov Processes
is the rapidity with which the memory of the initial state is lost. There is
no unique way of assessing it and depending on the circumstances this could
happen in many diererent ways at many dierent rates. Let
(n)
(x , dy) be
the n step transition probability. The issue is how the measures
(n)
(x , dy)
depend less and less on x as n . Suppose we measure this dependence
by

n
= sup
x,yX
sup
AB
|
(n)
(x , A)
(n)
(y , A)|
then the following is true.
Theorem 6.10. Either
n
1 for all n 1, or
n
C
n
for some
0 < 1
Proof. From the Chapman-Kolmogorov equations

(n+m)
(x , A)
(n+m)
(y , A) =
_

(m)
(z , A)[
(n)
(x , dz)
(n)
(y , dz)]
If f(x) is a function with |f(x)f(y)| C and =
1

2
is the dierence of
two probability measures with = sup
A
|(A)| , then it is elementary
to estimate, using
_
cd = 0,
|
_
f d| = inf
c
|
_
(f c)d| 2 inf
c
{sup
x
|f(x) c|} 2
C
2
= C
It follows that the sequence
n
is submultiplicative, i.e.

m+n

m

n
Our theorem follows from this property. As soon as some
k
= a < 1 we
have

n
[
k
]
[
n
k
]
C
n
with = a
1
k
.
6.4. MIXING PROPERTIES OF MARKOV PROCESSES. 193
Although this is an easy theorem it can be applied in some contexts.
Remark 6.4. If (x , dy) has density p(x, y) with respect to some reference
measure and p(x, y) q(y) 0 for all y with
_
q(y)d > 0, then it is
elementary to show that
1
(1 ).
Remark 6.5. If
n
0, we can estimate
|
(n)
(x , A)
(n+m)
(x , A)| = |
_
[
(n)
(x , A)
(n)
(y, A)]
(m)
(x , dy)|
n
and conclude from the estimate that
lim
n

(n)
(x , A) = (A)
exists. is seen to be an invariant probability measure.
Remark 6.6. In this context the invariant measure is unique. If is another
invariant measure because
(A) =
_

(n)
(x , A)(dy)
for every n 1
(A) = lim
n
_

(n)
(x , A)(dy) = (A).
Remark 6.7. The stationary process P

has the property that if E F

m
and F F
n

with a gap of k = n m > 0 then


P

[E F] =
_
E
_
X

(k)
(x
m
(), dx)P
x
(T
n
F)P

(d)
P

[E]P

[F] =
_
E
_
X
(dx)P
x
(T
n
F)P

(d)
P

[E F] P

[E]P

[F] =
_
E
_
X
P
x
(T
n
F)[
(k)
(x
m
(), dx) (dx)]P

(d)
from which it follows that
|P

[E F] P

[E]P

[F]|
k
P

(E)
proving an asymptotic independence property for P

.
194 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
There are situations in which we know that an invariant probability mea-
sure exists for and we wish to establish that
(n)
(x , A) converges to (A)
uniformly in A for each x X but not necessarily uniformly over the starting
points x. Uniformity in the starting point is very special. We will illustrate
this by an example.
Example 6.1. The Ornstein-Uhlenbeck process is Markov Chain on the state
space X = R, the real line with transition probability (x, dy) given by a
Gaussian distribution with mean x and variance
2
. It has a density p(x, y)
with respect to the Lebesgue measure so that (x, A) =
_
A
p(x, y)dy.
p(x, y) =
1

2
exp[
(y x)
2
2
2
]
It arises from the auto-regressive representation
x
n+1
= x
n
+
n+1
where
1
, ,
n
are independent standard Gaussians. The characteristic
function of any invariant mesure (t) satises, for every n 1,
(t) = (t) exp[

2
t
2
2
] = (
n
t) exp[
(

n1
j=0

2j
)
2
t
2
2
]
by induction on n. Therefore
|(t)| exp[
(

n1
j=0

2j
)
2
t
2
2
]
and this cannot be a characteristic function unless || < 1 (otherwise by
letting n we see that (t) = 0 for t = 0 and therefore discontinuous at
t = 0). If || < 1, by letting n and observing that (
n
t) (0) = 1
(t) = exp[

2
t
2
2(1
2
)
]
The only possible invariant measure is the Gaussian with mean 0 and variane

2
(1
2
)
. One can verify that this Gaussian is infact an invariant measure. If
|| < 1 a direct computation shows that
(n)
(x, dy) is a Gaussian with mean

n
x and variance
2
n
=

n1
j=0

2j

2
(1
2
)
2
as n . Clearly there
is uniform convergence only over bounded sets of starting points x. This is
typical.
6.5. CENTRAL LIMIT THEOREM FOR MARTINGALES. 195
6.5 Central Limit Theorem for Martingales.
If {
n
} is an ergodic stationary sequence of random variables with mean zero
then we know from the ergodic theorem that the mean

1
+n
n
converges to
zero almost surely. by the law of large numbers. We want to develop some
methods for proving the central limit theorem, i.e. the covergence of the
distribution of

1
++n

n
to some Gaussian distribution with mean 0 variance

2
. Under the best of situations, since the covariance
k
= E[X
n
X
n+k
] may
not be 0 for all k = 0, if we assume that

<j<
|
j
| < , we get

2
= lim
n
1
n
E[(
1
+ +
n
)
2
]
= lim
n

|j|n
(1
|j|
n
)
j
=

<j<

j
=
0
+ 2

j=1

j
.
The standard central limit theorem with

n scaling is not likely to work


if the covariances do not decay rapidly enough to be summable. When the
covariances {
k
} are all 0 for k = 0 the variance calculation yields
2
=
0
just as in the independent case, but there is no guarantee that the central
limit theorem is valid.
A special situation is when {
j
} are square integrable martingale dier-
ences. With the usual notation for the -elds F
m
n
for m n (remember
that m can be while n can be + ) we assume that
E{
n
|F
n1
} = 0 a.e.
and in this case by conditioning we see that
k
= 0 for k = 0. It is a useful and
important observation that in this context the central limit theorm always
holds. The distribution of Z
n
=

1
++n

n
converges to the normal distribution
with mean 0 and variance
2
=
0
. The proof is a fairly simple modication
of the usual proof of the central limit theorem. Let us dene
(n, j, t) = exp[

2
t
2
j
2n
]E
_
exp[i t

1
+ +
j

n
]
_
196 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
and write
(n, n, t) 1 =
n

j=1
[(n, j, t) (n, j 1, t)]
leaving us with the estimation of
(n, t) =

j=1
_
(n, j, t) (n, j 1, t)

.
Theorem 6.11. For an ergodic stationary sequence {
j
} of square integrable
martingale dirences, the central limit theorem is always valid.
Proof. We let S
j
=
1
+ +
j
and calculate
_
(n, j, t)(n, j 1, t)

= exp[

2
t
2
j
2n
]E
_
exp[it
S
j1

n
]
_
exp[it

j

n
] exp[

2
t
2
2n
]
__
.
We can replace it with
(n, j, t) = exp[

2
t
2
j
2n
]E
_
exp[it
S
j1

n
]
_
(
2

2
j
)t
2
2n
__
because the error can be controlled by Taylors expansion. In fact if we use
the martingale dierence property to kill the linear term, we can bound the
dierence, in an arbitrary nite interval |t| T, by
|
_
(n, j, t) (n, j 1, t)

(n, j, t)|
C
T
E
_

exp[it

j

n
] 1 it

j

n
+
t
2

2
j
2n

_
+C
T
| exp[

2
t
2
2n
] 1 +

2
t
2
2n
|
where C
T
is a constant that depends only on T. The right hand side is
independent of j because of stationarity. By Taylor expansions in the variable
t

n
of each of the two terms on the right, it is easily seen that
sup
|t|T
1jn
|
_
(n, j, t) (n, j 1, t)

(n, j, t)| = o(
1
n
).
6.5. CENTRAL LIMIT THEOREM FOR MARTINGALES. 197
Therefore
sup
|t|T
n

j=1
|
_
(n, j, t) (n, j 1, t)

(n, j, t)| = no(


1
n
) 0.
We now concentrate on estimating |

n
j=1
(n, j, t)|. We pick an integer k
which will be large but xed. We divide [1, n] into blocks of size k with
perhaps an incomplete block at the end. We will now replace (n, j, t) by

k
(n, j, t) = exp[

2
t
2
kr
2n
]E
_
exp[it
S
kr

n
]
_
(
2

2
j
)t
2
2n
__
for kr + 1 j k(r + 1) and r 0.
Using stationarity it is easy to estimate for r
n
k
,
|
k(r+1)

j=kr+1

k
(n, j, t) | C(t)
1
n
E
_

k(r+1)

j=kr+1
(
2

2
j
)

_
= C(t)
k
n
(k)
where (k) 0 as k by the L
1
ergodic theorem. After all {
2
j
} is a
stationary sequence with mean
2
and the ergodic theorem applies. Since
the above estimate is uniform in r, the left over incomplete block at the end
causes no problem and there are approximately
n
k
blocks, we conclude that
|
n

j=1

k
(n, j, t)| C(t)(k).
On the other hand, by stationarity,
n

j=1
|
k
(n, j, t) (n, j, t)|
n sup
1jn
|
k
(n, j, t) (n, j, t)|
C(t) sup
1jk
E
_

exp[

2
t
2
j
2n
] exp[it
S
j1

n
] 1

|
2

2
j
|
_
and it is elementary to show by dominated convergence theorem that the
right hand side tends to 0 as n for each nite k.
This concludes the proof of the theorem.
198 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
One may think that the assumption that {
n
} is a martingale dierence
is too restrictive to be useful. Let {X
n
} be any stationary process with zero
mean. We can often succeed in writing X
n
=
n+1
+
n+1
where
n
is a mar-
tingale dierence and
n
is negligible, in the sense that E[(

n
j=1

j
)
2
] = o(n).
Then the central limit theorem for {X
n
} can be deduced from that of {
n
}. A
cheap way to prove E[(

n
j=1

j
)
2
] = o(n) is to establish that
n
= Z
n
Z
n+1
for some stationary square integrable sequence {Z
n
}. Then

n
j=1

j
tele-
scopes and the needed estimate is obvious. Here is a way to construct Z
n
from X
n
so that X
n
+ (Z
n+1
Z
n
) is a martingale dierence.
Let us dene
Z
n
=

j=0
E
_
X
n+j
|F
n
_
There is no guarantee that the series converges, but we can always hope.
After all, if the memory is weak, prediction j steps ahead should be futile if
j is large. Therefore if X
n+j
is becoming independent of F
n
as j gets large
one would expect E
_
X
n+j
|F
n
_
to approach E[X
n+j
] which is assumed to be
0. By stationarity n plays no role. If Z
0
can be dened the shift operator T
can be used to dene Z
n
() = Z
0
(T
n
). Let us assume that {Z
n
} exist and
are square integrable. Then
Z
n
= E
_
Z
n+1
|F
n
_
+X
n
or equivalently
X
n
= Z
n
E
_
Z
n+1
|F
n
_
= [Z
n
Z
n+1
] + [Z
n+1
E
_
Z
n+1
|F
n
_
]
=
n+1
+
n+1
where
n+1
= Z
n
Z
n+1
and
n+1
= Z
n+1
E
_
Z
n+1
|F
n
_
. It is easy to see
that E[
n+1
|F
n
] = 0.
For a stationary ergodic Markov process {X
n
} on state space (X, B),
with transition probability (x , dy) and invariant measure , we can prove
the central limit theorem by this method. Let Y
j
= f(X
j
). Using the Markov
property we can calculate
Z
0
=

j=0
E[f(X
j
)|F
0
] =

j=0
[
j
f](X
0
) = [[I ]
1
f](X
0
).
6.6. STATIONARY GAUSSIAN PROCESSES. 199
If the equation [I ]U = f can be solved with U L
2
(), then

n+1
= U(X
n+1
) U(X
n
) +f(X
n
)
is a martingale dierence and we have a central limit theorem for
n
j=1
f(X
j
)

n
with variance given by

2
= E
P
_
[
0
]
2
_
= E
P
_
[U(X
1
) U(X
0
) +f(X
0
)]
2
_
.
Exercise 6.14. Let us consider a two state Markov Chain with states [1, 2].
Let the transition probabilities be given by (1, 1) = (2, 2) = p and (1, 2) =
(2, 1) = q with 0 < p, q < 1 , p + q = 1 . The invariant measure is
given by (1) = (2) =
1
2
for all values of p. Consider the random variable
S
n
= A
n
B
n
, where A
n
and B
n
are respectively the number of visits to the
states 1 and 2 during the rst n steps. Prove a central limit theorem for
Sn

n
and calculate the limiting variance as a function
2
(p) of p. How does
2
(p)
behave as p 0 or 1 ? Can you explain it? What is the value of
2
(
1
2
) ?
Could you have guessed it?
Exercise 6.15. Consider a random walk on the nonnegative integers with
(x , y) =
_

_
1
2
for all x = y 0
1
4
for y = x + 1, x 1
1+
4
for y = x 1, x 1
1
2
for x = 0, y = 1.
Prove that the chain is positive recurrent and nd the invariant measure
(x) explicitly. If f(x) is a function on x 0 with compact support solve
explicitly the equation [I ]U = f. Show that either U grows exponentially
at innity or is a constant for large x. Show that it is a constant if and only
if

x
f(x)(x) = 0. What can you say about the central limit theorem for

n
j=0
f(X
j
) for such functions f?
6.6 Stationary Gaussian Processes.
Considering the importance of Gaussian distributions in Probability theory,
it is only natural to study stationary Gaussian processes, i.e. stationary
200 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
processes {X
n
} that have Gaussian distributions as their nite dimensional
joint distributions. Since a joint Gaussian distribution is determined by its
means and covariances we need only specify E[X
n
] and Cov (X
n
, X
m
) =
E[X
n
X
m
] E[X
n
]E[X
m
]. Recall that the joint density on R
n
of N Gaussian
random variables with mean m = {m
i
} and covariance C = {
i,j
} is given
by
p(y) = [
1

2
]
N
1

DetC
exp[
1
2
< (y m), C
1
(y m) >]
Here m is the vector of means and C
1
is the inverse of the positive denite
covariance matrix C. If C only positive semidenite the Gaussian distribu-
tion lives on a lower dimensional hyperplane and is singular. By stationarity
E[X
n
] = c is independent of n and Cov (X
n
, X
m
) =
nm
can depend only on
the dierence nm. By symmetry
k
=
k
. Because the covariance matrix
is always positive semidenite the sequence
k
has the positive deniteness
property
n

k,j=1

jk
z
j
z
k
0
for all choices of n and complex numbers z
1
, z
n
. By Bochners theorem
(see Theorem 2.2) there exists a nonnegative measure on the circle that is
thought of as S = [0, 2] with end points identied such that

k
=
_
2
0
exp[ik]d()
and because of the symmetry of
k
, is symmetric as well with respect to
2 . It is convenient to assume that c = 0. One can always add
it back. Given a Gaussian process it is natural to carry out linear
operations that will leave the Gaussian character unchanged. Rather than
working with the -elds F
m
n
we will work with the linear subspaces H
m
n
spanned by {X
j
: m j n} and the innite spans H
n
=
mn
H
m
n
and
H
m
=
nm
H
m
n
, that are considered as linear subspaces of the Hilbertspace
H =
m,n
nm
H
m
n
which lies inside L
2
(P). But H is a small part of L
2
(P) ,
consisting only of linear functions of {X
j
}. The analog of Kolmogorovs tail
-eld are the subspaces
m
H
m
and
n
H
n
that are denoted by H

and H

.
The analog of Kolmogorovs zero-one law would be that these subspaces are
trivial having in them only the zero function. The symmetry in
k
implies
that the processes {X
n
} and {X
n
} have the same underlying distributions
6.6. STATIONARY GAUSSIAN PROCESSES. 201
so that both tails behave identically. A stationary Gaussian process {X
n
}
with mean 0 is said to be purely non determinstic if the tail subspaces are
trivial.
In nite dimensional theory a Covarance matrix can be diagonalized or
better still written in special form T

T, which gives a linear representation


of the Gaussian random variables in terms of canonical or independent stan-
dard Gaussian random variables. The point to note is that if X is standard
Gaussian with mean zero and covariance I = {
i,j
, then for any linear tr-
nasformation T, Y = TX iagain Gaussian with mean zero and covariance
C = TT

. In other words if
Y
i
=

j
t
i,k
X
k
then
C
i,j
=

k
t
i,k
t
j,k
In fact for any C we can nd a T which is upper or lower diagonal i.e. t
i,k
= 0
for i > k or i < k. If the indices correspond to time, this can be interpreted
as a causal representation interms of current and future or past variables
only.
The following questions have simple answers.
Q1. When does a Gaussian process have a moving average representation in
terms of independent Gaussians i.e a representation of the form
X
n
=

m=
a
nm

m
with

n=
a
2
n
<
in terms of i.i.d. Gaussians {
k
} with mean 0 and variance 1 ?
If we have such a representation then the Covariance
k
is easily calculated
as the convolution

k
=

j
a
j
a
j+k
= [a a](k)
202 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
and that will make {
k
} the Fourier coecients of the function
f = |

j
a
j
e
i j
|
2
which is the square of a function in L
2
(S). In other words the spectral
measure will be absolutely continuous with a density f with repect to the
normalized Lebesgue measure
d
2
. Conversely if we have a with a density
f its square root will be a function in L
2
and will therefore have Fourier
coecients a
n
in l
2
and a moving average representation holds in terms of
i.i.d. random variables with these weights..
Q2. When does a Gaussian Process have a representation that is causal
i.e. of the form
X
n
=

j0
a
j

nj
with

j0
a
2
j
< ?
If we do have a causal representation then the remote past of the {X
k
} process
is clearly part of the remote past of the {
k
} process. By Kolmogorovs
zero-one law, the remote past for independent Gaussians is trivial and a
causal representation is therefore possible for {X
k
} only if its remote past
is trivial. The converse is true as well. The subspace H
n
is spanned by
H
n1
and X
n
. Therefore either H
n
= H
n1
, or H
n1
has codimension 1 in
H
n
. In the former case by stationarity H
n
= H
n1
for every n. This inturn
implies H

= H = H

. Assuming that the process is not identically


zero i.e.
0
= (S) > 0 this makes the remote past or future the whole
thing and denitely nontrivial. So we may assume that H
n
= H
n1
e
n
where e
n
is a one dimensional subspace spanned by a unit vector
n
. Since
all our random variables are linear combinations of a Gaussian collection
they all have Gaussian distributions. We have the shift operator U satsfying
UX
n
= X
n+1
and we can assume with out loss of generality that U
n
=
n+1
for every n. If we start with X
0
in our Hilbert space
X
0
= a
0

0
+R
1
with R
1
H
n1
. We can continue and write
R
1
= a
1

1
+R
2
6.6. STATIONARY GAUSSIAN PROCESSES. 203
and so on. We will then have for every n
X
0
= a
0

0
+a
1

1
+ +a
n

n
+R
(n+1)
with R
(n+1)
H
(n+1)
. Since
n
H
n
= {0} we conclude that the the
expansion
X
0
=

j=0
a
j

j
is valid.
Q3. What are the conditions on the spectral density f in order that the
process may admit a causal representation. From our answer to Q1. we
know that we have to solve the following analytical problem. Given the
spectral measure with a non negative density f L
1
(S), when can we
write f = |g|
2
for some g L
2
(S), that admits a Fourier representation
g =

j0
a
j
e
i j
involving only positive frequencies. This has the following
neat solution which is far from obvious.
Theorem 6.12. The process determined by the spectral density f admits a
causal representation if and only if f() satises
_
S
log f()d >
Remark 6.8. Notice that the condition basically prevents f from vanishing
on a set of positive measure or having very at zeros.
The proof will use methods from the theory of functions of a complex variable.
Dene
Proof.
g() =

n0
c
n
exp[i n]
as the Fourier series of some g L
2
(S). Assume c
n
= 0 for some n > 0.
In fact we can assume without loss of generality that c
0
= 0 by removing a
suitable factor of e
i k
which will not aect |g()|. Then we will show that
1
2
_
S
log |g()|d log |c
0
|.
204 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Consider the function
G(z) =

n0
c
n
z
n
as an analytic function in the disc |z| < 1. It has boundary values
lim
r1
G(re
i
) = g()
in L
2
(S). Since G is an analytic function we know, from the theory of
functions of a complex variable, thatlog |G(re
i
)| is subharmonic and has the
mean value property
_
S
log |G(re
i
)|d log |G(0)| = log |c
0
|
Since G(re
i
) has a limit in L
2
(S), the positive part of log |G| which is domi-
nated by |G| is uniformly integrable. For the negative part we apply Fatous
lemma and derive our estimate.
Now for the converse. Let f L
1
(S). Assume
_
S
log f()d > or
equivalently log f L
1
(S). Dene the Fourier coecients
a
n
=
1
4
_
S
log f() exp[i n] d.
Because log f is integrable {a
n
} are uniformly bounded and the power series
A(z) =

a
n
z
n
which is well dened for |z| < 1. We dene
G(z) = exp[A(z)].
We will show that
lim
r1
G(re
i
) = g()
exists in L
2
(S) and f = |g|
2
, g being the boundary value of an analytic
function in the disc. The integral condition on log f is then the necessary
and sucient condition for writing f = |g|
2
with g involving only nonnegative
frequencies.
6.6. STATIONARY GAUSSIAN PROCESSES. 205
|G(re
i
)|
2
= exp
_
2 Real Part A(re
i
)

= exp
_
2

j=0
a
j
r
j
cos j

= exp
_
2

j=0
r
j
cos j[
1
4
_
S
log f() cos jd]

= exp
_
1
2
_
S
log f()[

j=0
r
j
cos j cos jd]

= exp
_
_
S
log f()K(r, , )d

_
S
f()K(r, , )d
Here K is the Poisson Kernel for the disc
K(r, , ) =
1
2

j=0
r
j
cos cos
is nonnegative and
_
S
K(r, , )d = 1. The last step is a consequence of
Jensens inequality. The function
f
r
() =
_
S
f()K(r, , )d
converges to f as r 1 in L
1
(S) by the properties of the Poisson Kernel. It
is therefore uniformly integrable. Since |G(re
i
)|
2
is dominated by f
r
we get
uniform integrability for |G|
2
as r 1. It is seen now that G has a limit g
in L
2
(S) as r 1 and f = |g|
2
.
One of the issues in the theory of time series is that of prediction. We
have a stochastic process {X
n
} that we have observed for times n 1 and
we want to predict X
0
. The best predictor is E
P
[X
0
|F
1
] or in the Gaussian
linear context it is the compuation of the projection of X
0
into H
1
. If we
have a moving average representation, even a causal one, while it is true that
206 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
X
j
is spanned by {
k
: k j} the converse may not be true. If the two spans
were the same, then the best predictor for X
0
is just

X
0
=

j1
a
j

j
obtained by dropping one term in the original representation. In fact in
answering Q2 the construction yielded a representation with this property.
The quantity |a
0
|
2
is then the prediction error. In any case it is a lower
bound.
Q4. What is the value of prediction error and how do we actually nd the
predictor ?
The situation is some what muddled. Let us assume that we have a purely
nondeterministic process i.e. a process with a spectral density satisfying
_
S
log f()d > . Then f can be represented as
f = |g|
2
with g H
2
, where by H
2
we denote the subspace of L
2
(S) that are boundary
values of analytic functions in the disc |z| < 1, or equivalently functions
g L
2
(S) with only nonnegative frequencies. For any such g, we have an
analytic function
G(z) = G(re
i
) =

n0
a
n
r
n
e
i n
.
For any choice of g H
2
with f = |g|
2
, we have
|G(0)|
2
= |a
0
|
2
exp
_
1
2
_
S
log f()d

. (6.1)
There is a choice of g contstructed in the proof of the thorem for which
|G(0)|
2
= exp
_
1
2
_
S
log f()d

(6.2)
The prediction error
2
(f), that depends only on f and not on the choice
of g, also satises
6.6. STATIONARY GAUSSIAN PROCESSES. 207

2
(f) |G(0)|
2
(6.3)
for every choice of g H
2
with f = |g|
2
. There is a choice of g such that

2
(f) = |G(0)|
2
(6.4)
Therefore from (6.1) and (6.4)

2
(f) exp
_
1
2
_
S
log f()d

(6.5)
On the other hand from (6.2) and (6.3)

2
(f) exp
_
1
2
_
S
log f()d

(6.6)
We do now have an exact formula

2
(f) = exp
_
1
2
_
S
log f()d

(6.7)
for the prediction error.
As for the predictor, it is not quite that simple. In principle it is a limit
of linear combinations of {X
j
: j 0} and may not always have a simple
concrete representation. But we can understand it a little better. Let us
consider the spaces H and L
2
(S; ) of square integrable functions on S with
respect to the spectral measure . There is a natural isomorphism between
the two Hilbert spaces, if we map

a
j
X
j

a
j
e
i j
The problem then is the question of approximating e
i
in L
2
(S; ) by linear
combinations of {e
i j
: j 0}. We have already established that the error,
which is nonzero in the purely nondeterministic case, i.e when d =
1
2
f()d
for some f L
1
(S) satisfying
_
S
log f()d > ,
is given by

2
(f) = exp
_
1
2
_
S
log f()d

208 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.


We now want to nd the best approximation.
In order to get at the predictor we have to make a very special choice
of the representation f = |g|
2
. Simply demanding g L
2
(S) will not even
give causal representations. Demanding g H
2
will always give us causal
representation, but there are too many of these. If we mutiply G(z) by
an analytic function V (z) that has boundary values v() satisfying |v()| =
|V (e
i
)| 1 on S, then gv is another choice. If we demand that
|G(0)|
2
= exp
_
1
2
_
S
log f()d

(6.8)
there is atleast one choice that will satisfy it. There is still ambiguity, albeit
a trivial one among these, for we can always multiply g by a complex number
of modulus 1 and that will not change anything of consequence. We have
the following theorem.
Theorem 6.13. The representation f = |g|
2
with g H
2
, and satisfying
(6.8), is unique to within a multiplicative constant of modulus 1. In other
words if f = |g
1
|
2
= |g
2
|
2
with both g
1
and g
2
satisfying (8), then g
1
= g
2
on S, where is a complex number of modulus 1.
Proof. Let F(re
i
) = log |G(re
i
)|. It is a subharmonic function and
lim
r1
F(re
i
) =
1
2
log f()
Because
lim
r1
G(re
i
) = g()
in L
2
(S), the functions are uniformly integrable in r. The positive part of the
logarithm F is well controlled and therefore uniformly uniformly integrable.
Fatous lemma is applicable and we should always have
limsup
r1
1
2
_
S
F(re
i
)d
1
4
_
S
log f()d
But because F is subharmonic its average value on a circle of radius r around
0 is nondecreasing in r, and the limsup is the same as the sup. Therefore
F(0) sup
0r<1
1
2
_
S
F(re
i
)d = limsup
r1
1
2
_
S
F(re
i
)d
1
4
_
S
log f()d
6.6. STATIONARY GAUSSIAN PROCESSES. 209
Since we have equality at both ends that implies a lot of things. In particular
F is harmonic and and is represented via the Poisson integral interms of its
boundary value
1
2
log f. In particular G has no zeros in the disc. Obviuosly F
is uniquely determined by log f, and by the Cauchy-Riemann equations the
imaginary part of log G is determined upto an additive constant. Therefore
the only ambiguity in G is a multiplicative constant of modulus 1.
Given the process {X
n
} with trivial tail subspaces, we saw earlier that it
has a representation
X
n
=

j=0
a
j

nj
in terms of standard i.i.d Gaussians and from the construction we also know
that
n
H
n
for each n. In particular
0
H
0
and can be approximated by
linear combinations of {X
j
: j 0}. Let us suppose that h() represents
0
in
L
2
(S; f). We know that h() is in the linear span of {e
i j
: j 0}. We want
to nd the function h. If
0
h, then by the nature of the isomorphism

n
e
i n
h and
1 =

j=0
a
j
e
i j
h()
is an orthonormal expansion in L
2
(S; f). Also if we denote by
G(z) =

j=0
a
j
z
j
then the boundary function g() = lim
r1
G(re
i
) has the property
g()h() = 1
and so
h() =
1
g()
Since the function G that we constructed has the property
|G(0)|
2
= |a
0
|
2
=
2
(f) = exp
_
1
2
_
S
log f() d

it is the canonical choice determined earlier, to with in a multiplicative con-


stant of modulus 1. The predictor then is clearly represented by the function

1() = 1 a
0
h() = 1
g(0)
g()
210 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Example 6.2. A wide class of examples are given by densities f() that are
rational trigonometric polynomials of the form
f() =
|

A
j
e
i j
|
2
|

B
j
e
i j
|
2
We can always mutiply by e
i k
inside the absolute value and assume that
f() =
|P(e
i
)|
2
|Q(e
i
)|
2
where P(z) and Q(z) are polynomials in the complex variable z. The sym-
metry of f under means that the coecients in the polynomial have
to be real. The integrability of f will force the polynomial Q not to have any
zeros on the circle |z| = 1. Given any two complex numbers c and z, such
that |z| = 1 and c = 0
|z c| = | z c| = |
1
z
c| = |1 cz| = |c||z
1
c
|
This means in our representation for f, rst we can omit terms that involve
powers of z that have only modulus 1 on S. Next, any term (z c) that
contributes a nonzero root c with |c| < 1 can be replaced by c(z
1
c
) and
thus move the root outside the disc without changing the value of f. We can
therefore rewrite
f() = |g()|
2
with
G(z) =
P(z)
Q(z)
with new polynomials P and Q that have no roots inside the unit disc and
with perhaps P alone having roots on S . Clearly
h() =
Q(e
i
)
P(e
i
)
If P has no roots on S, we have a nice convergent power series for
Q
P
with a
radius of convergence larger than 1, and we are in a very good situation. If
P = 1, we are in an even better situation with the predictor expressed as a
nite sum. If P has a root on S, then it could be a little bit of a mess as the
next exercise shows.
6.6. STATIONARY GAUSSIAN PROCESSES. 211
Exercise 6.16. Assume that we have a representation of the form
X
n
=
n

n1
in terms of standard i.i.d. Gaussians. How will you predict X
1
based on
{X
j
: j 0} ?
Exercise 6.17. An autoregressive scheme is a representation of the form
X
n
=
k

j=1
a
j
X
nj
+
n
where
n
is a standard Gaussian indepenedent of {(X
j
,
j
) : j (n1)}. In
other words the predictor

X
n
=
k

j=1
a
j
X
nj
and the prediction error
2
are specied for the model. Can you always
nd a stationary Gaussian process {X
n
} with spectral density f(), that is
consistent with the model?
212 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Chapter 7
Dynamic Programming and
Filtering.
7.1 Optimal Control.
Optimal control or dynamic programming is a useful and important concept
in the theory of Markov Processes. We have a state space X and a family

of transition probability functions indexed by a parameter A. The


parameter is called the control parameter and can be chosen at will from the
set A. The choice is allowed to vary over time i.e.
j
can be the parameter of
choice for the transition from x
j
at time j to x
j+1
at time j+1. The choice can
also depend on the information available up to that point, i.e.
j
can be an
F
j
measurable function. Then the conditional probability P{x
j+1
A|F
j
}
is given by

j
(x
0
, ,x
j
)
(x
j
, A). Of course in order for things to make sense we
need to assume some measurability conditions. We have a payo function
f(x
N
) and the object is to maximize E{f(x
N
)} by a suitable choice of the
functions {
j
(x
0
, , x
j
) : 0 j N 1}. The idea ( Bellmans) of
dynamic programming is to dene recursively (by backward induction) for
0 j N 1, the sequence of functions
V
j
(x) = sup

V
j+1
(y)

(x , dy) (1)
with
V
N
(x) = f(x)
213
214 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
as well as the sequence {

j
(x) : 0 j N 1} of functions that provide
the supremum in (1).
V
j
(x) =

V
j+1
(y)

j
(x)
(x , dy) = sup

V
j+1
(y)

(x , dy)
We then have
Theorem 7.1. If the Markov chain starts from x at time 0, then V
0
(x) is
the best expected value of the reward. The optimal control is Markovian
and is provided by {

j
(x
j
)}.
Proof. It is clear that if we pick the control as

j
then we have an inhomo-
geneous Markov chain with transition probability

j, j+1
(x , dy) =

j
(x)
(x , dy)
and if we denote by P

x
, the process corresponding to it that starts from the
point x at time 0, we can establish by induction that
E
P

x
{f(x
N
)|F
Nj
} = V
Nj
(x
Nj
)
for 1 j N. Taking j = N, we obtain
E
P

x
{f(x
N
)} = V
0
(x).
To show that V
0
(x) is optimal, for any admissible (not necessarily Markovian)
choice of controls, if P is the measure on F
N
corresponding to a starting point
x,
E
P
{V
j+1
(x
j+1
)|F
j
} V
j
(x
j
)
and now it follows that
E
P
{f(x
N
)} V
0
(x).
Exercise 7.1. The problem could be modied by making the reward function
equal to
E
P
{
N

j=0
f
j
(
j1
, x
j
)}
and thereby incorporate the cost of control into the reward function. Work
out the recursion formula for the optimal reward in this case.
7.2. OPTIMAL STOPPING. 215
7.2 Optimal Stopping.
A special class of optimization problems are called optimal stopping prob-
lems. We have a Markov chain with transition probability (x, dy) and time
runs from 0 to N. We have the option to stop at any time based on the
history up to that time. If we stop at time k in the state x, the reward
is f(k , x), The problem then is to maximize E
x
{f( , x

)} over all stopping


times 0 N. If V (k , x) is the optimal reward if the game starts from
x at time k, the best we can do starting from x at time k 1 is is to earn a
reward of
V (k 1 , x) = max[f(k 1 , x),

V (k , y)(x, dy)]
Starting with V (N , x) = f(N , x), by backwards induction, we can get
V (j , x) for 0 j N. The optimal stopping rule is given by
= {inf k : V (k , x
k
) = f(k , x
k
)}
Theorem 7.2. For any stopping time with 0 N,
E
x
{f( , x

)} V (0 , x)
and
E
x
{f( , x

)} = V (0 , x)
Proof. Because
V (k , x)

V (k + 1 , y)(x , dy)
we conclude that V (k , x
k
) is a supermartingale and and an application
of Doobs stopping theorem proves the rst claim. On the other hand if
V (k , x) > f(k , x), we have
V (k , x) =

V (k + 1 , y)(x , dy)
and this means V ( k , x
k
) is a martingale and this establishes the second
claim.
216 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
Example 7.1. (The Secretary Problem.) An interesting example is the fol-
lowing game.
We have a lottery with N tickets. Each ticket has a number on it. The
numbers a
1
, a
N
are distinct, but the player has no idea of what they
are. The player draws a ticket at random and looks at the number. He
can either keep the ticket or reject it. If he rejects it, he can draw another
ticket from the remaining ones and again decides if he wants to keep it. The
information available to him is the numbers on the tickets he has so far drawn
and discarded as well as the number on the last ticket that he has drawn and
is holding. If he decides to keep the ticket at any stage, then the game ends
and that is his ticket. Of course if he continues on till the end, rejecting all
of them, he is forced to keep the last one. The player wins only if the ticket
he keeps is the one that has the largest number written on it. He can not
go back and claim a ticket that he has already rejected and he can not pick
a new one unless he rejects the one he is holding. Assuming that the draws
are random at each stage, how can the player maximize the probability of
winning? How small is this probability?
It is clear that the strategy to pick the rst or the last or any xed draw
has the probability of
1
N
to win. It is not apriori clear that the probability p
N
of winning under the optimal strategy remains bounded away from 0 for large
N. It seems unlikely that any strategy can pick the winner with signicant
probability far large values of N. Nevertheless the following simple strategy
shows that
liminf
N
p
N

1
4
.
Let half the draws go by, no matter what, and then pick the rst one which
is the highest among the tickets drawn up to the time of the draw. If the
second best has already been drawn and the best is still to come, this strategy
will succeed. This has probability nearly
1
4
. In fact the strategy works if the
k best tickets have not been seen during the rst half, (k + 1)-th has been
and among the k best the highest shows up rst in the second half. The
probability for this is about
1
k2
k+1
, and as these are disjoint events
liminf
n
p
N

k1
1
k2
k+1
=
1
2
log 2
If we decide to look at the rst Nx tickets rather than
N
2
, the lower bound
becomes xlog
1
x
and an optimization over x leads to x =
1
e
and the resulting
7.2. OPTIMAL STOPPING. 217
lower bound
liminf
n
p
N

1
e
.
We will now use the method optimal stopping to decide on the best
strategy for every N and show that the procedure we described is about the
best. Since the only thing that matters is the ordering of the numbers, the
numbers themselves have no meaning. Consider a Markov chain with two
states 0 and 1. The player is in state 1 if he is holding the largest ticket so far.
Otherwise he is in state 0. If he is in state 1 and stops at stage k, i.e. when
k tickets have been drawn, the probability of his winning is easily calculated
to be
k
N
. If he is in state 0, he has to go on and the probability of landing on
1 at the next step is calculated to be
1
k+1
. If he is at 1 and decides to play on
the probability is still
1
k+1
for landing on 1 at the next stage. The problem
reduces to optimal stopping for a sequence X
1
, X
2
, , X
N
of independent
random variables with P{X
i
= 1} = 1i + 1, P{X
i
= 0} =
i
i+1
and a reward
function of f(i , 1) =
i
N
; f(i , 0) = 0. Let us dene recursively the optimal
probabilities
V (i , 0) =
1
I + 1
V (i + 1, 1) +
i
i + 1
V (i + 1 , 0)
and
V (i , 1) = max[
i
N
,
1
I + 1
V (i + 1, 1) +
i
i + 1
V (i + 1 , 0)] = max[
i
N
, V (i , 0)]
It is clear what the optimal strategy is. We should draw always if we are in
state 0, i.e. we are sure to lose if we stop. If we are holding a ticket that is
the largest so far, we should stop provided
i
N
> V (i , 0)
and go on if
i
N
< V (i , 0).
Either startegy is acceptable in case of equality. Since V (i+1 , 1) V (i+1 , 0)
for all i, it follows that V (i , 0) V (i + 1, 0). There is therefore a critical
k(= k
N
) such that
i
N
V (i , 0) if i k and
i
N
V (i , 0) if i k. The best
strategy is to wait till k tickets have been drawn, discarding every ticket,
218 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
and then pick the rst one that is the best so far. The last question is the
determination of k = k
N
. For i k,
V (i , 0) =
1
i + 1
i + 1
N
+
i
i + 1
V (i + 1 , 0) =
1
N
+
i
i + 1
V (i + 1 , 0)
or
V (i , 0)
i

V (i + 1 , 0)
i + 1
=
1
N

1
i
telling us
V (i , 0) =
i
N
N1

j=i
1
j
so that
k
N
= inf

i :
1
N
N1

j=i
1
j
<
1
N

Approximately log N log k


N
= 1 or k
N
=
N
e
.
7.3 Filtering.
The problem in ltering is that there is an underlying stochastic process
that we cannot observe. There is a related stochastic process driven by
the rst one that we can observe and we want to use our information to
make conclusions about the state of the unobserved process. A simple but
extreme example is when the unobserved process does not move and remains
at the same value. Then it becomes a parameter. The driven process may
be a sequence of i.i.d random variables with densities f(, x) where is
the unobserved, unchanging underlying parameter. We have a sample of n
independent observations X
1
, , X
n
from the common distribution f(, x)
and our goal is then nothing other than parameter estimation. We shall take
a Bayesian approach. We have a prior distribution (d) on the space of
parameters and this can be modied to an aposteriori distribution after
the sample is observed. We have the joint distribution
n

i=1
f(, x
i
) dx
i
(d)
7.3. FILTERING. 219
and we calculate the conditional distribution of

n
(d|x
1
x
n
)
given x
1
, , x
n
. This is our best informed guess about the nature of the
unknown parameter. We can use this information as we see t. If we have an
additional observation x
n+1
we need not recalculate everything, but we can
simply update by viewing
n
as the new prior and calculating the posterior
after a single observation x
n+1
.
We will just work out a single illustration of this known as the Kallman-
Bucy lter. Suppose {x
n
} the unobserved process is a Gaussian Markov
chain
x
n+1
= x
n
+
n+1
with 0 < < 1 and the noise term
n
are i.i.d normally distributed random
variables with mean 0 and variance 1. The observed process y
n
is given by
y
n
= x
n
+
n
where the {
j
} are again independent standard Gaussians that are indepen-
dent of the {
j
} as well. If we start with an initial distribution for x
0
say one
that is Gaussian with mean m
0
and variance
2
0
, we can compute the joint
distribution of x
0
, x
1
and y
1
and then the conditional of x
1
given y
1
. This
becomes the new distribution of the state x
1
based on the observation y
1
.
This allows us te calculate recursively at every stage.
Let us do this explicitly now. The distribution of x
1
, y
1
is jointly normal
with mean (m
0
, m
0
) variances (
2

2
0
+
2
,
2

2
0
+
2
+ 1) and covariance
(
2

2
0
+
2
). The posterior distribution of x
1
is again Normal with mean
m
1
= m
0
+
(
2

2
0
+
2
)
(
2

2
0
+
2
+ 1)
(y
1
m
0
)
=
1
(
2

2
0
+
2
+ 1)
m
0
+
(
2

2
0
+
2
)
(
2

2
0
+
2
+ 1
)y
1
and variance

2
1
= (
2

2
0
+
2
)(1
(
2

2
0
+
2
)
(
2

2
0
+
2
+ 1)
=
(
2

2
0
+
2
)
(
2

2
0
+
2
+ 1)
220 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
After a long time while the recursion for m
n
remains the same
m
n
=
1
(
2

2
0
+
2
+ 1)
m
n1
+
(
2

2
0
+
2
)
(
2

2
0
+
2
+ 1
)y
n
the variance
2
n
has an asymptotic value
2

given by the solution of

=
(
2

+
2
)
(
2

+
2
+ 1)
.
Bibliography
[1] Ahlfors, Lars V. Complex analysis. An introduction to the theory of
analytic functions of one complex variable. Third edition. International
Series in Pure and Applied Mathematics. McGraw-Hill Book Co., New
York, 1978. xi+331 pp.
[2] Dym, H.; McKean, H. P. Fourier series and integrals. Probability and
Mathematical Statistics, No. 14. Academic Press, New York-London,
1972. x+295 pp.
[3] Halmos, Paul R. Measure Theory. D. Van Nostrand Company, Inc., New
York, N. Y., 1950. xi+304 pp.
[4] Kolmogorov, A. N. Foundations of the theory of probability. Transla-
tion edited by Nathan Morrison, with an added bibliography by A. T.
Bharuch-Reid. Chelsea Publishing Co., New York, 1956. viii+84 pp.
[5] Parthasarathy, K. R. An introduction to quantum stochastic calculus.
Monographs in Mathematics, 85. Birkhuser Verlag, Basel, 1992. xii+290
pp.
[6] Parthasarathy, K. R. Probability measures on metric spaces. Probability
and Mathematical Statistics, No. 3 Academic Press, Inc., New York-
London 1967 xi+276 pp.
[7] Royden, H. L. Real analysis. Third edition. Macmillan Publishing Com-
pany, New York, 1988. xx+444 pp.
[8] Stroock, Daniel W.; Varadhan, S. R. Srinivasa Multidimensional diu-
sion processes. Grundlehren der Mathematischen Wissenschaften [Fun-
damental Principles of Mathematical Sciences], 233. Springer-Verlag,
Berlin-New York, 1979. xii+338 pp.
221
Index
-eld, 9
accompanying laws, 78
Bellman, 213
Berry, 97
Berry-Essen theorem, 97
binomial distribution, 31
Birkho, 179, 182
ergodic theorem of, 179, 182
Bochner, 32, 45, 49, 200
theorem of, 32, 45
for the circle, 49
Borel, 58
Borel-Cantelli lemma, 58
bounded convergence theorem, 19
branching process, 142
Bucy, 219
Cantelli, 58
Caratheodory
extension theorem, 11
Cauchy, 35
Cauchy distribution, 35
central limit theorem, 71
central limit theorem
under mixing , 198
change of variables, 23
Chapman, 117
Chapman-Kolmogorov equations, 117
characteristic function, 31
uniqueness theorm , 34
Chebychev, 55
Chebychevs inequality, 55
compound Poisson distribution, 77
conditional expectation, 101, 109
Jensens inequality, 110
conditional probability, 101, 112
regular version, 113
conditioning, 101
continuity theorem, 39
control, 213
convergence
almost everywhere, 17
in distribution, 38
in law, 38
in probability, 17
convolution, 53
countable additivity, 9
covariance, 29
covariance matrix, 29
Cramer, 39
degenerate distribution, 31
Dirichlet, 33
Dirichlet integral, 33
disintegration theorem, 115
distribution
joint, 24
of a random variable, 24
distribution function, 13
dominated convergence theorem, 21
222
INDEX 223
Doob, 151, 152, 157, 161, 164
decomposition theorem of, 157
inequality of, 152
inequality of , 151
stopping theorem of , 161
upcrossing inequality of, 164
double integral, 27
dynamic programming, 213
ergodic invariant measure, 184
ergodic process, 184
extremality of, 185
ergodic theorem, 179
almost sure, 182
almost sure , 179
maximal, 182
mean, 179
ergodicity, 184
Esseen, 97
exit probability, 170
expectation, 28
exponential distribution, 35
two sided, 35
extension theorem, 11
Fatou, 20
Fatous lemma, 20
eld, 8
-eld generated by, 10
lter, 219
nite additivity, 9
Fubini, 27
Fubinis theorem, 27
gamma distribution, 35
Gaussian distribution, 35
Gaussian process, 200
stationary, 200
autoregressive schemes, 211
causal representation of, 200
moving average representation
of , 200
prediction of , 205
prediction error of, 205
predictor of, 205
rational spectral density, 210
spectral density of , 200
spectral measure of , 200
generating function, 36
geometric distribution, 34
Hahn, 104
Hahn-Jordan decomposition, 104
independent events, 51
independent random variables, 51
indicator function, 15
induced probability measure, 23
innitely divisible distributions, 83
integrable functions, 21
integral, 14, 15
invariant measures, 179
inversion theorem, 34
irrational rotations, 187
Jensen, 110
Jordan, 104
Kallman, 219
Kallman-Bucy lter, 219
Khintchine, 89
Kolmogorov, 7, 59, 62, 66, 67, 70,
117
consistency theorem of, 59, 61
inequality of, 62
one series theorem of, 66
three srries theorem of, 67
two series theorem of, 66
224 INDEX
zero-one law of, 70
Levy, 39, 63, 86, 89
inequality of, 63
theorem of, 63
Levy measures, 86
Levy-Khintchine representation , 89
law of large numbers
strong, 61
weak, 55
law of the iterated logarithm, 93
Lebesgue, 13
extension theorem, 13
Lindeberg, 72, 76
condition of, 72
theorem of, 72
Lipschitz, 108
Lipschitz condition, 108
Lyapunov, 76
condition of, 76
mapping, 22
Markov, 117
chain, 117
process, 117
homogeneous , 117
Markov chain
aperiodic, 133
invariant distribution for, 122
irreducible , 124
periodic behavior, 133
stationary distribution for, 122
Markov process
invariant measures
ergodicity, 189
invariant measures for, 188
mixing, 192
reversible, 189
stationary, 188
Markov property, 119
strong, 123
martingale dierence, 150
martingale transform, 165
martingales, 149
almost sure convergence of, 155,
158
central limit theorem for, 196
convergence theorem, 154
sub-, 151
super-, 151
maximal ergodic inequality, 183
mean, 28
measurable function, 15
measurable space, 22
moments, 33, 36
generating function, 36
uniqueness from, 36
monotone class, 9, 12
monotone converegence theorem, 20
negative binomial distribution, 34
Nikodym, 105
normal distribution, 35
optimal control, 213
optimal stopping, 215
option pricing, 167
optional stopping theorem, 161
Ornstein, 194
Ornstein-Uhlenbeck process, 194
outer measure, 11
Poisson, 34, 77
Poisson distribution, 34
positive denite function, 32, 45
probability space, 14
product -eld, 26
INDEX 225
product measure, 25
product space, 24, 25
queues, 136
Radon, 105
Radon-Nikodym
derivative, 105
theorem, 105
random variable, 15
random walk, 121
recurrence, 174
simple, 134
transience, 174
recurrence, 124
null, 124
positive, 124
recurrent states, 133
renewal theorem, 128
repeated integral, 27
Riemann-Stieljes integral, 30
secretary problem, 216
signed measure, 104
simple function, 15
Stirling, 57
Stirlings formula, 57, 71
stochastic matrix, 124
stopped -eld, 161
stopping time, 122, 160
transformations, 22, 23
measurable, 23
measure preserving, 179
isometries from, 179
transience, 124
transient states, 133
transition operator, 169
transition probability, 117
stationary, 117
Tulcea, 116
theorem of, 116
Uhlenbeck, 194
uniform distribution, 34
uniform innitesimality, 76
uniform tightness, 43
upcrossing inequality, 164
urn problem, 140
variance, 29
weak convergence, 38
Weierstrass, 37
factorization, 37

You might also like