Stats Stanford

Probability Theory: STAT310/MATH230; April
24, 2011
Amir Dembo
E-mail address: amir@math.stanford.edu
Department of Mathematics, Stanford University, Stanford, CA 94305.
Contents
Preface 5
Chapter 1. Probability, measure and integration 7
1.1. Probability spaces, measures and -algebras 7
1.2. Random variables and their distribution 18
1.3. Integration and the (mathematical) expectation 30
1.4. Independence and product measures 54
Chapter 2. Asymptotics: the law of large numbers 71
2.1. Weak laws of large numbers 71
2.2. The Borel-Cantelli lemmas 77
2.3. Strong law of large numbers 85
Chapter 3. Weak convergence, clt and Poisson approximation 95
3.1. The Central Limit Theorem 95
3.2. Weak convergence 103
3.3. Characteristic functions 117
3.4. Poisson approximation and the Poisson process 133
3.5. Random vectors and the multivariate clt 140
Chapter 4. Conditional expectations and probabilities 151
4.1. Conditional expectation: existence and uniqueness 151
4.2. Properties of the conditional expectation 156
4.3. The conditional expectation as an orthogonal projection 164
4.4. Regular conditional probability distributions 169
Chapter 5. Discrete time martingales and stopping times 175
5.1. Denitions and closure properties 175
5.2. Martingale representations and inequalities 184
5.3. The convergence of Martingales 191
5.4. The optional stopping theorem 204
5.5. Reversed MGs, likelihood ratios and branching processes 210
Chapter 6. Markov chains 225
6.1. Canonical construction and the strong Markov property 225
6.2. Markov chains with countable state space 233
6.3. General state space: Doeblin and Harris chains 255
Chapter 7. Continuous, Gaussian and stationary processes 269
7.1. Denition, canonical construction and law 269
7.2. Continuous and separable modications 274
3
4 CONTENTS
7.3. Gaussian and stationary processes 284
Chapter 8. Continuous time martingales and Markov processes 289
8.1. Continuous time ltrations and stopping times 289
8.2. Continuous time martingales 294
8.3. Markov and Strong Markov processes 317
Chapter 9. The Brownian motion 341
9.1. Brownian transformations, hitting times and maxima 341
9.2. Weak convergence and invariance principles 348
9.3. Brownian path: regularity, local maxima and level sets 367
Bibliography 375
Index 377
Preface
These are the lecture notes for a year long, PhD level course in Probability Theory
that I taught at Stanford University in 2004, 2006 and 2009. The goal of this
course is to prepare incoming PhD students in Stanfords mathematics and statistics
departments to do research in probability theory. More broadly, the goal of the text
is to help the reader master the mathematical foundations of probability theory
and the techniques most commonly used in proving theorems in this area. This is
then applied to the rigorous study of the most fundamental classes of stochastic
processes.
Towards this goal, we introduce in Chapter 1 the relevant elements from measure
and integration theory, namely, the probability space and the -algebras of events
in it, random variables viewed as measurable functions, their expectation as the
corresponding Lebesgue integral, and the important concept of independence.
Utilizing these elements, we study in Chapter 2 the various notions of convergence
of random variables and derive the weak and strong laws of large numbers.
Chapter 3 is devoted to the theory of weak convergence, the related concepts
of distribution and characteristic functions and two important special cases: the
Central Limit Theorem (in short clt) and the Poisson approximation.
Drawing upon the framework of Chapter 1, we devote Chapter 4 to the denition,
existence and properties of the conditional expectation and the associated regular
conditional probability distribution.
Chapter 5 deals with ltrations, the mathematical notion of information progres-
sion in time, and with the corresponding stopping times. Results about the latter
are obtained as a by product of the study of a collection of stochastic processes
called martingales. Martingale representations are explored, as well as maximal
inequalities, convergence theorems and various applications thereof. Aiming for a
clearer and easier presentation, we focus here on the discrete time settings deferring
the continuous time counterpart to Chapter 8.
Chapter 6 provides a brief introduction to the theory of Markov chains, a vast
subject at the core of probability theory, to which many text books are devoted.
We illustrate some of the interesting mathematical properties of such processes by
examining few special cases of interest.
Chapter 7 sets the framework for studying right-continuous stochastic processes
indexed by a continuous time parameter, introduces the family of Gaussian pro-
cesses and rigorously constructs the Brownian motion as a Gaussian process of
continuous sample path and zero-mean, stationary independent increments.
5
6 PREFACE
Chapter 8 expands our earlier treatment of martingales and strong Markov pro-
cesses to the continuous time setting, emphasizing the role of right-continuous l-
tration. The mathematical structure of such processes is then illustrated both in
the context of Brownian motion and that of Markov jump processes.
Building on this, in Chapter 9 we re-construct the Brownian motion via the in-
variance principle as the limit of certain rescaled random walks. We further delve
into the rich properties of its sample path and the many applications of Brownian
motion to the clt and the Law of the Iterated Logarithm (in short, lil).
The intended audience for this course should have prior exposure to stochastic
processes, at an informal level. While students are assumed to have taken a real
analysis class dealing with Riemann integration, and mastered well this material,
prior knowledge of measure theory is not assumed.
It is quite clear that these notes are much inuenced by the text books [Bil95,
Dur10, Wil91, KaS97] I have been using.
I thank my students out of whose work this text materialized and my teaching as-
sistants Su Chen, Kshitij Khare, Guoqiang Hu, Julia Salzman, Kevin Sun and Hua
Zhou for their help in the assembly of the notes of more than eighty students into
a coherent document. I am also much indebted to Kevin Ross, Andrea Montanari
and Oana Mocioalca for their feedback on earlier drafts of these notes, to Kevin
Ross for providing all the gures in this text, and to Andrea Montanari, David
Siegmund and Tze Lai for contributing some of the exercises in these notes.
Amir Dembo
Stanford, California
April 2010
CHAPTER 1
Probability, measure and integration
This chapter is devoted to the mathematical foundations of probability theory.
Section 1.1 introduces the basic measure theory framework, namely, the probability
space and the -algebras of events in it. The next building blocks are random
variables, introduced in Section 1.2 as measurable functions X() and their
distribution.
This allows us to dene in Section 1.3 the important concept of expectation as the
corresponding Lebesgue integral, extending the horizon of our discussion beyond
the special functions and variables with density to which elementary probability
theory is limited. Section 1.4 concludes the chapter by considering independence,
the most fundamental aspect that dierentiates probability from (general) measure
theory, and the associated product measures.
1.1. Probability spaces, measures and -algebras
We shall dene here the probability space (, T, P) using the terminology of mea-
sure theory.
The sample space is a set of all possible outcomes of some random exper-
iment. Probabilities are assigned by A P(A) to A in a subset T of all possible
sets of outcomes. The event space T represents both the amount of information
available as a result of the experiment conducted and the collection of all subsets
of possible interest to us, where we denote elements of T as events. A pleasant
mathematical framework results by imposing on T the structural conditions of a
-algebra, as done in Subsection 1.1.1. The most common and useful choices for
this -algebra are then explored in Subsection 1.1.2. Subsection 1.1.3 provides fun-
damental supplements from measure theory, namely Dynkins and Caratheodorys
theorems and their application to the construction of Lebesgue measure.
1.1.1. The probability space (, T, P). We use 2
to denote the set of all

possible subsets of . The event space is thus a subset T of 2
, consisting of all
allowed events, that is, those subsets of to which we shall assign probabilities.
We next dene the structural conditions imposed on T.
Denition 1.1.1. We say that T 2
is a -algebra (or a -eld), if

(a) T,
(b) If A T then A
c
T as well (where A
c
= A).
(c) If A
i
T for i = 1, 2, 3, . . . then also
i
A
i
T.
Remark. Using DeMorgans law, we know that (
i
A
c
i
)
c
=
i
A
i
. Thus the fol-
lowing is equivalent to property (c) of Denition 1.1.1:
(c) If A
i
T for i = 1, 2, 3, . . . then also
i
A
i
T.
7
8 1. PROBABILITY, MEASURE AND INTEGRATION
Denition 1.1.2. A pair (, T) with T a -algebra of subsets of is called a
measurable space. Given a measurable space (, T), a measure is any countably
additive non-negative set function on this space. That is, : T [0, ], having
the properties:
(a) (A) () = 0 for all A T.
(b) (
n
A
n
) =
n
(A
n
) for any countable collection of disjoint sets A
n
T.
When in addition () = 1, we call the measure a probability measure, and
often label it by P (it is also easy to see that then P(A) 1 for all A T).
Remark. When (b) of Denition 1.1.2 is relaxed to involve only nite collections
of disjoint sets A
n
, we say that is a nitely additive non-negative set-function.
In measure theory we sometimes consider signed measures, whereby is no longer
non-negative, hence its range is [, ], and say that such measure is nite when
its range is R (i.e. no set in T is assigned an innite measure).
Denition 1.1.3. A measure space is a triplet (, T, ), with a measure on the
measurable space (, T). A measure space (, T, P) with P a probability measure
is called a probability space.
The next exercise collects some of the fundamental properties shared by all prob-
ability measures.
Exercise 1.1.4. Let (, T, P) be a probability space and A, B, A
i
events in T.
Prove the following properties of every probability measure.
(a) Monotonicity. If A B then P(A) P(B).
(b) Sub-additivity. If A
i
A
i
then P(A)
i
P(A
i
).
(c) Continuity from below: If A
i
A, that is, A
1
A
2
. . . and
i
A
i
= A,
then P(A
i
) P(A).
(d) Continuity from above: If A
i
A, that is, A
1
A
2
. . . and
i
A
i
= A,
then P(A
i
) P(A).
Remark. In the more general context of measure theory, note that properties (a)-
(c) of Exercise 1.1.4 hold for any measure , whereas the continuity from above
holds whenever (A
i
) < for all i suciently large. Here is more on this:
Exercise 1.1.5. Prove that a nitely additive non-negative set function on a
measurable space (, T) with the continuity property
B
n
T, B
n
, (B
n
) < = (B
n
) 0
must be countably additive if () < . Give an example that it is not necessarily
so when () = .
The -algebra T always contains at least the set and its complement, the empty
set . Necessarily, P() = 1 and P() = 0. So, if we take T
0
= , as our -
algebra, then we are left with no degrees of freedom in choice of P. For this reason
we call T
0
the trivial -algebra. Fixing , we may expect that the larger the -
algebra we consider, the more freedom we have in choosing the probability measure.
This indeed holds to some extent, that is, as long as we have no problem satisfying
the requirements in the denition of a probability measure. A natural question is
when should we expect the maximal possible -algebra T = 2
to be useful?
Example 1.1.6. When the sample space is countable we can and typically shall
take T = 2
. Indeed, in such situations we assign a probability p
> 0 to each
1.1. PROBABILITY SPACES, MEASURES AND -ALGEBRAS 9
making sure that
= 1. Then, it is easy to see that taking P(A) =
A
p
for any A results with a probability measure on (, 2
). For instance, when

is nite, we can take p
=
1
||
, the uniform measure on , whereby computing
probabilities is the same as counting. Concrete examples are a single coin toss, for
which we have
1
= H, T ( = H if the coin lands on its head and = T if it
lands on its tail), and T
1
= , , H, T, or when we consider a nite number of
coin tosses, say n, in which case
n
= (
1
, . . . ,
n
) :
i
H, T, i = 1, . . . , n
is the set of all possible n-tuples of coin tosses, while T
n
= 2
n
is the collection
of all possible sets of n-tuples of coin tosses. Another example pertains to the
set of all non-negative integers = 0, 1, 2, . . . and T = 2
, where we get the

Poisson probability measure of parameter > 0 when starting from p
k
=

k
k!
e
for
k = 0, 1, 2, . . ..
When is uncountable such a strategy as in Example 1.1.6 will no longer work.
The problem is that if we take p
= P() > 0 for uncountably many values of

, we shall end up with P() = . Of course we may dene everything as before
on a countable subset

of and demand that P(A) = P(A

) for each A .
Excluding such trivial cases, to genuinely use an uncountable sample space we
need to restrict our -algebra T to a strict subset of 2
.
Denition 1.1.7. We say that a probability space (, T, P) is non-atomic, or
alternatively call P non-atomic if P(A) > 0 implies the existence of B T, B A
with 0 < P(B) < P(A).
Indeed, in contrast to the case of countable , the generic uncountable sample
space results with a non-atomic probability space (c.f. Exercise 1.1.27). Here is an
interesting property of such spaces (see also [Bil95, Problem 2.19]).
Exercise 1.1.8. Suppose P is non-atomic and A T with P(A) > 0.
(a) Show that for every > 0, we have B A such that 0 < P(B) < .
(b) Prove that if 0 < a < P(A) then there exists B A with P(B) = a.
Hint: Fix
n
0 and dene inductively numbers x
n
and sets G
n
T with H
0
= ,
H
n
=
k<n
G
k
, x
n
= supP(G) : G AH
n
, P(H
n
G) a and G
n
AH
n
such that P(H
n
G
n
) a and P(G
n
) (1
n
)x
n
. Consider B =
k
G
k
.
As you show next, the collection of all measures on a given space is a convex cone.
Exercise 1.1.9. Given any measures
n
, n 1 on (, T), verify that =
n=1
c
n
n
is also a measure on this space, for any nite constants c
n
0.
Here are few properties of probability measures for which the conclusions of Ex-
ercise 1.1.4 are useful.
Exercise 1.1.10. A function d : A A [0, ) is called a semi-metric on
the set A if d(x, x) = 0, d(x, y) = d(y, x) and the triangle inequality d(x, z)
d(x, y) +d(y, z) holds. With AB = (A B
c
) (A
c
B) denoting the symmetric
dierence of subsets A and B of , show that for any probability space (, T, P),
the function d(A, B) = P(AB) is a semi-metric on T.
Exercise 1.1.11. Consider events A
n
in a probability space (, T, P) that are
almost disjoint in the sense that P(A
n
A
m
) = 0 for all n ,= m. Show that then
P(
n=1
A
n
) =
n=1
P(A
n
).
Exercise 1.1.12. Suppose a random outcome N follows the Poisson probability
measure of parameter > 0. Find a simple expression for the probability that N is
an even integer.
1.1.2. Generated and Borel -algebras. Enumerating the sets in the -
algebra T is not a realistic option for uncountable . Instead, as we see next, the
most common construction of -algebras is then by implicit means. That is, we
demand that certain sets (called the generators) be in our -algebra, and take the
smallest possible collection for which this holds.
Exercise 1.1.13.
(a) Check that the intersection of (possibly uncountably many) -algebras is
also a -algebra.
(b) Verify that for any -algebras 1 ( and any H 1, the collection
1
H
= A ( : A H 1 is a -algebra.
(c) Show that H 1
H
is non-increasing with respect to set inclusions, with
1
= 1 and 1
= (. Deduce that 1
HH
= 1
H
1
H
for any pair

H, H
1.
In view of part (a) of this exercise we have the following denition.
Denition 1.1.14. Given a collection of subsets A
(not necessarily count-

able), we denote the smallest -algebra T such that A
T for all either by

(A
) or by (A
, ), and call (A
) the -algebra generated by the sets

A
. That is,
(A
) =
( : ( 2
is a algebra, A
( .
Example 1.1.15. Suppose = S is a topological space (that is, S is equipped with
a notion of open subsets, or topology). An example of a generated -algebra is the
Borel -algebra on S dened as (O S open) and denoted by B
S
. Of special
importance is B
R
which we also denote by B.
Dierent sets of generators may result with the same -algebra. For example, tak-
ing = 1, 2, 3 it is easy to see that (1) = (2, 3) = , 1, 2, 3, 1, 2, 3.
A -algebra T is countably generated if there exists a countable collection of sets
that generates it. Exercise 1.1.17 shows that B
R
is countably generated, but as you
show next, there exist non countably generated -algebras even on = R.
Exercise 1.1.16. Let T consist of all A such that either A is a countable set
or A
c
is a countable set.
(a) Verify that T is a -algebra.
(b) Show that T is countably generated if and only if is a countable set.
Recall that if a collection of sets / is a subset of a -algebra (, then also (/) (.
Consequently, to show that (A
) = (B
) for two dierent sets of generators

A
and B
, we only need to show that A
(B
) for each and that

B
(A
) for each . For instance, considering B

Q
= ((a, b) : a < b Q),
we have by this approach that B
Q
= ((a, b) : a < b R), as soon as we
show that any interval (a, b) is in B
Q
. To see this fact, note that for any real
a < b there are rational numbers q
n
< r
n
such that q
n
a and r
n
b, hence
(a, b) =
n
(q
n
, r
n
) B
Q
. Expanding on this, the next exercise provides useful
alternative denitions of B.
Exercise 1.1.17. Verify the alternative denitions of the Borel -algebra B:
((a, b) : a < b R) = ([a, b] : a < b R) = ((, b] : b R)
= ((, b] : b Q) = (O R open )
If A R is in B of Example 1.1.15, we say that A is a Borel set. In particular, all
open (closed) subsets of R are Borel sets, as are many other sets. However,
Proposition 1.1.18. There exists a subset of R that is not in B. That is, not all
subsets of R are Borel sets.
Proof. See [Wil91, A.1.1] or [Bil95, page 45].
Example 1.1.19. Another classical example of an uncountable is relevant for
studying the experiment with an innite number of coin tosses, that is,
=
N
1
for
1
= H, T (indeed, setting H = 1 and T = 0, each innite sequence
is in correspondence with a unique real number x [0, 1] with being the binary
expansion of x, see Exercise 1.2.13). The -algebra should at least allow us to
consider any possible outcome of a nite number of coin tosses. The natural -
algebra in this case is the minimal -algebra having this property, or put more
formally T
c
= (A
,k
,
k
1
, k = 1, 2, . . .), where A
,k
=
:
i
=
i
, i =
1 . . . , k for = (
1
, . . . ,
k
).
The preceding example is a special case of the construction of a product of mea-
surable spaces, which we detail now.
Example 1.1.20. The product of the measurable spaces (
i
, T
i
), i = 1, . . . , n is
the set =
1

n
with the -algebra generated by A
1
A
n
: A
i
T
i
,
denoted by T
1
T
n
.
You are now to check that the Borel -algebra of R
d
is the product of d-copies of
that of R. As we see later, this helps simplifying the study of random vectors.
Exercise 1.1.21. Show that for any d < ,
B
R
d = B B = ((a
1
, b
1
) (a
d
, b
d
) : a
i
< b
i
R, i = 1, . . . , d)
(you need to prove both identities, with the middle term dened as in Example
1.1.20).
Exercise 1.1.22. Let T = (A
, ) where the collection of sets A
, is
uncountable (i.e., is uncountable). Prove that for each B T there exists a count-
able sub-collection A
j
, j = 1, 2, . . . A
, , such that B (A
j
, j =
1, 2, . . .).
Often there is no explicit enumerative description of the -algebra generated by
an innite collection of subsets, but a notable exception is
Exercise 1.1.23. Show that the sets in ( = ([a, b] : a, b Z) are all possible
unions of elements from the countable collection b, (b, b +1), b Z, and deduce
that B , = (.
Probability measures on the Borel -algebra of R are examples of regular measures,
namely:
Exercise 1.1.24. Show that if P is a probability measure on (R, B) then for any
A B and > 0, there exists an open set G containing A such that P(A) + >
P(G).
Here is more information about B
R
d.
Exercise 1.1.25. Show that if is a nitely additive non-negative set function on
(R
d
, B
R
d) such that (R
d
) = 1 and for any Borel set A,
(A) = sup(K) : K A, K compact ,
then must be a probability measure.
Hint: Argue by contradiction using the conclusion of Exercise 1.1.5. To this end,
recall the nite intersection property (if compact K
i
R
d
are such that
n
i=1
K
i
are
non-empty for nite n, then the countable intersection
i=1
K
i
is also non-empty).
1.1.3. Lebesgue measure and Caratheodorys theorem. Perhaps the
most important measure on (R, B) is the Lebesgue measure, . It is the unique
measure that satises (F) =
r
k=1
(b
k
a
k
) whenever F =
r
k=1
(a
k
, b
k
] for some
r < and a
1
< b
1
< a
2
< b
2
< b
r
. Since (R) = , this is not a probability
measure. However, when we restrict to be the interval (0, 1] we get
Example 1.1.26. The uniform probability measure on (0, 1], is denoted U and
dened as above, now with added restrictions that 0 a
1
and b
r
1. Alternatively,
U is the restriction of the measure to the sub--algebra B
(0,1]
of B.
Exercise 1.1.27. Show that ((0, 1], B
(0,1]
, U) is a non-atomic probability space and
deduce that (R, B, ) is a non-atomic measure space.
Note that any countable union of sets of probability zero has probability zero, but
this is not the case for an uncountable union. For example, U(x) = 0 for every
x R, but U(R) = 1.
As we have seen in Example 1.1.26 it is often impossible to explicitly specify the
value of a measure on all sets of the -algebra T. Instead, we wish to specify its
values on a much smaller and better behaved collection of generators / of T and
use Caratheodorys theorem to guarantee the existence of a unique measure on T
that coincides with our specied values. To this end, we require that / be an
algebra, that is,
Denition 1.1.28. A collection / of subsets of is an algebra (or a eld) if
(a) /,
(b) If A / then A
c
/ as well,
(c) If A, B / then also A B /.
Remark. In view of the closure of algebra with respect to complements, we could
have replaced the requirement that / with the (more standard) requirement
that /. As part (c) of Denition 1.1.28 amounts to closure of an algebra
under nite unions (and by DeMorgans law also nite intersections), the dierence
between an algebra and a -algebra is that a -algebra is also closed under countable
unions.
We sometimes make use of the fact that unlike generated -algebras, the algebra
generated by a collection of sets / can be explicitly presented.
Exercise 1.1.29. The algebra generated by a given collection of subsets /, denoted
f(/), is the intersection of all algebras of subsets of containing /.
(a) Verify that f(/) is indeed an algebra and that f(/) is minimal in the
sense that if ( is an algebra and / (, then f(/) (.
(b) Show that f(/) is the collection of all nite disjoint unions of sets of the
form
ni
j=1
A
ij
, where for each i and j either A
ij
or A
c
ij
are in /.
We next state Caratheodorys extension theorem, a key result from measure the-
ory, and demonstrate how it applies in the context of Example 1.1.26.
Theorem 1.1.30 (Caratheodorys extension theorem). If
0
: / [0, ]
is a countably additive set function on an algebra / then there exists a measure
on (, (/)) such that =
0
on /. Furthermore, if
0
() < then such a
measure is unique.
To construct the measure U on B
(0,1]
let = (0, 1] and
/ = (a
1
, b
1
] (a
r
, b
r
] : 0 a
1
< b
1
< < a
r
< b
r
1 , r <
be a collection of subsets of (0, 1]. It is not hard to verify that / is an algebra, and
further that (/) = B
(0,1]
(c.f. Exercise 1.1.17, for a similar issue, just with (0, 1]
replaced by R). With U
0
denoting the non-negative set function on / such that
(1.1.1) U
0
_
r
_
k=1
(a
k
, b
k
]
_
=
r
k=1
(b
k
a
k
) ,
note that U
0
((0, 1]) = 1, hence the existence of a unique probability measure U on
((0, 1], B
(0,1]
) such that U(A) = U
0
(A) for sets A / follows by Caratheodorys
extension theorem, as soon as we verify that
Lemma 1.1.31. The set function U
0
is countably additive on /. That is, if A
k
is a
sequence of disjoint sets in / such that
k
A
k
= A /, then U
0
(A) =
k
U
0
(A
k
).
The proof of Lemma 1.1.31 is based on
Exercise 1.1.32. Show that U
0
is nitely additive on /. That is, U
0
(
n
k=1
A
k
) =
n
k=1
U
0
(A
k
) for any nite collection of disjoint sets A
1
, . . . , A
n
/.
Proof. Let G
n
=
n
k=1
A
k
and H
n
= A G
n
. Then, H
n
and since
A
k
, A / which is an algebra it follows that G
n
and hence H
n
are also in /. By
denition, U
0
is nitely additive on /, so
U
0
(A) = U
0
(H
n
) +U
0
(G
n
) = U
0
(H
n
) +
n
k=1
U
0
(A
k
) .
To prove that U
0
is countably additive, it suces to show that U
0
(H
n
) 0, for then
U
0
(A) = lim
n
U
0
(G
n
) = lim
n
n
k=1
U
0
(A
k
) =
k=1
U
0
(A
k
) .
To complete the proof, we argue by contradiction, assuming that U
0
(H
n
) 2 for
some > 0 and all n, where H
n
are elements of /. By the denition of /
and U
0
, we can nd for each a set J
/ whose closure J
is a subset of H
and
U
0
(H
) 2
(for example, add to each a

k
in the representation of H
the
minimum of 2
/r and (b
k
a
k
)/2). With U
0
nitely additive on the algebra /
this implies that for each n,
U
0
_
n
_
=1
(H
)
_
=1
U
0
(H
) .
As H
n
H
for all n, we have that

H
n

n
J
=
_
n
(H
n
J
)
_
n
(H
) .
Hence, by nite additivity of U
0
and our assumption that U
0
(H
n
) 2, also
U
0
(
n
J
) = U
0
(H
n
) U
0
(H
n

n
J
) U
0
(H
n
) U
0
(
_
n
(H
)) .
In particular, for every n, the set
n
J
is non-empty and therefore so are the

decreasing sets K
n
=
n
J
. Since K
n
are compact sets (by Heine-Borel theo-
rem), the set
is then non-empty as well, and since J
is a subset of H
for all
we arrive at
non-empty, contradicting our assumption that H

n
.
Remark. The proof of Lemma 1.1.31 is generic (for nite measures). Namely, any
non-negative nitely additive set function
0
on an algebra / is countably additive
if
0
(H
n
) 0 whenever H
n
/ and H
n
. Further, as this proof shows, when
is a topological space it suces for countable additivity of
0
to have for any
H / a sequence J
k
/ such that J
k
H are compact and
0
(H J
k
) 0 as
k .
Exercise 1.1.33. Show the necessity of the assumption that / be an algebra in
Caratheodorys extension theorem, by giving an example of two probability measures
,= on a measurable space (, T) such that (A) = (A) for all A / and
T = (/).
Hint: This can be done with = 1, 2, 3, 4 and T = 2
.
It is often useful to assume that the probability space we have is complete, in the
sense we make precise now.
Denition 1.1.34. We say that a measure space (, T, ) is complete if any subset
N of any B T with (B) = 0 is also in T. If further = P is a probability
measure, we say that the probability space (, T, P) is a complete probability space.
Our next theorem states that any measure space can be completed by adding to
its -algebra all subsets of sets of zero measure (a procedure that depends on the
measure in use).
Theorem 1.1.35. Given a measure space (, T, ), let A = N : N A for
some A T with (A) = 0 denote the collection of -null sets. Then, there
exists a complete measure space (, T, ), called the completion of the measure
space (, T, ), such that T = F N : F T, N A and = on T.
Proof. This is beyond our scope, but see detailed proof in [Dur10, Theorem
A.2.3]. In particular, T = (T, A) and (A N) = (A) for any N A and
A T (c.f. [Bil95, Problems 3.10 and 10.5]).
The following collections of sets play an important role in proving the easy part
of Caratheodorys theorem, the uniqueness of the extension .
Denition 1.1.36. A -system is a collection T of sets closed under nite inter-
sections (i.e. if I T and J T then I J T).
A -system is a collection / of sets containing and BA for any A B A, B /,
which is also closed under monotone increasing limits (i.e. if A
i
/ and A
i
A,
then A / as well).
Obviously, an algebra is a -system. Though an algebra may not be a -system,
Proposition 1.1.37. A collection T of sets is a -algebra if and only if it is both
a -system and a -system.
Proof. The fact that a -algebra is a -system is a trivial consequence of
Denition 1.1.1. To prove the converse direction, suppose that T is both a -
system and a -system. Then is in the -system T and so is A
c
= A for any
A T. Further, with T also a -system we have that
A B = (A
c
B
c
) T ,
for any A, B T. Consequently, if A
i
T then so are also G
n
= A
1
A
n
T.
Since T is a -system and G
n

i
A
i
, it follows that
i
A
i
T as well, completing
the verication that T is a -algebra.
The main tool in proving the uniqueness of the extension is Dynkins theorem,
stated next.
Theorem 1.1.38 (Dynkins theorem). If T / with T a -system and
/ a -system then (T) /.
Proof. A short though dense exercise in set manipulations shows that the
smallest -system containing T is a -system (for details see [Wil91, Section A.1.3]
or the proof of [Bil95, Theorem 3.2]). By Proposition 1.1.37 it is a -algebra, hence
contains (T). Further, it is contained in the -system /, as / also contains T.
Remark. Proposition 1.1.37 remains valid even if in the denition of -system we
relax the condition that B A / for any A B A, B /, to the condition
A
c
/ whenever A /. However, Dynkins theorem does not hold under the
latter denition.
As we show next, the uniqueness part of Caratheodorys theorem, is an immediate
consequence of the theorem.
Proposition 1.1.39. If two measures
1
and
2
on (, (T)) agree on the -
system T and are such that
1
() =
2
() < , then
1
=
2
.
Proof. Let / = A (T) :
1
(A) =
2
(A). Our assumptions imply that
T / and that /. Further, (T) is a -system (by Proposition 1.1.37), and
if A B, A, B /, then by additivity of the nite measures
1
and
2
,
1
(B A) =
1
(B)
1
(A) =
2
(B)
2
(A) =
2
(B A),
that is, B A /. Similarly, if A
i
A and A
i
/, then by the continuity from
below of
1
and
2
(see remark following Exercise 1.1.4),
1
(A) = lim
n
1
(A
n
) = lim
n
2
(A
n
) =
2
(A) ,
so that A /. We conclude that / is a -system, hence by Dynkins theorem,
(T) /, that is,
1
=
2
.
Remark. With a somewhat more involved proof one can relax the condition
1
() =
2
() < to the existence of A
n
T such that A
n
and
1
(A
n
) <
(c.f. [Bil95, Theorem 10.3] for details). Accordingly, in Caratheodorys extension
theorem we can relax
0
() < to the assumption that
0
is a -nite mea-
sure, that is
0
(A
n
) < for some A
n
/ such that A
n
, as is the case with
Lebesgues measure on R.
We conclude this subsection with an outline the proof of Caratheodorys extension
theorem, noting that since an algebra / is a -system and /, the uniqueness of
the extension to (/) follows from Proposition 1.1.39. Our outline of the existence
of an extension follows [Wil91, Section A.1.8] (or see [Bil95, Theorem 11.3] for
the proof of a somewhat stronger result). This outline centers on the construction
of the appropriate outer measure, a relaxation of the concept of measure, which we
now dene.
Denition 1.1.40. An increasing, countably sub-additive, non-negative set func-
tion
on a measurable space (, T) is called an outer measure. That is,
: T
[0, ], having the properties:
(a)
() = 0 and
(A
1
)
(A
2
) for any A
1
, A
2
T with A
1
A
2
.
(b)
n
A
n
)
(A
n
) for any countable collection of sets A
n
T.
In the rst step of the proof we dene the increasing, non-negative set function
(E) = inf
n=1
0
(A
n
) : E
_
n
A
n
, A
n
/,
for E T = 2
, and prove that it is countably sub-additive, hence an outer measure

on T.
By denition,
(A)
0
(A) for any A /. In the second step we prove that
if in addition A

n
A
n
for A
n
/, then the countable additivity of
0
on /
results with
0
(A)
0
(A
n
). Consequently,
=
0
on the algebra /.
The third step uses the countable additivity of
0
on / to show that for any A /
the outer measure
is additive when splitting subsets of by intersections with A

and A
c
. That is, we show that any element of / is a
-measurable set, as dened

next.
Denition 1.1.41. Let be a non-negative set function on a measurable space
(, T), with () = 0. We say that A T is a -measurable set if (F) =
(F A) + (F A
c
) for all F T.
The fourth step consists of proving the following general lemma.
Lemma 1.1.42 (Caratheodorys lemma). Let
be an outer measure on a
measurable space (, T). Then the
-measurable sets in T form a -algebra ( on

which
is countably additive, so that (, (,
) is a measure space.
In the current setting, with / contained in the -algebra (, it follows that (/)
( on which
is a measure. Thus, the restriction of
to (/) is the stated

measure that coincides with
0
on /.
Remark. In the setting of Caratheodorys extension theorem for nite measures,
we have that the -algebra ( of all
-measurable sets is the completion of (/)

with respect to (c.f. [Bil95, Page 45]). In the context of Lebesgues measure U
on B
(0,1]
, this is the -algebra B
(0,1]
of all Lebesgue measurable subsets of (0, 1].
Associated with it are the Lebesgue measurable functions f : (0, 1] R for which
f
1
(B) B
(0,1]
for all B B. However, as noted for example in [Dur10, Theorem
A.2.4], the non Borel set constructed in the proof of Proposition 1.1.18 is also non
Lebesgue measurable.
The following concept of a monotone class of sets is a considerable relaxation of
that of a -system (hence also of a -algebra, see Proposition 1.1.37).
Denition 1.1.43. A monotone class is a collection / of sets closed under both
monotone increasing and monotone decreasing limits (i.e. if A
i
/ and either
A
i
A or A
i
A, then A /).
When starting from an algebra instead of a -system, one may save eort by
applying Halmoss monotone class theorem instead of Dynkins theorem.
Theorem 1.1.44 (Halmoss monotone class theorem). If / / with /
an algebra and / a monotone class then (/) /.
Proof. Clearly, any algebra which is a monotone class must be a -algebra.
Another short though dense exercise in set manipulations shows that the intersec-
tion m(/) of all monotone classes containing an algebra / is both an algebra and
a monotone class (see the proof of [Bil95, Theorem 3.4]). Consequently, m(/) is
a -algebra. Since / m(/) this implies that (/) m(/) and we complete the
proof upon noting that m(/) /.
Exercise 1.1.45. We say that a subset V of 1, 2, 3, has Cesaro density (V )
and write V CES if the limit
(V ) = lim
n
n
1
[V 1, 2, 3, , n[ ,
exists. Give an example of sets V
1
CES and V
2
CES for which V
1
V
2
/ CES.
Thus, CES is not an algebra.
Here is an alternative specication of the concept of algebra.
Exercise 1.1.46.
(a) Suppose that / and that AB
c
/ whenever A, B /. Show that
/ is an algebra.
(b) Give an example of a collection ( of subsets of such that (, if
A ( then A
c
( and if A, B ( are disjoint then also A B (,
while ( is not an algebra.
As we already saw, the -algebra structure is preserved under intersections. How-
ever, whereas the increasing union of algebras is an algebra, it is not necessarily
the case for -algebras.
Exercise 1.1.47. Suppose that /
n
are classes of sets such that /
n
/
n+1
.
(a) Show that if /
n
are algebras then so is
n=1
/
n
.
(b) Provide an example of -algebras /
n
for which
n=1
/
n
is not a -
algebra.
1.2. Random variables and their distribution
Random variables are numerical functions X() of the outcome of our ran-
dom experiment. However, in order to have a successful mathematical theory, we
limit our interest to the subset of measurable functions (or more generally, measur-
able mappings), as dened in Subsection 1.2.1 and study the closure properties of
this collection in Subsection 1.2.2. Subsection 1.2.3 is devoted to the characteriza-
tion of the collection of distribution functions induced by random variables.
1.2.1. Indicators, simple functions and random variables. We start
with the denition of random variables, rst in the general case, and then restricted
to R-valued variables.
Denition 1.2.1. A mapping X : S between two measurable spaces (, T)
and (S, o) is called an (S, o)-valued Random Variable (R.V.) if
X
1
(B) := : X() B T B o.
Such a mapping is also called a measurable mapping.
Denition 1.2.2. When we say that X is a random variable, or a measurable
function, we mean an (R, B)-valued random variable which is the most common
type of R.V. we shall encounter. We let mT denote the collection of all (R, B)-
valued measurable mappings, so X is a R.V. if and only if X mT. If in addition
is a topological space and T = (O open ) is the corresponding Borel -
algebra, we say that X : R is a Borel (measurable) function. More generally,
a random vector is an (R
d
, B
R
d)-valued R.V. for some d < .
The next exercise shows that a random vector is merely a nite collection of R.V.
on the same probability space.
Exercise 1.2.3. Relying on Exercise 1.1.21 and Theorem 1.2.9, show that X :
R
d
is a random vector if and only if X() = (X
1
(), . . . , X
d
()) with each
X
i
: R a R.V.
Hint: Note that X
1
(B
1
. . . B
d
) =
d
i=1
X
1
i
(B
i
).
We now provide two important generic examples of random variables.
Example 1.2.4. For any A T the function I
A
() =
_
1, A
0, / A
is a R.V.
Indeed, : I
A
() B is for any B R one of the four sets , A, A
c
or
(depending on whether 0 B or not and whether 1 B or not), all of whom are
in T. We call such R.V. also an indicator function.
Exercise 1.2.5. By the same reasoning check that X() =
N
n=1
c
n
I
An
() is a
R.V. for any nite N, non-random c
n
R and sets A
n
T. We call any such X
a simple function, denoted by X SF.
Our next proposition explains why simple functions are quite useful in probability
theory.
Proposition 1.2.6. For every R.V. X() there exists a sequence of simple func-
tions X
n
() such that X
n
() X() as n , for each xed .
1.2. RANDOM VARIABLES AND THEIR DISTRIBUTION 19
Proof. Let
f
n
(x) = n1
x>n
+
n2
n
1
k=0
k2
n
1
(k2
n
,(k+1)2
n
]
(x) ,
noting that for R.V. X 0, we have that X
n
= f
n
(X) are simple functions. Since
X X
n+1
X
n
and X() X
n
() 2
n
whenever X() n, it follows that
X
n
() X() as n , for each .
We write a general R.V. as X() = X
+
()X
() where X
+
() = max(X(), 0)
and X
() = min(X(), 0) are non-negative R.V.-s. By the above argument

the simple functions X
n
= f
n
(X
+
) f
n
(X
) have the convergence property we

claimed.
Note that in case T = 2
, every mapping X : S is measurable (and therefore

is an (S, o)-valued R.V.). The choice of the -algebra T is very important in
determining the class of all (S, o)-valued R.V. For example, there are non-trivial
-algebras ( and T on = R such that X() = is a measurable function for
(, T), but is non-measurable for (, (). Indeed, one such example is when T is the
Borel -algebra B and ( = ([a, b] : a, b Z) (for example, the set :
is not in ( whenever / Z).
Building on Proposition 1.2.6 we have the following analog of Halmoss monotone
class theorem. It allows us to deduce in the sequel general properties of (bounded)
measurable functions upon verifying them only for indicators of elements of -
systems.
Theorem 1.2.7 (Monotone class theorem). Suppose 1 is a collection of R-
valued functions on such that:
(a) The constant function 1 is an element of 1.
(b) 1 is a vector space over R. That is, if h
1
, h
2
1 and c
1
, c
2
R then
c
1
h
1
+c
2
h
2
is in 1.
(c) If h
n
1 are non-negative and h
n
h where h is a (bounded) real-valued
function on , then h 1.
If T is a -system and I
A
1 for all A T, then 1 contains all (bounded)
functions on that are measurable with respect to (T).
Remark. We stated here two versions of the monotone class theorem, with the
less restrictive assumption that (c) holds only for bounded h yielding the weaker
conclusion about bounded elements of m(T). In the sequel we use both versions,
which as we see next, are derived by essentially the same proof. Adapting this
proof you can also show that any collection 1 of non-negative functions on
satisfying the conditions of Theorem 1.2.7 apart from requiring (b) to hold only
when c
1
h
1
+c
2
h
2
0, must contain all non-negative elements of m(T).
Proof. Let / = A : I
A
1. From (a) we have that /, while (b)
implies that B A is in / whenever A B are both in /. Further, in view of (c)
the collection / is closed under monotone increasing limits. Consequently, / is a
-system, so by Dynkins - theorem, our assumption that / contains T results
with (T) /. With 1 a vector space over R, this in turn implies that 1 contains
all simple functions with respect to the measurable space (, (T)). In the proof of
Proposition 1.2.6 we saw that any (bounded) measurable function is a dierence of
two (bounded) non-negative functions each of which is a monotone increasing limit
of certain non-negative simple functions. Thus, from (b) and (c) we conclude that
1 contains all (bounded) measurable functions with respect to (, (T)).
The concept of almost sure prevails throughout probability theory.
Denition 1.2.8. We say that two (S, o)-valued R.V. X and Y dened on the same
probability space (, T, P) are almost surely the same if P( : X() ,= Y ()) = 0.
This shall be denoted by X
a.s.
= Y . More generally, same notation applies to any
property of R.V. For example, X() 0 a.s. means that P( : X() < 0) = 0.
Hereafter, we shall consider X and Y such that X
a.s.
= Y to be the same S-valued
R.V. hence often omit the qualier a.s. when stating properties of R.V. We also
use the terms almost surely (a.s.), almost everywhere (a.e.), and with probability
1 (w.p.1) interchangeably.
Since the -algebra o might be huge, it is very important to note that we may
verify that a given mapping is measurable without the need to check that the pre-
image X
1
(B) is in T for every B o. Indeed, as shown next, it suces to do
this only for a collection (of our choice) of generators of o.
Theorem 1.2.9. If o = (/) and X : S is such that X
1
(A) T for all
A /, then X is an (S, o)-valued R.V.
Proof. We rst check that

o = B o : X
1
(B) T is a -algebra.
Indeed,
a).

o since X
1
() = .
b). If A

o then X
1
(A) T. With T a -algebra, X
1
(A
c
) =
_
X
1
(A)
_
c
T.
Consequently, A
c

o.
c). If A
n

o for all n then X
1
(A
n
) T for all n. With T a -algebra, then also
X
1
(
n
A
n
) =
n
X
1
(A
n
) T. Consequently,
n
A
n

o.
Our assumption that /

o, then translates to o = (/)

o, as claimed.
The most important -algebras are those generated by ((S, o)-valued) random
variables, as dened next.
Exercise 1.2.10. Adapting the proof of Theorem 1.2.9, show that for any mapping
X : S and any -algebra o of subsets of S, the collection X
1
(B) : B o is
a -algebra. Verify that X is an (S, o)-valued R.V. if and only if X
1
(B) : B
o T, in which case we denote X
1
(B) : B o either by (X) or by T
X
and
call it the -algebra generated by X.
To practice your understanding of generated -algebras, solve the next exercise,
providing a convenient collection of generators for (X).
Exercise 1.2.11. If X is an (S, o)-valued R.V. and o = (/) then (X) is gen-
erated by the collection of sets X
1
(/) := X
1
(A) : A /.
An important example of use of Exercise 1.2.11 corresponds to (R, B)-valued ran-
dom variables and / = (, x] : x R (or even / = (, x] : x Q) which
generates B (see Exercise 1.1.17), leading to the following alternative denition of
the -algebra generated by such R.V. X.
Denition 1.2.12. Given a function X : R we denote by (X) or by T
X
the smallest -algebra T such that X() is a measurable mapping from (, T) to
(R, B). Alternatively,
(X) = ( : X() , R) = ( : X() q, q Q) .
More generally, given a random vector X = (X
1
, . . . , X
n
), that is, random variables
X
1
, . . . , X
n
on the same probability space, let (X
k
, k n) (or T
X
n
), denote the
smallest -algebra T such that X
k
(), k = 1, . . . , n are measurable on (, T).
Alternatively,
(X
k
, k n) = ( : X
k
() , R, k n) .
Finally, given a possibly uncountable collection of functions X
: R, indexed
by , we denote by (X
, ) (or simply by T
X
), the smallest -algebra T
such that X
(), are measurable on (, T).

The concept of -algebra is needed in order to produce a rigorous mathematical
theory. It further has the crucial role of quantifying the amount of information
we have. For example, (X) contains exactly those events A for which we can say
whether A or not, based on the value of X(). Interpreting Example 1.1.19 as
corresponding to sequentially tossing coins, the R.V. X
n
() =
n
gives the result
of the n-th coin toss in our experiment
of innitely many such tosses. The -

algebra T
n
= 2
n
of Example 1.1.6 then contains exactly the information we have
upon observing the outcome of the rst n coin tosses, whereas the larger -algebra
T
c
allows us to also study the limiting properties of this sequence (and as you show
next, T
c
is isomorphic, in the sense of Denition 1.4.24, to B
[0,1]
).
Exercise 1.2.13. Let T
c
denote the cylindrical -algebra for the set
= 0, 1
N
of innite binary sequences, as in Example 1.1.19.
(a) Show that X() =
n=1
n
2
n
is a measurable map from (
, T
c
) to
([0, 1], B
[0,1]
).
(b) Conversely, let Y (x) = (
1
, . . . ,
n
, . . .) where for each n 1,
n
(1) = 1
while
n
(x) = I(2
n
x is an odd number) when x [0, 1). Show that
Y = X
1
is a measurable map from ([0, 1], B
[0,1]
) to (
, T
c
).
Here are some alternatives for Denition 1.2.12.
Exercise 1.2.14. Verify the following relations and show that each generating col-
lection of sets on the right hand side is a -system.
(a) (X) = ( : X() , R)
(b) (X
k
, k n) = ( : X
k
()
k
, 1 k n,
1
, . . . ,
n
R)
(c) (X
1
, X
2
, . . .) = ( : X
k
()
k
, 1 k m,
1
, . . . ,
m
R, m
N)
(d) (X
1
, X
2
, . . .) = (
n
(X
k
, k n))
As you next show, when approximating a random variable by a simple function,
one may also specify the latter to be based on sets in any generating algebra.
Exercise 1.2.15. Suppose (, T, P) is a probability space, with T = (/) for an
algebra /.
(a) Show that infP(AB) : A / = 0 for any B T (recall that AB =
(A B
c
) (A
c
B)).
(b) Show that for any bounded random variable X and > 0 there exists a
simple function Y =
N
n=1
c
n
I
An
with A
n
/ such that P([X Y [ >
) < .
Exercise 1.2.16. Let T = (A
, ) and suppose there exist

1
,=
2
such
that for any , either
1
,
2
A
or
1
,
2
A
c
.
(a) Show that if mapping X is measurable on (, T) then X(
1
) = X(
2
).
(b) Provide an explicit -algebra T of subsets of = 1, 2, 3 and a mapping
X : R which is not a random variable on (, T).
We conclude with a glimpse of the canonical measurable space associated with a
stochastic process (X
t
, t T) (for more on this, see Lemma 7.1.7).
Exercise 1.2.17. Fixing a possibly uncountable collection of random variables X
t
,
indexed by t T, let T
X
C
= (X
t
, t C) for each C T. Show that
T
X
T
=
_
C countable
T
X
C
and that any R.V. Z on (, T
X
T
) is measurable on T
X
C
for some countable C T.
1.2.2. Closure properties of random variables. For the typical measur-
able space with uncountable it is impractical to list all possible R.V. Instead,
we state a few useful closure properties that often help us in showing that a given
mapping X() is indeed a R.V.
We start with closure with respect to the composition of a R.V. and a measurable
mapping.
Proposition 1.2.18. If X : S is an (S, o)-valued R.V. and f is a measurable
mapping from (S, o) to (T, T ), then the composition f(X) : T is a (T, T )-
valued R.V.
Proof. Considering an arbitrary B T , we know that f
1
(B) o since f is
a measurable mapping. Thus, as X is an (S, o)-valued R.V. it follows that
[f(X)]
1
(B) = X
1
(f
1
(B)) T .
This holds for any B T , thus concluding the proof.
In view of Exercise 1.2.3 we have the following special case of Proposition 1.2.18,
corresponding to S = R
n
and T = R equipped with the respective Borel -algebras.
Corollary 1.2.19. Let X
i
, i = 1, . . . , n be R.V. on the same measurable space
(, T) and f : R
n
R a Borel function. Then, f(X
1
, . . . , X
n
) is also a R.V. on
the same space.
To appreciate the power of Corollary 1.2.19, consider the following exercise, in
which you show that every continuous function is also a Borel function.
Exercise 1.2.20. Suppose (S, ) is a metric space (for example, S = R
n
). A func-
tion g : S [, ] is called lower semi-continuous (l.s.c.) if liminf
(y,x)0
g(y)
g(x), for all x S. A function g is said to be upper semi-continuous(u.s.c.) if g
is l.s.c.
(a) Show that if g is l.s.c. then x : g(x) b is closed for each b R.
(b) Conclude that semi-continuous functions are Borel measurable.
(c) Conclude that continuous functions are Borel measurable.
A concrete application of Corollary 1.2.19 shows that any linear combination of
nitely many R.V.-s is a R.V.
Example 1.2.21. Suppose X
i
are R.V.-s on the same measurable space and c
i
R.
Then, W
n
() =
n
i=1
c
i
X
i
() are also R.V.-s. To see this, apply Corollary 1.2.19
for f(x
1
, . . . , x
n
) =
n
i=1
c
i
x
i
a continuous, hence Borel (measurable) function (by
Exercise 1.2.20).
We turn to explore the closure properties of mT with respect to operations of a
limiting nature, starting with the following key theorem.
Theorem 1.2.22. Let R = [, ] equipped with its Borel -algebra
B
R
= ([, b) : b R) .
If X
i
are R-valued R.V.-s on the same measurable space, then
inf
n
X
n
, sup
n
X
n
, liminf
n
X
n
, limsup
n
X
n
,
are also R-valued random variables.
Proof. Pick an arbitrary b R. Then,
: inf
n
X
n
() < b =
_
n=1
: X
n
() < b =
_
n=1
X
1
n
([, b)) T.
Since B
R
is generated by [, b) : b R, it follows by Theorem 1.2.9 that inf
n
X
n
is an R-valued R.V.
Observing that sup
n
X
n
= inf
n
(X
n
), we deduce from the above and Corollary
1.2.19 (for f(x) = x), that sup
n
X
n
is also an R-valued R.V.
Next, recall that
W = liminf
n
X
n
= sup
n
_
inf
ln
X
l
_
.
By the preceding proof we have that Y
n
= inf
ln
X
l
are R-valued R.V.-s and hence
so is W = sup
n
Y
n
.
Similarly to the arguments already used, we conclude the proof either by observing
that
Z = limsup
n
X
n
= inf
n
_
sup
ln
X
l
_
,
or by observing that limsup
n
X
n
= liminf
n
(X
n
).
Remark. Since inf
n
X
n
, sup
n
X
n
, limsup
n
X
n
and liminf
n
X
n
may result in values
even when every X
n
is R-valued, hereafter we let mT also denote the collection
of R-valued R.V.
An important corollary of this theorem deals with the existence of limits of se-
quences of R.V.
Corollary 1.2.23. For any sequence X
n
mT, both
0
= : liminf
n
X
n
() = limsup
n
X
n
()
and
1
= : liminf
n
X
n
() = limsup
n
X
n
() R
are measurable sets, that is,
0
T and
1
T.
Proof. By Theorem1.2.22 we have that Z = limsup
n
X
n
and W = liminf
n
X
n
are two R-valued variables on the same space, with Z() W() for all . Hence,
1
= : Z() W() = 0, Z() R, W() R is measurable (apply Corollary
1.2.19 for f(z, w) = z w), as is
0
= W
1
() Z
1
()
1
.
The following structural result is yet another consequence of Theorem 1.2.22.
Corollary 1.2.24. For any d < and R.V.-s Y
1
, . . . , Y
d
on the same measurable
space (, T) the collection 1 = h(Y
1
, . . . , Y
d
); h : R
d
R Borel function is a
vector space over R containing the constant functions, such that if X
n
1 are
non-negative and X
n
X, an R-valued function on , then X 1.
Proof. By Example 1.2.21 the collection of all Borel functions is a vector
space over R which evidently contains the constant functions. Consequently, the
same applies for 1. Next, suppose X
n
= h
n
(Y
1
, . . . , Y
d
) for Borel functions h
n
such
that 0 X
n
() X() for all . Then, h(y) = sup
n
h
n
(y) is by Theorem
1.2.22 an R-valued Borel function on R
d
, such that X = h(Y
1
, . . . , Y
d
). Setting
h(y) = h(y) when h(y) R and h(y) = 0 otherwise, it is easy to check that h is a
real-valued Borel function. Moreover, with X : R (nite valued), necessarily
X = h(Y
1
, . . . , Y
d
) as well, so X 1.
The point-wise convergence of R.V., that is X
n
() X(), for every is
often too strong of a requirement, as it may fail to hold as a result of the R.V. being
ill-dened for a negligible set of values of (that is, a set of zero measure). We
thus dene the more useful, weaker notion of almost sure convergence of random
variables.
Denition 1.2.25. We say that a sequence of random variables X
n
on the same
probability space (, T, P) converges almost surely if P(
0
) = 1. We then set
X
= limsup
n
X
n
, and say that X
n
converges almost surely to X
, or use the
notation X
n
a.s.
X
.
Remark. Note that in Denition 1.2.25 we allow the limit X
() to take the
values with positive probability. So, we say that X
n
converges almost surely
to a nite limit if P(
1
) = 1, or alternatively, if X
R with probability one.

We proceed with an explicit characterization of the functions measurable with
respect to a -algebra of the form (Y
k
, k n).
Theorem 1.2.26. Let ( = (Y
k
, k n) for some n < and R.V.-s Y
1
, . . . , Y
n
on the same measurable space (, T). Then, m( = g(Y
1
, . . . , Y
n
) : g : R
n
R is a Borel function.
Proof. From Corollary 1.2.19 we know that Z = g(Y
1
, . . . , Y
n
) is in m( for
each Borel function g : R
n
R. Turning to prove the converse result, recall
part (b) of Exercise 1.2.14 that the -algebra ( is generated by the -system T =
A
: = (
1
, . . . ,
n
) R
n
where I
A
= h
(Y
1
, . . . , Y
n
) for the Borel function
h
(y
1
, . . . , y
n
) =
n
k=1
1
y
k
k
. Thus, in view of Corollary 1.2.24, we have by the
monotone class theorem that 1 = g(Y
1
, . . . , Y
n
) : g : R
n
R is a Borel function
contains all elements of m(.
We conclude this sub-section with a few exercises, starting with Borel measura-
bility of monotone functions (regardless of their continuity properties).
Exercise 1.2.27. Show that any monotone function g : R R is Borel measurable.
Next, Exercise 1.2.20 implies that the set of points at which a given function g is
discontinuous, is a Borel set.
Exercise 1.2.28. Fix an arbitrary function g : S R.
(a) Show that for any > 0 the function g
(x, ) = infg(y) : (x, y) < is

u.s.c. and the function g
(x, ) = supg(y) : (x, y) < is l.s.c.

(b) Show that D
g
= x : sup
k
g
(x, k
1
) < inf
k
g
(x, k
1
) is exactly the set
of points at which g is discontinuous.
(c) Deduce that the set D
g
of points of discontinuity of g is a Borel set.
Here is an alternative characterization of B that complements Exercise 1.2.20.
Exercise 1.2.29. Show that if T is a -algebra of subsets of R then B T if
and only if every continuous function f : R R is in mT (i.e. B is the smallest
-algebra on R with respect to which all continuous functions are measurable).
Exercise 1.2.30. Suppose X
n
and X
are real-valued random variables and

P( : limsup
n
X
n
() X
()) = 1 .
Show that for any > 0, there exists an event A with P(A) < and a non-random
N = N(), suciently large such that X
n
() < X
() + for all n N and every

A
c
.
Equipped with Theorem 1.2.22 you can also strengthen Proposition 1.2.6.
Exercise 1.2.31. Show that the class mT of R-valued measurable functions, is the
smallest class containing SF and closed under point-wise limits.
Finally, relying on Theorem 1.2.26 it is easy to show that a Borel function can
only reduce the amount of information quantied by the corresponding generated
-algebras, whereas such information content is invariant under invertible Borel
transformations, that is
Exercise 1.2.32. Show that (g(Y
1
, . . . , Y
n
)) (Y
k
, k n) for any Borel func-
tion g : R
n
R. Further, if Y
1
, . . . , Y
n
and Z
1
, . . . , Z
m
dened on the same proba-
bility space are such that Z
k
= g
k
(Y
1
, . . . , Y
n
), k = 1, . . . , m and Y
i
= h
i
(Z
1
, . . . , Z
m
),
i = 1, . . . , n for some Borel functions g
k
: R
n
R and h
i
: R
m
R, then
(Y
1
, . . . , Y
n
) = (Z
1
, . . . , Z
m
).
1.2.3. Distribution, density and law. As dened next, every random vari-
able X induces a probability measure on its range which is called the law of X.
Denition 1.2.33. The law of a real-valued R.V. X, denoted T
X
, is the probability
measure on (R, B) such that T
X
(B) = P( : X() B) for any Borel set B.
Remark. Since X is a R.V., it follows that T
X
(B) is well dened for all B B.
Further, the non-negativity of P implies that T
X
is a non-negative set function on
(R, B), and since X
1
(R) = , also T
X
(R) = 1. Consider next disjoint Borel sets
B
i
, observing that X
1
(B
i
) T are disjoint subsets of such that
X
1
(
_
i
B
i
) =
_
i
X
1
(B
i
) .
Thus, by the countable additivity of P we have that
T
X
(
_
i
B
i
) = P(
_
i
X
1
(B
i
)) =
i
P(X
1
(B
i
)) =
i
T
X
(B
i
) .
This shows that T
X
is also countably additive, hence a probability measure, as
claimed in Denition 1.2.33.
Note that the law T
X
of a R.V. X : R, determines the values of the
probability measure P on (X).
Denition 1.2.34. We write X
D
= Y and say that X equals Y in law (or in
distribution), if and only if T
X
= T
Y
.
A good way to practice your understanding of the Denitions 1.2.33 and 1.2.34 is
by verifying that if X
a.s.
= Y , then also X
D
= Y (that is, any two random variables
we consider to be the same would indeed have the same law).
The next concept we dene, the distribution function, is closely associated with
the law T
X
of the R.V.
Denition 1.2.35. The distribution function F
X
of a real-valued R.V. X is
F
X
() = P( : X() ) = T
X
((, ]) R
Our next result characterizes the set of all functions F : R [0, 1] that are
distribution functions of some R.V.
Theorem 1.2.36. A function F : R [0, 1] is a distribution function of some
R.V. if and only if
(a) F is non-decreasing
(b) lim
x
F(x) = 1 and lim
x
F(x) = 0
(c) F is right-continuous, i.e. lim
yx
F(y) = F(x)
Proof. First, assuming that F = F
X
is a distribution function, we show that
it must have the stated properties (a)-(c). Indeed, if x y then (, x] (, y],
and by the monotonicity of the probability measure T
X
(see part (a) of Exercise
1.1.4), we have that F
X
(x) F
X
(y), proving that F
X
is non-decreasing. Further,
(, x] R as x , while (, x] as x , resulting with property (b)
of the theorem by the continuity from below and the continuity from above of the
probability measure T
X
on R. Similarly, since (, y] (, x] as y x we get
the right continuity of F
X
by yet another application of continuity from above of
T
X
.
We proceed to prove the converse result, that is, assuming F has the stated prop-
erties (a)-(c), we consider the random variable X
() = supy : F(y) < on

the probability space ((0, 1], B
(0,1]
, U) and show that F
X
= F. With F having
property (b), we see that for any > 0 the set y : F(y) < is non-empty and
further if < 1 then X
() < , so X
: (0, 1) R is well dened. The identity

(1.2.1) : X
() x = : F(x) ,
implies that F
X
(x) = U((0, F(x)]) = F(x) for all x R, and further, the sets
(0, F(x)] are all in B
(0,1]
, implying that X
is a measurable function (i.e. a R.V.).

Turning to prove (1.2.1) note that if F(x) then x , y : F(y) < and so by
denition (and the monotonicity of F), X
() x. Now suppose that > F(x).

Since F is right continuous, this implies that F(x + ) < for some > 0, hence
by denition of X
also X
() x + > x, completing the proof of (1.2.1) and

with it the proof of the theorem.
Check your understanding of the preceding proof by showing that the collection
of distribution functions for R-valued random variables consist of all F : R [0, 1]
that are non-decreasing and right-continuous.
Remark. The construction of the random variable X
() in Theorem 1.2.36 is
called Skorokhods representation. You can, and should, verify that the random
variable X
+
() = supy : F(y) would have worked equally well for that
purpose, since X
+
() ,= X
() only if X
+
() > q X
() for some rational q,

in which case by denition F(q) , so there are most countably many such
values of (hence P(X
+
,= X
) = 0). We shall return to this construction when

dealing with convergence in distribution in Section 3.2. An alternative approach to
Theorem 1.2.36 is to adapt the construction of the probability measure of Example
1.1.26, taking here = R with the corresponding change to / and replacing the
right side of (1.1.1) with
r
k=1
(F(b
k
) F(a
k
)), yielding a probability measure T
on (R, B) such that T((, ]) = F() for all R (c.f. [Bil95, Theorem 12.4]).
Our next example highlights the possible shape of the distribution function.
Example 1.2.37. Consider Example 1.1.6 of n coin tosses, with -algebra T
n
=
2
n
, sample space
n
= H, T
n
, and the probability measure P
n
(A) =
A
p
,
where p
= 2
n
for each
n
(that is, =
1
,
2
, ,
n
for
i
H, T),
corresponding to independent, fair, coin tosses. Let Y () = I
{1=H}
measure the
outcome of the rst toss. The law of this random variable is,
T
Y
(B) =
1
2
1
{0B}
+
1
2
1
{1B}
and its distribution function is
F
Y
() = T
Y
((, ]) = P
n
(Y () ) =
_
_
1, 1
1
2
, 0 < 1
0, < 0
. (1.2.2)
Note that in general (X) is a strict subset of the -algebra T (in Example 1.2.37
we have that (Y ) determines the probability measure for the rst coin toss, but
tells us nothing about the probability measure assigned to the remaining n 1
tosses). Consequently, though the law T
X
determines the probability measure P
on (X) it usually does not completely determine P.
Example 1.2.37 is somewhat generic. That is, if the R.V. X is a simple function (or
more generally, when the set X() : is countable and has no accumulation
points), then its distribution function F
X
is piecewise constant with jumps at the
possible values that X takes and jump sizes that are the corresponding probabilities.
Indeed, note that (, y] (, x) as y x, so by the continuity from below of
T
X
it follows that
F
X
(x
) := lim
yx
F
X
(y) = P( : X() < x) = F
X
(x) P( : X() = x) ,
for any R.V. X.
A direct corollary of Theorem 1.2.36 shows that any distribution function has a
collection of continuity points that is dense in R.
Exercise 1.2.38. Show that a distribution function F has at most countably many
points of discontinuity and consequently, that for any x R there exist y
k
and z
k
at which F is continuous such that z
k
x and y
k
x.
In contrast with Example 1.2.37 the distribution function of a R.V. with a density
is continuous and almost everywhere dierentiable, that is,
Denition 1.2.39. We say that a R.V. X() has a probability density function
f
X
if and only if its distribution function F
X
can be expressed as
(1.2.3) F
X
() =
_

f
X
(x)dx, R.
By Theorem 1.2.36 a probability density function f
X
must be an integrable, Lebesgue
almost everywhere non-negative function, with
_
R
f
X
(x)dx = 1. Such F
X
is contin-
uous with
dFX
dx
(x) = f
X
(x) except possibly on a set of values of x of zero Lebesgue
measure.
Remark. To make Denition 1.2.39 precise we temporarily assume that probability
density functions f
X
are Riemann integrable and interpret the integral in (1.2.3)
in this sense. In Section 1.3 we construct Lebesgues integral and extend the scope
of Denition 1.2.39 to Lebesgue integrable density functions f
X
0 (in particular,
accommodating Borel functions f
X
). This is the setting we assume thereafter, with
the right-hand-side of (1.2.3) interpreted as the integral (f
X
; (, ]) of f
X
with
respect to the restriction on (, ] of the completion of the Lebesgue measure on
R (c.f. Denition 1.3.59 and Example 1.3.60). Further, the function f
X
is uniquely
dened only as a representative of an equivalence class. That is, in this context we
consider f and g to be the same function when (x : f(x) ,= g(x)) = 0.
Building on Example 1.1.26 we next detail a few classical examples of R.V. that
have densities.
Example 1.2.40. The distribution function F
U
of the R.V. of Example 1.1.26 is
F
U
() = P(U ) = P(U [0, ]) =
_
_
1, > 1
, 0 1
0, < 0
(1.2.4)
and its density is f
U
(u) =
_
1, 0 u 1
0, otherwise
.
The exponential distribution function is
F(x) =
_
0, x 0
1 e
x
, x 0
,
corresponding to the density f(x) =
_
0, x 0
e
x
, x > 0
, whereas the standard normal
distribution has the density
(x) = (2)
1/2
e
x
2
2
,
with no closed form expression for the corresponding distribution function (x) =
_
x
(u)du in terms of elementary functions.
Every real-valued R.V. X has a distribution function but not necessarily a density.
For example X = 0 w.p.1 has distribution function F
X
() = 1
0
. Since F
X
is
discontinuous at 0, the R.V. X does not have a density.
Denition 1.2.41. We say that a function F is a Lebesgue singular function if it
has a zero derivative except on a set of zero Lebesgue measure.
Since the distribution function of any R.V. is non-decreasing, from real analysis
we know that it is almost everywhere dierentiable. However, perhaps somewhat
surprisingly, there are continuous distribution functions that are Lebesgue singular
functions. Consequently, there are non-discrete random variables that do not have
a density. We next provide one such example.
Example 1.2.42. The Cantor set C is dened by removing (1/3, 2/3) from [0, 1]
and then iteratively removing the middle third of each interval that remains. The
uniform distribution on the (closed) set C corresponds to the distribution function
obtained by setting F(x) = 0 for x 0, F(x) = 1 for x 1, F(x) = 1/2 for
x [1/3, 2/3], then F(x) = 1/4 for x [1/9, 2/9], F(x) = 3/4 for x [7/9, 8/9],
and so on (which as you should check, satises the properties (a)-(c) of Theorem
1.2.36). From the denition, we see that dF/dx = 0 for almost every x / C and
that the corresponding probability measure has P(C
c
) = 0. As the Lebesgue measure
of C is zero, we see that the derivative of F is zero except on a set of zero Lebesgue
measure, and consequently, there is no function f for which F(x) =
_
x
f(y)dy
holds. Though it is somewhat more involved, you may want to check that F is
everywhere continuous (c.f. [Bil95, Problem 31.2]).
Even discrete distribution functions can be quite complex. As the next example
shows, the points of discontinuity of such a function might form a (countable) dense
subset of R (which in a sense is extreme, per Exercise 1.2.38).
Example 1.2.43. Let q
1
, q
2
, . . . be an enumeration of the rational numbers and set
F(x) =
i=1
2
i
1
[qi,)
(x)
(where 1
[qi,)
(x) = 1 if x q
i
and zero otherwise). Clearly, such F is non-
decreasing, with limits 0 and 1 as x and x , respectively. It is not hard
to check that F is also right continuous, hence a distribution function, whereas by
construction F is discontinuous at each rational number.
As we have that P( : X() ) = F
X
() for the generators : X()
of (X), we are not at all surprised by the following proposition.
Proposition 1.2.44. The distribution function F
X
uniquely determines the law
T
X
of X.
Proof. Consider the collection (R) = (, b] : b R of subsets of R. It
is easy to see that (R) is a -system, which generates B (see Exercise 1.1.17).
Hence, by Proposition 1.1.39, any two probability measures on (R, B) that coincide
on (R) are the same. Since the distribution function F
X
species the restriction
of such a probability measure T
X
on (R) it thus uniquely determines the values
of T
X
(B) for all B B.
Dierent probability measures P on the measurable space (, T) may trivialize
dierent -algebras. That is,
Denition 1.2.45. If a -algebra 1 T and a probability measure P on (, T)
are such that P(H) 0, 1 for all H 1, we call 1 a P-trivial -algebra.
Similarly, a random variable X is called P-trivial or P-degenerate, if there exists
a non-random constant c such that P(X ,= c) = 0.
Using distribution functions we show next that all random variables on a P-trivial
-algebra are P-trivial.
Proposition 1.2.46. If a random variable X m1 for a P-trivial -algebra 1,
then X is P-trivial.
Proof. By denition, the sets : X() are in 1 for all R. Since 1
is P-trivial this implies that F
X
() 0, 1 for all R. In view of Theorem 1.2.36
this is possible only if F
X
() = 1
c
for some non-random c R (for example, set
c = inf : F
X
() = 1). That is, P(X ,= c) = 0, as claimed.
We conclude with few exercises about the support of measures on (R, B).
Exercise 1.2.47. Let be a measure on (R, B). A point x is said to be in the
support of if (O) > 0 for every open neighborhood O of x. Prove that the support
is a closed set whose complement is the maximal open set on which vanishes.
Exercise 1.2.48. Given an arbitrary closed set C R, construct a probability
measure on (R, B) whose support is C.
Hint: Try a measure consisting of a countable collection of atoms (i.e. points of
positive probability).
As you are to check next, the discontinuity points of a distribution function are
closely related to the support of the corresponding law.
Exercise 1.2.49. The support of a distribution function F is the set S
F
= x R
such that F(x +) F(x ) > 0 for all > 0.
(a) Show that all points of discontinuity of F() belong to S
F
, and that any
isolated point of S
F
(that is, x S
F
such that (x , x +) S
F
= x
for some > 0) must be a point of discontinuity of F().
(b) Show that the support of the law T
X
of a random variable X, as dened
in Exercise 1.2.47, is the same as the support of its distribution function
F
X
.
1.3. Integration and the (mathematical) expectation
A key concept in probability theory is the mathematical expectation of ran-
dom variables. In Subsection 1.3.1 we provide its denition via the framework
of Lebesgue integration with respect to a measure and study properties such as
monotonicity and linearity. In Subsection 1.3.2 we consider fundamental inequal-
ities associated with the expectation. Subsection 1.3.3 is about the exchange of
integration and limit operations, complemented by uniform integrability and its
consequences in Subsection 1.3.4. Subsection 1.3.5 considers densities relative to
arbitrary measures and relates our treatment of integration and expectation to
Riemanns integral and the classical denition of the expectation for a R.V. with
probability density. We conclude with Subsection 1.3.6 about moments of random
variables, including their values for a few well known distributions.
1.3. INTEGRATION AND THE (MATHEMATICAL) EXPECTATION 31
1.3.1. Lebesgue integral, linearity and monotonicity. Let SF
+
denote
the collection of non-negative simple functions with respect to the given measurable
space (S, T) and mT
+
denote the collection of [0, ]-valued measurable functions
on this space. We next dene Lebesgues integral with respect to any measure
on (S, T), rst for SF
+
, then extending it to all f mT
+
. With the notation
(f) :=
_
S
f(s)d(s) for this integral, we also denote by
0
() the more restrictive
integral, dened only on SF
+
, so as to clarify the role each of these plays in some of
our proofs. We call an R-valued measurable function f mT for which ([f[) < ,
a -integrable function, and denote the collection of all -integrable functions by
L
1
(S, T, ), extending the denition of the integral (f) to all f L
1
(S, T, ).
Denition 1.3.1. Fix a measure space (S, T, ) and dene (f) by the following
four step procedure:
Step 1. Dene
0
(I
A
) := (A) for each A T.
Step 2. Any SF
+
has a representation =
n
l=1
c
l
I
A
l
for some nite n < ,
non-random c
l
[0, ] and sets A
l
T, yielding the denition of the integral via
0
() :=
n
l=1
c
l
(A
l
) ,
where we adopt hereafter the convention that 0 = 0 = 0.
Step 3. For f mT
+
we dene
(f) := sup
0
() : SF
+
, f.
Step 4. For f mT let f
+
= max(f, 0) mT
+
and f
= min(f, 0) mT
+
.
We then set (f) = (f
+
) (f
) provided either (f
+
) < or (f
) < . In
particular, this applies whenever f L
1
(S, T, ), for then (f
+
) + (f
) = ([f[)
is nite, hence (f) is well dened and nite valued.
We use the notation
_
S
f(s)d(s) for (f) which we call Lebesgue integral of f
with respect to the measure .
The expectation E[X] of a random variable X on a probability space (, T, P) is
merely Lebesgues integral
_
X()dP() of X with respect to P. That is,
Step 1. E[I
A
] = P(A) for any A T.
Step 2. Any SF
+
has a representation =
n
l=1
c
l
I
A
l
for some non-random
n < , c
l
0 and sets A
l
T, to which corresponds
E[] =
n
l=1
c
l
E[I
A
l
] =
n
l=1
c
l
P(A
l
) .
Step 3. For X mT
+
dene
EX = supEY : Y SF
+
, Y X.
Step 4. Represent X mT as X = X
+
X
, where X
+
= max(X, 0) mT
+
and
X
= min(X, 0) mT
+
, with the corresponding denition
EX = EX
+
EX
,
provided either EX
+
< or EX
< .
Remark. Note that we may have EX = while X() < for all . For
instance, take the random variable X() = for = 1, 2, . . . and T = 2
. If
P( = k) = ck
2
with c = [
k=1
k
2
]
1
a positive, nite normalization constant,
then EX = c
k=1
k
1
= .
Similar to the notation of -integrable functions introduced in the last step of
the denition of Lebesgues integral, we have the following denition for random
variables.
Denition 1.3.2. We say that a random variable X is (absolutely) integrable, or
X has nite expectation, if E[X[ < , that is, both EX
+
< and EX
< .
Fixing 1 q < we denote by L
q
(, T, P) the collection of random variables X
on (, T) for which [[X[[
q
= [E[X[
q
]
1/q
< . For example, L
1
(, T, P) denotes
the space of all (absolutely) integrable random-variables. We use the short notation
L
q
when the probability space (, T, P) is clear from the context.
We next verify that Lebesgues integral of each function f is assigned a unique
value in Denition 1.3.1. To this end, we focus on
0
: SF
+
[0, ] of Step 2 of
our denition and derive its structural properties, such as monotonicity, linearity
and invariance to a change of argument on a -negligible set.
Lemma 1.3.3.
0
() assigns a unique value to each SF
+
. Further,
a).
0
() =
0
() if , SF
+
are such that (s : (s) ,= (s)) = 0.
b).
0
is linear, that is
0
( +) =
0
() +
0
() ,
0
(c) = c
0
() ,
for any , SF
+
and c 0.
c).
0
is monotone, that is
0
()
0
() if (s) (s) for all s S.
Proof. Note that a non-negative simple function SF
+
has many dierent
representations as weighted sums of indicator functions. Suppose for example that
(1.3.1)
n
l=1
c
l
I
A
l
(s) =
m
k=1
d
k
I
B
k
(s) ,
for some c
l
0, d
k
0, A
l
T, B
k
T and all s S. There exists a nite
partition of S to at most 2
n+m
disjoint sets C
i
such that each of the sets A
l
and
B
k
is a union of some C
i
, i = 1, . . . , 2
n+m
. Expressing both sides of (1.3.1) as nite
weighted sums of I
Ci
, we necessarily have for each i the same weight on both sides.
Due to the (nite) additivity of over unions of disjoint sets C
i
, we thus get after
some algebra that
(1.3.2)
n
l=1
c
l
(A
l
) =
m
k=1
d
k
(B
k
) .
Consequently,
0
() is well-dened and independent of the chosen representation
for . Further, the conclusion (1.3.2) applies also when the two sides of (1.3.1)
dier for s C as long as (C) = 0, hence proving the rst stated property of the
lemma.
Choosing the representation of + based on the representations of and
immediately results with the stated linearity of
0
. Given this, if (s) (s) for all
s, then = + for some SF
+
, implying that
0
() =
0
() +
0
()
0
(),
as claimed.
Remark. The stated monotonicity of
0
implies that () coincides with
0
() on
SF
+
. As
0
is uniquely dened for each f SF
+
and f = f
+
when f mT
+
, it
follows that (f) is uniquely dened for each f mT
+
L
1
(S, T, ).
All three properties of
0
(hence ) stated in Lemma 1.3.3 for functions in SF
+
extend to all of mT
+
L
1
. Indeed, the facts that (cf) = c(f), that (f) (g)
whenever 0 f g, and that (f) = (g) whenever (s : f(s) ,= g(s)) = 0 are
immediate consequences of our denition (once we have these for f, g SF
+
). Since
f g implies f
+
g
+
and f
, the monotonicity of () extends to functions

in L
1
(by Step 4 of our denition). To prove that (h + g) = (h) + (g) for all
h, g mT
+
L
1
requires an application of the monotone convergence theorem (in
short MON), which we now state, while deferring its proof to Subsection 1.3.3.
Theorem 1.3.4 (Monotone convergence theorem). If 0 h
n
(s) h(s) for
all s S and h
n
mT
+
, then (h
n
) (h) .
Indeed, recall that while proving Proposition 1.2.6 we constructed the sequence
f
n
such that for every g mT
+
we have f
n
(g) SF
+
and f
n
(g) g. Specifying
g, h mT
+
we have that f
n
(h) +f
n
(g) SF
+
. So, by Lemma 1.3.3,
(f
n
(h)+f
n
(g)) =
0
(f
n
(h)+f
n
(g)) =
0
(f
n
(h))+
0
(f
n
(g)) = (f
n
(h))+(f
n
(g)) .
Since f
n
(h) h and f
n
(h) +f
n
(g) h +g, by monotone convergence,
(h +g) = lim
n
(f
n
(h) +f
n
(g)) = lim
n
(f
n
(h)) + lim
n
(f
n
(g)) = (h) +(g) .
To extend this result to g, h mT
+
L
1
, note that h
+g
= f +(h+g)
f for
some f mT
+
such that h
+
+g
+
= f +(h+g)
+
. Since (h
) < and (g
) < ,
by linearity and monotonicity of () on mT
+
necessarily also (f) < and the
linearity of (h+g) on mT
+
L
1
follows by elementary algebra. In conclusion, we
have just proved that
Proposition 1.3.5. The integral (f) assigns a unique value to each f mT
+

L
1
(S, T, ). Further,
a). (f) = (g) whenever (s : f(s) ,= g(s)) = 0.
b). is linear, that is for any f, h, g mT
+
L
1
and c 0,
(h +g) = (h) +(g) , (cf) = c(f) .
c). is monotone, that is (f) (g) if f(s) g(s) for all s S.
Our proof of the identity (h + g) = (h) + (g) is an example of the following
general approach to proving that certain properties hold for all h L
1
.
Denition 1.3.6 (Standard Machine). To prove the validity of a certain property
for all h L
1
(S, T, ), break your proof to four easier steps, following those of
Denition 1.3.1.
Step 1. Prove the property for h which is an indicator function.
Step 2. Using linearity, extend the property to all SF
+
.
Step 3. Using MON extend the property to all h mT
+
.
Step 4. Extend the property in question to h L
1
by writing h = h
+
h
and
using linearity.
Here is another application of the standard machine.
Exercise 1.3.7. Suppose that a probability measure T on (R, B) is such that
T(B) = (fI
B
) for the Lebesgue measure on R, some non-negative Borel function
f() and all B B. Using the standard machine, prove that then T(h) = (fh) for
any Borel function h such that either h 0 or (f[h[) < .
Hint: See the proof of Proposition 1.3.56.
We shall see more applications of the standard machine later (for example, when
proving Proposition 1.3.56 and Theorem 1.3.61).
We next strengthen the non-negativity and monotonicity properties of Lebesgues
integral () by showing that
Lemma 1.3.8. If (h) = 0 for h mT
+
, then (s : h(s) > 0) = 0. Conse-
quently, if for f, g L
1
(S, T, ) both (f) = (g) and (s : f(s) > g(s)) = 0,
then (s : f(s) ,= g(s)) = 0.
Proof. By continuity below of the measure we have that
(s : h(s) > 0) = lim
n
(s : h(s) > n
1
)
(see Exercise 1.1.4). Hence, if (s : h(s) > 0) > 0, then for some n < ,
0 < n
1
(s : h(s) > n
1
) =
0
(n
1
I
h>n
1 ) (h) ,
where the right most inequality is a consequence of the denition of (h) and the
fact that h n
1
I
h>n
1 SF
+
. Thus, our assumption that (h) = 0 must imply
that (s : h(s) > 0) = 0.
To prove the second part of the lemma, consider

h = g f which is non-negative
outside a set N T such that (N) = 0. Hence, h = (g f)I
N
c mT
+
and
0 = (g) (f) = (
h) = (h) by Proposition 1.3.5, implying that (s : h(s) >

0) = 0 by the preceding proof. The same applies for

h and the statement of the
lemma follows.
We conclude this subsection by stating the results of Proposition 1.3.5 and Lemma
1.3.8 in terms of the expectation on a probability space (, T, P).
Theorem 1.3.9. The mathematical expectation E[X] is well dened for every R.V.
X on (, T, P) provided either X 0 almost surely, or X L
1
(, T, P). Further,
(a) EX = EY whenever X
a.s.
= Y .
(b) The expectation is a linear operation, for if Y and Z are integrable R.V. then
for any constants , the R.V. Y +Z is integrable and E(Y +Z) = (EY ) +
(EZ). The same applies when Y, Z 0 almost surely and , 0.
(c) The expectation is monotone. That is, if Y and Z are either integrable or
non-negative and Y Z almost surely, then EY EZ. Further, if Y and Z are
integrable with Y Z a.s. and EY = EZ, then Y
a.s.
= Z.
(d) Constants are invariant under the expectation. That is, if X
a.s.
= c for non-
random c (, ], then EX = c.
Remark. Part (d) of the theorem relies on the fact that P is a probability measure,
namely P() = 1. Indeed, it is obtained by considering the expectation of the
simple function cI
to which X equals with probability one.

The linearity of the expectation (i.e. part (b) of the preceding theorem), is often
extremely helpful when looking for an explicit formula for it. We next provide a
few examples of this.
Exercise 1.3.10. Write (, T, P) for a random experiment whose outcome is a
recording of the results of n independent rolls of a balanced six-sided dice (including
their order). Compute the expectation of the random variable D() which counts
the number of dierent faces of the dice recorded in these n rolls.
Exercise 1.3.11 (Matching). In a random matching experiment, we apply a
random permutation to the integers 1, 2, . . . , n, where each of the possible n!
permutations is equally likely. Let Z
i
= I
{(i)=i}
be the random variable indicating
whether i = 1, 2, . . . , n is a xed point of the random permutation, and X
n
=
n
i=1
Z
i
count the number of xed points of the random permutation (i.e. the
number of self-matchings). Show that E[X
n
(X
n
1) (X
n
k + 1)] = 1 for
k = 1, 2, . . . , n.
Similarly, here is an elementary application of the monotonicity of the expectation
(i.e. part (c) of the preceding theorem).
Exercise 1.3.12. Suppose an integrable random variable X is such that E(XI
A
) =
0 for each A (X). Show that necessarily X = 0 almost surely.
1.3.2. Inequalities. The linearity of the expectation often allows us to com-
pute EX even when we cannot compute the distribution function F
X
. In such cases
the expectation can be used to bound tail probabilities, based on the following clas-
sical inequality.
Theorem 1.3.13 (Markovs inequality). Suppose : R [0, ] is a Borel
function and let
(A) = inf(y) : y A for any A B. Then for any R.V. X,
(A)P(X A) E((X)I
XA
) E(X).
Proof. By the denition of
(A) and non-negativity of we have that
(A)I
xA
(x)I
xA
(x) ,
for all x R. Therefore,
(A)I
XA
(X)I
XA
(X) for every .
We deduce the stated inequality by the monotonicity of the expectation and the
identity E(
(A)I
XA
) =
(A)P(X A) (due to Step 2 of Denition 1.3.1).

We next specify three common instances of Markovs inequality.
Example 1.3.14. (a). Taking (x) = x
+
and A = [a, ) for some a > 0 we have
that
(A) = a. Markovs inequality is then

P(X a)
EX
+
a
,
which is particularly appealing when X 0, so EX
+
= EX.
(b). Taking (x) = [x[
q
and A = (, a] [a, ) for some a > 0, we get that
(A) = a
q
. Markovs inequality is then a
q
P([X[ a) E[X[
q
. Considering q = 2
and X = Y EY for Y L
2
, this amounts to
P([Y EY [ a)
Var(Y )
a
2
,
which we call Chebyshevs inequality (c.f. Denition 1.3.67 for the variance and
moments of random variable Y ).
(c). Taking (x) = e
x
for some > 0 and A = [a, ) for some a R we have
that
(A) = e
a
. Markovs inequality is then
P(X a) e
a
Ee
X
.
This bound provides an exponential decay in a, at the cost of requiring X to have
nite exponential moments.
In general, we cannot compute EX explicitly from the Denition 1.3.1 except
for discrete R.V.s and for R.V.s having a probability density function. We thus
appeal to the properties of the expectation listed in Theorem 1.3.9, or use various
inequalities to bound one expectation by another. To this end, we start with
Jensens inequality, dealing with the eect that a convex function makes on the
expectation.
Proposition 1.3.15 (Jensens inequality). Suppose g() is a convex function
on an open interval G of R, that is,
g(x) + (1 )g(y) g(x + (1 )y) x, y G, 0 1.
If X is an integrable R.V. with P(X G) = 1 and g(X) is also integrable, then
E(g(X)) g(EX).
Proof. The convexity of g() on G implies that g() is continuous on G (hence
g(X) is a random variable) and the existence for each c G of b = b(c) R such
that
(1.3.3) g(x) g(c) +b(x c), x G.
Since G is an open interval of R with P(X G) = 1 and X is integrable, it follows
that EX G. Assuming (1.3.3) holds for c = EX, that X G a.s., and that both
X and g(X) are integrable, we have by Theorem 1.3.9 that
E(g(X)) = E(g(X)I
XG
) E[(g(c)+b(Xc))I
XG
] = g(c)+b(EXc) = g(EX) ,
as stated. To derive (1.3.3) note that if (c h
2
, c +h
1
) G for positive h
1
and h
2
,
then by convexity of g(),
h
2
h
1
+h
2
g(c +h
1
) +
h
1
h
1
+h
2
g(c h
2
) g(c) ,
which amounts to [g(c + h
1
) g(c)]/h
1
[g(c) g(c h
2
)]/h
2
. Considering the
inmum over h
1
> 0 and the supremum over h
2
> 0 we deduce that
inf
h>0,c+hG
g(c +h) g(c)
h
:= (D
+
g)(c) (D
g)(c) := sup
h>0,chG
g(c) g(c h)
h
.
With G an open set, obviously (D
g)(x) > and (D

+
g)(x) < for any x G
(in particular, g() is continuous on G). Now for any b [(D
g)(c), (D
+
g)(c)] R
we get (1.3.3) out of the denition of D
+
g and D
g.
Remark. Since g() is convex if and only if g() is concave, we may as well state
Jensens inequality for concave functions, just reversing the sign of the inequality in
this case. A trivial instance of Jensens inequality happens when X() = xI
A
() +
yI
A
c() for some x, y R and A T such that P(A) = . Then,
EX = xP(A) +yP(A
c
) = x +y(1 ) ,
whereas g(X()) = g(x)I
A
() +g(y)I
A
c (). So,
Eg(X) = g(x) +g(y)(1 ) g(x +y(1 )) = g(EX) ,
as g is convex.
Applying Jensens inequality, we show that the spaces L
q
(, T, P) of Denition
1.3.2 are nested in terms of the parameter q 1.
Lemma 1.3.16. Fixing Y mT, the mapping q [[Y [[
q
= [E[Y [
q
]
1/q
is non-
decreasing for q > 0. Hence, the space L
q
(, T, P) is contained in L
r
(, T, P) for
any r q.
Proof. Fix q > r > 0 and consider the sequence of bounded R.V. X
n
() =
min([Y ()[, n)
r
. Obviously, X
n
and X
q/r
n
are both in L
1
. Apply Jensens In-
equality for the convex function g(x) = [x[
q/r
and the non-negative R.V. X
n
, to
get that
(EX
n
)
q
r
E(X
q
r
n
) = E[min([Y [, n)
q
] E([Y [
q
) .
For n we have that X
n
[Y [
r
, so by monotone convergence E([Y [
r
)
q
r
(E[Y [
q
). Taking the 1/q-th power yields the stated result [[Y [[
r
[[Y [[
q
.
We next bound the expectation of the product of two R.V. while assuming nothing
about the relation between them.
Proposition 1.3.17 (H olders inequality). Let X, Y be two random variables
on the same probability space. If p, q > 1 with
1
p
+
1
q
= 1, then
(1.3.4) E[XY [ [[X[[
p
[[Y [[
q
.
Remark. Recall that if XY is integrable then E[XY [ is by itself an upper bound
on [[EXY ][. The special case of p = q = 2 in H olders inequality
E[XY [
EX
2
EY
2
,
is called the Cauchy-Schwarz inequality.
Proof. Fixing p > 1 and q = p/(p 1) let = [[X[[
p
and = [[Y [[
q
. If = 0
then [X[
p
a.s.
= 0 (see Theorem 1.3.9). Likewise, if = 0 then [Y [
q
a.s.
= 0. In either
case, the inequality (1.3.4) trivially holds. As this inequality also trivially holds
when either = or = , we may and shall assume hereafter that both and
are nite and strictly positive. Recall that
x
p
p
+
y
q
q
xy 0, x, y 0
(c.f. [Dur10, Page 21] where it is proved by considering the rst two derivatives
in x). Taking x = [X[/ and y = [Y [/, we have by linearity and monotonicity of
the expectation that
1 =
1
p
+
1
q
=
E[X[
p
p
p
+
E[Y [
q
q
q

E[XY [
,
yielding the stated inequality (1.3.4).
A direct consequence of H olders inequality is the triangle inequality for the norm
[[X[[
p
in L
p
(, T, P), that is,
Proposition 1.3.18 (Minkowskis inequality). If X, Y L
p
(, T, P), p 1,
then [[X +Y [[
p
[[X[[
p
+[[Y [[
p
.
Proof. With [X+Y [ [X[ +[Y [, by monotonicity of the expectation we have
the stated inequality in case p = 1. Considering hereafter p > 1, it follows from
H olders inequality (Proposition 1.3.17) that
E[X +Y [
p
= E([X +Y [[X +Y [
p1
)
E([X[[X +Y [
p1
) +E([Y [[X +Y [
p1
)
(E[X[
p
)
1
p
(E[X +Y [
(p1)q
)
1
q
+ (E[Y [
p
)
1
p
(E[X +Y [
(p1)q
)
1
q
= ([[X[[
p
+[[Y [[
p
) (E[X +Y [
p
)
1
q
(recall that (p 1)q = p). Since X, Y L
p
and
[x +y[
p
([x[ +[y[)
p
2
p1
([x[
p
+[y[
p
), x, y R, p > 1,
if follows that a
p
= E[X + Y [
p
< . There is nothing to prove unless a
p
> 0, in
which case dividing by (a
p
)
1/q
we get that
(E[X +Y [
p
)
1
1
q
[[X[[
p
+ [[Y [[
p
,
giving the stated inequality (since 1
1
q
=
1
p
).
Remark. Jensens inequality applies only for probability measures, while both
H olders inequality ([fg[) ([f[
p
)
1/p
([g[
q
)
1/q
and Minkowskis inequality ap-
ply for any measure , with exactly the same proof we provided for probability
measures.
To practice your understanding of Markovs inequality, solve the following exercise.
Exercise 1.3.19. Let X be a non-negative random variable with Var(X) 1/2.
Show that then P(1 +EX X 2EX) 1/2.
To practice your understanding of the proof of Jensens inequality, try to prove
its extension to convex functions on R
n
.
Exercise 1.3.20. Suppose g : R
n
R is a convex function and X
1
, X
2
, . . . , X
n
are integrable random variables, dened on the same probability space and such that
g(X
1
, . . . , X
n
) is integrable. Show that then Eg(X
1
, . . . , X
n
) g(EX
1
, . . . , EX
n
).
Hint: Use convex analysis to show that g() is continuous and further that for any
c R
n
there exists b R
n
such that g(x) g(c) + b, x c for all x R
n
(with
, denoting the inner product of two vectors in R
n
).
Exercise 1.3.21. Let Y 0 with v = E(Y
2
) < .
(a) Show that for any 0 a < EY ,
P(Y > a)
(EY a)
2
E(Y
2
)
Hint: Apply the Cauchy-Schwarz inequality to Y I
Y >a
.
(b) Show that (E[Y
2
v[)
2
4v(v (EY )
2
).
(c) Derive the second Bonferroni inequality,
P(
n
_
i=1
A
i
)
n
i=1
P(A
i
)
1j<in
P(A
i
A
j
) .
How does it compare with the bound of part (a) for Y =
n
i=1
I
Ai
?
1.3.3. Convergence, limits and expectation. Asymptotic behavior is a
key issue in probability theory. We thus explore here various notions of convergence
of random variables and the relations among them, focusing on the integrability
conditions needed for exchanging the order of limit and expectation operations.
Unless explicitly stated otherwise, throughout this section we assume that all R.V.
are dened on the same probability space (, T, P).
In Denition 1.2.25 we have encountered the convergence almost surely of R.V. A
weaker notion of convergence is convergence in probability as dened next.
Denition 1.3.22. We say that R.V. X
n
converge to a given R.V. X
in proba-
bility, denoted X
n
p
X
, if P( : [X
n
() X
()[ > ) 0 as n , for

any xed > 0. This is equivalent to [X
n
X
[
p
0, and is a special case of the
convergence in -measure of f
n
mT to f
mT, that is (s : [f
n
(s)f
(s)[ >
) 0 as n , for any xed > 0.
Our next exercise and example clarify the relationship between convergence almost
surely and convergence in probability.
Exercise 1.3.23. Verify that convergence almost surely to a nite limit implies
convergence in probability, that is if X
n
a.s.
X
R then X
n
p
X
.
Remark 1.3.24. Generalizing Denition 1.3.22, for a separable metric space (S, )
we say that (S, B
S
)-valued random variables X
n
converge to X
in probability if and
only if for every > 0, P((X
n
, X
) > ) 0 as n (see [Dud89, Section

9.2] for more details). Equipping S = R with a suitable metric (for example,
(x, y) = [(x) (y)[ with (x) = x/(1 + [x[) : R [1, 1]), this denition
removes the restriction to X
nite in Exercise 1.3.23.

In general, X
n
p
X
does not imply that X

n
a.s.
X
.
Example 1.3.25. Consider the probability space ((0, 1], B
(0,1]
, U) and X
n
() =
1
[tn,tn+sn]
() with s
n
0 as n slowly enough and t
n
[0, 1 s
n
] are such
that any (0, 1] is in innitely many intervals [t
n
, t
n
+ s
n
]. The latter property
applies if t
n
= (i 1)/k and s
n
= 1/k when n = k(k 1)/2 +i, i = 1, 2, . . . , k and
k = 1, 2, . . . (plot the intervals [t
n
, t
n
+ s
n
] to convince yourself ). Then, X
n
p
0
(since s
n
= U(X
n
,= 0) 0), whereas xing each (0, 1], we have that X
n
() =
1 for innitely many values of n, hence X
n
does not converge a.s. to zero.
Associated with each space L
q
(, T, P) is the notion of L
q
convergence which we
now dene.
Denition 1.3.26. We say that X
n
converges in L
q
to X
, denoted X
n
L
q
X
,
if X
n
, X
L
q
and [[X
n
X
[[
q
0 as n (i.e., E([X
n
X
[
q
) 0 as
n .
Remark. For q = 2 we have the explicit formula
[[X
n
X[[
2
2
= E(X
2
n
) 2E(X
n
X) +E(X
2
).
Thus, it is often easiest to check convergence in L
2
.
The following immediate corollary of Lemma 1.3.16 provides an ordering of L
q
convergence in terms of the parameter q.
Corollary 1.3.27. If X
n
L
q
X
and q r, then X
n
L
r
X
.
Next note that the L
q
convergence implies the convergence of the expectation of
[X
n
[
q
.
Exercise 1.3.28. Fixing q 1, use Minkowskis inequality (Proposition 1.3.18),
to show that if X
n
L
q
X
, then E[X
n
[
q
E[X
[
q
and for q = 1, 2, 3, . . . also
EX
q
n
EX
q
.
Further, it follows from Markovs inequality that the convergence in L
q
implies
convergence in probability (for any value of q).
Proposition 1.3.29. If X
n
L
q
X
, then X
n
p
X
.
Proof. Fixing q > 0 recall that Markovs inequality results with
P([Y [ > )
q
E[[Y [
q
] ,
for any R.V. Y and any > 0 (c.f part (b) of Example 1.3.14). The assumed
convergence in L
q
means that E[[X
n
X
[
q
] 0 as n , so taking Y = Y
n
=
X
n
X
, we necessarily have also P([X

n
X
[ > ) 0 as n . Since > 0

is arbitrary, we see that X
n
p
X
as claimed.
The converse of Proposition 1.3.29 does not hold in general. As we next demon-
strate, even the stronger almost surely convergence (see Exercise 1.3.23), and having
a non-random constant limit are not enough to guarantee the L
q
convergence, for
any q > 0.
Example 1.3.30. Fixing q > 0, consider the probability space ((0, 1], B
(0,1]
, U)
and the R.V. Y
n
() = n
1/q
I
[0,n
1
]
(). Since Y
n
() = 0 for all n n
0
and some
nite n
0
= n
0
(), it follows that Y
n
()
a.s.
0 as n . However, E[[Y
n
[
q
] =
nU([0, n
1
]) = 1 for all n, so Y
n
does not converge to zero in L
q
(see Exercise
1.3.28).
Considering Example 1.3.25, where X
n
L
q
0 while X
n
does not converge a.s. to
zero, and Example 1.3.30 which exhibits the converse phenomenon, we conclude
that the convergence in L
q
and the a.s. convergence are in general non comparable,
and neither one is a consequence of convergence in probability.
Nevertheless, a sequence X
n
can have at most one limit, regardless of which con-
vergence mode is considered.
Exercise 1.3.31. Check that if X
n
L
q
X and X
n
a.s.
Y then X
a.s.
= Y .
Though we have just seen that in general the order of the limit and expectation
operations is non-interchangeable, we examine for the remainder of this subsection
various conditions which do allow for such an interchange. Note in passing that
upon proving any such result under certain point-wise convergence conditions, we
may with no extra eort relax these to the corresponding almost sure convergence
(and the same applies for integrals with respect to measures, see part (a) of Theorem
1.3.9, or that of Proposition 1.3.5).
Turning to do just that, we rst outline the results which apply in the more
general measure theory setting, starting with the proof of the monotone convergence
theorem.
Proof of Theorem 1.3.4. By part (c) of Proposition 1.3.5, the proof of
which did not use Theorem 1.3.4, we know that (h
n
) is a non-decreasing sequence
that is bounded above by (h). It therefore suces to show that
lim
n
(h
n
) = sup
n
0
() : SF
+
, h
n
sup
0
() : SF
+
, h = (h) (1.3.5)
(see Step 3 of Denition 1.3.1). That is, it suces to nd for each non-negative
simple function h a sequence of non-negative simple functions
n
h
n
such
that
0
(
n
)
0
() as n . To this end, xing , we may and shall choose
without loss of generality a representation =
m
l=1
c
l
I
A
l
such that A
l
T are
disjoint and further c
l
(A
l
) > 0 for l = 1, . . . , m (see proof of Lemma 1.3.3). Using
hereafter the notation f
(A) = inff(s) : s A for f mT

+
and A T, the
condition (s) h(s) for all s S is equivalent to c
l
h
(A
l
) for all l, so
0
()
m
l=1
h
(A
l
)(A
l
) = V .
Suppose rst that V < , that is 0 < h
(A
l
)(A
l
) < for all l. In this case, xing
< 1, consider for each n the disjoint sets A
l,,n
= s A
l
: h
n
(s) h
(A
l
) T
and the corresponding
,n
(s) =
m
l=1
h
(A
l
)I
A
l,,n
(s) SF
+
,
where
,n
(s) h
n
(s) for all s S. If s A
l
then h(s) > h
(A
l
). Thus, h
n
h
implies that A
l,,n
A
l
as n , for each l. Consequently, by denition of (h
n
)
and the continuity from below of ,
lim
n
(h
n
) lim
n
0
(
,n
) = V .
Taking 1 we deduce that lim
n
(h
n
) V
0
(). Next suppose that V = ,
so without loss of generality we may and shall assume that h
(A
1
)(A
1
) = .
Fixing x (0, h
(A
1
)) let A
1,x,n
= s A
1
: h
n
(s) x T noting that A
1,x,n

A
1
as n and
x,n
(s) = xI
A1,x,n
(s) h
n
(s) for all n and s S, is a non-
negative simple function. Thus, again by continuity from below of we have that
lim
n
(h
n
) lim
n
0
(
x,n
) = x(A
1
) .
Taking x h
(A
1
) we deduce that lim
n
(h
n
) h
(A
1
)(A
1
) = , completing the
proof of (1.3.5) and that of the theorem.
Considering probability spaces, Theorem 1.3.4 tells us that we can exchange the
order of the limit and the expectation in case of monotone upward a.s. convergence
of non-negative R.V. (with the limit possibly innite). That is,
Theorem 1.3.32 (Monotone convergence theorem). If X
n
0 and X
n
()
X
() for almost every , then EX

n
EX
.
In Example 1.3.30 we have a point-wise convergent sequence of R.V. whose ex-
pectations exceed that of their limit. In a sense this is always the case, as stated
next in Fatous lemma (which is a direct consequence of the monotone convergence
theorem).
Lemma 1.3.33 (Fatous lemma). For any measure space (S, T, ) and any f
n

mT, if f
n
(s) g(s) for some -integrable function g, all n and -almost-every
s S, then
(1.3.6) liminf
n
(f
n
) (liminf
n
f
n
) .
Alternatively, if f
n
(s) g(s) for all n and a.e. s, then
(1.3.7) limsup
n
(f
n
) (limsup
n
f
n
) .
Proof. Assume rst that f
n
mT
+
and let h
n
(s) = inf
kn
f
k
(s), noting
that h
n
mT
+
is a non-decreasing sequence, whose point-wise limit is h(s) :=
liminf
n
f
n
(s). By the monotone convergence theorem, (h
n
) (h). Since
f
n
(s) h
n
(s) for all s S, the monotonicity of the integral (see Proposition 1.3.5)
implies that (f
n
) (h
n
) for all n. Considering the liminf as n we arrive
at (1.3.6).
Turning to extend this inequality to the more general setting of the lemma, note
that our conditions imply that f
n
a.e.
= g + (f
n
g)
+
for each n. Considering the
countable union of the -negligible sets in which one of these identities is violated,
we thus have that
h := liminf
n
f
n
a.e.
= g + liminf
n
(f
n
g)
+
.
Further, (f
n
) = (g) +((f
n
g)
+
) by the linearity of the integral in mT
+
L
1
.
Taking n and applying (1.3.6) for (f
n
g)
+
mT
+
we deduce that
liminf
n
(f
n
) (g) +(liminf
n
(f
n
g)
+
) = (g) +(h g) = (h)
(where for the right most identity we used the linearity of the integral, as well as
the fact that g is -integrable).
Finally, we get (1.3.7) for f
n
by considering (1.3.6) for f
n
.
Remark. In terms of the expectation, Fatous lemma is the statement that if R.V.
X
n
X, almost surely, for some X L
1
and all n, then
(1.3.8) liminf
n
E(X
n
) E(liminf
n
X
n
) ,
whereas if X
n
X, almost surely, for some X L
1
and all n, then
(1.3.9) limsup
n
E(X
n
) E(limsup
n
X
n
) .
Some text books call (1.3.9) and (1.3.7) the Reverse Fatou Lemma (e.g. [Wil91,
Section 5.4]).
Using Fatous lemma, we can easily prove Lebesgues dominated convergence the-
orem (in short DOM).
Theorem 1.3.34 (Dominated convergence theorem). For any measure space
(S, T, ) and any f
n
mT, if for some -integrable function g and -almost-every
s S both f
n
(s) f
(s) as n , and [f
n
(s)[ g(s) for all n, then f
is
-integrable and further ([f
n
f
[) 0 as n .
Proof. Up to a -negligible subset of S, our assumption that [f
n
[ g and
f
n
f
, implies that [f
[ g, hence f
is -integrable. Applying Fatous lemma

(1.3.7) for [f
n
f
[ 2g such that limsup

n
[f
n
f
[ = 0, we conclude that
0 limsup
n
([f
n
f
[) (limsup
n
[f
n
f
[) = (0) = 0 ,
as claimed.
By Minkowskis inequality, ([f
n
f
[) 0 implies that ([f

n
[) ([f
[). The
dominated convergence theorem provides us with a simple sucient condition for
the converse implication in case also f
n
f
a.e.
Lemma 1.3.35 (Scheffes lemma). If f
n
mT converges a.e. to f
mT
and ([f
n
[) ([f
[) < then ([f

n
f
[) 0 as n .
Remark. In terms of expectation, Schees lemma states that if X
n
a.s.
X
and
E[X
n
[ E[X
[ < , then X
n
L
1
X
as well.
Proof. As already noted, we may assume without loss of generality that
f
n
(s) f
(s) for all s S, that is g

n
(s) = f
n
(s) f
(s) 0 as n ,
for all s S. Further, since ([f
n
[) ([f
[) < , we may and shall assume also

that f
n
are R-valued and -integrable for all n , hence g
n
L
1
(S, T, ) as well.
Suppose rst that f
n
mT
+
for all n . In this case, 0 (g
n
)
for all
n and s. As (g
n
)
(s) 0 for every s S, applying the dominated convergence

theorem we deduce that ((g
n
)
) 0. From the assumptions of the lemma (and

the linearity of the integral on L
1
), we get that (g
n
) = (f
n
) (f
) 0 as
n . Since [x[ = x + 2x
for any x R, it thus follows by linearity of the

integral on L
1
that ([g
n
[) = (g
n
) + 2((g
n
)
) 0 for n , as claimed.
In the general case of f
n
mT, we know that both 0 (f
n
)
+
(s) (f
)
+
(s) and
0 (f
n
)
(s) (f
(s) for every s, so by (1.3.6) of Fatous lemma, we have that

([f
[) = ((f
)
+
) +((f
) liminf
n
((f
n
)
) + liminf
n
((f
n
)
+
)
liminf
n
[((f
n
)
) +((f
n
)
+
)] = lim
n
([f
n
[) = ([f
[) .
Hence, necessarily both ((f
n
)
+
) ((f
)
+
) and ((f
n
)
) ((f
). Since
[x y[ [x
+
y
+
[ + [x
[ for all x, y R and we already proved the lemma

for the non-negative (f
n
)
and (f
n
)
+
, we see that
lim
n
([f
n
f
[) lim
n
([(f
n
)
+
(f
)
+
[) + lim
n
([(f
n
)
(f
[) = 0 ,
concluding the proof of the lemma.
We conclude this sub-section with quite a few exercises, starting with an alterna-
tive characterization of convergence almost surely.
Exercise 1.3.36. Show that X
n
a.s.
0 if and only if for each > 0 there is n
so that for each random integer M with M() n for all we have that
P( : [X
M()
()[ > ) < .
Exercise 1.3.37. Let Y
n
be (real-valued) random variables on (, T, P), and N
k
positive integer valued random variables on the same probability space.
(a) Show that Y
N
k
() = Y
N
k
()
() are random variables on (, T).
(b) Show that if Y
n
a.s
Y
and N
k
a.s.
then Y
N
k
a.s.
Y
.
(c) Provide an example of Y
n
p
0 and N
k
a.s.
such that almost surely
Y
N
k
= 1 for all k.
(d) Show that if Y
n
a.s.
Y
and P(N
k
> r) 1 as k , for every xed
r < , then Y
N
k
p
Y
.
In the following four exercises you nd some of the many applications of the
monotone convergence theorem.
Exercise 1.3.38. You are now to relax the non-negativity assumption in the mono-
tone convergence theorem.
(a) Show that if E[(X
1
)
] < and X
n
() X() for almost every , then
EX
n
EX.
(b) Show that if in addition sup
n
E[(X
n
)
+
] < , then X L
1
(, T, P).
Exercise 1.3.39. In this exercise you are to show that for any R.V. X 0,
(1.3.10) EX = lim
0
E
X for E
X =
j=0
jP( : j < X() (j + 1)) .
First use monotone convergence to show that E
k
X converges to EX along the
sequence
k
= 2
k
. Then, check that E
X E
X + for any , > 0 and deduce

from it the identity (1.3.10).
Applying (1.3.10) verify that if X takes at most countably many values x
1
, x
2
, . . .,
then EX =
i
x
i
P( : X() = x
i
) (this applies to every R.V. X 0 on a
countable ). More generally, verify that such formula applies whenever the series
is absolutely convergent (which amounts to X L
1
).
Exercise 1.3.40. Use monotone convergence to show that for any sequence of
non-negative R.V. Y
n
,
E(
n=1
Y
n
) =
n=1
EY
n
.
n
, X L
1
(, T, P) are such that
(a) X
n
0 almost surely, E[X
n
] = 1, E[X
n
log X
n
] 1, and
(b) E[X
n
Y ] E[XY ] as n , for each bounded random variable Y on
(, T).
Show that then X 0 almost surely, E[X] = 1 and E[X log X] 1.
Hint: Jensens inequality is handy for showing that E[X log X] 1.
Next come few direct applications of the dominated convergence theorem.
Exercise 1.3.42.
(a) Show that for any random variable X, the function t E[e
|tX|
] is con-
tinuous on R (this function is sometimes called the bilateral exponential
transform).
(b) Suppose X 0 is such that EX
q
< for some q > 0. Show that then
q
1
(EX
q
1) Elog X as q 0 and deduce that also q
1
log EX
q
Elog X as q 0.
Hint: Fixing x 0 deduce from convexity of q x
q
that q
1
(x
q
1) log x as
q 0.
Exercise 1.3.43. Suppose X is an integrable random variable.
(a) Show that E([X[I
{X>n}
) 0 as n .
(b) Deduce that for any > 0 there exists > 0 such that
supE[[X[I
A
] : P(A) .
(c) Provide an example of X 0 with EX = for which the preceding fails,
that is, P(A
k
) 0 as k while E[XI
A
k
] is bounded away from zero.
The following generalization of the dominated convergence theorem is also left as
an exercise.
Exercise 1.3.44. Suppose g
n
() f
n
() h
n
() are -integrable functions in the
same measure space (S, T, ) such that for -almost-every s S both g
n
(s)
g
(s), f
n
(s) f
(s) and h
n
(s) h
(s) as n . Show that if further g
and
h
are -integrable functions such that (g

n
) (g
) and (h
n
) (h
), then
f
is -integrable and (f
n
) (f
).
Finally, here is a demonstration of one of the many issues that are particularly
easy to resolve with respect to the L
2
(, T, P) norm.
Exercise 1.3.45. Let X = (X(t))
tR
be a mapping from R into L
2
(, T, P).
Show that t X(t) is a continuous mapping (with respect to the norm | |
2
on
L
2
(, T, P)), if and only if both
(t) = E[X(t)] and r(s, t) = E[X(s)X(t)] (s)(t)
are continuous real-valued functions (r(s, t) is continuous as a map from R
2
to R).
1.3.4. L
1
-convergence and uniform integrability. For probability theory,
the dominated convergence theorem states that if random variables X
n
a.s.
X
are
such that [X
n
[ Y for all n and some random variable Y such that EY < , then
X
L
1
and X
n
L
1
X
. Since constants have nite expectation (see part (d) of

Theorem 1.3.9), we have as its corollary the bounded convergence theorem, that is,
Corollary 1.3.46 (Bounded Convergence). Suppose that a.s. [X
n
()[ K for
some nite non-random constant K and all n. If X
n
a.s.
X
, then X
L
1
and
X
n
L
1
X
.
We next state a uniform integrability condition that together with convergence in
probability implies the convergence in L
1
.
Denition 1.3.47. A possibly uncountable collection of R.V.-s X
, J is
called uniformly integrable (U.I.) if
(1.3.11) lim
M
sup
E[[X
[I
|X|>M
] = 0 .
Our next lemma shows that U.I. is a relaxation of the condition of dominated
convergence, and that U.I. still implies the boundedness in L
1
of X
, J.
Lemma 1.3.48. If [X
[ Y for all and some R.V. Y such that EY < , then

the collection X
is U.I. In particular, any nite collection of integrable R.V. is

U.I.
Further, if X
is U.I. then sup
E[X
[ < .
Proof. By monotone convergence, E[Y I
Y M
] EY as M , for any R.V.
Y 0. Hence, if in addition EY < , then by linearity of the expectation,
E[Y I
Y >M
] 0 as M . Now, if [X
[ Y then [X
[I
|X|>M
Y I
Y >M
,
hence E[[X
[I
|X|>M
] E[Y I
Y >M
], which does not depend on , and for Y L
1
converges to zero when M . We thus proved that if [X
[ Y for all and

some Y such that EY < , then X
is a U.I. collection of R.V.-s

For a nite collection of R.V.-s X
i
L
1
, i = 1, . . . , k, take Y = [X
1
[ +[X
2
[ + +
[X
k
[ L
1
such that [X
i
[ Y for i = 1, . . . , k, to see that any nite collection of
integrable R.V.-s is U.I.
Finally, since
E[X
[ = E[[X
[I
|X|M
] +E[[X
[I
|X|>M
] M + sup
E[[X
[I
|X|>M
] ,
we see that if X
, J is U.I. then sup
E[X
[ < .
We next state and prove Vitalis convergence theorem for probability measures,
deferring the general case to Exercise 1.3.53.
Theorem 1.3.49 (Vitalis convergence theorem). Suppose X
n
p
X
. Then,
the collection X
n
is U.I. if and only if X
n
L
1
X
which in turn is equivalent to

X
n
being integrable for all n and E[X
n
[ E[X
[.
Remark. In view of Lemma 1.3.48, Vitalis theorem relaxes the assumed a.s. con-
vergence X
n
X
of the dominated (or bounded) convergence theorem, and of

Schees lemma, to that of convergence in probability.
Proof. Suppose rst that [X
n
[ M for some non-random nite constant M
and all n. For each > 0 let B
n,
= : [X
n
() X
()[ > . The assumed

convergence in probability means that P(B
n,
) 0 as n (see Denition
1.3.22). Since P([X
[ M + ) P(B
n,
), taking n and considering
=
k
0, we get by continuity from below of P that almost surely [X
[ M.
So, [X
n
X
[ 2M and by linearity and monotonicity of the expectation, for any

n and > 0,
E[X
n
X
[ = E[[X
n
X
[I
B
c
n,
] +E[[X
n
X
[I
Bn,
]
E[I
B
c
n,
] +E[2MI
Bn,
] + 2MP(B
n,
) .
Since P(B
n,
) 0 as n , it follows that limsup
n
E[X
n
X
[ . Taking
0 we deduce that E[X
n
X
[ 0 in this case.
Moving to deal now with the general case of a collection X
n
that is U.I., let
M
(x) = max(min(x, M), M). As [
M
(x)
M
(y)[ [xy[ for any x, y R, our
assumption X
n
p
X
implies that
M
(X
n
)
p

M
(X
) for any xed M < .

With [
M
()[ M, we then have by the preceding proof of bounded convergence
that
M
(X
n
)
L
1

M
(X
). Further, by Minkowskis inequality, also E[

M
(X
n
)[
E[
M
(X
)[. By Lemma 1.3.48, our assumption that X

n
are U.I. implies their
L
1
boundedness, and since [
M
(x)[ [x[ for all x, we deduce that for any M,
(1.3.12) > c := sup
n
E[X
n
[ lim
n
E[
M
(X
n
)[ = E[
M
(X
)[ .
With [
M
(X
)[ [X
[ as M , it follows from monotone convergence that

E[
M
(X
)[ E[X
[, hence E[X
[ c < in view of (1.3.12). Fixing >

0, choose M = M() < large enough for E[[X
[I
|X|>M
] < , and further
increasing M if needed, by the U.I. condition also E[[X
n
[I
|Xn|>M
] < for all n.
Considering the expectation of the inequality [x
M
(x)[ [x[I
|x|>M
(which holds
for all x R), with x = X
n
and x = X
, we obtain that
E[X
n
X
[ E[X
n

M
(X
n
)[ +E[
M
(X
n
)
M
(X
)[ +E[X
M
(X
)[
2 +E[
M
(X
n
)
M
(X
)[ .
Recall that
M
(X
n
)
L
1

M
(X
), hence limsup
n
E[X
n
X
[ 2. Taking 0
completes the proof of L
1
convergence of X
n
to X
.
Suppose now that X
n
L
1
X
. Then, by Jensens inequality (for the convex

function g(x) = [x[),
[E[X
n
[ E[X
[[ E[[ [X
n
[ [X
[ [] E[X
n
X
[ 0.
That is, E[X
n
[ E[X
[ and X
n
, n are integrable.
It thus remains only to show that if X
n
p
X
, all of which are integrable and

E[X
n
[ E[X
[ then the collection X

n
is U.I. To the end, for any M > 1, let
M
(x) = [x[I
|x|M1
+ (M 1)(M [x[)I
(M1,M]
([x[) ,
a piecewise-linear, continuous, bounded function, such that
M
(x) = [x[ for [x[
M1 and
M
(x) = 0 for [x[ M. Fixing > 0, with X
integrable, by dominated
convergence E[X
[E
m
(X
) for some nite m = m(). Further, as [

m
(x)
m
(y)[ (m 1)[x y[ for any x, y R, our assumption X
n
p
X
implies that
m
(X
n
)
p

m
(X
). Hence, by the preceding proof of bounded convergence,

followed by Minkowskis inequality, we deduce that E
m
(X
n
) E
m
(X
) as
n . Since [x[I
|x|>m
[x[
m
(x) for all x R, our assumption E[X
n
[
E[X
[ thus implies that for some n

0
= n
0
() nite and all n n
0
and M m(),
E[[X
n
[I
|Xn|>M
] E[[X
n
[I
|Xn|>m
] E[X
n
[ E
m
(X
n
)
E[X
[ E
m
(X
) + 2 .
As each X
n
is integrable, E[[X
n
[I
|Xn|>M
] 2 for some M m nite and all n
(including also n < n
0
()). The fact that such nite M = M() exists for any > 0
amounts to the collection X
n
being U.I.
The following exercise builds upon the bounded convergence theorem.
Exercise 1.3.50. Show that for any X 0 (do not assume E(1/X) < ), both
(a) lim
y
yE[X
1
I
X>y
] = 0 and
(b) lim
y0
yE[X
1
I
X>y
] = 0.
Next is an example of the advantage of Vitalis convergence theorem over the
dominated convergence theorem.
Exercise 1.3.51. On ((0, 1], B
(0,1]
, U), let X
n
() = (n/ log n)I
(0,n
1
)
() for n 2.
Show that the collection X
n
is U.I. such that X
n
a.s.
0 and EX
n
0, but there
is no random variable Y with nite expectation such that Y X
n
for all n 2 and
almost all (0, 1].
By a simple application of Vitalis convergence theorem you can derive a classical
result of analysis, dealing with the convergence of Cesaro averages.
Exercise 1.3.52. Let U
n
denote a random variable whose law is the uniform prob-
ability measure on (0, n], namely, Lebesgue measure restricted to the interval (0, n]
and normalized by n
1
to a probability measure. Show that g(U
n
)
p
0 as n ,
for any Borel function g() such that [g(y)[ 0 as y . Further, assuming that
also sup
y
[g(y)[ < , deduce that E[g(U
n
)[ = n
1
_
n
0
[g(y)[dy 0 as n .
Here is Vitalis convergence theorem for a general measure space.
Exercise 1.3.53. Given a measure space (S, T, ), suppose f
n
, f
mT with
([f
n
[) nite and ([f
n
f
[ > ) 0 as n , for each xed > 0. Show

that ([f
n
f
[) 0 as n if and only if both sup

n
([f
n
[I
|fn|>k
) 0 and
sup
n
([f
n
[I
A
k
) 0 for k and some A
k
T such that (A
c
k
) < .
We conclude this subsection with a useful sucient criterion for uniform integra-
bility and few of its consequences.
Exercise 1.3.54. Let f 0 be a Borel function such that f(r)/r as r .
Suppose Ef([X
[) C for some nite non-random constant C and all J. Show

that then X
: J is a uniformly integrable collection of R.V.

Exercise 1.3.55.
(a) Construct random variables X
n
such that sup
n
E([X
n
[) < , but the
collection X
n
is not uniformly integrable.
(b) Show that if X
n
is a U.I. collection and Y
n
is a U.I. collection, then
X
n
+Y
n
is also U.I.
(c) Show that if X
n
p
X
and the collection X

n
is uniformly integrable,
then E(X
n
I
A
) E(X
I
A
) as n , for any measurable set A.
1.3.5. Expectation, density and Riemann integral. Applying the stan-
dard machine we now show that xing a measure space (S, T, ), each non-negative
measurable function f induces a measure f on (S, T), with f being the natural
generalization of the concept of probability density function.
Proposition 1.3.56. Fix a measure space (S, T, ). Every f mT
+
induces
a measure f on (S, T) via (f)(A) = (fI
A
) for all A T. These measures
satisfy the composition relation h(f) = (hf) for all f, h mT
+
. Further, h
L
1
(S, T, f) if and only if fh L
1
(S, T, ) and then (f)(h) = (fh).
Proof. Fixing f mT
+
, obviously f is a non-negative set function on (S, T)
with (f)() = (fI
) = (0) = 0. To check that f is countably additive, hence

a measure, let A =
k
A
k
for a countable collection of disjoint sets A
k
T. Since
n
k=1
fI
A
k
fI
A
, it follows by monotone convergence and linearity of the integral
that,
(fI
A
) = lim
n
(
n
k=1
fI
A
k
) = lim
n
n
k=1
(fI
A
k
) =
k
(fI
A
k
)
Thus, (f)(A) =
k
(f)(A
k
) verifying that f is a measure.
Fixing f mT
+
, we turn to prove that the identity
(1.3.13) (f)(hI
A
) = (fhI
A
) A T ,
holds for any h mT
+
. Since the left side of (1.3.13) is the value assigned to A
by the measure h(f) and the right side of this identity is the value assigned to
the same set by the measure (hf), this would verify the stated composition rule
h(f) = (hf). The proof of (1.3.13) proceeds by applying the standard machine:
Step 1. If h = I
B
for B T we have by the denition of the integral of an indicator
function that
(f)(I
B
I
A
) = (f)(I
AB
) = (f)(A B) = (fI
AB
) = (fI
B
I
A
) ,
which is (1.3.13).
Step 2. Take h SF
+
represented as h =
n
l=1
c
l
I
B
l
with c
l
0 and B
l
T.
Then, by Step 1 and the linearity of the integrals with respect to f and with
respect to , we see that
(f)(hI
A
) =
n
l=1
c
l
(f)(I
B
l
I
A
) =
n
l=1
c
l
(fI
B
l
I
A
) = (f
n
l=1
c
l
I
B
l
I
A
) = (fhI
A
) ,
again yielding (1.3.13).
Step 3. For any h mT
+
there exist h
n
SF
+
such that h
n
h. By Step 2 we
know that (f)(h
n
I
A
) = (fh
n
I
A
) for any A T and all n. Further, h
n
I
A
hI
A
and fh
n
I
A
fhI
A
, so by monotone convergence (for both integrals with respect to
f and ),
(f)(hI
A
) = lim
n
(f)(h
n
I
A
) = lim
n
(fh
n
I
A
) = (fhI
A
) ,
completing the proof of (1.3.13) for all h mT
+
.
Writing h mT as h = h
+
h
with h
+
= max(h, 0) mT
+
and h
=
min(h, 0) mT
+
, it follows from the composition rule that
_
h
d(f) = (f)(h
I
S
) = h
(f)(S) = ((h
f))(S) = (fh
I
S
) =
_
fh
d.
Observing that fh
= (fh)
when f mT
+
, we thus deduce that h is f-
integrable if and only if fh is -integrable in which case
_
hd(f) =
_
fhd, as
stated.
Fixing a measure space (S, T, ), every set D T induces a -algebra T
D
=
A T : A D. Let
D
denote the restriction of to (D, T
D
). As a corollary
of Proposition 1.3.56 we express the integral with respect to
D
in terms of the
original measure .
Corollary 1.3.57. Fixing D T let h
D
denote the restriction of h mT to
(D, T
D
). Then,
D
(h
D
) = (hI
D
) for any h mT
+
. Further, h
D
L
1
(D, T
D
,
D
)
if and only if hI
D
L
1
(S, T, ), in which case also
D
(h
D
) = (hI
D
).
Proof. Note that the measure I
D
of Proposition 1.3.56 coincides with
D
on the -algebra T
D
and assigns to any set A T the same value it assigns to
A D T
D
. By Denition 1.3.1 this implies that (I
D
)(h) =
D
(h
D
) for any
h mT
+
. The corollary is thus a re-statement of the composition and integrability
relations of Proposition 1.3.56 for f = I
D
.
Remark 1.3.58. Corollary 1.3.57 justies using hereafter the notation
_
A
fd or
(f; A) for (fI
A
), or writing E(X; A) =
_
A
X()dP() for E(XI
A
). With this
notation in place, Proposition 1.3.56 states that each Z 0 such that EZ = 1
induces a probability measure Q = ZP such that Q(A) =
_
A
ZdP for all A T,
and then E
Q
(W) :=
_
WdQ = E(ZW) whenever W 0 or ZW L
1
(, T, P)
(the assumption EZ = 1 translates to Q() = 1).
Proposition 1.3.56 is closely related to the probability density function of Denition
1.2.39. En-route to showing this, we rst dene the collection of Lebesgue integrable
functions.
Denition 1.3.59. Consider Lebesgues measure on (R, B) as in Section 1.1.3,
and its completion on (R, B) (see Theorem 1.1.35). A set B B is called
Lebesgue measurable and f : R R is called Lebesgue integrable function if
f mB, and ([f[) < . As we show in Proposition 1.3.64, any non-negative
Riemann integrable function is also Lebesgue integrable, and the integral values
coincide, justifying the notation
_
B
f(x)dx for (f; B), where the function f and
the set B are both Lebesgue measurable.
Example 1.3.60. Suppose f is a non-negative Lebesgue integrable function such
that
_
R
f(x)dx = 1. Then, T = f of Proposition 1.3.56 is a probability measure
on (R, B) such that T(B) = (f; B) =
_
B
f(x)dx for any Lebesgue measurable set
B. By Theorem 1.2.36 it is easy to verify that F() = T((, ]) is a distribution
function, such that F() =
_
f(x)dx. That is, T is the law of a R.V. X : R

R whose probability density function is f (c.f. Denition 1.2.39 and Proposition
1.2.44).
Our next theorem allows us to compute expectations of functions of a R.V. X
in the space (R, B, T
X
), using the law of X (c.f. Denition 1.2.33) and calculus,
instead of working on the original general probability space. One of its immediate
consequences is the obvious fact that if X
D
= Y then Eh(X) = Eh(Y ) for any
non-negative Borel function h.
Theorem 1.3.61 (Change of variables formula). Let X : R be a ran-
dom variable on (, T, P) and h a Borel measurable function such that Eh
+
(X) <
or Eh
(X) < . Then,

(1.3.14)
_
h(X())dP() =
_
R
h(x)dT
X
(x).
Proof. Apply the standard machine with respect to h mB:
Step 1. Taking h = I
B
for B B, note that by the denition of expectation of
indicators
Eh(X) = E[I
B
(X())] = P( : X() B) = T
X
(B) =
_
h(x)dT
X
(x).
Step 2. Representing h SF
+
as h =
m
l=1
c
l
I
B
l
for c
l
0, the identity (1.3.14)
follows from Step 1 by the linearity of the expectation in both spaces.
Step 3. For h mB
+
, consider h
n
SF
+
such that h
n
h. Since h
n
(X())
h(X()) for all , we get by monotone convergence on (, T, P), followed by ap-
plying Step 2 for h
n
, and nally monotone convergence on (R, B, T
X
), that
_
h(X())dP() = lim
n
_
h
n
(X())dP()
= lim
n
_
R
h
n
(x)dT
X
(x) =
_
R
h(x)dT
X
(x),
as claimed.
Step 4. Write a Borel function h(x) as h
+
(x) h
(x). Then, by Step 3, (1.3.14)

applies for both non-negative functions h
+
and h
. Further, at least one of these

two identities involves nite quantities. So, taking their dierence and using the
linearity of the expectation (in both probability spaces), lead to the same result for
h.
Combining Theorem 1.3.61 with Example 1.3.60, we show that the expectation of
a Borel function of a R.V. X having a density f
X
can be computed by performing
calculus type integration on the real line.
Corollary 1.3.62. Suppose that the distribution function of a R.V. X is of the form
(1.2.3) for some Lebesgue integrable function f
X
(x). Then, for any Borel measur-
able function h : R R, the R.V. h(X) is integrable if and only if
_
[h(x)[f
X
(x)dx <
, in which case Eh(X) =
_
h(x)f
X
(x)dx. The latter formula applies also for any
non-negative Borel function h().
Proof. Recall Example 1.3.60 that the law T
X
of X equals to the probability
measure f
X
. For h 0 we thus deduce from Theorem 1.3.61 that Eh(X) =
f
X
(h), which by the composition rule of Proposition 1.3.56 is given by (f
X
h) =
_
h(x)f
X
(x)dx. The decomposition h = h
+
h
then completes the proof of the

general case.
Our next task is to compare Lebesgues integral (of Denition 1.3.1) with Rie-
manns integral. To this end recall,
Denition 1.3.63. A function f : (a, b] [0, ] is Riemann integrable with inte-
gral R(f) < if for any > 0 there exists = () > 0 such that [
l
f(x
l
)(J
l
)
R(f)[ < , for any x
l
J
l
and J
l
a nite partition of (a, b] into disjoint subin-
tervals whose length (J
l
) < .
Lebesgues integral of a function f is based on splitting its range to small intervals
and approximating f(s) by a constant on the subset of S for which f() falls into
each such interval. As such, it accommodates an arbitrary domain S of the function,
in contrast to Riemanns integral where the domain of integration is split into small
rectangles hence limited to R
d
. As we next show, even for S = (a, b], if f 0
(or more generally, f bounded), is Riemann integrable, then it is also Lebesgue
integrable, with the integrals coinciding in value.
Proposition 1.3.64. If f(x) is a non-negative Riemann integrable function on an
interval (a, b], then it is also Lebesgue integrable on (a, b] and (f) = R(f).
Proof. Let f
(J) = inff(x) : x J and f
(J) = supf(x) : x J.
Varying x
l
over J
l
we see that
(1.3.15) R(f)
l
f
(J
l
)(J
l
)
l
f
(J
l
)(J
l
) R(f) + ,
for any nite partition of (a, b] into disjoint subintervals J
l
such that sup
l
(J
l
)
. For any such partition, the non-negative simple functions () =
l
f
(J
l
)I
J
l
and u() =
l
f
(J
l
)I
J
l
are such that () f u(), whereas R(f)
(()) (u()) R(f) + , by (1.3.15). Consider the dyadic partitions
n
of (a, b] to 2
n
intervals of length (b a)2
n
each, such that
n+1
is a renement
of
n
for each n = 1, 2, . . .. Note that u(
n
)(x) u(
n+1
)(x) for all x (a, b]
and any n, hence u(
n
))(x) u
(x) a Borel measurable R-valued function (see

Exercise 1.2.31). Similarly, (
n
)(x)
(x) for all x (a, b], with
also Borel
measurable, and by the monotonicity of Lebesgues integral,
R(f) lim
n
((
n
)) (
) (u
) lim
n
(u(
n
)) R(f) .
We deduce that (u
) = (
) = R(f) for u
. The set x (a, b] :

f(x) ,=
(x) is a subset of the Borel set x (a, b] : u
(x) >
(x) whose
Lebesgue measure is zero (see Lemma 1.3.8). Consequently, f is Lebesgue measur-
able on (a, b] with (f) = (
) = R(f) as stated.
Here is an alternative, direct proof of the fact that Q in Remark 1.3.58 is a
probability measure.
Exercise 1.3.65. Suppose E[X[ < and A =
n
A
n
for some disjoint sets
A
n
T.
(a) Show that then
n=0
E(X; A
n
) = E(X; A) ,
that is, the sum converges absolutely and has the value on the right.
(b) Deduce from this that for Z 0 with EZ positive and nite, Q(A) :=
EZI
A
/EZ is a probability measure.
(c) Suppose that X and Y are non-negative random variables on the same
probability space (, T, P) such that EX = EY < . Deduce from the
preceding that if EXI
A
= EY I
A
for any A in a -system / such that
T = (/), then X
a.s.
= Y .
Exercise 1.3.66. Suppose T is a probability measure on (R, B) and f 0 is a
Borel function such that T(B) =
_
B
f(x)dx for B = (, b], b R. Using the
theorem show that this identity applies for all B B. Building on this result,
use the standard machine to directly prove Corollary 1.3.62 (without Proposition
1.3.56).
1.3.6. Mean, variance and moments. We start with the denition of mo-
ments of a random variable.
Denition 1.3.67. If k is a positive integer then EX
k
is called the kth moment
of X. When it is well dened, the rst moment m
X
= EX is called the mean. If
EX
2
< , then the variance of X is dened to be
(1.3.16) Var(X) = E(X m
X
)
2
= EX
2
m
2
X
EX
2
.
Since E(aX + b) = aEX + b (linearity of the expectation), it follows from the
denition that
(1.3.17) Var(aX +b) = E(aX +b E(aX +b))
2
= a
2
E(X m
X
)
2
= a
2
Var(X)
We turn to some examples, starting with R.V. having a density.
Example 1.3.68. If X has the exponential distribution then
EX
k
=
_

0
x
k
e
x
dx = k!
for any k (see Example 1.2.40 for its density). The mean of X is m
X
= 1 and
its variance is EX
2
(EX)
2
= 1. For any > 0, it is easy to see that T = X/
has density f
T
(t) = e
t
1
t>0
, called the exponential density of parameter . By
(1.3.17) it follows that m
T
= 1/ and Var(T) = 1/
2
.
Similarly, if X has a standard normal distribution, then by symmetry, for k odd,
EX
k
=
1
2
_

x
k
e
x
2
/2
dx = 0 ,
whereas by integration by parts, the even moments satisfy the relation
(1.3.18) EX
2
=
1
2
_

x
21
xe
x
2
/2
dx = (2 1)EX
22
,
for = 1, 2, . . .. In particular,
Var(X) = EX
2
= 1 .
Consider G = X +, where > 0 and R, whose density is
f
G
(y) =
1
2
2
e
(y)
2
2
2
.
We call the law of G the normal distribution of mean and variance
2
(as EG =
and Var(G) =
2
).
Next are some examples of R.V. with nite or countable set of possible values.
Example 1.3.69. We say that B has a Bernoulli distribution of parameter p
[0, 1] if P(B = 1) = 1 P(B = 0) = p. Clearly,
EB = p 1 + (1 p) 0 = p .
Further, B
2
= B so EB
2
= EB = p and
Var(B) = EB
2
(EB)
2
= p p
2
= p(1 p) .
Recall that N has a Poisson distribution with parameter 0 if
P(N = k) =

k
k!
e
for k = 0, 1, 2, . . .
(where in case = 0, P(N = 0) = 1). Observe that for k = 1, 2, . . .,
E(N(N 1) (N k + 1)) =
n=k
n(n 1) (n k + 1)
n
n!
e
=
k
n=k
nk
(n k)!
e
=
k
.
Using this formula, it follows that EN = while
Var(N) = EN
2
(EN)
2
= .
The random variable Z is said to have a Geometric distribution of success proba-
bility p (0, 1) if
P(Z = k) = p(1 p)
k1
for k = 1, 2, . . .
This is the distribution of the number of independent coin tosses needed till the rst
appearance of a Head, or more generally, the number of independent trials till the
rst occurrence in this sequence of a specic event whose probability is p. Then,
EZ =
k=1
kp(1 p)
k1
=
1
p
EZ(Z 1) =
k=2
k(k 1)p(1 p)
k1
=
2(1 p)
p
2
Var(Z) = EZ(Z 1) +EZ (EZ)
2
=
1 p
p
2
.
Exercise 1.3.70. Consider a counting random variable N
n
=
n
i=1
I
Ai
.
(a) Provide a formula for Var(N
n
) in terms of P(A
i
) and P(A
i
A
j
) for
i ,= j.
(b) Using your formula, nd the variance of the number N
n
of empty boxes
when distributing at random r distinct balls among n distinct boxes, where
each of the possible n
r
assignments of balls to boxes is equally likely.
Exercise 1.3.71. Show that if P(X [a, b]) = 1, then Var(X) (b a)
2
/4.
1.4. Independence and product measures
In Subsection 1.4.1 we build-up the notion of independence, from events to random
variables via -algebras, relating it to the structure of the joint distribution func-
tion. Subsection 1.4.2 considers nite product measures associated with the joint
law of independent R.V.-s. This is followed by Kolmogorovs extension theorem
which we use in order to construct innitely many independent R.V.-s. Subsection
1.4.3 is about Fubinis theorem and its applications for computing the expectation
of functions of independent R.V.
1.4.1. Denition and conditions for independence. Recall the classical
denition that two events A, B T are independent if P(A B) = P(A)P(B).
For example, suppose two fair dice are thrown (i.e. = 1, 2, 3, 4, 5, 6
2
with
T = 2
and the uniform probability measure). Let E

1
= Sum of two is 6 and
E
2
= rst die is 4 then E
1
and E
2
are not independent since
P(E
1
) = P((1, 5) (2, 4) (3, 3) (4, 2) (5, 1)) =
5
36
, P(E
2
) = P( :
1
= 4) =
1
6
and
P(E
1
E
2
) = P((4, 2)) =
1
36
,= P(E
1
)P(E
2
).
However one can check that E
2
and E
3
= sum of dice is 7 are independent.
In analogy with the independence of events we dene the independence of two
random vectors and more generally, that of two -algebras.
Denition 1.4.1. Two -algebras 1, ( T are independent (also denoted P-
independent), if
P(G H) = P(G)P(H), G (, H 1 ,
that is, two -algebras are independent if every event in one of them is independent
of every event in the other.
The random vectors X = (X
1
, . . . , X
n
) and Y = (Y
1
, . . . , Y
m
) on the same prob-
ability space are independent if the corresponding -algebras (X
1
, . . . , X
n
) and
(Y
1
, . . . , Y
m
) are independent.
Remark. Our denition of independence of random variables is consistent with
that of independence of events. For example, if the events A, B T are indepen-
dent, then so are I
A
and I
B
. Indeed, we need to show that (I
A
) = , , A, A
c
and (I
B
) = , , B, B
c
are independent. Since P() = 0 and is invariant under
intersections, whereas P() = 1 and all events are invariant under intersection with
, it suces to consider G A, A
c
and H B, B
c
. We check independence
1.4. INDEPENDENCE AND PRODUCT MEASURES 55
rst for G = A and H = B
c
. Noting that A is the union of the disjoint events
A B and A B
c
we have that
P(A B
c
) = P(A) P(A B) = P(A)[1 P(B)] = P(A)P(B
c
) ,
where the middle equality is due to the assumed independence of A and B. The
proof for all other choices of G and H is very similar.
More generally we dene the mutual independence of events as follows.
Denition 1.4.2. Events A
i
T are P-mutually independent if for any L <
and distinct indices i
1
, i
2
, . . . , i
L
,
P(A
i1
A
i2
A
iL
) =
L
k=1
P(A
i
k
).
We next generalize the denition of mutual independence to -algebras, random
variables and beyond. This denition applies to the mutual independence of both
nite and innite number of such objects.
Denition 1.4.3. We say that the collections of events /
T with J
(possibly an innite index set) are P-mutually independent if for any L < and
distinct
1
,
2
, . . . ,
L
J,
P(A
1
A
2
A
L
) =
L
k=1
P(A
k
), A
k
/
k
, k = 1, . . . , L.
We say that random variables X
, J are P-mutually independent if the -

algebras (X
), J are P-mutually independent.

When the probability measure P in consideration is clear from the context, we say
that random variables, or collections of events, are mutually independent.
Our next theorem gives a sucient condition for the mutual independence of
a collection of -algebras which as we later show, greatly simplies the task of
checking independence.
Theorem 1.4.4. Suppose (
i
= (/
i
) T for i = 1, 2, , n where /
i
are -
systems. Then, a sucient condition for the mutual independence of (
i
is that /
i
,
i = 1, . . . , n are mutually independent.
Proof. Let H = A
i1
A
i2
A
iL
, where i
1
, i
2
, . . . , i
L
are distinct elements
from 1, 2, . . . , n 1 and A
i
/
i
for i = 1, . . . , n 1. Consider the two nite
measures
1
(A) = P(A H) and
2
(A) = P(H)P(A) on the measurable space
(, (
n
). Note that
1
() = P( H) = P(H) = P(H)P() =
2
() .
If A /
n
, then by the mutual independence of /
i
, i = 1, . . . , n, it follows that
1
(A) = P(A
i1
A
i2
A
i3
A
iL
A) = (
L
k=1
P(A
i
k
))P(A)
= P(A
i1
A
i2
A
iL
)P(A) =
2
(A) .
Since the nite measures
1
() and
2
() agree on the -system /
n
and on , it
follows that
1
=
2
on (
n
= (/
n
) (see Proposition 1.1.39). That is, P(GH) =
P(G)P(H) for any G (
n
.
Since this applies for arbitrary A
i
/
i
, i = 1, . . . , n 1, in view of Denition
1.4.3 we have just proved that if /
1
, /
2
, . . . , /
n
are mutually independent, then
/
1
, /
2
, . . . , (
n
are mutually independent.
Applying the latter relation for (
n
, /
1
, . . . , /
n1
(which are mutually independent
since Denition 1.4.3 is invariant to a permutation of the order of the collections) we
get that (
n
, /
1
, . . . , /
n2
, (
n1
are mutually independent. After n such iterations
we have the stated result.
Because the mutual independence of the collections of events /
, J amounts
to the mutual independence of any nite number of these collections, we have the
immediate consequence:
Corollary 1.4.5. If -systems of events /
, J, are mutually independent,

then (/
), J, are also mutually independent.

Another immediate consequence deals with the closure of mutual independence
under projections.
Corollary 1.4.6. If the -systems of events 1
,
, (, ) are mutually inde-
pendent, then the -algebras (
= (
1
,
), are also mutually independent.
Proof. Let /
be the collection of sets of the form A =

m
j=1
H
j
where H
j

1
,j
for some m < and distinct
1
, . . . ,
m
. Since 1
,
are -systems, it follows
that so is /
for each . Since a nite intersection of sets A

k
/
k
, k = 1, . . . , L is
merely a nite intersection of sets from distinct collections 1
k
,j(k)
, the assumed
mutual independence of 1
,
implies the mutual independence of /
. By Corollary
1.4.5, this in turn implies the mutual independence of (/
). To complete the proof,

simply note that for any , each H 1
,
is also an element of /
, implying that
(
(/
).
Relying on the preceding corollary you can now establish the following character-
ization of independence (which is key to proving Kolmogorovs 0-1 law).
Exercise 1.4.7. Show that if for each n 1 the -algebras T
X
n
= (X
1
, . . . , X
n
)
and (X
n+1
) are P-mutually independent then the random variables X
1
, X
2
, X
3
, . . .
are P-mutually independent. Conversely, show that if X
1
, X
2
, X
3
, . . . are indepen-
dent, then for each n 1 the -algebras T
X
n
and T
X
n
= (X
r
, r > n) are indepen-
dent.
It is easy to check that a P-trivial -algebra 1 is P-independent of any other
-algebra ( T. Conversely, as we show next, independence is a great tool for
proving that a -algebra is P-trivial.
Lemma 1.4.8. If each of the -algebras (
k
(
k+1
is P-independent of a -algebra
1 (
k1
(
k
) then 1 is P-trivial.
Remark. In particular, if 1 is P-independent of itself, then 1 is P-trivial.
Proof. Since (
k
(
k+1
for all k and (
k
are -algebras, it follows that / =
k1
(
k
is a -system. The assumed P-independence of 1 and (
k
for each k
yields the P-independence of 1 and /. Thus, by Theorem 1.4.4 we have that
1 and (/) are P-independent. Since 1 (/) it follows that in particular
P(H) = P(H H) = P(H) P(H) for each H 1. So, necessarily P(H) 0, 1
for all H 1. That is, 1 is P-trivial.
We next dene the tail -algebra of a stochastic process.
Denition 1.4.9. For a stochastic process X
k
we set T
X
n
= (X
r
, r > n) and
call T
X
=
n
T
X
n
the tail -algebra of the process X
k
.
As we next see, the P-triviality of the tail -algebra of independent random vari-
ables is an immediate consequence of Lemma 1.4.8. This result, due to Kolmogorov,
is just one of the many 0-1 laws that exist in probability theory.
Corollary 1.4.10 (Kolmogorovs 0-1 law). If X
k
are P-mutually indepen-
dent then the corresponding tail -algebra T
X
is P-trivial.
Proof. Note that T
X
k
T
X
k+1
and T
X
T
X
= (X
k
, k 1) = (
k1
T
X
k
)
(see Exercise 1.2.14 for the latter identity). Further, recall Exercise 1.4.7 that for
any n 1, the -algebras T
X
n
and T
X
n
are P-mutually independent. Hence, each of
the -algebras T
X
k
is also P-mutually independent of the tail -algebra T
X
, which
by Lemma 1.4.8 is thus P-trivial.
Out of Corollary 1.4.6 we deduce that functions of disjoint collections of mutually
independent random variables are mutually independent.
Corollary 1.4.11. If R.V. X
k,j
, 1 k m, 1 j l(k) are mutually indepen-
dent and f
k
: R
l(k)
R are Borel functions, then Y
k
= f
k
(X
k,1
, . . . , X
k,l(k)
) are
mutually independent random variables for k = 1, . . . , m.
Proof. We apply Corollary 1.4.6 for the index set = (k, j) : 1 k
m, 1 j l(k), and mutually independent -systems 1
k,j
= (X
k,j
), to deduce
the mutual independence of (
k
= (
j
1
k,j
). Recall that (
k
= (X
k,j
, 1 j l(k))
and (Y
k
) (
k
(see Denition 1.2.12 and Exercise 1.2.32). We complete the proof
by noting that Y
k
are mutually independent if and only if (Y
k
) are mutually
independent.
Our next result is an application of Theorem 1.4.4 to the independence of random
variables.
Corollary 1.4.12. Real-valued random variables X
1
, X
2
, . . . , X
m
on the same prob-
ability space (, T, P) are mutually independent if and only if
(1.4.1) P(X
1
x
1
, . . . , X
m
x
m
) =
m
i=1
P(X
i
x
i
) , x
1
, . . . , x
m
R.
Proof. Let /
i
denote the collection of subsets of of the form X
1
i
((, b])
for b R. Recall that /
i
generates (X
i
) (see Exercise 1.2.11), whereas (1.4.1)
states that the -systems /
i
are mutually independent (by continuity from below
of P, taking x
i
for i ,= i
1
, i ,= i
2
, . . . , i ,= i
L
, has the same eect as taking a
subset of distinct indices i
1
, . . . , i
L
from 1, . . . , m). So, just apply Theorem 1.4.4
to conclude the proof.
The condition (1.4.1) for mutual independence of R.V.-s is further simplied when
these variables are either discrete valued, or having a density.
Exercise 1.4.13. Suppose (X
1
, . . . , X
m
) are random variables and (S
1
, . . . , S
m
)
are countable sets such that P(X
i
S
i
) = 1 for i = 1, . . . , m. Show that if
P(X
1
= x
1
, . . . , X
m
= x
m
) =
m
i=1
P(X
i
= x
i
)
whenever x
i
S
i
, i = 1, . . . , m, then X
1
, . . . , X
m
Exercise 1.4.14. Suppose the random vector X = (X
1
, . . . , X
m
) has a joint prob-
ability density function f
X
(x) = g
1
(x
1
) g
m
(x
m
). That is,
P((X
1
, . . . , X
m
) A) =
_
A
g
1
(x
1
) g
m
(x
m
)dx
1
. . . dx
m
, A B
R
m ,
where g
i
are non-negative, Lebesgue integrable functions. Show that then X
1
, . . . , X
m
Beware that pairwise independence (of each pair A
k
, A
j
for k ,= j), does not imply
mutual independence of all the events in question and the same applies to three or
more random variables. Here is an illustrating example.
Exercise 1.4.15. Consider the sample space = 0, 1, 2
2
with probability measure
on (, 2
) that assigns equal probability (i.e. 1/9) to each possible value of =

(
1
,
2
) . Then, X() =
1
and Y () =
2
are independent R.V. each taking
the values 0, 1, 2 with equal (i.e. 1/3) probability. Dene Z
0
= X, Z
1
= (X +
Y )mod3 and Z
2
= (X + 2Y )mod3.
(a) Show that Z
0
is independent of Z
1
, Z
0
is independent of Z
2
, Z
1
is inde-
pendent of Z
2
, but if we know the value of Z
0
and Z
1
, then we also know
Z
2
.
(b) Construct four 1, 1-valued random variables such that any three of
them are independent but all four are not.
Hint: Consider products of independent random variables.
Here is a somewhat counter intuitive example about tail -algebras, followed by
an elaboration on the theme of Corollary 1.4.11.
Exercise 1.4.16. Let (/, /
) denote the smallest -algebra ( such that any func-

tion measurable on / or on /
is also measurable on (. Let W

0
, W
1
, W
2
, . . . be
independent random variables with P(W
n
= +1) = P(W
n
= 1) = 1/2 for all n.
For each n 1, dene X
n
:= W
0
W
1
. . .W
n
.
(a) Prove that the variables X
1
, X
2
, . . . are independent.
(b) Show that o = (T
W
0
, T
X
) is a strict subset of the -algebra T =
n
(T
W
0
, T
X
n
).
Hint: Show that W
0
mT is independent of o.
Exercise 1.4.17. Consider random variables (X
i,j
, 1 i, j n) on the same prob-
ability space. Suppose that the -algebras !
1
, . . . , !
n
are P-mutually independent,
where !
i
= (X
i,j
, 1 j n) for i = 1, . . . , n. Suppose further that the -algebras
(
1
, . . . , (
n
are P-mutually independent, where (
j
= (X
i,j
, 1 i n). Prove that
the random variables (X
i,j
, 1 i, j n) must then be P-mutually independent.
We conclude this subsection with an application in number theory.
Exercise 1.4.18. Recall Eulers zeta-function which for real s > 1 is given by
(s) =
k=1
k
s
. Fixing such s, let X and Y be independent random variables
with P(X = k) = P(Y = k) = k
s
/(s) for k = 1, 2, . . ..
(a) Show that the events D
p
= X is divisible by p, with p a prime number,
are P-mutually independent.
(b) By considering the event X = 1, provide a probabilistic explanation of
Eulers formula 1/(s) =
p
(1 1/p
s
).
(c) Show that the probability that no perfect square other than 1 divides X is
precisely 1/(2s).
(d) Show that P(G = k) = k
2s
/(2s), where G is the greatest common
divisor of X and Y .
1.4.2. Product measures and Kolmogorovs theorem. Recall Example
1.1.20 that given two measurable spaces (
1
, T
1
) and (
2
, T
2
) the product (mea-
surable) space (, T) consists of =
1
2
and T = T
1
T
2
, which is the same
as T = (/) for
/ =
_
m
j=1
A
j
B
j
: A
j
T
1
, B
j
T
2
, m <
_
,
where throughout,
denotes the union of disjoint subsets of .

We now construct product measures on such product spaces, rst for two, then
for nitely many, probability (or even -nite) measures. As we show thereafter,
these product measures are associated with the joint law of independent R.V.-s.
Theorem 1.4.19. Given two -nite measures
i
on (
i
, T
i
), i = 1, 2, there exists
a unique -nite measure
2
on the product space (, T) such that
2
(
m
j=1
A
j
B
j
) =
m
j=1
1
(A
j
)
2
(B
j
), A
j
T
1
, B
j
T
2
, m < .
We denote
2
=
1
2
and call it the product of the measures
1
and
2
.
Proof. By Caratheodorys extension theorem, it suces to show that / is an
algebra on which
2
is countably additive (see Theorem 1.1.30 for the case of nite
measures). To this end, note that =
1

2
/. Further, / is closed under
intersections, since
(
m
j=1
A
j
B
j
)
(
n
i=1
C
i
D
i
) =
i,j
[(A
j
B
j
) (C
i
D
i
)]
=
i,j
(A
j
C
i
) (B
j
D
i
) .
It is also closed under complementation, for
(
m
j=1
A
j
B
j
)
c
=
m
j=1
[(A
c
j
B
j
) (A
j
B
c
j
) (A
c
j
B
c
j
)] .
By DeMorgans law, / is an algebra.
Note that countable unions of disjoint elements of / are also countable unions of
disjoint elements of the collection ! = A B : A T
1
, B T
2
of measurable
rectangles. Hence, if we show that
(1.4.2)
m
j=1
1
(A
j
)
2
(B
j
) =
1
(C
i
)
2
(D
i
) ,
whenever
m
j=1
A
j
B
j
=
i
(C
i
D
i
) for some m < , A
j
, C
i
T
1
and B
j
, D
i

T
2
, then we deduce that the value of
2
(E) is independent of the representation
we choose for E / in terms of measurable rectangles, and further that
2
is
countably additive on /. To this end, note that the preceding set identity amounts
to
m
j=1
I
Aj
(x)I
Bj
(y) =
i
I
Ci
(x)I
Di
(y) x
1
, y
2
.
Hence, xing x
1
, we have that (y) =
m
j=1
I
Aj
(x)I
Bj
(y) SF
+
is the mono-
tone increasing limit of
n
(y) =
n
i=1
I
Ci
(x)I
Di
(y) SF
+
as n . Thus, by
linearity of the integral with respect to
2
and monotone convergence,
g(x) :=
m
j=1
2
(B
j
)I
Aj
(x) =
2
() = lim
n
2
(
n
) = lim
n
n
i=1
I
Ci
(x)
2
(D
i
) .
We deduce that the non-negative g(x) mT
1
is the monotone increasing limit of
the non-negative measurable functions h
n
(x) =
n
i=1
2
(D
i
)I
Ci
(x). Hence, by the
same reasoning,
m
j=1
2
(B
j
)
1
(A
j
) =
1
(g) = lim
n
1
(h
n
) =
2
(D
i
)
1
(C
i
) ,
proving (1.4.2) and the theorem.
It follows from Theorem 1.4.19 by induction on n that given any nite collection
of -nite measure spaces (
i
, T
i
,
i
), i = 1, . . . , n, there exists a unique product
measure
n
=
1

n
on the product space (, T) (i.e., =
1

n
and T = (A
1
A
n
; A
i
T
i
, i = 1, . . . , n)), such that
(1.4.3)
n
(A
1
A
n
) =
n
i=1
i
(A
i
) A
i
T
i
, i = 1, . . . , n.
Remark 1.4.20. A notable special case of this construction is when
i
= R with
the Borel -algebra and Lebesgue measure of Section 1.1.3. The product space
is then R
n
with its Borel -algebra and the product measure is
n
, the Lebesgue
measure on R
n
.
The notion of the law T
X
of a real-valued random variable X as in Denition
1.2.33, naturally extends to the joint law T
X
of a random vector X = (X
1
, . . . , X
n
)
which is the probability measure T
X
= P X
1
on (R
n
, B
R
n).
We next characterize the joint law of independent random variables X
1
, . . . , X
n
as the product of the laws of X
i
for i = 1, . . . , n.
Proposition 1.4.21. Random variables X
1
, . . . , X
n
on the same probability space,
having laws
i
= T
Xi
, are mutually independent if and only if their joint law is
n
=
1

n
.
Proof. By Denition 1.4.3 and the identity (1.4.3), if X
1
, . . . , X
n
are mutually
independent then for B
i
B,
T
X
(B
1
B
n
) = P(X
1
B
1
, . . . , X
n
B
n
)
=
n
i=1
P(X
i
B
i
) =
n
i=1
i
(B
i
) =
1

n
(B
1
B
n
) .
This shows that the law of (X
1
, . . . , X
n
) and the product measure
n
agree on the
collection of all measurable rectangles B
1
B
n
, a -system that generates B
R
n
(see Exercise 1.1.21). Consequently, these two probability measures agree on B
R
n
(c.f. Proposition 1.1.39).
Conversely, if T
X
=
1

n
, then by same reasoning, for Borel sets B
i
,
P(
n
i=1
: X
i
() B
i
) = T
X
(B
1
B
n
) =
1

n
(B
1
B
n
)
=
n
i=1
i
(B
i
) =
n
i=1
P( : X
i
() B
i
) ,
which amounts to the mutual independence of X
1
, . . . , X
n
.
We wish to extend the construction of product measures to that of an innite col-
lection of independent random variables. To this end, let N = 1, 2, . . . denote the
set of natural numbers and R
N
= x = (x
1
, x
2
, . . .) : x
i
R denote the collection
of all innite sequences of real numbers. We equip R
N
with the product -algebra
B
c
= (!) generated by the collection !of all nite dimensional measurable rectan-
gles (also called cylinder sets), that is sets of the form x : x
1
B
1
, . . . , x
n
B
n
,
where B
i
B, i = 1, . . . , n N (e.g. see Example 1.1.19).
Kolmogorovs extension theorem provides the existence of a unique probability
measure P on (R
N
, B
c
) whose projections coincide with a given consistent sequence
of probability measures
n
on (R
n
, B
R
n).
Theorem 1.4.22 (Kolmogorovs extension theorem). Suppose we are given
probability measures
n
on (R
n
, B
R
n) that are consistent, that is,
n+1
(B
1
B
n
R) =
n
(B
1
B
n
) B
i
B, i = 1, . . . , n <
Then, there is a unique probability measure P on (R
N
, B
c
) such that
(1.4.4) P( :
i
B
i
, i = 1, . . . , n) =
n
(B
1
B
n
) B
i
B, i n <
Proof. (sketch only) We take a similar approach as in the proof of Theorem
1.4.19. That is, we use (1.4.4) to dene the non-negative set function P
0
on the
collection ! of all nite dimensional measurable rectangles, where by the consis-
tency of
n
the value of P
0
is independent of the specic representation chosen
for a set in !. Then, we extend P
0
to a nitely additive set function on the algebra
/ =
_
m
j=1
E
j
: E
j
!, m <
_
,
in the same linear manner we used when proving Theorem 1.4.19. Since / generates
B
c
and P
0
(R
N
) =
n
(R
n
) = 1, by Caratheodorys extension theorem it suces to
check that P
0
is countably additive on /. The countable additivity of P
0
is veried
by the method we already employed when dealing with Lebesgues measure. That
is, by the remark after Lemma 1.1.31, it suces to prove that P
0
(H
n
) 0 whenever
H
n
/ and H
n
. The proof by contradiction of the latter, adapting the
argument of Lemma 1.1.31, is based on approximating each H / by a nite
union J
k
H of compact rectangles, such that P
0
(H J
k
) 0 as k . This is
done for example in [Bil95, Page 490].
Example 1.4.23. To systematically construct an innite sequence of independent
random variables X
i
of prescribed laws T
Xi
=
i
, we apply Kolmogorovs exten-
sion theorem for the product measures
n
=
1

n
constructed following
Theorem 1.4.19 (where it is by denition that the sequence
n
is consistent). Al-
ternatively, for innite product measures one can take arbitrary probability spaces
(
i
, T
i
,
i
) and directly show by contradiction that P
0
(H
n
) 0 whenever H
n
/
and H
n
(for more details, see [Str93, Exercise 1.1.14]).
Remark. As we shall nd in Sections 6.1 and 7.1, Kolmogorovs extension theorem
is the key to the study of stochastic processes, where it relates the law of the process
to its nite dimensional distributions. Certain properties of R are key to the proof
of Kolmogorovs extension theorem which indeed is false if (R, B) is replaced with
an arbitrary measurable space (S, o) (see the discussions in [Dur10, Subsection
2.1.4] and [Dud89, notes for Section 12.1]). Nevertheless, as you show next, the
conclusion of this theorem applies for any B-isomorphic measurable space (S, o).
Denition 1.4.24. Two measurable spaces (S, o) and (T, T ) are isomorphic if
there exists a one to one and onto measurable mapping between them whose inverse
is also a measurable mapping. A measurable space (S, o) is B-isomorphic if it is
isomorphic to a Borel subset T of R equipped with the induced Borel -algebra
T = B T : B B.
Here is our generalized version of Kolmogorovs extension theorem.
Corollary 1.4.25. Given a measurable space (S, o) let S
N
denote the collection of
all innite sequences of elements in S equipped the product -algebra o
c
generated
by the collection of all cylinder sets of the form s : s
1
A
1
, . . . , s
n
A
n
, where
A
i
o for i = 1, . . . , n. If (S, o) is B-isomorphic then for any consistent sequence
of probability measures
n
on (S
n
, o
n
) (that is,
n+1
(A
1
A
n
S) =
n
(A
1
A
n
) for all n and A
i
o), there exists a unique probability measure Q on
(S
N
, o
c
) such that for all n and A
i
o,
(1.4.5) Q(s : s
i
A
i
, i = 1, . . . , n) =
n
(A
1
A
n
) .
Next comes a guided proof of Corollary 1.4.25 out of Theorem 1.4.22.
Exercise 1.4.26.
(a) Verify that our proof of Theorem 1.4.22 applies in case (R, B) is replaced
by T B equipped with the induced Borel -algebra T (with R
N
and B
c
replaced by T
N
and T
c
, respectively).
(b) Fixing such (T, T ) and (S, o) isomorphic to it, let g : S T be one to
one and onto such that both g and g
1
are measurable. Check that the
one to one and onto mappings g
n
(s) = (g(s
1
), . . . , g(s
n
)) are measurable
and deduce that
n
(B) =
n
(g
1
n
(B)) are consistent probability measures
on (T
n
, T
n
).
(c) Consider the one to one and onto mapping g
(s) = (g(s
1
), . . . , g(s
n
), . . .)
from S
N
to T
N
and the unique probability measure P on (T
N
, T
c
) for
which (1.4.4) holds. Verify that o
c
is contained in the -algebra of subsets
A of S
N
for which g
(A) is in T
c
and deduce that Q(A) = P(g
(A)) is
a probability measure on (S
N
, o
c
).
(d) Conclude your proof of Corollary 1.4.25 by showing that this Q is the
unique probability measure for which (1.4.5) holds.
Remark. Recall that Caratheodorys extension theorem applies for any -nite
measure. It follows that, by the same proof as in the preceding exercise, any
consistent sequence of -nite measures
n
uniquely determines a -nite measure
Q on (S
N
, o
c
) for which (1.4.5) holds, a fact which we use in later parts of this text
(for example, in the study of Markov chains in Section 6.1).
Our next proposition shows that in most applications one encounters B-isomorphic
measurable spaces (for which Kolmogorovs theorem applies).
Proposition 1.4.27. If S B
M
for a complete separable metric space M and o
is the restriction of B
M
to S then (S, o) is B-isomorphic.
Remark. While we do not provide the proof of this proposition, we note in passing
that it is an immediate consequence of [Dud89, Theorem 13.1.1].
1.4.3. Fubinis theorem and its application. Returning to (, T, ) which
is the product of two -nite measure spaces, as in Theorem 1.4.19, we now prove
that:
Theorem 1.4.28 (Fubinis theorem). Suppose =
1
2
is the product of the
-nite measures
1
on (X, X) and
2
on (Y, }). If h mT for T = X} is such
that h 0 or
_
[h[ d < , then,
_
XY
hd =
_
X
__
Y
h(x, y) d
2
(y)
_
d
1
(x) (1.4.6)
=
_
Y
__
X
h(x, y) d
1
(x)
_
d
2
(y)
Remark. The iterated integrals on the right side of (1.4.6) are nite and well
dened whenever
_
[h[d < . However, for h / mT
+
the inner integrals might
be well dened only in the almost everywhere sense.
Proof of Fubinis theorem. Clearly, it suces to prove the rst identity
of (1.4.6), as the second immediately follows by exchanging the roles of the two
measure spaces. We thus prove Fubinis theorem by showing that
(1.4.7) y h(x, y) m}, x X,
(1.4.8) x f
h
(x) :=
_
Y
h(x, y) d
2
(y) mX,
so the double integral on the right side of (1.4.6) is well dened and
(1.4.9)
_
XY
hd =
_
X
f
h
(x)d
1
(x) .
We do so in three steps, rst proving (1.4.7)-(1.4.9) for nite measures and bounded
h, proceeding to extend these results to non-negative h and -nite measures, and
then showing that (1.4.6) holds whenever h mT and
_
[h[d is nite.
Step 1. Let 1 denote the collection of bounded functions on XY for which (1.4.7)
(1.4.9) hold. Assuming that both
1
(X) and
2
(Y) are nite, we deduce that 1
contains all bounded h mT by verifying the assumptions of the monotone class
theorem (i.e. Theorem 1.2.7) for 1 and the -system ! = AB : A X, B }
of measurable rectangles (which generates T).
To this end, note that if h = I
E
and E = AB !, then either h(x, ) = I
B
() (in
case x A), or h(x, ) is identically zero (when x , A). With I
B
m} we thus have
(1.4.7) for any such h. Further, in this case the simple function f
h
(x) =
2
(B)I
A
(x)
on (X, X) is in mX and
_
XY
I
E
d =
1

2
(E) =
2
(B)
1
(A) =
_
X
f
h
(x)d
1
(x) .
Consequently, I
E
1 for all E !; in particular, the constant functions are in 1.
Next, with both m} and mX vector spaces over R, by the linearity of h f
h
over the vector space of bounded functions satisfying (1.4.7) and the linearity of
f
h

1
(f
h
) and h (h) over the vector spaces of bounded measurable f
h
and
h, respectively, we deduce that 1 is also a vector space over R.
Finally, if non-negative h
n
1 are such that h
n
h, then for each x X the
mapping y h(x, y) = sup
n
h
n
(x, y) is in m}
+
(by Theorem 1.2.22). Further,
f
hn
mX
+
and by monotone convergence f
hn
f
h
(for all x X), so by the same
reasoning f
h
mX
+
. Applying monotone convergence twice more, it thus follows
that
(h) = sup
n
(h
n
) = sup
n
1
(f
hn
) =
1
(f
h
) ,
so h satises (1.4.7)(1.4.9). In particular, if h is bounded then also h 1 .
Step 2. Suppose now that h mT
+
. If
1
and
2
are nite measures, then
we have shown in Step 1 that (1.4.7)(1.4.9) hold for the bounded non-negative
functions h
n
= h n. With h
n
h we have further seen that (1.4.7)-(1.4.9) hold
also for the possibly unbounded h. Further, the closure of (1.4.8) and (1.4.9) with
respect to monotone increasing limits of non-negative functions has been shown
by monotone convergence, and as such it extends to -nite measures
1
and
2
.
Turning now to -nite
1
and
2
, recall that there exist E
n
= A
n
B
n
!
such that A
n
X, B
n
Y,
1
(A
n
) < and
2
(B
n
) < . As h is the monotone
increasing limit of h
n
= hI
En
mT
+
it thus suces to verify that for each n
the non-negative f
n
(x) =
_
Y
h
n
(x, y)d
2
(y) is measurable with (h
n
) =
1
(f
n
).
Fixing n and simplifying our notations to E = E
n
, A = A
n
and B = B
n
, recall
Corollary 1.3.57 that (h
n
) =
E
(h
E
) for the restrictions h
E
and
E
of h and to
the measurable space (E, T
E
). Also, as E = A B we have that T
E
= X
A
}
B
and
E
= (
1
)
A
(
2
)
B
for the nite measures (
1
)
A
and (
2
)
B
. Finally, as
f
n
(x) = f
hE
(x) :=
_
B
h
E
(x, y)d(
2
)
B
(y) when x A and zero otherwise, it follows
that
1
(f
n
) = (
1
)
A
(f
hE
). We have thus reduced our problem (for h
n
), to the case
of nite measures
E
= (
1
)
A
(
2
)
B
which we have already successfully resolved.
Step 3. Write h mT as h = h
+
h
, with h
mT
+
. By Step 2 we know that
y h
(x, y) m} for each x X, hence the same applies for y h(x, y). Let
X
0
denote the subset of X for which
_
Y
[h(x, y)[d
2
(y) < . By linearity of the
integral with respect to
2
we have that for all x X
0
(1.4.10) f
h
(x) = f
h+
(x) f
h
(x)
is nite. By Step 2 we know that f
h
mX, hence X
0
= x : f
h+
(x) + f
h
(x) <
is in X. From Step 2 we further have that
1
(f
h
) = (h
) whereby our
assumption that
_
[h[ d =
1
(f
h+
+ f
h
) < implies that
1
(X
c
0
) = 0. Let
f
h
(x) = f
h+
(x) f
h
(x) on X
0
and

f
h
(x) = 0 for all x / X
0
. Clearly,

f
h
mX
is
1
-almost-everywhere the same as the inner integral on the right side of (1.4.6).
Moreover, in view of (1.4.10) and linearity of the integrals with respect to
1
and
we deduce that
(h) = (h
+
) (h
) =
1
(f
h+
)
1
(f
h
) =
1
(
f
h
) ,
which is exactly the identity (1.4.6).
Equipped with Fubinis theorem, we have the following simpler formula for the
expectation of a Borel function h of two independent R.V.
Theorem 1.4.29. Suppose that X and Y are independent random variables of laws
1
= T
X
and
2
= T
Y
. If h : R
2
R is a Borel measurable function such that
h 0 or E[h(X, Y )[ < , then,
(1.4.11) Eh(X, Y ) =
_
_
_
h(x, y) d
1
(x)
_
d
2
(y)
In particular, for Borel functions f, g : R R such that f, g 0 or E[f(X)[ <
and E[g(Y )[ < ,
(1.4.12) E(f(X)g(Y )) = Ef(X) Eg(Y )
Proof. Subject to minor changes of notations, the proof of Theorem 1.3.61
applies to any (S, o)-valued R.V. Considering this theorem for the random vector
(X, Y ) whose joint law is
1

2
(c.f. Proposition 1.4.21), together with Fubinis
theorem, we see that
Eh(X, Y ) =
_
R
2
h(x, y) d(
1

2
)(x, y) =
_
_
_
h(x, y) d
1
(x)
_
d
2
(y) ,
which is (1.4.11). Take now h(x, y) = f(x)g(y) for non-negative Borel functions
f(x) and g(y). In this case, the iterated integral on the right side of (1.4.11) can
be further simplied to,
E(f(X)g(Y )) =
_
_
_
f(x)g(y) d
1
(x)
_
d
2
(y) =
_
g(y)[
_
f(x) d
1
(x)] d
2
(y)
=
_
[Ef(X)]g(y) d
2
(y) = Ef(X) Eg(Y )
(with Theorem 1.3.61 applied twice here), which is the stated identity (1.4.12).
To deal with Borel functions f and g that are not necessarily non-negative, rst
apply (1.4.12) for the non-negative functions [f[ and [g[ to get that E([f(X)g(Y )[) =
E[f(X)[E[g(Y )[ < . Thus, the assumed integrability of f(X) and of g(Y ) allows
us to apply again (1.4.11) for h(x, y) = f(x)g(y). Now repeat the argument we
used for deriving (1.4.12) in case of non-negative Borel functions.
Another consequence of Fubinis theorem is the following integration by parts for-
mula.
Lemma 1.4.30 (integration by parts). Suppose H(x) =
_
x
h(y)dy for a
non-negative Borel function h and all x R. Then, for any random variable X,
(1.4.13) EH(X) =
_
R
h(y)P(X > y)dy .
Proof. Combining the change of variables formula (Theorem 1.3.61), with our
assumption about H(), we have that
EH(X) =
_
R
H(x)dT
X
(x) =
_
R
_
_
R
h(y)I
x>y
d(y)
_
dT
X
(x) ,
where denotes Lebesgues measure on (R, B). For each y R, the expectation of
the simple function x h(x, y) = h(y)I
x>y
with respect to (R, B, T
X
) is merely
h(y)P(X > y). Thus, applying Fubinis theorem for the non-negative measurable
function h(x, y) on the product space RR equipped with its Borel -algebra B
R
2,
and the -nite measures
1
= T
X
and
2
= , we have that
EH(X) =
_
R
_
_
R
h(y)I
x>y
dT
X
(x)
_
d(y) =
_
R
h(y)P(X > y)dy ,
as claimed.
Indeed, as we see next, by combining the integration by parts formula with H olders
inequality we can convert bounds on tail probabilities to bounds on the moments
of the corresponding random variables.
Lemma 1.4.31.
(a) For any r > p > 0 and any random variable Y 0,
EY
p
=
_

0
py
p1
P(Y > y)dy =
_

0
py
p1
P(Y y)dy
= (1
p
r
)
_

0
py
p1
E[min(Y/y, 1)
r
]dy .
(b) If X, Y 0 are such that P(Y y) y
1
E[XI
Y y
] for all y > 0, then
|Y |
p
q|X|
p
for any p > 1 and q = p/(p 1).
(c) Under the same hypothesis also EY 1 +E[X(log Y )
+
].
Proof. (a) The rst identity is merely the integration by parts formula for
h
p
(y) = py
p1
1
y>0
and H
p
(x) = x
p
1
x0
and the second identity follows by the
fact that P(Y = y) = 0 up to a (countable) set of zero Lebesgue measure. Finally,
it is easy to check that H
p
(x) =
_
R
h
p,r
(x, y)dy for the non-negative Borel function
h
p,r
(x, y) = (1 p/r)py
p1
min(x/y, 1)
r
1
x0
1
y>0
and any r > p > 0. Hence,
replacing h(y)I
x>y
throughout the proof of Lemma 1.4.30 by h
p,r
(x, y) we nd that
E[H
p
(X)] =
_
0
E[h
p,r
(X, y)]dy, which is exactly our third identity.
(b) In a similar manner it follows from Fubinis theorem that for p > 1 and any
non-negative random variables X and Y
E[XY
p1
] = E[XH
p1
(Y )] = E[
_
R
h
p1
(y)XI
Y y
dy] =
_
R
h
p1
(y)E[XI
Y y
]dy .
Thus, with y
1
h
p
(y) = qh
p1
(y) our hypothesis implies that
EY
p
=
_
R
h
p
(y)P(Y y)dy
_
R
qh
p1
(y)E[XI
Y y
]dy = qE[XY
p1
] .
Applying H olders inequality we deduce that
EY
p
qE[XY
p1
] q|X|
p
|Y
p1
|
q
= q|X|
p
[EY
p
]
1/q
where the right-most equality is due to the fact that (p 1)q = p. In case Y
is bounded, dividing both sides of the preceding bound by [EY
p
]
1/q
implies that
|Y |
p
q|X|
p
. To deal with the general case, let Y
n
= Y n, n = 1, 2, . . . and
note that either Y
n
y is empty (for n < y) or Y
n
y = Y y. Thus, our
assumption implies that P(Y
n
y) y
1
E[XI
Yny
] for all y > 0 and n 1. By
the preceding argument |Y
n
|
p
q|X|
p
for any n. Taking n it follows by
monotone convergence that |Y |
p
q|X|
p
.
(c) Considering part (a) with p = 1, we bound P(Y y) by one for y [0, 1] and
by y
1
E[XI
Y y
] for y > 1, to get by Fubinis theorem that
EY =
_

0
P(Y y)dy 1 +
_

1
y
1
E[XI
Y y
]dy
= 1 +E[X
_

1
y
1
I
Y y
dy] = 1 +E[X(log Y )
+
] .
We further have the following corollary of (1.4.12), dealing with the expectation
of a product of mutually independent R.V.
Corollary 1.4.32. Suppose that X
1
, . . . , X
n
are P-mutually independent random
variables such that either X
i
0 for all i, or E[X
i
[ < for all i. Then,
(1.4.14) E
_
n
i=1
X
i
_
=
n
i=1
EX
i
,
that is, the expectation on the left exists and has the value given on the right.
Proof. By Corollary 1.4.11 we know that X = X
1
and Y = X
2
X
n
are
independent. Taking f(x) = [x[ and g(y) = [y[ in Theorem 1.4.29, we thus have
that E[X
1
X
n
[ = E[X
1
[E[X
2
X
n
[ for any n 2. Applying this identity
iteratively for X
l
, . . . , X
n
, starting with l = m, then l = m + 1, m + 2, . . . , n 1
leads to
(1.4.15) E[X
m
X
n
[ =
n
k=m
E[X
k
[ ,
holding for any 1 m n. If X
i
0 for all i, then [X
i
[ = X
i
and we have (1.4.14)
as the special case m = 1.
To deal with the proof in case X
i
L
1
for all i, note that for m = 2 the identity
(1.4.15) tells us that E[Y [ = E[X
2
X
n
[ < , so using Theorem 1.4.29 with
f(x) = x and g(y) = y we have that E(X
1
X
n
) = (EX
1
)E(X
2
X
n
). Iterating
this identity for X
l
, . . . , X
n
, starting with l = 1, then l = 2, 3, . . . , n 1 leads to
the desired result (1.4.14).
Another application of Theorem 1.4.29 provides us with the familiar formula for
the probability density function of the sum X+Y of independent random variables
X and Y , having densities f
X
and f
Y
respectively.
Corollary 1.4.33. Suppose that R.V. X with a Borel measurable probability density
function f
X
and R.V. Y with a Borel measurable probability density function f
Y
are independent. Then, the random variable Z = X +Y has the probability density
function
f
Z
(z) =
_
R
f
X
(z y)f
Y
(y)dy .
Proof. Fixing z R, apply Theorem 1.4.29 for h(x, y) = 1
(x+yz)
, to get
that
F
Z
(z) = P(X +Y z) = Eh(X, Y ) =
_
R
_
_
R
h(x, y)dT
X
(x)
_
dT
Y
(y) .
Considering the inner integral for a xed value of y, we have that
_
R
h(x, y)dT
X
(x) =
_
R
I
(,zy]
(x)dT
X
(x) = T
X
((, z y]) =
_
zy
f
X
(x)dx,
where the right most equality is by the existence of a density f
X
(x) for X (c.f.
Denition 1.2.39). Clearly,
_
zy
f
X
(x)dx =
_
z
f
X
(x y)dx. Thus, applying
Fubinis theorem for the Borel measurable function g(x, y) = f
X
(x y) 0 and
the product of the -nite Lebesgues measure on (, z] and the probability
measure T
Y
, we see that
F
Z
(z) =
_
R
_
_
z
f
X
(x y)dx
_
dT
Y
(y) =
_
z
_
_
R
f
X
(x y)dT
Y
(y)
_
dx
(in this application of Fubinis theorem we replace one iterated integral by another,
exchanging the order of integrations). Since this applies for any z R, it follows
by denition that Z has the probability density
f
Z
(z) =
_
R
f
X
(z y)dT
Y
(y) = Ef
X
(z Y ) .
With Y having density f
Y
, the stated formula for f
Z
is a consequence of Corollary
1.3.62.
Denition 1.4.34. The expression
_
f(z y)g(y)dy is called the convolution of
the non-negative Borel functions f and g, denoted by f g(z). The convolution of
measures and on (R, B) is the measure on (R, B) such that (B) =
_
(B x)d(x) for any B B (where B x = y : x +y B).
Corollary 1.4.33 states that if two independent random variables X and Y have
densities, then so does Z = X+Y , whose density is the convolution of the densities
of X and Y . Without assuming the existence of densities, one can show by a similar
argument that the law of X +Y is the convolution of the law of X and the law of
Y (c.f. [Dur10, Theorem 2.1.10] or [Bil95, Page 266]).
Convolution is often used in analysis to provide a more regular approximation to
a given function. Here are few of the reasons for doing so.
Exercise 1.4.35. Suppose Borel functions f, g are such that g is a probability
density and
_
[f(x)[dx is nite. Consider the scaled densities g
n
() = ng(n), n 1.
(a) Show that f g(y) is a Borel function with
_
[f g(y)[dy
_
[f(x)[dx
and if g is uniformly continuous, then so is f g.
(b) Show that if g(x) = 0 whenever [x[ 1, then f g
n
(y) f(y) as n ,
for any continuous f and each y R.
Next you nd two of the many applications of Fubinis theorem in real analysis.
Exercise 1.4.36. Show that the set G
f
= (x, y) R
2
: 0 y f(x) of points
under the graph of a non-negative Borel function f : R [0, ) is in B
R
2 and
deduce the well-known formula (G
f
) =
_
f(x)d(x), for its area.
Exercise 1.4.37. For n 2, consider the unit sphere S
n1
= x R
n
: |x| = 1
equipped with the topology induced by R
n
. Let the surface measure of A B
S
n1 be
(A) = n
n
(C
0,1
(A)), for C
a,b
(A) = rx : r (a, b], x A and the n-fold product
Lebesgue measure
n
(as in Remark 1.4.20).
(a) Check that C
a,b
(A) B
R
n and deduce that () is a nite measure on
S
n1
(which is further invariant under orthogonal transformations).
(b) Verify that
n
(C
a,b
(A)) =
b
n
a
n
n
(A) and deduce that for any B B
R
n
n
(B) =
_

0
_
_
S
n1
I
rxB
d(x)
_
r
n1
d(r) .
Hint: Recall that
n
(B) =
n
n
(B) for any 0 and B B
R
n.
Combining (1.4.12) with Theorem 1.2.26 leads to the following characterization of
the independence between two random vectors (compare with Denition 1.4.1).
Exercise 1.4.38. Show that the R
n
-valued random variable (X
1
, . . . , X
n
) and the
R
m
-valued random variable (Y
1
, . . . , Y
m
) are independent if and only if
E(h(X
1
, . . . , X
n
)g(Y
1
, . . . , Y
m
)) = E(h(X
1
, . . . , X
n
))E(g(Y
1
, . . . , Y
m
)),
for all bounded, Borel measurable functions g : R
m
R and h : R
n
R.
Then show that the assumption of h() and g() bounded can be relaxed to both
h(X
1
, . . . , X
n
) and g(Y
1
, . . . , Y
m
) being in L
1
(, T, P).
Here is another application of (1.4.12):
Exercise 1.4.39. Show that E(f(X)g(X)) (Ef(X))(Eg(X)) for every random
variable X and any bounded non-decreasing functions f, g : R R.
In the following exercise you bound the exponential moments of certain random
variables.
Exercise 1.4.40. Suppose Y is an integrable random variable such that E[e
Y
] is
nite and E[Y ] = 0.
(a) Show that if [Y [ then
log E[e
Y
]
2
(e
1)E[Y
2
] .
Hint: Use the Taylor expansion of e
Y
Y 1.
(b) Show that if E[Y
2
e
Y
]
2
E[e
Y
] then
log E[e
Y
] log cosh() .
Hint: Note that (u) = log E[e
uY
] is convex, non-negative and nite
on [0, 1] with (0) = 0 and
(0) = 0. Verify that
(u) +
(u)
2
=
E[Y
2
e
uY
]/E[e
uY
] is non-decreasing on [0, 1] and (u) = log cosh(u)
satises the dierential equation
(u) +
(u)
2
=
2
.
As demonstrated next, Fubinis theorem is also handy in proving the impossibility
of certain constructions.
Exercise 1.4.41. Explain why it is impossible to have P-mutually independent
random variables U
t
(), t [0, 1], on the same probability space (, T, P), having
each the uniform probability measure on [1/2, 1/2], such that t U
t
() is a Borel
function for almost every .
Hint: Show that E[(
_
r
0
U
t
()dt)
2
] = 0 for all r [0, 1].
Random variables X and Y such that E(X
2
) < and E(Y
2
) < are called
uncorrelated if E(XY ) = E(X)E(Y ). It follows from (1.4.12) that independent
random variables X, Y with nite second moment are uncorrelated. While the
converse is not necessarily true, it does apply for pairs of random variables that
take only two dierent values each.
Exercise 1.4.42. Suppose X and Y are uncorrelated random variables.
(a) Show that if X = I
A
and Y = I
B
for some A, B T then X and Y are
also independent.
(b) Using this, show that if a, b-valued R.V. X and c, d-valued R.V. Y
are uncorrelated, then they are also independent.
(c) Give an example of a pair of R.V. X and Y that are uncorrelated but not
independent.
Next come a pair of exercises utilizing Corollary 1.4.32.
Exercise 1.4.43. Suppose X and Y are random variables on the same probability
space, X has a Poisson distribution with parameter > 0, and Y has a Poisson
distribution with parameter > (see Example 1.3.69).
(a) Show that if X and Y are independent then P(X Y ) exp((
)
2
).
(b) Taking = for > 1, nd I() > 0 such that P(X Y )
2 exp(I()) even when X and Y are not independent.
Exercise 1.4.44. Suppose X and Y are independent random variables of identical
distribution such that X > 0 and E[X] < .
(a) Show that E[X
1
Y ] > 1 unless X() = c for some non-random c and
almost every .
(b) Provide an example in which E[X
1
Y ] = .
We conclude this section with a concrete application of Corollary 1.4.33, comput-
ing the density of the sum of mutually independent R.V., each having the same
exponential density. To this end, recall
Denition 1.4.45. The gamma density with parameters > 0 and > 0 is given
by
f
(s) = ()
1
s
1
e
s
1
s>0
,
where () =
_
0
s
1
e
s
ds is nite and positive. In particular, = 1 corresponds
to the exponential density f
T
of Example 1.3.68.
Exercise 1.4.46. Suppose X has a gamma density of parameters
1
and and Y
has a gamma density of parameters
2
and . Show that if X and Y are indepen-
dent then X +Y has a gamma density of parameters
1
+
2
and . Deduce that
if T
1
, . . . , T
n
are mutually independent R.V. each having the exponential density of
parameter , then W
n
=
n
i=1
T
i
has the gamma density of parameters = n and
.
CHAPTER 2
Asymptotics: the law of large numbers
Building upon the foundations of Chapter 1 we turn to deal with asymptotic
theory. To this end, this chapter is devoted to degenerate limit laws, that is,
situations in which a sequence of random variables converges to a non-random
(constant) limit. Though not exclusively dealing with it, our focus here is on the
sequence of empirical averages n
1
n
i=1
X
i
as n .
Section 2.1 deals with the weak law of large numbers, where convergence in prob-
ability (or in L
q
for some q > 1) is considered. This is strengthened in Section 2.3
to a strong law of large numbers, namely, to convergence almost surely. The key
tools for this improvement are the Borel-Cantelli lemmas, to which Section 2.2 is
devoted.
2.1. Weak laws of large numbers
A weak law of large numbers corresponds to the situation where the normalized
sums of large number of random variables converge in probability to a non-random
constant. Usually, the derivation of a weak low involves the computation of vari-
ances, on which we focus in Subsection 2.1.1. However, the L
2
convergence we
obtain there is of a somewhat limited scope of applicability. To remedy this, we
introduce the method of truncation in Subsection 2.1.2 and illustrate its power in
a few representative examples.
2.1.1. L
2
limits for sums of uncorrelated variables. The key to our
derivation of weak laws of large numbers is the computation of variances. As a
preliminary step we dene the covariance of two R.V. and extend the notion of a
pair of uncorrelated random variables, to a (possibly innite) family of R.V.
Denition 2.1.1. The covariance of two random variables X, Y L
2
(, T, P) is
Cov(X, Y ) = E[(X EX)(Y EY )] = EXY EXEY ,
so in particular, Cov(X, X) = Var(X).
We say that random variables X
L
2
(, T, P) are uncorrelated if
E(X
) = E(X
)E(X
) ,= ,
or equivalently, if
Cov(X
, X
) = 0 ,= .
As we next show, the variance of the sum of nitely many uncorrelated random
variables is the sum of the variances of the variables.
Lemma 2.1.2. Suppose X
1
, . . . , X
n
are uncorrelated random variables (which nec-
essarily are dened on the same probability space). Then,
(2.1.1) Var(X
1
+ +X
n
) = Var(X
1
) + + Var(X
n
) .
71
72 2. ASYMPTOTICS: THE LAW OF LARGE NUMBERS
Proof. Let S
n
=
n
i=1
X
i
. By Denition 1.3.67 of the variance and linearity of
the expectation we have that
Var(S
n
) = E([S
n
ES
n
]
2
) = E
_
[
n
i=1
X
i

n
i=1
EX
i
]
2
_
= E
_
[
n
i=1
(X
i
EX
i
)]
2
_
.
Writing the square of the sum as the sum of all possible cross-products, we get that
Var(S
n
) =
n
i,j=1
E[(X
i
EX
i
)(X
j
EX
j
)]
=
n
i,j=1
Cov(X
i
, X
j
) =
n
i=1
Cov(X
i
, X
i
) =
n
i=1
Var(X
i
) ,
where we use the fact that Cov(X
i
, X
j
) = 0 for each i ,= j since X
i
and X
j
are
uncorrelated.
Equipped with this lemma we have our
Theorem 2.1.3 (L
2
weak law of large numbers). Consider S
n
=
n
i=1
X
i
for uncorrelated random variables X
1
, . . . , X
n
, . . .. Suppose that Var(X
i
) C and
EX
i
= x for some nite constants C, x, and all i = 1, 2, . . .. Then, n
1
S
n
L
2
x as
n , and hence also n
1
S
n
p
x.
Proof. Our assumptions imply that E(n
1
S
n
) = x, and further by Lemma
2.1.2 we have the bound Var(S
n
) nC. Recall the scaling property (1.3.17) of the
variance, implying that
E
_
(n
1
S
n
x)
2
_
= Var
_
n
1
S
n
_
=
1
n
2
Var(S
n
)
C
n
0
as n . Thus, n
1
S
n
L
2
x (recall Denition 1.3.26). By Proposition 1.3.29 this
implies that also n
1
S
n
p
x.
The most important special case of Theorem 2.1.3 is,
Example 2.1.4. Suppose that X
1
, . . . , X
n
are independent and identically dis-
tributed (or in short, i.i.d.), with EX
2
1
< . Then, EX
2
i
= C and EX
i
= m
X
are both nite and independent of i. So, the L
2
weak law of large numbers tells us
that n
1
S
n
L
2
m
X
, and hence also n
1
S
n
p
m
X
.
Remark. As we shall see, the weaker condition E[X
i
[ < suces for the conver-
gence in probability of n
1
S
n
to m
X
. In Section 2.3 we show that it even suces
for the convergence almost surely of n
1
S
n
to m
X
, a statement called the strong
law of large numbers.
Exercise 2.1.5. Show that the conclusion of the L
2
weak law of large numbers
holds even for correlated X
i
, provided EX
i
= x and Cov(X
i
, X
j
) r([i j[) for all
i, j, and some bounded sequence r(k) 0 as k .
With an eye on generalizing the L
2
weak law of large numbers we observe that
Lemma 2.1.6. If the random variables Z
n
L
2
(, T, P) and the non-random b
n
are such that b
2
n
Var(Z
n
) 0 as n , then b
1
n
(Z
n
EZ
n
)
L
2
0.
2.1. WEAK LAWS OF LARGE NUMBERS 73
Proof. We have E[(b
1
n
(Z
n
EZ
n
))
2
] = b
2
n
Var(Z
n
) 0.
Example 2.1.7. Let Z
n
=
n
k=1
X
k
for uncorrelated random variables X
k
. If
Var(X
k
)/k 0 as k , then Lemma 2.1.6 applies for Z
n
and b
n
= n, hence
n
1
(Z
n
EZ
n
) 0 in L
2
(and in probability). Alternatively, if also Var(X
k
) 0,
then Lemma 2.1.6 applies even for Z
n
and b
n
= n
1/2
.
Many limit theorems involve random variables of the form S
n
=
n
k=1
X
n,k
, that is,
the row sums of triangular arrays of random variables X
n,k
: k = 1, . . . , n. Here
are two such examples, both relying on Lemma 2.1.6.
Example 2.1.8 (Coupon collectors problem). Consider i.i.d. random vari-
ables U
1
, U
2
, . . ., each distributed uniformly on 1, 2, . . . , n. Let [U
1
, . . . , U
l
[ de-
note the number of distinct elements among the rst l variables, and
n
k
= infl :
[U
1
, . . . , U
l
[ = k be the rst time one has k distinct values. We are interested
in the asymptotic behavior as n of T
n
=
n
n
, the time it takes to have at least
one representative of each of the n possible values.
To motivate the name assigned to this example, think of collecting a set of n
dierent coupons, where independently of all previous choices, each item is chosen
at random in such a way that each of the possible n outcomes is equally likely.
Then, T
n
is the number of items one has to collect till having the complete set.
Setting
n
0
= 0, let X
n,k
=
n
k

n
k1
denote the additional time it takes to get
an item dierent from the rst k 1 distinct items collected. Note that X
n,k
has
a geometric distribution of success probability q
n,k
= 1
k1
n
, hence EX
n,k
= q
1
n,k
and Var(X
n,k
) q
2
n,k
(see Example 1.3.69). Since
T
n
=
n
n

n
0
=
n
k=1
(
n
k

n
k1
) =
n
k=1
X
n,k
,
we have by linearity of the expectation that
ET
n
=
n
k=1
_
1
k 1
n
_
1
= n
n
=1
1
= n(log n +
n
) ,
where
n
=
n
=1
_
n
1
x
1
dx is between zero and one (by monotonicity of x
x
1
). Further, X
n,k
is independent of each earlier waiting time X
n,j
, j = 1, . . . , k
1, hence we have by Lemma 2.1.2 that
Var(T
n
) =
n
k=1
Var(X
n,k
)
n
k=1
_
1
k 1
n
_
2
n
2
=1
2
= Cn
2
,
for some C < . Applying Lemma 2.1.6 with b
n
= nlog n, we deduce that
T
n
n(log n +
n
)
nlog n
L
2
0 .
Since
n
/ log n 0, it follows that
T
n
nlog n
L
2
1 ,
and T
n
/(nlog n) 1 in probability as well.
One possible extension of Example 2.1.8 concerns innitely many possible coupons.
That is,
Exercise 2.1.9. Suppose
k
are i.i.d. positive integer valued random variables,
with P(
1
= i) = p
i
> 0 for i = 1, 2, . . .. Let D
l
= [
1
, . . . ,
l
[ denote the number
of distinct elements among the rst l variables.
(a) Show that D
n
a.s.
as n .
(b) Show that n
1
ED
n
0 as n and deduce that n
1
D
n
p
0.
Hint: Recall that (1 p)
n
1 np for any p [0, 1] and n 0.
Example 2.1.10 (An occupancy problem). Suppose we distribute at random
r distinct balls among n distinct boxes, where each of the possible n
r
assignments
of balls to boxes is equally likely. We are interested in the asymptotic behavior of
the number N
n
of empty boxes when r/n [0, ], while n . To this
end, let A
i
denote the event that the i-th box is empty, so N
n
=
n
i=1
I
Ai
. Since
P(A
i
) = (1 1/n)
r
for each i, it follows that E(n
1
N
n
) = (1 1/n)
r
e
.
Further, EN
2
n
=
n
i,j=1
P(A
i
A
j
) and P(A
i
A
j
) = (1 2/n)
r
for each i ,= j.
Hence, splitting the sum according to i = j or i ,= j, we see that
Var(n
1
N
n
) =
1
n
2
EN
2
n
(1
1
n
)
2r
=
1
n
(1
1
n
)
r
+ (1
1
n
)(1
2
n
)
r
(1
1
n
)
2r
.
As n , the rst term on the right side goes to zero, and with r/n , each of
the other two terms converges to e
2
. Consequently, Var(n
1
N
n
) 0, so applying
Lemma 2.1.6 for b
n
= n we deduce that
N
n
n
e
in L
2
and in probability.
2.1.2. Weak laws and truncation. Our next order of business is to extend
the weak law of large numbers for row sums S
n
in triangular arrays of independent
X
n,k
which lack a nite second moment. Of course, with S
n
no longer in L
2
, there
is no way to establish convergence in L
2
. So, we aim to retain only the convergence
in probability, using truncation. That is, we consider the row sums S
n
for the
truncated array X
n,k
= X
n,k
I
|X
n,k
|bn
, with b
n
slowly enough to control the
variance of S
n
and fast enough for P(S
n
,= S
n
) 0. As we next show, this gives
the convergence in probability for S
n
which translates to same convergence result
for S
n
.
Theorem 2.1.11 (Weak law for triangular arrays). Suppose that for each
n, the random variables X
n,k
, k = 1, . . . , n are pairwise independent. Let X
n,k
=
X
n,k
I
|X
n,k
|bn
for non-random b
n
> 0 such that as n both
(a)
n
k=1
P([X
n,k
[ > b
n
) 0,
and
(b) b
2
n
n
k=1
Var(X
n,k
) 0.
Then, b
1
n
(S
n
a
n
)
p
0 as n , where S
n
=
n
k=1
X
n,k
and a
n
=
n
k=1
EX
n,k
.
2.1. WEAK LAWS OF LARGE NUMBERS 75
Proof. Let S
n
=
n
k=1
X
n,k
. Clearly, for any > 0,
_
S
n
a
n
b
n
>
_
_
S
n
,= S
n
_
_
_
S
n
a
n
b
n
>
_
.
Consequently,
(2.1.2) P
_
S
n
a
n
b
n
>
_
P(S
n
,= S
n
) +P
_
S
n
a
n
b
n
>
_
.
To bound the rst term, note that our condition (a) implies that as n ,
P(S
n
,= S
n
) P
_
n
_
k=1
X
n,k
,= X
n,k
k=1
P(X
n,k
,= X
n,k
) =
n
k=1
P([X
n,k
[ > b
n
) 0 .
Turning to bound the second term in (2.1.2), recall that pairwise independence
is preserved under truncation, hence X
n,k
, k = 1, . . . , n are uncorrelated random
variables (to convince yourself, apply (1.4.12) for the appropriate functions). Thus,
an application of Lemma 2.1.2 yields that as n ,
Var(b
1
n
S
n
) = b
2
n
n
k=1
Var(X
n,k
) 0 ,
by our condition (b). Since a
n
= ES
n
, from Chebyshevs inequality we deduce that
for any xed > 0,
P
_
S
n
a
n
b
n
>
_

2
Var(b
1
n
S
n
) 0 ,
as n . In view of (2.1.2), this completes the proof of the theorem.
Specializing the weak law of Theorem 2.1.11 to a single sequence yields the fol-
lowing.
Proposition 2.1.12 (Weak law of large numbers). Consider i.i.d. random
variables X
i
, such that xP([X
1
[ > x) 0 as x . Then, n
1
S
n

n
p
0,
where S
n
=
n
i=1
X
i
and
n
= E[X
1
I
{|X1|n}
].
Proof. We get the result as an application of Theorem 2.1.11 for X
n,k
= X
k
and b
n
= n, in which case a
n
= n
n
. Turning to verify condition (a) of this
theorem, note that
n
k=1
P([X
n,k
[ > n) = nP([X
1
[ > n) 0
as n , by our assumption. Thus, all that remains to do is to verify that
condition (b) of Theorem 2.1.11 holds here. This amounts to showing that as
n ,
n
= n
2
n
k=1
Var(X
n,k
) = n
1
Var(X
n,1
) 0 .
Recall that for any R.V. Z,
Var(Z) = EZ
2
(EZ)
2
E[Z[
2
=
_

0
2yP([Z[ > y)dy
(see part (a) of Lemma 1.4.31 for the right identity). Considering Z = X
n,1
=
X
1
I
{|X1|n}
for which P([Z[ > y) = P([X
1
[ > y) P([X
1
[ > n) P([X
1
[ > y)
when 0 < y < n and P([Z[ > y) = 0 when y n, we deduce that
n
= n
1
Var(Z) n
1
_
n
0
g(y)dy ,
where by our assumption, g(y) = 2yP([X
1
[ > y) 0 for y . Further, the
non-negative Borel function g(y) 2y is then uniformly bounded on [0, ), hence
n
1
_
n
0
g(y)dy 0 as n (c.f. Exercise 1.3.52). Verifying that
n
0, we
established condition (b) of Theorem 2.1.11 and thus completed the proof of the
proposition.
Remark. The condition xP([X
1
[ > x) 0 for x is indeed necessary for the
existence of non-random
n
such that n
1
S
n
n
p
0 (c.f. [Fel71, Page 234-236]
for a proof).
Exercise 2.1.13. Let X
i
be i.i.d. with P(X
1
= (1)
k
k) = 1/(ck
2
log k) for
integers k 2 and a normalization constant c =
k
1/(k
2
log k). Show that
E[X
1
[ = , but there is a non-random < such that n
1
S
n
p
.
As a corollary to Proposition 2.1.12 we next show that n
1
S
n
p
m
X
as soon as
the i.i.d. random variables X
i
are in L
1
.
Corollary 2.1.14. Consider S
n
=
n
k=1
X
k
for i.i.d. random variables X
i
such
that E[X
1
[ < . Then, n
1
S
n
p
EX
1
as n .
Proof. In view of Proposition 2.1.12, it suces to show that if E[X
1
[ < ,
then both nP([X
1
[ > n) 0 and EX
1

n
= E[X
1
I
{|X1|>n}
] 0 as n .
To this end, recall that E[X
1
[ < implies that P([X
1
[ < ) = 1 and hence
the sequence X
1
I
{|X1|>n}
converges to zero a.s. and is bounded by the integrable
[X
1
[. Thus, by dominated convergence E[X
1
I
{|X1|>n}
] 0 as n . Applying
dominated convergence for the sequence nI
{|X1|>n}
(which also converges a.s. to
zero and is bounded by the integrable [X
1
[), we deduce that nP([X
1
[ > n) =
E[nI
{|X1|>n}
] 0 when n , thus completing the proof of the corollary.
We conclude this section by considering an example for which E[X
1
[ = and
Proposition 2.1.12 does not apply, but nevertheless, Theorem 2.1.11 allows us to
deduce that c
1
n
S
n
p
1 for some c
n
such that c
n
/n .
Example 2.1.15. Let X
i
be i.i.d. random variables such that P(X
1
= 2
j
) =
2
j
for j = 1, 2, . . .. This has the interpretation of a game, where in each of its
independent rounds you win 2
j
dollars if it takes exactly j tosses of a fair coin
to get the rst Head. This example is called the St. Petersburg paradox, since
though EX
1
= , you clearly would not pay an innite amount just in order to
play this game. Applying Theorem 2.1.11 we nd that one should be willing to pay
roughly nlog
2
n dollars for playing n rounds of this game, since S
n
/(nlog
2
n)
p
1
as n . Indeed, the conditions of Theorem 2.1.11 apply for b
n
= 2
mn
provided
2.2. THE BOREL-CANTELLI LEMMAS 77
the integers m
n
are such that m
n
log
2
n . Taking m
n
log
2
n+log
2
(log
2
n)
implies that b
n
nlog
2
n and a
n
/(nlog
2
n) = m
n
/ log
2
n 1 as n , with the
consequence of S
n
/(nlog
2
n)
p
1 (for details see [Dur10, Example 2.2.7]).
2.2. The Borel-Cantelli lemmas
When dealing with asymptotic theory, we often wish to understand the relation
between countably many events A
n
in the same probability space. The two Borel-
Cantelli lemmas of Subsection 2.2.1 provide information on the probability of the set
of outcomes that are in innitely many of these events, based only on P(A
n
). There
are numerous applications to these lemmas, few of which are given in Subsection
2.2.2 while many more appear in later sections of these notes.
2.2.1. Limit superior and the Borel-Cantelli lemmas. We are often in-
terested in the limits superior and limits inferior of a sequence of events A
n
on
the same measurable space (, T).
Denition 2.2.1. For a sequence of subsets A
n
, dene
A
:= limsup A
n
=
m=1
_
=m
A
= : A
n
for innitely many ns
= : A
n
innitely often = A
n
i.o.
Similarly,
liminf A
n
=
_
m=1
=m
A
= : A
n
for all but nitely many ns
= : A
n
eventually = A
n
ev.
Remark. Note that if A
n
T are measurable, then so are limsup A
n
and liminf A
n
.
By DeMorgans law, we have that A
n
ev. = A
c
n
i.o.
c
, that is, A
n
for all
n large enough if and only if A
c
n
for nitely many ns.
Also, if A
n
eventually, then certainly A
n
innitely often, that is
liminf A
n
limsup A
n
.
The notations limsup A
n
and liminf A
n
are due to the intimate connection of
these sets to the limsup and liminf of the indicator functions on the sets A
n
. For
example,
limsup
n
I
An
() = I
limsup An
(),
since for a given , the limsup on the left side equals 1 if and only if the
sequence n I
An
() contains an innite subsequence of ones. In other words, if
and only if the given is in innitely many of the sets A
n
. Similarly,
liminf
n
I
An
() = I
liminf An
(),
since for a given , the liminf on the left side equals 1 if and only if there
are only nitely many zeros in the sequence n I
An
() (for otherwise, their limit
inferior is zero). In other words, if and only if the given is in A
n
for all n large
enough.
In view of the preceding remark, Fatous lemma yields the following relations.
Exercise 2.2.2. Prove that for any sequence A
n
T,
P(limsup A
n
) limsup
n
P(A
n
) liminf
n
P(A
n
) P(liminf A
n
) .
Show that the right most inequality holds even when the probability measure is re-
placed by an arbitrary measure (), but the left most inequality may then fail unless
(
kn
A
k
) < for some n.
Practice your understanding of the concepts of limsup and liminf of sets by solving
the following exercise.
Exercise 2.2.3. Assume that P(limsup A
n
) = 1 and P(liminf B
n
) = 1. Prove
that P(limsup(A
n
B
n
)) = 1. What happens if the condition on B
n
is weakened
to P(limsup B
n
) = 1?
Our next result, called the rst Borel-Cantelli lemma, states that if the probabil-
ities P(A
n
) of the individual events A
n
converge to zero fast enough, then almost
surely, A
n
occurs for only nitely many values of n, that is, P(A
n
i.o.) = 0. This
lemma is extremely useful, as the possibly complex relation between the dierent
events A
n
is irrelevant for its conclusion.
Lemma 2.2.4 (Borel-Cantelli I). Suppose A
n
T and
n=1
P(A
n
) < . Then,
P(A
n
i.o.) = 0.
Proof. Dene N() =
k=1
I
A
k
(). By the monotone convergence theorem
and our assumption,
E[N()] = E
_

k=1
I
A
k
()
_
=
k=1
P(A
k
) < .
Since the expectation of N is nite, certainly P( : N() = ) = 0. Noting
that the set : N() = is merely : A
n
i.o., the conclusion P(A
n
i.o.) = 0
of the lemma follows.
Our next result, left for the reader to prove, relaxes somewhat the conditions of
Lemma 2.2.4.
Exercise 2.2.5. Suppose A
n
T are such that
n=1
P(A
n
A
c
n+1
) < and
P(A
n
) 0. Show that then P(A
n
i.o.) = 0.
The rst Borel-Cantelli lemma states that if the series
n
P(A
n
) converges then
almost every is in nitely many sets A
n
. If P(A
n
) 0, but the series
n
P(A
n
)
diverges, then the event A
n
i.o. might or might not have positive probability. In
this sense, the Borel-Cantelli I is not tight, as the following example demonstrates.
Example 2.2.6. Consider the uniform probability measure U on ((0, 1], B
(0,1]
), and
the events A
n
= (0, 1/n]. Then A
n
, so A
n
i.o. = , but U(A
n
) = 1/n, so
n
U(A
n
) = and the Borel-Cantelli I does not apply.
Recall also Example 1.3.25 showing the existence of A
n
= (t
n
, t
n
+ 1/n] such that
U(A
n
) = 1/n while A
n
i.o. = (0, 1]. Thus, in general the probability of A
n
i.o.
depends on the relation between the dierent events A
n
.
As seen in the preceding example, the divergence of the series
n
P(A
n
) is not
sucient for the occurrence of a set of positive probability of values, each of
which is in innitely many events A
n
. However, upon adding the assumption that
the events A
n
are mutually independent (agrantly not the case in Example 2.2.6),
we conclude that almost all must be in innitely many of the events A
n
:
Lemma 2.2.7 (Borel-Cantelli II). Suppose A
n
T are mutually independent
and
n=1
P(A
n
) = . Then, necessarily P(A
n
i.o.) = 1.
Proof. Fix 0 < m < n < . Use the mutual independence of the events A
and the inequality 1 x e

x
for x 0, to deduce that
P
_
n
=m
A
c
_
=
n
=m
P(A
c
) =
n
=m
(1 P(A
))
=m
e
P(A
)
= exp(
n
=m
P(A
)) .
As n , the set
n
=m
A
c
shrinks. With the series in the exponent diverging, by

continuity from above of the probability measure P() we see that for any m,
P
_

=m
A
c
_
exp(
=m
P(A
)) = 0 .
Take the complement to see that P(B
m
) = 1 for B
m
=
=m
A
and all m. Since

B
m
A
n
i.o. when m , it follows by continuity from above of P() that
P(A
n
i.o.) = lim
m
P(B
m
) = 1 ,
as stated.
As an immediate corollary of the two Borel-Cantelli lemmas, we observe yet an-
other 0-1 law.
Corollary 2.2.8. If A
n
T are P-mutually independent then P(A
n
i.o.) is either
0 or 1. In other words, for any given sequence of mutually independent events,
either almost all outcomes are in innitely many of these events, or almost all
outcomes are in nitely many of them.
The Kochen-Stone lemma, left as an exercise, generalizes Borel-Cantelli II to sit-
uations lacking independence.
k
are events on the same probability space such that
k
P(A
k
) = and
limsup
n
_
n
k=1
P(A
k
)
_
2
/
_

1j,kn
P(A
j
A
k
)
_
= > 0 .
Prove that then P(A
n
i.o. ) .
Hint: Consider part (a) of Exercise 1.3.21 for Y
n
=
kn
I
A
k
and a
n
= EY
n
.
2.2.2. Applications. In the sequel we explore various applications of the two
Borel-Cantelli lemmas. In doing so, unless explicitly stated otherwise, all events
and random variables are dened on the same probability space.
We know that the convergence a.s. of X
n
to X
implies the convergence in prob-

ability of X
n
to X
, but not vice verse (see Exercise 1.3.23 and Example 1.3.25).
As our rst application of Borel-Cantelli I, we rene the relation between these
two modes of convergence, showing that convergence in probability is equivalent to
convergence almost surely along sub-sequences.
Theorem 2.2.10. X
n
p
X
if and only if for every subsequence m X

n(m)
there exists a further sub-subsequence X
n(m
k
)
such that X
n(m
k
)
a.s.
X
as k .
We start the proof of this theorem with a simple analysis lemma.
Lemma 2.2.11. Let y
n
be a sequence in a topological space. If every subsequence
y
n(m)
has a further sub-subsequence y
n(m
k
)
that converges to y, then y
n
y.
Proof. If y
n
does not converge to y, then there exists an open set G containing
y and a subsequence y
n(m)
such that y
n(m)
/ G for all m. But clearly, then we
cannot nd a further subsequence of y
n(m)
that converges to y.
Remark. Applying Lemma 2.2.11 to y
n
= E[X
n
X
[ we deduce that X
n
L
1
X
if and only if any subsequence n(m) has a further sub-subsequence n(m

k
) such that
X
n(m
k
)
L
1
X
as k .
Proof of Theorem 2.2.10. First, we show suciency, assuming X
n
p
X
.
Fix a subsequence n(m) and
k
0. By the denition of convergence in probability,
there exists a sub-subsequence n(m
k
) such that P
_
[X
n(m
k
)
X
[ >
k
_

2
k
. Call this sequence of events A
k
=
_
: [X
n(m
k
)
() X
()[ >
k
_
. Then
the series
k
P(A
k
) converges. Therefore, by Borel-Cantelli I, P(limsup A
k
) =
0. For any / limsup A
k
there are only nitely many values of k such that
[X
n(m
k
)
X
[ >
k
, or alternatively, [X
n(m
k
)
X
[
k
for all k large enough.
Since
k
0, it follows that X
n(m
k
)
() X
() when / limsup A
k
, that is,
with probability one.
Conversely, x > 0. Let y
n
= P([X
n
X
[ > ). By assumption, for every sub-

sequence n(m) there exists a further subsequence n(m
k
) so that X
n(m
k
)
converges
to X
almost surely, hence in probability, and in particular, y

n(m
k
)
0. Applying
Lemma 2.2.11 we deduce that y
n
0, and since > 0 is arbitrary it follows that
X
n
p
X
.
It is not hard to check that convergence almost surely is invariant under application
of an a.s. continuous mapping.
Exercise 2.2.12. Let g : R R be a Borel function and denote by D
g
its set of
discontinuities. Show that if X
n
a.s.
X
nite valued, and P(X
D
g
) = 0, then
g(X
n
)
a.s.
g(X
) as well (recall Exercise 1.2.28 that D

g
B). This applies for a
continuous function g in which case D
g
= .
A direct consequence of Theorem 2.2.10 is that convergence in probability is also
preserved under an a.s. continuous mapping (and if the mapping is also bounded,
we even get L
1
convergence).
Corollary 2.2.13. Suppose X
n
p
X
, g is a Borel function and P(X
D
g
) =
0. Then, g(X
n
)
p
g(X
). If in addition g is bounded, then g(X

n
)
L
1
g(X
) (and
Eg(X
n
) Eg(X
)).
Proof. Fix a subsequence X
n(m)
. By Theorem 2.2.10 there exists a subse-
quence X
n(m
k
)
such that P(A) = 1 for A = : X
n(m
k
)
() X
() as k .
Let B = : X
() / D
g
, noting that by assumption P(B) = 1. For any
A B we have g(X
n(m
k
)
()) g(X
()) by the continuity of g outside D

g
.
Therefore, g(X
n(m
k
)
)
a.s.
g(X
). Now apply Theorem 2.2.10 in the reverse direc-

tion: For any subsequence, we have just constructed a further subsequence with
convergence a.s., hence g(X
n
)
p
g(X
).
Finally, if g is bounded, then the collection g(X
n
) is U.I. yielding, by Vitalis
convergence theorem, its convergence in L
1
(and hence that Eg(X
n
) Eg(X
)).
You are next to extend the scope of Theorem 2.2.10 and the continuous mapping
of Corollary 2.2.13 to random variables taking values in a separable metric space.
Exercise 2.2.14. Recall the denition of convergence in probability in a separable
metric space (S, ) as in Remark 1.3.24.
(a) Extend the proof of Theorem 2.2.10 to apply for any (S, B
S
)-valued ran-
dom variables X
n
, n (and in particular for R-valued variables).
(b) Denote by D
g
the set of discontinuities of a Borel measurable g : S
R (dened similarly to Exercise 1.2.28, where real-valued functions are
considered). Suppose X
n
p
X
and P(X
D
g
) = 0. Show that then
g(X
n
)
p
g(X
) and if in addition g is bounded, then also g(X

n
)
L
1
g(X
).
The following result in analysis is obtained by combining the continuous mapping
of Corollary 2.2.13 with the weak law of large numbers.
Exercise 2.2.15 (Inverting Laplace transforms). The Laplace transform of
a bounded, continuous function h(x) on [0, ) is the function L
h
(s) =
_
0
e
sx
h(x)dx
on (0, ).
(a) Show that for any s > 0 and positive integer k,
(1)
k1
s
k
L
(k1)
h
(s)
(k 1)!
=
_

0
e
sx
s
k
x
k1
(k 1)!
h(x)dx = E[h(W
k
)] ,
where L
(k1)
h
() denotes the (k1)-th derivative of the function L
h
() and
W
k
has the gamma density with parameters k and s.
(b) Recall Exercise 1.4.46 that for s = n/y the law of W
n
coincides with the
law of n
1
n
i=1
T
i
where T
i
0 are i.i.d. random variables, each having
the exponential distribution of parameter 1/y (with ET
1
= y and nite
moments of all order, c.f. Example 1.3.68). Deduce that the inversion
formula
h(y) = lim
n
(1)
n1
(n/y)
n
(n 1)!
L
(n1)
h
(n/y) ,
holds for any y > 0.
The next application of Borel-Cantelli I provides our rst strong law of large
numbers.
Proposition 2.2.16. Suppose E[Z
2
n
] C for some C < and all n. Then,
n
1
Z
n
a.s.
0 as n .
Proof. Fixing > 0 let A
k
= : [k
1
Z
k
()[ > for k = 1, 2, . . .. Then, by
Chebyshevs inequality and our assumption,
P(A
k
) = P( : [Z
k
()[ k)
E(Z
2
k
)
(k)
2

C
2
k
2
.
Since
k
k
2
< , it follows by Borel Cantelli I that P(A
) = 0, where A
=
: [k
1
Z
k
()[ > for innitely many values of k. Hence, for any xed
> 0, with probability one k
1
[Z
k
()[ for all large enough k, that is,
limsup
n
n
1
[Z
n
()[ a.s. Considering a sequence
m
0 we conclude that
n
1
Z
n
0 for n and a.e. .
Exercise 2.2.17. Let S
n
=
n
l=1
X
l
, where X
i
are i.i.d. random variables with
EX
1
= 0 and EX
4
1
< .
(a) Show that n
1
S
n
a.s.
0.
Hint: Verify that Proposition 2.2.16 applies for Z
n
= n
1
S
2
n
.
(b) Show that n
1
D
n
a.s.
0 where D
n
denotes the number of distinct integers
among
k
, k n and
k
are i.i.d. integer valued random variables.
Hint: D
n
2M +
n
k=1
I
|
k
|M
.
In contrast, here is an example where the empirical averages of integrable, zero
mean independent variables do not converge to zero. Of course, the trick is to have
non-identical distributions, with the bulk of the probability drifting to negative one.
i
are mutually independent random variables such that
P(X
n
= n
2
1) = 1 P(X
n
= 1) = n
2
for n = 1, 2, . . .. Show that EX
n
= 0,
for all n, while n
1
n
i=1
X
i
a.s.
1 for n .
Next we have few other applications of Borel-Cantelli I, starting with some addi-
tional properties of convergence a.s.
Exercise 2.2.19. Show that for any R.V. X
n
(a) X
n
a.s.
0 if and only if P([X
n
[ > i.o. ) = 0 for each > 0.
(b) There exist non-random constants b
n
such that X
n
/b
n
a.s.
0.
Exercise 2.2.20. Show that if W
n
> 0 and EW
n
1 for every n, then almost
surely,
limsup
n
n
1
log W
n
0 .
Our next example demonstrates how Borel-Cantelli I is typically applied in the
study of the asymptotic growth of running maxima of random variables.
Example 2.2.21 (Head runs). Let X
k
, k Z be a two-sided sequence of i.i.d.
0, 1-valued random variables, with P(X
1
= 1) = P(X
1
= 0) = 1/2. With
m
=
maxi : X
mi+1
= = X
m
= 1 denoting the length of the run of 1s going
backwards from time m, we are interested in the asymptotics of the longest such
run during 1, 2, . . . , n, that is,
L
n
= max
m
: m = 1, . . . , n
= maxmk : X
k+1
= = X
m
= 1 for some m = 1, . . . , n .
Noting that
m
+ 1 has a geometric distribution of success probability p = 1/2, we
deduce by an application of Borel-Cantelli I that for each > 0, with probability
one,
n
(1+) log
2
n for all n large enough. Hence, on the same set of probability
one, we have N = N() nite such that L
n
max(L
N
, (1+) log
2
n) for all n N.
Dividing by log
2
n and considering n followed by
k
0, this implies that
limsup
n
L
n
log
2
n
a.s.
1 .
For each xed > 0 let A
n
= L
n
< k
n
for k
n
= [(1 ) log
2
n]. Noting that
A
n

mn
i=1
B
c
i
,
for m
n
= [n/k
n
] and the independent events B
i
= X
(i1)kn+1
= = X
ikn
= 1,
yields a bound of the form P(A
n
) exp(n
/(2 log
2
n)) for all n large enough (c.f.
[Dur10, Example 2.3.3] for details). Since
n
P(A
n
) < , we have that
liminf
n
L
n
log
2
n
a.s.
1
by yet another application of Borel-Cantelli I, followed by
k
0. We thus conclude
that
L
n
log
2
n
a.s.
1 .
The next exercise combines both Borel-Cantelli lemmas to provide the 0-1 law for
another problem about head runs.
k
be a sequence of i.i.d. 0, 1-valued random variables,
with P(X
1
= 1) = p and P(X
1
= 0) = 1 p. Let A
k
be the event that X
m
= =
X
m+k1
= 1 for some 2
k
m 2
k+1
k. Show that P(A
k
i.o. ) = 1 if p 1/2
and P(A
k
i.o. ) = 0 if p < 1/2.
Hint: When p 1/2 consider only m = 2
k
+ (i 1)k for i = 0, . . . , [2
k
/k].
Here are a few direct applications of the second Borel-Cantelli lemma.
Exercise 2.2.23. Suppose that Z
k
are i.i.d. random variables such that P(Z
1
=
z) < 1 for any z R.
(a) Show that P(Z
k
converges for k ) = 0.
(b) Determine the values of limsup
n
(Z
n
/ log n) and liminf
n
(Z
n
/ log n)
in case Z
k
has the exponential distribution (of parameter = 1).
After deriving the classical bounds on the tail of the normal distribution, you
use both Borel-Cantelli lemmas in bounding the uctuations of the sums of i.i.d.
standard normal variables.
Exercise 2.2.24. Let G
i
be i.i.d. standard normal random variables.
(a) Show that for any x > 0,
(x
1
x
3
)e
x
2
/2
_

x
e
y
2
/2
dy x
1
e
x
2
/2
.
Many texts prove these estimates, for example see [Dur10, Theorem
1.2.3].
(b) Show that, with probability one,
limsup
n
G
n
2 log n
= 1 .
(c) Let S
n
= G
1
+ + G
n
. Recall that n
1/2
S
n
has the standard normal
distribution. Show that
P([S
n
[ < 2
_
nlog n, ev. ) = 1 .
Remark. Ignoring the dependence between the elements of the sequence S
k
, the
bound in part (c) of the preceding exercise is not tight. The denite result here is
the law of the iterated logarithm (in short lil), which states that when the i.i.d.
summands are of zero mean and variance one,
(2.2.1) P(limsup
n
S
n
2nlog log n
= 1) = 1 .
We defer the derivation of (2.2.1) to Theorem 9.2.29, building on a similar lil for
the Brownian motion (but, see [Bil95, Theorem 9.5] for a direct proof of (2.2.1),
using both Borel-Cantelli lemmas).
The next exercise relates explicit integrability conditions for i.i.d. random vari-
ables to the asymptotics of their running maxima.
Exercise 2.2.25. Consider possibly R-valued, i.i.d. random variables Y
i
and
their running maxima M
n
= max
kn
Y
k
.
(a) Using (2.3.4) if needed, show that P([Y
n
[ > n i.o. ) = 0 if and only if
E[[Y
1
[] < .
(b) Show that n
1
Y
n
a.s.
0 if and only if E[[Y
1
[] < .
(c) Show that n
1
M
n
a.s.
0 if and only if E[(Y
1
)
+
] < and P(Y
1
> ) >
0.
(d) Show that n
1
M
n
p
0 if and only if nP(Y
1
> n) 0 and P(Y
1
>
) > 0.
(e) Show that n
1
Y
n
p
0 if and only if P([Y
1
[ < ) = 1.
In the following exercise, you combine Borel Cantelli I and the variance computa-
tion of Lemma 2.1.2 to improve upon Borel Cantelli II.
n=1
P(A
n
) = for pairwise independent events
A
i
. Let S
n
=
n
i=1
I
Ai
be the number of events occurring among the rst n.
(a) Prove that Var(S
n
) E(S
n
) and deduce from it that S
n
/E(S
n
)
p
1.
(b) Applying Borel-Cantelli I show that S
n
k
/E(S
n
k
)
a.s.
1 as k , where
n
k
= infn : E(S
n
) k
2
.
(c) Show that E(S
n
k+1
)/E(S
n
k
) 1 and since n S
n
is non-decreasing,
deduce that S
n
/E(S
n
)
a.s.
1.
2.3. STRONG LAW OF LARGE NUMBERS 85
Remark. Borel-Cantelli II is the a.s. convergence S
n
for n , which is a
consequence of part (c) of the preceding exercise (since ES
n
).
We conclude this section with an example in which the asymptotic rate of growth
of random variables of interest is obtained by an application of Exercise 2.2.26.
Example 2.2.27 (Record values). Let X
i
be a sequence of i.i.d. random
variables with a continuous distribution function F
X
(x). The event A
k
= X
k
>
X
j
, j = 1, . . . , k 1 represents the occurrence of a record at the k instance (for
example, think of X
k
as an athletes kth distance jump). We are interested in the
asymptotics of the count R
n
=
n
i=1
I
Ai
of record events during the rst n instances.
Because of the continuity of F
X
we know that a.s. the values of X
i
, i = 1, 2, . . . are
distinct. Further, rearranging the random variables X
1
, X
2
, . . . , X
n
in a decreasing
order induces a random permutation
n
on 1, 2, . . . , n, where all n! possible per-
mutations are equally likely. From this it follows that P(A
k
) = P(
k
(k) = 1) = 1/k,
and though denitely not obvious at rst sight, the events A
k
are mutually indepen-
dent (see [Dur10, Example 2.3.2] for details). So, ER
n
= log n +
n
where
n
is
between zero and one, and from Exercise 2.2.26 we deduce that (log n)
1
R
n
a.s.
1
as n . Note that this result is independent of the law of X, as long as the
distribution function F
X
is continuous.
2.3. Strong law of large numbers
In Corollary 2.1.14 we got the classical weak law of large numbers, namely, the
convergence in probability of the empirical averages n
1
n
i=1
X
i
of i.i.d. integrable
random variables X
i
to the mean EX
1
. Assuming in addition that EX
4
1
< , you
used Borel-Cantelli I in Exercise 2.2.17 en-route to the corresponding strong law of
large numbers, that is, replacing the convergence in probability with the stronger
notion of convergence almost surely.
We provide here two approaches to the strong law of large numbers, both of which
get rid of the unnecessary nite moment assumptions. Subsection 2.3.1 follows
Etemadis (1981) direct proof of this result via the subsequence method. Subsection
2.3.2 deals in a more systematic way with the convergence of random series, yielding
the strong law of large numbers as one of its consequences.
2.3.1. The subsequence method. Etemadis key observation is that it es-
sentially suces to consider non-negative X
i
, for which upon proving the a.s. con-
vergence along a not too sparse subsequence n
l
, the interpolation to the whole
sequence can be done by the monotonicity of n

n
X
i
. This is an example of
a general approach to a.s. convergence, called the subsequence method, which you
have already encountered in Exercise 2.2.26.
We thus start with the strong law for integrable, non-negative variables.
Proposition 2.3.1. Let S
n
=
n
i=1
X
i
for non-negative, pairwise independent and
identically distributed, integrable random variables X
i
. Then, n
1
S
n
a.s.
EX
1
as
n .
Proof. The proof progresses along the themes of Section 2.1, starting with
the truncation X
k
= X
k
I
|X
k
|k
and its corresponding sums S
n
=
n
i=1
X
i
.
Since X
i
are identically distributed and x P([X
1
[ > x) is non-increasing, we
have that
k=1
P(X
k
,= X
k
) =
k=1
P([X
1
[ > k)
_

0
P([X
1
[ > x)dx = E[X
1
[ <
(see part (a) of Lemma 1.4.31 for the rightmost identity and recall our assumption
that X
1
is integrable). Thus, by Borel-Cantelli I, with probability one, X
k
() =
X
k
() for all but nitely many ks, in which case necessarily sup
n
[S
n
() S
n
()[
is nite. This shows that n
1
(S
n
S
n
)
a.s.
0, whereby it suces to prove that
n
1
S
n
a.s.
EX
1
.
To this end, we next show that it suces to prove the following lemma about
almost sure convergence of S
n
along suitably chosen subsequences.
Lemma 2.3.2. Fixing > 1 let n
l
= [
l
]. Under the conditions of the proposition,
n
1
l
(S
n
l
ES
n
l
)
a.s.
0 as l .
By dominated convergence, E[X
1
I
|X1|k
] EX
1
as k , and consequently, as
n ,
1
n
ES
n
=
1
n
n
k=1
EX
k
=
1
n
n
k=1
E[X
1
I
|X1|k
] EX
1
(we have used here the consistency of Cesaro averages, c.f. Exercise 1.3.52 for an
integral version). Thus, assuming that Lemma 2.3.2 holds, we have that n
1
l
S
n
l
a.s.
EX
1
when l , for each > 1.
We complete the proof of the proposition by interpolating from the subsequences
n
l
= [
l
] to the whole sequence. To this end, x > 1. Since n S
n
is non-
decreasing, we have for all and any n [n
l
, n
l+1
],
n
l
n
l+1
S
n
l
()
n
l
S
n
()
n

n
l+1
n
l
S
n
l+1
()
n
l+1
With n
l
/n
l+1
1/ for l , the a.s. convergence of m
1
S
m
along the subse-
quence m = n
l
implies that the event
A
:= :
1
EX
1
liminf
n
S
n
()
n
limsup
n
S
n
()
n
EX
1
,
has probability one. Consequently, taking
m
1, we deduce that the event B :=
m
A
m
also has probability one, and further, n
1
S
n
() EX
1
for each B.
We thus deduce that n
1
S
n
a.s.
EX
1
, as needed to complete the proof of the
proposition.
Remark. The monotonicity of certain random variables (here n S
n
), is crucial
to the successful application of the subsequence method. The subsequence n
l
for
which we need a direct proof of convergence is completely determined by the scaling
function b
1
n
applied to this monotone sequence (here b
n
= n); we need b
n
l+1
/b
n
l

, which should be arbitrarily close to 1. For example, same subsequences n
l
= [
l
]
are to be used whenever b
n
is roughly of a polynomial growth in n, while even
n
l
= (l!)
c
would work in case b
n
= log n.
Likewise, the truncation level is determined by the highest moment of the basic
variables which is assumed to be nite. For example, we can take X
k
= X
k
I
|X
k
|k
p
for any p > 0 such that E[X
1
[
1/p
< .
Proof of Lemma 2.3.2. Note that E[X
2
k
] is non-decreasing in k. Further,
X
k
are pairwise independent, hence uncorrelated, so by Lemma 2.1.2,
Var(S
n
) =
n
k=1
Var(X
k
)
n
k=1
E[X
2
k
] nE[X
2
n
] = nE[X
2
1
I
|X1|n
] .
Combining this with Chebychevs inequality yield the bound
P([S
n
ES
n
[ n) (n)
2
Var(S
n
)
2
n
1
E[X
2
1
I
|X1|n
] ,
for any > 0. Applying Borel-Cantelli I for the events A
l
= [S
n
l
ES
n
l
[ n
l
,
followed by
m
0, we get the a.s. convergence to zero of n
1
[S
n
ES
n
[ along any
subsequence n
l
for which
l=1
n
1
l
E[X
2
1
I
|X1|n
l
] = E[X
2
1
l=1
n
1
l
I
|X1|n
l
] <
(the latter identity is a special case of Exercise 1.3.40). Since E[X
1
[ < , it thus
suces to show that for n
l
= [
l
] and any x > 0,
(2.3.1) u(x) :=
l=1
n
1
l
I
xn
l
cx
1
,
where c = 2/( 1) < . To establish (2.3.1) x > 1 and x > 0, setting
L = minl 1 : n
l
x. Then,
L
x, and since [y] y/2 for all y 1,
u(x) =
l=L
n
1
l
2
l=L
l
= c
L
cx
1
.
So, we have established (2.3.1) and hence completed the proof of the lemma.
As already promised, it is not hard to extend the scope of the strong law of large
numbers beyond integrable and non-negative random variables.
Theorem 2.3.3 (Strong law of large numbers). Let S
n
=
n
i=1
X
i
for
pairwise independent and identically distributed random variables X
i
, such that
either E[(X
1
)
+
] is nite or E[(X
1
)
] is nite. Then, n
1
S
n
a.s.
EX
1
as n .
Proof. First consider non-negative X
i
. The case of EX
1
< has already
been dealt with in Proposition 2.3.1. In case EX
1
= , consider S
(m)
n
=
n
i=1
X
(m)
i
for the bounded, non-negative, pairwise independent and identically distributed
random variables X
(m)
i
= min(X
i
, m) X
i
. Since Proposition 2.3.1 applies for
X
(m)
i
, it follows that a.s. for any xed m < ,
(2.3.2) liminf
n
n
1
S
n
liminf
n
n
1
S
(m)
n
= EX
(m)
1
= Emin(X
1
, m) .
Taking m , by monotone convergence Emin(X
1
, m) EX
1
= , so (2.3.2)
results with n
1
S
n
a.s.
Turning to the general case, we have the decomposition X
i
= (X
i
)
+
(X
i
)
of
each random variable to its positive and negative parts, with
(2.3.3) n
1
S
n
= n
1
n
i=1
(X
i
)
+
n
1
n
i=1
(X
i
)
Since (X
i
)
+
are non-negative, pairwise independent and identically distributed,
it follows that n
1
n
i=1
(X
i
)
+
a.s.
E[(X
1
)
+
] as n . For the same reason,
also n
1
n
i=1
(X
i
)
a.s.
E[(X
1
)
]. Our assumption that either E[(X

1
)
+
] < or
E[(X
1
)
] < implies that EX

1
= E[(X
1
)
+
] E[(X
1
)
] is well dened, and in

view of (2.3.3) we have the stated a.s. convergence of n
1
S
n
to EX
1
.
Exercise 2.3.4. You are to prove now a converse to the strong law of large numbers
(for a more general result, due to Feller (1946), see [Dur10, Theorem 2.5.9]).
(a) Let Y denote the integer part of a random variable Z 0. Show that
Y =
n=1
I
{Zn}
, and deduce that
(2.3.4)
n=1
P(Z n) EZ 1 +
n=1
P(Z n) .
(b) Suppose X
i
are i.i.d R.V.s with E[[X
1
[
] = for some > 0. Show

that for any k > 0,
n=1
P([X
n
[ > kn
1/
) = ,
and deduce that a.s. limsup
n
n
1/
[X
n
[ = .
(c) Conclude that if S
n
= X
1
+X
2
+ +X
n
, then
limsup
n
n
1/
[S
n
[ = , a.s.
We provide next two classical applications of the strong law of large numbers, the
rst of which deals with the large sample asymptotics of the empirical distribution
function.
Example 2.3.5 (Empirical distribution function). Let
F
n
(x) = n
1
n
i=1
I
(,x]
(X
i
) ,
denote the observed fraction of values among the rst n variables of the sequence
X
i
which do not exceed x. The functions F
n
() are thus called the empirical
distribution functions of this sequence.
For i.i.d. X
i
with distribution function F
X
our next result improves the strong
law of large numbers by showing that F
n
converges uniformly to F
X
as n .
Theorem 2.3.6 (Glivenko-Cantelli). For i.i.d. X
i
with arbitrary distribu-
tion function F
X
, as n ,
D
n
= sup
xR
[F
n
(x) F
X
(x)[
a.s.
0 .
Remark. While outside our scope, we note in passing the Dvoretzky-Kiefer-Wolfowitz
inequality that P(D
n
> ) 2 exp(2n
2
) for any n and all > 0, quantifying
the rate of convergence of D
n
to zero (see [DKW56], or [Mas90] for the optimal
pre-exponential constant).
Proof. By the right continuity of both x F
n
(x) and x F
X
(x) (c.f.
Theorem 1.2.36), the value of D
n
is unchanged when the supremum over x R is
replaced by the one over x Q (the rational numbers). In particular, this shows
that each D
n
is a random variable (c.f. Theorem 1.2.22).
Applying the strong law of large numbers for the i.i.d. non-negative I
(,x]
(X
i
)
whose expectation is F
X
(x), we deduce that F
n
(x)
a.s.
F
X
(x) for each xed non-
random x R. Similarly, considering the strong law of large numbers for the i.i.d.
non-negative I
(,x)
(X
i
) whose expectation is F
X
(x
), we have that F
n
(x
)
a.s.
F
X
(x
) for each xed non-random x R. Consequently, for any xed l < and
x
1,l
, . . . , x
l,l
we have that
D
n,l
= max(
l
max
k=1
[F
n
(x
k,l
) F
X
(x
k,l
)[,
l
max
k=1
[F
n
(x
k,l
) F
X
(x
k,l
)[)
a.s.
0 ,
as n . Choosing x
k,l
= infx : F
X
(x) k/(l + 1), we get out of the
monotonicity of x F
n
(x) and x F
X
(x) that D
n
D
n,l
+ l
1
(c.f. [Bil95,
Proof of Theorem 20.6]). Therefore, taking n followed by l completes
the proof of the theorem.
We turn to our second example, which is about counting processes.
Example 2.3.7 (Renewal theory). Let
i
be i.i.d. positive, nite random
variables and T
n
=
n
k=1
k
. Here T
n
is interpreted as the time of the n-th occur-
rence of a given event, with
k
representing the length of the time interval between
the (k 1) occurrence and that of the k-th such occurrence. Associated with T
n
is
the dual process N
t
= supn : T
n
t counting the number of occurrences during
the time interval [0, t]. In the next exercise you are to derive the strong law for the
large t asymptotics of t
1
N
t
.
Exercise 2.3.8. Consider the setting of Example 2.3.7.
(a) By the strong law of large numbers argue that n
1
T
n
a.s.
E
1
. Then,
adopting the convention
1
= 0, deduce that t
1
N
t
a.s.
1/E
1
for t .
Hint: From the denition of N
t
it follows that T
Nt
t < T
Nt+1
for all
t 0.
(b) Show that t
1
N
t
a.s.
1/E
2
as t , even if the law of
1
is dierent
from that of the i.i.d.
i
, i 2.
Here is a strengthening of the preceding result to convergence in L
1
.
Exercise 2.3.9. In the context of Example 2.3.7 x > 0 such that P(
1
> ) >
and let

T
n
=
n
k=1

k
for the i.i.d. random variables
i
= I
{i>}
. Note that
T
n
T
n
and consequently N
t

N
t
= supn :

T
n
t.
(a) Show that limsup
t
t
2
E
N
2
t
< .
(b) Deduce that t
1
N
t
: t 1 is uniformly integrable (see Exercise 1.3.54),
and conclude that t
1
EN
t
1/E
1
when t .
The next exercise deals with an elaboration over Example 2.3.7.
Exercise 2.3.10. For i = 1, 2, . . . the ith light bulb burns for an amount of time
i
and then remains burned out for time s
i
before being replaced by the (i +1)th bulb.
Let R
t
denote the fraction of time during [0, t] in which we have a working light.
Assuming that the two sequences
i
and s
i
are independent, each consisting of
i.i.d. positive and integrable random variables, show that R
t
a.s.
E
1
/(E
1
+Es
1
).
Here is another exercise, dealing with sampling at times of heads in independent
fair coin tosses, from a non-random bounded sequence of weights v(l), the averages
of which converge.
Exercise 2.3.11. For a sequence B
i
of i.i.d. Bernoulli random variables of
parameter p = 1/2, let T
n
be the time that the corresponding partial sums reach
level n. That is, T
n
= infk :
k
i=1
B
i
n, for n = 1, 2, . . ..
(a) Show that n
1
T
n
a.s.
2 as n .
(b) Given non-negative, non-random v(k) show that k
1
k
i=1
v(T
i
)
a.s.
s
as k , for some non-random s, if and only if n
1
n
l=1
v(l)B
l
a.s.
s/2
as n .
(c) Deduce that if n
1
n
l=1
v(l)
2
is bounded and n
1
n
l=1
v(l) s as n
, then k
1
k
i=1
v(T
i
)
a.s.
s as k .
Hint: For part (c) consider rst the limit of n
1
n
l=1
v(l)(B
l
0.5) as n .
We conclude this subsection with few additional applications of the strong law of
large numbers, rst to a problem of universal hypothesis testing, then an application
involving stochastic geometry, and nally one motivated by investment science.
Exercise 2.3.12. Consider i.i.d. [0, 1]-valued random variables X
k
.
(a) Find Borel measurable functions f
n
: [0, 1]
n
0, 1, which are inde-
pendent of the law of X
k
, such that f
n
(X
1
, X
2
, . . . , X
n
)
a.s.
0 whenever
EX
1
< 1/2 and f
n
(X
1
, X
2
, . . . , X
n
)
a.s.
1 whenever EX
1
> 1/2.
(b) Modify your answer to assure that f
n
(X
1
, X
2
, . . . , X
n
)
a.s.
1 also in case
EX
1
= 1/2.
n
be i.i.d. random vectors, each uniformly distributed
on the unit ball u R
2
: [u[ 1. Consider the R
2
-valued random vectors
X
n
= [X
n1
[U
n
, n = 1, 2, . . . starting at a non-random, non-zero vector X
0
(that is,
each point is uniformly chosen in a ball centered at the origin and whose radius is the
distance from the origin to the previously chosen point). Show that n
1
log [X
n
[
a.s.
1/2 as n .
Exercise 2.3.14. Let V
n
be i.i.d. non-negative random variables. Fixing r > 0
and q (0, 1], consider the sequence W
0
= 1 and W
n
= (qr + (1 q)V
n
)W
n1
,
n = 1, 2, . . .. A motivating example is of W
n
recording the relative growth of a
portfolio where a constant fraction q of ones wealth is re-invested each year in a
risk-less asset that grows by r per year, with the remainder re-invested in a risky
asset whose annual growth factors are the random V
n
.
(a) Show that n
1
log W
n
a.s.
w(q), for w(q) = Elog(qr + (1 q)V
1
).
(b) Show that q w(q) is concave on (0, 1].
(c) Using Jensens inequality show that w(q) w(1) in case EV
1
r. Fur-
ther, show that if EV
1
1
r
1
, then the almost sure convergence applies
also for q = 0 and that w(q) w(0).
(d) Assuming that EV
2
1
< and EV
2
1
< show that supw(q) : q [0, 1]
is nite, and further that the maximum of w(q) is obtained at some q

(0, 1) when EV
1
> r > 1/EV
1
1
. Interpret your results in terms of the
preceding investment example.
Hint: Consider small q > 0 and small 1q > 0 and recall that log(1+x) xx
2
/2
for any x 0.
2.3.2. Convergence of random series. A second approach to the strong
law of large numbers is based on studying the convergence of random series. The
key tool in this approach is Kolmogorovs maximal inequality, which we prove next.
Proposition 2.3.15 (Kolmogorovs maximal inequality). The random vari-
ables Y
1
, . . . , Y
n
are mutually independent, with EY
2
l
< and EY
l
= 0 for l =
1, . . . , n. Then, for Z
k
= Y
1
+ +Y
k
and any z > 0,
(2.3.5) z
2
P( max
1kn
[Z
k
[ z) Var(Z
n
) .
Remark. Chebyshevs inequality gives only z
2
P([Z
n
[ z) Var(Z
n
) which is
signicantly weaker and insucient for our current goals.
Proof. Fixing z > 0 we decompose the event A = max
1kn
[Z
k
[ z
according to the minimal index k for which [Z
k
[ z. That is, A is the union of the
disjoint events A
k
= [Z
k
[ z > [Z
j
[, j = 1, . . . , k 1 over 1 k n. Obviously,
(2.3.6) z
2
P(A) =
n
k=1
z
2
P(A
k
)
n
k=1
E[Z
2
k
; A
k
] ,
since Z
2
k
z
2
on A
k
. Further, EZ
n
= 0 and A
k
are disjoint, so
(2.3.7) Var(Z
n
) = EZ
2
n

n
k=1
E[Z
2
n
; A
k
] .
It suces to show that E[(Z
n
Z
k
)Z
k
; A
k
] = 0 for any 1 k n, since then
E[Z
2
n
; A
k
] E[Z
2
k
; A
k
] = E[(Z
n
Z
k
)
2
; A
k
] + 2E[(Z
n
Z
k
)Z
k
; A
k
]
= E[(Z
n
Z
k
)
2
; A
k
] 0 ,
and (2.3.5) follows by comparing (2.3.6) and (2.3.7). Since Z
k
I
A
k
can be represented
as a non-random Borel function of (Y
1
, . . . , Y
k
), it follows that Z
k
I
A
k
is measurable
on (Y
1
, . . . , Y
k
). Consequently, for xed k and l > k the variables Y
l
and Z
k
I
A
k
are independent, hence uncorrelated. Further EY
l
= 0, so
E[(Z
n
Z
k
)Z
k
; A
k
] =
n
l=k+1
E[Y
l
Z
k
I
A
k
] =
n
l=k+1
E(Y
l
)E(Z
k
I
A
k
) = 0 ,
completing the proof of Kolmogorovs inequality.
Equipped with Kolmogorovs inequality, we provide an easy to check sucient
condition for the convergence of random series of independent R.V.
Theorem 2.3.16. Suppose X
i
are independent random variables with Var(X
i
) <
and EX
i
= 0. If
n
Var(X
n
) < then w.p.1. the random series
n
X
n
()
converges (that is, the sequence S
n
() =
n
k=1
X
k
() has a nite limit S
()).
Proof. Applying Kolmogorovs maximal inequality for the independent vari-
ables Y
l
= X
l+r
, we have that for any > 0 and positive integers r and n,
P( max
rkr+n
[S
k
S
r
[ )
2
Var(S
r+n
S
r
) =
2
r+n
l=r+1
Var(X
l
) .
Taking n , we get by continuity from below of P that
P(sup
kr
[S
k
S
r
[ )
2
l=r+1
Var(X
l
)
By our assumption that
n
Var(X
n
) is nite, it follows that
l>r
Var(X
l
) 0 as
r . Hence, if we let T
r
= sup
n,mr
[S
n
S
m
[, then for any > 0,
P(T
r
2) P(sup
kr
[S
k
S
r
[ ) 0
as r . Further, r T
r
() is non-increasing, hence,
P(limsup
M
T
M
2) = P(inf
M
T
M
2) P(T
r
2) 0 .
That is, T
M
()
a.s.
0 for M . By denition, the convergence to zero of T
M
()
is the statement that S
n
() is a Cauchy sequence. Since every Cauchy sequence in
R converges to a nite limit, we have the stated a.s. convergence of S
n
().
We next provide some applications of Theorem 2.3.16.
Example 2.3.17. Considering non-random a
n
such that
n
a
2
n
< and inde-
pendent Bernoulli variables B
n
of parameter p = 1/2, Theorem 2.3.16 tells us that
n
(1)
Bn
a
n
converges with probability one. That is, when the signs in
n
a
n
are chosen on the toss of a fair coin, the series almost always converges (though
quite possibly
n
[a
n
[ = ).
Exercise 2.3.18. Consider the record events A
k
of Example 2.2.27.
(a) Verify that the events A
k
are mutually independent with P(A
k
) = 1/k.
(b) Show that the random series
n2
(I
An
1/n)/ log n converges almost
surely and deduce that (log n)
1
R
n
a.s.
1 as n .
(c) Provide a counterexample to the preceding in case the distribution func-
tion F
X
(x) is not continuous.
The link between convergence of random series and the strong law of large numbers
is the following classical analysis lemma.
Lemma 2.3.19 (Kroneckers lemma). Consider two sequences of real numbers
x
n
and b
n
where b
n
> 0 and b
n
. If
n
x
n
/b
n
converges, then s
n
/b
n
0
for s
n
= x
1
+ + x
n
.
Proof. Let u
n
=
n
k=1
(x
k
/b
k
) which by assumption converges to a nite
limit denoted u
. Setting u
0
= b
0
= 0, summation by parts yields the identity,
s
n
=
n
k=1
b
k
(u
k
u
k1
) = b
n
u
n
k=1
(b
k
b
k1
)u
k1
.
Since u
n
u
and b
n
, the Cesaro averages b
1
n
n
k=1
(b
k
b
k1
)u
k1
also
converge to u
. Consequently, s
n
/b
n
u
= 0.
Theorem 2.3.16 provides an alternative proof for the strong law of large numbers
of Theorem 2.3.3 in case X
i
are i.i.d. (that is, replacing pairwise independence
by mutual independence). Indeed, applying the same truncation scheme as in the
proof of Proposition 2.3.1, it suces to prove the following alternative to Lemma
2.3.2.
Lemma 2.3.20. For integrable i.i.d. random variables X
k
, let S
m
=
m
k=1
X
k
and X
k
= X
k
I
|X
k
|k
. Then, n
1
(S
n
ES
n
)
a.s.
0 as n .
Lemma 2.3.20, in contrast to Lemma 2.3.2, does not require the restriction to a
subsequence n
l
. Consequently, in this proof of the strong law there is no need for
an interpolation argument so it is carried directly for X
k
, with no need to split each
variable to its positive and negative parts.
Proof of Lemma 2.3.20. We will shortly show that
(2.3.8)
k=1
k
2
Var(X
k
) 2E[X
1
[ .
With X
1
integrable, applying Theorem 2.3.16 for the independent variables Y
k
=
k
1
(X
k
EX
k
) this implies that for some A with P(A) = 1, the random series
n
Y
n
() converges for all A. Using Kroneckers lemma for b
n
= n and
x
n
= X
n
() EX
n
we get that n
1
n
k=1
(X
k
EX
k
) 0 as n , for every
A, as stated.
The proof of (2.3.8) is similar to the computation employed in the proof of Lemma
2.3.2. That is, Var(X
k
) EX
2
k
= EX
2
1
I
|X1|k
and k
2
2/(k(k + 1)), yielding
that
k=1
k
2
Var(X
k
)
k=1
2
k(k + 1)
EX
2
1
I
|X1|k
= EX
2
1
v([X
1
[) ,
where for any x > 0,
v(x) = 2
k=x
1
k(k + 1)
= 2
k=x
_
1
k

1
k + 1
=
2
x
2x
1
.
Consequently, EX
2
1
v([X
1
[) 2E[X
1
[, and (2.3.8) follows.
Many of the ingredients of this proof of the strong law of large numbers are also
relevant for solving the following exercise.
Exercise 2.3.21. Let c
n
be a bounded sequence of non-random constants, and X
i
i.i.d. integrable R.V.-s of zero mean. Show that n

1
n
k=1
c
k
X
k
a.s.
0 for n .
Next you nd few exercises that illustrate how useful Kroneckers lemma is when
proving the strong law of large numbers in case of independent but not identically
distributed summands.
n
=
n
k=1
Y
k
for independent random variables Y
i
such
that Var(Y
k
) 0 is xed (this falls short of the law of the iterated logarithm of
(2.2.1), but each Y
k
is allowed here to have a dierent distribution).
Exercise 2.3.23. Suppose the independent random variables X
i
are such that
Var(X
k
) p
k
< and EX
k
= 0 for k = 1, 2, . . ..
(a) Show that if
k
p
k
< then n
1
n
k=1
kX
k
a.s.
0.
(b) Conversely, assuming
k
p
k
= , give an example of independent ran-
dom variables X
i
, such that Var(X
k
) p
k
, EX
k
= 0, for which almost
surely limsup
n
X
n
() = 1.
(c) Show that the example you just gave is such that with probability one, the
sequence n
1
n
k=1
kX
k
() does not converge to a nite limit.
Exercise 2.3.24. Consider independent, non-negative random variables X
n
.
(a) Show that if
(2.3.9)
n=1
[P(X
n
1) +E(X
n
I
Xn<1
)] <
then the random series
n
X
n
() converges w.p.1.
(b) Prove the converse, namely, that if
n
X
n
() converges w.p.1. then
(2.3.9) holds.
(c) Suppose G
n
are mutually independent random variables, with G
n
having
the normal distribution A(
n
, v
n
). Show that w.p.1. the random series
n
G
2
n
() converges if and only if e =
n
(
2
n
+v
n
) is nite.
(d) Suppose
n
are mutually independent random variables, with
n
having
the exponential distribution of parameter
n
> 0. Show that w.p.1. the
random series
n
() converges if and only if
n
1/
n
is nite.
Hint: For part (b) recall that for any a
n
[0, 1), the series
n
a
n
is nite if and
only if
n
(1 a
n
) > 0. For part (c) let f(y) =
n
min((
n
+

v
n
y)
2
, 1) and
observe that if e = then f(y) +f(y) = for all y ,= 0.
You can now also show that for such strong law of large numbers (that is, with
independent but not identically distributed summands), it suces to strengthen
the corresponding weak law (only) along the subsequence n
r
= 2
r
.
Exercise 2.3.25. Let Z
k
=
k
j=1
Y
j
where Y
j
are mutually independent R.V.-s.
(a) Fixing > 0 show that if 2
r
Z
2
r
a.s.
0 then
r
P([Z
2
r+1 Z
2
r [ > 2
r
)
is nite and if m
1
Z
m
p
0 then max
m<k2m
P([Z
2m
Z
k
[ m) 0.
(b) Adapting the proof of Kolmogorovs maximal inequality show that for any
n and z > 0,
P( max
1kn
[Z
k
[ 2z) min
1kn
P([Z
n
Z
k
[ < z) P([Z
n
[ > z) .
(c) Deduce that if both m
1
Z
m
p
0 and 2
r
Z
2
r
a.s.
0 then also n
1
Z
n
a.s.
0.
Hint: For part (c) combine parts (a) and (b) with z = n, n = 2
r
and the mutually
independent Y
j+n
, 1 j n, to show that
r
P(2
r
D
r
2) is nite for D
r
=
max
2
r
<k2
r+1 [Z
k
Z
2
r [ and any xed > 0.
CHAPTER 3
Weak convergence, clt and Poisson approximation
After dealing in Chapter 2 with examples in which random variables converge
to non-random constants, we focus here on the more general theory of weak con-
vergence, that is situations in which the laws of random variables converge to a
limiting law, typically of a non-constant random variable. To motivate this theory,
we start with Section 3.1 where we derive the celebrated Central Limit Theorem (in
short clt), the most widely used example of weak convergence. This is followed by
the exposition of the theory, to which Section 3.2 is devoted. Section 3.3 is about
the key tool of characteristic functions and their role in establishing convergence
results such as the clt. This tool is used in Section 3.4 to derive the Poisson ap-
proximation and provide an introduction to the Poisson process. In Section 3.5 we
generalize the characteristic function to the setting of random vectors and study
their properties while deriving the multivariate clt.
3.1. The Central Limit Theorem
We start this section with the property of the normal distribution that makes it
the likely limit for properly scaled sums of independent random variables. This is
followed by a bare-hands proof of the clt for triangular arrays in Subsection 3.1.1.
We then present in Subsection 3.1.2 some of the many examples and applications
of the clt.
Recall the normal distribution of mean R and variance v > 0, denoted here-
after A(, v), the density of which is
(3.1.1) f(y) =
1
2v
exp(
(y )
2
2v
) .
As we show next, the normal distribution is preserved when the sum of independent
variables is considered (which is the main reason for its role as the limiting law for
the clt).
Lemma 3.1.1. Let Y
n,k
be mutually independent random variables, each having
the normal distribution A(
n,k
, v
n,k
). Then, G
n
=
n
k=1
Y
n,k
has the normal
distribution A(
n
, v
n
), with
n
=
n
k=1
n,k
and v
n
=
n
k=1
v
n,k
.
Proof. Recall that Y has a A(, v) distribution if and only if Y has the
A(0, v) distribution. Therefore, we may and shall assume without loss of generality
that
n,k
= 0 for all k and n. Further, it suces to prove the lemma for n = 2, as
the general case immediately follows by an induction argument. With n = 2 xed,
we simplify our notations by omitting it everywhere. Next recall the formula of
Corollary 1.4.33 for the probability density function of G = Y
1
+ Y
2
, which for Y
i
95
96 3. WEAK CONVERGENCE, clt AND POISSON APPROXIMATION
of A(0, v
i
) distribution, i = 1, 2, is
f
G
(z) =
_

2v
1
exp(
(z y)
2
2v
1
)
1
2v
2
exp(
y
2
2v
2
)dy .
Comparing this with the formula of (3.1.1) for v = v
1
+v
2
, it just remains to show
that for any z R,
(3.1.2) 1 =
_

2u
exp(
z
2
2v

(z y)
2
2v
1
y
2
2v
2
)dy ,
where u = v
1
v
2
/(v
1
+ v
2
). It is not hard to check that the argument of the expo-
nential function in (3.1.2) is (y cz)
2
/(2u) for c = v
2
/(v
1
+ v
2
). Consequently,
(3.1.2) is merely the obvious fact that the A(cz, u) density function integrates to
one (as any density function should), no matter what the value of z is.
Considering Lemma 3.1.1 for Y
n,k
= (nv)
1/2
(Y
k
) and i.i.d. random variables
Y
k
, each having a normal distribution of mean and variance v, we see that
n,k
=
0 and v
n,k
= 1/n, so G
n
= (nv)
1/2
(
n
k=1
Y
k
n) has the standard A(0, 1)
distribution, regardless of n.
3.1.1. Lindebergs clt for triangular arrays. Our next proposition, the
celebrated clt, states that the distribution of

S
n
= (nv)
1/2
(
n
k=1
X
k
n) ap-
proaches the standard normal distribution in the limit n , even though X
k
may well be non-normal random variables.
Proposition 3.1.2 (Central Limit Theorem). Let
S
n
=
1
nv
(
n
k=1
X
k
n) ,
where X
k
are i.i.d with v = Var(X
1
) (0, ) and = E(X
1
). Then,
(3.1.3) lim
n
P(
S
n
b) =
1
2
_
b
exp(
y
2
2
)dy for every b R.
As we have seen in the context of the weak law of large numbers, it pays to extend
the scope of consideration to triangular arrays in which the random variables X
n,k
are independent within each row, but not necessarily of identical distribution. This
is the context of Lindebergs clt, which we state next.
Theorem 3.1.3 (Lindebergs clt). Let

S
n
=
n
k=1
X
n,k
for P-mutually inde-
pendent random variables X
n,k
, k = 1, . . . , n, such that EX
n,k
= 0 for all k and
v
n
=
n
k=1
EX
2
n,k
1 as n .
Then, the conclusion (3.1.3) applies if for each > 0,
(3.1.4) g
n
() =
n
k=1
E[X
2
n,k
; [X
n,k
[ ] 0 as n .
Note that the variables in dierent rows need not be independent of each other
and could even be dened on dierent probability spaces.
3.1. THE CENTRAL LIMIT THEOREM 97
Remark 3.1.4. Under the assumptions of Proposition 3.1.2 the variables X
n,k
=
(nv)
1/2
(X
k
) are mutually independent and such that
EX
n,k
= (nv)
1/2
(EX
k
) = 0, v
n
=
n
k=1
EX
2
n,k
=
1
nv
n
k=1
Var(X
k
) = 1 .
Further, per xed n these X
n,k
are identically distributed, so
g
n
() = nE[X
2
n,1
; [X
n,1
[ ] = v
1
E[(X
1
)
2
I
|X1|
nv
] .
For each > 0 the sequence (X
1
)
2
I
|X1|
nv
converges a.s. to zero for
n and is dominated by the integrable random variable (X
1
)
2
. Thus, by
dominated convergence, g
n
() 0 as n . We conclude that all assumptions
of Theorem 3.1.3 are satised for this choice of X
n,k
, hence Proposition 3.1.2 is a
special instance of Lindebergs clt, to which we turn our attention next.
Let r
n
= max
v
n,k
: k = 1, . . . , n for v
n,k
= EX
2
n,k
. Since for every n, k and
> 0,
v
n,k
= EX
2
n,k
= E[X
2
n,k
; [X
n,k
[ < ] +E[X
2
n,k
; [X
n,k
[ ]
2
+g
n
() ,
it follows that
r
2
n

2
+g
n
() n, > 0 ,
hence Lindebergs condition (3.1.4) implies that r
n
0 as n .
Remark. Lindeberg proved Theorem 3.1.3, introducing the condition (3.1.4). Later,
Feller proved that (3.1.3) plus r
n
0 implies that Lindebergs condition holds. To-
gether, these two results are known as the Feller-Lindeberg Theorem.
We see that the variables X
n,k
are of uniformly small variance for large n. So,
considering independent random variables Y
n,k
that are also independent of the
X
n,k
and such that each Y
n,k
has a A(0, v
n,k
) distribution, for a smooth function
h() one may control [Eh(
S
n
) Eh(G
n
)[ by a Taylor expansion upon successively
replacing the X
n,k
by Y
n,k
. This indeed is the outline of Lindebergs proof, whose
core is the following lemma.
Lemma 3.1.5. For h : R R of continuous and uniformly bounded second and
third derivatives, G
n
having the A(0, v
n
) law, every n and > 0, we have that
[Eh(
S
n
) Eh(G
n
)[
_
6
+
r
n
2
_
v
n
|h
+g
n
()|h
,
with |f|
= sup
xR
[f(x)[ denoting the supremum norm.
Remark. Recall that G
n
D
=
n
G for
n
=

v
n
. So, assuming v
n
1 and Lin-
debergs condition which implies that r
n
0 for n , it follows from the
lemma that [Eh(
S
n
) Eh(
n
G)[ 0 as n . Further, [h(
n
x) h(x)[
[
n
1[[x[|h
, so taking the expectation with respect to the standard normal

law we see that [Eh(
n
G) Eh(G)[ 0 if the rst derivative of h is also uniformly
bounded. Hence,
(3.1.5) lim
n
Eh(
S
n
) = Eh(G) ,
for any continuous function h() of continuous and uniformly bounded rst three
derivatives. This is actually all we need from Lemma 3.1.5 in order to prove Lin-
debergs clt. Further, as we show in Section 3.2, convergence in distribution as in
(3.1.3) is equivalent to (3.1.5) holding for all continuous, bounded functions h().
Proof of Lemma 3.1.5. Let G
n
=
n
k=1
Y
n,k
for mutually independent Y
n,k
,
distributed according to A(0, v
n,k
), that are independent of X
n,k
. Fixing n and
h, we simplify the notations by eliminating n, that is, we write Y
k
for Y
n,k
, and X
k
for X
n,k
. To facilitate the proof dene the mixed sums
U
l
=
l1
k=1
X
k
+
n
k=l+1
Y
k
, l = 1, . . . , n
Note the following identities
G
n
= U
1
+Y
1
, U
l
+X
l
= U
l+1
+Y
l+1
, l = 1, . . . , n 1, U
n
+X
n
=

S
n
,
which imply that,
(3.1.6) [Eh(G
n
) Eh(
S
n
)[ = [Eh(U
1
+Y
1
) Eh(U
n
+X
n
)[
n
l=1
l
,
where
l
= [E[h(U
l
+ Y
l
) h(U
l
+ X
l
)][, for l = 1, . . . , n. For any l and R,
consider the remainder term
R
l
() = h(U
l
+) h(U
l
) h
(U
l
)

2
2
h
(U
l
)
in second order Taylors expansion of h() at U
l
. By Taylors theorem, we have that
[R
l
()[ |h
[[
3
6
, (from third order expansion)
[R
l
()[ |h
[[
2
, (from second order expansion)
whence,
(3.1.7) [R
l
()[ min
_
|h
[[
3
6
, |h
[[
2
_
.
Considering the expectation of the dierence between the two identities,
h(U
l
+X
l
) = h(U
l
) +X
l
h
(U
l
) +
X
2
l
2
h
(U
l
) +R
l
(X
l
) ,
h(U
l
+Y
l
) = h(U
l
) +Y
l
h
(U
l
) +
Y
2
l
2
h
(U
l
) +R
l
(Y
l
) ,
we get that
E[(X
l
Y
l
)h
(U
l
)]
E
_
(
X
2
l
2

Y
2
l
2
)h
(U
l
)
+[E[R
l
(X
l
) R
l
(Y
l
)][ .
Recall that X
l
and Y
l
are independent of U
l
and chosen such that EX
l
= EY
l
and
EX
2
l
= EY
2
l
. As the rst two terms in the bound on
l
vanish we have that
(3.1.8)
l
E[R
l
(X
l
)[ +E[R
l
(Y
l
)[ .
Further, utilizing (3.1.7),
E[R
l
(X
l
)[ |h
E
_
[X
l
[
3
6
; [X
l
[
+|h
E[[X
l
[
2
; [X
l
[ ]

6
|h
E[[X
l
[
2
] +|h
E[X
2
l
; [X
l
[ ] .
Summing these bounds over l = 1, . . . , n, by our assumption that
n
l=1
EX
2
l
= v
n
and the denition of g
n
(), we get that
(3.1.9)
n
l=1
E[R
l
(X
l
)[

6
v
n
|h
+ g
n
()|h
.
Recall that Y
l
/
v
n,l
is a standard normal random variable, whose fourth moment
is 3 (see (1.3.18)). By monotonicity in q of the L
q
-norms (c.f. Lemma 1.3.16), it
follows that E[[Y
l
/
v
n,l
[
3
] 3, hence E[Y
l
[
3
3v
3/2
n,l
3r
n
v
n,l
. Utilizing once
more (3.1.7) and the fact that v
n
=
n
l=1
v
n,l
, we arrive at
(3.1.10)
n
l=1
E[R
l
(Y
l
)[
|h
6
n
l=1
E[Y
l
[
3
r
n
2
v
n
|h
.
Plugging (3.1.8)(3.1.10) into (3.1.6) completes the proof of the lemma.
In view of (3.1.5), Lindebergs clt builds on the following elementary lemma,
whereby we approximate the indicator function on (, b] by continuous, bounded
functions h
k
: R R for each of which Lemma 3.1.5 applies.
Lemma 3.1.6. There exist h
k
(x) of continuous and uniformly bounded rst three
derivatives, such that 0 h
k
(x) I
(,b)
(x) and 1 h
+
k
(x) I
(,b]
(x) as
k .
Proof. There are many ways to prove this. Here is one which is from rst prin-
ciples, hence requires no analysis knowledge. The function : [0, 1] [0, 1] given
by (x) = 140
_
1
x
u
3
(1 u)
3
du is monotone decreasing, with continuous derivatives
of all order, such that (0) = 1, (1) = 0 and whose rst three derivatives at 0 and
at 1 are all zero. Its extension (x) = (min(x, 1)
+
) to a function on R that is one
for x 0 and zero for x 1 is thus non-increasing, with continuous and uniformly
bounded rst three derivatives. It is easy to check that the translated and scaled
functions h
+
k
(x) = (k(x b)) and h
k
(x) = (k(x b) + 1) have all the claimed
properties.
Proof of Theorem 3.1.3. Applying (3.1.5) for h
k
(), then taking k
we have by monotone convergence that
liminf
n
P(
S
n
< b) lim
n
E[h
k
(
S
n
)] = E[h
k
(G)] F
G
(b
) .
Similarly, considering h
+
k
(), then taking k we have by bounded convergence
that
limsup
n
P(
S
n
b) lim
n
E[h
+
k
(
S
n
)] = E[h
+
k
(G)] F
G
(b) .
Since F
G
() is a continuous function we conclude that P(
S
n
b) converges to
F
G
(b) = F
G
(b
), as n . This holds for every b R as claimed.

3.1.2. Applications of the clt. We start with the simpler, i.i.d. case. In
doing so, we use the notation Z
n
D
G when the analog of (3.1.3) holds for the
sequence Z
n
, that is P(Z
n
b) P(G b) as n for all b R (where G is
a standard normal variable).
Example 3.1.7 (Normal approximation of the Binomial). Consider i.i.d.
random variables B
i
, each of whom is Bernoulli of parameter 0 < p < 1 (i.e.
P(B
1
= 1) = 1 P(B
1
= 0) = p). The sum S
n
= B
1
+ +B
n
has the Binomial
distribution of parameters (n, p), that is,
P(S
n
= k) =
_
n
k
_
p
k
(1 p)
nk
, k = 0, . . . , n.
For example, if B
i
indicates that the ith independent toss of the same coin lands on
a Head then S
n
counts the total numbers of Heads in the rst n tosses of the coin.
Recall that EB = p and Var(B) = p(1 p) (see Example 1.3.69), so the clt states
that (S
n
np)/
_
np(1 p)
D
G. It allows us to approximate, for all large enough
n, the typically non-computable weighted sums of binomial terms by integrals with
respect to the standard normal density.
Here is another example that is similar and almost as widely used.
Example 3.1.8 (Normal approximation of the Poisson distribution). It
is not hard to verify that the sum of two independent Poisson random variables has
the Poisson distribution, with a parameter which is the sum of the parameters of
the summands. Thus, by induction, if X
i
are i.i.d. each of Poisson distribution
of parameter 1, then N
n
= X
1
+ . . . + X
n
has a Poisson distribution of param-
eter n. Since E(N
1
) = Var(N
1
) = 1 (see Example 1.3.69), the clt applies for
(N
n
n)/n
1/2
. This provides an approximation for the distribution function of the
Poisson variable N
of parameter that is a large integer. To deal with non-integer

values = n + for some (0, 1), consider the mutually independent Poisson
variables N
n
, N
and N
1
. Since N
D
= N
n
+N
and N
n+1
D
= N
n
+N
+N
1
, this
provides a monotone coupling , that is, a construction of the random variables N
n
,
N
and N
n+1
on the same probability space, such that N
n
N
N
n+1
. Because
of this monotonicity, for any > 0 and all n n
0
(b, ) the event (N
)/
b
is between (N
n+1
(n + 1))/
n + 1 b and (N
n
n)/
n b +. Con-
sidering the limit as n followed by 0, it thus follows that the convergence
(N
n
n)/n
1/2
D
G implies also that (N
)/
1/2
D
G as . In words,
the normal distribution is a good approximation of a Poisson with large parameter.
In Theorem 2.3.3 we established the strong law of large numbers when the sum-
mands X
i
are only pairwise independent. Unfortunately, as the next example shows,
pairwise independence is not good enough for the clt.
Example 3.1.9. Consider i.i.d.
i
such that P(
i
= 1) = P(
i
= 1) = 1/2
for all i. Set X
1
=
1
and successively let X
2
k
+j
= X
j
k+2
for j = 1, . . . , 2
k
and
k = 0, 1, . . .. Note that each X
l
is a 1, 1-valued variable, specically, a product
of a dierent nite subset of
i
-s that corresponds to the positions of ones in the
binary representation of 2l 1 (with
1
for its least signicant digit,
2
for the next
digit, etc.). Consequently, each X
l
is of zero mean and if l ,= r then in EX
l
X
r
at least one of the
i
-s will appear exactly once, resulting with EX
l
X
r
= 0, hence
with X
l
being uncorrelated variables. Recall part (b) of Exercise 1.4.42, that such
variables are pairwise independent. Further, EX
l
= 0 and X
l
1, 1 mean that
P(X
l
= 1) = P(X
l
= 1) = 1/2 are identically distributed. As for the zero mean
variables S
n
=
n
j=1
X
j
, we have arranged things such that S
1
=
1
and for any
k 0
S
2
k+1 =
2
k
j=1
(X
j
+X
2
k
+j
) =
2
k
j=1
X
j
(1 +
k+2
) = S
2
k(1 +
k+2
) ,
hence S
2
k =
1
k+1
i=2
(1 +
i
) for all k 1. In particular, S
2
k = 0 unless
2
=
3
=
. . . =
k+1
= 1, an event of probability 2
k
. Thus, P(S
2
k ,= 0) = 2
k
and certainly
the clt result (3.1.3) does not hold along the subsequence n = 2
k
.
We turn next to applications of Lindebergs triangular array clt, starting with
the asymptotic of the count of record events till time n 1.
Exercise 3.1.10. Consider the count R
n
of record events during the rst n in-
stances of i.i.d. R.V. with a continuous distribution function, as in Example 2.2.27.
Recall that R
n
= B
1
+ +B
n
for mutually independent Bernoulli random variables
B
k
such that P(B
k
= 1) = 1 P(B
k
= 0) = k
1
.
(a) Check that b
n
/ log n 1 where b
n
= Var(R
n
).
(b) Show that Lindebergs clt applies for X
n,k
= (log n)
1/2
(B
k
k
1
).
(c) Recall that [ER
n
log n[ 1, and conclude that (R
n
log n)/
log n
D
G.
Remark. Let o
n
denote the symmetric group of permutations on 1, . . . , n. For
s o
n
and i 1, . . . , n, denoting by L
i
(s) the smallest j n such that s
j
(i) = i,
we call s
j
(i) : 1 j L
i
(s) the cycle of s containing i. If each s o
n
is equally
likely, then the law of the number T
n
(s) of dierent cycles in s is the same as that
of R
n
of Example 2.2.27 (for a proof see [Dur10, Example 2.2.4]). Consequently,
Exercise 3.1.10 also shows that in this setting (T
n
log n)/
log n
D
G.
Part (a) of the following exercise is a special case of Lindebergs clt, known also
as Lyapunovs theorem.
Exercise 3.1.11 (Lyapunovs theorem). Let S
n
=
n
k=1
X
k
for X
k
mutually
independent such that v
n
= Var(S
n
) < .
(a) Show that if there exists q > 2 such that
lim
n
v
q/2
n
n
k=1
E([X
k
EX
k
[
q
) = 0 ,
then v
1/2
n
(S
n
ES
n
)
D
G.
(b) Use the preceding result to show that n
1/2
S
n
D
G when also EX
k
= 0,
EX
2
k
= 1 and E[X
k
[
q
C for some q > 2, C < and k = 1, 2, . . ..
(c) Show that (log n)
1/2
S
n
D
G when the mutually independent X
k
are
such that P(X
k
= 0) = 1k
1
and P(X
k
= 1) = P(X
k
= 1) = 1/(2k).
The next application of Lindebergs clt involves the use of truncation (which we
have already introduced in the context of the weak law of large numbers), to derive
the clt for normalized sums of certain i.i.d. random variables of innite variance.
Proposition 3.1.12. Suppose X
k
are i.i.d of symmetric distribution, that is
X
1
D
= X
1
(or P(X
1
> x) = P(X
1
< x) for all x) such that P([X
1
[ > x) = x
2
for x 1. Then,
1
nlog n
n
k=1
X
k
D
G as n .
Remark 3.1.13. Note that Var(X
1
) = EX
2
1
=
_
0
2xP([X
1
[ > x)dx = (c.f.
part (a) of Lemma 1.4.31), so the usual clt of Proposition 3.1.2 does not apply here.
Indeed, the innite variance of the summands results in a dierent normalization
of the sums S
n
=
n
k=1
X
k
that is tailored to the specic tail behavior of x
P([X
1
[ > x).
Caution should be exercised here, since when P([X
1
[ > x) = x
for x > 1 and

some 0 < < 2, there is no way to approximate the distribution of (S
n
a
n
)/b
n
by the standard normal distribution. Indeed, in this case b
n
= n
1/
and the
approximation is by an -stable law (c.f. Denition 3.3.31 and Exercise 3.3.33).
Proof. We plan to apply Lindebergs clt for the truncated random variables
X
n,k
= b
1
n
X
k
I
|X
k
|cn
where b
n
=
nlog n and c
n
1 are such that both c
n
/b
n

0 and c
n
/
n . Indeed, for each n the variables X

n,k
, k = 1, . . . , n, are i.i.d.
of bounded and symmetric distribution (since both the distribution of X
k
and the
truncation function are symmetric). Consequently, EX
n,k
= 0 for all n and k.
Further, we have chosen b
n
such that
v
n
= nEX
2
n,1
=
n
b
2
n
EX
2
1
I
|X1|cn
=
n
b
2
n
_
cn
0
2x[P([X
1
[ > x) P([X
1
[ > c
n
)]dx
=
n
b
2
n
_
_
1
0
2xdx +
_
cn
1
2
x
dx
_
cn
0
2x
c
2
n
dx
_
=
2nlog c
n
b
2
n
1
as n . Finally, note that [X
n,k
[ c
n
/b
n
0 as n , implying that
g
n
() = 0 for any > 0 and all n large enough, hence Lindebergs condition
trivially holds. We thus deduce from Lindebergs clt that
1
nlog n
S
n
D
G as
n , where S
n
=
n
k=1
X
k
I
|X
k
|cn
is the sum of the truncated variables. We
have chosen the truncation level c
n
large enough to assure that
P(S
n
,= S
n
)
n
k=1
P([X
k
[ > c
n
) = nP([X
1
[ > c
n
) = nc
2
n
0
for n , hence we may now conclude that
1
nlog n
S
n
D
G as claimed.
We conclude this section with Kolmogorovs three series theorem, the most den-
itive result on the convergence of random series.
Theorem 3.1.14 (Kolmogorovs three series theorem). Suppose X
k
are
independent random variables. For non-random c > 0 let X
(c)
n
= X
n
I
|Xn|c
be the
corresponding truncated variables and consider the three series
(3.1.11)
n
P([X
n
[ > c),
n
EX
(c)
n
,
n
Var(X
(c)
n
).
Then, the random series
n
X
n
converges a.s. if and only if for some c > 0 all
three series of (3.1.11) converge.
Remark. By convergence of a series we mean the existence of a nite limit to the
sum of its rst m entries when m . Note that the theorem implies that if all
3.2. WEAK CONVERGENCE 103
three series of (3.1.11) converge for some c > 0, then they necessarily converge for
every c > 0.
Proof. We prove the suciency rst, that is, assume that for some c > 0
all three series of (3.1.11) converge. By Theorem 2.3.16 and the niteness of
n
Var(X
(c)
n
) it follows that the random series
n
(X
(c)
n
EX
(c)
n
) converges a.s.
Then, by our assumption that
n
EX
(c)
n
converges, also
n
X
(c)
n
converges a.s.
Further, by assumption the sequence of probabilities P(X
n
,= X
(c)
n
) = P([X
n
[ > c)
is summable, hence by Borel-Cantelli I, we have that a.s. X
n
,= X
(c)
n
for at most
nitely many ns. The convergence a.s. of
n
X
(c)
n
thus results with the conver-
gence a.s. of
n
X
n
, as claimed.
We turn to prove the necessity of convergence of the three series in (3.1.11) to the
convergence of
n
X
n
, which is where we use the clt. To this end, assume the
random series
n
X
n
converges a.s. (to a nite limit) and x an arbitrary constant
c > 0. The convergence of
n
X
n
implies that [X
n
[ 0, hence a.s. [X
n
[ > c
for only nitely many ns. In view of the independence of these events and Borel-
Cantelli II, necessarily the sequence P([X
n
[ > c) is summable, that is, the series
n
P([X
n
[ > c) converges. Further, the convergence a.s. of
n
X
n
then results
with the a.s. convergence of
n
X
(c)
n
.
Suppose now that the non-decreasing sequence v
n
=
n
k=1
Var(X
(c)
k
) is unbounded,
in which case the latter convergence implies that a.s. T
n
= v
1/2
n
n
k=1
X
(c)
k
0
when n . We further claim that in this case Lindebergs clt applies for
S
n
=
n
k=1
X
n,k
, where
X
n,k
= v
1/2
n
(X
(c)
k
m
(c)
k
), and m
(c)
k
= EX
(c)
k
.
Indeed, per xed n the variables X
n,k
are mutually independent of zero mean
and such that
n
k=1
EX
2
n,k
= 1. Further, since [X
(c)
k
[ c and we assumed that
v
n
it follows that [X
n,k
[ 2c/
v
n
0 as n , resulting with Lindebergs
condition holding (as g
n
() = 0 when > 2c/
v
n
, i.e. for all n large enough).
Combining Lindebergs clt conclusion that

S
n
D
G and T
n
a.s.
0, we deduce that
(
S
n
T
n
)
D
G (c.f. Exercise 3.2.8). However, since

S
n
T
n
= v
1/2
n
n
k=1
m
(c)
k
are non-random, the sequence P(
S
n
T
n
0) is composed of zeros and ones,
hence cannot converge to P(G 0) = 1/2. We arrive at a contradiction to our
assumption that v
n
, and so conclude that the sequence Var(X
(c)
n
) is summable,
that is, the series
n
Var(X
(c)
n
) converges.
By Theorem 2.3.16, the summability of Var(X
(c)
n
) implies that the series
n
(X
(c)
n

m
(c)
n
) converges a.s. We have already seen that
n
X
(c)
n
converges a.s. so it follows
that their dierence
n
m
(c)
n
, which is the middle term of (3.1.11), converges as
well.
3.2. Weak convergence
Focusing here on the theory of weak convergence, we rst consider in Subsection
3.2.1 the convergence in distribution in a more general setting than that of the clt.
This is followed by the study in Subsection 3.2.2 of weak convergence of probability
measures and the theory associated with it. Most notably its relation to other modes
of convergence, such as convergence in total variation or point-wise convergence of
probability density functions. We conclude by introducing in Subsection 3.2.3 the
key concept of uniform tightness which is instrumental to the derivation of weak
convergence statements, as demonstrated in later sections of this chapter.
3.2.1. Convergence in distribution. Motivated by the clt, we explore here
the convergence in distribution, its relation to convergence in probability, some
additional properties and examples in which the limiting law is not the normal law.
To start o, here is the denition of convergence in distribution.
Denition 3.2.1. We say that R.V.-s X
n
converge in distribution to a R.V. X
,
denoted by X
n
D
X
, if F
Xn
() F
X
() as n for each xed which is
a continuity point of F
X
.
Similarly, we say that distribution functions F
n
converge weakly to F
, denoted
by F
n
w
F
, if F
n
() F
() as n for each xed which is a continuity

point of F
.
Remark. If the limit R.V. X
has a probability density function, or more generally

whenever F
X
is a continuous function, the convergence in distribution of X
n
to
X
is equivalent to the point-wise convergence of the corresponding distribution

functions. Such is the case of the clt, since the normal R.V. G has a density.
Further,
Exercise 3.2.2. Show that if F
n
w
F
and F
() is a continuous function then

also sup
x
[F
n
(x) F
(x)[ 0.
The clt is not the only example of convergence in distribution we have already
met. Recall the Glivenko-Cantelli theorem (see Theorem 2.3.6), whereby a.s. the
empirical distribution functions F
n
of an i.i.d. sequence of variables X
i
converge
uniformly, hence point-wise to the true distribution function F
X
.
Here is an explicit necessary and sucient condition for the convergence in distri-
bution of integer valued random variables
n
, 1 n be integer valued R.V.-s. Show that X
n
D
if and only if P(X

n
= k)
n
P(X
= k) for each k Z.
In contrast with all of the preceding examples, we demonstrate next why the
convergence X
n
D
X
has been chosen to be strictly weaker than the point-

wise convergence of the corresponding distribution functions. We also see that
Eh(X
n
) Eh(X
) or not, depending upon the choice of h(), and even within the
collection of continuous functions with image in [1, 1], the rate of this convergence
is not uniform in h.
Example 3.2.4. The random variables X
n
= 1/n converge in distribution to X
=
0. Indeed, it is easy to check that F
Xn
() = I
[1/n,)
() converge to F
X
() =
I
[0,)
() at each ,= 0. However, there is no convergence at the discontinuity
point = 0 of F
X
as F
X
(0) = 1 while F
Xn
(0) = 0 for all n.
Further, Eh(X
n
) = h(
1
n
) h(0) = Eh(X
) if and only if h(x) is continuous at

x = 0, and the rate of convergence varies with the modulus of continuity of h(x) at
x = 0.
More generally, if X
n
= X + 1/n then F
Xn
() = F
X
( 1/n) F
X
(
) as
n . So, in order for X + 1/n to converge in distribution to X as n , we
have to restrict such convergence to the continuity points of the limiting distribution
function F
X
, as done in Denition 3.2.1.
We have seen in Examples 3.1.7 and 3.1.8 that the normal distribution is a good
approximation for the Binomial and the Poisson distributions (when the corre-
sponding parameter is large). Our next example is of the same type, now with the
approximation of the Geometric distribution by the Exponential one.
Example 3.2.5 (Exponential approximation of the Geometric). Let Z
p
be a random variable with a Geometric distribution of parameter p (0, 1), that is,
P(Z
p
k) = (1 p)
k1
for any positive integer k. As p 0, we see that
P(pZ
p
> t) = (1 p)
t/p
e
t
for all t 0
That is, pZ
p
D
T, with T having a standard exponential distribution. As Z
p
corresponds to the number of independent trials till the rst occurrence of a spe-
cic event whose probability is p, this approximation corresponds to waiting for the
occurrence of rare events.
At this point, you are to check that convergence in probability implies the con-
vergence in distribution, which is hence weaker than all notions of convergence
explored in Section 1.3.3 (and is perhaps a reason for naming it weak convergence).
The converse cannot hold, for example because convergence in distribution does not
require X
n
and X
to be even dened on the same probability space. However,

convergence in distribution is equivalent to convergence in probability when the
limiting random variable is a non-random constant.
Exercise 3.2.6. Show that if X
n
p
X
, then X
n
D
X
. Conversely, if X
n
D
and X
is almost surely a non-random constant, then X

n
p
X
.
Further, as the next theorem shows, given F
n
w
F
, it is possible to construct
random variables Y
n
, n such that F
Yn
= F
n
and Y
n
a.s.
Y
. The catch
of course is to construct the appropriate coupling, that is, to specify the relation
between the dierent Y
n
s.
Theorem 3.2.7. Let F
n
be a sequence of distribution functions that converges
weakly to F
. Then there exist random variables Y

n
, 1 n on the probability
space ((0, 1], B
(0,1]
, U) such that F
Yn
= F
n
for 1 n and Y
n
a.s.
Y
.
Proof. We use Skorokhods representation as in the proof of Theorem 1.2.36.
That is, for (0, 1] and 1 n let Y
+
n
() Y
n
() be
Y
+
n
() = supy : F
n
(y) , Y
n
() = supy : F
n
(y) < .
While proving Theorem 1.2.36 we saw that F
Y
n
= F
n
for any n , and as
remarked there Y
n
() = Y
+
n
() for all but at most countably many values of ,
hence P(Y
n
= Y
+
n
) = 1. It thus suces to show that for all (0, 1),
Y
+
() limsup
n
Y
+
n
() limsup
n
Y
n
()
liminf
n
Y
n
() Y
() . (3.2.1)
Indeed, then Y
n
() Y
() for any A = : Y
+
() = Y
() where
P(A) = 1. Hence, setting Y
n
= Y
+
n
for 1 n would complete the proof of the
theorem.
Turning to prove (3.2.1) note that the two middle inequalities are trivial. Fixing
(0, 1) we proceed to show that
(3.2.2) Y
+
() limsup
n
Y
+
n
() .
Since the continuity points of F
form a dense subset of R (see Exercise 1.2.38),

it suces for (3.2.2) to show that if z > Y
+
() is a continuity point of F
, then
necessarily z Y
+
n
() for all n large enough. To this end, note that z > Y
+
()
implies by denition that F
(z) > . Since z is a continuity point of F
and
F
n
w
F
we know that F
n
(z) F
(z). Hence, F
n
(z) > for all suciently large
n. By denition of Y
+
n
and monotonicity of F
n
, this implies that z Y
+
n
(), as
needed. The proof of
(3.2.3) liminf
n
Y
n
() Y
() ,
is analogous. For y < Y
() we know by monotonicity of F
that F
(y) < .
Assuming further that y is a continuity point of F
, this implies that F

n
(y) <
for all suciently large n, which in turn results with y Y
n
(). Taking continuity
points y
k
of F
such that y
k
Y
() will yield (3.2.3) and complete the proof.

The next exercise provides useful ways to get convergence in distribution for one
sequence out of that of another sequence. Its result is also called the converging
together lemma or Slutskys lemma.
Exercise 3.2.8. Suppose that X
n
D
X
and Y
n
D
Y
, where Y
is non-
random and for each n the variables X
n
and Y
n
are dened on the same probability
space.
(a) Show that then X
n
+Y
n
D
X
+Y
.
Hint: Recall that the collection of continuity points of F
X
is dense.
(b) Deduce that if Z
n
X
n
D
0 then X
n
D
X if and only if Z
n
D
X.
(c) Show that Y
n
X
n
D
Y
.
For example, here is an application of Exercise 3.2.8 en-route to a clt connected
to renewal theory.
Exercise 3.2.9.
(a) Suppose N
m
are non-negative integer-valued random variables and b
m

are non-random integers such that N
m
/b
m
p
1. Show that if S
n
=
n
k=1
X
k
for i.i.d. random variables X
k
with v = Var(X
1
) (0, )
and E(X
1
) = 0, then S
Nm
/
vb
m
D
G as m .
Hint: Use Kolmogorovs inequality to show that S
Nm
/
vb
m
S
bm
/
vb
m
p
0.
(b) Let N
t
= supn : S
n
t for S
n
=
n
k=1
Y
k
and i.i.d. random variables
Y
k
> 0 such that v = Var(Y
1
) (0, ) and E(Y
1
) = 1. Show that
(N
t
t)/
vt
D
G as t .
Theorem 3.2.7 is key to solving the following:
Exercise 3.2.10. Suppose that Z
n
D
Z
. Show that then b

n
(f(c + Z
n
/b
n
)
f(c))/f
(c)
D
Z
for every positive constants b

n
and every Borel function
f : R R (not necessarily continuous) that is dierentiable at c R, with a
derivative f
(c) ,= 0.
Consider the following exercise as a cautionary note about your interpretation of
Theorem 3.2.7.
Exercise 3.2.11. Let M
n
=
n
k=1
k
i=1
U
i
and W
n
=
n
k=1
n
i=k
U
i
, where U
i
are i.i.d. uniformly on [0, c] and c > 0.

(a) Show that M
n
a.s.
M
as n , with M
taking values in [0, ].

(b) Prove that M
is a.s. nite if and only if c < e (but EM
is nite only
for c < 2).
(c) In case c < e prove that W
n
D
M
as n while W
n
can not have
an almost sure limit. Explain why this does not contradict Theorem 3.2.7.
The next exercise relates the decay (in n) of sup
s
[F
X
(s) F
Xn
(s)[ to that of
sup[Eh(X
n
) Eh(X
)[ over all functions h : R [M, M] with sup

x
[h
(x)[ L.
Exercise 3.2.12. Let
n
= sup
s
[F
X
(s) F
Xn
(s)[.
(a) Show that if sup
x
[h(x)[ M and sup
x
[h
(x)[ L, then for any b > a,

C = 4M +L(b a) and all n
[Eh(X
n
) Eh(X
)[ C
n
+ 4MP(X
/ [a, b]) .
(b) Show that if X
[a, b] and f
X
(x) > 0 for all x [a, b], then
[Q
n
() Q
()[
1
n
for any (
n
, 1
n
), where Q
n
() =
supx : F
Xn
(x) < denotes -quantile for the law of X
n
. Using this,
construct Y
n
D
= X
n
such that P([Y
n
Y
[ >
1
n
) 2
n
and deduce
the bound of part (a), albeit the larger value 4M +L/ of C.
Here is another example of convergence in distribution, this time in the context
of extreme value theory.
n
= max
1in
T
i
, where T
i
, i = 1, 2, . . . are i.i.d. ran-
dom variables of distribution function F
T
(t). Noting that F
Mn
(x) = F
T
(x)
n
, show
that b
1
n
(M
n
a
n
)
D
M
when:
(a) F
T
(t) = 1 e
t
for t 0 (i.e. T
i
are Exponential of parameter one).
Here, a
n
= log n, b
n
= 1 and F
M
(y) = exp(e
y
) for y R.
(b) F
T
(t) = 1 t
for t 1 and > 0. Here, a

n
= 0, b
n
= n
1/
and
F
M
(y) = exp(y
) for y > 0.
(c) F
T
(t) = 1 [t[
for 1 t 0 and > 0. Here, a

n
= 0, b
n
= n
1/
and F
M
(y) = exp([y[
) for y 0.
Remark. Up to the linear transformation y (y )/, the three distributions
of M
provided in Exercise 3.2.13 are the only possible limits of maxima of i.i.d.
random variables. They are thus called the extreme value distributions of Type 1
(or Gumbel-type), in case (a), Type 2 (or Frechet-type), in case (b), and Type 3
(or Weibull-type), in case (c). Indeed,
Exercise 3.2.14.
(a) Building upon part (a) of Exercise 2.2.24, show that if G has the standard
normal distribution, then for any y R
lim
t
1 F
G
(t +y/t)
1 F
G
(t)
= e
y
.
(b) Let M
n
= max
1in
G
i
for i.i.d. standard normal random variables
G
i
. Show that b
n
(M
n
b
n
)
D
M
where F
M
(y) = exp(e
y
) and b
n
is such that 1 F
G
(b
n
) = n
1
.
(c) Show that b
n
/
2 log n 1 as n and deduce that M

n
/
2 log n
p
1.
(d) More generally, suppose T
t
= infx 0 : M
x
t, where x M
x
is some monotone non-decreasing family of random variables such that
M
0
= 0. Show that if e
t
T
t
D
T
as t with T
having the
standard exponential distribution then (M
x
log x)
D
M
as x ,
where F
M
(y) = exp(e
y
).
Our next example is of a more combinatorial avor.
Exercise 3.2.15 (The birthday problem). Suppose X
i
are i.i.d. with each
X
i
uniformly distributed on 1, . . . , n. Let T
n
= mink : X
k
= X
l
, for some l < k
mark the rst coincidence among the entries of the sequence X
1
, X
2
, . . ., so
P(T
n
> r) =
r
k=2
(1
k 1
n
) ,
is the probability that among r items chosen uniformly and independently from
a set of n dierent objects, no two are the same (the name birthday problem
corresponds to n = 365 with the items interpreted as the birthdays for a group of
size r). Show that P(n
1/2
T
n
> s) exp(s
2
/2) as n , for any xed s 0.
Hint: Recall that x x
2
log(1 x) x for x [0, 1/2].
The symmetric, simple random walk on the integers is the sequence of random
variables S
n
=
n
k=1
k
where
k
are i.i.d. such that P(
k
= 1) = P(
k
= 1) =
1
2
.
From the clt we already know that n
1/2
S
n
D
G. The next exercise provides
the asymptotics of the rst and last visits to zero by this random sequence, namely
R = inf 1 : S
= 0 and L
n
= sup n : S
= 0. Much more is known about

this random sequence (c.f. [Dur10, Section 4.3] or [Fel68, Chapter 3]).
Exercise 3.2.16. Let q
n,r
= P(S
1
> 0, . . . , S
n1
> 0, S
n
= r) and
p
n,r
= P(S
n
= r) = 2
n
_
n
k
_
k = (n +r)/2 .
(a) Counting paths of the walk, prove the discrete reection principle that
P
x
(R < n, S
n
= y) = P
x
(S
n
= y) = p
n,x+y
for any positive integers
x, y, where P
x
() denote probabilities for the walk starting at S
0
= x.
(b) Verify that q
n,r
=
1
2
(p
n1,r1
p
n1,r+1
) for any n, r 1.
Hint: Paths of the walk contributing to q
n,r
must have S
1
= 1. Hence,
use part (a) with x = 1 and y = r.
(c) Deduce that P(R > n) = p
n1,0
+ p
n1,1
and that P(L
2n
= 2k) =
p
2k,0
p
2n2k,0
for k = 0, 1, . . . , n.
(d) Using Stirlings formula (that
2n(n/e)
n
/n! 1 as n ), show
that

nP(R > 2n) 1 and that (2n)
1
L
2n
D
X, where X has the
arc-sine probability density function f
X
(x) =
1
x(1x)
on [0, 1].
(e) Let H
2n
count the number of 1 k 2n such that S
k
0 and S
k1
0.
Show that H
2n
D
= L
2n
, hence (2n)
1
H
2n
D
X.
3.2.2. Weak convergence of probability measures. We rst extend the
denition of weak convergence from distribution functions to measures on Borel
-algebras.
Denition 3.2.17. For a topological space S, let C
b
(S) denote the collection of all
continuous bounded functions on S. We say that a sequence of probability measures
n
on a topological space S equipped with its Borel -algebra (see Example 1.1.15),
converges weakly to a probability measure
, denoted
n
w

, if
n
(h)
(h)
for each h C
b
(S).
As we show next, Denition 3.2.17 is an alternative denition of convergence in
distribution, which, in contrast to Denition 3.2.1, applies to more general R.V.
(for example to the R
d
-valued random variables we consider in Section 3.5).
Proposition 3.2.18. The weak convergence of distribution functions is equivalent
to the weak convergence of the corresponding laws as probability measures on (R, B).
Consequently, X
n
D
X
if and only if for each h C

b
(R), we have Eh(X
n
)
Eh(X
) as n .
Proof. Suppose rst that F
n
w
F
and let Y
n
, 1 n be the random
variables given by Theorem 3.2.7 such that Y
n
a.s.
Y
. For h C
b
(R) we have by
continuity of h that h(Y
n
)
a.s.
h(Y
), and by bounded convergence also

T
n
(h) = E(h(Y
n
)) E(h(Y
)) = T
(h) .
Conversely, suppose that T
n
w
T
per Denition 3.2.17. Fixing R, let the

non-negative h
k
C
b
(R) be such that h
k
(x) I
(,)
(x) and h
+
k
(x) I
(,]
(x)
as k (c.f. Lemma 3.1.6 for a construction of such functions). We have by the
weak convergence of the laws when n , followed by monotone convergence as
k , that
liminf
n
T
n
((, )) lim
n
T
n
(h
k
) = T
(h
k
) T
((, )) = F
) .
Similarly, considering h
+
k
() and then k , we have by bounded convergence
that
limsup
n
T
n
((, ]) lim
n
T
n
(h
+
k
) = T
(h
+
k
) T
((, ]) = F
() .
For any continuity point of F
we conclude that F
n
() = T
n
((, ]) converges
as n to F
() = F
), thus completing the proof.

By yet another application of Theorem 3.2.7 we nd that convergence in distri-
bution is preserved under a.s. continuous mappings (see Corollary 2.2.13 for the
analogous statement for convergence in probability).
Proposition 3.2.19 (Continuous mapping). For a Borel function g let D
g
de-
note its set of points of discontinuity. If X
n
D
X
and P(X
D
g
) = 0, then
g(X
n
)
D
g(X
). If in addition g is bounded then Eg(X

n
) Eg(X
).
Proof. Given X
n
D
X
, by Theorem 3.2.7 there exists Y

n
D
= X
n
, such that
Y
n
a.s.
Y
. Fixing h C
b
(R), clearly D
hg
D
g
, so
P(Y
D
hg
) P(Y
D
g
) = 0.
Therefore, by Exercise 2.2.12, it follows that h(g(Y
n
))
a.s.
h(g(Y
)). Since h g is
bounded and Y
n
D
= X
n
for all n, it follows by bounded convergence that
Eh(g(X
n
)) = Eh(g(Y
n
)) E(h(g(Y
)) = Eh(g(X
)) .
This holds for any h C
b
(R), so by Proposition 3.2.18, we conclude that g(X
n
)
D
g(X
).
Our next theorem collects several equivalent characterizations of weak convergence
of probability measures on (R, B). To this end we need the following denition.
Denition 3.2.20. For a subset A of a topological space S, we denote by A the
boundary of A, that is A = A A
o
is the closed set of points in the closure of A
but not in the interior of A. For a measure on (S, B
S
) we say that A B
S
is a
-continuity set if (A) = 0.
Theorem 3.2.21 (portmanteau theorem). The following four statements are
equivalent for any probability measures
n
, 1 n on (R, B).
(a)
n
w

(b) For every closed set F, one has limsup

n
n
(F)
(F)
(c) For every open set G, one has liminf
n
n
(G)
(G)
(d) For every
-continuity set A, one has lim

n
n
(A) =
(A)
Remark. As shown in Subsection 3.5.1, this theorem holds with (R, B) replaced
by any metric space S and its Borel -algebra B
S
.
For
n
= T
Xn
we get the formulation of the Portmanteau theorem for random
variables X
n
, 1 n , where the following four statements are then equivalent
to X
n
D
X
:
(a) Eh(X
n
) Eh(X
) for each bounded continuous h

(b) For every closed set F one has limsup
n
P(X
n
F) P(X
F)
(c) For every open set G one has liminf
n
P(X
n
G) P(X
G)
(d) For every Borel set A such that P(X
A) = 0, one has
lim
n
P(X
n
A) = P(X
A)
Proof. It suces to show that (a) (b) (c) (d) (a), which we
shall establish in that order. To this end, with F
n
(x) =
n
((, x]) denoting the
corresponding distribution functions, we replace
n
w

of (a) by the equivalent

condition F
n
w
F
(see Proposition 3.2.18).

(a) (b). Assuming F
n
w
F
, we have the random variables Y

n
, 1 n of
Theorem 3.2.7, such that T
Yn
=
n
and Y
n
a.s.
Y
. Since F is closed, the function

I
F
is upper semi-continuous bounded by one, so it follows that a.s.
limsup
n
I
F
(Y
n
) I
F
(Y
) ,
and by Fatous lemma,
limsup
n
n
(F) = limsup
n
EI
F
(Y
n
) Elimsup
n
I
F
(Y
n
) EI
F
(Y
) =
(F) ,
as stated in (b).
(b) (c). The complement F = G
c
of an open set G is a closed set, so by (b) we
have that
1 liminf
n
n
(G) = limsup
n
n
(G
c
)
(G
c
) = 1
(G) ,
implying that (c) holds. In an analogous manner we can show that (c) (b), so
(b) and (c) are equivalent.
(c) (d). Since (b) and (c) are equivalent, we assume now that both (b) and (c)
hold. Then, applying (c) for the open set G = A
o
and (b) for the closed set F = A
we have that
(A) limsup
n
n
(A) limsup
n
n
(A)
liminf
n
n
(A) liminf
n
n
(A
o
)
(A
o
) . (3.2.4)
Further, A = A
o
A so
(A) = 0 implies that
(A) =
(A
o
) =
(A)
(with the last equality due to the fact that A
o
A A). Consequently, for such a
set A all the inequalities in (3.2.4) are equalities, yielding (d).
(d) (a). Consider the set A = (, ] where is a continuity point of F
.
Then, A = and
() = F
() F
) = 0. Applying (d) for this

choice of A, we have that
lim
n
F
n
() = lim
n
n
((, ]) =
((, ]) = F
() ,
which is our version of (a).
We turn to relate the weak convergence to the convergence point-wise of proba-
bility density functions. To this end, we rst dene a new concept of convergence
of measures, the convergence in total-variation.
Denition 3.2.22. The total variation norm of a nite signed measure on the
measurable space (S, T) is
||
tv
= sup(h) : h mT, sup
sS
[h(s)[ 1.
We say that a sequence of probability measures
n
converges in total variation to a
probability measure
, denoted
n
t.v.

, if |
n
|
tv
0.
Remark. Note that ||
tv
= 1 for any probability measure (since (h) ([h[)
|h|
(1) 1 for the functions h considered, with equality for h = 1). By a similar
reasoning, |
|
tv
2 for any two probability measures ,
on (S, T).
Convergence in total-variation obviously implies weak convergence of the same
probability measures, but the converse fails, as demonstrated for example by
n
=
1/n
, the probability measure on (R, B) assigning probability one to the point 1/n,
which converge weakly to
=
0
(see Example 3.2.4), whereas |
n

| = 2
for all n. The dierence of course has to do with the non-uniformity of the weak
convergence with respect to the continuous function h.
To gain a better understanding of the convergence in total-variation, we consider
an important special case.
Proposition 3.2.23. Suppose P = f and Q = g for some measure on (S, T)
and f, g mT
+
such that (f) = (g) = 1. Then,
(3.2.5) |P Q|
tv
=
_
S
[f(s) g(s)[d(s) .
Further, suppose
n
= f
n
with f
n
mT
+
such that (f
n
) = 1 for all n .
Then,
n
t.v.

if f
n
(s) f
(s) for -almost-every s S.

Proof. For any measurable function h : S [1, 1] we have that
(f)(h) (g)(h) = (fh) (gh) = ((f g)h) ([f g[) ,
with equality when h(s) = sgn((f(s)g(s)) (see Proposition 1.3.56 for the left-most
identity and note that fh and gh are in L
1
(S, T, )). Consequently, |P Q|
tv
=
sup(f)(h) (g)(h) : h as above = ([f g[), as claimed.
For
n
= f
n
, we thus have that |
n
|
tv
= ([f
n
f
[), so the convergence in

total-variation is equivalent to f
n
f
in L
1
(S, T, ). Since f
n
0 and (f
n
) = 1
for any n , it follows from Schees lemma (see Lemma 1.3.35) that the latter
convergence is a consequence of f
n
(s) f
(s) for a.e. s S.

Two specic instances of Proposition 3.2.23 are of particular value in applications.
Example 3.2.24. Let
n
= T
Xn
denote the laws of random variables X
n
that
have probability density functions f
n
, n = 1, 2, . . . , . Recall Exercise 1.3.66 that
then
n
= f
n
for Lebesgues measure on (R, B). Hence, by the preceding propo-
sition, the convergence point-wise of f
n
(x) to f
(x) implies the convergence in

total-variation of T
Xn
to T
X
, and in particular implies that X
n
D
X
.
Example 3.2.25. Similarly, if X
n
are integer valued for n = 1, 2 . . ., then
n
= f
n
for f
n
(k) = P(X
n
= k) and the counting measure
on (Z, 2
Z
) such that
(k) = 1
for each k Z. So, by the preceding proposition, the point-wise convergence of
Exercise 3.2.3 is not only necessary and sucient for weak convergence but also for
convergence in total-variation of the laws of X
n
to that of X
.
In the next exercise, you are to rephrase Example 3.2.25 in terms of the topological
space of all probability measures on Z.
Exercise 3.2.26. Show that d(, ) = ||
tv
is a metric on the collection of all
probability measures on Z, and that in this space the convergence in total variation
is equivalent to the weak convergence which in turn is equivalent to the point-wise
convergence at each x Z.
Hence, under the framework of Example 3.2.25, the Glivenko-Cantelli theorem
tells us that the empirical measures of integer valued i.i.d. R.V.-s X
i
converge in
total-variation to the true law of X
1
.
Here is an example from statistics that corresponds to the framework of Example
3.2.24.
n+1
denote the central value on a list of 2n+1 values (that
is, the (n + 1)th largest value on the list). Suppose the list consists of mutually
independent R.V., each chosen uniformly in [0, 1).
(a) Show that V
n+1
has probability density function (2n + 1)
_
2n
n
_
v
n
(1 v)
n
at each v [0, 1).
(b) Verify that the density f
n
(v) of

V
n
=
2n(2V
n+1
1) is of the form
f
n
(v) = c
n
(1 v
2
/(2n))
n
for some normalization constant c
n
that is
independent of [v[
2n.
(c) Deduce that for n the densities f
n
(v) converge point-wise to the
standard normal density, and conclude that

V
n
D
G.
Here is an interesting interpretation of the clt in terms of weak convergence of
probability measures.
Exercise 3.2.28. Let / denote the set of probability measures on (R, B) for
which
_
x
2
d(x) = 1 and
_
xd(x) = 0, and / denote the standard normal
distribution. Consider the mapping T : / / where T is the law of (X
1
+
X
2
)/
2 for X
1
and X
2
i.i.d. of law each. Explain why the clt implies that
T
m
w
as m , for any /. Show that T = (see Lemma 3.1.1), and
explain why is the unique, globally attracting xed point of T in /.
Your next exercise is the basis behind the celebrated method of moments for weak
convergence.
Exercise 3.2.29. Suppose that X and Y are [0, 1]-valued random variables such
that E(X
n
) = E(Y
n
) for n = 0, 1, 2, . . ..
(a) Show that Ep(X) = Ep(Y ) for any polynomial p().
(b) Show that Eh(X) = Eh(Y ) for any continuous function h : [0, 1] R
and deduce that X
D
= Y .
Hint: Recall Weierstrass approximation theorem, that if h is continuous on [0, 1]
then there exist polynomials p
n
such that sup
x[0,1]
[h(x) p
n
(x)[ 0 as n .
We conclude with the following example about weak convergence of measures in
the space of innite binary sequences.
Exercise 3.2.30. Consider the topology of coordinate wise convergence on S =
0, 1
N
and the Borel probability measures
n
on S, where
n
is the uniform
measure over the
_
2n
n
_
binary sequences of precisely n ones among the rst 2n
coordinates, followed by zeros from position 2n + 1 onwards. Show that
n
w

where
denotes the law of i.i.d. Bernoulli random variables of parameter p = 1/2.

Hint: Any open subset of S is a countable union of disjoint sets of the form A
,k
=
S :
i
=
i
, i = 1, . . . , k for some = (
1
, . . . ,
k
) 0, 1
k
and k N.
3.2.3. Uniform tightness and vague convergence. So far we have studied
the properties of weak convergence. We turn to deal with general ways to establish
such convergence, a subject to which we return in Subsection 3.3.2. To this end,
the most important concept is that of uniform tightness, which we now dene.
Denition 3.2.31. We say that a probability measure on (S, B
S
) is tight if for
each > 0 there exists a compact set K
S such that (K
c
) < . A collection
of probability measures on (S, B

S
) is called uniformly tight if for each > 0
there exists one compact set K
such that
(K
c
) < for all .

Since bounded closed intervals are compact and [M, M]
c
as M , by
continuity from above we deduce that each probability measure on (R, B) is
tight. The same argument applies for a nite collection of probability measures on
(R, B) (just choose the maximal value among the nitely many values of M = M
that are needed for the dierent measures). Further, in the case of S = R which we
study here one can take without loss of generality the compact K
as a symmetric
bounded interval [M
, M
], or even consider instead (M
, M
] (whose closure
is compact) in order to simplify notations. Thus, expressing uniform tightness
in terms of the corresponding distribution functions leads in this setting to the
following alternative denition.
Denition 3.2.32. A sequence of distribution functions F
n
is called uniformly
tight, if for every > 0 there exists M = M
such that
limsup
n
[1 F
n
(M) +F
n
(M)] < .
Remark. As most texts use in the context of Denition 3.2.32 tight (or tight
sequence) instead of uniformly tight, we shall adopt the same convention here.
Uniform tightness of distribution functions has some structural resemblance to the
U.I. condition (1.3.11). As such we have the following simple sucient condition
for uniform tightness (which is the analog of Exercise 1.3.54).
Exercise 3.2.33. A sequence of probability measures
n
on (R, B) is uniformly
tight if sup
n
n
(f([x[)) is nite for some non-negative Borel function such that
f(r) as r . Alternatively, if sup
n
Ef([X
n
[) < then the distribution
functions F
Xn
form a tight sequence.
The importance of uniform tightness is that it guarantees the existence of limit
points for weak convergence.
Theorem 3.2.34 (Prohorov theorem). A collection of probability measures
on a complete, separable metric space S equipped with its Borel -algebra B
S
, is
uniformly tight if and only if for any sequence
m
there exists a subsequence
m
k
that converges weakly to some probability measure
on (S, B
S
) (where
is
not necessarily in and may depend on the subsequence m
k
).
Remark. For a proof of Prohorovs theorem, which is beyond the scope of these
notes, see [Dud89, Theorem 11.5.4].
Instead of Prohorovs theorem, we prove here a bare-hands substitute for the
special case S = R. When doing so, it is convenient to have the following notion of
convergence of distribution functions.
Denition 3.2.35. When a sequence F
n
of distribution functions converges to a
right continuous, non-decreasing function F
at all continuous points of F
, we
say that F
n
converges vaguely to F
, denoted F
n
v
F
.
In contrast with weak convergence, the vague convergence allows for the limit
F
(x) =
((, x]) to correspond to a measure
such that
(R) < 1.
Example 3.2.36. Suppose F
n
= pI
[n,)
+qI
[n,)
+(1pq)F for some p, q 0
such that p+q 1 and a distribution function F that is independent of n. It is easy
to check that F
n
v
F
as n , where F
= q +(1 p q)F is the distribution

function of an R-valued random variable, with probability mass p at + and mass
q at . If p +q > 0 then F
is not a distribution function of any measure on R

and F
n
does not converge weakly.
The preceding example is generic, that is, the space R is compact, so the only loss
of mass when dealing with weak convergence on R has to do with its escape to .
It is thus not surprising that every sequence of distribution functions have vague
limit points, as stated by the following theorem.
Theorem 3.2.37 (Hellys selection theorem). For every sequence F
n
of dis-
tribution functions, there is a subsequence F
n
k
and a non-decreasing right contin-
uous function F
such that F
n
k
(y) F
(y) as k at all continuity points y

of F
, that is F
n
k
v
F
.
Deferring the proof of Hellys theorem to the end of this section, uniform tightness
is exactly what prevents probability mass from escaping to , thus assuring the
existence of limit points for weak convergence.
Lemma 3.2.38. The sequence of distribution functions F
n
is uniformly tight if
and only if each vague limit point of this sequence is a distribution function. That
is, if and only if when F
n
k
v
F, necessarily 1 F(x) +F(x) 0 as x .
Proof. Suppose rst that F
n
is uniformly tight and F
n
k
v
F. Fixing > 0,
there exist r
1
< M
and r
2
> M
that are both continuity points of F. Then, by

the denition of vague convergence and the monotonicity of F
n
,
1 F(r
2
) +F(r
1
) = lim
k
(1 F
n
k
(r
2
) +F
n
k
(r
1
))
limsup
n
(1 F
n
(M
) +F
n
(M
)) < .
It follows that limsup
x
(1 F(x) + F(x)) and since > 0 is arbitrarily
small, F must be a distribution function of some probability measure on (R, B).
Conversely, suppose F
n
is not uniformly tight, in which case by Denition 3.2.32,
for some > 0 and n
k

(3.2.6) 1 F
n
k
(k) +F
n
k
(k) for all k.
By Hellys theorem, there exists a vague limit point F to F
n
k
as k . That
is, for some k
l
as l we have that F
n
k
l
v
F. For any two continuity
points r
1
< 0 < r
2
of F, we thus have by the denition of vague convergence, the
monotonicity of F
n
k
l
, and (3.2.6), that
1 F(r
2
) +F(r
1
) = lim
l
(1 F
n
k
l
(r
2
) +F
n
k
l
(r
1
))
liminf
l
(1 F
n
k
l
(k
l
) +F
n
k
l
(k
l
)) .
Considering now r = min(r
1
, r
2
) , this shows that inf
r
(1F(r)+F(r)) ,
hence the vague limit point F cannot be a distribution function of a probability
measure on (R, B).
Remark. Comparing Denitions 3.2.31 and 3.2.32 we see that if a collection of
probability measures on (R, B) is uniformly tight, then for any sequence
m

the corresponding sequence F
m
of distribution functions is uniformly tight. In view
of Lemma 3.2.38 and Hellys theorem, this implies the existence of a subsequence
m
k
and a distribution function F
such that F
m
k
w
F
. By Proposition 3.2.18
we deduce that
m
k
w

, a probability measure on (R, B), thus proving the only

direction of Prohorovs theorem that we ever use.
Proof of Theorem 3.2.37. Fix a sequence of distribution function F
n
. The
key to the proof is to observe that there exists a sub-sequence n
k
and a non-
decreasing function H : Q [0, 1] such that F
n
k
(q) H(q) for any q Q.
This is done by a standard analysis argument called the principle of diagonal
selection. That is, let q
1
, q
2
, . . ., be an enumeration of the set Q of all rational
numbers. There exists then a limit point H(q
1
) to the sequence F
n
(q
1
) [0, 1],
that is a sub-sequence n
(1)
k
such that F
n
(1)
k
(q
1
) H(q
1
). Since F
n
(1)
k
(q
2
) [0, 1],
there exists a further sub-sequences n
(2)
k
of n
(1)
k
such that
F
n
(i)
k
(q
i
) H(q
i
) for i = 1, 2.
In the same manner we get a collection of nested sub-sequences n
(i)
k
n
(i1)
k
such
that
F
n
(i)
k
(q
j
) H(q
j
), for all j i.
The diagonal n
(k)
k
then has the property that
F
n
(k)
k
(q
j
) H(q
j
), for all j,
so n
k
= n
(k)
k
is our desired sub-sequence, and since each F
n
is non-decreasing, the
limit function H must also be non-decreasing on Q.
Let F
(x) := infH(q) : q Q, q > x, noting that F
[0, 1] is non-decreasing.
Further, F
is right continuous, since

lim
xnx
F
(x
n
) = infH(q) : q Q, q > x
n
for some n
= infH(q) : q Q, q > x = F
(x).
Suppose that x is a continuity point of the non-decreasing function F
. Then, for
any > 0 there exists y < x such that F
(x) < F
(y) and rational numbers

y < r
1
< x < r
2
such that H(r
2
) < F
(x) +. It follows that

(3.2.7) F
(x) < F
(y) H(r
1
) H(r
2
) < F
(x) + .
Recall that F
n
k
(x) [F
n
k
(r
1
), F
n
k
(r
2
)] and F
n
k
(r
i
) H(r
i
) as k , for i = 1, 2.
Thus, by (3.2.7) for all k large enough
F
(x) < F
n
k
(r
1
) F
n
k
(x) F
n
k
(r
2
) < F
(x) +,
which since > 0 is arbitrary implies F
n
k
(x) F
(x) as k .
Exercise 3.2.39. Suppose that the sequence of distribution functions F
X
k
is
uniformly tight and EX
2
k
< are such that EX
2
n
as n . Show that then
also Var(X
n
) as n .
Hint: If [EX
n
l
[
2
then sup
l
Var(X
n
l
) < yields X
n
l
/EX
n
l
L
2
1, whereas the
uniform tightness of F
Xn
l
implies that X
n
l
/EX
n
l
p
0.
Using Lemma 3.2.38 and Hellys theorem, you next explore the possibility of estab-
lishing weak convergence for non-negative random variables out of the convergence
of the corresponding Laplace transforms.
Exercise 3.2.40.
(a) Based on Exercise 3.2.29 show that if Z 0 and W 0 are such that
E(e
sZ
) = E(e
sW
) for each s > 0, then Z
D
= W.
(b) Further, show that for any Z 0, the function L
Z
(s) = E(e
sZ
) is
innitely dierentiable at all s > 0 and for any positive integer k,
E[Z
k
] = (1)
k
lim
s0
d
k
ds
k
L
Z
(s) ,
even when (both sides are) innite.
(c) Suppose that X
n
0 are such that L(s) = lim
n
E(e
sXn
) exists for all
s > 0 and L(s) 1 for s 0. Show that then the sequence of distribution
functions F
Xn
is uniformly tight and that there exists a random variable
X
0 such that X
n
D
X
and L(s) = E(e

sX
) for all s > 0.
3.3. CHARACTERISTIC FUNCTIONS 117
Hint: To show that X
n
D
X
try reading and adapting the proof of

Theorem 3.3.17.
(d) Let X
n
= n
1
n
k=1
kI
k
for I
k
0, 1 independent random variables,
with P(I
k
= 1) = k
1
. Show that there exists X
0 such that X
n
D
and E(e
sX
) = exp(
_
1
0
t
1
(e
st
1)dt) for all s > 0.
Remark. The idea of using transforms to establish weak convergence shall be
further developed in Section 3.3, with the Fourier transform instead of the Laplace
transform.
3.3. Characteristic functions
This section is about the fundamental concept of characteristic function, its rele-
vance for the theory of weak convergence, and in particular for the clt.
In Subsection 3.3.1 we dene the characteristic function, providing illustrating ex-
amples and certain general properties such as the relation between nite moments
of a random variable and the degree of smoothness of its characteristic function. In
Subsection 3.3.2 we recover the distribution of a random variable from its charac-
teristic function, and building upon it, relate tightness and weak convergence with
the point-wise convergence of the associated characteristic functions. We conclude
with Subsection 3.3.3 in which we re-prove the clt of Section 3.1 as an applica-
tion of the theory of characteristic functions we have thus developed. The same
approach will serve us well in other settings which we consider in the sequel (c.f.
Sections 3.4 and 3.5).
3.3.1. Denition, examples, moments and derivatives. We start o
with the denition of the characteristic function of a random variable. To this
end, recall that a C-valued random variable is a function Z : C such that the
real and imaginary parts of Z are measurable, and for Z = X +iY with X, Y R
integrable random variables (and i =
1), let E(Z) = E(X) +iE(Y ) C.

Denition 3.3.1. The characteristic function
X
of a random variable X is the
map R C given by
X
() = E[e
iX
] = E[cos(X)] +iE[sin(X)]
where R and obviously both cos(X) and sin(X) are integrable R.V.-s.
We also denote by
() the characteristic function associated with a probability

measure on (R, B). That is,
() = (e
ix
) is the characteristic function of a
R.V. X whose law T
X
is .
Here are some of the properties of characteristic functions, where the complex
conjugate x iy of z = x +iy C is denoted throughout by z and the modulus of
z = x +iy is [z[ =
_
x
2
+y
2
.
Proposition 3.3.2. Let X be a R.V. and
X
its characteristic function, then
(a)
X
(0) = 1
(b)
X
() =
X
()
(c) [
X
()[ 1
(d)
X
() is a uniformly continuous function on R
(e)
aX+b
() = e
ib
X
(a)
Proof. For (a),
X
(0) = E[e
i0X
] = E[1] = 1. For (b), note that
X
() = Ecos(X) +iEsin(X)
= Ecos(X) iEsin(X) =
X
() .
For (c), note that the function [z[ =
_
x
2
+y
2
: R
2
R is convex, hence by
Jensens inequality (c.f. Exercise 1.3.20),
[
X
()[ = [Ee
iX
[ E[e
iX
[ = 1
(since the modulus [e
ix
[ = 1 for any real x and ).
For (d), since
X
(+h)
X
() = Ee
iX
(e
ihX
1), it follows by Jensens inequality
for the modulus function that
[
X
( +h)
X
()[ E[[e
iX
[[e
ihX
1[] = E[e
ihX
1[ = (h)
(using the fact that [zv[ = [z[[v[). Since 2 [e
ihX
1[ 0 as h 0, by bounded
convergence (h) 0. As the bound (h) on the modulus of continuity of
X
()
is independent of , we have uniform continuity of
X
() on R.
For (e) simply note that
aX+b
() = Ee
i(aX+b)
= e
ib
Ee
i(a)X
= e
ib
X
(a).
We also have the following relation between nite moments of the random variable
and the derivatives of its characteristic function.
Lemma 3.3.3. If E[X[
n
< , then the characteristic function
X
() of X has
continuous derivatives up to the n-th order, given by
(3.3.1)
d
k
d
k
X
() = E[(iX)
k
e
iX
] , for k = 1, . . . , n
Proof. Note that for any x, h R
e
ihx
1 = ix
_
h
0
e
iux
du .
Consequently, for any h ,= 0, R and positive integer k we have the identity
k,h
(x) = h
1
_
(ix)
k1
e
i(+h)x
(ix)
k1
e
ix
_
(ix)
k
e
ix
(3.3.2)
= (ix)
k
e
ix
h
1
_
h
0
(e
iux
1)du ,
from which we deduce that [
k,h
(x)[ 2[x[
k
for all and h ,= 0, and further that
[
k,h
(x)[ 0 as h 0. Thus, for k = 1, . . . , n we have by dominated convergence
(and Jensens inequality for the modulus function) that
[E
k,h
(X)[ E[
k,h
(X)[ 0 for h 0.
Taking k = 1, we have from (3.3.2) that
E
1,h
(X) = h
1
(
X
( +h)
X
()) E[iXe
iX
] ,
so its convergence to zero as h 0 amounts to the identity (3.3.1) holding for
k = 1. In view of this, considering now (3.3.2) for k = 2, we have that
E
2,h
(X) = h
1
(
X
( +h)
X
()) E[(iX)
2
e
iX
] ,
and its convergence to zero as h 0 amounts to (3.3.1) holding for k = 2. We
continue in this manner for k = 3, . . . , n to complete the proof of (3.3.1). The
continuity of the derivatives follows by dominated convergence from the convergence
to zero of [(ix)
k
e
i(+h)x
(ix)
k
e
ix
[ 2[x[
k
as h 0 (with k = 1, . . . , n).
The converse of Lemma 3.3.3 does not hold. That is, there exist random variables
with E[X[ = for which
X
() is dierentiable at = 0 (c.f. Exercise 3.3.23).
However, as we see next, the existence of a nite second derivative of
X
() at
= 0 implies that EX
2
< .
Lemma 3.3.4. If liminf
0
2
(2
X
(0)
X
()
X
()) < , then EX
2
< .
Proof. Note that
2
(2
X
(0)
X
()
X
()) = Eg
(X), where
g
(x) =
2
(2 e
ix
e
ix
) = 2
2
[1 cos(x)] x
2
for 0 .
Since g
(x) 0 for all and x, it follows by Fatous lemma that

liminf
0
Eg
(X) E[liminf
0
g
(X)] = EX
2
,
thus completing the proof of the lemma.
We continue with a few explicit computations of the characteristic function.
Example 3.3.5. Consider a Bernoulli random variable B of parameter p, that is,
P(B = 1) = p and P(B = 0) = 1 p. Its characteristic function is by denition
B
() = E[e
iB
] = pe
i
+ (1 p)e
i0
= pe
i
+ 1 p .
The same type of explicit formula applies to any discrete valued R.V. For example,
if N has the Poisson distribution of parameter then
(3.3.3)
N
() = E[e
iN
] =
k=0
(e
i
)
k
k!
e
= exp((e
i
1)) .
The characteristic function has an explicit form also when the R.V. X has a
probability density function f
X
as in Denition 1.2.39. Indeed, then by Corollary
1.3.62 we have that
(3.3.4)
X
() =
_
R
e
ix
f
X
(x)dx,
which is merely the Fourier transform of the density f
X
(and is well dened since
cos(x)f
X
(x) and sin(x)f
X
(x) are both integrable with respect to Lebesgues mea-
sure).
Example 3.3.6. If G has the A(, v) distribution, namely, the probability density
function f
G
(y) is given by (3.1.1), then its characteristic function is
G
() = e
iv
2
/2
.
Indeed, recall Example 1.3.68 that G = X + for =

v and X of a standard
normal distribution A(0, 1). Hence, considering part (e) of Proposition 3.3.2 for
a =

v and b = , it suces to show that
X
() = e
2
/2
. To this end, as X is
integrable, we have from Lemma 3.3.3 that
X
() = E(iXe
iX
) =
_
R
xsin(x)f
X
(x)dx
(since xcos(x)f
X
(x) is an integrable odd function, whose integral is thus zero).
The standard normal density is such that f
X
(x) = xf
X
(x), hence integrating by
parts we nd that
X
() =
_
R
sin(x)f
X
(x)dx =
_
R
cos(x)f
X
(x)dx =
X
()
(since sin(x)f
X
(x) is an integrable odd function). We know that
X
(0) = 1 and
since () = e
2
/2
is the unique solution of the ordinary dierential equation
() = () with (0) = 1, it follows that

X
() = ().
Example 3.3.7. In another example, applying the formula (3.3.4) we see that
the random variable U = U(a, b) whose probability density function is f
U
(x) =
(b a)
1
1
a<x<b
, has the characteristic function
U
() =
e
ib
e
ia
i(b a)
(recall that
_
b
a
e
zx
dx = (e
zb
e
za
)/z for any z C). For a = b the characteristic
function simplies to sin(b)/(b). Or, in case b = 1 and a = 0 we have
U
() =
(e
i
1)/(i) for the random variable U of Example 1.1.26.
For a = 0 and z = + i, > 0, the same integration identity applies also
when b (since the real part of z is negative). Consequently, by (3.3.4), the
exponential distribution of parameter > 0 whose density is f
T
(t) = e
t
1
t>0
(see Example 1.3.68), has the characteristic function
T
() = /( i).
Finally, for the density f
S
(s) = 0.5e
|s|
it is not hard to check that
S
() =
0.5/(1 i) + 0.5/(1 + i) = 1/(1 +
2
) (just break the integration over s R in
(3.3.4) according to the sign of s).
We next express the characteristic function of the sum of independent random
variables in terms of the characteristic functions of the summands. This relation
makes the characteristic function a useful tool for proving weak convergence state-
ments involving sums of independent variables.
Lemma 3.3.8. If X and Y are two independent random variables, then
X+Y
() =
X
()
Y
()
Proof. By the denition of the characteristic function
X+Y
() = Ee
i(X+Y )
= E[e
iX
e
iY
] = E[e
iX
]E[e
iY
] ,
where the right-most equality is obtained by the independence of X and Y (i.e.
applying (1.4.12) for the integrable f(x) = g(x) = e
ix
). Observing that the right-
most expression is
X
()
Y
() completes the proof.
Here are three simple applications of this lemma.
Example 3.3.9. If X and Y are independent and uniform on (1/2, 1/2) then
by Corollary 1.4.33 the random variable = X + Y has the triangular density,
f
(x) = (1 [x[)1
|x|1
. Thus, by Example 3.3.7, Lemma 3.3.8, and the trigono-
metric identity cos = 1 2 sin
2
(/2) we have that its characteristic function is
() = [
X
()]
2
=
_
2 sin(/2)
_
2
=
2(1 cos )
2
.
Exercise 3.3.10. Let X,

X be i.i.d. random variables.
(a) Show that the characteristic function of Z = X

X is a non-negative,
real-valued function.
(b) Prove that there do not exist a < b and i.i.d. random variables X,

X
such that X

X is the uniform random variable on (a, b).
In the next exercise you construct a random variable X whose law has no atoms
while its characteristic function does not converge to zero for .
Exercise 3.3.11. Let X = 2
k=1
3
k
B
k
for B
k
i.i.d. Bernoulli random vari-
ables such that P(B
k
= 1) = P(B
k
= 0) = 1/2.
(a) Show that
X
(3
k
) =
X
() ,= 0 for k = 1, 2, . . ..
(b) Recall that X has the uniform distribution on the Cantor set C, as speci-
ed in Example 1.2.42. Verify that x F
X
(x) is everywhere continuous,
hence the law T
X
has no atoms (i.e. points of positive probability).
3.3.2. Inversion, continuity and convergence. Is it possible to recover the
distribution function from the characteristic function? Then answer is essentially
yes.
Theorem 3.3.12 (Levys inversion theorem). Suppose
X
is the characteris-
tic function of random variable X whose distribution function is F
X
. For any real
numbers a < b and , let
(3.3.5)
a,b
() =
1
2
_
b
a
e
iu
du =
e
ia
e
ib
i2
.
Then,
(3.3.6) lim
T
_
T
T
a,b
()
X
()d =
1
2
[F
X
(b) +F
X
(b
)]
1
2
[F
X
(a) +F
X
(a
)] .
Furthermore, if
_
R
[
X
()[d < , then X has the bounded continuous probability
density function
(3.3.7) f
X
(x) =
1
2
_
R
e
ix
X
()d .
Remark. The identity (3.3.7) is a special case of the Fourier transform inversion
formula, and as such is in duality with
X
() =
_
R
e
ix
f
X
(x)dx of (3.3.4). The
formula (3.3.6) should be considered its integrated version, which thereby holds
even in the absence of a density for X.
Here is a simple application of the duality between (3.3.7) and (3.3.4).
Example 3.3.13. The Cauchy density is f
X
(x) = 1/[(1 + x
2
)]. Recall Example
3.3.7 that the density f
S
(s) = 0.5e
|s|
has the positive, integrable characteristic
function 1/(1 +
2
). Thus, by (3.3.7),
0.5e
|s|
=
1
2
_
R
1
1 +t
2
e
its
dt .
Multiplying both sides by two, then changing t to x and s to , we get (3.3.4) for
the Cauchy density, resulting with its characteristic function
X
() = e
||
.
When using characteristic functions for proving limit theorems we do not need
the explicit formulas of Levys inversion theorem, but rather only the fact that the
characteristic function determines the law, that is:
Corollary 3.3.14. If the characteristic functions of two random variables X and
Y are the same, that is
X
() =
Y
() for all , then X
D
= Y .
Remark. While the real-valued moment generating function M
X
(s) = E[e
sX
] is
perhaps a simpler object than the characteristic function, it has a somewhat limited
scope of applicability. For example, the law of a random variable X is uniquely
determined by M
X
() provided M
X
(s) is nite for all s [, ], some > 0 (c.f.
[Bil95, Theorem 30.1]). More generally, assuming all moments of X are nite, the
Hamburger moment problem is about uniquely determining the law of X from a
given sequence of moments EX
k
. You saw in Exercise 3.2.29 that this is always
possible when X has bounded support, but unfortunately, this is not always the case
when X has unbounded support. For more on this issue, see [Dur10, Subsection
3.3.5].
Proof of Corollary 3.3.14. Since
X
=
Y
, comparing the right side of
(3.3.6) for X and Y shows that
[F
X
(b) +F
X
(b
)] [F
X
(a) +F
X
(a
)] = [F
Y
(b) +F
Y
(b
)] [F
Y
(a) +F
Y
(a
)] .
As F
X
is a distribution function, both F
X
(a) 0 and F
X
(a
) 0 when a .
For this reason also F
Y
(a) 0 and F
Y
(a
) 0. Consequently,
F
X
(b) +F
X
(b
) = F
Y
(b) +F
Y
(b
) for all b R .
In particular, this implies that F
X
= F
Y
on the collection ( of continuity points
of both F
X
and F
Y
. Recall that F
X
and F
Y
have each at most a countable set of
points of discontinuity (see Exercise 1.2.38), so the complement of ( is countable,
and consequently ( is a dense subset of R. Thus, as distribution functions are non-
decreasing and right-continuous we know that F
X
(b) = infF
X
(x) : x > b, x (
and F
Y
(b) = infF
Y
(x) : x > b, x (. Since F
X
(x) = F
Y
(x) for all x (, this
identity extends to all b R, resulting with X
D
= Y .
Remark. In Lemma 3.1.1, it was shown directly that the sum of independent
random variables of normal distributions A(
k
, v
k
) has the normal distribution
A(, v) where =
k
and v =
k
v
k
. The proof easily reduces to dealing
with two independent random variables, X of distribution A(
1
, v
1
) and Y of
distribution A(
2
, v
2
) and showing that X+Y has the normal distribution A(
1
+
2
, v
1
+v
2
). Here is an easy proof of this result via characteristic functions. First by
the independence of X and Y (see Lemma 3.3.8), and their normality (see Example
3.3.6),
X+Y
() =
X
()
Y
() = exp(i
1
v
1
2
/2) exp(i
2
v
2
2
/2)
= exp(i(
1
+
2
)
1
2
(v
1
+v
2
)
2
)
We recognize this expression as the characteristic function corresponding to the
A(
1
+
2
, v
1
+ v
2
) distribution, which by Corollary 3.3.14 must indeed be the
distribution of X +Y .
Proof of Levys inversion theorem. Consider the product of the law
T
X
of X which is a probability measure on R and Lebesgues measure of [T, T],
noting that is a nite measure on R [T, T] of total mass 2T.
Fixing a < b R let h
a,b
(x, ) =
a,b
()e
ix
, where by (3.3.5) and Jensens
inequality for the modulus function (and the uniform measure on [a, b]),
[h
a,b
(x, )[ = [
a,b
()[
1
2
_
b
a
[e
iu
[du =
b a
2
.
Consequently,
_
[h
a,b
[d < , and applying Fubinis theorem, we conclude that
J
T
(a, b) :=
_
T
T
a,b
()
X
()d =
_
T
T
a,b
()
_
_
R
e
ix
dT
X
(x)
_
d
=
_
T
T
_
_
R
h
a,b
(x, )dT
X
(x)
_
d =
_
R
_
_
T
T
h
a,b
(x, )d
_
dT
X
(x) .
Since h
a,b
(x, ) is the dierence between the function e
iu
/(i2) at u = x a and
the same function at u = x b, it follows that
_
T
T
h
a,b
(x, )d = R(x a, T) R(x b, T) .
Further, as the cosine function is even and the sine function is odd,
R(u, T) =
_
T
T
e
iu
i2
d =
_
T
0
sin(u)
d =
sgn(u)
S([u[T) ,
with S(r) =
_
r
0
x
1
sin xdx for r > 0.
Even though the Lebesgue integral
_
0
x
1
sin xdx does not exist, because both
the integral of the positive part and the integral of the negative part are innite,
we still have that S(r) is uniformly bounded on (0, ) and
lim
r
S(r) =

2
(c.f. Exercise 3.3.15). Consequently,
lim
T
[R(x a, T) R(x b, T)] = g
a,b
(x) =
_
_
0 if x < a or x > b
1
2
if x = a or x = b
1 if a < x < b
.
Since S() is uniformly bounded, so is [R(x a, T) R(x b, T)[ and by bounded
convergence,
lim
T
J
T
(a, b) = lim
T
_
R
[R(x a, T) R(x b, T)]dT
X
(x) =
_
R
g
a,b
(x)dT
X
(x)
=
1
2
T
X
(a) +T
X
((a, b)) +
1
2
T
X
(b) .
With T
X
(a) = F
X
(a) F
X
(a
), T
X
((a, b)) = F
X
(b
) F
X
(a) and T
X
(b) =
F
X
(b) F
X
(b
), we arrive at the assertion (3.3.6).

Suppose now that
_
R
[
X
()[d = C < . This implies that both the real and the
imaginary parts of e
ix
X
() are integrable with respect to Lebesgues measure on
R, hence f
X
(x) of (3.3.7) is well dened. Further, [f
X
(x)[ C is uniformly bounded
and by dominated convergence with respect to Lebesgues measure on R,
lim
h0
[f
X
(x +h) f
X
(x)[ lim
h0
1
2
_
R
[e
ix
[[
X
()[[e
ih
1[d = 0,
implying that f
X
() is also continuous. Turning to prove that f
X
() is the density
of X, note that
[
a,b
()
X
()[
b a
2
[
X
()[ ,
so by dominated convergence we have that
(3.3.8) lim
T
J
T
(a, b) = J
(a, b) =
_
R
a,b
()
X
()d .
Further, in view of (3.3.5), upon applying Fubinis theorem for the integrable func-
tion e
iu
I
[a,b]
(u)
X
() with respect to Lebesgues measure on R
2
, we see that
J
(a, b) =
1
2
_
R
_
_
b
a
e
iu
du
_
X
()d =
_
b
a
f
X
(u)du ,
for the bounded continuous function f
X
() of (3.3.7). In particular, J
(a, b) must
be continuous in both a and b. Comparing (3.3.8) with (3.3.6) we see that
J
(a, b) =
1
2
[F
X
(b) +F
X
(b
)]
1
2
[F
X
(a) +F
X
(a
)] ,
so the continuity of J
(, ) implies that F
X
() must also be continuous everywhere,
with
F
X
(b) F
X
(a) = J
(a, b) =
_
b
a
f
X
(u)du ,
for all a < b. This shows that necessarily f
X
(x) is a non-negative real-valued
function, which is the density of X.
Exercise 3.3.15. Integrating
_
z
1
e
iz
dz around the contour formed by the upper
semi-circles of radii and r and the intervals [r, ] and [r, ], deduce that S(r) =
_
r
0
x
1
sin xdx is uniformly bounded on (0, ) with S(r) /2 as r .
Our strategy for handling the clt and similar limit results is to establish the
convergence of characteristic functions and deduce from it the corresponding con-
vergence in distribution. One ingredient for this is of course the fact that the
characteristic function uniquely determines the corresponding law. Our next result
provides an important second ingredient, that is, an explicit sucient condition for
uniform tightness in terms of the limit of the characteristic functions.
Lemma 3.3.16. Suppose
n
are probability measures on (R, B) and
n
() =
n
(e
ix
) the corresponding characteristic functions. If
n
() () as n ,
for each R and further () is continuous at = 0, then the sequence
n
is
uniformly tight.
Remark. To see why continuity of the limit () at 0 is required, consider the
sequence
n
of normal distributions A(0, n
2
). From Example 3.3.6 we see that
the point-wise limit () = I
=0
of
n
() = exp(n
2
2
/2) exists but is dis-
continuous at = 0. However, for any M < we know that
n
([M, M]) =
1
([M/n, M/n]) 0 as n , so clearly the sequence
n
is not uniformly
tight. Indeed, the corresponding distribution functions F
n
(x) = F
1
(x/n) converge
vaguely to F
(x) = F
1
(0) = 1/2 which is not a distribution function (reecting
escape of all the probability mass to ).
Proof. We start the proof by deriving the key inequality
(3.3.9)
1
r
_
r
r
(1
())d ([2/r, 2/r]

c
) ,
which holds for every probability measure on (R, B) and any r > 0, relating the
smoothness of the characteristic function at 0 with the tail decay of the correspond-
ing probability measure at . To this end, xing r > 0, note that
J(x) :=
_
r
r
(1 e
ix
)d = 2r
_
r
r
(cos x +i sinx)d = 2r
2 sinrx
x
.
So J(x) is non-negative (since [ sin u[ [u[ for all u), and bounded below by 2r
2/[x[ (since [ sinu[ 1). Consequently,
(3.3.10) J(x) max(2r
2
[x[
, 0) rI
{|x|>2/r}
.
Now, applying Fubinis theorem for the function 1e
ix
whose modulus is bounded
by 2 and the product of the probability measure and Lebesgues measure on
[r, r], which is a nite measure of total mass 2r, we get the identity
_
r
r
(1
())d =
_
r
r
_
_
R
(1 e
ix
)d(x)
_
d =
_
R
J(x)d(x) .
Thus, the lower bound (3.3.10) and monotonicity of the integral imply that
1
r
_
r
r
(1
())d =
1
r
_
R
J(x)d(x)
_
R
I
{|x|>2/r}
d(x) = ([2/r, 2/r]
c
) ,
hence establishing (3.3.9).
We turn to the application of this inequality for proving the uniform tightness.
Since
n
(0) = 1 for all n and
n
(0) (0), it follows that (0) = 1. Further,
() is continuous at = 0, so for any > 0, there exists r = r() > 0 such that
4
[1 ()[ for all [r, r],
and hence also
2

1
r
_
r
r
[1 ()[d .
The point-wise convergence of
n
to implies that [1
n
()[ [1 ()[. By
bounded convergence with respect to Uniform measure of on [r, r], it follows
that for some nite n
0
= n
0
() and all n n
0
,

1
r
_
r
r
[1
n
()[d ,
which in view of (3.3.9) results with

1
r
_
r
r
[1
n
()]d
n
([2/r, 2/r]
c
) .
Since > 0 is arbitrary and M = 2/r is independent of n, by Denition 3.2.32 this
amounts to the uniform tightness of the sequence
n
.
Building upon Corollary 3.3.14 and Lemma 3.3.16 we can nally relate the point-
wise convergence of characteristic functions to the weak convergence of the corre-
sponding measures.
Theorem 3.3.17 (Levys continuity theorem). Let
n
, 1 n be proba-
bility measures on (R, B).
(a) If
n
w

, then
n
()
() for each R.
(b) Conversely, if
n
() converges point-wise to a limit () that is contin-
uous at = 0, then
n
is a uniformly tight sequence and
n
w
such
that
= .
Proof. For part (a), since both x cos(x) and x sin(x) are bounded
continuous functions, the assumed weak convergence of
n
to
implies that
n
() =
n
(e
ix
)
(e
ix
) =
() (c.f. Denition 3.2.17).

Turning to deal with part (b), recall that by Lemma 3.3.16 we know that the
collection =
n
is uniformly tight. Hence, by Prohorovs theorem (see the
remark preceding the proof of Lemma 3.2.38), for every subsequence
n(m)
there is a
further sub-subsequence
n(m
k
)
that converges weakly to some probability measure
. Though in general
might depend on the specic choice of n(m), we deduce

from part (a) of the theorem that necessarily
= . Since the characteristic

function uniquely determines the law (see Corollary 3.3.14), here the same limit
=
applies for all choices of n(m). In particular, xing h C

b
(R), the sequence
y
n
=
n
(h) is such that every subsequence y
n(m)
has a further sub-subsequence
y
n(m
k
)
that converges to y = (h). Consequently, y
n
=
n
(h) y = (h) (see
Lemma 2.2.11), and since this applies for all h C
b
(R), we conclude that
n
w

such that
= .
Here is a direct consequence of Levys continuity theorem.
n
D
X
, Y
n
D
Y
and Y
n
is independent of
X
n
for 1 n , then X
n
+Y
n
D
X
+Y
.
Combining Exercise 3.3.18 with the Portmanteau theorem and the clt, you can
now show that a nite second moment is necessary for the convergence in distribu-
tion of n
1/2
n
k=1
X
k
for i.i.d. X
k
.
k
,

X
k
are i.i.d. and n
1/2
n
k=1
X
k
D
Z (with
the limit Z R).
(a) Set Y
k
= X
k

X
k
and show that n
1/2
n
k=1
Y
k
D
Z

Z, with Z and
Z i.i.d.
(b) Let U
k
= Y
k
I
|Y
k
|b
and V
k
= Y
k
I
|Y
k
|>b
. Show that for any u < and
all n,
P(
n
k=1
Y
k
u
n) P(
n
k=1
U
k
u
n,
n
k=1
V
k
0)
1
2
P(
n
k=1
U
k
u
n) .
(c) Apply the Portmanteau theorem and the clt for the bounded i.i.d. U
k
to get that for any u, b < ,

P(Z

Z u)
1
2
P(G u/
_
EU
2
1
) .
Considering the limit b followed by u deduce that EY
2
1
< .
(d) Conclude that if n
1/2
n
k=1
X
k
D
Z, then necessarily EX
2
1
< .
Remark. The trick of replacing X
k
by the variables Y
k
= X
k

X
k
whose law is
symmetric (i.e. Y
k
D
= Y
k
), is very useful in many problems. It is often called the
symmetrization trick.
Exercise 3.3.20. Provide an example of a random variable X with a bounded
probability density function but for which
_
R
[
X
()[d = , and another example
of a random variable X whose characteristic function
X
() is not dierentiable at
= 0.
As you nd out next, Levys inversion theorem can help when computing densities.
Exercise 3.3.21. Suppose the random variables U
k
are i.i.d. where the law of each
U
k
is the uniform probability measure on (1, 1). Considering Example 3.3.7, show
that for each n 2, the probability density function of S
n
=
n
k=1
U
k
is
f
Sn
(s) =
1
_

0
cos(s)(sin /)
n
d ,
and deduce that
_
0
cos(s)(sin /)
n
d = 0 for all s > n 2.
Exercise 3.3.22. Deduce from Example 3.3.13 that if X
k
are i.i.d. each having
the Cauchy density, then n
1
n
k=1
X
k
has the same distribution as X
1
, for any
value of n.
We next relate dierentiability of
X
() with the weak law of large numbers and
show that it does not imply that E[X[ is nite.
n
=
n
k=1
X
k
where the i.i.d. random variables X
k
have
each the characteristic function
X
().
(a) Show that if
dX
d
(0) = z C, then z = ia for some a R and n
1
S
n
p
a
as n .
(b) Show that if n
1
S
n
p
a, then
X
(h
k
)
n
k
e
ia
for any h
k
0 and
n
k
= [1/h
k
], and deduce that
dX
d
(0) = ia.
(c) Conclude that the weak law of large numbers holds (i.e. n
1
S
n
p
a for
some non-random a), if and only if
X
() is dierentiable at = 0 (this
result is due to E.J.G. Pitman, see [Pit56]).
(d) Use Exercise 2.1.13 to provide a random variable X for which
X
() is
dierentiable at = 0 but E[X[ = .
As you show next, X
n
D
X
yields convergence of
Xn
() to
X
(), uniformly
over compact subsets of R.
n
D
X
then for any r nite,

lim
n
sup
||r
[
Xn
()
X
()[ = 0 .
Hint: By Theorem 3.2.7 you may further assume that X
n
a.s.
X
.
Characteristic functions of modulus one correspond to lattice or degenerate laws,
as you show in the following renement of part (c) of Proposition 3.3.2.
Exercise 3.3.25. Suppose [
Y
()[ = 1 for some ,= 0.
(a) Show that Y is a (2/)-lattice randomvariable, namely, that Y mod (2/)
is P-degenerate.
Hint: Check conditions for equality when applying Jensens inequality for
(cos Y, sin Y ) and the convex function g(x, y) =
_
x
2
+y
2
.
(b) Deduce that if in addition [
Y
()[ = 1 for some / Q then Y must be
P-degenerate, in which case
Y
() = exp(ic) for some c R.
Building on the preceding two exercises, you are to prove next the following con-
vergence of types result.
Exercise 3.3.26. Suppose Z
n
D
Y and
n
Z
n
+
n
D

Y for some

Y , non-P-
degenerate Y , and non-random
n
0,
n
.
(a) Show that
n
0 nite.
Hint: Start with the niteness of limit points of
n
.
(b) Deduce that
n
nite.
(c) Conclude that

Y
D
= Y +.
Hint: Recall Slutskys lemma.
Remark. This convergence of types fails for P-degenerate Y . For example, if
Z
n
D
= A(0, n
3
), then both Z
n
D
0 and nZ
n
D
0. Similarly, if Z
n
D
= A(0, 1)
then
n
Z
n
D
= A(0, 1) for the non-converging sequence
n
= (1)
n
(of alternating
signs).
Mimicking the proof of Levys inversion theorem, for random variables of bounded
support you get the following alternative inversion formula, based on the theory of
Fourier series.
Exercise 3.3.27. Suppose R.V. X supported on (0, t) has the characteristic func-
tion
X
and the distribution function F
X
. Let
0
= 2/t and
a,b
() be as in
(3.3.5), with
a,b
(0) =
ba
2
.
(a) Show that for any 0 < a < b < t
lim
T
T
k=T
0
(1
[k[
T
)
a,b
(k
0
)
X
(k
0
) =
1
2
[F
X
(b)+F
X
(b
)]
1
2
[F
X
(a)+F
X
(a
)] .
Hint: Recall that S
T
(r) =
T
k=1
(1 k/T)
sinkr
k
is uniformly bounded for
r (0, 2) and integer T 1, and S
T
(r)
r
2
as T .
(b) Show that if
k
[
X
(k
0
)[ < then X has the bounded continuous
probability density function, given for x (0, t) by
f
X
(x) =

0
2
kZ
e
ik0x
X
(k
0
) .
(c) Deduce that if R.V.s X and Y supported on (0, t) are such that
X
(k
0
) =
Y
(k
0
) for all k Z, then X
D
= Y .
Here is an application of the preceding exercise for the random walk on the circle
S
1
of radius one (c.f. Denition 5.1.6 for the random walk on R).
Exercise 3.3.28. Let t = 2 and denote the unit circle S
1
parametrized by
the angular coordinate to yield the identication = [0, t] where both end-points
are considered the same point. We equip with the topology induced by [0, t] and
the surface measure
similarly induced by Lebesgues measure (as in Exercise

1.4.37). In particular, R.V.-s on (, B
) correspond to Borel periodic functions on

R, of period t. In this context we call U of law t
1
a uniform R.V. and call

S
n
= (
n
k=1
k
)mod t, with i.i.d ,
k
, a random walk.
(a) Verify that Exercise 3.3.27 applies for
0
= 1 and R.V.-s on .
(b) Show that if probability measures
n
on (, B
) are such that

n
(k)
(k) for n and xed k Z, then
n
w

and (k) =
(k) for
all k Z.
Hint: Since is compact the sequence
n
is uniformly tight.
(c) Show that
U
(k) = 1
k=0
and
Sn
(k) =
(k)
n
. Deduce from these facts
that if has a density with respect to
then S
n
D
U as n .
Hint: Recall part (a) of Exercise 3.3.25.
(d) Check that if = is non-random for some /t / Q, then S
n
does
not converge in distribution, but S
Kn
D
U for K
n
which are uniformly
chosen in 1, 2, . . . , n, independently of the sequence
k
.
3.3.3. Revisiting the clt. Applying the theory of Subsection 3.3.2 we pro-
vide an alternative proof of the clt, based on characteristic functions. One can
prove many other weak convergence results for sums of random variables by prop-
erly adapting this approach, which is exactly what we will do when demonstrating
the convergence to stable laws (see Exercise 3.3.33), and in proving the Poisson
approximation theorem (in Subsection 3.4.1), and the multivariate clt (in Section
3.5).
To this end, we start by deriving the analog of the bound (3.1.7) for the charac-
teristic function.
Lemma 3.3.29. If a random variable X has E(X) = 0 and E(X
2
) = v < , then
for all R,
X
()
_
1
1
2
v
2
_

2
Emin([X[
2
, [[[X[
3
/6).
Proof. Let R
2
(x) = e
ix
1 ix(ix)
2
/2. Then, rearranging terms, recalling
E(X) = 0 and using Jensens inequality for the modulus function, we see that
X
()
_
1
1
2
v
2
_
E
_
e
iX
1iX
i
2
2

2
X
2
ER
2
(X)
E[R
2
(X)[.
Since [R
2
(x)[ min([x[
2
, [x[
3
/6) for any x R (see also Exercise 3.3.34), by mono-
tonicity of the expectation we get that E[R
2
(X)[ Emin([X[
2
, [X[
3
/6), com-
pleting the proof of the lemma.
The following simple complex analysis estimate is needed for relating the approx-
imation of the characteristic function of summands to that of their sum.
Lemma 3.3.30. Suppose z
n,k
C are such that z
n
=
n
k=1
z
n,k
z
and
n
=
n
k=1
[z
n,k
[
2
0 when n . Then,
n
:=
n
k=1
(1 +z
n,k
) exp(z
) for n .
Proof. Recall that the power series expansion
log(1 +z) =
k=1
(1)
k1
z
k
k
converges for [z[ < 1. In particular, for [z[ 1/2 it follows that
[ log(1 +z) z[
k=2
[z[
k
k
[z[
2
k=2
2
(k2)
k
[z[
2
k=2
2
(k1)
= [z[
2
.
Let
n
= max[z
n,k
[ : k = 1, . . . , n. Note that
2
n

n
, so our assumption that
n
0 implies that
n
1/2 for all n suciently large, in which case
[ log
n
z
n
[ = [ log
n
k=1
(1 +z
n,k
)
n
k=1
z
n,k
[
n
k=1
[ log(1 +z
n,k
) z
n,k
[
n
.
With z
n
z
and
n
0, it follows that log
n
z
. Consequently,
n

exp(z
) as claimed.
We will give now an alternative proof of the clt of Theorem 3.1.2.
Proof of Theorem 3.1.2. From Example 3.3.6 we know that
G
() = e
2
2
is the characteristic function of the standard normal distribution. So, by Levys
continuity theorem it suces to show that
Sn
() exp(
2
/2) as n , for
each R. Recall that

S
n
=
n
k=1
X
n,k
, with X
n,k
= (X
k
)/
vn i.i.d.
random variables, so by independence (see Lemma 3.3.8) and scaling (see part (e)
of Proposition 3.3.2), we have that
n
:=
Sn
() =
n
k=1
X
n,k
() =
Y
(n
1/2
)
n
= (1 +z
n
/n)
n
,
where Y = (X
1
)/
v and z
n
= z
n
() := n[
Y
(n
1/2
) 1]. Applying Lemma
3.3.30 for z
n,k
= z
n
/n it remains only to show that z
n

2
/2 (for then
n
=
[z
n
[
2
/n 0). Indeed, since E(Y ) = 0 and E(Y
2
) = 1, we have from Lemma 3.3.29
that
[z
n
+
2
/2[ =
n[
Y
(n
1/2
) 1] +
2
/2
EV
n
,
for V
n
= min([Y [
2
, n
1/2
[Y [
3
/6). With V
n
a.s.
0 as n and V
n
[[
2
[Y [
2
which is integrable, it follows by dominated convergence that EV
n
0 as n ,
hence z
n

2
/2 completing the proof of Theorem 3.1.2.
We proceed with a brief introduction of stable laws, their domain of attraction
and the corresponding limit theorems (which are a natural generalization of the
clt).
Denition 3.3.31. Random variable Y has a stable law if it is non-degenerate
and for any m 1 there exist constants d
m
> 0 and c
m
, such that Y
1
+. . . +Y
m
D
=
d
m
Y +c
m
, where Y
i
are i.i.d. copies of Y . Such variable has a symmetric stable
law if in addition Y
D
= Y . We further say that random variable X is in the
domain of attraction of non-degenerate Y if there exist constants b
n
> 0 and a
n
such that Z
n
(X) = (S
n
a
n
)/b
n
D
Y for S
n
=
n
k=1
X
k
and i.i.d. copies X
k
of
X.
By denition, the collection of stable laws is closed under the ane map Y
vY + for R and v > 0 (which correspond to the centering and scale of the
law, but not necessarily its mean and variance). Clearly, each stable law is in its
own domain of attraction and as we see next, only stable laws have a non-empty
domain of attraction.
Proposition 3.3.32. If X is in the domain of attraction of some non-degenerate
variable Y , then Y must have a stable law.
Proof. Fix m 1, and setting n = km let
n
= b
n
/b
k
> 0 and
n
=
(a
n
ma
k
)/b
k
. We then have the representation
n
Z
n
(X) +
n
=
m
i=1
Z
(i)
k
,
where Z
(i)
k
= (X
(i1)k+1
+. . . +X
ik
a
k
)/b
k
are i.i.d. copies of Z
k
(X). From our
assumption that Z
k
(X)
D
Y we thus deduce (by at most m 1 applications of
Exercise 3.3.18), that
n
Z
n
(X)+
n
D

Y , where

Y = Y
1
+. . .+Y
m
for i.i.d. copies
Y
i
of Y . Moreover, by assumption Z
n
(X)
D
Y , hence by the convergence of
types

Y
D
= d
m
Y + c
m
for some nite non-random d
m
0 and c
m
(c.f. Exercise
3.3.26). Recall Lemma 3.3.8 that
Y
() = [
Y
()]
m
. So, with Y assumed non-
degenerate the same applies to

Y (see Exercise 3.3.25), and in particular d
m
> 0.
Since this holds for any m 1, by denition Y has a stable law.
We have already seen two examples of symmetric stable laws, namely those asso-
ciated with the zero-mean normal density and with the Cauchy density of Example
3.3.13. Indeed, as you show next, for each (0, 2) there corresponds the sym-
metric -stable variable Y
whose characteristic function is

Y
() = exp([[
)
(so the Cauchy distribution corresponds to the symmetric stable of index = 1
and the normal distribution corresponds to index = 2).
Exercise 3.3.33. Fixing (0, 2), suppose X
D
= X and P([X[ > x) = x
for
all x 1.
(a) Check that
X
() = 1([[)[[
where (r) =
_
r
(1cos u)u
(+1)
du
converges as r 0 to (0) nite and positive.
(b) Setting
,0
() = exp([[
), b
n
= ((0)n)
1/
and

S
n
= b
1
n
n
k=1
X
k
for i.i.d. copies X
k
of X, deduce that
Sn
()
,0
() as n , for
any xed R.
(c) Conclude that X is in the domain of attraction of a symmetric stable
variable Y
, whose characteristic function is

,0
().
(d) Fix = 1 and show that with probability one limsup
n

S
n
= and
liminf
n

S
n
= .
Hint: Recall Kolmogorovs 0-1 law. The same proof applies for any > 0
once we verify that Y
has unbounded support.

(e) Show that if = 1 then
1
nlog n
n
k=1
[X
k
[ 1 in probability but not
almost surely (in contrast, X is integrable when > 1, in which case the
strong law of large numbers applies).
Remark. While outside the scope of these notes, one can show that (up to scaling)
any symmetric stable variable must be of the form Y
for some (0, 2]. Further,

for any (0, 2) the necessary and sucient condition for X
D
= X to be in the
domain of attraction of Y
is that the function L(x) = x
P([X[ > x) is slowly

varying at (that is, L(ux)/L(x) 1 for x and xed u > 0). Indeed, as
shown for example in [Bre92, Theorem 9.32], up to the mapping Y

vY + ,
the collection of all stable laws forms a two parameter family Y
,
, parametrized
by the index (0, 2] and skewness [1, 1]. The corresponding characteristic
functions are
(3.3.11)
,
() = exp([[
(1 +isgn()g
())) ,
where g
1
(r) = (2/) log [r[ and g
= tan(/2) is constant for all ,= 1 (in

particular, g
2
= 0 so the parameter is irrelevant when = 2). Further, in case
< 2 the domain of attraction of Y
,
consists precisely of the random variables
X for which L(x) = x
P([X[ > x) is slowly varying at and (P(X > x) P(X <

x))/P([X[ > x) as x (for example, see [Bre92, Theorem 9.34]). To
complete this picture, we recall [Fel71, Theorem XVII.5.1], that X is in the domain
of attraction of the normal variable Y
2
if and only if L(x) = E[X
2
I
|X|x
] is slowly
varying (as is of course the case whenever EX
2
is nite).
As shown in the following exercise, controlling the modulus of the remainder term
for the n-th order Taylor approximation of e
ix
one can generalize the bound on
X
() beyond the case n = 2 of Lemma 3.3.29.
Exercise 3.3.34. For any x R and non-negative integer n, let
R
n
(x) = e
ix
k=0
(ix)
k
k!
.
(a) Show that R
n
(x) =
_
x
0
iR
n1
(y)dy for all n 1 and deduce by induction
on n that
[R
n
(x)[ min
_
2[x[
n
n!
,
[x[
n+1
(n + 1)!
_
for all x R, n = 0, 1, 2, . . . .
(b) Conclude that if E[X[
n
< then
X
()
n
k=0
(i)
k
EX
k
k!
[[
n
E
_
min
_
2[X[
n
n!
,
[[[X[
n+1
(n + 1)!
_
_
.
By solving the next exercise you generalize the proof of Theorem 3.1.2 via char-
acteristic functions to the setting of Lindebergs clt.
Exercise 3.3.35. Consider

S
n
=
n
k=1
X
n,k
for mutually independent random
variables X
n,k
, k = 1, . . . , n, of zero mean and variance v
n,k
, such that v
n
=
n
k=1
v
n,k
1 as n .
(a) Fixing R show that
n
=
Sn
() =
n
k=1
(1 +z
n,k
) ,
where z
n,k
=
X
n,k
() 1.
(b) With z
=
2
/2, use Lemma 3.3.29 to verify that [z
n,k
[ 2
2
v
n,k
and
further, for any > 0,
[z
n
v
n
z
[
n
k=1
[z
n,k
v
n,k
z
[
2
g
n
() +
[[
3
6
v
n
,
where z
n
=
n
k=1
z
n,k
and g
n
() is given by (3.1.4).
(c) Recall that Lindebergs condition g
n
() 0 implies that r
2
n
= max
k
v
n,k

0 as n . Deduce that then z
n
z
and
n
=
n
k=1
[z
n,k
[
2
0
when n .
3.4. POISSON APPROXIMATION AND THE POISSON PROCESS 133
(d) Applying Lemma 3.3.30, conclude that

S
n
D
G.
We conclude this section with an exercise that reviews various techniques one may
use for establishing convergence in distribution for sums of independent random
variables.
Exercise 3.3.36. Throughout this problem S
n
=
n
k=1
X
k
for mutually indepen-
dent random variables X
k
.
(a) Suppose that P(X
k
= k
) = P(X
k
= k
) = 1/(2k
) and P(X
k
= 0) =
1 k
. Show that for any xed R and > 1, the series S

n
()
converges almost surely as n .
(b) Consider the setting of part (a) when 0 < 1 and = 2 + 1 is
positive. Find non-random b
n
such that b
1
n
S
n
D
Z and 0 < F
Z
(z) < 1
for some z R. Provide also the characteristic function
Z
() of Z.
(c) Repeat part (b) in case = 1 and > 0 (see Exercise 3.1.11 for = 0).
(d) Suppose now that P(X
k
= 2k) = P(X
k
= 2k) = 1/(2k
2
) and P(X
k
=
1) = P(X
k
= 1) = 0.5(1 k
2
). Show that S
n
/
n
D
G.
3.4. Poisson approximation and the Poisson process
Subsection 3.4.1 deals with the Poisson approximation theorem and few of its ap-
plications. It leads naturally to the introduction of the Poisson process in Subsection
3.4.2, where we also explore its relation to sums of i.i.d. Exponential variables and
to order statistics of i.i.d. uniform random variables.
3.4.1. Poisson approximation. The Poisson approximation theorem is about
the law of the sum S
n
of a large number (= n) of independent random variables.
In contrast to the clt that also deals with such objects, here all variables are non-
negative integer valued and the variance of S
n
remains bounded, allowing for the
approximation in law of S
n
by an integer valued variable. The Poisson distribution
results when the number of terms in the sum grows while the probability that each
of them is non-zero decays. As such, the Poisson approximation is about counting
the number of occurrences among many independent rare events.
Theorem 3.4.1 (Poisson approximation). Let S
n
=
n
k=1
Z
n,k
, where for each
n the random variables Z
n,k
for 1 k n, are mutually independent, each taking
value in the set of non-negative integers. Suppose that p
n,k
= P(Z
n,k
= 1) and
n,k
= P(Z
n,k
2) are such that as n ,
(a)
n
k=1
p
n,k
< ,
(b) max
k=1, ,n
p
n,k
0,
(c)
n
k=1
n,k
0.
Then, S
n
D
N
of a Poisson distribution with parameter , as n .

Proof. The rst step of the proof is to apply truncation by comparing S
n
with
S
n
=
n
k=1
Z
n,k
,
where Z
n,k
= Z
n,k
I
Z
n,k
1
for k = 1, . . . , n. Indeed, observe that,
P(S
n
,= S
n
)
n
k=1
P(Z
n,k
,= Z
n,k
) =
n
k=1
P(Z
n,k
2)
=
n
k=1
n,k
0 for n , by assumption (c) .
Hence, (S
n
S
n
)
p
0. Consequently, the convergence S
n
D
N
of the sums of
truncated variables imply that also S
n
D
N
(c.f. Exercise 3.2.8).

As seen in the context of the clt, characteristic functions are a powerful tool
for the convergence in distribution of sums of independent random variables (see
Subsection 3.3.3). This is also evident in our proof of the Poisson approximation
theorem. That is, to prove that S
n
D
N
, if suces by Levys continuity theorem

to show the convergence of the characteristic functions
Sn
()
N
() for each
R.
To this end, recall that Z
n,k
are independent Bernoulli variables of parameters
p
n,k
, k = 1, . . . , n. Hence, by Lemma 3.3.8 and Example 3.3.5 we have that for
z
n,k
= p
n,k
(e
i
1),
Sn
() =
n
k=1
Z
n,k
() =
n
k=1
(1 p
n,k
+p
n,k
e
i
) =
n
k=1
(1 +z
n,k
) .
Our assumption (a) implies that for n
z
n
:=
n
k=1
z
n,k
= (
n
k=1
p
n,k
)(e
i
1) (e
i
1) := z
.
Further, since [z
n,k
[ 2p
n,k
, our assumptions (a) and (b) imply that for n ,
n
=
n
k=1
[z
n,k
[
2
4
n
k=1
p
2
n,k
4( max
k=1,...,n
p
n,k
)(
n
k=1
p
n,k
) 0 .
Applying Lemma 3.3.30 we conclude that when n ,
S
n
() exp(z
) = exp((e
i
1)) =
N
()
(see (3.3.3) for the last identity), thus completing the proof.
Remark. Recall Example 3.2.25 that the weak convergence of the laws of the
integer valued S
n
to that of N
also implies their convergence in total variation.

In the setting of the Poisson approximation theorem, taking
n
=
n
k=1
p
n,k
, the
more quantitative result
[[T
Sn
T
N
n
[[
tv
=
k=0
[P(S
n
= k) P(N
n
= k)[ 2 min(
1
n
, 1)
n
k=1
p
2
n,k
due to Stein (1987) also holds (see also [Dur10, (3.6.1)] for a simpler argument,
due to Hodges and Le Cam (1960), which is just missing the factor min(
1
n
, 1)).
For the remainder of this subsection we list applications of the Poisson approxi-
mation theorem, starting with
Example 3.4.2 (Poisson approximation for the Binomial). Take indepen-
dent variables Z
n,k
0, 1, so
n,k
= 0, with p
n,k
= p
n
that does not depend on
k. Then, the variable S
n
= S
n
has the Binomial distribution of parameters (n, p
n
).
By Steins result, the Binomial distribution of parameters (n, p
n
) is approximated
well by the Poisson distribution of parameter
n
= np
n
, provided p
n
0. In case
n
= np
n
< , Theorem 3.4.1 yields that the Binomial (n, p
n
) laws converge
weakly as n to the Poisson distribution of parameter . This is in agreement
with Example 3.1.7 where we approximate the Binomial distribution of parameters
(n, p) by the normal distribution, for in Example 3.1.8 we saw that, upon the same
scaling, N
n
is also approximated well by the normal distribution when
n
.
Recall the occupancy problem where we distribute at random r distinct balls
among n distinct boxes and each of the possible n
r
assignments of balls to boxes is
equally likely. In Example 2.1.10 we considered the asymptotic fraction of empty
boxes when r/n and n . Noting that the number of balls M
n,k
in the
k-th box follows the Binomial distribution of parameters (r, n
1
), we deduce from
Example 3.4.2 that M
n,k
D
N
. Thus, P(M
n,k
= 0) P(N
= 0) = e
.
That is, for large n each box is empty with probability about e
, which may
explain (though not prove) the result of Example 2.1.10. Here we use the Poisson
approximation theorem to tackle a dierent regime, in which r = r
n
is of order
nlog n, and consequently, there are fewer empty boxes.
Proposition 3.4.3. Let S
n
denote the number of empty boxes. Assuming r = r
n
is such that ne
r/n
[0, ), we have that S
n
D
N
as n .
Proof. Let Z
n,k
= I
M
n,k
=0
for k = 1, . . . , n, that is Z
n,k
= 1 if the k-th box
is empty and Z
n,k
= 0 otherwise. Note that S
n
=
n
k=1
Z
n,k
, with each Z
n,k
having the Bernoulli distribution of parameter p
n
= (1 n
1
)
r
. Our assumption
about r
n
guarantees that np
n
. If the occupancy Z
n,k
of the various boxes were
mutually independent, then the stated convergence of S
n
to N
would have followed

from Theorem 3.4.1. Unfortunately, this is not the case, so we present a bare-
hands approach showing that the dependence is weak enough to retain the same
conclusion. To this end, rst observe that for any l = 1, 2, . . . , n, the probability
that given boxes k
1
< k
2
< . . . < k
l
are all empty is,
P(Z
n,k1
= Z
n,k2
= = Z
n,k
l
= 1) = (1
l
n
)
r
.
Let p
l
= p
l
(r, n) = P(S
n
= l) denote the probability that exactly l boxes are empty
out of the n boxes into which the r balls are placed at random. Then, considering
all possible choices of the locations of these l 1 empty boxes we get the identities
p
l
(r, n) = b
l
(r, n)p
0
(r, n l) for
(3.4.1) b
l
(r, n) =
_
n
l
_
_
1
l
n
_
r
.
Further, p
0
(r, n) = 1P( at least one empty box), so that by the inclusion-exclusion
formula,
(3.4.2) p
0
(r, n) =
n
l=0
(1)
l
b
l
(r, n) .
According to part (b) of Exercise 3.4.4, p
0
(r, n) e
. Further, for xed l we

have that (n l)e
r/(nl)
, so as before we conclude that p
0
(r, n l) e
.
By part (a) of Exercise 3.4.4 we know that b
l
(r, n)
l
/l! for xed l, hence
p
l
(r, n) e
l
/l!. As p
l
= P(S
n
= l), the proof of the proposition is thus
complete.
The following exercise provides the estimates one needs during the proof of Propo-
sition 3.4.3 (for more details, see [Dur10, Theorem 3.6.5]).
Exercise 3.4.4. Assuming ne
r/n
, show that
(a) b
l
(r, n) of (3.4.1) converges to
l
/l! for each xed l.
(b) p
0
(r, n) of (3.4.2) converges to e
.
Finally, here is an application of Proposition 3.4.3 to the coupon collectors prob-
lem of Example 2.1.8, where T
n
denotes the number of independent trials, it takes
to have at least one representative of each of the n possible values (and each trial
produces a value U
i
that is distributed uniformly on the set of n possible values).
Example 3.4.5 (Revisiting the coupon collectors problem). For any
x R, we have that
(3.4.3) lim
n
P(T
n
nlog n nx) = exp(e
x
),
which is an improvement over our weak law result that T
n
/nlog n 1. Indeed, to
derive (3.4.3) view the rst r trials of the coupon collector as the random placement
of r balls into n distinct boxes that correspond to the n possible values. From this
point of view, the event T
n
r corresponds to lling all n boxes with the r
balls, that is, having none empty. Taking r = r
n
= [nlog n + nx] we have that
ne
r/n
= e
x
, and so it follows from Proposition 3.4.3 that P(T
n
r
n
)
P(N
= 0) = e
, as stated in (3.4.3).
Note that though T
n
=
n
k=1
X
n,k
with X
n,k
independent, the convergence in dis-
tribution of T
n
, given by (3.4.3), is to a non-normal limit. This should not surprise
you, for the terms X
n,k
with k near n are large and do not satisfy Lindebergs
condition.
Exercise 3.4.6. Recall that
n
denotes the rst time one has distinct values

when collecting coupons that are uniformly distributed on 1, 2, . . . , n. Using the
Poisson approximation theorem show that if n and = (n) is such that
n
1/2
[0, ), then
n

D
N with N a Poisson random variable of
parameter
2
/2.
3.4.2. Poisson Process. The Poisson process is a continuous time stochastic
process N
t
(), t 0 which belongs to the following class of counting processes.
Denition 3.4.7. A counting process is a mapping N
t
(), where N
t
()
is a piecewise constant, non-decreasing, right continuous function of t 0, with
N
0
() = 0 and (countably) innitely many jump discontinuities, each of whom is
of size one.
Associated with each sample path N
t
() of such a process are the jump times
0 = T
0
< T
1
< < T
n
< such that T
k
= inft 0 : N
t
k for each k, or
equivalently
N
t
= supk 0 : T
k
t.
In applications we nd such N
t
as counting the number of discrete events occurring
in the interval [0, t] for each t 0, with T
k
denoting the arrival or occurrence time
of the k-th such event.
Remark. It is possible to extend the notion of counting processes to discrete events
indexed on R
d
, d 2. This is done by assigning random integer counts N
A
to Borel
subsets A of R
d
in an additive manner, that is, N
AB
= N
A
+N
B
whenever A and
B are disjoint. Such processes are called point processes. See also Exercise 7.1.13
for more about Poisson point process and inhomogeneous Poisson processes of non-
constant rate.
Among all counting processes we characterize the Poisson process by the joint
distribution of its jump (arrival) times T
k
.
Denition 3.4.8. The Poisson process of rate > 0 is the unique counting process
with the gaps between jump times
k
= T
k
T
k1
, k = 1, 2, . . . being i.i.d. random
variables, each having the exponential distribution of parameter .
Thus, from Exercise 1.4.46 we deduce that the k-th arrival time T
k
of the Poisson
process of rate has the gamma density of parameters = k and ,
f
T
k
(u) =

k
u
k1
(k 1)!
e
u
1
u>0
.
As we have seen in Example 2.3.7, counting processes appear in the context of
renewal theory. In particular, as shown in Exercise 2.3.8, the Poisson process of
rate satises the strong law of large numbers t
1
N
t
a.s.
.
Recall that a random variable N has the Poisson() law if
P(N = n) =

n
n!
e
, n = 0, 1, 2, . . . .
Our next proposition, which is often used as an alternative denition of the Poisson
process, also explains its name.
Proposition 3.4.9. For any and any 0 = t
0
< t
1
< < t
, the increments N
t1
,
N
t2
N
t1
, . . . , N
t
N
t
1
, are independent random variables and for some > 0
and all t > s 0, the increment N
t
N
s
has the Poisson((t s)) law.
Thus, the Poisson process has independent increments, each having a Poisson law,
where the parameter of the count N
t
N
s
is proportional to the length of the
corresponding interval [s, t].
The proof of Proposition 3.4.9 relies on the lack of memory of the exponential
distribution. That is, if the law of a random variable T is exponential (of some
parameter > 0), then for all t, s 0,
(3.4.4) P(T > t +s[T > t) =
P(T > t +s)
P(T > t)
=
e
(t+s)
e
t
= e
s
= P(T > s) .
Indeed, the key to the proof of Proposition 3.4.9 is the following lemma.
Lemma 3.4.10. Fixing t > 0, the variables
j
with
1
= T
Nt+1
t, and
j
= T
Nt+j
T
Nt+j1
, j 2 are i.i.d. each having the exponential distribution
of parameter . Further, the collection
j
is independent of N
t
which has the
Poisson distribution of parameter t.
Remark. Note that in particular, E
t
= T
Nt+1
t which counts the time till
next arrival occurs, hence called the excess life time at t, follows the exponential
distribution of parameter .
Proof. Fixing t > 0 and n 1 let H
n
(x) = P(t T
n
> t x). With
H
n
(x) =
_
x
0
f
Tn
(t y)dy and T
n
independent of
n+1
, we get by Fubinis theorem
(for I
tTn>tn+1
), and the integration by parts of Lemma 1.4.30 that
P(N
t
= n) = P(t T
n
> t
n+1
) = E[H
n
(
n+1
)]
=
_
t
0
f
Tn
(t y)P(
n+1
> y)dy
=
_
t
0
n
(t y)
n1
(n 1)!
e
(ty)
e
y
dy = e
t
(t)
n
n!
. (3.4.5)
As this applies for any n 1, it follows that N
t
has the Poisson distribution of
parameter t. Similarly, observe that for any s
1
0 and n 1,
P(N
t
= n,
1
> s
1
) = P(t T
n
> t
n+1
+s
1
)
=
_
t
0
f
Tn
(t y)P(
n+1
> s
1
+y)dy
= e
s1
P(N
t
= n) = P(
1
> s
1
)P(N
t
= n) .
Since T
0
= 0, P(N
t
= 0) = e
t
and T
1
=
1
, in view of (3.4.4) this conclusion
extends to n = 0, proving that
1
is independent of N
t
and has the same exponential
law as
1
.
Next, x arbitrary integer k 2 and non-negative s
j
0 for j = 1, . . . , k. Then,
for any n 0, since
n+j
, j 2 are i.i.d. and independent of (T
n
,
n+1
),
P(N
t
= n,
j
> s
j
, j = 1, . . . , k)
= P(t T
n
> t
n+1
+s
1
, T
n+j
T
n+j1
> s
j
, j = 2, . . . , k)
= P(t T
n
> t
n+1
+s
1
)
k
j=2
P(
n+j
> s
j
) = P(N
t
= n)
k
j=1
P(
j
> s
j
).
Since s
j
0 and n 0 are arbitrary, this shows that the random variables N
t
and
j
, j = 1, . . . , k are mutually independent (c.f. Corollary 1.4.12), with each
j
hav-
ing an exponential distribution of parameter . As k is arbitrary, the independence
of N
t
and the countable collection
j
follows by Denition 1.4.3.
Proof of Proposition 3.4.9. Fix t, s
j
0, j = 1, . . . , k, and non-negative
integers n and m
j
, 1 j k. The event N
sj
= m
j
, 1 j k is of the form
(
1
, . . . ,
r
) H for r = m
k
+ 1 and
H =
k
j=1
x [0, )
r
: x
1
+ +x
mj
s
j
< x
1
+ +x
mj +1
.
Since the event (
1
, . . . ,
r
) H is merely N
t+sj
N
t
= m
j
, 1 j k, it
follows form Lemma 3.4.10 that
P(N
t
= n, N
t+sj
N
t
= m
j
, 1 j k) = P(N
t
= n, (
1
, . . . ,
r
) H)
= P(N
t
= n)P((
1
, . . . ,
r
) H) = P(N
t
= n)P(N
sj
= m
j
, 1 j k) .
By induction on this identity implies that if 0 = t
0
< t
1
< t
2
< < t
, then
(3.4.6) P(N
ti
N
ti1
= n
i
, 1 i ) =
i=1
P(N
titi1
= n
i
)
(the case = 1 is trivial, and to advance the induction to + 1 set k = , t = t
1
,
n = n
1
and s
j
= t
j+1
t
1
, m
j
=
j+1
i=2
n
i
).
Considering (3.4.6) for = 2, t
2
= t > s = t
1
, and summing over the values of n
1
we see that P(N
t
N
s
= n
2
) = P(N
ts
= n
2
), hence by (3.4.5) we conclude that
N
t
N
s
has the Poisson distribution of parameter (t s), as claimed.
The Poisson process is also related to the order statistics V
n,k
for the uniform
measure, as stated in the next two exercises.
1
, U
2
, . . . , U
n
be i.i.d. with each U
i
having the uniform
measure on (0, 1]. Denote by V
n,k
the k-th smallest number in U
1
, . . . , U
n
.
(a) Show that (V
n,1
, . . . , V
n,n
) has the same law as (T
1
/T
n+1
, . . . , T
n
/T
n+1
),
where T
k
are the jump (arrival) times for a Poisson process of rate
(see Subsection 1.4.2 for the denition of the law T
X
of a random vector
X).
(b) Taking = 1, deduce that nV
n,k
D
T
k
as n while k is xed, where
T
k
has the gamma density of parameters = k and s = 1.
Exercise 3.4.12. Fixing any positive integer n and 0 t
1
t
2
t
n
t,
show that
P(T
k
t
k
, k = 1, . . . , n[N
t
= n) =
n!
t
n
_
t1
0
_
t2
x1
(
_
tn
xn1
dx
n
)dx
n1
dx
1
.
That is, conditional on the event N
t
= n, the rst n jump times T
k
: k = 1, . . . , n
have the same law as the order statistics V
n,k
: k = 1, . . . , n of a sample of n i.i.d
random variables U
1
, . . . , U
n
, each of which is uniformly distributed in [0, t].
Here is an application of Exercise 3.4.12.
Exercise 3.4.13. Consider a Poisson process N
t
of rate and jump times T
k
.
(a) Compute the values of g(n) = E(I
Nt=n
n
k=1
T
k
).
(b) Compute the value of v = E(
Nt
k=1
(t T
k
)).
(c) Suppose that T
k
is the arrival time to the train station of the k-th pas-
senger on a train that departs the station at time t. What is the meaning
of N
t
and of v in this case?
The representation of the order statistics V
n,k
in terms of the jump times of
a Poisson process is very useful when studying the large n asymptotics of their
spacings R
n,k
. For example,
Exercise 3.4.14. Let R
n,k
= V
n,k
V
n,k1
, k = 1, . . . , n, denote the spacings
between V
n,k
of Exercise 3.4.11 (with V
n,0
= 0). Show that as n ,
(3.4.7)
n
log n
max
k=1,...,n
R
n,k
p
1 ,
and further for each xed x 0,
(3.4.8) G
n
(x) := n
1
n
k=1
I
{R
n,k
>x/n}
p
e
x
,
(3.4.9) B
n
(x) := P( min
k=1,...,n
R
n,k
> x/n
2
) e
x
.
As we show next, the Poisson approximation theorem provides a characterization
of the Poisson process that is very attractive for modeling real-world phenomena.
Corollary 3.4.15. If N
t
is a Poisson process of rate > 0, then for any xed k,
0 < t
1
< t
2
< < t
k
and non-negative integers n
1
, n
2
, , n
k
,
P(N
t
k
+h
N
t
k
= 1[N
tj
= n
j
, j k) = h +o(h),
P(N
t
k
+h
N
t
k
2[N
tj
= n
j
, j k) = o(h),
where o(h) denotes a function f(h) such that h
1
f(h) 0 as h 0.
Proof. Fixing k, the t
j
and the n
j
, denote by A the event N
tj
= n
j
, j k.
For a Poisson process of rate the random variable N
t
k
+h
N
t
k
is independent of
A with P(N
t
k
+h
N
t
k
= 1) = e
h
h and P(N
t
k
+h
N
t
k
2) = 1e
h
(1+h).
Since e
h
= 1h+o(h) we see that the Poisson process satises this corollary.
Our next exercise explores the phenomenon of thinning, that is, the partitioning
of Poisson variables as sums of mutually independent Poisson variables of smaller
parameter.
i
are i.i.d. with P(X
i
= j) = p
j
for j = 0, 1, . . . , k
and N a Poisson random variable of parameter that is independent of X
k
. Let
N
j
=
N
i=1
I
Xi=j
j = 0, . . . , k .
(a) Show that the variables N
j
, j = 0, 1, . . . , k are mutually independent with
N
j
having a Poisson distribution of parameter p
j
.
(b) Show that the sub-sequence of jump times
T
k
obtained by independently
keeping with probability p each of the jump times T
k
of a Poisson pro-
cess N
t
of rate , yields in turn a Poisson process

N
t
of rate p.
We conclude this section noting the superposition property, namely that the sum
of two independent Poisson processes is yet another Poisson process.
Exercise 3.4.17. Suppose N
t
= N
(1)
t
+ N
(2)
t
where N
(1)
t
and N
(2)
t
are two inde-
pendent Poisson processes of rates
1
> 0 and
2
> 0, respectively. Show that N
t
is a Poisson process of rate
1
+
2
.
3.5. Random vectors and the multivariate clt
The goal of this section is to extend the clt to random vectors, that is, R
d
-valued
random variables. Towards this end, we revisit in Subsection 3.5.1 the theory
of weak convergence, this time in the more general setting of R
d
-valued random
variables. Subsection 3.5.2 is devoted to the extension of characteristic functions
and Levys theorems to the multivariate setting, culminating with the Cramer-
wold reduction of convergence in distribution of random vectors to that of their
3.5. RANDOM VECTORS AND THE MULTIVARIATE clt 141
one dimensional linear projections. Finally, in Subsection 3.5.3 we introduce the
important concept of Gaussian random vectors and prove the multivariate clt.
3.5.1. Weak convergence revisited. Recall Denition 3.2.17 of weak con-
vergence for a sequence of probability measures on a topological space S, which
suggests the following denition for convergence in distribution of S-valued random
variables.
Denition 3.5.1. We say that (S, B
S
)-valued random variables X
n
converge in
distribution to a (S, B
S
)-valued random variable X
, denoted by X
n
D
X
, if
T
Xn
w
T
X
.
As already remarked, the Portmanteau theorem about equivalent characterizations
of the weak convergence holds also when the probability measures
n
are on a Borel
measurable space (S, B
S
) with (S, ) any metric space (and in particular for S = R
d
).
Theorem 3.5.2 (portmanteau theorem). The following ve statements are
equivalent for any probability measures
n
, 1 n on (S, B
S
), with (S, ) any
metric space.
(a)
n
w

(b) For every closed set F, one has limsup

n
n
(F)
(F)
(c) For every open set G, one has liminf
n
n
(G)
(G)
(d) For every
-continuity set A, one has lim

n
n
(A) =
(A)
(e) If the Borel function g : S R is such that
(D
g
) = 0, then
n
g
1
w
g
1
and if in addition g is bounded then
n
(g)
(g).
Remark. For S = R, the equivalence of (a)(d) is the content of Theorem 3.2.21
while Proposition 3.2.19 derives (e) out of (a) (in the context of convergence in
distribution, that is, X
n
D
X
and P(X
D
g
) = 0 implying that g(X
n
)
D
g(X
)). In addition to proving the converse of the continuous mapping property,

we extend the validity of this equivalence to any metric space (S, ), for we shall
apply it again in Subsection 9.2, considering there S = C([0, )), the metric space
of all continuous functions on [0, ).
Proof. The derivation of (b) (c) (d) in Theorem 3.2.21 applies for any
topological space. The direction (e) (a) is also obvious since h C
b
(S) has
D
h
= and C
b
(S) is a subset of the bounded Borel functions on the same space
(c.f. Exercise 1.2.20). So taking g C
b
(S) in (e) results with (a). It thus remains
only to show that (a) (b) and that (d) (e), which we proceed to show next.
(a) (b). Fixing A B
S
let
A
(x) = inf
yA
(x, y) : S [0, ). Since [
A
(x)
A
(x
)[ (x, x
) for any x, x
, it follows that x
A
(x) is a continuous function
on (S, ). Consequently, h
r
(x) = (1 r
A
(x))
+
C
b
(S) for all r 0. Further,
A
(x) = 0 for all x A, implying that h
r
I
A
for all r. Thus, applying part (a)
of the Portmanteau theorem for h
r
we have that
limsup
n
n
(A) lim
n
n
(h
r
) =
(h
r
) .
As
A
(x) = 0 if and only if x A it follows that h
r
I
A
as r , resulting with
limsup
n
n
(A)
(A) .
Taking A = A = F a closed set, we arrive at part (b) of the theorem.
(d) (e). Fix a Borel function g : S R with K = sup
x
[g(x)[ < such that
(D
g
) = 0. Clearly, R :
g
1
() > 0 is a countable set. Thus, xing
> 0 we can pick < and
0
<
1
< <
such that
g
1
(
i
) = 0
for 0 i ,
0
< K < K <
and
i

i1
< for 1 i . Let
A
i
= x :
i1
< g(x)
i
for i = 1, . . . , , noting that A
i
x : g(x) =
i1
,
or g(x) =
i
D
g
. Consequently, by our assumptions about g() and
i
we
have that
(A
i
) = 0 for each i = 1, . . . , . It thus follows from part (d) of the
Portmanteau theorem that
i=1
n
(A
i
)
i=1
(A
i
)
as n . Our choice of
i
and A
i
is such that g
i=1
i
I
Ai
g +, resulting
with
n
(g)
i=1
n
(A
i
)
n
(g) +
for n = 1, 2, . . . , . Considering rst n followed by 0, we establish that
n
(g)
(g). More generally, recall that D

hg
D
g
for any g : S R and h
C
b
(R). Thus, by the preceding proof
n
(h g)
(h g) as soon as
(D
g
) = 0.
This applies for every h C
b
(R), so in this case
n
g
1
w

g
1
.
We next show that the relation of Exercise 3.2.6 between convergences in proba-
bility and in distribution also extends to any metric space (S, ), a fact we will later
use in Subsection 9.2, when considering the metric space of all continuous functions
on [0, ).
Corollary 3.5.3. If random variables X
n
, 1 n on the same probability
space and taking value in a metric space (S, ) are such that (X
n
, X
)
p
0, then
X
n
D
X
.
Proof. Fixing h C
b
(S) and > 0, we have by continuity of h() that G
r
S,
where
G
r
= y S : [h(x) h(y)[ whenever (x, y) r
1
.
By denition, if X
G
r
and (X
n
, X
) r
1
then [h(X
n
)h(X
)[ . Hence,
for any n, r 1,
E[[h(X
n
) h(X
)[] + 2|h|
(P(X
/ G
r
) +P((X
n
, X
) > r
1
)) ,
where |h|
= sup
xS
[h(x)[ is nite (by the boundedness of h). Considering n
followed by r we deduce from the convergence in probability of (X
n
, X
)
to zero, that
limsup
n
E[[h(X
n
) h(X
)[] + 2|h|
lim
r
P(X
/ G
r
) = .
Since this applies for any > 0, it follows by the triangle inequality that Eh(X
n
)
Eh(X
) for all h C
b
(S), i.e. X
n
D
X
.
Remark. The notion of distribution function for an R
d
-valued random vector X =
(X
1
, . . . , X
d
) is
F
X
(x) = P(X
1
x
1
, . . . , X
d
x
d
) .
Inducing a partial order on R
d
by x y if and only if x y has only non-negative
coordinates, each distribution function F
X
(x) has the three properties listed in
Theorem 1.2.36. Unfortunately, these three properties are not sucient for a given
function F : R
d
[0, 1] to be a distribution function. For example, since the mea-
sure of each rectangle A =
d
i=1
(a
i
, b
i
] should be positive, the additional constraint
of the form
A
F =
2
d
j=1
F(x
j
) 0 should hold if F() is to be a distribution
function. Here x
j
enumerates the 2
d
corners of the rectangle A and each corner
is taken with a positive sign if and only if it has an even number of coordinates
from the collection a
1
, . . . , a
d
. Adding the fourth property that
A
F 0 for
each rectangle A R
d
, we get the necessary and sucient conditions for F() to be
a distribution function of some R
d
-valued random variable (c.f. [Bil95, Theorem
12.5] for a detailed proof).
Recall Denition 3.2.31 of uniform tightness, where for S = R
d
we can take K
=
[M
, M
]
d
with no loss of generality. Though Prohorovs theorem about uniform
tightness (i.e. Theorem 3.2.34) is beyond the scope of these notes, we shall only
need in the sequel the fact that a uniformly tight sequence of probability measures
has at least one limit point. This can be proved for S = R
d
in a manner similar
to what we have done in Theorem 3.2.37 and Lemma 3.2.38 for S = R
1
, using the
corresponding concept of distribution function F
X
() (see [Dur10, Theorem 3.9.2]
for more details).
3.5.2. Characteristic function. We start by extending the useful notion of
characteristic function to the context of R
d
-valued random variables (which we also
call hereafter random vectors).
Denition 3.5.4. Adopting the notation (x, y) =
d
i=1
x
i
y
i
for x, y R
d
, a ran-
dom vector X = (X
1
, X
2
, , X
d
) with values in R
d
has the characteristic function
X
() = E[e
i(,X)
] ,
where = (
1
,
2
, ,
d
) R
d
and i =
1.
Remark. The characteristic function
X
: R
d
C exists for any X since
(3.5.1) e
i(,X)
= cos(, X) +i sin(, X) ,
with both real and imaginary parts being bounded (hence integrable) random vari-
ables. Actually, it is easy to check that all ve properties of Proposition 3.3.2 hold,
where part (e) is modied to
A
t
X+b
() = exp(i(b, ))
X
(A), for any non-random
d d-dimensional matrix A and b R
d
(with A
t
denoting the transpose of the
matrix A).
Here is the extension of the notion of probability density function (as in Denition
1.2.39) to a random vector.
Denition 3.5.5. Suppose f
X
is a non-negative Borel measurable function with
_
R
d
f
X
(x)dx = 1. We say that a random vector X = (X
1
, . . . , X
d
) has a probability
density function f
X
() if for every b = (b
1
, . . . , b
d
),
F
X
(b) =
_
b1

_
b
d
f
X
(x
1
, . . . , x
d
)dx
d
dx
1
(such f
X
is sometimes called the joint density of X
1
, . . . , X
d
). This is the same as
saying that the law of X is of the form f
X
d
with
d
the d-fold product Lebesgue
measure on R
d
(i.e. the d > 1 extension of Example 1.3.60).
Example 3.5.6. We have the following extension of the Fourier transform formula
(3.3.4) to random vectors X with density,
X
() =
_
R
d
e
i(,x)
f
X
(x)dx
(this is merely a special case of the extension of Corollary 1.3.62 to h : R
d
R).
We next state and prove the corresponding extension of Levys inversion theorem.
Theorem 3.5.7 (Levys inversion theorem). Suppose
X
() is the character-
istic function of random vector X = (X
1
, . . . , X
d
) whose law is T
X
, a probability
measure on (R
d
, B
R
d). If A = [a
1
, b
1
] [a
d
, b
d
] with T
X
(A) = 0, then
(3.5.2) T
X
(A) = lim
T
_
[T,T]
d
d
j=1
aj,bj
(
j
)
X
()d
for
a,b
() of (3.3.5). Further, the characteristic function determines the law of a
random vector. That is, if
X
() =
Y
() for all then X has the same law as Y .
Proof. We derive (3.5.2) by adapting the proof of Theorem 3.3.12. First apply
Fubinis theorem with respect to the product of Lebesgues measure on [T, T]
d
and the law of X (both of which are nite measures on R
d
) to get the identity
J
T
(a, b) :=
_
[T,T]
d
d
j=1
aj,bj
(
j
)
X
()d =
_
R
d
_
d
j=1
_
T
T
h
aj,bj
(x
j
,
j
)d
j
_
dT
X
(x)
(where h
a,b
(x, ) =
a,b
()e
ix
). In the course of proving Theorem 3.3.12 we have
seen that for j = 1, . . . , d the integral over
j
is uniformly bounded in T and that
it converges to g
aj,bj
(x
j
) as T . Thus, by bounded convergence it follows that
lim
T
J
T
(a, b) =
_
R
d
g
a,b
(x)dT
X
(x) ,
where
g
a,b
(x) =
d
j=1
g
aj,bj
(x
j
) ,
is zero on A
c
and one on A
o
(see the explicit formula for g
a,b
(x) provided there).
So, our assumption that T
X
(A) = 0 implies that the limit of J
T
(a, b) as T is
merely T
X
(A), thus establishing (3.5.2).
Suppose now that
X
() =
Y
() for all . Adapting the proof of Corollary 3.3.14
to the current setting, let = R : P(X
j
= ) > 0 or P(Y
j
= ) > 0 for some
j = 1, . . . , d noting that if all the coordinates a
j
, b
j
, j = 1, . . . , d of a rectangle
A are from the complement of then both T
X
(A) = 0 and T
Y
(A) = 0. Thus,
by (3.5.2) we have that T
X
(A) = T
Y
(A) for any A in the collection ( of rectangles
with coordinates in the complement of . Recall that is countable, so for any
rectangle A there exists A
n
( such that A
n
A, and by continuity from above of
both T
X
and T
Y
it follows that T
X
(A) = T
Y
(A) for every rectangle A. In view of
Proposition 1.1.39 and Exercise 1.1.21 this implies that the probability measures
T
X
and T
Y
agree on all Borel subsets of R
d
.
We next provide the ingredients needed when using characteristic functions en-
route to the derivation of a convergence in distribution result for random vectors.
To this end, we start with the following analog of Lemma 3.3.16.
Lemma 3.5.8. Suppose the random vectors X
n
, 1 n on R
d
are such that
X
n
()
X
() as n for each R
d
. Then, the corresponding sequence
of laws T
X
n
is uniformly tight.
Proof. Fixing R
d
consider the sequence of random variables Y
n
= (, X
n
).
Since
Yn
() =
X
n
() for 1 n , we have that
Yn
()
Y
() for all
R. The uniform tightness of the laws of Y
n
then follows by Lemma 3.3.16.
Considering
1
, . . . ,
d
which are the unit vectors in the d dierent coordinates,
we have the uniform tightness of the laws of X
n,j
for the sequence of random
vectors X
n
= (X
n,1
, X
n,2
, . . . , X
n,d
) and each xed coordinate j = 1, . . . , d. For
the compact sets K
= [M
, M
]
d
and all n,
P(X
n
/ K
)
d
j=1
P([X
n,j
[ > M
) .
As d is nite, this leads from the uniform tightness of the laws of X
n,j
for each
j = 1, . . . , d to the uniform tightness of the laws of X
n
.
Equipped with Lemma 3.5.8 we are ready to state and prove Levys continuity
theorem.
Theorem 3.5.9 (Levys continuity theorem). Let X
n
, 1 n be random
vectors with characteristic functions
X
n
(). Then, X
n
D
X
if and only if
X
n
()
X
() as n for each xed R

d
.
Proof. This is a re-run of the proof of Theorem 3.3.17, adapted to R
d
-valued
random variables. First, both x cos((, x)) and x sin((, x)) are bounded
continuous functions, so if X
n
D
X
, then clearly as n ,
X
n
() = E
_
e
i(,X
n
)
_
E
_
e
i(,X
)
_
=
X
().
For the converse direction, assuming that
X
n

X
point-wise, we know from

Lemma 3.5.8 that the collection T
X
n
is uniformly tight. Hence, by Prohorovs
theorem, for every subsequence n(m) there is a further sub-subsequence n(m
k
) such
that T
X
n(m
k
)
converges weakly to some probability measure T
Y
, possibly dependent
upon the choice of n(m). As X
n(m
k
)
D
Y , we have by the preceding part of the
proof that
X
n(m
k
)

Y
, and necessarily
Y
=
X
. The characteristic function

determines the law (see Theorem 3.5.7), so Y
D
= X
is independent of the choice

of n(m). Thus, xing h C
b
(R
d
), the sequence y
n
= Eh(X
n
) is such that every
subsequence y
n(m)
has a further sub-subsequence y
n(m
k
)
that converges to y
.
Consequently, y
n
y
(see Lemma 2.2.11). This applies for all h C

b
(R
d
), so we
conclude that X
n
D
X
, as stated.
Remark. As in the case of Theorem 3.3.17, it is not hard to show that if
X
n
()
() as n and () is continuous at = 0 then is necessarily the charac-
teristic function of some random vector X
and consequently X
n
D
X
.
The proof of the multivariate clt is just one of the results that rely on the following
immediate corollary of Levys continuity theorem.
Corollary 3.5.10 (Cramer-Wold device). A sucient condition for X
n
D
is that (, X
n
)
D
(, X
) for each R
d
.
Proof. Since (, X
n
)
D
(, X
) it follows by Levys continuity theorem (for

d = 1, that is, Theorem 3.3.17), that
lim
n
E
_
e
i(,X
n
)
_
= E
_
e
i(,X
)
_
.
As this applies for any R
d
, we get that X
n
D
X
by applying Levys
continuity theorem in R
d
(i.e., Theorem 3.5.9), now in the converse direction.
Remark. Beware that it is not enough to consider only nitely many values of
in the Cramer-Wold device. For example, consider the random vectors X
n
=
(X
n
, Y
n
) with X
n
, Y
2n
i.i.d. and Y
2n+1
= X
2n+1
. Convince yourself that in this
case X
n
D
X
1
and Y
n
D
Y
1
but the random vectors X
n
do not converge in
distribution (to any limit).
The computation of the characteristic function is much simplied in the presence
of independence.
Exercise 3.5.11. Show that if Y = (Y
1
, . . . , Y
d
) with Y
k
mutually independent
R.V., then for all = (
1
, . . . ,
d
) R
d
,
(3.5.3)
Y
() =
d
k=1
Y
k
(
k
)
Conversely, show that if (3.5.3) holds for all R
d
, the random variables Y
k
,
k = 1, . . . , d are mutually independent of each other.
3.5.3. Gaussian random vectors and the multivariate clt. Recall the
following linear algebra concept.
Denition 3.5.12. An d d matrix A with entries A
jk
is called non-negative
denite (or positive semidenite) if A
jk
= A
kj
for all j, k, and for any R
d
(, A) =
d
j=1
d
k=1
j
A
jk
k
0.
We are ready to dene the class of multivariate normal distributions via the cor-
responding characteristic functions.
Denition 3.5.13. We say that a random vector X = (X
1
, X
2
, , X
d
) is Gauss-
ian, or alternatively that it has a multivariate normal distribution if
(3.5.4)
X
() = e
1
2
(,V)
e
i(,)
,
for some non-negative denite d d matrix V, some = (
1
, . . . ,
d
) R
d
and all
= (
1
, . . . ,
d
) R
d
. We denote such a law by A(, V).
Remark. For d = 1 this denition coincides with Example 3.3.6.
Our next proposition proves that the multivariate A(, V) distribution is well
dened and further links the vector and the matrix V to the rst two moments
of this distribution.
Proposition 3.5.14. The formula (3.5.4) corresponds to the characteristic func-
tion of a probability measure on R
d
. Further, the parameters and V of the Gauss-
ian random vector X are merely
j
= EX
j
and V
jk
= Cov(X
j
, X
k
), j, k = 1, . . . , d.
Proof. Any non-negative denite matrix V can be written as V = U
t
D
2
U
for some orthogonal matrix U (i.e., such that U
t
U = I, the d d-dimensional
identity matrix), and some diagonal matrix D. Consequently,
(, V) = (A, A)
for A = DU and all R
d
. We claim that (3.5.4) is the characteristic function
of the random vector X = A
t
Y +, where Y = (Y
1
, . . . , Y
d
) has i.i.d. coordinates
Y
k
, each of which has the standard normal distribution. Indeed, by Exercise 3.5.11
Y
() = exp(
1
2
(, )) is the product of the characteristic functions exp(
2
k
/2) of
the standard normal distribution (see Example 3.3.6), and by part (e) of Proposition
3.3.2,
X
() = exp(i(, ))
Y
(A), yielding the formula (3.5.4).
We have just shown that X has the A(, V) distribution if X = A
t
Y + for a
Gaussian random vector Y (whose distribution is A(0, I)), such that EY
j
= 0 and
Cov(Y
j
, Y
k
) = 1
j=k
for j, k = 1, . . . , d. It thus follows by linearity of the expec-
tation and the bi-linearity of the covariance that EX
j
=
j
and Cov(X
j
, X
k
) =
[EA
t
Y (A
t
Y )
t
]
jk
= (A
t
IA)
jk
= V
jk
, as claimed.
Denition 3.5.13 allows for V that is non-invertible, so for example the constant
random vector X = is considered a Gaussian random vector though it obviously
does not have a density. The reason we make this choice is to have the collection
of multivariate normal distributions closed with respect to L
2
-convergence, as we
prove below to be the case.
Proposition 3.5.15. Suppose Gaussian random vectors X
n
converge in L
2
to a
random vector X
, that is, E[|X

n
X
|
2
] 0 as n . Then, X
is
a Gaussian random vector, whose parameters are the limits of the corresponding
parameters of X
n
.
Proof. Recall that the convergence in L
2
of X
n
to X
implies that
n
= EX
n
converge to
= EX
and the element-wise convergence of the covariance matri-

ces V
n
to the corresponding covariance matrix V
. Further, the L
2
-convergence
implies the corresponding convergence in probability and hence, by bounded con-
vergence
X
n
()
X
() for each R
d
. Since
X
n
() = e
1
2
(,Vn)
e
i(,
n
)
,
for any n < , it follows that the same applies for n = . It is a well known fact
of linear algebra that the element-wise limit V
of non-negative denite matrices

V
n
is necessarily also non-negative denite. In view of Denition 3.5.13, we see
that the limit X
is a Gaussian random vector, whose parameters are the limits

of the corresponding parameters of X
n
.
One of the main reasons for the importance of the multivariate normal distribution
is the following clt (which is the multivariate extension of Proposition 3.1.2).
Theorem 3.5.16 (Multivariate clt). Let

S
n
= n
1
2
n
k=1
(X
k
), where X
k
are i.i.d. random vectors with nite second moments and such that = EX
1
.
Then,

S
n
D
G, with G having the A(0, V) distribution and where V is the dd-
dimensional covariance matrix of X
1
.
Proof. Consider the i.i.d. random vectors Y
k
= X
k
each having also the
covariance matrix V. Fixing an arbitrary vector R
d
we proceed to show that
(,

S
n
)
D
(, G), which in view of the Cramer-Wold device completes the proof
of the theorem. Indeed, note that (,

S
n
) = n
1
2
n
k=1
Z
k
, where Z
k
= (, Y
k
) are
i.i.d. R-valued random variables, having zero mean and variance
v
= Var(Z
1
) = E[(, Y
1
)
2
] = (, E[Y
1
Y
t
1
] ) = (, V) .
Observing that the clt of Proposition 3.1.2 thus applies to (,

S
n
), it remains only
to verify that the resulting limit distribution A(0, v
) is indeed the law of (, G).

To this end note that by Denitions 3.5.4 and 3.5.13, for any s R,
(,G)
(s) =
G
(s) = e
1
2
s
2
(,V)
= e
v
s
2
/2
,
which is the characteristic function of the A(0, v
) distribution (see Example 3.3.6).

Since the characteristic function uniquely determines the law (see Corollary 3.3.14),
we are done.
Here is an explicit example for which the multivariate clt applies.
Example 3.5.17. The simple random walk on Z
d
is S
n
=
n
k=1
X
k
where X,
X
k
are i.i.d. random vectors such that
P(X = +e
i
) = P(X = e
i
) =
1
2d
i = 1, . . . , d,
and e
i
is the unit vector in the i-th direction, i = 1, . . . , d. In this case EX = 0
and if i ,= j then EX
i
X
j
= 0, resulting with the covariance matrix V = (1/d)I for
the multivariate normal limit in distribution of n
1/2
S
n
.
Building on Lindebergs clt for weighted sums of i.i.d. random variables, the
following multivariate normal limit is the basis for the convergence of random walks
to Brownian motion, to which Section 9.2 is devoted.
k
are i.i.d. with E
1
= 0 and E
2
1
= 1. Consider the
random functions

S
n
(t) = n
1/2
S(nt) where S(t) =
[t]
k=1
k
+(t [t])
[t]+1
and [t]
denotes the integer part of t.
(a) Verify that Lindebergs clt applies for

S
n
=
n
k=1
a
n,k
k
whenever the
non-random a
n,k
are such that r
n
= max[a
n,k
[ : k = 1, . . . , n 0
and v
n
=
n
k=1
a
2
n,k
1.
(b) Let c(s, t) = min(s, t) and xing 0 = t
0
t
1
< < t
d
, denote by C the
d d matrix of entries C
jk
= c(t
j
, t
k
). Show that for any R
d
,
d
r=1
(t
r
t
r1
)(
r
j=1
j
)
2
= (, C) ,
(c) Using the Cramer-Wold device deduce that (
S
n
(t
1
), . . . ,

S
n
(t
d
))
D
G
with G having the A(0, C) distribution.
As we see in the next exercise, there is more to a Gaussian random vector than
each coordinate having a normal distribution.
1
has a standard normal distribution and S is inde-
pendent of X
1
and such that P(S = 1) = P(S = 1) = 1/2.
(a) Check that X
2
= SX
1
also has a standard normal distribution.
(b) Check that X
1
and X
2
are uncorrelated random variables, each having
the standard normal distribution, while X = (X
1
, X
2
) is not a Gaussian
random vector and where X
1
and X
2
are not independent variables.
Motivated by the proof of Proposition 3.5.14 here is an important property of
Gaussian random vectors which may also be considered to be an alternative to
Denition 3.5.13.
Exercise 3.5.20. A random vector X has the multivariate normal distribution if
and only if (
d
i=1
a
ji
X
i
, j = 1, . . . , m) is a Gaussian random vector for any non-
random coecients a
11
, a
12
, . . . , a
md
R.
The classical denition of the multivariate normal density applies for a strict subset
of the distributions we consider in Denition 3.5.13.
Denition 3.5.21. We say that X has a non-degenerate multivariate normal dis-
tribution if the matrix V is invertible, or alternatively, when V is (strictly) positive
denite matrix, that is (, V) > 0 whenever ,= 0.
We next relate the density of a random vector with its characteristic function, and
provide the density for the non-degenerate multivariate normal distribution.
Exercise 3.5.22.
(a) Show that if
_
R
d
[
X
()[d < , then X has the bounded continuous
probability density function
(3.5.5) f
X
(x) =
1
(2)
d
_
R
d
e
i(,x)
X
()d .
(b) Show that a random vector X with a non-degenerate multivariate normal
distribution A(, V) has the probability density function
f
X
(x) = (2)
d/2
(detV)
1/2
exp
_
1
2
(x , V
1
(x ))
_
.
Here is an application to the uniform distribution over the sphere in R
n
, as n .
Exercise 3.5.23. Suppose Y
k
are i.i.d. random variables with EY
2
1
= 1 and
EY
1
= 0. Let W
n
= n
1
n
k=1
Y
2
k
and X
n,k
= Y
k
/
W
n
for k = 1, . . . , n.
(a) Noting that W
n
a.s.
1 deduce that X
n,1
D
Y
1
.
(b) Show that n
1/2
n
k=1
X
n,k
D
G whose distribution is A(0, 1).
(c) Show that if Y
k
are standard normal random variables, then the ran-
dom vector X
n
= (X
n,1
, . . . , X
n,n
) has the uniform distribution over the
surface of the sphere of radius

n in R
n
(i.e., the unique measure sup-
ported on this sphere and invariant under orthogonal transformations),
and interpret the preceding results for this special case.
We conclude the section with the following exercise, which is a multivariate, Lin-
debergs type clt.
Exercise 3.5.24. Let y
t
denotes the transpose of the vector y R
d
and |y| its
Euclidean norm. The independent random vectors Y
k
on R
d
are such that Y
k
D
=
Y
k
,
lim
n
n
k=1
P(|Y
k
| >
n) = 0,
and for some symmetric, (strictly) positive denite matrix V and any xed
(0, 1],
lim
n
n
1
n
k=1
E(Y
k
Y
t
k
I
Y
k
n
) = V.
(a) Let T
n
=
n
k=1
X
n,k
for X
n,k
= n
1/2
Y
k
I
Y
k
n
. Show that T
n
D
G, with G having the A(0, V) multivariate normal distribution.

(b) Let

S
n
= n
1/2
n
k=1
Y
k
and show that

S
n
D
G.
(c) Show that (
S
n
)
t
V
1
S
n
D
Z and identify the law of Z.
CHAPTER 4
Conditional expectations and probabilities
The most important concept in probability theory is the conditional expectation
to which this chapter is devoted. In contrast with the elementary denition often
used for a nite or countable sample space, the conditional expectation, as dened
in Section 4.1, is itself a random variable. Section 4.2 details the important prop-
erties of the conditional expectation. Section 4.3 provides a representation of the
conditional expectation as an orthogonal projection in Hilbert space. Finally, in
Section 4.4 we represent the conditional expectation also as the expectation with
respect to the random regular conditional probability distribution.
4.1. Conditional expectation: existence and uniqueness
In Subsection 4.1.1 we review the elementary denition of the conditional expec-
tation E(X[Y ) in case of discrete valued R.V.-s X and Y . This motivates our
formal denition of the conditional expectation for any pair of R.V.s. such that
X is integrable. The existence and uniqueness of the conditional expectation is
shown there based on the Radon-Nikodym theorem, the proof of which we provide
in Subsection 4.1.2.
4.1.1. Conditional expectation: motivation and denition. Suppose
the R.V.s X and Z on a probability space (, T, P) are both simple functions.
More precisely, let X take the distinct values x
1
, . . . , x
m
R and Z take the
distinct values z
1
, . . . , z
n
R, where without loss of generality we assume that
P(Z = z
i
) > 0 for i = 1, . . . , n. Then, from elementary probability theory, we know
that for any i = 1, . . . , n, j = 1, . . . , m,
P(X = x
j
[Z = z
i
) =
P(X = x
j
, Z = z
i
)
P(Z = z
i
)
,
and we can compute the corresponding conditional expectation
E[X[Z = z
i
] =
m
j=1
x
j
P(X = x
j
[Z = z
i
) .
Noting that this conditional expectation is a function of (via the value of
Z()), we dene the R.V. Y = E[X[Z] on the same probability space such that
Y () = E[X[Z = z
i
] whenever is such that Z() = z
i
.
Example 4.1.1. Suppose that X =
1
and Z =
2
on the probability space T = 2
,
= 1, 2
2
with
P(1, 1) = .5, P(1, 2) = .1, P(2, 1) = .1, P(2, 2) = .3.
151
152 4. CONDITIONAL EXPECTATIONS AND PROBABILITIES
Then,
P(X = 1[Z = 1) =
P(X = 1, Z = 1)
P(Z = 1)
=
5
6
,
implying that P(X = 2[Z = 1) =
1
6
and
E[X[Z = 1] = 1
5
6
+ 2
1
6
=
7
6
.
Likewise, check that E[X[Z = 2] =
7
4
, hence E[X[Z] =
7
6
I
Z=1
+
7
4
I
Z=2
.
Partitioning into the discrete collection of Z-atoms, namely the sets G
i
= :
Z() = z
i
for i = 1, . . . , n, observe that Y () is constant on each of these sets.
The -algebra ( = T
Z
= (Z) = Z
1
(B), B B is in this setting merely
the collection of all 2
n
possible unions of various Z-atoms. Hence, ( is nitely
generated and since Y () is constant on each generator G
i
of (, we see that Y ()
is measurable on (, (). Further, since any G ( is of the form G =
iI
G
i
for
the disjoint sets G
i
and some J 1, . . . , n, we nd that
E[Y I
G
] =
iI
E[Y I
Gi
] =
iI
E[X[Z = z
i
]P(Z = z
i
)
=
iI
m
j=1
x
j
P(X = x
j
, Z = z
i
) = E[XI
G
] .
To summarize, in case X and Z are simple functions and ( = (Z), we have
Y = E[X[Z] as a R.V. on (, () such that E[Y I
G
] = E[XI
G
] for all G (.
Since both properties make sense for any -algebra ( and any integrable R.V. X
this suggests the denition of the conditional expectation as given by the following
theorem.
Theorem 4.1.2. Given X L
1
(, T, P) and ( T a -algebra there exists a
R.V. Y called the conditional expectation (C.E.) of X given (, denoted by E[X[(],
such that Y L
1
(, (, P) and for any G (,
(4.1.1) E [(X Y ) I
G
] = 0 .
Moreover, if (4.1.1) holds for any G ( and R.V.s Y and

Y , both of which are in
L
1
(, (, P), then P(
Y = Y ) = 1. In other words, the C.E. is uniquely dened for

P-almost every .
Remark. We call Y L
1
(, (, P) that satises (4.1.1) for all G ( a version
of the C.E. E[X[(]. In view of the preceding theorem, unless stated otherwise we
consider all versions of the C.E. as being the same R.V.
Given our motivation for Theorem 4.1.2, we let E[X[Z] stand for E[X[T
Z
] and
likewise E[X[Z
1
, Z
2
, . . .] stand for E[X[T
Z
], where T
Z
= (Z
1
, Z
2
, . . .).
To check whether a R.V. is a C.E. with respect to a given -algebra (, it suces
to verify (4.1.1) for some -system that contains and generates (, as you show in
the following exercise. This useful general observation is often key to determining
an explicit formula for the C.E.
Exercise 4.1.3. Suppose that T is a -system of subsets of such that T and
( = (T) T. Show that if X L
1
(, T, P) and Y L
1
(, (, P) are such that
E[XI
G
] = E[Y I
G
] for every G T then Y = E[X[(].
4.1. CONDITIONAL EXPECTATION: EXISTENCE AND UNIQUENESS 153
To prove the existence of the C.E. we need the following denition of absolute
continuity of measures.
Denition 4.1.4. Let and be two measures on measurable space (S, T). We
say that is absolutely continuous with respect to , denoted by , if
(A) = 0 = (A) = 0
for any set A T.
Recall Proposition 1.3.56 that given a measure on (S, T), any f mT
+
induces a
new measure f on (S, T). The next theorem, whose proof is deferred to Subsection
4.1.2, shows that all absolutely continuous -nite measures with respect to a -
nite measure are of this form.
Theorem 4.1.5 (Radon-Nikodym theorem). If and are two -nite mea-
sures on (S, T) such that , then there exists f mT
+
nite valued such that
= f. Further, if f = g then (s : f(s) ,= g(s)) = 0.
Remark. The assumption in Radon-Nikodym theorem that is a -nite measure
can be somewhat relaxed, but not completely dispensed o.
Denition 4.1.6. The function f such that = f is called the Radon-Nikodym
derivative (or density) of with respect to and denoted f =
d
d
.
We note in passing that a real-valued R.V. has a probability density function f
if and only if its law is absolutely continuous with respect to the completion
of Lebesgue measure on (R, B), with f being the corresponding Radon-Nikodym
derivative (c.f. Example 1.3.60).
Proof of Theorem 4.1.2. Given two versions Y and

Y of E[X[(] we apply
(4.1.1) for the set G
= : Y ()

Y () > to see that (by linearity of the
expectation),
0 = E [XI
G
] E [XI
G
] = E [(Y

Y ) I
G
] P(G
) .
Hence, P(G
) = 0. Since this applies for any > 0 and G
G
0
as 0, we deduce
that P(Y

Y > 0) = 0. The same argument applies with the roles of Y and

Y
reversed, so P(Y

Y = 0) = 1 as claimed.
We turn to the existence of the C.E. assuming rst that X L
1
(, T, P) is also
non-negative. Let denote the probability measure obtained by restricting P to
the measurable space (, () and denote the measure obtained by restricting XP
of Proposition 1.3.56 to this measurable space, noting that is a nite measure
(since () = (XP)() = E[X] < ). If G ( is such that (G) = P(G) = 0,
then by denition also (G) = (XP)(G) = 0. Therefore, is absolutely continuous
with respect to , and by the Radon-Nikodym theorem there exists Y m(
+
such
that = Y . This implies that for any G (,
E[XI
G
] = P(XI
G
) = (G) = (Y )(G) = (Y I
G
) = E[Y I
G
]
(and in particular, that E[Y ] = () < ), proving the existence of the C.E. for
non-negative R.V.s.
Turning to deal with the case of a general integrable R.V. X we use the representa-
tion X = X
+
X
with X
+
0 and X
0 such that both E[X

+
] and E[X
] are
nite. Set Y = Y
+
Y
where the integrable, non-negative R.V.s Y
= E[X
[(]
exist by the preceding argument. Then, Y m( is integrable, and by denition of
Y
we have that for any G (

E[Y I
G
] = E[Y
+
I
G
] E[Y
I
G
] = E[X
+
I
G
] E[X
I
G
] = E[XI
G
] .
This establishes (4.1.1) and completes the proof of the theorem.
Remark. Beware that for Y = E[X[(] often Y
,= E[X
[(] (for example, take

the trivial ( = , and P(X = 1) = P(X = 1) = 1/2 for which Y = 0 while
E[X
[(] = 1).
Exercise 4.1.7. Suppose either E(Y
k
)
+
is nite or E(Y
k
)
is nite for random

variables Y
k
, k = 1, 2 on (, T, P) such that E[Y
1
I
A
] E[Y
2
I
A
] for any A T.
Show that then P(Y
1
Y
2
) = 1.
In the next exercise you show that the Radon-Nikodym density preserves the
product structure.
Exercise 4.1.8. Suppose that
k

k
are pairs of -nite measures on (S
k
, T
k
)
for k = 1, . . . , n with the corresponding Radon-Nikodym derivatives f
k
= d
k
/d
k
.
(a) Show that the -nite product measure =
1

n
on the product
space (S, T) is absolutely continuous with respect to the -nite measure
=
1

n
on (S, T), with d/d =
n
k=1
f
k
.
(b) Suppose and are probability measures on S = (s
1
, . . . , s
n
) : s
k

S
k
, k = 1, . . . , n. Show that f
k
(s
k
), k = 1, . . . , n, are both mutually
-independent and mutually -independent.
4.1.2. Proof of the Radon-Nikodym theorem. This section is devoted
to proving the Radon-Nikodym theorem, which we have already used for estab-
lishing the existence of C.E. This is done by proving the more general Lebesgue
decomposition, based on the following denition.
Denition 4.1.9. Two measures
1
and
2
on the same measurable space (S, T)
are mutually singular if there is a set A T such that
1
(A) = 0 and
2
(A
c
) = 0.
This is denoted by
1
2
, and we sometimes state that
1
is singular with respect
to
2
, instead of
1
and
2
mutually singular.
Equipped with the concept of mutually singular measures, we next state the
Lebesgue decomposition and show that the Radon-Nikodym theorem is a direct
consequence of this decomposition.
Theorem 4.1.10 (Lebesgue decomposition). Suppose and are measures
on the same measurable space (S, T) such that (S) and (S) are nite. Then,
=
ac
+
s
where the measure
s
is singular with respect to and
ac
= f for
some f mT
+
. Further, such a decomposition of is unique (per given ).
Remark. To build your intuition, note that Lebesgue decomposition is quite ex-
plicit for -nite measures on a countable space S (with T = 2
S
). Indeed, then
ac
and
s
are the restrictions of to the support S
= s S : (s) > 0
of and its complement, respectively, with f(s) = (s)/(s) for s S
the
Radon-Nikodym derivative of
ac
with respect to (see Exercise 1.2.47 for more
on the support of a measure).
4.1. CONDITIONAL EXPECTATION: EXISTENCE AND UNIQUENESS 155
Proof of the Radon-Nikodym theorem. Assume rst that (S) and (S)
are nite. Let =
ac
+
s
be the unique Lebesgue decomposition induced by .
Then, by denition there exists a set A T such that
s
(A
c
) = (A) = 0. Further,
our assumption that implies that
s
(A) (A) = 0 as well, hence
s
(S) = 0,
i.e. =
ac
= f for some f mT
+
.
Next, in case and are -nite measures the sample space S is a countable union
of disjoint sets A
n
T such that both (A
n
) and (A
n
) are nite. Considering the
measures
n
= I
An
and
n
= I
An
such that
n
(S) = (A
n
) and
n
(S) = (A
n
)
are nite, our assumption that implies that
n

n
. Hence, by the
preceding argument for each n there exists f
n
mT
+
such that
n
= f
n
n
. With
=
n
and
n
= (f
n
I
An
) (by the composition relation of Proposition 1.3.56),
it follows that = f for f =
n
f
n
I
An
mT
+
nite valued.
As for the uniqueness of the Radon-Nikodym derivative f, suppose that f = g
for some g mT
+
and a -nite measure . Consider E
n
= D
n
s : g(s)f(s)
1/n, g(s) n and measurable D
n
S such that (D
n
) < . Then, necessarily
both (fI
En
) and (gI
En
) are nite with
n
1
(E
n
) ((g f)I
En
) = (g)(E
n
) (f)(E
n
) = 0 ,
implying that (E
n
) = 0. Considering the union over n = 1, 2, . . . we deduce
that (s : g(s) > f(s)) = 0, and upon reversing the roles of f and g, also
(s : g(s) < f(s)) = 0.
Remark. Following the same argument as in the preceding proof of the Radon-
Nikodym theorem, one easily concludes that Lebesgue decomposition applies also
for any two -nite measures and .
Our next lemma is the key to the proof of Lebesgue decomposition.
Lemma 4.1.11. If the nite measures and on (S, T) are not mutually singular,
then there exists B T and > 0 such that (B) > 0 and (A) (A) for all
A T
B
.
The proof of this lemma is based on the Hahn-Jordan decomposition of a nite
signed measure to its positive and negative parts (for a denition of a nite signed
measure see the remark after Denition 1.1.2).
Theorem 4.1.12 (Hahn decomposition). For any nite signed measure : T
R there exists D T such that
+
= I
D
and
= I
D
c are measures on (S, T).
See [Bil95, Theorem 32.1] for a proof of the Hahn decomposition as stated here,
or [Dud89, Theorem 5.6.1] for the same conclusion in case of a general, that is
[, ]-valued signed measure, where uniqueness of the Hahn-Jordan decomposi-
tion of a signed measure as the dierence between the mutually singular measures
is also shown (see also [Dur10, Theorems A.4.3 and A.4.4]).

Remark. If I
B
is a measure we call B T a positive set for the signed measure
and if I
B
is a measure we say that B T is a negative set for . So, the Hahn
decomposition provides a partition of S into a positive set (for ) and a negative
set (for ).
Proof of Lemma 4.1.11. Let A =
n
D
n
where D
n
, n = 1, 2 . . ., is a posi-
tive set for the Hahn decomposition of the nite signed measure n
1
. Since A
c
is contained in the negative set D
c
n
for n
1
, it follows that (A
c
) n
1
(A
c
).
Taking n we deduce that (A
c
) = 0. If (D
n
) = 0 for all n then (A) = 0
and necessarily is singular with respect to , contradicting the assumptions of
the lemma. Therefore, (D
n
) > 0 for some nite n. Taking = n
1
and B = D
n
results with the thesis of the lemma.
Proof of Lebesgue decomposition. Our goal is to construct f mT
+
such that the measure
s
= f is singular with respect to . Since necessarily
s
(A) 0 for any A T, such a function f must belong to
1 = h mT
+
: (A) (h)(A), for all A T.
Indeed, we take f to be an element of 1 for which (f)(S) is maximal. To show
that such f exists note rst that 1 is closed under non-decreasing passages to the
limit (by monotone convergence). Further, if h and

h are both in 1 then also
maxh,
h 1 since with = s : h(s) >
h(s) we have that for any A T,

(A) = (A ) +(A
c
) (hI
A
) +(
hI
A
c ) = (maxh,
hI
A
) .
That is, 1 is also closed under the formation of nite maxima and in particu-
lar, the function lim
n
max(h
1
, . . . , h
n
) is in 1 for any h
n
1. Now let =
sup(h)(S) : h 1 noting that (S) is nite. Choosing h
n
1 such
that (h
n
)(S) n
1
results with f = lim
n
max(h
1
, . . . , h
n
) in 1 such that
(f)(S) lim
n
(h
n
)(S) = . Since f is an element of 1 both
ac
= f and
s
= f are nite measures.
If
s
fails to be singular with respect to then by Lemma 4.1.11 there exists
B T and > 0 such that (B) > 0 and
s
(A) (I
B
)(A) for all A T. Since
=
s
+f, this implies that f+I
B
1. However, ((f+I
B
))(S) +(B) >
contradicting the fact that is the nite maximal value of (h)(S) over h 1.
Consequently, this construction of f has = f +
s
with a nite measure
s
that
is singular with respect to .
Finally, to prove the uniqueness of the Lebesgue decomposition, suppose there
exist f
1
, f
2
mT
+
, such that both f
1
and f
2
are singular with respect to
. That is, there exist A
1
, A
2
T such that (A
i
) = 0 and ( f
i
)(A
c
i
) = 0 for
i = 1, 2. Considering A = A
1
A
2
it follows that (A) = 0 and ( f
i
)(A
c
) = 0
for i = 1, 2. Consequently, for any E T we have that ( f
1
)(E) = (E A) =
( f
2
)(E), proving the uniqueness of
s
, and hence of the decomposition of as
ac
+
s
.
We conclude with a simple application of Radon-Nikodym theorem in conjunction
with Lemma 1.3.8.
Exercise 4.1.13. Suppose and are two -nite measures on the same measur-
able space (S, T) such that (A) (A) for all A T. Show that if (g) = (g) is
nite for some g mT such that (s : g(s) 0) = 0 then () = ().
4.2. Properties of the conditional expectation
In some generic settings the C.E. is rather explicit. One such example is when X
is measurable on the conditioning -algebra (.
Example 4.2.1. If X L
1
(, (, P) then Y = X m( satises (4.1.1) so
E[X[(] = X. In particular, if X = c is a constant R.V. then E[X[(] = c for
any -algebra (.
4.2. PROPERTIES OF THE CONDITIONAL EXPECTATION 157
Here is an extension of this example.
Exercise 4.2.2. Suppose that (Y, })-valued random variable Y is measurable on
( and (X, X)-valued random variable Z is P-independent of (. Show that if is
measurable on the product space (X Y, X }) and (Z, Y ) is integrable, then
E[(Z, Y )[(] = g(Y ) where g(y) = E[(Z, y)].
Since only constant random variables are measurable on T
0
= , , by denition
of the C.E. clearly E[X[T
0
] = EX. We show next that E[X[1] = EX also whenever
the conditioning -algebra 1 is independent of (X) (and in particular, when 1 is
P-trivial).
Proposition 4.2.3. If X L
1
(, T, P) and the -algebra 1 is independent of
((X), (), then
E[X[(1, ()] = E[X[ (] .
For ( = , this implies that
1 independent of (X) = E[X[1] = EX .
Remark. Recall that a P-trivial -algebra 1 T is independent of (X) for
any X mT. Hence, by Proposition 4.2.3 in this case E[X[1] = EX for all
X L
1
(, T, P).
Proof. Let Y = E[X[(] m(. Because 1 is independent of ((, (X)), it
follows that for any G ( and H 1 the random variable I
H
is independent of
both XI
G
and Y I
G
. Consequently,
E[XI
GH
] = E[XI
G
I
H
] = E[XI
G
]EI
H
E[Y I
GH
] = E[Y I
G
I
H
] = E[Y I
G
]EI
H
Further, E[XI
G
] = E[Y I
G
] by the denition of Y , hence E[XI
A
] = E[Y I
A
] for
any A / = G H : G (, H 1. Applying Exercise 4.1.3 with Y
L
1
(, (, P) L
1
(, (1, (), P) and / a -system of subsets containing and
generating ((, 1), we thus conclude that
E[X[((, 1)] = Y = E[X[(]
as claimed.
We turn to derive various properties of the C.E. operation, starting with its posi-
tivity and linearity (per xed conditioning -algebra).
Proposition 4.2.4. Let X L
1
(, T,P) and set Y = E[X[(] for some -algebra
( T. Then,
(a) EX = EY
(b) ( Positivity) X 0 = Y 0 a.s. and X > 0 = Y > 0
a.s.
Proof. Considering G = ( in the denition of the C.E. we nd that
EX = E[XI
G
] = E[Y I
G
] = EY .
Turning to the positivity of the C.E. note that if X 0 a.s. then 0 E[XI
G
] =
E[Y I
G
] 0 for G = : Y () 0 (. Hence, in this case E[Y I
Y 0
] = 0.
That is, almost surely Y 0. Further, P(X > , Y 0) E[XI
X>
I
Y 0
]
E[XI
Y 0
] = 0 for any > 0, so P(X > 0, Y = 0) = 0 as well.
We next show that the C.E. operator is linear.
Proposition 4.2.5. ( Linearity) Let X, Y L
1
(, T,P) and ( T a -algebra.
Then, for any , R,
E[X +Y [(] = E[X[(] +E[Y [(] .
Proof. Let Z = E[X[(] and V = E[Y [(]. Since Z, V L
1
(, T, P) the
same applies for Z +V . Further, for any G (, by linearity of the expectation
operator and the denition of the C.E.
E[(Z+V )I
G
] = E[ZI
G
] +E[V I
G
] = E[XI
G
] +E[Y I
G
] = E[(X+Y )I
G
],
as claimed.
From its positivity and linearity we immediately get the monotonicity of the C.E.
Corollary 4.2.6 (Monotonicity). If X, Y L
1
(, T, P) are such that X Y ,
then E[X[(] E[Y [(] for any -algebra ( T.
In the following exercise you are to combine the linearity and positivity of the
C.E. with Fubinis theorem.
Exercise 4.2.7. Show that if X, Y L
1
(, T, P) are such that E[X[Y ] = Y and
E[Y [X] = X then almost surely X = Y .
Hint: First show that E[(X Y )I
{X>cY }
] = 0 for any non-random c.
We next deal with the relationship between the C.E.s of the same R.V. for nested
conditioning -algebras.
Proposition 4.2.8 (Tower property). Suppose X L
1
(, T,P) and the -
algebras 1 and ( are such that 1 ( T. Then, E[X[1] = E[E(X[()[1].
Proof. Recall that Y = E[X[(] is integrable, hence Z = E[Y [1] is integrable.
Fixing A 1 we have that E[Y I
A
] = E[ZI
A
] by the denition of the C.E. Z. Since
1 (, also A ( hence E[XI
A
] = E[Y I
A
] by the denition of the C.E. Y . We
deduce that E[XI
A
] = E[ZI
A
] for all A 1. It then follows from the denition of
the C.E. that Z is a version of E[X[1].
Remark. The tower property is also called the law of iterated expectations.
Any -algebra ( contains the trivial -algebra T
0
= , . Applying the tower
property with 1 = T
0
and using the fact that E[Y [T
0
] = EY for any integrable
random variable Y , it follows that for any -algebra (
(4.2.1) EX = E[X[T
0
] = E[E(X[()[T
0
] = E[E(X[()] .
Here is an application of the tower property, leading to stronger conclusion than
what one has from Proposition 4.2.3.
Lemma 4.2.9. If integrable R.V. X and -algebra ( are such that E[X[(] is
independent of X, then E[X[(] = E[X].
Proof. Let Z = E[X[(]. Applying the tower property for 1 = (Z) ( we
have that E[X[1] = E[Z[1]. Clearly, E[Z[1] = Z (see Example 4.2.1), whereas
our assumption that X is independent of Z implies that E[X[1] = E[X] (see
Proposition 4.2.3). Consequently, Z = E[X], as claimed.
As shown next, we can take out what is known when computing the C.E.
Proposition 4.2.10. Suppose Y m( and X L
1
(, T,P) are such that XY
L
1
(, T,P). Then, E[XY [(] = Y E[X[(].
Proof. Let Z = E[X[(] which is well dened due to our assumption that
E[X[ < . With Y Z m( and E[XY [ < , it suces to check that
(4.2.2) E[XY I
A
] = E[ZY I
A
]
for all A (. Indeed, if Y = I
B
for B ( then Y I
A
= I
G
for G = B A ( so
(4.2.2) follows from the denition of the C.E. Z. By linearity of the expectation,
this extends to Y which is a simple function on (, (). Recall that for X 0 by
positivity of the C.E. also Z 0, so by monotone convergence (4.2.2) then applies
for all Y m(
+
. In general, let X = X
+
X
and Y = Y
+
Y
for Y
m(
+
and the integrable X
0. Since [XY [ = (X
+
+ X
)(Y
+
+ Y
) is integrable, so
are the products X
and (4.2.2) holds for each of the four possible choices of the
pair (X
, Y
), with Z
= E[X
[(] instead of Z. Upon noting that Z = Z

+
Z
(by linearity of the C.E.), and XY = X

+
Y
+
X
+
Y
Y
+
+ X
, it readily
follows that (4.2.2) applies also for X and Y .
Adopting hereafter the notation P(A[() for E[I
A
[(], the following exercises illus-
trate some of the many applications of Propositions 4.2.8 and 4.2.10.
Exercise 4.2.11. For any -algebras (
i
T, i = 1, 2, 3, let (
ij
= ((
i
, (
j
) and
prove that the following conditions are equivalent:
(a) P[A
3
[(
12
] = P[A
3
[(
2
] for all A
3
(
3
.
(b) P[A
1
A
3
[(
2
] = P[A
1
[(
2
]P[A
3
[(
2
] for all A
1
(
1
and A
3
(
3
.
(c) P[A
1
[(
23
] = P[A
1
[(
2
] for all A
1
(
1
.
Remark. Taking (
1
= (X
k
, k < n), (
2
= (X
n
) and (
3
= (X
k
, k > n),
condition (a) of the preceding exercise states that the sequence of random variables
X
k
has the Markov property. That is, the conditional probability of a future
event A
3
given the past and present information (
12
is the same as its conditional
probability given the present (
2
alone. Condition (c) makes the same statement,
but with time reversed, while condition (b) says that past and future events A
1
and
A
3
are conditionally independent given the present information, that is, (
2
.
Exercise 4.2.12. Let Z = (X, Y ) be a uniformly chosen point in (0, 1)
2
. That
is, X and Y are independent random variables, each having the U(0, 1) measure
of Example 1.1.26. Set T = 2I
A
(Z) + 10I
B
(Z) + 4I
C
(Z) where A = (x, y) :
0 < x < 1/4, 3/4 < y < 1, B = (x, y) : 1/4 < x < 3/4, 0 < y < 1/2 and
C = (x, y) : 3/4 < x < 1, 1/4 < y < 1.
(a) Find an explicit formula for the conditional expectation W = E(T[X)
and use it to determine the conditional expectation U = E(TX[X).
(b) Find the value of E[(T W) sin(e
X
)].
Exercise 4.2.13. Fixing a positive integer k, compute E(X[Y ) in case Y = kX
[kX] for X having the U(0, 1) measure of Example 1.1.26 (and where [x] denotes
the integer part of x).
Exercise 4.2.14. Fixing t R and X integrable random variable, let Y = max(X, t)
and Z = min(X, t). Setting a
t
= E[X[X t] and b
t
= E[X[X t], show that
E[X[Y ] = Y I
Y >t
+a
t
I
Y =t
and E[X[Z] = ZI
Z<t
+b
t
I
Z=t
.
Exercise 4.2.15. Let X, Y be i.i.d. random variables. Suppose is independent
of (X, Y ), with P( = 1) = p, P( = 0) = 1 p. Let Z = (Z
1
, Z
2
) where
Z
1
= X + (1 )Y and Z
2
= Y + (1 )X.
(a) Prove that Z and are independent.
(b) Obtain an explicit expression for E[g(X, Y )[Z], in terms of Z
1
and Z
2
,
where g : R
2
R is a bounded Borel function.
Exercise 4.2.16. Suppose EX
2
< and dene Var(X[() = E[(XE(X[())
2
[(].
(a) Show that, E[Var(X[(
2
)] E[Var(X[(
1
)] for any two -algebras (
1
(
2
(that is, the dispersion of X about its conditional mean decreases as the
algebra grows).
(b) Show that for any -algebra (,
Var[X] = E[Var(X[()] + Var[E(X[()] .
Exercise 4.2.17. Suppose N is a non-negative, integer valued R.V. which is inde-
pendent of the independent, integrable R.V.-s
i
on the same probability space, and
that
i
P(N i)E[
i
[ is nite.
(a) Check that
X() =
N()
i=1
i
() ,
is integrable and deduce that EX =
i
P(N i)E
i
.
(b) Suppose in addition that
i
are identically distributed, in which case this
is merely Walds identity EX = ENE
1
. Show that if both
1
and N are
square-integrable, then so is X and
Var(X) = Var(
1
)EN + Var(N)(E
1
)
2
.
Suppose XY , X and Y are integrable. Combining Proposition 4.2.10 and (4.2.1)
convince yourself that if E[X[Y ] = EX then E[XY ] = EXEY . Recall that if X and
Y are independent and integrable then E[X[Y ] = EX (c.f. Proposition 4.2.3). As
you show next, the converse implications are false and further, one cannot dispense
of the nesting relationship between the two -algebras in the tower property.
Exercise 4.2.18. Provide examples of X, Y 1, 0, 1 such that
(a) E[XY ] = EXEY but E[X[Y ] ,= EX.
(b) E[X[Y ] = EX but X is not independent of Y .
(c) For = 1, 2, 3 and T
i
= (i), i = 1, 2, 3,
E[E(X[T
1
)[T
2
] ,= E[E(X[T
2
)[T
1
].
As shown in the sequel, per xed conditioning -algebra we can interpret the
C.E. as an expectation in a dierent (conditional) probability space. Indeed, every
property of the expectation has a corresponding extension to the C.E. For example,
the extension of Jensens inequality is
Proposition 4.2.19 (Jensens inequality). Suppose g() is a convex function
on an open interval G of R, that is,
g(x) + (1 )g(y) g(x + (1 )y) x, y G, 0 1.
If X is an integrable R.V. with P(X G) = 1 and g(X) is also integrable, then
almost surely E[g(X)[1] g(E[X[1]) for any -algebra 1.
Proof. Recall our derivation of (1.3.3) showing that
g(x) g(c) + (D
g)(c)(x c) c, x G
Further, with (D
g)() a nite, non-decreasing function on G where g() is contin-

uous, it follows that
g(x) = sup
cGQ
g(c) + (D
g)(c)(x c) = sup
n
a
n
x +b
n
for some sequences a

n
and b
n
in R and all x G.
Since P(X G) = 1, almost surely g(X) a
n
X + b
n
and by monotonicity of
the C.E. also E[g(X)[1] a
n
Y + b
n
for Y = E[X[1]. Further, P(Y G) = 1
due to the linearity and positivity of the C.E., so almost surely E[g(X)[1]
sup
n
a
n
Y +b
n
= g(Y ), as claimed.
Example 4.2.20. Fixing q 1 and applying (the conditional) Jensens inequality
for the convex function g(x) = [x[
q
, we have that E[[X[
q
[1] [E[X[1][
q
for any
X L
q
(, T, P). So, by the tower property and the monotonicity of the expectation,
|X|
q
q
= E[X[
q
= E[E([X[
q
[1)]
E[[E(X[1)[
q
] = |E(X[1)|
q
q
.
In conclusion, |X|
q
|E(X[1)|
q
for all q 1.
Exercise 4.2.21. Let Z = E[X[(] for an integrable random variable X and a
-algebra (.
(a) Show that if EZ
2
= EX
2
< then Z = X a.s.
(b) Suppose that Z = E[X[(] has the same law as X. Show that then Z = X
a.s. even if EX
2
= .
Hint: Show that E[([X[ X)I
A
] = 0 for A = Z 0 (, so X 0 for
almost every A. Applying this for X c with c non-random deduce that
P(X < c Z) = 0 and conclude that X Z a.s.
In the following exercises you are to derive the conditional versions of Markovs
and H olders inequalities.
Exercise 4.2.22. Suppose p > 0 is non-random and X is a random variable in
(, T, P) with E[X[
p
nite.
(a) Prove that for every -algebra ( T, with probability one
E[[X[
p
[(] =
_

0
px
p1
P([X[ > x[()dx.
(b) Deduce the conditional version of Markovs inequality, that for any a > 0
P([X[ a [() a
p
E[[X[
p
[(]
(compare with Lemma 1.4.31 and Example 1.3.14).
Exercise 4.2.23. Suppose E[X[
p
< and E[Y [
q
< for some p, q > 1 such that
1
p
+
1
q
= 1. Prove the conditional H olders inequality
E[[XY [ [(] (E[[X[
p
[(])
1/p
(E[[Y [
q
[(])
1/q
(compare with Proposition 1.3.17).
Here are the corresponding extensions of some of the convergence theorems of
Section 1.3.3.
Theorem 4.2.24 (Monotone convergence for C.E.). If 0 X
m
X
a.s.
and EX
< , then E[X

m
[(] E[X
[(].
Proof. Let Y
m
= E[X
m
[(] m(
+
. By monotonicity of the C.E. we have that
the sequence Y
m
is a.s. non-decreasing, hence it has a limit Y
m(
+
(possibly
innite). We complete the proof by showing that Y
= E[X
[(]. Indeed, for any

G (,
E[Y
I
G
] = lim
m
E[Y
m
I
G
] = lim
m
E[X
m
I
G
] = E[X
I
G
],
where since Y
m
Y
and X
m
X
the rst and third equalities follow by the

monotone convergence theorem (the unconditional version), and the second equality
from the denition of the C.E. Y
m
. Considering G = we see that Y
is integrable.
In conclusion, E[X
m
[(] = Y
m
Y
= E[X
[(], as claimed.
Lemma 4.2.25 (Fatous lemma for C.E.). If the non-negative, integrable X
n
on same measurable space (, T) are such that liminf
n
X
n
is integrable, then
for any -algebra ( T,
E
_
liminf
n
X
n
(
_
liminf
n
E[X
n
[(] a.s.
Proof. Applying the monotone convergence theorem for the C.E. of the non-
decreasing sequence of non-negative R.V.s Z
n
= infX
k
: k n (whose limit is
the integrable liminf
n
X
n
), results with
(4.2.3) E
_
liminf
n
X
n
[(
_
= E( lim
n
Z
n
[() = lim
n
E[Z
n
[(] a.s.
Since Z
n
X
n
it follows that E[Z
n
[(] E[X
n
[(] for all n and
(4.2.4) lim
n
E[Z
n
[(] = liminf
n
E[Z
n
[(] liminf
n
E[X
n
[(] a.s.
Upon combining (4.2.3) and (4.2.4) we obtain the thesis of the lemma.
Fatous lemma leads to the C.E. version of the dominated convergence theorem.
Theorem 4.2.26 (Dominated convergence for C.E.). If sup
m
[X
m
[ is inte-
grable and X
m
a.s
X
, then E[X
m
[(]
a.s.
E[X
[(].
Proof. Let Y = sup
m
[X
m
[ and Z
m
= Y X
m
0. Applying Fatous lemma
for the C.E. of the non-negative, integrable R.V.s Z
m
2Y , we see that
E
_
liminf
m
Z
m
(
_
liminf
m
E[Z
m
[(] a.s.
Since X
m
converges, by the linearity of the C.E. and integrability of Y this is
equivalent to
E
_
lim
m
X
m
(
_
limsup
m
E[X
m
[(] a.s.
Applying the same argument for the non-negative, integrable R.V.s W
m
= Y +X
m
results with
E
_
lim
m
X
m
(
_
liminf
m
E[X
m
[(] a.s. .
We thus conclude that a.s. the liminf and limsup of the sequence E[X
m
[(] coincide
and are equal to E[X
[(], as stated.
1
, X
2
be random variables dened on same probability space
(, T, P) and ( T a -algebra. Prove that (a), (b) and (c) below are equivalent.
(a) For any Borel sets B
1
and B
2
,
P(X
1
B
1
, X
2
B
2
[() = P(X
1
B
1
[()P(X
2
B
2
[() .
(b) For any bounded Borel functions h
1
and h
2
,
E[h
1
(X
1
)h
2
(X
2
)[(] = E[h
1
(X
1
)[(]E[h
2
(X
2
)[(] .
(c) For any bounded Borel function h,
E[h(X
1
)[((, (X
2
))] = E[h(X
1
)[(] .
Denition 4.2.28. If one of the equivalent conditions of Exercise 4.2.27 holds we
say that X
1
and X
2
are conditionally independent given (.
Exercise 4.2.29. Suppose that X and Y are conditionally independent given (Z)
and that X and Z are conditionally independent given T, where T (Z). Prove
that then X and Y are conditionally independent given T.
Our next result shows that the C.E. operation is continuous with respect to L
q
convergence.
n
L
q
X
. That is, X
n
, X
L
q
(, T, P) are such
that E([X
n
X
[
q
) 0. Then, E[X
n
[(]
L
q
E[X
[(] for any -algebra ( T.

Proof. We saw already in Example 4.2.20 that E[X
n
[(] are in L
q
(, (, P)
for n . Further, by the linearity of C.E., Jensens Inequality for the convex
function [x[
q
as in this example, and the tower property of (4.2.1),
E[[E(X
n
[() E(X
[()[
q
] = E[[E(X
n
X
[()[
q
]
E[E([X
n
X
[
q
[()] = E[[X
n
X
[
q
] 0,
by our hypothesis, yielding the thesis of the theorem.
As you will show, the C.E. operation is also continuous with respect to the follow-
ing topology of weak L
q
convergence.
Denition 4.2.31. Let L
(, T, P) denote the collection of all random variables

on (, T) which are P-a.s. bounded, with |Y |
denoting the smallest non-random

K such that P([Y [ K) = 1. Setting p(q) : [1, ] [1, ] via p(q) = q/(q1), we
say that X
n
converges weakly in L
q
to X
, denoted X
n
wL
q
X
, if X
n
, X
L
q
and E[(X
n
X
)Y ] 0 for each xed Y such that |Y |

p(q)
is nite (compare with
Denition 1.3.26).
Exercise 4.2.32. Show that E[Y E(X[()] = E[XE(Y [()] for any -algebra ( T,
provided that for some q 1 and p = q/(q 1) both |X|
q
and |Y |
p
are nite.
Deduce that if X
n
wL
q
X
then E[X
n
[(]
wL
q
E[X

In view of Example 4.2.20 we already know that for each integrable random vari-
able X the collection E[X[(] : ( T is a -algebra is a bounded in L
1
(, T, P).
As we show next, this collection is even uniformly integrable (U.I.), a key fact in
our study of uniformly integrable martingales (see Subsection 5.3.1).
Proposition 4.2.33. For any X L
1
(, T, P), the collection E[X[1] : 1 T
is a -algebra is U.I.
Proof. Fixing > 0, let = (X, ) > 0 be as in part (b) of Exercise 1.3.43
and set the nite constant M =
1
E[X[. By Markovs inequality and Example
4.2.20 we get that MP(A) E[Y [ E[X[ for A = [Y [ M 1 and Y =
E[X[1]. Hence, P(A) by our choice of M, whereby our choice of results
with E[[X[I
A
] (c.f. part (b) of Exercise 1.3.43). Further, by (the conditional)
Jensens inequality [Y [ E[[X[ [1] (see Example 4.2.20). Therefore, by denition
of the C.E. E[[X[ [1],
E[[Y [I
|Y |>M
] E[[Y [I
A
] E[E[[X[[1]I
A
] = E[[X[I
A
] .
Since this applies for any -algebra 1 T and the value of M = M(X, ) does not
depend on Y , we conclude that the collection of such Y = E[X[1] is U.I.
To check your understanding of the preceding derivation, prove the following nat-
ural extension of Proposition 4.2.33.
Exercise 4.2.34. Let ( be a uniformly integrable collection of random variables
on (, T, P). Show that the collection T of all R.V. Y such that Y
a.s.
= E[X[1] for
some X ( and -algebra 1 T, is U.I.
Here is a somewhat counter intuitive fact about the conditional expectation.
n
a.s
Y
in (, T, P) when n and Y
n
are
uniformly integrable.
(a) Show that E[Y
n
[(]
L
1
E[Y

(b) Provide an example of such sequence Y
n
and a -algebra ( T such
that E[Y
n
[(] does not converge almost surely to E[Y
[(].
4.3. The conditional expectation as an orthogonal projection
It readily follows from our next proposition that for X L
2
(, T, P) and -
algebras ( T the C.E. Y = E[X[(] is the unique Y L
2
(, (, P) such that
(4.3.1) |X Y |
2
= inf|X W|
2
: W L
2
(, (, P).
Proposition 4.3.1. For any X L
2
(, T, P) and -algebras ( T, a R.V.
Y L
2
(, (, P) is optimal in the sense of (4.3.1) if and only if it satises the
orthogonality relations
(4.3.2) E[(X Y )Z] = 0 for all Z L
2
(, (, P) .
Further, any such R.V. Y is a version of E[X[(].
Proof. If Y L
2
(, (, P) satises (4.3.1) then considering W = Y + Z it
follows that for any Z L
2
(, (, P) and R,
0 |X Y Z|
2
2
|X Y |
2
2
=
2
EZ
2
2E[(X Y )Z] .
By elementary calculus, this inequality holds for all R if and only if E[(X
Y )Z] = 0. Conversely, suppose Y L
2
(, (, P) satises (4.3.2) and x W
L
2
(, (, P). Then, considering (4.3.2) for Z = W Y we see that
|X W|
2
2
= |X Y |
2
2
2E[(X Y )(W Y )] +|W Y |
2
2
|X Y |
2
2
,
so necessarily Y satises (4.3.1). Finally, since I
G
L
2
(, (, P) for any G (, if
Y satises (4.3.2) then it also satises the identity (4.1.1) which characterizes the
C.E. E[X[(].
4.3. THE CONDITIONAL EXPECTATION AS AN ORTHOGONAL PROJECTION 165
Example 4.3.2. If ( = (A
1
, . . . , A
n
) for nite n and disjoint sets A
i
such that
P(A
i
) > 0 for i = 1, . . . , n, then L
2
(, (, P) consists of all variables of the form
W =
n
i=1
v
i
I
Ai
, v
i
R. A R.V. Y of this form satises (4.3.1) if and only if the
corresponding v
i
minimizes
E[(X
n
i=1
v
i
I
Ai
)
2
] EX
2
=
_
n
i=1
P(A
i
)v
2
i
2
n
i=1
v
i
E[XI
Ai
]
_
,
which amounts to v
i
= E[XI
Ai
]/P(A
i
). In particular, if Z =
n
i=1
z
i
I
Ai
for
distinct z
i
-s, then (Z) = ( and we thus recover our rst denition of the C.E.
E[X[Z] =
n
i=1
E[XI
Z=zi
]
P(Z = z
i
)
I
Z=zi
.
As shown in the sequel, using (4.3.1) as an alternative characterization of the
C.E. of X L
2
(, T, P) we can prove the existence of the C.E. without invoking
the Radon-Nikodym theorem. We start by dening the relevant concepts from the
theory of Hilbert spaces on which this approach is based.
Denition 4.3.3. A linear vector space is a set H that is closed under operations
of addition and multiplication by (real-valued) scalars. That is, if h
1
, h
2
H then
h
1
+h
2
H and h H for all R, where (h
1
+h
2
) = h
1
+h
2
, ( +)h =
h + h, (h) = ()h and 1h = h. A normed vector space is a linear vector
space H equipped with a norm | |. That is, a non-negative function on H such
that |h| = [[|h| for all R and d(h
1
, h
2
) = |h
1
h
2
| is a metric on H.
Denition 4.3.4. A sequence h
n
in a normed vector space is called a Cauchy
sequence if sup
k,mn
|h
k
h
m
| 0 as n and we say that h
n
converges to
h H if |h
n
h| 0 as n . A Banach space is a normed vector space in
which every Cauchy sequence converges.
Building on the preceding, we dene the concept of inner product and the corre-
sponding Hilbert spaces and sub-spaces.
Denition 4.3.5. A Hilbert space is a Banach space H whose norm is of the
form (h, h)
1/2
for a bi-linear, symmetric function (h
1
, h
2
) : H H R such that
(h, h) 0 and we call such (h
1
, h
2
) an inner product for H. A subset K of a Hilbert
space which is closed under addition and under multiplication by a scalar is called
a Hilbert sub-space if every Cauchy sequence h
n
K has a limit in K.
Here are two elementary properties of inner products we use in the sequel.
Exercise 4.3.6. Let |h| = (h, h)
1
2
with (h
1
, h
2
) an inner product for a linear
vector space H. Show that Schwarz inequality
(u, v)
2
|u|
2
|v|
2
,
and the parallelogram law |u+v|
2
+|uv|
2
= 2|u|
2
+2|v|
2
hold for any u, v H.
Our next proposition shows that for each nite q 1 the space L
q
(, T, P) is a
Banach space for the norm | |
q
, the usual addition of R.V.s and the multiplication
of a R.V. X() by a non-random (scalar) constant. Further, L
2
(, (, P) is a Hilbert
sub-space of L
2
(, T, P) for any -algebras ( T.
Proposition 4.3.7. Upon identifying R-valued R.V. which are equal with proba-
bility one as being in the same equivalence class, for each q 1 and a -algebra T,
the space L
q
(, T, P) is a Banach space for the norm | |
q
. Further, L
2
(, (, P) is
then a Hilbert sub-space of L
2
(, T, P) for the inner product (X, Y ) = EXY and
any -algebras ( T.
Proof. Fixing q 1, we identify X and Y such that P(X ,= Y ) = 0 as
being the same element of L
q
(, T, P). The resulting set of equivalence classes is a
normed vector space. Indeed, both ||
q
, the addition of R.V. and the multiplication
by a non-random scalar are compatible with this equivalence relation. Further, if
X, Y L
q
(, T, P) then |X|
q
= [[|X|
q
< for all R and by Minkowskis
inequality |X + Y |
q
|X|
q
+ |Y |
q
< . Consequently, L
q
(, T, P) is closed
under the operations of addition and multiplication by a non-random scalar, with
| |
q
a norm on this collection of equivalence classes.
Suppose next that X
n
L
q
is a Cauchy sequence for | |
q
. Then, by denition,
there exist k
n
such that |X
r
X
s
|
q
q
< 2
n(q+1)
for all r, s k
n
. Observe that
by Markovs inequality
P([X
kn+1
X
kn
[ 2
n
) 2
nq
|X
kn+1
X
kn
|
q
q
< 2
n
,
and consequently the sequence P([X
kn+1
X
kn
[ 2
n
) is summable. By Borel-
Cantelli I it follows that
n
[X
kn+1
() X
kn
()[ is nite with probability one, in
which case clearly
X
kn
= X
k1
+
n1
i=1
(X
ki+1
X
ki
)
converges to a nite limit X(). Next let, X = limsup
n
X
kn
(which per Theo-
rem 1.2.22 is an R-valued R.V.). Then, xing n and r k
n
, for any t n,
E
_
[X
r
X
kt
[
q
_
= |X
r
X
kt
|
q
q
2
nq
,
so that by the a.s. convergence of X
kt
to X and Fatous lemma
E[X
r
X[
q
= E
_
lim
t
[X
r
X
kt
[
q
_
liminf
t
E[X
r
X
kt
[
q
2
nq
.
This inequality implies that X
r
X L
q
and hence also X L
q
. As r so
does n and we can further deduce from the preceding inequality that X
r
L
q
X.
Recall that [EXY [
EX
2
EY
2
by the Cauchy-Schwarz inequality. Thus, the bi-
linear, symmetric function (X, Y ) = EXY on L
2
L
2
is real-valued and compatible
with our equivalence relation. As |X|
2
2
= (X, X), the Banach space L
2
(, T, P) is
a Hilbert space with respect to this inner product.
Finally, observe that for any -algebra ( T the subset L
2
(, (, P) of the Hilbert
space L
2
(, T, P) is closed under addition of R.V.s and multiplication by a non-
random constant. Further, as shown before, the L
2
limit of a Cauchy sequence
X
n
L
2
(, (, P) is limsup
n
X
kn
which also belongs to L
2
(, (, P). Hence, the
latter is a Hilbert subspace of L
2
(, T, P).
Remark. With minor notational modications, this proof shows that for any mea-
sure on (S, T) and q 1 nite, the set L
q
(S, T, ) of -a.e. equivalence classes of
R-valued, measurable functions f such that ([f[
q
) < , is a Banach space. This
is merely a special case of a general extension of this property, corresponding to
Y = R in your next exercise.
4.3. THE CONDITIONAL EXPECTATION AS AN ORTHOGONAL PROJECTION 167
Exercise 4.3.8. For q 1 nite and a given Banach space (Y, | |), consider
the space L
q
(S, T, ; Y) of all -a.e. equivalence classes of functions f : S Y,
measurable with respect to the Borel -algebra induced on Y by | | and such that
(|f()|
q
) < .
(a) Show that |f|
q
= (|f()|
q
)
1/q
makes L
q
(S, T, ; Y) into a Banach
space.
(b) For future applications of the preceding, verify that the space Y = C
b
(T)
of bounded, continuous real-valued functions on a topological space T is
a Banach space for the supremum norm |f| = sup[f(t)[ : t T.
Your next exercise extends Proposition 4.3.7 to the collection L
(, T, P) of all
R-valued R.V. which are in equivalence classes of bounded random variables.
Exercise 4.3.9. Fixing a probability space (, T, P) prove the following facts:
(a) L
(, T, P) is a Banach space for |X|
= infM : P([X[ M) = 1.
(b) |X|
q
|X|
as q , for any X L
(, T, P).
(c) If E[[X[
q
] < for some q > 0 then E[X[
q
P([X[ > 0) as q 0.
(d) The collection SF of simple functions is dense in L
q
(, T, P) for any
1 q .
(e) The collection C
b
(R) of bounded, continuous real-valued functions, is
dense in L
q
(R, B, ) for any q 1 nite.
Hint: The (bounded) monotone class theorem might be handy.
In view of Proposition 4.3.7, the existence of the C.E. of X L
2
which satises
(4.3.1), or the equivalent condition (4.3.2), is a special instance of the following
fundamental geometric property of Hilbert spaces.
Theorem 4.3.10 (Orthogonal projection). Given h H and a Hilbert sub-
space G of H, let d = inf|h g| : g G. Then, there exists a unique

h G,
called the orthogonal projection of h on G, such that d = |h
h|. This is also the

unique

h G such that (h
h, f) = 0 for all f G.
Proof. We start with the existence of

h G such that d = |h
h|. To this
end, let g
n
G be such that |h g
n
| d. Applying the parallelogram law for
u = h
1
2
(g
m
+g
k
) and v =
1
2
(g
m
g
k
) we nd that
|hg
k
|
2
+|hg
m
|
2
= 2|h
1
2
(g
m
+g
k
)|
2
+2|
1
2
(g
m
g
k
)|
2
2d
2
+
1
2
|g
m
g
k
|
2
since
1
2
(g
m
+g
k
) G. Taking k, m , both |h g
k
|
2
and |h g
m
|
2
approach
d
2
and hence by the preceding inequality |g
m
g
k
| 0. In conclusion, g
n
is a
Cauchy sequence in the Hilbert sub-space G, which thus converges to some

h G.
Recall that |h
h| d by the denition of d. Since for n both |hg

n
| d
and |g
n
h| 0, the converse inequality is a consequence of the triangle inequality

|h
h| |h g
n
| +|g
n
h|.
Next, suppose there exist g
1
, g
2
G such that (h g
i
, f) = 0 for i = 1, 2 and
all f G. Then, by linearity of the inner product (g
1
g
2
, f) = 0 for all f G.
Considering f = g
1
g
2
G we see that (g
1
g
2
, g
1
g
2
) = |g
1
g
2
|
2
= 0 so
necessarily g
1
= g
2
.
We complete the proof by showing that

h G is such that |h
h|
2
|h g|
2
for all g G if and only if (h
h, f) = 0 for all f G. This is done exactly as in

the proof of Proposition 4.3.1. That is, by symmetry and bi-linearity of the inner
product, for all f G and R,
|h
h f|
2
|h
h|
2
=
2
|f|
2
2(h
h, f)
We arrive at the stated conclusion upon noting that xing f, this function is non-
negative for all if and only if (h
h, f) = 0.
Applying Theorem 4.3.10 for the Hilbert subspace G = L
2
(, (, P) of L
2
(, T, P)
(see Proposition 4.3.7), you have the existence of a unique Y G satisfying (4.3.2)
for each non-negative X L
2
.
Exercise 4.3.11. Show that for any non-negative integrable X, not necessarily in
L
2
, the sequence Y
n
G corresponding to X
n
= min(X, n) is non-decreasing and
that its limit Y satises (4.1.1). Verify that this allows you to prove Theorem 4.1.2
without ever invoking the Radon-Nikodym theorem.
Exercise 4.3.12. Suppose ( T is a -algebra.
(a) Show that for any X L
1
(, T, P) there exists some G ( such that
E[XI
G
] = sup
AG
E[XI
A
] .
Any G with this property is called (-optimal for X.
(b) Show that Y = E[X[(] almost surely, if and only if for any r R, the
event : Y () > r is (-optimal for the random variable (X r).
Here is an alternative proof of the existence of E[X[(] for non-negative X L
2
which avoids the orthogonal projection, as well as the Radon-Nikodym theorem
(the general case then follows as in Exercise 4.3.11).
Exercise 4.3.13. Suppose X L
2
(, T, P) is non-negative. Assume rst that
the -algebra ( T is countably generated. That is ( = (B
1
, B
2
, . . .) for some
B
k
T.
(a) Let Y
n
= E[X[(
n
] for the nitely generated (
n
= (B
k
, k n) (for its
existence, see Example 4.3.2). Show that Y
n
is a Cauchy sequence in
L
2
(, (, P), hence it has a limit Y L
2
(, (, P).
(b) Show that Y = E[X[(].
Hint: Theorem 2.2.10 might be of some help.
Assume now that ( is not countably generated.
(c) Let 1
1
1
2
be nite -algebras. Show that
E[E(X[1
1
)
2
] E[E(X[1
2
)
2
] .
(d) Let = sup E[E(X[1)
2
], where the supremum is over all nite -algebras
1 (. Show that is nite, and that there exists an increasing sequence
of nite -algebras 1
n
such that E[E(X[1
n
)
2
] as n .
(e) Let 1
= (
n
1
n
) and Y
n
= E[X[1
n
] for the 1
n
in part (d). Explain
why your proof of part (b) implies the L
2
convergence of Y
n
to a R.V. Y
such that E[Y I
A
] = E[XI
A
] for any A 1
.
( f ) Fixing A ( such that A / 1
, let 1
n,A
= (A, 1
n
) and Z
n
=
E[X[1
n,A
]. Explain why some sub-sequence of Z
n
has an a.s. and
L
2
limit, denoted Z. Show that EZ
2
= EY
2
= and deduce that
E[(Y Z)
2
] = 0, hence Z = Y a.s.
(g) Show that Y is a version of the C.E. E[X[(].
4.4. REGULAR CONDITIONAL PROBABILITY DISTRIBUTIONS 169
4.4. Regular conditional probability distributions
We rst show that if the random vector (X, Z) R
2
has a probability den-
sity function f
X,Z
(x, z) (per Denition 3.5.5), then the C.E. E[X[Z] can be com-
puted out of the corresponding conditional probability density (as done in a typ-
ical elementary probability course). To this end, let f
Z
(z) =
_
R
f
X,Z
(x, z)dx and
f
X
(x) =
_
R
f
X,Z
(x, z)dz denote the probability density functions of Z and X. That
is, f
Z
(z) = (f
X,Z
(, z)) and f
X
(x) = (f
X,Z
(x, )) for Lebesgue measure and the
Borel function f
X,Z
on R
2
. Recall that f
Z
() and f
X
() are non-negative Borel
functions (for example, consider our proof of Fubinis theorem in case of Lebesgue
measure on (R
2
, B
R
2) and the non-negative integrable Borel function h = f
X,Z
).
So, dening the conditional probability density function of X given Z as
f
X|Z
(x[z) =
_
fX,Z(x,z)
fZ(z)
if f
Z
(z) > 0 ,
f
X
(x) otherwise ,
guarantees that f
X|Z
: R
2
R
+
is Borel measurable and
_
R
f
X|Z
(x[z)dx = 1 for
all z R.
Proposition 4.4.1. Suppose the random vector (X, Z) has a probability density
function f
X,Z
(x, z) and g() is a Borel function on R such that E[g(X)[ < .
Then, g(Z) is a version of E[g(X)[Z] for the Borel function
(4.4.1) g(z) =
_
R
g(x)f
X|Z
(x[z)dx,
in case
_
R
[g(x)[f
X,Z
(x, z)dx is nite (taking otherwise g(z) = 0).
Proof. Since the Borel function h(x, z) = g(x)f
X,Z
(x, z) is integrable with
respect to Lebesgue measure on (R
2
, B
R
2), it follows that g() is also a Borel function
(c.f. our proof of Fubinis theorem). Further, by Fubinis theorem the integrability
of g(X) implies that (R A) = 0 for A = z :
_
[g(x)[f
X,Z
(x, z)dx < , and
with T
Z
= f
Z
this implies that P(Z A) = 1. By Jensens inequality,
[ g(z)[
_
[g(x)[f
X|Z
(x[z)dx, z A.
Thus, by Fubinis theorem and the denition of f
X|Z
we have that
> E[g(X)[ =
_
[g(x)[f
X
(x)dx
_
[g(x)[
_
_
A
f
X|Z
(x[z)f
Z
(z)dz
_
dx
=
_
A
_
_
[g(x)[f
X|Z
(x[z)dx
_
f
Z
(z)dz
_
A
[ g(z)[f
Z
(z)dz = E[ g(Z)[ .
So, g(Z) is integrable. With (4.4.1) holding for all z A and P(Z A) = 1, by
Fubinis theorem and the denition of f
X|Z
we have that for any Borel set B,
E[ g(Z)I
B
(Z)] =
_
BA
g(z)f
Z
(z)dz =
_
_
_
g(x)f
X|Z
(x[z)dx
_
I
BA
(z)f
Z
(z)dz
=
_
R
2
g(x)I
BA
(z)f
X,Z
(x, z)dxdz = E[g(X)I
B
(Z)] .
This amounts to E[ g(Z)I
G
] = E[g(X)I
G
] for any G (Z) = Z
1
(B) : B B
so indeed g(Z) is a version of E[g(X)[Z].
To each conditional probability density f
X|Z
([) corresponds the collection of con-
ditional probability measures

P
X|Z
(B, ) =
_
B
f
X|Z
(x[Z())dx. The remainder of
this section deals with the following generalization of the latter object.
Denition 4.4.2. Let Y : S be an (S, o)-valued R.V. in the probability space
(, T, P), per Denition 1.2.1, and ( T a -algebra. The collection

P
Y |G
(, ) :
o [0, 1] is called the regular conditional probability distribution (R.C.P.D.)
of Y given ( if:
(a)

P
Y |G
(A, ) is a version of the C.E. E[I
Y A
[(] for each xed A o.
(b) For any xed , the set function

P
Y |G
(, ) is a probability measure
on (S, o).
In case S = , o = T and Y () = , we call this collection the regular conditional
probability (R.C.P.) on T given (, denoted also by

P(A[()().
If the R.C.P. exists, then we can dene all conditional expectations through the
R.C.P. Unfortunately, the R.C.P. might not exist (see [Bil95, Exercise 33.11] for
an example in which there exists no R.C.P. on T given ().
Recall that each C.E. is uniquely determined only a.e. Hence, for any countable
collection of disjoint sets A
n
T there is possibly a set of of probability zero
for which a given collection of C.E. is such that
P(
_
n
A
n
[()() ,=
n
P(A
n
[()() .
In case we need to examine an uncountable number of such collections in order to
see whether P([() is a measure on (, T), the corresponding exceptional sets of
can pile up to a non-negligible set, hence the reason why a R.C.P. might not exist.
Nevertheless, as our next proposition shows, the R.C.P.D. exists for any condi-
tioning -algebra ( and any real-valued random variable X. In this setting, the
R.C.P.D. is the analog of the law of X as in Denition 1.2.33, but now given the
information contained in (.
Proposition 4.4.3. For any real-valued random variable X and any -algebra
( T, there exists a R.C.P.D.

P
X|G
(, ).
Proof. Consider the random variables H(q, ) = E[I
{Xq}
[(](), indexed by
q Q. By monotonicity of the C.E. we know that if q r then H(q, ) H(r, )
for all / A
qr
where A
qr
( is such that P(A
qr
) = 0. Further, by linearity
and dominated convergence of C.E.s H(q + n
1
, ) H(q, ) as n for all
/ B
q
, where B
q
( is such that P(B
q
) = 0. For the same reason, H(q, ) 0
as q and H(q, ) 1 as q for all / C, where C ( is such
that P(C) = 0. Since Q is countable, the set D = C
r,q
A
rq
q
B
q
is also in
( with P(D) = 0. Next, for a xed non-random distribution function G(), let
F(x, ) = infG(r, ) : r Q, r > x, where G(r, ) = H(r, ) if / D and
G(r, ) = G(r) otherwise. Clearly, for all the non-decreasing function
x F(x, ) converges to zero when x and to one when x , as C D.
Furthermore, x F(x, ) is right continuous, hence a distribution function, since
lim
xnx
F(x
n
, ) = infG(r, ) : r Q, r > x
n
for some n
= infG(r, ) : r Q, r > x = F(x, ) .
Thus, to each corresponds a unique probability measure

P(, ) on (R, B)
such that

P((, x], ) = F(x, ) for all x R (recall Theorem 1.2.36 for its
existence and Proposition 1.2.44 for its uniqueness).
Note that G(q, ) m( for all q Q, hence so is F(x, ) for each x R (see
Theorem 1.2.22). It follows that B B :

P(B, ) m( is a -system (see
Corollary 1.2.19 and Theorem 1.2.22), containing the -system T = R, (, q] :
q Q, hence by Dynkins theorem

P(B, ) m( for all B B. Further, for
/ D and q Q,
H(q, ) = G(q, ) F(q, ) G(q +n
1
, ) = H(q +n
1
, ) H(q, )
as n (specically, the left-most inequality holds for /
r
A
r,q
and the right-
most limit holds for / B
q
). Hence,

P(B, ) = E[I
{XB}
[(]() for any B T
and / D. Since P(D) = 0 it follows from the denition of the C.E. that for any
G ( and B T,
_
G
P(B, )dP() = E[I

{XB}
I
G
] .
Fixing G (, by monotone convergence and linearity of the expectation, the set /
of B B for which this equation holds is a -system. Consequently, / = (T) = B.
Since this applies for all G (, we conclude that

P(B, ) is a version of E[I
XB
[(]
for each B B. That is,

P(B, ) is per Denition 4.4.2 the R.C.P.D. of X given
(.
Remark. The reason behind Proposition 4.4.3 is that (X) inherits the structure
of the Borel -algebra B which in turn is not too big due to the fact the rational
numbers are dense in R. Indeed, as you are to deduce in the next exercise, there
exists a R.C.P.D. for any (S, o)-valued R.V. X with a B-isomorphic (S, o).
Exercise 4.4.4. Suppose (S, o) is B-isomorphic, that is, there exists a Borel set
T (equipped with the induced Borel -algebra T = B T : B B) and a one to
one and onto mapping g : S T such that both g and g
1
are measurable. For
any -algebra ( and (S, o)-valued R.V. X let

P
Y |G
(, ) denote the R.C.P.D. of the
real-valued random variable Y = g(X).
(a) Explain why without loss of generality

P
Y |G
(T, ) = 1 for all .
(b) Verify that for any A o both : X() A = : Y () g(A) and
g(A) B.
(c) Deduce that

Q(A, ) =

P
Y |G
(g(A), ) is the R.C.P.D. of X given (.
Our next exercise provides a generalization of Proposition 4.4.3 which is key to
the canonical construction of Markov chains in Section 6.1. We note in passing
that to conform with the notation for Markov chains, we reverse the order of the
arguments in the transition probabilities

P
X|Y
(y, A) with respect to that of the
R.C.P.D.

P
X|(Y )
(A, ).
Exercise 4.4.5. Suppose (S, o) is B-isomorphic and X and Y are (S, o)-valued
R.V. in the same probability space (, T, P). Prove that there exists (regular) tran-
sition probability

P
X|Y
(, ) : S o [0, 1] such that
(a) For each A o xed, y

P
X|Y
(y, A) is a measurable function and
P
X|Y
(Y (), A) is a version of the C.E. E[I
XA
[(Y )]().
(b) For any xed , the set function

P
X|Y
(Y (), ) is a probability
measure on (S, o).
Hint: With g : S T as before, show that (Y ) = (g(Y )) and deduce from
Theorem 1.2.26 that

P
X|(g(Y ))
(A, ) = f(A, g(Y ()) for each A o, where z
f(A, z) is a Borel function.
Here is the extension of the change of variables formula (1.3.14) to the setting of
conditional distributions.
Exercise 4.4.6. Suppose X mT and Y m( for some -algebras ( T are
real-valued. Prove that, for any Borel function h : R
2
R such that E[h(X, Y )[ <
, almost surely,
E[h(X, Y )[(] =
_
R
h(x, Y ())d
P
X|G
(x, ) .
For an integrable R.V. X (and a non-random constant Y = c), this exercise
provides the representation
E[X[(] =
_
R
xd
P
X|G
(x, ) ,
of the C.E. in terms of the corresponding R.C.P.D. (with the right side denoting the
Lebesgues integral of Denition 1.3.1 for the probability space (R, B,
P
X|G
(, )).
Solving the next exercise should improve your understanding of the relation be-
tween the R.C.P.D. and the conditional probability density function.
Exercise 4.4.7. Suppose that the random vector (X, Y, Z) has a probability density
function f
X,Y,Z
per Denition 3.5.5.
(a) Express the R.C.P.D.

P
Y |(X,Z)
in terms of f
X,Y,Z
.
(b) Using this expression show that if X is independent of (Y, Z), then
E[Y [X, Z] = E[Y [Z] .
(c) Provide an example of random variables X, Y, Z, such that X is indepen-
dent of Y and
E[Y [X, Z] ,= E[Y [Z] .
n
=
n
k=1
k
for i.i.d. integrable random variables
k
.
(a) Show that E[
1
[S
n
] = n
1
S
n
.
Hint: Consider E[
(1)
I
SnB
] for B B and a uniformly chosen ran-
dom permutation of 1, . . . , n which is independent of
k
.
(b) Find P(
1
b[S
2
) in case the i.i.d.
k
are Exponential of parameter .
Hint: See the representation of Exercise 3.4.11.
Exercise 4.4.9. Let E[X[X < Y ] = E[XI
X<Y
]/P(X < Y ) for integrable X and
Y such that P(X < Y ) > 0. For each of the following statements, either show that
it implies E[X[X < Y ] EX or provide a counter example.
(a) X and Y are independent.
(b) The random vector (X, Y ) has the same joint law as the random vector
(Y, X) and P(X = Y ) = 0.
(c) EX
2
< , EY
2
< and E[XY ] EXEY .
Exercise 4.4.10. Suppose (X, Y ) are distributed according to a multivariate nor-
mal distribution, with EX = EY = 0 and EY
2
> 0. Show that E[X[Y ] = Y with
= E[XY ]/EY
2
.
CHAPTER 5
Discrete time martingales and stopping times
In this chapter we study a collection of stochastic processes called martingales.
To simplify our presentation we focus on discrete time martingales and ltrations,
also called discrete parameter martingales and ltrations, with denitions and ex-
amples provided in Section 5.1 (indeed, a discrete time stochastic process is merely
a sequence of random variables dened on the same probability space). As we shall
see in Section 5.4, martingales play a key role in computations involving stopping
times. Martingales share many other useful properties, chiey among which are
tail bounds and convergence theorems. Section 5.2 deals with martingale repre-
sentations and tail inequalities, some of which are applied in Section 5.3 to prove
various convergence theorems. Section 5.5 further demonstrates the usefulness of
martingales in the study of branching processes, likelihood ratios, and exchangeable
processes.
5.1. Denitions and closure properties
Subsection 5.1.1 introduces the concepts of ltration, martingale and stopping
time and provides a few illustrating examples and interpretations. Subsection 5.1.2
introduces the related super-martingales and sub-martingales, as well as the power-
ful martingale transform and other closure properties of this collection of stochastic
processes.
5.1.1. Martingales, ltrations and stopping times: denitions and
examples. Intuitively, a ltration represents any procedure of collecting more and
more information as times goes on. Our starting point is the following rigorous
mathematical denition of a (discrete time) ltration.
Denition 5.1.1. A ltration is a non-decreasing family of sub--algebras T
n
of our measurable space (, T). That is, T

0
T
1
T
2
T
n
T and
T
n
is a -algebra for each n. We denote by T
n
T
a ltration T
n
and the
associated -algebra T
= (
k
T
k
) such that the relation T
k
T
applies for all

0 k .
Given a ltration, we are interested in stochastic processes (S.Ps) such that for
each n the information gathered by that time suces for evaluating the value of
the n-th element of the process. That is,
Denition 5.1.2. A S.P. X
n
, n = 0, 1, . . . is adapted to a ltration T
n
, also
denoted T
n
-adapted, if (X
n
) T
n
for each n (that is, X
n
mT
n
for each n).
At this point you should convince yourself that X
n
is adapted to the ltration
T
n
if and only if (X
0
, X
1
, . . . , X
n
) T
n
for all n. That is,
175
176 5. DISCRETE TIME MARTINGALES AND STOPPING TIMES
Denition 5.1.3. The ltration T
X
n
with T
X
n
= (X
0
, X
1
, . . . , X
n
) is the min-
imal ltration with respect to which X
n
is adapted. We therefore call it the
canonical ltration for the S.P. X
n
.
Whenever clear from the context what it means, we shall use the notation X
n
both for the whole S.P. X
n
and for the n-th R.V. of this process, and likewise we
may sometimes use T
n
to denote the whole ltration T
n
.
A martingale consists of a ltration and an adapted S.P. which can represent the
outcome of a fair gamble. That is, the expected future reward given current
information is exactly the current value of the process, or as a rigorous denition:
Denition 5.1.4. A martingale (denoted MG) is a pair (X
n
, T
n
), where T
n
is
a ltration and X
n
is an integrable S.P., that is, E[X
n
[ < for all n, adapted
to this ltration, such that
(5.1.1) E[X
n+1
[T
n
] = X
n
n, a.s.
Remark. The slower a ltration n T
n
grows, the easier it is for an adapted
S.P. to be a martingale. That is, if 1
n
T
n
for all n and S.P. X
n
adapted to
ltration 1
n
is such that (X
n
, T
n
) is a martingale, then by the tower property
(X
n
, 1
n
) is also a martingale. In particular, if (X
n
, T
n
) is a martingale then X
n
is also a martingale with respect to its canonical ltration. For this reason, hereafter
the statement X
n
is a MG (without explicitly specifying the ltration), means
that X
n
is a MG with respect to its canonical ltration T
X
n
= (X
k
, k n).
We next provide an alternative characterization of the martingale property.
n
=
n
k=0
D
k
then the canonical ltration for X
n
is
the same as the canonical ltration for D
n
. Further, (X
n
, T
n
) is a martingale if
and only if D
n
is an integrable S.P., adapted to T
n
, such that E[D
n+1
[T
n
] = 0
a.s. for all n.
Remark. The martingale dierences associated with X
n
are D
n
= X
n
X
n1
,
n 1 and D
0
= X
0
.
Proof. With both the transformation from (X
0
, . . . , X
n
) to (D
0
, . . . , D
n
) and
its inverse being continuous (hence Borel), it follows that T
X
n
= T
D
n
for each n
(c.f. Exercise 1.2.32). Therefore, X
n
is adapted to a given ltration T
n
if and
only if D
n
is adapted to this ltration (see Denition 5.1.3). It is easy to show
by induction on n that E[X
k
[ < for k = 0, . . . , n if and only if E[D
k
[ < for
k = 0, . . . , n. Hence, X
n
is an integrable S.P. if and only if D
n
is. Finally, with
X
n
mT
n
it follows from the linearity of the C.E. that
E[X
n+1
[T
n
] X
n
= E[X
n+1
X
n
[T
n
] = E[D
n+1
[T
n
] ,
and the alternative expression for the martingale property follows from (5.1.1).
Our rst example of a martingale, is the random walk, perhaps the most funda-
mental stochastic process.
Denition 5.1.6. The random walk is the stochastic process S
n
= S
0
+
n
k=1
k
with real-valued, independent, identically distributed
k
which are also indepen-
dent of S
0
. Unless explicitly stated otherwise, we always set S
0
to be zero. We say
that the random walk is symmetric if the law of
k
is the same as that of
k
. We
5.1. DEFINITIONS AND CLOSURE PROPERTIES 177
call it a simple random walk (on Z), in short srw, if
k
1, 1. The srw is
completely characterized by the parameter p = P(
k
= 1) which is always assumed
to be in (0, 1) (or alternatively, by q = 1 p = P(
k
= 1)). Thus, the symmetric
srw corresponds to p = 1/2 = q (and the asymmetric srw corresponds to p ,= 1/2).
The random walk is a MG (with respect to its canonical ltration), whenever
E[
1
[ < and E
1
= 0.
Remark. More generally, such partial sums S
n
form a MG even when the in-
dependent and integrable R.V.
k
of zero mean have non-identical distributions,
and the canonical ltration of S
n
is merely T
n
, where T
n
= (
1
, . . . ,
n
). In-
deed, this is an application of Proposition 5.1.5 for independent, integrable D
k
=
S
k
S
k1
=
k
, k 1 (with D
0
= 0), where E[D
n+1
[D
0
, D
1
, . . . , D
n
] = ED
n+1
= 0
for all n 0 by our assumption that E
k
= 0 for all k.
Denition 5.1.7. We say that a stochastic process X
n
is square-integrable if
EX
2
n
< for all n. Similarly, we call a martingale (X
n
, T
n
) such that EX
2
n
<
for all n, an L
2
-MG (or a square-integrable MG).
Square-integrable martingales have zero-mean, uncorrelated dierences and admit
an elegant decomposition of conditional second moments.
n
, T
n
) and (Y
n
, T
n
) are square-integrable martingales.
(a) Show that the corresponding martingale dierences D
n
are uncorrelated
and that each D
n
, n 1, has zero mean.
(b) Show that for any n 0,
E[X
[T
n
] X
n
Y
n
= E[(X
X
n
)(Y
Y
n
)[T
n
]
=
k=n+1
E[(X
k
X
k1
)(Y
k
Y
k1
)[T
n
] .
(c) Deduce that if sup
k
[X
k
[ C non-random then for any 1,
E
_
_

k=1
D
2
k
_
2
_
6C
4
.
Remark. A square-integrable stochastic process with zero-mean mutually inde-
pendent dierences is necessarily a martingale (consider Proposition 5.1.5). So, in
view of part (a) of Exercise 5.1.8, the MG property is between the more restrictive
requirement of having zero-mean, independent dierences, and the not as useful
property of just having zero-mean, uncorrelated dierences. While in general these
three conditions are not the same, as you show next they do coincide in case of
Gaussian stochastic processes.
Exercise 5.1.9. A stochastic process X
n
is Gaussian if for each n the ran-
dom vector (X
1
, . . . , X
n
) has the multivariate normal distribution (c.f. Denition
3.5.13). Show that having independent or uncorrelated dierences are equivalent
properties for such processes, which together with each of these dierences having
a zero mean is then also equivalent to the MG property.
Products of R.V. is another classical source for martingales.
Example 5.1.10. Consider the stochastic process M
n
=
n
k=1
Y
k
for independent,
integrable random variables Y
k
0. Its canonical ltration coincides with T
Y
n
(see
Exercise 1.2.32), and taking out what is known we get by independence that
E[M
n+1
[T
Y
n
] = E[Y
n+1
M
n
[T
Y
n
] = M
n
E[Y
n+1
[T
Y
n
] = M
n
E[Y
n+1
] ,
so M
n
is a MG, which we then call the product martingale, if and only if
EY
k
= 1 for all k 1 (for general sequence Y
n
we need instead that a.s.
E[Y
n+1
[Y
1
, . . . , Y
n
] = 1 for all n).
Remark. In investment applications, the MG condition EY
k
= 1 corresponds to a
neutral return rate, and is not the same as the condition E[log Y
k
] = 0 under which
the associated partial sums S
n
= log M
n
form a MG.
We proceed to dene the important concept of stopping time (in the simpler
context of a discrete parameter ltration).
Denition 5.1.11. A random variable taking values in 0, 1, . . . , n, . . . , is a
stopping time for the ltration T
n
(also denoted T
n
-stopping time), if the event
: () n is in T
n
for each nite n 0.
Remark. Intuitively, a stopping time corresponds to a situation where the deci-
sion whether to stop or not at any given (non-random) time step is based on the
information available by that time step. As we shall amply see in the sequel, one of
the advantages of MGs is in providing a handle on explicit computations associated
with various stopping times.
The next two exercises provide examples of stopping times. Practice your under-
standing of this concept by solving them.
Exercise 5.1.12. Suppose that and are stopping times for the same ltration
T
n
. Show that then , and + are also stopping times for this ltration.
Exercise 5.1.13. Show that the rst hitting time () = mink 0 : X
k
() B
of a Borel set B R by a sequence X
k
, is a stopping time for the canonical
ltration T
X
n
. Provide an example where the last hitting time = supk 0 :
X
k
B of a set B by the sequence, is not a stopping time (not surprising, since
we need to know the whole sequence X
k
in order to verify that there are no visits
to B after a given time n).
Here is an elementary application of rst hitting times.
Exercise 5.1.14 (Reflection principle). Suppose S
n
is a symmetric random
walk starting at S
0
= 0 (see Denition 5.1.6).
(a) Show that P(S
n
S
k
0) 1/2 for k = 1, 2, . . . , n.
(b) Fixing x > 0, let = infk 0 : S
k
> x and show that
P(S
n
> x)
n
k=1
P( = k, S
n
S
k
0)
1
2
n
k=1
P( = k) .
(c) Deduce that for any n and x > 0,
P(
n
max
k=1
S
k
> x) 2P(S
n
> x) .
(d) Considering now the symmetric srw, show that for any positive integers
n, x,
P(
n
max
k=1
S
k
x) = 2P(S
n
x) P(S
n
= x)
and that Z
2n+1
D
= ([S
2n+1
[1)/2, where Z
n
denotes the number of (strict)
sign changes within S
0
= 0, S
1
, . . . , S
n
.
We conclude this subsection with a useful sucient condition for the integrability
of a stopping time.
Exercise 5.1.15. Suppose the T
n
-stopping time is such that a.s.
P[ n +r[T
n
]
for some positive integer r, some > 0 and all n.
(a) Show that P( > kr) (1 )
k
for any positive integer k.
Hint: Use induction on k.
(b) Deduce that in this case E < .
5.1.2. Sub-martingales, super-martingales and stopped martingales.
Often when operating on a MG, we naturally end up with a sub-martingale or a
super-martingale, as dened next. Moreover, these processes share many of the
properties of martingales, so it is useful to develop a unied theory for them.
Denition 5.1.16. A sub-martingale (denoted sub-MG) is an integrable S.P. X
n
,
adapted to the ltration T
n
, such that
E[X
n+1
[T
n
] X
n
n, a.s.
A super-martingale (denoted sup-MG) is an integrable S.P. X
n
, adapted to the
ltration T
n
such that
E[X
n+1
[T
n
] X
n
n, a.s.
(A typical S.P. X
n
is neither a sub-MG nor a sup-MG, as the sign of the R.V.
E[X
n+1
[T
n
] X
n
may well be random, or possibly dependent upon n).
Remark 5.1.17. Note that X
n
is a sub-MG if and only if X
n
is a sup-MG.
By this identity, all results about sub-MGs have dual statements for sup-MGs and
vice verse. We often state only one out of each such pair of statements. Further,
X
n
is a MG if and only if X
n
is both a sub-MG and a sup-MG. As a result,
every statement holding for either sub-MGs or sup-MGs, also hold for MGs.
Example 5.1.18. Expanding on Example 5.1.10, if the non-negative, integrable
random variables Y
k
are such that E[Y
n
[Y
1
, . . . , Y
n1
] 1 a.s. for all n then M
n
=
n
k=1
Y
k
is a sub-MG, and if E[Y
n
[Y
1
, . . . , Y
n1
] 1 a.s. for all n then M
n
is a
sup-MG. Such martingales appear for example in mathematical nance, where Y
k
denotes the random proportional change in the value of a risky asset at the k-th
trading round. So, positive conditional mean return rate yields a sub-MG while
negative conditional mean return rate gives a sup-MG.
The sub-martingale (and super-martingale) property is closed with respect to the
addition of S.P.
n
and Y
n
are sub-MGs with respect to a
ltration T
n
, then so is X
n
+Y
n
. In contrast, show that for any sub-MG Y
n
there exists integrable X

n
adapted to T
Y
n
such that X
n
+Y
n
is not a sub-MG
with respect to any ltration.
Here are some of the properties of sub-MGs (and of sup-MGs).
Proposition 5.1.20. If (X
n
, T
n
) is a sub-MG, then a.s. E[X
[T
m
] X
m
for
any > m. Consequently, for s sub-MG necessarily n EX
n
is non-decreasing.
Similarly, for a sup-MG a.s. E[X
[T
m
] X
m
(with n EX
n
non-increasing),
and for a martingale a.s. E[X
[T
m
] = X
m
for all > m (with E[X
n
] independent
of n).
Proof. Suppose X
n
is a sub-MG and = m+k for k 1. Then,
E[X
m+k
[T
m
] = E[E(X
m+k
[T
m+k1
)[T
m
] E[X
m+k1
[T
m
]
with the equality due to the tower property and the inequality by the denition
of a sub-MG and monotonicity of the C.E. Iterating this inequality for decreasing
values of k we deduce that E[X
m+k
[T
m
] E[X
m
[T
m
] = X
m
for all non-negative
integers k, m, as claimed. Next taking the expectation of this inequality, we have
by monotonicity of the expectation and (4.2.1) that E[X
m+k
] E[X
m
] for all
k, m 0, or equivalently, that n EX
n
is non-decreasing.
To get the corresponding results for a super-martingale X
n
note that then
X
n
is a sub-martingale, see Remark 5.1.17. As already mentioned there, if
X
n
is a MG then it is both a super-martingale and a sub-martingale, hence
both E[X
[T
m
] X
m
and E[X
[T
m
] X
m
, resulting with E[X
[T
m
] = X
m
, as
stated.
Exercise 5.1.21. Show that a sub-martingale (X
n
, T
n
) is a martingale if and only
if EX
n
= EX
0
for all n.
We next detail a few examples in which sub-MGs or sup-MGs naturally appear,
starting with an immediate consequence of Jensens inequality
Proposition 5.1.22. Suppose : R R is convex and E[[(X
n
)[] < .
(a) If (X
n
, T
n
) is a martingale then ((X
n
), T
n
) is a sub-martingale.
(b) If x (x) is also non-decreasing, ((X
n
), T
n
) is a sub-martingale even
when (X
n
, T
n
) is only a sub-martingale.
Proof. With (X
n
) integrable and adapted, it suces to check that a.s.
E[(X
n+1
)[T
n
] (X
n
) for all n. To this end, since () is convex and X
n
is
integrable, by the conditional Jensens inequality,
E[(X
n+1
)[T
n
] (E[X
n+1
[T
n
]) ,
so it remains only to verify that (E[X
n+1
[T
n
]) (X
n
). This clearly applies
when (X
n
, T
n
) is a MG, and even for a sub-MG (X
n
, T
n
), provided that () is
monotone non-decreasing.
Example 5.1.23. Typical convex functions for which the preceding proposition is
often applied are (x) = [x[
p
, p 1, (x) = (xc)
+
, (x) = max(x, c) (for c R),
(x) = e
x
and (x) = xlog x (the latter only for non-negative S.P.). Considering
instead () concave leads to a sup-MG, as for example when (x) = min(x, c) or
(x) = x
p
for some p (0, 1) or (x) = log x (latter two cases restricted to non-
negative S.P.). For example, if X
n
is a sub-martingale then (X
n
c)
+
is also
a sub-martingale (since (x c)
+
is a convex, non-decreasing function). Similarly,
if X
n
is a super-martingale, then min(X
n
, c) is also a super-martingale (since
X
n
is a sub-martingale and the function min(x, c) = max(x, c) is convex
and non-decreasing).
Here is a concrete application of Proposition 5.1.22.
i
are mutually independent, E
i
= 0 and E
2
i
=
2
i
.
(a) Let S
n
=
n
i=1
i
and s
2
n
=
n
i=1
2
i
. Show that S
2
n
is a sub-martingale
and S
2
n
s
2
n
is a martingale.
(b) Show that if in addition m
n
=
n
i=1
Ee
i
are nite, then e
Sn
is a
sub-martingale and M
n
= e
Sn
/m
n
is a martingale.
Remark. A special case of Exercise 5.1.24 is the random walk S
n
of Denition
5.1.6, with S
2
n
nE
2
1
being a MG when
1
is square-integrable and of zero mean.
Likewise, e
Sn
is a sub-MG whenever E
1
= 0 and Ee
1
is nite. Though e
Sn
is in
general not a MG, the normalized M
n
= e
Sn
/[Ee
1
]
n
is merely the product MG of
Example 5.1.10 for the i.i.d. variables Y
i
= e
i
/E(e
1
).
Here is another family of super-martingales, this time related to super-harmonic
functions.
Denition 5.1.25. A lower semi-continuous function f : R
d
R is super-
harmonic if for any x and r > 0,
f(x)
1
[B(0, r)[
_
B(x,r)
f(y)dy
where B(x, r) = y : [x y[ r is the ball of radius r centered at x and [B(x, r)[
denotes its volume.
Exercise 5.1.26. Suppose S
n
= x+
n
k=1
k
for i.i.d.
k
that are chosen uniformly
on the ball B(0, 1) in R
d
(i.e. using Lebesgues measure on this ball, scaled by
its volume). Show that if f() is super-harmonic on R
d
then f(S
n
) is a super-
martingale.
Hint: When checking the integrability of f(S
n
) recall that a lower semi-continuous
function is bounded below on any compact set.
We next dene the important concept of a martingale transform, and show that
it is a powerful and exible method for generating martingales.
Denition 5.1.27. We call a sequence V
n
predictable (or pre-visible) for the
ltration T
n
, also denoted T
n
-predictable, if V
n
is measurable on T
n1
for all
n 1. The sequence of random variables
Y
n
=
n
k=1
V
k
(X
k
X
k1
) , n 1, Y
0
= 0
is called the martingale transform of the T
n
-predictable V
n
with respect to a sub
or super martingale (X
n
, T
n
).
Theorem 5.1.28. Suppose Y
n
is the martingale transform of T
n
-predictable
V
n
with respect to a sub or super martingale (X
n
, T
n
).
(a) If Y
n
is integrable and (X
n
, T
n
) is a martingale, then (Y
n
, T
n
) is also a
martingale.
(b) If Y
n
is integrable, V
n
0 and (X
n
, T
n
) is a sub-martingale (or super-
martingale) then (Y
n
, T
n
) is also a sub-martingale (super-martingale, re-
spectively).
(c) For the integrability of Y
n
it suces in both cases to have [V
n
[ c
n
for
some non-random nite constants c
n
, or alternatively to have V
n
L
q
and X
n
L
p
for all n and some p, q > 1 such that
1
q
+
1
p
= 1.
Proof. With V
n
and X
n
n
, it follows that
V
k
X
l
mT
k
mT
n
for all l k n. By inspection Y
n
mT
n
as well (see
Corollary 1.2.19), i.e. Y
n
is adapted to T
n
.
Turning to prove part (c) of the theorem, note that for each n the variable Y
n
is
a nite sum of terms of the form V
k
X
l
. If V
k
L
q
and X
l
L
p
for some p, q > 1
such that
1
q
+
1
p
= 1, then by H olders inequality V
k
X
l
is integrable. Alternatively,
since a super-martingale X
l
is in particular integrable, V
k
X
l
is integrable as soon
as [V
k
[ is bounded by a non-random nite constant. In conclusion, if either of these
conditions applies for all k, l then obviously Y
n
is an integrable S.P.
Recall that Y
n+1
Y
n
= V
n+1
(X
n+1
X
n
) and V
n+1
mT
n
(since V
n
is T
n
-
predictable). Therefore, taking out V
n+1
which is measurable on T
n
we nd that
E[Y
n+1
Y
n
[T
n
] = E[V
n+1
(X
n+1
X
n
)[T
n
] = V
n+1
E[X
n+1
X
n
[T
n
] .
This expression is zero when (X
n
, T
n
) is a MG and non-negative when V
n+1
0
and (X
n
, T
n
) is a sub-MG. Since the preceding applies for all n, we consequently
have that (Y
n
, T
n
) is a MG in the former case and a sub-MG in the latter. Finally,
to complete the proof also in case of a sup-MG (X
n
, T
n
), note that then Y
n
is the
MG transform of V
n
with respect to the sub-MG (X
n
, T
n
).
Here are two concrete examples of a martingale transform.
Example 5.1.29. The S.P. Y
n
=
n
k=1
X
k1
(X
k
X
k1
) is a MG whenever
X
n
L
2
(, T, P) is a MG (indeed, V
n
= X
n1
is predictable for the canonical
ltration of X
n
and consider p = q = 2 in part (c) of Theorem 5.1.28).
Example 5.1.30. Given an integrable process V
n
suppose that for each k 1 the
bounded
k
has zero mean and is independent of T
k1
= (
1
, . . . ,
k1
, V
1
, . . . , V
k
).
Then, Y
n
=
n
k=1
V
k
k
is a martingale for the ltration T
n
. Indeed, by assump-
tion, the dierences
n
of X
n
=
n
k=1
k
are such that E[
k
[T
k1
] = 0 for all
k 1. Hence, (X
n
, T
n
) is a martingale (c.f. Proposition 5.1.5), and Y
n
is
the martingale transform of the T
n
-predictable V
n
with respect to the martingale
(X
n
, T
n
) (where the integrability of Y
n
is a consequence of the boundedness of each
k
and integrability of each V
k
). In discrete mathematics applications one often
uses a special case of this construction, with an auxiliary sequence of random i.i.d.
signs
k
1, 1 such that P(
1
= 1) =
1
2
and
n
is independent of the given
integrable S.P. V
n
.
We next dene the important concept of a stopped stochastic process and then
use the martingale transform to show that stopped sub and super martingales are
also sub-MGs (sup-MGs, respectively).
Denition 5.1.31. Given a stochastic process X
n
and a a random variable
taking values in 0, 1, . . . , n, . . . , , the stopped at stochastic process, denoted
X
n
, is given by
X
n
() =
_
X
n
(), n ()
X
()
(), n > ()
Theorem 5.1.32. If (X
n
, T
n
) is a sub-MG (or a sup-MG or a MG) and are
stopping times for T
n
, then (X
n
X
n
, T
n
) is also a sub-MG (or sup-MG or
MG, respectively). In particular, taking = 0 we have that (X
n
, T
n
) is then a
sub-MG (or sup-MG or MG, respectively).
Proof. We may and shall assume that (X
n
, T
n
) is a sub-MG (just consider
X
n
in case X
n
is a sup-MG and both when X
n
is a MG). Let V
k
() = I
{()<k()}
.
Since are two T
n
-stopping times, it follows that V
k
() = I
{()(k1)}

I
{()(k1)}
is measurable on T
k1
for all k 1. Thus, V
n
is a bounded,
non-negative T
n
-predictable sequence. Further, since
X
n
() X
n
() =
n
k=1
I
{()<k()}
(X
k
() X
k1
())
is the martingale transform of V
n
with respect to sub-MG (X
n
, T
n
), we know from
Theorem 5.1.28 that (X
n
X
n
, T
n
) is also a sub-MG. Finally, considering the
latter sub-MG for = 0 and adding to it the sub-MG (X
0
, T
n
), we conclude that
(X
n
, T
n
) is a sub-MG (c.f. Exercise 5.1.19 and note that X
n0
= X
0
).
Theorem 5.1.32 thus implies the following key ingredient in the proof of Doobs
optional stopping theorem (to which we return in Section 5.4).
Corollary 5.1.33. If (X
n
, T
n
) is a sub-MG and are T
n
-stopping times,
then EX
n
EX
n
for all n. The reverse inequality holds in case (X
n
, T
n
) is a
sup-MG, with EX
n
= EX
n
for all n in case (X
n
, T
n
) is a MG.
Proof. Suces to consider X
n
which is a sub-MG for the ltration T
n
. In
this case we have from Theorem 5.1.32 that Y
n
= X
n
X
n
is also a sub-MG
for this ltration. Noting that Y
0
= 0 we thus get from Proposition 5.1.20 that
EY
n
0. Theorem 5.1.32 also implies the integrability of X
n
so by linearity of
the expectation we conclude that EX
n
EX
n
.
An important concept associated with each stopping time is the stopped -algebra
dened next.
Denition 5.1.34. The stopped -algebra T
associated with the stopping time

for a ltration T
n
is the collection of events A T
such that A : ()
n T
n
for all n.
With T
n
representing the information known at time n, think of T
as quantifying
the information known upon stopping at . Some of the properties of these stopped
-algebras are detailed in the next exercise.
Exercise 5.1.35. Let and be T
n
-stopping times.
(a) Verify that T
is a -algebra and if () = n is non-random then T
=
T
n
.
(b) Suppose X
n
mT
n
for all n (including n = unless is nite for all ).
Show that then X
mT
. Deduce that () T
and X
k
I
{=k}
mT
for any k non-random.

(c) Show that for any integrable Y
n
and non-random k,
E[Y
I
{=k}
[T
] = E[Y
k
[T
k
]I
{=k}
.
(d) Show that if then T
.
Our next exercise shows that the martingale property is equivalent to the strong
martingale property whereby conditioning at stopped -algebras T
replaces the
one at T
n
for non-random n.
Exercise 5.1.36. Given an integrable stochastic process X
n
adapted to a ltra-
tion T
n
, show that (X
n
, T
n
) is a martingale if and only if E[X
n
[T
] = X
for
any non-random, nite n and all T
n
-stopping times n.
For non-integrable stochastic processes we generalize the concept of a martingale
into that of a local martingale.
Exercise 5.1.37. The pair (X
n
, T
n
) is called a local martingale if X
n
is adapted
to the ltration T
n
and there exist T
n
-stopping times
k
such that
k
with
probability one and (X
n
k
, T
n
) is a martingale for each k. Show that any martin-
gale is a local martingale and any integrable, local martingale is a martingale.
We conclude with the renewal property of stopping times with respect to the
canonical ltration of an i.i.d. sequence.
Exercise 5.1.38. Suppose is an a.s. nite stopping time with respect to the
canonical ltration T
Z
n
of a sequence Z
k
of i.i.d. R.V-s.
(a) Show that T
Z
= (Z
+k
, k 1) is independent of the stopped -algebra
T
Z
.
(b) Provide an example of a nite T
Z
n
-stopping time and independent Z
k
for which T
Z
is not independent of T
Z
.
5.2. Martingale representations and inequalities
In Subsection 5.2.1 we show that martingales are at the core of all adapted pro-
cesses. We further explore there the structure of certain sub-martingales, intro-
ducing the increasing process associated with square-integrable martingales. This
is augmented in Subsection 5.2.2 by the study of maximal inequalities for sub-
martingales (and martingales). Such inequalities are an important technical tool
in many applications of probability theory. In particular, they are the key to the
convergence results of Section 5.3.
5.2.1. Martingale decompositions. To demonstrate the relevance of mar-
tingales to the study of general stochastic processes, we start with a representation
of any adapted, integrable, discrete-time S.P. as the sum of a martingale and a
predictable process.
Theorem 5.2.1 (Doobs decomposition). Given an integrable stochastic process
X
n
, adapted to a discrete parameter ltration T
n
, n 0, there exists a decom-
position X
n
= Y
n
+A
n
such that (Y
n
, T
n
) is a MG and A
n
is an T
n
-predictable
sequence. This decomposition is unique up to the value of Y
0
mT
0
.
5.2. MARTINGALE REPRESENTATIONS AND INEQUALITIES 185
Proof. Let A
0
= 0 and for n 1 set
A
n
= A
n1
+E[X
n
X
n1
[T
n1
].
By denition of the conditional expectation we see that A
k
A
k1
is measurable
on T
k1
for any k 1. Since T
k1
T
n1
for all k n and A
n
= A
0
+
n
k=1
(A
k
A
k1
), it follows that A
n
is T
n
-predictable. We next check that
Y
n
= X
n
A
n
is a MG. To this end, recall that since X
n
is integrable so is
X
n
X
n1
, whereas the C.E. only reduces the L
1
norm (see Example 4.2.20).
Therefore, E[A
n
A
n1
[ E[X
n
X
n1
[ < . Hence, A
n
is integrable, as
is X
n
, implying by Minkowskis inequality that Y
n
is integrable as well. With
X
n
adapted and A
n
predictable, hence adapted, to T
n
, we see that Y
n
is
also T
n
-adapted. It remains to check the martingale condition, that almost surely
E[Y
n
Y
n1
[T
n1
] = 0 for all n 1. Indeed, by linearity of the C.E. and the
construction of the T
n
-predictable sequence A
n
, for any n 1,
E[Y
n
Y
n1
[T
n1
] = E[X
n
X
n1
(A
n
A
n1
)[T
n1
]
= E[X
n
X
n1
[T
n1
] (A
n
A
n1
) = 0 .
We nish the proof by checking that such a decomposition is unique up to the
choice of Y
0
. To this end, suppose that X
n
= Y
n
+ A
n
=

Y
n
+

A
n
are two such
decompositions of a given stochastic process X
n
. Then,

Y
n
Y
n
= A
n

A
n
. Since
A
n
and
A
n
are both T
n
-predictable sequences while (Y
n
, T
n
) and (
Y
n
, T
n
) are
martingales, we nd that
A
n

A
n
= E[A
n

A
n
[T
n1
] = E[
Y
n
Y
n
[T
n1
]
=

Y
n1
Y
n1
= A
n1

A
n1
.
Thus, A
n

A
n
is independent of n and if in addition Y
0
=

Y
0
then A
n

A
n
=
A
0

A
0
=

Y
0
Y
0
= 0 for all n. In conclusion, both sequences A
n
and Y
n
are
uniquely determined as soon as we determine Y
0
, a R.V. measurable on T
0
.
Doobs decomposition has more structure when (X
n
, T
n
) is a sub-MG.
Exercise 5.2.2. Check that the predictable part of Doobs decomposition of a sub-
martingale (X
n
, T
n
) is a non-decreasing sequence, that is, A
n
A
n+1
for all n.
Remark. As shown in Subsection 5.3.2, Doobs decomposition is particularly useful
in connection with square-integrable martingales X
n
, where one can relate the
limit of X
n
as n with that of the non-decreasing sequence A
n
in the
decomposition of X
2
n
.
We next evaluate Doobs decomposition for two classical sub-MGs.
Example 5.2.3. Consider the sub-MG S
2
n
for the random walk S
n
=
n
k=1
k
,
where
k
are i.i.d. random variables with E
1
= 0 and E
2
1
= 1. Since Y
n
= S
2
n
n
is a martingale (see Exercise 5.1.24), and Doobs decomposition S
2
n
= Y
n
+ A
n
is
unique, it follows that the non-decreasing predictable part in the decomposition of
S
2
n
is A
n
= n.
In contrast with the preceding example, the non-decreasing predictable part in
Doobs decomposition is for most sub-MGs a non-degenerate random sequence, as
is the case in our next example.
Example 5.2.4. Consider the sub-MG (M
n
, T
Z
n
) where M
n
=
n
i=1
Z
i
for i.i.d.
integrable Z
i
0 such that EZ
1
> 1 (see Example 5.1.10). The non-decreasing
predictable part of its Doobs decomposition is such that for n 1
A
n+1
A
n
= E[M
n+1
M
n
[T
Z
n
] = E[Z
n+1
M
n
M
n
[T
Z
n
]
= M
n
E[Z
n+1
1[T
Z
n
] = M
n
(EZ
1
1)
(since Z
n+1
is independent of T
Z
n
). In this case A
n
= (EZ
1
1)
n1
k=1
M
k
+ A
1
,
where we are free to choose for A
1
any non-random constant. We see that A
n
is
a non-degenerate random sequence (assuming the R.V. Z
i
are not a.s. constant).
We conclude with the representation of any L
1
-bounded martingale as the dier-
ence of two non-negative martingales (resembling the representation X = X
+
X
for an integrable R.V. X and non-negative X
).
Exercise 5.2.5. Let (X
n
, T
n
) be a martingale with sup
n
E[X
n
[ < . Show that
there is a representation X
n
= Y
n
Z
n
with (Y
n
, T
n
) and (Z
n
, T
n
) non-negative
martingales such that sup
n
E[Y
n
[ < and sup
n
E[Z
n
[ < .
5.2.2. Maximal and up-crossing inequalities. Martingales are rather tame
stochastic processes. In particular, as we see next, the tail of max
kn
X
k
is bounded
by moments of X
n
. This is a major improvement over Markovs inequality, relat-
ing the typically much smaller tail of the R.V. X
n
to its moments (see part (b) of
Example 1.3.14).
Theorem 5.2.6 (Doobs inequality). For any sub-martingale X
n
and x > 0
let
x
= mink 0 : X
k
x. Then, for any nite n 0,
(5.2.1) P(
n
max
k=0
X
k
x) x
1
E[X
n
I
{xn}
] x
1
E[(X
n
)
+
] .
Proof. Since X
x
x whenever
x
is nite, setting
A
n
= :
x
() n = :
n
max
k=0
X
k
() x ,
it follows that
E[X
nx
] = E[X
x
I
xn
] +E[X
n
I
x>n
] xP(A
n
) +E[X
n
I
A
c
n
].
With X
n
a sub-MG and
x
a pair of T
X
n
-stopping times, it follows from
Corollary 5.1.33 that E[X
nx
] E[X
n
]. Therefore, E[X
n
] E[X
n
I
A
c
n
] xP(A
n
)
which is exactly the left inequality in (5.2.1). The right inequality there holds by
monotonicity of the expectation and the trivial fact XI
A
(X)
+
for any R.V. X
and any measurable set A.
Remark. Doobs inequality generalizes Kolmogorovs maximal inequality. Indeed,
consider X
k
= Z
2
k
for the L
2
-martingale Z
k
= Y
1
+ +Y
k
, where Y
l
are mutually
independent with EY
l
= 0 and EY
2
l
< . By Proposition 5.1.22 X
k
is a sub-MG,
so by Doobs inequality we obtain that for any z > 0,
P( max
1kn
[Z
k
[ z) = P( max
1kn
X
k
z
2
) z
2
E[(X
n
)
+
] = z
2
Var(Z
n
)
which is exactly Kolmogorovs maximal inequality of Proposition 2.3.15.
Combining Doobs inequality with Doobs decomposition of non-negative sub-
martingales, we arrive at the following bounds, due to Lenglart.
Lemma 5.2.7. Let V
n
= max
n
k=0
Z
k
and A
n
denote the T
n
-predictable sequence in
Doobs decomposition of a non-negative submartingale (Z
n
, T
n
) with Z
0
= 0. Then,
for any T
n
-stopping time and all x, y > 0,
(5.2.2) P(V
x, A
y) x
1
E(A
y) .
Further, in this case E[V
p
] c
p
E[A
p
] for c
p
= 1 + 1/(1 p) and any p (0, 1).
Proof. Since M
n
= Z
n
A
n
is a MG with respect to the ltration T
n
(starting at M
0
= 0), by Theorem 5.1.32 the same applies for the stopped stochastic
process M
n
, with any T
n
-stopping time. By the same reasoning Z
n
= M
n
+
A
n
is a sub-MG with respect to T
n
. Applying Doobs inequality (5.2.1) for
this non-negative sub-MG we deduce that for any n and x > 0,
P(V
n
x) = P(
n
max
k=0
Z
k
x) x
1
E[ Z
n
] = x
1
E[ A
n
] .
Both V
n
and A
n
are non-negative and non-decreasing in n (see Exercise 5.2.2),
so by monotone convergence we have that P(V
x) x
1
EA
. In particular,
xing y > 0, since A
n
is T
n
-predictable, = minn 0 : A
n+1
> y is an
T
n
-stopping time. Further, with A
n
non-decreasing, < if and only if A
> y
in which case A
y (by the denition of ). Consequently, A
y and as
V
x, A
y V
x we arrive at the inequality (5.2.2).

Next, considering (5.2.2) for x = y we see that for Y = A
and any y > 0,

P(V
y) P(Y y) +E[min(Y/y, 1)] .

Multiplying both sides of this inequality by py
p1
and integrating over y (0, ),
upon taking r = 1 > p in part (a) of Lemma 1.4.31 we conclude that
EV
p
EY
p
+ (1 p)
1
EY
p
,
as claimed.
To practice your understanding, adapt the proof of Doobs inequality en-route to
the following dual inequality (which is often called Doobs second sub-MG inequal-
ity).
Exercise 5.2.8. Show that for any sub-MG X
n
, nite n 0 and x > 0,
(5.2.3) P(
n
min
k=0
X
k
x) x
1
(E[(X
n
)
+
] E[X
0
]) .
Here is a typical example of an application of Doobs inequality.
Exercise 5.2.9. Fixing s > 0, the independent variables Z
n
are such that P(Z
n
=
1) = P(Z
n
= 1) = n
s
/2 and P(Z
n
= 0) = 1 n
s
. Starting at Y
0
= 0, for
n 1 let
Y
n
= n
s
Y
n1
[Z
n
[ +Z
n
I
{Yn1=0}
.
(a) Show that Y
n
is a martingale and that for any x > 0 and n 1,
P(
n
max
k=1
Y
k
x)
1
2x
[1 +
n1
k=1
(k + 1)
s
(1 k
s
)] .
(b) Show that Y
n
p
0 as n and further Y
n
a.s.
0 if and only if s > 1,
but there is no value of s for which Y
n
L
1
0.
Martingales also provide bounds on the probability that the sum of bounded in-
dependent variables is too close to its mean (in lieu of the clt).
n
=
n
k=1
k
where
k
are independent and E
k
= 0,
[
k
[ K for all k. Let s
2
n
=
n
k=1
E
2
k
. Using Corollary 5.1.33 for the martingale
S
2
n
s
2
n
and a suitable stopping time show that
P(
n
max
k=1
[S
k
[ x) (x +K)
2
/s
2
n
.
If the positive part of the sub-MG has nite p-th moment you can improve the
rate of decay in x in Doobs inequality by an application of Proposition 5.1.22 for
the convex non-decreasing (y) = max(y, 0)
p
, denoted hereafter by (y)
p
+
. Further,
in case of a MG the same argument yields comparable bounds on tail probabilities
for the maximum of [Y
k
[.
Exercise 5.2.11.
(a) Show that for any sub-MG Y
n
, p 1, nite n 0 and y > 0,
P(
n
max
k=0
Y
k
y) y
p
E
_
max(Y
n
, 0)
p
_
.
(b) Show that in case Y
n
is a martingale, also
P(
n
max
k=1
[Y
k
[ y) y
p
E
_
[Y
n
[
p
_
.
(c) Suppose the martingale Y
n
is such that Y
0
= 0. Using the fact that
(Y
n
+c)
2
is a sub-martingale and optimizing over c, show that for y > 0,
P(
n
max
k=0
Y
k
y)
EY
2
n
EY
2
n
+y
2
Here is the version of Doobs inequality for non-negative sup-MGs and its appli-
cation for the random walk.
Exercise 5.2.12.
(a) Show that if is a stopping time for the canonical ltration of a non-
negative super-martingale X
n
then EX
0
EX
n
E[X
I
n
] for
any nite n.
(b) Deduce that if X
n
is a non-negative super-martingale then for any x >
0
P(sup
k
X
k
x) x
1
EX
0
.
(c) Suppose S
n
is a random walk with E
1
= < 0 and Var(
1
) =
2
> 0.
Let = /(
2
+
2
) and f(x) = 1/(1 + (z x)
+
). Show that f(S
n
) is
a super-martingale and use this to conclude that for any z > 0,
P(sup
k
S
k
z)
1
1 +z
.
Hint: Taking v(x) = f(x)
2
1
x<z
show that g
x
(y) = f(x) + v(x)[(y
x) + (y x)
2
] f(y) for all x and y. Then show that f(S
n
) =
E[g
Sn
(S
n+1
)[S
k
, k n].
Integrating Doobs inequality we next get bounds on the moments of the maximum
of a sub-MG.
Corollary 5.2.13 (L
p
maximal inequalities). If X
n
is a sub-MG then for
any n and p > 1,
(5.2.4) E
_
(max
kn
X
k
)
p
+
q
p
E
_
(X
n
)
p
+
,
where q = p/(p 1) is a nite universal constant. Consequently, if Y
n
is a MG
then for any n and p > 1,
(5.2.5) E
__
max
kn
[Y
k
[
_
p
q
p
E
_
[Y
n
[
p
.
Proof. The bound (5.2.4) is obtained by applying part (b) of Lemma 1.4.31
for the non-negative variables X = (X
n
)
+
and Y = (max
kn
X
k
)
+
. Indeed, the
hypothesis P(Y y) y
1
E[XI
Y y
] of this lemma is provided by the left in-
equality in (5.2.1) and its conclusion that EY
p
q
p
EX
p
is precisely (5.2.4). In
case Y
n
is a martingale, we get (5.2.5) by applying (5.2.4) for the non-negative
sub-MG X
n
= [Y
n
[.
Remark. A bound such as (5.2.5) can not hold for all sub-MGs. For example, the
non-random sequence Y
k
= (k n) 0 is a sub-MG with [Y
0
[ = n but Y
n
= 0.
The following two exercises show that while L
p
maximal inequalities as in Corollary
5.2.13 can not hold for p = 1, such an inequality does hold provided we replace
E(X
n
)
+
in the bound by E[(X
n
)
+
log min(X
n
, 1)].
Exercise 5.2.14. Consider the martingale M
n
=
n
k=1
Y
k
for i.i.d. non-negative
random variables Y
k
with EY
1
= 1 and P(Y
1
= 1) < 1.
(a) Explain why E(log Y
1
)
+
is nite and why the strong law of large numbers
implies that n
1
log M
n
a.s.
< 0 when n .
(b) Deduce that M
n
a.s.
0 as n and that consequently M
n
is not
uniformly integrable.
(c) Show that if (5.2.4) applies for p = 1 and some q < , then any non-
negative martingale would have been uniformly integrable.
n
is a non-negative sub-MG then
E
_
max
kn
X
k
(1 e
1
)
1
1 +E[X
n
(log X
n
)
+
] .
Hint: Apply part (c) of Lemma 1.4.31 and recall that x(log y)
+
e
1
y +x(log x)
+
for any x, y 0.
We just saw that in general L
1
-bounded martingales might not be U.I. Neverthe-
less, as you show next, for sums of independent zero-mean random variables these
two properties are equivalent.
n
=
n
k=1
k
with
k
independent.
(a) Prove Ottavianis inequality. Namely, show that for any n and t, s 0,
P(
n
max
k=1
[S
k
[ t +s) P([S
n
[ t) +P(
n
max
k=1
[S
k
[ t +s)
n
max
k=1
P([S
n
S
k
[ > s) .
(b) Suppose further that
k
is integrable, E
k
= 0 and sup
n
E[S
n
[ < .
Show that then E[sup
k
[S
k
[] is nite.
In the spirit of Doobs inequality bounding the tail probability of the maximum
of a sub-MG X
k
, k = 0, 1, . . . , n in terms of the value of X
n
, we will bound the
oscillations of X
k
, k = 0, 1, . . . , n over an interval [a, b] in terms of X
0
and X
n
.
To this end, we require the following denition of up-crossings.
Denition 5.2.17. The number of up-crossings of the interval [a, b] by X
k
(), k =
0, 1, . . . , n, denoted U
n
[a, b](), is the largest Z
+
such that X
si
() < a and
X
ti
() > b for 1 i and some 0 s
1
< t
1
< < s
< t
n.
For example, Fig. 1 depicts two up-crossings of [a, b].
Figure 1. Illustration of up-crossings of [a, b] by X
k
()
Our next result, Doobs up-crossing inequality, is the key to the a.s. convergence
of sup-MGs (and sub-MGs) on which Section 5.3 is based.
Lemma 5.2.18 (Doobs up-crossing inequality). If X
n
is a sup-MG then
(5.2.6) (b a)E(U
n
[a, b]) E[(X
n
a)
] E[(X
0
a)
] a < b .
Proof. Fixing a < b, let V
1
= I
{X0<a}
and for n = 2, 3, . . ., dene recursively
V
n
= I
{Vn1=1,Xn1b}
+ I
{Vn1=0,Xn1<a}
. Informally, the sequence V
k
is zero
while waiting for the process X
n
to enter (, a) after which time it reverts
to one and stays so while waiting for this process to enter (b, ). See Figure 1
for an illustration in which black circles depict indices k such that V
k
= 1 and
open circles indicate those values of k with V
k
= 0. Clearly, the sequence V
n
is
predictable for the canonical ltration of X
n
. Let Y
n
denote the martingale
5.3. THE CONVERGENCE OF MARTINGALES 191
transform of V
n
with respect to X
n
(per Denition 5.1.27). By the choice of
V
every up-crossing of the interval [a, b] by X

k
, k = 0, 1, . . . , n contributes to Y
n
the dierence between the value of X
at the end of the up-crossing (i.e. the last in

the corresponding run of black circles), which is at least b and its value at the start
of the up-crossing (i.e. the last in the preceding run of open circles), which is at
most a. Thus, each up-crossing increases Y
n
by at least (b a) and if X
0
< a then
the rst up-crossing must have contributed at least (b X
0
) = (b a) +(X
0
a)
to Y
n
. The only other contribution to Y
n
is by the up-crossing of the interval [a, b]
that is in progress at time n (if there is such), and since it started at value at most
a, its contribution to Y
n
is at least (X
n
a)
. We thus conclude that

Y
n
(b a)U
n
[a, b] + (X
0
a)
(X
n
a)
for all . With V

n
predictable, bounded and non-negative it follows that Y
n
is a super-martingale (see parts (b) and (c) of Theorem 5.1.28). Thus, considering
the expectation of the preceding inequality yields the up-crossing inequality (5.2.6)
since 0 = EY
0
EY
n
for the sup-MG Y
n
.
Doobs up-crossing inequality implies that the total number of up-crossings of [a, b]
by a non-negative sup-MG has a nite expectation. In this context, Dubins up-
crossing inequality, which you are to derive next, provides universal (i.e. depending
only on a/b), exponential bounds on tail probabilities of this random variable.
1
n
, T
n
) and (X
2
n
, T
n
) are both sup-MGs and is an
T
n
-stopping time such that X
1
X
2
.
(a) Show that W
n
= X
1
n
I
>n
+X
2
n
I
n
is a sup-MG with respect to T
n
and
deduce that so is Y
n
= X
1
n
I
n
+ X
2
n
I
<n
(this is sometimes called the
switching principle).
(b) For a sup-MG X
n
0 and constants b > a > 0 dene the T
X
n
-stopping
times
0
= 1,
= infk >
: X
k
a and
+1
= infk >
: X
k

b, = 0, 1, . . .. That is, the -th up-crossing of (a, b) by X
n
starts at
1
and ends at
. For = 0, 1, . . . let Z
n
= a
when n [
) and
Z
n
= a
1
b
X
n
for n [
,
+1
). Show that (Z
n
, T
X
n
) is a sup-MG.
(c) For b > a > 0 let U
[a, b] denote the total number of up-crossings of the

interval [a, b] by a non-negative super-martingale X
n
. Deduce from the
preceding that for any positive integer ,
P(U
[a, b] )
_
a
b
_
E[min(X
0
/a, 1)]
(this is Dubins up-crossing inequality).
5.3. The convergence of Martingales
As we shall see in this section, a sub-MG (or a sup-MG), has an integrable
limit under relatively mild integrability assumptions. For example, in this con-
text L
1
-boundedness (i.e. the niteness of sup
n
E[X
n
[), yields a.s. convergence
(see Doobs convergence theorem), while the L
1
-convergence of X
n
is equivalent
to the stronger hypothesis of uniform integrability of this process (see Theorem
5.3.12). Finally, the even stronger L
p
-convergence applies for the smaller sub-class
of L
p
-bounded martingales (see Doobs L
p
martingale convergence).
Indeed, these convergence results are closely related to the fact that the maximum
and up-crossings counts of a sub-MG do not grow too rapidly (and same applies
for sup-MGs and martingales). To further explore this direction, we next link the
niteness of the total number of up-crossings U
[a, b] of intervals [a, b], b > a, by a

process X
n
to its a.s. convergence.
Lemma 5.3.1. If for each b > a almost surely U
[a, b] < , then X

n
a.s
X
where X
is an R-valued random variable.

Proof. Note that the event that X
n
has an almost sure (R-valued) limit as
n is the complement of
=
_
a,bQ
a a,
a,b
= : liminf
n
X
n
() < a b
and liminf
n
X
n
() < a are both limit points of the sequence X
n
(), hence the
total number of up-crossings of the interval [a, b] by this sequence is innite. That
is,
a,b
: U
[a, b]() = . So, from our hypothesis that U
[a, b] is nite
almost surely it follows that P(
a,b
) = 0 for each a < b, resulting with the stated
conclusion.
Combining Doobs up-crossing inequality of Lemma 5.2.18 with Lemma 5.3.1 we
now prove Doobs a.s. convergence theorem for sup-MGs (and sub-MGs).
Theorem 5.3.2 (Doobs convergence theorem). Suppose sup-MG (X
n
, T
n
)
is such that sup
n
E[(X
n
)
] < . Then, X
n
a.s.
X
and E[X
[ liminf
n
E[X
n
[
is nite.
Proof. Fixing b > a, recall that 0 U
n
[a, b] U
[a, b] as n , where
U
[a, b] denotes the total number of up-crossings of [a, b] by the sequence X

n
.
Hence, by monotone convergence E(U
[a, b]) = sup

n
E(U
n
[a, b]). Further, with
(xa)
[a[+x
, we get from Doobs up-crossing inequality and the monotonicity

of the expectation that
E(U
n
[a, b])
1
(b a)
E(X
n
a)

1
(b a)
_
[a[ + sup
n
E[(X
n
)
]
_
.
Thus, our hypothesis that sup
n
E[(X
n
)
] < implies that E(U
[a, b]) is nite,

hence in particular U
[a, b] is nite almost surely.

Since this applies for any b > a, we have from Lemma 5.3.1 that X
n
a.s
X
.
Further, with X
n
a sup-MG, we have that E[X
n
[ = EX
n
+ 2E(X
n
)
EX
0
+
2E(X
n
)
for all n. Using this observation in conjunction with Fatous lemma for
0 [X
n
[
a.s.
[X
[ and our hypothesis, we nd that

E[X
[ liminf
n
E[X
n
[ EX
0
+ 2 sup
n
E[(X
n
)
] < ,
as stated.
Remark. In particular, Doobs convergence theorem implies that if (X
n
, T
n
) is
a non-negative sup-MG then X
n
a.s.
X
for some integrable X
(and in this
case EX
EX
0
). The same convergence applies for a non-positive sub-MG and
more generally, for any sub-MG with sup
n
E(X
n
)
+
< . Further, the following
exercise provides alternative equivalent conditions for the applicability of Doobs
convergence theorem.
Exercise 5.3.3. Show that the following ve conditions are equivalent for any sub-
MG X
n
(and if X
n
is a sup-MG, just replace (X
n
)
+
by (X
n
)
).
(a) lim
n
E[X
n
[ exists and is nite.
(b) sup
n
E[X
n
[ < .
(c) liminf
n
E[X
n
[ < .
(d) lim
n
E(X
n
)
+
exists and is nite.
(e) sup
n
E(X
n
)
+
< .
Our rst application of Doobs convergence theorem extends Doobs inequality
(5.2.1) to the following bound on the maximal value of a U.I. sub-MG.
Corollary 5.3.4. For any U.I. sub-MG X
n
and x > 0,
(5.3.1) P(X
k
x for some k < ) x
1
E[X
I
x<
] x
1
E[(X
)
+
] ,
where
x
= mink 0 : X
k
x.
Proof. Let A
n
=
x
n = max
kn
X
k
x and A
=
x
< =
X
k
x for some k < . Then, A
n
A
and as the U.I. sub-MG X

n
is
L
1
-bounded, we have from Doobs convergence theorem that X
n
a.s.
X
. Con-
sequently, X
n
I
An
and (X
n
)
+
converge almost surely to X
I
A
and (X
)
+
, re-
spectively. Since these two sequences are U.I. we further have that E[X
n
I
An
]
E[X
I
A
] and E[(X
n
)
+
] E[(X
)
+
]. Recall Doobs inequality (5.2.1) that
(5.3.2) P(A
n
) x
1
E[X
n
I
An
] x
1
E[(X
n
)
+
]
for any n nite. Taking n we conclude that
P(A
) x
1
E[X
I
A
] x
1
E[(X
)
+
]
which is precisely our stated inequality (5.3.1).
Applying Doobs convergence theorem we also nd that martingales of bounded
dierences either converge to a nite limit or oscillate between and +.
n
is a martingale of uniformly bounded dier-
ences. That is, almost surely sup
n
[X
n
X
n1
[ c for some nite non-random
constant c. Then, P(A B) = 1 for the events
A = : lim
n
X
n
() exists and is nite,
B = : limsup
n
X
n
() = and liminf
n
X
n
() = .
Proof. We may and shall assume without loss of generality that X
0
= 0
(otherwise, apply the proposition for the MG Y
n
= X
n
X
0
). Fixing a posi-
tive integer k, consider the stopping time
k
() = infn 0 : X
n
() k for
the canonical ltration of X
n
and the associated stopped sup-MG Y
n
= X
n
k
(per Theorem 5.1.32). By denition of
k
and our hypothesis of X
n
having uni-
formly bounded dierences, it follows that Y
n
() k c for all n. Consequently,
sup
n
E(Y
n
)
k + c and by Doobs convergence theorem Y

n
() Y
() R
for all /
k
and some measurable
k
such that P(
k
) = 0. In particular, if
k
() = and /
k
then X
n
() = Y
n
() has a nite limit, so A. This
shows that A
c

k
<
k
for all k, and hence A
c
B

k

k
where
B
=
k
k
< = : liminf
n
X
n
() = . With P(
k
) = 0 for all k,
we thus deduce that P(A B
) = 1. Applying the preceding argument for the

sup-MG X
n
we nd that P(A B
+
) = 1 for B
+
= : limsup
n
X
n
() = .
Combining these two results we conclude that P(A(B
B
+
)) = 1 as stated.
Remark. Consider a random walk S
n
=
n
k=1
k
with zero-mean, bounded incre-
ments
k
(i.e. [
k
[ c with c a nite non-random constant). Then, v = E
2
k
is
nite and the event A where S
n
() S
() as n for some S
() nite,
implies that

S
n
= (nv)
1/2
S
n
0. Combining the clt

S
n
D
G with Fatous
lemma and part (d) of the Portmanteau theorem we nd that for any > 0,
P(A) E[liminf
n
I
|
Sn|
] liminf
n
P([
S
n
[ ) = P([G[ ) .
Taking 0 we deduce that P(A) = 0. Hence, by Proposition 5.3.5 such random
walk is an example of a non-converging MG for which a.s.
limsup
n
S
n
= = liminf
n
S
n
.
Here is another application of the preceding proposition.
Exercise 5.3.6. Consider the T
n
-adapted W
n
0, such that sup
n
[W
n+1
W
n
[
K for some nite non-random constant K and W
0
= 0. Suppose there exist non-
random, positive constants a and b such that for all n 0,
E[W
n+1
W
n
+a[T
n
]I
{Wnb}
0 .
With N
n
=
n
k=1
I
{W
k
<b}
, show that P(N
is nite) = 0.
Hint: Check that X
n
= W
n
+an(K+a)N
n1
is a sup-MG of uniformly bounded
dierences.
As we show next, Doobs convergence theorem leads to the integrability of X
for
any L
1
bounded sub-MG X
n
and any stopping time .
Lemma 5.3.7. If (X
n
, T
n
) is a sub-MG and sup
n
E[(X
n
)
+
] < then E[X
[ <
for any T
n
-stopping time .
Proof. Since ((X
n
)
+
, T
n
) is a sub-MG (see Proposition 5.1.22), it follows
that E[(X
n
)
+
] E[(X
n
)
+
] for all n (consider Theorem 5.1.32 for the sub-MG
(X
n
)
+
and = ). Thus, our hypothesis that sup
n
E[(X
n
)
+
] is nite results with
sup
n
E[(Y
n
)
+
] nite, where Y
n
= X
n
. Applying Doobs convergence theorem for
the sub-MG (Y
n
, T
n
) we have that Y
n
a.s
Y
with Y
= X
integrable.
We further get the following relation, which is key to establishing Doobs optional
stopping for certain sup-MGs (and sub-MGs).
Proposition 5.3.8. Suppose (X
n
, T
n
) is a non-negative sup-MG and are
stopping times for the ltration T
n
. Then, EX
EX
are nite valued.

Proof. From Theorem 5.1.32 we know that Z
n
= X
n
X
n
is a sup-MG
(as are X
n
and X
n
), with Z
0
= 0. Thus, E[X
n
] E[X
n
] are nite and
since , subtracting from both sides the nite E[X
n
I
n
] we nd that
E[X
I
<n
] E[X
I
<n
] +E[X
n
I
n
I
<n
] .
The sup-MG X
n
is non-negative, so by Doobs convergence theorem X
n
a.s.
X
and in view of Fatous lemma

liminf
n
E[X
n
I
n
I
<n
] E[X
I
=
I
<
] .
Further, by monotone convergence E[X
I
<n
] E[X
I
<
] and E[X
I
<n
]
E[X
I
<
]. Hence, taking n results with
E[X
I
<
] E[X
I
<
] +E[X
I
=
I
<
] .
Adding the identity E[X
I
=
] = E[X
I
=
], which holds for , yields the
stated inequality E[X
] E[X
]. Considering 0 we further see that E[X

0
]
E[X
] E[X
] 0 are nite, as claimed.

Solving the next exercise should improve your intuition about the domain of va-
lidity of Proposition 5.1.22 and of Doobs convergence theorem.
Exercise 5.3.9.
(a) Provide an example of a sub-martingale X
n
for which X
2
n
is a super-
martingale and explain why it does not contradict Proposition 5.1.22.
(b) Provide an example of a martingale which converges a.s. to and
explain why it does not contradict Theorem 5.3.2.
Hint: Try S
n
=
n
i=1
i
, with zero-mean, independent but not identically
distributed
i
.
We conclude this sub-section with few additional applications of Doobs conver-
gence theorem.
n
and Y
n
are non-negative, integrable processes
n
such that
n1
Y
n
< a.s. and E[X
n+1
[T
n
]
(1 +Y
n
)X
n
+Y
n
for all n. Show that X
n
converges a.s. to a nite limit as n .
Hint: Find a non-negative super-martingale (W
n
, T
n
) whose convergence implies
that of X
n
.
k
be mutually independent but not necessarily integrable
random variables, such that
(a) Fixing c < non-random, let Y
(c)
n
=
n
k=1
[S
k1
[I
|S
k1
|c
X
k
I
|X
k
|c
.
Show that Y
(c)
n
is a martingale with respect to the ltration T
X
n
and
that sup
n
|Y
(c)
n
|
2
< .
Hint: Kolmogorovs three series theorem may help in proving that Y
(c)
n

is L
2
-bounded.
(b) Show that Y
n
=
n
k=1
[S
k1
[X
k
converges a.s.
5.3.1. Uniformly integrable martingales. The main result of this subsec-
tion is the following L
1
convergence theorem for uniformly integrable (U.I.) sub-
MGs (and sup-MGs).
Theorem 5.3.12. If (X
n
, T
n
) is a sub-MG, then X
n
is U.I. (c.f. Denition
1.3.47), if and only if X
n
L
1
X
, in which case also X

n
a.s.
X
and X
n

E[X
[T
n
] for all n.
Remark. If X
n
is uniformly integrable then sup
n
E[X
n
[ is nite (see Lemma
1.3.48). Thus, the assumption of Theorem 5.3.12 is stronger than that of Theorem
5.3.2, as is its conclusion.
Proof. If X
n
is U.I. then sup
n
E[X
n
[ < . For X
n
sub-MG it thus
follows by Doobs convergence theorem that X
n
a.s.
X
with X
integrable. Ob-
viously, this implies that X
n
p
X
. Similarly, if we start instead by assuming

that X
n
L
1
X
then also X
n
p
X
. Either way, Vitalis convergence theorem

(i.e. Theorem 1.3.49), tells us that uniform integrability is equivalent to L
1
con-
vergence when X
n
p
X
. We thus deduce that for sub-MGs the U.I. property is

equivalent to L
1
convergence and either one of these yields also the corresponding
a.s. convergence.
Turning to show that X
n
E[X
[T
n
] for all n, recall that X
m
E[X
[T
m
] for
all > m and any sub-MG (see Proposition 5.1.20). Further, since X
L
1
X
it
follows that E[X
[T
m
]
L
1
E[X
[T
m
] as , per xed m (see Theorem 4.2.30).
The latter implies the convergence a.s. of these conditional expectations along some
sub-sequence
k
(c.f. Theorem 2.2.10). Hence, we conclude that for any m, a.s.
X
m
liminf
E[X
[T
m
] E[X
[T
m
] ,
i.e., X
n
E[X
[T
n
] for all n.
The preceding theorem identies the collection of U.I. martingales as merely the
set of all Doobs martingales, a concept we now dene.
Denition 5.3.13. The sequence X
n
= E[X[T
n
] with X an integrable R.V. and
T
n
a ltration, is called Doobs martingale of X with respect to T
n
.
Corollary 5.3.14. A martingale (X
n
, T
n
) is U.I. if and only if X
n
= E[X
[T
n
] is
a Doobs martingale with respect to T
n
, or equivalently if and only if X
n
L
1
X
.
Proof. Theorem 5.3.12 states that a sub-MG (hence also a MG) is U.I. if and
only if it converges in L
1
and in this case X
n
E[X
[T
n
]. Applying this theorem
also for X
n
we deduce that a U.I. martingale is necessarily a Doobs martingale
of the form X
n
= E[X
[T
n
]. Conversely, the sequence X
n
= E[X[T
n
] for some
integrable X and a ltration T
n
is U.I. (see Proposition 4.2.33).
We next generalize Theorem 4.2.26 about dominated convergence of C.E.
Theorem 5.3.15 (Levys upward theorem). Suppose sup
m
[X
m
[ is integrable,
X
n
a.s.
X
and T
n
T
. Then E[X
n
[T
n
] E[X
[T
] both a.s. and in L

1
.
Remark. Levys upward theorem is trivial if X
n
is adapted to T
n
(which is
obviously not part of its assumptions). On the other hand, recall that in view of
part (b) of Exercise 4.2.35, having X
n
U.I. and X
n
a.s.
X
is in general not
enough even for the a.s. convergence of E[X
n
[(] to E[X
[(].
Proof. Consider rst the special case where X
n
= X does not depend on
n. Then, Y
n
= E[X[T
n
] is a U.I. martingale. Therefore, E[Y
[T
n
] = E[X[T
n
]
for all n, where Y
denotes the a.s. and L

1
limit of Y
n
(see Corollary 5.3.14).
As Y
n
mT
n
mT
clearly Y
= lim
n
Y
n
mT
. Further, by denition of
the C.E. E[XI
A
] = E[Y
I
A
] for all A in the -system T =
n
T
n
hence with
T
= (T) it follows that Y
= E[X[T
] (see Exercise 4.1.3).

Turning to the general case, with Z = sup
m
[X
m
[ integrable and X
m
a.s.
X
, we
deduce that X
and W
k
= sup[X
n
X
[ : n k 2Z are both integrable. So,

the conditional Jensens inequality and the monotonicity of the C.E. imply that for
all n k,
[E[X
n
[T
n
] E[X
[T
n
][ E[[X
n
X
[ [T
n
] E[W
k
[T
n
] .
Consequently, considering n we nd by the special case of the theorem where
X
n
is replaced by W
k
independent of n (which we already proved), that
limsup
n
[E[X
n
[T
n
] E[X
[T
n
][ lim
n
E[W
k
[T
n
] = E[W
k
[T
] .
Similarly, we know that E[X
[T
n
]
a.s
E[X
[T
]. Further, by denition W
k
0
a.s. when k , so also E[W
k
[T
] 0 by the usual dominated convergence of

C.E. (see Theorem 4.2.26). Combining these two a.s. convergence results and the
preceding inequality, we deduce that E[X
n
[T
n
]
a.s
E[X
[T
] as stated. Finally,
since [E[X
n
[T
n
][ E[Z[T
n
] for all n, it follows that E[X
n
[T
n
] is U.I. and hence
the a.s. convergence of this sequence to E[X
[T
] yields its convergence in L

1
as
well (c.f. Theorem 1.3.49).
Considering Levys upward theorem for X
n
= X
= I
A
and A T
yields the
following corollary.
Corollary 5.3.16 (Levys 0-1 law). If T
n
T
, A T
, then E[I
A
[T
n
]
a.s.
I
A
.
As shown in the sequel, Kolmogorovs 0-1 law about P-triviality of the tail -
algebra T
X
=
n
T
X
n
of independent random variables is a special case of Levys
0-1 law.
Proof of Corollary 1.4.10. Let T
X
= (
n
T
X
n
). Recall Denition 1.4.9
that T
X
T
X
n
T
X
for all n. Thus, by Levys 0-1 law E[I
A
[T
X
n
]
a.s.
I
A
for
any A T
X
. By assumption X
k
are P-mutually independent, hence for any
A T
X
the R.V. I
A
mT
X
n
is independent of the -algebra T
X
n
. Consequently,
E[I
A
[T
X
n
]
a.s.
= P(A) for all n. We deduce that P(A)
a.s.
= I
A
, implying that P(A)
0, 1 for all A T
X
, as stated.
The generalization of Theorem 4.2.30 which you derive next also relaxes the as-
sumptions of Levys upward theorem in case only L
1
convergence is of interest.
n
L
1
X
and T
n
T
then E[X
n
[T
n
]
L
1
E[X
[T
].
Here is an example of the importance of uniform integrability when dealing with
convergence.
n
a.s.
0 are [0, 1]-valued random variables and M
n
is a non-negative MG.
(a) Provide an example where E[X
n
M
n
] = 1 for all n nite.
(b) Show that if M
n
is U.I. then E[X
n
M
n
] 0.
Denition 5.3.19. A continuous function x : [0, 1) R is absolutely continuous
if for every > 0 there exists > 0 such that for all k < , s
1
< t
1
s
2
< t
2

s
k
< t
k
[0, 1)
k
=1
[t
[ =
k
=1
[x(t
) x(s
)[ .
The next exercise uses convergence properties of MGs to prove a classical result
in real analysis, namely, that an absolutely continuous function is dierentiable for
Lebesgue a.e. t [0, 1).
Exercise 5.3.20. On the probability space ([0, 1), B, U) consider the events
A
i,n
= [(i 1)2
n
, i2
n
) for i = 1, . . . , 2
n
, n = 0, 1, . . . ,
and the associated -algebras T
n
= (A
i,n
, i = 1, . . . , 2
n
).
(a) Write an explicit formula for E[h[T
n
] and h L
1
([0, 1), B, U).
(b) For h
i,n
= 2
n
(x(i2
n
)x((i1)2
n
)), show that X
n
(t) =
2
n
i=1
h
i,n
I
Ai,n
(t)
is a martingale with respect to T
n
.
(c) Show that for absolutely continuous x() the martingale X
n
is U.I.
Hint: Show that P([X
n
[ > ) c/ for some constant c < and all
n, > 0.
(d) Show that then there exists h L
1
([0, 1), B, U) such that
x(t) x(s) =
_
t
s
h(u)du for all 1 > t s 0.
(e) Recall Lebesgues theorem, that
1
_
s+
s
[h(s) h(u)[du
a.s
0 as 0,
for a.e. s [0, 1). Using it, conclude that
dx
dt
= h for almost every
t [0, 1).
Recall that uniformly bounded p-th moment for some p > 1 implies U.I. (see
Exercise 1.3.54). Strengthening the L
1
convergence of Theorem 5.3.12, the next
proposition shows that an L
p
-bounded martingale converges to its a.s. limit also
in L
p
(provided p > 1). In contrast to the preceding convergence results, this
one does not hold for sub-MGs (or sup-MGs) which are not MGs (for example, let
= infk 1 :
k
= 0 for independent
k
such that P(
k
,= 0) = k
2
/(k + 1)
2
,
so P( n) = n
2
and verify that X
n
= nI
{n}
a.s.
0 but EX
2
n
= 1, so this
L
2
-bounded sup-MG does not converge to zero in L
2
).
Proposition 5.3.21 (Doobs L
p
martingale convergence). If the MG X
n
is such that sup

n
E[X
n
[
p
< for some p > 1, then there exists a R.V. X
such
that X
n
X
almost surely and in L

p
. Moreover, |X
|
p
liminf
n
|X
n
|
p
.
Proof. Being L
p
bounded, the MG X
n
is L
1
bounded and Doobs martin-
gale convergence theorem applies here, so X
n
a.s.
X
for some integrable R.V. X
.
Further, applying Fatous lemma for [X
n
[
p
0 we have that,
liminf
n
E([X
n
[
p
) E(liminf
n
[X
n
[
p
) = E[X
[
p
as claimed, with X
L
p
. It thus suces to verify that E[X
n
X
[
p
0 as
n . To this end, with c = sup
n
E[X
n
[
p
nite we have by the L
p
maximal
inequality of (5.2.5) that EZ
n
q
p
c for Z
n
= max
kn
[X
k
[
p
and any nite n. Since
0 Z
n
Z = sup
k<
[X
k
[
p
, we have by monotone convergence that EZ q
p
c is
nite. As X
is the a.s. limit of X

n
it follows that [X
[
p
Z as well. Hence,
Y
n
= [X
n
X
[
p
([X
n
[ + [X
[)
p
2
p
Z. With Y
n
a.s.
0 and Y
n
2
p
Z for
integrable Z, we deduce by dominated convergence that EY
n
0 as n , thus
completing the proof of the proposition.
Remark. Proposition 5.3.21 does not have an L
1
analog. Indeed, as we have
seen already in Exercise 5.2.14, there exists a non-negative MG M
n
such that
EM
n
= 1 for all n and M
n
M
= 0 almost surely, so obviously, M

n
does not
converge to M
in L
1
.
Example 5.3.22. Consider the martingale S
n
=
n
k=1
k
for independent, square-
integrable, zero-mean random variables
k
such that
k
E
2
k
< . Since ES
2
n
=
n
k=1
E
2
k
, it follows from Proposition 5.3.21 that the random series S
n
()
S
() almost surely and in L

2
(see also Theorem 2.3.16 for a direct proof of this
result, based on Kolmogorovs maximal inequality).
n
=
1
n
k=1
k
for i.i.d.
k
L
2
(, T, P) of zero-
mean and unit variance. Let T
n
= (
k
, k n) and T
= (
k
, k < ).
(a) Prove that EWZ
n
0 for any xed W L
2
(, T
, P).
(b) Deduce that the same applies for any W L
2
(, T, P) and conclude that
Z
n
does not converge in L
2
.
(c) Show that though Z
n
D
G, a standard normal variable, there exists no
Z
mT such that Z
n
p
Z
.
We conclude this sub-section with the application of martingales to the study of
P olyas urn scheme.
Example 5.3.24 (P olyas urn). Consider an urn that initially contains r red and
b blue marbles. At the k-th step a marble is drawn at random from the urn, with all
possible choices being equally likely, and it and c
k
more marbles of the same color are
then returned to the urn. With N
n
= r+b+
n
k=1
c
k
counting the number of marbles
in the urn after n iterations of this procedure, let R
n
denote the number of red
marbles at that time and M
n
= R
n
/N
n
the corresponding fraction of red marbles.
Since R
n+1
R
n
, R
n
+c
n
with P(R
n+1
= R
n
+c
n
[T
M
n
) = R
n
/N
n
= M
n
it follows
that E[R
n+1
[T
M
n
] = R
n
+ c
n
M
n
= N
n+1
M
n
. Consequently, E[M
n+1
[T
M
n
] = M
n
for all n with M
n
a uniformly bounded martingale.
For the study of P olyas urn scheme we need the following denition.
Denition 5.3.25. The beta density with parameters > 0 and > 0 is
f
(u) =
( +)
()()
u
1
(1 u)
1
1
u[0,1]
,
where () =
_
0
s
1
e
s
ds is nite and positive (compare with Denition 1.4.45).
In particular, = = 1 corresponds to the density f
U
(u) of the uniform measure
on (0, 1], as in Example 1.2.40.
n
be the martingale of Example 5.3.24.
(a) Show that M
n
M
a.s. and in L
p
for any p > 1.
(b) Assuming further that c
k
= c for all k 1, show that for = 0, . . . , n,
P(R
n
= r + c) =
_
n
1
i=0
(r +ic)
n1
j=0
(b +jc)
n1
k=0
(r +b +kc)
,
and deduce that M
has the beta density with parameters = b/c and

= r/c (in particular, M
has the law of U(0, 1] when r = b = c

k
> 0).
(c) For r = b = c
k
> 0 show that P(sup
k1
M
k
> 3/4) 2/3.
Exercise 5.3.27 (Bernard Friedmans urn). Consider the following variant of
P olyas urn scheme, where after the k-th step one returns to the urn in addition to
the marble drawn and c
k
marbles of its color, also d
k
1 marbles of the opposite
color. Show that if c
k
, d
k
are uniformly bounded and r +b > 0, then M
n
a.s.
1/2.
Hint: With X
n
= (M
n
1/2)
2
check that E[X
n
[T
M
n1
] = (1 a
n
)X
n1
+u
n
, where
the non-negative constants a
n
and u
n
are such that
k
u
k
< and
k
a
k
= .
Exercise 5.3.28. Fixing b
n
[, 1] for some > 0, suppose X
n
are [0, 1]-valued,
T
n
-adapted such that X
n+1
= (1 b
n
)X
n
+ b
n
B
n
, n 0, and P(B
n
= 1[T
n
) =
1P(B
n
= 0[T
n
) = X
n
. Show that X
n
a.s.
X
0, 1 and P(X
= 1[T
0
) = X
0
.
5.3.2. Square-integrable martingales. If (X
n
, T
n
) is a square-integrable
martingale then (X
2
n
, T
n
) is a sub-MG, so by Doobs decomposition X
2
n
= M
n
+A
n
for a non-decreasing T
n
-predictable sequence A
n
and a MG (M
n
, T
n
) with M
0
=
0. In the course of proving Doobs decomposition we saw that A
n
A
n1
=
E[X
2
n
X
2
n1
[T
n1
] and part (b) of Exercise 5.1.8 provides an alternative expression
A
n
A
n1
= E[(X
n
X
n1
)
2
[T
n1
], motivating the following denition.
Denition 5.3.29. The sequence A
n
= X
2
0
+
n
k=1
E[(X
k
X
k1
)
2
[T
k1
] is called
the predictable compensator of an L
2
-martingale (X
n
, T
n
) and denoted by angle-
brackets, that is, A
n
= X
n
.
With EM
n
= EM
0
= 0 it follows that EX
2
n
= EX
n
, so X
n
is L
2
-bounded if
and only if sup
n
EX
n
is nite. Further, X
n
is non-decreasing, so it converges
to a limit, denoted hereafter X
. With X
n
X
0
= X
2
0
integrable it further
follows by monotone convergence that EX
n
EX
, so X
n
is L
2
bounded
if and only if X
is integrable, in which case X

n
converges a.s. and in L
2
(see
Doobs L
2
convergence theorem). As we show in the sequel much more can be said
about the relation between convergence of X
n
() and the random variable X
.
To simplify the notations assume hereafter that X
0
= 0 = X
0
so X
n
0 (the
transformation of our results to the general case is trivial).
We start with the following explicit bounds on E[sup
n
[X
n
[
p
] for p 2, from which
we deduce that X
n
is U.I. (hence converges a.s. and in L
1
), whenever X
1/2
is
integrable.
Proposition 5.3.30. There exist nite constants c
q
, q (0, 1], such that if (X
n
, T
n
)
is an L
2
-MG with X
0
= 0, then
E[sup
k
[X
k
[
2q
] c
q
E[ X
q
] .
Remark. Our proof gives c
q
= (2 q)/(1 q) for q < 1 and c
1
= 4.
Proof. Let V
n
= max
n
k=0
[X
k
[
2
, noting that V
n
V
= sup
k
[X
k
[
2
when
n . As already explained EX
2
n
EX
for n . Thus, applying the

bound (5.2.5) of Corollary 5.2.13 for p = 2 we nd that
E[V
n
] 4EX
2
n
4EX
,
and considering n we get our thesis for q = 1 (by monotone convergence).
Turning to the case of 0 < q < 1, note that (V
)
q
= sup
k
[X
k
[
2q
. Further, the
T
n
-predictable part in Doobs decomposition of the non-negative sub-martingale
Z
n
= X
2
n
is A
n
= X
n
. Hence, applying Lemma 5.2.7 with p = q and =
yields the stated bound.
Here is an application of Proposition 5.3.30 to the study of a certain class of
random walks.
n
=
n
k=1
k
for i.i.d.
k
of zero mean and nite second
moment. Suppose is an T
n
-stopping time such that E[
] is nite.
(a) Compute the predictable compensator of the L
2
-martingale (S
n
, T
n
).
(b) Deduce that S
n
is U.I. and that ES
= 0.
We deduce from Proposition 5.3.30 that X
n
() converges a.s. to a nite limit
when X
1/2
is integrable. A considerable renement of this conclusion is oered

by our next result, relating such convergence to X
being nite at !
Theorem 5.3.32. Suppose (X
n
, T
n
) is an L
2
martingale with X
0
= 0.
(a) X
n
() converges to a nite limit for a.e. for which X
() is nite.
(b) X
n
()/X
n
() 0 for a.e. for which X
() is innite.
(c) If the martingale dierences X
n
X
n1
are uniformly bounded then the
converse to part (a) holds. That is, X
() is nite for a.e. for

which X
n
() converges to a nite limit.
Proof. (a) Recall that for any n and T
n
-stopping time we have the identity
X
2
n
= M
n
+ X
n
with EM
n
= 0, resulting with sup
n
E[X
2
n
] = EX
.
For example, see the proof of Proposition 5.3.30, where we also noted that
v
=
minn 0 : X
n+1
> v are T
n
-stopping times such that X
v
v. Thus, setting
Y
n
= X
n
k
for a positive integer k, the martingale (Y
n
, T
n
) is L
2
-bounded and as
such it almost surely has a nite limit. Further, we saw there that if X
() is
nite then
k
() = for some random positive integer k = k(), in which case
X
n
k
= X
n
for all n. Since we consider only countably many values of k, this
yields the thesis of part (a) of the theorem.
(b). Since V
n
= (1 +X
n
)
1
is an T
n
-predictable sequence of bounded variables,
its martingale transform Y
n
=
n
k=1
V
k
(X
k
X
k1
) with respect to the square-
integrable martingale X
n
is also a square-integrable martingale for the ltration
T
n
(c.f. Theorem 5.1.28). Further, since V
k
mT
k1
it follows that for all k 1,
Y
k
Y
k1
= E[(Y
k
Y
k1
)
2
[T
k1
] = V
2
k
E[(X
k
X
k1
)
2
[T
k1
]
=
X
k
X
k1
(1 +X
k
)
2

1
1 +X
k1
1
1 +X
k
(as (x y)/(1 + x)
2
(1 + y)
1
(1 + x)
1
for all x y 0 and X
k
0 is
non-decreasing in k). With Y
0
= X
0
= 0, adding the preceding inequalities
over k = 1, . . . , n, we deduce that Y
n
1 1/(1 + X
n
) 1 for all n. Thus,
by part (a) of the theorem, for almost every , Y
n
() has a nite limit. That is,
for a.e. the series
n
x
n
/b
n
converges, where b
n
= 1 + X
n
() is a positive,
non-decreasing sequence and X
n
() =
n
k=1
x
k
for all n. If in addition to the
convergence of this series also X
() = then b
n
and by Kroneckers
lemma X
n
()/b
n
0. In this case b
n
/(b
n
1) 1 so we conclude that then also
X
n
/(b
n
1) 0, which is exactly the thesis of part (b) of the theorem.
(c). Suppose that P
_
X
= , sup
n
[X
n
[ <
_
> 0. Then, there exists some r
such that P
_
X
= ,
r
=
_
> 0 for the T
n
-stopping time
r
= infm 0 :
[X
m
[ > r. Since sup
m
[X
m
X
m1
[ c for some non-random nite constant c, we
have that [X
nr
[ r+c, from which we deduce that EX
nr
= EX
2
nr
(r+c)
2
for all n. With 0 X
nr
X
r
, by monotone convergence also
E
_
X
I
r=
E
_
X
r
(r +c)
2
.
This contradicts our assumption that P
_
X
= ,
r
=
_
> 0. In conclusion,
necessarily, P
_
X
= , sup
n
[X
n
[ <
_
= 0. Consequently, with sup
n
[X
n
[
nite on the set of values for which X
n
() converges to a nite limit, it follows
that X
() is nite for a.e. such .

We next prove Levys extension of both Borel-Cantelli lemmas (which is a neat
application of the preceding theorem).
Proposition 5.3.33 (Borel-Cantelli III). Consider events A
n
T
n
for some
ltration T
n
. Let S
n
=
n
k=1
I
A
k
count the number of events occurring among
the rst n, with S
k
I
A
k
the corresponding total number of occurrences. Sim-
ilarly, let Z
n
=
n
k=1
k
denote the sum of the rst n conditional probabilities
k
= P(A
k
[T
k1
) and Z
k

k
. Then, for almost every ,
(a) If Z
() is nite, then so is S
().
(b) If Z
() is innite, then S
n
()/Z
n
() 1.
Remark. Given any sequence of events, by the tower property E
k
= P(A
k
) for
all k and setting T
n
= (A
k
, k n) guarantees that A
k
T
k
for all k. Hence,
(a) If EZ
k
P(A
k
) is nite, then from part (a) of Proposition 5.3.33 we
deduce that
k
I
A
k
is nite a.s., thus recovering the rst Borel-Cantelli lemma.
(b) For T
n
= (A
k
, k n) and mutually independent events A
k
we have that
k
= P(A
k
) and Z
n
= ES
n
for all n. Thus, in this case, part (b) of Proposition
5.3.33 is merely the statement that S
n
/ES
n
a.s.
1 when
k
P(A
k
) = , which is
your extension of the second Borel-Cantelli via Exercise 2.2.26.
Proof. Clearly, M
n
= S
n
Z
n
is square-integrable and T
n
-adapted. Further,
as M
n
M
n1
= I
An
E[I
An
[T
n1
] and Var(I
An
[T
n1
) =
n
(1
n
), it follows that
the predictable compensator of the L
2
martingale (M
n
, T
n
) is M
n
=
n
k=1
k
(1
k
). Hence, M
n
Z
n
for all n, and if Z
() is nite, then so is M
(). By
part (a) of Theorem 5.3.32, for a.e. such the nite limit M
() of M
n
() exists,
implying that S
= M
+Z
is nite as well.
With S
n
= M
n
+ Z
n
, it suces for part (b) of the proposition to show that
M
n
/Z
n
0 for a.e. for which Z
() = . To this end, note rst that by

the preceding argument, the nite limit M
() exists also for a.e. for which

Z
() = while M
() is nite. For such we have that M

n
/Z
n
0 (since
M
n
() is a bounded sequence while Z
n
() is unbounded). Finally, from part (b)
of Theorem 5.3.32 we know that M
n
/M
n
M
n
/Z
n
converges to zero for a.e.
for which M
() is innite.
The following extension of Kolmogorovs three series theorem uses both Theorem
5.3.32 and Levys extension of the Borel-Cantelli lemmas.
n
is adapted to ltration T
n
and for any n, the
R.C.P.D. of X
n
given T
n1
equals the R.C.P.D. of X
n
given T
n1
. For non-
random c > 0 let X
(c)
n
= X
n
I
|Xn|c
be the corresponding truncated variables.
(a) Verify that (Z
n
, T
n
) is a MG, where Z
n
=
n
k=1
X
(c)
k
.
(b) Considering the series
(5.3.3)
n
P([X
n
[ > c [T
n1
), and
n
Var(X
(c)
n
[T
n1
),
show that for a.e. the series
n
X
n
() has a nite limit if and only if
both series in (5.3.3) converge.
(c) Provide an example where the convergence in part (b) occurs with proba-
bility 0 < p < 1.
We now consider sucient conditions for convergence almost surely of the mar-
tingale transform.
n
=
n
k=1
V
k
(Z
k
Z
k1
) is the martingale transform
of the T
n
-predictable V
n
with respect to the martingale (Z
n
, T
n
), per Denition
5.1.27.
(a) Show that if Z
n
is L
2
-bounded and V
n
is uniformly bounded then
Y
n
a.s.
Y
nite.
(b) Deduce that for L
2
-bounded MG Z
n
the sequence Y
n
() converges to a
nite limit for a.e. for which sup
k1
[V
k
()[ is nite.
(c) Suppose now that V
k
is predictable for the canonical ltration T
n
of
the i.i.d.
k
. Show that if
k
D
=
k
and u uP([
1
[ u) is bounded
above, then the series
n
V
n
n
has a nite limit for a.e. for which
k1
[V
k
()[ is nite.
Hint: Consider Exercise 5.3.34 for the adapted sequence X
k
= V
k
k
.
Here is another application of Levys extension of the Borel-Cantelli lemmas.
n
= 1 +
n
k=1
D
k
, n 0, where the 1, 1-valued
D
k
is T
k
-adapted and such that E[D
k
[T
k1
] for some non-random 1 > > 0
and all k 1.
(a) Show that (X
n
, T
n
) is a sub-martingale and provide its Doob decomposi-
tion.
(b) Using this decomposition and Levys extension of the Borel-Cantelli lem-
mas, show that X
n
almost surely.
(c) Let Z
n
=
Xn
for = (1 )/(1 + ). Show that (Z
n
, T
n
) is a super-
martingale and deduce that P(inf
n
X
n
0) .
As we show next, the predictable compensator controls the exponential tails for
martingales of bounded dierences.
Exercise 5.3.37. Fix > 0 non-random and an L
2
martingale (M
n
, T
n
) with
M
0
= 0 and bounded dierences sup
k
[M
k
M
k1
[ 1.
(a) Show that N
n
= exp(M
n
(e
1)M
n
) is a sup-MG for T
n
.
Hint: Recall part (a) of Exercise 1.4.40.
(b) Show that for any a.s. nite T
n
-stopping time and constants u, r > 0,
P(M
u, M
r) exp(u +r(e
1)) .
(c) Show that if the martingale S
n
of Example 5.3.22 has uniformly bounded
dierences [
k
[ 1, then Eexp(S
) is nite for S
k
and any
R.
Applying part (c) of the preceding exercise, you are next to derive the following tail
estimate, due to Dvoretsky, in the context of Levys extension of the Borel-Cantelli
lemmas.
k
T
k
for some ltration T
k
. Let S
n
=
n
k=1
I
A
k
and Z
n
=
n
k=1
P(A
k
[T
k1
). Show that P(S
n
r +u, Z
n
r) e
u
(r/(r +u))
r+u
for all n and u, r > 0, then deduce that for any 0 < r < 1,
P
_
n
_
k=1
A
k
_
er +P
_
n
k=1
P(A
k
[T
k1
) > r
_
.
Hint: Recall the proof of Borel-Cantelli III that the L
2
-martingale M
n
= S
n
Z
n
has dierences bounded by one and M
n
Z
n
.
We conclude this section with a renement of the well known Azuma-Hoeding
concentration inequality for martingales of bounded dierences, from which we
deduce the strong law of large numbers for martingales of bounded dierences.
Exercise 5.3.39. Suppose (M
n
, T
n
) is a martingale with M
0
= 0 and dierences
D
k
= M
k
M
k1
, k 1 such that for some nite
k
,
E[D
2
k
e
D
k
[T
k1
]
2
k
E[e
D
k
[T
k1
] < .
(a) Show that N
n
= exp(M
n

2
r
n
/2) is a sup-MG for T
n
provided
[0, 1] and r
n
=
n
k=1
2
k
.
Hint: Recall part (b) of Exercise 1.4.40.
(b) Deduce that for I(x) = (x 1)(2x x 1) and any u 0,
P(M
n
u) exp(r
n
I(u/r
n
)/2) .
(c) Conclude that b
1
n
M
n
a.s.
0 for any martingale M
n
of uniformly bounded
dierences and non-random b
n
such that b
n
/
nlog n .
5.4. The optional stopping theorem
This section is about the use of martingales in computations involving stopping
times. The key tool for doing so is the following theorem.
Theorem 5.4.1 (Doobs optional stopping). Suppose are T
n
-stopping
times and X
n
= Y
n
+V
n
for sub-MGs (V
n
, T
n
), (Y
n
, T
n
) such that V
n
is non-positive
and Y
n
is uniformly integrable. Then, the R.V. X
and X
are integrable and

EX
EX
EX
0
(where X
() and X
() are set as limsup

n
X
n
() in case
the corresponding stopping time is innite).
Remark 5.4.2. Doobs optional stopping theorem holds for any sub-MG (X
n
, T
n
)
such that X
n
is uniformly integrable (just set V
n
= 0). Alternatively, it holds
also whenever E[X
[T
n
] X
n
for some integrable X
and all n (for then the

martingale Y
n
= E[X
[T
n
] is U.I. by Corollary 5.3.14, hence Y
n
also U.I. by
Proposition 5.4.4, and the sub-MG V
n
= X
n
Y
n
is by assumption non-positive).
5.4. THE OPTIONAL STOPPING THEOREM 205
By far the most common application has (X
n
, T
n
) a martingale, in which case
it yields that EX
0
= EX
for any T
n
n
is U.I.
(for example, whenever is bounded, or under the more general conditions of
Proposition 5.4.4).
Proof. By linearity of the expectation, it suces to prove the claim sepa-
rately for Y
n
= 0 and for V
n
. Dealing rst with Y
n
= 0, i.e. with a non-positive
sub-MG (V
n
, T
n
), note that (V
n
, T
n
) is then a non-negative sup-MG. Thus, the
inequality E[V
] E[V
] E[V
0
] and the integrability of V
and V
are immediate
consequences of Proposition 5.3.8.
Considering hereafter the sub-MG (Y
n
, T
n
) such that Y
n
is U.I., since
are T
n
-stopping times it follows by Theorem 5.1.32 that U
n
= Y
n
, Z
n
= Y
n
and
U
n
Z
n
are all sub-MGs with respect to T
n
. In particular, EU
n
EZ
n
EZ
0
for
all n. Our assumption that the sub-MG (U
n
, T
n
) is U.I. results with U
n
U
a.s.
and in L
1
(see Theorem 5.3.12). Further, as we show in part (c) of Proposition 5.4.4,
in this case U
n
= Z
n
is U.I. so by the same reasoning, Z
n
Z
a.s. and in L
1
.
We thus deduce that EU
EZ
EZ
0
. By denition, U
= lim
n
Y
n
= Y
and Z
= lim
n
Y
n
= Y
. Consequently, EY
EY
EY
0
, as claimed.
We complement Theorem 5.4.1 by rst strengthening its conclusion and then pro-
viding explicit sucient conditions for the uniform integrability of Y
n
.
Lemma 5.4.3. Suppose X
n
is adapted to ltration T
n
and the T
n
-stopping
time is such that for any T
n
-stopping time the R.V. X
is integrable and
E[X
] E[X
]. Then, also E[X
[T
] X
a.s.
Proof. Fixing A T
set = I
A
+ I
A
c . Note that is also an
T
n
-stopping time since for any n,
n = (A n)
_
(A
c
n)
= (A n)
_
((A
c
n) n) T
n
because both A and A
c
are in T
and n T
n
(c.f. Denition 5.1.34 of
the -algebra T
). By assumption, X
, X
, X
are integrable and EX
EX
.
Since X
= X
I
A
+X
I
A
c subtracting the nite E[X
I
A
c ] from both sides of this
inequality results with E[X
I
A
] E[X
I
A
]. This holds for all A T
and with
E[X
I
A
] = E[ZI
A
] for Z = E[X
[T
] (by denition of the conditional expectation),

we see that E[(ZX
)I
A
] 0 for all A T
. Since both Z and X
are measurable
on T
(see part (b) of Exercise 5.1.35), it thus follows that a.s. Z X
, as
claimed.
Proposition 5.4.4. Suppose Y
n
is integrable and is a stopping time for a
ltration T
n
. Then, Y
n
is uniformly integrable if any one of the following
conditions hold.
(a) E < and a.s. E[[Y
n
Y
n1
[[T
n1
] c for some nite, non-random
c.
(b) Y
n
I
>n
is uniformly integrable and Y
I
<
is integrable.
(c) (Y
n
, T
n
) is a uniformly integrable sub-MG (or sup-MG).
Proof. (a) Clearly, [Y
n
[ Z
n
, where
Z
n
= [Y
0
[ +
n
k=1
[Y
k
Y
k1
[ = [Y
0
[ +
n
k=1
[Y
k
Y
k1
[I
k
,
is non-decreasing in n. Hence, sup
n
[Y
n
[ Z
, implying that Y
n
is U.I.
whenever EZ
is nite (c.f. Lemma 1.3.48). Proceeding to show that this is the

case under condition (a), recall that I
k
mT
k1
for all k (since is an T
n
-
stopping time). Thus, taking out what is known, by the tower property we nd
that under condition (a),
E[[Y
k
Y
k1
[I
k
] = E[E([Y
k
Y
k1
[ [T
k1
)I
k
] c P( k)
for all k 1. Summing this bound over k = 1, 2, . . . results with
EZ
E[Y
0
[ +c
k=1
P( k) = E[Y
0
[ +c E ,
with the integrability of Z
being a consequence of the hypothesis in condition (a)

that is integrable.
(b) Next note that [X
n
[ [X
[I
<
+ [X
n
[I
>n
for every n, any sequence of
random variables X
n
and any 0, 1, 2, . . . , . Condition (b) states that the
sequence [Y
n
[I
>n
is U.I. and that the variable [Y
[I
<
is integrable. Thus,
taking the expectation of the preceding inequality in case X
n
= Y
n
I
|Yn|>M
, we nd
that when condition (b) holds,
sup
n
E[[Y
n
[I
|Yn |>M
] E[[Y
[I
|Y |>M
I
<
] + sup
n
E[[Y
n
[I
|Yn|>M
I
>n
] ,
converges to zero as M . That is, [Y
n
[ is then a U.I. sequence.
(c) The hypothesis of (c) that Y
n
is U.I. implies that Y
n
I
>n
is also U.I. and
that sup
n
E[(Y
n
)
+
] is nite. With an T
n
-stopping time and (Y
n
, T
n
) a sub-MG,
it further follows by Lemma 5.3.7 that Y
I
<
is integrable. Having arrived at the
hypothesis of part (b), we are done.
Since Y
n
is U.I. whenever is bounded, we have the following immediate
consequences of Doobs optional stopping theorem, Remark 5.4.2 and Lemma 5.4.3.
Corollary 5.4.5. For any sub-MG (X
n
, T
n
) and any non-decreasing sequence
k
of T
n
-stopping times, (X
k
, T
k
, k 0) is a sub-MG when either sup
k

k
a non-
random nite integer, or a.s. X
n
E[X
[T
n
] for an integrable X
and all n 0.
Check that by part (b) of Exercise 5.2.16 and part (c) of Proposition 5.4.4 it follows
from Doobs optional stopping theorem that ES
= 0 for any stopping time with

respect to the canonical ltration of S
n
=
n
k=1
k
provided the independent
k
are integrable with E
k
= 0 and sup
n
E[S
n
[ < .
Sometimes Doobs optional stopping theorem is applied en-route to a useful con-
tradiction. For example,
n
is a sub-martingale such that EX
0
0 and
inf
n
X
n
< 0 a.s. then necessarily E[sup
n
X
n
] = .
Hint: Assuming rst that sup
n
[X
n
[ is integrable, apply Doobs optional stopping
theorem to arrive at a contradiction. Then consider the same argument for the
sub-MG Z
n
= maxX
n
, 1.
Exercise 5.4.7. Fixing b > 0, let
b
= minn 0 : S
n
b for the random walk
S
n
of Denition 5.1.6 and suppose
n
= S
n
S
n1
are uniformly bounded, of
zero mean and positive variance.
(a) Show that
b
is almost surely nite.
Hint: See Proposition 5.3.5.
(b) Show that E[minS
n
: n
b
] = .
Martingales often provide much information about specic stopping times. We
detail below one such example, pertaining to the srw of Denition 5.1.6.
Corollary 5.4.8 (Gamblers Ruin). Fixing positive integers a and b the prob-
ability that a srw S
n
, starting at S
0
= 0, hits a before rst hitting +b is
r = (e
b
1)/(e
b
e
a
) for = log[(1 p)/p] ,= 0. For the symmetric srw, i.e.
when p = 1/2, this probability is r = b/(a +b).
Remark. The probability r is often called the gamblers ruin, or ruin probability
for a gambler with initial capital of +a, betting on the outcome of independent
rounds of the same game, a unit amount per round, gaining or losing an amount
equal to his bet in each round and stopping when either all his capital is lost (the
ruin event), or his accumulated gains reach the amount +b.
Proof. Consider the stopping time
a,b
= infn 0 : S
n
b, or S
n
a for
the canonical ltration of the srw. That is,
a,b
is the rst time that the srw exits
the interval (a, b). Since (S
k
+ k)/2 has the Binomial(k, p) distribution it is not
hard to check that sup
P(S
k
= ) 0 hence P(
a,b
> k) P(a < S
k
< b) 0
as k . Consequently,
a,b
is nite a.s. Further, starting at S
0
(a, b) and
using only increments
k
1, 1, necessarily S
a,b
a, b with probability
one. Our goal is thus to compute the ruin probability r = P(S
a,b
= a). To
this end, note that Ee
k
= pe
+ (1 p)e
= 1 for = log[(1 p)/p]. Thus,

M
n
= exp(S
n
) =
n
k=1
e
k
is, for such , a non-negative MG with M
0
= 1
(c.f. Example 5.1.10). Clearly, M
n
a,b
= exp(S
n
a,b
) exp([[ max(a, b)) is
uniformly bounded (in n), hence uniformly integrable. So, applying Doobs optional
stopping theorem for this MG and stopping time, we have that
1 = EM
0
= E[M
a,b
] = E[e
S
a,b
] = re
a
+ (1 r)e
b
,
which easily yields the stated explicit formula for r in case ,= 0 (i.e. p ,= 1/2).
Finally, recall that S
n
is a martingale for the symmetric srw, with S
n
a,b
uni-
formly bounded, hence uniformly integrable. So, applying Doobs optional stopping
theorem for this MG, we nd that in the symmetric case
0 = ES
0
= E[S
a,b
] = ar +b(1 r) ,
that is, r = b/(a +b) when p = 1/2.
Here is an interesting consequence of the Gamblers ruin formula.
Example 5.4.9. Initially, at step k = 0 zero is the only occupied site in Z. Then, at
each step a new particle starts at zero and follows a symmetric srw, independently
of the previous particles, till it lands on an unoccupied site, whereby it stops and
thereafter occupies this site. The set of occupied sites after k steps is thus an interval
of length k + 1 and we let R
k
1, . . . , k + 1 count the number of non-negative
integers occupied after k steps (starting at R
0
= 1).
Clearly, R
k+1
R
k
, R
k
+ 1 and P(R
k+1
= R
k
[T
M
k
) = R
k
/(k + 2) by the
preceding Gamblers ruin formula. Thus, R
k
follows the evolution of Bernard
Friedmans urn with parameters d
k
= r = b = 1 and c
k
= 0. Consequently, by
Exercise 5.3.27 we have that (n + 1)
1
R
n
a.s.
1/2.
You are now to derive Walds identities about stopping times for the random
walk, and use them to gain further information about the stopping times
a,b
of the
preceding corollary.
Exercise 5.4.10. Let be an integrable stopping time for the canonical ltration
of the random walk S
n
.
(a) Show that if
1
is integrable, then Walds identity ES
= E
1
E holds.
Hint: Use the representation S
k=1
k
I
k
and independence.
(b) Show that if in addition
1
is square-integrable, then Walds second iden-
tity E[(S
E
1
)
2
] = Var(
1
)E holds as well.
Hint: Explain why you may assume that E
1
= 0, prove the identity with
n instead of and use Doobs L
2
convergence theorem.
(c) Show that if
1
0 then Walds identity applies also when E =
(under the convention that 0 = 0).
Exercise 5.4.11. For the srw S
n
and positive integers a, b consider the stopping
time
a,b
= minn 0 : S
n
/ (a, b) as in proof of Corollary 5.4.8.
(a) Check that E[
a,b
] < .
Hint: See Exercise 5.1.15.
(b) Combining Corollary 5.4.8 with Walds identities, compute the value of
E[
a,b
].
(c) Show that
a,b

b
= minn 0 : S
n
= b for a (where the
minimum over the empty set is ), and deduce that E
b
= b/(2p 1)
when p 1/2.
(d) Show that
b
is almost surely nite when p 1/2.
(e) Find constants c
1
and c
2
such that Y
n
= S
4
n
6nS
2
n
+ c
1
n
2
+ c
2
n is a
martingale for the symmetric srw, and use it to evaluate E[(
b,b
)
2
] in
this case.
We next provide a few applications of Doobs optional stopping theorem, starting
with information on the law of
b
for srw (and certain other random walks).
Exercise 5.4.12. Consider the stopping time
b
= infn 0 : S
n
= b and the
martingale M
n
= exp(S
n
)M()
n
for a srw S
n
, with b a positive integer and
M() = E[e
1
].
(a) Show that if p = 1q [1/2, 1) then e
b
E[M()
b
] = 1 for every > 0.
(b) Deduce that for p [1/2, 1) and every 0 < s < 1,
E[s
1
] =
1
2qs
_
1
_
1 4pqs
2
_
,
and E[s
b
] = (E[s
1
])
b
.
(c) Show that if 0 0.
(d) Deduce that for p (0, 1/2) the variable Z = 1 + max
k0
S
k
has a Geo-
metric distribution of success probability 1 e
.
Exercise 5.4.13. Consider
b
= minn 0 : S
n
b for b > 0, in case the i.i.d.
increments
n
= S
n
S
n1
of the random walk S
n
are such that P(
1
> 0) > 0
and
1
[
1
> 0 has the Exponential law of parameter .
(a) Show that for any n nite, conditional on
b
= n the law of S
b
b is
also Exponential of parameter .
Hint: Recall the memory-less property of the exponential distribution.
(b) With M() = E[e
1
] and
0 denoting the maximal solution of

M() = 1, verify the existence of a monotone decreasing, continuous
function u : (0, 1] [
, ) such that M(u(s)) = 1/s.

(c) Evaluate E[s
b
I
b
<
], 0 < s < 1, and P(
b
< ) in terms of u(s) and
.
Exercise 5.4.14. A monkey types a random sequence of capital letters
k
that
are chosen independently of each other, with each
k
chosen uniformly from amongst
the 26 possible values A, B, . . . , Z.
(a) Suppose that just before each time step n = 1, 2, , a new gambler
arrives on the scene and bets $1 that
n
= P. If he loses, he leaves,
whereas if he wins, he receives $26, all of which he bets on the event
n+1
= R. If he now loses, he leaves, whereas if he wins, he bets his
current fortune of $26
2
on the event that
n+2
= O, and so on, through
the word PROBABILITY . Show that the amount of money M
n
that the
gamblers have collectively earned by time n is a martingale with respect
to T
n
.
(b) Let L
n
denote the number of occurrences of the word PROBABILITY in
the rst n letters typed by the monkey and = infn 11 : L
n
= 1 the
rst time by which it produced this word. Using Doobs optional stopping
theorem show that E = a for a = 26
11
. Does the same apply for the
rst time by which the monkey produces the word ABRACADABRA
and if not, what is E?
(c) Show that n
1
L
n
a.s.
1
a
and further that (L
n
n/a)/
vn
D
G for some
nite, positive constant v.
Hint: Fixing < 1/2, partition 11, . . . , n into m = n
consecutive
blocks K
i
, each of length either = n/m 10 or + 1, with gaps of
length 10 between them. With W
i
denoting the number of occurrences of
PROBABILITY within K
i
, apply Lindebergs clt to (L
n
n
/a)/
v
n
,
where L
n
=
m
i=1
W
i
, n
= n 10m and v
n
=
m
i=1
Var(W
i
), then show
that n
1
v
n
converges.
Exercise 5.4.15. Consider a fair game consisting of successive turns whose out-
come are the i.i.d. signs
k
1, 1 such that P(
1
= 1) =
1
2
, and where
upon betting the wagers V
k
in each turn, your gain (or loss) after n turns is
Y
n
=
n
k=1
k
V
k
. Here is a betting system V
k
, predictable with respect to the
canonical ltration T
n
, as in Example 5.1.30, that surely makes a prot in this
fair game!
Choose a nite sequence x
1
, x
2
, . . . , x
of non-random positive numbers. For each

k 1, wager an amount V
k
that equals the sum of the rst and last terms in your
sequence prior to your k-s turn. Then, to update your sequence, if you just won
your bet delete those two numbers while if you lost it, append their sum as an extra
term x
+1
= x
1
+ x
at the right-hand end of the sequence. You play iteratively

according to this rule till your sequence is empty (and if your sequence ever consists
of one term only, you wager that amount, so upon wining you delete this term, while
upon losing you append it to the sequence to obtain two terms).
(a) Let v =
i=1
x
i
. Show that the sum of terms in your sequence after n
turns is a martingale S
n
= v + Y
n
with respect to T
n
. Deduce that
with probability one you terminate playing with a prot v at the nite
T
n
-stopping time = infn 0 : S
n
= 0.
(b) Show that E is nite.
Hint: Consider the number of terms N
n
in your sequence after n turns.
(c) Show that the expected value of your aggregate maximal loss till termina-
tion, namely EL for L = min
k
Y
k
, is innite (which is why you are
not to attempt this gambling scheme).
In the next exercise you derive a time-reversed version of the L
2
maximal inequality
(5.2.4) by an application of Corollary 5.4.5.
Exercise 5.4.16. Associate to any given martingale (Y
n
, 1
n
) the record times
k+1
= minj 0 : Y
j
> Y
k
, k = 0, 1, . . . starting at
0
= 0.
(a) Fixing m nite, set
k
=
k
m and explain why (Y
k
, 1
k
) is a MG.
(b) Deduce that if EY
2
m
is nite then
m
k=1
E[(Y
k
Y
k1
)
2
] = EY
2
m
EY
2
0
.
Hint: Apply Exercise 5.1.8.
(c) Conclude that for any martingale Y
n
and all m
E[(max
m
Y
Y
m
)
2
] EY
2
m
.
5.5. Reversed MGs, likelihood ratios and branching processes
With martingales applied throughout probability theory, we present here just a
few selected applications. Our rst example, Sub-section 5.5.1, deals with the
analysis of extinction probabilities for branching processes. We then study in Sub-
section 5.5.2 the likelihood ratios for independent experiments with the help of
Kakutanis theorem about product martingales. Finally, in Sub-section 5.5.3 we
develop the theory of reversed martingales and applying it, provide zero-one law
and representation results for exchangeable processes.
5.5.1. Branching processes: extinction probabilities. We use martin-
gales to study the extinction probabilities of branching processes, the object we
dene next.
Denition 5.5.1 (Branching process). The branching process is a discrete
time stochastic process Z
n
taking non-negative integer values, such that Z
0
= 1
and for any n 1,
Z
n
=
Zn1
j=1
N
(n)
j
,
where N and N
(n)
j
for j = 1, 2, . . . are i.i.d. non-negative integer valued R.V.s with
nite mean m
N
= EN < , and where we use the convention that if Z
n1
= 0
5.5. REVERSED MGS, LIKELIHOOD RATIOS AND BRANCHING PROCESSES 211
then also Z
n
= 0. We call a branching process sub-critical when m
N
< 1, critical
when m
N
= 1 and super-critical when m
N
> 1.
Remark. The S.P. Z
n
is interpreted as counting the size of an evolving popula-
tion, with N
(n)
j
being the number of ospring of j
th
individual of generation (n1)
and Z
n
being the size of the n-th generation. Associated with the branching process
is the family tree with the root denoting the 0-th generation and having N
(n)
j
edges
from vertex j at distance n from the root to vertices of distance (n+1) from the root.
Random trees generated in such a fashion are called Galton-Watson trees and are
the subject of much research. We focus here on the simpler S.P. Z
n
and shall use
throughout the ltration T
n
= (N
(k)
j
, k n, j = 1, 2, . . .). We note in passing
that in general T
Z
n
is a strict subset of T
n
(since in general one can not recover the
number of ospring of each individual knowing only the total population sizes at
the dierent generations). Though not dealt with here, more sophisticated related
models have also been successfully studied by probabilists. For example, branching
process with immigration, where one adds to Z
n
an external random variable I
n
that count the number of individuals immigrating into the population at the n
th
generation; Age-dependent branching process where individuals have random life-
times during which they produce ospring according to age-dependent probability
generating function; Multi-type branching process where each individual is assigned
a label (type), possibly depending on the type of its parent and with a dierent law
for the number of ospring in each type, and branching process in random envi-
ronment where the law of the number of ospring per individual is itself a random
variable (part of the a-apriori given random environment).
Our goal here is to nd the probability p
ex
of population extinction, formally
dened as follows.
Denition 5.5.2. The extinction probability of a branching process is
p
ex
:= P( : Z
n
() = 0 for all n large enough) .
Obviously, p
ex
= 0 whenever P(N = 0) = 0 and p
ex
= 1 whenever P(N = 0) = 1.
Hereafter we exclude these degenerate cases by assuming that 1 > P(N = 0) > 0.
To this end, we rst deduce that with probability one, conditional upon non-
extinction the branching process grows unboundedly.
Lemma 5.5.3. If P(N = 0) > 0 then with probability one either Z
n
or
Z
n
= 0 for all n large enough.
Proof. We start by proving that for any ltration T
n
T
and any S.P.

Z
n
0 if for A T
, some non-random
k
> 0 and all large positive integers k, n
(5.5.1) P(A[T
n
)I
[0,k]
(Z
n
)
k
I
[0,k]
(Z
n
) ,
then P(AB) = 1 for B = lim
n
Z
n
= . Indeed, C
k
= Z
n
k, i.o. in n are by
(5.5.1) such that C
k
P(A[T
n
)
k
, i.o. in n. By Levys 0-1 law P(A[T
n
) I
A
except on a set D such that P(D) = 0, hence also C
k
DI
A

k
= DA for
all k. With C
k
B
c
it follows that B
c
DA yielding our claim that P(AB) = 1.
Turning now to the branching process Z
n
, let A = : Z
n
() = 0 for all n large
enough which is in T
, noting that if Z
n
k and N
(n+1)
j
= 0, j = 1, . . . , k, then
Z
n+1
= 0 hence A. Consequently, by the independence of N
(n+1)
j
, j = 1, . . .
and T
n
it follows that
E[I
A
[T
n
]I
{Znk}
E[I
{Zn+1=0}
[T
n
]I
{Znk}
P(N = 0)
k
I
{Znk}
for all n and k. That is, (5.5.1) holds in this case for
k
= P(N = 0)
k
> 0. As
shown already, this implies that with probability one either Z
n
or Z
n
= 0 for
all n large enough.
The generating function
(5.5.2) L(s) = E[s
N
] = P(N = 0) +
k=1
P(N = k)s
k
plays a key role in analyzing the branching process. In this task, we employ the
following martingales associated with branching process.
Lemma 5.5.4. Suppose 1 > P(N = 0) > 0. Then, (X
n
, T
n
) is a martingale where
X
n
= m
n
N
Z
n
. In the super-critical case we also have the martingale (M
n
, T
n
) for
M
n
=
Zn
and (0, 1) the unique solution of s = L(s). The same applies in the
sub-critical case if there exists a solution (1, ) of s = L(s).
Proof. Since the value of Z
n
is a non-random function of N
(k)
j
, k n, j =
1, 2, . . ., it follows that both X
n
and M
n
are T
n
-adapted. We proceed to show by
induction on n that the non-negative processes Z
n
and s
Zn
for each s > 0 such that
L(s) max(s, 1) are integrable with
(5.5.3) E[Z
n+1
[T
n
] = m
N
Z
n
, E[s
Zn+1
[T
n
] = L(s)
Zn
.
Indeed, recall that the i.i.d. random variables N
(n+1)
j
of nite mean m
N
are inde-
pendent of T
n
on which Z
n
is measurable. Hence, by linearity of the expectation
it follows that for any A T
n
,
E[Z
n+1
I
A
] =
j=1
E[N
(n+1)
j
I
{Znj}
I
A
] =
j=1
E[N
(n+1)
j
]E[I
{Znj}
I
A
] = m
N
E[Z
n
I
A
] .
This veries the integrability of Z
n
0 as well as the identity E[Z
n+1
[T
n
] = m
N
Z
n
of (5.5.3), which amounts to the martingale condition E[X
n+1
[T
n
] = X
n
for X
n
=
m
n
N
Z
n
. Similarly, xing s > 0,
s
Zn+1
=
=0
I
{Zn=}
j=1
s
N
(n+1)
j
.
Hence, by linearity of the expectation and independence of s
N
(n+1)
j
and T
n
,
E[s
Zn+1
I
A
] =
=0
E[I
{Zn=}
I
A
j=1
s
N
(n+1)
j
]
=
=0
E[I
{Zn=}
I
A
]
j=1
E[s
N
(n+1)
j
] =
=0
E[I
{Zn=}
I
A
]L(s)
= E[L(s)
Zn
I
A
] .
Since Z
n
0 and L(s) max(s, 1) this implies that Es
Zn+1
1 + Es
Zn
and the
integrability of s
Zn
follows by induction on n. Given that s
Zn
is integrable and the
preceding identity holds for all A T
n
, we have thus veried the right identity in
(5.5.3), which in case s = L(s) is precisely the martingale condition for M
n
= s
Zn
.
Finally, to prove that s = L(s) has a unique solution in (0, 1) when m
N
= EN > 1,
note that the function s L(s) of (5.5.2) is continuous and bounded on [0, 1].
Further, since L(1) = 1 and L
(1) = EN > 1, it follows that L(s) < s for some

0 < s < 1. With L(0) = P(N = 0) > 0 we have by continuity that s = L(s)
for some s (0, 1). To show the uniqueness of such solution note that EN > 1
implies that P(N = k) > 0 for some k > 1, so L
(s) =
k=2
k(k 1)P(N = k)s
k2
is positive and nite on (0, 1). Consequently, L() is strictly convex there. Hence,
if (0, 1) is such that = L(), then L(s) < s for s (, 1), so such a solution
(0, 1) is unique.
Remark. Since X
n
= m
n
N
Z
n
is a martingale with X
0
= 1, it follows that EZ
n
=
m
n
N
for all n 0. Thus, a sub-critical branching process, i.e. when m
N
< 1, has
mean total population size
E[
n=0
Z
n
] =
n=0
m
n
N
=
1
1 m
N
< ,
which is nite.
We now determine the extinction probabilities for branching processes.
Proposition 5.5.5. Suppose 1 > P(N = 0) > 0. If m
N
1 then p
ex
= 1. In
contrast, if m
N
> 1 then p
ex
= , with m
n
N
Z
n
a.s.
X
and Z
n
a.s.
Z
0, .
Remark. In words, we nd that for sub-critical and non-degenerate critical branch-
ing processes the population eventually dies o, whereas non-degenerate super-
critical branching processes survive forever with positive probability and conditional
upon such survival their population size grows unboundedly in time.
Proof. Applying Doobs martingale convergence theorem to the non-negative
MG X
n
of Lemma 5.5.4 we have that X
n
a.s.
X
with X
almost surely nite.

In case m
N
1 this implies that Z
n
= m
n
N
X
n
is almost surely bounded (in n),
hence by Lemma 5.5.3 necessarily Z
n
= 0 for all large n, i.e. p
ex
= 1. In case
m
N
> 1 we have by Doobs martingale convergence theorem that M
n
a.s.
M
for
the non-negative MG M
n
=
Zn
of Lemma 5.5.4. Since (0, 1) and Z
n
0, it
follows that this MG is bounded by one, hence U.I. and with Z
0
= 1 it follows that
EM
= EM
0
= (see Theorem 5.3.12). Recall Lemma 5.5.3 that Z
n
a.s.
Z

0, , so M
=
Z
0, 1 with
p
ex
= P(Z
= 0) = P(M
= 1) = EM
=
as stated.
Remark. For a non-degenerate critical branching process (i.e. when m
N
= 1
and P(N = 0) > 0), we have seen that the martingale Z
n
converges to 0 with
probability one, while EZ
n
= EZ
0
= 1. Consequently, this MG is L
1
-bounded but
not U.I. (for another example, see Exercise 5.2.14). Further, as either Z
n
= 0 or
Z
n
1, it follows that in this case 1 = E(Z
n
[Z
n
1)(1 q
n
) for q
n
= P(Z
n
= 0).
Further, here q
n
p
ex
= 1 so we deduce that conditional upon non-extinction, the
mean population size E(Z
n
[Z
n
1) = 1/(1 q
n
) grows to innity as n .
As you show next, if super-critical branching process has a square-integrable o-
spring distribution then m
n
N
Z
n
converges in law to a non-degenerate random vari-
able. The Kesten-Stigum Llog L-theorem, (which we do not prove here), states
that the latter property holds if and only if E[N log N] is nite.
Exercise 5.5.6. Consider a super-critical branching process Z
n
where the num-
ber of ospring is of mean m
N
= E[N] > 1 and variance v
N
= Var(N) < .
(a) Compute E[X
2
n
] for X
n
= m
n
N
Z
n
.
(b) Show that P(X
> 0) > 0 for the a.s. limit X
of the martingale X
n
.
(c) Show that P(X
= 0) = and deduce that for a.e. , if the branching

process survives forever, that is Z
n
() > 0 for all n, then X
() > 0.
The generating function L(s) = E[s
N
] yields information about the laws of Z
n
and that of X
of Proposition 5.5.5.
Proposition 5.5.7. Consider the generating functions L
n
(s) = E[s
Zn
] for s
[0, 1] and a branching process Z
n
starting with Z
0
= 1. Then, L
0
(s) = s and
L
n
(s) = L[L
n1
(s)] for n 1 and L() of (5.5.2). Consequently, the generating
function

L
(s) = E[s
X
] of X
is a solution of

L
(s) = L[
(s
1/mN
)] which
converges to one as s 1.
Remark. In particular, the probability q
n
= P(Z
n
= 0) = L
n
(0) that the branch-
ing process is extinct after n generations is given by the recursion q
n
= L(q
n1
)
for n 1, starting at q
0
= 0. Since the continuous function L(s) is above s on
the interval from zero to the smallest positive solution of s = L(s) it follows that
q
n
is a monotone non-decreasing sequence that converges to this solution, which is
thus the value of p
ex
. This alternative evaluation of p
ex
does not use martingales.
Though implicit here, it instead relies on the Markov property of the branching
process (c.f. Example 6.1.10).
Proof. Recall that Z
1
= N
(1)
1
and if Z
1
= k then the branching process Z
n
for n 2 has the same law as the sum of k i.i.d. variables, each having the same
law as Z
n1
(with the j
th
such variable counting the number of individuals in the
n
th
generation who are descendants of the j
th
individual of the rst generation).
Consequently, E[s
Zn
[Z
1
= k] = E[s
Zn1
]
k
for all n 2 and k 0. Summing over
the disjoint events Z
1
= k we have by the tower property that for n 2,
L
n
(s) = E[E(s
Zn
[Z
1
)] =
k=0
P(N = k)L
n1
(s)
k
= L[L
n1
(s)]
for L() of (5.5.2), as claimed. Obviously, L
0
(s) = s and L
1
(s) = E[s
N
] = L(s).
From this identity we conclude that

L
n
(s) = L[
L
n1
(s
1/mN
)] for

L
n
(s) = E[s
Xn
]
and X
n
= m
n
N
Z
n
. With X
n
a.s.
X
we have by bounded convergence that
L
n
(s)

L
(s) = E[s
X
], which by the continuity of r L(r) on [0, 1] is thus a
solution of the identity

L
(s) = L[
(s
1/mN
)]. Further, by monotone convergence
(s)

L
(1) = 1 as s 1.
Remark. Of course, q
n
= P(T n) provides the distribution function of the time
of extinction T = mink 0 : Z
k
= 0. For example, if N has the Bernoulli(p)
distribution for some 0 < p < 1 then T is merely a Geometric(1 p) random
variable, but in general the law of T is more involved.
The generating function

L
() determines the law of X
0 (see Exercise 3.2.40).

For example, as you show next, in the special case where N has the Geometric
distribution, conditioned on non-extinction X
is an exponential random variable.

n
is a branching process with Z
0
= 1 and N +1 having
a Geometric(p) distribution for some 0 1/2, critical if p = 1/2 and super-critical if p < 1/2.
(a) Check that L(s) = p/(1 (1 p)s) and = 1/m. Then verify that
L
n
(s) = (pm
n
(1 s) + (1 p)s p)/((1 p)(1 s)m
n
+ (1 p)s p)
except in the critical case for which L
n
(s) = (n(n1)s)/((n+1) ns).
(b) Show that in the super-critical case

L
(e
) = +(1 )
2
/(+(1 ))
for all 0 and deduce that conditioned on non-extinction X
has the
exponential distribution of parameter (1 ).
(c) Show that in the sub-critical case E[s
Zn
[Z
n
,= 0] (1m)s/[1ms] and
deduce that then the law of Z
n
conditioned upon non-extinction converges
weakly to a Geometric(1 m) distribution.
(d) Show that in the critical case E[e
Zn/n
[Z
n
,= 0] 1/(1+) for all 0
and deduce that then the law of n
1
Z
n
conditioned upon non-extinction
converges weakly to an exponential distribution (of parameter one).
The following exercise demonstrates that martingales are also useful in the study
of Galton-Watson trees.
Exercise 5.5.9. Consider a super-critical branching process Z
n
such that 1 N
for some non-random nite . A vertex of the corresponding Galton-Watson tree
T
is called a branch point if it has more than one ospring. For each vertex
v T
let C(v) count the number of branch points one encounters when traversing
along a path from the root of the tree to v (possibly counting the root, but not
counting v among these branch points).
(a) Let T
n
denote the set of vertices in T
of distance n from the root.

Show that for each > 0,
X
n
:= M()
n
vTn
e
C(v)
is a martingale when M() = m
N
e
+P(N = 1)(1 e
).
(b) Let B
n
= minC(v) : v T
n
. Show that a.s. liminf
n
n
1
B
n

where > 0 is non-random (and possibly depends on the ospring
distribution).
5.5.2. Product martingales and Radon-Nikodym derivatives. We start
with an explicit characterization of uniform integrability for the product martingale
of Example 5.1.10.
Theorem 5.5.10 (Kakutanis Theorem). Let M
denote the a.s. limit of the

product martingale M
n
=
n
k=1
Y
k
, with M
0
= 1 and independent, integrable Y
k
0
such that EY
k
= 1 for all k 1. By Jensens inequality, a
k
= E[
Y
k
] is in (0, 1]
for all k 1. The following ve statements are then equivalent:
(a) M
n
is U.I. , (b) M
n
L
1
M
; (c) EM
= 1;
(d)
k
a
k
> 0 ; (e)
k
(1 a
k
) < ,
and if any (every) one of them fails, then M
= 0 a.s.
Proof. Statement (a) implies statement (b) because any U.I. martingale con-
verges in L
1
(see Theorem 5.3.12). Further, the L
1
convergence per statement (b)
implies that EM
n
EM
and since EM
n
= EM
0
= 1 for all n, this results with
EM
= 1 as well, which is statement (c).

Considering the non-negative martingale N
n
=
n
k=1
(
Y
k
/a
k
) we next show that
(c) implies (d) by proving the contra-positive. Indeed, by Doobs convergence
theorem N
n
a.s.
N
with N
nite a.s. Hence, if statement (d) fails to hold (that

is,
n
k=1
a
k
0), then M
n
= N
2
n
(
n
k=1
a
k
)
2
a.s.
0. So in this case M
= 0 a.s.
and statement (c) also fails to hold.
In contrast, if statement (d) holds then N
n
is L
2
-bounded since for all n,
EN
2
n
= (
n
k=1
a
k
)
2
EM
n

_
k
a
k
_
2
= c < .
Thus, with M
k
N
2
k
it follows by the L
2
-maximal inequality that for all n,
E
_
n
max
k=0
M
k
E
_
n
max
k=0
N
2
k
4E[N
2
n
] 4c .
Hence, M
k
0 are such that sup
k
M
k
is integrable and in particular, M
n
is U.I.
(that is, (a) holds).
Finally, to see why the statements (d) and (e) are equivalent note that upon
applying the Borel Cantelli lemmas for independent events A
n
with P(A
n
) = 1a
n
the divergence of the series
k
(1a
k
) is equivalent to P(A
c
n
eventually) = 0, which
for strictly positive a
k
is equivalent to
k
a
k
= 0.
We next consider another martingale that is key to the study of likelihood ratios
in sequential statistics. To this end, let P and Q be two probability measures on
the same measurable space (, T
) with P
n
= P
Fn
and Q
n
= Q
Fn
denoting the
restrictions of P and Q to a ltration T
n
T
.
Theorem 5.5.11. Suppose Q
n
P
n
for all n, with M
n
= dQ
n
/dP
n
denoting the
corresponding Radon-Nikodym derivatives on (, T
n
). Then,
(a) (M
n
, T
n
) is a martingale on the probability space (, T
, P) where M
n
a.s.
as n and M
is P-a.s. nite.
(b) If M
n
is uniformly P-integrable then Q P and dQ/dP = M
.
(c) More generally, the Lebesgue decomposition of Q to its absolutely contin-
uous and singular parts with respect to P is
(5.5.4) Q = Q
ac
+Q
s
= M
P+I
{M=}
Q.
Remark. From the decomposition of (5.5.4) it follows that if Q P then both
Q(M
< ) = 1 and P(M
) = 1 while if QP then both Q(M
= ) = 1 and
P(M
= 0) = 1.
Example 5.5.12. Suppose T
n
= (
n
) and the countable partitions
n
= A
i,n

T of are nested (that is, for each n the partition
n+1
is a renement of
n
). It
is not hard to check directly that
M
n
=
{i:P(Ai,n)>0}
Q(A
i,n
)
P(A
i,n
)
I
Ai,n
,
is an T
n
-sup-MG for (, T, P) and is further an T
n
-martingale if Q(A
i,n
) = 0
whenever P(A
i,n
) = 0 (which is precisely the assumption made in Theorem 5.5.11).
We have seen this construction in Exercise 5.3.20, where
n
are the dyadic parti-
tions of = [0, 1), P is taken to be Lebesgues measure on [0, 1) and Q([s, t)) =
x(t) x(s) is the signed measure associated with the function x().
Proof. (a). By the Radon-Nikodym theorem, M
n
mT
n
is non-negative
and P-integrable (since P
n
(M
n
) = Q
n
() = 1). Further, Q(A) = Q
n
(A) =
M
n
P
n
(A) = M
n
P(A) for all A T
n
. In particular, if k n and A T
k
then
(since T
k
T
n
),
P(M
n
I
A
) = Q(A) = P(M
k
I
A
) ,
so in (, T
, P) we have M
k
= E[M
n
[T
k
] by denition of the conditional expecta-
tion. Finally, by Doobs convergence theorem the non-negative MG M
n
converges
P-a.s. to M
which is P-a.s. nite.

(b). We have seen already that if A T
k
then Q(A) = P(M
n
I
A
) for all n k.
Hence, if M
n
is further uniformly P-integrable then also P(M
n
I
A
) P(M
I
A
),
so taking n we deduce that in this case Q(A) = P(M
I
A
) for any A
k
T
k
(and in particular for A = ). Since the probability measures Q and M
P then
coincide on the -system
k
T
k
they agree also on the -algebra T
generated by
this -system (recall Proposition 1.1.39).
(c). To deal with the general case, where M
n
is not necessarily uniformly P-
integrable, consider the probability measure S = (P+Q)/2 and its restrictions S
n
=
(P
n
+Q
n
)/2 to T
n
. As P
n
S
n
and Q
n
S
n
there exist V
n
= dP
n
/dS
n
0 and
W
n
= dQ
n
/dS
n
0 such that V
n
+ W
n
= 2. Per part (a) the bounded (V
n
, T
n
)
and (W
n
, T
n
) are martingales on (, T
, S), having the S-a.s. nite limits V
and
W
, respectively. Further, as shown in part (b), V
= dP/dS and W
= dQ/dS.
Recall that W
n
S
n
= Q
n
= M
n
P
n
= M
n
V
n
S
n
, so S-a.s. M
n
V
n
= W
n
= 2 V
n
for
all n. Consequently, S-a.s. V
n
> 0 and M
n
= (2 V
n
)/V
n
. Considering n we
deduce that S-a.s. M
n
M
= (2 V
)/V
= W
/V
, possibly innite, and

I
{M<}
= I
{V>0}
. Thus,
Q = W
S = I
{V>0}
M
S +I
{V=0}
W
S
= I
{M<}
M
P+I
{M=}
Q,
and since M
is nite P-a.s. this is precisely the stated Lebesgue decomposition

of Q with respect to P.
Combining Theorem 5.5.11 and Kakutanis theorem we next deduce that if the
marginals of one innite product measure are absolutely continuous with respect to
those of another, then either the former product measure is absolutely continuous
with respect to the latter, or these two measures are mutually singular. This
dichotomy is a key result in the treatment by theoretical statistics of the problem
of hypothesis testing (with independent observables under both the null hypothesis
and the alternative hypothesis).
Proposition 5.5.13. Suppose that P and Q are product measures on (R
N
, B
c
)
which make the coordinates X
n
() =
n
independent with the respective laws Q
X
1
k
PX
1
k
for each k N. Let Y
k
() = d(QX
1
k
)/d(PX
1
k
)(X
k
()) then
denote the likelihood ratios of the marginals. Then, M
k
Y
k
exists a.s. under
both P and Q. If =
k
P(
Y
k
) is positive then Q is absolutely continuous with
respect to P with dQ/dP = M
, whereas if = 0 then Q is singular with respect

to P such that Q-a.s. M
= while P-a.s. M
= 0.
Remark 5.5.14. Note that the preceding Y
k
are identically distributed when both
P and Q are products of i.i.d. random variables. Hence in this case > 0 if and
only if P(
Y
1
) = 1, which with P(Y
1
) = 1 is equivalent to P[(
Y
1
1)
2
] = 0, i.e. to
having P-a.s. Y
1
= 1. The latter condition implies that P-a.s. M
= 1, so Q = P.
We thus deduce that any Q ,= P that are both products of i.i.d. random variables,
are mutually singular, and for n large enough the likelihood test of comparing M
n
to a xed threshold decides correctly between the two hypothesis regarding the law
of X
k
, since P-a.s. M
n
0 while Q-a.s. M
n
.
Proof. We are in the setting of Theorem 5.5.11 for = R
N
and the ltration
T
X
n
= (X
k
: 1 k n) T
X
= (X
k
, k < ) = B
c
(c.f. Exercise 1.2.14 and the denition of B
c
preceding Kolmogorovs extension
theorem). Here M
n
= dQ
n
/dP
n
=
n
k=1
Y
k
and the mutual independence of
X
k
imply that Y
k
are both P-independent and Q-independent (c.f. part (b)
of Exercise 4.1.8). In the course of proving part (c) of Theorem 5.5.11 we have
shown that M
n
M
both P-a.s. and Q-a.s. Further, recall part (a) of Theorem

5.5.11 that M
n
is a martingale on (, T
X
, P). From Kakutanis theorem we know
that the product martingale M
n
is uniformly P-integrable when > 0 (see (d)
implying (a) there), whereas if = 0 then P-a.s. M
= 0. By part (b) of Theorem

5.5.11 the uniform P-integrability of M
n
results with Q = M
P P. In contrast,
when P-a.s. M
= 0 we get from the decomposition of part (c) of Theorem 5.5.11

that Q
ac
= 0 and Q = I
{M=}
Q so in this case Q-a.s. M
= and QP.
Here is a concrete application of the preceding proposition.
Exercise 5.5.15. Suppose P and Q are two product probability measures on the
set
= 0, 1
N
of innite binary sequences equipped with the product -algebra
generated by its cylinder sets, with p
k
= P( :
k
= 1) strictly between zero and
one and q
k
= Q( :
k
= 1) [0, 1].
(a) Deduce from Proposition 5.5.13 that Q is absolutely continuous with re-
spect to P if and only if
k
(1
p
k
q
k

_
(1 p
k
)(1 q
k
)) is nite.
(b) Show that if
k
[p
k
q
k
[ is nite then Q is absolutely continuous with
respect to P.
(c) Show that if p
k
, q
k
[, 1 ] for some > 0 and all k, then Q P if
and only if
k
(p
k
q
k
)
2
< .
(d) Show that if
k
q
k
< and
k
p
k
= then QP so in general the
condition
k
(p
k
q
k
)
2
< is not sucient for absolute continuity of
Q with respect to P.
In the spirit of Theorem 5.5.11, as you show next, a positive martingale (Z
n
, T
n
)
induces a collection of probability measures Q
n
that are equivalent to P
n
= P
Fn
(i.e. both Q
n
P
n
and P
n
Q
n
), and satisfy a certain martingale Bayes rule.
In particular, the following discrete time analog of Girsanovs theorem, shows that
such construction can signicantly simplify certain computations upon moving from
P
n
to Q
n
.
Exercise 5.5.16. Suppose (Z
n
, T
n
) is a (strictly) positive MG on (, T, P), nor-
malized so that EZ
0
= 1. Let P
n
= P
Fn
and consider the equivalent probability
measure Q
n
on (, T
n
) of Radon-Nikodym derivative dQ
n
/dP
n
= Z
n
.
(a) Show that Q
k
= Q
n
F
k
for any 0 k n.
(b) Fixing 0 k m n and Y L
1
(, T
m
, P) show that Q
n
-a.s. (hence
also P-a.s.), E
Qn
[Y [T
k
] = E[Y Z
m
[T
k
]/Z
k
.
(c) For T
n
= T
n
, the canonical ltration of i.i.d. standard normal variables
k
and any bounded, T
n
-predictable V
n
, consider the measures Q
n
in-
duced by the exponential martingale Z
n
= exp(Y
n

1
2
n
k=1
V
2
k
), where
Y
n
=
n
k=1
k
V
k
. Show that X of coordinates X
m
=
m
k=1
(
k
V
k
),
1 m n, is under Q
n
a Gaussian random vector whose law is the
same as that of
m
k=1
k
: 1 m n under P.
Hint: Use characteristic functions.
5.5.3. Reversed martingales and 0-1 laws. Reversed martingales which
we next dene, though less common than martingales, are key tools in the proof of
many asymptotics (e.g. 0-1 laws).
Denition 5.5.17. A reversed martingale (in short RMG), is a martingale indexed
by non-negative integers. That is, integrable X
n
, n 0, adapted to a ltration T
n
,
n 0, such that E[X
n+1
[T
n
] = X
n
for all n 1. We denote by T
n
T
a ltration T
n
n0
and the associated -algebra T
n0
T
n
such that the
relation T
k
T
applies for any k 0.

Remark. One similarly denes reversed subMG-s (and supMG-s), by replacing
E[X
n+1
[T
n
] = X
n
for all n 1 with the condition E[X
n+1
[T
n
] X
n
, for all
n 1 (or the condition E[X
n+1
[T
n
] X
n
, for all n 1, respectively). Since
(X
n+k
, T
n+k
), k = 0, . . . , n, is then a MG (or sub-MG, or sup-MG), any result
about subMG-s, sup-MG-s and MG-s that does not involve the limit as n
(such as, Doobs decomposition, maximal and up-crossing inequalities), shall apply
also for reversed subMG-s, reversed supMG-s and RMG-s.
As we see next, RMG-s are the dual of Doobs martingales (with time moving
backwards), hence U.I. and as such each RMG converges both a.s. and in L
1
as
n .
Theorem 5.5.18 (Levys downward theorem). With X
0
integrable, (X
n
, T
n
),
n 0 is a RMG if and only if X
n
= E[X
0
[T
n
] for all n 0. Further, E[X
0
[T
n
]
E[X
0
[T
] almost surely and in L

1
when n .
Remark. Actually, (X
n
, T
n
) is a RMG for X
n
= E[Y [T
n
], n 0 and any in-
tegrable Y (possibly Y , mT
0
). Further, E[Y [T
n
] E[Y [T
] almost surely
and in L
1
. This is merely a restatement of Levys downward theorem, since for
X
0
= E[Y [T
0
] we have by the tower property that E[Y [T
n
] = E[X
0
[T
n
] for any
n 0.
Proof. Suppose (X
n
, T
n
) is a RMG. Then, xing n < 0 and applying Propo-
sition 5.1.20 for the MG (X
n+k
, T
n+k
), k = 0, . . . , n (taking there = n > m =
0), we deduce that E[X
0
[T
n
] = X
n
. Conversely, suppose X
n
= E[X
0
[T
n
] for X
0
integrable and all n 0. Then, X
n
L
1
(, T
n
, P) by the denition of C.E. and
further, with T
n
T
n+1
, we have by the tower property that
X
n
= E[X
0
[T
n
] = E[E(X
0
[T
n+1
)[T
n
] = E[X
n+1
[T
n
] ,
so any such (X
n
, T
n
) is a RMG.
Setting hereafter X
n
= E[X
0
[T
n
], note that for each n 0 and a < b , by
Doobs up-crossing inequality for the MG (X
n+k
, T
n+k
), k = 0, . . . , n, we have
that E(U
n
[a, b]) (b a)
1
E[(X
0
a)
] (where U
n
[a, b] denotes the number of up-
crossings of the interval [a, b] by X
k
(), k = n, . . . , 0). By monotone convergence
this implies that E(U
[a, b]) (b a)
1
E[(X
0
a)
] is nite (for any a < b).

Repeating the proof of Lemma 5.3.1, now for n , we thus deduce that
X
n
a.s.
X
as n . Recall Proposition 4.2.33 that E[X

0
[T
n
] is U.I. hence
by Vitalis convergence theorem also X
n
L
1
X
when n (and in particular

the random variable X
is integrable).
We now complete the proof by showing that X
= E[X
0
[T
]. Indeed, xing
k 0, since X
n
mT
k
for all n k it follows that X
= limsup
n
X
n
is also in mT
k
. This applies for all k 0, hence X
m[
k0
T
k
] = mT
.
Further, E[X
n
I
A
] E[X
I
A
] for any A T
(by the L
1
convergence of
X
n
to X
), and as A T
T
n
also E[X
0
I
A
] = E[X
n
I
A
] for all n 0.
Thus, E[X
I
A
] = E[X
0
I
A
] for all A T
, so by the denition of conditional

expectation, X
= E[X
0
[T
].
Similarly to Levys upward theorem, as you show next, Levys downward theorem
can be extended to accommodate a dominated sequences of random variables and
if X
0
L
p
for some p > 1, then X
n
L
p
X
as n (which is the analog of

Doobs L
p
Exercise 5.5.19. Suppose T
n
T
and Y
n
a.s.
Y
as n . Show that if
sup
n
[Y
n
[ is integrable, then E[Y
n
[T
n
]
a.s.
E[Y
[T
] when n .
n
, T
n
) is a RMG. Show that if E[X
0
[
p
is nite and
p > 1, then X
n
L
p
X
when n
Not all reversed sub-MGs are U.I. but here is an explicit characterization of those
that are.
Exercise 5.5.21. Show that a reversed sub-MG X
n
is U.I. if and only if inf
n
EX
n
is nite.
Our rst application of RMG-s is to provide an alternative proof of the strong law
of large numbers of Theorem 2.3.3, with the added bonus of L
1
convergence.
Theorem 5.5.22 (Strong law of large numbers). Suppose S
n
=
n
k=1
k
for i.i.d. integrable
k
. Then, n
1
S
n
E
1
a.s. and in L
1
when n .
Proof. Let X
m
= (m + 1)
1
S
m+1
for m 0, and dene the corresponding
ltration T
m
= (X
k
, k m). Recall part (a) of Exercise 4.4.8, that X
n
=
E[
1
[X
n
] for each n 0. Further, clearly T
n
= ((
n
, T
n
) for (
n
= (X
n
) and
T
= (
r
, r > ). With T
X
n
independent of ((
1
), (
n
), we thus have that
X
n
= E[
1
[T
n
] for each n 0 (see Proposition 4.2.3). Consequently, (X
n
, T
n
) is
a RMG which by Levys downward theorem converges for n both a.s. and
in L
1
to the nite valued random variable X
= E[
1
[T
]. Combining this
and the tower property leads to EX
= E
1
so it only remains to show that
P(X
,= c) = 0 for some non-random constant c. To this end, note that for any
nite,
X
= limsup
m
1
m
m
k=1
k
= limsup
m
1
m
m
k=+1
k
.
Clearly, X
mT
X
for any so X
is also measurable on the tail -algebra

T =
of the sequence
k
. We complete the proof upon noting that the
-algebra T is P-trivial (by Kolmogorovs 0-1 law and the independence of
k
), so
in particular, a.s. X
equals a non-random constant (see Proposition 1.2.46).

In preparation for the Hewitt-Savage 0-1 law and de-Finettis theorem we now
dene the exchangeable -algebra and random variables.
Denition 5.5.23 (exchangeable -algebra and random variables). Con-
sider the product measurable space (R
N
, B
c
) as in Kolmogorovs extension theorem.
Let c
m
B
c
denote the -algebra of events that are invariant under permutations of
the rst m coordinates; that is, A c
m
if (
(1)
, . . . ,
(m)
,
m+1
, . . .) A for any
permutation of 1, . . . , m and all (
1
,
2
, . . .) A. The exchangeable -algebra
c =
m
c
m
consists of all events that are invariant under all nite permutations of
coordinates. Similarly, we call an innite sequence of R.V.s
k
k1
on the same
measurable space exchangeable if (
1
, . . . ,
m
)
D
= (
(1)
, . . . ,
(m)
) for any m and
any permutation of 1, . . . , m; that is, the joint law of an exchangeable sequence
is invariant under nite permutations of coordinates.
Our next lemma summarizes the use of RMG-s in this context.
Lemma 5.5.24. Suppose
k
() =
k
is an exchangeable sequence of random vari-
ables on (R
N
, B
c
). For any bounded Borel function : R
R and m let
S
m
() =
1
(m)
i
(
i1
, . . . ,
i
), where i = (i
1
, . . . , i
) is an -tuple of distinct in-

tegers from 1, . . . , m and (m)
=
m!
(m)!
is the number of such -tuples. Then,
(5.5.5)

S
m
() E[(
1
, . . . ,
)[c]
a.s. and in L
1
when m .
Proof. Fixing m since the value of

S
m
() is invariant under any permu-
tation of the rst m coordinates of we have that

S
m
() is measurable on c
m
.
Further, this bounded R.V. is obviously integrable, so
(5.5.6)

S
m
() = E[
S
m
()[c
m
] =
1
(m)
i
E[(
i1
, . . . ,
i
)[c
m
] .
Fixing any -tuple of distinct integers i
1
, . . . , i
from 1, . . . , m, by our exchange-

ability assumption, the probability measure on (R
N
, B
c
) is invariant under any per-
mutation of the rst m coordinates of such that (i
k
) = k for k = 1, . . . , . Con-
sequently, E[(
i1
, . . . ,
i
)I
A
] = E[(
1
, . . . ,
)I
A
] for any A c
m
, implying that
E[(
i1
, . . . ,
i
)[c
m
] = E[(
1
, . . . ,
)[c
m
]. Since this applies for any -tuple of dis-
tinct integers from1, . . . , m it follows by (5.5.6) that

S
m
() = E[(
1
, . . . ,
)[c
m
]
for all m . In conclusion, considering the ltration T
n
= c
n
, n 0 for which
T
= c, we have in view of the remark following Levys downward theorem that

(
S
n
(), c
n
), n 0 is a RMG and the convergence in (5.5.5) holds a.s. and in
L
1
.
Remark. Noting that any sequence of i.i.d. random variables is also exchangeable,
our rst application of Lemma 5.5.24 is the following zero-one law.
Theorem 5.5.25 (Hewitt-Savage 0-1 law). The exchangeable -algebra c of a
sequence of i.i.d. random variables
k
() =
k
is P-trivial (that is, P(A) 0, 1
for any A c).
Remark. Given the Hewitt-Savage 0-1 law, we can simplify the proof of Theorem
5.5.22 upon noting that for each m the -algebra T
m
is contained in c
m+1
, hence
T
c must also be P-trivial.

Proof. As the i.i.d.
k
() =
k
are exchangeable, from Lemma 5.5.24 we
have that for any bounded Borel : R
R, almost surely

S
m
()

S
() =
E[(
1
, . . . ,
)[c].
We proceed to show that

S
() = E[(
1
, . . . ,
)]. To this end, xing a nite

integer r m let
S
m,r
() =
1
(m)
{i: i1>r,...,i
>r}
(
i1
, . . . ,
i
)
denote the contribution of the -tuples i that do not intersect 1, . . . , r. Since
there are exactly (mr)
such -tuples and is bounded, it follows that

[
S
m
()

S
m,r
()[ [1
(mr)
(m)
]|()|

c
m
for some c = c(r, , ) nite and all m. Consequently, for any r,
() = lim
m
S
m
() = lim
m
S
m,r
() .
Further, by the mutual independence of
k
we have that

S
m,r
() are independent
of T
r
, hence the same applies for their limit

S
(). Applying Lemma 4.2.9 for

X = (
1
, . . . ,
) we deduce that E[(

1
, . . . ,
)[c] = E[(
1
, . . . ,
)]. Recall
that I
G
is, for each G T
, a bounded Borel function of (

1
, . . . ,
). Hence,
E[I
G
[c] = E[I
G
]. Thus, by the tower property and taking out what is known,
P(A G) = E[I
A
E[I
G
[c]] = E[I
A
]E[I
G
] for any A c. That is, c and T
are
independent for each nite , so by Lemma 1.4.8 we conclude that c is a P-trivial
-algebra, as claimed.
The proof of de Finettis theorem requires the following algebraic identity which
we leave as an exercise for the reader.
Exercise 5.5.26. Fixing bounded Borel functions f : R
1
R and g : R R, let
h
j
(x
1
, . . . , x
) = f(x
1
, . . . , x
1
)g(x
j
) for j = 1, . . . , . Show that for any sequence
k
and any m ,
S
m
(h
) =
m
m + 1
S
m
(f)
S
m
(g)
1
m + 1
1
j=1
S
m
(h
j
) .
Theorem 5.5.27 (de Finettis theorem). If
k
() =
k
is an exchangeable
sequence then conditional on c the random variables
k
, k 1 are mutually inde-
pendent and identically distributed.
Remark. For example, if exchangeable
k
are 0, 1-valued, then by de Finettis
theorem these are i.i.d. Bernoulli variables of parameter p, conditional on c. The
joint (unconditional) law of
k
is thus that of a mixture of i.i.d. Bernoulli(p)
sequences with p a [0, 1]-valued random variable, measurable on c.
Proof. In view of Exercise 5.5.26, upon applying (5.5.5) of Lemma 5.5.24
for the exchangeable sequence
k
and bounded Borel functions f, g and h
, we
deduce that
E[f(
1
, . . . ,
1
)g(
)[c] = E[f(
1
, . . . ,
1
)[c]E[g(
)[c] .
By induction on this leads to the identity
E
_

k=1
g
k
(
k
)[c
k=1
E[g
k
(
k
)[c]
for all and bounded Borel g
k
: R R. Taking g
k
= I
B
k
for B
k
B we have
P[(
1
, . . . ,
) B
1
B
[c] =
k=1
P(
k
B
k
[c)
which implies that conditional on c the R.V.-s
k
are mutually independent (see
Proposition 1.4.21). Further, E[g(
1
)I
A
] = E[g(
r
)I
A
] for any A c, bounded
Borel g(), positive integer r and exchangeable variables
k
() =
k
, from which it
follows that conditional on c these R.V.-s are also identically distributed.
We conclude this section with exercises detailing further applications of RMG-s
for the study of a certain U-statistics, for solving the ballots problem and in the
context of mixing conditions.
k
are i.i.d. random variables and h : R
2
R a
Borel function such that E[[h(
1
,
2
)[] < . For each m 2 let
W
2m
=
1
m(m1)
1i=jm
h(
i
,
j
) .
For example, note that W
2m
=
1
m1
m
k=1
(
k
m
1
m
i=1
i
)
2
is of this form,
corresponding to h(x, y) = (x y)
2
/2.
(a) Show that W
n
= E[h(
1
,
2
)[T
W
n
] for n 0 hence (W
n
, T
W
n
) is a RMG
and determine its almost sure limit as n .
(b) Assuming in addition that v = E[h(
1
,
2
)
2
] is nite, nd the limit of
E[W
2
n
] as n .
Exercise 5.5.29 (The ballot problem). Let S
k
=
k
i=1
i
for i.i.d. integer
valued
j
0 and for n 2 consider the event
n
= S
j
< j for 1 j n.
(a) Show that X
k
= k
1
S
k
is a RMG for the ltration T
k
= (S
j
, j k)
and that = inf n : X
1 1 is a stopping time for it.

(b) Show that I
n
= 1X
whenever S
n
n, hence P(
n
[S
n
) = (1S
n
/n)
+
.
The name ballot problem is attached to Exercise 5.5.29 since for
j
0, 2 we
interpret 0s and 2s as n votes for two candidates A and B in a ballot, with
n
= A
leads B throughout the counting and P(
n
[B gets r votes) = (1 2r/n)
+
.
As you nd next, the ballot problem yields explicit formulas for the probability
distributions of the stopping times
b
= infn 0 : S
n
= b associated with the
srw S
n
.
Exercise 5.5.30. Let R = inf 1 : S
= 0 denote the rst visit to zero by the

srw S
n
. Using a path reversal counting argument followed by the ballot problem,
show that for any positive integers n, b,
P(
b
= n[S
n
= b) = P(R > n[S
n
= b) =
b
n
and deduce that for any k 0,
P(
b
= b + 2k) = b
(b + 2k 1)!
k!(k +b)!
p
b+k
q
k
.
Exercise 5.5.31. Show that for any A T and -algebra ( T
sup
BG
[P(A B) P(A)P(B)[ E[[P(A[() P(A)[] .
Next, deduce that if (
n
( as n and ( is P-trivial, then
lim
m
sup
BGm
[P(A B) P(A)P(B)[ = 0 .
CHAPTER 6
Markov chains
The rich theory of Markov processes is the subject of many text books and one can
easily teach a full course on this subject alone. Thus, we limit ourselves here to the
discrete time Markov chains and to their most fundamental properties. Specically,
in Section 6.1 we provide denitions and examples, and prove the strong Markov
property of such chains. Section 6.2 explores the key concepts of recurrence, tran-
sience, invariant and reversible measures, as well as the asymptotic (long time)
behavior for time homogeneous Markov chains of countable state space. These con-
cepts and results are then generalized in Section 6.3 to the class of Harris Markov
chains.
6.1. Canonical construction and the strong Markov property
We start with the denition of a Markov chain.
Denition 6.1.1. Given a ltration T
n
, an T
n
-adapted stochastic process X
n
taking values in a measurable space (S, o) is called an T

n
-Markov chain with state
space (S, o) if for any A o,
(6.1.1) P[X
n+1
A[T
n
] = P[X
n+1
A[X
n
] n, a.s.
Remark. We call X
n
a Markov chain in case T
n
= (X
k
, k n), noting that
if X
n
is an T
n
-Markov chain then it is also a Markov chain. Indeed, T
X
n
=
(X
k
, k n) T
n
since X
n
is adapted to T
n
, so by the tower property we
have that for any T
n
-Markov chain, any A o and all n, almost surely,
P[X
n+1
A[T
X
n
] = E[E[I
Xn+1A
[T
n
][T
X
n
] = E[E[I
Xn+1A
[X
n
][T
X
n
]
= E[I
Xn+1A
[X
n
] = P[X
n+1
A[X
n
] .
The key object in characterizing an T
n
-Markov chain are its transition probabili-
ties, as dened next.
Denition 6.1.2. A set function p : S o [0, 1] is a transition probability if
(a) For each x S, A p(x, A) is a probability measure on (S, o).
(b) For each A o, x p(x, A) is a measurable function on (S, o).
We say that an T
n
-Markov chain X
n
has transition probabilities p
n
(x, A), if
almost surely P[X
n+1
A[T
n
] = p
n
(X
n
, A) for every n 0 and every A o and
call it a homogeneous T
n
-Markov chain if p
n
(x, A) = p(x, A) for all n, x S and
A o.
With bo mo denoting the collection of all bounded (R, B
R
)-valued measurable
mappings on (S, o), we next express E[h(X
k+1
)[T
k
] for h bo in terms of the
transition probabilities of the T
n
-Markov chain X
n
.
225
226 6. MARKOV CHAINS
Lemma 6.1.3. If X
n
is an T
n
-Markov chain with state space (S, o) and transi-
tion probabilities p
n
( , ), then for any h bo and all k 0
(6.1.2) E[h(X
k+1
)[T
k
] = (p
k
h)(X
k
) ,
where h (p
k
h) : bo bo and (p
k
h)(x) =
_
p
k
(x, dy)h(y) denotes the Lebesgue
integral of h() under the probability measure p
k
(x, ) per xed x S.
Proof. Let 1 bo denote the collection of bounded, measurable R-valued
functions h() for which (p
k
h)(x) bo and (6.1.2) holds for all k 0. Since p
k
(, )
are transition probabilities of the chain, I
A
1 for all A o (c.f. Denition
6.1.2). Thus, we complete the proof of the lemma upon checking that 1 satises the
conditions of the (bounded version of the) monotone class theorem (i.e. Theorem
1.2.7). To this end, for a constant h we have that p
k
h is also constant and evidently
(6.1.2) then holds. Further, with bo a vector space over R, due to the linearity of
both the conditional expectation on the left side of (6.1.2) and the expectation on
its right side, so is 1. Next, suppose h
m
1, h
m
0 and h
m
h bo. Then,
by monotone convergence (p
k
h
m
)(x) (p
k
h)(x) for each x S and all k 0. In
particular, with p
k
h
m
bo and p
k
h bounded by the bound on h, it follows that
p
k
h bo. Further, by the monotone convergence of conditional expectations and
the boundedness of h(X
k+1
) also E[h
m
(X
k+1
)[T
k
] E[h(X
k+1
)[T
k
]. It thus follows
that h 1 and with all conditions of the monotone class theorem holding for 1
and the -system o, we have that bo 1, as stated.
Our construction of product measures extends to products of transition probabili-
ties. Indeed, you should check at this point that the proof of Theorem 1.4.19 easily
adapts to yield the following proposition.
Proposition 6.1.4. Given a -nite measure
1
on (X, X) and
2
: Xo [0, 1]
such that B
2
(x, B) is a probability measure on (S, o) for each xed x X
and x
2
(x, B) is measurable on (X, X) for each xed B o, there exists a
unique -nite measure on the product space (XS, Xo), denoted hereafter by
=
1

2
, such that
(A B) =
_
A
1
(dx)
2
(x, B), A X, B o .
We turn to show how relevant the preceding proposition is for Markov chains.
Proposition 6.1.5. To any -nite measure on (S, o) and any sequence of
transition probabilities p
n
(, ) there correspond unique -nite measures
k
=
p
0
p
k1
on (S
k+1
, o
k+1
), k = 1, 2, . . . such that
k
(A
0
A
k
) =
_
A0
(dx
0
)
_
A1
p
0
(x
0
, dx
1
)
_
A
k
p
k1
(x
k1
, dx
k
)
for any A
i
o, i = 0, . . . , k. If is a probability measure, then
k
is a consistent
sequence of probability measures (that is,
k+1
(AS) =
k
(A) for any k nite and
A S
k+1
).
Further, if X
n
is a Markov chain with state space (S, o), transition probabilities
p
n
( , ) and initial distribution (A) = P(X
0
A) on (S, o), then for any h
bo
6.1. CANONICAL CONSTRUCTION AND THE STRONG MARKOV PROPERTY 227
and all k 0,
(6.1.3) E[
k
=0
h
(X
)] =
_
(dx
0
)h
0
(x
0
)
_
p
k1
(x
k1
, dx
k
)h
k
(x
k
) ,
so in particular, X
n
has the nite dimensional distributions (f.d.d.)
(6.1.4) P(X
0
A
0
, . . . , X
n
A
n
) = p
0
p
n1
(A
0
. . . A
n
) .
Proof. Starting at a -nite measure
1
= on (S, o) and applying Propo-
sition 6.1.4 for
2
(x, B) = p
0
(x, B) on S o yields the -nite measure
1
=
p
0
on (S
2
, o
2
). Applying this proposition once more, now with
1
=
1
and
2
((x
0
, x
1
), B) = p
1
(x
1
, B) for x = (x
0
, x
1
) S S yields the -nite measure
2
= p
0
p
1
on (S
3
, o
3
) and upon repeating this procedure k times we arrive
at the -nite measure
k
= p
0
p
k1
on (S
k+1
, o
k+1
). Since p
n
(x, S) = 1
for all n and x S, it follows that if is a probability measure, so are
k
which by
construction are also consistent.
Suppose next that the Markov chain X
n
has transition probabilities p
n
(, ) and
initial distribution . Fixing k and h
bo we have by the tower property and

(6.1.2) that
E[
k
=0
h
(X
)] = E[
k1
=0
h
(X
)E(h
k
(X
k
)[T
X
k1
)] = E[
k1
=0
h
(X
)(p
k1
h
k
)(X
k1
)] .
Further, with p
k1
h
k
bo (see Lemma 6.1.3), also h
k1
(p
k1
h
k
) bo and we
get (6.1.3) by induction on k starting at Eh
0
(X
0
) =
_
(dx
0
)h
0
(x
0
). The formula
(6.1.4) for the f.d.d. is merely the special case of (6.1.3) corresponding to indicator
functions h
= I
A
.
Remark 6.1.6. Using (6.1.1) we deduce from Exercise 4.4.5 that any T
n
-Markov
chain with a B-isomorphic state space has transition probabilities. We proceed to
dene the law of such a Markov chain and building on Proposition 6.1.5 show that
it is uniquely determined by the initial distribution and transition probabilities of
the chain.
Denition 6.1.7. The law of a Markov chain X
n
with a B-isomorphic state space
(S, o) and initial distribution is the unique probability measure P
on (S
, o
c
)
with S
= S
Z+
, per Corollary 1.4.25, with the specied f.d.d.
P
(s : s
i
A
i
, i = 0, . . . , n) = P(X
0
A
0
, . . . , X
n
A
n
) ,
for A
i
o. We denote by P
x
the law P
in case (A) = I
xA
(i.e. when X
0
= x
is non-random).
Remark. Denition 6.1.7 provides the (joint) law for any stochastic process X
n
with a B-isomorphic state space (that is, it applies for any sequence of (S, o)-valued
R.V. on the same probability space).
Here is our canonical construction of Markov chains out of their transition prob-
abilities and initial distributions.
Theorem 6.1.8. If (S, o) is B-isomorphic, then to any collection of transition
probabilities p
n
: S o [0, 1] and any probability measure on (S, o) there
corresponds a Markov chain Y
n
(s) = s
n
on the measurable space (S
, o
c
) with
state space (S, o), transition probabilities p
n
(, ), initial distribution and f.d.d.
(6.1.5) P
(s : (s
0
, . . . , s
k
) A) = p
0
p
k1
(A) A o
k+1
, k < .
Remark. In particular, this construction implies that for any probability measure
on (S, o) and all A o
c
(6.1.6) P
(A) =
_
(dx)P
x
(A) .
We shall use the latter identity as an alternative denition for P
, that is applicable
even for a non-nite initial measure (namely, when (S) = ), noting that if is
-nite then P
is also the unique -nite measure on (S
, o
c
) for which (6.1.5)
holds (see the remark following Corollary 1.4.25).
Proof. The given transition probabilities p
n
(, ) and probability measure
on (S, o) determine the consistent probability measures
k
= p
0
p
k1
per Proposition 6.1.5 and thereby via Corollary 1.4.25 yield the stochastic process
Y
n
(s) = s
n
on (S
, o
c
), of law P
, state space (S, o) and f.d.d.

k
. Taking k = 0
in (6.1.5) conrms that its initial distribution is indeed . Further, xing k 0
nite, let Y = (Y
0
, . . . , Y
k
) and note that for any A o
k+1
and B o
E[I
{YA}
I
{Y
k+1
B}
] =
k+1
(A B) =
_
A
k
(dy)p
k
(y
k
, B) = E[I
{YA}
p
k
(Y
k
, B)]
(where the rst and last equalities are due to (6.1.5)). Consequently, for any B o
and k 0 nite, p
k
(Y
k
, B) is a version of the C.E. E[I
{Y
k+1
B}
[T
Y
k
] for T
Y
k
=
(Y
0
, . . . , Y
k
), thus showing that Y
n
is a Markov chain of transition probabilities
p
n
(, ).
Remark. Conversely, given a Markov chain X
n
of state space (S, o), apply-
ing this construction for its transition probabilities and initial distribution yields a
Markov chain Y
n
that has the same law as X
n
. To see this, recall (6.1.4) that
the f.d.d. of a Markov chain are uniquely determined by its transition probabili-
ties and initial distribution, and further for a B-isomorphic state space, the f.d.d.
uniquely determine the law P
of the corresponding stochastic process. For this

reason we consider (S
, o
c
, P
) to be the canonical probability space for Markov

chains, with X
n
() =
n
given by the coordinate maps.
The evaluation of the f.d.d. of a Markov chain is considerably more explicit when
the state space S is a countable set (in which case o = 2
S
), as then
p
n
(x, A) =
yA
p
n
(x, y) ,
for any A S, so the transition probabilities are determined by p
n
(x, y) 0 such
that
yS
p
n
(x, y) = 1 for all n and x S (and all Lebesgue integrals are in this case
merely sums). In particular, if S is a nite set and the chain is homogeneous, then
identifying S with 1, . . . , m for some m < , we view p(x, y) as the (x, y)-th entry
of an mm dimensional transition probability matrix, and express probabilities of
interest in terms of powers of the latter matrix.
For homogeneous Markov chains whose state space is S = R
d
(or a product of
closed intervals thereof), equipped with the corresponding Borel -algebra, compu-
tations are relatively explicit when for each x S the transition probability p(x, )
is absolutely continuous with respect to (the completion of) Lebesgue measure on
S. Its non-negative Radon-Nikodym derivative p(x, y) is then called the transition
probability kernel of the chain. In this case (ph)(x) =
_
h(y)p(x, y)dy and the right
side of (6.1.4) amounts to iterated integrations of the kernel p(x, y) with respect to
Lebesgue measure on S.
Here are few homogeneous Markov chains of considerable interest in probability
theory and its applications.
Example 6.1.9 (Random walk). The random walk S
n
= S
0
+
n
k=1
k
, where
k
are i.i.d. R
d
-valued random variables that are also independent of S
0
is an example
of a homogeneous Markov chain. Indeed, S
n+1
= S
n
+
n+1
with
n+1
independent
of T
S
n
= (S
0
, . . . , S
n
). Hence, P[S
n+1
A[T
S
n
] = P[S
n
+
n+1
A[S
n
]. With
n+1
having the same law as
1
, we thus get that P[S
n
+
n+1
A[S
n
] = p(S
n
, A)
for the transition probabilities p(x, A) = P(
1
y x : y A) (c.f. Exercise
4.2.2) and the state space S = R
d
(with its Borel -algebra).
Example 6.1.10 (Branching process). Another homogeneous Markov chain
is the branching process Z
n
of Denition 5.5.1 having the countable state space
S = 0, 1, 2, . . . (and the -algebra o = 2
S
). The transition probabilities are in this
case p(x, A) = P(
x
j=1
N
j
A), for integer x 1 and p(0, A) = 1
0A
.
Example 6.1.11 (Renewal Markov chain). Suppose q
k
0 and
k=1
q
k
= 1.
Taking S = 0, 1, 2, . . . (and o = 2
S
), a homogeneous Markov chain with transition
probabilities p(0, j) = q
j+1
for j 0 and p(i, i 1) = 1 for i 1 is called a renewal
chain.
As you are now to show, in a renewal (Markov) chain X
n
the value of X
n
is
the amount of time from n to the rst of the (integer valued) renewal times T
k
in [n, ), where
m
= T
m
T
m1
are i.i.d. and P(
1
= j) = q
j
(compare with
Example 2.3.7).
k
are i.i.d. positive integer valued random variables
with P(
1
= j) = q
j
. Let T
m
= T
0
+
m
k=1
k
for non-negative integer random
variable T
0
which is independent of
k
.
(a) Show that N
= infk 0 : T
k
, = 0, 1, . . ., are nite stopping
times for the ltration (
n
= (T
0
,
k
, k n).
(b) Show that for each xed non-random , the random variable
N
+1
is
independent of the stopped -algebra (
N
and has the same law as

1
.
(c) Let X
n
= min(T
k
n)
+
: T
k
n. Show that X
n+1
= X
n
+
Nn+1
I
Xn=0
1 is a homogeneous Markov chain whose transition probabilities are given

in Example 6.1.11.
Example 6.1.13 (Birth and death chain). A homogeneous Markov chain X
n
whose state space is S = 0, 1, 2, . . . and for which X

n+1
X
n
1, 0, 1 is called
a birth and death chain.
Exercise 6.1.14 (Bayesian estimator). Let and U
k
be independent random
variables, each of which is uniformly distributed on (0, 1). Let S
n
=
n
k=1
X
k
for
X
k
= sgn( U
k
). That is, rst pick according to the uniform distribution and
then generate a srw S
n
with each of its increments being +1 with probability and
1 otherwise.
(a) Compute P(X
n+1
= 1[X
1
, . . . , X
n
).
(b) Show that S
n
is a Markov chain. Is it a homogeneous chain?
Exercise 6.1.15 (First order auto-regressive process). The rst order
auto-regressive process X
k
is dened via X
n
= X
n1
+
n
for n 1, where is
a non-random scalar constant and
k
are i.i.d. R
d
-valued random variables that
are independent of X
0
.
(a) With T
n
= (X
0
,
k
, k n) verify that X
n
is a homogeneous T
n
-
Markov chain of state space S = R
d
(equipped with its Borel -algebra),
and provide its transition probabilities.
(b) Suppose [[ < 1 and X
0
=
0
for non-random scalar , with each
k
having the multivariate normal distribution A(0, V) of zero mean and
covariance matrix V. Find the values of for which the law of X
n
is
independent of n.
As we see in the sequel, our next result, the strong Markov property, is extremely
useful. It applies to any homogeneous Markov chain with a B-isomorphic state
space and allows us to handle expectations of random variables shifted by any
stopping time with respect to the canonical ltration of the chain.
Proposition 6.1.16 (Strong Markov property). Consider a canonical prob-
ability space (S
, o
c
, P
), a homogeneous Markov chain X

n
() =
n
constructed
on it via Theorem 6.1.8, its canonical ltration T
X
n
= (X
k
, k n) and the shift
operator : S
such that ()
k
=
k+1
for all k 0 (with the corresponding
iterates (
n
)
k
=
k+n
for k, n 0). Then, for any h
n
bo
c
with sup
n,
[h
n
()[
nite, and any T
X
n
-stopping time
(6.1.7) E
[h
) [ T
X
]I
{<}
= E
X
[h
] I
{<}
.
Remark. Here T
X
is the stopped -algebra associated with the stopping time

(c.f. Denition 5.1.34) and E
(or E
x
) indicates expectation taken with respect
to P
(P
x
, respectively). Both sides of (6.1.7) are set to zero when () =
and otherwise its right hand side is g(n, x) = E
x
[h
n
] evaluated at n = () and
x = X
()
().
The strong Markov property is a signicant extension of the Markov property:
(6.1.8) E
[ h(
n
) [ T
X
n
] = E
Xn
[h] ,
holding almost surely for any non-negative integer n and xed h bo
c
(that is,
the identity (6.1.7) with = n non-random). This in turn generalizes Lemma 6.1.3
where (6.1.8) is proved in the special case of h(
1
) and h bo.
Proof. We rst prove (6.1.8) for h() =
k
=0
g
) with g
bo, = 0, . . . , k.
To this end, x B o
n+1
and recall that
m
= p
0
p
m1
are the f.d.d. for
P
. Consequently, by (6.1.3) and the denition of

n
,
E
[h(
n
)I
B
(
0
, . . . ,
n
)] =
n+k
[I
B
(x
0
, . . . , x
n
)
k
=0
g
(x
+n
)]
=
n
_
I
B
(x
0
, . . . , x
n
)g
0
(x
n
)
_
p(x
n
, dy
1
)g
1
(y
1
)
_
p(y
k1
, dy
k
)g
k
(y
k
)
_
= E
[I
B
(X
0
, . . . , X
n
)E
Xn
(h) ] .
This holds for all B o
n+1
, which by denition of the conditional expectation
amounts to (6.1.8).
The collection 1 bo
c
of bounded, measurable h : S
R for which (6.1.8)

holds, clearly contains the constant functions and is a vector space over R (by
linearity of the expectation and the conditional expectation). Moreover, by the
monotone convergence theorem for conditional expectations, if h
m
1 are non-
negative and h
m
h which is bounded, then also h 1. Taking in the preceding
g
= I
B
we see that I
A
1 for any A in the -system T of cylinder sets (i.e.
whenever A = :
0
B
0
, . . . ,
k
B
k
for some k nite and B
o). We thus
deduce by the (bounded version of the) monotone class theorem that 1 = bo
c
, the
collection of all bounded functions on S
that are measurable with respect to the

-algebra o
c
generated by T.
Having established the Markov property (6.1.8), xing h
n
bo
c
and a T
X
n
-
stopping time , we proceed to prove (6.1.7) by decomposing both sides of the
latter identity according to the value of . Specically, the bounded random vari-
ables Y
n
= h
n
(
n
) are integrable and applying (6.1.8) for h = h
n
we have that
E
[Y
n
[T
X
n
] = g(n, X
n
). Hence, by part (c) of Exercise 5.1.35, for any nite integer
k 0,
E
[h
)I
{=k}
[T
X
] = g(k, X
k
)I
{=k}
= g(, X
)I
{=k}
The identity (6.1.7) is then established by taking out the T
X
-measurable indicator
on = k and summing over k = 0, 1, . . . (where the niteness of sup
n,
[h
n
()[
provides the required integrability).
Exercise 6.1.17. Modify the last step of the proof of Proposition 6.1.16 to show
that (6.1.7) holds as soon as
k
E
X
k
[ [h
k
[ ]I
{=k}
is P
-integrable.
Here are few applications of the Markov and strong Markov properties.
Exercise 6.1.18. Consider a homogeneous Markov chain X
n
with B-isomorphic
state space (S, o). Fixing B
l
o, let
n
=
l>n
X
l
B
l
and = X
l
B
l
i.o..
(a) Using the Markov property and Levys upward theorem (Theorem 5.3.15),
show that P(
n
[X
n
)
a.s.
I
.
(b) Show that P(X
n
A
n
i.o. ) = 0 for any A
n
o such that for
some > 0 and all n, with probability one,
P(
n
[X
n
) I
{XnAn}
.
(c) Suppose A, B o are such that P
x
(X
l
B for some l 1) for some
> 0 and all x A. Deduce that
P(X
n
A nitely often X
n
B i.o.) = 1 .
Exercise 6.1.19 (Reflection principle). Consider a symmetric random walk
S
n
=
n
k=1
k
, that is,
k
are i.i.d. real-valued and such that
1
D
=
1
. With
n
= S
n
, use the strong Markov property for the stopping time = infk n :
k
> b and h
k
() = I
{
nk
>b}
to show that for any b > 0,
P(max
kn
S
k
> b) 2P(S
n
> b) .
Derive also the following, more precise result for the symmetric srw, where for any
integer b > 0,
P(max
kn
S
k
b) = 2P(S
n
> b) +P(S
n
= b) .
The concept of invariant measure for a homogeneous Markov chain, which we now
introduce, plays an important role in our study of such chains throughout Sections
6.2 and 6.3.
Denition 6.1.20. A measure on (S, o) such that (S) > 0 is called a positive
or non-zero measure. An event A o
c
is called shift invariant if A =
1
A (i.e.
A = : () A), and a positive measure on (S
, o
c
) is called shift invariant
if
1
() = () (i.e. (A) = ( : () A) for all A o
c
). We say
that a stochastic process X
n
with a B-isomorphic state space (S, o) is (strictly)
stationary if its joint law is shift invariant. A positive -nite measure on
a B-isomorphic space (S, o) is called an invariant measure for a transition proba-
bility p(, ) if it denes via (6.1.6) a shift invariant measure P
(). In particular,
starting at X
0
chosen according to an invariant probability measure results with
a stationary Markov chain X
n
.
Lemma 6.1.21. Suppose a -nite measure and transition probability p
0
(, ) on
(S, o) are such that p
0
(S A) = (A) for any A o. Then, for all k 1 and
A o
k+1
,
p
0
p
k
(S A) = p
1
p
k
(A) .
Proof. Our assumption that ((p
0
f)) = (f) for f = I
A
and any A o
extends by the monotone class theorem to all f bo. Fixing A
i
o and k 1
let f
k
(x) = I
A0
(x)p
1
p
k
(x, A
1
A
k
) (where p
1
p
k
(x, ) are the
probability measures of Proposition 6.1.5 in case =
x
is the probability measure
supported on the singleton x and p
0
(y, y) = 1 for all y S). Since (p
j
h) bo
for any h bo and j 1 (see Lemma 6.1.3), it follows that f
k
bo as well.
Further, (f
k
) = p
1
p
k
(A) for A = A
0
A
1
A
k
. By the same
reasoning also
((p
0
f
k
)) =
_
S
(dy)
_
A0
p
0
(y, dx)p
1
p
k
(x, A
1
A
k
) = p
0
p
k
(SA) .
Thus, the stated identity holds for the -system of product sets A = A
0
A
k
which generates o
k+1
and since p
1
p
k
(B
n
S
k
) = (B
n
) < for some
B
n
S, this identity extends to all of o
k+1
(see the remark following Proposition
1.1.39).
Remark 6.1.22. Let
k
=
k
p denote the -nite measures of Proposition 6.1.5
in case p
n
(, ) = p(, ) for all n (with
0
=
0
p = ). Specializing Lemma 6.1.21 to
this setting we see that if
1
(SA) =
0
(A) for any A o then
k+1
(SA) =
k
(A)
for all k 0 and A o
k+1
.
6.2. MARKOV CHAINS WITH COUNTABLE STATE SPACE 233
Building on the preceding remark we next characterize the invariant measures for
a given transition probability.
Proposition 6.1.23. A positive -nite measure () on B-isomorphic (S, o) is an
invariant measure for transition probability p(, ) if and only if p(SA) = (A)
for all A o.
Proof. With a positive -nite measure, so are the measures P
and P
1
on (S
, o
c
) which for a B-isomorphic space (S, o) are uniquely determined by
their nite dimensional distributions (see the remark following Corollary 1.4.25).
By (6.1.5) the f.d.d. of P
are the -nite measures

k
(A) =
k
p(A) for A o
k+1
and k = 0, 1, . . . (where
0
= ). By denition of the corresponding f.d.d. of
P

1
are
k+1
(S A). Therefore, a positive -nite measure is an invariant
measure for p(, ) if and only if
k+1
(S A) =
k
(A) for any non-negative integer
k and A o
k+1
, which by Remark 6.1.22 is equivalent to p(S A) = (A) for
all A o.
6.2. Markov chains with countable state space
Throughout this section we restrict our attention to homogeneous Markov chains
X
n
on a countable (nite or innite), state space S, setting as usual o = 2
S
and
p(x, y) = P
x
(X
1
= y) for the corresponding transition probabilities. Noting that
such chains admit the canonical construction of Theorem 6.1.8 since their state
space is B-isomorphic (c.f. Proposition 1.4.27 for M = S equipped with the metric
d(x, y) = 1
x=y
), we start with a few useful consequences of the Markov and strong
Markov properties that apply for any homogeneous Markov chain on a countable
state space.
Proposition 6.2.1 (Chapman-Kolmogorov). For any x, y S and non-negative
integers k n,
(6.2.1) P
x
(X
n
= y) =
zS
P
x
(X
k
= z)P
z
(X
nk
= y)
Proof. Using the canonical construction of the chain whereby X
n
() =
n
,
we combine the tower property with the Markov property for h() = I
{
nk
=y}
followed by a decomposition according to the value z of X
k
to get that
P
x
(X
n
= y) = E
x
[h(
k
)] = E
x
_
E
x
_
h(
k
) [ T
X
k
_
= E
x
[E
X
k
(h)] =
zS
P
x
(X
k
= z)E
z
(h) .
This concludes the proof as E
z
(h) = P
z
(X
nk
= y).
Remark. The Chapman-Kolmogorov equations of (6.2.1) are a concrete special
case of the more general Chapman-Kolmogorov semi-group representation p
n
=
p
k
p
nk
of the n-step transition probabilities p
n
(x, y) = P
x
(X
n
= y). See [Dyn65]
for more on this representation, which is at the core of the analytic treatment of
general Markov chains and processes (and beyond our scope).
We proceed to derive some results about rst hitting times of subsets of the state
space by the Markov chain, where by convention we use
A
= infn 0 : X
n
A
in case the initial state matters and the strictly positive T
A
= infn 1 : X
n
A
when it does not, with
y
=
{y}
and T
y
= T
{y}
. To this end, we start with the rst
entrance decomposition of X
n
= y according to the value of T
y
(which serves as
an alternative to the Chapman-Kolmogorov decomposition of the same event via
the value in S of X
k
).
Exercise 6.2.2 (First entrance decomposition).
For a homogeneous Markov chain X
n
on (S, o), let T
y,r
= infn r : X
n
= y
(so T
y
= T
y,1
and
y
= T
y,0
).
(a) Show that for any x, y S, B o and positive integers r n,
P
x
(X
n
B, T
y,r
n) =
nr
k=0
P
x
(T
y,r
= n k)P
y
(X
k
B) .
(b) Deduce that in particular,
P
x
(X
n
= y) =
n
k=r
P
x
(T
y,r
= k)P
y
(X
nk
= y).
(c) Conclude that for any y S and non-negative integers r, ,
j=0
P
y
(X
j
= y)
+r
n=r
P
y
(X
n
= y).
In contrast, here is an application of the last entrance decomposition.
Exercise 6.2.3 (Last entrance decomposition). Show that for a homogeneous
Markov chain X
n
on state space (S, o), all x, y S, B o and n 1,
P
x
(X
n
B, T
y
n) =
n1
k=0
P
x
(X
nk
= y)P
y
(X
k
B, T
y
> k) .
Hint: With L
n
= max1 n : X
= y denoting the last visit of y by the

chain during 1, . . . , n, observe that T
y
n is the union of the disjoint events
L
n
= n k, k = 0, . . . , n 1.
Next, we express certain hitting probabilities for Markov chains in terms of har-
monic functions for these chains.
Denition 6.2.4. Extending Denition 5.1.25 we say that f : S R which is
either bounded below or bounded above is super-harmonic for the transition prob-
ability p(x, y) at x S when f(x)

yS
p(x, y)f(y). Likewise, f() is sub-
harmonic at x when this inequality is reversed and harmonic at x in case an equality
holds. Such a function is called super-harmonic (or sub-harmonic, harmonic, re-
spectively) for p(, ) (or for the corresponding chain X
n
), if it is super-harmonic
(or, sub-harmonic, harmonic, respectively), at all x S. Equivalently, f() which is
either bounded below or bounded above is harmonic provided f(X
n
) is a martin-
gale whenever the initial distribution of the chain is such that f(X
0
) is integrable.
Similarly, f() bounded below is super-harmonic if f(X
n
) is a super-martingale
whenever f(X
0
) is integrable.
Exercise 6.2.5. Suppose S C is nite, inf
x/ C
P
x
(
C
< ) > 0 and A C,
B = C A are both non-empty.
(a) Show that there exist N < and > 0 such that P
y
(
C
> kN) (1)
k
for all k 1 and y S.
(b) Show that g(x) = P
x
(
A
<
B
) is harmonic at every x / C.
(c) Show that if a bounded function g() is harmonic at every x / C then
g(X
nC
) is a martingale.
(d) Deduce that g(x) = P
x
(
A
<
B
) is the only bounded function harmonic
at every x / C for which g(x) = 1 when x A and g(x) = 0 when x B.
(e) Show that if f : S R
+
satises f(x) = 1 +
yS
p(x, y)f(y) at every
x / C then f(X
nC
) +n
C
is a martingale. Deduce that if in addition
f(x) = 0 for x C then f(x) = E
x
C
.
The next exercise demonstrates few of the many interesting explicit formulas one
may nd for nite state Markov chains.
Exercise 6.2.6. Throughout, X
n
is a Markov chain on S = 0, 1, . . . , N of
transition probability p(x, y).
(a) Use induction to show that in case N = 1, p(0, 1) = and p(1, 0) =
such that + > 0,
P
(X
n
= 0) =

+
+ (1 )
n
_
(0)

+
_
.
(b) Fixing (0) and
1
,=
0
non-random, suppose = and conditional
on X
n
the variables B
k
are independent Bernoulli(
X
k
). Evaluate the
mean and variance of the additive functional S
n
=
n
k=1
B
k
.
(c) Verify that E
x
[(X
n
N/2)] = (12/N)
n
(xN/2) for the Ehrenfest chain
whose transition probabilities are p(x, x 1) = x/N = 1 p(x, x + 1).
6.2.1. Classication of states, recurrence and transience. We start
with the partition of a countable state space of a homogeneous Markov chains
to its intercommunicating (equivalence) classes, as dened next.
Denition 6.2.7. Let
xy
= P
x
(T
y
< ) denote the probability that starting at x
the chain eventually visits the state y. State y is said to be accessible from state
x ,= y if
xy
> 0 (or alternatively, we then say that x leads to y). Two states x ,= y,
each accessible to the other, are said to intercommunicate, denoted by x y. A
non-empty collection of states C S is called irreducible if each two states in
C intercommunicate, and closed if there is no y / C and x C such that y is
accessible from x.
Remark. Evidently an irreducible set C may be a non-closed set and vice verse.
For example, if p(x, y) > 0 for any x, y S then Sz is irreducible and non-closed
(for any z S). More generally, adopting hereafter the convention that x x,
any non-empty proper subset of an irreducible set is irreducible and non-closed.
Conversely, when there exists y S such that p(x, y) = 0 for all x S y, then S
is closed and reducible. More generally, a closed set that has a closed proper subset
is reducible. Note however the following elementary properties.
Exercise 6.2.8.
(a) Show that if
xy
> 0 and
yz
> 0 then also
xz
> 0.
(b) Deduce that intercommunication is an equivalence relation (that is, x
x, if x y then also y x and if both x y and y z then also
x z).
(c) Explain why its equivalence classes partition S into maximal irreducible
sets such that the directed graph indicating which one leads to each other
is both transitive (i.e. if C
1
leads to C
2
and C
2
leads to C
3
then also C
1
leads to C
3
), and acyclic (i.e. if C
1
leads to C
2
then C
2
does not lead to
C
1
).
For our study of the qualitative behavior of such chains we further classify each
state as either a transient state, visited only nitely many times by the chain or as
a recurrent state to which the chain returns with certainty (innitely many times)
once it has been reached by the chain. To this end, we make use of the following
formal denition and key proposition.
Denition 6.2.9. A state y S is called recurrent (or persistent) if
yy
= 1 and
transient if
yy
< 1.
Proposition 6.2.10. With T
0
y
= 0, let T
k
y
= infn > T
k1
y
: X
n
= y for k 1
denote the time of the k-th return to state y S (so T
1
y
= T
y
> 0 regardless of X
0
).
Then, for any x, y S and k 1,
(6.2.2) P
x
(T
k
y
< ) =
xy
k1
yy
.
Further, let N
(y) denote the number of visits to state y by the Markov chain at

positive times. Then, E
x
N
(y) =
xy
1yy
is positive if and only if
xy
> 0, in which
case it is nite when y is transient and innite when y is recurrent.
Proof. The identity (6.2.2) is merely the observation that starting at x, in
order to have k visits to y, one has to rst reach y and then to have k1 consecutive
returns to y. More formally, the event T
y
< =
n
T
y
n is in o
c
so xing
k 2 the strong Markov property applies for the stopping time = T
k1
y
and the
indicator function h = I
{Ty<}
. Further, < implies that h(
) = I
{T
k
y
<}
()
and X
= y so E
X
h = P
y
(T
y
< ) =
yy
. Combining the tower property with
the strong Markov property we thus nd that
P
x
(T
k
y
< ) = E
x
[h(
)I
<
] = E
x
[E
x
[h(
) [ T
X
]I
<
]
= E
x
[
yy
I
<
] =
yy
P
x
(T
k1
y
< ) ,
and (6.2.2) follows by induction on k, starting with the trivial case k = 1.
Next note that if the chain makes at least k visits to state y, then the k-th return
to y occurs at nite time, and vice verse. That is, T
k
y
< = N
(y) k, and
from the identity (6.2.2), we get that
E
x
N
(y) =
k=1
P
x
(N
(y) k) =
k=1
P
x
(T
k
y
< )
=
k=1
xy
k1
yy
=
_
xy
1yy
,
xy
> 0
0,
xy
= 0
(6.2.3)
as claimed.
In the same spirit as the preceding proof you next show that successive returns to
the same state by a Markov chain are renewal times.
Exercise 6.2.11. Fix a recurrent state y S of a Markov chain X
n
. Let R
k
=
T
k
y
and r
k
= R
k
R
k1
the number of steps between consecutive returns to y.
(a) Deduce from the strong Markov property that under P
y
the random vec-
tors Y
k
= (r
k
, X
R
k1
, . . . , X
R
k
1
) for k = 1, 2, . . . are independent and
identically distributed.
(b) Show that for any probability measure , under P
and conditional on the

event T
y
< , the random vectors Y
k
are independent of each other
and further Y
k
D
= Y
2
for all k 2, with Y
2
having then the law of Y
1
under P
y
.
Here is a direct consequence of Proposition 6.2.10.
Corollary 6.2.12. Each of the following characterizes a recurrent state y:
(a)
yy
= 1;
(b) P
y
(T
k
y
< ) = 1 for all k;
(c) P
y
(X
n
= y, i.o.) = 1;
(d) P
y
(N
(y) = ) = 1;
(e) E
y
N
(y) = .
Proof. Considering (6.2.2) for x = y we have that (a) implies (b). Given
(b) we have w.p.1. that X
n
k
= y for innitely many n
k
= T
k
y
, k = 1, 2, . . .,
which is (c). Clearly, the events in (c) and (d) are identical, and evidently (d)
implies (e). To complete the proof simply note that if
yy
< 1 then by (6.2.3)
E
y
N
(y) =
yy
/(1
yy
) is nite.
We are ready for the main result of this section, a decomposition of the recurrent
states to disjoint irreducible closed sets.
Theorem 6.2.13 (Decomposition theorem). A countable state space S of a
homogeneous Markov chain can be partitioned uniquely as
S = T R
1
R
2
. . .
where T is the set of transient states and the R
i
are disjoint, irreducible closed sets
of recurrent states with
xy
= 1 whenever x, y R
i
.
Remark. An alternative statement of the decomposition theorem is that for any
pair of recurrent states
xy
=
yx
0, 1 while
xy
= 0 if x is recurrent and y
is transient (so x y S :
xy
> 0 induces a unique partition of the recurrent
states to irreducible closed sets).
Proof. Suppose x y. Then,
xy
> 0 implies that P
x
(X
K
= y) > 0 for
some nite K and
yx
> 0 implies that P
y
(X
L
= x) > 0 for some nite L. By the
Chapman-Kolmogorov equations we have for any integer n 0,
P
x
(X
K+n+L
= x) =
z,vS
P
x
(X
K
= z)P
z
(X
n
= v)P
v
(X
L
= x)
P
x
(X
K
= y)P
y
(X
n
= y)P
y
(X
L
= x) . (6.2.4)
As E
y
N
(y) =
n=1
P
y
(X
n
= y), summing the preceding inequality over n 1
we nd that E
x
N
(x) cE
y
N
(y) with c = P
x
(X
K
= y)P
y
(X
L
= x) positive.
If x is a transient state then E
x
N
(x) is nite (see Corollary 6.2.12), hence the

same applies for y. Reversing the roles of x and y we conclude that any two
intercommunicating states x and y are either both transient or both recurrent.
More generally, an irreducible set of states C is either transient (i.e. every x C
is transient) or recurrent (i.e. every x C is recurrent).
We thus consider the unique partition of S to (disjoint) maximal irreducible equiv-
alence classes of (see Exercise 6.2.8), with R
i
denoting those equivalence classes
that are recurrent and proceed to show that if x is recurrent and
xy
> 0 for y ,= x,
then
yx
= 1. The latter implies that any y accessible from x R
must intercom-
municate with x, so with R
a maximal irreducible set, necessarily such y is also

in R
. We thus conclude that each R
is closed, with
xy
= 1 whenever x, y R
,
as claimed.
To complete the proof x a state y ,= x that is accessible by the chain from
a recurrent state x, noting that then L = infn 1 : P
x
(X
n
= y) > 0 is
nite. Further, because L is the minimal such value there exist y
0
= x, y
L
= y
and y
i
,= x for 1 i L such that
L
k=1
p(y
k1
, y
k
) > 0. Consequently, if
P
y
(T
x
= ) = 1
yx
> 0, then
P
x
(T
x
= )
L
k=1
p(y
k1
, y
k
)(1
yx
) > 0 ,
in contradiction of the assumption that x is recurrent.
The decomposition theorem motivates the following denition, as an irreducible
chain is either a recurrent chain or a transient chain.
Denition 6.2.14. A homogeneous Markov chain is called an irreducible Markov
chain (or in short, irreducible), if S is irreducible, a recurrent Markov chain (or in
short, recurrent), if every x S is recurrent and a transient Markov chain (or in
short, transient), if every x S is transient.
By denition once the chain enters a closed set, it remains forever in this set.
Hence, if X
0
R
we may as well take R
to be the whole state space. The case

of X
0
T is more involved, for then the chain either remains forever in the set of
transient states, or it lies eventually in the rst irreducible set of recurrent states
it entered. As we next show, the rst of these possibilities does not occur when T
(or S) is nite (and any irreducible chain of nite state space is recurrent).
Proposition 6.2.15. If F is a nite set of transient states then for any initial
distribution P
(X
n
F i.o.) = 0. Hence, any nite closed set C contains at least
one recurrent state, and if C is also irreducible then C is recurrent.
Proof. Let N
(F) =
yF
N
(y) denote the totality of positive time the

chain spends at a set F. If F is a nite set of transient states then by Proposition
6.2.10 and linearity of the expectation E
x
N
(F) is nite, hence P

x
(N
(F) =
) = 0. With S countable and x arbitrary, it follows that P
(N
(F) = ) = 0
for any initial distribution . This is precisely our rst claim (as N
(F) is innite
if and only if X
n
F for innitely many values of n). If C is a closed set then
starting at x C the chain stays in C forever. Thus, P
x
(N
(C) = ) = 1 and to
not contradict our rst claim, if such C is nite, then it must contain at least one
recurrent state, which is our second claim. Finally, while proving the decomposition
theorem we showed that if an irreducible set contains a recurrent state then all its
states are recurrent, thus yielding our third and last claim.
We proceed to study the recurrence versus transience of states for some homo-
geneous Markov chains we have encountered in Section 6.1. To this end, starting
with the branching process we make use of the following denition.
Denition 6.2.16. If a singleton x is a closed set of a homogeneous Markov
chain, then we call x an absorbing state for the chain. Indeed, once the chain visits
an absorbing state it remains there (so an absorbing state is recurrent).
Example 6.2.17 (Branching Processes). By our denition of the branching
process Z
n
we have that 0 is an absorbing state (as p(0, 0) = 1, hence
0k
= 0
for all k 1). If P(N = 0) > 0 then clearly
k0
p(k, 0) = P(N = 0)
k
> 0 and
kk
1
k0
< 1 for all k 1, so all states other than 0 are transient.
Exercise 6.2.18. Suppose a homogeneous Markov chain X
n
with state space
S = 0, 1, . . . , N is a martingale for any initial distribution.
(a) Show that 0 and N are absorbing states, that is, p(0, 0) = p(N, N) = 1.
(b) Show that if also P
x
(
{0,N}
< ) > 0 for all x then all other states are
transient and
xN
= P
x
(
N
<
0
) = x/N.
(c) Check that this applies for the symmetric srw on S (with absorption at
0 and N), in which case also E
x
{0,N}
= x(N x).
Example 6.2.19 (Renewal Markov chain). The renewal Markov chain of Ex-
ample 6.1.11 has p(i, i 1) = 1 for i 1 so evidently
i0
= 1 for all i 1 and
hence also
00
= 1, namely 0 is a recurrent state. Recall that p(0, j) = q
j+1
, so if
k : q
k
> 0 is unbounded, then
0j
> 0 for all j so the only closed set containing
0 is S = Z
+
. Consequently, in this case the renewal chain is recurrent. If on the
other hand K = supk : q
k
> 0 < then R = 0, 1, . . . , K 1 is an irreducible
closed set of recurrent states and all other states are transient. Indeed, starting at
any positive integer j this chain enters its recurrent class of states after at most j
steps and stays there forever.
Your next exercise pursues another approach to the classication of states, ex-
pressing the return probabilities
xx
in terms of limiting values of certain generating
functions. Applying this approach to the asymmetric srw on the integers provides
us with an example of a transient (irreducible) chain.
Exercise 6.2.20. Given a homogeneous Markov chain of countable state space S
and x S, consider for 1 < s < 1 the generating functions f(s) = E
x
[s
Tx
] and
u(s) =
k0
E
x
[s
T
k
x
] =
n0
P
x
(X
n
= x)s
n
.
(a) Show that u(s) = u(s)f(s) + 1.
(b) Show that u(s) 1 + E
x
[N
(x)] as s 1, while f(s)

xx
and deduce
that E
x
[N
(x)] =
xx
/(1
xx
).
(c) Consider the srw on Z with p(i, i + 1) = p and p(i, i 1) = q = 1 p.
Show that in this case u(s) = (14pqs
2
)
1/2
is independent of the initial
state x.
Hint: Recall that (1 t)
1/2
=
m=0
_
2m
m
_
2
2m
t
m
for any 0 t < 1.
(d) Deduce that the srw on Z has
xx
= 2 min(p, q) for all x so for 0 < p < 1,
p ,= 1/2 this irreducible chain is transient, whereas for p = 1/2 it is
recurrent.
Our next proposition explores a powerful method for proving recurrence of an
irreducible chain by the construction of super-harmonic functions (per Denition
6.2.4).
Proposition 6.2.21. Suppose S is irreducible for a chain X
n
and there exists
h : S [0, ) of nite level sets G
r
= x : h(x) < r that is super-harmonic at
S G
r
for this chain and some nite r. Then, the chain X
n
is recurrent.
Proof. If S is nite then the chain is recurrent by Proposition 6.2.15. As-
suming hereafter that S is innite, x r
0
large enough so the nite set F = G
r0
is
non-empty and h() is super-harmonic at x / F. By Proposition 6.2.15 and part
(c) of Exercise 6.1.18 (for B = F = S A), if P
x
(
F
< ) = 1 for all x S then F
contains at least one recurrent state, so by irreducibility of S the chain is recurrent,
as claimed. Proceeding to show that P
x
(
F
< ) = 1 for all x S, x r > r
0
and
C = C
r
= F (S G
r
). Note that h() super-harmonic at x / C, hence h(X
nC
)
is a non-negative sup-MG under P
x
for any x S. Further, S C is a subset of G
r
hence a nite set, so it follows by irreducibility of S that P
x
(
C
< ) = 1 for all
x S (see part (a) of Exercise 6.2.5). Consequently, from Proposition 5.3.8 we get
that
h(x) E
x
h(X
C
) r P
x
(
C
<
F
)
(since h(X
C
) r when
C
<
F
). Thus,
P
x
(
F
< ) P
x
(
F

C
) 1 h(x)/r
and taking r we deduce that P
x
(
F
< ) = 1 for all x S, as claimed.
Here is a concrete application of Proposition 6.2.21.
n
is an irreducible random walk on Z with zero-
mean increments
k
such that [
k
[ r for some nite integer r. Show that S
n
is a recurrent chain.
The following exercises complement Proposition 6.2.21.
Exercise 6.2.23. Suppose that S is irreducible for some homogeneous Markov
chain. Show that this chain is recurrent if and only if the only non-negative super-
harmonic functions for it are the constant functions.
n
is an irreducible birth and death chain with p
i
=
p(i, i + 1), q
i
= p(i, i 1) and r
i
= 1 p
i
q
i
= p(i, i) 0, where p
i
and q
i
are
positive for i > 0, q
0
= 0 and p
0
> 0. Let
h(m) =
m1
k=0
k
j=1
q
j
p
j
,
for m 1 and h(0) = 0.
(a) Check that h() is harmonic for the chain at all positive integers.
(b) Fixing a < x < b in S = Z
+
verify that P
x
(
C
< ) = 1 for C = a, b
and that h(X
nC
) is a bounded martingale under P
x
. Deduce that
P
x
(T
a
< T
b
) =
h(b) h(x)
h(b) h(a)
(c) Considering a = 0 and b show that the chain is transient if and
only if h() is bounded above.
(d) Suppose i(p
i
/q
i
1) c as i . Show that the chain is recurrent if
c < 1 and transient if c > 1, so in particular, when p
i
= p = 1 q
i
for
all i > 0 the chain is recurrent if and only if p
1
2
.
6.2.2. Invariant, excessive and reversible measures. Recall Proposition
6.1.23 that an invariant measure for the transition probability p(x, y) is uniquely
determined by a non-zero : S [0, ) such that
(6.2.5) (y) =
xS
(x)p(x, y) , y S .
To simplify our notations we thus regard such a function as the corresponding
invariant measure. Similarly, we say that : S [0, ) is a nite, positive, or
probability measure, when
x
(x) is nite (positive, or equals one, respectively),
and call x : (x) > 0 the support of the measure .
Denition 6.2.25. Relaxing the notion of invariance we say that a non-zero :
S [0, ] is an excessive measure if
(y)
xS
(x)p(x, y) , y S .
Example 6.2.26. Some chains do not have any invariant measure. For example,
in a birth and death chain with p
i
= 1, i 0 the identity (6.2.5) is merely (0) = 0
and (i) = (i 1) for i 1, whose only solution is the zero function. However,
the totally asymmetric srw on Z with p(x, x + 1) = 1 at every integer x has an
invariant measure (x) = 1, although just as in the preceding birth and death chain
all its states are transient with the only closed set being the whole state space.
Nevertheless, as we show next, to every recurrent state corresponds an invariant
measure.
Proposition 6.2.27. Let T
z
denote the possibly innite return time to a state z
by a homogeneous Markov chain X
n
. Then,
z
(y) = E
z
_
Tz1
n=0
I
{Xn=y}
_
,
is an excessive measure for X
n
, the support of which is the closed set of all states
accessible from z. If z is recurrent then
z
() is an invariant measure, whose support
is the closed and recurrent equivalence class of z.
Remark. We have by the second claim of Proposition 6.2.15 (for the closed set S),
that any chain with a nite state space has at least one recurrent state. Further,
recall that any invariant measure is -nite, which for a nite state space amounts
to being a nite measure. Hence, by Proposition 6.2.27 any chain with a nite state
space has at least one invariant probability measure.
Example 6.2.28. For a transient state z the excessive measure
z
(y) may be
innite at some y S. For example, the transition probability p(x, 0) = 1 for all
x S = 0, 1 has 0 as an absorbing (recurrent) state and 1 as a transient state,
with T
1
= and
1
(1) = 1 while
1
(0) = .
Proof. Using the canonical construction of the chain, we set
h
k
(, y) =
Tz()1
n=0
I
{
n+k
=y}
,
so that
z
(y) = E
z
h
0
(, y). By the tower property and the Markov property of the
chain,
E
z
h
1
(, y) = E
z
_

n=0
I
{Tz>n}
I
{Xn+1=y}
xS
I
{Xn=x}
_
=
xS
n=0
E
z
_
I
{Tz>n}
I
{Xn=x}
P
z
(X
n+1
= y[T
X
n
)
_
=
xS
n=0
E
z
_
I
{Tz>n}
I
{Xn=x}
_
p(x, y) =
xS
z
(x)p(x, y) .
The key to the proof is the observation that if
0
= z then h
0
(, y) h
1
(, y) for
any y S, with equality when y ,= z or T
z
() < (in which case
Tz()
=
0
).
Consequently, for any state y,
z
(y) = E
z
h
0
(, y) E
z
h
1
(, y) =
xS
z
(x)p(x, y) ,
with equality when y ,= z or z is recurrent (in which case P
z
(T
z
< ) = 1).
By denition
z
(z) = 1, so
z
() is an excessive measure. Iterating the preceding
inequality k times we further deduce that
z
(y)

x
z
(x)P
x
(X
k
= y) for any
k 1 and y S, with equality when z is recurrent. If
zy
= 0 then clearly
z
(y) = 0, while if
zy
> 0 then P
z
(X
k
= y) > 0 for some k nite, hence
z
(y)
z
(z)P
z
(X
k
= y) > 0. The support of
z
is thus the closed set of states accessible
from z, which for z recurrent is its equivalence class. Finally, note that if x z
then P
x
(X
k
= z) > 0 for some k nite, so 1 =
z
(z)
z
(x)P
x
(X
k
= z) implying
that
z
(x) < . That is, if z is recurrent then
z
is a -nite, positive invariant
measure, as claimed.
What about uniqueness of the invariant measure for a given transition probability?
By denition the set of invariant measures for p(, ) is a convex cone (that is, if
1
and
2
are invariant measures, possibly the same, then for any positive c
1
and c
2
the measure c
1
1
+c
2
2
is also invariant). Thus, hereafter we say that the invariant
measure is unique whenever it is unique up to multiplication by a positive constant.
The rst negative result in this direction comes from Proposition 6.2.27. Indeed,
the invariant measures
z
and
x
are clearly mutually singular (and in particular,
not constant multiple of each other), whenever the two recurrent states x and z do
not intercommunicate. In contrast, your next exercise yields a positive result, that
the invariant measure supported within each recurrent equivalence class of states
is unique (and given by Proposition 6.2.27).
Exercise 6.2.29. Suppose : S (0, ) is a strictly positive invariant measure
for the transition probability p(, ) of a Markov chain X
n
on the countable set S.
(a) Verify that q(x, y) = (y)p(y, x)/(x) is a transition probability on S.
(b) Verify that if : S [0, ) is an excessive measure for p(, ) then
h(x) = (x)/(x) is super-harmonic for q(, ).
(c) Show that if p(, ) is irreducible and recurrent, then so is q(, ). Deduce
from Exercise 6.2.23 that then h(x) is a constant function, hence (x) =
c(x) for some c > 0 and all x S.
Proposition 6.2.30. If R is a recurrent equivalence class of states then the
invariant measure whose support is contained in R is unique (and has R as its
support). In particular, the invariant measure of an irreducible, recurrent chain is
unique (up to multiplication by a constant) and strictly positive.
Proof. Recall the decomposition theorem that R is closed, hence the restric-
tion of p(, ) to R is also a transition probability and when considering invariant
measures supported within R we may as well take S = R. That is, hereafter we
assume that the chain is recurrent. In this case we have by Proposition 6.2.27 a
strictly positive invariant measure =
z
on S = R. To complete the proof re-
call the conclusion of Exercise 6.2.29 that any -nite excessive measure (and in
particular any invariant measure), is then a constant multiple of .
Propositions 6.2.27 and 6.2.30 provide a complete picture of the invariant measures
supported outside the set T of transient states, as the convex cone generated by
the mutually singular, unique invariant measures
z
() supported on each closed
recurrent equivalence class. Complementing it, your next exercise shows that
an invariant measure must be zero at all transient states that lead to at least one
recurrent state and if it is positive at some v T then it is also positive at any
y T accessible from v.
Exercise 6.2.31. Let () be an invariant measure for a Markov chain X
k
on
S.
(a) Iterating (6.2.5) verify that (y) =
x
(x)P
x
(X
k
= y) for all k 1
and y S.
(b) Deduce that if (v) > 0 for some v S then (y) > 0 for any y accessible
from v.
(c) Show that if R is a recurrent equivalence class then (x)p(x, y) = 0
for all x / R and y R.
Hint: Exercise 6.2.29 may be handy here.
(d) Deduce that if such R is accessible from v , R then (v) = 0.
We complete our discussion of (non)-uniqueness of the invariant measure with an
example of a transient chain having two strictly positive invariant measures that
are not constant multiple of each other.
Example 6.2.32 (srw on Z). Consider the srw, a homogeneous Markov chain
with state space Z and transition probability p(x, x + 1) = 1 p(x, x 1) = p for
some 0 < p < 1. You can easily verify that both the counting measure

(x) 1
and
0
(x) = (p/(1 p))
x
are invariant measures for this chain, with
0
a constant
multiple of

only in the symmetric case p = 1/2. Recall Exercise 6.2.20 that this
chain is transient for p ,= 1/2 and recurrent for p = 1/2 and observe that neither
nor
0
is a nite measure. Indeed, as we show in the sequel, a nite invariant
measure of a Markov chain must be zero at all transient states.
Remark. Evidently, having a uniform (or counting) invariant measure (i.e. (x)
c > 0 for all x S), as in the preceding example, is equivalent to the transition
probability being doubly stochastic, that is,
xS
p(x, y) = 1 for all y S.
Example 6.2.32 motivates our next subject, which are the conditions under which
a Markov chain is reversible, starting with the relevant denitions.
Denition 6.2.33. A non-zero : S [0, ) is called a reversible measure
for the transition probability p(, ) if the detailed balance relation (x)p(x, y) =
(y)p(y, x) holds for all x, y S. We say that a transition probability p(, ) (or the
corresponding Markov chain) is reversible if it has a reversible measure.
Remark. Every reversible measure is an invariant measure, for summing the
detailed balance relation over x S yields the identity (6.2.5), but there are
non-reversible invariant measures. For example, the uniform invariant measure
of a doubly stochastic transition probability p(, ) is non-reversible as soon as
p(x, y) ,= p(y, x) for some x, y S. Indeed, for the asymmetric srw of Exam-
ple 6.2.32 (i.e., when p ,= 1/2), the (constant) counting measure

is non-reversible
while
0
is a reversible measure (as you can easily check on your own).
As their name suggest, reversible measures have to do with the time reversed chain
(and the corresponding adjoint transition probability), which we now dene.
Denition 6.2.34. If () is an invariant measure for transition probability p(x, y),
then q(x, y) = (y)p(y, x)/(x) is a transition probability on the support of (),
which we call the adjoint (or dual) of p(, ) with respect to . The corresponding
chain of law Q
is called the time reversed chain (with respect to ).

It is not hard, and left to the reader, to check that for any invariant probability
measure the stationary Markov chains Y
n
of law Q
and X
n
of law P
are
such that (Y
k
, . . . , Y
)
D
= (X
, . . . , X
k
) for any k nite. Indeed, this is why Y
n
is called the time reversed chain.

Also note that () is a reversible measure if and only if p(, ) is self-adjoint with
respect to () (that is, q(x, y) = p(x, y) on the support of ()). Alternatively put,
() is a reversible measure if and only if P
= Q
, that is, the shift invariant law

of the chain induced by is the same as that of its time reversed chain.
By Denition 6.2.33 the set of reversible measures for p(, ) is a convex cone.
The following exercise arms that reversible measures are zero outside the closed
equivalence classes of the chain and uniquely determined by it within each such
class. It thus reduces the problem of characterizing reversible chains (and measures)
to doing so for irreducible chains.
Exercise 6.2.35. Suppose (x) is a reversible measure for the transition probability
p(x, y) of a Markov chain X
n
with a countable state space S.
(a) Show that (x)P
x
(X
k
= y) = (y)P
y
(X
k
= x) for any x, y S and all
k 1.
(b) Deduce that if (x) > 0 then any y accessible from x must intercommu-
nicate with x.
(c) Conclude that the support of () is a disjoint union of closed equiva-
lence classes, within each of which the measure is uniquely determined
by p(, ) up to a non-negative constant multiple.
We proceed to characterize reversible irreducible Markov chains as random walks
on networks.
Denition 6.2.36. A network (or weighted graph) consists of a countable (nite
or innite) set of vertices V with a symmetric weight function w : V V [0, )
(i.e. w
xy
= w
yx
for all x, y V). Further requiring that (x) =
yV
w
xy
is
nite and positive for each x V, a random walk on the network is a homogeneous
Markov chain of state space V and transition probability p(x, y) = w
xy
/(x). That
is, when at state x the probability of the chain moving to state y is proportional to
the weight w
xy
of the pair x, y.
Remark. For example, an undirected graph is merely a network the weights w
xy
of which are either one (indicating an edge in the graph whose ends are x and y)
or zero (no such edge). Assuming such graph has positive and nite degrees, the
random walker moves at each time step to a vertex chosen uniformly at random
from those adjacent in the graph to its current position.
Exercise 6.2.37. Check that a random walk on a network has a strictly positive
reversible measure (x) =
y
w
xy
and that a Markov chain is reversible if and only
if there exists an irreducible closed set V on which it is a random walk (with weights
w
xy
= (x)p(x, y)).
Example 6.2.38 (Birth and death chain). We leave for the reader to check
that the irreducible birth and death chain of Exercise 6.2.24 is a random walk on
the network Z
+
with weights w
x,x+1
= p
x
(x) = q
x+1
(x + 1), w
xx
= r
x
(x) and
w
xy
= 0 for [x y[ > 1, and the unique reversible measure (x) =
x
i=1
pi1
qi
(with
(0) = 1).
Remark. Though irreducibility does not imply uniqueness of the invariant measure
(c.f. Example 6.2.32), if is an invariant measure of the preceding birth and death
chain then (x + 1) is determined by (6.2.5) from (x) and (x 1), so starting
at (0) = 1 we conclude that the reversible measure of Example 6.2.38 is also the
unique invariant measure for this chain.
We conclude our discussion of reversible measures with an explicit condition for
reversibility of an irreducible chain, whose proof is left for the reader (for example,
see [Dur10, Theorem 6.5.1]).
Exercise 6.2.39 (Kolmogorovs cycle condition). Show that an irreducible
chain of transition probability p(x, y) is reversible if and only if p(x, y) > 0 whenever
p(y, x) > 0 and
k
i=1
p(x
i1
, x
i
) =
k
i=1
p(x
i
, x
i1
) ,
for any k 3 and any cycle x
0
, x
1
, . . . , x
k
= x
0
.
Remark. The renewal Markov chain of Example 6.1.11 is one of the many recurrent
chains that fail to satisfy Kolmogorovs condition (and thus are not reversible).
Turning to investigate the existence and support of nite invariant measures (or
equivalently, that of invariant probability measures), we further partition the re-
current states of the chain according to the integrability (or lack thereof) of the
corresponding return times.
Denition 6.2.40. With T
z
denoting the rst return time to state z, a recurrent
state z is called positive recurrent if E
z
(T
z
) < and null recurrent otherwise.
Indeed, invariant probability measures require the existence of positive recurrent
states, on which they are supported.
Proposition 6.2.41. If () is an invariant probability measure then all states z
with (z) > 0 are positive recurrent. Further, if the support of () is an irreducible
set R of positive recurrent states then (z) = 1/E
z
(T
z
) for all z R.
Proof. Recall Proposition 6.2.10 that for any initial probability measure ()
the number of visits N
(z) =
n1
I
Xn=z
to a state z by the chain is such that
n=1
P
(X
n
= z) = E
(z) =
xS
(x)E
x
N
(z) =
xS
(x)

xz
1
zz
1
1
zz
(since
xz
1 for all x). Starting at X
0
chosen according to an invariant proba-
bility measure () results with a stationary Markov chain X
n
and in particular
P
(X
n
= z) = (z) for all n. The left side of the preceding inequality is thus
innite for positive (z) and invariant probability measure (). Consequently, in
this case
zz
= 1, or equivalently z must be a recurrent state of the chain. Since
this applies for any z S we conclude that () is supported outside the set T of
transient states.
Next, recall that for any z S,
z
(S) =
yS
z
(y) = E
z
_
yS
Tz1
n=0
I
{Xn=y}
_
= E
z
T
z
,
so
z
is a nite measure if and only if z is a positive recurrent state of the chain.
If the support of () is an irreducible equivalence class R then we deduce from
Propositions 6.2.27 and 6.2.30 that
z
is a nite measure and (z) =
z
(z)/
z
(S) =
1/E
z
T
z
for any z R. Consequently, R must be a positive recurrent equivalence
class, that is, all states of R are positive recurrent.
To complete the proof, note that by the decomposition theorem any invariant
probability measure () is a mixture of such invariant probability measures, each
supported on a dierent closed recurrent class R
i
, which by the preceding argument
must all be positive recurrent.
In the course of proving Proposition 6.2.41 we have shown that positive and null
recurrence are equivalence class properties. That is, an irreducible set of states
C is either positive recurrent (i.e. every z C is positive recurrent), null recurrent
(i.e. every z C is null recurrent), or transient. Further, recall the discussion
after Proposition 6.2.27, that any chain with a nite state space has an invariant
probability measure, from which we get the following corollary.
Corollary 6.2.42. For an irreducible Markov chain the existence of an invariant
probability measure is equivalent to the existence of a positive recurrent state, in
which case every state is positive recurrent. We call such a chain positive recurrent
and note that any irreducible chain with a nite state space is positive recurrent.
For the remainder of this section we consider the existence and non-existence of
invariant probability measures for some Markov chains of interest.
Example 6.2.43. Since the invariant measure of a recurrent chain is unique up to
a constant multiple (see Proposition 6.2.30) and a transient chain has no invariant
probability measure (see Corollary 6.2.42), if an irreducible chain has an invariant
measure () for which
x
(x) = then it has no invariant probability measure.
For example, since the counting measure

is an invariant measure for the (irre-
ducible) srw of Example 6.2.32, this chain does not have an invariant probability
measure, regardless of the value of p. For the same reason, the symmetric srw on
Z (i.e. where p = 1/2), is a null recurrent chain.
Similarly, the irreducible birth and death chain of Exercise 6.2.24 has an invariant
probability measure if and only if its reversible measure (x) =
x
i=1
pi1
qi
is nite
(c.f. Example 6.2.38). In particular, if p
j
= 1 q
j
= p for all j 1 then this chain
is positive recurrent with an invariant probability measure when p < 1/2 but null
recurrent for p = 1/2 (and transient when 1 > p > 1/2).
Finally, a random walk on a graph is irreducible if and only if the graph is con-
nected. With (v) 1 for all v V (see Denition 6.2.36), it is positive recurrent
only for nite graphs.
Exercise 6.2.44. Check that (j) =
k>j
q
k
is an invariant measure for the
recurrent renewal Markov chain of Example 6.1.11 in case k : q
k
> 0 is unbounded
(see Example 6.2.19). Conclude that this chain is positive recurrent if and only if
k
kq
k
is nite.
In the next exercise you nd how the invariant probability measure is modied by
the introduction of holding times.
Exercise 6.2.45. Let () be the unique invariant probability measure of an ir-
reducible, positive recurrent Markov chain X
n
with transition probability p(x, y)
such that p(x, x) = 0 for all x S. Fixing r(x) (0, 1), consider the Markov chain
Y
n
whose transition probability is q(x, x) = 1 r(x) and q(x, y) = r(x)p(x, y) for
all y ,= x. Show that Y
n
is an irreducible, recurrent chain of invariant measure
(x) = (x)/r(x) and deduce that Y
n
is further positive recurrent if and only if
x
(x)/r(x) < .
Though we have established the next result in a more general setting, the proof
we outline here is elegant, self-contained and instructive.
Exercise 6.2.46. Suppose g() is a strictly concave bounded function on [0, )
and () is a strictly positive invariant probability measure for irreducible transition
probability p(x, y). For any : S [0, ) let (p)(y) =
xS
(x)p(x, y) and
c() =
yS
g
_
(y)
(y)
_
(y) .
(a) Show that c(p) c().
(b) Assuming p(x, y) > 0 for all x, y S deduce from part (a) that any
invariant measure () for p(x, y) is a constant multiple of ().
(c) Extend this conclusion to any irreducible p(x, y) by checking that
p(x, y) =
n=1
2
n
P
x
(X
n
= y) > 0 , x, y S ,
and that invariant measures for p(x, y) are also invariant for p(x, y).
Here is an introduction to the powerful method of Lyapunov (or energy) functions.
Exercise 6.2.47. Let
z
= infn 0 : Z
n
= z and T
Z
n
= (Z
k
, k n), for
Markov chain Z
n
of transition probabilities p(x, y) on a countable state space S.
(a) Show that V
n
= Z
nz
is a T
Z
n
-Markov chain and compute its transition
probabilities q(x, y).
(b) Suppose h : S [0, ) is such that h(z) = 0, the function (ph)(x) =
y
p(x, y)h(y) is nite everywhere and h(x) (ph)(x) + for some
> 0 and all x ,= z. Show that (W
n
, T
Z
n
) is a sup-MG under P
x
for
W
n
= h(V
n
) +(n
z
) and any x S.
(c) Deduce that E
x
z
h(x)/ for any x S and conclude that z is positive
recurrent in the stronger sense that E
x
T
z
is nite for all x S.
(d) Fixing > 0 consider i.i.d. random vectors v
k
= (
k
,
k
) such that
P(v
1
= (1, 0)) = P(v
1
= (0, 1)) = 0.25 and P(v
1
= (1, 0)) =
P(v
1
= (0, 1)) = 0.25 + . The chain Z
n
= (X
n
, Y
n
) on Z
2
is such
that X
n+1
= X
n
+ sgn(X
n
)
n+1
and Y
n+1
= Y
n
+ sgn(Y
n
)
n+1
, where
sgn(0) = 0. Prove that (0, 0) is positive recurrent in the sense of part (c).
Exercise 6.2.48. Consider the Markov chain Z
n
=
n
+ (Z
n1
1)
+
, n 1,
on S = 0, 1, 2, . . ., where
n
are i.i.d. S-valued such that P(
1
> 1) > 0 and
E
1
= 1 for some > 0.
(a) Show that Z
n
is positive recurrent.
(b) Find its invariant probability measure () in case P(
1
= k) = p(1p)
k
,
k S, for some p (1/2, 1).
(c) Is this Markov chain reversible?
6.2.3. Aperiodicity and limit theorems. Building on our classication of
states and study of the invariant measures of homogeneous Markov chains with
countable state space S, we focus here on the large n asymptotics of the state
X
n
() of the chain and its law.
We start with the asymptotic behavior of the occupation time
N
n
(y) =
n
=1
I
X
=y
,
of state y by the Markov chain during its rst n steps.
Proposition 6.2.49. For any probability measure on S and all y S,
(6.2.6) lim
n
n
1
N
n
(y) =
1
E
y
(T
y
)
I
{Ty<}
P
-a.s.
Remark. This special case of the strong law of large numbers for Markov additive
functionals (see Exercise 6.2.62 for its generalization), tells us that if a Markov
chain visits a positive recurrent state then it asymptotically occupies it for a positive
fraction of time, while the fraction of time it occupies each null recurrent or transient
state is zero (hence the reason for the name null recurrent ).
Proof. First note that if y is transient then E
x
N
(y) is nite by (6.2.3) for

any x S. Hence, P
-a.s. N
(y) is nite and consequently n

1
N
n
(y) 0 as
n . Furthermore, since P
y
(T
y
= ) = 1
yy
> 0, in this case E
y
(T
y
) =
and (6.2.6) follows.
Turning to consider recurrent y S, note that if T
y
() = then N
n
(y)() = 0
for all n and (6.2.6) trivially holds. Thus, assuming hereafter that T
y
() < ,
we have by recurrence of y that a.s. T
k
y
() < for all k (see Corollary 6.2.12).
Recall part (b) of Exercise 6.2.11, that under P
and conditional on T
y
< ,
the positive, nite random variables
k
= T
k
y
T
k1
y
are independent of each other,
with
k
, k 2 further identically distributed and of mean value E
y
(T
y
). Since
N
n
(y) = supk 0 : T
k
y
n, as you have showed in part (b) of Exercise 2.3.8,
it follows from the strong law of large numbers that n
1
N
n
(y)
a.s.
1/E
y
(T
y
) for
n . This completes the proof, as by assumption I
{Ty<}
= 1 in the present
case.
Here is a direct application of Proposition 6.2.49.
Exercise 6.2.50. Consider the positions X
n
of a particle starting at X
0
= x S
and moving in S = 0, . . . , r according to the following rules. From any position
1 y r 1 the particle moves to y 1 or y + 1, and each such move is made
with probability 1/2 independently of all other moves, whereas from positions 0 and
r the particle moves in one step to position k S.
(a) Fixing y S and k 1, . . . , r 1 nd the almost sure limit (k, y) of
n
1
N
n
(y) as n .
(b) Find the almost sure limit (y) of n
1
N
n
(y) in case upon reaching either
0 or r the particle next moves to an independently and uniformly chosen
position K 1, . . . , r 1.
Your next task is to prove the following ratio limit theorem for the occupation
times N
n
(y) within each irreducible, closed recurrent set of states. In particular,
it renes the limited information provided by Proposition 6.2.49 in case y is a null
recurrent state.
Exercise 6.2.51. Suppose y S is a recurrent state for the chain X
n
. Let
y
()
denote the invariant measure of the chain per Proposition 6.2.27, whose support is
the closed and recurrent equivalence class R
y
of y. Decomposing the path X
at the successive return times T

k
y
show that for any x, w R
y
,
lim
n
N
n
(w)
N
n
(y)
=
y
(w), P
x
-a.s.
Hint: Use Exercise 6.2.11 and the monotonicity of n N
n
(w).
Proceeding to study the asymptotics of P
x
(X
n
= y) we start with the following
consequence of Proposition 6.2.49.
Corollary 6.2.52. For all x, y S,
(6.2.7) lim
n
1
n
n
=1
P
x
(X
= y) =

xy
E
y
(T
y
)
.
Further, for any transient state y T
(6.2.8) lim
n
P
x
(X
n
= y) =

xy
E
y
(T
y
)
.
Proof. Since sup
n
n
1
N
n
(y) 1, the convergence in (6.2.7) follows from
Proposition 6.2.49 by bounded convergence (i.e. Corollary 1.3.46).
For a transient state y the sequence P
x
(X
n
= y) is summable (to the nite value
E
x
N
(y), c.f. Proposition 6.2.10), hence P

x
(X
n
= y) 0 as n . Further,
this amounts to (6.2.8) as in this case E
y
(T
y
) = .
Corollary 6.2.52 tells us that for every Markov chain the Ces`aro averages of
P
x
(X
n
= y) converge. In contrast, our next example shows that even for an
irreducible chain of nite state space the sequence n P
x
(X
n
= y) may fail to
converge pointwise.
Example 6.2.53. Consider the Markov chain X
n
on state space S = 0, 1 with
transition probabilities p(x, y) = 1
x=y
. Then, P
x
(X
n
= y) = 1
{n even}
when x = y
and P
x
(X
n
= y) = 1
{n odd}
when x ,= y, so the sequence n P
x
(X
n
= y)
alternates between zero and one, having no limit for any xed (x, y) S
2
.
Nevertheless, as we prove in the sequel (more precisely, in Theorem 6.2.59), peri-
odicity of the state y is the only reason for such non-convergence of P
x
(X
n
= y).
Denition 6.2.54. The period d
x
of a state x S of a Markov chain X
n
is the
greatest common divisor (g.c.d.) of the set J
x
= n 1 : P
x
(X
n
= x) > 0, with
d
x
= 0 in case J
x
is empty. Similarly, we say that the chain is of period d if d
x
= d
for all x S. A state x is called aperiodic if d
x
1 and a Markov chain is called
aperiodic if every x S is aperiodic.
As the rst step in this program, we show that the period is constant on each
irreducible set.
Lemma 6.2.55. The set J
x
contains all large enough integer multiples of d
x
and
if x y then d
x
= d
y
.
Proof. Considering (6.2.4) for x = y and L = 0 we nd that J
x
is closed
under addition. Hence, this set contains all large enough integer multiples of d
x
because every non-empty set J of positive integers which is closed under addition
must contain all large enough integer multiples of its g.c.d. d. Indeed, it suces
to prove this fact when d = 1 since the general case then follows upon considering
the non-empty set J
= n 1 : nd J whose g.c.d. is one (and which is

also closed under addition). Further, note that any integer n
2
is of the form
n =
2
+ k + r = r( + 1) + ( r + k) for some k 0 and 0 r < . Hence, if
two consecutive integers and + 1 are in J then so are all integers n
2
. We
thus complete the proof by showing that K = infm : m, J, m > > 0 > 1
is in contradiction with J having g.c.d. d = 1. Indeed, both m
0
and m
0
+ K are
in J for some positive integer m
0
and if d = 1 then J must contain also a positive
integer of the form m
1
= sK + r for some 0 < r < K and s 0. With J closed
under addition, (s + 1)(m
0
+ K) > (s + 1)m
0
+ m
1
must then both be in J but
their dierence is (s +1)Km
1
= Kr < K, in contradiction with the denition
of K.
If x y then in view of the inequality (6.2.4) there exist nite K and L such that
K + n + L J
x
whenever n J
y
. Moreover, K + L J
x
so every n J
y
must
also be an integer multiple of d
x
. Consequently, d
x
is a common divisor of J
y
and
therefore d
y
, being the greatest common divisor of J
y
, is an integer multiple of d
x
.
Reversing the roles of x and y we likewise have that d
x
is an integer multiple of d
y
from which we conclude that in this case d
x
= d
y
.
The key for determining the asymptotics of P
x
(X
n
= y) is to handle this question
for aperiodic irreducible chains, to which end the next lemma is most useful.
Lemma 6.2.56. Consider two independent copies X
n
and Y
n
of an ape-
riodic, irreducible chain on a countable state space S with transition probabili-
ties p(, ). The Markov chain Z
n
= (X
n
, Y
n
) on S
2
of transition probabilities
p
2
((x
, y
), (x, y)) = p(x
, x)p(y
, y) is then also aperiodic and irreducible. If X

n

has invariant probability measure () then Z
n
is further positive recurrent and
has the invariant probability measure
2
(x, y) = (x)(y).
Remark. Example 6.2.53 shows that for periodic p(, ) the chain of transition
probabilities p
2
(, ) may not be irreducible.
Proof. Fix states z
= (x
, y
) S
2
and z = (x, y) S
2
. Since p(, ) are the
transition probabilities of an irreducible chain, there exist K and L nite such that
P
x
(X
K
= x) > 0 and P
y
(Y
L
= y) > 0. Further, by the aperiodicity of this chain
we have from Lemma 6.2.55 that both P
x
(X
n+L
= x) > 0 and P
y
(Y
K+n
= y
) > 0
for all n large enough, in which case from (6.2.4) we deduce that P
z
(Z
K+n+L
=
z) > 0 as well. As this applies for any z
, z S
2
, the chain Z
n
is irreducible.
Further, considering z
= z we see that J
z
contains all large enough integers, hence
Z
n
is also aperiodic. Finally, it is easy to verify that if () is an invariant
probability measure for p(, ) then
2
(x, y) = (x)(y) is an invariant probability
measure for p
2
(, ), whose existence implies positive recurrence of the chain Z
n
(see Corollary 6.2.42).

The following Markovian coupling complements Lemma 6.2.56.
Theorem 6.2.57. Let X
n
and Y
n
be two independent copies of an aperiodic,
irreducible Markov chain. Suppose further that the irreducible chain Z
n
= (X
n
, Y
n
)
is recurrent. Then, regardless of the initial distribution of (X
0
, Y
0
), the rst meeting
time = min 0 : X
= Y
of the two processes is a.s. nite and for any n,

(6.2.9) |T
Xn
T
Yn
|
tv
2P( > n) ,
where | |
tv
denotes the total variation norm of Denition 3.2.22.
Proof. Recall Lemma 6.2.56 that the Markov chain Z
n
= (X
n
, Y
n
) on S
2
is
irreducible. We have further assumed that Z
n
is recurrent, hence
z
= min
0 : Z
= z is a.s. nite (for any z S S), regardless of the initial measure of

Z
0
= (X
0
, Y
0
). Consequently,
= inf
z
: z = (x, x) for some x S
is also a.s. nite, as claimed.
Turning to prove the inequality (6.2.9), xing g bo bounded by one, recall
that the chains X
n
and Y
n
have the same transition probabilities and further
X
= Y
. Thus, for any k n,

I
{=k}
E
X
k
[g(X
nk
)] = I
{=k}
E
Y
k
[g(Y
nk
)] .
By the Markov property and taking out the known I
{=k}
it thus follows that
E[I
{=k}
g(X
n
)] = E(I
{=k}
E
X
k
[g(X
nk
)])
= E(I
{=k}
E
Y
k
[g(Y
nk
)]) = E[I
{=k}
g(Y
n
)] .
Summing over 0 k n we deduce that E[I
{n}
g(X
n
)] = E[I
{n}
g(Y
n
)] and
hence
Eg(X
n
) Eg(Y
n
) = E[I
{>n}
g(X
n
)] E[I
{>n}
g(Y
n
)]
= E[I
{>n}
(g(X
n
) g(Y
n
))] .
Since [g(X
n
) g(Y
n
)[ 2, we conclude that [Eg(X
n
) Eg(Y
n
)[ 2P( > n) for
any g bo bounded by one, which is precisely what is claimed in (6.2.9).
Remark. Another Markovian coupling corresponds to replacing the transition
probabilities p
2
((x
, y
), (x, y)) with p(x
, x)1
y=x
whenever x
= y
. Doing so ex-
tends the identity Y
= X
to Y
n
= X
n
for all n , thus yielding the bound
P(X
n
,= Y
n
) P( > n) while each coordinate of the coupled chain evolves as
before according to the original transition probabilities p(, ).
The tail behavior of the rst meeting time controls the rate of convergence of
n P
x
(X
n
= y). As you are to show next, this convergence is exponentially fast
when the state space is nite.
Exercise 6.2.58. Show that if the aperiodic, irreducible Markov chain X
n
has -
nite state space, then P( > n) exp(n) for the rst meeting time of Theorem
6.2.57, some > 0 and any n large enough.
Hint: First assume that p(x, y) > 0 for all x, y S. Then show that P
x
(X
r
= y) > 0
for some nite r and all x, y and consider the chain Z
nr
.
The following consequence of Theorem 6.2.57 is a major step in our analysis of
the asymptotics of P
x
(X
n
= y).
Theorem 6.2.59. The convergence (6.2.8) holds whenever y is an aperiodic state
of the Markov chain X
n
. In particular, if this Markov chain is irreducible, positive
recurrent and aperiodic then for any x S,
lim
n
|P
x
(X
n
) ()|
tv
= 0 .
Proof. If
xy
= 0 then P
x
(X
n
= y) = 0 for all n and (6.2.8) trivially holds.
Otherwise,
xy
=
k=1
P
x
(T
y
= k) ,
is nite. Hence, in view of the rst entrance decomposition
P
x
(X
n
= y) =
n
k=1
P
x
(T
y
= k)P
y
(X
nk
= y)
(see part (b) of Exercise 6.2.2), the asymptotics (6.2.8) follows by bounded conver-
gence (with respect to the law of T
y
conditional on T
y
< ), from
(6.2.10) lim
n
P
y
(X
n
= y) =
1
E
y
(T
y
)
.
Turning to prove (6.2.10), in view of Corollary 6.2.52 we may and shall assume
hereafter that y is an aperiodic recurrent state. Further, recall that by Theorem
6.2.13 it then suces to consider the aperiodic, irreducible, recurrent chain X
n
obtained upon restricting the original Markov chain to the closed equivalence
class of y, which with some abuse of notation we denote hereafter also by S.
Suppose rst that X
n
is positive recurrent and so it has the invariant probability
measure (w) = 1/E
w
(T
w
) (see Proposition 6.2.41). The irreducible chain Z
n
=
(X
n
, Y
n
) of Lemma 6.2.56 is then recurrent, so we apply Theorem 6.2.57 for X
0
= y
and Y
0
chosen according to the invariant probability measure . Since Y
n
is a
stationary Markov chain (see Denition 6.1.20), in particular Y
n
D
= Y
0
has the law
for all n. Moreover, the corresponding rst meeting time is a.s. nite. Hence,
P( > n) 0 as n and by (6.2.9) the law of X
n
converges in total variation
to . This convergence in total variation further implies that P
y
(X
n
= y) (y)
when n (c.f. Example 3.2.25), which is precisely the statement of (6.2.10).
Next, consider a null recurrent aperiodic, irreducible chain X
n
, in which case our
thesis is that P
y
(X
n
= y) 0 when n . This clearly holds if the irreducible
chain Z
n
of Lemma 6.2.56 is transient, for setting z = (y, y) we then have upon
applying Corollary 6.2.52 for the chain Z
n
, that as n
P
z
(Z
n
= z) = P
y
(X
n
= y)
2
0 .
Proceeding to prove our thesis when the chain Z
n
is recurrent, suppose to the
contrary that the sequence n P
y
(X
n
= y) has a limit point (y) > 0. Then,
mapping S in a one to one manner into Z we deduce from Hellys theorem that
along a further sub-sequence n
the distributions of X
n
under P
y
converge vaguely,
hence pointwise (see Exercise 3.2.3), to some nite, positive measure on S. We
complete the proof of the theorem by showing that is an excessive measure for
the irreducible, recurrent chain X
n
. Indeed, By part (c) of Exercise 6.2.29 this
would imply the existence of a nite invariant measure for X
n
, in contradiction
with our assumption that this chain is null recurrent (see Corollary 6.2.42).
To prove that is an excessive measure, note rst that considering Theorem
6.2.57 for Z
0
= (x, y) we get from (6.2.9) that [P
x
(X
n
= w) P
y
(X
n
= w)[ 0
as n , for any x, w S. Consequently, P
x
(X
n
= w) (w) as , for
every x, w S. Moreover, from the Chapman-Kolmogorov equations we have that
for any w S, any nite set F S and all 1,
zS
p(x, z)P
z
(X
n
= w) = P
x
(X
n
+1
= w)
zF
P
x
(X
n
= z)p(z, w) .
In the limit this yields by bounded convergence (with respect to the prob-
ability measure p(x, ) on S), that for all w S
(w) =
zS
p(x, z)(w)
zF
(z)p(z, w) .
Taking F S we conclude by monotone convergence that () is an excessive mea-
sure on S, as we have claimed before.
Turning to the behavior of P
x
(X
n
= y) for periodic state y, we start with the
following consequence of Theorem 6.2.59.
Corollary 6.2.60. The convergence (6.2.8) holds whenever y is a null recurrent
state of the Markov chain X
n
and if y is a positive recurrent state of X
n
having
period d = d
y
, then
(6.2.11) lim
n
P
y
(X
nd
= y) =
d
E
y
(T
y
)
.
Proof. If y S has period d 1 for the chain X
n
then P
y
(X
n
= y) = 0
whenever n is not an integer multiple of d. Hence, the expected return time to such
state y by the Markov chain Y
n
= X
nd
is precisely 1/d of the expected return time
E
y
(T
y
) for X
n
. Therefore, (6.2.11) is merely a reformulation of the limit (6.2.10)
for the chain Y
n
at its aperiodic state y S.
If y is a null recurrent state of X
n
then E
y
(T
y
) = so we have just established
that P
y
(X
n
= y) 0 as n . It thus follows by the rst entrance decomposition
at T
y
that in this case P
x
(X
n
= y) 0 for any x S (as in the opening of the
proof of Theorem 6.2.59).
In the next exercise, you extend (6.2.11) to the asymptotic behavior of P
x
(X
n
= y)
for any two states x, y in a recurrent chain (which is not necessarily aperiodic).
n
is an irreducible, recurrent chain of period d. For
each x, y S let J
x,y
= n 1 : P
x
(X
n
= y) > 0.
(a) Fixing z S show that there exist integers 0 r
y
< d such that if n J
z,y
then d divides n r
y
.
(b) Show that if n J
x,y
then n = (r
y
r
x
) mod d and deduce that S
i
=
y S : r
y
= i, i = 0, . . . , d1 are the irreducible equivalence classes
of the aperiodic chain X
nd
(S
i
are called the cyclic classes of X
n
).
(c) Show that for all x, y, S,
lim
n
P
x
(X
nd+ryrx
= y) =
d
E
y
(T
y
)
.
Remark. It is not always true that if a recurrent state y has period d then
P
x
(X
nd+r
= y) d
xy
/E
y
(T
y
) for some r = r(x, y) 0, . . . , d 1. Indeed,
let p(x, y) be the transition probabilities of the renewal chain with q
1
= 0 and
q
k
> 0 for k 2 (see Example 6.1.11), except for setting p(1, 2) = 1 (instead
of p(1, 0) = 1 in the renewal chain). The corresponding Markov chain has pre-
cisely two recurrent states, y = 1 and y = 2, both of period d = 2 and mean
return times E
1
(T
1
) = E
2
(T
2
) = 2. Further,
02
= 1 but P
0
(X
nd
= 2) and
P
0
(X
nd+1
= 2) 1 , where =
k
q
2k
is strictly between zero and one.
We next consider the large n asymptotic behavior of the Markov additive functional
A
f
n
=
n
=1
f(X
), where X
is an irreducible, positive recurrent Markov chain.

In the following two exercises you establish rst the strong law of large numbers
(thereby generalizing Proposition 6.2.49), and then the central limit theorem for
such Markov additive functionals.
n
is an irreducible, positive recurrent chain of initial
probability measure and invariant probability measure (). Let f : S R be such
that ([f[) < .
(a) Fixing y S let R
k
= T
k
y
. Show that the random variables
Z
f
k
=
R
k
1
=R
k1
f(X
) , k 1 ,
are mutually independent and moreover Z
f
k
, k 2 are identically dis-
tributed with EZ
|f|
2
nite.
Hint: Consider Exercise 6.2.11.
(b) With S
f
n
=
Nn(y)
k=1
Z
f
k+1
show that
lim
n
n
1
S
f
n
=
EZ
f
2
E
y
(T
y
)
= (f) P
-a.s.
(c) Show that P
-a.s. maxn
1
Z
|f|
k
: k n 0 when n and deduce
that n
1
A
f
n
(f) with P
probability one.
Exercise 6.2.63. For X
n
as in Exercise 6.2.62 suppose that f : S R is such
that (f) = 0 and v
|f|
= E
y
[(Z
|f|
1
)
2
] is nite.
6.3. GENERAL STATE SPACE: DOEBLIN AND HARRIS CHAINS 255
(a) Show that n
1/2
S
f
n
D

uG as n , for u = v
f
/E
y
(T
y
) nite and
G a standard normal variable.
Hint: See part (a) of Exercise 3.2.9.
(b) Show that maxn
1/2
Z
|f|
k
: k n
p
0 and deduce that n
1/2
A
f
n
D
uG.
Building upon their strong law of large number, you are next to show that irre-
ducible, positive recurrent chains have P-trivial tail -algebra and the laws of any
two such chains are mutually singular (for the analogous results for i.i.d. variables,
see Corollary 1.4.10 and Remark 5.5.14, respectively).
n
is an irreducible, positive recurrent chain of law
P
x
on (S
, o
c
) (as in Denition 6.1.7).
(a) Show that P
x
(A) is independent of x S whenever A is in the tail -
algebra T
X
(of Denition 1.4.9).
(b) Deduce that T
X
is P-trivial.
n
is an irreducible, positive recurrent chain of tran-
sition probability p(x, y), initial and invariant probability measures () and (),
respectively.
(a) Show that X
n
, X
n+1
is an irreducible, positive recurrent chain on S
2
+
=
(x, y) : x, y S, p(x, y) > 0, of initial and invariant measures (x)p(x, y)
and (x)p(x, y), respectively.
(b) Let P
and P
denote the laws of two irreducible, positive recurrent

chains on the same countable state space S, whose transition probabil-
ities p(x, y) and p
(x, y) are not identical. Show that P
and P
are
mutually singular measures (per Denition 4.1.9).
Hint: Consider the conclusion of Exercise 6.2.62 (for f() = 1
x
(), or, if
the invariant measures and
are identical, then for f() = 1

(x,y)
()
and the induced pair-chains of part (a)).
Exercise 6.2.66. Fixing 1 > > > 0 let P
,
n
denote the law of (X
0
, . . . , X
n
)
for the Markov chain X
k
of state space S = 1, 1 starting from X
0
= 1
and evolving according to transition probability p(1, 1) = = 1 p(1, 1) and
p(1, 1) = = 1 p(1, 1). Fixing an integer b > 0 consider the stopping time
b
= infn 0 : A
n
= b where A
n
=
n
k=1
X
k
.
(a) Setting
= log(/), h(1) = 1 and h(1) = (1 )/((1 )), show

that the Radon-Nikodym derivative M
n
= dP
,
n
/dP
,
n
is of the form
M
n
= exp(
A
n
)h(X
n
).
(b) Deduce that P
,
(
b
< ) = exp(
b)/h(1).
n
is a Markov chain of transition probability p(x, y)
and g() = (ph)() h() for some bounded function h() on S. Show that h(X
n
)
n1
=0
g(X
) is then a martingale.
6.3. General state space: Doeblin and Harris chains
The rened analysis of homogeneous Markov chains with countable state space
is possible because such chains hit states with positive probability. This does not
happen in many important applications where the state space is uncountable. How-
ever, most proofs require only having one point of the state space that the chain
hits with probability one. As we shall see, subject to the rather mild irreducibil-
ity and recurrence properties of Section 6.3.1, it is possible to create such a point
(called a recurrent atom), even in an uncountable state space, by splitting the chain
transitions. Guided by successive visits of the recurrent atom for the split chain, we
establish in Section 6.3.2 the existence and attractiveness of invariant (probability)
measures for the split chain (which then yield such results about the original chain).
6.3.1. Minorization, splitting, irreducibility and recurrence. Consid-
ering hereafter homogeneous Markov chains, we start by imposing a minorization
property of the transition probability p(, ) which yields the splitting of these tran-
sitions.
Denition 6.3.1. Consider a B-isomorphic state space (S, o). Suppose there exists
a non-zero measurable function v : S [0, 1] and a probability measure q() on (S, o)
such that the transition probability of the chain X
n
is of the form
(6.3.1) p(x, ) = (1 v(x)) p(x, ) +v(x)q() ,
for some transition probability p(x, ) and v(x)q() p(x, ). Amending the state
space to S = S with the corresponding -algebra o = A, A : A o,
we then consider the split chain X
n
on (S, o) with transition probability
p(x, A) = (1 v(x)) p(x, A) x S, A o
p(x, ) = v(x) x S
p(, B) =
_
q(dy)p(y, B) B o .
The transitions of X
n
on S have been split by moving to the pseudo-atom
with probability v(x). The random times in which the split chain is at state are
regeneration times for X
n
. That is, stopping times where future transitions are
decoupled from the past. Indeed, the event X
n
= corresponds to X
n
moving to
a second copy of S where it is distributed according to the so called regeneration
measure q(), independently of X
n1
.
As the transitions of the split chain outside occur according to the excess proba-
bility (1v(x)) p(x, ), we can further merge the split chain to get back the original.
That is,
Denition 6.3.2. The merge transition probability m(, ) on (S, o) is such that
m(x, x) = 1 for all x S and m(, ) = q(). Associated with it is the split
mapping f f : bo bo such that f() = (mf)() =
_
m(, dy)f(y).
We note in passing that f(x) = f(x) for all x S and f() = q(f), and fur-
ther use in the sequel the following elementary fact about the closure of transition
probabilities under composition.
Corollary 6.3.3. Given any transition probabilities
i
: X A [0, 1], i = 1, 2,
the set function
1
2
: XA [0, 1] such that
1
2
(x, A) =
_

1
(x, dy)
2
(y, A) for
all x X and A A is a transition probability.
Proof. From Proposition 6.1.4 we see that
2
(x, A) = (
1
(x, )
2
)(X A) = (
1
2
(, A))(x) .
Now, by the rst equality, A
1
2
(x, A) is a probability measure on (S, o) for
each x S, and by the second equality, x
1
2
(x, A) is a measurable function on
(S, o) for each A o, as required in Denition 6.1.2.
Equipped with these notations we have the following coupling of X
n
and X
n
.
Proposition 6.3.4. Consider the setup of Denitions 6.3.1 and 6.3.2.
(a). mp = p and the restriction of pm to (S, o) equals to p.
(b). Suppose Z
n
is an inhomogeneous Markov chain on (S, o) with transition
probability p
2k
= m and p
2k+1
= p. Then, X
n
= Z
2n
is a Markov chain of transi-
tion probability p and X
n
= Z
2n+1
S is a Markov chain of transition probability p.
Setting an initial measure for Z
0
= X
0
corresponds to having the initial measure
(A) = (A) +()q(A) for X
0
S.
(c). E
[f(X
n
)] = E
[f(X
n
)] for any f bo, any initial distribution on (S, o)
and all n 0.
Proof. (a). Since m(x, x) = 1 it follows that mp(x, B) = p(x, B) for all
x S and B o. Further, m(, ) = q() so mp(, B) =
_
q(dy)p(y, B) which by
denition of p equals p(, B) (see Denition 6.3.1). Similarly, if either B = A o
or B = A, then by denition of the merge m and split p transition probabilities
we have as claimed that for any x S,
pm(x, B) = p(x, A) +p(x, )q(A) = p(x, A) .
(b). As m(x, ) = 0 for all x S, this follows directly from part (a). Indeed,
Z
0
= X
0
of measure is mapped by transition m to X
0
= Z
1
S of measure
= m, then by transition p to X
1
= Z
2
, followed by transition m to X
1
= Z
3
S
and so on. Therefore, the transition probability between X
n1
and X
n
is mp = p
and the one between X
n1
and X
n
is pm restricted to (S, o), namely p.
(c). Constructing X
n
and X
n
as in part (b), if the initial distribution of X
0
assigns zero mass to then = with X
0
= X
0
. Further, by construction
E
[f(X
n
)] = E
[(mf)(X
n
)] which by denition of the split mapping is precisely
E
[f(X
n
)], as claimed.
We plan to study existence and attractiveness of invariant (probability) measures
for the split chain X
n
, then apply Proposition 6.3.4 to transfer such results to
the original chain X
n
. This however requires the recurrence of the atom . To
this end, we must restrict the so called small function v(x) of (6.3.1), motivating
the next denition.
Denition 6.3.5. A homogeneous Markov chain X
n
on (S, o) is called a strong
Doeblin chain if the minorization condition (6.3.1) holds with a constant small
function. That is, when inf
x
p(x, A) q(A) for some probability measure q on
(S, o), a positive constant > 0 and all A o. We call X
n
a Doeblin chain in
case Y
n
= X
rn
is a strong Doeblin chain for some nite r, namely when P
x
(X
r

A) q(A) for all x S and A o.
The Doeblin condition allows us to construct a split chain Y
n
that visits its
atom at each time step with probability (0, ). Considering part (c) of
Exercise 6.1.18 (with A = S), it follows that P
(Y
n
= i.o.) = 1 for any initial
distribution . So, in any Doeblin chain the atom is a recurrent state of the split
chain. Further, since T
= infn 1 : Y
n
= is such that P
x
(T
= 1) =
for all x S, by the Markov property of Y
n
(and Exercise 5.1.15), we deduce that
E
[T
] 1/ is nite and uniformly bounded (in terms of the initial distribution

). Consequently, the atom is a positive recurrent, aperiodic state of the split
chain, which is accessible with probability one from each of its states.
As we see in Section 6.3.2, this is more than enough to assure that starting at
any initial state, T
Yn
converges in total variation norm to the unique invariant
probability measure for Y
n
.
You are next going to examine which Markov chains of countable state space are
Doeblin chains.
Exercise 6.3.6. Suppose o = 2
S
with S a countable set.
(a) Show that a Markov chain of state space (S, o) is a Doeblin chain if and
only if there exists a S and r nite such that inf
x
P
x
(X
r
= a) > 0.
(b) Deduce that for any Doeblin chain S = T R, where R = y S :
ay
>
0 is a non-empty irreducible, closed set of positive recurrent, aperiodic
states and T = y S :
ay
= 0 consists of transient states, all of which
lead to R.
(c) Verify that a Markov chain on a nite state space is a Doeblin chain if
and only if it has an aperiodic state a S that is accessible from any
other state.
(d) Check that branching processes with 0 < P(N = 0) < 1, renewal Markov
chains and birth and death chains are never Doeblin chains.
The preceding exercise shows that the Doeblin (recurrence) condition is too strong
for many chains of interest. We thus replace it by the weaker H-irreducibility
condition whereby the small function v(x) is only assumed bounded below on a
small, accessible set C. To this end, we start with the denitions of an accessible
set and weakly irreducible Markov chain.
Denition 6.3.7. We say that A o is accessible by the Markov chain X
n
if
P
x
(T
A
< ) > 0 for all x S.
Given a non-zero -nite measure on (S, o), the chain is -irreducible if any set
A o with (A) > 0 is accessible by it. Finally, a homogeneous Markov chain on
(S, o) is called weakly irreducible if it is -irreducible for some non-zero -nite
measure (in particular, any Doeblin chain is weakly irreducible).
Remark. Modern texts on Markov chains typically refer to the preceding as the
standard denition of irreducibility but we use here the term weak irreducibility to
clearly distinguish it from the elementary denition for a countable S. Indeed, in
case S is a countable set, let

denote the corresponding counting measure of S.
A Markov chain of state space S is then

-irreducible if and only if
xy
> 0 for
all x, y S, matching our Denition 6.2.14 of irreducibility, whereas a chain on S
countable is weakly irreducible if and only if
xa
> 0 for some a S and all x S.
In particular, a weakly irreducible chain of a countable state space S has exactly one
non-empty equivalence class of intercommunicating states (i.e. y S :
ay
> 0),
which is further accessible by the chain.
As we show next, a weakly irreducible chain has a maximal irreducibility measure
such that (A) > 0 if and only if A o is accessible by the chain.
n
is a weakly irreducible Markov chain on (S, o).
Then, there exists a probability measure on (S, o) such that for any A o,
(6.3.2) (A) > 0 P
x
(T
A
< ) > 0 x S .
We call such a maximal irreducibility measure for the chain.
Remark. Clearly, if a chain is -irreducible, then any non-zero -nite measure
absolutely continuous with respect to (per Denition 4.1.4), is also an irreducibil-
ity measure for this chain. The converse holds in case of a maximal irreducibility
measure. That is, unraveling Denition 6.3.7 it follows from (6.3.2) that X
n
is
-irreducible if and only if the non-zero -nite measure is absolutely continuous
with respect to .
Proof. Let be a non-zero -nite irreducibility measure of the given weakly
irreducible chain X
n
. Taking D o such that (D) (0, ) we see that X
n
is
also q-irreducible for the probability measure q() = ( D)/(D). We claim that
(6.3.2) holds for the probability measure (A) =
_
S
q(dx)k(x, A) on (S, o), where
(6.3.3) k(x, A) =
n=1
2
n
P
x
(X
n
A) .
Indeed, with T
A
< =
n1
X
n
A, clearly P
x
(T
A
< ) > 0 if and only
if k(x, A) > 0. Consequently, if P
x
(T
A
< ) is positive for all x S then so is
k(x, A) and hence (A) > 0. Conversely, if (A) > 0 then necessarily q(C) > 0
for C = x S : k(x, A) and some > 0 small enough. In particular,
xing x S, as X
n
is q-irreducible, also P
x
(T
C
< ) > 0. That is, there exists
positive integer m = m(x) such that P
x
(X
m
C) > 0. It now follows by the
Markov property at m (for h() =
1
2
A
), that
k(x, A) 2
m
=1
2
P
x
(X
m+
A)
= 2
m
E
x
[k(X
m
, A)] 2
m
P
x
(X
m
C) > 0 .
Since this is equivalent to P
x
(T
A
< ) > 0 and applies for all x S, we have
established the identity (6.3.2).
We next dene the notions of a small set and an H-irreducible chain.
Denition 6.3.9. An accessible set C o of a Markov chain X
n
on (S, o) is
called r-small set if the transition probability (x, A) P
x
(X
r
A) satises the
minorization condition (6.3.1) with a small function that is constant and positive
on C. That is, when P
x
(X
r
) I
C
(x)q() for some positive constant > 0 and
probability measure q on (S, o).
We further use small set for 1-small set, and call the chain H-irreducible if it has
an r-small set for some nite r 1 and strong H-irreducible in case r = 1.
Clearly, a chain is Doeblin if and only if S is an r-small set for some r 1, and is
further strong Doeblin in case r = 1. In particular, a Doeblin chain is H-irreducible
and a strong Doeblin chain is also strong H-irreducible.
Exercise 6.3.10. Prove the following properties of H-irreducible chains.
(a) Show that an H-irreducible chain is q-irreducible, hence weakly irreducible.
(b) Show that if X
n
is strong H-irreducible then the atom of the split
chain X
n
is accessible by X
n
from all states in S.
(c) Show that in a countable state space every weakly irreducible chain is
strong H-irreducible.
Hint: Try C = a and q() = p(a, ) for some a S.
Actually, the converse to part (a) of Exercise 6.3.10 holds as well. That is, weak
irreducibility is equivalent to H-irreducibility (for the proof, see [Num84, Propo-
sition 2.6]), and weakly irreducible chains can be analyzed via the study of an
appropriate split chain. For simplicity we focus hereafter on the somewhat more
restricted setting of strong H-irreducible chains. The following example shows that
it still applies for many Markov chains of interest.
Example 6.3.11 (Continuous transition densities). Let S = R
d
with o = B
S
.
Suppose that for each x R
d
the transition probability has a density p(x, y) with
respect to Lebesgue measure
d
() on R
d
such that (x, y) p(x, y) is continuous
jointly in x and y. Picking u and v such that p(u, v) > 0, there exists a neighborhood
C of u and a bounded neighborhood K of v, such that infp(x, y) : x C, y
K > 0. Hence, setting q() to be the uniform measure on K (i.e. q(A) =
d
(A
K)/
d
(K) for any A o), such a chain is strong H-irreducible provided C is an
accessible set. For example, this occurs whenever p(x, u) > 0 for all x R
d
.
Remark 6.3.12. Though our study of Markov chains has been mostly concerned
with measure theoretic properties of (S, o) (e.g. being B-isomorphic), quite often
S is actually a topological state space with o its Borel -algebra. As seen in the
preceding example, continuity properties of the transition probability are then of
much relevance in the study of Markov chains on S. In this context, we say that
p : SB
S
[0, 1] is a strong Feller transition probability, when the linear operator
(ph)() =
_
p(, dy)h(y) of Lemma 6.1.3 maps every bounded B
S
-measurable func-
tion h to ph C
b
(S), a continuous bounded function on S. In case of continuous
transition densities, as in Example 6.3.11, the transition probability is strong Feller
whenever the collection of probability measures p(x, ), x S is uniformly tight
(per Denition 3.2.31).
In case o = B
S
we further have the following topological notions of reachability
and irreducibility.
Denition 6.3.13. Suppose X
n
is a Markov chain on a topological space S
equipped with its Borel -algebra o = B
S
. We call x S a reachable state of X
n
if any neighborhood of x is accessible by this chain and call the chain O-irreducible
(or open set irreducible), if every x S is reachable, that is, every open set is
accessible by X
n
.
Remark. Equipping a countable state space S with its discrete topology yields the
Borel -algebra o = 2
S
, in which case O-irreducibility is equivalent to our earlier
Denition 6.2.14 of irreducibility.
For more general topological state spaces (such as S = R
d
), by their denitions,
a weakly irreducible chain is O-irreducible if and only if its maximal irreducibility
measure is such that (O) > 0 for any open subset O of S. Conversely,
Exercise 6.3.14. Show that if a strong Feller transition probability p(, ) has a
reachable state x S, then it is weakly irreducible.
Hint: Try the irreducibility measure () = p(x, ).
Remark. The minorization (6.3.1) may cause the maximal irreducibility measure
for the split chain to be supported on a smaller subset of the state space than the
one for the original chain. For example, consider the trivial Doeblin chain of i.i.d.
X
n
, that is, p(x, ) = q(). In this case, taking v(x) = 1 results with the split
chain X
n
= for all n 1, so the maximal irreducibility measures =
and
= q of X
n
and X
n
are then mutually singular.
This is of course precluded by our additional requirement that v(x)q() p(x, ).
For a strong H-irreducible chain X
n
it is easily accommodated by, for example,
setting v(x) = I
C
(x) with = /2 > 0, and then the restriction of to o is a
maximal irreducibility measure for X
n
.
Strong H-irreducible chains with a recurrent atom are called H-recurrent chains.
That is,
Denition 6.3.15. A strong H-irreducible chain X
n
is called H-recurrent if
P
(T
< ) = 1. By the strong Markov property of X

n
at the consecutive visit
times T
k
of , H-recurrence further implies that P
(T
k
nite for all k) = 1, or

equivalently P
(X
n
= i.o.) = 1.
Here are a few examples and exercises to clarify the concept of H-recurrence.
Example 6.3.16. Many strong H-irreducible chains are not H-recurrent. For ex-
ample, combining part (c) of Exercise 6.3.10 with the remark following Denition
6.3.7 we see that such are all irreducible transient chains on a countable state space.
By the same reasoning, a Markov chain of countable state space S is H-recurrent if
and only if S = TR with R a non-empty irreducible, closed set of recurrent states
and T a collection of transient states that lead to R (c.f. part (b) of Exercise 6.3.6
for such a decomposition in case of Doeblin chains). In particular, such chains are
not necessarily recurrent in the sense of Denition 6.2.14. For example, the chain
on S = 1, 2, . . . with transitions p(k, 1) = 1 p(k, k +1) = k
s
for some constant
s > 0, is H-recurrent but has only one recurrent state, i.e. R = 1. Further,
k1
< 1 for all k ,= 1 when s > 1, while
k1
= 1 for all k when s 1.
Remark. Advanced texts on Markov chains refer to what we call H-recurrence as
the standard denition of recurrence and call such chains Harris recurrent when
in addition P
x
(T
< ) = 1 for all x S. As seen in the preceding example,

both notions are weaker than the elementary notion of recurrence for countable
S, per Denition 6.2.14. For this reason, we adopt here the convention of call-
ing H-recurrence (with H after Harris), what is not the usual denition of Harris
recurrence.
Exercise 6.3.17. Verify that any strong Doeblin chain is also H-recurrent. Con-
versely show that for any H-recurrent chain X
n
there exists C o and a proba-
bility distribution q on (S, o) such that P
q
(T
k
C
nite for all k) = 1 and the Markov
chain Z
k
= X
T
k+1
C
for k 0 is then a strong Doeblin chain.
The next proposition shows that similarly to the elementary notion of recurrence,
H-recurrence is transferred from the atom to all sets that are accessible from
it. Building on this proposition, you show in Exercise 6.3.19 that the same applies
when starting at any irreducibility probability measure of the split chain and that
every set in o is either almost surely visited or almost surely never reached from
by the split chain.
Proposition 6.3.18. For an H-recurrent chain X
n
consider the probability mea-
sure
(6.3.4) (B) =
n=1
2
n
P
(X
n
B) .
Then, P
(X
n
B i.o.) = 1 whenever (B) > 0.
Proof. Clearly, (B) > 0 if and only if P
(T
B
< ) > 0. Further, if
= P
(T
B
< ) > 0, then considering the split chain starting at X
0
= , we
have from part (c) of Exercise 6.1.18 that
P
(X
n
= nitely often X
n
B i.o.) = 1 .
As P
(X
n
= i.o.) = 1 by the assumed H-recurrence, our thesis that P
(X
n
B
i.o.) = 1 follows.
Exercise 6.3.19. Suppose is the probability measure of (6.3.4) for an H-recurrent
chain.
(a) Argue that is accessible by the split chain X
n
and show that is
a maximal irreducibility measure for it.
(b) Show that P
(D) = P
(D) for any shift invariant D o

c
(i.e. where
D =
1
D).
(c) In case B o is such that (B) > 0 explain why P
x
(X
n
B i.o.) = 1
for -a.e. x S and P
(X
n
B i.o.) = 1 for any probability measure
.
(d) Show that P
(T
B
< ) 0, 1 for all B o.
Given a strong H-irreducible chain X
n
there is no unique way to select the small
set C, regeneration measure q() and > 0 such that p(x, ) I
C
(x)q(). Conse-
quently, there are many dierent split chains for each chain X
n
. Nevertheless, as
you show next, H-recurrence is determined by the original chain X
n
.
n
and X
n
are two dierent split chains for the
same strong H-irreducible chain X
n
with the corresponding atoms and
.
Relying on Proposition 6.3.18 prove that P
(T
< ) = 1 if and only if P
(T
<
) = 1.
The concept of H-recurrence builds on measure theoretic properties of the chain,
namely the minorization associated with strong H-irreducibility. In contrast, for
topological state space we have the following topological concept of O-recurrence,
built on reachability of states.
Denition 6.3.21. A state x of a Markov chain X
n
on (topological) state space
(S, B
S
) is called O-recurrent (or open set recurrent), if P
x
(X
n
O i.o.) = 1 for
any neighborhood O of x in S. All states x S which are not O-recurrent are called
O-transient. Such a chain is then called O-recurrent if every x S is O-recurrent
and O-transient if every x S is O-transient.
Remark. As was the case with O-irreducibility versus irreducibility, for a count-
able state space S equipped with its discrete topology, being O-recurrent (or O-
transient), is equivalent to being recurrent (or transient, respectively), per Deni-
tions 6.2.9 and 6.2.14.
The concept of O-recurrence is in particular suitable for the study of random
walks. Indeed,
n
= S
0
+
n
k=1
k
is a random walk on R
d
.
(a) Show that if S
n
has one reachable state, then it is O-irreducible.
(b) Show that either S
n
is an O-recurrent chain or it is an O-transient
chain.
(c) Show that if S
n
is O-recurrent, then
S = x R
d
: P
x
(|X
n
| < r i.o. ) > 0, for all r > 0 ,
is a closed sub-group of R
d
(i.e. 0 S and if x, y S then also xy S),
with respect to which S
n
is O-irreducible (i.e. P
y
(T
B(x,r)
< ) > 0
for all r > 0 and x, y S).
In case of one-dimensional random walks, you are to recover next the Chung-Fuchs
theorem, stating that if n
1
S
n
converges to zero in probability, then this Markov
chain is O-recurrent.
Exercise 6.3.23 (Chung-Fuchs theorem). Suppose S
n
is a random walk on
S R.
(a) Show that such random walk is O-recurrent if and only if for each r > 0,
n=0
P
0
([S
n
[ < r) = .
(b) Show that for any r > 0 and k Z,
n=0
P
0
(S
n
[kr, (k + 1)r))
m=0
P
0
([S
m
[ < r) ,
and deduce that suces to check divergence of the series in part (a) for
large r.
(c) Conclude that if n
1
S
n
p
0 as n , then S
n
is O-recurrent.
6.3.2. Invariant measures, aperiodicity and asymptotic behavior. We
consider hereafter an H-recurrent Markov chain X
n
of transition probability p(, )
on the B-isomorphic state space (S, o) with its recurrent pseudo-atom and the
corresponding split and merge chains p(, ), m(, ) on (S, o) per Denitions 6.3.1
and 6.3.2.
The following lemma characterizes the invariant measures of the split chain p(, )
and their relation to the invariant measures for p(, ). To this end, we use hereafter
2
also for the measure
1
2
(A) =
1
(
2
(, A)) on (X, A) in case
1
is a measure
on (X, A) and
2
is a transition probability on this space and let p
n
(x, B) denote
the transition probability P
x
(X
n
B) on (S, S).
Lemma 6.3.24. A measure on (S, o) is invariant for the split chain p(, ) of a
strong H-irreducible chain if and only if = p and 0 < () < . Further,
m is then an invariant measure for the original chain p(, ). Conversely, if is
an invariant measure for p(, ) then the measure p is invariant for the split chain.
Proof. Recall Proposition 6.1.23 that a measure is invariant for the split
chain if and only if is positive, -nite and
(B) = p(S B) = p(B) B o .
Likewise, a measure is invariant for p if and only if is positive, -nite and
(A) = p(A) for all A o.
We rst show that if is invariant for p then = m is invariant for p. Indeed,
note that from Denition 6.3.2 it follows that
(6.3.5) (A) = (A) + ()q(A) A o
and in particular, such is a positive, -nite measure on (S, o) for any -nite
on (S, o), and any probability measure q() on (S, o). Further, starting the
inhomogeneous Markov chain Z
n
of Proposition 6.3.4 with initial measure for
Z
0
= X
0
yields the measure for Z
1
= X
0
. By construction, the measure of
Z
2
= X
1
is then p and that of Z
3
= X
1
is (p)m = (pm). Next, the invariance
of for p implies that the measure of X
1
equals that of X
0
. Consequently, the
measure of X
1
must equal that of X
0
, namely = (pm). With m(, ) 0
necessarily () = 0 and the identity = (pm) holds also for the restrictions
to (S, o) of both and pm. Since the latter equals to p (see part (a) of Proposition
6.3.4), we conclude that = p, as claimed.
Conversely, let = p where is an invariant measure for p (and we set () =
0). Since is -nite, there exist A
n
S such that (A
n
) < for all n and
necessarily also q(A
n
) > 0 for all n large enough (by monotonicity from below
of the probability measure q()). Further, the invariance of implies that m =
(p)m = , i.e. the relation (6.3.5) holds. In particular, (S) = (S) so inherits
the positivity of . Moreover, both () = and (A
n
) = contradict the
niteness of (A
n
) for all n, so the measure is -nite on (S, o). Next, start
the chain Z
n
at Z
0
= X
0
S of initial measure . It yields the same measure
= m for Z
1
= X
0
, with measure = p for Z
2
= X
1
followed by m = for
Z
3
= X
1
and p for Z
4
= X
2
. As the measure of X
1
equals that of X
0
, it follows
that the measure p of X
2
equals the measure of X
1
, i.e. is invariant for p.
Finally, suppose the measure satises = p. Iterating this identity we deduce
that = p
n
for all n 1, hence also = k for the transition probability
(6.3.6) k(x, B) =
n=1
2
n
p
n
(x, B) .
Due to its strong H-irreducibility, the atom of the split chain is an accessible
set for the transition probability p (see part (b) of Exercise 6.3.10). So, from (6.3.6)
we deduce that k(x, ) > 0 for all x S. Consequently, as n ,
B
n
= x S : k(x, ) n
1
S ,
whereas by the identity () = ( k)() also () n
1
(B
n
). This proves
the rst claim of the lemma. Indeed, we have just shown that when = p it
follows that is positive if and only if () > 0 and -nite if and only if
() < .
Our next result shows that, similarly to Proposition 6.2.27, the recurrent atom
induces an invariant measure for the split chain (and hence also one for the original
chain).
n
is H-recurrent of transition probability p(, ) then
(6.3.7)
(B) = E
_
T1
n=0
I
{XnB}
_
is an invariant measure for p(, ).
Proof. Let
,n
(B) = P
(X
n
B, T
> n), noting that

(6.3.8)
(B) =
n=0

,n
(B) B o
and
,n
(g) = E
[I
{T>n}
g(X
n
)] for all g bo. Since T
> n T
X
n
=
(X
k
, k n), we have by the tower and Markov properties that, for each n 0,
P
(X
n+1
B, T
> n) = E
[I
{T>n}
P
(X
n+1
B[T
X
n
) ]
= E
[I
{T>n}
p(X
n
, B) ] =
,n
(p(, B)) = (
,n
p)(B) .
Hence,
(
p)(B) =
n=0
(
,n
p)(B) =
n=0
P
(X
n+1
B, T
> n)
= E
_
T
n=1
I
{XnB}
_
=
(B)
since P
(T
< , X
0
= X
T
) = 1. We thus established that
p =
and as

() = 1 we conclude from Lemma 6.3.24 that it is an invariant measure for

the split chain.
Building on Lemma 6.3.24 and Proposition 6.3.25 we proceed to the uniqueness of
the invariant measure for an H-recurrent chain, namely the extension of Proposition
6.2.30 to a typically uncountable state space.
Theorem 6.3.26. Up to a constant multiple, the unique invariant measure for
H-recurrent transition probability p(, ) is the restriction to (S, o) of
m, where

is per (6.3.7).
Remark. As
(S) = E
, it follows from the theorem that an H-recurrent

chain has an invariant probability measure if and only if E
(T
) = E
q
(T
) < .
In accordance with Denition 6.2.40 we call such chains positive H-recurrent. While
the value of E
(T
) depends on the specic split chain one associates with X

n
, it
follows from the preceding that positive H-recurrence, i.e. the niteness of E
(T
),
is determined by the original chain. Further, in view of the relation (6.3.5) between

m and
and the decomposition (6.3.8) of
, the unique invariant probability

measure for X
n
is then
(6.3.9) (A) =
1
E
q
(T
n=0
P
q
(X
n
A, T
> n) A o .
Proof. By Lemma 6.3.24, to any invariant measure for p (with () = 0),
corresponds the invariant measure = p for the split chain p. It is also shown
there that 0 < () < . Hence, with no loss of generality we assume hereafter
that the given invariant measure for p has already been divided by this positive,
nite constant, and so () = 1. Recall that while proving Lemma 6.3.24 we
further noted that = m, due to the invariance of for p. Consequently, to prove
the theorem it suces to show that =
(for then = m =
m).
To this end, x B o and recall from the proof of Lemma 6.3.24 that is also
invariant for p
n
and any n 1. Using the latter invariance property and applying
Exercise 6.2.3 for y = and the split chain X
n
, we nd that
(B) = ( p
n
)(B) =
_
S
(dx)P
x
(X
n
B)
_
S
(dx)P
x
(X
n
B, T
n)
=
n1
k=0
( p
nk
)()P
(X
k
B, T
> k) =
n1
k=0

,k
(B) ,
with
,k
() per the decomposition (6.3.8) of
(). Taking n , we thus deduce

that
(6.3.10) (B)
k=0

,k
(B) =
(B) B o .
We proceed to show that this inequality actually holds with equality, namely, that
=
. To this end, recall that while proving Lemma 6.3.24 we showed that
invariant measures for p, such as and
are also invariant for the transition

probability k(, ) of (6.3.6), and by strong H-irreducibility the measurable function
g() = k(, ) is strictly positive on S. Therefore,
(g) = ( k)() = () = 1 =
() = (
k)() =
(g) .
Recall Exercise 4.1.13 that identity such as (g) =
(g) = 1 for a strictly positive

g mo, strengthens the inequality (6.3.10) between two -nite measures and

on (S, o) into the claimed equality =
.
The next result is a natural extension of Theorem 6.2.57.
n
and Y
n
are independent copies of a strong
H-irreducible chain. Then, for any initial distribution of (X
0
, Y
0
) and all n,
(6.3.11) |T
Xn
T
Yn
|
tv
2P( > n) ,
where | |
tv
denotes the total variation norm of Denition 3.2.22 and = min
0 : X
= Y
= is the time of the rst joint visit of the atom by the corresponding
copies of the split chain under the coupling of Proposition 6.3.4.
Proof. Fixing g bo bounded by one, recall that the split mapping yields
g bo of the same bound, and by part (c) of Proposition 6.3.4
Eg(X
n
) Eg(Y
n
) = Eg(X
n
) Eg(Y
n
)
for any joint initial distribution of (X
0
, Y
0
) on (S
2
, o o) and all n 0. Further,
since X
= Y
in case n, following the proof of Theorem 6.2.57 one nds that

[Eg(X
n
) Eg(Y
n
)[ 2P( > n). Since this applies for all g bo bounded by
one, we are done.
Our goal is to extend the scope of the convergence result of Theorem 6.2.59 to the
setting of positive H-recurrent chains. To this end, we rst adapt Denition 6.2.54
of an aperiodic chain.
Denition 6.3.28. The period of a strongly H-irreducible chain is the g.c.d. d
of
the set J
= n 1 : P
(X
n
= ) > 0, of return times to its pseudo-atom and
such chain is called aperiodic if it has period one. For example, q(C) > 0 implies
aperiodicity of the chain.
Remark. Recall that being (strongly) H-irreducible amounts for a countable state
space to having exactly one non-empty equivalence class of intercommunicating
states (which is accessible from any other state). The preceding denition then
coincides with the common period of these intercommunicating states per Denition
6.2.54.
More generally, our denition of the period of the chain seems to depend on which
small set and regeneration measure one chooses. However, in analogy with Exercise
6.3.20, after some work it can be shown that any two split chains for the same strong
H-irreducible chain induce the same period.
Theorem 6.3.29. Let () denote the unique invariant probability measure of an
aperiodic positive H-recurrent Markov chain X
n
. If x S is such that P
x
(T
<
) = 1, then
(6.3.12) lim
n
|P
x
(X
n
) ()|
tv
= 0 .
Remark. It follows from (6.3.9) that () is absolutely continuous with respect to
() of Proposition 6.3.18. Hence, by parts (a) and (c) of Exercise 6.3.19, both
(6.3.13) P
(T
< ) = 1 ,
and P
x
(T
< ) = 1 for -a.e. x S. Consequently, the convergence result

(6.3.12) holds for -a.e. x S.
Proof. Consider independent copies X
n
and Y
n
of the split chain starting
at X
0
= x and at Y
0
whose law is the invariant probability measure = p of
the split chain. The corresponding X
n
and Y
n
per Proposition 6.3.4 have the laws
P
x
(X
n
) and (), respectively. Hence, in view of Theorem 6.3.27, to establish
(6.3.12) it suces to show that with probability one X
n
= Y
n
= for some nite,
possibly random value of n. Proceeding to prove the latter fact, recall (6.3.13)
and the H-recurrence of the chain, in view of which we have with probability one
that Y
n
= for innitely many values of n, say at random times R
k
. Similarly,
our assumption that P
x
(T
< ) = 1 implies that with probability one X

n
=
for innitely many values of n, say at another sequence of random times
R
k
and it remains to show that these two random subsets of 1, 2, . . . intersect with
probability one. To this end, note that upon adapting the argument used in solving
Exercise 6.2.11 you nd that R
1
,

R
1
, r
k
= R
k+1
R
k
and r
k
=

R
k+1

R
k
for
k 1 are mutually independent, with r
k
, r
k
, k 1 identically distributed, each
following the law of T
under P
. Let W
n+1
= W
n
+ Z
n
and

W
n+1
=

W
n
+

Z
n
,
starting at W
0
=

W
0
= 1, where the i.i.d. Z, Z
,

Z
are independent of X
n
and Y
n
and such that P(Z = k) = 2
k
for k 1. It then follows by the strong
Markov property of the split chains that S
n
= R
Wn

R
Wn
, n 0, is a random
walk on Z, whose i.i.d. increments
n
have each the law of the dierence between
two independent copies of T
Z
under P
. As mentioned already, our thesis follows

from P(S
n
= 0 i.o.) = 1, which in view of Corollary 6.2.12 and Theorem 6.2.13 is in
turn an immediate consequence of our claim that S
n
is an irreducible, recurrent
Markov chain.
Turning to prove that S
n
is irreducible, note that since Z is independent of
T
k
, for any n 1
P
(T
Z
= n) =
k=1
2
k
P
(T
k
= n) =
k=1
2
k
P
(N
n
() = k, X
n
= ) .
Consequently, P
(T
Z
= n) > 0 if and only if P
(X
n
= ) > 0. That is, the
support of the law of T
Z
is the set J
of possible return times to . By the

assumed aperiodicity of the chain, the g.c.d. of J
is one (see Denition 6.3.28).

Further, by denition this subset of positive integers is closed under addition, hence
as we have seen in the course of proving Lemma 6.2.55, the set J
contains all large

enough integers. As
1
is the dierence between two independent copies of T
Z
, the
law of each of which is strictly positive for all large enough positive integers, clearly
P(
1
= z) > 0 for all z Z, out of which the irreducibility of S
n
follows.
As for the recurrence of S
n
, note that by the assumed positive H-recurrence of
X
n
and the independence of Z and this chain,
E
(T
Z
) =
k=1
E
(T
k
)P(Z = k) = E
(T
k=1
kP(Z = k) = E
(T
)E(Z) < .
Hence, the increments
n
of the irreducible random walk S
n
on Z are integrable
and of zero mean. Consequently, n
1
S
n
p
0 as n which by the Chung-Fuchs
theorem implies the recurrence of S
n
(see Exercise 6.3.23).
k
is the rst order auto-regressive process X
n
=
X
n1
+
n
, n 1 with [[ < 1 and where the integrable i.i.d.
n
have a strictly
positive, continuous density f
() with respect to Lebesgue measure on R

d
.
(a) Show that X
k
is a strong H-irreducible chain.
(b) Show that V
n
=
n
k=0
k
converges a.s. to V
k0
k
whose
law () is an invariant probability measure for X
k
.
(c) Show that X
k
is positive H-recurrent.
(d) Explain why X
k
is aperiodic and deduce that starting at any xed x
R
d
the law of X
n
converges in total variation to ().
n
is an aperiodic, positive H-recurrent chain
and x, y S are such that P
x
(T
< ) = P
y
(T
< ) = 1, then for any A o,

lim
n
[P
x
(X
n
A) P
y
(X
n
A)[ = 0 .
n
are i.i.d. with P(
1
= 1) = 1 P(
1
= 1) = b
and U
n
are i.i.d. uniform on [5, 5] and independent of
n
. Consider the
Markov chain X
n
with state space S = R such that X
n
= X
n1
+
n
sign(X
n1
)
when [X
n1
[ > 5 and X
n
= X
n1
+U
n
otherwise.
(a) Show that this chain is strongly H-irreducible for any 0 b < 1.
(b) Show that it has a unique invariant measure (up to a constant multiple),
when 0 b 1/2.
(c) Show that if 0 b < 1/2 the chain has a unique invariant probability
measure () and that P
x
(X
n
B) (B) as n for any x R
and every Borel set B.
CHAPTER 7
Continuous, Gaussian and stationary processes
A discrete parameter stochastic process (S.P.) is merely a sequence of random vari-
ables. We have encountered and constructed many such processes when considering
martingales and Markov chains in Sections 5.1 and 6.1, respectively. Our focus here
is on continuous time processes, each of which consists of an uncountable collection
of random variables (dened on the same probability space).
We have successfully constructed by an ad-hoc method one such process, namely
the Poisson process of Section 3.4. In contrast, Section 7.1 provides a canonical
construction of S.P., viewed as a collection of R.V.-s X
t
(), t T. This con-
struction, based on the specication of nite dimensional distributions, applies for
any index set T and any S.P. taking values in a B-isomorphic measurable space.
However, this approach ignores the sample function t X
t
() of the process.
Consequently, the resulting law of the S.P. provides no information about proba-
bilities such as that of continuity of the sample function, or whether it is ever zero,
or the distribution of sup
tT
X
t
. We thus detail in Section 7.2 a way to circumvent
this diculty, whereby we guarantee, under suitable conditions, the continuity of
the sample function for almost all outcomes , or at the very least, its (Borel)
measurability.
We conclude this chapter by studying in Section 7.3 the concept of stationary (of
processes and their increments), and the class of Gaussian (stochastic) processes,
culminating with the denition and construction of the Brownian motion.
7.1. Denition, canonical construction and law
We start with the denition of a stochastic process.
Denition 7.1.1. Given (, T, P), a stochastic process, denoted X
t
, is a col-
lection X
t
: t T of R.V.-s. In case the index set T is an interval in R we call it
a continuous time S.P. The function t X
t
() is called the sample function (or
sample path, realization, or trajectory), of the S.P. at .
We shall follow the approach we have taken in constructing product measures (in
Section 1.4.2) and repeated for dealing with Markov chains (in Section 6.1). To
this end, we start with the nite dimensional distributions associated with the S.P.
Denition 7.1.2. By nite dimensional distributions (f.d.d.) of a S.P. X
t
, t T
we refer to the collection of probability measures
t1,t2, ,tn
() on B
n
, indexed by n
and distinct t
k
T, k = 1, . . . , n, where
t1,t2, ,tn
(B) = P((X
t1
, X
t2
, , X
tn
) B) ,
for any Borel subset B of R
n
.
269
270 7. CONTINUOUS, GAUSSIAN AND STATIONARY PROCESSES
0 0.5 1 1.5 2 2.5 3
2
1.5
1
0.5
0
0.5
1
1.5
t
X
t
(
)
t X
t
(
1
)
t X
t
(
2
)
Figure 1. Sample functions of a continuous time stochastic pro-
cess, corresponding to two outcomes
1
and
2
.
Not all f.d.d. are relevant here, for you should convince yourself that the f.d.d. of
any S.P. should be consistent, as specied next.
Denition 7.1.3. We say that a collection of nite dimensional distributions is
consistent if for any B
k
B, distinct t
k
T and nite n,
(7.1.1)
t1, ,tn
(B
1
B
n
) =
t
(1)
, ,t
(n)
(B
(1)
B
(n)
) ,
for any permutation of 1, 2, , n and
(7.1.2)
t1, ,tn1
(B
1
B
n1
) =
t1, ,tn1,tn
(B
1
B
n1
R) .
Here is a simpler, equivalent denition of consistent f.d.d. in case T is linearly (i.e.
totally) ordered.
Lemma 7.1.4. In case T is a linearly ordered set (for example, T countable, or T
R), it suces to dene as f.d.d. the collection of probability measures
s1,...,sn
()
running over s
1
< s
2
< < s
n
in T and nite n, where such collection is
consistent if and only if for any A
i
B and k = 1, . . . , n,
s1, ,sn
(A
1
A
k1
R A
k+1
A
n
)
=
s1, ,s
k1
,s
k+1
, ,sn
(A
1
A
k1
A
k+1
A
n
) (7.1.3)
Proof. Since the set T is linearly ordered, for any distinct t
i
T, i = 1, . . . , n
there exists a unique permutation on 1, . . . , n such that s
i
= t
(i)
are in in-
creasing order and taking the random vector (X
s1
, , X
sn
) of (joint) distribution
s1, ,sn
(), we set
t1, ,tn
as the distribution of the vector (X
t1
, , X
tn
) of per-
muted coordinates. This unambiguously extends the denition of the f.d.d. from
the ordered s
1
< < s
n
to all distinct t
i
T. Proceeding to verify the consis-
tency of these f.d.d. note that by our denition, the identity (7.1.1) holds whenever
t
(i)
are in increasing order. Permutations of 1, . . . , n form a group with re-
spect to composition, so (7.1.1) extends to t
(i)
of arbitrary order. Next suppose
that in the permutation of 1, . . . , n such that s
i
= t
(i)
are in increasing order
7.1. DEFINITION, CANONICAL CONSTRUCTION AND LAW 271
we have n = (k) for some 1 k n. Then, setting B
n
= R and A
i
= B
(i)
leads
to A
k
= R and from (7.1.1) and (7.1.3) it follows that
t1, ,tn
(B
1
B
n1
R) =
s1, ,s
k1
,s
k+1
, ,sn
(A
1
A
k1
A
k+1
A
n
) .
Further, (t
1
, . . . , t
n1
) is the image of (s
1
, . . . , s
k1
, s
k+1
, . . . , s
n
) under the permu-
tation
1
restricted to 1, . . . , k1, k+1, . . . , n so a second application of (7.1.1)
results with the consistency condition (7.1.2).
Our goal is to establish the existence and uniqueness (in law) of the S.P. associ-
ated with any given consistent collection of f.d.d. We shall do so via a canonical
construction, whereby we set = R
T
and T = B
T
as follows.
Denition 7.1.5. Let R
T
denote the collection of all functions x(t) : T R. A
nite dimensional measurable rectangle in R
T
is any set of the form x() : x(t
i
)
B
i
, i = 1, . . . , n for a positive integer n, B
i
B and t
i
T, i = 1, . . . , n. The
cylindrical -algebra B
T
is the -algebra generated by the collection of all nite
dimensional measurable rectangles.
Note that in case T = 1, 2, . . ., the -algebra B
T
is precisely the product -
algebra B
c
used in stating and proving Kolmogorovs extension theorem. Further,
enumerating C = t
k
, it is not hard to see that B
C
is in one to one correspondence
with B
c
for any innite, countable C T.
The next concept is handy in studying the structure of B
T
for uncountable T.
Denition 7.1.6. We say that A R
T
has a countable representation if
A = x() R
T
: (x(t
1
), x(t
2
), . . .) D ,
for some D B
c
and C = t
k
T. The set C is then called the (countable) base
of the (countable) representation (C, D) of A.
Indeed, B
T
consists of the sets in R
T
having a countable representation and is
further image of T
X
= (X
t
, t T) via the mapping X
: R
T
.
Lemma 7.1.7. The -algebra B
T
is the collection ( of all subsets of R
T
that have
a countable representation. Further, for any S.P. X
t
, t T, the -algebra T
X
is
the collection ( of sets of the form : X
() A with A B
T
.
Proof. First note that for any subsets T
1
T
2
of T, the restriction to T
1
of
functions on T
2
induces a measurable projection p : (R
T2
, B
T2
) (R
T1
, B
T1
). Fur-
ther, enumerating over a countable C maps the corresponding cylindrical -algebra
B
C
in a one to one manner into the product -algebra B
c
. Thus, if A ( has the
countable representation (C, D) then A = p
1
(D) for the measurable projection p
from R
T
to R
C
, hence A B
T
. Having just shown that ( B
T
we turn to show that
conversely B
T
(. Since each nite dimensional measurable rectangle has a count-
able representation (of a nite base), this is an immediate consequence of the fact
that ( is a -algebra. Indeed, R
T
has a countable representation (of empty base),
and if A ( has the countable representation (C, D) then A
c
has the countable
representation (C, D
c
). Finally, if A
k
( has a countable representation (C
k
, D
k
)
for k = 1, 2, . . . then the subset C =
k
C
k
of T serves as a common countable base
for these sets. That is, A
k
has the countable representation (C,

D
k
), for k = 1, 2, . . .
and

D
k
= p
1
k
(D
k
) B
c
, where p
k
denotes the measurable projection from R
C
to
R
C
k
. Consequently, as claimed
k
A
k
( for it has the countable representation
(C,
k
D
k
).
As for the second part of the lemma, temporarily imposing on the -algebra 2
makes X
: R
T
an (S, o)-valued R.V. for S = R
T
and o = B
T
. From Exercises
1.2.10 and 1.2.11 we thus deduce that ( is the -algebra generated by the sets of
the form : X
ti
() B
i
, i = 1, . . . , n for B
i
B, t
i
T and nite n, which
is precisely the -algebra T
X
.
Combining Lemma 7.1.7 and Kolmogorovs extension theorem, we proceed with
the promised canonical construction, yielding the following conclusion.
Proposition 7.1.8. For any consistent collection of f.d.d., there exists a probability
space (, T, P) and a stochastic process X
t
(), t T on it, whose f.d.d. are
in agreement with the given collection. Further, the restriction of the probability
measure P to the -algebra T
X
is uniquely determined by the specied f.d.d.
Proof. Starting with the existence of the probability space, suppose rst that
T = C is countable. In this case, enumerating over C = s
j
we further have
from the consistency condition (7.1.3) of Lemma 7.1.4 that it suces to consider
the sequence of f.d.d.
s1,...,sn
for n = 1, 2, . . . and the existence of a probability
measure P
C
on (R
C
, B
c
) that agrees with the given f.d.d. follows by Kolmogorovs
extension theorem (i.e. Theorem 1.4.22). Moving to deal with uncountable T, take
= R
T
and T = B
T
with X
t
() =
t
. Recall Lemma 7.1.7, that any A B
T
has
a countable representation (C, D) so we can assign P(A) = P
C
(D), where P
C
is
dened through Kolmogorovs extension theorem for the countable subset C of T.
We proceed to show that P() is well dened. That is, P
C1
(D
1
) = P
C2
(D
2
) for any
two countable representations (C
1
, D
1
) and (C
2
, D
2
) of the same set A B
T
. Since
C = C
1
C
2
is then also a countable base for A, we may and shall assume that
C
1
C
2
in which case necessarily D
2
= p
1
21
(D
1
) for the measurable projection
p
21
from R
C2
to R
C1
. By their construction, P
Ci
for i = 1, 2 coincide on all nite
dimensional measurable rectangles with a base in C
1
. Hence, P
C1
= P
C2
p
1
21
and
in particular P
C2
(D
2
) = P
C1
(D
1
). By construction the non-negative set function
P on (R
T
, B
T
) has the specied f.d.d. for X
t
() =
t
so we complete the proof
of existence by showing that P is countably additive. To this end, as shown in
the proof of Lemma 7.1.7, any sequence of disjoint sets A
k
B
T
admits countable
representations (C,

D
k
), k = 1, 2, . . . with a common base C and disjoint

D
k
B
c
.
Hence, by the countable additivity of P
C
,
P(
k
A
k
) = P
C
(
k
D
k
) =
k
P
C
(
D
k
) =
k
P(A
k
) .
As for uniqueness, recall Lemma 7.1.7 that every set in T
X
is of the form :
(X
t1
(), X
t2
(), . . .) D for some D B
c
and C = t
j
a countable subset of T.
Fixing such C, recall Kolmogorovs extension theorem, that the law of (X
t1
, X
t2
, . . .)
on B
c
is uniquely determined by the specied laws of (X
t1
, . . . , X
tn
) for n = 1, 2, . . ..
Since this applies for any countable C, we see that the whole restriction of P to
T
X
is uniquely determined by the given collection of f.d.d.
Remark. Recall Corollary 1.4.25 that Kolmogorovs extension theorem holds when
(R, B) is replaced by any B-isomorphic measurable space (S, o). Check that thus,
the same applies for the preceding proof, hence Proposition 7.1.8 holds for any
7.1. DEFINITION, CANONICAL CONSTRUCTION AND LAW 273
(S, o)-valued S.P. X
t
provided (S, o) is B-isomorphic (c.f. [Dud89, Theorem
12.1.2] for an even more general setting in which the same applies).
Motivated by Proposition 7.1.8 our denition of the law of the S.P. is as follows.
Denition 7.1.9. The law (or distribution) of a S.P. is the probability measure
T
X
on B
T
such that for all A B
T
,
T
X
(A) = P( : X
() A) .
Proposition 7.1.8 tells us that the f.d.d. uniquely determine the law of any S.P.
and provide the probability of any event in T
X
. However, for our construction to
be considered a success story, we want most events of interest be in T
X
. That is,
their image via the sample function should be in B
T
. Unfortunately, as we show
next, this is certainly not the case for uncountable T.
Lemma 7.1.10. Fixing R and I = [a, b) for some a < b, the following sets
A
= x R
I
: x(t) for all t I ,
C(I) = x R
I
: t x(t) is continuous on I ,
are not in B
I
.
Proof. In view of Lemma 7.1.7, if A
B
I
then A
has a countable base

C = t
k
and in particular the values of x(t
k
) determine whether x() A
or
not. But C is a strict subset of the uncountable index set I, so xing some values
x(t
k
) for all t
k
C, the function x() on I still may or may not be in A
, as
by denition the latter further requires that x(t) for all t I C. Similarly, if
C(I) B
I
then it has a countable base C = t
k
and the values of x(t
k
) determine
whether x() C(I). However, since C ,= I, xing x() continuous on I t with
t I C, the function x() may or may not be continuous on I, depending on the
value of x(t).
Remark. With A
/ B
I
, the canonical construction provides minimal information
about M
I
= sup
tI
X
t
, which typically is not even measurable with respect to
T
X
. However, note that A
T
X
in case all sample functions of X
t
are right-
continuous. That is, for such S.P. the law of M
I
is uniquely determined by the f.d.d.
We return to this point in Section 7.2 when considering separable modications.
Similarly, since C(I) / B
I
, the canonical construction does not assign a probability
for continuity of the sample function. To further demonstrate that this type of
diculty is generic, recall that by our ad-hoc construction of the Poisson process
out of its jump times, all sample functions of this process are in
Z
= x Z
I
+
: t x(t) is non-decreasing ,
where I = [0, ). However, convince yourself that Z
/ B
I
, so had we applied the
canonical construction starting from the f.d.d. of the Poisson process, we would
not have had any probability assigned to this key property of its sample functions.
You are next to extend the phenomena illustrated by Lemma 7.1.10, providing a
host of relevant subsets of R
I
which are not in B
I
.
Exercise 7.1.11. Let I R denote an interval of positive length.
(a) Show that none of the following collections of functions is in B
I
: all linear
functions, all polynomials, all constants, all non-decreasing functions, all
functions of bounded variation, all dierentiable functions, all analytic
functions, all functions continuous at a xed t I.
(b) Show that B
I
fails to contain the collection of functions that vanish some-
where in I, the collection of functions such that x(s) < x(t) for some
s < t, and the collection of functions with at least one local maximum.
(c) Show that C(I) has no non-empty subset A B
I
, but the complement of
C(I) in R
I
has a non-empty subset A B
I
.
(d) Show that the completion B
I
of B
I
with respect to any probability measure
P on B
I
fails to contain the set A = B(I) of all Borel measurable functions
x : I R.
Hint: Consider A and A
c
.
In contrast to the preceding exercise, independence of the increments of a S.P. is
determined by its f.d.d.
Exercise 7.1.12. A continuous time S.P. X
t
, t 0 has independent increments
if X
t+h
X
t
is independent of T
X
t
= (X
s
, 0 s t) for any h > 0 and all t 0.
Show that if X
t1
, X
t2
X
t1
, . . . , X
tn
X
tn1
are mutually independent, for all
n < and 0 t
1
< t
2
< < t
n
< , then X
t
has independent increments.
Hence, this property is determined by the f.d.d. of X
t
.
Here is the canonical construction for Poisson random measures, where T is not a
subset of R (for example, the Poisson point processes where T = B
R
d).
Exercise 7.1.13. Let T = A A : (A) < for a given measure space
(X, A, ). Construct a S.P. N
A
: A T such that N
A
has the Poisson((A)) law
for each A T and N
A
k
, k = 1, . . . , n are P-mutually independent whenever A
k
,
k = 1, . . . , n are disjoint sets.
Hint: Given A
j
T, j = 1, 2, let B
j1
= A
j
= B
c
j0
and N
b1,b2
, for b
1
, b
2
0, 1
such that (b
1
, b
2
) ,= (0, 0), be independent R.V. of Poisson((B
1b1
B
2b2
)) law. As
the distribution of (N
A1
, N
A2
) take the joint law of (N
1,1
+N
1,0
, N
1,1
+N
0,1
).
Remark. The Poisson process

N
t
of rate one is merely the restriction to sets A =
[0, t], t 0, of the Poisson random measure N
A
in case () is Lebesgues measure
on [0, ). More generally, in case () has density f() with respect to Lebesgues
measure on [0, ), we call such restriction X
t
= N
[0,t]
the inhomogeneous Poisson
process of rate function f(t) 0, t 0. It is a counting process of independent
increments, which is a non-random time change X
t
=

N
([0,t])
of a Poisson process
of rate one, but in general the gaps between jump times of X
t
are neither i.i.d.
nor of exponential distribution.
7.2. Continuous and separable modications
The canonical construction of Section 7.1 determines the law of a S.P. X
t
on the
image B
T
of T
X
. Recall that T
X
is inadequate as far as properties of the sample
functions t X
t
() are concerned. Nevertheless, a typical patch of this approach
is to choose among S.P. with the given f.d.d. one that has regular enough sample
functions. To illustrate this, we start with a simple explicit example in which path
properties are not entirely determined by the f.d.d.
7.2. CONTINUOUS AND SEPARABLE MODIFICATIONS 275
Example 7.2.1. consider the S.P.
Y
t
() = 0, t, X
t
() =
_
1, t =
0, otherwise
on the probability space ([0, 1], B
[0,1]
, U), with U the uniform measure on I = [0, 1].
Since A
t
= : X
t
() ,= Y
t
() = t, clearly P(X
t
= Y
t
) = 1 for each xed t I.
Moreover, P(
n
i=1
A
ti
) = 0 for any t
1
, . . . , t
n
I, hence X
t
has the same f.d.d.
as Y
t
. However, P( : sup
tI
X
t
() ,= 0) = 1, whereas P( : sup
tI
Y
t
() ,=
0) = 0. Similarly, P( : X
() C(I)) = 0, whereas P( : Y
() C(I)) = 1.
While the two S.P. of Example 7.2.1 have dierent maximal value and dier in
their sample path continuity, we would typically consider one to be merely a (small)
modication of the other, motivating our next denition.
Denition 7.2.2. Stochastic processes X
t
and Y
t
are called versions of one
another if they have the same f.d.d. A S.P. Y
t
, t T is further called a mod-
ication of X
t
, t T if P(X
t
,= Y
t
) = 0 for all t T and two such S.P. are
called indistinguishable if : X
t
() ,= Y
t
() for some t T is a P-null set
(hence, upon completing the space, P(X
t
,= Y
t
for some t T) = 0). Similarly
to Denition 1.2.8, throughout we consider two indistinguishable S.P.-s to be the
same process, hence often omit the qualier a.s. in reference to sample function
properties that apply for all t T.
For example, Y
t
is the continuous modication of X
t
in Example 7.2.1 but
these two processes are clearly distinguishable. In contrast, modications with a.s.
right-continuous sample functions are indistinguishable.
Exercise 7.2.3. Show that continuous time S.P.-s X
t
and Y
t
which are mod-
ications of each other and have w.p.1. right-continuous sample functions, must
also be indistinguishable.
You should also convince yourself at this point that as we have implied, if Y
t
is
a modication of X
t
, then Y
t
is also a version of X
t
. The converse fails, for
while a modication has to be dened on the same probability space as the original
S.P. this is not required of versions. Even on the same probability space it is easy
to nd a pair of versions which are not modications of each other.
Example 7.2.4. For the uniform probability measure on the nite set = H, T,
the constant in time S.P.-s X
t
() = I
H
() and Y
t
() = 1 X
t
() are clearly
versions of each other but not modications of each other.
We proceed to derive a relatively easy to check sucient condition for the existence
of a (continuous) modication of the S.P. which has H older continuous sample
functions, as dened next.
Denition 7.2.5. Recall that a function f(t) on a metric space (T, d(, )) is locally
-Holder continuous if
sup
{t=s,d(t,u)d(s,u)<hu}
[f(t) f(s)[
d(t, s)
c
u
,
for > 0, some c : T [0, ) and h : T (0, ], and is uniformly -Holder con-
tinuous if the same applies for constant c < h = . In case = 1 such functions
are also called locally (or uniformly) Lipschitz continuous, respectively. We say
that a S.P. Y
t
, t T is locally/uniformly -H older/Lipschitz continuous, with
respect to a metric d(, ) on T if its sample functions t Y
t
() have the corre-
sponding property (for some S.P c
u
() < and h
u
() > 0, further requiring c to
be a non-random constant for uniform continuity). Since local H older continuity
implies continuity, clearly then P( : Y
t
() C(T)) = 1. That is, such processes
have continuous sample functions. We also use the term continuous modication
to denote a modication
X
t
of a given S.P. X
t
such that
X
t
has continu-
ous sample functions (and similarly dene locally/uniformly -Holder continuous
modications).
Remark. The Euclidean norm d(t, s) = |t s| is used for sample path continuity
of a random eld, namely, where T R
r
for some nite r, taking d(t, s) = [t s[
for a continuous time S.P. Also, recall that for compact metric space (T, d) there is
no dierence between local and uniform H older continuity of f : T R, so in this
case local -Holder continuity of S.P. Y
t
, t T is equivalent to
P( : sup
s=tT
[Y
t
() Y
s
()[
d(t, s)
c()) = 1 ,
for some nite R.V. c().
Theorem 7.2.6 (Kolmogorov-Centsov continuity theorem). Suppose X
t
is a
S.P. indexed on T = I
r
, with I a compact interval. If there exist positive constants
, and nite c such that
(7.2.1) E[[X
t
X
s
[
] c|t s|
r+
, for all s, t T,
then there exists a continuous modication of X
t
, t T which is also locally
-H older continuous for any 0 < < /.
Remark. Since condition (7.2.1) involves only the joint distribution of (X
s
, X
t
),
it is determined by the f.d.d. of the process. Consequently, either all versions of
the given S.P. satisfy (7.2.1) or none of them does.
Proof. We consider hereafter the case of r = 1, assuming with no loss of
generality that T = [0, 1], and leave to the reader the adaptation of the proof to
r 2 (to this end, see [KaS97, Solution of Problem 2.2.9]).
Our starting point is the bound
(7.2.2) P([X
t
X
s
[ )
E[[X
t
X
s
[
] c
[t s[
1+
,
which holds for any > 0, t, s I, where the rst inequality follows from Markovs
inequality and the second from (7.2.1). From this bound we establish the a.s.
local H older continuity of the sample function of X
t
over the collection Q
(2)
1
=
1
Q
(2,)
1
of dyadic rationals in [0, 1], where Q
(2,)
T
= j2
T, j Z
+
. To
this end, xing < / and considering (7.2.2) for = 2
, we have by nite
sub-additivity that
P(
2
1
max
j=0
[X
(j+1)2
X
j2
[ 2
) c2
,
for = > 0. Since
is nite, it then follows by Borel-Cantelli I that

2
1
max
j=0
[X
(j+1)2
X
j2
[ < 2
, n
() ,
where n
() is nite for all / N
and N
T has zero probability.

As you show in Exercise 7.2.7 this implies the local -Holder continuity of t
X
t
() over the dyadic rationals. That is,
(7.2.3) [X
t
() X
s
()[ c()[t s[
,
for c() = 2/(12
) nite and any t, s Q

(2)
1
such that [ts[ < h
() = 2
n()
.
Turning to construct the S.P.
X
t
, t T, we x
k
/ and set N
=
k
N
k
.
Considering the R.V.

X
s
() = X
s
()I
N
c
() for s Q
(2)
1
, we further set

X
t
=
lim
n

X
sn
for some non-random s
n
Q
(2)
1
such that s
n
t [0, 1] Q
(2)
1
.
Indeed, in view of (7.2.3), by the uniform continuity of s

X
s
() over Q
(2)
1
, the
sequence n

X
sn
() is Cauchy, hence convergent, per . By construction,
the S.P.
X
t
, t [0, 1] is such that
[
X
t
()

X
s
()[ c
k
[t s[
k
,
for any k and t, s [0, 1] such that [t s[ <

h
k
(), where

h
k
= I
N
+ I
N
c
k
is
positive for all and c
k
= c(
k
) is nite. That is,
X
t
is locally -Holder
continuous on T = [0, 1] for any
k
, hence also for all < / (and in particular,
X
t
, t [0, 1] has continuous sample functions).
It thus remains only to verify that
X
t
is a modication of X
t
. To this end,
observe rst that since P(N
k
) = 0 for all k, also P(N
) = 0. Further,

X
s
() =
X
s
() for all s Q
(2)
1
and / N
. Next, from (7.2.2) we have that P([X

t
X
sn
[
) 0 for any xed > 0 and s
n
t, that is, X
sn
p
X
t
. Hence, recall Theorem
2.2.10, also X
s
n(k)
a.s.
X
t
along some subsequence k n(k). Considering an
arbitrary t [0, 1] Q
(2)
1
and the sequence s
n
Q
(2)
1
as in the construction of
X
t
,
we have in addition that

X
s
n(k)

X
t
. Consequently, P(
X
t
,= X
t
) P(N
) = 0
from which we conclude that
X
t
t
on T = [0, 1].
Exercise 7.2.7. Fixing x R
[0,1]
, let
,r
(x) =
2
r
max
j=0
[x((j +r)2
) x(j2
)[ .
(a) Show that for any integers k > m 0,
sup
t,sQ
(2,k)
1
|ts|<2
m
[x(t) x(s)[ 2
k
=m+1
,1
(x) .
Hint: Applying induction on k consider s < t and s s
t, where
s
= minu Q
(2,k1)
1
: u s and t
= maxu Q
(2,k1)
1
: u t.
(b) Fixing > 0, let c
= 2/(1 2
) and deduce that if

,1
(x) 2
for
all n, then
[x(t) x(s)[ c
[t s[
for all t, s Q
(2)
1
such that [t s[ < 2
n
.
Hint: Apply part (a) for m n such that 2
(m+1)
[t s[ < 2
m
.
We next identify the restriction of the cylindrical -algebra of R
T
to C(T) as the
Borel -algebra on the space of continuous functions, starting with T = I
r
for I a
compact interval.
Lemma 7.2.8. For T = I
r
and I R a compact interval, consider the topological
space (C(T), ||
) of continuous functions on T, equipped with the topology induced

by the supremum norm |x|
= sup
tT
[x(t)[. The corresponding Borel -algebra,
denoted hereafter B
C(T)
coincides with A C(T) : A B
T
.
Proof. Recall that for any z C(T),
|z|
= sup
tTQ
r
[z(t)[ .
Hence, each open ball
B(x, r) = y C(T) : |y x|
< r
in S = (C(T), | |
) is the countable intersection of R

t
C(T) for the corre-
sponding one dimensional measurable rectangles R
t
B
T
indexed by t T Q
r
.
Consequently, each open ball B(x, r) is in the -algebra ( = A C(T) : A B
T
.
With denoting a countable dense subset of the separable metric space S, it readily
follows that S has a countable base |, consisting of the balls B(x, 1/n) for positive
integers n and centers x . With every open set thus being a countable union of
elements from |, it follows that B
S
= (|). Further, | (, hence also B
S
(.
Conversely, recall that ( = (O) for the collection O of sets of the form
O = x C(T) : x(t
i
) O
i
, i = 1, . . . , n ,
with n nite, t
i
T and open O
i
R, i = 1, . . . , n. Clearly, each O O is an open
subset of S and it follows that ( B
S
.
In the next exercise, you adapt the proof of Lemma 7.2.8 for T = [0, ) (and the
same would apply for T R
r
which is the product of one-dimensional intervals).
Exercise 7.2.9. For T = [0, ), equip the set C(T) of continuous functions on T
with the topology of uniform convergence on compact subsets of T. Show that the
corresponding Borel -algebra B
C(T)
coincides with A C(T) : A B
T
.
Hint: Uniform convergence on compacts is equivalent to convergence in the com-
plete, separable metric space S = (C([0, )), (, )), where (x, y) =
j=1
2
j
(|x
y|
j
) for |x|
t
= sup
s[0,t]
[x(s)[ and (r) = r/(1 +r) (c.f. [Dud89, Page 355]).
Combining Proposition 7.1.8, Exercise 7.2.3, Theorem 7.2.6 and Lemma 7.2.8,
yields the following useful canonical construction for continuous-time processes of
a.s. continuous sample path.
Corollary 7.2.10. Given a consistent collection of f.d.d. indexed on T = I
r
(with
I R a compact interval), such that (7.2.1) holds (for some positive , and nite
c), there exists a S.P.

X
() : (C(T), | |
), measurable with respect to B

C(T)
,
which has the specied f.d.d. and is indistinguishable from any of its continuous
modications.
Remark. An alternative approach is to directly construct the sample functions of
stochastic processes of interest. That is, to view the process from the start as a
random variable X
() taking values in certain topological space of functions

equipped with its Borel -algebra (for example, the space C(T) with a suitable
metric). In dealing with the Brownian motion, we pursue both approaches, rst
relying on the canonical construction (namely, Corollary 7.2.10), and then proving
instead an invariance principle via weak convergence in C(T) (c.f. Section 9.2).
In contrast with Theorem 7.2.6, here is an example of a S.P. with no continuous
modication, for which (7.2.1) holds with = 0.
Example 7.2.11. Consider the S.P. X
t
() = I
{>t}
, for t [0, 1] and the uniform
probability measure on = (0, 1]. Then, E[[X
t
X
s
[
] = U((s, t]) = [t s[ for

all 0 < s < t 1, so X
t
, t [0, 1] satises (7.2.1) with c = 1, = 0 and any
> 0. However, if
X
t
t
then a.s.

X
t
() = X
t
() at all
t (0, 1] Q, from which it follows that s

X
s
() is discontinuous at s = .
While direct application of Theorem 7.2.6 is limited to (locally -Holder) contin-
uous modications on compact intervals, say [0, T], it is easy to combine these to
one (locally -Holder) continuous modication, valid on [0, ).
Lemma 7.2.12. Suppose there exist T
n
such that the continuous time S.P.
X
t
, t 0 has (locally -H older) continuous modications
X
(n)
t
, t [0, T
n
].
Then, the S.P. X
t
, t 0 also has such modication on [0, ).
Proof. By assumption, for each positive integer n, the event
A
n
= :

X
(n)
t
() = X
t
(), t Q [0, T
n
] ,
has probability one. The event A
=
n
A
n
of probability one is then such that
X
(n)
t
() =

X
(m)
t
() for all A
, positive integers n, mand any t Q[0, T

n
T
m
].
By continuity of t

X
(n)
t
() and t

X
(m)
t
() it follows that for all A
X
(n)
t
() =

X
(m)
t
() , n, m, t [0, T
n
T
m
] .
Consequently, for such there exists a function t

X
t
() on [0, ) that coin-
cides with each of the functions

X
(n)
t
() on its interval of denition [0, T
n
]. By
assumption the latter are (locally -Holder) continuous, so the same applies for the
sample function t

X
t
() on [0, ). Setting

X
t
() 0 in case / A
completes
the construction of the S.P.
X
t
, t 0 with (locally -Holder) continuous sample
functions, such that for any t [0, T
n
],
P(X
t
,=

X
t
) P(A
c
) +P(X
t
,=

X
(n)
t
) = 0 .
Since T
n
, we conclude that this S.P. is a (locally -Holder) continuous modi-
cation of X
t
, t 0.
The following application of Kolmogorov-Centsov theorem demonstrates the im-
portance of its free parameter .
t
, t I is a continuous time S.P. such that E(X
t
) =
0 and E(X
2
t
) = 1 for all t I, a compact interval on the line.
(a) Show that if for some nite c, p > 1 and h > 0,
(7.2.4) E[X
t
X
s
] 1 c(t s)
p
for all s < t s +h, t, s I ,
then there exists a continuous modication of X
t
, t I which is also
locally -H older continuous, for < (p 1)/2.
(b) Show that if (X
s
, X
t
) is a multivariate normal for each t > s, then it
suces for the conclusion of part (a) to have E[X
t
X
s
] 1 c(t s)
p1
instead of (7.2.4).
Hint: In part (a) use = 2 while for part (b) try = 2k and k 1.
Example 7.2.14. There exist S.P.-s satisfying (7.2.4) with p = 1 for which there
is no continuous modication. One such process is the random telegraph signal
R
t
= (1)
Nt
R
0
, where P(R
0
= 1) = P(R
0
= 1) = 1/2 and R
0
is independent of
the Poisson process N
t
of rate one. The process R
t
alternately jumps between
1 and +1 at the random jump times T
k
of the Poisson process N
t
. Hence, by
the same argument as in Example 7.2.11 it does not have a continuous modication.
Further, for any t > s 0,
E[R
s
R
t
] = 1 2P(R
s
,= R
t
) 1 2P(N
s
< N
t
) 1 2(t s) ,
so R
t
satises (7.2.4) with p = 1 and c = 2.
Remark. The S.P. R
t
of Example 7.2.14 is a special instance of the continuous-
time Markov jump processes, which we study in Section 8.3.3. Though the sample
function of this process is a.s. discontinuous, it has the following RCLL property,
as is the case for all continuous-time Markov jump processes.
Denition 7.2.15. Given a countable C I we say that a function x R
I
is
C-separable at t if there exists a sequence s
k
C that converges to t such that
x(s
k
) x(t). If this holds at all t I, we call x() a C-separable function. A
continuous time S.P. X
t
, t I is separable if there exists a non-random, countable
C I such that all sample functions t X
t
() are C-separable. Such a process is
further right-continuous with left-limits (in short, RCLL), if the sample function
t X
t
() is right-continuous and of left-limits at any t I (that is, for h 0
both X
t+h
() X
t
() and the limit of X
th
() exists). Similarly, a modication
which is a separable S.P. or one having RCLL sample functions is called a separable
modication, or RCLL modication of the S.P., respectively. As usual, suces to
have any of these properties w.p.1 (for we do not dierentiate between a pair of
indistinguishable S.P.).
Remark. Clearly, a S.P. of continuous sample functions is also RCLL and a S.P.
having right-continuous sample functions (in particular, any RCLL process), is
further separable. To summarize,
H older continuity Continuity RCLL Separable
But, the S.P. X
t
of Example 7.2.1 is non-separable. Indeed, C-separability of
t X
t
() at t = requires that C, so for any countable subset C of [0, 1] we
have that P(t X
t
is C-separable) P(C) = 0.
One motivation for the notion of separability is its prevalence. Namely, to any
consistent collection of f.d.d. indexed on an interval I, corresponds a separable
S.P. with these f.d.d. This is achieved at the small cost of possibly moving from
real-valued variables to R-valued variables (each of which is nevertheless a.s. real-
valued).
Proposition 7.2.16. Any continuous time S.P. X
t
, t I admits a separable
modication (consisting possibly of R-valued variables). Hence, to any consistent
collection of f.d.d. indexed on I corresponds an (R
I
, (B
R
)
I
)-valued separable S.P.
with these f.d.d.
We prove this proposition following [Bil95, Theorem 38.1], but leave its technical
engine (i.e. [Bil95, Lemma 1, Page 529]), as your next exercise.
t
, t I is a continuous time S.P.
(a) Fixing B B, consider the probabilities p(D) = P(Y
s
B for all s D),
for countable D I. Show that for any A I there exists a countable
subset D
= D
(A, B) of A such that p(D
) = infp(D) : countable
D A.
Hint: Let D
=
k
D
k
where p(D
k
) k
1
+infp(D) : countable D A.
(b) Deduce that if t A then N
t
(A, B) = : Y
s
() B for all s D
(A, B)
and Y
t
() / B has zero probability.
(c) Let C denote the union of D
(A, B) over all A = I (q

1
, q
2
) and B =
(q
3
, q
4
)
c
, with q
i
Q. Show that at any t I there exists N
t
T such
that P(N
t
) = 0 and the sample functions t Y
t
() are C-separable at t
for every / N
t
.
Hint: Let N
t
denote the union of N
t
(A, B) over the sets (A, B) as in the
denition of C, such that further t A.
Proof. Assuming rst that Y
t
, t I is a (0, 1)-valued S.P. set

Y
= Y
on
the countable, dense C I of part (c) of Exercise 7.2.17. Then, xing non-random
s
n
C such that s
n
t I C we dene the R.V.-s
Y
t
= Y
t
I
N
c
t
+I
Nt
limsup
n
Y
sn
,
for the events N
t
of zero probability from part (c) of Exercise 7.2.17. The resulting
S.P.
Y
t
, t I is a [0, 1]-valued modication of Y
t
(since P(
Y
t
,= Y
t
) P(N
t
) = 0
for each t I). It clearly suces to check C-separability of t

Y
t
() at each xed
t / C and this holds by our construction if N
t
and by part (c) of Exercise
7.2.17 in case N
c
t
. For any (0, 1)-valued S.P. Y
t
we have thus constructed a
separable [0, 1]-valued modication
Y
t
. To handle an R-valued S.P. X
t
, t I,
let
Y
t
, t I denote the [0, 1]-valued, separable modication of the (0, 1)-valued
S.P. Y
t
= F
G
(X
t
), with F
G
() denoting the standard normal distribution function.
Since F
G
() has a continuous inverse F
1
G
: [0, 1] R (where F
1
G
(0) = and
F
1
G
(1) = ), it directly follows that

X
t
= F
1
G
(
Y
t
) is an R-valued separable
modication of the S.P. X
t
.
Here are few elementary and useful consequences of separability.
Exercise 7.2.18. Suppose S.P. X
t
, t I is C-separable and J I an open
interval.
(a) Show that
sup
tJ
X
t
= sup
tJC
X
t
is in mT
X
, hence its law is determined by the f.d.d.
(b) Similarly, show that for any h > 0 and s I,
sup
t[s,s+h)
[X
t
X
s
[ = sup
t[s,s+h)C
[X
t
X
s
[
is in mT
X
with its law determined by the f.d.d.
The joint measurability of sample functions is an important property to have.
Denition 7.2.19. A continuous time S.P. X
t
, t I is measurable if X
t
() :
I R is measurable with respect to B
I
T (that is, for any B B, the
subset (t, ) : X
t
() B of I is in B
I
T, where as usual B
I
denotes the
completion of the Borel -algebra with respect to Lebesgues measure on I and T is
the completion of T with respect to P).
As we show in Proposition 8.1.8, any right-continuous S.P. (and in particular,
RCLL), is also measurable. While separability does not imply measurability, build-
ing on the obvious measurability of (simple) RCLL processes, following the proof
of [Doo53, Theorem II.2.6] we show next that to any consistent and continuous
in probability collection of f.d.d. corresponds a both separable and measurable S.P.
having the specied f.d.d.
t
, t I is continuous in probability if for any t I
and > 0,
lim
st
P([X
s
X
t
[ > ) = 0 .
Remark. Continuity in probability is a very mild property, which is completely
determined by the f.d.d. and has little to do with the sample functions of the
process. For example, note that the Poisson process is continuous in probability, as
are the random telegraph noise R
t
of Example 7.2.14 and even the non-separable
S.P. X
t
of Example 7.2.1.
Proposition 7.2.21. Any continuous in probability process X
t
, t I has an
(R
I
, (B
R
)
I
)-valued separable modication which is further a measurable process.
Proof. It suces to consider I = [0, 1]. Indeed, by an ane time change the
same proof then applies for any compact interval I, and if I is unbounded, simply
decompose it to countably many disjoint bounded intervals and glue together the
corresponding separable and measurable modications of the given process.
Further, in view of Proposition 7.2.16 and the transformation via F
G
() we have
utilized in its proof, we consider with no loss of generality a (0, 1)-valued s
k
-
separable, continuous in probability S.P. Y
t
, t [0, 1] and provide a [0, 1]-valued
measurable modication
Y
t
of Y
t
, which we then verify to be also a separable
process. To this end, with no loss of generality, assume further that s
1
= 0. Then,
for any n N set t
n+1
= 2 and with 0 = t
1
< < t
n
the monotone increasing
rearrangement of s
k
, k = 1, . . . , n, consider the [0, 1]-valued, RCLL stochastic
process
Y
(n)
t
=
n
j=1
Y
tj
I
[tj ,tj+1)
(t) ,
which is clearly also a measurable S.P. By the denseness of s
k
in [0, 1], it follows
from the continuity in probability of Y
t
that Y
(n)
t
p
Y
t
as n , for any xed
t [0, 1]. Hence, by bounded convergence E[[Y
(n)
t
Y
(m)
t
[] 0 as n, m for
each t [0, 1]. Then, by yet another application of bounded convergence
lim
m,n
E[[Y
(n)
T
Y
(m)
T
[] = 0 ,
where the R.V. T [0, 1] is chosen independently of P, according to the uni-
form probability measure U corresponding to Lebesgues measure () restricted
to ([0, 1], B
[0,1]
). By Fubinis theorem, this amounts to Y
(n)
t
() being a Cauchy,
hence convergent, sequence in L
1
([0, 1] , B
[0,1]
T, U P) (recall Proposition
4.3.7 that the latter is a Banach space). In view of Theorem 2.2.10, upon passing
to a suitable subsequence n
j
we thus have that (t, ) Y
(nj)
t
() converges to some
B
[0,1]
T-measurable function (t, ) Y
()
t
() for all (t, ) / N, where we may
and shall assume that s
k
N B
[0,1]
T and U P(N) = 0. Taking now
Y
t
() = I
N
c (t, )Y
()
t
() +I
N
(t, )Y
t
() ,
note that

Y
t
() = Y
()
t
() for a.e. (t, ), so with Y
()
t
() a measurable process, by
the completeness of our product -algebra, the S.P.
Y
t
, t [0, 1] is also measur-
able. Further, xing t [0, 1], if
Y
t
() ,= Y
t
() then A
t
= : Y
(nj)
t
()
Y
()
t
() ,= Y
t
(). But, recall that Y
(nj)
t
p
Y
t
for all t [0, 1], hence P(A
t
) = 0,
i.e.
Y
t
, t [0, 1] is a modication of the given process Y
t
, t [0, 1].
Finally, since
Y
t
coincides with the s
k
-separable S.P. Y
t
on the set s
k
, the
sample function t

Y
t
() is, by our construction, s
k
-separable at any t [0, 1]
such that (t, ) N. Moreover, Y
(nj)
t
= Y
s
k
=

Y
s
k
for some k = k(j, t), with
s
k(j,t)
t by the denseness of s
k
in [0, 1]. Hence, if (t, ) / N then
Y
t
() = lim
j
Y
(nj)
t
() = lim
j
Y
s
k(j,t)
() .
Thus,
Y
t
, t [0, 1] is s
k
-separable and as claimed, it is a separable, measurable
modication of Y
t
, t [0, 1].
Recall (1.4.7) that the measurability of the process, namely of (t, ) X
t
(),
implies that all its sample functions t X
t
() are Lebesgue measurable functions
on I. Measurability of a S.P. also results with well dened integrals of its sample
function. For example, if a Borel function h(t, x) is such that
_
I
E[[h(t, X
t
)[]dt
is nite, then by Fubinis theorem t E[h(t, X
t
)] is in L
1
(I, B
I
, ), the integral
_
I
h(s, X
s
)ds is an a.s. nite R.V. and
_
I
E[h(s, X
s
)] ds = E[
_
I
h(s, X
s
) ds] .
Conversely, as you are to show next, under mild conditions the dierentiability of
sample functions t X
t
implies the dierentiability of t E[X
t
].
Exercise 7.2.22. Suppose each sample function t X
t
() of a continuous time
S.P. X
t
, t I is dierentiable at any t I.
(a) Verify that

t
X
t
is a random variable for each xed t I.
(b) Show that if [X
t
X
s
[ [t s[Y for some integrable random variable Y ,
a.e. and all t, s I, then t E[X
t
] has a nite derivative and for
any t I,
d
dt
E(X
t
) = E
_
t
X
t
_
.
We next generalize the lack of correlation of independent R.V. to the setting of
continuous time S.P.-s.
Exercise 7.2.23. Suppose square-integrable, continuous time S.P.-s X
t
, t I
and Y
t
, t I are P-independent. That is, both processes are dened on the same
probability space and the -algebras T
X
and T
Y
are P-independent. Show that in
this case,
E[X
t
Y
t
[T
Z
s
] = E[X
t
[T
X
s
]E[Y
t
[T
Y
s
] ,
for any s t I, where Z
t
= (X
t
, Y
t
) R
2
, T
X
s
= (X
u
, u I, u s) and T
Y
s
,
T
Z
s
are similarly dened.
7.3. Gaussian and stationary processes
Building on Denition 3.5.13 of Gaussian random vectors, we have the following
important class of (centered) Gaussian (stochastic) processes, which plays a key
role in our construction of the Brownian motion.
t
, t T is a Gaussian process (or Gaussian S.P.),
if (X
t1
, . . . , X
tn
) is a Gaussian random vector for any n nite and t
k
T, k =
1, . . . , n. Alternatively, a S.P. is Gaussian if and only if it has multivariate
normal f.d.d. We further say that a Gaussian S.P. is centered if its mean function
m(t) = E[X
t
] is zero.
Recall the following notion of non-negative deniteness, based on Denition 3.5.12.
Denition 7.3.2. A symmetric function c(t, s) = c(s, t) on a product set T T is
called non-negative denite (or positive semidenite) if for any nite n and t
k
T,
k = 1, . . . , n, the n n matrix of entries c(t
j
, t
k
) is non-negative denite. That is,
for any a
k
R, k = 1, . . . , n,
(7.3.1)
n
j=1
n
k=1
a
j
c(t
j
, t
k
)a
k
0.
Example 7.3.3. Note that the auto-covariance function c(t, s) = Cov(X
t
, X
s
) of a
square-integrable S.P. X
t
, t T is non-negative denite. Indeed, the left side of
(7.3.1) is in this case precisely the non-negative Var(
n
j=1
a
j
X
tj
).
Convince yourself that non-negative deniteness is the only property that the auto-
covariance function of a Gaussian S.P. must have and further that the following is
an immediate corollary of the canonical construction and the denitions of Gaussian
random vectors and stochastic processes.
Exercise 7.3.4.
(a) Show that for any index set T, the law of a Gaussian S.P. is uniquely
determined by its mean and auto-covariance functions.
(b) Show that a Gaussian S.P. exists for any mean function and any non-
negative denite auto-covariance function.
Remark. An interesting consequence of Exercise 7.3.4 is the existence of an isonor-
mal process on any vector space H equipped with an inner product as in Denition
4.3.5. That is, a centered Gaussian process X
h
, h H indexed by elements of H
whose auto-covariance function is given by the inner product (h
1
, h
2
) : HH R.
Indeed, the latter is non-negative denite on HH since for h =
n
j=1
a
j
h
j
H,
n
j=1
n
k=1
a
j
(h
j
, h
k
)a
k
= (h, h) 0 .
One of the useful properties of Gaussian processes is their closure with respect to
L
2
-convergence (as a consequence of Proposition 3.5.15).
Proposition 7.3.5. If the S.P. X
t
, t T and the Gaussian S.P. X
(k)
t
, t T
are such that E[(X
t
X
(k)
t
)
2
] 0 as k , for each xed t T, then X
t
, t T
is a Gaussian S.P. whose mean and auto-covariance functions are the pointwise
limits of those for the processes X
(k)
t
, t T.
7.3. GAUSSIAN AND STATIONARY PROCESSES 285
Proof. Fix n nite and t
k
T, k = 1, . . . , n. Applying Proposition 3.5.15 for
the sequence of Gaussian random vectors X
k
= (X
(k)
t1
, . . . , X
(k)
tn
), we deduce that
X
= (X
t1
, . . . , X
tn
) is also a Gaussian random vector whose mean and covariance
parameters (, V) are the element-wise limits of the parameters of the sequence of
random vectors X
k
. With this holding for all f.d.d. of the S.P. X
t
, t T,
by Denition 7.3.1 the latter is a Gaussian S.P. (of the stated mean and auto-
correlation functions).
Recall Exercise 5.1.9, that for a Gaussian random vector (Y
t2
Y
t1
, . . . , Y
tn
Y
tn1
),
with n nite and t
1
< t
2
< < t
n
, having independent coordinates is equivalent
to having uncorrelated coordinates. Hence, from Exercise 7.1.12 we deduce that
Corollary 7.3.6. A continuous time, Gaussian S.P. Y
t
, t I has independent
increments if and only if Cov(Y
t
Y
u
, Y
s
) = 0 for all s u < t I.
Remark. Check that the zero covariance condition in this corollary is equivalent to
the Gaussian process having auto-covariance function of the form c(t, s) = g(t s).
Recall Denition 6.1.20 that a discrete time stochastic process X
n
with a B-
isomorphic state space (S, o), is (strictly) stationary if its law T
X
is shift invariant,
namely, T
X

1
= T
X
for the shift operator ()
k
=
k+1
on S
. This concept
of invariance of the law of the process to translation of time, extends naturally to
continuous time S.P.
Denition 7.3.7. The (time) shifts
s
: S
[0,)
S
[0,)
are dened for s 0 via
s
(x)() = x( +s) and a continuous time S.P. X
t
, t 0 is called stationary (or
strictly stationary), if its law T
X
is invariant under any time shift
s
, s 0. That
is, T
X
(
s
)
1
= T
X
for all s 0. For two-sided continuous time S.P. X
t
, t R
the denition of time shifts extends to s R and stationarity is then the invariance
of the law under
s
for any s R.
Recall Proposition 7.1.8 that the law of a continuous time S.P. is uniquely deter-
mined by its f.d.d. Consequently, such process is (strictly) stationary if and only if
its f.d.d. are invariant to translation of time. That is, if and only if
(7.3.2) (X
t1
, . . . , X
tn
)
D
= (X
t1+s
, . . . , X
tn+s
)
for any n nite and s, t
i
0 (or for any s, t
i
R in case of a two-sided continuous
time S.P.). In contrast, here is a much weaker concept of stationarity.
Denition 7.3.8. A square-integrable continuous time S.P. of constant mean func-
tion and auto-covariance function of the form c(t, s) = r([t s[) is called weakly
stationary (or L
2
-stationary).
Indeed, considering (7.3.2) for n = 1 and n = 2, clearly any square-integrable
stationary S.P. is also weakly stationary. As you show next, the converse fails in
general, but applies for all Gaussian S.P.
Exercise 7.3.9. Show that any weakly stationary Gaussian S.P. is also (strictly)
stationary. In contrast, provide an example of a (non-Gaussian) weakly stationary
process which is not stationary.
To gain more insight about stationary processes solve the following exercise.
t
, t 0 is a centered weakly stationary S.P. of auto-
covariance function r(t).
(a) Show that [r(t)[ r(0) for all t > 0 and further, if r(h) = r(0) for some
h > 0 then X
t+h
a.s.
= X
t
for each t 0.
(b) Deduce that any centered, weakly stationary process of independent in-
crements must be a modication of the trivial process having constant
sample functions X
t
() = X
0
() for all t 0 and .
Denition 7.3.11. We say that a continuous time S.P. X
t
, t I has stationary
increments if for t, s I the law of the increment X
t
X
s
depends only on t s.
We conclude this chapter with the denition and construction of the celebrated
Brownian motion which is the most fundamental continuous time stochastic pro-
cess.
Denition 7.3.12. A S.P. W
t
, t 0 is called a Brownian motion (or a Wiener
process) starting at x R, if it is a Gaussian process of mean function m(t) = x and
auto-covariance c(t, s) = Cov(W
t
, W
s
) = t s, whose sample functions t W
t
()
are continuous. The case of x = 0 is called the standard Brownian motion (or
standard Wiener process).
In addition to constructing the Brownian motion, you are to show next that it has
stationary, independent increments.
Exercise 7.3.13.
(a) Construct a continuous time Gaussian S.P. B
t
, t 0 of the mean and
auto-covariance functions of Denition 7.3.12
Hint: Look for f.d.d. such that B
0
= x and having independent incre-
ments B
t
B
s
of zero mean and variance t s.
(b) Show that there exists a Wiener process, namely a continuous modica-
tion W
t
, t 0 of B
t
, t 0.
Hint: Try Kolmogorov-Centsov theorem for = 4.
(c) Deduce that for any T nite, the S.P. W
t
, t [0, T] can be viewed as
the random variable W
: (, T) (C([0, T]), ||
), which is measurable
with respect to the Borel -algebra on C([0, T]) and is further a.s. locally
-H older continuous for any < 1/2.
Hint: As in Exercise 7.2.13 try = 2k with k 1 in (7.2.1).
(d) Show that the S.P. B
t
, t 0 is non-stationary, but it is a process of
stationary, independent increments.
Example 7.3.14. Convince yourself that every stationary process has stationary
increments while the Brownian motion of Exercise 7.3.13 is an example of a non-
stationary process with stationary (independent) increments. The same phenomena
applies for discrete time S.P. (in which case the symmetric srw serves as an ex-
ample of a non-stationary process with stationary, independent increments).
An alternative construction of the Wiener process on I = [0, T] is as the innite
series
W
t
= x +
k=0
a
k
(t)G
k
,
7.3. GAUSSIAN AND STATIONARY PROCESSES 287
0 0.5 1 1.5 2 2.5 3
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
t
W
t
Sample paths of Brownian motion
Figure 2. Three sample functions of Brownian motion. The den-
sity curves illustrate that the random variable W
1
has a A(0, 1)
law, while W
2
has a A(0, 2) law.
with G
k
i.i.d. standard normal random variables and a
k
() continuous functions
on I such that
(7.3.3)
k=0
a
k
(t)a
k
(s) = t s =
1
2
([t +s[ [t s[) .
For example, taking T = 1/2 and expanding f(x) = [x[ for [x[ 1 into a Fourier
series, one nds that
[x[ =
1
2

k=0
4
(2k + 1)
2
2
cos((2k + 1)x) .
Hence, by the trigonometric identity cos(ab)cos(a+b) = 2 sin(a) sin(b) it follows
that (7.3.3) holds for
a
k
(t) =
2
(2k + 1)
sin((2k + 1)t) .
Though we shall not do so, the continuity w.p.1. of t W
t
is then obtained by
showing that for any > 0
P(|
k=n
a
k
(t)G
k
|
) 0
as n (see [Bry95, Theorem 8.1.3]).
We turn to explore some interesting Gaussian processes of continuous sample
functions that are derived out of the Wiener process W
t
, t 0.
Exercise 7.3.15. With W
t
, t 0 a standard Wiener process, show that each
of the following is a Gaussian S.P. of continuous sample functions, compute its
mean and auto-covariance functions and determine whether or not it is a stationary
process.
(a) The standard Brownian bridge

B
t
= W
t
min(t, 1)W
1
.
(b) The Ornstein-Uhlenbeck process U
t
= e
t/2
W
e
t .
(c) The Brownian motion with drift Z
(r,)
t
= W
t
+rt +x, with non-random
drift r R and diusion coecient > 0.
(d) The integrated Brownian motion I
t
=
_
t
0
W
s
ds.
Exercise 7.3.16. Suppose W
t
, t 0 is a standard Wiener process.
(a) Compute E(W
s
[W
t
) and Var(W
s
[W
t
), rst for s > t, then for s < t.
(b) Show that t
1
W
t
a.s.
0 when t .
Hint: As we show in the sequel, the martingale W
t
, t 0 satises
Doobs L
2
maximal inequality.
(c) Show that for t [0, 1] the S.P.

B
t
= (1 t)W
t/(1t)
(with

B
1
= 0), has
the same law as the standard Brownian bridge and its sample functions
are continuous w.p.1.
(d) Show that restricted to [0, 1], the law of the standard Brownian bridge
matches that of W
t
, t [0, 1], conditioned upon W
1
= 0 (hence the
name Brownian bridge).
The fractional Brownian motion is another Gaussian S.P. of considerable interest
in nancial mathematics and in the analysis of computer and queuing networks.
Exercise 7.3.17. For H (0, 1), the fractional Brownian motion (or in short,
fBM), of Hurst parameter H is the centered Gaussian S.P. X
t
, t 0, of auto-
covariance function
c(t, s) =
1
2
[[t[
2H
+[s[
2H
[t s[
2H
] , s, t 0.
(a) Show that the square-integrability with respect to Lebesgues measure of
g(u) = [1 u[
H1/2
sgn(1 u) +[u[
H1/2
sgn(u) (which you need not ver-
ify), implies that c(t, s) =
_
g
t
(x)g
s
(x)dx for g
t
(x) = |g|
1
2
[t[
H1/2
g(x/t)
in case t > 0 and g
0
(x) = 0.
Hint: g
t
(s +x) g
s
(s +x) = g
ts
(x), hence |g
t
g
s
|
2
2
= |g
ts
|
2
2
.
(b) Deduce that the fBM X
t
, t 0 exists and has a continuous modication
which is also locally -H older continuous for any 0 < < H.
(c) Verify that for H =
1
2
this modication is the standard Wiener process.
(d) Show that for any non-random b > 0, the S.P. b
H
X
bt
, t 0 is an
fBM of the same Hurst parameter H.
(e) For which values of H are the increments of the fBM stationary and for
which values are they independent?
CHAPTER 8
Continuous time martingales and Markov
processes
Continuous time ltrations and stopping times are introduced in Section 8.1, em-
phasizing the dierences with the corresponding notions for discrete time processes
and the connections to sample path continuity. Building upon it and Chapter 5
about discrete time martingales, we review in Section 8.2 the theory of continu-
ous time martingales. Similarly, Section 8.3 builds upon Chapter 6 about Markov
chains, in providing a short introduction to the rich theory of strong Markov pro-
cesses.
8.1. Continuous time ltrations and stopping times
We start with the denitions of continuous time ltrations and S.P. adapted to
them (compare with Denitions 5.1.1 and 5.1.2, respectively).
Denition 8.1.1. A (continuous time) ltration is a non-decreasing family of sub-
-algebras T
t
of the measurable space (, T), indexed by t 0. By T
t
T
we
denote such ltration T
t
and the associated minimal -algebra T
= (
t
T
t
)
such that T
s
T
t
for all 0 s t .
Denition 8.1.2. A (continuous time) S.P. X
t
, t 0 is adapted to a (contin-
uous time) ltration T
t
, or in short T
t
-adapted, if X
t
mT
t
for each t 0 or
equivalently, if T
X
t
T
t
for all t 0.
Remark 8.1.3. To avoid cumbersome technical diculties, we assume throughout
that the ltration is augmented so that every P-null set is in T
0
. That is, if N A
for some A T with P(A) = 0 then N T
0
(which is a somewhat stronger
assumption than the completion of both T and T
0
). In particular, this assures that
any modication of an T
t
-adapted continuous time S.P. remains T
t
-adapted.
When dealing with continuous time processes it helps if each new piece of informa-
tion has a denite rst time of arrival, as captured mathematically by the concept
of right-continuous ltration.
Denition 8.1.4. To any continuous time ltration T
t
we associate the corre-
sponding left-ltration T
t
= (T
s
, s < t) at time t, consisting of all events prior
to t (where we set T
0
= T
0
), and right-ltration T
t
+ =
>0
T
t+
at time t, con-
sisting of all events immediately after t. A ltration T
t
is called right-continuous
if it coincides with its right-ltration, that is T
t
= T
t
+ for all t 0.
The next example ties the preceding denitions to those in the much simpler
setting of Chapter 5.
289
290 8. CONTINUOUS TIME MARTINGALES AND MARKOV PROCESSES
Example 8.1.5. To each discrete time ltration (
n
, n Z
+
corresponds the
interpolated (continuous time) ltration T
t
= (
[t]
, where [t] denotes the integer part
of t 0. Convince yourself that any interpolated ltration is right-continuous, but
usually not left-continuous. That is, T
t
,= T
t
(at any t = n integer in which (
n
,=
(
n1
), with each jump in the ltration accounting for a new piece of information
arriving at that time.
Similarly, we associate with any (
n
-adapted discrete time S.P. Y
n
an inter-
polated continuous time S.P. X
t
= Y
[t]
, t 0, noting that X
t
, t 0 is then
T
t
-adapted if and only if Y
n
is (
n
-adapted.
Example 8.1.6. In analogy with Denition 5.1.3, another generic continuous time
ltration is the canonical ltration T
X
t
= (X
s
, 0 s t) associated with each
continuous time S.P. X
t
, t 0.
Unfortunately, sample path continuity of a S.P. X
t
does not guarantee the right-
continuity of its canonical ltration T
X
t
. Indeed, considering the uniform prob-
ability measure on = 1, 1 note that the canonical ltration T
X
t
of the
S.P. X
t
() = t, which has continuous sample functions, is evidently not right-
continuous at t = 0 (as T
X
0
= , while T
X
t
= T = 2
for all t > 0).

When S.P. X
s
, s 0 is T
t
-adapted, we can view X
s
, s [0, t] as a S.P. on the
smaller measurable space (, T
t
), for each t 0. However, as seen in Section 7.2,
more is required in order to have Borel sample functions, prompting the following
extension of Denition 7.2.19 (and renement of Denition 8.1.2).
Denition 8.1.7. An T
t
-adapted S.P. X
t
, t 0 is called T
t
-progressively mea-
surable if X
s
() : [0, t] R is measurable with respect to B
[0,t]
T
t
, for each
t 0.
Remark. In contrast to Denition 7.2.19, we have dropped the completion of the
relevant -algebras in the preceding denition. Indeed, the standing assumption of
Remark 8.1.3 guarantees the completeness of each -algebra of the ltration T
t
and
as we see next, progressive measurability is in any case equivalent to adaptedness
for all RCLL processes.
Proposition 8.1.8. An T
t
-adapted S.P. X
s
, s 0 of right-continuous sample
functions is also T
t
-progressively measurable.
Proof. Fixing t > 0, let Q
(2,)
t
+
denote the nite set of dyadic rationals of the
form j2
[0, t] augmented by t and arranged in increasing order 0 = t

0
< t
1
<
< t
k
= t (where k
= t2
). The -th approximation of the sample function

X
s
() for s [0, t], is then
X
()
s
() = X
0
I
{0}
(s) +
k
j=1
X
tj
()I
(tj1,tj]
(s) .
Note that per positive integer and B B,
(s, ) [0, t] : X
()
s
() B = 0 X
1
0
(B)
k
_
j=1
(t
j1
, t
j
] X
1
tj
(B) ,
which is in the product -algebra B
[0,t]
T
t
, since each of the sets X
1
tj
(B) is
in T
t
(recall that X
s
, s 0 is T
t
-adapted and t
j
[0, t]). Consequently, each
8.1. CONTINUOUS TIME FILTRATIONS AND STOPPING TIMES 291
of the maps (s, ) X
()
s
() is a real-valued R.V. on the product measurable
space ([0, t] , B
[0,t]
T
t
). Further, by right-continuity of the sample functions
s X
s
(), for each xed (s, ) [0, t] the sequence X
()
s
() converges as
to X
s
(), which is thus a R.V. on the same (product) measurable space
(recall Corollary 1.2.23).
Associated with any ltration T
t
is the collection of all T
t
-stopping times and
the corresponding stopped -algebras (compare with Denitions 5.1.11 and 5.1.34).
Denition 8.1.9. A random variable : [0, ] is called a stopping time for
the (continuous time) ltration T
t
, or in short T
t
-stopping time, if : ()
t T
t
for all t 0. Associated with each T
t
-stopping time is the stopped
-algebra T
= A T
: A t T
t
for all t 0 (which quanties the
information in the ltration at the stopping time ).
The T
t
+-stopping times are also called T
t
-Markov times (or T
t
-optional times),
with the corresponding Markov -algebras T
+ = A T
: A t T
t
+
for all t 0.
Remark. As their name suggest, Markov/optional times appear both in the con-
text of Doobs optional stopping theorem (in Section 8.2.3), and in that of the
strong Markov property (see Section 8.3.2).
Obviously, any non-random constant t 0 is a stopping time. Further, by def-
inition, every T
t
-stopping time is also an T
t
-Markov time and the two concepts
coincide for right-continuous ltrations. Similarly, the Markov -algebra T
+ con-
tains the stopped -algebra T
for any T
t
-stopping time (and they coincide in case
of right-continuous ltrations).
Your next exercise provides more explicit characterization of Markov times and
closure properties of Markov and stopping times (some of which you saw before in
Exercise 5.1.12).
Exercise 8.1.10.
(a) Show that is an T
t
-Markov time if and only if : () < t T
t
for
all t 0.
(b) Show that if
n
, n Z
+
are T
t
-stopping times, then so are
1

2
,
1
+
2
and sup
n
n
.
(c) Show that if
n
, n Z
+
are T
t
-Markov times, then in addition to
1
+
2
and sup
n
n
, also inf
n
n
, liminf
n
n
and limsup
n
n
are T
t
-Markov
times.
(d) In the setting of part (c) show that
1
+
2
is an T
t
-stopping time when
either both
1
and
2
are strictly positive, or alternatively, when
1
is a
strictly positive T
t
-stopping time.
Similarly, here are some of the basic properties of stopped -algebras (compare
with Exercise 5.1.35), followed by additional properties of Markov -algebras.
Exercise 8.1.11. Suppose and are T
t
-stopping times.
(a) Verify that () T
, that T
is a -algebra, and if () = t is non-

random then T
= T
t
.
(b) Show that T
= T
and deduce that each of the events < ,

, = belongs to T
.
Hint: Show rst that if A T
then A T
.
(c) Show that for any integrable R.V. Z,
E[Z[T
]I
= E[Z[T
]I
,
and deduce that
E[E(Z[T
)[T
] = E[Z[T
] .
(d) Show that if and mT
then is an T
t
-stopping time.
Exercise 8.1.12. Suppose ,
n
are T
t
-Markov times.
(a) Verify that T
+ = A T
: A < t T
t
for all t 0.
(b) Suppose
1
is further T
t
-stopping time and
1
with a strict inequality
whenever is nite. Show that then T
+ T
1
.
(c) Setting = inf
n
n
, show that T
+ =
n
T
+
n
. Deduce that if
n
are
T
t
-stopping times and <
n
whenever is nite, then T
+ =
n
T
n
.
In contrast to adaptedness, progressive measurability transfers to stopped pro-
cesses (i.e. the continuous time extension of Denition 5.1.31), which is essential
when dealing in Section 8.2 with stopped sub-martingales (i.e. the continuous time
extension of Theorem 5.1.32).
Proposition 8.1.13. Given T
t
-progressively measurable S.P. X
s
, s 0, the
stopped at (T
t
-stopping time) S.P. X
s()
(), s 0 is also T
t
-progressively
measurable. In particular, if either < or there exists X
mT
, then
X
mT
.
Proof. Fixing t > 0, denote by o the product -algebra B
[0,t]
T
t
on the
product space S = [0, t] . The assumed T
t
-progressive measurability of X
s
, s
0 amounts to the measurability of g
1
: (S, o) (R, B) such that g
1
(s, ) = X
s
().
Further, as (s, ) X
s()
() is the composition g
1
(g
2
(s, )) for the mapping
g
2
(s, ) = (s(), ) from (S, o) to itself, by Proposition 1.2.18 the T
t
-progressive
measurability of the stopped S.P. follows form our claim that g
2
is measurable.
Indeed, recall that is an T
t
-stopping time, so : () > u T
t
for any
u [0, t]. Hence, for any xed u [0, t] and A T
t
,
g
1
2
((u, t] A) = (u, t] (A : () > u) o ,
which suces for measurability of g
2
(since the product -algebra o is generated
by the collection (u, t] A : u [0, t], A T
t
).
Turning to the second claim, since X
s
, s 0 is T
t
-progressively measurable,
we have that for any xed B B and nite t 0,
X
1
(B)
1
([0, t]) = : X
t()
() B : () t T
t
(recall (1.4.7) that : (t, ) A is in T
t
for any set A o). Moreover, by
our assumptions X
1
(B)
1
() is in T
, hence so is its union with the sets

X
1
(B)
1
([0, n]), n Z
+
, which is precisely X
1
(B). We have thus shown

that X
1
(B) T
for any B B, namely, that X
mT
.
Recall Exercise 5.1.13 that for discrete time S.P. and ltrations, the rst hitting
time
B
of a Borel set B by T
n
-adapted process is an T
n
-stopping time. Unfortu-
nately, this may fail in the continuous time setting, even when considering an open
set B and the canonical ltration T
X
t
of a S.P. of continuous sample functions.
8.1. CONTINUOUS TIME FILTRATIONS AND STOPPING TIMES 293
Example 8.1.14. Indeed, consider B = (0, ) and the S.P. X
t
() = t of Exam-
ple 8.1.6. In this case,
B
(1) = 0 while
B
(1) = , so the event :
B
()
0 = 1 is not in T
X
0
= , (hence
B
is not an T
X
t
-stopping time). As shown
next, this problem is only due to the lack of right-continuity in the ltration T
X
t
.
Proposition 8.1.15. Consider an T
t
-adapted, right-continuous S.P. X
s
, s 0.
Then, the rst hitting time
B
() = inft 0 : X
t
() B is an T
t
-Markov
time for an open set B and further an T
t
-stopping time when B is a closed set and
X
s
, s 0 has continuous sample functions.
Proof. Fixing t > 0, by denition of
B
the set
1
B
([0, t)) is the union of
X
1
s
(B) over all s [0, t). Further, if the right-continuous function s X
s
()
intersects an open set B at some s [0, t) then necessarily X
q
() B at some
q Q
t
= Q [0, t). Consequently,
(8.1.1)
1
B
([0, t)) =
_
sQ
t
X
1
s
(B) .
Now, the T
t
-adaptedness of X
s
implies that X
1
s
(B) T
s
T
t
for all s t,
and in particular for any s in the countable collection Q
t
. We thus deduce from
(8.1.1) that
B
< t T
t
for all t 0, and in view of part (a) of Exercise 8.1.10,
conclude that
B
is an T
t
-Markov time in case B is open.
Assuming hereafter that B is closed and u X
u
continuous, we claim that for
any t > 0,
(8.1.2)
B
t =
_
0st
X
1
s
(B) =
k=1
B
k
< t := A
t
,
where B
k
= x R : [x y[ < k
1
, for some y B, and that the left identity
in (8.1.2) further holds for t = 0. Clearly, X
1
0
(B) T
0
and for B
k
open, by the
preceding proof
B
k
< t T
t
. Hence, (8.1.2) implies that
B
t T
t
for all
t 0, namely, that
B
is an T
t
-stopping time.
Turning to verify (8.1.2), x t > 0 and recall that if A
t
then [X
s
k
() y
k
[ <
k
1
for some s
k
[0, t) and y
k
B. Upon passing to a sub-sequence, s
k
s [0, t],
hence by continuity of the sample function X
s
k
() X
s
(). This in turn implies
that y
k
X
s
() B (because B is a closed set). Conversely, if X
s
() B for some
s [0, t) then also
B
k
s < t for all k 1, whereas even if only X
t
() = y B,
by continuity of the sample function also X
s
() y for 0 s t (and once again
B
k
< t for all k 1). To summarize, A
t
if and only if there exists s [0, t]
such that X
s
() B, as claimed. Considering hereafter t 0 (possibly t = 0),
the existence of s [0, t] such that X
s
B results with
B
t. Conversely,
if
B
() t then X
sn
() B for some s
n
() t + n
1
and all n. But then
s
n
k
s t along some sub-sequence n
k
, so for B closed, by continuity of
the sample function also X
sn
k
() X
s
() B.
We conclude with a technical result on which we shall later rely, for example,
in proving the optional stopping theorem and in the study of the strong Markov
property.
Lemma 8.1.16. Given an T
t
-Markov time , let
= 2
([2
] + 1) for 1.
Then,
are T
t
-stopping times and A :
() = q T
q
for any A T
+,
1 and q Q
(2,)
= k2
, k Z
+
.
Proof. By its construction,
takes values in the discrete set Q

(2,)
.
Moreover, with : () < t T
t
for any t 0 (see part (a) of Exercise 8.1.10),
it follows that for any q Q
(2,)
,
:
() = q = : () [q 2
, q) T
q
.
Hence,
is an T
t
-stopping time, as claimed. Next, xing A T
+, in view of
Denitions 8.1.4 and 8.1.9, the sets A
t,m
= A : () t m
1
are in T
t
for
any m 1. Further, by the preceding, xing q Q
(2,)
and q
= q 2
you have
that
A :
() = q = A : () [q
, q) = (
m1
A
q,m
) (
m1
A
q
,m
)
is the dierence between an element of T
q
and one of T
q
T
q
. Consequently,
A :
() = q is in T
q
, as claimed.
8.2. Continuous time martingales
As we show in this section, once the technical challenges involved with the conti-
nuity of time are taken care o, the results of Chapter 5 extend in a natural way
to the collection of continuous time (sub and super) martingales. Similar to the
break-up of Chapter 5, we devote Subsection 8.2.1 to the denition, examples and
closure properties of this collection of S.P. (compare with Section 5.1), followed by
Subsection 8.2.2 about tail (and upcrossing) inequalities and convergence properties
of such processes (compare with Sections 5.2.2 and 5.3, respectively). The state-
ment, proof and applications of Doobs optional stopping theorem are explored in
Subsection 8.2.3 (compare with Section 5.4), with martingale representations being
the focus of Subsection 8.2.4 (compare with Sections 5.2.1 and 5.3.2).
8.2.1. Denition, examples and closure properties. For a continuous
ltration, it is not enough to consider the martingale property one step ahead, so
we replace Denitions 5.1.4 and 5.1.16 by the following continuous time analog of
Proposition 5.1.20.
Denition 8.2.1. The pair (X
t
, T
t
, t 0) is called a continuous time martingale
(in short MG), if the integrable (continuous time) S.P. X
t
, t 0 is adapted to
the (continuous time) ltration T
t
, t 0 and for any xed t s 0, the identity
E[X
t
[T
s
] = X
s
holds a.s. Replacing the preceding identity with E[X
t
[T
s
] X
s
a.s.
for each t s 0, or with E[X
t
[T
s
] X
s
a.s. for each t s 0, denes the
continuous time sub-MG and continuous time sup-MG, respectively. These three
classes of continuous time S.P. are related in the same manner as in the discrete
time setting (c.f. Remark 5.1.17).
It immediately follows from the preceding denition that t EX
t
is non-decreasing
for a sub-MG, non-increasing for a sup-MG, and constant (in time), for a MG.
Further, unless explicitly stated otherwise, one uses the canonical ltration when
studying MGs (or sub/sup-MGs).
t
, T
t
, t 0) is a continuous time sub-MG.
(a) Show that (X
t
, T
X
t
, t 0) is also a sub-MG.
(b) Show that if EX
t
= EX
0
for all t 0, then (X
t
, T
t
, t 0) is also a
martingale.
8.2. CONTINUOUS TIME MARTINGALES 295
The decomposition of conditional second moments, as in part (b) of Exercise 5.1.8,
applies for all continuous time square-integrable MGs.
t
, T
t
, t 0) is a square-integrable MG. Verify that
(8.2.1) E[X
2
t
[T
s
] X
2
s
= E[(X
t
X
s
)
2
[T
s
] for any t s 0 ,
and deduce that t EX
2
t
is non-decreasing.
As you see next, the Wiener process and the compensated Poisson process play
the same role that the random walk of zero-mean increments plays in the discrete
time setting (with Wiener process being the prototypical MG of continuous sample
functions, and compensated Poisson process the prototypical MG of discontinuous
RCLL sample functions).
Proposition 8.2.4. Any integrable S.P. X
t
, t 0 of independent increments
(see Exercise 7.1.12), and constant mean function is a MG.
Proof. Recall that a S.P. X
t
has independent increments if X
t+h
X
t
is
independent of T
X
t
, for all h > 0 and t 0. We have also assumed that E[X
t
[ <
and EX
t
= EX
0
for all t 0. Therefore, E[X
t+h
X
t
[T
X
t
] = E[X
t+h
X
t
] = 0.
Further, X
t
mT
X
t
and hence E[X
t+h
[T
X
t
] = X
t
. That is, X
t
, T
X
t
, t 0 is a
MG, as claimed.
Example 8.2.5. In view of Exercise 7.3.13 and Proposition 8.2.4 we have that the
Wiener process/Brownian motion (W
t
, t 0) of Denition 7.3.12 is a martingale.
Combining Proposition 3.4.9 and Exercise 7.1.12, we see that the Poisson process
N
t
of rate has independent increments and mean function EN
t
= t. Conse-
quently, by Proposition 8.2.4 the compensated Poisson process M
t
= N
t
t is
also a martingale (and T
M
t
= T
N
t
).
Similarly to Exercise 5.1.9, as you check next, a Gaussian martingale X
t
, t 0
is necessarily square-integrable and of independent increments, in which case M
t
=
X
2
t
X
t
is also a martingale.
Exercise 8.2.6.
(a) Show that if X
t
, t 0 is a square-integrable S.P. having zero-mean
independent increments, then (X
2
t
X
t
, T
X
t
, t 0) is a MG with X
t
=
EX
2
t
EX
2
0
a non-random, non-decreasing function.
(b) Prove that the conclusion of part (a) applies to any martingale X
t
, t 0
which is a Gaussian S.P.
(c) Deduce that if X
t
, t 0 is square-integrable, with X
0
= 0 and zero-
mean, stationary independent increments, then (X
2
t
tEX
2
1
, T
X
t
, t 0)
is a MG.
In the context of the Brownian motion B
t
, t 0, we deduce from part (b) of
Exercise 8.2.6 that B
2
t
t, t 0 is a MG. This is merely a special case of the
following collection of MGs associated with the standard Brownian motion.
Exercise 8.2.7. Let u
k+1
(t, y, ) =

u
k
(t, y, ) for k 0 and u
0
(t, y, ) =
exp(y
2
t/2).
(a) Show that for any R the S.P. (u
0
(t, B
t
, ), t 0) is a martingale with
respect to T
B
t
.
(b) Check that for k = 1, 2, . . .,
u
k
(t, y, 0) =
[k/2]
r=0
k!
(k 2r)!r!
y
k2r
(t/2)
r
.
(c) Deduce that the S.P. (u
k
(t, B
t
, ), t 0), k = 1, 2, . . . are also MGs
with respect to T
B
t
, as are B
2
t
t, B
3
t
3tB
t
, B
4
t
6tB
2
t
+ 3t
2
and
B
6
t
15tB
4
t
+ 45t
2
B
2
t
15t
3
.
(d) Verify that for each k Z
+
and R the function u
k
(t, y, ) solves the
heat equation u
t
(t, y) +
1
2
u
yy
(t, y) = 0.
The collection of sub-MG (equivalently, sup-MG or MG), is closed under the
addition of S.P. (compare with Exercise 5.1.19).
t
, T
t
) and (Y
t
, T
t
) are sub-MGs and t f(t) a non-
decreasing, non-random function.
(a) Verify that (X
t
+Y
t
, T
t
) is a sub-MG and hence so is (X
t
+f(t), T
t
).
(b) Rewrite this, rst for sup-MGs X
t
and Y
t
, then in case of MGs.
With the same proof as in Proposition 5.1.22, you are next to verify that the
collection of sub-MGs (and that of sup-MGs), is also closed under the application
of a non-decreasing convex (concave, respectively), function (c.f. Example 5.1.23
for the most common choices of this function).
Exercise 8.2.9. Suppose the integrable S.P. X
t
, t 0 and convex function :
R R are such that E[[(X
t
)[] < for all t 0. Show that if (X
t
, T
t
) is a MG
then ((X
t
), T
t
) is a sub-MG and the same applies even when (X
t
, T
t
) is only a
sub-MG, provided () is also non-decreasing.
As you show next, the martingale Bayes rule of Exercise 5.5.16 applies also for a
positive, continuous time martingale (Z
t
, T
t
, t 0).
Exercise 8.2.10. Suppose (Z
t
, T
t
, t 0) is a (strictly) positive MG on (, T, P),
normalized so that EZ
0
= 1. For each t > 0, let P
t
= P
Ft
and consider the equiva-
lent probability measure Q
t
on (, T
t
) of Radon-Nikodym derivative dQ
t
/dP
t
= Z
t
.
(a) Show that Q
s
= Q
t
Fs
for any s [0, t].
(b) Fixing u s [0, t] and Y L
1
(, T
s
, Q
t
) show that Q
t
-a.s. (hence
also P-a.s.), E
Qt
[Y [T
u
] = E[Y Z
s
[T
u
]/Z
u
.
(c) Verify that if

> 0 and N
t
is a Poisson Process of rate > 0 then
Z
t
= e
(
)t
(
/)
Nt
is a strictly positive martingale with EZ
0
= 1 and
show that N
t
, t [0, T] is a Poisson process of rate
under the measure

Q
T
, for any nite T.
Remark. Up to the re-parametrization = log(
/), the martingale Z

t
of part (c)
of the preceding exercise is of the form Z
t
= u
0
(t, N
t
, ) for u
0
(t, y, ) = exp(y
t(e
1)). Building on it and following the line of reasoning of Exercise 8.2.7

yields the analogous collection of martingales for the Poisson process N
t
, t 0.
For example, here the functions u
k
(t, y, ) on (t, y) R
+
Z
+
solve the equation
u
t
(t, y) +[u(t, y) u(t, y +1)] = 0, with M
t
= u
1
(t, N
t
, 0) being the compensated
Poisson process of Example 8.2.5 while u
2
(t, N
t
, 0) is the martingale M
2
t
t.
Remark. While beyond our scope, we note in passing that in continuous time the
martingale transform of Denition 5.1.27 is replaced by the stochastic integral Y
t
=
_
t
0
V
s
dX
s
. This stochastic integral results with stochastic dierential equations and
is the main object of study of stochastic calculus (to which many texts are devoted,
among them [KaS97]). In case V
s
= X
s
is the Wiener process W
s
, the analog
of Example 5.1.29 is Y
t
=
_
t
0
W
s
dW
s
, which for the appropriate denition of the
stochastic integral (due to It o), is merely the martingale Y
t
=
1
2
(W
2
t
t). Indeed,
It os stochastic integral is dened via martingale theory, at the cost of deviating
from the standard integration by parts formula. The latter would have applied if
the sample functions t W
t
() were dierentiable w.p.1., which is denitely not
the case (as we shall see in Section 9.3).
Exercise 8.2.11. Suppose S.P. X
t
, t 0 is integrable and T
t
-adapted. Show
that if E[X
u
] E[X
] for any u 0 and T

t
-stopping time whose range () is
a nite subset of [0, u], then (X
t
, T
t
, t 0) is a sub-MG.
Hint: Consider = sI
A
+uI
A
c with s [0, u] and A T
s
.
We conclude this sub-section with the relations between continuous and discrete
time (sub/super) martingales.
Example 8.2.12. Convince yourself that to any discrete time sub-MG (Y
n
, (
n
, n
Z
+
) corresponds the interpolated continuous time sub-MG (X
t
, T
t
, t 0) of the in-
terpolated right-continuous ltration T
t
= (
[t]
and RCLL S.P. X
t
= Y
[t]
of Example
8.1.5.
Remark 8.2.13. In proving results about continuous time MGs (or sub-MGs/sup-
MGs), we often rely on the converse of Example 8.2.12. Namely, for any non-
random, non-decreasing sequence s
k
[0, ), if (X
t
, T
t
) is a continuous time
MG (or sub-MG/sup-MG), then clearly (X
s
k
, T
s
k
, k Z
+
) is a discrete time MG
(or sub-MG/sup-MG, respectively), while (X
s
k
, T
s
k
, k Z
) is a RMG (or reversed

subMG/supMG, respectively), where s
0
s
1
s
k
.
8.2.2. Inequalities and convergence. In this section we extend the tail
inequalities and convergence properties of discrete time sub-MGs (or sup-MGs), to
the corresponding results for sub-MGs (and sup-MGs) of right-continuous sample
functions, which we call hereafter in short right-continuous sub-MGs (or sup-MGs).
We start with Doobs inequality (compare with Theorem 5.2.6).
Theorem 8.2.14 (Doobs inequality). If X
s
, s 0 is a right-continuous
sub-MG, then for t 0 nite, M
t
= sup
0st
X
s
, and any x > 0
(8.2.2) P(M
t
x) x
1
E[X
t
I
{Mtx}
] x
1
E[(X
t
)
+
] .
Proof. It suces to show that for any y > 0
(8.2.3) yP(M
t
> y) E[X
t
I
{Mt>y}
] .
Indeed, by dominated convergence, taking y x yields the left inequality in (8.2.2),
and the proof is then complete since E[ZI
A
] E[(Z)
+
] for any event A and inte-
grable R.V. Z.
Turning to prove (8.2.3), x hereafter t 0 and let Q
(2,)
t
+
denote the nite set
of dyadic rationals of the form j2
[0, t] augmented by t. Recall Remark

8.2.13 that enumerating Q
(2,)
t
+
in a non-decreasing order produces a discrete time
sub-MG X
s
k
. Applying Doobs inequality (5.2.1) for this sub-MG, we nd that
xP(M() x) E[X
t
I
{M()x}
] for
M() = max
sQ
(2,)
t
+
X
s
,
and any x > 0. Considering x y, it then follows by dominated convergence that
yP(M() > y) E[X
t
I
{M()>y}
] ,
for any 1. Next note that as ,
M() M() = sup
sQ
(2)
t
{t}
X
s
,
and moreover, M() = M
t
by the right-continuity of the sample function t X
t
(compare with part (a) of Exercise 7.2.18). Consequently, for both P(M() >
y) P(M
t
> y) and E[X
t
I
{M()>y}
] E[X
t
I
{Mt>y}
], thus completing the proof of
(8.2.3).
With (y)
p
+
denoting hereafter the function (max(y, 0))
p
, we proceed with the re-
nement of Doobs inequality for MGs or when the positive part of a sub-MG has
nite p-th moment for some p > 1 (compare to Exercise 5.2.11).
Exercise 8.2.15.
(a) Show that in the setting of Theorem 8.2.14, for any p 1, nite t 0
and x > 0,
P(M
t
x) x
p
E
_
(X
t
)
p
+
,
(b) Show that if Y
s
, s 0 is a right-continuous MG, then
P( sup
0st
[Y
s
[ y) y
p
E
_
[Y
t
[
p
_
.
By integrating Doobs inequality (8.2.2) you bound the moments of the supremum
of a right-continuous sub-MG over a compact time interval.
Corollary 8.2.16 (L
p
maximal inequalities). With q = q(p) = p/(p 1), for
any p > 1, t 0 and a right-continuous sub-MG X
s
, s 0,
(8.2.4) E
_
( sup
0ut
X
u
)
p
+
q
p
E[(X
t
)
p
+
] ,
and if Y
s
, s 0 is a right-continuous MG then also
(8.2.5) E
_
( sup
0ut
[Y
u
[)
p
q
p
E[[Y
t
[
p
] .
Proof. Adapting the proof of Corollary 5.2.13, the bound (8.2.4) is just the
conclusion of part (b) of Lemma 1.4.31 for the non-negative variables X = (X
t
)
+
and Y = (M
t
)
+
, with the left inequality in (8.2.2) providing its hypothesis. We
are thus done, as the bound (8.2.5) is merely (8.2.4) in case of the non-negative
sub-MG X
t
= [Y
t
[.
In case p = 1 we have the following extension of Exercise 5.2.15.
s
, s 0 is a non-negative, right-continuous sub-MG.
Show that for any t 0,
E
_
sup
0ut
X
u
(1 e
1
)
1
1 +E[X
t
(log X
t
)
+
] .
Hint: Relying on Exercise 5.2.15, interpolate as in our derivation of (8.2.2).
Doobs fundamental up-crossing inequality (see Lemma 5.2.18), extends to the
number of up-crossings in dyadic-times, as dened next.
Denition 8.2.18. The number of up-crossings (in dyadic-times), of the interval
[a, b] by the continuous time S.P. X
u
, u [0, t], is the random variable U
t
[a, b] =
sup
U
t,
[a, b], where U
t,
[a, b]() denotes the number of up-crossings of [a, b] by the
nite sequence X
s
k
(), s
k
Q
(2,)
t
+
, as in Denition 5.2.17.
Remark. It is easy to check that for any right-continuous S.P. the number of
up-crossings in dyadic-times coincides with the natural denition of the number of
up-crossings U
t
[a, b] = supU
F
[a, b] : F a nite subset of [0, t], where U
F
[a, b] is the
number of up-crossings of [a, b] by X
s
(), s F However, for example the non-
random S.P. X
t
= I
tQ
has zero up-crossings in dyadic-times, while U
t
[a, b] =
for any 1 > b > a > 0 and t > 0. Also, U
t
[a, b] may be non-measurable on (, T)
in the absence of right continuity of the underlying S.P.
Lemma 8.2.19 (Doobs up-crossing inequality). If X
s
, s 0 is a sup-MG,
then for any t 0,
(8.2.6) (b a)E(U
t
[a, b]) E[(X
t
a)
] E[(X
0
a)
] a a and t 0. Since Q
(2,)
t
+
is a non-decreasing sequence of
nite sets, by denition U
t,
is non-decreasing and by monotone convergence
it suces to show that for all ,
(b a)E( U
t,
[a, b] ) E[(X
t
a)
] E[(X
0
a)
] .
Recall Remark 8.2.13 that enumerating s
k
Q
(2,)
t
+
in a non-decreasing order pro-
duces a discrete time sup-MG X
s
k
, k = 0, . . . , n with s
0
= 0 and s
n
= t, so this
is merely Doobs up-crossing inequality (5.2.6).
Since Doobs maximal and up-crossing inequalities apply for any right-continuous
sub-MG (and sup-MG), so do most convergence results we have deduced from them
in Section 5.3. For completeness, we provide a short summary of these results (and
briey outline how to adapt their proofs), starting with Doobs a.s. convergence
theorem.
Theorem 8.2.20 (Doobs convergence theorem). Suppose right-continuous
sup-MG X
t
, t 0 is such that sup
t
E[(X
t
)
] < . Then, X
t
a.s.
X
and
E[X
[ liminf
t
E[X
t
[ is nite.
Proof. Let U
[a, b] = sup
nZ+
U
n
[a, b]. Paralleling the proof of Theorem
5.3.2, in view of our assumption that sup
t
E[(X
t
)
] is nite, it follows from Lemma

8.2.19 and monotone convergence that E(U
[a, b]) is nite for each b > a. Hence,

w.p.1. the variables U
[a, b]() are nite for all a, b Q, a < b. By sample path

right-continuity and diagonal selection, in the set
a,b
= : liminf
t
X
t
() < a < b < limsup
t
X
t
() ,
it suces to consider t Q
(2)
, hence
a,b
T. Further, if
a,b
, then
X
q
2k1
() < a < b < X
q
2k
() ,
for some dyadic rationals q
k
, hence U
[a, b]() = . Consequently, the a.s.

convergence of X
t
to X
follows as in the proof of Lemma 5.3.1. Finally, the stated

bound on E[X
[ is then derived exactly as in the proof of Theorem 5.3.2.

Remark. Similarly to Exercise 5.3.3, for right-continuous sub-MG X
t
the nite-
ness of sup
t
E[X
t
[, of sup
t
E[(X
t
)
+
] and of liminf
t
E[X
t
[ are equivalent to each other
and to the existence of a nite limit for E[X
t
[ (or equivalently, for lim
t
E[(X
t
)
+
]),
each of which further implies that X
t
a.s.
X
integrable. Replacing (X
t
)
+
by (X
t
)
the same applies for sup-MGs. In particular, any non-negative, right-continuous,

sup-MG X
t
, t 0 converges a.s. to integrable X
such that EX
EX
0
.
Note that Doobs convergence theorem does not apply for the Wiener process
W
t
, t 0 (as E[(W
t
)
+
] =
_
t/(2) is unbounded). Indeed, as we see in Exercise
8.2.34, almost surely, limsup
t
W
t
= and liminf
t
W
t
= . That is, the
magnitude of oscillations of the Brownian sample path grows indenitely.
In contrast, Doobs convergence theorem allows you to extend Doobs inequality
(8.2.2) to the maximal value of a U.I. right-continuous sub-MG over all t 0.
= sup
s0
X
s
for a U.I. right-continuous sub-MG X
t
, t
0. Show that X
t
a.s.
X
integrable and for any x > 0,

(8.2.7) P(M
x) x
1
E[X
I
{Mx}
] x
1
E[(X
)
+
] .
Hint: Start with (8.2.3) and adapt the proof of Corollary 5.3.4.
The following integrability condition is closely related to L
1
convergence of right-
continuous sub-MGs (and sup-MGs).
Denition 8.2.22. We say that a sub-MG (X
t
, T
t
, t 0) is right closable, or
has a last element (X
, T
) if T
t
T
and X
L
1
(, T
, P) is such that
for any t 0, almost surely E[X
[T
t
] X
t
. A similar denition applies for a
sup-MG, but with E[X
[T
t
] X
t
and for a MG, in which case we require that
E[X
[T
t
] = X
t
, namely, that X
t
is a Doobs martingale of X
with respect to
T
t
(see Denition 5.3.13).
Building upon Doobs convergence theorem, we extend Theorem 5.3.12 and Corol-
lary 5.3.14, showing that for right-continuous MGs the properties of having a last
element, uniform integrability and L
1
convergence, are equivalent to each other.
Proposition 8.2.23. The following conditions are equivalent for a right-continuous
non-negative sub-MG X
t
, t 0:
(a) X
t
is U.I.;
(b) X
t
L
1
X
;
(c) X
t
a.s.
X
a last element of X
t
.
Further, even without non-negativity (a) (b) = (c) and a right-continuous
MG has any, hence all, of these properties, if and only if it is a Doob martingale.
Remark. By denition, any non-negative sup-MG has a last element X
= 0 (and
obviously, the same applies for any non-positive sub-MG), but many non-negative
sup-MGs are not U.I. (for example, any non-degenerate critical branching process
is such, as explained in the remark following the proof of Proposition 5.5.5). So,
whereas a MG with a last element is U.I. this is not always the case for sub-MGs
(and sup-MGs).
Proof. (a) (b): U.I. implies L
1
-boundedness which for a right-continuous
sub-MG yields by Doobs convergence theorem the convergence a.s., and hence in
probability of X
t
to integrable X
. Clearly, the L
1
convergence of X
t
to X

also implies such convergence in probability. Either way, recall Vitalis convergence
theorem (i.e. Theorem 1.3.49), that U.I. is equivalent to L
1
convergence when
X
t
p
X
. We thus deduce the equivalence of (a) and (b) for right-continuous

sub-MGs, where either (a) or (b) yields the corresponding a.s. convergence.
(a) and (b) yield a last element: With X
denoting the a.s. and L

1
limit of the
U.I. collection X
t
, it is left to show that E[X
[T
s
] X
s
for any s 0. Fixing
t > s and A T
s
, by the denition of sub-MG we have E[X
t
I
A
] E[X
s
I
A
].
Further, E[X
t
I
A
] E[X
I
A
] (recall part (c) of Exercise 1.3.55). Consequently,
E[X
I
A
] E[X
s
I
A
] for all A T
s
. That is, E[X
[T
s
] X
s
.
Last element and non-negative = (a): Since X
t
0 and EX
t
EX
nite, it
follows that for any nite t 0 and M > 0, by Markovs inequality P(X
t
> M)
M
1
EX
t
M
1
EX
0 as M . It then follows that E[X
I
{Xt>M}
] con-
verges to zero as M , uniformly in t (recall part (b) of Exercise 1.3.43). Further,
by denition of the last element we have that E[X
t
I
{Xt>M}
] E[X
I
{Xt>M}
].
Therefore, E[X
t
I
{Xt>M}
] also converges to zero as M , uniformly in t, i.e.
X
t
is U.I.
Equivalence for MGs: For a right-continuous MG the equivalent properties (a)
and (b) imply the a.s. convergence to X
such that for any xed t 0, a.s.

X
t
E[X
[T
t
]. Applying this also for the right-continuous MG X
t
we deduce
that X
is a last element of the Doobs martingale X

t
= E[X
[T
t
]. To complete
the proof recall that any Doobs martingale is U.I. (see Proposition 4.2.33).
Finally, Paralleling the proof of Proposition 5.3.21, upon combining Doobs con-
vergence theorem 8.2.20 with Doobs L
p
maximal inequality (8.2.5) we arrive at
Doobs L
p
MG convergence.
Proposition 8.2.24 (Doobs L
p
If right-continuous MG X
t
, t 0 is L
p
-bounded for some p > 1, then X
t
X
a.s. and in L
p
. Moreover, |X
|
p
liminf
t
|X
t
|
p
.
Throughout we rely on right continuity of the sample functions to control the tails
of continuous time sub/sup-MGs and thereby deduce convergence properties. Of
course, the interpolated MGs of Example 8.2.12 and the MGs derived in Exercise
8.2.7 out of the Wiener process are right-continuous. More generally, as shown
next, for any MG the right-continuity of the ltration translates (after a modica-
tion) into RCLL sample functions, and only a little more is required for an RCLL
modication in case of a sup-MG (or a sub-MG).
Theorem 8.2.25. Suppose (X
t
, T
t
, t 0) is a sup-MG with right-continuous l-
tration T
t
, t 0 and t EX
t
is right-continuous. Then, there exists an RCLL
modication
X
t
, t 0 of X
t
, t 0 such that (
X
t
, T
t
, t 0) is a sup-MG.
Proof. Step 1. To construct
X
t
, t 0 recall Lemma 8.2.19 that any sup-
MG X
t
, t 0 has a nite expected number of up-crossings E(U
n
[a, b]) for each
b > a and n Z
+
. Hence, P() = 0, where
= : U
n
[a, b]() = , for some n Z
+
, a, b Q, b > a .
Further, if is such that for some 0 t < n,
liminf
qt,qQ
(2)
X
q
() < limsup
qt,qQ
(2)
X
q
(),
then there exist a, b Q, b > a, and a decreasing sequence q
k
Q
(2)
n
such that
X
q
2k
() < a < b < X
q
2k1
(), which in turn implies that U
n
[a, b]() is innite.
Thus, if / then the limits X
t
+() of X
q
() over dyadic rationals q t exist at
all t 0. Considering the R.V.-s M
n
= sup(X
q
)
: q Q
(2)
n
and the event
=
_
: M
n
() = , for some n Z
+
,
observe that if /
then X
t
+() are nite valued for all t 0. Further, setting
X
t
() = X
t
+()I
(), note that

X
t
() is measurable and nite for each
t 0. We conclude the construction by verifying that P(
) = 0. Indeed, recall
that right-continuity was applied only at the end of the proof of Doobs inequality
(8.2.2), so using only the sub-MG property of (X
t
)
we have that for all y > 0,

P(M
n
> y) y
1
E[(X
n
)
] .
Hence, M
n
is a.s. nite. Starting with Doobs second inequality (5.2.3) for the
sub-MG X
t
, by the same reasoning P(M
+
n
> y) y
1
(E[(X
n
)
] +E[X
0
]) for
all y > 0. Thus, M
+
n
is also a.s. nite and as claimed P(
) = 0.
Step 2. Recall that our convention, as in Remark 8.1.3, implies that the P-null
event
T
0
. It then follows by the T
t
-adaptedness of X
t
and the preceding
construction of X
t
+, that
X
t
is T
t
+-adapted, namely T
t
-adapted (by the assumed
right-continuity of T
t
, t 0). Clearly, our construction of X
t
+ yields right-
continuous sample functions t

X
t
(). Further, a re-run of part of Step 1 yields
the RCLL property, by showing that for any
c
the sample function t X

t
+()
has nite left limits at each t > 0. Indeed, otherwise there exist a, b Q, b > a and
s
k
t such that X
s
+
2k1
() < a < b < X
s
+
2k
(). By construction of X
t
+ this implies
the existence of q
k
Q
(2)
n
such that q
k
t and X
q
2k1
() < a < b < X
q
2k
().
Consequently, in this case U
n
[a, b]() = , in contradiction with
c
.
Step 3. Fixing s 0, we show that X
s
+ = X
s
for a.e. /
, hence the T
t
-
adapted S.P.
X
t
, t 0 is a modication of the sup-MG (X
t
, T
t
, t 0) and as
such, (
X
t
, T
t
, t 0) is also a sup-MG. Turning to show that X
s
+
a.s.
= X
s
, x
non-random dyadic rationals q
k
s as k and recall Remark 8.2.13 that
(X
q
k
, T
q
k
, k Z
) is a reversed sup-MG. Further, from the sup-MG property, for

any A T
s
,
sup
k
E[X
q
k
I
A
] E[X
s
I
A
] < .
Considering A = , we deduce by Exercise 5.5.21 that the collection X
q
k
is U.I.
and thus, the a.s. convergence of X
q
k
to X
s
+ yields that E[X
q
k
I
A
] E[X
s
+I
A
]
(recall part (c) of Exercise 1.3.55). Moreover, E[X
q
k
] E[X
s
] in view of the
assumed right-continuity of t E[X
t
]. Consequently, taking k we deduce
that E[X
s
+I
A
] E[X
s
I
A
] for all A T
s
, with equality in case A = . With both
X
s
+ and X
s
measurable on T
s
, it thus follows that a.s. X
s
+ = X
s
, as claimed.
8.2.3. The optional stopping theorem. We are ready to extend the very
useful Doobs optional stopping theorem (see Theorem 5.4.1), to the setting of
right-continuous sub-MGs.
Theorem 8.2.26 (Doobs optional stopping).
If (X
t
, T
t
, t [0, ]) is a right-continuous sub-MG with a last element (X
, T
)
in the sense of Denition 8.2.22, then for any T
t
-Markov times , the integrable
X
and X
are such that EX
EX
, with equality in case of a MG.

Remark. Recall Proposition 8.1.8 that right-continuous, T
t
-adapted X
s
, s 0
is T
t
-progressively measurable, hence also T
t
+-progressively measurable. With the
existence of X
mT
, it then follows from Proposition 8.1.13 that X
mT
+
is a R.V. for any T
t
-Markov time (and by the same argument X
mT
in case
is an T
t
-stopping time).
Proof. Fixing 1 and setting s
k
= k2
for k Z
+
, recall Remark
8.2.13 that (X
s
k
, T
s
k
, k Z
+
) is a discrete time sub-MG. Further, the assumed
existence of a last element (X
, T
) for the sub-MG (X

t
, T
t
, t 0) implies that
a.s. E[X
s
[T
s
k
] X
s
k
for any k Z
+
. With a slight abuse of notations we
call s
k
-valued R.V. an T
s
k
-stopping time if s
k
T
s
k
for all k Z
+
.
Then, as explained in Remark 5.4.2, it thus follows from Theorem 5.4.1 that for
any T
s
k
-stopping times
, the R.V. X
and X
are integrable, with

(8.2.8) E[X
] E[X
] E[X
0
] .
In Lemma 8.1.16 we have constructed an T
s
k
-stopping time
= 2
([2
] + 1)
for the given T
t
-Markov time . Similarly, we have the T
s
k
-stopping time
=
2
([2
] + 1) corresponding to the T
t
-Markov time . Our assumption that
translates to
, hence the inequality (8.2.8) holds for any positive integer

. By their construction
() () and
() () as . Thus, by the
assumed right-continuity of t X
t
(), we have the a.s. convergence of X
to X
and of X
to X
(when ).
We claim that (X
n
, T
n
, n Z
) is a discrete time reversed sub-MG. Indeed,

xing 2, note that Q
(2,1)
is a subset of Q
(2,)
= s
k
. Appealing
once more to Remark 5.4.2, we can thus apply Lemma 5.4.3 for the pair
1

of T
s
k
-stopping times and deduce that a.s.
E[X
1
[T
] X
.
The latter inequality holds for all 2, amounting to the claimed reversed sub-
MG property. Since in addition inf
n
EX
n
EX
0
is nite (see (8.2.8)), we deduce
from Exercise 5.5.21 that the sequence X
=1
is U.I. The same argument shows
that X
=1
is U.I. Hence, both sequences converge in L
1
to their respective limits
X
and X
. In particular, both variables are integrable and in view of (8.2.8) they

further satisfy the stated inequality EX
EX
.
We proceed with a few of the consequences of Doobs optional theorem, starting
with the extension of Lemma 5.4.3 to our setting.
t
, T
t
, t [0, ]) is a right-continuous sub-MG with a last
element, then E[X
[T
+] X
w.p.1. for any T

t
-Markov times (with equality
in case of a MG), and if is an T
t
-stopping time, then further E[X
[T
] X
w.p.1. (again with equality in case of a MG).

Proof. Fixing A T
+, it follows as in the proof of Lemma 5.4.3 that =

I
A
+ I
A
c is an T
t
+-stopping time. Thus, applying Theorem 8.2.26 for we
deduce that E[X
] E[X
]. Further, X
and X
are integrable, so proceeding as

in the proof of Lemma 5.4.3 we get that E[(Z
+
X
)I
A
] 0 for Z
+
= E[X
[T
+]
and all A T
+. Recall as noted just after the statement Theorem 8.2.26, that

X
mT
+ for the T
t
+-stopping time , and consequently, a.s. Z
+
X
, as
claimed.
In case is an T
t
-stopping time, note that by the tower property (and taking out
the known I
A
), also E[(Z X
)I
A
] 0 for Z = E[Z
+
[T
] = E[X
[T
] and all
A T
. Here, as noted before, we further have that X
mT
and consequently,
in this case, a.s. Z X
as well. Finally, if (X
t
, T
t
, t 0) is further a MG, combine
the statement of the corollary for sub-MGs (X
t
, T
t
) and (X
t
, T
t
) to nd that a.s.
X
= Z
+
(and X
= Z for an T
t
-stopping time ).
Remark 8.2.28. We refer hereafter to both Theorem 8.2.26 and its renement
in Corollary 8.2.27 as Doobs optional stopping. Clearly, both apply if the right-
continuous sub-MG (X
t
, T
t
, t 0) is such that a.s. E[Y [T
t
] X
t
for some inte-
grable R.V. Y and each t 0 (for by the tower property, such sub-MG has the
last element X
= E[Y [T
]). Further, note that if is a bounded T

t
-Markov
time, namely [0, T] for some non-random nite T, then you dispense of the re-
quirement of a last element by considering these results for Y
t
= X
tT
(whose last
element Y
is the integrable X
T
mT
T
mT
and where Y
= X
, Y
= X
).
As this applies whenever both and are non-random, we deduce from Corollary
8.2.27 that if X
t
is a right-continuous sub-MG (or MG) for some ltration T
t
,
then it is also a right-continuous sub-MG (or MG, respectively), for the correspond-
ing ltration T
t
+.
The latter observation leads to the following result about the stopped continuous
time sub-MG (compare to Theorem 5.1.32).
Corollary 8.2.29. If is an T
t
-stopping time and (X
t
, T
t
, t 0) is a right-
continuous subMG (or supMG or a MG), then X
t
= X
t()
() is also a right-
continuous subMG (or supMG or MG, respectively), for this ltration.
Proof. Recall part (b) of Exercise 8.1.10, that = u is a bounded T
t
-
stopping time for each u [0, ). Further, xing s u, note that for any A T
s
,
= (s )I
A
+ (u )I
A
c ,
is an T
t
-stopping time such that . Indeed, as T
s
T
t
when s t, clearly
t = t (A s t) (A
c
s u t) T
t
,
for all t 0. In view of Remark 8.2.28 we thus deduce, upon applying Theorem
8.2.26, that E[I
A
X
u
] E[I
A
X
s
] for all A T
s
. From this we conclude that
the sub-MG condition E[X
u
[ T
s
] X
s
holds a.s. whereas the right-continuity
of t X
t
is an immediate consequence of right-continuity of t X
t
.
In the discrete time setting we have derived Theorem 5.4.1 also for U.I. X
n
and mostly used it in this form (see Remark 5.4.2). Similarly, you now prove
Doobs optional stopping theorem for right-continuous sub-MG (X
t
, T
t
, t 0) and
T
t
t
is U.I.
t
, T
t
, t 0) is a right-continuous sub-MG.
(a) Fixing nite, non-random u 0, show that for any T
t
-stopping times
, a.s. E[X
u
[T
] X
u
(with equality in case of a MG).
Hint: Apply Corollary 8.2.27 for the stopped sub-MG (X
tu
, T
t
, t 0).
(b) Show that if (X
u
, u 0) is U.I. then further X
and X
(dened
as limsup
t
X
t
in case = ), are integrable and E[X
[T
] X
a.s.
(again with equality for a MG).
Hint: Show that Y
u
= X
u
has a last element.
Relying on Corollary 8.2.27 you can now also extend Corollary 5.4.5.
t
, T
t
, t 0) is a right-continuous sub-MG and
k
is
a non-decreasing sequence of T
t
-stopping times. Show that if (X
t
, T
t
, t 0) has a
last element or sup
k
k
T for some non-random nite T, then (X
k
, T
k
, k Z
+
)
is a discrete time sub-MG.
Next, restarting a right-continuous sub-MG at a stopping time yields another
sub-MG and an interesting formula for the distribution of the supremum of certain
non-negative MGs.
t
, T
t
, t 0) is a right-continuous sub-MG and that
is a bounded T
t
-stopping time.
(a) Verify that if T
t
is a right-continuous ltration, then so is (
t
= T
t+
.
(b) Taking Y
t
= X
+t
X
, show that (Y
t
, (
t
, t 0) is a right-continuous
sub-MG.
Exercise 8.2.33. Consider a non-negative MG Z
t
, t 0 of continuous sample
functions, such that Z
0
= 1 and Z
t
a.s.
0 as t . Show that for any x > 1,
P(sup
t>0
Z
t
x) = x
1
.
We conclude this sub-section with concrete applications of Doobs optional stop-
ping theorem in the context of rst hitting times for the Wiener process (W
t
, t 0)
of Denition 7.3.12.
(r)
t
= W
t
rt denote the Brownian motion with drift r 0
(and continuous sample functions), starting at Z
(r)
0
= 0.
(a) Check that the rst hitting time
(r)
b
= inft 0 : Z
(r)
t
b of level
b > 0, is an T
W
t
-stopping time.
(b) For s > 0 set (r, s) = r +
r
2
+ 2s and show that
E[exp(s
(r)
b
)] = exp((r, s)b) .
Hint: Check that
1
2
2
r s = 0 at = (r, s), then stop the martingale
u
0
(t, W
t
, (r, s)) of Exercise 8.2.7 at
(r)
b
.
(c) Letting s 0 deduce that P(
(r)
b
< ) = exp(2rb).
(d) Considering now r = 0 and b , deduce that a.s. limsup
t
W
t
=
and liminf
t
W
t
= .
Exercise 8.2.35. Consider the exit time
(r)
a,b
= inft 0 : Z
(r)
t
/ (a, b) of an
interval, for the S.P. Z
(r)
t
= W
t
rt of continuous sample functions, where W
0
= 0,
r R and a, b > 0 are nite non-random.
(a) Check that
(r)
a,b
is a.s. nite T
W
t
-stopping time and show that for any
r ,= 0,
P(Z
(r)
(r)
a,b
= a) = 1 P(Z
(r)
(r)
a,b
= b) =
e
2rb
1
e
2rb
e
2ra
,
while P(W
(0)
a,b
= a) = b/(a +b).
Hint: For r ,= 0 consider u
0
(t, W
t
, 2r) of Exercise 8.2.7 stopped at
(r)
a,b
.
(b) Show that for all s 0
E(e
s
(0)
a,b
) =
sinh(a
2s) + sinh(b
2s)
sinh((a +b)
2s)
.
Hint: Stop the MGs u
0
(t, W
t
,
2s) of Exercise 8.2.7 at

a,b
=
(0)
a,b
.
(c) Deduce that E
a,b
= ab and Var(
a,b
) =
ab
3
(a
2
+b
2
).
Hint: Recall part (b) of Exercise 3.2.40.
Here is a related result about rst hitting time of spheres by a standard d-
dimensional Brownian motion.
Denition 8.2.36. The standard d-dimensional Brownian motion is the R
d
-valued
S.P. W(t), t 0 such that W(t) = (W
1
(t), . . . , W
d
(t)) with W
i
(t), t 0, i =
1, 2, . . . , d mutually independent, standard (one-dimensional) Wiener processes. It
is clearly a MG and a centered R
d
-valued Gaussian S.P. of continuous sample
functions and stationary, independent increments.
Exercise 8.2.37. Let T
W
t
= (W(s), s t) denote the canonical ltration of a
standard k-dimensional Brownian motion, R
t
= |W(t)|
2
its Euclidean distance
from the origin and
b
= inft 0 : R
t
b the corresponding rst hitting time of
a sphere of radius b > 0 centered at the origin.
(a) Show that M
t
= R
2
t
kt is an T
W
t
-martingale of continuous sample
functions and that
b
is an a.s. nite T
W
t
-stopping time.
(b) Deduce that E[
b
] = b
2
/k.
Remark. The S.P. R
t
, t 0 of the preceding exercise is called the Bessel process
with dimension k. Though we shall not do so, it can be shown that the S.P.
B
t
= R
t
_
t
0
R
1
s
ds is well-dened and in fact is a standard Wiener process (c.f.
[KaS97, Proposition 3.3.21]), with = (k 1)/2 the corresponding index of the
Bessel process. The Bessel process is thus dened for all 1/2 (and starting at
R
0
= r > 0, also for 0 < < 1/2). One can then further show that if R
0
= r > 0
then P
r
(inf
t0
R
t
> 0) = I
{>1/2}
(hence the k-dimensional Brownian motion is
O-transient for k 3, see Denition 6.3.21), and P
r
(R
t
> 0, for all t 0) = 1 even
for the critical case of = 1/2 (so by translation, for any given point z R
2
, the
two-dimensional Brownian path, starting at any position other than z R
2
w.p.1.
enters every disc of positive radius centered at z but never reaches the point z).
8.2.4. Doob-Meyer decomposition and square-integrable martingales.
In this section we study the structure of square-integrable martingales and in
particular the roles of the corresponding predictable compensator and quadratic
variation. In doing so, we x throughout the probability space (, T, P) and a
right-continuous ltration T
t
on it, augmented so that every P-null set is in T
0
(see Remark 8.1.3).
Denition 8.2.38. We denote by /
2
the vector space of all square-integrable mar-
tingales X
t
, t 0 for the xed right-continuous ltration, which start at X
0
= 0
and have right-continuous sample functions. We further denote by /
c
2
the lin-
ear subspace of /
2
consisting of those square-integrable martingales whose sample
functions are continuous (and as before X
0
= 0).
As in the discrete time setting of Section 5.3.2, the key to the study of a square-
integrable martingale X /
2
is the Doob-Meyer decomposition of X
2
t
to the sum
of a martingale and the predictable quadratic variation X
t
. More generally, the
Doob-Meyer decomposition is the continuous time analog of Doobs decomposition
of any discrete time integrable process as the sum of a martingale and a predictable
sequence. The extension of the concept of predictable S.P. to the continuous time
setting is quite subtle and outside our scope, but recall Exercise 5.2.2 that when
decomposing a sub-MG, the non-martingale component should be an increasing
process, as dened next.
t
-adapted, integrable S.P. A
t
, t 0 of right-continuous,
non-decreasing sample functions starting at A
0
= 0, is called an increasing process
(or more precisely, an T
t
-increasing process).
Remark. An increasing process is obviously a non-negative, right-continuous, sub-
MG. By monotonicity, A
= lim
t
A
t
is a well dened random variable, and due
to Proposition 8.2.23, integrability of A
is equivalent to A
t
being U.I. which in
turn is equivalent to this sub-MG having a last-element (i.e. being right closable).
Recall the notion of q-th variation of a function f : [a, b] R, with q > 0 a
parameter, which we next extend to the q-th variation of continuous time S.P.-s.
Denition 8.2.40. For any nite partition = a = s
()
0
< s
()
1
< < s
()
k
= b
of [a, b], let || = max
k
i=1
s
()
i
s
()
i1
denote the length of the longest interval in
and
V
(q)
()
(f) =
k
i=1
[f(s
()
i
) f(s
()
i1
)[
q
denote the q-th variation of the function f() on the partition . The q-th variation
of f() on [a, b] is then the [0, ]-valued
(8.2.9) V
(q)
(f) = lim
0
V
(q)
()
(f) ,
provided such limit exists (namely, the same R-valued limit exists along each se-
quence
n
, n 1 such that |
n
| 0). Similarly, the q-th variation on [a, b]
of a S.P. X
t
, t 0, denoted V
(q)
(X) is the limit in probability of V
(q)
()
(X
())
per (8.2.9), if such a limit exists, and when this occurs for any compact interval
[0, t], we have the q-th variation, denoted V
(q)
(X)
t
, as a stochastic process with
non-negative, non-decreasing sample functions, such that V
(q)
(X)
0
= 0.
Remark. As you are soon to nd out, of most relevance here is the case of q-
th variation for q = 2, which is also called the quadratic variation. Note also
that V
(1)
()
(f) is bounded above by the total variation of the function f, namely
V (f) = supV
(1)
()
(f) : a nite partition of [a, b] (which induces at each interval a
norm on the linear subspace of functions of nite total variation, see also the related
Denition 3.2.22 of total variation norm for nite signed measures). Further, as
you show next, if V
(1)
(f) exists then it equals to V (f) (but beware that V
(1)
(f)
may not exist, for example, in case f(t) = 1
Q
(t)).
Exercise 8.2.41.
(a) Show that if f : [a, b] R is monotone then V
(1)
()
(f) = max
t[a,b]
f(t)
min
t[a,b]
f(t) for any nite partition , so in this case V
(1)
(f) = V (f)
is nite.
(b) Show that V
(1)
()
() is non-decreasing with respect to a renement
of the nite partition of [a, b] and hence, for each f there exist nite
partitions
n
such that |
n
| 0 and V
(1)
(n)
(f) V (f).
(c) For f : [0, ) R and t 0 let V (f)
t
denote the value of V (f) for
the interval [0, t]. Show that if f() is left-continuous, then so is the
non-decreasing function t V (f)
t
.
(d) Show that if t X
t
is left-continuous, then V (X)
t
is T
X
t
-progressively
measurable, and
n
= inft 0 : V (X)
t
n are non-decreasing T
X
t
-
Markov times such that V (X)
tn
n for all n and t.
Hint: Show that enough to consider for V (X)
t
the countable collection
of nite partitions for which s
()
i
Q
(2)
t
+
, then note that
n
< t =
k1
V (X)
tk
1 n.
From the preceding exercise we see that any increasing process A
t
has nite total
variation, with V (A)
t
= V
(1)
(A)
t
= A
t
for all t. This is certainly not the case for
non-constant continuous martingales, as shown in the next lemma (which is also
key to the uniqueness of the Doob-Meyer decomposition for sub-MGs of continuous
sample path).
Lemma 8.2.42. A martingale M
t
of continuous sample functions and nite total
variation on each compact interval, is indistinguishable from a constant.
Remark. Sample path continuity is necessary here, for in its absence we have
the compensated Poisson process M
t
= N
t
t which is a martingale (see Exam-
ple 8.2.5), of nite total variation on compact intervals (since V (M)
t
V (N)
t
+
V (t)
t
= N
t
+t by part (a) of Exercise 8.2.41).
Proof. Considering the martingale

M
t
= M
t
M
0
such that V (
M)
t
= V (M)
t
for all t, we may and shall assume hereafter that M
0
= 0. Suppose rst that
V (M)
t
K is bounded, uniformly in t and by a non-random nite constant. In
particular, [M
t
[ K for all t 0 and xing a nite partition = 0 = s
0
< s
1
<
< s
k
= t, the discrete time martingale M
si
is square integrable and as shown
in part (b) of Exercise 5.1.8
E[M
2
t
] = E[
k
i=1
(M
si
M
si1
)
2
] = E[V
(2)
()
(M)] .
By the denition of the q-th variation and our assumption that V (M)
t
K, it
follows that
V
(2)
()
(M) K
k
sup
i=1
[M
si
M
si1
[ = KD
.
Taking expectation on both sides we deduce in view of the preceding identity that
E[M
2
t
] KED
where 0 D
V (M)
t
K for all nite partitions of [0, t].
Further, by the uniform continuity of t M
t
() on [0, T] we have that D
() 0
when || 0, hence E[D
] 0 as || 0 and consequently E[M

2
t
] = 0.
We have thus shown that if the continuous martingale M
t
is such that sup
t
V (M)
t
is bounded by a non-random constant, then M
t
() = 0 for any t 0 and a.e.
. To deal with the general case, recall Remark 8.2.28 that (M
t
, T
M
t
+
, t 0) is
a continuous martingale, hence by Corollary 8.2.29 and part (d) of Exercise 8.2.41
so is (M
tn
, T
M
t
+
, t 0), where
n
= inft 0 : V (M)
t
n are non-decreasing
and V (M)
tn
n for all n and t. Consequently, for any t 0, w.p.1. M
tn
= 0
for n = 1, 2, . . .. The assumed niteness of V (M)
t
() implies that
n
, hence
M
tn
M
t
as n , resulting with M
t
() = 0 for a.e. . Finally, by the
continuity of t M
t
(), the martingale M must then be indistinguishable from
the zero stochastic process (see Exercise 7.2.3).
Considering a bounded, continuous martingale X
t
, the next lemma allows us to
conclude in the sequel that V
(2)
()
(X) converges in L
2
as || 0 and its limit can be
set to be an increasing process.
Lemma 8.2.43. Suppose X /
c
2
. For any partition = 0 = s
0
< s
1
<
of [0, ) with a nite number of points on each compact interval, the S.P. M
()
t
=
X
2
t
V
()
t
(X) is an T
t
-martingale of continuous sample path, where
(8.2.10) V
()
t
(X) =
k
i=1
(X
si
X
si1
)
2
+ (X
t
X
s
k
)
2
, t [s
k
, s
k+1
) .
If in addition sup
t
[X
t
[ K for some nite, non-random constant K, then V
(2)
(n)
(X)
is a Cauchy sequence in L
2
(, T, P) for any xed b and nite partitions
n
of [0, b]
such that |
n
| 0.
Proof. (a). With T
X
t
T
t
, the T
t
-adapted process M
t
= M
()
t
of continuous
sample paths is integrable (by the assumed square integrability of X
t
). Noting that
for any k 0 and all s
k
s < t s
k+1
,
M
t
M
s
= X
2
t
X
2
s
(X
t
X
s
k
)
2
+ (X
s
X
s
k
)
2
= 2X
s
k
(X
t
X
s
) ,
clearly then E[M
t
M
s
[T
s
] = 2X
s
k
E[X
t
X
s
[T
s
] = 0, by the martingale property
of (X
t
, T
t
), which suces for verifying that M
t
, T
t
, t 0 is a martingale.
(b). Utilizing these martingales, we now turn to prove the second claim of the
lemma. To this end, x two nite partitions and
of [0, b] and let denote

the partition based on the collection of points
. With U
t
= V
(
)
t
(X) and
U
t
= V
()
t
(X), applying part (a) of the proof for the martingale Z
t
= M
()
t

M
(
)
t
= U
t
U
t
(which is square-integrable by the assumed boundedness of X
t
),
we deduce that Z
2
t
V
( )
t
(Z), t [0, b] is a martingale whose value at t = 0 is zero.
Noting that Z
b
= V
(2)
(
)
(X) V
(2)
()
(X), it then follows that
E
_
_
V
(2)
(
)
(X) V
(2)
()
(X)
_
2
_
= E[Z
2
b
] = E[V
( )
b
(Z)] .
Next, recall (8.2.10) that V
( )
b
(Z) is a nite sum of terms of the form (U
u
U
s
U
u
+
U
s
)
2
2(U
u
U
s
)
2
+2(U
u
U
s
)
2
. Consequently, V
( )
(Z) 2V
( )
(U
) +2V
( )
(U)
and to conclude that V
(2)
(n)
(X) is a Cauchy sequence in L
2
(, T, P) for any nite
partitions
n
of [0, b] such that |
n
| 0, it suces to show that E[V
( )
b
(U)] 0
as |
| || 0.
To establish the latter claim, note rst that since is a renement of , each
interval [t
j
, t
j+1
] of is contained within some interval [s
i
, s
i+1
] of , and then
U
tj+1
U
tj
= (X
tj+1
X
si
)
2
(X
tj
X
si
)
2
= (X
tj+1
X
tj
)(X
tj+1
+X
tj
2X
si
)
(see (8.2.10)). Since t
j+1
s
i
||, this implies in turn that
(U
tj+1
U
tj
)
2
4(X
tj+1
X
tj
)
2
[osc
(X)]
2
,
where osc
(X) = sup[X
t
X
s
[ : [t s[ , t, s [0, b]. Consequently,
V
( )
b
(U) 4V
( )
b
(X)[osc
(X)]
2
and by the Cauchy-Schwarz inequality
_
EV
( )
b
(U)
_
2
16E
_
_
V
( )
b
(X)
_
2
_
E
_
_
osc
(X)
_
4
_
.
The random variables osc
(X) are uniformly (in and ) bounded (by 2K) and

converge to zero as 0 (in view of the uniform continuity of t X
t
on [0, b]).
Thus, by bounded convergence the right-most expectation in the preceding inequal-
ity goes to zero as || 0. To complete the proof simply note that V
( )
b
(X) is of
the form
j=1
D
2
j
for the dierences D
j
= X
tj
X
tj1
of the uniformly bounded
discrete time martingale X
tj
, hence E[
_
V
( )
b
(X)
_
2
] 6K
4
by part (c) of Exercise
5.1.8.
Building on the preceding lemma, the following decomposition is an important
special case of the more general Doob-Meyer decomposition and a key ingredient
in the theory of stochastic integration.
Theorem 8.2.44. For X /
c
2
, the continuous modication of V
(2)
(X)
t
is the
unique T
t
-increasing process A
t
= X
t
of continuous sample functions, such that
M
t
= X
2
t
A
t
is an T
t
-martingale (also of continuous sample functions), and any
two such decompositions of X
2
t
as the sum of a martingale and increasing process
are indistinguishable.
Proof. Step 1. Uniqueness. If X
2
t
= M
t
+A
t
= N
t
+B
t
with A
t
, B
t
increasing
processes of continuous sample paths and M
t
, N
t
martingales, then Y
t
= N
t
M
t
=
A
t
B
t
is a martingale of continuous sample paths, starting at Y
0
= A
0
B
0
= 0,
such that V (Y )
t
V (A)
t
+V (B)
t
= A
t
+B
t
is nite for any t nite. From Lemma
8.2.42 we then deduce that w.p.1. Y
t
= 0 for all t 0 (i.e. A
t
is indistinguishable
of B
t
), proving the stated uniqueness of the decomposition.
Step 2. Existence of V
(2)
(X)
t
when X is uniformly bounded.
Turning to construct such a decomposition, assume rst that X /
c
2
is uniformly
(in t and ) bounded by a non-random nite constant. Let V
(t) = V
(
)
t
(X) of
(8.2.10) for the partitions
of [0, ) whose elements are the dyadic Q

(2,)
=
k2
, k Z
+
. By denition, V
(t) = V
(2)
(
)
(X) for the partitions
of [0, t] whose
elements are the nite collections Q
(2,)
t
+
of dyadic from
[0, t] augmented by
t. Since |
| |
| = 2
, we deduce from Lemma 8.2.43 that per t 0 xed,

V
(t), 1 is a Cauchy sequence in L

2
(, T, P). Recall Proposition 4.3.7 that
any Cauchy sequence in L
2
(, T, P) has a limit. So, in particular V
(t) converges
in L
2
, for , to some U(t, ). For any (other) sequence
2n
of nite partitions
of [0, t] such that |
2n
| 0, upon interlacing
2n+1
=
n
we further have by
Lemma 8.2.43 that V
(2)
( n)
(X) is a Cauchy, hence convergent in L
2
, sequence. Its
limit coincides with the sub-sequential limit U(t, ) along n
= 2 + 1, which also
matches the L
2
limit of V
(2)
( 2n)
(X). As this applies for any nite partitions
2n
2n
| 0, we conclude that (t, ) U(t, ) is the quadratic
variation of X
t
.
Step 3. Constructing A
t
. Turning to produce a continuous modication A
t
() of
U(t, ), recall Lemma 8.2.43 that for each the process M
,t
= X
2
t
V
(t) is an
T
t
-martingale of continuous sample path. The same applies for V
n
(t) V
m
(t) =
M
m,t
M
n,t
, so xing an integer j 1 we deduce by Doobs L
2
maximal inequality
(see (8.2.5) of Corollary 8.2.16), that
E[|V
n
V
m
|
2
j
] 4E[(V
n
(j) V
m
(j))
2
] ,
where |f|
j
= sup[f(t)[ : t [0, j] makes Y = C([0, j]) into a Banach space (see
part (b) of Exercise 4.3.8). In view of the L
2
convergence of V
n
(j) we have that
E[(V
n
(j) V
m
(j))
2
] 0 as n, m , hence V
n
: Y is a Cauchy sequence
in L
2
(, T, P; Y), which by part (a) of Exercise 4.3.8 converges in this space to
some U
j
(, ) C([0, j]). By the preceding we further deduce that U
j
(t, ) is a
continuous modication on [0, j] of the pointwise L
2
limit function U(t, ). In view
of Exercise 7.2.3, for any j
> j the S.P.-s U

j
and U
j
are indistinguishable on [0, j],
so there exists one square-integrable, continuous modication A : C([0, ))
of U(t, ) whose restriction to each [0, j] coincides with U
j
(up to one P-null set).
Step 4. The decomposition: A
t
increasing and M
t
= X
2
t
A
t
martingale.
First, as V
(0) = 0 for all , also A

0
= U(0, ) = 0. We saw in Step 3 that
|V
A|
j
0 in L
2
, hence also in L
1
and consequently E[(|V
A|
j
)] 0 when
, for (r) = r/(1 +r) r and any xed positive integer j. Hence,
E[(V
, A)] =
j=1
2
j
E[(|V
A|
j
)] 0 ,
as , where (, ) is a metric on C([0, )) for the topology of uniform
convergence on compact intervals (see Exercise 7.2.9). To verify that A
t
is an
T
t
-increasing process, recall Theorem 2.2.10 that (V
n
k
, A)
a.s.
0 along some non-
random subsequence n
k
. That is, with T
0
augmented as usual by all P-null sets,
V
n
k
(t, ) A
t
() as k , for all t 0 and / N, where N T
0
is such that
P(N) = 0. Setting A
t
0 when N, the T
t
-adaptedness of V
n
k
(t) transfers to
A
t
. Also, by construction t V
()
t
is non-decreasing when restricted to the times
in . Moreover, if q < q
Q
(2)
then for all k large enough q, q

n
k
implying
that V
n
k
(q) V
n
k
(q
). Taking k it follows that A

q
() A
q
() for all ,
thus by sample path continuity, A
t
is an T
t
-increasing process.
Finally, since the T
t
-martingales M
,t
converge in L
1
for (and t 0 xed),
to the T
t
-adapted process M
t
= X
2
t
A
t
, it is easy to check that M
t
, T
t
, t 0 is
a martingale.
Step 5. Localization. Having established the stated decomposition in case X /
c
2
is uniformly bounded by a non-random constant, we remove the latter condition
by localizing via the stopping times
r
= inft 0 : [X
t
[ r for positive inte-
gers r. Indeed, note that since X
t
() is bounded on any compact time interval,
r
when r (for each ). Further, with X
(r)
t
= X
tr
a uniformly
bounded (by r), continuous martingale (see Corollary 8.2.29), by the preceding
proof we have T
t
-increasing processes A
(r)
t
, each of which is the continuous mod-
ication of the quadratic variation V
(2)
(X
(r)
)
t
, such that M
(r)
t
= X
2
tr
A
(r)
t
are continuous T
t
-martingales. Since E[(V
(
)
(X
(r)
), A
(r)
)] 0 for and
each positive integer r, in view of Theorem 2.2.10 we get by diagonal selection the
existence of a non-random sub-sequence n
k
and a P-null set N
such that
(V
(n
k
)
(X
(r)
), A
(r)
) 0 for k , all r and / N
. From (8.2.10) we note

that V
()
t
(X
(r)
) = V
()
tr
(X) for any t, , r and . Consequently, if / N
then
A
(r)
t
= A
(r)
tr
for all t 0 and A
(r)
t
= A
(r
)
t
as long as r r
and t
r
. Since
t A
(r)
t
() are non-decreasing, necessarily A
(r)
t
A
(r
)
t
for all t 0. We thus
deduce that A
(r)
t
A
t
for any / N
and all t 0. Further, with A

(r)
t
inde-
pendent of r as soon as
r
() t, the non-decreasing sample function t A
t
()
inherits the continuity of t A
(r)
t
(). Taking A
t
() 0 for N
we proceed
to show that A
t
is integrable, hence an T
t
-increasing process of continuous sample
functions. To this end, xing u 0 and setting Z
r
= X
2
ur
, by monotone con-
vergence EZ
r
= EM
(r)
u
+ EA
(r)
u
EA
u
when r (as M
(r)
t
are martingales,
starting at M
(r)
0
= 0). Since u
r
u and the sample functions t X
t
are con-
tinuous, clearly Z
r
X
2
u
. Moreover, sup
r
[Z
r
[ (sup
0su
[X
s
[)
2
is integrable (by
Doobs L
2
maximal inequality (8.2.5)), so by dominated convergence EZ
r
EX
2
u
.
Consequently, EA
u
= EX
2
u
is nite, as claimed.
Next, xing t 0, > 0, r Z
+
and a nite partition of [0, t], since V
(2)
()
(X) =
V
(2)
()
(X
(r)
) whenever
r
t, clearly,
[V
(2)
()
(X) A
t
[ 2
r
< t [V
(2)
()
(X
(r)
) A
(r)
t
[ [A
(r)
t
A
t
[ .
We have shown already that V
(2)
()
(X
(r)
)
p
A
(r)
t
as || 0. Hence,
limsup
0
P([V
(2)
()
(X) A
t
[ 2) P(
r
< t) +P([A
(r)
t
A
t
[ )
and considering r we deduce that V
(2)
()
(X)
p
A
t
. That is, the process A
t
is a modication of the quadratic variation of X

t
.
We complete the proof by verifying that the integrable, T
t
-adapted process M
t
=
X
2
t
A
t
of continuous sample functions satises the martingale condition. Indeed,
since M
(r)
t
are T
t
-martingales, we have for each s u and all r that w.p.1
E[X
2
ur
[T
s
] = E[A
(r)
u
[T
s
] +M
(r)
s
.
Considering r we have already seen that X
2
ur
X
2
u
and a.s. A
(r)
u

A
u
, hence also M
(r)
s
a.s.
M
s
. With sup
r
X
2
ur
integrable, we get by dominated
convergence of C.E. that E[X
2
ur
[T
s
] E[X
2
u
[T
s
] (see Theorem 4.2.26). Similarly,
E[A
(r)
u
[T
s
] E[A
u
[T
s
] by monotone convergence of C.E. hence w.p.1 E[X
2
u
[T
s
] =
E[A
u
[T
s
] +M
s
for each s u, namely, (M
t
, T
t
) is a martingale.
The following exercise shows that X /
c
2
has zero q-th variation for all q > 2.
Moreover, unless X
t
/
c
2
is zero throughout an interval of positive length, its q-th
variation for 0 < q < 2 is innite with positive probability and its sample path are
then not locally -Holder continuous for any > 1/2.
Exercise 8.2.45.
(a) Suppose S.P. X
t
, t 0 of continuous sample functions has an a.s.
nite r-th variation V
(r)
(X)
t
for each xed t > 0. Show that then for
each t > 0 and q > r a.s. V
(q)
(X)
t
= 0 whereas if 0 < q < r, then
V
(q)
(X)
t
= for a.e. for which V
(r)
(X)
t
> 0.
(b) Show that if X /
c
2
and

A
t
is a S.P. of continuous sample path and
nite total variation on compact intervals, then the quadratic variation
of X
t
+

A
t
is X
t
.
(c) Suppose X /
c
2
and T
t
-stopping time are such that X
= 0. Show
that P(X
t
= 0 for all t 0) = 1.
(d) Show that if a S.P. X
t
, t 0 is locally -H older continuous on [0, T]
for some > 1/2, then its quadratic variation on this interval is zero.
Here is the necessary and sucient condition under which a sub-MG has a Doob-
Meyer decomposition, namely, it is the sum of a martingale and increasing part.
t
-progressively measurable (and in particular an T
t
-adapted,
right-continuous), S.P. Y
t
, t 0 is of class DL if the collection Y
u
, an T
t
-
stopping time is U.I. for each nite, non-random u.
Theorem 8.2.47 (Doob-Meyer decomposition). A right-continuous, sub-MG
Y
t
, t 0 for T
t
admits the decomposition Y
t
= M
t
+ A
t
with M
t
a right-
continuous T
t
-martingale and A
t
an T
t
-increasing process, if and only if Y
t
, t 0
is of class DL.
Remark 8.2.48. To extend the uniqueness of Doob-Meyer decomposition beyond
sub-MGs with continuous sample functions, one has to require A
t
to be a natural
process. While we do not dene this concept here, we note in passing that every
continuous increasing process is a natural process and a natural process is also
an increasing process (c.f. [KaS97, Denition 1.4.5]), whereas the uniqueness is
attained since if a nite linear combination of natural processes is a martingale,
then it is indistinguishable from zero (c.f. proof of [KaS97, Theorem 1.4.10]).
Proof outline. We focus on constructing the Doob-Meyer decomposition
for Y
t
, t I in case I = [0, 1]. To this end, start with the right-continuous
modication of the non-positive T
t
-sub-martingale Z
t
= Y
t
E[Y
1
[T
t
], which exists
since t EZ
t
is right-continuous (see Theorem 8.2.25). Suppose you can nd
A
1
L
1
(, T
1
, P) such that
(8.2.11) A
t
= Z
t
+E[A
1
[T
t
] ,
is T
t
-increasing on I. Then, M
t
= Y
t
A
t
must be right-continuous, integrable and
T
t
-adapted. Moreover, for any t I,
M
t
= Y
t
A
t
= Y
t
Z
t
E[A
1
[T
t
] = E[M
1
[T
t
] .
So, by the tower property (M
t
, T
t
, t I) satises the martingale condition and we
are done.
Proceeding to construct such A
1
, x 1 and for the (ordered) nite set Q
(2,)
1
of dyadic rationals recall Doobs decomposition (in Theorem 5.2.1), of the discrete
time sub-MG Z
sj
, T
sj
, s
j
Q
(2,)
1
as the sum of a discrete time U.I. martingale
M
()
sj
, s
j
Q
(2,)
1
and the predictable, non-decreasing (in view of Exercise 5.2.2),
nite sequence A
()
sj
, s
j
Q
(2,)
1
, starting with A
()
0
= 0. Noting that Z
1
= 0, or
equivalently M
()
1
= A
()
1
, it follows that for any q Q
(2,)
1
(8.2.12) A
()
q
= Z
q
M
()
q
= Z
q
E[M
()
1
[T
q
] = Z
q
+E[A
()
1
[T
q
] .
Relying on the fact that the sub-MG Y
t
, t I is of class DL, this representation
allows one to deduce that the collection A
()
1
, 1 is U.I. (for details see [KaS97,
proof of Theorem 1.4.10]). This in turn implies by the Dunford-Pettis compactness
criterion that there exists an integrable A
1
and a non-random sub-sequence n
k

such that A
(n
k
)
1
wL
1
A
1
, as in Denition 4.2.31. Now consider the T
t
-adapted,
integrable S.P. dened via (8.2.11), where by Theorem 8.2.25 (and the assumed
right continuity of the ltration T
t
), we may and shall assume that the U.I. MG
E[A
1
[T
t
] has right-continuous sample functions (and hence, so does t A
t
). Since
Q
(2,)
1
Q
(2)
1
, upon comparing (8.2.11) and (8.2.12) we nd that for any q Q
(2)
1
and all large enough
A
()
q
A
q
= E[A
()
1
A
1
[T
q
] .
Consequently, A
(n
k
)
q
wL
1
A
q
for all q Q
(2)
1
(see Exercise 4.2.32). In particular,
A
0
= 0 and setting q < q
Q
(2)
1
, V = I
{Aq>A
q
}
we deduce by the monotonicity
of j A
()
sj
for each and , that
E[(A
q
A
q
)V ] = lim
k
E[(A
(n
k
)
q
A
(n
k
)
q
)V ] 0 .
So, by our choice of V necessarily P(A
q
> A
q
) = 0 and consequently, w.p.1. the
sample functions t A
t
() are non-decreasing over Q
(2)
1
. By right-continuity the
same applies over I and we are done, for A
t
, t I of (8.2.11) is thus indistinguish-
able from an T
t
-increasing process.
The same argument applies for I = [0, r] and any r Z
+
. While we do not
do so here, the T
t
-increasing process A
t
, t I can be further shown to be a
natural process. By the uniqueness of such decompositions, as alluded to in Re-
mark 8.2.48, it then follows that the restriction of the process A
t
constructed on
[0, r
] to a smaller interval [0, r] is indistinguishable from the increasing process one

constructed directly on [0, r]. Thus, concatenating the processes A
t
, t r and
M
t
, t r yields the stated Doob-Meyer decomposition on [0, ).
As for the much easier converse, xing non-random u R, by monotonicity of
t A
t
the collection A
u
, an T
t
-stopping time is dominated by the integrable
A
u
hence U.I. Applying Doobs optional stopping theorem for the right-continuous
MG (M
t
, T
t
), you further have that M
u
= E[M
u
[T
] for any T
t
-stopping time
(see part (a) of Exercise 8.2.30), so by Proposition 4.2.33 the collection M
u
,
an T
t
-stopping time is also U.I. In conclusion, the existence of such Doob-Meyer
decomposition Y
t
= M
t
+A
t
implies that the right-continuous sub-MG Y
t
, t 0
is of class DL (recall part (b) of Exercise 1.3.55).
Your next exercise provides a concrete instance in which Doob-Meyer decomposi-
tion applies, connecting it with the decomposition in Theorem 8.2.44 of the non-
negative sub-MG Y
t
= X
2
t
of continuous sample path, as the sum of the quadratic
variation X
t
and the continuous martingale X
2
t
X
t
.
t
, t 0 is a non-negative, right-continuous, sub-MG
for T
t
.
(a) Show that Y
t
is in class DL.
(b) Show that if Y
t
further has continuous sample functions then the processes
M
t
and A
t
in its Doob-Meyer decomposition also have continuous sample
functions (and are thus unique).
Remark. From the preceding exercise and Remark 8.2.48, we associate to each
X /
2
a unique natural process, denoted X
t
and called the predictable quadratic
variation of X, such that X
2
t
X
t
is a right-continuous martingale. However,
when X / /
c
2
, it is no longer the case that the predictable quadratic variation
matches the quadratic variation of Denition 8.2.40 (as a matter of fact, the latter
may not exist).
Example 8.2.50. A standard Brownian Markov process consists of a standard
Wiener process W
t
, t 0 and ltration T
t
, t 0 such that T
W
s
T
s
for any
s 0 while (W
t
W
s
, t s) is independent of T
s
(see also Denition 8.3.7 for its
Markov property). For right-continuous augmented ltration T
t
, such process W
t
is in /
c
2
and further, M
t
= W
2
t
t is a martingale of continuous sample path. We
thus deduce from Theorem 8.2.44 that its (predictable) quadratic variation is the
non-random W
t
= t, which by Exercise 8.2.45 implies that the total variation of
the Brownian sample path is a.s. innite on any interval of positive length. More
generally, recall part (b) of Exercise 8.2.6 that X
t
is non-random for any Gauss-
ian martingale, hence so is the quadratic variation of any Gaussian martingale of
continuous sample functions.
As you show next, the type of convergence to the quadratic variation may be
strengthened (e.g. to convergence in L
2
or a.s.) for certain S.P. by imposing some
restrictions on the partitions considered.
(2)
(n)
(W) denote the quadratic variations of the Wiener pro-
cess on a sequence of nite partitions
n
n
| 0 as n .
(a) Show that V
(2)
(n)
(W)
L
2
t.
(b) Show that V
(2)
(n)
(W)
a.s.
t whenever
n=1
|
n
| < .
Remark. However, beware that for a.e. there exist random nite partitions
n
of [0, 1] such that |
n
| 0 and V
(2)
(n)
(W) (see [Fre71, Page 48]).
Example 8.2.52. While we shall not prove it, Levys martingale characterization
of the Brownian motion states the converse of Example 8.2.50, that any X /
c
2
of quadratic variation X
t
= t must be a standard Brownian Markov process (c.f.
[KaS97, Theorem 3.3.16]). However, recall Example 8.2.5 that for a Poisson pro-
cess N
t
of rate , the compensated process M
t
= N
t
t is in /
2
and you can easily
check that M
2
t
t is then a right-continuous martingale. Since the continuous in-
creasing process t is natural, we deduce from the uniqueness of the Doob-Meyer
decomposition that M
t
= t. More generally, by the same argument we deduce
from part (c) of Exercise 8.2.6 that X
t
= tE(X
2
1
) for any square-integrable S.P.
with X
0
= 0 and zero-mean, stationary independent increments. In particular,
this shows that sample path continuity is necessary for Levys characterization of
the Brownian motion and that the standard Wiener process is the only zero-mean,
square-integrable stochastic process X
t
of continuous sample path and stationary
independent increments, such that X
0
= 0.
Building upon Levys characterization, you can now prove the following special
case of the extremely useful Girsanovs theorem.
Exercise 8.2.53. Suppose (W
t
, T
t
, t 0) is a standard Brownian Markov process
on a probability space (, T, P) and xing a non-random parameters R and
T > 0 consider the exponential T
t
-martingale Z
t
= exp(W
t

2
t/2) and the
corresponding probability measure Q
T
(A) = E(I
A
Z
T
) on (, T
T
).
(a) Show that V
(2)
(Z)
t
=
2
_
t
0
Z
2
u
du.
(b) Show that

W
u
= W
u
+ u is for u [0, T] an T
u
-martingale on the
probability space (, T
T
, Q
T
).
(c) Deduce that (
W
t
, T
t
, t T) is a standard Brownian Markov process on
(, T
T
, Q
T
).
Here is the extension to the continuous time setting of Lemma 5.2.7 and Proposi-
tion 5.3.30.
t
= sup
s[0,t]
Y
s
and A
t
be the increasing process of contin-
uous sample functions in the Doob-Meyer decomposition of a non-negative, contin-
uous, T
t
-submartingale Y
t
, t 0 with Y
0
= 0.
(a) Show that P(V
x, A
< y) x
1
E(A
y) for all x, y > 0 and any

T
t
-stopping time .
(b) Setting c
1
= 4 and c
q
= (2 q)/(1 q) for q (0, 1), conclude that
E[sup
s
[X
s
[
2q
] c
q
E[X
q
] for any X /
c
2
and q (0, 1], hence
[X
t
[
2q
, t 0 is U.I. when X
q
is integrable.
Taking X, Y /
2
we deduce from the Doob-Meyer decomposition of XY /
2
that (X Y )
2
t
X Y
t
are right-continuous T
t
-martingales. Considering their
dierence we deduce that XY X, Y is a martingale for
X, Y
t
=
1
4
_
X +Y
t
X Y
t
_
(this is an instance of the more general polarization technique). In particular, XY is
a right-continuous martingale whenever X, Y = 0, prompting our next denition.
Denition 8.2.55. For any pair X, Y /
2
, we call the S.P. X, Y
t
the bracket
of X and Y and say that X, Y /
2
are orthogonal if for any t 0 the bracket
X, Y
t
is a.s. zero.
Remark. It is easy to check that X, X = X for any X /
2
. Further, for any
s [0, t], w.p.1.
E[(X
t
X
s
)(Y
t
Y
s
)[T
s
] = E[X
t
Y
t
X
s
Y
s
[T
s
] = E[X, Y
t
X, Y
s
[T
s
],
so the orthogonality of X, Y /
2
amounts to X and Y having uncorrelated
increments over [s, t], conditionally on T
s
. Here is more on the structure of the
bracket as a bi-linear form on /
2
, which on /
c
2
coincides with the cross variation
of X and Y .
Exercise 8.2.56. Show that for all X, X
i
, Y /
2
:
(a) c
1
X
1
+c
2
X
2
, Y = c
1
X
1
, Y +c
2
X
2
, Y for any c
i
R, i = 1, 2.
Hint: Recall Remark 8.2.48 that a martingale of the form
j=1
U
j
for U
j
/
2
and nite, is zero.
(b) X, Y = Y, X.
8.3. MARKOV AND STRONG MARKOV PROCESSES 317
(c) [X, Y [
2
XY .
(d) With Z
t
= V (X, Y )
t
, for a.e. and all 0 s < t < ,
Z
t
Z
s

1
2
[X
t
X
s
+Y
t
Y
s
] .
(e) Show that for X, Y /
c
2
the bracket X, Y
t
is also the limit in proba-
bility as || 0 of
k
i=1
[X
t
()
i
X
t
()
i1
][Y
t
()
i
Y
t
()
i1
] ,
where = 0 = t
()
0
< t
()
1
< < t
()
k
= t is a nite partition of [0, t].
We conclude with a brief introduction to stochastic integration (for more on this
topic, see [KaS97, Section 3.2]). Following our general approach to integration, the
Ito stochastic integral I
t
=
_
t
0
X
s
dW
s
is constructed rst for simple processes X
t
,
i.e. those having sample path that are piecewise constant on non-random intervals,
as you are to do next.
Exercise 8.2.57. Suppose (W
t
, T
t
) is a standard Brownian Markov process and
X
t
is a bounded, T
t
-adapted, left-continuous simple process. That is,
X
t
() =
0
()1
{0}
(t) +
i=0
i
()1
(ti,ti+1]
(t) ,
where 0 = t
0
< t
1
< < t
k
< is a non-random unbounded sequence and the
T
tn
-adapted sequence
n
() is bounded uniformly in n and .
(a) With A
t
=
_
t
0
X
2
u
du, show that both
I
t
=
k1
j=0
j
(W
tj+1
W
tj
) +
k
(W
t
W
t
k
), when t [t
k
, t
k+1
) ,
and I
2
t
A
t
are martingales with respect to T
t
.
(b) Deduce that I
t
/
c
2
with A
t
= I
t
being its quadratic variation, and in
particular EI
2
t
=
_
t
0
E[X
2
u
]du.
8.3. Markov and Strong Markov processes
In Subsection 8.3.1 we dene Markov semi-groups and the corresponding Markov
processes. We also extend the construction of Markov chains from Subsection 6.1
to deal with these S.P. This is followed in Subsection 8.3.2 with the study of the
strong Markov property and the related Feller property, showing in particular that
both the Brownian motion and the Poisson process are strong Markov processes.
We then devote Subsection 8.3.3 to the study of Markov jump processes, which are
the natural extension of both Markov chains and (compound) Poisson processes.
8.3.1. Markov semi-groups, processes and the Markov property. We
start with the denition of a Markov process, focusing on (time) homogeneous
processes having (stationary, regular) transition probabilities (compare with De-
nitions 6.1.1 and 6.1.2).
Denition 8.3.1 (Markov processes). A collection p
s,t
(, ), t s 0 of
transition probabilities on a measurable space (S, o) (as in Denition 6.1.2), is
consistent if it satises the Chapman-Kolmogorov equations
(8.3.1) p
t1,t3
(x, B) = p
t1,t2
p
t2,t3
(x, B) , x S, B o,
for any t
3
t
2
t
1
0 (c.f. Corollary 6.3.3 for the composition of transition
probabilities). In particular, p
t,t
(x, B) = I
B
(x) =
x
(B) for any t 0. Such
collection is called a Markov semi-group (of stationary transition probabilities), if
in addition p
s,t
= p
ts
for all t s 0. The Chapman-Kolmogorov equations are
then
(8.3.2) p
s+u
(x, B) = p
s
p
u
(x, B) , x S, B o, u, s 0 ,
with p
0
(x, B) = I
B
(x) =
x
(B) being the semi-group identity element.
An T
t
-adapted S.P. X
t
, t 0 taking values in (S, o) is an T
t
-Markov process
of (consistent) transition probabilities p
s,t
, t s 0 and state space (S, o) if for
any t s 0 and B o, almost surely
(8.3.3) P(X
t
B[T
s
) = p
s,t
(X
s
, B) .
It is further a (time) homogeneous T
t
-Markov process of semi-group p
u
, u 0 if
for any u, s 0 and B o, almost surely
(8.3.4) P(X
s+u
B[T
s
) = p
u
(X
s
, B) .
Remark. Recall that for Markov chains, which are discrete time S.P.-s, one con-
siders only t, s Z
+
. In this case (8.3.1) is automatically satised by setting
p
s,t
= p
s,s+1
p
s+1,s+2
p
t1,t
to be the composition of the (one-step) transition
probabilities of the Markov chain (see Denition 6.1.2), with p
s,s+1
= p independent
of s when the chain is homogeneous. Similarly, if T
t
-adapted S.P. (X
t
, t 0) sat-
ises (8.3.4) for p
t
(x, B) = P(X
t
B[X
0
= x) and x p
t
(x, B) is measurable per
xed t 0 and B o, then considering the tower property for I
B
(X
s+u
)I
{x}
(X
0
)
and (X
0
) T
s
, one easily veries that (8.3.2) holds, hence (X
t
, t 0) is an ho-
mogeneous T
t
-Markov process. More generally, in analogy with our denition of
Markov chains via (6.1.1), one may opt to say that T
t
-adapted S.P. (X
t
, t 0) is
an T
t
-Markov process provided for each B o and t s 0,
P(X
t
B[T
s
)
a.s.
= P(X
t
B[X
s
) .
Indeed, as noted in Remark 6.1.6 (in view of Exercise 4.4.5), for B-isomorphic
(S, o) this suces for the existence of transition probabilities which satisfy (8.3.3).
However, this simpler to verify plausible denition of Markov processes results with
Chapman-Kolmogorov equations holding only up to a null set per xed t
3
t
2

t
1
0. The study of such processes is consequently made more cumbersome, which
is precisely why we, like most texts, do not take this route.
By Lemma 6.1.3 we deduce from Denition 8.3.1 that for any f bo and all
t s 0,
(8.3.5) E[f(X
t
)[T
s
] = (p
s,t
f)(X
s
) ,
where f (p
s,t
f) : bo bo and (p
s,t
f)(x) =
_
p
s,t
(x, dy)f(y) denotes the
Lebesgue integral of f() under the probability measure p
s,t
(x, ) per xed x S.
The Chapman-Kolmogorov equations are necessary and sucient for generating
consistent Markovian f.d.d. out of a given collection of transition probabilities and
a specied initial probability distribution. As outlined next, we thus canonically
construct the Markov process, following the same approach as in proving Theorem
6.1.8 (for Markov chain), and Proposition 7.1.8 (for continuous time S.P.).
Theorem 8.3.2. Suppose (S, o) is B-isomorphic. Given any (S, o)-valued consis-
tent transition probabilities p
s,t
, t s 0, the probability distribution on (S, o)
uniquely determines the linearly ordered consistent f.d.d.
(8.3.6)
0,s1,...,sn
= p
0,s1
p
sn1,sn
for 0 = s
0
< s
1
< < s
n
, and there exists a Markov process of state space (S, o)
having these f.d.d. Conversely, the f.d.d. of any Markov process having initial
probability distribution (B) = P(X
0
B) and satisfying (8.3.3), are given by
(8.3.6).
Proof. Recall Proposition 6.1.5 that p
0,s1
p
sn1,sn
denotes the
Markov-product-like measures, whose evaluation on product sets is by iterated in-
tegrations over the transition probabilities p
s
k1
,s
k
, in reverse order k = n, . . . , 1,
followed by a nal integration over the initial measure . As shown in this proposi-
tion, given any transition probabilities p
s,t
, t s 0, the probability distribution
on (S, o) uniquely determines p
0,s1
p
sn1,sn
, namely, the f.d.d. spec-
ied in (8.3.6). We then uniquely specify the remaining f.d.d. as the probability
measures
s1,...,sn
(D) =
s0,s1,...,sn
(S D). Proceeding to check the consistency
of these f.d.d. note that p
s,u
p
u,t
(, S ) = p
s,t
(, ) for any s < u < t (by the
Chapman-Kolmogorov identity (8.3.1)). Thus, considering s = s
k1
, u = s
k
and
t = s
k+1
we deduce that if D = A
0
A
n
with A
k
= S for some k = 1, . . . , n1,
then
p
s0,s1
p
sn1,sn
(D) = p
s
k1
,s
k+1
p
sn1,sn
(D
k
)
for D
k
= A
0
A
k1
A
k+1
A
n
, which are precisely the consistency
conditions of (7.1.3) for the f.d.d.
s0,...,sn
. These consistency requirements are
further handled in case of a product set D with A
n
= S by observing that for all
x S and any transition probability p
sn1,sn
(x, S) = 1, whereas our denition of
s1,...,sn
already dealt with A
0
= S. Having shown that this collection of f.d.d. is
consistent, recall that Proposition 7.1.8 applies even with (R, B) replaced by the B-
isomorphic measurable space (S, o). Setting T = [0, ), it provides the construction
of a S.P. Y
t
() = (t), t T via the coordinate maps on the canonical probability
space (S
T
, o
T
, P
) with the f.d.d. of (8.3.6). Turning next to verify that (Y

t
, T
Y
t
, t
T) satises the Markov condition (8.3.3), x t s 0, B o and recall that, for
t > s as in the proof of Theorem 6.1.8, and by denition in case t = s,
(8.3.7) E[I
{YA}
I
B
(Y
t
)] = E[I
{YA}
p
s,t
(Y
s
, B)]
for any nite dimensional measurable rectangle A = x() : x(t
i
) B
i
, i = 1, . . . , n
such that t
i
[0, s] and B
i
o. Thus, the collection
/ = A o
[0,s]
: (8.3.7) holds for A ,
contains the -system of nite dimensional measurable rectangles which generates
o
[0,s]
, and in particular, S /. Further, by linearity of the expectation / is closed
under proper dierence and by monotone convergence if A
n
/ is such that A
n
A
then A / as well. Consequently, / is a -system and by Dynkins theorem,
(8.3.7) holds for every set in o
[0,s]
= T
Y
s
(see Lemma 7.1.7). It then follows that
P(Y
t
B[T
Y
s
) = p
s,t
(Y
s
, B) a.s. for each t s 0 and B o. That is, Y
t
, t 0
is an T
Y
t
-Markov process.
Conversely, suppose X
t
, t 0 satises (8.3.3) and has initial probability distri-
bution (). Then, for any t
0
> > t
n
s 0, and f
bo, = 0, . . . , n, almost
surely,
(8.3.8) E[
n
=0
f
(X
t
)[T
s
] =
_
p
s,tn
(X
s
, dy
n
)f
n
(y
n
)
_
p
t1,t0
(y
1
, dy
0
)f
0
(y
0
) .
The latter identity is proved by induction on n, where denoting its right side by
g
n+1,s
(X
s
), we see that g
n+1,s
= p
s,tn
(f
n
g
n,tn
) and the case n = 0 is merely (8.3.5).
In the induction step we have from the tower property and T
t
-adaptedness of X
t
that
E[
n
=0
f
(X
t
)[T
s
] = E[f
n
(X
tn
)E[
n1
=0
f
(X
t
)[T
tn
][T
s
]
= E[f
n
(X
tn
)g
n,tn
(X
tn
)[T
s
] = g
n+1,s
(X
s
) ,
where the induction hypothesis is used in the second equality and (8.3.5) in the
third. In particular, considering the expected value of (8.3.8) for s = 0 and indicator
functions f
() it follows that the f.d.d. of this process are given by (8.3.6), as

claimed.
Remark 8.3.3. As in Lemma 7.1.7, for B-isomorphic state space (S, o) any F
T
X
is of the form F = (X
)
1
(A) for some A o
T
, where X
() : o
T
denote
the collection of sample functions of the given Markov process X
t
, t 0. Then,
P(F) = P
(A), so while proving Theorem 8.3.2 we have dened the law P
() of
Markov process X
t
, t 0 as the unique probability measure on o
[0,)
such that
P
( : (s
) B
, = 0, . . . , n) = P(X
s0
B
0
, . . . , X
sn
B
n
) ,
for B
o and distinct s
0 (compare with Denition 6.1.7 for the law of a

Markov chain). We denote by P
x
the law P
in case (B) = I
xB
, namely, when
X
0
= x is non-random and note that P
(A) =
_
S
P
x
(A)(dx) for any probability
measure on (S, o) and all A o
T
, with P
x
uniquely determined by the specied
(consistent) transition probabilities p
s,t
, t s 0.
The evaluation of the f.d.d. of a Markov process is more explicit when S is a
countable set, as then p
s,t
(x, B) =
yB
p
s,t
(x, y) for any B S (and all Lebesgue
integrals are merely sums). Likewise, in case S = R
d
(equipped with o = B
S
),
computations are relatively explicit if for each t > s 0 and x S the probability
measure p
s,t
(x, ) is absolutely continuous with respect to Lebesgue measure on S,
in which case (p
s,t
f)(x) =
_
p
s,t
(x, y)f(y)dy and the right side of (8.3.8) amounts
to iterated integrations of the transition probability kernel p
s,t
(x, y) of the process
with respect to Lebesgue measure on S.
The next exercise is about the closure of the collection of Markov processes under
certain invertible non-random measurable mappings.
t
, T
X
t
, t 0) is a Markov process of state space (S, o),
u : [0, ) [0, ) is an invertible, strictly increasing function and for each t 0
the measurable mapping
t
: (S, o) (
S,

o) is invertible, with
1
t
measurable.
(a) Setting Y
t
=
t
(X
u(t)
), verify that T
Y
t
= T
X
u(t)
and that (Y
t
, T
Y
t
, t 0)
is a Markov process of state space (
S,

o).
(b) Show that if (X
t
, T
X
t
, t 0) is a homogeneous Markov process then so is
Z
t
=
0
(X
t
).
Of particular note is the following collection of Markov processes.
Proposition 8.3.5. If real-valued S.P. X
t
, t 0 has independent increments,
then (X
t
, T
X
t
, t 0) is a Markov process of transition probabilities p
s,t
(y, B) =
T
XtXs
(z : y + z B), and if X
t
, t 0 further has stationary, independent
increments, then this Markov process is homogeneous.
Proof. Considering Exercise 4.2.2 for ( = T
X
s
, Y = X
s
m( and the R.V.
Z = Z
t,s
= X
t
X
s
which is independent of (, you nd that (8.3.3) holds for
p
s,t
(y, B) = P(y +Z B), which in case of stationary increments depends only on
t s. Clearly, B P(y +Z B) is a probability measure on (R, B), for any t s
and y R. Further, if B = (, b] then p
s,t
(y, B) = F
Z
(b y) is a Borel function
of y (see Exercise 1.2.27). As the -system / = B B : y P(y + Z B)
is a Borel function contains the -system (, b] : b R generating B, it
follows that / = B, hence p
s,t
(, ) is a transition probability for each t s 0.
To verify that the Chapman-Kolmogorov equations hold, x u [s, t] noting that
Z
s,t
= Z
s,u
+ Z
u,t
, with Z
u,t
= X
t
X
u
independent of Z
s,u
= X
u
X
s
. Hence,
by the tower property,
p
s,t
(y, B) = E[P(y +Z
s,u
+Z
u,t
B[Z
s,u
)]
= E[p
u,t
(y +Z
s,u
, B)] = (p
s,u
(p
u,t
I
B
))(y) = p
s,u
p
u,t
(y, B) ,
and this relation, i.e. (8.3.1), holds for all y R and B B, as claimed.
Among the consequences of Proposition 8.3.5 is the fact that both the Brownian
motion and the Poisson process (potentially starting at N
0
= x R), are homoge-
neous Markov processes of explicit Markov semi-groups.
Example 8.3.6. Recall Proposition 3.4.9 and Exercise 7.3.13 that both the Pois-
son process and the Brownian motion are processes of stationary independent in-
crements. Further, this property clearly extends to the Brownian motion with drift
Z
(r)
t
= W
t
+rt +x, and to the Poisson process with drift N
(r)
t
= N
t
+rt +x, where
the drift r R is a non-random constant, x R is the specied (under P
x
), initial
value of N
(r)
0
(or Z
(r)
0
), and N
t
N
0
is a Poisson process of rate . Consequently,
both Z
(r)
t
, t 0 and N
(r)
t
, t 0 are real-valued homogeneous Markov processes.
Specically, from the preceding proposition we have that the Markov semi-group of
the Brownian motion with drift is p
t
(x +rt, B), where for t > 0,
(8.3.9) p
t
(x, B) =
_
B
e
(yx)
2
/2t
2t
dy ,
having the transition probability kernel p
t
(x, y) = exp((y x)
2
/2t)/
2t. Sim-
ilarly, the Markov semi-group of the Poisson process with drift is q
t
(x + rt, B),
where
(8.3.10) q
t
(x, B) = e
t
k=0
(t)
k
k!
I
B
(x +k) .
Remark. Homogeneous Markov chains are characterized by their (one-step) transi-
tion probabilities, whereas each homogeneous Markov process has a full semi-group
p
t
(), t 0. While outside our scope, we note in passing that the semi-group rela-
tion (8.3.2) can be rearranged as s
1
(p
s+t
p
t
) = s
1
(p
s
p
0
)p
t
, which subject to
the appropriate regularity conditions should yield for s 0 the celebrated backward
Kolmogorov equation
t
p
t
= Lp
t
. The operator L = lim
s0
s
1
(p
s
p
0
) is then
called the generator of the Markov process (or its semi-group). For example, the
transition probability kernel p
t
(x +rt, y) of the Brownian motion with drift solves
the partial dierential equation (pde), u
t
=
1
2
u
xx
+ ru
x
and the generator of this
semi-group is Lu =
1
2
u
xx
+ ru
x
(c.f. [KaS97, Chapter 5]). For this reason, many
computations about Brownian motion can also be done by solving rather simple
elliptic or parabolic pde-s.
We saw in Proposition 8.3.5 that the Wiener process (W
t
, t 0) is a homogeneous
T
W
t
-Markov process of continuous sample functions and the Markov semi-group
of (8.3.9). This motivates the following denition of a Brownian Markov process
(W
t
, T
t
), where our accommodation of possible enlargements of the ltration and
dierent initial distributions will be useful in future applications.
Denition 8.3.7 (Brownian Markov process). We call (W
t
, T
t
) a Brownian
Markov process if W
t
, t 0 of continuous sample functions is a homogeneous T
t
-
Markov process with the Brownian semi-group p
t
, t 0 of (8.3.9). If in addition
W
0
= 0, we call such process a standard Brownian Markov process.
You can easily check that if Markov process is also a stationary process (see Def-
inition 7.3.7), then it must be a homogeneous Markov process, but many homo-
geneous Markov processes are non-stationary (for example, recall Examples 7.3.14
and 8.3.6, that the Brownian motion is non-stationary yet homogeneous, Markov
process). Stationarity of such S.P.-s is related to the important concept of invariant
probability measures which we dene next (compare with Denition 6.1.20).
Denition 8.3.8. A probability measure on a B-isomorphic space (S, o) is called
an invariant (probability) measure for a semi-group of transition probabilities p
u
, u
0, if the induced law P
() =
_
S
P
x
()(dx) (see Remark 8.3.3), is invariant under
any time shift
s
, s 0.
Here is the explicit characterization of invariant measures for a given Markov
semi-group and their connection to stationary Markov processes.
Exercise 8.3.9. Adapting the proof of Proposition 6.1.23 show that a probability
measure on B-isomorphic (S, o) is an invariant measure for a Markov semi-group
p
u
, u 0, if and only if p
t
= for any t 0 and that a homogeneous Markov
process X
t
, t 0 is a stationary S.P. if and only if the initial distribution (B) =
P(X
0
B) is an invariant probability measure for the corresponding Markov semi-
group.
Pursuing similar themes, your next exercise examines some of the most fundamen-
tal S.P. one derives out of the Brownian motion.
Exercise 8.3.10. With W
t
, t 0 a Wiener process, consider the Geometric
Brownian motion Y
t
= e
Wt
, Ornstein-Uhlenbeck process U
t
= e
t/2
W
e
t , Brownian
motion with drift Z
(r,)
t
= W
t
+rt and the standard Brownian bridge on [0, 1] (as
in Exercises 7.3.15-7.3.16).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
Brownian Bridge B
t
t
B
t
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
2
4
6
8
10
12
14
16
18
t
Y
t
Geometric Brownian Motion Y
t
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
t
U
t
OrnsteinUhlenbeck process U
t
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
1
2
3
4
5
6
7
8
9
t
X
t
Brownian Motion with drift: = 1, = 2, x= 1
Figure 1. Illustration of sample paths for processes in Exercise 8.3.10.
(a) Determine which of these four S.P. is a Markov process with respect to
its canonical ltration, and among those, which are also homogeneous.
(b) Find among these S.P. a homogeneous Markov process whose increments
are neither independent nor stationary.
(c) Find among these S.P. a Markov process of stationary increments, which
is not a homogeneous Markov process.
Homogeneous Markov processes possess the following Markov property, extending
the invariance (8.3.4) of the process under the time shifts
s
of Denition 7.3.7 to
any bounded o
[0,)
-measurable function of its sample path (compare to (6.1.8) in
case of homogeneous Markov chains).
Proposition 8.3.11 (Markov property). Suppose (X
t
, t 0) is a homogeneous
T
t
-Markov process on a B-isomorphic state space (S, o) and let P
x
denote the
corresponding family of laws associated with its semi-group. Then, x E
x
[h] is
measurable on (S, o) for any h bo
[0,)
, and further for any s 0, almost surely
(8.3.11) E[h
s
(X
())[T
s
] = E
Xs
[h] .
Remark 8.3.12. From Lemma 7.1.7 you can easily deduce that any V bT
X
is of
the form V = h(X
) with h bo
[0,)
. Further, in view of Exercise 7.2.9, for a real-
valued Markov process X
t
, t 0 of continuous sample functions, the preceding
proposition applies for any bounded Borel function h() on the space C([0, )) of
continuous functions equipped with the topology of uniform convergence on com-
pact intervals.
Proof. Fixing s 0, in case h(x()) =
n
=0
f
(x(u
)) for nite n, f
bo
and u
0
> > u
n
0, we have by (8.3.8) for t
= s + u
and the semi-group

p
r,t
= p
tr
of (X
t
, t 0), that
E[
n
=0
f
(X
t
)[T
s
] = p
un
(f
n
p
un1un
( (f
1
p
u0u1
f
0
)))(X
s
) = E
Xs
[
n
=0
f
(X
u
)] .
The measurability of x E
x
[h] for such functionals h() is veried by induction on
n, where if n = 0 then for f
0
bo by Lemma 6.1.3 also E
x
h = g
1
(x) = p
u0
f
0
(x)
is in bo and by the same argument, in the induction step g
n+1
(x) = p
un
(f
n
g
n
)(x)
are also in bo.
To complete the proof consider the collection 1 of functionals h bo
[0,)
such
that x E
x
[h] is o-measurable and (8.3.11) holds. The linearity of the (condi-
tional) expectation and the monotone convergence theorem result with 1 a vector-
space that is closed under monotone limits, respectively. Further, as already shown,
1 contains the indicators h() = I
A
() with A = x() : x(u
) B
o, =
0, . . . , n a nite dimensional measurable rectangle. Thus, 1 satises the condi-
tions of the monotone class theorem. Consequently 1 = bo
[0,)
, that is, for each
h o
[0,)
both x E
x
[h] bo and (8.3.11) holds w.p.1.
8.3.2. Strong Markov processes and Feller semi-groups. Given a homo-
geneous T
t
-Markov process (X
t
, t 0), we seek to strengthen its Markov property
about the shift of the sample path by non-random s 0 (see Proposition 8.3.11),
to the strong Markov property, whereby shifting by any T
t
-Markov time is ac-
commodated (see Proposition 6.1.16 about Markov chains having this property).
Denition 8.3.13 (strong Markov process). We say that an T
t
-progressively
measurable, homogeneous Markov process X
t
, t 0 on B-isomorphic state space
(S, o), has the strong Markov property (or that (X
t
, T
t
) is a strong Markov pro-
cess), if for any bounded h(s, x()) measurable on the product -algebra | = B
[0,)
o
[0,)
, and any T
t
-Markov time , almost surely
(8.3.12) I
{<}
E[h(, X
+
())[T
+] = g
h
(, X
)I
{<}
,
where g
h
(s, x) = E
x
[h(s, )] is bounded and measurable on B
[0,)
o, x P
x
are
the laws associated with the semi-group of (X
t
, T
t
), T
+ is the Markov -algebra

associated with (c.f. Denition 8.1.9), and both sides of (8.3.12) are set to zero
when () = .
As noted in Remark 8.3.12, every V bT
X
is of the form V = h(X
), with h()
in the scope of the strong Markov property, which for a real-valued homogeneous
Markov process X
t
, t 0 of continuous sample functions, contains all bounded
Borel functions h(, ) on [0, ) C([0, )). In applications it is often handy to
further have a time varying functional h(s, x()) (for example, see our proof of the
reection principle, in Proposition 9.1.10).
Remark. Recall that the Markov time is an T
t
+-stopping time (see Denition
8.1.9), hence the assumed T
t
-progressive measurability of X
t
guarantees that on
the event < the R.V. and X
are measurable on T
+ (see Proposition
8.1.13), hence by our denition so is g
h
(, X
). While we have stated and proved

Proposition 8.1.13 only in case of real-valued S.P. X
t
, t 0, the same proof (and
conclusion), applies for any state space (S, o). We lose nothing by assuming pro-
gressive measurability of X
t
since for a right-continuous process this is equivalent
to its adaptedness (see Proposition 8.1.8, whose proof and conclusion extend to any
topological state space).
Here is an immediate consequence of Denition 8.3.13.
t
, T
t
) is a strong Markov process and is an T
t
-stopping
time, then for any h b|, almost surely
(8.3.13) I
{<}
E[h(, X
+
())[T
] = g
h
(, X
)I
{<}
.
In particular, if (X
t
, T
t
) is a strong Markov process, then X
t
, t 0 is a homoge-
neous T
t
+-Markov process and for any s 0 and h bo
[0,)
, almost surely
(8.3.14) E[h(X
)[T
s
+] = E[h(X
)[T
s
] .
Proof. By the preceding remark, having an T
t
-stopping time results with
g
h
(, X
)I
{<}
which is measurable on T
. Thus, applying the tower property

for the expectation of (8.3.12) conditional on T
+, results with (8.3.13).

Comparing (8.3.13) and (8.3.12) for constant in time h(x()) and the non-random,
nite stopping time = s we deduce that (8.3.14) holds whenever h = h
0

s
for some h
0
bo
[0,)
. Since (X
t
, T
t
) is a homogeneous Markov process, con-
sidering h
0
(x()) = I
B
(x(u)) for u 0 and B o, it follows that (8.3.4) holds
also for (X
t
, T
t
+), namely, that X
t
, t 0 is a homogeneous T
t
+-Markov pro-
cess. With 1 denoting the collection of functionals h bo
[0,)
for which (8.3.14)
holds, by the monotone class theorem it suces to check that this is the case when
h(x()) =
k
m=1
I
Bm
(x(u
m
)), with k nite, u
m
0 and B
m
o. Representing
such functionals as h() = h
1
()h
0

s
() with h
0
(x()) =
ums
I
Bm
(x(u
m
s))
and h
1
(x()) =
um<s
I
Bm
(x(u
m
)), we complete the proof by noting that h
1
(X
))
is measurable with respect to T
s
T
s
+, so can be taken out of both conditional
expectations in (8.3.14) and thus eliminated.
To make the most use of the strong Markov property, Denition 8.3.13 calls for
an arbitrary h b|. As we show next, for checking that a specic S.P. is a strong
Markov process, it suces to verify (8.3.12) only for h(s, x()) = I
B
(x(u)) and
bounded Markov times (compare with the denition of a homogeneous Markov
process via (8.3.4)), which is way more manageable task.
Proposition 8.3.15. An T
t
-progressively measurable, homogeneous Markov pro-
cess X
t
, t 0 with a semi-group p
u
, u 0 on B-isomorphic state space (S, o),
has the strong Markov property if for any u 0, B o and bounded T
t
-Markov
times , almost surely
(8.3.15) P[X
+u
B[T
+] = p
u
(X
, B) .
Proof. Step 1. We start by extending the validity of (8.3.15) to any a.s.
nite Markov time. To this end, xing u 0, B o, n Z
+
and a [0, ]-
valued T
t
-Markov time , recall that
n
= n is a bounded T
t
+-stopping time
(c.f. part (c) of Exercise 8.1.10). Further, the bounded I
{n}
and p
u
(X
n
, B) are
both measurable on T
+
n
(see part (a) of Exercise 8.1.11 and Proposition 8.1.13,
respectively). Hence, multiplying the identity (8.3.15) in case of
n
by I
{n}
, and
taking in, then out, what is known, we nd that a.s.
0 = E[I
{n}
(I
B
(X
n+u
) p
u
(X
n
, B))[T
+
n
] = I
{n}
E[Z[T
+
n
] ,
for the bounded R.V.
Z = I
{<}
[I
B
(X
+u
) p
u
(X
, B)] .
By part (c) of Exercise 8.1.11 it then follows that w.p.1. I
{n}
E[Z[T
+] = 0.
Taking n we deduce that a.s. E[Z[T
+] = 0. Upon taking out the known

I
{<}
p
u
(X
, B) we represent this as
(8.3.16) E[I
{<}
f(X
+u
)[T
+] = I
{<}
(p
u
f)(X
) , almost surely
for f() = I
B
(). By linearity of the expectation and conditional expectation, this
identity extends from indicators to all o-measurable simple functions, whereby it
follows by monotone convergence that it holds for all f bo.
Step 2. We are ready to prove that (8.3.12) holds for any T
t
-Markov time , in case
h(s, x()) = f
0
(s)
n
=1
f
(x(u
)), with bounded Borel f

0
: [0, ) R, f
bo and
u
1
> > u
n
0 = u
n+1
. As f
0
() bT
+, one can always take this (known)

part of h(, ) out of the conditional expectation in (8.3.12) and thereafter eliminate
it. Thus, setting f
0
= 1 we proceed to prove by induction on n that (8.3.12) holds,
namely, that for any T
t
-Markov time , f
bo and u
1
> > u
n
0, almost
surely,
E[I
{<}
n
=1
f
(X
+u
)[T
+] = I
{<}
g
n
(X
) ,
for the bounded, o-measurable functions g
1
= p
u1u2
f
1
and
g
= p
u
u
+1
(f
g
1
) , = 2, . . . , n.
The identity (8.3.16) is the n = 1 basis of the proof. To carry out the induction
step, recall part (c) of Exercise 8.1.10 that
= +u
is a decreasing sequence
of T
t
-Markov times, which are nite if and only if is, and further, T
+ T
+
n
(see part (b) of Exercise 8.1.11). It thus follows by the tower property and taking
out the known term f
n
(X
+
n
) bT
+
n
(when < , see Proposition 8.1.13), that
E[I
{<}
n
=1
f
(X
)[T
+] = E[f
n
(X
n
)E[I
{n<}
n1
=1
f
(X
)[T
+
n
][T
+]
= E[I
{<}
f
n
(X
n
)g
n1
(X
n
)[T
+
] = I
{<}
g
n
(X
) .
Indeed, since

n
= u
u
n
are non-random and positive, the induction hy-
pothesis applies for the T
t
-Markov time
n
to yield the second equality, whereas
the third equality is established by considering the identity (8.3.16) for f = f
n
g
n1
and u = u
n
.
Step 3. Similarly to the proof of Proposition 8.3.11, xing A T
+, yet another
application of the monotone class theorem shows that any h b| is in the collection
1 b| for which g
h
(s, x) = E
x
[h(s, )] is measurable on B
[0,)
o and
(8.3.17) E[I
{<}
I
A
h(, X
+
)] = E[g
h
(, X
)I
{<}
I
A
] .
Indeed, in Step 2 we have shown that 1 contains the indicators on the -system
T = B D : B B
[0,)
, D o
[0,)
a nite dimensional measurable rectangle ,
such that | = (T). Further, constants are in 1 which by the linearity of the
expectation (and hence of h g
h
), is a vector space. Finally, if h
n
h bounded
and h
n
1 are non-negative, then h b| and by monotone convergence g
hn

g
h
bounded and measurable, with the pair (h, g
h
) also satisfying (8.3.17). Since
g
h
(, X
)I
{<}
is in bT
+
and the preceding argument applies for all A T
+,
we conclude that per and h the identity (8.3.12) holds w.p.1., as claimed.
As you are to show now, both Markov and strong Markov properties apply for
product laws of nitely many independent processes, each of which has the corre-
sponding property.
Exercise 8.3.16. Suppose on some probability space (, T, P) we have homo-
geneous Markov processes (X
(i)
t
, T
(i)
t
) of B-isomorphic state spaces (S
i
, o
i
) and
Markov semi-groups p
(i)
t
(, ), such that T
(i)
, i = 1, . . . , are P-mutually inde-

pendent.
(a) Let X
t
= (X
(1)
t
, . . . , X
()
t
) and T
t
= T
(1)
t
T
()
t
. Show that (X
t
, T
t
)
is a homogeneous Markov process, of the Markov semi-group
p
t
(x, B
1
B
) =
i=1
p
t
(x
i
, B
i
)
on the B-isomorphic state space (S, o), where S = S
1
S
and
o = o
1
o
.
(b) Suppose in addition that for each i = 1, . . . , k the T
(i)
t
-progressively mea-
surable process X
(i)
t
has the strong Markov property. Show that the strong
Markov property then holds for the T
t
-progressively measurable (S, o)-
valued stochastic process X
t
.
Recall Proposition 6.1.16 that every homogeneous Markov chain of a B-isomorphic
state space has the strong Markov property and that in this context every Markov
time is a stopping time and takes only countably many possible values. As expected,
you are to show next that any homogeneous Markov process has the strong Markov
property (8.3.13) for such stopping times.
t
, T
t
) is a homogeneous Markov process, (S, o) its
B-isomorphic state space and : C is an T
t
-stopping time with countable
C = s
k
[0, ].
(a) Show that A : () = s
k
T
s
k
for any nite s
k
C and A T
.
(b) Deduce that h(, X
+
)I
{<}
is a R.V. and g
h
(, X
)I
{<}
bT
pro-
vided h(s
k
, ) are o
[0,)
-measurable and uniformly bounded on CS
[0,)
.
(c) Conclude that (8.3.13) holds a.s. for any such and h.
For the Feller semi-groups we dene next (compare with the strong Feller property
of Remark 6.3.12), the right-continuity of sample functions yields the strong Markov
property.
Denition 8.3.18. A Feller semi-group is a Markov semi-group p
u
, u 0 on
(R, B) such that p
t
: C
b
(R) C
b
(R) for any t 0. That is, x (p
t
f)(x) is
continuous for any xed bounded, continuous function f and t 0.
Proposition 8.3.19. Any right-continuous homogeneous Markov process (X
t
, T
t
)
with a Feller semi-group (of transition probabilities), is a strong Markov process.
Proof. Fixing u 0, a bounded T
t
-Markov time , A T
+ and f C
b
(R),
we proceed to show that
(8.3.18) E[f(X
+u
)I
A
] = E[(p
u
f)(X
)I
A
] .
Indeed, recall that in Lemma 8.1.16 we have constructed a sequence of nite T
t
-
stopping times
= 2
([2
] +1) taking values in the countable set of non-negative

dyadic rationals, such that
. Further, for any we have that A T
(see
part (b) of Exercise 8.1.12), hence as shown in Exercise 8.3.17,
E[f(X
+u
)I
A
] = E[(p
u
f)(X
)I
A
] .
Due to the sample path right-continuity, both X
+u
X
+u
and X
.
Since f C
b
(R) and p
u
f C
b
(R) (by the assumed Feller property), as
both f(X
+u
) f(X
+u
) and (p
u
f)(X
) (p
u
f)(X
). We thus deduce by
bounded convergence that (8.3.18) holds.
Next, consider non-negative f
k
C
b
(R) such that f
k
I
(,b)
(see Lemma 3.1.6
for an explicit construction of such). By monotone convergence p
u
f
k
p
u
I
(,b)
and hence
(8.3.19) E[I
B
(X
+u
)I
A
] = E[p
u
(X
, B)I
A
] ,
for any B in the -system (, b) : b R which generates the Borel -algebra
B. The collection of / of sets B B for which the preceding identity holds is a -
system (by linearity of the expectation and monotone convergence), so by Dynkins
theorem it holds for any Borel set B. Since this applies for any A T
+, the
strong Markov property of (X
t
, T
t
) follows from Proposition 8.3.15, upon noting
that the right-continuity of t X
t
implies that X
t
is T
t
-progressively measurable,
with p
u
(X
, B) mT
+ (see Propositions 8.1.8 and 8.1.13, respectively).

Taking advantage of the preceding result, you can now verify that any right-
continuous S.P. of stationary, independent increments is a strong Markov process.
t
, t 0 is a real-valued process of stationary, inde-
pendent increments.
(a) Show that X
t
, t 0 has a Feller semi-group.
(b) Show that if X
t
, t 0 is also right-continuous, then it is a strong
Markov process. Deduce that this applies in particular for the Poisson
process (starting at N
0
= x R as in Example 8.3.6), as well as for any
Brownian Markov process (X
t
, T
t
).
(c) Suppose the right-continuous X
t
, t 0 is such that lim
t0
E[X
t
[ = 0
and X
0
= 0. Show that X
t
is integrable for all t 0 and M
t
= X
t
tEX
1
is a martingale. Deduce that then E[X
] = E[]E[X
1
] for any integrable
T
X
t
-stopping time .
Hint: Relying on Walds identity (see part (a) of Exercise 5.4.10), estab-
lish the last claim rst for the T
X
t
-stopping times
= 2
([2
] + 1).
Our next example demonstrates that some regularity of the semi-group is needed
when aiming at the strong Markov property (i.e., merely considering the canonical
ltration of a homogeneous Markov process with continuous sample functions is
not enough).
Example 8.3.21. Suppose X
0
is independent of the standard Wiener process W
t
, t
0 and q = P(X
0
= 0) (0, 1). The S.P. X
t
= X
0
+ W
t
I
{X0=0}
has continuous
sample functions and for any xed s 0, a.s. I
{X0=0}
= I
{Xs=0}
(as the dierence
occurs on the event W
s
= X
0
,= 0 which is of zero probability). Further, the
independence of increments of W
t
implies the same for X
t
conditioned on X
0
,
hence for any u 0 and Borel set B, almost surely,
P(X
s+u
B[T
X
s
) = I
0B
I
{X0=0}
+P(W
s+u
W
s
+X
s
B[X
s
)I
{X0=0}
= p
u
(X
s
, B) ,
where p
u
(x, B) = p
0
(x, B)I
x=0
+ p
u
(x, B)I
x=0
for the Brownian semi-group p
u
().
Clearly, per u xed, p
u
(, ) is a transition probability on (R, B) and p
0
(x, B) is
the identity element for the semi-group relation p
u+s
= p
u
p
s
which is easily shown
to hold (but this is not a Feller semi-group, since x ( p
t
f)(x) is discontinuous
at x = 0 whenever f(0) ,= Ef(W
t
)). In view of Denition 8.3.1, we have just
shown that p
u
(, ) is the Markov semi-group associated with the T
X
t
-progressively
measurable homogeneous Markov process X
t
, t 0 (regardless of the distribution
of X
0
). However, (X
t
, T
X
t
) is not a strong Markov process. Indeed, note that
= inft 0 : X
t
= 0 is an T
X
t
-stopping time (see Proposition 8.1.15), which
is nite a.s. (since if X
0
,= 0 then X
t
= W
t
+ X
0
and =
(0)
X0
of Exercise
8.2.34, whereas for X
0
= 0 obviously = 0). Further, by continuity of the sample
functions, X
= 0 whenever < , so if (X
t
, T
X
t
) was a strong Markov process,
then in particular, a.s.
P(X
+1
> 0[T
X
) = p
1
(0, (0, )) = 0
(this is merely (8.3.15) for the stopping time , u = 1 and B = (0, )). However,
the latter identity fails whenever X
0
,= 0 (i.e. with probability 1 q > 0), for then
the left side is merely p
1
(0, (0, )) = 1/2 (since W
t
, T
X
t
is a Brownian Markov
process, hence a strong Markov process, see Exercise 8.3.20).
Here is an alternative, martingale based, proof that any Brownian Markov process
is a strong Markov process.
t
, T
t
) is a Brownian Markov process.
(a) Let R
t
and I
t
denote the real and imaginary parts of the complex-valued
S.P. M
t
= exp(iX
t
+ t
2
/2). Show that both (R
t
, T
t
) and (I
t
, T
t
) are
MG-s.
(b) Fixing a bounded T
t
-stopping time , show that E[M
+u
[T
+] = M
w.p.1.
(c) Deduce that w.p.1. the R.C.P.D. of W
+u
given T
+ matches the normal

distribution of mean W
() and variance u.
(d) Conclude that the T
t
-progressively measurable homogeneous Markov pro-
cess W
t
, t 0 is a strong Markov process.
8.3.3. Markov jump processes. This section is about the following Markov
processes which in many respects are very close to Markov chains.
Denition 8.3.23. A function x : R
+
S is called a step function if it is constant
on each of the intervals [s
k1
, s
k
), for some countable (possibly nite), set of isolated
points 0 = s
0
< s
1
< s
2
< . A continuous-time stochastic process (X
t
, t 0)
taking values in some measurable space (S, o) is called a pure jump process if its
sample functions are step functions. A Markov pure jump process is a homogeneous
Markov process which, starting at any non-random X
0
= x S, is also a pure jump
process on its B-isomorphic state space (S, o).
Remark. We often use Markov jump process for Markov pure jump process and
note in passing that these processes are sometimes also called continuous time
Markov chains.
The relatively explicit analysis of Markov jump processes, as provided here, owes
much to the fact that the jump times in their sample functions are isolated. Many
interesting, and harder to analyze Markov processes have piecewise constant sample
functions, but with accumulation points of jump times.
We start by showing that the strong Markov property applies for all Markov jump
processes.
Proposition 8.3.24. Any Markov jump process (X
t
, T
t
) is a strong Markov pro-
cess.
Proof. Though we did not even endow the state space (S, o) with a topology,
the sample functions t X
t
, being step functions, are trivially right continuous,
hence the Markov jump process is T
t
-progressively measurable (see Proposition
8.1.8). Fixing u 0, a bounded T
t
-Markov time , A T
+ and B o, as
in the proof of Proposition 8.3.19 the identity (8.3.19) holds for some sequence
of T
t
-stopping times such that
. Since the right-continuous sample

functions t X
t
of a jump process are constant except possibly for isolated jump
times, both X
= X
and X
+u
= X
+u
for all large enough. Consequently,
I
B
(X
+u
) = I
B
(X
+u
) and p
u
(X
, B) = p
u
(X
, B) for all large enough, so by

bounded convergence the identity (8.3.19) also holds for the T
t
-Markov time .
Since this applies for any A T
+, as explained while proving Proposition 8.3.19,

the strong Markov property of (X
t
, T
t
) then follows from Proposition 8.3.15.
Example 8.3.25. The semi-group of a Markov jump process is often not a Feller
semi-group (so Proposition 8.3.24 is not a special case of Proposition 8.3.19). For
example, setting sgn(0) = 0 it is easy to check that p
t
(x, A) = e
t
I
{xA}
+ (1
e
t
)I
{sgn(x)A}
is a Markov semi-group on R, which is not a Feller semi-group (as
(p
1
h)(x) = e
1
h(x) + (1 e
1
)1
x=0
is discontinuous for h(x) = x
2
1 C
b
(R)).
This semi-group corresponds to a Markov jump process X
t
with at most one jump
per sample function, such that starting at any state X
0
other than the (absorbing)
states 1, 0 and 1, it jumps to sgn(X
0
) 1, 0, 1 at a random time having the
exponential distribution of parameter one.
In view of Lemma 7.1.7, the law of a homogeneous Markov process does not tell us
directly whether or not it is a Markov jump process. In fact, a Markov jump process
corresponds to the piecewise constant RCLL modication of the given Markov law
(and such modication is essentially unique, see Exercise 7.2.3), so one of the central
issues here is to determine when such a modication exists.
With the Poisson process as our prototypical example of a Markov jump process,
we borrow from the treatment of the Poisson process (in Subsection 3.4.2), and
proceed to describe the jump parameters of Markov jump processes. These pa-
rameters then serve as a convenient alternative to the general characterization of a
homogeneous Markov process via its (Markov) semi-group.
Proposition 8.3.26. Suppose (X
t
, t 0) is a Markov jump process. Then, =
inft 0 : X
t
,= X
0
is an T
X
t
-stopping time which under P
x
has the exponential
distribution of parameter
x
, for all x S and some measurable : S R
+
.
Further, if
x
> 0 then is P
x
-almost-surely nite and P
x
-independent of the
S-valued random variable X
.
Proof. Since each t X
t
() is a step function, clearly, for each t 0
t =
_
qQ
(2)
t
+
X
q
,= X
0
T
t
and so is a strictly positive T
X
t
-stopping time. Fixing x S and u 0, by the
same reasoning
x
= inft u : X
t
,= x is an T
X
t
-stopping time. Under P
x
the
event > u implies that X
u
= x and =
Xu
. Thus, applying the Markov
property for h = I
>t
(so h(X
u+
) = I
{Xu
>u+t}
), we have that
P
x
( > u +t) = P
x
(
Xu
> u +t, X
u
= x, > u)
= E
x
[E(
Xu
> u +t[T
X
u
)I
{>u,Xu=x}
] = P
x
( > t)P
x
( > u) .
So, the [0, 1]-valued function g(t) = P
x
( > t) is such that g(t + u) = g(t)g(u) for
all t, u 0. In particular, by elementary algebra g(t) = g(1)
t
for all t Q
(2)
, which
by the right-continuity of t g(t) extends to all t 0. Since is strictly positive
we have that g(0) = 1 and consequently, 1 g(1) > 0. Setting g(1) = exp(
x
)
we thus conclude that P
x
( > t) = exp(
x
t) for some : S R
+
which is
measurable (by the measurability of x P
x
( > 1)).
Suppose now that
x
> 0, in which case is nite P
x
-almost surely. Then, X
is well dened and applying the Markov property for h = I

B
(X
) (so h(X
u+
) =
I
B
(X
Xu
)), we have that for any B o and u 0,
P
x
(X
B, > u) = P
x
(X
x
B, > u, X
u
= x)
= E
x
[P(X
Xu
B[T
X
u
)I
{>u,Xu=x}
] = P
x
(X
B)P
x
( > u) .
As > u and X
B, < are P
x
-independent for any u 0 and B o, it
follows that the random variables and X
are P
x
-independent, as claimed.
Markov jump processes have the following parameters.
Denition 8.3.27. We call p(x, A) and
x
the jump transition probability and
jump rates of a Markov jump process X
t
, t 0, if p(x, A) = P
x
(X
A) for
A o and x S of positive jump rate
x
, while p(x, A) = I
xA
in case
x
= 0.
More generally, a pair (, p) with : S R
+
measurable and p(, ) a transition
probability on (S, o) such that p(x, x) = I
x=0
is called jump parameters.
The jump parameters provide the following canonical construction of Markov jump
processes.
Theorem 8.3.28. Suppose (, p) are jump parameters on a B-isomorphic space
(S, o). Let Z
n
, n 0 be the homogeneous Markov chain of transition probability
p(, ) and initial state Z
0
= x S. For each y S let
j
(y), j 1 be i.i.d.
random variables, independent of Z
n
and having each the exponential distribution
of parameter
y
. Set T
0
= 0, T
k
=
k
j=1
j
(Z
j1
), k 1 and X
t
= Z
k
for all
t [T
k
, T
k+1
), k 0. Assuming P
x
(T
< ) = 0 for all x S, the process

X
t
, t 0 thus constructed is the unique Markov jump process with the given
jump parameters. Conversely, (, p) are the parameters of a Markov jump process
if and only if P
x
(T
< ) = 0 for all x S.

Remark. The random time
j
(Z
j1
) is often called the holding time at state Z
j1
(or alternatively, the j-th holding time), along the sample path of the Markov jump
process.
Proof. Part I. Existence.
Starting from jump parameters (, p) and X
0
= x, if P
x
(T
< ) = 0 then our

construction produces X
t
, t 0 which is indistinguishable from a pure jump
process and whose parameters coincide with the specied (, p). So, assuming
hereafter with no loss of generality that T
() = for all , we proceed to

show that X
t
, T
X
t
is a homogeneous Markov process. Indeed, since p
t
(x, B) =
P
x
(X
t
B) is per t 0 a transition probability on (S, o), this follows as soon as
we show that P
x
(X
s+u
B[T
X
s
) = P
Xs
(X
u
B) for any xed s, u 0, x S and
B o.
Turning to prove the latter identity, x s, u, x, B and note that
X
u
B =
_
0
Z
B, T
+1
> u T
,
is of the form X
u
B = (Z
, T
) A
u
where A
u
(S [0, ])
c
. Hence, this
event is determined by the law of the homogeneous Markov chain Z
n
, T
n
, n 0
on S [0, ]. With Y
t
= supk 0 : T
k
t counting the number of jumps in
the interval [0, t], we further have that if Y
s
= k, then X
s
= Z
k
and X
s+u

B = (Z
k+
, T
k+
s) A
u
. Moreover, since t X
t
() is a step function,
T
X
s
= (Y
s
, Z
k
, T
k
, k Y
s
). Thus, decomposing as the union of disjoint events
Y
s
= k it suces to show that under P
x
, the law of (Z
k+
, T
k+
s) conditional
on (Z
k
, T
k
) and the event Y
s
= k =
k+1
(Z
k
) > s T
k
0, is the same
as the law of (Z
, T
) under P
Z
k
. In our construction, given Z
k
S, the random
variable =
k+1
(Z
k
) = T
k+1
T
k
is independent of T
k
and follows the exponential
Z
k
. Hence, setting = sT
k
0, by the lack of memory
of this exponential distribution, for any t, s, k 0,
P
x
(T
k+1
> t +s[T
k
, Z
k
, Y
s
= k) = P
x
( > t +[T
k
, Z
k
, > ) = P( > t[Z
k
) .
That is, under P
x
, the law of T
k+1
s conditional on T
X
s
and the event Y
s
= k,
is the same as the law of T
1
under P
Z
k
. With Z
n
, n 0 a homogeneous Markov
chain whose transition probabilities are independent of T
n
, n 0, it follows that
further the joint law of (Z
k+1
, T
k+1
s) conditional on T
X
s
and the event Y
s
= k
is the same as the joint law of (Z
1
, T
1
) under P
Z
k
. This completes our proof that
X
t
, T
X
t
is a homogeneous Markov process, since for any z S, conditional on
Z
k+1
= z the value of (Z
k+1+
, T
k+1+
T
k+1
) is independent of T
k+1
and by the
Markov property has the same joint law as (Z
1+
, T
1+
T
1
) given Z
1
= z.
Part II. Uniqueness. Start conversely with a Markov pure jump process (
X
t
, t 0)
such that

X
0
= x and whose jump parameters per Denition 8.3.27 are (, p).
In the sequel we show that with probability one we can embed within its sample
function t

X
t
() a realization of the Markov chain Z
n
, n 0 of transition
probability p(, ), starting at Z
0
= x, such that

X
t
= Z
k
for all t [T
k
, T
k+1
),
k 0 and with T
0
= 0, show that for any k 0, conditionally on Z
j
, T
j
, j k,
the variables
k+1
= T
k+1
T
k
and Z
k+1
are independent of each other, with
k+1
having the exponential distribution of parameter
Z
k
.
This of course implies that even conditionally on the innite sequence Z
n
, n 0,
the holding times
k+1
, k 0 are independent of each other, with
k+1
maintain-
ing its exponential distribution of parameter
Z
k
. Further, since t

X
t
() is a
step function (see Denition 8.3.23), necessarily here T
() = for all .
This applies for any non-random x S, thus showing that any Markov pure jump
process can be constructed as in the statement of the theorem, provided (, p) are
such that P
x
(T
< ) = 0 for all x S, with the latter condition also necessary

for (, p) to be the jump parameters of any Markov pure jump process (and a
moment thought will convince you that this completes the proof of the theorem).
Turning to the promised embedding, let T
0
= 0, Z
0
=

X
0
= x and T
1
= T
0
+
1
for
1
= inft 0 :

X
t
,= Z
0
. Recall Proposition 8.3.26 that
1
has the exponential
x
and is an T
X
t
-stopping time. In case
x
= 0 we
are done for then T
1
= and

X
t
() = Z
0
() = x for all t 0. Otherwise,
recall Proposition 8.3.26 that T
1
is nite w.p.1. in which case Z
1
= X
T1
is well
dened and independent of T
1
, with the law of Z
1
being p(Z
0
, ). Further, since
(
X
t
, T
X
t
) is a strong Markov process (see Proposition 8.3.24) and excluding the
null set : T
1
() = , upon applying the strong Markov property at the nite
stopping time T
1
we deduce that conditional on T
X
T1
the process
X
T1+t
, t 0
is a Markov pure jump process, of the same jump parameters, but now starting
at Z
1
. We can thus repeat this procedure and w.p.1. construct the sequence
k+1
= inft 0 :

X
t+T
k
,=

X
T
k
, k 1, where T
k+1
= T
k
+
k+1
are T
X
t
-stopping
times and Z
k
=

X
T
k
(terminating at T
k+1
= if
Z
k
= 0). This is the embedding
described before, for indeed

X
t
= Z
k
for all t [T
k
, T
k+1
), the sequence Z
n
, n 0
has the law of a homogeneous Markov chain of transition probability p(, ) (starting
at Z
0
= x), and conditionally on (Z
j
, T
j
, j k) T
X
T
k
, the variables
k+1
and
Z
k+1
are independent of each other, with
k+1
having the exponential distribution
of parameter
Z
k
.
Remark 8.3.29. When the jump rates
x
= are constant, the corresponding
jump times T
k
are those of a Poisson process N
t
of rate , which is independent
of the Markov chain Z
n
. Hence, in this case the Markov jump process has the
particularly simple structure X
t
= Z
Nt
.
Here is a more explicit, equivalent condition for existence of a Markov pure jump
process with the specied jump parameters (, p). It implies in particular that such
a process exists whenever the jump rates are bounded (i.e. sup
x
x
nite).
Exercise 8.3.30. Suppose (, p) are jump parameters on the B-isomorphic state
space (S, o).
(a) Show that P
x
(T
< ) = 0 if and only if P

x
(
1
Zn
< ) = 0.
Hint: Upon Conditioning on Z
n
consider part (d) of Exercise 2.3.24.
(b) Conclude that to any jump parameters p(, ) and bo corresponds
a well dened, unique Markov jump process, constructed as in Theorem
8.3.28.
Remark. The event : T
() < is often called an explosion. A further

distinction can then be made between the pure (or non-explosive) Markov jump
processes we consider here, and the explosive Markov jump processes such that
P
x
(T
< ) > 0 for some x S, which nevertheless can be constructed as in

Theorem 8.3.28 to have step sample functions, but only up to the time T
of
explosion.
Example 8.3.31 (birth processes). Markov (jump) processes which are also
counting processes, are called birth processes. The state space of such processes is
S = 0, 1, 2, . . . and in view of Theorem 8.3.28 they correspond to jump transitions
p(x, x + 1) = 1. Specically, these processes are of the form X
t
= supk 0 :
k1
j=X0
j
t, where the holding times
j
, j 1, are independent Exponential(
j
)
random variables. In view of Exercise 8.3.30 such processes are non-explosive if
and only if
jk

1
j
= for all k 0. For example, this is the case when
j
= j +
0
with
0
0 and > 0, and such a process is then called simple
birth with immigration process if also
0
> 0, or merely simple birth process if
0
= 0 (in contrast, the Poisson process corresponds to = 0 and
0
> 0). The
latter processes serve in modeling the growth in time of a population composed of
individuals who independently give birth at rate (following an exponentially dis-
tributed holding time between consecutive birth events), with additional immigration
into the population at rate
0
, independently of birth events.
Remark. In the context of Example 8.3.31, E
x
T
k
=
x+k1
j=x

1
j
for the arrival
time T
k
to state k + x, so taking for example
j
= j
for some > 1 results with

an explosive Markov jump process. Indeed, then E
x
T
k
c for nite c =
j1
j
and any x, k 1. By monotone convergence E

x
T
c, so within an integrable,
hence a.s. nite time T
the sample function t X

t
escapes to innity, hence
the name explosion given to such phenomena. But, observe that unbounded jump
rates do not necessarily imply an explosion (as for example, in case of simple birth
processes), and explosion may occur for one initial state but not for another (for
example here
0
= 0 so there is no explosion if starting at x = 0).
As you are to verify now, the jump parameters characterize the relatively explicit
generator for the semi-group of a Markov jump process, which in particular satises
Kolmogorovs forward (in case of bounded jump rates), and backward equations.
Denition 8.3.32. The linear operator L : bo mo such that (Lh)(x) =
x
_
(h(y) h(x))p(x, dy) for h bo is called the generator of the Markov jump
process corresponding to jump parameters (, p). In particular, (LI
{x}
c )(x) =
x
and more generally (LI
B
)(x) =
x
p(x, B) for any B x
c
(so specifying the
generator is in this context equivalent to specifying the jump parameters).
Exercise 8.3.33. Consider a Markov jump process (X
t
, t 0) of semi-group p
t
(, )
and jump parameters (
,
p) as in Denition 8.3.32. Let T
k
=
k
j=1
j
denote the
jump times of the sample function s X
s
() and Y
t
=
k1
I
{T
k
t}
the number
of such jumps in the interval [0, t].
(a) Show that if
x
> 0 then
P
x
(
2
t[
1
) =
_
(1 e
yt
)p(x, dy) ,
and deduce that t
1
P
x
(Y
t
2) 0 as t 0, for any x S.
(b) Fixing x S and h bo, show that
[(p
s
h)(x) (p
0
h)(x) E
x
[(h(X
) h(x))I
s
][ 2|h|
P
x
(Y
s
2) ,
and deduce that for L per Denition 8.3.32,
(8.3.20) lim
s0
s
1
((p
s
h)(x) (p
0
h)(x)) = (Lh)(x) ,
where the convergence in (8.3.20) is uniform in |h|
.
(c) Verify that t (Lp
t
h)(x) is continuous and t (p
t
h)(x) is dierentiable
for any x S, h bo, t 0 and conclude that the backward Kolmogorov
equation holds. Specically, show that
(8.3.21)
t
(p
t
h)(x) = (Lp
t
h)(x) t 0, x S, h bo.
(d) Show that if sup
xS
x
is nite, then L : bo bo, the convergence in
(8.3.20) is also uniform in x and Kolmogorovs forward equation (also
known as the Fokker-Planck equation), holds. That is,
(8.3.22)
t
(p
t
h)(x) = (p
t
(Lh))(x) t 0, x S, h bo.
Remark. Exercise 8.3.33 relates the Markov semi-group with the correspond-
ing jump parameters, showing that a Markov semi-group p
t
(, ) corresponds to
a Markov jump process only if for any x S, the limit
(8.3.23) lim
t0
t
1
(1 p
t
(x, x) =
x
exists, is nite and o-measurable. Moreover, necessarily then also
(8.3.24) lim
t0
t
1
p
t
(x, B) =
x
p(x, B) B x
c
,
for some transition probability p(, ). Recall Theorem 8.3.28 that with the ex-
ception of possible explosion, the converse applies, namely whenever (8.3.23) and
(8.3.24) hold, the semi-group p
t
(, ) corresponds to a (possibly explosive) Markov
jump process. We note in passing that while Kolmogorovs backward equation
(8.3.21) is well dened for any jump parameters, the existence of solution which is
a Markov semi-group, is equivalent to non-explosion of the corresponding Markov
jump process.
In particular, in case of bounded jump rates the conditions (8.3.23) and (8.3.24)
are equivalent to the Markov process being a Markov pure jump process and in this
setting you are now to characterize the invariant measures for the Markov jump
process in terms of its jump parameters (or equivalently, in terms of its generator).
Exercise 8.3.34. Suppose (, p) are jump parameters on B-isomorphic state space
(S, o) such that sup
xS
x
is nite.
(a) Show that probability measure is invariant for the corresponding Markov
jump process if and only if (Lh) = 0 for the generator L : bo bo of
these jump parameters and all h bo.
Hint: Combine Exercises 8.3.9 and 8.3.33 (utilizing the boundedness of
x (Lh)(x)).
(b) Deduce that is an invariant probability measure for (, p) if and only if
(p) = .
In particular, the invariant probability measures of a Markov jump process with
constant jump rates are precisely the invariant measures for its jump transition
probability.
Of particular interest is the following special family of Markov jump processes.
Denition 8.3.35. Real-valued Markov pure jump processes with a constant jump
rate whose jump transition probability is of the form p(x, B) = T
(z : x+z B)
for some law T
on (R, B), are called compound Poisson processes. Recall Remark

8.3.29 that a compound Poisson process is of the form X
t
= S
Nt
for a random walk
S
n
= S
0
+
n
k=1
k
with i.i.d. ,
k
which are independent of the Poisson process
N
t
of rate .
Remark. The random telegraph signal R
t
= (1)
Nt
R
0
of Example 7.2.14 is a
Markov jump process on S = 1, 1 with constant jump rate , which is not a
compound Poisson process (as its transition probabilities p(1, 1) = p(1, 1) = 1
do not correspond to a random walk).
As we see next, compound Poisson processes retain many of the properties of the
Poisson process.
Proposition 8.3.36. A compound Poisson process X
t
, t 0 has stationary,
independent increments and the characteristic function of its Markov semi-group
p
t
(x, ) is
(8.3.25) E
x
[e
iXt
] = e
ix+t(
()1)
,
where
() denotes the characteristic function of the corresponding jump sizes

k
.
Proof. We start by proving that X
t
, t 0 has independent increments,
where by Exercise 7.1.12 it suces to x 0 = t
0
< t
1
< t
2
< < t
n
and show that
the random variables D
i
= X
ti
X
ti1
, i = 1, . . . , n, are mutually independent. To
this end, note that N
t0
= 0 and conditional on the event N
ti
= m
i
for m
i
=
i
j=1
r
j
and xed r = (r
1
, . . . , r
n
) Z
n
+
, we have that D
i
=
mi
k=mi1+1
k
are mutually
independent with D
i
then having the same distribution as the random walk S
ri
starting at S
0
= 0. So, for any f
i
bB, by the tower property and the mutual
independence of N
ti
N
ti1
, 1 i n,
E
x
[
n
i=1
f
i
(D
i
)] = E[E
x
(
n
i=1
f
i
(D
i
)[T
N
)] =
rZ
n
+
n
i=1
_
P(N
ti
N
ti1
= r
i
)E
0
[f
i
(S
ri
)]
_
=
n
i=1
_

ri=0
P(N
ti
N
ti1
= r
i
)E
0
[f
i
(S
ri
)]
_
=
n
i=1
E
x
[f
i
(D
i
)] ,
yielding the mutual independence of D
i
, i = 1, . . . , n.
We have just seen that for each t > s the increment X
t
X
s
has under P
x
the same law as S
NtNs
has under P
0
. Since N
t
N
s
D
= N
ts
, it follows by the
independence of the random walk S
r
and the Poisson process N
t
, t 0 that
P
x
(X
t
X
s
) = P
0
(S
Nts
) depends only on t s, which by Denition 7.3.11
amounts to X
t
, t 0 having stationary increments.
Finally, the identity (3.3.3) extends to E[z
Nt
] = exp(t(z 1)) for N
t
having
a Poisson distribution with parameter t and any complex variable z. Thus, as
E
x
[e
iSr
] = e
ix
()
r
(see Lemma 3.3.8), utilizing the independence of S
r
from
N
t
, we conclude that
E
x
[e
iXt
] = E[E
x
(e
iSN
t
[N
t
)] = e
ix
E[
()
Nt
] = e
ix+t(
()1)
,
for all t 0 and x, R, as claimed.
t
, t 0 be a compound Poisson process of jump rate .
(a) Show that if the corresponding jump sizes
k
are square integrable then
E
0
X
t
= tE
1
and Var(X
t
) = tE
2
1
.
(b) Show that if E
1
= 0 then X
t
, t 0 is a martingale. More generally,
u
0
(t, X
t
, ) is a martingale for u
0
(t, y, ) = exp(y t(M
() 1)) and
any R for which the moment generating function M
() = E[e
1
] is
nite.
Here is the analog for compound Poisson processes of the thinning of Poisson
variables.
t
, t 0 is a compound Poisson process of jump
rate and jump size law T
. Fixing a disjoint nite partition of R 0 to Borel

sets B
j
, j = 1, . . . , m, consider the decomposition X
t
= X
0
+
m
j=1
X
(j)
t
in terms of
the contributions
X
(j)
t
=
Nt
k=1
k
I
Bj
(
k
)
to X
t
by jumps whose size belong to B
j
. Then X
(j)
t
, t 0 for j = 1, . . . , m are
independent compound Poisson processes of jump rates
(j)
= P( B
j
) and
i.i.d. jump sizes
(j)
,
(j)
such that P(
(j)
) = P( [ B
j
), starting at
X
(j)
0
= 0.
Proof. While one can directly prove this result along the lines of Exercise
3.4.16, we resort to an indirect alternative, whereby we set

X
t
= X
0
+
m
j=1
Y
(j)
t
for the independent compound Poisson processes Y
(j)
t
of jump rates
(j)
and i.i.d.
jump sizes
(j)
k
, starting at Y
(j)
0
= 0. By construction,

X
t
is a pure jump process
whose jump times T
k
() are contained in the union over j = 1, . . . , m of the
isolated jump times T
(j)
k
() of t Y
(j)
t
(). Recall that each T
(j)
k
has the gamma
density of parameters = k and
(j)
(see Exercise 1.4.46 and Denition 3.4.8).
Therefore, by the independence of Y
(j)
t
, t 0 w.p.1. no two jump times among
T
(j)
k
, j, k 1 are the same, in which case

X
(j)
t
= Y
(j)
t
for all j and t 0 (as the
jump sizes of each Y
(j)
t
are in the disjoint element B
j
of the specied nite partition
of R 0). With the R
m
-valued process (
X
(1)
, . . . ,

X
(m)
) being indistinguishable
from (Y
(1)
, . . . , Y
(m)
), it thus suces to show that
X
t
, t 0 is a compound
Poisson process of the specied jump rate and jump size law T
.
To this end, recall Proposition 8.3.36 that each of the processes Y
(j)
t
has station-
ary independent increments and due to their independence, the same applies for
X
t
, t 0, which is thus a real-valued homogeneous Markov process (see Propo-
sition 8.3.5). Next, note that since =
m
j=1
(j)
and for all R,
m
j=1
(j)
(j) () =
m
j=1
E[e
i
I
Bj
()] =
() ,
we have from (8.3.25) and Lemma 3.3.8 that for any R,
E
x
[e
i

Xt
] = e
ix
m
j=1
E[e
iY
(j)
t
]
= e
ix
m
j=1
e
(j)
t(
(j)
()1)
= e
ix+t(
()1)
= E
x
[e
iXt
] .
That is, denoting by p
t
(, ) and p
t
(, ) the Markov semi-groups of X
t
, t 0 and
X
t
, t 0 respectively, we found that per xed x R and t 0 the transi-
tion probabilities p
t
(x, ) and p
t
(x, ) have the same characteristic function. Con-
sequently, by Levys inversion theorem p
t
(, ) = p
t
(, ) for all t 0, i.e., these
two semi-groups are identical. Obviously, this implies that the Markov pure jump
processes X
t
and

X
t
have the same jump parameters (see (8.3.23) and (8.3.24)),
so as claimed
X
t
, t 0 is a compound Poisson process of jump rate and jump
size law T
.
As in the case of Markov chains, the jump transition probability of a Markov
jump process with countable state space S is of the form p(x, A) =
yA
p(x, y).
In this case, accessibility and intercommunication of states, as well as irreducible,
transient and recurrent classes of states, are dened according to the transition
probability p(x, y) and obey the relations explored already in Subsection 6.2.1.
Moreover, as you are to check next, Kolmogorovs equations (8.3.21) and (8.3.22)
are more explicit in this setting.
Exercise 8.3.39. Suppose (, p) are the parameters of a Markov jump process on
a countable state space S.
(a) Check that p
s
(x, z) = P
x
(X
s
= z) are then the solution of the countable
system of linear ODEs
dp
s
(x, z)
ds
=
yS
q(x, y)p
s
(y, z) s 0, x, z S,
starting at p
0
(x, z) = I
x=z
, where q(x, x) =
x
and q(x, y) =
x
p(x, y)
for x ,= y.
(b) Show that p
s
(x, z) must also satisfy the corresponding forward equation
dp
s
(x, z)
ds
=
yS
p
s
(x, y)q(y, z) s 0, x, z S.
(c) In case S is a nite set, show that the matrix P
s
of entries p
s
(x, z) is
given by P
s
= e
sQ
=
k=0
s
k
k!
Q
k
, where Q is the matrix of entries
q(x, y).
The formula P
s
= e
sQ
explains why Q, and more generally L, is called the gener-
ator of the semi-group P
s
.
From Exercise 8.3.34 we further deduce that, at least for bounded jump rates, an
invariant probability measure for the Markov jump process is uniquely determined
by the function : S [0, 1] such that
x
(x) = 1 and
(8.3.26)
y
(y) =
xS
(x)
x
p(x, y) y S .
For constant positive jump rates this condition coincides with the characterization
(6.2.5) of invariant probability measures for the jump transition probability. Con-
sequently, for such jump processes the invariant, reversible and excessive measures
as well as positive and null recurrent states are dened as the corresponding ob-
jects for the jump transition probability and obey the relations explored already in
Subsection 6.2.2.
Remark. While we do not pursue this further, we note in passing that more
generally, a measure () is reversible for a Markov jump process with countable
state space S if and only if
y
(y)p(y, x) = (x)
x
p(x, y) for any x, y S (so
any reversible probability measure is by (8.3.26) invariant for the Markov jump
process). Similarly, in general we call x S with
x
= 0 an absorbing, hence
positive recurrent, state and say that a non-absorbing state is positive recurrent
if it has nite mean return time. That is, if E
x
T
x
< for the rst return time
T
x
= inft : X
t
= x to state x. It can then be shown, in analogy with
Proposition 6.2.41, that any invariant probability measure () is zero outside the
positive recurrent states and if its support is an irreducible class R of non-absorbing
positive recurrent states, then (z) = 1/(
z
E
z
[T
z
]) (see, [GS01, Section 6.9] for
more details).
To practice your understanding, the next exercise explores in more depth the
important family of birth and death Markov jump processes (or in short, birth and
death processes).
Exercise 8.3.40 (Birth and death processes). A birth and death process is a
Markov jump process X
t
on S = 0, 1, 2, . . . for which Z
n
is a birth and death
chain. That is, p(x, x + 1) = p
x
= 1 p(x, x 1) for all x S (where of course
p
0
= 1). Assuming
x
> 0 for all x and p
x
(0, 1) for all x > 0, let
(k) =

0
k
k
i=1
p
i1
1 p
i
.
Show that X
t
is irreducible and has an invariant probability measure if and only
if c =
k0
(k) is nite, in which case its invariant measure is (k) = (k)/c.
The next exercise deals with independent random sampling along the path of a
Markov pure jump process.
Exercise 8.3.41. Let Y
k
= X
T
k
, k = 0, 1, . . ., where

T
k
=
k
i=1

i
and the i.i.d.

i
0 are independent of the Markov pure jump process X
t
, t 0.
(a) Show that Y
k
is a homogeneous Markov chain and verify that any in-
variant probability measure for X
t
is also an invariant measure for
Y
k
.
(b) Show that in case of constant jump rates
x
= and each
i
having the
exponential distribution of parameter

> 0, one has the representation
Y
k
= Z
L
k
of sampling the embedded chain Z
n
at L
k
=
k
i=1
(
i
1) for
i.i.d.
i
1, each having the Geometric distribution of success probability
p =

/( +
).
(c) Conclude that if
T
k
are the jump times of a Poisson process of rate
> 0 which is independent of the compound Poisson process X

t
, then
Y
k
is a random walk, the increment of which has the law of
1
i=1

i
.
Compare your next result with part (a) of Exercise 8.2.45.
t
, t 0 is a real-valued Markov pure jump process,
with 0 = T
0
< T
1
< T
2
< denoting the jump times of its sample function. Show
that for any q > 0 its nite q-th variation V
(q)
(X)
t
exists, and is given by
V
(q)
(X)
t
=
k1
I
{T
k
t}
[X
T
k
X
T
k1
[
q
.
CHAPTER 9
The Brownian motion
The Brownian motion is the most fundamental continuous time stochastic process.
We have seen already in Section 7.3 that it is a Gaussian process of continuous sam-
ple functions and independent, stationary increments. In addition, it is a martingale
of the type considered in Section 8.2 and has the strong Markov property of Section
8.3. Having all these beautiful properties allows for a rich mathematical theory. For
example, many probabilistic computations involving the Brownian motion can be
made explicit by solving partial dierential equations. Further, the Brownian mo-
tion is the corner stone of diusion theory and of stochastic integration. As such
it is the most fundamental object in applications to and modeling of natural and
man-made phenomena.
This chapter deals with some of the most interesting properties of the Brownian
motion. Specically, in Section 9.1 we use stopping time, Markov and martingale
theory to study path properties of this process, focusing on passage times and
running maxima. Expressing in Section 9.2 random walks and discrete time MGs as
time-changed Brownian motion, we prove Donskers celebrated invariance principle.
It then provides fundamental results about these discrete time S.P.-s, such as the
law of the iterated logarithm (in short lil), and the martingale clt. Finally, the
fascinating aspects of the (lack of) regularity of the Brownian sample path are the
focus of Section 9.3.
9.1. Brownian transformations, hitting times and maxima
We start with a few elementary path transformations under which the Wiener
process of Denition 7.3.12 is invariant (see also Figure 2 illustrating its sample
functions).
Exercise 9.1.1. For W
t
, t 0 a standard Wiener process, show that the S.P.
W
(i)
t
, i = 1, . . . , 6 are also standard Wiener processes.
(a) (Symmetry)

W
(1)
t
= W
t
, t 0.
(b) (Time-homogeneity)

W
(2)
t
= W
T+t
W
T
, t 0 with T > 0 a non-random
constant.
(c) (Time-reversal)

W
(3)
t
= W
T
W
Tt
, for t [0, T], with T > 0 a non-
random constant.
(d) (Scaling)

W
(4)
t
=
1/2
W
t
, t 0, with > 0 a non-random constant.
(e) (Time-inversion)

W
(5)
t
= tW
1/t
for t > 0 and

W
(5)
0
= 0.
( f ) (Averaging)

W
(6)
t
=
n
k=1
c
k
W
(k)
t
, t 0, where W
(k)
t
are independent
copies of the Wiener process and c
k
non-random such that
n
k=1
c
2
k
= 1.
(g) Show that

W
(2)
t
and

W
(3)
t
are independent Wiener processes and evaluate
q
t
= P
x
(W
T
> W
Tt
> W
T+t
), where t [0, T].
341
342 9. THE BROWNIAN MOTION
Remark. These invariance transformations are extensively used in the study of
the Wiener process. As a token demonstration, note that since time-inversion maps
L
a,b
= supt 0 : W
t
/ (at, bt) to the stopping time
a,b
of Exercise 8.2.35, it
follows that L
a,b
is a.s. nite and P(W
L
a,b
= bL
a,b
) = a/(a +b).
Recall Exercise 8.3.20 (or Exercise 8.3.22), that any Brownian Markov process
(W
t
, T
t
) is a strong Markov process, yielding the following consequence of Corollary
8.3.14 (and of the identication of the Borel -algebra of C([0, )) as the restriction
of the cylindrical -algebra B
[0,)
to C([0, )), see Exercise 7.2.9).
Corollary 9.1.2. If (W
t
, T
t
) is a Brownian Markov process, then W
t
, t 0 is a
homogeneous T
t
+-Markov process and further, for any s 0 and Borel measurable
functional h : C([0, )) R, almost surely
(9.1.1) E[h(W
)[T
s
+] = E[h(W
)[T
s
] .
From this corollary and the Brownian time-inversion property we further deduce
both Blumenthals 0-1 law about the P
x
-triviality of the -algebra T
W
0
+
and its
analog about the P
x
-triviality of the tail -algebra of the Wiener process (compare
the latter with Kolmogorovs 0-1 law). To this end, we rst extend the denition
of the tail -algebra, as in Denition 1.4.9, to continuous time S.P.-s.
Denition 9.1.3. Associate with any continuous time S.P. X
t
, t 0 the canon-
ical future -algebras T
X
t
= (X
s
, s t), with the corresponding tail -algebra of
the process being T
X
=
t0
T
X
t
.
Proposition 9.1.4 (Blumenthals 0-1 law). Let P
x
denote the law of the
Wiener process W
t
, t 0 starting at W
0
= x (identifying (, T
W
) with C([0, ))
and its Borel -algebra). Then, P
x
(A) 0, 1 for each A T
W
0
+
and x R. Fur-
ther, if A T
W
then either P
x
(A) = 0 for all x or P
x
(A) = 1 for all x.
Proof. Applying Corollary 9.1.2 for the Wiener process starting at W
0
= x
and its canonical ltration, we have by the P
x
-triviality of T
W
0
that for each A
T
W
0
+
,
I
A
= E
x
[I
A
[T
W
0
+ ] = E
x
[I
A
[T
W
0
] = P
x
(A) P
x
a.s.
Hence, P
x
(A) 0, 1. Proceeding to prove our second claim, set X
0
= 0 and
X
t
= tW
1/t
for t > 0, noting that X
t
, t 0 is a standard Wiener process (see
part (e) of Exercise 9.1.1). Further, T
W
t
= T
X
1/t
for any t > 0, hence
T
W
=
t>0
T
W
t
=
t>0
T
X
1/t
= T
X
0
+ .
Consequently, applying our rst claim for the canonical ltration of the standard
Wiener processes X
t
we see that P
0
(A) 0, 1 for any A T
X
0
+
= T
W
.
Moreover, since A T
W
1
, it is of the form I
A
= I
D

1
for some D T
W
, so by
the tower and Markov properties,
P
x
(A) = E
x
[I
D

1
(())] = E
x
[P
W1
(D)] =
_
p
1
(x, y)P
y
(D)dy ,
for the strictly positive Brownian transition kernel p
1
(x, y) = exp((xy)
2
/2)/
2.
If P
0
(A) = 0 then necessarily P
y
(D) = 0 for Lebesgue almost every y, hence also
P
x
(A) = 0 for all x R. Conversely, if P
0
(A) = 1 then P
0
(A
c
) = 0 and with
A
c
T
W
, by the preceding argument 1 P
x
(A) = P
x
(A
c
) = 0 for all x R.
9.1. BROWNIAN TRANSFORMATIONS, HITTING TIMES AND MAXIMA 343
Blumenthals 0-1 law is very useful in determining properties of the Brownian
sample function in the limits t 0 and t . Here are few of its many consequences.
Corollary 9.1.5. Let
0
+ = inft 0 : W
t
> 0,
0
= inft 0 : W
t
< 0 and
T
0
= inft > 0 : W
t
= 0. Then, P
0
(
0
+ = 0) = P
0
(
0
= 0) = P
0
(T
0
= 0) = 1
and w.p.1. the standard Wiener process changes sign innitely many times in any
time interval [0, ], > 0. Further, for any x R, with P
x
-probability one,
limsup
t
1
t
W
t
= , liminf
t
1
t
W
t
= , W
un
= 0 for some u
n
() .
Proof. Since P
0
(
0
+ t) P
0
(W
t
> 0) = 1/2 for all t > 0, also P
0
(
0
+ =
0) 1/2. Further,
0
+ is an T
W
t
-Markov time (see Proposition 8.1.15). Hence,
0
+ = 0 =
0
+ 0 T
W
0
+
and from Blumenthals 0-1 law it follows that
P
0
(
0
+ = 0) = 1. By the symmetry property of the standard Wiener process (see
part (a) of Exercise 9.1.1), also P
0
(
0
= 0) = 1. Combining these two facts we
deduce that P
0
-a.s. there exist t
n
0 and s
n
0 such that W
tn
> 0 > W
sn
for
all n. By sample path continuity, this implies the existence of u
n
0 such that
W
un
= 0 for all n. Hence, P
0
(T
0
= 0) = 1. As for the second claim, note that for
any r > 0,
P
0
(W
n
r
n i.o.) limsup
n
P
0
(W
n
r
n) = P
0
(W
1
r) > 0
where the rst inequality is due to Exercise 2.2.2 and the equality holds by the
scaling property of W
t
(see part (d) of Exercise 9.1.1). Since W
n
r
n i.o.
T
W
we thus deduce from Blumenthals 0-1 law that P
x
(W
n
r
n i.o.) = 1 for
any x R. Considering r
k
this implies that limsup
t
W
t
/
t = with
P
x
-probability one. Further, by the symmetry property of the standard Wiener
process,
P
0
(W
n
r
n, i.o.) = P
0
(W
n
r
n, i.o.) > 0 ,
so the preceding argument leads to liminf
t
W
t
/
t = with P
x
-probability
one. In particular, P
x
-a.s. there exist t
n
and s
n
such that W
tn
> 0 > W
sn
which by sample path continuity implies the existence of u
n
such that W
un
= 0
for all n.
Combining the strong Markov property of the Brownian Markov process and the
independence of its increments, we deduce next that each a.s. nite Markov time
is a regeneration time for this process, where it starts afresh independently of
the path it took up to this (random) time.
Corollary 9.1.6. If (W
t
, T
t
) is a Brownian Markov process and is an a.s. nite
T
t
-Markov time, then the S.P. W
+t
W
, t 0 is a standard Wiener process,

which is independent of T
+.
Proof. With a.s. nite, T
t
-Markov time and W
t
, t 0 an T
t
-progressively
measurable process, it follows that B
t
= W
t+
W
is a R.V. on our probability

space and B
t
, t 0 is a well dened S.P. whose sample functions inherit the conti-
nuity of those of W
t
, t 0. Since the S.P.

W
t
= W
t
W
0
has the f.d.d. hence the
law of the standard Wiener process, xing h bB
[0,)
and
h(x()) = h(x() x(0)),

the value of g
h
(y) = E
y
[
h(W
)] = E[h(
)] is independent of y. Consequently,
xing A T
+, by the tower property and the strong Markov property (8.3.12) of

the Brownian Markov process (W
t
, T
t
) we have that
E[I
A
h(B
)] = E[I
A
h(W
+
)] = E[I
A
g
h
(W
)] = P(A)E[h(
)] .
In particular, considering A = we deduce that the S.P. B
t
, t 0 has the f.d.d.
and hence the law of the standard Wiener process
W
t
. Further, recall Lemma
7.1.7 that for any F T
B
, the indicator I
F
is of the form I
F
= h(B
) for some
h bB
[0,)
, in which case by the preceding P(A F) = P(A)P(F). Since this
applies for any F T
B
and A T
+ we have established the P-independence of

the two -algebras, namely, the stated independence of B
t
, t 0 and T
+.
Beware that to get such a regeneration it is imperative to start with a Markov
time . To convince yourself, solve the following exercise.
Exercise 9.1.7. Suppose W
t
, t 0 is a standard Wiener process.
(a) Provide an example of a nite a.s. random variable 0 such that
W
+t
W
, t 0 does not have the law of a standard Brownian motion.

(b) Provide an example of a nite T
W
t
-stopping time such that [] is not
an T
W
t
-stopping time.
Combining Corollary 9.1.6 with the fact that w.p.1.
0
+ = 0, you are next to
prove the somewhat surprising fact that w.p.1. a Brownian Markov process enters
(b, ) as soon as it exits (, b).
Exercise 9.1.8. Let
b
+ = inft 0 : W
t
> b for b 0 and a Brownian Markov
process (W
t
, T
t
).
(a) Show that P
0
(
b
,=
b
+) = 0.
(b) Suppose a nite random-variable H 0 is independent of T
W
. Show
that
H
,=
H
+ T has probability zero.
The strong Markov property of the Wiener process also provides the probability
that starting at x (c, d) it reaches level d before level c (i.e., the event W
(0)
a,b
= b
of Exercise 8.2.35, with b = d x and a = x c).
Exercise 9.1.9. Consider the stopping time = inft 0 : W
t
/ (c, d) for a
Wiener process W
t
, t 0 starting at x (c, d).
(a) Using the strong Markov property of W
t
show that u(x) = P
x
(W
= d)
is an harmonic function, namely, u(x) = (u(x + r) + u(x r))/2 for
any c x r < x < x + r d, with boundary conditions u(c) = 0 and
u(d) = 1.
(b) Check that v(x) = (x c)/(d c) is an harmonic function satisfying the
same boundary conditions as u(x).
Since boundary conditions at x = c and x = d uniquely determine the value of a
harmonic function in (c, d) (a fact you do not need to prove), you thus showed that
P
x
(W
= d) = (x c)/(d c).
We proceed to derive some of the many classical explicit formulas involving Brow-
nian hitting times, starting with the celebrated reection principle, which provides
among other things the probability density functions of the passage times for a
standard Wiener process and of their dual, the running maxima of this process.
0 3
1
0.5
0
0.5
1
1.5
2
2.5
3
3.5
s
W
s
b
Reflection principle for Brownian motion with b=1, t=3
Figure 1. Illustration of the reection principle for Brownian motion.
Proposition 9.1.10 (Reflection principle). With W
t
, t 0 the standard
Wiener process, let M
t
= sup
s[0,t]
W
s
denote its running maxima and T
b
= inft
0 : W
t
= b the corresponding passage times. Then, for any t, b > 0,
(9.1.2) P(M
t
b) = P(
b
t) = P(T
b
t) = 2P(W
t
b) .
Remark. The reection principle was stated by P. Levy [Lev39] and rst rigor-
ously proved by Hunt [Hun56]. It is attributed to D. Andre [And1887] who solved
the ballot problem of Exercise 5.5.29 by a similar symmetry argument (leading also
to the reection principle for symmetric random walks, as in Exercise 6.1.19).
Proof. Recall Proposition 8.1.15 that
b
is a stopping time for T
W
t
. Further,
since b > 0 = W
0
and s W
s
is continuous, clearly
b
= T
b
and W
T
b
= b
whenever T
b
is nite. Heuristically, given that T
b
= s < u we have that W
s
= b
and by reection symmetry of the Brownian motion, expect the conditional law of
W
u
W
s
to retain its symmetry around zero, as illustrated in Figure 1. This of
course leads to the prediction that for any u, b > 0,
(9.1.3) P(T
b
< u, W
u
> b) =
1
2
P(T
b
< u) .
With W
0
= 0, by sample path continuity W
u
> b T
b
< u, so the preceding
prediction implies that
P(T
b
< u) = 2P(T
b
< u, W
u
> b) = 2P(W
u
> b) .
The supremum M
t
() of the continuous function s W
s
() over the compact
interval [0, t] is attained at some s [0, t], hence the identity M
t
b =
b
t
holds for all t, b > 0. Thus, considering u t > 0 leads in view of the continuity of
(u, b) P(W
u
> b) to the statement (9.1.2) of the proposition. Turning to rigor-
ously prove (9.1.3), we rely on the strong Markov property of the standard Wiener
process for the T
W
t
-stopping time T
b
and the functional h(s, x()) = I
A
(s, x()),
where A = (s, x()) : x() C(R
+
), s [0, u) and x(u s) > b. To this end,
note that F
y,a,a
= x C([0, )) : x(u s) y for all s [a, a
] is closed (with
respect to uniform convergence on compact subsets of [0, )), and x(u s) > b
for some s [0, u) if and only if x() F
b
k
,q,q
for some b
k
= b + 1/k, k 1 and
q < q
Q
(2)
u
. So, A is the countable union of closed sets [q, q
] F
b
k
,q,q
, hence
Borel measurable on [0, ) C([0, )). Next recall that by the denition of the
set A,
g
h
(s, b) = E
b
[I
A
(s, W
)] = I
[0,u)
(s)P
b
(W
us
> b) =
1
2
I
[0,u)
(s) .
Further, h(s, x(s+)) = I
[0,u)
(s)I
x(u)>b
and W
T
b
= b whenever T
b
is nite, so taking
the expectation of (8.3.13) yields (for our choices of h(, ) and ), the identity,
E[I
{T
b
<u}
I
{Wu>b}
] = E[h(T
b
, W
T
b
+
)] = E[g
h
(T
b
, W
T
b
)]
= E[g
h
(T
b
, b)] =
1
2
E[I
{T
b
<u}
] ,
which is precisely (9.1.3).
Since t
1/2
W
t
D
= G, a standard normal variable of continuous distribution func-
tion, we deduce from the reection principle that the distribution functions of T
b
and M
t
are continuous and such that F
T
b
(t) = 1 F
Mt
(b) = 2(1 F
G
(b/
t)). In
particular, P(T
b
> t) 0 as t , hence T
b
is a.s. nite. We further have the
corresponding explicit probability density functions on [0, ),
f
T
b
(t) =

t
F
T
b
(t) =
b
2t
3
e
b
2
2t
, (9.1.4)
f
Mt
(b) =

b
F
Mt
(b) =
2
2t
e
b
2
2t
. (9.1.5)
Remark. From the preceding formula for the density of T
b
=
b
you can easily
check that it has innite expected value, in contrast with the exit times
a,b
of
bounded intervals (a, b), which have nite moments (see part (c) of Exercise 8.2.35
for niteness of the second moment and note that the same method extends to all
moments). Recall that in part (b) of Exercise 8.2.34 you have already found that
the Laplace transform of the density of T
b
is
L
fT
b
(s) =
_

0
e
st
f
T
b
(t)dt = e
2sb
(and for inverting Laplace transforms, see Exercise 2.2.15). Further, using the
density of passage times, you can now derive the well-known arc-sine law for the
last exit of the Brownian motion from zero by time one.
Exercise 9.1.11. For the standard Wiener process W
t
and any t > 0, consider
the time L
t
= sups [0, t] : W
s
= 0 of last exit from zero by t, and the Markov
time R
t
= infs > t : W
s
= 0 of rst return to zero after t.
(a) Verify that P
x
(T
y
> u) = P(T
|yx|
> u) for any x, y R, and with
p
t
(x, y) denoting the Brownian transition probability kernel, show that
for u > 0 and 0 t +u) =
_

p
t
(0, y)P(T
|y|
> u)dy ,
P(L
t
u) =
_

p
u
(0, y)P(T
|y|
> t u)dy .
(b) Deduce from (9.1.4) that the probability density function of R
t
t is
f
Rtt
(u) =
t/(
u(t +u)).
Hint: Express P(R
t
> t+u)/u as one integral over y R, then change
variables to z
2
= y
2
(1/u + 1/t).
(c) Show that L
t
has the arc-sine law P(L
t
u) = (2/) arcsin(
_
u/t) and
hence the density f
Lt
(u) = 1/(
_
u(t u)) on [0, t].
(d) Find the joint probability density function of (L
t
, R
t
).
Remark. Knowing the law of L
t
is quite useful, for L
t
> u is just the event
W
s
= 0 for some s (u, t]. You have encountered the arc-sine law in Exercise
3.2.16 (where you proved the discrete reection principle for the path of the sym-
metric srw). Indeed, as shown in Section 9.2 by Donskers invariance principle,
these two arc-sine laws are equivalent.
Here are a few additional results about passage times and running maxima.
Exercise 9.1.12. Generalizing the proof of (9.1.3), deduce that for a standard
Wiener process, any u > 0 and a
1
< a
2
b,
(9.1.6) P(T
b
< u, a
1
< W
u
< a
2
) = P(2b a
2
< W
u
< 2b a
1
) ,
and conclude that the joint density of (M
t
, W
t
) is
(9.1.7) f
Wt,Mt
(a, b) =
2(2b a)
2t
3
e
(2ba)
2
2t
,
for b a 0 and zero otherwise.
Exercise 9.1.13. Let W(t) = (W
1
(t), W
2
(t)) denote the two-dimensional Brownian
motion of Denition 8.2.36, starting at a non-random W(0) = (x
1
, x
2
) with x
1
> 0
and x
2
> 0.
(a) Find the density of = inft 0 : W
1
(t) = 0 or W
2
(t) = 0.
(b) Find the joint density of (, W
1
(), W
2
()) with respect to Lebesgue mea-
sure on (t, x, y) (0, )
3
: x = 0 or y = 0.
Hint: The identity (9.1.6) might be handy here.
Exercise 9.1.14. Consider a Brownian Markov process (W
t
, T
t
) with W
0
0 and
p
t
(x, B) = P
x
(W
t
B) its Brownian semi-group of transition probabilities.
(a) Show that (W
tT0
, T
t
) is a homogeneous Markov process on [0, ) whose
transition probabilities are: p
,t
(0, 0) = 1, and if x > 0 then p
,t
(x, B) =
p
t
(x, B)p
t
(x, B) for B (0, ), while p
,t
(x, 0) = 2p
t
(x, (, 0]).
(b) Show that ([W
t
[, T
t
) is a homogeneous Markov process on [0, ) whose
transition probabilities are p
+,t
(x, B) = p
t
(x, B) + p
t
(x, B) (for x 0
and B [0, )).
Remark. We call (W
tT0
, T
t
) the Brownian motion absorbed at zero and ([W
t
[, T
t
)
the reected Brownian motion. These are the simplest possible ways of constraining
the Brownian motion to have state space [0, ).
Exercise 9.1.15. The Brownian Markov process (W
t
, T
t
) starts at W
0
= 0.
(a) Show that Y
t
= M
t
W
t
is an T
t
-Markov process, of the same transition
probabilities p
+,t
, t 0 on [0, ) as the reected Brownian motion.
(b) Deduce that Y
t
, t 0 has the same law as the reected Brownian mo-
tion.
Exercise 9.1.16. For a Brownian Markov process (W
t
, T
t
) starting at W
0
= 0,
show that L
,t
= sups [0, t] : W
s
= M
t
has the same arc-sine law as L
t
.
When solving the next exercise beware that the sample function b T
b
() is
w.p.1. discontinuous. In fact, though we shall not prove it, T
b
, b 0 is a left-
continuous, purely discontinuous increasing process, i.e. there is no interval on
which b T
b
is continuous (c.f. [KaS97, Section 6.2.A]).
Exercise 9.1.17. Consider the passage times T
b
for a Brownian Markov process
(W
t
, T
t
) starting at W
0
= 0.
(a) Show that for any h bB and 0 b < c,
E[h(T
c
T
b
)[T
T
+
b
] = E
b
[h(T
c
)] = E
0
[h(T
cb
)] ,
(b) Deduce that T
b
, b 0 is a S.P. of stationary, non-negative indepen-
dent increments, whose Markov semi-group has the transition probability
kernel
q
t
(x, y) =
t
_
2(y x)
3
+
e
t
2
2(yx)
+
,
corresponding to the one-sided 1/2-stable density of (9.1.4).
(c) Show that
b
+, b 0 of Exercise 9.1.8 is a right-continuous modica-
tion of T
b
, b 0, hence a strong Markov process of same transition
probabilities.
(d) Show that T
c
D
= c
2
T
1
for any c R.
Exercise 9.1.18. Suppose that for some b > 0 xed,
k
are i.i.d. each having
the probability density function (9.1.4) of T
b
.
(a) Show that n
2
n
k=1
k
D
= T
b
(which is why we say that the law of T
b
is
-stable for = 1/2).
(b) Show that P(n
2
max
n
k=1
k
y) exp(b
_
2/(y)) for all y 0 (com-
pare with part (b) of Exercise 3.2.13).
Exercise 9.1.19. Consider a standard Wiener process W
t
, t 0.
(a) Fixing b, t > 0 let
b,t
= infs t : V
s
b for V
s
= [W
s
[/
s. Check
that
b,t
D
= t
b,1
, then show that for b < 1,
E
b,1
=
1
1 b
2
E[(V
2
1
b
2
)I
{V1b}
] ,
whereas E
b,1
= in case b 1.
(b) Considering now V
s
=
_
s
0
exp[c(W
s
W
u
) c
2
(s u)/2]du for c R
non-random, verify that V
s
s is a martingale and deduce that in this
case E
b,0
= b for any b > 0.
9.2. Weak convergence and invariance principles
Consider the linearly interpolated, time-space rescaled random walk

S
n
(t) =
n
1/2
S(nt) (as depicted in Figure 2, for the symmetric srw), where
(9.2.1) S(t) =
[t]
k=1
k
+ (t [t])
[t]+1
,
and
k
are i.i.d. Recall Exercise 3.5.18 that by the clt, if E
1
= 0 and E
2
1
= 1,
then as n the f.d.d. of the S.P.

S
n
() of continuous sample path, converge
weakly to those of the standard Wiener process. Since f.d.d. uniquely determine
9.2. WEAK CONVERGENCE AND INVARIANCE PRINCIPLES 349
0 0.5 1
2
1
0
1
2
First 20 steps
0 0.5 1
2
1
0
1
2
First 1000 steps
0 0.5 1
2
1
0
1
2
First 100 steps
0 0.5 1
2
1
0
1
2
First 5000 steps
Figure 2. Scaled srw for dierent values of n.
the law of a S.P. it is thus natural to expect also to have the stronger, convergence
in distribution, as dened next.
Denition 9.2.1. We say that S.P. X
n
(t), t 0 of continuous sample functions
converge in distribution to a S.P. X
(t), t 0, denoted X
n
()
D
X
(), if the
corresponding laws converge weakly in the topological space S consisting of C([0, ))
equipped with the topology of uniform convergence on compact subsets of [0, ).
That is, if g(X
n
())
D
g(X
()) whenever g : C([0, )) R Borel measurable,

is such that w.p.1. the sample function of X
() is not in the set D

g
of points
of discontinuity of g (with respect to uniform convergence on compact subsets of
[0, )).
As we state now and prove in the sequel, such functional clt, also known as
Donskers invariance principle, indeed holds.
Theorem 9.2.2 (Donskers invariance principle). If
k
are i.i.d. with
E
1
= 0 and E
2
1
= 1, then for S() of (9.2.1), the S.P.

S
n
() = n
1/2
S(n)
converge in distribution, as n , to the standard Wiener process.
Remark. The preceding theorem is called an invariance principle because the
limiting process does not depend on the law of the summands
k
of the random
walk. However, the condition E
2
1
< is almost necessary for the n
1/2
scaling
and for having a Brownian limit process. Indeed, note Remark 3.1.13 that both fail
as soon as E[
1
[
= for some 0 < < 2.

Since h(x()) = f(x(t
1
), . . . , x(t
k
)) is continuous and bounded on C([0, )) for any
f C
b
(R
k
) and each nite subset t
1
, . . . , t
k
of [0, ), convergence in distribution
of S.P. of continuous sample path implies the weak convergence of their f.d.d. But,
beware that the convergence of f.d.d. does not necessarily imply convergence in
distribution, even for S.P. of continuous sample functions.
Exercise 9.2.3. Give a counter-example to show that weak convergence of the
f.d.d. of S.P. X
n
() of continuous sample functions to those of S.P. X
() of
continuous sample functions, does not imply that X
n
()
D
X
().
Hint: Try X
n
(t) = nt1
[0,1/n]
(t) + (2 nt)1
[1/n,2/n]
(t).
Nevertheless, with S = (C([0, ), ) a complete, separable metric space (c.f. Ex-
ercise 7.2.9), we have the following useful partial converse as an immediate conse-
quence of Prohorovs theorem.
Proposition 9.2.4. If the laws of S.P. X
n
() of continuous sample functions are
uniformly tight in C([0, )) and for n the f.d.d. of X
n
() converge weakly
to the f.d.d. of X
(), then X
n
()
D
X
().
Proof. Recall part (e) of Theorem 3.5.2, that by the Portmanteau theorem
X
n
()
D
X
() as in Denition 9.2.1, if and only if the corresponding laws
n
= T
Xn
converge weakly on the metric space S = C([0, )) (and its Borel
-algebra). That is, if and only if Eh(X
n
()) Eh(X
()) for each h contin-

uous and bounded on S (also denoted by
n
w

, see Denition 3.2.17). Let
(m)
n
be a subsequence of
n
. Since
n
is uniformly tight, so is
(m)
n
. Thus,
by Prohorovs theorem, there exists a further sub-subsequence
(m
k
)
n
such that
(m
k
)
n
converges weakly to a probability measure
on S. Recall Proposition 7.1.8

that the f.d.d. uniquely determine the law of S.P. of continuous sample functions.
Hence, from the assumed convergence of f.d.d. of X
n
() to those of X
(),
we deduce that
= T
X
=
. Consequently, Eh(X
(m
k
)
n
()) Eh(X
()) for
each h C
b
([0, )) (see Exercise 7.2.9). Fixing h C
b
([0, )) note that we have
just shown that every subsequence y
(m)
n
of the sequence y
n
= Eh(X
n
()) has a
further sub-subsequence y
(m
k
)
n
that converges to y
. Hence, we deduce by Lemma

2.2.11 that y
n
y
. Since this holds for all h C

b
([0, )), we conclude that
X
n
()
D
X
().
Having Proposition 9.2.4 and the convergence of f.d.d. of

S
n
(), Donskers invari-
ance principle is a consequence of the uniform tightness in S of the laws of these
S.P.-s. In view of Denition 3.2.31, we prove this uniform tightness by exhibiting
compact sets K
such that sup

n
P(
S
n
/ K
) 0 as . To this end, recall the

following classical result of functional analysis (for a proof see [KaS97, Theorem
2.4.9] or the more general version provided in [Dud89, Theorem 2.4.7]).
Theorem 9.2.5 (Arzel a-Ascoli theorem). A set K C([0, )) has compact
closure with respect to uniform convergence on compact intervals, if and only if
sup
xK
[x(0)[ is nite and for t > 0 xed, sup
xK
osc
t,
(x()) 0 as 0, where
(9.2.2) osc
t,
(x()) = sup
0h
sup
0ss+ht
[x(s +h) x(s)[ ,
is just the maximal absolute increment of x() over all pairs of times [0, t] which
are within distance of each other.
The Arzel a-Ascoli theorem suggests the following strategy for proving uniform
tightness.
Exercise 9.2.6. Let S denote the set C([0, )) equipped with the topology of uni-
form convergence on compact intervals, and consider its subsets F
r,
= x() :
x(0) = 0, osc
r,
(x()) 1/r for > 0 and integer r 1.
(a) Verify that the functional x() osc
t,
(x()) is continuous on S per xed
t and and further that per x() xed, the function osc
t,
(x()) is non-
decreasing in t and in . Deduce that F
r,
are closed sets and for any
r
0, the intersection
r
F
r,r
is a compact subset of S.
(b) Show that if S.P.-s X
n
(t), t 0 of continuous sample functions are
such that X
n
(0) = 0 for all n and for any r 1,
lim
0
sup
n1
P(osc
r,
(X
n
()) > r
1
) = 0 ,
then the corresponding laws are uniformly tight in S.
Hint: Let K
=
r
F
r,r
with
r
0 such that P(X
n
/ F
r,r
) 2
r
.
Since

S
n
() = n
1/2
S(n) and S(0) = 0, by the preceding exercise the uniform
tightness of the laws of

S
n
(), and hence Donskers invariance principle, is an im-
mediate consequence of the following bound.
Proposition 9.2.7. If
k
are i.i.d. with E
1
= 0 and E
2
1
nite, then
lim
0
sup
n1
P(osc
nr,n
(S()) > r
1
n) = 0 ,
for S(t) =
[t]
k=1
k
+ (t [t])
[t]+1
and any integer r 1.
Proof. Fixing r 1, let q
n,
= P(osc
nr,n
(S()) > r
1
n). Since t S(t)

is uniformly continuous on compacts, osc
nr,n
(S())() 0 when 0 (for each
). Consequently, q
n,
0 for each xed n, hence uniformly over n n
0
and any xed n
0
. With q
n,
non-decreasing, this implies that sup
n
q
n,
0
provided limsup
k
q
k,
0 when 0. To show the latter, observe that since
the piecewise linear S(t) changes slope only at integer values of t,
osc
kr,k
(S()) osc
kr,m
(S()) M
m,
,
for m = [k] + 1, = rk/m and
(9.2.3) M
m,
= max
1im
0jm1
[S(i +j) S(j)[ .
Thus, for any > 0,
limsup
k
q
k,
limsup
m
P(M
m,(v)
> v
m) ,
where v = r
1
_
k/m 1/(r
) as k and (v) = r
3
v
2
. Since v when
0, we complete the proof by appealing to part (c) of Exercise 9.2.8.
As you have just seen, the key to the proof of Proposition 9.2.7 is the following
bound on maximal uctuations of increments of the random walk.
m
=
m
k=1
k
for i.i.d.
k
such that E
1
= 0 and
E
2
1
= 1. For integers m, 1, let S(m) = S
m
and M
m,
be as in (9.2.3), with
M
m,0
= max
m
i=1
[S
i
[.
(a) Show that for any m 1 and t 0,
P(M
m,0
t +
2m) 2P([S
m
[ t) .
Hint: Use Ottavianis inequality (see part (a) of Exercise 5.2.16).
(b) Show that for any m, 1 and x 0,
P(M
m,
> 2x) P(M
m,1
> 2x) P(M
2m,0
> x) .
(c) Deduce that if v
2
log (v) 0, then
limsup
v
limsup
m
P(M
m,(v)
> v
m) = 0 .
Hint: Recall that m
1/2
S
m
D
G by the clt.
Applying Donskers invariance principle, you can induce limiting results for ran-
dom walks out of the corresponding facts about the standard Brownian motion,
which we have found already in Subsection 9.1.
Example 9.2.9. Recall the running maxima M
t
= sup
s[0,t]
W
s
, whose density
we got in (9.1.5) out of the reection principle. Since h
0
(x()) = supx(s) : s
[0, 1] is continuous with respect to uniform convergence on C([0, 1]), we have from
Donskers invariance principle that as n ,
h
0
(
S
n
) =
1
n
n
max
k=0
S
k
D
M
1
(where we have used the fact that the maximum of the linearly interpolated func-
tion S(t) must be obtained at some integer value of t). The functions h
(x()) =
_
1
0
x(s)
ds for = 1, 2, . . . are also continuous on C([0, 1]), so by same reasoning,

h
S
n
) = n
(1+/2)
_
n
0
S(u)
du
D
_
1
0
(W
u
)
du .
Similar limits can be obtained by considering h
([x()[).
Exercise 9.2.10.
(a) Building on Example 9.2.9, show that for any 1,
n
(1+/2)
n
k=1
(S
k
)
_
1
0
(W
u
)
du ,
as soon as E
1
= 0 and E
2
1
= 1 (i.e. there is no need to assume niteness
of the -th moment of
1
), and in case = 1 the limit law is merely a
normal of zero mean and variance 1/3.
(b) The cardinality of the set S
0
, . . . , S
n
is called the range of the walk by
time n and denoted rng
n
. Show that for the symmetric srw on Z,
n
1/2
rng
n
D
sup
s1
W
s
inf
s1
W
s
.
We continue in the spirit of Example 9.2.9, except for dealing with functionals
that are no longer continuous throughout C([0, )).
Example 9.2.11. Let
n
= n
1
infk 1 : S
k

n. As shown in Exercise
9.2.12, considering the function g
1
(x()) = inft 0 : x(t) 1 we nd that

n
D
T
1
as n , where the density of T
1
= inft 0 : W
t
1 is given in
(9.1.4).
Similarly, let A
t
(b) =
_
t
0
I
Ws>b
ds denote the occupation time of B = (b, )
by the standard Brownian motion up to time t. Then, considering g(x(), B) =
_
1
0
I
{x(s)B}
ds, we nd that for n and b R xed,
(9.2.4)

A
n
(b) =
1
n
n
k=1
I
{S
k
>b
n}
D
A
1
(b) ,
as you are also to justify upon solving Exercise 9.2.12. Of particular note is the
case of b = 0, where Levys arc-sine law tells us that A
t
(0)
D
= L
t
of Exercise 9.1.11
(as shown for example in [KaS97, Proposition 4.4.11]).
Recall the arc-sine limiting law of Exercise 3.2.16 for n
1
sup n : S
1
S
0
in case of the symmetric srw. In view of Exercise 9.1.11, working with g
0
(x()) =
sups [0, 1] : x(s) = 0 one can extend the validity of this limit law to any random
walk with increments of mean zero and variance one (c.f. [Dur10, Example 8.6.3]).
Exercise 9.2.12.
(a) Let g
1
+(x()) = inft 0 : x(t) > 1. Show that P(W
G) = 1
for the subset G = x() : x(0) = 0 and g
1
(x()) = g
1
+(x()) <
of C([0, )), and that g
1
(x
n
()) g
1
(x()) for any sequence x
n
()
C([0, )) which converges uniformly on compacts to x() G. Further,
show that
n
n
1
g
1
(
S
n
)
n
and deduce that
n
D
T
1
.
(b) To justify (9.2.4), rst verify that the non-negative g(x(), (b, )) is con-
tinuous on any sequence whose limit is in G = x() : g(x(), b) = 0
and that E[g(W
, b)] = 0, hence g(
S
n
, (b, ))
D
A
1
(b). Then, deduce
that

A
n
(b)
D
A
1
(b) by showing that for any > 0 and n 1,
g(
S
n
, (b +, ))
n
()

A
n
(b) g(
S
n
, (b , )) +
n
() ,
with
n
() = n
1
n
k=1
I
{|
k
|>
n}
converging in probability to zero
when n .
Our next result is a renement due to Kolmogorov and Smirnov, of the Glivenko-
Cantelli theorem (which states that for i.i.d. X, X
k
the empirical distribution
functions F
n
(x) = n
1
n
i=1
I
(,x]
(X
i
), converge w.p.1., uniformly in x, to the
distribution function of X, whichever it may be, see Theorem 2.3.6).
Corollary 9.2.13. Suppose X
k
, X are i.i.d. and x F
X
(x) is a continuous
function. Then, setting D
n
= sup
xR
[F
n
(x) F
X
(x)[, as n ,
(9.2.5) n
1/2
D
n
D
sup
t[0,1]
[
B
t
[ ,
for the standard Brownian bridge

B
t
= W
t
tW
1
on [0, 1].
Remark. Though not proved here, more generally n
1/2
D
n
D
sup
xR
[
B
FX(x)
[
(which for continuous F
X
() coincides with (9.2.5)).
Proof. Recall the Skorokhod construction X
k
= X
(U
k
) of Theorem 1.2.36,
with i.i.d. uniform variables U
k
on (0, 1], such that X
k
x = U
k
F
X
(x)
for all x R (see (1.2.1)), by which it follows that
D
n
= sup
uFX(R)
[n
1
n
i=1
I
(0,u]
(U
i
) u[ .
The assumed continuity of the distribution function F
X
() further implies that
F
X
(R) = (0, 1). Throwing away the sample U
n
, let V
n1,k
, k = 1, . . . , n 1,
denote the k-th smallest number in U
1
, . . . , U
n1
. It is not hard to check that
[D
n

D
n
[ 2n
1
for
D
n
=
n1
max
k=1
[
k
n
V
n1,k
[ .
Now, from Exercise 3.4.11 we have the representation V
n1,k
= T
k
/T
n
, where T
k
=
k
j=1
j
for i.i.d.
j
, each having the exponential distribution of parameter one.
Next, note that S
k
= T
k
k is a random walk S
k
=
k
j=1
k
with i.i.d.
j
=
j
1
such that E
1
= 0 and E
2
1
= 1. Hence, setting Z
n
= n/T
n
one easily checks that
n
1/2
D
n
= Z
n
n
max
k=1
[n
1/2
S
k

k
n
(n
1/2
S
n
)[ = Z
n
sup
t[0,1]
[
S
n
(t) t
S
n
(1)[ ,
where

S
n
(t) = n
1/2
S(nt) for S() of (9.2.1), so t

S
n
(t) t
S
n
(1) is linear on
each of the intervals [k/n, (k + 1)/n], k 0. Consequently,
n
1/2
D
n
= Z
n
g(
S
n
()) ,
for g(x()) = sup
t[0,1]
[x(t) tx(1)[. By the strong law of large numbers, Z
n
a.s.
1
E1
= 1. Moreover, g() is continuous on C([0, 1]), so by Donskers invariance
principle g(
S
n
())
D
g(W
) = sup
t[0,1]
[
B
t
[ (see Denition 9.2.1), and by Slutskys
lemma, rst n
1/2
D
n
= Z
n
g(
S
n
()) and then n
1/2
D
n
(which is within 2n
1/2
of
n
1/2
D
n
), have the same limit in distribution (see part (c), then part (b) of Exercise
3.2.8).
You are now to provide an explicit formula for the distribution function F
KS
() of
sup
t[0,1]
[
B
t
[.
Exercise 9.2.14. Consider the standard Brownian bridge

B
t
on [0, 1], as in Exer-
cises 7.3.15-7.3.16.
(a) Show that q
b
= P(sup
t[0,1]

B
t
b) = exp(2b
2
) for any b > 0.
Hint: Argue that q
b
= P(
(b)
b
< ) for
(r)
b
of Exercise 8.2.34.
(b) Deduce that for any non-random a, c > 0,
P( inf
t[0,1]
B
t
a or sup
t[0,1]
B
t
c) =
n1
(1)
n1
(p
n
+r
n
) ,
where p
2n
= r
2n
= q
na+nc
, r
2n+1
= q
na+nc+c
and p
2n+1
= q
na+nc+a
.
Hint: Using inclusion-exclusion prove this expression for p
n
= P( for
some 0 < t
1
< < t
n
1,

B
ti
= a for odd i and

B
ti
= c for even
i) and r
n
similarly dened, just with

B
ti
= c at odd i and

B
ti
= a at
even i. Then use the reection principle for Brownian motion W
t
such
that [W
1
[ to equate these with the relevant q
b
(in the limit 0).
(c) Conclude that for any non-random b > 0,
(9.2.6) F
KS
(b) = P( sup
t[0,1]
[
B
t
[ b) = 1 2
n=1
(1)
n1
e
2n
2
b
2
.
Remark. The typical approach to accepting/rejecting the hypothesis that i.i.d.
observations X
k
have been generated according to a specied continuous distri-
bution F
X
is by thresholding the value of F
KS
(b) at the observed Kolmogorov-
Smirnov statistic b = n
1/2
D
n
, per (9.2.6). To this end, while outside our scope,
using the so called Hungarian construction, which is a much sharper coupling alter-
native to Corollary 9.2.21, one can further nd the rate (in n) of the convergence
in distribution in (9.2.5) (for details, see [SW86, Chapter 12.1]).
We conclude with a sucient condition for convergence in distribution on C([0, )).
Exercise 9.2.15. Suppose C([0, ))-valued random variables X
n
, 1 n ,
dened on the same probability space, are such that |X
n
X
p
0 for n
and any 1 xed (where |x|
t
= sup
s[0,t]
[x(s)[). Show that X
n
D
X
in the
topological space S consisting of C([0, )) equipped with the topology of uniform
convergence on compact subsets of [0, ).
Hint: Consider Exercise 7.2.9 and Corollary 3.5.3.
9.2.1. Skorokhods representation and the martingale clt. We pursue
here an alternative approach for proving invariance principles, which is better suited
to deal with dependence, culminating with a Lindeberg type, martingale clt. We
rst utilize the continuity of the Brownian path to deduce, in view of Corollary 3.5.3,
that linear interpolation of random discrete samples along the path converges in
distribution to the Brownian motion, provided the sample times approach a uniform
density.
Lemma 9.2.16. Suppose W(t), t 0 is a standard Wiener process and k T
n,k
are non-decreasing, such that T
n,[nt]
p
t when n , for each xed t [0, ].
Then, |
S
n
W|
p
0 for the norm |x()| = sup
t[0,]
[x(t)[ of C([0, ]) and

S
n
(t) =
S
n
(nt), where
(9.2.7) S
n
(t) = W(T
n,[t]
) + (t [t])(W(T
n,[t]+1
) W(T
n,[t]
)) .
Remark. In view of Exercise 9.2.15, the preceding lemma implies in particular
that if T
n,[nt]
p
t for each xed t 0, then

S
n
()
D
W() in C([0, )).
Proof. Recall that each sample function t W(t)() is uniformly continu-
ous on [0, ], hence osc
,
(W())() 0 as 0 (see (9.2.2) for the denition of
osc
,
(x())). Fixing > 0 note that as r ,
G
r
= : osc
,3/r
(W())() ,
so by continuity from below of probability measures, P(G
r
) 1 for some integer
r. Setting s
j
= j/r for j = 0, 1, . . . , r +1, our hypothesis that T
n,[nt]
p
t per xed
t 0, hence uniformly on any nite collection of times, implies that for some nite
n
0
= n
0
(, r) and all n n
0
,
P(
r
max
j=0
[T
n,[nsj]
s
j
[ r
1
) 1 .
Further, by the monotonicity of k T
n,k
, if t [s
j1
, s
j
) and n r, then
T
n,[nsj1]
s
j
T
n,[nt]
t T
n,[nt]+1
t T
n,[nsj+1]
s
j1
and since s
j+1
s
j1
= 2/r, it follows that for any n max(n
0
, r),
(9.2.8) P( sup
b{0,1},t[0,)
[T
n,[nt]+b
t[ 3r
1
) 1 .
Recall that for any n 1 and t 0,
[
S
n
(t) W(t)[ (1 )[W(T
n,[nt]
) W(t)[ +[W(T
n,[nt]+1
) W(t)[ ,
where = nt [nt] [0, 1). Observe that if both G
r
and the event in (9.2.8)
occur, then by denition of G
r
each of the two terms on the right-side of the last
inequality is at most . We thus see that |
S
n
W|
whenever both G
r
and
the event in (9.2.8) occur. That is, P(|
S
n
W| ) 1 2. Since this applies
for all > 0, we have just shown that |
S
n
W|
p
0, as claimed.
The key tool in our program is an alternative Skorokhod representation of ran-
dom variables. Whereas in Theorem 1.2.36 we applied the inverse of the desired
distribution function to a uniform random variable on [0, 1], here we construct a
stopping time of the form
A,B
= inft 0 : W
t
/ (A, B) such that W
has
the stated, mean zero law. To this end, your next exercise exhibits the appropriate
random levels (A, B) to be used in this construction.
Denition 9.2.17. Given a random variable V 0 of positive, nite mean, we
say that Z 0 is a size-biased sample of V if E[g(Z)] = E[V g(V )]/EV for all
g bB. Alternatively, the Radon-Nikodym derivative between the corresponding
laws is
dPZ
dPV
(v) = v/EV .
Exercise 9.2.18. Suppose X is an integrable random variable, such that EX = 0
(so EX
+
= EX
is nite). Consider the [0, )

2
-valued random vector
(A, B) = (0, 0)I
{X=0}
+ (Z, X)I
{X>0}
+ (X, Y )I
{X<0}
,
where Y and Z are size-biased samples of X
+
and X
, respectively, which are

further independent of X. Show that then for any f bB,
(9.2.9) E[r(A, B)f(A) + (1 r(A, B))f(B)] = E[f(X)] ,
where r(a, b) = b/(a +b) for a > 0 and r(0, b) = 1.
Hint: It suces to show that E[Xh(Z, X)I
{X>0}
] = E[(X)h(X, Y )I
{X<0}
] for
h(a, b) = (f(b) f(a))/(a +b).
Theorem 9.2.19 (Skorokhods representation). Suppose (W
t
, T
t
, t 0) is
a Brownian Markov process such that W
0
= 0 and T
is independent of the [0, 1]-

valued independent uniform variables U
i
, i = 1, 2. With (
t
= ((U
1
, U
2
), T
t
),
given the law T
X
of an integrable X such that EX = 0, there exists an a.s. nite
(
t
-stopping time such that W
D
= X, E = EX
2
and E
2
2EX
4
.
Remark. The extra randomness in the form of U
i
, i = 1, 2 is not needed when
X = a, b takes only two values (for you can easily check that then Exercise
9.2.18 simply sets A = a and B = b). In fact, one can eliminate it altogether, at
the cost of a more involved proof (see [Bil95, Theorem 37.6] or [MP09, Sections
5.3.1-5.3.2] for how this is done).
Proof. We rst construct (X, Y, Z) of Exercise 9.2.18 from the independent
pair of uniformly distributed random variables (U
1
, U
2
). That is, given the spec-
ied distribution function F
X
() of X, we set X() = supx : F
X
(x) 0}
_
y
0
vdF
X
(v)/EX
+
and F
Z
(z) =
I
{z>0}
_
0
z
(v)dF
X
(v)/EX
, we apply the same procedure to construct the strictly

positive Y and Z out of U
2
. With the resulting pair (Y, Z) measurable on (U
2
)
and independent of X, we proceed to have the [0, )
2
-valued random vector (A, B)
measurable on (U
1
, U
2
) as in Exercise 9.2.18.
We claim that = inft 0 : W
t
/ (A, B) is an a.s. nite (
t
-stopping time.
Indeed, considering the T
t
-adapted M
+
t
= sup
s[0,t]
W
s
and M
t
= inf
s[0,t]
W
s
,
note that by continuity of the Brownian sample function, t if and only if either
M
+
t
B or M
t
A. With (A, B) measurable on (U
1
, U
2
) (
0
it follows that
: () t (
t
for each t 0. Further, recall part (d) of Exercise 8.2.34 that
w.p.1. M
+
t
, in which case the (
t
-stopping time () is nite.
Setting Y = [0, )
2
, we deduce from the preceding analysis of the events
t that () = h(, (A(), B()) with h(, (a, b)) =
a,b
() measurable on the
product space (, T
) (Y, B
Y
). Since W
t
is T
t
-progressively measurable, the
same applies for V = I
{<}
f(W
, ) and any bounded below Borel function f on

R
2
. With (A, B) (U
1
, U
2
) independent of T
, it thus follows from Fubinis

theorem that E[V ] = E[g(A, B)] for
g(a, b) = E[I
{
a,b
<}
f(W
a,b
,
a,b
)] .
In particular, considering V = I
{<}
f(W
), since W
t
is a Brownian Markov
process starting at W
0
= 0, we have from part (a) of Exercise 8.2.35 that
g(a, b) = r(a, b)f(a) + (1 r(a, b))f(b) ,
where r(a, b) = b/(a + b) for a > 0 and r(0, b) = 1. In view of the identity (9.2.9)
we thus deduce that for any f bB,
E[f(W
)] = E[V ] = E[g(A, B)] = E[f(X)] .

That is, W
D
= X, as claimed. Recall part (c) of Exercise 8.2.35 that E[
a,b
] = ab.
Since ab = r(a, b)f(a) + (1 r(a, b))f(b) for f(x) = x
2
and any (a, b) Y, we
deduce by the same reasoning that E[] = E[AB] = E[X
2
]. Similarly, the bound
E
2
5
3
EX
4
follows from the identity E[
2
a,b
] = (ab)
2
+ ab(a
2
+ b
2
)/3 of part (c)
of Exercise 8.2.35, and the inequality
ab(a
2
+b
2
+ 3ab) 5ab(a
2
+b
2
ab) = 5[r(a, b)a
4
+ (1 r(a, b))b
4
]
for all (a, b) Y.
Building on Theorem 9.2.19 we have the following representation, due to Strassen,
of any discrete time martingale via sampling along the path of the Brownian motion.
Theorem 9.2.20 (Strassens martingale representation). Suppose the prob-
ability space contains a martingale (M
, T
) such that M
0
= 0, the i.i.d. [0, 1]-valued
uniform U
i
and a standard Wiener process W(t), t 0, both independent of
T
and of each other. Then:

(a) The ltrations T
k,t
= (T
k
, (U
i
, i 2k), T
W
t
) are such that (W(t), T
k,t
)
is a Brownian Markov process for any 1 k .
(b) There exist non-decreasing a.s. nite T
k,t
-stopping times T
k
, starting
with T
0
= 0, where
k
= T
k
T
k1
and the ltration 1
k
= T
k,T
k
are such
that w.p.1. E[
k
[1
k1
] = E[D
2
k
[T
k1
] and E[
2
k
[1
k1
] 2E[D
4
k
[T
k1
]
for the martingale dierences D
k
= M
k
M
k1
and all k 1.
(c) The discrete time process W(T
) has the same f.d.d. as M
.
Remark. In case T
= (M
k
, k ), our proof constructs the martingale as
samples M
= W(T
) of the standard Wiener process W(t), t 0 at a.s. nite,

non-decreasing ((U
i
, i 2k), T
W
t
)-stopping times T
k
. Upon eliminating the
extra randomness (U
1
, U
2
) in Theorem 9.2.19, we thus get the embedding of M
n
inside the path t W

t
, where T
k
are T
W
t
-stopping times and part (b) of the
theorem applies for the corresponding stopped -algebras 1
k
= T
W
T
k
. Indeed, we
have stipulated the a-apriori existence of (M
, T
) independently of the Wiener

process only in order to accommodate non-canonical ltrations T
.
Proof. (a). Fixing 1 k , since 1 = T
k,0
is independent of T
W
, we
have from Proposition 4.2.3 that for any B B and u, s 0,
E[I
B
(W(u +s))[T
k,s
] = E[I
B
(W(u +s))[(1, T
W
s
)] = E[I
B
(W(u +s))[T
W
s
] .
With (W(t), T
W
t
) a Brownian Markov process, it thus follows that so are (W(t), T
k,t
).
(b). Starting at T
0
= 0 we sequentially construct the non-decreasing T
k,t
-stopping
times T
k
. Assuming T
k1
have been constructed already, consider Corollary 9.1.6
for the Brownian Markov process (W(t), T
k1,t
) and the T
k1,t
-stopping time T
k1
,
to deduce that W
k
(t) = W(T
k1
+t)W(T
k1
) is a standard Wiener process which
is independent of 1
k1
. The pair (U
2k1
, U
2k
) of [0, 1]-valued independent uniform
variables is by assumption independent of T
k1,
, hence of both 1
k1
and the
standard Wiener process W
k
(t). Recall that D
k
= M
k
M
k1
is integrable and
by the martingale property E[D
k
[T
k1
] = 0. With T
k1
1
k1
and representing
our probability measure as a product measure on (
, T
), we thus apply
Theorem 9.2.19 for the continuous ltration (
t
= (U
2k1
, U
2k
, W
k
(s), s t) which
is independent of 1
k1
and the random distribution function
F
X
k
(x; ) =

P
D
k
|F
k1
((, x], ) ,
corresponding to the zero-mean R.C.P.D. of D
k
given T
k1
. The resulting a.s.
nite (
t
-stopping time
k
is such that
E[
k
[1
k1
] =
_
x
2
dF
X
k
(x; ) = E[D
2
k
[T
k1
]
(see Exercise 4.4.6 for the second identity), and by the same reasoning E[
2
k
[1
k1
]
2E[D
4
k
[T
k1
], while the R.C.P.D. of W
k
(
k
) given 1
k1
matches the law

P
D
k
|F
k1
of X
k
. Note that the threshold levels (A
k
, B
k
) of Exercise 9.2.18 are measurable on
T
k,0
since by right-continuity of distribution functions their construction requires
only the values of (U
2k1
, U
2k
) and F
X
k
(q; ), q Q. For example, for any x R,
: X
k
() x = : U
1
() F
X
k
(q; ) for all q Q, q > x ,
with similar identities for Y
k
y and Z
k
z, whose distribution functions at
q Q are ratios of integrals of the type
_
q
0
vdF
X
k
(v; ), each of which is the limit
as n of the T
k1
-measurable
qn
(/n)(F
X
k
(/n + 1/n; ) F
X
k
(/n; )) .
Further, setting T
k
= T
k1
+
k
, from the proof of Theorem 9.2.19 we have that
T
k
t if and only if T
k1
t and either sup
u[0,t]
W(u) W(uT
k1
) B
k
or inf
u[0,t]
W(u) W(u T
k1
) A
k
. Consequently, the event T
k
t is in
(A
k
, B
k
, t T
k1
, I
{T
k1
t}
, W(s), s t) T
k,t
,
by the T
k,0
-measurability of (A
k
, B
k
) and our hypothesis that T
k1
is an T
k1,t
-
stopping time.
(c). With W(T
0
) = M
0
= 0, it clearly suces to show that the f.d.d. of W
) =
W(T
) W(T
1
) match those of D
= M
M
1
. To this end, recall that
1
k
= T
k,T
k
is a ltration (see part (b) of Exercise 8.1.11), and in part (b) we
saw that W
k
(
k
) is 1
k
-adapted such that its R.C.P.D. given 1
k1
matches the
R.C.P.D. of the T
k
-adapted D
k
, given T
k1
. Hence, for any f
bB, we have from

the tower property that
E[
n
=1
f
(W
))] = E[E[f
n
(W
n
(
n
))[1
n1
]
n1
=1
f
(W
))]
= E[E[f
n
(D
n
)[T
n1
]
n1
=1
f
(W
))]
= E[E[f
n
(D
n
)[T
n1
]
n1
=1
f
(D
)] = E[
n
=1
f
(D
)],
where the third equality is from the induction assumption that D
n1
=1
has the
same law as W
)
n1
=1
.
The following corollary of Strassens representation recovers Skorokhods represen-
tation of the random walk S
n
as the samples of Brownian motion at a sequence
of stopping times with i.i.d. increments.
Corollary 9.2.21 (Skorokhods representation for random walks). Sup-
pose
1
is integrable and of zero mean. The random walk S
n
=
n
k=1
k
of i.i.d.
k
can be represented as S
n
= W(T
n
) for T
0
= 0, i.i.d.
k
= T
k
T
k1
0 such
that E[
1
] = E[
2
1
] and standard Wiener process W(t), t 0. Further, each T
k
is
a stopping time for T
k,t
= ((U
i
, i 2k), T
W
t
) (with i.i.d. [0, 1]-valued uniform
U
i
that are independent of W
t
, t 0).
Proof. The construction we provided in proving Theorem 9.2.20 is based on
inductively applying Theorem 9.2.19 for k = 1, 2, . . ., where X
k
follows the R.C.P.D.
of the MG dierence D
k
given T
k1
. For a martingale of independent dierences,
such as the random walk S
n
, we can a-apriori produce the independent thresholds
(A
k
, B
k
), k 1, out of the given pairs of U
i
, independently of the Wiener process.
Then, in view of Corollary 9.1.6, for k = 1, 2, . . . both (A
k
, B
k
) and the standard
Wiener process W
k
() = W(T
k1
+ ) W(T
k1
) are independent of the stopped
at T
k1
element of the continuous time ltration (A
i
, B
i
, i < k, W(s), s t).
Consequently, so is the stopping time
k
= inft 0 : W
k
(t) / (A
k
, B
k
) with
respect to the continuous time ltration (
k,t
= (A
k
, B
k
, W
k
(s), s t), from which
we conclude that
k
are in this case i.i.d.
Combining Strassens martingale representation with Lemma 9.2.16, we are now
in position to prove a Lindeberg type martingale clt.
Theorem 9.2.22 (martingale clt, Lindebergs). Suppose that for any n 1
xed, (M
n,
, T
n,
) is a (discrete time) L
2
-martingale, starting at M
n,0
= 0, and
the corresponding martingale dierences D
n,k
= M
n,k
M
n,k1
and predictable
compensators
M
n
k=1
E[D
2
n,k
[T
n,k1
] ,
are such that for any xed t [0, 1], as n ,
(9.2.10) M
n
[nt]
p
t .
If in addition, for each > 0,
(9.2.11) g
n
() =
n
k=1
E[D
2
n,k
I
{|D
n,k
|}
[T
n,k1
]
p
0 ,
then as n , the linearly interpolated, time-scaled S.P.
(9.2.12)

S
n
(t) = M
n,[nt]
+ (nt [nt])D
n,[nt]+1
,
converges in distribution on C([0, 1]) to the standard Wiener process, W(t), t
[0, 1].
Remark. For martingale dierences D
n,k
, k = 1, 2, . . . that are mutually in-
dependent per xed n, our assumption (9.2.11) reduces to Lindebergs condition
(3.1.4) and the predictable compensators v
n,t
= M
n
[nt]
are then non-random. In
particular, v
n,t
= [nt]/n t in case D
n,k
= n
1/2
k
for i.i.d.
k
of zero mean and
unit variance, in which case g
n
() = E[
2
1
; [
1
[
n] 0 (see Remark 3.1.4), and

we recover Donskers invariance principle as a special case of Lindebergs martingale
clt.
Proof. Step 1. We rst prove a somewhat stronger convergence statement for
martingales of uniformly bounded dierences and predictable compensators. Specif-
ically, using | | for the supremum norm |x()| = sup
t[0,1]
[x(t)[ on C([0, 1]), we
proceed under the additional assumption that for some non-random
n
0,
M
n
n
2 ,
n
max
k=1
[D
n,k
[ 2
n
,
to construct a coupling of

S
n
() of (9.2.12) and the standard Wiener process W()
in the same probability space, such that |
S
n
W|
p
0. To this end, apply Theo-
rem 9.2.20 simultaneously for the martingales (M
n,
, T
n,
) with the same standard
Wiener process W(t), t 0 and auxiliary i.i.d. [0, 1]-valued uniform U
i
, to get
the representation M
n,
= W(T
n,
). Recall that T
n,
=
k=1
n,k
where for each
n, the non-negative
n,k
are adapted to the ltration 1
n,k
, k 1 and such that
w.p.1. for k = 1, . . . , n,
E[
n,k
[1
n,k1
] = E[D
2
n,k
[T
n,k1
] , (9.2.13)
E[
2
n,k
[1
n,k1
] 2E[D
4
n,k
[T
n,k1
] . (9.2.14)
Under this representation, the process

S
n
() of (9.2.12) is of the form considered in
Lemma 9.2.16, and as shown there, |
S
n
W|
p
0 provided T
n,[nt]
p
t for each
xed t [0, 1].
To verify the latter convergence in probability, set

T
n,
= T
n,
M
n
and
n,k
=
n,k
E[
n,k
[1
n,k1
]. Note that by the identities of (9.2.13),
M
n
k=1
E[D
2
n,k
[T
n,k1
] =
k=1
E[
n,k
[1
n,k1
] ,
hence

T
n,
=
k=1

n,k
is for each n, the 1
n,
-martingale part in Doobs decomposi-
tion of the integrable, 1
n,
-adapted sequence T
n,
, 0. Further, considering the
expectation in both sides of (9.2.14), by our assumed uniform bound [D
n,k
[ 2
n
it follows that for any k = 1, . . . , n,
E[
2
n,k
] E[
2
n,k
] 2E[D
4
n,k
] 8
2
n
E[D
2
n,k
] ,
where the left-most inequality is just the L
2
-reduction of conditional centering
(namely, E[Var(X[1)] E[X
2
], see part (a) of Exercise 4.2.16). Consequently,
the martingale
T
n,
, 0 is square-integrable and since its dierences
n,k
are
uncorrelated (see part (a) of Exercise 5.1.8), we deduce that for any n,
E[
T
2
n,
] =
k=1
E[
2
n,k
] 8
2
n
n
k=1
E[D
2
n,k
] = 8
2
n
E[M
n
n
] .
Recall our assumption that M
n
n
is uniformly bounded, hence xing t [0, 1], we
conclude that

T
n,[nt]
L
2
0 as n . This of course implies the convergence to zero
in probability of

T
n,[nt]
, and in view of assumption (9.2.10) and Slutskys lemma,
also T
n,[nt]
=

T
n,[nt]
+M
n
[nt]
p
t as n .
Step 2. We next eliminate the superuous assumption M
n
n
2 via the strategy
employed in proving part (a) of Theorem 5.3.32. That is, consider the T
n,
-stopped
martingales

M
n,
= M
n,n
for stopping times
n
= nmin < n : M
n
+1
> 2,
such that M
n
n
2. As the corresponding martingale dierences are

D
n,k
=
D
n,k
I
{kn}
, you can easily verify that for all n
M
n
= M
n
n
.
In particular,
M
n
n
= M
n
n
2. Further, if
n
= n then
M
n
[nt]
= M
n
[nt]
for all t [0, 1] and the function
S
n
(t) =

M
n,[nt]
+ (nt [nt])
D
n,[nt]+1
,
coincides on [0, 1] with

S
n
() of (9.2.12). Due to the monotonicity of M
n
we
have that
n
< n = M
n
n
> 2, so our assumption (9.2.10) that M
n
n
p
1
implies that P(
n
< n) 0. Consequently,
M
n
[nt]
p
t and applying Step 1
for the martingales
M
n,
, n we have the coupling of

S
n
() and the standard
Wiener process W() such that |
S
n
W|
p
0. Combining it with the (natural)
coupling of

S
n
() and

S
n
() such that P(
S
n
() ,=

S
n
()) P(
n
< n) 0, we arrive
at the coupling of

S
n
() and W() such that |
S
n
W|
p
0.
Step 3. We establish the clt under the condition (9.2.11), by reducing the problem
to the setting we have already handled in Step 2. This is done by a truncation
argument similar to the one we used in proving Theorem 2.1.11, except that now
we need to re-center the martingale dierences after truncation is done (to convince
yourself that some truncation is required, consider the special case of i.i.d. D
n,k
with innite forth moment, and note that a stopping argument as in Step 2 is
not feasible here because unlike M
n
, the martingale dierences are not T

n,
-
predictable).
Specically, our assumption (9.2.11) implies the existence of nite n
j
such
that P(g
n
(j
1
) j
3
) j
1
for all j 1 and n n
j
n
1
= 1. Setting
n
= j
1
for n [n
j
, n
j+1
) it then follows that as n both
n
0 and for > 0 xed,
P(
2
n
g
n
(
n
) ) P(
2
n
g
n
(
n
)
n
) 0 .
In conclusion, there exist non-random
n
0 such that
2
n
g
n
(
n
)
p
0, which we
use hereafter as the truncation level for the martingale dierences D
n,k
, k n.
That is, consider for each n, the new martingale

M
n,
=
k=1
D
n,k
, where
D
n,k
= D
n,k
E[D
n,k
[T
n,k1
] , D
n,k
= D
n,k
I
{|D
n,k
|<n}
.
By construction, [
D
n,k
[ 2
n
for all k n. Hence, with

D
n,k
= D
n,k
I
{|D
n,k
|n}
,
by Slutskys lemma and the preceding steps of the proof we have a coupling such
that |
S
n
W|
p
0, as soon as we show that for all n,
0 M
n
M
n
k=1
E[
D
2
n,k
[T
n,k1
]
(for the right hand side is bounded by 2g
n
(
n
) which by our choice of
n
converges
to zero in probability, so the convergence (9.2.10) of the predictable compensators
transfers from M
n
to

M
n
). These inequalities are in turn a direct consequence of
the bounds
(9.2.15) E[
D
2
n,k
[T] E[D
2
n,k
[T] E[D
2
n,k
[T] E[
D
2
n,k
[T] + 2E[
D
2
n,k
[T]
holding for T = T
n,k1
and all 1 k n. The left-most inequality in (9.2.15) is
merely an instance of the L
2
-reduction of conditional centering, while the middle
one follows from the identity D
n,k
= D
n,k
+

D
n,k
upon realizing that by denition
D
n,k
,= 0 if and only if

D
n,k
= 0, so
E[D
2
n,k
[T] = E[D
2
n,k
[T] +E[
D
2
n,k
[T] .
The latter identity also yields the right-most inequality in (9.2.15), for E[D
n,k
[T] =
E[
D
n,k
[T] (due to the martingale condition E[D
n,k
[T
n,k1
] = 0), hence
E[D
2
n,k
[T] E[
D
2
n,k
[T] =
_
E[D
n,k
[T]
_
2
=
_
E[
D
n,k
[T]
_
2
E[
D
2
n,k
[T] .
Now that we have exhibited a coupling for which |
S
n
W|
p
0, if |
S
n
S
n
|
p
0
then by the triangle inequality for the supremum norm | | there also exists a
coupling with |
S
n
W|
p
0 (to construct the latter coupling, since there exist
non-random
n
0 such that P(|
S
n

S
n
|
n
) 0, given the coupled

S
n
()
and W() you simply construct

S
n
() per n, conditional on the value of

S
n
() in
such a way that the joint law of (
S
n
,

S
n
) minimizes P(|
S
n

S
n
|
n
) subject to
the specied laws of

S
n
() and of

S
n
()). In view of Corollary 3.5.3 this implies the
convergence in distribution of

S
n
() to W() (on the metric space (C([0, 1]), | |)).
Turning to verify that |
S
n

S
n
|
p
0, recall rst that [
D
n,k
[ [
D
n,k
[
2
/
n
, hence
for T = T
n,k1
,
[E(D
n,k
[T)[ = [E(
D
n,k
[T)[ E( [
D
n,k
[ [T)
1
n
E[
D
2
n,k
[T] .
Note also that if the event
n
= [D
n,k
[ <
n
, for all k n occurs, then D
n,k

D
n,k
= E[D
n,k
[T
n,k1
] for all k. Therefore,
I
n
|
S
n

S
n
| I
n
n
k=1
[D
n,k

D
n,k
[
n
k=1
[E(D
n,k
[T
n,k1
)[
1
n
g
n
(
n
) ,
and our choice of
n
0 such that
2
n
g
n
(
n
)
p
0 implies that I
n
|
S
n

S
n
|
p
0.
We thus complete the proof by showing that P(
c
n
) 0. Indeed, xing n and
r > 0, we apply Exercise 5.3.38 for the events A
k
= [D
n,k
[
n
adapted to
the ltration T
n,k
, k 0, and by Markovs inequality for C.E. (see part (b) of
Exercise 4.2.22), arrive at
P(
c
n
) = P
_
n
_
k=1
A
k
_
er +P
_
n
k=1
P([D
n,k
[
n
[T
n,k1
) > r
_
er +P
_
2
n
n
k=1
E[
D
2
n,k
[T
n,k1
] > r
_
= er +P(
2
n
g
n
(
n
) > r) .
Consequently, our choice of
n
implies that P(
c
n
) 3r for any r > 0 and all n
large enough. So, upon considering r 0 we deduce that P(
c
n
) 0. As explained
before this concludes our proof of the martingale clt.
Specializing Theorem 9.2.22 to the case of a single martingale (M
, T
) leads to
the following corollary.
Corollary 9.2.23. Suppose an L
2
-martingale (M
, T
) starting at M
0
= 0, is of
T
-predictable compensators such that n

1
M
n
p
1 and as n ,
n
1
n
k=1
E[(M
k
M
k1
)
2
; [M
k
M
k1
[
n] 0 ,
for any xed > 0. Then, as n , the linearly interpolated, time-scaled S.P.
(9.2.16)

S
n
(t) = n
1/2
M
[nt]
+ (nt [nt])(M
[nt]+1
M
[nt]
) ,
converge in distribution on C([0, 1]) to the standard Wiener process.
Proof. Simply consider Theorem 9.2.22 for M
n,
= n
1/2
M
and T
n,
= T
.
In this case M
n
= n
1
M
so (9.2.10) amounts to n
1
M
n
p
1 and in stating
the corollary we merely replaced the condition (9.2.11) by the stronger assumption
that E[g
n
()] 0 as n .
Further specializing Theorem 9.2.22, you are to derive next the martingale exten-
sion of Lyapunovs clt.
k
=
k
i=1
X
i
, where the T
k
-adapted, square-integrable
X
k
are such that w.p.1. E[X
k
[T
k1
] = for some non-random and all k 1.
Setting V
n,q
= n
q/2
n
k=1
E[[X
k
[
q
[T
k1
] suppose further that V
n,2
p
1, while
for some q > 2 non-random, V
n,q
p
0.
(a) Setting M
k
= Z
k
k show that

S
n
() of (9.2.16) converges in distribution
on C([0, 1]) to the standard Wiener process.
(b) Deduce that L
n
D
L
, where
L
n
= n
1/2
max
0kn
Z
k

k
n
Z
n
and P(L
b) = exp(2b
2
) for any b > 0.
(c) In case > 0, set T
b
= infk 1 : Z
k
> b and show that b
1
T
b
p
1/
when b .
The following exercises present typical applications of the martingale clt, starting
with the least-squares parameter estimation for rst order auto regressive processes
(see Exercises 6.1.15 and 6.3.30 for other aspects of these processes).
Exercise 9.2.25 (First order auto regressive process). Consider the R-
valued S.P. Y
0
= 0 and Y
n
= Y
n1
+D
n
for n 1, with D
n
a uniformly bounded
T
n
-adapted martingale dierence sequence such that a.s. E[D
2
k
[T
k1
] = 1 for all
k 1, and (1, 1) is a non-random parameter.
(a) Check that Y
n
is uniformly bounded. Deduce that n
1
n
k=1
D
2
k
a.s.
1
and n
1
Z
n
a.s.
0, where Z
n
=
n
k=1
Y
k1
D
k
.
Hint: See part (c) of Exercise 5.3.39.
(b) Let V
n
=
n
k=1
Y
2
k1
. Considering the estimator
n
=
1
Vn
n
k=1
Y
k
Y
k1
of the parameter , conclude that
V
n
(
n
)
D
A(0, 1) as n .
(c) Suppose now that = 1. Show that in this case (Y
n
, T
n
) is a martingale
of uniformly bounded dierences and deduce from the martingale clt that
the two-dimensional random vectors (n
1
Z
n
, n
2
V
n
) converge in distri-
bution to (
1
2
(W
2
1
1),
_
1
0
W
2
t
dt) with W
t
, t 0 a standard Wiener
process. Conclude that in this case
_
V
n
(
n
)
D
W
2
1
1
2
_
_
1
0
W
2
t
dt
.
(d) Show that the conclusion of part (c) applies in case = 1, except for
multiplying the limiting variable by 1.
Hint: Consider the sequence (1)
k
Y
k
with Y
k
corresponding to D
k
=
(1)
k
D
k
and = 1.
Exercise 9.2.26. Let L
n
= n
1/2
max
0kn
k
i=1
c
i
Y
i
, where the T
k
-adapted,
square-integrable Y
k
are such that w.p.1. E[Y
k
[T
k1
] = 0 and E[Y
2
k
[T
k1
] = 1
for all k 1. Suppose further that sup
k1
E[[Y
k
[
q
[T
k1
] is nite a.s. for some
q > 2 and c
k
mT
k1
are such that n
1
n
k=1
c
2
k
p
1. Show that L
n
D
L
with
P(L
b) = 2P(G b) for a standard normal variable G and all b 0.

Hint: Consider part (a) of Exercise 9.2.24.
9.2.2. Law of the iterated logarithm. With W
t
a centered normal variable
of variance t, one expects the Brownian sample function to grow as
t for t
and t 0. While this is true for xed, non-random times, such reasoning ignores
the random uctuations of the path (as we have discussed before in the context of
random walks, see Exercise 2.2.24). Accounting for these we obtain the following
law of the iterated logarithm (lil).
Theorem 9.2.27 (Kinchins lil). Set h(t) =
_
2t log log(1/t) for t < 1/e and
h(t) = th(1/t). Then, for standard Wiener processes W

t
, t 0 and
W
t
, t 0,
w.p.1. the following hold:
limsup
t0
W
t
h(t)
= 1, liminf
t0
W
t
h(t)
= 1, (9.2.17)
limsup
t
W
t
h(t)
= 1, liminf
t
W
t
h(t)
= 1. (9.2.18)
Remark. To determine the scale h(t) recall the tail estimate P(G x) = exp(
x
2
2
(1+
o(1)) which implies that for t
n
=
2n
, (0, 1) the sequence P(W
tn
bh(t
n
)) =
n
b+o(1)
is summable when b > 1 but not summable when b < 1. Indeed, using
such tail bounds we prove the lil in the form of (9.2.17) by the subsequence method
you have seen already in the proof of Proposition 2.3.1. Specically, we consider
such time skeleton t
n
and apply Borel-Cantelli I for b > 1 and near one (where
Doobs inequality controls the uctuations of t W
t
by those at t
n
), en-route
to the upper bound. To get a matching lower bound we use Borel-Cantelli II for
b < 1 and the independent increments W
tn
W
tn+1
(which are near W
tn
when
is small).
Proof. Since

h(t) = th(1/t), by the time-inversion invariance of the standard
Wiener process, it follows upon considering

W
t
= tW
1/t
that (9.2.18) is equivalent
to (9.2.17). Further, by the symmetry of this process, it suces to prove the
statement about the limsup in (9.2.17).
Proceeding to upper bound W
t
/h(t), applying Doobs inequality (8.2.2) for the
non-negative martingale X
s
= exp((W
s
s/2)) (see part (a) of Exercise 8.2.7),
such that E[X
0
] = 1, we nd that for any t, , y 0,
P( sup
s[0,t]
W
s
s/2 y) = P( sup
s[0,t]
X
s
e
y
) e
y
.
Fixing , (0, 1), consider this inequality for t
n
=
2n
, y
n
= h(t
n
)/2,
n
=
(1 + 2)h(t
n
)/t
n
and n > n
0
() = 1/(2 log(1/)). Since exp(h(t)
2
/2t) = log(1/t),
it follows that e
nyn
= (n
0
/n)
1+2
is summable. Thus, by Borel-Cantelli I, for
some n
1
= n
1
(, , ) n
0
nite, w.p.1. sup
s[0,tn]
W
s

n
s/2 y
n
for all
n n
1
. With log log(1/t) non-increasing on [0, 1/e], we then have that for every
s (t
n+1
, t
n
] and n n
1
,
W
s
y
n
+
n
t
n
/2 = (1 +)h(t
n
)
1 +
h(s)
Therefore, w.p.1. limsup
s0
W
s
/h(s) (1 + )/. Considering
k
= 1/k = 1
k
and k we conclude that
(9.2.19) limsup
s0
W
s
h(s)
1 , w.p.1.
To bound below the left side of (9.2.19), consider the independent events A
n
=
W
tn
W
tn+1
(1
2
)h(t
n
), where as before t
n
=
2n
, n > n
0
() and (0, 1)
is xed. Setting x
n
= (1
2
)h(t
n
)/
t
n+1
t
n
, we have by the time-homogeneity
and scaling properties of the standard Wiener process (see parts (b) and (d) of
Exercise 9.1.1), that P(A
n
) = P(W
1
x
n
). Further, noting that both x
2
n
/2 =
(1
2
) log log(1/t
n
) and nx
1
n
exp(x
2
n
/2) as n , by the lower
bound on the tail of the standard normal distribution (see part (a) of Exercise
2.2.24), we have that for some
> 0 and all n large enough,

P(A
n
) = P(W
1
x
n
)
1 x
2
n
2x
n
e
x
2
n
/2

n
1
.
Now, by Borel-Cantelli II the divergence of the series
n
P(A
n
) implies that w.p.1.
W
tn
W
tn+1
(1
2
)h(t
n
) for innitely many values of n. Further, applying
the bound (9.2.19) for the standard Wiener process W
s
, s 0, we know that
w.p.1. W
tn+1
2h(t
n+1
) 4h(t
n
) for all n large enough. Upon adding these
two bounds, we have that w.p.1. W
tn
(1 4
2
)h(t
n
) for innitely many
values of n. Finally, considering
k
= 1/k and k , we conclude that w.p.1.
limsup
t0
W
t
/h(t) 1, which completes the proof.
As illustrated by the next exercise, restricted to a suciently sparsely spaced t
n
,
the a.s. maximal uctuations of the Wiener process are closer to the xed time clt
scale of O(
t), than the lil scale

h(t).
Exercise 9.2.28. Show that for t
n
= exp(exp(n)) and a Brownian Markov process
W
t
, t 0, almost surely,
limsup
n
W
tn
/
_
2t
n
log log log t
n
= 1 .
Combining Kinchins lil and the representation of the random walk as samples of
the Brownian motion at random times, we have the corresponding lil of Hartman-
Wintner for the random walk.
Proposition 9.2.29 (Hartman-Wintners lil). Suppose S
n
=
n
k=1
k
, where
k
are i.i.d. with E
1
= 0 and E
2
1
= 1. Then, w.p.1.
limsup
n
S
n
/
h(n) = 1 ,
where

h(n) = nh(1/n) =
2nlog log n.
Remark. Recall part (c) of Exercise 2.3.4 that if E[[
1
[
] = for some 0 < < 2

then w.p.1. n
1/
[S
n
[ is unbounded, hence so is [S
n
[/
h(n) and in particular the

lil fails.
Proof. In Corollary 9.2.21 we represent the random walk as S
n
= W
Tn
for the
standard Wiener process W
t
, t 0 and T
n
=
n
k=1
k
with non-negative i.i.d.
k
of mean one. By the strong law of large numbers, n
1
T
n
a.s.
1. Thus, xing
> 0, w.p.1. there exists n
0
() nite such that n/(1 + ) T
n
n(1 + ) for all
n n
0
. With t
= e
e
(1 +)
for 0, and V
= sup[W
s
W
t
[ : s, t [t
, t
+3
],
note that if n [t
+1
, t
+2
] and n n
0
(), then [W
Tn
W
n
[ V
. Further, t
h(t)
is non-decreasing, so w.p.1.
limsup
n
S
n
h(n)
W
n
h(n)
limsup
h(t
)
and in view of (9.2.18), it suces to show that for some non-random
0 as 0,
(9.2.20) limsup
h(t
, w.p.1.
To this end, by the triangle inequality,
V
2 sup[W
s
W
t
[ : s [t
, t
+3
] .
It thus follows that for
= (t
+3
t
)/t
= (1 +)
3
1 and all 0,
P(V

_
8
h(t
)) P
_
sup
s[t
,t
+3
]
[W
s
W
t
[
_
2
h(t
)
_
= P( sup
u[0,1]
[W
u
[ x
) 4P(W
1
x
) ,
9.3. BROWNIAN PATH: REGULARITY, LOCAL MAXIMA AND LEVEL SETS 367
where x
h(t
)/
t
+3
t
4 log log t
2, the equality is by time-

homogeneity and scaling of the Brownian motion and the last inequality is a con-
sequence of the symmetry of Brownian motion and the reection principle (see
(9.1.2)). With
2
exp(x
2
/2) = (/ log t
)
2
bounded above (in ), applying the up-
per bound of part (a) of Exercise 2.2.24 for the standard normal distribution of W
1
we nd that for some nite
and all 0,
P(W
1
x
) (2)
1/2
x
1
e
x
2
/2

2
.
Having just shown that
P(V
h(t
)) is nite, we deduce by Borel-Cantelli

I that (9.2.20) holds for
, which as explained before, completes the

proof.
Remark. Strassens lil goes further than Hartman-Wintners lil, in characteriz-
ing the almost sure limit set (i.e., the collection of all limits of convergent subse-
quences in C([0, 1])), for S(n)/
h(n) and S() of (9.2.1), as

/ = x() C([0, 1]) : x(t) =
_
t
0
y(s)ds :
_
1
0
y(s)
2
ds 1 .
While Strassens lil is outside our scope, here is a small step in this direction.
Exercise 9.2.30. Show that w.p.1. [1, 1] is the limit set of the R-valued sequence
S
n
/
h(n).
9.3. Brownian path: regularity, local maxima and level sets
Recall Exercise 7.3.13 that the Brownian sample function is a.s. locally -Holder
continuous for any < 1/2 and Kinchins lil tells us that it is not -Holder con-
tinuous for any 1/2 and any interval [0, t]. Generalizing the latter irregularity
property, we rst state and prove the classical result of Paley, Wiener and Zyg-
mund (see [PWZ33]), showing that a.s. a Brownian Markov process has nowhere
dierentiable sample functions (not even at a random time t = t()).
Denition 9.3.1. For a continuous function f : [0, ) R and (0, 1], the
upper and lower (right) -derivatives at s 0 are the R-valued
D
f(s) = limsup
u0
u
[f(s +u) f(s)] and D
f(s) = liminf
u0
u
[f(s +u) f(s)] ,

which always exist. The Dini derivatives correspond to = 1 and denoted by D
1
f(s)
and D
1
f(s). Indeed, a continuous function f is dierentiable from the right at s
if D
1
f(s) = D
1
f(s) is nite.
Proposition 9.3.2 (Paley-Wiener-Zygmund). With probability one, the sample
function of a Wiener process t W
t
() is nowhere dierentiable. More precisely,
for = 1 and any T ,
(9.3.1) P( : < D
W
t
() D
W
t
() < for some t [0, T]) = 0 .
Proof. Fixing integers k, r 1, let
A
kr
=
_
s[0,1]
u[0,1/r]
: [W
s+u
() W
s
()[ ku .
Note that if c D
1
W
t
() D
1
W
t
() c for some t [0, 1] and c < , then by
denition of the Dini derivatives, A
kr
for k = [c] + 1 and some r 1. We thus
establish (9.3.1) for = 1 and T = 1, as soon as we show that A
kr
C for some
C T
W
such that P(C) = 0 (due to the uncountable union/intersection in the
denition of A
kr
, it is a-apriori not in T
W
, but recall Remark 8.1.3 that we add all
P-null sets to T
0
T, hence a-posteriori A
kr
T and P(A
kr
) = 0). To this end
we set
C =
n4r
n
_
i=1
C
n,i
in T
W
, where for i = 1, . . . , n,
C
n,i
= : [W
(i+j)/n
() W
(i+j1)/n
()[ 8k/n for j = 1, 2, 3 .
To see that A
kr
C note that for any n 4r, if A
kr
then for some integer
1 i n there exists s [(i 1)/n, i/n] such that [W
t
() W
s
()[ k(t s)
for all t [s, s + 1/r]. This applies in particular for t = (i + j)/n, j = 0, 1, 2, 3,
in which case 0 t s 4/n 1/r and consequently, [W
t
() W
s
()[ 4k/n.
Then, by the triangle inequality necessarily also C
n,i
.
We next show that P(C) = 0. Indeed, note that for each i, n the random variables
G
j
=

n(W
(i+j)/n
W
(i+j1)/n
), j = 1, 2, . . ., are independent, each having the
standard normal distribution. With their density bounded by 1/
2 1/2, it
follows that P([G
j
[ ) for all > 0 and consequently,
P(C
n,i
) =
3
j=1
P([G
j
[ 8kn
1/2
) (8k)
3
n
3/2
.
This in turn implies that P(C)

in
P(C
n,i
) (8k)
3
/
n for any n 4r and

upon taking n , results with P(C) = 0, as claimed.
Having established (9.3.1) for T = 1, we note that by the scaling property of
the Wiener process, the same applies for any nite T. Finally, the subset of
considered there in case T = is merely the increasing limit as n of such
subsets for T = n, hence also of zero probability.
You can even improve upon this negative result as follows.
Exercise 9.3.3. Adapting the proof of Proposition 9.3.2 show that for any xed
>
1
2
, w.p.1. the sample function t W
t
() is nowhere -H older continuous.
That is, (9.3.1) holds for any > 1/2.
t
, t 0 be a stochastic process of stationary increments
which satises for some H (0, 1) the self-similarity property X
ct
D
= c
H
X
t
, for
all t 0 and c > 0. Show that if P(X
1
= 0) = 0 then for any t 0 xed, a.s.
limsup
u0
u
1
[X
t+u
X
t
[ = . Hence, w.p.1. the sample functions t X
t
are
not dierentiable at any xed t 0.
Almost surely the sample path of the Brownian motion is locally -Holder con-
tinuous for any <
1
2
but not for any >
1
2
. Further, by the lil its modulus of
continuity is at least h() =
_
2 log log(1/). The exact modulus of continuity
of the Brownian path is provided by the following theorem, due to Paul Levy (see
[Lev37]).
Theorem 9.3.5 (Levys modulus of continuity). Setting g() =
_
2 log(
1
)
for (0, 1], with (W
t
, t [0, T]) a Wiener process, for any 0 < T < , almost
surely,
(9.3.2) limsup
0
osc
T,
(W
)
g()
= 1 ,
where osc
T,
(x()) is dened in (9.2.2).
Remark. This result does not extend to T = , as by independence of the un-
bounded Brownian increments, for any > 0, with probability one
osc
,
(W
) max
k1
[W
k
W
(k1)
[ = .
Proof. Fixing 0 < T < , note that g(T)/(
Tg()) 1 as 0. Fur-
ther, osc
1,
(
) = T
1/2
osc
T,T
(W
) where

W
s
= T
1/2
W
Ts
is a standard Wiener
process on [0, 1]. Consequently, it suces to prove (9.3.2) only for T = 1.
Setting hereafter T = 1, we start with the easier lower bound on the left side of
(9.3.2). To this end, x (0, 1) and note that by independence of the increments
of the Wiener process,
P(
,1
(W
) (1 )g(2
)) = (1 q
)
2
exp(2
) ,
where q
= P([W
2
[ > (1 )g(2
)) and
(9.3.3)
,r
(x()) =
2
r
max
j=0
[x((j +r)2
) x(j2
)[ .
Further, by scaling and the lower bound of part (a) of Exercise 2.2.24, it is easy to
check that for some
0
=
0
() and all
0
,
q
= P([G[ > (1 )
_
2 log 2) 2
(1)
.
By denition osc
1,2
(x())
,1
(x()) for any x : [0, 1] R and with exp(2
)
exp(2
) summable, it follows by Borel-Cantelli I that w.p.1.

osc
1,2
(W
)
,1
(W
) > (1 )g(2
)
for all
1
(, ) nite. In particular, for any > 0 xed, w.p.1.
limsup
0
osc
1,
(W
)
g()
1 ,
and considering
k
= 1/k 0, we conclude that
limsup
0
osc
1,
(W
)
g()
1 w.p.1.
To show the matching upper bound, we x (0, 1) and b = b() = (1+2)/(1)
and consider the events
A
r2
,r
(W
) <
bg(r2
) .
By the sub-additivity of probabilities,
P(A
c
r2
P(
,r
(W
bg(r2
))
r2
j=0
P([G
r,j
[ x
,r
)
where x
,r
=
_
2b log(2
/r) and G
r,j
= (W
(j+r)2
W
j2
)/
r2
have the stan-

dard normal distribution. Since > 0, clearly x
,r
is bounded away from zero for
r 2
and exp(x
2
,r
/2) = (r2
)
b
. Hence, from the upper bound of part (a) of
Exercise 2.2.24 we deduce that the r-th term of the outer sum is at most 2
C(r2
)
b
for some nite constant C = C(). Further, for some nite = () and all 1,
r2
r
b
_
2
+1
0
t
b
dt 2
(b+1)
.
Therefore, as (1 )b() (1 +) = > 0,
P(A
c
) C2
2
b
2
(b+1)
= C2
,
and since
P(A
c
) is nite, by the rst Borel-Cantelli lemma, on a set
of
probability one, A
for all
0
(, ) nite. As you show in Exercise 9.3.6, it
then follows from the continuity of t W
t
that on
,
[W
s+h
() W
s
()[
bg(h)[1 +(,
0
(, ), h)] ,
where (, , h) 0 as h 0. Consequently, for any
,
limsup
0
g()
1
sup
0ss+h1,h=
[W
s+h
() W
s
()[
b .
Since g() is non-decreasing on [0, 1/e], we can further replace the condition h =
in the preceding inequality by h [0, ] and deduce that on
limsup
0
osc
1,
(W
)
g()

_
b() .
Taking
k
= 1/k for which b(1/k) 1 we conclude that w.p.1. the same bound also
holds with b = 1.
Exercise 9.3.6. Suppose x C([0, 1]) and
m,r
(x) are as in (9.3.3).
(a) Show that for any m, r 0,
sup
r2
m
|ts|(r+1)2
m
[x(t) x(s)[ 4
=m+1
,1
(x) +
m,r
(x) .
Hint: Deduce from part (a) of Exercise 7.2.7 that this holds if in addition
t, s Q
(2,k)
1
for some k > m.
(b) Show that for some c nite, if 2
(m+1)(1)
h 1/e with m 0 and
(0, 1), then
=m+1
g(2
) cg(2
m1
)
c
1
2
(m+1)/2
g(h) .
Hint: Recall that g(h) =
_
2hlog(1/h) is non-decreasing on [0, 1/e].
(c) Conclude that there exists (,
0
, h) 0 as h 0, such that if
,r
(x)
bg(r2
) for some (0, 1) and all 1 r 2
,
0
, then
sup
0ss+h1
[x(s +h) x(s)[
bg(h)[1 +(,
0
, h)] .
We take up now the study of level sets of the standard Wiener process
(9.3.4) :
(b) = t 0 : W
t
() = b ,
for non-random b R, starting with its zero set :
= :
(0).
Proposition 9.3.7. For a.e. , the zero set :
of the standard Wiener

process, is closed, unbounded, of zero Lebesgue measure and having no isolated
points.
Remark. Recall that by Baires category theorem, any closed subset of R having
no isolated points, must be uncountable (c.f. [Dud89, Theorem 2.5.2]).
Proof. First note that (t, ) W
t
() is measurable with respect to B
[0,)
T
and hence so is the set : =
. Applying Fubinis theorem for the product

measure Leb P and h(t, ) = I
Z
(t, ) = I
{Wt()=0}
we nd that E[Leb(:
)] =
(Leb P)(:) =
_
0
P(W
t
= 0)dt = 0. Thus, the set :
is w.p.1. of zero Lebesgue

measure, as claimed. The set :
is closed since it is the inverse image of the closed

set 0 under the continuous mapping t W
t
. In Corollary 9.1.5 we have further
shown that w.p.1. :
is unbounded and that the continuous function t W

t
changes sign innitely many times in any interval [0, ], > 0, from which it follows
that zero is an accumulation point of :
.
Next, with A
s,t
= : :
(s, t) is a single point, note that the event that :
has an isolated point in (0, ) is the countable union of A

s,t
over s, t Q such that
0 < s < t. Consequently, to show that w.p.1. :
has no isolated point, it suces to

show that P(A
s,t
) = 0 for any 0 < s < t. To this end, consider the a.s. nite T
W
t
-
Markov times R
r
= infu > r : W
u
= 0, r 0. Fixing 0 < s < t, let = R
s
noting
that A
s,t
= < t R
and consequently P(A

s,t
) P(R
> ). By continuity
of t W
t
we know that W
= 0, hence R
= infu > 0 : W
+u
W
= 0.
Recall Corollary 9.1.6 that W
+u
W
, u 0 is a standard Wiener process

and therefore P(R
> ) = P(R
0
> 0) = P
0
(T
0
> 0) = 0 (as shown already in
Corollary 9.1.5).
In view of its strong Markov property, the level sets of the Wiener process inherit
the properties of its zero set.
Corollary 9.3.8. For any xed b R and a.e. , the level set :
(b) is closed,
unbounded, of zero Lebesgue measure and having no isolated points.
Proof. Fixing b R, b ,= 0, consider the T
W
t
-Markov time T
b
= infs >
0 : W
s
= b. While proving the reection principle we have seen that w.p.1. T
b
is nite and W
T
b
= b, in which case it follows from (9.3.4) that t :
(b) if and
only if t = T
b
+ u for u 0 such that

W
u
() = 0, where

W
u
= W
T
b
+u
W
T
b
is, by Corollary 9.1.6, a standard Wiener process. That is, up to a translation by
T
b
() the level set :
(b) is merely the zero set of

W
t
and we conclude the proof by
applying Proposition 9.3.7 for the latter zero set.
Remark 9.3.9. Recall Example 8.2.50 that for a.e. the sample function W
t
()
is of unbounded total variation on each nite interval [s, t] with s < t. Thus, from
part (a) of Exercise 8.2.41 we deduce that on any such interval w.p.1. the sample
function W
() is non-monotone. Since every nonempty interval includes one with

rational endpoints, of which there are only countably many, we conclude that for
a.e. , the sample path t W
t
() of the Wiener process is monotone in no
interval. Here is an alternative, direct proof of this fact.
Exercise 9.3.10. Let A
n
=
n
i=1
: W
i/n
() W
(i1)/n
() 0 and
A = : t W
t
() is non-decreasing on [0, 1].
(a) Show that P(A
n
) = 2
n
for all n 1 and that A =
n
A
n
T has zero
probability.
(b) Deduce that for any interval [s, t] with 0 s < t non-random the proba-
bility that W
() is monotone on [s, t] is zero and conclude that the event

F T where t W
t
() is monotone on some non-empty open interval,
has zero probability.
Hint: Recall the invariance transformations of Exercise 9.1.1 and that F
can be expressed as a countable union of events indexed by s < t Q.
Our next objects of interest are the collections of local maxima and points of
increase along the Brownian sample path.
Denition 9.3.11. We say that t 0 is a point of local maximum of f : [0, )
R if there exists > 0 such that f(t) f(s) for all s [(t )
+
, t +], s ,= t, and
a point of strict local maximum if further f(t) > f(s) for any such s. Similarly,
we say that t 0 is a point of increase of f : [0, ) R if there exists > 0 such
that f((t h)
+
) f(t) f(t +h) for all h (0, ].
The irregularity of the Brownian sample path suggests that it has many local
maxima, as we shall indeed show, based on the following exercise in real analysis.
Exercise 9.3.12. Fix f : [0, ) R.
(a) Show that the set of strict local maxima of f is countable.
Hint: For any > 0, the points of M
= t 0 : f(t) > f(s), for all

s [(t )
+
, t +], s ,= t are isolated.
(b) Suppose f is continuous but f is monotone on no interval. Show that
if f(b) > f(a) for b > a 0, then there exist b > u
3
> u
2
> u
1
a
such that f(u
2
) > f(u
3
) > f(u
1
) = f(a), and deduce that f has a local
maximum in [u
1
, u
3
].
Hint: Set u
1
= supt [a, b) : f(t) = f(a).
(c) Conclude that for a continuous function f which is monotone on no in-
terval, the set of local maxima of f is dense in [0, ).
Proposition 9.3.13. For a.e. , the set of points of local maximum for
the Wiener sample path W
t
() is a countable, dense subset of [0, ) and all local
maxima are strict.
Remark. Recall that the upper Dini derivative D
1
f(t) of Denition 9.3.1 is non-
positive, hence nite, at every point t of local maximum of f(). Thus, Proposition
9.3.13 provides a dense set of points t 0 where D
1
W
t
() < and by symmetry
of the Brownian motion, another dense set where D
1
W
t
() > , though as we
have seen in Proposition 9.3.2, w.p.1. there is no point t() 0 for which both
apply.
Proof. If a continuous function f has a non-strict local maximum then there
exist rational numbers 0 q
1
< q
4
such that the set / = u (q
1
, q
4
) : f(u) =
sup
t[q1,q4]
f(t) has an accumulation point in [q
1
, q
4
). In particular, for some
rational numbers 0 q
1
< q
2
< q
3
< q
4
the set / intersects both intervals
(q
1
, q
2
) and (q
3
, q
4
). Thus, setting M
s,r
= sup
t[s,r]
W
t
, if P(M
s3,s4
= M
s1,s2
) = 0
for each 0 s
1
< s
2
< s
3
< s
4
, then w.p.1. every local maximum of W
t
()
is strict. This is all we need to show, since in view of Remark 9.3.9, Exer-
cise 9.3.12 and the continuity of Brownian motion, w.p.1. the (countable) set
of (strict) local maxima of W
t
() is dense on [0, ). Now, xing 0 s
1
< s
2
<
s
3
< s
4
note that M
s3,s4
M
s1,s2
= Z Y + X for the mutually independent
Z = sup
t[s3,s4]
W
t
W
s3
, Y = sup
t[s1,s2]
W
t
W
s2
and X = W
s3
W
s2
.
Since g(x) = P(X = x) = 0 for all x R, we are done as by Fubinis theorem,
P(M
s3,s4
= M
s1,s2
) = P(X Y +Z = 0) = E[g(Y Z)] = 0.
Remark. While proving Proposition 9.3.13 we have shown that for any count-
able collection of disjoint intervals I
i
, w.p.1. the corresponding maximal values
sup
tIi
W
t
of the Brownian motion must all be distinct. In particular, P(W
q
= W
q
for some q ,= q
Q) = 0, which of course does not contradict the fact that

P(W
0
= W
t
for uncountably many t 0) = 1 (as implied by Proposition 9.3.7).
Here is a remarkable contrast with Proposition 9.3.13, showing that the Brownian
sample path has no point of increase (try to imagine such a path!).
Theorem 9.3.14 (Dvoretzky, Erd os, Kakutani). Almost every sample path
of the Wiener process has no point of increase (or decrease).
For the proof of this result, see [MP09, Theorem 5.14].
Bibliography
[And1887] Desire Andre, Solution directe du probleme resolu par M. Bertrand, Comptes Rendus
Acad. Sci. Paris, 105, (1887), 436437.
[Bil95] Patrick Billingsley, Probability and measure, third edition, John Wiley and Sons, 1995.
[Bre92] Leo Breiman, Probability, Classics in Applied Mathematics, Society for Industrial and
Applied Mathematics, 1992.
[Bry95] Wlodzimierz Bryc, The normal distribution, Springer-Verlag, 1995.
[Doo53] Joseph Doob, Stochastic processes, Wiley, 1953.
[Dud89] Richard Dudley, Real analysis and probability, Chapman and Hall, 1989.
[Dur10] Rick Durrett, Probability: Theory and Examples, fourth edition, Cambridge U Press,
2010.
[DKW56] Aryeh Dvortzky, Jack Kiefer and Jacob Wolfowitz, Asymptotic minimax character of
the sample distribution function and of the classical multinomial estimator, Ann. Math.
Stat., 27, (1956), 642669.
[Dyn65] Eugene Dynkin, Markov processes, volumes 1-2, Springer-Verlag, 1965.
[Fel71] William Feller, An introduction to probability theory and its applications, volume II,
second edition, John Wiley and sons, 1971.
[Fel68] William Feller, An introduction to probability theory and its applications, volume I, third
edition, John Wiley and sons, 1968.
[Fre71] David Freedman, Brownian motion and diusion, Holden-Day, 1971.
[GS01] Georey Grimmett and David Stirzaker, Probability and random processes, 3rd ed., Ox-
ford University Press, 2001.
[Hun56] Gilbert Hunt, Some theorems concerning Brownian motion, Trans. Amer. Math. Soc.,
81, (1956), 294319.
[KaS97] Ioannis Karatzas and Steven E. Shreve, Brownian motion and stochastic calculus,
Springer Verlag, third edition, 1997.
[Lev37] Paul Levy, Theorie de laddition des variables aleatoires, Gauthier-Villars, Paris, (1937).
[Lev39] Paul Levy, Sur certains processus stochastiques homogenes, Compositio Math., 7, (1939),
283339.
[KT75] Samuel Karlin and Howard M. Taylor, A rst course in stochastic processes, 2nd ed.,
Academic Press, 1975.
[Mas90] Pascal Massart, The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality, Ann.
Probab. 18, (1990), 12691283.
[MP09] Peter Morters and Yuval Peres, Brownian motion, Cambridge University Press, 2010.
[Num84] Esa Nummelin, General irreducible Markov chains and non-negative operators, Cam-
bridge University Press, 1984.
[Oks03] Bernt Oksendal, Stochastic dierential equations: An introduction with applications, 6th
ed., Universitext, Springer Verlag, 2003.
[PWZ33] Raymond E.A.C. Paley, Norbert Wiener and Antoni Zygmund, Note on random func-
tions, Math. Z. 37, (1933), 647668.
[Pit56] E. J. G. Pitman, On the derivative of a characteristic function at the origin, Ann. Math.
Stat. 27 (1956), 11561160.
[SW86] Galen R. Shorak and Jon A. Wellner, Empirical processes with applications to statistics,
Wiley, 1986.
[Str93] Daniel W. Stroock, Probability theory: an analytic view, Cambridge university press,
1993.
[Wil91] David Williams, Probability with martingales, Cambridge university press, 1991.
375
Index
-system, 14
-integrable, 31
-system, 14
-algebra, 7, 175
-algebra, P-trivial, 30, 56, 157
-algebra, Borel, 10
-algebra, Markov, 291, 324
-algebra, completion, 14, 274, 275, 282,
289, 290
-algebra, countably generated, 10
-algebra, cylindrical, 271, 323, 342
-algebra, exchangeable, 221
-algebra, generated, 10, 20
-algebra, induced, 62
-algebra, invariant, 232
-algebra, optional, 291
-algebra, stopped, 183, 230, 291
-algebra, tail, 57, 255, 342
-algebra, trivial, 8, 158
-eld, 7
0-1 law, 79
0-1 law, Blumenthals, 342
0-1 law, Hewitt-Savage, 222
0-1 law, Kolmogorovs, 57, 131, 197, 342
0-1 law, Levys, 197
absolutely continuous, 153, 229, 320
absolutely continuous, mutually, 218
algebra, 12
almost everywhere, 20
almost surely, 20
angle-brackets, 200
arc-sine law, 108, 347
arc-sine law, Levys, 353
Bessel process, 306
Bessel process, index, 306
birth processes, 333
Bonferronis inequality, 38
Borel function, 18
Borel set, 11
Borel-Cantelli I, 78, 166, 202
Borel-Cantelli II, 79, 202
branching process, 210, 229, 239
Brownian bridge, 288, 322, 353
Brownian motion, 148, 286, 341
Brownian motion, d-dimensional, 306
Brownian motion, absorbed, 347
Brownian motion, drift, 288, 321
Brownian motion, fractional, 288
Brownian motion, geometric, 322
Brownian motion, integral, 288
Brownian motion, level set, 371
Brownian motion, local maximum, 372
Brownian motion, maxima, 345, 352
Brownian motion, modulus of continuity,
368
Brownian motion, nowhere dierentiable,
367
Brownian motion, nowhere monotone, 371
Brownian motion, reected, 347
Brownian motion, standard, 286, 343, 359
Brownian motion, total variation, 315
Brownian motion, zero set, 371
Cantor set, 29, 121
Caratheodorys extension theorem, 13
Caratheodorys lemma, 16
Cauchy sequence, 165
central limit theorem, 96, 113, 348
central limit theorem, Donskers, 349
central limit theorem, functional, 349
central limit theorem, Lindebergs, 96, 209,
355, 359
central limit theorem, Lyapunovs, 101, 363
central limit theorem, Markov additive
functional, 254
central limit theorem, multivariate, 147,
150
Cesaro averages, 47, 249
change of variables, 50
Chapman-Kolmogorov equations, 233, 318,
322
characteristic function, 117, 143, 336
compensator, predictable, 200, 306, 307,
359
conditional expectation, 152, 172
cone, convex, 9, 242
377
378 INDEX
consistent, 61
continuous mapping, 80, 81, 109
continuous modication, 275
continuous, Holder, 368
convergence almost surely, 24
convergence in L
q
, 39, 163
convergence in L
q
, weak, 163
convergence in distribution, 104, 141, 349
convergence in measure, 39
convergence in probability, 39, 105, 142,
307
convergence weakly, 350
convergence, bounded, 45
convergence, dominated, 42, 162, 196, 312
convergence, monotone, 33, 41, 161, 312
convergence, of types, 128, 131
convergence, total-variation, 111, 251, 266
convergence, uniformly integrable, 46
convergence, vague, 114
convergence, Vitalis theorem, 46, 48, 220
convergence, weak, 104, 109, 350
convolution, 68
countable representation, 22, 271
counting process, 136, 333
coupling, 105, 357
coupling, Markovian, 251, 252
coupling, monotone, 100
coupon collectors problem, 73, 136
covariance, 71
Cramer-Wold device, 146
DeMorgans law, 7, 77
density, Cesaro, 17
derivative, Dini, 367, 372
diagonal selection, principle, 115, 299, 312
distribution function, 26, 88, 104, 142
distribution, Bernoulli, 53, 92, 100, 119, 134
distribution, beta, 199
distribution, Binomial, 100, 135
distribution, Cauchy, 121, 131
distribution, exponential, 28, 52, 83, 105,
120, 137, 172, 354
distribution, extreme value, 107
distribution, gamma, 70, 81, 137, 139, 337
distribution, geometric, 53, 73, 83, 105,
208, 339
distribution, multivariate normal, 146, 149,
173, 177, 230, 279, 284
distribution, multivariate normal,
non-degenerate, 149
distribution, normal, 28, 53, 95, 119, 287
distribution, Poisson, 53, 70, 100, 119, 133,
137
distribution, Poisson thinning, 140, 337
distribution, stable, 348
distribution, support, 30
distribution, triangular, 120
Doobs convergence theorem, 192, 299
Doobs decomposition, 184, 307, 313
Doobs optional stopping, 194, 204, 302
Doob-Meyer decomposition, 307, 313
doubly stochastic, 243
Dynkins theorem, 15
event, 7
event space, 7
event, shift invariant, 232, 262
expectation, 31
extinction probability, 211
Fatous Lemma, 162
Fatous lemma, 42
Feller property, 327
Feller property, strong, 260
eld, 12
ltration, 56, 175, 225, 318
ltration, augmented, 289
ltration, canonical, 176, 290
ltration, continuous time, 289, 294
ltration, interpolated, 290, 297
ltration, left, 289
ltration, left-continuous, 290
ltration, right, 289
ltration, right-continuous, 289, 306
nite dimensional distributions, 148, 227,
269, 319
nite dimensional distributions, consistent,
270, 319
Fokker-Planck equation, 335
Fubinis theorem, 63, 283
function, absolutely continuous, 198
function, continuous, 109, 167, 260
function, Holder continuous, 275
function, harmonic, 234
function, indicator, 18
function, Lebesgue integrable, 28, 50, 283
function, Lebesgue singular, 29
function, Lipschitz continuous, 275
function, measurable, 17
function, non-negative denite, 284
function, Riemann integrable, 51
function, semi-continuous, 22, 110, 181
function, separable, 280
function, simple, 18, 167
function, slowly varying, 131
function, step, 329
function, sub-harmonic, 234
function, super-harmonic, 181, 234, 242
Galton-Watson trees, 211, 215
Girsanovs theorem, 219, 316
Glivenko-Cantelli theorem, 88, 353
graph, weighted, 244
Hahn decomposition, 155
harmonic function, 344
Hellys selection theorem, 114, 253
INDEX 379
hitting time, rst, 178, 207, 293, 305
hitting time, last, 178
holding time, 331, 334
hypothesis testing, 90
i.i.d., 72
independence, 54
independence, P, 54
independence, mutual, 55, 146
independence, pairwise, 100
independence, stochastic processes, 283
inequality, L
p
martingale, 189, 298, 311
inequality, Cauchy-Schwarz, 37
inequality, Chebyshevs, 35
inequality, Doobs, 186, 193, 297
inequality, Doobs up-crossing, 190, 299
inequality, Doobs, second, 187
inequality, Holders, 37
inequality, Jensens, 36, 160, 180
inequality, Kolmogorovs, 90, 106
inequality, Markovs, 35
inequality, Minkowskis, 37
inequality, Ottavianis, 189, 352
inequality, Schwarzs, 165
inequality, up-crossing, 191
inner product, 165, 284
integration by parts, 65
invariance principle, Donsker, 349, 360
Kakutanis theorem, 215
Kesten-Stigum theorem, 214
Kochen-Stone lemma, 79
Kolmogorovs backward equation, 322, 335
Kolmogorovs cycle condition, 245
Kolmogorovs extension theorem, 61, 228,
272
Kolmogorovs forward equation, 335, 338
Kolmogorovs three series theorem, 102,
195, 203
Kolmogorov-Centsov theorem, 276
Kolmogorov-Smirnov statistic, 355
Kroneckers lemma, 92, 202
Levys characterization theorem, 315
Levys continuity theorem, 125, 145
Levys downward theorem, 219
Levys inversion theorem, 121, 128, 144,
338
Levys upward theorem, 196
Laplace transform, 81, 116, 346
law, 25, 60, 144
law of large numbers, strong, 71, 82, 87, 92,
189, 204, 220, 248, 249, 254
law of large numbers, strong, non-negative
variables, 85
law of large numbers, weak, 71, 75
law of large numbers, weak, in L
2
, 72
law of the iterated logarithm, 84, 364
law, joint, 60, 139
law, size biased, 356
law, stochastic process, 227
Lebesgue decomposition, 154, 216
Lebesgue integral, 31, 172, 226, 318
Lebesgue measure, 12, 16, 28, 34, 153
Lenglart inequality, 186, 316
likelihood ratio, 218
lim inf, 77
lim sup, 77
local maximum, 372
Markov chain, 225
Markov chain, -irreducible, 258
Markov chain, aperiodic, 250, 266
Markov chain, atom, 256
Markov chain, birth and death, 229, 240,
245, 247
Markov chain, continuous time, 329
Markov chain, cyclic decomposition, 254
Markov chain, Ehrenfest, 235
Markov chain, Feller, 260
Markov chain, rst entrance decomposition,
234
Markov chain, H-recurrent, 261
Markov chain, homogeneous, 225
Markov chain, irreducible, 238, 258
Markov chain, last entrance decomposition,
234
Markov chain, law, 227
Markov chain, minorization, 256
Markov chain, O-recurrent, 262
Markov chain, O-transient, 262
Markov chain, open set irreducible, 260
Markov chain, period, 250, 266
Markov chain, positive H-recurrent, 265
Markov chain, positive recurrent, 246
Markov chain, recurrent, 238, 243, 261
Markov chain, recurrent atom, 256
Markov chain, renewal, 229, 239, 245, 254
Markov chain, reversible, 244, 245
Markov chain, stationary, 232, 252
Markov chain, transient, 238, 243
Markov class, closed, 235, 258
Markov class, irreducible, 235, 258, 338
Markov occupation time, 236, 248
Markov process, 289, 318
Markov process, birth and death, 339
Markov process, Brownian, 315, 322
Markov process, generator, 322, 334, 338
Markov process, homogeneous, 318, 324
Markov process, jump parameters, 331
Markov process, jump rates, 331
Markov process, jump, explosive, 333
Markov process, jump, pure, 329
Markov process, law, 320
Markov process, O-recurrent, 306
Markov process, O-transient, 306
Markov process, stationary, 322
380 INDEX
Markov process, strong, 324, 342
Markov property, 159, 230, 323
Markov property, strong, 230, 324
Markov semi-group, 233, 318
Markov semi-group, Feller, 327
Markov state, absorbing, 238
Markov state, accessible, 235, 258, 338
Markov state, aperiodic, 250, 258
Markov state, intercommunicate, 235, 338
Markov state, null recurrent, 245, 249, 338
Markov state, O-recurrent, 262
Markov state, O-transient, 262
Markov state, period, 250
Markov state, positive recurrent, 245, 258,
338
Markov state, reachable, 260, 262
Markov state, recurrent, 236, 241, 338
Markov state, transient, 236, 338
Markov time, 291, 302, 324, 343
Markov, accessible set, 258
Markov, additive functional, 254
Markov, Doeblin chain, 257
Markov, equivalence class property, 246
Markov, H-irreducible, 259
Markov, Harris chain, 261
Markov, initial distribution, 226, 319
Markov, jump process, 280, 282
Markov, meeting time, 251
Markov, minorization, 256
Markov, occupation ratios, 249
Markov, small function, 257
Markov, small set, 259
Markov, split chain, 256
martingale, 176, 234, 355
martingale dierence, 176, 193
martingale dierences, bounded, 364
martingale transform, 181, 203
martingale, L
2
, 177, 306
martingale, L
p
, 198
martingale, L
p
, right-continuous, 301
martingale, backward, 219, 297
martingale, Bayes rule, 219, 296
martingale, continuous time, 294
martingale, cross variation, 316
martingale, dierences, 359
martingale, Doobs, 196, 204, 300
martingale, Gaussian, 177
martingale, interpolated, 297, 301
martingale, local, 184, 311
martingale, orthogonal, 316
martingale, product, 178, 215
martingale, reversed, 219, 297
martingale, right closed, 300
martingale, square-integrable, 177, 306
martingale, square-integrable, bracket, 316
martingale, square-integrable, continuous,
306
martingale, sub, 179
martingale, sub, continuous time, 294
martingale, sub, last element, 300, 302, 307
martingale, sub, reversed, 219, 220, 303
martingale, sub, right closed, 307
martingale, sub, right-continuous, 297
martingale, sup, reversed, 302
martingale, super, 179, 234
martingale, super, continuous time, 294
martingale, uniformly integrable, 195, 300
maximal inequalities, 186
mean, 52
measurable space, 8
measurable function, 18
measurable function, bounded, 225
measurable mapping, 18
measurable rectangles, 59
measurable set, 7
measurable space, isomorphic, 21, 62, 171,
227, 272, 319, 320
measurable space, product, 11
measure, 8
measure space, 8
measure space, complete, 14
measure, -nite, 16
measure, completion, 28, 50, 153
measure, counting, 112, 258
measure, excessive, 241, 242, 253, 338
measure, nite, 8
measure, invariant, 232, 241, 322
measure, invariant probability, 232, 245,
265, 322, 335
measure, invariant, unique, 242
measure, maximal irreducibility, 258
measure, mutually singular, 154, 217, 242,
255
measure, non-atomic, 9, 12
measure, outer, 16
measure, positive, 232
measure, probability, 8
measure, product, 59, 60
measure, regular, 11
measure, restricted, 49, 64
measure, reversible, 244, 338
measure, shift invariant, 232, 285
measure, signed, 8, 155
measure, support, 30, 154, 241
measure, surface of sphere, 68, 128
measure, tight, 113
measure, uniform, 12, 47, 69, 120, 127, 128,
139
measures, uniformly tight, 113, 124, 143,
145, 260, 350
memory-less property, 209
memoryless property, 137, 332
merge transition probability, 256
modication, 275
modication, continuous, 276
modication, RCLL, 280, 301, 313
INDEX 381
modication, separable, 280
moment, 52
moment generating function, 122
moment problem, 113, 122
monotone class, 17
monotone class theorem, 19, 63, 324, 325
monotone class theorem, Halmoss, 17
monotone function, 308
network, 244
non-negative denite matrix, 146
norm, 165
occupancy problem, 74, 135
occupation time, 353
optional time, 291, 302, 324
order statistics, 139, 354
Ornstein-Uhlenbeck process, 288, 322
orthogonal projection, 167
parallelogram law, 165
passage time, 345, 371
point of increase, 372
point process, 137
point process, Poisson, 137, 274
Poisson approximation, 133
Poisson process, 137, 274, 295, 315
Poisson process, arrival times, 137
Poisson process, compensated, 295, 296,
308, 315
Poisson process, compound, 335
Poisson process, drift, 321
Poisson process, excess life time, 138
Poisson process, inhomogeneous, 137, 274
Poisson process, jump times, 137
Poisson process, rate, 137
Poisson process, superposition, 140
Poisson process, time change, 274
Poisson, thinning, 140
polarization, 316
Portmanteau theorem, 110, 141, 350
pre-visible, 181
predictable, 181, 184, 187, 307, 313
probability density function, 28, 143, 153,
169, 346
probability density function, conditional,
169
probability density function, joint, 58
probability space, 8
probability space, canonical, 228, 319
probability space, complete, 14
Prohorovs theorem, 114, 143, 350
quadratic variation, predictable, 306, 307
Radon-Nikodym derivative, 153, 215, 216,
219, 229, 255, 320, 356
Radon-Nikodym theorem, 153
random eld, 276
random variable, 18
random variable, P-degenerate, 30, 127,
130
random variable, P-trivial, 30
random variable, integrable, 32
random variable, lattice, 127
random variables, exchangeable, 221
random vector, 18, 143, 146, 284
random walk, 128, 176, 181, 185, 194,
207209, 229, 240, 245, 263, 348
random walk, simple, 108, 148, 177, 179,
208, 229, 239
random walk, simple, range, 352
random walk, symmetric, 176, 178, 208, 232
record values, 85, 92, 101, 210
reection principle, 108, 178, 345, 352
regeneration measure, 256
regeneration times, 256, 343
regular conditional probability, 169, 170
regular conditional probability distribution,
170, 203
renewal theory, 89, 106, 137, 229, 236, 249
renewal times, 89, 229, 236, 249
RMG, 219, 297
ruin probability, 207, 305
sample function, continuous, 276, 286, 306
sample function, RCLL, 280
sample function, right-continuous, 297
sample space, 7
Schees lemma, 43
set function, countably additive, 8, 272
set function, nitely additive, 8
set, boundary, 110
set, continuity, 110
set, cylinder, 61
set, Lebesgue measurable, 50
set, negative, 155
set, negligible, 24
set, null, 14, 275, 289, 306
set, positive, 155
Skorokhod representation, 27, 105, 353, 356
Slutskys lemma, 106, 128, 354, 361
space, L
q
, 32, 163
split mapping, 256
square-integrable, 177
srw, 286
stable law, 130
stable law, domain of attraction, 130
stable law, index, 131
stable law, skewness, 132
stable law, symmetric, 130
standard machine, 33, 48, 50
state space, 225, 318
Stirlings formula, 108
stochastic integral, 297, 317
stochastic process, 175, 225, 269, 318
stochastic process, -stable, 348
382 INDEX
stochastic process, adapted, 175, 225, 289,
294, 318
stochastic process, auto regressive, 364
stochastic process, auto-covariance, 284,
287
stochastic process, auto-regressive, 230, 268
stochastic process, Bessel, 306
stochastic process, canonical construction,
269, 271
stochastic process, continuous, 276
stochastic process, continuous in
probability, 282
stochastic process, continuous time, 269
stochastic process, DL, 313
stochastic process, Gaussian, 177, 284, 286
stochastic process, Gaussian, centered, 284
stochastic process, Holder continuous, 276,
312
stochastic process, increasing, 307
stochastic process, independent increments,
274, 285, 286, 295, 315, 321, 336
stochastic process, indistinguishable, 275,
308, 310
stochastic process, integrable, 176
stochastic process, interpolated, 290, 297
stochastic process, isonormal, 284
stochastic process, law, 232, 273, 285, 288
stochastic process, Lipschitz continuous,
276
stochastic process, mean, 284
stochastic process, measurable, 281
stochastic process, progressively
measurable, 290, 308, 313, 324
stochastic process, pure jump, 329
stochastic process, sample function, 269
stochastic process, sample path, 136, 269
stochastic process, separable, 280, 282
stochastic process, simple, 317
stochastic process, stationary, 232, 285,
288, 322
stochastic process, stationary increments,
286, 321, 336
stochastic process, stopped, 183, 292, 304
stochastic process, supremum, 281
stochastic process, variation of, 307
stochastic process, weakly stationary, 285
stochastic process, Wiener, 286, 295
stochastic process, Wiener, standard, 286
stochastic processes, canonical
construction, 278
stopping time, 178, 230, 291, 302, 305, 325
sub-space, Hilbert, 165
subsequence method, 85, 365
symmetrization, 126
take out what is known, 158
tower property, 158
transition probabilities, stationary, 318
transition probability, 171, 225, 318
transition probability, adjoint, 244
transition probability, Feller, 260
transition probability, jump, 331
transition probability, kernel, 229, 320, 321
transition probability, matrix, 228
truncation, 74, 101, 134
uncorrelated, 69, 71, 75, 177
uniformly integrable, 45, 163, 204, 300
up-crossings, 190, 299
urn, B. Friedman, 200, 208
urn, Polya, 199
variance, 52
variation, 307
variation, quadratic, 307
variation, total, 111, 307, 371
vector space, Banach, 165
vector space, Hilbert, 165, 284
vector space, linear, 165
vector space, normed, 165
version, 152, 275
Walds identities, 208, 328
Walds identity, 160
weak law, truncation, 74
Wiener process, maxima, 345
Wiener process, standard, 343, 359
with probability one, 20

Stats Stanford

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stats Stanford

Uploaded by

Copyright:

Available Formats

Probability Theory: STAT310/MATH230; April

to denote the set of all

is a -algebra (or a -eld), if

. Indeed, in such situations we assign a probability p

= 1. Then, it is easy to see that taking P(A) =

for any A results with a probability measure on (, 2

). For instance, when

, where we get the

= P() > 0 for uncountably many values of

for any pair

(not necessarily count-

T for all either by

) the -algebra generated by the sets

) for two dierent sets of generators

, we only need to show that A

) for each and that

) for each . For instance, considering B

, ) where the collection of sets A

(for example, add to each a

for all n, we have that

is non-empty and therefore so are the

is then non-empty as well, and since J

non-empty, contradicting our assumption that H

on a measurable space (, T) is called an outer measure. That is,

, and prove that it is countably sub-additive, hence an outer measure

is additive when splitting subsets of by intersections with A

-measurable set, as dened

-measurable sets in T form a -algebra ( on

is countably additive, so that (, (,

is a measure. Thus, the restriction of

to (/) is the stated

-measurable sets is the completion of (/)

() = min(X(), 0) are non-negative R.V.-s. By the above argument

) have the convergence property we

, every mapping X : S is measurable (and therefore

(), are measurable on (, T).

of innitely many such tosses. The -

, ) and suppose there exist

R with probability one.

(x, ) = infg(y) : (x, y) < is

(x, ) = supg(y) : (x, y) < is l.s.c.

are real-valued random variables and

() + for all n N and every

() = supy : F(y) < on

: (0, 1) R is well dened. The identity

is a measurable function (i.e. a R.V.).

() x. Now suppose that > F(x).

() x + > x, completing the proof of (1.2.1) and

() for some rational q,

) = 0). We shall return to this construction when

, the monotonicity of () extends to functions

h) = (h) by Proposition 1.3.5, implying that (s : h(s) >

to which X equals with probability one.

(A) = inf(y) : y A for any A B. Then for any R.V. X,

(A) and non-negativity of we have that

(A)P(X A) (due to Step 2 of Denition 1.3.1).

(A) = a. Markovs inequality is then

g)(x) > and (D

()[ > ) 0 as n , for

) > ) 0 as n (see [Dud89, Section

nite in Exercise 1.3.23.

does not imply that X

, we necessarily have also P([X