You are on page 1of 69

Probability Theory

Matthias L owe
Academic Year 2001/2002
Contents
1 Introduction 1
2 Basics, Random Variables 1
3 Expectation, Moments, and Jensens inequality 3
4 Convergence of random variables 7
5 Independence 11
6 Products and Sums of Independent Random Variables 16
7 Innite product probability spaces 18
8 Zero-One Laws 23
9 Laws of Large Numbers 26
10 The Central Limit Theorem 35
11 Conditional Expectation 43
12 Martingales 50
13 Brownian motion 57
13.1 Construction of Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . 61
14 Appendix 66
1 Introduction
Chance, luck, and fortune have been in the centre of human mind ever since people have
started to think. The latest when people started to play cards or roll dices for money,
there has also been a desire for a mathematical description and understanding of chance.
Experience tells, that even in a situation, that is governed purely by chance there seems
to be some regularly. For example, in a long sequence of fair coin tosses we will typically
see about half of the time heads and about half of the time tails. This was formulated
already by Jakob Bernoulli (published 1713) and is called a law of large numbers. Not
much later the French mathematician de Moivre analyzed how much the typical number
of heads in a series of n fair coin tosses uctuates around
1
2
n. He thereby discovered the
rst form of what nowadays has become known as the Central Limit Theorem.
However, a mathematically clean description of these results could not be given by
either of the two authors. The problem with such a description is that, of course, the
average number of heads in a sequence of fair coin tosses will only typically converge to
1
2
.
One could, e.g., imagine that we are rather unlucky and toss only heads. So the principal
question was: what is the probability of an event? The standard idea in the early days
was, to dene it as the limit of the average time of occurrences of the event in a long
row of typical experiments. But here we are back again at the original problem of what
a typical sequence is. It is natural that, even if we can dene the concept of typical
sequences, they are very dicult to work with. This is, why Hilbert in his famous talk
on the International Congress of Mathematicians 1900 in Paris mentioned the axiomatic
foundation of probability theory as the sixth of his 23 open problems in mathematics.
This problem was solved by A. N. Kolmogorov in 1933 by ingeniously making use of
the newly developed eld of measure theory: A probability is understood as a measure on
the space of all outcomes of the random experiment. This measure is chosen in such a way
that it has total mass one.
From this start-up the whole framework of probability theory was developed: Starting
from the laws of large numbers over the Central Limit Theorem to the very new eld of
mathematical nance (and many, many others).
In this course we will meet the most important highlights from probability theory and
then turn into the direction of stochastic processes, that are basic for mathematical nance.
2 Basics, Random Variables
As mentioned in the introduction the concept of Kolmogorov understands a probability on
space of outcomes as measure with total mass one on this space. So in the framework of
probability theory we will always consider a triple (, T, P) and call it a probability space:
Denition 2.1 A probability space is a triple (, T, P), where
is a set, is called a state of the world.
1
T is a -algebra over , A T is called an event.
P is a measure on T with P() = 1, P(A) is called the probability of event A.
A probability space can be considered as an experiment we perform. Of course, in
an experiment in physics one is not always interested to measure everything one could in
principle measure. Such a measurement in probability theory is called a random variable.
Denition 2.2 A random variable X is a mapping
X : R
d
that is measurable. Here we endow R
d
with its Borel -algebra B
d
.
The important fact about random variables is that the underlying probability space
(, T, P) does not really matter. For example, consider the two experiments

1
= 0, 1 , T
1
= T
1
, and P
1
0 =
1
2
and

2
= 1, 2, . . . , 6 , T
2
= T
2
, and P
2
i =
1
6
, i
2
.
Consider the random variables
X
1
:
1
R

and
X
2
:
2
R

_
0 i is even
1 i is odd.
Then
P
1
(X
1
= 0) = P
2
(X
2
= 0) =
1
2
and therefore X
1
and X
2
have same behavior even though they are dened on completely
dierent spaces. What we learn from this example is that what really matters for a random
variable is the distribution P X
1
:
Denition 2.3 The distribution P
X
of a random variable X : R
d
, is the following
probability measure on
_
R
d
, B
d
_
:
P
X
(A) := P(X A) = P
_
X
1
(A)
_
, A B
d
.
So the distribution of a random variable is its image measure in R
d
.
Example 2.4 Important distributions of random variables X (we have already met in
introductory courses in probability and statistics) are
2
The Binomial distribution with parameters n and p, i.e. a random variable X is
Binomially distributed with parameters n and p (B(n, p)-distributed), if
P(X = k) =
_
n
k
_
p
k
(1 p)
nk
0 k n.
The binomial distribution is the distribution of the number of 1s in n independent
coin tosses with success probability p.
The Normal distribution with parameters and
2
, i.e. a random variable X is
normally distributed with parameters and
2
(^(,
2
)-distributed), if
P(X a) =
1

2
2
_
a

1
2
(
x

)
2
dx.
Here a R.
The Dirac distribution with atom in a R, i.e. a random variable X is Dirac
distributed with parameter a, if
P(X = b) =
a
(b) =
_
1 if b = a
0 otherwise.
Here b R.
The Poisson distribution with parameter R, i.e. a random variable X is Poisson
distributed with parameter R (T()-distributed), if
P(X = k) =

k
e

k!
k N
0
= N 0.
The Multivariate Normal distribution with parameters R
d
and , i.e. a random
variable X is normally distributed in dimension d with parameters and (^(, )-
distributed), if R
d
, is a symmetric, positive denite d d matrix and for
A = (, a
1
] . . . (, a
d
]
P(X A) =
1
_
(2)
d
det
_
a
1

. . .
_
a
d

exp
_

1
2

1
(x ), (x ))
_
dx
1
. . . dx
d
.
3 Expectation, Moments, and Jensens inequality
We are now going to consider important characteristics of random variables
Denition 3.1 The expectation of a random variable is dened as
E(X) := E
P
(X) :=
_
XdP =
_
X()dP()
if this integral is well dened.
3
Notice that for A T one has E(1
A
) =
_
1
A
dP = P(A). Quite often one may want to
integrate f(X) for a function f : R
d
R
m
. How does this work?
Proposition 3.2 Let X : R
d
be a random variable and f : R
d
R be a measurable
function. Then
_
f XdP = E[f X] =
_
fdP
X
. (3.1)
Proof. If f = 1
A
, A B
d
we have
E[f X] = E(1
A
X) = P(X A) = P
X
(A) =
_
1
A
dP
X
.
Hence (3.1) holds true for functions f =

n
i=1

i
1
A
i
,
i
R, A
i
B
d
. The standard
approximation techniques yield (3.1) for general integrable f.
In particular Proposition 3.2 yields that
E[X] =
_
xdP
X
(x).
Exercise 3.3 If X : N
0
, then
EX =

n=0
nP(X = n) =

n=1
P(X n)
Exercise 3.4 If X : R, then

n=1
P([X[ n) E([X[)

n=0
P([X[ n) .
Thus X has an expectation, if and only if

n=0
P([X[ n) converges.
Further characteristics of random variables are the p-th moments.
Denition 3.5 1. For p 1, the p-th moment of a random variable is dened as
E(X
p
).
2. The centered p-th moment of a random variable is dened as E[(X EX)
p
].
3. The variance of a random variable X is its centered second moment, hence
V(X) := E
_
(X EX)
2

.
Its standard deviation is dened as
:=
_
V(X).
4
Proposition 3.6 V(X) < if and only if X L
2
(P). In this case
V(X) = E
_
X
2
_
(E(X))
2
(3.2)
as well as
V(X) E
_
X
2
_
(3.3)
and
(EX)
2
E
_
X
2
_
(3.4)
Proof. If V(X) < , then XEX L
2
(P). But L
2
(P) is a vector space and the constant
EX L
2
(P) X = (X EX) +EX L
2
(P) .
On the other hand if X L
2
(P) X L
1
(P). Hence EX exists and is a constant.
Therefore also
X EX L
2
(P) .
By linearity of expectation then:
V(X) = E(X EX)
2
= EX
2
2(EX)
2
+ (EX)
2
= EX
2
(EX)
2
.
This immediately implies (3.3) and (3.4).
Exercise 3.7 Show that X and X EX have the same variance.
It will turn out in the next step that (3.4) is a special case of a much more general
principle. To this end recall the concept of a convex function. Recall that a function
: R R is convex if for all (0, 1) we have
(x + (1 ) y) (x) + (1 ) (y)
for all x, y R. In a rst course in analysis one learns that convexity is implied by
tt
0.
On the other hand convex functions do not need to be dierentiable. The following exercise
shows that they are close to being dierentiable.
Exercise 3.8 Let I be an interval and : I R be a convex function. Show that the
right derivative

+
(x) exist for all x

I (the interior of x) and the left derivative

(x)
exists for all x

I. Hence is continuous on

I. Show moreover that

+
is monotonely
increasing on

I and it holds:
(y) (x) +

+
(x)(y x)
for x

I, y I.
Applying Exercise 3.8 yields the generalization of (3.4) mentioned above.
5
Theorem 3.9 (Jensens inequality) Let X : R be a random variable on (, T, P)
and assume X is P-integrable and takes values in an open interval I R. Then EX I
and for every convex
: I R
X is a random variable. If this random variable X is P-integrable it holds:
(E(X)) E( X).
Proof. Assume I = (, ). Thus X() < for all . But then E(X) , but
then also E(X) < . Indeed E(X) = , implies that the strictly positive random variable
X() equals 0 on a set of P measure one, i.e. P-almost surely. This is a contradiction.
Analogously: EX > . According to exercise 3.8 is continuous on

I = I hence Borel-
measurable. Now we know
(y) (x) +

+
(x) (y x) (3.5)
for all x, y I with equality for x = y. Hence
(y) = sup
xI
_
(x) +

+
(x) (y x)
_
(3.6)
for all y I. Putting y = X() in (3.5) yields
X (x) +

+
(x) (X x)
and by integration
E( X) (x) +

+
(x) (E(X) x)
all x I. Together with (3.6) this gives
E( X) sup
xI
_
(x) +

+
(x)(EX x)
_
= (E(X)).
This is the assertion of Jensens inequality.
Corollary 3.10 Let X L
p
(P) for some p 1. Then
[E(X)[
p
E([X[
p
).
Exercise 3.11 Let I be an open interval and : I R be convex. For x
1
, . . . , x
n
I and

1
, . . .
n
R
+
with

n
i=1

i
= 1, show that

_
n

i=1

i
x
i
_

n

i=1

i
(x
i
) .
6
4 Convergence of random variables
Already in the course in measure theory we met three dierent types on convergence:
Denition 4.1 Let (X
n
) be a sequence of random variables.
1. X
n
is stochastically convergent (or convergent in probability) to a random variable
X, if for each > 0
lim
n
P([X
n
X[ ) = 0
2. X
n
converges almost surely to a random variable X, if for each > 0
P
_
limsup
n
[X
n
X[
_
= 0
3. X
n
converges to a random variable X in L
p
or in p-norm, if
lim
n
E ([X
n
X[
p
) = 0.
Already in measure theory we proved:
Theorem 4.2 1. If X
n
converges to X almost surely or in L
p
, then it also converges
stochastically. None of the converses is true.
2. Almost sure convergence does not imply convergence in L
p
and vice versa.
Denition 4.3 Let (, T) be a measurable, topological space endowed with its Borel -
algebra T. This means T is generated by the topology on . Moreover for each n N let

n
and be probability measures on (, T). We say that
n
converges weakly to , if for
each bounded, continuous, and real valued function f : R (we will write (
b
() for the
space of all such functions) it holds

n
(f) :=
_
fd
n

_
fd =: (f) as n (4.1)
Theorem 4.4 Let (X
n
)
nN
be a sequence of real valued random variables on a space
(, T, P). Assume (X
n
) converges to a random variable X stochastically. Then P
Xn
(the
sequence of distributions) converges weakly to P
X
,i.e.
lim
n
_
fdP
Xn
=
_
fdP
X
or equivalently
lim
n
E(f X
n
) = E(f X)
for all f (
b
(R).
If X is constant P-a.s. the converse also holds true.
7
Proof. First assume f (
b
(R) is uniformly continuous. Then for > 0 there is a > 0
such that for any x
t
, x
tt
R
[x
t
x
tt
[ < [f (x
t
) f (x
tt
)[ < .
Dene A
n
:= [X
n
X[ , n N. Then

_
fdP
Xn

_
fdP
X

= [E(f X
n
) E(f X)[
E([f X
n
f X[)
= E([f X
n
f X[ 1
An
) +E
_
[f X
n
f X[ 1
A
c
n
_
2 |f| P(A
n
) +P(A
c
n
)
2 |f| P(A
n
) +.
Here we used the notation
|f| := sup[f(x)[ : x R
and that
[f X
n
f X[ [f X
n
[ +[f X[ 2 |f| .
But since X
n
X stochastically, P(A
n
) 0 as n , so for n large enough P(A
n
)

2|f|
. Thus for such n

_
fdP
Xn

_
fdP
X

2.
Now let f (
b
(R) be arbitrary and denote I
n
:= [n, n]. Since I
n
R as n we
have P
X
(I
n
) 1 as n . Thus for all > 0 there is n
0
() =: n
0
such that
1 P
X
(I
n
0
) = P
X
(R I
n
0
) < .
We choose the continuous function u
n
0
in the following way:
u
n
0
(x) =
_

_
1 x I
n
0
0 [x[ n
0
+ 1
x +n
0
+ 1 n
0
< x < n
0
+ 1
x +n
0
+ 1 n
0
1 < x < n
0
Eventually put

f := u
n
0
f. Since

f 0 outside the compact set [n
0
1, n
0
+ 1], the
function

f is uniformly continuous (and so is u
n
0
) and hence
lim
n
_

fdP
Xn
=
_

fdP
X
as well as
lim
n
_
u
n
0
dP
Xn
=
_
u
n
0
dP
X
;
thus also
lim
n
_
(1 u
n
0
) dP
Xn
=
_
(1 u
n
0
) dP
X
.
8
By the triangle inequality

_
fdP
Xn

_
fdP
X

(4.2)
_

f

f

dP
Xn
+

_

fdP
Xn

_

fdP
X

+
_

f

f

dP
X
.
For large n n
1
(),

_

fdP
Xn

_

fdP
X

. Furthermore from 0 1 u
n
0
1
R\In
0
we
obtain
_
(1 u
n
0
) dP
X
P
X
(R I
n
0
) < ,
so that for all n n
2
() also
_
(1 u
n
0
) dP
Xn
< .
This yields
_

f

f

dP
X
=
_
[f[ (1 u
n
0
) dP
X
|f|
on the one hand and _

f

f

dP
Xn
|f|
all n n
2
() on the other. Hence we obtain from (4.2) for large n:

_
fdP
Xn

_
fdP
X

2 |f| +.
This proves weak convergence of the distributions.
For the converse let R and assume X (X is identically equal to R P-almost
surely). This means P
X
=

where

is the Dirac measure concentrated in . For the


open interval I = ( , +) , > 0 we may nd f (
b
(R) with f 1
I
and f () = 1.
Then _
fdP
Xn
P
Xn
(I) = P (X
n
I) 1.
Since we assumed weak convergence of P
Xn
to P
X
we know that
_
fdP
Xn

_
fdP
X
= f () = 1
as n . Since
_
fdP
X
P(X
n
I) 1, this implies
P (X
n
I) 1
as n . But
X
n
I = [X
n
[
and thus
P([X
n
X[ ) = P([X
n
[ )
= 1 P([X
n
[ < ) 0
for all > 0. This means X
n
converges stochastically to X.
9
Denition 4.5 Let X
n
, X be random variables on a probability space (, T, P). If P
Xn
converges weakly to P
X
we also say that X
n
converges to X in distribution.
Remark 4.6 If the random variables in Theorem 4.4 are R
d
-valued and the f : R
d
R
belong to (
b
(R
d
) the statement of the theorem stays valid.
Example 4.7 For each sequence
n
> 0 with lim
n
= 0 we have
lim
n
^(0,
2
n
) =
0
.
Here
0
denotes the Dirac measure concentrated in 0 R. Indeed, substituting x = y we
obtain
1

2
2
_

f(x)e

x
2
2
2
dx =
_

2
e

y
2
2
f(y)dy
Now for all y R

2
e

y
2
2
f (y)

|f| e

y
2
2
which is integrable. Thus by dominated convergence
lim
n
_

1
_
2
2
n
e

x
2
2
2
n
f(x)dx = lim
n
_

2
e

y
2
2
f(
n
y)dy
=
_

lim
n
1

2
e

y
2
2
f(
n
y)dy = f(0) =
_
fd
0
.
Exercise 4.8 Show that the converse direction in Theorem 4.4 is not true, if we drop the
assumption that X P-a.s.
Exercise 4.9 Assume the following holds for sequence (X
n
) of random variables on a
probability space (, T, P):
P([X
n
[ > ) <
For all n large enough (larger than n()) for each given > 0. Is this equivalent with
stochastic convergence of X
n
to 0 ?
Exercise 4.10 For a sequence of Poisson distributions (
n
) with parameters
n
> 0 show
that
lim
n

n
=
0
,
if lim
n

n
= 0. Is there a probability measure on B
1
, with
lim
n

n
= (weakly)
if lim
n
= ?
10
5 Independence
The concept of independence of events is one of the most essential in probability theory. It
is met already in the introductory courses. Its background is the following:
Assume we are given a probability space (, T, P). For two events A, B T with
P(B) > 0 one may ask, how the probability of the event A changes, if we know already
that B has happened, we only need to consider the probability of A B. To obtain a
probability again we normalize by P(B) and get the conditional probability of A given B:
P(A [ B) =
P(A B)
P(B)
.
now we would call A and B independent, if the knowledge that B has happened does not
change the probability that A will happen or not, i.e. if P(A [ B) and P(A) are the same
P(A [ B) = P (A) .
This in other words means A and B are independent, if
P(A B) = P(A) P(B) .
More generally we dene
Denition 5.1 A family (A
i
)
iI
of events on T is called independent if for each choice of
dierent indices i
1
, . . . , i
n
I
P(A
i
1
.. A
in
) = P(A
i
1
) . . . P(A
in
) (5.1)
Exercise 5.2 Give an example of a sequence of events that are pairwise independent, i.e.
each pair of events from this sequence is independent, but not independent (i.e. all events
together not independent).
We generalize Denition 5.1 to set systems in the following way:
Denition 5.3 For each i I let c
i
T be a collection of events. (c
i
)
iI
are called
independent if (5.1) holds for each i
1
, . . . , i
n
I, each n N and each A
i
c
i
, =
1, . . . , n.
Exercise 5.4 Show the following
1. A family (c
i
)
iI
is independent, if and only if every nite sub-family is independent.
2. Independence of (c
i
)
iI
is maintained, if we reduce the families (c
i
). More precisely,
let (c
i
) be independent and c
t
i
c
i
then also the families (c
t
i
) are independent.
3. If for all n N, (c
n
i
)
iI
are independent and for all n N and i I, c
n
i
c
n+1
i
,
then (

n
c
n
i
)
iI
are independent.
11
Exercise 5.5 If (c
i
)
iI
are independent, then so are the Dynkin-systems (T(c
i
))
iI
. Here
T(/) is the Dynkin system generated by /, it coincides with the intersection of all Dynkin
systems containing /. (See the Appendix for denitions.)
Corollary 5.6 Let (c
i
)
iI
be an independent family of -stable sets c
i
T. Then also the
families ( (c
i
))
iI
of the -algebras generated by the c
i
are independent.
Theorem 5.7 Let (c
i
)
iI
be an independent family of -stable sets c
i
T and
I =

jJ
I
j
with I
i
I
j
= , i ,= j. Let /
j
:=
_

iI
j
c
i
_
. Then also (/
j
)
jJ
is independent.
Proof. For j J let

c
j
be the system of all sets of the form
E
i
1
. . . E
in
where ,= i
1
, . . . , i
n
I
j
and E
i
c
i
, = 1, . . . , n, are arbitrary. Then

c
j
is -
stable. As an immediate consequence of the independence of the (c
i
)
iI
also the (

c
j
)
jJ
are independent. Eventually /
j
= (

c
j
). Thus the assertion follows from Corollary 5.6.
In the next step we want to show that events that depend on all but nitely many
-algebras of a countable family of independent -algebras can only have probability zero
or one. To this end we need the following denition.
Denition 5.8 Let (/
n
)
n
be a sequence of -algebras from T and
T
n
:=
_

_
m=n
/
m
_
the -algebra generated by /
n
, /
n+1
, . . .. Then
T

:=

n=1
T
n
is called the -algebra of the tail events.
Exercise 5.9 Why is T

a -algebra?
This is now the result announced above:
Theorem 5.10 (Kolmogorovs Zero-One-Law) Let (/
n
) be an independent sequence
of -algebras /
n
T. Then for every tail event A T

it holds
P(A) = 0 or P(A) = 1.
12
Proof. Let A T

and T the system of all sets D T that are independent of A. We


want to show that A T:
In the exercise below we show that T is a Dynkin system. By Theorem 5.7 the -algebra
T
n+1
is independent of the -algebra
/
n
:= (/
1
. . . /
n
) .
Since A T
n+1
we know that /
n
T for every n N. Thus
/ :=

_
n=1
/
n
T
Obviously
_
/
n
_
is increasing. For E, F / there thus exists n with E, F /
n
and hence
E F /
n
and thus E F /. This means that / is -stable. Since / T, i.e.
(/, A) is independent, from Exercise 5.5 we conclude that (T(/), A) is independent,
so that
_
/
_
= T
_
/
_
T. Moreover /
n
/, all n. Hence T
1
= (/
n
)
_
/
_
.
Therefore
A T

T
1

_
/
_
T.
Therefore A is independent of A, i.e. it holds
P(A) = P(A A) = P(A) P(A) = (P(A))
2
.
Hence P(A) 0, 1 is asserted.
Exercise 5.11 Show that T from the proof of Theorem 5.10 is a Dynkin system.
As an immediate consequence of the Kolmogorov Zero-One-Law (Theorem 5.10) we
obtain
Theorem 5.12 (Borels Zero-One-Law) For each independent sequence (A
n
)
n
of events
A
n
T we have
P(A
n
for innitely many n) = 0
or
P(A
n
for innitely many n) = 1,
i.e.
P(limsup A
n
) 0, 1 .
Proof. Let /
n
= (A
n
), i.e. /
n
= , , A
n
, A
c
n
. It follows that (/
n
)
n
is independent.
For Q
n
:=

m=n
A
m
we have Q
n
T
n
. Since (T
n
)
n
is decreasing we even have Q
m
T
n
for
all m n, n N. Since (Q
n
)
n
is decreasing we obtain
limsup
n
A
n
=

k=1
Q
k
=

k=j
Q
k
T
j
for all j N. Hence limsup A
n
T

. Hence the assertion follows from Kolmogorovs


zero-one law.
13
Exercise 5.13 In every T the pairs (A, B) (any A, B T with P(A) = 0 or P(A) = 1 or
P(B) = 0 or P(B) = 1) are pairs of independent sets. If these are the only pairs of indepen-
dent sets we call T independence-free. Show that the following space is independence-free:
= N, T = T () , and P(k) = 2
k!
for each k 2, P(1) = 1

k=2
P(k). (Hint: Door zo nodig over te gaan op comple-
menten, mag je aannemen dat 1 / A en 1 / B.)
A special case of the above abstract setting is the concept of independent random
variables. This will be introduced next. Again we work over a probability space (, T, P).
Denition 5.14 A family of random variables (X
i
)
i
is called independent, if the -algebras
( (X
i
))
i
generated by them are independent.
For nite families there is another criterion (which is important, since by denition of
independence we only need to check the independence of nite families).
Theorem 5.15 Let X
1
, . . . , X
n
be a sequence of n random variables with values in mea-
surable spaces (
i
, /
i
) with -stable generators c
i
of /
i
. X
1
, . . . , X
n
are independent if
and only if
P(X
1
E
1
, . . . , X
n
E
n
) =
n

i=1
P(X
i
E
i
)
for all E
i
c
i
, i = 1, . . . , n.
Proof. Put
(
i
:=
_
X
1
i
(E
i
) , E
i
c
i
_
.
Then (
i
generates (X
i
). (
i
is -stable and (
i
. According to Corollary 5.6 we need
to show the independence of ((
i
)
i=1...n
, which is equivalent with
P(G
1
G
n
) = P(G
1
) P(G
n
)
for all choices of G
i
(
i
. Suciency is evident, since we may choose G
i
= for appropriate
i.
Exercise 5.16 Random variables X
1
, . . . , X
n+1
are independent with values in (
i
, T
i
) if
and only if X
1
, . . . , X
n
are independent and X
n+1
is independent of (X
1
, . . . , X
n
).
The following theorem states that a measurable deformation of an independent family
of random variarables stays independent:
Theorem 5.17 Let (X
i
)
iI
be a family of independent random variables X
i
with values in
(
i
, /
i
) and let
f
i
: (
i
, /
i
) (
t
i
, /
t
i
)
be measurable. Then (f
i
(X
i
))
iI
is independent.
14
Proof. Let i
1
, . . . , i
n
I. Then
P
_
f
i
1
(X
i
1
) A
t
i
1
, . . . , f
in
(X
in
) A
t
in
_
= P
_
X
i
1
f
1
i
1
_
A
t
i
1
_
, . . . , X
in
f
1
in
_
A
t
in
__
=
n

=1
P
_
X
i
f
1
i
_
A
t
i
__
=
n

=1
P
_
f
i
(X
i
) A
t
i
_
by the independence of (X
i
)
iI
. Here the A
t
i
/
t
i
were arbitrary.
Already Theorem 5.15 gives rise to the idea that independence of random variables
may be somehow related to product measures. This is made more precise in the following
theorem. To this end let X
1
, . . . , X
n
be random variables such that
X
i
: (, T) (
i
, /
i
) .
Dene
Y := X
1
X
n
:
1

n
Then the distribution of Y which we denote by P
Y
can be computed as P
Y
= P
X
1
Xn
.
Note that P
Y
is a probability measure on
n
i=1
/
i
.
Theorem 5.18 The random variables X
1
, . . . , X
n
are independent if and only if their dis-
tribution is the product measure of the individual distributions, i.e. if
P
X
1
...Xn
= P
X
1
P
Xn
Proof. Let A
i
/
i
, i = 1, . . . , n. Then with
Y = X
1
X
n
:
P
Y
_
n

i=1
A
i
_
= P
_
Y
n

i=1
A
i
_
= P(X
1
A
1
, . . . , X
n
A
n
)
as well as
P
X
i
(A
i
) = P(X
i
A
i
) i = 1, . . . n.
Now P
Y
is the product measure of the P
X
i
if and only if
P
Y
(A
1
. . . A
n
) = P
X
1
(A
1
) . . . P
Xn
(A
n
) .
But this is identical with
P(X
1
A
1
, . . . , X
n
A
n
) =
n

i=1
P(X
i
A
i
) .
But according to Theorem 5.15 this is equivalent with the independence of the X
i
.
15
6 Products and Sums of Independent Random Vari-
ables
In this section we will study independent random variables in greater detail.
Theorem 6.1 Let X
1
, . . . , X
n
be independent, real-valued random variables. Then
E
_
n

i=1
X
i
_
=
n

i=1
E(X
i
) (6.1)
if EX
i
is well dened (and nite) for all i. (6.1) shows that then also E(

n
i=1
X
i
) is well
dened.
Proof. We know that Q :=
n
i=1
P
X
i
is the joint distribution of the X
1
, . . . , X
n
. By
Proposition 3.2 and Fubinis theorem
E
_

i=1
X
i

_
=
_
[x
1
. . . x
n
[ dQ(x
1
, . . . , x
n
)
=
_
. . .
_
[x
1
[ . . . [x
n
[ dP
X
1
(x
1
) . . . dP
Xn
(x
n
)
_
[x
1
[ dP
X
1
(x
1
) . . .
_
[x
n
[ dP
Xn
(x
n
)
This shows that integrability of the X
i
implies integrability of

n
i=1
X
i
. In this case the
equalities are also true without absolute values. This proves the result.
Exercise 6.2 For any two random variables X, Y that are integrable Theorem 6.1 tells
that independence of X, Y implies that
E(X Y ) = E(X) E(Y )
Show that the converse is not true.
Denition 6.3 For any two random variables X, Y that are integrable and have an inte-
grable product we dene the covariance of X and Y to be
cov (X, Y ) = E[(X EX) (Y EY )]
= E(XY ) EXEY.
X and Y are uncorrelated if cov (X, Y ) = 0.
Remark 6.4 If X, Y are independent cov(X, Y ) = 0.
Theorem 6.5 Let X
1
, . . . , X
n
be square integrable random variables. Then
V
_
n

i=1
X
i
_
=
n

i=1
V(X
i
) +

i,=j
cov (X
i
, X
j
) (6.2)
In particular, if X
1
, . . . , X
n
are uncorrelated
V
_
n

i=1
X
i
_
=
n

i=1
V(X
i
) . (6.3)
16
Proof. We have
V
_
n

i=1
X
i
_
= E
_
_
_
n

i=1
(X
i
EX
i
)
_
2
_
_
= E
_
n

i=1
(X
i
EX
i
)
2
+

i,=j
(X
i
EX
i
) (X
j
EX
j
)
_
=
n

i=1
V(X
i
) +

i=j
cov (X
i
, X
j
) .
This proves (6.2). For (6.3 just note that for uncorrelated random variables X, Y one has
cov(X, Y ) = 0.
Eventually we turn to determining the distribution of the sum of independent random
variables.
Theorem 6.6 Let X
1
, . . . , X
n
be independent R
d
valued random variables. Then the distri-
bution of the sum S
n
:= X
1
+. . .+X
n
is given by the convolution product of the distributions
of the X
i
, i.e.
P
Sn
:= P
X
1
P
X
2
P
Xn
Proof. Again let Y := X
1
. . . X
n
:
_
R
d
_
n
, and vector addition
A
n
: R
d
. . . R
d
R
d
.
Then S
n
= A
n
Y , hence a random variable. Now P
Sn
is the image measure of P under
A
n
Y , which we denote by (A
n
Y ) (P). Thus
P
Sn
= (A
n
Y ) (P) = A
n
(P
Y
) .
Now P
Y
= P
X
i
. So by the denition of the convolution product
P
X
1
. . . P
Xn
= A
n
(P
Y
) = P
Sn
More explicitly, in the case d = 1, let g(x
1
, . . . , x
n
) = 1 for x
1
+ + x
n
s and
g(x
1
, . . . , x
n
) = 0, otherwise. Then application of Fubinis theorem yields
P(S
n
s) = E(g(X
1
, . . . , X
n
)) =
_ _
g(x
1
, x
2
, . . . , x
n
)dP
X
1
(x
1
)dP
(X
2
,...,Xn)
(x
2
, . . . , x
n
) =
=
_
P(X
1
s x
2
x
n
)dP
(X
2
,...,Xn)
(x
2
, . . . , x
n
)
In the case that X
1
has a density f
X
1
with respect to Lebesgue measure, and the order of
dierentiation with respect to s and integration can be exchanged, it follows that S
n
has a
density f
Sn
and
f
Sn
(s) =
_
f
X
1
(s x
2
x
n
)dP
(X
2
,...,Xn)
(x
2
, . . . , x
n
)
The same formula holds in the case that X
1
, . . . , X
n
have a discrete distribution (that is,
almost surely assume values in a xed countable subset of R) if densities are taken with
respect to the count measure.
17
Example 6.7 1. As we learned in Introduction to Statistics the convolution of a Bi-
nomial distribution with parameters n and p, B(n, p) and a Binomial distribution
B(m, p) is a B(n +m, p) distribution:
B(n.p) B(m, p) = B(n +m, p).
2. As we learned in Introduction to Statistics the convolution of a T ()-distribution
(a Poisson distribution with parameter ) with a T ()-distribution is a P( + )-
distribution:
T () T () = T ( +)
3. As has been communicated in Introduction to Probability:
^
_
,
2
_
^
_
,
2
_
= ^
_
+,
2
+
2
_
.
7 Innite product probability spaces
Many theorems in probability theory start with: Let X
1
, X
2
, . . . , X
n
, . . . be a sequence
of i.i.d. random variables. But how do we know that such sequences really exist? This
will be shown in this section. In the last section we established the framework for some
of the most important theorems from probabilities, as the Weak Law of Large Numbers or
the Central Limit Theorem. Those are the theorems that assume: Let X
1
, X
2
, . . . X
n
be
i.i.d. random variables. Others, as the Strong Law of Large Numbers ask for the behavior
of a sequence of independent and identically distributed (i.i.d.) random variables; they
usually start like Let X
1
, X
2
, . . . be a sequence of i.i.d. random variables. The natural
rst question to ask is: Does such a sequence exist at all?
In the same way as the existence of a nite sequence of i.i.d. random variables is related
to nite product measures, the answer to the above question is related to innite product
measures. So, we assume that we are given a sequence of measure spaces (
n
, /
n
,
n
) of
which we moreover assume that

n
(
n
) = 1 for all n.
We construct (, /) as follows: We want each to be a sequence (
n
)
n
where
n

n
.
So we put
:=

n=1

n
.
Moreover we have the idea that a probability measure on should be dened by what
happens on the rst n coordinates, n N. So for A
1
/
1
, A
2
/
2
, . . . , A
n
/
n
, n N
we want
A := A
1
. . . A
n

n+1

n+2
. . . (7.1)
to be in /. By independence we want to dene a measure on (, /) that assigns to A
dened in (7.1) the mass
(A) =
1
(A) . . .
n
(A
n
).
18
We will solve this problem in greater generality. Let I be an index set and (
i
, /
i
,
i
)
iI
be measure spaces with
i
(
i
) = 1. For ,= K I dene

K
:=

iK

i
, (7.2)
in particular :=
I
. Let p
K
J
for J K denote the canonical projection from
K
to
J
.
For J = i we will also write p
K
i
instead of p
K
i
and p
i
in place of p
I
i
. Obviously
p
L
J
= p
K
J
p
L
K
(J K L) (7.3)
and
p
J
:= p
I
J
= p
K
J
p
K
(J K) . (7.4)
Moreover denote by
H(I) := J I, J ,= , [J[ is nite .
For J H(I) the -algebras and measures
/
J
:=
iJ
/
i
and
J
:=
iJ

i
are dened by Fubinis theorem in measure theory.
In analogy to the nite dimensional case we dene
Denition 7.1 The product -algebra
iI
/
i
of the -algebras (/
i
)
iI
is dened as the
smallest -algebra / on , such that all projections p
i
:
i
are (/, /
i
)-measurable.
Hence

iI
/
i
:= (p
i
, i I) . (7.5)
Exercise 7.2 Show that

iI
/
i
:= (p
J
, J H(I)) . (7.6)
According to the above we are now looking for a measure on (, /), that assigns mass

1
(A
1
) . . .
n
(A
n
) to each A as dened in (7.1). In other words

_
p
1
J
_

iJ
A
i
__
=
J
_

iJ
A
i
_
.
The question, whether such a measure exists, is solved in
Theorem 7.3 On / :=
iI
/
i
there is a unique measure with
p
J
() := p
1
J
=
J
(7.7)
for all J H(I). It holds () = 1.
19
Proof. We may assume [I[ = , since otherwise the result is known from Fubinis theorem.
We start with some preparatory considerations:
In Exercise 7.4 below it will be shown that p
K
J
is (/
K
, /
J
)-measurable for J K and
that p
K
J
(
K
) =
J
, (J K, J, K H(I)).
Hence, if we introduce the -algebra of the J-cylinder sets
Z
J
:= p
1
J
(/
J
) (J H(I)) (7.8)
the measurability of p
K
J
implies
_
p
K
J
_
1
(/
J
) /
K
and thus
Z
J
Z
K
(J K, J, K H(I)) . (7.9)
Eventually we introduce the system of all cylinder sets
Z :=
_
J1(I)
Z
J
.
Note that due to (7.9) for Z
1
, Z
2
Z we have Z
1
, Z
2
Z
J
, for suitably chosen J H(I).
Hence Z is an algebra (but generally not a -algebra). From (7.5) and (7.6) it follows
/ = (Z) .
Now we come to the main part of the proof. This will be divided into four parts.
1. Assume Z Z = p
1
J
(A), J H(I) , A /
J
. According to (7.7) Z must get mass
(Z) =
J
(A). We have to show that this is well dened. So let
Z = p
1
J
(A) = p
1
K
(B)
for J, K H(I), A /
J
, B /
K
. If J K we obtain:
p
1
J
(A) = p
1
K
_
_
p
K
J
_
1
(A)
_
and thus
p
1
K
(B) = p
1
K
(B
t
) with B
t
:=
_
p
K
J
_
1
(A) .
Since p
K
() =
K
we obtain
B = B
t
=
_
p
K
J
_
1
(A) .
Thus by the introductory considerations

K
(B) =
J
(A) .
For arbitrary J, K dene L := J K. Since J, K L, (7.9) implies the existence of
C /
L
with p
1
L
(C) = p
1
J
(A) = p
1
K
(B). Therefore from what we have just seen:

L
(C) =
J
(A) and
L
(C) =
K
(B).
Hence

J
(A) =
K
(B).
Thus the function

0
_
p
1
J
(A)
_
=
J
(A) (J H(I) , A /
J
), (7.10)
is well-dened on Z.
20
2. Now we show that
0
as dened in (7.10) is a volume on Z. Trivially it holds,
0
0
and
0
() = 0. Moreover, as shown above for Y, Z Z, Y Z = , there is a
J H(I) , A, B /
J
such that Y = p
1
J
(A), Z = p
1
J
(B). Now Y Z = implies
A B = and due to
Y Z = p
1
J
(A B)
we obtain

0
(Y Z) =
J
(A B) =
J
(A) +
J
(B) =
0
(Y ) +
0
(Z)
hence the nite additivity of
0
.
It remains to show that
0
is also -additive. Then the general principles from
measure theory yield that
0
can be uniquely extended to a measure on (Z) = /.
also is a probability measure, because of = p
1
J
(
J
) for all J H(I) and therefore
() =
0
() =
J
(
J
) = 1.
To prove the -additivity of
0
we rst show:
3. Let Z Z and J H(I). Then for all
J

J
the set
Z

J
:= : (
J
, p
I\J
()) Z
is a cylinder set. This set consists of all with the following property: if we
replace the coordinates
i
with i J by the corresponding coordinates of
J
, we
obtain a point in Z. Moreover

0
(Z) =
_

0
(Z

J
)d
J
(
J
). (7.11)
This is shown by the following consideration. For Z Z there are K H(I) and
A /
K
such that Z = p
1
K
(A), this means that
0
(Z) =
K
(A). Since I is innite
we may assume J K and J ,= K. For the
J
-intersection of A in
K
, which we
call A

J
, i.e. for the set of all
t

K\J
with (
J
,
t
) A, it holds
Z

J
= p
1
K\J
(A

J
).
By Fubinis theorem A

J
/
K\J
and hence Z

J
= p
1
K\J
(A

J
) are (KJ)-cylindersets.
Since
K
=
J

K\J
Fubinis theorem implies

0
(Z) =
K
(A) =
_

K\J
(A

J
)d
J
(
J
). (7.12)
But this is (7.11), since

0
(Z

J
) =
K\J
(A

J
)
(because of Z

J
= p
1
K\J
(A

J
)).
21
4. Eventually we show that
0
is -continuous and thus -additive. To this end let (Z
n
)
be a decreasing family of cylinder sets in Z with := inf
n

0
(Z
n
) > 0. We will show
that

n=1
Z
n
,= . (7.13)
Now each Z
n
is of the form Z
n
= p
1
Jn
(A
n
), J
n
H(I), A
n
/
Jn
. Due to (7.9) we may
assume J
1
J
2
J
3
. . . . We apply the result proved in 3. to J = J
1
and Z = Z
n
.
As
J
1

0
_
Z

J
1
n
_
is /
J
1
-measurable
Q
n
:=
_

J
1

J
1
:
0
_
Z

J
1
n
_


2
_
/
J
1
.
Since all
J
s have mass one we obtain from (7.11):

0
(Z
n
)
J
1
(Q
n
) +

2
,
hence
J
1
(Q
n
)

2
> 0, for all n N. Together with (Z
n
) also (Q
n
) is decreasing.

J
1
as a nite measure is - continuous, which implies

n=1
Q
n
,= . Hence there is

J
1

n=1
Q
n
with

0
_
Z

J
1
n
_


2
> 0 all n. (7.14)
Successive application of 3. implies via induction that for each k N there is
J
k

J
k
with (7.14)
0
_
Z

J
k
n
_
2
k
> 0 and p
J
k+1
J
k
_

J
k+1
_
=
J
k
.
Due to this second property there is
0
with p
J
k
(
0
) =
J
k
. Because of (7.14)
we have Z

Jn
n
,= such that there is
n
with
_

Jn
, p
I\Jn
(
n
)
_
Z
n
. But then
also
_

Jn
, p
I\Jn
(
0
)
_
=
0
Z
n
.
Thus
0

n=1
Z
n
which proves (7.13).
Therefore
0
is -additive and hence has an extension to /by Caratheodorys theorem
(Theorem 14.7). It is clear that
0
has mass one (i.e.
0
() = 1), since for J H(I)
we have = p
1
J
(
J
) and hence

0
() =
J
(
J
) = 1.
In particular
0
is nite, and the extension is unique, and it is a probability
measure, that is () =
0
() = 1.
This proves the theorem.
We conclude the chapter with an Exercise, that was left open during this proof.
Exercise 7.4 With the notations of this section, in particular of Theorem 7.3 show that
p
K
J
is (/
K
, /
J
)-measurable (J K, J, K H(I)) and that
p
K
J
(
K
) =
J.
22
8 Zero-One Laws
Already in Section 5 we encountered the prototype of a zero-one law: For a sequence of
events (A
n
)
n
that are independent we have Borels Zero-One-Law (Theorem 5.12):
P(limsup A
n
) 0, 1 .
In a rst step we will now ask, when the probability in question is zero and when it is one.
This leads to the following frequently used lemma:
Lemma 8.1 (Borel-Cantelli Lemma) Let (A
n
) be a sequence of events over a probabil-
ity space (, T, P). Then

n=1
P(A
n
) < P(limsup A
n
) = 0 (8.1)
If the events (A
n
) are pairwise independent then also

n=1
P(A
n
) = P(limsup A
n
) = 1. (8.2)
Remark 8.2 The Borel-Cantelli Lemma is most often used in the form of (8.1). Note that
this part does not require any knowledge about the dependence structure of the A
n
.
Proof of Lemma 8.1. (8.1) is easy. Dene
A := limsup A
n
=

n=1

_
i=n
A
i
.
This implies
A

_
i=n
A
i
for all n N.
and thus
P(A) P
_

_
i=n
A
i
_

i=n
P(A
i
) (8.3)
Since

i=1
P(A
i
) converges,

i=n
P(A
i
) converges to zero as n . This implies P(A) =
0, hence (8.1).
For (8.2) again put A := limsup A
n
and furthermore
I
n
:= 1
An
, S
n
:=
n

j=1
I
j
and eventually
S :=

j=1
I
j
.
23
Since the A
n
are assumed to be pairwise independent they are pairwise uncorrelated as
well. Hence
V(S
n
) =
n

j=1
V(I
j
) =
n

j=1
_
E
_
I
2
j
_
E(I
j
)
2

= E(S
n
)
n

j=1
E(I
j
)
2
ES
n
,
where the last equality follows since I
2
j
= I
j
. Now by assumption

n=1
E(I
n
) = +.
Since S
n
S this is equivalent with
lim
n
E(S
n
) = E(S) = (8.4)
On the other hand A, if and only if A
n
for innitely many n which is the case, if
and only if S () = +. The assertion thus is
P(S = +) = 1.
This can be seen as follows. By Chebyshevs inequality
P([S
n
E(S
n
)[ ) 1
V(S
n
)

2
for all > 0. Because of (8.4) we may assume that ES
n
> 0 and choose =
1
2
ES
n
. Hence
P
_
S
n

1
2
E(S
n
)
_
P
_
[S
n
ES
n
[
1
2
ES
n
_
1 4
V(S
n
)
E(S
n
)
2
But V(S
n
) E (S
n
) and E (S
n
) . Thus
lim
V(S
n
)
E(S
n
)
2
= 0.
Therefore for all > 0 and all n large enough
P
_
S
n

1
2
ES
n
_
1 .
But now S S
n
and hence also
P
_
S
1
2
ES
n
_
1
for all > 0. But this implies P(S = +) = 1 which is what we wanted to show.
Example 8.3 Let (X
n
) be a sequence of real valued random variables which satises

n=1
P([X
n
[ > ) < (8.5)
for all > 0. Then X
n
0 P-a.s. Indeed the Borel-Cantelli Lemma says that (8.5)
implies that
P([X
n
[ > innitely often in n) = 0.
But this is exactly the denition of almost sure convergence of X
n
to 0.
24
Exercise 8.4 Is (8.5) equivalent with P-almost sure convergence of X
n
to 0?
Here is how Theorem 5.10 translates to random variables.
Theorem 8.5 (Kolmogorovs 0-1 Law) Let (X
n
)
n
be a sequence of independent random
variables with values in arbitrary measurable spaces. Then for every tail event A, i.e. for
each A with
A

n=1
(X
m
, m n)
it holds that P(A) 0, 1.
Exercise 8.6 Derive Theorem 8.5 from Theorem 5.10.
Corollary 8.7 Let (X
n
)
nN
a sequence of independent, real-valued random variables. De-
ne
T

:=

m=1
(X
i
, i m)
to be the tail -algebra. If then T is a real-valued random variable, that is measurable with
respect to T

, then T is P-almost surely constant. I.e. there is a R such that


P(T = ) = 1.
Such random variables T : R that are T

-measurable are called tail functions.


Proof. For each

R we have that
T T

.
This implies P(T ) 0, 1. On the other hand, being a distribution function we have
lim

P(T ) = 0 and lim


+
P(T ) = 1
Let C := R : P(T ) = 1 and := inf(C) = inf R : P(T ) = 1 Then for
an appropriately chosen decreasing sequence (
n
) C we have
n
and since T
n

T we have C. Hence = min C. This implies
P(T < ) = 0
which implies
P(T = ) = 1.
Exercise 8.8 A coin is tossed innitely often. Show that every nite sequence
(
1
, . . . ,
k
) ,
i
H, T , k N
occurs innitely often with probability one.
25
Exercise 8.9 Try to prove (8.2) in the Borel-Cantelli Lemma for independent events
(A
n
) as follows:
1. For each sequence (
n
) of real numbers with 0
n
1 we have
n

i=1
(1
i
) exp
_

i=1

i
_
(8.6)
This implies

n=1

n
= lim
n
n

i=1
(1
i
) = 0
2. For A := limsup A
n
we have
1 P(A) = lim
n
P
_

m=n
A
c
m
_
= lim
n
lim
N
N

m=n
(1 P(A
m
)) .
3. As

P(A
n
) diverges we have because of 1.
lim
N
N

m=n
(1 P(A
m
)) = 0
and hence P(A) = 1 because of 2. Fill in the missing details!
9 Laws of Large Numbers
The central goal of probability theory is to describe the asymptotic behavior of a sequence
of random variables. In its easiest form this has already been done for i.i.d. sequences
in Introduction to Probability and Statistics. In the rst theorem of this section this is
slightly generalized.
Theorem 9.1 (Khintchine) Let (X
n
)
nN
be a sequence of square integrable, real valued
random variables, that are pairwise uncorrelated. Assume
lim
n
1
n
2
n

i=1
V(X
i
) = 0.
Then the weak law of large numbers holds, i.e.
lim
n
P
_

1
n
n

i=1
X
i

1
n
E
n

i=1
X
i

>
_
= 0 for all > 0.
26
Proof. By Chebyshevs inequality for each > 0:
P
_

1
n
n

i=1
(X
i
EX
i
)

>
_

1

2
V
_
1
n
n

i=1
(X
i
EX
i
)
_
=
1

2
1
n
2
V
_
n

i=1
(X
i
EX
i
)
_
=
1

2
1
n
2
n

i=1
V(X
i
EX
i
)
=
1

2
1
n
2
n

i=1
V(X
i
) .
Here we used that the random variables are pairwise uncorrelated. By assumption the
latter expression converges to zero.
Remark 9.2 As we will learn in the next Theorem, for an independent sequence square
integrability is even not required.
Theorem 9.1 raises the question whether we can replace the stochastic convergence there
by almost sure convergence. This will be shown in the following theorem. Such a theorem
is called a strong law of large numbers. Its rst form was proved by Kolmogorov. We
will present a proof due to Etemadi from 1981.
Theorem 9.3 (Strong Law of Large Numbers Etemadi 1981) For each sequence
(X
n
)
n
of real-valued, pairwise independent, identically distributed (integrable) random vari-
ables the Strong Law of Large Numbers holds, i.e.
P
_
limsup
n

1
n
n

i=1
X
i
EX
1

>
_
= 0 for each > 0.
Before we prove Theorem 9.3 let us make a couple of remarks. These should reveal the
structure of the proof a bit:
1. Denote S
n
=

n
i=1
X
i
. Then Theorem 9.3 asserts that
1
n
S
n
:= EX
1
, P-almost
surely.
2. Together with X
n
also X
+
n
and X

n
(where X
+
n
= max (X
n
, 0) and X

n
= (X
n
)
+
)
satisfy the assumptions of Theorem 9.3. Since X
n
= X
+
n
X

n
it therefore suces to
prove Theorem 9.3 for positive random variables. We therefore assume X
n
0 for
the rest of the proof.
3. All proofs of the Strong Law of Large Numbers use the following trick: We truncate
the random variables X
n
by cutting o values that are too large. We therefore
introduce
Y
n
:= X
n
1
[Xn[<n
= X
n
1
Xn<n
27
Of course, if is the distribution of X
n
and
n
is the distribution of Y
n
, then
n
,= .
Indeed
n
= f
n
(), where
f
n
(x) :=
_
x if 0 x < n,
0 otherwise.
The idea behind truncation is that we gain square integrability of the sequence.
Indeed:
E
_
Y
2
n
_
= E
_
f
2
n
X
n
_
=
_
f
2
n
(x)d(x) =
_
n
0
x
2
d(x) < .
4. Of course, after having gained information about the Y
n
we need to translate these
results back to the X
n
. To this end we will apply the Borel-Cantelli Lemma and show
that

n=1
P(X
n
,= Y
n
) < .
This implies that X
n
,= Y
n
only for nitely many n with probability one. In particular,
if we can show that
1
n

n
i=1
Y
i
P-a.s., we also can show that
1
n

n
i=1
X
i

P-a.s..
5. For the purposes of the proof we remark the following: Let > 1 and for n N let
k
n
:= [
n
]
denote the largest integer
n
. This means k
n
N and
k
n

n
< k
n
+ 1.
Since
lim
n

n
1

n
= 1
there is a number c

, 0 < c

< 1, such that


k
n
>
n
1 c

n
for all n N. (9.1)
We now turn to
Proof of Theorem 9.3.
Step 1: Without loss of generality X
n
> 0. Dene Y
n
= 1
Xn<n
X
n
. Then Y
n
are
independent and square integrable. Dene
S
t
n
:=
n

i=1
(Y
i
EY
i
) .
Let > 0 and > 1. Using Chebyshevs inequality and the independence of the random
variables (Y
n
) we obtain
P
_

1
n
S
t
n

>
_

1

2
V
_
1
n
S
t
n
_
=
1

2
1
n
2
V(S
t
n
) =
1
n
2
1

2
n

i=1
V(Y
i
) .
28
Observe that V(Y
i
) = E(Y
2
i
) (E(Y
i
))
2
E(Y
2
i
). Thus
P
_

1
n
S
t
n

>
_

1
n
2
1

2
n

i=1
E
_
Y
2
i
_
.
For k
n
= [
n
] this gives
P
_

1
k
n
S
t
kn

>
_

1

2
k
2
n
kn

i=1
E
_
Y
2
i
_
for all n N. Thus

n=1
P
_

1
k
n
S
t
kn

>
_

1

n=1
kn

i=1
1
k
2
n
E
_
Y
2
i
_
.
By rearranging the order of summation we obtain

n=1
P
_

1
k
n
S
t
kn

>
_

1

j=1
t
j
E
_
Y
2
j
_
where
t
j
:=

n=n
j
1
k
2
n
and n
j
is the smallest n with k
n
j. From (9.1) we obtain
t
j

1
c
2

n=n
j
1

2n
=
1
c
2

2n
j
1
1
1

2
= d

2n
j
where d

= c
2

(1
2
)
1
> 0. This implies
t
j
d

j
2
.
By using the above

n=1
P
_

1
k
n
S
t
kn

>
_

d

j=1
1
j
2
j

k=1
_
k
k1
x
2
d(x).
Again rearranging the order of summation yields:

j=1
1
j
2
j

k=1
_
k
k1
x
2
d(x) =

k=1
_

j=k
1
j
2
_
_
k
k1
x
2
d(x).
Since

j=k
1
j
2
<
1
k
2
+
1
k (k + 1)
+
1
(k + 1) (k + 2)
+. . .
=
1
k
2
+
_
1
k

1
k + 1
_
+
_
1
k + 1

1
k + 2
_
+. . . =
1
k
2
+
1
k

2
k
,
29
this yields

n=1
P
_

1
k
n
S
t
kn

>
_

2d

k=1
_
k
k1
x
k
xd(x)
2d

k=1
_
k
k1
xd(x) =
2d

2
E(X
1
) < .
Thus by the Borel-Cantelli Lemma
P
_

1
k
n
S
t
kn

> innitely often in n


_
= 0.
But this is equivalent with the almost sure convergence of
lim
n
1
k
n
S
t
kn
= 0 P-a.s. (9.2)
Step 2: Next let us see that indeed
1
kn

kn
i=1
Y
i
can only converge to E(X
1
). By
denition of Y
n
we have that
E(Y
n
) =
_
xd
n
(x) =
_
n
0
xd(x) .
Thus by monotone convergence
E(X
1
) = lim
n
E(Y
n
) .
By Exercise 9.6 below this implies
E(X
1
) = lim
n
1
n
(EY
1
+. . . +EY
n
) . (9.3)
By denition of the sums S
t
n
we have
1
k
n
S
t
kn
=
1
k
n
kn

i=1
Y
i

1
k
n
kn

i=1
E(Y
i
) ,
(9.2) and (9.3) together imply
lim
n
1
k
n
kn

i=1
Y
i
= lim
n
1
k
n
S
t
kn
+ lim
n
1
k
n
kn

i=1
EY
i
= EX
1
P-a.s.,
which is what we wanted to show in this step.
Step 3: Now we are aiming at removing the truncation from the X
n
. Consider the sum

n=1
P(X
n
,= Y
n
) =

n=1
P(X
n
n)
According to Exercise 3.4 this is smaller than E(X
1
), so that it is bounded. Therefore
P(X
n
,= Y
n
innitely often) = 0.
30
Hence there is a n
0
(random) such that with probability one X
n
= Y
n
for all n n
0
. But
the nitely many dierences drop out when averaging, hence
lim
n
1
k
n
S
kn
= EX
1
P-a.s.
Step 4: Eventually we show that the theorem holds not only for subsequences k
n
chosen
as above, but also for the whole sequence.
For xed > 1, of course, the sequences (k
n
)
n
are xed and diverge to +. Hence for
every m N there exists n N such that
k
n
< m k
n+1
.
Since we assumed the X
i
to be non-negative this implies
S
kn
S
m
S
k
n+1
.
Hence
S
kn
k
n

k
n
m

S
m
m

S
k
n+1
k
n+1

k
n+1
m
.
The denition of k
n
yields
k
n

n
< k
n
+ 1 m k
n+1

n+1
.
This gives
k
n+1
m
<

n+1

n
=
as well as
k
n
m
>

n
1

n+1
.
Now, given , for all n n
1
= n
1
() we have
n
1
n1
. Hence, if m k
n
1
and thus
n n
1
we obtain
k
n
m
>

n
1

n+1
>

n1

n+1
=
1

2
Now for each we have a set

with P(

) = 1 with
lim
1
k
n
S
kn
() = EX
1
for all

.
Without loss of generality we may assume that X
i
are not identically equal to zero P-a.s.,
otherwise the assertion of the Strong Law of Large Numbers is trivially true. Therefore we
may assume without loss of generality that EX
1
> 0. Since > 1 we then have
1

EX
1
<
1
k
n
S
kn
() < EX
1
for all

and all n large enough. For such m and this means


_

3
1
_
EX
1
<
1
m
S
m
() EX
1
<
_

2
1
_
EX
1
.
31
Dene

1
:=

n=1

1+
1
n
.
Then P(
1
) = 1 and
lim
m
1
m
S
m
() = EX
1
for all
1
. This proves the theorem.
Remark 9.4 Theorem 9.3 in particular implies that for i.i.d. sequences of random vari-
ables (X
n
) with a nite rst moment the Strong Law of Large Numbers holds true. Since
stochastic convergence is implied by almost sure convergence also Theorem 9.1 the Weak
Law of Large Numbers holds true for such sequences as well. Therefore the niteness of
the second moment is not necessary for Theorem 9.1 to hold true for i.i.d. sequences.
Remark 9.5 One might, of course, ask whether a nite rst moment is necessary for
Theorem 9.3 to hold. Indeed one can prove that, if a sequence of i.i.d. random variables is
such that
1
n

n
i=1
X
i
converges almost surely to some random variable Y (necessarily a tail
function as in Corollary 8.7!), then EX
1
exists and Y = EX
1
almost surely. This will not
be shown in the context of this course.
Exercise 9.6 Let (a
m
)
m
be real numbers such that lim
m
a
m
= a. Show that this implies
that their Cesaro mean
lim
n
1
n
(a
1
+a
2
+. . . +a
n
) = a.
Exercise 9.7 Prove the Strong Law of Large Numbers for a sequence of i.i.d. random
variables (X
n
)
n
with a nite fourth moment, i.e. for random variables with E(X
4
1
) < .
Do not use the statement of Theorem 9.3 explicitly.
Remark 9.8 A very natural question to ask in the context of Theorem 9.3 is: how fast
does
1
n
S
n
converge to EX
1
, i.e. given a sequence of i.i.d. random variables (X
n
)
n
what is
P
_

1
n

X
i
EX
1


_
?
If X
1
has a nite moment generating function; i.e. if
M(t) := log Ee
tX
1
< for all t,
the answer is: exponentially fast. Indeed, Cramers theorem (which cannot be proven in the
context of this course) asserts the following: let I : R R be given by
I(x) = sup
t
[xt M(t)] .
Then for every closed set A R
limsup
n
1
n
log P
_
1
n
n

i=1
X
i
A
_
inf
xA
I(x)
32
and for every open set O R
liminf
n
1
n
log P
_
1
n
n

i=1
X
i
O
_
inf
xC
I(x).
This is called a principle of large deviations for the random variables
1
n

n
i=1
X
i
. In
particular, one can show that the function I is convex and non-negative, with
I(x) = 0 x = EX
1
.
We therefore obtain
> 0Nn > N : P
_

1
n

X
i
EX
1


_
e
nmin(

I(EX
1
+),

I(EX
1
))+n
where

I is the I-function introduced above evaluated for the random variables X
i
EX
1
.
The speed of convergence is thus exponentially fast.
Example 9.9 For a Bernoulli B(1, p) random variable X
e
M(t)
= E(e
tX
) = pe
t
+ (1 p); I(x) = xlog
_
p
x
_
(1 x) log
_
1 p
1 x
_
.
Exercise 9.10 Determine the functions M and I for a normally distributed random vari-
able.
Exercise 9.11 Argue that if the moment generating function of a random variable X is
nite, all its moments are nite. In particular both Laws of Large Numbers are applicable
to a sequence X
1
, X
2
, . . . of iid random variables distributed like X.
At the end of this section we will turn to two applications of the Law of Large Numbers
which are interesting in their own right:
The rst of these two applications is in number theory. Let (, T, P) be given by =
[0, 1), T = B
1

and P =
1

(Lebesgue measure). For every number we may


consider its g-adic representation
=

n=1

n
g
n
(9.4)
Here g 2 is a natural number and
n
0, . . . , g 1. This representation is unique,
if we ask that not all but nitely many of the digits
n
are equal to g 1. For each
0, . . . , g 1 let S
,g
n
() be the number of all i 1, . . . , n with
i
() = in its
g-adic representation (9.4). We will call a number [0, 1) g-normal
1
, if
lim
n
1
n
S
,g
n
() =
1
g
1
The usual meaning of g-normality is that each string of digits
1

2
. . .
k
occurs with frequency g
k
.
33
for all = 0, . . . , g1. Hence is g-normal, if in the long run all of its digits occur with the
same frequency. We will call absolutely normal, if is g-normal for all g N, g 2.
Now for a number [0, 1) randomly chosen according to Lebesgue measure the
i
() are
i.i.d. random variables; they have as their distribution the uniform distribution on the set
0, . . . , g 1. This has to be shown in Exercise 9.13 below and is a consequence of the
uniformity of Lebesgue measure. Hence the random variable
X
n
() =
_
1 if
n
() =
0 otherwise
are i.i.d. random variables for each g and . Moreover S
,g
n
() =

n
i=1
X
,g
i
(). According
to the Strong Law of Large Numbers (Theorem 9.3)
1
n
S
,g
n
() E(X
,g
1
) =
1
g

1
-a.s.
for all 0, . . . , g 1 and all g 2. This means
1
-almost every number is g-normal,
i.e. there is a set N
g
with
1
(N
g
) = 0, such that is g-normal for all N
c
g
. Now
N :=

_
g=2
N
g
is a set of Lebesgue measure zero as well. This readily implies
Theorem 9.12 (E. Borel)
1
-almost every [0, 1) is absolutely normal.
It is rather surprising that hardly any normal numbers (in the usual meaning, see
Footnote page 33), are known. Champernowne (1933) showed that
= 0, 1234567891011121314 . . .
is 10-normal. Whether

2, log 2, e or are normal of any kind has not been shown yet.
There are no absolutely normal numbers known at all.
Exercise 9.13 Show that for every g 2, the random variables
n
() introduced above
are i.i.d. random variables that are uniformly distributed on 0, . . . , g 1.
The second application is to derive a classical result from analysis which in principle
has nothing to do with probability theory is related to the Strong Law of Large Numbers.
As may be well known the approximation theorem by Stone and Weierstra asserts
that every continuous function on [a, b] (more generally on every compact set) can be
approximated uniformly by polynomials. Obviously it suces to prove this for [a, b] = [0, 1].
So let f ( ([0, 1]) be a continuous function on [0, 1]. Dene the nth Bernstein polynomial
for f as
B
n
f(x) =
n

k=0
_
n
k
_
f
_
k
n
_
x
k
(1 x)
nk
.
34
Theorem 9.14 For each f ( ([0, 1]) the polynomials B
n
f converge to f uniformly in
[0, 1].
Proof. Since f is continuous and [0, 1] is compact f is uniformly continuous on [0, 1], i.e.
for each > 0 there exists > 0 such that
[x y[ < [f(x) f(y)[ < .
Now consider a sequence of i.i.d Bernoullis with parameter p, (X
n
)
n
. Call
S

n
:=
1
n
S
n
:=
1
n
n

i=1
X
i
.
Then by Chebyshevs inequality
P([S

n
p[ )
1

2
V(S

n
) =
1
n
2

2
V(S
n
) =
p (1 p)
n
2

1
4n
2
. (9.5)
This yields
[B
n
f(p) f(p)[ = [E(f S

n
) f(p)[ =

_
f(x)dP
S

n
(x) f(p)

_
[S

n
p[<
[f(S

n
(x)) f(p)[ dP
S

n
(x) +
+
_
[S

n
p[
[f(S

n
(x)) f(p)[ dP
S

n
(x)
+ 2 |f| P([S

n
p[ ) +
2 |f|
4n
2
.
Here |f| is the sup-norm of f. Hence
sup
p[0,1]
[B
n
f(p) f(p)[ +
2 |f|
4n
2
This can be made smaller than 2 by choosing n large enough.
Notice that Weak Law of Large Numbers by itself would yield, instead of inequality (9.5),
an inequality of the kind
> 0 N such that n N : P([S

n
p[ ) .
In this approach it is not clear that N can be chosen independently of p, so that we only
would get pointwise convergence.
10 The Central Limit Theorem
In the previous section we met one of the central theorems of probability theory the
Law of Large Numbers: If EX
1
exists, for sequence of i.i.d. random variables (X
n
), their
average
1
n

n
i=1
X
i
converges to EX
1
(a.s.). The following theorem, called the Central Limit
35
Theorem analyzes the ne structure in the Law of Large Numbers. Its name is due to Polya,
the proof of the following theorem goes back to Charles Stein.
First of all notice that in a certain sense, in order to analyze the ne structure of

n
i=1
X
i
the scaling of the Weak Law of Large Numbers
1
n
is already an overscaling. On this scale
we just cannot see the shape of the distribution anymore, since by scaling

n
i=1
X
i
by a
factor
1
n
we have reduced its variance to a scale
1
n
which converges to zero. What we see
in the Law of Large Numbers is a bell shaped curve with a tiny, tiny width. Here is what
we get, if we scale the variance to one:
Theorem 10.1 (Central Limit Theorem - CLT) Let X
1
, . . . , X
n
be a sequence of ran-
dom variables that are independent and have identical distribution (the same for all n) with
EX
2
1
< . Then
lim
n
P
_
n
i=1
(X
i
EX
1
)

nVX
1
a
_
=
1

2
_
a

e
x
2
/2
dx. (10.1)
Before proving the Central Limit Theorem let us remark that it holds under weaker
assumptions as well
Remark 10.2 Indeed, the Central Limit Theorem also holds under the following weaker
assumption. Assume given for n = 1, 2, . . . an independent family of random variables X
ni
,
i = 1, . . . , n. For j = 1, . . . n let

nj
= EX
nj
and
s
n
:=

_
n

i=1
VX
ni
.
The sequence ((X
ni
)
n
i=1
)
n
is said to satisfy the Lindeberg condition, if
L
n
() 0 as n
for all > 0. Here
L
n
() =
1
s
2
n
n

j=1
E
_
(X
nj

nj
)
2
; [X
nj

nj
[ s
n

.
Intuitively speaking the Lindeberg condition asks that none of the variables dominates the
whole sum.
The generalized form of the CLT stated above now asserts that if the sequence (X
n
)
satises the Lindeberg condition, it also satises the CLT, i.e.
lim
n
P
_

n
i=1
(X
ni

ni
)
_

n
i=1
VX
ni
a
_
=
1

2
_
a

e
x
2
/2
dx.
The proof of this more general theorem basically mimicks the proof we will give below for
Theorem 10.1. We spare ourselves the additional technical work.
36
We will present a proof of the CLT that goes back to Charles Stein. It is based on a
couple of facts:
Fact 1: It suces to prove the CLT for i.i.d. random variables with EX
1
= 0. Otherwise
one just substracts EX
1
from each of the X
i
.
Fact 2: Dene
S
n
:=
n

i=1
X
i
and
2
:= V(X
1
)
Theorem 10.1 asserts the convergence in distribution of the (normalized) S
n
to a Gaussian
random variable. What we need to show is thus
E
_
f
_
S
n

n
2
__

1

2
_

f(x)e
x
2
/2
dx = E[f(Y )] (10.2)
as n for all f : R R that are uniformly continuous and bounded. Here Y is a
standard normal random variable, i.e. it is ^(0, 1) distributed.
We prepare the proof of the CLT in two lemmata.
Lemma 10.3 Let f : R R be bounded and uniformly continuous. Dene
^(f) :=
1

2
_

f(y)e
y
2
/2
dy
and
g(x) := e
x
2
/2
_
x

(f(y) ^(f)) e
y
2
/2
dy.
Then g fullls
g
t
(x) xg(x) = f(x) ^(f). (10.3)
Proof. Dierentiating g gives
g
t
(x) = xe
x
2
/2
_
x

(f(y) ^(f)) e
y
2
/2
dy +e
x
2
/2
(f(x) ^(f)) e
x
2
/2
= xg(x) +f(x) ^(f).
The importance of the above lemma becomes obvious, if we substitute a random variable
X into (10.3) and take expectations:
E[g
t
(X) Xg(X)] = E[f(X) ^(f)] .
If X ^(0, 1) is standard normal, the right hand side is zero and so is the left hand
side. The idea is thus that instead of showing that
E[f(U
n
) ^(f)]
converges to zero, we may show the same for
E[g
t
(U
n
) U
n
g(U
n
)] .
The next step discusses the function g introduced above.
37
Lemma 10.4 Let f : R R be bounded and uniformly continuous and g be the solution
of
g
t
(x) xg(x) = f(x) ^(f). (10.4)
Then g(x), xg(x) and g
t
(x) are bounded and continuous
Proof. Obviously g is even dierentiable, hence continuous. But then also xg(x) is con-
tinuous. Eventually
g
t
(x) = xg(x) +f(x) ^(f)
is continuous as the sum of continuous functions. For the boundedness part rst note that
any continuous function on a compact set is bounded, hence we only need to check that
the functions g, xg, and g
t
are bounded for x .
To this end rst note that
g(x) = e
x
2
/2
_
x

(f(y) ^(f)) e
y
2
/2
dy
= e
x
2
/2
_

x
(f(y) ^(f)) e
y
2
/2
dy
(this is true since the whole integral must equal zero).
For x 0 we have
g(x) sup
y0
[f(y) ^(f)[ e
x
2
/2
_
x

e
y
2
/2
dy
while for x 0 we have
g(x) sup
y0
[f(y) ^(f)[ e
x
2
/2
_

x
e
y
2
/2
dy.
Now for x 0
e
x
2
/2
_
x

e
y
2
/2
dy e
x
2
/2
_
x

y
[x[
e
y
2
/2
dy =
1
[x[
and similarly for x 0
e
x
2
/2
_

x
e
y
2
/2
dy e
x
2
/2
_

x
y
x
e
y
2
/2
dy
1
[x[
. (10.5)
Thus we see that for x 1
[g(x)[ [xg(x)[ sup
y0
[f(y) ^(f)[
as well as for x 1
[g(x)[ [xg(x)[ sup
y0
[f(y) ^(f)[ .
Hence g and xg are bounded. But then also g
t
is bounded, since
g
t
(x) = xg(x) +f(x) ^(f).
38
Now we turn to proving the CLT.
Proof of Theorem 10.1. Besides assuming that EX
i
= 0 for all i we may also assume
that
2
= VX
1
= 1. Otherwise we just replace X
i
by X
i
/

2
. We write S
n
:=

n
i=1
X
i
and
recall that in order to prove the assertion it suces that for all f bounded and continuous
E
_
g
t
_
S
n

n
_

S
n

n
g
_
S
n

n
__
0
as n . Here g is dened as above.
But using the identity
_
1
0
X
j

n
_
g
t
_
S
n

n
(1 s)
X
j

n
_
g
t
_
S
n
X
j

n
__
ds
= g
_
S
n

n
_
g
_
S
n
X
j

n
_

X
j

n
g
t
_
S
n
X
j

n
_
which is to be proven in Exercise 10.5 below we arrive at
E
_
g
t
_
S
n

n
_

S
n

n
g
_
S
n

n
__
=
n

j=1
E
_
1
n
g
t
_
S
n

n
_

X
j

n
g
_
S
n

n
__
=
n

j=1
E
_
1
n
g
t
_
S
n

n
_

X
j

n
g
_
S
n
X
j

n
_

X
2
j
n
g
t
_
S
n
X
j

n
_

X
2
j
n
_
1
0
g
t
_
S
n

n
(1 s)
X
j

n
_
g
t
_
S
n
X
j

n
_
ds
_
=
n

j=1
E
_
1
n
g
t
_
S
n

n
_

1
n
g
t
_
S
n
X
j

n
_

X
2
j
n
_
1
0
g
t
_
S
n

n
(1 s)
X
j

n
_
g
t
_
S
n
X
j

n
_
ds
_
.
In the last step we used linearity of expectation together with EX
i
= 0 for all i as well as
independence of X
i
and S
n
X
i
for all i together with EX
2
i
= 1. Let us dene

j
:= E
_
1
n
g
t
_
S
n

n
_

1
n
g
t
_
S
n
X
j

n
_

X
2
j
n
_
1
0
g
t
_
S
n

n
(1 s)
X
j

n
_
g
t
_
S
n
X
j

n
_
ds
_
.
The idea will now be that the continuous function g
t
is uniformly continuous on every
compact set. So, if
X
j

n
is small, so are g
t
_
Sn

n
_

1
n
g
t
_
SnX
j

n
_
and g
t
_
Sn

n
(1 s)
X
j

n
_

g
t
_
SnX
j

n
_
as long as
Sn

n
is inside the chosen compact set. On the other hand the probabil-
ities that
Sn

n
is outside a chosen large compact set or that
X
j

n
is large are very small. This
39
together with the boundedness of g and g
t
will basically yield the proof. For K > 0, > 0
we write

1
j
:=
j
1
[
X
j

n
[
1
[
Sn

n
[K

2
j
:=
j
1
[
X
j

n
[
1
[
Sn

n
[>K

3
j
:=
j
1
[
X
j

n
[>
.
Hence
n

j=1

j
=
n

j=1

1
j
+
2
j
+
3
j
=
n

j=1

1
j
+
n

j=1

2
j
+
n

j=1

3
j
.
We rst consider the
2
j
-terms:
By Chebyshevs inequality
P([
S
n

n
[ > K)
V
_
Sn

n
_
K
2
=
1
K
2
and P([
S
n1

n
[ > K )
V
_
S
n1

n
_
(K )
2
=
n 1
n(K )
2
.
Hence for given > 0 we can nd K so large that
P([
S
n

n
[ > K) and P([
S
n1

n
[ > K ) .
Since g
t
is bounded by [[g
t
[[ := sup
xR
[g
t
(x)[ we obtain:
n

j=1
[
2
j
[ =
n

j=1

E
_
1
n
g
t
_
S
n

n
_

1
n
g
t
_
S
n
X
j

n
_

X
2
j
n
_
1
0
g
t
_
S
n

n
(1 s)
X
j

n
_
g
t
_
S
n
X
j

n
_
ds
_
1
[
X
j

n
[
1
[
Sn

n
[>K

j=1
E
_
1
n
2[[g
t
[[1
[
X
j

n
[
1
[
Sn

n
[>K
_
+
2[[g
t
[[
n
E
_
X
2
j
1
[
X
j

n
[
1
[
Sn

n
[>K
_

j=1
E
_
1
n
2[[g
t
[[1
[
X
j

n
[
1
[
Sn

n
[>K
_
+
2[[g
t
[[
n
E
_
X
2
j
1
[
X
j

n
[
1
[
SnX
j

n
[>K
_
2[[g
t
[[E
_
1

Sn

>K
_
+
2[[g
t
[[
n

j
E
_
X
2
j
1
[
X
j

n
[
_
E
_
1
[
SnX
j

n
[>K
_
= 2[[g
t
[[P
_

S
n

> K
_
+ 2[[g
t
[[P
_

S
n1

> K
_
4[[g
t
[[
For the
1
j
-terms observe that for every xed K > 0 the continuous function g
t
is
uniformly continuous on [K, K]. This means that given > 0, there is > 0 such that
[x y[ < [g
t
(x) g
t
(y)[ < .
40
For given > 0 we choose such , and K as in the rst step. Then
n

j=1
[
1
j
[ =
n

j=1

E
_
1
n
g
t
_
S
n

n
_

1
n
g
t
_
S
n
X
j

n
_

X
2
j
n
_
1
0
g
t
_
S
n

n
(1 s)
X
j

n
_
g
t
_
S
n
X
j

n
_
ds
_
1
[
X
j

n
[
1
[
Sn

n
[K

j=1

E
_
1
n
g
t
_
S
n

n
_

1
n
g
t
_
S
n
X
j

n
__
1
[
X
j

n
[
1
[
Sn

n
[K

_
X
2
j
n
_
1
0
g
t
_
S
n

n
(1 s)
X
j

n
_
g
t
_
S
n
X
j

n
_
ds
_
1
[
X
j

n
[
1
[
Sn

n
[K

j=1
_
1
n
+
EX
2
j
n

_
= n
_
2
n
_
= 2.
Eventually we turn to the
3
j
-terms:
Since EX
2
1
< there exists an n
0
such that for a given > 0 and all n n
0
and as
above we have
E
_
X
2
1
1

X
1

>
_
< .
This implies
n

j=1
[
3
j
[ =
n

j=1

E
_
1
n
g
t
_
S
n

n
_

1
n
g
t
_
S
n
X
j

n
_

X
2
j
n
_
1
0
g
t
_
S
n

n
(1 s)
X
j

n
_
g
t
_
S
n
X
j

n
_
ds
_
1
[
X
j

n
[>

j=1
_
2
n
E1
[
X
j

n
[>
[[g
t
[[ +
2
n
[[g
t
[[E
_
X
2
j
1
[
X
j

n
[>
__
2[[g
t
[[P
_

X
j

>
_
+ 2[[g
t
[[E
_
X
2
j
1
[
X
j

n
[>
_
4[[g
t
[[
Hence for a given > 0 with the choice of > 0 and K as above we obtain
E
_
g
t
_
S
n

n
_

S
n

n
g
_
S
n

n
__
=
n

j=1

1
j
+
n

j=1

2
j
+
n

j=1

3
j
2 + 8[[g
t
[[.
This can be made arbitrarily small by letting 0. This proves the theorem.
41
Exercise 10.5 Let X
1
, . . . , X
n
be i.i.d. random variables and S
n
=

n
i=1
X
i
. Let
g : R R
be a continuously dierentiable function. Show that for all j
_
1
0
[g
t
(S
n
(1 s)X
j
) g
t
(S
n
X
j
)]X
j
ds = g(S
n
) g(S
n
X
j
) X
j
g
t
(S
n
X
j
).
We conclude the section with the informal discussion of two extensions on the Central
Limit Theorem. The rst is of practical importance, the second is of more theoretical
interest.
When one tries to apply the Central Limit Theorem, e.g. for a sequence of i.i.d. random
variables, it is of course not only important to know that

X
n
:=

n
i=1
(X
i
EX
1
)

nVX
1
converges to a random variable Z ^ (0, 1). One also needs to know, how close the
distributions of

X
n
and Z are. This is stated in the following theorem due to Berry and
Esseen:
Theorem 10.6 (Berry-Esseen) Let (X
i
)
iN
be a sequence of i.i.d. random variables with
E([X
1
[
3
) < .Then for a ^ (0, 1)-distributed random variable Z it holds:
sup
aR

P
_
n
i=1
X
i
EX
1

nVX
1
a
_
P(Z a)

n
E([X
1
EX
1
[
3
)
(VX
1
)
3/2
.
The numerical value of C is below 6 and larger than 0.4. This is rather easy to prove.
The second extension of the Central Limit Theorem starts with the following obser-
vation: Let X
1
, X
2
, . . . be a sequence of i.i.d. random variables with nite variance and
expectation zero. Then the law of large numbers says that
1
n

n
i=1
X
i
converges to EX
1
= 0
in probability and almost surely. But it tells nothing about the size of the uctuations.
This is considered in greater detail by the Central Limit Theorem. The latter describes the
asymptotic probabilities that
P
_

n
i=1
(X
i
EX
1
)
_
nV(X
1
)
a
_
.
Since these probabilities are positive for all a R according to the Central Limit Theorem,
it can be shown that the uctuations of

n
i=1
X
i
are larger than

n, more precisely, for
each positive a R it holds with probability one that limsup
P
n
i=1
X
i

n
a.
The question for the precise size of the uctuations, i.e. for the right scaling (a
n
) such
that
limsup

n
i=1
X
i

a
n
n
is almost surely nite, is answered by the law of the iterated logarithm:
42
Theorem 10.7 (Law of the Iterated Logarithm by Hartmann and Winter) Let
(X
i
)
iN
be a sequence of i.i.d. random variables with
2
:= VX
1
< ( > 0). Then for
S
n
:=

n
i=1
X
i
it holds
limsup
S
n

2nlog log n
= + P-a.s.
and
liminf
S
n

2nlog log n
= P-a.s.
Due to the restricted time we will not be able to prove the Law of the Iterated Logarithm
in the context of this course. Despite its theoretical interest its practical relevance is rather
limited. To understand why, notice that the correction to the

nVX
1
from the Central
Limit Theorem to the Law of the Iterated Logarithm are of order

log log n. Even for a


fantastically large number of observation, 10
100
(which is more than one observation per
atom in the universe)

log log n is really small,e.g.


_
log log 10
100
=
_
log(100 log 10) =
_
log 100 + log log 10

6.13 2.47.
11 Conditional Expectation
To understand the concept of conditional expectation, we will start with a little example.
Example 11.1 Let be a nite population and let the random variable X () denote the
income of person . So, if we are only interested in income, X contains the full information
of our experiment. Now assume we are a sociologist and want to measure the inuence of
a persons religion on his income. So we are not interested in the full information given by
X, but only in how X behaves on each of the sets,
catholic, protestant, islamic, jewish, atheist,
etc. This leads to the concept of conditional expectation.
The basic idea of conditional expectation will be that given a random variable
X : (, T) R
and a sub--algebra / of T to introduce a new random variable called E[X [ /] =: X
0
such that X
0
is /-measurable and
_
C
X
0
dP =
_
C
XdP
for all C /. So X
0
contains all information necessary when we only consider events in
/. First we need to see that such a X
0
can be found in a unique way.
43
Theorem 11.2 Let (, T, P) be a probability space and X an integrable random variable.
Let ( T be a sub--algebra. Then (up to P-a.s. equality) there is a unique random
variable X
0
, which is (-measurable and satises
_
C
X
0
dP =
_
C
XdP for all C (. (11.1)
If X 0, then X
0
0 P-a.s.
Proof. First we treat the case X 0. Denote P
0
:= P [( and Q = XP [(. Both, P
0
and
Q are measures on (, P
0
even is a probability measure. By denition
Q(C) =
_
C
XdP.
Hence Q(C) = 0 for all C with P(C) = 0 = P
0
(C). Hence Q P
0
. By the theorem of
Radon-Nikodym there is a (-measurable function X
0
0 on such that Q = X
0
P
0
. Thus
_
C
X
0
dP
0
=
_
C
XdP for all C (.
Hence _
C
X
0
dP =
_
C
XdP for all C (.
Hence X
0
satises (11.1). For

X
0
that is (-measurable and satises (11.1) the set C =
_

X
0
< X
0
_
is in ( and
_
C

X
0
dP =
_
C
X
0
dP, whence P(C) = 0. In the same way P(
_

X
0
> X
0
_
=
0. Therefore

X
0
is P-a.s. equal to X
0
.
The proof for arbitrary, integrable X is left to the reader.
Exercise 11.3 Prove Theorem 11.2 for arbitrary, integrable X : R.
Denition 11.4 Under the conditions of Theorem 11.2 the random variable X
0
(which is
P-a.s. unique) is called the conditional expectation of X given (. It is denoted by
X
0
=: E[X [ (] =: E
(
[X] .
If ( is generated by a sequence of random variable (Y
i
)
iI
such that ( = (Y
i
, i I) we
write
E
_
X [ (Y
i
)
iI

= E[X [ (] .
If I = 1, . . . , n we also write E[X [ Y
1
, . . . , Y
n
].
Note that, in order to check whether Y (Y (-measurable) is a conditional expectation
of X given the sub--algebra ( we need to check
_
C
Y dP =
_
C
XdP
for all C (. This determines E[X [ (] only P-a.s. on sets C (. We therefore also speak
about dierent versions of conditional expectation.
44
Example 11.5 1. If ( = , , then the constant random variable EX is a version of
E[X [ (]. Indeed if C = , then any variable does the job. If C =
_
C
XdP = EX =
_
EXdP.
2. If ( is generated by the family (B
i
)
iI
of mutually disjoint sets (i.e. B
i
B
j
=
if i ,= j), where I is countable and B
i
/ (the original space being (, /, P)) and
P(B
i
) > 0 then
E[X [ (] =

iI
1
P(B
i
)
1
B
i
_
B
i
XdP P-a.s.
be checked in the following exercise.
Exercise 11.6 Show that the assertion of Example 11.5.2. is true.
Exercise 11.7 Show that the following assertions for the conditional expectation E[X [ (]
of random variables
X, Y : (, /)
_
R, B
1
_
(( /) are true:
1. E[E[X [ (]] = EX
2. If X is (-measurable then E[X [ (] = X P-a.s.
3. If X = Y P-a.s., then E[X [ (] = E[Y [ (] P-a.s.
4. If X , then E[X [ (] = P-a.s.
5. E[X +Y [ (] = E[X [ (] +E[Y [ (] P-a.s. Here , R.
6. X Y P-a.s. implies E[X [ (] E[Y [ (] P-a.s.
The following theorems have proofs that are almost identical with the proofs of the
theorems for expectations:
Theorem 11.8 (monotone convergence) Let (X
n
) be an increasing sequence of positive
random variables with X = sup X
n
, X integrable , then
sup
n
E[X
n
[ (] = lim
n
E[X
n
[ (] = E
_
lim
n
X
n
[ (
_
= E[X [ (] .
Theorem 11.9 (dominated convergence) Let (X
n
) be a sequence of random variables
converging pointwise to an (integrable) random variable X, such that there is an integrable
random variable Y with Y [X
n
[, then
lim
n
E[X
n
[ (] = E[X [ (] .
Also Jensens inequality has a generalization to conditional expectations:
45
Theorem 11.10 (Jensens inequality) Let X be an integrable random variable taking
values in an open interval I R and let
q : I R
be a convex function. Then for each ( / it holds
E[X [ (] : I
and
q (E[X [ (]) E[q X [ (] .
An immediate consequence of Theorem 11.10 is the following (for p 1):
[E[X [ (][
p
E[[X[
p
[ (]
which implies
E([E[X [ (][
p
) E([X[
p
) .
Denoting by
N
p
(f) =
__
[f[
p
dP
_
1/p
this means
N
p
(E[X [ (]) N
p
(X) , X L
p
(P) .
This holds for 1 p < . N
p
(f) is called the L
p
-norm of f. The case p = , which means
that if X is bounded P-a.s. by some M 0, then so is E[X [ (], follows from Exercise
11.7.
We slightly reformulate the denition of conditional expectation to discuss its further
properties.
Lemma 11.11 Let X be a positive integrable function. Let X
0
: (, /) (R, B
1
) a posi-
tive (-measurable integrable random variable that is a version of E[X [ (] (X integrable),
then _
ZX
0
dP =
_
ZXdP (11.2)
for all (-measurable, positive random variables Z.
Proof. From (11.1) we obtain (11.2) for step functions. The general result follows from
monotone convergence.
We are now prepared to show a number of properties of conditional expectations which
we will call smoothing properties
Theorem 11.12 (Smoothing properties of conditional expectations) Let (, T, P)
be probability space and X L
p
(P) and Y L
q
(P), 1 p ,
1
p
+
1
q
= 1.
1. If ( T and X is (-measurable then
E[XY [ (] = XE[Y [ (] (11.3)
46
2. If (
1
, (
2
T with (
1
(
2
then
E[E[X [ (
2
] [ (
1
] = E[E[X [ (
1
] [ (
2
] = E[X [ (
1
] .
Proof.
1. First assume that X, Y 0. Let X be (-measurable and C (. Then
_
C
XY dP =
_
1
C
XY dP =
_
1
C
XE[Y [ (] dP =
_
C
XE[Y [ (] dP.
Indeed, this follows immediately from lemma 11.11 since 1
C
X is (-measurable. On
the other hand, we also have XY L
1
(P) and
_
C
XY dP =
_
C
E[XY [ (] dP.
Since XE[Y [ (] is (-measurable we obtain
E[XY [ (] = XE[Y [ (] P-a.s.
In the case X L
p
(P) , Y L
q
(P) we observe that then XY L
1
(P) and conclude
as above.
2. Observe that, of course, E[X [ (
1
] is (
1
-measurable and, since (
1
(
2
, also (
2
-
measurable. Property 2 in Exercise 11.7 than implies
E[E[X [ (
1
] [ (
2
] = E[X [ (
1
] , P-a.s.
Moreover for all C (
1 _
C
E[X [ (
1
] dP =
_
C
XdP.
Hence for all C (
1
_
C
E[X [ (
1
] dP =
_
C
E[X [ (
2
] dP.
But this means
E[E[X [ (
2
] [ (
1
] = E[X [ (
1
] P-a.s.
The previous theorem leads to yet another characterization of the conditional expecta-
tion. To this end take X L
2
(P) and denote X
0
:= E[X [ (] for a ( T. Let Z L
2
(P)
be (-measurable. Then X
0
L
2
(P) and by (11.3)
E[Z (X X
0
) [ (] = ZE[X X
0
[ (] = Z (E[X [ (] X
0
) = Z (X
0
X
0
) = 0.
Theorem 11.13 For all X L
2
(P) and each ( T the conditional expectation E[X [ (]
is (up to a.s. equality) the unique (-measurable random variable X
0
L
2
(P) with
E
_
(X X
0
)
2

= min E
_
(X Y )
2

; Y L
2
(P) , Y (-measurable
47
Proof. Let Y L
2
(P) be (-measurable. Put X
0
:= E[X [ (]. Then
E((XY )
2
) = E((XX
0
+X
0
Y )
2
) = E((XX
0
)
2
)+E((X
0
Y )
2
)+2E((XX
0
)(X
0
Y ))
But E((X X
0
)(X
0
Y )) = 0, since X
0
Y is (-measurable.
This gives
E
_
(X Y )
2

E
_
(X X
0
)
2

= E
_
(X
0
Y )
2

. (11.4)
Due to positivity of squares we hence obtain
E
_
(X X
0
)
2

E
_
(X Y )
2

.
If, on the other hand
E
_
(X X
0
)
2

= E
_
(X Y )
2

then
E
_
(X
0
Y )
2

= 0
which implies Y = X
0
= E[X [ (] P-a.s.
The last theorem states that E[X [ (] for X L
2
(P) is the best approximation of X
in the (-measurable function space in the sense of a least squares approximation. It is
the projection of X onto the space of square integrable, (-measurable functions.
Exercise 11.14 Prove that for X L
2
(P), = E(X) is the number that minimizes
E((X )
2
).
With the help of conditional expectation we can also give a new denition of conditional
probability
Denition 11.15 Let (, T, P) be a probability space and ( T be a sub--algebra. For
A T
P[A [ (] := E[1
A
[ (]
is called the conditional probability of A given (.
Example 11.16 In the situation of Example 11.5.2 the conditional expectation of A T
is given by
P(A [ () =

iI
P(A [ B
i
) 1
B
i
:=

iI
P(A B
i
) 1
B
i
P(B
i
)
.
In a last step we will only introduce (but not prove) conditional expectations on events
with zero probability. Of course, in general this will just give nonsense but in the case of
a conditional expectation E[X [ Y = y] where X, Y are random variables such that (X, Y )
has a Lebesgue density we can give this expression a meaning.
48
Theorem 11.17 Let X, Y be real valued random variables such that (X, Y ) has a density
f : R
2
R
+
0
with respect to two dimensional Lebesgue measure
2
. Assume that X is
integrable and that
f
0
(y) :=
_
f (x, y) dx > 0 for all y R.
Then the function E(X[Y ) will be denoted by
y E(X [ Y = y)
and one has
E(X [ Y = y) =
1
f
0
(y)
_
xf (x, y) dx for P
Y
-a.e. y R.
In particular
E(X [ Y ) =
1
f
0
(Y )
_
xf (x, Y ) dx P-a.s.
We will also need the following relationship between conditional expectation and indepen-
dence, which is a generalization of Example 11.5, case 1.
Lemma 11.18 Let X be an integrable real valued random variable and ( T a sub--
algebra such that X is independent of (, that is (X) and ( are independent, then
E(X [ () = E(X), P-a.s.
Proof. Suppose X 0. Then an increasing sequence of step functions X
n
can be con-
structed by X
n
= [2
n
X]/2
n
. Then X
n
converges monotonically to X. Notice that X
n
is a
linear combination of indicator functions 1
A
with A (X). And
_
C
1
A
dP = P(C A) =
P(C)P(A) =
_
C
E(1
A
)dP. Thus E(1
A
[ () = E(1
A
), and by linearity E(X
n
[ () = E(X
n
)
and by the monotone convergence theorem E(X [ () = E(X). The general case follows by
linearity, X = X
+
X

.
Exercise 11.19 Let X and Y be as in Theorem 11.17, such that X and Y are independent.
Then X is independent of (Y ), and by Lemma 11.18 we have E(X [ Y ) = E(X [ (Y )) =
E(X). Apply Theorem 11.17 to give an alternative derivation of this fact.
49
12 Martingales
In this section we are going to dene a notion, that will turn out to be of central interest
in all of so called stochastic analysis and mathematical nance. A key role in its denition
will be taken by conditional expectation. In this section we will just give the denition and
a couple of examples. There is a rich theory of martingales. Parts of this theory we will
meet in a class on Stochastic Calculus.
Denition 12.1 Let (, T, P) be a probability space and I be an ordered set (linearly or-
dered), i.e. for s, t I either s t or t s, with s t and t s implies s = t and s t,
t u implies s u. For t I let T
t
T be a -algebra. (T
t
)
tI
is called a ltration, if
s t implies T
s
T
t
. A sequence of random variables (X
t
)
tI
is called (T
t
)
tI
- adapted
if X
t
is T
t
-measurable for all t I.
Exercise 12.2 Construct a ltration on a probability space with [I[ 3.
Example 12.3 Let (X
t
)
tI
be a family of random variables, and I a linearly ordered set,
then
T
t
= X
s
, s t
is a ltration and (X
t
) is adapted with respect to (T
t
). (F
t
)
tI
is called the canonical
ltration with respect to (X
t
)
tI
.
Denition 12.4 Let (, T, P) be a probability space and I a linearly ordered set. let (T
t
)
tI
be a ltration and (X
t
)
tI
be an (T
t
)-adapted sequence of random variables. (X
t
) is called
an (F
t
)-supermartingale, if
E[X
t
[ T
s
] X
s
P-a.s. (12.1)
for all s t. (12.1) is equivalent with
_
C
X
t
dP
_
C
X
s
dP, for all C T
s
. (12.2)
(X
t
) is called a (T
t
)-submartingale, if (X
t
) is a (F
t
)-supermartingale. Eventually (X
t
)
is called a martingale, if it is both a submartingale and a supermartingale. This means that
E[X
t
[ T
s
] = X
s
P-a.s.
for s t or, equivalently,
_
C
X
t
dP =
_
C
X
s
dP, C T
s
.
Exercise 12.5 Show that the conditions (12.1) and (12.2) are equivalent.
Remark 12.6 1. If (T
t
) is the canonical ltration with respect to (X
t
)
tI
, then often
(X
t
) simply called a supermartingale, submartingale, or a martingale.
50
2. (12.1) and (12.2) are evidently correct for s = t (with equality). Hence these proper-
ties only need to be checked for s < t.
3. Putting C = in (12.2) we obtain for a supermartingale (X
t
)
t
s t E(X
s
) E(X
t
) .
Hence for supermartingales (E(X
s
))
s
is a decreasing sequence, while for a submartin-
gale (E(X
s
)) is an increasing sequence.
4. In particular, if each of the random variables X
s
is almost surely constant, e.g. if
is a singleton (a set with just one element) then (X
s
) is a decreasing sequence, if (X
s
)
is a supermartingale. And it is an increasing sequence, if (X
s
) is a submartingale.
Hence martingales are (in a certain sense) the stochastic generalization of constant
sequences.
Exercise 12.7 Let (X
t
), (Y
t
) be adapted to the same ltration and , R. Show the
following
1. If (X
t
) and (Y
t
) are martingales, then (X
t
+Y
t
) is a martingale.
2. If (X
t
) and (Y
t
) are supermartingales, then so is (X
t
Y
t
) = (min(X
t
, Y
t
))
3. If (X
t
)) is a submartingale, so is
_
X
+
t
, T
t
_
.
4. If (X
t
) is a martingale taking values in an open set J R and
q : J R
is convex then (q X
t
, T
t
) is a submartingale, if q (X
t
) is integrable for all t.
Of course, at rst glance the denition of a martingale may look a bit weird. We will
therefore give a couple of examples to show that it is not as strange as expected.
Example 12.8 Let (X
n
) be an i.i.d. sequence of R-valued random variables. Put S
n
=
X
1
+. . . +X
n
and consider the canonical ltration T
n
= (S
m
, m n). By Lemma 11.18
we have
E[X
n+1
[ S
1
, . . . , S
n
] = E[X
n+1
] P-a.s.
and by part 2. of Exercise 11.7
E[X
i
[ S
1
, . . . , S
n
] = X
i
P-a.s.
for all i = 1, . . . , n. Adding these n + 1 equations gives
E[S
n+1
[ T
n
] = S
n
+E[X
n+1
] P-a.s.
If EX
i
= 0 for all i, then
E[S
n+1
[ T
n
] = S
n
i.e. (S
n
) is a martingale. If E[X
i
] 0 then
E[S
n+1
[ T
n
] S
n
,
i.e. (S
n
) is a supermartingale. In the same way (S
n
) is a submartingale if EX
i
0.
51
Example 12.9 Consider the following game. For each n N a coin with probability p
for heads is tossed. If it shows heads (X
n
= +1) our player receives money otherwise he
(X
n
= 1) looses money. The way he wins or looses is determined in the following way.
Before the game starts he determines a sequence (
n
)
n
of functions

n
: H, T
n
R
+
.
In round number n + 1 he plays for
n
(X
1
, . . . , X
n
) Euros depending on how the rst n
games ended. If we denote by S
n
his capital at time n, then
S
1
= X
1
and S
n+1
= S
n
+
n
(X
1
, . . . , X
n
) X
n+1
.
Hence
E[S
n+1
[ X
1
, . . . , X
n
] = S
n
+
n
(X
1
, . . . , X
n
) E[X
n+1
[ X
1
, . . . , X
n
]
= S
n
+
n
(X
1
, . . . , X
n
) E(X
n+1
)
= S
n
+ (2p 1)
n
(X
1
, . . . , X
n
) ,
since X
n+1
is independent of X
1
, . . . , X
n
and E(X
n+1
) = 2p 1. Hence for p =
1
2
E[S
n+1
[ X
1
, . . . , X
n
] = S
n
so (S
n
) is a martingale while for p >
1
2
E[S
n+1
[ X
1
, . . . , X
n
] S
n
,
hence (S
n
) is a submartingale and for p <
1
2
E[S
n+1
[ X
1
, . . . X
n
] S
n
,
so (S
n
) is a supermartingale. This explains the idea that martingales are generalizations
of fair games.
Exercise 12.10 Let X
1
, X
2
, . . . be a sequence of independent random variables with nite
variance V(X
i
) =
2
i
. Then

n
i=1
(X
i
E(X
i
))
2

n
i=1

2
i
is a martingale with respect
to the ltration T
n
= (X
1
, . . . , X
n
).
Exercise 12.11 Consider the gamblers martingale. Consider an i.i.d. sequence (X
n
)

n=1
of Bernoulli variables with values 1 and 1, each with probability 1/2. Consider the sequence
(Y
n
) such that Y
n
= 2
n1
if X
1
= = X
n1
= 1, and Y
n
= 0 if X
i
= 1 for some i n1.
Show that S
n
=

n
i=1
X
i
Y
i
is a martingale. Show that S
n
almost surely converges and
determine its limit S

. Observe that S
n
,= E(S

[ T
n
).
Example 12.12 In a sense Example 12.8 is both, a special case and a generalization of the
following example. To this end let X
1
, . . . , X
n
, . . . denote an i.i.d. sequence of R
d
-valued
random variables. Assume
P(X
i
= +
k
) = P(X
i
=
k
) =
1
2d
52
for all i = 1, 2, . . . and all k = 1, . . . , d. Here
k
denotes the k-th unit vector. Dene the
stochastic process S
n
by
S
0
= 0,
and
S
n
=
n

i=1
X
i
.
This process is called a random walk in d directions. Some of its properties will be discussed
below. First we will see that indeed (S
n
) is a martingale. Indeed,
E[S
n+1
[ X
1
, . . . , X
n
] = E[X
n+1
] +S
n
= S
n
.
As a matter of fact, not only is (S
n
) a martingale, but, in a certain sense it is the discrete
time martingale.
Since the random walk in d dimensions is the model for a discrete time martingale
(the standard model of a continuous time martingale will be introduced in the following
section) it is worth while studying some of its properties. This has been done in thousands
of research papers in the past 50 years. We will just mention one interesting property here,
that reveals a dichotomy in the random walks behavior for dimensions d = 1, 2 or d 3.
Denition 12.13 Let (S
n
) be a stochastic process in Z
d
, i.e. for each n N, S
n
is a
random variable with values in Z
d
. (S
n
) is called recurrent in a state x Z
d
, if
P(S
n
= x innitely often in n) = 1.
It is called transient in x, if
P(S
n
= x innitely often in n) < 1.
(S
n
) is called recurrent (transient), if each x Z
d
is recurrent (transient).
Proposition 12.14 In the situation of Example 12.12, if x Z
d
is recurrent, then all
y Z
d
are recurrent.
Exercise 12.15 Show proposition 12.14.
We will show a variant of the following
Theorem 12.16 The random walk (S
n
) introduced in Example 12.12 is recurrent in di-
mensions d = 1, 2 and transient for d 3.
To prove a version of Theorem 12.16 we will rst discuss the property of recurrence:
Lemma 12.17 Let f
k
denote probability that the random walk return to the origin after
k steps for the rst time, and let p
k
denote probability that the random walk return to the
origin after k steps. Then a random walk (S
n
) is recurrent if and only if

k
f
k
= 1 and
this is the case if and only if

k
p
k
= .
53
Proof. The rst equivalence is easy. Denote by
k
the set of all realizations of the random
walk returning to the origin for the rst time after k steps. Then if the random walk (S
n
)
is recurrent, with probability one there exists a k > 0 such that S
k
= 0 and S
l
,= 0 for all
0 < l < k. Hence

k
f
k
= 1. On the other hand, if

k
f
k
= 1, then P(

k

k
) = 1. Hence
with probability one there exists a k > 0 such that S
k
= 0 and S
l
,= 0 for all 0 < l < k.
But then the situation at times 0 and k is completely the same and hence there exists
k
t
> k such that S
k
= 0 and S
l
,= 0 for all k < l < k
t
. Iterating this gives that S
k
= 0 for
innitely many ks with probability one.
In order to relate f
k
and p
k
we derive the following recursion
p
k
= f
k
+f
k1
p
1
+ +f
0
p
k
(12.3)
(the last summand ist just added for completeness, we have f
0
= 0). Indeed this is again
easy to see. The left hand side is just the probability to be at the origin at time k. This
event is the disjoint uinion of the events to be at 0 for the rst time after 1 l k steps
and to walk from zero to zero in the remaining steps. Hence we obtain.
p
k
=
k

i=1
f
i
p
ki
and p
0
= 1. (12.4)
Dene the generating functions
F(z) =

k0
f
k
z
k
and P(z) =

k0
p
k
z
k
.
Multiplying the left and right sides in (12.4) with z
k
and summing them from k = 0 to
innity gives
P(z) = 1 +P(z)F(z)
i.e.
F(z) = 1 1/P(z).
By Abels theorem

k=1
f
k
= F(1) = lim
z1
F(z) = 1 lim
z1
1
P(z)
.
First assume that

k
p
k
< . Then
lim
z1
P(z) = P(1) =

k
p
k
<
and thus
lim
z1
1
P(z)
= 1/

k
p
k
> 0.
Hence

k=1
f
k
< 1 and the random walk (S
n
) is transient.
Next assume that

k
p
k
= and x > 0. Then we nd N such that
N

k=0
p
k

2

.
54
Then for z suciently close to one we have

N
k=0
p
k
z
k

and consequently for such z


1
P(z)

1

N
k=0
p
k
z
k
.
But this implies that
lim
z1
1
P(z)
= 1/

k
p
k
= 0
and therefore

k=1
f
k
= 1 and the random walk (S
n
) is transient.
Exercise 12.18 What has the Borel-Cantelli Lemma 8.1 to say in the above situation?
Dont overlook that the events S
n
= x (n N) may be dependent.
We will now apply this criterion to analyze recurrence and transience for a random walk
similar to the one dened in Example 12.12.
To this end dene the following random walk (R
n
) in d dimensions. For k N let
Y
k
1
, . . . , Y
k
d
be i.i.d. random variables, taking values in 1, +1 with P(Y
k
1
= 1) = P(Y
k
1
=
1) = 1/2. Let X
k
be the random vector X
k
= (Y
k
1
, . . . , Y
k
d
). Dene R
0
0 and for k 1
R
n
=
n

k=1
X
k
.
Theorem 12.19 The random walk (R
n
) dened above is recurrent in dimensions d = 1, 2
and transient for d 3.
Proof. Consider a sequence of i.i.d. random variables (Z
k
) taking values in 1, +1 with
P(Z
k
= 1) = P(Z
k
= 1) = 1/2. Write q
k
= P(

2k
i=1
Z
i
= 0). Then we apply Stirlings
formula
lim
n
n!/(

2n
n+1/2
e
n
) = 1.
to obtain
q
k
=
_
2k
k
_
2
2k
=
2k!
k!k!
2
2k

4k
_
2k
e
_
2k
2k
_
k
e
_
2k
2
2k
=
_
1
k
.
Hence the probability of a single coordinate of R
2n
to be zero (R
n
cannot be zero if n is
odd) asymptotically behaves like
_
1
n
. Hence
P(R
n
= 0)
_
1
n
_d
2
.
55
But

n
_
1
n
_d
2
=
for d = 1 and d = 2, while

n
_
1
n
_d
2
<
for d 3. This proves the theorem.
We will give two results about martingales. The rst one is inspired by the fact that
given a random variable M and a ltration T
t
of T, the family of conditional expectations,
E(M [ T
t
), yields a martingale. Can a martingale always be described in this way? That
is the content of the Martingale Limit Theorem.
Theorem 12.20 Suppose M
t
is a martingale with respect to the ltration T
t
and that
the martingale is (uniformly) square integrable, that is limsup
t
E(M
2
t
) < . Then there
is a square integrable random variable M

such that M
t
= E(M

[ T
t
) a.s.. Moreover
limM
t
= M

in L
2
sense.
Proof. The basic property that we will use, is that the space of square integrable random
variables L
2
(, T, P) with the L
2
inner product is a complete vectorspace. In other words,
a Cauchy sequence converges. Recall that for any t < s it holds that M
t
= E(M
s
[ T
t
), and
we have seen in Theorem 11.13 that then M
s
M
t
is perpendicular to M
t
. In particular
we have the Pythagoras formula
E(M
2
s
) = E(M
2
t
) +E((M
s
M
t
)
2
).
This implies that E(M
2
s
) is increasing in s, and therefore its limit exists and equals
limsup E(M
2
t
) which is nite. Therefore given > 0 there is a u such that for t > s > u we
have E((M
s
M
t
)
2
) < . That means that M
s

s
is a Cauchy sequence. Let M

be its limit.
In particular M

is a random variable. Since orthogonal projection onto the suspace of T


t
measurable functions is a continuous map, it holds that E(M

[ T
t
) = limE(M
s
[ T
t
) = M
t
.
With some extra eort one may show that M

is also the limit in the sense of almost


sure convergence. The Martingale Limit Theorem is valid under more general circum-
stances, for example it is sucient that only limsup
t
E([M
t
[) < , in which case M

is
the limit in L
1
sense (as well as almost surely).
An important concept for random processes is the concept of a stopping time.
Denition 12.21 A stopping time is a random variable : I , such that for
all t I, ; () t is T
t
measurable. Here I is ordered such that t < for all
t I. Given a process M
t
, the stopped process M
t
is given by M
t
() = M
s
() where
s = t = min(, t).
Example 12.22 Given A T
u
a stopping time is constructed by = 1
A
c +u 1
A
, that
is,
A
() = if / A, and () = u if A.
56
Exercise 12.23 If T : R is constant, then T is a stopping time. If S and T are
stopping times then max(S, T) and min(S, T) are stopping times.
A very important property of martingales is the following Martingale Stopping Theorem.
Theorem 12.24 Let M
t

t
be a martingale, and a stopping time. Then the stopped
process M
t

t
is a martingale.
Proof. It is easy to see that an adapted process stopped at a stopping time is again an
adapted process. We will give a proof of the martingale property for the simple stopping
time
A
given above. If s > t u, let B T
t
, then
_
B
E(M
s
[ T
t
)dP =
_
B
M
s
dP =
_
BA
M
s
+
_
BA
c
M
s
=
_
BA
M
u
+
_
BA
c
M
s
=
_
BA
M
u
+
_
BA
c
E(M
s
[ T
t
)
=
_
BA
M
u
+
_
BA
c
M
t
=
_
B
M
t
.
If u t and s > t, then one can apply Theorem 11.12
E(M
s
[ T
t
) = E(E(M
s
[ T
u
) [ T
t
) = E(M
us
[ T
t
) = M
t
= M
t
.
Exercise 12.25 Modify the proof of Theorem 12.24 to show that a stopped supermartingale
is a supermartingale.
Exercise 12.26 Consider the roulette game. There are several possibilities for a bet, given
by a number p (0, 1) such that with probability 36/37 p the return is p
1
times the stake
and the return is zero with probability 136/37p. The probabilities p are such that p
1
N.
Suppose you start with an initial fortune X
0
N, and perform a sequence of bets until
this fortune is reduced to zero. We are interested in the expected value of the total sum of
stakes. To determine this consider the sequence of subsequent fortunes X
i
, and consider the
sequence of stakes Y
i
, meaning that the stake in bet i is Y
i
= Y
i
(X
1
, . . . , X
i1
) (Y
i
X
i1
).
In particular, if for this stake the probability p is chosen, either X
i
= X
i1
Y
i
+ p
1
Y
i
(with probability 36/37 p) or X
i
= X
i1
Y
i
(with probability 1 36/37 p). Show that
(X
i
+1/37

i
j=1
Y
j
)
i
is a martingale with respect to the ltration T
i
= (X
1
, . . . , X
i
). The
stopping time N is the rst time i such that X
i
= 0. Show that E(

N
j=1
Y
j
) = 37 X
0
.
13 Brownian motion
In this section we will construct the continuous time martingale, Brownian motion. Besides
this, Brownian motion is also a building block of stochastic calculus and stochastic analysis.
In stochastic analysis one studies random functions of one variable and various kinds
of integrals and derivatives thereof. The argument of these functions is usually interpreted
as time, so the functions themselves can be thought of as the path of a random process.
57
Here, like in other areas of mathematics, going from the discrete to the continuous
yields a pay-o in simplicity and smoothness, at the price of a formally more complicated
analysis. Compare, to make an analogy, the integral
_
n
0
x
3
dx with the sum

n
k=1
k
3
. The
integral requires a more rened analysis for its denition and its properties, but once this
has been done the integral is easier to calculate. Similarly, in stochastic analysis you will
become acquainted with a convenient dierential calculus as a reward for some hard work
in analysis.
Stochastic analysis can be applied in a wide variety of situations. We sketch a few
examples below.
1. Some dierential equations become more realistic when we allow some randomness
in their coecients. Consider for example the following growth equation, used among
other places in population biology:
d
dt
S
t
= (r + N
t
)S
t
. (13.1)
Here, S
t
is the size of the population at time t, r is the average growth rate of the
population, and the noise N
t
models random uctuations in the growth rate.
2. At time t = 0 an investor buys stocks and bonds on the nancial market, i.e., he
divides his initial capital C
0
into A
0
shares of stock and B
0
shares of bonds. The
bonds will yield a guaranteed interest rate r
t
. If we assume that the stock price S
t
satises the growth equation (13.1), then his capital C
t
at time t is
C
t
= A
t
S
t
+B
t
e
r

t
, (13.2)
where A
t
and B
t
are the amounts of stocks and bonds held at time t. With a keen eye
on the market the investor sells stocks to buy bonds and vice versa. If his tradings
are self-nancing, then dC
t
= A
t
dS
t
+B
t
d(e
r

t
). An interesting question is:
- What would he be prepared to pay for a so-called European call option, i.e.,
the right (bought at time 0) to purchase at time T > 0 a share of stock at a
predetermined price K?
The rational answer, q say, was found by Black and Scholes (1973) through an analysis
of the possible strategies leading from an initial investment q to a payo C
T
. Their
formula is being used on the stock markets all over the world.
3. The Langevin equation describes the behaviour of a dust particle suspended in a uid:
m
d
dt
V
t
= V
t
+ N
t
. (13.3)
Here, V
t
is the velocity at time t of the dust particle, the friction exerted on the
particle due to the viscosity of the uid is V
t
, and the noise N
t
stands for the
disturbance due to the thermal motion of the surrounding uid molecules colliding
with the particle.
58
4. The path of the dust particle in example 3 is observed with some inaccuracy. One
measures the perturbed signal Z(t) given by
Z
t
= V
t
+

N
t
. (13.4)
Here

N
t
is again a noise. One is interested in the best guess for the actual value of
V
t
, given the observation Z
s
for 0 s t. This is called a ltering problem: how to
lter away the noise

N
t
. Kalman and Bucy (1961) found a linear algorithm, which
was almost immediately applied in aerospace engineering. Filtering theory is now a
ourishing and extremely useful discipline.
5. Stochastic analysis can help solve boundary value problems such as the Dirichlet
problem. If the value of a harmonic function f on the boundary of some bounded
regular region D R
n
is known, then one can express the value of f in the interior
of D as follows:
E(f (B
x

)) = f(x), (13.5)
where B
x
t
:= x +
_
t
0
N
t
dt is an integrated noise or Brownian motion, starting at x,
and denotes the time when this Brownian motion rst reaches the boundary. (A
harmonic function f is a function satisfying f = 0 with the Laplacian.)
The goal of the course Stochastic Analysis is to make sense of the above equations, and to
work with them.
In all the above examples the unexplained symbol N
t
occurs, which is to be thought of
as a completely random function of t, in other words, the continuous time analogue of
a sequence of independent identically distributed random variables. In a rst attempt to
catch this concept, let us formulate the following requirements:
1. N
t
is independent of N
s
for t ,= s;
2. The random variables N
t
(t 0) all have the same probability distribution ;
3. E(N
t
) = 0.
However, when taken literally these requirements do not produce what we want. This is
seen by the following argument. By requirement 1 we have for every point in time an
independent value of N
t
. We shall show that such a continuous i.i.d. sequence N
t
is not
measurable in t, unless it is identically 0.
Let denote the probability distribution of N
t
, which by requirement 2 does not depend
on t, i.e., ([a, b]) := P[a N
t
b]. Divide R into two half lines, one extending from a to
and the other extending from a to . If N
t
is not a constant function of t, then there
must be a value of a such that each of the half lines has positive measure. So
p := P(N
t
a) = ((, a]) (0, 1). (13.6)
Now consider the set of time points where the noise N
t
is low: E := t 0: N
t
a .
It can be shown that with probability 1 the set E is not Lebesgue measurable. Without
59
giving a full proof we can understand this as follows. Let denote the Lebesgue measure
on R. If E were measurable, then by requirement 1 and Eq. (13.6) it would be reasonable
to expect its relative share in any interval (c, d) to be p, i.e.,
(E (c, d)) = p (d c) . (13.7)
On the other hand, it is known from measure theory that every measurable set E is ar-
bitrarily thick somewhere with respect to the Lebesgue measure , i.e., for all < 1 an
interval (c, d) can be found such that
(E (c, d)) > (d c)
(cf. Halmos (1974) Th. III.16.A). This clearly contradicts Eq. (13.7), so E is not measurable.
This is a bad property of N
t
: for, in view of (13.1), (13.3), (13.4) and (13.5), we would like
to integrate N
t
.
For this reason, let us approach the problem from another angle. Instead of N
t
, let us
consider the integral of N
t
, and give it a name:
B
t
:=
_
t
0
N
s
ds.
The three requirements on the evasive object N
t
then translate into three quite sensible
requirements for B
t
.
BM1. For 0 = t
0
t
1
t
n
the random variables B
t
j+1
B
t
j
(j = 0, . . . , n 1) are
independent;
BM2. B
t
has stationary increments, i.e., the joint probability distribution of
(B
t
1
+s
B
u
1
+s
, B
t
2
+s
B
u
2
+s
, . . . , B
tn+s
B
un+s
)
does not depend on s 0, where t
i
> u
i
for i = 1, 2, , n are arbitrary.
BM3. E(B
t
B
0
) = 0 for all t.
We add a normalisation:
BM4. B
0
= 0 and E(B
2
1
) = 1.
Still, these four requirements do not determine B
t
. For example, the compensated Poisson
jump process also satises them. Our fth requirement xes the process B
t
uniquely:
BM5. t B
t
continuous a.s.
The object B
t
so dened is called the Wiener process, or (by a slight abuse of physical
terminology) Brownian motion. In the next section we shall give a rigorous and explicit
construction of this process.
Before we go into details we remark the following
60
Exercise 13.1 Show that BM5, together with BM1 and BM2, implies the following:
For any > 0
nP([B
t+
1
n
B
t
[ > ) 0 (13.8)
as n . Hint: compare with inequality (8.6).
Exercise 13.1 helps us to specify the increments of Brownian motion in the following way
2
.
Exercise 13.2 Suppose BM1, BM2, BM4 and (13.8) hold. Apply the Central Limit
Theorem (Lindebergs condition, page 36) to
X
n,k
:= Bkt
n
B(k1)t
n
and conclude that B
s+t
B
s
, t > 0 has a normal distribution with variance t, i.e.
P(B
s+t
B
s
A) =
1

2t
_
A
e

x
2
2t
dx.
As a matter of fact, BM1 and BM5 already imply that the increments B
s+t
B
s
are
normally distributed
3
.
BM 2. If s 0 and t > 0, then
P(B
s+t
B
s
A) =
1

2t
_
A
e

x
2
2t
dx.
we can now dene Brownian motion as follows
Denition 13.3 A one-dimensional Brownian motion is a real-valued process B
t
, t 0
with the properties BM1, BM2, and BM5.
13.1 Construction of Brownian Motion
Whenever a stochastic process with certain porperties is dened, the most natural question
to ask is, does such a process exist? Of course, the answer is yes, otherwise these lecture
notes would not have been written.
In this section we shall construct Brownian motion on [0, T]. For the sake of simplicity
we will take T = 1, the construction for general T can be carried out along the same lines,
or, by just concatenating independent Brownian motions.
The construction we shall use was given by P. Levy in 1948. Since we saw that the
increments of Brownian motion are independent Gaussian random variables, the idea is to
construct Brownian motion from these Gaussian increments.
2
See R. Durrett (1991), Probability: Theory and Examples, Section 7.1, Exercise 1.1, p. 334. Unfortu-
nately, there is something wrong with this exercise. See the 3rd edition (2005) for a correct treatment.
3
See e.g. I. Gihman, A. Skorohod, The Theory of Stochastic Processes I, Ch. III, 5, Theorem 5, p. 189.
For a high-tech approach, see N. Ikeda, S. Watanabe, Stochastic Dierential Equations and Diusion
Processes, Ch. II, Theorem 6.1, p. 74.
61
More precisely, we start with the following observation. Suppose we already had con-
structed Brownian motion, say (B
t
)
0tT
. Take two times 0 s < t T, put :=
s+t
2
,
and let
p(, x, y) :=
1

2
e
(yx)
2
/2
, > 0, x, y, R
be the Gaussian kernel centered in x with variance . Then, conditioned on B
s
= x and
B
t
= z, the random variable B

is normal with mean :=


x+z
2
and variance
2
:=
ts
4
.
Indeed, since B
s
,B

B
s
, and B
t
B

are independent we obtain


P[B
s
dx, B

dy, B
t
dz] = p(s, 0, x)p(
t s
2
, x, y)p(
t s
2
, y, z)dxdy dz
= p(s, 0, x)p(t s, x, z)
1

2
e

(y)
2
2
2
dxdy dz
(which is just a bit of algebra). Dividing by
P[B
s
dx, B
t
dz] = p(s, 0, x)p(t s, x, z)dxdz
we obtain
P[B

dy[B
s
dx, B
t
dz] =
1

2
e

(y)
2
2
2
dy,
which is our claim.
This suggests that we might be able to construct Brownian motion on [0, 1] by interpo-
lation.
To carry out this program, we begin with a sequence
(n)
k
, k I(n), n N
0
of inde-
pendent, standard normal random variables on some probability space (, T, P). Here
I(n) := k N, k 2
n
, k = 2l + 1 for some l N
denotes the set of odd, positive integers less than 2
n
. For each n N
0
we dene a process
B
(n)
:= B
(n)
t
: 0 t 1 by recursion and linear interpolation of the preceeding process,
as follows. For n N, B
(n)
k/2
n1
will agree with B
(n1)
k/2
n1
, for all k = 0, 1, . . . , 2
n1
. Thus for
each n we only need to specify the values of B
(n)
k/2
n
for k I(n). We start with
B
(0)
0
= 0 and B
(1)
1
=
(0)
1
.
If the values of B
(n1)
k/2
n1
, k = 0, 1 . . . 2
n1
have been dened (an thus B
(n1)
t
, k/2
n1
t
(k + 1)/2
n1
is the linear interpolation between B
(n1)
k/2
n1
and B
(n1)
(k+1)/2
n1
) and k I(n), we
denote s = (k 1)/2
n
, t = (k + 1)/2
n
, =
1
2
(B
(n1)
s
+ B
(n1)
t
) and
2
=
ts
4
= 2
n+1
and
set in accordance with the above observations
B
(n)
k/2
n
:= B
(n)
(t+s)/2
:= +
(n)
k
.
We shall show that, almost surely, B
(n)
t
converges uniformly in t to a continuous function
B
t
(as n ) and that B
t
is a Brownian motion.
62
We start with giving a more convenient representation of the processes B
(n)
, n = 0, 1, . . ..
We dene the following Haar functions by H
0
1
(t) 1, and for n N, k I(n)
H
(n)
k
(t) :=
_
_
_
2
(n1)/2
,
k1
2
n
t <
k
2
n
2
(n1)/2
,
k
2
n
t <
k+1
2
n
0 otherwise.
The Schauder functions are dened by
S
(n)
k
(t) :=
_
t
0
H
(n)
k
(u)du, 0 t 1, n N
0
, k I(n).
Note that S
(0)
1
(t) = t, and that for n 1 the graphs of S
(n)
k
are little tents of height
2
(n+1)/2
centered at k/2
n
and non overlapping for dierent values of k I(n). Clearly,
B
(0)
t
=
(0)
1
S
(0)
1
(t), and by induction on n, it is readily veried that
B
(n)
t
() =
n

m=0

kI(m)

(m)
k
()S
(m)
k
(t), 0 t 1, n N. (13.9)
Lemma 13.4 As n , the sequence of functions B
(n)
t
(), 0 t 1, n N
0
, given
by (13.9) converges uniformly in t to a continuous function B
t
(), 0 t 1 for almost
every .
Proof. Let b
n
:= max
kI(n)
[
(n)
k
[. Oberserve that for x > 0 and each n, k
P([
(n)
k
[ > x) =
_
2

_

x
e
u
2
/2
du

_
2

_

x
u
x
e
u
2
/2
du =
_
2

1
x
e
x
2
/2
,
which gives
P(b
n
> n) = P(
_
kI(n)
[
(n)
k
[ > n) 2
n
P([
(n)
1
[ > n)
_
2

2
n
n
e
n
2
/2
,
for all n N. Since

n
_
2

2
n
n
e
n
2
/2
< ,
the Borel-Cantelli Lemma implies that there is a set

with P(

) = 1 such that for


there is an n
0
() such that for all n n
0
() it holds true that b
n
() n. But then

nn
0
()

kI(n)
[
(n)
k
()S
(n)
k
(t)[

nn
0
()
n2
(n+1)/2
< ;
so for

, B
(n)
t
() converges uniformly in t to a limit B
t
. The uniformity of the
convergence implies the continuity of the limit B
t
.
The following exercise facilitates the construction of Brownian motion substantially:
63
Exercise 13.5 Check the following in a textbook of functional analysis:
The inner product
f, g) :=
_
1
0
f(t)g(t)dt
turns L
2
[0, 1] into a Hilbert space, and the Haar functions H
(n)
k
; k I(n), n N
0
form a
complete, orthonormal system.
Thus the Parseval equality
f, g) =

n=0

kI(n)
f, H
(n)
k
)g, H
(n)
k
) (13.10)
holds true.
Applying (13.10) to f = 1
[0,t]
and g = 1
[0,s]
yields

n=0

kI(n)
S
(n)
k
(t)S
(n)
k
(s) = s t. (13.11)
Now we are able to prove
Theorem 13.6 With the above notations
B
t
:= lim
n
B
(n)
t
is a Brownian motion in [0, 1].
Proof. In view of our denition of Brownian motion it suces to prove that for 0 = t
0
<
t
1
. . . < t
n
1, the increments (B
t
j
B
t
j1
)
j=1,...,n
are independent, normally distributed
with mean zero and variance (t
j
t
j1
). For this we will show that the Fourier transforms
satisfy the appropriate condition, namely that for
j
R (and as usual i :=

1)
E
_
exp
_
i
n

j=1

j
(B
t
j
B
t
j1
)
_
_
=
n

j=1
exp
_

1
2

2
j
(t
j
t
j1
)
_
. (13.12)
To derive (13.12) it is most natural to exploit the construction of B
t
form Gaussian random
variables. Set
n+1
= 0 and use the independence and normality of the
(n)
k
to compute for
64
M N
E
_
exp
_
i
n

j=1
(
j+1

j
)B
(M)
t
j
_
_
= E
_
exp
_
i
M

m=0

kI(m)

(m)
k
n

j=1
(
j+1

j
)S
(m)
k
(t
j
)
_
_
=
M

m=0

kI(m)
E
_
exp
_
i
(m)
k
n

j=1
(
j+1

j
)S
(m)
k
(t
j
)
_
_
=
M

m=0

kI(m)
exp
_

1
2
_
n

j=1
(
j+1

j
)S
(m)
k
(t
j
)
_
2
_
= exp
_

1
2
n

j=1
n

l=1
(
j+1

j
)(
l+1

l
)
M

m=0

kI(m)
S
(m)
k
(t
j
)S
(m)
k
(t
l
)
_
Now we send M and apply (13.11) to obtain
E
_
exp
_
i
n

j=1

j
(B
t
j
B
t
j1
)
_
_
= E
_
exp
_
i
n

j=1
(
j+1

j
)B
t
j
_
_
= exp
_

n1

j=1
n

l=j+1
(
j+1

j
)(
l+1

l
)t
j

1
2
n

j=1
(
j+1

j
)
2
t
j
_
= exp
_

n1

j=1
(
j+1

j
)(
j+1
)t
j

1
2
n

j=1
(
j+1

j
)
2
t
j
_
= exp
_
1
2
n1

j=1
(
2
j+1

2
j
)t
j

1
2

2
n
t
n
_
=
n

j=1
exp
_

1
2

2
j
(t
j
t
j1
)
_
.
65
14 Appendix
Let be a set and T() the collection of subsets of .
Denition 14.1 A system of sets 1 T() is called a ring if it satises
1
A, B 1 A B 1
A, B 1 A B 1
If additionally
1
then 1 is called an algebra.
Note that for A, B their intersection A B = A (B A).
Denition 14.2 A system T T() is called a Dynkin system if it satises
T
D T D
c
T
For every sequence (D
n
)
nN
of pairwise disjoint sets D
n
T, their union
n
D
n
is also in
T.
The following theorem holds:
Theorem 14.3 A Dynkin system is a -algebra if and only of for any two A, B T we
have
A B T
Similar to the case of -algebras for every system of sets c T() there is a smallest
Dynkin system T(c) generated by (and containing) c. The importance of Dynkin systems
mainly is due to the following
Theorem 14.4 For every c T() with
A, B c A B c
we have
T(c) = (c).
Denition 14.5 Let 1 be a ring. A function
: 1 [0, ]
is called a volume, if it satises
() = 0
66
and
(
n
i=1
A
i
) =
n

i=1
(A
i
) (14.1)
for all pairwise disjoint sets A
1
, . . . , A
n
1 and all n N. A volume is called a
pre-measure if
(

i=1
A
i
) =

i=1
(A
i
) (14.2)
for all pairwise disjoint sequence of sets (A
i
)
iN
1. We will call (14.1) nite additivity
and (14.2) -additivity.
A pre-measure on a -algebra / is called a measure.
Theorem 14.6 Let 1 be a ring and be a volume on 1. If is a pre-measure, then
it is -continuous, i.e. for all (A
n
)
n
, A
n
1, with (A
n
) < and A
n
it holds
lim
n
(A
n
) = 0. If 1 is an algebra and () < , then the reverse also holds: an
-continuous volume is a pre-measure.
Theorem 14.7 (Caratheodory) For every pre-measure on a ring 1 over there is
at least one way to extend to a measure on (1).
In the case that 1 is an algebra and is -nite (i.e. is the countable union of subsets
of nite measure), this extension is unique.
67

You might also like