You are on page 1of 84

M1 - Mathématiques Appliquées

à l’Économie et à la Finance
Erasmus Mundus
University Paris 1

Probability 1
Annie Millet

The aim of these lectures is to recall the basic notions of measure theory already seen in the
third year of Licence, concentrating on the tools that are constantly used by probabilists, as
well as the purely probabilistic notions. Most of the measure-theoretical results are stated
without proofs, except for the new ones.
We will develop in more details the conditional expectation (which is crucial in statistics,
in further Probability lectures on discrete and continuous time processes), Gaussian vectors
and convergence theorems.

6 . .
.
.. . . . .
.
. . .. . . . .. . .
. . .. . .
. . . .. .
. . .. . . . .. . . . .. . . . .. .
.. . . . .. ... .. . . .. .. ... ..... . . . .
... . .... .... .. ... . ...... ..... .. ...... . . . . .. ..
. . . . .
4 . .
. . ... . . . . . .. .... . .. . . ... .. .. .. . . .
. .. . .. . .. ... ..... .... ...... .. ..... . ................ ........ . .. . .
. . .. .... . . ....... ...................................... .................... ........ ...... ........ .... ..... .. .. . . . . .
. .. ....... . ... . ..... ...... ..... ...... ...... .. ....... .... ........ .. .. .... .. ... .
.. . . . . . .. .. .... . . .. . . .... . . ... . . ... ..
. . ... .
. . . ... . .. . . ........ ....... ..... . .................................................... ..... .... . .. . . . . . . . .
. . . .. ... . . .. .. . ...... .......... ... .................................................................................................................... .. ..... .... . . . .
. ... ....... ....... ... ... .... ................. ......................... .. ........ ........... .. . .
. . . . . . .... . ...................2 .. ............ ........................ ................................ ........... ....... . . . ... . ........... . . . . .
. . .. . . . .. . . . .. . . . .
. . . .... ... .. ........ .................. ........................................................................................................................................................................................ .. .. . . . . .
. . . . . . . ......... .. ... .................................................................................. ........ ........... .. .. ..... . . . .. . .
. .. . . . ... ..... ... ... ...... ....................................................................................... ................ ..... .... ....... . . .
. . . . .. .
. . ..... . .. . . ....... ....... ... ..... .......... .. .. . . . . .... .. . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . .. .
... .. .... . ................................................................................................................................................................................................................................................... ... . . . .
... . . . .. . ... ......... ..... ................................................................................................................................................................ ............... ... .... .. . . .
. . . . . ................ ..... ............................................................................................................................ ................. .. .. .... . .
. . . ............. . .. ....... ................................................................................................................................................................................................................................................................. ........ ......... .. ... .. . . ... ..
. . .. .. . . .. . .. .. .. . .. .. .............................................................. .......... ........ .. .. .... . . . .. .
. . ... .... .. ................................... ........................ ...................................................0 ....................... ..... ......... . ... ..... .. . .
. . . .. . ........................................................................................................................................................................................ .................. . .. . .. .. .
−8 −6 . . −4 .. . .. .. .. ........ . .....................−2 . ... .................................................................................0........................................................... ........ ....2 . . .. .... ... .
. 4. 6 8
. . . . . ... . .. .... . .... . .... ................ .. ...... .. .. . . . .. .. . . . . . .
. . . .. .... . .. ..................................................................................................................................................................................................................................... ................. . .... ..... .. .. .
. . . . . .... .. . . . . . . .. .. . . . . . . . ... . .. . . . . . . . .. . . .. . . . .. . . . . . . . . . . . . . . . . .
. . ... ............... ............................................................................................................................................................................... ......... . ...... ... . . .. .
. . ... .. . ............ ... ............... .......................... ...... .......... ... .......... ......... ... ......... . . . .. . .
. ... . ..... . .............................................................................................................................................................................................................. ........ ...... . . . .... . . . .
. . .. . . ... ...... .......... ...... ............ .. .... ............ ................. ......... ... .... .... . .. .. . . .
. ..... ..... ............ . .....................................................................................................................−2 .. . ....... .............. . .. .. .. .. . .
. .
. . . .
. . . . . .. . .. ... . . . .......... ........ .. ... . .... .. .. . .... ...... . ... . . .. . .. . . . . .. . .
. . . . . . .... ..... ........................................................................................................................................ ... .. . . .
. . . . . . .. . .. ........... .... ................. .... ... .. ....... ..... ..... .
. . . .. .. . . ..... .. . .. .. . . . . .
. . . . . . ...... .. . .................... ......... .............. ........... ........... . ..... .. .. ..... .. . ..
. . .. . . . .... . . . . .. ... . .. ... ... . . .
. .. . .. . . ........... . . . . . .. ...... ..... ......... .. ........ . ..... . . ... . . ... . .
.. . . . . .. .. .. .. .. . .. .. . .
. ... .. . . ... .... .......... . ....... . ..... ... .. . . . . .
. . .. . . . . .... .... .. . . . . . . .. . . −4
.
. .. .. . ... . . . . ..
. . . . . . . .. . . . . .
. . . .
. .. . . .
. .. .
. . . .
.
. −6
.
Table des matières

1. Reminders on the concepts of measure theory . . . . . . . . . . . 4


1.1. σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Measurable functions - Step functions . . . . . . . . . . . . . . . 5
1.3. Non Negative Measure . . . . . . . . . . . . . . . . . . . . . . 7
1.4. Integral of a non negative measurable function- Monotone Conver-
gence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5. Integral of a measurable function of any sign . . . . . . . . . . . 12
1.6. Null sets - Lebesgue Theorems. . . . . . . . . . . . . . . . . . . 14
1.7. Spaces L1 , L2 , Lp and L∞ . . . . . . . . . . . . . . . . . . . . . . 16
1.8. Product Measure - Fubini’s Theorem . . . . . . . . . . . . . . . . 20
2. Probabilistic formulation . . . . . . . . . . . . . . . . . . . . . 23
2.1. Real Random Variables - Distribution - Distribution Function. . . . 23
2.2. Random Vectors - Change of variables. . . . . . . . . . . . . . . 32
2.3. Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1. Method of inversion of the distribution function . . . . . . . . . . 37
2.4.2. Rejection method . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.3. Simulation of Gaussian random variables . . . . . . . . . . . . 42
3. Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . 43
3.1. Conditional Probability. . . . . . . . . . . . . . . . . . . . . . 43
3.2. Orthogonal Projection Theorem . . . . . . . . . . . . . . . . . 44
3.3. Construction of the conditional expectation E(X/G) . . . . . . . . 46
3.4. Conditional distribution and E(Y /X) . . . . . . . . . . . . . . . . 48
3.4.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.2. Case of a random variable X taking values in a finite or countable set 49
3.4.3. Case of the pair (X, Y ) with a density - Conditional Density . . . . 51
3.4.4. Application to simulation . . . . . . . . . . . . . . . . . . . 52
3.5. Properties of conditional expectation . . . . . . . . . . . . . . . 52
3.5.1. Different generalization of the properties of the expected value . . . . 53
3.5.2. Conditional expectation and independence . . . . . . . . . . . . 55
4. Characteristic function - Gaussian Vectors . . . . . . . . . . . . 57
4.1. Characteristic function of a real random variable . . . . . . . . . 57
4.2. Characteristic function of random vectors . . . . . . . . . . . . 60
4.3. Gaussian vectors . . . . . . . . . . . . . . . . . . . . . . . . . 62
5. Convergence Theorems . . . . . . . . . . . . . . . . . . . . . . . 68
5.1. Convergence in probability . . . . . . . . . . . . . . . . . . . . 68
5.2. Laws of large numbers . . . . . . . . . . . . . . . . . . . . . . 71
5.3. Convergence in distribution . . . . . . . . . . . . . . . . . . . . 73
5.3.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.2. Convergence in distribution and distribution function . . . . . . . 76
5.3.3. Convergence in distribution and characteristic function . . . . . . . 77
5.4. Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 81
1 Reminders on the concepts of measure theory

The aim of this chapter is to give reminders on the general theory of integration that
will be used in probability. Only the few results which are « new » compared to the lectures
given in the third year of Licence will be proved.
Convention of notation. If A ⊂ X, we denote by Ac the complement of A in X.

1.1 σ-algebra

The information that we need for a qualitative model is given by a set Ω (which often
corresponds to all of the possible results of an experiment) and the subsets of Ω (which
are results of the experiment that have particular properties). However, it is necessary to
give the rules of manipulation for these subsets which will later allow the introduction of
quantitative information. We lightly touch on all of this in a more general setting. So as to
reserve the notation Ω for a probability space, we will refer to the reference space here as
X.

Definition 1.1 Let X be a space. A σ-algebra on X is a family X of subsets of X such that


1. ∅ ∈ X ,
2. For all A ∈ X we have Ac ∈ X ,
3. For all sequences (An , n ≥ 1) of elements of X , ∪n An ∈ X .
We say that (X, X ) is a measurable space and the elements of X are the measurable sets.

The following remark brings together the important properties of a σ-algebra X on X that
follow immediately from the preceding definition.

Remark 1.2 Let (X, X ) be a measurable space.


1. For all sequences (An , n ≥ 1) of elements of X , ∩n An ∈ X . Moreover, the upper li-
mit limn An = ∩n (∪k≥n Ak ) and lower limit lim inf An = ∪n (∩k≥n Ak ) of the sequence
n
(An , n ≥ 1) are in X . They are written, respectively, as the set of the elements x of X
that belong to an infinite number of the sets An (resp. to all the sets An starting at a
certain number).
2. Every union (resp. intersection) of a finite number of elements of X belongs to X .
3. If A ∈ X , B ∈ X , A\B := A ∩ B c ∈ X .

It is obvious that the intersection of a family of σ-algebras on X is also a σ-algebra on


X, and the set P(X) of all subsets of X is a σ-algebra on X. However, the union of two
σ-algebras is not necessarily a σ-algebra, as is seen in the following example : X = [0, 1[,
F1 = {∅, [0, 12 [, [ 12 , 1[, X}, F2 = {∅, [0, 14 [, [ 14 , 1[, X}. The set [0, 12 [∩[ 41 , 1[= [ 41 , 12 [6∈ F1 ∪ F2 . This
remark justifies the following definition.

Definition 1.3 For every family A of subsets of X there exists a smallest σ-algebra contai-
ning A, called the σ-algebra generated by A and written σ(A) ; it is the intersection of all
the σ-algebras that contain A.

The following examples of σ-algebras generated by a family of sets will be used often.

Probability 1 - Annie Millet 2007-2008


1. If A ⊂ X and A = {A}, the σ-algebra generated by A is σ(A) = {∅, A, Ac , X}.
2. If sets (An , n ∈ N) of X are pairwise disjoint and their union is X, (An , n ∈ N) is said
to be a partition of X. The σ-algebra generated by this partition is the set of all unions
(finite or uncountable) of elements of the partition.
3. If X = R and I denotes the family of intervals ] − ∞, t] for t ∈ R, the σ-algebra
generated by I is the Borel σ-algebra of R, denoted R. It contains all the open intervals,
closed intervals, half-open intervals, and all the open sets of R (which are the union
of a sequence of disjoint open intervals), all the closed sets of R,...
On the other hand, we have R = σ(C) with C = {[a, b] : a ∈ R, b ∈ R, a ≤ b}, or
C = {]a, b[: a ∈ R, b ∈ R, a < b}, or C = {] − ∞, a[: a ∈ R}. However, it is not possible
to explicitly write the σ-algebra of Borel sets. In the same way there exists the set of
non-Borel sets, but in practice all sets that we construct « naturally » are Borel sets.
4. We write [0, +∞] = [0, ∞[∪{+∞}. The σ-algebra B([0, +∞]) is the σ-algebra genera-
ted by the family A = {[a, b], 0 ≤ a ≤ b ≤ +∞}.
5. If X = Rd and P denotes the open tiles of Rd (that is to say the sets Πdi=1 ]ai , bi [ where
ai < bi for all i = 1, · · · , d) the σ-algebra generated by P is the σ-algebra of Borel sets
of Rd and is equal to Rd .
6. If X is a topological space, the Borel σ-algebra on X is the σ-algebra denoted B(X)
generated by the family O of open sets of X. When X = Rd for the usual topology, we
find the preceding Borel σ-algebra, and it is written that B(Rd ) and is equal to Rd .
7. Let (Y, Y) be a measurable space and f : X → Y be a mapping. The family of inverse
images of the elements of Y by f , which we write X = {A = f −1 (B) : B ∈ Y}, is a
σ-algebra on X. We write X = f −1 (Y).
The following example shows that if X is a σ-algebra on X, the set of images of
the element of X by f is not necessarily a σ-algebra on Y : X = Y = R, f (x) = x2 ,
X = σ([−1, 2]). In fact the set of images σ([−1, 2]) = {∅, [−1, 2], ]−∞, −1[∪]2, +∞[, R}
by f is {∅, [0, 4], ]1, +∞[, R} which is not a σ-algebra.

1.2 Measurable functions - Step functions

The following notion is that of mappings compatible with measurable spaces.

Definition 1.4 Let (X, X ) and (Y, Y) be measurable spaces. A map f : (X, X ) → (Y, Y) is
measurable if for all B ∈ Y, its pull back f −1 (B) ∈ X , which is written f −1 (Y) ⊂ X .

Example 1.5 1. Every constant mapping is measurable for any σ-algebra on the domain
and range.
2. If (X, X ) is a measurable space, for all A ∈ X , the indicator function of A, 1 A :
(X, X ) → (R, R) defined by 1A (x) = 1 if x ∈ A and 0 otherwise, is measurable.

We remark that if Y = σ(A) is the σ-algebra generated by A, a mapping f : (X, X ) → (Y, Y)


is measurable if and only if f −1 (A) ⊂ X . We deduce that every mapping f , continuous from
a topological space X to a topological space Y, is measurable from (X, B(X)) into (Y, B(Y)).
Proposition 1.6 Let (Xi , Xi ), i = 1, 2, 3 be measurable spaces, f1 : (X1 , X1 ) → (X2 , X2 )
and f2 : (X2 , X2 ) → (X3 , X3 ) measurable mappings. Then f2 ◦ f1 : (X1 , X1 ) → (X3 , X3 ) is
measurable.

2007-2008 Probability 1 - Annie Millet


One immediately deduces the following corollary.
Corollary 1.7 Let (X, X ) be a measurable space, f, g be measurable functions from (X, X )
to (R, R). Then the functions |f |, |f |p for p ∈]0, +∞[, f + = sup(f, 0), f − = sup(−f, 0),
f + g, f g, sup(f, g), inf(f, g) are measurable. Furthermore, the sets {x ∈ X : f (x) = g(x)},
{x ∈ X : f (x) ≤ g(x)}, {x ∈ X : f (x) < g(x)}, {x ∈ X : f (x) ≥ g(x)} and {x ∈ X : f (x) >
g(x)} are measurable.
The following notion of a step function, which favors chosen values rather than the sets
where these values are taken, is central. The sets, although measurable, can be much more
complicated than intervals. This provides a great flexibility to the abstract measure theory,
compared with that of Riemann integrals.
Definition 1.8 Let (X, X ) be a measurable space. A function f : (X, X ) → R is a step
function if it takes a finite number of values and is measurable on R with respect to the
Borel σ-algebra, i.e., if there exists an integer n ≥ 1 such that for all i = 1, · · · , n there are
sets Ai ∈ X , and constants ci ∈ R such that
n
X
f= ci 1 Ai . (1.1)
i=1

We remark that if we take the sets Ai pairwise disjoint, the constants ci are unique. The
step functions are stable by sum, product, supremum, infimum.

The following theorem is crucial in the theory ; it allows us to approximate measurable


functions by step functions.
Theorem 1.9 Let (X, X ) be a measurable space. Let f : (X, X ) → ([0, +∞], B([0, +∞]) be a
measurable function. Then there exists an increasing sequence (fn , n ≥ 1) of step functions,
with non negative values, which converge simply to f .
(ii) Every finite measurable function f : (X, X ) → R (resp. bounded) is the limit (resp.
is the uniform limit) of a sequence of step functions.
Proof.
(i) For all n ≥ 1 and every i = 0, · · · , n2n − 1, let

Ai = {ω ∈ Ω : f (ω) ∈ [i2−n , (i + 1)2−n [} et Bn = {ω ∈ Ω : f (ω) ≥ n}.

The sequence
n2 n −1
X
fn = i 2−n 1Ai + n1Bn
i=0

are increasing step functions and for all x, (fn (x) , n ≥ 1) converges to f (x).
(ii) If f is negative somewhere, we use (i) for approximating f + = sup(f, 0) and f − =
sup(−f, 0) and we deduce a sequence of step functions whose difference converges simply to
f . If the function f is also bounded, the sequences that approximate f + and f − converge
uniformly and their difference then converges uniformly to f = f + − f . 2
Convention of notation. In the following, if A ⊂ R and f : X → R, for all A ⊂ R we will
write
{f ∈ A} := {ω ∈ Ω : f (ω) ∈ A} = f −1 (A).

Probability 1 - Annie Millet 2007-2008


In the same way, we will write {f ≤ t} = {ω ∈ Ω : f (ω) ≤ t}, {f = a} = {ω ∈ Ω : f (ω) =
a}, ...
If X and Y are topological spaces, X = B(X) and Y = B(Y) denote their Borel σ-algebras,
a mapping f : X → Y measurable from (X, X ) into (Y, Y) is called a Borel map.

1.3 Non Negative Measure

The qualitative description of the results of an experiment is insufficient. It must be


completed by quantitative information, that is to say that every set of the σ-algebra must
be assigned a non negative number, its « measure ».
Definition 1.10 Let (X, X ) be a measurable set. A non negative measure on (X, X ) is a
mapping µ : X → [0, +∞] such that
1. µ(∅) = 0.
P
2. If (An , n ≥ 1) is a sequence of pairwise disjoint measurable sets, µ(∪n An ) = n µ(An ).
We say that (X, X , µ) is a measure space.
If µ(X) < +∞, we say that the measure µ is finite. If µ(X) = 1, we say that µ is a
probability. If there exists an increasing sequence (Xn , n ≥ 1) of measurable sets such that
∪n Xn = X and µ(Xn ) < +∞ for every integer n, we say that µ is σ-finite.
We remark that we allow µ(A) = +∞ with the following algebraic conventions : a+(+∞) =
+∞, (+∞) + (+∞) = +∞.
Convention When the measure µ is a probability on a measurable space (Ω, F ) that des-
cribes the results of a random experiment, we will usually write P and we say that the space
(Ω, F , P ) is a probability space.
The following examples of measure are fundamental.
Example 1.11 1. The Dirac measure at point a ∈ X is the probability δa on X defined
by δa (B) = 1 if a ∈ B and δa (B) = 0 if a 6∈ B.
2. The counting measure (N, P(N)) is the measure µ defined by µ(n) = 1 for every integer
n ≥ 0. For all A ⊂ N, µ(A) = |A|, where |A| denotes the number of elements in A. It
is a σ-finite measure, not a finite measure.
3. If (an , n ≥ 0) is a sequence of elementsPof X and (αn , n ≥ 0) is a sequence
P of nonne-
gative real numbers, the measure µ = n αn δan is σ-finite. When n αn = 1, it is a
probability. The counting measure corresponds to the case where X = N, a n = n and
αn = 1.
The following proposition gives some of the usual properties of non negative measures.
Its proof is left as an exercise.
Proposition 1.12 Let (X, X , µ) be a measure space.
(i) For all A ∈ X , B ∈ X , µ(A ∪ B) + µ(A ∩ B) = µ(A) + µ(B). If also µ(A ∩ B) < ∞,
then µ(A ∪ B) = µ(A) + µ(B) − µ(A ∩ B). P
(ii) For every collection (An , n ≥ 0) of measurable spaces, µ(∪n An ) ≤ n µ(An ).
(iii) For all increasing sequences (An , n ≥ 0) of measurable sets (An ⊂ An + 1), µ(∪n An ) =
limn µ(An ).
(iv) For every decreasing sequence (An , n ≥ 0) of measurable sets such that µ(A0 ) < ∞,
µ(∩n An ) = limn µ(An ).

2007-2008 Probability 1 - Annie Millet


The following example shows that in the point (iv), the property µ(A0 ) < ∞ is crucial.
Let X = N, and µ be the counting measure on X = P(N), and for all n ≥ 0, An = {k ∈ N :
k ≥ n}. Then µ(An ) = +∞ for every integer n while ∩n An = ∅.
The following theorem is fundamental.

Theorem 1.13 There exists a unique non negative measure λ on (R, R), called the Lebesgue
measure, such that
(i) λ([a, b]) = b − a for all a ≤ b.
(ii) λ is invariant by translation, that is to say that for all a ∈ R and all A ∈ R,
λ(a + A) = λ(A).
(iii) The measure λ is σ-finite.

We may characterize λ by requiring in the point (i) only λ([0, 1]) = 1. It is from this
theorem that the « good » σ-algebra on R is the Borel σ-algebra and not P(R). The proof
of the existence of this measure is long and the arguments are not much used in the rest of
the document. The proof of the uniqueness uses a result which helps in a number of contexts
where we wish to establish a uniqueness result. It relies on the following theorem.

Theorem 1.14 (Monotone Class Theorem or Dynkin’s Theorem ) Let C be a family of


subsets of X that is stable by finite intersection, that is to say such that if A ∈ C and B ∈ C,
then A ∩ B ∈ C. Let B be a family of subsets which contains C such that
(i) X ∈ B
(ii) If A ∈ B and B ∈ B, A ⊂ B implies B\A ∈ B.
(iii) If An ⊂ An+1 is an increasing sequence of elements of B, then ∪n An ∈ B.
Then, the σ-algebra σ(C) generated by C is contained in B.

Proof. We will say that a λ-system is a family of subsets of X that satisfies the conditions
(i)-(iii) of the theorem. A σ-algebra is clearly a λ-system. We write λ(C) as the intersection
of all the λ-systems that contain C ; we then deduce that λ(C) is contained in B. To check
that λ(C) contains σ(C), it suffices to prove that λ(C) is a σ-algebra.
It is clear that λ(C) satisfies the properties (1) and (2) of the definition 1.1 ; it suffices
then to prove that λ(C) is stable by countable union. We first show that λ(C) is stable by
finite intersection. For all A ∈ C, we denote

Π(A) = {B ∈ λ(C) : A ∩ B ∈ λ(C)}.

It is easy to show that, C ⊂ Π(A) ⊂ λ(C). For proving that λ(C) = Π(A), it then suffices
to show that Π(A) is a λ-system. First of all, A ∩ X = A ∈ C ⊂ λ(C), so that X ∈ Π(A).
Let B1 ⊂ B2 be elements of Π(A). Then (B2 \B1 ) ∩ A = (B2 \A) ∩ (B1 \A) ∈ Π(A) since
it is the intersection of two sets that are included in Π(A). Finally, let (Bn , n ≥ 0) be an
increasing sequence of elements of Π(A). Then (∪n Bn ) ∩ A = ∪n (Bn ∩ A), whereas Bn ∩ A
is an increasing sequence of elements of Π(A). We then deduce that ∪n Bn ∈ Π(A). Hence
we have shown that λ(C) = Π(A), that is to say that for all A ∈ C and B ∈ λ(C), we have
A ∩ B ∈ Π(A).
Now let A ∈ λ(C). The previous reasoning shows that C ⊂ Π(A) and that the new
Π(A) = λ(C). We deduce that the family λ(C) is stable by finite intersection. The stability
by complementation permits the deduction that λ(C) is also stable by finite union. Finally
let (An , n ≥ 1) be a sequence (not necessarily increasing) of elements of λ(C). For all

Probability 1 - Annie Millet 2007-2008


n ≥ 1, we write Bn = ∪1≤k≤n Ak . Then the sequence (Bn ) is increasing and by construction
∪n Bn = ∪n An , which completes the proof. 2

We immediately deduce the following result from the theorem :


Theorem 1.15 Let C be a family of subsets of X stable by finite intersection, P 1 and P2 be
probabilities on the σ-algebra σ(C). If P1 (A) = P2 (A) for all A ∈ C, then P1 = P2 .
Proof. Let B = {B ∈ σ(C) : P1 (A) = P2 (A)} ⊂ σ(C). By hypothesis C ⊂ B and we
easily show that B satisfies the properties (i)-(iii) of the theorem 1.14. We deduce from the
theorem that σ(C) ⊂ B, which completes the proof. 2
The proof of the following corollary is left to the reader.
Corollary 1.16 Let C be a family of subsets of X that is stable by finite intersection. Let µ 1
and µ2 be measures on σ(C) that agree on C and are σ-finite on C, that is to say that there
exists an increasing sequence of sets An ∈ C such that ∪An = X and µ1 (An ) = µ2 (An ) < +∞
for all n ≥ 0. Then µ1 = µ2 .
The previous corollary clearly gives the uniqueness of the Lebesgue measure on R. Indeed,
it suffices to take C = {[a, b] : a ≤ b}.
We remark that if λ refers to the Lebesgue measure on (R, R), then for all real numbers
x, µ({x}) = 0, and for all countable sets A (for example A = N, A = Q), λ(A) = 0. Likewise,
for a, b ∈ R with a < b, λ(]a, b]) = λ([a, b]) = λ([a, b[) = λ(]a, b[) = b − a.

Proposition 1.17 Let λ be Lebesgue measure on R. For every integer d ≥ 2, the measure
λd = λ⊗d on the Borel σ-algebra  Rd is the unique measure on Rd such that for ai < bi ,
d d
i = 1, · · · , d, λ Πi=1 ]ai , bi [ = Πi=1 (bi − ai ).

The following result is a « functional » version of the Monotone Class Theorem ; its proof
is left as an exercise.
Theorem 1.18 Let H be a vector subspace of the set of bounded functions from Ω to R
such that
(i) the constant functions belong to H.
(ii) If (hn ) is a sequence of elements of H that converge uniformly to h, then h ∈ H.
(iii) If (hn ) is an increasing sequence of non negative functions of H such that the
function h = supn hn is bounded, then h ∈ H.
Let C be a subset of H that is stable by multiplication. Then H contains all the bounded
measurable functions from (Ω, σ(C)) to (R, R).
By construction, λ2 ({x} × R) = λ2 (R × {x}) = 0 and, more generally, every line is a null
set for the Lebesgue measure λ2 . In the same way, every vector (or affine) subspace from
Rd of dimension strictly less than d is a null set for the Lebesgue measure λd . The following
theorem allows the transfer of a measure from the domain space to the range space by a
measurable function.

Theorem 1.19 Let (X, X ) and (Y, Y) be measurable spaces, f : (X, X ) → (Y, Y) a measu-
rable function and µ a non negative measure on X . The function ν : Y → [0, +∞] defined by
ν(B) = µ(f −1 (B)) is a measure on Y called the image (or pushforward measure) measure
of µ by f . We often use the notation ν = µ ◦ f −1 or ν = µf . Its total volume is equal to that
of µ.

2007-2008 Probability 1 - Annie Millet


1.4 Integral of a non negative measurable function- Monotone Conver-
gence Theorem

In the rest of this section, except where the contrary is explicitly noted, we will write
(X, X , µ) as a measure space. We also use the systematic convention that 0 × (+∞) = 0.
We begin by defining integrals of the « simplest » functions, that is to say the indicator
functions, imposing that the integral of 1A with respect to µ must be µ(A). For preserving
the linearity, it is thus natural to impose the following definition.
P
Definition 1.20 Let f = ni=1 αi 1Ai be a non negative step function. Using the convention
0 × (+∞) = 0, we define the integral of f with respect to µ by
Z d
! d
X X
αi 1 A i dµ := αi µ(Ai ). (1.2)
i=1 i=1

R
We readily notice that the preceding definition of f dµ is independent of the decomposition
of the step function f as a linear combination of indicator functions. We will suppose in the
following that the sets Ai are pairwise disjoint. We easily verify the following proposition :

Proposition
R 1.21 LetR f and g be
R non negative step functions, c a nonnegative real number.
(i) (f + cg)dµ = f dµ R + c gdµ.
R
(ii) If 0 ≤ f ≤ g, 0 ≤ f dµ ≤ gdµ.

For example, if µ is Rthe counting measure on P(N) and if fn (i) = i for all i ≤ n and
fn (i) = n for all i > n, fn dµ = +∞, while if gn (i) = i for all i ≤ n and gn (i) = 0 for all
R P
i > n, gn dµ = ni=0 i = n(n+1)2
.
If λ denotes the Lebesgue measure on R, and if f : R →R[0, +∞[ is defined by f (x) = 1
if x ∈ [0, 1[, f (x) = 2 if x ∈ [1, 4] and f (x) = 0 if x 6∈ [0, 4], f dλ = 7.
We then define the integral of a non negative measurable function by using the point
(ii).

Definition 1.22 Let f : (X, X ) → ([0, +∞], B([0, +∞])) be a non negative measurable
function. We then define the integral of f with respect to µ by
Z Z 
f dµ = sup ϕdµ : 0 ≤ ϕ ≤ f, ϕ a step function . (1.3)

Notation When manyRspaces or many Rmeasures will be in play,R the preceding


R integral will
sometimes be denoted X f (x)dµ(x) or X f (x)µ(dx) instead of X f dµ or f dµ.

If the definition of the integral of a non negative measurable function given by (1.3) is
« intrinsic » it is rarely useful for calculating this integral. The theorem 1.9 gave a construc-
tive procedure for approximating a non negative measurable function by increasing step
functions. The following theorem is one of the most fundamentally ones of the theory. It
allows, by passage to the increasing limit, the concrete computation of integrals of non
negative measurable functions.

Probability 1 - Annie Millet 2007-2008


Theorem 11

Theorem 1.23 (Monotone Convergence Theorem) Let (fn , n ≥ 0) be a sequence of in-


creasing non negative measurable functions from (X, X ) to ([0, +∞], B([0, +∞])). Then
f = limn fn : X → [0, +∞] is measurable from (X, X ) to ([0, +∞], B([0, +∞])) and
Z   Z
lim fn dµ = lim fn dµ. (1.4)
n n

The following examples demonstrate that the monotonicity of the sequence (fn ) is crucial,
either for bounded sequences if the measure is infinite, or if the measure is finite with
unbounded sequences.

Example 1.24 (i) We consider the measure space (R, R, λ), where λ designates the Le-
besgue
R measure. For n ≥ 0 let fn = 1[n,n+1[. Then the sequence fn converges simply to 0, but
fn dλ = 1 for every integer n. Thus, for a set of infinite measure, it does not suffice that
the sequence (fn ) be bounded to exchange limit and integral.
(ii) We consider the measure space ([0, 1], B([0, 1]), λ), where λ designates the restriction
of the Lebesgue measure to [0, 1]. Then for all α > 0, the sequence (f n = nα 1]0, 1 ] , n ≥ 1)
R R n
converges simply R to 0, but f n dλ = 1 for every integer n if α = 1, lim n f n dλ = 0 if
0 < α < 1 and fn dλ tends to +∞ if α > 1. Again, it is not sufficient for the measure to
be a probability in order to exchange limit and integral.

We immediately deduce from the Monotone Convergence Theorem 1.23 and from Theorem
1.9 the following results.

R X ) → ([0, +∞], B([0, +∞])) is a non negative measurable function, for all
1. If f : (X,
a ∈ X f dδa = f (a).
2. If f : N → [0, +∞], f is measurable for the σ-algebra P(N), and if µ designates the
counting measure on P(N), Z X
f dµ = f (n).
n≥0

The evaluation of a function and the sum of a series of non negative terms then appear as
particular cases of the integrals of functions for the appropriate measures.
The following proposition gathers some properties of the integral of a non negative
measurable function.

Proposition 1.25 Let f and g be non negative measurable functions from (X, X ) to ([0, +∞],
B([0, +∞])). Then R R
(i) If 0R≤ f ≤ g, we have 0 ≤ f dµ ≤ gdµ.
(ii) If R f dµ = 0, we have µ(f 6= 0) = 0.
(iii) If f dµ < +∞, we have µ(f = +∞) = 0.
(iv) (Markov Inequality) For every number a ∈]0, +∞[,
Z
1
µ(f ≥ a) ≤ f dµ. (1.5)
a

Finally, the Monotone Convergence Theorem allows the construction of non negative
measures from other non negative measures. Its proof is left as an exercise.

2007-2008 Probability 1 - Annie Millet


12 1 Reminders on the concepts of measure theory

Theorem 1.26 Let (X, X , µ) be a measure space, f : (X, X ) → ([0, +∞], B([0, +∞])) a
non negative measurable function. Then the mapping ν : X → [0, +∞] defined by
Z
ν(A) = (f 1A )dµ , ∀A ∈ X

is a non negative measure. We say that the measure ν has density f with respect to the
measure µ.
We immediately
R see that if ν has density f with respectR to µ, the total mass of ν is equal
to ν(X) = f dµ ; the measure ν is thus a probability if f dµ = 1.

1.5 Integral of a measurable function of any sign

Again in this section (X, X , µ) denotes a measure space. In order to preserve the linearity
of the integral, since we can write f = f + − f , where f + = f ∨ 0 and f − = (−f ) ∨ 0, it
is tempting to define the integral of f as the difference of the integrals of f + and of f − .
However, we are not able to set this definition in full generality ; indeed, we would have
to give a meaning to the expression (+∞) − (+∞). This leads to the introduction of the
following definition, considering the fact that |f | = f + + f − .
Definition 1.27 A measurable function
R f : (X, X ) → (R, R) is µ-integrable (or integrable
with respect to the measure µ) if |f |dµ < +∞. RWe writeR L1 (µ) forR the collection of µ-
1 + −
Rintegrable functions, and if f ∈ L (µ) we write f dµ = f dµ − f dµ and kf k1 =
|f |dµ.
We see that in the previous definition we are assured that the separate integrals of f + and
of f − are finite. Since we are able to give a sense of a − b when one of the two terms of the
difference is +∞, we are R able to equally
R define the semi-integrable
R functions
R and impose
R that
either of the integrals f + dµ or f − dµ is finite ; we set again f dµ = f + dµ − f − dµ ∈
[−∞, +∞].
Remark 1.28 If f = f1 − f2 where f1 and f2 are nonnegative integrable measurable func-
tions,
R weRhave f + R≤ f1 and f − ≤ f2 whereas (f1 − f + ) − (f2 − f − ) = 0. We deduce that
f dµ = f1 dµ − f2 dµ, which proves the linearity of the integral.
Proposition 1.29 Let f and R g be µ-integrableRfunctions,Rα and β be real numbers. Then
αf + βg is µ-integrable and (αf + βg)dµ = α f dµ + β gdµ.

In the particular case (X, X ) = (R, R), the following proposition connects the integral
of a function f with respect to the Lebesgue measure λ and the Riemann integral of f . It
is a valuable tool for effectively calculating integrals with respect to the Lebesgue measure.
Proposition 1.30 (i) Every function f that is Riemann integrable on the closed, bounded
interval [a, b] is integrable for the Lebesgue measure λ and
Z Z b
(f 1[a,b] )dλ = f (x)dx. (1.6)
a

(ii) Every function f admitting an absolutely convergent generalized Riemann integral on


an open or semi-open interval with endpoints a, b ∈ [−∞, +∞] is integrable for the Lebesgue
measure on that interval and the equation (1.6) is also satisfied.

Probability 1 - Annie Millet 2007-2008


1.5 Integral of a measurable function of any sign 13

However, we are not able to place into the scope of measure theory those functions for
which the generalized Riemann integral converges without converging absolutely, as pointed
out by the following example. The function f : R → R defined by f (x) = sin(x) x
1[1,+∞[ is
not integrable
R +∞ (nor even semi-integrable) for the Lebesgue measure, although the Riemann
integral −∞ f (x)dx converges, but is not absolutely converging.
Let a ∈ X ; every
R measurable function f : (X, X ) → (R, R) is integrable for the Dirac
measure δa and f dδa = f (a).
A function f : N → R (also called a sequence) is integrable for the counting measure µ
on P(N) P if and only if the series of general
R termsPf (n) is absolutely convergent, that is to
say that n |f (n)| < +∞. In this case, f dµ = n f (n). The function f : N → R defined
(−1)n
by f (n)
P= n+1 is thus not integrable for the counting measure, although the alternating
series n f (n) will be convergent.

The following theorems are constantly used in probability theory. The first one links the
integrals of a function g with respect to the measure µ and with respect to the measure ν
of density f with respect to µ.

Theorem 1.31 Let (X, X , µ) be a measure space, f : (X, X ) → ([0, +∞], B([0, +∞])) a
non negative measurable function and ν the measure of density f with respect to µ defined
by Theorem 1.26. Then
(i) For every non negative measurable function ϕ : (X, X ) → ([0, ∞], B([0, +∞])),
Z Z
ϕdν = (f ϕ)dµ. (1.7)

(ii) The measurable function ϕ : (X, X ) → (R, R) is ν-integrable if and only if the
function f ϕ is µ-integrable, and if the functions are integrable the equality (1.7) remains
valid.
R R
Proof. (i) By definition, for every set A ∈ X , 1A dν = ν(A) = (f 1A )dµ and the equation
(1.7) is then true when ϕ = 1A . By linearity of the integral, the equation (1.7) remains true
when ϕ is a non negative step function. Let ϕ be a non negative measurable function and
(ϕn ) an increasing sequence of non negative step functions that converge simply to ϕ (and
whose existence is assured by Theorem 1.9). Because f is non negative, the sequence (f ϕ n )
is increasing and converges simply to f ϕ. The Monotone Convergence Theorem 1.23 applied
to the sequence (ϕn ) with respect to the measure ν and to the sequence (f ϕn ) with respect
to the measure µ yields that the equation (1.7) is true for every non negative measurable
function ϕ, which proves (i).
(ii) Let ϕ : (X, X ) → (R, R) be a measurable function ; then |ϕ| is a non negative
measurable function and (1.7) shows that ϕ ∈ L1 (ν) if and only if f ϕ ∈ L1 (µ). Moreover,
if ϕ ∈ L1 (ν), ϕ+ and ϕ− are non negative, measurable, and also belong to L1 (ν). Using (i),
we deduce that f ϕ+ and f ϕ− are non negative measurable functions that belong to L1 (µ)
and such that f ϕ = f ϕ+ − f ϕ− . The remark 1.28 and the point (i) allows to deduce that
the equality (1.7) is true for ϕ, which ends the proof of (ii). 2

The next theorem allows the comparison of the integrals of ϕ ◦ f with respect to µ and
of ϕ with respect to the image measure of µ by f denoted µ ◦ f −1 . Its proof is left as an
exercise.

2007-2008 Probability 1 - Annie Millet


14 1 Reminders on the concepts of measure theory

Theorem 1.32 (Theorem of the image measure) Let (X, X ) and (Y, Y) be measurable
spaces, µ a non negative measure on X and ν = µ ◦ f −1 the image measure of µ by f
defined on Y in Theorem 1.19. Then
(i) For every non negative measurable function ϕ : (Y, Y) → ([0, ∞], B([0, +∞])),
Z Z
−1
ϕd(µ ◦ f ) = (ϕ ◦ f )dµ. (1.8)
Y X

(ii) For every measurable function ϕ : (Y, Y) → (R, R), ϕ is µ ◦ f −1 -integrable if and
only if ϕ ◦ f is µ-integrable. In that case, the equality (1.8) remains true.

1.6 Null sets - Lebesgue Theorems

The Monotone Convergence Theorem leads to the following result, which is used to show
that functions are integrable.

Theorem 1.33 (Fatou’s Lemma) Let (fn , n ≥ 1) be a sequence of non negative measurable
functions. Then Z Z
0 ≤ lim inf fn dµ ≤ lim inf fn dµ ≤ +∞.
R
We immediately deduce that, if supn |fn |dµ < +∞ and if the sequence (fn ) converges
simply to f , then f is µ-integrable.
We seek to extend the Monotone Convergence Theorem in a number of ways : on one
hand, weakening the notion of simple convergence of the sequence (fn ), and on the other
hand not requiring that the sequence be increasing and non negative. In this section (X, X , µ)
is a measure space.

Definition 1.34 (i) A measurable set A is a µ-null set if µ(A) = 0. When no confusion
concerning the measure µ is possible, we say that a µ-null set is a null set.
(ii) A property is true µ almost everywhere if it is true on the complement of a µ-null
set.
(iii) A function f : X → R is µ-null if it is zero µ-almost everywhere.

Thus, the set Q of rational numbers is a null set for the Lebesgue measure λ on R, {x}×R
is a null set for the Lebesgue measure λ2 on R2 . The sequence of functions fn : [0, 1] → R
defined by fn (x) = n1[0, 1 ] converges to 0 λ-almost surely.
n
If two non negative measurable
R R(or integrable) functions f and g are such that f = g µ-
almost everywhere, then f dµ = gdµ.

Proposition 1.35 (i) A function f ∈ L1 (µ) (that is to say µ-integrable) is finite µ-almost
everywhere. R
(ii) A function f is µ-null if and only if |f |dµ = 0.R
(iii) An integrable function f is µ-null if and only if A f dµ = 0 for every set A ∈ X .

Proof. The Markov inequality (1.5) shows that for every integer n ≥ 1,
Z Z
1 1
µ(|f | ≥ n) ≤ |f |dµ and µ(|f | ≥ ) ≤ n |f |dµ.
n n

Probability 1 - Annie Millet 2007-2008


1.6 Null sets - Lebesgue Theorems 15

(i) The sequence of sets {|f | ≥ n} is increasing, µ(|f | ≥ 1) < +∞ and ∩n {|f | ≥ n} =
{|f | = +∞}. Using the property (iv) of the Proposition 1.12 we deduce that µ(|f | = +∞) =
0. R
(ii) If |f |dµ = 0, the sequence of sets {|f | ≥ n1 } is increasing and µ(|f | ≥ n1 ) = 0 for
all n. The point (iii) of Proposition 1.12 shows that µ(f 6= 0) = µ ∪n {|f | ≥ n1 } = 0. The
converse is evident. R R
(iii) If f is µ-null, then for all A ∈ X , | A f dµ| ≤ |f |dµ = 0. Conversely, R for every
1 1 1
integer n ≥ 1, let An = {f ≥ n } and Bn = {f ≤ − n }. Then 0 ≤ n µ(An ) ≤ An f dµ = 0,
so µ(An ) = 0 and in the same way µ(Bn ) = 0. Since the sequence (An ∪ Bn , n ≥ 1) is
increasing and {f 6= 0} = ∪n (An ∪ Bn ), we have µ(f 6= 0) = 0. 2
The following result is the second « classical » theorem that allows to exchange the limit
and integral for a sequence of functions.
Theorem 1.36 (Dominated Convergence Theorem) Let (fn , n ≥ 1) be a sequence of mea-
surable functions from (X, X ) to (R, R) such that
(i) The sequence (fn ) converges to f µ-almost everywhere.
(ii) There exists a function g ∈ L1 (µ) such that for every n ≥ 1, |fn | ≤ g µ-almost
everywhere. R R R
Then, limn fn dµ = f dµ. Furthermore, we have the stronger result : limn |fn −
f |dµ = 0.
We see here that we improved the results concerning convergence of sequences of integrals
compared to the corresponding ones in the framework of the Riemann integrals. Indeed, in
the last framework one has to impose that the sequence of Riemann integrable functions
(fn ) converges uniformly to f . Furthermore, the hypothesis of « domination » in part (ii)
of the Dominated Convergence Theorem may be satisfied without uniform convergence, as
shown in the next example. We let X = [0, 1] endowed with the Borel σ-algebra and the
Lebesgue measure λ (restricted to [0,1]). For every n ≥ 1, x ∈ [0, 1] set
2
!
e−nx
fn (x) = min √ , n .
x

The sequence (fn ) of continuous (and thus Borel) functions is such that fn (x) , n ≥ 1
converges to 0 for all x ∈]0, 1]. The sequence (fn ) thus converges to 0 λ almost everywhere.
R1
Moreover, |fn (x)| ≤ g(x) with g(x) = √1x . Furthermore 0 √1x dx < +∞, and g is λ integrable
R
on [0, 1]. The Monotone Convergence Theorem 1.23 thus applies and lim n fn dλ = 0, while
the sequence (fn ) does not converge uniformly to 0.
We immediately deduce the following corollary that allows the change of the order of
series and integral.
Corollary 1.37 Let (gn , n ≥ 1) be a sequence of measurable functions from (X, X ) to
(R, R).
(i) If the functions gn are non negative for every integer n, then
Z X ! XZ
gn dµ = gn dµ. (1.9)
n n
P R P
(ii) If n |gn |dµ
P < +∞, then the functions gn , n |gn | and the function defined
almost everywhere by n gn are µ-integrable. Furthermore, the equality (1.9) is also true.

2007-2008 Probability 1 - Annie Millet


16 1 Reminders on the concepts of measure theory

Proof. (i) We apply P the Monotone Convergence Theorem 1.23 to the increasing sequence
of partial sums fnP= nk=1 gk and we conclude by linearity of the integral.
1
(ii) Let g = n |gn |. From (i), g ∈ L (µ) so g < +∞ µ-almost everywhere and the
series of general terms gn (x), which is almost everywhere
P absolutely convergent, is almost
everywhere convergent. For the null set upon which n gn is notP
defined, we give it an
n
arbitrary value, for example,
P 0. The sequence of partial sums fn = k=1 gk then converges
µ-almost everywhere to n gn and |fn | ≤ g for all n. The Dominated Convergence Theorem
and the linearity of the integral concludes the proof. 2
The following theorems give sufficient conditions for continuity and differentiability of
the integrals depending on one parameter. Since these properties may be characterized
by the convergence of sequences, they are the immediate consequences of the Dominated
Convergence Theorem and their proof is left as an exercise.

Theorem 1.38 Let I be an open set of Rk , f : Rd × I → R a Borel function and µ a non


negative measure on Rd . We suppose that :
(i) For all t ∈ I, the function x ∈ Rd → f (x, t) is µ-integrable.
(ii) For µ-almost all x ∈ Rd the function t ∈ I → f (x, t) is continuous.
(iii) For all t0 ∈ I there exists a ball V centered at t0 and contained in I and a function
g ∈ L1 (µ)R such that |f (x, t)| ≤ g(x) for µ-almost all x and for all t ∈ V . Then the function
t ∈ I → Rd f (x, t)dµ(x) is continuous on I.

Theorem 1.39 Let I be an open set of Rk , f : Rd × I → R a Borel function and µ a non


negative measure on Rd . We fix j ∈ {1, · · · , k} and suppose that :
(i) For all t ∈ I, the function x ∈ Rd → f (x, t) is µ-integrable.
(ii) For µ-almost all x ∈ Rd the function t ∈ I → f (x, t) admits a partial derivative
∂f
∂tj
(x, t).
(iii) For all t0 ∈ I there exists
a ball V centered at t0 and contained in I and a function
1 ∂f
g ∈ L (µ) such that ∂tj (x, t) ≤ g(x) for µ-almost all x and for all t ∈ V .
R ∂F
Then the function t ∈ I → F (t) = Rd f (x, t)dµ(x) admits a partial derivative ∂t j
(t) on I
and Z
∂F ∂f
(t) = (x, t)dµ(x).
∂tj Rd ∂tj

1.7 Spaces L1 , L2 , Lp and L∞

The goal of this section is to study the functions whose pth power is integrable, as
generalizations of the space L1 . Again we take a measure space (X, X , µ).

Definition 1.40 For every real p ∈]0,R+∞[, let Lp (X, X , µ) be the set of measurable func-
tions from (X, X ) to (R, R) such that |f |p dµ < +∞. When there will be no confusion of
the set and the measure, we will write more simply Lp = Lp (X, X , µ).

If a, b and α are real numbers, |αa + b|p ≤ (2 max(|α| |a|, |b|))p ≤ 2p |α|p|a|p + 2p |b|p . We
deduce that Lp is a vector space. The following result will be very useful for probabilities.

Proposition 1.41 If µ(X) < +∞, then for 0 < p ≤ q, Lq ⊂ Lp .

Probability 1 - Annie Millet 2007-2008


1.7 Spaces L1 , L2 , Lp and L∞ 17

Proof. Let µ be a finite measure and 0 < p ≤ q. Then, |f |p ≤ |f |q 1{|f |≥1} + 1{|f |<1} . If
f ∈ Lq , we deduce that
Z Z
|f | dµ ≤ |f |q dµ + µ(|f | < 1) < +∞.
p
2

In the case of the Lebesgue measure λ on R, the function f defined by f (x) = √1x 1]0,1] (x)
1
is such that f ∈ L1 but f 6∈ L2 while the function g defined by g(x) = |x|+1 is such that
2 1
g ∈ L but g 6∈ L . In the following we will suppose p ≥ 1 and if f : (X, X ) → (R, R) is
measurable, we will write
Z  1p
p
kf kp = |f | dµ . (1.10)

When f ∈ L2 , we say that the function f is square integrable. The following theorem will
be very useful in the sequel.

Theorem 1.42 (Schwarz’s Inequality) Let f and g be two functions belonging to L 2 . Then
f g ∈ L1 and Z
|f g|dµ ≤ kf k2 kgk2 . (1.11)

We also have the triangle inequality kf + gk2 ≤ kf k2 + kgk2 .

Proof. Let f, g be functions in L2 . Then, the linearity of the integral implies that for all
a ∈ R, Z Z Z Z
2 2
0 ≤ |af + g| dµ = a |f | dµ + 2a |f g|dµ + |g|2 dµ.
2

The discriminant of its trinomial is thus non negative, which proves the inequality (1.11).
Further,
R in the particular case a = 1, the preceding identity and (1.11) shows that kf +gk 22 =
(f + g)2 dµ ≤ (kf k2 + kgk2 )2 . 2

We may generalize the Schwarz inequality and the triangle inequality in Theorem 1.42
for the case p > 1. We say that the real numbers p ∈ [1, +∞] and q ∈ [1, +∞] are conjugate
if
1 1
+ = 1, (1.12)
p q
1
with the convention +∞
= 0. Then we see that p = 1 and q = +∞ are conjugate, and so
are p = q = 2.

Theorem 1.43 (Hölder’s Inequality) Let p ∈]1, +∞[ and q ∈]1, +∞[ be conjugate expo-
nents.
(i) Let f and g be non negative measurable functions from (X, X ) to ([0, +∞[, B([0, +∞[)).
Then, Z
0≤ f g dµ ≤ kf kp kgkq ≤ +∞. (1.13)

Furthermore if kf kp + kgkq < +∞, the inequality (1.13) is an equality if and only if there
exist non negative real numbers a and b such that (a, b) 6= (0, 0) and af p = bg q µ almost
everywhere.

2007-2008 Probability 1 - Annie Millet


18 1 Reminders on the concepts of measure theory

(ii) Let f ∈ Lp and g ∈ Lq . Then, f g ∈ L1 and


kf gk1 ≤ kf kp kgkq . (1.14)
Furthermore the inequality (1.14) is an equality if and only if there exist non negative real
numbers a and b such that (a, b) 6= (0, 0) and a |f |p = b |g|q µ almost everywhere.

Proof. (i) Let α ∈]0, 1[. First of all we show the Young inequality for two non negative real
numbers u and v
uα v 1−α ≤ αu + (1 − α)v, (1.15)
with equality if and only if u = v. Let ϕα : [0, +∞[→ R be the function defined by ϕα (x) =
xα − αx. Then ϕα is differentiable on ]0, +∞[ and (ϕα )0 (x) > 0 on ]0, 1[ while ϕ0α (x) < 0
on ]1, +∞[. We deduce ϕα (x) ≤ ϕα (1) = 1 − α for all x ∈]0, +∞[ with equality if and only
if x = 1. When u ≥ 0 and v > 0, we write the inequality ϕα (x) ≤ 1 − α with x = uv and
multiply by v > 0 ; this yields (1.15). Finally, (1.15) is true obviously true if u ≥ 0 and
v = 0.

We prove (1.13). This inequality is clearly shown if kf kp = 0 or kgkq = 0, because in the


case f or g is µ-null and f g = 0 µ-almost everywhere (with the convention 0×(+∞) = 0. We
remark that in this case, (1.13) is an equality and that f or g is zero µ almost everywhere.
So we suppose that kf kp 6= 0 and kgkq 6= 0. If one of these terms is +∞, again (1.13)
is evident. Thus we suppose that kf kp ∈]0, +∞[ and kgkq ∈]0, +∞[. Let α = p1 ∈]0, 1[ ;
p g q (x)
1 − α = 1q . The Young inequality (1.15) applied to u = fkf(x)
kpp
and v = kgkqq
shows that for
all x ∈ X,
f (x)g(x) 1 f p (x) 1 g q (x)
≤ + , (1.16)
kf kp kgkq p kf kpp q kgkqq
p q
with equality only for the elements x of X such that fkf(x)
kpp
= gkgk(x)qq . We integrate this and
deduce
Z  Z p Z q 
1 f (x) 1 g (x)
0 ≤ f gdµ ≤ kf kp kgkq µ(dx) + µ(dx) = kf kp kgkq ,
p kf kpp q kgkqq
which proves (1.13). If theRinequality
R (1.13) is an equality, then the inequality (1.16) is of the
form ϕ ≤ ψ and we have ϕdµ = ψdµ. The point (ii) of the Proposition 1.35 shows that
p q
ψ − ϕ is zero µ-almost everywhere, that is to say fkf(x) kpp
= gkgk(x)qq for µ-almost all x. Because
kf kp 6= 0 and kgkq 6= 0, this concludes the proof of (i).
The part (ii) is an immediate consequence of (i) applied to |f | and |g|. 2
We immediately deduce the
R
Corollary 1.44 Let f ∈ Lp and g ∈ Lq . Then f gdµ ≤ kf kp kgkq .
It remains to generalize the triangle inequality ; it is the object of the following :
Theorem 1.45 (Minkowski Inequality) Let p ∈ [1, +∞[, f and g be functions in Lp . Then
kf + gkp ≤ kf kp + kgkp .
Likewise, when p > 1, this inequality is an equality if and only if f = 0 µ-almost everywhere
or there exists a real number α ≥ 0 such that g = αf . When p = 1, the triangle inequality
is an equality if and only if f g ≥ 0 µ-almost everywhere.

Probability 1 - Annie Millet 2007-2008


1.7 Spaces L1 , L2 , Lp and L∞ 19

Proof. The triangle inequality is trivial for p = 1 and thus we suppose that p > 1.
Integrating the inequality |f + g|p ≤ |f ||f + g|p−1 + |g||f + g|p−1 , we deduce
Z Z
kf + gkp ≤ |f ||f + g| dµ + |g||f + g|p−1dµ.
p p−1

If q is the conjugate exponent of p, the Hölder inequality leads to


Z  1q Z  1q
p (p−1)q (p−1)q
kf + gkp ≤ kf kp |f + g| dµ + kgkp |f + g| dµ .

Since (p − 1)q = 1, we deduce


kf + gkpp ≤ (kf |p + kgkp)kf + gkp−1
p .

Because Lp is a vector space, kf + gkp < +∞. If kf + gkp = 0, the triangle inequality is
trivial. Otherwise, we may simplify the last inequality found by kf + gkp−1
p and we deduce
the triangle inequality.
The characterization of the case of equality is not proved. 2
The definition of k.kp shows that if a ∈ R and f ∈ L , kaf kp = |a|kf kp ; thanks to the
p

Minkowski
R Inequality, k.kp is a semi-norm, but it is not a norm. In fact kf kp = 0 implies
that |f | dµ = 0, or that |f |p = 0 µ-almost everywhere, that is to say that f is a null
p

function, but we do not have f = 0.


Definition 1.46 Let p ∈ [1, +∞[. We say that the functions f and g in Lp are equivalent
(denoted f ∼ g) if kf −gkp = 0, that is to say if f = g µ-almost everywhere. This equivalence
relation is compatible with addition and multiplication by a real number. We write L p for
the set of equivalence classes of Lp by ∼. If f¯ ∈ Lp denotes the equivalence class of f ∈ Lp ,
we write kf¯kp = kf kp (which does not depend on the representative of f¯).
Theorem 1.47 (i) If f¯ and ḡ are two classes in L2 ,
Z
hf¯, ḡi = hf, gi := f gdµ (1.17)

does not depend on the representatives f ∈ L2 and g ∈ L2 and defines a scalar product on
L2 with the associated norm hf, f i = kf k22 . The vector space L2 is complete for this norm ;
it is a Hilbert space.
(ii) The vector space Lp is complete for the norm k.kp .
Convention In the following, so as to ease the notation, we will make a consistent abuse of
notation to identify the equivalence class f¯ ∈ L2 and any of it representatives, a measurable
function f ∈ L2 (which is defined µ-almost everywhere).
Definition 1.48 For every measurable function f : (X, X ) → (R, R), we define
kf k∞ = inf{a ∈ [0, +∞[: µ(|f | > a) = 0}, (1.18)
with the convention inf ∅ = +∞. We write L∞ for the set of measurable functions such that
there exists a number a ∈ R+ such that |f | ≤ a µ-almost everywhere, that is to say such
that kf k∞ < +∞.
The set L∞ is a vector space and k.k∞ is a semi-norm. The Hölder inequality (1.14) is also
true when p = 1 and q = +∞. The preceding equivalence relation allows the introduction
of a norm on the quotient space L∞ , and we will apply the same convention of notation for
the elements of L∞ .

2007-2008 Probability 1 - Annie Millet


20 1 Reminders on the concepts of measure theory

1.8 Product Measure - Fubini’s Theorem

The goal of this section is to define a σ-algebra and a measure on a product space X × Y
and to calculate integrals with respect to this measure.
Definition 1.49 Let (X, X ) and (Y, Y) be measurable spaces. The product σ-algebra of X
and Y is the σ-algebra on X × Y denoted X ⊗ Y generated by the set X × Y = {A × B : A ∈
X , B ∈ Y}, that is to say
X ⊗ Y = σ({A × B : A ∈ X , B ∈ Y}).
Remark 1.50 (i) The σ-algebra X ⊗ Y is the smallest σ-algebra on X × Y that makes
measurable the canonical projections ΠX : X × Y → X and ΠY : X × Y → Y defined by
ΠX (x, y) = x and ΠY (x, y) = y when the σ-algebra for the domain X (resp. Y) is X (resp.
Y).
(ii) Let (V, V) be a measurable space. A mapping f = (fX , fY ) : (V, V) → (X × Y, X ⊗ Y)
is measurable if and only if the mappings fX (resp. fY ) are measurable from (V, V) to (X, X )
(resp. to (Y, Y)).
The previous definition may be extended to a finite product of spaces. For i = 1, · · · , d,
let (Xi , Xi ) be a measurable space and let X = Πdi=1 Xi . If A = {Πdi=1 Ai , Ai ∈ Xi } denotes
the set of all products of elements of the σ-algebras Xi , the σ-algebra on X generated by
A is called the product σ-algebra and written X = ⊗di=1 Xi . In the particular case where
(Xi , Xi ) = (R, R) for all i = 1, · · · , d, X = Rd the Borel σ-algebra Rd coincides with the
σ-algebra R⊗d . Moreover the operation ⊗ on σ-algebras is associative, that is to say that
⊗3i=1 Xi = (X1 ⊗ X2 ) ⊗ X3 = X1 ⊗ (X2 ⊗ X3 )
Proposition 1.51 Let (X, X ) and (Y, Y) be two measurable spaces.
(i) Let A ∈ X ⊗ Y, the sections Ax = {y ∈ Y : (x, y) ∈ A} belong to Y for all x ∈ X and
Ay = {x ∈ X : (x, y) ∈ A} belong to X for all y ∈ Y.
(ii) Let f : (X × Y, X ⊗ Y) → (R, R) be a measurable function. For all x ∈ X the
section of f with respect to x, defined by fx (y) = f (x, y) for y ∈ Y, is measurable from
(Y, Y) to (R, R) and in the same way for all y ∈ Y, the section f y with respect y defined by
f y (x) = f (x, y) for x ∈ X, is measurable from (X, X ) to (R, R).
If (X, X , µ) and (Y, Y, ν) are two measure spaces, the following theorem defines a measure
on the measurable product space (X × Y, X ⊗ Y)
Theorem 1.52 Let (X, X , µ) and (Y, Y, ν) be two σ-finite measure spaces. There exists a
unique measure denoted µ ⊗ ν on (X × Y, X ⊗ Y) defined by
(µ ⊗ ν)(A × B) = µ(A) ν(B) , ∀A ∈ X , ∀B ∈ Y, (1.19)
with the convention 0 × (+∞) = 0.
The existence of this measure is admitted. Its uniqueness is a consequence of the Monotone
Class Theorem (by the Corollary 1.16). We remark that we may iterate this construction
and define a product measure µ1 ⊗ · · · ⊗ µd on the product σ-algebra X1 ⊗ · · · ⊗ Xd defined
on the product space X = Πdi=1 Xi . Finally we remark that the Lebesgue measure λd on
Rd = ⊗d R is the product measure λ⊗d .
The following theorem links the integrals of non negative measurable functions defined
on X × Y with respect to µ ⊗ ν to the integrals with respect to µ and to ν.

Probability 1 - Annie Millet 2007-2008


1.8 Product Measure - Fubini’s Theorem 21

Theorem 1.53 (Fubini-Tonelli’s Theorem)


Let (X, X , µ) and (Y, Y, ν) be σ-finite measure spaces, f : (X×Y, X ⊗Y) → ([0,R+∞], B([0, ∞]))
a non negative measurable function. Then for all x ∈ X the mapping y ∈ Y 7→ RX f (x, y)µ(dx)
is measurable from (Y, Y) to (([0, +∞], B([0, ∞])) and the mapping x ∈ X 7→ Y f (x, y)ν(dy)
is measurable from (X, X ) to (([0, +∞], B([0, ∞])). Furthermore,
Z Z Z  Z Z 
f d(µ ⊗ ν) = f (x, y)ν(dy) µ(dx) = f (x, y)µ(dx) ν(dy) . (1.20)
X×Y X Y Y X

It remains to extend the theorem to functions of any sign.


Theorem 1.54 (Fubini-Lebesgue’s Theorem)
Let (X, X , µ) and (Y, Y, ν) be σ-finite measure spaces, f : (X × Y, X ⊗ Y) → (R, R) a
function that belongs to L1 (µ ⊗ ν), that is to say integrable for the product measure µ ⊗ ν.
Then
(i) For µ-almost all x ∈ X, the function y → f (x, y) ∈ L1 (ν) and for ν almost all y ∈ Y,
the function x → f (x, Ry) ∈ L1 (µ). R
(ii) We have x → Y f (x, y)dν(y) ∈ L1 (µ) and y → X f (x, y)dµ(x) ∈ L1 (ν), where the
functions are defined, respectively, µ-almost everywhere and ν-almost everywhere.
(iii)
Z Z Z  Z Z 
f d(µ ⊗ ν) = f (x, y)ν(dy) µ(dx) = f (x, y)µ(dx) ν(dy) . (1.21)
X×Y X Y Y X

Convention of notation We often write the equations (1.20) and (1.21) in the form
Z Z Z Z Z Z
µ(dx)ν(dy)f (x, y) = µ(dx) f (x, y)ν(dy) = ν(dy) µ(dx)f (x, y).
X Y X Y Y X

These two theorems, Fubini-Tonelli and Fubini-Lebesgue, give another proof of Corollary
1.37. They allow to define double series either of non negative terms or absolutely converging,
and their calculation.
The next notion of convolution product will be used in probability.

Proposition 1.55 Let f and g be Borel functions from R to R integrable for Lebesgue
measure λ. Then the function f ∗ g defined by
Z Z
(f ∗ g)(x) = f (x − y)g(y)λ(dy) = f (y)g(x − y)λ(dx) (1.22)

is integrable for Lebesgue measure and


kf ∗ gk1 ≤ kf k1 kgk1 . (1.23)
We call f ∗ g the convolution product of f and g. Moreover, if f or g is of class C k for k ≥ 0,
f ∗ g is also of class C k .
Proof. It suffices to apply the Fubini-Tonelli Theorem 1.53 to obtain
Z Z Z 
kf ∗ gk1 = |f ∗ g(x)|λ(dx) ≤ |f (x − y)||g(y)|λ(y) λ(dx)
Z Z 
≤ |g(y)| |f (x − y)|λ(dx) λ(dy) = kgk1 kf k1 ,

2007-2008 Probability 1 - Annie Millet


22 1 Reminders on the concepts of measure theory

where the last equality occurs by the invariance of the Lebesgue measure by translation.
The regularity of the convolution product comes by the theorems 1.38 and 1.39 from conti-
nuity and differentiability of the integrals depending on a parameter, which are immediate
consequences of the dominated convergence theorem. 2
The Hölder inequality yields more precise properties of the convolution product acting
on the spaces Lp .

Proposition 1.56 (i) Let p ∈ [1, +∞] and q be conjugate exponents, f ∈ Lp (λ) and g ∈
Lq (λ). Then
sup |(f ∗ g)(x)| ≤ kf kp kgkq . (1.24)
x∈R

(ii) Let p ∈ [1, +∞], f ∈ Lp (λ) and g ∈ L1 (λ). Then f ∗ g ∈ Lp (λ) and we have

kf ∗ gkp ≤ kf kp kgk1 . (1.25)

Proof. (i) The inequality (1.24) is immediate if p = +∞ or q = +∞. So we suppose


p, q ∈]1, +∞[. The Hölder inequality shows that for all x ∈ R,
Z  p1 Z  q1
p q
|f ∗ g(x)| ≤ |f (x − y)| λ(dy) |g(y)| λ(dy)

and we deduce (1.24) by invariance of the Lebesgue measure by translation.


(ii) Again, (1.25) is obvious if p = +∞. For p ∈ [1, +∞[, we apply the Hölder inequality
to the measure |g(y)|λ(dy). Then for all x ∈ R, if q is the conjugate exponent of p,
Z Z  p1 Z  q1
p q
(|f |∗|g|)(x) = |f (x−y)||g(y)|λ(dy) ≤ |f (x − y)| |g(y)|λ(dy) 1 |g(y)|λ(dy) .

Raising this inequality to the power p then integrating with respect to the Lebesgue measure
λ(dx), because |f |p ∈ L1 , the inequality (1.23) implies
p
1+ pq
k |f | ∗ |g| kpp ≤ k |f |p ∗ |g| k1 kgk1q ≤ k |f |p k1 k g k1 = k f kpp k g kp1 ,

which concludes the proof of (1.25) since kf ∗ gkpp ≤ k|f | ∗ |g|kpp. 2

Probability 1 - Annie Millet 2007-2008


23

2 Probabilistic formulation

In all of this chapter, we consider a space Ω, described as all the possible values of an
experiment or of an observation « that depends on chance ».
The σ-algebra that describes the information defined by properties of the observations
is traditionally written F . The sets belonging to the σ-algebra are called events and the
elements of Ω are denoted ω. When the space Ω is finite or countable (for example N or
Nd ), the usual σ-algebra is the space of all subsets of Ω, so that F = P(Ω). When Ω is not
countable, for example when Ω = R, the σ-algebra P(Ω) of all subsets of Ω is technically
not proper and we are led to a smaller construction, for example R.
The probability P on F is such that P (A) numerically describes the « chances that
the property that defines A will be realized when we look at the results of a particular
experiment ».
The simplest probabilistic model is the one where the space Ω is finite. We often take as
the σ-algebra on F = P(Ω), the space of the subsets of Ω and the uniform probability on F
defined by P (A) = |A||Ω|
for all A ⊂ Ω describes mathematically the fact that the occurrence
of all the values appear in the same way (a fair coin, a fair die, a choice of a person « to
random » in a population, ...)
This model is certainly very limited for translating more complex phenomena and we do
not take it up again here. The following results give the essential reminders of probability
theory using the tools of abstract measure from the preceding chapter.

2.1 Real Random Variables - Distribution - Distribution Function

We begin by reviewing the notions of measurable function and integral by use of the
usual terminology in probability. In the beginning it is a simple problem of vocabulary, but
rather soon problems specific to probability are reached.
Convention For every Borel function ψ : R → R, non negative or integrable with respect
to the Lebesgue measure λ, we will write
Z Z
ψdλ = ψ(x)dx.
R R

Definition 2.1 A real-valued random variable is a measurable function X : (Ω, F ) →


(R, R), or again X : (Ω, F ) → ([−∞, +∞], B([−∞, +∞])). A discrete random variable
is a measurable function X : (Ω, F ) → (N, P(N)). If X : Ω → [0, +∞] is measurable, the
integral of X with respect to P is called the expectation (or expected value) of X and denoted
E(X), i.e., Z
E(X) := XdP. (2.1)

R
A variable X is integrableR if it is P -integrable, that is to say if E(|X|) = Ω |X|dP
R < +∞ ;
in this case, the integral XdP is likewise defined and we write again E(X) = XdP .

We remark that every discrete random variable is equally a real random real.
The null sets are the elements of F of P -measure zero, that is to say P -null. A property
true P -almost everywhere will be said true almost surely and denoted a.s.
We thus translate again the part (i) of the Proposition 1.35 into saying that an integrable
random variable is finite almost surely. As an exercise we can prove that this property leads

2007-2008 Probability 1 - Annie Millet


24 2 Probabilistic formulation

to the following lemma, which is very useful in probability for showing the almost sure
convergence that we will study in Chapter 5.

Lemma 2.2 (Borel-Cantelli Lemma) Let P (Ω, F , P ) be a probability space and (A n , n ≥ 1)


a sequence of measurable sets such that n P (An ) < +∞. Then P (lim sup An ) = 0.

In the spirit of the section 1.7, we introduce the spaces Lp and Lp . A random variable X
is said to be square integrable if it belongs to L2 := L2 (P ), of power p integrable if it belongs
to Lp := Lp (P ), 1 ≤ p < +∞ and (essentially) bounded if it belongs to L∞ := L∞ (P ). For
all p ∈ [1, +∞[, and every real random variable X,
1
kXkp = [E(|X|p )] p . (2.2)
In the following, we will make the abuse of language indicated at the end of section 1.7 by
identifying an element of the quotient space Lp with one if its representatives and a random
variable belonging to Lp with it equivalence class of Lp . Then, we will speak of the « p
norm » of a random variable X ∈ Lp , while we should also speak of its semi-norm and of
the norm of the equivalence class of X. The Hölder inequality implies that the spaces Lp
are included one in the others.

Theorem 2.3 Let (Ω, F , P ) be a probability space. Then if 1 ≤ p1 ≤ p2 ≤ +∞, Lp2 ⊂ Lp1 .
More precisely, for every real random variable X,

kXkp1 ≤ kXkp2 , ∀(p1 , p2 ) such that 1 ≤ p1 ≤ p2 ≤ +∞. (2.3)

Proof. The inequality (2.3) is clear if X 6∈ Lp2 and we suppose that kXkp2 < +∞. Ob-
viously, if X ∈ L∞ , |X| ≤ kXk∞ a.s. and thus for all p1 ∈ [1, +∞[, |X|p1 ≤ kXkp∞1 . By
integrating this inequality, we deduce (2.3) when p2 = ∞.
We suppose 1 ≤ p1 < p2 < +∞, and we write p = pp12 ∈]1, +∞[ and q denote the
conjugate exponent of p. Then the Hölder inequality (1.14) applied to the function g = 1,
such that kgkq = 1, implies
Z  1p
kXkpp11 p1
= E(|X| × 1) ≤ |X| p1 p
dP × 1 = kXkpp12 ,

which ends the proof of (2.3). 2


The following notion generalizes domination and likewise allows to exchange limit and
integral.

Definition 2.4 Let E be a family of measurable functions. We say that E is equiintegrable


(or uniformly integrable) if Z
lim sup |f |dP = 0. (2.4)
a→+∞ f ∈E {|f |≥a}

Let g be an integrable function ; the dominated convergence theorem 1.36 proves that
E = {g} is equiintegrable. We infer that E = {measurable f : |f | ≤ |g|} is also an
equiintegrable family. The following result demonstrates that a bounded sequence in Lp ,
p > 1 is uniformly integrable ; this will be useful to study stochastic processes.

Probability 1 - Annie Millet 2007-2008


2.1 Real Random Variables - Distribution - Distribution Function 25

Theorem 2.5 Let E be a family of measurable functions such that there exists p ∈]1, +∞[
satisfying M = sup{kf kp : f ∈ E} < +∞. Then, E is uniformly integrable.

Proof. Let q ∈]1, +∞[ be the conjugate exponent of p. For all a > 0, the Hölder inequality
implies, for f ∈ E that
Z Z  p1
1
p
|f |dP ≤ |f | dP (P (|f | ≥ a) q .
{|f |≥a}

1
R
The Markov inequality (1.5) shows that P (|f | ≥ a) ≤ ap
|f |p dP . We deduce that for every
function f ∈ E, Z
p p
|f |dP ≤ M 1+ q a− q ,
{|f |≥a}

which finishes the proof. 2


The following example shows that a sequence of functions bounded in L1 is not necessa-
rily uniformly integrable. Let X = [0, 1], X = B([0, 1]), P is Rthe restriction of Rthe Lebesgue
measure to [0, 1], fn = n 1[0, 1 ] . Then for all a > 0 and n ≥ a, {|fn |≥a} |fn |dP = |fn |dP = 1.
n
On the other hand, a sequence may be bounded in L2 without being dominated by an in-
tegrable function, as shown in the following example. For every integer n and every integer
n
k = 1, · · · , 2n , let fn,k = 2 2 1[(k−1)2−n ,k2−n ] . Then supn sup1≤k≤2n kfn,k k2 = 1, but for all
x ∈ [0, 1], supn sup1≤k≤2n |fn,k (x)| = +∞.
We may characterize equiintegrability by a property of the integral over the sets of
« small »probability of « continuous »type.

Proposition 2.6 A family of measurable functions E is equiintegrable if and only if the


following two properties are satisfied :
(i) C = sup{kf k1 : f ∈ E} < +∞
R exists α > 0 such that for every set A ∈ F such that P (A) ≤ α
(ii) For every ε > 0 there
and every function f ∈ E, A |f |dP < ε.

Proof. If E is uniformly
R integrable, for all ε ∈]0, 1] there exists a > 0 such that for every
function f ∈ E, {|f ≥a} |f |dP ≤ ε, and thus kf k1 ≤ a + 1, which proves (i). Furthermore, for
every measurable set A,
Z Z Z
|f |dP = |f |dP + |f |dP ≤ ε + aP (A),
A A∩{|f |≥a} A∩{|f |<a}

which proves (ii).


Conversely, let E be a class of measurable functions that satisfy the conditions (i) and
(ii). Then for all a > 0, the Markov inequality (1.5) and (i) prove that for all functions
f ∈ E, P (|f | ≥ a) ≤ Ca . Moreover, for fixed ε > 0, let αR> 0 such that (ii) is true. Then for
a > Cα and A = {|f | ≥ a} the property (ii) proves that {|f |≥a} |f |dP < ε for every function
f ∈ E, which concludes the proof. 2
The following theorem generalizes the dominated convergence theorem.

Theorem 2.7 Let (fn , n ≥ 1) be a sequence of measurable functions that converge almost
surely to f and are equiintegrable. Then limn E(|f − fn |) = 0 and limn E(fn ) = E(f ).

2007-2008 Probability 1 - Annie Millet


26 2 Probabilistic formulation

Proof. Following the Proposition 2.6, the sequence fn is bounded in L1 and the Fatou
Lemma 1.33 implies that f is integrable. For every a > 0,
Z Z Z

E(|fn − f |) ≤ |fn |dP + |f |dP + fn 1{|fn |<a} − f 1{|f |<a} dP.
|fn |≥a} |f |≥a}

For every ε > 0, the Definition 2.4 shows the existence of a > 0 such thatR for every function
g contained in the equiintegrable family E = {f } ∪ {fn : n ≥ 1}, we have |g|1{|g|≥a} dP ≤ ε.
Moreover, the sequence fn 1{|fn |<a} − f 1{|f |<a} converges almost surely to 0 and is dominated
by theR constant 2a, which is integrable. The dominated convergence theorem thus shows
that fn 1{|fn |<a} − f 1{|f |<a} dP converges to 0. This completes the proof. 2
Theorem 2.3 shows that the space L2 of square integrable random variables is included
in the space of random integrable variables. The following definition shows that, for every
random variable X that is square integrable, E(X) is the best constant that approximates X
for the k k2 norm. Indeed, if X is square integrable, E(X) is well defined, E(X − E(X)) = 0
and for every real number a, the linearity of the integral implies
         2
2 2
E |X − a| = E |X − E(X)| + E(X) − a E X − E(X) + E(X) − a
 
≥ E |X − E(X)|2 .

Definition 2.8 Let X be a random square integrable variable. The variance of X is


 
V ar(X) = E |X − E(X)|2 = E(X 2 ) − E(X)2 . (2.5)

Following the Proposition 1.35 (ii), a random variable is almost surely constant (and equal
to its expectation) if and only if it is square integrable and has zero variance. Clearly, if X is
integrable, E(aX + b) = aE(X) + b and if X is square integrable, V ar(aX + b) = a2 V ar(X).
The variance of a real variable measures the « dispersion » of X around its expectation.
The notion of image measure is crucial. The following theorem is a rephrasing of Theo-
rems 1.19 and 1.32.

Theorem 2.9 Let X : Ω → R be a random variable. The probability PX = P ◦ X −1 ,the


image measure (or pushforward measure) of P by X, is called the distribution of X. For
every Borel set A ∈ R, PX (A) = P (X ∈ A) and for every Borel function ϕ : R → [0, +∞],
Z Z Z
E[ϕ(X)] = ϕ(X(ω))dP (ω) = ϕ(x)dPX (x) = ϕdPX . (2.6)
Ω R

The function ϕ is RPX -integrable if and only if ϕ(X) is P -integrable and in this case we have
anew E(ϕ(X)) = ϕdPX . R
Moreover, if ν is a probability on R such that E(ϕ(X)) = R ϕdν for every non negative
Borel function ϕ (or for every bounded Borel function ϕ), then ν = P X .

We remark that the distribution of X does not describe the probability P on all the σ-
algebra F , but on the σ-algebra generated by X, σ(X) = {X ∈ A : A ∈ R}. Finally we
remark that for every probability ν on the Borel σ-algebra R, there exists a probability
space (Ω, F , P ) and a real random variable X : Ω → R of distribution ν. In fact it suffices
to take Ω = R, F = R, P = ν and X the identity on R.

Probability 1 - Annie Millet 2007-2008


2.1 Real Random Variables - Distribution - Distribution Function 27

When X : Ω → R, the Monotone Class Theorem 1.15 applied to the class C = {]a, b[:
a < b, a, b ∈ R} shows that the distribution of X is characterized by the probability of X
on the intervals of R. We may also decrease the class of intervals which are required to
characterize the distribution of X by considering C = {] − ∞, t], t ∈ R}.
Proposition 2.10 Let X : (Ω, F ) → (R, R) be a real random variable and P be a probability
on F . The distribution of X is characterized by the distribution function of X, that is by
the function F : R → [0, 1] defined by

F (t) = P (X ≤ t), ∀t ∈ R. (2.7)

Moreover, the distribution function F is increasing, right continuous, lim t→−∞ F (t) = 0 and
limt→+∞ F (t) = 1.
When X : Ω → N is a discrete random variable, its distribution
P is characterized by the
set of non negative numbers (P (X = n), n ≥ 0) such that n P (X = n) = 1. In this case,
the distribution of X is written like a balanced series of Dirac masses, as
X
PX = P (X = n) δn .
n≥0

In this case, for every Borel function ϕ which is non negative or such that ϕ(X) is integrable,
X
E(ϕ(X)) = ϕ(n)P (X = n).
n≥0

In particular, X
E(X) = nP (X = n) ∈ [0, +∞].
n≥0

Moreover, if F is the distribution function of X, for every integer n ≥ 0, P (X P = n) =


F (n) − F (n − 1). All sets (an , n ≥ 0) of non negative numbers that satisfy S := n≥0 an <
+∞, by normalization may define the distribution of a discrete random variable, by placing
P (X = n) = aSn .

Example 2.11 Here are some examples of some « usual »discrete distributions
1) Bernoulli distribution of parameter p ∈ [0, 1], denoted B(p)
This distribution models an experiment with two possible results (heads or tails, success
or failure, ...) We encode the results 0 (failure) and 1 (success). A random variable X :
Ω → {0, 1} follows a Bernoulli distribution of parameter p ∈ [0, 1] if P (X = 1) = p and
P (X = 0) = 1 − p. It is a bounded random variable, which thus belongs to all the Lp ,
1 ≤ p ≤ +∞ spaces. We have E(X) = p and V ar(X) = p(1 − p) ; we then find that X is
constant, that is « deterministic » , if p = 0 or p = 1.
2) Binomial distribution of parameters n ≥ 1 and p ∈ [0, 1], denoted B(n, p)
This distribution models the total number of successes when we repeat n times « in an
independent way » the same experiment that every time follows the Bernoulli distribution of
parameter p. We will return to this in section 2.3. If a random variable X : Ω → {0, · · · , n}
follows a Bernoulli distribution, for every integer k = 0, · · · , n, P (X = k) = Cnk pk (1 − p)n−k .
Again, a binomial random variable B(n, p) belongs to all the Lp spaces for 1 ≤ p ≤ +∞.
We have E(X) = np and V ar(X) = np(1 − p).

2007-2008 Probability 1 - Annie Millet


28 2 Probabilistic formulation

3) Poisson distribution of parameter λ denoted P(λ)


One of the motivations for the study of this distribution is the following problem : we
repeat an infinite number of times in an independent way the same experiment which lasts a
random time with exponential distribution (which will be recalled later). We wish to model
the number of realizations of this experiment in a given interval of time.
A random variable X : Ω → N follows the Poisson distribution of parameter λ > 0 if for
n
every integer n ≥ 1, P (X = n) = exp(−λ) λn! . This random variable is not bounded, but it
belongs to all the Lp spaces, p ∈ [1, +∞[. We have E(X) = λ and V ar(X) = λ.
4) Geometric distribution of parameter a denoted G(a)
One of the motivations of the study of this distribution is the following problem : we
repeat an infinite number of times in an independent way the same random experiment with
two outcomes encoded success and failure (thus modeled as Bernoulli random variables) and
we wish to know how many times we need to perform the experiment before having the first
success.
A random variable X : Ω → N follows a geometric distribution of parameter a ∈ [0, 1[ if
for every integer n, P (X = n) = (1 − a)an . This random variable is not exactly described
as in the preceding problem, which is modeled by a geometric distribution greater than 1
(since the waiting time is greater than or equal to 1). It is calculated by a geometric series.
But if a = 0 (where X is almost surely equal to 0) the random variable X is not bounded,
a a
but belongs to all the Lp spaces, 1 ≤ p < +∞. We have E(X) = 1−a and V ar(X) = (1−a) 2.

Note that sometimes, in order to describe exactly the waiting time of the first success,
one says that the distribution of X is geometric with parameter a if for any integer n ≥ 1,
P (X = n) = (1 − a) an−1 .

Another fundamental example of the distribution of a random variable uses Theorem


1.31.

Definition 2.12 A real random variable X (or its distribution) has density f (with respect
to the Lebesgue measure) if its distribution PX has the density f with respect to λ, that is if
for every Borel set A ∈ R,
Z Z
P (X ∈ A) = 1A f dλ = f (x)dx. (2.8)
A
R
The function f is Borel, non negative, such that R f (x)λ(dx) = 1 and for every Borel
function ϕ non negative (or such that ϕf is λ-integrable),
  Z
E ϕ(X) = ϕ(x) f (x) dx. (2.9)
R

The validity of (2.9) for every Borel non negative (or bounded) function ϕ implies that
the distribution of X has the density f with respect to the Lebesgue measure λ. We say more
simply that X has density f .

If X is a non negative or integrable random variable of density f , we then deduce


Z
E(X) = xf (x)dx.
R

Probability 1 - Annie Millet 2007-2008


2.1 Real Random Variables - Distribution - Distribution Function 29

R
Moreover, every non negative Borel function g on R and such that I := R g(x)dx < +∞
holds, by normalization, gives a probability density on R ; it suffices to put f (x) = g(x)I
.
Let X be a real variable of density f . Then the distribution function F of X is the
integral of the upper bound of f and is continuous, that is to say that P (X = t) = 0 for
all t ∈ R. Then it is natural to try to connect the derivative of F , if it exists, to f . This
point is rather delicate. We will admit that every increasing function G on the interval [a, b]
(thus Riemann integrable on [a, b]) that is also continuous on [a, b] is differentiable λ-almost
everywhere on [a, b]. Then, if G0 is its derivative (defined λ-almost everywhere) the following
inequality is true : Z b
G0 (x)dx ≤ G(b) − G(a). (2.10)
a

Indeed, for every integer n ≥ 1, let Gn (x) = n[G(x + n1 ) − G(x)] for a ≤ x ≤ b − n1 and
Gn (x) = 0 for b − n1 < x ≤ b. Then the functions Gn are non negative and convergent
λ-almost everywhere to G0 . The Fatou Lemma 1.33 allows the conclusion that
Z b Z b
0
G (x)dλ(x) ≤ lim inf Gn (x)dλ(x).
a n a

Furthermore,
Z b Z b Z 1
a+ n
Gn (x)dλ(x) = n G(x)dx − n G(x)dx
1
a b− n a
   
1 1
= n Φ(b) − Φ(b − ) − n Φ(a + ) − Φ(a) ,
n n
Rt
where we denote Φ(t) = a G(x)dx. Since G is continuous at a and at b, Φ admits a right-
hand derivative at a equal to G(a) and a left-hand derivative at b equal to G(b). The sequence
Rb
a
Gn (x)dλ(x) thus converges to G(b) − G(a).
The following example shows that the inequality (2.10) between the lower limit and the
integral of the derivative may be strict. If G(x) = 1[ 1 ,1] , a = 0 and b = 1, G0 (x) = 0 λ-almost
R1 2

everywhere on [a, b] and 0 G0 (x)dx = 0 < G(1) − G(0) = 1. A more complex example shows
that this inequality may be strict even if G is continuous.
Nevertheless, the integral of the upper bound F of a function f integrable with respect
to Lebesgue measure λ (called an absolutely continuous function) is λ-almost everywhere
differentiable and F 0 = f λ-almost everywhere. This will not be proved here. We then deduce

Proposition 2.13 (i) Let X be a real random variable of density f . Then the distribution
function F of X is continuous, almost everywhere differentiable and F 0 (x) = f (x) λ-almost
everywhere.
(ii) Let X be a real random variable whose distribution
R +∞ 0 function F is continuous ; then
F is differentiable almost everywhere. Moreover, if −∞ F (x) dx = 1, we deduce that X has
density F 0 .

Proof. The point (i) is clear. For the point (ii), the differentiability
Rn of F λ-almost everywhere
is clear. For every integer n, we know by (2.10) that −n F 0 (x)dx ≤ F (n) − F (−n) ≤ 1.
Furthermore, since F is increasing,
R 0 F 0 ≥ 0 λ-almost everywhere. The monotone convergence
theorem then shows that R F (x)dx ≤ 1, that is to say that F 0 is integrable. If we write

2007-2008 Probability 1 - Annie Millet


30 2 Probabilistic formulation

Rt
Φ(t) = −∞ F 0 (x)dx, the part (i) shows that Φ0 = F 0 λ-almost everywhere. Furthermore,
R +∞ 0
−∞ R
F (x) dx = 1 and for all t ∈ R, (2.10) and the monotone class theorem 1.15 show
t R +∞
that −∞ F 0 (x) dx ≤ F (t) when t F 0 (x) dx ≤ 1 − F (t). We then deduce that for every t,
Rt
Φ(t) = −∞ F 0 (x) dx = F (t), which shows that Φ is the distribution function of a distribution
µ on R of density F 0 , and since, according to Proposition 2.10, the distribution function
characterizes the distribution, we deduce that X has density F 0 . 2
We calculate the distribution of the image of a real random variable X by a function Φ
in the following manner.

Theorem 2.14 Let X be a random variable of density f , I =]a, b[ an open interval of R


(with possibly a = −∞ or b = +∞) such that f = 0 almost everywhere on ]a, b[ c . Then
P (X ∈]a, b[) = 1. Let Φ : I → J =]c, d[ be a bijective function, differentiable on I. Then,
for every non negative Borel function φ :]c, d[→ [0, +∞[, we have the change of variable
formula :
Z b Z d
0
φ(Φ(x))|Φ (x)|dx = φ(y)dy. (2.11)
a c

If φ is of non constant sign, 1]c,d[ φ is λ-integrable if and only if φ ◦ Φ|Φ0 |1]a,b[ is λ-integrable
and in this case, the equation (2.11) remains true. The random variable Y = Φ(X) almost
surely takes its values in the interval ]c, d[ and its density is the function defined by g(y) = 0
if y 6∈]c, d[ and for y ∈]c, d[, if Ψ :]c, d[→]a, b[ designates the reciprocal function of Φ,

g(y) = f (Ψ(y))|Ψ0(y)|. (2.12)


R
Proof. In the evident way, P (X ∈]a, b[c ) = ]a,b[c f (x) dx = 0. the classical change of
variables theorem of differential calculus (2.11) shows that ϕ is continuous. Since the conti-
nuous functions are dense in L1 , we extend further to the integrable functions by dominated
convergence, and next to non negative Borel ones by monotone convergence.
For deducing the density of Y = Φ(X), let ϕ :]c, d[→ R be a non negative Borel function.
Then if Ψ designates the reciprocal function of Φ, Ψ is differentiable on ]c, d[ and Ψ0 (y) =
1
Φ0 (Ψ(y))
. Applying (2.11) to φ(y) = ϕ(y)f (Ψ(y))|Ψ0(y)|, we deduce that
Z b Z d
E[ϕ(Y )] = ϕ(Φ(x))f (x)dx = ϕ(y)|Ψ0 (y)|f (Ψ(y))dy.
a c

The characterization of the density of Y proved in (2.9) completes the proof. 2

Example 2.15 The following examples of densities of random variables are classic.
1) Uniform distribution on the interval [a, b], written U ([a, b])
This distribution models a random phenomenon that takes the real values between a
and b, such that the probability of falling in an interval is proportional to its length, that is
to say that the values are placed « at random » in the interval [a, b].
A random variable X : Ω → R follows a uniform distribution on the interval [a, b] where
1
a and b are real numbers such that a < b if its density is the function f = b−a 1[a,b] . It only
takes values between a and b ; it is almost surely bounded and is contained in all the Lp
spaces with p ∈ [1, +∞].

Probability 1 - Annie Millet 2007-2008


2.1 Real Random Variables - Distribution - Distribution Function 31

a+b (b−a)2
We have E(X) = 2
, V ar(X) = 12
and the distribution function F is such that

 0 if t < a,
t−a
F (t) = b−a
if a ≤ t ≤ b,

1 if t > b.

2) Exponential distribution of parameter λ, written E(λ)


This distribution models the duration of the life of a material that does not give an
external sign of aging, or more generally the waiting time of a phenomenon whose lack of
occurrence will give no indication of the time that remains to wait before it will occur.
A random variable X : Ω → R follows an exponential distribution of parameter λ > 0
if its density is the function f defined by f (x) = λ exp(−λx) 1[0,+∞[(x). It is almost surely
non negative, but not bounded. However, it belongs to all the Lp spaces for p ∈ [1, +∞[.
 We have E(X)  = λ1 and V ar(X) = λ12 . Its distribution function is defined by F (t) =
1 − exp(−λt) 1[0,+∞[(t). In particular P (X ≥ t) = exp(−λt) for all t > 0.
3) Gamma distribution of parameters λ and a, written Γ(λ, a)
It is an important generalization of the exponential distribution from a technical point
of view. A random variable X : Ω → R follows a distribution Γ(λ, a) with λ > 0 and a > 0
if its density is the function f defined by
λa
f (x) = exp(−λx) xa−1 1]0,+∞[(x) ,
Γ(a)
where for all a > 0 we write Z +∞
Γ(a) = xa−1 e−x dx.
0
We see that the distributions Γ(λ, 1) and E(λ) are equal. Again, a random variable of
distribution Γ(λ, a) is not bounded but belongs to all the Lp spaces for p ∈ [1, +∞[.
√ + 1) = aΓ(a), and that for every integer n ≥ 1 we have
We recall that for all a > 0, Γ(a
Γ(n) = (n − 1)! and that Γ( 21 ) = Π.
We have E(X) = λa and V ar(X) = λa2 . We do not write down an explicit expression of
the distribution function, except in particular cases of values of a.
4) Standard Gaussian (or normal) distribution, written N (0, 1) It is a funda-
mental distribution in the theory of probability. It models the distribution of phenomena
that are the aggregate of very numerous random observations that are « microscopic » in-
dependent, and similar, such as a large number of small shocks (for example it is inherent
in the modeling of the course of assets in the stock market, ...)
A random variable follows the Gaussian distribution N (0, 1) if its density is the function
f defined by  2
1 x
f (x) = √ exp − .
2π 2
This random variable « loads » every interval [a, b] such that a < b, that is P (X ∈ [a, b]) > 0.
It is not bounded but belongs to all the Lp -spaces with p ∈ [1, +∞[. We cannot write down an
explicit formula for the distribution function (which is calculated by numerical methods),
but it is easy to check that its two parameters are respectively the expectation and the
variance, so that
E(X) = 0 and V ar(X) = 1.

2007-2008 Probability 1 - Annie Millet


32 2 Probabilistic formulation

5) Gaussian (or normal distribution) N (m, σ 2 )


√ distribution N (0, 1) by the affine
It is the distribution of the image of a variable X of
function x 7→ y = m + σx and we agree that σ = σ 2 ≥ 0. Thus, when X follows a
distribution N (0, 1), the random variable Y = m + σX follows a Gaussian distribution
N (m, σ 2 ). We then immediately have E(Y ) = m, V ar(Y ) = σ 2 . If σ = 0, X is almost
surely equal to m and if σ > 0, then a simple change of variable shows that the density of
Y is  
1 (y − m)2
g(y) = √ exp − .
σ 2π 2σ 2
Again, Y belongs to all the Lp spaces for p ∈ [1, +∞[ but does not belong to L∞ (except if
σ = 0).
6) Cauchy distribution
It is a very simple example of a random variable that does not belong to any of the Lp
spaces, 1 ≤ p ≤ +∞, that is to say that is not integrable.
A random variable X : Ω → R follows a Cauchy distribution if its density f is defined
1
by f (x) = π(1+x 2 ) . Then, E(|X|) = +∞ and the distribution function satisfies F (t) =
Π
Arctan(t) + 2 for every real number t.

2.2 Random Vectors - Change of variables

We are interested in the measurable functions from (Ω, F ) to (Rd , Rd ).

Definition 2.16 A random variable with values in Rd , or random vector, is a measurable


function X : (Ω, F ) → (Rd , Rd ). We say that X = (X1 , · · · , Xd ) is integrable (resp. belongs
to Lp for p ∈ [1, +∞]) if each of its components Xi , 1 ≤ i ≤ d is integrable (resp. belongs
to Lp ). If X ∈ L1 , the expectation X is the vector of Rd defined by

E(X) = (E(X1 ), · · · , E(Xd )).

We remark that a vector X = (X1 , · · · , Xd ) : Ω → Rd is a random vector if and


only if each of its components Xi , i = 1, · · · , d is a real random variable. Moreover, the
linearity of the integral shows that if h : Rd → Rr is a linear function, the random vector
h(X) := h ◦ X : Ω → Rr is integrable and
 
E h(X) = h E(X) .

Convention In the rest it will be convenient to commit the consistent abuse of notation in
identifying a vector of Rd with the column matrix of its components in the canonical basis.
If A designates the matrix associated to h in the canonical basis, we will write with this
convention :
 
E(X1 )
 .. 
E(X) =  .  and E(AX) = AE(X) for the expectation of the vector h(X).
E(Xd )

The variance is replaced by the covariance matrix of the random vector.

Probability 1 - Annie Millet 2007-2008


2.2 Random Vectors - Change of variables 33

Theorem 2.17 (i) Let X and Y be real random variables that are square integrable. The
covariance of X and Y is a real number defined by
 
Cov(X, Y ) = E(XY ) − E(X)E(Y ) = E (X − E(X)) (Y − E(Y )) = Cov(Y, X). (2.13)
p p
Furthermore, Cov(X, X) = V ar(X) and |Cov(X, Y )| ≤ V ar(X) V ar(Y ).
(ii) Let X = (X1 , · · · , Xd ) be a square integrable random vector. The covariance matrix
of X is the square d × d matrix denoted ΓX defined by
   
ΓX = Cov(Xi , Xj ) : 1 ≤ i, j ≤ d = E X X̃ − E(X)E(X), ^ (2.14)

where B̃ designates the transpose matrix of the matrix B. It is a non negative, symmetric
matrix. More precisely, for every vector (a1 , · · · , ad ) ∈ Rd if we equate a and the column
matrix (d, 1) for the ai ,
X d 
ãΓX a = V ar ai Xi ≥ 0. (2.15)
i=1

Finally, for every linear function h : R → R , of the matrix A in the canonical basis,
d r

ΓAX = A ΓX Ã. (2.16)


Proof. (i) It suffices to apply the Schwarz inequality. (1.11).
(ii) Following the point (i), ΓX is well defined and symmetric. The equations (2.15) and
(2.16) come from immediate calculations. 2
The theorem of the image measures allows to define the distribution of a random vector
X : Ω → Rd like the probability PX on Rd defined by
PX (A) = P (X ∈ A) , ∀A ∈ Rd .
The monotone class theorem 1.14 shows that the distribution of a random vector X =
(X1 , · · · , Xd ) may be characterized by the family
 
P X ∈ Πdi=1 [ai , bi ] = P ∩di=1 {Xi ∈ [ai , bi ]} . (2.17)
Definition 2.18 The random vector X : Ω → Rd has fRhas density f if f : Rd → [0, +∞[
is a non negative Borel function such that P (X ∈ A) = A f (x)dλd (x) for all A ∈ Rd .
The random vector X = (X1 , · · · , Xd ) has density f if and only if for every choice of
real numbers ai < bi , if Ai = [ai , bi ],
Z
d

P ∩i=1 {Xi ∈ Ai } = f (x1 , · · · , xd )dλd (x1 , · · · , xd ).
A1 ×···×Ad

Finally, X = (X1 , · · · , Xd ) has density f if and only if for every Borel function ϕ : Rd →
[0, +∞[, Z
E(ϕ(X)) = ϕ(x)f (x)dλd (x),
Rd
and this inequality is also true if ϕ is Borel (of some sign) such that ϕ(X) is integrable, or
in an equivalent way, ϕf is λd -integrable.
The calculation of integrals with respect to the measure λd , the product measure of
the Lebesgue measure on R, is accomplished by application of the Fubini-Tonelli Theorem
1.53 or Fubini-Lebesgue Theorem 1.54. These theorems allow calculation of the densities of
« sub-vectors » of X.

2007-2008 Probability 1 - Annie Millet


34 2 Probabilistic formulation

Theorem 2.19 Let X = (X1 , · · · , Xd ) be a random vector of density f , k an integer between


1 and d − 1, 1 ≤ i1 < i2 < ik ≤ d and Y = (Xi1 , · · · , Xik ). Then the vector Y has for density
the functions g defined by
Z
g(y) = f (x1 , · · · , xd )dλd−k (z), (2.18)
Rd−k

denoting as y = (xi1 , · · · , xik ), j1 < · · · < jd−k the elements of {1, · · · , d}\{i1 , · · · , ik } and
z = (xj1 , · · · , xjd−k ),
In particular, if d = 2, each component Xi of the pair (X1 , X2 ) has a density called the
ith marginal density fi defined respectively by
Z Z
f1 (x1 ) = f (x1 , x2 )dx2 and f2 (x2 ) = f (x1 , x2 )dx1 . (2.19)

We calculate the density of transformations of a random vector by the following change


of variables theorem. We first of all characterize permissible « changes of variables ».

Definition 2.20 Let ∆ be an open set of Rd and Φ : ∆ → Rd . We say that Φ is a C 1 -


diffeomorphism of ∆ on its image D = Φ(∆) if
(i) Φ is a bijection of ∆ on D
(ii) Φ is of class C 1 on ∆, that is to say that the partial derivatives exist and are conti-
nuous.
(iii) Φ0 (a) is a linear function from Rd into Rd invertible for all a ∈ ∆, that is to say if
JΦ (a) designates the Jacobian matrix of Φ at the point a defined by ( ∂Φ i
∂xj
(a), 1 ≤ i, j ≤ d),
its determinant |JΦ (a)| is different from zero at every point a ∈ ∆.
 −1
Then D is an open set of Rd and for all b ∈ D, (Φ−1 )0 (Φ(x)) = Φ0 (x) .

Theorem 2.21 Let ∆ be an open set of Rd and Φ : ∆ → D a C 1 -diffeomorphism of ∆ on


its image D. Then
(i) For every non negative Borel function φ : D → [0, +∞[,
Z Z
φ(y)dy = φ(Φ(x)) |JΦ (x)| dx. (2.20)
D ∆

(ii) A Borel function φ is integrable for the restriction of the Lebesgue measure on D if
and only if the function φ ◦ Φ |JΦ | is integrable for the restriction of the Lebesgue measure
on ∆. In this case, the equation (2.20) is again true.

We immediately deduce the following corollary on the density of Φ(X).

Corollary 2.22 Let ∆ be an open set of Rd and Φ : ∆ → D a C 1 -diffeomorphism of ∆ on


its image D, X : Ω → Rd a random variable of density f such that P (X ∈ ∆) = 1. Then
the random variable Y = Φ(X) is defined almost surely with values in R d , and its density
is the function g defined by

g(y) = f (Ψ(y)) |JΨ(y)|1D (y), (2.21)

where Ψ := Φ−1 : D → ∆ designates the reciprocal function of Φ.

Probability 1 - Annie Millet 2007-2008


2.3 Independence 35

2.3 Independence

It is necessary for us to more precisely formalize the implicit notions in the models
described for introducing the classical distributions, such as the Poisson distribution, the
geometric distribution, ...
The notion of independence is « purely probabilistic » ; its objective is to give a mathe-
matical formulation of successive experiments such that « the results of the early ones do not
influence the ones that follow », but it is not exactly equivalent to this intuitive property.
On the other hand, it is necessary to put away the confusion with linear independence.
The central notion is that of independence of σ-algebras, to which all the other definitions
return.
Definition 2.23 (i) Two events A and B in F are independent if
P (A ∩ B) = P (A)P (B).
(ii) A family (Gk , 1 ≤ k ≤ n) of sub-σ-algebras of F is independent if
P (∩nk=1 Ak ) = Πnk=1 P (Ak ) , ∀Ak ∈ Gk , 1 ≤ k ≤ n. (2.22)
A sequence (Gk , k ≥ 1)) of sub-σ-algebras of F is independent if and only if for every integer
n, the family (Gk , 1 ≤ k ≤ n) is independent.
(iii) A family of sets (Ak , 1 ≤ k ≤ n) (resp. (Ak , k ≥ 1)) of events is independent if the
family of σ-algebras (σ(Ak ), 1 ≤ k ≤ n) (resp. (σ(Ak ), k ≥ 1) ) is independent.
(iv) A finite family of random vectors (Xk : 1 ≤ k ≤ n) (resp. a sequence of random
vectors (Xk , k ≥ 1)), where Xk : Ω → Rdk , is independent if the family of the σ-algebras
σ(Xk ) = Xk−1 (Rdk ) is independent.
(v) A random vector X : Ω → Rd and a sub-σ-algebra G ⊂ F are independent if the
σ-algebras σ(X) = X −1 (Rd ) and G are independent.
We immediately see that the events A and B are independent if and only if the σ-
algebras σ(A) = {∅, Ω, A, Ac } and σ(B) are independent. Furthermore, if the random vectors
Xk : Ω → Rdk , 1 ≤ n are independent, and if Φk : Rdk → Rrk are Borel functions, then the
random vectors Yk = Φk ◦ Xk : Ω → Rrk are independent.
The following result characterizes the distribution of a random vector (X1 , · · · , Xd ) where
the components are independent (or blocks of the components are independent). It proof
consists of the fact that a probability on Rd is characterized by the value that it takes on
the tiles Πdi=1 [ai , bi ].
Theorem 2.24 Let Xi : Ω → Rdi , 1 ≤ i ≤ n be random vectors, k ∈ {1, · · · , n − 1} and
Y = (X1 , · · · , Xk ), Z = (Xk+1 , · · · , Xn ) be sub-vectors of X = (Y, Z) = (X1 , · · · , Xn ). Then
the properties are equivalent :
(i) The random vectors Y and Z are independent
(ii) The distribution of X on Rd1 +···dn is equal to the product of the distributions of Y
and of Z respectively on Rd1 +···dk and on Rdk+1 +···+dn , that is to say P(Y,Z) = PX ⊗ PY .
In the particular case where the random vectors Y and Z have for density, respectively,
g and h, these two properties are also equivalent to
(iii) The density of the pair of random vectors X = (Y, Z) is the « product »of the
densities f and g ; more precisely, it is the function f of x = (y, z) defined by
f (y, z) = g(y)h(z). (2.23)

2007-2008 Probability 1 - Annie Millet


36 2 Probabilistic formulation

This result extends to a finite number of sub-vectors of X. In the particular case of real
random variables Xi that have density fi , the random variables X1 , · · · , Xn are independent
if and only if the density of the vector X = (X1 , · · · , Xn ) on Rn is the function f defined
by f (x1 , · · · , xn ) = f1 (x1 ) · · · fn (xn ).
We deduce from this characterization of independence that if the random vectors Xk ,
1 ≤ k ≤ n, are independent for all k = 1, · · · , n − 1, then the vectors Y = (X1 , · · · , Xk ) and
Z = (Xk+1 , · · · , Xn ) are independent and consequently that if Φ : Rd1 +···dk → Rl and Ψ :
Rdk+1 +···dn → Rr are Borel functions, the random vectors Φ(Y ) and Ψ(Z) are independent.
Again, this property extends to a finite number of sub-vectors of X = (X1 , · · · , Xn ).
The theorems of Fubini-Tonelli 1.53 and of Fubini-Lebesgue 1.54 indicate that if X and
Y are independent real random variables of densities, respectively, f and g, then X + Y has
for density the convolution product f ∗ g defined by (1.22). The Fubini theorems imply the
following result.
Theorem 2.25 Let Xi : Ω → Rdi , i = 1, 2 be random independent vectors and Φi : Rdi → R
be Borel functions.
(i) If the functions Φi are non negative,
     
E Φ1 (X1 )Φ2 (X2 ) = E Φ1 (X1 ) E Φ2 (X2 ) . (2.24)

(ii) If the functions Φi are such that the random variables Φi (Xi ) are integrable, the
random variable Φ1 (X1 )Φ2 (X2 ) is also integrable and the equation (2.24) is also true.
We remark that the independence allows to weaken the square integrability hypothesis
for the product of real random variables to be integrable. We deduce in particular that if two
random variables X and Y are independent and integrable, the product XY is integrable
and that E(XY ) = E(X)E(Y ), that is to say that the covariance Cov(X, Y ) = 0. The
converse is false, as shown in the following example (which has other properties that will be
laid out later).
Example 2.26 Let X be a N (0, 1) random variable, a > 0, and Y be the random va-
riable defined by Y = X 1{|X|>a} − X1{|X|≤a} . Then the random variable Y is equally
N (0, 1),Rbut is a non-constant function of X that it is not independent of X. If we write
t 2
G(t) = 0 √12π exp(− x2 )dx, the function G : [0, +∞[→ [0, 21 [ is continuous, G(0) = 0 and
limt→+∞ G(t) = 21 . The intermediate value theorem shows then that there exists a > 0
(which is furthermore unique) such that G(a) = 41 . For this value of a, E(XY ) = 0, thus
Cov(X, Y ) = 0.
On the other hand, we deduce from Theorem 2.25 the
Corollary 2.27 Let X = (X1 , · · · , Xd ) : Ω → Rd be a random vector.
(i) If the components Xi of X are pairwise independent, the covariance matrix of X is
diagonal.
(ii) If k ∈ {1, · · · , d−1} is such that the vectors Y = (X1 , · · · , Xk ) and Z = (Xk+1 , · · · , Xd )
are independent, the the covariance matrix ΓX of X is block diagonal, that is to say
 
ΓY 0
ΓX = .
0 ΓZ
The converses of the two results of the Corollary 2.27 are false.

Probability 1 - Annie Millet 2007-2008


2.4 Simulation 37

2.4 Simulation

In a number of concrete situations, in the place of performing a random experiment


(which may be expensive, dangerous, ...) modeled by a real random variable or a random
vector X, we prefer to « simulate » this random variable, that is to say to obtain a real nu-
merical or vectorial result x that corresponds to an observation X(ω) from this experiment.
This number or this vector x is called a realization of the random variable X. We will wish
moreover to simulate a realization of a sequence of independent random variables of the
same distribution as X.
It is then crucial to answer to the following question : how do we generate a sequence of
numbers (xn , n ≥ 1) which are a realization (Xn (ω) , n ≥ 1) of a sequence of independent
random variables of the same given distribution ?
Here we progress to see that, from a theoretical point of view, the response to that
question returns to the case of the uniform distribution U ([0, 1]) on the interval [0, 1] ;
which we call the pseudo-random numbers. The question of characterizing « statistically
acceptably » such a sequence (as much for the adequacy of the distribution U ([0, 1]) as for
the independence of the consecutive draws) will not be addressed here. Finding a « good »
pseudo-random number generator has remained an applied problem of first importance for
a long time, but there now exist program libraries which contain excellent generators (e.g.
the « Mersenne Twister ».)
Most of the generators are of « congruential » type, that is, they generate an entire
sequence (xn , n ≥ 0) given by the induction relation :

xn+1 = a xn + c (mod m) ;

the initial value x0 is called the base, a is the multiplier, c is the addend and m the modulus
of the sequence. The sequence (xn ) takes its values between 0 and m − 1 and the sequence
(xn /m , n ≥ 1) takes its values in the interval [0, 1[. The maximal period of such a generator
is m and it is important, for the simulations of large samples for the given distribution, to
have generators of large period. The period for the Mersenne Twister is 106000 .

We suppose then that we know how to simulate the realization of a sample of uniform
distribution on [0, 1], that is to say a numerical sequence (un , n ≥ 0) of reals in [0, 1], which is
a realization (Un (ω), n ≥ 0) for a sequence (Un , n ≥ 0) of independent random variables with
the same uniform distribution U ([0, 1]), for example by executing the command Random in a
program. We proceed to write three classical methods of simulation of other real distributions
from (Un ).

2.4.1 Method of inversion of the distribution function

We seek to simulate the realizations of independent random real variables (Xn , n ≥ 1)


of the same distribution with the distribution function F : R → [0, 1] defined by F (t) =
P (X1 ≤ t) for all t ∈ R.

Proposition 2.28 We write F −1 :]0, 1[→ R the pseudo-inverse of F defined by

F −1 (u) = inf{t : F (t) ≥ u} for all u ∈]0, 1[ .

Then if U follows a distribution U (]0, 1[), F −1 (U ) has for distribution function F .

2007-2008 Probability 1 - Annie Millet


38 2 Probabilistic formulation

Proof. We show first of all that for all u ∈]0, 1[ and t ∈ R, F −1 (u) ≤ t if and only if
u ≤ F (t). In fact, if u ≤ F (t), by definition F −1 (u) ≤ t. Conversely, let y > t ≥ F −1 (u) ;
then, because F is increasing, F (y) ≥ u and because F is right-continuous, when y > t
converges to t, we deduce F (t) ≥ u. 2
−1
If the distribution function F of X is explicit, we deduce that (F (Un ), n ≥ 1) is a
sample of the distribution X. This furnishes for example a simulation algorithm when :

Case 1. X takes a finite (or countable) number of values We suppose that the
values taken by X are (ai , 0 ≤ i ≤ N ) ordered in increasing manner, and that P (X = ai ) =
pi for all i. We then calculate
P Fi = p0 + · · · + pi for all i and for all u ∈]0, 1[ we write :
−1
F (u) = a0 1{u≤F0 } + i≥1 ai 1{Fi−1 <u≤Fi } .
Example of a Bernoulli distribution of parameter p : P (X = 0) = q = 1 − p and
P (X = 1) = p . We deduce the simulation of n independent random variables of the same
Bernoulli distribution of parameter p ∈]0, 1[ that we place in the table X (by using the fact
that if U follows a uniform distribution U ([0, 1]), 1 − U also follows a uniform distribution
U ([0, 1])) :
For k = 1, ..., n
If (Random < p)
X[k] ← 1
Else X[k] ← 0
End
X takes a finite number of values : If X takes N + 1 values, at the beginning of the
program we calculate the values of Fi that we place in the table F [i], i = 0, · · · , N and we
equally place the values ai in a table a[i]. The critical loop is then the following :
i←0
U ← Random
While (U > F [i])
i←i+1
End
X[k] ← a[i]

Case 2. Exponential distribution E(λ) of parameter λ > 0 The density of X is


f (x) = λ e−λx 1{x>0} and the distribution function is thus F (t) = 0 if t ≤ 0 and F (t) =
1 − e−λt < 1 if t > 0. We deduce for u ∈ [0, 1[ : F −1 (u) = − ln(1−u)
λ
. If U follows a uniform
distribution U ([0, 1]), 1 − U equally follows a uniform distribution U ([0, 1]) and we deduce
a simulation algorithm of an exponential distribution of parameter λ :
X = − ln( Random )/λ
The use of the exponential distributions provides a simulation method of the Poisson dis-
tribution of parameter λ.
Proposition 2.29 Let (Ei , i ≥ 1) be a sequence of independent random variables of the
same exponential distribution of parameter λ > 0 ; then for every integer n ≥ 1,
λn
pn = P (E1 + · · · + En ≤ 1 < E1 + · · · + En+1 ) = e−λ .
n!

Probability 1 - Annie Millet 2007-2008


2.4 Simulation 39

Proof : For every integer n ≥ 1,


Z

pn = λn+1 exp − λ(x1 + · · · + xn+1 ) dx1 · · · dxn+1
{x +···xn ≤1<x1 +···xn+1 }
Z 1
 
= λn exp − λ (x1 + · · · + xn ) exp − λ[1 − (x1 + · · · xn )] dx1 · · · dxn
{x1 +···xn ≤1}
Z
−λ n λn
= e λ dx1 · · · dxn = e−λ . 2
{x1 +···xn ≤1} n!

We deduce that simulating the independent random variables (Ui , i ≥ 1) of the same
distribution U ([0, 1]), if n(ω) designates the first integer such that U1 U2 · · · Un(ω)+1 < e−λ ,
n follows a Poisson distribution P(λ). Hence a simulation algorithm of a random variable
X of distribution P(λ) reads :

a ← exp(−λ), X ← 0
U ← Random
While (U > a) do
U ← U ∗ Random , X ← X + 1
End

2.4.2 Rejection method


We suppose first of all that we know how to simulate by the algorithm A a random
variable of uniform distribution on a Borel set D ⊂ Rd (for example the square ] − 1, +1[2 )
and that we wish to simulate a random variable of uniform distribution on the Borel subset
C ⊂ D. The algorithm

Do X ← A
While C false
End
Return X
gives a simulation of the uniform distribution on C. In fact, let (Xn , n ≥ 1) be a sequence of
independent random variables of uniform distribution on D and τ = inf{n ≥ 1 : Xn ∈ C} ;
the preceding algorithm returns the random variable Xτ such that for all Borel subsets
B ⊂ C,

X ∞
X
P (Xτ ∈ B) = P ({τ = k} ∩ {Xk ∈ B}) = P (X1 6∈ C)k−1 P (Xk ∈ B)
k=1 k=1
∞ 
X k−1
|C| |B| |B| 1 |B|
= 1− = |C|
= .
k=1
|D| |D| |D| |C|
[D|

The following figure shows a rejection method for the simulation of the uniform on the unit
circle from a uniform distribution on the square [−1, 1]2 . Among 10 000 points drawn on
the square, only 7 848 are kept because they are in the unit circle ; this is constant with the
equality π/4 = 0, 785 398. This simulation has been obtained as follows :

2007-2008 Probability 1 - Annie Millet


40 2 Probabilistic formulation

1.060
.. . .... .. .. . ... ...... .. . .
... .. .. ... ........ .. . .. . ... . .. .. ......
. ... . ... ... . .... .. . . . . .. .. ... . . ....... ... .. ... . .
. .. . .. . .. .... . ... . .... . . . .. . ........ .... . ....... ...... ... . ...
.. .... .... . ......... .. ... ...... .... .................. . ........ ..... .... . . ... . ......... . ...... ... ........ . ..
0.848 ..... . ..... ... ..... ... . .. ....... . . ... .... .. ........ ..... . .. .... ..... . . . . . ..... .... ..... ... .. ..
.
...... ......... . ... .. .. . .. . .. . . . . .. ... . ... ..... . . .. .. .. .. . . ..... . ...
.
.... .... .. .. ............ .......... ... .. .............. ..... ... ... ........ . .... .............. .......... . .. ........ ... ..................... ..... ..... .. . .
. .. . . . ... .. . .... .. . . . .... ... . . . . .. . . . . . ... ...... .
......... ... ......... . .... ........ .. .... ..... ...... .... ...... .. . ... .. ............................. ... ...... . . .......... ....... .... ........ .. .... .. ... ...... .... .
. . ..... .... ... ...... . .... . ...... ........ . . .... .... . .. ... ... ...... . ....... .. ...... . ...... . . . .... ..... . ........... .. . .
. ........ ... .. . .. . ... ... .. ............ .. . . .. .. ... ... . .. ..... . ..... .. .......... ........ .... .......... ... .......... . .. . ............ ....... .. .... .
0.636
. .... ... . . . ... .... . . .. .. ... . .... .. .. . . .... ... ... ...... .. . ... ... .. ... .. . ... ...... .. ... .. . ..... .. . . ... .... . .. . .
.. . . .... ... . .. .. .. ..... ..... . . ... .... ... .... . .... .. . ... .. . .. . .. .. .. ..... ... . .. . .. . ...... .... .................. ....... .. .. . .. .. ..
................ ........ . ............... ... .. . ............ ............ ...................... ..... .. . . . ...... . .. . ...... ..... .. ..... .............. ..... .. ....... . .......... . . . ...... ..........
... . . . . . . .. . . . .. . . . . . . . .... .. ... . . . .. .. . ... . .. . . .. . . . ... .. . .. . .. .. .. . ... ..
0.424 ..... ... . .. ..... ............ ... ...... .. .. .. .. . ... .................. .... ... .. ... ......... ....... ........ . .. .. ..... .... .. .. ... ... ..... ... ........ .. ... .... .. .
.... . . ... .. . .. . . ... .. . .. . . .. . . .. . .. .. .. . . ... . ... ... . . ... ... .. . . . . . . . ...... ... ... . .. . .. . . . .. .. ... .
... . . ........................ .... . ........... ....... ...... .. .... . . .... . ... .. .. ... ... . . ...... . ...... ...... ........... .. . ........ ..... ....... .. . ... .... . ....... .. ... ...
...... ..... ........... .. ... ......... .... ....... ...... .... ................ ........... ................. . . . .... . .. . . .. .............. .. ........ .. . ... ................. ..... ... .. .. ....... ..... .............. ..
. . .. . . .. . ... . .. . . ... . . . .. . .. .. . . . . .. . .. . . . ... .. .. . ..
. ... ..... .... .. ....... .... .......... . ... ........... ...... . . .... .. ... . . .. .... ... . . .. .... .......... ... ...................... ... . .. ... ... ....... .......... .... .. ...... ....... .
0.212 ....... ... .. ... ....... ..... . ...... ... ......... .......... .... . .. .... ......... ... . .. . ... ............ ........ .. .. ..... . ............. .... ........... ... ................... ...... ..... ........ . ..... .... .... ..... . ...
.. . . .. .. . . . .
.......... ........... . .. .. .. ........ ... .. . ... ..... ..... .. . ........ . . .. ..... .............. .. . ........ .. . .... .... . .. ... ..... ........ .......... .......... ....... ... ... .... . ..... . ..... .. ... .. .........
. .
. . . . . .. . .. .. .. .. .. . .. ...... .. ... .......... . ....... ..... .. . ... .... . .... ... .. ..... . . ... .. . ........ . . . . ... .. . .... .. . ............ . ... ..... .... . ..... ... .
..... ... .. . . .. . . .. ... . . .. . .. .. .. . ... . .. .... .. . . . ... . .. . .... . . . . . .. .. . . .. . . .. . ... . ....... . . ..
. . .. .............. .. ....... .... ...... . . . .. ...... .. ........ ................. ................. ............... ............ .. .. ......... .. ..... . ........... .. ....... ... .. ... ........ . ... . .... ......... . .........
0.000 .. ... ..... .... . . . .. ... ... ... .. . . ... . ..... .. .... .. .. .. ... .. . .. . . ...... . .... .... .. ..... .. . . . .. ........ ... . ... .. ..... .. .... . ... . ... .
. ........... ...... ......... .. .. ........ . ... ............. ... .... ..... .... . .......... ....... ..... ...... ... .. ........ ...... ... . ........... ............... . . ... . .. .. .. ...... .... . . .. ... ....
... ........... . ... ............ . ... . ..... ... . . ... .. ...... ....... .... ............ .. ..... .. .. ....... . ..... ... ... . .... ..... .......... .. ..................... ........... ... ... ..... ...... . .. .... ........ ...... .........
. . ..... . .. ... . .. . ... .. . .... . . ..... .... . .. . . . . ... . ... .. .... .... .... .............. .. . . . . . .... . ... . . . . . ..
.. ......... . ....... ......... ........... ... ..... .... .. .... ....... ...... . .... . . ... ....... .... . ... ...... ... ......... ..... .. ........... ... ..... ...... ........... .. . ..... . . . . ...... .. .... .. ..
−0.212 .... ...................... ... . . ... .. ... ....... . .... . ...... . .. . ..... . .... . . . .... .. .. ... .. ........ . . ..... ..... . . . .. ... .... .. . . .. .. .. .. .. ..... ..... ..
. .. .... . . . .... . . .. . . . .. . .. . . . ... . ... . ... .. . . .. . . . . . .. . ..... . . . .. ... . . . . .. . .....
.. .... ..... ........ .... ... .... .................. . ........... . .. . ...... ........ ... ...... . .. . ... ......... .. . ... ..... ........... . .... .. ...... .......... ........... ... ..... ............ .. .. ..... . .
. .. . ...... .. ... .. ..... . . ... . ..... . .. ... . ... ... .... ..... .... ... .. . ... . ... .. .. .. .. .... ... ... ...... . .. ..... .... .... ...... . ... . .. . .... .
.... . . ... ... ... . .. . .. . .. . . . ........... .. . .. ......... ... . . . .. ... .. ...... ..... . . ...... . ..... .... ........ . .. . ... . .......... . ..
. . . .. . . .
−0.424 . ... .. .... .......... . .............. ...... .. ..... ........... .......... ..... .... .......... .... ... .......... ... .......... .... .. ... ......... ..... .. .. .. . ........... ... ... ......... ..... ........
... ..... . ... ....... ....... ... .. .... .. ..... .. ...... .... . . ... ... ............... ... ... .. . ...... . ............ . .. ... ... . ....... .... .... ... .. . ...... ...
... . . ... . .. ... ........ ... . ... . . ...... .. . ............... .. .. .. .. .. ... ..... . .... ......... .... ..... . ...... ... ..... . ..... . . ... .
.. . .. . .. . . . .. . .. . . .. ..
. ....... ....... ................... ........ ...... ..... . ........ . .... .... .......... .. ........ ... .... . ......... ...... ... .. . ......................... .. ........ ..... . ..... . ........ .
. .. .. ... ....... ..... ..... ... .. ... . .. .... . ... ... .. ... ... ...... . ... .. . . .. ... . . .... . . .. .. ..... ...... .... ..
−0.636 ... . . . ... .. . .. .. . . .... . . . . . ... ... .. ... . .. . . ... ... . . ......... . ... . . . ... . ..... . . . . .
... . .. .. . .. . . ... . .... . . ....... .. .. .. . ........ . . . . .. ...... ... . . . .. . ... .. .
. . . . .
. .... .... ... . . ................. ..... ... ... .... . . ..... .... . .. .. .. .......... ... ....... .... . . .. ... .. ..... ....... ......... ...
. . .. . ... .. . . .. . .. ... ... ... .. .. . ... ..... . . . . .. ..... . . . .. . ... . . .
. .. . .. . ..... . .. ... .. . .. . .... . .. . . . . ...... ... ...... .... . .. .. .. .. .. ..
. . .. . .. . .
.... ......... .. ..... ... .... ........ ......... .. ...... ..... ........ ..... ................ .... ............ .. . ...... ........ ..
−0.848 ......... . ... . ... ... ... . . .. . ...... . . ..... . ........... ....
. . ... . . .. . . .. . . . . ... .. .. ... .... . .
. . .... . .. .. ....... .. . ....... . .. .

−1.060
−1.5 −1.2 −0.9 −0.6 −0.3 0.0 0.3 0.6 0.9 1.2 1.5

Do U ← 2 ∗ Random − 1, V ← 2 ∗ Random − 1
While (U ∗ U + V ∗ V > 1)
End
X ← U and Y ← V

We now write the general rejection method. The idea lies in the following result : Si-
mulating a random variable of density f amounts to drawing a point at random under the
graph of f and returning the abscissa of this point. In fact if (X, Y ) is a random variable
with uniform distribution under the graph of the function f , then for all t ∈ R,
Z t Z f (x)
P (X ≤ t) = dydx.
−∞ 0

We want to simulate a random variable X whose distribution has density f and we


suppose that there exists a distribution of density g « which is easily simulated » and a
constant c > 0 such that :
f (x) ≤ c g(x) , ∀x ∈ R .
Because f and g are densities, we have c ≥ 1. This result may be generalized to a density
with respect to any measure and justifies the following rejection method :
(x)
Proposition 2.30 Let f and g be densities such that f ≤ c g ; we write q(x) = cfg(x) ∈ [0, 1].
Let Y1 be a random variable of density g and U1 be a random variable of uniform distribution
U ([0, 1]) independent of Y1 . If U1 ≤ q(Y1 ), we set X = Y1 . Otherwise, we reject X1 and we
simulate a sequence of independent random variables Yi of density g and Ui of uniform
distribution U ([0, 1]) up to τ = inf{i ≥ 1 : Ui ≤ q(Yi )} . Then the random variable X = Yτ
has density f , τ − 1 follows a geometric distribution of parameter 1c and E(τ ) = c.

Probability 1 - Annie Millet 2007-2008


2.4 Simulation 41

Proof : Because f and g are probability densities, we have :


Z +∞ Z 1 Z +∞
  1  1
P U1 > q(Y1 ) = g(y) dy du = g(y) − f (y) dy = 1 − .
−∞ f (y)
c g(y)
−∞ c c

1 k−1 1
We deduce that for every integer k ≥ 1, P (τ = k) = 1 − c c
, while for all t ∈ R :
∞ 
X k−1 Z t Z f (y)
1 c g(y)
P (X ≤ t) = 1− g(y) dy du
k=1
c −∞ 0
Z t Z t
f (y)
= c g(y) dy = f (y) dy .
−∞ cg(y) −∞

We remark that this method applies to the case where the random variables X and Y have
a density with respect to the same measure (which is not necessarily the Lebesgue measure,
but may be the counting measure). 2
Application to the Gamma distribution The rejection method allows for example the
simulating of a random variable of distribution Γ(λ, a), that is to say of density f (x) =
λa
Γ(a)
exp(−λ x) xa−1 1]0,+∞[ (x) where λ and a are strictly positive parameters and Γ(a) =
R +∞ −x a−1
0
e x dx.
If X and Y are independent random variables of distribution Γ(λ, a) and Γ(λ, b) respecti-
vely, the random variable X +Y follows a distribution Γ(λ, a+b). Moreover, the distribution
Γ(λ, 1) is an exponential distribution E(λ). Adding n independent exponential random va-
riables E(λ) leads to a random variable of distribution Γ(λ, n), for some integer n greater
than or equal to 1.
Finally a change of variables shows that if Y follows a distribution Γ(1, a), the random
variable X = Yλ follows a distribution Γ(λ, a). Hence, in order to simulate all the distributions
Γ(λ, a), it suffices to know how to simulate a random variable of distribution Γ(1, a) for a
parameter a ∈]0, 1[ , which is possible by the following rejection method of Ahrens and Dieter
(1974) modified by Best (1983). Note the numerical important feature of this method : it is
not necessary to calculate Γ(a).
1
Let a ∈]0, 1[ and f (x) = Γ(a) e−x xa−1 1]0,+∞[(x) and
a e  a−1 
g(x) = x 1]0,1[ (x) + e−x 1[1,+∞[ (x) ;
a+e
a+e
then f ≤ a e Γ(a)
g and for all x > 0 :

f (x)
q(x) = a+e = e−x 1]0,1[ (x) + xa−1 1[1,+∞[(x) .
a e Γ(a)
g(x)
Let Y be a random variable of density g ; we may easily calculate the distribution function
G of Y and its inverse is defined for z ∈]0, 1[ by :
  a1  
−1 a+e a+e
G (z) = z e (z) − ln
1]0, a+e (1 − z) 1[ a+e
[ e
,1[ (z) .
e ae
(1) We simulate a random variable U of uniform distribution U ([0, 1]) and then calculate
Y = G−1 (U ). We then simulate V with uniform distribution U ([0, 1]) independent of U .
(2) If V ≤ q(Y ), we set X = Y and if not we return to (1).

2007-2008 Probability 1 - Annie Millet


42 2 Probabilistic formulation

2.4.3 Simulation of Gaussian random variables


The distribution function of a scaled and centered Gaussian distribution N (0, 1) is not
explicit and the use of the tabulated distribution function accumulates errors. We lay out an
« exact » method of simulation called the Box-Muller method. If X1 and X2 are independent
standard Gaussian random variables N (0, 1), then the random variables Xi2 , i = 1, 2 are
independent and a change of variables shows that they follow a Gamma distribution Γ( 12 , 12 )
1 1
of density f (x) = √12 π e− 2 x x− 2 1]0,+∞[(x). The random variable R2 = X12 + X22 thus follows
an exponential distribution of parameter 21 . If we set X1 = R cos(θ) and X2 = R sin(θ), a
change of variables shows that θ follows a uniform distribution on the interval [0, 2π] and is
independent of R. We deduce that

Proposition 2.31 Let U1 and U2 be independent random variables of the same uniform
distribution U ([0, 1]) ; then the random variables
p p
X1 = −2 ln(U1 ) cos(2 π U2 ) and X2 = −2 ln(U1 ) sin(2 π U2 )

are independent Gaussian N (0, 1).

(We may show this proposition directly as an exercise.)

Histogramme des echantillons par Box Muller


Densite
0.40

0.36

0.32

0.28

0.24

0.20

0.16

0.12

0.08

0.04 Valeurs

0
−6 −4 −2 0 2 4 6

The above figure shows the histogram of the simulation of Gaussian random variables
N (0, 1) by the Box-Muller method with the help of 10 000 pairs of independent uniform
draws, and the graph of the theoretical density.

Probability 1 - Annie Millet 2007-2008


43

3 Conditional Expectation

This notion is fundamental in the study of a number of stochastic processes (martingales,


Markov chains, Brownian motion, ...) that intervene in the modeling of many phenomena,
in particular in finance.
We first of all seek the best approximation of a random variable X that « depends on all
the information contained in the σ-algebra F » in terms of another random variable which
only uses « more reduced (or smaller) information » , that is to say which is measurable
with respect to a sub-σ-algebra G of F . This idea of the use of reduced information has
already been used to define the classical notion of conditional probability. On the other
hand, we have seen that the expectation of a random square integrable variable has already
solved the question of the best approximation in L2 when the σ-algebra G = {∅, Ω} is the
trivial σ-algebra, with respect to which the only random variables are the constants. In
some particular case, we will see that the resolution of the problem of approximation in L2
consists in calculating the expectation with respect to the conditional probabilities.
We will next generalize this notion without the framework of approximation in L2 : as
with expectation, we define the conditional expectation of non negative or integrable random
variables given a sub σ-algebra, and we extend numerous results from the previous chapter
to this context.

3.1 Conditional Probability

We remember the definition :


Definition 3.1 Let A ∈ F be an event such that P (A) > 0. For every event B ∈ F , the
conditional probability of B given A is defined by
P (A ∩ B)
P (B/A) = .
P (A)
It is easy to see that B ∈ F → P (B/A) is a probability that only gives positive mass to
subsets of A. This notion is a very useful tool for describing the random phenomena that
occur sequentially in some way that the second experiment depends on the result of the
first. It then allows us to avoid a tedious description of the model. For example we lay out
two urns, U1 and U2 ; each of the urns Ui contains Ni black balls and Bi white balls. We
choose at random an urn from which we then choose a ball at random. It is clear that in
this case, if A = {ω we draw from the urn in the place U1 }, and for B = {ω : we draw a
black ball }, P (B/A) = N1N+B 1
1
.
The following Proposition brings together some useful properties of conditional proba-
bility.
Proposition 3.2 (i) Let (An , n ≥ 1) be a partition of Ω of events such that P (An ) > 0 for
every integer n. Then for all B ∈ F ,
X
P (B) = P (B/An ) P (An ).
n

(ii) Let A ∈ F and B ∈ F be events such that P (A) > 0, P (Ac ) > 0 and P (B) > 0.
Then
P (B/A)P (A)
P (A/B) = .
P (B/A)P (A) + P (B/Ac )P (Ac )

2007-2008 Probability 1 - Annie Millet


44 3 Conditional Expectation

n−1
(iii) Let (Ak , 1 ≤ k ≤ n) be a family of measurable sets such that P (∩k=1 Ak ) > 0. Then

P (∩nk=1 Ak ) = P (A1 )P (A2 /A1 )P (A3 /A1 ∩ A2 ) · · · P (An /A1 ∩ ... ∩ An−1 ).

3.2 Orthogonal Projection Theorem

This theorem is an infinite-dimensional version of the classical theorem of orthogonal


projection for Rd on a vector subspace of dimension strictly less than d. It allows the refor-
mulation of the problem of minimization of the norm in terms of orthogonality constraints
in an equivalent way that is technically easier to manipulate.
We recall that the space L2 := L2 (Ω, F , P ) is a Hilbert space, that is to say that it is
complete for the norm k.k2 associated to the scalar product defined by (1.17), which in the
particular case of a probability is written for X ∈ L2 and Y ∈ L2

hX, Y i = E(XY ). (3.1)

Theorem 3.3 (Orthogonal Projection Theorem) (i) Let H be a closed vector subspace of
L2 . For all X ∈ L2 , the following properties are equivalent for an element Π(X) ∈ H :
(a) kX − Π(X)k2 = inf{kX − Zk2 : Z ∈ H}.
(b) hX − Π(X), Zi = 0 for all Z ∈ H.
(ii) For all X ∈ L2 , there exists a unique element Π(X) ∈ H that satisfies the properties
(a) or (b), called the orthogonal projection of X on H.
(iii) The function X ∈ L2 → H defined by X 7→ Π(X) is linear and kΠ(X)k2 ≤ kXk2 .

Proof. (i) First of all we show that (a) implies (b). Let Z ∈ H ; for all λ ∈ R, Π(X)+λZ ∈ H,
hence

kX − Π(X)k22 ≤ kX − Π(X) − λZk22 = kX − Π(X)k22 − 2λhX − Π(X), Zi + λ2 kZk2 .

We deduce that the trinomial λ2 kZk2 −2λhX −Π(X), Zi ≥ 0 for all λ ∈ R ; the discriminant
of this trinomial is thus negative or zero, which implies (b).
Conversely, we suppose that (b) is true and let Z ∈ H. Then Π(X) − Z ∈ H and

kX − Zk22 = k(X − Π(X)) + (Π(X) − Z)k22


= kX − Π(X)k22 + 2hX − Π(X), Π(X) − Zi + kΠ(X) − Zk22
= kX − Π(X)k22 + kΠ(X) − Zk22 ≥ kX − Π(X)k22 .

(ii) Let m = inf{kX −Zk22 : Z ∈ H} and for all n ≥ 1, let Zn ∈ H such that kX −Zn k22 ≤
m + n1 . The parallelogram identity

ka + bk22 + ka − bk22 = 2kak22 + 2kbk22

applied to a = X − Zn and b = X − Zn+k for every integer n ≥ 1 and k ≥ 1 and the fact
that 21 (Zn + Zn+k ) ∈ H, thus shows
2
Zn + Zn+k

4 X − + kZn − Zn+k k2 = 2kX − Zn k2 + 2kX − Zn+k k2 ,
2 2 2 2
2

Probability 1 - Annie Millet 2007-2008


3.2 Orthogonal Projection Theorem 45

which implies
   
1 1 4
kZn − Zn+k k22 ≤2 m+ +2 m+ − 4m ≤ .
n n+k n

The sequence (Zn ) is thus Cauchy and converges in L2 to a limit Π(X). Since H is closed,
we deduce that Π(X) ∈ H ; on the other hand kX − Π(X)k22 = m. Let Y ∈ H another
element that equally satisfies kX − Y k2H = m. We again apply the parallelogram identity
with a = X − Π(X) and b = X − Y . Then, because Y +Π(X)
2
∈ H,
2
Y + Π(X)
4m + kY − Π(X)k22
≤ 4 X − + kY − Π(X)k2 = 2kak2 + 2kbk2 = 2m + 2m,
2 2 2 2
2

which implies that kY − Π(X)k22 = 0, that is to say Y = Π(X).

(iii) Let X, Y be square integrable random variables and λ ∈ R. Then, U = Π(X) +


λΠ(Y ) ∈ H and for all Z ∈ H

h(X + λY ) − Π(X) + λΠ(Y ) , Zi = hX − Π(X), Zi + λhY − Π(Y ), Zi = 0.

The characterization of Π(X +λY ) given in (b) then shows that Π(X +λY ) = Π(X)+λΠ(Y ).
Finally, the orthogonality of Π(X) ∈ H and X − Π(X) given by the property (b) shows
that kXk22 = kΠ(X)k22 + kX − Π(X)k22 ≥ kΠ(X)k22 , which completes the proof. 2

Example 3.4 • In the particular case where H is the vector subspace generated by the
function 1, that is to say the set of constants, we recover the fact that the best approximation
in L2 of a square integrable random variable by a constant is its expectation E(X). The
quadratic error, E(|X − E(X)|2 ) is the variance of X.
• If X designates a square integrable function, and if H designates the vector space of L2
generated by the random variables 1 and X, or in an equivalent way by the random variables
1 and X − E(X), we see that H is of dimension less than or equalto two and is thus closed.
In this case, for every element Y of L2 , Π(Y ) = a + b X − E(X) and the constants a and
b are characterized by the following properties, which consists of writing the orthogonality
of Y − Π(Y ) with the generating family of H formed by 1 and X − E(X) :
   
hY − a + b X − E(X) , 1i = E Y − a + b(X − E(X) = 0 ,
    
hY − a + b X − E(X) , X − E(X)i = E Y − (a + b(X − E(X)) [X − E(X)] = 0.

We deduce a = E(Y ) and thus bV ar(X) = Cov(X, Y ).


If X is almost surely constant, then H is of dimension 1 (the set of constant random
variables) and by uniqueness of the orthogonal projection Π(Y ) = E(Y ), b is arbitrary (even
if Π(Y ) is unique, its decomposition in the generating family need not be unique ; indeed,
this generating family is not a basis because X − E(X) = 0).
)
If X is not almost surely constant, V ar(X) > 0 and b = Cov(X,Y
V ar(X)
. We recall that the
linear regression Π(Y ) of Y by X is defined by

Cov(X, Y )  
Π(Y ) = E(Y ) + X − E(X) .
V ar(X)

2007-2008 Probability 1 - Annie Millet


46 3 Conditional Expectation

 Cov(X,Y )
Furthermore kY − Π(Y )k22 = V ar(Y ) 1 − ρ2 (X, Y ) , where ρ(X, Y ) = √ ∈
V ar(X)V ar(Y )
[−1, +1] according to the Schwarz inequality. We deduce that if X and Y are correlated,
kY − Π(Y )k22 < V ar(Y ), that is to say that the use of X has allowed the reduction of the
quadratic error of the approximation of Y . However, if X and Y are independent, or more
generally if they are not correlated, the use of X has not improved the approximation of Y .

3.3 Construction of the conditional expectation E(X/G)

We say that a random variable Y : Ω → Rd is G-measurable if Y −1 (Rd ) ⊂ G. The goal


of this section is generalizing the previous examples in two directions. For one part, in the
particular case G = σ(X), we would like to approximate in norm k.k2 the random variable
Y ∈ L2 by a σ(X)-measurable random variable, that is to say by a Borel function (not
necessarily affine) of X. More generally, we look for the best approximation in L2 of Y by a
G-measurable square integrable random variable, that is to say which does not depend on the
information described by the sub-σ-algebra G. This problem naturally appears in the study of
random processes, that is to say in families of random variables (Xn , n ≥ 0) or (Xt , t ∈ [0, 1])
that depend on time. In these models, the parameter n ∈ N or t ∈ [0, 1] represents the instant
at which we make the observation and the evolution of the phenomenon is such that what
we observe at an instant is not independent from that which we have already observed. The
« large » σ-algebra F contains the information of all the observations whereas the « small »
σ-algebra G only contains the information up to that instant n0 or t0 < 1. The conditional
expectation is often the tool that allows the description of evolution in time.
On the other side, for technical reasons we would like to define the conditional expec-
tation for random variables Y which are not necessarily square integrable, which can no
longer be solved as a problem of minimization of the k.k2 -norm. This last generalization is
natural, because we will see that in the particular case where G = σ(X), the conditional
expectation of a non negative or integrable random variable returns to the case where its
« expectation is the conditional distribution of Y given X = x ».

We begin by defining the conditional expectation when X ∈ L2


Theorem 3.5 Let G be a sub-σ-algebra of F and X be a square integrable real random
variable. There exists a random variable written E G (X) (or also E(X/G)) (unique up to
almost sure equivalence), G-measurable, square integrable , such that for all G-measurable
square integrable random variables Z

E(XZ) = E(E G (X)Z). (3.2)

This random variable is called the conditional expectation of X given G.


Proof. We again apply the orthogonal projection theorem to the set H of the elements of
L2 of which at least a representative is G-measurable. As H = L2 (Ω, G, P ), it is a complete
space and it is closed. By the usual abuse of language, we will identify the class of the
orthogonal projection of X on H, which belongs to H, with a G-measurable representative
(which is then unique up to almost equivalence) written E G (X) and the equation (3.2) is
only a translation of the property (b) of the orthogonal projection theorem 3.3. 2

For extending this theorem to the case where X is non negative or integrable, it is
necessary to solve two problems : the existence and the uniqueness of the extension.

Probability 1 - Annie Millet 2007-2008


3.3 Construction of the conditional expectation E(X/G) 47

The uniqueness in a non negative or integrable framework requires a characterization,


up to almost sure equivalence, by integration against indicator functions (or more generally
against square integrable random variables). This is ensured by the following theorem.

Lemma 3.6 Let X and Y be G-measurable random variables, which are either both non
negative, or both integrable, and such that E(1A X) ≤ E(1A Y ) (resp. E(1A X) = E(1A Y ))
for all A ∈ G. Then X ≤ Y a.s. (resp. X = Y a.s.).

Proof. For all a < b, we write F (a, b) = {Y ≤ a < b ≤ X}. Then {Y < X} =
∪a<b,a,b∈Q F (a, b) and it suffices to prove that P (F (a, b)) = 0 for all a < b. We suppose
that P (F (a, b)) > 0 ; then

E(1F (a,b) Y ) ≤ aP (F (a, b)) < bP (F (a, b)) ≤ E(1F (a,b) X),

which provides a contradiction. The a.s. equality is obtained by interchanging X and Y .


2

The following result allows us to show the existence of the foretold extensions.

Theorem 3.7 Let X be a non negative (resp. integrable) random variable. Then there exists
a non negative (resp. integrable) random variable E G (X), unique up to almost sure equiva-
lence, such that
E(1A X) = E(1A E G (X)) , ∀A ∈ G. (3.3)

Proof. The uniqueness comes from Lemma 3.6 and it thus suffices to prove the existence.
Let X ≥ 0 and for all n, Xn = X ∧ n ∈ L2 . Using Theorem 3.5 for every integer
n ≥ 1, E G (Xn ) ∈ L2 is such that for all A ∈ G, E(1A Xn ) = E(1A E G Xn ). Because the
sequence Xn is increasing, for all A ∈ G and for n ≥ 1, E(1A Xn ) ≤ E(1A Xn+1 ), hence
E(1A E G (Xn )) ≤ E(1A E G (Xn+1 )). Using Lemma 3.6, we deduce that the sequence E G (Xn )
is almost surely increasing. It thus converges almost surely to a G-measurable, non negative,
random variable, written Y . Furthermore, the monotone convergence theorem implies that
for all A ∈ G, E(1A Y ) = limn E(1A E G (Xn )) = limn E(1A Xn ) = E(1A X). We then deduce
that Y = E G (X). Moreover, if X is integrable, when A = Ω we deduce that E G (X) is
likewise integrable.
Let X = X + − X − ∈ L1 . Then the random variables X + and X − are non negative and
integrable ; both random variables E G (X + ) and E G (X − ) are thus also integrable (and hence
a.s. finite) and G-measurable. It then suffices to put E G (X) = E G (X + ) − E G (X − ). 2

The following proposition collects the immediate consequences of Theorem 3.7 and of
Lemma 3.6 and generalizes the usual properties of expectation. Its proof is left as an exercise.

Proposition 3.8 (i) E G (1) = 1 and if G = {∅, Ω}, E G (X) = E(X) a.s. for all non negative
or integrable random variables X.
(ii) For X, Y integrable, a, b ∈ R (resp. for X, Y non negative, a, b ∈ [0, +∞[),

E G (aX + bY ) = aE G (X) + bE G (Y ) a.s.

(iii) For X ≤ Y integrable (resp. non negative), E G (X) ≤ E G (Y ). 


(iv) For all random variables integrable or non negative random variable X, E E G (X) =
E(X).

2007-2008 Probability 1 - Annie Millet


48 3 Conditional Expectation

The following proposition presents a characterization of E G (X) similar that of (3.2) as


soon as the terms of the inequality are one way.
Proposition 3.9 (i) Let X be a non negative random variable. Then for every non negative
and G-measurable random variable Z,

E ZE G (X) = E(ZX) . (3.4)
(ii) Let X be an integrable variable. Then for each bounded and G-measurable random
variable Z, the equation (3.4) is satisfied.
Proof. Let X ≥ 0 ; then E G (X) ≥ 0 a.s. and the equation (3.3) shows that (3.4) is true if
Z = 1A and A ∈ G. By linearity of E G (property (ii) of the Proposition 3.8), the equation
(3.4) is also true if Z is a non negative step function. Finally the monotone convergence
theorem 1.23 and the Theorem 1.9 allows us to conclude.
If X is integrable, it suffices to apply the point (i) in decomposing X = X + − X − and
Z = Z + − Z − and to observe that the four integrals that we obtain are finite. 2

3.4 Conditional distribution and E(Y /X)

This section tries to handle the problem of the approximation of Y by a measurable


function (not necessarily affine) of X and more generally the particular case of the condi-
tional expectation given the σ-algebra G = σ(X). Let X : Ω → Rd be a random vector. The
space H for which we apply the orthogonal projection theorem is H = L2 (Ω, σ(X), P ),
that is to say the vector space of the (classes of) square integrable random variables
where a representative is measurable with respect to the σ-algebra generated by X, that is
σ(X) = {X −1 (B) : B ∈ Rd }. In the following, we will often identify a σ(X)-measurable
random variable and its almost sure equivalence class. We next extend this notion to the
case of random variables Y which are non negative or integrable.
3.4.1 Definition
As H is the space L2 corresponding to the σ-algebra σ(X), it is complete for the norm
k.k2 ; it is thus a closed subspace of L2 = L2 (Ω, F , P ). The following lemma characterizes
the random variables Y : Ω → Rr that are measurable with respect to σ(X).
Lemma 3.10 For every random vector X : Ω → Rd , a random vector Y : Ω → Rr is
measurable with respect to the σ-algebra σ(X), that is to say Y −1 (Rr ) ⊂ σ(X), if and only
if there exists a Borel function ϕ : Rd → Rr such that Y = ϕ(X).
Proof. Because the measurability in a product space for a σ-algebra is equivalent to that
of each component (cf. the Remark 1.50, (ii)), it suffices to verify the lemma when r = 1.
−1
If A = XP (B) with B ∈ R, the indicator function of A is written 1A = 1B ◦X.P
We deduce
that if Y = ni=1 ai 1X −1 (Ai ) is a step function on σ(X), then Y = ϕ◦X with ϕ = ni=1 ai 1Bi .
If Y is σ(X)-measurable, non negative, it can be approximated by an increasing sequence of
non negative step functions on the σ-algebra σ(X) (cf. Theorem 1.9). We deduce that there
exists a non negative Borel function ϕ such that Y = ϕ ◦ X ; by decomposing Y = Y + − Y − ,
we deduce that this equality finally extends to a random variable Y of any sign. 2

The Theorem 3.5 thus shows that for every random vector X : Ω → Rd and for every
square integrable random variable Y : Ω → R, there exists a unique (up to almost sure

Probability 1 - Annie Millet 2007-2008


3.4 Conditional distribution and E(Y /X) 49

equivalence) random variable ϕ ◦ X ∈ H such that for every Borel function ψ : Rd → R, if


E(ψ(X)2 ) < +∞, then hY − ϕ(X), ψ(X)i = 0, that is to say E(Y ψ(X)) = E(ϕ(X)ψ(X)).

The Theorem 3.7 shows finally that for every random vector X : Ω → Rd and for every
non negative or integrable random variable Y : Ω → R, there exists a unique (up to almost
equivalence) non negative or integrable random variable ϕ ◦ X such that for B ∈ Rd
 
E Y 1B (X) = E ϕ(X)1B (X) . (3.5)

Convention of notation The conditional expectation of a non negative or integrable


random variable Y given the σ-algebra σ(X) is called the conditional expectation of Y given
X and written E(Y /X). Furthermore, the Borel function ϕ such that E(Y /X) = ϕ ◦ X is
written x 7→ E(Y /X = x). The Theorem 3.7 and the Proposition 3.9 show that the Borel
function x 7→ E(Y /X = x) is characterized (up to almost equivalence) by the following two
properties if Y is non negative (or integrable) :
 
E(Y /X = x) ≥ 0 (resp. E E(Y /X = x) ◦ X < +∞. ) (3.6)
   
E Y ψ(X) = E E(Y /X = x) ◦ Xψ(X) , ∀ψ : Rd → R non negative (resp. bounded)
(3.7)

It is necessary to concretely calculate E(Y /X), that is to say x 7→ E(Y /X = x). We


remark that because this function is then composed with X, it is sufficient to calculate for
the values x ∈ Rd effectively taken on by X. More precisely, if P (X ∈ A) = 0 for a set
A ∈ Rd , then we may set E(Y /X = x) = 0 if x ∈ A.

We will only make the explicit calculations in two particular simple cases, which cover
a large number of situations.

3.4.2 Case of a random variable X taking values in a finite or countable set


To ease notations, we will suppose that X : Ω → N and will seek ϕ : N → R such that
E(Y /X) = ϕ ◦ X. Following the previous remark it suffices to calculate ϕ(n) for all the
integers n such that P (X = n) > 0 for us to deduce
X
E(Y /X) = ϕ(n)1{n} .
n:P (X=n)>0

Therefore we fix k ∈ N such that P (X = k) > 0 and let ψk = 1{k} be the indicator
function of the singleton {k}. Then ψk ≥ 0, E(ψk (X)2 ) = P (X = k) < +∞ and following
(3.4), X
 
E Y 1{X=k} = E ϕ(n)1{X=n} 1{X=k} .
n:P (X=n)>0

We deduce E(Y 1{X=k} ) = ϕ(k)P (X = k), which implies that



X E Y 1{X=k}
E(Y /X) = 1{X=k} ,
k
P (X = k)

2007-2008 Probability 1 - Annie Millet


50 3 Conditional Expectation


E Y 1{X=k}
We see that for each set {X = k}, E(Y /X = k) = P (X=k) is the expectation of Y
with respect to the conditional probability P (./X = k). In fact, this property is immediate
if Y = 1A , A ∈ F , next extended to step random variables by linearity, to non negative
random variables by monotone convergence and then to integrable random variables by the
difference of the positive and negative parts.

More generally, let Y : Ω → Rr be a random vector and φ : Rr → R a Borel function


such that φ ≥ 0 or φ ◦ Y ∈ L1 . The preceding calculation applied to φ(Y ) in place of Y
shows that 
X E φ(Y )1{X=k}
E(φ(Y )/X) = 1{X=k} . (3.8)
k
P (X = k)

E φ(Y )1{X=k}
Again, for calculating the values P (X=k)
, it suffices to use the distribution of Y given
X = k. Thus, when the random variable Y is discrete and also
!
X X
E(φ(Y )/X) = φ(n)P (Y = n/X = k) 1{X=k} .
k n

We may also suppose that the conditional distribution of Y given X = k has a density gk
with respect to the Lebesgue measure. In this case, the random variable Y takes real values
and for every Borel set A ∈ R and every integer k, the equality
Z
P (X = k, Y ∈ A) := P ({X = k} ∩ {Y ∈ A}) = P (X = k) gk (y)dy
A

describes the distribution of the pair (X, Y ). In this case,


R
X φ(y)gk (y)dy
E(φ(Y )/X) = 1{X=k} .
P (X = k)
k

Example 3.11 Let X and Y be independent random variables of Poisson distribution with
parameters λ and µ respectively. We wish to calculate the distribution of X given the sum
S = X +Y which follows a Poisson distribution of parameter λ+µ. For every pair of integers
n ≥ k ≥ 0 it is necessary to calculate

P (X = k, Y = n − k) e−λ λk e−µ µn−k n!


P (X = k|X+Y = n) = = −(λ+µ)
= Cnk pk (1−p)n−k ,
P (X + Y = n) k!(n − k)!e (λ + µ) n

λ
with p = λ+µ .
The distribution of X given X + Y = n is thus the binomial distribution B(n, p) and
E(X|X + Y = n) = np, which immediately implies E(X|S) = pS. A similar calculation for
the expectation and the variance of a binomial distribution gives E(X 2 |S) = p2 S 2 +p(1−p)S.

Probability 1 - Annie Millet 2007-2008


3.4 Conditional distribution and E(Y /X) 51

3.4.3 Case of the pair (X, Y ) with a density - Conditional Density


Let X : Ω → Rd and Y : Ω → Rr be random vectors such that the pair (X, Y ) has the
density f (x, y) with respect to the Lebesgue measure on Rd+r and let φ : Rr → R be a Borel
function such that φ ≥ 0 or φ(Y ) is integrable. We want Rto calculate E[φ(Y )/X]. R First of
all, following (2.18), the vector X has for density g(x) = Rr f (x, y)dλr (y) = Rr f (x, y)dy.
By analogy with the conditional probability, if (x, y) 7→ f (x, y) is continuous and if ∆(x)
and ∆(y) are « small » balls  centered at x and y respectively, we have P [X ∈ ∆(x)] ∼
f (x,y)λd (∆(x))λr (∆(y))
g(x)λd (∆(x)) > 0 and P (X, Y ) ∈ ∆(x) × ∆(y)/X ∈ ∆(x) ∼ g(x)λd (∆(x))
=
f (x,y)
g(x) r
λ (∆(y)). We deduce the following definition :

Definition 3.12 Let X : Ω → Rd and Y : Ω → Rr be random vectors such that the


pair (X,RY ) has the density f (x, y) with respect to the Lebesgue measure on R d+r and let
g(x) = Rr f (x, y)dy be the density of X. Then for all x ∈ Rd such that g(x) > 0, the
conditional density of Y given X = x is the function

f (x, y) f (x, y)
q(y/x) := =R . (3.9)
g(x) f (x, y)dy
R
We remark that if g(x) > 0, Rr q(y/x)dy = 1, which justifies the terminology.
The following theorem shows that in order to compute E(φ(Y )/X = x), we proceed
formally like we did to compute E(φ(Y )). We replace the density of Y by the conditional
density of Y given X = x.

Theorem 3.13 Let X : Ω → Rd and Y : Ω → Rr be random vectors such that the pair
(X, Y ) has the density f (x, y) with respect to the Lebesgue measure on Rd+r and φ : Rd+r →
R is a Borel Rfunction such that φ ≥ 0 or φ◦(X, Y ) ∈ L1 . Then for λd -almost all x ∈ Rd such
that g(x) = Rr f (x, y)dy > 0, if q(y/x) denotes the conditional density of Y given X = x,
Z Z
f (x, y)
E(φ(X, Y )/X = x) = φ(x, y) q(y/x)dy = φ(x, y) R dy. (3.10)
Rd Rd f (x, y)dy

Moreover, if we set E(φ(X, Y )/X = x) = 0 if g(x) = 0, then

E(φ(X, Y )/X) = E(φ(X, Y )/X = x) ◦ X.

Proof. For all B ∈ Rd , the characterization ϕ(x) = E(φ(X, Y )/X = x) given by (3.3) and
the Fubini-Lebesgue theorem 1.54 show that
Z
E[ϕ(X)1B (X)] = ϕ(x)g(x)dx = E[φ(X, Y )1B (X)]
B
Z Z  Z Z 
= φ(x, y)f (x, y)dy dx = φ(x, y)q(y/x)dy g(x)dx.
B Rr B Rr

The Proposition
R 1.35 (iii) applied to the Lebesgue measure λd on Rd implies that if α(x) =
ϕ(x) − Rr φ(x, y)q(y/x)dy, the function g(x)α(x) is zero almost everywhere, that is to say
that α(x) = 0 for λd -almost all x such that g(x) > 0. This concludes the proof. 2

2007-2008 Probability 1 - Annie Millet


52 3 Conditional Expectation

A similar calculation to the preceding shows moreover, using the notations of the Theo-
rem 3.13, if φ : Rd+r → [0, +∞[ is Borel,
Z Z
f (x, y)
E[φ(X, Y )|X = x) = φ(x, y) q(y/x)dy = φ(x, y) R dy.
Rd Rd f (x, y)dy
In the particular case where X and Y are independent, of respective densities g and h,
then f (x, y) = g(x)h(y), hence q(y/x) = h(y) and E(φ(Y )/X) = E(φ(Y )). We deduce again
that X is useless to improve the approximation in L2 of a function of Y because as soon as
φ(Y ) belongs to L2 , or more generally is non negative or integrable, E(φ(Y )/X) = E[φ(Y )].
Example 3.14 Let X and Y be independent random variables of the same exponential
distribution of parameter λ > 0. We want to calculate the conditional distribution of X given
S = X + Y . We calculate the density of the pair (X, S), denoted by f (x, s). The density of
the pair (X, Y ) is the product of the densities of X and of Y , and if D = {(x, s) : 0 < x < s}
the mapping Φ :]0, +∞[2 → D defined by Φ(x, y) = (x, x + y) is a C 1 -diffeomorphism. The
Jacobian of the inverse mapping calculated at the point (x, s) ∈ D is 1. The change of
variables formula implies that for all non negative Borel functions ϕ : R2 → [0, +∞[,
Z
E(ϕ(X, S)) = ϕ(x, x + y)λ2 e−λ(x+y) dxdy
]0,+∞[2
Z
= ϕ(x, s)λ2 e−s dxds.
D

The pair (X, S) then has as density the function f (x, s) = λ2 e−s 1{0<x<s} . The marginal
density of S is then λ2 se−λs 1[0,+∞[(s) and we well recover a Gamma distribution Γ(1, 2).
If s > 0, the conditional density of X given S = s is q(x|s) = 1s 1]0,s[ (x). It is a uniform
distribution on the interval ]0, s[. We deduce that for s > 0, E(X|S = s) = 2s and hence
Sp
E(X|S) = S2 . In the same manner, E(X p |S) = p+1 for every real number p ≥ 1.
A direct reasoning allows to give the value of E(X|S), but not the conditional expectation
of every non negative or integrable Borel function of X. The random variables X and Y
« play symmetric roles », which implies that E(X|X+Y ) = E(Y |X+Y ). Because E(.|X+Y )
is linear, X + Y = E(X + Y |X + Y ) = 2E(X|X + Y ), which yields E(X|X + Y ) = X+Y 2
without tedious calculation of the conditional distributions. Note that the above argument
is valid as soon as X and Y are independent identically distributed.
3.4.4 Application to simulation
The preceding notion of conditional distribution allows the simulation of random vectors
(X1 , · · · , Xn ) whose components are not independent. In fact, it suffices to simulate the
first component X1 by the methods of the section 2.4 ; we deduce that x1 = X1 (ω) ∈ R. We
next simulate the conditional distribution of X2 given X1 = x1 , which is again a probability
on R ; this supplies x2 = X2 (ω) ∈ R. We then simulate the conditional distribution of X3
given (X1 , X2 ) = (x1 , x2 ), and so on up to the simulation of the conditional distribution of
Xn given (X1 , · · · , Xn−1 ) = (x1 , · · · , xn−1 ).

3.5 Properties of conditional expectation

We show here the properties of the conditional expectation E G with respect to some
sub-σ-algebra G of F (which may be σ(X) for some random variable X).

Probability 1 - Annie Millet 2007-2008


3.5 Properties of conditional expectation 53

3.5.1 Different generalization of the properties of the expected value


The first proposition shows that G-measurable random variables act with respect to E G
like the constants with respect to expectation.

Proposition 3.15 Let X and Y be non negative random variables (or such that Y and XY
are integrable) and X be G-measurable. Then

E G (XY ) = X E G (Y ) a.s. (3.11)

Proof. We suppose that X and Y are non negative. Then for allA ∈ G,1A X is non negative
 
G
and G-measurable.
 Proposition 3.9 then shows that E 1 A X E (Y ) = E 1 A X Y =
E 1A (XY ) . We deduce that since the function is non negative and G-measurable, X E G (Y )
is a.s. equal to E G (XY ). When X and Y change sign, it suffices to write X = X + − X −
and Y = Y + − Y − . 2

Proposition 3.16 Let G ⊂ H be sub-σ-algebras of F and X be a non negative or integrable


random variable. Then 
E G E H (X) = E G (X) a.s. (3.12)

In particular E E H (X) = E(X) for every non negative or integrable random variable X.

Proof. Let X be non negative and Y = E G E H (X) . Then Y is non negative, G-measurable,
and for all A ∈ G ⊂ H,

E(1A Y ) = E(1A E H (X)) = E(1A X),

which implies that Y = E G (X) a.s. In order to extend this equality to the case where X is
integrable, we decompose X = X + − X − . Because G = {∅, Ω} ⊂ H for every sub-σ-algebra
H of F , the proposition 3.8 (i) allows us to conclude E E H (X) = E(X). 2

The following result is a conditional Schwarz inequality.


Proposition 3.17 Let X ∈ L2 and Y ∈ L2 . Then |E G (XY )|2 ≤ E G (X 2 ) E G (Y 2 ) a.s.
Proof. For all a ∈ R there exists a null set Na such that

0 ≤ E G (aX + Y )2 = a2 E G (X 2 ) + 2aE G (XY ) + E G (Y 2 ) on Nac .

The set N = ∪a∈Q Na is also a null set and the preceding inequality is true on N c for all
rational numbers a and thus by continuity for all the real numbers a. We deduce that the
discriminant of this trinomial is almost surely negative or zero, which concludes the proof.
2

The following result is a generalization of the Jensen inequality. It allows us to show that
E G contracts each k.kp norm.
Proposition 3.18 (i) Let f : R → R be a convex  function and X a random variable such
G G
that X and f (X) are integrable. Then f E (X) ≤ E (f (X)) a.s.
(ii) For all p ∈ [1, +∞[ and X ∈ Lp , |E G (X)|p ≤ E G (|X|p ) a.s. and for all p ∈ [1, +∞]
kE G (X)kp ≤ kXkp . The conditional expectation is then a contraction of each Lp space,
1 ≤ p ≤ +∞.

2007-2008 Probability 1 - Annie Millet


54 3 Conditional Expectation

Proof. (i) For all x ∈ R there exists a line below the graph of f and going through the
point (x, f (x)), that is to say an affine function y → gx (y) = αx (y − x) + f (x) such that
gx (y) ≤ f (y) for all y ∈ R. In considering the countable family of points x ∈ Q, which
will write xn , we deduce that g = supn gxn is a convex function, continuous, such that
g(x) = f (x) for all x ∈ Q. By continuity, we then deduce that f = g, that is to say
that f is the supremum of a sequence of affine functions gn = an x + bn ≤ f . For all n
we deduce that an X + bn ≤ f (X), and then that there exists a null set Nn such that
an E G (X)(ω) G
 + bn ≤ E G f (X) (ω)  for ω ∈ Nnc . Then N = ∪Nn is a null set and on N c ,
G G
f E (X) = supn an E (X) + bn ≤ E (f (X)).
(ii) For p ∈ [1, +∞[, the function |x|p is convex and it suffices to apply (i). Integrating
with respect to P the inequality |E G (X)|p ≤ E G (|X|p), we deducekE G (X)kp ≤ kXkp . The
case p = +∞ results from the Proposition 3.8 (i) and (iii). 2

The following result generalizes the convergence theorems of the first chapter. Its proof
is left as an exercise.

Proposition 3.19 Let (Xn , n ≥ 1) be a sequence of random variables.


(i) If Xn ≥ 0 for all n, and Xn increases to X, then E G (Xn ) increases to E G (X) almost
surely.
(ii) If Xn ≥ 0 for all n, then E G (lim inf Xn ) ≤ lim inf E G (Xn ) a.s.
(iii) If there exists Y ∈ L1 such that |Xn | ≤ Y a.s. and Xn → X a.s., then E G (Xn ) →
E G (X) a.s.

The following result allows us to find equiintegrable families of random variables related
to conditional expectation. It is important in the theory of martingales.

Theorem 3.20 (i) Let (Xi , i ∈ I) be a family of equiintegrable random variables and
(Gj , j ∈ J) a family of sub-σ-algebras of F . Then the family of random variables (Y ij =
E Gj (Xi ) : i ∈ I, j ∈ J) is equiintegrable.
(ii) In particular, we deduce that :
(a) Let X be an integrable random variable and (Gj , j ∈ J) a family of sub-σ-algebras
of G. Then the family (E Gj (X), j ∈ J) is equiintegrable.
(b) Let (Xi ) be a family of equiintegrable variables and G a sub-σ-algebra of F . Then
the family of random variables Yi = E G (Xi ) is equiintegrable.

Proof. (i) For all i ∈ I and j ∈ J we write Yij = E Gj (Xi ). For all a > 0, the event
{|Yij | ≥ a} ∈ Gj and according to the Proposition 3.18 (i), |Yij | ≤ E Gj (|Xi |). We deduce
Z Z Z
j G
|Yi |dP ≤ E (|Xi |)dP = |Xi |dP.
{|Yij |≥a} {|Yij |≥a} {|Yij |≥a}

Moreover, the Markov inequality (1.5) and the Proposition 3.16 show that P (|Y ij | ≥ a) ≤
1
a
E E Gj (|Xi |) = a1 E(|Xi |). Since, according to the Proposition 2.6 supi E(Xi |) < +∞, we
deduce that given any α > 0, P (|Yij | ≥ a) ≤ R α for a large enough. The Proposition 2.6
then allows the conclusion that for all ε > 0, {|Y j |≥a} |Xi |dP ≤ ε for a large enough, which
i
conclude the proof.
(ii) If X is integrable, the family {X} is equiintegrable. The point (a) is thus an imme-
diate consequence (i). The point (b) is a particular case of (i). 2

Probability 1 - Annie Millet 2007-2008


3.5 Properties of conditional expectation 55

3.5.2 Conditional expectation and independence


We recall that a random variable X and a σ-algebra G are independent if the σ-algebras
σ(X) and G are independent. The following lemma gives a first characterization of the
independence of X and G.

Lemma 3.21 Let X : Ω → Rd be a random variable and G a sub-σ-algebra of F . Then X


and G are independent if and only if for every non negative Borel function φ : R d → [0, +∞[
and every G-measurable non negative random variable Y ,
E[φ(X)Y ] = E[φ(X)]E(Y ). (3.13)

Proof. If X and G are independent and Y is G-measurable, σ(Y ) ⊂ G. The independence


of G and σ(X) then implies independence of σ(Y ) and of σ(X), which implies (3.13).
Conversely, suppose that (3.13) is true for all φ and Y . Let A ∈ G and Y = 1A ; Y
is G-measurable and non negative. Let B ∈ σ(X) ; then B = X −1 (C) with C ∈ Rd and if
φ = 1C , 1B = φ(X). The equality (3.13) allows us to conclude that P (A∩B) = E(Y φ(X)) =
E(Y )E(φ(X)) = P (A)P (B). The σ-algebras G and σ(X) are thus independent. 2

The following proposition allows us to characterize the independence of X and G in terms


of conditional expectation. Nevertheless, one should to notice that the equality E G (X) =
E(X) a.s. is not sufficient to show the independence of X and G. We will give a counte-
rexample as an exercise.

Proposition 3.22 (i) Let X be a non negative or integrable random variable, independent
of the sub-σ-algebra G. Then, E G (X) = E(X) a.s.
(ii) Let X : Ω → Rd be a random vector and G a sub-σ-algebra. Then X and G are
independent if and only if for every non negative Borel function φ : Rd → [0, +∞[
E G (φ(X)) = E(φ(X)). (3.14)

Proof. (i) The constant E(X) (real or equal to +∞) is G-measurable. Furthermore, for all
A ∈ G, E(1A X) = P (A)E(X) = E[1A E(X)], and the Theorem 3.7 allows us to conclude
E G (X) = E(X) a.s.
(ii) If X and G are independent, (3.14) is a consequence of (i). Conversely, if Y is G-
measurable and non negative, (3.14) and the Proposition 3.9 shows that
     
E φ(X)Y = E E G (φ(X))Y = E E(φ(X))Y = E[φ(X)]E(Y ).
The lemma 3.21 then implies the independence of X and G. 2

The previous Proposition may be generalized as follows.

Proposition 3.23 Let G be sub-σ-algebra of F , X : Ω → Rd and Y : Ω → Rr be random


vectors such that Y is G-measurable and X is independent of G. Then for every non negative
Borel function φ : Rd+r → [0, +∞[,
E[φ(X, Y )/G] = ϕ(Y ) a.s. where ϕ(y) = E[φ(X, y)].
In particular, if X and Y are independent,
E[φ(X, Y )/Y = y] = E[φ(X, y)] a.s.

2007-2008 Probability 1 - Annie Millet


56 3 Conditional Expectation

Proof. Let Z be a non negative G-measurable random variable. We write PX for the dis-
tribution of X and P(Y,Z) for the distribution of the pair (Y, Z). The random vectors X and
(Y, Z) are clearly independent and the Theorem 2.24, plus the Fubini-Tonelli Theorem show
that
Z Z
E[Zφ(X, Y )] = z φ(x, y)P(Y,Z) (dy, dz)PX (dx)
d r+1
ZR R  Z 
= z φ(x, y)PX (dx) P(Y,Z) (dy, dz)
r+1 Rr
ZR
= z E[φ(X, y)]P(Y,Z) (dy, dz) = E[Zϕ(Y )].
Rr+1

Since ϕ(Y ) is G-measurable, the Theorem 3.5 concludes the proof. 2

Probability 1 - Annie Millet 2007-2008


57

4 Characteristic function - Gaussian Vectors

In this chapter, we consider a fixed probability space (Ω, F , P ) which we will not recall
in a systematic way in the sequel. The Fourier transform is a very powerful tool in analysis.
Probabilists traditionally give it another name : the name of characteristic function. Fur-
thermore, the notion of Gaussian vector is central in the theory because of the convergence
theorems in distribution which will be seen in the following chapter. Its infinite dimensional
extension, the Brownian motion, is the basis for many models, especially in finance.

4.1 Characteristic function of a real random variable

Let X be a real random variable. For all t ∈ R the random variables cos(tX) and sin(tX)
are bounded, thus integrable. For every real number a, we recall that eia = cos(a)+i sin(a) ∈
C is a complex number of modulus 1. Requiring that the real part Re(Z) and imaginary
part Im(Z) are random variables, we naturally define a random variable Z : Ω → C. We say
that Z is integrable (resp. of pth power integrable) if and only if its real and imaginary parts
are. If Z ∈ L1 , we put E(Z) = E(Re(Z)) + iE(Im(Z)). All the properties of expectation
of the real random variables extend to complex random variables. We will write |z| for the
modulus of the complex number z.

Definition 4.1 Let X : Ω → R be a real random variable. Its characteristic function is the
function ΦX : R → C defined by

ΦX (t) = E eitX = E(cos(tX)) + iE(sin(tX)). (4.1)

Of course, we have for the real random variable X, ΦX (0) = 1.


P
If X : Ω → N, ΦX (t) = n≥0 eitn P (X = n) and when the distribution of X : Ω → R
R
has for density the function f , then ΦX (t) = R eitx f (x)dx is the Fourier transform fˆ(t) of
f calculated at the point t. In general, ΦX is the Fourier transform of the probability PX ,
that is to say of the distribution of X.

Example 4.2 1. If X is a Bernoulli random variable B(p) of parameter p ∈ [0, 1],


ΦX (t) = peit + 1 − p
2. If the distribution of X is binomial B(n, p) of parameters n and p ∈ [0, 1], the binomial
n
formula shows that ΦX (t) = [peit + 1 − p] . We will recapture this result in a different
way later.
3. If the distribution of X is Poisson P(λ), where λ > 0, one can prove as an exercise
that ΦX (t) = exp (λeit − 1).
R1 it
4. If X follows a uniform distribution on [0, 1] and t 6= 0, ΦX (t) = 0 eitx dx = e it−1 .
Certain calculations of integrals may be made « as usual » although the functions
take complex values.
5. If X follows the Cauchy distribution, using techniques of analytic functions (calculus
of residues) we obtain ΦX (t) = e−|t| . We will recover this result differently.

The following proposition gives the immediate properties of the characteristic function.

2007-2008 Probability 1 - Annie Millet


58 4 Characteristic function - Gaussian Vectors

Proposition 4.3 Let X : Ω → R be a real random variable.


(i) For all a, b ∈ R, ΦaX+b (t) = eibt ΦX (at).
(ii) The characteristic function ΦX is bounded by 1 and is uniformly continuous.

Proof. The part (i) results immediately from the linearity of the integral and the evident
equality : ΦaX (t) = ΦX (at) for all a, t ∈ R.
(ii) Because |eitX | = 1, |ΦX (t)| ≤ E(|eitX |) = 1. Moreover, for all  > 0 there exists
N such that P (|X| ≥ N ) < . The theorem of finite growth shows that for all x, s, t ∈ R,

|eitx − eisx | ≤ 2|x||t − s|. We deduce that if |t − s| ≤ 2N ,
Z N
|ΦX (t) − ΦX (s)| ≤ 2P (|X| ≥ N ) + |eitx − eisx |dPX (x) ≤ 2 + 2N |t − s| ≤ 3. 2
−N

The following property (i) is fundamental. The part (ii) allows us, in certain cases, to
recapture the density from the characteristic function. The theorem is temporarily assumed :
it will be proved in the last chapter (Proposition 5.18).

Theorem 4.4 Let X and Y be real random variables.


(i) The distributions PX and PY of X and Y are equal if and only if the characteristic
functions ΦX and ΦY are equal.
(ii) If X has for density f and if fˆ = ΦX ∈ L1 (λ) where λ designates the Lebesgue
measure on R, we have the Fourier inversion formula :
Z
1
f (x) = fˆ(t)e−itx dt for almost all x. (4.2)
2π R

A direct calculation shows that


Z Z 0 Z +∞
−itx −|t| −itx+t 1 −1 2
e e dt = e dt + e−itx−t dt = + = .
R −∞ 0 −ix + 1 −ix − 1 1 + x2
R −itx
We then see that if fˆ(t) = e−|t| , fˆ ∈ L1 (λ) and 2π
1
R
e fˆ(t)dt = π1 1+x
1
2 , that is to say that

we find the density of a Cauchy distribution.


By combining the Theorems 1.38 on continuity and 1.39 on differentiability of integrals
which depend on a parameter, we deduce the following :

Theorem 4.5 (i) Let X be a real random variable, n ≥ 1 an integer such that E(|X| n ) <
+∞. Then the characteristic function of X is of class C n and for every k = 1, · · · , n,
(k) 
ΦX (t) = ik E X k eitX . (4.3)

(k) 
In particular, ΦX (0) = ik E X k .
(ii) Let X ∈ L2 ; then Φ0 (0) = iE(X), Φ00 (0) = −E(X 2 ). Moreover, there exists α > 0
and a function  :] − α, +α[→ C such that limt→0 (t) = 0 and for |t| < α,
 
t2 2
Φ(t) = exp itE(X) − V ar(X) + t (t) . (4.4)
2

Probability 1 - Annie Millet 2007-2008


4.1 Characteristic function of a real random variable 59

Proof. (i) We apply Theorem 1.39 to the function (t, x) → f (x, t) = eitx (in fact separately
to the real part and to the imaginary part of this function, but we bring together the separate
results so that we may directly reason with a complex-valued
function), I = R, d = 1 and to
the measure µ = PX . If X is integrable, ∂f
∂t
(x, t) = |ixeitx | = |x| ∈ L1 (PX ). We deduce the
formula (4.3) for k = 1. It then suffices to reason by induction on the successive derivatives
(and the successive powers of X) up to order n.
(ii) If X ∈ L2 , ΦX is of class C 2 and the Taylor formula of order 2 at 0 gives

t2
ΦX (t) = 1 + itE(X) − E(X 2 ) + o(t2 ).
2

The function ln(1+z) may be defined for a complex number z such that |z| < 1 as ln(1+z) =
P∞ k+1 z k 2
k=1 (−1) k
and by approximating ln(1+z) with z − z2 +o(|z|2 ) when |z| is small enough,
we deduce (4.4). 2

Example 4.6 The following example is fundamental. By applying Theorem 4.5 (i), we
calculate the characteristic function of a Gaussian N (m, σ 2 ) random variable Y . Note that
in this case, in equation (4.4) the term (t) is identically zero. This will explain the central
role that Gaussian random variables play in the convergence theorems.
If Y follows a Gaussian distribution N (m, 0), Y is a.s. equal to its expectation m and
ΦY (t) = eitm .
If σ 6= 0, X = Y −mσ
is Gaussian of distribution N (0, 1) and Y = m + σX. From the
Proposition 4.3, ΦY (t) = eitm ΦX (σt) and it then suffices to calculate ΦX . Because X ∈ L1 ,
for all t ∈ R,
Z Z  Z
i − x2
2 2
− x2 1 x2
Φ0X (t) =√ x cos(tx)e dx + i x sin(tx)e dx = − √ xe− 2 sin(tx)dx.
2π R R 2π R

In fact, the first integral is zero because the integrand is odd and integrable. Integration by
parts in the second integral shows that
Z 2 2 i+∞ Z x2
− x2 − x2
xe sin(tx)dx = −e sin(tx) + te− 2 cos(tx)dx.
R −∞ R

R x2
Since by parity, ΦX (t) = √1 cos(tx)e− 2 dx, we deduce that ΦX is real valued and that
2π R
2
for all t, Φ0X (t) = −tΦX (t). We deduce that ln(|ΦX (t)|) = − t2 +C where C is a real constant.
Since ΦX is continuous and does not vanish, it keeps a constant sign. Furthermore, ΦX (0) = 1
t2
implies that ΦX (t) = e− 2 .
We finally check that if Y follows a N (m, σ 2 ) Gaussian distribution,
 
itY
 itm− t
2 σ2 t2
E e =e 2 = exp itE(Y ) − V ar(Y ) . (4.5)
2

We can show as an exercise the following result on the moments of a Gaussian N (0, 1)
random variable X : for every odd integer n, E(X n ) = 0 and for every even integer n = 2k,
E(X 2k ) = (2k)!
2k k!
.

2007-2008 Probability 1 - Annie Millet


60 4 Characteristic function - Gaussian Vectors

4.2 Characteristic function of random vectors

We generalize the preceding definition and results to random vectors X : Ω → Rd . We


write hx, yi for the scalar product of the vectors x and y of Rd .

Definition 4.7 Let X = (X1 , · · · , Xd ) : Ω → Rd be a random vector. Its characteristic


function is the function ΦX : Rd → C defined for every vector t = (t1 , · · · , td ) ∈ Rd by
  Pd 
ΦX (t) = E eiht,Xi = E ei k=1 tk Xk . (4.6)

The following theorem is a d-dimensional version of the first results on real random
variables.

Theorem 4.8 (i) For every random vector X : Ω → Rd , ΦX is uniformly continuous on


Rd , of modulus at most 1 and ΦX (0) = 1.
(ii) For every a ∈ R and b, t ∈ Rd , ΦX (at + b) = eiht,bi ΦX (at).
(iii) Two random vectors X and Y have the same distribution if and only if their cha-
racteristic functions are equal.
(iv) Let X : Ω → Rd and n ≥ 1 be an integer such that X ∈ PL
n
. Then ΦX is of class C n
d
and for every multi index α = (αj , 1 ≤ j ≤ d) of length |α| = j=1 αj ≤ n,

∂ |α| |α| α1 αd iht,Xi



αd ΦX (t) = i E X1 · · · Xd e .
∂tα11 · · · ∂ td

The point (iii) will be proved in the following chapter (Proposition 5.18).
The other properties are shown in a way similar to the corresponding proofs in the
preceding section and the details are left as an exercise.
In the particular case of square integrable random vectors, the following result is a
multidimensional version of the equation (4.4). Its proof is left as an exercise. Recall that
we are committing the abuse of language consisting of identifying a vector and the column
matrix of its components in the canonical basis and that we write t̃ for the matrix transpose
of the matrix t.

Theorem 4.9 Let X : Ω → Rd be a square integrable random vector. We write E(X) for
its expectation vector and ΓX for its covariance matrix. Then
d
1 X
ΦX (t) = 1 + iht, E(X)i − tk tl E(Xk Xl ) + o(|t|2 ).
2 k,l=1

There exists α > 0, a function  :] − α, +α[d → C such that limktk→0 (ktk) = 0, and such
that for t ∈ Rd with ktk < α one has :
d d
!
X 1 X
ΦX (t) = exp i tk E(Xk ) − tk tl Cov(Xk , Xl ) + ktk2 (ktk)
k=1
2 k,l=1
 
1 2
= exp i t̃ E(X) − t̃ ΓX t + ktk (ktk) . (4.7)
2

Probability 1 - Annie Millet 2007-2008


4.2 Characteristic function of random vectors 61

The characterization of the distribution by characteristic function yields a characteriza-


tion of the independence of random vectors.
Theorem 4.10 Let (Xk : Ω → Rdk , 1 ≤ k ≤ n) be random vectors. They are independent
if and only if the characteristic function of the vector X = (X1 , · · · , Xn ) : Ω → Rd1 +···+dn is
expressed for all t = (t1 , · · · , tn ) with tk ∈ Rdk as
n
Y
ΦX (t) = ΦXk (tk ).
k=1

Proof. To point out the ideas, we consider only a pair (X, Y ) of real random variables.
We write Z as a random variable with values in R2 and with distribution PX ⊗ PY . For all
(s, t) ∈ R2 , the Fubini-Lebesgue theorem 1.54 shows that
Z Z Z
i(sx+ty)
ΦZ (s, t) = e dPX (x) ⊗ dPY (y) = e dPX (x) eity dPY (y) = ΦX (s)ΦY (t).
isx
R2 R R
Because the characteristic function characterizes the distribution, we deduce that Z and the
pair (X, Y ) have the same distribution if and only if ΦZ (s, t) = Φ(X,Y ) (s, t). Thus Theorem
2.24 concludes the proof. 2
The following proposition is very useful to compute the distribution of the sum of inde-
pendent random variables. In the case of independent random variables X and Y of densities
f and g, respectively, it shows a classical result : the Fourier transform of the convolution
product f ∗ g of f and g (which is the density of X + Y ) is equal to the product of the
Fourier transforms of f and of g.

Pn 4.11 Let (Xk , 1 ≤ k ≤ n) bed independent random vectors with values in R


d
Proposition
and S = k=1 Xk their sum. For all t ∈ R ,
n
Y
ΦS (t) = ΦXk (t).
k=1

Proof. It suffices to use Theorem 2.25 and the fact that for all t ∈ Rd the random variables
eiht,Xk i are integrable (because bounded by 1). We immediately deduce
d
! n
Y Y 
iht,Xk i
ΦS (t) = E e = E eiht,Xk i .
k=1 k=1
2
Example 4.12 1. The sum of two independent Gaussian random variables X and Y with
distributions N (m1 , σ12 ) and N (m2 , σ22 ) respectively is Gaussian with distribution N (m1 +
m2 , σ12 + σ22 ). To check this, it suffices to compute the characteristic function of X + Y by
means of the preceding Proposition and to use the characteristic function of the Gaussian
distribution N (m, σ 2 ) computed in (4.5).
2. Let (Xn , n ≥ 1) be a sequence
Pn of independent random variables with the same Cauchy
1
distribution and let X̄n = n k=1 Xk be the average of the first n terms of this sequence.
Then, X̄n follow a Cauchy distribution. To prove this, it suffices to show that the characte-
ristic function of X̄n is e−|t| . For all t ∈ R,
!!  
  Xn
t Yn
t |t|
E e itX̄n
= E exp i Xk = Φ Xk = e−n n = e−|t| .
k=1
n k=1
n

2007-2008 Probability 1 - Annie Millet


62 4 Characteristic function - Gaussian Vectors

One can show as an exercise that if X and Y are independent, identically distributed,
square integrable real random variables, such that E(X) = 0, V ar(X) = 1 and X+Y √
2
has
the same distribution as X, then X (and Y ) are Gaussian N (0, 1). One can also check that
in this case, the random variables X+Y

2
and X−Y

2
are independent Gaussian N (0, 1).

4.3 Gaussian vectors

We have seen that real Gaussian random variables are characterized by the fact that
their characteristic function is the exponential of a polynomial of degree two (without the
extra term in the Taylor expansion of order two). The notion of Gaussian vector extends
this property in arbitrary dimension.

Definition 4.13 A random vector X = (X1 , · · · , Xd ) is Gaussian if and only if every linear
combination of its components is a real Gaussian random variable (of nonnegative variance).

The above definition immediately shows that if X : Ω → Rd is a Gaussian vector and if


L : Rd → Rr is a linear mapping, then the vector Y = L(X) : Ω → Rr is a Gaussian vector.
Following Theorem 2.17, if X = (X1 , · · · , Xd ) is a Gaussian vector, then for t ∈ Rd the
random variable ht, Xi is Gaussian of expectation ht, E(X)i and of variance t̃ ΓX t ; the
characteristic function of X is
 1
E eiht,Xi = ei ht , E(X)i− 2 t̃ ΓX t . (4.8)

We see in particular that if the vector Y = (Y1 , · · · , Yd ) is a centered Gaussian vector


with covariance matrix Idd , and X = σY , the density of X is
1 kxk2
gσ (x) = √ e− 2σ2 ,
(σ 2π)d
and its characteristic function is
√ !d
− 21 σ 2 kxk2 2π
ĝσ (x) = e = g 1 (x). (4.9)
σ σ

We then see that, up to a normalization factor, the characteristic function of X coincides


with the density g 1 of σ1 Y . This remark and the properties of convergence of σY when
σ
σ → 0 are key points to prove the injectivity of the Fourier transform and of the Fourier
inversion formula which will be shown later.
Furthermore, the first example 4.12 shows that if the components of the random vector
X = (X1 , · · · , Xd ) are independent Gaussians, then the vector X is Gaussian. It is obvious
that the components of a Gaussian random vector are real Gaussian random variables.
However, if the components of a vector are Gaussian random variables, the vector need not
be a Gaussian one (even if the components are non correlated), as shown in example 2.26.
In fact, in that case for all a the two random variables X and Y are N (0, 1) Gaussians and
for a suitable choice of a, Cov(X, Y ) = 0. The difference X − Y is nonzero and is bounded
by a, which prevents it from being Gaussian of zero variance (it would be constantly equal
to 0) or of strictly positive variance, because in this case we would have P (|X − Y | > a) > 0.
Thus, the vector (X, Y ) is not Gaussian.
The following result underscores a very important property of Gaussian vectors.

Probability 1 - Annie Millet 2007-2008


4.3 Gaussian vectors 63

Theorem 4.14 Let X = (X1 , · · · , Xd ) : Ω → Rd be a Gaussian vector. Then


(i) For all 1 ≤ j < k ≤ d, the components Xj and Xk are independent if and only if
Cov(Xj , Xk ) = 0.
(ii) The components of X are independent if and only if the covariance matrix of X is
diagonal.
(iii) For all k ∈ {1, · · · , d1 } the
 vectors Y = (X1 , · · · , Xk ) and Z = (Xk+1 , · · · , Xd ) are
ΓY 0
independent if and only if ΓX = .
0 ΓZ

Proof. (i) is an immediate consequence of (ii) because the sub vector (Xj , Xk ) of X is
Gaussian.
(ii) If the components of a square integrable random variable are independent, the cova-
riance matrix of X is always diagonal. Conversely, we suppose that the covariance matrix Γ X
is diagonal. For showing that the components of X are independent, it suffices to calculate
the characteristic function of X and to apply Theorem 4.10. For every vector t ∈ R d , the
equation (4.8) and the fact that ΓX is diagonal show that
d
Y d
Y
itk E(Xk )− 12 itk E(Xk )− 12 t2k V ar(Xk )
Pd Pd 2
k=1 tk V ar(Xk )
ΦX (t) = e k=1 = e = ΦXk (tk ). 2
k=1 k=1

The proof of (iii) is left as an exercise. 2


The following figure shows the simulation of 10,000 pairs of independent identically
distributed N (0, 1) random variables.

4
.
.
. . . .
. .. . 3 . . .
. . .
. . . . . . .. . . . . .. . .
.. . . . . . .. . . .
. . . .. .. . . . .
. ...... . . . .. . .. .... . . .. . . .... . . .. . .. .... ... . . ..
. .. ... ... . .. . ........ ... .. . .. ... . . . .
. . .. .. .. .. ....... ................... ....... .. . . .. .. . .. . .. ... ... . . ... . .
.. . .. .. . ... .... ....... ...... ......2 ..
... ..... ... . .. .... .. .... . .. ..
.. . . .. . . . .
. . .. . . .. . . ..... . .............. ......... ....................... ....... . . . .. .. ..
. . ... . . . .. . ... .. .. ................... .. .................................................................................................... ..... . ...... .... . .
. .. . ... .. ................... ... ...................................................................... .. ........ ... .. . .. .. .. ..
. . . . .. . .. ... .............................. .......... ................................ .......................................................... .. .... . ..... .. . .
. . .... .... .. .. . ..... . ........... .......... ................ .......... ......... ............ ............ .... ... .... . . . . .. .
. . . . . ............ ............................................... ......................................................................................................................................................................... .. .. ... .. . ... . ..
. . . . ............. .. ................................ ..........................1............................................... ........ ............... ..... .. . ..
. . ... . ...... ... ........... .................................................................................................................................................................................... ............. . ... .. . . .
. . . . . . . . . .. . . . . . .. . . . .. . . . . .
. ... . . .... ....... .. ....................................................................................................................................... ........ .... .. .... ... .. .. . . .
. . ...... ....... ..................... ............................................................................................................................. ................ ............... . .... . . .
. . ...... ..... ............................................................................................................................................................................................................................................................................................. ...... ..... . .... .... .
. ... .. . . . ......... ................. ...................................................................................................................... ...... ... ... ..
.. . . . . . ... . .. . .. . . . . .. . . . ..
. ... ... . . ... .. ................................................................................................................................................................................................................................... ....... ............ . . . .
.
. . . . . .. . . ........................................................................................................................0 . .. ...... . . .. .. .. .. . ... ..... .. .. . .
. ..... . .−2 ... ... . ......... ................. ............................................................................................................................................................................................... .......... . ...... .. .... .. . . ..
−6 −4 . . ... . . . . .. . . . . . .. .... . . . . . . . . . . . . . . . . . . . . . 0. . . . . . . . . . . . . .. . . . .. . .. . . .. .
. . . . .. .. .
. . .. . . .. . . ..2 . . . . . 4 6
. . . . ... ............................................................................................................. ................. .. .... .... .... ...... ....... . . .
. . .
. . . . . . . .... ....... ........... ....... .............................................................................................. ............................................ ....... ........ . . . .
. . . . . . .. . ........... .. ........ ...................................................................................................................................................................................................................................... . . .. . .
. . . . . . . . . . . . . . .
... . .. . ..... ........ .......................... ................................................................................................................... . .... .. . .. . . . . . . .. .
. . . .. . . .. .............................................−1 ............. ........... ...................... ... .... ....... ....... .. . .. ... ....
. .... . . ....... .. ................................. ........................................................................ ...................................................................... . .. ..... . .... . .
. . .. . . . .. ... . . . ....... . . ... .. . ..... . ... ... .. . . .... .. .
. . . ..... ... . .... ... .. ....... ................................. ............... .................................................... ............ ......... .......... ... . . .
. . . . .. . . . .... ... ......... . .... .. . ..... ... . ... ..... .... . . ..... . .. . . .
. . . .. .. ... .... ... .......... .... ........ ........ . ....... .... . . .. .. . . .
. .. .. .. . ....... .. ..... ... .. ..................... .. ..... ..... .... ........ ..... .. . . .. .. . . . ..
... .. .. . .. .. .. . .−2 . . . . . . . .. . . . ..
. . . . .. . . .. . ... . ........ . ... .... . . .
. . . . . . .. ....... . .. .......... ... . ... .. .. . .. . . . ..
. .. . . . .
. . .. .. ... . . . .
. . . . . .. . . . . . . . .
. .. . . ..
−3. .
.
. .
. . . .
.
−4

2007-2008 Probability 1 - Annie Millet


64 4 Characteristic function - Gaussian Vectors

The covariance matrix of a vector X in Rd is symmetric, non-negative and semi-definite


(that is to say that for every vector a ∈ Rd , ãΓX a ≥ 0). We recall the results from linear
algebra that allow us to reduce the matrix ΓX and to decompose it into a product AÃ. We
commit the consistent abuse of notation to identify a vector of Rd and the column matrix
of its coefficients in the canonical basis. We write kxk for the Euclidean norm of x.
Theorem 4.15 Let M be a d × d matrix. The following properties for M are equivalent :
(i) M is the change of basis matrix from an orthonormal basis to an orthonormal basis.
(ii) For every vector x of Rd , kM xk = kxk.
(iii) The column vectors of M are normalized and pairwise orthogonal.
(iv) The inverse matrix of M is equal to its transpose, that is to say that M M̃ = M̃ M =
Idd .
We say that such a matrix M is orthogonal.
The following theorem shows that covariance matrices can be diagonalized.
Theorem 4.16 Let Γ be a symmetric matrix with real coefficients. Then Γ is diagonalizable
in R by an orthogonal change of basis matrix M , that is to say that there exists a diagonal
matrix D with real coefficients and an orthogonal matrix M such that D = M −1 Γ M =
M̃ Γ M . If Γ is symmetric and non-negative semi-definite (resp. positive definite), the
elements of D are nonnegative (resp. strictly positive) and there exists a square matrix
(resp. invertible matrix) A such that Γ = AÃ.
Proof. The diagonalization of symmetric matrices is a classical result of linear algebra. For
every eigenvector xk of Γ associated to the eigenvalue λk , x˜k Γxk = λk kxk k2 . We deduce that
if Γ is non negative semi-definite (resp. positive definite) the eigenvalues of Γ are nonnegative

(resp. strictly positive). We write ∆ for the diagonal matrix such that ∆k,k = λk for all
k = 1, · · · , d. We deduce that Γ = M D M̃ = M ∆ ∆ ˜ M̃ = A Ã for A = M ∆. If the matrix
Γ is positive definite, the matrices M and ∆ are invertible and thus A is also invertible.
2
We immediately deduce that if X is a Gaussian vector with covariance matrix ΓX , if M
is an orthogonal matrix such that D = M −1 ΓX M is diagonal and if Y = QX, we have
ΓY = QΓX Q̃ = D. The vector Y = M̃ X is a Gaussian vector and its covariance matrix is
diagonal, which implies that the components of Y are independent.
For every real number m and every real number σ 2 ≥ 0, we know how to construct
a Gaussian random variable of expectation m and of variance σ 2 . Furthermore, if σ 6=  0,
the density of this Gaussian random variable is f (x) = σ√12π exp − 12 (x − m) σ12 (x − m) . If
σ = 0, the distribution of the Gaussian random variable is concentrated at m and does not
have a density. Similarly, the following theorem allows to construct a Gaussian vector with
imposed expectation vector and covariance matrix Γ ; when Γ is invertible, the Gaussian
vector has an explicit density.
Theorem 4.17 Let m ∈ Rd and Γ be a symmetric non negative semi-definite d × d matrix.
There exists a Gaussian vector X = (X1 , · · · , Xd ) of expectation vector m and of covariance
matrix Γ. This vector will be denoted N (m, Γ).
If the matrix Γ is also invertible, then X has for density the function f defined for every
vector x ∈ Rd by
 
1 1 1 ^ −1
f (x) = d
p exp − (x − m) Γ (x − m) . (4.10)
(2π) 2 det(Γ) 2

Probability 1 - Annie Millet 2007-2008


4.3 Gaussian vectors 65

Proof. Following Theorem 4.16, there exists a d × d matrix A such that Γ = AÃ. Let
Y = (Y1 , · · · , Yd ) be a vector whose components are independent Gaussian N (0, 1). The
vector Y is then Gaussian. Let L : Rd → Rd denote the linear mapping whose matrix in the
canonical basis is A. Then the vector Z = AY is Gaussian with covariance matrix AÃ = Γ.
It then suffices to consider the vector X = m + AY ; it is a Gaussian vector with expectation
m with the same covariance matrix as that of AY .
If the matrix Γ is invertible, the matrix A is also and Φ : Rd → Rd defined by x =
Φ(y) = Ay + m is a C 1 -diffeomorphism from Rd to Rd . The Jacobian matrix of Φ−1 is A−1 ,
the density of vector Y is
 d  d
1 − 21 kyk2 1 1
g(y) = √ e = √ e− 2 ỹy ,
2π 2π
and from the Theorem 2.21, the density of X = Φ(Y ) is then f (x) = |det(A−1 )|g(Φ−1 (x)).
Since det(Γ) = det(A)det(Ã) = [det(A)]2 and y = Φ−1 (x) = A−1 (x − m), we deduce (4.10)
from the identity :

ỹy = (x^ g
− m)A ^
−1 A−1 (x − m) = (x − m)Γ−1 (x − m). 2

The Gaussian vectors finally have a remarkable property relative to conditioning. The
proof of the following result is left as an exercise.
Theorem 4.18 Let (X, Y ) : Ω → R2 be a Gaussian vector  of expectation vector m =
σ12 ρσ1 σ2
(m1 , m2 ) and of covariance matrix Γ = , where σ1 > 0, σ2 > 0 and
ρσ1 σ2 σ22
ρ ∈ [−1, +1].
The random variables X − m1 and (Y − m2 ) − ρ σσ12 (X − m1 ) are independent centered
Gaussians. The conditional expectation of Y given X is E(Y |X) = m2 + ρσ σ1
2
(X − m1 ). It is
an affine function of X and it is then also the linear regression of Y in X.
• If |ρ| = 1, then V ar(σ2 X1 − ρσ1 X2 ) = 0, that is to say that the components X1 and X2
are not linearly independent (in the sense of linear algebra). The support of the distribution
of the vector (X1 , X2 ) is the line of equation σ2 x − ρσ1 y = σ2 m1 − ρσ1 m2 .
• If |ρ| < 1, the vector (X, Y ) has for density
 
exp − 2σ2 σ21(1−ρ2 ) [σ22 (x − m1 )2 − 2ρσ1 σ2 (x − m1 )(y − m2 ) + σ12 (y − m2 )2 ]
1 2
f (x, y) = p .
2π σ1 σ2 1 − ρ2
The conditional distribution of Y given X = x is Gaussian
 
ρσ2 2 2
N m2 + (x − m1 ) , σ2 (1 − ρ ) .
σ1
The result is generalized as follows : Let X = (Y, Z) be a Gaussian vector. The condi-
tional
Pkdistribution of Z given Y = y is that of a Gaussian vector. Moreover, E(Z|Y ) =
a + i=1 bi Yi if Y = (Y1 , · · · , Yk ), that is to say that the conditional expectation E(Z|Y ) is
a solution of the problem of multiple regression of Z in Y .

We can give more precise results for the decomposition of a non negative semi-definite
symmetric d × d matrix Γ of rank r ≤ d. In this case, there exists a d × d matrix A such that

2007-2008 Probability 1 - Annie Millet


66 4 Characteristic function - Gaussian Vectors

Γ = AÃ. When Γ is invertible (that is to say positive definite) we lay out a numerical method,
called the Choleski decomposition, for finding a lower triangular matrix A. The Choleski
decomposition of Γ (which is available in numerous programs libraries) is calculated in the
following way :
Γ1,i p
Γ1,1 , Ai,1 = for 2 ≤ i ≤ d , A1,1 =
A1,1
s X
for i increasing from 2 to d : Ai,i = Γi,i − |Ai,k |2 ,
1≤k≤i−1
Pi−1
Γi,j − k=1 Ai,k Aj,k
, Ai,j = 0 . for i < j ≤ d Aj,i =
Ai,i
   
σ12 ρ σ 1 σ2 σ1 p 0
When d = 2, Γ = , we have A = . Then if
ρ σ 1 σ2 σ22 ρ σ 2 σ2 1 − ρ 2
Y1 and Y2 are independent Gaussian N (0, 1) random p variables, the vector X = (X1 , X2 )
defined by X1 = m1 + σ1 Y1 , X2 = m2 + σ2 ρ Y1 + 1 − ρ2 Y2 is Gaussian of expectation
vector m = (m1 , m2 ) and of covariance matrix Γ.
The figures below show the simulation
 of 10 000 vectors
 X in R2 , centered and of co-
3 −2 3 2
variance matrix Γ = , and Γ̄ = respectively. The eigenvalues of Γ
−2 3 2 3
are 5 and 1 with eigenvectors, respectively,√ v1 =!(1, −1) and v2 = (1, 1). Its Choleski de-
3 q0
composition is Γ = A Ã with A = . We see that in this case the points are
− √23 5
3
concentrated in an ellipse of axes v1 and v2 and of lengths that depend on the eigenvalues
of Γ. The eigenvalues of Γ̄ are 5 and 1 with eigenvectors, respectively, the vectors v̄1 = (1, 1)
and v̄2 = (1, −1) and we observe a rotation of the ellipse.

.
.
. 6 6 . .
. .
..
. . .. . . . .
. .. .
. . . .. . .
. . ... . . . .. .
. .. . . . . . . ... .. . . .. . . . . . .. .
. . ..... .. . . . . . . . . . .
. . .. . . . .. . . .. . . .. . .. . . .. ... . . . . .. .
.. . .. .... .. .. . . .. .. . .. . . .. . . . .. . . .. . . . .. ... . . .. .. .... . .. . . . .
. .
. . . .. . . .. . . .. .. . . . ... .... . ..... ... .... . . . 4 . . 4 ... . ...... ... ....... . .. .. ..... . . .
. .
. .
.. . ..... .. ........ ....... ........... .... ......... . . . .... . .... . . . .. . . .
. . . ... .. . . ... ..... .... ..... .. ... .. . . .... ...... ... . . .
. . . .
. .
. .
. . .. . . . . . .... . . . .. ... . . . .. . .. .. .. ... .... .... ...... .. ..... . ................ ........ . .. . .
. .. .. . . .. ... .... . .. ............ ........ ..... .................. .......... .. . .............. ..... ..... . . .. .... . . ...... ...................................... ................... ........ .... .......... .... ..... .. .. . . . . . .
. . . . .. .. . . . .... . . . . ... . . . . . . . . .. . ... .. . . . . . . .. . .. .. . . . . .
. . . . . .. ....... .. . ...... ....................... ...... ........................................ .......... ... .... .... . ...... . .. . . . .. . . .. .. .. . ... .. .. . .... ............. ..... ... ......... ... .... ........ ... ....... .. ... . . . ... .
. . . . ... . . . . . . .. .
. .. .... .... ... .... .. ...... ..................................................................................................... ............ ..... ... . .. . . .. . . .. . .
. . . . ... . .. . . ........ ...... ..... . ................................................... ..... .... . .. . . . . . . . .
. . . . . .. ... . . .. .. .. ...... ......... ... .................................................................................................................. .. ..... .... . . . .
. . ...... ...... ................................................................................................................................................................. .... . . . . . . . . ...... . .... .... ... ..... .................. .... ......... ........ .. ........ ........... .. . .
. . .. . ... ... . .... .... .. .....................................................................................................................................2 ............ ...... . ......... .. ........ . .. . ... . . . . . . . .... . ..................2 .. .............. ...................... .................................. ............ ........ . . . ... . ...... .... . . . . .
. . .. . . . . . . .. . . .
. . . .
.. . . ... .... . ........................ ................................................................................................................................................................................ .. .... .... . .. .
. . . . .... .. .. ........ ....................................... ............................................................................................................................................................. ...... ... .. . . . . .
.. . ... .. . .. ...... .... ....... ........ .. .. ...... ....................... ....................... ..... . ... ... .. .. . . . . . . . . .. . . . . ........ ... .............................................................................. ....... ............ . .. ..... .. . . .. .
. . . . . . . . . .. . . . .
. .. . . . ... ..... ... .. ................................................ ................................................. ........... ...... .... ....... . . .
. . . . . .... .... .................... ......... ............................................................................................................................................................................................... ............ ....... . ..... ... .. . ... . .......... . . ........... ........... .................................. ............... . .... .. . . . . .. .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . ................. ............................................................................................................................. ....... ..... .. . .... . . .. . . . . ... . . .. . . . . . ... .. .... . .................................................................................................................................................................................................................................................... ... . . . .
. . . .. ................ ............................................................................................................................... .......................................................... .. . .. . . ... . . . .. . ... .......... ....................................................................................................................................................................... ............... ... ... . . . . .
. . ..... ... ........ . ............ .............................................................................................................. ............. ....... ............... . . .... ..... ... . . . . . . . . .................... ................................................................................................................................... ................. .. .. .... . . .
. . . . . ... .. ..... . .. .. ............. .... ........ ... .. . .... . . .. . ... . . .
. . ............. . .. ....... ....................................................................................................................................................................................................................................................... ...... ......... .. ... .. . . ... ..
. . .. .. . ...... .................................................................................................................................................................................................................................. .............. ...... .... .... . .... . .. ..
. .
. . . . ... ... ... .... ......... .................... ........0 . ... . . . ... .. . . .. .. .. . . . . . .. .. . . .. . ... . .. .. . . .. ............................................................ ........... ........ .. .. .... . . . .. .
. .. ...... . ... ... ... ............................................................................. .... .. .............. .. .. . . .. . . . . ... .... .. ............................................................. ...................................................0 ........................... ........ . .. ..... .. . .
.. . . ... . . . . ... .............................................................................................................................................................................................................................................................. ..... ... .. . . . . . .. . .................................................................................................................................................................................. .................. . .. . .. .. .
. . . . . .. . . .. . . . . . . . . . . . ... . . .
. ..
. . . . ... . ..... .................... ................................................................................... ... ... ... .... . .. . 4 .. .
. . . . . . . . . . . . . .
. . . .
. . .. . .. . .
. . .
. . .. .. .
.. . . .
. ..
..
...
.
..
. . . . . ..
. .. ........
.. . ..
. ..
. . .. ... . . . . . . . . −4 .. . .. .. .. ........ . ....................−2 .. . ........... ............................ ...........0 ......... ................................ . ... ...2 . .. .... ... 4.
. . .. . . . .......... .......................................................................................................................................................................................................................................... ..... . .. ... .. .
−8 −6 −4 −2 0 .
2 . ..
6 8 −8 −6 6 8
. . ... . . .. ........................................................................................................................................................................................................................................ . ...... ....... . ... . . .. . . . . . .. .. . . ..... . ....... ...... .... ....... .............. .................................................................... .... ...... ... .. . .. ..
.. . . . . .. .. ...... . ................................................................................................... ..... ............ . . . . . . .
. .. .. . . . .. ... .. .. . . . . . . . . . . .
. . ... ................................................................................................................................................................................................................................................. ........ .. ... .. .. .. .
. . .
. . . . . ........ .. ............................. ................................................................................................................................................................................................... ... .. . .... . . ..
. . . . .. . .
. . . . .. . ........... ... ........... .... .................. .... . ......... ......... ....... . ....... . . . . . .. . . . . . . . . .
. . ... .. .. ... ....... ....... .............. ..... ........................................................................................ ... ........ . .... .. . ... . .... . ........................... ............................................................................................................................................................................ ............... . . . .... . . . .
. . . . . . . . . . . . . . . . . . .. . ... . . . . .
. .
.. .
. . .... .... .... .. ......... ........... ..... ...................... ... ......... . ..... . .. . . ..... . . ...... . .. . .. .. .. . .... . .. ..
.. . . .
. . . . ..... ....... .. ..−2 .. . .. . . . .. .. .... . . . . . ... . . . . . . ... .. . ... ....... .......... ........ ........ ..... ........ ............ . ................... .......... ... ..... ... . .. .. . . .
. . .... . . .................................. ............................................................................................ ........... ................ . .. .. . . . ..... .... ... .......... . ....................................................................................................................−2 ......... ......... .... . . .. . .. . .
. . . . ... . ......... .......................... ............................ ................................................ ................. ...... . ... .. . . . . . .. .. . . .. . .. .... . .. ............ .............. ... .......... .. .. . ..... .......... ... .. .. . .... ..
. . . .. . .... . . .. . ... ....... ...... .. .. .. . ..... .......... .. . . . . . . . . . . .... ..... .................................................................................................................................. ... .. . . .
. . . . . . . . . ... ..... ....... . ....... ....... ....................................................... .... .................. . ... ....... . .. . .
.
. . . . . . . . ... ............ .... ............... ........... ....... ..... ..... .
. . .. .. . . ..... .. .. .. . . . . .
. .. . . . . . .... . . ... . . . . . .. . . . . . .. ...... .. . ................... ......... .............. ............. ........... . ..... .. .. ..... .. . ..
. . ... ....... . . .. .... ....... ...... . ........................ ....... ...................... .. . . . . ... . ... .... ..... . .. . . .. . .... ... .. ... ... .. . . . . ..
. . . . .. . .. . . . . . . . . . .
. . . .. . .... . . . ... ...... . . .. . . . . . . .. . . . . . . . .. . . . . . . . ... . . . . . . ... ... . ....... .. .... ... . .... . .. . . .
. . . . .. .. ...... ... ...... . ... .. ...... . ... ... . .... .... .. . .. . . . . .... .. ....... .. .. ... . ... . . . . . . ...
. . . . . . . .. .. . . . ... ... . .... ... . . . ... . .. .
−4. . .. . . .. . .... . . . .. . .. ... . .. .. . . . .. . . . . . . . .. .. . . . . .... .... .. . ... . ... . . −4 .
.
.
. .. . .. . ..
. .
.. .. . ... . . . . ..
. . ...... . .. . . . . . . . . .
.
. . .. . . . . . .. . . . . .
. . .. . . . .. . .. . . . . .
. . .
. . . . .. . . . .
. . . .
. . .
−6 . . −6
. .

The following results are important in statistics.


Pn
Definition 4.19 For every integer n ≥ 1, a χ2n is a random variable k=1 Xk2 where the
random variables (Xk , 1 ≤ k ≤ n) are independent N (0, 1).
We can show as an exercise that the square of a Gaussian random variable N (0, 1) follows
a Γ( 21 , 12 ) distribution. The distribution of a χ2n is then an immediate consequence of the

Probability 1 - Annie Millet 2007-2008


4.3 Gaussian vectors 67

following result : If X and Y are independent random variables with distributions Γ(λ, a)
and Γ(λ, b) respectively, then X + Y follows a Γ(λ, a + b) distribution. We deduce that a χ2n
follows a Γ( 12 , n2 ) distribution.

Proposition 4.20 Let (Xk , 1 ≤ k ≤ P n) be independent random


Pn variables of2 the same dis-
2 1 n
tribution N (m, σ ) ; we write X̄n = n k=1 Xk and Σn (X) = k=1 (Xk − X̄n ) . The random
2
variables X̄n and Σn (X) are independent of distributions, respectively, N (m, σn ) and σ 2 Y
where Y is a χ2n−1 .

Proof. If the variance is zero, the Proposition is clear. If σ > 0, we want to return to the case
of N (0, 1) random variables. In fact, for all k ≥ 1, the random variable Yk = σ −1 (Xk − m)
is N (0, 1), the Yk are independent, X̄n = m + σ Ȳn and Σn (X) = σ 2 Σn (Y ).
Thus, we suppose that Pnm = 0 and σ = 1. The average X̄n is a centered Gaussian random
1 1
variable of variance n2 k=1 V ar(Xk ) = n . The vector (X1 , · · · , Xn ) is Gaussian and the
vector (X̄n , X1 −X̄n , · · · , Xn −X̄n ) is thus Gaussian. Because Σn (X) is a Borel function of the
last n components of this vector, it suffices to verify that X̄n and (X1 − X̄n , · · · , Xn − X̄n ) are
independent and, thanks to Theorem 4.14, that for every k = 1, · · · , n, Cov(X̄n , Xk − X̄n ) =
0 ; this is straightforward.
It then remains to show that Σn (X) is the sum of the squares of n − 1 independent
N (0, 1) random variables. We write v1 = ( √1n , · · · , √1n ). The vector v1 is of norm 1 and
may be completed to an orthonormal basis (v1 , · · · , vn ) of Rn . We write A as the change of
basis matrix from the canonical basis to the basis (v1 , · · · , vn ). It is an orthogonal matrix,
as is its transpose Ã. The vector Z = ÃX is Gaussian, centered, and of covariance matrix
ÃIdP n A = Idn . The random variables Zk are independent
n √ PnN (0, 21) Gaussians
Pn and Z1 =
√1 X = n X̄ . Moreover, from Theorem 4.15 (iii), Z = X 2
n k=1 k n k=1 k k=1 k Finally,
.
n
X n
X n
X n
X
Σn (X) = Xk2 − 2X̄n 2
Xk + n(X̄n ) = Xk2 2
− n(X̄n ) = Zk2 .
k=1 k=1 k=1 k=2

This completes the proof. 2

Definition 4.21 For every integer n ≥ 1, a random variable Tn = √XY follows a Student
n
distribution with n degrees of freedom if the random variables X and Y are independent of
distributions, respectively, N (0, 1) and χ2n .

We immediately deduce that for an integer n ≥ 2, if (X1 , · · · , Xn ) are random inde-


pendent variables of the same Gaussian distribution N (m, σ 2 ) with σ > 0, the random
variable
X̄n − m
T = q Pn 2
k=1 (Xk −X̄n )
n−1

follows a Student distribution with n − 1 degrees of freedom. The important feature is that
neither the definition of T nor its distribution depends on the parameter σ. This is used
in statistics to test results on the expectation of Gaussian samples when the variance is
unknown.

2007-2008 Probability 1 - Annie Millet


68 5 Convergence Theorems

5 Convergence Theorems

The aim of this chapter is twofold. One one hand, we introduce two more notions of
convergence for sequences of random variables and compare them with those studied in the
first chapter about measure theory. On the other hand, we will prove classical convergence
results for the average of sequences of independent identically distributed random variables.
In all this chapter, unless specified otherwise, we consider a probability space (Ω, F , P )
which will not be systematically recalled in the sequel. The spaces Lp , 1 ≤ p ≤ +∞ are the
spaces Lp (P ) defined in the first chapter.
Convention of notation In all of this chapter, if (Xn , n ≥ P 1) denotes a sequence of
real random variables or of random vectors, we denote by Sn = nk=1 Xk the sum and by
X̄n = Snn the average of the first n random variables. Finally, a sequence of independent
identically distributed random variables is written i.i.d.

5.1 Convergence in probability

Recall that by the Schwarz inequality (resp. Hölder inequality), if a sequence of random
variables (Xn ) converges to X in L2 (resp. in Lp with 1 < p < +∞), then (Xn ) converges
to X in L1 . More generally, if 1 ≤ p1 < p2 ≤ +∞ and if the sequence (Xn ) converges to X
in Lp2 , then (Xn ) converges to X in Lp1 .
The converse is false, as the following example shows : for Ω =]0, 1[ consider the Borel
σ-algebra with the Lebesgue measure and Xn = na 1]0, 1 [ . If a p1 < 1, the sequence (Xn )
n
converges to 0 in Lp1 , while for all p2 > p1 , if a p2 > 1, the sequence kXn kp2 tends to +∞
and (Xn ) then does not converge to 0 in Lp2 . In the same way, if a > 0, the sequence (Xn )
is not bounded in L∞ , while for p < +∞, if ap < 1, it converges to 0 in Lp .

The following definition introduces a notion of convergence which is weaker than that of
convergence in L1 and of almost sure convergence. We write kxk for the Euclidean norm of
a vector x ∈ Rd .

Definition 5.1 A sequence of random vectors Xn : Ω → Rd converges to the random vector


X : Ω → Rd in probability if for all ε > 0,

lim P (kXn − Xk ≥ ε) = 0 . (5.1)


n

We remark immediately that a sequence of random vectors (Xn = (Xn1 , · · · , Xnd ) ∈


R , n ≥ 1) converges to X = (X 1 , · · · , X d ) in probability if and only if for each i = 1, · · · , d,
d

the sequence of real random variables (Xni , n ≥ 1) converges in probability to X i . The


following theorem relates convergences in L1 , almost surely and in probability.

Theorem 5.2 Let (Xn : Ω → Rd , n ≥ 1) be a sequence of random vectors and X : Ω → Rd .


Then :
(i) If the sequence (Xn ) converges to X in L1 , then the sequence (Xn ) converges to X
in probability.
(ii) If the sequence (Xn ) converges to X almost surely, then the sequence (Xn ) converges
to X in probability.
(iii) If the sequence (Xn ) converges to X in probability, then there exists a subsequence
(nk , k ≥ 1) such that the sequence (Xnk , k ≥ 1) converges to X almost surely.

Probability 1 - Annie Millet 2007-2008


5.1 Convergence in probability 69

(iv) If the sequence (Xn ) converges to X in probability and is equiintegrable, then the
sequence (Xn ) converges to X in L1 .

Proof.
(i) The Markov inequality (1.5) shows that for all ε > 0,

1
P (kXn − Xk ≥ ε) ≤ E(kXn − Xk),
ε
which proves that convergence in L1 implies convergence in probability.
(ii) For every integer N ≥ 1 and all ε > 0, we write ΩN = {ω : supn≥N kXn − Xk ≥ ε}.
The sequence of sets ΩN is decreasing and the almost sure convergence of (Xn ) to X shows
that P (ΩN ) → 0 when N → +∞. Since {kXN − Xk ≥ ε} ⊂ ΩN , we deduce that almost
sure convergence implies convergence in probability.

(iii) For every integer k ≥ 1, let Nk be an integer such that P kXn − Xk ≥ k1 ≤ k12 for
all n ≥ Nk . We then construct a sequence (nk , k ≥ 1) such that P for all k ≥P1, nk+1 > nk
and nk ≥ Nk . We write Ak = {kXnk − Xk ≥ k1 }. The series k P (Ak ) ≤ k k −2 is thus
convergent and the Borel-Cantelli lemma 2.2 implies that P (lim sup Ak ) = 0. We deduce
that, for almost all ω, there exists an integer K(ω) such that for k ≥ K(ω), kXnk − Xk(ω) ≤
1
k
, which shows that the sequence (Xnk (ω) − X(ω), k ≥ 1) converges to 0.
(iv) It suffices to reason for each component of Xn , that is to say that we want to
return to the assumption that d = 1. From (iii), there exists a subsequence (nk ) such that
(Xnk , k ≥ 1) converges to X almost surely. The equiintegrability of the sequence (Xn ) shows
that supk E(|Xnk |) < +∞ (cf. the Proposition 2.6) and the Fatou lemma 1.33 thus shows
that X is integrable. Moreover, for every ε > 0 and a > 0,
 
E(kXn − Xk) ≤ ε + E(kXn − Xk1{kXn −Xk≥ε} ≤ ε + E kXn k1{kXn k≥a} + kXk1{kXk≥a}
+ E[kXn − Xk1{kXn −Xk≥ε} 1{kXn −Xk≤2a} ].

From the Proposition 2.6, for every ε > 0 there exists a > 0 such that E(kXk1 {kXk≥a} ) ≤ ε
and for all n, E(kXn k1{kXn k≥a} ) ≤ ε. We deduce E(kXn − Xk) ≤ 3ε + 2aP (kXn − Xk ≥
ε) ≤ 4ε if n is large enough. 2

Let Ω = [0, 1[ be endowed with the Borel σ-algebra and the Lebesgue measure. For
all a > 0, the sequence Xn = na 1[0, 1 [ converges in probability to 0 while if a > 1 the
n
sequence Xn does not converge to 0 in L1 . So, for every k ≥ 1 and for n = 0, · · · , 2k − 1, let
Xk,n = 1[n2−k ,(n+1)2−k [ . By ordering the preceding pairs (k, n) such that (k1 , n1 ) ≤ (k2 , n2 )
if k1 < k2 , or if k1 = k2 and n1 ≤ n2 , we deduce that this sequence converges to 0 in
L1 , and thus also in probability, but not almost surely. We then see that the converses of
properties (i) and (ii) are false (without restriction of the subsequence or without additional
conditions.)
The convergence in probability is metrizable, as the following proposition shows.

Proposition 5.3 Let X and Y be random variables with values in Rd , and


 
kX − Y k
d(X, Y ) = E . (5.2)
1 + kX − Y k

2007-2008 Probability 1 - Annie Millet


70 5 Convergence Theorems

Then d is a distance (on the quotient set of random variables by the null random variables,
that is to say such that kXk = 0 a.s.). The set of random vectors (divided by set of a.s. the
null vectors) is complete for d.
Furthermore, for every sequence of random vectors (Xn ) with values in Rd , the sequence
(Xn ) converges to X in probability if and only if the sequence d(Xn , X) converges to 0.

Proof. (i) First of all we show that d is a distance. The function f : [0, +∞[→ [0, +∞[
x
defined by f (x) = 1+x is increasing and the triangle inequality for the norm on Rd implies
that for X, Y, Z : Ω → Rd ,
 
kX − Y k kY − Zk
d(X, Z) ≤ E (f (kX − Y k + kY − Zk)) ≤ E + ,
1 + kX − Y k 1 + kY − Zk

which implies the triangle inequality for d. In the evident way d(X, Y ) = d(Y, X). Finally,
kX−Y k
if d(X, Y ) = 0, we have 1+kX−Y k
= 0 a.s., which implies X = Y a.s.
(ii) We show that the set of random vectors endowed with the distance
P d is complete.
First of all let (Xn ) be a sequence of random vectors such that n d(Xn , Xn+1 ) < +∞.
We first show that (Xn ) converges almost surely to a random vector X. In fact, the monotone
convergence theorem 1.23 (or the Fubini-Tonelli theorem 1.53 works as well) shows that
P kXn −Xn+1 k P P kXn −Xn+1 k
E( n 1+kX n −Xn+1 k
) = n d(Xn , Xn+1 ) < +∞, which implies that n 1+kX n −Xn+1 k
< +∞
P
a.s. Because
P thean convergence of thePseries of positive terms n an is equivalent to that of
the series n 1+an , we deduce that n kXn − Xn+1 k < +∞ a.s., and thus that the sequence
(Xn ) converges almost surely to a random vector, denoted X.
kXn −Xk
The sequence 1+kX n −Xk
then converges almost surely to 0 and is dominated by 1. The
dominated convergence theorem yields that the sequence d(Xn , X) converges to 0.
This shows that the metric space of the (equivalence classes of) random vectors endowed
with the distance d is complete. Indeed,
P if the sequence (Yn ) is Cauchy, we are able to extract
a subsequence Xk = Ynk such that k d(Xk , Xk+1 ) < +∞. Hence there exists X such that
d(Xnk , X) → 0 when k → +∞. A Cauchy sequence with a subsequence converging to X is
such that the whole sequence converges to X . This concludes the proof of the completeness
of the metric space.

(3) It remains to show that convergence in probability is equivalent to that for d. Let
(Xn ) be a sequence that converges to X in probability. Then
  Z
kXn − Xk ε kXn − Xk
E ≤ + dP ≤ ε + P (kXn − Xk ≥ ε) ≤ 2ε
1 + kXn − Xk 1+ε {kXn −Xk≥ε} 1 + kXn − Xk

for n large enough.


Conversely, if the sequence (Xn ) does not converge to X in probability, there exists ε > 0
and α > 0 such that for all N , there exists n ≥ N for which P (kXn − Xk ≥ ε) ≥ α. Then,
x
since x ∈ [0, +∞[→ 1+x is increasing,
Z
kXn − Xk ε ε
d(Xn , X) ≥ dP ≥ P (kXn − Xk ≥ ε) ≥ α.
{kXn −Xk≥ε} 1 + kXn − Xk 1+ε 1+ε

The sequence d(Xn , X) thus does not converge to 0. 2

Probability 1 - Annie Millet 2007-2008


5.2 Laws of large numbers 71

The following proposition shows by example that the sum (and the product) of sequences
of real random variables which converge in probability converge to the sum (and the product)
of the limits. Its proof is left as an exercise. (Note that one can return to a compact set of
R2 and use the uniform continuity of a continuous function on a compact set.)

Proposition 5.4 (i) Let (Xn ) and (Yn ) be sequences of real random variables such that Xn
(resp. Yn ) converges in probability to X (resp. to Y ) and let f : R2 → R be a continuous
function. Then the sequence f (Xn , Yn ) converges in probability to f (X, Y ).
(ii) If P (X = 0) = 0, the sequence X1n converges in probability to X1 .

5.2 Laws of large numbers

First we show that the average of a sequence of square integrable i.i.d. random variables
converges in probability. This result, whose proof is very simple, suffices in many concrete
situations. We first of all prove a little more general result, of which the weak law of large
numbers is an immediate consequence.
For every real random variable Z ∈ L2 and λ > 0, the following version of the Markov
inequality is called the Bienayme-Chebychev inequality :
V ar(Z)
P (|Z − E(Z)| ≥ λ) ≤ . (5.3)
λ2
It immediately gives the Markov inequality for the random variable |Z − E(Z)|2 and the
constant λ2 (and shows directly that convergence in L2 implies convergence in probability.)

Theorem 5.5 Let (Xn ) be a sequence of independent square integrable real random va-
riables ; for all n ≥ 1, we write mn = E(Xn ) and σn2 = V ar(Xn ). We suppose that there
exists m ∈ R such that
n n
1X X
lim mk = m and σk2 = o(n2 ).
n n
k=1 k=1

Then the sequence X̄n converges to L2 (and thus in probability) to m.

Proof. Because (a + b)2 ≤ 2(a2 + b2 ), the independence of the sequence (Xn ) implies that
for all n ≥ 1
 2  2 2
1 Xn 1 Xn 2 Xn 1 Xn
2   2
E(|X̄n −m| ) ≤ 2E X̄n − mk +2 mk − m ≤ 2 σk +2 mk − m .
n n n n
k=1 k=1 k=1 k=1

This completes the proof. 2

If the sequence (XnP


) is i.i.d., real valued,
Pn square integrable, of expectation m and of
1 n
2
variance σ , we have n k=1 mk = m and k=1 σk = nσ 2 . By reasoning on each component
2

of the random vector, we immediately deduce


Corollary 5.6 Let (Xn ) be a sequence of independent identically distributed square inte-
grable random vectors. Then the sequence (X̄n ) converges in L2 , and thus also in probability,
to E(X).

2007-2008 Probability 1 - Annie Millet


72 5 Convergence Theorems

However, in certain situations, the weak law of large numbers is insufficient. For example,
the Monte-Carlo method consists in approximating the expectation of a random variable by
the average X̄n (ω) for a single realization (xn = Xn (ω) , n ≥ 1) of the sequence (Xn , n ≥ 1),
that is to say for a single value of ω. It is clear that a result of convergence in probability is
insufficient for proving that for almost every realization, the sequence X̄n (ω) approximates
E(X). On the other hand, we would like to weaken the square integrability condition by the
more natural integrability condition on the i.i.d. sequence (Xn ). These two improvements
are achieved in the following theorem.

Theorem 5.7 Let (Xn ) be a sequence of independent identically distributed integrable ran-
dom vectors. Then the sequence (X̄n ) converges almost surely to E(X1 ).

Proof. (1) By reasoning on each component of the sequence (Xn ), we return to the case of
i.i.d. real integrable random variables. Furthermore,
 replacing the i.i.d. sequence (Xn ) by the
centered i.i.d. sequence Yn = Xn − E(Xn ), n ≥ 1 , we return to the case where E(X1 ) = 0.
We thus consider an i.i.d. sequence of integrable, centered real random variables (Xn ).
(2) We will only prove the theorem in the particular case of square integrable, centered,
i.i.d. real random variables (Xn ) We at first show that the sequence X̄n2 converges a.s. to
0. Indeed, for all ε > 0, the Bienayme-Chebychev inequality (5.3) shows that

1 n2 V ar(X1 ) V ar(X1 )
P (|X̄n2 | ≥ ε) ≤ 2
V ar( X̄ n 2) =
4 2
= .
ε n ε n2 ε 2
1
The Borel-Cantelli lemma (2.2) shows that P (lim sup{|X̄n2 | ≥ n− 4 }) = 0. For almost all ω,
1
there then exists an integer N (ω) such that for n ≥ N (ω), |X̄n2 (ω)| ≤ n− 4 , which shows
the almost sure convergence of X̄n2 to 0.
(3) We now suppose that (Xn ) is square integrable, and we show that the sequence X̄n
converges a.s. For every integer n ≥ 1, we write

Aεn = {ω : ∃k ∈ {n2 , n2 + 1, · · · , (n + 1)2 − 1}, |X̄k − X̄n2 |(ω) > ε}.

Then the Markov inequality (1.5), the trivial inequality (a + b)2 ≤ 2a2 + 2b2 and the inde-
pendence of the Xn imply that for all ε > 0,
(n+1)2 −1
1 X 1
P (Aεn ) ≤ 2 P (|X̄k − X̄n2 | ≥ ε) ≤ 2
E(|X̄k − X̄n2 |2 )
ε ε
k=n2
 2 
(n+1)2 −1
X 1  1 1
Xn2
1 Xk

≤ 2
E − 2 Xi + Xi 
ε k n k
k=n2 i=1 i=n2 +1
(n+1)2 −1 n2
! (n+1)2 −1 k
!
2 X (k − n2 )2 X 2 X 1 X
≤ 2 V ar Xi + 2 V ar Xi
ε 2
(kn2 )2 ε 2
k 2
k=n i=1 k=n i=n2 +1
(n+1)2 −1 (n+1)2 −1
2 X
(2n)2 2 2 X 1
≤ n V ar(X 1 ) + 2nV ar(X1 )
ε2 2
n8 ε2 n4
k=n2
k=n 
C 1 1 C
≤ 2 3
+ 2 ≤ 2 2.
ε n n εn

Probability 1 - Annie Millet 2007-2008


5.3 Convergence in distribution 73

P −1 −1
Because n P (Ann 4 ) < +∞, the Borel-Cantelli lemma implies that P (lim sup Ann 4 ) = 0.
We deduce that for almost all ω and all ε > 0, there exists N1 (ω) such that for all
n ≥ N1 (ω), and for all k ∈ {n2 , · · · , (n + 1)2 − 1}, |X̄k − X̄n2 |(ω) ≤ ε. Moreover, for almost
all ω there exists N2 (ω) such that for all n ≥ N2 (ω), |X̄n2 (ω)| ≤ ε. The sequence (X̄n ) then
converges almost surely to 0. 2

5.3 Convergence in distribution

The following notion corresponds to the weak convergence of the distributions (which
are probability on Borel σ-fields). It is the weakest of the convergences studied and, contrary
to the preceding, it does not require that all the random variables be defined on the same
probability space.
5.3.1 Definition
As its name indicates, this notion does not depend on the probability space on which
the random variables are defined (as in all the preceding convergences). It is defined as the
weak convergence of the pushforward of the probabilities by the random variables.

Definition 5.8 (i) A sequence of probabilities (µn ) on Rd converges weakly to R the probability
µ if for everyR bounded, continuous function f : Rd → R the sequence f dµn , n ≥ 1
converges to f dµ.
(ii) Let (Ωn , Fn , Pn ) be a sequence of probability spaces and (Ω, F , P ) a probability space.
A sequence of random variables Xn : (Ωn , Fn ) → (Rd , Rd ) converges in distribution to the
random variable X : (Ω, F → (Rd , Rd ) if the sequence (Pn )Xn = Pn ◦ Xn−1 of distributions
of Xn converges to the distribution PX = P ◦ X −1 of X. If all the probability spaces are
the same, the convergence in distribution of the sequence (Xn ) to X is then defined by the
convergence of the sequence E[f (Xn )], n ≥ 1 to E[f (X)] for every bounded continuous
function f : Rd → R.
R
A probability µ on Rd was characterized by the family of integrals f dµ for the conti-
nuous bounded functions f (from the functional version of the monotone class theorem),
and we see that if a sequence (Xn ) of random vectors converges
 in distribution to X, that
is the say that if the sequence of distributions PXn , n ≥ 0 converges weakly to the dis-
tribution PX . The limiting distribution is unique (but neither the random variable, nor the
probability space on which it is defined).
The following proposition allows us to characterize convergence in distribution of a se-
quence of random variables (Xn ) by restricting the class of functions f in the definition.
Thus, restricting the convergence to continuous functions with compact support, we have a
notion of vague convergence.
Convention of notation In order to simplify the notations, we will suppose in the sequel
that all of the probability spaces on which the random variables are defined are the same
space (Ω, F , P ). Many results remain valid in the case of a sequence of random variables
defined on different probability spaces.

Proposition 5.9 A sequence of random vectors Xn : Ω → Rd converges in distribution to


X : Ω → Rd if and only if for every continuous function f : Rd → R with compact support,
the sequence E[f (Xn )], n ≥ 1 converges to E[f (X)].

2007-2008 Probability 1 - Annie Millet


74 5 Convergence Theorems

Proof. The convergence of the integrals of continuous bounded functions implies that of
continuous functions with compact support (or which tend to zero at infinity). Conversely,
let f be a bounded continuous function, (hk , k ≥ 1) a sequence of continuous functions
with compact support which increase to 1 (for example defined by hk (x) = 1 if kxk ≤ k,
hk (x) = 0 if kxk ≥ k + 1 and hk (x) = k + 1 − kxk if k < kxk < k + 1. Then
|E[f (Xn ) − f (X)]| ≤ |E[(f hk )(Xn ) − (f hk )(X)]|
+|E[f (Xn )(1 − hk (Xn ))]| + |E[f (X)(1 − hk (X))]|
≤ |E[(f hk )(Xn ) − (f hk )(X)]|
+kf k∞ (1 − E[hk (Xn )]) + kf k∞ (1 − E[hk (X)]) .
For all ε > 0, the dominated convergence theorem 1.36 shows that we may choose k such that
1 − E[hk (X)] < ε. The function hk being continuous with compact support, the sequence
E[hk (Xn )] converges to E[hk (X)] when n → ∞ and we may find N1 such that for n ≥ N1 ,
1 − E(hk (Xn )) < 2ε. Finally, the support of the function (f hk ) is compact and we may find
N2 such that for all n ≥ N2 , |E[(f hk )(Xn )−(f hk )(X)]| ≤ ε. We deduce that for n ≥ N1 ∨N2 ,
|E[f (Xn ) − f (X)]| ≤ (1 + 3kf k∞ )ε. 2
The example µu = δn provides us with a sequence of probabilities such that for every
continuous function f which
R tends to 0 at infinity (for example continuous on a compact
support), the sequence f dµn , n ≥ 1 converges to 0. There is vague convergence of the
sequence (δn ) to 0 but there is a « loss of total mass ». The sequence of random variables
(Xn = n) then does not converge in distribution. When we require that the limit measure
also be a probability (for example the distribution of a random variable) we avoid the
problem of loss of mass. Another more technical notion is that of tightness which will not
be addressed in these notes.
Proposition 5.10 If a sequence of random vectors (Xn ) with values in Rd converges in
distribution to a random vector X with values in Rd , the sequence PXn of distributions of
Xn is tight, that is to say that for all ε > 0 there exists a constant K > 0 such that
sup P (kXn k ≥ K) ≤ ε. (5.4)
n

Proof. For all ε > 0, there exists K1 such that P (kXk ≥ K1 ) ≤ ε. We may construct
a continuous function f such that f (x) = 0 if kxk ≤ K1 , 0 ≤ f ≤ 1 and f (x) = 1 if
kxk ≥ K1 + 1. Then there exists N such that for n ≥ N , |E(f (X)) − E(f (Xn ))| ≤ ε and
P (kXn k ≥ K1 + 1) ≤ E[f (Xn )] ≤ E[f (X)] + ε ≤ P (kXk ≥ K1 ) + ε ≤ 2ε.
It remains to choose K2 such that for n = 1, · · · , N − 1, P (kXn k ≥ K2 ) ≤ ε to deduce that
for all n ≥ 1 and K = K1 ∨ K2 , the inequality P (kXn k ≥ K) ≤ 2ε is true. 2
The following theorem shows that convergence in distribution is the weakest of all of the
notions of convergence that we have studied.
Proposition 5.11 Let (Xn ) be a sequence of random vectors with values in Rd defined on
the same probability space.
(i) If the sequence (Xn ) converges in probability to X, then (Xn ) converges in distribution
to X.
(ii) If the sequence (Xn ) converges in distribution to an a.s. constant random variable
X, then (Xn ) converges to X in probability.

Probability 1 - Annie Millet 2007-2008


5.3 Convergence in distribution 75

Proof. (i) Let f : Rd → R be a continuous function with compact support. It is uniformly


continuous and for all ε > 0 there then exists α > 0 such that kx − yk ≤ α implies
|f (x) − f (y)| ≤ ε. Then

|E[f (Xn ) − f (X)]| ≤ ε + 2kf k∞ P (kXn − Xk ≥ α).

The convergence of P (kXn −Xk ≥ α) to 0 then shows that the sequence E[f (Xn )] converges
to E[f (X)] and Proposition 5.9 concludes the proof.
(ii) Let (Xn ) be a sequence that converges in distribution to the constant vector a ∈ Rd .
For all λ > 0 let f be a continuous function such that 0 ≤ f ≤ 1, f (x) = 1 if kx − ak ≥ λ
and f (x) = 0 if kx − ak ≤ λ2 . Then, when n → ∞

P (kXn − ak ≥ λ) ≤ E[f (Xn )] → 0. 2

The following example shows that, except in the case of a constant limit, the convergence
in distribution is strictly weaker than that in probability.

Example 5.12 Let X and Y be independent random variables of Bernoulli distribution of


parameter p ∈]0, 1[. Let Xn = Y for every integer n. Then the sequence (Xn ) converges in
distribution to Y and thus also to X. However for all λ < 1, P (|Xn − X| ≥ λ) = P (X =
0, Y = 1) + P (X = 1, Y = 0) = 2p(1 − p). The sequence (Xn ) thus does not converge in
probability to X.

The following table gathers the links between the different types of convergence.
L2 L1 Proba Law
Xn −→ X ⇒ Xn −→ X ⇒ Xn −→ X ⇒ Xn −→ X

a.s.
Xn −→ X

If the random variables Xn and X take discrete values, we easily characterize the conver-
gence in distribution of (Xn ) to X.

Proposition 5.13 Let (Xn ) be a sequence of random variables which take their values on
a finite set I ⊂ R or in I = N. Then the sequence (Xn ) converges in distribution to the
random variable X : Ω → I if and only if for all i ∈ I, the sequence P (Xn = i), n ≥ 1
converges to P (X = i).

Proof. Let f : R → R be a continuous function with compact P support. The support of f


P a finite subset J ⊂ I. Then E[f (Xn )] = i∈J f (i)P (Xn = i) converges to
contains only
E[f (X)] = i∈J f (i)P (X = i) if for each i ∈ I, P (Xn = i) converges to P (X = i) when
n → ∞.
Conversely, for all i ∈ I, there exists a bounded continuous function fi such that fi (i) = 1
and such that for all j 6= i, fi (j) = 0. We deduce that P (Xn = i) = E[fi (Xn )] converges to
P (X = i) = E[fi (X)] if the sequence (Xn ) converges to X in distribution. 2

One can show as an exercise that if (Xn ) is a sequence of random variables with Binomial
distribution B(n, pn ) such that npn → λ > 0 when n tends to +∞, then the sequence (Xn )
converges in distribution to a random variable X with Poisson distribution of parameter λ.

2007-2008 Probability 1 - Annie Millet


76 5 Convergence Theorems

5.3.2 Convergence in distribution and distribution function


Let (Xn ) be a sequence of random variables that converge in distribution
 to X ; we seek
d
to show that for a Borel A ∈ R , the sequence P (Xn ∈ A), n ≥ 1 converges to P (X ∈ A).
The following example shows that one has to to add constraints on A. Indeed, the indicator
function 1A is bounded, but it is not continuous.
Example 5.14 The sequence of random variables Xn = n1 converges in distribution to
X = 0. However, the set A =] − ∞, 0] is such that for every integer n, P (Xn ∈ A) = 0 while
P (X ∈ A) = 1. This example clearly shows that the link between the distribution of X and
the boundary of A is crucial.
Theorem 5.15 Let (Xn ) be a sequence of random vectors with values in Rd which converges
in distribution to X and let A be a Borel set of Rd with boundary δ(A). Then when n → ∞,
if P (X ∈ δA) = 0, P (Xn ∈ A) → P (X ∈ A). (5.5)
Proof. We define two sequences (fk ) and (gk ) of continuous functions such that (fk ) in-
creases to 1 ◦ and (gk ) decreases to 1Ā .
A
Indeed, the function x → d(x, Ac ) is continuous and is zero on the closure Ac of Ac . The
sequence of functions defined by
fk = inf(1, kd(., Ac ))
is thus an increasing sequence of continuous functions such that 0 ≤ fk ≤ 1, and which
◦ ◦
satisfies fk (x) = 0 if x ∈ Ac = (A)c while fk (x) → 1 if d(x, Ac ) > 0, that is to say if x ∈A.
In the same way let
gk = 1 − inf(1, kd(., A)).
The sequence (gk ) of continuous functions is decreasing such that 0 ≤ gk ≤ 1, gk (x) = 1 if
x ∈ Ā and gk (x) → 0 if x 6∈ Ā. Thus for every integer k ≥ 1,
fk ≤ 1 ◦ ≤ 1Ā ≤ gk .
A

The hypothesis made on the boundary of A and the dominated convergence theorem imply
that for all ε > 0, there exists an integer K such that E[gK (X) − fK (X)] ≤ ε. The conver-
gence in distribution of (Xn ) to X then yields the existence of an integer N such that for
all n ≥ N , |E[gK (Xn ) − gK (X)]| ≤ ε and |E[fK (Xn ) − fK (X)]| ≤ ε. We then deduce that
for all n ≥ N ,
|P (Xn ∈ A) − P (X ∈ A)| ≤ 3ε,
which concludes the proof. 2

For special sets like A =] − ∞, t], we connect the convergence in distribution of real
random variables to the simple convergence of their distribution functions. We recall that
for every random variable X : Ω → R the distribution function F of X is defined by
F (t) = P (X ≤ t).
Theorem 5.16 Let (Xn ) be a sequence of real random variables of distribution functions
Fn and X be a real random variable of distribution function F . Then the sequence (X n )
converges to X in distribution if and only if
for all t ∈ R such that P (X = t) = 0, Fn (t) = P (Xn ≤ t) → F (t) = P (X ≤ t).

Probability 1 - Annie Millet 2007-2008


5.3 Convergence in distribution 77

Proof. The forward implication comes from the preceding theorem, because the boundary
of ] − ∞, t] is {t}. The converse one is accepted. This theorem implies in particular that if
the random variable X has a density, then (Xn ) converges to X in distribution if and only
if the sequence (Fn ) of distribution functions of Xn converges simply to X. 2

Example 5.17 Theorem 5.16 is well suited for showing the convergence in distribution
of a sequence of random variables defined as the supremum or infimum of sequences of
independent random variables. Using the above results, it is easy to check the following. Let
(Xn ) be a sequence of independent random variables of the same uniform distribution on
the interval [a, b] with a < b (defined on the same probability space). It is easy to check that
the sequences In = inf(X1 , · · · , Xn ) and Mn = sup(X1 , · · · , Xn ) converge in probability,
respectively, to a and b.
Indeed, by Proposition 5.11, it suffices to prove that the sequences (In ) and (Mn ) converge
in distribution respectively to the constants a and b. The distribution function of the constant
b is the function F = 1[b,+∞[. Then Theorem 5.16 shows that the sequence (Mn ) converges
in distribution to b if and only if for all t 6= b, the sequence P (Mn ≤ t) converges to
F (t). For all t ≤ a, P (Mn ≤ t) = 0, for all t ≥ b, P (Mn ≤ t) = 1 and for all t ∈]a, b[,
t−a n
P (Mn ≤ t) = b−a → 0 when n → ∞. The proof of the convergence in distribution of
(In ), which is similar, is left as an exercise.
5.3.3 Convergence in distribution and characteristic function
The characteristic function is again a very powerful tool for characterizing the conver-
gence in distribution. Furthermore, the convergence in distribution and the convolution by
Gaussian distributions of variance which tend to 0 allow us to obtain the distribution of a
random variable from its characteristic function.
The following result is important, because it is the base of the injectivity of the Fourier
transform and of the Fourier inversion formula which were stated in the previous chapter
(Theorems 4.4 and 4.8). We recall that we write hx, yi for the scalar product of the vectors
x and y and kxk2 = hx, xi for the square of the Euclidean norm of x.

Theorem 5.18 (i) For all n, let Xn be a Gaussian random vector of Rd whose components
are independent N (0, Id). Then for every sequence σn which converges to 0, the sequence of
random variables σn Xn converges in probability to 0.
(ii) Furthermore, for every probability µ on Rd , if we write
Z
µ̂(t) = eiht,xi µ(dx)
Rd
 2

for the Fourier transform of µ and gσ (x) = √1
(σ 2π)d
exp − kxk
2σ 2 for the density of the cente-
red Gaussian vector of covariance matrix σ 2 Idd , then for every sequence (σn ) which converges
to 0, the sequence of probabilities with density defined by
Z Z  2 2
1 −iht,xi σ ktk
hσn (x) = µ ∗ gσn (x) = gσn (x − y)µ(dy) = d
µ̂(t)e exp − n dt (5.6)
(2π) Rd 2
converges weakly to µ. This means that for every random variable Y of distribution µ in-
dependent of the sequence (Xn ), the sequence of random variables (Xn + Y ) converges in
distribution to Y .

2007-2008 Probability 1 - Annie Millet


78 5 Convergence Theorems

(iii) Finally, for every integrable function f with respect to the Lebesgue measure λ d , if
Z
fˆ(t) = eiht,xi f (x)dx
Rd

denotes the Fourier transform of f , when the sequence (σn ) tends to 0, the sequence of
functions Z  2 2
1 ˆ −iht,xi σ ktk
(f ∗ gσn )(x) = d
f (t)e exp − n dt
(2π) Rd 2
converges to f in L1 (λd ). Moreover, if the functions f and fˆ are integrable for the Lebesgue
measure, we have the Fourier inversion formula
Z
1
f (x) = e−iht,xi fˆ(t)dt , for almost all x ∈ Rd . (5.7)
(2π)d Rd
 
Proof. (i) For all λ > 0, P (kXn k ≥ λ) = P kY k ≥ σλn if Y follows a distribution N (0, Id).
Because kY k < +∞ a.s., we deduce that P (kXn k ≥ λ) → 0.

(ii) A change of variables and the Fubini-Tonelli theorem 1.53 show that if Xn and Y
are independent random variables, respectively, Gaussian N (0, σn2 Id) and of distribution µ,
then for every positive Borel function φ : Rd → [0, +∞[, we have
Z Z Z Z 
E[φ(Xn + Y )] = φ(x + y)gσn (x)dxµ(dy) = dzφ(z) gσn (z − y)µ(dy) ,
Rd Rd Rd Rd

which shows that the density of Xn + Y is hσn (x).


 d
On the other hand, for all σ > 0, equation (4.9) shows that if gσ (x) = √1 , then
σ 2π
 d
1
ĝ 1 (x) = σ 2π ĝ 1 (−x) and we have

σ σ

Z
1
hσ (x) = √ ĝ 1 (y − x)µ(dy)
(σ 2π)d Rd σ
Z  Z   
1 σd σ 2 ktk2
= √ √ exp iht, y − xi − dt µ(dy).
(σ 2π)d Rd ( 2π)d Rd 2
   2 2
σ 2 ktk2 σ ktk
Since exp iht, y − xi − 2 = exp − 2 , this function is integrable with respect
to the product measure dt ⊗ µ and the Fubini Lebesgue theorem 1.54 shows that
Z Z  Z
1 iht,yi
σ 2 ktk2
−iht,xi− 2 1 σ 2 ktk2
−iht,xi− 2
hσ (x) = e µ(dy) e dt = µ̂(t)e dt.
(2π)d Rd Rd (2π)d Rd

This proves (5.6).


Let f : Rd → R be a bounded continuous function ; then if
Z Z


Dσ = hσ (x)f (x)dx − f (y)µ(dy) ,
Rd Rd

Probability 1 - Annie Millet 2007-2008


5.3 Convergence in distribution 79

the Fubini Lebesgue theorem 1.54 applied to the measure µ(dy) ⊗ dx and to the function
gσ (x − y)f (x) shows that
Z Z

Dσ ≤ g (x − y)f (x)dx − f (y) µ(dy).
σ
Rd Rd

For all ε > 0 there exists λ > 0 such that for all σ > 0,
Z Z
gσ (x − y)dx = g1 (z)dz ≤ ε.
{kx−yk≥σλ} {kzk≥λ}

Considering the integral on the sets {kx − yk < σλ} and {kx − ykR ≥ σλ} and using the
continuity of f at the point y, we deduce that for all y, when σ → 0, Rd gσ (x − y)f (x)dx →
f (y). Furthermore, this difference is bounded by 2kf k∞ and the monotone convergence
theorem 1.36 yields that Dσn → 0 when σn → 0.

(iii) The convergence of (f ∗ gσn ) to f in L1 (λd ) is left as an exercise.


We show the Fourier inversion formula ˆ integrable. For all t ∈ Rd when
 if f and f 2are2 both

1 ˆ 1 ˆ
σ → 0, the function t → (2π) d f (t) exp −iht, xi − σ ktk
2
converges to t → (2π) d f (t)e
−iht,xi

and remains dominated by | 1 d fˆ| ∈ L1 (λd ). The dominated convergence theorem 1.36
(2π)
then showsR that for all x ∈ Rd , when σn → 0, the sequence (f ∗ gσn (x), n ≥ 1) converges
1
to (2π) d Rd
e−iht,xi fˆ(t)dt. By again applying the dominated convergence theorem we deduce
that for every bounded continuous function ϕ : Rd → R, when n → +∞
Z Z Z Z 
1 −iht,xi ˆ
(f ∗ gσn (x)ϕ(x)dx → d
e f (t)dt ϕ(x)dx.
Rd Rd Rd (2π) Rd

We write µ for the measure with density f with respect to the Lebesgue measure. Then,
µ ∗ gσn = f ∗ gσn and from (ii), the sequence of measures with density f ∗ gσn converges
weakly to the measure with density f (with respect to the Lebesgue measure), that is to
say for every bounded continuous function ϕ : Rd → R,
Z Z
(f ∗ gσn (x)ϕ(x)dx → ϕ(x)f (x)dx.
Rd Rd
R R
The equality Rd ϕ(x)f (x)dx = Rd ϕ(x)g(x)dx for every bounded continuous function
ϕ allows us to deduce f (x) = g(x) for λd almost all x, which shows the Fourier inversion
formula (5.7). 2

This Theorem shows a part of the Fourier inversion formula announced in Theorem 4.4
and also in the announced results in Theorems 4.4 and 4.8. Indeed, if µ and ν are two
probabilities on Rd whose Fourier transforms µ̂ and ν̂ are equal (that is to say let X and Y
be two random vectors of distribution µ and ν respectively whose characteristic functions
ΦX and ΦY are equal), then from the formula (5.6), the measures of density µ ∗ gσn = ν ∗ gσn
are equal and when σn → 0, they converge weakly, respectively, to µ and ν. The uniqueness
of the weak limit of this sequence of probabilities shows that µ = ν.

2007-2008 Probability 1 - Annie Millet


80 5 Convergence Theorems

Theorem 5.19 Let (Xn ) be a sequence of random vectors with values in Rd and X a random
vector with values in Rd . Then the sequence (Xn ) converges in distribution to X if and only
if the sequence of characteristic functions of Xn converges to the characteristic function of
X, that is to say if :
E(eiht,Xn i ) → E(eiht,Xi ) for all t ∈ Rd .

Proof. For all t ∈ Rd , the function x → eiht,xi is bounded, continuous, and the convergence
in distribution of (Xn ) to X thus implies the simple convergence of the sequence (ΦXn ) to
ΦX . One may show besides that the sequence (ΦXn ) converges to ΦX uniformly on every
compact set.
Conversely the equation (5.6) shows that for all n and all σ > 0,
 d Z
1 σ 2 ktk2
PXn ∗ gσ (x) = ΦXn (t)e−iht,xi− 2 dt.
2π Rd

We recall that for every random vector Y , |ΦY | ≤ 1. Since for all σ > 0, x ∈ Rd the sequence
σ 2 ktk2 σ 2 ktk2
of functions t → ΦXn (t)e−iht,xi− 2 converges to t → ΦX (t)e−iht,xi− 2 and is dominated
2 2
− σ ktk
by the function t → e 2∈ L1 (λd ), the dominated convergence theorem shows that
PXn ∗ gσ (x) converges to PX ∗ gσ (x) for all x.
We may rewrite this convergence in the following form : Let V be the vector space of
continuous functions that tend to 0 at infinity, and generated by

E = {y → gσ (x − y) : x ∈ Rd , σ > 0}.

Then for every function φ ∈ V, E[φ(Xn )] → E[φ(X)]. The Stone-Weierstrass Theorem


shows that the vector space V is dense in the space of continuous functions which tend to 0
at infinity for the norm k · k∞ of uniform convergence. We deduce that for every continuous
function f with compact support (and which thus tends to 0 at infinity) and all ε > 0, there
exists a function h ∈ V such that kf − hk∞ < ε. Then
Z Z

|E[f (Xn )] − E[f (X)]| ≤ (f − h)dPXn + (f − h)dPX + |E[h(Xn )] − E[h(X)]|

≤ 2ε + |E[h(Xn )] − E[h(X)]| ≤ 3ε

if n is large enough. Proposition 5.9 then allows us to conclude that the sequence (Xn )
converges in distribution to X. 2

The following theorem is a refinement of one of the implications of the preceding theorem,
since it does not require that the limit of the characteristic functions of Xn be a characteristic
function. It is accepted.

Theorem 5.20 (Lévy’s theorem) Let (Xn ) be a sequence of real random variables whose
characteristic functions (ΦXn ) converges simply to a function Φ, which is continuous at 0.
Then Φ is the characteristic function of a real random variable X and the sequence (X n )
converges to X in distribution.

Probability 1 - Annie Millet 2007-2008


5.4 Central Limit Theorem 81

Example 5.21 Let (Xn , n ≥ 1) be a sequence of independent real random variables such
that P (Xn = 2−n ) = P (Xn = −2−n ) = 21 for every integer n ≥ 1. The sequence Sn =
P n
k=1 Xk converges in distribution. Indeed, for all t ∈ R the characteristic function of S n is

n
Y −n −n n
Y
eit2 + e−it2
ΦSn (t) = = cos(t2−k ).
2
k=1 k=1

The trigonometric formula sin(2a) = 2 sin(a) cos(a) shows that for all n ≥ 1,

2n sin(t2−n )ΦSn (t) = sin(t).

When n → ∞, 2n sin(t2−n ) converges to t. Then for all t 6= 0, the sequence ΦSn (t) converges
to sin(t)
t
, whereas ΦSn (0) = 1 converges to 1. The limit function Φ defined by Φ(t) = sin(t)
t
if
t 6= 0 and Φ(0) = 1 is continuous at 0 and the Lévy theorem shows that it is the characteristic
function of a random variable X such that (Sn ) converges in distribution to X.
We may then identify the distribution of X as being uniform on the interval ] − 1, 1[.
Indeed, if f (x) = 21 1]−1,+1[ (x), for t 6= 0,
Z 1
1 1  it  sin(t)
fˆ(t) = eitx dx = e − e−it = ,
2 −1 2it t

while fˆ(0) = 1. We deduce that Φ(t) = fˆ(t), that is to say that X follows a uniform
distribution on ] − 1, +1[.

5.4 Central Limit Theorem

This theorem shows the central role that Gaussian random variables and Gaussian ran-
dom vectors play in probability. Indeed, properly renormalized, a sum of square integrable
independent identically distributed real random variables converges in distribution to a
« universal » Gaussian distribution N (0, 1). Moreover, this theorem indicates the speed of
convergence of √1n of the average X̄n to E(X) in the strong law of large numbers.
We will show this theorem in the classic setting of a square integrable i.i.d. sequence,
but it has a great number of extensions.
Conventions of notation In this section, all of the random variables must be defined on
the same probability space (Ω, F , P ).
For every sequence of random vectors (Xn , n ≥ 1), we write
n
X Sn
Sn = Xk , X̄n = .
k=1
n

Theorem 5.22 (Central Limit Theorem)


Let (Xn , n ≥ 1) be a sequence of square integrable, independent random vectors of the
same distribution with values in Rd , with expectation vector E(X) and of covariance matrix
Γ. Then the sequence

n[X̄n − E(X)] , n ≥ 1) converges in distribution to a Gaussian vector N (0, Γ). (5.8)

2007-2008 Probability 1 - Annie Millet


82 5 Convergence Theorems

Proof. We immediately return to the case where E(X) = 0. Indeed, the sequence Yn =
Xn −E(X) is also independent, square integrable, of covariance matrix Γ and Ȳ = X̄−E(X) ;
clearly, E(Yn ) = E(Ȳ ) = 0.
We thus suppose that E(X) = 0. We write Φ for the characteristic function of X1 ;
the theorem 4.9 shows  that for a vector t ∈ Rd whose norm ktk is small enough, Φ(t) =
exp − 2 t̃Γt + o(ktk ) . For all t ∈ R , we deduce that for n large enough,
1 2 d

  n " !#
t n gt t t 2
Φ√nX̄n (t) = Φ √ = exp − √ Γ √ + n o
√n .
n 2 n n

Therefore, the sequence of characteristic functions of nX̄n converges simply to the cha-
racteristic function of a Gaussian vector N (0, Γ) from the equation (4.8). The theorem 5.19
allows us to conclude. 2

For real random variables, we may renormalized by the square root of the variance of
the distribution and obtain a limit distribution N (0, 1) which does not depend on the initial
common distribution.

Corollary 5.23 Let (Xn , n ≥ 1) be a sequence of square integrable, independent, identically


distributed, non constant real random variables. Then the sequence of random variables
Sn − E(Sn ) √ X̄n − E(X1 )
p = n p converges in distribution to a random variable N (0, 1).
V ar(Sn ) V ar(X1 )
Rt x2
If we write F (t) = √1 e− 2 dx for the distribution function of a random variable
2π −∞
N (0, 1),  
Sn − nE(X1 )
P √ ≤t → F (t) , ∀t ∈ R.
nV ar(X1 )
Proof. The definition of convergence in distribution shows that if the sequence of random
variables (Xn ) converges in distribution to X, for every constant a ∈ R the sequence (aXn )
converges to aX. Thus, Theorem 5.22 applied with d = 1 (for a sequence (Xn ) of non
constant random variables) and the preceding remark applied to a = √ 1 shows that
V ar(X1 )
√  
the sequence √ n X̄n − E(X1 ) converges in distribution to a Gaussian N (0, 1). An
V ar(X1 )
easy calculation shows that this sequence may be written by « normalizing » the sequence
Sn −E(Sn )
Sn , that is to say in the form √ and Theorem 5.16 concludes the proof thanks to the
V ar(Sn )
continuity of F . 2

Example 5.24 1) Let (Xn ) be an i.i.d. sequence of Bernoulli random variables with para-
meter p ∈]0, 1[. Then the sequence √Sn −np converges in distribution to a Gaussian random
np(1−p)
variable N (0, 1). Since for all n ≥ 1 the random variable Sn follows a binomial distribution
B(n, p), this immediately yields the Gaussian approximation of a binomial distribution for
n large enough.
2) We assume that a computer rounds all the numbers to 10−9 up, that is to say that it
keeps 9 digits after the decimal point. It computes the sum of 106 elementary
 1operations for

−9 1 −9
which every rounding error follows a uniform distribution on the interval − 2 10 , 2 10 .

Probability 1 - Annie Millet 2007-2008


5.4 Central Limit Theorem 83

The rounding errors are independent and the error in the final result is the sum of the
errors made in each operation. We want to find the probability that the absolute value of
the final error is less than 12 10−6 . We introduce a sequence (Xn , 1 ≤ n ≤ 10 9
 ) of independent
1 −9 1 −9
random variables with the same uniform distribution on − 2 10 , 2 10 . Then E(X1 ) = 0
P 6
and V ar(X1 ) = 12 1
10−18 . We deduce that if S = 10 k=1 Xk , the distribution of
qS
10−18
is
103 12
close to the Gaussian distribution N (0, 1). If Y follows a Gaussian distribution N (0, 1),
  √ √ !
10−6 2 3|S| 2 3 × 10−6 √ √
P |S| ≤ =P ≤ ∼ P (|Y | ≤ 3) = 2F ( 3) − 1 ∼ 0.91674.
2 10−6 2 × 10−6

The vector central limit theorem implies the following result which is the basis of the χ2
test.
Let (Yn ) be a sequence of independent identically distributed random variables taking
their values in a finite state space I consisting of k elements, written 1, · · · , k. The dis-
tribution of Yi is thus the vector p = (pi , 1 ≤ i ≤ k) ∈ Rk where P (Y1 = i) = pi .
We write (ei , 1 ≤ i ≤ k) for the canonical basis vectors of Rk and for all n, we write
Xn = (Xn1 , · · · , Xnk ) : Ω → Rk for the random variable defined by
Xn (ω) = ei if and only if Yn (ω) = i.
P
The ith component Sni (ω) of the vector Sn (ω) = nj=1 Xj (ω) is then the number of draws
Yj (ω) 1 ≤ j ≤ n which take the value i and the vector X̄n gives the frequency for which we
observe the different values i ∈ I.
The random variables Xn are clearly independent identically distributed. Furthermore,
E(X1 ) = p and for all i, j ∈ {1, · · · , k},
Cov(X1i , X1j ) = pi δi,j − pi pj , where δi,i = 1 and δi,j = 0 if i 6= j.
If we write√Γ for the covariance
 matrix of X1 , the central limit theorem 5.22 shows that the
Sn
sequence n n − p converges in distribution to a Gaussian vector N (0, Γ). So that we
obtain a limit which does not depend on pi , it is necessary to weight the various components
of X differently.
Theorem 5.25 Under these hypotheses and with the preceding notation, the sequence
Xk  2 X k
n Sn (i) Sn (i)2
Tn = − pi = −n + converges in distribution to a χ2k−1 . (5.9)
p
i=1 i
n i=1
npi

Proof. The equality between both expressions of Tn is easily verified. Let f : Rk → R be


the function defined by
Xk
x2i
f (x) = .
i=1
p i
√ Sn 
Then f is continuous and Tn = f n n − p . Moreover, for every bounded continuous
function g : R → R, the function g ◦ f : Rk → R is bounded and continuous, which implies
that when n → +∞
   
√ Sn
E (g ◦ f ) n −p → E[(g ◦ f )(N (0, Γ))].
n

2007-2008 Probability 1 - Annie Millet


84 5 Convergence Theorems

We deduce that the sequence (Tn ) converges in distribution to the random variable T =
f (N (0, Γ)). It then remains to verify that the image by f of a Gaussian vector N (0, Γ) is a
χ2k−1 .
√ √
The vector v1 = ( p1 , · · · , pk ) ∈ Rk is normed and may be completed to an ortho-
normal basis (v1 , · · · , vk ) of Rk . Let A : Rk → Rk be an orthogonal transformation such
that A(v1 ) = e1 . We write N for a Gaussian vector N (0, Γ) and √Np for the vector whose
components are √Npii for 1 ≤ i ≤ k and Z = A √Np . Then Z is a centered Gaussian vector of
covariance matrix
√ √ 
ΓZ = AΓ √N à = A δi,j − pi pj , 1 ≤ i, j ≤ k à = Idk − e1 ee1 .
p

The covariance matrix Z is diagonal and the components of Z are thus independent by Theo-
rem 4.14. For i = 2, · · · , k, we have V ar(Zi ) = 1 whereas V ar(Z1 ) = 0. By construction,
2

T = f (N ) = √Np = kZk2 because from the Theorem 4.15, the orthogonal transformation
P P
A preserves the Euclidean norm. Furthermore, T = ki=1 Zi2 = ki=2 Zi2 is the sum of the
squares of k − 1 independent Gaussian N (0, 1) random variables, which completes the proof.
2

We deduce that :
• if the sequence (Yn ) follows the distribution p = (p1 , · · · , pk ), the sequence (Tn )
converges to a χ2k−1 .
• if to the contrary the distribution of the sequence (Yn ) is p̄ = (p̄1 , · · · , p̄k ) 6= p, there
exists at least one index i = 1, · · · , k such that p̄i 6= pi . Hence, the sequence Snn(i) → p̄i a.s.
 2
according to the strong law of large numbers. In the last case Tn ≥ pni Snn(i) − pi converges
almost surely to +∞.
This gives the rejection region for the χ2 test (for the adequacy of the distribution p for
the Yn ) : {Tn ≥ a} where the value of a is given by the level of the test and the table of the
distribution function of a χ2k−1 random variable.

Acknowledgments : I wish to thank Wayne Tarrant for his help with the translation of
this manuscript from French to English.

Bibliography
(1) Billingsley, P., Probability and Measure, Wiley series in probability, 1995.
(2) Briane, M. , Pagès, G., Théorie de l’intégration, Vuibert, 1998.
(3) Gradinaru, M., Roynette, B., Lectures and exercises, Probability, M1 University Nancy 1.
(4) Jacod, J., Protter, P., Probability Essentials, Springer, 2004
(5) Neveu, J., Probability Lectures, École Polytechnique.

Probability 1 - Annie Millet 2007-2008

You might also like