Professional Documents
Culture Documents
As we argued in the previous chapter, Pattern Recognition is founded on Probability Theory; here
we review the main results that will be needed in the book. This chapter is not intended as a
replacement for a course in Probability, but it will serve as a reference to the reader. For the sake of
precision, the language of measure theory is used in this chapter, but measure-theoretical concepts
will not be required in the remainder of the book.
A sample space S is the set of all outcomes of an experiment. An event E is a subset E ✓ S. Event
E is said to occur if it contains the outcome of the experiment.
The event E that the first coin lands tails is: E = {(T, H), (T, T )}. ⇧
Example 1.2. If the experiment consists in measuring the lifetime of a lightbulb, then
S = {t 2 IR | t 0} . (1.2)
The event that the lightbulb will fail at or earlier than 2 time units is the real interval E = [0, 2]. ⇧
3
4 CHAPTER 1. REVIEW OF PROBABILITY THEORY
If E ✓ F then the occurrence of E implies the occurrence of F . The union E [ F occurs i↵ (if and
only if) E, F , or both E and F occur. On the other hand, the intersection E \ F occurs i↵ both E
and F occur. If E \ F = ;, then E or F may occur but not both. Finally, the complement event
E c occurs i↵ E does not occur.
A probability space is a triple (S, F, P ) consisting of a sample space S, a -algebra F containing all
the events of interest, and a probability measure P , i.e., a real-valued function defined on each event
E 2 F that satisfies Kolmogorov’s axioms:
A1. 0 P (E) 1 ,
A2. P (S) = 1 ,
P1. P (E c ) = 1 P (E).
P3. P (E [ F ) = P (E) + P (F ) P (E \ F ).
Two important limiting events can be defined for any sequence of events E1 , E2 , . . .
We can see that lim supn!1 En occurs i↵ En occurs for an infinite number of n, that is, En
occurs infinitely often.
We can see that lim inf n!1 En occurs i↵ En occurs for all but a finite number of n, that is,
En eventually occurs for all n.
An important tool for our purposes are the Borel-Cantelli Lemmas, which specify the probabilities
of lim sup and lim inf events.
Proof.
1 [
1
! 1
!
\ [
P Ei =P lim Ei
n!1
n=1 i=n i=n
1
!
[
= lim P Ei
n!1 (1.12)
i=n
1
X
lim P (Ei )
n!1
i=n
= 0.
⇧
The converse to the First Lemma holds if the events are independent.
Therefore,
1
! 1
! 1
!
[ [ \
P ([En i.o.]) = P lim Ei = lim P Ei =1 lim P Eic (1.15)
n!1 n!1 n!1
i=n i=n i=n
where the last equality follows from DeMorgan’s Law. Now, note that, by independence,
1
! 1 1
\ Y Y
P Eic = P (Eic ) = (1 P (Ei )) (1.16)
i=n i=n i=n
1
! 1 1
!
\ Y X
P Eic exp( P (Ei )) = exp P (Ei ) =0 (1.17)
i=n i=1 i=n
P1
since, by assumption, i=n P (Ei ) = 1, for all n. From (1.15) and (1.17) it follows that P ([En i.o.]) =
1, as required. ⇧
1.1. BASIC CONCEPTS 7
Conditional probability is one of the most important concepts in Statistical Signal Processing,
Pattern Recognition, and in Probability Theory in general.
Given that an event F has occurred, for E to occur, E \ F has to occur. In addition, the sample
space gets restricted to those outcomes in F , so a normalization factor P (F ) has to be introduced.
Therefore,
P (E \ F )
P (E | F ) = . (1.18)
P (F )
For simplicity, it is usual to write P (E \ F ) = P (E, F ) to indicate the joint probability of E and F .
From (1.18), one then obtains
P (E, F ) = P (E | F )P (F ) , (1.19)
which is known as the multiplication rule. One can also condition on multiple events:
P (E \ F1 \ F2 \ . . . \ Fn )
P (E | F1 , F2 , . . . , Fn ) = . (1.20)
P (F1 \ F2 \ . . . \ Fn )
This allows one to generalize the multiplication rule thus:
The Law of Total Probability is a consequence of axioms of probability and the multiplication rule:
This property allows one to compute a hard unconditional probability in terms of easier conditional
ones. It can be extended to multiple conditioning events via
n
X n
X
P (E) = P (E, Fi ) = P (E | Fi )P (Fi ) , (1.23)
i=1 i=1
S
for pairwise disjoint Fi such that Fi ◆ E.
Bayes Theorem can be interpreted as a way to (1) “invert” the probability P (F | E) to obtain
the probability P (E | F ); or (2) “update” the “prior” probability P (E) to obtain the “posterior”
probability P (E | F ). The former interpretation is the foundation of estimation and detection
in Statistical Signal Processing, while the latter is the foundation of Bayesian Statistics. Bayes
Theorem plays a fundamental role in Pattern Recognition as well.
8 CHAPTER 1. REVIEW OF PROBABILITY THEORY
Events E and F are independent if the occurrence of one does not carry information as to the
occurrence of the other. That is:
If E and F are independent, so are the pairs (E,F c ), (E c ,F ), and (E c ,F c ). However, E being
independent of F and G does not imply that E is independent of F \ G. Furthermore, three events
E, F , G are independent if P (E, F, G) = P (E)P (F )P (G) and each pair of events is independent.
This can be extended to independence of any number of events, by requiring that the joint probability
factor and that all subsets of events be independent.
Finally, we remark that P (·|F ) is a probability measure, so that it satisfies all theorems mentioned
previously. In particular, it is possible to define the notion of conditional independence of events.
Random variables are the basic units of Pattern Recognition, as discussed in Chapter 1. A random
variable can be thought of roughly as a “random number.” Formally, a random variable X defined
on a probability space (S, F, P ) is a measurable function X : S ! IR with respect to F and the
usual Borel algebra of IR (see Section 1.1.2 for the required definitions). Thus, a random variable
X assigns to each outcome ! 2 S a real number X(!) — see Figure 1.1 for an illustration.
Sample space
1
{X 2 A} = X (A) ✓ S. (1.27)
It can be shown that all probability questions about a random variable X can be phrased in terms
of the probabilities of a simple set of events:
These events can be written more simply as {X x}, for x 2 IR. The cumulative distribution
function (CDF) of a random variable X is the function FX : IR ! [0, 1], which gives the probability
of these events:
FX (x) = P ({X x}), x 2 IR. (1.29)
For simplicity, henceforth we will often remove the braces around statements involving random
variables; e.g., we will write FX (x) = P (X x), for x 2 IR.
Properties of a CDF:
It can be shown that a random variable X is uniquely specified by its CDF FX and, conversely, given
a function FX satisfying items 1-3 above, there is a unique random variable X associated with it.
The notion of a probability density function (PDF) is fundamental in Probability Theory (and
Pattern Recognition). However, it is a secondary notion to that of a CDF. In fact, every random
variable X must have a CDF FX , but not all random variables have a PDF. They do if the CDF
FX is continuous and di↵erentiable everywhere but for a countable number of points. In this case,
X is said to be a continuous random variable, with PDF given by:
dFX (x)
pX (x) = , (1.30)
dx
10 CHAPTER 1. REVIEW OF PROBABILITY THEORY
Figure 1.2: The CDF and PDF of a uniform continuous random variable.
at all points x 2 IR where the derivative is defined. See Figure ?? for an illustration of a uniform
continuous random variable. Note that FX is continuous, and di↵erentiable everywhere except for
points a and b.
In this chapter, for precision, we always use the subscript X to denote quantities associated with
a random variable X, e.g., we write FX (x) and pX (x). Elsewhere in the book, we often omit the
subscript, e.g. we write F (x) and p(x), when there is no possibility of confusion.
Probability statements about X can then be made in terms of integration of pX . For example,
Z x
FX (x) = pX (u)du , x 2 IR ,
1
Z x2
(1.31)
P (x1 X x2 ) = pX (x)dx , x1 , x2 2 IR .
x1
Useful continuous random variables include the already mentioned uniform r.v. over the interval
[a, b], with density
1
pX (x) = , a < x < b, (1.32)
b a
the univariate Gaussian r.v. with parameters µ and > 0, such that
✓ ◆
1 (x µ)2
pX (x) = p exp , x 2 IR , (1.33)
2⇡ 2 2 2
x
pX (x) = e , x 0 (1.34)
e x( x)t 1
pX (x) = , x 0, (1.35)
(t)
R1 u ut 1 du,
where (t) = 0 e and the beta r.v. with parameters a, b > 0, such that:
1
pX (x) = xa 1
(1 x)b 1
, 0 < x < 1, (1.36)
B(a, b)
where B(a, b) = (a + b)/ (a) (b). Among these, the Gaussian is the only one defined over the
entire real line; the exponential and gamma are defined over the nonnegative real numbers, while
the uniform and beta have bounded support. In fact, the uniform r.v. over [0, 1] is just a beta with
a = b = 1, while an exponential r.v. is a gamma with t = 1.
1.2. RANDOM VARIABLES 11
If the random variable X only takes an at most countable number of values, then it is said to be a
discrete random variable. For example, let X be the numerical outcome of the roll of a six-sided.
The CDF FX for this example can be seen in Figure . We can see that FX for a discrete random
variable X is a “staircase” function. In particular, it is not possible to define a PDF in this case.
Figure 1.3: The CDF and PMF of a uniform discrete random variable.
The PMF pX corresponds to the “jumps” in the staircase CDF FX . See Figure 1.3 for the PMF in
the previous die-rolling example.
Useful discrete random variables include the already mentioned uniform r.v. over a finite set of
numbers K with PMF
1
pX (k) = , k2K, (1.38)
|K|
the Bernoulli with parameter 0 < p < 1, with PMF
pX (0) = 1 p,
(1.39)
pX (1) = p ,
the Binomial with parameters n 2 {1, 2, . . .} and 0 < p < 1, such that
✓ ◆
n k
pX (k) = p (1 p)n k , k = 0, 1, . . . , n , (1.40)
k
the Poisson with parameter > 0, such that
k
pX (k) = e , k = 0, 1, . . . (1.41)
k!
and the Geometric with parameter 0 < p < 1 such that
pX (k) = (1 p)k 1
p, k = 1, 2, . . . (1.42)
A binomial r.v. with parameters n and p has the distribution of a a sum of n i.i.d. Bernoulli r.v.s
with parameter p.
12 CHAPTER 1. REVIEW OF PROBABILITY THEORY
Random variables that are neither continuous nor discrete are called mixed random variables. They
are often, but not necessarily, mixtures of continuous and discrete random variables, such as linear
combinations of these (hence, the name “mixed”). The following table summarizes the classification
of random variables.
These are crucial elements in PR. As in the case of the usual CDF, PDF and PMF, these concepts
involve only the probabilities of certain special events. We review below only the case of two random
variables; the extension to finite sets of jointly distributed random variables (i.e., random vectors)
is fairly straightforward.
Two random variables X and Y are said to be jointly distributed if they are defined on the same
probability space (S, F, P ) (it can be shown that this is sufficient for the mapping (X, Y ) : S ! IR2
to be measurable with respect to F and the Borel algebra of IR2 ). In this case, the joint CDF of
X and Y is the joint probability of the events {X x} and {Y y}, for x, y 2 IR. Formally, we
define a function FXY : IR ⇥ IR ! [0, 1] given by
This is the probability of the “lower-left quadrant” with corner at (x, y). Note that FXY (x, 1) =
FX (x) and FXY (1, y) = FY (y). These are called the marginal CDFs.
If X and Y are jointly distributed and FXY is continuous and has continuous derivatives up to
second order, then X and Y are jointly continuous random variables, with joint density
@ 2 FXY (x, y)
pXY (x, y) = x, y 2 IR , (1.44)
@x@y
where the order of di↵erentiation does not matter. The joint density function pXY (x, y) integrates
to 1 over IR2 . The marginal densities are given by
Z 1
pX (x) = pXY (x, y) dy , x 2 IR ,
1
Z 1 (1.45)
pY (y) = pXY (x, y) dx , y 2 IR ,
1
1.3. EXPECTATION AND VARIANCE 13
The random variables X and Y are independent if pXY (x, y) = pX (x)pY (y), for all x, y 2 IR. It can
be shown that if X and Y are independent and Z = X + Y then
Z 1
pZ (z) = pX (x)pY (z x) dx , z 2 IR , (1.46)
1
with a similar expression in the discrete case for the corresponding PMFs. The above integral is
known as the convolution integral.
pXY (x, y)
pX|Y (x | y) = , x 2 IR . (1.47)
pY (y)
The concepts of joint PMF, marginal PMFs, and conditional PMF can defined in a similar way. For
conciseness, this is omitted in this brief review.
Expectation is a fundamental concept in Probability Theory and Pattern Recognition. It has several
important interpretations regarding a random variables: 1) its “mean” value; 2) a summary of its
distribution (sometimes referred to as a “location parameter”); 3) a prediction of its future value.
The latter meaning is the most important one for Pattern Recognition. The variance of a random
variable, on the other hand, gives 1) its “spread” around the mean; 2) a second summary of its
distribution (the “scale parameter”); 3) the uncertainty in the prediction of its future value by the
expectation.
1.3.1 Expectation
The expectation E[X] of a random variable X can be seen as an average of its values weighted by
their probabilities: Z 1
E[X] = xpX (x) du . (1.49)
1
14 CHAPTER 1. REVIEW OF PROBABILITY THEORY
If f : R ! R is Borel-measurable and concave (i.e., f lies at or above a line joining any of its points)
then Jensen’s Inequality asserts that
This can be extended directly to any finite number of jointly distributed random variables.
Analogous formulas concerning expectation for discrete random variables can be obtained by replac-
ing integration with summation and PDFs by PMFs.
From (1.52) and the linearity property of integration, one obtains the well-known linearity property
of expectation,
E[aX + bY ] = aE[X] + bE[Y ] . (1.53)
where no conditions on X and Y apart from the existence of the expectations are assumed. Once
again, this property can be easily extended to any finite number of jointly distributed random
variables.
It can be shown that E[f (X)g(Y )] = E[f (X)]E[g(Y )] for any Borel-measurable functions f, g : R !
R if, and only if, X and Y are independent. If this condition is satisfied for at least f (X) = X and
g(Y ) = Y , that is, if E[XY ] = E[X]E[Y ], then X and Y are said to be uncorrelated. Of course,
independence implies uncorrelatedness. The converse is only true in special cases; e.g. jointly
Gaussian random variables.
Expectation preserves order, in the sense that if P (X > Y ) = 1, then E[X] > E[Y ].
Holder’s Inequality states that, for 1 < r < 1 and 1/r + 1/s = 1,
The expectation of a random variable X is a↵ected by its probability tails, given by FX (a) = P (X
a) and 1 FX (a) = P (X a). If the probability tails fail to vanish sufficiently fast (X has “fat
tails”), then E[X] will not be finite, and the expectation is undefined. For a nonnegative random
variable X (i.e., one for which P (X 0) = 1), there is only one probability tail, the upper tail
P (X > a), and there is a simple formula relating E[X] to it:
Z 1
E[X] = P (X > x) dx . (1.56)
0
A small E[X] constrain the upper tail to be thin. This is guaranteed by Markov’s inequality: if X
is a nonnegative random variable,
E[X]
P (X a) , for all a > 0 . (1.57)
a
Finally, a particular result that if of interest to our purposes relates an exponentially-vanishing
upper tail of a nonnegative random variable to a bound on its expectation.
Lemma 1.3. If X is a non-negative random variable such that P (X > t) ce at2 , for all t > 0
and given a, c > 0, we have r
1 + log c
E[X] . (1.58)
a
p
Proof. Note that P (X 2 > t) = P (X > t) ce at .
From (1.56) we get:
Z 1 Z u Z 1
2 2 2
E[X ] = P (X > t) dt = P (X > t) dt + P (X 2 > t) dt
0 0 u
Z 1 (1.59)
c
u+ ce at dt = u + e au .
u a
By direct di↵erentiation, it is easy to verify that the upper bound on the right hand side is minimized
at u = (log c)/a. Substituting this value back into the bound leads to E[X 2 ] (1 + log c)/a. The
p
result then follows from the fact that E[X] E[X 2 ]. ⇧
1.3.2 Variance
The variance Var(X) of a random variable X is a nonnegative quantity related to the spread of the
values of X around the mean E[X]:
A small variance constrains the random variable to be close to its mean with high probability. This
follows from Chebyshev’s Inequality:
Var(X)
P (|X E[X]| ⌧) , for all ⌧ > 0 . (1.62)
⌧2
Chebyshev’s inequality follows directly from the application of Markov’s Inequality (1.57) to the
random variable |X E[X]|2 with a = ⌧ 2 .
Expectation has the linearity property, so that, given any pair of jointly distributed random variables
X and Y , it is always true that E[X + Y ] = E[X] + E[Y ] (provided that all expectations exist).
However, it is not always true that Var(X + Y ) = Var(X) + Var(Y ). In order to investigate this
issue, it is necessary to introduce the covariance between X and Y :
If Cov(X, Y ) > 0 then X and Y are positively correlated; otherwise, they are negatively correlated.
Clearly, X and Y are uncorrelated if and only if Cov(X, Y ) = 0. Clearly, Cov(X, X) = Var(X). In
P P Pn Pm
addition, Cov( ni=1 Xi , mj=1 Yj ) = i=1 j=1 Cov(Xi , Yj ).
n
! n
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj ) . (1.65)
i=1 i=1 i<j
Hence, the variance is distributive over sums if all variables are pair-wise uncorrelated.
p
It follows directly from the Cauchy-Schwarz Inequality (1.55) that |Cov(X, Y )| Var(X)Var(Y ).
Therefore, the covariance can be normalized to be in the interval [ 1, 1] thus:
Cov(X, Y )
⇢(X, Y ) = p , (1.66)
Var(X)Var(Y )
with 1 ⇢(X, Y ) 1. This is called the correlation coefficient between X and Y . The closer |⇢|
is to 1, the tighter is the relationship between X and Y . The limiting case where ⇢(X, Y ) = ±1
occurs if and only if Y = a ± bX, i.e., X and Y are perfectly related to each other through a linear
(affine) relationship. For this reason, ⇢(X, Y ) is sometimes called the linear correlation coefficient
between X and Y ; it does not respond to nonlinear relationships.
1.3. EXPECTATION AND VARIANCE 17
Conditional expectation allows the prediction of the value of a random variable given the observed
value of the other, i.e., prediction given data, while conditional variance yields the uncertainty of that
prediction. Conditional expectation and conditional variance are thus key probabilistic concepts in
Pattern Recognition.
If X and Y are jointly continuous random variables and the conditional density pX|Y (x | y) is well
defined for Y = y, then the conditional expectation of X given Y = y is:
Z 1
E[X | Y = y] = x pX|Y (x | y) dx (1.67)
1
with a similar definition for discrete random variables using conditional PMFs.
Most of the properties of expectation and variance apply without modification to conditional ex-
P P
pectations and conditional variances, respectively. For example, E[ ni=1 Xi | Y = y] = ni=1 E[Xi |
Y = y] and Var(aX + c | Y = y) = a2 Var(X | Y = y).
Now, both E[X | Y = y] and Var(X | Y = y) are deterministic quantities for each value of Y = y
(just as the ordinary expectation and variance are). But if the specific value Y = y is not specified
and allowed to vary, then we can look at E[X | Y ] and Var(X | Y ) as functions of the random
variable Y , and therefore, random variables themselves. The reasons why these are valid random
variables are nontrivial and beyond the scope of this review (see [Billingsley, 1995]).
One can show that the expectation of the random variable E[X | Y ] is precisely E[X]:
On the other hand, it is not the case that Var(X) = E[Var(X | Y )]. The answer is slightly more
complicated:
Var(X) = E[Var(X | Y )] + Var(E[X | Y ]) . (1.71)
18 CHAPTER 1. REVIEW OF PROBABILITY THEORY
Now, suppose one is interested in predicting the value of a random variable Y using a predictor Ŷ .
One would like Ŷ to be optimal according to some criterion. The criterion most widely used is the
Mean Square Error:
It can be shown easily that the minimum MSE (MMSE) estimator is simply the mean: Ŷ ⇤ = E[Y ].
This is a constant estimator, since no data are available. Clearly, the MSE of Ŷ ⇤ is simply the
variance of Y . Therefore, the best one can do in the absence of any extra information is to predict
the mean E[Y ], with an uncertainty equal to the variance Var(Y ).
If Var(Y ) is very small, i.e., if there were very small uncertainty in Y to begin with, then E[Y ] could
actually be an acceptable estimator. In practice, this is rarely the case. Therefore, observations
on an auxiliary random variable X (i.e., data) are sought to improve prediction. Naturally, it is
known (or hoped) that X and Y are not independent, otherwise no improvement over the constant
estimator is possible. One defines the conditional MSE of a data-dependent estimator Ŷ = h(X) to
be
By taking expectation over X, one obtains the unconditional MSE: E[(Y h(X))2 ]. The conditional
MSE is often the most important one in practice, since it concerns the particular data at hand,
while the unconditional MSE is data-independent and used to compare the performance of di↵erent
predictors. Regardless, it can be shown that the MMSE estimator in both cases is the conditional
mean h⇤ (X) = E[Y | X]. This is one of the most important results in Signal Processing and Pattern
Recognition. The function ⌘(x) = E[Y | X = x] is the optimal regression of Y on X. This is not
in general the optimal estimator if Y is discrete; e.g., in the case of classification. This is because
⌘(X) may not be in the range of values taken by Y , so it does not define a valid estimator. We will
see in Chapter 3 how to modify ⌘(·) to obtain the optimal estimator (optimal classifier) in this case.
1.4. VECTOR RANDOM VARIABLES 19
Vector random variables, or random vectors are defined analogously to ordinary random variables
(see Section 1.2). A random vector X defined on a probability space (S, F, P ) is a measurable
function X : S ! IRd with respect to F and the Borel algebra of IRd (the required definitions
are given in Section 1.1.2). Alternatively, if X1 , . . . , Xd are jointly distributed random variables on
(S, F, P ), then it can be shown that X = (X1 , . . . , Xd ) is a proper random vector defined on the
same probability space.
The distribution of a random vector X is the joint distribution of the component random variables.
The expected value of X is the vector of expected values:
2 3
E[X1 ]
6 7
E[X] = 4 · · · 5 . (1.74)
E[Xd ]
The second moments of a random vector are contained in the d ⇥ d covariance matrix:
where ⌃ii = Var(Xi ) and ⌃ij = Cov(Xi , Xj ), for i, j = 1, . . . , d. The covariance matrix is real
symmetric and thus diagonalizable:
⌃ = U DU T , (1.76)
where U is the orthogonal matrix of eigenvectors and D is the diagonal matrix of eigenvalues. All
eigenvalues are nonnegative (⌃ is positive semi-definite). In fact, except for “degenerate” cases, all
eigenvalues are positive, and so ⌃ is invertible (⌃ is said to be positive definite in this case).
has zero mean and covariance matrix Id (so that all components of Y are zero-mean, unit-variance,
and uncorrelated). This is called the Whitening or Mahalanobis transformation.
It can be shown [Casella and Berger, 2002] that this estimator is unbiased (that is, E[µ̂] = µ) and
consistent (that is, µ̂ converges in probability to µ as n ! 1; see Section 1.5 and Theorem ). On
the other hand, the sample covariance estimator is given by:
n
X
ˆ = 1
⌃ (Xi µ̂)(Xi µ̂)T . (1.79)
n 1
i=1
The multivariate Gaussian is the most important probability distribution in Engineering and Science.
The random vector X has a multivariate Gaussian distribution with mean µ and covariance matrix
⌃ (assuming ⌃ invertible, so that also det(⌃) > 0) if its density is given by
✓ ◆
1 1 T 1
p(x) = p exp (x µ) ⌃ (x µ) . (1.80)
(2⇡)d det(⌃) 2
The multivariate Gaussian has ellipsoidal contours of constant density of the form
(x µ)T ⌃ 1
(x µ) = c2 , c > 0. (1.81)
The axes of the ellipsoids are given by the eigenvectors of ⌃ and the length of the axes are pro-
portional to its eigenvalues. In the case ⌃ = 2 Id , where Id denotes the d ⇥ d identity matrix, the
contours are spherical with center at µ. This can be seen by substituting ⌃ = 2 Id in (1.81), which
leads to the following equation for the countours:
If d = 1, one gets the univariate Gaussian distribution X ⇠ N (µ, 2 ). With µ = 0 and = 1, the
PDF of X is given by Z x
1 u2
P (X x) = (x) = e 2 du . (1.83)
1 2⇡
It is clear that the function (·) satisfies the property ( x) = 1 (x).
The following are useful properties of a multivariate Gaussian random vector X ⇠ N (µ, ⌃):
• The components of X are independent if and only if they are uncorrelated, i.e., ⌃ is a diagonal
matrix.
1.5. CONVERGENCE OF RANDOM SEQUENCES 21
1
• The whitening transformation Y = ⌃ 2 (X µ) produces a multivariate gaussian Y ⇠ N (0, Ip )
(so that all components of Y are zero-mean, unit-variance, and uncorrelated Gaussian random
variables).
• If Y and X are jointly multivariate Gaussian, then the distribution of Y given X is again
multivariate Gaussian.
It is often necessary in Pattern recognition to investigate the long-term behavior of random se-
quences, such as the sequence of true or estimated classification error rates indexed by sample size.
In this section and the next, we review basic results about convergence of random sequences. We
consider only the case of real-valued random variables, but nearly all the definitions and results can
be directly extended to random vectors, with the appropriate modifications.
1. “Sure” convergence: Xn ! X surely if for all outcomes ! 2 S in the sample space one has
limn!1 Xn (!) = X(!).
Lp
3. Lp -convergence: Xn ! X in Lp , for p > 0, also denoted by Xn ! X, if E[|Xn |p ] < 1 for
n = 1, 2, . . ., E[|X|p ] < 1, and the p-norm of the di↵erence between Xn and X converges to
zero:
lim E[|Xn X|p ] = 0 . (1.85)
n!1
P
4. Convergence in Probability: Xn ! X in probability, also denoted by Xn ! X, if the “proba-
bility of error” converges to zero:
D
5. Convergence in Distribution : Xn ! X in distribution, also denoted by Xn ! X, if the
corresponding PDFs converge:
lim FXn (a) = FX (a) , (1.87)
n!1
at all points a 2 IR where FX is continuous.
Sure and almost-sure convergence has to do with convergence of the sequence realizations {Xn (!)}
to the corresponding limit X(!), so many properties of ordinary convergence apply. For example,
if f : R ! R is a continuous function, then Xn ! X a.s. implies that f (Xn ) ! f (X) a.s. as well (it
is possible to show that this is also true for convergence in probability).
Stronger relations between the modes of convergence can be proved for special cases. In particular,
mean-square convergence and convergence in probability can be shown to be equivalent for uniformly
bounded sequences. Classifier error rate sequences are uniformly bounded, so therefore this is an
important topic in Pattern Recognition.
A random sequence {Xn ; n = 1, 2, . . .} is uniformly bounded if there exists a finite K > 0, which
does not depend on n, such that
meaning that P (|Xn | < K) = 1, for all n = 1, 2, . . . The classification error rate sequence {"n ; n =
1, 2, . . .} is an example of uniformly bounded random sequence, with K = 1. We have the following
theorem.
1.5. CONVERGENCE OF RANDOM SEQUENCES 23
Theorem 1.1. Let {Xn ; n = 1, 2, . . .} be a uniformly bounded random sequence. The following
statements are equivalent.
(3) Xn ! X in probability.
Proof. First note that we can assume without loss of generality that X = 0, since Xn ! X if and
only if Xn X ! 0, and Xn X is also uniformly bounded, with E[|Xn X|p ] < 1. Showing
that (1) , (2) requires showing that Xn ! 0 in Lp , for some p > 0 implies that Xn ! 0 in Lq , for
all q > 0. First observe that E[|Xn |q ] E[K q ] = K q < 1, for all q > 0. If q > p, the result is
immediate. Let 0 < q < p. With X = Xnq , Y = 1 and r = p/q in Holder’s Inequality (1.54), we can
write
E[|Xn |q ] E[|Xn |p ]q/p . (1.90)
Hence, if E[|Xn |p ] ! 0, then E[|Xn |q ] ! 0, proving the assertion. To show that (2) , (3), first we
show the direct implication by writing Markov’s Inequality (1.57) with X = |Xn |p and a = ⌧ p :
E[|Xn |p ]
P (|Xn | ⌧) , for all ⌧ > 0 . (1.91)
⌧p
The right-hand side goes to 0 by hypothesis, and thus so does the left-hand side, which is equivalent
to (1.86) with X = 0. To show the reverse implication, write
By assumption, P (|Xn | ⌧ ) ! 0, for all ⌧ > 0, so that lim E[|Xn |p ] ⌧ p . Letting ⌧ ! 0 then
yields the desired result. ⇧
The previous Theorem states that convergence in m.s. and in probability are the same for uniformly
bounded sequences. The relationship between modes of convergence becomes:
( )
Mean-square
Sure ) Almost-sure ) ) Distribution (1.93)
Probability
Proof. From the previous theorem, Xn ! X in L1 , i.e., E[|Xn X|] ! 0. But |E[Xn X]|
E[|Xn X|], so |E[Xn X]| ! 0 ) E[Xn X] ! 0. ⇧
24 CHAPTER 1. REVIEW OF PROBABILITY THEORY
Example 1.3. Consider a sequence of independent binary random variables X1 , X2 , . . . that take
on values in {0, 1}, such that
1
P ({Xn = 1}) = , n = 1, 2, . . . (1.94)
n
Then Xn ! 0 in probability, since P (Xn > ⌧ ) ! 0, for every ⌧ > 0. By Theorem 1.1, Xn ! 0 in
Lp as well. However, Xn does not converge to 0 with probability one. Indeed,
1
X 1
X
P ({Xn = 1}) = P ({Xn = 0}) = 1 , (1.95)
n=1 n=1
so that Xn does not converge with probability one. However, if convergence of the probabilities to
zero is faster, e.g.
1
P ({Xn = 1}) = 2 , n = 1, 2, . . . (1.97)
n
P1
then n=1 P ({Xn = 1}) < 1 and the 1st Borel-Cantelli Lemma ensures that Xn ! 0 with
probability one.
In the previous example, note that, with P (Xn = 1) = 1/n, the probability of observing a 1 becomes
infinitesimally small as n ! 1, so the sequence consists, for all practice purposes, of all zeros for
large enough n. Convergence in probability and in Lp of Xn to 0 agrees with this fact, but the lack
of convergence with probability 1 does not. This shows that convergence with probability 1 may be
too stringent a criterion to be useful in practice, and convergence in probability and in Lp (assuming
boundedness) may be enough. For example, this is the case in most Signal Processing applications,
where L2 is the criterion of choice.2
The following two theorems are the classical limiting theorems for random sequences, the proofs of
which can be found in any advanced text in Probability Theory, e.g. [Chung, 1974].
Theorem 1.2. (Law of Large Numbers.) Given an i.i.d. random sequence {Xn ; n = 1, 2, . . .} with
common finite mean µ,
n
1X
Xi ! µ, with probability 1. (1.98)
n
i=1
2
More generally, Engineering applications are concerned with average performance and rates of failure.
1.6. ADDITIONAL TOPICS 25
Mainly for historical reasons, the previous theorem is sometimes called the Strong Law of Large
Numbers, with the weaker result involving only convergence in probability being called the Weak
Law of Large Numbers.
Theorem 1.3. (Central Limit Theorem.) Given an i.i.d. random sequence {Xn ; n = 1, 2, . . .} with
common finite mean µ and common finite variance 2 ,
n
!
1 X D
p Xi nµ ! N (0, 1) . (1.99)
n
i=1
The previous limiting theorems concern behavior of a sum of n random variables as n approach
infinity. It is also useful to have an idea of how partial sums di↵er from expected values for finite
n. This problem is addressed by the so-called concentration inequalities, the most famous of which
is Hoe↵ding’s inequality.
Theorem 1.4. (Hoe↵ding’s Inequality, 1963.) Given independent (not necessarily identically-
distributed) random variables X1 , . . . , Xn such that P (a Xi b) = 1, for i = 1, . . . , n, the
P
sum Sn = ni=1 Xi satisfies
2⌧ 2
P (|Sn E[Sn ]| ⌧ ) 2e n(a b)2 , for all ⌧ > 0 . (1.100)
Hoe↵ding’s Inequality is a special case of a more general concentration inequality due to McDiarmid.
Theorem 1.5. (McDiarmid’s Inequality, 1989.) Given independent (not necessarily identically-
distributed) random variables X1 , . . . , Xn defined on a set A and a function f : An ! IR such
that
|f (x1 , . . . , xi 1 , xi , xi+1 , . . . , xn ) f (x1 , . . . , xi 1 , x0i , xi+1 , . . . , xn )| ci , (1.101)
An interesting application of the Second Borel-Cantelli Lemma is the though experiment known as
the “infinite typist monkey.” Imagine a monkey that sits at a typewriter banging away randomly
for an infinite amount of time. It will produce Shakespeare’s complete works, and in fact, the entire
Library of Congress, not just once, but an infinite number of times.
26 CHAPTER 1. REVIEW OF PROBABILITY THEORY
Proof. Let L be the length in characters of the desired work. Let En be the event that the n-th
sequence of characters produced by the monkey matches, character by character, the desired work
(we are making it even harder for the monkey, as we are ruling out overlapping frames). Clearly
P (En ) = 27 L > 0. It is a very small number, but still positive. Now, since our monkey never gets
disappointed nor tired, the events En are independent. It follows by the 2nd Borel-Cantelli lemma
that En will occur, and infinitely often. Q.E.D.
The typist monkey would produce a library containing any possible works of literature, in any
language (based on the latin alphabet). This is what Argentine writer Jorge L. Borges had to say
about such a library (in a 1939 essay called “The Total Library:”)
Everything would be in its blind volumes. Everything: the detailed history of the future,
Aeschylus’ The Egyptians, the exact number of times that the waters of the Ganges have
reflected the flight of a falcon, the secret and true nature of Rome, the encyclopedia
Novalis would have constructed, my dreams and half-dreams at dawn on August 14, 1934,
the proof of Pierre Fermat’s theorem, the unwritten chapters of Edwin Drood, those same
chapters translated into the language spoken by the Garamantes, the paradoxes Berkeley
invented concerning Time but didn’t publish, Urizen’s books of iron, the premature
epiphanes of Stephen Dedalus, which would be meaningless before a cycle of a thousand
years, the Gnostic Gospel of Basilides, the song the sirens sang, the complete catalog
of the Library, the proof of the inaccuracy of that catalog. Everything: but for every
sensible line or accurate fact there would be millions of meaningless cacophonies, verbal
farragoes, and babblings. Everything: but all the generations of mankind could pass
before the dizzying shelves — shelves that obliterate the day and on which chaos lies —
ever reward them with a tolerable page.’
In practice, even if all the atoms in the universe were typist monkeys banging away billions of
characters a second since the big-bang, the probability of getting Shakespeare’s Hamlet, let alone
Borges’ total library, within the age of the universe would still be vanishingly small. This shows
that one must be careful with arguments involving infinity.
Given a sequence of events E1 , E2 , . . ., a tail event is an event whose occurrence depends on the
whole sequence, but is probabilistically independent of any finite subsequence. Some examples of
tail events are limn!1 En (if {En } is monotone), lim supn!1 En , and lim inf n!1 En .
One of the most startling results published in Kolmogorov’s 1933 monograph was the so-called
Zero-One Law. It states that, given a sequence of independent events E1 , E2 , . . ., all its tail events
1.7. BIBLIOGRAPHICAL REMARKS 27
have either probability 0 or probability 1. That is, tail events are either almost-surely impossible
or occur almost surely. In practice, it may be extremely difficult to conclude one way or the other.
The Borel-Cantelli lemmas together give a sufficient condition to decide on the 0-1 probability of
the tail event lim supn!1 En , with {En } an independent sequence.
This paradox illustrates the issues with the frequentist approach to probability. Imagine a game
where a fair coin is tossed repeatedly and independently, until the first tail appears. The player is
then rewarded 2N dollars. According to the standard frequentist interpretation, the fair price of a
game is its expected winnings. The question is 1) what the expected winnings of the coin-flipping
game are and 2) how much would most people be willing to play the game once.
Notice that the number of tosses N is therefore a Geometric random variable. The expectation
winnings are therefore
X1 X1
N 2n
E[W ] = E[2 ] = = 1 = 1. (1.103)
2n
n=1 n=1
However, this expected result is very far from being the most likely result in a single game, as
P (W = 1) = P (N = 1) = 0, with the most likely outcome, i.e, the mode of W , being equal to 2,
with P (W = 2) = P (N = 1) = 12 . What most people would be willing to pay to play this game once
would be a small multiple of that. It is only in the long run (i.e., by playing the game repeatedly
many times) that the average winnings of the game are huge. In this case, however, it is a very
long run, and any player, regardless of how rich they are, would be broke long before attaining the
promised unbounded winnings.
There are many excellent books on the theory of probability. We mention but a few below. At
the advanced undergraduate level, the books by S. Ross [Ross, 1994, Ross, 1995] o↵er a through
treatment of non-measure theoretical probability. At the graduate level, the books by P. Billings-
ley [Billingsley, 1995] and K. Chung [Chung, 1974] provide mathematically rigorous expositions of
measure-theoretical probability theory. The book by J. Rosenthal [Rosenthal, 2006] is a surpris-
28 CHAPTER 1. REVIEW OF PROBABILITY THEORY
Exercises
1. The Monty-Hall Problem. This problem demonstrates nicely subtle issues regarding partial
information and prediction. A certain show host has placed a case with US$1,000,000 behind
one of three identical doors. Behind each of the other two doors he placed a donkey. The host
asks the contestant to pick one door but not to open it. The host then opens one of the other
two doors to reveal a donkey. He then asks the contestant if he wants to stay with his door or
switch to the other unopened door. Assume that the host is honest and that if the contestant
initially picked the correct door, the host randomly picks one the two donkey doors to open.
Which of the following strategies is rationally justifiable:
2. The random experiment consists of throwing two fair dice. Let us define the events:
D = {the sum of the dice equals 6}
E = {the sum of the dice equals 7}
F = {the first die lands 4}
G = {the second die lands 3}
Show the following, both by arguing and by computing probabilities:
3. Suppose that a typist monkey is typing randomly, but that each time he types the “wrong
character,” it is discarded from the output. Assume also that the monkey types 24-7 at the rate
of one character per second, and that each character can be one of 27 symbols (the alphabet
1.7. BIBLIOGRAPHICAL REMARKS 29
without punctuation plus space). Given that Hamlet has about 130,000 characters, what is
the average number of days that it would take the typist monkey to compose the famous play?
4. Suppose that 3 balls are selected without replacement from an urn containing 4 white balls,
6 red balls, and 2 black balls. Let Xi = 1 if the i-th ball selected is white, and let Xi = 0
otherwise, for i = 1, 2, 3. Give the joint PMF of
(a) X1 , X2
(b) X1 , X2 , X3
5. Consider 12 independent rolls of a 6-sided die. Let X be the number of 1’s and let Y be
the number of 2’s obtained. Compute E[X], E[Y ], Var(X), Var(Y ), E[X + Y ], Var(X + Y ),
Cov(X, Y ), and ⇢(X, Y ). (Hint: You may want to compute these in the order given.)
(c) Conclude that the conditional expectation E[Y |X] (which can be shown to be the “best”
predictor of Y given X), is in the Gaussian case a linear function of X. This is the
foundation of optimal linear filtering in Signal Processing. Plot the regression line for
the case x = y , µx = 0, fixed µy and a few values of ⇢. What do you observe as the
correlation ⇢ changes? What happens for the case ⇢ = 0?
8. Consider the example of a random sequence X(n) of 0-1 binary r.v.’s given in class:
• Set X(0) = 1
• From the next 2 points, pick one randomly and set to 1, the other to zero.
• From the next 3 points, pick one randomly and set to 1, the rest to zero.
• From the next 4 points, pick one randomly and set to 1, the rest to zero.
• ...
Notice that Xn is clearly converging slowly in some sense to zero, but not with probability
one. This leads one to the realization that convergence with probability one is a very strong
requirement; in many practical situations, convergence in probability and in mean-square may
be more adequate.
Bibliography
[Billingsley, 1995] Billingsley, P. (1995). Probability and Measure. John Wiley, New York City, New
York, third edition.
[Casella and Berger, 2002] Casella, G. and Berger, R. (2002). Statistical Inference. Duxbury, Pacific
Grove, CA, 2nd edition.
[Chung, 1974] Chung, K. L. (1974). A Course in Probability Theory, Second Edition. Academic
Press, New York City, New York.
[Devroye et al., 1996] Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of
Pattern Recognition. Springer, New York.
[Nualart, 2004] Nualart, D. (2004). Kolmogorov and probability theory. Arbor, 178(704):607–619.
[Rosenthal, 2006] Rosenthal, J. (2006). A First Look At Rigorous Probability Theory. World Scien-
tific Publishing, Singapore, 2nd edition.
[Ross, 1994] Ross, S. (1994). A first course in probability. Macmillan, New York, 4th edition.
[Ross, 1995] Ross, S. (1995). Stochastic Processes. Wiley, New York, 2nd edition.
31