You are on page 1of 6

STATS 200 (Stanford University, Summer 2015)

Lecture 2:

Convergence Concepts

This lecture covers some additional concepts of probability that address the limiting behavior
of sequences of random variables. In particular, we consider concepts and results related to
convergence of such sequences. At least some of this material should be familiar from previous
courses, but some of it may be new as well.

2.1

Convergence of Random Variables

Before dealing with convergence of random variables, first recall the definition of convergence
for a sequence of real numbers.
Convergence of Real Numbers
Recall that a sequence of real numbers {an n 1} converges to a (an a as n ) if for
every > 0, there exists N 1 such that an a for every n N .
Convergence in Probability and Convergence in Distribution
For our purposes, there are two main notions of convergence for random variables. Let
{Xn n 1} be a sequence of random variables, and let X be a random variable. Let F (Xn )
and F (X) denote their cdfs.
We say that {Xn n 1} converges in probability to X (written Xn P X as n )
if for every > 0, P (Xn X > ) 0.
We say that {Xn n 1} converges in distribution to X (written Xn D X as n )
if F (Xn ) (x) F (X) (x) at every point x where F (X) is continuous.
Note that convergence in distribution is defined by convergence of cdfs, rather than the
values of the actual random variables. For this reason, it is sometimes simply written as
F (Xn ) D F (X) or Xn D F (X) . We may also replace F (X) with its common name if it
has one, e.g., Xn D N (0, 1) if F (X) is the cdf of a standard normal random variable.
Note: Convergence in distribution is also called convergence in law or weak convergence.

The next theorems address the relationship between these two types of convergence. The
proofs of these theorems and the other theorems in this section can be found at the end of
this lecture in Section 2.4.
Theorem 2.1.1. If Xn P X, then Xn D X.
Thus, convergence in probability is stronger than convergence in distribution. However, in
the case where the limiting random variable X is actually a constant, they are equivalent.
Theorem 2.1.2. Let a R be a constant. Then Xn P a if and only if Xn D a.

Lecture 2: Convergence Concepts

Continuous Mapping Theorems


The next theorems address convergence of functions of random variables that are continuous
(or continuous at a particular point).
Theorem 2.1.3. If Xn P a for some constant a R and g R R is continuous at a, then
g(Xn ) P g(a).
Theorem 2.1.4. If Xn P X and g R R is continuous, then g(Xn ) P g(X).
Theorem 2.1.5. If Xn D X and g R R is continuous, then g(Xn ) D g(X).
Slutskys Theorem
The following result will be particularly useful later in the course.
Theorem 2.1.6 (Slutskys Theorem). If Xn D X and Yn P a, where a R is a constant,
then Xn + Yn D X + a and Xn Yn D aX.
Example 2.1.7: Suppose Xn D X and an 1. Then an Xn X, noting that this is
simply a special case of Slutskys theorem in which the sequence of random variables Yn in
the theorem statement is a sequence of constants.

Notice that to apply Slutskys theorem, the sequence of random variables Yn in the theorem
statement must be converging to a constant.

2.2

Weak Law of Large Numbers and Central Limit Theorem

We now state two extremely important asymptotic results: the weak law of large numbers
and the central limit theorem. Note that we use the abbreviation iid to mean independent
and identically distributed (i.e., independent with the same cdf).
Note: DeGroot & Schervish use the term random sample to refer to a collection of
iid random variables. This terminology is not standard.

Weak Law of Large Numbers


The weak law of large numbers (or WLLN) formalizes the intuitive notion that the expectation of a random variable may be interpreted as its long-run average.
Theorem 2.2.1 (Weak Law of Large Numbers). Let {Xn n 1} be a sequence of iid
random variables with E(X1 ) < . Let X n = n1 ni=1 Xi . Then X n P E(X1 ).
Proof. For the case where Var(X1 ) < , see the proof of Theorem 6.2.4 of DeGroot &
Schervish (which relies on Theorem 6.2.2 of DeGroot & Schervish, the Chebyshev inequality).
A proof that holds when Var(X1 ) = and E(X1 ) < is beyond the scope of this course.
Note: Yes, there also exists a strong law of large numbers, which is similar but deals
with a stronger form of convergence called almost sure convergence or convergence with
probability 1. In more sophisticated versions of these theorems, the iid assumption can
be relaxed much more for the weak law than for the strong law.

Lecture 2: Convergence Concepts

Example 2.2.2: Let X1 , X2 , . . . be independent binomial random variables with success


probability . Then X n is the proportion of successes in the first n trials. The WLLN tells
us that X n P , noting that = E(X1 ).

Central Limit Theorem


The central limit theorem (or CLT) addresses the asymptotic distribution of an average of
iid random variables. Specifically, it states that this asymptotic distribution is a normal
distribution, regardless of the distribution of the individual random variables themselves.
Theorem 2.2.3 (Central Limit Theorem). Let {Xn n 1} be a sequence of iid random
variables with Var(X1 ) < . Let X n = n1 ni=1 Xi . Then n(X n ) D N (0, 2 ), where
= E(X1 ) and 2 = Var(X1 ).
Proof. See pages 368369 of DeGroot & Schervish for an outline. The details are beyond
the scope of this course.
Informally, the central limit theorem states that for large n, X n is approximately normal
with mean and variance 2 /n.

Example 2.2.4: In Example 2.2.2, the CLT tells us that n(X n ) D N [0, (1 )],
noting that (1 ) = Var(X1 ). Informally, this tells us that for large n, the distribution
of X n is approximately a normal distribution with mean and variance (1 )/n.

Note: It may appear as though the WLLN as stated above is implied by the CLT since
1
X n = [ n(X n )] +
n

and 1/ n 0. However, in more sophisticated versions of these theorems, the iid


assumption can be relaxed much more for the WLLN than for the CLT. (Even in the
versions stated above, the CLT requires a finite variance, while the WLLN does not.)

2.3

Delta Method

Let {Yn n 1} be a sequence of random variables such that n(Yn a) D Z for some
random variable Z and some constant a R. Let g R R be a function. What can we say
about the asymptotic behavior of g(Yn )?

First, note that since 1/ n 0,



Yn a = (1/ n)[ n(Yn a)] D 0 Z = 0.
Thus, Yn D a, and hence Yn P a.
If g is continuous at a, then g(Yn ) P g(a) by Theorem 2.1.3.
However, we can do better. Suppose that g is differentiable at a, so that we may write
g(Yn ) g(a) g (a) (Yn a) (a first-order Taylor expansion). Then by Theorem 2.1.5,

n[g(Yn ) g(a)] g (a) n(Yn a) D g (a) Z.


This is the basic idea of a technique called the delta method.

Lecture 2: Convergence Concepts

Delta Method (General Case)


Theorem
2.3.1 (Delta Method). Let {Yn n 1} be a sequence of random variables such

that n(Yn a) D Z for some random variable


Z and some constant a R. Let g R R

be continuously differentiable at a. Then n[g(Yn ) g(a)] D g (a) Z.

Proof. Formally, n[g(Yn ) g(a)] = g (Yn ) n(Yn a) for some Yn between Yn and a. Note
that for any > 0, P (Yn a > ) P (Yn a > ), and P (Yn a > ) 0 since Yn P a.
Then Yn P a, so g (Yn ) P g (a) by Theorem 2.1.3. Since n(Yn a) D Z, the result
follows by Slutskys theorem.
Delta Method (Normal Case)
The following special case is by far the most common use of the delta method.
Corollary 2.3.2 (Delta
Let {Yn n 1} be a sequence of random
Method, Normal Case).
2
constants a R and 2 > 0. Let
variables such that n(Yn a) D N (0, ) for some

g R R be continuously differentiable at a. Then n[g(Yn ) g(a)] D N (0, 2 [g (a)]2 ).


Proof. Take Z N (0, 2 ) in Theorem 2.3.1.
Example 2.3.3: Suppose X1 , X2 , . . . are iid from the continuous uniform distribution on
the interval [0, 60], and we want tofind the asymptotic distribution of ( X n )1 . We have
E(X1 ) = 30 and Var(X1 ) = 300, so n( X n 30) D N (0, 300) by the CLT. Our function
is g(t) = t1 , and g(30) = 1/30. Its derivative is g (t) = t2 , which is continuous at 30, and
g (30) = 1/900. Then by the Delta Method,

1
1
n[(X n ) ] D N (0, 1/2700),
30
noting that 300(1/900)2 = 1/2700. Thus, for large n, ( X n )1 is approximately normal, with
mean 1/30 and variance 1/(2700 n).

2.4

Proofs

This section contains several proofs that are not particularly insightful but are provided for
the sake of completeness. There is no need to study these proofs in detail.
Proof of Theorem 2.1.1. Let F (Xn ) and F (X) denote the cdfs of Xn and X, and let x R be
any point where F (X) is continuous. Now let > 0. Then there exists > 0 such that
F (X) (x)

< F (X) (x ) F (X) (x) F (X) (x + ) < F (X) (x) +


2
2

(2.4.1)

by the definition of continuity and the fact that the cdf is nondecreasing. Now observe that
P (Xn x) P (X x + ) + P (Xn X > ),
P (X x ) P (Xn x) + P (Xn X > ),

(2.4.2)
(2.4.3)

Lecture 2: Convergence Concepts

for every n 1, noting for each line that if the event inside the left-hand probability occurs,
then at least one of the events inside the right-hand probabilities occurs. Combining (2.4.2)
and (2.4.3) yields
F (X) (x ) P (Xn X > ) F (Xn ) (x) F (X) (x + ) + P (Xn X > ),

(2.4.4)

noting the definitions of F (Xn ) and F (X) . Then combining (2.4.1) and (2.4.4) yields

F (X) (x) P (Xn X > ) F (Xn ) (x) F (X) (x) + + P (Xn X > ).
2
2
Since Xn P X, there exists N 1 such that P (Xn X > ) < /2 for every n N . Then
F (X) (x) F (Xn ) (x) F (X) (x) +
for all n N , which establishes that F (Xn ) (x) F (X) (x).
Proof of Theorem 2.1.2. By Theorem 2.1.1, we only need to show that Xn D a implies
Xn P a. Let F (Xn ) denote the cdf of Xn , and let F (X) denote the cdf of the random
variable X = a, which is

0 if x < a,
(X)
F (x) = 1(x a) =

1 if x a.

If Xn D a, then F (Xn ) (x) 1(x a) for all x a. Now let > 0. Then
P (Xn a > ) = P (Xn < a ) + P (Xn > a + )
= P (Xn < a ) + [1 P (Xn a + )]

P (Xn a ) + [1 P (Xn a + )] = F (Xn ) (a ) + [1 F (Xn ) (a + )] 0,


noting that F (Xn ) (a ) 0 since a < a and F (Xn ) (a + ) 1 since a + > a.
Proof of Theorem 2.1.3. Let > 0. There exists > 0 such that g(x) g(a) for all x R
such that x a . Then P [g(Xn ) g(a) > ] P (Xn a > ) 0 since Xn P a.
Proof of Theorem 2.1.4. Let > 0 and > 0. First note that there exists c > 0 such that
P (X > c) /2. Then there exists > 0 such that g(x1 ) g(x2 ) for all x1 , x2 R such
that x1 c and x1 x2 . Now note that since Xn P X, there exists N 1 such that
P (Xn X > ) /2 for all n N . Then
P [g(Xn ) g(X) > ] P (X > c) + P (Xn X > )
for all n N . Thus, since > 0 was arbitrary, P [g(Xn ) g(X) > ] 0.
Proof of Theorem 2.1.5. A full proof is beyond the scope of this course. However, the proof
is straightforward in the special case where g is strictly increasing and hence has an inverse
function g 1 . Let t R be any point where F [g(X)] is continuous. Then P [g(X) = t] = 0, or
equivalently P [X = g 1 (t)] = 0, which implies that F (X) is continuous at g 1 (t). Then
F [g(Xn )] (t) = P [Xn g 1 (t)] = F (Xn ) [g 1 (t)]
F (X) [g 1 (t)] = P [X g 1 (t)] = F [g(X)] (t),
which establishes the result.

Lecture 2: Convergence Concepts

Proof of Theorem 2.1.6. Note that (trivially) Xn + a D X + a and aXn D aX. Then
we may take a = 0 without loss of generality since Xn + Yn = (Xn + a) + (Yn a) and
Xn Yn = Xn (Yn a) + aXn . Now let > 0. To prove the first result, let x R be any point
where F (X) is continuous. Then there exists > 0 such that
F (X) (x)

< F (X) (x ) F (X) (x) F (X) (x + ) < F (X) (x) +


3
3

(2.4.5)

and such that F (X) is continuous at x and x + . Now note that for each n 1,
F (Xn ) (x ) P (Yn > ) P (Xn + Yn x) F (Xn ) (x + ) + P (Yn > )

(2.4.6)

by the same argument as in (2.4.2), (2.4.3), and (2.4.4). Then there exists N 1 such that

F (Xn ) (x ) F (X) (x ) ,
3

F (Xn ) (x + ) F (X) (x + ) + ,
3

P (Yn > )

(2.4.7)
3

for all n N . Then combining (2.4.5), (2.4.6), and (2.4.7) yields


F (X) (x) P (Xn + Yn x) F (X) (x) + ,
for all n N . Thus, F (Xn +Yn ) (x) F (X) (x), which establishes the first result. To prove the
second result, note that it suffices to show that Xn Yn P 0. Let > 0, and let c 0 such
that F (X) is continuous at c and c with F (X) (c) /5 and F (X) (c) 1 /5. Now note
that since Xn D X and Yn P 0, there exists N 1 such that

2
2
,
F (Xn ) (c) F (Xn ) (c) 1 ,
5 5
5
5

P (Yn > /c)


5

F (Xn ) (c) F (X) (c) +

(2.4.8)
(2.4.9)

for all n N . Then


P (Xn Yn > ) P (Xn > c) + P (Yn > /c) = F (Xn ) (c) + 1 F (Xn ) (c) + P (Yn > /c)
for all n N , where the last inequality is by (2.4.8) and (2.4.9). Then P (Xn Yn > ) 0,
which establishes the second result.

You might also like