You are on page 1of 37

Probability and Statistics

(ENM 503)
Michael A. Carchidi
October 5, 2015
Chapter 9 - Limit Theorems
The following notes are based on the textbook entitled: A First Course in
Probability by Sheldon Ross (9th edition) and these notes can be viewed at
https://canvas.upenn.edu/
after you log in using your PennKey user name and Password.
1. Introduction
In this chapter, we shall develop some of the more important and beautiful
aspects of probability, which lie in its inequality and limit theorems. Of these, the
most important and profound are the laws of large numbers and the central limit
theorems since these seem to explain the statistics associated with many of our
very activities. The laws of large numbers are concerned with stating conditions
in which the average of a sequence of random variables approaches the expected average. The central limit theorems, which are much stronger, are concerned
with stating conditions in which the sum of a large number of random variables
approaches the normal distribution thereby explaining the empirical fact that
so many natural populations exhibit the well-known bell-shaped (normal) curves.
The limit theorems also supply a nice tie between the axiomatic approach to probability and the relative frequency (or measure of belief) approach to probability.
Let us begin with a few inequality results, which are quite simple, but they serve
as a necessary stepping stones to the more stronger limit results.

2. Chebyshevs Inequality and the Weak Large of Large Numbers


The first step in arriving at the laws of large numbers and the central limit theorems involves a very simple result (with a very simple proof) known as Markovs
inequality.
Markovs Inequality
Markovs inequality states that if X is a random variable that takes on only
non-negative values, so that
P (X < 0) = 0,
(1a)
and if a is any positive real number (i.e., a > 0) , then
1
P (X a) E(X).
a
Proof : To prove this we note that if I is the random variable defined as

1, when X a
I=
,

0, when X < a
then (since a > 0), it should be clear that

1, when 1 X/a
I=
,

0, when 1 > X/a

(1b)

(2)

and this says that I X/a for all values of the random variable X and all values
of a > 0. Taking expected values of I X/a, we have
1
E(I) E(X/a) = E(X).
a

But by the definition of I in Equation (2), we see that


E(I) = (1)P (X a) + (0)P (X < a) = P (X a)
and so we find that

1
P (X a) E(X)
a
2

and the proof of the theorem is complete.


Note that this can also be seen using only the Calculus (in the case when X
is a continuous random variable with pdf f ) by writing
Z
Z
Z
1
1
P (X a) =
f (t)dt =
af (t)dt
tf (t)dt
a a
a a
a
since

()dt

means that 0 < a t and the definition of the pdf says that f (t) 0. But we
also note that since 0 < a, we have
Z
Z
Z a
Z
tf (t)dt
tf (t)dt +
tf (t)dt =
tf (t)dt
a

and so

1
P (X a)
a

and then since


Z
Z
tf (t)dt =
E(X) =

Z
tf (t)dt+

tf (t)dt

tf (t)dt =

Z
t(0)dt+

tf (t)dt =

we see that

1
P (X a) E(X)
a
follows. Note that we may also write Equation (1b) as
1
1 P (X < a) E(X)
a

or

1
P (X < a) 1 E(X).
a

(1c)

tf (t)dt

Chebyshevs Inequality
An immediate consequence of Markovs inequality is Chebyshevs inequality,
which states that if X is any random variable with finite mean and standard
deviation , and if k is any positive real number, then
2
.
(3a)
P (|X | k)
k
This follows by noting that (X )2 is a non-negative random variable and setting
a = k2 > 0 in Markovs inequality, we have
P ((X )2 k2 )

1
1 2
2
E((X

)
)
=

k2
k2

or

2
k2
which is Equation (3a). Note that we may also write Equation (3a) as
P (|X | k)

1 P (|X | < k)
or
P (|X | < k) 1
or

2
k

2
k

P ( k < X < + k) 1

2
k

(3b)

Equations (1b,c) and (3a,b) enables one to derive bounds on probabilities when
only the mean or both the mean and variance are known for a random variable
X. Let us now consider two examples.
Example #1: Markovs and Chebyshevs Inequality
Suppose that the number of items produced in a factory is a random variable
with mean 50, then lets see what can we say about the probability that a given
weeks production will exceed 74 items, assuming that X is a discrete random

variable. For one, we may use only the mean value of 50 and Markovs inequality
(with a = 75) to say that
P (X > 74) = P (X 75)

50 2
1
E(X) =
= .
75
75 3

But suppose that we also know that the variance in a weeks production is 25,
then we can say something about the probability that a given weeks production
will be between 40 and 60 items. Using Chebyshevs inequality, in the form of
Equation (3b), with k = 10, we have
2
2
5
P (50 10 < X < 50 + 10) 1
=1
k
10
or

3
4
so that the probability that a given weeks production is between 40 and 60 is at
least as large as 3/4, regardless on the actual distribution of X.
P (40 < X < 60)

Since Chebyshevs inequality is valid for all random variables X, we should


not expect the bound on the probability to be very close to the actual probability
in most cases, as the following example shows.
Example #2: Chebyshevs Inequality is Numerically Weak
Suppose that the number of items produced in a factory example above is
normal with mean 50 and variance 25, i.e., X N(50, 25). Then the probability
that a given weeks production will be between 40 and 60 items can be computed
exactly as

40 50
X 50
60 50
P (40 < X < 60) = P
<
<
5
5
5
or
P (40 < X < 60) = P (2 < Z < 2) = (2) (2) = (2) (1 (2))
or
2
P (40 < X < 60) = 2(2) 1 =
2
5

1 2

e 2 t dt 1

or

r Z 2
1 2
2
e 2 t dt 1 ' 0.9545
P (40 < X < 60) =

which is much larger than the 3/4 = 0.75 as provided by Chebyshevs inequality
in Example #1 above.
Example #3: Zero Variance Not Random at all
Suppose that X is a random variable having mean and zero variance, then
Chebyshevs inequality would say that
2
P (|X | < k) 1
or
P (|X | < k) 1
k

since 2 = 0. But since P (|X | < k) 1 for all probabilities, we see then that
1 P (|X | < k) 1

which says that

P (|X | < k) = 1

for all k > 0. Taking the limit as k 0+ then says that


P (|X | 0) = 1
which says that P (X = ) = 1. In other words, the only random variables having
zero variance are those that are equal to their means with probability 1, and hence
they are not random at all.
You will notice that in the previous example, we used the limit-inequality
that if f (x) < g(x) for all x > 0, then
lim f (x) lim+ g(x).

x0+

x0

For example, it is true that f (x) = 1 x < g(x) = 1 for all x > 0, but
lim (1 x) < lim+ (1),

x0+

x0

resulting in 1 < 1 is not true, but rather


lim (1 x) lim+ (1)

x0+

x0

resulting in 1 1, which is a true statement. This is why


P (|X | < k) = 1

lim P (|X | < k) = lim+ (1)

and

k0+

k0

becomes
P

lim |X | < lim+ k

k0+

k0

where the < in


P
changes to a in

=1

or

P (|X | 0) = 1

lim |X | < lim+ k

k0+

k0

P (|X | 0).
The Weak Law of Large Numbers
Let X1 , X2 , X3 , ..., be a sequence of independent and identically distributed
random variables, each having the same finite mean E(Xi ) = , then, for any
> 0,

X1 + X2 + X3 + + Xn
(4a)
lim P
= 0.
n
n
To prove this result, we define the sequence of random variables
Yn =

X1 + X2 + X3 + + Xn
n

for n = 1, 2, 3, ..., and note that

1
X1 + X2 + X3 + + Xn
E(Yn ) = E
= E(X1 + X2 + X3 + + Xn )
n
n
or
E(Yn ) =

1
1
(E(X1 ) + E(X2 ) + E(X3 ) + + E(Xn )) = ( + + + + )
n
n

or E(Yn ) = for n = 1, 2, 3, .... We also have

1
X1 + X2 + X3 + + Xn
V (Yn ) = V
= 2 V (X1 + X2 + X3 + + Xn )
n
n
7

or, since the Xi s are independent,


V (Yn ) =

1
1
(V (X1 )+V (X2 ) +V (X3 )+ +V (Xn )) = 2 ( 2 + 2 + 2 + + 2 )
2
n
n

or V (Yn ) = 2 /n for n = 1, 2, 3, .... Then, using Chebyshevs inequality with


X = Yn and k = , we have
P (|Yn E(Yn )| )

V (Yn )
2

or

P (|Yn | )

2
n2

or

X1 + X2 + X3 + + Xn

P
2 .
n
n
for n = 1, 2, 3, .... Taking the limit as n , we have

X1 + X2 + X3 + + Xn

lim P
=0
lim
n
n
n
n2

(4b)

which says that


X1 + X2 + X3 + + Xn

lim P
= 0
n
n

for any > 0, and the result is now proven. It should be noted that we may not
take the limit of this result as 0+ , since then the limit
2

=0
lim
n n2
taken above would no longer be true.
3. The Central Limit Theorem
Suppose that a large number n, of independent random variables X1 , X2 ,
X3 , ..., Xn having means E(Xi ) = i and variances V (Xi ) = i2 , satisfy the
assumptions that: (a) the Xi s are uniformly bounded which means that there is
an M such that
P (|Xi | < M) = 1

for all i, and (b)

lim

n
X
i=1

i2 ,
8

then the central limit theorem states that the sum of these Xi s
X=

n
X

Xi

i=1

has a distribution that is approximately normal with means and variances given
by
= 1 + 2 + 3 + + n

and

2 = 12 + 22 + 32 + + n2 .

Therefore, this theorem provides a simple method for computing approximate


probabilities for large sums of independent random variables via the standard
normal cdf
Z z
1
2
(z) =
et /2 dt.
2
In addition, this theorem helps explain the remarkable fact that the empirical
frequency of so many natural populations exhibit bell-shaped (normal) curves, as
shown in the following figure.

A Bell-Shaped Curve
One simpler form of the central limit theorem states that if X1 , X2 , X3 , ..., Xn
is a sequence of independent and identically distributed random variables, each
having mean and variance 2 , and if
Zn =

X1 + X2 + X3 + + Xn n

(5a)

for n = 1, 2, 3, ..., then, for any < z < +,


1
lim P (Zn z) = (z) =
n
2

2 /2

et

dt.

(5b)

This says that the Zn s tend to the standard normal distribution as n gets large.
To prove this result, let us assume that the moment generating function of each
of the Xi s (denoted by MXi (t) = M(t)) exists and is finite. Now if E(Xi ) =
and V (Xi ) = 2 , we first note that
X1 + X2 + X3 + + Xn n

n
(X1 ) + (X2 ) + (X3 ) + + (Xn )

=
n
X1 + X2 + X3 + + Xn
Xi

where
Xi =
=

Zn =

and E(Xi ) = 0 and V (Xi ) = 1. Thus we now have


Zn =

X1 + X2 + X3 + + Xn

with E(Xi ) = 0 and V (Xi ) = 1. Then


X1 + X2 + X3 + + Xn

MZn (t) = M
n


X1 + X2 + X3 + + Xn

t
= E exp
n


tX1 + tX2 + tX3 + + tXn

= E exp
n



tX2
tX3
tX
tX1
exp
exp
exp n
= E exp
n
n
n
n

Since the Xi s and hence the Xi s are independent, the exp(tXi / n)s are also
independent and so


tX2
tX
tX1
MZn (t) = E exp
E exp
E exp n
n
n
n
10

but since the Xi s are all identical, we have


t
t
t
M
M
MZn (t) = M
n
n
n
and so

n

t
MZn (t) = M
n

where

M(t) = E(etXj ).

Taking the natural logarithm of both sides of this equation, we have

t
where
L(t) = ln (M(t)) .
ln (MZn (t)) = nL
n

(6)

Now
L(0) = ln (M(0)) = ln(M(0)) = ln(1) = 0
and, using what we know about moment generating functions from the previous
chapter, we have
L0 (t) =

M 0 (t)
M(t)

so that

and
L00 (t) =

L0 (0) =

E(Xi )
M 0 (0)
=
=0
M(0)
1

M 00 (t)M(t) M 0 (t)M 0 (t)


(M(t))2

so that
L00 (0) =

M 00 (0)M(0) M 0 (0)M 0 (0)


E((Xi )2 )(1) E(Xi )E(Xi )
=
(M(0))2
(1)2

or
L00 (0) = E((Xi )2 ) (E(Xi ))2 = V (Xi ) = 1.
Now taking the limit of Equation (6) as n , we find that

L(0)
0
t
L(tn1/2 )
=
= lim
lim (ln (MZn (t))) = lim nL
=
1
n
n
n
n
0
0
n
and so using LHopitals rule, we have

L(tn1/2 )
(1/2)tn3/2 L0 (tn1/2 )
= lim
lim (ln (MZn (t))) = lim
n
n
n
n1
n2
11

or
lim (ln (MZn (t))) =

0 1/2 0

L (tn
1
L (0)
0
1
)
=
t lim
t
=
1/2
2 n
n
2
0
0

and using LHopitals rule again, we find that



00 1/2
(1/2)tn3/2 L00 (tn1/2 )
1 2
1
lim (ln (MZn (t))) =
=
lim
)
t lim
t
L (tn
n
n
2 n
(1/2)n3/2
2

which says that

lim (ln (MZn (t))) =

This says that


2 /2

lim (MZn (t)) = et

1 2
1
t L00 (0) = t2 .
2
2

= MZ (t)

where

Z N(0, 1)

and using
lim (MZn (t)) = M

lim Zn (t)

= MZ (t)

and the uniqueness of the moment generating function, we conclude that


lim (Zn ) = Z N(0, 1)

and the proof is complete.


Example #4: Measuring Large Distances - Normal Approximation
An astronomer is interested in measuring the distance (in light years) from
the earth to a distant star. Although the astronomer has a measuring technique,
he knows that because of changing atmospheric conditions and normal error, each
time a measurement is made, it will not yield the same distance and hence not
the exact distance every time. Instead, merely an estimate to the exact distance
is measured each time. As a result, the astronomer plans to make a series of
measurements and then use the average of these as an estimate for the distance
from the earth to the star. If all of these measurements are independent and
identically distributed (which are both reasonable assumptions), and if each measurement has a common mean d (the actual distance in light years) and a common
standard deviation of 2 (light years), as governed by the measuring conditions,
12

how many measurements must the astronomer make if he wants to be at least


95% certain that his estimate is accurate to within 0.5 light years?
To solve this we assume that the astronomer makes n measurements denoted
by X1 , X2 , X3 , ..., Xn such that E(Xi ) = d (in light years) and (Xi ) = 2 light
years. Then from the central limit theorem, it follows that
Zn =

X1 + X2 + X3 + + Xn nd

N(0, 1)
n

for large n. Setting


An =

X1 + X2 + X3 + + Xn
n

we see that

An = d + Zn .
n

Then

P (d 0.5 An d + 0.5) = P d 0.5 d + Zn d + 0.5


n
or

or

or



n
n
Zn
P (d 0.5 An d + 0.5) = P
2
2


n
n

P (d 0.5 An d + 0.5) =
2
2

n
1.
P (d 0.5 An d + 0.5) = 2
2

Setting this so that


n
1 0.95,
P (d 0.5 An d + 0.5) = 2
2
we find that

n

0.975
2

n
1 (0.975) = 1.96.
2

or

13

This says that


n (3.92)2 = (3.92 2)2 = 61.47

so that the astronomer must take at least 62 observations.


Example #5: Measuring Large Distances - Chebyshevs Inequality
The previous example assumes that n = 62 is large enough so that the
central limit theorem is valid. If the astronomer is worried about the validly
of this, he may solve the problem using Chebyshevs inequality, which does not
require a large-n assumption. Toward this end, we have
E(Xi ) = d

and

V (Xi ) = 2

for each i = 1, 2, 3, ..., n, and with


1X
Xi
An =
n i=1
n

we have

E(An ) = d

and

V (An ) =

1 2
.
n

Using Chebyshevs inequality with X = An and k = 0.5, we then have


P (|An d| k)

V (An )
2
=
k2
nk2

and so
1 P (|An d| < k)

2
nk2

or

P (|An d| < k) 1

2
nk2

or

2
.
nk2
Setting = 2 and setting this probability larger than (or equal to) 0.95, we find
that
P (d k < An < d + k) 1

2
0.95
nk2

yielding

2
22
=
= 320,
(0.05)k2
(0.05)(0.5)2

which is more that five times the value of 62 obtained in the previous example.
Whether to take 62 measurements or 320 measurements must now be weighted
by the amount of time and costs it takes for each measurement.
14

Example #6: A Poisson Distribution


Suppose that the number of students who enroll in a psychology course is a
Poisson random variable with mean E(X) = 100. The professor in charge of the
course has decided that if the number enrolling is 120 or more, he will teach the
course in two separate sections, whereas if fewer that 120 students enroll, he will
teach all the students together in a single section. Determine the probability that
the professor will have to teach two sections.
The exact solution is

119 100
X
X
e100 (100)k
e
(100)k
P (X 120) =
=1
k!
k!
k=120
k=0

which gives
P (X 120) = 1 0.9718 = 0.0282
or P (X 120) ' 2.82%. Using the fact that a Poisson random variable with
mean 100 is the sum of 100 independent Poisson random variable each with mean
E(Xi ) = 1 and V (Xi ) = 1, we can make use of the central limit theorem and the
fact that
E(X) = 100E(Xi ) = 100

and

V (X) = 100V (Xi ) = 100

to approximate P (X 120) as P (X 119.5), which is the continuity correction,


and get

119.5 100
X 100

= P (Z 1.95) = 1 (1.95)
P (X 119.5) = P
10
10
which says that P (X 119.5) = 1 0.9744 = 0.0256, which is about

0.0256 0.0282
100% = 9.22%
0.0282
or 9.22% lower than the exact result.

15

Example #7: Rolling the Dice


If 10 fair six-sided dice are rolled, find the approximate probability that the
sum obtained is between 30 and 40, inclusive. To solve this we let Xi be the
random variable on the roll of the ith dice so that
1
1
1
1
1
7
1
E(Xi ) = (1) + (2) + (3) + (4) + (5) + (6) =
6
6
6
6
6
6
2
and

1
1
1
1
1
1
91
E(Xi2 ) = (1)2 + (2)2 + (3)2 + (4)2 + (5)2 + (6)2 =
6
6
6
6
6
6
6

and
91

V (Xi ) =
6

2
7
35
= .
2
12

Then since X = X1 +X2 + +X10 , we have (after using the continuity correction)
!

X 10(7/2)
40.5 10(7/2)
29.5 10(7/2)
p
p
p
,
P (29.5 X 40.5) = P
10 35/12
10 35/12
10 35/12

which gives

P (29.5 X 40.5) = P (1.0184 Z 1.0184) = 2(1.0184) 1


or P (29.5 X 40.5) = 2(0.8458) 1 = 0.6915.
Example #8: Designing a Parking Lot
Prior to the construction of a new mega shopping mall, the builders must plan
the size of the malls parking lot in terms of the number of shoppers that can be
simultaneously in the mall. If the mall is to thrive, customers are expected to
arrive at a constant rate of 625 per hour, according to a Poisson process and they
are expected to shop for an average of 4 hours. In the real mall there will be an
upper limit on the number of shoppers but for planning purposes, the builders can
pretend that the number of shoppers is infinite. Determine the minimum number
of parking places the builders should plan to make if they want to insure that
they have an adequate capacity at least 97.5% of the time. You may assume that
there is only one shopper per car.

16

The solution to this problem involves modeling the system as a self-service


queueing system with = 625 customers/hours and E(S) = 1/ = 4 hours, so
that the average number of customers in the mall at any given time is L = / =
2500 and this is distributed in a Poisson way so that the builders should plan on
the number of shoppers to be the smallest value of c such that the probability
there are no more than c shoppers is
P (N c) =

c
X
n=0

Pn =

c
X
Ln
n=0

n!

eL p = 0.975.

Solving this for c is quite dicult since L = 2500 is large making eL very small
and Ln very large. However, we know from our discussion of Example #6 above
that a normal approximation to the Poisson distribution using E(X) = L and
V (X) = L = 2500 can be used and using this we have

cL
N E(X)
c E(X)
N L
p
p

P
p
=P
L
L
V (X)
V (X)
or, since (N L)/L N(0, 1),

cL

p
which leads to
L

cL+

1
L (p).

With the numbers given in the problem, namely L = 2500 and p = 0.975 we see
that

c 2500 + 25001 (0.975) = 2500 + 50(1.96) = 2598

which means they builders should plan on having a parking complex that can
accommodate at least cmin = 2598 cars.

Example #9: The Sum of Uniform Random Variables


Suppose that Xi (for i = 1, 2, ..., n) are all independent and identical random
variables, each uniformly distributed over the interval from a to b (with b > a).
Since
Z b
x
a+b
dx =
E(Xi ) =
2
a ba
and
Z b 2
x
a2 + ab + b2
2
dx =
E(Xi ) =
3
a ba
17

and
V (Xi ) =

E(Xi2 )

a2 + ab + b2

(E(Xi )) =
3
2

we see that
Yn =

n
X
k=1

with
n
X

(a + b)n
=
E(Xi ) =
2
k=1
and so

or

a+b
2

(b a)2
12

Xi N(, 2 )

and

n
X

V (Xi ) =

k=1

c
Yn
>
P (Yn > c) = P

c
P (Yn > c) = 1
.

(b a)2 n
12

Using a = 0, b = 1, n = 10 and c = 6, we find that = 5, 2 = 5/6

!
r !
6
65
P (Y10 > 6) = 1 p
=1
5
5/6
or

1
P (Y10 > 6) = 1
2
which reduces to P (Y10 > 6) ' 0.1367.

Z 6/5

2 /2

et

dt

Example #10: The Sum of Uniform Random Variables


An instructor has N exams that will be graded in sequence. Suppose that the
times required to grade the N exams are independent and identical with mean
and standard deviation . Estimate the probability that the instructor will grade
at least n of the exams in the first T minutes of work. Letting E(Xi ) = and
V (Xi ) = . Then
n
X
Xn =
Xi
i=1

18

is the time to grade the first n exams and we want to estimate P (Xn T ). Now
E(Xn ) =

n
X

E(Xi ) = n

and

i=1

Then

V (Xn ) =

n
X

V (Xi ) = n 2 .

i=1

T n
T n
Xn n


'
.
P (Xn T ) = P
n
n
n
Using n = 25, = 20 minutes, = 4 minutes and T = 450 minutes, we find that

Z 5/2
5
1
450 25(20)
2
P (X25 450) '
=
=
et /2 dt
4(5)
2
2

or P (X25 450) ' 0.0062.


4. The Strong Law of Large Numbers
The strong law of large numbers is probably the best-know result in probability
theory. It simply states that the average of a sequence of independent random
variables X1 . X2 , X3 , ..., Xn having a common distribution will, with probability
1, converge to the mean () of that common distribution. In other words,

!
n
1X
P lim
Xi = = 1.
(7)
n n
i=1
Computing Probabilities Using The Strong Law of Large Numbers
As an important application of the strong law of large numbers, suppose that
a sequence of independent trials of some experiment is performed and suppose
that E is some fixed event of the experiment that occurs with probability P (E)
on any particular trial. Letting

if E does occur on the ith trial


1,
,
(8)
Xi =

0, if E does not occur on the ith trial


for i = 1, 2, 3, ..., then by the strong law of large numbers

1X

lim
Xi = (1)P (Xi = 1) + (0)P (Xi = 0) = (1)P (E) + (0)P (E)
n n
i=1
n

19

or
1X
Xi
P (E) = lim
n n
i=1

1X
which we may also write as P (E) '
Xi
n i=1

(9)

for large n. This result is very important in Simulation Theory, which is covered
in detail in the ESE 603 course.
To prove the strong law of large numbers, let us assume (although the theorem
can be proven without this assumption) that the random variable Xi has its first
four moments as finite so that so that E(Xi ) = < , E(Xi2 ) = 2 + 2 < ,
E(Xi3 ) = L < and E(Xi4 ) = K < . To begin, assume that E(Xi ) = so
that Yi = Xi has E(Yi ) = 0. Setting
n
X
Sn =
Yi
i=1

we see that

!4
n
n
n
n
n
X
X
X
X
X
Yi =
Yp
Yq
Yr
Ys
Sn4 =
p=1

i=1

or

Sn4

q=1

n X
n X
n
n X
X

r=1

s=1

Yp Yq Yr Ys .

p=1 q=1 r=1 s=1

The right-hand side of this expression will contain terms of the form
Yi4

Yi3 Yj

Yi2 Yj2

Yi2 Yj Yk

and Yi Yj Yk Yl

where the i, j, k and l are all dierent. Since E(Yi ) = 0 and all the Yi s are
independent we see that E(Yi3 Yj ) = E(Yi3 )E(Yj ) = 0 along with E(Yi2 Yj Yk ) =
E(Yi Yj Yk Yl ) = 0, so that
E(Sn4 ) =

n
X
i=1

E(Yi4 ) +

n
n X
X

E(Yi2 )E(Yj2 ).

i=1 j=1
j6=i

But
E(Yi4 ) =
=
=
=

E((Xi )4 ) = E(Xi4 4Xi3 + 6Xi2 2 4Xi 3 + 4 )


E(Xi4 ) 4E(Xi3 ) + 62 E(Xi2 ) 43 E(Xi ) + E(4 )
K 4E(Xi3 ) + 62 E(Xi2 ) 43 + 4
K 4E(Xi3 ) + 62 E(Xi2 ) 34
20

and
E(Yj2 ) = E(Yi2 ) = E((Xi )2 ) = V (Xi ) = 2 .
Then
E(Sn4 )

n
n X
n
X
X
3
2
2
4
=
(K 4E(Xi ) + 6 E(Xi ) 3 ) +
V (Xi )V (Xj )
i=1

i=1 j=1
j6=i
n
n
XX

n
X
(K 4E(Xi3 ) + 62 ( 2 + 2 ) 34 ) +
=
i=1

2 2

= n(K + 6 + 3 ) 4

n
X
i=1

22

i=1 j=1
j6=i

E(Xi3 ) + n(n 1) 2 2 ,

and setting E(Xi3 ) = L, we have


E(Sn4 ) = n(K + 62 2 + 34 ) 4nL + n(n 1) 2 2 .
so that
E

Sn4
n4

Sn4
n4

= (K + 62 2 + 34 4L 2 2 )

Then

X
n=1

K + 62 2 + 34 4L 2 2 2 2
+ 2 .
n3
n

showing that

X
n=1

since

X
1
= (3) ' 1 202
n3
n=1

Sn4
n4

=E

X S4
n

n=1

n4

X
X
1
1
2 2
+

n3
n2
n=1
n=1

<

X
2
1
' 1.645
=
(2)
=
n2
6
n=1

and

are both finite. But the fact that


!
!

X S4
X Sn 4
n
=
E
<
E
4
n
n
n=1
n=1
21

says that

4

X
Sn
n=1

with probability one and hence


4
Sn
=0
lim
n
n

<

or

lim

Sn
n

=0

with probability 1. Thus we find that


n
!
n
n
!
!

1X
1X
1X
Sn
lim
= lim
Yi = lim
(Xi ) = lim
Xi = 0
n
n
n
n
n
n i=1
n i=1
n i=1

or

lim

1X
Xi
n i=1
n

with probability 1, and now the proof is complete.


Many students are initially confused about the dierence between the weak
and strong laws of large numbers. The weak law of large numbers states that for
any specified large value of n, says n ,
X1 + X2 + X3 + + Xn
n
is likely to be near . However, it does not say that
X1 + X2 + X3 + + Xn
n
is bound to stay near for all values of n larger than n . Thus it leaves open the
possibility that large values of

X1 + X2 + X3 + + Xn

n
can occur infinitely often (though at infrequent intervals). The strong form shows
that this cannot occur. In particular, it implies, with probability 1, for any > 0,

X1 + X2 + X3 + + Xn

>

n
22

only for a finite number of n values.


5. Other Inequalities
We are sometimes confronted with situations in which we are interested in
obtaining an upper bound for a probability of the form P (X a), where
a > 0 when only the mean = E(X) and variance 2 = V (X) of the random
variable X are known. We may use Chebyshevs inequality and the fact that
P (E) P (E F )

(10)

for any two events E and F , and


|A| a is equivalent to A a or A a
for a > 0, to write
P (X a) P ((X a) (X a)) = P (|X | a)

2
a2

so that

2
.
(11a)
a2
But the following result (known as the one-sided Chebyshevs inequality) says
that we can do better is than it states
P (X a > 0)

P (X a > 0)

2
a2 + 2

(11b)

and

2
.
(11c)
a2 + 2
To prove this we set Y = X so that E(Y ) = E(X) = 0 and V (Y ) =
V (X) = 2 . Suppose that b > 0, the
P (X a < 0)

Y a is equivalent to Y + b a + b.
Hence
P (Y a) = P (Y + b a + b) P ((Y + b a + b) (Y + b (a + b)))
23

or
P (Y a) P (|Y + b| a + b) = P ((Y + b)2 (a + b)2 )
since
|Y + b| a + b is equivalent to (Y + b)2 (a + b)2
for a + b > 0. Applying Markovs inequality, we may write that
P (Y a) P ((Y + b)2 (a + b)2 )
or
P (Y a)

E((Y + b)2 )
E(Y 2 + 2bY + b2 )
=
(a + b)2
(a + b)2

E(Y 2 ) + 2bE(Y ) + b2
2 + 2b(0) + b2
=
(a + b)2
(a + b)2

or
P (Y a)

2 + b2
= g(b)
(a + b)2

for all b > 0. We may now choose the value of b which minimizes g(b) by setting

2(ab 2 )
d 2 + b2
0
g (b) =
=
=0
db (a + b)2
(a + b)3
resulting in b = 2 /a and hence
gmin = g( 2 /a) =

2 + ( 2 /a)2
2
=
(a + 2 /a)2
a2 + 2

which then says that


2
a2 + 2
and the proof of Equation (11b) is complete. To prove Equation (11c) we use the
fact that
Y a is equivalent to Y + b b a
P (Y a) gmin =

and proceed along similar lines.


Example #11: The One-Sided Chebyshevs Inequality
Suppose that the number of items produced in a factory during a week is a
random variable with mean = 100 and variance 2 = 400. Let us compute an
24

upper bound on the probability that this weeks production will be at least 120.
To do this we use
P (X 120) = P (X 100 20) = P (Y 20)

2
400
=
a2 + 2
(20)2 + 400

so that

1
= 0.5.
2
Hence the probability that this weeks production will be at least 120 is at least
one-half. Note that Equation (11a) would be the weaker (and quite useless) result
P (X 120)

P (X 120)

400
2
=
= 1.
a2
(20)2

If we attempted to obtain a bound using Markovs inequality (which does not


utilize the variance 2 ), we would get
E(X)
100
5
=
= ,
a
120
6
which is a weaker result than P (X 120) 1/2.
P (X 120)

It should be noted that Equations (11b,c) also may be written as


P (X + a)

2
a2 + 2

(12a)

P (X a)

2
a2 + 2

(12b)

and

for any a > 0.


Example #12: The One-Sided Chebyshevs Inequality
A set of 200 people consisting of 100 men and 100 women is randomly divided
into 100 pairs of 2 each. Let us compute an upper bound to the probability that
at most 30 of these pairs will consist of a man and a woman. To solve this, let us
number the men from 1 to 100 and for i = 1, 2, 3, ..., 100, let

if man j is paired with a woman


1,
Xi =

0, if man j is not paired with a woman


25

so that
E(Xi ) = (1)P (Xi = 1) + (0)P (Xi = 0) = P (Xi = 1).
Then the number of man-women pairs is given by
X=

100
X

Xj .

j=1

Because man i is equally likely to be paired with any of the other 199 people, of
which 100 are woman, we have
E(Xi ) = P (Xi = 1) =
Similarly, for i 6= j,
E(Xi Xj ) = P (Xi = 1|Xj = 1)P (Xj = 1) =
Note that

100
.
199

99
100
100 1 100

.
199 2 199
197 199

99
100 1
=
199 2
197
follows because given that man j is paired with a woman, man i is equally likely
to be paired with any of the remaining 197 people of which 99 are woman. Using
these results, we now have
! 100
100
100
100
X
X
X
X
100
100
Xj =
E(Xj ) =
P (Xj = 1) =
E(X) = E
= 100
199
199
j=1
j=1
j=1
j=1
P (Xi = 1|Xj = 1) =

or

E(X) =
and
V (X) = V

100
X
j=1

Xj

100
X

10000
' 50.251,
199
100
X

V (Xj ) + 2

j=1

Cov(Xi , Xj )

i<j=1

100
100
X
X
2
2
(E(Xj ) (E(Xi )) ) + 2
(E(Xi Xj ) E(Xi )E(Xj ))
=
j=1

i<j=1

100
X
j=1

(P (Xj = 1) (P (Xi = 1)) ) + 2


26

100
X

(E(Xi Xj ) E(Xi )E(Xj ))

i<j=1

resulting in

2 !

100
X
99
100 100 100
+2

V (X) =
197
199
199
199
j=1
i<j=1

100
100
X
X
9900
100
+2
=
39601
7801397
j=1
i<j=1

9900
(100)(100 1)
100
= 100
+2

39601
2
7801397
100
X

100

199

100
199

or

196020000
' 25.126.
7801397
The one-sided Chebyshevs inequality can no be applied to get
V (X) =

P (X 30) P (X 50.25 20.25)

25.126
(20.25)2 + 25.126

or P (X 30) 0.0577.
Another Form For The One-Sided Chebyshevs Inequality
It should be noted that by setting b = + a (or b = a) in Equations
(12a,b), we have
2
P (X b)
(13a)
(b )2 + 2

when < b, and

P (X b)

2
( b)2 + 2

(13b)

when > b, which is what was used in the previous example with: b = 20,
= 50.25 and 2 = 25.126. By also writing
P (X b) = 1 P (X > b)

2
( b)2 + 2

for > b, we see that


P (X > b) 1
27

2
( b)2 + 2

(14a)

when > b. We also have


P (X < b) 1

2
(b )2 + 2

(14b)

when < b.
Chernos Bounds
Suppose that the moment generating function
MX (t) = E(etX )
of a random variable X is known, then for t > 0, we have
P (X a) = P (etX eta )

E(etX )
= eta M(t).
eta

For t < 0, we have


P (X a) = P (etX eta )

E(etX )
= eta M(t).
eta

Thus we find that


P (X a) eta M(t)

(15a)

P (X a) eta M(t)

(15b)

for all t > 0 and


for all t < 0, and these are known as the Cherno bounds. By choosing the value
of t which minimizes eta M(t), we find that
P (X a) min(eta M(t))

(16a)

P (X a) min(eta M(t)).

(16b)

t>0

and
t<0

28

Example #13: The Cherno Bounds for N(0, 1)


2 /2

If Z N (0, 1), then M(t) = et

and since
2 /2

eta M(t) = eta et

2 /2at

= et

we see that

d ta
d 2
2
(e M(t)) = (et /2at ) = (t a)et /2at = 0
dt
dt
occurs when t = a, which says that for a > 0,
2 /2aa

= ea

2 /2aa

= ea

min(eta M(t)) = ea
t>0

2 /2

In a similar way we have for a < 0,


min(eta M(t)) = ea
t<0

2 /2

Thus we find, using Equations (16a,b) that


2 /2

P (Z a > 0) ea

2 /2

P (Z a < 0) ea

and

(17)

which can serve as quick bounds on the standard normal distribution.


Example #14: Cherno Bounds for The Poisson Random Variable
If X is a Poisson random variable with parameter , then
t

M(t) = e(e 1)
and then

d ta (et 1)
t
) = (et a)e(e 1)ta = 0
(e e
dt
occurs when t = ln(a/), provided that a > 0, which says that for a > 0,
a
a a
e
ta
a ln(a/) (a/1)
a
e
=
e
=
e
min(e M(t)) = e
t>0

Thus we find that using Equation (16a) that

P (X a > 0) e
29

e
a

(18)

which can serve as a quick bound on the Poisson distribution.


Example #15: A Problem with Gambling
Consider a gambler who is likely to win a units with probability p or loose b
units (or win b units) with probability q = 1 p on every play, independently
of his past results. That is, if Xi is the gamblers winnings on the ith play, then
the Xi s are independent and
P (Xi = a) = p

and

P (Xi = b) = q = 1 p.

Letting
Sn =

n
X

Xi

i=1

denote the gamblers winnings after n plays, let us use Chernos results to determine a bound on P (Sn w). Toward this end we first note that
MXi (t) = E(etXi ) = peta + qetb .
Then
MSn (t) = E(etSn ) = E(et(X1 +X2 ++Xn ) ) = E(etX1 etX2 etXn )
or
MSn (t) = E(etX1 )E(etX2 ) E(etXn ) = (peta + qetb )n .
Using Chernos bound, we now have
P (Sn w) ewt (peta + qetb )n
for all t > 0. Since
d wt ta
(e (pe + qetb )n ) = wewt (peta + qetb )n
dt
+newt (paeta qbetb )(peta + qetb )n1
or
d wt ta
(e (pe + qetb )n )
dt
= ewt (peta + qetb )n1 (n(paeta qbetb ) w(peta + qetb )) = 0
30

gives
n(paeta qbetb ) w(peta + qetb ) = 0
or
(na w)pet(a+b) = (nb + w)q
which says that
tmin

(nb + w)q
1
1
ln
=
ln(R)
=
a+b
(na w)p
a+b

with

R=

(nb + w)q
(na w)p

provided that tmin > 0 or R > 1. Then


P (Sn w) ewtmin (petmin a + qetmin b )n
which becomes

n
P (Sn w) Rw/(a+b) pRa/(a+b) + (1 p)Rb/(a+b)

(19a)

provided that

R=

(nb + w)(1 p)
>1
(na w)p

(19b)

For example, suppose that: n = 10, w = 6, a = b = 1 and p = 1/2. Then


R=
and

(10(1) + 6)(1 1/2)


(nb + w)(1 p)
=
=4>1
(na w)p
((10)(1) 6)(1/2)

10
P (S10 6) 46/2 (1/2)(4)1/2 + (1 1/2)(4)1/2

which reduces to

P (S10

1
6)
64

10
5
4

showing that P (S10 6) 0.14552. It should be noted that the exact probability
in this case is
P (S10 6) = P (gambler wins at least 8 of the first 10 games)
which says that
P (S10 6) =

10
8

+
31

10
9
210

10
10

7
128

or P (S10 6) = 0.0547.
Jensens Expected-Value Inequality
This inequality is on expected values rather than probabilities. It states that
if f (x) is a twice-dierentiable convex function (which means that f 00 (x) 0 for
all x), then
E(f (X)) f (E(X))
(20a)
provided both E(f (X)) and E(X) exists and are finite. Also, if f (x) is a twicedierentiable concave function (which means that f 00 (x) 0 for all x), then
E(f (X)) f (E(X))

(20b)

provided both E(f (X)) and E(X) exists and are finite.
To prove this, we expand f (x) as a Taylor series about E(X) = and write
1
f (x) = f () + f 0 ()(x ) + f 00 ()(x )2
2
where is some value between x and . Since f 00 () 0, we have
f (x) f () + f 0 ()(x )
and then
f (X) f () + f 0 ()(X )
showing that
E(f (X)) E(f () + f 0 ()(X )) = E(f ()) + f 0 ()E(X ) = f ()
and so we find that
E(f (X)) f (E(X)).
Of course, if f (x) is a twice-dierentiable concave function then a similar proof
leads to
E(f (X)) f (E(X)).

32

Example #16: A Problem in Investing


An investor is faced with the following choices: either she can invest all of her
money in a risky proposition that would lead to a random return X that has mean
m, or she can put the money into a risk-free venture that will lead to a return
of m with probability 1. Suppose that her decision will be made on the basis of
maximizing the expected value of u(R), where R is her return and u is her utility
function. By Jensens inequality, it follows that if u is a concave function, then
E(u(X)) f (u(X)) = f (m)
so the risk-free alternative is preferable, whereas if u is convex, then
E(u(X)) f (u(X)) = f (m)
so the risky investment alternative would be preferred.
6. Bernoulli and Poisson Random Variables
In this final section we consider providing bounds on error probabilities when
approximating a sum of independent Bernoulli random variables by a Poisson
random variable. In other words, we establish bounds on how closely a sum of
independent Bernoulli random variables is approximated by a Poisson random
variable with the same mean. Toward this end, suppose that we want to approximate the sum of independent Bernoulli random variables with respective
parameters p1 , p2 , ..., pn . Starting with a sequence Y1 , Y2 , ..., Yn of independent
Poisson random variables, with Yi having mean pi , we will construct a sequence
X1 , X2 , ..., Xn of independent Bernoulli random variables with parameters p1 , p2 ,
..., pn , respectively, so that

0, with probability 1 pi
.
(21a)
Xi =

1,
with probability pi

33

Plots of ep and 1 p (for 0 p 1) as seen below


1
0.8
0.6
0.4
0.2

0.2

0.4

0.6

0.8

Plots of ep (bold) and


1 p (thin) for 0 p 1
shows that ep 1 p or (1 p)ep < 1, and so we may define a sequence U1 , U2 ,
..., Un of independent random variables that are also independent of the Yi s and
are such that

with some probability (1 pi )epi


0,
Ui =
.

pi
1, with some probability 1 (1 pi )e

We also define random variables Zi so that

0, if Yi = Ui = 0
.
Zi =

1,
otherwise

(21b)

Note that

P (Zi = 0) = P (Yi = Ui = 0) = P (Yi = 0)P (Ui = 0) = epi (1 pi )epi = 1 pi


and
P (Zi = 1) = 1 P (Zi = 0) = pi

and this shows that Zi = Xi for i = 1, 2, 3, ..., n.

Now if Xi = 0 (or Zi = 0 since Zi = Xi ) then so must Yi = 0 by the definition


of Zi . Therefore
P (Xi 6= Yi ) = P (Xi = 1, Yi 6= 1) = P (Xi = 1, Yi = 0) + P (Yi > 1)
34

which becomes
P (Xi 6= Yi ) = P (Ui = 1, Yi = 0) + P (Yi > 1)
since Yi = 0 and Ui = 0 gives Zi = 0 = Xi , which is not the case. Thus we now
find that
P (Xi 6= Yi ) = P (Ui = 1)P (Yi = 0) + 1 P (Yi 1)
or
P (Xi 6= Yi ) = (1 (1 pi )epi )(epi ) + 1 P (Yi = 0) P (Yi = 1)
or
P (Xi 6= Yi ) = epi (1 pi ) + 1 epi (pi epi )
which reduces to
P (Xi 6= Yi ) = pi pi epi = pi (1 epi ) pi (pi ) = p2i
and so we find that
P (Xi 6= Yi ) p2i

(21c)

for i = 1, 2, 3, ..., n. Setting


X=

n
X

and

Xi

Y =

i=1

n
X

Yi

(21d)

i=1

and using the fact that X 6= Y implies that Xi 6= Yi for some value of i, we see
that
P (X 6= Y ) P (Xi 6= Yi for some i) P (Xi 6= Yi for all i)
or
P (X 6= Y )

n
X
i=1

P (Xi 6= Yi )

n
X

p2i .

(21e)

i=1

Next, let us define an indicator function for any event B as

if B does occur
1,
IB =

0, if B does not occur

(21f)

and note that

= P (B)
E(IB ) = (1)P (B) + (0)P (B)
35

(21g)

and consider for any set of real numbers A, the quantity I{XA} I{Y A} . By
definition of indicator functions (which equal either 0 or 1) we see that I{XA}
I{Y A} is either equal to 0 1 = 1, 1 1 = 0 0 = 0 or 1 0 = +1, and
I{XA} I{Y A} = 1
/A
occurs only when I{XA} = 1 and I{Y A} = 0, which says that X A and Y
so that X 6= Y . Thus, we may conclude that
I{XA} I{Y A} I{X6=Y } .

(21h)

Taking expectation values of Equation (21h) and using Equation (21g) we find
that
E(I{XA} I{Y A} ) E(I{X6=Y } )

or
or

E(I{XA} ) E(I{Y A} ) E(I{X6=Y } )


P (X A) P (Y A) P (X 6= Y ).

Since probabilities are always non-negative, this also says that


|P (X A) P (Y A)| P (X 6= Y ).

(21i)

Finally, since Y is the sum of independent Poisson random variables (each with
parameters i = pi ), we know from the reproductive property of the Poisson
random variable that Y is also Poisson with parameter
=

n
X

i =

i=1

n
X

(21j)

pi ,

i=1

we have shown through Equation (21i) that


n

!
n

X e i X
X

(Xi A)
p2i .
P

i!
i=1

iA

(21k)

i=1

For the special case when all the pi are the same (and equal to p), we know that
X is binomial with parameters n and p, = np, and Equation (21k) becomes

X n
X enp (np)i

pi (1 p)ni
(21k)

np2

i
i!
iA
iA
36

or

X n
np
i
e (np)

pi (1 p)ni

np2

i!
i
iA

for any set of non-negative integers A.

37

(21l)

You might also like