You are on page 1of 32

Statistics 110Intro to Probability

Lectures by Joe Blitzstein


Notes by Max Wang
Harvard University, Fall 2011
Lecture 2: 9/2/11 . . . . . . . . . . . . . .

Lecture 19: 10/19/11 . . . . . . . . . . . . 14

Lecture 3: 9/7/11 . . . . . . . . . . . . . .

Lecture 20: 10/21/11 . . . . . . . . . . . . 15

Lecture 4: 9/9/11 . . . . . . . . . . . . . .

Lecture 21: 10/24/11 . . . . . . . . . . . . 16

Lecture 5: 9/12/11 . . . . . . . . . . . . .

Lecture 22: 10/26/11 . . . . . . . . . . . . 17

Lecture 6: 9/14/11 . . . . . . . . . . . . .

Lecture 7: 9/16/11 . . . . . . . . . . . . .

Lecture 8: 9/19/11 . . . . . . . . . . . . .

Lecture 23: 10/28/11 . . . . . . . . . . . . 18


Lecture 24: 10/31/11 . . . . . . . . . . . . 19
Lecture 25: 11/2/11 . . . . . . . . . . . . 20

Lecture 9: 9/21/11 . . . . . . . . . . . . .

Lecture 10: 9/23/11 . . . . . . . . . . . .

Lecture 11: 9/26/11 . . . . . . . . . . . .

Lecture 12: 9/28/11 . . . . . . . . . . . .

Lecture 13: 9/30/11 . . . . . . . . . . . .

Lecture 26: 11/4/11 . . . . . . . . . . . . 21


Lecture 27: 11/7/11 . . . . . . . . . . . . 22
Lecture 28: 11/9/11 . . . . . . . . . . . . 24
Lecture 29: 11/14/11 . . . . . . . . . . . . 25

Lecture 14: 10/3/11 . . . . . . . . . . . . 10

Lecture 30: 11/16/11 . . . . . . . . . . . . 26

Lecture 15: 10/5/11 . . . . . . . . . . . . 11

Lecture 31: 11/18/11 . . . . . . . . . . . . 27

Lecture 17: 10/14/11 . . . . . . . . . . . . 12

Lecture 32: 11/21/11 . . . . . . . . . . . . 28

Lecture 18: 10/17/11 . . . . . . . . . . . . 13

Lecture 33: 11/28/11 . . . . . . . . . . . . 29

Introduction
Statistics 110 is an introductory statistics course offered at Harvard University. It covers all the basics of probability
counting principles, probabilistic events, random variables, distributions, conditional probability, expectation, and
Bayesian inference. The last few lectures of the course are spent on Markov chains.
These notes were partially live-TEXedthe rest were TEXed from course videosthen edited for correctness and
clarity. I am responsible for all errata in this document, mathematical or otherwise; any merits of the material here
should be credited to the lecturer, not to me.
Feel free to email me at mxawng@gmail.com with any comments.

Acknowledgments
In addition to the course staff, acknowledgment goes to Zev Chonoles, whose online lecture notes (http://math.
uchicago.edu/~chonoles/expository-notes/) inspired me to post my own. I have also borrowed his format for
this introduction page.
The page layout for these notes is based on the layout I used back when I took notes by hand. The LATEX styles can
be found here: https://github.com/mxw/latex-custom.

Copyright
Copyright 2011 Max Wang.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This means you are free to edit, adapt, transform, or redistribute this work as long as you
include an attribution of Joe Blitzstein as the instructor of the course these notes are based on, and an
attribution of Max Wang as the note-taker;
do so in a way that does not suggest that either of us endorses you or your use of this work;
use this work for noncommercial purposes only; and
if you adapt or build upon this work, apply this same license to your contributions.
See http://creativecommons.org/licenses/by-nc-sa/4.0/ for the license details.

Stat 110Intro to Probability

Max Wang

Lecture 2 9/2/11

Proof. This count is equivalent to the number of ways


to put k indistinguishable particles in n distinguishable
boxes. Suppose we order the particles; then this count is
simply the number of ways to place dividers between
the particles, e.g.,

Definition 2.1. A sample space S is the set of all possible outcomes of an experiment.
Definition 2.2. An event A S is a subset of a sample
space.

| | || |
There are

Definition 2.3. Assuming that all outcomes are equally


likely and that the sample space is finite,
P (A) =

 

n+k1
n+k1
=
k
n1

# favorable outcomes
# possible outcomes

ways to place the particles, which determines the placement of the divisiors (or vice versa); this is our result. 
  

is the probability that A occurs.
n
n
Example.
1.
=
k
nk
Proposition 2.4 (Multiplication Rule). If there are r ex

 
periments and each experiment has ni possible outcomes,
n1
n
2. n
=k
then the overall sample space has size
k1
k
Pick k people out of n, then designate one as special. The RHS represents how many ways we can
do this by first picking the k individuals and then
making our designation. On the LHS, we see the
number of ways to pick a special individual and
then pick the remaining k 1 individuals from the
remaining pool of n 1.

n1 n2 nr
Example. The probability of a full house in a five card
poker hand (without replacement, and without other
players) is
 
 
4
4
13
12
3
2
 
P (full house) =
52
5

3. (Vandermonde)

 X

k  
n+m
n
m
=
k
i
ki
i=0

Definition 2.5. The binomial coefficient is given by


 
n
n!
=
k
(n k)!k!

On the LHS, we choose k people out of n + m.


On the RHS, we sum up, for every i, how to choose
i from the n people and k i from the m people.
Definition 3.2. A probability space consists of a sample
space S along with a function P : P(S) [0, 1] taking
events to real numbers, where

or 0 if k > n.
Theorem 2.6 (Sampling Table). The number of subsets
of size k chosen from a set of n distinct elements is given
by the following table:
ordered
replacement
no replacement

n!
(n k)!

1. P () = 0, P (S) = 1
S
P
2. P ( n=1 An ) = n=1 P (An ) if the An are disjoint

unordered
n+k1
k 
n
k

Lecture 4 9/9/11
Example (Birthday Problem). The probability that at
least two people among a group of k share the same birthday, assuming that birthdays are evenly distributed across
the 365 standard days, is given by

Lecture 3 9/7/11

P (match) = 1 P (no match)

365 364 (365 k + 1)


Proposition 3.1. The number of ways to choose k ele=1
ments from a set of order n, with replacement and where
365k
order doesnt matter, is
Proposition 4.1.


1. P (AC ) = 1 P (A).
n+k1
k
2. If A B, then P (A) P (B).
1

Stat 110Intro to Probability

Max Wang

3. P (A B) = P (A) + P (B) P (A B).


Proof. All immediate.

Definition 5.1. The probability of two events A and B


are independent if

P (A B) = P (A)P (B)
Corollary 4.2 (Inclusion-Exclusion). Generalizing 3
In general, for n events A1 , . . . , An , independence reabove,
quires i-wise independence for every i = 2, . . . , n; that

n
n
is, say, pairwise independence alone does not imply inde[
X
X
P Ai =
P (Ai )
P (Ai Aj )
pendence.
i=1

i=1

i<j

Note. We will write P (A B) as P (A, B).

P (Ai Aj Ak )

Example (Newton-Pepys Problem). Suppose we have


some fair dice; we want to determine which of the following is most likely to occur:

i<j<k

+ (1)n+1 P

n
\

Ai

1. At least one 6 given 6 dice.

i=1

2. At least two 6s with 12 dice.


Example (deMontmorts Problem). Suppose we have n
3. At least three 6s with 18 dice.
cards labeled 1, . . . , n. We want to determine the probability that for some card in a shuffled deck of such cards, For the first case, we have
the ith card has value i. Since the number of orderings of
6

1
the deck for which a given set of matches occurs is simply
P (A) = 1
5
the permutations on the remaining cards, we have
For the second,
1
(n 1)!

12


11
=
P (Ai ) =
1
1
1
n!
n
P (B) = 1
12
1
(n 2)!
5
1
5
=
P (A1 A2 ) =
n!
n(n 1)
and for the third,
(n k)!
k 
18k
2  
P (A1 Ak ) =
X
18
1
1
n!
P (C) = 1
1
5
k
k=0
So using the above corollary,
(The summand on the RHS is called a binomial
1
n(n 1)
1
P (A1 An ) = n

probability.)
n
2!
n(n 1)
Thus far, all the probabilities with which we have conn(n 1)(n 2)
1
+

cerned ourselves have been unconditional. We now turn


3!
n(n 1)(n 2)
to conditional probability, which concerns how to update
1
1
n 1
our beliefs (and computed probabilities) based on new
= 1 + + (1)
2! 3!
n!
evidence?
1
1
Definition 5.2. The probability of an event A given B
e
is
P (A B)
P (A|B) =
Lecture 5 9/12/11
P (B)
if P (B) > 0.
Note. Translation from English to inclusion-exclusion:
Corollary 5.3.

Probability that at least one of the Ai occurs:

P (A B) = P (B)P (A|B) = P (A)P (B|A)

P (A1 An )

or, more generally,


Probability that none of the Ai occurs:

P (A1 An ) = P (A1 ) P (A2 |A1 ) P (A3 |A1 , A2 )


P (An |A1 , . . . , An1 )

1 P (A1 An )

Theorem 5.4 (Bayes Theorem).

Probability that all of the Ai occur:


P (A1 An ) = 1

P (AC
1

P (A|B) =

AC
n)
2

P (B|A)P (A)
P (B)

Stat 110Intro to Probability

Max Wang

Lecture 6 9/14/11

Example. Two independent events are not necessarily


conditionally independent. Suppose we know that a fire
Theorem 6.1 (Law of Total Probability). Let S be a alarm goes off (event A). Suppose there are only two
sample space and A1 , . . . , An a partition of S. Then
possible causes, that a fire happened, F , or that someone
was making popcorn, C, and suppose moreover that these
P (B) = P (B A1 ) + + P (B An )
events are independent. Given, however, that the alarm
= P (B|A1 )P (A1 ) + + P (B|An )P (An )
went off, we have
Example. Suppose we are given a random two-card hand
from a standard deck.

P (F |(A C C )) = 1

1. What is the probability that both cards are aces and hence we do not have conditional independence.
given that we have an ace?
P (both aces, have ace)
P (have ace)
 52
4
2 / 2
 52
=
1 48
2 / 2
1
=
33

Lecture 7 9/16/11

P (both aces | have ace) =

Example (Monty Hall Problem). Suppose there are


three doors, behind two of which are goats and behind
one of which is a car. Monty Hall, the game show host,
knows the contents of each door, but we, the player, do
not, and have one chance to choose the car. After choosing a door, Monty then opens one of the two remaining
doors to reveal a goat (if both remaining doors have goats,
he chooses with equal probability). We are then given the
option to change our choiceshould we do so?
In fact, we should; the chance that switching will give
us the car is the same as the chance that we did not originally pick the car, which is 23 . However, we can also solve
the problem by conditioning. Suppose we have chosen a
door (WLOG, say the first). Let S be the event of finding
the car by switching, and let Di be the event that the car
is in door i. Then by the Law of Total Probability,

2. What is the probability that both cards are aces


given that we have the ace of spades?
P (both aces | ace of spades) =

3
1
=
51
17

Example. Suppose that a patient is being tested for a


disease and it is known that 1% of similar patients have
the disease. Suppose also that the patient tests positive
and that the test is 95% accurate. Let D be the event that
the patient has the disease and T the event that he tests
positive. Then we know P (T |D) = 0.95 = P (T C |DC ).
Using Bayes theorem and the Law of Total Probability,
we can compute

1
1
1
P (S) = P (S|D1 ) + P (S|D2 ) + P (S|D3 )
3
3
3
1
1
=0+1 +1
3
3
2
=
3

P (T |D)P (D)
P (T )
P (T |D)P (D)
=
P (T |D)P (D) + P (T |DC )P (DC )
0.16

P (D|T ) =

By symmetry, the probability that we succeed conditioned on the door Monty opens is the same.

Definition 6.2. Two events A and B are conditionally


independent of an event C if

Example (Simpsons Paradox). Suppose we have the


two following tables:

P ((A B) | C) = P (A|C)P (B|C)


Example. Two conditionally independent events are
not necessarily unconditionally independent. For instance, suppose we have a chess opponent of unknown
strength. We might say that conditional on the opponents strength, all games outcomes would be independent. However, without knowing the opponents strength,
earlier games would give us useful information about the
opponents strength; hence, without the conditioning, the
game outcomes are not independent.
3

success
failure

Hibbert
heart band-aid
70
10
20
0

success
failure

Nick
heart band-aid
2
81
8
9

Stat 110Intro to Probability

Max Wang

p
4p2 4p + 1
=
2p
1 (2p 1)
=
2p
q
= 1,
p
1

for the success of two doctors for two different operations.


Note that although Hibbert has a higher success rate
conditional on each operation, Nicks success rate is
higher overall. Let us denote A to be the event of a successful operation, B the event of being treated by Nick,
and C the event of having heart surgery. In other words,
then, we have

As with differential equations, this gives a general solution of the form



i
1
i
pi = A1 + B
q

P (A|B, C) < P (A|B C , C)


and
P (A|B, C C ) < P (A|B C , C C )

for p 6= q (to avoid a repeated root). Our boundary conditions for p0 and pn give

but
C

P (A|B) > P (A|B )

B = A

In this example, C is the confounder.


and

n

q
1=A 1
p
To solve for the case where p = q, we can guess x =
and take
1 xi
ixi1
i
lim
=
lim
=
x1 1 xn
x1 nxn1
n
So we have

 i

 n p 6= q
1
pi = 1
q

p=q
n

Lecture 8 9/19/11
Definition 8.1. A one-dimensional random walk models a (possibly infinite) sequence of successive steps along
the number line, where, starting from some position i,
we have a probability p of moving +1 and a probability
q = 1 p of moving 1.

q
p

Example. An example of a one-dimensional random


walk is the gamblers ruin problem, which asks: Given
two individuals A and B playing a sequence of successive
rounds of a game in which they bet $1, with A winning
Bs dollar with probability p and A losing a dollar to B
with probability q = 1 p, what is the probability that
Now suppose that p = 0.49 and i = n i. Then we
A wins the game (supposing A has i dollars and B has have the following surprising table
ni dollars)? This problem can be modeled by a random
N
P (A wins)
walk with absorbing states at 0 and n, starting at i.
20
0.40
To solve this problem, we perform first-step analy100
0.12
sis; that is, we condition on the first step. Let pi =
200
0.02
P (A wins game | A start at i). Then by the Law of Total
Probability, for 1 i n 1.
Note that this table is true when the odds are only slightly
against A and when A and B start off with equal funding;
pi = ppi+1 + qpi1
it is easy to see that in a typical gamblers situation, the
chance of winning is extremely small.
and of course we have p0 = 0 and pn = 1. This equation
Definition 8.2. A random variable is a function
is a difference equation.
To solve this equation, we start by guessing
X:SR
pi = xi

from some sample space S to the real line. A random


variable acts as a summary of some aspect of an experiment.

Then we have

Definition 8.3. A random variable X is said to have the


Bernoulli distribution if X has only two possible values,
0 and 1, and there is some p such that

xi = pxi+1 + qxi1
px2 xi + q = 0
1

1 4pq
2p
p
1 1 4p(1 p)
=
2p

x=

P (X = 1) = p

P (X = 0) = 1 p

We say that
X Bern(p)
4

Stat 110Intro to Probability

Max Wang

Note. We write X = 1 to denote the event

Proof. This is clear from our story definition of the


binomial distribution, as well as from our indicator r.v.s.
Let us also check this using PMFs.

{s S : X(s) = 1} = X 1 {1}
Definition 8.4. The distribution of successes in n independent Bern(p) trials is called the binomial distribution
and is given by
 
n k
P (X = k) =
p (1 p)nk
k

P (X + Y = k) =

k
X

P (X + Y = k | X = j)P (X = j)

j=0

independence

where 0 k n. We write

X Bin(n, p)
=
Definition 8.5. The probability mass function (PMF)
of a discrete random variable (a random variable with
enumerable values) is a function that gives the probability that the random variable takes some value. That is,
given a discrete random variable X, its PMF is

k
X

 
n j nj
P (Y = k j | X j)
p q
j
j=0

k
X

 
n j nj
P (Y = k j)
p q
j
j=0


 
k 
X
m
n j nj
pkj q m(kj)
p q
k

j
j
j=0

= pk q n+mk

Vandermonde

 
k 
X
m
n
k

j
j
j=0


n + m k n+mk
p q
k

fX (x) = P (X = x)

Example. Suppose we draw a random 5-card hand from


a standard 52-card deck. We want to find the distribuIn addition to our definition of the binomial distribu- tion of the number of aces in the hand. Let X = #aces.
tion by its PMF, we can also express a random variable We want to determine the PMF of X (or the CDFbut
the PMF is easier). We know that P (X = k) = 0 exX Bin(n, p) as a sum of indicator random variables,
cept if k = 0, 1, 2, 3, 4. This is clearly not binomial since
the trials (of drawing cards) are not independent. For
X = X1 + + Xn
k = 0, 1, 2, 3, 4, we have
 48 
4
where
(
k 5k

P (X = k) =
1 ith trial succeeds
52
Xi =
5
0 otherwise
which is just the probability of choosing k out of the 4
In other words, the Xi are i.i.d. (independent, identically aces and 5 k of the non-aces. This is reminiscient of the
distributed) Bern(p).
elk problem in the homework.

Lecture 9 9/21/11

Definition 9.1. The cumulative distribution function Definition 9.3. Suppose we have w white and b black
marbles, out of which we choose a simple random sam(CDF) of a random variable X is
ple of n. The distribution of # of white marbles in the
sample, which we will call X, is given by
FX (x) = P (X x)
 b 
w
nk
Note. The requirements
for a PMF with values pi is that
P (X = k) = k w+b

P
n
each pi 0 and i pi = 1. For Bin(n, p), we can easily
verify this with the binomial theorem, which yields
where 0 k w and 0 n k b. This is called the
hypergeometric distribution, denoted HGeom(w, b, n).


n
X
n k nk
n
p q
= (p + q) = 1
Proof. We should show that the above is a valid PMF. It
k
k=0
is clearly nonnegative. We also have, by Vandermondes
identity,
 b 

Proposition 9.2. If X, Y are independent random variw
w
w+b
X
k nk
n
ables and X Bin(n, p), Y Bin(m, p), then
 = w+b = 1
w+b
k=0

X + Y Bin(n + m, p)


5

Stat 110Intro to Probability

Max Wang

Note. The difference between the hypergometric and biThe above shows that to get the probability of an
nomial distributions is whether or not we sample with event, we can simply compute the expected value of an
replacement. We would expect that in the limiting case indicator.
of n , they would behave similarly.
Observation 10.6. Let X Bin(n, p). Then (using the
binomial theorem),
Lecture 10 9/23/11
 
n
X
n k nk
E(X) =
k
p q
Proposition 10.1 (Properties of CDFs).
A function
k
k=0
FX is a valid CDF iff the following hold about FX :
 
n
X
n k nk
=
k
p q
1. monotonically nondecreasing
k
k=1


n
2. right-continuous
X
n 1 k nk
=
n
p q
k1
3. limx FX (x) = 0 and limx FX (x) = 1.
k=1

n 
X
n 1 k1 nk
Definition 10.2. Two random variables X and Y are
= np
p
q
k1
independent if x, y,
k=1
n1
X n 1
P (X x, Y y) = P (X x)P (Y y)
= np
pj q n1j
j
j=0
In the discrete case, we can say equivalently that
= np
P (X = x, Y = y) = P (X = x)P (Y = y)
Proposition 10.7. Expected value is linear; that is, for
Note. As an aside before we move on to discuss averages random variables X and Y and some constant c,
and expected values, recall that
1
n

n
X
i=1

i=

E(X + Y ) = E(X) + E(Y )

n+1
2

and
E(cX) = cE(X)

Observation 10.8. Using linearity, given X Bin(n, p),


Example. Suppose we want to find the average of
since we know
1, 1, 1, 1, 1, 3, 3, 5. We could just add these up and divide
by 8, or we could formulate the average as a weighted
X = X1 + + Xn
average,
5
2
1
where the Xi are i.i.d. Bern(p), we have
1+ 3+ 5
8
8
8
X = p + + p = np
Definition 10.3. The expected value or average of a discrete random variable X is
Example. Suppose that, once again, we are choosing a
five card hand out of a standard deck, with X = #aces.
X
E(X) =
xP (X = x)
If Xi is an indicator of the ith card being an ace, we have
xIm(X)

E(X) = E(X1 + + X5 )
Observation 10.4. Let X Bern(p). Then

= E(X1 ) + + E(X5 )
by symmetry

E(X) = 1 P (X = 1) + 0 P (X = 0) = p

= 5E(X1 )

= 5P (first card is ace)


Definition 10.5. If A is some event, then an indicator
5
=
random variable for A is
13
(
Note that this holds even though the Xi are dependent.
1 A occurs
X=
0 otherwise
Definition 10.9. The geometric distribution, Geom(p),
is the number of failures of independent Bern(p) triBy definition, X Bern(P (A)), and by the above,
als before the first success. Its PMF is given by (for
X Geom(p))
E(X) = P (A)
P (X = k) = q k p
6

Stat 110Intro to Probability

Max Wang

for k N. Note that this PMF is valid since

pq k = p

k=0

Definition 11.1. The negative binomial distribution,


NB(r, p), is given by the number of failures of independent Bern(p) trials before the rth success. The PMF for
X NB(r, p) is given by


n+r1 r
P (X = n) =
p (1 p)n
r1

1
=1
1q

Observation 10.10. Let X Geom(p). We have our


formula for infinite geometric series,

X
k=0

for n N.

1
q =
1q
k

Observation 11.2. Let X NB(r, p). We can write


X = X1 + + Xr where each Xi is the number of
failures between the (i 1)th and ith success. Then
Xi Geom(p). Thus,
rq
E(X) = E(X1 ) + + E(Xr ) =
p

Taking the derivative of both sides gives

kq k1 =

k=1

1
(1 q)2

Observation 11.3. Let X FS(p), where FS(p) is the


time until the first success of independent Bern(p) trials,
counting the success. Then if we take Y = X 1, we have
Y Geom(p). So,

Then
E(X) =

X
k=0

kpq k = p

X
k=0

kq k =

pq
q
=
2
(1 q)
p

E(X) = E(Y ) + 1 =

q
1
+1=
p
p

Alternatively, we can use first step analysis and write a


recursive formula for E(X). If we condition on what hap- Example. Suppose we have a random permutation of
pens in the first Bernoulli trial, we have
{1, . . . , n} with n 2. What is the expected number
of local maximathat is, numbers greater than both its
E(X) = 0 p + (1 + E(X))q
neighbors?
Let Ij be the indicator random variable for position j
E(X) qE(X) = q
being a local maximum (1 j n). We are interested in
q
E(X) =
1q
E(I1 + + In ) = E(I1 ) + + E(In )
q
E(X) =
For the non-endpoint positions, in each local neighborp
hood of three numbers, the probability that the largest
number is in the center position is 31 .
Lecture 11 9/26/11
5, 2, , 28, 3, 8, , 14
| {z }
Recall our assertion that E, the expected value function,
is linear. We now prove this statement.
Moreover, these positions are all symmetrical. Analogously, the probability that an endpoint position is a local
Proof. Let X and Y be discrete random variables. We
maximum is 21 . Then we have
want to show that E(X + Y ) = E(X) + E(Y ).
n2 2
n+1
X
E(I1 ) + + E(In ) =
+ =
E(X + Y ) =
tP (X + Y = t)
3
2
3
t
Example (St. Petersburg Paradox). Suppose you are
X
=
(X + Y )(s)P ({s})
given the offer to play a game where a coin is flipped until
s
a heads is landed. Then, for the number of flips i made up
X
to and including the heads, you receive $2i . How much
=
(X(s) + Y (s))P ({s})
should you be willing to pay to play this game? That
s
X
X
is, what price would make the game fair, or the expected
=
X(s)P ({s}) +
Y (s)P ({s})
value zero?
s
s
X
X
Let X be the number of flips of the fair coin up to and
=
xP (X = x) +
yP (Y = y)
including the first heads. Clearly, X F S( 21 ). If we let
x
y
Y = 2X , we want to find E(Y ). We have
= E(X) + E(Y )

X
X
1
k
E(Y ) =
2 k =
1
The proof that E(cX) = cE(X) is similar.

2
k=1

k=1

Stat 110Intro to Probability

Max Wang

Proof. Fix k. Then as n and p 0,


 
n k
lim P (X = k) = n
lim
p (1 p)nk
n
k
p0
p0

This assumes, however, that our cash source is boundless.


If we bound it at 2K for some specific K, we should only
bet K dollars for a fair gamethis is a sizeable difference.

n(n 1) (n k + 1)
k!
p0

n 
k

1
1
n
n

Lecture 12 9/28/11
Definition 12.1. The Poisson distribution, Pois(), is
given by the PMF
P (X = k) =

e k
k!

k=0

k
= e e = 1
k!

Its mean is given by


E(X) = e
= e
= e
= e

k=1

k
e
k!

P (X 1) = 1 P (X = 0) = 1 e

k
(k 1)!

k=1

To approximate P (X 1), we approximate X Pois()


with = E(X). Then we have

X
k
k
k!

k=0

k

Example. Suppose we have n people and we want to


know the approximate probability that at least three individuals have the same birthday. There are n3 triplets
of people; for each triplet, let Iijk be the indicator r.v.
that persons i, j, and k have the same birthday. Let
X = # triple matches. Then we know that
 
n
1
E(X) =
3 3652

Observation 12.2. Checking that this PMF is indeed


valid, we have
e

for k N, X Pois(). We call the rate parameter.

= n
lim

0
= 1 e
0!

Lecture 13 9/30/11

k1
(k 1)!

Definition 13.1. Let X be a random variable. Then X


has a probability density function (PDF) fX (x) if
Z b
P (a X b) =
fX (x) dx

The Poisson distribution is often used for applications


where we count the successes of a large number of trials
where the per-trial success rate is small. For example, the
Poisson distribution is a good starting point for counting
the number of people who email you over the course of
an hour. The number of chocolate chips in a chocolate
chip cookie is another good candidate for a Poisson distribution, or the number of earthquakes in a year in some
particular region.
Since the Poisson distribution is not bounded, these
examples will not be precisely Poisson. However, in general, with a large number of events Ai with small P (Ai ),
and where the Ai are all independent or weakly dependent, then the number of the
Pn Ai that occur is approximately Pois(), with
i=1 P (Ai ). We call this a
Poisson approximation.

A valid PDF must satisfy


1. x, fX (x) 0
Z
2.
fX (x) dx = 1

Note. For  > 0 very small, we have






fX (x0 )  P X (x0 , x0 + )
2
2
Theorem 13.2. If X has PDF fX , then its CDF is
Z x
FX (x) = P (X x) =
fX (t) dt

If X is continuous and has CDF FX , then its PDF is


0
fX (x) = FX
(x)

Moreover,

Proposition 12.3. Let X Bin(n, p). Then as n ,


p 0, and where = np is held constant, we have
X Pois().

Z
a

fX (x) dx = FX (b) FX (a)

P (a < X < b) =

Stat 110Intro to Probability

Max Wang

 The CDF, then, is given by


Z x
Z x
Definition 13.3. The expected value of a continuous
fU (t) dt
fU (t) dt =
FU (x) =
random variable X is given by
a

Z
x
Z
1
=
dt
xfX (x) dx
E(X) =
b

a
a

xa
=
Giving the expected value is like giving a one number
ba
summary of the average, but it provides no information
Observation 13.7. The expected value of an r.v. U
about the spread of a distribution.
Unif(a, b) is
Definition 13.4. The variance of a random variable X
Z b
x
is given by
E(U ) =
dx
2
Var(X) = E((X EX) )
a ba
b
which is the expected value of the distance from X to its
x2
=

mean; that is, it is, on average, how far X is from its
2(b a)
a
mean.
2
2
b a
=
We cant use E(X EX) because, by linearity, we
2(b a)
have
(b + a)(b a)
=
2(b a)
E(X EX) = EX E(EX) = EX EX = 0
b+a
=
We would like to use E|X EX|, but absolute value is
2
hard to work with; instead, we have
This is the midpoint of the interval [a, b].
Definition 13.5. The standard deviation of a random
Finding the variance of U Unif(a, b), however, is a
variable X is
p
bit
more trouble. We need to determine E(U 2 ), but it is
SD(X) = Var(X)
too much of a hassle to figure out the PDF of U 2 . Ideally,
Note. Another way we can write variance is
things would be as simple as
Z
Var(X) = E((X EX)2 )
2
E(U
)
=
x2 fU (x) dx
= E(X 2 2X(EX) + (EX)2 )

Proof. By the Fundamental Theorem of Calculus.

= E(X 2 ) 2E(X)E(X) + (EX)2

Fortunately, this is true:

= E(X 2 ) (EX)2

Theorem 13.8 (Law of the Unconscious Statistician


Definition 13.6. The uniform distribution, Unif(a, b), is (LOTUS)). Let X be a continuous random variable, g :
given by a completely random point chosen in the interval R R continuous. Then
Z
[a, b]. Note that the probability of picking a given point
E(g(X))
=
g(x)fX (x) dx
x0 is exactly 0; the uniform distribution is continuous.

The PDF for U Unif(a, b) is given by


(
where fX is the PDF of X. This allows us to determine
c axb
the expected value of g(X) without knowing its distribufU (x) =
0 otherwise
tion.
for some constant c. To find c, we note that, by the defi- Observation 13.9. The variance of U Unif(a, b) is
given by
nition of PDF, we have
Z

Var(U ) = E(U 2 ) (EU )2



2
Z b
b+a
2
=
x fU (x) dx
2
a

2
Z b
1
b+a
=
x2 dx
ba a
2

c dx = 1
a

c(b a) = 1
c=

1
ba
9

Stat 110Intro to Probability

Max Wang

b 
2
1
b+a
x3
=

ba 3
2
a

a3
(b + a)2
b3

=
3(b a) 3(b a)
4
2
(b a)
=
12
The following table is useful for comparing discrete
and continuous random variables:

P?F
CDF
E(X)
Var(X)
E(g(X))
[LOTUS]

discrete
PMF P (X = x)
FX (x) = P (X x)
P
x xP (X = x)
EX 2 (EX)2
P
x g(x)P (X = x)

continuous
PDF fX (x)
FX (x) = P (X x)
R
xfX (x) dx

2
EX (EX)2
R
g(x)fX (x) dx

Intuitively, this is true because there is no difference


between measuring U from the right vs. from the left of
[0, 1].
The general uniform distribution is also linear; that is,
a + bU is uniform on some interval [a, b]. If a distribution
is nonlinear, it is hence nonuniform.
Definition 14.3. We say that random variables
X1 , . . . , Xn are independent if
for continuous, P (X1 x1 , . . . , Xn xn ) =
P (X1 x1 ) P (Xn xn )
for discrete, P (X1 = x1 , . . . , Xn = xn ) = P (X1 =
x1 ) P (Xn = xn )
The expressions on the LHS are called joint CDFs and
joint PMFs respectively.
Note that pairwise independence does not imply independence.

Example. Consider the penny matching game, where


X1 , X2 Bern( 21 ), i.i.d., and let X3 be the indicator r.v.
for the event X1 = X2 (the r.v. for winning the game). All
Lecture 14 10/3/11
of these are pairwise independent, but the X3 is clearly
Theorem 14.1 (Universality of the Uniform). Let us dependent on the combined outcomes of X1 and X2 .
take U Unif(0, 1), F a strictly increasing CDF. Then
Definition 14.4. The normal distribution, given by
for X = F 1 (U ), we have X F . Moreover, for any
N (0, 1), is defined by PDF
random variable X, if X F , then F (X) Unif(0, 1).
f (z) = cez

Proof. We have
P (X x) = P (F 1 (U ) x) = P (U F (x)) = F (x)

/2

where c is the normalizing constant required to have f


integrate to 1.

since P (U F (x)) is the length of the interval [0, F (x)], Proof. We want to prove that our PDF is valid; to do
so, we will simply determine the value of the normalizing
which is F (x). For the second part,
constant that makes it so. We will integrate the square
1
1
of the PDF sans constant because it is easier than inteP (F (X) x) = P (X F (x)) = F (F (x)) = x
grating navely
Z
Z
since F is Xs CDF. But this shows that F (X)
2
z 2 /2
e
dz
ez /2 dz
Unif(0, 1).


Z
Z
2
x
x2 /2
Example. Let F (x) = 1 e with x > 0 be the CDF
=
e
dx
ey /2 dy
X

of an r.v. X. Then F (X) = 1 e


by an application of
Z Z
2
2
the second part of Universality of the Uniform.
=
e(x +y )/2 dx dy

2 Z

Example. Let F (x) = 1 ex with x > 0, and also let


U Unif(0, 1). Suppose we want to simulate F with a
random variable X; that is, X F . Then computing the
inverse
F 1 (u) = ln(1 u)

er

=
0

/2

r dr d

r2
Substituting u = , du = r dr
2

Z 2 Z
u
=
e du d

yields F 1 (U ) = ln(1 U ) F .

= 2

Proposition 14.2. The standard uniform distribution is


symmetric; that is, if U Unif(0, 1), then also 1 U
So our normalizing constant is c =
Unif(0, 1).
10

1 .
2

Stat 110Intro to Probability

Max Wang

Observation 14.5. Let us compute the mean and vari- which yields a PDF of
ance of Z N (0, 1). We have
x 2
1
fX (x) = e( ) /2
Z
2
2
1
zez /2 dz = 0
EZ =
2
We also have X = + (Z) N (, 2 ).
Later, we will show that if Xi N (i , i2 ) are indeby symmetry (the integrand is odd). The variance re- pendent, then
duces to
Xi + Xj N (i + j , i2 + j2 )
2
2
2
Var(Z) = E(Z ) (EZ) = E(Z )
and
Xi Xj N (i j , i2 + j2 )
By LOTUS,
Z
Observation 15.2. If X N (, 2 ), we have
1
2 z 2 /2
2
z e
dz
E(Z ) =
2
P (|X | ) 68%
Z
2
2
P
(|X | 2) 95%
z 2 ez /2 dz
evenness =
2 0
P (|X | 3) 99.7%
Z
2
z 2 /2
by parts =
z ze
dz
Observation 15.3. We observe some properties of the
2 0 |{z} | {z }
u
dv
variance.

Z

2
2

Var(X) = E((X EX)2 ) = EX 2 (EX)2
= uv +
ez /2 dz

2
0
0
For any constant c,
!

2
2
Var(X + c) = Var(X)
=
0+
2
2
Var(cX) = c2 Var(X)
=1
Since variance is not linear, in general, Var(X + Y ) 6=
Var(X) + Var(Y ). However, if X and Y are independent,
We use to denote the standard normal CDF; so
we
do have equality. On the other extreme,
Z z
1
t2 /2
(z) =
e
dt
Var(X + X) = Var(2X) = 4 Var(X)
2
Also, in general,
By symmetry, we also have
Var(X) 0
(z) = 1 (z)
Var(X) = 0 a : P (X = a) = 1
Observation 15.4. Let us compute the variance of the
Poisson distribution. Let X Pois(). We have

Lecture 15 10/5/11

Recall the standard normal distribution. Let Z be an


X
e k
E(X 2 ) =
k2
r.v., Z N (0, 1). Then Z has CDF ; it has E(Z) = 0,
k!
k=0
Var(Z) = E(Z 2 ) = 1, and E(Z 3 ) = 0.1 By symmetry,
also Z N (0, 1).
To reduce this sum, we can do the following:

X
k

Definition 15.1. Let X = + Z, with R (the


mean or center), > 0 (the SD or scale). Then we say
X N (, 2 ). This is the general normal distribution.
If X N (, 2 ), we have E(X) = and Var( +
Z) = 2 Var(Z) = 2 . We call Z = X
the standard
ization of X. X has CDF




X
x
x
P (X x) = P

1 These

are called the first, second, and third moments.

11

k=0

k!

= e

Taking the derivative w.r.t. ,

X
kk1
= e
k!

k=1

X
k=0

kk1
= e
k!

Stat 110Intro to Probability

Max Wang

X
kk

k!

k=1

Repeating,

X
k 2 k1
k=1

X
k=1

k!

Lecture 17 10/14/11

= e

Definition 17.1. The exponential distribution, Expo(),


is defined by PDF
= e + e

f (x) = ex

k 2 k
= e ( + 1)
k!

for x > 0 and 0 elsewhere. We call the rate parameter.


Integrating clearly yields 1, which demonstrates validity. Our CDF is given by

So,

Z
2

E(X ) =

2e

k=0

=e

et dt

F (x) =

k!

(0
1 ex
=
0

e ( + 1)

x>0
otherwise

= 2 +
Observation 17.2. We can normalize any X Expo()
by multiplying by , which gives Y = X Expo(1). We
have

So for our variance, we have


Var(X) = (2 + ) 2 =
Observation 15.5. Let us compute the variance of thebinomial distribution. Let X Bin(n, p). We can write

P (Y y) = P (X

X = I1 + + In
where the Ij are i.i.d. Bern(p). Then,
X 2 = I12 + + In2 + 2I1 I2 + 2I1 I3 + + 2In1 In

Let us now compute the mean and variance of Y


Expo(1). We have
Z

where Ii Ij is the indicator of success on both i and j.


 
n
2
2
E(X ) = nE(I1 ) + 2
E(I1 I2 )
2
= np + n p np

yey dy
0
Z

y
= (ye ) +
ey dy

0

E(Y ) =

= np + n(n 1)p2
2 2

y
) = 1 ey/ = 1 ey

=1
for the mean. For the variance,

So,
Var(X) = (np + n2 p2 np2 ) n2 p2

Var(Y ) = EY 2 (EY )2
Z
=
y 2 ey dy 1

= np(1 p)
= npq

=1
Proof. (of Discrete LOTUS)
P
We want to show that E(g(X) = x g(x)P (X = x). To
do so, once again we can ungroup our expected value Then for X = Y , we have E(X) = 1 and Var(X) = 12 .

expression:
X
X
Definition 17.3. A random variable X has a memoryless
g(x)P (X = x) =
g(X(s))P ({s})
distribution if
x
sS
We can rewrite this as
X X
X
g(X(s))P ({s}) =
g(x)
x s:X(s)=x

X
x

P (X s + t | X s) = P (X t)

P ({s})

s:X(s)=x

g(x)P (X = x)

Intuitively, if we have a random variable that we interpret as a waiting time, memorylessness means that no
matter how long we have already waited, the probability
 of having to wait a given time more is invariant.
12

Stat 110Intro to Probability

Max Wang

Proposition 17.4. The exponential distribution is mem- Observation 18.3. We might ask why we call M
oryless.
moment-generating. Consider the Taylor expansion of
M:
Proof. Let X Expo(). We know that

n n
X
X
t
E(X n )tn
X
t
P (X t) = 1 P (X t) = e
=
E(etX ) = E
n!
n!
n=0
n=0
Meanwhile,
Note that we cannot simply make use of linearity since
our sum is infinite; however, this equation does hold for
reasons beyond the scope of the course.
This observation also shows us that

P (X s + t, X s)
P (X s)
P (X s + t)
=
P (X s)

P (X s + t | X s) =

E(X n ) = M (n) (0)

e(s+t)
es
t
=e

Claim 18.4. If X and Y have the same MGF, then they


have the same CDF.

= P (X t)

We will not prove this claim.


which is our desired result.

Observation 18.5. If X has MGF MX and Y has MGF


Example. Let X Expo(). Then by linearity and by MY , then
the memorylessness,
MX+Y = E(et(X+Y ) ) = E(etX )E(etY ) = MX MY
E(X | X > a) = a + E(X a | X > a)
1
The second inequality comes from the claim (which we
=a+
will prove later) that if for X, Y independent, E(XY ) =

E(X)E(Y ).

Lecture 18 10/17/11

Example. Let X Bern(p). Then

Theorem 18.1. If X is a positive, continuous random


M (t) = E(etX ) = pet + q
variable that is memoryless (i.e., its distribution is memoryless), then there exists R such that X Expo(). Suppose now that X Bin(n, p). Again, we write
Proof. Let F be the CDF of X and G = 1 F . By X = I1 + + In where the Ij are i.i.d Bern(p). Then we
see that
memorylessness,
M (t) = (pet + q)n
G(s + t) = G(s)G(t)
Example. Let Z N (0, 1). We have
We can easily derive from this identity that k Q,
Z
2
1

M
(t)
=
etzz /2 dz
G(kt) = G(t)k
2
Z
2
et /2 (1/2)(zt)2
This can be extended to all k R. If we take t = 1, then
completing the square =
e
dz
we have
2
2
G(x) = G(1)x = ex ln G(1)
= et /2
But since G(1) < 1, we can define = ln G(1), and we
Example. Suppose X1 , X2 , . . . are conditionally indewill have > 0. Then this gives us
pendent (given p) random variables that are Bern(p).
F (x) = 1 G(x) = 1 ex
Suppose also that p is unknown. In the Bayesian approach, let us treat p as a random variable. Let p
as desired.
 Unif(0, 1); we call this the prior distribution.
Let Sn = X1 + + Xn . Then Sn | p Bin(n, p). We
Definition 18.2. A random variable X has
want to find the posterior distribution, p | Sn , which will
moment-generating function (MGF)
give us P (Xn+1 = 1 | Sn = k). Using Bayes Theorem,
M (t) = E(etX )
P (Sn = k | p)f (p)
f (p | Sn = k) =
if M (t) is bounded on some interval (, ) about zero.
P (Sn = k)
13

Stat 110Intro to Probability

Max Wang

P (Sn = k | p)
P (Sn = k)

Observation 19.3. Let X Pois(). By LOTUS, its


MGF is given by

pk (1 p)nk
E(etX ) =

In the specific case of Sn = n, normalizing is easier:

etk e

k=0

k
k!

= e ee

f (p | Sn = n) = (n + 1)p

= e(e 1)
Computing P (Xn+1 = 1 | Sn = k) simply requires finding the expected value of an indicator with the above
Observation 19.4. Now let X Pois() and Y
probability p | Sn = n.
Pois() independent. We want to find the distribution
Z 1
n+1
of X + Y . We can simply multiply their MGFs, yielding
n
p(n + 1)p dp =
P (Xn+1 = 1 | Sn = n) =
n
+
2
0
t
t
MX (t)MY (t) = e(e 1) e(e 1)
= e(+)(e

Lecture 19 10/19/11
Observation 19.1. Let X Expo(1). We want to determine the MGF M of X. By LOTUS,

1)

Thus, X + Y Pois( + ).
Example. Suppose X, Y above are dependent; specifically, take X = Y . Then X + Y = 2X. But this cannot
be Poisson since it only takes on even values. We could
also compute the mean and variance

M (t) = E(etX )
Z
=
etx ex dx
0
Z
=
ex(1t) dx

E(2X) = 2

Var(2X) = 4

=
If we write

but they should be equal for the Poisson.

1
, t<1
1t

We now turn to the study of joint distributions. Recall


that joint distributions for independent random variables
can be given simply by multiplying their CDFs; we want
also to study cases where random variables are not independent.

X
X
tn
1
n!
tn =
=
1 t n=0
n!
n=0

we get immediately that


E(X n ) = n!

Definition 19.5. Let X, Y be random variables. Their


joint CDF is given by

Now take Y Expo(1) and let X = Y Expo(1). So


n
Yn = X
n , and hence

F (x, y) = P (X x, Y y)

E(Y n ) =

n!
n

In the discrete case, X and Y have a joint PMF given by

P (X = x, Y = y)
Observation 19.2. Let Z N (0, 1), and let us determine all its moments. We know that for n odd, by sym- and in the continuous case, X and Y have a joint PDF
given by
metry,

n
E(Z ) = 0
f (x, y) =
F (x, y)
xy
We previously showed that
and we can compute
2
ZZ
M (t) = et /2
P ((X, Y ) B) =
f (x, y) dx dy
But we can write
B

X
X
X
(t2 /2)n
t2n
(2n)! t2n
=
=
n!
2n n! n=0 2n n! (2n)!
n=0
n=0

So
E(Z 2n ) =

(2n)!
2n n!

Their separate CDFs and PMFs (e.g., P (X x)) are


referred to as marginal CDFs, PMFs, or PDFs. X and
Y are independent precisely when the the joint CDF is
equal to the product of the marginal CDFs:
F (x, y) = FX (x)FY (y)
14

Stat 110Intro to Probability

Max Wang

We can show that, equivalently, we can say


P (X = x, Y = y) = P (X = x)P (Y = y)

Lecture 20 10/21/11
Definition 20.1. Let X and Y be random variables.
Then the conditional PDF of Y |X is

or
fY |X (y|x) =

f (x, y) = fX (x)fY (y)


for all x, y R.

fX|Y (x|y)fY (y)


fX,Y (x, y)
=
fX (x)
fX (x)

Note that Y |X is shorthand for Y | X = x.

Definition 19.6. To get the marginal PMF or PDF of Example. Recall the PDF for our uniform distribution
a random variable X from its joint PMF or PDF with on the disk,
another random variable Y , we can marginalize over Y
(
1
by computing
y 2 = 1 x2
f (x, y) =
0 otherwise
X
P (X = x) =
P (X = x, Y = y)
y
and marginalizing over Y , we have

or
Z

fX (x) =

fX (x) =

fX,Y (x, y) dy

1x2

1x2

2p
1
dy =
1 x2

Example. Let X Bern(p), X Bern(q). Suppose


they have joint PMF given by
Y =0
2/6
2/6
4/6

X=0
X=1

Y =1
1/6
1/6
2/6

for 1 x 1. As a check, we could integrate this again


with respect to dx to ensure that it is 1. From this, it is
easy to find the conditional PDF,
fY |X (y|x) =

3/6
3/6

1
1/

=
2
1x
2 1 x2

for 1 x2 y 1 x2 . Sincewe are


holding x
constant, we see that Y |X Unif( 1 x2 , 1 x2 ).
Here we have computed the marginal probabilities (in the From these computations, it is clear, in many ways,
margin), and they demonstrate that X and Y are inde- that X and Y are not independent. It is not true that
pendent.
fX,Y = fX fY , nor that fY |X = fY
Example. Let us define the uniform distribution on the Proposition 20.2. Let X, Y have joint PDF f , and let
2
unit square, {(x, y) : x, y [0, 1]}. We want the joint g : R R. Then
Z Z
PDF to be constant everywhere in the square and 0 othE(g(X, Y )) =
g(x, y)f (x, y) dx dy
erwise; that is,

(
f (x, y) =

c
0

0 x 1, 0 y 1
otherwise

1
Normalizing, we simply need c = area
= 1. It is apparent
that the marginal PDFs are both uniform.

This is LOTUS in two dimensions.


Theorem 20.3. If X, Y are independent random variables, then E(XY ) = E(X)E(Y ).

Proof. We will prove this in the continuous case. Using


LOTUS, we have
Example. Let us define the uniform distribution on the
Z Z
unit disc, {(x, y) : x2 + y 2 1}. Their joint PDF can be
E(XY ) =
xyfX,Y (x, y) dx dy
given by

Z Z
(
1
x2 + y 2 1
by independence =
xyfX (x)fY (y) dx dy
f (x, y) =

0 otherwise
Z

=
E(X)yfY (y) dy
Given X = x, we have 1 x2 y 1 x2 . We

might guess that Y is uniform, but clearly X and Y are


= E(X)E(Y )
dependent in this case, and it turns out that this is not
the case.
as desired.

15

Stat 110Intro to Probability

Max Wang

Example. Let X, Y Unif(0, 1) i.i.d.; we want to find Lecture 21 10/24/11


E|X Y |. By LOTUS (and since the joint PDF is 1), we
Theorem 21.1. Let X N (1 , 12 ) and Y N (2 , 22 )
want to integrate
be independent random variables. Then X + Y N (1 +
Z 1Z 1
2 , 12 + 22 ).
|x y| dx dy
E|X Y | =
Proof. Since X and Y are independent, we can simply
Z0Z 0
ZZ
multiply their MGFs. This is given by
=
(x y) dx dy +
(y x) dx dy
x>y
xy
MX+Y (t) = MX (t)MY (t)
ZZ
t2
t2
by symmetry = 2
(x y) dx dy
= exp(1 t + 1 ) exp(2 t + 2 )
x>y
2
2
Z 1Z 1
2
t
= exp((1 + 2 )t + (12 + 22 ) )
(x y) dx dy
=
2
y
0
!
which yields our desired result.

Z 1
1
x2

=2
yx dy
Example. Let Z1 , Z2 N (0, 1), i.i.d.; let us find

2
0
y
E|Z1 Z2 |. By the above, Z1 Z2 N (0, 2). Let
1
Z
N (0, 1). Then
=

3
E|Z1 Z2 | = E| 2Z|

If we let M = max{X, Y } and L = min{X, Y }, then we


= 2E|Z|
would have |X Y | = M L, and hence also
Z
2
1
=
|z| ez /2 dz
1
2

E(M L) = E(M ) E(L) =


Z
2
3
1
evenness = 2
|z| ez /2 dz
2
We also have
r0
2
=

E(X + Y ) = E(M + L) = E(M ) + E(L) = 1


Definition 21.2. Let X = (X1 , . . . , Xk ) be a multivariThis gives
ate random variable,
P p = (p1 , . . . , pk ) a probability vector
1
2
with pj 0 and j pj = 1. The multinomial distribution
E(L) =
E(M ) =
is given by assorting n objects into k categories, each ob3
3
ject having probability pj of being in category j, and takExample (Chicken-Egg Problem). Suppose there are ing the number of objects in each category, Xj . If X has
N Pois() eggs, each hatching with probability p, in- the multinomial distribution, we write X Multk (n, p).
dependent (these are Bernoulli). Let X be the number of
The PMF of X is given by
eggs that hatch. Thus, X|N Bin(N, p). Let Y be the
n!
number that dont hatch. Then X + Y = N .
P (X1 = n1 , . . . , Xk = nk ) =
pn1 1 pnk k
n
!

n
!
1
k
Let us find the joint PMF of X and Y .
P
if k nk = n, and 0 otherwise.

X
Observation 21.3. Let X Multk (n, p). Then the
P (X = i,Y = j) =
P (X = i, Y = j | N = n)P (N = n)
marginal distribution of Xj is simply Xj Bin(n, pj ),
n=0
since each object is either in j or not, and we have
= P (X = i, Y = j | N = i + j)P (N = i + j)
E(Xj ) = npj Var(Xj ) = npj (1 pj )
= P (X = i | N = i + j)P (N = i + j)
(i + j)! i j i+j
pq e
i!j!
(i + j)!
!
!
i
j
p (p)
q (q)
= e
e
i!
j!

Observation 21.4. If we lump some of our categories


together for X Multk (n, p), then the result is still
multinomial. That is, taking
Y = (X1 , . . . , Xl1 , Xl + + Xk )
and

p0 = (p1 , . . . , pl1 , pl + + pk )
In other words, the randomness of the number of eggs
offsets the dependence of Y on X given a fixed number of we have Y Multl (n, p0 ), and this is true for any combiX. This is a special property of the Poisson distribution. nations of lumpings.
16

Stat 110Intro to Probability

Max Wang

Observation 21.5. Let X Multk (n, p). Then given


X1 = n1 ,

and then we proceed as before.

(X2 , . . . , Xk ) Multk1 (n n1 , (p02 , . . . , p0k ))

Lecture 22 10/26/11

where

pj
pj
=
1 p1
p2 + + pk
This is symmetric for all j.
p0j =

Definition 22.1. The covariance of random variables X


and Y is

Definition 21.6. The Cauchy distribution is a distribution of T = X


Y with X, Y N (0, 1) i.i.d.

Cov(X, Y ) = E((X EX)(Y EY ))

Note. The following properties are immediately true of


Note. The Cauchy distribution has no mean, but has the covariance:
property that an average of many Cauchy distributions is
still Cauchy.
1. Cov(X, X) = Var(X)
Observation 21.7. Let us compute the PDF of X with
the Cauchy distribution. The CDF is given by
P(

3. Cov(X, Y ) = E(XY ) E(X)E(Y )

X
X
t) = P (
t)
Y
|Y |
= P (X t|Y |)
Z Z t|y|
2
2
1
1
ex /2 ey /2 dx dy
=
2
2

Z
Z t|y|
2
2
1
1
ex /2 dx dy
=
ey /2
2
2

Z
1
y 2 /2
=
e
(t|y|) dy
2
r Z
2
2
=
ey /2 (ty) dy
0

4. c R, Cov(X, c) = 0
5. c R, Cov(cX, Y ) = c Cov(X, Y )
6. Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z)
The last two properties demonstrate that covariance is
bilinear. In general,

n
m
X
X
X
bj Yj =
ai Xi ,
ai bj Cov(Xi , Yj )
Cov
i=1

j=1

i,j

Observation 22.2. We can use covariance to compute


the variance of sums:

There is little we can do to compute this integral. Instead, let us compute the PDF, calling the CDF above
F (t). Then we have
r Z
2 2
2
2
1
0
F (t) =
yey /2 et y /2 dy
0
2
Z
1 (1+t2 )y2 /2
=
ye
dy
0
(1 + t2 )y 2
Substituting u =
= du = (1 + t2 )y dy,
2
1
=
(1 + t)2

Var(X + Y ) = Cov(X, X) + Cov(X, Y )


+ Cov(Y, X) + Cov(Y, Y )
= Var(X) + 2 Cov(X, Y ) + Var(Y )
and more generally,
X
X
X
Var(
X) =
Var(X) + 2
Cov(Xi , Xj )
i<j

Theorem 22.3. If X, Y are independent, then


Cov(X, Y ) = 0 (we say that they are uncorrelated).

We could also have performed this computation using the


Law of Total Probability. Let be the standard normal
PDF. We have
Z
P (X t|Y |) =
P (X t|Y | | Y = y)(y) dy

Z
by independence =
P (X ty)(y) dy

Z
=
(ty)(y) dy

2. Cov(X, Y ) = Cov(Y, X)

Example. The converse of the above is false. Let Z


N (0, 1), X = Z, Y = Z 2 , and let us compute the covariance.
Cov(X, Y ) = E(XY ) (EX)(EY )
= E(Z 3 ) (EZ)(EZ 2 )
=0
But X and Y are very dependent, since Y is a function
of X.

17

Stat 110Intro to Probability

Max Wang

Example. Let X Bin(n, p). Write X = X1 + + Xn


where the Xj are i.i.d. Bern(p). Then

Definition 22.4. The correlation of two random variables X and Y is


Cov(X, Y )
SD(X) SD(Y )


X EX Y EY
,
= Cov
SD(X) SD(Y )

Var(Xj ) = EXj2 (EXj )2

Cor(X, Y ) =

= p p2
= p(1 p)
It follows that

The operation of
X EX
SD(X)

Var(X) = np(1 p)

is called standardization; it gives the result a mean of 0


and a variance of 1.

since Cor(Xi , Xj ) = 0 for i 6= j by independence.

Lecture 23 10/28/11

Theorem 22.5. | Cor(X, Y )| 1.

Example. Let X HGeom(w, b, n). Let us write p =


w
Proof. We could apply Cauchy-Schwartz to get this rew+b and N = w+b. Then we can write X = X1 + +Xn
sult immediately, but we shall also provide a direct proof. where
the Xj are Bern(p). (Note, however, that unlike
WLOG, assume X and Y are standardized. Let = with the binomial, the X are not independent.) Then
j
Cor(X, Y ). We have
 
n
Var(X)
=
n
Var(X
)
+
2
Cov(X1 , X2 )
1
Var(X + Y ) = Var(X) + Var(Y ) + 2 = 2 + 2
2
 
n
and we also have
= np(1 p) + 2
Cov(X1 , X2 )
2
Var(X Y ) = Var(X) + Var(Y ) 2 = 2 2
Computing the covariance, we have
But since Var 0, this yields our result.

Cov(X1 , X2 ) = E(X1 X2 ) (EX1 )(EX2 )



 
2
w
w1
w

=
w+b w+b1
w+b


w
w1
=
p2
w+b w+b1

Example. Let (X1 , . . . , Xk ) Multk (n, p). We shall


compute Cov(Xi , Xj ) for all i, j. If i = j, then
Cov(Xi , Xi ) = Var(Xi ) = npi (1 pi )

Suppose i 6= j. We can expect that the covariance will be


and simplifying,
negative, since more objects in category i means less in
category j. We have
N n
Var(X) =
np(1 p)
N 1
Var(Xi + Xj ) = npi (1 pi ) + npj (1 pj ) + 2 Cov(Xi , Xj )
n
The term N
N 1 is called the finite population correction;
But by lumping i and j together, we also have
it represents the offset from the binomial due to lack of
replacement.
Var(Xi + Xj ) = n(pi + pj )(1 (pi + pj ))
Theorem 23.1. Let X be a continuous random variable
with PDF fX , and let Y = g(X) where g is differentiable
Then solving for c, we have
and strictly increasing. Then the PDF of Y is given by
Cov(Xi , Xj ) = npi pj
dx
fY (y) = fX (x)
dy
Note. Let A be an event and IA its indicator random
variable. It is clear that
where y = g(x) and x = g 1 (y). (Also recall from calcu 1
n
dy
IA
= IA
lus that dx
.)
dy = dx
for any n N. It is also clear that

Proof. From the CDF of Y , we get

IA IB = IAB

P (Y y) = P (g(X) y)
18

Stat 110Intro to Probability

Max Wang

= P (X g 1 (y))

FY (t x)fX (x) dx

= FX (g 1 (y))

Then taking the derivative of both sides,


Z
fT (t) =
fX (x)fY (t x) dx

= FX (x)
Then, differentiating, we get by the Chain Rule that

dx
fY (y) = fX (x)
dy

We now briefly turn our attention to proving the existence of objects with some desired property A using probability. We want to show that P (A) > 0 for some random
Example. Consider the log normal distribution, which
object, which implies that some such object must exist.
Z
is given by Y = e for Z N (0, 1). We have
Reframing this question, suppose each object in our
universe of objects has some kind of score associated
1 z2 /2
fY (y) = e
with this property; then we want to show that there is
2
some object with a good score. But we know that there
To put this in terms of y, we substitute z = ln y. More- is an object with score at least equal to the average score,
over, we know that
i.e., the score of a random object. Showing that this average is high enough will prove the existence of an object
dy
without specifying one.
= ez = y
dz
Example. Suppose there are 100 people in 15 commitand so,
tees of 20 people each, and that each person is on exactly
1 1
3 committees. We want to show that there exist 2 comfY (y) = e ln y/2
y 2
mittees with overlap 3. Let us find the average of two
Theorem 23.2. Suppose that X is a continuous random random committees. Using indicator random variables
variable in n dimensions, Y = g(X) where g : Rn Rn for the probability that a given person is on both of those
two committees, we get
is continuously differentiable and invertible. Then

3


300
20


2
dx
=
E(overlap) = 100 15 =

fY (y) = fX (x) det

105
7
dy
2
where

x1
y
1
dx

= ...

dy
xn
y1

..

xn
yn

..
.

xn
yn

is the Jacobian matrix.


Observation 23.3. Let T = X + Y , where X and Y are
independent. In the discrete case, we have
X
P (T = t) =
P (X = x)P (Y = t x)
x

For the continuous case, we have


fT (t) = (fX fY )(t)
Z
=
fX (x)fY (t x) dx

This is true because we have


FT (t) = P (T t)
Z
=
P (X + Y t | X = x)fX (x) dx

Then there exists a pair of committees with overlap of at


least 20
7 . But since all overlaps must be integral, there is
a pair of committees with overlap 3.

Lecture 24 10/31/11
Definition 24.1. The beta distribution, Beta(a, b) for
a, b > 0, is defined by PDF
(
cxa1 (1 x)b1 0 < x < 1
f (x) =
0
otherwise
where c is a normalizing constant (defined by the beta
function).
The beta distribution is a flexible family of continuous
distributions on (0, 1). By flexible, we mean that the appearance of the distribution varies significantly depending
on the values of its parameters. If a = b = 1, the beta
reduces to the uniform. If a = 2 and b = 1, the beta appears as a line with positive slope. If a = b = 21 , the beta
appears to be concave-up and parabolic; if a = b = 2, it
is concave down.
The beta distribution is often used as a prior distribution for some parameter on (0, 1). In particular, it is
the conjugate prior to the binomial distribution.
19

Stat 110Intro to Probability

Max Wang

Observation 24.2. Suppose that, based on some data, for any a > 0. The gamma function is a continuous exwe have X | p Bin(n, p), and that our prior distribu- tension of the factorial operator on natural numbers. For
tion for p is p Beta(a, b). We want to determine the n a positive integer,
posterior distribution of p, p | X. We have
(n) = (n 1)!
P (X = k | p)f (p)
f (p | X = k) =
More generally,
P (X = k)

n k
(x + 1) = x(x)
p (1 p)nk cpa1 (1 p)b1
= k
P (X = k)
Definition 25.2. The standard gamma distribution,
a+k1
cp
(1 p)b+nk1
Gamma(a, 1), is defined by PDF
1 a1 x
x
e
(a)

So, we have p | X Beta(a + X, b + n X). We call the


Beta the conjugate prior to the binomial because both its
prior and posterior distribution are Beta.

for x > 0, which is simply the integrand of the normalized


Observation 24.3. Let us find a specific case of the nor- Gamma function. More generally, let X Gamma(a, 1)
and Y = X
malizing constant
. We say that Y Gamma(a, ). To get the
PDF of Y , we simply change variables; we have x = y,
Z 1
so
c1 =
xk (1 x)nk dx
dx
0
fY (y) = fX (x)
dy
To do this, consider the story of Bayes billiards. Sup1
1
pose we have n + 1 billiard balls, all white; then we paint
=
(y)a ey
(a)
y
one pink and throw them along (0, 1) all independently.
1
1
Let X be the number of balls to the left of the pink ball.
=
(y)a ey
(a)
y
Then conditioning on where the pink ball ends up, we
have
Definition 25.3. We define a Poisson process as a proZ 1
cess in which events occur continuously and indepenP (X = k) =
P (X = k | p) f (p) dp
dently such that in any time interval t, the number of
|{z}
0
events which occur is Nt Pois(t) for some fixed rate
1
Z 1 
parameter .
n k
=
p (1 p)nk dp
k
Observation 25.4. The time T1 until the first event oc0
curs is Expo():
where, given the pink balls location, X is simply binomial
(each white ball has an independent chance p of landing
P (T1 > t) = P (Nt = 0) = et
to the left). Note, however, that painting a ball pink and
then throwing the balls along (0, 1) is the same as throw- which means that
ing the balls along the real line and then painting one
P (T1 t) = 1 et
pink. But then it is clear that there is an equal chance
for any given number from 0 to n of white balls to be to as desired. More generally, the time until the next event
the pink balls left. So we have
is always Expo(); this is clear from the memoryless property.
Z 1 
n k
1
nk
p (1 p)
dp =
Proposition 25.5. Let Tn be the time of the nth event
n+1
k
0
in a Poisson process with rate parameter . Then, for Xj
i.i.d. Expo(), we have

Lecture 25 11/2/11

Definition 25.1. The gamma function is given by


Z
(a) =
xa1 ex dx
Z0
1
=
xa ex dx
x
0

Tn =

n
X

Xj Gamma(n, )

j=1

The exponential distribution is the continuous analogue of the geometric distribution; in this sense, the
gamma distribution is the continuous analogue of the negative binomial distribution.
20

Stat 110Intro to Probability

Max Wang

Proof. One method of proof, which we will not use, Lecture 26 11/4/11
would be to repeatedly convolve the PDFs of the i.i.d. Xj .
Instead, we will use MGFs. Suppose that the Xj are i.i.d Observation 26.1 (Gamma-Beta). Let us take X
Gamma(a, ) to be your waiting time in line at the bank,
Expo(1); we will show that their sum is Gamma(n, 1).
and Y Gamma(b, ) your waiting time in line at the
The MGF of Xj is given by
post office. Suppose that X and Y are independent.
Let T = X + Y ; we know that this has distribution
1
MXj (t) =
Gamma(a + b, ).
1t
Let us compute the joint distribution of T and of
X
for t < 1. Then the MGF of Tn is
, the fraction of time spent waiting at the
W = X+Y
bank.
For
simplicity
of notation, we will take = 1. The

n
1
joint
PDF
is
given
by
MTn (t) =
1



(x, y)

fT,W (t, w) = fX,Y (x, y) det
also for t < 1. We will show that the gamma distribution
(t, w)


has the same MGF.

1
(x, y)
a x b y 1
=
det
x e y e
Let Y Gamma(n, 1). Then by LOTUS,
(a)(b)
xy
(t, w)
Z
1
1
We must find the determinant of the Jacobian (here exety y n ey dy
E(etY ) =
(n) 0
y
pressed in silly-looking notation). We know that
Z
1
1
x
=
y n e(1t)y dy
x+y =t
=w
(n) 0
y
x+y
Changing variables, with x = (1 t)y, then
Z
(1 t)n n x 1
x e
dx
(n)
x
0

n
1
(n)
=
1
(n)

n
1
=
1

Solving for x and y, we easily find that


x = tw

E(etY ) =

y = t(1 w)

Then the determinant of our Jacobian is given y




w
t

1 w t = tw t(1 w) = t


Taking the absolute value, we then get


1
1
xa ex y b ey t
(a)(b)
xy
1
1
a1
=
w
(1 w)b1 ta+b et
(a)(b)
t
(a + b) a1
1
1
=
w
(1 w)b1
ta+b et
(a)(b)
(a + b)
t

Note that this is the MGF for any n > 0, although the
sum of exponentials expression requires integral n.

fT,W (t, w) =

Observation 25.6. Let us compute the moments of


X Gamma(a, 1). We want to compute E(X c ). We
have
Z
1
1
c
E(X ) =
xc xa ex dx
(a) 0
x
Z
1
1
=
xa+c ex dx
(a) 0
x
(a + c)
=
(a)
a(a + 1) (a + c)(a)
=
(a)
= a(a + 1) (a + c)

This is a product of some function of w with the PDF of


T , so we see that T and W are independent. To find the
marginal distribution of W , we note that the PDF of T
integrates to 1 just like any PDF, so we have
Z
fW (w) =
fT,W (t, w) dt

If instead, we take X Gamma(a, ), then we will have


E(X c ) =

a(a + 1) (a + c)
c

(a + b) a1
w
(1 w)b1
=
(a)(b)
This yields W Beta(a, b) and also gives the normalizing
constant of the beta distribution.
It turns out that if X were distributed according to
any other distribution, we would not have independence,
but proving so is out of the scope of the course.

21

Stat 110Intro to Probability

Max Wang

Observation 26.2. Let us find E(W ) for W Example. Let U1 , . . . , Un be i.i.d. Unif(0, 1), and let us
X
Beta(a, b). Let us write W = X+Y
with X and Y de- determine the distribution of U(j) . Applying the above
result, we have
fined as above. We have




n 1 j1
1
E(X)
a
fU(j) (x) = n
x (1 x)nj
E
=
=
j1
X
E(X + Y )
a+b
Note that in general, the first equality is false! However, for 0 < x < 1. Thus, we have U(j) Beta(j, n j + 1).
X
because X +Y and X+Y
, they are uncorrelated and hence This confirms our earlier result that, for U1 and U2 i.i.d.
Unif(0, 1), we have
linear. So


1
1
E|U1 U2 | = E(Umax ) E(Umin ) =
E
E(X + Y ) = E(X)
3
X
because Umax Beta(2, 1) and Umin Beta(1, 2), which
Definition 26.3. Let X1 , . . . , Xn be i.i.d. The order
have means 23 and 31 respectively.
statistics of this sequence is

Lecture 27 11/7/11

X(1) X(2) X(n)


where

Example (Two Envelopes Paradox). Suppose we have


two envelopes containing sums of money X and Y , and
suppose we are told that one envelope has twice as much
X(n) = max{X1 , . . . , Xn }
money as the next. We choose one envelope; by symmeand the remaining X(j) fill out the order. If n is odd, we try, take X WLOG. Then it appears that Y has equal
have the median X( 1 ) . The order statistics lets us find probabilities of containing X
2 and of 2X, and thus avn+1
erages 1.25X. So it seems that we ought to switch to
arbitrary quantiles for the sequence.
envelope Y . But then, by the same reasoning, it would
The order statistics are hard to work with because seem we ought to switch back to X.
they are dependent (and positively correlated), even
We can argue about this paradox in two ways. First,
though we started with i.i.d. random variables. They are we can say, by symmetry, that
particularly tricky in the discrete case because of ties, so
E(X) = E(Y )
we will assume that the Xj are continuous.
X(1) = min{X1 , . . . , Xn }

Observation 26.4. Let X1 , . . . , Xn be i.i.d. continuous which is simple and straightforward. We might also, howwith PDF fj and CDF Fj . We want to find the CDF and ever, try to condition on the value of Y with respect to
X using the Law of Total Probability
PDF of X(j) . For the CDF, we have
P (X(j) x) = P (at least j of the Xj s are x)
n  
X
n
=
P (X1 x)k (1 P (X1 x))nk
k
k=j
n  
X
n
=
F (x)k (1 F (x))nk
k

E(Y ) = E(Y | Y = 2X)P (Y = 2X)


X
X
+ E(Y | Y = )P (Y = )
2
2
1
X 1
= E(2X) + E( )
2
2 2
5
= E(X)
4

Turning now to the PDF, recall that a PDF gives a density rather than a probability. We can multiply the PDF
of X(j) at a point x by a tiny interval dx about x in order to obtain the probability that X(j) is in that interval.
Then we can simply count the number of ways to have
one of the Xi be in that interval and precisely j 1 of the
Xi below the interval. So,


n1
fX(j) (x) dx = n(f (x) dx)
F (x)j1 (1 F (x))nj
j1


n1
fX(j) (x) = n
F (x)j1 (1 F (x))nj f (x)
j1

Assuming that X and Y are not 0 or infinite, these cannot both be correct, and the argument from symmetry is
immediately correct.
The flaw in our second argument is that, in general,

k=j

E(Y | Y = Z) 6= E(Z)
because we cannot drop the condition that Y = Z; we
must write
E(Y | Y = Z) = E(Z | Y = Z)
In other words, if we let I be the indicator for Y = 2X,
we are saying that X and I are dependent.

22

Stat 110Intro to Probability

Max Wang

Example (Patterns in coin flips). Suppose we repeat- Definition 27.1. Now let us write
edly flip a fair coin. We want to determine how many
g(x) = E(Y | X = x)
flips it takes until HT is observed (including the H and
T ); similarly, we can ask how many flips it takes to get Then
HH. Let us call these random variables WHT and WHH
E(Y |X) = g(X)
respectively. Note that, by symmetry,
So, suppose for instance that g(x) = x2 ; then g(X) = X 2 .
We can see that E(Y |X) is a random variable and a funcE(WHH ) = E(WT T )
tion of X. This is a conditional expectation.
and

Example. Let X and Y be i.i.d. Pois(). Then


E(WHT ) = E(WT H )

E(X + Y | X) = E(X|X) + E(Y |X)

Let us first consider WHT . This is the time to the first


H, which we will call W1 , plus the time W2 to the next
T . Then we have
E(WHT ) = E(W1 ) + E(W2 ) = 2 + 2 = 4

X is a function of itself

= X + E(Y |X)

X and Y independent)

= X + E(Y )
=X +

Note that, in general,

E(h(X) | X) = h(X)
because Wi 1 Geom( 12 ).
Now let us consider WHH . The distinction here is that
Now let us determine E(X | X + Y ). We can do this
no progress can be easily made; once we get a heads, in two different ways. First, let T = X + Y and let us
we are not decidedly halfway to the goal, because if the find the conditional PMF.
next flip is tails, we lose all our work. Instead, we make
P (T = n | X = k)P (X = k)
use of conditional expectation. Let Hi be the event that
P (X = k | T = n) =
P (T = n)
C
the ith toss is heads, Ti = Hi the event that it is tails.
P (Y = n k)P (X = k)
Then
=
P (T = n)
1
1
nk
e

k
E(WHH ) = E(WHH | H1 ) + E(WHH | T1 )
(nk)! e
k!
2
2


=
(2)n
2
1
1 1
e
= E(WHH | H1 , H2 ) + E(WHH | H1 , T2 )
   n!n
2
2 2
n
1
=
1
k
1
+ (1 + E(WHH ))
2


1
1 1
1 That is, X | T = n Bin(n, 2 ). Thus, we have
= 1 + (2 + E(WHH ))
+ (1 + E(WHH ))
2 2
2
n
E(X | T = n) =
2
Solving for E(WHH ) gives
which means that
T
E(WHH ) = 6
E(X|T ) =
2
So far, we have been conditioning expectations on
In our second method, first we note that
events. Let X and Y be random variables; then this kind
E(X | X + Y ) = E(Y | X + Y )
of conditioning includes computing E(Y | X = x). If Y
is discrete, then
by symmetry (since they are i.i.d.). We have
E(Y | X = x) =

E(X | X + Y ) + E(Y | X + Y ) = E(X + Y | X + Y )

yP (Y = y | X = x)

=X +Y
=T

and if Y is continuous,
Z

So, without even using the Poisson, E(X|T ) =

E(Y | X = x) =

yfY |X=x (y|x) dy

Z
if X continuous

fX,Y (x, y)
dy
fX (x)

T
2

Proposition 27.2 (Adams Law). Let X and Y be random variables. Then


E(E(Y |X)) = E(Y )
23

Stat 110Intro to Probability

Max Wang

Lecture 28 11/9/11

1. Since we know X, we know h(X), and this is equivalent to factoring out at constant (by linearity).

Example. Let X N (0, 1), Y = X 2 . Then

2. Immediate.

E(Y |X) = E(X 2 |X) = X 2 = Y

3. We will prove the discrete case. Let E(Y |X) =


g(X). Then by discrete LOTUS, we have
X
Eg(X) =
g(x)P (X = x)

On the other hand,


E(X|Y ) = E(X|X 2 ) = 0
2

since, after observing X = a, then X = a with equal


likelihood of being positive or negative (since the standard
normal is symmetric about 0). Note that this doesnt
mean X and X 2 are independent.

E(Y | X = x)P (X = x)

yP (Y = y | X = x) P (X = x)

Example. Suppose we have a stick, break off a random


piece, and then break off another random piece. We can
model this as X Unif(0, 1), Y |X Unif(0, X). We
know that
x
E(Y | X = x) =
2
and hence
X
E(Y |X) =
2
Note that
1
E(E(Y |X)) = = E(Y )
4
That is, on average, we take half the stick and then take
half of that stick, which matches our intuition.

X X

=
yP (Y = y | X = x)P (X = x)
x

conditional PMF times marginal PMF = joint PMF


XX
=
yP (Y = y, X = x)
x

=
=

XX
y

yP (Y = y, X = x)

yP (Y = y)

= E(Y )

Proposition 28.1. Let X and Y be random variables.

4. We have
E((Y E(Y |X))h(X))

1. E(h(X)Y | X) = h(X)E(Y |X).

= E(Y h(X)) E(E(Y |X)h(X))

2. E(Y |X) = E(Y ) if X and Y are independent (the


converse, however, is not true in general).

= E(Y h(X)) E(E(h(X)Y |X))


= E(Y h(X)) E(Y h(X))

3. E(E(Y |X)) = E(Y ). This is called iterated expectation or Adams Law; it is usually more useful to
think of this as computing E(Y ) by choosing a simple X to work with.
4. E((Y E(Y |X))h(X)) = 0. In words, the residual
(i.e., Y E(Y |X)) is uncorrelated with h(X):

=0

Definition 28.2. We can define the conditional variance
much as we did conditional expectation. Let X and Y be
random variables. Then

Cov(Y E(Y |X), h(X))


Var(Y |X) = E(Y 2 |X) (E(Y |X))2
= E((Y E(Y |X))h(X)) E(Y E(Y |X)) E(h(X))
= E((Y E(Y |X))2 | X)
|
{z
} |
{z
}
0
0
Proposition 28.3 (Eves Law).
To better understand (4), we can think of the functions X and Y as vectors (the vector space has inner product hX, Y i = E(XY )). We can think of E(Y |X) as the
projection of Y onto the plane consisting of all functions
of X. In this picture, the residual vector Y E(Y |X) is
orthogonal to the plane of all functions of X, and thus
hY E(Y |X), h(X)i = 0.
Proof. We will prove all the properties above.

Var(Y ) = E(Var(Y |X)) + Var(E(Y |X))


Example. Suppose we have three populations, where
X = 1 is the first, X = 2 the second, and X = 3 the
third, and suppose we know the mean and variance of
the height Y of individuals in each of the separate populations. Then Eves law says we can take the variance
of all three means, and add it to the mean of all three
variances, to get the total variance.
24

Stat 110Intro to Probability

Max Wang

Example. Suppose we choose a random city and then make N a constant, so let us condition on N . Then using
choose a random sample of n people in that city. Let X the Law of Total Probability, we have
be the number of people with a particular disease, and

X
Q the proportion of people in the chosen city with the
E(X) =
E(X | N = n)P (N = n)
disease. Let us determine E(X) and Var(X), assuming
n=0
Q Beta(a, b) (a mathematically convenient, flexible dis
X
tribution).
=
nP (N = n)
Assume that X|Q Bin(n, Q). Then
n=0
= E(N )
E(X) = E(E(X|Q))
Note that we can drop the conditional because N and the
= E(nQ)
Xj are independent; otherwise, this would not be true.
a
=n
We could also apply Adams Law to get
a+b
E(X) = E(E(X|N )) = E(N ) = E(N )

and
Var(X) = E(Var(X|Q)) + Var(E(X|Q))

To get the variance, we apply Eves Law to get

= E(nQ(1 Q)) + n2 Var(Q)

Var(X) = E(Var(X|N )) + Var(E(X|N ))


= E(N 2 ) + Var(N )

We have
E(Q(1 Q)) =
=
=
=

Z
(a + b) 1 a
q (1 q)b dq
(a)(b) 0
(a + b) (a + 1)(b + 1)
(a)(b) (a + b + 2)
ab(a + b)
(a + b + 1)(a + b)(a + b)
ab
(a + b)(a + b + 1)

and
Var(Q) =

(1 )
a+b+1

= 2 E(N ) + 2 Var(N )
We now turn our attention to statistical inequalities.
Theorem 29.1 (Cauchy-Schwartz Inequality).
p
|E(XY )| E(X 2 )E(Y 2 )
If X and Y are uncorrelated, E(XY ) = (EX)(EY ), so
we dont need inequality.
We will not prove this inequality in general. However,
if X and Y have mean 0, then




E(XY )


| Corr(X, Y )| = p
1
E(X 2 )E(Y 2 )

a
where = a+b
. This gives us all the information we need
to easily compute Var(X).
Theorem 29.2 (Jensens Inequality). If g : R R is
convex (i.e., g 00 > 0), then

Lecture 29 11/14/11

Eg(X) g(EX)

Example. Consider a store with a random number N If g ic concave (i.e., g 00 < 0), then
of customers. Let Xj be the amount the jth customer
Eg(X) g(EX)
spends, with E(Xj ) = and Var(Xj ) = 2 . Assume
that N, X1 , X2 , . . . are independent. We want to deterExample. If X is positive, then
mine the mean and variance of
X=

N
X
j=1

E(

Xj

1
1
)
X
EX

and

E(ln X) ln(EX)
We might, at first, mistakenly invoke linearity to claim
that E(X) = N . But this is incoherent; the LHS is a Proof. It is true of any convex function g that
real number whereas the RHS is a random variable. However, this error highlights something useful: we want to
g(x) a + bx
25

Stat 110Intro to Probability

Max Wang

Lecture 30 11/16/11

if a + bx is the line tangent to any point (x0 , g(x0 )) on


the graph of g. Take x0 = E(X). Then we have

Definition 30.1. Let X1 , X2 , . . . be i.i.d. random variables with mean and variance 2 . The sample mean of
the first n random variables is
n
X
n = 1
Xj
X
n j=1

g(x) a + bx
g(X) a + bX
Eg(X) E(a + bX)
= a + bE(X)
= g(EX)

We want to answer the question: What happens to


the sample mean when n gets large?


Theorem 29.3 (Markov Inequality).

Theorem 30.2 (Law of Large Numbers). With probability 1, as n ,


n
X

E|X|
a

pointwise. That is, the sample mean of a collection of


i.i.d. random variables converges to the true mean.

P (|X| a)

Example. Suppose that Xj Bern(p). The Law of


Large Numbers says that n1 (X1 + + Xn ) p.

for any a > 0.


Proof. Let I|X|a be the indicator random variable for
the event |X| a. It is always true that

Note that the Law of Large Numbers says nothing


about the value of any individual Xj . For instance, in the
above example with simple success and failures (which
aI|X|a |X|
we may model as a series of coin flips), flipping heads
many times does not mean that a tails is on its way.
because if I|X|a = 1, then |X| a and then inequality
Rather, it means that the large but finite string of heads
holds, and if I|X|a = 0, the inequality is trivial since
is swamped by the infinite flips yet to come.
|X| 0. Then, taking expected values, we have
Theorem 30.3 (Weak Law of Large Numbers). For any
aEI|X|a E|X|
c > 0, as n 0,
as desired.

n | > c) 0
P (|X

Proof. (of Weak LoLN) By Chebyshevs inequality,


Example. Suppose we have 100 people. It is easily possible that at least 95% of the people are younger than
average in the group. However, it is not possible that at
least 50% are older than twice the average age.

n | > c) Var(Xn )
P (|X
c2
1
2
2 n
= n 2
c
2
= 2
nc
0

Theorem 29.4 (Chebyshev Inequality).


P (|X | a)

Var(X)
a2

for = EX and a > 0. Alternatively, we can write


P (|X | c SD(X))

1
c2

Note that the Law of Large Numbers does not tell us


n . To study this disanything about the distribution of X
n 0, we
tribution, and in particular the rate at which X
might consider
n )
ni (X

for c > 0.

for various values of i.

Proof.

Theorem 30.4 (Central Limit Theorem). As n ,


P (|X | a) = P ((X )2 a2 )
by Markov

E((X )2 )
a2
Var(X)
=
a2

n )
(X
N (0, 1)

in distribution; that is, the CDFs converge. Equivalently,


Pn
j=1 Xj n

N (0, 1)

n
n1/2

26

Stat 110Intro to Probability

Max Wang

Proof. We will prove the CLT assuming that the MGF


M (t) of the Xj exists (note that we have been assuming
all along that the first two moments exist). We will show
that the MGFs converge, which will imply that the CDFs
converge (however, we will not show this fact).
Let us assume WLOG that = 0 and = 1. Let
Sn =

n
X

The Poisson approximation works well when n is large,


p is small, and = np is moderate. In contrast, the Normal approximation works well when n is large and p is
near 21 (to match the symmetry of the normal).
It seems a little strange that we are approximating a
discrete distribution with a continuous distribution. In
general, to correct for this, we can write

Xj

P (X = a) = P (a  < X < a + )

j=1

We will show that the MGF of


of N (0, 1). We have
E(etSn /

where (a , a + ) contains only a

Sn

converges to the MGF

Lecture 31 11/18/11

Definition 31.1. Let V = Z12 + + Zn2 where the Zj


N (0, 1) i.i.d. Then V has the chi-squared distribution
with n degrees of freedom, V 2n .

uncorrelated since independent


= E(etX1 /

) E(etXn /

tXj / n n

Observation 31.2. It is true, but we will not prove, that

)
= E(e

n
t
=M
n

1 1
21 = Gamma( , )
2 2

Taking the limit results in the indeterminate form 1 , Since 2n =


which is hard to work with. Instead, we take the log of
both sides and then take the limit, to get

lim n ln M


= lim

ln M ( tn )

/2

n 1
2n = Gamma( , )
2 2

Then T has the Student-t distribution with n degrees of


freedom, T tn .
Observation 31.4. The Student-t is symmetric; that is
T tn . Note that if n = 1, then T is the ratio of two
i.i.d. standard normals, so T becomes the Cauchy distribution (and hence has no mean).
If n 2, then
E(T ) = E(Z)E( p

1
V /n

)=0

Note that in general, T tn will only have moments up


to (but not including) the nth.

is the N (0, 1) MGF.

Corollary 30.5. Let X Bin(n, p) with X =


Xj Bern(p) i.i.d.

21 , we have

Definition 31.3. Let Z N (0, 1) and V 2n be independent. Let


Z
T =p
V /n

1
n

1
substitute y =
n
ln M (ty)
= lim
y0
y2
tM 0 (ty)
LHopitals = lim
y0 2yM (ty)
M 0 (ty)
t
[M (0) = 1, M 0 (0) = 0] = lim
2 y0
y
t2
M 00 (ty)
LHopitals =
lim
2 y0
1
t2
=
2
2
= ln et /2
and et


Pn

j=1

a np
X np
b np
P (a X b) = P


npq
npq
npq




1
1

b np
a np

Observation 31.5. We proved that

Xj ,

E(Z 2 ) = 1,

E(Z 4 ) = 1 3,

E(Z 6 ) = 1 3 5

using MGFs. We can also prove this by noting that


E(Z 2n ) = E((Z 2 )n )
and that Z 2 21 = Gamma( 21 , 12 ). Then we can simply
use LOTUS to get our desired mean.
27

Stat 110Intro to Probability

Max Wang

Observation 31.6. The Student-t distribution looks


much like the normal distribution but is heavier-tailed,
especially if n is small. As n , we claim that the
Student-t converges to the standard normal.
Let
Z
Tn = p
Vn /n

Lecture 32 11/21/11
Definition 32.1. Let X0 , X1 , X2 , . . . be sequence of random variables. We think of Xn as the state of a finite
system at a discrete time n (that is, the Xn have discrete
indices and each has finite range). The sequence has the
Markov property if

where Z1 , Z2 , . . . N (0, 1) i.i.d., Vn = Z12 + + Zn2 , and


Z N (0, 1) independent of the Zj . By the Law of Large
numbers, with probability 1,
Vn
=1
lim
n n
So Tn Z, which is standard normal as desired.

P (Xn+1 = j | Xn = i, Xn1 = in1 , . . . , X0 = i0 )


= P (Xn+1 = j | Xn = i)

In casual terms, in a system with the Markov property,


the past and future are conditionally independent given
the present. Such a system is called a Markov chain.
Definition 31.7. Let X = (X1 , . . . , Xk ) be a random
If P (Xn+1 = j | Xn = i) does not depend on time n,
vector. We say that X has the multivariate normal
then
we denote
distribution (MVN) if every linear combination
t1 X1 + tk Xk

qij := P (Xn+1 = j | Xn = i)

of the Xj is normal.
Then (Z + called the transition probability, and we call the sequence
a homogenous Markov chain.
To describe a homogenous Markov chain we simply
s(Z + 2W ) + t(3Z + 5W ) = (s + 3t)Z + (2s + 5t)W
need to show the states of the process and the transition probabilities. We could, instead, array the qij s as a
is a sum of independent normals and hence normal.
Example. Let Z N (0, 1). Let S be a random sign (1 matrix,

with equal probabilities) independent of Z. Then Z and
Q = qij
SZ are marginally standard normal. However, (Z, SZ) is
not multivariate normal, since Z + SZ is 0 with proba- called the transition matrix.
bility 12 .
Observation 31.8. Recall that the MGF for X Note. More generally, we could consider continuous systems (i.e., spaces) at continous times and more broadly
N (, 2 ) is given by
study stochastic processes. However, in this course, we
2 2
E(etX ) = et+t /2
will restrict our study to homogenous Markov chains.
Example. Let Z, W be i.i.d. N (0, 1).
2W, 3Z + 5W ) is MVN, since

Suppose that X = (X1 , . . . , Xk ) is MVN. Let j = EXj .


Example. The following diagram describes a (homogeThen the MGF of X is given by
nous) Markov chain:
E(et1 X1 ++tk Xk )
1
1/2
= exp(t1 1 + + tk k + Var(t1 X1 + + tk Xk ))
2
Theorem 31.9. Let X = (X1 , . . . , Xk ) be MVN. Then
within X, uncorrelated implies independence. For in2/3
1
1/3
1/4
1
2
3
4
stance, if we write X = (X1 , X2 ), if every component
1/2
of X1 is uncorrelated with every component of X2 , then
1/2
1/4
X1 is independent of X2 .
Example. Let X, Y be i.i.d. N (0, 1). Then (X + Y, X
We could alternatively describe the same Markov chain
Y ) is MVN. We also have that
by specifying its transition matrix
Cov(X + Y, X Y )

2
1
= Var(X) + Cov(X, Y ) Cov(X, Y ) Var(Y )
0
0
31 3 1

0 2 0

=0
Q = 2

0 0 0 1
So by our above theorem, X + Y and X Y are indepen1
1
1
0 4 4
2
dent.
28

Stat 110Intro to Probability

Max Wang

Lecture 33 11/28/11

Observation 32.2. Suppose that at time n, Xn has distribution s (a row vector in the transition matrix, which
represents the PMF). Then
P (Xn+1 = j) =

Example. The following are some pathological examples


of Markov chains (sans transition probabilities), in statediagram form:

P (Xn+1 = j | Xn = i)P (Xn = i)

1. Unpathological Markov chain

qij si

= sQ
So sQ is the distribution of Xn+1 . More generally, we
have that sQj is the distribution of Xn+j .
We can also compute the two-step transition probability:

2. Disconnected Markov chain

P (Xn+2 = j | Xn = i)
X
=
P (Xn+2 = j | Xn+1 = k, Xn = i)

P (Xn+1 = k | Xn = i)
=

3. Markov chain with absorbing states

P (Xn+2 = j | Xn+1 = k)P (Xn+1 = k | Xn = i)

qkj qik

4. Periodic Markov chain

qik qkj
1

= (Q2 )ij
More generally, we have
P (Xn+m = j | Xn = i) = (Qm )ij

Definition 33.1. A state is recurrent if, starting from


that state, there is probability 1 of transitioning back to
that state after a finite number of transitions. If a state
is not recurrent, it is transient.

Definition 33.2. A Markov chain is irreducible if it is


Definition 32.3. Let s be some probability vector for a
possible (with positive probability) to transition from any
Markov chain with transition matrix Q. We say that s is
state to any other state in a finite number of transitions.
stationary for the chain if
Note that in an irreducible chain, all states are recurrent;
over an infinite number of transitions, any nonzero probsQ = s
ability of returning to a state means that the event of
return will occur with probability 1.
We also call s a stationary distribution. Note that this is
Observation 33.3. In our example above, Markov
the transpose of an eigenvector equation.
chains 1 and 4 are irreducible; chains 2 and 3 are not.
All the states of chain 2 are recurrent; even though the
This definition raises the following questions:
chain itself has two connected components, we will al1. Does a stationary distribution exist for every ways (i.e., with probability 1), return to the state which
we started from.
Markov chain?
However, in chain 3, states 1 and 2 are transient. With
probability
1, from states 1 and 2, we will at some point
2. Is the stationary distribution unique?
transition to state 0 or 3; after that point, we will never
return to state 1 or 2. On the other hand, if we start in
3. Does the chain (in some sense) converge to the sta0 or 3, we stay there forever; they are clearly recurrent.
tionary distribution?
Theorem 33.4. For any irreducible Markov chain,
4. How can we compute it (efficiently)?
1. A stationary distribution s exists.
29

Stat 110Intro to Probability

Max Wang

2. s is unique.

for all i, j.
Assume i 6= j. Since the Markov chain is undirected,
1
3. si = , where ri is the average time to return to qij and qji are either both zero or both nonzero. If (i, j)
ri
is an edge, then
state i starting from state i.
1
qij =
m
di
4. If Q is strictly positive for some m, then
since our Markov chain represents a random walk. But
lim P (Xn = i) = si
this suffices to prove our claim.
n
Let us now normalize di to a stationary vector si . This
Alternatively, if t is any (starting-state) probability is easy; we can simply take
vector, then
di
lim tQ = s
si = P
n
j dj
Definition 33.5. A Markov chain with transition matrix
Q is reversible if there is a probability vector s such that and we have thus found our desired stationary distribution.
si qij = sj qji
for all pairs of states i and j.
Reversibility is also known as time-reversibility. Intuitively, the progression of a reversible Markov chain could
be played back backwards, and the probabilities would be
consistent with the original Markov chain.
Theorem 33.6. If a Markov chain is reversible with respect to s, then s is stationary.
Proof. We know that si qij = sj qji for some s. Summing
over all states,
X
X
si qij =
sj qji
i

= sj

qji

= sj
But since this is true for every j, this is exactly the statement of
sQ = s
as desired.

Example (Random walk on an undirected network).


Consider the following example undirected Markov chain
1

Let di be the degree of i (so, d1 = 2, d2 = 2, d3 = 3,


d4 = 1). Then we claim that (in general)
di qij = dj qji
30

You might also like