Professional Documents
Culture Documents
Lecture 3: 9/7/11 . . . . . . . . . . . . . .
Lecture 4: 9/9/11 . . . . . . . . . . . . . .
Lecture 5: 9/12/11 . . . . . . . . . . . . .
Lecture 6: 9/14/11 . . . . . . . . . . . . .
Lecture 7: 9/16/11 . . . . . . . . . . . . .
Lecture 8: 9/19/11 . . . . . . . . . . . . .
Lecture 9: 9/21/11 . . . . . . . . . . . . .
Introduction
Statistics 110 is an introductory statistics course offered at Harvard University. It covers all the basics of probability
counting principles, probabilistic events, random variables, distributions, conditional probability, expectation, and
Bayesian inference. The last few lectures of the course are spent on Markov chains.
These notes were partially live-TEXedthe rest were TEXed from course videosthen edited for correctness and
clarity. I am responsible for all errata in this document, mathematical or otherwise; any merits of the material here
should be credited to the lecturer, not to me.
Feel free to email me at mxawng@gmail.com with any comments.
Acknowledgments
In addition to the course staff, acknowledgment goes to Zev Chonoles, whose online lecture notes (http://math.
uchicago.edu/~chonoles/expository-notes/) inspired me to post my own. I have also borrowed his format for
this introduction page.
The page layout for these notes is based on the layout I used back when I took notes by hand. The LATEX styles can
be found here: https://github.com/mxw/latex-custom.
Copyright
Copyright 2011 Max Wang.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This means you are free to edit, adapt, transform, or redistribute this work as long as you
include an attribution of Joe Blitzstein as the instructor of the course these notes are based on, and an
attribution of Max Wang as the note-taker;
do so in a way that does not suggest that either of us endorses you or your use of this work;
use this work for noncommercial purposes only; and
if you adapt or build upon this work, apply this same license to your contributions.
See http://creativecommons.org/licenses/by-nc-sa/4.0/ for the license details.
Max Wang
Lecture 2 9/2/11
Definition 2.1. A sample space S is the set of all possible outcomes of an experiment.
Definition 2.2. An event A S is a subset of a sample
space.
| | || |
There are
n+k1
n+k1
=
k
n1
# favorable outcomes
# possible outcomes
ways to place the particles, which determines the placement of the divisiors (or vice versa); this is our result.
is the probability that A occurs.
n
n
Example.
1.
=
k
nk
Proposition 2.4 (Multiplication Rule). If there are r ex
periments and each experiment has ni possible outcomes,
n1
n
2. n
=k
then the overall sample space has size
k1
k
Pick k people out of n, then designate one as special. The RHS represents how many ways we can
do this by first picking the k individuals and then
making our designation. On the LHS, we see the
number of ways to pick a special individual and
then pick the remaining k 1 individuals from the
remaining pool of n 1.
n1 n2 nr
Example. The probability of a full house in a five card
poker hand (without replacement, and without other
players) is
4
4
13
12
3
2
P (full house) =
52
5
3. (Vandermonde)
X
k
n+m
n
m
=
k
i
ki
i=0
or 0 if k > n.
Theorem 2.6 (Sampling Table). The number of subsets
of size k chosen from a set of n distinct elements is given
by the following table:
ordered
replacement
no replacement
n!
(n k)!
1. P () = 0, P (S) = 1
S
P
2. P ( n=1 An ) = n=1 P (An ) if the An are disjoint
unordered
n+k1
k
n
k
Lecture 4 9/9/11
Example (Birthday Problem). The probability that at
least two people among a group of k share the same birthday, assuming that birthdays are evenly distributed across
the 365 standard days, is given by
Lecture 3 9/7/11
Max Wang
P (A B) = P (A)P (B)
Corollary 4.2 (Inclusion-Exclusion). Generalizing 3
In general, for n events A1 , . . . , An , independence reabove,
quires i-wise independence for every i = 2, . . . , n; that
n
n
is, say, pairwise independence alone does not imply inde[
X
X
P Ai =
P (Ai )
P (Ai Aj )
pendence.
i=1
i=1
i<j
P (Ai Aj Ak )
i<j<k
+ (1)n+1 P
n
\
Ai
i=1
probability.)
n
2!
n(n 1)
Thus far, all the probabilities with which we have conn(n 1)(n 2)
1
+
P (A1 An )
1 P (A1 An )
P (AC
1
P (A|B) =
AC
n)
2
P (B|A)P (A)
P (B)
Max Wang
Lecture 6 9/14/11
P (F |(A C C )) = 1
1. What is the probability that both cards are aces and hence we do not have conditional independence.
given that we have an ace?
P (both aces, have ace)
P (have ace)
52
4
2 / 2
52
=
1 48
2 / 2
1
=
33
Lecture 7 9/16/11
3
1
=
51
17
1
1
1
P (S) = P (S|D1 ) + P (S|D2 ) + P (S|D3 )
3
3
3
1
1
=0+1 +1
3
3
2
=
3
P (T |D)P (D)
P (T )
P (T |D)P (D)
=
P (T |D)P (D) + P (T |DC )P (DC )
0.16
P (D|T ) =
By symmetry, the probability that we succeed conditioned on the door Monty opens is the same.
success
failure
Hibbert
heart band-aid
70
10
20
0
success
failure
Nick
heart band-aid
2
81
8
9
Max Wang
p
4p2 4p + 1
=
2p
1 (2p 1)
=
2p
q
= 1,
p
1
for p 6= q (to avoid a repeated root). Our boundary conditions for p0 and pn give
but
C
B = A
n
q
1=A 1
p
To solve for the case where p = q, we can guess x =
and take
1 xi
ixi1
i
lim
=
lim
=
x1 1 xn
x1 nxn1
n
So we have
i
n p 6= q
1
pi = 1
q
p=q
n
Lecture 8 9/19/11
Definition 8.1. A one-dimensional random walk models a (possibly infinite) sequence of successive steps along
the number line, where, starting from some position i,
we have a probability p of moving +1 and a probability
q = 1 p of moving 1.
q
p
Then we have
xi = pxi+1 + qxi1
px2 xi + q = 0
1
1 4pq
2p
p
1 1 4p(1 p)
=
2p
x=
P (X = 1) = p
P (X = 0) = 1 p
We say that
X Bern(p)
4
Max Wang
{s S : X(s) = 1} = X 1 {1}
Definition 8.4. The distribution of successes in n independent Bern(p) trials is called the binomial distribution
and is given by
n k
P (X = k) =
p (1 p)nk
k
P (X + Y = k) =
k
X
P (X + Y = k | X = j)P (X = j)
j=0
independence
where 0 k n. We write
X Bin(n, p)
=
Definition 8.5. The probability mass function (PMF)
of a discrete random variable (a random variable with
enumerable values) is a function that gives the probability that the random variable takes some value. That is,
given a discrete random variable X, its PMF is
k
X
n j nj
P (Y = k j | X j)
p q
j
j=0
k
X
n j nj
P (Y = k j)
p q
j
j=0
k
X
m
n j nj
pkj q m(kj)
p q
k
j
j
j=0
= pk q n+mk
Vandermonde
k
X
m
n
k
j
j
j=0
n + m k n+mk
p q
k
fX (x) = P (X = x)
Lecture 9 9/21/11
Definition 9.1. The cumulative distribution function Definition 9.3. Suppose we have w white and b black
marbles, out of which we choose a simple random sam(CDF) of a random variable X is
ple of n. The distribution of # of white marbles in the
sample, which we will call X, is given by
FX (x) = P (X x)
b
w
nk
Note. The requirements
for a PMF with values pi is that
P (X = k) = k w+b
P
n
each pi 0 and i pi = 1. For Bin(n, p), we can easily
verify this with the binomial theorem, which yields
where 0 k w and 0 n k b. This is called the
hypergeometric distribution, denoted HGeom(w, b, n).
n
X
n k nk
n
p q
= (p + q) = 1
Proof. We should show that the above is a valid PMF. It
k
k=0
is clearly nonnegative. We also have, by Vandermondes
identity,
b
Proposition 9.2. If X, Y are independent random variw
w
w+b
X
k nk
n
ables and X Bin(n, p), Y Bin(m, p), then
= w+b = 1
w+b
k=0
X + Y Bin(n + m, p)
5
Max Wang
Note. The difference between the hypergometric and biThe above shows that to get the probability of an
nomial distributions is whether or not we sample with event, we can simply compute the expected value of an
replacement. We would expect that in the limiting case indicator.
of n , they would behave similarly.
Observation 10.6. Let X Bin(n, p). Then (using the
binomial theorem),
Lecture 10 9/23/11
n
X
n k nk
E(X) =
k
p q
Proposition 10.1 (Properties of CDFs).
A function
k
k=0
FX is a valid CDF iff the following hold about FX :
n
X
n k nk
=
k
p q
1. monotonically nondecreasing
k
k=1
n
2. right-continuous
X
n 1 k nk
=
n
p q
k1
3. limx FX (x) = 0 and limx FX (x) = 1.
k=1
n
X
n 1 k1 nk
Definition 10.2. Two random variables X and Y are
= np
p
q
k1
independent if x, y,
k=1
n1
X n 1
P (X x, Y y) = P (X x)P (Y y)
= np
pj q n1j
j
j=0
In the discrete case, we can say equivalently that
= np
P (X = x, Y = y) = P (X = x)P (Y = y)
Proposition 10.7. Expected value is linear; that is, for
Note. As an aside before we move on to discuss averages random variables X and Y and some constant c,
and expected values, recall that
1
n
n
X
i=1
i=
n+1
2
and
E(cX) = cE(X)
E(X) = E(X1 + + X5 )
Observation 10.4. Let X Bern(p). Then
= E(X1 ) + + E(X5 )
by symmetry
E(X) = 1 P (X = 1) + 0 P (X = 0) = p
= 5E(X1 )
Max Wang
pq k = p
k=0
1
=1
1q
X
k=0
for n N.
1
q =
1q
k
kq k1 =
k=1
1
(1 q)2
Then
E(X) =
X
k=0
kpq k = p
X
k=0
kq k =
pq
q
=
2
(1 q)
p
E(X) = E(Y ) + 1 =
q
1
+1=
p
p
X
X
1
k
E(Y ) =
2 k =
1
The proof that E(cX) = cE(X) is similar.
2
k=1
k=1
Max Wang
n(n 1) (n k + 1)
k!
p0
n
k
1
1
n
n
Lecture 12 9/28/11
Definition 12.1. The Poisson distribution, Pois(), is
given by the PMF
P (X = k) =
e k
k!
k=0
k
= e e = 1
k!
k=1
k
e
k!
P (X 1) = 1 P (X = 0) = 1 e
k
(k 1)!
k=1
X
k
k
k!
k=0
k
= n
lim
0
= 1 e
0!
Lecture 13 9/30/11
k1
(k 1)!
Moreover,
Z
a
P (a < X < b) =
Max Wang
Z
x
Z
1
=
dt
xfX (x) dx
E(X) =
b
a
a
xa
=
Giving the expected value is like giving a one number
ba
summary of the average, but it provides no information
Observation 13.7. The expected value of an r.v. U
about the spread of a distribution.
Unif(a, b) is
Definition 13.4. The variance of a random variable X
Z b
x
is given by
E(U ) =
dx
2
Var(X) = E((X EX) )
a ba
b
which is the expected value of the distance from X to its
x2
=
mean; that is, it is, on average, how far X is from its
2(b a)
a
mean.
2
2
b a
=
We cant use E(X EX) because, by linearity, we
2(b a)
have
(b + a)(b a)
=
2(b a)
E(X EX) = EX E(EX) = EX EX = 0
b+a
=
We would like to use E|X EX|, but absolute value is
2
hard to work with; instead, we have
This is the midpoint of the interval [a, b].
Definition 13.5. The standard deviation of a random
Finding the variance of U Unif(a, b), however, is a
variable X is
p
bit
more trouble. We need to determine E(U 2 ), but it is
SD(X) = Var(X)
too much of a hassle to figure out the PDF of U 2 . Ideally,
Note. Another way we can write variance is
things would be as simple as
Z
Var(X) = E((X EX)2 )
2
E(U
)
=
x2 fU (x) dx
= E(X 2 2X(EX) + (EX)2 )
= E(X 2 ) (EX)2
c dx = 1
a
c(b a) = 1
c=
1
ba
9
Max Wang
b
2
1
b+a
x3
=
ba 3
2
a
a3
(b + a)2
b3
=
3(b a) 3(b a)
4
2
(b a)
=
12
The following table is useful for comparing discrete
and continuous random variables:
P?F
CDF
E(X)
Var(X)
E(g(X))
[LOTUS]
discrete
PMF P (X = x)
FX (x) = P (X x)
P
x xP (X = x)
EX 2 (EX)2
P
x g(x)P (X = x)
continuous
PDF fX (x)
FX (x) = P (X x)
R
xfX (x) dx
2
EX (EX)2
R
g(x)fX (x) dx
Proof. We have
P (X x) = P (F 1 (U ) x) = P (U F (x)) = F (x)
/2
since P (U F (x)) is the length of the interval [0, F (x)], Proof. We want to prove that our PDF is valid; to do
so, we will simply determine the value of the normalizing
which is F (x). For the second part,
constant that makes it so. We will integrate the square
1
1
of the PDF sans constant because it is easier than inteP (F (X) x) = P (X F (x)) = F (F (x)) = x
grating navely
Z
Z
since F is Xs CDF. But this shows that F (X)
2
z 2 /2
e
dz
ez /2 dz
Unif(0, 1).
Z
Z
2
x
x2 /2
Example. Let F (x) = 1 e with x > 0 be the CDF
=
e
dx
ey /2 dy
X
er
=
0
/2
r dr d
r2
Substituting u = , du = r dr
2
Z 2 Z
u
=
e du d
yields F 1 (U ) = ln(1 U ) F .
= 2
1 .
2
Max Wang
Observation 14.5. Let us compute the mean and vari- which yields a PDF of
ance of Z N (0, 1). We have
x 2
1
fX (x) = e( ) /2
Z
2
2
1
zez /2 dz = 0
EZ =
2
We also have X = + (Z) N (, 2 ).
Later, we will show that if Xi N (i , i2 ) are indeby symmetry (the integrand is odd). The variance re- pendent, then
duces to
Xi + Xj N (i + j , i2 + j2 )
2
2
2
Var(Z) = E(Z ) (EZ) = E(Z )
and
Xi Xj N (i j , i2 + j2 )
By LOTUS,
Z
Observation 15.2. If X N (, 2 ), we have
1
2 z 2 /2
2
z e
dz
E(Z ) =
2
P (|X | ) 68%
Z
2
2
P
(|X | 2) 95%
z 2 ez /2 dz
evenness =
2 0
P (|X | 3) 99.7%
Z
2
z 2 /2
by parts =
z ze
dz
Observation 15.3. We observe some properties of the
2 0 |{z} | {z }
u
dv
variance.
Z
2
2
Var(X) = E((X EX)2 ) = EX 2 (EX)2
= uv +
ez /2 dz
2
0
0
For any constant c,
!
2
2
Var(X + c) = Var(X)
=
0+
2
2
Var(cX) = c2 Var(X)
=1
Since variance is not linear, in general, Var(X + Y ) 6=
Var(X) + Var(Y ). However, if X and Y are independent,
We use to denote the standard normal CDF; so
we
do have equality. On the other extreme,
Z z
1
t2 /2
(z) =
e
dt
Var(X + X) = Var(2X) = 4 Var(X)
2
Also, in general,
By symmetry, we also have
Var(X) 0
(z) = 1 (z)
Var(X) = 0 a : P (X = a) = 1
Observation 15.4. Let us compute the variance of the
Poisson distribution. Let X Pois(). We have
Lecture 15 10/5/11
X
k
1 These
11
k=0
k!
= e
X
kk1
= e
k!
k=1
X
k=0
kk1
= e
k!
Max Wang
X
kk
k!
k=1
Repeating,
X
k 2 k1
k=1
X
k=1
k!
Lecture 17 10/14/11
= e
f (x) = ex
k 2 k
= e ( + 1)
k!
So,
Z
2
E(X ) =
2e
k=0
=e
et dt
F (x) =
k!
(0
1 ex
=
0
e ( + 1)
x>0
otherwise
= 2 +
Observation 17.2. We can normalize any X Expo()
by multiplying by , which gives Y = X Expo(1). We
have
P (Y y) = P (X
X = I1 + + In
where the Ij are i.i.d. Bern(p). Then,
X 2 = I12 + + In2 + 2I1 I2 + 2I1 I3 + + 2In1 In
yey dy
0
Z
y
= (ye ) +
ey dy
0
E(Y ) =
= np + n(n 1)p2
2 2
y
) = 1 ey/ = 1 ey
=1
for the mean. For the variance,
So,
Var(X) = (np + n2 p2 np2 ) n2 p2
Var(Y ) = EY 2 (EY )2
Z
=
y 2 ey dy 1
= np(1 p)
= npq
=1
Proof. (of Discrete LOTUS)
P
We want to show that E(g(X) = x g(x)P (X = x). To
do so, once again we can ungroup our expected value Then for X = Y , we have E(X) = 1 and Var(X) = 12 .
expression:
X
X
Definition 17.3. A random variable X has a memoryless
g(x)P (X = x) =
g(X(s))P ({s})
distribution if
x
sS
We can rewrite this as
X X
X
g(X(s))P ({s}) =
g(x)
x s:X(s)=x
X
x
P (X s + t | X s) = P (X t)
P ({s})
s:X(s)=x
g(x)P (X = x)
Intuitively, if we have a random variable that we interpret as a waiting time, memorylessness means that no
matter how long we have already waited, the probability
of having to wait a given time more is invariant.
12
Max Wang
Proposition 17.4. The exponential distribution is mem- Observation 18.3. We might ask why we call M
oryless.
moment-generating. Consider the Taylor expansion of
M:
Proof. Let X Expo(). We know that
n n
X
X
t
E(X n )tn
X
t
P (X t) = 1 P (X t) = e
=
E(etX ) = E
n!
n!
n=0
n=0
Meanwhile,
Note that we cannot simply make use of linearity since
our sum is infinite; however, this equation does hold for
reasons beyond the scope of the course.
This observation also shows us that
P (X s + t, X s)
P (X s)
P (X s + t)
=
P (X s)
P (X s + t | X s) =
e(s+t)
es
t
=e
= P (X t)
E(X)E(Y ).
Lecture 18 10/17/11
M
(t)
=
etzz /2 dz
G(kt) = G(t)k
2
Z
2
et /2 (1/2)(zt)2
This can be extended to all k R. If we take t = 1, then
completing the square =
e
dz
we have
2
2
G(x) = G(1)x = ex ln G(1)
= et /2
But since G(1) < 1, we can define = ln G(1), and we
Example. Suppose X1 , X2 , . . . are conditionally indewill have > 0. Then this gives us
pendent (given p) random variables that are Bern(p).
F (x) = 1 G(x) = 1 ex
Suppose also that p is unknown. In the Bayesian approach, let us treat p as a random variable. Let p
as desired.
Unif(0, 1); we call this the prior distribution.
Let Sn = X1 + + Xn . Then Sn | p Bin(n, p). We
Definition 18.2. A random variable X has
want to find the posterior distribution, p | Sn , which will
moment-generating function (MGF)
give us P (Xn+1 = 1 | Sn = k). Using Bayes Theorem,
M (t) = E(etX )
P (Sn = k | p)f (p)
f (p | Sn = k) =
if M (t) is bounded on some interval (, ) about zero.
P (Sn = k)
13
Max Wang
P (Sn = k | p)
P (Sn = k)
pk (1 p)nk
E(etX ) =
etk e
k=0
k
k!
= e ee
f (p | Sn = n) = (n + 1)p
= e(e 1)
Computing P (Xn+1 = 1 | Sn = k) simply requires finding the expected value of an indicator with the above
Observation 19.4. Now let X Pois() and Y
probability p | Sn = n.
Pois() independent. We want to find the distribution
Z 1
n+1
of X + Y . We can simply multiply their MGFs, yielding
n
p(n + 1)p dp =
P (Xn+1 = 1 | Sn = n) =
n
+
2
0
t
t
MX (t)MY (t) = e(e 1) e(e 1)
= e(+)(e
Lecture 19 10/19/11
Observation 19.1. Let X Expo(1). We want to determine the MGF M of X. By LOTUS,
1)
Thus, X + Y Pois( + ).
Example. Suppose X, Y above are dependent; specifically, take X = Y . Then X + Y = 2X. But this cannot
be Poisson since it only takes on even values. We could
also compute the mean and variance
M (t) = E(etX )
Z
=
etx ex dx
0
Z
=
ex(1t) dx
E(2X) = 2
Var(2X) = 4
=
If we write
1
, t<1
1t
X
X
tn
1
n!
tn =
=
1 t n=0
n!
n=0
F (x, y) = P (X x, Y y)
E(Y n ) =
n!
n
P (X = x, Y = y)
Observation 19.2. Let Z N (0, 1), and let us determine all its moments. We know that for n odd, by sym- and in the continuous case, X and Y have a joint PDF
given by
metry,
n
E(Z ) = 0
f (x, y) =
F (x, y)
xy
We previously showed that
and we can compute
2
ZZ
M (t) = et /2
P ((X, Y ) B) =
f (x, y) dx dy
But we can write
B
X
X
X
(t2 /2)n
t2n
(2n)! t2n
=
=
n!
2n n! n=0 2n n! (2n)!
n=0
n=0
So
E(Z 2n ) =
(2n)!
2n n!
Max Wang
Lecture 20 10/21/11
Definition 20.1. Let X and Y be random variables.
Then the conditional PDF of Y |X is
or
fY |X (y|x) =
Definition 19.6. To get the marginal PMF or PDF of Example. Recall the PDF for our uniform distribution
a random variable X from its joint PMF or PDF with on the disk,
another random variable Y , we can marginalize over Y
(
1
by computing
y 2 = 1 x2
f (x, y) =
0 otherwise
X
P (X = x) =
P (X = x, Y = y)
y
and marginalizing over Y , we have
or
Z
fX (x) =
fX (x) =
fX,Y (x, y) dy
1x2
1x2
2p
1
dy =
1 x2
X=0
X=1
Y =1
1/6
1/6
2/6
3/6
3/6
1
1/
=
2
1x
2 1 x2
(
f (x, y) =
c
0
0 x 1, 0 y 1
otherwise
1
Normalizing, we simply need c = area
= 1. It is apparent
that the marginal PDFs are both uniform.
=
E(X)yfY (y) dy
Given X = x, we have 1 x2 y 1 x2 . We
Max Wang
3
E|Z1 Z2 | = E| 2Z|
n
!
1
k
Let us find the joint PMF of X and Y .
P
if k nk = n, and 0 otherwise.
X
Observation 21.3. Let X Multk (n, p). Then the
P (X = i,Y = j) =
P (X = i, Y = j | N = n)P (N = n)
marginal distribution of Xj is simply Xj Bin(n, pj ),
n=0
since each object is either in j or not, and we have
= P (X = i, Y = j | N = i + j)P (N = i + j)
E(Xj ) = npj Var(Xj ) = npj (1 pj )
= P (X = i | N = i + j)P (N = i + j)
(i + j)! i j i+j
pq e
i!j!
(i + j)!
!
!
i
j
p (p)
q (q)
= e
e
i!
j!
p0 = (p1 , . . . , pl1 , pl + + pk )
In other words, the randomness of the number of eggs
offsets the dependence of Y on X given a fixed number of we have Y Multl (n, p0 ), and this is true for any combiX. This is a special property of the Poisson distribution. nations of lumpings.
16
Max Wang
Lecture 22 10/26/11
where
pj
pj
=
1 p1
p2 + + pk
This is symmetric for all j.
p0j =
X
X
t) = P (
t)
Y
|Y |
= P (X t|Y |)
Z Z t|y|
2
2
1
1
ex /2 ey /2 dx dy
=
2
2
Z
Z t|y|
2
2
1
1
ex /2 dx dy
=
ey /2
2
2
Z
1
y 2 /2
=
e
(t|y|) dy
2
r Z
2
2
=
ey /2 (ty) dy
0
4. c R, Cov(X, c) = 0
5. c R, Cov(cX, Y ) = c Cov(X, Y )
6. Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z)
The last two properties demonstrate that covariance is
bilinear. In general,
n
m
X
X
X
bj Yj =
ai Xi ,
ai bj Cov(Xi , Yj )
Cov
i=1
j=1
i,j
There is little we can do to compute this integral. Instead, let us compute the PDF, calling the CDF above
F (t). Then we have
r Z
2 2
2
2
1
0
F (t) =
yey /2 et y /2 dy
0
2
Z
1 (1+t2 )y2 /2
=
ye
dy
0
(1 + t2 )y 2
Substituting u =
= du = (1 + t2 )y dy,
2
1
=
(1 + t)2
Z
by independence =
P (X ty)(y) dy
Z
=
(ty)(y) dy
2. Cov(X, Y ) = Cov(Y, X)
17
Max Wang
Cor(X, Y ) =
= p p2
= p(1 p)
It follows that
The operation of
X EX
SD(X)
Var(X) = np(1 p)
Lecture 23 10/28/11
=
w+b w+b1
w+b
w
w1
=
p2
w+b w+b1
IA IB = IAB
P (Y y) = P (g(X) y)
18
Max Wang
= P (X g 1 (y))
FY (t x)fX (x) dx
= FX (g 1 (y))
= FX (x)
Then, differentiating, we get by the Chain Rule that
dx
fY (y) = fX (x)
dy
We now briefly turn our attention to proving the existence of objects with some desired property A using probability. We want to show that P (A) > 0 for some random
Example. Consider the log normal distribution, which
object, which implies that some such object must exist.
Z
is given by Y = e for Z N (0, 1). We have
Reframing this question, suppose each object in our
universe of objects has some kind of score associated
1 z2 /2
fY (y) = e
with this property; then we want to show that there is
2
some object with a good score. But we know that there
To put this in terms of y, we substitute z = ln y. More- is an object with score at least equal to the average score,
over, we know that
i.e., the score of a random object. Showing that this average is high enough will prove the existence of an object
dy
without specifying one.
= ez = y
dz
Example. Suppose there are 100 people in 15 commitand so,
tees of 20 people each, and that each person is on exactly
1 1
3 committees. We want to show that there exist 2 comfY (y) = e ln y/2
y 2
mittees with overlap 3. Let us find the average of two
Theorem 23.2. Suppose that X is a continuous random random committees. Using indicator random variables
variable in n dimensions, Y = g(X) where g : Rn Rn for the probability that a given person is on both of those
two committees, we get
is continuously differentiable and invertible. Then
3
300
20
2
dx
=
E(overlap) = 100 15 =
fY (y) = fX (x)det
105
7
dy
2
where
x1
y
1
dx
= ...
dy
xn
y1
..
xn
yn
..
.
xn
yn
Lecture 24 10/31/11
Definition 24.1. The beta distribution, Beta(a, b) for
a, b > 0, is defined by PDF
(
cxa1 (1 x)b1 0 < x < 1
f (x) =
0
otherwise
where c is a normalizing constant (defined by the beta
function).
The beta distribution is a flexible family of continuous
distributions on (0, 1). By flexible, we mean that the appearance of the distribution varies significantly depending
on the values of its parameters. If a = b = 1, the beta
reduces to the uniform. If a = 2 and b = 1, the beta appears as a line with positive slope. If a = b = 21 , the beta
appears to be concave-up and parabolic; if a = b = 2, it
is concave down.
The beta distribution is often used as a prior distribution for some parameter on (0, 1). In particular, it is
the conjugate prior to the binomial distribution.
19
Max Wang
Observation 24.2. Suppose that, based on some data, for any a > 0. The gamma function is a continuous exwe have X | p Bin(n, p), and that our prior distribu- tension of the factorial operator on natural numbers. For
tion for p is p Beta(a, b). We want to determine the n a positive integer,
posterior distribution of p, p | X. We have
(n) = (n 1)!
P (X = k | p)f (p)
f (p | X = k) =
More generally,
P (X = k)
n k
(x + 1) = x(x)
p (1 p)nk cpa1 (1 p)b1
= k
P (X = k)
Definition 25.2. The standard gamma distribution,
a+k1
cp
(1 p)b+nk1
Gamma(a, 1), is defined by PDF
1 a1 x
x
e
(a)
Lecture 25 11/2/11
Tn =
n
X
Xj Gamma(n, )
j=1
The exponential distribution is the continuous analogue of the geometric distribution; in this sense, the
gamma distribution is the continuous analogue of the negative binomial distribution.
20
Max Wang
Proof. One method of proof, which we will not use, Lecture 26 11/4/11
would be to repeatedly convolve the PDFs of the i.i.d. Xj .
Instead, we will use MGFs. Suppose that the Xj are i.i.d Observation 26.1 (Gamma-Beta). Let us take X
Gamma(a, ) to be your waiting time in line at the bank,
Expo(1); we will show that their sum is Gamma(n, 1).
and Y Gamma(b, ) your waiting time in line at the
The MGF of Xj is given by
post office. Suppose that X and Y are independent.
Let T = X + Y ; we know that this has distribution
1
MXj (t) =
Gamma(a + b, ).
1t
Let us compute the joint distribution of T and of
X
for t < 1. Then the MGF of Tn is
, the fraction of time spent waiting at the
W = X+Y
bank.
For
simplicity
of notation, we will take = 1. The
n
1
joint
PDF
is
given
by
MTn (t) =
1
(x, y)
fT,W (t, w) = fX,Y (x, y)det
also for t < 1. We will show that the gamma distribution
(t, w)
has the same MGF.
1
(x, y)
a x b y 1
=
det
x e y e
Let Y Gamma(n, 1). Then by LOTUS,
(a)(b)
xy
(t, w)
Z
1
1
We must find the determinant of the Jacobian (here exety y n ey dy
E(etY ) =
(n) 0
y
pressed in silly-looking notation). We know that
Z
1
1
x
=
y n e(1t)y dy
x+y =t
=w
(n) 0
y
x+y
Changing variables, with x = (1 t)y, then
Z
(1 t)n n x 1
x e
dx
(n)
x
0
n
1
(n)
=
1
(n)
n
1
=
1
E(etY ) =
y = t(1 w)
Note that this is the MGF for any n > 0, although the
sum of exponentials expression requires integral n.
fT,W (t, w) =
a(a + 1) (a + c)
c
(a + b) a1
w
(1 w)b1
=
(a)(b)
This yields W Beta(a, b) and also gives the normalizing
constant of the beta distribution.
It turns out that if X were distributed according to
any other distribution, we would not have independence,
but proving so is out of the scope of the course.
21
Max Wang
Observation 26.2. Let us find E(W ) for W Example. Let U1 , . . . , Un be i.i.d. Unif(0, 1), and let us
X
Beta(a, b). Let us write W = X+Y
with X and Y de- determine the distribution of U(j) . Applying the above
result, we have
fined as above. We have
n 1 j1
1
E(X)
a
fU(j) (x) = n
x (1 x)nj
E
=
=
j1
X
E(X + Y )
a+b
Note that in general, the first equality is false! However, for 0 < x < 1. Thus, we have U(j) Beta(j, n j + 1).
X
because X +Y and X+Y
, they are uncorrelated and hence This confirms our earlier result that, for U1 and U2 i.i.d.
Unif(0, 1), we have
linear. So
1
1
E|U1 U2 | = E(Umax ) E(Umin ) =
E
E(X + Y ) = E(X)
3
X
because Umax Beta(2, 1) and Umin Beta(1, 2), which
Definition 26.3. Let X1 , . . . , Xn be i.i.d. The order
have means 23 and 31 respectively.
statistics of this sequence is
Lecture 27 11/7/11
Observation 26.4. Let X1 , . . . , Xn be i.i.d. continuous which is simple and straightforward. We might also, howwith PDF fj and CDF Fj . We want to find the CDF and ever, try to condition on the value of Y with respect to
X using the Law of Total Probability
PDF of X(j) . For the CDF, we have
P (X(j) x) = P (at least j of the Xj s are x)
n
X
n
=
P (X1 x)k (1 P (X1 x))nk
k
k=j
n
X
n
=
F (x)k (1 F (x))nk
k
Turning now to the PDF, recall that a PDF gives a density rather than a probability. We can multiply the PDF
of X(j) at a point x by a tiny interval dx about x in order to obtain the probability that X(j) is in that interval.
Then we can simply count the number of ways to have
one of the Xi be in that interval and precisely j 1 of the
Xi below the interval. So,
n1
fX(j) (x) dx = n(f (x) dx)
F (x)j1 (1 F (x))nj
j1
n1
fX(j) (x) = n
F (x)j1 (1 F (x))nj f (x)
j1
Assuming that X and Y are not 0 or infinite, these cannot both be correct, and the argument from symmetry is
immediately correct.
The flaw in our second argument is that, in general,
k=j
E(Y | Y = Z) 6= E(Z)
because we cannot drop the condition that Y = Z; we
must write
E(Y | Y = Z) = E(Z | Y = Z)
In other words, if we let I be the indicator for Y = 2X,
we are saying that X and I are dependent.
22
Max Wang
Example (Patterns in coin flips). Suppose we repeat- Definition 27.1. Now let us write
edly flip a fair coin. We want to determine how many
g(x) = E(Y | X = x)
flips it takes until HT is observed (including the H and
T ); similarly, we can ask how many flips it takes to get Then
HH. Let us call these random variables WHT and WHH
E(Y |X) = g(X)
respectively. Note that, by symmetry,
So, suppose for instance that g(x) = x2 ; then g(X) = X 2 .
We can see that E(Y |X) is a random variable and a funcE(WHH ) = E(WT T )
tion of X. This is a conditional expectation.
and
X is a function of itself
= X + E(Y |X)
X and Y independent)
= X + E(Y )
=X +
E(h(X) | X) = h(X)
because Wi 1 Geom( 12 ).
Now let us consider WHH . The distinction here is that
Now let us determine E(X | X + Y ). We can do this
no progress can be easily made; once we get a heads, in two different ways. First, let T = X + Y and let us
we are not decidedly halfway to the goal, because if the find the conditional PMF.
next flip is tails, we lose all our work. Instead, we make
P (T = n | X = k)P (X = k)
use of conditional expectation. Let Hi be the event that
P (X = k | T = n) =
P (T = n)
C
the ith toss is heads, Ti = Hi the event that it is tails.
P (Y = n k)P (X = k)
Then
=
P (T = n)
1
1
nk
e
k
E(WHH ) = E(WHH | H1 ) + E(WHH | T1 )
(nk)! e
k!
2
2
=
(2)n
2
1
1 1
e
= E(WHH | H1 , H2 ) + E(WHH | H1 , T2 )
n!n
2
2 2
n
1
=
1
k
1
+ (1 + E(WHH ))
2
1
1 1
1 That is, X | T = n Bin(n, 2 ). Thus, we have
= 1 + (2 + E(WHH ))
+ (1 + E(WHH ))
2 2
2
n
E(X | T = n) =
2
Solving for E(WHH ) gives
which means that
T
E(WHH ) = 6
E(X|T ) =
2
So far, we have been conditioning expectations on
In our second method, first we note that
events. Let X and Y be random variables; then this kind
E(X | X + Y ) = E(Y | X + Y )
of conditioning includes computing E(Y | X = x). If Y
is discrete, then
by symmetry (since they are i.i.d.). We have
E(Y | X = x) =
yP (Y = y | X = x)
=X +Y
=T
and if Y is continuous,
Z
E(Y | X = x) =
Z
if X continuous
fX,Y (x, y)
dy
fX (x)
T
2
Max Wang
Lecture 28 11/9/11
1. Since we know X, we know h(X), and this is equivalent to factoring out at constant (by linearity).
2. Immediate.
E(Y | X = x)P (X = x)
yP (Y = y | X = x) P (X = x)
X X
=
yP (Y = y | X = x)P (X = x)
x
=
=
XX
y
yP (Y = y, X = x)
yP (Y = y)
= E(Y )
4. We have
E((Y E(Y |X))h(X))
3. E(E(Y |X)) = E(Y ). This is called iterated expectation or Adams Law; it is usually more useful to
think of this as computing E(Y ) by choosing a simple X to work with.
4. E((Y E(Y |X))h(X)) = 0. In words, the residual
(i.e., Y E(Y |X)) is uncorrelated with h(X):
=0
Definition 28.2. We can define the conditional variance
much as we did conditional expectation. Let X and Y be
random variables. Then
Max Wang
Example. Suppose we choose a random city and then make N a constant, so let us condition on N . Then using
choose a random sample of n people in that city. Let X the Law of Total Probability, we have
be the number of people with a particular disease, and
X
Q the proportion of people in the chosen city with the
E(X) =
E(X | N = n)P (N = n)
disease. Let us determine E(X) and Var(X), assuming
n=0
Q Beta(a, b) (a mathematically convenient, flexible dis
X
tribution).
=
nP (N = n)
Assume that X|Q Bin(n, Q). Then
n=0
= E(N )
E(X) = E(E(X|Q))
Note that we can drop the conditional because N and the
= E(nQ)
Xj are independent; otherwise, this would not be true.
a
=n
We could also apply Adams Law to get
a+b
E(X) = E(E(X|N )) = E(N ) = E(N )
and
Var(X) = E(Var(X|Q)) + Var(E(X|Q))
We have
E(Q(1 Q)) =
=
=
=
Z
(a + b) 1 a
q (1 q)b dq
(a)(b) 0
(a + b) (a + 1)(b + 1)
(a)(b) (a + b + 2)
ab(a + b)
(a + b + 1)(a + b)(a + b)
ab
(a + b)(a + b + 1)
and
Var(Q) =
(1 )
a+b+1
= 2 E(N ) + 2 Var(N )
We now turn our attention to statistical inequalities.
Theorem 29.1 (Cauchy-Schwartz Inequality).
p
|E(XY )| E(X 2 )E(Y 2 )
If X and Y are uncorrelated, E(XY ) = (EX)(EY ), so
we dont need inequality.
We will not prove this inequality in general. However,
if X and Y have mean 0, then
E(XY )
| Corr(X, Y )| = p
1
E(X 2 )E(Y 2 )
a
where = a+b
. This gives us all the information we need
to easily compute Var(X).
Theorem 29.2 (Jensens Inequality). If g : R R is
convex (i.e., g 00 > 0), then
Lecture 29 11/14/11
Eg(X) g(EX)
Example. Consider a store with a random number N If g ic concave (i.e., g 00 < 0), then
of customers. Let Xj be the amount the jth customer
Eg(X) g(EX)
spends, with E(Xj ) = and Var(Xj ) = 2 . Assume
that N, X1 , X2 , . . . are independent. We want to deterExample. If X is positive, then
mine the mean and variance of
X=
N
X
j=1
E(
Xj
1
1
)
X
EX
and
E(ln X) ln(EX)
We might, at first, mistakenly invoke linearity to claim
that E(X) = N . But this is incoherent; the LHS is a Proof. It is true of any convex function g that
real number whereas the RHS is a random variable. However, this error highlights something useful: we want to
g(x) a + bx
25
Max Wang
Lecture 30 11/16/11
Definition 30.1. Let X1 , X2 , . . . be i.i.d. random variables with mean and variance 2 . The sample mean of
the first n random variables is
n
X
n = 1
Xj
X
n j=1
g(x) a + bx
g(X) a + bX
Eg(X) E(a + bX)
= a + bE(X)
= g(EX)
Theorem 29.3 (Markov Inequality).
E|X|
a
P (|X| a)
n | > c) 0
P (|X
n | > c) Var(Xn )
P (|X
c2
1
2
2 n
= n 2
c
2
= 2
nc
0
Var(X)
a2
1
c2
for c > 0.
Proof.
E((X )2 )
a2
Var(X)
=
a2
n )
(X
N (0, 1)
N (0, 1)
n
n1/2
26
Max Wang
n
X
Xj
P (X = a) = P (a < X < a + )
j=1
Sn
Lecture 31 11/18/11
) E(etXn /
tXj / n n
)
= E(e
n
t
=M
n
1 1
21 = Gamma( , )
2 2
= lim
ln M ( tn )
/2
n 1
2n = Gamma( , )
2 2
1
V /n
)=0
21 , we have
1
n
1
substitute y =
n
ln M (ty)
= lim
y0
y2
tM 0 (ty)
LHopitals = lim
y0 2yM (ty)
M 0 (ty)
t
[M (0) = 1, M 0 (0) = 0] = lim
2 y0
y
t2
M 00 (ty)
LHopitals =
lim
2 y0
1
t2
=
2
2
= ln et /2
and et
Pn
j=1
a np
X np
b np
P (a X b) = P
npq
npq
npq
1
1
b np
a np
Xj ,
E(Z 2 ) = 1,
E(Z 4 ) = 1 3,
E(Z 6 ) = 1 3 5
Max Wang
Lecture 32 11/21/11
Definition 32.1. Let X0 , X1 , X2 , . . . be sequence of random variables. We think of Xn as the state of a finite
system at a discrete time n (that is, the Xn have discrete
indices and each has finite range). The sequence has the
Markov property if
qij := P (Xn+1 = j | Xn = i)
of the Xj is normal.
Then (Z + called the transition probability, and we call the sequence
a homogenous Markov chain.
To describe a homogenous Markov chain we simply
s(Z + 2W ) + t(3Z + 5W ) = (s + 3t)Z + (2s + 5t)W
need to show the states of the process and the transition probabilities. We could, instead, array the qij s as a
is a sum of independent normals and hence normal.
Example. Let Z N (0, 1). Let S be a random sign (1 matrix,
with equal probabilities) independent of Z. Then Z and
Q = qij
SZ are marginally standard normal. However, (Z, SZ) is
not multivariate normal, since Z + SZ is 0 with proba- called the transition matrix.
bility 12 .
Observation 31.8. Recall that the MGF for X Note. More generally, we could consider continuous systems (i.e., spaces) at continous times and more broadly
N (, 2 ) is given by
study stochastic processes. However, in this course, we
2 2
E(etX ) = et+t /2
will restrict our study to homogenous Markov chains.
Example. Let Z, W be i.i.d. N (0, 1).
2W, 3Z + 5W ) is MVN, since
2
1
= Var(X) + Cov(X, Y ) Cov(X, Y ) Var(Y )
0
0
31 3 1
0 2 0
=0
Q = 2
0 0 0 1
So by our above theorem, X + Y and X Y are indepen1
1
1
0 4 4
2
dent.
28
Max Wang
Lecture 33 11/28/11
Observation 32.2. Suppose that at time n, Xn has distribution s (a row vector in the transition matrix, which
represents the PMF). Then
P (Xn+1 = j) =
qij si
= sQ
So sQ is the distribution of Xn+1 . More generally, we
have that sQj is the distribution of Xn+j .
We can also compute the two-step transition probability:
P (Xn+2 = j | Xn = i)
X
=
P (Xn+2 = j | Xn+1 = k, Xn = i)
P (Xn+1 = k | Xn = i)
=
qkj qik
qik qkj
1
= (Q2 )ij
More generally, we have
P (Xn+m = j | Xn = i) = (Qm )ij
Max Wang
2. s is unique.
for all i, j.
Assume i 6= j. Since the Markov chain is undirected,
1
3. si = , where ri is the average time to return to qij and qji are either both zero or both nonzero. If (i, j)
ri
is an edge, then
state i starting from state i.
1
qij =
m
di
4. If Q is strictly positive for some m, then
since our Markov chain represents a random walk. But
lim P (Xn = i) = si
this suffices to prove our claim.
n
Let us now normalize di to a stationary vector si . This
Alternatively, if t is any (starting-state) probability is easy; we can simply take
vector, then
di
lim tQ = s
si = P
n
j dj
Definition 33.5. A Markov chain with transition matrix
Q is reversible if there is a probability vector s such that and we have thus found our desired stationary distribution.
si qij = sj qji
for all pairs of states i and j.
Reversibility is also known as time-reversibility. Intuitively, the progression of a reversible Markov chain could
be played back backwards, and the probabilities would be
consistent with the original Markov chain.
Theorem 33.6. If a Markov chain is reversible with respect to s, then s is stationary.
Proof. We know that si qij = sj qji for some s. Summing
over all states,
X
X
si qij =
sj qji
i
= sj
qji
= sj
But since this is true for every j, this is exactly the statement of
sQ = s
as desired.