You are on page 1of 6

INTRODUCTION TO BAYESIAN METHODS I

Abstract. We will revisit point estimation and hypothesis test-


ing from the Bayesian perspective.

1. Introduction
1.1. A Math 526 Exercise. Suppose B1 , B2 , B3 give a partition of
a sample space Ω, so that B1 , B2 , B3 are mutually exclusive, and their
union is all of Ω. Given any event A, clearly it is given by the disjoint
union,
A = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ (A ∩ B3 ),
thus
P(A) = P(A ∩ B1 ) + P(A ∩ B2 ) + P(A ∩ B3 ).
We also know from the definition of conditional probabilities, that if
each of the Bi ’s have non-zero probabilities, then
P(A ∩ Bi ) = P(A|Bi )P(Bi ),
for each i = 1, 2, 3. Thus we obtain that,
P(A) = P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 ).
Recall that this is referred to as the rule of total probability.
Exercise 1 (Two Face). The DC comic book villain Two-Face often
uses a coin to decide the fate of his victims. If the result of the flip is
tails, then the victim is spared, otherwise the victim is killed. It turns
out he actually randomly selects from three coins: a fair one, one that
comes up tails 1/3 of the time, and another that comes up tails 1/10
of the time. What is the probability that a victim is spared?
Solution. Let Sp denote that event that the victim is spared and let
C1 be the event that the fair coin is used, C2 be the event that the coin
that comes up tails 1/3 of time is used, and C3 denote the event that
the coin that comes up tails 1/10 of the time is used. Then
P(Sp) = P(Sp|C1 )P(C1 ) + P(Sp|C2 )P(C2 ) + P(Sp|C3 )P(C3 )
= 1/2(1/3) + 1/3(1/3) + 1/10(1/3).
1
2 INTRODUCTION TO BAYESIAN METHODS I

Sometimes we also want to compute P(Bi |A), and a bit algebra gives
the following formula, in the case i = 3:
P(B3 ∩ A)
P(B3 |A) =
P(A)
P(A|B3 )P(B3 )
= .
P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 )
Recall that this is referred to as Bayes’ theorem.
Exercise 2. Referring to Exercise 1, supposed that the victim was
spared, then what is the probability that the fair coin was used?
Solution. Bayes’ theorem gives
P(C1 ∩ Sp) P(Sp|C1 )P(C1 )
P(C1 |Sp) = = ;
P(Sp) P(Sp|C1 )P(C1 ) + P(Sp|C2 )P(C2 ) + P(Sp|C3 )P(C3 )
these are all numbers we know.
1.2. Bayesian statistics. In classical statistics, θ ∈ ∆ is unknown so
we take a random sample from fθ and then we make an inference about
θ.
In Bayesian statistics, rather than thinking of the parameter as un-
known, we think of it has a random variable having some unknown
distribution. Let (fθ )θ∈∆ be a family of pdfs. Let Θ be a random
variable with pdf r taking values in ∆. Here r is called prior pdf for
Θ; we do not really know the true pdf for Θ, and this is a subjective
assignment or guess based on our present knowledge or ignorance. We
think of f (x1 ; θ) = f (x1 |θ) as the conditional pdf of a random variable
X1 that can be generated in the following two step procedure: First,
we generate Θ = θ, then we generate X1 with pdf fθ . In other words,
we let the joint pdf of X1 and Θ be given by
f (x1 |θ)r(θ).
In shorthand, we will denote this model by writing
X1 |θ ∼ f (x|θ)
Θ ∼ r(θ)
Similarly, we say that X = (X1 , . . . , Xn ) is a random sample from
the conditional distribution of X1 given Θ = θ if X1 |θ ∼ f (x1 |θ) and
Yn
L(x; θ) = L(x|θ) = f (xi |θ);
i=1
in which case the joint pdf of X and Θ is given by
j(x, θ) = L(x|θ)r(θ).
INTRODUCTION TO BAYESIAN METHODS I 3

What we are interested in is updating our knowledge or belief about


the distribution of Θ, after observing X = x; more precisely, we con-
sider
j(x, θ) L(x|θ)r(θ)
s(θ|x) = = ,
fX (x) fX (x)
where fX is the pdf of X, which can be obtained by integrating or
summing the joint density j(x, θ) with respect to θ. We call s the
posterior pdf. Thus ‘prior’ refers to our knowledge of the distribution
of Θ prior to our observation X and ‘posterior’ refers to our knowledge
after our observation of X.
Let us remark on our notation. Earlier in the course, we used Θ to
denote the set of possible parameter values; here, we use ∆ to denote
this, since we are reserving Θ to be a random variable taking values in
θ ∈ ∆. We use r to denote the prior pdf, the next letter s to denote
the posterior pdf, j to denote the joint pdf of X and Θ, fX to denote
the pdf of X alone, and fθ = f (·|θ) to denote the pdf conditional
distribution of Xi given θ.
Exercise 3. Let X = (X1 , . . . , Xn ) be a random sample from the con-
ditional distribution of X1 given Θ = θ, where X1 |θ ∼ Bern(θ) and
Θ ∼ U nif (0, 1). Find the posterior distribution.
Solution. Let x ∈ {0, 1}n and t = x1 + · · · + xn . Let θ ∈ (0, 1). Let
r(θ) = 1[θ ∈ (0, 1)] be the prior pdf for Θ and fX be the pdf for X. We
have that
L(x|θ)r(θ)
s(θ|x) =
fX (x)
θ (1 − θ)n−t
t
= .
fX (x)
We have that
Z 1 Z 1
fX (x) = L(x|θ) = θt (1 − θ)n−t dθ
0 0

This is actually a somewhat difficult integral. However, we already


know the answer to this:

fX (x) = B(t + 1, n − t + 1).


This see why recall that the pdf of beta distribution with parameters
α > 0 and β > 0 is given by
1
f (t; α, β) = tα−1 (1 − t)β−1 1[t ∈ (0, 1)],
B(α, β)
4 INTRODUCTION TO BAYESIAN METHODS I

where B(α, β) is a constant so that f integrates to 1. Sometimes the


function B is called the beta function. It can also be shown that
Γ(α)Γ(β)
B(α, β) = ,
Γ(α + β)
Thus we have that s(θ|x) is given by the pdf of beta distribution in θ
with parameters α = t + 1 and β = n − t + 1, where t = x1 + · · · + xn .
1.3. Calculation tools. In computing a posterior distribution it is
not necessary to directly compute the pdf of X. We have that
s(θ|x) = L(x|θ)r(θ)(fX (x))−1 .
Since s itself is a pdf, fX (x) (which does not depend on θ) can be
thought of as a normalizing constant. Often one writes
s(θ|x) ∝ G(x; θ),
to mean that there exists a constant c(x) such that
s(θ|x) = c(x)G(x; θ).
Thus
s(θ|x) ∝ L(x|θ)r(θ).
It is often possible to identify the pdf from an expression involving
L(x|θ)r(θ) or other simplified expressions.
For example, in presence of a sufficient statistic t = t(x), we have
that
L(x|θ) = g(t(x); θ)H(x),
for some functions g and H, where H does not depend on θ. Thus we
have that
s(θ|x) ∝ g(t(x); θ)r(θ).
Since s(θ|x) = s(θ|x0 ) if t(x) = t(x0 ), sometimes we will abuse notation
slightly and write
s(θ|t) ∝ g(t(x); θ)r(θ);
here the constant of proportionality depends on t. We will refer to
s(θ|t) as the posterior pdf of Θ given the sufficient statistic t.
Exercise 4. Let Let X = (X1 , . . . , Xn ) be a random sample from the
conditional distribution of X1 given Θ = θ, where X1 |θ ∼ N (θ, 1) and
Θ ∼ N (0, 1). Find the posterior distribution.
Solution. Let x ∈ Rn and x̄ be the usual sample mean. We know
that X̄ is a sufficient statistic for θ and that conditional on Θ = θ,
X̄ ∼ N (θ, 1/n). Let g(·; θ) be the pdf of a N (θ, 1/n) random variable.
INTRODUCTION TO BAYESIAN METHODS I 5

We have that posterior pdf of Θ given the sufficient statistic x̄ is given


by
s(θ|x̄) ∝ g(x̄; θ)r(θ)
∝ exp[− n2 (x̄ − θ)2 ] exp[− 12 θ2 ]
∝ exp[− 12 (nx̄2 − 2nx̄θ + (n + 1)θ2 ]
∝ exp[− 12 ((n + 1)θ2 − 2nx̄θ])]
∝ exp[− n+1
2
(θ2 − 2 n+1
n
x̄θ])]
∝ exp[− n+1
2
(θ2 − 2 n+1
n n
x̄θ + ( n+1 x̄)2 ])]
∝ exp[− n+1
2
(θ − n
n+1
x̄)2 ])]
Hence, we can recognize that s(θ|x̄) is the pdf of a normal random
n 1
variable with mean n+1 x̄ and variance n+1 .

2. Conjugate family of distributions


Consider a class of prior pdfs C. Let X = (X1 , . . . , Xn ) be a random
sample from the conditional distribution of X1 given Θ = θ, where
X1 |θ ∼ f (x1 |θ) and Θ ∼ r(θ), from some r ∈ C. Let t(X) be a
sufficient statistic for θ. We say that the class C gives a conjugate
family of distributions for F = {fθ }θ∈∆ if the posterior pdf of Θ
given the sufficient statistic t is such that s(θ|t) ∈ C for all t and all
r ∈ C.
Exercise 5. Let C = (rα,β )α>0,β>0 , where rα,β is the pdf of the gamma
distribution with parameters α and β. Let F = (fθ )θ>0 ), where fθ is
the pdf of a Poisson random variable with mean θ. Show that C is a
conjugate family for F.
Solution. Let X = (X1 , . . . , Xn ) be a random sample from the condi-
tional distribution of X1 given Θ = θ, where X1 |θ ∼ fθ and Θ ∼ rα,β .
We know that t(X) = X1 + · · · + Xn is a sufficient statistic for θ and
that conditional on Θ = θ, we have that t(X) ∼ P oi(nθ); let g(t; θ) be
the pdf for t(X). We have that for t ∈ {0, 1, 2, . . .}
s(θ|t) ∝ g(t; θ)rα,β (θ)
e−nθ (nθ)t 1
∝ α
θα−1 e−θ/β
t! β Γ(α)
∝ θα+t−1 exp[−θ(1/β + n)].
We recognize that s(θ|t) give the pdf of a gamma distribution with pa-
β
rameters α0 = α + t and β 0 = nβ+1 .
6 INTRODUCTION TO BAYESIAN METHODS I

Let us remark that in the notation of Exercise 5, we will sometimes


call α0 and β 0 the posterior hyperparamaters and α and β the prior
hyperparameters.
Exercise 6. Fix σ > 0. Let C = (rµ0 ,σ0 )µ0 >0,σ0 >0 , where rµ0 ,σ0 is the
pdf of a normal random variable with mean µ0 and variance σ02 . Let
F = (fθ )θ>0 , where fθ is the pdf of a normal random variable with
mean θ and variance σ 2 . Show that C is a conjugate family for F and
the posterior hyperparameters are given by
0 σ02 σ 2 /n
µ = 2 x̄ + 2 µ0
σ0 + (σ 2 /n) σ0 + σ 2 /n
and
02 σ 2 /n
σ = 2 2
σ02
σ0 + σ /n
Exercise 7. Let C = (rα,β )α>0,β>0 , where rα,α is the pdf of a beta
random variable with parameters α and β. Let F = (fθ )θ>0 , where fθ is
the pdf of a Bernoulli random variable with parameter θ ∈ (0, 1). Show
that C is a conjugate family for F and the posterior hyperparameters
are given by
α0 = α + t
and
β 0 = β + n − t,
where t is the sample sum.

You might also like