Bayesian Statistical Analysis: Chapter 1: Fundamentals of Bayesian Inference

Bayesian Statistical Analysis
Chapter 1: Fundamentals of Bayesian Inference
Tang Yin-cai
yctang@stat.ecnu.edu.cn
SCHOOL OF FINANCE AND S TAT I S T I C S
March 11, 2009 Chapter 1 - p. 1/??

1.1 The Bayesian
Method and
Comparison with
Classical Method
1.1 The Bayesian Method and

Comparison with Classical Method
March 11, 2009 Chapter 1 - p. 2/??

Statistical Inference
Stat Inference
Classical ...
Bayesian ...
Bayesian Theorem
Statistical Inference Difference
Subj. and Obj.
March 11, 2009 Chapter 1 - p. 3/??

Statistical inference
Statistical Inference is a problem in which data Stat Inference
have been generated in accordance with some un- Classical ...
Bayesian ...
known probability distribution which must be ana- Bayesian Theorem
Difference
lyzed and some type of inferences about the un- Subj. and Obj.
known distributions to be made.
March 11, 2009 Chapter 1 - p. 4/??

Bayesian ...
Difference
In other words, in a statistics problem, there are
two or more probability distributions which may
have generated the data. By analyzing the data,
we attempt
March 11, 2009 Chapter 1 - p. 4/??

Bayesian ...
Difference
we attempt
■ to learn about the unknown distribution,
March 11, 2009 Chapter 1 - p. 4/??

Bayesian ...
Difference
we attempt
■ to make some inferences about certain
properties of the distribution, and
March 11, 2009 Chapter 1 - p. 4/??

Bayesian ...
Difference
we attempt
■ to make some inferences about certain
properties of the distribution, and

■ to determine the relative likelihood that each
possible distribution
S
is actually
CHOOL OFF
the
S
INANCE AND
correct one.
TAT I S T I C S
March 11, 2009 Chapter 1 - p. 4/??

There are three approaches to Probability. They are Statistical Inference
Stat Inference
Classical ...
Bayesian ...
Bayesian Theorem
Difference
Subj. and Obj.
March 11, 2009 Chapter 1 - p. 5/??

Stat Inference
1. Axiomatic: Probability by definition and Classical ...
Bayesian ...
properties Bayesian Theorem
Difference
Subj. and Obj.
March 11, 2009 Chapter 1 - p. 5/??

Stat Inference
Bayesian ...
Difference
2. Relative Frequency: Repeated trials Subj. and Obj.
March 11, 2009 Chapter 1 - p. 5/??

Stat Inference
Bayesian ...
Difference
3. Degree of belief (subjective): Personal measure

of uncertainty
March 11, 2009 Chapter 1 - p. 5/??

Stat Inference
Bayesian ...
Difference
3. Degree of belief (subjective): Personal measure

of uncertainty
We have quite familiar with the first two and we use
quite often in decision making, especially when no
information or data available. The third is closely
related to Bayesian inference which we are going
to learn.
March 11, 2009 Chapter 1 - p. 5/??

Let’s take a look at Hypothesis Testing as an Statistical Inference
Stat Inference
example to see what classical statistical inference Classical ...
Bayesian ...
and Bayesian inference do correspondingly. Bayesian Theorem
Difference
Subj. and Obj.
March 11, 2009 Chapter 1 - p. 6/??

Stat Inference
Bayesian ...
Hypothesis Testing is a form of proof by statistical Difference
Subj. and Obj.
contradiction: Evidence is gathered in favor of the-
ory by demonstrating that the data is unlikely to be
observed if the postulated theoretical model were
false.
March 11, 2009 Chapter 1 - p. 6/??

Stat Inference
Bayesian ...
Hypothesis Testing is a form of proof by statistical Difference
Subj. and Obj.
contradiction: Evidence is gathered in favor of the-
ory by demonstrating that the data is unlikely to be
observed if the postulated theoretical model were
false.
Why do we do it this way?
March 11, 2009 Chapter 1 - p. 6/??

Classical Approach
According to the probability theory, we express our Statistical Inference

Stat Inference
uncertainty as: Classical ...
Bayesian ...
P (Model is True|Observed Data) Bayesian Theorem
£Ø
Difference
Subj. and Obj.
However, based on our epistemological( )
foundations, we cannot state that the model is true
with a certain probability X.
March 11, 2009 Chapter 1 - p. 7/??

Classical Approach
According to the probability theory, we express our Statistical Inference

Stat Inference
uncertainty as: Classical ...
Bayesian ...
P (Model is True|Observed Data) Bayesian Theorem
£Ø
Difference
Subj. and Obj.
However, based on our epistemological( )
foundations, we cannot state that the model is true
with a certain probability X.
Either the model is true, or not.
March 11, 2009 Chapter 1 - p. 7/??

Instead, we are limited to a knowledge of: Statistical Inference
Stat Inference
Classical ...
P (Observed Data|Model is True) Bayesian ...
Bayesian Theorem
Difference
Subj. and Obj.
March 11, 2009 Chapter 1 - p. 8/??

Stat Inference
Classical ...
Bayesian Theorem
■ If P (Observed Data|Model is True) is close to Difference
Subj. and Obj.
one, then the data is consistent with the model,
and we would not reject it as an objective
interpretation of reality.
Example ?
March 11, 2009 Chapter 1 - p. 8/??

Stat Inference
Classical ...
Bayesian Theorem
■ If P (Observed Data|Model is True) is close to Difference
Subj. and Obj.
one, then the data is consistent with the model,
and we would not reject it as an objective
interpretation of reality.
Example ?
■ If P (Observed Data|Model is True) is not close to
.
one, then the data is inconsistent with the
model s predictions, and we reject the model.
Example ?
March 11, 2009 Chapter 1 - p. 8/??

Thus we summarized the three-step procedure for Statistical Inference
Stat Inference
classical hypothesis testing. Classical ...
Bayesian ...
Bayesian Theorem
Difference
Subj. and Obj.
March 11, 2009 Chapter 1 - p. 9/??

Stat Inference
Bayesian ...
Bayesian Theorem
■ Define the Research Hypothesis. A
Step 1. Difference
Subj. and Obj.
Research or Alternative Hypothesis is a
statement derived from theory about what the
researcher expects to find in the data.
March 11, 2009 Chapter 1 - p. 9/??

Stat Inference
Bayesian ...
Bayesian Theorem
■ Define the Research Hypothesis. A
Step 1. Difference
Subj. and Obj.
Research or Alternative Hypothesis is a
statement derived from theory about what the
researcher expects to find in the data.
■ Define the Null Hypothesis. The Null

Step 2.
Hypothesis is a statement of what you would not
expect to find if your research or alternative
hypothesis was consistent with reality.
March 11, 2009 Chapter 1 - p. 9/??

■ Step 3. Conduct an analysis of the data to
determine whether or not you can reject the null
hypothesis with some pre-determined
probability. If you can reject the null hypothesis
with some probability, then the data is consistent
with the model. If you cannot reject the null
hypothesis with some probability, then the data
is not consistent with the model.
March 11, 2009 Chapter 1 - p. 10/??

Bayesian Approach
Bayesians, in contrast, try to do the following:
March 11, 2009 Chapter 1 - p. 11/??

Bayesian Approach

■ Make inferences based on all information
March 11, 2009 Chapter 1 - p. 11/??

Bayesian Approach

■ See how new data effects our (old) inferences
March 11, 2009 Chapter 1 - p. 11/??

Bayesian Approach

■ Need to identify all hypotheses (or states of
nature) that may be true
March 11, 2009 Chapter 1 - p. 11/??

Bayesian Approach


■ Need to know what each hypothesis (or state of
nature) predicts
March 11, 2009 Chapter 1 - p. 11/??

Bayesian Approach


nature) predicts
■ Need to know how to update our old inferences
in light of our observations
March 11, 2009 Chapter 1 - p. 11/??

Bayesian Approach


nature) predicts
■ Need to know how to update our old inferences
in light of our observations

In sum, we try to do statistics like how scientists
think.
See Figure 1.1 for a schematic representation of
Bayesian reasoning.
March 11, 2009 Chapter 1 - p. 11/??

Figure 1: Schematic Representation of Bayesian Reasoning.
Theory, Creativity
Inference,
Hypothesis, Verification,
Model Clasification
6
Deduction Induction
?
Epistemic Relationships
Predition - Data
Observation
March 11, 2009 Chapter 1 - p. 12/??

Bayes’ Theorem/Rule
Based on the conditional probability we have the

Bayes’ Theorem:
p(θ)p(y|θ)
p(θ|y) = (1.1)
p(y)
March 11, 2009 Chapter 1 - p. 13/??

Bayes’ Theorem/Rule
Based on the conditional probability we have the

Bayes’ Theorem:
p(θ)p(y|θ)
p(θ|y) = (1.1)
p(y)
z p(θ) —
■ called Prior Distribution
■ the probability distribution for the parameters θ
■ subjective uncertainty about the parameters

before we see the data: we have some idea
about what values the parameters might take
March 11, 2009 Chapter 1 - p. 13/??

z p(y), —
■ called the marginal distribution of the data or
Prior Predictive Distribution
■ the unconditional distribution of the data
■ a constant: only depends on y
March 11, 2009 Chapter 1 - p. 14/??

z p(y), —
■ called the marginal distribution of the data or
Prior Predictive Distribution
■ the unconditional distribution of the data
■ a constant: only depends on y
Thus, we may write the Bayes’ formula as

p(θ|y) ∝ p(θ)p(y|θ). (1.2)
March 11, 2009 Chapter 1 - p. 14/??

z p(θ|y) —
■ called the Posterior Distribution
■ the product of the prior and the likelihood
■ combines the both information from the prior and

the information of the data
■ can be updated according to these two kinds of

information.
March 11, 2009 Chapter 1 - p. 15/??

The key difference between C & B
Deduction( üÌ) and induction(8B) are two facet

of reasoning.
March 11, 2009 Chapter 1 - p. 16/??

Deduction( üÌ ) and induction( 8B

) are two facet
of reasoning.
■ We deduct outcomes from hypothesis:
"If A then B", that is,

If (hypothesis) A is true, then B can be concluded
(observed).
March 11, 2009 Chapter 1 - p. 16/??


) are two facet
of reasoning.

(observed).
■ We infer hypothesis from outcomes by induction.
March 11, 2009 Chapter 1 - p. 16/??


) are two facet
of reasoning.

(observed).
■ We infer hypothesis from outcomes by induction.
Figure 2 shows one possible situation which may

happen.
March 11, 2009 Chapter 1 - p. 16/??

A VLLVLVVVV
o
rr B
LLL VVVVV
j
r
f
LLL VVVVV rrrr

LLL r VrrVVV
LLrrr VVVV
rrrLLL hh C
r L LhLhLhh hh
rrr hhh LLL
rrr hhhh
r r hhhhh LL
r
hrh hh L
E D
x
Figure 2: An example of Bayesian reasoning
If A is true, then we are likely to observe B, C or D.

B and C are now observed. Therefore, A is sup-
ported!.
March 11, 2009 Chapter 1 - p. 17/??

The key difference between classical and Bayesian
reasoning is that
the Bayesian believes that knowledge is subjective.
Consequently, the Bayesian rejects the
/ 0
epistemological foundation that there exists a
true data-generating process that can be
revealed through process of elimination.
March 11, 2009 Chapter 1 - p. 18/??

Subjectivity and objectivity
■ All statistical methods that use probability are

subjective in the sense of relying on
mathematical idealizations of the world.
March 11, 2009 Chapter 1 - p. 19/??


■ Bayesian methods are sometimes said to be
especially subjective because of their reliance on
a prior distribution, but in most problems,
scientific judgment is necessary to specify both
the ’likelihood’ and the ’prior’ parts of the model.
March 11, 2009 Chapter 1 - p. 19/??


■ Bayesian methods are sometimes said to be
especially subjective because of their reliance on
a prior distribution, but in most problems,
scientific judgment is necessary to specify both
the ’likelihood’ and the ’prior’ parts of the model.
■ A general principle is at work here: whenever
there is replication, in the sense of many
exchangeable units observed, there is scope for
estimating features of a probability distribution
from data and thus making the analysis more
’objective.’
March 11, 2009 Chapter 1 - p. 19/??

Problems with Classical Statistical
Inference
March 11, 2009 Chapter 1 - p. 20/??

Problem A: p value and Hypothesis Testing
Example:
■ Background and Data Information:
◆ The staff of Slater School was concerned that

their high cancer rate could be due to two
nearby high voltage transmission lines.
◆ There were 8 cases of invasive cancer over a

long time among 145 staff members
◆ Based on the national cancer rate among

woman this age, the expected number of
cancers is 4.2(approximately 3/100)
March 11, 2009 Chapter 1 - p. 21/??

■ Assumption—independence:
The 145 staff members developed cancer

independently of each other and the chance of
cancer, θ, was the same for each staff person.
Therefore, the number of cancers, Y , follows a

binomial distribution: Y |θ ∼ Bin(145, θ) .
March 11, 2009 Chapter 1 - p. 22/??

The Question
The classical hypothesis is to test

H0 : θ = 0.03 vs. H1 : θ > 0.03
March 11, 2009 Chapter 1 - p. 23/??

The Question

H0 : θ = 0.03 vs. H1 : θ > 0.03
Instead of answer this question directly, we answer
the following question:
How well do each of Four Simplified Competing
Theories explain the data?
March 11, 2009 Chapter 1 - p. 23/??

The Question

H0 : θ = 0.03 vs. H1 : θ > 0.03
Instead of answer this question directly, we answer
the following question:
How well do each of Four Simplified Competing
Theories explain the data?
Theory A1 : θ = 0.03
March 11, 2009 Chapter 1 - p. 23/??

The Likelihood of Theories A-D
For each hypothesized θ, from Bin(145, θ), we have

145 8
P r(Y = 8|θ) = θ (1 − θ)137 . (1.3)
8
March 11, 2009 Chapter 1 - p. 24/??


145 8
P r(Y = 8|θ) = θ (1 − θ)137 . (1.3)
8
Theory A1 : P r(Y = 8|θ = 0.03) ≈ 0.036

Theory A2 : P r(Y = 8|θ = 0.04) ≈ 0.096
Theory A3 : P r(Y = 8|θ = 0.05) ≈ 0.134
Theory A4 : P r(Y = 8|θ = 0.06) ≈ 0.136
March 11, 2009 Chapter 1 - p. 24/??


145 8
P r(Y = 8|θ) = θ (1 − θ)137 . (1.3)
8
Theory A1 : P r(Y = 8|θ = 0.03) ≈ 0.036

Theory A2 : P r(Y = 8|θ = 0.04) ≈ 0.096
Theory A3 : P r(Y = 8|θ = 0.05) ≈ 0.134
Theory A4 : P r(Y = 8|θ = 0.06) ≈ 0.136
This is a ratio of approximately 1: 3: 4: 4.
So, Theory A2 explains the data about 3 times as
well as theory A1 .
March 11, 2009 Chapter 1 - p. 24/??

From the likelihood principal we see that
once Y = 8 has been observed, then
p(y|θ) = P r(Y = y|θ)
describes how well each theory, or value of θ
explains the data.
No other value of Y is relevant.
March 11, 2009 Chapter 1 - p. 25/??

From the likelihood principal we see that
once Y = 8 has been observed, then
p(y|θ) = P r(Y = y|θ)
describes how well each theory, or value of θ
explains the data.
No other value of Y is relevant.
The Likelihood principal is central to Bayesian rea-

soning.
March 11, 2009 Chapter 1 - p. 25/??

Bayesian Analysis
There are other sources of information about

whether cancer can be induced by proximity to
high-voltage transmission lines.
March 11, 2009 Chapter 1 - p. 26/??

Bayesian Analysis
There are other sources of information about

whether cancer can be induced by proximity to
£Øö
high-voltage transmission lines.
■ Pro: Epidemiologists( ) show positive
correlations between cancer and proximity
■ Con: Other epidemiologists don .
t show these
correlations, and physicists and biologists
maintain believe that energy in magnetic fields
associated with high-voltage power lines is too
small to have an appreciable biological effect.
March 11, 2009 Chapter 1 - p. 26/??

Supposes we judge the pro and con sources
equally reliable.
Therefore, Theory A1 (no effect) is as likely as

Theories A2 , A3 , and A4 together, and we judge
theories A2 , A3 , and A4 to be equally likely.
So,
March 11, 2009 Chapter 1 - p. 27/??

equally reliable.

So,
P r(A1 ) ≈ 0.5 ≈ P r(A2 ) + P r(A3 ) + P r(A4 ),
March 11, 2009 Chapter 1 - p. 27/??

equally reliable.

So,
P r(A1 ) ≈ 0.5 ≈ P r(A2 ) + P r(A3 ) + P r(A4 ),
P r(A2 ) ≈ P r(A3 ) ≈ P r(A4 ) ≈ 1/6.
March 11, 2009 Chapter 1 - p. 27/??

equally reliable.

So,
P r(A1 ) ≈ 0.5 ≈ P r(A2 ) + P r(A3 ) + P r(A4 ),
P r(A2 ) ≈ P r(A3 ) ≈ P r(A4 ) ≈ 1/6.
These quantities will

S represent
F
CHOOL OF
ourS prior beliefs.
INANCE AND TAT I S T I C S
March 11, 2009 Chapter 1 - p. 27/??

Based on the Bayes’ Theorem and the
assumptions of four theories, we have
P r(A1 )P r(Y = 8|A1 )
P r(A1 |Y = 8) =
P r(Y = 8)
P r(A1 )P r(Y = 8|A1 )
= P3
i=1 P r(Ai )P r(Y = 8|Ai )
1
= 2 ×0.36
1 ×0.36+ 1 ×0.096+ 1 ×0.134+ 1 ×0.136
2 6 6 6
= 0.23
P r(A2 |Y = 8) = 0.21
P r(A3 |Y = 8) = 0.28
P r(A4 |Y = 8) = 0.28.
March 11, 2009 Chapter 1 - p. 28/??

Based on the Bayes’ Theorem and the
assumptions of four theories, we have
P r(A1 )P r(Y = 8|A1 )
P r(A1 |Y = 8) =
P r(Y = 8)
P r(A1 )P r(Y = 8|A1 )
= P3
i=1 P r(Ai )P r(Y = 8|Ai )
1
= 2 ×0.36
1 ×0.36+ 1 ×0.096+ 1 ×0.134+ 1 ×0.136
2 6 6 6
= 0.23
P r(A2 |Y = 8) = 0.21
P r(A3 |Y = 8) = 0.28
P r(A4 |Y = 8) = 0.28.
March 11, 2009 Chapter 1 - p. 28/??

VÇ'
Accordingly, we see that each of these four
theories is equally likely, and the odds( ) are
3:1 that the cancer rate at Slater is greater than
0.03.
Therefore, the Bayesian analysis revealed that the

probability that
P r(θ > 0.03|Y = 8) = 0.77,
which would not be sufficient to reject the null
hypothesis H0 : θ = 0.03.
March 11, 2009 Chapter 1 - p. 29/??

Non-Bayesian Analysis
H0 : θ = 0.03 ⇔ H1 : θ > 0.03
March 11, 2009 Chapter 1 - p. 30/??

H0 : θ = 0.03 ⇔ H1 : θ > 0.03

p-value of Classical statisticians: the probability
under H0 of observing an outcome at least as ex-
treme as that actually observed.
March 11, 2009 Chapter 1 - p. 30/??

H0 : θ = 0.03 ⇔ H1 : θ > 0.03

p-value of Classical statisticians: the probability
under H0 of observing an outcome at least as ex-
treme as that actually observed.
For the Slater problem, we find:

p-value = P r(Y = 8|θ = 0.03)
+P r(Y = 9|θ = 0.03)
+P r(Y = 10|θ = 0.03)
+ · · · + P r(Y = 145|θ = 0.03)
≈ 0.07.
March 11, 2009 Chapter 1 - p. 30/??

Thus, under a classical hypothesis test(at the
significance level α = 0.10),
reject the null hypothesis of no effect from the

power lines at Slater.
March 11, 2009 Chapter 1 - p. 31/??

Critique of p-values
Bayesians claim that the p-value should not be

used to compare hypotheses because:
March 11, 2009 Chapter 1 - p. 32/??


■ hypotheses should be compared by how well
they explain the data.
March 11, 2009 Chapter 1 - p. 32/??



■ the p-value does not account for how well the
alternative hypotheses explain the data
March 11, 2009 Chapter 1 - p. 32/??



.
■ the p-value summands are irrelevant because
they don t explain how well any hypothesis

explains any observed data.
March 11, 2009 Chapter 1 - p. 32/??



.
■ the p-value summands are irrelevant because
they don t explain how well any hypothesis

explains any observed data.
In short, the p-value does not obey the likelihood
principle because it uses P r(Y = y|θ) for values of
y other than the observed value y = 8.
The same thing is true of all classical hypothesis
tests and confidence intervals.
March 11, 2009 Chapter 1 - p. 32/??

Problem B: Confidence interval
A %100p (frequentist) confidence interval(CI) for a

parameter θ is an interval constructed according to
a specific method (for example, the maximum like-
lihood method), such that if we were to repeat the
experiment numerous times, with a new set of ob-
servational data (with different random errors) for
each experiment, then 100p% of the confidence in-
tervals we construct using this method would con-
tain the true (fixed) value of θ, whatever that is.
March 11, 2009 Chapter 1 - p. 33/??

An Example
Example. For example, a random sample survey

of American adults may indicate that mean income
in the United States is $35,000. Assuming (rather
implausibly) that income is normally distributed, we
could estimate a 90% confidence interval for our
sample mean, perhaps [$15,000, $55,000] for a
modestly sized sample. Using conventional fre-
quentist inference we can conclude that intervals
like the one calculated would cover the true (popu-
lation) mean income 90% of the time for repeated
applications of the sampling procedure.
March 11, 2009 Chapter 1 - p. 34/??

Questions
The questions about frequentist CI arise:
March 11, 2009 Chapter 1 - p. 35/??

Questions

■ What about non-repeatable data? That is, there
is no data-generation process (DGP) creating

data sets for us, just a single set of data. How
can we apply frequentist procedures?
March 11, 2009 Chapter 1 - p. 35/??

Questions


can we apply frequentist procedures? Β(α = 1, β = 3)
■ What about the asymmetric
3.0
distribution? For example left
2.5
2.0
skew Be(1, 3) of sample
1.5
mean, what is the CI for the
1.0
population mean? The mode
0.5
is not included in the CI. This
0.0
seems not plausible!
0.0 0.2 0.4 0.6 0.8 1.0
March 11, 2009 Chapter 1 - p. 35/??

Questions


can we apply frequentist procedures?
■ What about the asymmetric
distribution? For example left
skew Be(1, 3) of sample
mean, what is the CI for the
population mean? The mode
is not included in the CI. This
seems not plausible!
■ What about multimodal distribution? How can
the two modes in the middle of the CI?
March 11, 2009 Chapter 1 - p. 35/??

Explanation
One reason: different definiton of "probability"
March 11, 2009 Chapter 1 - p. 36/??

Explanation

■ The frequentists definition: probability is the
long-run expected frequency of occurrence.

P (A) = n/N,
where n is the number of times event A occurs in
N opportunities.
March 11, 2009 Chapter 1 - p. 36/??

Explanation

■ The frequentists definition: probability is the
long-run expected frequency of occurrence.

P (A) = n/N,
where n is the number of times event A occurs in
N opportunities.
■ The Bayesian view: Probability is related to
degree of belief. It is a measure of the plausibility
of an event given incomplete knowledge.
March 11, 2009 Chapter 1 - p. 36/??

■ Thus a frequentist believes that a population
mean is real, unknown, and can only be
estimated from the data.
■ Knowing the distribution for the sample mean, he
constructs a confidence interval, centered at the
sample mean.
■ Tricky: Either the true mean is in the interval or
it is not.
March 11, 2009 Chapter 1 - p. 37/??

■ So the frequentist can’t say there’s a 95%
probabilitya that the true mean is in this interval,
because it’s either already in, or it’s not. And
that’s because, to a frequentist, the true mean,
being a single fixed value, doesn’t have a
distribution.
s£`{
■ The sample mean does. Thus the frequentist
must use circumlocutions( ) like "95% of
similar intervals would contain the true mean, if
each interval were constructed from a different
random sample like this one."
a
”probability” = long-run fraction having this characteristic.
March 11, 2009 Chapter 1 - p. 38/??

Confidence intervals based on z distribution
20
|
|
|
|
|
15
|
|
|
|
Index
|
10
|
|
|
|
|
|
5
|
|
|
|
−4 −2 0 2 4 6
Confidence Interval
Figure 3: Confidence intervals for population mean
38-1
Bayesian’s Point of View
Bayesians have an altogether different world-view.

They say that only the data are real. The
population mean is an abstraction, and as such
some values are more believable than others
based on the data and their prior beliefs.
(Sometimes the prior belief is very non-informative,
however.) The Bayesian constructs a credible
N!
interval, centered near the sample mean, but
tempered( ) by "prior" beliefs concerning the
mean.
March 11, 2009 Chapter 1 - p. 39/??

A credible interval (which can also be abbreviated
to CI: how confusing) is an inherently Bayesian
concept: it is an interval such that the parameter is
believed to lie in the interval with probability p.
Fundamentally, the belief (probability) attaches to
the person who makes the statement, rather than
the parameter itself - in other words, it is subjective.
March 11, 2009 Chapter 1 - p. 40/??

Now the Bayesian can say what the frequentist
cannot: "There is a 95% probabilitya that this
interval contains the mean." b
a
”probability” = degree of believability.
b
A frequentist is a person whose long-run ambition is to be wrong 5% of the time. A
Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly
believes he has seen a mule.
March 11, 2009 Chapter 1 - p. 41/??

The Late Bus Example: What we are concern is
the probability that the school bus will be late in the
morning. We observed n mornings and found that
school buses are late y times. Then y follows the
binomial distribution Bin(n, θ), where θ is the
probability that the school bus is late.

n y
P r(y = y|θ) = θ (1 − θ)(n−y) .
y
March 11, 2009 Chapter 1 - p. 42/??

Let n = 10, y = 3 and assume that we have no
information about θ. That is, we choose uniform
distribution on (0,1] as the prior for θ, called
noninformative prior of θ. In the late subsections
we will discuss the example in a general way
where Beta distributions Be(α, β) will be used as
informative prior for θ. This time the uniform prior is
a special Beta distribution, Be(1, 1). Thus we have
from (1.2) that the posterior distribution is a Beta
distribution Be(4, 8). Its mean, median and mode
are 0.33, 0.32 and 0.30 respectively. Thus the 95%
symmetric credible interval is (0.11, 0.61). a
a
In R, we get the quarter quantiles: qbeta(0.025, 4, 8)=0.11, qbeta(0.975, 4, 8)=0.61.
March 11, 2009 Chapter 1 - p. 43/??

Criticisms of the Bayesian approach
March 11, 2009 Chapter 1 - p. 44/??

■ The results are subjective. With only a few

observations, the parameter estimates may be
sensitive to the choice of priors. (See the Slater
School case)
March 11, 2009 Chapter 1 - p. 45/??


School case)
/
0 ~
■ Bayesian reply: Bayesians use diffuse

priors, sensitivity analysis, etc. to mitigate(
) the influence of priors on their results.
March 11, 2009 Chapter 1 - p. 45/??


School case)
/
0 ~

■ The Bayesian analysis is philosophically
unsound. Bayesians treat θ as a random variable
where classical analysis treats θ as a fixed, but
unknown constant.
March 11, 2009 Chapter 1 - p. 45/??


School case)
/
0 ~

■ The Bayesian analysis is philosophically
unsound. Bayesians treat θ as a random variable
where classical analysis treats θ as a fixed, but
unknown constant.
■ Bayesian Reply: Treating θ as random does not
necessarily mean that θ is random; rather, it
expresses our uncertainty/knowledge about θ.
March 11, 2009 Chapter 1 - p. 45/??

Advantages of Bayesian statistics
March 11, 2009 Chapter 1 - p. 46/??

Advantages of Bayesian statistics
There was a big explosion of Bayesian statistics

over the past 20 years. Approximately 30% of
papers in top statistical reviews are about
Bayesian statistics. Among the top 10 most cited
mathematicians over the last 10 years, 5 are
Bayesian statisticians! Over the last 5 years, 4
Copss Medals were awarded to Bayesian
statisticians. Bayesian inference has been widely
used because of its advantages which classical
statistics may lack of.
March 11, 2009 Chapter 1 - p. 47/??

■ The Bayesian approach is very well-adapted to
many application areas:
bioinformatics, genetics, epidemiology,
econometrics, machine learning, spatial
statistics, clinical trials, survival analysis,
computer modelling, nuclear magnetic
resonance etc.
March 11, 2009 Chapter 1 - p. 48/??

resonance etc.
■ It allows one to incorporate in a principle way any
prior information available on a given problem.
March 11, 2009 Chapter 1 - p. 48/??

resonance etc.
■ It is honest and makes clear that any analysis
relies on a part of subjectivity.
March 11, 2009 Chapter 1 - p. 48/??

resonance etc.
■ It is honest and makes clear that any analysis
relies on a part of subjectivity.
■ Knowledge synthesis - It formalizes process of
learning from data to update beliefs.
March 11, 2009 Chapter 1 - p. 48/??

■ It is a simple framework and, much simpler than
"standard" approaches. Nevertheless, it is richer
than classic approach in modelling, less
assumptions and less (irrelevant) math too.
March 11, 2009 Chapter 1 - p. 49/??

■ It is a simple framework and, much simpler than
"standard" approaches. Nevertheless, it is richer
than classic approach in modelling, less
assumptions and less (irrelevant) math too.
■ Classical methods are often special cases of
Bayesian methods. For instances, Basic
hypothesis testing and estimation, design and
sample-size computations, linear and non-linear
regression, non-parametric statistics. etc. It gives
direct interpretation of confidence intervals and
p-values which is not easy via classical
approach.
March 11, 2009 Chapter 1 - p. 49/??

■ It is straightforward to handle missing data,
outliers, censored data, sparse data sets etc.
March 11, 2009 Chapter 1 - p. 50/??

■ It is straightforward to handle missing data,
outliers, censored data, sparse data sets etc.
■ It provides comprehensive and robust estimation
of models that cannot be fitted otherwise —
multilevel models, nested random effects etc.
March 11, 2009 Chapter 1 - p. 50/??

1.2 Introduction to Bayesian Statistics
March 11, 2009 Chapter 1 - p. 51/??

Overview
March 11, 2009 Chapter 1 - p. 52/??

Process of Bayesian Inference
Bayesian inference is the process of fitting a

probability model to a set of data and summarizing
the result by a probability distribution on the
parameters of the model and on unobserved
©
quantities such as predictions for new
observations
March 11, 2009 Chapter 1 - p. 53/??

The process of Bayesian data analysis:
March 11, 2009 Chapter 1 - p. 54/??

■ Setting up a full probability model — a joint
©
probability distribution for all observable and
unobservable quantities in a problem
March 11, 2009 Chapter 1 - p. 54/??

©
■ Conditioning on observed data: calculating and
interpreting the appropriate posterior distribution
March 11, 2009 Chapter 1 - p. 54/??

©

■ Evaluating the fit of the model and the
µ
implications of the resulting posterior
distribution
◆ Does the model fit the data?
◆ Are the substantive conclusions reasonable?
◆ How sensitive are the results to the modeling
assumptions in step 1?
March 11, 2009 Chapter 1 - p. 54/??

©

■ Evaluating the fit of the model and the
µ
implications of the resulting posterior
distribution
◆ Does the model fit the data?
◆ Are the substantive conclusions reasonable?
◆ How sensitive are the results to the modeling
assumptions in step 1?
March 11, 2009 Chapter 1 - p. 54/??

General notation for statistical
inference
March 11, 2009 Chapter 1 - p. 55/??

Two kinds of estimands
Two kinds of estimands( O )—unobserved

quantities for which statistical inferences are made.
March 11, 2009 Chapter 1 - p. 56/??

Two kinds of estimands( O )—unobserved

¶
1) potentially observable quantities, such as future
observations of a process
March 11, 2009 Chapter 1 - p. 56/??

Two kinds of estimands( O)—unobserved

¶
2) quantities that are not directly observable, that is,
parameters that govern the hypothetical process
©
leading to the observed data (for example,
regression coefficients)
March 11, 2009 Chapter 1 - p. 56/??

Two kinds of estimands( O)—unobserved

¶
2) quantities that are not directly observable, that is,
parameters that govern the hypothetical process
©
leading to the observed data (for example,
regression coefficients)
§
The distinction between these two kinds of
estimands is not always precise but generally
useful as a way of understanding how a statistical
©
model for a particular problem fits into the real
world
March 11, 2009 Chapter 1 - p. 56/??

Notations
Notations (they can be scalar or vectors):
March 11, 2009 Chapter 1 - p. 57/??

Notations

■ θ — unobservable vector quantities or population
parameters of interest;
March 11, 2009 Chapter 1 - p. 57/??

Notations

■ y = (y1 , . . . , yn ) — the observed data;
March 11, 2009 Chapter 1 - p. 57/??

Notations

■ ỹ unknown§ but potentially observable,

quantities
March 11, 2009 Chapter 1 - p. 57/??

Notations

■ ỹ unknown§ but potentially observable,

quantities
■ When using matrix notation, we consider vectors
as column vectors. For example, if u is a vector

with n components, then uT u is a scalar and uuT
©
an n × n matrix
March 11, 2009 Chapter 1 - p. 57/??

Exchangeability
The n values yi may be regarded as

§
■
exchangeable meaning that the joint probability

density p(y1 , . . . , yn ) should be invariant to
permutations of the indexes ©
March 11, 2009 Chapter 1 - p. 58/??

Exchangeability
The n values yi may be regarded as

§
■
exchangeable meaning that the joint probability

density p(y1 , . . . , yn ) should be invariant to
permutations of the indexes ©
■ Generally, it is useful and appropriate to model
data from an exchangeable distribution as
independently and identically distributed (iid)
given some unknown parameter vector θ with
©
distribution p(θ)
March 11, 2009 Chapter 1 - p. 58/??

Bayesian inference
March 11, 2009 Chapter 1 - p. 59/??

Bayesian Inference
§
■ Bayesian statistical conclusions about a
parameter θ or unobserved data ỹ, are made in
terms of probability statements ©
March 11, 2009 Chapter 1 - p. 60/??

Bayesian Inference
§
■ These probability statements are conditional on
the observed value of y, and in our notation are
written simply as p(θ|y) or p(ỹ|y) ©
March 11, 2009 Chapter 1 - p. 60/??

Bayesian Inference
§
■ These probability statements are conditional on
the observed value of y, and in our notation are
written simply as p(θ|y) or p(ỹ|y) ©
©
■ We also implicitly condition on the known values
of any covariates, x
March 11, 2009 Chapter 1 - p. 60/??

Bayes’ rule
■ Prior distribution p(θ)
March 11, 2009 Chapter 1 - p. 61/??

Bayes’ rule

■ Sampling/data distribution p(y|θ)
March 11, 2009 Chapter 1 - p. 61/??

Bayes’ rule

■ Joint distribution p(θ, y) = p(θ)p(y|θ)
March 11, 2009 Chapter 1 - p. 61/??

Bayes’ rule

■ Posterior distribution
p(θ, y) p(θ)p(y|θ)
p(θ|y) = = , (1.4)
p(y) p(y)
where P
◆ p(y) =
R θ p(θ)p(y|θ) for discrete θ or
◆ p(y) = p(θ)p(y|θ)dθ for continuous θ
θ
March 11, 2009 Chapter 1 - p. 61/??

Bayes’ rule

■ Posterior distribution
p(θ, y) p(θ)p(y|θ)
p(θ|y) = = , (1.4)
p(y) p(y)
where P
◆ p(y) =
R θ p(θ)p(y|θ) for discrete θ or
◆ p(y) = p(θ)p(y|θ)dθ for continuous θ
θ
■ or (unnormalized posterior)
p(θ|y) ∝ p(θ)p(y|θ). (1.5)
March 11, 2009 Chapter 1 - p. 61/??

Prediction
■ Before the data y are considered, the distribution

of the unknown but observable y is
Z Z
p(y) = p(y, θ)dθ = p(θ)p(y|θ)dθ. (1.6)
March 11, 2009 Chapter 1 - p. 62/??

Prediction

Z Z
p(y) = p(y, θ)dθ = p(θ)p(y|θ)dθ. (1.6)
■ called
◆ marginal distribution of y
◆ prior predictive distribution
March 11, 2009 Chapter 1 - p. 62/??

Prediction

Z Z
p(y) = p(y, θ)dθ = p(θ)p(y|θ)dθ. (1.6)
■ called
■ why prior: because it is not conditional on a

previous observation of the process
March 11, 2009 Chapter 1 - p. 62/??

Prediction

Z Z
p(y) = p(y, θ)dθ = p(θ)p(y|θ)dθ. (1.6)
■ called
■ why prior: because it is not conditional on a

previous observation of the process
©
■ why predictive: because it is the distribution for a
quantity that is observable
March 11, 2009 Chapter 1 - p. 62/??

After the data y have been observed, we can
§§
■
predict all unknown observable ỹ from the

same process ©
March 11, 2009 Chapter 1 - p. 63/??

§§
■

same process ©
March 11, 2009 Chapter 1 - p. 63/??

Average of conditional predictions

§§
■

same process ©
March 11, 2009 Chapter 1 - p. 63/??

y and ỹ are conditionally independent
§§
■

same process ©
March 11, 2009 Chapter 1 - p. 63/??

§§
■

same process ©
March 11, 2009 Chapter 1 - p. 63/??

§§
■

same process ©
■ why posterior: conditional on the observed y
March 11, 2009 Chapter 1 - p. 63/??

§§
■

same process ©
■ why posterior: conditional on the observed y
■ why predictive: a prediction for an observable ỹ.
March 11, 2009 Chapter 1 - p. 63/??

Likelihood
■ Using Bayes’ rule with a chosen probability

model means that the data y affect the posterior
inference (1.5) only through the function p(y|θ) —
§ ©
the likelihood function (when regarded as a
function of θ for fixed y)
March 11, 2009 Chapter 1 - p. 64/??

Likelihood
■ Using Bayes’ rule with a chosen probability

model means that the data y affect the posterior
inference (1.5) only through the function p(y|θ) —
§ ©
the likelihood function (when regarded as a
function of θ for fixed y)
■ In this way Bayesian inference obeys what is
§
sometimes called the likelihood principle, which
states that for a given sample of data any two
probability models p(y|θ) that have the same
©
likelihood function yield the same inference for
θ
March 11, 2009 Chapter 1 - p. 64/??

Likelihood and odds ratios
■ The posterior odds (ratios) for θ1 compared to θ2 :

p(θ1 |y) p(θ1 )p(y|θ1 )/p(y) p(θ1 )p(y|θ1 )
= = . (1.7)
March 11, 2009 Chapter 1 - p. 65/??

Likelihood and odds ratios
■ The posterior odds (ratios) for θ1 compared to θ2 :

= = . (1.7)
■ Odds have the attractive property that Bayes’

rule takes a particularly simple form — In words,
©
the posterior odds are equal to the prior odds
multiplied by the likelihood ratio, p(y|θ1 )/p(y|θ2 )
March 11, 2009 Chapter 1 - p. 65/??

Computation and software
§
■ We will rely primarily on the statistical package R
for graphs and basic simulations fitting of
©
classical simple models(including regression, ...),
optimization, and some simple programming
March 11, 2009 Chapter 1 - p. 66/??

§
©
©
■ We use WinBugs within R(see Appendix C) as a
first try for fitting most models
March 11, 2009 Chapter 1 - p. 66/??

§
©
©
■ We use WinBugs within R(see Appendix C) as a
first try for fitting most models
■ other related softwares
◆ First Bayes: http://www.tonyohagan.co.uk/1b/
◆ BACC for R/S-plus/Matlab:
(http://www.econ.umn.edu/ bacc)
◆ MCMCpack: R package (V0.7-3)
◆ coda: R package
March 11, 2009 Chapter 1 - p. 66/??

µ
Specific computational tasks that arise in Bayesian
data analysis include
March 11, 2009 Chapter 1 - p. 67/??

µ
■ Vector and matrix manipulations (see Table 1.1)
■ Computing probability density functions (see
Appendix A)
■ Drawing simulations from probability distributions
■ Structured programming (including looping and
customized functions)
■ Calculating the linear regression estimate and
variance matrix
©
■ Graphics, including scatterplots with overlain
lines and multiple graphs per page
March 11, 2009 Chapter 1 - p. 67/??

µ
■ Vector and matrix manipulations (see Table 1.1)
■ Computing probability density functions (see
Appendix A)
■ Drawing simulations from probability distributions
■ Structured programming (including looping and
customized functions)
■ Calculating the linear regression estimate and
variance matrix
©
■ Graphics, including scatterplots with overlain
lines and multiple graphs per page
March 11, 2009 Chapter 1 - p. 67/??

Our general approach to computation is to fit many
©
models, gradually increasing the complexity ( See
Appendix C for a simple example)
March 11, 2009 Chapter 1 - p. 68/??

Our general approach to computation is to fit many
©
models, gradually increasing the complexity ( See
Appendix C for a simple example)
Appendix C illustrates how to perform
©
computations in R and Bugs in several different
ways for a single example
March 11, 2009 Chapter 1 - p. 68/??

1.3 Simple Examples
March 11, 2009 Chapter 1 - p. 69/??

The Bayes’ Theorem/rule revisited
March 11, 2009 Chapter 1 - p. 70/??

The Bayes’ Theorem
■ The central idea and the goal of applied Bayesian

paradigm is to investigate how to combine and
model changes when new information from
different sources (’data’) is received.
March 11, 2009 Chapter 1 - p. 71/??

■ This is done through the Bayes’ rule:

p(θ)p(y|θ)
p(θ|y) = ∝ p(θ)p(y|θ),
p(y)
March 11, 2009 Chapter 1 - p. 72/??

■ This is done through the Bayes’ rule:

p(θ)p(y|θ)
p(θ|y) = ∝ p(θ)p(y|θ),
p(y)
■ where
◆ θ is the parameter of interest.
◆ y is the observed data.
◆ p(y|θ) is the probability of y for θ (likelihood).
◆ p(θ) is the prior (distribution), initial distribution
for θ.
◆ p(θ|y) is the posterior distribution for θ, given
the data y.
◆ p(y) is the marginal distribution, the total
probability for the given data y.

March 11, 2009 Chapter 1 - p. 72/??

■ Suppose ỹ ∼ p(ỹ|θ) is to be observed. The
(posterior) predictive distribution of ỹ, given
observed data y can be got from the posterior
distribution
Z
p(ỹ|y) = p(ỹ|θ)p(θ|y)dθ.
March 11, 2009 Chapter 1 - p. 73/??

■ Suppose ỹ ∼ p(ỹ|θ) is to be observed. The
(posterior) predictive distribution of ỹ, given
observed data y can be got from the posterior
distribution
Z
p(ỹ|y) = p(ỹ|θ)p(θ|y)dθ.
■ C.f.: The marginal distribution p(y) is sometimes

called the prior predictive distribution.
March 11, 2009 Chapter 1 - p. 73/??

Types of priors
In applied Bayesian inference, we have three kinds

of priors for the parameter.
March 11, 2009 Chapter 1 - p. 74/??

Types of priors

1. Uninformative prior:
■ Uniform, as wide as possible
■ sometimes called flat priors
■ problem: often difficult to define
March 11, 2009 Chapter 1 - p. 74/??

Types of priors

2. Informative Prior
■ not uniform
■ assume we have some prior knowledge
March 11, 2009 Chapter 1 - p. 74/??

Types of priors

2. Informative Prior
■ not uniform
■ assume we have some prior knowledge
3. Conjugate Prior
■ prior and posterior have same distribution
■ often makes the maths easier
March 11, 2009 Chapter 1 - p. 74/??

Since Bayesian inference is virtually determined by
the prior of the parameter θ and the likelihood
p(y|θ), we often write the Bayesian model as
Y |θ ∼ p(y|θ) and θ ∼ p(θ).
March 11, 2009 Chapter 1 - p. 75/??

Example 1: θ takes two possible values
March 11, 2009 Chapter 1 - p. 76/??

θ takes two possible values
Assume a DNA trace is found at a criminal scene.

Assume trace is run through a database with
10,000,000 citizens, and a single match is found.
What is the probability for guilty?
March 11, 2009 Chapter 1 - p. 77/??


Let θ = 1 be ’guilt’ and θ = 0 be not. Then

p(θ = 1) = 10−7 . We also assume
p(match|θ = 1) ≈ 1 and p(match|θ = 0) = 10−6 .
March 11, 2009 Chapter 1 - p. 77/??


Let θ = 1 be ’guilt’ and θ = 0 be not. Then

p(θ = 1) = 10−7 . We also assume
p(match|θ = 1) ≈ 1 and p(match|θ = 0) = 10−6 .
From the Bayes’ Theorem, we have

1 × 10−7
p(θ = 1|match) = ≈ 0.09.
1 × 10 + 10 × 1
−7 −6
March 11, 2009 Chapter 1 - p. 77/??

Example 2: Binomial data with beta
prior
March 11, 2009 Chapter 1 - p. 78/??

Binomial data with beta prior
Suppose that
March 11, 2009 Chapter 1 - p. 79/??

Suppose that
■ the likelihood (model) for y given θ is binomial
Bin(n, θ), i.e.,

n y
p(y|θ) = θ (1 − y)n−y ,
y
March 11, 2009 Chapter 1 - p. 79/??

Suppose that
Bin(n, θ), i.e.,

n y
p(y|θ) = θ (1 − y)n−y ,
y
■ and the prior is beta Be(α, β), where the

hyperparameters α and β are known,
1
p(θ) = θα−1 (1 − θ)β−1 , 0 ≤ θ ≤ 1.
B(α, β)
March 11, 2009 Chapter 1 - p. 79/??

Suppose that
Bin(n, θ), i.e.,

n y
p(y|θ) = θ (1 − y)n−y ,
y
■ and the prior is beta Be(α, β), where the

hyperparameters α and β are known,
1
p(θ) = θα−1 (1 − θ)β−1 , 0 ≤ θ ≤ 1.
B(α, β)
■ Find the 1) joint, 2) marginal, 3) posterior, and 4)

predictive distributions.
March 11, 2009 Chapter 1 - p. 79/??

■ The results are as follows:
n

y
p(y, θ) = θα+y−1 (1 − θ)n−y+β−1
B(α, β)
n

y
B(y + α, n − y + β)
p(y) = , y = 0, 1, · · · , n
B(α, β)
1
p(θ|y) = θα+y−1 (1 − θ)n−y+β−1 ,
B(y + α, n − y + β)
0≤θ≤1
n

y
B(y + ỹ + α, 2n − y − ỹ + β)
p(ỹ|y) =
B(y + α, n − y + β)
March 11, 2009 Chapter 1 - p. 80/??

■ Note that the posterior of θ, Be(α + y, n − y + β),
is the same as its prior, beta distribution.
March 11, 2009 Chapter 1 - p. 81/??

■ Priors which have the same form as the
posteriors are called conjugate priors.
March 11, 2009 Chapter 1 - p. 81/??

■ Thus beta distribution is the conjugate prior for
the proportional rate θ.
March 11, 2009 Chapter 1 - p. 81/??

■ Thus beta distribution is the conjugate prior for
the proportional rate θ.
■ If α = 1, β = 1, the prior becomes the uniform
prior.
March 11, 2009 Chapter 1 - p. 81/??

Sources of influence on the posterior: Now let us
take an numerical example to see how
March 11, 2009 Chapter 1 - p. 82/??

■ different priors,
March 11, 2009 Chapter 1 - p. 82/??

■ different data, and
March 11, 2009 Chapter 1 - p. 82/??

■ new coming data
March 11, 2009 Chapter 1 - p. 82/??

■ new coming data
bring changes to the posterior.
March 11, 2009 Chapter 1 - p. 82/??

The influence of different priors
The shape of the beta distribution Be(α, β) is

determined by both hyperparameters α and β. The
expectation, mode and variance are
α
E(θ) = ,
α+β
α−1
M (θ) = ,
α+β−2
αβ
V ar(θ) = 2
.
(α + β) (α + β + 1)
March 11, 2009 Chapter 1 - p. 83/??

The influence of different priors
The shape of the beta distribution Be(α, β) is

determined by both hyperparameters α and β. The
expectation, mode and variance are
α
E(θ) = ,
α+β
α−1
M (θ) = ,
α+β−2
αβ
V ar(θ) = 2
.
(α + β) (α + β + 1)
And we can also get the median and different
quantiles. Similar quantities can be obtained from
the posterior beta distributions.
March 11, 2009 Chapter 1 - p. 83/??

Thus we have
■ When both α and β increase, the variance gets
lower.
■ When β increases, the distribution shifts
downward and when α increases, the distribution

shifts upward.
March 11, 2009 Chapter 1 - p. 84/??

Thus we have
■ When both α and β increase, the variance gets
lower.
■ When β increases, the distribution shifts
downward and when α increases, the distribution

shifts upward.
For the Late Bus Example, suppose we observe
y=3 late buses in two weeks (n=10 days). Figure
?? shows 9 priors for θ with (α, β) = (0.5, 0.5),
(0.5, 1.0), (0.5, 1.5), (1.0, 0.5), (1.0, 1.0),
(1.0, 1.5), (1.5, 0.5), (1.5, 1.0), (1.5, 1.5), and the
corresponding overlapping posteriors.
March 11, 2009 Chapter 1 - p. 84/??

α = 1, β = 1 α = 0.5, β = 1 α = 0.5, β = 1.5
postmean=0.32,postmax=0.28 postmean=0.30,postmax=0.26 postmean=0.29,postmax=0.25
5
3.0
6
5
4
4
2.0
3
2
2
1.0
1
1
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
α = 1, β = 0.5 α = 1, β = 1 α = 1, β = 1.5
1.4
1.5
5
1.2
4
1.0
3
1.0
0.5
2
0.8
1
0.6
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
α = 1.5, β = 0.5 α = 1.5, β = 1 α = 1.5, β = 1.5

1.5
1.2
6
5
1.0
0.8
4
3
0.5
0.4
2
1
0.0
0.0
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
March 11, 2009 Chapter 1 - p. 85/??

The influence of different data
■ The result of the Bayesian inference is also

affected by the data or its distribution (likelihood).
March 11, 2009 Chapter 1 - p. 86/??


■ For the Late Bus Example, we only consider the

case under flat prior: p(θ) = Be(1, 1) ∝ 1, the
unform distribution.
March 11, 2009 Chapter 1 - p. 86/??



■ Figure ?? shows the posterior beta distributions
Be(y + 1, n − y + 1), where n = 5 (one week) and
y = 0, 1, 2, 3, 4, 5.
March 11, 2009 Chapter 1 - p. 86/??



■ Figure ?? shows the posterior beta distributions
Be(y + 1, n − y + 1), where n = 5 (one week) and
y = 0, 1, 2, 3, 4, 5.
■ Figure ?? shows the two posteriors for
n = 5, y = 1 and n = 10, y = 3. We see that the
shapes of posteriors differ a lot.
March 11, 2009 Chapter 1 - p. 86/??

4
y=0 y=5
3
y=1 y=4
y=2 y=3
2
Prior
1
0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 5: Posterior distributions: Be(α + y, n − y + β), for n = 5

and y = 0, 1, 2, 3, 4, 5 S C H O O L O F F I N A N C E A N D S T A T I S T I C S
March 11, 2009 Chapter 1 - p. 87/??

3.0
2.5
2.0
Prior
1.5
n=5,y=1
n=10,y=3
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 6: Posterior distributions: Be(α + y, n − y + β), for n =

5, y = 1 and n = 10, yS=
CH3O O L O F F I N A N C E A N D S TAT I S T I C S
March 11, 2009 Chapter 1 - p. 88/??

The influence of new coming data
■ Suppose we observe some data y1 , and then

from the Bayes’ rule, we get the posterior
distribution
p(θ|y1 ) ∝ p(y1 |θ) × p(θ).
March 11, 2009 Chapter 1 - p. 89/??

The influence of new coming data
■ Suppose we observe some data y1 , and then

from the Bayes’ rule, we get the posterior
distribution
p(θ|y1 ) ∝ p(y1 |θ) × p(θ).
■ Later we observe some more data y2 . If it is

independent of the first data set y1 , then
p(y1 and y2 |θ) = p(y1 |θ) × p(y2 |θ).
March 11, 2009 Chapter 1 - p. 89/??

Hence, from the Bayes’ rule, we have
p(θ|y1 , y2 ) ∝ p(θ) × p(y1 |θ) × p(y2 |θ)
= p(θ|y1 ) × p(y2 |θ).
March 11, 2009 Chapter 1 - p. 90/??

Hence, from the Bayes’ rule, we have
p(θ|y1 , y2 ) ∝ p(θ) × p(y1 |θ) × p(y2 |θ)
= p(θ|y1 ) × p(y2 |θ).
That is, we use the first posterior as the prior for
the second posterior. The resulting posterior can
then be used as a new prior distribution which can
be updated with further data.
March 11, 2009 Chapter 1 - p. 90/??

■ The Bayesian approach is often talked about as
a learning process:
As we get more data, we add to our store of in-
formation by multiplying it by our current posterior
distribution.
March 11, 2009 Chapter 1 - p. 91/??

a learning process:
distribution.
■ It has been argued that this can form the basis of
science and this has been applied to the
(Bayesian) decision making process.
March 11, 2009 Chapter 1 - p. 91/??

a learning process:
distribution.
■ It has been argued that this can form the basis of
science and this has been applied to the
(Bayesian) decision making process.
■ For the Late Bus Example, if after 10 weeks, we
observe 10 late bused, then we see from Figure
?? that as evidence accumulates, our beliefs of θ
converge, though our priors differ a great:
Be(1, 1), Be(2, 5), Be(1, 10).
March 11, 2009 Chapter 1 - p. 91/??

10
8
5
8
6
4
Posterios
Priors
6
n=5, y=1
Posterios
n=50, y=10
4
4
2
2
1
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
y y y
Figure 7: Posterior distributions with accumulating sampling

information S C H O O L O F F I N A N C E A N D S TAT I S T I C S
March 11, 2009 Chapter 1 - p. 92/??

Example 3: Normal data with normal
prior
March 11, 2009 Chapter 1 - p. 93/??

Example 3: Normal data with normal prior
■ This example is important because it addresses

the normal likelihood and normal prior
combination often used in practice.
March 11, 2009 Chapter 1 - p. 94/??


■ Assume that
◆ an observation, y is normally distributed with
mean θ and known variance σ 2

◆ The parameter of interest, θ also has normal
distribution with parameters µ and τ 2 .
March 11, 2009 Chapter 1 - p. 94/??


■ Assume that
◆ an observation, y is normally distributed with
mean θ and known variance σ 2

◆ The parameter of interest, θ also has normal
distribution with parameters µ and τ 2 .

■ Find the marginal, posterior, and predictive
distributions.
March 11, 2009 Chapter 1 - p. 94/??

The results are as follows:
p(y) = N (µ, σ 2 + τ 2 ),
2 2 2 2

τ σ σ τ
p(θ|y) = N 2 2
y+ 2 2
µ, 2 2
,
σ +τ σ +τ σ +τ
2 2 2 2

τ σ 2 σ τ
p(ỹ|y) = N 2 2
y+ 2 2
µ, σ + 2 2
.
σ +τ σ +τ σ +τ
March 11, 2009 Chapter 1 - p. 95/??

If y1 , y2 , · · · , yn are observed instead of a single
observation y, then from the sampling distribution
σ2
of ȳ, N (θ, n ), we have
!
σ2 σ2 2
τ2 n n
τ
p(θ|ȳ) = N σ2
y+ σ2
µ, σ2
,
+τ 2 + τ2 + τ2
n n n
!
σ2 σ2 2
τ2 τ
p(ỹ|ȳ) = N σ2
y+ σ2
n
µ, σ 2 + σ2
n
.
+τ 2 + τ2 + τ2
n n n
We see that normal distribution is the conjugate

prior for the normal mean.
March 11, 2009 Chapter 1 - p. 96/??

Bayesian Statistical Analysis: Chapter 1: Fundamentals of Bayesian Inference

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Statistical Analysis: Chapter 1: Fundamentals of Bayesian Inference

Uploaded by

Copyright:

Available Formats

Bayesian Statistical Analysis

Chapter 1: Fundamentals of Bayesian Inference

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 1/??

1.1 The Bayesian Method and

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 2/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 3/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 4/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 4/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 4/??

■ to make some inferences about certain

properties of the distribution, and

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 4/??

■ to make some inferences about certain

properties of the distribution, and

March 11, 2009 Chapter 1 - p. 4/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 5/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 5/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 5/??

3. Degree of belief (subjective): Personal measure

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 5/??

3. Degree of belief (subjective): Personal measure

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 5/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 6/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 6/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 6/??

According to the probability theory, we express our Statistical Inference

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 7/??

According to the probability theory, we express our Statistical Inference

Either the model is true, or not.

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 7/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 8/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 8/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 8/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 9/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 9/??

■ Define the Null Hypothesis. The Null

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 9/??

SCHOOL OF FINANCE AND S TAT I S T I C S

March 11, 2009 Chapter 1 - p. 10/??