You are on page 1of 134

.

Course contents

Introduction of Bayesian concepts using single-parameter models.


Bayesian Methods for Multiple-parameter models and hyerarchical models.
Data Analysis
Computation: approximations to the posterior, rejection and
importance sampling and MCMC.
ENAR Annual Meeting
Model checking, diagnostics, model fit.

Linear hierarchical models: random effects and mixed models.

Tampa, Florida March 26, 2006 Generalized linear models: logistic, multinomial, Poisson regression.

Hierarchical models for spatially correlated data.

ENAR - March 2006 1 ENAR - March 2006 2

Course emphasis Software: R (or SPlus) early on, and WinBUGS for most of the
examples after we discuss computational methods.

Notes draw heavily on the book by Gelman et al., Bayesian Data All of the programs used to construct examples can be downloaded
Analysis 2nd. ed., and many of the figures are borrowed directly from from www.public.iastate.edu/ alicia.
that book.
On my website, go to Teaching and then to ENAR Short Course.
We focus on the implementation of Bayesian methods and interpretation
of results.

Little theory, but some is needed to understand methods.

Lots of examples. Some are not directly drawn from biological problems,
but still serve to illustrate methodology.

Biggest idea to get across: inference by simulation.

ENAR - March 2006 3 ENAR - March 2006 4


Single parameter models Binomial model

There is no real advantage to being a Bayesian in these simple models. Of historical importance: Bayes derived his theorem and Laplace
We discuss them to introduce important concepts: provided computations for the Binomial model.

Priors: non-informative, conjugate, other informative Before Bayes, question was: given , what are the probabilities of the
Computations possible outcomes of y?
Summary of posterior, presentation of results
Bayes asked: what is P r(1 < < 2|y)?
Binomial, Poisson, Normal models as examples
Using a uniform prior for , Bayes showed that
Some real examples
R 2
1
y (1 )ny d
P r(1 < < 2|y) ,
p(y)

with p(y) = 1/(n + 1) uniform a priori.

ENAR - March 2006 5 ENAR - March 2006 6

Laplace devised an approximation to the integral expression in MLE is sample proportion: = y/n.
numerator.
Prior on : uniform on [0,1] (for now)
First application by Laplace: estimate the probability that there were
more female births in Paris from 1745 to 1770. p() 1.

Consider n exchangeable Bernoulli trials y1, ..., yn. Posterior:


p(|y) y (1 )ny .
yi = 1 a success, yi = 0 a failure.
As a function of , posterior is proportional to a Beta(y + 1, n y + 1)
Exchangeability:summary is # of successes in n trials y. density.

For the probability of success, Y B(n, ): We could also do the calculations to derive the posterior:
n y ny

y (1 )
 
n y
p(y|) = (1 )ny . p(|y) = R n y ny d
y y (1 )

ENAR - March 2006 7 ENAR - March 2006 8


n! Posterior variance
= (n + 1) y (1 )ny
y!(n y)!
(y + 1)(n y + 1)
(n + 1)! y V ar(|y) = .
= (1 )ny (n + 2)2(n + 3)
y!(n y)!
(n + 2) Interval estimation
= y+11(1 )ny+11
(y + 1)(n y + 1)
95% credibe set (or central posterior interval) is (a, b) if:
= Beta(y + 1, n y + 1)
Z a Z b
p(|y)d = 0.025 and p(|y)d = 0.975
Point estimation 0 0

y+1 A 100(1 )% highest posterior density credible set is subset C of


Posterior mean E(|y) = n+2
Note posterior mean is compromise between prior mean (1/2) and such that
sample proportion y/n. C = { : p(|y) k()}
Posterior mode y/n where k() is largest constant such that P r(C|y) 1 .
Posterior median such that P r( |y) = 0.5. For symmetric unimodal posteriors, credible sets and highest
Best point estimator minimizes the expected loss (more later). posterior density credible sets coincide.

ENAR - March 2006 9 ENAR - March 2006 10

In other cases, HPD sets have smallest size. Inference by simulation:


Interpretation (for either): probability that is in set is equal to
1 . Which is what we want to say!. Draw values from posterior: easy for closed form posteriors, can also
do in other cases (later)
Monte Carlo estimates of point and interval estimates
Credible set and HPD set Added MC error (due to sampling)
Easy to get interval estimates and estimators for functions of
parameters.

Prediction:

Prior predictive distribution

1 
n y 1
Z
p(y) = (1 )ny d = .
0 y n+1

A priori, all values of y are equally likely

ENAR - March 2006 11 ENAR - March 2006 12


Posterior predictive distribution, to predict outcome of a new trial y Binomial model: different priors
given y successes in previous n trials
Z 1 How do we choose priors?
P r(y = 1|y) = P r(y = 1|y, )p(|y)d
0 In a purely subjective manner (orthodox)
1
Using actual information (e.g., from literature or scientific
Z
= P r(y = 1|)p(|y)d knowledge)
0
1 Eliciting from experts
y+1
Z
= p(|y)d = E(|y) = For mathematical convenience
0 n+2 To express ignorance

Unless we have reliable prior information about , we prefer to let the


data speak for themselves.

Asymptotic argument: as sample size increases, likelihood should


dominate posterior

ENAR - March 2006 13 ENAR - March 2006 14

Conjugate prior Beta is the conjugate prior for binomial model: posterior is in the
same form as the prior.
Suppose that we choose a Beta prior for :
To choose prior parameters, think as follows: observe successes in
p(|, ) 1 1
(1 ) + prior trials. Prior guess for is /( + ).

Posterior is now

p(|y) y+1(1 )ny+1

Posterior is again proportional to a Beta:

p(|y) = Beta(y + , n y + ).

For now, , considered fixed and known, but they can also get their
own prior distribution (hierarchical model).

ENAR - March 2006 15 ENAR - March 2006 16


Conjugate priors for () the natural parameter and t(y) a sufficient statistic.

Formal definition: F a class of sampling distributions and P a class Consider prior density
of prior distributions. Then P is conjugate for F if p() P and
p(y|) F implies p(|y) P . p() g() exp(()T )

If F is exponential family then distributions in F have natural conjugate


priors. Posterior also in exponential form

A distribution in F has form p(|y) g()n+ exp(()T (t(y) + ))

p(yi|) = f (yi)g() exp(()T u(yi)).

For an iid sequence, likelihood function is

p(y|) g()n exp(()T t(y)),

ENAR - March 2006 17 ENAR - March 2006 18

Conjugate prior for binomial proportion where g()n = (1 )n, y is the sufficient statistic, and the logit
log /(1 ) is the natural parameter.
y is # of successes in n exchangeable trials, so sampling distribution is
binomial with prob of success :
Consider prior for : Beta(, ):
y ny
p(y|) (1 )
p() 1(1 )1
Note that

p(y|) y (1 )n(1 )y If we let = 1 and = + 2, can write


(1 )n exp{y[log log(1 )]}
p() (1 )

Written in exponential family form


or

n
p(y|) (1 ) exp{y log } p() (1 ) exp{ log }
1 1

ENAR - March 2006 19 ENAR - March 2006 20


Then posterior is in same form as prior: Posterior variance

( + y)( + n y)
p(|y) (1 )n+ exp{(y + ) log } V ar(|y) =
1 ( + + n)2( + + n + 1)
E(|y)[1 E(|y)]
=
Since p(y|) u(1 )ny then prior Beta(, ) suggests that a ++n+1
priori we believe in approximately successes in + trials.

As y and n y get large:


Our prior guess for the probability of success is /( + )
E(|y) y/n
By varying ( + ) (with /( + ) fixed), we can incorporate more var(|y) (y/n)[1 (y/n)]/n which approaches zero at rate 1/n.
or less information into the prior: prior sample size Prior parameters have diminishing influence in the posterior as n
.
Also note:
+y
E(|y) =
++n
is always between the prior mean /( + ) and the MLE y/n.

ENAR - March 2006 21 ENAR - March 2006 22

Placenta previa example and therefore


|y Beta(y + 1, n y + 1).
Condition in which placenta is implanted low in uterus, preventing
normal delivery.
Thus a posteriori, is Beta(438, 544) and
In study in Germany, found that 437 out of 980 births with placenta 438
previa were females so observed ratio is 0.446 E(|y) =
438 + 544
Suppose that in general population, proportion of female births is 0.485. 438 544
V ar(|y) =
(438 + 544)2(438 + 544 + 1)
Can we say that the proportion of female babies among placenta previa
births is lower than in the general population?
Check sensitivity to choice of priors: try several Beta priors with
If y = number of female births, is probability of a female birth, and increasingly more information about .
p() is uniform, then

p(|y) y (1 )n1 Fix prior mean at 0.485 and increase prior sample size + .

ENAR - March 2006 23 ENAR - March 2006 24


Conjugate prior for Poisson rate
Prior Post. Post 2.5th Post 97.5th
mean + median pctile pctile y (0, 1, ...) is counts, with rate > 0. Sampling distribution is
0.5 2 0.446 0.415 0.477 Poisson
0.485 2 0.446 0.415 0.477 yi exp()
p(yi|) =
0.485 5 0.446 0.415 0.477 yi !
0.485 10 0.446 0.415 0.477
0.485 20 0.447 0.416 0.478 For exchangeable (y1, ..., yn)
0.485 100 0.450 0.420 0.479 n
0.485 200 0.453 0.424 0.481
Y yi exp() ny exp(n)
p(y|) = = Qn .
i=1
yi ! i=1 yi !

Consider Gamma(, ) as prior for : p() 1 exp()


Results robust to choice of prior, even very informative prior. Prior mean Then posterior is also Gamma:
not in posterior 95% credible set.
p(|y) ny exp(n)1 exp() ny+1 exp(n )

ENAR - March 2006 25 ENAR - March 2006 26

For = ny + and = n + , posterior is Gamma(, ) Mean of a normal distribution

Note: y N (, 2) with 2 known

Prior mean of is / For {y1, ..., yn} an iid sample, the likelihood is:
Posterior mean of is
1 1
ny + p(y|) = i exp( 2 (yi )2)
E(|y) = 2 2 2
n+
Viewed as function of , likelihood is exponential of a quadratic in :
If sample size n then E(|y) approaches MLE of .
1 X 2
If sample size goes to zero, then E(|y) approaches prior mean. p(y|) exp( ( 2yi + yi2))
2 2 i

Conjugate prior for must belong to family of form

p() = exp(A2 + B + C)

ENAR - March 2006 27 ENAR - March 2006 28


that can be parameterized as Expand squares, collect terms in 2 and in :

1 2 2
2 20 + 20
P P P
p() exp( ( 0)2) 1 i yi 2 i yi + i
202 p(|y) exp( [ + ])
2 2 02
and then p() is N (0, 02) 1 ( 2 + 2)n2 2(ny02 + 0 2)
exp( [ 0 ])
2 202
(0, 02) are hyperparameters. For now, consider known. 1 (( 2/n) + 02) 2 y02 + 0( 2/n)
exp( [ 2( )])
2 ( 2/n)02 ( 2/n) + 02
If p() is conjugate, then posterior p(|y) must also be normal with
parameters (n, n2).
Then p(|y) is normal with
Recall that p(|y) p()p(y|)
Mean: n = (y02 + 0( 2/n))/(( 2/n) + 02)
Variance: n2 = (( 2/n)02)/(( 2/n) + 02)
Then
)2 ( 0)2
P
1 i (yi
p(|y) exp( [ + ])
2 2 02 Posterior mean

ENAR - March 2006 29 ENAR - March 2006 30

Note that n
2/n 02
y02 2
+ 0( /n) y + 12 0 = 0( ) + y( )
n = =
2 0 ( 2/n) + 02 ( 2/n) + 02
( 2/n) + 02
n
2 + 1
02
Add and subtract 002/(( 2/n) + 02) to see that
(by dividing numerator and denominator into ( 2/n)02)
02
Posterior mean is weighted average of prior mean and sample mean. n = 0 + (y 0)( )
( 2/n) + 02

Weights are given by precisions n/ 2 and 1/02. Posterior mean is prior mean shrunken towards observed value. Amount
of shrinkage depends on relative size of precisions
As data precision increases (( 2/n) 0) because 2 0 or because
n , n y.
Posterior variance:

Also, note that Recall that

y02 + 0( 2/n) 1 (( 2/n) + 02) 2 y 2 + 0( 2/n)


n = p(|y) exp( 2 [ 2( 0 2 )])
( 2/n) + 02 2
2 ( /n)0 ( /n) + 02

ENAR - March 2006 31 ENAR - March 2006 32


Then Given , y N (, 2)
|y N (n, n2)
1 (( 2/n) + 02)
=
n2 ( 2/n)02
n 1 Then
= 2
+ 2
0
1 (y )2 1 ( n)2
Z
p(y|y) exp{ } exp{ }d
2 2 2 n2
Posterior precision = sum of prior and data precisions.
1 (y )2 ( n)2
Z
exp{ [ + ]}d
2 2 n2
Posterior predictive distribution

Want to predict next observation y. Integrand is kernel of bivariate normal, so (y, ) have bivariate normal
R joint posterior.
Recall p(y|y) p(y|)p(|y)d

We know that Marginal p(y|y) must be normal.

ENAR - March 2006 33 ENAR - March 2006 34

Need E(y|y) and var(y|y): In n, prior precision 1/02 and data precision n/ 2 are equivalent.
Then:
E(y|y) = E[E(y|y, )|y] = E(|y) = n,
For n large, (y, 2) determine posterior
because E(y|y, ) = E(y|) = For 02 = 2/n, prior has the same weight as adding one more
observation with value 0.
var(y|y) = E[var(y|y, )|y] + var[E(y|y, )|y] When 02 with n fixed, or when n with 02 fixed:
= E( 2|y) + var(|y)
p(|y) N (y, 2/n)
= 2 + n2

because var(y|y, ) = var(y|) = 2 Good approximation in practice when prior beliefs about are vague
or when sample size is large.
Variance includes additional term n as a penalty for us not knowing
the true value of .
1 + n y
0 2
02
Recall n = 1 + n
02 2

ENAR - March 2006 35 ENAR - March 2006 36


Normal variance Likelihood is in exponential family form with natural parameter ( 2) =
2. Then, natural conjugate prior must be of form
Example of a scale model
p( 2) ( 2) exp(( 2))
2
Assume that y N (, ), known.

For iid y1, ..., yn: Consider an inverse Gamma prior:


n
1 X
p(y| 2) ( 2)n/2 exp( (yi )2) p( 2) ( 2)(+1) exp(/ 2)
2 2 i=1

nv
p(y| 2) ( 2)n/2 exp( ) For ease of interpretation, reparameterize as a scaled-inverted 2
2 2
distribution, with 0 d.f. and 02 scale.
for suffient statistic n
1X
v= (yi )2 Scale 02 corresponds to prior guess for 2.
n i=1 Large 0: lots of confidence in 02 as a good value for 2.

ENAR - March 2006 37 ENAR - March 2006 38

If 2 inv 2(0, 02) then 1 1


( 2)( 2 +1) exp( 112)
2 2
2 ( 0 +1) 002
p( 2) ( ) 2 exp( ) with 1 = 0 + n and
02 2 2
nv + 002
12 =
n + 0
0 02
Corresponds to Inv Gamma( 20 , 2 ).
p( 2|y) is also a scaled inverted 2
Prior mean is 020/(0 2)
Posterior scale is weighted average of prior guess and data estimate
Prior variance behaves like 04/0: large 0, large prior precision.
Weights given by prior and sample degrees of freedom
Posterior
Prior provides information equivalent to 0 observations with average
2 2
p( |y) p( )p(y| ) 2 squared deviation equal to 02.
nv 2 ( 0 +1) 002
( 2)n/2 exp( )( ) 2 exp( ) As n , 12 v.
2 2 02 2 2

ENAR - March 2006 39 ENAR - March 2006 40


Poisson model Natural conjugate prior must have form

p() e e log
Appropriate for count data such as number of cases, number of
accidents, etc. For a vector of n iid observations:
or p() e 1 that looks like a Gamma(, ).
yi e t(y)en
p(y|) = i = p(y|) = Qn , Posterior is Gamma( + ny, + n)
yi ! i=1 y!

Pn
where is the rate, y = 0, 1, ... and t(y) = i=1 yi the sufficient
statistic for .

We can write the model in the exponential family form:

p(y|) en et(y) log

where () = log is natural parameter.

ENAR - March 2006 41 ENAR - March 2006 42

Poisson model - Digression Thus we can interpret negative binomial distributions as arising from a
mixture of Poissons with rate , where follows a Gamma distribution:
Z
In the conjugate case, can often derive p(y) using p(y) = Poisson (y|) Gamma (|, )d

p()p(y|)
p(y) =
p(|y) Negative binomial is robust alternative to Poisson.

In case of Gamma-Poisson model for a single observation:

Poisson() Gamma(, ) ( + y)
p(y) = =
Gamma( + y, + 1) ()y!(1 + )(+y)
1 y
= ( + y 1, y)( ) ( )
+1 +1

the density of a negative binomial with parameters (, ).

ENAR - March 2006 43 ENAR - March 2006 44


Poisson model (contd) Now Py X
p(y|) i exp( xi)
for xi the exposure of unit i.
Given rate, observations assumed to be exchangeable.
With prior p() = Gamma (, ), posterior is
Add an exposure to model: observations are exchangeable within small
exposure intervals. p(|y) = Gamma ( +
X
yi , +
X
xi)
i i
Examples:
In a city of size 500,000, define rate of death from cancer per million
people, then exposure is 0.5
Intersection in Ames with traffic of one million vehicles per year,
consider number of traffic accidents per million vehicles per year.
Exposure for intersection is 1.
Exposure is typically known and reflects the fact that in each unit,
the number of persons or cars or plants or animals that are at risk
is different.

ENAR - March 2006 45 ENAR - March 2006 46

Non-informative prior distns For 0 :


1 y
A non-informative prior distribution 12 2/n
Has minimal impact on posterior
Same result could have been obtained using p() 1.
Lets the data speak for themselves
Also called vague, flat, diffuse
The uniform prior is a natural non-informative prior for location
So far, we have concentrated on conjugate family. parameters (see later).

Conjugate priors can also be almost non-informative. It is improper: Z Z


p()d = d =
If y N (, 1), natural conjugate for is N (0, 02), and posterior is
N (1, 12), where yet leads to proper posterior for . This is not always the case.

0/02 + ny/ 2 2 1 Poisson example: y1, ..., yn iid Poisson(), consider p() 1/2.
1 = , 1 =
1/02 + n/ 2 1/02 + n/ 2
R
0 1/2d = improper

ENAR - March 2006 47 ENAR - March 2006 48


Yet d
d = 1 is Jacobian
P y n 1/2 Then, if p() is prior for , p() = 1p(log ) is corresponding
p(|y)
P y 1/2
e
i i prior for transformation.
For p() c, p() 1, informative.
=
P y +1/21
en
i i
Informative prior is needed to arrive at same answer in both
= i i n
e parameterizations.

proportional to a Gamma( 21 +
P
i yi , n), proper

Uniform priors arise when we assign equal density to each : no


preferences for one value over the other.

But they are not invariant under one-to-one transformations.

Example:
= exp{} so that = log{} is inverse transformation.

ENAR - March 2006 49 ENAR - March 2006 50

Jeffreys prior and


p() |I()|1/2

In 1961, Jeffreys proposed a method for finding non-informative priors Theorem: Jeffreys prior is locally uniform and therefore non-informative.
that are invariant to one-to-one transformations

Jeffreys proposed
p() [I()]1/2
where I() is the expected Fisher information:

d2
I() = E [ log p(y|)]
d2

If is a vector, then I() is a matrix with (i, j) element equal to

d2
E [ log p(y|)]
didj

ENAR - March 2006 51 ENAR - March 2006 52


Jeffreys prior for binomial proportion Jeffreys prior for normal mean

If y B(n, ) then log p(y|) y log + (n y) log(1 ) and


y1, ..., yn iid N (, 2), 2 known.
d2
log p(y|) = y2 (n y)(1 )2
d2 1 X
p(y|) exp( (y )2)
2 2
Taking expectations: 1 X
log p(y|) 2 (y )2
2
d2
I() = E [ log p(y|)] = E(y)2 + (n E(y))(1 )2 d2 n
d2 log p(y|) = 2
n d2
= n2(n n)(1 )2 = .
(1 )
constant with respect to .

Then Then I() constant and p() constant.


1 1
p() [I()]1/2 1/2(1 )1/2 Beta( , ).
2 2

ENAR - March 2006 53 ENAR - March 2006 54

Jeffreys prior for normal variance Invariance property of Jeffreys


y1, ..., yn iid N (, 2), known.
The Jeffreys prior is invariant to one-to-one transformations = h()
1 X
p(y|) n exp( 2 (y )2) Then:
2 d dh() 1
1 X p() = p()| | = p()| |
log p(y|) n log 2 (y )2 d d
2
d2 n 3 X Under invariance, choose p() such that p() constructed as above
log p(y|) = (yi )2
d 2 2 2 4 would match what would be obtained directly.

Take negative of expectation so that: Consider p() = [I()]1/2 and evaluate I() at = h1():

n 3 n d2
I() = + n 2 = 2 I() = E[ log p(y|)]
2 2 4 2 d2

and therefore the Jeffreys prior is p() 1. d2 d


= E[ 2
log p(y| = h1())[ ]2]
d d

ENAR - March 2006 55 ENAR - March 2006 56


d 2
= I()[ ]
d

d
Then [I()]1/2 = [I()]1/2 d as required.

ENAR - March 2006 57


Multiparameter models - Intro Nuisance parameters

Consider a model with two parameters (1, 2) (e.g., a normal


Most realistic problems require models with more than one parameter distribution with unknown mean and variance)

Typically, we are interested in one or a few of those parameters We are interested in 1 so 2 is a nuisance parameter

Classical approach for estimation in multiparameter models: The marginal posterior distribution of interest is p(1|y)
1. Maximize a joint likelihood: can get nasty when there are many Can be obtained directly from the joint posterior density
parameters
2. Proceed in steps p(1, 2|y) p(1, 2)p(y|1, 2)

Bayesian approach: base inference on the marginal posterior by integrating with respect to 2:
distributions of the parameters of interest. Z
p(1|y) = p(1, 2|y)d2
Parameters that are not of interest are called nuisance parameters.

ENAR - March 2006 1 ENAR - March 2006 2

Nuisance parameters (contd) Nuisance parameters (contd)


Important difference with frequentists!
Note too that
By averaging conditional p(1, |2, y) over possible values of 2, we
Z explicitly recognize our uncertainty about 2.
p(1|y) = p(1, |2, y)p(2|y)d2
Two extreme cases:
The marginal of 1 is a mixture of conditionals on 2, or a weighted 1. Almost certainty about the value of 2: If prior and sample are very
average of the conditional evaluated at different values of 2. Weights informative about 2, marginal p(2|y) will be concentrated around
are given by marginal p(2|y) some value 2. In that case,

p(1|y) p(1|2, y)

2. Lots of uncertainty about 2: Marginal p(2|y) will assign relatively


high probability to wide range of values of 2. Point estimate 2 no
longer reliable. Important to average over range of values of 2.

ENAR - March 2006 3 ENAR - March 2006 4


Nuisance parameters (contd) Example: Normal model

In most cases, integral not computed explicitly yi iid from N (, 2), both unknown

Instead, use a two-step simulation approach


Non-informative prior for (, 2) assuming prior independence:
(k)
1. Marginal simulation step: Draw value of 2 from p(2|y) for
2
k = 1, 2, ... p(, 2) 1 2
(k)
2. Conditional simulation step: For each 2 , draw a value of 1 from
(k)
the conditional density p(1|2 , y)
Joint posterior:
Effective approach when marginal and conditional are of standard form.
p(, 2|y) p(, 2)p(y|, 2)
More sophisticated simulation approaches later. n
1 X
n2 exp( 2 (yi )2)
2 i=1

ENAR - March 2006 5 ENAR - March 2006 6

Note that Example: Normal model (contd)


n
X X
(yi )2 = (yi2 2yi + 2)
Let
i=1 i 1 X
X s2 = (yi y)2
= yi2 2ny + n2 n1 i
i
Then can write posterior for (, 2) as
X
= (yi y)2 + n(y )2
i
1
p(, 2|y) n2 exp( [(n 1)s2 + n(y )2])
by adding and subtracting 2ny . 2 2 2

Sufficient statistics are (y, s2)

ENAR - March 2006 7 ENAR - March 2006 8


Example: Conditional posterior p(| 2, y) Example: Marginal posterior p( 2|y)

Conditional on 2: To get p( 2|y) we need to integrate p(, 2|y) over :

1
Z
p(| 2, y) = N (y, 2/n) p( 2|y) [(n 1)s2 + n(y )2])d
n2 exp(
2 2
(n 1)s2 n
Z
We know this from earlier chapter (posterior of normal mean when n2 exp( ) exp( 2 (y )2)d
2 2 2
variance is known)
(n 1)s2 p
n2 exp( ) 2 2/n
2 2
We can also see this by noting that, viewed as a function of only:

n Then
p(| 2, y) exp( (y )2) (n 1)s2
2 2 p( 2|y) ( 2)(n+1)/2 exp( )
2 2
2
that we recognize as the kernel of a N (y, 2/n) which is proportional to a scaled-inverse distribution with degrees
of freedom (n 1) and scale s2.

ENAR - March 2006 9 ENAR - March 2006 10

Recall classical result: conditional on 2, the distribution of the scaled Normal model: analytical derivation
sufficient statistic (n 1)s2/ 2 is 2n1.
For the normal model, we can derive the marginal p(|y) analytically:

Z
p(|y) = p(, 2|y)d 2

1 n/2+1 1
Z
( 2
) exp( 2 [(n 1)s2 + n(y )2])d 2
2 2

Use the transformation


A
z=
2 2
where A = (n 1)s2 + n(y )2. Then

d 2 A
= 2
dz 2z
ENAR - March 2006 11 ENAR - March 2006 12
and Normal model: analytical derivation

z n A
Z
( ) 2 +1 2 exp(z)dz
Z
p(|y) p(|y) An/2
n
z 2 1 exp (z)dz
0 A z
Z
n
An/2 z 2 1 exp(z)dz
Integrand is unnormalized Gamma(n/2, 1), so integral is constant w.r.t.

Recall that A = (n 1)s2 + n(y )2. Then

p(|y) An/2
[(n 1)s2 + n(y )2]n/2
n( y)2 n/2
[1 + ]
(n 1)s2

the kernel of a tdistribution with n 1 degrees of freedom, centered


at y and with scale parameter s2/n

ENAR - March 2006 13 ENAR - March 2006 14

For the non-informative prior p(, 2) 2, the posterior distribution Normal model: analytical derivation
of is a non-standard t. Then,
We saw that
y y
p( |y) = tn1 p( |y) = tn1
s/ n s/ n

Notice similarity to classical result: for iid normal observations from


the standard t distribution.
N (, 2), given (, 2), the pivotal quantity

y
|, 2 tn1
s/ n

A pivot is a non-trivial function of the data and the parameter(s)


whose distribution, given , is independent of . Property deduced
from sampling distribution as above.

Baby example of pivot property: y N (, 1). Pivot is x = y .


Given , x N (0, 1), independent of .

ENAR - March 2006 15 ENAR - March 2006 16


Posterior predictive for future obs Can derive the posterior predictive distribution in analytic form. Note
that
Z
Posterior predictive distribution for future observation y is a mixture: p(y| 2, y) = p(y| 2, )p(| 2, y)d

1 n
Z
exp( 2 (y )2) exp( 2 ( y)2)d
Z Z

p(y|y) = p(y|y, 2, )p(, 2|y)dd 2 2 2

After some algebra:


First factor in integrand is just normal model, and it does not depend
on y at all. 1
p(y| 2, y) = N (y, (1 + ) 2)
n
To simulate y from posterior predictive distributions, do the following:
Using same approach as in deriving posterior distribution of , we find
1. Draw 2 from Inv-2(n 1, s2) that
1
2. Draw from N (y, 2/n) p(y|y) tn1(y, (1 + )1/2s)
3. Draw y from N (, 2) n

ENAR - March 2006 17 ENAR - March 2006 18

Normal data and conjugate prior Normal data and conjugate prior
Recall that using a non-informative prior, we found that
Note that and 2 are not independent a priori.
2 2
p(| , y) N (y, /n)
Posterior density for (, 2):
p( 2|y) Inv 2(n 1, s2)
Multiply likelihood by N-Inv-2(0, 2/0; 0, 02) prior
Then, factoring p(, 2) = p(| 2)p( 2) the conjugate prior for 2 Expand the two squares in
would also be scaled inverse 2 and for (conditional on 2) would be Complete the square by adding and subtracting term depending on
normal. Consider y and 0

| 2 N(0, 2/0) Then p(, 2|y) N-Inv-2(n, n2 /n; n, n2 ) where


2 Inv-2(0, 02) 0 n
n = 0 + y
0 + n 0 + n
Jointly: n = 0 + n, n = 0 + n
0 n
1 nn2 = 002 + (n 1)s2 + (y )2
2
p(, ) 1 2 (0 /2+1)
( ) exp( 2 [002 + 0(0 )2]) 0 + n
2

ENAR - March 2006 19 ENAR - March 2006 20


Normal data and conjugate prior Normal data and conjugate prior

Interpretation of posterior parameters: Conditional posterior of : As before | 2, y N(n, 2/n).


n a weighted average as earlier
nn2 : sum of the sample sum of squares, prior sum of squares, and Marginal posterior of 2: As before 2|y Inv-2(n, n2 ).
additional uncertainty due to difference between sample mean and
prior mean. Marginal posterior of : As before |y tn (|n, n2 /n).

Two ways to sample from joint posterior distribution:


1. Sample from t and 2 from Inv-2
2. Sample 2 from Inv-2 and given 2, sample from N

ENAR - March 2006 21 ENAR - March 2006 22

Semi-conjugate prior for normal model with 1


+ n2 y
02 0 1
n = 1 , n2 =
Consider setting independent priors for and 2: 22
+ n2 1
02
+ n2

N(0, 02) NOTE: Even though and 2 are independent a priori, they are not
2 Inv-2(0, 02) independent in the posterior.

Example: mean weight of students, visual inspection shows weights


between 100 and 200 pounds.

p(, 2) = p()p( 2) is not conjugate and does not lead to posterior


of known form

Can factor as earlier:

| 2, y N(n, n2)

ENAR - March 2006 23 ENAR - March 2006 24


Semi-conjugate prior and p( 2|y) so that

N(|0, 02) Inv 2( 2|0, 02) N(yi|, 2)


p( 2|y)
The marginal posterior p( 2|y) can be obtained by integrating the joint N(|n, n2)
p(, 2|y) w.r.t. :
which is still a mess.
Z
p( 2|y) N(|0, 02) Inv2( 2|0, 02) N(yi|, 2)d

Integration can be performed by noting that integrand as function of


is proportional to normal density.

Keeping track of normalizing constants that depend on 2 is messy.


Easier to note that:
p(, 2|y)
p( 2|y) =
p(| 2, y)

ENAR - March 2006 25 ENAR - March 2006 26

Semi-conjugate prior and p( 2|y) Implementing inverse CDF method

From earlier page:


We often need to draw values from distributions that do not have a
N(|0, 02) Inv 2( 2|0, 02) N(yi|, 2) nice standard form. Example is expression 3.14 in text, for p( 2|y)
p( 2|y)
N(|n, n2) in the normal model with semi-conjugate priors for and 2.

Factors that depend on must cancel, and therefore we know that One approach is the inverse cdf method (see earlier notes), implemented
p( 2|y) does not depend on in the sense that we can evaluate p( 2|y) numerically.
for a grid of values of 2 and any arbitrary value of .
We assume that we can evaluate the pdf (even in unnormalized form)
Choose = n and then denominator simplifies to something for a grid of values of in the appropriate range
proportional to n1. Then

p( 2|y) n N(|0, 02) Inv 2( 2|0, 02) N(yi|, 2)

that can be evaluated for a grid of values of 2.

ENAR - March 2006 27 ENAR - March 2006 28


Implementing inverse CDF method Inverse CDF method - Example
Prob() CDF (F)
To generate 1000 draws from a distribution p(|y) with parameter , 1 0.03 0.03
do 2 0.04 0.07
3 0.08 0.15
1. Evaluate p(|y) for a grid of m values of . Let the evaluations be 4 0.15 0.30
denoted by (p1, p2, ..., pm) 5 0.20 0.50
Pm1
2. Compute the CDF over the grid: (p1, p1 + p2, ...., i=1 pi, 1) and 6 0.30 0.80
denote those (f1, f2, ..., fm) 7 0.10 0.90
3. Generate M uniform random variables u in [0, 1]. 8 0.05 0.95
4. If u [fi1, fi], draw i. 9 0.03 0.98
10 0.02 1.00

Consider a parameter with values in (1, 10) with probability


distribution and cumulative distribution functions as above. Under
distribution, values of between 4 and 7 are more likely.

ENAR - March 2006 29 ENAR - March 2006 30

Inverse CDF method - Example Example: Football point spreads


Data di, i = 1, ..., 672 are differences between predicted outcome of
To implement method do: football game and actual score.
1. Draw u U (0, 1). For example, in first draw u = 0.45 Normal model:
2. For u (fi1, fi), draw i. di N (, 2)
3. For u = 0.45 (0.3, 0.5), draw = 5.
4. Alternative approach: Priors:
(a) Flip another coin v U (0, 1). N (0, 22)
(b) Pick i1 if v 0.5 and pick i if v > 0.5 p( 2) 2
5. In example, for u = 0.45, would either choose = 4 or would choose
= 4 with probability 1/2. To draw values of (, 2) from posterior, do
6. Repeat many times M .
First draw 2 from p( 2|y) using inverse cdf method
Then draw from p(| 2, y), normal.
If M very large, we expect that about 50% of our draws will be
= 4, 5, or 6, about 2% will be 10, etc. To draw 2, evaluate p( 2|y) on the grid [150, 250].

ENAR - March 2006 31 ENAR - March 2006 32


Example: Rounded measurements (prob. 3.5)
Sometimes, measurements are rounded, and we do not observe true Joint posterior contours
values. exact normal

4
yi are observed rounded values, and zi are unobserved true

3
measurements.

sigma2

2
If zi N (, 2), then

1
0
   
yi + 0.5 yi 0.5 8 9 10 11 12 13

y|, 2 i mu

rounded

4
Prior: p(, 2) 2

3
sigma2

2
We are interested in posterior inference about (, 2) and in differences

1
between rounded and exact analysis.

0
8 9 10 11 12 13

mu

ENAR - March 2006 33

Multinomial model - Intro Multinomial model - Sampling distn

Generalization of binomial model, for the case where observations can Formally:
have more than two possible values.
y = k 1 vector of counts of #s of observations in each outcome
j : probability of jth outcome
Sampling distribution: multinomial with parameters (1, ..., k ), the Pk Pk
j=1 j = 1 and j=1 yj = n
probabilities associated to each of the k possible outcomes.
Sampling distribution:
Example: In a survey, respondents may: Strongly Agree, Agree,
Disagree, Strongly Disagree, or have No Opinion when presented with y
a statement such as Instructor for Stat 544 is spectacular. p(y|) kj=1j j

ENAR - March 2006 34 ENAR - March 2006 35


Multinomial model - Prior The j can be thought of as prior counts associated to jth outcome,
so that 0 would then be a prior sample size.

Conjugate prior for (1, ..., k ) is Dirichlet distribution, a multivariate For the Dirichlet:
generalization of the Beta:
j j (0 j )
1 E(j ) = , V ar(j ) =
p(|) kj=1j j 0 02(0 + 1)
ij
Cov(i, j ) =
with 02(0 + 1)

k
X
j > 0j, and 0 = j
j=1
k
X
j > 0j, and j = 1
j=1

ENAR - March 2006 36 ENAR - March 2006 37

Dirichlet distribution V ar(j ) =


j (0 j )
02(0 + 1)
ij
The Dirichlet distribution is the conjugate prior for the parameters of Cov(i, j ) = 2
0(0 + 1)
the multinomial model.
Note that the j are negatively correlated.
If (1, 2, ...K ) D(1, 2, ..., K ) then
Because of the sum to one restriction, the pdf of the K-dimensional
(0) 1
random vector can be written in terms of K 1 random variables.
p(1, ...k ) = j j j ,
j (j )
The marginal distributions can be shown to be Beta.
where j 0, j j = 1, j 0 and 0 = j j .

Proof for K = 3:
Some properties:

j (1, 2, 3) 11 21
E(j ) = p(1, 2) = 2 (1 1 2)31
0 j (0) 1

ENAR - March 2006 38 ENAR - March 2006 39


To get marginal for 1, integrate p(1, 2) with respect to 2 with Then,
limits of integration 0 and 1 1. Call normalizing constant Q and use 1 Beta(1, 2 + 3).
change of variable:
2 This generalizes to what is called the clumping property of the Dirichlet.
v=
1 1 In general, if (1, ..., k ) D(1, ..., K ):
so that 2 = v(1 1) and d2 = (1 1)dv.
p(i) = Beta(1, 0 i)
The marginal is then:
Z 1
p(1) = Q 111(v(1 1))21
0
(1 1 v(1 1))31(1 1)dv
Z 1
= Q111(1 1)2+31 v 21(1 v)31dv
0
(2)(3)
= Q111(1 1)2+31
(2 + 3)

ENAR - March 2006 40 ENAR - March 2006 41

Multinomial model - Posterior For P


j = 1 j: uniform non-informative prior on all vectors of j such
that j j = 1

Posterior must have Dirichlet form also: For j = 0 j: the uniform prior is on the log(j ) with same
1 yj restriction.
p(|y) kj=1j j j
y +j 1 In either case, posterior is proper if yj 1 j.
kj=1j j

P
n = j (j + yj ) = 0 + n is total number of observations.

Posterior mean (a point estimate) of j is:

j + yj
E(j |y) =
0 + n
# of obs. of jth outcome
=
total # of obs

ENAR - March 2006 42 ENAR - March 2006 43


Multinomial model - samples from p(|y) Multinomial model - Example
Pre-election polling in 1988:
Gamma method: Two steps for each j : n = 1,447 adults in the US
1. Draw x1, x2, ..., xk from independent y1 = 727 supported G. Bush (the elder)
Gamma(, (j + yj )) for any common y2 = 583 supported M. Dukakis
Pk
2. Set j = xj / i=1 xi y3 = 137 supported other or had no opinion

If no other information available, can assume that observations are


Beta method: Relies on properties of Dirichlet:
exchangeable given . (However, if information on party affiliation is
Marginal p(j |y) = Beta(j + yj , n (j + yj )) available, unreasonable to assume exchangeability)
Conditional p(j |j , y) = Dirichlet
Polling is done under complex survey design. Ignore for now and
assume simple random sampling. Then, (y1, y2, y3) M ult(1, 2, 3)

Of interest: 1 2, the difference in the population in support of the


two major candidates.

ENAR - March 2006 44 ENAR - March 2006 45

Multinomial model - Example

Non-informative prior, for now: set 1 = 2 = 3 = 1 Proportion voting for Bush (the elder)
Posterior distribution is Dirichlet(728, 584, 138). Then:

E(1|y) = 0.502, E(2|y) = 0.403

Other quantities obtain by simulation (see next)

To derive p(1 2|y) do:


1. Draw m values of (1, 2, 3) from posterior
2. For each draw, compute 1 2
0.46 0.48 0.50 0.52 0.54
Results and program: see quantiles of posterior distributions of 1 and
2, credible set for 1 2 and Prob(1 > 2|y). theta_1

ENAR - March 2006 46


Difference between two candidates Proportion voting for Dukakis

0.36 0.38 0.40 0.42 0.44


0.0 0.05 0.10 0.15 0.20
theta_2

theta_1 - theta_2

Example: Bioassay experiment Bioassay example (contd)

Does mortality in lab animals increase with increased dose of some Model
drug? yi Bin (ni, i)

Experiment: 20 animals randomly allocated to four doses i are not exchangeable because probability of death depends on dose.
(treatments), and number of dead animals within each dose recorded.
Dose xi ni No. of deaths yi One way to model is with linear relationship:

-0.863 5 0 i = + xi.
-0.296 5 1
-0.053 5 3 Not a good idea because i (0, 1).
0.727 5 5
Transform i:  
i
Animals exchangeable within dose. logit (i) = log
1 i

ENAR - March 2006 47 ENAR - March 2006 48


Since logit (i) (, +), can use linear model. (Logistic Then
regression)
E[logit (i)] = + xi. p(yi| , , ni, xi)
 yi  niyi
exp( + xi) exp( + xi)
Likelihood: if yi Bin (ni, i), then 1
1 + exp( + xi) 1 + exp( + xi)

p(yi|ni, i) (i)yi (1 i)niyi .


Prior: p(, ) 1.
But recall that   Posterior:
i
log = + xi, p(, |y, n, x) ki=1p(yi|, , ni, xi).
1 i
so that We first evaluate p(, |y, n, x) on the grid
exp( + xi)
i = .
1 + exp( + xi) (, ) [5, 10] [10, 40]

and use inverse cdf method to sample from posterior.

ENAR - March 2006 49 ENAR - March 2006 50

Z
Bioassay example (contd) = p(j |, y)p(|y)d
Posterior evaluated on 200 200 grid of values of (, ). X
p(j |, y)
1 2 ... ... 200
1 p(1|y)
2 p(2|y)
.. ..
.. .. Sum of row i entries is p(i|y).
200 p(200|y)
p(1|y) p(2|y) ... ... p(200|y)

Entry (j, i) in grid is p(, | = i, = j , y).

Sum of column j entries is p(j |y) because:


Z
p(j |y) = p(j , |y)d

ENAR - March 2006 51 ENAR - March 2006 52


Bioassay example (contd) Bioassay example (contd)
LD50 is dose at which probability of death is 0.5, or
We sample from p(|y), and then from p(|, y) (or the other  
way around). yi
E = i = 0.5
ni
To sample from p(|y): so that
1. Obtain empirical p(| = j , y), j = 1, ...200 by summing over the
.
2. Use inverse cdf method to sample from p(|y). 0.5 = logit1( + xi)
logit(0.5) = + xi
To sample from p(|, y):
0 = + xi
1. Given a draw , choose appropriate column in grid
2. Use inverse cdf method on p(| = , y).

Then LD50 is computed as xi = /.

ENAR - March 2006 53 ENAR - March 2006 54

Posterior distribution of the probability of death for each dose


Posterior distributions of the regression parameters

box plot: theta Alpha: posterior mean = 1.297, with 95% credible set [-0.365, 3.755]
Beta: posterior mean = 11.52, with 95% credible set [3.572, 25.79]
[4]
1.0
[3]

0.75 alpha sample: 2000 beta sample: 2000


0.6 0.1
0.4 0.075
0.5 [2] 0.05
0.2 0.025
0.0 0.0
0.25 -2.5 0.0 2.5 5.0 -20.0 0.0 20.0 40.0

[1]

0.0
Multivariate normal model - Intro
Posterior distribution of the LD50 in the log scale
Posterior mean: -0.1064 Generalization of the univariate normal model, where now observations
95% credible set: [-0.268, 0.1141]
are d 1 vectors of measurements

For the ith sample unit: yi = (yi1 , yi2 , ..., yid ), and yi N (, ), with
LD50 sample: 2000 (, ) d 1 and d d respectively
6.0
For a sample of n iid observations:
4.0
2.0 1X
p(y|, ) ||n/2 exp[ (yi )01(yi )]
0.0 2 i

-1.0 -0.5 0.0 0.5 1


||n/2 exp[ tr(yi )01(yi )]
2
1 1
||n/2
exp[ tr S]
2
ENAR - March 2006 55

with X Multivariate normal model - Known


S= (yi )(yi )0
i We want the posterior distribution of the mean p(|y) under the
assumption that the variance matrix is known.
S is the d d matrix of sample squared deviations and cross-deviations
from . Conjugate prior for is p(|0, 0) = N (0, 0), as in the univariate
case

Posterior for :

p (|y, )
( " #)
1 X
exp ( 0) ( 0) +
0 1
(yi ) (yi )
0 1
2 i

that is a quadratic function of . Completing the square in the


exponent:
p(|y, ) = N (n, n)

ENAR - March 2006 56 ENAR - March 2006 57


with Multivariate normal model - Known

n = (10 + n1y)(1 + n1)1


n = (1 + n1)1 Posterior precision is equal to the sum of prior and sample precisions:

1
1
n = 0 + n
1

Posterior mean is weighted average of prior and sample means:

n = (1 1
0 0 + n y)(0 + n )
1 1 1

From usual properties of multivariate normal distribution, can derive


marginal posterior distribution of any subvector of or conditional
posterior distribution of (1) given (2).

ENAR - March 2006 58 ENAR - March 2006 59

0 0
Let 0 = ((1) , (2) )0 and correspondingly, let Multivariate normal model - Known
(1) (1,1)
Marginal of subvector (1) is p((1)|, y) = N (n , n ).
" #
(1)
n
n = (2)
n Conditional of subvector (1) given (2) is

and " # p((1)|(2), , y) = N ((1)


n +
1|2 (2)
( (2)
n ),
1|2
)
(1,1) (1,2)
n n
n = (2,1) (2,2)
n n where
 1
1|2 = n(1,2) (2,2)
n
 1
1|2 = (1,1)
n (1,2)
n (2,2)
n (2,1)
n

We recognize 1|2 as the regression coefficient in a regression of (1)


on (2)

ENAR - March 2006 60 ENAR - March 2006 61


Multivariate normal - Posterior predictive Multivariate normal - Sampling from p(y|y)

We seek p(y|y), and note that


With known, to draw a value y from p(y|y) note that
p(y, |y) = N (y; , )N (; n, n) Z
p(y|y) = p(y|, y)p(|y)d
Exponential in p(y, |y) is a quadratic form in (y, ), so (y, ) have a
normal joint posterior distribution.
Then:
Marginal p(y|y) is also normal with mean and variance:
1. Draw from p(|y) = N (n, n)
E(y|y) = E(E(y|, y)|y) = E(|y) = n 2. Draw y from p(y|, y) = N (, )

var(y|y) = E(var(y|, y)|y) + var(E(y|, y)|y) Alternatively (better), draw y directly from
= E(|y) + var(|y)
= + n . p(y|y) = N (n, + n)

ENAR - March 2006 62 ENAR - March 2006 63

Sampling from a multivariate normal Sampling from a multivariate normal

Using sequential conditional sampling:

Want to sample y d 1 from N (, ). Use fact that all conditionals in a multivariate normal are also normal.
1. Draw y1 from N (1, (11))
Two common approaches 2. Then draw y2|y1 from N (2|1, 2|1)
3. Etc.
Using Cholesky decomposition of :
1. Get A, a lower triangular matrix such that AA0 = For example, if d = 2,
2. Draw (z1, z2, ..., zd) iid N (0, 1) and let z = (z1, ..., zd)
3. Compute y = + Az 1. Draw y1 from
N ((1), 2(1))
2. Draw y2 from

(12) ( (12))2
N ((2) + (y 1 (1)
), 2(2)
)
2(2) 2(1)

ENAR - March 2006 64 ENAR - March 2006 65


Non-informative prior for Multivariate normal - unknown ,

The conjugate family of priors for (, ) is the normal-inverse Wishart


A non-informative prior for is obtained by letting the prior precision family.
0 go to zero.
The inverse Wishart is the multivariate generalization of the inverse 2
With the uniform prior, the posterior for is proportional to the distribution
likelihood.
If |0, 0 Inv-Wishart0 (1
0 ) then
Posterior is proper only if n > d; otherwise, S is not of full column  
rank. 1
p(|0, 0) ||(0+d+1)/2 exp tr(01)
2
If n > d,
p(|, y) = N (y, /n) for positive definite and symmetric.

For the Inv-Wishart with 0 degrees of freedom and scale 0: E() =


(0 d 1)10

ENAR - March 2006 66 ENAR - March 2006 67

Multivariate normal - unknown , Multivariate normal - unknown ,

Then conjuage prior is Posterior must also be in the Normal-Inv-Wishart form

Inv-Wishart0 (1 Results from univariate normal generalize directly:


0 )

| N(0, /0) |y Inv-Wishartn (1


n )
|, y N (n, /n)
which corresponds to |y Mult-tnd+1(n, n/(n(n d + 1)))
y|y Mult-t
p(, ) ||(0+d)/2+1
  Here:
1 0
exp tr(01) ( 0)01( 0) 0 n
2 2 n = 0 + y
0 + n 0 + n
n = 0 + n
n = 0 + n

ENAR - March 2006 68 ENAR - March 2006 69


n0 Multivariate normal - sampling ,
n = 0 + S + (y 0)(y 0)0
0 + n
To sample from p(, |y) do: (1) Sample from p(|y) (2) Sample
X
S = (yi y)(yi y)0
i from p(|, y)

To sample from p(y|y), use drawn (, ) and obtain draw y from


N (, )

Sampling from Wishart (S):


1. Draw independent
P d 1 vectors 1, 2, ..., from a N (0, S)
2. Let Q = i=1 ii0

Method works when > d. If Q Wishart then Q1 =


Inv-Wishart.

We have already seen how to sample from a multivariate normal given


mean vector and covariance matrix.

ENAR - March 2006 70 ENAR - March 2006 71

Multivariate normal - Jeffreys prior Example: bullet lead concentrations

If we start from the conjugate normal-inv-Wishart prior and let : In the US, four major manufacturers produce all bullets fired. One of
them is Cascade.
0 0
0 1 A sample of 200 round-nosed .38 caliber bullets with lead tips were
obtained from Cascade.
|0| 0
Concentration of five elements (antimony, copper, arsenic, bismuth,
then resulting prior is Jeffreys prior for (, ): and silver) in the lead alloy was obtained. Data for for Cascade are
stored in federal.data in the courses web site.
p(, ) ||(d+1)/2
Assuming that the 200 5 1 observation vectors for Cascade can
be represented by a multivariate normal distribution (perhaps after
transformation), we are interested in the posterior distribution of the
mean vector and of functions of the covariance parameters.

ENAR - March 2006 72 ENAR - March 2006 1


Particular quantities of interest for inference include: Bullet lead example - Model and prior
The mean trace element concentrations (1, ..., 5)
The correlations between trace element concentrations
(11, 12, ..., 45). We concentrate on the correlations that involve copper (four of them).
The largest eigenvalue of the covariance matrix
Sampling distribution:
In the data file, columns correspond to antimony, copper, arsenic,
bismuth, and silver, in that order. y|, N (, )

For this example, the antimony concentration was divided by 100


We use the conjugate Normal-Inv-Wishart family of priors, and choose
parameters for p(|0, 0) first:

0,ii = (100, 5000, 15000, 1000, 100)


0,ij = 0 i, j
0 = 7

ENAR - March 2006 2 ENAR - March 2006 3

For p(|, mu0, 0): 3. For each draw, compute:


(a) the ratio between 1 (largest eigenvalue of ) and 2 (second
0 = (200, 200, 200, 100, 50) largest eigenvalue of )
(b) the four correlations copper,j = copper,j /(copper j ), for j
0 = 10 {antimony, arsenic, bismuth, silver}.

We are interested in the posterior distributions of the five means, the


Low values for 0 and 0 suggest little confidence in prior guesses
eigenvalues ratio, and the four correlation coefficients
0, 0

We set 0 to be diagonal apriori: we have some information about the


variances of the five element concentrations, but no information about
their correlation.

Sampling from posterior distribution: We follow the usual approach:


1. Draw from Inv-Wishartn (1
n )
2. Draw from N (n, /n)

ENAR - March 2006 4 ENAR - March 2006 5


Scatter plot of the posterior mean concentrations Results from R program
# antimony






c(mean(sample.mu[,1]),sqrt(var(sample.mu[,1])))




























[1] 265.237302 1.218594
































quantile(sample.mu[,1],probs=c(0.025,0.05,0.5,0.95,0.975))
























2.5% 5.0% 50.0% 95.0% 97.5%










262.875 263.3408 265.1798 267.403 267.7005


# copper






























c(mean(sample.mu[,2]),sqrt(var(sample.mu[,2])))



























[1] 259.36335 5.20864
























quantile(sample.mu[,2],probs=c(0.025,0.05,0.5,0.95,0.975))



















2.5% 5.0% 50.0% 95.0% 97.5%


249.6674 251.1407 259.5157 268.0606 269.3144
































# arsenic
















c(mean(sample.mu[,3]),sqrt(var(sample.mu[,3])))

[1] 231.196891 8.390751


quantile(sample.mu[,3],probs=c(0.025,0.05,0.5,0.95,0.975))

ENAR - March 2006 6 ENAR - March 2006 7

2.5% 5.0% 50.0% 95.0% 97.5% Posterior of 1/2 of


213.5079 216.7256 231.6443 243.5567 247.1686
# bismuth
c(mean(sample.mu[,4]),sqrt(var(sample.mu[,4])))
[1] 127.248741 1.553327
quantile(sample.mu[,4],probs=c(0.025,0.05,0.5,0.95,0.975))
2.5% 5.0% 50.0% 95.0% 97.5%
124.3257 124.6953 127.2468 129.7041 130.759
# silver
c(mean(sample.mu[,5]),sqrt(var(sample.mu[,5])))
[1] 38.2072916 0.7918199
quantile(sample.mu[,5],probs=c(0.025,0.05,0.5,0.95,0.975))
2.5% 5.0% 50.0% 95.0% 97.5%
36.70916 36.93737 38.1971 39.54205 39.73169

ENAR - March 2006 8 ENAR - March 2006 9


Summary statistics of posterior dist. of ratio Correlation of copper with other elements

c(mean(ratio.l),sqrt(var(ratio.l)))
[1] 5.7183875 0.8522507
quantile(ratio.l,probs=c(0.025,0.05,0.5,0.95,0.975))
2.5% 5.0% 50.0% 95.0% 97.5%
4.336424 4.455404 5.606359 7.203735 7.619699

ENAR - March 2006 10 ENAR - March 2006 11

Summary statistics for correlations 2.5% 5.0% 50.0% 95.0% 97.5%


0.5560137 0.5685715 0.6429459 0.7013519 0.7135556
# j = antimony # j = silver
c(mean(sample.rho[,1]),sqrt(var(sample.rho[,1]))) c(mean(sample.rho[,5]),sqrt(var(sample.rho[,5])))
[1] 0.50752063 0.05340247 [1] 0.03642082 0.07232010
quantile(sample.rho[,1],probs=c(0.025,0.05,0.5,0.95,0.975)) quantile(sample.rho[,5],probs=c(0.025,0.05,0.5,0.95,0.975))
2.5% 5.0% 50.0% 95.0% 97.5% 2.5% 5.0% 50.0% 95.0% 97.5%
0.4014274 0.4191201 0.5076544 0.5901489 0.6047564 -0.09464575 -0.0831816 0.03370379 0.164939 0.1765523
# j = arsenic
c(mean(sample.rho[,3]),sqrt(var(sample.rho[,3])))
[1] -0.56623609 0.04896403
quantile(sample.rho[,3],probs=c(0.025,0.05,0.5,0.95,0.975))
2.5% 5.0% 50.0% 95.0% 97.5%
-0.6537461 -0.6448817 -0.570833 -0.4857224 -0.465808
# j = bismuth
c(mean(sample.rho[,4]),sqrt(var(sample.rho[,4])))
[1] 0.63909311 0.04087149
quantile(sample.rho[,4],probs=c(0.025,0.05,0.5,0.95,0.975))

ENAR - March 2006 12 ENAR - March 2006 13


S-Plus code rwishart_function(a,b) {
k_ncol(b)
m_matrix(0,nrow=k,ncol=1)
# Cascade example cc_matrix(0,nrow=a,ncol=k)
# We need to define two functions, one to generate random vectors from for (i in 1:a) { cc[i,]_rmnorm(m,b) }
# a multivariate normal distribution and the other to generate random w_t(cc)%*%cc
# matrices from a Wishart distribution. w }
# Note, that a function that generates random matrices from a #
# inverse Wishart distribution is not necessary to be defined #Read the data
# since if W ~ Wishart(S) then W^(-1) ~ Inv-Wishart(S^(-1)) #
# y_scan(file="cascade.data")
y_matrix(y,ncol=5,byrow=T)
# A function that generates random observations from a y[,1]_y[,1]/100
# Multivariate Normal distribution: rmnorm(a,b) means_rep(0,5)
# the parameters of the function are for (i in 1:5){ means[i]_mean(y[,i]) }
# a = a column vector of kx1 means_matrix(means,ncol=1,byrow=T)
# b = a definite positive kxk matrix n_nrow(y)
# #
rmnorm_function(a,b) { #Assign values to the prior parameters
#
k_nrow(b) v0_7
zz_t(chol(b))%*%matrix(rnorm(k),nrow=k)+a Delta0_diag(c(100,5000,15000,1000,100))
zz } k0_10
# A function that generates random observations from a mu0_matrix(c(200,200,200,100,50),ncol=1,byrow=T)
# Wishart distribution: rwishart(a,b) # Calculate the values of the parameters of the posterior
# the parameters of the function are #
# a = df mu.n_(k0*mu0+n*means)/(k0+n)
# b = a definite positive kxk matrix v.n_v0+n
# Note a must be > k k.n_k0+n

ENAR - March 2006 14 ENAR - March 2006 15

ones_matrix(1,nrow=n,ncol=1) }
S = t(y-ones%*%t(means))%*%(y-ones%*%t(means))
Delta.n_Delta0+S+(k0*n/(k0+n))*(means-mu0)%*%t(means-mu0) # Graphics and summary statistics
# sink("cascade.output")
# Draw Sigma and mu from their posteriors # Calculate the ratio between max(eigenvalues of Sigma)
# # and max({eigenvalues of Sigma}\{max(eigenvalues of Sigma)})
samplesize_200 #
lambda.samp_matrix(0,nrow=samplesize,ncol=5) #this matrix will store ratio.l_sample.l[,1]/sample.l[,2]
#eigenvalues of Sigma # "Histogram of the draws of the ratio"
rho.samp_matrix(0,nrow=samplesize,ncol=5) postscript("cascade_lambda.eps",height=5,width=6)
mu.samp_matrix(0,nrow=samplesize,ncol=5) hist(ratio.l,nclass=30,axes = F,
for (j in 1:samplesize) { xlab="eigenvalue1/eigenvalue2",xlim=c(3,9))
axis(1)
# Sigma dev.off()
SS = solve(Delta.n) # Sumary statistics of the ratio of the eigenvalues
# The following makes sure that SS is symmetric c(mean(ratio.l),sqrt(var(ratio.l)))
for (pp in 1:5) { for (jj in 1:5) { quantile(ratio.l,probs=c(0.025,0.05,0.5,0.95,0.975))
if(pp<jj){SS[pp,jj]=SS[jj,pp]} } } #
Sigma_solve(rwishart(v.n,SS)) # correlations with copper
# The following makes sure that Sigma is symmetric #
for (pp in 1:5) { for (jj in 1:5) { postscript("cascade_corr.eps",height=8,width=8)
if(pp<jj){Sigma[pp,jj]=Sigma[jj,pp]} } } par(mfrow=c(2,2))
# Eigenvalue of Sigma # "Histogram of the draws of corr(antimony,copper)"
lambda.samp[j,]_eigen(Sigma)$values hist(sample.rho[,1],nclass=30,axes = F,xlab="corr(antimony,copper)",
# Correlation coefficients xlim=c(0.3,0.7))
for (pp in 1:5){ axis(1)
rho.samp[j,pp]_Sigma[pp,2]/sqrt(Sigma[pp,pp]*Sigma[2,2]) } # "Histogram of the draws of corr(arsenic,copper)"
# mu hist(sample.rho[,3],nclass=30,axes = F,xlab="corr(arsenic,copper)")
mu.samp[j,]_rmnorm(mu.n,Sigma/k.n) axis(1)

ENAR - March 2006 16 ENAR - March 2006 17


# "Histogram of the draws of corr(bismuth,copper)" # Summary statistics of the dsn. of mean of
hist(sample.rho[,4],nclass=30,axes = F,xlab="corr(bismuth,copper)", # antimony
xlim=c(0.5,.8)) c(mean(sample.mu[,1]),sqrt(var(sample.mu[,1])))
axis(1) quantile(sample.mu[,1],probs=c(0.025,0.05,0.5,0.95,0.975))
# "Histogram of the draws of corr(silver,copper)" # copper
hist(sample.rho[,5],nclass=25,axes = F,xlab="corr(silver,copper)", c(mean(sample.mu[,2]),sqrt(var(sample.mu[,2])))
xlim=c(-.2,.3)) quantile(sample.mu[,2],probs=c(0.025,0.05,0.5,0.95,0.975))
axis(1) # arsenic
dev.off() c(mean(sample.mu[,3]),sqrt(var(sample.mu[,3])))
# Summary statistics of the dsn. of correlation of copper with quantile(sample.mu[,3],probs=c(0.025,0.05,0.5,0.95,0.975))
# antimony # bismuth
c(mean(sample.rho[,1]),sqrt(var(sample.rho[,1]))) c(mean(sample.mu[,4]),sqrt(var(sample.mu[,4])))
quantile(sample.rho[,1],probs=c(0.025,0.05,0.5,0.95,0.975)) quantile(sample.mu[,4],probs=c(0.025,0.05,0.5,0.95,0.975))
# arsenic # silver
c(mean(sample.rho[,3]),sqrt(var(sample.rho[,3]))) c(mean(sample.mu[,5]),sqrt(var(sample.mu[,5])))
quantile(sample.rho[,3],probs=c(0.025,0.05,0.5,0.95,0.975)) quantile(sample.mu[,5],probs=c(0.025,0.05,0.5,0.95,0.975))
# bismuth
c(mean(sample.rho[,4]),sqrt(var(sample.rho[,4]))) q()
quantile(sample.rho[,4],probs=c(0.025,0.05,0.5,0.95,0.975))
# silver
c(mean(sample.rho[,5]),sqrt(var(sample.rho[,5])))
quantile(sample.rho[,5],probs=c(0.025,0.05,0.5,0.95,0.975))
#
# Means
#
mulabels=c("Antimony","Copper","Arsenic","Bismuth","Silver")
postscript("cascade_means.eps",height=8,width=8)
pairs(sample.mu,labels=mulabels)
dev.off()

ENAR - March 2006 18 ENAR - March 2006 19


Advanced Computation Strategy for computation

Approximations based on posterior modes If possible, work in the log-posterior scale.

Simulation from posterior distributions Factor posterior distribution: p(, |y) = p(|, y)p(|y)

Markov chain simulation Reduces to lower-dimensional problem


May be able to sample on a grid
Helps to identify parameters most influenced by prior
Why do we need advanced computational methods?
Re-parametrizing sometimes helps:
Except for simpler cases, computation is not possible with available
methods: Create parameters with easier interpretation
Permit normal approximation (e.g., log of variance or log of Poisson
Logistic regression with random effects
rate or logit of probability)
Normal-normal model with unknown sampling variances j2
Poisson-lognormal hierarchical model for counts.

ENAR - March 26, 2006 1 ENAR - March 26, 2006 2

Normal approximation to the posterior q(|y): un-normalized density, typically p()p(y|).


= (, ), typically lower dimensional than .
It is often reasonable to approximate a complicated posterior
Practical advice: computations are often easier with log posteriors than
distribution using a normal (or a mixture of normals) approximation.
with posteriors.
Approximation may be a good starting point for more sophisticated
methods.

Computational strategy:
1. Find joint posterior mode or mode of marginal posterior distributions
(better strategy if possible)
2. Fit normal approximations at the mode (or use mixture of normals if
posterior is multimodal)

Notation:
p(|y): joint posterior of interest (target distribution)

ENAR - March 26, 2006 3 ENAR - March 26, 2006 4


Finding posterior modes conditional posteriors. If = (, ), and

p(, y) = p(|, y)p(|y)


To find the mode of a density, we maximize the function with respect
to the parameter(s). Optimization problem. then

Modes are not interesting per se, but provide the first step for 1. Find mode
analytically approximating a density. 2. Then find by maximizing p(| = , y)

Modes are easier to find than means: no integration, and can work Many algorithms to find modes. Most popular include Newton-Raphson
with un-normalized density and Expectation-Maximization (EM).

Multi-modal posteriors pose a problem: only way to find multiple modes


is to run mode-finding algorithms several times, starting from different
locations in parameter space

Whenever possible, shoot for finding the mode of marginal and

ENAR - March 26, 2006 5 ENAR - March 26, 2006 6

Taylor approximation to p(|y) Then, for large n and in interior of parameter space:

Second order expansion of log p(|y) around the mode is p(|y) N(, [I()]1)
 2 
1 0 d
log p(|y) log p(|y) + ( ) log p(|y) ( ). with
2 d2 = d2
I() = log p(|y),
d2
Linear term in expansion vanishes because log posterior has a zero the observed information matrix.
derivative at the mode.

Considering log p(|y) approximation as a function of , we see that:


1. log p(|y) is constant and
 2 
1 0 d
log p(|y) ( ) log p(|y) ( )
2 d2 =

is proportional to a log-normal density

ENAR - March 26, 2006 7 ENAR - March 26, 2006 8


Normal and normal-mixture approximation The k reflect the relative mass associated to each mode

It can be shown that k = q(k |y)|Vk |1/2.


For one mode, we saw that
The mixture-normal approximation is then
pnapprox(|y) = N(, V )
X 1
with pnapprox(|y) q(k |y) exp{ ( k )0V1 ( k )}.
2 k
V = [L00()]1 or [I()]1 k

L00() is called the curvature of the log posterior at the mode. A more robust approximation can be obtained by using a tkernel
instead of the normal kernel:
Suppose now that p(|y) has K modes. X
ptapprox(|y) q(k |y)[ + ( k )0V1
k
( k )](d+)/2,
Approximation to p(|y) is now a mixture of normals: k

X with d the dimension of and relatively low. For most problems,


pnapprox(|y) k N(k , Vk )
= 4 works well.
k

ENAR - March 26, 2006 9 ENAR - March 26, 2006 10

Sampling from a mixture approximation Simulation from posterior - Rejection sampling

To sample from the normal-mixture approximation: Idea is to draw values of p(|y), perhaps by making use of an
1. First choose one of the K normal components, using relative instrumental or auxiliary distribution g(|y) from which it is easier to
probability masses k as multinomial probabilities. sample.
2. Given a component, draw a value from either the normal or the t
density. The target density p(|y) need only be known up to the normalizing
constant.
Reasons not to use normal approximation:
Let q(|y) = p()p(y; ) be the un-normalized posterior distribution so
When mode of parameter is near edge of parameter space (e.g., in that
SAT example) q(|y)
p(|y) = R .
When even transformation of parameter makes normal approximation p()p(y; )d
crazy.
Can do better than normal approximation using more advanced Simpler notation:
methods
1. Target density: f (x)

ENAR - March 26, 2006 11 ENAR - March 26, 2006 12


2. Instrumental density: g(x) Then
3. Constant M such that f (x) M g(x) Ry R f (x)/M g(x)
0
dug(x)dx
P (Y y) =
The following algorithm produces a variable Y that is distributed R R f (x)/M g(x)
dug(x)dx
0
according to f : R y
M 1 f (x)dx
1. Generate X g and U U[0,1] = R .
2. Set Y = X if U f (X)/M g(X) M 1 f (x)dx
3. Reject draw otherwise.
Ry
Since the last expression equals
f (x)dx, we have proven the result.
Proof: The distribution function of Y is given by:
  In the Bayesian context, q(|y) (the un-normalized posterior) plays the
f (X) role of f (x) above.
P (Y y) = P X y|U
M g(X)
When both f and g are normalized densities:
 
P X y, U Mf g(X)
(X)

=   1. The probability of accepting a draw is 1/M , and expected number


P U Mf g(X)
(X)
of draws until one is accepted is M .

ENAR - March 26, 2006 13 ENAR - March 26, 2006 14

2. For each f , there will be many instrumental densities g1, g2, ....
Choose the g that requires smallest bound M
3. M is necessarily larger than 1, and will approach minimum value 1
when g closely imitates f .

In general, g needs to have thicker tails than f for f /g to remain


bounded for all x.

Cannot use a normal g to generate values from a Cauchy f .

Can do the opposite, however.

Rejection sampling can be used within other algorithms, such as the


Gibbs sampler (see later).

ENAR - March 26, 2006 15


Rejection sampling - Getting M Importance sampling

Finding the bound M can be a problem, but consider the following Also known as SIR (Sampling Importance Resampling), the method is
implementation approach recalling that no more than a weighted bootstrap.
q(|y) = p()p(y; ).
Suppose we have a sample of draws (1, 2, ..., n) from the proposal
distribution g(). Can we convert it to a sample from q(|y)?
Let g() = p() and draw a value from the prior.
Draw U U (0, 1).
For each i, compute
Let M = p(y; ), where is the mode of p(y; ).
Accept the draw if
i = q(i|y)/g(i)
q(|y) p()p(y; ) p(y; )
X
u = = . wi = i/ j .
M p() p(y; )p() p(y; ) j

Those from the prior that are likely under the likelihood are kept in
the posterior sample. Draw from the discrete distribution over {1, ..., n} with weight wi
on i, without replacement.

ENAR - March 26, 2006 16 ENAR - March 26, 2006 17

The sample of re-sampled s is approximately a sample from q(|y). The size of the re-sample can be as large as desired.

Proof: suppose that is univariate (for convenience). Then: The more g resembles q, the smaller the sample size that is needed for
the re-sample to approximate well the target distribution p(|y).
n
X
Pr( a) = wi1,a(i)
i=1
A consistent estimator of the normalizing constant is
P
n 1
i P
i 1,a (i ) X
= n1 i.
n 1
i i
i
Eg q(|y)
g() 1,a (i )

Eg q(|y)
g()
R
q(|y)d
R a


q(|y)d
Z a
p(|y)d.

ENAR - March 26, 2006 18 ENAR - March 26, 2006 19


Importance sampling in a different context Generate values from g(), and estimate

1 X p(i|y)
E[h()] = h(i)
Suppose that we wish to estimate E[h()] for p(|y) (e.g., a N g(i)
posterior mean).
The w(i) = p(i|y)/g(i) are the same importance weights from
earlier.
If N values i can be drawn from p(|y), can compute Monte Carlo
estimate:
1 X Will not work if tails of g are short relative to tails of p.
E[h()] = h(i).
N

Sampling from p(|y) may be difficult. But note:

h()p(|y)
Z
E[h()|y] = g()d.
g()

ENAR - March 26, 2006 20 ENAR - March 26, 2006 21

Importance sampling (contd) Why without replacement? If weights are approximately constant,
can re-sample with replacement. If some weights are large, re-sample
would favor values of with large weights repeteadly unless we sample
Importance sampling is most often used to improve normal without replacement.
approximation to posterior.
Difficult to determine whether importance sampling draws approximate
draws from posterior
Example: suppose that you have a normal (or t) approximation to the
posterior and use it to draw a sample that you hope approximates a Diagnostic: monitor weights and look for outliers
sample from p(|y).
Note: draws from g() can be reused for other p(|y)!
Importance sampling can be used to improve sample:
Given draws (1, ..., m) from g(), can investigate sensitivity to prior by
1. Obtain large sample of size L: (1, ..., L) from the approximation re-computing importance weights. For posteriors p1(|y) and p2(|y),
g(). compute (for the same draws of )
2. Construct importance weights w(l) = q(l|y)/g(l)
3. Sample k < L values from (1, ..., L) with probability proportional pj (i|y)
wj (i) = , j = 1, 2.
to the weights, without replacement. g(i)

ENAR - March 26, 2006 22 ENAR - March 26, 2006 23


Markov chain Monte Carlo Markov chains

Methods based on stochastic process theory, useful for approximating A process Xt in discrete time t = 0, 1, 2, ..., T where
target posterior distributions.
E(Xt|X0, X1, ..., Xt1) = E(Xt|Xt1)
Iterative: must decide when convergence has happened
is called a Markov chain.
Quite general, can be implemented where other methods fail

Can be used even in high dimensional examples A Markov chain is irreducible if it is possible to reach all states from
any other state:
Three methods: Metropolis-Hastings, Metropolis and Gibbs sampler.
pn(j|i) > 0, pm(i|j) > 0, m, n > 0
M and GS special cases of M-H.
where pn(j|i) denotes probability of getting to state j from state i in
n steps.

ENAR - March 26, 2006 24 ENAR - March 26, 2006 25

A Markov chain is periodic with period d if pn(i|i) = 0 unless n = kd Properties of Markov chains (contd)
for some integer k. If d = 2, the chain returns to i in cycles of 2k
steps.
Ergodicity: If a MC is irreducible and aperiodic and has stationary
If d = 1, the chain is aperiodic. distribution then we have ergodicity:

Theorem: Finite-state, irreducible, aperiodic Markov chains have a 1X


an = a(Xt)
limiting distribution: limn pn(j|i) = , with pn(j|i) the probability n t
that we reach j from i after n steps or transitions. E{a(X)} as n .

an is an ergodic average.

Also, rate of convergence can be calculated and is geometric.

ENAR - March 26, 2006 26 ENAR - March 26, 2006 27


Numerical standard error Markov chain simulation

Sequence {X1, X2, ..., Xn} is not iid.


Idea: Suppose that sampling from p(|y) is hard, but that we can
The asymptotic standard error of an is approximately generate (somehow) a Markov chain {(t), t T } with stationary
s distribution p(|y).
a2 X
{1 + 2 i(a)} Situation is different from the usual stochastic process case:
n i
Here we know the stationary distribution.
where i is the ith lag autocorrelation in the function a{Xt}. We seek an algorithm to transition from (t) to (t+1)) and that will
take us to the stationary distribution.
First term a2/n is the usual sampling variance under iid sampling.
Idea: start from some initial guess 0 and let the chain run for n steps
Second term is a penalty for the fact that sample is not iid and is
(n large), so that it reaches its stationary distribution.
usually bigger than 1.

Often, the standard error is computed assuming a finite number of lags. After convergence, all additional steps in the chain are draws from the
stationary distribution p(|y).

ENAR - March 26, 2006 28 ENAR - March 26, 2006 29

MCMC methods all based on the same idea; difference is just in how The Gibbs Sampler
the transitions in the MC are created.
An iterative algorithm that produces Markov chains with joint stationary
In MCMC simulation, we generate at least one MC for each parameter distribution p(|y) by cycling through all possible conditional posterior
in the model. Often, more than one (independent) chain for each distributions.
parameter
Example: suppose that = (1, 2, 3), and that the target distribution
is p(|y). Steps in the Gibbs sampler are:
1. Start with a guess (10, 20, 30)
2. Draw 11 from p(1|2 = 20, 3 = 30, y)
3. Draw 21 from p(2|1 = 11, 3 = 30, y)
4. Draw 31 from p(3|1 = 11, 2 = 21, y)

Steps above complete one iteration of the GS

Repeat the steps above n times, and after convergence (see later),
draws (n+1, n+2, ...) are sample from stationary distribution p(|y).

ENAR - March 26, 2006 30 ENAR - March 26, 2006 31


The Gibbs Sampler (contd) Example: Gibbs sampling in the bivariate
normal
Baby example: = (1, 2) are bivariate normal with mean y = (y1, y2),
variances 12 = 22 = 1 and covariance . Then:
Just as illustration, we consider the trivial problem of sampling from
the posterior distribution of a bivariate normal mean vector.
p(1|2, y) N (y1 + (2 y2), 1 2)
Suppose that we have a single observation vector (y1, y2) where
and
p(2|1, y) N (y2 + (1 y1), 1 2)
     
y1 1 1
N , . (1)
y2 2 1
In this case, would not need GS, just for illustration

See figure: y1 = 0, y2 = 0, and = 0.8. With a uniform prior on (1, 2), the joint posterior distribution is
     
1 y1 1
|y N , . (2)
2 y2 1

ENAR - March 26, 2006 32 KSU - April 2005 1

The conditional distributions are


After 50 iterations..
1|2, y N(y1 + (2 y2), 1 2)
2|1, y N(y2 + (1 y1), 1 2)

We generate four independent chains, starting from four different


corners of the 2-dimensional parameter space.

KSU - April 2005 2


The non-conjugate normal example
After 1000 iterations, and eliminating the path lines.

Let y N (, 2), with (, 2) unknown.

Consider the semi-conjugate prior:

N (0, 02)
2 Inv 2(0, 02).

Joint posterior distribution is

n ( 1
P(y ) ) (
2 1 ( )2 )
202
0 0 2
( 0 20 )
p(, 2) ( 2) 2 e 2 2 i
e ( 2)( 2 +1))e 2 .

ENAR - March 26, 2006 33

Non-conjugate example (contd) where n


2
y + 12 0 1
n = n
0
1 and n2 = n .
2 + 02 2
+ 12
0
2 2
Derive full conditional distributions p(| , y) and p( |, y).

For :

1 1 X 1
p(| 2, y) exp{ [ 2 (yi )2 + 2 ( 0)2]}
2 0
1
exp{ [(n02 + 2)2 2(n02y + 20)]}
2
1 n02 + 2 2
 2
n0 y + 20

exp{ [ 2 ]}
2 202 n02 + 2
N (n, n2),

ENAR - March 26, 2006 34 ENAR - March 26, 2006 35


Non-conjugate normal example (contd) Non-conjugate normal example (contd)

The full conditional for 2 is


For the non-conjugate normal model, Gibbs sampling consists of
n+0 1X following steps:
p( 2|, y) ( 2)( 2 +1) exp{ [ (yi )2 + 002]}
2 1. Start with a guess for 2, 2(0).
Inv 2(n, n2 ), 2. Draw (1) from a normal with mean and variance
(n( 2(0)), n2( 2(0))), where
where
n
2(0)
y + 12 0
2(0) 0
n = 0 + n n( )= n 1 ,
2(0) + 02
02 = nS +2
002,

and and
1
n2( 2(0)) =
X
2 .
S=n 1
(yi ) . n
2(0)
+ 12
0

ENAR - March 26, 2006 36 ENAR - March 26, 2006 37

3. Next draw 2(1) from Inv 2(n, n2 ((1))), where A more interesting example

n2 ((1)) = nS 2((1)) + 002, Poisson counts, with a change point: consider a sample of size n of
counts (y1, ..., yn) distributed as Poisson with some rate, and suppose
and that the rate changes, so that
X
S 2((1)) = n1 (yi (1))2.
For i = 1, ..., m , yi Poi()
4. Repeat many times.
For i = m + 1, ..., n , yi Poi()

and m is unknown.

Priors on (, , m):

Ga(, )
Ga(, )
m U[1,n]

ENAR - March 26, 2006 38 ENAR - March 26, 2006 39


Joint posterior distribution: Conditional of :
P m
i=1 yi +1
m
Y n
Y p(|m, , y) exp((m + ))
p(, , m|y) eyi eyi 1e 1e n1 m
X
i=1 i=m+1 Ga( + yi, m + )
y1 +1 (m+) y2 +1 (nm+) i=1
e e

Pm Pn Conditional of :
with y1 = i=1 yi and y2 = i=m+1 yi .
P n
i=m+1 yi +1
p(|, m, y) exp((n m + ))
n
Note: if we knew m, problem would be trivial to solve. Thus Gibbs, X
Ga( + yi, n m + )
where we condition on all other parameters to determine Markov chains,
i=m+1
appears to be the right approach.

Conditional of m = 1, 2, ..., n:
To implement Gibbs sampling, need full conditional distributions. Pick
and choose pieces that depend on each parameter p(m|, , y) = c1q(m|, )

ENAR - March 26, 2006 40 ENAR - March 26, 2006 41


= c1y1 +1 exp((m + ))y2 +1 exp((n m + )). Non-standard distributions

Note that all terms in joint posterior depend on m, so conditional of


m is proportional to joint posterior. It may happen that one or more of the full conditionals is not a standard
distribution
Distribution does not look like any standard form so cannot sample
directly. Need to obtain normalizing constant, easy to do for relatively What to do then?
small n and for this discrete case: Try direct simulation: grid approximation, rejection sampling
n
Try approximation: normal or t approximation, need mode at each
c=
X
q(k|, , y). iteration (see later)
k=1
Try more general Markov chain algorithms: Metropolis or Metropolis-
Hastings.

To sample from p(m|, , y) can use inverse cdf on a grid, or other


methods.

ENAR - March 26, 2006 42 ENAR - March 26, 2006 43


Metropolis-Hastings algorithm 3. Stay in place (do not accept the draw) with probability 1 r, i.e.,
(t+1) = (t).

More flexible transition kernel: rather than requiring sampling from Remarkably, the proposal distribution (text calls it jump distribution)
conditional distributions, M-H permits using many other proposal can have just about any form.
densities
When proposal distribution is symmetric, i.e.
Idea: instead of drawing sequentially from conditionals as in Gibbs,
M-H jumps around the parameter space
J(|) = J(|),
The algorithm is the following:
Metropolis-Hastings acceptance probability is
1. Given a draw t in iteration t, sample a candidate draw from a
proposal distribution J(|) p(|y)
2. Accept the draw with probability r= .
p(|y)
p(|y)/J(|)
r= . This is the Metropolis algorithm
p(|y)/J(|)

ENAR - March 26, 2006 44 ENAR - March 26, 2006 45

Proposal distributions Independence sampler

Convergence does not depend on J, but rate of convergence does. Proposal distribution Jt does not depend on t.

Optimal J is p(|y) in which case r = 1. Just find a distribution g() and generate values from it

Else, how do we choose J?


Can work very well if g(.) is a good approximation to p(|y) and g has
1. It is easy to get samples from J heavier tails than p.
2. It is easy to compute r
3. It leads to rapid convergence and mixing: jumps should be large Can work awfully bad otherwise
enough to take us everywhere in the parameter space but not too
large so that draw is accepted (see figure from Gilks et al. (1995)).
Acceptance probability is
Three main approaches: random walk M-H (most popular),
p(|y)/g()
independence sampler, and approximation M-H r=
p(|y)/g()

ENAR - March 26, 2006 46 ENAR - March 26, 2006 47


With large samples (central limit theorem operating) proposal could be Random walk M-H
normal distribution centered at mode of p(|y) and with variance larger
than inverse of Fisher information evaluated at mode.
Most popular, easy to use.

Idea: generate candidate using random walk.

Proposal is often normal centered at current draw:

J(|(t1)) = N (|(t1), V )

Symmetric: think of drawing values (t1) from a N (0, V ). Thus,


r simplifies.

Difficult to choose V :
1. V too small: takes long to explore parameter space

ENAR - March 26, 2006 48 ENAR - March 26, 2006 49

2. V too large: jumps to extremes are less likely to be accepted. Stay


in the same place too long
3. Ideal V : posterior variance. Unknown, so might do some trial and
error runs

Optimal acceptance rate (from some theory results) are between 25%
and 50% for this type of proposal distribution. Gets lower with higher
dimensional problem.

ENAR - March 26, 2006 50


Approximation M-H Starting values

Idea is to improve approximation of J to p(|y) as we know more about If chain is irreducible, choice of 0 will not affect convergence.
.
With multiple chains (see later), can choose over-dispersed starting
E.g., in random walk M-H, can perhaps increase acceptance rate by values for each chain. Possible algorithm:
considering
1. Find posterior modes
J(|(t1)) = N (|(t1), V(t1) )
2. Create over-dispersed approximation to posterior at mode (e.g., t4)
variance also depends on current draw 3. Sample values from that distribution

Proposals here typically not symmetric, requires full r expression. Not much research on this topic.

ENAR - March 26, 2006 51 ENAR - March 26, 2006 52

How many chains? Convergence

Even after convergence, draws from stationary distribution are


correlated. Sample is not i.i.d. Impossible to decide whether chain has converged, can only monitor
behavior.
An iid sample of size n can be obtained by keeping only last draw from
each of n independent chains. Too inefficient.
Easiest approach: graphical displays (trace plots in WinBUGS).
Compromise: If autocorrelation at lag k and larger is negligible, then p
can generate an almost i.i.d. sample by keeping only every kth draw in A bit more formal (and most popular in terms of use): (R) of Gelman
the chain after convergence. Need a very long chain. and Rubin (1992).

To check convergence (see later) multiple chains can be generated To use the G-R diagnostic, must generate multiple independent chains
independently (in parallel). for each parameter.
Burn-in: iterations required to reach stationary distribution
(approximately). Not used for inference. The G-R diagnostic compares the within-chain sd to the between-chain
sd.

ENAR - March 26, 2006 53 ENAR - March 26, 2006 54


Before convergence, the within-chain sd is an underestimate of the An unbiased estimate of var(|y) is weighted average:
sd(|y), and between-chain sd is overestimate.
n1 1
If chains converge, all draws are from stationary distribution, so within var(|y)
= W+ B
n n
and between chain variances should be similar.

Consider scalar parameter and let ij be jth draw in ith chain, with Early in iterations, var(|y)
overestimates true posterior variance
i = 1, ..., n and j = 1, ..., J
Diagnostic measures potential reduction in the scale if iterations are
Between-chain variance: continued:
p p var(|y)

(R) = ( )
n X 1X 1 W
B= (j ..)2, j = ij , .. = j which goes to 1 as n .
J 1 n J

Within-chain variance:
1 X
W = (ij j )2.
J(n 1)

ENAR - March 26, 2006 55 ENAR - March 26, 2006 56


Hierarchical models - Introduction Neither approach appealing.

Hierarchical approach: view the s as a sample from a common


Consider the following study:
population distribution indexed by a parameter .
J counties are sampled from the population of 99 counties in the
state of Iowa. Observed data yij can be used to draw inferences about s even though
Within each county, nj farms are selected to participate in a fertilizer the j are not observed.
trial
Outcome yij corresponds to ith farm in jth county, and is modeled A simple hierarchical model:
as p(yij |j ).
We are interested in estimating the j . p(y, , ) = p(y|)p(|)p().

Two obvious approaches:


p(y|) = Usual sampling distribution
1. Conduct J separate analyses and obtain p(j |yj ): all s are
independent p(|) = Population dist. for or prior
2. Get one posterior p(|y1, y2, ..., yJ ): point estimate is same for all p() = Hyperprior
s.

ENAR - March 26, 2006 1 ENAR - March 26, 2006 2

Hierarchical models - Rat tumor example Posterior for is

p(|y) = Beta( + 4, + 10)


Imagine a single toxicity experiment performed on rats.
Where can we get good guesses for and ?
is probability that a rat receiving no treatment develops a tumor.
One possibility: from the literature, look at data from similar
experiments. In this example, information from 70 other experiments
Data: n = 14, and y = 4 develop a tumor.
is available.

From the sample: tumor rate is 4/14 = 0.286. In jth study, yj is the number of rats with tumors and nj is the sample
size, j = 1, .., 70.
Simple analysis:
See Table 5.1 and Figure 5.1 in Gelman et al..
y| Bin(n, )
Model the yj as independent binomial data given the study-specific j
|, Beta(, ) and sample size nj

ENAR - March 26, 2006 3 ENAR - March 26, 2006 4


Representation of hierarchical model in Figure 5.1 assumes that the
From Gelman et al.: hierarchical representation of tumor experiments Beta(, ) is a good population distribution for the j .

Empirical Bayes as a first step to estimating mean tumor rate in


experiment number 71:
1. Get point estimates of , from earlier 70 experiments as follows:
P70
2. Mean tumor rate is r = (70)1 j=1 yj /nj and standard deviation is
[(69)1 j (yj /nj r)2]1/2. Values are 0.136 and 0.103 respectively.
P

Using method of moments:



= 0.136, + = /0.136
+

2
= 0.1032
( + ) ( + + 1)

Resulting estimate for (, ) is (1.4, 8.6).

ENAR - March 26, 2006 5

Then posterior is 3. If the prior Beta(, ) is the appropriate prior, shouldnt we know
p(|y) = Beta(5.4, 18.6) the values of the parameters prior to observing any data?
with posterior mean 0.223, lower than the sample mean, and posterior
standard deviation 0.083. Approach: place a hyperprior on the tumor rates j with parameters
(, ).
Posterior point estimate is lower than crude estimate; indicates that in
current experiment, number of tumors was unusually high. Can still use all the data to estimate the hyperparameters.

Why not go back now and use the prior to obtain better estimates for Idea: Bayesian analysis on the joint distribution of all parameters
tumor rates in earlier 70 experiments? (1, ..., 71, , |y)

Should not do that because:


1. Cant use the data twice: we use the historical data to get the
prior, and cannot now combine that prior with the data from the
experiments for inference
2. Using a point estimate for (, ) suggests that there is no uncertainty
about (, ): not true

ENAR - March 26, 2006 6 ENAR - March 26, 2006 7


Hierarchical models - Exchangeability In rat tumor example, we have no information to distinguish between
the 71 experiments, except sample size. Since nj is presumably not
related to j , we choose an exchangeable model for the j .
J experiments. In experiment j, yj p(yj |j ).
Simplest exchangeable distribution for the j :
To create joint probability model, use idea of exchangeability
p(|) = Jj=1p(j |).
Recall: a set of random variables (z1, ..., zK ) are exchangeable if their
joint distribution p(z1, ..., zK ) is invariant to permutations of their
labels. The hyperparameter is typically unknown, so marginal for must be
obtained by averaging:
In our set-up, we have two opportunities for assuming exchangeability: Z
1. At the level of data: conditional on j , the yij s are exchangeable if p() = Jj=1p(j |)p()d.
we cannot distinguish between them
2. At the level of the parameters: unless something (other than the
data) distinguishes the s, we assume that they are exchangeable as Exchangeable distribution for (1, ..., J ) is written in the form of a
well. mixture distribution

ENAR - March 26, 2006 8 ENAR - March 26, 2006 9

Mixture model characterizes parameters (1, ..., J ) as independent Hierarchical models - Bayesian treatment
draws from a superpopulation that is determined by unknown
parameters
Since is unknown, posterior is now p(, |y).
Exchangeability does not hold when we have additional information on
covariates xj to distinguish between the j . Model formulation:

Can still model exchangeability with covariates:


p(, ) = p()p(|) = joint prior
Z
p(1, ..., J |x1, ..., xJ ) = Jj=1p(j |xj , )p(|x)d, p(, |y) p(, )p(y|, ) = p(, )p(y|)

with x = (x1, ..., xJ ). In Iowa example, we might know that different The hyperparameter gets its own prior distribution.
counties have very different soil quality.
Important to check whether posterior is proper when using improper
Exchangeability does not imply that j are all the same; just that priors in hierarchical models!
they can be assumed to be draws from some common superpopulation
distribution.

ENAR - March 26, 2006 10 ENAR - March 26, 2006 11


Posterior predictive distributions Hierarchical models - Computation

There are two posterior predictive distributions potentially of interest: Harder than before because we have more parameters

1. Distribution of future observations y corresponding to an existing j . Easiest when population distribution p(|) is conjugate to the
Draw y from the posterior predictive given existing draws for j . likelihood p(|y).
2. Distribution of new observations y corresponding to new j s drawn
from the same superpopulation. First draw from its posterior, then In non-conjugate models, must use more advanced computation.
draw for a new experiment, and then draw y from the posterior
predictive given the simulated .
Usual steps:

In rat tumor example: 1. Write p(, |y) p()p(|)p(y|)


2. Analytically determine p(|y, ). Easy for conjugate models.
1. More rats from experiment #71, for example 3. Derive the marginal p(|y) by
2. Experiment #72 and then rats from experiment #72
Z
p(|y) = p(, |y)d,

ENAR - March 26, 2006 12 ENAR - March 26, 2006 13

or, if convenient, by p(|y) = p(,|y)


p(|y,) .
Rat tumor example

Normalizing constant in denominator may depend on as well as y.


Sampling distribution for data from experiments j = 1, ..., 71:
Steps for computation in hierarchical models are:
yj Bin(nj , j )
1. Draw vector from p(|y). If is low-dimensional, can use inverse
cdf method as before. Else, need more advanced methods
2. Draw from conditional p(|, y). If the j are conditionally Tumor rates j assumed to be independent draws from Beta:
independent, p(|, y) factors as
j Beta(, )
p(|, y) = j p(j |, y)

so components of can be drawn one at a time. Choose a non-informative prior for (, ) to indicate prior ignorance.
3. Draw y from appropriate posterior predictive distribution.
Since hyperprior will be non-informative, and perhaps improper, must
check integrability of posterior.

ENAR - March 26, 2006 14 ENAR - March 26, 2006 15


Defer choice of p(, ) until a bit later. Marginal posterior distribution of hyperparameters is obtained using

p(, , |y)
p(, |y) =
Joint posterior distribution: p(|y, , )

p(, , |y) p(, )p(|, )p(y|, , ) Substituting into expression above, we get
( + ) 1 y
p(, ) j j (1 j )1j j j (1 j )nj yj .
()() ( + ) ( + yj )( + nj yj )
p(, |y) p(, )j
()() ( + + nj )

Conditional of : notice that given (, ), the j are independent, with Not a standard distribution, but easy to evaluate and only in two
Beta distributions: dimensions.

( + + nj ) +yj 1
Can evaluate p(, |y) over a grid of values of (, ), and then use
p(j |, , y) = (1 j )+nj yj 1.
( + yj )( + nj yj ) j inverse cdf method to draw from marginal and from conditional.

ENAR - March 26, 2006 16 ENAR - March 26, 2006 17

But what is p(, )? As it turns out, many of the obvious choices for A flat prior on prior guess and degrees of freedom
the prior lead to improper posterior. (See solution to Exercise 5.7, on

the course web site.) p( , + ) 1.
+

Some obvious choices such as p(, |y) 1 lead to non-integrable A flat prior on the log(mean) and log(degrees of freedom)
posterior.
p(log(/), log( + )) 1.

Integrability can be checked analytically by evaluating the behavior of A reasonable choice for the prior is a flat distribution on the prior mean
the posterior as , (or functions) go to . and square root of inverse of the degrees of freedom:

An empirical assessment can be made by looking at the contour plot p( , ( + )1/2) 1.
+
of p(, |y) over a grid. Significant mass extending towards infinity
suggests that the posterior will not integrate to a constant.
This is equivalent to

Other choices leading to non-integrability of p(, |y) include: p(, ) ( + )5/2.

ENAR - March 26, 2006 18 ENAR - March 26, 2006 19


Also equivalent to Grid too narrow leaves mass outside. Try [2.3, 1.3] [1, 5].

p(log(/), log( + ) ( + )5/2. Steps for computation:


1. Given draws (u, v), transform back to (, ).
We use the parameterization (log(/), log( + )). 2. Finally, sample j from Beta( + yj , + nj yj )

Idea for drawing values of and from their posterior is the usual:
evaluate p(log(/), log( + )|y) over grid of values of (, ), and
then use inverse cdf method.

For this problem, easier to evaluate the log posterior and then
exponentiate.

Grid: From earlier estimates, potential centers for the grid are =
1.4, = 8.6. In the new parameterization, this translates into u =
log(/) = 1.8, v = log( + ) = 2.3.

ENAR - March 26, 2006 20 ENAR - March 26, 2006 21

From Gelman et al.: Posterior means and 95% credible sets for i
Contours of joint posterior distribution of alpha and beta in reparametrized scale
Normal hierarchical model with j2 = 2/nj .

Sampling model for y.j is quite general. For nj large, y.j is normal
What we know as one-way random effects models
even if yij is not.
Set-up: J independent experiments, in each want to estimate a mean
Need now to think of priors for the j .
j .

Sampling model: What type of posterior estimates for j might be reasonable?


2 2
yij |j , N(j , ) 1. j = y.j : reasonable if nj large
for i = 1, ..., nj , and j = 1, ..., J. P P
2. j = y.. = ( j 2)1 j j2y.j : pooled estimate, reasonable if
we believe that all means are the same.
Assume for now that 2 is known.
P To decide between those two choices, do an F-test for differences
If y.j = n1
j i yij , then between groups.

y.j |j N(j , j2) ANOVA approach (for nj = n):

ENAR - March 26, 2006 22 ENAR - March 26, 2006 23

Source df SS MS E(MS) for j [0, 1]. Factor (1 j ) is a shrinkage factor.


Between groups J-1 SSB SSB / (J-1) n 2 + 2
Within groups J(n-1) SSE SSE / J(n-1) 2 All three estimates have a Bayesian justification:

where 2, the variance of the group means can be estimated as 1. j = y.j is posterior mean of j if sample means are normal and
p(j ) 1
M SB M SE 2. j = y.. is posterior mean if 1 = ... = J and p() 1.
2 = 3. j = j y.j + (1 j )y.. is posterior mean if j N(, 2)
n
independent of other s and sampling distribution for y.j is normal
If M SB >>> M SE then 2 > 0 and F-statistic is significant: do not
pool and use j = y.j . Latter is called the Normal-Normal model.

Else, F-test cannot reject H0 : 2 = 0, and must pool.

Alternative: why not a more general estimator:

j = j y.j + (1 j )y..

ENAR - March 26, 2006 24 ENAR - March 26, 2006 25


Normal-normal model consider a contidional flat prior, so that

p(, ) p( )
Set-up (for 2 known):

y.j |j N(j , j2) Joint posterior


Y
1, ..., J |, N(, 2) p(, , |y) p(, )p(|, )p(y|)
j Y Y
p(, ) N(j ; , ) N(y.j ; j , j2)
p(, 2)
j j

It follows that Conditional distribution of group means


Z Y
p(1, ..., J ) = N(j ; , 2)p(, 2)dd 2 For p(| ) 1, the conditional distributions of the j given , , y.j
j
are independent, and

The joint prior can be written as p(, ) = p(| )p( ). For , we will p(j |, , y.j ) = N(j , Vj )

ENAR - March 26, 2006 26 ENAR - March 26, 2006 27

with where p(y|, ) is the marginal likelihood:


j2 + 2y.j 1 1
j = , Vj1 = 2+ 2
j2 + 2 j
Z
p(y|, ) = p(y|, , )p(|, )d
Marginal distribution of hyperparameters Z Y Y
= N(y.j ; j , j2) N(j ; , )d1, ..., dJ
To get p(, |y) we would typically either integrate p(, , |y) with j j
respect to 1, ..., J , or would use the algebraic approach

p(, , |y) Integrand above is the product of quadratic functions in y.j and j , so
p(, |y) = they are jointly normal.
p(|, , y)

In the normal-normal model, data provide information about , , as Then the y.j |, are also normal, with mean and variance:
shown below.
E(y.j |, ) = E[E(y.j |j , , )|, ]
Consider the marginal posterior:
= E(j |, ) =
p(, |y) p(, )p(y|, ) var(y.j |, ) = E[var(y.j |j , , )|, ]

ENAR - March 26, 2006 28 ENAR - March 26, 2006 29


+var[E(y.j |j , , )|, ] which is the only parameter.
= E(j2|, ) + var(j |, )
Recall that p(| ) 1.
= j2 + 2
From earlier, we know that p(|y, ) = N(, V) where
Therefore Y
p(, |y) p( ) N(y.j ; , j2 + 2) P
y.j /(j2 + 2) 1
j
X
j = P 2 , V =
j 1/(j + 2) j
j2 + 2
We know that
To get p( |y) we use the old trick:
p(, |y) p( )p(| )
" #
Y
2 2 1/2 1 p(, |y)
(j + ) exp (y.j )2 p( |y) =
j
2(j2 + 2) p(|, y)
p( ) j N(y.j ; , j2 + 2)
Q

Now fix , and think of p(y.j |) as the likelihood in a problem in N(; , V)

ENAR - March 26, 2006 30 ENAR - March 26, 2006 31

Expression holds for any , so set = . Denominator is then just Non-informative (and improper) choice:
1/2
V .
Beware. Improper prior in hierarchical model can easily lead to
non-integrable posterior.
Then Y In normal-normal model, natural non-informative p( ) 1 or
p( |y) p( )V1/2 N(y.j ; , j2 + 2). equivalently p(log( )) 1 results in improper posterior.
j
p( ) 1 leads to proper posterior p( |y).
What do we use for p( )?

Safe choice is always a proper prior. For example, consider

p( ) = Inv 2(0, 02),

where
02 can be best guess for
0 can be small to reflect prior uncertainty.

ENAR - March 26, 2006 32 ENAR - March 26, 2006 33


Normal-normal model - Computation Normal-normal model - Prediction

1. Evaluate the one-dimensional p( |y) on a grid. Predicting future data y from the current experiments with means
= (1, ..., J ):
2. Sample from p( |y) using inverse cdf method.
1. Obtain draws of (, , 1, ..., J )
2. Draw y from N(j , j2)
3. Sample from N(, V) Predicting future data y from a future experiment with mean and
sample size n:
4. Sample j from N(j , Vj )
1. Draw , from their posterior
2. Draw from the population distribution p(|, ) (also known as
the prior for )
3. Draw y from N(, 2), where 2 = 2/n

ENAR - March 26, 2006 34 ENAR - March 26, 2006 35

Example: effect of diet on coagulation times j2 = (nj 1)1 (yij y.j )2, 2 = J 1 j2, and 2 = (J
P P

1)1 (y.j y..)2.


P

Normal hierarchical model for coagulation times of 24 animals


Data
randomized to four diets. Model:
Observations: yij with i = 1, ..., nj , j = 1, ..., J.
Given , coagulation times yij are exchangeable The following table contains data that represents coagulation time
Treatment means are normal N (, 2), and variance 2 is constant in seconds for blood drawn from 24 animals randomly allocated to four
across treatments different diets. Different treatments have different numbers of observations
Prior for (, log , log ) . A uniform prior on log leads to because the randomization was unrestricted.
improper posterior.
Diet Measurements
p(, , log , log |y) j N(j |, 2) A 62,60,63,59

j i N(yij |j , 2) B 63,67,71,64,65,66
C 68,66,71,67,68,68

P P D 56,62,60,61,63,64,63,59
Crude initial estimates: j = n1
j yij = y.j , = J 1 y.j = y..,

ENAR - March 26, 2006 36 ENAR - March 26, 2006 37


Conditional maximization for joint mode Conditional mode of :
X
|, , , y N(, 2/J), = J 1 j .
Conjugacy makes conditional maximization easy.

Conditional modes of treatment means are Conditional maximization: replace current estimate with .

j |, , , y N(j , Vj )

with
/ 2 + nj y.j / 2
j = ,
1/ 2 + nj / 2
and Vj1 = 1/ 2 + nj / 2.

For j = 1, ..., J, maximize conditional posteriors by using j in place of


current estimate j .

ENAR - March 26, 2006 38 ENAR - March 26, 2006 39

Conditional maximization Conditional maximization


Conditional mode of log : first derive conditional posterior for 2: Starting from crude estimates, conditional maximization required only
three iterations to converge approximately (see table)
2|, , , y Inv 2(n, 2),
Log posterior increased at each step
with 2 = n1 (yij j )2.
PP

Values in final iteration is approximate joint mode.


The mode of the Inv 2 is 2n/(n + 2). To get mode for log ,
use transformation. Term n/(n + 2) disappears with Jacobian, so When J is large relative to the nj , joint mode may not provide good
conditional mode of log is log . summary of posterior. Try to get marginal modes.
Conditional mode of log : same reasoning. Note that In this problem, factor:
2 2 2 2
|, , , y Inv (J 1, ),
p(, , log , log |y) = p(|, log , log , y)p(, log , log |y).
with 2 = (J 1)1 (j )2. After accounting for Jacobian of
P
transformation, conditional mode of log is log . Marginal of , log , log is three-dimensional regardless of J and nj .

ENAR - March 26, 2006 40 ENAR - March 26, 2006 41


Conditional maximization results Marginal maximization

Table shows iterations for conditional maximization.


Recall algebraic trick:
Stepwise ascent
Crude First Second Third Fourth p(, , log , log |y)
Parameter estimate iteration iteration iteration iteration p(, log , log |y) =
p(|, log , log , y)
1 61.000 61.282 61.288 61.290 61.290
2 66.000 65.871 65.869 65.868 65.868 j N(j |, 2)j i N(yij |j , 2)

3 68.000 67.742 67.737 67.736 67.736 j N(j |j , Vj )
4 61.000 61.148 61.152 61.152 61.152
64.000 64.010 64.011 64.011 64.011
2.291 2.160 2.160 2.160 2.160 Using in place of :
3.559 3.318 3.318 3.312 3.312
1/2
log p(params.|y) -61.604 -61.420 -61.420 -61.420 -61.420 p(, log , log |y) j N(j |, 2)j iN(yij |j , 2)j Vj

with and Vj as earlier.

ENAR - March 26, 2006 42 ENAR - March 26, 2006 43

Can maximize p(, log , log |y) using EM. Here, (j ) are the missing EM algorithm for marginal mode
data.
Log joint posterior:
Steps in EM: log p(, , log , log |y) n log (J 1) log
1. Average over missing data in E-step 1 X 1 XX
2 (j )2 (yij j )2
2. Maximize over (, log , log ) in M-step 2 j 2 2 j i

E-step: Average over using conditioning trick (and p(| rest). Need
two expectations:

Eold[(j )2] = E[(j )2|old, old, old, y]


= [Eold(j )]2 + varold(j )
= (j )2 + Vj

Similarly:
Eold[(yij j )2] = (yij j )2 + Vj .

ENAR - March 26, 2006 44 ENAR - March 26, 2006 45


EM algorithm for marginal mode (contd) EM algorithm for marginal mode (contd)

M-step: Maximize the expected log posterior (expectations just taken Beginning from joint mode, algorithm converged in three iterations.
in E-step) with respect to (, log , log ). Differentiate and equate to
zero to get maximizing values (new , log new , log new ). Expressions Important to check that log posterior increases at each step. Else,
are: programming or formulation error!
1X
new = j .
J j
Results:
1/2
1 X X Parameter Value at First Second Third
new = [(yij j )2 + Vj ]
n j i joint mode iteration iteration iteration
64.01 64.01 64.01 64.01
and 1/2 2.17 2.33 2.36 2.36
1 X 3.31 3.46 3.47 3.47
new = [(j new )2 + Vj ] .
J 1 j

ENAR - March 26, 2006 46 ENAR - March 26, 2006 47

Using EM results in simulation Gibbs sampling in normal-normal case


In a problem with standard form such as normal-normal, can do the
following: The Gibbs sampler can be easily implemented in the normal-normal
example because all posterior conditionals are of standard form.
1. Use EM to find marginal and conditional modes
2. Approximate posterior at the mode using N or t approximation
Refer back to Gibbs sampler discussion, and see derivation of
3. Draw values of parameters from papprox, and act as if they were
conditionals in hierarchical normal example.
from p.

Importance re-sampling can be used to improve accuracy of draws. For Full conditionals are:
each draw, compute importance weights, and re-sample draws (without j |all N (J of them)
replacement) using probability proportional to weight. |all N
2|all Inv 2
In diet and coagulation example, approximate approach as above likely 2|all Inv 2.
to produce quite reasonable results.

But must be careful, specially with scale parameters. Starting values for the Gibbs sampler can be drawn from, e.g., a t4
approximation to the marginal and conditional mode.

ENAR - March 26, 2006 48 ENAR - March 26, 2006 49


Multiple parallel chains for each parameter permit monitoring Results from Gibbs sampler
convergence using potential scale reduction statistic.

Summary of posterior distributions in coagulation example.

Posterior quantiles and estimated potential scale reductions computed


from the second halves of then Gibbs sampler sequences, each of length
1000.

Potential scale reductions for and were computed on the log scale.

The hierarchical variance 2, is estimated less precisely than the unit-


level variance, 2, as is typical in hierarchical models with a small
number of batches.

ENAR - March 26, 2006 50 ENAR - March 26, 2006 51

Posterior quantiles from Gibbs sampler Checking convergence

Burn-in = 50% of chain length. Parameter R upper


1 1.001 1.003
Posterior quantiles 2 1.001 1.003
Parameter 2.5% 25.0% 50.0% 75.0% 97.5% 3 1.000 1.002
1 58.92 60.44 61.23 62.08 63.69 4 1.000 1.001
2 63.96 65.26 65.91 66.57 67.94 1.005 1.005
3 65.72 67.11 67.77 68.44 69.75 1.000 1.001
4 59.39 60.56 61.14 61.71 62.89 1.012 1.013
55.64 62.32 63.99 65.69 73.00 log p(, log , log |y) 1.000 1.000
1.83 2.17 2.41 2.7 3.47 log p(, , log , log |y) 1.000 1.001
1.98 3.45 4.97 7.76 24.60
log p(, log , log |y) -70.79 -66.87 -65.36 -64.20 -62.71
log p(, , log , log |y) -71.07 -66.88 -65.25 -64.00 -62.42

ENAR - March 26, 2006 52 ENAR - March 26, 2006 53


Chains for diet means Posteriors for diet means

66
theta1

62
58
500 600 700 800 900 1000

iteration
67
theta2

56 58 60 62 64 66 62 64 66 68 70
65
63

500 600 700 800 900 1000 theta1 theta2


iteration
69
theta3

67
65

500 600 700 800 900 1000

iteration
65
63
theta3

61

64 66 68 70 72 58 60 62 64
59

500 600 700 800 900 1000


theta3 theta4
iteration

ENAR - March 26, 2006 54 ENAR - March 26, 2006 55

Chains for , , Posteriors for , ,


90 100
80
mu

70
60
50

50 55 60 65 70 75 80
500 600 700 800 900 1000
mu
iteration
15
sigma2

10
5

1.5 2.0 2.5 3.0 3.5 4.0 4.5


500 600 700 800 900 1000
sigma
iteration
10000
6000
tau2

2000
0

500 600 700 800 900 1000 0 5 10 15 20 25

iteration tau

ENAR - March 26, 2006 56 ENAR - March 26, 2006 57


Hierarchical modeling for meta-analysis Possible quantities of interest:
1. difference p1j p0j
Idea: summarize and integrate the results of research studies in a 2. probability ratio p1j /p0j
specific area. 3. odds ratio
j = [p1j /(1 p1j )]/[p0j /(1 p0j )]
Example 5.6 in Gelman et al.: 22 clinical trials conducted to study the
effect of beta-blockers on reducing mortality after cardiac infarction. We parametrize in terms of the log odds ratios: j = log j because
the posterior is almost normal even for small samples.
Considering studies separately, no obvious effect of beta-blockers.

Data are 22 22 tables: in jth study, n0j and n1j are numbers of
individuals assigned to control and treatment groups, respectively, and
y0j and y1j are number of deaths in each group.

Sampling model for jth experiment: two independent Binomials, with


probabilities of death p0j and p1j .

ENAR - March 26, 2006 58 ENAR - March 26, 2006 59

Normal approximation to the likelihood Study Control


Raw data
Treated
Log-
odds, sd,
j deaths total deaths total yj j
1 3 39 3 38 0.0282 0.8503
2 14 116 7 114 -0.7410 0.4832
3 11 93 5 69 -0.5406 0.5646
Consider estimating j by the empirical logits: 4 127 1520 102 1533 -0.2461 0.1382
5 27 365 28 355 0.0695 0.2807
    6 6 52 4 59 -0.5842 0.6757
y1j y0j 7 152 939 98 945 -0.5124 0.1387
yj = log log 8 48 471 60 632 -0.0786 0.2040
n1j y1j n0j y0j 9 37 282 25 278 -0.4242 0.2740
10 188 1921 138 1916 -0.3348 0.1171
11 52 583 64 873 -0.2134 0.1949
12 47 266 45 263 -0.0389 0.2295
13 16 293 9 291 -0.5933 0.4252
Approximate sampling variance (e.g., using a Taylor expansion): 14 45 883 57 858 0.2815 0.2054
15 31 147 25 154 -0.3213 0.2977
16 38 213 33 207 -0.1353 0.2609
1 1 1 1 17 12 122 28 251 0.1406 0.3642
j2 = + + + 18 6 154 8 151 0.3220 0.5526
y1j n1j y1j y0j n0j y0j 19 3 134 6 174 0.4444 0.7166
20 40 218 32 209 -0.2175 0.2598
21 43 364 27 391 -0.5911 0.2572
22 39 674 22 680 -0.6081 0.2724

See Table 5.4 for the estimated log-odds ratios and their estimated
standard errors.

ENAR - March 26, 2006 60 ENAR - March 26, 2006 61


Goals of analysis Normal-normal model for
meta-analysis
Goal 1: if studies can be assumed to be exchangeable, we wish to
estimate the mean of the distribution of effect sizes, or overall average First level: sampling distribution yj |j , j2 N(j , j2)
effect.
Second level: population distribution j |, N(, 2)
Goal 2: the average effect size in each of the exchangeable studies
Third level: prior for hyperparameters , p(, ) = p(| )p( ) 1
Goal 3: the effect size that could be expected if a new, exchangeable mostly for convenience. Can incorporate information if available.
study, were to be conducted.

ENAR - March 26, 2006 62 ENAR - March 26, 2006 63

Posterior quantiles of effect j Posterior quantiles


Study normal approx. (on log-odds scale) Estimand 2.5% 25.0% 50.0% 75.0% 97.5%
j 2.5% 25.0% 50.0% 75.0% 97.5% Mean, -0.38 -0.29 -0.25 -0.21 -0.12
1 -0.58 -0.32 -0.24 -0.15 0.13 Standard deviation, 0.01 0.08 0.13 0.18 0.32
2 -0.63 -0.36 -0.28 -0.20 -0.03 Predicted effect, j -0.57 -0.33 -0.25 -0.17 0.08
3 -0.64 -0.34 -0.26 -0.18 0.06
4 -0.44 -0.31 -0.25 -0.18 -0.04
5 -0.44 -0.28 -0.21 -0.12 0.13 Marginal posterior density p( |y)
6 -0.62 -0.36 -0.27 -0.19 0.04
7 -0.62 -0.44 -0.36 -0.27 -0.16
8 -0.43 -0.28 -0.20 -0.13 0.08
9 -0.56 -0.36 -0.27 -0.20 -0.05
10 -0.48 -0.35 -0.29 -0.23 -0.12

0.04
11 -0.47 -0.31 -0.24 -0.17 -0.01
12 -0.42 -0.28 -0.21 -0.12 0.09
13 -0.65 -0.37 -0.27 -0.20 0.02

0.02
14 -0.34 -0.22 -0.12 0.00 0.30
15 -0.54 -0.32 -0.26 -0.17 0.00
16 -0.50 -0.30 -0.23 -0.15 0.06
17 -0.46 -0.28 -0.21 -0.11 0.15

0.0
18 -0.53 -0.30 -0.22 -0.13 0.15
0.0 0.5 1.0 1.5 2.0
19 -0.51 -0.31 -0.22 -0.13 0.17
20 -0.51 -0.32 -0.24 -0.17 0.04 tau

21 -0.67 -0.40 -0.30 -0.23 -0.09


22 -0.69 -0.40 -0.30 -0.22 -0.07

ENAR - March 26, 2006 64 ENAR - March 26, 2006 65


Conditional posterior means E(j |, y) Conditional posterior standard deviations sd(j |, y)

0.4

0.8
19 1
18
14

0.2
19
6
17

0.6
Estimated standard deviations
5
Estimated Treatment Effects

3
0.0

1 18
12
8 2
16
13
-0.2

0.4
20
11
4 17
10
15 15
22 9 5
-0.4

9 21 20 16
12

0.2
7 8 14
3 11
13 6
-0.6

22 21 4 7
10

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

tau tau

ENAR - March 26, 2006 66 ENAR - March 26, 2006 67

Histogram of 1000 simulations of 2, 18, 7, and 14 Overall mean effect, E(|, y)

-0.23
-1.0 -0.6 -0.2 0.2 -0.8 -0.4 0.0 0.4

-0.24
E(mu|tau,y)
Study 2 Study 18

-0.25
-0.26
-0.8 -0.4 0.0 -0.6 -0.2 0.2 0.6 0.0 0.5 1.0 1.5 2.0

Study 7 Study 14 tau

ENAR - March 26, 2006 68 ENAR - March 26, 2006 69


Example: Mixed model analysis Fat content in all samples was known to be very close to 3%.

Because of technical difficulties, none of the labs managed to analyze


Sheffield Food Company produces dairy products. all six samples using the government method within the alloted time.

Government cited company because the actual content of fat in yogurt Method Lab 1 Lab 2 Lab 3 Lab 4
sold by Sheffield appers to be higher than the label amount. Government 5.19 4.09 4.62 3.71
5.09 3.0 4.32 3.86
Sheffield believes that the discrepancy is due to the method employed 3.75 4.35 3.79
by the goverment to measure fat content, and conducted a multi- 4.04 4.59 3.63
laboratory study to investigate. 4.06
Sheffield 3.26 3.02 3.08 2.98
Four laboratories in the United States were randomly chosen. 3.38 3.32 2.95 2.89
3.24 2.83 2.98 2.75
Each laboratory received 12 carefully mixed samples of yogurt, with 3.41 2.96 2.74 3.04
instructions to analyze 6 using the government method and 6 using 3.35 3.23 3.07 2.88
Sheffields method. 3.04 3.07 2.70 3.20

alpha[1] chains 1:3

We fitted a mixed linear model to the data, where 5.5


5.0
Method is a fixed effect with two levels 4.5
4.0
Laboratory is a random effect with four levels 3.5
Method by laboratory interaction is random with six levels 3.0

5001 6000 8000 10000


yijk = i + j + ()ij + eijk ,
iteration
and i = + i and eijk N(0, 2).
alpha[2] chains 1:3

Priors: 4.5
4.0
3.5
p(i) 1 3.0
2.5
p(j |2 |2 ) N(0, 2 ) 2.0
2 2
p(()ij | ) N(0, ) 5001 6000 8000 10000
iteration

The three variance components were assigned diffuse inverted gamma


priors.
beta[1] chains 1:3 sample: 15000 beta[2] chains 1:3 sample: 15000 beta[1] chains 1:3 beta[3] chains 1:3
3.0 4.0 1.0 1.0
2.0 3.0 0.5 0.5
2.0 0.0 0.0
1.0 1.0 -0.5 -0.5
0.0 0.0 -1.0 -1.0

-1.0 0.0 1.0 -1.0 -0.5 0.0 0.5 0 20 40 0 20 40


lag lag

beta[3] chains 1:3 sample: 15000 beta[4] chains 1:3 sample: 15000
beta[4] chains 1:3 beta[2] chains 1:3
4.0 4.0 1.0
1.0
3.0 3.0 0.5 0.5
2.0 2.0 0.0 0.0
1.0 1.0 -0.5 -0.5
0.0 0.0 -1.0 -1.0

-2.0 -1.0 0.0 1.0 -2.0 -1.0 0.0 0 20 40 0 20 40


lag lag

beta[1] chains 1:3 beta[2] chains 1:3 beta[1] chains 1:3 beta[2] chains 1:3
1.0 1.0 1.5 1.5
0.5 0.5 1.0 1.0
0.0 0.0
-0.5 -0.5 0.5 0.5
-1.0 -1.0 0.0 0.0
0 20 40 0 20 40 5001 6000 8000 5001 6000 8000
lag lag iteration iteration

beta[3] chains 1:3 beta[4] chains 1:3 beta[3] chains 1:3


beta[4] chains 1:3
1.0 1.0 1.5
0.5 0.5 1.0
1.0
0.0 0.0
-0.5 0.5 0.5
-0.5
-1.0 -1.0 0.0 0.0
0 20 40 0 20 40 5001 6000 8000 5001 6000 8000
lag lag iteration iteration
box plot: alpha box plot: beta

5.0 1.0
[1]
[1]

[3]
0.5
[2]
4.0 [4]

[2]

0.0

3.0
-0.5

2.0 -1.0
Model checking other model characteristics such as covariates, form of dependence
between response variable and covariates, etc.

What do we need to check? Classical approaches to model checking:


Model fit: does the model fit the data? Do parameter estimates make sense
Sensitivity to prior and other assumptions Does the model generate data like observed sample
Model selection: is this the best model? Are predictions reasonable
Robustness: do conclusions change if we change data? Is model best in some sense (e.g., AIC, likelihood ratio, etc.)

Remember: models are never true; they may just fit the data well and Bayesian approach to model checking
allow for useful inference.
Does the posterior distribution of parameters correspond with what
we know from subject-matter knowledge?
Model checking strategies must address various parts of models:
Predictive distribution for future data must also be consistent with
priors substantive knowledge
sampling distribution Future data generated by predictive distribution compared to current
hierarchical structure sample

ENAR - March 26, 2006 1 ENAR - March 26, 2006 2

Sensitivity to prior and other model components Posterior predictive model checking

Most popular model checking approach has a frequentist flavor: we Basic idea is that data generated from the model must look like
generate replications of the sample from posterior and observe the observed data.
behavior of sample summaries over repeated sampling.
Posterior predictive model checks based on replicated data y rep
generated from the posterior predictive distribution:
Z
p(y rep|y) = p(y rep|)p(|y)d.

y rep is not the same as y. Predictive outcomes y can be anything


(e.g., a regression prediction using different covariates x). But y rep is
a replication of y.

y rep are data that could be observed if we repeated the exact same
experiment again tomorrow (if in fact the s in our analysis gave rise
to the data y).

ENAR - March 26, 2006 3 ENAR - March 26, 2006 4


Definition of a replicate in a hierarchical model: Determine appropriate T (y, )
Compare posterior predictive distribution of T (y rep, ) to posterior
p(|y) p(|, y) p(y rep|) : rep from same units distribution of T (y, )
p(|y) p(|) p(y rep|) : rep from new units

Test quantity: T (y, ) is a discrepancy statistic used as a standard.

We use T (y, ) to determine discrepancy between model and data on


some specific aspect we wish to check.

Example of T (y, ): proportion of standardized residuals outside of


(3, 3) in a regression model to check for outliers

In classical statistics, use T (y) a test-statistic that depends only on the


data. Special case of Bayesian T (y, ).

For model checking:

ENAR - March 26, 2006 5 ENAR - March 26, 2006 6

Bayes p-values Probability taken over joint posterior distribution of (, y rep):


Z Z
Bayes p value = IT (yrep,)T (y,)p(|y)p(y rep|)ddy rep
p-values attempt to measure tail-area probabilities.

Classical definition:

class p value = Pr(T (y rep) T (y)|)

Probability is taken over distribution of y rep with fixed


Point estimate typically used to compute the pvalue.

Posterior predictive p-values:

Bayes p value = Pr(T (y rep, ) T (y, )|y)

ENAR - March 26, 2006 7 ENAR - March 26, 2006 8



Relation between p-values ny rep
 
ny 2
= Pr |
S S
 
ny
Small example: = P tn1
S
Sample y1, ..., yn N (, 2)
Fit N (0, 2) and check whether = 0 is good fit This is special case: not always possible to get rid of nuisance
parameters.

In real life, would fit more general N (, 2) and would decide whether Bayes approach:
= 0 is plausible.
p value = P r(T (y rep) T (y)|y)
Classical approach: Z Z
= IT (yrep)T (y)p(y rep| 2)p( 2|y)dy repd 2
Test statistic is sample mean T (y) = y

Note that
p value = P r(T (y rep) T (y)| 2)
= P r(y rep y| 2) IT (yrep)T (y)p(y rep| 2) = P (T (y rep) T (y)| 2)

ENAR - March 26, 2006 9 ENAR - March 26, 2006 10

= classical p-value Interpreting posterior predictive pvalues

Then: We look for tail-area probabilities that are not too small or too large.
Bayes p value = E{p valueclass|y}
where the expectation is taken with respect to p( 2|y). The ideal posterior predictive p-value is 0.5: the test quantity falls right
in the middle of its posterior predictive.
In this example, classical pvalue and Bayes pvalue are the same.
Posterior predictive p-values are actual posterior probabilities
In general, Bayes can handle easily nuisance parameters.
Wrong interpretation: P r(model is true |data).

ENAR - March 26, 2006 11 ENAR - March 26, 2006 12


Example: Independence of Bernoulli trials In sample, T (y) = 3.

Posterior is Beta(8,14).
Sequence of binary outcomes y1, ..., yn modeled as iid Bernoulli trials
with probability of success . To test assumption of independence, do:

Uniform prior on leads to p(|y) s(1 )ns with s =


P
yi . 1. For j = 1, ..., M draw j from Beta(8,14).
rep,j rep,j
2. Draw {y1 , ..., y20 } independent Bernoulli variables with
j
probability .
Sample:
3. In each of the M replicate samples, compute T (y), the number of
1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
switches between 0 and 1.

Is the assumption of independence warranted? p = Prob(T (y rep) T (y)|y) = 0.98.


B

Consider discrepancy statistic If observed trials were independent, expected number of switches in 20
trials is approximately 8 and 3 is very unlikely.
T (y, ) = T (y) = number of switches between 0 and 1.

ENAR - March 26, 2006 13 ENAR - March 26, 2006 14

Posterior predictive dist. of number of switches Example: Newcombs speed of light

Newcomb obtained 66 measurements of the speed of light. (Chapter


3). One measurement of -44 is a potential outlier under a normal
model.

From data: y = 26.21 and s = 10.75.

Model: yi|, 2 N (, 2), (, 2) 2.

Posteriors
p( 2|y) = Inv 2(65, s2)
p(| 2, y) = N(y, 2/66)
or
p(|y) = t65(y, s2/66).

ENAR - March 26, 2006 15 ENAR - March 26, 2006 16


For posterior predictive checks, do for i = 1, ..., M Example: Newcombs data
(i) 2(i) 2
1. Generate ( , ) from p(, |y)
rep(i) rep(i)
2. Generate y1 , ..., y66 from N ((i), 2(i))

ENAR - March 26, 2006 17 ENAR - March 26, 2006 18

Example: Replicated datasets Is observed minimum value -44 consistent with model?
Define T (y) = min{yi}
get distribution of T (y rep).
Large negative value of -44 inconsistent with distribution of
T (y rep) = min{yirep} because observed T (y) very unlikely under
distribution of T (y rep).
Model inadequate, must account for long tail to the left.

ENAR - March 26, 2006 19 ENAR - March 26, 2006 20


Posterior predictive of smallest observation Is sampling variance captured by model?
Define
1
T (y) = s2y = i(yi y)2
65
obtain distribution of T (y rep).
Observed variance of 115 very consistent with distribution (across
reps) of sample variances
Probability of observing a larger sample variance is 0.48: observed
variance in middle of distribution of T (y rep)
But this is meaningless: T (y) is sufficient statistic for 2, so model
MUST have fit it well.

Symmetry in center of distribution


Look at difference in the distance of 10th and 90th percentile from
the mean.
Approx 10th percentile is 6th order statistic y(6), and approx 90th
percentile is y(61).

ENAR - March 26, 2006 21 ENAR - March 26, 2006 22

Define T (y, ) = |y(61) | |y(6) | Posterior predictive of variance and symmetry


Need joint distribution of T (y, ) and of T (y rep, ).
Model accounts for symmetry in center of distribution
Joint distribution of T (y, ) and T (y rep, ) about evenly distributed
along line T (y, ) = T (y rep, ).
Probability that T (y rep, ) > T (y, ) about 0.26, plausible under
sampling variability.

ENAR - March 26, 2006 23 ENAR - March 26, 2006 24


Choice of discrepancy measure % who never smoked
% who always smoked
% of incident smokers: began not smoking but then switched to
Often, choose more than one measure, to investigate different attributes
smoking and continued doing so.
of the model.
Can smoking behavior in adolescents be predicted using information Generate replicate datasets from both models.
on covariates? (Example on page 172 of Gelman et al.)
Data: six observations of smoking habits in 2,000 adolescents every Compute the three test statistics in each of the replicated datasets
six months
Compare the posterior predictive distributions of each of the three
Two models: statistics with the value of the statistic in the observed dataset
Logistic regression model
Latent-class model: propensity to smoke modeled as nonlinear Results: Model 2 slightly better than 1 at predicting % of always
function of predictors. Conditional on smoking, fit same logistic smokers, but both models fail to predict % of incident smokers.
95% CI 95% CI
regression model as above. Test variable T (y) for T (y rep ) p-value for T (y rep ) p-value
% never smokers 77.3 [75.5, 78.2] 0.27 [74.8, 79.9] 0.53
% always-smokers 5.1 [5.0, 6.5] 0.95 [3.8, 6.3] 0.44
Three different test statistics chosen: % incident smokers 8.4 [5.3, 7.9] 0.005 [4.9, 7.8] 0.004

ENAR - March 26, 2006 25 ENAR - March 26, 2006 26

Omnibus tests A Bayesian 2 test is carried out as follows:


Compute the distribution of T (y, ) for many draws of from
It is often useful to also consider summary statistics: posterior and for observed data.
P (yiE(yi|))2 Compute the distribution of T (y rep, ) for replicated datasets and
2-discrepancy: T (y, ) = var(yi |) posterior draws of
Compute Bayesian p-value: probability of observing a more extreme
Deviance: T (y, ) = 2 log p(y|) value of T (y rep, ).
Calculate p-value empirically from simulations over and y rep.
Deviance is proportional to mean squared error if model is normal and
with constant variance.

In classical approach, insert null or mle in place of in T (y, ) and


compare test statistic to reference distribution derived under asymptotic
arguments.

In Bayesian approach, reference distribution for test statistic is


automatically calculated from posterior predictive simulations.

ENAR - March 26, 2006 27 ENAR - March 26, 2006 28


Criticisms of posterior predictive checks Pps can be conservative

Too conservative because data get used twice One criticism for pp model checks is that they tend to reject models
only in the face of extreme evidence.
Posterior predictive p-values are not really p-values: distribution under
null hypothesis is not uniform, as it should be Example (from Stern):

What is high/low if pp p-values not uniform? y N (, 1) and N (0, 9).


Observation: yobs = 10
Some Bayesians object to frequentist slant of using unobserved data Posterior p(|y) = N (0.9yobs, 0.9) = N (9, 0.9)
for anything Posterior predictive dist: N (9, 1.9)
Posterior predictive p-value is 0.23, we would not reject the model
Lots of work recently on improving posterior predictive p-values: Bayarri
and Berger, JASA, 2000, is good reference Effect of prior is minimized because 9 > 1 and posterior predictive
mean is close to observed datapoint.
In spite of criticisms, pp checks are easy to use in very general cases,
and intuitively appealing. To reject, we would need to observe y 23.

ENAR - March 26, 2006 29 ENAR - March 26, 2006 30

Graphical posterior predictive checks Model comparison

Idea is to display the data together with simulated data. Which model fits the data best?

Three kinds of checks: Often, models are nested: a model with parameters is nested within
a model with parameters (, ).
Direct data display
Display of data summaries or parameter estimates
Comparison involves deciding whether adding to the model improves
Graphs of residuals or other measures of discrepancy
its fit. Improvement in fit may not justify additional complexity.

It is also possible to compare non-nested models.

We focus on predictive performance and on model posterior probabilities


to compare models.

ENAR - March 26, 2006 31 ENAR - March 26, 2006 32


Expected deviance An estimate is
1 X
Davg(y) = D(y, j ).
M j
We compare the observed data to several models to see which predicts
more accurately.

We summarize model fit using the deviance

D(y, ) = 2 log p(y|).

It can be shown that the model with the lowest expected deviance is
best in the sense of minimizing the (Kullback-Leibler) distance between
the model p(y|) and the true distribution of y, f (y).

We compute the expected deviance by simulation:

Davg(y) = E|y (D(y, )|y).

ENAR - March 26, 2006 33 ENAR - March 26, 2006 34

Deviance information criterion - DIC where D (y) = D(y, (y)) and (y) is a point estimator of such as
the posterior mean.
Idea is to estimate the error that would be expected when applying the
model to future data.

Expected mean square predictive error:


" #
pred 1 X rep rep 2
Davg = E (y E(yi |y)) .
n i i

A model that minimizes the expected predictive deviance is best in the


sense of out-of-sample predictive power.

pred
An approximation to Davg is the Deviance Information Criterion or
DIC
pred
DIC = Davg = 2Davg(y) D (y),

ENAR - March 26, 2006 35 ENAR - March 26, 2006 36


Example: SAT study We still prefer the hierarchical model because assumption that all
schools have identical mean is strong.
We compare three models: no pooling, complete pooling and
hierarchical model (partial pooling).

Deviance is
X
D(y, ) = 2 log N(yj |j , j2)
j
X
= log(2J 2) + ((yj j )2/j2)
j

The DIC for the three models were: 70.3 (no pooling), 61.5 (complete
pooling), 63.4 (hierarchical model).

Based on DIC, we would pick the complete pooling model.

ENAR - March 26, 2006 37 ENAR - March 26, 2006 38

Bayes Factors The BF12 is computed as


R
p(y|1, M1)p(1|M1)d1
BF12 =R .
Suppose that we wish to decide between two models M1 and M2 p(y|2, M2)p(2|M2)d2
(different prior, sampling distribution, parameters).
Note that we need p(y) under each model to compute p(i|Mi).
Priors on models are p(M1) and p(M2) = 1 p(M1). Therefore, the BF is only defined when the marginal distribution of the
data p(y) is proper, and therefore, when the prior distribution in each
The posterior odds favoring M1 over M2 are model is proper.

For example, if y N (, 1) with p() 1, we get


p(M1|y) p(y|M1) p(M1)
= .
p(M2|y) p(y|M2) p(M2) 1
Z
p(y) (2)1/2 exp{ (y )2}d = 1,
2
The ratio p(y|M1)/p(y|M2) is called a Bayes factor. which is constant for any y and thus is not a proper distribution. This
creates an indeterminacy because we can increase or decrease the BF
It tells us how much the data support one model over the other. simply by using a different constant in the numerator and denominator.

ENAR - March 26, 2006 39 ENAR - March 26, 2006 40


Computation of Bayes Factors Note that
h()
Z
p(y)
1
= d
p(y)
Difficulty lies in the computation of p(y)
h()
Z
= p(|y)d.
Z p(y|)p()
p(y) = p(y|)p()d.
h() can be anything (e.g., a normal approximation to the posterior).

The simplest approach is to draw many values of from p() and get Draw values of from the posterior and evaluate the integral
a Monte Carlo approximation to the integral. numerically.

May not work well because the s from the prior may not come from Computations can get tricky because denominator can be small.
the parameter space region where the sampling distribution has most
mass. As n ,
1
A better Monte Carlo approximation is similar to importance sampling. log(BF ) log(p(y|2, M2) log(p(y|1, M1) (d1 d2) log(n),
2
ENAR - March 26, 2006 41 ENAR - March 26, 2006 42

where i is the posterior mode under model Mi and di is the dimension


of i.

Notice that the criterion penalizes a model for additional parameters.

Using log(BF ) is equivalent to ranking models using the BIC criterion,


given by
1
BIC = log(p(y|, M ) + d log(n).
2

ENAR - March 26, 2006 43


Ordinary Linear Regression - Introduction Ordinary Linear Regression - Model

Question of interest: how does an outcome y vary as a function of a


In ordinary linear regression models, the conditional variance is equal
vector of covariates X.
across observations: var(yi|, X) = 2. Thus, = (1, 2, ..., k , 2).
We want the conditional distribution p(y|, x), where observations
Likelihood: For the ordinary normal linear regression model, we have
(y, x)i are assumed exchangeable.
p(y|X, , 2) = N (X, 2I)
Covariates (x1, x2, ..., xk ) can be discrete or continuous, and typically
x1 = 1 for all n units. The matrix X, n k is the model matrix
with In and n n identity matrix.
Most common version of the model is the normal linear model
Priors: A non-informative prior distribution for (, 2) is
k
X
E(yi|, X) = j xij p(, 2) 2
j=1

(more on informative priors later)

ENAR - March 2006 1 ENAR - March 2006 2

Joint posterior: p(| 2, y) (viewed as function of ). We get:


 i
1 h
 
2 2 n 1 2 p(| 2, y) exp 2 0X 0X 2 0X 0X
p(, |y) ( ) 2 1 exp (y X) (y X)
0
2
2
 
1
exp 2 ( )0X 0X( )
2
Joint posterior
  for
2 n 1 2 = (X 0X)1X 0y
p(, 2|y) ( ) 2 1 exp (y X) (y X)
0
2
Then:
| 2, y N (, 2(X 0X)1)
Consider
p(, 2|y) = p(| 2, y)p( 2|y)

Conditional posterior for : Expand and complete the squares in

ENAR - March 2006 3 ENAR - March 2006 4


Ordinary Linear Regression - p( 2|y)
 i
1 h
Z
exp 2 (y X )0(y X ) + ( )0X 0X( ) d
2
 Z  
2 n 1 2 1
The marginal posterior distribution for 2 is obtained by integrating ( ) 2 1
exp 2 (n k)S exp 2 ( ) X X( ) d
0 0
2 2
the joint posterior with respect to :
Z
2
p( |y) = p(, 2|y)d
Integrand is kernel of k dimensional normal, so result of integration
is proportional to ( 2)k/2. Then
 
1
Z
2 n
( ) 2 1
exp 2
(y X) (y X) d
0
2
nk
p( 2|y) ( 2) 2 +1 exp 2(n k)S 2
 

By expanding square in integrand, and adding and subtracting


2y 0X(X 0X)1X 0y, we can write: proportional to an Inv-2(n k, S 2).
arge

n
p( 2|y) = ( 2) 2 1

ENAR - March 2006 5 ENAR - March 2006 6

Ordinary Linear Regression - Notes Regression - Posterior predictive distribution

Note that and S 2 are the MLEs of and 2 In regression, we often want to predict the outcome y for a new set of
covariates x. Thus, we wish to draw values from p(y|y, X)
When is the posterior proper? p(, 2|y) is proper if
By simulation:
1. n > k
2. rank(X) = k: columns of X are linearly independent or |X 0X| =
6 0 1. Draw 2 from Inv-2(n k, S 2)
2. Draw from N(, 2(X 0X)1)
To sample from the joint posterior:
3. Draw yi for i = 1, ..., m from N(x0i, 2)
1. Draw 2 from Inv-2(n k, S 2)
2. Given 2, draw vector from N(, 2(X 0X)1)

For efficiency, compute , S 2, (X 0X)1 once, before starting repeated


drawing.

ENAR - March 2006 7 ENAR - March 2006 8


Regression - Posterior predictive distribution where inner expectation averages over y conditional on and outer
averages over .

We can derive p(y|y) in steps, first considering p(y|y, 2). var(y| 2, y) = E var(y|, 2, y)| 2, y + var E(y|, 2, y)| 2, y
   

= E[ 2I| 2, y] + var[X| 2, y]
Note that
= 2(I + X(X 0X)1X 0)
Z
p(y|y, 2) = p(y|, 2)p(| 2, y)d
is normal because exponential in integrand is quadratic function in
Var has two terms: 2I is sampling variation, and 2X(X 0X)1X 0 is
(, y).
due to uncertainty about

To get mean and variance use conditioning trick: To complete specification of p(y|y) must integrate p(y|y, 2) with
respect to marginal posterior distribution of 2.
E(y| 2, y) = E E(y|, 2, y)| 2, y
 

= E(X| 2, y) Result is:


p(y|y) = tnk (X , S 2[I + X(X 0X)1X])
= X

ENAR - March 2006 9 ENAR - March 2006 10

Regression example: radon measurements in and is 0 otherwise. Similarly, X2 and X3 are dummies for Clay and
Goodhue counties, respectively.
Minnesota
X4 = 1 if measurement was taken in the first floor.

Radon measurements yi were taken in three counties in Minnesota:


The model is
Blue Earth, Clay and Goodhue.

log(yi) = 1x1i + 2x2i + 3x3i + 4x4i + ei,


14 houses in each of Blue Earth and Clay county and 13 houses in
Goodhue were sampled.
with e N(0, 2).
Measurements were taken in the basement and in the first floor.
Thus

We fit an ordinary regression model to the log radon measurements, E(y|Blue Earth, basement) = exp(1)
without an intercept.
E(y|Blue Earth, first floor) = exp(1 + 4)
We define dummy variables as follows: X1 = 1 if county is Blue Earth E(y|Clay, basement) = exp(2)
E(y|Clay, first floor) = exp(2 + 4)
beta[1]
E(y|Goodhue, basement) = exp(3)
3.0

E(y|Goodhue, first floor) = exp(3 + 4) 2.5


2.0
1.5
1.0
0.5
We used noninformative priors N(0, 1000) for the regression coefficients
3001 3500 4000 4500 5000
and a noninformative prior Gamma(0.01, 0.01) for the error variance. iteration

beta[2]

3.0
2.5
2.0
1.5
1.0
0.5

3001 3500 4000 4500 5000


iteration

beta[3] box plot: beta

3.0
2.5 3.0
2.0
[1] [2] [3]
1.5
1.0 2.0
3001 3500 4000 4500 5000
iteration

beta[4] 1.0

1.0
[4]

0.0
0.0
-1.0

-2.0
-1.0
3001 3500 4000 4500 5000
iteration
Regression - Posterior predictive checks

becb sample: 2000 becff sample: 2000 For regression models, there are well-known methods for checking
0.3 0.3 model and assumptions using estimated residuals. Residuals:
0.2 0.2
0.1 0.1
0.0 0.0 i = yi x0i
0.0 5.0 10.0 15.0 0.0 5.0 10.0 15.0

ccb sample: 2000 ccff sample: 2000 Two useful test statistics are:
0.4 0.4
0.3 0.3 Proportion of outliers among residuals
0.2 0.2
0.1 0.1 Correlation between squared residuals and fitted values y
0.0 0.0
0.0 5.0 10.0 0.0 5.0 10.0
Posterior predictive distributions of both statistics can be obtained via
gcb sample: 2000 gcff sample: 2000 simulation
0.3 0.3
0.2 0.2
0.1 0.1
Define a standardized residual as
0.0 0.0
0.0 5.0 10.0 15.0 0.0 5.0 10.0 15.0 ei = (yi x0i)/

ENAR - March 2006 13

If normal model is correct, standardized residuals should be about information about model adequacy.
N (0, 1) and therefore, |e|i > 3 suggests that ith observation may be
an outlier.

To derive the posterior predictive distribution for the proportion of


outliers q and for , the correlation between (e2, y), do:
1. Draw ( 2, ) from the joint posterior distribution
2. Draw y rep from N (X, 2I) given the existing X
3. Run the regression of y rep on X, and save residuals
4. Compute
5. Compute proportion of large standardized residuals q
6. Repeat for another y rep

Approach is frequentist in nature: we act as if we could repeat the


experiment many times.

Inspection of posterior predictive distribution of and of q provides

ENAR - March 2006 14 ENAR - March 2006 15


Regression with unequal variances
Results of posterior predictive checks suggest that there are no outliers
and that the correlation between the squared residuals and the fitted Consider now the case where y N (X, y ), with y 6= 2I.
values is negligible.
Covariance matrix y is n n and has n(n + 1)/2 distinct parameters,
The 95% credible set for the proportion of absolute residuals (in the and cannot be estimated from n observations. Must either specify y
41 observations) above 3 was (0, 4.88) with a mean of 0.59% and a or it must be assigned an informative prior distribution.
median of 0.
Typically, some structure is imposed on y to reduce the number of
The 95% credible set for was (0.31, 0.32) with a mean of 0.011. free parameters.

Notice that the posterior distribution of the proportion of outliers


is skewed. This sometimes happens when the quantity of interest
is bounded and most of the mass of its distribution is close to the
boundary.

ENAR - March 2006 16

Regression - known y is equivalent to computing

= (X 01
y X) Xy y
1 1

As before, let p() 1 be the non-informative prior V = (X 01


y X)
1

Since y is positive definite and symmetric, it has an upper-triangular


1/2 1/2 1/20 Note: By using the Cholesky factor you avoid computing the n n
square root matrix (Cholesky factor) y such that y y = y
inverse 1
y .
so that if:
y = X + e, e N (0, y )
then

y1/2y = 1/2
y X + 1/2
y e, y1/2e N (0, I)

With y known, just proceed as in ordinary linear regression, but use


the transformed y and X as above and fix 2 = 1. Algebraically, this

ENAR - March 2006 17 ENAR - March 2006 18


Prediction of new y with known y where

y = X + yy 1
yy (y X)
Even if we know y , prediction of new observations is more complicated:
must know covariance matrix of old and new data. Vy = yy yy 1
yy yy

Example: heights of children from same family are correlated. To


predict height of a new child when a brother is in the old dataset, we
must include that correlation in the prediction.

y are n new observations given a n k matrix of regressors X.

Joint distribution of y, y is:


   
y X y yy
|X, X, N ,
y X yy yy
y|y, , y N (y , Vy )

ENAR - March 2006 19 ENAR - March 2006 20

Regression - unknown y Expression for p(y |y) must hold for any , so we set = .

Note that
To draw inferences about (, y ), we proceed in steps: derive p(|y , y) = N(, V ) |V |1/2
p(|y , y) and then p(y |y). for = . Then

Let p() 1 as before 1


p(y |y) p(y )|V |1/2|y |1/2 exp[ (y X )01
y (y X )]
2
We know that
In principle, p(y |y) could be evaluated for a range of values of y .
p(, y |y) p(y )p(y|, y ) However:
p(y |y) =
p(|y , y) p(|y , y) 1. It is difficult to determine a prior for an n n unstructured matrix.
p(y ) N(y|, y ) 2. , V depend on y and it is very difficult to draw values of y from
, p(y |y).
N(|, V )
Need to put some structure on y
where (, V ) depend on y .

ENAR - March 2006 21 ENAR - March 2006 22


Regression - y = 2Qy 1. Draw 2 from Inv-2(n k, s2)
2. Draw from n(, V 2)

Suppose that we know y upto a scalar factor 2. In large datasets, use Cholesky factor transformation and unweighted
regression to avoid computation of Q1
y .
Non-informative prior on , 2 is p(, 2) 2

Results follow directly from ordinary linear regression, by using


1/2 1/2
transformation Qy y and Qy X, which is equivalent to computing

y X) X Qy y
= (X 0Q1 1 0 1

V y X)
= (X 0Q1 1

s2 = (n k)1(y X )0Q1
y (y X )

To estimate the joint posterior distribution p(, 2|y) do

ENAR - March 2006 23 ENAR - March 2006 24

Regression - other covariance structures = 1 ii = 2/wi i


Weighted regression
In practice, to uncouple from the scale of the weights, we multiply
In some applications, y = diag( 2/wi), for known weights and 2 weights by a factor so that their product equals 1.
unknown.

Inference is the same as before, but now Q1


y = diag(wi )

Parametric model for unequal variances:

Variances may depend on weights in non-linear fashion: ii =


2v(wi, ) for unknown parameter (0, 1) and known function
v such as v = wi

Note that for that function v:

= 0 ii = 2 i

ENAR - March 2006 25 ENAR - March 2006 26


Parametric model for unequal variances This suggests the following scheme
1. Draw from marginal posterior p(|y)
1/2 1/2
2 2. Compute Qy y and Qy X
Three parameters to estimate: , , . 2 2
3. Draw from p( |, y) as in ordinary regression
4. Draw from p(| 2, , y) as in ordinary regression
Possible prior for is uniform in [0, 1]
Marginal posterior distribution of :
Non-informative prior for , 2 is 2
p(, 2, |y)
p(|y) =
Joint posterior distribution: p(, 2|, y)
p(, 2, |y)
2
p(, , |y) p()p(, 2
)ni=1 2
N(yi; (X)i, v(wi, )) =
p(| 2, , y)p( 2|, y)
p() 2i N(yi|(X)i, 2v(wi, ))

For a given , we are back in the earlier weighted regression case with Inv-2(n k, s2) N(, V )

Q1
y = diag(v(w1 , ), ..., v(wn , )) Expression must hold for any (, 2), so we set = and 2 = s2.

ENAR - March 2006 27 ENAR - March 2006 28

Recall that and s2 depend on . s2 = (n k)1(y X )0(y X )

Also recall that weights are scaled to have product equal to 1. Draw 2 from Inv-2(n k, s2)
Draw from N(, 2V )
Then
p(|y) p()|V |1/2(s2)(nk)/2

To carry out computation


First sample from p(|y):
1. Evaluate p(|y) for a range of values of [0, 1]
2. Use inverse cdf method to sample
1/2 1/2
Given , compute y = Qy y and X = Qy X
Compute
0 0
= (X X )1X y
0
V = (X X )1

ENAR - March 2006 29 ENAR - March 2006 30


Including prior information about variance 2j .

Suppose we wish to add prior information about a single regression To include prior information, do the following:
coefficient j of the form: 1. Append one more data point to vector y with value j0.
2. Add one row to X with zeroes except in jth column.
j N(j0, 2j ), 3. Add a diagonal element with value 2j to y .

with j0, 2j known. Now apply computational methods for non-informative prior.

Prior information can be added in the form of an additional data Given y , posterior for obtained by weighted linear regression.
point.

An ordinary observation y is normal with mean x and variance 2.

As a function of j , the prior can be viewed as an observation with


0 on all xs except xj

ENAR - March 2006 31 ENAR - March 2006 32

Adding prior information for several s Prior information about 2

Suppose that for the entire vector , Typically we do not wish to include prior information about 2.

If we do, we can use the conjugate prior


N(0, ).
2 Inv-2(n0, 02).
Proceed as before: add k data points and draw posterior inference
by weighted linear regression applied to observations y, explanatory The marginal posterior of 2 is
variables X and variance matrix :
n002 + nS 2
2|y Inv-2(n0 + n, ).
     
y X y 0 n0 + n
y = , X = , =
0 Ik 0
If prior information on is also incorporated, S 2 is replaced by
Computation can be carried out conditional on first and then inverse corresponding value from regression of y on X and , and n is
cdf for or using the Gibbs sampling. replaced by length of y.

ENAR - March 2006 33 ENAR - March 2006 34


Inequality constraints on

Sometimes we wish to impose inequality constraints such as

1 > 0

or
2 < 3 < 4.

The easiest way is to ignore the constraint until the end:


Simulate (, 2) from posterior
discard all the draws that do not satisfy the constraint.

Typically a reasonably efficient way to proceed unless constraint


eliminates a large portion of unconstrained posterior distribution.

If so, data tend to contradict the model.

ENAR - March 2006 35


Generalized Linear Models There are three main components in the model:
1. Linear predictor = X
Generalized linear models are an extension of linear models to the case 2. Link function g(.) relating linear predictor to mean of outcome
where relationship between E(y|X) and X is not linear or normal variable: E(y|X) = = g 1() = g 1(X)
assumption is not appropriate. 3. Distribution of outcome variable y with mean = E(y|X).
Distribution can also depend on a dispersion parameter :
Sometimes a transformation suffices to return to the linear setup.
Consider the multiplicative model p(y|X, , ) = ni=1p(yi|(X)i, )

yi = xb1 b2 b3
i1 xi2 xi3 i In standard GLIMs for Poisson and binomial data, = 1.

A simple log transformation leads to In many applications, however, excess dispersion is present.

log(yi) = b1 log(xi1) + b2 log(xi2) + b3 log(xi3) + ei

When simple approaches do not work, we use GLIMs.

ENAR - March 26, 2006 1 ENAR - March 26, 2006 2

Some standard GLIMs Binomial model: Suppose that yi Bin(ni, i), ni known. Standard
link function is logit of probability of success :
Linear model: 
i

g(i) = log = (X)i = i
Simplest GLIM, with identity link function g() = . 1 i

Poisson model:
For a vector of data y:
Mean and variance and link function log() = X, so that
  y i  niyi
ni exp(i) 1
= exp(X) = exp() p(y|) = ni=1
yi 1 + exp(i) 1 + exp(i)

For y = (y1, ..., yn):


Another link used in econometrics is the probit link:
1
p(y|) = ni=1 exp( exp(i))(exp(i))yi 1(i) = i
y!

with i = (X)i. with (.) the normal cdf.

ENAR - March 26, 2006 3 ENAR - March 26, 2006 4


In practice, inference from logit and probit models is almost the same, Overdispersion
except in extremes of the tails of the distribution.

In many applications, the model can be formulated to allow for extra


variability or overdispersion.

E.g. in Poisson model, the variance is constrained to be equal to the


mean.

As an example, suppose that data are the number of fatal car accidents
at K intersections over T years. Covariates might include intersection
characteristics and traffic control devices (stop lights, etc).

To accommodate overdispersion we model the (log)rate as a linear


combination of covariates and add a random effect for intersection with
its own population distribution.

ENAR - March 26, 2006 5 ENAR - March 26, 2006 6

Setting up GLIMs To apply the Poisson GLIM, add a column to X with values log(T )
and fix the coefficient to 1. This is an offset.

Canonical link functions: Canonical link is function of mean that


appears in exponent of exponential family form of sampling distribution.

All links discussed so far are canonical except for the probit.

Offset: Arises when counts are obtained from different population sizes
or volumes or time periods and we need to use an exposure. Offset is
a covariate with a known coefficient.

Example: Number of incidents in a given exposure time T are Poisson


with rate per unit of time. Mean number of incidents is T .

Link function would be log() = i, but here mean of y is not but


T .

ENAR - March 26, 2006 7 ENAR - March 26, 2006 8


Interpreting GLIMs and
y = g 1(g(y0) + (x))

In linear models, j represents the change in the outcome when xj is


changed by one unit.

Here, j reflects changes in g() when xj is changed.

The effect of changing xj depends of current value of x.

To translate effects into the scale of y, measure changes relative to a


baseline
y0 = g 1(x0).

A change in x of x takes outcome from y0 to y where

g(y0) = x0 y0 = g 1(x0)

ENAR - March 26, 2006 9 ENAR - March 26, 2006 10

Priors in GLIM Non-informative prior for in augmented model.

Non-conjugate priors:
We focus on priors for although sometimes is present and has its
own prior. Often more natural to model p(|0, 0) = N (0, 0) with (0, 0)
known.
Non-informative prior for : Approximate computation based on normal approximation (see next)
particularly suitable.
With p() 1, posterior mode = MLE for
Approximate posterior inference can be based on normal Hierarchical GLIM:
approximation to posterior at mode.
Same approach as in linear models.
Conjugate prior for : Model some of the as exchangeable with common population
distribution with unknown parameters. Hyperpriors for parameters.
As in regression, express prior information about in terms of
hypothetical data obtained under same model.
Augment data vector and model matrix with y0 hypothetical
observations and X0n0k hypothetical predictors.

ENAR - March 26, 2006 11 ENAR - March 26, 2006 12


Computation Normal approximation to likelihood

Posterior distributions of parameters can be estimated using MCMC


methods in WinBUGS or other software. Objective: find zi and i2 such that normal likelihood

Metropolis within Gibbs will often be necessary: in GLIM, most often N (zi|(X)i, i2)
full conditionals do not have standard form.
is good approximation to GLIM likelihood p(yi|(X)i, ).
An alternative is to approximate the sampling distribution with a
cleverly chosen approximation.
Let (, ) be mode of (, ) so that i is the mode of i.
Idea:
Find mode of likelihood (, ) perhaps conditional on For L the loglikelihood, write
hyperparameters
Create pseudo-data with their pseudo-variances (see later) p(y1, ..., yn) = ip(yi|i, )
Model pseudo-data as normal with known (pseudo-)variances.
= i exp(L(yi|i, ))

ENAR - March 26, 2006 13 ENAR - March 26, 2006 14

Approximate factor in exponent by normal density in i: Let L00 = 2L/i2:


1
L00 =
1 i2
L(yi|i, ) 2 (zi i)2,
2i
Then

where (zi, i2) depend on (yi, i, ). L0(yi|i, )


zi = i
L00(yi|i, )
Now need to find expressions for (zi, i2). 1
i2 =
L00(yi|i, )
To get (zi, i2), match first and second order terms in Taylor approx
Example: binomial model with logit link:
around i to (i, i2) and solve for zi and for i2.
 
exp(i)
L(yi, |i) = yi log
Let L0 = L/i: 1 + exp(i)
1
 
L0 = (zi i) 1
i2 +(ni yi) log
1 + exp(i)

ENAR - March 26, 2006 15 ENAR - March 26, 2006 16


= yii ni log(1 + exp(i)) Models for multinomial responses

Then Multinomial data: outcomes y = (yi, ..., yK ) are counts in K categories.


exp(i)
L0 = yi ni Examples:
1 + exp(i)
exp(i) Number of students receiving grades A, B, C, D or F
L00 = ni Number of alligators that prefer to eat reptiles, birds, fish,
(1 + exp(i))2
invertebrate animals, or other (see example later)
Number of survey respondents who prefer Coke, Pepsi or tap water.
Pseudo-data and pseudo-variances:
In Chapter 3, we saw non-hierarchical multinomial models:
(1 + exp(i))2
 
yi exp(i)
zi = i + y
exp(i) ni 1 + exp(i) p(y|) kj=1j j
1 (1 + exp(i))2
i2 = Pk Pk
ni exp(i) with j : probability of jth outcome and j=1 j = 1 and j=1 yj =
n.

ENAR - March 26, 2006 17 ENAR - March 26, 2006 18

Here: we model j as a function of covariates (or predictors) X with Logit model for multinomial data
corresponding regression coefficients j .

For full hierarchical structure, the j are modeled as exchangeable with Here i = 1, ..., I is number of covariate patters. E.g., in alligator
some common population distribution p(|, ). example, 2 sizes four lakes = 8 covariate categories.

Model can be developed as extension of either binomial or Poisson Let yi be a multinomial random variable with sample size ni and k
models. possible outcomes. Then

yi Mult (ni; i1, ..., ik )

P Pk
with i yi = ni, and j ij = 1.

ij is the probability of jth outcome for ith covariate combination.

Standard parametrization: log of the probability of jth outcome relative

ENAR - March 26, 2006 19 ENAR - March 26, 2006 20


to baseline category j = 1: Typically, indicators for each outcome category are added to predictors
  to indicate relative frequency of each category when X = 0. Then
ij
log = ij = (Xj )i,
i1 ij = j + (Xj )i

with j a vector of regression coefficients for jth category. with 1 = 1 = 0 typically.

Sampling distribution:
!yij
exp(ij )
p(y|) Ii=1kj=1 Pk .
l=1 exp(il )

For identifiability, 1 = 0 and thus i1 = 0 for all i.

j is effect of changing X on probability of category j relative to


category 1.

ENAR - March 26, 2006 21 ENAR - March 26, 2006 22

Example from WinBUGS - Alligators with


exp(ijk )
ijk = Pk ,
Agresti (1990) analyzes feeding choices of 221 alligators. l=1 exp(ijl )
and
Response is one of five categories: fish, invertebrate, reptile, bird, ijk = k + ik + jk .
other.
Here,
Two covariates: length of alligator (less than 2.3 meters or larger than k is baseline indicator for category k
2.3 meters) and lake (Hancock, Oklawaha, Trafford, George). ik is coefficient for indicator for lake
jk is coefficient for indicator for size
2 4 = 8 covariate combinations (see data)

For i, j a combination of size and lake, we have counts in five possible


categories yij = (yij1, ..., yij5).

Model
p(yij |ij , nij ) = Mult (yij |nij , ij1, ..., ij5)

ENAR - March 26, 2006 23 ENAR - March 26, 2006 24


box plot: beta

box plot: alpha

[5]
5.0
0.0 [3,2]
[2,2]

[2] [3,3]
[4]
[2,3] [4,2]
[3]

-2.0 2.5 [3,4] [3,5]

[4,3] [4,4]
[2,4] [2,5]
[4,5]

-4.0 0.0

-6.0
-2.5

-5.0
Mean Std 2.5% Median 97.5%
alpha[2] -1.838 0.5278 -2.935 -1.823 -0.8267
alpha[3] -2.655 0.706 -4.261 -2.593 -1.419
alpha[4] -2.188 0.58 -3.382 -2.153 -1.172
alpha[5] -0.7844 0.3691 -1.531 -0.7687 -0.1168

box plot: gamma


Mean Std 2.5% Median 97.5%
4.0
beta[2,2] 2.706 0.6431 1.51 2.659 4.008
beta[2,3] 1.398 0.8571 -0.2491 1.367 3.171 [2,4]
2.0
beta[2,4] -1.799 1.413 -5.1 -1.693 0.5832 [2,3]

beta[2,5] -0.9353 0.7692 -2.555 -0.867 0.4413 [2,5]

0.0
beta[3,2] 2.932 0.6864 1.638 2.922 4.297 [2,2]

beta[3,3] 1.935 0.8461 0.37 1.886 3.842


beta[3,4] 0.3768 0.7936 -1.139 0.398 1.931 -2.0

beta[3,5] 0.7328 0.5679 -0.3256 0.7069 1.849


beta[4,2] 1.75 0.6116 0.636 1.73 3.023 -4.0

beta[4,3] -1.595 1.447 -4.847 -1.433 0.9117


beta[4,4] -0.7617 0.8026 -2.348 -0.7536 0.746
beta[4,5] -0.843 0.5717 -1.97 -0.8492 0.281
Mean Std 2.5% Median 97.5%

gamma[2,2] -1.523 0.4101 -2.378 -1.523 -0.7253


gamma[2,3] 0.342 0.5842 -0.788 0.3351 1.476
gamma[2,4] 0.7098 0.6808 -0.651 0.7028 2.035
gamma[2,5] -0.353 0.4679 -1.216 -0.3461 0.518
box plot: p[1,,] box plot: p[2,,]
p[1,1,1] 0.5392 0.07161 0.4046 0.5384 0.676
0.8
[1,2,1]
0.8 [2,1,2] p[1,1,2] 0.09435 0.04288 0.03016 0.08735 0.2067
[1,1,1]

[2,2,1]
p[1,1,3] 0.04585 0.02844 0.007941 0.04043 0.1213
0.6 0.6 p[1,1,4] 0.06837 0.03418 0.01923 0.06237 0.1457
[2,1,1]
p[1,1,5] 0.2523 0.06411 0.1389 0.2496 0.3808
[2,2,2]

p[1,2,1] 0.5679 0.09029 0.3959 0.5684 0.7342


[1,1,5]
0.4 0.4
[1,2,5] [2,2,3]
[1,2,4]

[1,1,2] [1,2,3] [2,1,3]


p[1,2,2] 0.02322 0.01428 0.005301 0.02013 0.06186
0.2 0.2
[1,1,3]
[1,1,4] [2,1,5]
[2,2,4]
[2,2,5]

p[1,2,3] 0.06937 0.04539 0.01191 0.05918 0.1768


p[1,2,4] 0.1469 0.07437 0.03647 0.1344 0.3182
[1,2,2]
[2,1,4]

0.0 0.0

p[1,2,5] 0.1927 0.07171 0.0793 0.1852 0.349


p[2,1,1] 0.2578 0.07217 0.1367 0.2496 0.4265
box plot: p[3,,]
box plot: p[4,,] p[2,1,2] 0.5968 0.08652 0.4159 0.6 0.7551
0.8 [4,2,1]
p[2,1,3] 0.08137 0.0427 0.01939 0.0728 0.1841
0.8
[3,1,2]
p[2,1,4] 0.0095 0.01186 0.00533 0.04105 0.143
0.6 0.6
[4,1,1]
p[2,1,5] 0.05445 0.03517 0.009716 0.0476 0.1398
[4,1,2]

p[2,2,1] 0.4612 0.08211 0.3055 0.4632 0.6175


[3,2,1]

0.4
[3,2,3] 0.4
p[2,2,2] 0.2455 0.06879 0.1263 0.2426 0.3867
[3,2,5]
[3,1,1] [3,1,5] [3,2,2]

[4,2,2]
p[2,2,3] 0.1947 0.07042 0.07776 0.189 0.3502
[3,2,4]

0.2
[3,1,3]
0.2 [4,1,5]
[4,2,4] [4,2,5]
p[2,2,4] 0.0302 0.02921 7.653E-4 0.02009 0.1122
[3,1,4]

[4,1,3]
[4,1,4] [4,2,3]
p[2,2,5] 0.06833 0.03984 0.01357 0.06097 0.1643
0.0 0.0 p[3,1,1] 0.1794 0.05581 0.08555 0.1732 0.296
p[3,1,2] 0.5178 0.09136 0.3334 0.5185 0.6933
p[3,1,3] 0.09403 0.04716 0.02652 0.08517 0.2219
p[3,1,4] 0.03554 0.02504 0.006063 0.02996 0.1055
p[3,1,5] 0.1732 0.06253 0.07053 0.167 0.3261
p[3,2,1] 0.2937 0.07225 0.1618 0.2901 0.4469

Hierarchical Poisson model


Interpretation of results

Because we set to zero several model parameters, interpreting results is Count data are often modeled using a Poisson model.
tricky. For example:

x Beta[2,2] ha posterior mean 2.714. This is the effect of lake Oklawaha If y Poisson() then E(y) = var(y) = .
(relative to Hancock) on the alligators preference for invertebrates
relative to fish.
When counts are assumed exchangeable given and the rates can
Since beta[2,2] > 0, we conclude that alligators in Oklawaha eat more also be assumed to be exchangeable, a Gamma population model for
invertebrates than do alligators in Hancock (even though both may the rates is often chosen.
prefer fish!).
The hierarchical model is then
x Gamma[2,2] is the effect of size 2 relative to size on the relative
preference for invertebrates. Since gamma[2,2] < 0, we conclude that
large alligators prefer fish more than do small alligators. yi Poisson(i)
i Gamma(, ).
x The alpha are baseline counts for each type of food relative to fish.
ENAR - March 26, 2006 25
Priors for the hyperparameters are often taken to be Gamma (or Conditional for i is
exponential):
p(i| all) iyi+1 exp{i( + 1)},
Gamma(a, b)
which is proportional to a Gamma with parameters (yi + , + 1).
Gamma(c, d),
The full conditional for is
with (a, b, c, d) known.
p(| all) i1
i a1 exp{b}.
The joint posterior distribution is
The conditional for does not have a standard form.
p(, , |y) iiyi exp{i}i1 exp{i}
a1 c1 For :
exp{b} exp{d}
p(| all) i exp{i} c1 exp{d}
X
To carry out Gibbs sampling we need to find the full conditional c1 exp{( i + d)},
distributions. i

ENAR - March 26, 2006 26 ENAR - March 26, 2006 27

P
which is proportional to a Gamma with parameters (c, i i + d).
Italian marriages Example
Computation:
Gill (2002) collected data on the number of marriages per 1,000 people in Italy
Given , , draw each i from the corresponding Gamma conditional. during 1936-1951.
Draw using a Metropolis step or rejection sampling or inverse cdf
method. Question: did the number of marriages decrease during WWII years? (1939
Draw from the Gamma conditional. 1945).

Model:
See Italian marriages example.
Number of marriages yi are Poisson with year-specific means i.

Assuming that rates of marriages are exchangeable across years, we model the
i as Gamma(, ).

To complete model specification, place independent Gamma priors on (, ), with


known hyper-parameter values.

ENAR - March 26, 2006 28


WinBUGS code: Results
lambda: 95% credible sets

model { [1]

for (i in 1:16) { [4]


[3]
[2]

y[i] ~ dpois(l[i]) [6]


[5]

l[i] ~ dgamma(alpha, beta)


[7]

[8]

[9]

} [10]

[11]

[12]

[13]

[14]

[15]

alpha ~ dgamma(1,1) [16]

beta ~ dgamma(1,1) 0.0 5.0 10.0

warave <- (l[4]+l[5]+ l[6]+l[7]+l[8]+l[9]+l[10]) / 7 lambda: box plot

nonwarave<- (l[1]+l[2]+l[3]+l[11]+l[12]+l[13]+l[14]+l[15]+l[16]) / 9 15.0


[2]
[11]
[12]

diff <- nonwarave - warave [13] [14] [15]


[3]
[1] [4] [5] [10] [16]
[6] [7]

} 10.0
[8] [9]

list(y = c(7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7)) 5.0

0.0

Difference between non-war and war years marriage rate overallrate chains 1:3 sample: 3000
0.6
Nonwar - war average 0.4
0.2
0.4 0.0
0.3 4.0 6.0 8.0 10.0
0.2
0.1
0.0
overallstd chains 1:3 sample: 3000
-5.0 0.0 5.0 1.5
1.0
0.5
0.0
Overall marriage rate: 1.0 2.0 3.0 4.0

If i ~ Gamma(, ), then E(i | y) = / .


mean sd 2.5% median 97.5%
overallrate 7.362 1.068 5.508 7.285 9.665
overallstd 2.281 0.394 1.59 2.266 3.125
Poisson regression Population distribution for the rates:

i| Gamma(, /i),
When rates are not exchangeable, we need to incorporate covariates
into the model. Often we are interested in the association between one with log(i) = x0i, and = (0, 1, ..., k1, ).
or more covariate and the outcome.
is thought of as an unboserved prior count.
It is possible (but not easy) to incorporate covariates into the Poisson-
Gamma model. Under population model,


Christiansen and Morris (1997, JASA) propose the following model: E(i) =
/i
= i
Sampling distribution, where ei is a known exposure:
2i 1
CV2(i) =
yi|i Poisson(iei). 2i
1
=
Under model, E(yi/ei) = i.

ENAR - March 26, 2006 29 ENAR - March 26, 2006 30

For k = 0, i is known. For k = 1, i are exchangeable. For k 2, i where y0 is the prior guess for the median of .
are (unconditionally) nonexchangeable.
Small values of y0 (for example, y0 < and the MLE of ) provide
In all cases, standardized rates i/i are Gamma(, ), are
less information.
exchangeable, and have expectation 1.

The covariates can include random effects. When the rates cannot be assumed to be exchangeable, it is common
to choose a generalized linear model of the form:
To complete specification of model, we need priors on .
p(y|) i exp{i}iyi
Christensen and Morris (1997) suggest:
i exp{ exp(i)}[exp(i)]yi ,
and independent a priori.
Non-informative prior on s associated to fixed effects. for i = x0i and log(i) = i.
For a proper prior of the form:
y0 The vector of covariates can include one or more random effects to
p(|y0) ,
( + y0)2 accommodate additional dispersion (see epylepsy example).

ENAR - March 26, 2006 31 ENAR - March 26, 2006 32


The second-level distribution for the s will typically be flat (if covariate Epilepsy example
is a fixed effect) or normal

j Normal(j0, 2j ) From Breslow and Clayton, 1993, JASA.

Fifty nine epilectic patients in a clinical trial were randomized to a new


if jth covariate is a random effect. The variance 2j represents the drug: T = 1 is the drug and T = 0 is the placebo.
between batch variability.
Covariates included:
Baseline data: number of seizures during eight weeks preceding trial
Age in years.

Outcomes: number of seizures during the two weeks preceding each of


four clinical visits.

Data suggest that number of seizures was significantly lower prior to


fourth visit, so an indicator was used for V4 versus the others.

ENAR - March 26, 2006 33 ENAR - March 26, 2006 34

Two random effects in the model:


A patient-level effect to introduce between patient variability.
A patients by visit effect to introduce between visit within patient Epilepsy study Program and results
dispersion. model {
for(j in 1 : N) {
for(k in 1 : T) {
log(mu[j, k]) <- a0 + alpha.Base * (log.Base4[j] - log.Base4.bar)
+ alpha.Trt * (Trt[j] - Trt.bar)
+ alpha.BT * (BT[j] - BT.bar)
+ alpha.Age * (log.Age[j] - log.Age.bar)
+ alpha.V4 * (V4[k] - V4.bar)
+ b1[j] + b[j, k]
y[j, k] ~ dpois(mu[j, k])
b[j, k] ~ dnorm(0.0, tau.b); # subject*visit random effects
}
b1[j] ~ dnorm(0.0, tau.b1) # subject random effects
BT[j] <- Trt[j] * log.Base4[j] # interaction
log.Base4[j] <- log(Base[j] / 4) log.Age[j] <- log(Age[j])
diff[j] <- mu[j,4] mu[j,1]
}
# covariate means:
log.Age.bar <- mean(log.Age[])
Trt.bar <- mean(Trt[])
BT.bar <- mean(BT[])
log.Base4.bar <- mean(log.Base4[])
V4.bar <- mean(V4[])
# priors:

a0 ~ dnorm(0.0,1.0E-4)
ENAR - March 26, 2006 35 alpha.Base ~ dnorm(0.0,1.0E-4)
alpha.Trt ~ dnorm(0.0,1.0E-4);
alpha.BT ~ dnorm(0.0,1.0E-4)
alpha.Age ~ dnorm(0.0,1.0E-4)
Individual random effects
alpha.V4 ~ dnorm(0.0,1.0E-4)
tau.b1 ~ dgamma(1.0E-3,1.0E-3); sigma.b1 <- 1.0 / sqrt(tau.b1)
tau.b ~ dgamma(1.0E-3,1.0E-3); sigma.b <- 1.0/ sqrt(tau.b)

# re-calculate intercept on original scale:


box plot: b1
alpha0 <- a0 - alpha.Base * log.Base4.bar - alpha.Trt * Trt.bar
- alpha.BT * BT.bar - alpha.Age * log.Age.bar - alpha.V4 * V4.bar
} 2.0

[35] [56]
[10] [25]
[32] [36] [46] [49]
[33]
[8]

Results
[3] [18] [22] [37]
[28] [43] [53]
1.0 [4] [11]
[40] [42] [55] [59]
[1] [2] [5] [24] [27]
[21] [45] [47]
[12] [14] [44]
[7] [13] [15] [20]
[6] [9] [30][31] [39] [48] [50]
[19] [34] [51]
[23] [29] [54]
[26]
[41]
[57]
[17] [38] [52]
0.0 [16] [58]

Parameter Mean Std 2.5th Median 97.5th


alpha.Age 0.4677 0.3557 -0.2407 0.4744 1.172 -1.0
alpha.Base 0.8815 0.1459 0.5908 0.8849 1.165
alpha.Trt -0.9587 0.4557 -1.794 -0.9637 -0.06769
alpha.V4 -0.1013 0.08818 -0.273 -0.09978 0.07268
-2.0
alpha.BT 0.3778 0.2427 -0.1478 0.3904 0.7886
sigma.b1 0.4983 0.07189 0.3704 0.4931 0.6579
sigma.b 0.3641 0.04349 0.2871 0.362 0.4552

Patients 5 and 49 are different. For #5, the number of events


Diff in seizure counts between V4 and V1
increases from visits 1 through 4. For #49, there is a significant
[1]
[2]
[3]
[4]
decrease.
[5]
[6]
[7]
[8]
[9]
[10]
[12]
[11] box plot: b[5, ] box plot: b[49, ]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[21]
[20] 1.0 [5,4] 1.0
[22]
[23]
[24]
[25] [5,2] [49,1]
[26]
[27]
[28]
[29]
[30]
[31] 0.5
[32]
[33]
[34] [5,3]
[36]
[35] 0.5 [49,3] [49,4]
[37] [5,1]
[38] [49,2]
[39]
[40]
[41]
[42]
[44]
[43] 0.0
[45]
[46]
[47]
[48]
[49]
[50]
[51] 0.0
[52]
[53]
[54] -0.5
[55]
[56]
[57]
[58]
[59]

-1.0 -0.5
-100.0 -50.0 0.0
Example: hierarchical Poisson model
Convergence and autocorrelation

We ran three parallel chains for 10,000 iterations


The USDA collects data on the number of farmers who adopt
Discarded the first 5,000 iterations from each chain as burn-in conservation tillage practices.
Thinned by 10, so there are 1,500 draws for inference.
In a recent survey, 10 counties in a midwestern state were randomly
sigma.b chains 1:3 sigma.b1 chains 1:3 sampled, and within county, a random number of farmers were
1.0
0.5
1.0
0.5 interviewed and asked whether they had adopted conservation practices.
0.0 0.0
-0.5 -0.5
-1.0 -1.0
0 20 40 0 20 40
lag lag The number of farmers interviewed in each county varied from a low
sigma.b chains 1:3 sigma.b1 chains 1:3
of 2 to a high of 100.
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0 Of interest to USDA is the estimation of the overall adoption rate in
5001 6000 8000
iteration
5001 6000 8000
iteration
the state.

We fitted the following model: contained in the posterior distribution of (, ).

yi Poisson(i) We fitted the model using WinBUGS and chose non-informative priors
i = i Ei for the hyperparameters.
i Gamma(, ),
Observed data:
where yi is the number of adopters in the ith county, Ei is the number of
farmers interviewed, i is the expected adoption rate, and the expected
number of adopters in a county can be obtained by multiplying i by County Ei yi County Ei yi
the number of farms in the county. 1 94 5 6 31 19
2 15 1 7 2 1
The hierarchical structure establishes that even though counties may 3 62 5 8 2 1
vary significantly in terms of the rate of adoption, they are still 4 126 14 9 9 4
exchangeable, so the rates are generated from a common population 5 5 3 10 100 22
distribution.

Information about the overall rate of adoption across the state is


Posterior distribution of rate of adoption in each county node mean sd 2.5% median 97.5%

theta[1] 0.06033 0.02414 0.02127 0.05765 0.1147


box plot: theta
theta[2] 0.1074 0.08086 0.009868 0.08742 0.306
theta[3] 0.09093 0.03734 0.03248 0.08618 0.1767
1.5 [7]
[8] theta[4] 0.1158 0.03085 0.06335 0.1132 0.1831
[5]
theta[5] 0.5482 0.2859 0.131 0.4954 1.242
theta[6] 0.6028 0.1342 0.3651 0.5962 0.8932
1.0
theta[7] 0.4586 0.3549 0.0482 0.3646 1.343
[9]
[6]
theta[8] 0.4658 0.3689 0.04501 0.3805 1.41
theta[9] 0.4378 0.2055 0.1352 0.3988 0.9255
theta[10] 0.2238 0.0486 0.1386 0.2211 0.33
0.5
[2] [10]

[3] [4]
[1]

0.0

Posterior distribution of number of adopters in each county, given node mean sd 2.5% median 97.5%
exposures
lambda[1] 5.671 2.269 2.0 5.419 10.78
box plot: lambda
lambda[2] 1.611 1.213 0.148 1.311 4.59
lambda[3] 5.638 2.315 2.014 5.343 10.95
lambda[4] 14.59 3.887 7.982 14.26 23.07
40.0
lambda[5] 2.741 1.43 0.6551 2.477 6.21
[10]
lambda[6] 18.69 4.16 11.32 18.48 27.69
30.0
lambda[7] 0.9171 0.7097 0.09641 0.7292 2.687
[6]
lambda[8] 0.9317 0.7377 0.09002 0.761 2.819
[4] lambda[9] 3.94 1.85 1.217 3.589 8.33
20.0 lambda[10] 22.38 0.1028 13.86 22.11 33.0

[1] [3]

10.0 [9] node mean sd 2.5% median 97.5%


[5]
[2]
[7] [8]
alpha 0.8336 0.3181 0.3342 0.7989 1.582
0.0 beta 2.094 1.123 0.4235 1.897 4.835
mean 0.4772 0.2536 0.1988 0.4227 1.14
Poisson model for small area deaths Model
Taken from Bayesian Statistical Modeling, Peter Congdon, 2001
First level of the hierarchy
Congdon considers the incidence of heart disease mortality in 758 electoral wards in
the Greater London area over three years (1990-92). These small areas are grouped Oij |ij Poisson(ij )
administratively into 33 boroughs.

Regressors: log(ij ) = log(Eij ) + 1j + 2j (xij x) + ij


at ward level: xij , index of socio-economic deprivation.
at borough level: wj , where
 1 for inner London boroughs ij N (0, )
2
wi =
0 for outer suburban boroughs
Death counts Oij are Poisson with means ij .

log(Eij ) is an offset, i.e. an explanatory variable with known coefficient (equal


We assume borough level variation in the intercepts and in the impacts of deprivation; to 1).
this variation is linked to the category of borough (inner vs outer).

j = (1j , 2j )0 are random coefficients for the intercepts and the impacts of Model (contd)
deprivation at the borough level.
Second level of the hierarchy
ij is a random error for Poisson over-dispersion. We fitted two models: one
without and another with this random error term.
 
1j
j = N2(j , )
2j

where  
11 12
=
21 22
1j = 11 + 12wj
2j = 21 + 22wj
11, 12, 21, and 22 are the population coefficients for the intercepts, the
impact of borough category, the impact of deprivations, and, respectively, the
interaction impact of the level-2 regressors and level-1 regressors.
Hyperparameters
1
 1

Wishart (R) ,
ij N (0, 0.1) i, j {1, 2}
2
Inv-Gamma(a, b)
Computation of Eij Model 1: without overdispersion term
Suppose that p is an overall disease rate. Then, Eij = nij p and ij = pij /p. (i.e. without ij )
The s are said:

1. externally standardized if p is obtained from another data source (such as a standard std. Posterior quantiles
reference table); node mean dev. 2.5% median 97.5%
2. internally standardized if p is obtained from the given dataset, e.g. 11 -0.075 0.074 -0.224 -0.075 0.070
XO 12
21
0.078
0.621
0.106
0.138
-0.128
0.354
0.078
0.620
0.290
0.896
= X
ij
ij 22 0.104 0.197 -0.282 0.103 0.486
p
n 1 0.294 0.038 0.231 0.290 0.380
ij
ij 2 0.445 0.077 0.318 0.437 0.618

In our example we rely on the latter approach. Deviance 945.800 11.320 925.000 945.300 969.400
Under (a), the joint distribution of the Oij is a product Poisson; under (b) is
multinomial.
P
However, since likelihood inference is unaffected by whether we condition on ij Oij ,
the product Poisson likelihood is commonly retained.

Model 2: with overdispersion term


std. Posterior quantiles
node mean dev. 2.5% median 97.5%
11 -0.069 0.074 -0.218 -0.069 0.076
12 0.068 0.106 -0.141 0.068 0.278
21 0.616 0.141 0.335 0.615 0.892
22 0.105 0.198 -0.276 0.104 0.499
1 0.292 0.037 0.228 0.289 0.376
2 0.431 0.075 0.306 0.423 0.600

Deviance 802.400 40.290 726.600 800.900 883.500

Including ward-level random variability ij reduces the average Poisson GLM deviance
to 802, with a 95% credible interval from 726 to 883. This is in line with the expected
value of the GLM deviance for N = 758 areas if the Poisson Model is appropriate.
Hierarchical models for spatial data Often data are also collected over time, so models that include spatial
and temporal correlations of outcomes are needed.
Based on the book by Banerjee, Carlin and Gelfand Hierarchical Modeling
and Analysis for Spatial Data, 2004. We focus on Chapters 1, 2 and 5.
We focus on spatial, rather than spatio-temporal models.
Geo-referenced data arise in agriculture, climatology, economics,
We also focus on models for univariate rather than multivariate
epidemiology, transportation and many other areas.
outcomes.
What does geo-referenced mean? In a nutshell, we know the geographic
location at which an observation was collected.

Why does it matter? Sometimes, relative location can provide


information about an outcome beyond that provided by covariates.

Example: infant mortality is typically higher in high poverty areas.


Even after incorporating poverty as a covariate, residuals may still be
spatially correlated due to other factors such as nearness to pollution
sites, distance to pre and post-natal care centers, etc.

ENAR - March 26, 2006 1 ENAR - March 26, 2006 2

Types of spatial data Marked point process data: If covariate information is available we talk
about a marked point process. Covariate value at each site marks the
site as belonging to a certain covariate batch or group.
Point-referenced data: Y (s) a random outcome (perhaps vector-
valued) at location s, where s varies continuously over some region D. Combinations: e.g. ozone daily levels collected in monitoring stations
The location s is typically two-dimensional (latitude and longitude) but with precise location, and number of children in a zip code reporting to
may also include altitude. Known as geostatistical data. the ER on that day. Require data re-alignment to combine outcomes
and covariates.
Areal data: outcome Yi is an aggregate value over an areal unit with
well-defined boundaries. Here, D is divided into a finite collection of
areal units. Known as lattice data even though lattices can be irregular.

Point-pattern data: Outcome Y (s) is the occurrence or not of an event


and locations s are random. Example: locations of trees of a species
in a forest or addresseses of persons with a particular disease. Interest
is often in deciding whether points occur independently in space or
whether there is clustering.

ENAR - March 26, 2006 3 ENAR - March 26, 2006 4


Models for point-level data

The basics

Location index s varies continuously over region D.

We often assume that the covariance between two observations at


locations si and sj depends only on the distance dij between the
points.

The spatial covariance is often modeled as exponential:

Cov (Y (si), Y (si0 )) = C(dii0 ) = 2edii0 ,

where ( 2, ) > 0 are the partial sill and decay parameters, respectively.

Covariogram: a plot of C(dii0 ) against dii0 .

ENAR - March 26, 2006 5


For i = i0, dii0 = 0 and C(dii0 ) = var(Y (si)). Models for point-level data (contd)
Covariance structure
Sometimes, var(Y (si)) = 2 + 2, for 2 the nugget effect and 2 + 2
the sill.
Suppose that outcomes are normally distributed and that we choose an
exponential model for the covariance matrix. Then:

Y |, N(, ()),

with

Y = {Y (s1), Y (s2), ..., Y (sn)}


()ii0 = cov(Y (si), Y (si0 ))
= ( 2, 2, ).

Then
()ii0 = 2 exp(dii0 ) + 2Ii=i0 ,

ENAR - March 26, 2006 6 ENAR - March 26, 2006 7

with ( 2, 2, ) > 0. Models for point-level data, details

This is an example of an isotropic covariance function: the spatial Basic model:


correlation is only a function of d. Y (s) = (s) + w(s) + e(s),
where (s) = x (s) and the residual is divided into two components:
0

w(s) is a realization of a zero-centered stationary Gaussian process and


e(s) is uncorrelated pure error.

The w(s) are functions of the partial sill 2 and decay parameters.

The e(s) introduces the nugget effect 2.

2 interpreted as pure sampling variability or as microscale variability,


i.e., spatial variability at distances smaller than the distance between
two outcomes: the e(s) are sometimes viewed as spatial processes with
rapid decay.

ENAR - March 26, 2006 8 ENAR - March 26, 2006 9


The variogram and semivariogram Examples of semi-variograms

A spatial process is said to be: Semi-variograms for the linear, spherical and exponential models.
Strictly stationary if distributions of Y (s) and Y (s + h) are equal,
for h the distance.
Weakly stationary if (s) = and Cov(Y (s), Y (s + h)) = C(h).
Instrinsically stationary if

E[Y (s + h) Y (s)] = 0, and


E[Y (s + h) Y (s)]2 = V ar[Y (s + h) Y (s)]
= 2(h),

defined for differences and depending only on distance.

2(h) is the variogram and (h) is the semivariogram

ENAR - March 26, 2006 10 ENAR - March 26, 2006 11

Stationarity Semivariogram (contd)

If (h) depends on h only through its length ||h||, then the spatial
Strict stationarity implies weak stationarity but the converse is not true
process is isotropic. Else it is anisotropic.
except in Gaussian processes.
There are many choices for isotropic models. The exponential model
Weak stationarity implies intrinsec stationarity, but the converse is not is popular and has good properties. For t = ||h||:
true in general.
(t) = 2 + 2(1 exp(t)) if t > 0,
Notice that intrinsec stationarity is defined on the differences between
outcomes at two locations and thus says nothing about the joint = 0 otherwise.
distribution of outcomes.
See figures, page 24.

The powered exponential model has an extra parameter for smoothness:

(t) = 2 + 2(1 exp(t)) if t > 0

ENAR - March 26, 2006 12 ENAR - March 26, 2006 13


Another popular choice is the Gaussian variogram model, equal to the Point-level data (contd)
exponential except for the exponent term, that is exp(2t2)).

Fitting of the variogram has been traditionally done by eye: For point-referenced data, frequentists focus on spatial prediction using
kriging.
Plot an empirical estimate of the variogram akin to the sample
variance estimate or the autocorrelation function in time series
Choose a theoretical functional form to fit the empirical Problem: given observations {Y (s1), ..., Y (sn)}, how do we predict
Choose values for ( 2, 2, ) that fit the data well. Y (so) at a new site so?

If a distribution for the outcomes is assumed and a functional form Consider the model
for the variogram is chosen, parameter estimates can be estimated via
some likelihood-based method. Y = X + , where  N(0, ),

Of course, we can also be Bayesians. and where


= 2H() + 2I.
Here, H()ii0 = (, dii0 ).

ENAR - March 26, 2006 14 ENAR - March 26, 2006 15

Kriging consists in finding a function f (y) of the observations that Classical kriging (contd)
minimizes the MSE of prediction
(Not a surprising!) Result: f (y) that minimizes Q is the conditional
Q = E[(Y (so) f (y))2|y]. mean of Y (s0) given observations y (see pages 50-52 for proof):

E[Y (so)|y] = x0o + 01(y X )


V ar[Y (so)|y] = 2 + 2 01,

where

= ( 2(, do1), ..., 2(, don))


= (X 01X)1X 01y
= 2H().

Solution assumes that we have observed the covariates xo at the new


site.

ENAR - March 26, 2006 16 ENAR - March 26, 2006 17


If not,in the classical framework Y (so), xo are jointly estimated using Bayesian methods for estimation
an EM-type iterative algorithm.

The Gaussian isotropic kriging model is just a general linear model


similar to those in Chapter 15 of textbook.

Just need to define the appropriate covariance structure.

For an exponential covariance structure with a nugget effect, parameters


to be estimated are = (, 2, 2, ).

Steps:
Choose priors and define sampling distribution
Obtain posterior for all parameters p(|y)
Bayesian kriging: get posterior predictive distribution for outcome at
new location p(yo|y, X, xo).

ENAR - March 26, 2006 18 ENAR - March 26, 2006 19

Sampling distribution (marginal data model) Be cautious with improper priors for anything but .

y| N(X, 2H() + 2I)

Priors: typically chosen so that parameters are independent a priori.

As in the linear model:


Non-informative prior for is uniform or can use a normal prior too.
Conjugate priors for variances 2, 2 are inverse gamma priors.

For , appropriate prior depends on covariance model. For simple


exponential where

(si sj ; ) = exp(||si sj ||),

a Gamma prior can be a good choice.

ENAR - March 26, 2006 20 ENAR - March 26, 2006 21


Hierarchical representation of model Estimation of spatial surface W |y

Hierarchical model representation: first condition on the spatial random Interest is sometimes on estimating the spatial surface using p(W |y).
effects W = {w(s1), ..., w(sn)}:
If marginal model is fitted, we can still get marginal posterior for W as
y|, W N(X + W, 2I) Z
W |, 2 N(0, 2H()). p(W |y) = p(W | 2, )p( 2, |y)d 2d.

Model specification is then completed by choosing priors for , 2 and


for , 2 (hyperparameters). Given draws ( 2(g), (g)) from the Gibbs sampler on the marginal
model, we can generate W from
Note that hierarchical model has n more parameters (the w(si)) than
the marginal model. p(W | 2(g), (g)) = N(0, 2(g)H((g))).

Computation with the marginal model preferable because 2H()+ 2I Analytical marginalization over W is possible only if model has Gaussian
tends to be better behaved than 2H() at small distances. form.

ENAR - March 26, 2006 22 ENAR - March 26, 2006 23

Bayesian kriging (1) (2) (G)


Draws {yo , yo , ..., yo , } are a sample from the posterior predictive
distribution of the outcome at the new location so.
Let Yo = Y (so) and xo = x(so). Kriging is accomplished by obtaining
the posterior predictive distribution To predict Y at a set of m new locations so1, ..., som, it is best to do
Z joint prediction to be able to estimate the posterior association among
p(yo|xo, X, y) = p(yo, |y, X, xo)d m predictions.
Z
= p(yo|, y, xo)p(|y, X)d. Beware of joint prediction at many new locations with WinBUGS. It
can take forever.

Since (Yo, Y ) are jointly multivariate normal (see expressions 2.18 and
2.19 on page 51), then p(yo|, y, xo) is a conditional normal distribution.

Given MCMC draws of the parameters ((1), ..., (G)) from the posterior
(g)
distribution p(|y, X), we draw values yo for each (g) as

yo(g) p(yo|(g), y, xo).

ENAR - March 26, 2006 24 ENAR - March 26, 2006 25


Kriging example from WinBugs Here, H()ij = (si sj ; ) = exp(||si sj ||).

Priors on (, , ).
Data were first published by Davis (1973) and consist of heights at 52
locations in a 310-foot square area. We predict elevations at 225 new locations.

We have 52 s = (x, y) coordinates and outcomes (heights).

Unit of distance is 50 feet and unit of elevation is 10 feet.

The model is

height = + , where  N(0, ),

and where
= 2H().

ENAR - March 26, 2006 26 ENAR - March 26, 2006 27

model {

# Spatially structured multivariate normal likelihood


height[1:N] ~ spatial.exp(mu[], x[], y[], tau, phi, kappa) # exponential correlation function

2000
for(i in 1:N) {
mu[i] <- beta
}

# Priors
beta ~ dflat()

1500
tau ~ dgamma(0.001, 0.001)
sigma2 <- 1/tau

# priors for spatial.exp parameters


phi ~ dunif(0.05, 20) # prior decay for correlation at min distance (0.2 x 50 ft) is 0.02 to 0.99
# prior range for correlation at max distance (8.3 x 50 ft) is 0 to 0.66

iteration
kappa ~ dunif(0.05,1.95)

1000
# Spatial prediction

# Single site prediction


for(j in 1:M) {
height.pred[j] ~ spatial.unipred(beta, x.pred[j], y.pred[j], height[])
500

# Only use joint prediction for small subset of points, due to length of time it takes to run
for(j in 1:10) { mu.pred[j] <- beta }
height.pred.multi[1:10] ~ spatial.pred(mu.pred[], x.pred[1:10], y.pred[1:10], height[])
1
beta

}
1200.0
1.00E+3
800.0
600.0
400.0

Data  Click on one of the arrows for the data 

Initial values
 Click on one of the arrows for inital values for spatial.exp model 
2000
phi

1500
0.8
0.6
0.4

iteration
0.2

1000
0.0

1 500 1000 1500 2000


iteration

500
kappa

1
2.0

1.5

1.0

0.5
Hierarchical models for areal data

Areal data: often aggregate outcomes over a well-defined area.


Example: number of cancer cases per county or proportion of people
(16) 700.0 - 750.0

(40) 750.0 - 800.0

(80) 800.0 - 850.0

(61) 850.0 - 900.0

(26) 900.0 - 950.0

living in poverty in a set of census tracks.


(1) >= 950.0
(1) < 700.0

What are the inferential issues?


1. Identifying a spatial pattern and its strength. If data are spatially
correlated, measurements from areas that are close will be more
0.002km

alike.
2. Smoothing and to what degree. Observed measurements often
present extreme values due to small samples in small areas. Maximal
smoothing: substitute observed measurements by the overall mean
in the region. Something less extreme is what we discuss later.
(samples)means for height.pred

3. Prediction: for a new area, how would we predict Y given


measurements in other areas?

ENAR - March 26, 2006 28


N
Defining neighbors The wij can be thought of as weights and provide a means to introduce
spatial structure into the model.
A proximity matrix W with entries wij spatially connects areas i and
j in some fashion. Areas that are closer by in some sense are more alike.

Typically, wii = 0. For any problem, we can define first, second, third, etc order neighbors.
For distance bins (0, d1], (d1, d2], (d2, d3], ... we can define
There are many choices for the wij : (1)
1. W (1), the first-order proximity matrix with wij = 1 if distance
Binary: wij = 1 if areas i, j share a common boundary, and is 0 between i abd j is less than d1.
otherwise. (2)
2. W (2), the second-order proximity matrix with wij = 1 if distance
Continuous: decreasing function of intercentroidal distance. between i abd j is more than d1 but less than d2.
Combo: wij = 1 if areas are within a certain distance.

W need not be symmetric.


P
Entries are often standardized by dividing into j wij = wi+. If entries
are standardized, the W will often be asymmetric.

ENAR - March 26, 2006 29 ENAR - March 26, 2006 30

Areal data models corresponding to a constant disease rate across areas.

An external standardized estimate is


Because these models are used mostly in epidemiology, we begin with
an application in disease mapping to introduce concepts. X
Ei = nij rj ,
Typical data: j

Yi observed number of cases in area i, i = 1, ..., I where rj is the risk for persons of age group j (from some existing
Ei expected number of cases in area i. table of risks by age) and nij is the number of persons of age j in area
i.
The Y s are assumed to be random and the Es are assumed to be
known and to depend on the number of persons ni at risk.

An internal standardized estimate of Ei is


P 
yi
Ei = nir = ni P i ,
i ni

ENAR - March 26, 2006 31 ENAR - March 26, 2006 32


Standard frequentist approach estimated by plugging i to obtain

Yi
For small Ei, var(SM Ri) = .
Ei2
Yi|i Poisson(Eii),
with i the true relative risk in area i.
To get a confidence interval for i, first assume that log(SM Ri) is
The MLE is the standard mortality ratio approximately normal.

Yi
i = SM Ri = . From a Taylor expansion:
Ei

1
The variance of the SM Ri is V ar[log(SM Ri)] V ar(SM Ri)
SM Ri2
var(Yi) i Ei2 Yi 1
var(SM Ri) = = , = 2 2= .
Ei2 Ei Yi Ei Yi

ENAR - March 26, 2006 33 ENAR - March 26, 2006 34

An approximate 95% CI for log(i) is Under H0, Yi Poisson(Ei) so the pvalue for the test is

SM Ri 1.96/(Yi)1/2. p = Prob(X Yi|Ei)


= 1 Prob(X Yi|Ei)
Yi 1
Transforming back, an approximate 95% CI for i is X exp(Ei)E x
i
= 1 .
x=0
x!
(SM Ri exp(1.96/(Yi)1/2), SM Ri exp(1.96/(Yi)1/2)).

If p < 0.05 we reject H0.


Suppose we wish to test whether risk in area i is high relative to other
areas. Then test

H0 : i = 1 versus Ha : i > 1.

This is a one-sided test.

ENAR - March 26, 2006 35 ENAR - March 26, 2006 36


Hierarchical models for areal data Poisson-Gamma model

To estimate and map underlying relative risks, might wish to fit a Consider
random effects model.
Yi|i Poisson(Eii)
Assumption: true risks come from a common underlying distribution. i|a, b Gamma(a, b).

Random effects models permit borrowing strength across areas to


obtain better area-level estimates. Since E(i) = a/b and V ar(i) = a/b2, we can fix a, b as follows:
A priori, let E(i) = 1, the null value.
Alas, models may be complex: Let V ar(i) = (0.5)2, large on that scale.
High-dimensional: one random effect for each area.
Resulting prior is Gamma(4, 4).
Non-normal if data are counts or binomial proportions.
Posterior is also Gamma:
We have already discussed hierarchical Poisson models, so material in
the next few transparencies is a review.
p(i|yi) = Gamma(yi + a, Ei + b).

ENAR - March 26, 2006 37 ENAR - March 26, 2006 38

A point estimate of i is Poisson-lognormal models with spatial errors


yi + a
E(i|y) = E(i|yi) = The Poisson-Gamma model does not allow (easily) for spatial correlation
Ei + b
2 among the i.
yi + V[E(i )]
ar(i )
=
Ei + VE(i)
ar(i )
Instead, consider the Poisson-lognormal model, where in the second
stage we model the log-relative risks log(i) = i:
E(i )
Ei( Eyii ) V ar(i ) E(i )
= +
Ei + VE(i)
Ei + VE( i) Yi|i Poisson(Ei exp(i))
ar(i ) ar(i )

= wiSM Ri + (1 wi)E(i), i = x0i + i + i,

where wi = Ei/[Ei + (E(i)/V ar(i))]. where xi are area-level covariates.

Bayesian point estimate is a weighted average of the data-based SM Ri The i are assumed to be exchangeable and model between-area
and the prior mean E(i). variability:
i N(0, 1/h).

ENAR - March 26, 2006 39 ENAR - March 26, 2006 40


The i incorporate global extra-Poisson variability in the log-relative While sensible, this model is difficult to fit because
risks (across the entire region).
Lots of matrix inversions required
Distance between i and i0 may not be obvious.
The i are the spatial parameters; they capture regional clustering.

They model extra-Poisson variability in the log-relative risks at the local


level so that neighboring areas have similar risks.

One way to model the i is to proceed as in the point-referenced data


case. For = (1, ..., I ), consider

|, NI (, H()),

and H()ii0 = cov(i, i0 ) with hyperparameters .

Possible models for H() include the exponential, the powered


exponential, etc.

ENAR - March 26, 2006 41 ENAR - March 26, 2006 42

CAR model needed:

1
More reasonable to think of a neighbor-based proximity measure and p(i| all) Poi(yi|Eiexi+i+i ) N(i|i, ).
mi c
consider a conditionally autoregressive model for :

i N(i, 1/(cmi)),

where X
i = wij (i j ),
i6=j
and mi is the number of neighbors of area i. Earlier we called this wi+.

The weights wij are (typically) 0 if areas i and j are not neighbors and
1 if they are.

CAR models lend themselves to the Gibbs sampler. Each i can


be sampled from its conditional distribution so no matrix inversion is

ENAR - March 26, 2006 43 ENAR - March 26, 2006 44


Difficulties with CAR model Consider

h Gamma(ah, bh), c Gamma(ac, bc).


The CAR prior is improper. Prior is a pairwise difference prior identified
only up to a constant.
To place equal emphasis on heterogeneity and spatial clustering, it is
The posterior will still be proper, but to identify anP
intercept 0 for the tempting to make ah = ac and bh = bc. This is not correct because
log-relative risks, we need to impose a constraint: i i = 0. 1. The h prior is defined marginally, where the c prior is conditional.
2. The conditional prior precision is cmi. Thus, a scale that satisfies
In simulation, constraint is imposed numerically by recentering each
vector around its own mean. 1 1
sd(i) = p sd(i)
h ) 0.7 mc
h and c cannot be too large because i and i become unidentifiable.
We observe only one Yi in each area yet we try to fit two random effects. with m the average number of neighbors is more fair (Bernardinelli
Very little data information. et al. 1995, Statistics in Medicine).

Hyperpriors for h, c need to be chosen carefully.

ENAR - March 26, 2006 45 ENAR - March 26, 2006 46

Example: Incidence of lip cancer in 56 areas in Scotland Model

model {
Data on the number of lip cancer cases in 56 counties in Scotland were obtained. Expected # Likelihood
lip cancer counts Ei were available. for (i in 1 : N) {
O[i] ~ dpois(mu[i])
Covariate was the proportion of the population in each county working in agriculture. log(mu[i]) <- log(E[i]) + alpha0 + alpha1 * X[i]/10 + b[i]
RR[i] <- exp(alpha0 + alpha1 * X[i]/10 + b[i]) # Area-specific
Model included only one random effect bi to introduce spatial association between counties. relative risk (for maps)
}
Two counties are neighbors if they have a common border (i.e., they are adjacent).
# CAR prior distribution for random effects:
Three counties that are islands have no neighbors and for those, WinBUGS sets the b[1:N] ~ car.normal(adj[], weights[], num[], tau)
random spatial effect to be 0. The relative risk for the islands is thus based on the baseline for(k in 1:sumNumNeigh) {
rate 0 and on the value of the covariate xi. weights[k] <- 1
}
We wish to smooth and map the relative risks RR.
# Other priors:
alpha0 ~ dflat()
alpha1 ~ dnorm(0.0, 1.0E-5)
tau ~ dgamma(0.5, 0.0005) # prior on precision
sigma <- sqrt(1 / tau) # standard deviation
}
Data 4, 4, 4, 5, 6, 5),
adj = c(
list(N = 56, 19, 9, 5,
O = c( 9, 39, 11, 9, 15, 8, 26, 7, 6, 20, 10, 7,
13, 5, 3, 8, 17, 9, 2, 7, 9, 7, 12,
16, 31, 11, 7, 19, 15, 7, 10, 16, 11, 28, 20, 18,
5, 3, 7, 8, 11, 9, 11, 8, 6, 4, 19, 12, 1,
10, 8, 2, 6, 19, 3, 2, 3, 28, 6, 17, 16, 13, 10, 2,
1, 1, 1, 1, 0, 0), 29, 23, 19, 17, 1,
E = c( 1.4, 8.7, 3.0, 2.5, 4.3, 2.4, 8.1, 2.3, 2.0, 6.6, 22, 16, 7, 2,
4.4, 1.8, 1.1, 3.3, 7.8, 4.6, 1.1, 4.2, 5.5, 4.4, 5, 3,
10.5,22.7, 8.8, 5.6,15.5,12.5, 6.0, 9.0,14.4,10.2, 19, 17, 7,
4.8, 2.9, 7.0, 8.5,12.3,10.1,12.7, 9.4, 7.2, 5.3, 35, 32, 31,
18.8,15.8, 4.3,14.6,50.7, 8.2, 5.6, 9.3,88.7,19.6, 29, 25,
3.4, 3.6, 5.7, 7.0, 4.2, 1.8), 29, 22, 21, 17, 10, 7,
X = c(16,16,10,24,10,24,10, 7, 7,16, 29, 19, 16, 13, 9, 7,
7,16,10,24, 7,16,10, 7, 7,10, 56, 55, 33, 28, 20, 4,
7,16,10, 7, 1, 1, 7, 7,10,10, 17, 13, 9, 5, 1,
7,24,10, 7, 7, 0,10, 1,16, 0, 56, 18, 4,
1,16,16, 0, 1, 7, 1, 1, 0, 1, 50, 29, 16,
1, 0, 1, 1,16,10), 16, 10,
num = c(3, 2, 1, 3, 3, 0, 5, 0, 5, 4, 39, 34, 29, 9,
0, 2, 3, 3, 2, 6, 6, 6, 5, 3, 56, 55, 48, 47, 44, 31, 30, 27,
3, 2, 4, 8, 3, 3, 4, 4, 11, 6, 29, 26, 15,
7, 3, 4, 9, 4, 2, 4, 6, 3, 4, 43, 29, 25,
5, 5, 4, 5, 4, 6, 6, 4, 9, 2, 56, 32, 31, 24,

45, 33, 18, 4, 52, 51, 49, 38, 34,


50, 43, 34, 26, 25, 23, 21, 17, 16, 15, 9, 56, 45, 33, 30, 24, 18,
55, 45, 44, 42, 38, 24, 55, 27, 24, 20, 18),
47, 46, 35, 32, 27, 24, 14, sumNumNeigh = 234)
31, 27, 14,
55, 45, 28, 18,
54, 52, 51, 43, 42, 40, 39, 29, 23, Results
46, 37, 31, 14,
41, 37, The proportion of individuals working in agriculture appears to be
46, 41, 36, 35,
54, 51, 49, 44, 42, 30, associated to the incidence of cancer
40, 34, 23,
52, 49, 39, 34,
53, 49, 46, 37, 36,
51, 43, 38, 34, 30, alpha1 sample: 10000
42, 34, 29, 26, 4.0
49, 48, 38, 30, 24,
55, 33, 30, 28, 3.0
53, 47, 41, 37, 35, 31, 2.0
53, 49, 48, 46, 31, 24, 1.0
49, 47, 44, 24, 0.0
54, 53, 52, 48, 47, 44, 41, 40, 38,
29, 21, -0.25 0.0 0.25 0.5 0.75
54, 42, 38, 34,
54, 49, 40, 34,
49, 47, 46, 41,
2.0

3.0

4.0
There appears to be spatial correlation among counties:

1.0 -
1.0

4.0
2.0 -

3.0 -

(2) >=
(26) <

(15)
box plot: b

(9)

(4)
2.0
[1]
[3]
[2] [5]
[12][13]
[7] [9]
[10] [15] [17] [19]
[4]
1.0 [16] [25]
[20][21] [26]
[14] [18]
[36]

200.0km
[23]
[27][28][29]
[22]
[24] [31][32][33] [35] [40] [56]
[30] [37][38][39] [43]
[34] [41] [51][52]
[44] [46][47][48] [50] [53][54][55]
0.0 [45]
[42]
[49]

-1.0

(samples)means for RR
-2.0

N
Example: Lung cancer in London
20.0

30.0
10.0 -
(38) < 10.0

(2) >= 30.0


20.0 -

Data were obtained from an annual report by the London Health Authority in which the
association between health and poverty was investigated.
(14)

(2)

Population under study are men 65 years of age and older.

Available were:
x Observed and expected lung cancer counts in 44 census ward districts
x Ward-level index of socio-economic deprivation.
200.0km

A model with two random effects was fitted to the data. One random effect introduces
region-wide heterogeneity (between ward variability) and the other one introduces regional
clustering.

The priors for the two components of random variability, are sometimes known as
convolution priors.

Interest is in:
x Determining association between health outcomes and poverty.
values for O

x Smoothing and mapping relative risks.


x Assessing the degree of spatial clustering in these data.
N
Model sigma.b <- sqrt(1 / tau.b)
tau.h ~ dgamma(0.5, 0.0005)
model { sigma.h <- sqrt(1 / tau.h)
propstd <- sigma.b / (sigma.b + sigma.h)
for (i in 1 : N) {
# Likelihood }
O[i] ~ dpois(mu[i])
log(mu[i]) <- log(E[i]) + alpha + beta * depriv[i] + b[i] + h[i] Data click on one of the arrows to open data
RR[i] <- exp(alpha + beta * depriv[i] + b[i] + h[i]) # Area-specific
relative risk (for maps) Inits click on one of the arrows to open initial values

# Exchangeable prior on unstructured random effects


h[i] ~ dnorm(0, tau.h)
} Note that the priors on the precisions for the exchangeable and the spatial
random effects are Gamma(0.5, 0.0005).
# CAR prior distribution for spatial random effects:
b[1:N] ~ car.normal(adj[], weights[], num[], tau.b)
for(k in 1:sumNumNeigh) { That means that a priori, the expected value of the standard deviations is
weights[k] <- 1 approximately 0.03 with a relatively large prior standard deviation.
}
This is not a fair prior as discussed in class. The average number of neighbors
# Other priors: is 4.8 and this is not taken into account in the choice of priors.
alpha ~ dflat()
beta ~ dnorm(0.0, 1.0E-5)
tau.b ~ dgamma(0.5, 0.0005)

Results Posterior distribution of RR for first 15 wards


node mean sd 2.5% median 97.5%
th th
Parameter Mean Std 2.5 97.5 RR[1] 0.828 0.1539 0.51 0.8286 1.148
perc. perc. RR[2] 1.18 0.2061 0.7593 1.188 1.6
RR[3] 0.8641 0.1695 0.5621 0.8508 1.247
Alpha -0.208 0.1 -0.408 -0.024 RR[4] 0.8024 0.1522 0.5239 0.7873 1.154
Beta 0.0474 0.0179 0.0133 0.0838 RR[5] 0.7116 0.1519 0.379 0.7206 0.9914
Relative size of 0.358 0.243 0.052 0.874 RR[6] 1.05 0.2171 0.7149 1.015 1.621
spatial std RR[7] 1.122 0.1955 0.7556 1.116 1.589
RR[8] 0.821 0.1581 0.4911 0.8284 1.132
RR[9] 1.112 0.2167 0.7951 1.07 1.702
RR[10] 1.546 0.2823 1.072 1.505 2.146
sigma.b sample: 1000 sigma.h sample: 1000 RR[11] 0.7697 0.1425 0.4859 0.7788 1.066
20.0 10.0 RR[12] 0.865 0.16 0.6027 0.8464 1.26
15.0 7.5 RR[13] 1.237 0.3539 0.8743 1.117 2.172
10.0 5.0
5.0 2.5 RR[14] 0.8359 0.1807 0.5279 0.8084 1.3
0.0 0.0 RR[15] 0.7876 0.1563 0.489 0.7869 1.141
-0.2 0.0 0.2 0.4 0.6 -0.2 0.0 0.2 0.4 0.6
values for O (17) < 5.0 values for b (2) < -0.1

(19) 5.0 - 10.0 (2) -0.1 - -0.05


N N

(3) 10.0 - 15.0 (18) -0.05 - 1.38778E-17

(3) 15.0 - 20.0 (19) 1.38778E-17 - 0.05

(2) >= 20.0 (2) 0.05 - 0.1

(1) >= 0.1

2.5km
2.5km

values for RR (8) < 0.8

(20) 0.8 - 1.0


N
(11) 1.0 - 1.2

(5) >= 1.2

2.5km

You might also like