Monte Carlo Statistical Methods

Monte Carlo Statistical Methods: Introduction [1]
Monte Carlo Statistical Methods

George Casella
University of Florida
January 3, 2008
casella@stat.u.edu
1
Based on
Monte Carlo Statistical Methods,
Christian Robert and George Casella,
2004, Springer-Verlag
Programming in R (available as a free download from
http://www.r-project.org
Also WinBugs, available free from
http://www.mrc-bsu.cam.ac.uk/bugs/
R programs for the course available at
http://www.stat.u.edu/ casella/mcsm/
2
Introduction
Statistical Models
Likelihood Models
Bayesian Models
Deterministic Numerical Models
Simulation vs. Numerical Methods
3
1.1 Statistical Models
In a typical statistical model we observe
Y
1
, Y
2
, . . . , Y
n
f(y|)
The distribution of the sample is given by the product,
the likelihood function
n
i=1
f(y
i
|).
Inference about is based on this likelihood.
In many situations the likelihood can be complicated
4
Example 1.1: Censored Random Variables
If
X
1
N(,
2
), X
2
N(,
2
),
the distribution of Y = min{X
1
, X
2
} is
_
1
_
y
__

1
_
y
_
+
_
1
_
y
__
_
y
_
,
where and are the cdf and pdf of the normal distribution.
This results in a complex likelihood.
5
Example 1.2: Mixture Models
Models of mixtures of distributions:
X f
j
with probability p
j
,
for j = 1, 2, . . . , k, with overall density
X p
1
f
1
(x) + + p
k
f
k
(x) .
For a sample of independent random variables (X
1
, , X
n
), sample den-
sity
n
i=1
{p
1
f
1
(x
i
) + + p
k
f
k
(x
i
)} .
Expanding this product involves k
n
elementary terms: prohibitive to com-
pute in large samples.
6
Example 1.2 : Normal Mixtures
For a mixture of two normal distributions,
pN(,
2
) + (1 p)N(,
2
) ,
The likelihood proportional to
n
i=1
_
p
1
_
x
i
_
+ (1 p)
1
_
x
i
__
containing 2
n
terms.
Standard maximization techniques often fail to nd the global maximum
because of multimodality of the likelihood function.
R program normal-mixture1
7
#This gives the distribution of the mixture of two normals#
e<-.3; nsim<-1000;m<-2;s<-1;
u<-(runif(nsim)<e);z<-rnorm(nsim)
z1<-rnorm(nsim,mean=m,sd=s)
#This plots histogram and density#
hist(u*z+(1-u)*z1,xlab="x",xlim=c(-5,5),freq=F,
col="green",breaks=50,)
mix<-function(x)e*dnorm(x)+(1-e)*dnorm(x,mean=m,sd=s)
xplot<-c(-50:50)/10
par(new=T)
plot(xplot,mix(xplot), xlim=c(-5,5),type="l",yaxt="n",ylab="")
8
Histogram of u * z + (1 u) * z1
x
D
e
n
s
i
t
y
4 2 0 2 4
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
4 2 0 2 4
xplot
Figure 1: Histogram and density of normal mixture
9
1.2: Likelihood Methods
Maximum Likelihood Methods
For an iid sample X
1
, . . . , X
n
from a population with density f(x|
1
, . . . ,
k
),
the likelihood function is
L(|x) = L(
1
, . . . ,
k
|x
1
, . . . , x
n
)
=
n
i=1
f(x
i
|
1
, . . . ,
k
).
Global justications from asymptotics
10
Example 1.9: Students t distribution
Reasonable alternative to normal errors is Students t distribution, de-
noted by
T (p, , )
more robust against possible modelling errors
Density of T (p, , ) proportional to
1
_
1 +
(x )
2
p
2
_
(p+1)/2
,
11
Example 1.9: Students t distribution
When p known and and both unknown, the likelihood
n
p+1
2
n
i=1
_
1 +
(x
i
)
2
p
2
_
.
may have n local minima.
Each of which needs to be calculated to determine the global maximum.
12

-5 0 5 10 15
0
.
0
0
.
0
0
0
0
2
0
.
0
0
0
0
4
0
.
0
0
0
0
6
Illustration of the multiplicity of modes of the likelihood from a Cauchy
distribution C(, 1) (p = 1) when n = 3 and X
1
= 0, X
2
= 5, and X
3
= 9.
13
Section 1.3 Bayesian Methods
In the Bayesian paradigm, information brought by
the data x, realization of
X f(x|),
combined with prior information specied by prior distribution with
density ()
14
Bayesian Methods
Summary in a probability distribution, (|x), called the posterior dis-
tribution
Derived from the joint distribution f(x|)(), according to
(|x) =
f(x|)()
_
f(x|)()d
,
[Bayes Theorem]
where
m(x) =
_
f(x|)()d
is the marginal density of X
15
Example 1.11: Binomial Bayes Estimator
For an observation X from the binomial distribution Binomial(n, p) the
(so-called) conjugate prior is the family of beta distributions Beta(a, b)
The classical Bayes estimator
is the posterior mean
=
(a + b + n)
(a + x)(n x + b)
_
1
0
p p
x+a1
(1 p)
nx+b1
dp
=
n
a + b + n
_
x
n
_
+
a + b
a + b + n
_
a
a + b
_
.
A Biased estimator of p
16
The Variance/Bias Trade-o
Bayes Estimators are biased
Mean Squared Error (MSE) = Variance + Bias
2
MSE = E(
p)
2
Measures average closeness to parameter
Small Bias can yield large Variance .
=
n
a + b + n
_
x
n
_
+
a + b
a + b + n
_
a
a + b
_
Var
=
_
n
a + b + n
_
2
Var
_
x
n
_
17
Conjugate Priors
A prior is conjugate if
()(the prior) and (|x)(the posterior)
are in the same family of distributions.
Examples
() normal , (|x) normal
() beta , (|x) beta
Restricts the choice of prior
Typically non-robust
Originally used for computational ease
18
Example 1.13: Logistic Regression
Standard regression model for binary (0 1) responses: the logit model
where distribution of Y modelled by
P(Y = 1) = p =
exp(x
t
)
1 + exp(x
t
)
.
Equivalently, the logit transform of p, logit(p) = log[p/(1 p)], satises
logit(p) = x
t
.
Computation of a condence region on quite delicate when (|x) not
explicit.
In particular, when the condence region involves only one component of
a vector parameter, calculation of (|x) requires the integration of the
joint distribution over all the other parameters.
19
Challenger Data
In 1986, the space shuttle Challenger exploded during take o, killing the
seven astronauts aboard.
The explosion was the result of an O-ring failure.
Flight No. 14 9 23 10 1 5 13 15 4 3 8 17
Failure 1 1 1 1 0 0 0 0 0 0 0 0
Temp. 53 57 58 63 66 67 67 67 68 69 70 70
Flight No. 2 11 6 7 16 21 19 22 12 20 18
Failure 1 1 0 0 0 1 0 0 0 0 0
Temp. 70 70 72 73 75 75 76 76 78 79 81
It is reasonable to t a logistic regression, with p = probability of an
O-ring failure and x = temperature.
20
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
30 50 70 90
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
D
e
n
s
i
t
y
0.2 0.4 0.6 0.8
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
D
e
n
s
i
t
y
0.90 0.94 0.98
0
2
0
4
0
6
0
8
0
The left panel shows the average logistic function and variation
The middle panel shows predictions of failure probabilities at 65
o
Fahren-
heit
The right panel shows predictions of failure probabilities at 45
o
Fahren-
heit.
21
Section 1.4: Deterministic Numerical Methods
To solve an equation of the form
f(x) = 0,
the NewtonRaphson algorithm produces a sequence x
n
:
x
n+1
= x
n
_
f
x
x=x
n
_
1
f(x
n
)
that converges to a solution of f(x) = 0.
Note that
f
x
is a matrix in multidimensional settings.
22
Example 1.17: Newton-Raphson
Newton-Raphson algorithm can be used to nd the square root of a num-
ber.
If we are interested in the square root of b, this is equivalent to solving
the equation
f(x) = x
2
b = 0.
This results in the iterations
x
(j+1)
= x
(j)
f(x
(j)
)
f
(x
(j)
)
= x
(j)
x
(j)2
b
2x
(j)
=
1
2
(x
(j)
+
b
x
(j)
).
23
Example 1.17: Newton-Raphson -2
0 1 2 3 4 5
2
0
2
4
f(x)
2 4 6 8 10
5
0
5
1
0
1
5
2
0
2
5
values of f(x)
2 4 6 8 10
5
0
5
1
0
1
5
2
0
2
5
2 4 6 8 10
5
0
5
1
0
1
5
2
0
2
5
Left: x
2
; Right: f(x) = x
2
2
Rapid convergence from dierent starting points
Three runs are shown, starting at x = .5, 2.4.
24
Example 1.17: Newton-Raphson -3
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
h(x)
0 50 100 150 200
1
0
1
2
3
4
5
values of h(x)
0 50 100 150 200
1
0
1
2
3
4
5
0 50 100 150 200
1
0
1
2
3
4
5
Problems with the function h(x) = [cos(50x) + sin(20x)]
2
.
Greediness of the Newton-Raphson algorithm pushes it to the nearest
mode.
25
Variants of Newton-Raphson
The steepest descent method, where each iteration results in a unidimen-
sional optimizing problem for F(x
n
+td
n
) (t ), d
n
being an acceptable
direction, namely such that
d
2
F
dt
2
(x
n
+ td
n
)
t=0
is of the proper sign.
The direction d
n
is often chosen as F or as
_
t
F(x
n
) + I
1
F(x
n
),
in the LevenbergMarquardt version.
26
Section 1.4.2: Integration
The numerical computation of an integral
I =
_
b
a
h(x)dx
can be done by simple Riemann integration.
By improved techniques such as the trapezoidal rule
I =
1
2
n1
i=1
(x
i+1
x
i
)(h(x
i
) + h(x
i+1
)) ,
where the x
i
s constitute an ordered partition of [a, b].
27
Section 1.4.2: Integration
By Simpsons rule, whose formula is
I =

3
_
f(a) + 4
n
i=1
h(x
2i1
) + 2
n
i=1
h(x
2i
) + f(b)
_
in the case of equally spaced samples with (x
i+1
x
i
) = .
Other approaches involve orthogonal polynomials (GramCharlier,
Legendre, etc.)
Splines
However, these methods may not work well in high dimensions
28
Comparison
Advantages of Simulation
Integration may focus on areas of low probability
Simulation can avoid these
Local modes are a problem for deterministic methods
Advantages of Deterministic Methods
Simulation doesnt consider the form of the function
Deterministic Methods can be much faster for smooth functions.
In low dimensions Riemann Sums or Quadrature are very fast
29
Comparison
When the statistician
needs to study the details of a likelihood surface or posterior distri-
bution
needs to simultaneously estimate several features of these functions
when the distributions are highly multimodal
it is preferable to use a simulation-based approach.
fruitless to advocate the superiority of one method over the other
More reasonable to justify the use of simulation-based methods by the
statistician in terms of expertise.
The intuition acquired by a statistician in his or her every-day processing
of random models can be directly exploited in the implementation of
simulation techniques
30
Monte Carlo Statistical Methods: Random Variable Generation [31]
Chapter 2: Random Variable Generation
Rely on the possibility of producing (with a computer) a supposedly end-
less ow of random variables (usually iid) for well-known distributions.
Although we are not directly concerned with the mechanics of produc-
ing uniform random variables, we are concerned with the statistics of
producing uniform and other random variables.
We look at some basic methodology that can, starting from these sim-
ulated uniform random variables, produce random variables from both
standard and nonstandard distributions.
31
Uniform Random Numbers
A uniform pseudo-random number generator is an algorithm which, start-
ing from an initial value u
0
and a transformation D, produces a sequence
(u
i
) = (D
i
(u
0
)) of values in [0, 1].
or all n, the values (u
1
, . . . , u
n
) reproduce the behavior of an iid sample
(V
1
, . . . , V
n
) of uniform random variables when compared through a usual
set of tests.
32
Uniform Random Numbers
This denition is clearly restricted to testable aspects of the random vari-
able generation, which are connected through the deterministic transfor-
mation u
i
= D(u
i1
).
The validity of the algorithm consists in the verication that the sequence
U
1
, . . . , U
n
leads to acceptance of the hypothesis
H
0
: U
1
, . . . , U
n
are iid U
[0,1]
.
The set of tests used is generally of some consequence.
KolmogorovSmirnov
Nonparametric
Time Series
Die Hard (Marsaglia)
Our denition is functional: An algorithm that generates uniform num-
bers is acceptable if it is not rejected by a set of tests.
33
KISS Algorithm
A preferred algorithm
A congruential generator D(x) = ax + b(modM + 1)
Register Shifts to break patterns
Period of order 2
95
Successfully tested on Die Hard
34
The Inverse Transform
Lemma 2.4: If X has the cdf F(x), then the random variable F(X) has
the U
[0,1]
distribution.
Thus, formally, in order to generate a random variable X F, it suf-
ces to generate U according to U
[0,1]
and then make the transformation
x = F
(u).
In other words, simulate U U
[0,1]
then solve for X in
U = F(X) =
_
X
f(x)dx
35
Example 2.5: Exponential variable generation
If X Exp(1), so F(x) = 1 e
x
, then solving for x in u = 1 e
x
gives
x = log(1 u).
Therefore, if U U
[0,1]
, the random variable X = log U has the expo-
nential distribution
R program
36
Exponentials from Uniforms
#This generates exponentials from uniforms#
nsim<-10000;u<-runif(nsim);
y<--log(u);
hist(y,main="Exponential",freq=F,col="green",breaks=50)
par(new=T)
plot(function(x)dexp(x), 0,10,xlab="",ylab="",xaxt="n",yaxt="n")
Exponential
y
D
e
n
s
i
t
y
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
37
Example 2.7; Building on exponential random variables
Some of the random variables that can be generated starting from an
exponential distribution.
If the X
i
s are iid Exp(1) random variables,
Y = 2
j=1
X
j

2
2
, {1, 2, . . .
Y =
a
j=1
X
j
Ga(a, ) , a {1, 2, . . .
Y =
a
j=1
X
j
a+b
j=1
X
j
Be(a, b) , a, b {1, 2, . . .
38
Limitations
These transformations are quite simple to use and, hence, will often be a
favorite
There are limits to their usefulness
In scope of variables that can be generated
Eciency of generation
There are more ecient algorithms for gamma and beta random vari-
ables.
We cannot use exponentials to generate gamma random variables with a
non-integer shape parameter
We cannot get a
2
1
variable, which would, in turn, get us a N(0, 1)
variable.
39
Example 2.8: Box-Muller
If r and are the polar coordinates of (X
1
, X
2
), then,
r
2
= X
2
1
+ X
2
2

2
2
= Exp(1/2) ,
U
[0,2]
.
If U
1
and U
2
are iid U
[0,1]
, the variables X
1
and X
2
dened by
X
1
=
_
2 log(U
1
) cos(2U
2
) , X
2
=
_
2 log(U
1
) sin(2U
2
) ,
are then iid N(0, 1).
40
Box-Muller Algorithm
1. Generate U
1
, U
2
iid U
[0,1]
;
2. Dene
_
x
1
=
_
2 log(u
1
) cos(2u
2
) ,
x
2
=
_
2 log(u
1
) sin(2u
2
) ;
3. Take x
1
and x
2
as two independent draws from N(0, 1).
41
Note on Box-Muller
In comparison with algorithms based on the Central Limit Theorem, this
algorithm is exact
It produces two normal random variables from two uniform random vari-
ables
The only drawback (in speed) being the necessity of calculating functions
such as log, cos, and sin.
Devroye(1985) gives faster alternatives that avoid the use of these func-
tions
42
Poisson Random Variables
Discrete Random Variables can always be generated using the Probability
Integral Transform.
For Example, to generate X Poisson() calculate
p
0
= P
(X 0), p
1
= P
(X 1), p
2
= P
(X 2), . . .
Then generate U Uniform[0, 1] and take
X = k if p
k1
< U < p
k
.
There are more ecient algorithms, but this is OK
R Program DiscreteX
43
Discrete Random Variables
p<-c(.1,.2,.3,.3,.1) #P(X=0), P(X=1), etc
sum(p) #check
cp<-c(0,cumsum(p))
nsim<-5000
X<-array(0,c(nsim,1))
for(i in 1:nsim)
{
u<-runif(1)
X[i]<-sum(cp<u)-1
}
hist(X)
See also Logarithmic
44
Negative Binomial Random Variables
A Poisson generator can be used to get Negative Binomial random vari-
ables since
Y Gamma(n, (1 p)/) and X|y Poisson(y)
implies
X Negative Binomial(n, p)
45
Negative Binomial
nsim<-10000;n<-6;p<-.3;
y<-rgamma(nsim,n,p/(1-p));x<-rpois(nsim,y);
hist(x,main="Negative Binomial",freq=F,col="green",breaks=40)
par(new=T)
lines(1:50,dnbinom(1:50,n,p))
Negative Binomial
x
D
e
n
s
i
t
y
0 10 20 30 40 50
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
0
.
0
5
0
.
0
6
46
Mixture Representation
The representation of the Negative Binomial is a particular case of a
mixture distribution
A mixture represents a density as the marginal of another distribution:
f(x) =
i
p
i
f
i
(x)
To generate from f(x)
Choose f
i
with probability p
i
Generate an observation from f
i
47
Section 2.3: Accept-Reject Methods
There are many distributions from which it is dicult, or even impossible,
to directly simulate by an inverse transform.
Moreover, in some cases, we are not even able to represent the distribution
in a usable form, such as a transformation or a mixture.
We thus turn to another class of methods that only requires us to know
the functional form of the density f of interest up to a multiplicative
constant
The key to this method is to use a simpler (simulationwise) density g
from which the simulation is actually done. For a given density gcalled
the instrumental or candidate density there are thus many densities
fcalled the target densitieswhich can be simulated this way.
48
The Accept-Reject Algorithm
1. Generate X g, U U
[0,1]
;
2. Accept Y = X if U
1
M
f(X)
g(X)
;
3. Return to 1. otherwise.
49
Accept-Reject: Produces Y f exactly.
Generate X g, U Uniform[0, 1].
Accept Y = X if U f(X)/Mg(X)
P(Y y|U
f(X)
Mg(X)
) =
P(X y, U
f(X)
Mg(X)
)
P(U
f(X)
Mg(X)
)
=
_
y
_
f(x)/Mg(x)
0
du g(x)dx
_
_
f(x)/Mg(x)
0
du g(x)dx
=
_
y
f(x)
Mg(x)
g(x)dx
_
f(x)
Mg(x)
du g(x)dx
= P(Y y)
50
Two Interesting Properties of AR
We can simulate from any density know up to a multiplicative constant
This is important in Bayesian calculations
The posterior distribution
(|x) f(x|)()
is only specied up to a normalizing constant
The probability of acceptance is 1/M, and the expected number of trials
until acceptance is M
51
Example: Beta Accept-Reject
Generate Y beta(a, b).
No direct method if a and b are not integers.
Use a uniform candidate
For a = 2.7 and b = 6.3
Put the beta density f
Y
(y) inside a box
Box has sides 1 and c, where c max
y
f
Y
(y).
If (U, V ) are independent uniform(0, 1) random variables
P(V y|U
1
c
f
Y
(V )) = P(Y y)
52
Example: Beta Accept-Reject - Uniform Candidate
Acceptance Rate = 37%
Histogram of v
v
F
r
e
q
u
e
n
c
y
0.0 0.2 0.4 0.6 0.8 1.0
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
Histogram of Y
Y
D
e
n
s
i
t
y
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
53
Example: Beta Accept-Reject - Uniform Candidate
R program: BetaAR-1
a<-2.7; b<-6.3; c<-2.669;nsim<-2500;
#Generate u and v#
u<-runif(nsim);v<-runif(nsim);
#---------Generate Y, the beta random variable--------------#
test<-dbeta(v, a, b)/c; #density ratio
Y<-v*(u<test) #accepted values
Y<-Y[Y!=0] #eliminate zeros
length(Y)/nsim #percent accepted
#----------Plot---------------------------------------------#
par(mfrow=c(1,2))
hist(v)
hist(Y)
par(new=T)
plot(function(x)(dbeta(x, a, b)));
#------------------------------------------------------------
54
Properties
For c=2.669 the acceptance probability is 1/2.669 = .37 , so we accept
37%
If we simulate from a beta(2,6), the bound is 1.67, so we accept 60%
55
Example: Beta Accept-Reject - Beta Candidate
Acceptance Rate with better candidate
Direct generation of Beta(2, 6)
Acceptance Rate = 60%
Histogram of v
v
D
e
n
s
i
t
y
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
Histogram of Y
Y
D
e
n
s
i
t
y
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
56
Example: Beta Accept-Reject - Beta Candidate
R program: BetaAR-2
a<-2.7; b<-6.3; c<-1.67;nsim<-2500;
#Generate u and v#
u<-runif(nsim);
v<-rbeta(nsim,2,6) #beta candidate
#---------Generate Y, the beta random variable--------------#
test<-dbeta(v, a, b)/(c*dbeta(v, 2, 6)); #density ratio
Y<-v*(u<test) #accepted values
Y<-Y[Y!=0] #eliminate zeros
length(Y)/nsim #percent accepted
#----------Plot---------------------------------------------#
par(mfrow=c(1,2))
hist(v)
par(new=T)
plot(function(x)(dbeta(x, 2, 6)))
hist(Y)
par(new=T)
plot(function(x)(dbeta(x, a, b)));
57
Beta AR Generation - Some Intuition
Uniform Candidate
Accepted Values are Under Density
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
D
e
n
s
i
t
y
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
58
Example: Normal from Cauchy
Normal: f(x) =
1
2
exp(x
2
/2)
Cauchy: g(x) =
1
1
1+x
2
f/g =
_
2
(1 + x
2
) exp(x
2
/2)
_
2
e
= 1.52
attained at x = 1.
Prob. of acceptance = 1/1.52=.66
Mean number of trials to success = 1.52
59
Example 2.18: Normals from Double Exponential
Generate Normal(0, 1) from a Double Exponential with density
g(x|) = (/2) exp(x)
Minimum bound at = 1
Acceptance probability = .76
60
Example 2.19: Gamma Random Variables - Non Integer Shape
Illustrates power of AR
Gamma = sum of exponentials only if an integer - no Chi Squared.
Generate f(x) =
1
()
x
1
e
x
,
= 1 without loss of generality.
Candidate density g(x) =
1
(a)b
a
x
1
e
x/b
Then if > a and b > 1
f(x)
g(x)

x
a
b
a
e
(1/b1)x
< .
Take a = []. Then b = []/a minimizes M
61
Example 2.20: Truncated Normal distributions
Truncated normal distributions are very useful (censoring).
For the constraint x > a, the density f
a
(x) is proportional to
f
a
(x) e
1
2
2
(x)
2
I(x > a)
Naive method: generate Y N(,
2
) until Y > a
Can sometimes work, but requires, on the average, 1/(( a)/) sim-
ulations to get one random variable.
For a = + 2, need 44 simulations for each acceptance.
62
Truncated Normal distributions
Better: Use a translated exponential distribution
g(x) e
(xa)
I(x > a)
For a = + 2, need less than 12 simulations for each acceptance.
63
Truncated Normal - Some Details
The Accept-Reject ratio is
f(x)
g(x)
=
e
1
2
2
(x)
2
I(x > a)
e
(xa)
I(x > a)
These are unnormalized densities
We dont need to worry about the constants
If > a
M = max
x>a
f(x)
g(x)
=
1
e
1
2
(
2
2a)
,
attained at x = .
Can further optimize by minimizing in
64
Truncated Normal - Some Details
For simplicity, we will take = a so that
M =
1
a
e
1
2
a
2
and
f(x)
Mg(x)
= a e
1
2
(xa)
2
Now lets compare AR to naive simulation
Generate 100 random variables
Take a = 1 and a = 3.5
65
Example: Truncated Normal
Samples generated Naively and with AR
Acceptance Rate very high for AR
Naive
X
F
r
e
q
u
e
n
c
y
3.6 4.0 4.4
0
5
1
0
1
5
2
0
2
5
AR
X
F
r
e
q
u
e
n
c
y
3.5 4.0 4.5 5.0
0
5
0
1
0
0
1
5
0
R Program Truncated
66
Monte Carlo Statistical Methods: Monte Carlo Integration [67]
Chapter 3: Monte Carlo Integration
Two major classes of numerical problems that arise in statistical inference
optimization problems
integration problems
Although optimization is generally associated with the likelihood ap-
proach, and integration with the Bayesian approach, these are not strict
classications
67
Example 3.1 Bayes Estimator
In general, the Bayes estimate under the loss function L(, ) and the
prior is the solution of the minimization program
min
L(, ) () f(x|) d .
Only when the loss function is the quadratic function
2
will the
Bayes estimator be a posterior expectation.
For L(, ) = ||, the Bayes estimator associated with is the posterior
median of (|x),
(x), which is the solution to the equation

_
(x)
() f(x|) d =
_
(x)
() f(x|) d .
68
Section 3.2: Classical Monte Carlo Integration
Generic problem of evaluating the integral
E
f
[h(X)] =
_
X
h(x) f(x) dx .
Based on previous developments, it is natural to propose using a sample
(X
1
, . . . , X
m
) generated from the density f
Approximate the integral by the empirical average
This approach is often referred to as the Monte Carlo method
69
Strong Law
For a sample (X
1
, . . . , X
m
), the empirical average
h
m
=
1
m
m
j=1
h(x
j
) ,
converges almost surely to
E
f
[h(X)]
This is the Strong Law of Large Numbers
70
Central Limit Theorem
Estimate the variance with
var(h
m
) =
1
m
_
X
(h(x) E
f
[h(X)])
2
f(x)dx
For m large,
h
m
E
f
[h(X)]
v
m
is therefore approximately distributed as a N(0, 1) variable
This leads to the construction of a convergence test and of condence
bounds on the approximation of E
f
[h(X)].
71
Example 3.4: Monte Carlo Integration
Recall the function that we saw in the Newton-Raphson example:
h(x) = [cos(50x) + sin(20x)]
2
.
To calculate the integral, we generate U
1
, U
2
, . . . , U
n
iid U(0, 1) random
variables, and approximate
_
h(x)dx with
h(U
i
)/n.
It is clear that the Monte Carlo average is converging, with value of 0.963
after 10, 000 iterations.
72
#The function to be integrated
mci.ex <- function(x){(cos(50*x)+sin(20*x))^2}
plot(function(x)mci.ex(x), xlim=c(0,1),ylim=c(0,4))
#The monte carlo sum
sum(mci.ex(u))/nsim
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
Function Generated Values of Function
0 1 2 3 4
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
0 500 1000 1500 2000
0
.
9
0
0
.
9
5
1
.
0
0
1
.
0
5
1
.
1
0
Mean and Standard Errors
0 500 1000 1500 2000
0
.
9
0
0
.
9
5
1
.
0
0
1
.
0
5
1
.
1
0
0 500 1000 1500 2000
0
.
9
0
0
.
9
5
1
.
0
0
1
.
0
5
1
.
1
0
73
Example 3.5: Normal CDF
The approximation of
(t) =
_
t
2
e
y
2
/2
dy
by the Monte Carlo method is
(t) =
1
n
n
i=1
I
x
i
t
,
With (exact) variance (t)(1 (t))/n
The variables I
x
i
t
are independent Bernoulli with success probability
(t).
Method breaks down for tail probabilities
74
Section 3.3 Importance Sampling
Simulation from the true density f is not necessarily optimal
The method of importance sampling is an evaluation of E
f
[h(X)] based
based on the alternative representation
E
f
[h(X)] =
_
X
h(x) f(x) dx =
_
X
h(x)
f(x)
g(x)
g(x) dx ,
We generate a sample X
1
, . . . , X
n
from a given distribution g and approx-
imating
E
f
[h(X)]
1
m
m
j=1
f(X
j
)
g(X
j
)
h(X
j
) .
The Strong Law guarantees
1
m
m
j=1
f(X
j
)
g(X
j
)
h(X
j
) E
f
[h(X)]
75
Simple Example
Gamma(3, 2/3) from Exponential(1)
nsim<-10000;
target <- function(x)((27/16)*(x^2)*exp(-3*x/2))
candidate<-function(x)(exp(-x))
plot(function(x)target(x),xlim=c(0,10))
par(new=T)
plot(function(x)candidate(x),xlab="",ylab="",xaxt="n",yaxt="n")
#Compute the mean and variance
y<-rexp(nsim);
m1<-sum(y*target(y)/candidate(y))/nsim
m2<-sum((y^2)*target(y)/candidate(y))/nsim
m1;m2-m1^2;
Calculate: mean = 1.998896, var = 1.295856
True: mean = 2, var = 1.33
76
Example - Normal Tail Probabilities
For a = 3.5, 4.5, 5.5, calculate P(Z > a) =
_
a
(x)dx
Naive approach: X
i
N(0, 1),
_

a
(x)dx = EI(X > a)
so
1
n
n
i=1
I(X
i
> a)
_

a
(x)dx
77
Example - Normal Tail Probabilities - 2
Importance sampling: X
i
g(x) = e
(xa)
, x > a,
_

a
(x)dx =
_

a
_
(x)
g(x)
_
g(x) dx
so
1
n
n
i=1
(X
i
)
e
(X
i
a)

_

a
(x)dx
And one more....
78
Example - Normal Tail Probabilities - 3
Transform to Uniform
_

a
(x)dx =
_
1/a
0
(1/y)
y
2
dy, y = 1/x
For U
i
Uniform(0, 1/a) with density g(x) = a
1
n
n
i=1
(1/U
i
)
aU
2
i
_

a
(x)dx
Can monitor convergence with standard deviation
R Program TruncatedIS
Also - Multivariate Normal Tails
R Program MultivariateTruncatedIS
79
Importance Sampling Facts
Candidate g needs to have heavier tails than target f
The same sample g can be used for many targets f
This cuts down error in Monte Carlo comparisons
Alternative form
m
j=1
_
_
f(X
j
)
g(X
j
)
j
f(X
j
)
g(X
j
)
_
_
h(X
j
)
Biased, but with smaller variance
Often beats unbiased estimator in MSE
Strong Law applies
80
Example 3.13: Students t
X T (, ,
2
), with density
f(x) =
(( + 1)/2)
(/2)
_
1 +
(x )
2
2
_
(+1)/2
.
Take = 0 and = 1.
Estimate
_

2.1
x
5
f(x)dx
Candidates
f itself
Cauchy
Normal
Uniform(0, 1/2.1)
81
Importance Sampling Comparisons
f (solid), Cauchy (short dash), Normal (dots), Uniform(long dash)
Uniform candidate the best
0 2000 4000 6000 8000 10000
0
5
1
0
1
5
xplot
m
e
a
n
0 2000 4000 6000 8000 10000
0
5
1
0
1
5
xplot
0 2000 4000 6000 8000 10000
0
5
1
0
1
5
xplot
0 2000 4000 6000 8000 10000
0
5
1
0
1
5
xplot
R program Students-t-moment
82
Monte Carlo Statistical Methods: Monte Carlo Optimization [83]
Chapter 5: Monte Carlo Optimization
Dierences between the numerical approach and the simulation approach
to the problem
max
h()
lie in the treatment of the function h.
In an optimization problem using deterministic numerical methods
The analytical properties of the target function (convexity, bounded-
ness, smoothness) are often paramount.
For the simulation approach
We are concerned with h from a probabilistic (rather than analytical)
point of view.
83
Monte Carlo Optimization
The problem
max
h()
Deterministic numerical methods analytical properties
Simulation approach probabilistic view.
This dichotomy is somewhat articial
Some simulation approaches have no probabilistic interpretation
Nonetheless, the use of the analytical properties of h plays a lesser role
in the simulation approach.
84
Two Simulation Approaches
Exploratory Approach
Goal: To optimize h by describing its entire range
Actual properties of h play a lesser role
Probabilistic Approach
Monte Carlo exploits probabilistic properties of h
This approach tied to missing data methods
85
Section 5.2: Stochastic Exploration
A rst approach is to simulate from a uniform distribution on , u
1
, . . . , u
m

U
,
Use the approximation
h
m
= max(h(u
1
), . . . , h(u
m
)).
This method converges (as m goes to ), but it may be very slow since
it does not take into account any specic feature of h.
Distributions other than the uniform, which can possibly be related to h,
may then do better.
In particular, in setups where the likelihood function is extremely costly
to compute the number of evaluations of the function h is best kept to a
minimum.
86
Example 5.2: A rst Monte Carlo maximization
Recall the function h(x) = [cos(50x) + sin(20x)]
2
.
we try our nave strategy and simulate u
1
, . . . , u
m
U(0, 1), and use the
approximation h
m
= max(h(u
1
), . . . , h(u
m
))
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
x
f
u
n
c
t
i
o
n
(
x
)

m
c
i
(
x
)

(
x
)
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
u
m
c
i
(
u
)
87
Example 5.2: A rst Monte Carlo maximization
#simple monte carlo optimization#
par(mfrow=c(1,2))
#The function to be optimized
mci <- function(x){(cos(50*x)+sin(20*x))^2}
plot(function(x)mci(x), xlim=c(0,1),ylim=c(0,4),lwd=2)
optimize(mci, c(0, 1), tol = 0.0001, maximum=TRUE)
#The monte carlo maximum
max(mci(u))
plot(u,mci(u))
#The "exact" value is 3.8325#
88
A Probabilistic Approach
If h is positive with
_
h <
Finding max h is the same as
Finding the modes of h
h exp(h) makes h positive
89
A Tough Minimization
Consider minimizing
h(x, y) = (x sin(20y) + y sin(20x))
2
cosh(sin(10x)x)
+ (x cos(10y) y sin(10x))
2
cosh(cos(20y)y) ,
whose global minimum is 0, attained at (x, y) = (0, 0)
-1
-0.5
0
0.5
1
X
-1
-0
.5
0
0
.5
1
Y

0
1
2
3
4
5
6
Z
90
Properties
Many local minima
Standard methods may not nd global minimum
We can simulate from exp(h(x, y))
Get minimum from min
i
h(x
i
, y
i
)
Can use other methods...
91
Deterministic Gradient Methods
The gradient method is a deterministic numerical approach to the problem
max
h().
It produces a sequence (
j
) that converges to the maximum when
the domain
d
the function (h)
are both convex.
The sequence (
j
) is constructed in a recursive manner through
j+1
=
j
+
j
h(
j
) ,
j
> 0 ,
Here
h is the gradient of h

j
is chosen to aid convergence
92
Stochastic Variant
There are stochastic variants of the gradient method
They do not always go along the steepest slope
This is an advantage, as it can avoid local maxima and saddlepoints
The best, and simple version is Simulated Annealing/Metropolis Algo-
rithm
93
Simulated Annealing
This name is borrowed from Metallurgy:
A metal manufactured by a slow decrease of temperature (annealing) is
stronger than a metal manufactured by a fast decrease of temperature.
The fundamental idea of simulated annealing methods is that a change
of scale, called temperature, allows for faster moves on the surface of the
function h to maximize.
Rescaling partially avoids the trapping attraction of local maxima.
As T decreases toward 0, the values simulated from this distribution
become concentrated in a narrower and narrower neighborhood of
the local maxima of h
94
Metropolis Algorithm/Simulated Annealing
Simulation method proposed by Metropolis et al. (1953)
Starting from
0
, is generated from
Uniform in a neighborhood of
0
.
The new value of is generated as
1
=
_
with probability = exp(h/T) 1
0
with probability 1 ,
h = h() h(
0
)
If h() h(
0
), is accepted
If h() < h(
0
), may still be accepted
This allows escape from local maxima
95
Metropolis/Simulated Annealing Algorithm
In its most usual implementation, the simulated annealing algorithm
modies the temperature T at each iteration
It has the form
1. Simulate from an instrumental distribution
with density g(|
i
|);
2. Accept
i+1
= with probability
i
= exp{h
i
/T
i
} 1;
take
i+1
=
i
otherwise.
3. Update T
i
to T
i+1
.
96
Metropolis/Simulated Annealing Algorithm - Comments
1. Simulate from an instrumental distribution
with density g(|
i
|);
2. Accept
i+1
= with probability
i
= exp{h
i
/T
i
} 1;
take
i+1
=
i
otherwise.
3. Update T
i
to T
i+1
.
All positive moves accepted
As T 0
Harder to accept downward moves
No big downward moves
Not a Markov Chain - dicult to analyze
97
Simple Example Revisited
Recall the function h(x) = [cos(50x) + sin(20x)]
2
The specic algorithm we use is
Starting at iteration t, the iteration is at (x
(t)
, h(x
(t)
)):
1. Simulate u U(a
t
, b
t
) where a
t
= max(x
(t)
r, 0) and b
t
= min(x
(t)
+
r, 1)
2. Accept x
(t+1)
= u with probability
(t)
= min
_
exp
_
h(u) h(x
(t)
)
T
t
_
, 1
_
,
take x
(t+1)
= x
(t)
otherwise.
3. Update T
t
to T
t+1
.
The value of r controls the size of the interval around the current point
(staying in (0, 1))
The value of T
t
controls the cooling.
98
The Trajectory
0.0 0.4 0.8
0
1
2
3
4
x
f
u
n
c
t
i
o
n
(
x
)

m
c
i
(
x
)

(
x
)
0.0 0.4 0.8
0
1
2
3
xval
m
c
i
(
x
v
a
l
)
Left Panel is the function
Right Panel is the Simulated Annealing Trajectory
99
R Program
par(mfrow=c(1,2))
#The function to be optimized
mci <- function(x){(cos(50*x)+sin(20*x))^2}
plot(function(x)mci(x), xlim=c(0,1),ylim=c(0,4),lwd=2)
#optimize(mci, c(0, 1), tol = 0.0001, maximum=TRUE)
#The monte carlo maximum
nsim<-2500
u<-runif(nsim)
#Simulated annealing
xval<-array(0,c(nsim,1));r<-.5
for(i in 2:nsim){
test<-runif(1, min=max(xval[i-1]-r,0),max=min(xval[i-1]+r,1));
delta<-mci(test)-mci(xval[i-1]);
rho<-min(exp(delta*log(i)/1),1);
xval[i]<-test*(u[i]<rho)+xval[i-1]*(u[i]>rho)
}
mci(xval[nsim])
plot(xval,mci(xval),type="l",lwd=2)
100
Simulated Annealing Property
Theorem 5.7: Under mild assumptions, the Simulated Annealing algo-
rithm is guaranteed to nd the global maximum
101
Return to the dicult maximization
Apply simulated Annealing
Dierent choices of T
i
Results dependent on choice of T
i
T
i
1/ log(i + 1) preferred
102
Simulated Annealing Runs
g Uniform(.1, .1)
Starting point (0.5, 0.4)
Case T
i

T
h(
T
) min
t
h(
t
) Accept. rate
1 1/10i (1.94, 0.480) 0.198 4.02 10
7
0.9998
2 1/ log(1 + i) (1.99, 0.133) 3.408 3.823 10
7
0.96
3 100/ log(1 + i) (0.575, 0.430) 0.0017 4.708 10
9
0.6888
4 1/10 log(1 + i) (0.121, 0.150) 0.0359 2.382 10
7
0.71
Case 3 explores valley near the minimum
Recommended T
i
/ log(i + 1) for large
103
T
i
= 1/10i
-2 -1 0 1
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
104
T
i
= 1/ log(i + 1)
-2 -1 0 1
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
105
T
i
= 100/ log(i + 1)
-1.0 -0.5 0.0 0.5
-
0
.
5
0
.
0
0
.
5
1
.
0
106
Section 5.3: Missing Data
Methods that work directly with the objective function are less concerned
with fast exploration of the space
Need to be concerned wheh approximating an objective function - we may
introduce an additional level of error
Many of these methods work well in Missing data models, where the
likelihood g(x|) can be expressed as
g(x|) =
_
Z
f(x, z|) dz
More generally, the function h(x) to be optimized can be expressed as the
expectation
h(x) = E[H(x, Z)]
107
Example 5.14: Censored data likelihood
Observe Y
1
, . . ., Y
n
, iid, from f(y )
Order the observations so that y = (y
1
, , y
m
) are uncensored and
(y
m+1
, . . . , y
n
) are censored (and equal to a).
108
The observed likelihood function is
L(|y) =
m
i=1
[1 F(a )]
nm
f(y
i
) ,
where F is the cdf associated with f.
If we had observed the last n m values, say z = (z
m+1
, . . . , z
n
), with
z
i
> a (i = m + 1, . . . , n), we could have constructed the complete data
likelihood
L
c
(|y, z) =
m
i=1
f(y
i
)
n
i=m+1
f(z
i
),
with which it often is easier to work.
Note that
L(|y) = E[L
c
(|y, Z)] =
_
Z
L
c
(|y, z)f(z|y, ) dz,
where f(z|y, ) is the density of the missing data conditional on the ob-
served data.
109
Three Likelihoods
For f(y ) = N(, 1) three likelihoods are shown
leftmost (dotted):values greater than 4.5 are replaced by the value
4.5
center (solid): observed data likelihood
rightmost (dashed): the actual data.
Right panel: EM/MCEM algorithms
3.0 3.5 4.0 4.5 5.0
0
.
0

e
+
0
0
5
.
0

e
1
4
1
.
0

e
1
3
1
.
5

e
1
3
L
i
k
e
l
i
h
o
o
d
0 10 20 30 40 50
3
.
6
3
.
8
4
.
0
4
.
2
4
.
4
Iteration
E
M

E
s
t
i
m
a
t
e
110
Section 5.3.2: The EM Algorithm
Dempster, Laird and Rubin (1977)
Takes advantage of the representation
g(x|) =
_
Z
f(x, z|) dz
Solves a sequence of easier maximization problems
Limit is the answer to the original problem
111
EM Details
Observe X
1
, . . . , X
n
, iid from g(x|) and want to compute
= arg max L(|x) =

n
i=1
g(x
i
|)
We augment the data with z, where X, Z f(x, z|)
Note the basic EM Identity
k(z|, x) =
f(x, z|)
g(x|)
,
where k(z|, x) is the conditional distribution of the missing data Z
given the observed data x.
112
EM Details - continued
The identity leads to the following relationship between the
complete-data likelihood L
c
(|x, z)
observed data likelihood L(|x).
For any value
0
,
log L(|x) = E
0
[log L
c
(|x, z)] E
0
[log k(z|, x)],
where the expectation is with respect to k(z|
0
, x).
To maximize log L(|x), we only have to deal with the rst term on the
right side, as the other term can be ignored.
113
Note that
E
0
[log L
c
(|x, z)] =
_
log L
c
(|x, z)k(z|
0
, x)dz
Given
0
,
we then maximize E
0
[log L
c
(|x, z)] in
A sequence of estimators

(j)
, j = 1, 2, . . ., is obtained iteratively
E
(j1)
[log L
c
(
(j)
|x, z)] = max
(j1)
[log L
c
(|x, z)].
114
The iteration contains both an expectation step and a maximization step,
giving the algorithm its name.
1. Compute
E
(m)
[log L
c
(|x, z)] ,
where the expectation is with respect to k(z|
m
, x) (the E-step) .
2. Maximize E
(m)
[log L
c
(|x, z)] in and take (the M-step)
(m+1)
= arg max
(m)
[log L
c
(|x, z)].
The iterations are conducted until a xed point is obtained.
115
EM Theorem
Theoretical core of the EM Algorithm
by maximizing E
(m)
[log L
c
(|x, z)] at each step
the observed data likelihood on the left is increased at each step.
Theorem 5.15
The sequence (
(j)
) satises
L(
(j+1)
|x) L(
(j)
|x).
116
Genetic Linkage
The classic missing data example
197 animals are distributed into four catagories
(x
1
, x
2
, x
3
, x
4
) = (125, 18, 20, 34)
and modeled with the multinomial distribution
M
_
n;
1
2
+

4
,
1
4
(1 ),
1
4
(1 ),

4
_
.
Estimation is easier if the x
1
cell is split into two cells, so we create the
augmented model
(z
1
, z
2
, x
2
, x
3
, x
4
) M
_
n;
1
2
,

4
,
1
4
(1 ),
1
4
(1 ),

4
_
,
with x
1
= z
1
+ z
2
.
117
Genetic Linkage
The observed likelihood function is proportional to
_
1
2
+

4
_
x
1
_
1
4
(1 )
_
x
2
+x
3
_
4
_
x
4
(2 + )
x
1
(1 )
x
2
+x
3
x
4
,
and the complete-data likelihood function is
_
1
2
_
z
1
_
4
_
z
2
_
1
4
(1 )
_
x
2
+x
3
_
4
_
x
4

z
2
+x
4
(1 )
x
2
+x
3
.
The missing data density is
missing data density =
complete-data likelihood function
observed likelihood function
.
118
Genetic Linkage
The observed likelihood function (2 + )
x
1
(1 )
x
2
+x
3
x
4
,
and the complete-data likelihood function
z
2
+x
4
(1 )
x
2
+x
3
.
z
2
+x
4
(1 )
x
2
+x
3
(2 + )
x
1
(1 )
x
2
+x
3
x
4
_

2 +
_
z
2
_
2
2 +
_
x
1
z
2
so Z
2
binomial(x
1
,

2+
).
Note that x
2
, x
3
, x
4
cancel.
119
Genetic Linkage
For the EM algorithm, the expected complete log-likelihood function is
E
0
[(Z
2
+ x
4
) log + (x
2
+ x
3
) log(1 )]
=
_

0
2 +
0
x
1
+ x
4
_
log + (x
2
+ x
3
) log(1 ).
and the EM iterates are
j+1
= argmax
__

j
2 +
j
x
1
+ x
4
_
log + (x
2
+ x
3
) log(1 )
_
=
j
2+
j
x
1
+ x
4
j
2+
j
x
1
+ x
2
+ x
3
+ x
4
.
R program GeneticEM
120
EM Sequence (and standard errors)
5 10 15 20 25
0
.
5
5
0
.
6
0
0
.
6
5
0
.
7
0
iteration
5 10 15 20 25
0
.
5
5
0
.
6
0
0
.
6
5
0
.
7
0
iteration
5 10 15 20 25
0
.
5
5
0
.
6
0
0
.
6
5
0
.
7
0
iteration
121
Example 5.17: EM for censored data
For Y
i
N(, 1), with censoring at a, the complete-data likelihood is
L
c
(|y, z)
m
i=1
exp{(y
i
)
2
/2)}
n
i=m+1
exp{(z
i
)
2
/2}.
The density of of the missing data z = (z
nm+1
, . . . , z
n
) is a truncated
normal
Z k(z|, y) =
1
(2)
(nm)/2
exp
_
n
i=m+1
(z
i
)
2
/2
_
,
122
Censored EM - continued
Complete-data log likekihood
1
2
m
i=1
(y
i
)
2
1
2
n
i=nm+1
E
[(Z
i
)
2
].
Dierentiate and set equal to zero, solving for the EM estimate
=
m y + (n m)E
(Z
1
)
n
.
Evaluate the expectation to get the EM sequence
(j+1)
=
m y + (n m)
(j)
+
(a
(j)
)
1(a
(j)
)
n
,
where and are the normal pdf and cdf, respectively.
123
Section 5.3.3: Monte Carlo EM
A diculty with the implementation of the EM algorithm is that each
E-step requires the computation of the expected log likelihood
E
0
(log L
c
(|x, z)).
To overcome this diculty
simulate Z
1
, . . . , Z
m
k(z|x, )
maximize the approximate complete data log-likelihood
0
(log L
c
(|x, z)) =
1
m
m
i=1
log L
c
(|x, z) .
124
Monte Carlo EM -2
Maximize the approximate complete data log-likelihood
0
(log L
c
(|x, z)) =
1
m
m
i=1
log L
c
(|x, z) .
When m goes to innity, this quantity converges to E
0
(log L
c
(|x, z))
Thus, Monte Carlo EM regular EM.
125
Genetic Linkage
For the Monte Carlo EM algorithm, we average the complete-data log
likelihood over z
2
1
m
m
i=1
log
_
z
2i
+x
4
(1 )
x
2
+x
3
=
_
1
m
m
i=1
z
2i
+ x
4
_
log() + (x
2
+ x
3
) log(1 )
= ( z
2
+ x
4
) log() + (x
2
+ x
3
) log(1 ),
where z
2
=
1
m
m
i=1
z
2i
, z
2i
Binomial(x
1
,
0
/(2 +
0
)).
126
Genetic Linkage
The Monte Carlo MLE in is then the Beta MLE
=
z
2
+ x
4
z
2
+ x
2
+ x
3
+ x
4
.
For the EM sequence
(j+1)
=
m y + (n m)E
(j)
(Z
1
)
n
,
the MCEM solution replaces E
(j) (Z
1
) with
1
M
M
i=1
Z
i
, Z
i
k(z|
(j)
, y).
127
Censored MCEM
Complete-data log likekihood
1
2
m
i=1
(y
i
)
2
1
2
n
i=nm+1
E
[(Z
i
)
2
].
Dierentiate and set equal to zero, solving for the EM estimate
=
m y + (n m)E
(Z
1
)
n
.
Evaluate the expectation to get the MCEM sequence
(j+1)
=
m y + (n m) +

Z
n
,
where

Z is the mean of (Z
1
, . . . , Z
M
)
(Z
1
, . . . , Z
M
) truncated normal with mean

(j)
128
EM and MCEM Sequence for censored data
0 10 20 30 40 50
3
.
3
3
.
4
3
.
5
3
.
6
3
.
7
Index
t
h
a
t
129
R program
xdata<-c(3.64, 2.78, 2.91,2.85,2.54,2.62,3.16,2.21,4.05,2.19,2.97,4.32,
3.56,3.39,3.59,4.13,4.21,1.68,3.88,4.33)
n<-25;m<-20;t0<-4;a<-4.5;nt<-50
xbar<-mean(xdata);that<-array(xbar,dim=c(nt,1));
for (j in 2:nt) {
that[j] <-(m/n)*xbar+(1-m/n)*(that[j-1]+dnorm(a-that[j-1])
/(1-pnorm(a-that[j-1])))}
#now do MCEM, z=missing data, nz=size of MC sample
tmc<-array(xbar,dim=c(nt,1));nz<-500;
for (j in 2:nt) {
z<-array(a-1,dim=c(nz,1));
for (k in 1:nz) {while(z[k] <a) z[k] <- rnorm(1,mean=tmc[j-1],sd=1)}
zbar<-mean(z)
tmc[j] <-(m/n)*xbar+(1-m/n)*zbar}
plot(that,type="l",xlim=c(0,nt),ylim=c(3.3,3.7),lwd=2)
par(new=T)
plot(tmc,type="l",xlim=c(0,nt),ylim=c(3.3,3.7),xlab="",
ylab="",xaxt="n",yaxt="n",lwd=2)
130
EM Standard Errors
Recall that the variance of the MLE, is approximated by
Var

_

2
2
E(log L(|x))
_
1
We estimate this with
Var

_

2
2
log L(|x)
_
1
For Genetic Linkage, the observed likelihood function
(2 + )
x
1
(1 )
x
2
+x
3
x
4
,
131
EM Standard Errors -2
For Genetic Linkage, the observed likelihood function
(2 + )
x
1
(1 )
x
2
+x
3
x
4
,
The variance is estimated with
_
d
2
d
2
(2 + )
x
1
(1 )
x
2
+x
3
x
4
_
1
132
EM Sequence (and standard errors)
5 10 15 20 25
0
.
5
5
0
.
6
0
0
.
6
5
0
.
7
0
iteration
5 10 15 20 25
0
.
5
5
0
.
6
0
0
.
6
5
0
.
7
0
iteration
5 10 15 20 25
0
.
5
5
0
.
6
0
0
.
6
5
0
.
7
0
iteration
133
MCEM Standard Errors
The variance of the MLE, is approximated with the observed data likeli-
hood
Var

_

2
2
log L(|x)
_
1
Oakes (1999) expressed this wih only the complete-data likelihood
2
log L(|x)
=
_

2
2
E[log L(
|x, z)|] +

2
E[log L(
|x, z)|]
_
=
with expectation under the missing data distribution.
This expression only involves the complete data likelihood!
134
But, the expression is not good for simulation.
With eort, we can write this as
2
log L(|x) = E
_

2
2
log L(|x, z)
_
+ var
_

log L(|x, z)
_
.
This allows the Monte Carlo evaluation
2
log L(|x)
=
1
M
M
j=1
2
log L(|x, z
(j)
)
+
1
M
M
j=1
_

log L(|x, z
(j)
)
1
M
M
=1
log L(|x, z
(j
)
)
_
2
,
where (z
(j)
), j = 1, . . . , M are generated from the missing data distribu-
tion (and have already been generated to do MCEM).
135
Monte Carlo Statistical Methods: Markov Chains [136]
Chapter 6: Markov Chains
A Markov chain is a sequence of random variables that can be thought
of as evolving over time
The probability of a transition depends on the particular set the chain is
in.
We dene a Markov chain by its transition kernel
When X is discrete, the transition kernel simply is a (transition ma-
trix K with elements
P
xy
= P(X
n
= y|X
n1
= x) , x, y X.
In the continuous case, the kernel also denotes the conditional density
K(x, x
) P(X A|x) =
_
A
K(x, x
)dx
=
_
A
f(x
|x)dx
.
136
Section 6.1: Essentials of MCMC
In the setup of MCMC algorithms, Markov chains are constructed from
a transition kernel K, a conditional probability density
X
n+1
K(X
n
, X
n+1
).
An example is a random walk
X
n+1
= X
n
+
n
where
n
is generated independently of X
n
, X
n1
, . . ..
If
n
is symmetric about zero, the sequence is called a symmetric random
walk
137
Example 6.6: AR(1) Models
AR(1) models provide a simple illustration of Markov chains on continu-
ous state-space
Here
X
n
= X
n1
+
n
, ,
with
n
N(0,
2
)
If the
n
s are independent, X
n
is independent from X
n2
, X
n3
, . . . con-
ditionally on X
n1
.
138
Essentials of MCMC - continued
The chains encountered in MCMC settings enjoy a very strong stability
property
The stationary distribution, or the marginal distribution always exists.
The stationary distribution satises
X
n
X
n+1
,
139
AR(1) Stationary Distribution
The stationary distribution (x|,
2
)must satisfy
_
(x
n
|x
n1
,
2
) (x
n1
|,
2
)dx
n1
= (x
n
|,
2
)
Evaluating the integral yields
EX
n
= = and VarX
n
=
2
=
2
+
2
2
Therefore
= 0 and
2
=

2
1
2
which requires || < 1.
140
Essentials - continued
If the kernel allows for free moves over the entire state space, the chain
is irreducible
This also insures that the chains are positive recurrent, that is, they visit
every set innitely often.
The stationary distribution is also a limiting distribution in the sense that
the limiting distribution of X
n+1
is
141
An irreducible, positive recurrent Markov chain is ergodic, that is, it
converges.
In a simulation setup, a consequence of this convergence property is that
the average
1
N
N
n=1
h(X
n
) E
[h(X)]
almost surely.
Under a slightly stronger assumption a Central Limit Theorem also holds
for this average
142
As a nal essential, we associate the probabilistic language of Markov
chains with the statistical language of data analysis.
Statistics Markov Chain
marginal distribution invariant distribution
proper marginals positive recurrent
If the marginals are not proper, or if they do not exist, then the chain is
not positive recurrent. It is either null recurrent or transient, and both
are bad.
143
AR(1) Recurrent and Transient -Note the Scale
3 1 0 1 2 3
1
1
2
3
= 0.4
x
y
p
l
o
t
4 2 0 2 4
2
0
2
4
= 0.8
x
y
p
l
o
t
20 10 0 10 20
2
0
0
1
0
2
0
= 0.95
x
y
p
l
o
t
20 10 0 10 20
2
0
0
1
0
2
0
= 1.001
x
y
p
l
o
t
144
Monte Carlo Statistical Methods: The Metropolis-Hastings Algorithm
[145]
Chapter 7: The Metropolis-Hastings Algorithm
Section 7.1: The MCMC Principle
It is not necessary to directly simulate from f to calculate
_
h(x)f(x)dx
Now we obtain
X
1
, . . . , X
n
approx f without simulating from f
Use an ergodic Markov Chain
145
[146]
Working Principle of MCMC Algorithms
For an arbitrary starting value x
(0)
, a chain (X
(t)
) is generated using a
transition kernel with stationary distribution f
This ensures the convergence in distribution of (X
(t)
) to a random variable
from f
Given that the chain is ergodic, the starting value x
(0)
is, in principle,
unimportant.
Denition A Markov chain Monte Carlo (MCMC) method for the simulation
of a distribution f is any method producing an ergodic Markov chain (X
(t)
)
whose stationary distribution is f.
146
[147]
Section 7.3: The Metropolis-Hastings Algorithm
The algorithm starts with and target density f
A candidate density q(y|x)
The ratio
f(x)
q(y|x)
must be known up to a constant.
147
[148]
The Algorithm
The MetropolisHastings algorithm associated with the objective (target)
density f and the conditional density q produces a Markov chain (X
(t)
)
through the following transition:
1. Generate Y
t
q(y|x
(t)
).
2. Take
X
(t+1)
=
_
Y
t
with probability (x
(t)
, Y
t
),
x
(t)
with probability 1 (x
(t)
, Y
t
),
where
(x, y) = min
_
f(y)
f(x)
q(x|y)
q(y|x)
, 1
_
.
148
[149]
MH Properties
This algorithm always accepts values y
t
such that the ratio f(y
t
)/q(y
t
|x
(t)
)
is increased
It may accept values y
t
such that the ratio is decreased, similar to sto-
chastic optimization
Like the AcceptReject method, the MetropolisHastings algorithm only
depends on the ratios
f(y
t
)/f(x
(t)
) and q(x
(t)
|y
t
)/q(y
t
|x
(t)
)
and is, therefore, independent of normalizing constants
149
[150]
MH Properties - continued
There are similarities between MH and the AcceptReject methods
A sample produced by MH diers from an iid sample.
For one thing, such a sample may involve repeated occurrences of the
same value
Rejection of Y
t
leads to repetition of X
(t)
at time t + 1
150
[151]
MH Properties - continued
It is necessary to impose minimal regularity conditions on both f and
the conditional distribution q for f to be the limiting distribution of the
chain (X
(t)
)
The support of f should be connected
It is better that sup
x
f(x)/q(x|x
) <
151
[152]
MH Convergence
Under mild conditions, MH is a reversible, ergodic Markov Chain, hence
it converges
The empirical sums
1
M
h(X
i
) converge
The CLT is satised
152
[153]
Section 7.4: The Independent MH Algorithm
the instrumental distribution q is independent of X
(t)
and is denoted g
by analogy. Given x
(t)
(a) Generate Y
t
g(y).
(b) Take
X
(t+1)
=
_
_
_
Y
t
with probability min
_
f(Y
t
) g(x
(t)
)
f(x
(t)
) g(Y
t
)
, 1
_
x
(t)
otherwise.
Although the Y
t
s are generated independently, the resulting sample is
not iid, if only because the probability of acceptance of Y
t
depends on
X
(t)
153
[154]
Example 7.10: Generating Gamma Variables
Generate Ga(, ) using a Gamma Ga([], b) candidate (where [a] denotes
the integer part of a).
Take = 1
1. Generate Y
t
Ga([], []/).
2. Take
X
(t+1)
=
_
Y
t
with probability
t
x
(t)
otherwise,
where
t
= min
_
_
Y
t
x
(t)
exp
_
x
(t)
Y
t
__
[]
, 1
_
.
154
[155]
Example 7.11: Logistic Regression
Return to the Challenger Data
We observe (x
i
, y
i
), i = 1, . . . , n according to the model
Y
i
Bernoulli(p(x
i
)), p(x) =
exp( + x)
1 + exp( + x)
,
where p(x) is the probability of an O-ring failure at temperature x.
The likelihood is
L(, |y)
n
i=1
_
exp( + x
i
)
1 + exp( + x
i
)
_
y
i
_
1
1 + exp( + x
i
)
_
1y
i
and we take the prior to be
(|b)
() =
1
b
e
e
e
/b
dd,
155
[156]
Logistic Regression - continued
The prior
(|b)
() =
1
b
e
e
e
/b
dd,
puts an exponential prior on log
a at prior on
insures propriety of the posterior distribution
Choose b so that E = , where is the MLE of
156
[157]
The posterior distribution is proportional to L(, |y)(, )
To simulate from this distribution we take an independent candidate
g(, ) =
(|
b)(),
where () is a normal distribution with mean

and variance
2
, the
MLEs.
Note that although basing the prior distribution on the data is some-
what in violation of the formal Bayesian paradigm, nothing is violated
if the candidate depends on the data.
In fact, this will usually result in a more eective simulation, as the
candidate is placed close to the target.
157
[158]
Generating a random variable from g(, ) is straightforward
If we are at the point (
0
,
0
) in the Markov chain, and we generate
(
) from g(, ), we accept the candidate with probability

min
_
L(
|y)
L(
0
,
0
|y)
(
0
)
(
)
, 1
_
.
158
[159]
Estimation of the slope and intercept from the Challenger logistic regres-
sion. The top panels show histograms of the distribution of the coe-
cients, while the bottom panels show the convergence of the means.
Intercept
D
e
n
s
i
t
y
10 12 14 16
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Slope
D
e
n
s
i
t
y
0.25 0.20 0.15
0
5
1
0
1
5
2
0
0 2000 6000 10000
1
4
.
0
1
4
.
5
1
5
.
0
Intercept
a
m
e
a
n
0 2000 6000 10000
0
.
2
3
5
0
.
2
2
0
Slope
b
m
e
a
n
159
[160]
Estimation of the failure probabilities from the Challenger logistic re-
gression. The left panel is for 65
o
Fahrenheit and the right panel is for
40
o
.
We can run the programs
Failure
Prob
D
e
n
s
i
t
y
0.2 0.4 0.6 0.8
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
Failure
Prob
D
e
n
s
i
t
y
0.94 0.96 0.98 1.00
0
5
0
1
0
0
1
5
0
160
[161]
Section 7.5 Random Walk Metropolis
Take into account the value previously simulated to generate the following
value
This idea is already used in algorithms such as the simulated annealing
Since the candidate g in the MH algorithm is allowed to depend on
the current state X
(t)
, a rst choice to consider is to simulate Y
t
according to
Y
t
= X
(t)
+
t
,
where
t
is a random perturbation with distribution g, independent
of X
(t)
.
q(y|x) is now of the form g(y x)
The Markov chain associated with q is a random walk
161
[162]
Random Walk Metropolis - continued
The choice of a symmetric function g (that is, such that g(t) = g(t)),
leads to the following random walk MH algorithm
Given x
(t)
,
(a) Generate Y
t
g(|y x
(t)
|).
(b) Take
X
(t+1)
=
_
_
_
Y
t
with probability min
_
1,
f(Y
t
)
f(x
(t)
)
_
x
(t)
otherwise.
162
[163]
Hastings (1970) considers the generation of the normal distribution N(0, 1)
based on the uniform distribution on [, ]
The algorithm: At time t
(a) Generate Y = X
t
+ U
(b)
= min
_
e
.5(Y
2
X
2
t
)
, 1
_
(c)
X
t+1
=
_
Y with probability
X
t
otherwise
Three samples of 20, 000 points produced by this method for = 0.1, 0.5,
and 1.
R program Hastings
163
[164]
Histogram of x[i, ]
x[i, ]
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
u
n
c
t
i
o
n
(
x
)

d
n
o
r
m
(
x
)

(
x
)
Histogram of x[i, ]
x[i, ]
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
u
n
c
t
i
o
n
(
x
)

d
n
o
r
m
(
x
)

(
x
)
Histogram of x[i, ]
x[i, ]
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
u
n
c
t
i
o
n
(
x
)

d
n
o
r
m
(
x
)

(
x
)
0 1000 2000 3000 4000 5000
0
.
0
0
.
5
1
.
0
1
.
5
c(1:nsim)
c
u
m
s
u
m
(
x
[
i
,

]
)
/
c
(
1
:
n
s
i
m
)
0 1000 2000 3000 4000 5000
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
c(1:nsim)
c
u
m
s
u
m
(
x
[
i
,

]
)
/
c
(
1
:
n
s
i
m
)
0 1000 2000 3000 4000 5000
0
.
2
0
.
2
0
.
6
1
.
0
c(1:nsim)
c
u
m
s
u
m
(
x
[
i
,

]
)
/
c
(
1
:
n
s
i
m
)
Note the convergence for larger ranges
R program RandomWalkMet
164
[165]
Random Walk Metropolis - 3
Explaining the behavior
The Random Walk
Y = X
t
+ U, U U(, )
has high autocorrelation for small
High Autocorrelation Poor Mixing
Look at Autocorrelation for = 0.1, 0.5, and 1.
R program RandomWalkMetAC
165
[166]
Random Walk Metropolis - 4
Histogram of x[i, ]
x[i, ]
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
u
n
c
t
i
o
n
(
x
)

d
n
o
r
m
(
x
)

(
x
)
Histogram of x[i, ]
x[i, ]
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
u
n
c
t
i
o
n
(
x
)

d
n
o
r
m
(
x
)

(
x
)
Histogram of x[i, ]
x[i, ]
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
u
n
c
t
i
o
n
(
x
)

d
n
o
r
m
(
x
)

(
x
)
0 5 10 20 30
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
Series x[i, ]
0 5 10 20 30
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
Series x[i, ]
0 5 10 20 30
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
Series x[i, ]
Smaller Autocorrelation for larger ranges
R program RandomWalkMetAC
166
Monte Carlo Statistical Methods: The Two Stage Gibbs Sampler [167]
Chapter 9: The Two Stage Gibbs Sampler
The implementation of the two-stage Gibbs sampler is straightforward.
Suppose that the random variables X and Y have joint density f(x, y)
The two-stage Gibbs sampler generates a Markov chain (X
t
, Y
t
) according
to the following steps:
Take X
0
= x
0
For t = 1, 2, . . . , generate
1. Y
t
f
Y |X
(|x
t1
);
2. X
t
f
X|Y
(|y
t
) .
where f
Y |X
and f
X|Y
are the conditional distributions associated with f
Then (X
t
, Y
t
) (X, Y ) f(x, y)
X
t
X f(x)
Y
t
Y f(y)
167
Example 9.1: Normal Bivariate Gibbs
For the special case of the bivariate normal density,
(X, Y ) N
2
_
0,
_
1
1
__
,
The Gibbs sampler is
Given y
t
, generate
X
t+1
| y
t
N(y
t
, 1
2
) ,
Y
t+1
| x
t+1
N(x
t+1
, 1
2
).
The Gibbs sampler is obviously not necessary in this particular case
The marginal Markov chain in X is dened by the AR(1) relation
X
t+1
=
2
X
t
+
t
,
t
N(0, 1) ,
with
2
= 1
2
+
2
(1
2
) = 1
4
.
The stationary distribution of this chain is N
_
0,
1
4
1
4
_
.
168
Gibbs Sampler: Missing Data
Gibbs works well in missing data models
We start with a marginal density f
X
(x) and construct (or complete) a
joint density to aid in simulation
Like the case of the EM algorithm
In missing data models we write
g(x|) =
_
Z
f(x, z|) dz
Which results in the Gibbs sampler

f(x, z|)
_
f(x, z|)
Z
f(x, z|)
_
Z
f(x, z|)
169
Example 9.7: Grouped counting data
For 360 consecutive time units, consider recording the number of passages
of individuals, per unit time, past some sensor.
the number of cars observed at a crossroad
number of leucocytes in a region of a blood sample
Hypothetical results are
Number of 0 1 2 3 4
passages or more
Number of 139 128 55 25 13
observations
170
Poisson Bayes completion
Assume Poisson P() model
The observed data likelihood is
(|x
1
, . . . , x
5
) e
347
128+552+253
_
1 e
i=0
i
i!
_
13
,
for x
1
= 139, . . . , x
5
= 13.
Complete the data with
z = (z
1
, . . . , z
13
)
171
Start with x = observed data, z = missing data, then
X|
347
i=1
e
x
i
x
i
!
Z|, , x
13
i=1
e
z
i
z
i
!
I(z
i
4).
The joint distribution is
e
360
i
x
i
+
i
z
i
i
x
i
!
i
z
i
!
i
I(z
i
4)
172
For () 1/, the full conditionals are
Z|, x Truncated Poisson()
|z, x Gamma(
i
x
i
+
i
z
i
+ 1, 1/360)
A Gibbs sampler in and z can do the calculations
Given
(t1)
,
1. Simulate Y
(t)
i
P(
(t1)
) I
y4
(i = 1, . . . , 13)
2. Simulate
(t)
Ga
_
313 +
13
i=1
y
(t)
i
, 360
_
.
173
Poisson Bayes Gibbs Sampler R code
nsim<-500;lam<-array(313/360,dim=c(nsim,1));y<-array(0,dim=c(13,1));
for (j in 2:nsim) {
for(i in 1:13){while(y[i] < 4) y[i] <- rpois(1,lam[j -1])};a<-313+sum(y);
lam[j]<-rgamma(1,a,scale=1/360);
y<-y*0;
}
den<-1:(nsim)
meanlam<-cumsum(lam)/den;
par(mfrow=c(1,2))
plot(meanlam,type="l",ylim=c(.9,1.1),xlab="iteration",
ylab="estimate",col="red")
hist(lam,main="Mean",freq=F,col="green")
R program PoissonCompletion
174
Poisson Bayes Gibbs Sampler Output
0 1000 3000 5000
0
.
9
0
0
.
9
5
1
.
0
0
1
.
0
5
1
.
1
0
iteration
e
s
t
i
m
a
t
e
Mean
lam
D
e
n
s
i
t
y
0.9 1.0 1.1 1.2
0
2
4
6
175
Two Estimators of Lambda
Output from the Gibbs Sampler
Z|, x Truncated Poisson()
|z, x Gamma(
i
x
i
+
i
z
i
+ 1, 1/360)
Estimate with the Empirical Average,
1
M
M
j=1
(j)
or the Conditional Expectation
rb
=
1
M
M
j=1
E
_
x, z
(j)
=
1
360M
M
j=1
_
313 +
13
i=1
z
(j)
i
_
,
176
Two Estimators of Lambda -2
or the Conditional Expectation
rb
=
1
M
M
j=1
E
_
x, z
(j)
=
1
360M
M
j=1
_
313 +
13
i=1
z
(j)
i
_
,
Rao-Blackwellized
Typically Smoother
Convergence Diagnostic Both estimators converge
R program PoissonCompletion2
See R program PoissonCompletion3 to eliminate while
177
Poisson Gibbs Sampler - Convergence of Estimators
0 100 200 300 400 500
0
.
9
0
0
.
9
5
1
.
0
0
1
.
0
5
1
.
1
0
iteration
e
s
t
i
m
a
t
e
0 100 200 300 400 500
0
.
9
0
0
.
9
5
1
.
0
0
1
.
0
5
1
.
1
0
iteration
e
s
t
i
m
a
t
e
Mean
lam
D
e
n
s
i
t
y
0.9 1.0 1.1 1.2
0
1
2
3
4
5
6
7
178
Poisson EM Algorithm
There is a corresponding EM algorithm: For the observed data likelihood
L(|x
1
, . . . , x
5
) e
347
313
_
1 e
i=0
i
i!
_
13
,
We have the complete data likelihood
L(|x
1
, . . . , x
5
, z) e
347
313
_
e
13
13
i=1
z
i
z
i
!
_
,
With expected log likelihood
log L 360 + (313 + E[
i
z
i
]) log
179
Poisson EM Algorithm
from the expected log likelihood
log 360 + (313 + E[
i
z
i
]) log
We get the Monte Carlo EM iteration
(t+1)
=
1
360
(313 + 13E
(t) [Z
i
])
where
Z
i
P(
(t)
).
180
Poisson EM Algorithm R code
nsim<-50;lam<-array(313/360,dim=c(nsim,1));ybar<-array(4,dim=c(nsim,1))
#Use m values for the mean
m<-25;y<-array(0,dim=c(m,1));
for (j in 2:nsim) {
for(i in 1:m){while(y[i] < 4) y[i] <- rpois(1,lam[j -1])};
ybar[j]<-mean(y);
lam[j]<-(313+13*ybar[j])/360;
y<-y*0;
}
par(mfrow=c(1,2))
hist(ybar,col="green",breaks=10)
plot(lam,col="red",type="l",ylim=c(.95,1.05))
181
Poisson EM Output
Histogram of ybar
ybar
F
r
e
q
u
e
n
c
y
4.0 4.1 4.2 4.3 4.4
0
5
1
0
1
5
0 10 20 30 40 50
0
.
9
6
0
.
9
8
1
.
0
0
1
.
0
2
1
.
0
4
Index
l
a
m
182
Section 9.4: The EMGibbs Connection
The is a General EM/Gibbs relationship
X g(x|) is the observed data
Z f(x, z|) is the augmented data
We have the complete-data and incomplete-data likelihoods
L
c
(|x, z) = f(x, z|) and L(|x) = g(x|) ,
with the missing data density
k(z|x, ) =
L
c
(|x, z)
L(|x)
.
183
The EMGibbs Connection
If we can normalize the complete-data likelihood in that is, if
_
L
c
(|x, z)d <
dene
L
(|x, z) =
L
c
(|x, z)
_
L
c
(|x, z)d
and create the two-stage Gibbs sampler:
1. z| k(z|x, )
2. |z L
(|x, z)
Note the direct connection to an EM algorithm based on L
c
and k.
184
Remember Genetic Linkage
The observed likelihood function is proportional to
_
1
2
+

4
_
x
1
_
1
4
(1 )
_
x
2
+x
3
_
4
_
x
4
(2 + )
x
1
(1 )
x
2
+x
3
x
4
,
and the complete-data likelihood function is
_
1
2
_
z
1
_
4
_
z
2
_
1
4
(1 )
_
x
2
+x
3
_
4
_
x
4

z
2
+x
4
(1 )
x
2
+x
3
.
missing data density =
complete-data likelihood function
observed likelihood function
.
185
Genetic Linkage
To Gibbs sample this (with a uniform prior on ) use
|x, z
2

z
2
+x
4
(1 )
x
2
+x
3
= Beta(z
2
+ x
4
+ 1, x
2
+ x
3
+ 1)
z
2
|x,
z
2
+x
4
(1 )
x
2
+x
3
= Binomial(x
1
,

2 +
)
186
Example 9.21: Censored Data Gibbs
For the censored data example, the distribution of the missing data is
Z
i

(z )
1 (a )
and the distribution of |x, z is
L(|x, z)
m
i=1
e
(x
i
)
2
/2
n
i=m+1
e
(z
i
)
2
/2
,
which corresponds to a
N
_
m x + (n m) z
n
,
1
n
_
distribution and so we immediately have that L
exists and that we can

run a Gibbs sampler
R program censoredGibbs
Generate Z with Accept-Reject
187
Example 9.21: Censored Data Gibbs
0 2000 6000 10000
3
.
6
0
3
.
6
5
3
.
7
0
3
.
7
5
3
.
8
0
that
F
r
e
q
u
e
n
c
y
3.0 3.4 3.8 4.2
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
188
Examples 5.18 and 9.22: Cellular Phone Plans
It is typical for cellular phone companies to oer plans of options,
bundling together four or ve options
One cellular company had oered a four-option plan in some areas, and a
ve-option plan (which included the four, plus one more) in another area
In each area, customers were ask to choose their favorite option, and the
results were tabulated. In some areas they choose their favorite from four
plans, and in some areas from ve plans.
The phone company is interested in knowing which are the popular plans,
to help them set future prices.
189
Cellular Phone: The Data
Cellular phone plan preferences in 37 areas: Data are number of customers
who choose the particular plan as their favorite.
Plan Plan
1 2 3 4 5 1 2 3 4 5
1 26 63 20 0 20 56 18 29 5
2 31 14 16 51 21 27 53 10 0
3 41 28 34 10 22 47 29 4 11
4 27 29 19 25 23 43 66 6 1
5 26 48 41 0 24 14 30 23 23 6
6 30 45 12 14 25 4 24 24 32 7
7 53 39 12 11 26 11 30 22 23 8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
190
Cellular Phones: EM - 1
We can model the complete data as follows. In area i, there are n
i
cus-
tomers, each of whom chooses their favorite plan from Plans 1 5.
The observation for customer i is
Y
i
= (Y
i1
, . . . , Y
i5
), where Y
i
M(1, (p
1
, p
2
, . . . , p
5
)).
If we assume the customers are independent, in area i the data are
T
i
= (T
i1
, . . . , T
i5
) =
n
i
j=1
Y
i
M(n
i
, (p
1
, p
2
, . . . , p
5
))
191
If the rst m observations have the Y
i5
missing, denote the missing data
by z
i
and then we have the complete data likelihood
L(p|T, z) =
m
i=1
_
n
i
+ z
i
T
i1
, . . . , T
i4
, z
i
_
p
T
i1
1
p
T
i4
4
p
z
i
5
i=m+1
_
n
i
T
i1
, . . . , T
i5
_
5
j=1
p
T
ij
j
where
p = (p
1
, p
2
, . . . , p
5
),
T = (T
1
, T
2
, . . . , T
5
),
z = (z
1
, z
2
, . . . , z
m
), and
_
n
n
1
,n
2
,...,n
k
_
is the multinomial coecient
n!
n
1
!n
2
!n
k
!
.
192
The observed data likelihood can be calculated as
L(p|T) =
z
L(p|T, z)
leading to the missing data distribution
k(z|T, p) =
m
i=1
_
n
i
+ z
i
z
i
_
p
z
i
5
(1 p
5
)
n
i
+1
,
a product of negative binomial distributions.
193
Dene
W
j
=
n
i=1
T
ij
for j = 1, . . . , 4, and
W
5
=
m
i=1
T
i5
for j = 5.
The expected complete data log likelihood is
4
j=1
W
j
log p
j
+ [W
5
+
m
i=1
E(Z
i
|p
)] log(1 p
1
p
2
p
3
p
4
).
194
Cellular Phones: EM -5
The expected complete data log likelihood is
4
j=1
W
j
log p
j
+ [W
5
+
m
i=1
E(Z
i
|p
)] log(1 p
1
p
2
p
3
p
4
).
leading to the EM iterations
E(Z
i
|p
(t)
) = (n
i
+ 1)
p
(t)
5
1 p
(t)
5
, p
(t+1)
j
=
W
j
m
i=1
E(Z
i
|p
(t)
) +
5
j
=1
W
j
for j = 1, . . . , 4.
195
Cellular Phones: EM
The MLE of p is (0.273, 0.329, 0.148, 0.125, 0.125); convergence is very
rapid.
EM sequence for cellular phone data, 25 iterations
5 10 15 20 25
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
0
.
3
5
Iteration
E
s
t
i
m
a
t
e
5 10 15 20 25
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
0
.
3
5
5 10 15 20 25
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
0
.
3
5
5 10 15 20 25
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
0
.
3
5
5 10 15 20 25
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
0
.
3
5
196
Cellular phone Gibbs
Now we use the Gibbs sampler to get our solution. From the complete
data likelihood and the missing data distribution we have
p|W
1
, W
2
, . . . , W
5
,
i
Z
i
D(W
1
+ 1, W
2
+ 1, . . . , W
5
+
i
Z
i
+ 1)
i
Z
i
Neg
_
m
i=1
n
i
+ m, 1 p
5
_
.
The point estimates agree with those of the EM algorithm, p = (0.258,
0.313, 0.140, 0.118, 0.170), with the exception of p
5
, which is larger than
the MLE.
197
Cellular phone Gibbs
Gibbs output for cellular phone data, 5000 iterations
p
1
D
e
n
s
i
t
y
0.245 0.270
0
2
0
4
0
6
0
8
0
p
2
D
e
n
s
i
t
y
0.30 0.33
0
2
0
4
0
6
0
p
3
D
e
n
s
i
t
y
0.130 0.150
0
2
0
4
0
6
0
8
0
1
0
0
p
4
D
e
n
s
i
t
y
0.105 0.125
0
2
0
4
0
6
0
8
0
p
5
D
e
n
s
i
t
y
0.12 0.16 0.20
0
1
0
2
0
3
0
4
0
5
0
0 2000 5000
0
.
2
5
7
0
0
.
2
5
7
5
0
.
2
5
8
0
0
.
2
5
8
5
0
.
2
5
9
0
Iteration
R
u
n
n
i
n
g

M
e
a
n
0 2000 5000
0
.
3
1
0
0
.
3
1
2
0
.
3
1
4
0
.
3
1
6
Iteration
E
s
t
i
m
a
t
e
0 2000 5000
0
.
1
3
8
0
.
1
3
9
0
.
1
4
0
0
.
1
4
1
0
.
1
4
2
Iteration
E
s
t
i
m
a
t
e
0 2000 5000
0
.
1
1
8
0
0
.
1
1
8
5
0
.
1
1
9
0
0
.
1
1
9
5
0
.
1
2
0
0
Iteration
E
s
t
i
m
a
t
e
0 2000 5000
0
.
1
6
8
0
.
1
7
0
0
.
1
7
2
Iteration
E
s
t
i
m
a
t
e
198
Section 9.1.4: The HammersleyCliord Theorem
A most surprising feature of the Gibbs sampler is that the
conditional distributions contain sucient information to produce a
sample from the joint distribution.
This is the case for both two-stage and multi-stage Gibbs
The full conditional distributions perfectly summarize the joint den-
sity,
although the set of marginal distributions obviously fails to do so
199
The HammersleyCliord Theorem
The following result then shows that the joint density can be directly and
constructively derived from the conditional densities.
Theorem: The joint distribution associated with the conditional densi-
ties f
Y |X
(y|x) and f
X|Y
(x|y) has the joint density
f(x, y) =
f
Y |X
(y|x)
_ _
f
Y |X
(y|x)/f
X|Y
(x|y)
dy
.
Note that the joint is written using conditionals
200
The HammersleyCliord Theorem Proof
f(x, y) = f(x|y)f(y) = f(y|x)f(x), so
f(y)
f(x)
=
f(y|x)
f(x|y)
, and
_
f(y)
f(x)
dy =
1
f(x)
=
_
f(y|x)
f(x|y)
dy
So the marginal is written only with conditionals and
f(x, y) = f(y|x)f(x) =
f(y|x)
_
f(y|x)
f(x|y)
dy
201
Monte Carlo Statistical Methods: The Multi-Stage Gibbs Sampler [202]
The Multi-Stage Gibbs Sampler
Suppose that for some p > 1, the random variable X X can be written
as X = (X
1
, . . . , X
p
), where the X
i
s are either uni- or multidimensional.
Moreover, suppose that we can simulate from the corresponding univari-
ate conditional densities f
1
, . . . , f
p
, that is, we can simulate
X
i
|x
1
, x
2
, . . . , x
i1
, x
i+1
, . . . , x
p
f
i
(x
i
|x
1
, x
2
, . . . , x
i1
, x
i+1
, . . . , x
p
)
for i = 1, 2, . . . , p.
202
The Multi-Stage Gibbs Sampler
Given x
(t)
= (x
(t)
1
, . . . , x
(t)
p
), generate
1. X
(t+1)
1
f
1
(x
1
|x
(t)
2
, . . . , x
(t)
p
);
2. X
(t+1)
2
f
2
(x
2
|x
(t+1)
1
, x
(t)
3
, . . . , x
(t)
p
),
.
.
.
p. X
(t+1)
p
f
p
(x
p
|x
(t+1)
1
, . . . , x
(t+1)
p1
).
The densities f
1
, . . . , f
p
are called the full conditionals
These are the only densities used for simulation, even in a high-dimensional
problem.
203
Hierarchical Models - Introduction
A hierarchical model is of the form
X f(x|)
g(|)
h(|)
k()
All hyperparameters specied at deepest level
Eect of deeper hyperparameters is lower
Easy to get joint distribution
Easy to pick o full conditionals
204
Hierarchical Models - Introduction - 2
Hierarchical Model
X f(x|)
(|)
(|)
()
Joint distribution
f(x|) (|) (|) ()
Full Conditionals
(|x, , ) terms in joint involving
etc...
205
Hierarchical Models - Introduction - 3
Normal Hierarchical Model (Conjugate)
X N(,
2
)
N(
0
,
2
2
)
2
Inverted Gamma(a, b)
Here
0
,
2
, a, b are specied
Usual to take
2
10 (variance ratio)
Choose a, b to give prior a big variance
206
Normal Hierarchical Models
Normal Hierarchical Model
X
i
N(,
2
), i = 1, . . . , n
N(
0
,
2
2
)
2
Joint Distribution
f(x, ,
2
)
_
1
i
(x
i
)
2
/(2
2
)
_
_
1
e
(
0
)
2
/(2
2
2
)
_
_
1
(
2
)
a+1
e
1/b
2
_
207
Normal Hierarchical Models -2
Joint Distribution
f(x, ,
2
)
_
1
i
(x
i
)
2
/(2
2
)
_
_
1
e
(
0
)
2
/(2
2
2
)
_
_
1
(
2
)
a+1
e
1/b
2
_
full conditional
(|x,
2
)
_
1
i
(x
i
)
2
/(2
2
)
_
_
1
e
(
0
)
2
/(2
2
2
)
_
_
1
(
2
)
a+1
e
1/b
2
_
= Normal

2
full conditional
(
2
|x, )
_
1
i
(x
i
)
2
/(2
2
)
_
_
1
e
(
0
)
2
/(2
2
2
)
_
_
1
(
2
)
a+1
e
1/b
2
_
= Inverted Gamma
208
To estimate and
2
X
i
N(,
2
), i = 1, . . . , n
N(
0
,
2
2
)
2
Use a Gibbs sampler with
N
_
1
1 + n
2

0
+
n
2
1 + n
2
x,

2
2
1 + n
2
_
1
2
Gamma
_
n + 1
2
+ a,
1
i
(x
i
)
2
2
+
(
0
)
2
2
+
1
b
_
209
Example
Energy Intake (Megajoules) over 24 hours, 15 year old females
91 504 557 609 693 727 764 803
857 929 970 1043 1089 1195 1384 1713
210
Example
R program NormalHierarchy-1
Histogram of theta
theta
F
r
e
q
u
e
n
c
y
6.0 6.5 7.0 7.5
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
Histogram of sigma2
sigma2
F
r
e
q
u
e
n
c
y
0.5 1.5 2.5
0
5
0
0
1
0
0
0
1
5
0
0
211
Normal Hierarchical Models -3a
To avoid specifying
0
use the hierarchy
X
i
N(,
2
), i = 1, . . . , n
Uniform(, )
2
which gives a Gibbs sampler with
N
_
x,
2
_
1
2
Gamma
_
n
2
+ a,
1
i
(x
i
)
2
2
+
1
b
_
212
Example
Histogram of theta
theta
F
r
e
q
u
e
n
c
y
6.0 6.5 7.0 7.5
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
Histogram of sigma2
sigma2
F
r
e
q
u
e
n
c
y
0.5 1.5
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
1
2
0
0
213
A bit more complicated - oneway anova: Y
ij
= +
i
+
ij
A full hierarchical specication
Y
ij
N( +
i
,
2
), i = 1, . . . , k, j = 1, . . . , n
i
Uniform(, )
i
N(0,
2
), i = 1, . . . , k
2
Inverted Gamma(a
1
, b
1
)
2
Inverted Gamma(a
2
, b
2
)
214
Normal Hierarchical Models -4a
Oneway anova: Y
ij
= +
i
+
ij
with Gibbs sampler
N
_
y ,

2
i
n
i
_
i
N
_
n
i
2
+ n
i
2
( y
i
),

2
2
+ n
i
2
_
1
2
Gamma
_
_
i
n
i
2
+ a
1
,
1
ij
(y
ij
i
)
2
2
+
1
b
1
_
_
1
2
Gamma
_
k
2
+ a
2
,
1
i

2
i
2
+
1
b
2
_
215
Example
Energy Intake (Megajoules) over 24 hours, 15 year old females and 15
year old males
Histogram of mu
mu
F
r
e
q
u
e
n
c
y
6.2 6.6 7.0 7.4
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
1
2
0
0
Histogram of tau2
tau2
F
r
e
q
u
e
n
c
y
0.0 0.2 0.4 0.6 0.8 1.0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
Histogram of sigma2
sigma2
F
r
e
q
u
e
n
c
y
0.2 0.4 0.6 0.8 1.0
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
1
2
0
0
216
Age Distribution of Chinook Salmon - 1
Chinook salmon spawn in fresh water and the juveniles hatch and swim
out to sea
They return to their natal stream to spawn 3 to 7 years later.
Fish of multiple ages return to the stream
We want estimates of the age composition
Take scales from a sample of sh and count the annuli.
This is time-consuming and expensive
Use length as a proxy for age - easier and faster to obtain
Now we will use both length and age.
217
Observe (y
i
, x
i
), y
i
= Age, x
i
= length, where
f(y
i
, x
i
|p, , )
7
j=3
p
I(y
i
=j)
j
1
y
i
exp
_
(x
i
y
i
)
2
2
2
y
i
_
And we can write the full likelihood as
L(p, , |y, x)
7
j=3
p
n
j
j
1
n
j
j
exp
_
n
j
s
2
j
+ n
j
( x
j

j
)
2
2
2
j
_
n
j
= (#y
i
= j),
j
n
j
= n
x
j
=
1
n
j
i:y
i
=j
x
i
s
2
j
=
1
n
j
i:y
i
=j
(x
i
x
j
)
2
218
Age Distribution of Chinook Salmon -3
With no missing y
i
the likelihood factors
L(p, , |y, x)
_
7
j=3
p
n
j
j
_ _
7
j=3
1
n
j
j
exp
_
n
j
s
2
j
+ n
j
( x
j

j
)
2
2
2
j
_
_
p
j
=
n
j
j
n
j

j
= x
j

2
j
= s
2
j
219
A Bayesian Analysis
Prior specications
p Dirichlet(
3
, . . . ,
7
)

j
Normal(
j0
,
2
j
)

2
j
Full conditionals for a Gibbs Sampler
p Dirichlet(n
3
+
3
, . . . , n
7
+
7
)

j
Normal
_
n
j
2
j
n
j
2
j
+
2
j
x
j
+

2
j
n
j
2
j
+
2
j
j0
,

2
j
2
j
n
j
2
j
+
2
j
_

2
j
Inverted Gamma
_
n
j
2
+ a,
n
j
s
2
j
2
+
1
b
_
Notice that p only depends on n
j
.
220
With missing y
i
things are more interesting
Write n = n
obs
+ n
m
= Observed + Missing
y = y
obs
+y
m
= Observed + Missing
Now the likelihood is
L(p, , |y, x)
7
j=3
p
n
j
j
1
n
j
j
exp
_
n
j
s
2
j
+ n
j
( x
j

j
)
2
2
2
j
_
n
m
i=1
7
j=3
_
p
j
1
j
exp
_
(x
i
j
)
2
2
2
j
__
I(y
mi
=j)
where n
j
, x
j
, s
2
j
are dened for the observed data.
221
The Gibbs sampler lls in missing Age data
Y
mi
Multinomial
_
p
j
1
j
exp
_
(x
i
j
)
2
2
2
j
__
Then updates the parameters
p Dirichlet(n
3
+
3
, . . . , n
7
+
7
)

j
Normal
_
n
j
2
j
n
j
2
j
+
2
j
x
j
+

2
j
n
j
2
j
+
2
j
j0
,

2
j
2
j
n
j
2
j
+
2
j
_

2
j
Inverted Gamma
_
n
j
2
+ a,
n
j
s
2
j
2
+
1
b
_
where n
j
, x
j
and s
2
j
are recalculated for each new Y
m
.
222
3
F
r
e
q
u
e
n
c
y
0.00 0.05 0.10 0.15 0.20
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
4
F
r
e
q
u
e
n
c
y
0.20 0.25 0.30 0.35
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
5
F
r
e
q
u
e
n
c
y
0.14 0.18 0.22 0.26
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
6
F
r
e
q
u
e
n
c
y
0.20 0.30 0.40 0.50
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
7
F
r
e
q
u
e
n
c
y
0.00 0.05 0.10 0.15
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
223
A lazy hierarchical specication
Y
ij
N( +
i
,
2
), i = 1, . . . , k, j = 1, . . . , n
i
i
N(0,
2
), i = 1, . . . , k
The classical random eects model
We can set up a Gibbs sampler
224
Random Eects Model
Y
ij
N( +
i
,
2
), i = 1, . . . , k, j = 1, . . . , n
i
i
N(0,
2
), i = 1, . . . , k
with Gibbs sampler
i
N
_
n
i
2
n
i
2
+
2
( y
i
),
n
i
2
n
i
2
+
2
_
N
_
y ,

2
i
n
i
_
1
2
Gamma
_
i
n
i
2
1,
2
ij
(y
ij

i
)
2
_
1
2
Gamma
_
k
2
1,
2
2
i
_
225
Problem!!
This is not a Gibbs sampler
Conditional distributions do not exist!
Result of using improper priors
Improper priors sometimes OK
Sometimes: bad conditionals
Sometimes: good conditionals, bad posterior REAL BAD
Extremely hard to detect
Moral: Best to use proper priors
226
Its better to be lucky than good
Looking for simple example (Am. Statistician 1992)
X|Y = y ye
yx
, Y |X = x xe
xy
227
Its better to be lucky than good
Looking for simple example (Am. Statistician 1992)
X|Y = y ye
yx
, Y |X = x xe
xy
This is not a Gibbs sampler
No joint distribution exists!
Hammersley-Cliord
f(x, y) =
e
xy
_
0
1
y
dy
228
Hierarchical Models: Animal epidemiology
Research in animal epidemiology sometimes uses data from groups of
animals, such as litters or herds.
Such data may not follow some of the usual assumptions of independence,
etc., and, as a result, variances of parameter estimates tend to be larger
(overdispersion)
Data on the number of cases of clinical mastitis in dairy cattle herds over
a one year period.
229
Hierarchical Models: Animal epidemiology
X
i
P(
i
), where
i
is the underlying rate of infection in herd i
To account for overdispersion, put a gamma prior distribution on the
Poisson parameter. A complete hierarchical specication is
X
i
P(
i
),
i
Ga(,
i
),
i
Ga(a, b),
where , a, and b are specied.
The posterior density of
i
, (
i
|x, ), can now be simulated via the
Gibbs sampler
i
(
i
|x, ,
i
) = Ga(x
i
+ , 1 +
i
),
i
(
i
|x, , a, b,
i
) = Ga( + a,
i
+ b) .
230
Animal Epidemiology R code
xdata <-c(0,0,1,1,2,2,2,2,2,2,4,4,4,5,5,5,5,5,5,6,6,8,8,8,9,9,9,
10,10,12,12,13,13,13,13,18,18,19,19,19,19,20,20,22,22,22,23,25)
nx<-length(xdata)
nsim<-1000;
lambda<-array(2,dim=c(nsim,nx));beta<-array(5,dim=c(nsim,nx));
alpha<-.1;a<-1;b<-1;
for(i in 2:nsim){
for(j in 1:nx){
beta[i,j]<-1/rgamma(1,shape=alpha+a,scale=1/(lambda[i-1,j]+(1/b)));
lambda[i,j]<-rgamma(1,shape=xdata[j]+alpha,scale=1/(1+(1/beta[i,j])))
}
}
231
Gibbs sampler output
Selected estimates of
i
and
i
.
Herd 5
5
D
e
n
s
i
t
y
0 2 4 6 8
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
Herd 15
15
D
e
n
s
i
t
y
0 2 4 6 8 10 12
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
Herd 15
15
D
e
n
s
i
t
y
0 10 30 50 70
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0 200 400 600 800
1
.
4
1
.
6
1
.
8
2
.
0
Iteration
R
u
n
n
i
n
g

M
e
a
n
0 200 400 600 800
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
Iteration
R
u
n
n
i
n
g

M
e
a
n
0 200 400 600 800
0
5
0
1
0
0
1
5
0
Iteration
R
u
n
n
i
n
g

M
e
a
n
232
Prediction - Introduction
For the simple model
X f(x|)
g()
The predictive density of a new X is
(x
new
|x
old
) =
_
f(x
new
|)(|x
old
)d
(|x
old
) is the posterior density
Averages over the parameter values
If
1
, . . . ,
M
(|x
old
)
(x
new
|x
old
)
1
M
i
f(x
new
|
i
)
233
Prediction - Introduction -2
For the hierarchical model
X f(x|)
g(|)
h(|)
k()
the Gibbs sampler give us a (
i
,
i
,
i
), i = 1, . . . , M
A sample from the joint distribution.
Using Monte Carlo sums
(x
new
|x
old
)
1
M
i
f(x
new
|
i
)
A Conditionally Independent Hierarchical Model
234
Oneway Anova Predictive Density
Energy Intake - oneway anova: Y
ij
= +
i
+
ij
A full hierarchical specication
Y
ij
N( +
i
,
2
), i = 1, . . . , k, j = 1, . . . , n
i
Uniform(, )
i
N(0,
2
), i = 1, . . . , k
2
Inverted Gamma(a
1
, b
1
)
2
Inverted Gamma(a
2
, b
2
)
Predictive Density for Group i
(y
new
|y) =
1
M
M
j=1
1
_
2
2
j
e
.5(y
new
ij
)
2
/
2
j
where (
j
,
ij
,
2
j
) are a sample from the posterior distribution.
235
Energy Intake - Predictive density for females
R program NormalPrediction-3
5 6 7 8 9
0
.
0
0
.
2
0
.
4
0
.
6
x
5 6 7 8 9
0
.
0
0
.
2
0
.
4
0
.
6
x
solid = naive prediction dashed = predictive density
236
PKPD Medical Models
Pharmacokineticsis the modeling of the relationship between the dosage
of a drug and the resulting concentration in the blood.
Gilks et al. (1993) approach:
Estimate pharmacokinetic parameters using mixed-eects model and
nonlinear structure
Also robust to the outliers common to clinical trials
For a given dose d
i
administered at time 0 to patient i, the measured log
concentration in the blood at time t
ij
, X
ij
, is assumed to follow a normal
distribution
X
ij
N(log g
ij
(
i
),
2
),
237
PKPD Medical Models
X
ij
N(log g
ij
(
i
),
2
),

i
= (log C
i
, log V
i
)
are parameters for the ith individual,

2
is the mea-
surement error variance, and g
ij
is given by
g
ij
(
i
) =
d
i
V
i
exp
_
C
i
t
ij
V
i
_
.
C
i
represents clearance
V
i
represents volume for patient i.
We complete the hierarchical specication with
log C
i
N(
C
,
2
C
) and log V
i
N(
V
,
2
V
).
with
C
,
2
C
,
V
,
2
V
xed.
238
PKPD Medical Models
The posterior density is proportional to
(C
i
, V
i
)
j
_
exp
_
x
ij
log g
ij
2
2
__
exp
_
log C
i
C
2
2
C
_
exp
_
log V
i
V
2
2
V
_
,
The full conditional of C
i
is
(C
i
)
j
exp
_
x
ij
log g
ij
2
2
_
exp
_
log C
i
C
2
2
C
_
Note that to get the full conditional, we pick o all terms with C
i
.
239
PKPD Medical Models
The full conditional of C
i
is
(C
i
)
j
exp
_
x
ij
log g
ij
2
2
_
exp
_
log C
i
C
2
2
C
_
We can write this as
(C
i
) exp
_
(C
i
V
i
B/A)
2
V
2
i

2
_
exp
_
log C
i
C
2
2
C
_
with A =
j
t
2
ij
and B =
j
t
ij
(X
ij
+ log(d
i
/V
i
).
Sampling from this is a challenge
240
PKPD Medical Models
The full conditional is
(C
i
) exp
_
(C
i
V
i
B/A)
2
V
2
i

2
_
exp
_
log C
i
C
2
2
C
_
Some options - use Metropolis
Candidate is
N
_
V
i
B/A, V
2
i

2
_
Use Taylor: log C
i
= log
C
+
C
i
C
to get candidate
N
_
2
c
2
C
V
i
B/A +
2
V
2
i

C
2
c
2
C
+
2
V
2
i
,

2
c
2
C
2
V
2
i
2
c
2
C
+
2
V
2
i
_
V
i
is even harder
241
PKPD Medical Models
Plan B: Use WinBugs
Uses Metropolis with Adaptive Rejection Sampling
But.... Lets start simple with WinBugs
242
Specifying Models with WinBugs
There are three steps to producing an MCMC model in WinBugs:
Specify the distributional features of the model, and the quanti-
ties to be estimated.
Compile the instructions into the run-time program.
Run the sampler which produces Markov chains.
Remember that the rst step must identify the full distributions for
each variable in the model.
243
WinBugs - Starting Simple
Normal Hierarchical Model (Conjugate)
X N(,
2
)
N(
0
,
2
2
)
2
Here
0
,
2
, a, b are specied
Each variable must be specied, or have a distribution
NO improper priors allowed
244
To Run WinBugs
Model
Specication Tool: Highlight and Check Model
Data: Highlight and Load Data
Compile
Inits: Highlight and Load
Inference
Sample Monitor Tool: enter nodes (parameters) stats, trace den-
sity
Model Update Tool
245
To Run WinBugs - 2
X N(,
2
)
N(
0
,
2
2
)
2
model
{ for( i in 1 : N )
{
X[i] ~ dnorm(theta,sigma2)
}
theta ~ dnorm(theta0,v)
v <- tau2*sigma2
sigma2 ~ dgamma(1,1)
theta0 <- 6
tau2 <- 10
}
WinBugs - Simple.odc
246
WinBugs - Another Example
Logistic Regression
Y Bernoulli(p(x))
logitp(x) =
0
+
1
x
1
+
2
x
2
Y = emergency room use
x
1
= health category
x
2
= health care provider
Model
{
for( i in 1 : N ) {
logit(p[i]) <- alpha0 + alpha1 * metq[i] + alpha2 * np[i]
er[i] ~ dbern(p[i])
}
alpha0 ~ dnorm(0.0,0.1)
}
WinBugs ER.odc
247
Return to PKPD Medical Models
Model
X
ij
N(log g
ij
(
i
),
2
)
g
ij
(
i
) =
d
i
V
i
exp
_
C
i
t
ij
V
i
_
i
= (log C
i
, log V
i
)
log C
i
N(
C
,
2
C
), log V
i
N(
V
,
2
V
).
Model: WinBugs PKWinBugs.odc
for( i in 1 : N ) {
for( j in 1 : T ) {
X[i , j] ~ dnorm(g[i , j],sigma)
g[i , j] <- (30/V[i]) *exp(-C[i]*t[j]/V[i])}
C[i]<-exp(LC[i]); V[i]<-exp(LV[i])
LC[i] ~ dnorm(mC,sigmaC); LV[i]~ dnorm(mV,sigmaV)}
sigma ~ dgamma(0.01,0.01)
mC ~ dnorm(0.0,1.0E-3)
sigmaC ~ dgamma(0.01,0.01)
mV ~ dnorm(0.0,1.0E-3)
sigmaV ~ dgamma(0.01,0.01)
248
PKPD Medical Models
Alternative Specication:
X
ij
log g
ij
(
i
)
_
/( 2)
T
.
This is easy for the Gibbs sampler:
T
(x|,
2
) =
_
N(x|,
2

w
) Gamma(w|
2
,
1
2
) dw
So to generate X T
(x|,
2
):
X|W N(x|,
2

W
)
W Gamma(w|
2
,
1
2
)
which ts right in to the Gibbs sampler
WinBugs PKWinBugs2.odc
249
PKPD Models - Prediction
Model
X
ij
N(log g
ij
(
i
),
2
)
g
ij
(
i
) =
d
i
V
i
exp
_
C
i
t
ij
V
i
_
i
= (log C
i
, log V
i
)
log C
i
N(
C
,
2
C
), log V
i
N(
V
,
2
V
).
Predictive density for individual i at time j is
(x|x) =
_ _
1
2
2
e
(xlog(g
ij
(
i
))
2
2
2
(
i
,
2
|x)d
i
d
2
1
M
M
k=1
1
2
2(k)
e
[xlog(g
ij
(
(k)
i
)]
2
2
2(k)
(
(k)
i
,
2(k)
), k = 1, . . . , M output from WinBugs
250
PKPD Models - Prediction for individual 1
Average over (
(k)
i
,
2(k)
) for individual 1
0.8 1.0 1.2 1.4
0
1
2
3
4
5
6
Time=2
x
0.8 1.0 1.2 1.4
0
1
2
3
4
5
6
x
0.2 0.4 0.6 0.8
0
1
2
3
4
5
6
Time=6
x
0.2 0.4 0.6 0.8
0
1
2
3
4
5
6
x
0.1 0.1 0.3 0.5
0
1
2
3
4
5
6
Time=10
x
0.1 0.1 0.3 0.5
0
1
2
3
4
5
6
x
251
PKPD Models - Prediction -2
Predictive density for any individual at time j is
(x|x) =
_ _
1
2
2
e
(xlog(g
ij
())
2
2
2
(|x)dd
2
1
nM
n
i=1
M
k=1
1
2
2(k)
e
[xlog(g
ij
(
(k)
i
)]
2
2
2(k)
(
(k)
i
,
2(k)
), k = 1, . . . , M, i = 1, . . . n output from WinBugs
Increased variability
Takes into account variation between individuals
Out-of-sample prediction
252
PKPD Models - Prediction for new individual
Average over (
(k)
i
,
2(k)
) for all individuals
0.8 1.0 1.2 1.4 1.6 1.8
0
1
2
3
4
5
6
Time=2
x
0.8 1.0 1.2 1.4 1.6 1.8
0
1
2
3
4
5
6
x
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0
1
2
3
4
5
6
Time=6
x
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0
1
2
3
4
5
6
x
0.2 0.0 0.2 0.4 0.6 0.8
0
1
2
3
4
5
6
Time=10
x
0.2 0.0 0.2 0.4 0.6 0.8
0
1
2
3
4
5
6
x
253
Age Distribution of Chinook Salmon - Winbugs
Recall the model
Sampling Model
Y
i
Catagorical(p)
X
i
Normal(
y
i
,
2
y
i
)
Prior Specications
p Dirichlet(
3
, . . . ,
7
)

j
Normal(
j0
,
2
j
)

2
j
254
Age Distribution of Chinook Salmon - Winbugs - 2
Model: WinBugs Chinook.odc
model {
#Priors on Parameters
for(a in 1:A)
{ tau20[a]<-1/sigma20[a]
mu[a]~dnorm(mu0[a],tau20[a])
tau2[a]~dgamma(3,100)
}
pi[1:A]~ddirch(alpha[1:A])
#Sampling Model
for (i in 1:nfish)
{
age[i] ~ dcat(pi[1:A])
length[i] ~ dnorm(mu[age[i]],tau2[age[i]])
}
}
255
Age Distribution of Chinook Salmon - Winbugs Estimates
256
Age Distribution of Chinook Salmon - R Estimates
3
F
r
e
q
u
e
n
c
y
0.00 0.05 0.10 0.15 0.20
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
4
F
r
e
q
u
e
n
c
y
0.20 0.25 0.30 0.35
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
5
F
r
e
q
u
e
n
c
y
0.14 0.18 0.22 0.26
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
6
F
r
e
q
u
e
n
c
y
0.20 0.30 0.40 0.50
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
7
F
r
e
q
u
e
n
c
y
0.00 0.05 0.10 0.15
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 2000 6000 10000
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
257
Monte Carlo Statistical Methods: Diagnosing Convergence [258]
Chapter 12: Diagnosing Convergence
Convergence Criteria
There are three (increasingly stringent) types of convergence
Convergence to the Stationary Distribution
Convergence of Averages
Convergence to iid Sampling
258
Convergence to the Stationary Distribution
Minimal requirement
Theoretically, stationarity is only achieved asymptotically
Not the major issue. Rather,
Speed of exploration of the support of f
Degree of correlation between the
(t)
s.
259
Convergence of Averages
Convergence of the empirical average
1
T
T
t=1
h(
(t)
) E
f
[h()]
for an arbitrary function h.
Most relevant in the implementation of MCMC
Convergence related to the mixing speed (Brooks and Roberts)
260
Convergence to iid Sampling
How close a sample (
(t)
1
, . . . ,
(t)
n
) is to being iid.
Can use subsampling(or batch sampling) to reduce correlation between
the successive points of the Markov chain.
261
Multiple Chains
There are methods involving one chain, and those involving multiple
chains.
By simulating several chains, variability and dependence on the initial
values are reduced
Can control convergence to the stationary distribution by comparing the
estimation, using dierent chains, of quantities of interest.
262
Multiple Chains - some cautions
An initial distribution which is too concentrated around a local mode
of f does not contribute signicantly more than a single chain to the
exploration of f
Slow algorithms, like Gibbs sampling, usually favor single chains
A unique chain with MT observations and a slow rate of mixing is
more likely to get closer to the stationary distribution than M chains
of size T
263
Overall Cautions
It is somewhat of an illusion to think we can control the ow of a Markov
chain and assess its convergence behavior from a few realizations of this
chain.
The heart of the diculty is the key problem of statistics, where the
uncertainty due to the observations prohibits categorical conclusions and
nal statements.
But...We do out best!
264
Monitoring Convergence of Averages
Example 12.10: Beta Generator
The Markov chain (X
(t)
)
X
(t+1)
=
_
Y Be( + 1, 1) with probability x
(t)
x
(t)
otherwise
has stationary distribution
f(x) = x
1
,
Can generate directly
Can also use Metropolis, which accepts y with probability x
(t)
/y
Note E
f
(X) =

+1
265
Beta Generator
This is a very bad chain
CLT doesnt hold
Metropolis and Direct
0 500 1500 2500
0
.
6
0
0
.
7
0
0
.
8
0
Index
0 500 1500 2500
0
.
6
0
0
.
7
0
0
.
8
0
Index
0 500 1500 2500
0
.
6
0
0
.
7
0
0
.
8
0
Index
0 500 1500 2500
0
.
6
0
0
.
7
0
0
.
8
0
Index
0 500 1500 2500
0
.
6
0
0
.
7
0
0
.
8
0
Index
0 500 1500 2500
0
.
6
0
0
.
7
0
0
.
8
0
Index
0 500 1500 2500
0
.
6
0
0
.
7
0
0
.
8
0
Index
0 500 1500 2500
0
.
6
0
0
.
7
0
0
.
8
0
Index
266
Recall Example 1.2 : Normal Mixtures
For a mixture of two normal distributions,
pN(,
2
) + (1 p)N(,
2
) ,
The likelihood proportional to
n
i=1
_
p
1
_
x
i
_
+ (1 p)
1
_
x
i
__
containing 2
n
terms.
Standard maximization techniques often fail to nd the global maximum
because of multimodality of the likelihood function.
267
Normal Mixture/Gibbs Sampling
Two components with equal known variance and xed weights,
p N(
1
,
2
) + (1 p) N(
2
,
2
) .
N(0, c
2
) prior distribution on both means
1
and
2
Latent Variable model assumes
unobserved component indicators z
i
of the observations x
i
,
P(Z
i
= 1) = 1 P(Z
i
= 2) = p,
and
X
i
|Z
i
= k N(
k
,
2
) .
268
Normal Mixture/Gibbs Sampling-2
The conditional distributions are
j
N
_
_
1
1/c + n
j
_
z
i
=j
x
i
,
_

2
1/c + n
j
_
_
,
z given (
1
,
2
) is a product of binomials, with
P(Z
i
= 1|x
i
,
1
,
2
)
=
p exp{(x
i
1
)
2
/2
2
}
p exp{(x
i
1
)
2
/2
2
} + (1 p) exp{(x
i
2
)
2
/2
2
}
.
269
Normal Mixture/Gibbs Sampling Example
Take
1
= 1,
2
= 1, and p = .25
Vary = .5, 1, 2
Start in one mode
R program NormalMixtureGibbs
270
Normal Mixture,
1
= 1,
2
= 1 , = 2
6 4 2 0 2 4 6
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
x
4 2 0 2 4 6
1
0
1
2
3
mu1
m
u
2
0 200 400 600 800 1000
1
0
1
2
3
Index
0 200 400 600 800 1000
1
0
1
2
3
Index
Cant nd underlying means
Appears close to convergence
Reasonable representation of density
271
Normal Mixture,
1
= 1,
2
= 1 , = 1
4 2 0 2 4
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
x
3 2 1 0 1 2
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
mu1
m
u
2
0 200 400 600 800 1000
1
0
1
2
3
Index
0 200 400 600 800 1000
1
0
1
2
3
Index
Closer to nding underlying means
272
Normal Mixture,
1
= 1,
2
= 1 , = .5
2 1 0 1 2
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
x
0.4 0.6 0.8 1.0 1.2 1.4
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
mu1
m
u
2
0 200 400 600 800 1000
1
0
1
2
3
Index
0 200 400 600 800 1000
1
0
1
2
3
Index
Finds underlying means
273
Normal Mixture/Gibbs Sampling-Two Dimensions
In higher dimensions, the Gibbs sampler may not escape the attraction
of the local mode when initialized close to that mode
1 0 1 2 3
1
0
1
2
3
2
274
Normal Mixture/Gibbs Sampling-5
This problem is common to single chain monitoring methods
Dicult to detect the existence of other modes
Or of other unexplored regions of the space
275
Multiple Estimates
In most cases, the graph of the raw sequence doesnt help in the detection
of stationarity or convergence.
A more helpful indicator is the behavior of the averages in terms of T.
Can use several convergent estimators of E
f
[h()] based on the same chain
Monitor until all estimators coincide
276
Monitoring Convergence of Averages -Poisson/Gibbs Example
Two Estimators of Lambda
Empirical Average or the Conditional Expectation
Convergence Diagnostic Both estimators converge
0 100 200 300 400 500
0
.
9
0
0
.
9
5
1
.
0
0
1
.
0
5
1
.
1
0
iteration
e
s
t
i
m
a
t
e
0 100 200 300 400 500
0
.
9
0
0
.
9
5
1
.
0
0
1
.
0
5
1
.
1
0
iteration
e
s
t
i
m
a
t
e
Mean
lam
D
e
n
s
i
t
y
0.9 1.0 1.1 1.2
0
1
2
3
4
5
6
7
277
Common Estimates
The empirical average S
T
The conditional (or RaoBlackwellized) version of this average
S
C
T
=
1
T
T
t=1
E[h()|
(t)
] ,
Importance sampling:
S
P
T
=
T
t=1
w
t
h(
(t)
) ,
where w
t
f(
(t)
)/g
t
(
(t)
) and g
t
is the true density used for the simula-
tion.
(t)
.
278
Example 12.12: Cauchy Posterior
The hierarchical model
X
i
Cauchy(), i = 1, . . . , 3
N(0,
2
)
has posterior distribution
(|x
1
, x
2
, x
3
) e
2
/2
2
3
i=1
1
(1 + ( x
i
)
2
)
We can use a Gibbs sampler
i
|, x
i
Exp
_
1 + ( x
i
)
2
2
_
,
|x
1
, x
2
, x
3
,
1
,
2
,
3
N
_
1
x
1
+
2
x
2
+
3
x
3
1
+
2
+
3
+
2
,
1
1
+
2
+
3
+
2
_
,
279
Example 12.12: Cauchy Posterior -2
The Gibbs sampler is based on the latent variables
i
, where
_
e
1
2
i
(1+(x
i
)
2
)
d
i
=
2
1 + (x
i
)
2
so
i
Exponential
_
1
2
(1 + (x
i
)
2
)
_
Monitor with three estimates of
Empirical Average
Rao-Blackwellized
Importance sample
280
Monitor with three estimates of
Empirical Average
1
M
M
j=1
(j)
Rao-Blackwellized
E = C
_
e
2
/2
2
3
i=1
1
(1 + ( x
i
)
2
)
d
= C
_ _
e
2
/2
2
e
1
2
i

i
(1+(x
i
)
2
)
dd
1
d
2
d
3
And so
|
1
,
2
,
3
N
_
_
i
x
i
1
2
+
i
,
_
1
2
+
i
_
1
_
_
Importance sampling with Cauchy candidate
281
Cauchy Posterior Convergence
theta
n
10 5 0 5 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
10 5 0 5 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
n
f
u
n
c
t
i
o
n
(
t
)

p
(
t
)

(
x
)
0 200 400 600 800 1000
1
0
1
2
Index
0 200 400 600 800 1000
1
0
1
2
Index
0 200 400 600 800 1000
1
0
1
2
Index
Gibbs Sample Histogram Emp. Avg RB IS
282
Multiple estimates
Empirical Average and RB are similar - supports convergence
IS poor - not yet converged
theta
n
10 5 0 5 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
10 5 0 5 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
n
f
u
n
c
t
i
o
n
(
t
)

p
(
t
)

(
x
)
0 200 400 600 800 1000
1
0
1
2
Index
0 200 400 600 800 1000
1
0
1
2
Index
0 200 400 600 800 1000
1
0
1
2
Index
283
Multiple Estimate-Conclusions
Limitations:
The method does not always apply
Intrinsically conservative (since the speed of convergence is deter-
mined by the slower estimate)
Advantage: When applicable, superior diagnostic to single chain
theta
n
10 5 0 5 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
10 5 0 5 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
n
f
u
n
c
t
i
o
n
(
t
)

p
(
t
)

(
x
)
0 200 400 600 800 1000
1
0
1
2
Index
0 200 400 600 800 1000
1
0
1
2
Index
0 200 400 600 800 1000
1
0
1
2
Index
284
With and Between Variances
Gelman/Rubin Criterion
Criterion based on the dierence between a weighted estimator of the
variance and the variance of estimators from the dierent chains
Need good (dispersed) starting values
285
With and Between Variances - Some Details
Generate M chains, estimate = h()
Calculate
B
T
= Between Variance
W
T
= Pooled Within Variance
R
T
= Adjusted Ratio of B
T
/W
T
Convergence when R
T
1
286
To run Gelman-Rubin in WinBugs
We need at least two chains in the Model Specication
Select the B-G-R diag in the Sample Monitor Tool
Modied by Brooks and Gelman (1998)
287
Brooks - Gelman -Rubin
The plot shows
Green: Widths of pooled central 80% CI for R
T
Blue: Widths of average central 80% CI for R
T
Red: R
T
Want R
T
1
Look at some examples
Simple.odc
PKWinBugs.odc
288
Gelman/Rubin Comments
Method has enjoyed wide usage, in particular because of its simplicity
and of its connections with the standard tools
Gelman and Rubin (1992) suggest removing the rst half of the simulated
sample to reduce the dependence on the initial distribution
The accurate construction of the initial distribution can be quite delicate
and time-consuming.
The method relies on normal approximations
But its not bad!
289

Monte Carlo Statistical Methods

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monte Carlo Statistical Methods

Uploaded by

Copyright:

Available Formats

Monte Carlo Statistical Methods: Introduction [1]

Monte Carlo Statistical Methods

is the posterior mean

(x), which is the solution to the equation

= arg max L(|x) =

) from g(, ), we accept the candidate with probability

exists and that we can

are parameters for the ith individual,

You might also like