Montecarlo Simulation

Markov Chain Monte Carlo Methods

Christian P. Robert
Universite Paris Dauphine and CREST-INSEE
http://www.ceremade.dauphine.fr/
~
xian
November 9, 2009
Outline
Motivation and leading example
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm
The Gibbs Sampler
MCMC tools for variable dimension problems
Sequential importance sampling
New [2004] edition:
Introduction
Likelihood methods
Missing variable models
Bayesian Methods
Bayesian troubles
Introduction
Latent structures make life harder!
Even simple models may lead to computational complications, as
in latent variable models
f(x[) =
_
f
(x, x
[) dx

Introduction
f(x[) =
_
f
(x, x
[) dx
If (x, x
) observed, ne!
Introduction
f(x[) =
_
f
(x, x
[) dx
If (x, x
) observed, ne!
If only x observed, trouble!
Introduction
Example (Mixture models)
Models of mixtures of distributions:
X f
j
with probability p
j
,
for j = 1, 2, . . . , k, with overall density
X p
1
f
1
(x) + +p
k
f
k
(x) .
Introduction
X f
j
with probability p
j
,
X p
1
f
1
(x) + +p
k
f
k
(x) .
For a sample of independent random variables (X
1
, , X
n
),
sample density
n
i=1
p
1
f
1
(x
i
) + +p
k
f
k
(x
i
) .
Introduction
X f
j
with probability p
j
,
X p
1
f
1
(x) + +p
k
f
k
(x) .
For a sample of independent random variables (X
1
, , X
n
),
sample density
n
i=1
p
1
f
1
(x
i
) + +p
k
f
k
(x
i
) .
Expanding this product involves k
n
elementary terms: prohibitive
to compute in large samples.
Introduction
1 0 1 2 3
1
0
1
2
3
2
Case
Likelihood methods
Maximum likelihood methods
Go Bayes!!
For an iid sample X
1
, . . . , X
n
from a population with density
f(x[
1
, . . . ,
k
), the likelihood function is
L([x) = L(
1
, . . . ,
k
[x
1
, . . . , x
n
)
=
n
i=1
f(x
i
[
1
, . . . ,
k
).
Likelihood methods
Go Bayes!!
For an iid sample X
1
, . . . , X
n
f(x[
1
, . . . ,
k
L([x) = L(
1
, . . . ,
k
[x
1
, . . . , x
n
)
=
n
i=1
f(x
i
[
1
, . . . ,
k
).
Global justications from asymptotics
Likelihood methods
Go Bayes!!
For an iid sample X
1
, . . . , X
n
f(x[
1
, . . . ,
k
L([x) = L(
1
, . . . ,
k
[x
1
, . . . , x
n
)
=
n
i=1
f(x
i
[
1
, . . . ,
k
).
Global justications from asymptotics
Computational diculty depends on structure, eg latent
variables
Likelihood methods
Example (Mixtures again)
For a mixture of two normal distributions,
p^(,
2
) + (1 p)^(,
2
) ,
likelihood proportional to
n
i=1
_
p
1
_
x
i
_
+ (1 p)
1
_
x
i
__
containing 2
n
terms.
Likelihood methods
Standard maximization techniques often fail to nd the global
maximum because of multimodality of the likelihood function.
Example
In the special case
f(x[, ) = (1 ) exp(1/2)x
2
+

exp(1/2
2
)(x )
2
(1)
with > 0 known,
Likelihood methods
Standard maximization techniques often fail to nd the global
maximum because of multimodality of the likelihood function.
Example
In the special case
f(x[, ) = (1 ) exp(1/2)x
2
+

exp(1/2
2
)(x )
2
(1)
with > 0 known, whatever n, the likelihood is unbounded:
lim
0
( = x
1
, [x
1
, . . . , x
n
) =
The special case of missing variable models
Consider again a latent variable representation
g(x[) =
_
Z
f(x, z[) dz
g(x[) =
_
Z
f(x, z[) dz
Dene the completed (but unobserved) likelihood
L
c
([x, z) = f(x, z[)
g(x[) =
_
Z
f(x, z[) dz
Dene the completed (but unobserved) likelihood
L
c
([x, z) = f(x, z[)
Useful for optimisation algorithm
The EM Algorithm
Gibbs connection Bayes rather than EM
Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q([
(m)
, x) = E[log L
c
([x, Z)[
(m)
, x] ,
The EM Algorithm
Gibbs connection Bayes rather than EM
Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q([
(m)
, x) = E[log L
c
([x, Z)[
(m)
, x] ,
2. (M step) Maximise Q([
(m)
, x) in and take
(m+1)
= arg max
Q([
(m)
, x).
until a xed point [of Q] is reached
Echantillon N(0,1)
2 1 0 1 2
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
1 0 1 2 3
1
0
1
2
3
2
Likeliho
Bayesian Methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,
realization of
X f(x[),
Bayesian Methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,
realization of
X f(x[),
is combined with prior information specied by prior distribution
with density
()
Bayesian Methods
Central tool
Summary in a probability distribution, ([x), called the posterior
distribution
Bayesian Methods
Central tool
distribution
Derived from the joint distribution f(x[)(), according to
([x) =
f(x[)()
_
f(x[)()d
,
[Bayes Theorem]
Bayesian Methods
Central tool
distribution
Derived from the joint distribution f(x[)(), according to
([x) =
f(x[)()
_
f(x[)()d
,
[Bayes Theorem]
where
m(x) =
_
f(x[)()d
is the marginal density of X
Bayesian Methods
Central tool...central to Bayesian inference
Posterior dened up to a constant as
([x) f(x[) ()
Operates conditional upon the observations

Bayesian Methods
([x) f(x[) ()
Integrate simultaneously prior information and information

brought by x
Bayesian Methods
([x) f(x[) ()

brought by x
Avoids averaging over the unobserved values of x

Bayesian Methods
([x) f(x[) ()

brought by x
Coherent updating of the information available on ,

independent of the order in which i.i.d. observations are
collected
Bayesian Methods
([x) f(x[) ()

brought by x
Coherent updating of the information available on ,

independent of the order in which i.i.d. observations are
collected
Provides a complete inferential scope and a unique motor of

inference
Bayesian troubles
Conjugate bonanza...
Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions
Bayesian troubles
Conjugate bonanza...
Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions
The classical Bayes estimator
is the posterior mean

(a +b +n)
(a +x)(n x +b)
_
1
0
p p
x+a1
(1 p)
nx+b1
dp
=
x +a
a +b +n
.
Bayesian troubles
Example (Normal)
In the normal ^(,
2
) case, with both and unknown,
conjugate prior on = (,
2
) of the form
(
2
)
exp
_
( )
2
+
_
/
2
Bayesian troubles
Example (Normal)
In the normal ^(,
2
) case, with both and unknown,
conjugate prior on = (,
2
) of the form
(
2
)
exp
_
( )
2
+
_
/
2
since
((,
2
)[x
1
, . . . , x
n
) (
2
)
exp
_
( )
2
+
_
/
2
(
2
)
n
exp
_
n( x)
2
+s
2
x
_
/
2
(
2
)
+n
exp
_
(
+n)(
x
)
2
+ +s
2
x
+
n
n +
_
/
2
Bayesian troubles
...and conjugate curse
The use of conjugate priors for computational reasons
implies a restriction on the modeling of the available prior
information
Bayesian troubles
information
may be detrimental to the usefulness of the Bayesian approach
Bayesian troubles
information
may be detrimental to the usefulness of the Bayesian approach
gives an impression of subjective manipulation of the prior
information disconnected from reality.
Bayesian troubles
A typology of Bayes computational problems
(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
Bayesian troubles
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
Bayesian troubles
models;
(iii). use of a huge dataset;
Bayesian troubles
models;
(iv). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
Bayesian troubles
models;
(iv). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(v). use of a complex inferential procedure as for instance, Bayes
factors
B
01
(x) =
P(
0
[ x)
P(
1
[ x)
_
(
0
)
(
1
)
.
Bayesian troubles
Example (Mixture once again)
Observations from
x
1
, . . . , x
n
f(x[) = p(x;
1
,
1
) + (1 p)(x;
2
,
2
)
Bayesian troubles
Observations from
x
1
, . . . , x
n
f(x[) = p(x;
1
,
1
) + (1 p)(x;
2
,
2
)
Prior
i
[
i
N (
i
,
2
i
/n
i
),
2
i
IG(
i
/2, s
2
i
/2), p Be(, )
Bayesian troubles
Observations from
x
1
, . . . , x
n
f(x[) = p(x;
1
,
1
) + (1 p)(x;
2
,
2
)
Prior
i
[
i
N (
i
,
2
i
/n
i
),
2
i
IG(
i
/2, s
2
i
/2), p Be(, )
Posterior
([x
1
, . . . , x
n
)
n
j=1
p(x
j
;
1
,
1
) + (1 p)(x
j
;
2
,
2
) ()
=
n
=0
(k
t
)
(k
t
)([(k
t
))
[O(2
n
)]
Bayesian troubles
Example (Mixture once again (contd))
For a given permutation (k
t
), conditional posterior distribution
([(k
t
)) = N
_
1
(k
t
),

2
1
n
1
+
_
IG((
1
+)/2, s
1
(k
t
)/2)
N
_
2
(k
t
),

2
2
n
2
+n
_
IG((
2
+n )/2, s
2
(k
t
)/2)
Be( +, +n )
Bayesian troubles
Example (Mixture once again (contd))
where
x
1
(k
t
) =
1
t=1
x
k
t
, s
1
(k
t
) =
t=1
(x
k
t
x
1
(k
t
))
2
,
x
2
(k
t
) =
1
n
n
t=+1
x
k
t
, s
2
(k
t
) =
n
t=+1
(x
k
t
x
2
(k
t
))
2
and
1
(k
t
) =
n
1
1
+ x
1
(k
t
)
n
1
+
,
2
(k
t
) =
n
2
2
+ (n ) x
2
(k
t
)
n
2
+n
,
s
1
(k
t
) = s
2
1
+ s
2
1
(k
t
) +
n
1
n
1
+
(
1
x
1
(k
t
))
2
,
s
2
(k
t
) = s
2
2
+ s
2
2
(k
t
) +
n
2
(n )
n
2
+n
(
2
x
2
(k
t
))
2
,
posterior updates of the hyperparameters
Bayesian troubles
Bayes estimator of :
(x
1
, . . . , x
n
) =
n
=0
(k
t
)
(k
t
)E
[[x, (k
t
)]
Too costly: 2
n
terms
Bayesian troubles
press for AR
Example (Poly-t priors)
Normal observation x ^(, 1), with conjugate prior
^(, )
Closed form expression for the posterior mean
_
f(x[) () d
_ _
f(x[) () d =
=
x +
2
1 +
2
.
Bayesian troubles
Example (Poly-t priors (2))
More involved prior distribution:
poly-t distribution
[Bauwens,1985]
() =
k
i=1
_
i
+ (
i
)
2
i
,
i
> 0
Bayesian troubles
Example (Poly-t priors (2))
More involved prior distribution:
poly-t distribution
[Bauwens,1985]
() =
k
i=1
_
i
+ (
i
)
2
i
,
i
> 0
Computation of E[[x] ???
Bayesian troubles
Example (AR(p) model)
Auto-regressive representation of a time series,
x
t
=
p
i=1
i
x
ti
+
t
Bayesian troubles
Example (AR(p) model)
Auto-regressive representation of a time series,
x
t
=
p
i=1
i
x
ti
+
t
If order p unknown, predictive distribution of x
t+1
given by
(x
t+1
[x
t
, . . . , x
1
)
_
f(x
t+1
[x
t
, . . . , x
tp+1
)(, p[x
t
, . . . , x
1
)dp d ,
Bayesian troubles
Example (AR(p) model (contd))
Integration over the parameters of all models
p=0
_
f(x
t+1
[x
t
, . . . , x
tp+1
)([p, x
t
, . . . , x
1
) d (p[x
t
, . . . , x
1
) .
Bayesian troubles
Example (AR(p) model (contd))
Multiple layers of complexity
(i). Complex parameter space within each AR(p) model because
of stationarity constraint
(ii). if p unbounded, innity of models
(iii). varies between models AR(p) and AR(p + 1), with a
dierent stationarity constraint (except for root
reparameterisation).
(iv). if prediction used sequentially, every tick/second/hour/day,
posterior distribution (, p[x
t
, . . . , x
1
) must be re-evaluated
Basic methods
Uniform pseudo-random generator
Beyond Uniform distributions
Transformation methods
Accept-Reject Methods
Fundamental theorem of simulation
Log-concave densities
Rely on the possibility of producing (computer-wise) an
endless ow of random variables (usually iid) from well-known
distributions
Rely on the possibility of producing (computer-wise) an
endless ow of random variables (usually iid) from well-known
distributions
Given a uniform random number generator, illustration of
methods that produce random variables from both standard
and nonstandard distributions
Basic methods
The inverse transform method
For a function F on R, the generalized inverse of F, F
, is dened
by
F
(u) = inf x; F(x) u .

Basic methods
The inverse transform method
For a function F on R, the generalized inverse of F, F
, is dened
by
F
(u) = inf x; F(x) u .

Denition (Probability Integral Transform)
If U |
[0,1]
, then the random variable F
(U) has the distribution

F.
Basic methods
The inverse transform method (2)
To generate a random variable X F, simply generate
U U
[0,1]
Basic methods
The inverse transform method (2)
To generate a random variable X F, simply generate
U U
[0,1]
and then make the transform
x = F
(u)
Desiderata and limitations
skip Uniform
Production of a deterministic sequence of values in [0, 1] which
imitates a sequence of iid uniform random variables |
[0,1]
.
skip Uniform
[0,1]
.
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]
skip Uniform
[0,1]
.
Random sequence in the sense: Having generated
(X
1
, , X
n
), knowledge of X
n
[or of (X
1
, , X
n
)] imparts
no discernible knowledge of the value of X
n+1
.
skip Uniform
[0,1]
.
(X
1
, , X
n
), knowledge of X
n
[or of (X
1
, , X
n
)] imparts
n+1
.
Deterministic: Given the initial value X
0
, sample
(X
1
, , X
n
) always the same
skip Uniform
[0,1]
.
(X
1
, , X
n
), knowledge of X
n
[or of (X
1
, , X
n
)] imparts
n+1
.
Deterministic: Given the initial value X
0
, sample
(X
1
, , X
n
) always the same
Validity of a random number generator based on a single
sample X
1
, , X
n
when n tends to +, not on replications
(X
11
, , X
1n
), (X
21
, , X
2n
), . . . (X
k1
, , X
kn
)
where n xed and k tends to innity.
Algorithm starting from an initial value 0 u
0
1 and a
transformation D, which produces a sequence
(u
i
) = (D
i
(u
0
))
in [0, 1].
Algorithm starting from an initial value 0 u
0
1 and a
transformation D, which produces a sequence
(u
i
) = (D
i
(u
0
))
in [0, 1].
For all n,
(u
1
, , u
n
)
reproduces the behavior of an iid U
[0,1]
sample (V
1
, , V
n
) when
compared through usual tests
Uniform pseudo-random generator (2)
Validity means the sequence U
1
, , U
n
leads to accept the
hypothesis
H : U
1
, , U
n
are iid U
[0,1]
.
Uniform pseudo-random generator (2)
Validity means the sequence U
1
, , U
n
leads to accept the
hypothesis
H : U
1
, , U
n
are iid U
[0,1]
.
The set of tests used is generally of some consequence
KolmogorovSmirnov and other nonparametric tests
Time series methods, for correlation between U
i
and
(U
i1
, , U
ik
)
Marsaglias battery of tests called Die Hard (!)
Usual generators
In R and S-plus, procedure runif()
The Uniform Distribution
Description:
runif generates random deviates.
Example:
u <- runif(20)
.Random.seed is an integer vector, containing
the random number generator state for random
number generation in R. It can be saved and
restored, but should not be altered by users.
500 520 540 560 580 600
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
uniform sample
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
Usual generators (2)
In C, procedure rand() or random()
SYNOPSIS
#include <stdlib.h>
long int random(void);
DESCRIPTION
The random() function uses a non-linear additive
feedback random number generator employing a
default table of size 31 long integers to return
successive pseudo-random numbers in the range
from 0 to RAND_MAX. The period of this random
generator is very large, approximately
16*((2**31)-1).
RETURN VALUE
random() returns a value between 0 and RAND_MAX.
Usual generators(3)
In Scilab, procedure rand()
rand() : with no arguments gives a scalar whose
value changes each time it is referenced. By
default, random numbers are uniformly distributed
in the interval (0,1). rand(normal) switches to
a normal distribution with mean 0 and variance 1.
EXAMPLE
x=rand(10,10,uniform)
Beyond Uniform generators
Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F
(for instance, exponential, and

Weibull distributions), use the probability integral
transform
here

transform
here
Case specic methods rely on properties of the distribution (for
instance, normal distribution, Poisson distribution)

transform
here
More generic methods (for instance, accept-reject and
ratio-of-uniform)

transform
here
More generic methods (for instance, accept-reject and
ratio-of-uniform)
Simulation of the standard distributions is accomplished quite
eciently by many numerical and statistical programming
packages.
Case where a distribution F is linked in a simple way to another
distribution easy to simulate.
Example (Exponential variables)
If U |
[0,1]
, the random variable
X = log U/
has distribution
P(X x) = P(log U x)
= P(U e
x
) = 1 e
x
,
the exponential distribution E xp().
Other random variables that can be generated starting from an
exponential include
Y = 2
j=1
log(U
j
)
2
2
Y =
1
j=1
log(U
j
) Ga(a, )
Y =
a
j=1
log(U
j
)
a+b
j=1
log(U
j
)
Be(a, b)
Points to note
Transformation quite simple to use
There are more ecient algorithms for gamma and beta
random variables
Cannot generate gamma random variables with a non-integer
shape parameter
For instance, cannot get a
2
1
variable, which would get us a
^(0, 1) variable.
Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X
1
, X
2
), then,
r
2
= X
2
1
+X
2
2

2
2
= E (1/2) and U [0, 2]
Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X
1
, X
2
), then,
r
2
= X
2
1
+X
2
2

2
2
= E (1/2) and U [0, 2]
Consequence: If U
1
, U
2
iid |
[0,1]
,
X
1
=
_
2 log(U
1
) cos(2U
2
)
X
2
=
_
2 log(U
1
) sin(2U
2
)
iid ^(0, 1).
Box-Muller Algorithm (2)
1. Generate U
1
, U
2
iid |
[0,1]
;
2. Dene
x
1
=
_
2 log(u
1
) cos(2u
2
) ,
x
2
=
_
2 log(u
1
) sin(2u
2
) ;
3. Take x
1
and x
2
as two independent draws from
^(0, 1).
Box-Muller Algorithm (3)
4 2 0 2 4
1
0
1
2
3
Unlike algorithms based on the CLT,

this algorithm is exact
Get two normals for the price of

two uniforms
4 2 0 2 4
0
.0
0
.1
0
.2
0
.3
0
.4
Drawback (in speed)

in calculating log, cos and sin.
More transforms
Reject
Example (Poisson generation)
Poissonexponential connection:
If N T() and X
i
E xp(), i N
,
P
(N = k) =
P
(X
1
+ +X
k
1 < X
1
+ +X
k+1
) .
More Poisson
Skip Poisson
A Poisson can be simulated by generating Exp(1) till their
sum exceeds 1.
This method is simple, but is really practical only for smaller
values of .
On average, the number of exponential variables required is .
Other approaches are more suitable for large s.
Atkinsons Poisson
To generate N T():
1. Define
= /
3, = and k = log clog ;

2. Generate U
1
U
[0,1]
and calculate
x = log(1 u
1
)/u
1
/
until x > 0.5 ;
3. Define N = x + 0.5 and generate
U
2
U
[0,1]
;
4. Accept N if
x+log (u
2
/1+exp(x)
2
) k+N log log N! .
Negative extension
A generator of Poisson random variables can produce negative

binomial random variables since,
Y (a(n, (1 p)/p) X[y T(y)
implies
X ^eg(n, p)
Mixture representation
The representation of the negative binomial is a particular
case of a mixture distribution
The principle of a mixture representation is to represent a
density f as the marginal of another distribution, for example
f(x) =
iY
p
i
f
i
(x) ,
If the component distributions f
i
(x) can be easily generated,
X can be obtained by rst choosing f
i
with probability p
i
and
then generating an observation from f
i
.
Partitioned sampling
Special case of mixture sampling when
f
i
(x) = f(x) I
A
i
(x)
__
A
i
f(x) dx
and
p
i
= Pr(X A
i
)
for a partition (A
i
)
i
Accept-Reject Methods
Accept-Reject algorithm
Many distributions from which it is dicult, or even
impossible, to directly simulate.
Another class of methods that only require us to know the
functional form of the density f of interest only up to a
multiplicative constant.
The key to this method is to use a simpler (simulation-wise)
density g, the instrumental density, from which the simulation
from the target density f is actually done.
Lemma
Simulating
X f(x)
equivalent to simulating
(X, U) |(x, u) : 0 < u < f(x)
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
x
f
(
x
)
The Accept-Reject algorithm
Given a density of interest f, nd a density g and a constant M
such that
f(x) Mg(x)
on the support of f.
The Accept-Reject algorithm
Given a density of interest f, nd a density g and a constant M
such that
f(x) Mg(x)
on the support of f.
1. Generate X g, U |
[0,1]
;
2. Accept Y = X if U f(X)/Mg(X) ;
3. Return to 1. otherwise.
Validation of the Accept-Reject method
Warranty:
This algorithm produces a variable Y distributed according to f
4 2 0 2 4
0
1
2
3
4
5
Two interesting properties
First, it provides a generic method to simulate from any
density f that is known up to a multiplicative factor
Property particularly important in Bayesian calculations where
the posterior distribution
([x) () f(x[) .
is specied up to a normalizing constant
Two interesting properties
First, it provides a generic method to simulate from any
density f that is known up to a multiplicative factor
Property particularly important in Bayesian calculations where
the posterior distribution
([x) () f(x[) .
is specied up to a normalizing constant
Second, the probability of acceptance in the algorithm is
1/M, e.g., expected number of trials until a variable is
accepted is M
More interesting properties
In cases f and g both probability densities, the constant M is
necessarily larger that 1.
The size of M, and thus the eciency of the algorithm, are
functions of how closely g can imitate f, especially in the tails
The size of M, and thus the eciency of the algorithm, are
functions of how closely g can imitate f, especially in the tails
For f/g to remain bounded, necessary for g to have tails
thicker than those of f.
It is therefore impossible to use the A-R algorithm to simulate
a Cauchy distribution f using a normal distribution g, however
the reverse works quite well.
No Cauchy!
Example (Normal from a Cauchy)
Take
f(x) =
1
2
exp(x
2
/2)
and
g(x) =
1
1
1 +x
2
,
densities of the normal and Cauchy distributions.
No Cauchy!
Example (Normal from a Cauchy)
Take
f(x) =
1
2
exp(x
2
/2)
and
g(x) =
1
1
1 +x
2
,
densities of the normal and Cauchy distributions.
Then
f(x)
g(x)
=
_
2
(1 +x
2
) e
x
2
/2
_
2
e
= 1.52
attained at x = 1.
Example (Normal from a Cauchy (2))
So probability of acceptance
1/1.52 = 0.66,
and, on the average, one out of every three simulated Cauchy
variables is rejected.
No Double!
Example (Normal/Double Exponential)
Generate a N (0, 1) by using a double-exponential distribution
with density
g(x[) = (/2) exp([x[)
Then
f(x)
g(x[)

_
2
1
e
2
/2
and minimum of this bound (in ) attained for
= 1
Example (Normal/Double Exponential (2))
Probability of acceptance
_
/2e = .76
To produce one normal random variable requires on the average
1/.76 1.3 uniform variables.
truncate
Example (Gamma generation)
Illustrates a real advantage of the Accept-Reject algorithm
The gamma distribution (a(, ) represented as the sum of
exponential random variables, only if is an integer
Example (Gamma generation (2))
Can use the Accept-Reject algorithm with instrumental distribution
(a(a, b), with a = [], 0.
(Without loss of generality, = 1.)
Example (Gamma generation (2))
Can use the Accept-Reject algorithm with instrumental distribution
(a(a, b), with a = [], 0.
(Without loss of generality, = 1.)
Up to a normalizing constant,
f/g
b
= b
a
x
a
exp(1 b)x b
a
_
a
(1 b)e
_
a
for b 1.
The maximum is attained at b = a/.
Cheng and Feasts Gamma generator
Gamma Ga(, 1), > 1 distribution
1. Dene c
1
= 1, c
2
= ( (1/6))/c
1
,
c
3
= 2/c
1
, c
4
= 1 +c
3
, and c
5
= 1/
.
2. Repeat
generate U
1
, U
2
take U
1
= U
2
+c
5
(1 1.86U
1
) if > 2.5
until 0 < U
1
< 1.
3. Set W = c
2
U
2
/U
1
.
4. If c
3
U
1
+W +W
1
c
4
or
c
3
log U
1
log W +W 1,
take c
1
W;
otherwise, repeat.
Truncated Normal simulation
Example (Truncated Normal distributions)
Constraint x produces density proportional to
e
(x)
2
/2
2
I
x
for a bound large compared with
Truncated Normal simulation
Example (Truncated Normal distributions)
Constraint x produces density proportional to
e
(x)
2
/2
2
I
x
for a bound large compared with
There exists alternatives far superior to the nave method of
generating a ^(,
2
) until exceeding , which requires an average
number of
1/(( )/)
simulations from ^(,
2
) for a single acceptance.
Example (Truncated Normal distributions (2))
Instrumental distribution: translated exponential distribution,
E (, ), with density
g
(z) = e
(z)
I
z
.
Example (Truncated Normal distributions (2))
Instrumental distribution: translated exponential distribution,
E (, ), with density
g
(z) = e
(z)
I
z
.
The ratio f/g
is bounded by
f/g

_
1/ exp(
2
/2 ) if > ,
1/ exp(
2
/2) otherwise.
Log-concave densities (1)
move to next chapter
Densities f whose logarithm is concave, for
instance Bayesian posterior distributions such that
log ([x) = log () + log f(x[) +c
concave
Take
S
n
= x
i
, i = 0, 1, . . . , n + 1 supp(f)
such that h(x
i
) = log f(x
i
) known up to the same constant.
By concavity of h, line L
i,i+1
through (x
i
, h(x
i
)) and
(x
i+1
, h(x
i+1
))
Take
S
n
= x
i
, i = 0, 1, . . . , n + 1 supp(f)
such that h(x
i
) = log f(x
i
) known up to the same constant.
By concavity of h, line L
i,i+1
through (x
i
, h(x
i
)) and
(x
i+1
, h(x
i+1
))
x
1
x
2
x
3
x
4
x
L (x)
2,3
log f(x)
below h in [x
i
, x
i+1
] and
above this graph outside this interval

For x [x
i
, x
i+1
], if
h
n
(x) = minL
i1,i
(x), L
i+1,i+2
(x) and h
n
(x) = L
i,i+1
(x) ,
the envelopes are
h
n
(x) h(x) h
n
(x)
uniformly on the support of f, with
h
n
(x) = and h
n
(x) = min(L
0,1
(x), L
n,n+1
(x))
on [x
0
, x
n+1
]
c
.
Therefore, if
f
n
(x) = exp h
n
(x) and f
n
(x) = exp h
n
(x)
then
f
n
(x) f(x) f
n
(x) =
n
g
n
(x) ,
where
n
normalizing constant of f
n
ARS Algorithm
1. Initialize n and S
n
.
2. Generate X g
n
(x), U |
[0,1]
.
3. If U f
n
(X)/
n
g
n
(X), accept X;
otherwise, if U f(X)/
n
g
n
(X), accept X
kill ducks
Example (Northern Pintail ducks)
Ducks captured at time i with both probability p
i
and size N of
the population unknown.
Dataset
(n
1
, . . . , n
11
) = (32, 20, 8, 5, 1, 2, 0, 2, 1, 1, 0)
Number of recoveries over the years 19571968 of N = 1612
Northern Pintail ducks banded in 1956
Example (Northern Pintail ducks (2))
Corresponding conditional likelihood
L(p
1
, . . . , p
I
[N, n
1
, . . . , n
I
) =
N!
(N r)!
I
i=1
p
n
i
i
(1 p
i
)
Nn
i
,
where I number of captures, n
i
number of captured animals
during the ith capture, and r is the total number of dierent
captured animals.
Prior selection
If
N P()
and
i
= log
_
p
i
1 p
i
_
^(
i
,
2
),
[Normal logistic]
Posterior distribution
(, N[, n
1
, . . . , n
I
)
N!
(N r)!
N
N!
I
i=1
(1 +e
i
)
N
I
i=1
exp
_
i
n
i
1
2
2
(
i
i
)
2
_
For the conditional posterior distribution
(
i
[N, n
1
, . . . , n
I
) exp
_
i
n
i
1
2
2
(
i
i
)
2
__
(1+e
i
)
N
,
the ARS algorithm can be implemented since
i
n
i
1
2
2
(
i
i
)
2
N log(1 +e
i
)
is concave in
i
.
Posterior distributions of capture log-odds ratios for the
years 19571965.
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1957
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1958
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1959
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1960
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1961
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1962
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1963
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1964
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1965
1960
8 7 6 5 4
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
True
distribution versus histogram of simulated sample
Monte Carlo integration
Introduction
Importance Sampling
Acceleration methods
Bayesian importance sampling
Introduction
Quick reminder
Two major classes of numerical problems that arise in statistical
inference
Optimization - generally associated with the likelihood
approach
Introduction
Quick reminder
Two major classes of numerical problems that arise in statistical
inference
Optimization - generally associated with the likelihood
approach
Integration- generally associated with the Bayesian approach
Introduction
skip Example!
Example (Bayesian decision theory)
Bayes estimators are not always posterior expectations, but rather
solutions of the minimization problem
min
L(, ) () f(x[) d .
Proper loss:
For L(, ) = ( )
2
, the Bayes estimator is the posterior mean
Introduction
skip Example!
min
L(, ) () f(x[) d .
Proper loss:
For L(, ) = ( )
2
Absolute error loss:
For L(, ) = [ [, the Bayes estimator is the posterior median
Introduction
skip Example!
min
L(, ) () f(x[) d .
Proper loss:
For L(, ) = ( )
2
Absolute error loss:
For L(, ) = [ [, the Bayes estimator is the posterior median
With no loss function
use the maximum a posteriori (MAP) estimator
arg max
([x)()
Theme:
Generic problem of evaluating the integral
I = E
f
[h(X)] =
_
X
h(x) f(x) dx
where X is uni- or multidimensional, f is a closed form, partly
closed form, or implicit density, and h is a function
Monte Carlo integration (2)
Monte Carlo solution
First use a sample (X
1
, . . . , X
m
) from the density f to
approximate the integral I by the empirical average
h
m
=
1
m
m
j=1
h(x
j
)
Monte Carlo integration (2)
Monte Carlo solution
First use a sample (X
1
, . . . , X
m
) from the density f to
approximate the integral I by the empirical average
h
m
=
1
m
m
j=1
h(x
j
)
which converges
h
m
E
f
[h(X)]
by the Strong Law of Large Numbers
Monte Carlo precision
Estimate the variance with
v
m
=
1
m
1
m1
m
j=1
[h(x
j
) h
m
]
2
,
and for m large,
h
m
E
f
[h(X)]
v
m
N (0, 1).
Note: This can lead to the construction of a convergence test and
of condence bounds on the approximation of E
f
[h(X)].
Example (Cauchy prior/normal sample)
For estimating a normal mean, a robust prior is a Cauchy prior
X N (, 1), ((0, 1).
Under squared error loss, posterior mean
(x) =
_

1 +
2
e
(x)
2
/2
d
_

1
1 +
2
e
(x)
2
/2
d
Example (Cauchy prior/normal sample (2))
Form of
suggests simulating iid variables
1
, ,
m
N (x, 1)
and calculating
m
(x) =
m
i=1
i
1 +
2
i
_
m
i=1
1
1 +
2
i
.
The Law of Large Numbers implies
m
(x)
(x) as m .
0 200 400 600 800 1000
9
.
6
9
.
8
1
0
.
0
1
0
.
2
1
0
.
4
1
0
.
6
iterations
Range
Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,
based on the alternative representation
E
f
[h(X)] =
_
X
_
h(x)
f(x)
g(x)
_
g(x) dx .
which allows us to use other distributions than f
Importance Sampling
Importance sampling algorithm
Evaluation of
E
f
[h(X)] =
_
X
h(x) f(x) dx
by
1. Generate a sample X
1
, . . . , X
n
from a distribution g
2. Use the approximation
1
m
m
j=1
f(X
j
)
g(X
j
)
h(X
j
)
Importance Sampling
Same thing as before!!!
Convergence of the estimator
1
m
m
j=1
f(X
j
)
g(X
j
)
h(X
j
)
_
X
h(x) f(x) dx
Importance Sampling
Same thing as before!!!
Convergence of the estimator
1
m
m
j=1
f(X
j
)
g(X
j
)
h(X
j
)
_
X
h(x) f(x) dx
converges for any choice of the distribution g
[as long as supp(g) supp(f)]
Importance Sampling
Important details
Instrumental distribution g chosen from distributions easy to
simulate
The same sample (generated from g) can be used repeatedly,
not only for dierent functions h, but also for dierent
densities f
Even dependent proposals can be used, as seen later
PMC chapter
Importance Sampling
Although g can be any density, some choices are better than
others:
Finite variance only when
E
f
_
h
2
(X)
f(X)
g(X)
_
=
_
X
h
2
(x)
f
2
(X)
g(X)
dx < .
Importance Sampling
others:
E
f
_
h
2
(X)
f(X)
g(X)
_
=
_
X
h
2
(x)
f
2
(X)
g(X)
dx < .
Instrumental distributions with tails lighter than those of f
(that is, with sup f/g = ) not appropriate.
If sup f/g = , the weights f(x
j
)/g(x
j
) vary widely, giving
too much importance to a few values x
j
.
Importance Sampling
others:
E
f
_
h
2
(X)
f(X)
g(X)
_
=
_
X
h
2
(x)
f
2
(X)
g(X)
dx < .
Instrumental distributions with tails lighter than those of f
(that is, with sup f/g = ) not appropriate.
If sup f/g = , the weights f(x
j
)/g(x
j
) vary widely, giving
too much importance to a few values x
j
.
If sup f/g = M < , the accept-reject algorithm can be used
as well to simulate f directly.
Importance Sampling
Example (Cauchy target)
Case of Cauchy distribution C(0, 1) when importance function is
Gaussian N (0, 1).
Ratio of the densities
(x) =
p
(x)
p
0
(x)
=
2
exp x
2
/2
(1 +x
2
)
very badly behaved: e.g.,
_

(x)
2
p
0
(x)dx = .
Poor performances of the associated importance sampling
estimator
Importance Sampling
0 2000 4000 6000 8000 10000
0
5
0
1
0
0
1
5
0
2
0
0
iterations
Range
and average of 500 replications of IS estimate of E[exp X]
over 10, 000 iterations.
Importance Sampling
Optimal importance function
The choice of g that minimizes the variance of the
importance sampling estimator is
g
(x) =
[h(x)[ f(x)
_
Z
[h(z)[ f(z) dz
.
Importance Sampling
Optimal importance function
The choice of g that minimizes the variance of the
importance sampling estimator is
g
(x) =
[h(x)[ f(x)
_
Z
[h(z)[ f(z) dz
.
Rather formal optimality result since optimal choice of g
(x)
requires the knowledge of I, the integral of interest!
Importance Sampling
Practical impact
m
j=1
h(X
j
) f(X
j
)/g(X
j
)
m
j=1
f(X
j
)/g(X
j
)
,
where f and g are known up to constants.
Also converges to I by the Strong Law of Large Numbers.
Biased, but the bias is quite small
Importance Sampling
Practical impact
m
j=1
h(X
j
) f(X
j
)/g(X
j
)
m
j=1
f(X
j
)/g(X
j
)
,
where f and g are known up to constants.
Also converges to I by the Strong Law of Large Numbers.
Biased, but the bias is quite small
In some settings beats the unbiased estimator in squared error
loss.
Using the optimal solution does not always work:
m
j=1
h(x
j
) f(x
j
)/[h(x
j
)[ f(x
j
)
m
j=1
f(x
j
)/[h(x
j
)[ f(x
j
)
=
#positive h #negative h
m
j=1
1/[h(x
j
)[
Importance Sampling
Selfnormalised importance sampling
For ratio estimator
n
h
=
n
i=1
i
h(x
i
)
_
n
i=1
i
with X
i
g(y) and W
i
such that
E[W
i
[X
i
= x] = f(x)/g(x)
Importance Sampling
Selfnormalised variance
then
var(
n
h
)
1
n
2
2
_
var(S
n
h
) 2E
[h] cov(S
n
h
, S
n
1
) +E
[h]
2
var(S
n
1
)
_
.
for
S
n
h
=
n
i=1
W
i
h(X
i
) , S
n
1
=
n
i=1
W
i
Rough approximation
var
n
h

1
n
var
(h(X)) 1 + var
g
(W)
Importance Sampling
Example (Students t distribution)
X T (, ,
2
), with density
f
(x) =
(( + 1)/2)
(/2)
_
1 +
(x )
2
2
_
(+1)/2
.
Without loss of generality, take = 0, = 1.
Problem: Calculate the integral
_

2.1
_
sin(x)
x
_
n
f
(x)dx.
Importance Sampling
Example (Students t distribution (2))
Simulation possibilities
Directly from f
, since f
=
N (0,1)

Importance Sampling
Directly from f
, since f
=
N (0,1)
Importance sampling using Cauchy C(0, 1)

Importance Sampling
Directly from f
, since f
=
N (0,1)

Importance sampling using a normal N (0, 1)
(expected to be nonoptimal)
Importance Sampling
Directly from f
, since f
=
N (0,1)

Importance sampling using a normal N (0, 1)
(expected to be nonoptimal)
Importance sampling using a U ([0, 1/2.1])
change of variables
Importance Sampling

0 10000 20000 30000 40000 50000
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0

0 10000 20000 30000 40000 50000
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0

0 10000 20000 30000 40000 50000
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0

0 10000 20000 30000 40000 50000
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0

Sampling
from f (solid lines), importance sampling with Cauchy
instrumental (short dashes), U ([0, 1/2.1]) instrumental (long
dashes) and normal instrumental (dots).
Importance Sampling
IS suers from curse of dimensionality
As dimension increases, discrepancy between importance and
target worsens
skip explanation
Importance Sampling
IS suers from curse of dimensionality
As dimension increases, discrepancy between importance and
target worsens
skip explanation
Explanation:
Take target distribution and instrumental distribution
Simulation of a sample of iid samples of size n x
1:n
from
n
=
N
n
Importance sampling estimator for
n
(f
n
) =
_
f
n
(x
1:n
)
n
(dx
1:n
)
n
(f
n
) =
N
i=1
f
n
(
i
1:n
)
N
j=1
W
i
j
N
j=1
N
j=1
W
j
,
where W
i
k
=
d
d
(
i
k
), and
i
j
are iid with distribution .
For V
k
k0
, sequence of iid nonnegative random variables and for
n 1, T
n
= (V
k
; k n), set
U
n
=
n
k=1
V
k
Importance Sampling
IS suers (2)
Since E[V
n+1
] = 1 and V
n+1
independent from T
n
,
E(U
n+1
[ T
n
) = U
n
E(V
n+1
[ T
n
) = U
n
,
and thus U
n
n0
martingale
Since x
x concave, by Jensens inequality,

E(
_
U
n+1
[ T
n
)
_
E(U
n+1
[ T
n
)
_
U
n
and thus
U
n
n0
supermartingale
Assume E(
_
V
n+1
) < 1. Then
E(
_
U
n
) =
n
k=1
E(
_
V
k
) 0, n .
Importance Sampling
IS suers (3)
But
U
n
n0
is a nonnegative supermartingale and thus
U
n
converges a.s. to a random variable Z 0. By Fatous lemma,
E(Z) = E
_
lim
n
_
U
n
_
liminf
n
E(
U
n
) = 0.
Hence, Z = 0 and U
n
0 a.s., which implies that the martingale
U
n
n0
is not regular.
Apply these results to V
k
=
d
d
(
i
k
), i 1, . . . , N:
E
_
_
d
d
(
i
k
)
_
E
_
d
d
(
i
k
)
_
= 1.
with equality i
d
d
= 1, -a.e., i.e. = .
Thus all importance weights converge to 0
Importance Sampling
too volatile!
Example (Stochastic volatility model)
y
t
= exp (x
t
/2)
t
,
t
^(0, 1)
with AR(1) log-variance process (or volatility)
x
t+1
= x
t
+u
t
, u
t
^(0, 1)
Importance Sampling
Evolution of IBM stocks (corrected from trend and log-ratio-ed)
0 100 200 300 400 500
1
0
6
days
Importance Sampling
Example (Stochastic volatility model (2))
Observed likelihood unavailable in closed from.
Joint posterior (or conditional) distribution of the hidden state
sequence X
k
1kK
can be evaluated explicitly
K
k=2
exp
_
2
(x
k
x
k1
)
2
+
2
exp(x
k
)y
2
k
+x
k
_
/2 , (2)
up to a normalizing constant.
Importance Sampling
Computational problems
Direct simulation from this distribution impossible because of
(a) dependence among the X
k
s,
(b) dimension of the sequence X
k
1kK
, and
(c) exponential term exp(x
k
)y
2
k
within (2).
Importance Sampling
Importance sampling
Natural candidate: replace the exponential term with a quadratic
approximation to preserve Gaussianity.
E.g., expand exp(x
k
) around its conditional expectation x
k1
as
exp(x
k
) exp(x
k1
)
_
1 (x
k
x
k1
) +
1
2
(x
k
x
k1
)
2
_
Importance Sampling
Corresponding Gaussian importance distribution with mean
k
=
x
k1
2
+y
2
k
exp(x
k1
)/2 1 y
2
k
exp(x
k1
)/2
2
+y
2
k
exp(x
k1
)/2
and variance
2
k
= (
2
+y
2
k
exp(x
k1
)/2)
1
Prior proposal on X
1
,
X
1
^(0,
2
)
Importance Sampling
Simulation starts with X
1
and proceeds forward to X
n
, each X
k
being generated conditional on Y
k
and the previously generated
X
k1
.
Importance weight computed sequentially as the product of
exp
_
2
(x
k
x
k1
)
2
+ exp(x
k
)y
2
k
+x
k
_
/2
exp
_
2
k
(x
k

k
)
2
_
1
k
.
(1 k K)
Importance Sampling
weights
D
e
n
s
it
y
15 5 0 5 10 15
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
0 20 40 60 80 100
0
.
4
0
.
3
0
.
2
0
.
1
0
.
0
0
.
1
t
Histogram
of the logarithms of the importance weights (left) and
comparison between the true volatility and the best t,
based on 10, 000 simulated importance samples.
Importance Sampling
0 20 40 60 80 100
0
.
4
0
.
2
0
.
0
0
.
2
0
.
4
t
Corresponding
range of the simulated X
k
1k100
, compared with the true
value.
Correlated simulations
Negative correlation reduces variance
Special technique but ecient when it applies
Two samples (X
1
, . . . , X
m
) and (Y
1
, . . . , Y
m
) from f to estimate
I =
_
R
h(x)f(x)dx
by
I
1
=
1
m
m
i=1
h(X
i
) and

I
2
=
1
m
m
i=1
h(Y
i
)
with mean I and variance
2
Variance reduction
Variance of the average
var
_
I
1
+
I
2
2
_
=

2
2
+
1
2
cov(
I
1
,
I
2
).
If the two samples are negatively correlated,
cov(
I
1
,
I
2
) 0 ,
they improve on two independent samples of same size
Antithetic variables
If f symmetric about , take Y
i
= 2 X
i
If X
i
= F
1
(U
i
), take Y
i
= F
1
(1 U
i
)
If (A
i
)
i
partition of A, partitioned sampling by sampling
X
j
s in each A
i
(requires to know Pr(A
i
))
Control variates
out of control!
For
I =
_
h(x)f(x)dx
unknown and
I
0
=
_
h
0
(x)f(x)dx
known,
I
0
estimated by

I
0
and
I estimated by

I
Control variates (2)
Combined estimator
=

I +(
I
0
I
0
)
is unbiased for I and

var(
) = var(
I) +
2
var(
I) + 2cov(
I,
I
0
)
Optimal control
Optimal choice of
=
cov(
I,
I
0
)
var(
I
0
)
,
with
var(
) = (1
2
) var(
I) ,
where correlation between

I and

I
0
Usual solution: regression coecient of h(x
i
) over h
0
(x
i
)
Example (Quantile Approximation)
Evaluate
= Pr(X > a) =
_

a
f(x)dx
by
=
1
n
n
i=1
I(X
i
> a),
with X
i
iid f.
If Pr(X > ) =
1
2
known
Example (Quantile Approximation (2))
Control variate
=
1
n
n
i=1
I(X
i
> a) +
_
1
n
n
i=1
I(X
i
> ) Pr(X > )
_
improves upon if
Example (Quantile Approximation (2))
Control variate
=
1
n
n
i=1
I(X
i
> a) +
_
1
n
n
i=1
I(X
i
> ) Pr(X > )
_
improves upon if
< 0 and [[ < 2
cov( ,
0
)
var(
0
)
2
Pr(X > a)
Pr(X > )
.
Integration by conditioning
Use Rao-Blackwell Theorem
var(E[(X)[Y]) var((X))
Consequence
If

I unbiased estimator of I = E
f
[h(X)], with X simulated from a
joint density

f(x, y), where
_

f(x, y)dy = f(x),
the estimator
= E
f
[
I[Y
1
, . . . , Y
n
]
dominate

I(X
1
, . . . , X
n
) variance-wise (and is unbiased)
skip expectation
Example (Students t expectation)
For
E[h(x)] = E[exp(x
2
)] with X T (, 0,
2
)
a Students t distribution can be simulated as
X[y ^(,
2
y) and Y
1

2
.
Example (Students t expectation (2))
Empirical distribution
1
m
m
j=1
exp(X
2
j
) ,
can be improved from the joint sample
((X
1
, Y
1
), . . . , (X
m
, Y
m
))
Example (Students t expectation (2))
Empirical distribution
1
m
m
j=1
exp(X
2
j
) ,
can be improved from the joint sample
((X
1
, Y
1
), . . . , (X
m
, Y
m
))
since
1
m
m
j=1
E[exp(X
2
)[Y
j
] =
1
m
m
j=1
1
_
2
2
Y
j
+ 1
is the conditional expectation.
In this example, precision ten times better

0 2000 4000 6000 8000 10000
0
.
5
0
0
.
5
2
0
.
5
4
0
.
5
6
0
.
5
8
0
.
6
0

0 2000 4000 6000 8000 10000
0
.
5
0
0
.
5
2
0
.
5
4
0
.
5
6
0
.
5
8
0
.
6
0
Estimators
of E[exp(X
2
)]: empirical average (full) and conditional
expectation (dotted) for (, , ) = (4.6, 0, 1).
Bayesian importance functions
directly Markovian
Recall algorithm:
1. Generate
(1)
1
, ,
(T)
1
from cg()
with
c
1
=
_
g()d
2. Take
_
f(x[)()d
1
T
T
t=1
f(x[
(t)
)
(
(t)
)
cg(
(t)
)
t=1
f(x[
(t)
)
(
(t)
)
g(
(t)
)
T
t=1
(
(t)
)
g(
(t)
)
= m
IS
(x)
Choice of g
g() = ()
m
IS
(x) =
1
T
t
f(x[
(t)
)
often inecient if data informative
impossible if is improper
Choice of g
g() = ()
m
IS
(x) =
1
T
t
f(x[
(t)
)
often inecient if data informative
impossible if is improper
g() = f(x[)()
c unknown
m
IS
(x) = 1
_
1
T
T
t=1
1
f(x[
(t)
)
improper priors allowed
g() = () + (1 )([x)
defensive mixture
1 Ok
[Hestenberg, 1998]
g() = () + (1 )([x)
defensive mixture
1 Ok
[Hestenberg, 1998]
g() = ([x)
m
h
(x) =
1
1
T
T
t=1
h()
f(x[)()
works for any h
nite variance if
_
h
2
()
f(x[)()
d <
Bridge sampling
[Chen & Shao, 1997]
Given two models f
1
(x[
1
) and f
2
(x[
2
),
1
(
1
[x) =

1
(
1
)f
1
(x[
1
)
m
1
(x)
2
(
2
[x) =

2
(
2
)f
2
(x[
2
)
m
2
(x)
Bayes factor:
B
12
(x) =
m
1
(x)
m
2
(x)
ratio of normalising constants
Bridge sampling (2)
(i) Missing normalising constants:
1
(
1
[x)
1
(
1
)
2
(
2
[x)
2
(
2
)
B
12

1
n
n
i=1

1
(
i
)

2
(
i
)

i

2
Bridge sampling (3)
(ii) Still missing normalising constants:
B
12
=
_

2
()()
1
()d
_

1
()()
2
()d
()
1
n
1
n
1
i=1

2
(
1i
)(
1i
)
1
n
2
n
2
i=1

1
(
2i
)(
2i
)
ji

j
()
Bridge sampling (4)
Optimal choice
() =
n
1
+n
2
n
1
1
() +n
2
2
()
[?]
[Chen, Meng & Wong, 2000]
Basics
Irreducibility
Transience and Recurrence
Invariant measures
Ergodicity and convergence
Limit theorems
Quantitative convergence rates
Coupling
Renewal and CLT
Basics
Basics
Denition (Markov chain)
A sequence of random variables whose distribution evolves over
time as a function of past realizations
Basics
Basics
Denition (Markov chain)
A sequence of random variables whose distribution evolves over
time as a function of past realizations
Chain dened through its transition kernel, a function K dened
on X B(X) such that
x X, K(x, ) is a probability measure;
A B(X), K(, A) is measurable.

Basics
no discrete
When X is a discrete (nite or denumerable) set, the
transition kernel simply is a (transition) matrix K with
elements
P
xy
= Pr(X
n
= y[X
n1
= x) , x, y X
Since, for all x X, K(x, ) is a probability, we must have
P
xy
0 and K(x, X) =
yX
P
xy
= 1
The matrix K is referred to as a Markov transition matrix
or a stochastic matrix
Basics
In the continuous case, the kernel also denotes the
conditional density K(x, x
) of the transition K(x, )

Pr(X A[x) =
_
A
K(x, x
)dx
.
Then, for any bounded , we may dene
K(x) = K(x, ) =
_
X
K(x, dy)(y).
Note that
[K(x)[
_
X
K(x, dy)[(y)[ [[
= sup
xX
[(x)[.
We may also associate to a probability measure the measure
K, dened as
K(A) =
_
X
(dx)K(x, A).
Basics
Markov chains
skip denition
Given a transition kernel K, a sequence X
0
, X
1
, . . . , X
n
, . . . of
random variables is a Markov chain denoted by (X
n
), if, for any
t, the conditional distribution of X
t
given x
t1
, x
t2
, . . . , x
0
is the
same as the distribution of X
t
given x
t1
. That is,
Pr(X
k+1
A[x
0
, x
1
, x
2
, . . . , x
k
) = Pr(X
k+1
A[x
k
)
=
_
A
K(x
k
, dx)
Basics
Note that the entire structure of the chain only depends on
The transition function K
The initial state x
0
or initial distribution X
0

Basics
Example (Random walk)
The normal random walk is the kernel K(x, ) associated with the
distribution
^
p
(x,
2
I
p
)
which means
X
t+1
= X
t
+
t
t
being an iid additional noise
Basics
4 2 0 2
0
2
4
6
8
1
0
x
y
100 consecutive realisations of the random walk in R
2
with
= 1
Basics
bypass remarks
On a discrete state-space X = x
0
, x
1
, . . .,
A function on a discrete state space is uniquely dened by

the (column) vector = ((x
0
), (x
1
), . . . , )
T
and
K(x) =
yX
P
xy
(y)
can be interpreted as the xth component of the product of
the transition matrix K and of the vector .
Basics
bypass remarks
On a discrete state-space X = x
0
, x
1
, . . .,
A function on a discrete state space is uniquely dened by

the (column) vector = ((x
0
), (x
1
), . . . , )
T
and
K(x) =
yX
P
xy
(y)
can be interpreted as the xth component of the product of
the transition matrix K and of the vector .
A probability distribution on T(X) is dened as a (row)

vector = ((x
0
), (x
1
), . . .) and the probability distribution
K is dened, for each y X as
K(y) =
xX
(x)P
xy
yth component of the product of the vector and of the
transition matrix K.
Basics
Composition of kernels
Let Q
1
and Q
2
be two probability kernels. Dene, for any x X
and any A B(X) the product of kernels Q
1
Q
2
as
Q
1
Q
2
(x, A) =
_
X
Q
1
(x, dy)Q
2
(y, A)
When the state space X is discrete, the product of Markov kernels
coincides with the product of matrices Q
1
Q
2
.
Irreducibility
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms
Irreducibility
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms
Denition (Irreducibility)
In the discrete case, the chain is irreducible if all states
communicate, namely if
P
x
(
y
< ) > 0 , x, y X ,
y
being the rst (positive) time y is visited
Irreducibility
Irreducibility for a continuous chain
In the continuous case, the chain is -irreducible for some measure
if for some n,
K
n
(x, A) > 0
for all x X
for every A B(X) with (A) > 0

Irreducibility
Minoration condition
Assume there exist a probability measure and > 0 such that,
for all x X and all A B(X),
K(x, A) (A)
This is called a minoration condition.
When K is a Markov chain on a discrete state space, this is
equivalent to saying that P
xy
> 0 for all x, y X.
Irreducibility
Small sets
Denition (Small set)
If there exist C B(X), (C) > 0, a probability measure and
> 0 such that, for all x C and all A B(X),
K(x, A) (A)
C is called a small set
For discrete state space, atoms are small sets.
Towards further stability
Irreducibility: every set A has a chance to be visited by the
Markov chain (X
n
)
This property is too weak to ensure that the trajectory of
(X
n
) will enter A often enough.
Towards further stability
Irreducibility: every set A has a chance to be visited by the
Markov chain (X
n
)
This property is too weak to ensure that the trajectory of
(X
n
) will enter A often enough.
A Markov chain must enjoy good stability properties to
guarantee an acceptable approximation of the simulated
model.
Formalizing this stability leads to dierent notions of
recurrence
For discrete chains, the recurrence of a state equivalent to
probability one of sure return.
Always satised for irreducible chains on nite spaces
In a nite state space X, denote the average number of visits to a
state by
i=1
I
(X
i
)
state by
i=1
I
(X
i
)
If E
] = , the state is recurrent

If E
] < , the state is transient

For irreducible chains, recurrence/transience is property of the
chain, not of a particular state
state by
i=1
I
(X
i
)
If E
] = , the state is recurrent

If E
] < , the state is transient

For irreducible chains, recurrence/transience is property of the
chain, not of a particular state
Similar denitions for the continuous case.
Harris recurrence
Stronger form of recurrence:
Denition (Harris recurrence)
A set A is Harris recurrent if
P
x
(
A
= ) = 1 for all x A.
The chain (X
n
) is Harris recurrent if it is
irreducible
for every set A with (A) > 0, A is Harris recurrent.
Harris recurrence
Stronger form of recurrence:
Denition (Harris recurrence)
A set A is Harris recurrent if
P
x
(
A
= ) = 1 for all x A.
The chain (X
n
) is Harris recurrent if it is
irreducible
for every set A with (A) > 0, A is Harris recurrent.
Note that
P
x
(
A
= ) = 1 implies E
x
[
A
] =
Invariant measures
Invariant measures
Stability increases for the chain (X
n
) if marginal distribution of X
n
independent of n
Requires the existence of a probability distribution such that
X
n+1
if X
n

Invariant measures
Invariant measures
Stability increases for the chain (X
n
) if marginal distribution of X
n
independent of n
Requires the existence of a probability distribution such that
X
n+1
if X
n

Denition (Invariant measure)
A measure is invariant for the transition kernel K(, ) if
(B) =
_
X
K(x, B) (dx) , B B(X) .
Invariant measures
Stability properties and invariance
The chain is positive recurrent if is a probability measure.
Otherwise it is null recurrent or transient
Invariant measures
Stability properties and invariance
The chain is positive recurrent if is a probability measure.
Otherwise it is null recurrent or transient
If probability measure, also called stationary distribution
since
X
0
implies that X
n
for every n
The stationary distribution is unique
Invariant measures
Insights
no time for that!
Invariant probability measures are important not merely be-
cause they dene stationary processes, but also because
they turn out to be the measures which dene the long-
term or ergodic behavior of the chain.
To understand why, consider P
(X
n
) for a starting distribution . If
a limiting measure
exists such as
P
(X
n
A)
(A)
for all A B(X), then
Invariant measures
(A) = lim
n
_
(dx)P
n
(x, A)
= lim
n
_
X
_
P
n1
(x, dw)K(w, A)
=
_
X
(dw)K(w, A)
since setwise convergence of
_
P
n
(x, ) implies convergence of integrals of
bounded measurable functions. Hence, if a limiting distribution exists, it is an
invariant probability measure; and obviously, if there is a unique invariant
probability measure, the limit
will be independent of whenever it exists.

We nally consider: to what is the chain converging?
The invariant distribution is a natural candidate for the limiting
distribution
We nally consider: to what is the chain converging?
The invariant distribution is a natural candidate for the limiting
distribution
A fundamental property is ergodicity, or independence of initial
conditions. In the discrete case, a state is ergodic if
lim
n
[K
n
(, ) ()[ = 0 .
Norm and convergence
In general , we establish convergence using the total variation norm
|
1
2
|
TV
= sup
A
[
1
(A)
2
(A)[
and we want
_
_
_
_
_
K
n
(x, )(dx)
_
_
_
_
TV
= sup
A
_
K
n
(x, A)(dx) (A)
to be small.
skip minoration TV
Total variation distance and minoration
Lemma
Let and
be two probability measures. Then,

1 inf
_
i
(A
i
)
(A
i
)
_
= |
|
TV
.
where the inmum is taken over all nite partitions (A
i
)
i
of X.
Total variation distance and minoration (2)
Assume that there exist a probability and > 0 such that, for all
A B(X) we have
(A)
(A) (A).
Then, for all I and all partitions A
1
, A
2
, . . ., A
I
,
i=1
(A
i
)
(A
i
)
and the previous result thus implies that
|
|
TV
(1 ).
Harris recurrence and ergodicity
Theorem
If (X
n
) Harris positive recurrent and aperiodic, then
lim
n
_
_
_
_
_
K
n
(x, )(dx)
_
_
_
_
TV
= 0
for every initial distribution .
Theorem
If (X
n
lim
n
_
_
_
_
_
K
n
(x, )(dx)
_
_
_
_
TV
= 0
We thus take Harris positive recurrent and aperiodic as
equivalent to ergodic
[Meyn & Tweedie, 1993]
Theorem
If (X
n
lim
n
_
_
_
_
_
K
n
(x, )(dx)
_
_
_
_
TV
= 0
We thus take Harris positive recurrent and aperiodic as
equivalent to ergodic
[Meyn & Tweedie, 1993]
Convergence in total variation implies
lim
n
[E
[h(X
n
)] E
[h(X)][ = 0
for every bounded function h.
no detail of convergence
Convergences
There are dierence speeds of convergence
ergodic (fast enough)
geometrically ergodic (faster)
uniformly ergodic (fastest)
Geometric ergodicity
A -irreducible aperiodic Markov kernel P with invariant
distribution is geometrically ergodic if there exist V 1, and
constants < 1, R < such that (n 1)
|P
n
(x, .) (.)|
V
RV (x)
n
,
on V < which is full and absorbing.
Geometric ergodicity implies a lot of important results
CLT for additive functionals n

1/2
g(X
k
) and functions
[g[ < V

1/2
g(X
k
) and functions
[g[ < V
Rosenthals type inequalities

E
x
k=1
g(X
k
)
p
C(p)n
p/2
, [g[
p
2

1/2
g(X
k
) and functions
[g[ < V
Rosenthals type inequalities

E
x
k=1
g(X
k
)
p
C(p)n
p/2
, [g[
p
2
exponential inequalities (for bounded functions and small

enough)
E
x
_
exp
_
k=1
g(X
k
)
__
<
Minoration condition and uniform ergodicity
Under the minoration condition, the kernel K is thus contractant
and standard results in functional analysis shows the existence and
the unicity of a xed point . The previous relation implies that,
for all x X.
|P
n
(x, ) |
TV
(1 )
n
Such Markov chains are called uniformly ergodic.
Uniform ergodicity
Theorem (S&n ergodicity)
The following conditions are equivalent:
(X
n
)
n
is uniformly ergodic,
there exist < 1 and R < such that, for all x X,

|P
n
(x, ) |
TV
R
n
,
for some n > 0,

sup
xX
|P
n
(x, ) ()|
TV
< 1.
[Meyn and Tweedie, 1993]
Limit theorems
Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
Limit theorems
Limit theorems
But also need of statistical inference, made by induction from the
observed sample.
If |P
n
x
| close to 0, no direct information about
X
n
P
n
x
c _ We need LLNs and CLTs!!!
Limit theorems
Limit theorems
But also need of statistical inference, made by induction from the
observed sample.
If |P
n
x
| close to 0, no direct information about
X
n
P
n
x
c _ We need LLNs and CLTs!!!
Classical LLNs and CLTs not directly applicable due to:
Markovian dependence structure between the
observations X
i
Non-stationarity of the sequence
Limit theorems
The Theorem
Theorem (Ergodic Theorem)
If the Markov chain (X
n
) is Harris recurrent, then for any function
h with E[h[ < ,
lim
n
1
n
i
h(X
i
) =
_
h(x)d(x),
Limit theorems
Central Limit Theorem
To get a CLT, we need more assumptions.
skip conditions and results
Limit theorems
Central Limit Theorem
To get a CLT, we need more assumptions.
skip conditions and results
For MCMC, the easiest is
Denition (reversibility)
A Markov chain (X
n
) is reversible if for
all n
X
n+1
[X
n+2
= x X
n+1
[X
n
= x
The direction of time does not matter
-> P( )
P( ) ->
[Green,
Limit theorems
The CLT
Theorem
If the Markov chain (X
n
) is Harris recurrent and reversible,
1
N
_
N
n=1
(h(X
n
) E
[h])
_
L
^(0,
2
h
) .
where
0 <
2
h
= E
[h
2
(X
0
)]
+2
k=1
E
[h(X
0
)h(X
k
)] < +.
[Kipnis & Varadhan, 1986]
skip detailed results
Let P a Markov transition kernel on (X, B(X)), with P positive
recurrent and its stationary distribution
skip detailed results
Let P a Markov transition kernel on (X, B(X)), with P positive
recurrent and its stationary distribution
Convergence rate Determine, from the kernel, a sequence
B(, n), such that
|P
n
|
V
B(, n)
where V : X [1, ) and for any signed measure ,
||
V
= sup
||V
[()[
Practical purposes?
In the 90s, a wealth of contributions on quantitative bounds
triggered by MCMC algorithms to answer questions like: what is
the appropriate burn in? or how long should the sampling continue
after burn in?
[Douc, Moulines and Rosenthal, 2001]
[Jones and Hobert, 2001]
Tools at hand
For MCMC algorithms, kernels are explicitly known.
Type of quantities (more or less directly) available:
Minoration constants
K
s
(x, A) (A), for all x C,
Foster-Lyapunov Drift conditions,

KV V +bI
C
and goal is to obtain a bound depending explicitly upon , , b,
&tc...
Coupling
Coupling
skip coupling
If X and X
and
, one can construct two

random variables

X and

X
such that
X ,

X
and

X =

X
with probability
Coupling
Coupling
skip coupling
If X and X
and
, one can construct two

random variables

X and

X
such that
X ,

X
and

X =

X
with probability
The basic coupling construction
with probability , draw Z according to and set
X =

X
= Z.
with probability 1 , draw

X and

X
under distributions
( )/(1 ) and (
)/(1 ),
respectively.
[Thorisson, 2000]
Coupling
Coupling inequality
X, X
r.v.s with probability distribution K(x, .) and K(x
, .),
respectively, can be coupled with probability if:
K(x, ) K(x
, )
x,x
(.)
where
x,x
is a probability measure, or, equivalently,
|K(x, ) K(x
, )|
TV
(1 )
Dene an -coupling set as a set

C X X satisfying :
(x, x
)

C, A B(X), K(x, A) K(x
, A)
x,x
(A)
Coupling
Small set and coupling sets
C X small set if there exist > 0 and a probability measure
such that, for all A B(X)
K(x, A) (A), x C. (3)
Small sets always exist when the MC is -irreducible
[Jain and Jamieson, 1967]
Coupling
Small set and coupling sets
C X small set if there exist > 0 and a probability measure
such that, for all A B(X)
K(x, A) (A), x C. (3)
Small sets always exist when the MC is -irreducible
[Jain and Jamieson, 1967]
For MCMC kernels, small sets in general easy to nd.
If C is a small set, then

C = C C is a coupling set:
(x, x
)

C, A B(X), K(x, A) K(x
, A) (A).
Coupling
Coupling for Markov chains
P Markov transition kernel on X X such that, for all

(x, x
) ,

C (where

C is an -coupling set) and all A B(X) :
P(x, x
; AX) = K(x, A) and

P(x, x
; X A) = K(x
, A)
Coupling
Coupling for Markov chains
P Markov transition kernel on X X such that, for all

(x, x
) ,

C (where

C is an -coupling set) and all A B(X) :
P(x, x
; AX) = K(x, A) and

P(x, x
; X A) = K(x
, A)
For example,
for (x, x
) ,

C,

P(x, x
; AA
) = K(x, A)K(x
, A
).
For all (x, x
)

C and all A, A
B(X), dene the residual

kernel
R(x, x
; AX) = (1 )
1
(K(x, A)
x,x
(A))
R(x, x
; X A
) = (1 )
1
(K(x
, A)
x,x
(A
)).
Coupling
Coupling algorithm
Initialisation Let X
0
and X
and set d
0
= 0.
After coupling If d
n
= 1, then draw X
n+1
K(X
n
, ), and
set X
n+1
= X
n+1
.
Before coupling If d
n
= 0 and (X
n
, X
n
)

C,
with probability , draw X

n+1
= X
n+1

X
n
,X
n
and set
d
n+1
= 1.
with probability 1 , draw (X

n+1
, X
n+1
)

R(X
n
, X
n
; )
and set d
n+1
= 0.
If d
n
= 0 and (X
n
, X
n
) ,

C, then draw
(X
n+1
, X
n+1
)

P(X
n
, X
n
; ).
(X
n
, X
n
, d
n
) [where d
n
is the bell variable which indicates
whether the chains have coupled or not] is a Markov chain on
(X X 0, 1).
Coupling
Coupling inequality (again!)
Dene the coupling time T as
T = infk 1, d
k
= 1
Coupling inequality
sup
A
[P
k
(A)
P
k
(A)[ P
,
,0
[T > k]
[Pitman, 1976; Lindvall, 1992]
Coupling
Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Coupling
Drift conditions
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
PV V +bI
C
, V 1
M
k
=
k
V (X
k
)I(
C
k), k 1 is a supermartingale and thus
E
x
[
C
] V (x) +b
1
I
C
(x).
Coupling
Drift conditions
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
PV V +bI
C
, V 1
M
k
=
k
V (X
k
)I(
C
k), k 1 is a supermartingale and thus
E
x
[
C
] V (x) +b
1
I
C
(x).
Conversely, if there exists a set C such that E
x
[
C
] < for all
x (in a full and absorbing set), then there exists a drift function
verifying the Foster-Lyapunov conditions.
[Meyn and Tweedie, 1993]
Coupling
If the drift condition is imposed directly on the joint transition
kernel

P, there exist V 1, 0 < < 1 and a set

C such that :
PV (x, x
) V (x, x
) (x, x
) ,

C
When

P(x, x
; AA
) = K(x, A)K(x
, A
), one may consider
V (x, x
) = (1/2)
_
V (x) +V (x
)
_
where V drift function for P (but not necessarily the best choice)
Coupling
Explicit bound
Theorem
For any distributions and
, and any j k, then:

|P
k
()
P
k
()|
TV
(1 )
j
+
k
B
j1
E
,
,0
[V (X
0
, X
0
)]
where
B = 1
1
(1 ) sup
C
RV.
[DMR,2001]
Renewal and CLT
Renewal and CLT
Given a Markov chain (X
n
)
n
, how good an approximation of
I =
_
g(x)(x)dx
is
g
n
:=
1
n
n1
i=0
g(X
i
) ?
Renewal and CLT
Renewal and CLT
Given a Markov chain (X
n
)
n
, how good an approximation of
I =
_
g(x)(x)dx
is
g
n
:=
1
n
n1
i=0
g(X
i
) ?
Standard MC if CLT
n(g
n
E
[g(X)])
d
^(0,
2
g
)
and there exists an easy-to-compute, consistent estimate of
2
g
...
Renewal and CLT
Minoration
skip construction
Assume that the kernel density K satises, for some density q(),
(0, 1) and a small set C A ,
K(y[x) q(y) for all y A and x C
Then split K into a mixture
K(y[x) = q(y) + (1 ) R(y[x)
where R is residual kernel
Renewal and CLT
Split chain
Let
0
,
1
,
2
, . . . be iid B(). Then the split chain
(X
0
,
0
), (X
1
,
1
), (X
2
,
2
), . . .
is such that, when X
i
C,
i
determines X
i+1
:
X
i+1

_
q(x) if
i
= 1,
R(x[X
i
) otherwise
[Regeneration] When (X
i
,
i
) C 1, X
i+1
q
Renewal and CLT
Renewals
For X
0
q and R successive renewals, dene by
1
< . . . <
R
the
renewal times.
Then
R
_
g
R
E
[g(X)]
_
=
R
N
_
1
R
R
t=1
(S
t
N
t
E
[g(X)])
_
where N
t
length of the t th tour, and S
t
sum of the g(X
j
)s over
the t th tour.
Since (N
t
, S
t
) are iid and E
q
[S
t
N
t
E
[g(X)]] = 0, if N
t
and S
t
have nite 2nd moments,
R
_
g
R
E
g
_
d
^(0,
2
g
)
there is a simple, consistent estimator of

2
g
[Mykland & al., 1995; Robert, 1995]
Renewal and CLT
Moment conditions
We need to show that, for the minoration condition, E
q
[N
2
1
] and
E
q
[S
2
1
] are nite.
If
1. the chain is geometrically ergodic, and
2. E
[[g[
2+
] < for some > 0,
then E
q
[N
2
1
] < and E
q
[S
2
1
] < .
[Hobert & al., 2002]
Note that drift + minoration ensures geometric ergodicity
[Rosenthal, 1995; Roberts & Tweedie, 1999]
Monte Carlo Methods based on Markov Chains
The MetropolisHastings algorithm
A collection of Metropolis-Hastings algorithms
Extensions
The Gibbs Sampler
Running Monte Carlo via Markov Chains
It is not necessary to use a sample from the distribution f to
approximate the integral
I =
_
h(x)f(x)dx ,
Running Monte Carlo via Markov Chains
It is not necessary to use a sample from the distribution f to
approximate the integral
I =
_
h(x)f(x)dx ,
We can obtain X
1
, . . . , X
n
f (approx) without directly
simulating from f, using an ergodic Markov chain with
stationary distribution f
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x
(0)
, an ergodic chain (X
(t)
) is
generated using a transition kernel with stationary distribution f
Idea
(0)
(t)
) is
Insures the convergence in distribution of (X

(t)
) to a random
variable from f.
For a large enough T

0
, X
(T
0
)
can be considered as
distributed from f
Produce a dependent sample X

(T
0
)
, X
(T
0
+1)
, . . ., which is
generated from f, sucient for most approximation purposes.
Idea
(0)
(t)
) is
Insures the convergence in distribution of (X

(t)
) to a random
variable from f.
For a large enough T

0
, X
(T
0
)
can be considered as
distributed from f
Produce a dependent sample X

(T
0
)
, X
(T
0
+1)
, . . ., which is
generated from f, sucient for most approximation purposes.
Problem: How can one build a Markov chain with a given
stationary distribution?
Basics
The algorithm uses the objective (target) density
f
and a conditional density
q(y[x)
called the instrumental (or proposal) distribution
The MH algorithm
Algorithm (MetropolisHastings)
Given x
(t)
,
1. Generate Y
t
q(y[x
(t)
).
2. Take
X
(t+1)
=
_
Y
t
with prob. (x
(t)
, Y
t
),
x
(t)
with prob. 1 (x
(t)
, Y
t
),
where
(x, y) = min
_
f(y)
f(x)
q(x[y)
q(y[x)
, 1
_
.
Features
Independent of normalizing constants for both f and q([x)

(ie, those constants independent of x)
Never move to values with f(y) = 0
The chain (x
(t)
)
t
may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (y
t
)
t
is usually not a Markov chain
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satises the detailed
balance condition
f(y) K(y, x) = f(x) K(x, y)
balance condition
f(y) K(y, x) = f(x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
balance condition
f(y) K(y, x) = f(x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
3. If
Pr
_
f(Y
t
) q(X
(t)
[Y
t
)
f(X
(t)
) q(Y
t
[X
(t)
)
1
_
< 1. (1)
that is, the event X
(t+1)
= X
(t)
is possible, then the chain
is aperiodic
Convergence properties (2)
4. If
q(y[x) > 0 for every (x, y), (2)
the chain is irreducible
4. If
5. For M-H, f-irreducibility implies Harris recurrence
4. If
5. For M-H, f-irreducibility implies Harris recurrence
6. Thus, for M-H satisfying (1) and (2)
(i) For h, with E
f
[h(X)[ < ,
lim
T
1
T
T
t=1
h(X
(t)
) =
_
h(x)df(x) a.e. f.
(ii) and
lim
n
_
_
_
_
_
K
n
(x, )(dx) f
_
_
_
_
TV
= 0
for every initial distribution , where K
n
(x, ) denotes the
kernel for n transitions.
The Independent Case
The instrumental distribution q is independent of X
(t)
, and is
denoted g by analogy with Accept-Reject.
The Independent Case
The instrumental distribution q is independent of X
(t)
, and is
denoted g by analogy with Accept-Reject.
Algorithm (Independent Metropolis-Hastings)
Given x
(t)
,
a Generate Y
t
g(y)
b Take
X
(t+1)
=
_
_
Y
t
with prob. min
_
f(Y
t
) g(x
(t)
)
f(x
(t)
) g(Y
t
)
, 1
_
,
x
(t)
otherwise.
Properties
The resulting sample is not iid
Properties
The resulting sample is not iid but there exist strong convergence
properties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists a
constant M such that
f(x) Mg(x) , x supp f.
In this case,
|K
n
(x, ) f|
TV

_
1
1
M
_
n
.
[Mengersen & Tweedie, 1996]
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
x
t+1
= x
t
+
t+1

t
N (0,
2
)
and observables
y
t
[x
t
N (x
2
t
,
2
)
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
x
t+1
= x
t
+
t+1

t
N (0,
2
)
and observables
y
t
[x
t
N (x
2
t
,
2
)
The distribution of x
t
given x
t1
, x
t+1
and y
t
is
exp
1
2
2
_
(x
t
x
t1
)
2
+ (x
t+1
x
t
)
2
+

2
2
(y
t
x
2
t
)
2
_
.
Example (Noisy AR(1) too)
Use for proposal the N (
t
,
2
t
) distribution, with
t
=
x
t1
+x
t+1
1 +
2
and
2
t
=

2
1 +
2
.
Example (Noisy AR(1) too)
Use for proposal the N (
t
,
2
t
) distribution, with
t
=
x
t1
+x
t+1
1 +
2
and
2
t
=

2
1 +
2
.
Ratio
(x)/q
ind
(x) = exp(y
t
x
2
t
)
2
/2
2
is bounded
(top) Last 500 realisations of the chain X
k
k
out of 10, 000
iterations; (bottom) histogram of the chain, compared with
the target distribution.
Example (Cauchy by normal)
go random W
Given a Cauchy C(0, 1) distribution, consider a normal
N (0, 1) proposal
go random W
N (0, 1) proposal
The MetropolisHastings acceptance ratio is
(
)/(
)
()/())
= exp
__
2
(
)
2
_
/2
1 + (
)
2
(1 +
2
)
.
go random W
N (0, 1) proposal
The MetropolisHastings acceptance ratio is
(
)/(
)
()/())
= exp
__
2
(
)
2
_
/2
1 + (
)
2
(1 +
2
)
.
Poor perfomances: the proposal distribution has lighter tails than
the target Cauchy and convergence to the stationary distribution is
not even geometric!
D
e
n
s
it
y
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Histogram
of Markov chain (
t
)
1t5000
against target C(0, 1)
distribution.
0 1000 2000 3000 4000 5000
1
0
1
2
3
iterations
Range
and average of 1000 parallel
runs when initialized with a
normal N (0, 100
2
)
distribution.
Random walk MetropolisHastings
Use of a local perturbation as proposal
Y
t
= X
(t)
+
t
,
where
t
g, independent of X
(t)
.
The instrumental density is now of the form g(y x) and the
Markov chain is a random walk if we take g to be symmetric
g(x) = g(x)
Algorithm (Random walk Metropolis)
Given x
(t)
1. Generate Y
t
g(y x
(t)
)
2. Take
X
(t+1)
=
_
_
_
Y
t
with prob. min
_
1,
f(Y
t
)
f(x
(t)
)
_
,
x
(t)
otherwise.
Example (Random walk and normal target)
forget History!
Generate ^(0, 1) based on the uniform proposal [, ]
[Hastings (1970)]
The probability of acceptance is then
(x
(t)
, y
t
) = exp(x
(t)
2
y
2
t
)/2 1.
Example (Random walk & normal (2))
Sample statistics
0.1 0.5 1.0
mean 0.399 -0.111 0.10
variance 0.698 1.11 1.06
c _ As , we get better histograms and a faster exploration of the
support of f.
-1 0 1 2
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
(a)

-
1
.5
-
1
.0
-
0
.5
0
.0
0
.5
-2 0 2
0
1
0
0
2
0
0
3
0
0
4
0
0
(b)

-
1
.5
-
1
.0
-
0
.5
0
.0
0
.5
-3 -2 -1 0 1 2 3
0
1
0
0
2
0
0
3
0
0
4
0
0
(c)

-
1
.5
-
1
.0
-
0
.5
0
.0
0
.5
Three samples based on |[, ] with (a) = 0.1, (b) = 0.5
and (c) = 1.0, superimposed with the convergence of the
means (15, 000 simulations).
Example (Mixture models (again!))
([x)
n
j=1
_
k
=1
p
f(x
j
[
)
_
()
Example (Mixture models (again!))
([x)
n
j=1
_
k
=1
p
f(x
j
[
)
_
()
Metropolis-Hastings proposal:
(t+1)
=
_

(t)
+
(t)
if u
(t)
<
(t)
(t)
otherwise
where
(t)
=
(
(t)
+
(t)
[x)
(
(t)
[x)
1
and scaled for good acceptance rate
p
th
e
ta
0.0 0.2 0.4 0.6 0.8 1.0
-
1
0
1
2
tau
th
e
ta
0.2 0.4 0.6 0.8 1.0 1.2
-
1
0
1
2
p
ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0
.2
0
.4
0
.6
0
.8
1
.0
1
.2
-1 0 1 2
0
.0
1
.0
2
.0
theta
0.2 0.4 0.6 0.8
0
2
4
tau
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
6
p
Random walk sampling (50000 iterations)
General case of a 3 component normal mixture
[Celeux & al., 2000]
1 0 1 2 3
1
0
1
2
3
2
X
Random walk MCMC output for .7^(
1
, 1) +.3^(
2
, 1)
Example (probit model)
skip probit
Likelihood of the probit model
n
i=1
(y
T
i
)
x
i
(y
T
i
)
1x
i
Example (probit model)
skip probit
Likelihood of the probit model
n
i=1
(y
T
i
)
x
i
(y
T
i
)
1x
i
Random walk proposal
(t+1)
=
(t)
+
t

t
N
p
(0, )
where, for instance,
= (Y Y
T
)
1
0 5 10
1
0
5
0
5
Likeliho
surface and random walk Metropolis-Hastings steps
Uniform ergodicity prohibited by random walk structure
Uniform ergodicity prohibited by random walk structure
At best, geometric ergodicity:
Theorem (Sucient ergodicity)
For a symmetric density f, log-concave in the tails, and a positive
and symmetric density g, the chain (X
(t)
) is geometrically ergodic.
no tail eect
Example (Comparison of tail
eects)
Random-walk
MetropolisHastings algorithms
based on a N (0, 1) instrumental
for the generation of (a) a
^(0, 1) distribution and (b) a
distribution with density
(x) (1 +[x[)
3
(a)

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
(a)

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
(b)

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
90% condence envelopes of
the means, derived from 500
parallel independent chains
Example (Cauchy by normal continued)
Again, Cauchy C(0, 1) target and Gaussian random walk proposal,
N (,
2
), with acceptance probability
1 +
2
1 + (
)
2
1 ,
Overall t of the Cauchy density by the histogram satisfactory, but
poor exploration of the tails: 99% quantile of C(0, 1) equal to 3,
but no simulation exceeds 14 out of 10, 000!
[Roberts & Tweedie, 2004]
Again, lack of geometric ergodicity!
Slow convergence shown by the non-stable range after 10, 000
iterations.
D
e
n
s
ity
5 0 5
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
0
.2
5
0
.3
0
0
.3
5
Histogram of the 10, 000 rst steps of a random walk
MetropolisHastings algorithm using a N (, 1) proposal
0 2000 4000 6000 8000 10000
1
0
0
5
0
0
5
0
1
0
0
iterations
Range
of 500 parallel runs for the same setup
Further convergence properties
Under assumptions
skip detailed convergence
(A1) f is super-exponential, i.e. it is positive with positive

continuous rst derivative such that
lim
|x|
n(x)
log f(x) = where n(x) := x/[x[.

In words : exponential decay of f in every direction with rate
tending to
(A2) limsup
|x|
n(x)
m(x) < 0, where

m(x) = f(x)/[f(x)[.
In words: non degeneracy of the countour manifold
(
f(y)
= y : f(y) = f(x)
Q is geometrically ergodic, and
V (x) f(x)
1/2
veries the drift condition
[Jarner & Hansen, 2000]
Further [further] convergence properties
skip hyperdetailed convergence
If P -irreducible and aperiodic, for r = (r(n))
nN
real-valued non
decreasing sequence, such that, for all n, m N,
r(n +m) r(n)r(m),
and r(0) = 1, for C a small set,
C
= infn 1, X
n
C, and
h 1, assume
sup
xC
E
x
_
C
1
k=0
r(k)h(X
k
)
_
< ,
then,
S(f, C, r) :=
_
x X, E
x
_
C
1
k=0
r(k)h(X
k
)
_
<
_
is full and absorbing and for x S(f, C, r),
lim
n
r(n)|P
n
(x, .) f|
h
= 0.
[Tuominen & Tweedie, 1994]
Comments
[CLT, Rosenthals inequality...] h-ergodicity implies CLT

for additive (possibly unbounded) functionals of the chain,
Rosenthals inequality and so on...
[Control of the moments of the return-time] The

condition implies (because h 1) that
sup
xC
E
x
[r
0
(
C
)] sup
xC
E
x
_
C
1
k=0
r(k)h(X
k
)
_
< ,
where r
0
(n) =
n
l=0
r(l) Can be used to derive bounds for
the coupling time, an essential step to determine computable
bounds, using coupling inequalities
[Roberts & Tweedie, 1998; Fort & Moulines, 2000]
Alternative conditions
The condition is not really easy to work with...
[Possible alternative conditions]
(a) [Tuominen, Tweedie, 1994] There exists a sequence
(V
n
)
nN
, V
n
r(n)h, such that
(i) sup
C
V
0
< ,
(ii) V
0
= V
1
= and
(iii) PV
n+1
V
n
r(n)h +br(n)I
C
.
(b) [Fort 2000] V f 1 and b < , such that sup
C
V <
and
PV (x) +E
x
_
k=0
r(k)f(X
k
)
_
V (x) +bI
C
(x)
where
C
is the hitting time on C and
r(k) = r(k) r(k 1), k 1 and r(0) = r(0).
Result (a) (b) sup
xC
E
x
_
C
1
k=0
r(k)f(X
k
)
_
< .
Extensions
Extensions
There are many other families of HM algorithms
Adaptive Rejection Metropolis Sampling
Reversible Jump (later!)
Langevin algorithms
to name just a few...
Extensions
Langevin Algorithms
Proposal based on the Langevin diusion L
t
is dened by the
stochastic dierential equation
dL
t
= dB
t
+
1
2
log f(L
t
)dt,
where B
t
is the standard Brownian motion
Theorem
The Langevin diusion is the only non-explosive diusion which is
reversible with respect to f.
Extensions
Discretization
Instead, consider the sequence
x
(t+1)
= x
(t)
+

2
2
log f(x
(t)
) +
t
,
t
^
p
(0, I
p
)
where
2
corresponds to the discretization step
Extensions
Discretization
Instead, consider the sequence
x
(t+1)
= x
(t)
+

2
2
log f(x
(t)
) +
t
,
t
^
p
(0, I
p
)
where
2
corresponds to the discretization step
Unfortunately, the discretized chain may be be transient, for
instance when
lim
x
2
log f(x)[x[
1
> 1
Extensions
MH correction
Accept the new value Y
t
with probability
f(Y
t
)
f(x
(t)
)

exp
_
_
_
_Y
t
x
(t)

2
2
log f(x
(t)
)
_
_
_
2
_
2
2
_
exp
_
_
_
_x
(t)
Y
t

2
2
log f(Y
t
)
_
_
_
2
_
2
2
_
1 .
Choice of the scaling factor
Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998]
Extensions
Optimizing the Acceptance Rate
Problem of choice of the transition kernel from a practical point of
view
Most common alternatives:
(a) a fully automated algorithm like ARMS;
(b) an instrumental density g which approximates f, such that
f/g is bounded for uniform ergodicity to apply;
(c) a random walk
In both cases (b) and (c), the choice of g is critical,
Extensions
Case of the independent MetropolisHastings algorithm
Choice of g that maximizes the average acceptance rate
= E
_
min
_
f(Y ) g(X)
f(X) g(Y )
, 1
__
= 2P
_
f(Y )
g(Y )

f(X)
g(X)
_
, X f, Y g,
Related to the speed of convergence of
1
T
T
t=1
h(X
(t)
)
to E
f
[h(X)] and to the ability of the algorithm to explore any
complexity of f
Extensions
Case of the independent MetropolisHastings algorithm (2)
Practical implementation
Choose a parameterized instrumental distribution g([) and
adjusting the corresponding parameters based on the evaluated
acceptance rate
() =
2
m
m
i=1
I
{f(y
i
)g(x
i
)>f(x
i
)g(y
i
)}
,
where x
1
, . . . , x
m
sample from f and y
1
, . . . , y
m
iid sample from g.
Extensions
Example (Inverse Gaussian distribution)
no inverse
Simulation from
f(z[
1
,
2
) z
3/2
exp
_
1
z

2
z
+ 2
_
2
+ log
_
2
2
_
I
R
+
(z)
based on the Gamma distribution (a(, ) with =
_
2
/
1
Extensions
Example (Inverse Gaussian distribution)
no inverse
Simulation from
f(z[
1
,
2
) z
3/2
exp
_
1
z

2
z
+ 2
_
2
+ log
_
2
2
_
I
R
+
(z)
based on the Gamma distribution (a(, ) with =
_
2
/
1
Since
f(x)
g(x)
x
1/2
exp
_
(
1
)x

2
x
_
,
the maximum is attained at
x
=
( + 1/2)
_
( + 1/2)
2
+ 4
2
(
1
)
2(
1
)
.
Extensions
Example (Inverse Gaussian distribution (2))
The analytical optimization (in ) of
M() = (x
)
1/2
exp
_
(
1
)x

2
x
_
is impossible
0.2 0.5 0.8 0.9 1 1.1 1.2 1.5
() 0.22 0.41 0.54 0.56 0.60 0.63 0.64 0.71
E[Z] 1.137 1.158 1.164 1.154 1.133 1.148 1.181 1.148
E[1/Z] 1.116 1.108 1.116 1.115 1.120 1.126 1.095 1.115
(
1
= 1.5,
2
= 2, and m = 5000).
Extensions
Case of the random walk
Dierent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f.
Extensions
Case of the random walk
Dierent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f.
If x
(t)
and y
t
are close, i.e. f(x
(t)
) f(y
t
) y is accepted with
probability
min
_
f(y
t
)
f(x
(t)
)
, 1
_
1 .
For multimodal densities with well separated modes, the negative
eect of limited moves on the surface of f clearly shows.
Extensions
Case of the random walk (2)
If the average acceptance rate is low, the successive values of f(y
t
)
tend to be small compared with f(x
(t)
), which means that the
random walk moves quickly on the surface of f since it often
reaches the borders of the support of f
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
This rule is to be taken with a pinch of salt!
Extensions
Example (Noisy AR(1) continued)
For a Gaussian random walk with scale small enough, the
random walk never jumps to the other mode. But if the scale is
suciently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.
Extensions
Markov chain based on a random walk with scale = .1.
Extensions
Markov chain based on a random walk with scale = .5.
The Gibbs Sampler
The Gibbs Sampler
The Gibbs Sampler
General Principles
Completion
Convergence
The Hammersley-Cliord theorem
Hierarchical models
Data Augmentation
Improper Priors
The Gibbs Sampler
General Principles
General Principles
A very specic simulation algorithm based on the target
distribution f:
1. Uses the conditional densities f
1
, . . . , f
p
from f
The Gibbs Sampler
General Principles
General Principles
distribution f:
1
, . . . , f
p
from f
2. Start with the random variable X = (X
1
, . . . , X
p
)
The Gibbs Sampler
General Principles
General Principles
distribution f:
1
, . . . , f
p
from f
2. Start with the random variable X = (X
1
, . . . , X
p
)
3. Simulate from the conditional densities,
X
i
[x
1
, x
2
, . . . , x
i1
, x
i+1
, . . . , x
p
f
i
(x
i
[x
1
, x
2
, . . . , x
i1
, x
i+1
, . . . , x
p
)
for i = 1, 2, . . . , p.
The Gibbs Sampler
General Principles
Algorithm (Gibbs sampler)
Given x
(t)
= (x
(t)
1
, . . . , x
(t)
p
), generate
1. X
(t+1)
1
f
1
(x
1
[x
(t)
2
, . . . , x
(t)
p
);
2. X
(t+1)
2
f
2
(x
2
[x
(t+1)
1
, x
(t)
3
, . . . , x
(t)
p
),
. . .
p. X
(t+1)
p
f
p
(x
p
[x
(t+1)
1
, . . . , x
(t+1)
p1
)
X
(t+1)
X f
The Gibbs Sampler
General Principles
Properties
The full conditionals densities f
1
, . . . , f
p
are the only densities used
for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate
The Gibbs Sampler
General Principles
Properties
The full conditionals densities f
1
, . . . , f
p
are the only densities used
for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate
The Gibbs sampler is not reversible with respect to f. However,
each of its p components is. Besides, it can be turned into a
reversible sampler, either using the Random Scan Gibbs sampler
see section
or running instead the (double) sequence
f
1
f
p1
f
p
f
p1
f
1
The Gibbs Sampler
General Principles
Example (Bivariate Gibbs sampler)
(X, Y ) f(x, y)
Generate a sequence of observations by
Set X
0
= x
0
For t = 1, 2, . . . , generate
Y
t
f
Y |X
([x
t1
)
X
t
f
X|Y
([y
t
)
where f
Y |X
and f
X|Y
are the conditional distributions
The Gibbs Sampler
General Principles
A Very Simple Example: Independent ^(,
2
)
Observations
When Y
1
, . . . , Y
n
iid
^(y[,
2
) with both and unknown, the
posterior in (,
2
) is conjugate outside a standard familly
The Gibbs Sampler
General Principles
A Very Simple Example: Independent ^(,
2
)
Observations
When Y
1
, . . . , Y
n
iid
^(y[,
2
) with both and unknown, the
posterior in (,
2
) is conjugate outside a standard familly
But...
[Y
0:n
,
2
^
_
1
n
n
i=1
Y
i
,

2
n
)
2
[Y
1:n
, 1(
_
n
2
1,
1
2
n
i=1
(Y
i
)
2
_
assuming constant (improper) priors on both and
2
Hence we may use the Gibbs sampler for simulating from the
posterior of (,
2
)
The Gibbs Sampler
General Principles
R Gibbs Sampler for Gaussian posterior
n = length(Y);
S = sum(Y);
mu = S/n;
for (i in 1:500)
S2 = sum((Y-mu)^2);
sigma2 = 1/rgamma(1,n/2-1,S2/2);
mu = S/n + sqrt(sigma2/n)*rnorm(1);
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1
The Gibbs Sampler
General Principles
Number of Iterations 1, 2
The Gibbs Sampler
General Principles
Number of Iterations 1, 2, 3
The Gibbs Sampler
General Principles
Number of Iterations 1, 2, 3, 4
The Gibbs Sampler
General Principles
Number of Iterations 1, 2, 3, 4, 5
The Gibbs Sampler
General Principles
Number of Iterations 1, 2, 3, 4, 5, 10
The Gibbs Sampler
General Principles
Number of Iterations 1, 2, 3, 4, 5, 10, 25
The Gibbs Sampler
General Principles
Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50
The Gibbs Sampler
General Principles
Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100
The Gibbs Sampler
General Principles
Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500
The Gibbs Sampler
General Principles
Limitations of the Gibbs sampler
Formally, a special case of a sequence of 1-D M-H kernels, all with
acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions
The Gibbs Sampler
General Principles
The Gibbs sampler
2. requires some knowledge of f
The Gibbs Sampler
General Principles
The Gibbs sampler
3. is, by construction, multidimensional
The Gibbs Sampler
General Principles
The Gibbs sampler
3. is, by construction, multidimensional
4. does not apply to problems where the number of parameters
varies as the resulting chain is not irreducible.
The Gibbs Sampler
Completion
Latent variables are back
The Gibbs sampler can be generalized in much wider generality
A density g is a completion of f if
_
Z
g(x, z) dz = f(x)
The Gibbs Sampler
Completion
Latent variables are back
The Gibbs sampler can be generalized in much wider generality
A density g is a completion of f if
_
Z
g(x, z) dz = f(x)
Note
The variable z may be meaningless for the problem
The Gibbs Sampler
Completion
Purpose
g should have full conditionals that are easy to simulate for a
Gibbs sampler to be implemented with g rather than f
For p > 1, write y = (x, z) and denote the conditional densities of
g(y) = g(y
1
, . . . , y
p
) by
Y
1
[y
2
, . . . , y
p
g
1
(y
1
[y
2
, . . . , y
p
),
Y
2
[y
1
, y
3
, . . . , y
p
g
2
(y
2
[y
1
, y
3
, . . . , y
p
),
. . . ,
Y
p
[y
1
, . . . , y
p1
g
p
(y
p
[y
1
, . . . , y
p1
).
The Gibbs Sampler
Completion
The move from Y
(t)
to Y
(t+1)
is dened as follows:
Algorithm (Completion Gibbs sampler)
Given (y
(t)
1
, . . . , y
(t)
p
), simulate
1. Y
(t+1)
1
g
1
(y
1
[y
(t)
2
, . . . , y
(t)
p
),
2. Y
(t+1)
2
g
2
(y
2
[y
(t+1)
1
, y
(t)
3
, . . . , y
(t)
p
),
. . .
p. Y
(t+1)
p
g
p
(y
p
[y
(t+1)
1
, . . . , y
(t+1)
p1
).
The Gibbs Sampler
Completion
Example (Mixtures all over again)
Hierarchical missing data structure:
If
X
1
, . . . , X
n

k
i=1
p
i
f(x[
i
),
then
X[Z f(x[
Z
), Z p
1
I(z = 1) +. . . +p
k
I(z = k),
Z is the component indicator associated with observation x
The Gibbs Sampler
Completion
Example (Mixtures (2))
Conditionally on (Z
1
, . . . , Z
n
) = (z
1
, . . . , z
n
) :
(p
1
, . . . , p
k
,
1
, . . . ,
k
[x
1
, . . . , x
n
, z
1
, . . . , z
n
)
p
1
+n
1
1
1
. . . p
k
+n
k
1
k
(
1
[y
1
+n
1
x
1
,
1
+n
1
) . . . (
k
[y
k
+n
k
x
k
,
k
+n
k
),
with
n
i
=
j
I(z
j
= i) and x
i
=
j; z
j
=i
x
j
/n
i
.
The Gibbs Sampler
Completion
Algorithm (Mixture Gibbs sampler)
1. Simulate
i
(
i
[y
i
+n
i
x
i
,
i
+n
i
) (i = 1, . . . , k)
(p
1
, . . . , p
k
) D(
1
+n
1
, . . . ,
k
+n
k
)
2. Simulate (j = 1, . . . , n)
Z
j
[x
j
, p
1
, . . . , p
k
,
1
, . . . ,
k

k
i=1
p
ij
I(z
j
= i)
with (i = 1, . . . , k)
p
ij
p
i
f(x
j
[
i
)
and update n
i
and x
i
(i = 1, . . . , k).
The Gibbs Sampler
Completion
-2 0 2 4 6 8
0
5
1
0
1
5

T = 500
-2 0 2 4 6 8
0
5
1
0
1
5

T = 1000
-2 0 2 4 6 8
0
5
1
0
1
5

T = 2000
-2 0 2 4 6 8
0
5
1
0
1
5

T = 3000
-2 0 2 4 6 8
0
5
1
0
1
5

T = 4000
-2 0 2 4 6 8
0
5
1
0
1
5

T = 5000
Estimation of the pluggin density for 3 components and T
iterations for 149 observations of acidity levels in US lakes
The Gibbs Sampler
Completion
10 15 20 25 30 35
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
Galaxy dataset (82 observations) with k = 2 components
average density (yellow), and pluggins:
average (tomato), marginal MAP (green), MAP (marroon)
The Gibbs Sampler
Completion
A wee problem
1 0 1 2 3 4
1
0
1
2
3
4
2
Gibbs started at random
The Gibbs Sampler
Completion
A wee problem
1 0 1 2 3 4
1
0
1
2
3
4
2
Gibbs started at random
Gibbs stuck at the wrong mode
1 0 1 2 3
1
0
1
2
3
2
The Gibbs Sampler
Completion
Random Scan Gibbs sampler
back to basics dont do random
Modication of the above Gibbs sampler where, with probability
1/p, the i-th component is drawn from f
i
(x
i
[X
i
), ie when the
components are chosen at random
Motivation
The Random Scan Gibbs sampler is reversible.
The Gibbs Sampler
Completion
Slice sampler as generic Gibbs
If f() can be written as a product
k
i=1
f
i
(),
The Gibbs Sampler
Completion
Slice sampler as generic Gibbs
If f() can be written as a product
k
i=1
f
i
(),
it can be completed as
k
i=1
I
0
i
f
i
()
,
leading to the following Gibbs algorithm:
The Gibbs Sampler
Completion
Algorithm (Slice sampler)
Simulate
1.
(t+1)
1
U
[0,f
1
(
(t)
)]
;
. . .
k.
(t+1)
k
U
[0,f
k
(
(t)
)]
;
k+1.
(t+1)
U
A
(t+1)
, with
A
(t+1)
= y; f
i
(y)
(t+1)
i
, i = 1, . . . , k.
The Gibbs Sampler
Completion
Example of results with a truncated ^(3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2
The Gibbs Sampler
Completion
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3
The Gibbs Sampler
Completion
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4
The Gibbs Sampler
Completion
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4, 5
The Gibbs Sampler
Completion
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4, 5, 10
The Gibbs Sampler
Completion
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4, 5, 10, 50
The Gibbs Sampler
Completion
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4, 5, 10, 50, 100
The Gibbs Sampler
Completion
Good slices
The slice sampler usually enjoys good theoretical properties (like
geometric ergodicity and even uniform ergodicity under bounded f
and bounded X).
As k increases, the determination of the set A
(t+1)
may get
increasingly complex.
The Gibbs Sampler
Completion
Example (Stochastic volatility core distribution)
Dicult part of the stochastic volatility model
(x) exp
_
2
(x )
2
+
2
exp(x)y
2
+x
_
/2 ,
simplied in exp
_
x
2
+exp(x)
_
The Gibbs Sampler
Completion
Example (Stochastic volatility core distribution)
Dicult part of the stochastic volatility model
(x) exp
_
2
(x )
2
+
2
exp(x)y
2
+x
_
/2 ,
simplied in exp
_
x
2
+exp(x)
_
Slice sampling means simulation from a uniform distribution on
A =
_
x; exp
_
x
2
+exp(x)
_
/2 u
_
=
_
x; x
2
+exp(x)
_
if we set = 2 log u.
Note Inversion of x
2
+exp(x) = needs to be done by
trial-and-error.
The Gibbs Sampler
Completion
0 10 20 30 40 50 60 70 80 90 100
0.1
0.05
0
0.05
0.1
Lag
C
o
r
r
e
l
a
t
i
o
n
1 0.5 0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
D
e
n
s
i
t
y
Histogram of a Markov chain produced by a slice sampler
and target distribution in overlay.
The Gibbs Sampler
Convergence
Properties of the Gibbs sampler
Theorem (Convergence)
For
(Y
1
, Y
2
, , Y
p
) g(y
1
, . . . , y
p
),
if either
[Positivity condition]
(i) g
(i)
(y
i
) > 0 for every i = 1, , p, implies that
g(y
1
, . . . , y
p
) > 0, where g
(i)
denotes the marginal distribution
of Y
i
, or
(ii) the transition kernel is absolutely continuous with respect to g,
then the chain is irreducible and positive Harris recurrent.
The Gibbs Sampler
Convergence
Properties of the Gibbs sampler (2)
Consequences
(i) If
_
h(y)g(y)dy < , then
lim
nT
1
T
T
t=1
h
1
(Y
(t)
) =
_
h(y)g(y)dy a.e. g.
(ii) If, in addition, (Y
(t)
) is aperiodic, then
lim
n
_
_
_
_
_
K
n
(y, )(dx) f
_
_
_
_
TV
= 0
The Gibbs Sampler
Convergence
Slice sampler
fast on that slice
For convergence, the properties of X
t
and of f(X
t
) are identical
Theorem (Uniform ergodicity)
If f is bounded and suppf is bounded, the simple slice sampler is
uniformly ergodic.
[Mira & Tierney, 1997]
The Gibbs Sampler
Convergence
A small set for a slice sampler
no slice detail
For
>
,
C = x A;
< f(x) <
is a small set:
Pr(x, )

()
where
(A) =
1
0
(A L())
(L())
d
if L() = x A; f(x) >
The Gibbs Sampler
Convergence
Slice sampler: drift
Under dierentiability and monotonicity conditions, the slice
sampler also veries a drift condition with V (x) = f(x)
, is
geometrically ergodic, and there even exist explicit bounds on the
total variation distance
The Gibbs Sampler
Convergence
Slice sampler: drift
Under dierentiability and monotonicity conditions, the slice
sampler also veries a drift condition with V (x) = f(x)
, is
geometrically ergodic, and there even exist explicit bounds on the
total variation distance
Example (Exponential cxp(1))
For n > 23,
[[K
n
(x, ) f()[[
TV
.054865 (0.985015)
n
(n 15.7043)
The Gibbs Sampler
Convergence
Slice sampler: convergence
no more slice detail
Theorem
For any density such that
(x A; f(x) > ) is non-increasing

then
[[K
523
(x, ) f()[[
TV
.0095
The Gibbs Sampler
Convergence
A poor slice sampler
Example
Consider
f(x) = exp[[x[[ x R
d
Slice sampler equivalent to
one-dimensional slice sampler on
(z) = z
d1
e
z
z > 0
or on
(u) = e
u
1/d
u > 0
Poor performances when d large
(heavy tails)
0 200 400 600 800 1000
-
2
-
1
0
1
1 dimensional run
c
o
r
r
e
la
tio
n
0 10 20 30 40
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
1 dimensional acf
0 200 400 600 800 1000
1
0
1
5
2
0
2
5
3
0
10 dimensional run
c
o
r
r
e
la
tio
n
0 10 20 30 40
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
10 dimensional acf
0 200 400 600 800 1000
0
2
0
4
0
6
0
20 dimensional run
c
o
r
r
e
la
tio
n
0 10 20 30 40
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
20 dimensional acf
0 200 400 600 800 1000
0
1
0
0
2
0
0
3
0
0
4
0
0
100 dimensional run
c
o
r
r
e
la
tio
n
0 10 20 30 40
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
100 dimensional acf
Sample runs of log(u) and
ACFs for log(u) (Roberts
& Rosenthal, 1999)
The Gibbs Sampler
Hammersley-Cliord theorem
An illustration that conditionals determine the joint distribution
Theorem
If the joint density g(y
1
, y
2
) have conditional distributions
g
1
(y
1
[y
2
) and g
2
(y
2
[y
1
), then
g(y
1
, y
2
) =
g
2
(y
2
[y
1
)
_
g
2
(v[y
1
)/g
1
(y
1
[v) dv
.
[Hammersley & Cliord, circa 1970]
The Gibbs Sampler
General HC decomposition
Under the positivity condition, the joint distribution g satises
g(y
1
, . . . , y
p
)
p
j=1
g
j
(y
j
[y
1
, . . . , y
j1
, y
j+1
, . . . , y
p
)
g
j
(y
j
[y
1
, . . . , y
j1
, y
j+1
, . . . , y
p
)
for every permutation on 1, 2, . . . , p and every y
Y .
The Gibbs Sampler
Hierarchical models
Hierarchical models
no hierarchy
The Gibbs sampler is particularly well suited to hierarchical models
Example (Animal epidemiology)
Counts of the number of cases of clinical mastitis in 127 dairy
cattle herds over a one year period
Number of cases in herd i
X
i
P(
i
) i = 1, , m
where
i
is the underlying rate of infection in herd i
Lack of independence might manifest itself as overdispersion.
The Gibbs Sampler
Hierarchical models
Example (Animal epidemiology (2))
Modied model
X
i
P(
i
)
i
Ga(,
i
)
i
IG(a, b),
The Gibbs Sampler
Hierarchical models
Example (Animal epidemiology (2))
Modied model
X
i
P(
i
)
i
Ga(,
i
)
i
IG(a, b),
The Gibbs sampler corresponds to conditionals
i
(
i
[x, ,
i
) = Ga(x
i
+, [1 + 1/
i
]
1
)
i
(
i
[x, , a, b,
i
) = IG( +a, [
i
+ 1/b]
1
)
The Gibbs Sampler
Hierarchical models
if you hate rats
Example (Rats)
Experiment where rats are intoxicated by a substance, then treated
by either a placebo or a drug:
x
ij
^(
i
,
2
c
), 1 j J
c
i
, control
y
ij
^(
i
+
i
,
2
a
), 1 j J
a
i
, intoxication
z
ij
^(
i
+
i
+
i
,
2
t
), 1 j J
t
i
, treatment
Additional variable w
i
, equal to 1 if the rat is treated with the
drug, and 0 otherwise.
The Gibbs Sampler
Hierarchical models
Example (Rats (2))
Prior distributions (1 i I),
i
^(
,
2
),
i
^(
,
2
),
and
i
^(
P
,
2
P
) or
i
^(
D
,
2
D
),
if ith rat treated with a placebo (P) or a drug (D)
The Gibbs Sampler
Hierarchical models
Example (Rats (2))
Prior distributions (1 i I),
i
^(
,
2
),
i
^(
,
2
),
and
i
^(
P
,
2
P
) or
i
^(
D
,
2
D
),
if ith rat treated with a placebo (P) or a drug (D)
Hyperparameters of the model,
,
P
,
D
,
c
,
a
,
t
,
,
P
,
D
,
associated with Jereys noninformative priors.
Alternative prior with two possible levels of intoxication
i
p^(
1
,
2
1
) + (1 p)^(
2
,
2
2
),
The Gibbs Sampler
Hierarchical models
Conditional decompositions
Easy decomposition of the posterior distribution
For instance, if
[
1

1
([
1
),
1

2
(
1
),
then
([x) =
_
1
([
1
, x)(
1
[x) d
1
,
The Gibbs Sampler
Hierarchical models
Conditional decompositions (2)
where
([
1
, x) =
f(x[)
1
([
1
)
m
1
(x[
1
)
,
m
1
(x[
1
) =
_
f(x[)
1
([
1
) d,
(
1
[x) =
m
1
(x[
1
)
2
(
1
)
m(x)
,
m(x) =
_
1
m
1
(x[
1
)
2
(
1
) d
1
.
The Gibbs Sampler
Hierarchical models
Conditional decompositions (3)
Moreover, this decomposition works for the posterior moments,
that is, for every function h,
E
[h()[x] = E
(
1
|x)
[E
1
[h()[
1
, x]] ,
where
E
1
[h()[
1
, x] =
_
h()([
1
, x) d.
The Gibbs Sampler
Hierarchical models
Example (Rats inc., continued
if you still hate rats
)
Posterior complete distribution given by
((
i
,
i
,
i
)
i
,
, . . . ,
c
, . . . [D)
I
i=1
_
exp (
i
)
2
/2
2
+ (
i
)
2
/2
2
J
c
i
j=1
exp (x
ij

i
)
2
/2
2
c
J
a
i
j=1
exp (y
ij

i
i
)
2
/2
2
a
J
t
i
j=1
exp (z
ij

i
i
)
2
/2
2
t
i
=0
exp (
i
P
)
2
/2
2
P
i
=1
exp (
i
D
)
2
/2
2
D
P
i
J
c
i
1
c

P
i
J
a
i
1
a

P
i
J
t
i
1
t
(
)
I1
I
D
1
D

I
P
1
P
,
The Gibbs Sampler
Hierarchical models
Local conditioning property
For the hierarchical model
() =
_
1
...
n
1
([
1
)
2
(
1
[
2
)
n+1
(
n
) d
1
d
n+1
.
we have
(
i
[x, ,
1
, . . . ,
n
) = (
i
[
i1
,
i+1
)
with the convention
0
= and
n+1
= 0.
The Gibbs Sampler
Hierarchical models
Example (Rats inc., terminated
still this zemmiphobia?!
)
The full conditional distributions correspond to standard
distributions and Gibbs sampling applies.

0 2000 4000 6000 8000 10000
1
.
6
0
1
.
7
0
1
.
8
0
1
.
9
0

0 2000 4000 6000 8000 10000
-
2
.
9
0
-
2
.
8
0
-
2
.
7
0
-
2
.
6
0

0 2000 4000 6000 8000 10000
0
.
4
0
0
.
5
0
0
.
6
0
0
.
7
0

0 2000 4000 6000 8000 10000
1
.
7
1
.
8
1
.
9
2
.
0
2
.
1

Convergence of the posterior means
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
0
2
0
4
0
6
0
8
0
1
0
0
1
2
0
control

-4.0 -3.5 -3.0 -2.5 -2.0 -1.5
0
2
0
4
0
6
0
8
0
1
0
0
1
4
0
intoxication

-1 0 1 2
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
3
0
0
placebo

0 1 2 3
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
drug

Posteriors of the eects
The Gibbs Sampler
Hierarchical models
Posterior Gibbs inference

D

P

D

P
Probability 1.00 0.9998 0.94 0.985
Condence [-3.48,-2.17] [0.94,2.50] [-0.17,1.24] [0.14,2.20]
Posterior probabilities of signicant eects
The Gibbs Sampler
Data Augmentation
Data Augmentation
The Gibbs sampler with only two steps is particularly useful
Algorithm (Data Augmentation)
Given y
(t)
,
1.. Simulate Y
(t+1)
1
g
1
(y
1
[y
(t)
2
) ;
2.. Simulate Y
(t+1)
2
g
2
(y
2
[y
(t+1)
1
) .
The Gibbs Sampler
Data Augmentation
Data Augmentation
The Gibbs sampler with only two steps is particularly useful
Algorithm (Data Augmentation)
Given y
(t)
,
1.. Simulate Y
(t+1)
1
g
1
(y
1
[y
(t)
2
) ;
2.. Simulate Y
(t+1)
2
g
2
(y
2
[y
(t+1)
1
) .
Theorem (Markov property)
Both (Y
(t)
1
) and (Y
(t)
2
) are Markov chains, with transitions
K
i
(x, x
) =
_
g
i
(y[x)g
3i
(x
[y) dy,
The Gibbs Sampler
Data Augmentation
Example (Grouped counting data)
360 consecutive records of the number of passages per unit time
Number of
passages 0 1 2 3 4 or more
Number of
observations 139 128 55 25 13
The Gibbs Sampler
Data Augmentation
Example (Grouped counting data (2))
Feature Observations with 4 passages and more are grouped
If observations are Poisson P(), the likelihood is
([x
1
, . . . , x
5
)
e
347
128+552+253
_
1 e
i=0
i
i!
_
13
,
which can be dicult to work with.
The Gibbs Sampler
Data Augmentation
Example (Grouped counting data (2))
Feature Observations with 4 passages and more are grouped
If observations are Poisson P(), the likelihood is
([x
1
, . . . , x
5
)
e
347
128+552+253
_
1 e
i=0
i
i!
_
13
,
which can be dicult to work with.
Idea With a prior () = 1/, complete the vector (y
1
, . . . , y
13
) of
the 13 units larger than 4
The Gibbs Sampler
Data Augmentation
Algorithm (Poisson-Gamma Gibbs)
a Simulate Y
(t)
i
P(
(t1)
) I
y4
i = 1, . . . , 13
b Simulate
(t)
(a
_
313 +
13
i=1
y
(t)
i
, 360
_
.
The Gibbs Sampler
Data Augmentation
Algorithm (Poisson-Gamma Gibbs)
a Simulate Y
(t)
i
P(
(t1)
) I
y4
i = 1, . . . , 13
b Simulate
(t)
(a
_
313 +
13
i=1
y
(t)
i
, 360
_
.
The Bayes estimator
=
1
360T
T
t=1
_
313 +
13
i=1
y
(t)
i
_
converges quite rapidly
to R& B

0 100 200 300 400 500
1
.
0
2
1
1
.
0
2
2
1
.
0
2
3
1
.
0
2
4
1
.
0
2
5
0.9 1.0 1.1 1.2
0
1
0
2
0
3
0
4
0
lambda
The Gibbs Sampler
Data Augmentation
Rao-Blackwellization
If (y
1
, y
2
, . . . , y
p
)
(t)
, t = 1, 2, . . . T is the output from a Gibbs
sampler
0
=
1
T
T
t=1
h
_
y
(t)
1
_
_
h(y
1
)g(y
1
)dy
1
and is unbiased.
The Gibbs Sampler
Data Augmentation
Rao-Blackwellization
If (y
1
, y
2
, . . . , y
p
)
(t)
, t = 1, 2, . . . T is the output from a Gibbs
sampler
0
=
1
T
T
t=1
h
_
y
(t)
1
_
_
h(y
1
)g(y
1
)dy
1
and is unbiased.
The Rao-Blackwellization replaces
0
with its conditional
expectation
rb
=
1
T
T
t=1
E
_
h(Y
1
)[y
(t)
2
, . . . , y
(t)
p
_
.
The Gibbs Sampler
Data Augmentation
Rao-Blackwellization (2)
Then
Both estimators converge to E[h(Y
1
)]
Both are unbiased,
The Gibbs Sampler
Data Augmentation
Rao-Blackwellization (2)
Then
Both estimators converge to E[h(Y
1
)]
Both are unbiased,
and
var
_
E
_
h(Y
1
)[Y
(t)
2
, . . . , Y
(t)
p
__
var(h(Y
1
)),
so
rb
is uniformly better (for Data Augmentation)
The Gibbs Sampler
Data Augmentation
Examples of Rao-Blackwellization
Example
Bivariate normal Gibbs sampler
X [ y N (y, 1
2
)
Y [ x N (x, 1
2
).
Then
0
=
1
T
T
i=1
X
(i)
and
1
=
1
T
T
i=1
E[X
(i)
[Y
(i)
] =
1
T
T
i=1
Y
(i)
,
estimate E[X] and
2
0
/
2
1
=
1
2
> 1.
The Gibbs Sampler
Data Augmentation
Examples of Rao-Blackwellization (2)
Example (Poisson-Gamma Gibbs contd)
Nave estimate
0
=
1
T
T
t=1
(t)
and Rao-Blackwellized version
=
1
T
T
t=1
E[
(t)
[x
1
, x
2
, . . . , x
5
, y
(i)
1
, y
(i)
2
, . . . , y
(i)
13
]
=
1
360T
T
t=1
_
313 +
13
i=1
y
(t)
i
_
,
back to graph
The Gibbs Sampler
Data Augmentation
NP Rao-Blackwellization & Rao-Blackwellized NP
Another substantial benet of Rao-Blackwellization is in the
approximation of densities of dierent components of y without
nonparametric density estimation methods.
The Gibbs Sampler
Data Augmentation
NP Rao-Blackwellization & Rao-Blackwellized NP
Another substantial benet of Rao-Blackwellization is in the
approximation of densities of dierent components of y without
nonparametric density estimation methods.
Lemma
The estimator
1
T
T
t=1
g
i
(y
i
[y
(t)
j
, j ,= i) g
i
(y
i
),
is unbiased.
The Gibbs Sampler
Data Augmentation
The Duality Principle
skip dual part
Ties together the properties of the two Markov chains in Data
Augmentation
Consider a Markov chain (X
(t)
) and a sequence (Y
(t)
) of random
variables generated from the conditional distributions
X
(t)
[y
(t)
(x[y
(t)
)
Y
(t+1)
[x
(t)
, y
(t)
f(y[x
(t)
, y
(t)
) .
The Gibbs Sampler
Data Augmentation
The Duality Principle
skip dual part
Ties together the properties of the two Markov chains in Data
Augmentation
Consider a Markov chain (X
(t)
) and a sequence (Y
(t)
) of random
variables generated from the conditional distributions
X
(t)
[y
(t)
(x[y
(t)
)
Y
(t+1)
[x
(t)
, y
(t)
f(y[x
(t)
, y
(t)
) .
Theorem (Duality properties)
If the chain (Y
(t)
) is ergodic then so is (X
(t)
) and the duality also
holds for geometric or uniform ergodicity.
Note
The chain (Y
(t)
) can be discrete, and the chain (X
(t)
) continuous.
The Gibbs Sampler
Improper Priors
Improper Priors
Unsuspected danger resulting from careless use of MCMC
algorithms:
The Gibbs Sampler
Improper Priors
Improper Priors
algorithms:
It may happen that
all conditional distributions are well dened,
all conditional distributions may be simulated from, but...
The Gibbs Sampler
Improper Priors
Improper Priors
algorithms:
It may happen that
the system of conditional distributions may not correspond to
any joint distribution
The Gibbs Sampler
Improper Priors
Improper Priors
algorithms:
It may happen that
the system of conditional distributions may not correspond to
any joint distribution
Warning The problem is due to careless use of the Gibbs sampler
in a situation for which the underlying assumptions are violated
The Gibbs Sampler
Improper Priors
Example (Conditional exponential distributions)
For the model
X
1
[x
2
E xp(x
2
) , X
2
[x
1
E xp(x
1
)
the only candidate f(x
1
, x
2
) for the joint density is
f(x
1
, x
2
) exp(x
1
x
2
),
but
_
f(x
1
, x
2
)dx
1
dx
2
=
c _ These conditionals do not correspond to a joint
probability distribution
The Gibbs Sampler
Improper Priors
Example (Improper random eects)
Consider
Y
ij
= +
i
+
ij
, i = 1, . . . , I, j = 1, . . . , J,
where
i
N (0,
2
) and
ij
N (0,
2
),
the Jereys (improper) prior for the parameters , and is
(,
2
,
2
) =
1
2
.
The Gibbs Sampler
Improper Priors
Example (Improper random eects 2)
The conditional distributions
i
[y, ,
2
,
2
N
_
J( y
i
)
J +
2
2
, (J
2
+
2
)
1
_
,
[, y,
2
,
2
N ( y ,
2
/JI) ,
2
[, , y,
2
1(
_
I/2, (1/2)
2
i
_
,
2
[, , y,
2
1(
_
_
IJ/2, (1/2)
i,j
(y
ij

i
)
2
_
_
,
are well-dened and a Gibbs sampler can be easily implemented in
this setting.
The Gibbs Sampler
Improper Priors
-4 -3 -2 -1 0
0
5
1
0
1
5
2
0
2
5
3
0
(1000 iterations)
f
r
e
q
.

-
8
-
6
-
4
-
2
0
o
b
s
e
r
v
a
t
i
o
n
s
Example (Improper random
eects 2)
The gure shows the sequence of
(t)
s and its histogram over
1, 000 iterations. They both fail
to indicate that the
corresponding joint distribution
does not exist
The Gibbs Sampler
Improper Priors
Final notes on impropriety
The improper posterior Markov chain
cannot be positive recurrent
The Gibbs Sampler
Improper Priors
The major task in such settings is to nd indicators that ag that
something is wrong. However, the output of an improper Gibbs
sampler may not dier from a positive recurrent Markov chain.
The Gibbs Sampler
Improper Priors
The major task in such settings is to nd indicators that ag that
something is wrong. However, the output of an improper Gibbs
sampler may not dier from a positive recurrent Markov chain.
Example
The random eects model was initially treated in Gelfand et al.
(1990) as a legitimate model
Introduction
Greens method
Birth and Death processes
Introduction
A new brand of problems
There exist setups where
One of the things we do not know is the number
of things we do not know
[Peter Green]
Introduction
Bayesian Model Choice
Typical in model choice settings
- model construction (nonparametrics)
- model checking (goodness of t)
- model improvement (expansion)
- model prunning (contraction)
- model comparison
- hypothesis testing (Science)
- prediction (nance)
Introduction
Bayesian Model Choice II
Many areas of application
variable selection
change point(s) determination
image analysis
graphical models and expert systems
variable dimension models
causal inference
Introduction
Example (Mixture again, yes!)
Benchmark dataset: Speed of galaxies
[Roeder, 1990; Richardson & Green, 1997]
1.0 1.5 2.0 2.5 3.0 3.5
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
speeds
Introduction
Example (Mixture again (2))
Modelling by a mixture model
M
i
: x
j

i
=1
p
i
^(
i
,
2
i
) (j = 1, . . . , 82)
i?
Introduction
Bayesian variable dimension model
Denition
A variable dimension model is dened as a collection of models
(k = 1. . . . , K),
M
k
= f([
k
);
k

k
,
associated with a collection of priors on the parameters of these
models,
k
(
k
) ,
and a prior distribution on the indices of these models,
(k) , k = 1, . . . , K .
Alternative notation:
(M
k
,
k
) = (k)
k
(
k
)
Introduction
Bayesian solution
Formally over:
1. Compute
p(M
i
[x) =
p
i
_
i
f
i
(x[
i
)
i
(
i
)d
i
j
p
j
_
j
f
j
(x[
j
)
j
(
j
)d
j
2. Take largest p(M
i
[x) to determine model, or use
j
p
j
_
j
f
j
(x[
j
)
j
(
j
)d
j
as predictive
[Dierent decision theoretic perspectives]
Introduction
Diculties
Not at
(formal) inference level

[see above]
parameter space representation

=
k
,
[even if there are parameters common to several models]
Introduction
Diculties
Not at
(formal) inference level

[see above]
parameter space representation

=
k
,
[even if there are parameters common to several models]
Rather at
(practical) inference level:

model separation, interpretation, overtting, prior modelling,
prior coherence
computational level:
innity of models, moves between models, predictive
computation
Greens method
Greens resolution
Setting up a proper measuretheoretic framework for designing
moves between models M
k
[Green, 1995]
Greens method
Greens resolution
Setting up a proper measuretheoretic framework for designing
moves between models M
k
[Green, 1995]
Create a reversible kernel K on H =
k
k
k
such that
_
A
_
B
K(x, dy)(x)dx =
_
B
_
A
K(y, dx)(y)dy
for the invariant density [x is of the form (k,
(k)
)]
Greens method
Greens resolution (2)
Write K as
K(x, B) =
m=1
_

m
(x, y)q
m
(x, dy) +(x)I
B
(x)
where q
m
(x, dy) is a transition measure to model M
m
and
m
(x, y) the corresponding acceptance probability.
Greens method
Write K as
K(x, B) =
m=1
_

m
(x, y)q
m
(x, dy) +(x)I
B
(x)
where q
m
m
and
m
Introduce a symmetric measure
m
(dx, dy) on H
2
and impose on
(dx)q
m
(x, dy) to be absolutely continuous wrt
m
,
(dx)q
m
(x, dy)
m
(dx, dy)
= g
m
(x, y)
Greens method
Write K as
K(x, B) =
m=1
_

m
(x, y)q
m
(x, dy) +(x)I
B
(x)
where q
m
m
and
m
Introduce a symmetric measure
m
(dx, dy) on H
2
and impose on
(dx)q
m
(x, dy) to be absolutely continuous wrt
m
,
(dx)q
m
(x, dy)
m
(dx, dy)
= g
m
(x, y)
Then
m
(x, y) = min
_
1,
g
m
(y, x)
g
m
(x, y)
_
ensures reversibility
Greens method
Special case
When contemplating a move between two models, M
1
and M
2
,
the Markov chain being in state
1
M
1
, denote by K
12
(
1
, d)
and K
21
(
2
, d) the corresponding kernels, under the detailed
balance condition
(d
1
) K
12
(
1
, d) = (d
2
) K
21
(
2
, d) ,
and take, wlog, dim(M
2
) > dim(M
1
).
Greens method
Special case
When contemplating a move between two models, M
1
and M
2
,
the Markov chain being in state
1
M
1
, denote by K
12
(
1
, d)
and K
21
(
2
, d) the corresponding kernels, under the detailed
balance condition
(d
1
) K
12
(
1
, d) = (d
2
) K
21
(
2
, d) ,
and take, wlog, dim(M
2
) > dim(M
1
).
Proposal expressed as
2
=
12
(
1
, v
12
)
where v
12
is a random variable of dimension
dim(M
2
) dim(M
1
), generated as
v
12

12
(v
12
) .
Greens method
Special case (2)
In this case, q
12
(
1
, d
2
) has density
12
(v
12
)
12
(
1
, v
12
)
(
1
, v
12
)
1
,
by the Jacobian rule.
If probability
12
of choosing move to M
2
while in M
1
,
acceptance probability reduces to
(
1
, v
12
) = 1
(M
2
,
2
)
21
(M
1
,
1
)
12
12
(v
12
)
12
(
1
, v
12
)
(
1
, v
12
)
.
Greens method
Interpretation (1)
The representation puts us back in a xed dimension setting:
M
1
V
12
and M
2
are in one-to-one relation
Greens method
Interpretation (1)
The representation puts us back in a xed dimension setting:
M
1
V
12
and M
2
are in one-to-one relation
regular MetropolisHastings move from the couple (

1
, v
12
)
to
2
when stationary distributions are
(M
1
,
1
)
12
(v
12
)
and (M
2
,
2
), and when proposal distribution is deterministic
(??)
Greens method
Interpretation (2)
Consider, instead, the proposals
2
^(
12
(
1
, v
12
), ) and
12
(
1
, v
12
) ^(
2
, )
Greens method
Interpretation (2)
2
^(
12
(
1
, v
12
), ) and
12
(
1
, v
12
) ^(
2
, )
Reciprocal proposal has density
exp
_
(
2
12
(
1
, v
12
))
2
/2
_
12
(
1
, v
12
)
(
1
, v
12
)

Greens method
Interpretation (2)
2
^(
12
(
1
, v
12
), ) and
12
(
1
, v
12
) ^(
2
, )
Reciprocal proposal has density
exp
_
(
2
12
(
1
, v
12
))
2
/2
_
12
(
1
, v
12
)
(
1
, v
12
)

Thus MetropolisHastings acceptance probability is
1
(M
2
,
2
)
(M
1
,
1
)
12
(v
12
)
12
(
1
, v
12
)
(
1
, v
12
)
Does not depend on : Let go to 0

Greens method
Saturation
[Brooks, Giudici, Roberts, 2003]
Consider series of models M
i
(i = 1, . . . , k) such that
max
i
dim(M
i
) = n
max
<
Parameter of model M
i
then completed with an auxiliary variable
U
i
such that
dim(
i
, u
i
) = n
max
and U
i
q
i
(u
i
)
Posit the following joint distribution for [augmented] model M
i
(M
i
,
i
) q
i
(u
i
)
Greens method
Back to xed dimension
Saturation: no varying dimension anymore since (
i
, u
i
) of xed
dimension.
Greens method
Back to xed dimension
Saturation: no varying dimension anymore since (
i
, u
i
) of xed
dimension.
Algorithm (Three stage MCMC update)
1. Update the current value of the parameter,
i
;
2. Update u
i
conditional on
i
;
3. Update the current model from M
i
to M
j
using the bijection
(
j
, u
j
) =
ij
(
i
, u
i
)
Greens method
Example (Mixture of normal distributions)
M
k
:
k
j=1
p
jk
^(
jk
,
2
jk
)
[Richardson & Green, 1997]
Moves:
Greens method
Example (Mixture of normal distributions)
M
k
:
k
j=1
p
jk
^(
jk
,
2
jk
)
[Richardson & Green, 1997]
Moves:
(i) Split
_
_
_
p
jk
= p
j(k+1)
+p
(j+1)(k+1)
p
jk
jk
= p
j(k+1)
j(k+1)
+p
(j+1)(k+1)
(j+1)(k+1)
p
jk
2
jk
= p
j(k+1)
2
j(k+1)
+p
(j+1)(k+1)
2
(j+1)(k+1)
(ii) Merge (reverse)
Greens method
Example (Mixture (2))
Additional Birth and Death moves for empty components
(created from the prior distribution)
Equivalent
(i). Split
(T)
_
_
u
1
, u
2
, u
3
|(0, 1)
p
j(k+1)
= u
1
p
jk
j(k+1)
= u
2
jk
2
j(k+1)
= u
3
2
jk
Greens method
Histogram of k
k
1 2 3 4 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
1
2
3
4
5
Rawplot of k
k
Histogram and rawplot of
100, 000 ks under the
constraint k 5.
Normalised enzyme dataset
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
Greens method
Example (Hidden Markov model)
move to birth
Extension of the mixture model
P(X
t
+ 1 = j[X
t
= i) = w
ij
,
w
ij
=
ij
/
i
,
Y
t
[X
t
= i ^(
i
,
2
i
).
Greens method
. . .
-
Y
t
6
X
t
-
Y
t+1
6
X
t+1
-
. . .
Greens method
Example (Hidden Markov model (2))
Move to split component j
into j
1
and j
2
:
ij
1
=
ij
i
,
ij
2
=
ij
(1
i
),
i
|(0, 1);
j
1
j
=
j
j
,
j
2
j
=
j
j
/
j
,
j
log ^(0, 1);
similar ideas give
j
1
j
2
etc.;
j
1
=
j
3
j
,
j
2
=
j
+ 3
j
^(0, 1);
2
j
1
=
2
j
,
2
j
2
=
2
j
log ^(0, 1).

[Robert & al., 2000]
Greens method
0 10000 20000 30000 40000
1
2
3
4
5
0 50000 100000 150000 200000
0
0.5
1
0 5000 10000 15000 20000
0
0.005
0.01
0.015
Upper panel: First 40,000 values of k for S&P 500 data, plotted
every 20th sweep. Middle panel: estimated posterior distribution
of k for S&P 500 data as a function of number of sweeps. Lower
panel:
1
and
2
in rst 20,000 sweeps with k = 2 for S&P 500
data.
Greens method
Example (Autoregressive model)
move to birth
Typical setting for model choice: determine order p of AR(p)
model
Greens method
Example (Autoregressive model)
move to birth
Typical setting for model choice: determine order p of AR(p)
model
Consider the (less standard) representation
p
i=1
(1
i
B) X
t
=
t
,
t
^(0,
2
)
where the
i
s are within the unit circle if complex and within
[1, 1] if real.
[Huerta and West, 1998]
Greens method
AR(p) reversible jump algorithm
Example (Autoregressive (2))
Uniform priors for the real and complex roots
j
,
1
k/2 + 1
i
R
1
2
I
|
i
|<1
i
R
1
I
|
i
|<1
and (purely birth-and-death) proposals based on these priors
k k+1 [Creation of real root]
k k+2 [Creation of complex root]
k k-1 [Deletion of real root]
k k-2 [Deletion of complex root]

instant death!
Use of an alternative methodology based on a Birth&-Death
(point) process
[Preston, 1976; Ripley, 1977; Geyer & Mller, 1994; Stevens, 1999]
instant death!
Use of an alternative methodology based on a Birth&-Death
(point) process
[Preston, 1976; Ripley, 1977; Geyer & Mller, 1994; Stevens, 1999]
Idea: Create a Markov chain in continuous time, i.e. a Markov
jump process, moving between models M
k
, by births (to increase
the dimension), deaths (to decrease the dimension), and other
moves.
Time till next modication (jump) is exponentially distributed
with rate depending on current state
Remember: if
1
, . . . ,
v
are exponentially distributed,
i
c(
i
),
min
i
c
_
i
_
Time till next modication (jump) is exponentially distributed
with rate depending on current state
Remember: if
1
, . . . ,
v
are exponentially distributed,
i
c(
i
),
min
i
c
_
i
_
Dierence with MH-MCMC: Whenever a jump occurs, the
corresponding move is always accepted. Acceptance probabilities
replaced with holding times.
Implausible congurations
L()() 1
die quickly.
Balance condition
Sucient to have detailed balance
L()()q(,
) = L(
)(
)q(
, ) for all ,
for () L()() to be stationary.

Here q(,
) rate of moving from state to
.
Possibility to add split/merge and xed-k processes if balance
condition satised.
Example (Mixture contd)
Stephens original modelling:
Representation as a (marked) point process

=
_
p
j
, (
j
,
j
)
_
j
Birth rate
0
(constant)
Birth proposal from the prior
Death rate
j
() for removal of point j
Death proposal removes component and modies weights

Example (Mixture contd (2))
Overall death rate

k
j=1
j
() = ()
Balance condition
(k+1) d(p, (, )) L(p, (, )) =
0
L()
(k)
(k + 1)
with
d( p
j
, (
j
,
j
)) =
j
()
Case of Poisson prior k Toi(

1
)
j
() =

0
1
L( p
j
, (
j
,
j
))
L()
Stephens original algorithm
Algorithm (Mixture Birth& Death)
For v = 0, 1, , V
t v
Run till t > v + 1
1. Compute
j
() =
L([
j
)
L()
1
2. ()
k
j=1
j
(
j
),
0
+(), u |([0, 1])
3. t t ulog(u)
Algorithm (Mixture Birth& Death (contd))
4. With probability ()/
Remove component j with probability
j
()/()
k k 1
p
/(1 p
j
) ( ,= j)
Otherwise,
Add component j from the prior (
j
,
j
) p
j
Be(, k)
p
(1 p
j
) ( ,= j)
k k + 1
5. Run I MCMC(k, , p)
Rescaling time
move to HMM
In discrete-time RJMCMC, let the time unit be 1/N,
put
k
=
k
/N and
k
= 1
k
/N
As N , each birth proposal will be accepted, and having k components births occur according to a
Poisson process with rate
k
while component (w, ) dies with rate
lim
N
N
k+1
1
k + 1
min(A
1
, 1)
= lim
N
N
1
k + 1
likelihood ratio
1
k+1
b(w, )
(1 w)
k1
= likelihood ratio
1
k
k + 1
b(w, )
(1 w)
k1
.
Hence RJMCMCBDMCMC. This holds more generally.
Example (HMM models (contd))
Implementation of the split-and-combine rule of Richardson and
Green (1997) in continuous time
Example (HMM models (contd))
Implementation of the split-and-combine rule of Richardson and
Green (1997) in continuous time
Move to split component j
into j
1
and j
2
:
ij
1
=
ij
i
,
ij
2
=
ij
(1
i
),
i
|(0, 1);
j
1
j
=
j
j
,
j
2
j
=
j
j
/
j
,
j
log ^(0, 1);
similar ideas give
j
1
j
2
etc.;
j
1
=
j
3
j
,
j
2
=
j
+ 3
j
^(0, 1);
2
j
1
=
2
j
,
2
j
2
=
2
j
log ^(0, 1).

[Cappe & al, 2001]
Wind intensity in Athens
5 0 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
2
0
2
4
6
Histogram and rawplot of 500 wind intensities in Athens
Number of states
temp[, 1]
R
e
la
t
iv
e

F
r
e
q
u
e
n
c
y
2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0 500 1000 1500 2000 2500
1
2
3
4
5
instants
n
u
m
b
e
r

o
f

s
t
a
t
e
s
Log likelihood values
temp[, 2]
R
e
la
t
iv
e

F
r
e
q
u
e
n
c
y
1400 1200 1000 800 600 400 200
0
.
0
0
0
0
.
0
1
0
0
.
0
2
0
0
.
0
3
0
0 500 1000 1500 2000 2500
1
4
0
0
1
0
0
0
6
0
0
2
0
0
instants
lo
g
lik
e
lih
o
o
d
Number of moves
temp[, 3]
R
e
la
t
iv
e

F
r
e
q
u
e
n
c
y
5 10 15 20 25 30
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0 500 1000 1500 2000 2500
5
1
0
2
0
3
0
instants
N
u
m
b
e
r

o
f

m
o
v
e
s
MCMC output on k (histogram and rawplot), corresponding
loglikelihood values (histogram and rawplot), and number of
moves (histogram and rawplot)
0 500 1000 1500
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
0 500 1000 1500

0
1
2
3
4
5
MCMC sequence of the probabilities

j
of the stationary
distribution (top) and the parameters (bottom) of the
three components when conditioning on k = 3
5 0 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6

MCMC evaluation of the marginal density of the dataset
(dashes), compared with R nonparametric density estimate
(solid lines).
basic importance
Adaptive MCMC
Importance sampling revisited
Dynamic extensions
Population Monte Carlo
Adaptive MCMC
Adaptive MCMC is not possible
Algorithms trained on-line usually invalid:
Adaptive MCMC
Adaptive MCMC is not possible
Algorithms trained on-line usually invalid:
using the whole past of the chain implies that this is not a
Markov chain any longer!
Adaptive MCMC
Example (Poly t distribution)
Consider a t-distribution T (3, , 1) sample (x
1
, . . . , x
n
) with a at
prior () = 1
If we try t a normal proposal from empirical mean and variance of
the chain so far,
t
=
1
t
t
i=1
(i)
and
2
t
=
1
t
t
i=1
(
(i)
t
)
2
,
Adaptive MCMC
Example (Poly t distribution)
Consider a t-distribution T (3, , 1) sample (x
1
, . . . , x
n
) with a at
prior () = 1
If we try t a normal proposal from empirical mean and variance of
the chain so far,
t
=
1
t
t
i=1
(i)
and
2
t
=
1
t
t
i=1
(
(i)
t
)
2
,
MetropolisHastings algorithm with acceptance probability
n
j=2
_
+ (x
j

(t)
)
2
+ (x
j
)
2
_
(+1)/2
exp (
t
(t)
)
2
/2
2
t
exp(
t
)
2
/2
2
t
,
where ^(
t
,
2
t
).
Adaptive MCMC
Example (Poly t distribution (2))
Invalid scheme:
when range of initial values too small, the

(i)
s cannot
converge to the target distribution and concentrates on too
small a support.
long-range dependence on past values modies the

distribution of the sequence.
using past simulations to create a non-parametric

approximation to the target distribution does not work either
Adaptive MCMC
0 1000 2000 3000 4000 5000
0
.4
0
.2
0
.0
0
.2
Iterations
x

1.5 1.0 0.5 0.0 0.5
0
1
2
3
0 1000 2000 3000 4000 5000
1
.5
1
.0
0
.5
0
.0
0
.5
1
.0
1
.5
Iterations
x

2 1 0 1 2
0
.0
0
.2
0
.4
0
.6
0 1000 2000 3000 4000 5000
1
0
1
2
Iterations
x

2 1 0 1 2 3
0
.0
0
.1
0
.2
0
.3
0
.4
0
.5
0
.6
0
.7
Adaptive scheme for a sample of 10 x
j
T
and initial
variances of (top) 0.1, (middle) 0.5, and (bottom) 2.5.
Adaptive MCMC
2 1 0 1 2
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
Comparison of the distribution of an adaptive scheme sample
of 25, 000 points with initial variance of 2.5 and of the target
distribution.
Adaptive MCMC
0 10000 30000 50000
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Iterations
x

1.5 0.5 0.5 1.0 1.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Sample produced by 50, 000 iterations of a nonparametric
adaptive MCMC scheme and comparison of its distribution
with the target distribution.
Adaptive MCMC
Simply forget about it!
Warning:
One should not constantly adapt the proposal on past
performances
Either adaptation ceases after a period of burnin
or the adaptive scheme must be theoretically assessed on its own
right.
Approximation of integrals
back to basic importance
I =
_
h(x)(x)dx
by unbiased estimators
I =
1
n
n
i=1
i
h(x
i
)
when
x
1
, . . . , x
n
iid
q(x) and
i
def
=
(x
i
)
q(x
i
)
Markov extension
For densities f and g, and importance weight
(x) = f(x)/g(x) ,
for any kernel K(x, x
) with stationary distribution f,

_
(x) K(x, x
) g(x)dx = f(x
) .
[McEachern, Clyde, and Liu, 1999]
Markov extension
For densities f and g, and importance weight
(x) = f(x)/g(x) ,
for any kernel K(x, x
) with stationary distribution f,

_
(x) K(x, x
) g(x)dx = f(x
) .
[McEachern, Clyde, and Liu, 1999]
Consequence: An importance sample transformed by MCMC
transitions keeps its weights
Unbiasedness preservation:
E
_
(X)h(X
=
_
(x) h(x
) K(x, x
) g(x) dxdx
= E
f
[h(X)]
Not so exciting!
The weights do not change!
Not so exciting!
The weights do not change!
If x has small weight
(x) = f(x)/g(x) ,
then
x
K(x, x
)
keeps this small weight.
Pros and cons of importance sampling vs. MCMC
Production of a sample (IS) vs. of a Markov chain (MCMC)
Dependence on importance function (IS) vs. on previous value

(MCMC)
Unbiasedness (IS) vs. convergence to the true distribution

(MCMC)
Variance control (IS) vs. learning costs (MCMC)
Recycling of past simulations (IS) vs. progressive adaptability

(MCMC)
Processing of moving targets (IS) vs. handling large

dimensional problems (MCMC)
Non-asymptotic validity (IS) vs. dicult asymptotia for

adaptive algorithms (MCMC)
Dynamic extensions
Dynamic importance sampling
Idea
It is possible to generalise importance sampling using random
weights
t
Dynamic extensions
Dynamic importance sampling
Idea
It is possible to generalise importance sampling using random
weights
t
such that
E[
t
[x
t
] = (x
t
)/g(x
t
)
Dynamic extensions
(a) Self-regenerative chains
[Sahu & Zhigljavsky, 1998; Gasemyr, 2002]
Proposal
Y p(y) p(y)
and target distribution (y) (y)
Ratios
(x) = (x)/p(x) and (x) = (x)/ p(x)
Unknown Known
Acceptance function
(x) =
1
1 + (x)
> 0
Dynamic extensions
Geometric jumps
Theorem
If
Y p(y)
and
W[Y = y G((y)) ,
then
X
t
= = X
t+W1
= Y ,= X
t+W
denes a Markov chain with stationary distribution
Dynamic extensions
Plusses
Valid for any choice of [ small = large variance and large

= slow convergence]
Only depends on current value [Dierence with Metropolis]
Random integer weight W [Similarity with Metropolis]
Saves on the rejections: always accept [Dierence with

Metropolis]
Introduces geometric noise compared with importance

sampling
2
SZ
= 2
2
IS
+ (1/)
2
Can be used with a sequence of proposals p

k
and constants
k
[Adaptativity]
Dynamic extensions
A generalisation
[Gasemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.
Dynamic extensions
A generalisation
[Gasemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.
Algorithm (Gasemyrs dynamic weights)
Generate a sequence of random weights W
n
by
1. Generate Y
n
p(y)
2. Generate V
n
B(q(y
n
))
3. Generate S
n
(eo((y
n
))
4. Take W
n
= V
n
S
n
Dynamic extensions
Validation
direct to PMC
(y) =
p(y)q(y)
_
p(y)q(y)dy
,
the chain (X
t
) associated with the sequence (Y
n
, W
n
) by
Y
1
= X
1
= = X
1+W
1
1
, Y
2
= X
1+W
1
=
is a Markov chain with transition
K(x, y) = (x)(y)
which has a point mass at y = x with weight 1 (x).
Dynamic extensions
Ergodicity for Gasemyrs scheme
Necessary and sucient condition
is stationary for (X
t
) i
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .
Dynamic extensions
Ergodicity for Gasemyrs scheme
Necessary and sucient condition
is stationary for (X
t
) i
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .
Implies that
E[W
n
[Y
n
= y] = w(y) .
[Average importance sampling]
Special case: (y) = 1/(1 +w(y)) of Sahu and Zhigljavski (2001)
Dynamic extensions
Properties
Constraint on : for (y) 1, must be such that
p(y)q(y)
(y)

Reverse of accept-reject conditions (!)
Variance of
n
W
n
h(Y
n
)/
n
W
n
(4)
is
2
_
(h(y) )
2
q(y)
w(y)(y)dy (1/)
2
,
by Cramer-Wold/Slutsky
Still worse than importance sampling.
Dynamic extensions
(b) Dynamic weighting
[Wong & Liang, 1997; Liu, Liang & Wong, 2001; Liang, 2002]
direct to PMC
Generalisation of the above: simultaneous generation of points
and weights, (
t
,
t
), under the constraint
E[
t
[
t
] (
t
) (5)
Same use as importance sampling weights
Dynamic extensions
Algorithm (Liangs dynamic importance sampling)
1. Generate y K(x, y) and compute
=
(y)K(y, x)
(x)K(x, y)
2. Generate u |(0, 1) and take
(x
) =
_
(y, (1 +)/a) if u < a
(x, (1 +)/(1 a) otherwise
where a = /( +), = (x, ), and > 0 constant or
independent rv
Dynamic extensions
Preservation of the equilibrium equation
If g
and g
+
denote the distributions of the augmented variable
(X, W) before the step and after the step, respectively, then
_

0
g
+
(x
) d
=
_
(1 +) [(, x, x
) +] g
(x, ) K(x, x
)
(, x, x
)
(, x, x
) +
dxd
+
_
(1 +)
((, x
, z) +)
(x
, ) K(x, z)

(, x
, z) +
dz d
= (1 +)
__
g
(x, )
(x
)K(x
, x)
(x)
dxd
+
_
g
(x
, ) K(x
, z) dz d
_
= (1 +)
_
(x
)
_
c
0
K(x
, x) dx +c
0
(x
)
_
= 2(1 +)c
0
(x
) ,
where c proportionality constant
Dynamic extensions
Special case: R-move
[Liang, 2002]
= 0 and 1, and thus
(x
) =
_
(y, + 1) if u < /( + 1)
(x, ( + 1)) otherwise,
[Importance sampling]
Dynamic extensions
Special case: W-move
0, thus a = 1 and
(x
) = (y, ) .
Q-move
[Liu & al, 2001]
(x
) =
_
(y, ) if u < 1 / ,
(x, a) otherwise,
with a 1 either a constant or an independent random variable.
Dynamic extensions
Notes
Updating step in Q and R schemes written as

(x
t+1
,
t+1
) = x
t
,
t
/Pr(R
t
= 0)
with probability Pr(R
t
= 0) and
(x
t+1
,
t+1
) = y
t+1
,
t
r(x
t
, y
t+1
)/Pr(R
t
= 1)
t
= 1), where R
t
is the move indicator
and
y
t+1
K(x
t
, y)
Dynamic extensions
Notes (2)
Geometric structure of the weights

Pr(R
t
= 0) =

t
t+1
.
and
Pr(R
t
= 0) =

t
r(x
t
, y
t
)
t
r(x
t
, y
t
) +
, > 0 ,
for the R scheme
Dynamic extensions
Notes (2)
Geometric structure of the weights

Pr(R
t
= 0) =

t
t+1
.
and
Pr(R
t
= 0) =

t
r(x
t
, y
t
)
t
r(x
t
, y
t
) +
, > 0 ,
for the R scheme
Number of steps T before an acceptance (a jump) such that

Pr (T t) = P(R
1
= 0, . . . , R
t1
= 0)
= E
_
_
t1
j=0
j+1
_
_
E[1/
t
] .
Dynamic extensions
Alternative scheme
Preservation of weight expectation:
(x
t+1
,
t+1
) =
_
_
(x
t
,
t
t
/Pr(R
t
= 0))
t
= 0) and
(y
t+1
, (1
t
)
t
r(x
t
, y
t+1
)/Pr(R
t
= 1))
t
= 1).
Dynamic extensions
Alternative scheme (2)
Then
Pr (T = t) = P(R
1
= 0, . . . , R
t1
= 0, R
t
= 1)
= E
_
_
t1
j=0
j+1
(1
t
)
t1
r(x
0
, Y
t
)
t
_
_
which is equal to
t1
(1 )E[
o
r(x, Y
t
)/
t
]
when
j
constant and deterministic.
Dynamic extensions
Example
Choose a function 0 < (, ) < 1 and to take, while in (x
0
,
0
),
(x
1
,
1
) =
_
y
1
,

0
r(x
0
, y
1
)
(x
0
, y
1
)
(1 (x
0
, y
1
)
_
with probability
min(1,
0
r(x
0
, y
1
))

= (x
0
, y
1
)
and
(x
1
,
1
) =
_
x
0
,

0
1 (x
0
, y
1
)
(x
0
, y
1
)
_
with probability 1 (x
0
, y
1
).
Idea
Simulate from the product distribution
N
n
(x
1
, . . . , x
n
) =
n
i=1
(x
i
)
and apply dynamic importance sampling to the sample
(a.k.a. population)
x
(t)
= (x
(t)
1
, . . . , x
(t)
n
)
Iterated importance sampling
As in Markov Chain Monte Carlo (MCMC) algorithms,
introduction of a temporal dimension :
x
(t)
i
q
t
(x[x
(t1)
i
) i = 1, . . . , n, t = 1, . . .
and
I
t
=
1
n
n
i=1
(t)
i
h(x
(t)
i
)
is still unbiased for
(t)
i
=

t
(x
(t)
i
)
q
t
(x
(t)
i
[x
(t1)
i
)
, i = 1, . . . , n
Fundamental importance equality
Preservation of unbiasedness
E
_
h(X
(t)
)
(X
(t)
)
q
t
(X
(t)
[X
(t1)
)
_
=
_
h(x)
(x)
q
t
(x[y)
q
t
(x[y) g(y) dxdy
=
_
h(x) (x) dx
for any distribution g on X
(t1)
Sequential variance decomposition
Furthermore,
var
_
I
t
_
=
1
n
2
n
i=1
var
_
(t)
i
h(x
(t)
i
)
_
,
if var
_
(t)
i
_
exists, because the x
(t)
i
s are conditionally uncorrelated
Note
This decomposition is still valid for correlated [in i] x
(t)
i
s when
incorporating weights
(t)
i
Simulation of a population
The importance distribution of the sample (a.k.a. particles) x
(t)
q
t
(x
(t)
[x
(t1)
)
can depend on the previous sample x
(t1)
in any possible way as
long as marginal distributions
q
it
(x) =
_
q
t
(x
(t)
) dx
(t)
i
can be expressed to build importance weights
it
=
(x
(t)
i
)
q
it
(x
(t)
i
)
Special case of the product proposal
If
q
t
(x
(t)
[x
(t1)
) =
n
i=1
q
it
(x
(t)
i
[x
(t1)
)
[Independent proposals]
then
var
_
I
t
_
=
1
n
2
n
i=1
var
_
(t)
i
h(x
(t)
i
)
_
,
Validation
skip validation
E
_
(t)
i
h(X
(t)
i
)
(t)
j
h(X
(t)
j
)
_
=
_
h(x
i
)
(x
i
)
q
it
(x
i
[x
(t1)
)
(x
j
)
q
jt
(x
j
[x
(t1)
)
h(x
j
)
q
it
(x
i
[x
(t1)
) q
jt
(x
j
[x
(t1)
) dx
i
dx
j
g(x
(t1)
)dx
(t1)
= E
[h(X)]
2
whatever the distribution g on x
(t1)
Self-normalised version
In general, is unscaled and the weight
(t)
i

(x
(t)
i
)
q
it
(x
(t)
i
)
, i = 1, . . . , n,
is scaled so that
(t)
i
= 1
Self-normalised version properties
Loss of the unbiasedness property and the variance

decomposition
Normalising constant can be estimated by
t
=
1
tn
t
=1
n
i=1
(x
()
i
)
q
i
(x
()
i
)
Variance decomposition (approximately) recovered if

t1
is
used instead
Sampling importance resampling
Importance sampling from g can also produce samples from the
target
[Rubin, 1987]
target
[Rubin, 1987]
Theorem (Bootstraped importance sampling)
If a sample (x
i
)
1im
is derived from the weighted sample
(x
i
,
i
)
1in
by multinomial sampling with weights
i
, then
x
i
(x)
target
[Rubin, 1987]
Theorem (Bootstraped importance sampling)
If a sample (x
i
)
1im
is derived from the weighted sample
(x
i
,
i
)
1in
by multinomial sampling with weights
i
, then
x
i
(x)
Note
Obviously, the x
i
s are not iid
Iterated sampling importance resampling
This principle can be extended to iterated importance sampling:
After each iteration, resampling produces a sample from
[Again, not iid!]
Iterated sampling importance resampling
This principle can be extended to iterated importance sampling:
After each iteration, resampling produces a sample from
[Again, not iid!]
Incentive
Use previous sample(s) to learn about and q
Generic Population Monte Carlo
Algorithm (Population Monte Carlo Algorithm)
For t = 1, . . . , T
For i = 1, . . . , n,
1. Select the generating distribution q
it
()
2. Generate x
(t)
i
q
it
(x)
3. Compute
(t)
i
= ( x
(t)
i
)/q
it
( x
(t)
i
)
Normalise the
(t)
i
s into
(t)
i
s
Generate J
i,t
/((
(t)
i
)
1iN
) and set x
i,t
= x
(t)
J
i,t
D-kernels in competition
A general adaptive construction:
Construct q
i,t
as a mixture of D dierent transition kernels
depending on x
(t1)
i
q
i,t
=
D
=1
p
t,
K
(x
(t1)
i
, x),
D
=1
p
t,
= 1 ,
and adapt the weights p
t,
.
D-kernels in competition
A general adaptive construction:
Construct q
i,t
as a mixture of D dierent transition kernels
depending on x
(t1)
i
q
i,t
=
D
=1
p
t,
K
(x
(t1)
i
, x),
D
=1
p
t,
= 1 ,
and adapt the weights p
t,
.
Example
Take p
t,
proportional to the survival rate of the points
(a.k.a. particles) x
(t)
i
generated from K

Implementation
Algorithm (D-kernel PMC)
For t = 1, . . . , T
generate (K
i,t
)
1iN
M((p
t,k
)
1kD
)
for 1 i N, generate
x
i,t
K
K
i,t
(x)
compute and renormalize the importance weights
i,t
generate (J
i,t
)
1iN
M((
i,t
)
1iN
)
take x
i,t
= x
J
i,t
,t
and p
t+1,d
=
N
i=1

i,t
I
d
(K
i,t
)
Links with particle lters
Usually setting where =

t
changes with t: Population
Monte Carlo also adapts to this case
Can be traced back all the way to Hammersley and Morton

(1954) and the self-avoiding random walk problem
Gilks and Berzuini (2001) produce iterated samples with (SIR)

resampling steps, and add an MCMC step: this step must use
a
t
invariant kernel
Chopin (2001) uses iterated importance sampling to handle

large datasets: this is a special case of PMC where the q
it
s
are the posterior distributions associated with a portion k
t
of
the observed dataset
Links with particle lters (2)
Rubinstein and Kroeses (2004) cross-entropy method is

parameterised importance sampling targeted at rare events
Stavropoulos and Titteringtons (1999) smooth bootstrap and

Warnes (2001) kernel coupler use nonparametric kernels on
the previous importance sample to build an improved
proposal: this is a special case of PMC
West (1992) mixture approximation is a precursor of smooth

bootstrap
Mengersen and Robert (2002) pinball sampler is an MCMC

attempt at population sampling
Del Moral and Doucet (2003) sequential Monte Carlo

samplers also relates to PMC, with a Markovian dependence
on the past sample x
(t)
but (limited) stationarity constraints
Things can go wrong
Unexpected behaviour of the mixture weights when the number of
particles increases
N
i=1

i,t
I
K
i,t
=d
P
1
D
Things can go wrong
Unexpected behaviour of the mixture weights when the number of
particles increases
N
i=1

i,t
I
K
i,t
=d
P
1
D
Conclusion
At each iteration, every weight converges to 1/D:
the algorithm fails to learn from experience!!
Saved by Rao-Blackwell!!
Modication: Rao-Blackwellisation (=conditioning)
Saved by Rao-Blackwell!!
Modication: Rao-Blackwellisation (=conditioning)
Use the whole mixture in the importance weight:
i,t
= ( x
i,t
)
D
d=1
p
t,d
K
d
(x
i,t1
, x
i,t
)
instead of
i,t
=
( x
i,t
)
K
K
i,t
(x
i,t1
, x
i,t
)
Adapted algorithm
Algorithm (Rao-Blackwellised D-kernel PMC)
At time t (t = 1, . . . , T),
Generate
(K
i,t
)
1iN
iid
/((p
t,d
)
1dD
);
Generate
( x
i,t
)
1iN
ind
K
K
i,t
(x
i,t1
, x)
and set
i,t
= ( x
i,t
)
_
D
d=1
p
t,d
K
d
(x
i,t1
, x
i,t
);
Generate
(J
i,t
)
1iN
iid
/((
i,t
)
1iN
)
and set x
i,t
= x
J
i,t
,t
and p
t+1,d
=
N
i=1

i,t
p
t,d
.
Theorem (LLN)
Under regularity assumptions, for h L
1
and for every t 1,

1
N
N
k=1

i,t
h(x
i,t
)
N
P
(h)
and
p
t,d
N
P

t
d
The limiting coecients (
t
d
)
1dD
are dened recursively as
t
d
=
t1
d
_
_
K
d
(x, x
D
j=1
t1
j
K
j
(x, x
)
_
(dx, dx
).
Recursion on the weights
Set F as
F() =
_
d
_
_
K
d
(x, x
D
j=1
j
K
j
(x, x
)
_
(dx, dx
)
_
1dD
on the simplex
S =
_
= (
1
, . . . ,
D
); d 1, . . . , D,
d
0 and
D
d=1
d
= 1
_
.
and dene the sequence
t+1
= F(
t
)
Kullback divergence
Denition (Kullback divergence)
For S,
KL() =
_
_
log
_
(x)(x
)
(x)
D
d=1
d
K
d
(x, x
)
__
(dx, dx
).
Kullback divergence between and the mixture.
Goal: Obtain the mixture closest to , i.e., that minimises KL()
Connection with RBDPMCA ??
Theorem
Under the assumption
d 1, . . . , D, <
_
log(K
d
(x, x
))(dx, dx
) <
for every S
D
,
KL(F()) KL().
Connection with RBDPMCA ??
Theorem
Under the assumption
d 1, . . . , D, <
_
log(K
d
(x, x
))(dx, dx
) <
for every S
D
,
KL(F()) KL().
Conclusion
The Kullback divergence decreases at every iteration of RBDPMCA
An integrated EM interpretation
skip interpretation
We have
min
= arg min
S
KL() = arg max
S
_
log p
( x)(d x)
= arg max
S
_
log
_
p
( x, K)dK (d x)
for x = (x, x
) and K /((
d
)
1dD
). Then
t+1
= F(
t
)
means
t+1
= arg max
__
E
t (log p
(

X, K)[

X = x)(d x)
and
lim
t
t
=
min
Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]
Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]
Then
1 0.0500000 0.05000000 0.9000000
2 0.2605712 0.09970292 0.6397259
6 0.2740816 0.19160178 0.5343166
10 0.2989651 0.19200904 0.5090259
16 0.2651511 0.24129039 0.4935585
Weight evolution
Target and mixture evolution
Example : PMC for mixtures
Observation of an iid sample x = (x
1
, . . . , x
n
) from
p^(
1
,
2
) + (1 p)^(
2
,
2
),
with p ,= 1/2 and > 0 known.
Usual ^(,
2
/) prior on
1
and
2
:
(
1
,
2
[x) f(x[
1
,
2
) (
1
,
2
)
Algorithm (Mixture PMC)
Step 0: Initialisation
For j = 1, . . . , n = pm, choose (
1
)
(0)
j
, (
2
)
(0)
j
For k = 1, . . . , p, set r
k
= m
Step i: Update (i = 1, . . . , I)
For k = 1, . . . , p,
1. generate a sample of size r
k
as
(
1
)
(i)
j
^
_
(
1
)
(i1)
j
, v
k
_
and (
2
)
(i)
j
^
_
(
2
)
(i1)
j
, v
k
_
2. compute the weights
j

f
_
x
(
1
)
(i)
j
, (
2
)
(i)
j
_
_
(
1
)
(i)
j
, (
2
)
(i)
j
_
_
(
1
)
(i)
j
(
1
)
(i1)
j
, v
k
_
_
(
2
)
(i)
j
(
2
)
(i1)
j
, v
k
_
Resample the
_
(
1
)
(i)
j
, (
2
)
(i)
j
_
j
using the weights
j
,
Details
After an arbitrary initialisation, use of the previous (importance)
sample (after resampling) to build random walk proposals,
^(()
(i1)
j
, v
j
)
with a multiscale variance v
j
within a predetermined set of p scales
ranging from 10
3
down to 10
3
, whose importance is proportional
to its survival rate in the resampling step.
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
1
2
3
4
Iterations
V
a
r
(
1
)
0 100 200 300 400 500
0
.
0
5
0
.
1
5
0
.
2
5
Iterations
1
0 100 200 300 400 500
0
1
2
3
4
Iterations
V
a
r
(
2
)
0 100 200 300 400 500
2
.
0
1
0
2
.
0
2
0
Iterations
2
(u.left)
Number of resampled points for v
1
= 5 (darker) and v
2
= 2;
(u.right) Number of resampled points for the other variances;
(m.left) Variance of the
1
s along iterations; (m.right) Average of
the
1
s over iterations; (l.left) Variance of the
2
s along
iterations; (l.right) Average of the simulated
2
s over iterations.
2 1 0 1 2 3 4
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
2
Log-posterior distribution and sample of means

Montecarlo Simulation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Montecarlo Simulation

Uploaded by

Copyright:

Available Formats

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods

Operates conditional upon the observations

Operates conditional upon the observations

Integrate simultaneously prior information and information

Operates conditional upon the observations

Integrate simultaneously prior information and information

Avoids averaging over the unobserved values of x

Operates conditional upon the observations

Integrate simultaneously prior information and information

Avoids averaging over the unobserved values of x

Coherent updating of the information available on ,

Operates conditional upon the observations

Integrate simultaneously prior information and information

Avoids averaging over the unobserved values of x

Coherent updating of the information available on ,

Provides a complete inferential scope and a unique motor of

is the posterior mean

(u) = inf x; F(x) u .

(u) = inf x; F(x) u .

(U) has the distribution

(for instance, exponential, and

(for instance, exponential, and

(for instance, exponential, and

(for instance, exponential, and

Unlike algorithms based on the CLT,

Get two normals for the price of

Drawback (in speed)

3, = and k = log clog ;

A generator of Poisson random variables can produce negative

above this graph outside this interval

suggests simulating iid variables

Markov Chain Monte Carlo Methods

Importance sampling using Cauchy C(0, 1)

Importance sampling using Cauchy C(0, 1)

Importance sampling using Cauchy C(0, 1)

x concave, by Jensens inequality,

is unbiased for I and

x X, K(x, ) is a probability measure;

A B(X), K(, A) is measurable.

) of the transition K(x, )

A function on a discrete state space is uniquely dened by

A function on a discrete state space is uniquely dened by

A probability distribution on T(X) is dened as a (row)

for every A B(X) with (A) > 0

] = , the state is recurrent

] < , the state is transient

] = , the state is recurrent

] < , the state is transient

will be independent of whenever it exists.

be two probability measures. Then,

CLT for additive functionals n

CLT for additive functionals n

Rosenthals type inequalities

CLT for additive functionals n

Rosenthals type inequalities

exponential inequalities (for bounded functions and small

there exist < 1 and R < such that, for all x X,

for some n > 0,

Foster-Lyapunov Drift conditions,

, one can construct two

, one can construct two

with probability , draw Z according to and set