You are on page 1of 584

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods


Christian P. Robert
Universite Paris Dauphine and CREST-INSEE
http://www.ceremade.dauphine.fr/
~
xian
November 9, 2009
Markov Chain Monte Carlo Methods
Outline
Motivation and leading example
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm
The Gibbs Sampler
MCMC tools for variable dimension problems
Sequential importance sampling
Markov Chain Monte Carlo Methods
New [2004] edition:
Markov Chain Monte Carlo Methods
Motivation and leading example
Motivation and leading example
Motivation and leading example
Introduction
Likelihood methods
Missing variable models
Bayesian Methods
Bayesian troubles
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
Latent structures make life harder!
Even simple models may lead to computational complications, as
in latent variable models
f(x[) =
_
f

(x, x

[) dx

Markov Chain Monte Carlo Methods


Motivation and leading example
Introduction
Latent structures make life harder!
Even simple models may lead to computational complications, as
in latent variable models
f(x[) =
_
f

(x, x

[) dx

If (x, x

) observed, ne!
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
Latent structures make life harder!
Even simple models may lead to computational complications, as
in latent variable models
f(x[) =
_
f

(x, x

[) dx

If (x, x

) observed, ne!
If only x observed, trouble!
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
Example (Mixture models)
Models of mixtures of distributions:
X f
j
with probability p
j
,
for j = 1, 2, . . . , k, with overall density
X p
1
f
1
(x) + +p
k
f
k
(x) .
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
Example (Mixture models)
Models of mixtures of distributions:
X f
j
with probability p
j
,
for j = 1, 2, . . . , k, with overall density
X p
1
f
1
(x) + +p
k
f
k
(x) .
For a sample of independent random variables (X
1
, , X
n
),
sample density
n

i=1
p
1
f
1
(x
i
) + +p
k
f
k
(x
i
) .
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
Example (Mixture models)
Models of mixtures of distributions:
X f
j
with probability p
j
,
for j = 1, 2, . . . , k, with overall density
X p
1
f
1
(x) + +p
k
f
k
(x) .
For a sample of independent random variables (X
1
, , X
n
),
sample density
n

i=1
p
1
f
1
(x
i
) + +p
k
f
k
(x
i
) .
Expanding this product involves k
n
elementary terms: prohibitive
to compute in large samples.
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
1 0 1 2 3

1
0
1
2
3

2
Case
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Maximum likelihood methods
Go Bayes!!
For an iid sample X
1
, . . . , X
n
from a population with density
f(x[
1
, . . . ,
k
), the likelihood function is
L([x) = L(
1
, . . . ,
k
[x
1
, . . . , x
n
)
=

n
i=1
f(x
i
[
1
, . . . ,
k
).
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Maximum likelihood methods
Go Bayes!!
For an iid sample X
1
, . . . , X
n
from a population with density
f(x[
1
, . . . ,
k
), the likelihood function is
L([x) = L(
1
, . . . ,
k
[x
1
, . . . , x
n
)
=

n
i=1
f(x
i
[
1
, . . . ,
k
).
Global justications from asymptotics
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Maximum likelihood methods
Go Bayes!!
For an iid sample X
1
, . . . , X
n
from a population with density
f(x[
1
, . . . ,
k
), the likelihood function is
L([x) = L(
1
, . . . ,
k
[x
1
, . . . , x
n
)
=

n
i=1
f(x
i
[
1
, . . . ,
k
).
Global justications from asymptotics
Computational diculty depends on structure, eg latent
variables
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Example (Mixtures again)
For a mixture of two normal distributions,
p^(,
2
) + (1 p)^(,
2
) ,
likelihood proportional to
n

i=1
_
p
1

_
x
i

_
+ (1 p)
1

_
x
i

__
containing 2
n
terms.
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Standard maximization techniques often fail to nd the global
maximum because of multimodality of the likelihood function.
Example
In the special case
f(x[, ) = (1 ) exp(1/2)x
2
+

exp(1/2
2
)(x )
2

(1)
with > 0 known,
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Standard maximization techniques often fail to nd the global
maximum because of multimodality of the likelihood function.
Example
In the special case
f(x[, ) = (1 ) exp(1/2)x
2
+

exp(1/2
2
)(x )
2

(1)
with > 0 known, whatever n, the likelihood is unbounded:
lim
0
( = x
1
, [x
1
, . . . , x
n
) =
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
The special case of missing variable models
Consider again a latent variable representation
g(x[) =
_
Z
f(x, z[) dz
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
The special case of missing variable models
Consider again a latent variable representation
g(x[) =
_
Z
f(x, z[) dz
Dene the completed (but unobserved) likelihood
L
c
([x, z) = f(x, z[)
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
The special case of missing variable models
Consider again a latent variable representation
g(x[) =
_
Z
f(x, z[) dz
Dene the completed (but unobserved) likelihood
L
c
([x, z) = f(x, z[)
Useful for optimisation algorithm
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
The EM Algorithm
Gibbs connection Bayes rather than EM
Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q([

(m)
, x) = E[log L
c
([x, Z)[

(m)
, x] ,
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
The EM Algorithm
Gibbs connection Bayes rather than EM
Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q([

(m)
, x) = E[log L
c
([x, Z)[

(m)
, x] ,
2. (M step) Maximise Q([

(m)
, x) in and take

(m+1)
= arg max

Q([

(m)
, x).
until a xed point [of Q] is reached
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
Echantillon N(0,1)
2 1 0 1 2
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
1 0 1 2 3

1
0
1
2
3

2
Likeliho
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,
realization of
X f(x[),
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,
realization of
X f(x[),
is combined with prior information specied by prior distribution
with density
()
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool
Summary in a probability distribution, ([x), called the posterior
distribution
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool
Summary in a probability distribution, ([x), called the posterior
distribution
Derived from the joint distribution f(x[)(), according to
([x) =
f(x[)()
_
f(x[)()d
,
[Bayes Theorem]
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool
Summary in a probability distribution, ([x), called the posterior
distribution
Derived from the joint distribution f(x[)(), according to
([x) =
f(x[)()
_
f(x[)()d
,
[Bayes Theorem]
where
m(x) =
_
f(x[)()d
is the marginal density of X
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool...central to Bayesian inference
Posterior dened up to a constant as
([x) f(x[) ()

Operates conditional upon the observations


Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool...central to Bayesian inference
Posterior dened up to a constant as
([x) f(x[) ()

Operates conditional upon the observations

Integrate simultaneously prior information and information


brought by x
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool...central to Bayesian inference
Posterior dened up to a constant as
([x) f(x[) ()

Operates conditional upon the observations

Integrate simultaneously prior information and information


brought by x

Avoids averaging over the unobserved values of x


Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool...central to Bayesian inference
Posterior dened up to a constant as
([x) f(x[) ()

Operates conditional upon the observations

Integrate simultaneously prior information and information


brought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on ,


independent of the order in which i.i.d. observations are
collected
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool...central to Bayesian inference
Posterior dened up to a constant as
([x) f(x[) ()

Operates conditional upon the observations

Integrate simultaneously prior information and information


brought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on ,


independent of the order in which i.i.d. observations are
collected

Provides a complete inferential scope and a unique motor of


inference
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Conjugate bonanza...
Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Conjugate bonanza...
Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions
The classical Bayes estimator

is the posterior mean


(a +b +n)
(a +x)(n x +b)
_
1
0
p p
x+a1
(1 p)
nx+b1
dp
=
x +a
a +b +n
.
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Normal)
In the normal ^(,
2
) case, with both and unknown,
conjugate prior on = (,
2
) of the form
(
2
)

exp
_

( )
2
+
_
/
2
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Normal)
In the normal ^(,
2
) case, with both and unknown,
conjugate prior on = (,
2
) of the form
(
2
)

exp
_

( )
2
+
_
/
2
since
((,
2
)[x
1
, . . . , x
n
) (
2
)

exp
_

( )
2
+
_
/
2
(
2
)
n
exp
_
n( x)
2
+s
2
x
_
/
2
(
2
)

+n
exp
_
(

+n)(
x
)
2
+ +s
2
x
+
n

n +

_
/
2
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
...and conjugate curse
The use of conjugate priors for computational reasons
implies a restriction on the modeling of the available prior
information
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
...and conjugate curse
The use of conjugate priors for computational reasons
implies a restriction on the modeling of the available prior
information
may be detrimental to the usefulness of the Bayesian approach
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
...and conjugate curse
The use of conjugate priors for computational reasons
implies a restriction on the modeling of the available prior
information
may be detrimental to the usefulness of the Bayesian approach
gives an impression of subjective manipulation of the prior
information disconnected from reality.
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
A typology of Bayes computational problems
(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
A typology of Bayes computational problems
(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
A typology of Bayes computational problems
(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii). use of a huge dataset;
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
A typology of Bayes computational problems
(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii). use of a huge dataset;
(iv). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
A typology of Bayes computational problems
(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii). use of a huge dataset;
(iv). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(v). use of a complex inferential procedure as for instance, Bayes
factors
B

01
(x) =
P(
0
[ x)
P(
1
[ x)
_
(
0
)
(
1
)
.
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Mixture once again)
Observations from
x
1
, . . . , x
n
f(x[) = p(x;
1
,
1
) + (1 p)(x;
2
,
2
)
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Mixture once again)
Observations from
x
1
, . . . , x
n
f(x[) = p(x;
1
,
1
) + (1 p)(x;
2
,
2
)
Prior

i
[
i
N (
i
,
2
i
/n
i
),
2
i
IG(
i
/2, s
2
i
/2), p Be(, )
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Mixture once again)
Observations from
x
1
, . . . , x
n
f(x[) = p(x;
1
,
1
) + (1 p)(x;
2
,
2
)
Prior

i
[
i
N (
i
,
2
i
/n
i
),
2
i
IG(
i
/2, s
2
i
/2), p Be(, )
Posterior
([x
1
, . . . , x
n
)
n

j=1
p(x
j
;
1
,
1
) + (1 p)(x
j
;
2
,
2
) ()
=
n

=0

(k
t
)
(k
t
)([(k
t
))
[O(2
n
)]
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Mixture once again (contd))
For a given permutation (k
t
), conditional posterior distribution
([(k
t
)) = N
_

1
(k
t
),

2
1
n
1
+
_
IG((
1
+)/2, s
1
(k
t
)/2)
N
_

2
(k
t
),

2
2
n
2
+n
_
IG((
2
+n )/2, s
2
(k
t
)/2)
Be( +, +n )
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Mixture once again (contd))
where
x
1
(k
t
) =
1

t=1
x
k
t
, s
1
(k
t
) =

t=1
(x
k
t
x
1
(k
t
))
2
,
x
2
(k
t
) =
1
n

n
t=+1
x
k
t
, s
2
(k
t
) =

n
t=+1
(x
k
t
x
2
(k
t
))
2
and

1
(k
t
) =
n
1

1
+ x
1
(k
t
)
n
1
+
,
2
(k
t
) =
n
2

2
+ (n ) x
2
(k
t
)
n
2
+n
,
s
1
(k
t
) = s
2
1
+ s
2
1
(k
t
) +
n
1

n
1
+
(
1
x
1
(k
t
))
2
,
s
2
(k
t
) = s
2
2
+ s
2
2
(k
t
) +
n
2
(n )
n
2
+n
(
2
x
2
(k
t
))
2
,
posterior updates of the hyperparameters
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Mixture once again)
Bayes estimator of :

(x
1
, . . . , x
n
) =
n

=0

(k
t
)
(k
t
)E

[[x, (k
t
)]
Too costly: 2
n
terms
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
press for AR
Example (Poly-t priors)
Normal observation x ^(, 1), with conjugate prior
^(, )
Closed form expression for the posterior mean
_

f(x[) () d
_ _

f(x[) () d =
=
x +
2

1 +
2
.
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Poly-t priors (2))
More involved prior distribution:
poly-t distribution
[Bauwens,1985]
() =
k

i=1
_

i
+ (
i
)
2

i
,
i
> 0
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (Poly-t priors (2))
More involved prior distribution:
poly-t distribution
[Bauwens,1985]
() =
k

i=1
_

i
+ (
i
)
2

i
,
i
> 0
Computation of E[[x] ???
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (AR(p) model)
Auto-regressive representation of a time series,
x
t
=
p

i=1

i
x
ti
+
t
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (AR(p) model)
Auto-regressive representation of a time series,
x
t
=
p

i=1

i
x
ti
+
t
If order p unknown, predictive distribution of x
t+1
given by
(x
t+1
[x
t
, . . . , x
1
)
_
f(x
t+1
[x
t
, . . . , x
tp+1
)(, p[x
t
, . . . , x
1
)dp d ,
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (AR(p) model (contd))
Integration over the parameters of all models

p=0
_
f(x
t+1
[x
t
, . . . , x
tp+1
)([p, x
t
, . . . , x
1
) d (p[x
t
, . . . , x
1
) .
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Example (AR(p) model (contd))
Multiple layers of complexity
(i). Complex parameter space within each AR(p) model because
of stationarity constraint
(ii). if p unbounded, innity of models
(iii). varies between models AR(p) and AR(p + 1), with a
dierent stationarity constraint (except for root
reparameterisation).
(iv). if prediction used sequentially, every tick/second/hour/day,
posterior distribution (, p[x
t
, . . . , x
1
) must be re-evaluated
Markov Chain Monte Carlo Methods
Random variable generation
Random variable generation
Motivation and leading example
Random variable generation
Basic methods
Uniform pseudo-random generator
Beyond Uniform distributions
Transformation methods
Accept-Reject Methods
Fundamental theorem of simulation
Log-concave densities
Monte Carlo Integration
Notions on Markov Chains
Markov Chain Monte Carlo Methods
Random variable generation
Random variable generation
Rely on the possibility of producing (computer-wise) an
endless ow of random variables (usually iid) from well-known
distributions
Markov Chain Monte Carlo Methods
Random variable generation
Random variable generation
Rely on the possibility of producing (computer-wise) an
endless ow of random variables (usually iid) from well-known
distributions
Given a uniform random number generator, illustration of
methods that produce random variables from both standard
and nonstandard distributions
Markov Chain Monte Carlo Methods
Random variable generation
Basic methods
The inverse transform method
For a function F on R, the generalized inverse of F, F

, is dened
by
F

(u) = inf x; F(x) u .


Markov Chain Monte Carlo Methods
Random variable generation
Basic methods
The inverse transform method
For a function F on R, the generalized inverse of F, F

, is dened
by
F

(u) = inf x; F(x) u .


Denition (Probability Integral Transform)
If U |
[0,1]
, then the random variable F

(U) has the distribution


F.
Markov Chain Monte Carlo Methods
Random variable generation
Basic methods
The inverse transform method (2)
To generate a random variable X F, simply generate
U U
[0,1]
Markov Chain Monte Carlo Methods
Random variable generation
Basic methods
The inverse transform method (2)
To generate a random variable X F, simply generate
U U
[0,1]
and then make the transform
x = F

(u)
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Desiderata and limitations
skip Uniform
Production of a deterministic sequence of values in [0, 1] which
imitates a sequence of iid uniform random variables |
[0,1]
.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Desiderata and limitations
skip Uniform
Production of a deterministic sequence of values in [0, 1] which
imitates a sequence of iid uniform random variables |
[0,1]
.
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Desiderata and limitations
skip Uniform
Production of a deterministic sequence of values in [0, 1] which
imitates a sequence of iid uniform random variables |
[0,1]
.
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]
Random sequence in the sense: Having generated
(X
1
, , X
n
), knowledge of X
n
[or of (X
1
, , X
n
)] imparts
no discernible knowledge of the value of X
n+1
.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Desiderata and limitations
skip Uniform
Production of a deterministic sequence of values in [0, 1] which
imitates a sequence of iid uniform random variables |
[0,1]
.
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]
Random sequence in the sense: Having generated
(X
1
, , X
n
), knowledge of X
n
[or of (X
1
, , X
n
)] imparts
no discernible knowledge of the value of X
n+1
.
Deterministic: Given the initial value X
0
, sample
(X
1
, , X
n
) always the same
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Desiderata and limitations
skip Uniform
Production of a deterministic sequence of values in [0, 1] which
imitates a sequence of iid uniform random variables |
[0,1]
.
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]
Random sequence in the sense: Having generated
(X
1
, , X
n
), knowledge of X
n
[or of (X
1
, , X
n
)] imparts
no discernible knowledge of the value of X
n+1
.
Deterministic: Given the initial value X
0
, sample
(X
1
, , X
n
) always the same
Validity of a random number generator based on a single
sample X
1
, , X
n
when n tends to +, not on replications
(X
11
, , X
1n
), (X
21
, , X
2n
), . . . (X
k1
, , X
kn
)
where n xed and k tends to innity.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Uniform pseudo-random generator
Algorithm starting from an initial value 0 u
0
1 and a
transformation D, which produces a sequence
(u
i
) = (D
i
(u
0
))
in [0, 1].
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Uniform pseudo-random generator
Algorithm starting from an initial value 0 u
0
1 and a
transformation D, which produces a sequence
(u
i
) = (D
i
(u
0
))
in [0, 1].
For all n,
(u
1
, , u
n
)
reproduces the behavior of an iid U
[0,1]
sample (V
1
, , V
n
) when
compared through usual tests
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Uniform pseudo-random generator (2)
Validity means the sequence U
1
, , U
n
leads to accept the
hypothesis
H : U
1
, , U
n
are iid U
[0,1]
.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Uniform pseudo-random generator (2)
Validity means the sequence U
1
, , U
n
leads to accept the
hypothesis
H : U
1
, , U
n
are iid U
[0,1]
.
The set of tests used is generally of some consequence
KolmogorovSmirnov and other nonparametric tests
Time series methods, for correlation between U
i
and
(U
i1
, , U
ik
)
Marsaglias battery of tests called Die Hard (!)
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Usual generators
In R and S-plus, procedure runif()
The Uniform Distribution
Description:
runif generates random deviates.
Example:
u <- runif(20)
.Random.seed is an integer vector, containing
the random number generator state for random
number generation in R. It can be saved and
restored, but should not be altered by users.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
500 520 540 560 580 600
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
uniform sample
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Usual generators (2)
In C, procedure rand() or random()
SYNOPSIS
#include <stdlib.h>
long int random(void);
DESCRIPTION
The random() function uses a non-linear additive
feedback random number generator employing a
default table of size 31 long integers to return
successive pseudo-random numbers in the range
from 0 to RAND_MAX. The period of this random
generator is very large, approximately
16*((2**31)-1).
RETURN VALUE
random() returns a value between 0 and RAND_MAX.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Usual generators(3)
In Scilab, procedure rand()
rand() : with no arguments gives a scalar whose
value changes each time it is referenced. By
default, random numbers are uniformly distributed
in the interval (0,1). rand(normal) switches to
a normal distribution with mean 0 and variance 1.
EXAMPLE
x=rand(10,10,uniform)
Markov Chain Monte Carlo Methods
Random variable generation
Beyond Uniform distributions
Beyond Uniform generators
Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F

(for instance, exponential, and


Weibull distributions), use the probability integral
transform
here
Markov Chain Monte Carlo Methods
Random variable generation
Beyond Uniform distributions
Beyond Uniform generators
Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F

(for instance, exponential, and


Weibull distributions), use the probability integral
transform
here
Case specic methods rely on properties of the distribution (for
instance, normal distribution, Poisson distribution)
Markov Chain Monte Carlo Methods
Random variable generation
Beyond Uniform distributions
Beyond Uniform generators
Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F

(for instance, exponential, and


Weibull distributions), use the probability integral
transform
here
Case specic methods rely on properties of the distribution (for
instance, normal distribution, Poisson distribution)
More generic methods (for instance, accept-reject and
ratio-of-uniform)
Markov Chain Monte Carlo Methods
Random variable generation
Beyond Uniform distributions
Beyond Uniform generators
Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F

(for instance, exponential, and


Weibull distributions), use the probability integral
transform
here
Case specic methods rely on properties of the distribution (for
instance, normal distribution, Poisson distribution)
More generic methods (for instance, accept-reject and
ratio-of-uniform)
Simulation of the standard distributions is accomplished quite
eciently by many numerical and statistical programming
packages.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Transformation methods
Case where a distribution F is linked in a simple way to another
distribution easy to simulate.
Example (Exponential variables)
If U |
[0,1]
, the random variable
X = log U/
has distribution
P(X x) = P(log U x)
= P(U e
x
) = 1 e
x
,
the exponential distribution E xp().
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Other random variables that can be generated starting from an
exponential include
Y = 2

j=1
log(U
j
)
2
2
Y =
1

j=1
log(U
j
) Ga(a, )
Y =

a
j=1
log(U
j
)

a+b
j=1
log(U
j
)
Be(a, b)
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Points to note
Transformation quite simple to use
There are more ecient algorithms for gamma and beta
random variables
Cannot generate gamma random variables with a non-integer
shape parameter
For instance, cannot get a
2
1
variable, which would get us a
^(0, 1) variable.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X
1
, X
2
), then,
r
2
= X
2
1
+X
2
2

2
2
= E (1/2) and U [0, 2]
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X
1
, X
2
), then,
r
2
= X
2
1
+X
2
2

2
2
= E (1/2) and U [0, 2]
Consequence: If U
1
, U
2
iid |
[0,1]
,
X
1
=
_
2 log(U
1
) cos(2U
2
)
X
2
=
_
2 log(U
1
) sin(2U
2
)
iid ^(0, 1).
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Box-Muller Algorithm (2)
1. Generate U
1
, U
2
iid |
[0,1]
;
2. Dene
x
1
=
_
2 log(u
1
) cos(2u
2
) ,
x
2
=
_
2 log(u
1
) sin(2u
2
) ;
3. Take x
1
and x
2
as two independent draws from
^(0, 1).
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Box-Muller Algorithm (3)
4 2 0 2 4

1
0
1
2
3

Unlike algorithms based on the CLT,


this algorithm is exact

Get two normals for the price of


two uniforms
4 2 0 2 4
0
.0
0
.1
0
.2
0
.3
0
.4

Drawback (in speed)


in calculating log, cos and sin.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
More transforms
Reject
Example (Poisson generation)
Poissonexponential connection:
If N T() and X
i
E xp(), i N

,
P

(N = k) =
P

(X
1
+ +X
k
1 < X
1
+ +X
k+1
) .
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
More Poisson
Skip Poisson
A Poisson can be simulated by generating Exp(1) till their
sum exceeds 1.
This method is simple, but is really practical only for smaller
values of .
On average, the number of exponential variables required is .
Other approaches are more suitable for large s.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Atkinsons Poisson
To generate N T():
1. Define
= /

3, = and k = log clog ;


2. Generate U
1
U
[0,1]
and calculate
x = log(1 u
1
)/u
1
/
until x > 0.5 ;
3. Define N = x + 0.5 and generate
U
2
U
[0,1]
;
4. Accept N if
x+log (u
2
/1+exp(x)
2
) k+N log log N! .
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Negative extension

A generator of Poisson random variables can produce negative


binomial random variables since,
Y (a(n, (1 p)/p) X[y T(y)
implies
X ^eg(n, p)
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Mixture representation
The representation of the negative binomial is a particular
case of a mixture distribution
The principle of a mixture representation is to represent a
density f as the marginal of another distribution, for example
f(x) =

iY
p
i
f
i
(x) ,
If the component distributions f
i
(x) can be easily generated,
X can be obtained by rst choosing f
i
with probability p
i
and
then generating an observation from f
i
.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Partitioned sampling
Special case of mixture sampling when
f
i
(x) = f(x) I
A
i
(x)
__
A
i
f(x) dx
and
p
i
= Pr(X A
i
)
for a partition (A
i
)
i
Markov Chain Monte Carlo Methods
Random variable generation
Accept-Reject Methods
Accept-Reject algorithm
Many distributions from which it is dicult, or even
impossible, to directly simulate.
Another class of methods that only require us to know the
functional form of the density f of interest only up to a
multiplicative constant.
The key to this method is to use a simpler (simulation-wise)
density g, the instrumental density, from which the simulation
from the target density f is actually done.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Fundamental theorem of simulation
Lemma
Simulating
X f(x)
equivalent to simulating
(X, U) |(x, u) : 0 < u < f(x)
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
x
f
(
x
)
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
The Accept-Reject algorithm
Given a density of interest f, nd a density g and a constant M
such that
f(x) Mg(x)
on the support of f.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
The Accept-Reject algorithm
Given a density of interest f, nd a density g and a constant M
such that
f(x) Mg(x)
on the support of f.
1. Generate X g, U |
[0,1]
;
2. Accept Y = X if U f(X)/Mg(X) ;
3. Return to 1. otherwise.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Validation of the Accept-Reject method
Warranty:
This algorithm produces a variable Y distributed according to f
4 2 0 2 4
0
1
2
3
4
5
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Two interesting properties
First, it provides a generic method to simulate from any
density f that is known up to a multiplicative factor
Property particularly important in Bayesian calculations where
the posterior distribution
([x) () f(x[) .
is specied up to a normalizing constant
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Two interesting properties
First, it provides a generic method to simulate from any
density f that is known up to a multiplicative factor
Property particularly important in Bayesian calculations where
the posterior distribution
([x) () f(x[) .
is specied up to a normalizing constant
Second, the probability of acceptance in the algorithm is
1/M, e.g., expected number of trials until a variable is
accepted is M
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
More interesting properties
In cases f and g both probability densities, the constant M is
necessarily larger that 1.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
More interesting properties
In cases f and g both probability densities, the constant M is
necessarily larger that 1.
The size of M, and thus the eciency of the algorithm, are
functions of how closely g can imitate f, especially in the tails
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
More interesting properties
In cases f and g both probability densities, the constant M is
necessarily larger that 1.
The size of M, and thus the eciency of the algorithm, are
functions of how closely g can imitate f, especially in the tails
For f/g to remain bounded, necessary for g to have tails
thicker than those of f.
It is therefore impossible to use the A-R algorithm to simulate
a Cauchy distribution f using a normal distribution g, however
the reverse works quite well.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
No Cauchy!
Example (Normal from a Cauchy)
Take
f(x) =
1

2
exp(x
2
/2)
and
g(x) =
1

1
1 +x
2
,
densities of the normal and Cauchy distributions.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
No Cauchy!
Example (Normal from a Cauchy)
Take
f(x) =
1

2
exp(x
2
/2)
and
g(x) =
1

1
1 +x
2
,
densities of the normal and Cauchy distributions.
Then
f(x)
g(x)
=
_

2
(1 +x
2
) e
x
2
/2

_
2
e
= 1.52
attained at x = 1.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Normal from a Cauchy (2))
So probability of acceptance
1/1.52 = 0.66,
and, on the average, one out of every three simulated Cauchy
variables is rejected.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
No Double!
Example (Normal/Double Exponential)
Generate a N (0, 1) by using a double-exponential distribution
with density
g(x[) = (/2) exp([x[)
Then
f(x)
g(x[)

_
2

1
e

2
/2
and minimum of this bound (in ) attained for

= 1
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Normal/Double Exponential (2))
Probability of acceptance
_
/2e = .76
To produce one normal random variable requires on the average
1/.76 1.3 uniform variables.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
truncate
Example (Gamma generation)
Illustrates a real advantage of the Accept-Reject algorithm
The gamma distribution (a(, ) represented as the sum of
exponential random variables, only if is an integer
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Gamma generation (2))
Can use the Accept-Reject algorithm with instrumental distribution
(a(a, b), with a = [], 0.
(Without loss of generality, = 1.)
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Gamma generation (2))
Can use the Accept-Reject algorithm with instrumental distribution
(a(a, b), with a = [], 0.
(Without loss of generality, = 1.)
Up to a normalizing constant,
f/g
b
= b
a
x
a
exp(1 b)x b
a
_
a
(1 b)e
_
a
for b 1.
The maximum is attained at b = a/.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Cheng and Feasts Gamma generator
Gamma Ga(, 1), > 1 distribution
1. Dene c
1
= 1, c
2
= ( (1/6))/c
1
,
c
3
= 2/c
1
, c
4
= 1 +c
3
, and c
5
= 1/

.
2. Repeat
generate U
1
, U
2
take U
1
= U
2
+c
5
(1 1.86U
1
) if > 2.5
until 0 < U
1
< 1.
3. Set W = c
2
U
2
/U
1
.
4. If c
3
U
1
+W +W
1
c
4
or
c
3
log U
1
log W +W 1,
take c
1
W;
otherwise, repeat.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Truncated Normal simulation
Example (Truncated Normal distributions)
Constraint x produces density proportional to
e
(x)
2
/2
2
I
x
for a bound large compared with
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Truncated Normal simulation
Example (Truncated Normal distributions)
Constraint x produces density proportional to
e
(x)
2
/2
2
I
x
for a bound large compared with
There exists alternatives far superior to the nave method of
generating a ^(,
2
) until exceeding , which requires an average
number of
1/(( )/)
simulations from ^(,
2
) for a single acceptance.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Truncated Normal distributions (2))
Instrumental distribution: translated exponential distribution,
E (, ), with density
g

(z) = e
(z)
I
z
.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Truncated Normal distributions (2))
Instrumental distribution: translated exponential distribution,
E (, ), with density
g

(z) = e
(z)
I
z
.
The ratio f/g

is bounded by
f/g


_
1/ exp(
2
/2 ) if > ,
1/ exp(
2
/2) otherwise.
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (1)
move to next chapter
Densities f whose logarithm is concave, for
instance Bayesian posterior distributions such that
log ([x) = log () + log f(x[) +c
concave
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (2)
Take
S
n
= x
i
, i = 0, 1, . . . , n + 1 supp(f)
such that h(x
i
) = log f(x
i
) known up to the same constant.
By concavity of h, line L
i,i+1
through (x
i
, h(x
i
)) and
(x
i+1
, h(x
i+1
))
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (2)
Take
S
n
= x
i
, i = 0, 1, . . . , n + 1 supp(f)
such that h(x
i
) = log f(x
i
) known up to the same constant.
By concavity of h, line L
i,i+1
through (x
i
, h(x
i
)) and
(x
i+1
, h(x
i+1
))
x
1
x
2
x
3
x
4
x
L (x)
2,3
log f(x)

below h in [x
i
, x
i+1
] and

above this graph outside this interval


Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (3)
For x [x
i
, x
i+1
], if
h
n
(x) = minL
i1,i
(x), L
i+1,i+2
(x) and h
n
(x) = L
i,i+1
(x) ,
the envelopes are
h
n
(x) h(x) h
n
(x)
uniformly on the support of f, with
h
n
(x) = and h
n
(x) = min(L
0,1
(x), L
n,n+1
(x))
on [x
0
, x
n+1
]
c
.
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (4)
Therefore, if
f
n
(x) = exp h
n
(x) and f
n
(x) = exp h
n
(x)
then
f
n
(x) f(x) f
n
(x) =
n
g
n
(x) ,
where
n
normalizing constant of f
n
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
ARS Algorithm
1. Initialize n and S
n
.
2. Generate X g
n
(x), U |
[0,1]
.
3. If U f
n
(X)/
n
g
n
(X), accept X;
otherwise, if U f(X)/
n
g
n
(X), accept X
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
kill ducks
Example (Northern Pintail ducks)
Ducks captured at time i with both probability p
i
and size N of
the population unknown.
Dataset
(n
1
, . . . , n
11
) = (32, 20, 8, 5, 1, 2, 0, 2, 1, 1, 0)
Number of recoveries over the years 19571968 of N = 1612
Northern Pintail ducks banded in 1956
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Example (Northern Pintail ducks (2))
Corresponding conditional likelihood
L(p
1
, . . . , p
I
[N, n
1
, . . . , n
I
) =
N!
(N r)!
I

i=1
p
n
i
i
(1 p
i
)
Nn
i
,
where I number of captures, n
i
number of captured animals
during the ith capture, and r is the total number of dierent
captured animals.
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Example (Northern Pintail ducks (3))
Prior selection
If
N P()
and

i
= log
_
p
i
1 p
i
_
^(
i
,
2
),
[Normal logistic]
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Example (Northern Pintail ducks (4))
Posterior distribution
(, N[, n
1
, . . . , n
I
)
N!
(N r)!

N
N!
I

i=1
(1 +e

i
)
N
I

i=1
exp
_

i
n
i

1
2
2
(
i

i
)
2
_
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Example (Northern Pintail ducks (5))
For the conditional posterior distribution
(
i
[N, n
1
, . . . , n
I
) exp
_

i
n
i

1
2
2
(
i

i
)
2
__
(1+e

i
)
N
,
the ARS algorithm can be implemented since

i
n
i

1
2
2
(
i

i
)
2
N log(1 +e

i
)
is concave in
i
.
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Posterior distributions of capture log-odds ratios for the
years 19571965.
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1957
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1958
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1959
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1960
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1961
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1962
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1963
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1964
10 9 8 7 6 5 4 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1965
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
1960
8 7 6 5 4
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
True
distribution versus histogram of simulated sample
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
Motivation and leading example
Random variable generation
Monte Carlo Integration
Introduction
Monte Carlo integration
Importance Sampling
Acceleration methods
Bayesian importance sampling
Notions on Markov Chains
The Metropolis-Hastings Algorithm
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Introduction
Quick reminder
Two major classes of numerical problems that arise in statistical
inference
Optimization - generally associated with the likelihood
approach
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Introduction
Quick reminder
Two major classes of numerical problems that arise in statistical
inference
Optimization - generally associated with the likelihood
approach
Integration- generally associated with the Bayesian approach
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Introduction
skip Example!
Example (Bayesian decision theory)
Bayes estimators are not always posterior expectations, but rather
solutions of the minimization problem
min

L(, ) () f(x[) d .
Proper loss:
For L(, ) = ( )
2
, the Bayes estimator is the posterior mean
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Introduction
skip Example!
Example (Bayesian decision theory)
Bayes estimators are not always posterior expectations, but rather
solutions of the minimization problem
min

L(, ) () f(x[) d .
Proper loss:
For L(, ) = ( )
2
, the Bayes estimator is the posterior mean
Absolute error loss:
For L(, ) = [ [, the Bayes estimator is the posterior median
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Introduction
skip Example!
Example (Bayesian decision theory)
Bayes estimators are not always posterior expectations, but rather
solutions of the minimization problem
min

L(, ) () f(x[) d .
Proper loss:
For L(, ) = ( )
2
, the Bayes estimator is the posterior mean
Absolute error loss:
For L(, ) = [ [, the Bayes estimator is the posterior median
With no loss function
use the maximum a posteriori (MAP) estimator
arg max

([x)()
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
Monte Carlo integration
Theme:
Generic problem of evaluating the integral
I = E
f
[h(X)] =
_
X
h(x) f(x) dx
where X is uni- or multidimensional, f is a closed form, partly
closed form, or implicit density, and h is a function
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
Monte Carlo integration (2)
Monte Carlo solution
First use a sample (X
1
, . . . , X
m
) from the density f to
approximate the integral I by the empirical average
h
m
=
1
m
m

j=1
h(x
j
)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
Monte Carlo integration (2)
Monte Carlo solution
First use a sample (X
1
, . . . , X
m
) from the density f to
approximate the integral I by the empirical average
h
m
=
1
m
m

j=1
h(x
j
)
which converges
h
m
E
f
[h(X)]
by the Strong Law of Large Numbers
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
Monte Carlo precision
Estimate the variance with
v
m
=
1
m
1
m1
m

j=1
[h(x
j
) h
m
]
2
,
and for m large,
h
m
E
f
[h(X)]

v
m
N (0, 1).
Note: This can lead to the construction of a convergence test and
of condence bounds on the approximation of E
f
[h(X)].
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
Example (Cauchy prior/normal sample)
For estimating a normal mean, a robust prior is a Cauchy prior
X N (, 1), ((0, 1).
Under squared error loss, posterior mean

(x) =
_

1 +
2
e
(x)
2
/2
d
_

1
1 +
2
e
(x)
2
/2
d
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
Example (Cauchy prior/normal sample (2))
Form of

suggests simulating iid variables

1
, ,
m
N (x, 1)
and calculating

m
(x) =
m

i=1

i
1 +
2
i
_
m

i=1
1
1 +
2
i
.
The Law of Large Numbers implies

m
(x)

(x) as m .
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
0 200 400 600 800 1000
9
.
6
9
.
8
1
0
.
0
1
0
.
2
1
0
.
4
1
0
.
6
iterations
Range
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,
based on the alternative representation
E
f
[h(X)] =
_
X
_
h(x)
f(x)
g(x)
_
g(x) dx .
which allows us to use other distributions than f
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Importance sampling algorithm
Evaluation of
E
f
[h(X)] =
_
X
h(x) f(x) dx
by
1. Generate a sample X
1
, . . . , X
n
from a distribution g
2. Use the approximation
1
m
m

j=1
f(X
j
)
g(X
j
)
h(X
j
)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Same thing as before!!!
Convergence of the estimator
1
m
m

j=1
f(X
j
)
g(X
j
)
h(X
j
)
_
X
h(x) f(x) dx
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Same thing as before!!!
Convergence of the estimator
1
m
m

j=1
f(X
j
)
g(X
j
)
h(X
j
)
_
X
h(x) f(x) dx
converges for any choice of the distribution g
[as long as supp(g) supp(f)]
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Important details
Instrumental distribution g chosen from distributions easy to
simulate
The same sample (generated from g) can be used repeatedly,
not only for dierent functions h, but also for dierent
densities f
Even dependent proposals can be used, as seen later
PMC chapter
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Although g can be any density, some choices are better than
others:
Finite variance only when
E
f
_
h
2
(X)
f(X)
g(X)
_
=
_
X
h
2
(x)
f
2
(X)
g(X)
dx < .
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Although g can be any density, some choices are better than
others:
Finite variance only when
E
f
_
h
2
(X)
f(X)
g(X)
_
=
_
X
h
2
(x)
f
2
(X)
g(X)
dx < .
Instrumental distributions with tails lighter than those of f
(that is, with sup f/g = ) not appropriate.
If sup f/g = , the weights f(x
j
)/g(x
j
) vary widely, giving
too much importance to a few values x
j
.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Although g can be any density, some choices are better than
others:
Finite variance only when
E
f
_
h
2
(X)
f(X)
g(X)
_
=
_
X
h
2
(x)
f
2
(X)
g(X)
dx < .
Instrumental distributions with tails lighter than those of f
(that is, with sup f/g = ) not appropriate.
If sup f/g = , the weights f(x
j
)/g(x
j
) vary widely, giving
too much importance to a few values x
j
.
If sup f/g = M < , the accept-reject algorithm can be used
as well to simulate f directly.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Cauchy target)
Case of Cauchy distribution C(0, 1) when importance function is
Gaussian N (0, 1).
Ratio of the densities
(x) =
p

(x)
p
0
(x)
=

2
exp x
2
/2
(1 +x
2
)
very badly behaved: e.g.,
_

(x)
2
p
0
(x)dx = .
Poor performances of the associated importance sampling
estimator
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
0 2000 4000 6000 8000 10000
0
5
0
1
0
0
1
5
0
2
0
0
iterations
Range
and average of 500 replications of IS estimate of E[exp X]
over 10, 000 iterations.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Optimal importance function
The choice of g that minimizes the variance of the
importance sampling estimator is
g

(x) =
[h(x)[ f(x)
_
Z
[h(z)[ f(z) dz
.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Optimal importance function
The choice of g that minimizes the variance of the
importance sampling estimator is
g

(x) =
[h(x)[ f(x)
_
Z
[h(z)[ f(z) dz
.
Rather formal optimality result since optimal choice of g

(x)
requires the knowledge of I, the integral of interest!
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Practical impact

m
j=1
h(X
j
) f(X
j
)/g(X
j
)

m
j=1
f(X
j
)/g(X
j
)
,
where f and g are known up to constants.
Also converges to I by the Strong Law of Large Numbers.
Biased, but the bias is quite small
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Practical impact

m
j=1
h(X
j
) f(X
j
)/g(X
j
)

m
j=1
f(X
j
)/g(X
j
)
,
where f and g are known up to constants.
Also converges to I by the Strong Law of Large Numbers.
Biased, but the bias is quite small
In some settings beats the unbiased estimator in squared error
loss.
Using the optimal solution does not always work:

m
j=1
h(x
j
) f(x
j
)/[h(x
j
)[ f(x
j
)

m
j=1
f(x
j
)/[h(x
j
)[ f(x
j
)
=
#positive h #negative h

m
j=1
1/[h(x
j
)[
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Selfnormalised importance sampling
For ratio estimator

n
h
=
n

i=1

i
h(x
i
)
_
n

i=1

i
with X
i
g(y) and W
i
such that
E[W
i
[X
i
= x] = f(x)/g(x)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Selfnormalised variance
then
var(
n
h
)
1
n
2

2
_
var(S
n
h
) 2E

[h] cov(S
n
h
, S
n
1
) +E

[h]
2
var(S
n
1
)
_
.
for
S
n
h
=
n

i=1
W
i
h(X
i
) , S
n
1
=
n

i=1
W
i
Rough approximation
var
n
h

1
n
var

(h(X)) 1 + var
g
(W)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Students t distribution)
X T (, ,
2
), with density
f

(x) =
(( + 1)/2)

(/2)
_
1 +
(x )
2

2
_
(+1)/2
.
Without loss of generality, take = 0, = 1.
Problem: Calculate the integral
_

2.1
_
sin(x)
x
_
n
f

(x)dx.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Students t distribution (2))
Simulation possibilities
Directly from f

, since f

=
N (0,1)

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling
Example (Students t distribution (2))
Simulation possibilities
Directly from f

, since f

=
N (0,1)

Importance sampling using Cauchy C(0, 1)


Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Students t distribution (2))
Simulation possibilities
Directly from f

, since f

=
N (0,1)

Importance sampling using Cauchy C(0, 1)


Importance sampling using a normal N (0, 1)
(expected to be nonoptimal)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Students t distribution (2))
Simulation possibilities
Directly from f

, since f

=
N (0,1)

Importance sampling using Cauchy C(0, 1)


Importance sampling using a normal N (0, 1)
(expected to be nonoptimal)
Importance sampling using a U ([0, 1/2.1])
change of variables
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling


0 10000 20000 30000 40000 50000
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0


0 10000 20000 30000 40000 50000
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0


0 10000 20000 30000 40000 50000
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0


0 10000 20000 30000 40000 50000
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0

Sampling
from f (solid lines), importance sampling with Cauchy
instrumental (short dashes), U ([0, 1/2.1]) instrumental (long
dashes) and normal instrumental (dots).
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
IS suers from curse of dimensionality
As dimension increases, discrepancy between importance and
target worsens
skip explanation
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
IS suers from curse of dimensionality
As dimension increases, discrepancy between importance and
target worsens
skip explanation
Explanation:
Take target distribution and instrumental distribution
Simulation of a sample of iid samples of size n x
1:n
from
n
=
N
n
Importance sampling estimator for
n
(f
n
) =
_
f
n
(x
1:n
)
n
(dx
1:n
)

n
(f
n
) =

N
i=1
f
n
(
i
1:n
)

N
j=1
W
i
j

N
j=1

N
j=1
W
j
,
where W
i
k
=
d
d
(
i
k
), and
i
j
are iid with distribution .
For V
k

k0
, sequence of iid nonnegative random variables and for
n 1, T
n
= (V
k
; k n), set
U
n
=
n

k=1
V
k
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
IS suers (2)
Since E[V
n+1
] = 1 and V
n+1
independent from T
n
,
E(U
n+1
[ T
n
) = U
n
E(V
n+1
[ T
n
) = U
n
,
and thus U
n

n0
martingale
Since x

x concave, by Jensens inequality,


E(
_
U
n+1
[ T
n
)
_
E(U
n+1
[ T
n
)
_
U
n
and thus

U
n

n0
supermartingale
Assume E(
_
V
n+1
) < 1. Then
E(
_
U
n
) =
n

k=1
E(
_
V
k
) 0, n .
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
IS suers (3)
But

U
n

n0
is a nonnegative supermartingale and thus

U
n
converges a.s. to a random variable Z 0. By Fatous lemma,
E(Z) = E
_
lim
n
_
U
n
_
liminf
n
E(

U
n
) = 0.
Hence, Z = 0 and U
n
0 a.s., which implies that the martingale
U
n

n0
is not regular.
Apply these results to V
k
=
d
d
(
i
k
), i 1, . . . , N:
E
_
_
d
d
(
i
k
)
_
E
_
d
d
(
i
k
)
_
= 1.
with equality i
d
d
= 1, -a.e., i.e. = .
Thus all importance weights converge to 0
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
too volatile!
Example (Stochastic volatility model)
y
t
= exp (x
t
/2)
t
,
t
^(0, 1)
with AR(1) log-variance process (or volatility)
x
t+1
= x
t
+u
t
, u
t
^(0, 1)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Evolution of IBM stocks (corrected from trend and log-ratio-ed)
0 100 200 300 400 500

1
0

6
days
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Stochastic volatility model (2))
Observed likelihood unavailable in closed from.
Joint posterior (or conditional) distribution of the hidden state
sequence X
k

1kK
can be evaluated explicitly
K

k=2
exp
_

2
(x
k
x
k1
)
2
+
2
exp(x
k
)y
2
k
+x
k
_
/2 , (2)
up to a normalizing constant.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Computational problems
Example (Stochastic volatility model (3))
Direct simulation from this distribution impossible because of
(a) dependence among the X
k
s,
(b) dimension of the sequence X
k

1kK
, and
(c) exponential term exp(x
k
)y
2
k
within (2).
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Importance sampling
Example (Stochastic volatility model (4))
Natural candidate: replace the exponential term with a quadratic
approximation to preserve Gaussianity.
E.g., expand exp(x
k
) around its conditional expectation x
k1
as
exp(x
k
) exp(x
k1
)
_
1 (x
k
x
k1
) +
1
2
(x
k
x
k1
)
2
_
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Stochastic volatility model (5))
Corresponding Gaussian importance distribution with mean

k
=
x
k1

2
+y
2
k
exp(x
k1
)/2 1 y
2
k
exp(x
k1
)/2

2
+y
2
k
exp(x
k1
)/2
and variance

2
k
= (
2
+y
2
k
exp(x
k1
)/2)
1
Prior proposal on X
1
,
X
1
^(0,
2
)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Stochastic volatility model (6))
Simulation starts with X
1
and proceeds forward to X
n
, each X
k
being generated conditional on Y
k
and the previously generated
X
k1
.
Importance weight computed sequentially as the product of
exp
_

2
(x
k
x
k1
)
2
+ exp(x
k
)y
2
k
+x
k
_
/2
exp
_

2
k
(x
k

k
)
2
_

1
k
.
(1 k K)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
weights
D
e
n
s
it
y
15 5 0 5 10 15
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
0 20 40 60 80 100

0
.
4

0
.
3

0
.
2

0
.
1
0
.
0
0
.
1
t
Histogram
of the logarithms of the importance weights (left) and
comparison between the true volatility and the best t,
based on 10, 000 simulated importance samples.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
0 20 40 60 80 100

0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
t
Corresponding
range of the simulated X
k

1k100
, compared with the true
value.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Correlated simulations
Negative correlation reduces variance
Special technique but ecient when it applies
Two samples (X
1
, . . . , X
m
) and (Y
1
, . . . , Y
m
) from f to estimate
I =
_
R
h(x)f(x)dx
by

I
1
=
1
m
m

i=1
h(X
i
) and

I
2
=
1
m
m

i=1
h(Y
i
)
with mean I and variance
2
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Variance reduction
Variance of the average
var
_

I
1
+

I
2
2
_
=

2
2
+
1
2
cov(

I
1
,

I
2
).
If the two samples are negatively correlated,
cov(

I
1
,

I
2
) 0 ,
they improve on two independent samples of same size
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Antithetic variables
If f symmetric about , take Y
i
= 2 X
i
If X
i
= F
1
(U
i
), take Y
i
= F
1
(1 U
i
)
If (A
i
)
i
partition of A, partitioned sampling by sampling
X
j
s in each A
i
(requires to know Pr(A
i
))
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Control variates
out of control!
For
I =
_
h(x)f(x)dx
unknown and
I
0
=
_
h
0
(x)f(x)dx
known,
I
0
estimated by

I
0
and
I estimated by

I
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Control variates (2)
Combined estimator

=

I +(

I
0
I
0
)

is unbiased for I and


var(

) = var(

I) +
2
var(

I) + 2cov(

I,

I
0
)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Optimal control
Optimal choice of

=
cov(

I,

I
0
)
var(

I
0
)
,
with
var(

) = (1
2
) var(

I) ,
where correlation between

I and

I
0
Usual solution: regression coecient of h(x
i
) over h
0
(x
i
)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Example (Quantile Approximation)
Evaluate
= Pr(X > a) =
_

a
f(x)dx
by
=
1
n
n

i=1
I(X
i
> a),
with X
i
iid f.
If Pr(X > ) =
1
2
known
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Example (Quantile Approximation (2))
Control variate
=
1
n
n

i=1
I(X
i
> a) +
_
1
n
n

i=1
I(X
i
> ) Pr(X > )
_
improves upon if
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Example (Quantile Approximation (2))
Control variate
=
1
n
n

i=1
I(X
i
> a) +
_
1
n
n

i=1
I(X
i
> ) Pr(X > )
_
improves upon if
< 0 and [[ < 2
cov( ,
0
)
var(
0
)
2
Pr(X > a)
Pr(X > )
.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Integration by conditioning
Use Rao-Blackwell Theorem
var(E[(X)[Y]) var((X))
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Consequence
If

I unbiased estimator of I = E
f
[h(X)], with X simulated from a
joint density

f(x, y), where
_

f(x, y)dy = f(x),
the estimator

= E

f
[

I[Y
1
, . . . , Y
n
]
dominate

I(X
1
, . . . , X
n
) variance-wise (and is unbiased)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
skip expectation
Example (Students t expectation)
For
E[h(x)] = E[exp(x
2
)] with X T (, 0,
2
)
a Students t distribution can be simulated as
X[y ^(,
2
y) and Y
1

2

.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Example (Students t expectation (2))
Empirical distribution
1
m
m

j=1
exp(X
2
j
) ,
can be improved from the joint sample
((X
1
, Y
1
), . . . , (X
m
, Y
m
))
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods
Example (Students t expectation (2))
Empirical distribution
1
m
m

j=1
exp(X
2
j
) ,
can be improved from the joint sample
((X
1
, Y
1
), . . . , (X
m
, Y
m
))
since
1
m
m

j=1
E[exp(X
2
)[Y
j
] =
1
m
m

j=1
1
_
2
2
Y
j
+ 1
is the conditional expectation.
In this example, precision ten times better
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration methods


0 2000 4000 6000 8000 10000
0
.
5
0
0
.
5
2
0
.
5
4
0
.
5
6
0
.
5
8
0
.
6
0


0 2000 4000 6000 8000 10000
0
.
5
0
0
.
5
2
0
.
5
4
0
.
5
6
0
.
5
8
0
.
6
0
Estimators
of E[exp(X
2
)]: empirical average (full) and conditional
expectation (dotted) for (, , ) = (4.6, 0, 1).
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
Bayesian importance functions
directly Markovian
Recall algorithm:
1. Generate
(1)
1
, ,
(T)
1
from cg()
with
c
1
=
_
g()d
2. Take
_
f(x[)()d
1
T
T

t=1
f(x[
(t)
)
(
(t)
)
cg(
(t)
)

t=1
f(x[
(t)
)
(
(t)
)
g(
(t)
)
T

t=1
(
(t)
)
g(
(t)
)
= m
IS
(x)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
Choice of g
g() = ()
m
IS
(x) =
1
T

t
f(x[
(t)
)
often inecient if data informative
impossible if is improper
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
Choice of g
g() = ()
m
IS
(x) =
1
T

t
f(x[
(t)
)
often inecient if data informative
impossible if is improper
g() = f(x[)()
c unknown
m
IS
(x) = 1
_
1
T
T

t=1
1
f(x[
(t)
)
improper priors allowed
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
g() = () + (1 )([x)
defensive mixture
1 Ok
[Hestenberg, 1998]
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
g() = () + (1 )([x)
defensive mixture
1 Ok
[Hestenberg, 1998]
g() = ([x)
m
h
(x) =
1
1
T
T

t=1
h()
f(x[)()
works for any h
nite variance if
_
h
2
()
f(x[)()
d <
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
Bridge sampling
[Chen & Shao, 1997]
Given two models f
1
(x[
1
) and f
2
(x[
2
),

1
(
1
[x) =

1
(
1
)f
1
(x[
1
)
m
1
(x)

2
(
2
[x) =

2
(
2
)f
2
(x[
2
)
m
2
(x)
Bayes factor:
B
12
(x) =
m
1
(x)
m
2
(x)
ratio of normalising constants
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
Bridge sampling (2)
(i) Missing normalising constants:

1
(
1
[x)
1
(
1
)

2
(
2
[x)
2
(
2
)
B
12

1
n
n

i=1

1
(
i
)

2
(
i
)

i

2
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
Bridge sampling (3)
(ii) Still missing normalising constants:
B
12
=
_

2
()()
1
()d
_

1
()()
2
()d
()

1
n
1
n
1

i=1

2
(
1i
)(
1i
)
1
n
2
n
2

i=1

1
(
2i
)(
2i
)

ji

j
()
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Bayesian importance sampling
Bridge sampling (4)
Optimal choice
() =
n
1
+n
2
n
1

1
() +n
2

2
()
[?]
[Chen, Meng & Wong, 2000]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Notions on Markov Chains
Notions on Markov Chains
Basics
Irreducibility
Transience and Recurrence
Invariant measures
Ergodicity and convergence
Limit theorems
Quantitative convergence rates
Coupling
Renewal and CLT
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
Basics
Denition (Markov chain)
A sequence of random variables whose distribution evolves over
time as a function of past realizations
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
Basics
Denition (Markov chain)
A sequence of random variables whose distribution evolves over
time as a function of past realizations
Chain dened through its transition kernel, a function K dened
on X B(X) such that

x X, K(x, ) is a probability measure;

A B(X), K(, A) is measurable.


Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
no discrete
When X is a discrete (nite or denumerable) set, the
transition kernel simply is a (transition) matrix K with
elements
P
xy
= Pr(X
n
= y[X
n1
= x) , x, y X
Since, for all x X, K(x, ) is a probability, we must have
P
xy
0 and K(x, X) =

yX
P
xy
= 1
The matrix K is referred to as a Markov transition matrix
or a stochastic matrix
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
In the continuous case, the kernel also denotes the
conditional density K(x, x

) of the transition K(x, )


Pr(X A[x) =
_
A
K(x, x

)dx

.
Then, for any bounded , we may dene
K(x) = K(x, ) =
_
X
K(x, dy)(y).
Note that
[K(x)[
_
X
K(x, dy)[(y)[ [[

= sup
xX
[(x)[.
We may also associate to a probability measure the measure
K, dened as
K(A) =
_
X
(dx)K(x, A).
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
Markov chains
skip denition
Given a transition kernel K, a sequence X
0
, X
1
, . . . , X
n
, . . . of
random variables is a Markov chain denoted by (X
n
), if, for any
t, the conditional distribution of X
t
given x
t1
, x
t2
, . . . , x
0
is the
same as the distribution of X
t
given x
t1
. That is,
Pr(X
k+1
A[x
0
, x
1
, x
2
, . . . , x
k
) = Pr(X
k+1
A[x
k
)
=
_
A
K(x
k
, dx)
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
Note that the entire structure of the chain only depends on
The transition function K
The initial state x
0
or initial distribution X
0

Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
Example (Random walk)
The normal random walk is the kernel K(x, ) associated with the
distribution
^
p
(x,
2
I
p
)
which means
X
t+1
= X
t
+
t

t
being an iid additional noise
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
4 2 0 2
0
2
4
6
8
1
0
x
y
100 consecutive realisations of the random walk in R
2
with
= 1
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
bypass remarks
On a discrete state-space X = x
0
, x
1
, . . .,

A function on a discrete state space is uniquely dened by


the (column) vector = ((x
0
), (x
1
), . . . , )
T
and
K(x) =

yX
P
xy
(y)
can be interpreted as the xth component of the product of
the transition matrix K and of the vector .
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
bypass remarks
On a discrete state-space X = x
0
, x
1
, . . .,

A function on a discrete state space is uniquely dened by


the (column) vector = ((x
0
), (x
1
), . . . , )
T
and
K(x) =

yX
P
xy
(y)
can be interpreted as the xth component of the product of
the transition matrix K and of the vector .

A probability distribution on T(X) is dened as a (row)


vector = ((x
0
), (x
1
), . . .) and the probability distribution
K is dened, for each y X as
K(y) =

xX
(x)P
xy
yth component of the product of the vector and of the
transition matrix K.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Basics
Composition of kernels
Let Q
1
and Q
2
be two probability kernels. Dene, for any x X
and any A B(X) the product of kernels Q
1
Q
2
as
Q
1
Q
2
(x, A) =
_
X
Q
1
(x, dy)Q
2
(y, A)
When the state space X is discrete, the product of Markov kernels
coincides with the product of matrices Q
1
Q
2
.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Irreducibility
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Irreducibility
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms
Denition (Irreducibility)
In the discrete case, the chain is irreducible if all states
communicate, namely if
P
x
(
y
< ) > 0 , x, y X ,

y
being the rst (positive) time y is visited
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Irreducibility
Irreducibility for a continuous chain
In the continuous case, the chain is -irreducible for some measure
if for some n,
K
n
(x, A) > 0

for all x X

for every A B(X) with (A) > 0


Markov Chain Monte Carlo Methods
Notions on Markov Chains
Irreducibility
Minoration condition
Assume there exist a probability measure and > 0 such that,
for all x X and all A B(X),
K(x, A) (A)
This is called a minoration condition.
When K is a Markov chain on a discrete state space, this is
equivalent to saying that P
xy
> 0 for all x, y X.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Irreducibility
Small sets
Denition (Small set)
If there exist C B(X), (C) > 0, a probability measure and
> 0 such that, for all x C and all A B(X),
K(x, A) (A)
C is called a small set
For discrete state space, atoms are small sets.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Transience and Recurrence
Towards further stability
Irreducibility: every set A has a chance to be visited by the
Markov chain (X
n
)
This property is too weak to ensure that the trajectory of
(X
n
) will enter A often enough.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Transience and Recurrence
Towards further stability
Irreducibility: every set A has a chance to be visited by the
Markov chain (X
n
)
This property is too weak to ensure that the trajectory of
(X
n
) will enter A often enough.
A Markov chain must enjoy good stability properties to
guarantee an acceptable approximation of the simulated
model.
Formalizing this stability leads to dierent notions of
recurrence
For discrete chains, the recurrence of a state equivalent to
probability one of sure return.
Always satised for irreducible chains on nite spaces
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Transience and Recurrence
Transience and Recurrence
In a nite state space X, denote the average number of visits to a
state by

i=1
I

(X
i
)
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Transience and Recurrence
Transience and Recurrence
In a nite state space X, denote the average number of visits to a
state by

i=1
I

(X
i
)
If E

] = , the state is recurrent


If E

] < , the state is transient


For irreducible chains, recurrence/transience is property of the
chain, not of a particular state
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Transience and Recurrence
Transience and Recurrence
In a nite state space X, denote the average number of visits to a
state by

i=1
I

(X
i
)
If E

] = , the state is recurrent


If E

] < , the state is transient


For irreducible chains, recurrence/transience is property of the
chain, not of a particular state
Similar denitions for the continuous case.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Transience and Recurrence
Harris recurrence
Stronger form of recurrence:
Denition (Harris recurrence)
A set A is Harris recurrent if
P
x
(
A
= ) = 1 for all x A.
The chain (X
n
) is Harris recurrent if it is
irreducible
for every set A with (A) > 0, A is Harris recurrent.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Transience and Recurrence
Harris recurrence
Stronger form of recurrence:
Denition (Harris recurrence)
A set A is Harris recurrent if
P
x
(
A
= ) = 1 for all x A.
The chain (X
n
) is Harris recurrent if it is
irreducible
for every set A with (A) > 0, A is Harris recurrent.
Note that
P
x
(
A
= ) = 1 implies E
x
[
A
] =
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Invariant measures
Invariant measures
Stability increases for the chain (X
n
) if marginal distribution of X
n
independent of n
Requires the existence of a probability distribution such that
X
n+1
if X
n

Markov Chain Monte Carlo Methods
Notions on Markov Chains
Invariant measures
Invariant measures
Stability increases for the chain (X
n
) if marginal distribution of X
n
independent of n
Requires the existence of a probability distribution such that
X
n+1
if X
n

Denition (Invariant measure)
A measure is invariant for the transition kernel K(, ) if
(B) =
_
X
K(x, B) (dx) , B B(X) .
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Invariant measures
Stability properties and invariance
The chain is positive recurrent if is a probability measure.
Otherwise it is null recurrent or transient
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Invariant measures
Stability properties and invariance
The chain is positive recurrent if is a probability measure.
Otherwise it is null recurrent or transient
If probability measure, also called stationary distribution
since
X
0
implies that X
n
for every n
The stationary distribution is unique
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Invariant measures
Insights
no time for that!
Invariant probability measures are important not merely be-
cause they dene stationary processes, but also because
they turn out to be the measures which dene the long-
term or ergodic behavior of the chain.
To understand why, consider P

(X
n
) for a starting distribution . If
a limiting measure

exists such as
P

(X
n
A)

(A)
for all A B(X), then
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Invariant measures

(A) = lim
n
_
(dx)P
n
(x, A)
= lim
n
_
X
_
P
n1
(x, dw)K(w, A)
=
_
X

(dw)K(w, A)
since setwise convergence of
_
P
n
(x, ) implies convergence of integrals of
bounded measurable functions. Hence, if a limiting distribution exists, it is an
invariant probability measure; and obviously, if there is a unique invariant
probability measure, the limit

will be independent of whenever it exists.


Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Ergodicity and convergence
We nally consider: to what is the chain converging?
The invariant distribution is a natural candidate for the limiting
distribution
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Ergodicity and convergence
We nally consider: to what is the chain converging?
The invariant distribution is a natural candidate for the limiting
distribution
A fundamental property is ergodicity, or independence of initial
conditions. In the discrete case, a state is ergodic if
lim
n
[K
n
(, ) ()[ = 0 .
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Norm and convergence
In general , we establish convergence using the total variation norm
|
1

2
|
TV
= sup
A
[
1
(A)
2
(A)[
and we want
_
_
_
_
_
K
n
(x, )(dx)
_
_
_
_
TV
= sup
A

_
K
n
(x, A)(dx) (A)

to be small.
skip minoration TV
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Total variation distance and minoration
Lemma
Let and

be two probability measures. Then,


1 inf
_

i
(A
i
)

(A
i
)
_
= |

|
TV
.
where the inmum is taken over all nite partitions (A
i
)
i
of X.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Total variation distance and minoration (2)
Assume that there exist a probability and > 0 such that, for all
A B(X) we have
(A)

(A) (A).
Then, for all I and all partitions A
1
, A
2
, . . ., A
I
,

i=1
(A
i
)

(A
i
)
and the previous result thus implies that
|

|
TV
(1 ).
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Harris recurrence and ergodicity
Theorem
If (X
n
) Harris positive recurrent and aperiodic, then
lim
n
_
_
_
_
_
K
n
(x, )(dx)
_
_
_
_
TV
= 0
for every initial distribution .
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Harris recurrence and ergodicity
Theorem
If (X
n
) Harris positive recurrent and aperiodic, then
lim
n
_
_
_
_
_
K
n
(x, )(dx)
_
_
_
_
TV
= 0
for every initial distribution .
We thus take Harris positive recurrent and aperiodic as
equivalent to ergodic
[Meyn & Tweedie, 1993]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Harris recurrence and ergodicity
Theorem
If (X
n
) Harris positive recurrent and aperiodic, then
lim
n
_
_
_
_
_
K
n
(x, )(dx)
_
_
_
_
TV
= 0
for every initial distribution .
We thus take Harris positive recurrent and aperiodic as
equivalent to ergodic
[Meyn & Tweedie, 1993]
Convergence in total variation implies
lim
n
[E

[h(X
n
)] E

[h(X)][ = 0
for every bounded function h.
no detail of convergence
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Convergences
There are dierence speeds of convergence
ergodic (fast enough)
geometrically ergodic (faster)
uniformly ergodic (fastest)
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Geometric ergodicity
A -irreducible aperiodic Markov kernel P with invariant
distribution is geometrically ergodic if there exist V 1, and
constants < 1, R < such that (n 1)
|P
n
(x, .) (.)|
V
RV (x)
n
,
on V < which is full and absorbing.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Geometric ergodicity implies a lot of important results

CLT for additive functionals n


1/2

g(X
k
) and functions
[g[ < V
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Geometric ergodicity implies a lot of important results

CLT for additive functionals n


1/2

g(X
k
) and functions
[g[ < V

Rosenthals type inequalities


E
x

k=1
g(X
k
)

p
C(p)n
p/2
, [g[
p
2
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Geometric ergodicity implies a lot of important results

CLT for additive functionals n


1/2

g(X
k
) and functions
[g[ < V

Rosenthals type inequalities


E
x

k=1
g(X
k
)

p
C(p)n
p/2
, [g[
p
2

exponential inequalities (for bounded functions and small


enough)
E
x
_
exp
_

k=1
g(X
k
)
__
<
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Minoration condition and uniform ergodicity
Under the minoration condition, the kernel K is thus contractant
and standard results in functional analysis shows the existence and
the unicity of a xed point . The previous relation implies that,
for all x X.
|P
n
(x, ) |
TV
(1 )
n
Such Markov chains are called uniformly ergodic.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Ergodicity and convergence
Uniform ergodicity
Theorem (S&n ergodicity)
The following conditions are equivalent:

(X
n
)
n
is uniformly ergodic,

there exist < 1 and R < such that, for all x X,


|P
n
(x, ) |
TV
R
n
,

for some n > 0,


sup
xX
|P
n
(x, ) ()|
TV
< 1.
[Meyn and Tweedie, 1993]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Limit theorems
Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Limit theorems
Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
But also need of statistical inference, made by induction from the
observed sample.
If |P
n
x
| close to 0, no direct information about
X
n
P
n
x
c _ We need LLNs and CLTs!!!
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Limit theorems
Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
But also need of statistical inference, made by induction from the
observed sample.
If |P
n
x
| close to 0, no direct information about
X
n
P
n
x
c _ We need LLNs and CLTs!!!
Classical LLNs and CLTs not directly applicable due to:
Markovian dependence structure between the
observations X
i
Non-stationarity of the sequence
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Limit theorems
The Theorem
Theorem (Ergodic Theorem)
If the Markov chain (X
n
) is Harris recurrent, then for any function
h with E[h[ < ,
lim
n
1
n

i
h(X
i
) =
_
h(x)d(x),
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Limit theorems
Central Limit Theorem
To get a CLT, we need more assumptions.
skip conditions and results
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Limit theorems
Central Limit Theorem
To get a CLT, we need more assumptions.
skip conditions and results
For MCMC, the easiest is
Denition (reversibility)
A Markov chain (X
n
) is reversible if for
all n
X
n+1
[X
n+2
= x X
n+1
[X
n
= x
The direction of time does not matter
-> P( )
P( ) ->

[Green,
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Limit theorems
The CLT
Theorem
If the Markov chain (X
n
) is Harris recurrent and reversible,
1

N
_
N

n=1
(h(X
n
) E

[h])
_
L
^(0,
2
h
) .
where
0 <
2
h
= E

[h
2
(X
0
)]
+2

k=1
E

[h(X
0
)h(X
k
)] < +.
[Kipnis & Varadhan, 1986]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Quantitative convergence rates
Quantitative convergence rates
skip detailed results
Let P a Markov transition kernel on (X, B(X)), with P positive
recurrent and its stationary distribution
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Quantitative convergence rates
Quantitative convergence rates
skip detailed results
Let P a Markov transition kernel on (X, B(X)), with P positive
recurrent and its stationary distribution
Convergence rate Determine, from the kernel, a sequence
B(, n), such that
|P
n
|
V
B(, n)
where V : X [1, ) and for any signed measure ,
||
V
= sup
||V
[()[
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Quantitative convergence rates
Practical purposes?
In the 90s, a wealth of contributions on quantitative bounds
triggered by MCMC algorithms to answer questions like: what is
the appropriate burn in? or how long should the sampling continue
after burn in?
[Douc, Moulines and Rosenthal, 2001]
[Jones and Hobert, 2001]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Quantitative convergence rates
Tools at hand
For MCMC algorithms, kernels are explicitly known.
Type of quantities (more or less directly) available:

Minoration constants
K
s
(x, A) (A), for all x C,

Foster-Lyapunov Drift conditions,


KV V +bI
C
and goal is to obtain a bound depending explicitly upon , , b,
&tc...
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Coupling
skip coupling
If X and X

and

, one can construct two


random variables

X and

X

such that

X ,

X

and

X =

X

with probability
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Coupling
skip coupling
If X and X

and

, one can construct two


random variables

X and

X

such that

X ,

X

and

X =

X

with probability
The basic coupling construction

with probability , draw Z according to and set

X =

X

= Z.

with probability 1 , draw



X and

X

under distributions
( )/(1 ) and (

)/(1 ),
respectively.
[Thorisson, 2000]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Coupling inequality
X, X

r.v.s with probability distribution K(x, .) and K(x

, .),
respectively, can be coupled with probability if:
K(x, ) K(x

, )
x,x
(.)
where
x,x
is a probability measure, or, equivalently,
|K(x, ) K(x

, )|
TV
(1 )
Dene an -coupling set as a set

C X X satisfying :
(x, x

)

C, A B(X), K(x, A) K(x

, A)
x,x
(A)
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Small set and coupling sets
C X small set if there exist > 0 and a probability measure
such that, for all A B(X)
K(x, A) (A), x C. (3)
Small sets always exist when the MC is -irreducible
[Jain and Jamieson, 1967]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Small set and coupling sets
C X small set if there exist > 0 and a probability measure
such that, for all A B(X)
K(x, A) (A), x C. (3)
Small sets always exist when the MC is -irreducible
[Jain and Jamieson, 1967]
For MCMC kernels, small sets in general easy to nd.
If C is a small set, then

C = C C is a coupling set:
(x, x

)

C, A B(X), K(x, A) K(x

, A) (A).
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Coupling for Markov chains

P Markov transition kernel on X X such that, for all


(x, x

) ,

C (where

C is an -coupling set) and all A B(X) :

P(x, x

; AX) = K(x, A) and



P(x, x

; X A) = K(x

, A)
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Coupling for Markov chains

P Markov transition kernel on X X such that, for all


(x, x

) ,

C (where

C is an -coupling set) and all A B(X) :

P(x, x

; AX) = K(x, A) and



P(x, x

; X A) = K(x

, A)
For example,

for (x, x

) ,

C,

P(x, x

; AA

) = K(x, A)K(x

, A

).

For all (x, x

)

C and all A, A

B(X), dene the residual


kernel

R(x, x

; AX) = (1 )
1
(K(x, A)
x,x
(A))

R(x, x

; X A

) = (1 )
1
(K(x

, A)
x,x
(A

)).
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Coupling algorithm

Initialisation Let X
0
and X

and set d
0
= 0.

After coupling If d
n
= 1, then draw X
n+1
K(X
n
, ), and
set X

n+1
= X
n+1
.

Before coupling If d
n
= 0 and (X
n
, X

n
)

C,

with probability , draw X


n+1
= X

n+1

X
n
,X

n
and set
d
n+1
= 1.

with probability 1 , draw (X


n+1
, X

n+1
)

R(X
n
, X

n
; )
and set d
n+1
= 0.

If d
n
= 0 and (X
n
, X

n
) ,

C, then draw
(X
n+1
, X

n+1
)

P(X
n
, X

n
; ).
(X
n
, X

n
, d
n
) [where d
n
is the bell variable which indicates
whether the chains have coupled or not] is a Markov chain on
(X X 0, 1).
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Coupling inequality (again!)
Dene the coupling time T as
T = infk 1, d
k
= 1
Coupling inequality
sup
A
[P
k
(A)

P
k
(A)[ P
,

,0
[T > k]
[Pitman, 1976; Lindvall, 1992]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
PV V +bI
C
, V 1
M
k
=
k
V (X
k
)I(
C
k), k 1 is a supermartingale and thus
E
x
[

C
] V (x) +b
1
I
C
(x).
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
PV V +bI
C
, V 1
M
k
=
k
V (X
k
)I(
C
k), k 1 is a supermartingale and thus
E
x
[

C
] V (x) +b
1
I
C
(x).
Conversely, if there exists a set C such that E
x
[

C
] < for all
x (in a full and absorbing set), then there exists a drift function
verifying the Foster-Lyapunov conditions.
[Meyn and Tweedie, 1993]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
If the drift condition is imposed directly on the joint transition
kernel

P, there exist V 1, 0 < < 1 and a set

C such that :

PV (x, x

) V (x, x

) (x, x

) ,

C
When

P(x, x

; AA

) = K(x, A)K(x

, A

), one may consider

V (x, x

) = (1/2)
_
V (x) +V (x

)
_
where V drift function for P (but not necessarily the best choice)
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Coupling
Explicit bound
Theorem
For any distributions and

, and any j k, then:


|P
k
()

P
k
()|
TV
(1 )
j
+
k
B
j1
E
,

,0
[V (X
0
, X

0
)]
where
B = 1
1
(1 ) sup

C
RV.
[DMR,2001]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Renewal and CLT
Renewal and CLT
Given a Markov chain (X
n
)
n
, how good an approximation of
I =
_
g(x)(x)dx
is
g
n
:=
1
n
n1

i=0
g(X
i
) ?
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Renewal and CLT
Renewal and CLT
Given a Markov chain (X
n
)
n
, how good an approximation of
I =
_
g(x)(x)dx
is
g
n
:=
1
n
n1

i=0
g(X
i
) ?
Standard MC if CLT

n(g
n
E

[g(X)])
d
^(0,
2
g
)
and there exists an easy-to-compute, consistent estimate of
2
g
...
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Renewal and CLT
Minoration
skip construction
Assume that the kernel density K satises, for some density q(),
(0, 1) and a small set C A ,
K(y[x) q(y) for all y A and x C
Then split K into a mixture
K(y[x) = q(y) + (1 ) R(y[x)
where R is residual kernel
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Renewal and CLT
Split chain
Let
0
,
1
,
2
, . . . be iid B(). Then the split chain
(X
0
,
0
), (X
1
,
1
), (X
2
,
2
), . . .
is such that, when X
i
C,
i
determines X
i+1
:
X
i+1

_
q(x) if
i
= 1,
R(x[X
i
) otherwise
[Regeneration] When (X
i
,
i
) C 1, X
i+1
q
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Renewal and CLT
Renewals
For X
0
q and R successive renewals, dene by
1
< . . . <
R
the
renewal times.
Then

R
_
g

R
E

[g(X)]
_
=

R
N
_
1
R
R

t=1
(S
t
N
t
E

[g(X)])
_
where N
t
length of the t th tour, and S
t
sum of the g(X
j
)s over
the t th tour.
Since (N
t
, S
t
) are iid and E
q
[S
t
N
t
E

[g(X)]] = 0, if N
t
and S
t
have nite 2nd moments,

R
_
g

R
E

g
_
d
^(0,
2
g
)

there is a simple, consistent estimator of


2
g
[Mykland & al., 1995; Robert, 1995]
Markov Chain Monte Carlo Methods
Notions on Markov Chains
Renewal and CLT
Moment conditions
We need to show that, for the minoration condition, E
q
[N
2
1
] and
E
q
[S
2
1
] are nite.
If
1. the chain is geometrically ergodic, and
2. E

[[g[
2+
] < for some > 0,
then E
q
[N
2
1
] < and E
q
[S
2
1
] < .
[Hobert & al., 2002]
Note that drift + minoration ensures geometric ergodicity
[Rosenthal, 1995; Roberts & Tweedie, 1999]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
Motivation and leading example
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
The MetropolisHastings algorithm
A collection of Metropolis-Hastings algorithms
Extensions
The Gibbs Sampler
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
It is not necessary to use a sample from the distribution f to
approximate the integral
I =
_
h(x)f(x)dx ,
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
It is not necessary to use a sample from the distribution f to
approximate the integral
I =
_
h(x)f(x)dx ,
We can obtain X
1
, . . . , X
n
f (approx) without directly
simulating from f, using an ergodic Markov chain with
stationary distribution f
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x
(0)
, an ergodic chain (X
(t)
) is
generated using a transition kernel with stationary distribution f
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x
(0)
, an ergodic chain (X
(t)
) is
generated using a transition kernel with stationary distribution f

Insures the convergence in distribution of (X


(t)
) to a random
variable from f.

For a large enough T


0
, X
(T
0
)
can be considered as
distributed from f

Produce a dependent sample X


(T
0
)
, X
(T
0
+1)
, . . ., which is
generated from f, sucient for most approximation purposes.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x
(0)
, an ergodic chain (X
(t)
) is
generated using a transition kernel with stationary distribution f

Insures the convergence in distribution of (X


(t)
) to a random
variable from f.

For a large enough T


0
, X
(T
0
)
can be considered as
distributed from f

Produce a dependent sample X


(T
0
)
, X
(T
0
+1)
, . . ., which is
generated from f, sucient for most approximation purposes.
Problem: How can one build a Markov chain with a given
stationary distribution?
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
The MetropolisHastings algorithm
Basics
The algorithm uses the objective (target) density
f
and a conditional density
q(y[x)
called the instrumental (or proposal) distribution
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
The MH algorithm
Algorithm (MetropolisHastings)
Given x
(t)
,
1. Generate Y
t
q(y[x
(t)
).
2. Take
X
(t+1)
=
_
Y
t
with prob. (x
(t)
, Y
t
),
x
(t)
with prob. 1 (x
(t)
, Y
t
),
where
(x, y) = min
_
f(y)
f(x)
q(x[y)
q(y[x)
, 1
_
.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
Features

Independent of normalizing constants for both f and q([x)


(ie, those constants independent of x)

Never move to values with f(y) = 0

The chain (x
(t)
)
t
may take the same value several times in a
row, even though f is a density wrt Lebesgue measure

The sequence (y
t
)
t
is usually not a Markov chain
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satises the detailed
balance condition
f(y) K(y, x) = f(x) K(x, y)
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satises the detailed
balance condition
f(y) K(y, x) = f(x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satises the detailed
balance condition
f(y) K(y, x) = f(x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
3. If
Pr
_
f(Y
t
) q(X
(t)
[Y
t
)
f(X
(t)
) q(Y
t
[X
(t)
)
1
_
< 1. (1)
that is, the event X
(t+1)
= X
(t)
is possible, then the chain
is aperiodic
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
Convergence properties (2)
4. If
q(y[x) > 0 for every (x, y), (2)
the chain is irreducible
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
Convergence properties (2)
4. If
q(y[x) > 0 for every (x, y), (2)
the chain is irreducible
5. For M-H, f-irreducibility implies Harris recurrence
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm
Convergence properties (2)
4. If
q(y[x) > 0 for every (x, y), (2)
the chain is irreducible
5. For M-H, f-irreducibility implies Harris recurrence
6. Thus, for M-H satisfying (1) and (2)
(i) For h, with E
f
[h(X)[ < ,
lim
T
1
T
T

t=1
h(X
(t)
) =
_
h(x)df(x) a.e. f.
(ii) and
lim
n
_
_
_
_
_
K
n
(x, )(dx) f
_
_
_
_
TV
= 0
for every initial distribution , where K
n
(x, ) denotes the
kernel for n transitions.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
The Independent Case
The instrumental distribution q is independent of X
(t)
, and is
denoted g by analogy with Accept-Reject.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
The Independent Case
The instrumental distribution q is independent of X
(t)
, and is
denoted g by analogy with Accept-Reject.
Algorithm (Independent Metropolis-Hastings)
Given x
(t)
,
a Generate Y
t
g(y)
b Take
X
(t+1)
=
_

_
Y
t
with prob. min
_
f(Y
t
) g(x
(t)
)
f(x
(t)
) g(Y
t
)
, 1
_
,
x
(t)
otherwise.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Properties
The resulting sample is not iid
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Properties
The resulting sample is not iid but there exist strong convergence
properties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists a
constant M such that
f(x) Mg(x) , x supp f.
In this case,
|K
n
(x, ) f|
TV

_
1
1
M
_
n
.
[Mengersen & Tweedie, 1996]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
x
t+1
= x
t
+
t+1

t
N (0,
2
)
and observables
y
t
[x
t
N (x
2
t
,
2
)
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
x
t+1
= x
t
+
t+1

t
N (0,
2
)
and observables
y
t
[x
t
N (x
2
t
,
2
)
The distribution of x
t
given x
t1
, x
t+1
and y
t
is
exp
1
2
2
_
(x
t
x
t1
)
2
+ (x
t+1
x
t
)
2
+

2

2
(y
t
x
2
t
)
2
_
.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1) too)
Use for proposal the N (
t
,
2
t
) distribution, with

t
=
x
t1
+x
t+1
1 +
2
and
2
t
=

2
1 +
2
.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1) too)
Use for proposal the N (
t
,
2
t
) distribution, with

t
=
x
t1
+x
t+1
1 +
2
and
2
t
=

2
1 +
2
.
Ratio
(x)/q
ind
(x) = exp(y
t
x
2
t
)
2
/2
2
is bounded
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
(top) Last 500 realisations of the chain X
k

k
out of 10, 000
iterations; (bottom) histogram of the chain, compared with
the target distribution.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Cauchy by normal)
go random W
Given a Cauchy C(0, 1) distribution, consider a normal
N (0, 1) proposal
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Cauchy by normal)
go random W
Given a Cauchy C(0, 1) distribution, consider a normal
N (0, 1) proposal
The MetropolisHastings acceptance ratio is
(

)/(

)
()/())
= exp
__

2
(

)
2
_
/2

1 + (

)
2
(1 +
2
)
.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Cauchy by normal)
go random W
Given a Cauchy C(0, 1) distribution, consider a normal
N (0, 1) proposal
The MetropolisHastings acceptance ratio is
(

)/(

)
()/())
= exp
__

2
(

)
2
_
/2

1 + (

)
2
(1 +
2
)
.
Poor perfomances: the proposal distribution has lighter tails than
the target Cauchy and convergence to the stationary distribution is
not even geometric!
[Mengersen & Tweedie, 1996]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
D
e
n
s
it
y
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Histogram
of Markov chain (
t
)
1t5000
against target C(0, 1)
distribution.
0 1000 2000 3000 4000 5000

1
0
1
2
3
iterations
Range
and average of 1000 parallel
runs when initialized with a
normal N (0, 100
2
)
distribution.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MetropolisHastings
Use of a local perturbation as proposal
Y
t
= X
(t)
+
t
,
where
t
g, independent of X
(t)
.
The instrumental density is now of the form g(y x) and the
Markov chain is a random walk if we take g to be symmetric
g(x) = g(x)
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Algorithm (Random walk Metropolis)
Given x
(t)
1. Generate Y
t
g(y x
(t)
)
2. Take
X
(t+1)
=
_
_
_
Y
t
with prob. min
_
1,
f(Y
t
)
f(x
(t)
)
_
,
x
(t)
otherwise.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Random walk and normal target)
forget History!
Generate ^(0, 1) based on the uniform proposal [, ]
[Hastings (1970)]
The probability of acceptance is then
(x
(t)
, y
t
) = exp(x
(t)
2
y
2
t
)/2 1.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Random walk & normal (2))
Sample statistics
0.1 0.5 1.0
mean 0.399 -0.111 0.10
variance 0.698 1.11 1.06
c _ As , we get better histograms and a faster exploration of the
support of f.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
-1 0 1 2
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
(a)

-
1
.5
-
1
.0
-
0
.5
0
.0
0
.5
-2 0 2
0
1
0
0
2
0
0
3
0
0
4
0
0
(b)

-
1
.5
-
1
.0
-
0
.5
0
.0
0
.5
-3 -2 -1 0 1 2 3
0
1
0
0
2
0
0
3
0
0
4
0
0
(c)

-
1
.5
-
1
.0
-
0
.5
0
.0
0
.5
Three samples based on |[, ] with (a) = 0.1, (b) = 0.5
and (c) = 1.0, superimposed with the convergence of the
means (15, 000 simulations).
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Mixture models (again!))
([x)
n

j=1
_
k

=1
p

f(x
j
[

)
_
()
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Mixture models (again!))
([x)
n

j=1
_
k

=1
p

f(x
j
[

)
_
()
Metropolis-Hastings proposal:

(t+1)
=
_

(t)
+
(t)
if u
(t)
<
(t)

(t)
otherwise
where

(t)
=
(
(t)
+
(t)
[x)
(
(t)
[x)
1
and scaled for good acceptance rate
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
p
th
e
ta
0.0 0.2 0.4 0.6 0.8 1.0
-
1
0
1
2
tau
th
e
ta
0.2 0.4 0.6 0.8 1.0 1.2
-
1
0
1
2
p
ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0
.2
0
.4
0
.6
0
.8
1
.0
1
.2
-1 0 1 2
0
.0
1
.0
2
.0
theta
0.2 0.4 0.6 0.8
0
2
4
tau
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
6
p
Random walk sampling (50000 iterations)
General case of a 3 component normal mixture
[Celeux & al., 2000]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
1 0 1 2 3

1
0
1
2
3

2
X
Random walk MCMC output for .7^(
1
, 1) +.3^(
2
, 1)
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (probit model)
skip probit
Likelihood of the probit model
n

i=1
(y
T
i
)
x
i
(y
T
i
)
1x
i
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (probit model)
skip probit
Likelihood of the probit model
n

i=1
(y
T
i
)
x
i
(y
T
i
)
1x
i
Random walk proposal

(t+1)
=
(t)
+
t

t
N
p
(0, )
where, for instance,
= (Y Y
T
)
1
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
0 5 10

1
0

5
0
5

Likeliho
surface and random walk Metropolis-Hastings steps
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Convergence properties
Uniform ergodicity prohibited by random walk structure
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Convergence properties
Uniform ergodicity prohibited by random walk structure
At best, geometric ergodicity:
Theorem (Sucient ergodicity)
For a symmetric density f, log-concave in the tails, and a positive
and symmetric density g, the chain (X
(t)
) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail eect
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Comparison of tail
eects)
Random-walk
MetropolisHastings algorithms
based on a N (0, 1) instrumental
for the generation of (a) a
^(0, 1) distribution and (b) a
distribution with density
(x) (1 +[x[)
3
(a)

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
(a)

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5


0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5


0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
(b)

0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5


0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5


0 50 100 150 200
-
1
.
5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
90% condence envelopes of
the means, derived from 500
parallel independent chains
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Cauchy by normal continued)
Again, Cauchy C(0, 1) target and Gaussian random walk proposal,

N (,
2
), with acceptance probability
1 +
2
1 + (

)
2
1 ,
Overall t of the Cauchy density by the histogram satisfactory, but
poor exploration of the tails: 99% quantile of C(0, 1) equal to 3,
but no simulation exceeds 14 out of 10, 000!
[Roberts & Tweedie, 2004]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Again, lack of geometric ergodicity!
[Mengersen & Tweedie, 1996]
Slow convergence shown by the non-stable range after 10, 000
iterations.
D
e
n
s
ity
5 0 5
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
0
.2
5
0
.3
0
0
.3
5
Histogram of the 10, 000 rst steps of a random walk
MetropolisHastings algorithm using a N (, 1) proposal
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
0 2000 4000 6000 8000 10000

1
0
0

5
0
0
5
0
1
0
0
iterations
Range
of 500 parallel runs for the same setup
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Further convergence properties
Under assumptions
skip detailed convergence

(A1) f is super-exponential, i.e. it is positive with positive


continuous rst derivative such that
lim
|x|
n(x)

log f(x) = where n(x) := x/[x[.


In words : exponential decay of f in every direction with rate
tending to

(A2) limsup
|x|
n(x)

m(x) < 0, where


m(x) = f(x)/[f(x)[.
In words: non degeneracy of the countour manifold
(
f(y)
= y : f(y) = f(x)
Q is geometrically ergodic, and
V (x) f(x)
1/2
veries the drift condition
[Jarner & Hansen, 2000]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Further [further] convergence properties
skip hyperdetailed convergence
If P -irreducible and aperiodic, for r = (r(n))
nN
real-valued non
decreasing sequence, such that, for all n, m N,
r(n +m) r(n)r(m),
and r(0) = 1, for C a small set,
C
= infn 1, X
n
C, and
h 1, assume
sup
xC
E
x
_

C
1

k=0
r(k)h(X
k
)
_
< ,
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
then,
S(f, C, r) :=
_
x X, E
x
_

C
1

k=0
r(k)h(X
k
)
_
<
_
is full and absorbing and for x S(f, C, r),
lim
n
r(n)|P
n
(x, .) f|
h
= 0.
[Tuominen & Tweedie, 1994]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Comments

[CLT, Rosenthals inequality...] h-ergodicity implies CLT


for additive (possibly unbounded) functionals of the chain,
Rosenthals inequality and so on...

[Control of the moments of the return-time] The


condition implies (because h 1) that
sup
xC
E
x
[r
0
(
C
)] sup
xC
E
x
_

C
1

k=0
r(k)h(X
k
)
_
< ,
where r
0
(n) =

n
l=0
r(l) Can be used to derive bounds for
the coupling time, an essential step to determine computable
bounds, using coupling inequalities
[Roberts & Tweedie, 1998; Fort & Moulines, 2000]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Alternative conditions
The condition is not really easy to work with...
[Possible alternative conditions]
(a) [Tuominen, Tweedie, 1994] There exists a sequence
(V
n
)
nN
, V
n
r(n)h, such that
(i) sup
C
V
0
< ,
(ii) V
0
= V
1
= and
(iii) PV
n+1
V
n
r(n)h +br(n)I
C
.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
(b) [Fort 2000] V f 1 and b < , such that sup
C
V <
and
PV (x) +E
x
_

k=0
r(k)f(X
k
)
_
V (x) +bI
C
(x)
where
C
is the hitting time on C and
r(k) = r(k) r(k 1), k 1 and r(0) = r(0).
Result (a) (b) sup
xC
E
x
_

C
1
k=0
r(k)f(X
k
)
_
< .
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Extensions
There are many other families of HM algorithms
Adaptive Rejection Metropolis Sampling
Reversible Jump (later!)
Langevin algorithms
to name just a few...
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Langevin Algorithms
Proposal based on the Langevin diusion L
t
is dened by the
stochastic dierential equation
dL
t
= dB
t
+
1
2
log f(L
t
)dt,
where B
t
is the standard Brownian motion
Theorem
The Langevin diusion is the only non-explosive diusion which is
reversible with respect to f.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Instead, consider the sequence
x
(t+1)
= x
(t)
+

2
2
log f(x
(t)
) +
t
,
t
^
p
(0, I
p
)
where
2
corresponds to the discretization step
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Instead, consider the sequence
x
(t+1)
= x
(t)
+

2
2
log f(x
(t)
) +
t
,
t
^
p
(0, I
p
)
where
2
corresponds to the discretization step
Unfortunately, the discretized chain may be be transient, for
instance when
lim
x

2
log f(x)[x[
1

> 1
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
MH correction
Accept the new value Y
t
with probability
f(Y
t
)
f(x
(t)
)

exp
_

_
_
_Y
t
x
(t)


2
2
log f(x
(t)
)
_
_
_
2
_
2
2
_
exp
_

_
_
_x
(t)
Y
t


2
2
log f(Y
t
)
_
_
_
2
_
2
2
_
1 .
Choice of the scaling factor
Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Optimizing the Acceptance Rate
Problem of choice of the transition kernel from a practical point of
view
Most common alternatives:
(a) a fully automated algorithm like ARMS;
(b) an instrumental density g which approximates f, such that
f/g is bounded for uniform ergodicity to apply;
(c) a random walk
In both cases (b) and (c), the choice of g is critical,
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Case of the independent MetropolisHastings algorithm
Choice of g that maximizes the average acceptance rate
= E
_
min
_
f(Y ) g(X)
f(X) g(Y )
, 1
__
= 2P
_
f(Y )
g(Y )

f(X)
g(X)
_
, X f, Y g,
Related to the speed of convergence of
1
T
T

t=1
h(X
(t)
)
to E
f
[h(X)] and to the ability of the algorithm to explore any
complexity of f
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Case of the independent MetropolisHastings algorithm (2)
Practical implementation
Choose a parameterized instrumental distribution g([) and
adjusting the corresponding parameters based on the evaluated
acceptance rate
() =
2
m
m

i=1
I
{f(y
i
)g(x
i
)>f(x
i
)g(y
i
)}
,
where x
1
, . . . , x
m
sample from f and y
1
, . . . , y
m
iid sample from g.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Example (Inverse Gaussian distribution)
no inverse
Simulation from
f(z[
1
,
2
) z
3/2
exp
_

1
z

2
z
+ 2
_

2
+ log
_
2
2
_
I
R
+
(z)
based on the Gamma distribution (a(, ) with =
_

2
/
1
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Example (Inverse Gaussian distribution)
no inverse
Simulation from
f(z[
1
,
2
) z
3/2
exp
_

1
z

2
z
+ 2
_

2
+ log
_
2
2
_
I
R
+
(z)
based on the Gamma distribution (a(, ) with =
_

2
/
1
Since
f(x)
g(x)
x
1/2
exp
_
(
1
)x

2
x
_
,
the maximum is attained at
x

=
( + 1/2)
_
( + 1/2)
2
+ 4
2
(
1
)
2(
1
)
.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Example (Inverse Gaussian distribution (2))
The analytical optimization (in ) of
M() = (x

)
1/2
exp
_
(
1
)x



2
x

_
is impossible
0.2 0.5 0.8 0.9 1 1.1 1.2 1.5
() 0.22 0.41 0.54 0.56 0.60 0.63 0.64 0.71
E[Z] 1.137 1.158 1.164 1.154 1.133 1.148 1.181 1.148
E[1/Z] 1.116 1.108 1.116 1.115 1.120 1.126 1.095 1.115
(
1
= 1.5,
2
= 2, and m = 5000).
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk
Dierent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk
Dierent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f.
If x
(t)
and y
t
are close, i.e. f(x
(t)
) f(y
t
) y is accepted with
probability
min
_
f(y
t
)
f(x
(t)
)
, 1
_
1 .
For multimodal densities with well separated modes, the negative
eect of limited moves on the surface of f clearly shows.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk (2)
If the average acceptance rate is low, the successive values of f(y
t
)
tend to be small compared with f(x
(t)
), which means that the
random walk moves quickly on the surface of f since it often
reaches the borders of the support of f
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
This rule is to be taken with a pinch of salt!
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Example (Noisy AR(1) continued)
For a Gaussian random walk with scale small enough, the
random walk never jumps to the other mode. But if the scale is
suciently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Markov chain based on a random walk with scale = .1.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Markov chain based on a random walk with scale = .5.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Gibbs Sampler
The Gibbs Sampler
General Principles
Completion
Convergence
The Hammersley-Cliord theorem
Hierarchical models
Data Augmentation
Improper Priors
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
General Principles
A very specic simulation algorithm based on the target
distribution f:
1. Uses the conditional densities f
1
, . . . , f
p
from f
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
General Principles
A very specic simulation algorithm based on the target
distribution f:
1. Uses the conditional densities f
1
, . . . , f
p
from f
2. Start with the random variable X = (X
1
, . . . , X
p
)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
General Principles
A very specic simulation algorithm based on the target
distribution f:
1. Uses the conditional densities f
1
, . . . , f
p
from f
2. Start with the random variable X = (X
1
, . . . , X
p
)
3. Simulate from the conditional densities,
X
i
[x
1
, x
2
, . . . , x
i1
, x
i+1
, . . . , x
p
f
i
(x
i
[x
1
, x
2
, . . . , x
i1
, x
i+1
, . . . , x
p
)
for i = 1, 2, . . . , p.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Algorithm (Gibbs sampler)
Given x
(t)
= (x
(t)
1
, . . . , x
(t)
p
), generate
1. X
(t+1)
1
f
1
(x
1
[x
(t)
2
, . . . , x
(t)
p
);
2. X
(t+1)
2
f
2
(x
2
[x
(t+1)
1
, x
(t)
3
, . . . , x
(t)
p
),
. . .
p. X
(t+1)
p
f
p
(x
p
[x
(t+1)
1
, . . . , x
(t+1)
p1
)
X
(t+1)
X f
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Properties
The full conditionals densities f
1
, . . . , f
p
are the only densities used
for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Properties
The full conditionals densities f
1
, . . . , f
p
are the only densities used
for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate
The Gibbs sampler is not reversible with respect to f. However,
each of its p components is. Besides, it can be turned into a
reversible sampler, either using the Random Scan Gibbs sampler
see section
or running instead the (double) sequence
f
1
f
p1
f
p
f
p1
f
1
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example (Bivariate Gibbs sampler)
(X, Y ) f(x, y)
Generate a sequence of observations by
Set X
0
= x
0
For t = 1, 2, . . . , generate
Y
t
f
Y |X
([x
t1
)
X
t
f
X|Y
([y
t
)
where f
Y |X
and f
X|Y
are the conditional distributions
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
A Very Simple Example: Independent ^(,
2
)
Observations
When Y
1
, . . . , Y
n
iid
^(y[,
2
) with both and unknown, the
posterior in (,
2
) is conjugate outside a standard familly
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
A Very Simple Example: Independent ^(,
2
)
Observations
When Y
1
, . . . , Y
n
iid
^(y[,
2
) with both and unknown, the
posterior in (,
2
) is conjugate outside a standard familly
But...
[Y
0:n
,
2
^
_

1
n

n
i=1
Y
i
,

2
n
)

2
[Y
1:n
, 1(
_

n
2
1,
1
2

n
i=1
(Y
i
)
2
_
assuming constant (improper) priors on both and
2

Hence we may use the Gibbs sampler for simulating from the
posterior of (,
2
)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
R Gibbs Sampler for Gaussian posterior
n = length(Y);
S = sum(Y);
mu = S/n;
for (i in 1:500)
S2 = sum((Y-mu)^2);
sigma2 = 1/rgamma(1,n/2-1,S2/2);
mu = S/n + sqrt(sigma2/n)*rnorm(1);
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2, 3
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2, 3, 4
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2, 3, 4, 5
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2, 3, 4, 5, 10
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2, 3, 4, 5, 10, 25
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the
^(0, 1) distribution
Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Limitations of the Gibbs sampler
Formally, a special case of a sequence of 1-D M-H kernels, all with
acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Limitations of the Gibbs sampler
Formally, a special case of a sequence of 1-D M-H kernels, all with
acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions
2. requires some knowledge of f
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Limitations of the Gibbs sampler
Formally, a special case of a sequence of 1-D M-H kernels, all with
acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions
2. requires some knowledge of f
3. is, by construction, multidimensional
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Limitations of the Gibbs sampler
Formally, a special case of a sequence of 1-D M-H kernels, all with
acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions
2. requires some knowledge of f
3. is, by construction, multidimensional
4. does not apply to problems where the number of parameters
varies as the resulting chain is not irreducible.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Latent variables are back
The Gibbs sampler can be generalized in much wider generality
A density g is a completion of f if
_
Z
g(x, z) dz = f(x)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Latent variables are back
The Gibbs sampler can be generalized in much wider generality
A density g is a completion of f if
_
Z
g(x, z) dz = f(x)
Note
The variable z may be meaningless for the problem
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Purpose
g should have full conditionals that are easy to simulate for a
Gibbs sampler to be implemented with g rather than f
For p > 1, write y = (x, z) and denote the conditional densities of
g(y) = g(y
1
, . . . , y
p
) by
Y
1
[y
2
, . . . , y
p
g
1
(y
1
[y
2
, . . . , y
p
),
Y
2
[y
1
, y
3
, . . . , y
p
g
2
(y
2
[y
1
, y
3
, . . . , y
p
),
. . . ,
Y
p
[y
1
, . . . , y
p1
g
p
(y
p
[y
1
, . . . , y
p1
).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
The move from Y
(t)
to Y
(t+1)
is dened as follows:
Algorithm (Completion Gibbs sampler)
Given (y
(t)
1
, . . . , y
(t)
p
), simulate
1. Y
(t+1)
1
g
1
(y
1
[y
(t)
2
, . . . , y
(t)
p
),
2. Y
(t+1)
2
g
2
(y
2
[y
(t+1)
1
, y
(t)
3
, . . . , y
(t)
p
),
. . .
p. Y
(t+1)
p
g
p
(y
p
[y
(t+1)
1
, . . . , y
(t+1)
p1
).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example (Mixtures all over again)
Hierarchical missing data structure:
If
X
1
, . . . , X
n

k

i=1
p
i
f(x[
i
),
then
X[Z f(x[
Z
), Z p
1
I(z = 1) +. . . +p
k
I(z = k),
Z is the component indicator associated with observation x
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example (Mixtures (2))
Conditionally on (Z
1
, . . . , Z
n
) = (z
1
, . . . , z
n
) :
(p
1
, . . . , p
k
,
1
, . . . ,
k
[x
1
, . . . , x
n
, z
1
, . . . , z
n
)
p

1
+n
1
1
1
. . . p

k
+n
k
1
k
(
1
[y
1
+n
1
x
1
,
1
+n
1
) . . . (
k
[y
k
+n
k
x
k
,
k
+n
k
),
with
n
i
=

j
I(z
j
= i) and x
i
=

j; z
j
=i
x
j
/n
i
.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Algorithm (Mixture Gibbs sampler)
1. Simulate

i
(
i
[y
i
+n
i
x
i
,
i
+n
i
) (i = 1, . . . , k)
(p
1
, . . . , p
k
) D(
1
+n
1
, . . . ,
k
+n
k
)
2. Simulate (j = 1, . . . , n)
Z
j
[x
j
, p
1
, . . . , p
k
,
1
, . . . ,
k

k

i=1
p
ij
I(z
j
= i)
with (i = 1, . . . , k)
p
ij
p
i
f(x
j
[
i
)
and update n
i
and x
i
(i = 1, . . . , k).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
-2 0 2 4 6 8
0
5
1
0
1
5


T = 500
-2 0 2 4 6 8
0
5
1
0
1
5


T = 1000
-2 0 2 4 6 8
0
5
1
0
1
5


T = 2000
-2 0 2 4 6 8
0
5
1
0
1
5


T = 3000
-2 0 2 4 6 8
0
5
1
0
1
5


T = 4000
-2 0 2 4 6 8
0
5
1
0
1
5


T = 5000
Estimation of the pluggin density for 3 components and T
iterations for 149 observations of acidity levels in US lakes
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
10 15 20 25 30 35
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
Galaxy dataset (82 observations) with k = 2 components
average density (yellow), and pluggins:
average (tomato), marginal MAP (green), MAP (marroon)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
A wee problem
1 0 1 2 3 4

1
0
1
2
3
4

2
Gibbs started at random
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
A wee problem
1 0 1 2 3 4

1
0
1
2
3
4

2
Gibbs started at random
Gibbs stuck at the wrong mode
1 0 1 2 3

1
0
1
2
3

2
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Random Scan Gibbs sampler
back to basics dont do random
Modication of the above Gibbs sampler where, with probability
1/p, the i-th component is drawn from f
i
(x
i
[X
i
), ie when the
components are chosen at random
Motivation
The Random Scan Gibbs sampler is reversible.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Slice sampler as generic Gibbs
If f() can be written as a product
k

i=1
f
i
(),
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Slice sampler as generic Gibbs
If f() can be written as a product
k

i=1
f
i
(),
it can be completed as
k

i=1
I
0
i
f
i
()
,
leading to the following Gibbs algorithm:
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Algorithm (Slice sampler)
Simulate
1.
(t+1)
1
U
[0,f
1
(
(t)
)]
;
. . .
k.
(t+1)
k
U
[0,f
k
(
(t)
)]
;
k+1.
(t+1)
U
A
(t+1)
, with
A
(t+1)
= y; f
i
(y)
(t+1)
i
, i = 1, . . . , k.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example of results with a truncated ^(3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example of results with a truncated ^(3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example of results with a truncated ^(3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example of results with a truncated ^(3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4, 5
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example of results with a truncated ^(3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4, 5, 10
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example of results with a truncated ^(3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4, 5, 10, 50
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example of results with a truncated ^(3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
x
y
Number of Iterations 2, 3, 4, 5, 10, 50, 100
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Good slices
The slice sampler usually enjoys good theoretical properties (like
geometric ergodicity and even uniform ergodicity under bounded f
and bounded X).
As k increases, the determination of the set A
(t+1)
may get
increasingly complex.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example (Stochastic volatility core distribution)
Dicult part of the stochastic volatility model
(x) exp
_

2
(x )
2
+
2
exp(x)y
2
+x
_
/2 ,
simplied in exp
_
x
2
+exp(x)
_
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
Example (Stochastic volatility core distribution)
Dicult part of the stochastic volatility model
(x) exp
_

2
(x )
2
+
2
exp(x)y
2
+x
_
/2 ,
simplied in exp
_
x
2
+exp(x)
_
Slice sampling means simulation from a uniform distribution on
A =
_
x; exp
_
x
2
+exp(x)
_
/2 u
_
=
_
x; x
2
+exp(x)
_
if we set = 2 log u.
Note Inversion of x
2
+exp(x) = needs to be done by
trial-and-error.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Completion
0 10 20 30 40 50 60 70 80 90 100
0.1
0.05
0
0.05
0.1
Lag
C
o
r
r
e
l
a
t
i
o
n
1 0.5 0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
D
e
n
s
i
t
y
Histogram of a Markov chain produced by a slice sampler
and target distribution in overlay.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
Properties of the Gibbs sampler
Theorem (Convergence)
For
(Y
1
, Y
2
, , Y
p
) g(y
1
, . . . , y
p
),
if either
[Positivity condition]
(i) g
(i)
(y
i
) > 0 for every i = 1, , p, implies that
g(y
1
, . . . , y
p
) > 0, where g
(i)
denotes the marginal distribution
of Y
i
, or
(ii) the transition kernel is absolutely continuous with respect to g,
then the chain is irreducible and positive Harris recurrent.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
Properties of the Gibbs sampler (2)
Consequences
(i) If
_
h(y)g(y)dy < , then
lim
nT
1
T
T

t=1
h
1
(Y
(t)
) =
_
h(y)g(y)dy a.e. g.
(ii) If, in addition, (Y
(t)
) is aperiodic, then
lim
n
_
_
_
_
_
K
n
(y, )(dx) f
_
_
_
_
TV
= 0
for every initial distribution .
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
Slice sampler
fast on that slice
For convergence, the properties of X
t
and of f(X
t
) are identical
Theorem (Uniform ergodicity)
If f is bounded and suppf is bounded, the simple slice sampler is
uniformly ergodic.
[Mira & Tierney, 1997]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
A small set for a slice sampler
no slice detail
For

>

,
C = x A;

< f(x) <

is a small set:
Pr(x, )

()
where
(A) =
1

0
(A L())
(L())
d
if L() = x A; f(x) >
[Roberts & Rosenthal, 1998]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
Slice sampler: drift
Under dierentiability and monotonicity conditions, the slice
sampler also veries a drift condition with V (x) = f(x)

, is
geometrically ergodic, and there even exist explicit bounds on the
total variation distance
[Roberts & Rosenthal, 1998]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
Slice sampler: drift
Under dierentiability and monotonicity conditions, the slice
sampler also veries a drift condition with V (x) = f(x)

, is
geometrically ergodic, and there even exist explicit bounds on the
total variation distance
[Roberts & Rosenthal, 1998]
Example (Exponential cxp(1))
For n > 23,
[[K
n
(x, ) f()[[
TV
.054865 (0.985015)
n
(n 15.7043)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
Slice sampler: convergence
no more slice detail
Theorem
For any density such that

(x A; f(x) > ) is non-increasing


then
[[K
523
(x, ) f()[[
TV
.0095
[Roberts & Rosenthal, 1998]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
A poor slice sampler
Example
Consider
f(x) = exp[[x[[ x R
d
Slice sampler equivalent to
one-dimensional slice sampler on
(z) = z
d1
e
z
z > 0
or on
(u) = e
u
1/d
u > 0
Poor performances when d large
(heavy tails)
0 200 400 600 800 1000
-
2
-
1
0
1
1 dimensional run
c
o
r
r
e
la
tio
n
0 10 20 30 40
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
1 dimensional acf
0 200 400 600 800 1000
1
0
1
5
2
0
2
5
3
0
10 dimensional run
c
o
r
r
e
la
tio
n
0 10 20 30 40
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
10 dimensional acf
0 200 400 600 800 1000
0
2
0
4
0
6
0
20 dimensional run
c
o
r
r
e
la
tio
n
0 10 20 30 40
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
20 dimensional acf
0 200 400 600 800 1000
0
1
0
0
2
0
0
3
0
0
4
0
0
100 dimensional run
c
o
r
r
e
la
tio
n
0 10 20 30 40
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
100 dimensional acf
Sample runs of log(u) and
ACFs for log(u) (Roberts
& Rosenthal, 1999)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Hammersley-Cliord theorem
Hammersley-Cliord theorem
An illustration that conditionals determine the joint distribution
Theorem
If the joint density g(y
1
, y
2
) have conditional distributions
g
1
(y
1
[y
2
) and g
2
(y
2
[y
1
), then
g(y
1
, y
2
) =
g
2
(y
2
[y
1
)
_
g
2
(v[y
1
)/g
1
(y
1
[v) dv
.
[Hammersley & Cliord, circa 1970]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Hammersley-Cliord theorem
General HC decomposition
Under the positivity condition, the joint distribution g satises
g(y
1
, . . . , y
p
)
p

j=1
g

j
(y

j
[y

1
, . . . , y

j1
, y

j+1
, . . . , y

p
)
g

j
(y

j
[y

1
, . . . , y

j1
, y

j+1
, . . . , y

p
)
for every permutation on 1, 2, . . . , p and every y

Y .
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Hierarchical models
no hierarchy
The Gibbs sampler is particularly well suited to hierarchical models
Example (Animal epidemiology)
Counts of the number of cases of clinical mastitis in 127 dairy
cattle herds over a one year period
Number of cases in herd i
X
i
P(
i
) i = 1, , m
where
i
is the underlying rate of infection in herd i
Lack of independence might manifest itself as overdispersion.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Example (Animal epidemiology (2))
Modied model
X
i
P(
i
)

i
Ga(,
i
)

i
IG(a, b),
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Example (Animal epidemiology (2))
Modied model
X
i
P(
i
)

i
Ga(,
i
)

i
IG(a, b),
The Gibbs sampler corresponds to conditionals

i
(
i
[x, ,
i
) = Ga(x
i
+, [1 + 1/
i
]
1
)

i
(
i
[x, , a, b,
i
) = IG( +a, [
i
+ 1/b]
1
)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
if you hate rats
Example (Rats)
Experiment where rats are intoxicated by a substance, then treated
by either a placebo or a drug:
x
ij
^(
i
,
2
c
), 1 j J
c
i
, control
y
ij
^(
i
+
i
,
2
a
), 1 j J
a
i
, intoxication
z
ij
^(
i
+
i
+
i
,
2
t
), 1 j J
t
i
, treatment
Additional variable w
i
, equal to 1 if the rat is treated with the
drug, and 0 otherwise.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Example (Rats (2))
Prior distributions (1 i I),

i
^(

,
2

),
i
^(

,
2

),
and

i
^(
P
,
2
P
) or
i
^(
D
,
2
D
),
if ith rat treated with a placebo (P) or a drug (D)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Example (Rats (2))
Prior distributions (1 i I),

i
^(

,
2

),
i
^(

,
2

),
and

i
^(
P
,
2
P
) or
i
^(
D
,
2
D
),
if ith rat treated with a placebo (P) or a drug (D)
Hyperparameters of the model,

,
P
,
D
,
c
,
a
,
t
,

,
P
,
D
,
associated with Jereys noninformative priors.
Alternative prior with two possible levels of intoxication

i
p^(
1
,
2
1
) + (1 p)^(
2
,
2
2
),
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Conditional decompositions
Easy decomposition of the posterior distribution
For instance, if
[
1

1
([
1
),
1

2
(
1
),
then
([x) =
_

1
([
1
, x)(
1
[x) d
1
,
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Conditional decompositions (2)
where
([
1
, x) =
f(x[)
1
([
1
)
m
1
(x[
1
)
,
m
1
(x[
1
) =
_

f(x[)
1
([
1
) d,
(
1
[x) =
m
1
(x[
1
)
2
(
1
)
m(x)
,
m(x) =
_

1
m
1
(x[
1
)
2
(
1
) d
1
.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Conditional decompositions (3)
Moreover, this decomposition works for the posterior moments,
that is, for every function h,
E

[h()[x] = E
(
1
|x)
[E

1
[h()[
1
, x]] ,
where
E

1
[h()[
1
, x] =
_

h()([
1
, x) d.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Example (Rats inc., continued
if you still hate rats
)
Posterior complete distribution given by
((
i
,
i
,
i
)
i
,

, . . . ,
c
, . . . [D)
I

i=1
_
exp (
i

)
2
/2
2

+ (
i

)
2
/2
2

J
c
i

j=1
exp (x
ij

i
)
2
/2
2
c

J
a
i

j=1
exp (y
ij

i

i
)
2
/2
2
a

J
t
i

j=1
exp (z
ij

i

i
)
2
/2
2
t

i
=0
exp (
i

P
)
2
/2
2
P

i
=1
exp (
i

D
)
2
/2
2
D

P
i
J
c
i
1
c

P
i
J
a
i
1
a

P
i
J
t
i
1
t
(

)
I1

I
D
1
D

I
P
1
P
,
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Local conditioning property
For the hierarchical model
() =
_

1
...
n

1
([
1
)
2
(
1
[
2
)
n+1
(
n
) d
1
d
n+1
.
we have
(
i
[x, ,
1
, . . . ,
n
) = (
i
[
i1
,
i+1
)
with the convention
0
= and
n+1
= 0.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Example (Rats inc., terminated
still this zemmiphobia?!
)
The full conditional distributions correspond to standard
distributions and Gibbs sampling applies.


0 2000 4000 6000 8000 10000
1
.
6
0
1
.
7
0
1
.
8
0
1
.
9
0




0 2000 4000 6000 8000 10000
-
2
.
9
0
-
2
.
8
0
-
2
.
7
0
-
2
.
6
0




0 2000 4000 6000 8000 10000
0
.
4
0
0
.
5
0
0
.
6
0
0
.
7
0




0 2000 4000 6000 8000 10000
1
.
7
1
.
8
1
.
9
2
.
0
2
.
1


Convergence of the posterior means
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
0
2
0
4
0
6
0
8
0
1
0
0
1
2
0
control

-4.0 -3.5 -3.0 -2.5 -2.0 -1.5
0
2
0
4
0
6
0
8
0
1
0
0
1
4
0
intoxication

-1 0 1 2
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
3
0
0
placebo

0 1 2 3
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
drug

Posteriors of the eects
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Hierarchical models
Posterior Gibbs inference


D

P

D

P
Probability 1.00 0.9998 0.94 0.985
Condence [-3.48,-2.17] [0.94,2.50] [-0.17,1.24] [0.14,2.20]
Posterior probabilities of signicant eects
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Data Augmentation
The Gibbs sampler with only two steps is particularly useful
Algorithm (Data Augmentation)
Given y
(t)
,
1.. Simulate Y
(t+1)
1
g
1
(y
1
[y
(t)
2
) ;
2.. Simulate Y
(t+1)
2
g
2
(y
2
[y
(t+1)
1
) .
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Data Augmentation
The Gibbs sampler with only two steps is particularly useful
Algorithm (Data Augmentation)
Given y
(t)
,
1.. Simulate Y
(t+1)
1
g
1
(y
1
[y
(t)
2
) ;
2.. Simulate Y
(t+1)
2
g
2
(y
2
[y
(t+1)
1
) .
Theorem (Markov property)
Both (Y
(t)
1
) and (Y
(t)
2
) are Markov chains, with transitions
K
i
(x, x

) =
_
g
i
(y[x)g
3i
(x

[y) dy,
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Example (Grouped counting data)
360 consecutive records of the number of passages per unit time
Number of
passages 0 1 2 3 4 or more
Number of
observations 139 128 55 25 13
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Example (Grouped counting data (2))
Feature Observations with 4 passages and more are grouped
If observations are Poisson P(), the likelihood is
([x
1
, . . . , x
5
)
e
347

128+552+253
_
1 e

i=0

i
i!
_
13
,
which can be dicult to work with.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Example (Grouped counting data (2))
Feature Observations with 4 passages and more are grouped
If observations are Poisson P(), the likelihood is
([x
1
, . . . , x
5
)
e
347

128+552+253
_
1 e

i=0

i
i!
_
13
,
which can be dicult to work with.
Idea With a prior () = 1/, complete the vector (y
1
, . . . , y
13
) of
the 13 units larger than 4
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Algorithm (Poisson-Gamma Gibbs)
a Simulate Y
(t)
i
P(
(t1)
) I
y4
i = 1, . . . , 13
b Simulate

(t)
(a
_
313 +
13

i=1
y
(t)
i
, 360
_
.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Algorithm (Poisson-Gamma Gibbs)
a Simulate Y
(t)
i
P(
(t1)
) I
y4
i = 1, . . . , 13
b Simulate

(t)
(a
_
313 +
13

i=1
y
(t)
i
, 360
_
.
The Bayes estimator

=
1
360T
T

t=1
_
313 +
13

i=1
y
(t)
i
_
converges quite rapidly
to R& B


0 100 200 300 400 500
1
.
0
2
1
1
.
0
2
2
1
.
0
2
3
1
.
0
2
4
1
.
0
2
5
0.9 1.0 1.1 1.2
0
1
0
2
0
3
0
4
0
lambda
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Rao-Blackwellization
If (y
1
, y
2
, . . . , y
p
)
(t)
, t = 1, 2, . . . T is the output from a Gibbs
sampler

0
=
1
T
T

t=1
h
_
y
(t)
1
_

_
h(y
1
)g(y
1
)dy
1
and is unbiased.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Rao-Blackwellization
If (y
1
, y
2
, . . . , y
p
)
(t)
, t = 1, 2, . . . T is the output from a Gibbs
sampler

0
=
1
T
T

t=1
h
_
y
(t)
1
_

_
h(y
1
)g(y
1
)dy
1
and is unbiased.
The Rao-Blackwellization replaces
0
with its conditional
expectation

rb
=
1
T
T

t=1
E
_
h(Y
1
)[y
(t)
2
, . . . , y
(t)
p
_
.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Rao-Blackwellization (2)
Then
Both estimators converge to E[h(Y
1
)]
Both are unbiased,
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Rao-Blackwellization (2)
Then
Both estimators converge to E[h(Y
1
)]
Both are unbiased,
and
var
_
E
_
h(Y
1
)[Y
(t)
2
, . . . , Y
(t)
p
__
var(h(Y
1
)),
so
rb
is uniformly better (for Data Augmentation)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Examples of Rao-Blackwellization
Example
Bivariate normal Gibbs sampler
X [ y N (y, 1
2
)
Y [ x N (x, 1
2
).
Then

0
=
1
T
T

i=1
X
(i)
and
1
=
1
T
T

i=1
E[X
(i)
[Y
(i)
] =
1
T
T

i=1
Y
(i)
,
estimate E[X] and
2

0
/
2

1
=
1

2
> 1.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Examples of Rao-Blackwellization (2)
Example (Poisson-Gamma Gibbs contd)
Nave estimate

0
=
1
T
T

t=1

(t)
and Rao-Blackwellized version

=
1
T
T

t=1
E[
(t)
[x
1
, x
2
, . . . , x
5
, y
(i)
1
, y
(i)
2
, . . . , y
(i)
13
]
=
1
360T
T

t=1
_
313 +
13

i=1
y
(t)
i
_
,
back to graph
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
NP Rao-Blackwellization & Rao-Blackwellized NP
Another substantial benet of Rao-Blackwellization is in the
approximation of densities of dierent components of y without
nonparametric density estimation methods.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
NP Rao-Blackwellization & Rao-Blackwellized NP
Another substantial benet of Rao-Blackwellization is in the
approximation of densities of dierent components of y without
nonparametric density estimation methods.
Lemma
The estimator
1
T
T

t=1
g
i
(y
i
[y
(t)
j
, j ,= i) g
i
(y
i
),
is unbiased.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
The Duality Principle
skip dual part
Ties together the properties of the two Markov chains in Data
Augmentation
Consider a Markov chain (X
(t)
) and a sequence (Y
(t)
) of random
variables generated from the conditional distributions
X
(t)
[y
(t)
(x[y
(t)
)
Y
(t+1)
[x
(t)
, y
(t)
f(y[x
(t)
, y
(t)
) .
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
The Duality Principle
skip dual part
Ties together the properties of the two Markov chains in Data
Augmentation
Consider a Markov chain (X
(t)
) and a sequence (Y
(t)
) of random
variables generated from the conditional distributions
X
(t)
[y
(t)
(x[y
(t)
)
Y
(t+1)
[x
(t)
, y
(t)
f(y[x
(t)
, y
(t)
) .
Theorem (Duality properties)
If the chain (Y
(t)
) is ergodic then so is (X
(t)
) and the duality also
holds for geometric or uniform ergodicity.
Note
The chain (Y
(t)
) can be discrete, and the chain (X
(t)
) continuous.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Improper Priors
Unsuspected danger resulting from careless use of MCMC
algorithms:
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Improper Priors
Unsuspected danger resulting from careless use of MCMC
algorithms:
It may happen that
all conditional distributions are well dened,
all conditional distributions may be simulated from, but...
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Improper Priors
Unsuspected danger resulting from careless use of MCMC
algorithms:
It may happen that
all conditional distributions are well dened,
all conditional distributions may be simulated from, but...
the system of conditional distributions may not correspond to
any joint distribution
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Improper Priors
Unsuspected danger resulting from careless use of MCMC
algorithms:
It may happen that
all conditional distributions are well dened,
all conditional distributions may be simulated from, but...
the system of conditional distributions may not correspond to
any joint distribution
Warning The problem is due to careless use of the Gibbs sampler
in a situation for which the underlying assumptions are violated
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Example (Conditional exponential distributions)
For the model
X
1
[x
2
E xp(x
2
) , X
2
[x
1
E xp(x
1
)
the only candidate f(x
1
, x
2
) for the joint density is
f(x
1
, x
2
) exp(x
1
x
2
),
but
_
f(x
1
, x
2
)dx
1
dx
2
=
c _ These conditionals do not correspond to a joint
probability distribution
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Example (Improper random eects)
Consider
Y
ij
= +
i
+
ij
, i = 1, . . . , I, j = 1, . . . , J,
where

i
N (0,
2
) and
ij
N (0,
2
),
the Jereys (improper) prior for the parameters , and is
(,
2
,
2
) =
1

2
.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Example (Improper random eects 2)
The conditional distributions

i
[y, ,
2
,
2
N
_
J( y
i
)
J +
2

2
, (J
2
+
2
)
1
_
,
[, y,
2
,
2
N ( y ,
2
/JI) ,

2
[, , y,
2
1(
_
I/2, (1/2)

2
i
_
,

2
[, , y,
2
1(
_
_
IJ/2, (1/2)

i,j
(y
ij

i
)
2
_
_
,
are well-dened and a Gibbs sampler can be easily implemented in
this setting.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
-4 -3 -2 -1 0
0
5
1
0
1
5
2
0
2
5
3
0
(1000 iterations)
f
r
e
q
.

-
8
-
6
-
4
-
2
0
o
b
s
e
r
v
a
t
i
o
n
s
Example (Improper random
eects 2)
The gure shows the sequence of

(t)
s and its histogram over
1, 000 iterations. They both fail
to indicate that the
corresponding joint distribution
does not exist
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Final notes on impropriety
The improper posterior Markov chain
cannot be positive recurrent
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Final notes on impropriety
The improper posterior Markov chain
cannot be positive recurrent
The major task in such settings is to nd indicators that ag that
something is wrong. However, the output of an improper Gibbs
sampler may not dier from a positive recurrent Markov chain.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Improper Priors
Final notes on impropriety
The improper posterior Markov chain
cannot be positive recurrent
The major task in such settings is to nd indicators that ag that
something is wrong. However, the output of an improper Gibbs
sampler may not dier from a positive recurrent Markov chain.
Example
The random eects model was initially treated in Gelfand et al.
(1990) as a legitimate model
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
MCMC tools for variable dimension problems
MCMC tools for variable dimension problems
Introduction
Greens method
Birth and Death processes
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
A new brand of problems
There exist setups where
One of the things we do not know is the number
of things we do not know
[Peter Green]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
Bayesian Model Choice
Typical in model choice settings
- model construction (nonparametrics)
- model checking (goodness of t)
- model improvement (expansion)
- model prunning (contraction)
- model comparison
- hypothesis testing (Science)
- prediction (nance)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
Bayesian Model Choice II
Many areas of application

variable selection

change point(s) determination

image analysis

graphical models and expert systems

variable dimension models

causal inference
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
Example (Mixture again, yes!)
Benchmark dataset: Speed of galaxies
[Roeder, 1990; Richardson & Green, 1997]
1.0 1.5 2.0 2.5 3.0 3.5
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
speeds
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
Example (Mixture again (2))
Modelling by a mixture model
M
i
: x
j

i

=1
p
i
^(
i
,
2
i
) (j = 1, . . . , 82)
i?
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
Bayesian variable dimension model
Denition
A variable dimension model is dened as a collection of models
(k = 1. . . . , K),
M
k
= f([
k
);
k

k
,
associated with a collection of priors on the parameters of these
models,

k
(
k
) ,
and a prior distribution on the indices of these models,
(k) , k = 1, . . . , K .
Alternative notation:
(M
k
,
k
) = (k)
k
(
k
)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
Bayesian solution
Formally over:
1. Compute
p(M
i
[x) =
p
i
_

i
f
i
(x[
i
)
i
(
i
)d
i

j
p
j
_

j
f
j
(x[
j
)
j
(
j
)d
j
2. Take largest p(M
i
[x) to determine model, or use

j
p
j
_

j
f
j
(x[
j
)
j
(
j
)d
j
as predictive
[Dierent decision theoretic perspectives]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
Diculties
Not at

(formal) inference level


[see above]

parameter space representation


=

k
,
[even if there are parameters common to several models]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Introduction
Diculties
Not at

(formal) inference level


[see above]

parameter space representation


=

k
,
[even if there are parameters common to several models]
Rather at

(practical) inference level:


model separation, interpretation, overtting, prior modelling,
prior coherence

computational level:
innity of models, moves between models, predictive
computation
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Greens resolution
Setting up a proper measuretheoretic framework for designing
moves between models M
k
[Green, 1995]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Greens resolution
Setting up a proper measuretheoretic framework for designing
moves between models M
k
[Green, 1995]
Create a reversible kernel K on H =

k
k
k
such that
_
A
_
B
K(x, dy)(x)dx =
_
B
_
A
K(y, dx)(y)dy
for the invariant density [x is of the form (k,
(k)
)]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Greens resolution (2)
Write K as
K(x, B) =

m=1
_

m
(x, y)q
m
(x, dy) +(x)I
B
(x)
where q
m
(x, dy) is a transition measure to model M
m
and

m
(x, y) the corresponding acceptance probability.
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Greens resolution (2)
Write K as
K(x, B) =

m=1
_

m
(x, y)q
m
(x, dy) +(x)I
B
(x)
where q
m
(x, dy) is a transition measure to model M
m
and

m
(x, y) the corresponding acceptance probability.
Introduce a symmetric measure
m
(dx, dy) on H
2
and impose on
(dx)q
m
(x, dy) to be absolutely continuous wrt
m
,
(dx)q
m
(x, dy)

m
(dx, dy)
= g
m
(x, y)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Greens resolution (2)
Write K as
K(x, B) =

m=1
_

m
(x, y)q
m
(x, dy) +(x)I
B
(x)
where q
m
(x, dy) is a transition measure to model M
m
and

m
(x, y) the corresponding acceptance probability.
Introduce a symmetric measure
m
(dx, dy) on H
2
and impose on
(dx)q
m
(x, dy) to be absolutely continuous wrt
m
,
(dx)q
m
(x, dy)

m
(dx, dy)
= g
m
(x, y)
Then

m
(x, y) = min
_
1,
g
m
(y, x)
g
m
(x, y)
_
ensures reversibility
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Special case
When contemplating a move between two models, M
1
and M
2
,
the Markov chain being in state
1
M
1
, denote by K
12
(
1
, d)
and K
21
(
2
, d) the corresponding kernels, under the detailed
balance condition
(d
1
) K
12
(
1
, d) = (d
2
) K
21
(
2
, d) ,
and take, wlog, dim(M
2
) > dim(M
1
).
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Special case
When contemplating a move between two models, M
1
and M
2
,
the Markov chain being in state
1
M
1
, denote by K
12
(
1
, d)
and K
21
(
2
, d) the corresponding kernels, under the detailed
balance condition
(d
1
) K
12
(
1
, d) = (d
2
) K
21
(
2
, d) ,
and take, wlog, dim(M
2
) > dim(M
1
).
Proposal expressed as

2
=
12
(
1
, v
12
)
where v
12
is a random variable of dimension
dim(M
2
) dim(M
1
), generated as
v
12

12
(v
12
) .
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Special case (2)
In this case, q
12
(
1
, d
2
) has density

12
(v
12
)

12
(
1
, v
12
)
(
1
, v
12
)

1
,
by the Jacobian rule.
If probability
12
of choosing move to M
2
while in M
1
,
acceptance probability reduces to
(
1
, v
12
) = 1
(M
2
,
2
)
21
(M
1
,
1
)
12

12
(v
12
)

12
(
1
, v
12
)
(
1
, v
12
)

.
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Interpretation (1)
The representation puts us back in a xed dimension setting:

M
1
V
12
and M
2
are in one-to-one relation
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Interpretation (1)
The representation puts us back in a xed dimension setting:

M
1
V
12
and M
2
are in one-to-one relation

regular MetropolisHastings move from the couple (


1
, v
12
)
to
2
when stationary distributions are
(M
1
,
1
)
12
(v
12
)
and (M
2
,
2
), and when proposal distribution is deterministic
(??)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Interpretation (2)
Consider, instead, the proposals

2
^(
12
(
1
, v
12
), ) and
12
(
1
, v
12
) ^(
2
, )
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Interpretation (2)
Consider, instead, the proposals

2
^(
12
(
1
, v
12
), ) and
12
(
1
, v
12
) ^(
2
, )
Reciprocal proposal has density
exp
_
(
2

12
(
1
, v
12
))
2
/2
_

12
(
1
, v
12
)
(
1
, v
12
)

by the Jacobian rule.


Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Interpretation (2)
Consider, instead, the proposals

2
^(
12
(
1
, v
12
), ) and
12
(
1
, v
12
) ^(
2
, )
Reciprocal proposal has density
exp
_
(
2

12
(
1
, v
12
))
2
/2
_

12
(
1
, v
12
)
(
1
, v
12
)

by the Jacobian rule.


Thus MetropolisHastings acceptance probability is
1
(M
2
,
2
)
(M
1
,
1
)
12
(v
12
)

12
(
1
, v
12
)
(
1
, v
12
)

Does not depend on : Let go to 0


Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Saturation
[Brooks, Giudici, Roberts, 2003]
Consider series of models M
i
(i = 1, . . . , k) such that
max
i
dim(M
i
) = n
max
<
Parameter of model M
i
then completed with an auxiliary variable
U
i
such that
dim(
i
, u
i
) = n
max
and U
i
q
i
(u
i
)
Posit the following joint distribution for [augmented] model M
i
(M
i
,
i
) q
i
(u
i
)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Back to xed dimension
Saturation: no varying dimension anymore since (
i
, u
i
) of xed
dimension.
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Back to xed dimension
Saturation: no varying dimension anymore since (
i
, u
i
) of xed
dimension.
Algorithm (Three stage MCMC update)
1. Update the current value of the parameter,
i
;
2. Update u
i
conditional on
i
;
3. Update the current model from M
i
to M
j
using the bijection
(
j
, u
j
) =
ij
(
i
, u
i
)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Example (Mixture of normal distributions)
M
k
:
k

j=1
p
jk
^(
jk
,
2
jk
)
[Richardson & Green, 1997]
Moves:
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Example (Mixture of normal distributions)
M
k
:
k

j=1
p
jk
^(
jk
,
2
jk
)
[Richardson & Green, 1997]
Moves:
(i) Split
_
_
_
p
jk
= p
j(k+1)
+p
(j+1)(k+1)
p
jk

jk
= p
j(k+1)

j(k+1)
+p
(j+1)(k+1)

(j+1)(k+1)
p
jk

2
jk
= p
j(k+1)

2
j(k+1)
+p
(j+1)(k+1)

2
(j+1)(k+1)
(ii) Merge (reverse)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Example (Mixture (2))
Additional Birth and Death moves for empty components
(created from the prior distribution)
Equivalent
(i). Split
(T)
_

_
u
1
, u
2
, u
3
|(0, 1)
p
j(k+1)
= u
1
p
jk

j(k+1)
= u
2

jk

2
j(k+1)
= u
3

2
jk
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Histogram of k
k
1 2 3 4 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
1
2
3
4
5
Rawplot of k
k
Histogram and rawplot of
100, 000 ks under the
constraint k 5.
Normalised enzyme dataset
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Example (Hidden Markov model)
move to birth
Extension of the mixture model
P(X
t
+ 1 = j[X
t
= i) = w
ij
,
w
ij
=
ij
/

i
,
Y
t
[X
t
= i ^(
i
,
2
i
).
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
. . .
-

Y
t
6

X
t
-

Y
t+1
6

X
t+1
-
. . .
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Example (Hidden Markov model (2))
Move to split component j

into j
1
and j
2
:

ij
1
=
ij

i
,
ij
2
=
ij

(1
i
),
i
|(0, 1);

j
1
j
=
j

j
,
j
2
j
=
j

j
/
j
,
j
log ^(0, 1);
similar ideas give
j
1
j
2
etc.;

j
1
=
j

3
j

,
j
2
=
j

+ 3
j

^(0, 1);

2
j
1
=
2
j

,
2
j
2
=
2
j

log ^(0, 1).


[Robert & al., 2000]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
0 10000 20000 30000 40000
1
2
3
4
5
0 50000 100000 150000 200000
0
0.5
1
0 5000 10000 15000 20000
0
0.005
0.01
0.015
Upper panel: First 40,000 values of k for S&P 500 data, plotted
every 20th sweep. Middle panel: estimated posterior distribution
of k for S&P 500 data as a function of number of sweeps. Lower
panel:
1
and
2
in rst 20,000 sweeps with k = 2 for S&P 500
data.
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Example (Autoregressive model)
move to birth
Typical setting for model choice: determine order p of AR(p)
model
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
Example (Autoregressive model)
move to birth
Typical setting for model choice: determine order p of AR(p)
model
Consider the (less standard) representation
p

i=1
(1
i
B) X
t
=
t
,
t
^(0,
2
)
where the
i
s are within the unit circle if complex and within
[1, 1] if real.
[Huerta and West, 1998]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Greens method
AR(p) reversible jump algorithm
Example (Autoregressive (2))
Uniform priors for the real and complex roots
j
,
1
k/2 + 1

i
R
1
2
I
|
i
|<1

i
R
1

I
|
i
|<1
and (purely birth-and-death) proposals based on these priors

k k+1 [Creation of real root]

k k+2 [Creation of complex root]

k k-1 [Deletion of real root]

k k-2 [Deletion of complex root]


Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Birth and Death processes
instant death!
Use of an alternative methodology based on a Birth&-Death
(point) process
[Preston, 1976; Ripley, 1977; Geyer & Mller, 1994; Stevens, 1999]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Birth and Death processes
instant death!
Use of an alternative methodology based on a Birth&-Death
(point) process
[Preston, 1976; Ripley, 1977; Geyer & Mller, 1994; Stevens, 1999]
Idea: Create a Markov chain in continuous time, i.e. a Markov
jump process, moving between models M
k
, by births (to increase
the dimension), deaths (to decrease the dimension), and other
moves.
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Birth and Death processes
Time till next modication (jump) is exponentially distributed
with rate depending on current state
Remember: if
1
, . . . ,
v
are exponentially distributed,
i
c(
i
),
min
i
c
_

i
_
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Birth and Death processes
Time till next modication (jump) is exponentially distributed
with rate depending on current state
Remember: if
1
, . . . ,
v
are exponentially distributed,
i
c(
i
),
min
i
c
_

i
_
Dierence with MH-MCMC: Whenever a jump occurs, the
corresponding move is always accepted. Acceptance probabilities
replaced with holding times.
Implausible congurations
L()() 1
die quickly.
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Balance condition
Sucient to have detailed balance
L()()q(,

) = L(

)(

)q(

, ) for all ,

for () L()() to be stationary.


Here q(,

) rate of moving from state to

.
Possibility to add split/merge and xed-k processes if balance
condition satised.
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Example (Mixture contd)
Stephens original modelling:

Representation as a (marked) point process


=
_
p
j
, (
j
,
j
)
_
j

Birth rate
0
(constant)

Birth proposal from the prior

Death rate
j
() for removal of point j

Death proposal removes component and modies weights


Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Example (Mixture contd (2))

Overall death rate


k

j=1

j
() = ()

Balance condition
(k+1) d(p, (, )) L(p, (, )) =
0
L()
(k)
(k + 1)
with
d( p
j
, (
j
,
j
)) =
j
()

Case of Poisson prior k Toi(


1
)

j
() =

0

1
L( p
j
, (
j
,
j
))
L()
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Stephens original algorithm
Algorithm (Mixture Birth& Death)
For v = 0, 1, , V
t v
Run till t > v + 1
1. Compute
j
() =
L([
j
)
L()

1
2. ()
k

j=1

j
(
j
),
0
+(), u |([0, 1])
3. t t ulog(u)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Algorithm (Mixture Birth& Death (contd))
4. With probability ()/
Remove component j with probability
j
()/()
k k 1
p

/(1 p
j
) ( ,= j)
Otherwise,
Add component j from the prior (
j
,
j
) p
j
Be(, k)
p

(1 p
j
) ( ,= j)
k k + 1
5. Run I MCMC(k, , p)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Rescaling time
move to HMM
In discrete-time RJMCMC, let the time unit be 1/N,
put

k
=
k
/N and
k
= 1
k
/N
As N , each birth proposal will be accepted, and having k components births occur according to a
Poisson process with rate
k
while component (w, ) dies with rate
lim
N
N
k+1

1
k + 1
min(A
1
, 1)
= lim
N
N
1
k + 1
likelihood ratio
1

k+1

b(w, )
(1 w)
k1
= likelihood ratio
1

k
k + 1

b(w, )
(1 w)
k1
.
Hence RJMCMCBDMCMC. This holds more generally.
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Example (HMM models (contd))
Implementation of the split-and-combine rule of Richardson and
Green (1997) in continuous time
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Example (HMM models (contd))
Implementation of the split-and-combine rule of Richardson and
Green (1997) in continuous time
Move to split component j

into j
1
and j
2
:

ij
1
=
ij

i
,
ij
2
=
ij

(1
i
),
i
|(0, 1);

j
1
j
=
j

j
,
j
2
j
=
j

j
/
j
,
j
log ^(0, 1);
similar ideas give
j
1
j
2
etc.;

j
1
=
j

3
j

,
j
2
=
j

+ 3
j

^(0, 1);

2
j
1
=
2
j

,
2
j
2
=
2
j

log ^(0, 1).


[Cappe & al, 2001]
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Wind intensity in Athens
5 0 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5

2
0
2
4
6
Histogram and rawplot of 500 wind intensities in Athens
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
Number of states
temp[, 1]
R
e
la
t
iv
e

F
r
e
q
u
e
n
c
y
2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0 500 1000 1500 2000 2500
1
2
3
4
5
instants
n
u
m
b
e
r

o
f

s
t
a
t
e
s
Log likelihood values
temp[, 2]
R
e
la
t
iv
e

F
r
e
q
u
e
n
c
y
1400 1200 1000 800 600 400 200
0
.
0
0
0
0
.
0
1
0
0
.
0
2
0
0
.
0
3
0
0 500 1000 1500 2000 2500

1
4
0
0

1
0
0
0

6
0
0

2
0
0
instants
lo
g

lik
e
lih
o
o
d
Number of moves
temp[, 3]
R
e
la
t
iv
e

F
r
e
q
u
e
n
c
y
5 10 15 20 25 30
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0 500 1000 1500 2000 2500
5
1
0
2
0
3
0
instants
N
u
m
b
e
r

o
f

m
o
v
e
s
MCMC output on k (histogram and rawplot), corresponding
loglikelihood values (histogram and rawplot), and number of
moves (histogram and rawplot)
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
0 500 1000 1500
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0

0 500 1000 1500


0
1
2
3
4
5

MCMC sequence of the probabilities


j
of the stationary
distribution (top) and the parameters (bottom) of the
three components when conditioning on k = 3
Markov Chain Monte Carlo Methods
MCMC tools for variable dimension problems
Birth and Death processes
5 0 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6

MCMC evaluation of the marginal density of the dataset
(dashes), compared with R nonparametric density estimate
(solid lines).
Markov Chain Monte Carlo Methods
Sequential importance sampling
Sequential importance sampling
basic importance
Sequential importance sampling
Adaptive MCMC
Importance sampling revisited
Dynamic extensions
Population Monte Carlo
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC
Adaptive MCMC is not possible
Algorithms trained on-line usually invalid:
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC
Adaptive MCMC is not possible
Algorithms trained on-line usually invalid:
using the whole past of the chain implies that this is not a
Markov chain any longer!
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC
Example (Poly t distribution)
Consider a t-distribution T (3, , 1) sample (x
1
, . . . , x
n
) with a at
prior () = 1
If we try t a normal proposal from empirical mean and variance of
the chain so far,

t
=
1
t
t

i=1

(i)
and
2
t
=
1
t
t

i=1
(
(i)

t
)
2
,
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC
Example (Poly t distribution)
Consider a t-distribution T (3, , 1) sample (x
1
, . . . , x
n
) with a at
prior () = 1
If we try t a normal proposal from empirical mean and variance of
the chain so far,

t
=
1
t
t

i=1

(i)
and
2
t
=
1
t
t

i=1
(
(i)

t
)
2
,
MetropolisHastings algorithm with acceptance probability
n

j=2
_
+ (x
j

(t)
)
2
+ (x
j
)
2
_
(+1)/2
exp (
t

(t)
)
2
/2
2
t
exp(
t
)
2
/2
2
t
,
where ^(
t
,
2
t
).
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC
Example (Poly t distribution (2))
Invalid scheme:

when range of initial values too small, the


(i)
s cannot
converge to the target distribution and concentrates on too
small a support.

long-range dependence on past values modies the


distribution of the sequence.

using past simulations to create a non-parametric


approximation to the target distribution does not work either
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC
0 1000 2000 3000 4000 5000

0
.4

0
.2
0
.0
0
.2
Iterations
x

1.5 1.0 0.5 0.0 0.5
0
1
2
3
0 1000 2000 3000 4000 5000

1
.5

1
.0

0
.5
0
.0
0
.5
1
.0
1
.5
Iterations
x

2 1 0 1 2
0
.0
0
.2
0
.4
0
.6
0 1000 2000 3000 4000 5000

1
0
1
2
Iterations
x

2 1 0 1 2 3
0
.0
0
.1
0
.2
0
.3
0
.4
0
.5
0
.6
0
.7
Adaptive scheme for a sample of 10 x
j
T

and initial
variances of (top) 0.1, (middle) 0.5, and (bottom) 2.5.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC

2 1 0 1 2
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
Comparison of the distribution of an adaptive scheme sample
of 25, 000 points with initial variance of 2.5 and of the target
distribution.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC
0 10000 30000 50000

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Iterations
x

1.5 0.5 0.5 1.0 1.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Sample produced by 50, 000 iterations of a nonparametric
adaptive MCMC scheme and comparison of its distribution
with the target distribution.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Adaptive MCMC
Simply forget about it!
Warning:
One should not constantly adapt the proposal on past
performances
Either adaptation ceases after a period of burnin
or the adaptive scheme must be theoretically assessed on its own
right.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Importance sampling revisited
Importance sampling revisited
Approximation of integrals
back to basic importance
I =
_
h(x)(x)dx
by unbiased estimators

I =
1
n
n

i=1

i
h(x
i
)
when
x
1
, . . . , x
n
iid
q(x) and
i
def
=
(x
i
)
q(x
i
)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Importance sampling revisited
Markov extension
For densities f and g, and importance weight
(x) = f(x)/g(x) ,
for any kernel K(x, x

) with stationary distribution f,


_
(x) K(x, x

) g(x)dx = f(x

) .
[McEachern, Clyde, and Liu, 1999]
Markov Chain Monte Carlo Methods
Sequential importance sampling
Importance sampling revisited
Markov extension
For densities f and g, and importance weight
(x) = f(x)/g(x) ,
for any kernel K(x, x

) with stationary distribution f,


_
(x) K(x, x

) g(x)dx = f(x

) .
[McEachern, Clyde, and Liu, 1999]
Consequence: An importance sample transformed by MCMC
transitions keeps its weights
Unbiasedness preservation:
E
_
(X)h(X

=
_
(x) h(x

) K(x, x

) g(x) dxdx

= E
f
[h(X)]
Markov Chain Monte Carlo Methods
Sequential importance sampling
Importance sampling revisited
Not so exciting!
The weights do not change!
Markov Chain Monte Carlo Methods
Sequential importance sampling
Importance sampling revisited
Not so exciting!
The weights do not change!
If x has small weight
(x) = f(x)/g(x) ,
then
x

K(x, x

)
keeps this small weight.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Importance sampling revisited
Pros and cons of importance sampling vs. MCMC

Production of a sample (IS) vs. of a Markov chain (MCMC)

Dependence on importance function (IS) vs. on previous value


(MCMC)

Unbiasedness (IS) vs. convergence to the true distribution


(MCMC)

Variance control (IS) vs. learning costs (MCMC)

Recycling of past simulations (IS) vs. progressive adaptability


(MCMC)

Processing of moving targets (IS) vs. handling large


dimensional problems (MCMC)

Non-asymptotic validity (IS) vs. dicult asymptotia for


adaptive algorithms (MCMC)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Dynamic importance sampling
Idea
It is possible to generalise importance sampling using random
weights
t
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Dynamic importance sampling
Idea
It is possible to generalise importance sampling using random
weights
t
such that
E[
t
[x
t
] = (x
t
)/g(x
t
)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
(a) Self-regenerative chains
[Sahu & Zhigljavsky, 1998; Gasemyr, 2002]
Proposal
Y p(y) p(y)
and target distribution (y) (y)
Ratios
(x) = (x)/p(x) and (x) = (x)/ p(x)
Unknown Known
Acceptance function
(x) =
1
1 + (x)
> 0
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Geometric jumps
Theorem
If
Y p(y)
and
W[Y = y G((y)) ,
then
X
t
= = X
t+W1
= Y ,= X
t+W
denes a Markov chain with stationary distribution
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Plusses

Valid for any choice of [ small = large variance and large


= slow convergence]

Only depends on current value [Dierence with Metropolis]

Random integer weight W [Similarity with Metropolis]

Saves on the rejections: always accept [Dierence with


Metropolis]

Introduces geometric noise compared with importance


sampling

2
SZ
= 2
2
IS
+ (1/)
2

Can be used with a sequence of proposals p


k
and constants

k
[Adaptativity]
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
A generalisation
[Gasemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
A generalisation
[Gasemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.
Algorithm (Gasemyrs dynamic weights)
Generate a sequence of random weights W
n
by
1. Generate Y
n
p(y)
2. Generate V
n
B(q(y
n
))
3. Generate S
n
(eo((y
n
))
4. Take W
n
= V
n
S
n
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Validation
direct to PMC
(y) =
p(y)q(y)
_
p(y)q(y)dy
,
the chain (X
t
) associated with the sequence (Y
n
, W
n
) by
Y
1
= X
1
= = X
1+W
1
1
, Y
2
= X
1+W
1
=
is a Markov chain with transition
K(x, y) = (x)(y)
which has a point mass at y = x with weight 1 (x).
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Ergodicity for Gasemyrs scheme
Necessary and sucient condition
is stationary for (X
t
) i
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Ergodicity for Gasemyrs scheme
Necessary and sucient condition
is stationary for (X
t
) i
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .
Implies that
E[W
n
[Y
n
= y] = w(y) .
[Average importance sampling]
Special case: (y) = 1/(1 +w(y)) of Sahu and Zhigljavski (2001)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Properties
Constraint on : for (y) 1, must be such that
p(y)q(y)
(y)

Reverse of accept-reject conditions (!)
Variance of

n
W
n
h(Y
n
)/

n
W
n
(4)
is
2
_
(h(y) )
2
q(y)
w(y)(y)dy (1/)
2

,
by Cramer-Wold/Slutsky
Still worse than importance sampling.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
(b) Dynamic weighting
[Wong & Liang, 1997; Liu, Liang & Wong, 2001; Liang, 2002]
direct to PMC
Generalisation of the above: simultaneous generation of points
and weights, (
t
,
t
), under the constraint
E[
t
[
t
] (
t
) (5)
Same use as importance sampling weights
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Algorithm (Liangs dynamic importance sampling)
1. Generate y K(x, y) and compute
=
(y)K(y, x)
(x)K(x, y)
2. Generate u |(0, 1) and take
(x

) =
_
(y, (1 +)/a) if u < a
(x, (1 +)/(1 a) otherwise
where a = /( +), = (x, ), and > 0 constant or
independent rv
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Preservation of the equilibrium equation
If g

and g
+
denote the distributions of the augmented variable
(X, W) before the step and after the step, respectively, then
_

0

g
+
(x

) d

=
_
(1 +) [(, x, x

) +] g

(x, ) K(x, x

)
(, x, x

)
(, x, x

) +
dxd
+
_
(1 +)
((, x

, z) +)

(x

, ) K(x, z)

(, x

, z) +
dz d
= (1 +)
__
g

(x, )
(x

)K(x

, x)
(x)
dxd
+
_
g

(x

, ) K(x

, z) dz d
_
= (1 +)
_
(x

)
_
c
0
K(x

, x) dx +c
0
(x

)
_
= 2(1 +)c
0
(x

) ,
where c proportionality constant
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Special case: R-move
[Liang, 2002]
= 0 and 1, and thus
(x

) =
_
(y, + 1) if u < /( + 1)
(x, ( + 1)) otherwise,
[Importance sampling]
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Special case: W-move
0, thus a = 1 and
(x

) = (y, ) .
Q-move
[Liu & al, 2001]
(x

) =
_
(y, ) if u < 1 / ,
(x, a) otherwise,
with a 1 either a constant or an independent random variable.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Notes

Updating step in Q and R schemes written as


(x
t+1
,
t+1
) = x
t
,
t
/Pr(R
t
= 0)
with probability Pr(R
t
= 0) and
(x
t+1
,
t+1
) = y
t+1
,
t
r(x
t
, y
t+1
)/Pr(R
t
= 1)
with probability Pr(R
t
= 1), where R
t
is the move indicator
and
y
t+1
K(x
t
, y)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Notes (2)

Geometric structure of the weights


Pr(R
t
= 0) =

t

t+1
.
and
Pr(R
t
= 0) =

t
r(x
t
, y
t
)

t
r(x
t
, y
t
) +
, > 0 ,
for the R scheme
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Notes (2)

Geometric structure of the weights


Pr(R
t
= 0) =

t

t+1
.
and
Pr(R
t
= 0) =

t
r(x
t
, y
t
)

t
r(x
t
, y
t
) +
, > 0 ,
for the R scheme

Number of steps T before an acceptance (a jump) such that


Pr (T t) = P(R
1
= 0, . . . , R
t1
= 0)
= E
_
_
t1

j=0

j+1
_
_
E[1/
t
] .
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Alternative scheme
Preservation of weight expectation:
(x
t+1
,
t+1
) =
_

_
(x
t
,
t

t
/Pr(R
t
= 0))
with probability Pr(R
t
= 0) and
(y
t+1
, (1
t
)
t
r(x
t
, y
t+1
)/Pr(R
t
= 1))
with probability Pr(R
t
= 1).
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Alternative scheme (2)
Then
Pr (T = t) = P(R
1
= 0, . . . , R
t1
= 0, R
t
= 1)
= E
_
_
t1

j=0

j+1
(1
t
)

t1
r(x
0
, Y
t
)

t
_
_
which is equal to

t1
(1 )E[
o
r(x, Y
t
)/
t
]
when
j
constant and deterministic.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Dynamic extensions
Example
Choose a function 0 < (, ) < 1 and to take, while in (x
0
,
0
),
(x
1
,
1
) =
_
y
1
,

0
r(x
0
, y
1
)
(x
0
, y
1
)
(1 (x
0
, y
1
)
_
with probability
min(1,
0
r(x
0
, y
1
))

= (x
0
, y
1
)
and
(x
1
,
1
) =
_
x
0
,

0
1 (x
0
, y
1
)
(x
0
, y
1
)
_
with probability 1 (x
0
, y
1
).
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Population Monte Carlo
Idea
Simulate from the product distribution

N
n
(x
1
, . . . , x
n
) =
n

i=1
(x
i
)
and apply dynamic importance sampling to the sample
(a.k.a. population)
x
(t)
= (x
(t)
1
, . . . , x
(t)
n
)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Iterated importance sampling
As in Markov Chain Monte Carlo (MCMC) algorithms,
introduction of a temporal dimension :
x
(t)
i
q
t
(x[x
(t1)
i
) i = 1, . . . , n, t = 1, . . .
and

I
t
=
1
n
n

i=1

(t)
i
h(x
(t)
i
)
is still unbiased for

(t)
i
=

t
(x
(t)
i
)
q
t
(x
(t)
i
[x
(t1)
i
)
, i = 1, . . . , n
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Fundamental importance equality
Preservation of unbiasedness
E
_
h(X
(t)
)
(X
(t)
)
q
t
(X
(t)
[X
(t1)
)
_
=
_
h(x)
(x)
q
t
(x[y)
q
t
(x[y) g(y) dxdy
=
_
h(x) (x) dx
for any distribution g on X
(t1)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Sequential variance decomposition
Furthermore,
var
_

I
t
_
=
1
n
2
n

i=1
var
_

(t)
i
h(x
(t)
i
)
_
,
if var
_

(t)
i
_
exists, because the x
(t)
i
s are conditionally uncorrelated
Note
This decomposition is still valid for correlated [in i] x
(t)
i
s when
incorporating weights
(t)
i
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Simulation of a population
The importance distribution of the sample (a.k.a. particles) x
(t)
q
t
(x
(t)
[x
(t1)
)
can depend on the previous sample x
(t1)
in any possible way as
long as marginal distributions
q
it
(x) =
_
q
t
(x
(t)
) dx
(t)
i
can be expressed to build importance weights

it
=
(x
(t)
i
)
q
it
(x
(t)
i
)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Special case of the product proposal
If
q
t
(x
(t)
[x
(t1)
) =
n

i=1
q
it
(x
(t)
i
[x
(t1)
)
[Independent proposals]
then
var
_

I
t
_
=
1
n
2
n

i=1
var
_

(t)
i
h(x
(t)
i
)
_
,
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Validation
skip validation
E
_

(t)
i
h(X
(t)
i
)
(t)
j
h(X
(t)
j
)
_
=
_
h(x
i
)
(x
i
)
q
it
(x
i
[x
(t1)
)
(x
j
)
q
jt
(x
j
[x
(t1)
)
h(x
j
)
q
it
(x
i
[x
(t1)
) q
jt
(x
j
[x
(t1)
) dx
i
dx
j
g(x
(t1)
)dx
(t1)
= E

[h(X)]
2
whatever the distribution g on x
(t1)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Self-normalised version
In general, is unscaled and the weight

(t)
i

(x
(t)
i
)
q
it
(x
(t)
i
)
, i = 1, . . . , n,
is scaled so that

(t)
i
= 1
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Self-normalised version properties

Loss of the unbiasedness property and the variance


decomposition

Normalising constant can be estimated by

t
=
1
tn
t

=1
n

i=1
(x
()
i
)
q
i
(x
()
i
)

Variance decomposition (approximately) recovered if


t1
is
used instead
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Sampling importance resampling
Importance sampling from g can also produce samples from the
target
[Rubin, 1987]
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Sampling importance resampling
Importance sampling from g can also produce samples from the
target
[Rubin, 1987]
Theorem (Bootstraped importance sampling)
If a sample (x

i
)
1im
is derived from the weighted sample
(x
i
,
i
)
1in
by multinomial sampling with weights
i
, then
x

i
(x)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Sampling importance resampling
Importance sampling from g can also produce samples from the
target
[Rubin, 1987]
Theorem (Bootstraped importance sampling)
If a sample (x

i
)
1im
is derived from the weighted sample
(x
i
,
i
)
1in
by multinomial sampling with weights
i
, then
x

i
(x)
Note
Obviously, the x

i
s are not iid
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Iterated sampling importance resampling
This principle can be extended to iterated importance sampling:
After each iteration, resampling produces a sample from
[Again, not iid!]
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Iterated sampling importance resampling
This principle can be extended to iterated importance sampling:
After each iteration, resampling produces a sample from
[Again, not iid!]
Incentive
Use previous sample(s) to learn about and q
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Generic Population Monte Carlo
Algorithm (Population Monte Carlo Algorithm)
For t = 1, . . . , T
For i = 1, . . . , n,
1. Select the generating distribution q
it
()
2. Generate x
(t)
i
q
it
(x)
3. Compute
(t)
i
= ( x
(t)
i
)/q
it
( x
(t)
i
)
Normalise the
(t)
i
s into
(t)
i
s
Generate J
i,t
/((
(t)
i
)
1iN
) and set x
i,t
= x
(t)
J
i,t
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
D-kernels in competition
A general adaptive construction:
Construct q
i,t
as a mixture of D dierent transition kernels
depending on x
(t1)
i
q
i,t
=
D

=1
p
t,
K

(x
(t1)
i
, x),
D

=1
p
t,
= 1 ,
and adapt the weights p
t,
.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
D-kernels in competition
A general adaptive construction:
Construct q
i,t
as a mixture of D dierent transition kernels
depending on x
(t1)
i
q
i,t
=
D

=1
p
t,
K

(x
(t1)
i
, x),
D

=1
p
t,
= 1 ,
and adapt the weights p
t,
.
Example
Take p
t,
proportional to the survival rate of the points
(a.k.a. particles) x
(t)
i
generated from K

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo
Implementation
Algorithm (D-kernel PMC)
For t = 1, . . . , T
generate (K
i,t
)
1iN
M((p
t,k
)
1kD
)
for 1 i N, generate
x
i,t
K
K
i,t
(x)
compute and renormalize the importance weights
i,t
generate (J
i,t
)
1iN
M((
i,t
)
1iN
)
take x
i,t
= x
J
i,t
,t
and p
t+1,d
=

N
i=1

i,t
I
d
(K
i,t
)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Links with particle lters

Usually setting where =


t
changes with t: Population
Monte Carlo also adapts to this case

Can be traced back all the way to Hammersley and Morton


(1954) and the self-avoiding random walk problem

Gilks and Berzuini (2001) produce iterated samples with (SIR)


resampling steps, and add an MCMC step: this step must use
a
t
invariant kernel

Chopin (2001) uses iterated importance sampling to handle


large datasets: this is a special case of PMC where the q
it
s
are the posterior distributions associated with a portion k
t
of
the observed dataset
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Links with particle lters (2)

Rubinstein and Kroeses (2004) cross-entropy method is


parameterised importance sampling targeted at rare events

Stavropoulos and Titteringtons (1999) smooth bootstrap and


Warnes (2001) kernel coupler use nonparametric kernels on
the previous importance sample to build an improved
proposal: this is a special case of PMC

West (1992) mixture approximation is a precursor of smooth


bootstrap

Mengersen and Robert (2002) pinball sampler is an MCMC


attempt at population sampling

Del Moral and Doucet (2003) sequential Monte Carlo


samplers also relates to PMC, with a Markovian dependence
on the past sample x
(t)
but (limited) stationarity constraints
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Things can go wrong
Unexpected behaviour of the mixture weights when the number of
particles increases
N

i=1

i,t
I
K
i,t
=d

P
1
D
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Things can go wrong
Unexpected behaviour of the mixture weights when the number of
particles increases
N

i=1

i,t
I
K
i,t
=d

P
1
D
Conclusion
At each iteration, every weight converges to 1/D:
the algorithm fails to learn from experience!!
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Saved by Rao-Blackwell!!
Modication: Rao-Blackwellisation (=conditioning)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Saved by Rao-Blackwell!!
Modication: Rao-Blackwellisation (=conditioning)
Use the whole mixture in the importance weight:

i,t
= ( x
i,t
)
D

d=1
p
t,d
K
d
(x
i,t1
, x
i,t
)
instead of

i,t
=
( x
i,t
)
K
K
i,t
(x
i,t1
, x
i,t
)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Adapted algorithm
Algorithm (Rao-Blackwellised D-kernel PMC)
At time t (t = 1, . . . , T),
Generate
(K
i,t
)
1iN
iid
/((p
t,d
)
1dD
);
Generate
( x
i,t
)
1iN
ind
K
K
i,t
(x
i,t1
, x)
and set
i,t
= ( x
i,t
)
_

D
d=1
p
t,d
K
d
(x
i,t1
, x
i,t
);
Generate
(J
i,t
)
1iN
iid
/((
i,t
)
1iN
)
and set x
i,t
= x
J
i,t
,t
and p
t+1,d
=

N
i=1

i,t
p
t,d
.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Convergence properties
Theorem (LLN)
Under regularity assumptions, for h L
1

and for every t 1,


1
N
N

k=1

i,t
h(x
i,t
)
N

P
(h)
and
p
t,d
N

P

t
d
The limiting coecients (
t
d
)
1dD
are dened recursively as

t
d
=
t1
d
_
_
K
d
(x, x

D
j=1

t1
j
K
j
(x, x

)
_
(dx, dx

).
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Recursion on the weights
Set F as
F() =
_

d
_
_
K
d
(x, x

D
j=1

j
K
j
(x, x

)
_
(dx, dx

)
_
1dD
on the simplex
S =
_
= (
1
, . . . ,
D
); d 1, . . . , D,
d
0 and
D

d=1

d
= 1
_
.
and dene the sequence

t+1
= F(
t
)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Kullback divergence
Denition (Kullback divergence)
For S,
KL() =
_
_
log
_
(x)(x

)
(x)

D
d=1

d
K
d
(x, x

)
__
(dx, dx

).
Kullback divergence between and the mixture.
Goal: Obtain the mixture closest to , i.e., that minimises KL()
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Connection with RBDPMCA ??
Theorem
Under the assumption
d 1, . . . , D, <
_
log(K
d
(x, x

))(dx, dx

) <
for every S
D
,
KL(F()) KL().
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Connection with RBDPMCA ??
Theorem
Under the assumption
d 1, . . . , D, <
_
log(K
d
(x, x

))(dx, dx

) <
for every S
D
,
KL(F()) KL().
Conclusion
The Kullback divergence decreases at every iteration of RBDPMCA
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
An integrated EM interpretation
skip interpretation
We have

min
= arg min
S
KL() = arg max
S
_
log p

( x)(d x)
= arg max
S
_
log
_
p

( x, K)dK (d x)
for x = (x, x

) and K /((
d
)
1dD
). Then
t+1
= F(
t
)
means

t+1
= arg max

__
E

t (log p

(

X, K)[

X = x)(d x)
and
lim
t

t
=
min
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]
Then
1 0.0500000 0.05000000 0.9000000
2 0.2605712 0.09970292 0.6397259
6 0.2740816 0.19160178 0.5343166
10 0.2989651 0.19200904 0.5090259
16 0.2651511 0.24129039 0.4935585
Weight evolution
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Target and mixture evolution
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Example : PMC for mixtures
Observation of an iid sample x = (x
1
, . . . , x
n
) from
p^(
1
,
2
) + (1 p)^(
2
,
2
),
with p ,= 1/2 and > 0 known.
Usual ^(,
2
/) prior on
1
and
2
:
(
1
,
2
[x) f(x[
1
,
2
) (
1
,
2
)
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Algorithm (Mixture PMC)
Step 0: Initialisation
For j = 1, . . . , n = pm, choose (
1
)
(0)
j
, (
2
)
(0)
j
For k = 1, . . . , p, set r
k
= m
Step i: Update (i = 1, . . . , I)
For k = 1, . . . , p,
1. generate a sample of size r
k
as
(
1
)
(i)
j
^
_
(
1
)
(i1)
j
, v
k
_
and (
2
)
(i)
j
^
_
(
2
)
(i1)
j
, v
k
_
2. compute the weights

j

f
_
x

(
1
)
(i)
j
, (
2
)
(i)
j
_

_
(
1
)
(i)
j
, (
2
)
(i)
j
_

_
(
1
)
(i)
j

(
1
)
(i1)
j
, v
k
_

_
(
2
)
(i)
j

(
2
)
(i1)
j
, v
k
_
Resample the
_
(
1
)
(i)
j
, (
2
)
(i)
j
_
j
using the weights
j
,
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
Details
After an arbitrary initialisation, use of the previous (importance)
sample (after resampling) to build random walk proposals,
^(()
(i1)
j
, v
j
)
with a multiscale variance v
j
within a predetermined set of p scales
ranging from 10
3
down to 10
3
, whose importance is proportional
to its survival rate in the resampling step.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
2
0
0
6
0
0
Iterations
R
e
s
a
m
p
lin
g
0 100 200 300 400 500
0
1
2
3
4
Iterations
V
a
r
(

1
)
0 100 200 300 400 500
0
.
0
5
0
.
1
5
0
.
2
5
Iterations

1
0 100 200 300 400 500
0
1
2
3
4
Iterations
V
a
r
(

2
)
0 100 200 300 400 500
2
.
0
1
0
2
.
0
2
0
Iterations

2
(u.left)
Number of resampled points for v
1
= 5 (darker) and v
2
= 2;
(u.right) Number of resampled points for the other variances;
(m.left) Variance of the
1
s along iterations; (m.right) Average of
the
1
s over iterations; (l.left) Variance of the
2
s along
iterations; (l.right) Average of the simulated
2
s over iterations.
Markov Chain Monte Carlo Methods
Sequential importance sampling
Population Monte Carlo
2 1 0 1 2 3 4
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0

2
Log-posterior distribution and sample of means

You might also like