You are on page 1of 28

Introduction to Markov chain Monte Carlo

with examples from Bayesian statistics


First winter school in eScience
Geilo, Wednesday January 31st 2007
Hakon Tjelmeland
Department of Mathematical Sciences
Norwegian University of Science and Technology
Trondheim, Norway
1
Introduction
Mixed audience
some with (almost) no knowledge about (Markov
chain) Monte Carlo
some know a little about (Markov chain) Monte
Carlo
some have used (Markov chain) Monte Carlo
a lot
Please ask questions/give comments!
I will discuss topics also discussed by Morten and
Laurant
MetropolisHastings algorithm and Bayesian
statistics
will use dierent notation/terminology
My goal: Everyone should understand
allmost all I discuss today
much of what I discuss tomorrow
the essence of what I talk about on Friday
You should
understand the mathematics
get intuition
The talk will be available on the web next week
Remember to ask questions: We have time for it
2
Plan
The Markov chain Monte Carlo (MCMC) idea
Some Markov chain theory
Implementation of the MCMC idea
MetropolisHastings algorithm
MCMC strategies
independent proposals
random walk proposals
combination of strategies
Gibbs sampler
Convergence diagnostics
trace plots
autocorrelation functions
one chain or many chains?
Typical MCMC problems and some remedies
high correlation between variables
multimodality
dierent scales
3
Plan (cont.)
Bayesian statistics hierarchical modelling
Bayes (1763) example
what is a probability?
Bayesian hierarchical modelling
Examples
analysis of microarray data
history matching petroleum application
More advanced MCMC techniques/ideas
reversible jump
adaptive Markov chain Monte Carlo
mode jumping proposals
parallelisation of MCMC algorithms
perfect simulation
4
Why (Markov chain) Monte Carlo?
Given a probability distribution of interest
(x), x R
N
Usually this means: have a formula for (x)
But normalising constant is often not known
(x) = ch(x)
have a formula for h(x)
Want to
want to understand (x)
generates realisations from (x) and look at
them
compute mean values

f
= E[f(x)] =
_
f(x)(x)dx
Note: most things of interest in a stochastic model
can be expressed as an expectation
probabilities
distributions
5
The Monte Carlo idea
Probability distribution of interest (x), x R
N
(x) is a high dimensional, complex distribution
Analytical calculations on (x) is not possible
Monte Carlo idea
generate iid samples x
1
, . . . , x
n
from (x).
estimate interesting quantities about (x)

f
= E[f(x)] =
_
f(x)(x)dx

f
=
1
n
n

i=1
f(x
i
)
unbiased estimator
E[
f
] =
1
n
n

i=1
E[f(x
i
)] =
1
n
n

i=1

f
=
f
estimation uncertainty
Var[
f
] =
1
n
2
n

i=1
Var[f(x
i
)] =
Var[f(x)]
n
SD[
f
] =
SD[f(x)]

n
6
The Markov chain Monte Carlo idea
Probability distribution of interest: (x), x R
N
(x) is a high dimensional, complex distribution
Analytical calculations on (x) is not possible
Direct sampling from (x) is not possible
Markov chain Monte Carlo idea
construct a Markov chain, {X
i
}

i=0
, so that
lim
i
P(X
i
= x) = (x)
simulate the Markov chain for many iterations
for m large enough, x
m
, x
m+1
, x
m+2
, . . . are (es-
sentially) samples from (x)
estimate interesting quantities about (x)

f
= E[f(x)] =
_
f(x)(x)dx

f
=
1
n
m+n1

i=m
f(x
i
)
unbiased estimator
E[
f
] =
1
n
m+n1

i=m
E[f(x
i
)] =
1
n
m+n1

i=m

f
=
f
what about the variance?
7
A (very) simple MCMC example
Note: This is just for illustration, you should
never never use MCMC for this distribution!
Let
(x) =
10
x
x!
e
10
, x = 0, 1, 2, . . .
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
Set x
0
to 0, 1 or 2 with probability 1/3 for each
Markov chain kernel
P(x
i+1
= x
i
1|x
i
) =
_
x
i
/20 if x
i
9,
1/2 if x
i
> 9
P(x
i+1
= x
i
|x
i
) =
_
(10 x
i
)/20 if x
i
9,
(x
i
9)/(2(x
i
+ 1)) if x
i
> 9
P(x
i+1
= x
i
+ 1|x
i
) =
_
1/2 if x
i
9,
5/(x
i
+ 1) if x
i
> 9
This Markov chain has limiting distribution (x)
will explain why later
8
A (very) simple MCMC example (cont.)
Trace plots of three runs
0 200 400 600 800 1000
0
5
1
0
1
5
2
0
0 200 400 600 800 1000
0
5
1
0
1
5
2
0
0 200 400 600 800 1000
0
5
1
0
1
5
2
0
9
A (very) simple MCMC example (cont.)
Convergence to the target distribution
0 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
1 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
2 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
10
A (very) simple MCMC example (cont.)
Convergence to the target distribution
5 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
10 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
20 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
11
A (very) simple MCMC example (cont.)
Convergence to the target distribution
50 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
100 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
150 iterations
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
12
Markov chain Monte Carlo
Note:
the chain x
0
, x
1
, x
2
, . . . is not converging!
the distribution P(X
n
= x) is converging
we simulate/observe only the chain x
0
, x
1
, x
2
, . . .
Need a (general) way to construct a Markov chain
for a given target distribution (x).
To simulate the Markov chain must be easy (or
at least possible)
Need to decide when (we think) the chain has
converged (well enough)
13
Some Markov chain theory
A Markov chain (x discrete) is a discrete time
stochastic process {X
i
}

i=0
, x
i
which fulls the
Markov assumption
P{X
i+1
= x
i+1
|X
0
= x
0
, . . . , X
i
= x
i
} = P{X
i+1
= x
i+1
|X
i
= x
i
}
Thus: a Markov chain can be specied by
the initial distribution P{X
0
= x
0
} = g(x
0
)
the transition kernel/matrix
P(y|x) = P(X
i+1
= y|X
i
= x)
Dierent notations are used
P
ij
P
xy
P(x, y) P(y|x)
14
Some Markov chain theory
A Markov chain (x discrete) is dened by
initial distribution: f(x
0
)
transition kernel: P(y|x), note:

y
P(y|x) = 1
Unique limiting distribution (x) = lim
i
f(x
i
) if
chain is irreducible, aperiodic and positive re-
current
if so, we have
(y) =

x
(x)P(y|x) for all y (1)
Note: A sucient condition for (1) is the detailed
balance condition
(x)P(y|x) = (y)P(x|y) for all x, y
proof:

x
(x)P(y|x) =

x
(y)P(x|y)
= (y)

x
P(x|y) = (y)
Note:
in a stochastic modelling setting: P(y|x) is given,
want to nd (x)
in an MCMC setting: (x) is given, need to
nd a P(y|x)
15
Implementation of the MCMC idea
Given a (limiting distribution) (x), x
Want a transition kernel so that
(y) =

x
(x)P(y|x) for all y
Any solutions?
# of unknowns: ||(|| 1);
# of equations: || 1
Dicult to construct P(y|x) from the above
Require the detailed balance condition
(x)P(y|x) = (y)P(x|y) for all x, y
Any solutions:
# of unknowns: ||(|| 1)
# of equations: ||(|| 1)/2
Still many solutions
Recall: dont need all solutions, one is enough!
General (and easy) construction strategy for P(y|x)
is available MetropolisHastings algorithm
16
MetropolisHastings algorithm
Detailed balance condition
(x)P(y|x) = (y)P(x|y) for all x, y
Choose
P(y|x) = Q(y|x)(y|x) for y = x,
where
Q(y|x) is a proposal kernel, we can choose this
(y|x) [0, 1] is an acceptance probability, need
to nd a formula for this
Recall: must have

y
P(y|x) = 1 for all x
so then
P(x|x) = 1

y=x
Q(y|x)(y|x)
Simulation algorithm
generate initial state x
0
f(x
0
)
for i = 1, 2, . . .
propose potential new state y
i
Q(y
i
|x
i1
)
compute acceptance probability (y
i
|x
i1
)
draw u
i
Uniform(0, 1)
if u
i
(y
i
|x
i1
) accept y
i
, i.e. set x
i
= y
i
,
otherwise reject y
i
and set x
i
= x
i1
17
The acceptance probability
Recall: detailed balance condition
(x)P(y|x) = (y)P(x|y) for all x, y
Proposal kernel
P(y|x) = Q(y|x)(y|x) for y = x
Thus, must have
(x)Q(y|x)(y|x) = (y)Q(x|y)(x|y) for all x = y
General solution
(y|x) = r(x, y)(y)Q(x|y) where r(x, y) = r(y, x)
Recall: must have
(y|x) = r(x, y)(y)Q(x|y) 1 r(x, y)
1
(y)Q(x|y)
(x|y) = r(x, y)(x)Q(y|x) 1 r(x, y)
1
(x)Q(y|x)
Choose r(x, y) as large as possible
r(x, y) = min
_
1
(x)Q(y|x)
,
1
(y)Q(x|y)
_
Thus
(y|x) = min
_
1,
(y)Q(x|y)
(x)Q(y|x)
_
18
MetropolisHastings algorithm
Recall: For convergence it is sucient with
detailed balance
irreducible
aperiodic
positive recurrent
Detailed balance: ok by construction
Irreducible: must be checked in each case
usually easy
Aperiodic: sucient that P(x|x) > 0 for one x
for example by (y|x) < 1 for one set x, y
Positive recurrent: in discrete state space, irre-
ducibility and nite state space is sucient
more dicult in general, but Markov chain
drifts if it is not recurrent
usually not a problem in practice
19
MetropolisHastings algorithm
Building blocks:
target distribution (x) (given by problem)
proposal distribution Q(y|x) (we choose)
acceptance probability
(y|x) = min
_
1,
(y)Q(x|y)
(x)Q(y|x)
_
Note: unknown normalising constant in (x) ok
A little history
Metropolis et al. (1953). Equations of state
calculations by fast computing machines. J. of
Chemical Physics.
Hastings (1970). Monte Carlo simulation meth-
ods using Markov chains and their applica-
tions. Biometrika.
Green (1995). Reversible jump MCMC com-
putation and Bayesian model determination.
Biometrika.
20
A simple MCMC example (revisited)
Let
(x) =
10
x
x!
e
10
, x = 0, 1, 2, . . .
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
Proposal distribution
Q(y|x) =
_
1/2 for y {x 1, x + 1},
0 otherwise
Acceptance probability
y = x 1 : (x 1|x) = min
_
_
_
1,
10
x1
(x1)!
e
10
10
x
x!
e
10
_
_
_
= min
_
1,
x
10
_
y = x + 1 : (x + 1|x) = min
_
_
_
1,
10
x+1
(x+1)!
e
10
10
x
x!
e
10
_
_
_
= min
_
1,
10
x + 1
_
P(y|x) then becomes as specied before
21
A (very) simple MCMC example
Note: This is just for illustration, you should
never use MCMC for this distribution!
Let
(x) =
10
x
x!
e
10
, x = 0, 1, 2, . . .
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
Set x
0
to 0, 1 or 2 with probability 1/3 for each
Markov chain kernel
P(x
i+1
= x
i
1|x
i
) =
_
x
i
/20 if x
i
9,
1/2 if x
i
> 9
P(x
i+1
= x
i
|x
i
) =
_
(10 x
i
)/20 if x
i
9,
(x
i
9)/(2(x
i
+ 1)) if x
i
> 9
P(x
i+1
= x
i
+ 1|x
i
) =
_
1/2 if x
i
9,
5/(x
i
+ 1) if x
i
> 9
This Markov chain has limiting distribution (x)
will explain why later
22
Another MCMC example Ising
2D rectangular lattice of nodes
Number the nodes from 1 to N
1 2 10
11 12
100
x
i
{0, 1}: value (colour) in node i, x = (x
1
, . . . , x
N
)
First order neighbourhood
Probability distribution
(x) = c exp
_
_
_

ij
I(x
i
= x
j
)
_
_
_
: parameter; c: normalising constant,
c =
_
_

x
exp
_
_
_

ij
I(x
i
= x
j
)
_
_
_
_
_
1
23
Ising example (cont.)
Probability distribution
(x) = c exp
_
_
_

ij
I(x
i
= x
j
)
_
_
_
Proposal algorithm
current state: x = (x
1
, . . . , x
N
)
draw a node k {1, . . . , n} at random
propose to revers the value of node k, i.e.
y = (x
1
, . . . , x
k1
, 1 x
k
, x
k+1
, . . . , x
N
)
k
Proposal kernel
Q(y|x) =
_
1
N
if x and y dier in (exactly) one node,
0 otherwise
Acceptance probability
(y|x) = min
_
1,
(y)Q(x|y)
(x)Q(y|x)
_
= min
_
_
_
1, exp
_
_
_

jk
_
I(x
j
= 1 x
k
) I(x
j
= x
k
)

_
_
_
_
_
_
24
Ising example (cont.)
= 0.87
x
0
= 0
0n 200n iterations 400n iterations
600n 800n iterations 1000n iterations
5000n 10000n iterations 15000n iterations
20000n 25000n iterations 30000n iterations
25
Ising example (cont.)
trace plot of number of 1s
three runs
dierent initial state:
all 0s
all 1s
independent random in each node
0 10000 20000 30000 40000 50000
0
1
0
0
0
0
3
0
0
0
0
26
Continuous state space
Target distribution
discrete: (x), x
continuous: (x), x R
N
Proposal distribution
discrete: Q(y|x)
continuous: Q(y|x)
Acceptance probability
discrete: (y|x)
continuous: (y|x)
(y|x) = min
_
1,
(y)Q(x|y)
(x)Q(y|x)
_
Rejection probability
discrete:
r(x) = 1

y=x
Q(y|x)(y|x)
continuous:
r(x) = 1
_
R
N
Q(y|x)(y|x)dy
27
Plan
The Markov chain Monte Carlo (MCMC) idea
Some Markov chain theory
Implementation of the MCMC idea
MetropolisHastings algorithm
MCMC strategies
independent proposals
random walk proposals
combination of strategies
Gibbs sampler
Convergence diagnostics
trace plots
autocorrelation functions
one chain or many chains?
Typical MCMC problems and some remedies
high correlation between variables
multimodality
dierent scales
28

You might also like