You are on page 1of 15

Markov Chain Monte Carlo (MCMC)

Professor M Ali, Wits

1 Monte Carlo Methods


Monte Carlo (MC) methods have been invented in the context of the development of the
atomic bomb in the 1940s. These methods can be applied to vast ranges of problems.
Monte Carlo methods (or Monte Carlo simulations) are a class of computational al-
gorithms that rely on repeated random (independent & identically distributed, aka i.i.d)
sampling to compute results of some interest. Results of interest are numerical integration
(e.g. Monte Carlo integration), probability estimation or estimation of confidence inter-
vals, distribution properties (estimating mean and variance), error estimations (to estimate
uncertainty on the mean and variance of a random variable), Bayesian interference and
global optimization among many other applied problems. MC methods provide generally
approximate results (or solutions).
At this point I would like to distinguish between Monte Carlo and Simulation. Sim-
ulation means producing i.i.d samples (aka an i.i.d sample) from a certain distribution
and just to look at them while MC simulation means some sort of estimation using the
samples. We often have a choice between Monte Carlo and deterministic methods. For
example, if X is a random variable, we can estimate some properties e.g. E[h(X)] of the
distribution using analytical integration. This certainly would be absolute accurate than
Monte Carlo integration. However, analytical integration is not always possible. Hence
MC methods are used in cases where analytical or numerical solutions do not exist or are
too difficult to implement.
There are many Monte Carlo methods where i.i.d samples are generated using inver-
sion of CDF (Cumulative Distribution Function) (transformation method), rejection sam-
pling, and importance sampling. They all have their strengths and weaknesses. Markov
Chain Monte Carlo (MCMC) is also a Monte Carlo method. MCMC is particularly a
flexible MC method which is often easy to implement. Samples generated in MCMC are
not necessarily independent samples. The Metropolis and Hastings (MH) algorithm is
one of the MCMC algorithm which simulates a Markov chain with a desired target distri-
bution. Before describing MCMC in more details I would like to shed some lights on the
sampling methods such as inversion and importance sampling.

1.1 Sampling using inverse transformation of CDF


Let us consider the random variable X with density fX (x) and CDF FX (x). If we want
to generate samples using X, we need to integrate fX (x). Note that
Z x
FX (x) = Pr{X ≤ x} = fX (x)dx.
−∞

1
Let us assume another random variable U such that U = FX (X). We assume that FX−1
exists. Now clearly, 0 ≤ U ≤ 1. Consider the CDF of the random variable (RV) U :

FU (u) = Pr{U ≤ u}
= Pr{FX (X) ≤ u}
= Pr{X ≤ FX−1 (u)}
= FX FX−1 (u)


=u

Clearly U ∼ U nif (0, 1) where U nif (0, 1) is the uniform density on (0,1). Hence we
can generate samples from X (which has density fX (x) and CDF FX (x)) by finding
X = FX−1 (U ). Consider that we generate an i.i.d sample xi , i = 1, 2, · · · , N , from fX (x).
The scheme for the i-th sample is as follows:

• U = u, U ∼ U nif (0, 1)

• Compute xi such that FX (x) = u

• Take xi to be the random number drawn from the distribution described by FX (xi )
(X = FX−1 (U ) has a distribution FX (x)).

Consider the following examples: X ∼ fX (x) = exp(−x), FX (x) = 1 − exp(−x)


(up= FX (x) = 1 − exp(−x), U = u ∈ (0, 1); x = − ln(1 − u)). Consider X ∼ fX (x) =
3
16
(x), 0 ≤ x ≤ 4, FX (x) = 81 x3/2 (u = FX (x) → x = 4u2/3 , U = u ∈ (0, 1)).
We can sample randomly over (0, 1) and map these to values in the distribution FX (x)
which we are interested in. However, this does not work in higher dimensions. It is not
always true that fX (x) be integrable, even in a single variable.

1.2 Sampling using the importance sampling technique


The idea of selecting statistical samples to approximate a hard problem by a much simpler
one is at the heart of modern Monte Carlo simulation. We often want to generate random
samples from a particular distribution, for example, to approximate the distribution or to
compute an integral involving the distribution. Consider the following integration
Z
I= h(x)dx. (1)

For various reasons (one of the reasons is that one cannot always integrate h(x) or one
takes great pains to integrate numerically, specially for high dimension case) we want to
find an approximate results of (1). For ease of explanation, let us take the 1-dimensional
case (also hold for many dimensions), Ω = [a, b], we can rewrite (1) as
Z
w(x)fX (x)dx,

1
where w(x) = (b − a)h(x) and fX (x) = b−a is the uniform density, and so we can draw
i
samples x , i = 1, · · · , N , from fX (x). Here I = EfX [w(X)] with X ∼ U nif (a, b). We

2
can approximate the above integral as (larger the N , the better the approximation is):
N
1 X
I≈ w(xi ), xi ∼ fX (x) (2)
N i=1
N
1 X
≈(b − a) h(xi ), xi ∼ fX (x).
N i=1

For independent samples, by Law of Large numbers


N
1 X
w(xi ) → EfX [w(X)] , N → ∞. (3)
N i=1

We can generalize this to compute integral of the form


Z
EfX [h(X)] = h(x)fX (x)dx, (4)

and when h(x) = x, (4) gives the mean of fX (x); when h(x) = x2 we get the second
moment of fX (x); when h(x) = 1[A] we get probability of A under fX (x) (where 1[A]
is the indicator function–it is equal to 1 when x ∈ A, and zero otherwise). In general
h(x) can be any quantity (for example h(x) is a profit of some investment x) which is
of interest to us and we want to its mean value under X ∼ fX (x). Recall that we can
approximate (4) as long as we have i.i.d samples xi drawn from fX (x). We will discuss
soon the situation when we are unable to have i.i.d observations xi , i.e. we cannot draw
sample from fX (x) but first let us consider the integration:
Z 2
I= h(x)dx, h(x) = x3 (5)
0
Z 2
1
= w(x)fX (x), w(x) = 2x3 , fX (x) = (6)
0 2
Z 2
8 3
= w(x)fX (x), w(x) = x, fX (x) = x2 (7)
0 3 8
One can see that there are two probability density functions in (6) and (7).
Homework 1: Find the analytical results of (5). Use (2) to find the approximate results
of (6) and (7). Use N = 10000. Which of the two fX (x) gives the better approximation?
Why?

I now return to my previous questions. How can we generalize the above idea (nu-
merical approximation) when we can not have i.i.d observations xi from fX (x)? This is
because, often fX (x) is so complicated that we are not able to generate i.i.d samples from
it and therefore we cannot perform Monte Carlo integration. Importance sampling is one
of the ways out.
If we cannot get i.i.d samples from fX (x) we can use an auxiliary (importance) distri-
bution, q(x) that we can sample from, and use this alternative representation of
Z Z Z
fX (x)
EfX [h(X)] = h(x)fX (x)dx = h(x) q(x)dx = w(x)q(x)dx
Ω Ω q(x) Ω

3
where w(x) = h(x) fq(x)
X (x)
. We see that

N
1 X
EfX [h(X)] ≈ w(xi ), xi ∼ q(x),
N i=1

q(x) is known as the proposal density. The importance sampling performs well as long
as supp(fX (x)) ⊂ supp(q(x)), where supp(fX (x)) = {x fX (x) > 0} is the support
of distribution fX (x) (i.e., the set of points with non-zero probability). This condition
just says that our proposal q(x) must have a non-zero probability of moving to the states
that have non-zero probability in the target density fX (x). Generally, q(x) is chosen as
Gaussian with large variance. The main difficulty with importance sampling is how to
find such a good q(x) when dealing with a problem in high dimension. Moreover, if q(x)
is not a good choice then the importance sampling generates sample in a region with no
or less density in fX (x). Hence, we must seek for a method that generate samples xi
from high density region of FX (x). This is possible via MCMC (Markov Chain Monte
Carlo) where the samples xi are generated in such a manner that they are co-related and
not independent hence not i.i.d samples. However, the approximation (3) still applies if
we generate samples using a Markov chain.

1.3 Monte Carlo integration


The general multidimensional integration problem is of the form:
Z
h(x)fX (x)dx

where Ω is the region over which integration is defined and fX (x) is the probability den-
sity of X. This can be computed numerically by generic Monte Carlo where a random set
of N values xi uniformly sampled from within the integration region Ω gives an estimate
of the integral:
N
1 X
I≈V h(xi )fX (xi )
N i=1
where V is volume of Ω. This is, however, very inefficient, particularly with functions that
vary significantly within the integration region (like posterior distributions in Bayesian
statistics). It would be much better if we could guarantee that the random set of values
xi we are using is (at least) asymptotically proportional to fX (x), as the integral estimate
would then reduce to:
N
1 X
I≈ h(xi ).
N i=1

The MCMC method is used to generate such co-related samples xi with the help of
Markov chain. Monte-Carlo integration is the most common application of Monte-Carlo
methods. The MCMC methods are sophisticated and general algorithms for simulation
from complex probability models: high dimensional, highly non-Gaussian, highly non-
linear and possibly multi-modal. This will be discussed in the next section.

4
2 Markov Chain Monte Carlo
We first present the basic concept of MCMC before discussing its framework and mathe-
matical background. Consider the naive example. I want to estimate the probability that
a standard normal random variable X ∼ N (0, 1) was less than 0.5, I could generate ten
thousand independent observations xt (realizations) from the standard normal distribution
and count up the number less than 0.5; say I got 6905 that were less than 0.5 out of 10000
total samples; my estimate for Pr(X ≤ 0.5) would be 0.6905, which is not that far off
from the actual value (which is 0.5). With enough simulated random numbers, the esti-
mate is very good, but the process is still inherently random. That would be a Monte Carlo
estimate. Notice that in Monte Carlo methods we need i.i.d samples from a distribution.
Now imagine I could not draw independent normal random variates (realization or
observations), e.g. X ∼ fX (x), instead I would start at 0 (the initial x0 ), and then with
every step add some uniform random number between -0.5 and 0.5 to my current value
i.e. xt+1 = xt + x, x ∼ U nif (−0.5, 0.5) (this is the proposal density, i.e. q(x|xt ) =
xt + U nif (−0.5, 0.5)), and then decide, based on a particular test, whether I liked that
new value xt+1 or not; if I liked it, I would use the new value as my current one, and if not,
I would reject it and stick with my old value. Because I only look at the new and current
values, this is a Markov chain. (In will elaborate on the Markov chain in soon.) If I set up
the test to decide whether or not I keep the new value correctly (it would be a random walk
MCMC, and the details get a bit complex right now), then even though I never generate
a single normal random variate xt , if I do this procedure for long enough, the list of
(accepted) values (numbers xt , i = 1, 2, · · · N ) I get from the procedure will be distributed
like a large number of draws from something that generates normal random variates. This
would give me a Markov Chain Monte Carlo simulation for a standard normal random
variable. If I use this to estimate probabilities
R∞ (like the one in previous paragraph, or the
confidence interval Pr{X > α} = α fX (x)dx, X ∼ fX (x)), that would be a MCMC
estimate. We see that in Markov Chain Monte Carlo methods one gets samples xt (not
i.i.d samples) using local information. Samples are dependent (correlated).
So, what is the goal? The goal of MCMC is to draw samples from some probabil-
ity distribution without having to know its exact height at any point. The way MCMC
achieves this is to ‘wander around’ on that distribution in such a way that the amoun-
t of time spent in each location is proportional to the height of the distribution. If the
‘wandering around’ process is set up correctly, we can make sure that this proportionality
(between time spent and height of the distribution) is achieved.
Intuitively, what we want to do is to walk around on some (lumpy) surface in such a
way that the amount of time we spend (or the number of samples drawn) in each location
is proportional to the height of the surface at that location. So, we would like to spend
twice as much time on a hilltop that is at an altitude of 100m as we do on a nearby hill
that is at an altitude of 50m. The nice thing is that we can do this even if we do not know
the absolute heights of points on the surface: all we have to know are the relative heights.
e.g., if one hilltop A is twice as high as hilltop B, then we would like to spend twice as
much time at A as we spend at B. This makes MCMC much simpler to implement or
to draw sample (indirectly) from much complicated distribution (e.g. Baysian posterior
distribution).
The simplest variant of MCMC (the Metropolis-Hastings algorithm) achieves this as
follows: assume that in every (discrete) time-step t, we pick a random new ‘proposed’

5
location (selected uniformly across the entire surface). If the proposed location is higher
than where we are standing now, move to it. If the proposed location is lower, then move
to the new location with probability p, where p is the ratio of the height of that point to the
height of the current location. Keep a list of the locations you have been at on every time
step, and that list will (asymptotically) have the right proportion of time spent in each part
of the surface. For the A and B hills described above, you will end up with twice the
probability of moving from B to A as you have of moving from A to B.
There are more complicated schemes for proposing new locations and the rules for
accepting them, but the basic idea is still: (a) pick a new ‘proposed’ location; (b) figure
out how much higher or lower that location is compared to your current location; (c)
probabilistically stay put or move to that location in a way that respects the overall goal of
spending time proportional to height of the location. Consider repeating (a)-(c) for many
times (many time steps t) where at each step t moving to a new location at time (t + 1)
only depends of the (current) location at time t (future depends only on the current state
of the process and not the past i.e. it is memoryless and hence a Markov chain). Clearly,
all the locations generated in this ways are outcomes of a Markov chain. Moreover, the
Markov chain soon forgets its initial position x0 (memoryless).
So far I have only mentioned the use of Markov chain in the Monte Carlo simulation
but I have not said why we need Markov chain. So why ‘Markov chain’? Because under
certain technical conditions (yet to be discussed), one can generate a memoryless process
that has the same limiting distribution as the random variable that we are trying to sim-
ulate. Indeed, if we repeat (a)-(c) (that generate correlated random results or locations)
long enough, we are guaranteed that once we pool enough of the results, we will end up
with a pile of numbers (locations) that looks ‘as if’ we had somehow managed to take
independent samples (after sufficiently long enough time steps t) from the complicated
distribution we wanted to know about or sample from.
This means that MCMC simulates a Markov chain (a sequence of random variables
Xt ) such that for a large enough t, the last random variable Xt of the chain is distributed
according to fX (x), the target distribution. Notice that no samples were generated from
the target density fX (x) (samples are generated from the proposal density).
So the idea is therefore to construct a Markov chain which converges to the desired
probability distribution after a number of steps. The state of the chain after a large number
of steps is then used as a sample from the desired or target distribution. There are many
different MCMC algorithms which use different techniques for generating the Markov
chain. Common ones include the Metropolis-Hastings. Clearly, there are some technical
conditions on the Markov chain which guarantee the convergence of Markov chain to the
desired or target distribution. These conditions will be presented in the next section.
MCMC methods are computational algorithms. They are a way of estimating some-
thing which is too difficult or time consuming to find deterministically. They are basically
a form of computer simulation of some mathematical or physical process. Here are some
of the problems that we will study through MCMC. Monte Carlo integration works great
on a high-dimensional functions by taking a random (asymptotically random) sample of
points of the function and calculating some type of average at these various points. By
increasing the sample size, the law of large numbers (the central limit theorem) tells us
we can increase the accuracy of our approximation by covering more and more of the
function.
In optimization, the objective is to compute the global optimum of some complicated

6
objective function (e.g. maximum likelihood of Bayesian posterior probability). In this
context, we will see how MCMC needs to go with an algorithm called ‘the simulated
annealing algorithm’.
We end this section with the following motto ‘construct a type Markov chain that
converges’ to a target distribution’. Before I present the Markov chain, I introduce a
number of notations and symbols. Both the time t and the sample space (state space) Ω
can be treated as continuous and discrete but in this topic I consider time t as the discrete
time and Ω continuous or discrete (or finite state space). Hence for the discrete or finite
state space Ω = {1, 2, · · · , n}. Therefore, the random variable Xt (or any random variable
X) can be discrete if it is defined on the discrete Ω, otherwise it will be continuous when Ω
is continuous. When the state space is continuous the corresponding (target) density will
be used as fX (x) (or fXt (x)). For the discrete sample space the corresponding density
is commonly known as the probability mass function or probability density function or
simply probability distribution function (for which the cumulative probability function
(CDF) can also be defined). I will refer to this (target) discrete density as probability
distribution and denote it by π. Hence, for discrete time and discrete space, Ω, Xt ∼ π
and for the discrete time continuous space Xt ∼ fX (x). The starting time or initial time
will be denoted as t=0.

2.1 Markov chain (MC)


To begin with I consider a Markov chain with discrete time and finite state space Ω which
has a more straightforward statistical analysis. Hence I will refer to the probability distri-
bution π which is a finite dimensional column vector. Whenever possible I will provide
examples of discrete time and continuous state space.
Before presenting the formal mathematical definition of a Markov chain, let me con-
sider our weather system. Assume that at each day our weather can be either sunny (sunny
state) or rainy (rainy state), i.e. Ω = {1, 2}. When I speak about time I mean here the
discrete time (not continuous time) i.e. t=every day (today, tomorrow and any other future
days) and t=0, 1, 2 · · · , and so there is always a next state and the process does not termi-
nate. The changes of state of the weather system are called transitions. We can associate
probabilities associated with various state changes (from rainy to sunny and vice versa).
Clear these probabilities can be denoted as p12 (probability of going from sunny today to
rainy next day), p21 , p11 (probability of sunny today and staying sunny next day) and p22 .
These probabilities are called the transition probabilities. What happens to our weather
system at each day is a random process and so we can associate a random variable Xt for
each t and so the random variable X0 at the very first day can take two values (each with
a probability) e.g. X0 ∼ (1, 0), the RV X1 also takes two values each with a probability
(but different from earlier), e.g. X1 ∼ (1/3, 2/3), and so on. The possible values of
Xt form the set Ω. Suppose that what happen to our weather system tomorrow is only
dependent on the weather today (and not on all previous days). There are many problems
which are similar to what I have just described–our weather system. These problem can
be defined mathematically by a Markov chain (MC).
A MC is a random process that evolves in time X0 , X2 , · · · with the property that
FUTURE independent of PAST given PRESENT.

7
A set of random variables Xt ; t = 0, 1, 2, · · · is a Markov chain if:

Pr{Xt+1 = j X0 , X1 , · · · , Xt = i} = Pr{Xt+1 = j Xt = i}, i, j ∈ Ω, ∀t.

A MC is specified by giving

• An initial distribution: λ(0) (a column vector) for which

λ(0) (i) = Pr{X0 = i}, i ∈ Ω

e.g. λ(0) (1) = 1 and λ(0) (2) = 0 in my example above where X0 ∼ λ(0) .

• Transition probabilities: P = (pij ) (a matrix) such



pij = Pr{Xt+1 = j Xt = i}, i, j ∈ Ω, ∀t.

• We assume the transition probabilities do not change with time. A Markov chain is
time homogeneous if

Pr{Xt+1 = j Xt = i} = Pr{Xr+1 = j Xr = i}, ∀ t, r,

i.e. Pr{Xt+1 = j Xt = i} is independent of t, to put it differently. Time homoge-
neous Markov chains are also known as stationary Markov chains. Notice that
X
pij = 1
j∈Ω

since the chain must be in some state in the next step. Therefore the matrix P is a
stochastic matrix.

• Let P (j) be the j-th column of P

X
λ(0) P (j) = λ(0) (i)pij (8)
i
X
= Pr{X0 = i} Pr{X1 = j X0 = i} (9)
i
X
= Pr{X0 = i, X1 = j} (10)
i
= Pr{X1 = j} (11)
=λ(1) (j). (12)
(13)

We can consider all j ∈ Ω and in above and obtain another distribution say λ(1) such
that X1 ∼ λ(1) . One can see that λ(1)T = λ(o)T P . We continue in this and obtain Xt ∼ λt .
We see that we are generating a MC {X0 , X1 , · · · , Xt } induced by the transition matrix
P . It follows that λ(t)T = λ(o)T P t . Clearly, the initial distribution λ(0) and the matrix
P capture all the relevant information about the dynamics of the Markov chain. The
following questions are important in the study of Markov chains:

8
• Does there exist a distribution π such that π T = π T P. If such a π exists, it is called
a stationary distribution or invariant distribution or equilibrium distribution. In this
case X
π(i)pij = π(j)
i
i.e. π is the (normalized) left eigenvector of P with eigenvalue 1. We assume that
Xt ∼ λ(t) but when t → ∞ does Xt ∼ π holds?
• If there exists a unique stationary distribution π, does λ(t) → π for all λ(0) . In
other words, does the distribution of the Markov chain converge to the stationary
distribution starting from any initial distribution λ(0) .

Given that there is stationary distribution. Does the second bullet holds true for all
Markov chains? It turns out that both the irreducibility and aperiodicity of the MC are
important for achieving λ(t) → π for all initial λ(0) . In fact, a finite state space, irreducible
Markov chain has a unique stationary distribution π and it is aperiodic (the necessary
condition). The sufficient condition is the reversibility of MC, (see Definition 4 below).
The detailed balanced is the sufficient condition for stationarity. Thus, the stationary
distribution is completely characterized.
If a MC is both irreducible and aperiodic then the MC is call Ergodic. This means that
anything that can happen eventually will happen.
Definition 1: A probability distribution function (or the probability mass function) on Ω
is stationary (invariant) with respect to P if π T = π T P . 
Definition 2: Irreducibility. Irreducible MC means any state can be reached from any
other state in a finite number of moves.
We say that a MC is irreducible if, for each states i and j, there exists an integer k
(k)
(possibly depending upon i and j) such that pij > 0 for finite k, where
(k)

pij = Pr{Xt+k = j Xt = i}, (14)
the superscript (k) is an index and not an exponent. In other words, a chain is irreducible
if it is possible to eventually get from any state i to any other state j in a finite number of
steps. 
Remark 1: Consider the transition matrix P for two states such that p11 = 1, p12 =
0, p21 = 0, p22 = 1. Then for any distribution π we have π T P = π T and the stationary
distribution is not unique. The Markov chain in this example is such that if it started in
one state, then it remained in the same state forever. In general, Markov chains where
one state cannot be reached from some other state will not possess a unique stationary
distribution. This is because the MC is reducible.
Definition 3: Aperiodicity. A state i has period k if any return to state i occur in multiple
of k time steps. Formally, the period of a state is defined as

k = gcd{t > 0 : Pr{Xt = i X0 = i} > 0}
(where gcd is the greatest common divisor of all t) provided that this set is not empty. If
k=1 then the state is said to be aperiodic i.e. returns to state i occur at irregular times/steps.

Remark 2: Consider the transition matrix P for two states such that p11 = 0, p12 =
1, p21 = 1, p22 = 0. The stationary distribution is π T = ( 21 , 12 ) but the system does not

9
converge to this stationary distribution starting from any initial condition. To see this, if
we assume λ(0)T = (1, 0) then λ(t) 6= π, t → ∞. This is because the MC is periodic.

The following example illustrates the computation of the steady-state (stationary) dis-
tribution of a Markov chain. Consider a three-state Markov chain with the state space
Ω = {a, b, c}. If the Markov chain is in state a; it switches from the current state to one
of the other two states, each with probability 14 ; or remains in the same state. If it is in
state b, then it switches to state c with probability 12 or remains the same state. If it is in
state c; it switches to state a with probability 1. Construct the transition matrix P . This
Markov chain is irreducible since it can go from any state to any other state in finite time
with non-zero probability. Next, note that there is a non-zero probability of remaining in
(k)
state a if the Markov chain starts in state a: Therefore, paa > 0 (see Eqn. (14)) for all
k and state a is aperiodic. Since the Markov chain is irreducible, this implies that all the
states are aperiodic. Thus, the finite-state Markov chain induced by P is irreducible and
aperiodic, which implies the existence of a stationary distribution to which the probability
distribution converges to ( 12 , 41 , 14 ) starting from any initial distribution.
Remark 3: If the state space is infinite, (countably infinite e.g. the set of integers or
continuous), the existence of a stationary distribution is not guaranteed even if the Markov
chain is irreducible and aperiodic. Hence, the following sufficient condition is required
for general state space Ω.
Definition 4: Reversibility. A Markov chain is reversible if there exists a stationary dis-
tribution π such that:

π(i) Pr{Xt+1 = j Xt = i} = π(j) Pr{Xt+1 = i Xt = j} ∀ t, i, j.


A Markov chain can be irreducible but not reversible. The MC induced by the transi-
tion matrix  1 1 1 
3 3 3
 1 0 0 
0 1 0
is not reversible. Possible sequence 1 → 3 → 2 → 1. Impossible sequence 1 → 2 →
3 → 1. There is a sequence of states for which it is possible to tell in which direction the
simulation has occurred and thus the chain is not reversible.

Hence P is reversible w.r.t. π if

π(i)pij = π(j)pji , i, j ∈ Ω

i. e. the M. C. looks the same running forward or backward (detailed balance condition).
If a Markov chain is reversible then:
X X
π(j) Pr{Xt+1 = i Xt = j} = π(i) Pr{Xt+1 = j Xt = i}
j j
X
π(j) Pr{Xt+1 = i Xt = j} = π(i).
j

10
This is the Ergodic behavior we want. No matter where we start, at some time, we will
be at the state i with probability π(i). π is the (normalized) left eigenvector of P with
eigenvalue 1 since in matrix notation
πT P = πT (15)
This property is also called detailed balance. We need solve π T P = π T for π to be
stationary. It is clear from Eq. (15) that
π(i) = π(i − 1)pi−1,i + π(i)pii + π(i + 1)pi+1,i
which is only possible when π(i)pi,i+1 = π(i + 1)pi+1,i since
pi,i−1 + pii + pi,i+1 = 1,
where |Ω| = 3.
A MC is reversible (if we have discrete time and continuous state space) if there exists
a distribution fX (x) such that

fX (x) Pr{Xt+1 = y Xt = x} = fX (y) Pr{Xt+1 = x Xt = y}
Z Z

fX (x) Pr{Xt+1 = y Xt = x}dx = fX (y) Pr{Xt+1 = x Xt = y} = fX (y).
x x
If a chain satisfies detailed balance (15), then π is its stationary distribution. In MCM-
C, we often deal with continuous state spaces, so we will write q(x|xt ) for the transition
probability from xt to x, instead of pij , and fX (x) as the target stationary distribution,
instead of π. In this case, detailed balance means
fX (xt )p(x, xt ) =fX (x)p(xt , x)
Z Z
t t t
fX (x )p(x, x )dx = fX (x)p(xt , x)dxt
x t x t
Z
=fX (x) p(xt , x)dxt
xt
=fX (x).
The conditional probability density or the proposal density is q(x|xt ) (which does not
necessarily be conditioned on xt , specially for the case of independent sampler). Hence
q(x|xt ) is the probability of sampling x from the proposal density and
p(x, xt ) = q(x|xt ) × α(x|x, xt ), (16)
where p(x, xt ) is the transition probability from xt to x and α(x|x, xt ) is the probability
of accepting x while at xt .
Remark 4: For the discrete state space Ω, we define the proposal distribution by qij =
Q(i, j) (or Q = (qij )) which means generating the state j while at state i and α(j, i) is the
acceptance probability of state j while in state i. Hence

pij = Pr{Xt+1 = j Xt = i} = Q(i, j) × α(j, i) (17)
Homework 2: Prove that Reversibility → Stationarity
Remark 5: In Markov chain theory we are given a MC, and the transition matrix P , we
find its equilibrium distribution π. In MCMC theory we are given distribution, π or fX (x)
we construct a MC reversible with respect to it.

11
2.2 The Markov Chain Monte Carlo algorithm
At the beginning of this section I redefine our main goals: (i) to sample from a target
distribution π (or fX (x)) and (ii) to approximate E [h(X)] where X ∼ π (or X ∼ fX (x)).
I have demonstrated earlier that we can solve many problems if we can achieve (i) and
(ii). Bad news: In many problems, methods are unavailable for direct simulation of an
i.i.d sample from fX (x) (or π). Good news: In many problems, methods such as the
Metropolis-Hastings algorithms can be used to simulate a Markov chain {X0 , · · · Xt }
which is converging in distribution to fX (x) (or π). This was made clear in the previous
subsection.
We have discussed, under some condition (reversibility, irreducibility, aperiodicity)
the last distribution of the MC Xt for large enough t converges to the target π (or fX (x))
and the chain is ergodic. The chain ‘forgets’ its initialization. Where does this all get
us? We still do not know where to get the matrix P . Indeed, designing a perfect transition
matrix P (transition kernel for continuous state space) will allow us to draw samples from
π (or fX (x)).
So, can we build a Markov chain that has pretty much any requested equilibrium
distribution? YES!! The answer is Markov chain Monte-Carlo with a perfect transition
matrix. If we succeed in designing such a transition matrix then:

• MCMC can use a homogeneous, ergodic and reversible Markov chain to generate
consistent samples drawn from any given target distribution.

• No specific knowledge of the target distribution, no specific support function are


required.

To construct
the stochastic matrix P = (pij ) (or the transition kernel p(y, xt ) =
Pr{Xt+1 = y Xt = xt } for continuous state space) we must defined a (problem de-
pendent) proposal matrix Q = (qij ) (or the proposal distribution q(x|xt ) for continuous
state space) which must satisfy the detailed balanced. We define

pij = qij × α(j, i), Q = (qij ), (18)

where the proposal matrix Q is also stochastic and must be problem dependent and
 
qji π(j)
α(j, i) = min 1,
qij π(i)

Since the generation probability q(y|xt ) and the acceptance probability α(y, xt ) are inde-
pendent, the resulting macroscopic dynamic i.e. the transition matrix (kernel),

p(y, xt ) = Pr(Xt+1 = y Xt = xt )
=q(y|xt )α(y|y, xt ), xt 6= y

and p(xt , xt ) can be found from the requirement that


Z
p(y, xt )dy = 1.
y

12
Hence for the continuous state space we write

p(x̂, xt ) =q(x̂|xt ) Pr{Accept x̂ x̂, xt )

=q(x̂|xt ) × α(x̂ x̂, xt ) (19)

where (the probability of drawing x̂ from the proposal distribution q(x|xt )) q(x̂|xt ) =
Q(x̂, xt ) and

fX (x̂) Q(x̂, xt )
 
t
α(x̂ x̂, x ) = min 1,

fX (xt ) Q(x̂, xt )
fX (x̂) q(x̂|xt )
 
= min 1, . (20)
fX (xt ) q(xt |x̂)

It can be shown that the Markov chain {X0 , X1 , · · · , Xt } induced by the stochastic
matrix Q is irreducible then the Markov chain induced by the transition matrix P is also
irreducible (with some further conditions since P is dependent on α). Moreover, if the MC
induced by the matrix Q is aperiodic then the MC induced by matrix P is also aperiodic.
The resulting (Metropolis-Hastings) Markov chain is reversible with respect to π (or
fX (x)). If it is also irreducible and aperiodic we have an ergodic Markov chain with
unique stationary and limiting distribution π (or fX (x)).

2.3 The Proposal distribution


It is now clear that construction of the transition matrix P (the transition kernel for contin-
uous Ω) is completely determined by the matrix Q (or q(x|xt ) for continuous state space).
We assuming that we have a distribution π (or fX (x)) that we want to sample from. We
need to define a proposal distribution q(x|xt ) and an initial value x0 . The choices of the
proposal distribution are as follows:

• Random walk Metropolis-Hastings MCMC: the proposal distribution does depend


on the current position of the MC. Hence, (a) q(x̂|xt ) = N (xt , σ 2 ), i.e. X t+1 = x̂,
x̂ = xt + t where t ∼ N (0, σ 2 ) (target say fX (x) = N (0, 1)); (b) Uniform update:
Xt+1 = Xt + i , i ∼ U nif (−v, v).

• Independence Metropolis-Hastings MCMC: the proposal distribution does not de-


pend on current position of the MC, i.e. Xt+1 is independent of Xt , .e. q(x̂|xt ) =
q(x̂). Hence for the case of non-random walk q(x̂|xt ) = N (0, σ 2 ), i.e. white noise:
Xi+1 = i , i ∼ N (0, σ 2 ).

Remark 6: The construction of stochastic proposal matrix Q will be discussed thoroughly


for the case where Ω is discrete when we solve global optimization problem.

2.4 The Metropolis-Hastings MCMC algorithm


With the choice of the proposal density q(x) (or the stochastic matrix Q) we can con-
struction a Markov chain induced by the transition matrix P (or the transition kernel for
continuous Ω) where all three technical conditions (irreducibility, aperiodicity and re-
versibility) hold. The MCMC can therefore generate large enough sample x0 , x1 , · · · , xt

13
where xt can be drown from the target density i.e. xt ∼ fX (x) when t is very large
(without using the density fX (x) or π). Given the simulated path of the Markov chain we
can compute Monte Carlo expectations for any quantities of interest by averages along
the sample path. Note that no specific knowledge of the target distribution, no specific
support function is required. Hence we can achieve our two goals stated in section 2.2.
The Metropolis-Hastings MCMC Algorithm is presented below:

The MCMC Algorithm


Step 0: Initialize x0 and t = 0.
Step 1: For t = 0, 1 · · · , do the following operations
• Sample u ∼ U nif (0, 1).

• Sample x̂ ∼ q(x|xt )

• If u < α(x̂, xt ) using Eqns. (18) or (19) then xt+1 = x̂

• Else xt+1 = xt
Step 2: Output x0 , x1 , · · · , xN −1
Step 3: End MCMC

In the above algorithm if q(x|xt ) is chosen such that the Markov chain satisfies modest
conditions (e.g. irreducibility and aperiodicity), then convergence to fX (x) is guaranteed.
However, the rate of convergence will depend on the relationship between q(x|xt ) and
fX (x). The advantage of the Metropolis-Hastings algorithm is that it is only required to
sample from the proposal q(x|xt ) and we are free to choose any proposal density which
is easy to sample from, as long as the resulting chain becomes irreducible and aperiodic.
The number of q(x) suggested in the previous can achieves these properties.
Ability to simulate from complicated multivariate probability distributions via MCMC
have impact in many areas of Statistics, but most profoundly for Bayesian approaches to
statistical modeling. Consider we want to draw samples from the posterior density
f0 (θ)L(d|θ)
fX (θ|d) = R (21)
f (θ)L(d|θ)dθ
θ 0

where f0 (θ) is prior and L(d|θ) is the model (or likelihood function), the data d is the
realization of the random variable D, θT = (θ1 , θ2 , · · · , θn ). The most difficult part of the
posterior is the normalizing constant
Z
f0 (θ)L(d|θ)dθ.
θ

However in MCMC this does not need to be evaluated when we want to sample from
complicated density fX (θ|d) . Why? (You can use θ = x and write fX (θ|d) = fX (x)).
We complete this section with a note that the very first MCMC is known as Metropolis
Algorithm, see Metropolis et.al. (1953)[1]. The only difference between the Metropolis-
Hastings MCMC algorithm and the Metropolis Algorithm is when symmetric proposal
q(x) is used in
fX (x̂) q(x̂|xt )
 
min 1, ,
fX (xt ) q(xt |x̂)

14
which results in  
fX (x̂)
A = min 1,
fX (xt )
since q(xt |x̂) = q(x̂|xt ).

2.5 MCMC for global optimization


Simulated annealing: Implementation
Simulated annealing requires:

• A configuration of all possible states Ω, i.e. the space of the problem

• An internal energy E, which we want to minimize

• ‘Options’ for the next state after the current state, similar to updates in Metropolis
algorithms

• A temperature T which is related to the probability to accept an increase in energy


is accepted if u < exp −∆E

kT
, u ∼ U nif (0, 1)

• A cooling schedule, which determines how T evolve with the number of iterations

References
1. N. Metropolis et al., Equation of State Calculations by Fast Computing Machines,
J. Chemical Physics Vol. 21, 1953, pp.1087-1092.

15

You might also like