Professional Documents
Culture Documents
1
Let us assume another random variable U such that U = FX (X). We assume that FX−1
exists. Now clearly, 0 ≤ U ≤ 1. Consider the CDF of the random variable (RV) U :
FU (u) = Pr{U ≤ u}
= Pr{FX (X) ≤ u}
= Pr{X ≤ FX−1 (u)}
= FX FX−1 (u)
=u
Clearly U ∼ U nif (0, 1) where U nif (0, 1) is the uniform density on (0,1). Hence we
can generate samples from X (which has density fX (x) and CDF FX (x)) by finding
X = FX−1 (U ). Consider that we generate an i.i.d sample xi , i = 1, 2, · · · , N , from fX (x).
The scheme for the i-th sample is as follows:
• U = u, U ∼ U nif (0, 1)
• Take xi to be the random number drawn from the distribution described by FX (xi )
(X = FX−1 (U ) has a distribution FX (x)).
For various reasons (one of the reasons is that one cannot always integrate h(x) or one
takes great pains to integrate numerically, specially for high dimension case) we want to
find an approximate results of (1). For ease of explanation, let us take the 1-dimensional
case (also hold for many dimensions), Ω = [a, b], we can rewrite (1) as
Z
w(x)fX (x)dx,
Ω
1
where w(x) = (b − a)h(x) and fX (x) = b−a is the uniform density, and so we can draw
i
samples x , i = 1, · · · , N , from fX (x). Here I = EfX [w(X)] with X ∼ U nif (a, b). We
2
can approximate the above integral as (larger the N , the better the approximation is):
N
1 X
I≈ w(xi ), xi ∼ fX (x) (2)
N i=1
N
1 X
≈(b − a) h(xi ), xi ∼ fX (x).
N i=1
and when h(x) = x, (4) gives the mean of fX (x); when h(x) = x2 we get the second
moment of fX (x); when h(x) = 1[A] we get probability of A under fX (x) (where 1[A]
is the indicator function–it is equal to 1 when x ∈ A, and zero otherwise). In general
h(x) can be any quantity (for example h(x) is a profit of some investment x) which is
of interest to us and we want to its mean value under X ∼ fX (x). Recall that we can
approximate (4) as long as we have i.i.d samples xi drawn from fX (x). We will discuss
soon the situation when we are unable to have i.i.d observations xi , i.e. we cannot draw
sample from fX (x) but first let us consider the integration:
Z 2
I= h(x)dx, h(x) = x3 (5)
0
Z 2
1
= w(x)fX (x), w(x) = 2x3 , fX (x) = (6)
0 2
Z 2
8 3
= w(x)fX (x), w(x) = x, fX (x) = x2 (7)
0 3 8
One can see that there are two probability density functions in (6) and (7).
Homework 1: Find the analytical results of (5). Use (2) to find the approximate results
of (6) and (7). Use N = 10000. Which of the two fX (x) gives the better approximation?
Why?
I now return to my previous questions. How can we generalize the above idea (nu-
merical approximation) when we can not have i.i.d observations xi from fX (x)? This is
because, often fX (x) is so complicated that we are not able to generate i.i.d samples from
it and therefore we cannot perform Monte Carlo integration. Importance sampling is one
of the ways out.
If we cannot get i.i.d samples from fX (x) we can use an auxiliary (importance) distri-
bution, q(x) that we can sample from, and use this alternative representation of
Z Z Z
fX (x)
EfX [h(X)] = h(x)fX (x)dx = h(x) q(x)dx = w(x)q(x)dx
Ω Ω q(x) Ω
3
where w(x) = h(x) fq(x)
X (x)
. We see that
N
1 X
EfX [h(X)] ≈ w(xi ), xi ∼ q(x),
N i=1
q(x) is known as the proposal density. The importance sampling performs well as long
as supp(fX (x)) ⊂ supp(q(x)), where supp(fX (x)) = {xfX (x) > 0} is the support
of distribution fX (x) (i.e., the set of points with non-zero probability). This condition
just says that our proposal q(x) must have a non-zero probability of moving to the states
that have non-zero probability in the target density fX (x). Generally, q(x) is chosen as
Gaussian with large variance. The main difficulty with importance sampling is how to
find such a good q(x) when dealing with a problem in high dimension. Moreover, if q(x)
is not a good choice then the importance sampling generates sample in a region with no
or less density in fX (x). Hence, we must seek for a method that generate samples xi
from high density region of FX (x). This is possible via MCMC (Markov Chain Monte
Carlo) where the samples xi are generated in such a manner that they are co-related and
not independent hence not i.i.d samples. However, the approximation (3) still applies if
we generate samples using a Markov chain.
where Ω is the region over which integration is defined and fX (x) is the probability den-
sity of X. This can be computed numerically by generic Monte Carlo where a random set
of N values xi uniformly sampled from within the integration region Ω gives an estimate
of the integral:
N
1 X
I≈V h(xi )fX (xi )
N i=1
where V is volume of Ω. This is, however, very inefficient, particularly with functions that
vary significantly within the integration region (like posterior distributions in Bayesian
statistics). It would be much better if we could guarantee that the random set of values
xi we are using is (at least) asymptotically proportional to fX (x), as the integral estimate
would then reduce to:
N
1 X
I≈ h(xi ).
N i=1
The MCMC method is used to generate such co-related samples xi with the help of
Markov chain. Monte-Carlo integration is the most common application of Monte-Carlo
methods. The MCMC methods are sophisticated and general algorithms for simulation
from complex probability models: high dimensional, highly non-Gaussian, highly non-
linear and possibly multi-modal. This will be discussed in the next section.
4
2 Markov Chain Monte Carlo
We first present the basic concept of MCMC before discussing its framework and mathe-
matical background. Consider the naive example. I want to estimate the probability that
a standard normal random variable X ∼ N (0, 1) was less than 0.5, I could generate ten
thousand independent observations xt (realizations) from the standard normal distribution
and count up the number less than 0.5; say I got 6905 that were less than 0.5 out of 10000
total samples; my estimate for Pr(X ≤ 0.5) would be 0.6905, which is not that far off
from the actual value (which is 0.5). With enough simulated random numbers, the esti-
mate is very good, but the process is still inherently random. That would be a Monte Carlo
estimate. Notice that in Monte Carlo methods we need i.i.d samples from a distribution.
Now imagine I could not draw independent normal random variates (realization or
observations), e.g. X ∼ fX (x), instead I would start at 0 (the initial x0 ), and then with
every step add some uniform random number between -0.5 and 0.5 to my current value
i.e. xt+1 = xt + x, x ∼ U nif (−0.5, 0.5) (this is the proposal density, i.e. q(x|xt ) =
xt + U nif (−0.5, 0.5)), and then decide, based on a particular test, whether I liked that
new value xt+1 or not; if I liked it, I would use the new value as my current one, and if not,
I would reject it and stick with my old value. Because I only look at the new and current
values, this is a Markov chain. (In will elaborate on the Markov chain in soon.) If I set up
the test to decide whether or not I keep the new value correctly (it would be a random walk
MCMC, and the details get a bit complex right now), then even though I never generate
a single normal random variate xt , if I do this procedure for long enough, the list of
(accepted) values (numbers xt , i = 1, 2, · · · N ) I get from the procedure will be distributed
like a large number of draws from something that generates normal random variates. This
would give me a Markov Chain Monte Carlo simulation for a standard normal random
variable. If I use this to estimate probabilities
R∞ (like the one in previous paragraph, or the
confidence interval Pr{X > α} = α fX (x)dx, X ∼ fX (x)), that would be a MCMC
estimate. We see that in Markov Chain Monte Carlo methods one gets samples xt (not
i.i.d samples) using local information. Samples are dependent (correlated).
So, what is the goal? The goal of MCMC is to draw samples from some probabil-
ity distribution without having to know its exact height at any point. The way MCMC
achieves this is to ‘wander around’ on that distribution in such a way that the amoun-
t of time spent in each location is proportional to the height of the distribution. If the
‘wandering around’ process is set up correctly, we can make sure that this proportionality
(between time spent and height of the distribution) is achieved.
Intuitively, what we want to do is to walk around on some (lumpy) surface in such a
way that the amount of time we spend (or the number of samples drawn) in each location
is proportional to the height of the surface at that location. So, we would like to spend
twice as much time on a hilltop that is at an altitude of 100m as we do on a nearby hill
that is at an altitude of 50m. The nice thing is that we can do this even if we do not know
the absolute heights of points on the surface: all we have to know are the relative heights.
e.g., if one hilltop A is twice as high as hilltop B, then we would like to spend twice as
much time at A as we spend at B. This makes MCMC much simpler to implement or
to draw sample (indirectly) from much complicated distribution (e.g. Baysian posterior
distribution).
The simplest variant of MCMC (the Metropolis-Hastings algorithm) achieves this as
follows: assume that in every (discrete) time-step t, we pick a random new ‘proposed’
5
location (selected uniformly across the entire surface). If the proposed location is higher
than where we are standing now, move to it. If the proposed location is lower, then move
to the new location with probability p, where p is the ratio of the height of that point to the
height of the current location. Keep a list of the locations you have been at on every time
step, and that list will (asymptotically) have the right proportion of time spent in each part
of the surface. For the A and B hills described above, you will end up with twice the
probability of moving from B to A as you have of moving from A to B.
There are more complicated schemes for proposing new locations and the rules for
accepting them, but the basic idea is still: (a) pick a new ‘proposed’ location; (b) figure
out how much higher or lower that location is compared to your current location; (c)
probabilistically stay put or move to that location in a way that respects the overall goal of
spending time proportional to height of the location. Consider repeating (a)-(c) for many
times (many time steps t) where at each step t moving to a new location at time (t + 1)
only depends of the (current) location at time t (future depends only on the current state
of the process and not the past i.e. it is memoryless and hence a Markov chain). Clearly,
all the locations generated in this ways are outcomes of a Markov chain. Moreover, the
Markov chain soon forgets its initial position x0 (memoryless).
So far I have only mentioned the use of Markov chain in the Monte Carlo simulation
but I have not said why we need Markov chain. So why ‘Markov chain’? Because under
certain technical conditions (yet to be discussed), one can generate a memoryless process
that has the same limiting distribution as the random variable that we are trying to sim-
ulate. Indeed, if we repeat (a)-(c) (that generate correlated random results or locations)
long enough, we are guaranteed that once we pool enough of the results, we will end up
with a pile of numbers (locations) that looks ‘as if’ we had somehow managed to take
independent samples (after sufficiently long enough time steps t) from the complicated
distribution we wanted to know about or sample from.
This means that MCMC simulates a Markov chain (a sequence of random variables
Xt ) such that for a large enough t, the last random variable Xt of the chain is distributed
according to fX (x), the target distribution. Notice that no samples were generated from
the target density fX (x) (samples are generated from the proposal density).
So the idea is therefore to construct a Markov chain which converges to the desired
probability distribution after a number of steps. The state of the chain after a large number
of steps is then used as a sample from the desired or target distribution. There are many
different MCMC algorithms which use different techniques for generating the Markov
chain. Common ones include the Metropolis-Hastings. Clearly, there are some technical
conditions on the Markov chain which guarantee the convergence of Markov chain to the
desired or target distribution. These conditions will be presented in the next section.
MCMC methods are computational algorithms. They are a way of estimating some-
thing which is too difficult or time consuming to find deterministically. They are basically
a form of computer simulation of some mathematical or physical process. Here are some
of the problems that we will study through MCMC. Monte Carlo integration works great
on a high-dimensional functions by taking a random (asymptotically random) sample of
points of the function and calculating some type of average at these various points. By
increasing the sample size, the law of large numbers (the central limit theorem) tells us
we can increase the accuracy of our approximation by covering more and more of the
function.
In optimization, the objective is to compute the global optimum of some complicated
6
objective function (e.g. maximum likelihood of Bayesian posterior probability). In this
context, we will see how MCMC needs to go with an algorithm called ‘the simulated
annealing algorithm’.
We end this section with the following motto ‘construct a type Markov chain that
converges’ to a target distribution’. Before I present the Markov chain, I introduce a
number of notations and symbols. Both the time t and the sample space (state space) Ω
can be treated as continuous and discrete but in this topic I consider time t as the discrete
time and Ω continuous or discrete (or finite state space). Hence for the discrete or finite
state space Ω = {1, 2, · · · , n}. Therefore, the random variable Xt (or any random variable
X) can be discrete if it is defined on the discrete Ω, otherwise it will be continuous when Ω
is continuous. When the state space is continuous the corresponding (target) density will
be used as fX (x) (or fXt (x)). For the discrete sample space the corresponding density
is commonly known as the probability mass function or probability density function or
simply probability distribution function (for which the cumulative probability function
(CDF) can also be defined). I will refer to this (target) discrete density as probability
distribution and denote it by π. Hence, for discrete time and discrete space, Ω, Xt ∼ π
and for the discrete time continuous space Xt ∼ fX (x). The starting time or initial time
will be denoted as t=0.
7
A set of random variables Xt ; t = 0, 1, 2, · · · is a Markov chain if:
Pr{Xt+1 = j X0 , X1 , · · · , Xt = i} = Pr{Xt+1 = j Xt = i}, i, j ∈ Ω, ∀t.
A MC is specified by giving
e.g. λ(0) (1) = 1 and λ(0) (2) = 0 in my example above where X0 ∼ λ(0) .
• We assume the transition probabilities do not change with time. A Markov chain is
time homogeneous if
Pr{Xt+1 = j Xt = i} = Pr{Xr+1 = j Xr = i}, ∀ t, r,
i.e. Pr{Xt+1 = j Xt = i} is independent of t, to put it differently. Time homoge-
neous Markov chains are also known as stationary Markov chains. Notice that
X
pij = 1
j∈Ω
since the chain must be in some state in the next step. Therefore the matrix P is a
stochastic matrix.
X
λ(0) P (j) = λ(0) (i)pij (8)
i
X
= Pr{X0 = i} Pr{X1 = j X0 = i} (9)
i
X
= Pr{X0 = i, X1 = j} (10)
i
= Pr{X1 = j} (11)
=λ(1) (j). (12)
(13)
We can consider all j ∈ Ω and in above and obtain another distribution say λ(1) such
that X1 ∼ λ(1) . One can see that λ(1)T = λ(o)T P . We continue in this and obtain Xt ∼ λt .
We see that we are generating a MC {X0 , X1 , · · · , Xt } induced by the transition matrix
P . It follows that λ(t)T = λ(o)T P t . Clearly, the initial distribution λ(0) and the matrix
P capture all the relevant information about the dynamics of the Markov chain. The
following questions are important in the study of Markov chains:
8
• Does there exist a distribution π such that π T = π T P. If such a π exists, it is called
a stationary distribution or invariant distribution or equilibrium distribution. In this
case X
π(i)pij = π(j)
i
i.e. π is the (normalized) left eigenvector of P with eigenvalue 1. We assume that
Xt ∼ λ(t) but when t → ∞ does Xt ∼ π holds?
• If there exists a unique stationary distribution π, does λ(t) → π for all λ(0) . In
other words, does the distribution of the Markov chain converge to the stationary
distribution starting from any initial distribution λ(0) .
Given that there is stationary distribution. Does the second bullet holds true for all
Markov chains? It turns out that both the irreducibility and aperiodicity of the MC are
important for achieving λ(t) → π for all initial λ(0) . In fact, a finite state space, irreducible
Markov chain has a unique stationary distribution π and it is aperiodic (the necessary
condition). The sufficient condition is the reversibility of MC, (see Definition 4 below).
The detailed balanced is the sufficient condition for stationarity. Thus, the stationary
distribution is completely characterized.
If a MC is both irreducible and aperiodic then the MC is call Ergodic. This means that
anything that can happen eventually will happen.
Definition 1: A probability distribution function (or the probability mass function) on Ω
is stationary (invariant) with respect to P if π T = π T P .
Definition 2: Irreducibility. Irreducible MC means any state can be reached from any
other state in a finite number of moves.
We say that a MC is irreducible if, for each states i and j, there exists an integer k
(k)
(possibly depending upon i and j) such that pij > 0 for finite k, where
(k)
pij = Pr{Xt+k = j Xt = i}, (14)
the superscript (k) is an index and not an exponent. In other words, a chain is irreducible
if it is possible to eventually get from any state i to any other state j in a finite number of
steps.
Remark 1: Consider the transition matrix P for two states such that p11 = 1, p12 =
0, p21 = 0, p22 = 1. Then for any distribution π we have π T P = π T and the stationary
distribution is not unique. The Markov chain in this example is such that if it started in
one state, then it remained in the same state forever. In general, Markov chains where
one state cannot be reached from some other state will not possess a unique stationary
distribution. This is because the MC is reducible.
Definition 3: Aperiodicity. A state i has period k if any return to state i occur in multiple
of k time steps. Formally, the period of a state is defined as
k = gcd{t > 0 : Pr{Xt = iX0 = i} > 0}
(where gcd is the greatest common divisor of all t) provided that this set is not empty. If
k=1 then the state is said to be aperiodic i.e. returns to state i occur at irregular times/steps.
Remark 2: Consider the transition matrix P for two states such that p11 = 0, p12 =
1, p21 = 1, p22 = 0. The stationary distribution is π T = ( 21 , 12 ) but the system does not
9
converge to this stationary distribution starting from any initial condition. To see this, if
we assume λ(0)T = (1, 0) then λ(t) 6= π, t → ∞. This is because the MC is periodic.
The following example illustrates the computation of the steady-state (stationary) dis-
tribution of a Markov chain. Consider a three-state Markov chain with the state space
Ω = {a, b, c}. If the Markov chain is in state a; it switches from the current state to one
of the other two states, each with probability 14 ; or remains in the same state. If it is in
state b, then it switches to state c with probability 12 or remains the same state. If it is in
state c; it switches to state a with probability 1. Construct the transition matrix P . This
Markov chain is irreducible since it can go from any state to any other state in finite time
with non-zero probability. Next, note that there is a non-zero probability of remaining in
(k)
state a if the Markov chain starts in state a: Therefore, paa > 0 (see Eqn. (14)) for all
k and state a is aperiodic. Since the Markov chain is irreducible, this implies that all the
states are aperiodic. Thus, the finite-state Markov chain induced by P is irreducible and
aperiodic, which implies the existence of a stationary distribution to which the probability
distribution converges to ( 12 , 41 , 14 ) starting from any initial distribution.
Remark 3: If the state space is infinite, (countably infinite e.g. the set of integers or
continuous), the existence of a stationary distribution is not guaranteed even if the Markov
chain is irreducible and aperiodic. Hence, the following sufficient condition is required
for general state space Ω.
Definition 4: Reversibility. A Markov chain is reversible if there exists a stationary dis-
tribution π such that:
π(i) Pr{Xt+1 = j Xt = i} = π(j) Pr{Xt+1 = iXt = j} ∀ t, i, j.
A Markov chain can be irreducible but not reversible. The MC induced by the transi-
tion matrix 1 1 1
3 3 3
1 0 0
0 1 0
is not reversible. Possible sequence 1 → 3 → 2 → 1. Impossible sequence 1 → 2 →
3 → 1. There is a sequence of states for which it is possible to tell in which direction the
simulation has occurred and thus the chain is not reversible.
π(i)pij = π(j)pji , i, j ∈ Ω
i. e. the M. C. looks the same running forward or backward (detailed balance condition).
If a Markov chain is reversible then:
X X
π(j) Pr{Xt+1 = iXt = j} = π(i) Pr{Xt+1 = j Xt = i}
j j
X
π(j) Pr{Xt+1 = iXt = j} = π(i).
j
10
This is the Ergodic behavior we want. No matter where we start, at some time, we will
be at the state i with probability π(i). π is the (normalized) left eigenvector of P with
eigenvalue 1 since in matrix notation
πT P = πT (15)
This property is also called detailed balance. We need solve π T P = π T for π to be
stationary. It is clear from Eq. (15) that
π(i) = π(i − 1)pi−1,i + π(i)pii + π(i + 1)pi+1,i
which is only possible when π(i)pi,i+1 = π(i + 1)pi+1,i since
pi,i−1 + pii + pi,i+1 = 1,
where |Ω| = 3.
A MC is reversible (if we have discrete time and continuous state space) if there exists
a distribution fX (x) such that
fX (x) Pr{Xt+1 = y Xt = x} = fX (y) Pr{Xt+1 = xXt = y}
Z Z
fX (x) Pr{Xt+1 = y Xt = x}dx = fX (y) Pr{Xt+1 = xXt = y} = fX (y).
x x
If a chain satisfies detailed balance (15), then π is its stationary distribution. In MCM-
C, we often deal with continuous state spaces, so we will write q(x|xt ) for the transition
probability from xt to x, instead of pij , and fX (x) as the target stationary distribution,
instead of π. In this case, detailed balance means
fX (xt )p(x, xt ) =fX (x)p(xt , x)
Z Z
t t t
fX (x )p(x, x )dx = fX (x)p(xt , x)dxt
x t x t
Z
=fX (x) p(xt , x)dxt
xt
=fX (x).
The conditional probability density or the proposal density is q(x|xt ) (which does not
necessarily be conditioned on xt , specially for the case of independent sampler). Hence
q(x|xt ) is the probability of sampling x from the proposal density and
p(x, xt ) = q(x|xt ) × α(x|x, xt ), (16)
where p(x, xt ) is the transition probability from xt to x and α(x|x, xt ) is the probability
of accepting x while at xt .
Remark 4: For the discrete state space Ω, we define the proposal distribution by qij =
Q(i, j) (or Q = (qij )) which means generating the state j while at state i and α(j, i) is the
acceptance probability of state j while in state i. Hence
pij = Pr{Xt+1 = j Xt = i} = Q(i, j) × α(j, i) (17)
Homework 2: Prove that Reversibility → Stationarity
Remark 5: In Markov chain theory we are given a MC, and the transition matrix P , we
find its equilibrium distribution π. In MCMC theory we are given distribution, π or fX (x)
we construct a MC reversible with respect to it.
11
2.2 The Markov Chain Monte Carlo algorithm
At the beginning of this section I redefine our main goals: (i) to sample from a target
distribution π (or fX (x)) and (ii) to approximate E [h(X)] where X ∼ π (or X ∼ fX (x)).
I have demonstrated earlier that we can solve many problems if we can achieve (i) and
(ii). Bad news: In many problems, methods are unavailable for direct simulation of an
i.i.d sample from fX (x) (or π). Good news: In many problems, methods such as the
Metropolis-Hastings algorithms can be used to simulate a Markov chain {X0 , · · · Xt }
which is converging in distribution to fX (x) (or π). This was made clear in the previous
subsection.
We have discussed, under some condition (reversibility, irreducibility, aperiodicity)
the last distribution of the MC Xt for large enough t converges to the target π (or fX (x))
and the chain is ergodic. The chain ‘forgets’ its initialization. Where does this all get
us? We still do not know where to get the matrix P . Indeed, designing a perfect transition
matrix P (transition kernel for continuous state space) will allow us to draw samples from
π (or fX (x)).
So, can we build a Markov chain that has pretty much any requested equilibrium
distribution? YES!! The answer is Markov chain Monte-Carlo with a perfect transition
matrix. If we succeed in designing such a transition matrix then:
• MCMC can use a homogeneous, ergodic and reversible Markov chain to generate
consistent samples drawn from any given target distribution.
To construct
the stochastic matrix P = (pij ) (or the transition kernel p(y, xt ) =
Pr{Xt+1 = y Xt = xt } for continuous state space) we must defined a (problem de-
pendent) proposal matrix Q = (qij ) (or the proposal distribution q(x|xt ) for continuous
state space) which must satisfy the detailed balanced. We define
where the proposal matrix Q is also stochastic and must be problem dependent and
qji π(j)
α(j, i) = min 1,
qij π(i)
Since the generation probability q(y|xt ) and the acceptance probability α(y, xt ) are inde-
pendent, the resulting macroscopic dynamic i.e. the transition matrix (kernel),
p(y, xt ) = Pr(Xt+1 = y Xt = xt )
=q(y|xt )α(y|y, xt ), xt 6= y
12
Hence for the continuous state space we write
p(x̂, xt ) =q(x̂|xt ) Pr{Accept x̂x̂, xt )
=q(x̂|xt ) × α(x̂x̂, xt ) (19)
where (the probability of drawing x̂ from the proposal distribution q(x|xt )) q(x̂|xt ) =
Q(x̂, xt ) and
fX (x̂) Q(x̂, xt )
t
α(x̂ x̂, x ) = min 1,
fX (xt ) Q(x̂, xt )
fX (x̂) q(x̂|xt )
= min 1, . (20)
fX (xt ) q(xt |x̂)
It can be shown that the Markov chain {X0 , X1 , · · · , Xt } induced by the stochastic
matrix Q is irreducible then the Markov chain induced by the transition matrix P is also
irreducible (with some further conditions since P is dependent on α). Moreover, if the MC
induced by the matrix Q is aperiodic then the MC induced by matrix P is also aperiodic.
The resulting (Metropolis-Hastings) Markov chain is reversible with respect to π (or
fX (x)). If it is also irreducible and aperiodic we have an ergodic Markov chain with
unique stationary and limiting distribution π (or fX (x)).
13
where xt can be drown from the target density i.e. xt ∼ fX (x) when t is very large
(without using the density fX (x) or π). Given the simulated path of the Markov chain we
can compute Monte Carlo expectations for any quantities of interest by averages along
the sample path. Note that no specific knowledge of the target distribution, no specific
support function is required. Hence we can achieve our two goals stated in section 2.2.
The Metropolis-Hastings MCMC Algorithm is presented below:
• Sample x̂ ∼ q(x|xt )
• Else xt+1 = xt
Step 2: Output x0 , x1 , · · · , xN −1
Step 3: End MCMC
In the above algorithm if q(x|xt ) is chosen such that the Markov chain satisfies modest
conditions (e.g. irreducibility and aperiodicity), then convergence to fX (x) is guaranteed.
However, the rate of convergence will depend on the relationship between q(x|xt ) and
fX (x). The advantage of the Metropolis-Hastings algorithm is that it is only required to
sample from the proposal q(x|xt ) and we are free to choose any proposal density which
is easy to sample from, as long as the resulting chain becomes irreducible and aperiodic.
The number of q(x) suggested in the previous can achieves these properties.
Ability to simulate from complicated multivariate probability distributions via MCMC
have impact in many areas of Statistics, but most profoundly for Bayesian approaches to
statistical modeling. Consider we want to draw samples from the posterior density
f0 (θ)L(d|θ)
fX (θ|d) = R (21)
f (θ)L(d|θ)dθ
θ 0
where f0 (θ) is prior and L(d|θ) is the model (or likelihood function), the data d is the
realization of the random variable D, θT = (θ1 , θ2 , · · · , θn ). The most difficult part of the
posterior is the normalizing constant
Z
f0 (θ)L(d|θ)dθ.
θ
However in MCMC this does not need to be evaluated when we want to sample from
complicated density fX (θ|d) . Why? (You can use θ = x and write fX (θ|d) = fX (x)).
We complete this section with a note that the very first MCMC is known as Metropolis
Algorithm, see Metropolis et.al. (1953)[1]. The only difference between the Metropolis-
Hastings MCMC algorithm and the Metropolis Algorithm is when symmetric proposal
q(x) is used in
fX (x̂) q(x̂|xt )
min 1, ,
fX (xt ) q(xt |x̂)
14
which results in
fX (x̂)
A = min 1,
fX (xt )
since q(xt |x̂) = q(x̂|xt ).
• ‘Options’ for the next state after the current state, similar to updates in Metropolis
algorithms
• A cooling schedule, which determines how T evolve with the number of iterations
References
1. N. Metropolis et al., Equation of State Calculations by Fast Computing Machines,
J. Chemical Physics Vol. 21, 1953, pp.1087-1092.
15