You are on page 1of 6

STAT G6103 Bayesian Data Analysis, HW12

Lichi Li

Project: Approximate Inference on GP Classification

In this project we study the empirical properties of various inference techniques, using synthetic
data to examine the behaviors. For convenience reasons we conduct experiments within GPML instead of Stan, using modal or distributional approximation against sampling algorithms to compare
posterior draws.
Our synthetic data was generated through two Gaussians on a two dimensional space, with the
following setting:








0.75
0.75
1 0
1 0.95
1 =
, 2 =
, S1 =
, S2 =
0
0
0 1
0.95 1
We then generate 80 samples from the first Gaussian and 40 samples from the second, and assign 1 to all samples in the first and 1 to the second. We then use Gaussian Process to try classifying
them, under different approximate inference techniques we examine their characteristics.

1.1

Expectation Propagation (EP)

Recall that the expectation propagation (EP) is a structured, distributional approximation that can
be applied to GP classification, iteratively updating site functions (which partitions the data) by
exact likelihood evaluations. In particular, EP looks for a Gaussian approximation q(f |D, ) =
N (f |m, A) to the posterior p(f |D, ) by using approximating site functions on the likelihood.
Q
Q
2
p(f |X, ) m
p(f |X, ) m
i=1 p(yi |fi )
i=1 t(f (xi ), i , i , Zi )
p(f |D, ) =

= q(f |D, )
p(D|)
p(D|)
where t(f (xi ), i , i2 , Zi ) = Zi N (f (xi )|i , i2 ) whose parameters are called site parameters. The
approximation is specified by mean m = A1 and A = (K 1 + 1 )1 , specified by =
2
[(x1 ), (x2 ), ..., (xm )]T and = diag(12 , 22 , ..., m
). The idea at each iteration is to take out the
current likelihood (i.e. p(yi |fi )) from p(f ) to form the cavity distribution,
1

STAT G6103 Bayesian Data Analysis, HW12

pi (f (xi )) =

Z Y
m

Lichi Li

p(yj |f (xj ))p(f |X, )df (xi )

j6=i

which is a marginalization that is intractable to compute for GPC. So in reality the cavity distribution is approximated as well,
qi (f (xi )) =

Z Y
m

2
)
t(f (xj ), j , j2 , Zj )p(f |X, )df (xi ) N (f (xi )|i , i

j6=i

which in turn gives us an approximation to the posterior marginal through:

qi (f (xi ))t(f (xi ), i , i2 , Zi )

Z
=

N (f |0, K)

m
Y

t(f (xj ), j , j2 , Zj )df (xi ) N (f (xi )|mi , Aii )

j=1

with parameters
2
i
= ((Aii )1 i2 )1

2
and i = i
(

mi
i
2)
Aii i

and EP adjusts the site parameters to make the approximate posterior marginals as closely to
identical as possible bearing the difference between exact likelihood and site function, by match
their zeroth, first, and second moments.
qi (f (xi ))p(yi |fi ) qi (f (xi ))t(f (xi ), i , i2 , Zi )
where matching moments is performed to (implicitly) minimize the Kullback Leibler divergence
between the targets:
tnew (f (xi ), i , i2 , Zi ) tnew
= argminti KL(
i

q(f )
q(f )
p(yi |fi ) k old ti )
old
ti
ti

(Here we omit the expressions and derivations for the moment update functions of the GPC.) To
conduct comparison, we may consider MCMC samples as ground truth, where they have been ran
through steady convergence for our purposes. For example, we can compare against Hamiltonian
Monte Carlo (HMC):

STAT G6103 Bayesian Data Analysis, HW12

Lichi Li

(The title may be difficult to see due to resolution issues, but the top left corner corresponds to
the latent mean, top right being latent standard deviation, bottom left and bottom right are their
observational counterparts.) As we may see, EP does a decent job at recovering the results of all
attributes except the observational standard deviation. One interesting phenomenon is the leap
happening on f , which is not very clear to me what the reason caused it. This is not observed
when comparing against elliptical slice sampling (ESS).

1.2

Laplace Approximation

Recall that the Laplace approximation is a simple Gaussian approximation whose mode and variance
matches the first and second derivatives of the target posterior, deploying a second order Taylor
expansion around the mode with resemblance of the posterior curvature around the mode.
3

STAT G6103 Bayesian Data Analysis, HW12

q(f ) = N (f |f, A),

Lichi Li

where f = argmaxf (f ) and A = (52f)1

1
1
n
logp(f |D, ) (f ) logp(D|f ) f T K 1 f log|K| log(2)
2
2
2
5(f ) = 5logp(D|f ) K 1 f ,

52 (f ) = 52 logp(D|f ) K 1

(We omit the two derivative derrivation) Realistically the performance is often worse than other approximations, but computationally much faster. Against HMC and ESS we easily see that Laplaces
strong Gaussian assumption does suffer in higher variance.

STAT G6103 Bayesian Data Analysis, HW12

1.3

Lichi Li

Variational Bayes (VB)

Variational Bayes (VB) partitions the variables and seeks minimization of Kullback-Leibler divergence from the approximation distribution q to the target posterior p by maximizing the evidence
lower bound (ELBO), whose details we omit here. Since KL divergence is not symmetric, the
direction chosen makes the approximation achieve different goals. EP has the reverse direction,
obtaining a more global approximation that overestimates posterior mass often, while VB tends to
be good at fitting local optimums. Its underlying, most used mean-field assumption often causes the
approximation to suffer from capturing the fidelity of the posterior, i.e. correlation among variables,
especially under GP.

STAT G6103 Bayesian Data Analysis, HW12

Lichi Li

Conclusion

Unfortunately due to severe time limitations we could only compare these. We noticed that Laplace
Approximation as a modal approximation, behaves different from EP and VB, while VB seems to
achieve a better (tighter) approximation to the posterior mean. VB seems to do a reasonably good
approximation on our low-dimensional toy example, where the number of variables is very small.
We hope that in the future sometime we can examine the details of these behaviors much more
carefully.

You might also like