hw12 Lichili

STAT G6103 Bayesian Data Analysis, HW12
Lichi Li
Project: Approximate Inference on GP Classification
In this project we study the empirical properties of various inference techniques, using synthetic
data to examine the behaviors. For convenience reasons we conduct experiments within GPML instead of Stan, using modal or distributional approximation against sampling algorithms to compare
posterior draws.
Our synthetic data was generated through two Gaussians on a two dimensional space, with the
following setting:

0.75
0.75
1 0
1 0.95
1 =
, 2 =
, S1 =
, S2 =
0
0
0 1
0.95 1
We then generate 80 samples from the first Gaussian and 40 samples from the second, and assign 1 to all samples in the first and 1 to the second. We then use Gaussian Process to try classifying
them, under different approximate inference techniques we examine their characteristics.
1.1
Expectation Propagation (EP)
Recall that the expectation propagation (EP) is a structured, distributional approximation that can
be applied to GP classification, iteratively updating site functions (which partitions the data) by
exact likelihood evaluations. In particular, EP looks for a Gaussian approximation q(f |D, ) =
N (f |m, A) to the posterior p(f |D, ) by using approximating site functions on the likelihood.
Q
Q
2
p(f |X, ) m
p(f |X, ) m
i=1 p(yi |fi )
i=1 t(f (xi ), i , i , Zi )
p(f |D, ) =
= q(f |D, )
p(D|)
p(D|)
where t(f (xi ), i , i2 , Zi ) = Zi N (f (xi )|i , i2 ) whose parameters are called site parameters. The
approximation is specified by mean m = A1 and A = (K 1 + 1 )1 , specified by =
2
[(x1 ), (x2 ), ..., (xm )]T and = diag(12 , 22 , ..., m
). The idea at each iteration is to take out the
current likelihood (i.e. p(yi |fi )) from p(f ) to form the cavity distribution,
1
pi (f (xi )) =
Z Y
m
Lichi Li
p(yj |f (xj ))p(f |X, )df (xi )
j6=i
which is a marginalization that is intractable to compute for GPC. So in reality the cavity distribution is approximated as well,
qi (f (xi )) =
Z Y
m
2
)
t(f (xj ), j , j2 , Zj )p(f |X, )df (xi ) N (f (xi )|i , i
j6=i
which in turn gives us an approximation to the posterior marginal through:
qi (f (xi ))t(f (xi ), i , i2 , Zi )
Z
=
N (f |0, K)
m
Y
t(f (xj ), j , j2 , Zj )df (xi ) N (f (xi )|mi , Aii )
j=1
with parameters
2
i
= ((Aii )1 i2 )1
2
and i = i
(
mi
i
2)
Aii i
and EP adjusts the site parameters to make the approximate posterior marginals as closely to
identical as possible bearing the difference between exact likelihood and site function, by match
their zeroth, first, and second moments.
qi (f (xi ))p(yi |fi ) qi (f (xi ))t(f (xi ), i , i2 , Zi )
where matching moments is performed to (implicitly) minimize the Kullback Leibler divergence
between the targets:
tnew (f (xi ), i , i2 , Zi ) tnew
= argminti KL(
i
q(f )
q(f )
p(yi |fi ) k old ti )
old
ti
ti
(Here we omit the expressions and derivations for the moment update functions of the GPC.) To
conduct comparison, we may consider MCMC samples as ground truth, where they have been ran
through steady convergence for our purposes. For example, we can compare against Hamiltonian
Monte Carlo (HMC):
Lichi Li
(The title may be difficult to see due to resolution issues, but the top left corner corresponds to
the latent mean, top right being latent standard deviation, bottom left and bottom right are their
observational counterparts.) As we may see, EP does a decent job at recovering the results of all
attributes except the observational standard deviation. One interesting phenomenon is the leap
happening on f , which is not very clear to me what the reason caused it. This is not observed
when comparing against elliptical slice sampling (ESS).
1.2
Laplace Approximation
Recall that the Laplace approximation is a simple Gaussian approximation whose mode and variance
matches the first and second derivatives of the target posterior, deploying a second order Taylor
expansion around the mode with resemblance of the posterior curvature around the mode.
3
q(f ) = N (f |f, A),
Lichi Li
where f = argmaxf (f ) and A = (52f)1
1
1
n
logp(f |D, ) (f ) logp(D|f ) f T K 1 f log|K| log(2)
2
2
2
5(f ) = 5logp(D|f ) K 1 f ,
52 (f ) = 52 logp(D|f ) K 1
(We omit the two derivative derrivation) Realistically the performance is often worse than other approximations, but computationally much faster. Against HMC and ESS we easily see that Laplaces
strong Gaussian assumption does suffer in higher variance.
1.3
Lichi Li
Variational Bayes (VB)
Variational Bayes (VB) partitions the variables and seeks minimization of Kullback-Leibler divergence from the approximation distribution q to the target posterior p by maximizing the evidence
lower bound (ELBO), whose details we omit here. Since KL divergence is not symmetric, the
direction chosen makes the approximation achieve different goals. EP has the reverse direction,
obtaining a more global approximation that overestimates posterior mass often, while VB tends to
be good at fitting local optimums. Its underlying, most used mean-field assumption often causes the
approximation to suffer from capturing the fidelity of the posterior, i.e. correlation among variables,
especially under GP.
Lichi Li
Conclusion
Unfortunately due to severe time limitations we could only compare these. We noticed that Laplace
Approximation as a modal approximation, behaves different from EP and VB, while VB seems to
achieve a better (tighter) approximation to the posterior mean. VB seems to do a reasonably good
approximation on our low-dimensional toy example, where the number of variables is very small.
We hope that in the future sometime we can examine the details of these behaviors much more
carefully.

hw12 Lichili

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

hw12 Lichili

Uploaded by

Copyright:

Available Formats

STAT G6103 Bayesian Data Analysis, HW12

Project: Approximate Inference on GP Classification

Expectation Propagation (EP)

STAT G6103 Bayesian Data Analysis, HW12

p(yj |f (xj ))p(f |X, )df (xi )

which in turn gives us an approximation to the posterior marginal through:

qi (f (xi ))t(f (xi ), i , i2 , Zi )

t(f (xj ), j , j2 , Zj )df (xi ) N (f (xi )|mi , Aii )

STAT G6103 Bayesian Data Analysis, HW12

STAT G6103 Bayesian Data Analysis, HW12

q(f ) = N (f |f, A),

where f = argmaxf (f ) and A = (52f)1

STAT G6103 Bayesian Data Analysis, HW12

Variational Bayes (VB)

STAT G6103 Bayesian Data Analysis, HW12

You might also like