You are on page 1of 162

Variational Inference:

Foundations and Modern Methods

David Blei, Rajesh Ranganath, Shakir Mohamed

NIPS 2016 Tutorial · December 5, 2016


Communities discovered in a 3.7M node network of U.S. Patents
[Gopalan and Blei, PNAS 2013]
1 2 3 4 5

Game Life Film Book Wine


Season Know Movie Life Street
Team School Show Books Hotel
Coach Street Life Novel House
Play Man Television Story Room
Points Family Films Man Night
Games Says Director Author Place
Giants House Man House Restaurant
Second Children Story War Park
Players Night Says Children Garden
Annual Review of Statistics and Its Application 2014.1:203-232. Downloaded from www.annualreviews.org

6 7 8 9 10

Bush Building Won Yankees Government


Campaign Street Team Game War
Clinton Square Second Mets Military
by Princeton University Library on 01/09/14. For personal use only.

Republican Housing Race Season Officials


House House Round Run Iraq
Party Buildings Cup League Forces
Democratic Development Open Baseball Iraqi
Political Space Game Team Army
Democrats Percent Play Games Troops
Senator Real Win Hit Soldiers

11 12 13 14 15

Children Stock Church Art Police


School Percent War Museum Yesterday
Women Companies Women Show Man
Family Fund Life Gallery Officer
Parents Market Black Works Officers
Child Bank Political Artists Case
Life Investors Catholic Street Found
Says Funds Government Artist Charged
Help Financial Jewish Paintings Street
Mother Business Pope Exhibition Shot

Figure 5

Topics found in 1.8M articles from the New York Times


Topics found in a corpus of 1.8 million articles from the New York Times. Modified from Hoffman et al. (2013).

a particular movie), our prediction of the rating depends on a linear combination of the user’s
embedding and the movie’s embedding. We[Hoffman, Blei,
can also use these Wang,
inferred Paisley,toJMLR
representations find 2013]
groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds
Attend, Infer, Repeat: Fast Scene Und

Figure 12. 3D scenes details:


Scenes, Left:
concepts andGround-truth
control. object and camera po
cup is closely aligned with ground-truth,
[Eslami et al., 2016, Lake etvisible).
thus not clearly al. 2015] We
AIR framework. Middle: AIR achieves significantly lower reconstru
much higher count inference accuracy. Right: Heatmap of locations
The learned policy appears to be more dependent on identity (bottom)

ness and accuracy with that of a fully supervised network


Adygei BalochiBantuKenya
BantuSouthAfrica
Basque Bedouin BiakaPygmy Brahui Burusho
Cambodian
Colombian
Dai Daur Druze French Han Han−NChina
HazaraHezhen
Italian Japanese Kalash Karitiana
Lahu Makrani Mandenka MayaMbutiPygmy
Melanesian
MiaoM

prob

zhen
Italian Japanese Kalash Karitiana
Lahu Makrani Mandenka MayaMbutiPygmy
Melanesian
MiaoMongola Mozabite NaxiOrcadian
Oroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscan
UygurXibo Yakut Yi Yoruba

pops
1
2
3
4
5
6
7

Population analysis of 2 billion genetic measurements


[Gopalan, Hao, Blei, Storey, Nature Genetics (in press)]
Neuroscience analysis of 220 million fMRI measurements
[Manning et al., PLOS ONE 2014]
0.2bits/pixel

jpeg
Mean Sample jpeg2K

Compression and content generation.


[Van den Oord et al., 2016, Gregor et al., 2016]
Analysis of 1.7M taxi trajectories, in Stan
[Kucukelbir et al., 2016]
The probabilistic pipeline

KNOWLEDGE & DATA


QUESTION

R A. 7Aty

This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions

K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1
2
3
4
5
6
7

K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops

Make assumptions Discover patterns Predict & Explore


1
2
3
4
5
6
7
8

K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9

Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.

Customized data analysis is important to many fields.


28

„ Pipeline separates assumptions, computation, application

„ Eases collaborative solutions to statistics problems


The probabilistic pipeline

KNOWLEDGE & DATA


QUESTION

R A. 7Aty

This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions

K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1
2
3
4
5
6
7

K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops

Make assumptions Discover patterns Predict & Explore


1
2
3
4
5
6
7
8

K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9

Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.

Inference is the key algorithmic problem.


28

„ Answers the question: What does this model say about this data?

„ Our goal: General and scalable approaches to inference


KNOWLEDGE & DATA
QUESTION

R A. 7Aty

This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions

K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1
2
3
4
5
6
7

K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops

Make assumptions Discover patterns Predict & Explore


1
2
3
4
5
6
7
8

K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9

Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.

28

Criticize model

Revise

[Box, 1980; Rubin, 1984; Gelman et al., 1996; Blei, 2014]


PART I

Main ideas and historical context


Probabilistic Machine Learning

„ A probabilistic model is a joint distribution of hidden variables z and


observed variables x,

p(z, x).

„ Inference about the unknowns is through the posterior, the conditional


distribution of the hidden variables given the observations

p(z, x)
p(z | x) = .
p(x)

„ For most interesting models, the denominator is not tractable. We appeal


to approximate posterior inference.
Variational Inference
p.z j x/

KL.q.zI ⌫ ⇤ / jj p.z j x//


q.zI ⌫/ ⇤

⌫ init

„ VI turns inference into optimization.


„ Posit a variational family of distributions over the latent variables,

q(z; ν)

„ Fit the variational parameters ν to be close (in KL) to the exact posterior.
(There are alternative divergences, which connect to algorithms like EP, BP, and others.)
(a) Initialization (b) Iteration 20
Example: Mixture of Gaussians

(c) Iteration 28 (d) Iteration 35 (e) Iteration 50

(a) Initialization (c)Iteration


(b) Iteration2028 (d) Iteration 35 (e) Iteration 50

Evidence Lower Bound Average Log Predictive


3;200
1
3;500 Evidence Lower Bound
1:1 Average Log Predictive
3;200
1:2 1
3;800
3;500 1:3 1:1
4;100 1:4 1:2
3;800
0 10 20 30 40 50 60 0 10 201:330 40 50 60
Iterations
4;100 Iterations
1:4
(c) Iteration 28 (d) Iteration 35
(f) subcaption 0 10 (e)
elbo 20Iteration
30 4050 50 (g) subcaption
60 avelogpred
0 10 20 30 40 50 60
Iterations Iterations
Figure 1: Main caption
(f) subcaption elbo (g) subcaption avelogpred
[images by Alp Kucukelbir]
Figure 1: Main caption
Evidence Lower Bound Average Log Predictive
History
1006 Carsten Peterson and James R . An derson

Coovergence 0 1 eM ColTeia llon Stabiles


2-04 - 1 XOR wilh Random WetgltJ

Sj µj

0 .0

µi
Si
- 0. 5

‘~
'0 ' 00 10 00 10000

t.\ntler 0 1 S We8plll

Figure 5: {sf' B out} a nd vt vout from th e BM and MFT respec tively


[Peterson and Anderson 1987]
as functions of Nsweep o For details on architect ure, an nealing schedule , [Jordan et (a)
al. 1999] (b) [Hinton and van Camp 1993]
an d Tij values, see figure 3.

Figure 22: (a) A node Si in a sigmoid belief network machine with its Markov blanket. (b)
Corw flfOOOCll 0 1 L4ean Carela llon Oillerence
2-4-1 XOR with Random Welltlll
The mean field equations yield a deterministic relationship, represented in the figure with
„ Variational inference adapts ideas from statistical physics to probabilistic
the dotted lines, between the variational parameters µi and µj for nodes j in the Markov
blanket of node i. is
J
inference.

Arguably, it began in the late eighties with Peterson and
a tractable lower bound on the log likelihood and the variational parameter ξi can be
ters
has

Anderson (1987),
along withwho used mean-field methods to fit a neural network.
• for
optimized the other variational parameters.
Saul and Jordan (1998) show that in the limiting case of networks in which each hidden The
• com
node
• has• a • large
_ number of parents, so that a central limit theorem can be invoked, the
„ This idea
_ _
'0 wasξipicked
parameter' 00 up byinterpretation
has a probabilistic
tbnber 01 SWeepli
1000 Jordan’s lab
as the
10000 in theexpectation
approximate early 1990s—Tommi
of σ(zi ),
with
ing

Jaakkola, Lawrence Saul, Zoubin


where σ(·) is again the logistic function.
Gharamani—who
Figure 6: Do as defined in equa tion (3 .17) as a functio n of Ns tKeep • For
generalized
For fixed values of the parameters ξi , by differentiating the KL divergence with respect
details on a rchitect ure, a nnealing sched ule, and T i j values, see figure it to
3.

many probabilistic models. (A review paper is Jordan


to the variational parameters µi , we obtain the following consistency equations:
⎛ ⎞ et al., 1999.)
# # #
µi = σ ⎝ θij µj + θi0 + θji (µj − ξj ) + Kji ⎠ (67) We
„ In parallel, Hinton andj Van Camp j (1993)j also developed mean-field for dom
we
& '
neural networks. Neal and Hinton (1993)
where K ji is the derivative of − ln e−ξ z + e(1−ξ )z
connected
with respect this
to µi . As Saul, et
j j al. idea to the EM
j j of
Thi
show, this term depends on node i, its child j, and the other parents (the “co-parents”) of
algorithm,
node j.which
Given thatlead
the firstto further variational
contributions frommethods for i,mixtures of
net
term is a sum over the parents of node pen

expertsand
(Waterhouse equation for aet al.,
node1996) and HMMs from(MacKay, 1997).
the second term is a sum over contributions from the children of node i, we see that the erro
line
consistency given again involves contributions the Markov blanket cho
of the node (see Fig. 22). Thus, as in the case of the Boltzmann machine, we find that the cor
variational parameters are linked via their Markov blankets and the consistency equation dec
(Eq. (67)) can be interpreted as a local message-passing algorithm. ma
models with two-dimensional latent
is Gaussian, linearly spaced coor-
Today
CDF of the Gaussian to produce
otted the corresponding
(a) NORB
generative (b) CIFAR (c) Frey

igure 4. a) Performance on the NORB dataset. Left: Samples from the training data. Right: sampled pixel means from
he model. b) Performance on CIFAR10 patches. Left: Samples from the training data. Right: Sampled pixel means
om the model. c) Frey faces data. Left: data samples. Right: model samples.

data {
˛ D 1:5; D1 i n t N; // number o f o b s e r v a t i o n s
i n t x [ N ] ; // d i s c r e t e - v a l u e d o b s e r v a t i o n s
}
parameters {
// l a t e n t v a r i a b l e , must be p o s i t i v e
r e a l < l o w e r =0> t h e t a ;
✓ }
model {
// non - c o n j u g a t e p r i o r f o r l a t e n t v a r i a b l e
theta ~ w e i b u l l ( 1 . 5 , 1) ;

xn // l i k e l i h o o d
f o r ( n i n 1 :N)
x [ n ] ~ poisson ( theta ) ;
N }
igure 5. Imputation results on MNIST digits. The first
olumn shows the true data. Column 2 shows pixel loca-
space (d) 20-D latent space
[Kingma andremaining
Wellingcolumns
2013]show
Figure 6. Two dimensional embedding of the MNIST data Figure 2: Specifying a simple nonconjugate probability model in Stan.
set. Each[Rezende et al.to2014] [Kucukelbir et al. 2015]
ons set as missing in grey. The
colour corresponds one of the digit classes.
mputations and denoising of the images for 15 iterations,
arting left to right. Top: 60% missingness. Middle: 80%
NIST for different dimensionalities
missingness. Bottom: 5x5 patch missing. analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with the pri
6.5. Data Visualisation gives the joint density p.X; ✓/ D p.X j ✓/ p.✓/.
„ There is now a Latent flurry
for visualisationof new work data sets.on
variable models such as DLGMs are often We
of high-dimensional variational inference, making it
usedfocus on approximate inference for differentiable probability models. These models have conti
uous
We latent variables ✓. They also have a gradient of the log-joint with respect to the latent˚ variabl
project the MNIST data set to a 2-dimensional latent
scalable,
matics and experimental design. We show theeasier
ability to
space derive,
and faster,
use this 2-D embedding more
as a visualisation
the data. A 2-dimensional embedding of the MNIST
of accurate, and applying it to more
r log p.X; ✓/. The gradient is valid within the support of the prior supp.p.✓// D ✓ j ✓

R and p.✓/ > 0 ✓ R , where K is the dimension of the latent variable space. This support s
K K
f the model to impute missing data using the MNIST is important: it determines the support of the posterior density and plays a key role later in the pap
ata set in figure 5. complicated
We test the imputation abilitymodels and applications.
data set is shown in figure 6. The classes separate
2
into di↵erent regions indicating that such a toolWe canmake no assumptions about conjugacy, either full or conditional.
nder two di↵erent missingness types (Little & Rubin,
be useful in gaining insight into the structure of high-
987): Missing-at-random (MAR), where we consider
dimensional data sets. For example, consider a model that contains a Poisson likelihood with unknown rate, p.x j ✓/. T
ntains a KL term that can often be observed variable x is discrete; the latent rate ✓ is continuous and positive. Place a Weibull pri
Modern VI touches many important areas: probabilistic programming,
0% and 80% of the pixels to be missing randomly, and
„
Not Missing-at-random (NMAR), where we consider a on ✓, defined over the positive real numbers. The resulting joint density describes a nonconjuga
he prior p (z) = N (0, I) and the
quare region of✓the image to be missing. The model 7. Discussion
differentiable probability model. (See Figure 2.) Its partial derivative @=@✓ p.x; ✓/ is valid within t
dimensionalityreinforcement
of z. Let µ and learning, neural networks, convex optimization, Bayesian
roduces very good completions in both test cases.
here is uncertainty in the identity of the image. This
support of the Weibull distribution, supp.p.✓// D RC ⇢ R. Because this model is nonconjugate, t
Our algorithm generalises to a large class of models
posterior is not a Weibull distribution. This presents a challenge for classical variational inferenc
with continuous latent variables, which include Gaus-
d let µ and simply denote the
statistics, and myriad applications.
expected and reflected in the errors in these comple-
j j sian, non-negative or sparsity-promoting latent In Section 2.3, we will see how
vari- handles this model.
ons as the resampling procedure is run, and further
ables. For models with discrete latent variables (e.g.,
emonstrates the ability of the model to capture the Many machine learning models are differentiable. For example: linear and logistic regression, matr
sigmoid belief networks), policy-gradient approaches
iversity of the underlying data. We do not integrate factorization with continuous or discrete measurements, linear dynamical systems, and Gaussian pr
that improve upon the REINFORCE approach remain
Our goal today the
ismosttogeneral,
teach
ver the missing values in our imputation procedure,
„ that simulates a Markov chain
ut use a procedure you the basics, cesses.
but intelligent design is needed to
explain
Mixture
control the gradient-variance in high dimensionalMarginalizing
set-
models, hiddensome of the
Markov models, and topicnewer ideas,
models have discrete random variabl
out these discrete variables renders these models differentiable. (We show an examp
N (z; 0, I) dz
hat we show converges to the true marginal distribu-
and to suggest These
open
tings.
areas of new research.
on. The procedure to sample from the missing pixels
iven the observed pixels is explained in appendix E.
in Section 3.3.) However, marginalization is not tractable for all models, such as the Ising mod
sigmoid belief networks, and (untruncated) Bayesian nonparametric models.
models are typically used with a large number
X
(µ2j + 2 2.2 Variational Inference
j)
1 Bayesian inference requires the posterior density p.✓ j X/, which describes how the latent variabl
vary when conditioned on a set of observations X. Many posterior densities are intractable becau
their normalization constants lack closed forms. Thus, we seek to approximate the posterior.
Variational Inference:
Foundations and Modern Methods
Part II: Mean-field VI and stochastic VI
Jordan+, Introduction to Variational Methods for Graphical Models, 1999
Ghahramani and Beal, Propagation Algorithms for Variational Bayesian Learning, 2001
Hoffman+, Stochastic Variational Inference, 2013

Part III: Stochastic gradients of the ELBO


Kingma and Welling, Auto-Encoding Variational Bayes, 2014
Ranganath+, Black Box Variational Inference, 2014
Rezende+, Stochastic Backpropagation and Approximate Inference in Deep Generative Models, 2014

Part IV: Beyond the mean field


Agakov and Barber, An Auxiliary Variational Method, 2004
Gregor+, DRAW: A recurrent neural network for image generation, 2015
Rezende+, Variational Inference with Normalizing Flows, 2015
Ranganath+, Hierarchical Variational Models, 2015
Maaløe+, Auxiliary Deep Generative Models, 2016
Variational Inference:
Foundations and Modern Methods

p.z j x/

KL.q.zI ⌫ ⇤ / jj p.z j x//


q.zI ⌫/ ⇤

⌫ init

VI approximates difficult quantities from complex models.


With stochastic optimization we can
„ scale up VI to massive data
„ enable VI on a wide class of difficult models
„ enable VI with elaborate and flexible families of approximations
PART II

Mean-field variational inference


and stochastic variational inference
Motivation: Topic Modeling

Topic models use posterior inference to discover the hidden thematic


structure in a large collection of documents.
Example: Latent Dirichlet Allocation (LDA)

Documents exhibit multiple topics.


Example: Latent Dirichlet Allocation (LDA)

Topic proportions and


Topics Documents
assignments
gene 0.04
dna 0.02
genetic 0.01
.,,

life 0.02
evolve 0.01
organism 0.01
.,,

brain 0.04
neuron 0.02
nerve 0.01
...

data 0.02
number 0.02
computer 0.01
.,,

„ Each topic is a distribution over words


„ Each document is a mixture of corpus-wide topics
„ Each word is drawn from one of those topics
Example: Latent Dirichlet Allocation (LDA)
Topic proportions and
Topics Documents
assignments

„ But we only observe the documents; everything else is hidden.


„ So we want to calculate the posterior

p(topics, proportions, assignments | documents)

(Note: millions of documents; billions of latent variables)


LDA as a Graphical Model

Per-word
Proportions
topic assignment
parameter Topic
parameter
Per-document Observed
topic proportions word Topics

˛ ✓d zd;n wd;n ˇk ⌘
N D K

„ Encodes assumptions about data with a factorization of the joint


„ Connects assumptions to algorithms for computing with data
„ Defines the posterior (through the joint)
Posterior Inference

˛ ✓d zd;n wd;n ˇk ⌘
N D K

„ The posterior of the latent variables given the documents is

p(β, θ , z, w)
p(β, θ , z | w) = R R P .
β θ z p(β, θ , z, w)

„ We can’t compute the denominator, the marginal p(w).


„ We use approximate inference.
1 2 3 4 5

Game Life Film Book Wine


Season Know Movie Life Street
Team School Show Books Hotel
Coach Street Life Novel House
Play Man Television Story Room
Points Family Films Man Night
Games Says Director Author Place
Giants House Man House Restaurant
Second Children Story War Park
Players Night Says Children Garden
Annual Review of Statistics and Its Application 2014.1:203-232. Downloaded from www.annualreviews.org

6 7 8 9 10

Bush Building Won Yankees Government


Campaign Street Team Game War
Clinton Square Second Mets Military
by Princeton University Library on 01/09/14. For personal use only.

Republican Housing Race Season Officials


House House Round Run Iraq
Party Buildings Cup League Forces
Democratic Development Open Baseball Iraqi
Political Space Game Team Army
Democrats Percent Play Games Troops
Senator Real Win Hit Soldiers

11 12 13 14 15

Children Stock Church Art Police


School Percent War Museum Yesterday
Women Companies Women Show Man
Family Fund Life Gallery Officer
Parents Market Black Works Officers
Child Bank Political Artists Case
Life Investors Catholic Street Found
Says Funds Government Artist Charged
Help Financial Jewish Paintings Street
Mother Business Pope Exhibition Shot

Figure 5

Topics found in 1.8M articles from the New York Times


Topics found in a corpus of 1.8 million articles from the New York Times. Modified from Hoffman et al. (2013).

a particular movie), our prediction of the rating depends on a linear combination of the user’s
embedding and the movie’s embedding. We can also use these inferred representations to find
Mean-field VI and Stochastic VI

Subsample Infer local Update global


data structure structure

Road map:

„ Define the generic class of conditionally conjugate models


„ Derive classical mean-field VI
„ Derive stochastic VI, which scales to massive data
A Generic Class of Models

Global variables
ˇ

Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1

„ The observations are x = x1:n .


„ The local variables are z = z1:n .
„ The global variables are β.
„ The ith data point xi only depends on zi and β.

Compute p(β, z | x).


A Generic Class of Models

Global variables
ˇ

Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1

„ A complete conditional is the conditional of a latent variable given the


observations and other latent variables.

„ Assume each complete conditional is in the exponential family,

p(zi | β, xi ) = h(zi ) exp{η` (β, xi )> zi − a(η` (β, xi ))}


p(β | z, x) = h(β) exp{ηg (z, x)> β − a(ηg (z, x))}.
A Generic Class of Models

Global variables
ˇ

Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1

„ A complete conditional is the conditional of a latent variable given the


observations and other latent variable.

„ The global parameter comes from conjugacy [Bernardo and Smith, 1994]
Pn
ηg (z, x) = α + i=1 t(zi , xi ),

where α is a hyperparameter and t(·) are sufficient statistics for [zi , xi ].


A Generic Class of Models

Global variables
ˇ

Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1

„ Bayesian mixture models „ Dirichlet process mixtures, HDPs

„ Time series models „ Multilevel regression


(HMMs, linear dynamic systems) (linear, probit, Poisson)

„ Factorial models „ Stochastic block models

„ Matrix factorization „ Mixed-membership models


(factor analysis, PCA, CCA) (LDA and some variants)
Variational Inference

p.z j x/

KL.q.zI ⌫ ⇤ / jj p.z j x//


q.zI ⌫/ ⇤

⌫ init

Minimize KL between q(β, z; ν) and the posterior p(β, z | x).


The Evidence Lower Bound

L (ν) = Eq [log p(β, z, x)] − Eq [log q(β, z; ν)]

„ KL is intractable; VI optimizes the evidence lower bound (ELBO) instead.


ƒ It is a lower bound on log p(x).
ƒ Maximizing the ELBO is equivalent to minimizing the KL.

„ The ELBO trades off two terms.


ƒ The first term prefers q(·) to place its mass on the MAP estimate.
ƒ The second term encourages q(·) to be diffuse.

„ Caveat: The ELBO is not convex.


Mean-field Variational Inference

ˇ ˇ
ELBO
zi xi i zi
n n

„ We need to specify the form of q(β, z).

„ The mean-field family is fully factorized,


Qn
q(β, z; λ, φ) = q(β; λ) i=1 q(zi ; φi ).

„ Each factor is the same family as the model’s complete conditional,

p(β | z, x) = h(β) exp{ηg (z, x)> β − a(ηg (z, x))}


q(β; λ) = h(β) exp{λ> β − a(λ)}.
Mean-field Variational Inference

ˇ ˇ
ELBO
zi xi i zi
n n

„ Optimize the ELBO,

L (λ, φ) = Eq [log p(β, z, x)] − Eq [log q(β, z)] .

„ Traditional VI uses coordinate ascent [Ghahramani and Beal, 2001]


 
λ∗ = Eφ ηg (z, x) ; φi∗ = Eλ [η` (β, xi )]

„ Iteratively update each parameter, holding others fixed.


ƒ Notice the relationship to Gibbs sampling [Gelfand and Smith, 1990] .
ƒ Caveat: The ELBO is not convex.
Mean-field Variational Inference for LDA

�d �d;n �k

˛ �d zd;n wd;n ˇk �
N D K

„ The local variables are the per-document variables θd and zd .


„ The global variables are the topics β1 , . . . , βK .
„ The variational distribution is
Y
K Y
D Y
N
q(β, θ , z) = q(βk ; λk ) q(θd ; γd ) q(zd,n ; φd,n )
k=1 d=1 n=1
Mean-field Variational Inference for LDA

0.4
0.3
Probability

0.2
0.1
0.0
1 8 16 26 36 46 56 66 76 86 96

Topics
Mean-field Variational Inference for LDA

“Genetics” “Evolution” “Disease” “Computers”


human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial system
gene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
information diversity malaria methods
genetics group parasite networks
mapping new parasites software
project two united new
sequences common tuberculosis simulations
Classical Variational Inference

Input: data x, model p(β, z, x).


Initialize λ randomly.

repeat
for each data point i do
Set local parameter φi ← Eλ [η` (β, xi )].
end

Set global parameter


Pn
λ←α+ i=1 Eφi [t(Zi , xi )] .

until the ELBO has converged


A Generic Class of Models

Global variables
ˇ

Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1

„ Bayesian mixture models „ Dirichlet process mixtures, HDPs

„ Time series models „ Multilevel regression


(HMMs, linear dynamic systems) (linear, probit, Poisson)

„ Factorial models „ Stochastic block models

„ Matrix factorization „ Mixed-membership models


(factor analysis, PCA, CCA) (LDA and some variants)
Stochastic Variational Inference

�d �d;n �k

˛ �d zd;n wd;n ˇk �
N D K

„ Classical VI is inefficient:
ƒ Do some local computation for each data point.
ƒ Aggregate these computations to re-estimate global structure.
ƒ Repeat.
„ This cannot handle massive data.
„ Stochastic variational inference (SVI) scales VI to massive data.
Stochastic Variational Inference

K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1
2
3
4
5
6
7

K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1

GLOBAL HIDDEN STRUCTURE


2
3
4
5
6
7

MASSIVE
8

K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

DATA
pops
1
2
3
4
5
6
7
8
9

Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.

28

Subsample Infer local Update global


data structure structure
Stochastic Optimization

„ Replace the gradient with cheaper noisy estimates [Robbins and Monro, 1951]

„ Guaranteed to converge to a local optimum [Bottou, 1996]

„ Has enabled modern machine learning


Stochastic Optimization

„ With noisy gradients, update

ˆ ν L (νt )
νt+1 = νt + ρt ∇
 
„ Requires unbiased gradients, E ∇ˆ ν L (ν) = ∇ν L (ν)

„ Requires the step size sequence ρt follows the Robbins-Monro conditions


Stochastic Variational Inference

„ The natural gradient of the ELBO [Amari, 1998; Sato, 2001]


€ Pn Š
∇nat
λ L (λ) = α + i=1 Eφi [t(Zi , xi )] − λ.

„ Construct a noisy natural gradient,

j ∼ Uniform(1, . . . , n)
ˆ nat L (λ)
∇ = α + nEφj∗ [t(Zj , xj )] − λ.
λ

„ This is a good noisy gradient.


ƒ Its expectation is the exact gradient (unbiased).
ƒ It only depends on optimized parameters of one data point (cheap).
Stochastic Variational Inference

Input: data x, model p(β, z, x).


Initialize λ randomly. Set ρt appropriately.

repeat
Sample j ∼ Unif(1, . . . , n).
 
Set local parameter φ ← Eλ η` (β, xj ) .

Set intermediate global parameter

λ̂ = α + nEφ [t(Zj , xj )].

Set global parameter

λ = (1 − ρt )λ + ρt λ̂.

until forever
Stochastic Variational Inference

K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1
2
3
4
5
6
7

K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1

GLOBAL HIDDEN STRUCTURE


2
3
4
5
6
7

MASSIVE
8

K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

DATA
pops
1
2
3
4
5
6
7
8
9

Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.

28

Subsample Infer local Update global


data structure structure
Stochastic Variational Inference in LDA

�d �d;n �k

˛ �d zd;n wd;n ˇk �
N D K

„ Sample a document
„ Estimate the local variational parameters using the current topics
„ Form intermediate topics from those local parameters
„ Update topics as a weighted average of intermediate and current topics
Stochastic Variational Inference in LDA

Online 98K
900
850
800 Batch 98K
Perplexity

Online 3.3M
750
700
650
600

103.5 104 104.5 105 105.5 106 106.5


Documents seen (log scale)
Documents 2048 4096 8192 12288 16384 32768 49152 65536
analyzed
systems systems service service service business business business
road health systems systems companies service service industry
made communication health companies systems companies companies service
Top eight service service companies business business industry industry companies
words announced billion market company company company services services
national language communication billion industry management company company
west care company health market systems management management
language road billion industry billion services public public

[Hoffman et al., 2010]


1 2 3 4 5

Game Life Film Book Wine


Season Know Movie Life Street
Team School Show Books Hotel
Coach Street Life Novel House
Play Man Television Story Room
Points Family Films Man Night
Games Says Director Author Place
Giants House Man House Restaurant
Second Children Story War Park
Players Night Says Children Garden
Annual Review of Statistics and Its Application 2014.1:203-232. Downloaded from www.annualreviews.org

6 7 8 9 10

Bush Building Won Yankees Government


Campaign Street Team Game War
Clinton Square Second Mets Military
by Princeton University Library on 01/09/14. For personal use only.

Republican Housing Race Season Officials


House House Round Run Iraq
Party Buildings Cup League Forces
Democratic Development Open Baseball Iraqi
Political Space Game Team Army
Democrats Percent Play Games Troops
Senator Real Win Hit Soldiers

11 12 13 14 15

Children Stock Church Art Police


School Percent War Museum Yesterday
Women Companies Women Show Man
Family Fund Life Gallery Officer
Parents Market Black Works Officers
Child Bank Political Artists Case
Life Investors Catholic Street Found
Says Funds Government Artist Charged
Help Financial Jewish Paintings Street
Mother Business Pope Exhibition Shot

Figure 5

Topics using the HDP, found in 1.8M articles from the New York Times
Topics found in a corpus of 1.8 million articles from the New York Times. Modified from Hoffman et al. (2013).

a particular movie), our prediction of the rating depends on a linear combination of the user’s
embedding and the movie’s embedding. We can also use these inferred representations to find
SVI scales many models

Subsample Infer local Update global


data structure structure

„ Bayesian mixture models „ Dirichlet process mixtures, HDPs

„ Time series models „ Multilevel regression


(HMMs, linear dynamic systems) (linear, probit, Poisson)

„ Factorial models „ Stochastic block models

„ Matrix factorization „ Mixed-membership models


(factor analysis, PCA, CCA) (LDA and some variants)
neurons
brain stimulus
motor
memory
visual
activated subjects synapses
tyrosine phosphorylation cortical
left ltp
activation
phosphorylation p53 task surface glutamate
kinase cell cycle proteins tip synaptic
activity protein
cyclin binding rna image neurons
regulation domain dna sample materials
computer
domains rna polymerase organic
problem device
receptor cleavage
information
polymer
science amino acids
research
scientists
receptors cdna
site
computers
polymers
funding molecules physicists
support says ligand sequence problems
laser particles
nih research ligands isolated optical physics
program people protein sequence light
apoptosis sequences surface
particle
electrons experiment
genome liquid quantum
wild type dna surfaces stars
mutant enzyme sequencing fluid
mutations enzymes model reaction astronomers
united states mutants
iron
active site
reactions universe
women cells
mutation reduction molecule galaxies
universities
cell molecules
expression magnetic
galaxy
cell lines plants
magnetic field transition state
students bone marrow plant
spin
superconductivity
gene
education genes
superconducting
pressure mantle
arabidopsis
bacteria high pressure crust sun
bacterial pressures upper mantle solar wind
host fossil record core meteorites earth
resistance development birds inner core ratios planets
mice parasite embryos fossils planet
gene dinosaurs
antigen virus drosophila species
disease fossil
t cells hiv genes forest
mutations
antigens aids expression forests
families earthquake co2
immune response infection populations
mutation earthquakes carbon
viruses ecosystems
fault carbon dioxide
ancient images methane
patients genetic found
disease cells population impact
data water
ozone
treatment proteins populations million years ago volcanic atmospheric
drugs differences africa
clinical researchers deposits climate
measurements
variation stratosphere
protein magma ocean
eruption ice concentrations
found volcanism changes
climate change

Adygei BalochiBantuKenya
BantuSouthAfrica
Basque Bedouin BiakaPygmy Brahui Burusho
Cambodian
Colombian
Dai Daur Druze French Han Han−NChina
HazaraHezhen
Italian Japanese Kalash Karitiana
Lahu Makrani Mandenka MayaMbutiPygmy
Melanesian
MiaoMongola Mozabite NaxiOrcadian
Oroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscan
UygurXibo Yakut Yi Yoruba

pops
1
2
3
prob

4
5
6
7

apanese Kalash Karitiana


Lahu Makrani Mandenka MayaMbutiPygmy
Melanesian
MiaoMongola Mozabite NaxiOrcadian
Oroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscan
UygurXibo Yakut Yi Yoruba

pops
1
2
3
4
5
6
7
PART III

Stochastic Gradients of the ELBO


Review: The Promise

KNOWLEDGE & DATA


QUESTION

R A. 7Aty

This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions

K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1
2
3
4
5
6
7

K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops

Make assumptions Discover patterns Predict & Explore


1
2
3
4
5
6
7
8

K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9

Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.

28

„ Realized for conditionally conjugate models

„ What about the general case?


The Variational Inference Recipe

Start with a model:

p(z, x)
The Variational Inference Recipe

Choose a variational approximation:

q(z; ν)
The Variational Inference Recipe

Write down the ELBO:

L (ν) = Eq(z;ν) [log p(x, z) − log q(z; ν)]


The Variational Inference Recipe

Compute the expectation(integral):

Example: L (ν) = xν2 + log ν


The Variational Inference Recipe

Take derivatives:

1
Example: ∇ν L (ν) = 2xν +
ν
The Variational Inference Recipe

Optimize:

νt+1 = νt + ρt ∇ν L
The Variational Inference Recipe

p(x, z) Z q.zI ⌫/

(· · · )q(z; ⌫)dz r⌫
q(z; ⌫)
Example: Bayesian Logistic Regression

„ Data pairs yi , xi
„ xi are covariates
„ yi are label
„ z is the regression coefficient
„ Generative process

p(z) ∼ N(0, 1)
p(yi | xi , z) ∼ Bernoulli(σ(zxi ))
VI for Bayesian Logistic Regression

Assume:
„ We have one data point (y, x)
„ x is a scalar
„ The approximating family q is the normal; ν = (µ, σ2 )
The ELBO is

L (µ, σ2 ) = Eq [log p(z) + log p(y | x, z) − log q(z)]


VI for Bayesian Logistic Regression

L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
VI for Bayesian Logistic Regression

L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
VI for Bayesian Logistic Regression

L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
1 2 1
= − (µ + σ ) + log σ2 + Eq [yxz − log(1 + exp(xz))]
2
2 2
VI for Bayesian Logistic Regression

L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
1 2 1
= − (µ + σ ) + log σ2 + Eq [yxz − log(1 + exp(xz))]
2
2 2
1 2 1
= − (µ + σ ) + log σ2 + yxµ − Eq [log(1 + exp(xz))]
2
2 2
VI for Bayesian Logistic Regression

L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
1 2 1
= − (µ + σ ) + log σ2 + Eq [yxz − log(1 + exp(xz))]
2
2 2
1 2 1
= − (µ + σ ) + log σ2 + yxµ − Eq [log(1 + exp(xz))]
2
2 2
We are stuck.
1. We cannot analytically take that expectation.
2. The expectation hides the objectives dependence on the variational
parameters. This makes it hard to directly optimize.
Options?

„ Derive a model specific bound:


[Jordan and Jaakola; 1996], [Braun and McAuliffe; 2008], others

„ More general approximations that require model-specific analysis:


[Wang and Blei; 2013], [Knowles and Minka; 2011]
Nonconjugate Models

„ Nonlinear Time series Models „ Discrete Choice Models

„ Deep Latent Gaussian Models „ Bayesian Neural Networks

„ Models with Attention „ Deep Exponential Families


(such as DRAW) (e.g. Sparse Gamma or Poisson)

„ Generalized Linear Models „ Correlated Topic Model


(Poisson Regression) (including nonparametric variants)

„ Stochastic Volatility Models „ Sigmoid Belief Network

We need a solution that does not entail model specific work


Black Box Variational Inference (BBVI)
Black box variational inference

REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES

ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE

I Sample from q. /
I Form noisy gradients without model-specific computation

I Use stochastic optimization


The Problem in the Classical VI Recipe

p(x, z) Z q.zI ⌫/

(· · · )q(z; ⌫)dz r⌫
q(z; ⌫)
The New VI Recipe

p(x, z) Z q.zI ⌫/

r⌫ (· · · )q(z; ⌫)dz
q(z; ⌫)

Use stochastic optimization!


Computing Gradients of Expectations

„ Define

g(z, ν) = log p(x, z) − log q(z; ν)

„ What is ∇ν L
Z
∇ν L = ∇ν q(z; ν)g(z, ν)dz
Z
= ∇ν q(z; ν)g(z, ν) + q(z; ν)∇ν g(z, ν)dz
Z
= q(z; ν)∇ν log q(z; ν)g(z, ν) + q(z; ν)∇ν g(z, ν)dz

= Eq(z;ν) [∇ν log q(z; ν)g(z, ν) + ∇ν g(z, ν)]

∇ν q
Using ∇ν log q = q
Roadmap

„ Score Function Gradients

„ Pathwise Gradients

„ Amortized Inference
Score Function Gradients of the ELBO
Score Function Estimator

Recall

∇ν L = Eq(z;ν) [∇ν log q(z; ν)g(z, ν) + ∇ν g(z, ν)]

Simplify:

Eq [∇ν g(z, ν)] = Eq [∇ν log q(z; ν)] = 0

Gives the gradient:

∇ν L = Eq(z;ν) [∇ν log q(z; ν)(log p(x, z) − log q(z; ν))]

Sometimes called likelihood ratio or REINFORCE gradients


[Glynn 1990; Williams, 1992; Wingate+ 2013; Ranganath+ 2014; Mnih+ 2014]
Noisy Unbiased Gradients

Gradient: Eq(z;ν) [∇ν log q(z; ν)(log p(x, z) − log q(z; ν))]

Noisy unbiased gradients with Monte Carlo!

1X
S
∇ν log q(zs ; ν)(log p(x, zs ) − log q(zs ; ν)),
S s=1
where zs ∼ q(z; ν)
Basic BBVI

Algorithm 1: Basic Black Box Variational Inference


Input : Model log p(x, z),
Variational approximation q(z; ν)
Output : Variational Parameters: ν

while not converged do


z[s] ∼ q // Draw S samples from q
ρ = t-th value of a Robbins Monro sequence
PS
ν = ν + ρ 1S s=1 ∇ν log q(z[s]; ν)(log p(x, z[s]) − log q(z[s]; ν))
t=t+1
end
The requirements for inference

The noisy gradient:

1X
S
∇ν log q(zs ; ν)(log p(x, zs ) − log q(zs ; ν)),
S s=1
where zs ∼ q(z; ν)

To compute the noisy gradient of the ELBO we need


„ Sampling from q(z)
„ Evaluating ∇ν log q(z; ν)
„ Evaluating log p(x, z) and log q(z)

There is no model specific work: black box criteria are satisfied


Black Box Variational Inference
Black box variational inference

REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES

ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE

I Sample from q. /
I Form noisy gradients without model-specific computation

I Use stochastic optimization


Problem: Basic BBVI doesn’t work

Variance of the gradient can be a problem

Varq(z;ν) = Eq(z;ν) [(∇ν log q(z; ν)(log p(x, z) − log q(z; ν)) − ∇ν L )2 ].

2.0
PDF
1.5 Abs Mu Score
1.0

0.5

0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Intuition:
Sampling rare values can lead to large scores and thus high variance
Solution: Control Variates

Replace with f with f̂ where E[f̂ (z)] = E[f (z)]. General such class:

f̂ (z) ¬ f (z) − a(h(z) − E[h(z)])

6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3

−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

„ h is a function of our choice


„ a is chosen to minimize the variance
„ Good h have high correlation with the original function f
Solution: Control Variates

Replace with f with f̂ where E[f̂ (z)] = E[f (z)]. General such class:

f̂ (z) ¬ f (z) − a(h(z) − E[h(z)])

6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3

−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

„ For variational inference we need functions with known q expectation


„ Set h as ∇ν log q(z; ν)
„ Simple as Eq [∇ν log q(z; ν)] = 0 for any q
Solution: Control Variates

Replace with f with f̂ where E[f̂ (z)] = E[f (z)]. General such class:

f̂ (z) ¬ f (z) − a(h(z) − E[h(z)])

6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3

−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

Many of the other techniques from Monte Carlo can help:


„ Importance Sampling, Quasi Monte Carlo, Rao-Blackwellization

[Ruiz+ 2016; Ranganath+2014; Titsias+2015; Mnih+2016]


Nonconjugate Models

„ Nonlinear Time series Models „ Discrete Choice Models

„ Deep Latent Gaussian Models „ Bayesian Neural Networks

„ Models with Attention „ Deep Exponential Families


(such as DRAW) (e.g. Sparse Gamma or Poisson)

„ Generalized Linear Models „ Correlated Topic Model


(Poisson Regression) (including nonparametric variants)

„ Stochastic Volatility Models „ Sigmoid Belief Network

We can design models based on data rather than inference.


More Assumptions?

The current black box criteria


„ Sampling from q(z)
„ Evaluating ∇ν log q(z; ν)
„ Evaluating log p(x, z) and log q(z)
Can we make additional assumptions that are not too restrictive?
Pathwise Gradients of the ELBO
Pathwise Estimator

Assume
1. z = t(ε, ν) for ε ∼ s(ε) implies z ∼ q(z; ν)
Example:

ε ∼ Normal(0, 1)
z = εσ + µ
→ z ∼ Normal(µ, σ2 )

2. log p(x, z) and log q(z) are differentiable with respect to z


Pathwise Estimator

Recall

∇ν L = Eq(z;ν) [∇ν log q(z; ν)g(z, ν) + ∇ν g(z, ν)]

Rewrite using using z = t(ε, ν)

∇ν L = Es(ε) [∇ν log s(ε)g(t(ε, ν), ν) + ∇ν g(t(ε, ν), ν)]

To differentiate:

∇L (ν) = Es(ε) [∇ν g(t(ε, ν), ν)]


= Es(ε) [∇z [log p(x, z) − log q(z; ν)]∇ν t(ε, ν) − ∇ν log q(z; ν)]
= Es(ε) [∇z [log p(x, z) − log q(z; ν)]∇ν t(ε, ν)]

This is also known as the reparameterization gradient.


[Glasserman 1991; Fu 2006; Kingma+ 2014; Rezende+ 2014; Titsias+ 2014]
ion (7) both lead to unbiased estimates of the exact gradient.
Variance
require Comparison
the gradient of the model and thus applies to more
gh variance.

103

101
1 Pathwise
10 Score Function
3 Score Function with
10
Control Variate
100 101 102 103
Number of samples
(b) Multivariate Nonlinear Regression Model [Kucukelbir+ 2016]

ator variances. The gradient estimator exhibits lower


over, it does not require control variate variance reduction,
tions.
Score Function Estimator vs. Pathwise Estimator

Pathwise
Score Function
„ Differentiates the function
„ Differentiates the density
∇z [log p(x, z) − log q(z; ν)]
∇ν q(z; ν)
„ Requires differentiable models
„ Works for discrete and
continuous models „ Requires variational
approximation to have form
„ Works for large class of
z = t(ε, ν)
variational approximations
„ Generally better behaved
„ Variance can be a big problem
variance
Amortized Inference
Hierarchical Models

A generic class of models

Global variables
ˇ

Local variables
zi xi
n
Y
n
p( , z, x) = p( ) p(zi , xi | )
i=1

Ñ Bayesian mixture models Ñ Dirichlet process mix

Ñ Time series models Ñ Multilevel regression


Mean Field Variational Approximation

ˇ ˇ
ELBO
zi xi i zi
n n
SVI: Revisited

Input: data x, model p(β, z, x).


Initialize λ randomly. Set ρt appropriately.

repeat
Sample j ∼ Unif(1, . . . , n).
 
Set local parameter φ ← Eλ η` (β, xj ) .

Set intermediate global parameter

λ̂ = α + nEφ [t(Zj , xj )].

Set global parameter


λ = (1 − ρt )λ + ρt λ̂.

until forever
SVI: The problem
Input: data x, model p(β, z, x).
Initialize λ randomly. Set ρt appropriately.

repeat
Sample j ∼ Unif(1, . . . , n).
 
Set local parameter φ ← Eλ η` (β, xj ) .
Set intermediate global parameter

λ̂ = α + nEφ [t(Zj , xj )].

Set global parameter


λ = (1 − ρt )λ + ρt λ̂.

until forever
„ These expectations are no longer tractable
„ Inner stochastic optimization needed for each data point.
SVI: The problem
Input: data x, model p(β, z, x).
Initialize λ randomly. Set ρt appropriately.

repeat
Sample j ∼ Unif(1, . . . , n).
 
Set local parameter φ ← Eλ η` (β, xj ) .
Set intermediate global parameter

λ̂ = α + nEφ [t(Zj , xj )].

Set global parameter


λ = (1 − ρt )λ + ρt λ̂.

until forever

Idea: Learn a mapping f from xi to φi


Amortizing Inference

ELBO:
– ™
X
n
L (λ, φ1...n ) = Eq [log p(β, z, x)] − Eq log q(β; λ) + q(zi ; φi )
i=1

Amortizing the ELBO with inference network f :


– ™
X
n
L (λ, θ ) = Eq [log p(β, z, x)] − Eq log q(β; λ) + q(zi | xi ; φi = fθ (xi ))
i=1

[Dayan+ 1995; Heess+ 2013; Gershman+ 2014, many others]


Amortized SVI
Input: data x, model p(β, z, x).
Initialize λ randomly. Set ρt appropriately.

repeat
Sample β ∼ q(β; λ).
Sample j ∼ Unif(1, . . . , n).
Sample zj ∼ q(zj | xj ; φθ (xj ).
Compute stochastic gradients

∇ˆλ L = ∇λ log q(β; λ)(log p(β) + n log p(xj , zj | β) − log q(β))


∇ˆθ L = n∇θ log q(zj | xj ; θ )(log p(xj , zj | β) − log q(zj | xk ; θ ))

Update

λ = λ + ρt ∇ˆλ
θ = θ + ρt ∇ˆθ .

until forever
A computational-statistical tradeoff

„ Amortized inference is faster, but admits a smaller class of approximations


„ The size of the smaller class depends on the flexibility of f

n
Y
q(zi ; i) n
Y
i=1 q(zi |xi ; f✓ (xi ))
i=1
Example: Variational Autoencoder (VAE)

z p(z) = Normal(0, 1)

2
x p(x|z) = Normal(µ (z), (z))

µ and σ2 are deep networks with parameters β.


[Kingma+ 2014; Rezende+ 2014]
Example: Variational Autoencoder (VAE)

z z ~ q(z | x)

2
Model
Inference
Network
q(z|x) = Normal(f✓µ (x), f✓ (x))
p(x |z)
q(z |x)

x ~ p(x | z)
Data x

All functions are deep networks


Example: Variational Autoencoder (VAE)

Analogies
Analogy-making
Rules of Thumb for a New Model

If log p(x, z) is z differentiable


„ Try out an approximation q that is reparameterizable

If log p(x, z) is not z differentiable


„ Use score function estimator with control variates
„ Add further variance reductions based on experimental evidence
Rules of Thumb for a New Model

If log p(x, z) is z differentiable


„ Try out an approximation q that is reparameterizable

If log p(x, z) is not z differentiable


„ Use score function estimator with control variates
„ Add further variance reductions based on experimental evidence

General Advice:
„ Use coordinate specific learning rates (e.g. RMSProp, AdaGrad)
„ Annealing + Tempering
„ Consider parallelizing across samples from q
Software

Systems with Variational Inference:


„ Venture, WebPPL, Edward, Stan, PyMC3, Infer.net, Anglican
Good for trying out lots of models

Differentiation Tools:
„ Theano, Torch, Tensorflow, Stan Math, Caffe
Can lead to more scalable implementations of individual models
PART IV

Beyond the Mean Field


Review:
BlackVariational Bound and Optimisation
box variational inference

REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES

ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE

„ Probabilistic
I Samplemodelling
from q. / and variational inference.
I Form noisy gradients without model-specific computation
„ ScalableI inference through stochastic optimisation.
Use stochastic optimization
„ Black-box variational inference: Non-conjugate models, Monte Carlo
gradient estimators and amortised inference.

These advances empower us with new way to design


more flexible approximate posterior distributions q(z)
Mean-field Approximations

Fully-factorised

p.z j x/
z2
KL.q.zI ⌫ ⇤ / jj p.z j x//
q.zI ⌫/ ⇤

⌫ init
z1 z3
Y
qM F (z|x) = q(zk )
k

Key part of algorithm is the choice of approximate posterior q(z).


log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]
| {z }| {z }
Expected likelihood Entropy
Mean-Field Posterior Approximations

Deep Latent
Gaussian Model
Latent variable
model p(x,z)
p(z)

x
p(x|z)

Mean-field or fully-factorised posterior is usually not sufficient


Real-world Posterior Distributions

Deep Latent
Gaussian Model
Latent variable
model p(x,z)
p(z)

x
p(x|z)

Complex dependencies · Non-Gaussian distributions · Multiple modes


Families of Approximate Posteriors
Two high-level goals:
„ Build richer approximate posterior distributions.

„ Maintain computational efficiency and scalability.


Families of Approximate Posteriors
Two high-level goals:
„ Build richer approximate posterior distributions.

„ Maintain computational efficiency and scalability.

True Posterior Fully-factorised

z2 z2

z1 z3 z1 z3

Most Expressive Least Expressive


Y
q ⇤ (z|x) / p(x|z)p(z) qM F (z|x) = q(zk )
k
Families of Approximate Posteriors
Two high-level goals:
„ Build richer approximate posterior distributions.

„ Maintain computational efficiency and scalability.

True Posterior Fully-factorised

z2 z2

z1 z3 z1 z3

Most Expressive Least Expressive


Y
q ⇤ (z|x) / p(x|z)p(z) qM F (z|x) = q(zk )
k

Same as the problem of specifying a model of the data itself.


Structured Posterior Approximations

True Posterior Structured Approx. Fully-factorised

z2 z2 z2

z1 z3 z1 z3
z1 z3

Most Expressive Least Expressive


Y Y
q ⇤ (z|x) / p(x|z)p(z) q(z) = qk (zk |{zj }j6=k ) qM F (z|x) = q(zk )
k k

Structured mean field: Introduce any form of dependency to provide a richer


approximating class of distributions.
[Saul and Jordan, 1996.]
Gaussian Approximate Posteriors
Use a correlated Gaussian:

qG (z; ν)=N (z|µ, Σ)

Variational parameters ν = {µ, Σ}


Gaussian Approximate Posteriors
Use a correlated Gaussian:

qG (z; ν)=N (z|µ, Σ)

Variational parameters ν = {µ, Σ}

Covariance models: Structure of covariance Σ describes dependency.


Full covariance is richest, but computationally expensive.
diag(↵1 , . . . , ↵K ) diag(↵1 , . . . , ↵K )
+uu>

Test neg. marginal likelihood


104

Mean-field
Rank-1 100

96
diag(↵1 , . . . , ↵K )
P UU> 92

+ j uj uj > 88

84
Rank1 Diag Wake−Sleep FA
+ +…+

Full
Rank-J
Gaussian Approximate Posteriors
Use a correlated Gaussian:

qG (z; ν)=N (z|µ, Σ)

Variational parameters ν = {µ, Σ}

Covariance models: Structure of covariance Σ describes dependency.


Full covariance is richest, but computationally expensive.
diag(↵1 , . . . , ↵K ) diag(↵1 , . . . , ↵K )
+uu>

Test neg. marginal likelihood


104

Mean-field
Rank-1 100

96
diag(↵1 , . . . , ↵K )
P UU> 92

+ j uj uj > 88

84
Rank1 Diag Wake−Sleep FA
+ +…+

Full
Rank-J

Approximate posterior is always Gaussian.


Beyond Gaussian Approximations
Autoregressive distributions: Impose
an ordering and non-linear dependency
on all preceding variables.
Y z1 z2 z3 z4 …
qAR (z; ν) = qk (zk |z<k ; νk )
k
Beyond Gaussian Approximations
Autoregressive distributions: Impose
an ordering and non-linear dependency
on all preceding variables.
Y z1 z2 z3 z4 …
qAR (z; ν) = qk (zk |z<k ; νk )
k

Compare DLGMs: Using Gaussian mean field (VAE) vs. auto-regressive


posterior (DRAW) in fully-connected DLGMs on CIFAR10.

≤86.6 ≤ 80.9

VAE DRAW

[Gregor et al., 2015]


Beyond Gaussian Approximations
Autoregressive distributions: Impose
an ordering and non-linear dependency
on all preceding variables.
Y z1 z2 z3 z4 …
qAR (z; ν) = qk (zk |z<k ; νk )
k

Compare DLGMs: Using Gaussian mean field (VAE) vs. auto-regressive


posterior (DRAW) in fully-connected DLGMs on CIFAR10.

≤86.6 ≤ 80.9

VAE DRAW

[Gregor et al., 2015]

Joint-distribution non-Gaussian, although conditionals are.


More Structured Posteriors

Linking functions
Mixture model

C(z)
y

z1 z2 z3 z1 z2 z3
!
Y
X qlm (z; ⌫) = qk (zk |⌫ k ) C(z; ⌫ k+1 )
qmm (z; ⌫) = ⇢r qr (zr |⌫ r ) k
r

[Saul and Jordan, 1996, Tran et al., 2016]


More Structured Posteriors

Linking functions
Mixture model

C(z)
y

z1 z2 z3 z1 z2 z3
!
Y
X qlm (z; ⌫) = qk (zk |⌫ k ) C(z; ⌫ k+1 )
qmm (z; ⌫) = ⇢r qr (zr |⌫ r ) k
r

[Saul and Jordan, 1996, Tran et al., 2016]

Suggests a general way to improve posterior approximations:


Introduce additional variables that induce dependencies,
but that remain tractable and efficient.
Designing Richer Posteriors
1. Introduce new variables ω that help to form a
richer approximate posterior distribution.
Z
q(z; ν) = q(z, ω; ν)dω
Designing Richer Posteriors
1. Introduce new variables ω that help to form a
richer approximate posterior distribution.
Z
q(z; ν) = q(z, ω; ν)dω

2. Adapt bound to compute entropy or a bound.

log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]


| {z }| {z }
Expected likelihood Entropy
Designing Richer Posteriors
1. Introduce new variables ω that help to form a
richer approximate posterior distribution. zK z
Z
q(z; ν) = q(z, ω; ν)dω …

2. Adapt bound to compute entropy or a bound. z1 𝛚

log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]


| {z }| {z } z0
Expected likelihood Entropy

x
3. Maintain computational efficiency: linear in
number of latent variables.
Designing Richer Posteriors
1. Introduce new variables ω that help to form a
richer approximate posterior distribution. zK z
Z
q(z; ν) = q(z, ω; ν)dω …

2. Adapt bound to compute entropy or a bound. z1 𝛚

log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]


| {z }| {z } z0
Expected likelihood Entropy

x
3. Maintain computational efficiency: linear in
number of latent variables.

Look at two different approaches


„ Change-of-variables: Normalising flows and invertible transforms.
„ Auxiliary variables: Entropy bounds, Monte Carlo sampling.
zK
Approximations using Change-of-variables

Exploit the rule for change of variables for random variables: z1

„ Begin with an initial distribution q0 (z0 |x). z0

„ Apply a sequence of K invertible functions fk .


x
zK
Approximations using Change-of-variables

Exploit the rule for change of variables for random variables: z1

„ Begin with an initial distribution q0 (z0 |x). z0

„ Apply a sequence of K invertible functions fk .


x

Sampling and Entropy


zK = f K . . . f2 f1 (z0 )
XK
@fk
log qK (zK ) = log q0 (z0 ) log det
@zk
k=1

1
@f
q(z 0 ) = q(z) det
@z

t=0 t=1 … t=T

Distribution flows through a sequence of invertible transforms


[Rezende and Mohamed, 2015]
Normalising Flows

Planar
q0 K=1 K=2 K=10
Unit Gaussian
Uniform
Normalising Flows
Choice of Transformation Function
– K ™
X ∂ fk
L =Eq0 (z0 ) [log p(x, zK )]−Eq0 (z0 ) [log q0 (z0 )] − Eq0 (z0 )
log det
k=1
∂z
k
Choice of Transformation Function
– K ™
X ∂ fk
L =Eq0 (z0 ) [log p(x, zK )]−Eq0 (z0 ) [log q0 (z0 )] − Eq0 (z0 )
log det
k=1
∂z
k

„ Begin with a fully-factorised Gaussian and improve by change of variables.


„ Triangular Jacobians allow for computational efficiency.
Choice of Transformation Function
– K ™
X ∂ fk
L =Eq0 (z0 ) [log p(x, zK )]−Eq0 (z0 ) [log q0 (z0 )] − Eq0 (z0 )
log det
k=1
∂z k

„ Begin with a fully-factorised Gaussian and improve by change of variables.


„ Triangular Jacobians allow for computational efficiency.

Planar Flow Real NVP Inverse AR Flow

zk+1 zk+1 zk+1

y1:d yd+1:D
concat
- ÷
+
t + μ σ
h s
har
z1:d zd+1:D
zk z<k
zk zk
y1:d = zk 1,1:d zk 1 µk (z<k , x)
zk =
zk = zk 1 + uh(w> zk 1 + b) yd+1:D = t(zk 1,1:d ) + zd+1:D exp(s(zk 1,1:d )) k (z<k , x)

[Rezende and Mohamed, 2016; Dinh et al., 2016; Kingma et al., 2016]
Choice of Transformation Function
– K ™
X ∂ fk
L =Eq0 (z0 ) [log p(x, zK )]−Eq0 (z0 ) [log q0 (z0 )] − Eq0 (z0 )
log det
k=1
∂z k

„ Begin with a fully-factorised Gaussian and improve by change of variables.


„ Triangular Jacobians allow for computational efficiency.

Planar Flow Real NVP Inverse AR Flow

zk+1 zk+1 zk+1

y1:d yd+1:D
concat
- ÷
+
t + μ σ
h s
har
z1:d zd+1:D
zk z<k
zk zk
y1:d = zk 1,1:d zk 1 µk (z<k , x)
zk =
zk = zk 1 + uh(w> zk 1 + b) yd+1:D = t(zk 1,1:d ) + zd+1:D exp(s(zk 1,1:d )) k (z<k , x)

[Rezende and Mohamed, 2016; Dinh et al., 2016; Kingma et al., 2016]

Linear time computation of the determinant and its gradient.


Modelling Improvements

VAE-type algorithms on the MNIST benchmark

≤86.6 ≤ 80.9 ⋍79.1

VAE DRAW IAF


Modelling Improvements

VAE-type algorithms on the MNIST benchmark

≤86.6 ≤ 80.9 ⋍79.1

VAE DRAW IAF

Samples generated from model on CIFAR10 images

Figure 4: Random sam


patches.
Hierarchical Approximate Posteriors

We can use ‘latent variables’ ω to enrich the approximate


posterior distribution, like we do for our density models.
Z
q(z|x) = q(z|ω, x)q(ω|x)dω
Hierarchical Approximate Posteriors

We can use ‘latent variables’ ω to enrich the approximate zK z


posterior distribution, like we do for our density models.
Z
q(z|x) = q(z|ω, x)q(ω|x)dω …

„ Use a hierarchical model for the approximate posterior. z1 𝛚

„ Stochastic variables ω rather than deterministic in the


change-of-variables approach.
z0
„ Both continuous and discrete latent variables can be
modelled.
x

[Ranganath et al., 2016]


Auxiliary-variable Methods

Modify the model to include ω = (z0 , . . . , zK−1 ).

Latent variable Auxiliary latent


model p(x,z) variable model p(x,z,𝛚)
p(z) p(z)

z z

x x ⍵
p(x|z) p(x|z) r(!|x, z)
Auxiliary-variable Methods

Modify the model to include ω = (z0 , . . . , zK−1 ).


zK z
Latent variable Auxiliary latent
model p(x,z) variable model p(x,z,𝛚)
p(z) p(z)

z z …

z1 𝛚
x x ⍵
p(x|z) p(x|z) r(!|x, z)

z0
„ Auxiliary variables leave the original model unchanged.
„ They capture structure of correlated variables because they
turn the posterior into a mixture of distributions q(z|x, ω). x

[Agakov and Barber, 2004; Maaløe et al., 2016]


Auxiliary Variational Lower Bounds
Standard bound: log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]
| {z }| {z }
Expected likelihood Entropy

Auxiliary latent Inference


variable model p(x,z,𝛚) model q(z,𝛚)
p(z) q(z|x, !)

z z

x ⍵ x ⍵
p(x|z) r(!|x, z) q(!|x)
Auxiliary Variational Lower Bounds
Standard bound: log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]
| {z }| {z }
Expected likelihood Entropy

Auxiliary latent Inference


variable model p(x,z,𝛚) model q(z,𝛚)
p(z) q(z|x, !)

z z

x ⍵ x ⍵
p(x|z) r(!|x, z) q(!|x)

Auxiliary variational bound: Bound the entropy for tractability.

log p(x) ≥ Eq(ω,z|x) [log p(x, z) + log r(ω|z, x)]−Eq(ω,z|x) [log q(z, ω|x)]
≥ L − Eq(z|x) [KL[q(ω|z, x)kr(ω|z, x)]

[Salimans et al., 2015; Ranganath et al., 2016; Maaløe et al., 2016]


Auxiliary Variational Methods
Choose an auxiliary prior r(ω|z, x) and auxiliary posterior q(ω|x, z)
Auxiliary Variational Methods
Choose an auxiliary prior r(ω|z, x) and auxiliary posterior q(ω|x, z)

Auxiliary latent
variable model p(x,z,𝛚)
p(z) „ Hamiltonian flow: r(ω)=N (ω|0,M)
z
„ Input-dependent Gaussian: r(ω|x,
Q z)
„ Auto-regressive: r(ω|x, z) = t r(ωt |fθ (ω<t , x))

x ⍵
„ q(ω|x, z) can be a mixture model, normalising flow,
p(x|z) r(!|x, z)
Gaussian process.
Inference
model q(z,𝛚)
q(z|x, !)

x ⍵
q(!|x)
Auxiliary Variational Methods
Choose an auxiliary prior r(ω|z, x) and auxiliary posterior q(ω|x, z)

Auxiliary latent
variable model p(x,z,𝛚)
p(z) „ Hamiltonian flow: r(ω)=N (ω|0,M)
z
„ Input-dependent Gaussian: r(ω|x,
Q z)
„ Auto-regressive: r(ω|x, z) = t r(ωt |fθ (ω<t , x))

x ⍵
„ q(ω|x, z) can be a mixture model, normalising flow,
p(x|z) r(!|x, z)
Gaussian process.
Inference
model q(z,𝛚)
q(z|x, !)

z
≤86.6 ≤ 80.9 ⋍79.1 ≤79.8

x ⍵ VAE DRAW IAF DRAW-VGP


q(!|x)

[Tran et al., 2016]

Easy sampling, evaluation of bound and gradients.


Summary

True Posterior Families of Posterior Approximations Fully-factorised


Normalising Structured mean-field Covariance models
flows
+
zK
z2 z1 z2 z3 z4 …
z2

Auxiliary variables Mixtures


z1 p(z)

z1 z3 z y
z1 z3

z0

x ⍵ z1 z2 z3
x p(x|z) r(!|x, z)

Most Expressive Least Expressive


Y
q ⇤ (z|x) / p(x|z)p(z) qM F (z|x) = q(zk )
k
Choosing your Approximation

KNOWLEDGE & DATA


QUESTION

R A. 7Aty

This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions

K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops
1
2
3
4
5
6
7

K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH

pops

Make assumptions Discover patterns Predict & Explore


1
2
3
4
5
6
7
8

K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9

Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.

28

Criticize model

Revise
Summary
Variational Inference:
Foundations and Modern Methods

p.z j x/

KL.q.zI ⌫ ⇤ / jj p.z j x//


q.zI ⌫/ ⇤

⌫ init

VI approximates difficult quantities from complex models.


With stochastic optimization we can
„ scale up VI to massive data
„ enable VI on a wide class of difficult models
„ enable VI with elaborate and flexible families of approximations
Bibliography

Introductory Variational Inference


„ Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to
variational methods for graphical models. Machine learning, 37(2), 183-233.
„ Beal, Matthew James. Variational algorithms for approximate Bayesian inference. Diss.
University of London, 2003.
„ Wainwright, Martin J., and Michael I. Jordan. "Graphical models, exponential families, and
variational inference." Foundations and Trends in Machine Learning 1, no. 1-2 (2008): 1-305.
Bibliography

Applications of Variational Inference


„ Frey, Brendan J., and Geoffrey E. Hinton. "Variational learning in nonlinear Gaussian belief
networks." Neural Computation 11, no. 1 (1999): 193-213.
„ Eslami, S. M., Heess, N., Weber, T., Tassa, Y., Kavukcuoglu, K., and Hinton, G. E. Attend, Infer,
Repeat: Fast Scene Understanding with Generative Models. NIPS (2016).
„ Rezende, Danilo Jimenez, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra.
"One-Shot Generalization in Deep Generative Models." ICML (2016).
„ Kingma, Diederik P., Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling.
"Semi-supervised learning with deep generative models." In Advances in Neural Information
Processing Systems, pp. 3581-3589. 2014.
Bibliography

Monte Carlo Gradient Estimation


„ Pierre L’Ecuyer, Note: On the interchange of derivative and expectation for likelihood ratio
derivative estimators, Management Science, 1995
„ Peter W Glynn, Likelihood ratio gradient estimation for stochastic systems, Communications
of the ACM, 1990
„ Michael C Fu, Gradient estimation, Handbooks in operations research and management
science, 2006
„ Ronald J Williams, Simple statistical gradient-following algorithms for connectionist
reinforcement learning, Machine learning, 1992
„ Paul Glasserman, Monte Carlo methods in financial engineering, 2003
„ Omiros Papaspiliopoulos, Gareth O Roberts, Martin Skold, A general framework for the
parametrization of hierarchical models, Statistical Science, 2007
„ Michael C Fu, Gradient estimation, Handbooks in operations research and management
science, 2006
„ Rajesh Ranganath, Sean Gerrish, and David M. Blei. "Black Box Variational Inference." In
AISTATS, pp. 814-822. 2014.
„ Andriy Mnih, and Karol Gregor. "Neural variational inference and learning in belief
networks." arXiv preprint arXiv:1402.0030 (2014).
Bibliography

Monte Carlo Gradient Estimation (cont.)


„ Michalis Titsias and Miguel Lázaro-Gredilla. "Doubly stochastic variational Bayes for
non-conjugate inference." (2014).
„ David Wingate and Theophane Weber. "Automated variational inference in probabilistic
programming." arXiv preprint arXiv:1301.1299 (2013).
„ John Paisley, David Blei, and Michael Jordan. "Variational Bayesian inference with stochastic
search." arXiv preprint arXiv:1206.6430 (2012).
„ Durk Kingma and Max Welling. "Auto-encoding Variational Bayes." ICLR (2014).
„ Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra. "Stochastic Backpropagation and
Approximate Inference in Deep Generative Models." ICML (2014).
Bibliography

Amortized Inference
„ Dayan, Peter, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. "The helmholtz
machine." Neural computation 7, no. 5 (1995): 889-904.
„ Gershman, Samuel J., and Noah D. Goodman. "Amortized inference in probabilistic
reasoning." In Proceedings of the 36th Annual Conference of the Cognitive Science Society.
2014.
„ Heess, Nicolas, Daniel Tarlow, and John Winn. "Learning to pass expectation propagation
messages." In Advances in Neural Information Processing Systems, pp. 3219-3227. 2013.
„ Jitkrittum, Wittawat, Arthur Gretton, Nicolas Heess, S. M. Eslami, Balaji Lakshminarayanan,
Dino Sejdinovic, and ZoltÃan
˛ SzabÃş. "Kernel-based just-in-time learning for passing
expectation propagation messages." arXiv preprint arXiv:1503.02551 (2015).
„ Korattikara, Anoop, Vivek Rathod, Kevin Murphy, and Max Welling. "Bayesian dark
knowledge." arXiv preprint arXiv:1506.04416 (2015).
Bibliography

Structured Mean Field


„ Jaakkola, T. S., and Jordan, M. I. (1998). Improving the mean field approximation via the use
of mixture distributions. In Learning in graphical models (pp. 163-173). Springer
Netherlands.
„ Saul, L.K. and Jordan, M.I., 1996. Exploiting tractable substructures in intractable networks.
Advances in neural information processing systems, pp.486-492.
„ Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra.
"DRAW: A recurrent neural network for image generation." ICML (2015).
„ Gershman, S., Hoffman, M. and Blei, D., 2012. Nonparametric variational inference. arXiv
preprint arXiv:1206.4665.
Bibliography

Change-of-variables and Normalising Flows


„ Tabak, E. G., and Cristina V. Turner. "A family of nonparametric density estimation
algorithms." Communications on Pure and Applied Mathematics 66, no. 2 (2013): 145-164.
„ Rezende, Danilo Jimenez, and Shakir Mohamed. "Variational inference with normalizing
flows." ICML (2015).
„ Kingma, D.P., Salimans, T. and Welling, M., 2016. Improving variational inference with
inverse autoregressive flow. arXiv preprint arXiv:1606.04934.
„ Dinh, L., Sohl-Dickstein, J. and Bengio, S., 2016. Density estimation using Real NVP. arXiv
preprint arXiv:1605.08803.
Bibliography

Auxiliary Variational Methods


„ Felix V. Agakov, and David Barber. "An auxiliary variational method." NIPS (2004).
„ Rajesh Ranganath, Dustin Tran, and David M. Blei. "Hierarchical Variational Models." ICML
(2016).
„ Lars Maaløe et al. "Auxiliary Deep Generative Models." ICML (2016).
„ Tim Salimans, Durk Kingma, Max Welling. "Markov chain Monte Carlo and variational
inference: Bridging the gap. In International Conference on Machine Learning." ICML (2015).
Bibliography

Related Variational Objectives


„ Yuri Burda, Roger Grosse, Ruslan Salakhutidinov. "Importance weighted autoencoders." ICLR
(2015).
„ Yingzhen Li, Richard E. Turner. "Rényi divergence variational inference." NIPS (2016).
„ Guillaume and Balaji Lakshminarayanan. "Approximate Inference with the Variational Holder
Bound." ArXiv (2015).
„ José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, and
Richard E. Turner. Black-box α-divergence Minimization. ICML (2016).
„ Rajesh Ranganath, Jaan Altosaar, Dustin Tran, David M. Blei. Operator Variational Inference.
NIPS (2016).
Bibliography

Discrete Latent Variable Models and Posterior Approximations


„ Radford Neal. "Learning stochastic feedforward networks." Tech. Rep. CRG-TR-90-7:
Department of Computer Science, University of Toronto (1990).
„ Lawrence K. Saul, Tommi Jaakkola, and Michael I. Jordan. "Mean field theory for sigmoid
belief networks." Journal of artificial intelligence research 4, no. 1 (1996): 61-76.
„ Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. "Deep
autoregressive networks." ICML (2014).
„ Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and David M. Blei. "Deep Exponential
Families." AISTATS (2015).
„ Rajesh Ranganath, Dustin Tran, and David M. Blei. "Hierarchical Variational Models." ICML
(2016).

You might also like