2016 NIPS VI Tutorial

Variational Inference:
Foundations and Modern Methods
David Blei, Rajesh Ranganath, Shakir Mohamed
NIPS 2016 Tutorial · December 5, 2016

Communities discovered in a 3.7M node network of U.S. Patents
[Gopalan and Blei, PNAS 2013]
1 2 3 4 5
Game Life Film Book Wine

Season Know Movie Life Street
Team School Show Books Hotel
Coach Street Life Novel House
Play Man Television Story Room
Points Family Films Man Night
Games Says Director Author Place
Giants House Man House Restaurant
Second Children Story War Park
Players Night Says Children Garden
Annual Review of Statistics and Its Application 2014.1:203-232. Downloaded from www.annualreviews.org
6 7 8 9 10
Bush Building Won Yankees Government

Campaign Street Team Game War
Clinton Square Second Mets Military
by Princeton University Library on 01/09/14. For personal use only.
Republican Housing Race Season Officials

House House Round Run Iraq
Party Buildings Cup League Forces
Democratic Development Open Baseball Iraqi
Political Space Game Team Army
Democrats Percent Play Games Troops
Senator Real Win Hit Soldiers
11 12 13 14 15
Children Stock Church Art Police

School Percent War Museum Yesterday
Women Companies Women Show Man
Family Fund Life Gallery Officer
Parents Market Black Works Officers
Child Bank Political Artists Case
Life Investors Catholic Street Found
Says Funds Government Artist Charged
Help Financial Jewish Paintings Street
Mother Business Pope Exhibition Shot
Figure 5
Topics found in 1.8M articles from the New York Times

Topics found in a corpus of 1.8 million articles from the New York Times. Modified from Hoffman et al. (2013).
a particular movie), our prediction of the rating depends on a linear combination of the user’s
embedding and the movie’s embedding. We[Hoffman, Blei,
can also use these Wang,
inferred Paisley,toJMLR
representations find 2013]
groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds
Attend, Infer, Repeat: Fast Scene Und
Figure 12. 3D scenes details:

Scenes, Left:
concepts andGround-truth
control. object and camera po
cup is closely aligned with ground-truth,
[Eslami et al., 2016, Lake etvisible).
thus not clearly al. 2015] We
AIR framework. Middle: AIR achieves significantly lower reconstru
much higher count inference accuracy. Right: Heatmap of locations
The learned policy appears to be more dependent on identity (bottom)
ness and accuracy with that of a fully supervised network

Adygei BalochiBantuKenya
BantuSouthAfrica
Basque Bedouin BiakaPygmy Brahui Burusho
Cambodian
Colombian
Dai Daur Druze French Han Han−NChina
HazaraHezhen
Italian Japanese Kalash Karitiana
Lahu Makrani Mandenka MayaMbutiPygmy
Melanesian
MiaoM
prob
zhen
Melanesian
MiaoMongola Mozabite NaxiOrcadian
Oroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscan
UygurXibo Yakut Yi Yoruba
pops
1
2
3
4
5
6
7
Population analysis of 2 billion genetic measurements

[Gopalan, Hao, Blei, Storey, Nature Genetics (in press)]
Neuroscience analysis of 220 million fMRI measurements
[Manning et al., PLOS ONE 2014]
0.2bits/pixel
jpeg
Mean Sample jpeg2K
Compression and content generation.

[Van den Oord et al., 2016, Gregor et al., 2016]
Analysis of 1.7M taxi trajectories, in Stan
[Kucukelbir et al., 2016]
The probabilistic pipeline
KNOWLEDGE & DATA

QUESTION
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
K=8
pops
Make assumptions Discover patterns Predict & Explore

1
2
3
4
5
6
7
8
K=9
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
Customized data analysis is important to many fields.

28
Pipeline separates assumptions, computation, application
Eases collaborative solutions to statistics problems

The probabilistic pipeline
KNOWLEDGE & DATA

QUESTION
R A. 7Aty
K=7
pops
1
2
3
4
5
6
7
K=8
pops

1
2
3
4
5
6
7
8
K=9
pops
1
2
3
4
5
6
7
8
9
unpopulated.
Inference is the key algorithmic problem.

28
Answers the question: What does this model say about this data?
Our goal: General and scalable approaches to inference

KNOWLEDGE & DATA
QUESTION
R A. 7Aty
K=7
pops
1
2
3
4
5
6
7
K=8
pops

1
2
3
4
5
6
7
8
K=9
pops
1
2
3
4
5
6
7
8
9
unpopulated.
28
Criticize model
Revise
[Box, 1980; Rubin, 1984; Gelman et al., 1996; Blei, 2014]

PART I
Main ideas and historical context

Probabilistic Machine Learning
A probabilistic model is a joint distribution of hidden variables z and

observed variables x,
p(z, x).
Inference about the unknowns is through the posterior, the conditional

distribution of the hidden variables given the observations
p(z, x)
p(z | x) = .
p(x)
For most interesting models, the denominator is not tractable. We appeal

to approximate posterior inference.
Variational Inference
p.z j x/
KL.q.zI ⌫ ⇤ / jj p.z j x//

q.zI ⌫/ ⇤
⌫
⌫ init
VI turns inference into optimization.

Posit a variational family of distributions over the latent variables,
q(z; ν)
Fit the variational parameters ν to be close (in KL) to the exact posterior.
(There are alternative divergences, which connect to algorithms like EP, BP, and others.)
(a) Initialization (b) Iteration 20
Example: Mixture of Gaussians
(c) Iteration 28 (d) Iteration 35 (e) Iteration 50
(a) Initialization (c)Iteration

(b) Iteration2028 (d) Iteration 35 (e) Iteration 50
Evidence Lower Bound Average Log Predictive

3;200
1
3;500 Evidence Lower Bound
1:1 Average Log Predictive
3;200
1:2 1
3;800
3;500 1:3 1:1
4;100 1:4 1:2
3;800
0 10 20 30 40 50 60 0 10 201:330 40 50 60
Iterations
4;100 Iterations
1:4
(c) Iteration 28 (d) Iteration 35
(f) subcaption 0 10 (e)
elbo 20Iteration
30 4050 50 (g) subcaption
60 avelogpred
0 10 20 30 40 50 60
Iterations Iterations
Figure 1: Main caption
(f) subcaption elbo (g) subcaption avelogpred
[images by Alp Kucukelbir]
Figure 1: Main caption
Evidence Lower Bound Average Log Predictive
History
1006 Carsten Peterson and James R . An derson
Coovergence 0 1 eM ColTeia llon Stabiles

2-04 - 1 XOR wilh Random WetgltJ
Sj µj
0 .0
µi
Si
- 0. 5
‘~
'0 ' 00 10 00 10000
t.\ntler 0 1 S We8plll
Figure 5: {sf' B out} a nd vt vout from th e BM and MFT respec tively

[Peterson and Anderson 1987]
as functions of Nsweep o For details on architect ure, an nealing schedule , [Jordan et (a)
al. 1999] (b) [Hinton and van Camp 1993]
an d Tij values, see figure 3.
Figure 22: (a) A node Si in a sigmoid belief network machine with its Markov blanket. (b)
Corw flfOOOCll 0 1 L4ean Carela llon Oillerence
2-4-1 XOR with Random Welltlll
The mean field equations yield a deterministic relationship, represented in the figure with
Variational inference adapts ideas from statistical physics to probabilistic
the dotted lines, between the variational parameters µi and µj for nodes j in the Markov
blanket of node i. is
J
inference.
•
Arguably, it began in the late eighties with Peterson and
a tractable lower bound on the log likelihood and the variational parameter ξi can be
ters
has
Anderson (1987),
along withwho used mean-field methods to fit a neural network.
• for
optimized the other variational parameters.
Saul and Jordan (1998) show that in the limiting case of networks in which each hidden The
• com
node
• has• a • large
_ number of parents, so that a central limit theorem can be invoked, the
This idea
_ _
'0 wasξipicked
parameter' 00 up byinterpretation
has a probabilistic
tbnber 01 SWeepli
1000 Jordan’s lab
as the
10000 in theexpectation
approximate early 1990s—Tommi
of σ(zi ),
with
ing
Jaakkola, Lawrence Saul, Zoubin

where σ(·) is again the logistic function.
Gharamani—who
Figure 6: Do as defined in equa tion (3 .17) as a functio n of Ns tKeep • For
generalized
For fixed values of the parameters ξi , by diﬀerentiating the KL divergence with respect
details on a rchitect ure, a nnealing sched ule, and T i j values, see figure it to
3.
many probabilistic models. (A review paper is Jordan

to the variational parameters µi , we obtain the following consistency equations:
⎛ ⎞ et al., 1999.)
# # #
µi = σ ⎝ θij µj + θi0 + θji (µj − ξj ) + Kji ⎠ (67) We
In parallel, Hinton andj Van Camp j (1993)j also developed mean-field for dom
we
& '
neural networks. Neal and Hinton (1993)
where K ji is the derivative of − ln e−ξ z + e(1−ξ )z
connected
with respect this
to µi . As Saul, et
j j al. idea to the EM
j j of
Thi
show, this term depends on node i, its child j, and the other parents (the “co-parents”) of
algorithm,
node j.which
Given thatlead
the firstto further variational
contributions frommethods for i,mixtures of
net
term is a sum over the parents of node pen
expertsand
(Waterhouse equation for aet al.,
node1996) and HMMs from(MacKay, 1997).
the second term is a sum over contributions from the children of node i, we see that the erro
line
consistency given again involves contributions the Markov blanket cho
of the node (see Fig. 22). Thus, as in the case of the Boltzmann machine, we find that the cor
variational parameters are linked via their Markov blankets and the consistency equation dec
(Eq. (67)) can be interpreted as a local message-passing algorithm. ma
models with two-dimensional latent
is Gaussian, linearly spaced coor-
Today
CDF of the Gaussian to produce
otted the corresponding
(a) NORB
generative (b) CIFAR (c) Frey
igure 4. a) Performance on the NORB dataset. Left: Samples from the training data. Right: sampled pixel means from
he model. b) Performance on CIFAR10 patches. Left: Samples from the training data. Right: Sampled pixel means
om the model. c) Frey faces data. Left: data samples. Right: model samples.
data {
˛ D 1:5; D1 i n t N; // number o f o b s e r v a t i o n s
i n t x [ N ] ; // d i s c r e t e - v a l u e d o b s e r v a t i o n s
}
parameters {
// l a t e n t v a r i a b l e , must be p o s i t i v e
r e a l < l o w e r =0> t h e t a ;
✓ }
model {
// non - c o n j u g a t e p r i o r f o r l a t e n t v a r i a b l e
theta ~ w e i b u l l ( 1 . 5 , 1) ;
xn // l i k e l i h o o d
f o r ( n i n 1 :N)
x [ n ] ~ poisson ( theta ) ;
N }
igure 5. Imputation results on MNIST digits. The first
olumn shows the true data. Column 2 shows pixel loca-
space (d) 20-D latent space
[Kingma andremaining
Wellingcolumns
2013]show
Figure 6. Two dimensional embedding of the MNIST data Figure 2: Specifying a simple nonconjugate probability model in Stan.
set. Each[Rezende et al.to2014] [Kucukelbir et al. 2015]
ons set as missing in grey. The
colour corresponds one of the digit classes.
mputations and denoising of the images for 15 iterations,
arting left to right. Top: 60% missingness. Middle: 80%
NIST for different dimensionalities
missingness. Bottom: 5x5 patch missing. analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with the pri
6.5. Data Visualisation gives the joint density p.X; ✓/ D p.X j ✓/ p.✓/.
There is now a Latent flurry
for visualisationof new work data sets.on
variable models such as DLGMs are often We
of high-dimensional variational inference, making it
usedfocus on approximate inference for differentiable probability models. These models have conti
uous
We latent variables ✓. They also have a gradient of the log-joint with respect to the latent˚ variabl
project the MNIST data set to a 2-dimensional latent
scalable,
matics and experimental design. We show theeasier
ability to
space derive,
and faster,
use this 2-D embedding more
as a visualisation
the data. A 2-dimensional embedding of the MNIST
of accurate, and applying it to more
r log p.X; ✓/. The gradient is valid within the support of the prior supp.p.✓// D ✓ j ✓
✓
R and p.✓/ > 0 ✓ R , where K is the dimension of the latent variable space. This support s
K K
f the model to impute missing data using the MNIST is important: it determines the support of the posterior density and plays a key role later in the pap
ata set in figure 5. complicated
We test the imputation abilitymodels and applications.
data set is shown in figure 6. The classes separate
2
into di↵erent regions indicating that such a toolWe canmake no assumptions about conjugacy, either full or conditional.
nder two di↵erent missingness types (Little & Rubin,
be useful in gaining insight into the structure of high-
987): Missing-at-random (MAR), where we consider
dimensional data sets. For example, consider a model that contains a Poisson likelihood with unknown rate, p.x j ✓/. T
ntains a KL term that can often be observed variable x is discrete; the latent rate ✓ is continuous and positive. Place a Weibull pri
Modern VI touches many important areas: probabilistic programming,
0% and 80% of the pixels to be missing randomly, and

Not Missing-at-random (NMAR), where we consider a on ✓, defined over the positive real numbers. The resulting joint density describes a nonconjuga
he prior p (z) = N (0, I) and the
quare region of✓the image to be missing. The model 7. Discussion
differentiable probability model. (See Figure 2.) Its partial derivative @=@✓ p.x; ✓/ is valid within t
dimensionalityreinforcement
of z. Let µ and learning, neural networks, convex optimization, Bayesian
roduces very good completions in both test cases.
here is uncertainty in the identity of the image. This
support of the Weibull distribution, supp.p.✓// D RC ⇢ R. Because this model is nonconjugate, t
Our algorithm generalises to a large class of models
posterior is not a Weibull distribution. This presents a challenge for classical variational inferenc
with continuous latent variables, which include Gaus-
d let µ and simply denote the
statistics, and myriad applications.
expected and reflected in the errors in these comple-
j j sian, non-negative or sparsity-promoting latent In Section 2.3, we will see how
vari- handles this model.
ons as the resampling procedure is run, and further
ables. For models with discrete latent variables (e.g.,
emonstrates the ability of the model to capture the Many machine learning models are differentiable. For example: linear and logistic regression, matr
sigmoid belief networks), policy-gradient approaches
iversity of the underlying data. We do not integrate factorization with continuous or discrete measurements, linear dynamical systems, and Gaussian pr
that improve upon the REINFORCE approach remain
Our goal today the
ismosttogeneral,
teach
ver the missing values in our imputation procedure,
that simulates a Markov chain
ut use a procedure you the basics, cesses.
but intelligent design is needed to
explain
Mixture
control the gradient-variance in high dimensionalMarginalizing
set-
models, hiddensome of the
Markov models, and topicnewer ideas,
models have discrete random variabl
out these discrete variables renders these models differentiable. (We show an examp
N (z; 0, I) dz
hat we show converges to the true marginal distribu-
and to suggest These
open
tings.
areas of new research.
on. The procedure to sample from the missing pixels
iven the observed pixels is explained in appendix E.
in Section 3.3.) However, marginalization is not tractable for all models, such as the Ising mod
sigmoid belief networks, and (untruncated) Bayesian nonparametric models.
models are typically used with a large number
X
(µ2j + 2 2.2 Variational Inference
j)
1 Bayesian inference requires the posterior density p.✓ j X/, which describes how the latent variabl
vary when conditioned on a set of observations X. Many posterior densities are intractable becau
their normalization constants lack closed forms. Thus, we seek to approximate the posterior.
Part II: Mean-field VI and stochastic VI
Jordan+, Introduction to Variational Methods for Graphical Models, 1999
Ghahramani and Beal, Propagation Algorithms for Variational Bayesian Learning, 2001
Hoffman+, Stochastic Variational Inference, 2013
Part III: Stochastic gradients of the ELBO

Kingma and Welling, Auto-Encoding Variational Bayes, 2014
Ranganath+, Black Box Variational Inference, 2014
Rezende+, Stochastic Backpropagation and Approximate Inference in Deep Generative Models, 2014
Part IV: Beyond the mean field

Agakov and Barber, An Auxiliary Variational Method, 2004
Gregor+, DRAW: A recurrent neural network for image generation, 2015
Rezende+, Variational Inference with Normalizing Flows, 2015
Ranganath+, Hierarchical Variational Models, 2015
Maaløe+, Auxiliary Deep Generative Models, 2016
p.z j x/

q.zI ⌫/ ⇤
⌫
⌫ init
VI approximates difficult quantities from complex models.

With stochastic optimization we can
scale up VI to massive data
enable VI on a wide class of difficult models
enable VI with elaborate and flexible families of approximations
PART II
Mean-field variational inference

and stochastic variational inference
Motivation: Topic Modeling
Topic models use posterior inference to discover the hidden thematic

structure in a large collection of documents.
Example: Latent Dirichlet Allocation (LDA)
Documents exhibit multiple topics.

Topic proportions and

Topics Documents
assignments
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Each topic is a distribution over words

Each document is a mixture of corpus-wide topics
Each word is drawn from one of those topics
Topic proportions and
Topics Documents
assignments
But we only observe the documents; everything else is hidden.

So we want to calculate the posterior
p(topics, proportions, assignments | documents)
(Note: millions of documents; billions of latent variables)

LDA as a Graphical Model
Per-word
Proportions
topic assignment
parameter Topic
parameter
Per-document Observed
topic proportions word Topics
˛ ✓d zd;n wd;n ˇk ⌘
N D K
Encodes assumptions about data with a factorization of the joint

Connects assumptions to algorithms for computing with data
Defines the posterior (through the joint)
Posterior Inference
˛ ✓d zd;n wd;n ˇk ⌘
N D K
The posterior of the latent variables given the documents is
p(β, θ , z, w)
p(β, θ , z | w) = R R P .
β θ z p(β, θ , z, w)
We can’t compute the denominator, the marginal p(w).

We use approximate inference.
1 2 3 4 5

6 7 8 9 10


11 12 13 14 15

Figure 5
Topics found in 1.8M articles from the New York Times

embedding and the movie’s embedding. We can also use these inferred representations to find
Mean-field VI and Stochastic VI
Subsample Infer local Update global

data structure structure
Road map:
Define the generic class of conditionally conjugate models

Derive classical mean-field VI
Derive stochastic VI, which scales to massive data
A Generic Class of Models
Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
The observations are x = x1:n .

The local variables are z = z1:n .
The global variables are β.
The ith data point xi only depends on zi and β.
Compute p(β, z | x).

Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
A complete conditional is the conditional of a latent variable given the

observations and other latent variables.
Assume each complete conditional is in the exponential family,
p(zi | β, xi ) = h(zi ) exp{η` (β, xi )> zi − a(η` (β, xi ))}

p(β | z, x) = h(β) exp{ηg (z, x)> β − a(ηg (z, x))}.
Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
A complete conditional is the conditional of a latent variable given the

observations and other latent variable.
The global parameter comes from conjugacy [Bernardo and Smith, 1994]
Pn
ηg (z, x) = α + i=1 t(zi , xi ),
where α is a hyperparameter and t(·) are sufficient statistics for [zi , xi ].

Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
Bayesian mixture models Dirichlet process mixtures, HDPs
Time series models Multilevel regression

(HMMs, linear dynamic systems) (linear, probit, Poisson)
Factorial models Stochastic block models
Matrix factorization Mixed-membership models

(factor analysis, PCA, CCA) (LDA and some variants)
Variational Inference
p.z j x/

q.zI ⌫/ ⇤
⌫
⌫ init
Minimize KL between q(β, z; ν) and the posterior p(β, z | x).

The Evidence Lower Bound
L (ν) = Eq [log p(β, z, x)] − Eq [log q(β, z; ν)]
KL is intractable; VI optimizes the evidence lower bound (ELBO) instead.

It is a lower bound on log p(x).
Maximizing the ELBO is equivalent to minimizing the KL.
The ELBO trades off two terms.

The first term prefers q(·) to place its mass on the MAP estimate.
The second term encourages q(·) to be diffuse.
Caveat: The ELBO is not convex.

Mean-field Variational Inference
ˇ ˇ
ELBO
zi xi i zi
n n
We need to specify the form of q(β, z).
The mean-field family is fully factorized,

Qn
q(β, z; λ, φ) = q(β; λ) i=1 q(zi ; φi ).
Each factor is the same family as the model’s complete conditional,
p(β | z, x) = h(β) exp{ηg (z, x)> β − a(ηg (z, x))}

q(β; λ) = h(β) exp{λ> β − a(λ)}.
Mean-field Variational Inference
ˇ ˇ
ELBO
zi xi i zi
n n
Optimize the ELBO,
L (λ, φ) = Eq [log p(β, z, x)] − Eq [log q(β, z)] .
Traditional VI uses coordinate ascent [Ghahramani and Beal, 2001]

λ∗ = Eφ ηg (z, x) ; φi∗ = Eλ [η` (β, xi )]
Iteratively update each parameter, holding others fixed.

Notice the relationship to Gibbs sampling [Gelfand and Smith, 1990] .
Caveat: The ELBO is not convex.
Mean-field Variational Inference for LDA
�d �d;n �k
˛ �d zd;n wd;n ˇk �
N D K
The local variables are the per-document variables θd and zd .

The global variables are the topics β1 , . . . , βK .
The variational distribution is
Y
K Y
D Y
N
q(β, θ , z) = q(βk ; λk ) q(θd ; γd ) q(zd,n ; φd,n )
k=1 d=1 n=1
0.4
0.3
Probability
0.2
0.1
0.0
1 8 16 26 36 46 56 66 76 86 96
Topics
“Genetics” “Evolution” “Disease” “Computers”

human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial system
gene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
information diversity malaria methods
genetics group parasite networks
mapping new parasites software
project two united new
sequences common tuberculosis simulations
Classical Variational Inference
Input: data x, model p(β, z, x).

Initialize λ randomly.
repeat
for each data point i do
Set local parameter φi ← Eλ [η` (β, xi )].
end
Set global parameter

Pn
λ←α+ i=1 Eφi [t(Zi , xi )] .
until the ELBO has converged

Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1


Stochastic Variational Inference
�d �d;n �k
N D K
Classical VI is inefficient:
Do some local computation for each data point.
Aggregate these computations to re-estimate global structure.
Repeat.
This cannot handle massive data.
Stochastic variational inference (SVI) scales VI to massive data.
K=7
pops
1
2
3
4
5
6
7
K=8
pops
1
GLOBAL HIDDEN STRUCTURE

2
3
4
5
6
7
MASSIVE
8
K=9
DATA
pops
1
2
3
4
5
6
7
8
9
unpopulated.
28

Stochastic Optimization
Replace the gradient with cheaper noisy estimates [Robbins and Monro, 1951]
Guaranteed to converge to a local optimum [Bottou, 1996]
Has enabled modern machine learning

Stochastic Optimization
With noisy gradients, update
ˆ ν L (νt )
νt+1 = νt + ρt ∇

Requires unbiased gradients, E ∇ˆ ν L (ν) = ∇ν L (ν)
Requires the step size sequence ρt follows the Robbins-Monro conditions

The natural gradient of the ELBO [Amari, 1998; Sato, 2001]

Pn
∇nat
λ L (λ) = α + i=1 Eφi [t(Zi , xi )] − λ.
∗
Construct a noisy natural gradient,
j ∼ Uniform(1, . . . , n)
ˆ nat L (λ)
∇ = α + nEφj∗ [t(Zj , xj )] − λ.
λ
This is a good noisy gradient.

Its expectation is the exact gradient (unbiased).
It only depends on optimized parameters of one data point (cheap).

Initialize λ randomly. Set ρt appropriately.
repeat
Sample j ∼ Unif(1, . . . , n).

Set local parameter φ ← Eλ η` (β, xj ) .
Set intermediate global parameter
λ̂ = α + nEφ [t(Zj , xj )].
λ = (1 − ρt )λ + ρt λ̂.
until forever
K=7
pops
1
2
3
4
5
6
7
K=8
pops
1
GLOBAL HIDDEN STRUCTURE

2
3
4
5
6
7
MASSIVE
8
K=9
DATA
pops
1
2
3
4
5
6
7
8
9
unpopulated.
28

Stochastic Variational Inference in LDA
�d �d;n �k
N D K
Sample a document
Estimate the local variational parameters using the current topics
Form intermediate topics from those local parameters
Update topics as a weighted average of intermediate and current topics
Stochastic Variational Inference in LDA
Online 98K
900
850
800 Batch 98K
Perplexity
Online 3.3M
750
700
650
600
103.5 104 104.5 105 105.5 106 106.5

Documents seen (log scale)
Documents 2048 4096 8192 12288 16384 32768 49152 65536
analyzed
systems systems service service service business business business
road health systems systems companies service service industry
made communication health companies systems companies companies service
Top eight service service companies business business industry industry companies
words announced billion market company company company services services
national language communication billion industry management company company
west care company health market systems management management
language road billion industry billion services public public
[Hoffman et al., 2010]

1 2 3 4 5

6 7 8 9 10


11 12 13 14 15

Figure 5
Topics using the HDP, found in 1.8M articles from the New York Times
embedding and the movie’s embedding. We can also use these inferred representations to find
SVI scales many models



neurons
brain stimulus
motor
memory
visual
activated subjects synapses
tyrosine phosphorylation cortical
left ltp
activation
phosphorylation p53 task surface glutamate
kinase cell cycle proteins tip synaptic
activity protein
cyclin binding rna image neurons
regulation domain dna sample materials
computer
domains rna polymerase organic
problem device
receptor cleavage
information
polymer
science amino acids
research
scientists
receptors cdna
site
computers
polymers
funding molecules physicists
support says ligand sequence problems
laser particles
nih research ligands isolated optical physics
program people protein sequence light
apoptosis sequences surface
particle
electrons experiment
genome liquid quantum
wild type dna surfaces stars
mutant enzyme sequencing fluid
mutations enzymes model reaction astronomers
united states mutants
iron
active site
reactions universe
women cells
mutation reduction molecule galaxies
universities
cell molecules
expression magnetic
galaxy
cell lines plants
magnetic field transition state
students bone marrow plant
spin
superconductivity
gene
education genes
superconducting
pressure mantle
arabidopsis
bacteria high pressure crust sun
bacterial pressures upper mantle solar wind
host fossil record core meteorites earth
resistance development birds inner core ratios planets
mice parasite embryos fossils planet
gene dinosaurs
antigen virus drosophila species
disease fossil
t cells hiv genes forest
mutations
antigens aids expression forests
families earthquake co2
immune response infection populations
mutation earthquakes carbon
viruses ecosystems
fault carbon dioxide
ancient images methane
patients genetic found
disease cells population impact
data water
ozone
treatment proteins populations million years ago volcanic atmospheric
drugs differences africa
clinical researchers deposits climate
measurements
variation stratosphere
protein magma ocean
eruption ice concentrations
found volcanism changes
climate change
Adygei BalochiBantuKenya
BantuSouthAfrica
Basque Bedouin BiakaPygmy Brahui Burusho
Cambodian
Colombian
Dai Daur Druze French Han Han−NChina
HazaraHezhen
Melanesian
pops
1
2
3
prob
4
5
6
7
apanese Kalash Karitiana

Melanesian
pops
1
2
3
4
5
6
7
PART III
Stochastic Gradients of the ELBO

Review: The Promise
KNOWLEDGE & DATA

QUESTION
R A. 7Aty
K=7
pops
1
2
3
4
5
6
7
K=8
pops

1
2
3
4
5
6
7
8
K=9
pops
1
2
3
4
5
6
7
8
9
unpopulated.
28
Realized for conditionally conjugate models
What about the general case?

The Variational Inference Recipe
Start with a model:
p(z, x)
Choose a variational approximation:
q(z; ν)
Write down the ELBO:
L (ν) = Eq(z;ν) [log p(x, z) − log q(z; ν)]

Compute the expectation(integral):
Example: L (ν) = xν2 + log ν

Take derivatives:
1
Example: ∇ν L (ν) = 2xν +
ν
Optimize:
νt+1 = νt + ρt ∇ν L
p(x, z) Z q.zI ⌫/
(· · · )q(z; ⌫)dz r⌫
q(z; ⌫)
Example: Bayesian Logistic Regression
Data pairs yi , xi
xi are covariates
yi are label
z is the regression coefficient
Generative process
p(z) ∼ N(0, 1)
p(yi | xi , z) ∼ Bernoulli(σ(zxi ))
VI for Bayesian Logistic Regression
Assume:
We have one data point (y, x)
x is a scalar
The approximating family q is the normal; ν = (µ, σ2 )
The ELBO is
L (µ, σ2 ) = Eq [log p(z) + log p(y | x, z) − log q(z)]

L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
L (µ, σ2 )
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
L (µ, σ2 )
1 1
2 2
1 2 1
= − (µ + σ ) + log σ2 + Eq [yxz − log(1 + exp(xz))]
2
2 2
L (µ, σ2 )
1 1
2 2
1 2 1
2
2 2
1 2 1
= − (µ + σ ) + log σ2 + yxµ − Eq [log(1 + exp(xz))]
2
2 2
L (µ, σ2 )
1 1
2 2
1 2 1
2
2 2
1 2 1
= − (µ + σ ) + log σ2 + yxµ − Eq [log(1 + exp(xz))]
2
2 2
We are stuck.
1. We cannot analytically take that expectation.
2. The expectation hides the objectives dependence on the variational
parameters. This makes it hard to directly optimize.
Options?
Derive a model specific bound:

[Jordan and Jaakola; 1996], [Braun and McAuliffe; 2008], others
More general approximations that require model-specific analysis:

[Wang and Blei; 2013], [Knowles and Minka; 2011]
Nonconjugate Models
Nonlinear Time series Models Discrete Choice Models
Deep Latent Gaussian Models Bayesian Neural Networks
Models with Attention Deep Exponential Families

(such as DRAW) (e.g. Sparse Gamma or Poisson)
Generalized Linear Models Correlated Topic Model

(Poisson Regression) (including nonparametric variants)
Stochastic Volatility Models Sigmoid Belief Network
We need a solution that does not entail model specific work

Black Box Variational Inference (BBVI)
Black box variational inference
REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES
ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE
I Sample from q. /
I Form noisy gradients without model-specific computation
I Use stochastic optimization

The Problem in the Classical VI Recipe
p(x, z) Z q.zI ⌫/
(· · · )q(z; ⌫)dz r⌫
q(z; ⌫)
The New VI Recipe
p(x, z) Z q.zI ⌫/
r⌫ (· · · )q(z; ⌫)dz
q(z; ⌫)
Use stochastic optimization!

Computing Gradients of Expectations
Define
g(z, ν) = log p(x, z) − log q(z; ν)
What is ∇ν L
Z
∇ν L = ∇ν q(z; ν)g(z, ν)dz
Z
= ∇ν q(z; ν)g(z, ν) + q(z; ν)∇ν g(z, ν)dz
Z
= q(z; ν)∇ν log q(z; ν)g(z, ν) + q(z; ν)∇ν g(z, ν)dz
= Eq(z;ν) [∇ν log q(z; ν)g(z, ν) + ∇ν g(z, ν)]
∇ν q
Using ∇ν log q = q
Roadmap
Score Function Gradients
Pathwise Gradients
Amortized Inference
Score Function Gradients of the ELBO
Score Function Estimator
Recall
∇ν L = Eq(z;ν) [∇ν log q(z; ν)g(z, ν) + ∇ν g(z, ν)]
Simplify:
Eq [∇ν g(z, ν)] = Eq [∇ν log q(z; ν)] = 0
Gives the gradient:
∇ν L = Eq(z;ν) [∇ν log q(z; ν)(log p(x, z) − log q(z; ν))]
Sometimes called likelihood ratio or REINFORCE gradients

[Glynn 1990; Williams, 1992; Wingate+ 2013; Ranganath+ 2014; Mnih+ 2014]
Noisy Unbiased Gradients
Gradient: Eq(z;ν) [∇ν log q(z; ν)(log p(x, z) − log q(z; ν))]
Noisy unbiased gradients with Monte Carlo!
1X
S
∇ν log q(zs ; ν)(log p(x, zs ) − log q(zs ; ν)),
S s=1
where zs ∼ q(z; ν)
Basic BBVI
Algorithm 1: Basic Black Box Variational Inference

Input : Model log p(x, z),
Variational approximation q(z; ν)
Output : Variational Parameters: ν
while not converged do

z[s] ∼ q // Draw S samples from q
ρ = t-th value of a Robbins Monro sequence
PS
ν = ν + ρ 1S s=1 ∇ν log q(z[s]; ν)(log p(x, z[s]) − log q(z[s]; ν))
t=t+1
end
The requirements for inference
The noisy gradient:
1X
S
∇ν log q(zs ; ν)(log p(x, zs ) − log q(zs ; ν)),
S s=1
where zs ∼ q(z; ν)
To compute the noisy gradient of the ELBO we need

Sampling from q(z)
Evaluating ∇ν log q(z; ν)
Evaluating log p(x, z) and log q(z)
There is no model specific work: black box criteria are satisfied

Black Box Variational Inference
Black box variational inference
REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES
ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE
I Sample from q. /
I Use stochastic optimization

Problem: Basic BBVI doesn’t work
Variance of the gradient can be a problem
Varq(z;ν) = Eq(z;ν) [(∇ν log q(z; ν)(log p(x, z) − log q(z; ν)) − ∇ν L )2 ].
2.0
PDF
1.5 Abs Mu Score
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Intuition:
Sampling rare values can lead to large scores and thus high variance
Solution: Control Variates
Replace with f with f̂ where E[f̂ (z)] = E[f (z)]. General such class:
f̂ (z) ¬ f (z) − a(h(z) − E[h(z)])
6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3
−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
h is a function of our choice

a is chosen to minimize the variance
Good h have high correlation with the original function f
f̂ (z) ¬ f (z) − a(h(z) − E[h(z)])
6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3
−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
For variational inference we need functions with known q expectation

Set h as ∇ν log q(z; ν)
Simple as Eq [∇ν log q(z; ν)] = 0 for any q
f̂ (z) ¬ f (z) − a(h(z) − E[h(z)])
6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3
−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
Many of the other techniques from Monte Carlo can help:

Importance Sampling, Quasi Monte Carlo, Rao-Blackwellization
[Ruiz+ 2016; Ranganath+2014; Titsias+2015; Mnih+2016]

Nonconjugate Models
Nonlinear Time series Models Discrete Choice Models
Deep Latent Gaussian Models Bayesian Neural Networks
Models with Attention Deep Exponential Families

(such as DRAW) (e.g. Sparse Gamma or Poisson)
Generalized Linear Models Correlated Topic Model

(Poisson Regression) (including nonparametric variants)
Stochastic Volatility Models Sigmoid Belief Network
We can design models based on data rather than inference.

More Assumptions?
The current black box criteria

Sampling from q(z)
Evaluating ∇ν log q(z; ν)
Evaluating log p(x, z) and log q(z)
Can we make additional assumptions that are not too restrictive?
Pathwise Gradients of the ELBO
Pathwise Estimator
Assume
1. z = t(ε, ν) for ε ∼ s(ε) implies z ∼ q(z; ν)
Example:
ε ∼ Normal(0, 1)
z = εσ + µ
→ z ∼ Normal(µ, σ2 )
2. log p(x, z) and log q(z) are differentiable with respect to z

Pathwise Estimator
Recall
∇ν L = Eq(z;ν) [∇ν log q(z; ν)g(z, ν) + ∇ν g(z, ν)]
Rewrite using using z = t(ε, ν)
∇ν L = Es(ε) [∇ν log s(ε)g(t(ε, ν), ν) + ∇ν g(t(ε, ν), ν)]
To differentiate:
∇L (ν) = Es(ε) [∇ν g(t(ε, ν), ν)]

= Es(ε) [∇z [log p(x, z) − log q(z; ν)]∇ν t(ε, ν) − ∇ν log q(z; ν)]
= Es(ε) [∇z [log p(x, z) − log q(z; ν)]∇ν t(ε, ν)]
This is also known as the reparameterization gradient.

[Glasserman 1991; Fu 2006; Kingma+ 2014; Rezende+ 2014; Titsias+ 2014]
ion (7) both lead to unbiased estimates of the exact gradient.
Variance
require Comparison
the gradient of the model and thus applies to more
gh variance.
103
101
1 Pathwise
10 Score Function
3 Score Function with
10
Control Variate
100 101 102 103
Number of samples
(b) Multivariate Nonlinear Regression Model [Kucukelbir+ 2016]
ator variances. The gradient estimator exhibits lower

over, it does not require control variate variance reduction,
tions.
Score Function Estimator vs. Pathwise Estimator
Pathwise
Score Function
Differentiates the function
Differentiates the density
∇z [log p(x, z) − log q(z; ν)]
∇ν q(z; ν)
Requires differentiable models
Works for discrete and
continuous models Requires variational
approximation to have form
Works for large class of
z = t(ε, ν)
variational approximations
Generally better behaved
Variance can be a big problem
variance
Amortized Inference
Hierarchical Models
A generic class of models
Global variables
ˇ
Local variables
zi xi
n
Y
n
p( , z, x) = p( ) p(zi , xi | )
i=1
Ñ Bayesian mixture models Ñ Dirichlet process mix
Ñ Time series models Ñ Multilevel regression

Mean Field Variational Approximation
ˇ ˇ
ELBO
zi xi i zi
n n
SVI: Revisited

repeat

λ̂ = α + nEφ [t(Zj , xj )].

λ = (1 − ρt )λ + ρt λ̂.
until forever
SVI: The problem
repeat

λ̂ = α + nEφ [t(Zj , xj )].

λ = (1 − ρt )λ + ρt λ̂.
until forever
These expectations are no longer tractable
Inner stochastic optimization needed for each data point.
SVI: The problem
repeat

λ̂ = α + nEφ [t(Zj , xj )].

λ = (1 − ρt )λ + ρt λ̂.
until forever
Idea: Learn a mapping f from xi to φi

Amortizing Inference
ELBO:

X
n
L (λ, φ1...n ) = Eq [log p(β, z, x)] − Eq log q(β; λ) + q(zi ; φi )
i=1
Amortizing the ELBO with inference network f :

X
n
L (λ, θ ) = Eq [log p(β, z, x)] − Eq log q(β; λ) + q(zi | xi ; φi = fθ (xi ))
i=1
[Dayan+ 1995; Heess+ 2013; Gershman+ 2014, many others]

Amortized SVI
repeat
Sample β ∼ q(β; λ).
Sample zj ∼ q(zj | xj ; φθ (xj ).
Compute stochastic gradients
∇ˆλ L = ∇λ log q(β; λ)(log p(β) + n log p(xj , zj | β) − log q(β))

∇ˆθ L = n∇θ log q(zj | xj ; θ )(log p(xj , zj | β) − log q(zj | xk ; θ ))
Update
λ = λ + ρt ∇ˆλ
θ = θ + ρt ∇ˆθ .
until forever
A computational-statistical tradeoff
Amortized inference is faster, but admits a smaller class of approximations

The size of the smaller class depends on the flexibility of f
n
Y
q(zi ; i) n
Y
i=1 q(zi |xi ; f✓ (xi ))
i=1
Example: Variational Autoencoder (VAE)
z p(z) = Normal(0, 1)
2
x p(x|z) = Normal(µ (z), (z))
µ and σ2 are deep networks with parameters β.

[Kingma+ 2014; Rezende+ 2014]
z z ~ q(z | x)
2
Model
Inference
Network
q(z|x) = Normal(f✓µ (x), f✓ (x))
p(x |z)
q(z |x)
x ~ p(x | z)
Data x
All functions are deep networks

Analogies
Analogy-making
Rules of Thumb for a New Model
If log p(x, z) is z differentiable

Try out an approximation q that is reparameterizable
If log p(x, z) is not z differentiable

Use score function estimator with control variates
Add further variance reductions based on experimental evidence
Rules of Thumb for a New Model
If log p(x, z) is z differentiable

Try out an approximation q that is reparameterizable
If log p(x, z) is not z differentiable

Use score function estimator with control variates
Add further variance reductions based on experimental evidence
General Advice:
Use coordinate specific learning rates (e.g. RMSProp, AdaGrad)
Annealing + Tempering
Consider parallelizing across samples from q
Software
Systems with Variational Inference:

Venture, WebPPL, Edward, Stan, PyMC3, Infer.net, Anglican
Good for trying out lots of models
Differentiation Tools:
Theano, Torch, Tensorflow, Stan Math, Caffe
Can lead to more scalable implementations of individual models
PART IV
Beyond the Mean Field

Review:
BlackVariational Bound and Optimisation
box variational inference
REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES
ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE
Probabilistic
I Samplemodelling
from q. / and variational inference.
ScalableI inference through stochastic optimisation.
Use stochastic optimization
Black-box variational inference: Non-conjugate models, Monte Carlo
gradient estimators and amortised inference.
These advances empower us with new way to design

more flexible approximate posterior distributions q(z)
Mean-field Approximations
Fully-factorised
p.z j x/
z2
q.zI ⌫/ ⇤
⌫
⌫ init
z1 z3
Y
qM F (z|x) = q(zk )
k
Key part of algorithm is the choice of approximate posterior q(z).

log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]
| {z }| {z }
Expected likelihood Entropy
Mean-Field Posterior Approximations
Deep Latent
Gaussian Model
Latent variable
model p(x,z)
p(z)
x
p(x|z)
Mean-field or fully-factorised posterior is usually not sufficient

Real-world Posterior Distributions
Deep Latent
Gaussian Model
Latent variable
model p(x,z)
p(z)
x
p(x|z)
Complex dependencies · Non-Gaussian distributions · Multiple modes

Families of Approximate Posteriors
Two high-level goals:
Build richer approximate posterior distributions.
Maintain computational efficiency and scalability.

True Posterior Fully-factorised
z2 z2
z1 z3 z1 z3
Most Expressive Least Expressive

Y
q ⇤ (z|x) / p(x|z)p(z) qM F (z|x) = q(zk )
k
True Posterior Fully-factorised
z2 z2
z1 z3 z1 z3

Y
k
Same as the problem of specifying a model of the data itself.

Structured Posterior Approximations
True Posterior Structured Approx. Fully-factorised
z2 z2 z2
z1 z3 z1 z3
z1 z3

Y Y
q ⇤ (z|x) / p(x|z)p(z) q(z) = qk (zk |{zj }j6=k ) qM F (z|x) = q(zk )
k k
Structured mean field: Introduce any form of dependency to provide a richer

approximating class of distributions.
[Saul and Jordan, 1996.]
Gaussian Approximate Posteriors
Use a correlated Gaussian:
qG (z; ν)=N (z|µ, Σ)
Variational parameters ν = {µ, Σ}

Covariance models: Structure of covariance Σ describes dependency.

Full covariance is richest, but computationally expensive.
diag(↵1 , . . . , ↵K ) diag(↵1 , . . . , ↵K )
+uu>
Test neg. marginal likelihood

104
Mean-field
Rank-1 100
96
diag(↵1 , . . . , ↵K )
P UU> 92
+ j uj uj > 88
84
Rank1 Diag Wake−Sleep FA
+ +…+
Full
Rank-J
Covariance models: Structure of covariance Σ describes dependency.

Full covariance is richest, but computationally expensive.
diag(↵1 , . . . , ↵K ) diag(↵1 , . . . , ↵K )
+uu>
Test neg. marginal likelihood

104
Mean-field
Rank-1 100
96
diag(↵1 , . . . , ↵K )
P UU> 92
+ j uj uj > 88
84
Rank1 Diag Wake−Sleep FA
+ +…+
Full
Rank-J
Approximate posterior is always Gaussian.

Beyond Gaussian Approximations
Autoregressive distributions: Impose
an ordering and non-linear dependency
on all preceding variables.
Y z1 z2 z3 z4 …
qAR (z; ν) = qk (zk |z<k ; νk )
k
Y z1 z2 z3 z4 …
k
Compare DLGMs: Using Gaussian mean field (VAE) vs. auto-regressive

posterior (DRAW) in fully-connected DLGMs on CIFAR10.
≤86.6 ≤ 80.9
VAE DRAW
[Gregor et al., 2015]

Y z1 z2 z3 z4 …
k
Compare DLGMs: Using Gaussian mean field (VAE) vs. auto-regressive

posterior (DRAW) in fully-connected DLGMs on CIFAR10.
≤86.6 ≤ 80.9
VAE DRAW
[Gregor et al., 2015]
Joint-distribution non-Gaussian, although conditionals are.

More Structured Posteriors
Linking functions
Mixture model
C(z)
y
z1 z2 z3 z1 z2 z3
!
Y
X qlm (z; ⌫) = qk (zk |⌫ k ) C(z; ⌫ k+1 )
qmm (z; ⌫) = ⇢r qr (zr |⌫ r ) k
r
[Saul and Jordan, 1996, Tran et al., 2016]

More Structured Posteriors
Linking functions
Mixture model
C(z)
y
z1 z2 z3 z1 z2 z3
!
Y
X qlm (z; ⌫) = qk (zk |⌫ k ) C(z; ⌫ k+1 )
qmm (z; ⌫) = ⇢r qr (zr |⌫ r ) k
r
[Saul and Jordan, 1996, Tran et al., 2016]
Suggests a general way to improve posterior approximations:

Introduce additional variables that induce dependencies,
but that remain tractable and efficient.
Designing Richer Posteriors
1. Introduce new variables ω that help to form a
richer approximate posterior distribution.
Z
q(z; ν) = q(z, ω; ν)dω
richer approximate posterior distribution.
Z
q(z; ν) = q(z, ω; ν)dω
2. Adapt bound to compute entropy or a bound.

| {z }| {z }
richer approximate posterior distribution. zK z
Z
q(z; ν) = q(z, ω; ν)dω …
2. Adapt bound to compute entropy or a bound. z1 𝛚

| {z }| {z } z0
x
3. Maintain computational efficiency: linear in
number of latent variables.
richer approximate posterior distribution. zK z
Z
q(z; ν) = q(z, ω; ν)dω …
2. Adapt bound to compute entropy or a bound. z1 𝛚

| {z }| {z } z0
x
3. Maintain computational efficiency: linear in
number of latent variables.
Look at two different approaches

Change-of-variables: Normalising flows and invertible transforms.
Auxiliary variables: Entropy bounds, Monte Carlo sampling.
zK
Approximations using Change-of-variables
…
Exploit the rule for change of variables for random variables: z1
Begin with an initial distribution q0 (z0 |x). z0
Apply a sequence of K invertible functions fk .

x
zK
Approximations using Change-of-variables
…
Exploit the rule for change of variables for random variables: z1
Begin with an initial distribution q0 (z0 |x). z0
Apply a sequence of K invertible functions fk .

x
Sampling and Entropy

zK = f K . . . f2 f1 (z0 )
XK
@fk
log qK (zK ) = log q0 (z0 ) log det
@zk
k=1
1
@f
q(z 0 ) = q(z) det
@z
t=0 t=1 … t=T
Distribution flows through a sequence of invertible transforms

[Rezende and Mohamed, 2015]
Normalising Flows
Planar
q0 K=1 K=2 K=10
Unit Gaussian
Uniform
Normalising Flows
Choice of Transformation Function
K
X ∂ fk
L =Eq0 (z0 ) [log p(x, zK )]−Eq0 (z0 ) [log q0 (z0 )] − Eq0 (z0 )
log det
k=1
∂z
k
K
X ∂ fk
log det
k=1
∂z
k
Begin with a fully-factorised Gaussian and improve by change of variables.

Triangular Jacobians allow for computational efficiency.
K
X ∂ fk
log det
k=1
∂z k

Planar Flow Real NVP Inverse AR Flow
zk+1 zk+1 zk+1
y1:d yd+1:D
concat
- ÷
+
t + μ σ
h s
har
z1:d zd+1:D
zk z<k
zk zk
y1:d = zk 1,1:d zk 1 µk (z<k , x)
zk =
zk = zk 1 + uh(w> zk 1 + b) yd+1:D = t(zk 1,1:d ) + zd+1:D exp(s(zk 1,1:d )) k (z<k , x)
[Rezende and Mohamed, 2016; Dinh et al., 2016; Kingma et al., 2016]
K
X ∂ fk
log det
k=1
∂z k

Planar Flow Real NVP Inverse AR Flow
zk+1 zk+1 zk+1
y1:d yd+1:D
concat
- ÷
+
t + μ σ
h s
har
z1:d zd+1:D
zk z<k
zk zk
y1:d = zk 1,1:d zk 1 µk (z<k , x)
zk =
zk = zk 1 + uh(w> zk 1 + b) yd+1:D = t(zk 1,1:d ) + zd+1:D exp(s(zk 1,1:d )) k (z<k , x)
[Rezende and Mohamed, 2016; Dinh et al., 2016; Kingma et al., 2016]
Linear time computation of the determinant and its gradient.

Modelling Improvements
VAE-type algorithms on the MNIST benchmark
≤86.6 ≤ 80.9 ⋍79.1
VAE DRAW IAF

Modelling Improvements
VAE-type algorithms on the MNIST benchmark
≤86.6 ≤ 80.9 ⋍79.1
VAE DRAW IAF
Samples generated from model on CIFAR10 images
Figure 4: Random sam

patches.
Hierarchical Approximate Posteriors
We can use ‘latent variables’ ω to enrich the approximate

posterior distribution, like we do for our density models.
Z
q(z|x) = q(z|ω, x)q(ω|x)dω
Hierarchical Approximate Posteriors
We can use ‘latent variables’ ω to enrich the approximate zK z

posterior distribution, like we do for our density models.
Z
q(z|x) = q(z|ω, x)q(ω|x)dω …
Use a hierarchical model for the approximate posterior. z1 𝛚
Stochastic variables ω rather than deterministic in the

change-of-variables approach.
z0
Both continuous and discrete latent variables can be
modelled.
x
[Ranganath et al., 2016]

Auxiliary-variable Methods
Modify the model to include ω = (z0 , . . . , zK−1 ).
Latent variable Auxiliary latent

model p(x,z) variable model p(x,z,𝛚)
p(z) p(z)
z z
x x ⍵
p(x|z) p(x|z) r(!|x, z)
Auxiliary-variable Methods
Modify the model to include ω = (z0 , . . . , zK−1 ).

zK z
Latent variable Auxiliary latent
model p(x,z) variable model p(x,z,𝛚)
p(z) p(z)
z z …
z1 𝛚
x x ⍵
p(x|z) p(x|z) r(!|x, z)
z0
Auxiliary variables leave the original model unchanged.
They capture structure of correlated variables because they
turn the posterior into a mixture of distributions q(z|x, ω). x
[Agakov and Barber, 2004; Maaløe et al., 2016]

Auxiliary Variational Lower Bounds
Standard bound: log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]
| {z }| {z }
Auxiliary latent Inference

variable model p(x,z,𝛚) model q(z,𝛚)
p(z) q(z|x, !)
z z
x ⍵ x ⍵
p(x|z) r(!|x, z) q(!|x)
Auxiliary Variational Lower Bounds
Standard bound: log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]
| {z }| {z }
Auxiliary latent Inference

variable model p(x,z,𝛚) model q(z,𝛚)
p(z) q(z|x, !)
z z
x ⍵ x ⍵
p(x|z) r(!|x, z) q(!|x)
Auxiliary variational bound: Bound the entropy for tractability.
log p(x) ≥ Eq(ω,z|x) [log p(x, z) + log r(ω|z, x)]−Eq(ω,z|x) [log q(z, ω|x)]
≥ L − Eq(z|x) [KL[q(ω|z, x)kr(ω|z, x)]
[Salimans et al., 2015; Ranganath et al., 2016; Maaløe et al., 2016]

Auxiliary Variational Methods
Choose an auxiliary prior r(ω|z, x) and auxiliary posterior q(ω|x, z)
Auxiliary latent
variable model p(x,z,𝛚)
p(z) Hamiltonian flow: r(ω)=N (ω|0,M)
z
Input-dependent Gaussian: r(ω|x,
Q z)
Auto-regressive: r(ω|x, z) = t r(ωt |fθ (ω<t , x))
x ⍵
q(ω|x, z) can be a mixture model, normalising flow,
p(x|z) r(!|x, z)
Gaussian process.
Inference
model q(z,𝛚)
q(z|x, !)
x ⍵
q(!|x)
Auxiliary latent
variable model p(x,z,𝛚)
p(z) Hamiltonian flow: r(ω)=N (ω|0,M)
z
Input-dependent Gaussian: r(ω|x,
Q z)
Auto-regressive: r(ω|x, z) = t r(ωt |fθ (ω<t , x))
x ⍵
q(ω|x, z) can be a mixture model, normalising flow,
p(x|z) r(!|x, z)
Gaussian process.
Inference
model q(z,𝛚)
q(z|x, !)
z
≤86.6 ≤ 80.9 ⋍79.1 ≤79.8
x ⍵ VAE DRAW IAF DRAW-VGP

q(!|x)
[Tran et al., 2016]
Easy sampling, evaluation of bound and gradients.

Summary
True Posterior Families of Posterior Approximations Fully-factorised

Normalising Structured mean-field Covariance models
flows
+
zK
z2 z1 z2 z3 z4 …
z2
…
Auxiliary variables Mixtures

z1 p(z)
z1 z3 z y
z1 z3
z0
x ⍵ z1 z2 z3
x p(x|z) r(!|x, z)

Y
k
Choosing your Approximation
KNOWLEDGE & DATA

QUESTION
R A. 7Aty
K=7
pops
1
2
3
4
5
6
7
K=8
pops

1
2
3
4
5
6
7
8
K=9
pops
1
2
3
4
5
6
7
8
9
unpopulated.
28
Criticize model
Revise
Summary
p.z j x/

q.zI ⌫/ ⇤
⌫
⌫ init
VI approximates difficult quantities from complex models.

With stochastic optimization we can
scale up VI to massive data
enable VI on a wide class of difficult models
enable VI with elaborate and flexible families of approximations
Bibliography
Introductory Variational Inference

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to
variational methods for graphical models. Machine learning, 37(2), 183-233.
Beal, Matthew James. Variational algorithms for approximate Bayesian inference. Diss.
University of London, 2003.
Wainwright, Martin J., and Michael I. Jordan. "Graphical models, exponential families, and
variational inference." Foundations and Trends in Machine Learning 1, no. 1-2 (2008): 1-305.
Bibliography
Applications of Variational Inference

Frey, Brendan J., and Geoffrey E. Hinton. "Variational learning in nonlinear Gaussian belief
networks." Neural Computation 11, no. 1 (1999): 193-213.
Eslami, S. M., Heess, N., Weber, T., Tassa, Y., Kavukcuoglu, K., and Hinton, G. E. Attend, Infer,
Repeat: Fast Scene Understanding with Generative Models. NIPS (2016).
Rezende, Danilo Jimenez, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra.
"One-Shot Generalization in Deep Generative Models." ICML (2016).
Kingma, Diederik P., Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling.
"Semi-supervised learning with deep generative models." In Advances in Neural Information
Processing Systems, pp. 3581-3589. 2014.
Bibliography
Monte Carlo Gradient Estimation

Pierre L’Ecuyer, Note: On the interchange of derivative and expectation for likelihood ratio
derivative estimators, Management Science, 1995
Peter W Glynn, Likelihood ratio gradient estimation for stochastic systems, Communications
of the ACM, 1990
Michael C Fu, Gradient estimation, Handbooks in operations research and management
science, 2006
Ronald J Williams, Simple statistical gradient-following algorithms for connectionist
reinforcement learning, Machine learning, 1992
Paul Glasserman, Monte Carlo methods in financial engineering, 2003
Omiros Papaspiliopoulos, Gareth O Roberts, Martin Skold, A general framework for the
parametrization of hierarchical models, Statistical Science, 2007
Michael C Fu, Gradient estimation, Handbooks in operations research and management
science, 2006
Rajesh Ranganath, Sean Gerrish, and David M. Blei. "Black Box Variational Inference." In
AISTATS, pp. 814-822. 2014.
Andriy Mnih, and Karol Gregor. "Neural variational inference and learning in belief
networks." arXiv preprint arXiv:1402.0030 (2014).
Bibliography
Monte Carlo Gradient Estimation (cont.)

Michalis Titsias and Miguel Lázaro-Gredilla. "Doubly stochastic variational Bayes for
non-conjugate inference." (2014).
David Wingate and Theophane Weber. "Automated variational inference in probabilistic
programming." arXiv preprint arXiv:1301.1299 (2013).
John Paisley, David Blei, and Michael Jordan. "Variational Bayesian inference with stochastic
search." arXiv preprint arXiv:1206.6430 (2012).
Durk Kingma and Max Welling. "Auto-encoding Variational Bayes." ICLR (2014).
Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra. "Stochastic Backpropagation and
Approximate Inference in Deep Generative Models." ICML (2014).
Bibliography
Amortized Inference
Dayan, Peter, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. "The helmholtz
machine." Neural computation 7, no. 5 (1995): 889-904.
Gershman, Samuel J., and Noah D. Goodman. "Amortized inference in probabilistic
reasoning." In Proceedings of the 36th Annual Conference of the Cognitive Science Society.
2014.
Heess, Nicolas, Daniel Tarlow, and John Winn. "Learning to pass expectation propagation
messages." In Advances in Neural Information Processing Systems, pp. 3219-3227. 2013.
Jitkrittum, Wittawat, Arthur Gretton, Nicolas Heess, S. M. Eslami, Balaji Lakshminarayanan,
Dino Sejdinovic, and ZoltÃan
˛ SzabÃş. "Kernel-based just-in-time learning for passing
expectation propagation messages." arXiv preprint arXiv:1503.02551 (2015).
Korattikara, Anoop, Vivek Rathod, Kevin Murphy, and Max Welling. "Bayesian dark
knowledge." arXiv preprint arXiv:1506.04416 (2015).
Bibliography
Structured Mean Field

Jaakkola, T. S., and Jordan, M. I. (1998). Improving the mean field approximation via the use
of mixture distributions. In Learning in graphical models (pp. 163-173). Springer
Netherlands.
Saul, L.K. and Jordan, M.I., 1996. Exploiting tractable substructures in intractable networks.
Advances in neural information processing systems, pp.486-492.
Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra.
"DRAW: A recurrent neural network for image generation." ICML (2015).
Gershman, S., Hoffman, M. and Blei, D., 2012. Nonparametric variational inference. arXiv
preprint arXiv:1206.4665.
Bibliography
Change-of-variables and Normalising Flows

Tabak, E. G., and Cristina V. Turner. "A family of nonparametric density estimation
algorithms." Communications on Pure and Applied Mathematics 66, no. 2 (2013): 145-164.
Rezende, Danilo Jimenez, and Shakir Mohamed. "Variational inference with normalizing
flows." ICML (2015).
Kingma, D.P., Salimans, T. and Welling, M., 2016. Improving variational inference with
inverse autoregressive flow. arXiv preprint arXiv:1606.04934.
Dinh, L., Sohl-Dickstein, J. and Bengio, S., 2016. Density estimation using Real NVP. arXiv
preprint arXiv:1605.08803.
Bibliography

Felix V. Agakov, and David Barber. "An auxiliary variational method." NIPS (2004).
Rajesh Ranganath, Dustin Tran, and David M. Blei. "Hierarchical Variational Models." ICML
(2016).
Lars Maaløe et al. "Auxiliary Deep Generative Models." ICML (2016).
Tim Salimans, Durk Kingma, Max Welling. "Markov chain Monte Carlo and variational
inference: Bridging the gap. In International Conference on Machine Learning." ICML (2015).
Bibliography
Related Variational Objectives

Yuri Burda, Roger Grosse, Ruslan Salakhutidinov. "Importance weighted autoencoders." ICLR
(2015).
Yingzhen Li, Richard E. Turner. "Rényi divergence variational inference." NIPS (2016).
Guillaume and Balaji Lakshminarayanan. "Approximate Inference with the Variational Holder
Bound." ArXiv (2015).
José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, and
Richard E. Turner. Black-box α-divergence Minimization. ICML (2016).
Rajesh Ranganath, Jaan Altosaar, Dustin Tran, David M. Blei. Operator Variational Inference.
NIPS (2016).
Bibliography
Discrete Latent Variable Models and Posterior Approximations

Radford Neal. "Learning stochastic feedforward networks." Tech. Rep. CRG-TR-90-7:
Department of Computer Science, University of Toronto (1990).
Lawrence K. Saul, Tommi Jaakkola, and Michael I. Jordan. "Mean field theory for sigmoid
belief networks." Journal of artificial intelligence research 4, no. 1 (1996): 61-76.
Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. "Deep
autoregressive networks." ICML (2014).
Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and David M. Blei. "Deep Exponential
Families." AISTATS (2015).
Rajesh Ranganath, Dustin Tran, and David M. Blei. "Hierarchical Variational Models." ICML
(2016).

2016 NIPS VI Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2016 NIPS VI Tutorial

Uploaded by

Copyright:

Available Formats

Variational Inference:

Foundations and Modern Methods

David Blei, Rajesh Ranganath, Shakir Mohamed

NIPS 2016 Tutorial · December 5, 2016

Game Life Film Book Wine

Bush Building Won Yankees Government

Republican Housing Race Season Officials

Children Stock Church Art Police

Topics found in 1.8M articles from the New York Times

Figure 12. 3D scenes details:

ness and accuracy with that of a fully supervised network

Population analysis of 2 billion genetic measurements

Compression and content generation.

KNOWLEDGE & DATA

Make assumptions Discover patterns Predict & Explore

Customized data analysis is important to many fields.

 Pipeline separates assumptions, computation, application

 Eases collaborative solutions to statistics problems

KNOWLEDGE & DATA

Make assumptions Discover patterns Predict & Explore

Inference is the key algorithmic problem.

 Our goal: General and scalable approaches to inference

Make assumptions Discover patterns Predict & Explore

[Box, 1980; Rubin, 1984; Gelman et al., 1996; Blei, 2014]

Main ideas and historical context

 A probabilistic model is a joint distribution of hidden variables z and

 Inference about the unknowns is through the posterior, the conditional

 For most interesting models, the denominator is not tractable. We appeal

KL.q.zI ⌫ ⇤ / jj p.z j x//

 VI turns inference into optimization.

(c) Iteration 28 (d) Iteration 35 (e) Iteration 50

(a) Initialization (c)Iteration

Evidence Lower Bound Average Log Predictive

Coovergence 0 1 eM ColTeia llon Stabiles

Figure 5: {sf' B out} a nd vt vout from th e BM and MFT respec tively

Jaakkola, Lawrence Saul, Zoubin

many probabilistic models. (A review paper is Jordan

Part III: Stochastic gradients of the ELBO

Part IV: Beyond the mean field

KL.q.zI ⌫ ⇤ / jj p.z j x//

VI approximates difficult quantities from complex models.

Mean-field variational inference

Topic models use posterior inference to discover the hidden thematic

Documents exhibit multiple topics.

Topic proportions and

 Each topic is a distribution over words

 But we only observe the documents; everything else is hidden.

p(topics, proportions, assignments | documents)

(Note: millions of documents; billions of latent variables)

 Encodes assumptions about data with a factorization of the joint

 The posterior of the latent variables given the documents is

 We can’t compute the denominator, the marginal p(w).

Game Life Film Book Wine

Bush Building Won Yankees Government

Republican Housing Race Season Officials

Children Stock Church Art Police

Topics found in 1.8M articles from the New York Times

Subsample Infer local Update global

 Define the generic class of conditionally conjugate models

 The observations are x = x1:n .

Compute p(β, z | x).

 A complete conditional is the conditional of a latent variable given the

 Assume each complete conditional is in the exponential family,

p(zi | β, xi ) = h(zi ) exp{η` (β, xi )> zi − a(η` (β, xi ))}

Pipeline separates assumptions, computation, application

Eases collaborative solutions to statistics problems

Our goal: General and scalable approaches to inference

A probabilistic model is a joint distribution of hidden variables z and

Inference about the unknowns is through the posterior, the conditional

For most interesting models, the denominator is not tractable. We appeal

VI turns inference into optimization.

Each topic is a distribution over words

But we only observe the documents; everything else is hidden.

Encodes assumptions about data with a factorization of the joint

The posterior of the latent variables given the documents is

We can’t compute the denominator, the marginal p(w).

Define the generic class of conditionally conjugate models

The observations are x = x1:n .

A complete conditional is the conditional of a latent variable given the

Assume each complete conditional is in the exponential family,

A complete conditional is the conditional of a latent variable given the

Bayesian mixture models Dirichlet process mixtures, HDPs

Time series models Multilevel regression

Factorial models Stochastic block models

Matrix factorization Mixed-membership models

KL is intractable; VI optimizes the evidence lower bound (ELBO) instead.

The ELBO trades off two terms.

Caveat: The ELBO is not convex.

We need to specify the form of q(β, z).

The mean-field family is fully factorized,

Each factor is the same family as the model’s complete conditional,

Optimize the ELBO,

Traditional VI uses coordinate ascent [Ghahramani and Beal, 2001]

Iteratively update each parameter, holding others fixed.

The local variables are the per-document variables θd and zd .

Bayesian mixture models Dirichlet process mixtures, HDPs

Time series models Multilevel regression

Factorial models Stochastic block models

Matrix factorization Mixed-membership models

Guaranteed to converge to a local optimum [Bottou, 1996]

Has enabled modern machine learning

With noisy gradients, update

Requires the step size sequence ρt follows the Robbins-Monro conditions

The natural gradient of the ELBO [Amari, 1998; Sato, 2001]

Construct a noisy natural gradient,

This is a good noisy gradient.

Bayesian mixture models Dirichlet process mixtures, HDPs

Time series models Multilevel regression

Factorial models Stochastic block models

Matrix factorization Mixed-membership models

Realized for conditionally conjugate models

What about the general case?

Derive a model specific bound:

More general approximations that require model-specific analysis: