Professional Documents
Culture Documents
6 7 8 9 10
11 12 13 14 15
Figure 5
a particular movie), our prediction of the rating depends on a linear combination of the user’s
embedding and the movie’s embedding. We[Hoffman, Blei,
can also use these Wang,
inferred Paisley,toJMLR
representations find 2013]
groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds
Attend, Infer, Repeat: Fast Scene Und
prob
zhen
Italian Japanese Kalash Karitiana
Lahu Makrani Mandenka MayaMbutiPygmy
Melanesian
MiaoMongola Mozabite NaxiOrcadian
Oroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscan
UygurXibo Yakut Yi Yoruba
pops
1
2
3
4
5
6
7
jpeg
Mean Sample jpeg2K
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
Answers the question: What does this model say about this data?
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
Criticize model
Revise
p(z, x).
p(z, x)
p(z | x) = .
p(x)
⌫ init
q(z; ν)
Fit the variational parameters ν to be close (in KL) to the exact posterior.
(There are alternative divergences, which connect to algorithms like EP, BP, and others.)
(a) Initialization (b) Iteration 20
Example: Mixture of Gaussians
Sj µj
0 .0
µi
Si
- 0. 5
‘~
'0 ' 00 10 00 10000
t.\ntler 0 1 S We8plll
Figure 22: (a) A node Si in a sigmoid belief network machine with its Markov blanket. (b)
Corw flfOOOCll 0 1 L4ean Carela llon Oillerence
2-4-1 XOR with Random Welltlll
The mean field equations yield a deterministic relationship, represented in the figure with
Variational inference adapts ideas from statistical physics to probabilistic
the dotted lines, between the variational parameters µi and µj for nodes j in the Markov
blanket of node i. is
J
inference.
•
Arguably, it began in the late eighties with Peterson and
a tractable lower bound on the log likelihood and the variational parameter ξi can be
ters
has
Anderson (1987),
along withwho used mean-field methods to fit a neural network.
• for
optimized the other variational parameters.
Saul and Jordan (1998) show that in the limiting case of networks in which each hidden The
• com
node
• has• a • large
_ number of parents, so that a central limit theorem can be invoked, the
This idea
_ _
'0 wasξipicked
parameter' 00 up byinterpretation
has a probabilistic
tbnber 01 SWeepli
1000 Jordan’s lab
as the
10000 in theexpectation
approximate early 1990s—Tommi
of σ(zi ),
with
ing
expertsand
(Waterhouse equation for aet al.,
node1996) and HMMs from(MacKay, 1997).
the second term is a sum over contributions from the children of node i, we see that the erro
line
consistency given again involves contributions the Markov blanket cho
of the node (see Fig. 22). Thus, as in the case of the Boltzmann machine, we find that the cor
variational parameters are linked via their Markov blankets and the consistency equation dec
(Eq. (67)) can be interpreted as a local message-passing algorithm. ma
models with two-dimensional latent
is Gaussian, linearly spaced coor-
Today
CDF of the Gaussian to produce
otted the corresponding
(a) NORB
generative (b) CIFAR (c) Frey
igure 4. a) Performance on the NORB dataset. Left: Samples from the training data. Right: sampled pixel means from
he model. b) Performance on CIFAR10 patches. Left: Samples from the training data. Right: Sampled pixel means
om the model. c) Frey faces data. Left: data samples. Right: model samples.
data {
˛ D 1:5; D1 i n t N; // number o f o b s e r v a t i o n s
i n t x [ N ] ; // d i s c r e t e - v a l u e d o b s e r v a t i o n s
}
parameters {
// l a t e n t v a r i a b l e , must be p o s i t i v e
r e a l < l o w e r =0> t h e t a ;
✓ }
model {
// non - c o n j u g a t e p r i o r f o r l a t e n t v a r i a b l e
theta ~ w e i b u l l ( 1 . 5 , 1) ;
xn // l i k e l i h o o d
f o r ( n i n 1 :N)
x [ n ] ~ poisson ( theta ) ;
N }
igure 5. Imputation results on MNIST digits. The first
olumn shows the true data. Column 2 shows pixel loca-
space (d) 20-D latent space
[Kingma andremaining
Wellingcolumns
2013]show
Figure 6. Two dimensional embedding of the MNIST data Figure 2: Specifying a simple nonconjugate probability model in Stan.
set. Each[Rezende et al.to2014] [Kucukelbir et al. 2015]
ons set as missing in grey. The
colour corresponds one of the digit classes.
mputations and denoising of the images for 15 iterations,
arting left to right. Top: 60% missingness. Middle: 80%
NIST for different dimensionalities
missingness. Bottom: 5x5 patch missing. analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with the pri
6.5. Data Visualisation gives the joint density p.X; ✓/ D p.X j ✓/ p.✓/.
There is now a Latent flurry
for visualisationof new work data sets.on
variable models such as DLGMs are often We
of high-dimensional variational inference, making it
usedfocus on approximate inference for differentiable probability models. These models have conti
uous
We latent variables ✓. They also have a gradient of the log-joint with respect to the latent˚ variabl
project the MNIST data set to a 2-dimensional latent
scalable,
matics and experimental design. We show theeasier
ability to
space derive,
and faster,
use this 2-D embedding more
as a visualisation
the data. A 2-dimensional embedding of the MNIST
of accurate, and applying it to more
r log p.X; ✓/. The gradient is valid within the support of the prior supp.p.✓// D ✓ j ✓
✓
R and p.✓/ > 0 ✓ R , where K is the dimension of the latent variable space. This support s
K K
f the model to impute missing data using the MNIST is important: it determines the support of the posterior density and plays a key role later in the pap
ata set in figure 5. complicated
We test the imputation abilitymodels and applications.
data set is shown in figure 6. The classes separate
2
into di↵erent regions indicating that such a toolWe canmake no assumptions about conjugacy, either full or conditional.
nder two di↵erent missingness types (Little & Rubin,
be useful in gaining insight into the structure of high-
987): Missing-at-random (MAR), where we consider
dimensional data sets. For example, consider a model that contains a Poisson likelihood with unknown rate, p.x j ✓/. T
ntains a KL term that can often be observed variable x is discrete; the latent rate ✓ is continuous and positive. Place a Weibull pri
Modern VI touches many important areas: probabilistic programming,
0% and 80% of the pixels to be missing randomly, and
Not Missing-at-random (NMAR), where we consider a on ✓, defined over the positive real numbers. The resulting joint density describes a nonconjuga
he prior p (z) = N (0, I) and the
quare region of✓the image to be missing. The model 7. Discussion
differentiable probability model. (See Figure 2.) Its partial derivative @=@✓ p.x; ✓/ is valid within t
dimensionalityreinforcement
of z. Let µ and learning, neural networks, convex optimization, Bayesian
roduces very good completions in both test cases.
here is uncertainty in the identity of the image. This
support of the Weibull distribution, supp.p.✓// D RC ⇢ R. Because this model is nonconjugate, t
Our algorithm generalises to a large class of models
posterior is not a Weibull distribution. This presents a challenge for classical variational inferenc
with continuous latent variables, which include Gaus-
d let µ and simply denote the
statistics, and myriad applications.
expected and reflected in the errors in these comple-
j j sian, non-negative or sparsity-promoting latent In Section 2.3, we will see how
vari- handles this model.
ons as the resampling procedure is run, and further
ables. For models with discrete latent variables (e.g.,
emonstrates the ability of the model to capture the Many machine learning models are differentiable. For example: linear and logistic regression, matr
sigmoid belief networks), policy-gradient approaches
iversity of the underlying data. We do not integrate factorization with continuous or discrete measurements, linear dynamical systems, and Gaussian pr
that improve upon the REINFORCE approach remain
Our goal today the
ismosttogeneral,
teach
ver the missing values in our imputation procedure,
that simulates a Markov chain
ut use a procedure you the basics, cesses.
but intelligent design is needed to
explain
Mixture
control the gradient-variance in high dimensionalMarginalizing
set-
models, hiddensome of the
Markov models, and topicnewer ideas,
models have discrete random variabl
out these discrete variables renders these models differentiable. (We show an examp
N (z; 0, I) dz
hat we show converges to the true marginal distribu-
and to suggest These
open
tings.
areas of new research.
on. The procedure to sample from the missing pixels
iven the observed pixels is explained in appendix E.
in Section 3.3.) However, marginalization is not tractable for all models, such as the Ising mod
sigmoid belief networks, and (untruncated) Bayesian nonparametric models.
models are typically used with a large number
X
(µ2j + 2 2.2 Variational Inference
j)
1 Bayesian inference requires the posterior density p.✓ j X/, which describes how the latent variabl
vary when conditioned on a set of observations X. Many posterior densities are intractable becau
their normalization constants lack closed forms. Thus, we seek to approximate the posterior.
Variational Inference:
Foundations and Modern Methods
Part II: Mean-field VI and stochastic VI
Jordan+, Introduction to Variational Methods for Graphical Models, 1999
Ghahramani and Beal, Propagation Algorithms for Variational Bayesian Learning, 2001
Hoffman+, Stochastic Variational Inference, 2013
p.z j x/
⌫ init
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Per-word
Proportions
topic assignment
parameter Topic
parameter
Per-document Observed
topic proportions word Topics
˛ ✓d zd;n wd;n ˇk ⌘
N D K
˛ ✓d zd;n wd;n ˇk ⌘
N D K
p(β, θ , z, w)
p(β, θ , z | w) = R R P .
β θ z p(β, θ , z, w)
6 7 8 9 10
11 12 13 14 15
Figure 5
a particular movie), our prediction of the rating depends on a linear combination of the user’s
embedding and the movie’s embedding. We can also use these inferred representations to find
Mean-field VI and Stochastic VI
Road map:
Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
The global parameter comes from conjugacy [Bernardo and Smith, 1994]
Pn
ηg (z, x) = α + i=1 t(zi , xi ),
Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
p.z j x/
⌫ init
ˇ ˇ
ELBO
zi xi i zi
n n
ˇ ˇ
ELBO
zi xi i zi
n n
�d �d;n �k
˛ �d zd;n wd;n ˇk �
N D K
0.4
0.3
Probability
0.2
0.1
0.0
1 8 16 26 36 46 56 66 76 86 96
Topics
Mean-field Variational Inference for LDA
repeat
for each data point i do
Set local parameter φi ← Eλ [η` (β, xi )].
end
Global variables
ˇ
Local variables
zi xi
n
Y
n
p(β, z, x) = p(β) p(zi , xi | β)
i=1
�d �d;n �k
˛ �d zd;n wd;n ˇk �
N D K
Classical VI is inefficient:
Do some local computation for each data point.
Aggregate these computations to re-estimate global structure.
Repeat.
This cannot handle massive data.
Stochastic variational inference (SVI) scales VI to massive data.
Stochastic Variational Inference
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
MASSIVE
8
K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
DATA
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
Replace the gradient with cheaper noisy estimates [Robbins and Monro, 1951]
ˆ ν L (νt )
νt+1 = νt + ρt ∇
Requires unbiased gradients, E ∇ˆ ν L (ν) = ∇ν L (ν)
j ∼ Uniform(1, . . . , n)
ˆ nat L (λ)
∇ = α + nEφj∗ [t(Zj , xj )] − λ.
λ
repeat
Sample j ∼ Unif(1, . . . , n).
Set local parameter φ ← Eλ η` (β, xj ) .
λ = (1 − ρt )λ + ρt λ̂.
until forever
Stochastic Variational Inference
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
MASSIVE
8
K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
DATA
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
�d �d;n �k
˛ �d zd;n wd;n ˇk �
N D K
Sample a document
Estimate the local variational parameters using the current topics
Form intermediate topics from those local parameters
Update topics as a weighted average of intermediate and current topics
Stochastic Variational Inference in LDA
Online 98K
900
850
800 Batch 98K
Perplexity
Online 3.3M
750
700
650
600
6 7 8 9 10
11 12 13 14 15
Figure 5
Topics using the HDP, found in 1.8M articles from the New York Times
Topics found in a corpus of 1.8 million articles from the New York Times. Modified from Hoffman et al. (2013).
a particular movie), our prediction of the rating depends on a linear combination of the user’s
embedding and the movie’s embedding. We can also use these inferred representations to find
SVI scales many models
Adygei BalochiBantuKenya
BantuSouthAfrica
Basque Bedouin BiakaPygmy Brahui Burusho
Cambodian
Colombian
Dai Daur Druze French Han Han−NChina
HazaraHezhen
Italian Japanese Kalash Karitiana
Lahu Makrani Mandenka MayaMbutiPygmy
Melanesian
MiaoMongola Mozabite NaxiOrcadian
Oroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscan
UygurXibo Yakut Yi Yoruba
pops
1
2
3
prob
4
5
6
7
pops
1
2
3
4
5
6
7
PART III
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
p(z, x)
The Variational Inference Recipe
q(z; ν)
The Variational Inference Recipe
Take derivatives:
1
Example: ∇ν L (ν) = 2xν +
ν
The Variational Inference Recipe
Optimize:
νt+1 = νt + ρt ∇ν L
The Variational Inference Recipe
p(x, z) Z q.zI ⌫/
(· · · )q(z; ⌫)dz r⌫
q(z; ⌫)
Example: Bayesian Logistic Regression
Data pairs yi , xi
xi are covariates
yi are label
z is the regression coefficient
Generative process
p(z) ∼ N(0, 1)
p(yi | xi , z) ∼ Bernoulli(σ(zxi ))
VI for Bayesian Logistic Regression
Assume:
We have one data point (y, x)
x is a scalar
The approximating family q is the normal; ν = (µ, σ2 )
The ELBO is
L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
VI for Bayesian Logistic Regression
L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
VI for Bayesian Logistic Regression
L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
1 2 1
= − (µ + σ ) + log σ2 + Eq [yxz − log(1 + exp(xz))]
2
2 2
VI for Bayesian Logistic Regression
L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
1 2 1
= − (µ + σ ) + log σ2 + Eq [yxz − log(1 + exp(xz))]
2
2 2
1 2 1
= − (µ + σ ) + log σ2 + yxµ − Eq [log(1 + exp(xz))]
2
2 2
VI for Bayesian Logistic Regression
L (µ, σ2 )
= Eq [log p(z) − log q(z) + log p(y | x, z)]
1 1
= − (µ2 + σ2 ) + log σ2 + Eq [log p(y | x, z)] + C
2 2
1 2 1
= − (µ + σ ) + log σ2 + Eq [yxz − log(1 + exp(xz))]
2
2 2
1 2 1
= − (µ + σ ) + log σ2 + yxµ − Eq [log(1 + exp(xz))]
2
2 2
We are stuck.
1. We cannot analytically take that expectation.
2. The expectation hides the objectives dependence on the variational
parameters. This makes it hard to directly optimize.
Options?
REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES
ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE
I Sample from q. /
I Form noisy gradients without model-specific computation
p(x, z) Z q.zI ⌫/
(· · · )q(z; ⌫)dz r⌫
q(z; ⌫)
The New VI Recipe
p(x, z) Z q.zI ⌫/
r⌫ (· · · )q(z; ⌫)dz
q(z; ⌫)
Define
What is ∇ν L
Z
∇ν L = ∇ν q(z; ν)g(z, ν)dz
Z
= ∇ν q(z; ν)g(z, ν) + q(z; ν)∇ν g(z, ν)dz
Z
= q(z; ν)∇ν log q(z; ν)g(z, ν) + q(z; ν)∇ν g(z, ν)dz
∇ν q
Using ∇ν log q = q
Roadmap
Pathwise Gradients
Amortized Inference
Score Function Gradients of the ELBO
Score Function Estimator
Recall
Simplify:
Gradient: Eq(z;ν) [∇ν log q(z; ν)(log p(x, z) − log q(z; ν))]
1X
S
∇ν log q(zs ; ν)(log p(x, zs ) − log q(zs ; ν)),
S s=1
where zs ∼ q(z; ν)
Basic BBVI
1X
S
∇ν log q(zs ; ν)(log p(x, zs ) − log q(zs ; ν)),
S s=1
where zs ∼ q(z; ν)
REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES
ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE
I Sample from q. /
I Form noisy gradients without model-specific computation
Varq(z;ν) = Eq(z;ν) [(∇ν log q(z; ν)(log p(x, z) − log q(z; ν)) − ∇ν L )2 ].
2.0
PDF
1.5 Abs Mu Score
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Intuition:
Sampling rare values can lead to large scores and thus high variance
Solution: Control Variates
Replace with f with f̂ where E[f̂ (z)] = E[f (z)]. General such class:
6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3
−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
Replace with f with f̂ where E[f̂ (z)] = E[f (z)]. General such class:
6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3
−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
Replace with f with f̂ where E[f̂ (z)] = E[f (z)]. General such class:
6
PDF
5 f = x + x2
fˆ; h = x2
4
fˆ; h = f
3
−1
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
Assume
1. z = t(ε, ν) for ε ∼ s(ε) implies z ∼ q(z; ν)
Example:
ε ∼ Normal(0, 1)
z = εσ + µ
→ z ∼ Normal(µ, σ2 )
Recall
To differentiate:
103
101
1 Pathwise
10 Score Function
3 Score Function with
10
Control Variate
100 101 102 103
Number of samples
(b) Multivariate Nonlinear Regression Model [Kucukelbir+ 2016]
Pathwise
Score Function
Differentiates the function
Differentiates the density
∇z [log p(x, z) − log q(z; ν)]
∇ν q(z; ν)
Requires differentiable models
Works for discrete and
continuous models Requires variational
approximation to have form
Works for large class of
z = t(ε, ν)
variational approximations
Generally better behaved
Variance can be a big problem
variance
Amortized Inference
Hierarchical Models
Global variables
ˇ
Local variables
zi xi
n
Y
n
p( , z, x) = p( ) p(zi , xi | )
i=1
ˇ ˇ
ELBO
zi xi i zi
n n
SVI: Revisited
repeat
Sample j ∼ Unif(1, . . . , n).
Set local parameter φ ← Eλ η` (β, xj ) .
until forever
SVI: The problem
Input: data x, model p(β, z, x).
Initialize λ randomly. Set ρt appropriately.
repeat
Sample j ∼ Unif(1, . . . , n).
Set local parameter φ ← Eλ η` (β, xj ) .
Set intermediate global parameter
until forever
These expectations are no longer tractable
Inner stochastic optimization needed for each data point.
SVI: The problem
Input: data x, model p(β, z, x).
Initialize λ randomly. Set ρt appropriately.
repeat
Sample j ∼ Unif(1, . . . , n).
Set local parameter φ ← Eλ η` (β, xj ) .
Set intermediate global parameter
until forever
ELBO:
X
n
L (λ, φ1...n ) = Eq [log p(β, z, x)] − Eq log q(β; λ) + q(zi ; φi )
i=1
repeat
Sample β ∼ q(β; λ).
Sample j ∼ Unif(1, . . . , n).
Sample zj ∼ q(zj | xj ; φθ (xj ).
Compute stochastic gradients
Update
λ = λ + ρt ∇ˆλ
θ = θ + ρt ∇ˆθ .
until forever
A computational-statistical tradeoff
n
Y
q(zi ; i) n
Y
i=1 q(zi |xi ; f✓ (xi ))
i=1
Example: Variational Autoencoder (VAE)
z p(z) = Normal(0, 1)
2
x p(x|z) = Normal(µ (z), (z))
z z ~ q(z | x)
2
Model
Inference
Network
q(z|x) = Normal(f✓µ (x), f✓ (x))
p(x |z)
q(z |x)
x ~ p(x | z)
Data x
Analogies
Analogy-making
Rules of Thumb for a New Model
General Advice:
Use coordinate specific learning rates (e.g. RMSProp, AdaGrad)
Annealing + Tempering
Consider parallelizing across samples from q
Software
Differentiation Tools:
Theano, Torch, Tensorflow, Stan Math, Caffe
Can lead to more scalable implementations of individual models
PART IV
REUSABLE
REUSABLE
REUSABLE MASSIVE
VARIATIONAL
VARIATIONAL
VARIATIONAL DATA
FAMILIES
FAMILIES
FAMILIES
ANY MODEL
BLACK BOX
p.ˇ; z j x/
VARIATIONAL
INFERENCE
Probabilistic
I Samplemodelling
from q. / and variational inference.
I Form noisy gradients without model-specific computation
ScalableI inference through stochastic optimisation.
Use stochastic optimization
Black-box variational inference: Non-conjugate models, Monte Carlo
gradient estimators and amortised inference.
Fully-factorised
p.z j x/
z2
KL.q.zI ⌫ ⇤ / jj p.z j x//
q.zI ⌫/ ⇤
⌫
⌫ init
z1 z3
Y
qM F (z|x) = q(zk )
k
Deep Latent
Gaussian Model
Latent variable
model p(x,z)
p(z)
x
p(x|z)
Deep Latent
Gaussian Model
Latent variable
model p(x,z)
p(z)
x
p(x|z)
z2 z2
z1 z3 z1 z3
z2 z2
z1 z3 z1 z3
z2 z2 z2
z1 z3 z1 z3
z1 z3
Mean-field
Rank-1 100
96
diag(↵1 , . . . , ↵K )
P UU> 92
+ j uj uj > 88
84
Rank1 Diag Wake−Sleep FA
+ +…+
Full
Rank-J
Gaussian Approximate Posteriors
Use a correlated Gaussian:
Mean-field
Rank-1 100
96
diag(↵1 , . . . , ↵K )
P UU> 92
+ j uj uj > 88
84
Rank1 Diag Wake−Sleep FA
+ +…+
Full
Rank-J
≤86.6 ≤ 80.9
VAE DRAW
≤86.6 ≤ 80.9
VAE DRAW
Linking functions
Mixture model
C(z)
y
z1 z2 z3 z1 z2 z3
!
Y
X qlm (z; ⌫) = qk (zk |⌫ k ) C(z; ⌫ k+1 )
qmm (z; ⌫) = ⇢r qr (zr |⌫ r ) k
r
Linking functions
Mixture model
C(z)
y
z1 z2 z3 z1 z2 z3
!
Y
X qlm (z; ⌫) = qk (zk |⌫ k ) C(z; ⌫ k+1 )
qmm (z; ⌫) = ⇢r qr (zr |⌫ r ) k
r
x
3. Maintain computational efficiency: linear in
number of latent variables.
Designing Richer Posteriors
1. Introduce new variables ω that help to form a
richer approximate posterior distribution. zK z
Z
q(z; ν) = q(z, ω; ν)dω …
x
3. Maintain computational efficiency: linear in
number of latent variables.
1
@f
q(z 0 ) = q(z) det
@z
Planar
q0 K=1 K=2 K=10
Unit Gaussian
Uniform
Normalising Flows
Choice of Transformation Function
K
X ∂ fk
L =Eq0 (z0 ) [log p(x, zK )]−Eq0 (z0 ) [log q0 (z0 )] − Eq0 (z0 )
log det
k=1
∂z
k
Choice of Transformation Function
K
X ∂ fk
L =Eq0 (z0 ) [log p(x, zK )]−Eq0 (z0 ) [log q0 (z0 )] − Eq0 (z0 )
log det
k=1
∂z
k
y1:d yd+1:D
concat
- ÷
+
t + μ σ
h s
har
z1:d zd+1:D
zk z<k
zk zk
y1:d = zk 1,1:d zk 1 µk (z<k , x)
zk =
zk = zk 1 + uh(w> zk 1 + b) yd+1:D = t(zk 1,1:d ) + zd+1:D exp(s(zk 1,1:d )) k (z<k , x)
[Rezende and Mohamed, 2016; Dinh et al., 2016; Kingma et al., 2016]
Choice of Transformation Function
K
X ∂ fk
L =Eq0 (z0 ) [log p(x, zK )]−Eq0 (z0 ) [log q0 (z0 )] − Eq0 (z0 )
log det
k=1
∂z k
y1:d yd+1:D
concat
- ÷
+
t + μ σ
h s
har
z1:d zd+1:D
zk z<k
zk zk
y1:d = zk 1,1:d zk 1 µk (z<k , x)
zk =
zk = zk 1 + uh(w> zk 1 + b) yd+1:D = t(zk 1,1:d ) + zd+1:D exp(s(zk 1,1:d )) k (z<k , x)
[Rezende and Mohamed, 2016; Dinh et al., 2016; Kingma et al., 2016]
z z
x x ⍵
p(x|z) p(x|z) r(!|x, z)
Auxiliary-variable Methods
z z …
z1 𝛚
x x ⍵
p(x|z) p(x|z) r(!|x, z)
z0
Auxiliary variables leave the original model unchanged.
They capture structure of correlated variables because they
turn the posterior into a mixture of distributions q(z|x, ω). x
z z
x ⍵ x ⍵
p(x|z) r(!|x, z) q(!|x)
Auxiliary Variational Lower Bounds
Standard bound: log p(x) ≥ L = Eq(z|x) [log p(x, z)] −Eq(z|x) [log q(z|x)]
| {z }| {z }
Expected likelihood Entropy
z z
x ⍵ x ⍵
p(x|z) r(!|x, z) q(!|x)
log p(x) ≥ Eq(ω,z|x) [log p(x, z) + log r(ω|z, x)]−Eq(ω,z|x) [log q(z, ω|x)]
≥ L − Eq(z|x) [KL[q(ω|z, x)kr(ω|z, x)]
Auxiliary latent
variable model p(x,z,𝛚)
p(z) Hamiltonian flow: r(ω)=N (ω|0,M)
z
Input-dependent Gaussian: r(ω|x,
Q z)
Auto-regressive: r(ω|x, z) = t r(ωt |fθ (ω<t , x))
x ⍵
q(ω|x, z) can be a mixture model, normalising flow,
p(x|z) r(!|x, z)
Gaussian process.
Inference
model q(z,𝛚)
q(z|x, !)
x ⍵
q(!|x)
Auxiliary Variational Methods
Choose an auxiliary prior r(ω|z, x) and auxiliary posterior q(ω|x, z)
Auxiliary latent
variable model p(x,z,𝛚)
p(z) Hamiltonian flow: r(ω)=N (ω|0,M)
z
Input-dependent Gaussian: r(ω|x,
Q z)
Auto-regressive: r(ω|x, z) = t r(ωt |fθ (ω<t , x))
x ⍵
q(ω|x, z) can be a mixture model, normalising flow,
p(x|z) r(!|x, z)
Gaussian process.
Inference
model q(z,𝛚)
q(z|x, !)
z
≤86.6 ≤ 80.9 ⋍79.1 ≤79.8
z1 z3 z y
z1 z3
z0
x ⍵ z1 z2 z3
x p(x|z) r(!|x, z)
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
K=9
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-
ropeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
Criticize model
Revise
Summary
Variational Inference:
Foundations and Modern Methods
p.z j x/
⌫ init
Amortized Inference
Dayan, Peter, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. "The helmholtz
machine." Neural computation 7, no. 5 (1995): 889-904.
Gershman, Samuel J., and Noah D. Goodman. "Amortized inference in probabilistic
reasoning." In Proceedings of the 36th Annual Conference of the Cognitive Science Society.
2014.
Heess, Nicolas, Daniel Tarlow, and John Winn. "Learning to pass expectation propagation
messages." In Advances in Neural Information Processing Systems, pp. 3219-3227. 2013.
Jitkrittum, Wittawat, Arthur Gretton, Nicolas Heess, S. M. Eslami, Balaji Lakshminarayanan,
Dino Sejdinovic, and ZoltÃan
˛ SzabÃş. "Kernel-based just-in-time learning for passing
expectation propagation messages." arXiv preprint arXiv:1503.02551 (2015).
Korattikara, Anoop, Vivek Rathod, Kevin Murphy, and Max Welling. "Bayesian dark
knowledge." arXiv preprint arXiv:1506.04416 (2015).
Bibliography