Mo Bayesian

Bayesian Inference with
Dirichlet Process
Mo Chen
1
Outline
• Preliminary
– Bayesian Inference
– Exponential Family
– Directed Graphical Model
– Gibbs Sampling
– Finite Mixture Model
• Dirichlet Process
– Dirichlet Distribution
– Dirichlet Process
– Infinite Mixture Model
• Representation of DP
– Chinese Restaurant Process
– Stick Breaking Construction
2
2
Glossary
• Observation: X = N
{xn }n=1
• Latent variable: z, z
• Parameter: θ
• Joint distribution: p(x, z)

• Conditional distribution: p(x|z)
• Marginal distribution:
� �
p(x) = p(x, z)dz = p(x|z)p(z)dz
3
Glossary
• Parameterized distribution
xi ∼ p(x|θ)
– with density function: p(x|θ)
• Likelihood: N
�
p(X|θ) = p(xn |θ)
• Prior: p(θ) n=1
• Posterior: p(θ|X)
4
Inference: Frequentist
• (1) Fitting (training)
– ML θ̂ = argmax p(X|θ)
– MAP θ̂ = argmax p(θ|X)
• (2) Inference (predict)

p(x|θ̂)
5
Gaussian: Frequentist (ML)
• (1) Fitting (training): θ̂ = argmax p(X|θ)
N
�
max ln N (xn |µ, Σ)
µ,Σ
n=1
�N �N
1 1
µ̂ = xn Σ̂ = (xn − µ̂)(xn − µ̂)T
N n=1 N n=1
• (2) Inference (predict): p(x|θ̂)

N (x|µ̂, Σ̂)
6
Gaussian: MAP Fitting
• Prior
p(µ, Λ)
• Posterior
p(µ, Λ|X) ∝ p(X|µ, Λ)p(µ, Λ)
• MAP
max ln p(X|µ, Λ) + ln p(µ, Λ)
µ,Λ
7
Inference: Bayesian
• Prior
p(θ)
• Posterior p(X|θ)p(θ)
p(θ|X) =
p(X)
» where �
p(X) = p(X|θ)p(θ)dθ
• Inference �
p(x|X) = p(x|θ)p(θ|X)dθ
8
Gaussian: Bayesian
• Prior (conjugate)
p(µ, Λ) = N (µ|mo , (κo Λ)−1 )W(Λ|(T o )−1 , ν o )
• Posterior
p(µ, Λ|X) = N (µ|m, (κΛ)−1 )W(Λ|T −1 , ν).
o o
κ=κ +N ν =ν +N
κo mo + N x̄
m=
κo + N
o o o T
o κ N ( x̄ − m )( x̄ − m )
T = T + NS +
κo + N
9
9
Gaussian: Bayesian
• Inference
� �
p(x|X) = p(x|µ, Λ)p(µ, Λ|X)dµdΛ
� �
−1 −1 −1
= N (x|µ, Λ )N (µ|m, (κΛ) )dµW(Λ|T , ν)dΛ
(κ + 1)T
= T (x|m, , ν − d + 1)
κ(ν − d + 1)
10
10
Student’s t-Distribution
11
Student’s t-Distribution
Robustness to outliers: Gaussian vs t-
distribution.
12
Bayesian Inference
• Methodology: Integral
• Tractable: Analytical solution
• Intractable: Approximation
– Monte Carlo (Gibbs Sampling)
– Taylor Expansion (Laplace approximation)
– Variational methods (VB, EP)
13
13
The Exponential Family
• Density
� T
�
p(x|θ) = h(x)g(θ) exp θ φ(x)
• Conjugate prior
o o o o ν o � o T o
�
p(θ|χ , ν ) = f (χ , ν )g(θ) exp ν θ χ
• Posterior � �
p(θ|X, χo , ν o ) = p(θ|χ, ν) = f (χ, ν)g(θ)ν exp νθT χ
» where �N
o o
o ν χ + n=1 φ(xn )
ν =ν +N χ=
νo + N
14
14
The Exponential Family
• Marginalization
�
o o
p(X) = p(X|θ)p(θ|χ , ν )dθ
o N
�o
f (χ , ν )
= h(xn )
f (χ, ν) n=1
15
15
Bayesian Networks
Directed Acyclic Graph (DAG)
p(a, b, c) = p(c|a, b)p(b|a)p(a)
p(x1 , · · · , xm ) = (xm |x1 , · · · , xm−1 ) · · · p(x2 |x1 )p(x1 )
16
Bayesian Networks
p(x1 , · · · , x7 ) = p(x1 )p(x2 )p(x3 )p(x4 |x1 , x2 , x3 )
p(x5 |x1 , x3 )p(x6 |x4 )p(x7 |x4 , x5 )
General Factorization
m
�
p(x) = p(xi |π(xi ))
i=1
17
Gibbs Sampling
• Sequentially sampling from (one at a
time)
ẑi ∼ p(zi |Z\i , X)
• Directed graphical model
x̂i ∼ p(xi |π(xi ), λ(xi ), π(λ(xi )))
18
18
Mixtures of Gaussians
Old Faithful data set
Single Gaussian Mixture of two

Gaussians
19
Mixtures of Gaussians
Combine simple models
into a complex model:
Component
Mixing coefficient
K=3
20
Mixture of Gaussian
LS AND EM
K
�
same graph as in Figure zkexcept that
9.6
p(z|π) = πk zn
se that the discrete variables zn are ob-
k=1 π
as the data variables xn .
p(zk = 1) = πk
K
� xn
p(x|z) = N (x|µk , Σk )zk µ Σ
k=1
N
� � �K
sider the problem
p(x) = of maximizing
p(x, z) = the likelihood
p(x|z)p(z) = for the complete data
πk N (x|µk , Σk )
rom (9.10) and (9.11), this likelihood function takes the form
z z k=1
N !
! K
znk
p(X, Z|µ, Σ, π) = πk N (xn |µk , Σk )znk 21 (9.35)
n=1 k=1 21
10.2. Illustration: Variational Mixture of Gaussians 475
Bayesian MoG
c graph representing the Bayesian mix- π zn Λ
ans model, in which the box (plate) de-
• Conjugate
N i.i.d. priors
observations. Here µ denotes
notes {Λk }.
– Dirichlet for π
p(π) = D(π|α ) o xn µ
– Gaussian-Wishart for µ, Λ N
K
�
o o −1 o −1 o
p(µ, Λ) = N (µk |m , (κ Λk ) )W(Λk |(T ) , ν ).
s we have seen, the parameter α0 can be interpreted as the effective
k=1
of observations associated with each component of the mixture. If the
small, then the posterior distribution will be influenced primarily by
p(x, z, π, µ, Λ) = p(x|z, µ, Λ)p(z|π)p(π)p(µ|Λ)p(Λ)
than by the prior.
we introduce an independent Gaussian-Wishart prior governing the
22
cision of each Gaussian component, given by
22
Gibbs Sampling for MoG
• Sample indicator variables
K
�
1
znk ∼ πk p(xn |µk , Σk ) Cn = πk p(xn |µk , Σk )
Cn
k=1
• Sample mixture weights
π ∼ D(π|N1 + αo , . . . , Nk + αo )
• Sample statistics of Gaussian

−1
Λk ∼ W(Λ|νk , Wk ) µk ∼ N (µ|mk , (κk Λk ) )
23
23
The Multinomial Distribution
K
� zk
p(z|µ) = M(z|µ) = µk
k=1
K
�
µk = 1
k
24
24
The Dirichlet Distribution
Conjugate prior for

the multinomial
distribution.
25
Dirichlet Posterior
K
� αk +mk −1
p(µ|Z, α) ∝ p(Z|µ)p(µ|α) ∝ µk
k=1
p(µ|Z, α) = D(µ|α + m)
26
Dirichlet
27

Mo Bayesian

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mo Bayesian

Uploaded by

Copyright:

Available Formats

Bayesian Inference with

• Joint distribution: p(x, z)

• (2) Inference (predict)

• (2) Inference (predict): p(x|θ̂)

• Tractable: Analytical solution

p(a, b, c) = p(c|a, b)p(b|a)p(a)

p(x1 , · · · , xm ) = (xm |x1 , · · · , xm−1 ) · · · p(x2 |x1 )p(x1 )

ẑi ∼ p(zi |Z\i , X)

• Directed graphical model

x̂i ∼ p(xi |π(xi ), λ(xi ), π(λ(xi )))

Single Gaussian Mixture of two

• Sample statistics of Gaussian

Conjugate prior for

You might also like