Professional Documents
Culture Documents
Dirichlet Process
Mo Chen
1
Outline
• Preliminary
– Bayesian Inference
– Exponential Family
– Directed Graphical Model
– Gibbs Sampling
– Finite Mixture Model
• Dirichlet Process
– Dirichlet Distribution
– Dirichlet Process
– Infinite Mixture Model
• Representation of DP
– Chinese Restaurant Process
– Stick Breaking Construction
2
2
Glossary
• Observation: X = N
{xn }n=1
• Latent variable: z, z
• Parameter: θ
3
Glossary
• Parameterized distribution
xi ∼ p(x|θ)
– with density function: p(x|θ)
• Likelihood: N
�
p(X|θ) = p(xn |θ)
• Prior: p(θ) n=1
• Posterior: p(θ|X)
4
Inference: Frequentist
• (1) Fitting (training)
– ML θ̂ = argmax p(X|θ)
– MAP θ̂ = argmax p(θ|X)
5
Gaussian: Frequentist (ML)
• (1) Fitting (training): θ̂ = argmax p(X|θ)
N
�
max ln N (xn |µ, Σ)
µ,Σ
n=1
�N �N
1 1
µ̂ = xn Σ̂ = (xn − µ̂)(xn − µ̂)T
N n=1 N n=1
6
Gaussian: MAP Fitting
• Prior
p(µ, Λ)
• Posterior
p(µ, Λ|X) ∝ p(X|µ, Λ)p(µ, Λ)
• MAP
max ln p(X|µ, Λ) + ln p(µ, Λ)
µ,Λ
7
Inference: Bayesian
• Prior
p(θ)
• Posterior p(X|θ)p(θ)
p(θ|X) =
p(X)
» where �
p(X) = p(X|θ)p(θ)dθ
• Inference �
p(x|X) = p(x|θ)p(θ|X)dθ
8
Gaussian: Bayesian
• Prior (conjugate)
p(µ, Λ) = N (µ|mo , (κo Λ)−1 )W(Λ|(T o )−1 , ν o )
• Posterior
p(µ, Λ|X) = N (µ|m, (κΛ)−1 )W(Λ|T −1 , ν).
o o
κ=κ +N ν =ν +N
κo mo + N x̄
m=
κo + N
o o o T
o κ N ( x̄ − m )( x̄ − m )
T = T + NS +
κo + N
9
9
Gaussian: Bayesian
• Inference
� �
p(x|X) = p(x|µ, Λ)p(µ, Λ|X)dµdΛ
� �
−1 −1 −1
= N (x|µ, Λ )N (µ|m, (κΛ) )dµW(Λ|T , ν)dΛ
(κ + 1)T
= T (x|m, , ν − d + 1)
κ(ν − d + 1)
10
10
Student’s t-Distribution
11
Student’s t-Distribution
Robustness to outliers: Gaussian vs t-
distribution.
12
Bayesian Inference
• Methodology: Integral
• Intractable: Approximation
– Monte Carlo (Gibbs Sampling)
– Taylor Expansion (Laplace approximation)
– Variational methods (VB, EP)
13
13
The Exponential Family
• Density
� T
�
p(x|θ) = h(x)g(θ) exp θ φ(x)
• Conjugate prior
o o o o ν o � o T o
�
p(θ|χ , ν ) = f (χ , ν )g(θ) exp ν θ χ
• Posterior � �
p(θ|X, χo , ν o ) = p(θ|χ, ν) = f (χ, ν)g(θ)ν exp νθT χ
» where �N
o o
o ν χ + n=1 φ(xn )
ν =ν +N χ=
νo + N
14
14
The Exponential Family
• Marginalization
�
o o
p(X) = p(X|θ)p(θ|χ , ν )dθ
o N
�o
f (χ , ν )
= h(xn )
f (χ, ν) n=1
15
15
Bayesian Networks
Directed Acyclic Graph (DAG)
16
Bayesian Networks
p(x1 , · · · , x7 ) = p(x1 )p(x2 )p(x3 )p(x4 |x1 , x2 , x3 )
p(x5 |x1 , x3 )p(x6 |x4 )p(x7 |x4 , x5 )
General Factorization
m
�
p(x) = p(xi |π(xi ))
i=1
17
Gibbs Sampling
• Sequentially sampling from (one at a
time)
18
18
Mixtures of Gaussians
Old Faithful data set
19
Mixtures of Gaussians
Combine simple models
into a complex model:
Component
Mixing coefficient
K=3
20
Mixture of Gaussian
LS AND EM
K
�
same graph as in Figure zkexcept that
9.6
p(z|π) = πk zn
se that the discrete variables zn are ob-
k=1 π
as the data variables xn .
p(zk = 1) = πk
K
� xn
p(x|z) = N (x|µk , Σk )zk µ Σ
k=1
N
� � �K
sider the problem
p(x) = of maximizing
p(x, z) = the likelihood
p(x|z)p(z) = for the complete data
πk N (x|µk , Σk )
rom (9.10) and (9.11), this likelihood function takes the form
z z k=1
N !
! K
znk
p(X, Z|µ, Σ, π) = πk N (xn |µk , Σk )znk 21 (9.35)
n=1 k=1 21
10.2. Illustration: Variational Mixture of Gaussians 475
Bayesian MoG
c graph representing the Bayesian mix- π zn Λ
ans model, in which the box (plate) de-
• Conjugate
N i.i.d. priors
observations. Here µ denotes
notes {Λk }.
– Dirichlet for π
p(π) = D(π|α ) o xn µ
– Gaussian-Wishart for µ, Λ N
K
�
o o −1 o −1 o
p(µ, Λ) = N (µk |m , (κ Λk ) )W(Λk |(T ) , ν ).
s we have seen, the parameter α0 can be interpreted as the effective
k=1
of observations associated with each component of the mixture. If the
small, then the posterior distribution will be influenced primarily by
p(x, z, π, µ, Λ) = p(x|z, µ, Λ)p(z|π)p(π)p(µ|Λ)p(Λ)
than by the prior.
we introduce an independent Gaussian-Wishart prior governing the
22
cision of each Gaussian component, given by
22
Gibbs Sampling for MoG
• Sample indicator variables
K
�
1
znk ∼ πk p(xn |µk , Σk ) Cn = πk p(xn |µk , Σk )
Cn
k=1
• Sample mixture weights
π ∼ D(π|N1 + αo , . . . , Nk + αo )
23
23
The Multinomial Distribution
K
� zk
p(z|µ) = M(z|µ) = µk
k=1
K
�
µk = 1
k
24
24
The Dirichlet Distribution
25
Dirichlet Posterior
K
� αk +mk −1
p(µ|Z, α) ∝ p(Z|µ)p(µ|α) ∝ µk
k=1
p(µ|Z, α) = D(µ|α + m)
26
Dirichlet
27