You are on page 1of 20

5724

11

Density Estimation with Gaussian Mixture


Models

5725 In earlier chapters, we covered already two fundamental problems in


5726 machine learning: regression (Chapter 9 and dimensionality reduction
5727 (Chapter 10). In this chapter, we will have a look at a third pillar of ma-
5728 chine learning: density estimation. On our journey, we introduce impor-
5729 tant concepts, such as the EM algorithm and a latent variable perspective
5730 of density estimation with mixture models.
5731 When we apply machine learning to data we often aim to represent data
5732 in some way. A straightforward way is to take the data points themselves
5733 as the representation of the data, see Figure 11.1 for an example. How-
5734 ever, this approach may be unhelpful if the dataset is huge or if we are in-
5735 terested in representing characteristics of the data. In density estimation,
5736 we represent the data compactly using a density, e.g., a Gaussian or Beta
5737 distribution. For example, we may be looking for the mean and variance
5738 of a data set in order to represent the data compactly using a Gaussian dis-
5739 tribution. The mean and variance can be found using tools we discussed
5740 in Section 8.2: maximum likelihood or maximum-a-posteriori estimation.
5741 We can then use the mean and variance of this Gaussian to represent the
5742 distribution underlying the data, i.e., we think of the dataset to be a typi-
5743 cal realization from this distribution if we were to sample from it.
5744 In practice, the Gaussian (or similarly all other distributions we encoun-

Figure 11.1
Two-dimensional
data set that cannot
4
be meaningfully
represented by a 2
Gaussian.

0
x2

−2

−4

−6 −4 −2 0 2 4 6
x1

316
c
Draft chapter (July 9, 2018) from “Mathematics for Machine Learning” 2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to https://mml-book.com.
11.1 Gaussian Mixture Model 317

5745 tered so far) have limited modeling capabilities. For example, a Gaussian
5746 approximation of the density that generated the data in Figure 11.1 would
5747 be a poor approximation. In the following, we will look at a more ex-
5748 pressive family of distributions, which we can use for density estimation:
5749 mixture models. mixture models
Mixture models can be used to describe a distribution p(x) by a convex
combination of K simple (base) distributions
K
X
p(x) = πk pk (x) (11.1)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.2)
k=1

5750 where the components pk are members of a family of basic distributions,


5751 e.g., Gaussians, Bernoullis or Gammas, and the πk are mixture weights. mixture weights
5752 Mixture models are more expressive than the corresponding base distri-
5753 butions because they allow for multimodal data representations, i.e., they
5754 can describe datasets with multiple “clusters”, such as the example in Fig-
5755 ure 11.1.
5756 In the following, we will focus on Gaussian mixture models (GMMs),
5757 where the basic distributions are Gaussians. For a given dataset, we aim
5758 to maximize the likelihood of the model parameters to train the GMM. For
5759 this purpose we will use results from Chapter 5, Section 7.2 and Chapter 6.
5760 However, unlike other application we discussed earlier (linear regression
5761 or PCA), we will not find a closed-form maximum likelihood solution. In-
5762 stead, we will arrive at a set of dependent simultaneous equations, which
5763 we can only solve iteratively.

5764 11.1 Gaussian Mixture Model


A Gaussian mixture model is a density model where
 we combine a finite Gaussian mixture
number of K Gaussian distributions N µk , Σk so that model

K
X 
p(x) = πk N x | µk , Σk (11.3)
k=1
K
X
0 6 πk 6 1 , πk = 1 . (11.4)
k=1

This convex combination of Gaussian distribution gives us significantly


more flexibility for modeling complex densities than a simple Gaussian
distribution (which we recover from (11.3) for K = 1). An illustration is
given in Figure 11.2. Here, the mixture density is given as
p(x) = 0.5N x | − 2, 12 + 0.2N x | 1, 2 + 0.3N x | 4, 1 . (11.5)
  

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
318 Density Estimation with Gaussian Mixture Models

Figure 11.2 0.30


Gaussian mixture Component 1
model. The Component 2
0.25 Component 3
Gaussian mixture
distribution (black) GMM density
0.20
is composed of a
convex combination
p(x)

of Gaussian 0.15
distributions and is
more expressive 0.10
than any individual
component. Dashed 0.05
lines represent the
weighted Gaussian
0.00
components.
−4 −2 0 2 4 6 8
x

5765 11.2 Parameter Learning via Maximum Likelihood


5766 Assume we are given a data set X = {x1 , . . . , xN } where xn , n =
5767 1, . . . , N are drawn i.i.d. from an unknown distribution p(x). Our ob-
5768 jective is to find a good approximation/representation of this unknown
5769 distribution p(x) by means of a Gaussian mixture model (GMM) with K
5770 mixture components. The parameters of the GMM are the K means µk ,
5771 the covariances Σk and mixture weights πk . We summarize all these free
5772 parameters in θ := {πk , µk , Σk , k = 1, . . . , K}.

Example 11.1 (Initial setting)

0.30
Figure 11.3 Initial Component 1
Component 2
setting: GMM 0.25 Component 3
(black) with GMM density
mixture three 0.20
mixture components
p(x)

(dashed) and seven 0.15


data points (discs).
0.10

0.05

0.00
−5 0 5 10 15
x

Throughout this chapter, we will have a simple running example that


helps us illustrate and visualize important concepts.
We will look at a one-dimensional data set X = [−3, −2.5, −1, 0, 2, 4, 5]
consisting of seven data points. We wish to find a GMM with K = 3
components that models the data. We initialize the individual components

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 319

as

p1 (x) = N x | − 4, 1 (11.6)

p2 (x) = N x | 0, 0.2 (11.7)

p3 (x) = N x | 8, 3 (11.8)
and assign them equal weights π1 = π2 = π3 = 13 . The corresponding
model (and the data points) are shown in Figure 11.3.

In the following, we detail how to obtain a maximum likelihood esti-


mate θ ML of the model parameters θ . We start by writing down the like-
lihood, i.e., the probability of the data given the parameters. We exploit
our i.i.d. assumption, which leads to the factorized likelihood
N
Y K
X 
p(X | θ) = p(xn | θ) , p(xn | θ) = πk N xn | µk , Σk (11.9)
n=1 k=1

where every individual likelihood term p(xn | θ) is a Gaussian mixture


density. Then, we obtain the log-likelihood as
N
X N
X K
X 
log p(X | θ) = log p(xn | θ) = log πk N xn | µk , Σk .
n=1 n=1 k=1
| {z }
=:L
(11.10)
We aim to find parameters θ ∗ML that maximize the log-likelihood L defined
in (11.10). Our “normal” procedure would be to compute the gradient
dL/dθ of the log-likelihood with respect to the model parameters θ , set
it to 0 and solve for θ . However, unlike our previous examples for maxi-
mum likelihood estimation (e.g., when we discussed linear regression in
Section 9.2), we cannot obtain a closed-form solution. If we were to con-
sider a single Gaussian as the desired density, the sum over k in (11.10)
vanishes, and the log can be applied directly to the Gaussian component,
such that we get
log N x | µ, Σ = − D2 log(2π) − 12 log det(Σ) − 12 (x − µ)> Σ−1 (x − µ) .


(11.11)
5773 This simple form allows us find closed-form maximum likelihood esti-
5774 mates of µ and Σ, as discussed in Chapter 8. However, in (11.10), we
5775 cannot move the log into the sum over k so that we cannot obtain a sim-
5776 ple closed-form maximum likelihood solution. However, we can exploit an
5777 iterative scheme to find good model parameters θ ML : the EM algorithm.
Any local optimum of a function exhibits the property that its gradi-
ent with respect to the parameters must vanish (necessary condition), see
Chapter 7. In our case, we obtain the following necessary conditions when

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
320 Density Estimation with Gaussian Mixture Models

we optimize the log-likelihood in (11.10) with respect to the GMM param-


eters µk , Σk , πk :
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0, (11.12)
∂µk n=1
∂µk
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0, (11.13)
∂Σk n=1
∂Σk
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0. (11.14)
∂πk n=1
∂πk

For all three necessary conditions, by applying the chain rule (see Sec-
tion 5.2.2), we require partial derivatives of the form
∂ log p(xn | θ) 1 ∂p(xn | θ)
= (11.15)
∂θ p(xn | θ) ∂θ
where θ = {µk , Σk , πk , k = 1, . . . , K} comprises all model parameters
and
1 1
= PK . (11.16)
p(xn | θ) j=1 πj N xn | µj , Σj

5778 In the following, we will compute the partial derivatives (11.12)–(11.14).


Theorem 11.1 (Update of the GMM Means). The update of the mean pa-
rameters µk , k = 1, . . . , K of the GMM is given by
PN N
n=1 rnk xn 1 X
µk = P N
= rnk xn , (11.17)
n=1 rnk
Nk n=1

where we define

πk N xn | µk , Σk
rnk := PK , (11.18)
j=1 πj N xn | µj , Σj
N
X
Nk := rnk . (11.19)
n=1

Proof From (11.15), we see that the gradient of the log-likelihood with
respect to the mean parameters µk , k = 1, . . . , K requires us to compute
the partial derivative
K
 
∂p(xn | θ) X ∂N xn | µj , Σj ∂N xn | µk , Σk
= πj = πk (11.20)
∂µk j=1
∂µk ∂µk
= πk (xn − µk )> Σ−1

k N xn | µk , Σk (11.21)
5779 where we exploited that only the k th mixture component depends on µk .

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 321

We use our result from (11.21) in (11.15) and put everything together
so that the desired partial derivative of L with respect to µk is given as
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.22)
∂µk n=1
∂µk n=1
p(xn | θ) ∂µk
N 
X
> −1
πk N xn | µk , Σk
= (xn − µk ) Σk PK  (11.23)
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk
N
X
= rnk (xn − µk )> Σ−1
k . (11.24)
n=1

5780 Here, we used the identity from (11.16) and the result of the partial
5781 derivative in (11.21) to get to the second row. The values rnk are often
5782 called responsibilities. responsibilities

Remark. The responsibility rnk of the k th mixture component for data The responsibilities
are closely related
point xn is proportional to the likelihood to the likelihood.

p(xn | πk , µk , Σk ) = πk N xn | µk , Σk (11.25)
5783 of the mixture component given the data point (the denominator in the
5784 definition of rnk is constant for all mixture components and serves as a
5785 normalizer). Therefore, mixture components have a high responsibility
5786 for a data point when the data point could be a plausible sample from
5787 that mixture component. ♦
>
5788 From the definition of rnk Pin (11.18) it is clear that [rn1 , . . . , rnK ]
5789 is a probability vector, i.e., k rnk = 1 with rnk > 0. This probability
5790 vector distributes probability mass among the K mixture component, and,
5791 intuitively, every rnk expresses the probability that xn has been generated
5792 by the k th mixture component.
We now solve for µk so that ∂µ ∂L
= 0> and obtain
k

N N PN N
X X
n=1 rnk xn 1 X
rnk xn = rnk µk ⇐⇒ µk = P = rnk xn ,
N Nk
n=1 rnk
n=1 n=1 n=1

(11.26)
where we defined
N
X
Nk := rnk . (11.27)
n=1

5793 as the total responsibility of the k th mixture component across the entire
5794 dataset. This concludes the proof of Theorem 11.1.
Intuitively, (11.17) can be interpreted as a Monte-Carlo estimate of the
mean of weighted data points xn where every xn is weighted by the

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
322 Density Estimation with Gaussian Mixture Models

Figure 11.4 Update responsibility rnk of the k th cluster for xn . Therefore, the mean µk is
of the mean pulled toward a data point xn with strength given by rnk . Intuitively, the
parameter of
means are pulled stronger toward data points for which the corresponding
mixture component
in a GMM. The mixture component has a high responsibility, i.e., a high likelihood. Fig-
mean µ is being ure 11.4 illustrates this. We can also interpret the mean update in (11.17)
pulled toward as the expected value of all data points under the distribution given by
individual data
points with the r k := [r1k , . . . , rN k ]> /Nk , (11.28)
weights given by the
corresponding which is a normalized probability vector, i.e.,
responsibilities. The
mean update is then
µk ← Erk [X] . (11.29)
a weighted average
of the data points. Example 11.2 (Responsibilities)
x2 x3 For our example from Figure 11.3 we compute the responsibilities rnk
r2  
r1 r3 1.0 0.0 0.0
x1
µk  1.0 0.0 0.0 
 
0.057 0.943 0.0 
0.001 0.999 0.0  ∈ RN ×K .
 
  (11.30)
 0.0 0.066 0.934
 
 0.0 0.0 1.0 
0.0 0.0 1.0
Here, the nth row tells us the responsibilities of all mixture components
for xn . The sum of all K responsibilities for a data point (sum of every
row) is 1. The k th column gives us an overview of the responsibility of
the k th mixture component. We can see that the third mixture component
(third column) is not responsible for any of the first four data points, but
takes much responsibility of the remaining data points. The sum of all
entries of a column gives us the values Nk , i.e., the total responsibility of
the k th mixture component. In our example, we get N1 = 2.057, N2 =
2.009, N3 = 2.934.

Example 11.3 (Mean Updates)

Figure 11.5 Effect


0.30 Component 1 Component 1
of updating the Component 2 0.30 Component 2
mean values in a 0.25 Component 3 Component 3
GMM density 0.25 GMM density
GMM. (a) GMM 0.20
0.20
before updating the
p(x)

p(x)

0.15
mean values; (b) 0.15

GMM after updating 0.10


0.10
the mean values µk 0.05 0.05
while retaining the
0.00 0.00
variances and −5 0 5 10 15 −5 0 5 10 15
mixture weights. x x

(a) GMM density and individual components (b) GMM density and individual components
prior to updating the mean values. after updating the mean values.

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 323

In our example from Figure 11.3, the mean values are updated as fol-
lows:
µ1 : −4 → −2.7 (11.31)
µ2 : 0 → −0.4 (11.32)
µ3 : 8 → 3.7 (11.33)
Here, we see that the means of the first and third mixture component
move toward the regime of the data, whereas the mean of the second
component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
the means and Figure 11.5(b) shows the GMM density after updating the
mean values µk .

5795 The update of the mean parameters in (11.17) look fairly straight-
5796 forward. However, note that the responsibilities rnk are a function of
5797 πj , µj , Σj for all j = 1, . . . , K , such that the updates in (11.17) depend
5798 on all parameters of the GMM, and a closed-form solution, which we ob-
5799 tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot
5800 be obtained.

Theorem 11.2 (Updates of the GMM Covariances). The update of the co-
variance parameters Σk , k = 1, . . . , K of the GMM is given by
N
1 X
Σk = rnk (xn − µk )(xn − µk )> , (11.34)
Nk n=1

5801 where rnk and Nk are defined in (11.18) and (11.19), respectively.

Proof To prove Theorem 11.2 our approach is to compute the partial


derivatives of the log-likelihood L with respect to the covariances Σk , set
them to 0 and solve for Σk . We start with our general approach
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = . (11.35)
∂Σk n=1
∂Σ k n=1
p(x n | θ) ∂Σk

We already know 1/p(xn | θ) from (11.16). To obtain the remaining par-


tial derivative ∂p(xn | θ)/∂Σk , we write down the definition of the Gaus-
sian distribution p(xn | θ), see (11.9), and drop all terms but the k th. We
then obtain
∂p(xn | θ)
(11.36a)
∂Σk
 
∂ −
D

1
1 > −1

= (2π) 2 det(Σk ) 2 exp − 2 (xn − µk ) Σk (xn − µk )
∂Σk
(11.36b)

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
324 Density Estimation with Gaussian Mixture Models

D ∂ 1
= πk (2π)− 2 det(Σk )− 2 exp − 12 (xn − µk )> Σ−1

k (xn − µk )
∂Σk

1 ∂
− 1 > −1

+ det(Σk ) 2 exp − 2 (xn − µk ) Σk (xn − µk ) . (11.36c)
∂Σk
We now use the identities
∂ 1 1 1
det(Σk )− 2 = − det(Σk )− 2 Σ−1
k , (11.37)
∂Σk 2

(xn − µk )> Σ−1 −1 > −1
k (xn − µk ) = −Σk (xn − µk )(xn − µk ) Σk
∂Σk
(11.38)
and obtain (after some re-arranging) the desired partial derivative re-
quired in (11.35) as
∂p(xn | θ) 
= πk N xn | µk , Σk
∂Σk
× − 21 (Σk−1 − Σ−1 > −1
 
k (xn − µk )(xn − µk ) Σk ) . (11.39)

Putting everything together, the partial derivative of the log-likelihood


with respect to Σk is given by
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.40a)
∂Σk n=1
∂Σ k n=1
p(xn | θ) ∂Σk
N 
X πk N xn | µk , Σk
= PK 
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk

× − 21 (Σ−1 −1 > −1
 
k − Σk (xn − µk )(xn − µk ) Σk ) (11.40b)
N
1X
=− rnk (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.40c)
2 n=1
N N
!
1 −1 X 1 −1 X
= − Σk rnk + Σk rnk (xn − µk )(xn − µk )> Σ−1
k .
2 n=1
2 n=1
| {z }
=Nk
(11.40d)
We see that the responsibilities rnk also appear in this partial derivative.
Setting this partial derivative to 0, we obtain the necessary optimality
condition
N
!
−1 −1
X
Nk Σk = Σk rnk (xn − µk )(xn − µk )> Σ−1k (11.41a)
n=1
N
!
X
⇐⇒ Nk I = rnk (xn − µk )(xn − µk )> Σ−1
k (11.41b)
n=1

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 325
N
1 X
⇐⇒ Σk = rnk (xn − µk )(xn − µk )> , (11.41c)
Nk n=1
5802 which gives us a simple update rule for Σk for k = 1, . . . , K and proves
5803 Theorem 11.2.
Similar to the update of µk in (11.17), we can interpret the update of
the covariance in (11.34) as an expected value
>
Erk [X̃ k X̃ k ] (11.42)
5804 where X̃ k := [x1 − µk , . . . , xN − µk ]> is the data matrix X centered at
5805 µk , and r k is the probability vector defined in (11.28).

Example 11.4 (Variance Updates)

Figure 11.6 Effect


Component 1
Component 1
0.4 Component 2
of updating the
0.30 Component 2
Component 3 Component 3 variances in a GMM.
0.25 GMM density GMM density
0.3 (a) GMM before
0.20 updating the
p(x)
p(x)

0.15 0.2 variances; (b) GMM


0.10 after updating the
0.1 variances while
0.05
retaining the means
0.00 0.0
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
and mixture
x x weights.
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the variances. after updating the variances.

In our example from Figure 11.3, the variances are updated as follows:
σ12 : 1 → 0.14 (11.43)
σ22 : 0.2 → 0.44 (11.44)
σ32 : 3 → 1.53 (11.45)
Here, we see that the variances of the first and third component shrink
significantly, the variance of the second component increases slightly.
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi-
vidual components prior to updating the variances. Figure 11.6(b) shows
the GMM density after updating the variances.

5806 Similar to the update of the mean parameters, we can interpret (11.34)
5807 as a Monte-Carlo estimate of the weighted covariance of data points xn
5808 associated with the k th mixture component, where the weights are the
5809 responsibilities rnk . As with the updates of the mean parameters, this up-
5810 date depends on all πj , µj , Σj , j = 1, . . . , K , through the responsibilities
5811 rnk , which prohibits a closed-form solution.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
326 Density Estimation with Gaussian Mixture Models

Theorem 11.3 (Update of the GMM Mixture Weights). The mixture weights
of the GMM are updated as
Nk
πk = (11.46)
N
5812 for k = 1, . . . , K , where N is the number of data points and Nk is defined
5813 in (11.19).
Proof To find the partial derivative of the log-likelihood with respect
P to
the weight parameters πk , k = 1, . . . , K , we take the constraint k πk =
1 into account by using Lagrange multipliers (see Section 7.2). The La-
grangian is
K
!
X
L=L+λ πk − 1 (11.47a)
k=1
N K K
!
X X  X
= log πk N xn | µk , Σk + λ πk − 1 (11.47b)
n=1 k=1 k=1

where L is the log-likelihood from (11.10) and the second term encodes
for the equality constraint that all the mixture weights need to sum up to
1. We obtain the partial derivatives with respect to πk and the Lagrange
multiplier λ
N 
∂L X N xn | µk , Σk Nk
= PK  +λ= + λ, (11.48)
∂πk n=1 j=1 N xn | µj , Σj
πk
K
∂L X
= πk − 1 . (11.49)
∂λ k=1

Setting both partial derivatives to 0 (necessary condition for optimum)


yields the system of equations
Nk
πk = − , (11.50)
λ
K
X
1= πk . (11.51)
k=1

Using (11.51) in (11.50) and solving for πk , we obtain


K K
X X Nk N
πk = 1 ⇐⇒ − = 1 ⇐⇒ − = 1 ⇐⇒ λ = −N .
k=1 k=1
λ λ
(11.52)
This allows us to substitute −N for λ in (11.50) to obtain
Nk
πk = , (11.53)
N

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 327

5814 which gives us the update for the weight parameters πk and proves Theo-
5815 rem 11.3.
5816 We can identify the mixture weight in (11.46) as the ratio of the to-
5817 tal responsibility
P of the k th cluster and the number of data points. Since
5818 N = k Nk the number of data points can also be interpreted as the to-
5819 tal responsibility of all mixture components together, such that πk is the
5820 relative importance of the k th mixture component for the dataset.
PN
5821 Remark. Since Nk = i=1 rnk , the update equation (11.46) for the mix-
5822 ture weights πk also depends on all πj , µj , Σj , j = 1, . . . , K via the re-
5823 sponsibilities rnk . ♦

Example 11.5 (Weight Parameter Updates)

Figure 11.7 Effect


Component 1 0.30 Component 1
0.4 Component 2 Component 2
of updating the
Component 3
GMM density
0.25 Component 3
GMM density
mixture weights in a
0.3
0.20 GMM. (a) GMM
before updating the
p(x)

p(x)

0.2 0.15
mixture weights; (b)
0.10
GMM after updating
0.1
0.05 the mixture weights
0.0 0.00 while retaining the
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 means and
x x
variances. Note the
(a) GMM density and individual components (b) GMM density and individual components
different scalings of
prior to updating the mixture weights. after updating the mixture weights.
the vertical axes.
In our running example from Figure 11.3, the mixture weights are up-
dated as follows:
1
π1 : 3
→ 0.29 (11.54)
1
π2 : 3
→ 0.29 (11.55)
1
π3 : 3
→ 0.42 (11.56)
Here we see that the third component gets more weight/importance,
while the other components become slightly less important. Figure 11.7
illustrates the effect of updating the mixture weights. Figure 11.7(a) is
identical to Figure 11.6(b) and shows the GMM density and its individual
components prior to updating the mixture weights. Figure 11.7(b) shows
the GMM density after updating the mixture weights.
Overall, having updated the means, the variances and the weights once,
we obtain the GMM shown in Figure 11.7(b). Compared with the ini-
tialization shown in Figure 11.3, we can see that the parameter updates
caused the GMM density to shift some of its mass toward the data points.
After updating the means, variances and weights once, the GMM fit
in Figure 11.7(b) is already remarkably better than its initialization from

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
328 Density Estimation with Gaussian Mixture Models

Figure 11.3. This is also evidenced by the log-likelihood values, which


increased from 28.3 (initialization) to 14.4 (after one complete update
cycle).

5824 11.3 EM Algorithm


5825 Unfortunately, the updates in (11.17), (11.34), and (11.46) do not consti-
5826 tute a closed-form solution for the updates of the parameters µk , Σk , πk
5827 of the mixture model because the responsibilities rnk depend on those pa-
5828 rameters in a complex way. However, the results suggest a simple iterative
5829 scheme for finding a solution to the parameters estimation problem via
EM algorithm 5830 maximum likelihood. The Expectation Maximization algorithm (EM algo-
5831 rithm) was proposed by Dempster et al. (1977) and is a general iterative
5832 scheme for learning parameters (maximum likelihood or MAP) in mixture
5833 models and, more generally, latent-variable models.
5834 In our example of the Gaussian mixture model, we choose initial values
5835 for µk , Σk , πk and alternate until convergence between
5836 • E-step: Evaluate the responsibilities rnk (posterior probability of data
5837 point i belonging to mixture component k ).
5838 • M-step: Use the updated responsibilities to re-estimate the parameters
5839 µk , Σk , πk .
5840 Every step in the EM algorithm increases the log-likelihood function (Neal
5841 and Hinton, 1999). For convergence, we can check the log-likelihood or
5842 the parameters directly. A concrete instantiation of the EM algorithm for
5843 estimating the parameters of a GMM is as follows:
5844 1. Initialize µk , Σk , πk
2. E-step: Evaluate responsibilities rnk for every data point xn using cur-
rent parameters πk , µk , Σk :

πk N xn | µk , Σk
rnk = P . (11.57)
j πj N xn | µj , Σj

3. M-step: Re-estimate parameters πk , µk , Σk using the current responsi-


bilities rnk (from E-step):
N
1 X
µk = rnk xn , (11.58)
Nk n=1
N
1 X
Σk = rnk (xn − µk )(xn − µk )> , (11.59)
Nk n=1
Nk
πk = . (11.60)
N

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.4 Latent Variable Perspective 329

Example 11.6 (GMM Fit)

Figure 11.8 EM
Component 1 28 algorithm applied to
0.4 Component 2
the GMM from

Negative log-likelihood
Component 3 26
GMM density
0.3
24 Figure 11.2.
22
p(x)

0.2 20

18
0.1
16

0.0 14
−5 0 5 10 15 0 1 2 3 4 5
x Iteration
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
algorithm converges and returns this mixture EM iterations.
density.

When we run EM on our example from Figure 11.3, we obtain the final
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as
 
p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25
 (11.61)
+ 0.43N x | 3.64, 1.63 .

5845 Figure 11.9 illustrates a few steps of the EM algorithm when applied to
5846 the two-dimensional dataset shown in Figure 11.1 with K = 3 mixture
5847 components.
5848 Figure 11.10 visualizes the final responsibilities of the mixture compo-
5849 nents for the data points. It becomes clear that there are data points that
5850 cannot be uniquely assigned to a single (either blue or yellow) compo-
5851 nent, such that the responsibilities of these two clusters for those points
5852 are around 0.5.

5853 11.4 Latent Variable Perspective


We can look at the GMM from the perspective of a discrete latent vari-
able model, i.e., where the latent variable z can attain only a finite set of
values. This is in contrast to PCA where the latent variables were continu-
ous valued numbers in RM . Let us assume that we have a mixture model
with K components and that a data point x is generated by exactly one
component. We can then use a discrete indicator variable zk ∈ {0, 1} that
indicates whether the k th mixture component generated that data point
so that

p(x | zk = 1) = N x | µk , Σk . (11.62)

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
330 Density Estimation with Gaussian Mixture Models
10.0 Figure 11.9
7.5 104 Illustration of the

Negative log-likelihood
5.0 EM algorithm for
2.5 fitting a Gaussian
mixture model with
x2
0.0
6 × 103
−2.5 three components to
−5.0 a two-dimensional
−7.5 data set. (a) Data
4 × 103
−10.0
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 0 10 20 30 40 50 60
set; (b) Negative log
x1 EM iteration likelihood (lower is
(a) Data set. (b) Negative log-likelihood. better) as a function
of the EM iterations.
10.0 10.0 The red dots
7.5 7.5 indicate the
5.0 5.0 iterations for which
2.5 2.5 the corresponding
GMM fits are shown
x2

x2
0.0 0.0

−2.5 −2.5 in (c)–(f). The


−5.0 −5.0 yellow discs indicate
−7.5 −7.5 the mean of the
−10.0 −10.0 Gaussian
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0
x1 x1 distribution.
(c) EM initialization. (d) EM after 1 iteration.

10.0 10.0
7.5 7.5
5.0 5.0
2.5 2.5
x2

0.0
x2

0.0
−2.5 −2.5
−5.0
−5.0
−7.5
−7.5
−10.0
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 −10.0
x1 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0
x1

(e) EM after 10 iterations (f) EM after 63 iterations

Figure 11.11 We define z := [z1 , . . . , zK ]> ∈ RK as a vector consisting of K − 1


Graphical model for
PK
many 0s and exactly one 1. Because of this, it also holds that k=1 zk =
a GMM with a single
data point.
1. Therefore, z is a one-hot encoding (also: 1-of-K representation). This
allows us to write the conditional distribution as
π
K
Y zk
p(x | z) = N x | µk , Σk , (11.63)
z k=1

µ 5854 where zk ∈ {0, 1}. Thus far, we assumed that the indicator variables zk
5855 are known. However, in practice, this is not the case, and we place a prior
Σ x
5856 distribution on z .
one-hot encoding 5857 In the following, we will discuss the prior p(z), the marginal p(x, z)
1-of-K 5858 and the posterior p(z | x) for the case of observing a single data point
representation
5859 x (the corresponding graphical model is shown in Figure 11.11) before

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.4 Latent Variable Perspective 331

Figure 11.10 Data


4 set colored
according to the
2 responsibilities of
the mixture
0 components when
x2

EM converges.
−2 While a single
mixture component
−4 is clearly
responsible for the
−6 −4 −2 0 2 4 6
data on the left, the
x1 overlap of the two
data clusters on the
right could have
been generated by
5860 extending the concepts to the general case where the dataset consists of
two mixture
5861 N data points. components.

5862 11.4.1 Prior


Given that we do not know which mixture component generated the data
point, we treat the indicators z as a latent variable and place a prior p(z)
on it. Then the prior p(zk = 1) = πk describes the probability that the k th
mixture component generated data point x. P To ensure that our probability
K
distribution is normalized, we require that k=1 πk = 1. We summarize
the prior p(z) = π in the probability vector π = [π1 , . . . , πK ]> . Because
of the one-hot encoding of z , we can write the prior in a less obvious form
K
Y
p(z) = πkzk , zk ∈ {0, 1} , (11.64)
k=1

5863 but this form will become handy later on.


5864 Remark (Sampling from a GMM). The construction of this latent variable
5865 model lends itself to a very simple sampling procedure to generate data:

5866 1. Sample z i ∼ p(z | π)


5867 2. Sample xi ∼ p(x | z i )
5868 In the first step, we would select a mixture component at random accord-
5869 ing to π ; in the second step we would draw a sample from a single mixture
5870 component. This kind of sampling, where samples of random variables
5871 depend on samples from the variable’s parents in the graphical model,
5872 is called ancestral sampling. This means, we can generate data from the ancestral sampling
5873 mixture model by generating a latent variable value k ∈ {1, . . . , K} to
5874 identify a singe mixture component, and then generate a data point xi
5875 by sampling from this mixture component. We can discard the samples of
5876 the latent variable so that we are left with the xi , which are valid samples
5877 from our mixture model. ♦

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
332 Density Estimation with Gaussian Mixture Models

5878 11.4.2 Marginal


If we marginalize out the latent variables z (by summing over all possible
one-hot encodings), we obtain the marginal distribution
K
X XY zk
p(x) = p(x, z | π) = πk N x | µk , Σk (11.65a)
z z k=1
 
= π1 N x | µ1 , Σ1 + · · · + πK N x | µK , ΣK (11.65b)
K
X 
= πk N x | µk , Σk , (11.65c)
k=1

The marginal 5879 which is identical to the GMM we introduced in (11.3). Therefore, the
distribution p(x) is5880 latent variable model with latent indicators zk is an equivalent way of
a Gaussian mixture
5881 thinking about a Gaussian mixture model.
model.

5882 11.4.3 Posterior


Let us have a brief look at the posterior distribution on the latent variable
z . According to Bayes’ theorem, the posterior is
p(x, z)
p(z | x) = , (11.66)
p(x)
where p(x) is given in (11.65c). With (11.63) and (11.64) we get the joint
distribution as
K K
Y zk Y
p(x, z) = p(x | z)p(z) = N x | µk , Σk πkzk (11.67a)
k=1 k=1
K
Y zk
= πk N x | µk , Σk . (11.67b)
k=1
QK zk
Here, we identify p(x | z) = k=1 N x | µk , Σk as the likelihood. This
yields the posterior distribution for the k th indicator variable zk

p(x | zk = 1)p(zk = 1) πk N x | µk , Σk
p(zk = 1 | x) = PK = PK ,
j=1 p(zj = 1)p(x | zj = 1) j=1 πj N x | µj , Σj
(11.68)
5883 which we identify as the responsibility of the k th mixture component for
5884 data point x. Note that we omitted the explicit conditioning on the GMM
5885 parameters πk , µk , Σk where k = 1, . . . , K .

5886 11.4.4 Extension to a Full Dataset


5887 Thus far, we only discussed the case where the dataset consists only of a
5888 single data point x. However, the concepts can be directly extended to the
5889 case of N data points x1 , . . . , xN , which we collect in the data matrix X .

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 333

Every data point xn possesses its own latent variable


z n = [zn1 , . . . , znK ]> ∈ RK . (11.69)
5890 Previously (when we only considered a single data point x) we omitted
5891 the index n, but now this becomes important. We collect all of these latent
5892 variables in the matrix Z . We share the same prior distribution π across all
5893 data points. The corresponding graphical model is shown in Figure 11.12,
5894 where we use the plate notation.
The likelihood p(X | Z) factorizes over the data points, such that the
joint distribution (11.67b) is given as Figure 11.12
Graphical model for
N Y
K
Y znk a GMM with N data
p(X, Z) = p(X | Z)p(Z) = πk N xn | µk , Σk . (11.70) points.
n=1 k=1
π
5895 Generally, the posterior distribution p(zk = 1 | xn ) is the probability that
5896 the k th mixture component generated data point xn and corresponds to
5897 the responsibility rnk we introduced in (11.18). Now, the responsibilities zn
5898 also have not only an intuitive but also a mathematically justified inter-
5899 pretation as posterior probabilities. µ
Σ xn
5900 11.4.5 EM Algorithm Revisited n = 1, . . . , N
The EM algorithm that we introduced as an iterative scheme for maximum
likelihood estimation can be derived in a principled way from the latent
variable perspective. Given a current setting θ (t) of model parameters, the
E-step calculates the expected log-likelihood
Q(θ | θ (t) ) = EZ | X,θ(t) [log p(X, Z | θ)] (11.71a)
Z
= log p(X, Z | θ)p(Z | X, θ (t) )dZ , (11.71b)

5901 where the expectation of the log-joint distribution of latent variables Z


5902 and observations X is taken with respect to the posterior p(Z | X, θ (t) ) of
5903 the latent variables. The M-step selects an updated set of model parame-
5904 ters θ (t+1) by maximizing (11.71b).
5905 Although an EM iteration does increase the log-likelihood, there are
5906 no guarantees that EM converges to the maximum likelihood solution.
5907 It is possible that the EM algorithm converges to a local maximum of
5908 the log-likelihood. Different initializations of the parameters θ could be
5909 used in multiple EM runs to reduce the risk of ending up in a bad local
5910 optimum. We do not go into further details here, but refer to the excellent
5911 expositions by Rogers and Girolami (2016) and Bishop (2006).

5912 11.5 Further Reading


5913 The GMM can be considered a generative model in the sense that it is
5914 straightforward to generate new data using ancestral sampling (Bishop,

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
334 Density Estimation with Gaussian Mixture Models

5915 2006). For given GMM parameters πk , µk , Σk , k = 1, . . . , K , we sample


5916 an index k from the probability
 vector [π1 , . . . , πK ]> and then sample a
5917 data point x ∼ N µk , Σk . If we repeat this N times, we obtain a dataset
5918 that has been generated by a GMM. Figure 11.1 was generated using this
5919 procedure.
5920 Throughout this chapter, we assumed that the number of components
5921 K is known. In practice, this is often not the case. However, we could use
5922 cross-validation, as discussed in Section 8.4, to find good models.
5923 Gaussian mixture models are closely related to the K -means cluster-
5924 ing algorithm. K -means also uses the EM algorithm to assign data points
5925 to clusters. If we treat the means in the GMM as cluster centers and ig-
5926 nore the covariances, we arrive at K -means. As also nicely described by
5927 MacKay (2003), K -means makes a “hard” assignments of data points to
5928 cluster centers µk , whereas a GMM makes a “soft” assignment via the
5929 responsibilities.
5930 We only touched upon the latent variable perspective of GMMs and the
5931 EM algorithm. Note that EM can be used for parameter learning in general
5932 latent variable models, e.g., nonlinear state-space models (Ghahramani
5933 and Roweis, 1999; Roweis and Ghahramani, 1999) and for reinforcement
5934 learning as discussed by Barber (2012). Therefore, the latent variable per-
5935 spective of a GMM is useful to derive the corresponding EM algorithm in
5936 a principled way (Bishop, 2006; Barber, 2012; Murphy, 2012).
5937 We only discussed maximum likelihood estimation (via the EM algo-
5938 rithm) for finding GMM parameters. The standard criticisms of maximum
5939 likelihood also apply here:
5940 • As in linear regression, maximum likelihood can suffer from severe
5941 overfitting. In the GMM case, this happens when the mean of a mix-
5942 ture component is identical to a data point and the covariance tends to
5943 0. Then, the likelihood approaches infinity. Bishop (2006) and Barber
5944 (2012) discuss this issue in detail.
5945 • We only obtain a point estimate of the parameters πk , µk , Σk for k =
5946 1, . . . , K , which does not give any indication of uncertainty in the pa-
5947 rameter values. A Bayesian approach would place a prior on the param-
5948 eters, which can be used to obtain a posterior distribution on the param-
5949 eters. This posterior allows us to compute the model evidence (marginal
5950 likelihood), which can be used for model comparison, which gives us a
5951 principled way to determine the number of mixture components. Un-
5952 fortunately, closed-form inference is not possible in this setting because
5953 there is no conjugate prior for this model. However, approximations,
5954 such as variational inference, can be used to obtain an approximate
5955 posterior (Bishop, 2006).
5956 In this chapter, we discussed mixture models for density estimation.
5957 There is a plethora of density estimation techniques available. In practice
Histograms 5958 we often use histograms and kernel density estimation.

Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 335
0.30 Figure 11.13
Data Histogram (orange
0.25 KDE bars) and kernel
Histogram density estimation
0.20
(blue line). The
p(x)

0.15
kernel density
estimator (with a
0.10 Gaussian kernel)
produces a smooth
0.05 estimate of the
underlying density,
0.00 whereas the
−4 −2 0 2 4 6 8
x histogram is simply
an unsmoothed
count measure of
5959 Histograms provide a non-parametric way to represent continuous den- how many data
5960 sities and have been proposed by Pearson (1895). A histogram is con- points (black) fall
into a single bin.
5961 structed by “binning” the data space and count how many data points fall
5962 into each bin. Then a bar is drawn at the center of each bin, and the height
5963 of the bar is proportional to the number of data points within that bin. The
5964 bin size is a critical hyper-parameter, and a bad choice can lead to overfit-
5965 ting and underfitting. Cross-validation, as discussed in Section 8.4.1, can
5966 be used to determine a good bin size. Kernel density
Kernel density estimation, independently proposed by Rosenblatt (1956) estimation
and Parzen (1962), is a nonparametric way for density estimation. Given
N i.i.d. samples, the kernel density estimator represents the underlying
distribution as
N
x − xn
 
1 X
p(x) = k , (11.72)
N h n=1 h
5967 where k is a kernel function, i.e., a non-negative function that integrates
5968 to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a simi-
5969 lar role as the bin size in histograms. Note that we place a kernel on every
5970 single data point xn in the dataset. Commonly used kernel functions are
5971 the uniform distribution and the Gaussian distribution. Kernel density esti-
5972 mates are closely related to histograms, but by choosing a suitable kernel,
5973 we can guarantee smoothness of the density estimate. Figure 11.13 illus-
5974 trates the difference between a histogram and a kernel density estimator
5975 (with a Gaussian-shaped kernel) for a given data set of 250 data points.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

You might also like