Professional Documents
Culture Documents
11
Figure 11.1
Two-dimensional
data set that cannot
4
be meaningfully
represented by a 2
Gaussian.
0
x2
−2
−4
−6 −4 −2 0 2 4 6
x1
316
c
Draft chapter (July 9, 2018) from “Mathematics for Machine Learning”
2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to https://mml-book.com.
11.1 Gaussian Mixture Model 317
5745 tered so far) have limited modeling capabilities. For example, a Gaussian
5746 approximation of the density that generated the data in Figure 11.1 would
5747 be a poor approximation. In the following, we will look at a more ex-
5748 pressive family of distributions, which we can use for density estimation:
5749 mixture models. mixture models
Mixture models can be used to describe a distribution p(x) by a convex
combination of K simple (base) distributions
K
X
p(x) = πk pk (x) (11.1)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.2)
k=1
K
X
p(x) = πk N x | µk , Σk (11.3)
k=1
K
X
0 6 πk 6 1 , πk = 1 . (11.4)
k=1
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
318 Density Estimation with Gaussian Mixture Models
of Gaussian 0.15
distributions and is
more expressive 0.10
than any individual
component. Dashed 0.05
lines represent the
weighted Gaussian
0.00
components.
−4 −2 0 2 4 6 8
x
0.30
Figure 11.3 Initial Component 1
Component 2
setting: GMM 0.25 Component 3
(black) with GMM density
mixture three 0.20
mixture components
p(x)
0.05
0.00
−5 0 5 10 15
x
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 319
as
p1 (x) = N x | − 4, 1 (11.6)
p2 (x) = N x | 0, 0.2 (11.7)
p3 (x) = N x | 8, 3 (11.8)
and assign them equal weights π1 = π2 = π3 = 13 . The corresponding
model (and the data points) are shown in Figure 11.3.
(11.11)
5773 This simple form allows us find closed-form maximum likelihood esti-
5774 mates of µ and Σ, as discussed in Chapter 8. However, in (11.10), we
5775 cannot move the log into the sum over k so that we cannot obtain a sim-
5776 ple closed-form maximum likelihood solution. However, we can exploit an
5777 iterative scheme to find good model parameters θ ML : the EM algorithm.
Any local optimum of a function exhibits the property that its gradi-
ent with respect to the parameters must vanish (necessary condition), see
Chapter 7. In our case, we obtain the following necessary conditions when
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
320 Density Estimation with Gaussian Mixture Models
For all three necessary conditions, by applying the chain rule (see Sec-
tion 5.2.2), we require partial derivatives of the form
∂ log p(xn | θ) 1 ∂p(xn | θ)
= (11.15)
∂θ p(xn | θ) ∂θ
where θ = {µk , Σk , πk , k = 1, . . . , K} comprises all model parameters
and
1 1
= PK . (11.16)
p(xn | θ) j=1 πj N xn | µj , Σj
where we define
πk N xn | µk , Σk
rnk := PK , (11.18)
j=1 πj N xn | µj , Σj
N
X
Nk := rnk . (11.19)
n=1
Proof From (11.15), we see that the gradient of the log-likelihood with
respect to the mean parameters µk , k = 1, . . . , K requires us to compute
the partial derivative
K
∂p(xn | θ) X ∂N xn | µj , Σj ∂N xn | µk , Σk
= πj = πk (11.20)
∂µk j=1
∂µk ∂µk
= πk (xn − µk )> Σ−1
k N xn | µk , Σk (11.21)
5779 where we exploited that only the k th mixture component depends on µk .
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 321
We use our result from (11.21) in (11.15) and put everything together
so that the desired partial derivative of L with respect to µk is given as
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.22)
∂µk n=1
∂µk n=1
p(xn | θ) ∂µk
N
X
> −1
πk N xn | µk , Σk
= (xn − µk ) Σk PK (11.23)
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk
N
X
= rnk (xn − µk )> Σ−1
k . (11.24)
n=1
5780 Here, we used the identity from (11.16) and the result of the partial
5781 derivative in (11.21) to get to the second row. The values rnk are often
5782 called responsibilities. responsibilities
Remark. The responsibility rnk of the k th mixture component for data The responsibilities
are closely related
point xn is proportional to the likelihood to the likelihood.
p(xn | πk , µk , Σk ) = πk N xn | µk , Σk (11.25)
5783 of the mixture component given the data point (the denominator in the
5784 definition of rnk is constant for all mixture components and serves as a
5785 normalizer). Therefore, mixture components have a high responsibility
5786 for a data point when the data point could be a plausible sample from
5787 that mixture component. ♦
>
5788 From the definition of rnk Pin (11.18) it is clear that [rn1 , . . . , rnK ]
5789 is a probability vector, i.e., k rnk = 1 with rnk > 0. This probability
5790 vector distributes probability mass among the K mixture component, and,
5791 intuitively, every rnk expresses the probability that xn has been generated
5792 by the k th mixture component.
We now solve for µk so that ∂µ ∂L
= 0> and obtain
k
N N PN N
X X
n=1 rnk xn 1 X
rnk xn = rnk µk ⇐⇒ µk = P = rnk xn ,
N Nk
n=1 rnk
n=1 n=1 n=1
(11.26)
where we defined
N
X
Nk := rnk . (11.27)
n=1
5793 as the total responsibility of the k th mixture component across the entire
5794 dataset. This concludes the proof of Theorem 11.1.
Intuitively, (11.17) can be interpreted as a Monte-Carlo estimate of the
mean of weighted data points xn where every xn is weighted by the
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
322 Density Estimation with Gaussian Mixture Models
Figure 11.4 Update responsibility rnk of the k th cluster for xn . Therefore, the mean µk is
of the mean pulled toward a data point xn with strength given by rnk . Intuitively, the
parameter of
means are pulled stronger toward data points for which the corresponding
mixture component
in a GMM. The mixture component has a high responsibility, i.e., a high likelihood. Fig-
mean µ is being ure 11.4 illustrates this. We can also interpret the mean update in (11.17)
pulled toward as the expected value of all data points under the distribution given by
individual data
points with the r k := [r1k , . . . , rN k ]> /Nk , (11.28)
weights given by the
corresponding which is a normalized probability vector, i.e.,
responsibilities. The
mean update is then
µk ← Erk [X] . (11.29)
a weighted average
of the data points. Example 11.2 (Responsibilities)
x2 x3 For our example from Figure 11.3 we compute the responsibilities rnk
r2
r1 r3 1.0 0.0 0.0
x1
µk 1.0 0.0 0.0
0.057 0.943 0.0
0.001 0.999 0.0 ∈ RN ×K .
(11.30)
0.0 0.066 0.934
0.0 0.0 1.0
0.0 0.0 1.0
Here, the nth row tells us the responsibilities of all mixture components
for xn . The sum of all K responsibilities for a data point (sum of every
row) is 1. The k th column gives us an overview of the responsibility of
the k th mixture component. We can see that the third mixture component
(third column) is not responsible for any of the first four data points, but
takes much responsibility of the remaining data points. The sum of all
entries of a column gives us the values Nk , i.e., the total responsibility of
the k th mixture component. In our example, we get N1 = 2.057, N2 =
2.009, N3 = 2.934.
p(x)
0.15
mean values; (b) 0.15
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the mean values. after updating the mean values.
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 323
In our example from Figure 11.3, the mean values are updated as fol-
lows:
µ1 : −4 → −2.7 (11.31)
µ2 : 0 → −0.4 (11.32)
µ3 : 8 → 3.7 (11.33)
Here, we see that the means of the first and third mixture component
move toward the regime of the data, whereas the mean of the second
component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
the means and Figure 11.5(b) shows the GMM density after updating the
mean values µk .
5795 The update of the mean parameters in (11.17) look fairly straight-
5796 forward. However, note that the responsibilities rnk are a function of
5797 πj , µj , Σj for all j = 1, . . . , K , such that the updates in (11.17) depend
5798 on all parameters of the GMM, and a closed-form solution, which we ob-
5799 tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot
5800 be obtained.
Theorem 11.2 (Updates of the GMM Covariances). The update of the co-
variance parameters Σk , k = 1, . . . , K of the GMM is given by
N
1 X
Σk = rnk (xn − µk )(xn − µk )> , (11.34)
Nk n=1
5801 where rnk and Nk are defined in (11.18) and (11.19), respectively.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
324 Density Estimation with Gaussian Mixture Models
D ∂ 1
= πk (2π)− 2 det(Σk )− 2 exp − 12 (xn − µk )> Σ−1
k (xn − µk )
∂Σk
1 ∂
− 1 > −1
+ det(Σk ) 2 exp − 2 (xn − µk ) Σk (xn − µk ) . (11.36c)
∂Σk
We now use the identities
∂ 1 1 1
det(Σk )− 2 = − det(Σk )− 2 Σ−1
k , (11.37)
∂Σk 2
∂
(xn − µk )> Σ−1 −1 > −1
k (xn − µk ) = −Σk (xn − µk )(xn − µk ) Σk
∂Σk
(11.38)
and obtain (after some re-arranging) the desired partial derivative re-
quired in (11.35) as
∂p(xn | θ)
= πk N xn | µk , Σk
∂Σk
× − 21 (Σk−1 − Σ−1 > −1
k (xn − µk )(xn − µk ) Σk ) . (11.39)
× − 21 (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.40b)
N
1X
=− rnk (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.40c)
2 n=1
N N
!
1 −1 X 1 −1 X
= − Σk rnk + Σk rnk (xn − µk )(xn − µk )> Σ−1
k .
2 n=1
2 n=1
| {z }
=Nk
(11.40d)
We see that the responsibilities rnk also appear in this partial derivative.
Setting this partial derivative to 0, we obtain the necessary optimality
condition
N
!
−1 −1
X
Nk Σk = Σk rnk (xn − µk )(xn − µk )> Σ−1k (11.41a)
n=1
N
!
X
⇐⇒ Nk I = rnk (xn − µk )(xn − µk )> Σ−1
k (11.41b)
n=1
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 325
N
1 X
⇐⇒ Σk = rnk (xn − µk )(xn − µk )> , (11.41c)
Nk n=1
5802 which gives us a simple update rule for Σk for k = 1, . . . , K and proves
5803 Theorem 11.2.
Similar to the update of µk in (11.17), we can interpret the update of
the covariance in (11.34) as an expected value
>
Erk [X̃ k X̃ k ] (11.42)
5804 where X̃ k := [x1 − µk , . . . , xN − µk ]> is the data matrix X centered at
5805 µk , and r k is the probability vector defined in (11.28).
In our example from Figure 11.3, the variances are updated as follows:
σ12 : 1 → 0.14 (11.43)
σ22 : 0.2 → 0.44 (11.44)
σ32 : 3 → 1.53 (11.45)
Here, we see that the variances of the first and third component shrink
significantly, the variance of the second component increases slightly.
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi-
vidual components prior to updating the variances. Figure 11.6(b) shows
the GMM density after updating the variances.
5806 Similar to the update of the mean parameters, we can interpret (11.34)
5807 as a Monte-Carlo estimate of the weighted covariance of data points xn
5808 associated with the k th mixture component, where the weights are the
5809 responsibilities rnk . As with the updates of the mean parameters, this up-
5810 date depends on all πj , µj , Σj , j = 1, . . . , K , through the responsibilities
5811 rnk , which prohibits a closed-form solution.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
326 Density Estimation with Gaussian Mixture Models
Theorem 11.3 (Update of the GMM Mixture Weights). The mixture weights
of the GMM are updated as
Nk
πk = (11.46)
N
5812 for k = 1, . . . , K , where N is the number of data points and Nk is defined
5813 in (11.19).
Proof To find the partial derivative of the log-likelihood with respect
P to
the weight parameters πk , k = 1, . . . , K , we take the constraint k πk =
1 into account by using Lagrange multipliers (see Section 7.2). The La-
grangian is
K
!
X
L=L+λ πk − 1 (11.47a)
k=1
N K K
!
X X X
= log πk N xn | µk , Σk + λ πk − 1 (11.47b)
n=1 k=1 k=1
where L is the log-likelihood from (11.10) and the second term encodes
for the equality constraint that all the mixture weights need to sum up to
1. We obtain the partial derivatives with respect to πk and the Lagrange
multiplier λ
N
∂L X N xn | µk , Σk Nk
= PK +λ= + λ, (11.48)
∂πk n=1 j=1 N xn | µj , Σj
πk
K
∂L X
= πk − 1 . (11.49)
∂λ k=1
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 327
5814 which gives us the update for the weight parameters πk and proves Theo-
5815 rem 11.3.
5816 We can identify the mixture weight in (11.46) as the ratio of the to-
5817 tal responsibility
P of the k th cluster and the number of data points. Since
5818 N = k Nk the number of data points can also be interpreted as the to-
5819 tal responsibility of all mixture components together, such that πk is the
5820 relative importance of the k th mixture component for the dataset.
PN
5821 Remark. Since Nk = i=1 rnk , the update equation (11.46) for the mix-
5822 ture weights πk also depends on all πj , µj , Σj , j = 1, . . . , K via the re-
5823 sponsibilities rnk . ♦
p(x)
0.2 0.15
mixture weights; (b)
0.10
GMM after updating
0.1
0.05 the mixture weights
0.0 0.00 while retaining the
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 means and
x x
variances. Note the
(a) GMM density and individual components (b) GMM density and individual components
different scalings of
prior to updating the mixture weights. after updating the mixture weights.
the vertical axes.
In our running example from Figure 11.3, the mixture weights are up-
dated as follows:
1
π1 : 3
→ 0.29 (11.54)
1
π2 : 3
→ 0.29 (11.55)
1
π3 : 3
→ 0.42 (11.56)
Here we see that the third component gets more weight/importance,
while the other components become slightly less important. Figure 11.7
illustrates the effect of updating the mixture weights. Figure 11.7(a) is
identical to Figure 11.6(b) and shows the GMM density and its individual
components prior to updating the mixture weights. Figure 11.7(b) shows
the GMM density after updating the mixture weights.
Overall, having updated the means, the variances and the weights once,
we obtain the GMM shown in Figure 11.7(b). Compared with the ini-
tialization shown in Figure 11.3, we can see that the parameter updates
caused the GMM density to shift some of its mass toward the data points.
After updating the means, variances and weights once, the GMM fit
in Figure 11.7(b) is already remarkably better than its initialization from
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
328 Density Estimation with Gaussian Mixture Models
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.4 Latent Variable Perspective 329
Figure 11.8 EM
Component 1 28 algorithm applied to
0.4 Component 2
the GMM from
Negative log-likelihood
Component 3 26
GMM density
0.3
24 Figure 11.2.
22
p(x)
0.2 20
18
0.1
16
0.0 14
−5 0 5 10 15 0 1 2 3 4 5
x Iteration
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
algorithm converges and returns this mixture EM iterations.
density.
When we run EM on our example from Figure 11.3, we obtain the final
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as
p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25
(11.61)
+ 0.43N x | 3.64, 1.63 .
5845 Figure 11.9 illustrates a few steps of the EM algorithm when applied to
5846 the two-dimensional dataset shown in Figure 11.1 with K = 3 mixture
5847 components.
5848 Figure 11.10 visualizes the final responsibilities of the mixture compo-
5849 nents for the data points. It becomes clear that there are data points that
5850 cannot be uniquely assigned to a single (either blue or yellow) compo-
5851 nent, such that the responsibilities of these two clusters for those points
5852 are around 0.5.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
330 Density Estimation with Gaussian Mixture Models
10.0 Figure 11.9
7.5 104 Illustration of the
Negative log-likelihood
5.0 EM algorithm for
2.5 fitting a Gaussian
mixture model with
x2
0.0
6 × 103
−2.5 three components to
−5.0 a two-dimensional
−7.5 data set. (a) Data
4 × 103
−10.0
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 0 10 20 30 40 50 60
set; (b) Negative log
x1 EM iteration likelihood (lower is
(a) Data set. (b) Negative log-likelihood. better) as a function
of the EM iterations.
10.0 10.0 The red dots
7.5 7.5 indicate the
5.0 5.0 iterations for which
2.5 2.5 the corresponding
GMM fits are shown
x2
x2
0.0 0.0
10.0 10.0
7.5 7.5
5.0 5.0
2.5 2.5
x2
0.0
x2
0.0
−2.5 −2.5
−5.0
−5.0
−7.5
−7.5
−10.0
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 −10.0
x1 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0
x1
µ 5854 where zk ∈ {0, 1}. Thus far, we assumed that the indicator variables zk
5855 are known. However, in practice, this is not the case, and we place a prior
Σ x
5856 distribution on z .
one-hot encoding 5857 In the following, we will discuss the prior p(z), the marginal p(x, z)
1-of-K 5858 and the posterior p(z | x) for the case of observing a single data point
representation
5859 x (the corresponding graphical model is shown in Figure 11.11) before
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.4 Latent Variable Perspective 331
EM converges.
−2 While a single
mixture component
−4 is clearly
responsible for the
−6 −4 −2 0 2 4 6
data on the left, the
x1 overlap of the two
data clusters on the
right could have
been generated by
5860 extending the concepts to the general case where the dataset consists of
two mixture
5861 N data points. components.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
332 Density Estimation with Gaussian Mixture Models
The marginal 5879 which is identical to the GMM we introduced in (11.3). Therefore, the
distribution p(x) is5880 latent variable model with latent indicators zk is an equivalent way of
a Gaussian mixture
5881 thinking about a Gaussian mixture model.
model.
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 333
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
334 Density Estimation with Gaussian Mixture Models
Draft (2018-07-09) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 335
0.30 Figure 11.13
Data Histogram (orange
0.25 KDE bars) and kernel
Histogram density estimation
0.20
(blue line). The
p(x)
0.15
kernel density
estimator (with a
0.10 Gaussian kernel)
produces a smooth
0.05 estimate of the
underlying density,
0.00 whereas the
−4 −2 0 2 4 6 8
x histogram is simply
an unsmoothed
count measure of
5959 Histograms provide a non-parametric way to represent continuous den- how many data
5960 sities and have been proposed by Pearson (1895). A histogram is con- points (black) fall
into a single bin.
5961 structed by “binning” the data space and count how many data points fall
5962 into each bin. Then a bar is drawn at the center of each bin, and the height
5963 of the bar is proportional to the number of data points within that bin. The
5964 bin size is a critical hyper-parameter, and a bad choice can lead to overfit-
5965 ting and underfitting. Cross-validation, as discussed in Section 8.4.1, can
5966 be used to determine a good bin size. Kernel density
Kernel density estimation, independently proposed by Rosenblatt (1956) estimation
and Parzen (1962), is a nonparametric way for density estimation. Given
N i.i.d. samples, the kernel density estimator represents the underlying
distribution as
N
x − xn
1 X
p(x) = k , (11.72)
N h n=1 h
5967 where k is a kernel function, i.e., a non-negative function that integrates
5968 to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a simi-
5969 lar role as the bin size in histograms. Note that we place a kernel on every
5970 single data point xn in the dataset. Commonly used kernel functions are
5971 the uniform distribution and the Gaussian distribution. Kernel density esti-
5972 mates are closely related to histograms, but by choosing a suitable kernel,
5973 we can guarantee smoothness of the density estimate. Figure 11.13 illus-
5974 trates the difference between a histogram and a kernel density estimator
5975 (with a Gaussian-shaped kernel) for a given data set of 250 data points.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.