Professional Documents
Culture Documents
Vector Calculus
3
Polynomial of degree 4 10.0 Figure 5.1 Vector
7.5 calculus plays a
2
5.0 central role in (a)
1 2.5 regression (curve
fitting) and (b)
x2
0.0
f(x)
0
density estimation,
−2.5
-1 i.e., modeling data
−5.0
-2
distributions.
Data −7.5
Maximum likelihood estimate
-3 −10.0
-5 0 5 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0
x x1
(a) Regression problem: Find parame- (b) Density estimation with a Gaussian mixture
ters, such that the curve explains the model: Find means and covariances, such that the
observations (circles) well. data (dots) can be explained well.
139
c
Draft chapter (July 2, 2018) from “Mathematics for Machine Learning”
2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to http://mml-book.com.
140 Vector Calculus
defines
introduced in this
chapter, along with used in Jacobiancollected in
Taylor series Partial derivatives
when they are used Hessian
used in
n in use
in other parts of the i in d d in used
us
ed used us
e in
book.
Chapter 7 Chapter 9 Chapter 10 Chapter 11 Chapter 12
Optimization Regression Dimensionality Density estimation Classification
reduction
derivative Definition 5.2 (Derivative). More formally, for h > 0 the derivative of f
at x is defined as the limit
df f (x + h) − f (x)
:= lim , (5.2)
dx h→0 h
2676 and the secant in Figure 5.3 becomes a tangent.
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.1 Differentiation of Univariate Functions 141
x h − xn
= lim i=0 i . (5.5)
h→0 h
We see that xn = n0 xn−0 h0 . By starting the sum at 1 the xn -term cancels,
and we obtain
Pn n n−i i
df x h
= lim i=1 i (5.6)
dx h→0 h
!
n
X n n−i i−1
= lim x h (5.7)
h→0
i=1
i
! n
!
n n−1 X n n−i i−1
= lim x + x h (5.8)
h→0 1 i=2
i
| {z }
→0 as h→0
n!
= xn−1 = nxn−1 . (5.9)
1!(n − 1)!
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
142 Vector Calculus
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.1 Differentiation of Univariate Functions 143
x0 = 0.
0 Higher-order Taylor
polynomials
approximate the
function f better
2 and more globally.
T10 is already
similar to f in
4 3 2 1 0 1 2 3 4 [−4, 4].
x
We seek a Taylor series expansion of f at x0 = 0, which is the Maclaurin
series expansion of f . We obtain the following derivatives:
We can see a pattern here: The coefficients in our Taylor series are only
±1 (since sin(0) = 0), each of which occurs twice before switching to the
other one. Furthermore, f (k+4) (0) = f (k) (0).
Therefore, the full Taylor series expansion of f at x0 = 0 is given by
∞
X f (k) (x0 )
f (x) = (x − x0 )k (5.30)
k=0
k!
1 2 1 1 1
=1+x− x − x3 + x4 + x5 − · · · (5.31)
2! 3! 4! 5!
1 1 1 1
= 1 − x2 + x4 ∓ · · · + x − x3 + x5 ∓ · · · (5.32)
2! 4! 3! 5!
∞ ∞
X 1 X 1
= (−1)k x2k + (−1)k x2k+1 (5.33)
k=0
(2k)! k=0
(2k + 1)!
= cos(x) + sin(x) , (5.34)
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
144 Vector Calculus
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.2 Partial Differentiation and Gradients 145
∂f f (x1 + h, x2 , . . . , xn ) − f (x)
= lim
∂x1 h→0 h
.. (5.47)
.
∂f f (x1 , . . . , xn−1 , xn + h) − f (x)
= lim
∂xn h→0 h
and collect them in the row vector
df h i
∇x f = gradf = = ∂f∂x(x) ∂f (x)
∂x2
· · · ∂f (x)
∂xn
∈ R1×n , (5.48)
dx 1
2703 where n is the number of variables and 1 is the dimension of the image/
2704 range of f . Here, we used the compact vector notation x = [x1 , . . . , xn ]> .
2705 The row vector in (5.48) is called the gradient of f or the Jacobian and is gradient
2706 the generalization of the derivative from Section 5.1. Jacobian
2707 Remark. This definition of the Jacobian is a special case of the general
2708 definition of the Jacobian for vector-valued functions as the collection of
2709 partial derivatives. We will get back to this in Section 5.3. ♦
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
146 Vector Calculus
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.2 Partial Differentiation and Gradients 147
Example 5.7
Consider f (x1 , x2 ) = x21 + 2x2 , where x1 = sin t and x2 = cos t, then
df ∂f ∂x1 ∂f ∂x2
= + (5.58)
dt ∂x1 ∂t ∂x2 ∂t
∂ sin t ∂ cos t
= 2 sin t +2 (5.59)
∂t ∂t
= 2 sin t cos t − 2 sin t = 2 sin t(cos t − 1) (5.60)
is the corresponding derivative of f with respect to t.
2735 This compact way of writing the chain rule as a matrix multiplication The chain rule can
2736 makes only sense if the gradient is defined as a row vector. Otherwise, be written as a
matrix
2737 we will need to start transposing gradients for the matrix dimensions to
multiplication.
2738 match. This may still be straightforward as long as the gradient is a vector
2739 or a matrix; however, when the gradient becomes a tensor (we will discuss
2740 this in the following), the transpose is no longer a triviality.
2741 Remark (Verifying the Correctness of a Gradient Implementation). The
2742 definition of the partial derivatives as the limit of the corresponding dif-
2743 ference quotient, see (5.47), can be exploited when numerically checking
2744 the correctness of gradients in computer programs: When we compute Gradient checking
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
148 Vector Calculus
2745 gradients and implement them, we can use finite differences to numer-
2746 ically test our computation and implementation: We choose the value h
2747 to be small (e.g., h = 10−4 ) and compare the finite-difference approxima-
2748 tion from (5.47) with our (analytic) implementation of the gradient. If the
2749 error is small, ourqgradient
P
implementation is probably correct. “Small”
(dh −df )2
2750 could mean that Pi (dhii +dfii )2 < 10−6 , where dhi is the finite-difference
i
2751 approximation and dfi is the analytic gradient of f with respect to the ith
2752 variable xi . ♦
f1 (x)
f (x) = ... ∈ Rm . (5.64)
fm (x)
2758 Writing the vector-valued function in this way allows us to view a vector-
2759 valued function f : Rn → Rm as a vector of functions [f1 , . . . , fm ]> ,
2760 fi : Rn → R that map onto R. The differentiation rules for every fi are
2761 exactly the ones we discussed in Section 5.2.
partial derivative of Therefore, the partial derivative of a vector-valued function f : Rn → Rm
a vector-valued with respect to xi ∈ R, i = 1, . . . n, is given as the vector
function
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.3 Gradients of Vector-Valued Functions 149
partial derivatives:
∂f1 (x) ∂f1 (x)
∂x1 ··· ∂xn
df (x) ∂f (x) ∂f (x) .. .. ∈ Rm×n .
= ··· =
dx ∂x1 ∂xn . .
∂fm (x) ∂fm (x)
∂x1 ··· ∂xn
(5.66)
If we now take a parallelogram with the sides c1 = [−2, 1]> , c2 = [1, 1]>
(orange in Figure 5.5) its area is given as the absolute value of the deter-
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
150 Vector Calculus
minant
det −2 1 = | − 3| = 3 ,
(5.71)
1 1
2766 i.e., the area of this is exactly 3 times the area of the unit square. We
2767 can find this scaling factor by finding a mapping that transforms the unit
2768 square into the other square. In linear algebra terms, we effectively per-
2769 form a variable transformation from (b1 , b2 ) to (c1 , c2 ). In our case, the
2770 mapping is linear and the absolute value of the determinant of this map-
2771 ping gives us exactly the scaling factor we are looking for.
2772 We will describe two approaches to identify this mapping. First, we ex-
2773 ploit the fact that the mapping is linear so that we can use the tools from
2774 Chapter 2 to identify this mapping. Second, we will find the mapping
2775 using partial derivatives using the tools we have been discussing in this
2776 chapter.
Approach 1 To get started with the linear algebra approach, we
identify both {b1 , b2 } and {c1 , c2 } as bases of R2 (see Section 2.6.1 for a
recap). What we effectively perform is a change of basis from (b1 , b2 ) to
(c1 , c2 ), and we are looking for the transformation matrix that implements
the basis change. Using results from Section 2.7.2, we identify the desired
basis change matrix as
−2 1
J= , (5.72)
1 1
2777 such that J b1 = c1 and J b2 = c2 . The absolute value of the determi-
2778 nant of J , which yields the scaling factor we are looking for, is given as
2779 |det(J )| = 3, i.e., the area of the square spanned by (c1 , c2 ) is three times
2780 greater than the area spanned by (b1 , b2 ).
2781 Approach 2 The linear algebra approach works nicely for linear
2782 transformations; for nonlinear transformations (which become relevant in
2783 Chapter 6), we can follow a more general approach using partial deriva-
2784 tives.
2785 For this approach, we consider a function f : R2 → R2 that performs
2786 a variable transformation. In our example, f maps the coordinate repre-
2787 sentation of any vector x ∈ R2 with respect to b1 , b2 onto the coordinate
2788 representation with respect to c1 , c2 , which we will denote by y ∈ R2 . We
2789 now want to identify the mapping so that we can compute how an area
2790 (or volume) changes when it is being transformed by f . For this we need
2791 to find out how f (x) changes if we modify x a bit. This question is exactly
2792 answered by the Jacobian matrix df dx
∈ R2×2 .
Since we can write
y1 = −2x1 + x2 (5.73)
y2 = x 1 + x 2 (5.74)
we have now the functional relationship between x and y , which allows
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.3 Gradients of Vector-Valued Functions 151
2793 The Jacobian represents the coordinate transformation we are looking Geometrically, the
2794 for and is exact if the coordinate transformation is linear (as in our case), Jacobian
determinant gives
2795 and (5.76) recovers exactly the basis change matrix in (5.72). If the co-
the magnification/
2796 ordinate transformation is nonlinear, the Jacobian approximates this non- scaling factor when
2797 linear transformation locally with a linear one. The absolute value of the we transform an
2798 Jacobian determinant |det(J )| is the factor areas or volumes are scaled by area or volume.
Jacobian
2799 when coordinates are transformed. In our case, we obtain |det(J )| = 3.
determinant
2800 The Jacobian determinant and variable transformations will become
2801 relevant in Section 6.5 when we transform random variables and prob-
2802 ability distributions. These transformations are extremely relevant in ma-
2803 chine learning in the context of training deep neural networks using the
2804 reparametrization trick, also called infinite perturbation analysis.
2805 ♦
Figure 5.6
2806 Throughout this chapter, we have encountered derivatives of functions. Overview of the
dimensionality of
2807 Figure 5.6 summarizes the dimensions of those gradients. If f : R →
(partial) derivatives.
2808 R the gradient is simply a scalar (top-left entry). For f : RD → R the
2809 gradient is a 1 × D row vector (to-right entry). For f : R → RE , the x
2810 gradient is an E × 1 column vector, and for f : RD → RE the gradient is f (x)
2811 an E × D matrix. ∂f
∂x
Finally, we collect the partial derivatives in the Jacobian and obtain the
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
152 Vector Calculus
gradient as
∂f1 ∂f1
···
∂x1 ∂xN A11 · · · A1N
df
= ... .. = .. .. = A ∈ RM ×N . (5.78)
dx . . .
∂fM ∂fM
∂x1
··· ∂xN
AM 1 · · · AM N
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.4 Gradients of Matrices 153
We seek ∂L
∂θ
, and we will use the chain rule for this purpose. L is called a
least-squares loss function. least-squares loss
Remark. We would have obtained the same result without using the chain
rule by immediately looking at the function
L2 (θ) := ky − Φθk2 = (y − Φθ)> (y − Φθ) . (5.97)
This approach is still practical for simple functions like L2 but becomes
impractical if consider deep function compositions. ♦
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
154 Vector Calculus
A ∈ IR4×2 x ∈ IR3 Figure 5.7
x1 Visualization of
x2 gradient
x3 computation of a
matrix with respect
to a vector. We are
interested in
computing the
Partial derivatives: gradient of
∂A A ∈ R4×2 with
∈ IR4×2
∂x3 respect to a vector
dA
∈ IR4×2×3 x ∈ R3 . We know
∂A dx
∈ IR4×2
∂x2 that gradient
collate dA
dx
∈ R4×2×3 . We
∂A
∈ IR4×2 follow two
∂x1
equivalent
approaches to arrive
4 there: (a) Collating
partial derivatives
3 into a Jacobian
tensor; (b)
2 Flattening of the
(a) Approach 1: We compute the partial derivative matrix into a vector,
∂A ∂A ∂A
, , , each of which is a 4 × 2 matrix, and col- computing the
∂x1 ∂x2 ∂x3
late them in a 4 × 2 × 3 tensor. Jacobian matrix,
re-shaping into a
Jacobian tensor.
A ∈ IR4×2 x ∈ IR3
x1
x2
x3
dà dA
∈ IR8×3 ∈ IR4×2×3
A ∈ IR4×2 Ã ∈ IR8 dx dx
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.4 Gradients of Matrices 155
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
156 Vector Calculus
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.5 Useful Identities for Computing Gradients 157
Liq if j = p, p 6= q
Lip if j = q, p 6= q
∂pqij = (5.112)
2L iq if j = p, p = q
0 otherwise
From (5.107), we know that the desired gradient has the dimension
(n × n) × (m × n), and every single entry of this tensor is given by ∂pqij
in (5.112), where p, q, j = 1, . . . , n and i = q, . . . , m.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
158 Vector Calculus
2835 gradient with respect to the model parameters using calculus and applying
2836 the chain rule, see Section 5.2.2. We already had a taste in Section 5.3
2837 when we looked at the gradient of a squared loss with respect to the
2838 parameters of a linear regression model.
Consider the function
q
f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 ) .
(5.123)
df 2x + 2x exp(x2 )
− sin x2 + exp(x2 ) 2x + 2x exp(x2 )
= p
dx 2
2 x + exp(x )2
!
1 2 2
1 + exp(x2 ) .
= 2x p − sin x + exp(x )
2 x2 + exp(x2 )
(5.124)
2839 Writing out the gradient in this explicit way is often impractical since it
2840 often results in a very lengthy expression for a derivative. In practice,
2841 it means that, if we are not careful, the implementation of the gradient
2842 could be significantly more expensive than computing the function, which
2843 is an unnecessary overhead. For training deep neural network models, the
backpropagation 2844 backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 1962;
2845 Rumelhart et al., 1986) is an efficient way to compute the gradient of an
2846 error function with respect to the parameters of the model.
where x are the inputs (e.g., images), y are the observations (e.g., class
labels) and every function fi , i = 1, . . . , K possesses its own parameters.
In neural networks with multiple layers, we have functions fi (xi−1 ) =
We discuss the case σ(Ai xi−1 + bi ) in the ith layer. Here xi−1 is the output of layer i − 1
where the activation and σ an activation function, such as the logistic sigmoid 1+e1−x , tanh or a
functions are
rectified linear unit (ReLU). In order to train these models, we require the
identical to clutter
notation. gradient of a loss function L with respect to all model parameters Aj , bj
for j = 0, . . . , K − 1. This also requires us to compute the gradient of L
with respect to the inputs of each layer. For example, if we have inputs x
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.6 Backpropagation and Automatic Differentiation 159
f 0 := x (5.126)
f i := σi (Ai−1 f i−1 + bi−1 ) , i = 1, . . . , K , (5.127)
∂L ∂L ∂f K ∂f K−1 ∂f K−2
= (5.131)
∂θ K−3 ∂f K ∂f K−1 ∂f K−2 ∂θ K−3
∂L ∂L ∂f K ∂f i+2 ∂f i+1
= ··· (5.132)
∂θ i ∂f K ∂f K−1 ∂f i+1 ∂θ i
2849 The orange terms are partial derivatives of the output of a layer with re-
2850 spect to its inputs, whereas the blue terms are partial derivatives of the
2851 output of a layer with respect to its parameters. Assuming, we have al-
2852 ready computed the partial derivatives ∂L/∂θ i+1 , then most of the com-
2853 putation can be reused to compute ∂L/∂θ i . The additional terms that
2854 we need to compute are indicated by the boxes. Figure 5.9 visualizes
2855 that the gradients are passed backward through the network. A more
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
160 Vector Calculus
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.6 Backpropagation and Automatic Differentiation 161
2881 the labels, the reverse mode is computationally significantly cheaper than
2882 the forward mode. Let us start with an instructive example.
Example 5.13
Consider the function
q
f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 )
(5.136)
from (5.123). If we were to implement a function f on a computer, we
would be able to save some computation by using intermediate variables: intermediate
variables
2
a=x , (5.137)
b = exp(a) , (5.138)
c = a + b, (5.139)
√
d = c, (5.140)
e = cos(c) , (5.141)
f = d + e. (5.142)
√
exp(·) b · d Figure 5.11
Computation graph
with inputs x,
x (·)2 a + c + f function values f
and intermediate
cos(·) e variables a, b, c, d, e.
This is the same kind of thinking process that occurs when applying the
chain rule. Observe that the above set of equations require fewer opera-
tions than a direct naive implementation of the function f (x) as defined
in (5.123). The corresponding computation graph in Figure 5.11 shows
the flow of data and computations required to obtain the function value
f.
The set of equations that include intermediate variables can be thought
of as a computation graph, a representation that is widely used in imple-
mentations of neural network software libraries. We can directly compute
the derivatives of the intermediate variables with respect to their corre-
sponding inputs by recalling the definition of the derivative of elementary
functions. We obtain:
∂a
= 2x , (5.143)
∂x
∂b
= exp(a) , (5.144)
∂a
∂c ∂c
=1= , (5.145)
∂a ∂b
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
162 Vector Calculus
∂d 1
= √ , (5.146)
∂c 2 c
∂e
= − sin(c) , (5.147)
∂c
∂f ∂f
=1= . (5.148)
∂d ∂e
By looking at the computation graph in Figure 5.11, we can compute
∂f /∂x by working backward from the output, and we obtain the follow-
ing relations:
∂f ∂f ∂d ∂f ∂e
= + , (5.149)
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
= , (5.150)
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= + , (5.151)
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
= . (5.152)
∂x ∂a ∂x
Note that we have implicitly applied the chain rule to obtain ∂f /∂x. By
substituting the results of the derivatives of the elementary functions, we
get
∂f 1
= 1 · √ + 1 · (− sin(c)) , (5.153)
∂c 2 c
∂f ∂f
= · 1, (5.154)
∂b ∂c
∂f ∂f ∂f
= exp(a) + · 1, (5.155)
∂a ∂b ∂c
∂f ∂f
= · 2x . (5.156)
∂x ∂a
By thinking of each of the derivatives above as a variable, we observe
that the computation required for calculating the derivative is of similar
complexity as the computation of the function itself. This is quite counter-
intuitive since the mathematical expression for the derivative ∂f∂x
(5.124)
is significantly more complicated than the mathematical expression of the
function f (x) in (5.123).
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.7 Higher-order Derivatives 163
where gi (·) are elementary functions and xPa(xi ) are the parent nodes of
the variable xi in the graph. Given a function defined in this way, we can
use the chain rule to compute the derivative of the function in a step-by-
step fashion. Recall that by definition f = xD and hence
∂f
= 1. (5.158)
∂xD
For other variables xi , we apply the chain rule
∂f X ∂f ∂xj X ∂f ∂gj
= = , (5.159)
∂xi x ∂xj ∂xi ∂xj ∂xi
j :xi ∈Pa(xj ) x j :xi ∈Pa(xj )
2883 where Pa(xj ) is the set of parent nodes of xj in the computing graph.
2884 Equation (5.157) is the forward propagation of a function, whereas (5.159) Auto-differentiation
2885 is the backpropagation of the gradient through the computation graph. For in reverse mode
requires a parse
2886 neural network training we backpropagate the error of the prediction with
tree.
2887 respect to the label.
2888 The automatic differentiation approach above works whenever we have
2889 a function that can be expressed as a computation graph, where the ele-
2890 mentary functions are differentiable. In fact, the function may not even be
2891 a mathematical function but a computer program. However, not all com-
2892 puter programs can be automatically differentiated, e.g., if we cannot find
2893 differential elementary functions. Programming structures, such as for
2894 loops and if statements require more care as well.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
164 Vector Calculus
2.0
2.5 4 3 2 1 0 1 2 3 4
x
multivariate Taylor Definition 5.7 (Multivariate Taylor Series). For the multivariate Taylor
series
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.8 Linearization and Multivariate Taylor Series 165
Figure 5.13
Visualizing outer
products. Outer
products of vectors
increase the
dimensionality of
(a) Given a vector δ ∈ R4 , we obtain the outer product δ 2 := δ ⊗ the array by 1 per
δ = δδ > ∈ R4×4 as a matrix. term.
2926 where Dxk f (x0 ) is the k -th (total) derivative of f with respect to x, eval-
2927 uated at x0 .
Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n of Taylor polynomial
f at x0 contains the first n + 1 components of the series in (5.165) and is
defined as
n
X Dxk f (x0 ) k
Tn = δ . (5.166)
k=0
k!
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
166 Vector Calculus
Figure 5.13 visualizes two such outer products. In general, we obtain the
following terms in the Taylor series:
X X
Dxk f (x0 )δ k = ··· Dxk f (x0 )[a, . . . , k]δ[a] · · · δ[k] , (5.169)
a k
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.8 Linearization and Multivariate Taylor Series 167
Therefore, we obtain
h i
1 ∂f ∂f
= 6 14 ∈ R1×2
Dx,y f (1, 2) = ∇x,y f (1, 2) = ∂x
(1, 2) ∂y
(1, 2)
(5.179)
such that
1
Dx,y f (1, 2)
x−1
δ = 6 14 = 6(x − 1) + 14(y − 2) . (5.180)
1! y−2
1
Note that Dx,y f (1, 2)δ contains only linear terms, i.e., first-order polyno-
mials.
The second-order partial derivatives are given by
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 (5.181)
∂x2 ∂x2
∂2f ∂2f
= 6y =⇒ (1, 2) = 12 (5.182)
∂y 2 ∂y 2
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 (5.183)
∂y∂x ∂y∂x
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 . (5.184)
∂x∂y ∂x∂y
When we collect the second-order partial derivatives, we obtain the Hes-
sian
" 2
∂2f
#
∂ f
∂x2 ∂x∂y 2 2
H = ∂2f 2
∂ f
= , (5.185)
2
2 6y
∂y∂x ∂y
such that
2 2
H(1, 2) = ∈ R2×2 . (5.186)
2 12
Therefore, the next term of the Taylor-series expansion is given by
2
Dx,y f (1, 2) 2 1 >
δ = δ H(1, 2)δ (5.187)
2! 2
2 2 x−1
= x−1 y−2 (5.188)
2 12 y − 2
= (x − 1)2 + 2(x − 1)(y − 2) + 6(y − 2)2 . (5.189)
2
Here, Dx,y f (1, 2)δ 2 contains only quadratic terms, i.e., second-order poly-
nomials.
The third-order derivatives are obtained as
h i
3
Dx,y f = ∂H∂x
∂H
∂y ∈ R2×2×2 , (5.190)
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
168 Vector Calculus
∂3f ∂3f
" #
3 ∂H ∂x3 ∂x2 ∂y
Dx,y f [:, :, 1] = = ∂3f ∂3f
, (5.191)
∂x ∂x∂y∂x ∂x∂y 2
∂3f ∂3f
" #
3 ∂H ∂y∂x2 ∂y∂x∂y
Dx,y f [:, :, 2] = = ∂3f ∂3f
. (5.192)
∂y ∂y 2 ∂x ∂y 3
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.9 Further Reading 169
2931 Even if p(x) is in a convenient form (e.g., Gaussian), this integral gen-
2932 erally not be solved analytically. The Taylor series expansion of f is one
2933 way of finding an approximate solution: Assuming p(x) = N µ, Σ is
2934 Gaussian, then the first-order Taylor series expansion around µ locally
2935 linearizes the nonlinear function f . For linear functions, we can compute
2936 the mean (and the covariance) exactly if p(x) is Gaussian distributed (see
2937 Section 6.6). This property is heavily exploited by the Extended Kalman Extended Kalman
2938 Filter (Maybeck, 1979) for online state estimation in nonlinear dynami- Filter
2939 cal systems (also called “state-space models”). Other deterministic ways
2940 to approximate the integral in (5.202) are the unscented transform (Julier unscented transform
2941 and Uhlmann, 1997), which does not require any gradients, or the Laplace Laplace
2942 approximation (Bishop, 2006), which uses the Hessian for a local Gaussian approximation
2943 approximation of p(x) at the posterior mean.
2944 Exercises
5.1 Consider the following functions
f1 (x) = sin(x1 ) cos(x2 ) , x ∈ R2 (5.203)
> n
f2 (x, y) = x y , x, y ∈ R (5.204)
> n
f3 (x) = xx , x∈R (5.205)
∂fi
2945 1. What are the dimensions of ∂x ?
2946 2. Compute the Jacobians
5.2 Differentiate f with respect to t and g with respect to X , where
f (t) = sin(log(t> t)) , t ∈ RD (5.206)
g(X) = tr(AXB) , A ∈ RD×E , X ∈ RE×F , B ∈ RF ×D , (5.207)
2947 where tr denotes the trace.
2948 5.3 Compute the derivatives df /dx of the following functions by using the chain
2949 rule. Provide the dimensions of every single partial derivative. Describe your
2950 steps in detail.
1.
f (z) = log(1 + z) , z = x> x , x ∈ RD
2.
f (z) = sin(z) , z = Ax + b , A ∈ RE×D , x ∈ RD , b ∈ RE
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
170 Vector Calculus
2955 Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii .
2956 Hint: Explicitly write out the outer product.
3. Use the chain rule. Provide the dimensions of every single partial deriva-
tive. You do not need to compute the product of the partial derivatives
explicitly.
f = tanh(z) ∈ RM
z = Ax + b, x ∈ RN , A ∈ RM ×N , b ∈ RM .
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.