Chapter 05

2644
Vector Calculus
2645 Many algorithms in machine learning are inherently based on optimizing

2646 an objective function with respect to a set of desired model parameters
2647 that control how well a model explains the data: Finding good parame-
2648 ters can be phrased as an optimization problem. Examples include linear
2649 regression (see Chapter 9), where we look at curve-fitting problems, and
2650 we optimize linear weight parameters to maximize the likelihood; neural-
2651 network auto-encoders for dimensionality reduction and data compres-
2652 sion, where the parameters are the weights and biases of each layer, and
2653 where we minimize a reconstruction error by repeated application of the
2654 chain-rule; Gaussian mixture models (see Chapter 11) for modeling data
2655 distributions, where we optimize the location and shape parameters of
2656 each mixture component to maximize the likelihood of the model. Fig-
2657 ure 5.1 illustrates some of these problems, which we typically solve by us-
2658 ing optimization algorithms that exploit gradient information (first-order
2659 methods). Figure 5.2 gives an overview of how concepts in this chapter
2660 are related and how they are connected to other chapters of the book.
2661 In this chapter, we will discuss how to compute gradients of functions,
2662 which is often essential to facilitate learning in machine learning models.
2663 Therefore, vector calculus is one of the fundamental mathematical tools
2664 we need in machine learning.
3
Polynomial of degree 4 10.0 Figure 5.1 Vector
7.5 calculus plays a
2
5.0 central role in (a)
1 2.5 regression (curve
fitting) and (b)
x2
0.0
f(x)
0
density estimation,
−2.5
-1 i.e., modeling data
−5.0
-2
distributions.
Data −7.5
Maximum likelihood estimate
-3 −10.0
-5 0 5 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0
x x1
(a) Regression problem: Find parame- (b) Density estimation with a Gaussian mixture
ters, such that the curve explains the model: Find means and covariances, such that the
observations (circles) well. data (dots) can be explained well.
139
c
Draft chapter (July 2, 2018) from “Mathematics for Machine Learning” 2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to http://mml-book.com.
140 Vector Calculus
Figure 5.2 A mind Difference quotient

map of the concepts
defines
introduced in this
chapter, along with used in Jacobiancollected in
Taylor series Partial derivatives
when they are used Hessian
used in
n in use
in other parts of the i in d d in used
us
ed used us
e in
book.
Chapter 7 Chapter 9 Chapter 10 Chapter 11 Chapter 12
Optimization Regression Dimensionality Density estimation Classification
reduction
Figure 5.3 The

average incline of a y f (x)
function f between
x0 and x0 + δx is
the incline of the
secant (blue)
through f (x0 ) and
f (x0 + δx) and f (x0 + δx)
given by δy/δx.
δy
f (x0 )
δx
2665 5.1 Differentiation of Univariate Functions

2666 In the following, we briefly revisit differentiation of a univariate function,
2667 which we may already know from school. We start with the difference
2668 quotient of a univariate function y = f (x), x, y ∈ R, which we will
2669 subsequently use to define derivatives.
difference quotient Definition 5.1 (Difference Quotient). The difference quotient

δy f (x + δx) − f (x)
:= (5.1)
δx δx
2670 computes the slope of the secant line through two points on the graph of
2671 f . In Figure 5.3 these are the points with x-coordinates x0 and x0 + δx.
2672 The difference quotient can also be considered the average slope of f
2673 between x and x + δx if we assume a f to be a linear function. In the limit
2674 for δx → 0, we obtain the tangent of f at x, if f is differentiable. The
2675 tangent is then the derivative of f at x.
derivative Definition 5.2 (Derivative). More formally, for h > 0 the derivative of f
at x is defined as the limit
df f (x + h) − f (x)
:= lim , (5.2)
dx h→0 h
2676 and the secant in Figure 5.3 becomes a tangent.
Draft (2018-07-02) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
5.1 Differentiation of Univariate Functions 141
Example 5.1 (Derivative of a Polynomial)

We want to compute the derivative of f (x) = xn , n ∈ N. We may already
know that the answer will be nxn−1 , but we want to derive this result
using the definition of the derivative as the limit of the difference quotient.
Using the definition of the derivative in (5.2) we obtain
df f (x + h) − f (x)
= lim (5.3)
dx h→0 h
(x + h)n − xn
= lim (5.4)
Pn hn n−i i
h→0
x h − xn
= lim i=0 i . (5.5)
h→0 h
We see that xn = n0 xn−0 h0 . By starting the sum at 1 the xn -term cancels,

and we obtain
Pn n n−i i
df x h
= lim i=1 i (5.6)
dx h→0 h
!
n
X n n−i i−1
= lim x h (5.7)
h→0
i=1
i
! n
!
n n−1 X n n−i i−1
= lim x + x h (5.8)
h→0 1 i=2
i
| {z }
→0 as h→0
n!
= xn−1 = nxn−1 . (5.9)
1!(n − 1)!
2677 5.1.1 Taylor Series

2678 The Taylor series is a representation of a function f as an infinite sum of
2679 terms. These terms are determined using derivatives of f , evaluated at x0 .
Definition 5.3 (Taylor Polynomial). The Taylor polynomial of degree n of Taylor polynomial
f : R → R at x0 is defined as
n
X f (k) (x0 )
Tn (x) := (x − x0 )k , (5.10)
k=0
k!
2680 where f (k) (x0 ) is the k th derivative of f at x0 (which we assume exists)

(k)
2681 and f k!(x0 ) are the coefficients of the polynomial.
Definition 5.4 (Taylor Series). For a smooth function f ∈ C ∞ , f : R → R,
the Taylor series of f at x0 is defined as Taylor series
f (x) = T∞ (x) , (5.11)
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
142 Vector Calculus
f ∈ C ∞ means that2682 where the Taylor polynomial T is defined in (5.10). For x0 = 0, we

f is continuously 2683 obtain the Maclaurin series as a special instance of the Taylor series. If
differentiable
2684 f (x) = T∞ (x) then f is called analytic.
infinitely many
times.
2685 Remark. In general, a Taylor polynomial of degree n is an approximation
Maclaurin series
analytic
2686 of a function, which does not need to be a polynomial. The Taylor poly-
2687 nomial is similar to f in a neighborhood around x0 . However, a Taylor
2688 polynomial of degree n is an exact representation of a polynomial f of
2689 degree k 6 n since all derivatives f (i) , i > k vanish. ♦
Example 5.2 (Taylor Polynomial)

We consider the polynomial
f (x) = x4 (5.12)
and seek the Taylor polynomial T6 , evaluated at x0 = 1. We start by com-
puting the coefficients f (k) (1) for k = 0, . . . , 6:
f (1) = 1 (5.13)
f 0 (1) = 4 (5.14)
f 00 (1) = 12 (5.15)
f (3) (1) = 24 (5.16)
(4)
f (1) = 24 (5.17)
(5)
f (1) = 0 (5.18)
(6)
f (1) = 0 (5.19)
Therefore, the desired Taylor polynomial is
6
X f (k) (x0 )
T6 (x) = (x − x0 )k (5.20)
k=0
k!
= 1 + 4(x − 1) + 6(x − 1)2 + 4(x − 1)3 + (x − 1)4 + 0 . (5.21)
Multiplying out and re-arranging yields
T6 (x) = (1 − 4 + 6 − 4 + 1) + x(4 − 12 + 12 − 4)
+ x2 (6 − 12 + 6) + x3 (4 − 4) + x4 (5.22)
= x4 = f (x) , (5.23)
i.e., we obtain an exact representation of the original function.
Example 5.3 (Taylor Series)

Consider the function
f (x) = sin(x) + cos(x) ∈ C ∞ . (5.24)
5.1 Differentiation of Univariate Functions 143
f Figure 5.4 Taylor

T0 polynomials. The
T1 original function
4 T5 f (x) =
T10 sin(x) + cos(x)
(black, solid) is
approximated by
2 Taylor polynomials
(dashed) around
y
x0 = 0.
0 Higher-order Taylor
polynomials
approximate the
function f better
2 and more globally.
T10 is already
similar to f in
4 3 2 1 0 1 2 3 4 [−4, 4].
x
We seek a Taylor series expansion of f at x0 = 0, which is the Maclaurin
series expansion of f . We obtain the following derivatives:
f (0) = sin(0) + cos(0) = 1 (5.25)

f 0 (0) = cos(0) − sin(0) = 1 (5.26)
f 00 (0) = − sin(0) − cos(0) = −1 (5.27)
f (3) (0) = − cos(0) + sin(0) = −1 (5.28)
(4)
f (0) = sin(0) + cos(0) = f (0) = 1 (5.29)
..
.
We can see a pattern here: The coefficients in our Taylor series are only
±1 (since sin(0) = 0), each of which occurs twice before switching to the
other one. Furthermore, f (k+4) (0) = f (k) (0).
Therefore, the full Taylor series expansion of f at x0 = 0 is given by
∞
X f (k) (x0 )
f (x) = (x − x0 )k (5.30)
k=0
k!
1 2 1 1 1
=1+x− x − x3 + x4 + x5 − · · · (5.31)
2! 3! 4! 5!
1 1 1 1
= 1 − x2 + x4 ∓ · · · + x − x3 + x5 ∓ · · · (5.32)
2! 4! 3! 5!
∞ ∞
X 1 X 1
= (−1)k x2k + (−1)k x2k+1 (5.33)
k=0
(2k)! k=0
(2k + 1)!
= cos(x) + sin(x) , (5.34)
c
144 Vector Calculus
power series where we used the power series representations

representations
∞
X 1 2k
cos(x) = (−1)k x , (5.35)
k=0
(2k)!
∞
X 1
sin(x) = (−1)k x2k+1 . (5.36)
k=0
(2k + 1)!
Figure 5.4 shows the corresponding first Taylor polynomials Tn for n =
0, 1, 5, 10.
2690 5.1.2 Differentiation Rules

2691 In the following, we briefly state basic differentiation rules, where we
2692 denote the derivative of f by f 0 .
Product Rule: (f (x)g(x))0 = f 0 (x)g(x) + f (x)g 0 (x) (5.37)

0
f 0 (x)g(x) − f (x)g 0 (x)

f (x)
Quotient Rule: = (5.38)
g(x) (g(x))2
Sum Rule: (f (x) + g(x)) = f (x) + g 0 (x)
0 0
(5.39)
0
Chain Rule: g(f (x)) = (◦f )0 (x) = g 0 (f (x))f 0 (x) (5.40)
2693 Here, g ◦ f is a function composition x 7→ f (x) 7→ g(f (x)).
Example 5.4 (Chain rule)

Let us compute the derivative of the function h(x) = (2x + 1)4 using the
chain rule. With
h(x) = (2x + 1)4 = g(f (x)) , (5.41)
f (x) = 2x + 1 , (5.42)
g(f ) = f 4 (5.43)
we obtain the derivatives of f and g as
f 0 (x) = 2 , (5.44)
g 0 (f ) = 4f 3 , (5.45)
such that the derivative of h is given as
(5.42)
h0 (x) = g 0 (f )f 0 (x) = (4f 3 ) · 2 = 4(2x + 1)3 · 2 = 8(2x + 1)3 , (5.46)
where we used the chain rule, see (5.40), and substituted the definition
of f in (5.42) in g 0 (f ).
5.2 Partial Differentiation and Gradients 145
2694 5.2 Partial Differentiation and Gradients

2695 Differentiation as discussed in Section 5.1 applies to functions f of a
2696 scalar variable x ∈ R. In the following, we consider the general case
2697 where the function f depends on one or more variables x ∈ Rn , e.g.,
2698 f (x) = f (x1 , x2 ). The generalization of the derivative to functions of sev-
2699 eral variables is the gradient.
2700 We find the gradient of the function f with respect to x by varying one
2701 variable at a time and keeping the others constant. The gradient is then
2702 the collection of these partial derivatives.
Definition 5.5 (Partial Derivative). For a function f : Rn → R, x 7→
f (x), x ∈ Rn of n variables x1 , . . . , xn we define the partial derivatives as partial derivatives
∂f f (x1 + h, x2 , . . . , xn ) − f (x)
= lim
∂x1 h→0 h
.. (5.47)
.
∂f f (x1 , . . . , xn−1 , xn + h) − f (x)
= lim
∂xn h→0 h
and collect them in the row vector
df h i
∇x f = gradf = = ∂f∂x(x) ∂f (x)
∂x2
· · · ∂f (x)
∂xn
∈ R1×n , (5.48)
dx 1
2703 where n is the number of variables and 1 is the dimension of the image/
2704 range of f . Here, we used the compact vector notation x = [x1 , . . . , xn ]> .
2705 The row vector in (5.48) is called the gradient of f or the Jacobian and is gradient
2706 the generalization of the derivative from Section 5.1. Jacobian
2707 Remark. This definition of the Jacobian is a special case of the general
2708 definition of the Jacobian for vector-valued functions as the collection of
2709 partial derivatives. We will get back to this in Section 5.3. ♦
Example 5.5 (Partial Derivatives using the Chain Rule)

For f (x, y) = (x + 2y 3 )2 , we obtain the partial derivatives We can use results
from scalar
∂f (x, y) ∂ differentiation: Each
= 2(x + 2y 3 ) (x + 2y 3 ) = 2(x + 2y 3 ) , (5.49) partial derivative is
∂x ∂x
a derivative with
∂f (x, y) ∂ respect to a scalar.
= 2(x + 2y 3 ) (x + 2y 3 ) = 12(x + 2y 3 )y 2 . (5.50)
∂y ∂y
where we used the chain rule (5.40) to compute the partial derivatives.
2710 Remark (Gradient as a Row Vector). It is not uncommon in the literature

2711 to define the gradient vector as a column vector, following the conven-
2712 tion that vectors are generally column vectors. The reason why we define
2713 the gradient vector as a row vector is twofold: First, we can consistently
c
146 Vector Calculus
2714 generalize the gradient to a setting where f : Rn → Rm no longer maps

2715 onto the real line (then the gradient becomes a matrix). Second, we can
2716 immediately apply the multi-variate chain-rule without paying attention
2717 to the dimension of the gradient. We will discuss both points later. ♦
Example 5.6 (Gradient)

For f (x1 , x2 ) = x21 x2 + x1 x32 ∈ R, the partial derivatives (i.e., the deriva-
tives of f with respect to x1 and x2 ) are
∂f (x1 , x2 )
= 2x1 x2 + x32 (5.51)
∂x1
∂f (x1 , x2 )
= x21 + 3x1 x22 (5.52)
∂x2
and the gradient is then
df h i
= ∂f (x 1 ,x2 ) ∂f (x1 ,x2 )
= 2x1 x2 + x32 x21 + 3x1 x22 ∈ R1×2 .

∂x1 ∂x2
dx
(5.53)
2718 5.2.1 Basic Rules of Partial Differentiation

Product rule:
(f g)0 = f 0 g + f g 02719
, In the multivariate case, where x ∈ Rn , the basic differentiation rules that
Sum rule: 2720 we know from school (e.g., sum rule, product rule, chain rule; see also
(f + g)0 = f 0 + g 0 ,
2721 Section 5.1.2) still apply. However, when we compute derivatives with re-
Chain rule:
0 0 0
(g(f )) = g (f )f 2722 spect to vectors x ∈ Rn we need to pay attention: Our gradients now
2723 involve vectors and matrices, and matrix multiplication is no longer com-
2724 mutative (see Section 2.2.1), i.e., the order matters.
Here are the general product rule, sum rule and chain rule:
∂ ∂f ∂g
Product Rule: f (x)g(x) = g(x) + f (x) (5.54)
∂x ∂x ∂x
∂ ∂f ∂g
Sum Rule: f (x) + g(x) = + (5.55)
∂x ∂x ∂x
∂ ∂ ∂g ∂f
Chain Rule: (g ◦ f )(x) = g(f (x)) = (5.56)
∂x ∂x ∂f ∂x
This is only an 2725 Let us have a closer look at the chain rule. The chain rule (5.56) resem-
intuition, but not 2726 bles to some degree the rules for matrix multiplication where we said that
mathematically
2727 neighboring dimensions have to match for matrix multiplication to be de-
correct since the
partial derivative is2728 fined, see Section 2.2.1. If we go from left to right, the chain rule exhibits
not a fraction. 2729 similar properties: ∂f shows up in the “denominator” of the first factor
2730 and in the “numerator” of the second factor. If we multiply the factors
2731 together, multiplication is defined (the dimensions of ∂f match, and ∂f
2732 “cancels”, such that ∂g/∂x remains.
5.2 Partial Differentiation and Gradients 147
2733 5.2.2 Chain Rule

Consider a function f : R2 → R of two variables x1 , x2 . Furthermore,
x1 (t) and x2 (t) are themselves functions of t. To compute the gradient of
f with respect to t, we need to apply the chain rule (5.56) for multivariate
functions as
" #
df h i ∂x1 (t) ∂f ∂x1 ∂f ∂x2
∂f ∂f ∂t
= ∂x1 ∂x2 ∂x2 (t) = + (5.57)
dt ∂t
∂x1 ∂t ∂x2 ∂t
2734 where d denotes the gradient and ∂ partial derivatives.
Example 5.7
Consider f (x1 , x2 ) = x21 + 2x2 , where x1 = sin t and x2 = cos t, then
df ∂f ∂x1 ∂f ∂x2
= + (5.58)
dt ∂x1 ∂t ∂x2 ∂t
∂ sin t ∂ cos t
= 2 sin t +2 (5.59)
∂t ∂t
= 2 sin t cos t − 2 sin t = 2 sin t(cos t − 1) (5.60)
is the corresponding derivative of f with respect to t.
If f (x1 , x2 ) is a function of x1 and x2 , where x1 (s, t) and x2 (s, t) are

themselves functions of two variables s and t, the chain rule yields the
partial derivatives
∂f ∂f ∂x1 ∂f ∂x2
= + , (5.61)
∂s ∂x1 ∂s ∂x2 ∂s
∂f ∂f ∂x1 ∂f ∂x2
= + , (5.62)
∂t ∂x1 ∂t ∂x2 ∂t
and the gradient is obtained by the matrix multiplication
df ∂f ∂x h i ∂x1 ∂x1
∂f ∂f
= = ∂s ∂t . (5.63)
d(s, t) ∂x ∂(s, t) | ∂x1 {z ∂x2 } ∂x∂s
2 ∂x2
∂t
∂f
| {z }
= ∂x ∂x
= ∂(s,t)
2735 This compact way of writing the chain rule as a matrix multiplication The chain rule can
2736 makes only sense if the gradient is defined as a row vector. Otherwise, be written as a
matrix
2737 we will need to start transposing gradients for the matrix dimensions to
multiplication.
2738 match. This may still be straightforward as long as the gradient is a vector
2739 or a matrix; however, when the gradient becomes a tensor (we will discuss
2740 this in the following), the transpose is no longer a triviality.
2741 Remark (Verifying the Correctness of a Gradient Implementation). The
2742 definition of the partial derivatives as the limit of the corresponding dif-
2743 ference quotient, see (5.47), can be exploited when numerically checking
2744 the correctness of gradients in computer programs: When we compute Gradient checking
c
148 Vector Calculus
2745 gradients and implement them, we can use finite differences to numer-
2746 ically test our computation and implementation: We choose the value h
2747 to be small (e.g., h = 10−4 ) and compare the finite-difference approxima-
2748 tion from (5.47) with our (analytic) implementation of the gradient. If the
2749 error is small, ourqgradient
P
implementation is probably correct. “Small”
(dh −df )2
2750 could mean that Pi (dhii +dfii )2 < 10−6 , where dhi is the finite-difference
i
2751 approximation and dfi is the analytic gradient of f with respect to the ith
2752 variable xi . ♦
2753 5.3 Gradients of Vector-Valued Functions

2754 Thus far, we discussed partial derivatives and gradients of functions f :
2755 Rn → R mapping to the real numbers. In the following, we will generalize
2756 the concept of the gradient to vector-valued functions (vector fields) f :
2757 Rn → Rm , where n, m > 1.
For a function f : Rn → Rm and a vector x = [x1 , . . . , xn ]> ∈ Rn , the
corresponding vector of function values is given as
 
f1 (x)
f (x) =  ...  ∈ Rm . (5.64)
 
fm (x)
2758 Writing the vector-valued function in this way allows us to view a vector-
2759 valued function f : Rn → Rm as a vector of functions [f1 , . . . , fm ]> ,
2760 fi : Rn → R that map onto R. The differentiation rules for every fi are
2761 exactly the ones we discussed in Section 5.2.
partial derivative of Therefore, the partial derivative of a vector-valued function f : Rn → Rm
a vector-valued with respect to xi ∈ R, i = 1, . . . n, is given as the vector
function
 ∂f1   f1 (x1 ,...,xi−1 ,xi +h,xi+1 ,...xn )−f1 (x) 

∂xi limh→0 h
∂f
=  ...  =  .. m
∈R .
   
∂xi .
∂fm fm (x1 ,...,xi−1 ,xi +h,xi+1 ,...xn )−fm (x)
∂xi limh→0 h
(5.65)
From (5.48), we know that we obtain the gradient of f with respect

to a vector as the row vector of the partial derivatives. In (5.65), every
partial derivative ∂f /∂xi is a column vector. Therefore, we obtain the
gradient of f : Rn → Rm with respect to x ∈ Rn by collecting these
5.3 Gradients of Vector-Valued Functions 149
Figure 5.5 The

f
determinant of the
Jacobian of f can
b2 c1 c2 be used to compute
the magnifier
between the blue
b1
and orange area.
partial derivatives:
 
∂f1 (x) ∂f1 (x)
 ∂x1 ··· ∂xn 
df (x) ∂f (x) ∂f (x) .. ..  ∈ Rm×n .
 
= ··· =
dx ∂x1 ∂xn  . . 
 ∂fm (x) ∂fm (x) 
∂x1 ··· ∂xn
(5.66)
Definition 5.6 (Jacobian). The collection of all first-order partial deriva-

tives of a vector-valued function f : Rn → Rm is called the Jacobian. The Jacobian
Jacobian J is an m × n matrix, which we define and arrange as follows: The gradient of a
function
df (x) h ∂f (x) ∂f (x)
i f : Rn → Rm is a
J = ∇x f = = ∂x1 · · · ∂xn
(5.67) matrix of size
dx

∂f (x)
 m × n.
1
∂x1
· · · ∂f∂x
1 (x)
n

= .. ..  (5.68)
. . ,


∂fm (x) ∂fm (x)
∂x1
··· ∂xn
 
x1
 ..  ∂fi
x= . , J(i, j) = . (5.69)
∂xj
xn
n 1 n
2762
Pfn : R → R , which maps a vector x ∈ R onto
In particular, a function
2763 a scalar (e.g., f (x) = i=1 xi ), possesses a Jacobian that is a row vector
2764 (matrix of dimension 1 × n), see (5.48).
2765 Remark. (Variable Transformation and Jacobian Determinant)

In Section 4.1, we saw that the determinant can be used to compute
the area of a parallelogram. If we are given two vectors b1 = [1, 0]> ,
b2 = [0, 1]> as the sides of the unit square (blue, see Figure 5.5), the area
of this square is

1 0
0 1 = 1 .
(5.70)
If we now take a parallelogram with the sides c1 = [−2, 1]> , c2 = [1, 1]>
(orange in Figure 5.5) its area is given as the absolute value of the deter-
c
150 Vector Calculus
minant

det −2 1 = | − 3| = 3 ,

(5.71)
1 1
2766 i.e., the area of this is exactly 3 times the area of the unit square. We
2767 can find this scaling factor by finding a mapping that transforms the unit
2768 square into the other square. In linear algebra terms, we effectively per-
2769 form a variable transformation from (b1 , b2 ) to (c1 , c2 ). In our case, the
2770 mapping is linear and the absolute value of the determinant of this map-
2771 ping gives us exactly the scaling factor we are looking for.
2772 We will describe two approaches to identify this mapping. First, we ex-
2773 ploit the fact that the mapping is linear so that we can use the tools from
2774 Chapter 2 to identify this mapping. Second, we will find the mapping
2775 using partial derivatives using the tools we have been discussing in this
2776 chapter.
Approach 1 To get started with the linear algebra approach, we
identify both {b1 , b2 } and {c1 , c2 } as bases of R2 (see Section 2.6.1 for a
recap). What we effectively perform is a change of basis from (b1 , b2 ) to
(c1 , c2 ), and we are looking for the transformation matrix that implements
the basis change. Using results from Section 2.7.2, we identify the desired
basis change matrix as

−2 1
J= , (5.72)
1 1
2777 such that J b1 = c1 and J b2 = c2 . The absolute value of the determi-
2778 nant of J , which yields the scaling factor we are looking for, is given as
2779 |det(J )| = 3, i.e., the area of the square spanned by (c1 , c2 ) is three times
2780 greater than the area spanned by (b1 , b2 ).
2781 Approach 2 The linear algebra approach works nicely for linear
2782 transformations; for nonlinear transformations (which become relevant in
2783 Chapter 6), we can follow a more general approach using partial deriva-
2784 tives.
2785 For this approach, we consider a function f : R2 → R2 that performs
2786 a variable transformation. In our example, f maps the coordinate repre-
2787 sentation of any vector x ∈ R2 with respect to b1 , b2 onto the coordinate
2788 representation with respect to c1 , c2 , which we will denote by y ∈ R2 . We
2789 now want to identify the mapping so that we can compute how an area
2790 (or volume) changes when it is being transformed by f . For this we need
2791 to find out how f (x) changes if we modify x a bit. This question is exactly
2792 answered by the Jacobian matrix df dx
∈ R2×2 .
Since we can write
y1 = −2x1 + x2 (5.73)
y2 = x 1 + x 2 (5.74)
we have now the functional relationship between x and y , which allows
5.3 Gradients of Vector-Valued Functions 151
us to get the partial derivatives
∂y1 ∂y1 ∂y2 ∂y2

= −2 , = 1, = 1, =1 (5.75)
∂x1 ∂x2 ∂x1 ∂x2
and compose the Jacobian as

" #
∂y1 ∂y1
∂x1 ∂x2 −2 1
J= ∂y2 ∂y2 = . (5.76)
∂x1 ∂x2
1 1
2793 The Jacobian represents the coordinate transformation we are looking Geometrically, the
2794 for and is exact if the coordinate transformation is linear (as in our case), Jacobian
determinant gives
2795 and (5.76) recovers exactly the basis change matrix in (5.72). If the co-
the magnification/
2796 ordinate transformation is nonlinear, the Jacobian approximates this non- scaling factor when
2797 linear transformation locally with a linear one. The absolute value of the we transform an
2798 Jacobian determinant |det(J )| is the factor areas or volumes are scaled by area or volume.
Jacobian
2799 when coordinates are transformed. In our case, we obtain |det(J )| = 3.
determinant
2800 The Jacobian determinant and variable transformations will become
2801 relevant in Section 6.5 when we transform random variables and prob-
2802 ability distributions. These transformations are extremely relevant in ma-
2803 chine learning in the context of training deep neural networks using the
2804 reparametrization trick, also called infinite perturbation analysis.
2805 ♦
Figure 5.6
2806 Throughout this chapter, we have encountered derivatives of functions. Overview of the
dimensionality of
2807 Figure 5.6 summarizes the dimensions of those gradients. If f : R →
(partial) derivatives.
2808 R the gradient is simply a scalar (top-left entry). For f : RD → R the
2809 gradient is a 1 × D row vector (to-right entry). For f : R → RE , the x
2810 gradient is an E × 1 column vector, and for f : RD → RE the gradient is f (x)
2811 an E × D matrix. ∂f
∂x
Example 5.8 (Gradient of a Vector-Valued Function)

We are given
f (x) = Ax , f (x) ∈ RM , A ∈ RM ×N , x ∈ RN .
To compute the gradient df /dx we first determine the dimension of
df /dx: Since f : RN → RM , it follows that df /dx ∈ RM ×N . Second,
to compute the gradient we determine the partial derivatives of f with
respect to every xj :
N
X ∂fi
fi (x) = Aij xj =⇒ = Aij (5.77)
j=1
∂xj
Finally, we collect the partial derivatives in the Jacobian and obtain the
c
152 Vector Calculus
gradient as
 ∂f1 ∂f1
···
  
∂x1 ∂xN A11 · · · A1N
df
=  ... ..  =  .. ..  = A ∈ RM ×N . (5.78)

dx .   . . 
∂fM ∂fM
∂x1
··· ∂xN
AM 1 · · · AM N
Example 5.9 (Chain Rule)

Consider the function h : R → R, h(t) = (f ◦ g)(t) with
f : R2 → R (5.79)
g : R → R2 (5.80)
f (x) = exp(x1 x22 ) , (5.81)

x1 t cos t
x= = g(t) = (5.82)
x2 t sin t
and compute the gradient of h with respect to t.
Since f : R2 → R and g : R → R2 we note that
∂f
∈ R1×2 , (5.83)
∂x
∂g
∈ R2×1 . (5.84)
∂t
The desired gradient is computed by applying the chain-rule:
dh ∂f ∂x h ∂f i ∂x1
∂f
= = ∂x1 ∂x ∂t
∂x2 (5.85)
dt ∂x ∂t 2
∂t

2 2 2
cos t − t sin t
= exp(x1 x2 )x2 2 exp(x1 x2 )x1 x2 (5.86)
sin t + t cos t
2 2

= exp(x1 x2 ) x2 (cos t − t sin t) + 2x1 x2 (sin t + t cos t) , (5.87)
where x1 = t cos t and x2 = t sin t, see (5.82).
Example 5.10 (Gradient of a Least-Squared Loss in a Linear Model)

Let us consider the linear model
y = Φθ , (5.88)
where θ ∈ RD is a parameter vector, Φ ∈ RN ×D are input features and
We will discuss this y ∈ RN are the corresponding observations. We define the functions
model in much
more detail in L(e) := kek2 , (5.89)
Chapter 9 in the
context of linear e(θ) := y − Φθ . (5.90)
regression.
5.4 Gradients of Matrices 153
We seek ∂L
∂θ
, and we will use the chain rule for this purpose. L is called a
least-squares loss function. least-squares loss
Remark. We will return to this setting in Chapter 9 when we discuss linear

regression and require the derivatives of the least-squares loss with respect
to the model parameters θ . ♦
Before we start our calculation, we determine the dimensionality of the
gradient as
∂L
∈ R1×D . (5.91)
∂θ
The chain rule allows us to compute the gradient as
∂L ∂L ∂e
= , (5.92)
∂θ ∂e ∂θ
i.e., every element is given by dLdtheta =
np.einsum(’n,nd’,dLde,dedtheta)
N
∂L X ∂L ∂e
[1, d] = [n] [n, d] . (5.93)
∂θ n=1
∂e ∂θ
We know that kek2 = e> e (see Section 3.2) and determine
∂L
= 2e> ∈ R1×N . (5.94)
∂e
Furthermore, we obtain
∂e
= −Φ ∈ RN ×D , (5.95)
∂θ
such that our desired derivative is
∂L (5.90)
= −2e> Φ = − 2(y > − θ > Φ> ) |{z} Φ ∈ R1×D . (5.96)
∂θ | {z }
1×N N ×D
Remark. We would have obtained the same result without using the chain
rule by immediately looking at the function
L2 (θ) := ky − Φθk2 = (y − Φθ)> (y − Φθ) . (5.97)
This approach is still practical for simple functions like L2 but becomes
impractical if consider deep function compositions. ♦
2812 5.4 Gradients of Matrices

2813 We will encounter situations where we need to take gradients of matri-
2814 ces with respect to vectors (or other matrices), which results in a multi-
2815 dimensional tensor. For example, if we compute the gradient of an m × n
c
154 Vector Calculus
A ∈ IR4×2 x ∈ IR3 Figure 5.7
x1 Visualization of
x2 gradient
x3 computation of a
matrix with respect
to a vector. We are
interested in
computing the
Partial derivatives: gradient of
∂A A ∈ R4×2 with
∈ IR4×2
∂x3 respect to a vector
dA
∈ IR4×2×3 x ∈ R3 . We know
∂A dx
∈ IR4×2
∂x2 that gradient
collate dA
dx
∈ R4×2×3 . We
∂A
∈ IR4×2 follow two
∂x1
equivalent
approaches to arrive
4 there: (a) Collating
partial derivatives
3 into a Jacobian
tensor; (b)
2 Flattening of the
(a) Approach 1: We compute the partial derivative matrix into a vector,
∂A ∂A ∂A
, , , each of which is a 4 × 2 matrix, and col- computing the
∂x1 ∂x2 ∂x3
late them in a 4 × 2 × 3 tensor. Jacobian matrix,
re-shaping into a
Jacobian tensor.
A ∈ IR4×2 x ∈ IR3
x1
x2
x3
dÃ dA
∈ IR8×3 ∈ IR4×2×3
A ∈ IR4×2 Ã ∈ IR8 dx dx
re-shape gradient re-shape
(b) Approach 2: We re-shape (flatten) A ∈ R4×2 into a

vector Ā ∈ R8 . Then, we compute the gradient ddxĀ
∈
R8×3 . We obtain the gradient tensor by re-shaping this
gradient as illustrated above.
2816 matrix with respect to a p × q matrix, the resulting Jacobian would be

2817 (p × q) × (m × n), i.e., a four-dimensional tensor (or array). Since ma-
2818 trices represent linear mappings, we can exploit the fact that there is a
5.4 Gradients of Matrices 155
2819 vector-space isomorphism (linear, invertible mapping) between the space

2820 Rm×n of m × n matrices and the space Rmn of mn vectors. Therefore, we
2821 can re-shape our matrices into vectors of lengths mn and pq , respectively.
2822 The gradient using these mn vectors results in a Jacobian of size pq × mn.
2823 Figure 5.7 visualizes both approaches. Matrices can be
2824 In practical applications, it is often desirable to re-shape the matrix transformed into
vectors by stacking
2825 into a vector and continue working with this Jacobian matrix: The chain the columns of the
2826 rule (5.56) boils down to simple matrix multiplication, whereas in the matrix
2827 case of a Jacobian tensor, we will need to pay more attention to what (“flattening”).
2828 dimensions we need to sum out.
Example 5.11 (Gradient of Vectors with Respect to Matrices)

Let us consider the following example, where
f = Ax , f ∈ RM , A ∈ RM ×N , x ∈ RN (5.98)
and where we seek the gradient df /dA. Let us start again by determining
the dimension of the gradient as
df
∈ RM ×(M ×N ) . (5.99)
dA
By definition, the gradient is the collection of the partial derivatives:
 ∂f1 
∂A
df ∂fi
=  ...  , ∈ R1×(M ×N ) . (5.100)
 
dA ∂fM
∂A
∂A
To compute the partial derivatives, it will be helpful to explicitly write out

the matrix vector multiplication:
N
X
fi = Aij xj , i = 1, . . . , M , (5.101)
j=1
and the partial derivatives are then given as

∂fi
= xq . (5.102)
∂Aiq
This allows us to compute the partial derivatives of fi with respect to a
row of A, which is given as
∂fi
= x> ∈ R1×1×N , (5.103)
∂Ai,:
∂fi
= 0> ∈ R1×1×N (5.104)
∂Ak6=i,:
where we have to pay attention to the correct dimensionality. Since fi
c
156 Vector Calculus
maps onto R and each row of A is of size 1 × N , we obtain a 1 × 1 × N -

sized tensor as the partial derivative of fi with respect to a row of A.
We stack the partial derivatives to obtain the desired gradient as
 >
0
 .. 
 . 
 >
0 
∂fi  > 1×(M ×N )
∂A 
= x  ∈R . (5.105)
 0> 
 
 . 
 .. 
0>
Example 5.12 (Gradient of Matrices with Respect to Matrices)

Consider a matrix L ∈ Rm×n and f : Rm×n → Rn×n with
f (L) = L> L =: K ∈ Rn×n . (5.106)
where we seek the gradient dK/dL. To solve this hard problem, let us
first write down what we already know: We know that the gradient has
the dimensions
dK
∈ R(n×n)×(m×n) , (5.107)
dL
which is a tensor. If we compute the partial derivative of f with respect to
a single entry Lij , i, j ∈ {1, . . . , n}, of L, we obtain an n × n-matrix
∂K
∈ Rn×n . (5.108)
∂Lij
Furthermore, we know that
dKpq
∈ R1×m×n (5.109)
dL
for p, q = 1, . . . , n, where Kpq = fpq (L) is the (p, q)-th entry of K =
f (L).
Denoting the i-th column of L by li , we see that every entry of K is
given by an inner product of two columns of L, i.e.,
m
X
Kpq = l> l
p q = Lkp Lkq . (5.110)
k=1
∂Kpq
When we now compute the partial derivative ∂Lij
, we obtain
m
∂Kpq X ∂
= Lkp Lkq = ∂pqij , (5.111)
∂Lij k=1
∂Lij
5.5 Useful Identities for Computing Gradients 157


 Liq if j = p, p 6= q
Lip if j = q, p 6= q

∂pqij = (5.112)

 2L iq if j = p, p = q
0 otherwise

From (5.107), we know that the desired gradient has the dimension
(n × n) × (m × n), and every single entry of this tensor is given by ∂pqij
in (5.112), where p, q, j = 1, . . . , n and i = q, . . . , m.
2829 5.5 Useful Identities for Computing Gradients

In the following, we list some useful gradients that are frequently required
in a machine learning context (Petersen and Pedersen, 2012):
>
∂ > ∂f (X)
f (X) = (5.113)
∂X ∂X

∂ ∂f (X)
tr(f (X)) = tr (5.114)
∂X ∂X

∂ ∂f (X)
det(f (X)) = det(f (X))tr f −1 (X) (5.115)
∂X ∂X
∂ −1 ∂f (X) −1
f (X) = −f −1 (X) f (X) (5.116)
∂X ∂X
∂a> X −1 b
= −(X −1 )> ab> (X −1 )> (5.117)
∂X
∂x> a
= a> (5.118)
∂x
∂a> x
= a> (5.119)
∂x
∂a> Xb
= ab> (5.120)
∂X
∂x> Bx
= x> (B + B > ) (5.121)
∂x
∂
(x − As)> W (x − As) = −2(x − As)> W A for symmetric W
∂s
(5.122)
2830 5.6 Backpropagation and Automatic Differentiation

2831 In many machine learning applications, we find good model parameters
2832 by performing gradient descent (Chapter 7), which relies on the fact that
2833 we can compute the gradient of a learning objective with respect to the
2834 parameters of the model. For a given objective function, we can obtain the
c
158 Vector Calculus
2835 gradient with respect to the model parameters using calculus and applying
2836 the chain rule, see Section 5.2.2. We already had a taste in Section 5.3
2837 when we looked at the gradient of a squared loss with respect to the
2838 parameters of a linear regression model.
q
f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 ) .

(5.123)
By application of the chain rule, and noting that differentiation is linear

we compute the gradient
df 2x + 2x exp(x2 )
− sin x2 + exp(x2 ) 2x + 2x exp(x2 )

= p
dx 2
2 x + exp(x )2
!
1 2 2
1 + exp(x2 ) .

= 2x p − sin x + exp(x )
2 x2 + exp(x2 )
(5.124)
2839 Writing out the gradient in this explicit way is often impractical since it
2840 often results in a very lengthy expression for a derivative. In practice,
2841 it means that, if we are not careful, the implementation of the gradient
2842 could be significantly more expensive than computing the function, which
2843 is an unnecessary overhead. For training deep neural network models, the
backpropagation 2844 backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 1962;
2845 Rumelhart et al., 1986) is an efficient way to compute the gradient of an
2846 error function with respect to the parameters of the model.
2847 5.6.1 Gradients in a Deep Network

In machine learning, the chain rule plays an important role when opti-
mizing parameters of a hierarchical model (e.g., for maximum likelihood
estimation). An area where the chain rule is used to an extreme is Deep
Learning where the function value y is computed as a deep function com-
position
y = (fK ◦ fK−1 ◦ · · · ◦ f1 )(x) = fK (fK−1 (· · · (f1 (x)) · · · )) , (5.125)
where x are the inputs (e.g., images), y are the observations (e.g., class
labels) and every function fi , i = 1, . . . , K possesses its own parameters.
In neural networks with multiple layers, we have functions fi (xi−1 ) =
We discuss the case σ(Ai xi−1 + bi ) in the ith layer. Here xi−1 is the output of layer i − 1
where the activation and σ an activation function, such as the logistic sigmoid 1+e1−x , tanh or a
functions are
rectified linear unit (ReLU). In order to train these models, we require the
identical to clutter
notation. gradient of a loss function L with respect to all model parameters Aj , bj
for j = 0, . . . , K − 1. This also requires us to compute the gradient of L
with respect to the inputs of each layer. For example, if we have inputs x
5.6 Backpropagation and Automatic Differentiation 159
Figure 5.8 Forward

x f1 fL−1 fK pass in a multi-layer
L
neural network to
compute the loss L
as a function of the
A0 , b0 A2 , b2 AK−2 , bK−2 AK−1 , bK−1 inputs x and the
parameters Ai , bi .
Figure 5.9
x f1 fL−1 fK Backward pass in a
L
multi-layer neural
network to compute
the gradients of the
A0 , b0 A2 , b2 AK−2 , bK−2 AK−1 , bK−1 loss function.
and observations y and a network structure defined by
f 0 := x (5.126)
f i := σi (Ai−1 f i−1 + bi−1 ) , i = 1, . . . , K , (5.127)
see also Figure 5.8 for a visualization, we may be interested in finding

Aj , bj for j = 0, . . . , K − 1, such that the squared loss
L(θ) = ky − f K (θ, x)k2 (5.128)
2848 is minimized, where θ = {A0 , b0 , . . . , AK−1 , bK−1 }.

To obtain the gradients with respect to the parameter set θ , we require
the partial derivatives of L with respect to the parameters θ j = {Aj , bj }
of each layer j = 0, . . . , K − 1. The chain rule allows us to determine the
partial derivatives as
∂L ∂L ∂f K
= (5.129)
∂θ K−1 ∂f K ∂θ K−1
∂L ∂L ∂f K ∂f K−1
= (5.130)
∂θ K−2 ∂f K ∂f K−1 ∂θ K−2
∂L ∂L ∂f K ∂f K−1 ∂f K−2
= (5.131)
∂θ K−3 ∂f K ∂f K−1 ∂f K−2 ∂θ K−3
∂L ∂L ∂f K ∂f i+2 ∂f i+1
= ··· (5.132)
∂θ i ∂f K ∂f K−1 ∂f i+1 ∂θ i
2849 The orange terms are partial derivatives of the output of a layer with re-
2850 spect to its inputs, whereas the blue terms are partial derivatives of the
2851 output of a layer with respect to its parameters. Assuming, we have al-
2852 ready computed the partial derivatives ∂L/∂θ i+1 , then most of the com-
2853 putation can be reused to compute ∂L/∂θ i . The additional terms that
2854 we need to compute are indicated by the boxes. Figure 5.9 visualizes
2855 that the gradients are passed backward through the network. A more
c
160 Vector Calculus
Figure 5.10 Simple

x a b y
graph illustrating
the flow of data
from x to y via
some intermediate2856 in-depth discussion about gradients of neural networks can be found at
variables a, b. 2857 https://tinyurl.com/yalcxgtv.
2858 There are efficient ways of implementing this repeated application of
backpropagation 2859 the chain rule using backpropagation (Kelley, 1960; Bryson, 1961; Drey-
2860 fus, 1962; Rumelhart et al., 1986). A good discussion about backpropaga-
2861 tion and the chain rule is available at https://tinyurl.com/ycfm2yrw.
2862 5.6.2 Automatic Differentiation

Automatic
differentiation is 2863 It turns out that backpropagation is a special case of a general technique
different from 2864 in numerical analysis called automatic differentiation. We can think of au-
symbolic
differentiation and
2865 tomatic differentation as a set of techniques to numerically (in contrast to
numerical 2866 symbolically) evaluate the exact (up to machine precision) gradient of a
approximations of2867 function by working with intermediate variables and applying the chain
the gradient, e.g., 2868
by rule. Automatic differentiation applies a series of elementary arithmetic
using finite
2869 operations, e.g., addition and multiplication and elementary functions,
differences.
automatic 2870 e.g., sin, cos, exp, log. By applying the chain rule to these operations, the
differentiation 2871 gradient of quite complicated functions can be computed automatically.
2872 Automatic differentiation applies to general computer programs and has
2873 forward and reverse modes.
Figure 5.10 shows a simple graph representing the data flow from in-
puts x to outputs y via some intermediate variables a, b. If we were to
compute the derivative dy/dx, we would apply the chain rule and obtain
dy dy db da
= . (5.133)
dx db da dx
In the general case, Intuitively, the forward and reverse mode differ in the order of multi-
we work with plication. Due to the associativity of matrix multiplication we can choose
Jacobians, which
between
can be vectors,

matrices or tensors. dy dy db da
= , (5.134)
dx db da dx

dy dy db da
= . (5.135)
dx db da dx
reverse mode 2874 Equation (5.134) would be the reverse mode because gradients are prop-
2875 agated backward through the graph, i.e., reverse to the data flow. Equa-
forward mode 2876 tion (5.135) would be the forward mode, where the gradients flow with
2877 the data from left to right through the graph.
2878 In the following, we will focus on reverse mode automatic differentia-
2879 tion, which is backpropagation. In the context of neural networks, where
2880 the input dimensionality is often much higher than the dimensionality of
5.6 Backpropagation and Automatic Differentiation 161
2881 the labels, the reverse mode is computationally significantly cheaper than
2882 the forward mode. Let us start with an instructive example.
Example 5.13
q
f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 )

(5.136)
from (5.123). If we were to implement a function f on a computer, we
would be able to save some computation by using intermediate variables: intermediate
variables
2
a=x , (5.137)
b = exp(a) , (5.138)
c = a + b, (5.139)
√
d = c, (5.140)
e = cos(c) , (5.141)
f = d + e. (5.142)
√
exp(·) b · d Figure 5.11
Computation graph
with inputs x,
x (·)2 a + c + f function values f
and intermediate
cos(·) e variables a, b, c, d, e.
This is the same kind of thinking process that occurs when applying the
chain rule. Observe that the above set of equations require fewer opera-
tions than a direct naive implementation of the function f (x) as defined
in (5.123). The corresponding computation graph in Figure 5.11 shows
the flow of data and computations required to obtain the function value
f.
The set of equations that include intermediate variables can be thought
of as a computation graph, a representation that is widely used in imple-
mentations of neural network software libraries. We can directly compute
the derivatives of the intermediate variables with respect to their corre-
sponding inputs by recalling the definition of the derivative of elementary
functions. We obtain:
∂a
= 2x , (5.143)
∂x
∂b
= exp(a) , (5.144)
∂a
∂c ∂c
=1= , (5.145)
∂a ∂b
c
162 Vector Calculus
∂d 1
= √ , (5.146)
∂c 2 c
∂e
= − sin(c) , (5.147)
∂c
∂f ∂f
=1= . (5.148)
∂d ∂e
By looking at the computation graph in Figure 5.11, we can compute
∂f /∂x by working backward from the output, and we obtain the follow-
ing relations:
∂f ∂f ∂d ∂f ∂e
= + , (5.149)
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
= , (5.150)
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= + , (5.151)
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
= . (5.152)
∂x ∂a ∂x
Note that we have implicitly applied the chain rule to obtain ∂f /∂x. By
substituting the results of the derivatives of the elementary functions, we
get
∂f 1
= 1 · √ + 1 · (− sin(c)) , (5.153)
∂c 2 c
∂f ∂f
= · 1, (5.154)
∂b ∂c
∂f ∂f ∂f
= exp(a) + · 1, (5.155)
∂a ∂b ∂c
∂f ∂f
= · 2x . (5.156)
∂x ∂a
By thinking of each of the derivatives above as a variable, we observe
that the computation required for calculating the derivative is of similar
complexity as the computation of the function itself. This is quite counter-
intuitive since the mathematical expression for the derivative ∂f∂x
(5.124)
is significantly more complicated than the mathematical expression of the
function f (x) in (5.123).
Automatic differentiation is a formalization of the example above. Let

x1 , . . . , xd be the input variables to the function, xd+1 , . . . , xD−1 be the
intermediate variables and xD the output variable. Then the computation
graph can be expressed as an equation
For i = d + 1, . . . , D : xi = gi (xPa(xi ) ) (5.157)
5.7 Higher-order Derivatives 163
where gi (·) are elementary functions and xPa(xi ) are the parent nodes of
the variable xi in the graph. Given a function defined in this way, we can
use the chain rule to compute the derivative of the function in a step-by-
step fashion. Recall that by definition f = xD and hence
∂f
= 1. (5.158)
∂xD
For other variables xi , we apply the chain rule
∂f X ∂f ∂xj X ∂f ∂gj
= = , (5.159)
∂xi x ∂xj ∂xi ∂xj ∂xi
j :xi ∈Pa(xj ) x j :xi ∈Pa(xj )
2883 where Pa(xj ) is the set of parent nodes of xj in the computing graph.
2884 Equation (5.157) is the forward propagation of a function, whereas (5.159) Auto-differentiation
2885 is the backpropagation of the gradient through the computation graph. For in reverse mode
requires a parse
2886 neural network training we backpropagate the error of the prediction with
tree.
2887 respect to the label.
2888 The automatic differentiation approach above works whenever we have
2889 a function that can be expressed as a computation graph, where the ele-
2890 mentary functions are differentiable. In fact, the function may not even be
2891 a mathematical function but a computer program. However, not all com-
2892 puter programs can be automatically differentiated, e.g., if we cannot find
2893 differential elementary functions. Programming structures, such as for
2894 loops and if statements require more care as well.
2895 5.7 Higher-order Derivatives

2896 So far, we discussed gradients, i.e., first-order derivatives. Sometimes, we
2897 are interested in derivatives of higher order, e.g., when we want to use
2898 Newton’s Method for optimization, which requires second-order deriva-
2899 tives (Nocedal and Wright, 2006). In Section 5.1.1, we discussed the Tay-
2900 lor series to approximate functions using polynomials. In the multivariate
2901 case, we can do exactly the same. In the following, we will do exactly this.
2902 But let us start with some notation.
2903 Consider a function f : R2 → R of two variables x, y . We use the
2904 following notation for higher-order partial derivatives (and for gradients):
∂2f
2905 • ∂x2
is the second partial derivative of f with respect to x
∂nf
2906 • ∂xn
is the nth partial derivative of f with respect to x
∂2f ∂ ∂f

2907 • ∂y∂x
= ∂y ∂x
is the partial derivative obtained by first partial differ-
2908 entiating with respect to x and then with respect to y
∂2f
2909 • ∂y∂x
is the partial derivative obtained by first partial differentiating by
2910 y and then x
2911 The Hessian is the collection of all second-order partial derivatives. Hessian
c
164 Vector Calculus
Figure 5.12 Linear

approximation of a 1.5
function. The 1.0 f(x)
original function f
is linearized at 0.5
x0 = −2 using a 0.0
first-order Taylor f(x)
series expansion. 0.5 f(x0) + f 0(x0)(x x0)
1.0
1.5 f(x0)
2.0
2.5 4 3 2 1 0 1 2 3 4
x
If f (x, y) is a twice (continuously) differentiable function then

∂2f ∂2f
= , (5.160)
∂x∂y ∂y∂x
i.e., the order of differentiation does not matter, and the corresponding
Hessian matrix Hessian matrix
" 2 2 #
∂ f ∂ f
∂x2 ∂x∂y
H= ∂2f ∂2f
(5.161)
∂x∂y ∂y 2
2912 is symmetric. Generally, for x ∈ Rn and f : Rn → R, the Hessian is an

2913 n × n matrix. The Hessian measures the local geometry of curvature.
2914 Remark (Hessian of a Vector Field). If f : Rn → Rm is a vector field, the
2915 Hessian is an (m × n × n)-tensor. ♦
2916 5.8 Linearization and Multivariate Taylor Series

The gradient ∇f of a function f is often used for a locally linear approxi-
mation of f around x0 :
f (x) ≈ f (x0 ) + (∇x f )(x0 )(x − x0 ) . (5.162)
2917 Here (∇x f )(x0 ) is the gradient of f with respect to x, evaluated at x0 .

2918 Figure 5.12 illustrates the linear approximation of a function f at an input
2919 x0 . The orginal function is approximated by a straight line. This approx-
2920 imation is locally accurate, but the further we move away from x0 the
2921 worse the approximation gets. Equation (5.162) is a special case of a mul-
2922 tivariate Taylor series expansion of f at x0 , where we consider only the
2923 first two terms. We discuss the more general case in the following, which
2924 will allow for better approximations.
multivariate Taylor Definition 5.7 (Multivariate Taylor Series). For the multivariate Taylor
series
5.8 Linearization and Multivariate Taylor Series 165
Figure 5.13
Visualizing outer
products. Outer
products of vectors
increase the
dimensionality of
(a) Given a vector δ ∈ R4 , we obtain the outer product δ 2 := δ ⊗ the array by 1 per
δ = δδ > ∈ R4×4 as a matrix. term.
(b) An outer product δ 3 := δ ⊗ δ ⊗ δ ∈ R4×4×4 results in a third-order tensor

(“three-dimensional matrix”), i.e., an array with three indexes.
series, we consider a function

f : RD → R (5.163)
D
x 7→ f (x) , x∈R , (5.164)
2925 that is smooth at x0 .
When we define the difference vector δ := x − x0 , the Taylor series of
f at (x0 ) is defined as
∞
X Dk f (x0 )
f (x) = x
δk , (5.165)
k=0
k!
2926 where Dxk f (x0 ) is the k -th (total) derivative of f with respect to x, eval-
2927 uated at x0 .
Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n of Taylor polynomial
f at x0 contains the first n + 1 components of the series in (5.165) and is
defined as
n
X Dxk f (x0 ) k
Tn = δ . (5.166)
k=0
k!
Remark (Notation). In (5.165) and (5.166), we used the slightly sloppy

notation of δ k , which is not defined for vectors x ∈ RD , D > 1, and k >
1. Note that both Dxk f and δ k are k -th order tensors, i.e., k -dimensional
k times
}| { z
k D×D×...×D
arrays. The k -th order tensor δ ∈ R is obtained as a k -fold A vector can be
D
outer product, denoted by ⊗, of the vector δ ∈ R . For example, implemented as a
1-dimensional array,
δ 2 = δ ⊗ δ = δδ > , δ 2 [i, j] = δ[i]δ[j] (5.167) a matrix as a
3 3 2-dimensional array.
δ = δ⊗δ⊗δ, δ [i, j, k] = δ[i]δ[j]δ[k] . (5.168)
c
166 Vector Calculus
Figure 5.13 visualizes two such outer products. In general, we obtain the
following terms in the Taylor series:
X X
Dxk f (x0 )δ k = ··· Dxk f (x0 )[a, . . . , k]δ[a] · · · δ[k] , (5.169)
a k
2928 where Dxk f (x0 )δ k

contains k -th order polynomials.
Now that we defined the Taylor series for vector fields, let us explicitly
write down the first terms Dxk f (x0 )δ k of the Taylor series expansion for
k = 0, . . . , 3 and δ := x − x0 :
np.einsum(
’i,i’,Df1,d) k = 0 : Dx0 f (x0 )δ 0 = f (x0 ) ∈ R (5.170)
np.einsum( X
’ij,i,j’, k=1: Dx1 f (x0 )δ 1 = ∇x f (x0 ) |{z}
δ = ∇x f (x0 )[i]δ[i] ∈ R (5.171)
Df2,d,d) | {z }
D×1 i
1×D
np.einsum(
Dx2 f (x0 )δ 2 δ > = δ > Hδ

’ijk,i,j,k’, k=2: = tr |{z}
H |{z}
δ |{z} (5.172)
Df3,d,d,d) D×D D×1 1×D
XX
= H[i, j]δ[i]δ[j] ∈ R (5.173)
i j
XXX
k = 3 : Dx3 f (x0 )δ 3 = Dx3 f (x0 )[i, j, k]δ[i]δ[j]δ[k] ∈ R
i j k
(5.174)
2929 ♦
Example 5.14 (Taylor-Series Expansion of a Function with Two Vari-

ables)
f (x, y) = x2 + 2xy + y 3 . (5.175)
We want to compute the Taylor series expansion of f at (x0 , y0 ) = (1, 2).
Before we start, let us discuss what to expect: The function in (5.175) is
a polynomial of degree 3. We are looking for a Taylor series expansion,
which itself is a linear combination of polynomials. Therefore, we do not
expect the Taylor series expansion to contain terms of fourth or higher
order to express a third-order polynomial. This means, it should be suffi-
cient to determine the first four terms of (5.165) for an exact alternative
representation of (5.175).
To determine the Taylor series expansion, start of with the constant term
and the first-order derivatives, which are given by
f (1, 2) = 13 (5.176)
∂f ∂f
= 2x + 2y =⇒ (1, 2) = 6 (5.177)
∂x ∂x
∂f ∂f
= 2x + 3y 2 =⇒ (1, 2) = 14 . (5.178)
∂y ∂y
5.8 Linearization and Multivariate Taylor Series 167
Therefore, we obtain
h i
1 ∂f ∂f
= 6 14 ∈ R1×2

Dx,y f (1, 2) = ∇x,y f (1, 2) = ∂x
(1, 2) ∂y
(1, 2)
(5.179)
such that
1
Dx,y f (1, 2)

x−1
δ = 6 14 = 6(x − 1) + 14(y − 2) . (5.180)
1! y−2
1
Note that Dx,y f (1, 2)δ contains only linear terms, i.e., first-order polyno-
mials.
The second-order partial derivatives are given by
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 (5.181)
∂x2 ∂x2
∂2f ∂2f
= 6y =⇒ (1, 2) = 12 (5.182)
∂y 2 ∂y 2
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 (5.183)
∂y∂x ∂y∂x
∂2f ∂2f
= 2 =⇒ (1, 2) = 2 . (5.184)
∂x∂y ∂x∂y
When we collect the second-order partial derivatives, we obtain the Hes-
sian
" 2
∂2f
#
∂ f
∂x2 ∂x∂y 2 2
H = ∂2f 2
∂ f
= , (5.185)
2
2 6y
∂y∂x ∂y
such that

2 2
H(1, 2) = ∈ R2×2 . (5.186)
2 12
Therefore, the next term of the Taylor-series expansion is given by
2
Dx,y f (1, 2) 2 1 >
δ = δ H(1, 2)δ (5.187)
2! 2
2 2 x−1
= x−1 y−2 (5.188)
2 12 y − 2
= (x − 1)2 + 2(x − 1)(y − 2) + 6(y − 2)2 . (5.189)
2
Here, Dx,y f (1, 2)δ 2 contains only quadratic terms, i.e., second-order poly-
nomials.
The third-order derivatives are obtained as
h i
3
Dx,y f = ∂H∂x
∂H
∂y ∈ R2×2×2 , (5.190)
c
168 Vector Calculus
∂3f ∂3f
" #
3 ∂H ∂x3 ∂x2 ∂y
Dx,y f [:, :, 1] = = ∂3f ∂3f
, (5.191)
∂x ∂x∂y∂x ∂x∂y 2
∂3f ∂3f
" #
3 ∂H ∂y∂x2 ∂y∂x∂y
Dx,y f [:, :, 2] = = ∂3f ∂3f
. (5.192)
∂y ∂y 2 ∂x ∂y 3
Since most second-order partial derivatives in the Hessian in (5.185) are

constant the only non-zero third-order partial derivative is
∂3f ∂3f
= 6 =⇒ (1, 2) = 6 . (5.193)
∂y 3 ∂y 3
Higher-order derivatives and the mixed derivatives of degree 3 (e.g.,
∂f 3
∂x2 ∂y
) vanish, such that

3 0 0 3 0 0
Dx,y f [:, :, 1] = , Dx,y f [:, :, 2] = (5.194)
0 0 0 6
and
3
Dx,y f (1, 2) 3
δ = (y − 2)3 , (5.195)
3!
which collects all cubic terms (third-order polynomials) of the Taylor se-
ries.
Overall, the (exact) Taylor series expansion of f at (x0 , y0 ) = (1, 2) is
2 3
1
Dx,y f (1, 2) 2 Dx,y f (1, 2) 3
f (x) = f (1, 2) + Dx,y f (1, 2)δ + δ + δ (5.196)
2! 3!
∂f (1, 2) ∂f (1, 2)
= f (1, 2) + (x − 1) + (y − 2) (5.197)
∂x ∂y
2
1 ∂ f (1, 2) 2 ∂ 2 f (1, 2)
+ (x − 1) + (y − 2)2 (5.198)
2! ∂x2 ∂y 2
∂ 2 f (1, 2) 1 ∂ 3 f (1, 2)

+2 (x − 1)(y − 2) + (y − 2)3 (5.199)
∂x∂y 6 ∂y 3
= 13 + 6(x − 1) + 14(y − 2) (5.200)
+ (x − 1)2 + 6(y − 2)2 + 2(x − 1)(y − 2) + (y − 2)3 . (5.201)
In this case, we obtained an exact Taylor series expansion of the polyno-
mial in (5.175), i.e., the polynomial in (5.201) is identical to the original
polynomial in (5.175). In this particular example, this result is not sur-
prising since the original function was a third-order polynomial, which
we expressed through a linear combination of constant terms, first-order,
second order and third-order polynomials in (5.201).
5.9 Further Reading 169
2930 5.9 Further Reading

In machine learning (and other disciplines), we often need to compute
expectations, i.e., we need to solve integrals of the form
Z
Ex [f (x)] = f (x)p(x)dx . (5.202)
2931 Even if p(x) is in a convenient form (e.g., Gaussian), this integral gen-
2932 erally not be solved analytically. The Taylor series expansion of f is one

2933 way of finding an approximate solution: Assuming p(x) = N µ, Σ is
2934 Gaussian, then the first-order Taylor series expansion around µ locally
2935 linearizes the nonlinear function f . For linear functions, we can compute
2936 the mean (and the covariance) exactly if p(x) is Gaussian distributed (see
2937 Section 6.6). This property is heavily exploited by the Extended Kalman Extended Kalman
2938 Filter (Maybeck, 1979) for online state estimation in nonlinear dynami- Filter
2939 cal systems (also called “state-space models”). Other deterministic ways
2940 to approximate the integral in (5.202) are the unscented transform (Julier unscented transform
2941 and Uhlmann, 1997), which does not require any gradients, or the Laplace Laplace
2942 approximation (Bishop, 2006), which uses the Hessian for a local Gaussian approximation
2943 approximation of p(x) at the posterior mean.
2944 Exercises
5.1 Consider the following functions
f1 (x) = sin(x1 ) cos(x2 ) , x ∈ R2 (5.203)
> n
f2 (x, y) = x y , x, y ∈ R (5.204)
> n
f3 (x) = xx , x∈R (5.205)
∂fi
2945 1. What are the dimensions of ∂x ?
2946 2. Compute the Jacobians
5.2 Differentiate f with respect to t and g with respect to X , where
f (t) = sin(log(t> t)) , t ∈ RD (5.206)
g(X) = tr(AXB) , A ∈ RD×E , X ∈ RE×F , B ∈ RF ×D , (5.207)
2947 where tr denotes the trace.
2948 5.3 Compute the derivatives df /dx of the following functions by using the chain
2949 rule. Provide the dimensions of every single partial derivative. Describe your
2950 steps in detail.
1.
f (z) = log(1 + z) , z = x> x , x ∈ RD
2.
f (z) = sin(z) , z = Ax + b , A ∈ RE×D , x ∈ RD , b ∈ RE
2951 where sin(·) is applied to every element of z .
c
170 Vector Calculus
2952 5.4 Compute the derivatives df /dx of the following functions.

2953 Describe your steps in detail.
1. Use the chain rule. Provide the dimensions of every single partial deriva-
tive.
f (z) = exp(− 12 z)
z = g(y) = y > S −1 y
y = h(x) = x − µ
2954 where x, µ ∈ RD , S ∈ RD×D .

2.
f (x) = tr(xx> + σ 2 I) , x ∈ RD
2955 Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii .
2956 Hint: Explicitly write out the outer product.
3. Use the chain rule. Provide the dimensions of every single partial deriva-
tive. You do not need to compute the product of the partial derivatives
explicitly.
f = tanh(z) ∈ RM
z = Ax + b, x ∈ RN , A ∈ RM ×N , b ∈ RM .
2957 Here, tanh is applied to every component of z .

Chapter 05

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 05

Uploaded by

Copyright:

Available Formats

2644

2645 Many algorithms in machine learning are inherently based on optimizing

Figure 5.2 A mind Difference quotient

Figure 5.3 The

2665 5.1 Differentiation of Univariate Functions

difference quotient Definition 5.1 (Difference Quotient). The difference quotient

Example 5.1 (Derivative of a Polynomial)

2677 5.1.1 Taylor Series

2680 where f (k) (x0 ) is the k th derivative of f at x0 (which we assume exists)

f (x) = T∞ (x) , (5.11)

f ∈ C ∞ means that2682 where the Taylor polynomial T is defined in (5.10). For x0 = 0, we

Example 5.2 (Taylor Polynomial)

Example 5.3 (Taylor Series)

f Figure 5.4 Taylor

f (0) = sin(0) + cos(0) = 1 (5.25)

power series where we used the power series representations

2690 5.1.2 Differentiation Rules

Product Rule: (f (x)g(x))0 = f 0 (x)g(x) + f (x)g 0 (x) (5.37)

2693 Here, g ◦ f is a function composition x 7→ f (x) 7→ g(f (x)).

Example 5.4 (Chain rule)

2694 5.2 Partial Differentiation and Gradients

Example 5.5 (Partial Derivatives using the Chain Rule)

2710 Remark (Gradient as a Row Vector). It is not uncommon in the literature

2714 generalize the gradient to a setting where f : Rn → Rm no longer maps

Example 5.6 (Gradient)

2718 5.2.1 Basic Rules of Partial Differentiation

2733 5.2.2 Chain Rule

If f (x1 , x2 ) is a function of x1 and x2 , where x1 (s, t) and x2 (s, t) are

2753 5.3 Gradients of Vector-Valued Functions

 ∂f1   f1 (x1 ,...,xi−1 ,xi +h,xi+1 ,...xn )−f1 (x) 

From (5.48), we know that we obtain the gradient of f with respect

Figure 5.5 The

Definition 5.6 (Jacobian). The collection of all first-order partial deriva-

2765 Remark. (Variable Transformation and Jacobian Determinant)

us to get the partial derivatives

∂y1 ∂y1 ∂y2 ∂y2

and compose the Jacobian as

Example 5.8 (Gradient of a Vector-Valued Function)

Example 5.9 (Chain Rule)

Example 5.10 (Gradient of a Least-Squared Loss in a Linear Model)

Remark. We will return to this setting in Chapter 9 when we discuss linear

2812 5.4 Gradients of Matrices

re-shape gradient re-shape

(b) Approach 2: We re-shape (flatten) A ∈ R4×2 into a

2816 matrix with respect to a p × q matrix, the resulting Jacobian would be

2819 vector-space isomorphism (linear, invertible mapping) between the space

Example 5.11 (Gradient of Vectors with Respect to Matrices)

To compute the partial derivatives, it will be helpful to explicitly write out

and the partial derivatives are then given as

maps onto R and each row of A is of size 1 × N , we obtain a 1 × 1 × N -

Example 5.12 (Gradient of Matrices with Respect to Matrices)

2829 5.5 Useful Identities for Computing Gradients

2830 5.6 Backpropagation and Automatic Differentiation

By application of the chain rule, and noting that differentiation is linear

2847 5.6.1 Gradients in a Deep Network

y = (fK ◦ fK−1 ◦ · · · ◦ f1 )(x) = fK (fK−1 (· · · (f1 (x)) · · · )) , (5.125)

Figure 5.8 Forward

and observations y and a network structure defined by

see also Figure 5.8 for a visualization, we may be interested in finding

L(θ) = ky − f K (θ, x)k2 (5.128)

2848 is minimized, where θ = {A0 , b0 , . . . , AK−1 , bK−1 }.

Figure 5.10 Simple

2862 5.6.2 Automatic Differentiation