Professional Documents
Culture Documents
Linear Algebra
786 1 Geometric vectors. This example of a vector may be familiar from school.
787 Geometric vectors are directed segments, which can be drawn, see Fig-
→ → →
788 ure 2.1(a). Two geometric vectors x, y can be added, such that x
→ →
789 + y = z is another geometric vector. Furthermore, multiplication by
→
790 a scalar λ x, λ ∈ R is also a geometric vector. In fact, it is the origi-
791 nal vector scaled by λ. Therefore, geometric vectors are instances of the
792 vector concepts introduced above.
793 2 Polynomials are also vectors, see Figure 2.1(b): Two polynomials can
794 be added together, which results in another polynomial; and they can
795 be multiplied by a scalar λ ∈ R, and the result is a polynomial as well.
796 Therefore, polynomials are (rather unusual) instances of vectors. Note
→ → 4 Figure 2.1
x+y Different types of
2
vectors. Vectors can
0 be surprising
objects, including
y
→ −2 (a) geometric
x → vectors and (b)
y −4
polynomials.
−6
−2 0 2
x
17
c
Draft chapter (October 22, 2018) from “Mathematics for Machine Learning”
2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this
file, please link to https://mml-book.com.
18 Linear Algebra
797 that polynomials are very different from geometric vectors. While ge-
798 ometric vectors are concrete “drawings”, polynomials are abstract con-
799 cepts. However, they are both vectors in the sense described above.
800 3 Audio signals are vectors. Audio signals are represented as a series of
801 numbers. We can add audio signals together, and their sum is a new
802 audio signal. If we scale an audio signal, we also obtain an audio signal.
803 Therefore, audio signals are a type of vector, too.
4 Elements of Rn are vectors. In other words, we can consider each el-
ement of Rn (the tuple of n real numbers) to be a vector. Rn is more
abstract than polynomials, and it is the concept we focus on in this
book. For example,
1
a = 2 ∈ R3
(2.1)
3
804 is an example of a triplet of numbers. Adding two vectors a, b ∈ Rn
805 component-wise results in another vector: a + b = c ∈ Rn . Moreover,
806 multiplying a ∈ Rn by λ ∈ R results in a scaled vector λa ∈ Rn .
807 Linear algebra focuses on the similarities between these vector concepts.
808 We can add them together and multiply them by scalars. We will largely
809 focus on vectors in Rn since most algorithms in linear algebra are for-
810 mulated in Rn . Recall that in machine learning, we often consider data
811 to be represented as vectors in Rn . In this book, we will focus on finite-
812 dimensional vector spaces, in which case there is a 1:1 correspondence
813 between any kind of (finite-dimensional) vector and Rn . By studying Rn ,
814 we implicitly study all other vectors such as geometric vectors and poly-
815 nomials. Although Rn is rather abstract, it is most useful.
816 One major idea in mathematics is the idea of “closure”. This is the ques-
817 tion: What is the set of all things that can result from my proposed oper-
818 ations? In the case of vectors: What is the set of vectors that can result by
819 starting with a small set of vectors, and adding them to each other and
820 scaling them? This results in a vector space (Section 2.4). The concept of
821 a vector space and its properties underlie much of machine learning.
matrix 822 A closely related concept is a matrix, which can be thought of as a
823 collection of vectors. As can be expected, when talking about properties
824 of a collection of vectors, we can use matrices as a representation. The
Pavel Grinfeld’s 825 concepts introduced in this chapter are shown in Figure 2.2
series on linear 826 This chapter is largely based on the lecture notes and books by Drumm
algebra:
827 and Weil (2001); Strang (2003); Hogben (2013); Liesen and Mehrmann
http://tinyurl.
com/nahclwm 828 (2015) as well as Pavel Grinfeld’s Linear Algebra series. Another excellent
Gilbert Strang’s 829 source is Gilbert Strang’s Linear Algebra course at MIT.
course on linear 830 Linear algebra plays an important role in machine learning and gen-
algebra: 831 eral mathematics. In Chapter 5, we will discuss vector calculus, where
http://tinyurl.
com/29p5q8j
832 a principled knowledge of matrix operations is essential. In Chapter 10,
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.1 Systems of Linear Equations 19
Vector
Figure 2.2 A mind
s map of the concepts
se pro
po per introduced in this
closure
m ty o
co f chapter, along with
Chapter 5 when they are used
Matrix Abelian
Vector calculus with + in other parts of the
ts Vector space Group Linear
en independence book.
rep
res
rep
res
maximal set
ent
s
System of
linear equations
Linear/affine
so mapping
lve
solved by
s Basis
Matrix
inverse
Gaussian
elimination
Chapter 3 Chapter 10
Chapter 12
Analytic geometry Dimensionality
Classification
reduction
833 we will use projections (to be introduced in Section 3.7) for dimensional-
834 ity reduction with Principal Component Analysis (PCA). In Chapter 9, we
835 will discuss linear regression where linear algebra plays a central role for
836 solving least-squares problems.
Example 2.1
A company produces products N1 , . . . , Nn for which resources R1 , . . . , Rm
are required. To produce a unit of product Nj , aij units of resource Ri are
needed, where i = 1, . . . , m and j = 1, . . . , n.
The objective is to find an optimal production plan, i.e., a plan of how
many units xj of product Nj should be produced if a total of bi units of
resource Ri are available and (ideally) no resources are left over.
If we produce x1 , . . . , xn units of the corresponding products, we need
a total of
ai1 x1 + · · · + ain xn (2.2)
many units of resource Ri . The optimal production plan (x1 , . . . , xn ) ∈
Rn , therefore, has to satisfy the following system of equations:
a11 x1 + · · · + a1n xn = b1
.. , (2.3)
.
am1 x1 + · · · + amn xn = bm
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
20 Linear Algebra
system of linear 841 Equation (2.3) is the general form of a system of linear equations, and
equations 842 x1 , . . . , xn are the unknowns of this system of linear equations. Every n-
unknowns 843 tuple (x1 , . . . , xn ) ∈ Rn that satisfies (2.3) is a solution of the linear equa-
solution
844 tion system.
Example 2.2
The system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) (2.4)
2x1 + 3x3 = 1 (3)
has no solution: Adding the first two equations yields 2x1 +3x3 = 5, which
contradicts the third equation (3).
Let us have a look at the system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) . (2.5)
x2 + x3 = 2 (3)
From the first and third equation it follows that x1 = 1. From (1)+(2) we
get 2+3x3 = 5, i.e., x3 = 1. From (3), we then get that x2 = 1. Therefore,
(1, 1, 1) is the only possible and unique solution (verify that (1, 1, 1) is a
solution by plugging in).
As a third example, we consider
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) . (2.6)
2x1 + 3x3 = 5 (3)
Since (1)+(2)=(3), we can omit the third equation (redundancy). From
(1) and (2), we get 2x1 = 5−3x3 and 2x2 = 1+x3 . We define x3 = a ∈ R
as a free variable, such that any triplet
5 3 1 1
− a, + a, a , a ∈ R (2.7)
2 2 2 2
is a solution to the system of linear equations, i.e., we obtain a solution
set that contains infinitely many solutions.
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.2 Matrices 21
850 ear equations must satisfy all equations simultaneously, the solution set
851 is the intersection of these line. This intersection can be a line (if the lin-
852 ear equations describe the same line), a point, or empty (when the lines
853 are parallel). An illustration is given in Figure 2.3. Similarly, for three
854 variables, each linear equation determines a plane in three-dimensional
855 space. When we intersect these planes, i.e., satisfy all linear equations at
856 the same time, we can end up with solution set that is a plane, a line, a
857 point or empty (when the planes are parallel). ♦
For a systematic approach to solving systems of linear equations, we will
introduce a useful compact notation. We will write the system from (2.3)
in the following form:
a11 a12 a1n b1
.. .. .. ..
x1 . + x2 . + · · · + xn . = . (2.8)
am1 am2 amn bm
a11 · · · a1n x1 b1
.. .. .. = .. .
⇐⇒ . . . . (2.9)
am1 · · · amn xn bm
858 In the following, we will have a close look at these matrices and define
859 computation rules.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
22 Linear Algebra
re-shape
wise sum, i.e.,
a11 + b11 · · · a1n + b1n
.. .. m×n
A + B := ∈R . (2.11)
. .
am1 + bm1 · · · amn + bmn
There are n columns 874 This means, to compute element cij we multiply the elements of the ith
in A and n rows in875 row of A with the j th column of B and sum them up. Later in Section 3.2,
B so that we can
876 we will call this the dot product of the corresponding row and column.
compute ail blj for
l = 1, . . . , n. Remark. Matrices can only be multiplied if their “neighboring” dimensions
match. For instance, an n × k -matrix A can be multiplied with a k × m-
matrix B , but only from the left side:
A |{z}
|{z} B = |{z}
C (2.13)
n×k k×m n×m
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.2 Matrices 23
Example 2.3
0 2
1 2 3
For A = ∈ R2×3 , B = 1 −1 ∈ R3×2 , we obtain
3 2 1
0 1
0 2
1 2 3 2 3
AB = 1 −1 = ∈ R2×2 , (2.14)
3 2 1 2 5
0 1
0 2 6 4 2
1 2 3
BA = 1 −1 = −2 0 2 ∈ R3×3 . (2.15)
3 2 1
0 1 3 2 1
888 Now that we have defined matrix multiplication, matrix addition and
889 the identity matrix, let us have a look at some properties of matrices,
890 where we will omit the “·” for matrix multiplication:
• Associativity:
∀A ∈ Rm×n , B ∈ Rn×p , C ∈ Rp×q : (AB)C = A(BC) (2.17)
• Distributivity:
∀A, B ∈ Rm×n , C, D ∈ Rn×p :(A + B)C = AC + BC (2.18a)
A(C + D) = AC + AD (2.18b)
• Neutral element:
∀A ∈ Rm×n : I m A = AI n = A (2.19)
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
24 Linear Algebra
902 Definition 2.4 (Transpose). For A ∈ Rm×n the matrix B ∈ Rn×m with
transpose 903 bij = aji is called the transpose of A. We write B = A> .
The main diagonal
(sometimes called 904 For a square matrix A> is the matrix we obtain when we “mirror” A on
“principal diagonal”,
905 its main diagonal. In general, A> can be obtained by writing the columns
“primary diagonal”,906 of A as the rows of A> .
“leading diagonal”, Some important properties of inverses and transposes are:
or “major diagonal”)
of a matrix A is the AA−1 = I = A−1 A (2.25)
collection of entries
−1 −1 −1
Aij where i = j. (AB) =B A (2.26)
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.2 Matrices 25
907 Moreover, if A is invertible then so is A> and (A−1 )> = (A> )−1 =: A−> In the scalar case
A matrix A is symmetric if A = A> . Note that this can only hold for
1
908 2+4
= 61 6= 12 + 41 .
909 (n, n)-matrices, which we also call square matrices because they possess symmetric
910 the same number of rows and columns. square matrices
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
26 Linear Algebra
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.3 Solving Systems of Linear Equations 27
This system of equations is in a particularly easy form, where the first two
columns consist of a 1Pand a 0. Remember that we want to find scalars
4
x1 , . . . , x4 , such that i=1 xi ci = b, where we define ci to be the ith
column of the matrix and b the right-hand-side of (2.37). A solution to
the problem in (2.37) can be found immediately by taking 42 times the
first column and 8 times the second column so that
42 1 0
b= = 42 +8 . (2.38)
8 0 1
Therefore, a solution vector is [42, 8, 0, 0]> . This solution is called a particular particular solution
solution or special solution. However, this is not the only solution of this special solution
system of linear equations. To capture all the other solutions, we need to
be creative of generating 0 in a non-trivial way using the columns of the
matrix: Adding 0 to our special solution does not change the special so-
lution. To do so, we express the third column using the first two columns
(which are of this very simple form)
8 1 0
=8 +2 (2.39)
2 0 1
so that 0 = 8c1 + 2c2 − 1c3 + 0c4 and (x1 , x2 , x3 , x4 ) = (8, 2, −1, 0). In
fact, any scaling of this solution by λ1 ∈ R produces the 0 vector, i.e.,
8
1 0 8 −4 2
= λ1 (8c1 + 2c2 − c3 ) = 0 .
λ1 (2.40)
0 1 2 12 −1
0
Following the same line of reasoning, we express the fourth column of the
matrix in (2.37) using the first two columns and generate another set of
non-trivial versions of 0 as
−4
1 0 8 −4 12
λ = λ2 (−4c1 + 12c2 − c4 ) = 0 (2.41)
0 1 2 12 2 0
−1
for any λ2 ∈ R. Putting everything together, we obtain all solutions of the
equation system in (2.37), which is called the general solution, as the set general solution
−4
42 8
8 2 12
4
x ∈ R : x = + λ1 + λ2 , λ1 , λ2 ∈ R . (2.42)
0 −1 0
0 0 −1
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
28 Linear Algebra
Example 2.6
For a ∈ R, we seek all solutions of the following system of equations:
−2x1 + 4x2 − 2x3 − x4 + 4x5 = −3
4x1 − 8x2 + 3x3 − 3x4 + x5 = 2
. (2.43)
x1 − 2x2 + x3 − x4 + x5 = 0
x1 − 2x2 − 3x4 + 4x5 = a
We start by converting this system of equations into the compact matrix
notation Ax = b. We no longer mention the variables x explicitly and
augmented matrix build the augmented matrix
−2 −2 −1 4 −3
4 Swap with R3
4
−8 3 −3 1 2
1 −2 1 −1 1 0 Swap with R1
1 −2 0 −3 4 a
where we used the vertical line to separate the left-hand-side from the
right-hand-side in (2.43). We use to indicate a transformation of the
left-hand-side into the right-hand-side using elementary transformations.
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.3 Solving Systems of Linear Equations 29
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
30 Linear Algebra
972 • All rows that contain only zeros are at the bottom of the matrix; corre-
973 spondingly, all rows that contain at least one non-zero element are on
974 top of rows that contain only zeros.
975 • Looking at non-zero rows only, the first non-zero number from the left
pivot 976 (also called the pivot or the leading coefficient) is always strictly to the
leading coefficient 977 right of the pivot of the row above it.
In other books, it is
sometimes required978 Remark (Basic and Free Variables). The variables corresponding to the
that the pivot is 1. 979 pivots in the row-echelon form are called basic variables, the other vari-
basic variables 980 ables are free variables. For example, in (2.44), x1 , x3 , x4 are basic vari-
free variables 981 ables, whereas x2 , x5 are free variables. ♦
982 Remark (Obtaining a Particular Solution). The row echelon form makes
983 our lives easier when we need to determine a particular solution. To do
984 this, we express the right-hand
PP side of the equation system using the pivot
985 columns, such that b = i=1 λi pi , where pi , i = 1, . . . , P , are the pivot
986 columns. The λi are determined easiest if we start with the most-right
987 pivot column and work our way to the left.
In the above example, we would try to find λ1 , λ2 , λ3 such that
−1
1 1 0
0 1 −1 −2
λ1
0 + λ2 0 + λ3 1 = 1 .
(2.47)
0 0 0 0
988 From here, we find relatively directly that λ3 = 1, λ2 = −1, λ1 = 2. When
989 we put everything together, we must not forget the non-pivot columns
990 for which we set the coefficients implicitly to 0. Therefore, we get the
991 particular solution x = [2, 0, −1, 1, 0]> . ♦
reduced row 992 Remark (Reduced Row Echelon Form). An equation system is in reduced
echelon form 993 row echelon form (also: row-reduced echelon form or row canonical form) if
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.3 Solving Systems of Linear Equations 31
998 The reduced row echelon form will play an important role later in Sec-
999 tion 2.3.3 because it allows us to determine the general solution of a sys-
1000 tem of linear equations in a straightforward way.
Gaussian
1001 Remark (Gaussian Elimination). Gaussian elimination is an algorithm that elimination
1002 performs elementary transformations to bring a system of linear equations
1003 into reduced row echelon form. ♦
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
32 Linear Algebra
where ∗ can be an arbitrary real number, with the constraints that the first
non-zero entry per row must be 1 and all other entries in the correspond-
ing column must be 0. The columns j1 , . . . , jk with the pivots (marked
in bold) are the standard unit vectors e1 , . . . , ek ∈ Rk . We extend this
matrix to an n × n-matrix à by adding n − k rows of the form
0 · · · 0 −1 0 · · · 0 (2.51)
1008 so that the diagonal of the augmented matrix à contains either 1 or −1.
1009 Then, the columns of Ã, which contain the −1 as pivots are solutions of
1010 the homogeneous equation system Ax = 0. To be more precise, these
1011 columns form a basis (Section 2.6.1) of the solution space of Ax = 0,
kernel 1012 which we will later call the kernel or null space (see Section 2.7.3).
null space
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.3 Solving Systems of Linear Equations 33
I n |A−1 .
A|I n ··· (2.55)
1014 This means that if we bring the augmented equation system into reduced
1015 row echelon form, we can read out the inverse on the right-hand side of
1016 the equation system. Hence, determining the inverse of a matrix is equiv-
1017 alent to solving systems of linear equations.
0 0 0 1 −1 0 −1 2
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
34 Linear Algebra
Moore-Penrose 1021 and use the Moore-Penrose pseudo-inverse (A> A)−1 A> to determine the
pseudo-inverse 1022 solution (2.58) that solves Ax = b, which also corresponds to the mini-
1023 mum norm least-squares solution. A disadvantage of this approach is that
1024 it requires many computations for the matrix-matrix product and comput-
1025 ing the inverse of A> A. Moreover, for reasons of numerical precision it
1026 is generally not recommended to compute the inverse or pseudo-inverse.
1027 In the following, we therefore briefly discuss alternative approaches to
1028 solving systems of linear equations.
1029 Gaussian elimination plays an important role when computing deter-
1030 minants (Section 4.1), checking whether a set of vectors is linearly inde-
1031 pendent (Section 2.5), computing the inverse of a matrix (Section 2.2.2),
1032 computing the rank of a matrix (Section 2.6.2) and a basis of a vector
1033 space (Section 2.6.1). We will discuss all these topics later on. Gaussian
1034 elimination is an intuitive and constructive way to solve a system of linear
1035 equations with thousands of variables. However, for systems with millions
1036 of variables, it is impractical as the required number of arithmetic opera-
1037 tions scales cubically in the number of simultaneous equations.
1038 In practice, systems of many linear equations are solved indirectly, by
1039 either stationary iterative methods, such as the Richardson method, the
1040 Jacobi method, the Gauß-Seidel method, or the successive over-relaxation
1041 method, or Krylov subspace methods, such as conjugate gradients, gener-
1042 alized minimal residual, or biconjugate gradients.
Let x∗ be a solution of Ax = b. The key idea of these iterative methods
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.4 Vector Spaces 35
1043 that reduces the residual error kx(k+1) − x∗ k in every iteration and finally
1044 converges to x∗ . We will introduce norms k · k, which allow us to compute
1045 similarities between vectors, in Section 3.1.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
36 Linear Algebra
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.4 Vector Spaces 37
1090 The elements x ∈ V are called vectors. The neutral element of (V, +) is vectors
1091 the zero vector 0 = [0, . . . , 0]> , and the inner operation + is called vector vector addition
1092 addition. The elements λ ∈ R are called scalars and the outer operation scalars
1093 · is a multiplication by scalars. Note that a scalar product is something multiplication by
1094 different, and we will get to this in Section 3.2. scalars
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
38 Linear Algebra
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.5 Linear Independence 39
1140 The 0-vector can always be P written as the linear combination of k vec-
k
1141 tors x1 , . . . , xk because 0 = i=1 0xi is always true. In the following,
1142 we are interested in non-trivial linear combinations of a set of vectors to
1143 represent 0, i.e., linear combinations of vectors x1 , . . . , xk where not all
1144 coefficients λi in (2.64) are 0.
1145 Definition 2.11 (Linear (In)dependence). Let us consider a vector space
1146 V with k ∈ N and x1 , . P
. . , xk ∈ V . If there is a non-trivial linear com-
k
1147 bination, such that 0 = i=1 λi xi with at least one λi 6= 0, the vectors
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
40 Linear Algebra
1148 x1 , . . . , xk are linearly dependent. If only the trivial solution exists, i.e., linearly dependent
linearly 1149 λ1 = . . . = λk = 0 the vectors x1 , . . . , xk are linearly independent.
independent
1150 Linear independence is one of the most important concepts in linear
1151 algebra. Intuitively, a set of linearly independent vectors are vectors that
1152 have no redundancy, i.e., if we remove any of those vectors from the set,
1153 we will lose something. Throughout the next sections, we will formalize
1154 this intuition more.
Figure 2.7
Geographic example
(with crude
approximations to Kampala
cardinal directions) t 506
es km
of linearly
t hw No
dependent vectors u rth
So wes
t
in a km
Nairobi
two-dimensional 4
37 t
t es
space (plane). 751 km Wes w
Kigali u th
So
km
4
37
In this example, the “506 km Northwest” vector (blue) and the “374 km
Southwest” vector (purple) are linearly independent. This means the
Southwest vector cannot be described in terms of the Northwest vector,
and vice versa. However, the third “751 km West” vector (black) is a lin-
ear combination of the other two vectors, and it makes the set of vectors
linearly dependent.
1155 Remark. The following properties are useful to find out whether vectors
1156 are linearly independent.
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.5 Linear Independence 41
1159 • If at least one of the vectors x1 , . . . , xk is 0 then they are linearly de-
1160 pendent. The same holds if two vectors are identical.
1161 • The vectors {x1 , . . . , xk : xi 6= 0, i = 1, . . . , k}, k > 2, are linearly
1162 dependent if and only if (at least) one of them is a linear combination
1163 of the others. In particular, if one vector is a multiple of another vector,
1164 i.e., xi = λxj , λ ∈ R then the set {x1 , . . . , xk : xi 6= 0, i = 1, . . . , k}
1165 is linearly dependent.
1166 • A practical way of checking whether vectors x1 , . . . , xk ∈ V are linearly
1167 independent is to use Gaussian elimination: Write all vectors as columns
1168 of a matrix A and perform Gaussian elimination until the matrix is in
1169 row echelon form (the reduced row echelon form is not necessary here).
1170 – The pivot columns indicate the vectors, which are linearly indepen-
1171 dent of the vectors on the left. Note that there is an ordering of vec-
1172 tors when the matrix is built.
– The non-pivot columns can be expressed as linear combinations of
the pivot columns on their left. For instance, the row echelon form
1 3 0
(2.65)
0 0 2
1173 tells us that the first and third column are pivot columns. The second
1174 column is a non-pivot column because it is 3 times the first column.
1175 All column vectors are linearly independent if and only if all columns
1176 are pivot columns. If there is at least one non-pivot column, the columns
1177 (and, therefore, the corresponding vectors) are linearly dependent.
1178 ♦
Example 2.14
Consider R4 with
−1
1 1
2 1 −2
x1 =
−3 ,
x2 =
0 ,
x3 =
1 .
(2.66)
4 2 1
To check whether they are linearly dependent, we follow the general ap-
proach and solve
−1
1 1
2 1 −2
λ1 x1 + λ2 x2 + λ3 x3 = λ1
−3 + λ2 0 + λ3 1 = 0
(2.67)
4 2 1
for λ1 , . . . , λ3 . We write the vectors xi , i = 1, 2, 3, as the columns of a
matrix and apply elementary row operations until we identify the pivot
columns:
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
42 Linear Algebra
1 −1 1 −1
1 1
2 1 −2 0 1 0
−3
··· (2.68)
0 1 0 0 1
4 2 1 0 0 0
Here, every column of the matrix is a pivot column. Therefore, there is no
non-trivial solution, and we require λ1 = 0, λ2 = 0, λ3 = 0 to solve the
equation system. Hence, the vectors x1 , x2 , x3 are linearly independent.
1180 This means that {x1 , . . . , xm } are linearly independent if and only if the
1181 column vectors {λ1 , . . . , λm } are linearly independent.
1182 ♦
1183 Remark. In a vector space V , m linear combinations of k vectors x1 , . . . , xk
1184 are linearly dependent if m > k . ♦
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.6 Basis and Rank 43
Example 2.15
Consider a set of linearly independent vectors b1 , b2 , b3 , b4 ∈ Rn and
x1 = b1 − 2b2 + b3 − b4
x2 = −4b1 − 2b2 + 4b4
(2.72)
x3 = 2b1 + 3b2 − b3 − 3b4
x4 = 17b1 − 10b2 + 11b3 + b4
Are the vectors x1 , . . . , x4 ∈ Rn linearly independent? To answer this
question, we investigate whether the column vectors
1 −4 2 17
−2 −2 3 −10
, , ,
1 0 −1 11
(2.73)
−1 4 −3 1
are linearly independent. The reduced row echelon form of the corre-
sponding linear equation system with coefficient matrix
1 −4 2
17
−2 −2 3 −10
A= (2.74)
1 0 −1 11
−1 4 −3 1
is given as
0 −7
1 0
0 1 0 −15
. (2.75)
0 0 1 −18
0 0 0 0
We see that the corresponding linear equation system is non-trivially solv-
able: The last column is not a pivot column, and x4 = −7x1 −15x2 −18x3 .
Therefore, x1 , . . . , x4 are linearly dependent as x4 can be expressed as a
linear combination of x1 , . . . , x3 .
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
44 Linear Algebra
Example 2.16
• The set
1 2 1
2 −1
, , 1
A=
3 0 0 (2.79)
4 2 −4
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.6 Basis and Rank 45
1212 Remark. Every vector space V possesses a basis B . The examples above
1213 show that there can be many bases of a vector space V , i.e., there is no
1214 unique basis. However, all bases possess the same number of elements,
1215 the basis vectors. ♦ basis vectors
The dimension of a
1216 We only consider finite-dimensional vector spaces V . In this case, the vector space
1217 dimension of V is the number of basis vectors, and we write dim(V ). If corresponds to the
1218 U ⊆ V is a subspace of V then dim(U ) 6 dim(V ) and dim(U ) = dim(V ) number of basis
1219 if and only if U = V . Intuitively, the dimension of a vector space can be vectors.
dimension
1220 thought of as the number of independent directions in this vector space.
1221 Remark. A basis of a subspace U = span[x1 , . . . , xm ] ⊆ Rn can be found
1222 by executing the following steps:
1223 1 Write the spanning vectors as columns of a matrix A
1224 2 Determine the row echelon form of A.
1225 3 The spanning vectors associated with the pivot columns are a basis of
1226 U.
1227 ♦
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
46 Linear Algebra
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.7 Linear Mappings 47
1 0 1
• A = 0 1 1. A possesses two linearly independent rows (and
0 0 0
columns).
Therefore, rk(A) = 2.
1 2 1
• A = −2 −3 1 We use Gaussian elimination to determine the
3 5 0
rank:
1 2 1 1 2 1
−2 −3 1 ··· 0 −1 3 . (2.83)
3 5 0 0 0 0
Here, we see that the number of linearly independent rows and columns
is 2, such that rk(A) = 2.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
48 Linear Algebra
1264 If Φ is injective then it can also be “undone”, i.e., there exists a mapping
1265 Ψ : W → V so that Ψ ◦ Φ(x) = x. If Φ is surjective then every element in
1266 W can be “reached” from V using Φ.
1267 With these definitions, we introduce the following special cases of linear
1268 mappings between vector spaces V and W :
Isomorphism
Endomorphism 1269 • Isomorphism: Φ : V → W linear and bijective
Automorphism 1270 • Endomorphism: Φ : V → V linear
1271 • Automorphism: Φ : V → V linear and bijective
identity mapping 1272 • We define idV : V → V , x 7→ x as the identity mapping in V .
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.7 Linear Mappings 49
1290 ♦
1293 Remark (Notation). We are at the point where notation gets a bit tricky.
1294 Therefore, we summarize some parts here. B = (b1 , . . . , bn ) is an ordered
1295 basis, B = {b1 , . . . , bn } is an (unordered) basis, and B = [b1 , . . . , bn ] is a
1296 matrix whose columns are the vectors b1 , . . . , bn . ♦
Definition 2.17 (Coordinates). Consider a vector space V and an ordered
basis B = (b1 , . . . , bn ) of V . For any x ∈ V we obtain a unique represen-
tation (linear combination)
x = α1 b1 + . . . + αn bn (2.89)
of x with respect to B . Then α1 , . . . , αn are the coordinates of x with coordinates
respect to B , and the vector
α1
..
α = . ∈ Rn (2.90)
αn
1297 is the coordinate vector/coordinate representation of x with respect to the coordinate vector
1298 ordered basis B . coordinate
representation
1299 Remark. Intuitively, the basis vectors can be thought of as being equipped Figure 2.8
1300 with units (including common units such as “kilograms” or “seconds”). Different coordinate
1301 Let us have a look at a geometric vector x ∈ R2 with coordinates [2, 3]> representations of a
vector x, depending
1302 with respect to the standard basis (e1 , e2 ) of R2 . This means, we can write on the choice of
1303 x = 2e1 + 3e2 . However, we do not have to choose the standard basis to basis.
1304 represent this vector. If we use the basis vectors b1 = [1, −1]> , b2 = [1, 1]> x = 2e1 + 3e2
1305 we will obtain the coordinates 21 [−1, 5]> to represent the same vector with x = − 12 b1 + 52 b2
1306 respect to (b1 , b2 ) (see Figure 2.8). ♦
1307 Remark. For an n-dimensional vector space V and an ordered basis B
1308 of V , the mapping Φ : Rn → V , Φ(ei ) = bi , i = 1, . . . , n, is linear e2
b2
1309 (and because of Theorem 2.16 an isomorphism), where (e1 , . . . , en ) is
1310 the standard basis of Rn . e1
1311 ♦ b1
1312 Now we are ready to make an explicit connection between matrices and
1313 linear mappings between finite-dimensional vector spaces.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
50 Linear Algebra
transformation 1314 the transformation matrix of Φ (with respect to the ordered bases B of V
matrix 1315 and C of W ).
ŷ = AΦ x̂ . (2.93)
1316 This means that the transformation matrix can be used to map coordinates
1317 with respect to an ordered basis in V to coordinates with respect to an
1318 ordered basis in W .
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.7 Linear Mappings 51
Figure 2.9 Three
examples of linear
transformations of
the vectors shown
as dots in (a). (b)
Rotation by 45◦ ; (c)
Stretching of the
(a) Original data. (b) Rotation by 45◦ . (c) Stretch along the (d) General linear horizontal
horizontal axis. mapping. coordinates by 2;
(d) Combination of
reflection, rotation
and stretching.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
52 Linear Algebra
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.7 Linear Mappings 53
Proof Following Drumm and Weil (2001) we can write the vectors of the
new basis B̃ of V as a linear combination of the basis vectors of B , such
that
n
X
b̃j = s1j b1 + · · · + snj bn = sij bi , j = 1, . . . , n . (2.105)
i=1
1348 where we first expressed the new basis vectors c̃k ∈ W as linear com-
1349 binations of the basis vectors cl ∈ W and then swapped the order of
1350 summation.
Alternatively, when we express the b̃j ∈ V as linear combinations of
bj ∈ V , we arrive at
n
! n n m
(2.105)
X X X X
Φ(b̃j ) = Φ sij bi = sij Φ(bi ) = sij ali cl (2.108a)
i=1 i=1 i=1 l=1
m n
!
X X
= ali sij cl , j = 1, . . . , n , (2.108b)
l=1 i=1
and, therefore,
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
54 Linear Algebra
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.7 Linear Mappings 55
1378 Note that the execution order in (2.115) is from right to left because vec-
1379 tors are multiplied at the right-hand side so that x 7→ Sx 7→ AΦ (Sx) 7→
T −1 AΦ (Sx) = ÃΦ x.
1380
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
56 Linear Algebra
1 1 0 1
1 0 1 1 0 1 0
B̃ = (1 , 1 , 0) ∈ R3 , C̃ = (
0 , 1 , 1 , 0) .
(2.118)
0 1 1
0 0 0 1
Then,
1 1 0 1
1 0 1 1 0 1 0
S = 1 1 0 , T =
0
, (2.119)
1 1 0
0 1 1
0 0 0 1
where the ith column of S is the coordinate representation of b̃i in terms
Since B is the of the basis vectors of B . Similarly, the j th column of T is the coordinate
standard basis, the
representation of c̃j in terms of the basis vectors of C .
coordinate
representation is Therefore, we obtain
straightforward to
1 −1 −1
find. For a general
1 3 2 1
basis B we would 1 1 −1 1 −1 0 4 2
ÃΦ = T −1 AΦ S =
(2.120a)
need to solve a 2 −1 1
1 1 10 8 4
linear equation
system to find the
0 0 0 2 1 6 3
−4 −4 −2
λi such that
P 3
i=1 λi bi = b̃j , 6 0 0
j = 1, . . . , 3. = 4
. (2.120b)
8 4
1 6 3
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.7 Linear Mappings 57
ker(Φ) Im(Φ)
0V 0W
1391 We also call V and W also the domain and codomain of Φ, respectively. domain
codomain
1392 Intuitively, the kernel is the set of vectors in v ∈ V that Φ maps onto
1393 the neutral element 0W ∈ W . The image is the set of vectors w ∈ W that
1394 can be “reached” by Φ from any vector in V . An illustration is given in
1395 Figure 2.11.
1396 Remark. Consider a linear mapping Φ : V → W , where V, W are vector
1397 spaces.
1405 i.e., the image is the span of the columns of A, also called the column column space
1406 space. Therefore, the column space (image) is a subspace of Rm , where
1407 m is the “height” of the matrix.
1408 • rk(A) = dim(Im(Φ))
1409 • The kernel/null space ker(Φ) is the general solution to the linear ho-
1410 mogeneous equation system Ax = 0 and captures all possible linear
1411 combinations of the elements in Rn that produce 0 ∈ Rm .
1412 • The kernel is a subspace of Rn , where n is the “width” of the matrix.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
58 Linear Algebra
1413 • The kernel focuses on the relationship among the columns, and we can
1414 use it to determine whether/how we can express a column as a linear
1415 combination of other columns.
1416 • The purpose of the kernel is to determine whether a solution of the
1417 system of linear equations is unique and, if not, to capture all possible
1418 solutions.
1419 ♦
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
2.8 Affine Spaces 59
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
60 Linear Algebra
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Exercises 61
1471 Exercises
2.1 We consider (R\{−1}, ?) where
a ? b := ab + a + b, a, b ∈ R\{−1} (2.134)
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
62 Linear Algebra
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Exercises 63
1489 2.5 Find the set S of all solutions in x of the following inhomogeneous linear
1490 systems Ax = b where A and b are defined below:
1
1 1 −1 −1 1
2 5 −7 −5 −2
A= , b=
2 −1 1 3 4
5 2 −4 2 6
2
1 −1 0 0 1 3
1 1 0 −3 0 6
A=
, b=
2 −1 0 1 −1 5
−1 2 0 −2 −1 −1
and 3i=1 xi = 1.
P
1491
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
64 Linear Algebra
1
2 1 3
x1 = −1 , x2 = 1 , x3 = −3
3 −2 8
2
1 1 1
2 1 0
x1 =
1 ,
x2 =
0 ,
x3 =
0
0 1 1
0 1 1
2.10 Write
1
y = −2
5
as linear combination of
1 1 2
x1 = 1 , x2 = 2 , x3 = −1
1 3 1
1 −1 1 1 0 −1
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Exercises 65
Φ : L1 ([a, b]) → R
Z b
f 7→ Φ(f ) = f (x)dx ,
a
1513 where L1 ([a, b]) denotes the set of integrable function on [a, b].
2
Φ : C1 → C0
f 7→ Φ(f ) = f 0 .
1514 where for k > 1, C k denotes the set of k times continuously differentiable
1515 functions, and C 0 denotes the set of continuous functions.
3
Φ:R→R
x 7→ Φ(x) = cos(x)
Φ : R3 → R2
1 2 3
x 7→ x
1 4 3
Φ : R 2 → R2
cos(θ) sin(θ)
x 7→ x
− sin(θ) cos(θ)
Φ : R3 → R 4
3x1 + 2x2 + x3
x1 x1 + x2 + x3
Φ x2 =
x1 − 3x2
x3
2x1 + 3x2 + x3
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
66 Linear Algebra
1519 2.16 Let E be a vector space. Let f and g be two endomorphisms on E such that
1520 f ◦ g = idE (i.e. f ◦ g is the identity isomorphism). Show that ker(f ) =
1521 ker(g ◦ f ), Im(g) = Im(g ◦ f ) and that ker(f ) ∩ Im(g) = {0E }.
2.17 Consider an endomorphism Φ : R3 → R3 whose transformation matrix
(with respect to the standard basis in R3 ) is
1 1 0
AΦ = 1 −1 0 .
1 1 1
1524 and let us define two ordered bases B = (b1 , b2 ) and B 0 = (b01 , b02 ) of R2 .
1525 1 Show that B and B 0 are two bases of R2 and draw those basis vectors.
1526 2 Compute the matrix P 1 that performs a basis change from B 0 to B .
3 We consider c1 , c2 , c3 , 3 vectors of R3 defined in the standard basis of R
as
1 0 1
c1 = 2 , c2 = −1 , c3 = 0 (2.138)
−1 2 −1
Draft (2018-10-22) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Exercises 67
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.