Professional Documents
Culture Documents
Notation
Examples
Inner product
Complexity
I other conventions: g, ~a
I in ai , i is the index
I two vectors a and b of the same size are equal if ai = bi for all i
b
a = c
d
I a has size m + n + p
a = (b1 , b2 , . . . , bm , c1 , c2 , . . . , cn , d1 , d2 , . . . , dp )
1 0 0
e1 = 0 , e2 = 1 , e3 = 0
0 0 1
Notation
Examples
Inner product
Complexity
x2 x
x
x2
x1 x1
I color: (R, G, B)
I quantities of n different commodities (or resources), e.g., bill of materials
I portfolio: entries give shares (or $ value or fraction) held in each of n
assets, with negative meaning short positions
I cash flow: xi is payment in period i to us
I audio: xi is the acoustic pressure at sample time i
(sample times are spaced 1/44100 seconds apart)
I features: xi is the value of ith feature or attribute of an entity
I customer purchase: xi is the total $ purchase of product i by a customer
over some period
I word count: xi is the number of times word i appears in a document
I a short document:
Word count vectors are used in computer based document
analysis. Each entry of the word count vector is the number of
times the associated dictionary word appears in the document.
word 3
in 2
number 1
horse 0
the 4
document 2
Notation
Examples
Inner product
Complexity
0 1 1
7 + 2 = 9
3 0 3
I subtraction is similar
I commutative: a + b = b + a
I associative: (a + b) + c = a + (b + c)
(so we can write both as a + b + c)
I a+0=0+a=a
I a−a=0
a+b b
p−q
p
βa = ( βa1 , . . . , βan )
I also denoted a β
I example:
1 −2
(−2) 9 = −18
6 −12
these equations look innocent, but be sure you understand them perfectly
β1 a1 + · · · + βm am
b = b1 e1 + · · · + bn en
b
a2 1.5a2
a1
0.75a1
I we have replicated a two period loan from two one period loans
Notation
Examples
Inner product
Complexity
aT b = a1 b1 + a2 b2 + · · · + an bn
I example:
−1 T 1
2 0 = (−1)(1) + (2)(0) + (2)(−3) = −7
2 −3
I aT b = bT a
I (γa) T b = γ(aT b)
I (a + b) T c = aT c + bT c
(a + b) T (c + d) = aT c + aT d + bT c + bT d
I 1T a = a1 + · · · + an (sum of entries)
Notation
Examples
Inner product
Complexity
Taylor approximation
Regression model
f (x) = aT x = a1 x1 + a2 x2 + · · · + an xn
I suppose f : Rn → R is linear
I specifically: ai = f (ei )
I follows from
f (x) = f (x1 e1 + x2 e2 + · · · + xn en )
= x1 f (e1 ) + x2 f (e2 ) + · · · + xn f (en )
f (x) g(x)
x x
Taylor approximation
Regression model
I suppose f : Rn → R
I first-order Taylor approximation of f , near point z:
∂f ∂f
f̂ (x) = f (z) + (z)(x1 − z1 ) + · · · + (z)(xn − zn )
∂x1 ∂xn
I f̂ (x) is very close to f (x) when xi are all near zi
I f̂ is an affine function of x
I can write using inner product as
f (x)
f̂ (x)
Taylor approximation
Regression model
ŷ = xT β + v
I y is selling price of house in $1000 (in some location, over some period)
I regressor is
x = (house area, # bedrooms)
(house area in 1000 sq.ft.)
800
600 House 5
House 4
400
House 1
200
House 2
House 3
0
0 200 400 600 800
Actual price y (thousand dollars)
Introduction to Applied Linear Algebra Boyd & Vandenberghe 2.14
3. Norm and distance
Outline
Norm
Distance
Standard deviation
Angle
I nonnegativity: kxk ≥ 0
I so we have
q
k(a, b, c)k = kak 2 + kbk 2 + kck 2 = k(kak, kbk, kck)k
Norm
Distance
Standard deviation
Angle
dist(a, b) = ka − bk
I by triangle inequality
ka − ck = k(a − b) + (b − c)k ≤ ka − bk + kb − ck
ka − ck
kb − ck
a b
ka − bk
Introduction to Applied Linear Algebra Boyd & Vandenberghe 3.9
Feature distance and nearest neighbors
I if x and y are feature vectors for two entities, kx − yk is the feature distance
I if z1 , . . . , zm is a list of vectors, zj is the nearest neighbor of x if
kx − zj k ≤ kx − zi k, i = 1, . . . , m
z4
x z6
z5
z3
z2
z1
Norm
Distance
Standard deviation
Angle
I standard deviation of x is
kx − (1T x/n)1k
std(x) = rms(x̃) = √
n
I std(x) gives ‘typical’ amount xi vary from avg(x)
I std(x) = 0 only if x = α1 for some α
I greek letters µ, σ commonly used for mean, standard deviation
I a basic formula:
rms(x) 2 = avg(x) 2 + std(x) 2
I avg(x) is the mean return over the period, usually just called return
I std(x) measures how variable the return is over the period, and is called
the risk
I multiple investments (with different return time series) are often compared
in terms of return and risk
ak bk ck dk
10 10 10 10
5 5 5 5
0 k 0 k 0 k 0 k
−5 −5 −5 −5
(mean) return
3 c
b
2
a d
1
0 risk
0 2 4
I for return time series with mean 8% and standard deviation 3%, loss
(xi ≤ 0) can occur in no more than (3/8) 2 = 14.1% of periods
Norm
Distance
Standard deviation
Angle
I written out,
1/2 1/2
|a1 b1 + · · · + an bn | ≤ a21 + · · · + a2n b21 + · · · + b2n
I we have
0 ≤ k βa − αbk 2
= k βak 2 − 2( βa) T (αb) + kαbk 2
= β 2 kak 2 − 2 βα(aT b) + α 2 kbk 2
= 2kak 2 kbk 2 − 2kak kbk(aT b)
aT b
!
∠(a, b) = arccos
kak kbk
I ∠(a, b) is the number in [0, π] that satisfies
θ = ∠(a, b)
I θ = π/2 = 90◦ : a and b are orthogonal, written a ⊥ b (aT b = 0)
I θ = 0: a and b are aligned (aT b = kak kbk )
I θ = π = 180◦ : a and b are anti-aligned (aT b = −kak kbk )
I θ ≤ π/2 = 90◦ : a and b make an acute angle (aT b ≥ 0)
I θ ≥ π/2 = 90◦ : a and b make an obtuse angle (aT b ≤ 0)
0
b
ã = a − avg(a)1, b̃ = b − avg(b)1
ãT b̃
ρ=
k ãk k b̃k
I ρ = cos ∠(ã, b̃)
– ρ = 0: a and b are uncorrelated
– ρ > 0.8 (or so): a and b are highly correlated
– ρ < −0.8 (or so): a and b are highly anti-correlated
I very roughly: highly correlated means ai and bi are typically both above
(below) their means together
ak bk bk
ρ = 97%
k k ak
ak bk bk
ρ = −99%
k k ak
ak bk bk
ρ = 0.4%
k k ak
Clustering
Algorithm
Examples
Applications
I given N n-vectors x1 , . . . , xN
I patient clustering
– xi are patient attributes, test results, symptoms
I financial sectors
– xi are n-vectors of financial attributes of company i
I clustering objective is
N
1 X
J clust = kxi − zci k 2
N i=1
Clustering
Algorithm
Examples
Applications
J clust = J1 + · · · + Jk , kxi − zj k 2
X
Jj = (1/N)
i∈Gj
I this is the mean (or average or centroid) of the points in the partition:
X
zj = (1/|Gj |) xi
i∈Gj
given x1 , . . . , xN ∈ Rn and z1 , . . . , zk ∈ Rn
repeat
Update partition: assign i to Gj , j = argminj0 kxi − zj0 k 2
Update centroids: zj = |P1 | i∈Pj xi
P
j
I the final partition (and its value of J clust ) can depend on the initial
representatives
I common approach:
– run k-means 10 times, with different (often random) initial representatives
– take as final partition the one with the smallest value of J clust
Clustering
Algorithm
Examples
Applications
1.5
J clust
0.5
1 3 5 7 9 11 13 15
Iteration
Clustering
Algorithm
Examples
Applications
42
40
J clust
38
36
1 5 9 13 17 21 25
Iteration
·10−3
8
J clust
7.5
1 5 10 15 20
Iteration
Introduction to Applied Linear Algebra Boyd & Vandenberghe 4.22
Topics discovered (clusters 1–3)
Linear independence
Basis
Orthonormal vectors
Gram–Schmidt algorithm
β1 a1 + · · · + βk ak = 0
I the vectors
0.2 −0.1 0
a1 = −7 , a2 = 2 , a3 = −1
8.6 −1 2.2
I can express any of them as linear combination of the other two, e.g.,
a2 = (−1/2)a1 + (3/2)a3
β1 a1 + · · · + βk ak = 0
x = β1 a1 + · · · + βk ak
x = γ1 a1 + · · · + γk ak
then βi = γi for i = 1, . . . , k
I this means that (in principle) we can deduce the coefficients from x
( β1 − γ1 )a1 + · · · + ( βk − γk )ak = 0
Linear independence
Basis
Orthonormal vectors
Gram–Schmidt algorithm
b = β1 a1 + · · · + βn an
for some β1 , . . . , βn
b = b1 e1 + · · · + bn en
Linear independence
Basis
Orthonormal vectors
Gram–Schmidt algorithm
1 i=j
(
aTi aj =
0 i,j
I the 3-vectors
0 1 1
0 , 1 1 , 1 −1
√ √
2 2
−1 0 0
Linear independence
Basis
Orthonormal vectors
Gram–Schmidt algorithm
given n-vectors a1 , . . . , ak
for i = 1, . . . , k
1. Orthogonalization: q̃i = ai − (qT1 ai )q1 − · · · − (qTi−1 ai )qi−1
2. Test for linear dependence: if q̃i = 0, quit
3. Normalization: qi = q̃i /k q̃i k
I if G–S does not stop early (in step 2), a1 , . . . , ak are linearly independent
a1
a2
q̃1 q1
a2 a2
q1 q1
a2 −(qT1 a2 )q1
q2
q̃2
I so qi ⊥ q1 , . . . , qi ⊥ qi−1
I ai is a linear combination of q1 , . . . , qi :
qT1 ai , . . . , qTi−1 ai
Matrices
Matrix-vector multiplication
Examples
0 1 −2.3 0.1
1.3 4 −0.1 0
4.1 −1 0 1.7
an m × n matrix A is
I tall if m > n
I wide if m < n
I square if m = n
1.2
−0.3
1.4
2.6
B C
" #
A=
D E
where B, C, D, and E are matrices (called submatrices or blocks of A)
I matrices in each block row must have same height (row dimension)
I matrices in each block column must have same width (column dimension)
I example: if
2 2 1 4
f g f g " # " #
B= 0 2 3 , C= −1 , D= , E=
1 3 5 4
then
0 2 3 −1
B C
" #
= 2 2 1 4
D E 1 3 5 4
I A is an m × n matrix
I can express as block matrix with its (m-vector) columns a1 , . . . , an
f g
A= a1 a2 ··· an
b1
b2
A = ..
.
bm
I contingency table: Aij is number of objects with first attribute i and second
attribute j
1 4
I can be represented as n × n matrix with Aij = 1 if (i, j) ∈ R
0 1 1 0
1 0 0 1
A =
0 0 0 1
1 0 0 0
Introduction to Applied Linear Algebra Boyd & Vandenberghe 6.9
Special matrices
I identity matrix is square matrix with Iii = 1 and Iij = 0 for i , j, e.g.,
1 0 0 0
1 0 0 1 0 0
" #
,
0 1 0 0 1 0
0 0 0 1
I sparse matrix: most entries are zero
– examples: 0 and I
– can be stored and manipulated efficiently
– nnz(A) is number of nonzero entries
I examples:
1 −1 0.7
0
−0.6
" #
0 1.2 −1.1 (upper triangular), (lower triangular)
−0.3 3.5
0 0 3.2
I for example,
0 4 T "
= 0 7 3
#
7 0
4 0 1
3 1
I transpose converts column to row vectors (and vice versa)
I (AT ) T = A
I (just like vectors) we can add or subtract matrices of the same size:
(subtraction is similar)
I scalar multiplication:
(αA)ij = αAij , i = 1, . . . , m, j = 1, . . . , n
A + B = B + A, α(A + B) = αA + αB, (A + B) T = AT + BT
kαAk = |α|kAk
kA + Bk ≤ kAk + kBk
kAk ≥ 0
kAk = 0 only if A = 0
I distance between two matrices: kA − Bk
I (there are other matrix norms, which we won’t use)
Introduction to Applied Linear Algebra Boyd & Vandenberghe 6.14
Outline
Matrices
Matrix-vector multiplication
Examples
yi = Ai1 x1 + · · · + Ain xn , i = 1, . . . , m
I for example,
# 2 "
0 2 −1 3
" #
1 =
−2 1 1 −4
−1
I y = Ax can be expressed as
yi = bTi x, i = 1, . . . , m
I y = Ax can be expressed as
y = x1 a1 + x2 a2 + · · · + xn an
Matrices
Matrix-vector multiplication
Examples
I (n − 1) × n difference matrix is
−1 1 0 ··· 0 0 0
0 −1 1 ··· 0 0 0
.. ..
. .
D =
.. ..
. .
0 0 0 ··· −1 1 0
0 0 0 ··· 0 −1 1
y = Dx is (n − 1) -vector of differences of consecutive entries of x:
x2 − x1
x3 − x2
Dx = ..
.
xn − xn−1
I Dirichlet energy: kDxk 2 is measure of wiggliness for x a time series
Introduction to Applied Linear Algebra Boyd & Vandenberghe 6.21
Return matrix – portfolio vector
I A is m × n matrix
I y = Ax
I n-vector x is input or action
I m-vector y is output or result
I Aij is the factor by which yi depends on xj
I Aij is the gain from input j to output i
I e.g., if A is lower triangular, then yi only depends on x1 , . . . , xi
Geometric transformations
Selectors
Incidence matrix
Convolution
I many geometric transformations and mappings of 2-D and 3-D vectors can
be represented via matrix multiplication y = Ax
cos θ − sin θ
" #
y= x θ x
sin θ cos θ
Geometric transformations
Selectors
Incidence matrix
Convolution
Geometric transformations
Selectors
Incidence matrix
Convolution
(vl − vk ) 2
X
D(v) =
edges (k,l)
Geometric transformations
Selectors
Incidence matrix
Convolution
c1 = a1 b1
c2 = a1 b2 + a2 b1
c3 = a1 b3 + a2 b2 + a3 b1
c4 = a2 b3 + a3 b2 + a4 b1
c5 = a3 b3 + a4 b2
c6 = a4 b3
a∗b=b∗a
(a ∗ b) ∗ c = a ∗ (b ∗ c)
a ∗ b = 0 only if a = 0 or b = 0
b1 0 0 0
b2 b1 0 0
b3 b2 b1 0
T (b) =
0 b3 b2 b1
0 0 b3 b2
0 0 0 b3
3 3
2 2
(a ∗ x)k
xk
1 1
0 0
0 50 100 0 50 100
k k
Introduction to Applied Linear Algebra Boyd & Vandenberghe 7.13
Input-output convolution system
I h3 is the factor by which current output depends on what the input was 2
time steps before
Linear functions
Linear equations
I f is linear:
f (x) = f (x1 e1 + x2 e2 + · · · + xn en )
= x1 f (e1 ) + x2 f (e2 ) + · · · + xn f (en )
= Ax
with A =
f g
f (e1 ) f (e2 ) ··· f (en )
Introduction to Applied Linear Algebra Boyd & Vandenberghe 8.3
Examples
0 ··· 0 1
0 ··· 1 0
A = .. . .. ..
. .. . .
1 ··· 0 0
1 0 ··· 0 0
1 1 ··· 0 0
A =
.. .. .. .. ..
. . . . .
1 1 ··· 1 0
1 1 ··· 1 1
f (x) = Ax + b
I same as:
f (αx + βy) = αf (x) + βf (y)
holds for all x, y, and α , β with α + β = 1
Linear functions
Linear equations
I n goods or services
I prices given by n-vector p, demand given as n-vector d
price
I δi = (pnew
i − pi )/pi is fractional changes in prices
I δidem = (dinew − di )/di is fractional change in demands
I price-demand elasticity model: δ dem = Eδprice
I suppose f : Rn → Rm is differentiable
I first order Taylor approximation f̂ of f near z:
∂fi ∂fi
f̂i (x) = fi (z) + (z)(x1 − z1 ) + · · · + (z)(xn − zn )
∂x1 ∂xn
= fi (z) + ∇fi (z) T (x − z)
I regression model: ŷ = xT β + v
– x is n-vector of features or regressors
– β is n-vector of model parameters; v is offset parameter
– (scalar) ŷ is our prediction of y
Linear functions
Linear equations
I x is called a solution if Ax = b
I depending on A and b, there can be
– no solution
– one solution
– many solutions
Linear functions
Linear equations
I expressed as
a1 R1 + · · · + ap Rp −→ b1 P1 + · · · + bq Pq
I R1 , . . . , Rp are reactants
I P1 , . . . , Pq are products
I a1 , . . . , ap , b1 , . . . , bq are positive coefficients
I coefficients usually integers, but can be scaled
– e.g., multiplying all coefficients by 1/2 doesn’t change the reaction
2H2 O −→ 2H2 + O2
I for each atom, total number on LHS must equal total on RHS
I e.g., electrolysis reaction is balanced:
– 4 units of H on LHS and RHS
– 2 units of O on LHS and RHS
for i = 1, . . . , m and j = 1, . . . , p
I conservation of mass is
f g" a #
R −P =0
b
I simple solution is a = b = 0
R −P a
" #" #
= em+1
eT1 0 b
a1 Cr2 O2−
7 + a2 Fe
2+
+ a3 H+ −→ b1 Cr3+ + b2 Fe3+ + b3 H2 O
I 5 atoms/charge: Cr, O, Fe, H, charge
I reactant and product matrix:
2 0 0 1 0 0
7 0 0 0 0 1
R = 0 1 0
P = 0 1 0
0 ,
0
1 0 0 2
−2 2 1 3 3 0
I balancing equations (including a1 = 1 constraint)
2 0 0 −1 0 0 a1 0
7 0 0 0 0 −1 a2 0
0 1 0 0 −1 0 a3 0
=
0 0 1 0 0 −2 b1 0
−2 2 1 −3 −3 0 b2 0
1 0 0 0 0 0 b3 1
Introduction to Applied Linear Algebra Boyd & Vandenberghe 8.21
Balancing equations example
Cr2 O2−
7 + 6Fe
2+
+ 14H+ −→ 2Cr3+ + 6Fe3+ + 7H2 O
Population dynamics
Epidemic dynamics
I sequence of n-vectors x1 , x2 , . . .
I examples: xt represents
– age distribution in a population
– economic output in n sectors
– mechanical variables
xt+1 = At xt , t = 1, 2, . . .
xt+1 = At xt + Bt ut + ct , t = 1, 2, . . .
– ut is an input m-vector
– Bt is n × m input matrix
– ct is offset
I K -Markov model:
xt+1 = A1 xt + · · · + AK xt−K+1 , t = K, K + 1, . . .
Population dynamics
Epidemic dynamics
4
Population (millions)
0
0 10 20 30 40 50 60 70 80 90 100
Age
I b and d can vary with time, but we’ll assume they are constant
30
4
20
2
10
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Age Age
(xt+1 )1 = bT xt
I number of i-year-olds next year is number of (i − 1) -year-olds this year,
minus those who die:
4
Population (millions)
0
0 10 20 30 40 50 60 70 80 90 100
Age
Population dynamics
Epidemic dynamics
1
Susceptible
Recovered
0.8
0.6
xt
0.4
0.2 Infected
Deceased
0
0 50 100 150 200
Time t
Matrix multiplication
Matrix powers
QR factorization
for i = 1, . . . , m, j = 1, . . . , n
I example:
# −1 −1
3 2 = 3.5
"
−1.5 −4.5
" #
0 −2
1 −1 0 −1 1
1 0
I inner product aT b
I matrix-vector multiplication Ax
a1 b1 a1 b2 ··· a1 bn
a2 b1 a2 b2 ··· a2 bn
T
ab = . .. ..
.. . .
am b1 am b2 ··· am bn
A B E F AE + BG AF + BH
" #" # " #
=
C D G H CE + DG CF + DH
I denote columns of B by bi :
f g
B= b1 b2 ··· bn
I then we have
f g
AB = A b1 b2 ··· bn
f g
= Ab1 Ab2 ··· Abn
Axi = bi , i = 1, . . . , k
Matrix multiplication
Matrix powers
QR factorization
I A is an m × p matrix, B is p × n
I define f : Rp → Rm and g : Rn → Rp as
I we have
h(x) = f (g(x)) = A(Bx) = (AB)x
I composition of linear functions is linear
I Dn is (n − 1) × n difference matrix:
Dn x = (x2 − x1 , . . . , xn − xn−1 )
I Dn−1 is (n − 2) × (n − 1) difference matrix:
1 −2 −1 0 0 −1 1 0 0 −1 1 0 0 0
0 1 −2 −1 0 = 0 −1 1 0 0 −1 1 0 0
0 0 −1 1 0
0 0 1 −2 −1 0 0 −1 1 0 0 0 −1
1
Matrix multiplication
Matrix powers
QR factorization
I example:
2 3 0 1 0 0 1
1 0 1 0 0
A = 0 0 1 1 1
5
1 0 0 0 0
0 0 0 1 0
1 4
Matrix multiplication
Matrix powers
QR factorization
I QT Q = I
I from Gram–Schmidt algorithm
I A = QR is called QR factorization of A
I factors satisfy QT Q = I , R upper triangular with positive diagonal entries
Inverse
Examples
Pseudo-inverse
1 −11 −10 16 1 0 −1 6
" # " #
B= , C=
9 7 8 −11 2 0 1 −4
0 = C0 = C(Ax) = (CA)x = Ix = x
I matrix generalization of
a number is invertible if and only if it is nonzero
−3 −4 1
A = 4 6 , b = −2
1 1
0
1
" #
Bb =
−1
I and also
1
" #
Cb =
−1
Introduction to Applied Linear Algebra Boyd & Vandenberghe 11.5
Right inverses
AX = I ⇐⇒ (AX) T = I ⇐⇒ X T AT = I
I so we conclude
A is right-invertible if and only if its rows are linearly independent
I right-invertible matrices are wide or square
I x = Bb is a solution:
Ax = A(Bb) = (AB)b = Ib = b
Inverse
Examples
Pseudo-inverse
I if A has a left and a right inverse, they are unique and equal
(and we say that A is invertible)
I so A must be square
I to see this: if AX = I , YA = I
X = IX = (YA)X = Y (AX) = YI = Y
A−1 A = AA−1 = I
I suppose A is invertible
x = A−1 b
I A is invertible
I columns of A are linearly independent
I I −1 = I
I if Q is orthogonal, i.e., square with QT Q = I , then Q−1 = QT
1 A22 −A12
" #
−1
A =
A11 A22 − A12 A21 −A21 A11
1 −2 3
A = 0 2 2
−3 −4 −4
L11 x1 = 0
L21 x1 + L22 x2 = 0
..
.
Ln1 x1 + Ln2 x2 + · · · + Ln,n−1 xn−1 + Lnn xn = 0
I so we have
A−1 = (QR) −1 = R−1 Q−1 = R−1 QT
Inverse
Examples
Pseudo-inverse
I write out Rx = b as
I computes x = R−1 b
I complexity:
– first step requires 1 flop (division)
– 2nd step needs 3 flops
– ith step needs 2i − 1 flops
total is 1 + 3 + · · · + (2n − 1) = n2 flops
Inverse
Examples
Pseudo-inverse
p(x) = c1 + c2 x + c3 x2 + c4 x3
that satisfies
I write as Ac = b, with
1 −1.1 (−1.1) 2 (−1.1) 3
1 −0.4 (−0.4) 2 (−0.4) 3
A =
1 0.1 (0.1) 2 (0.1) 3
1 0.8 (0.8) 2 (0.8) 3
p(x)
x
−1.5 −1 −0.5 0 0.5 1
1 1
0 0
−1 0 1 −1 0 1
1 1
0 0
−1 0 1 −1 0 1
Inverse
Examples
Pseudo-inverse
so Ax = 0
A† = (AT A) −1 AT
I it is a left inverse of A:
I pseudo-inverse is defined as
A† = AT (AAT ) −1
I A† is a right inverse of A:
I so
Examples
I residual is r = Ax − b
kAx̂ − bk 2 ≤ kAx − bk 2
I then
kAx − bk 2 = k(x1 a1 + · · · + xn an ) − bk 2
I so least squares problem is to find a linear combination of columns of A
that is closest to b
f (x̂) + 2
2 0 1 f (x̂) + 1
0
x2
A = −1 1 , b = 0
0 x̂
2 −1
−1
I Ax = b has no solution −1 0 1
x1
I least squares problem is to choose x to minimize
Examples
x̂ = (AT A) −1 AT b = A† b
I define
2
m X n
2
X
f (x) = kAx − bk = *. A x − b +/
ij j i
i=1 , j=1 -
I solution x̂ satisfies
∂f
(x̂) = ∇f (x̂)k = 0, k = 1, . . . , n
∂xk
Examples
I m = 10 groups, n = 3 channels
I target views vector vdes = 103 × 1
I optimal spending is ŝ = (62, 100, 1443)
1,500
Channel 1
2 Channel 2
Channel 3
1,000
Impressions
1 Impressions
500
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Group Group
Introduction to Applied Linear Algebra Boyd & Vandenberghe 12.14
Illumination
I n lamps illuminate an area divided in m regions
I Aij is illumination in region i if lamp j is on with power 1, other lamps are off
I xj is power of lamp j
I (Ax)i is illumination level at region i
I bi is target illumination level at region i
25m
2 (3.5m)
1 (4.0m)
3 (6.0m)
6 (6.0m) figure shows lamp positions for
4 (4.0m)
5 (4.0m) 7 (5.5m) example with
9 (5.0m) 10 (4.5m)
m = 252 , n = 10
8 (5.0m)
0
0 25m
Introduction to Applied Linear Algebra Boyd & Vandenberghe 12.15
Illumination
I equal lamp powers (x = 1)
Number of regions
100
50
0
0 1 2
Intensity
I least squares solution x̂, with b = 1
Number of regions
100
50
0
0 1 2
Intensity
Introduction to Applied Linear Algebra Boyd & Vandenberghe 12.16
13. Least squares data fitting
Outline
Validation
Feature engineering
y ≈ f (x)
I A = 1, so
θ̂ 1 = (1T 1) −1 1T yd = (1/N)1T yd = avg(yd )
I the mean of y (1) , . . . , y (N) is the least squares fit by a constant
I we can plot the data (xi , yi ) and the model function ŷ = f̂ (x)
std(yd )
f̂ (x) = avg(yd ) + ρ (x − avg(xd ))
std(xd )
f̂ (x)
90 6
Petroleum consumption
(million barrels per day)
80 4
2
70
0
60
−2
80
85
90
95
00
05
10
15
80
85
90
95
00
05
10
15
19
19
19
19
20
20
20
20
19
19
19
19
20
20
20
20
Year Year
I fi (x) = xi−1 , i = 1, . . . , p
I model is a polynomial of degree less than p
f̂ (x) = θ 1 + θ 2 x + · · · + θ p xp−1
I A is Vandermonde matrix
1 x (1) ··· (x (1) ) p−1
1 x (2) ··· (x (2) ) p−1
A = .. .. ..
. . .
1 x (N) ··· (x (N) ) p−1
x x
degree 10 degree 15
f̂ (x) f̂ (x)
x x
Introduction to Applied Linear Algebra Boyd & Vandenberghe 13.15
Regression as general data fitting
so model is
ŷ = θ 1 + θ 2 x1 + · · · + θ n+1 xn = xT θ 2:n + θ 1
I β = θ 2:n+1 , v = θ 1
I time zeries z1 , z2 , . . .
ẑt+1 = θ 1 zt + · · · + θ M zt−M+1 , t = M, M + 1, . . .
I M is memory of model
I ẑt+1 is prediction of next value, based on previous M values
I we’ll choose β to minimize sum of squares of prediction errors,
solid line shows one-hour ahead predictions from AR model, first 5 days
70
Temperature (◦ F)
65
60
55
20 40 60 80 100 120
k
Validation
Feature engineering
basic idea:
I goal of model is not to predict outcome for the given data
I instead it is to predict the outcome on new, unseen data
I (can also compare RMS prediction error on train and test data)
models fit using training set of 100 points; plots show test set of 100 points
degree 2 degree 6
f̂ (x) f̂ (x)
x x
degree 10 degree 15
f̂ (x) f̂ (x)
x x
Introduction to Applied Linear Algebra Boyd & Vandenberghe 13.25
Example
1
Train
Test
0.8
Relative RMS error
0.6
0.4
0.2
0 5 10 15 20
Degree
Validation
Feature engineering
ŷ = θ 1 f1 (x) + · · · + θ p fp (x)
(xi − bi )/ai
– bi ≈ mean value of the feature across the data
– ai ≈ standard deviation of the feature across the data
new features are called z-scores
log(1 + xi )
Classification
Multi-class classifiers
I disease detection
– x contains patient features, results of medical tests
I the errors have many other names, like Type I and Type II
ŷ = +1 ŷ = −1 Total
y = +1 Ntp Nfn Np
y = −1 Nfp Ntn Nn
All Ntp + Nfp Nfn + Ntp N
Classification
Multi-class classifiers
I fit model f̃ to encoded (±1) y (i) values using standard least squares data
fitting
ŷ = +1 ŷ = −1 Total
y = +1 5158 765 5923
y = −1 167 53910 54077
All 5325 54675 60000
ŷ = +1 ŷ = −1 Total
y = +1 864 116 980
y = −1 42 8978 9020
All 906 9094 10000
0.1 Positive
Negative
0.08
0.06
Fraction
0.04
0.02
0
−2 −1 0 1 2
f̃ (x (i) )
≥ 0.1
0.05
−0.05
≤ −0.1
f̃ (x) ≥ α
(
+1
f̂ (x) =
−1 f̃ (x) < α
I for negative α , false positive rate is higher but so is true positive rate
I trade off curve of true positive versus false positive rates is called receiver
operating characteristic (ROC)
Error rates
0.1 Positive 1
Negative False positive
0.08
α = −0.25
True positive rate
0.9
α=0
0.8
α = 0.25
0.7
0 0.05 0.1 0.15 0.2
False positive rate
Classification
Multi-class classifiers
I predictor is f̂ : Rn → {1, . . . , K}
I disease diagnosis
– guess diagnosis from among a set of candidates, from test results, patient
features
I create a least squares classifier for each label versus the others
I take as classifier
f̂ (x) = argmax f̃` (x)
` ∈ {1, ...,K }
we choose f̂ (x) = 3
Prediction
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 944 0 1 2 2 8 13 2 7 1 980
1 0 1107 2 2 3 1 5 1 14 0 1135
2 18 54 815 26 16 0 38 22 39 4 1032
3 4 18 22 884 5 16 10 22 20 9 1010
4 0 22 6 0 883 3 9 1 12 46 982
5 24 19 3 74 24 656 24 13 38 17 892
6 17 9 10 0 22 17 876 0 7 0 958
7 5 43 14 6 25 1 1 883 1 49 1028
8 14 48 11 31 26 40 17 13 756 18 974
9 16 10 3 17 80 0 1 75 4 803 1009
All 1042 1330 887 1042 1086 742 994 1032 898 947 10000
Prediction
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 972 0 0 2 0 1 1 1 3 0 980
1 0 1126 3 1 1 0 3 0 1 0 1135
2 6 0 998 3 2 0 4 7 11 1 1032
3 0 0 3 977 0 13 0 5 8 4 1010
4 2 1 3 0 953 0 6 3 1 13 982
5 2 0 1 5 0 875 5 0 3 1 892
6 8 3 0 0 4 6 933 0 4 0 958
7 0 8 12 0 2 0 1 992 3 10 1028
8 3 1 3 6 4 3 2 2 946 4 974
9 4 3 1 12 11 7 1 3 3 964 1009
All 997 1142 1024 1006 977 905 956 1013 983 997 10000
Control
J1 = kA1 x − b1 k 2 , . . . , Jk = kAk x − bk k 2
I Ai is an mi × n matrix, bi is an mi -vector, i = 1, . . . , k
I Ji are the objectives in a multi-objective optimization problem
(also called a multi-criterion problem)
I could choose x to minimize any one Ji , but we want one x that makes them
all small
J = λ 1 J1 + · · · + λ k Jk = λ 1 kA1 x − b1 k 2 + · · · + λ k kAk x − bk k 2
√ λ (A x − b )
2
1 1 1
J =
..
.
√
λ (A x − bk )
k k
I so we have J = k Ãx − b̃k 2 , with
√ λ A √ λ b
1 1 1 1
à =
.
..
, b̃ =
.
..
√ √
λ k Ak λ k bk
A1 and A2 both 10 × 5
−0.2
15 J (λ)
2
λ = 0.1
10
10
J2 (λ)
J1 (λ)
λ=1
5 5 λ = 10
Control
y = Ax + b
I trades off deviation from target demand and price change magnitude
K
1 X (k)
ky − ydes k 2
K k=1
Control
I measurement model: y = Ax + v
I can get far better results by incorporating prior information about x into
estimation, e.g.,
– x should be not too large
– x should be smooth
I we minimize J1 + λJ2
I x is an image
I A is a blurring operator
I y = Ax + v is a blurred, noisy image
I least squares de-blurring: choose x to minimize
λ = 10−6 λ = 10−4
λ = 10−2 λ=1
I Aij is the length of the intersection of the line in measurement i with voxel j
x1 x2
x6 line in measurement i
I for example, if x varies smoothly over region, use Dirichlet energy for graph
that connects each voxel to its neighbors
λ=5 λ = 10 λ = 100
Control
with f1 (x) = 1
kAθ − yk 2 + λ kθ 2:p k 2
kX T β + v1 − yk 2 + λ k βk 2
0
Train
−1 Test
0 0.5 1
x
I solid line is signal used to generate synthetic (simulated) data
I 10 blue points are used as training set; 20 red points are used as test set
I we fit a model with five parameters θ 1 , . . . , θ 5 :
4
X
f̂ (x) = θ 1 + θ k+1 cos(ωk x + φk ) (with given ωk , φk )
k=1
0 −2
10−5 10−3 10−1 101 103 105 10−5 10−3 10−1 101 103 105
λ λ
I minimum test RMS error is for λ around 0.08
I increasing λ ‘shrinks’ the coefficients θ 2 , . . . , θ 5
I dashed lines show coefficients used to generate the data
I for λ near 0.08, estimated coefficients are close to these ‘true’ values
Introduction to Applied Linear Algebra Boyd & Vandenberghe 15.30
16. Constrained least squares
Outline
minimize kAx − bk 2
subject to Cx = d
p(x) = θ 1 + θ 2 x + θ 3 x2 + θ 4 x3 x≤a
(
f̂ (x) =
q(x) = θ 5 + θ 6 x + θ 7 x2 + θ 8 x3 x>a
(a is given)
N
(f̂ (xi ) − yi ) 2
X
i=1
f̂ (x)
q(x)
p(x)
x
a
θ 1 + θ 2 a + θ 3 a2 + θ 4 a3 − θ 5 − θ 6 a − θ 7 a2 − θ 8 a3 = 0
2 2
θ 2 + 2θ 3 a + 3θ 4 a − θ 6 − 2θ 7 a − 3θ 8 a = 0
I least-norm problem:
minimize kxk 2
subject to Cx = d
i.e., find the smallest vector that satisfies a set of linear equations
vfin = f1 + f2 + · · · + f10
fin
p = (19/2)f1 + (17/2)f2 + · · · + (1/2)f10
1 1
Position
Force
0 0.5
−1 0
0 2 4 6 8 10 0 2 4 6 8 10
Time Time
I least-norm problem:
minimize kf 2
" k
1 1 ··· 1 1 0
# " #
subject to f =
19/2 17/2 ··· 3/2 1/2 1
with variable f
0.05 1
Position
Force
0 0.5
−0.05 0
0 2 4 6 8 10 0 2 4 6 8 10
Time Time
∂L ∂L
(x̂, z) = 0, i = 1, . . . , n, (x̂, z) = 0, i = 1, . . . , p
∂xi ∂zi
∂L
I (x̂, z) = cTi x̂ − di = 0, which we already knew
∂zi
I first n equations are more interesting:
n p
∂L
(AT A)ij x̂j − 2(AT b)i +
X X
(x̂, z) = 2 zj ci = 0
∂xi j=1 j=1
2AT A CT x̂ 2AT b
" #" # " #
=
C 0 z d
A
" #
C has linearly independent rows, has linearly independent columns
C
I implies m + p ≥ n, p ≤ n
I then
I
" #
I matrix always has independent columns
C
I we assume that C has independent rows
I optimality condition reduces to
2I CT x̂ 0
" #" # " #
=
C 0 z d
x̂ = CT (CCT ) −1 d = C† d
Portfolio optimization
Vt+1 = V1 (1 + r1 )(1 + r2 ) · · · (1 + rt )
I define T × n (asset) return matrix, with Rtj the return of asset j in period t
I row t of R is r̃tT , where r̃t is the asset return vector over period t
I if last asset is risk-free, the last column of R is µrf 1, where µrf is the
risk-free per-period interest rate
I risk is std(r)
VT+1 = V1 (1 + r1 ) · · · (1 + rT )
≈ V1 + V1 (r1 + · · · + rT )
= V1 + T avg(r)V1
I mean return and risk are often expressed in annualized form (i.e., per year)
(the squareroot in risk annualization comes from the assumption that the
fluctuations in return around the mean are independent)
I . . . and hope it will work well going forward (just like data fitting)
I we are really asking what would have been the best constant allocation,
had we known future returns
w 2RT R 1 µ −1 2ρT µ
z = 1T 0 0 1
1 T
z2 µ 0 0
ρ
w 2RT R 1 µ −1 RT 1
z = 1T 0 0 1
1 T
z2 µ 0 0
ρT
0.4
Annualized return
0.3
0.2
0.1
0
risk-free asset
0.4
Annualized return
0.3
0.2
0.1
1/n
0
Return Risk
Portfolio Train Test Train Test Leverage
risk-free 0.01 0.01 0.00 0.00 1.00
ρ = 10% 0.10 0.08 0.09 0.07 1.96
ρ = 20% 0.20 0.15 0.18 0.15 3.03
ρ = 40% 0.40 0.30 0.38 0.31 5.48
1/n (uniform weights) 0.10 0.21 0.23 0.13 1.00
Train Test
18
150
Risk-free
Value (thousand dollars)
1/n 16
10%
100 20%
14
40%
50 12
10 10
0 400 800 1200 1600 2000 0 100 200 300 400 500
Day Day
Portfolio optimization
xt+1 = At xt + Bt ut , yt = Ct xt , t = 1, 2, . . .
I can be written as
minimize k Ãz − b̃k 2
subject to C̃z = d̃
I vector z contains the Tn + (T − 1)m variables:
z = (x1 , . . . , xT , u1 , . . . , uT−1 )
C1 ··· 0 0 ··· 0
.. .. .. .. .. ..
. . . . . .
0 ··· CT 0 ··· 0
à = √ , b̃ = 0
0 ··· 0 ρI ··· 0
.. .. .. .. .. ..
. . . . . .
√
0 ··· 0 0 ···
ρI
I T = 100
4.4
4.2
ρ=1
Joutput
ρ = 0.2
3.8
ρ = 0.05
3.6
0 1 2 3 4
Jinput
0.5 0.4
ρ = 0.05 0 0.2
ut
yt
−0.5 0
0 20 40 60 80 100 0 20 40 60 80 100
0.5 0.4
ρ = 0.2 0 0.2
ut
yt
−0.5 0
0 20 40 60 80 100 0 20 40 60 80 100
0.5 0.4
ρ=1 0 0.2
ut
yt 0
−0.5
0 20 40 60 80 100 0 20 40 60 80 100
t t
Introduction to Applied Linear Algebra Boyd & Vandenberghe 17.24
Linear state feedback control
ut = Kxt , t = 1, 2, . . .
I one choice for K : solve linear quadratic control problem with xdes = 0
I solution ut is a linear function of xinit , hence u1 can be written as
u1 = Kxinit
0.1 0.4
State feedback
0 Optimal
ut
0.2
yt
Optimal
State feedback
−0.1
0
0 50 100 150 0 50 100 150
t t
Portfolio optimization
xt+1 = At xt + Bt wt , yt = Ct xt + vt , t = 1, 2, . . .
I xt is state (n-vector)
I yt is measurement (p-vector)
I wt is input or process noise (m-vector)
I vt is measurement noise or measurement residual (p-vector)
I we know At , Bt , Ct , and measurements y1 , . . . , yT
I can be written as
minimize k Ãz − b̃k 2
subject to C̃z = d̃
I vector z contains the Tn + (T − 1)m variables:
z = (x1 , . . . , xT , w1 , . . . , wT−1 )
C1 0 ··· 0 0 ··· 0 y1
0 C2 ··· 0 0 ··· 0
y2
.. .. .. .. .. .. ..
. . . . . . .
à = 0 0 ··· CT 0 ··· 0 , b̃ = yT
√
0 0 ··· 0 λI ··· 0 0
.. .. .. .. .. .. ..
. . . . .
√. .
0 0 ··· 0 0 ··· 0
λI
A1 −I 0 ··· 0 0 B1 0 ··· 0
0 A2 −I ··· 0 0 0 B2 ··· 0
C̃ = .. .. .. .. .. .. .. .. ..
, d̃ = 0
. . . . . . . . .
0 0 0 ··· AT−1 −I 0 0 ··· BT−1
kCt xt − yt k 2
X
Jmeas =
t∈ T
1 0 1 0 0 0
0 1 0 1 0 0 1 0 0 0
" #
At = , Bt = , Ct =
0 0 1 0 1 0 0 1 0 0
0 0 0 1 0 1
0
t=1
−500
(xt )2
−1000
−1500 t = 100
I randomly remove 20% (say) of the measurements and use as test set
80
RMS error 60
Test
40
20
Train
0
10−3 10−1 101 103 105
λ
I cross-validation method applied to previous example
I remove 20 of the 100 measurements
I suggests using λ ≈ 103
Examples
Levenberg–Marquardt algorithm
fi (x1 , . . . , xn ) = 0, i = 1, . . . , m
Examples
Levenberg–Marquardt algorithm
ρi = kx − ai k + vi , i = 1, . . . , m
i=1
Examples
Levenberg–Marquardt algorithm
I we want k f̂ (x; x (k) )k 2 small, but we don’t want to move too far from x (k) ,
where f̂ (x; x (k) ) ≈ f (x) no longer holds
I solution is
−1
x (k+1) = x (k) − Df (x (k) ) T Df (x (k) ) + λ (k) I Df (x (k) ) T f (x (k) )
idea:
I if λ (k) is too big, x (k+1) is too close to x (k) , and progress is slow
I if too small, x (k+1) may be far from x (k) and affine approximation is poor
update mechanism:
I if kf (x (k+1) )k 2 < kf (x (k) )k 2 , accept new x and reduce λ
3
4
2
x2
2
1
4
2
0 00
0 1 2 3 4 1 2 3 40 x2
x1 x1
Introduction to Applied Linear Algebra Boyd & Vandenberghe 18.14
Levenberg–Marquardt from three initial points
2
x2
0
0 1 2 3 4
x1
0.3
4
0.2
kf (x (k) )k 2
λ (k)
2
0.1
0
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
k k
Examples
Levenberg–Marquardt algorithm
N
( f̂ (x (i) ; θ) − y(i) ) 2
X
minimize
i=1
f̂ (x; θ)
f̂ (x; θ) = θ 1 exp(θ 2 x) cos(θ 3 x + θ 4 )
f̂ (x; θ)
x
Introduction to Applied Linear Algebra Boyd & Vandenberghe 18.20
Nonlinear least squares formulation
N
(f̂ (u(i) ; θ) − y(i) ) 2 + ku(i) − x (i) k 2
X
minimize
i=1
(x (i) , y(i) )
di di2 = (f̂ (u(i) ; θ) − y(i) ) 2 + ku(i) − x (i) k 2
Examples
Levenberg–Marquardt algorithm
i=1
φ(u)
eu − e−u
φ(u) = u 1
e + e−u
u
−4 −2 2 4
−1
10 Train
Classification error (%)
Test
9
6
10−5 10−3 10−1 101 103
λ
Introduction to Applied Linear Algebra Boyd & Vandenberghe 18.25
Feature engineering
Penalty method
minimize kf (x)k 2
subject to g(x) = 0
∂L ∂L
(x̂, ẑ) = 0, i = 1, . . . , n. (x̂, ẑ) = 0, i = 1, . . . , p
∂xi ∂zi
(provided the gradients ∇g1 (x̂) , . . . , ∇gp (x̂) are linearly independent)
minimize kAx − bk 2
subject to Cx = d
2AT A CT x̂ 2AT b
" #" # " #
=
C 0 ẑ d
Penalty method
" f (x) #
2
2 2
minimize kf (x)k + µkg(x)k =
√
µg(x)
I feasibility g(x (k) ) = 0 is only satisfied approximately for µ(k−1) large enough
x1 + exp(−x2 )
" #
f (x1 , x2 ) = , g(x1 , x2 ) = x1 + x13 + x2 + x22
x12 + 2x2 + 1
g(x) = 1
g(x) = 0
x̂ I solid: contour lines of kf (x)k 2
0 g(x) = −1
x2
−1
−1 0 1
x1
0
x (3) x (4)
x (2)
−0.5
µ (4) = 8 µ (5) = 16 µ (6) = 32
0.5
0
x (5) x (6) x (7)
−0.5
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
102
Feas. 104
Penalty parameter µ
Opt.
Residual
10−2 102
100
10−6
0 50 100 0 50 100
Cumulative L–M iterations Cumulative L–M iterations
Penalty method
I µ(k) increases rapidly and must become large to drive g(x) to (near) zero
I for large µ(k) , nonlinear least squares subproblem becomes harder
µ(k+1) = µ(k) if kg(x (k+1) )k < 0.25kg(x (k) )k, µ(k+1) = 2µ(k) otherwise
0
x (3) x (4)
x (2)
−0.5
µ (4) = 4, z (4) = −1.898 µ (5) = 4, z (5) = −1.976 µ (6) = 4, z (6) = −1.994
0.5
0
x (5) x (6) x (7)
−0.5
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
102 10
Feas.
Penalty parameter µ
Opt.
4
Residual
10−2
2
1
10−6
0 50 100 0 50 100
Cumulative L–M iterations Cumulative L–M iterations
Penalty method
φ dp1
L = s(t) cos θ(t)
dt
dp2
= s(t) sin θ(t)
θ dt
(p1 , p2 ) dθ s(t)
= tan φ(t)
dt L
I move car from given initial to desired final position and orientation
N N−1
minimize kuk k 2 + γ kuk+1 − uk k 2
P P
k=1 k=1
subject to x2 = f (0, u1 )
xk+1 = f (xk , uk ), k = 2, . . . , N − 1
xfinal = f (xN , uN )
I variables are u1 , . . . , uN , x2 , . . . , xN
1
1
0.5
0.5
0 0
0.6
0.4
0.4
0.2
0.2 0
0 −0.2
0 0
uk
uk
Steering angle
−0.5 −0.5
Speed
0 20 40 0 20 40
k k
0 0
uk
uk
0 20 40 0 20 40
k k
Introduction to Applied Linear Algebra Boyd & Vandenberghe 19.25