Professional Documents
Culture Documents
John Duchi
Contents
1 Notation 2 Matrix multiplication 3 Gradient of linear function 4 Derivative in a trace 5 Derivative of product in trace 6 Derivative of function of a matrix 7 Derivative of linear transformed input to function 8 Funky trace derivative 9 Symmetric Matrices and Eigenvectors 1 1 1 2 2 3 3 3 4
Notation
A few things on notation (which may not be very consistent, actually): The columns of a matrix T A Rmn are a1 through an , while the rows are given (as vectors) by a T m. 1 throught a
Matrix multiplication
n
ai aT i ,
that is, that the product of AAT is the sum of the outer products of the columns of A. To see this, consider that
n
(AAT )ij =
p=1
api apj
because the i, j element is the ith row of A, which is the vector a1i , a2i , , ani , dotted with the j th column of AT , which is a1j , , anj .
If we look at the matrix AAT , we see that n n p=1 ap1 ap1 p=1 ap1 apn . . .. . . AAT = = . . . n n p=1 apn ap1 p=1 apn apn
.. .
ai aT i
i=1
a m
= AT
If we take the derivative with respect to one of the xl s, we have the l component for each a i , which is to say ail , and the term for xl a T x , which gives us that l T x Ax = xl In the end, we see that
n T T xi ail + a T l x. l x = al x + a i=1
x xT Ax = Ax + AT x.
Derivative in a trace
Recall (as in Old and New Matrix Algebra Useful for Statistics ) that we can dene the dierential of a function f (x) to be the part of f (x + dx) f (x) that is linear in dx, i.e. is a constant times dx. Then, for example, for a vector valued function f , we can have f (x + dx) = f (x) + f (x)dx + (higher order terms). In the above, f is the derivative (or Jacobian). Note that the gradient is the transpose of the Jacobian. Consider an arbitrary matrix A. We see that T a 1 dx1 .. tr . n T a n dxn a T dxi tr(AdX ) = = i=1 i . dX dX dX Thus, we have tr(AdX ) dX so that =
ij n i=1
a1 a2 b . 1 . . an a1 T b1 a2 T b1 . . . an T b1 a1 T b2 a2 T b2 an T b2
m
b2
bn
.. .
a1 T bn a2 T bn . . . an T bn
m
=
i=1
a 1i b i 1 +
i=1
a 2i b i 2 + . . . +
i=1
ani bin
= =
bji BT
. . .
. . .
f (A) A1n
= (A f (A))
f (A) A2n T
.. .
. . .
f (A) Ann
Consider a function f : Rn R. Suppose we have a matrix A Rnm and a vector x Rm . We wish to compute x f (Ax). By the chain rule, we have f (Ax) xi
n
=
k=1 n
n k=1
f (Ax) ( aT k x) (Ax)k xi
= =
k f (Ax)aki
k=1
k=1 aT i f (Ax).
As such, x f (Ax) = AT f (Ax). Now, if we would like to get the second derivative of this function (third derivatives would be a little nice, but I do not like tensors), we have 2 f (Ax) xi xj = =
l=1 k=1
T a f (Ax) = xj i xj
n n
aki
k=1
f (Ax) (Ax)k
In this bit, let us have AB = f (A), where f is matrix-valued. A trABAT C = A trf (A)AT C = trf ()AT C + trf (A) T C = (AT C )T f () + (T trf (A) T C )T = C T AB T + (T tr T Cf (A))T = C T AB T + ((Cf (A))T )T = C T AB T + CAB
In this we prove that for a symmetric matrix A Rnn , all the eigenvalues are real, and that the eigenvectors of A form an orthonormal basis of Rn . First, we prove that the eigenvalues are real. Suppose one is complex: we have T x = (Ax)T x = xT AT x = xT Ax = xT x. x Thus, all the eigenvalues are real. Now, we suppose we have at least one eigenvector v = 0 of A. Consider a space W of vectors orthogonal to v . We then have that, for w W , (Aw)T v = wT AT v = wT Av = wT v = 0. Thus, we have a set of vectors W that, when transformed by A, are still orthogonal to v , so if we have an original eigenvector v of A, then a simple inductive argument shows that there is an orthonormal set of eigenvectors. To see that there is at least one eigenvector, consider the characteristic polynomial of A: X (A) = det(A I ). The eld is algebraicly closed, so there is at least one complex root r, so we have that A rI is singular and there is a vector v = 0 that is an eigenvector of A. Thus r is a real eigenvalue, so we have the base case for our induction, and the proof is complete. 4