Part 3

Lecture 28
Playing with Permutations

I want a clean cup, interrupted the Hatter: lets all move one place on.
He moved on as he spoke, and the Dormouse followed him:
the March Hare moved into the Dormouses place,
and Alice rather unwillingly took the place of the March Hare.
-
10
/6
Goals: We are going to prove that any sequence of transpositions that corrects the
ordering of a permutation has the same parity. Even though this seems unimportant,
this will be a key step in the derivation of the determinant formula.
28.1 Insight on Invariance
Consider the following problem (SUMaC 2013):
Pennies are placed on an 8 8 checkerboard in an alternating pattern of heads and tails.
You are allowed to make moves where in each move you turn over exactly two pennies
that lie next to each other in the same row or column. Can you take a sequence of moves
that leaves just one penny face up?
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
The answer is no. In the initial position, the total number of heads is even. Inductively, on the i-th
move, we can make one of these type of ips:
543
544 LECTURE 28. PLAYING WITH PERMUTATIONS
T H H T
H T T H
T T H H
H H T T
Regardless of the move, the total number of heads is still even. Therefore a situation with only one
head is impossible.
The moral of the story:
Math Mantra: Look for some FIXED NON-VARYING QUANTITY.
We call such quantity an invariant. In this case, our invariant was the parity of the number of heads.
Regardless of what moves we made, the parity of heads remained the same (even). You are looking
past the smoke and mirrors and latching onto some known truth.
So how does this apply to Math 51H?
One of the most important quantities in mathematics and engineering is the determinant. It has tons
of applications:
Calculating volume.
Changing variables in integration.
Checking invertibility of a matrix.
Proving that every natural number can be written as a sum of four squares.
1
To derive the determinant formula, we have to perform a number of swaps. And like the chessboard
problem, we have an unlimited number of choices for swaps. But it turns out that only the parity
of the swaps matters. This will be the lynchpin in our proof of determining the determinant.
28.2 Permutations
Back in the day, you were asked the problem:
How many ways can you rearrange
ABC?
1
My favorite application.
28.2. PERMUTATIONS 545
You made a little chart
ABC BAC CAB
ACB BCA CBA
Now that you are older, you realize it is far more kosher to use numbers and n-tuples instead of letters
and concatenation:
(1, 2, 3) (2, 1, 3) (3, 1, 2)
(1, 3, 2) (2, 3, 1) (3, 2, 1).
Generally,
Denition. A permutation on (1, 2, . . . , n) is an n-tuple
(i
1
, i
2
, . . . , i
n
)
such that each number between 1 and n appears exactly once:
{i
1
, i
2
, . . . , i
n
} = {1, 2, . . . , n}
Now consider the following scenario: you have the 7 Harry Potter Books on a shelf in some order
(4, 2, 1, 5, 6, 3, 7)
and you want to correct the ordering. But you are restricted to only swapping two books at a time.
Is it possible to correct the ordering?
Of course! Just consider the sequence
(4, 2, 1, 5, 6, 3, 7)
(4, 2, 1, 5, 3, 6, 7)
(1, 2, 4, 5, 3, 6, 7)
(1, 2, 3, 5, 4, 6, 7)
(1, 2, 3, 4, 5, 6, 7)
Because were math people, we like to give swaps a more formal name. We also like to think of a
swap as a function on permutations.
Denition. A transposition
j,k
is a function that maps permutations to permutations by swapping
the values in the j and k position
j,k
(i
1
, i
2
, . . . , i
j
. . . , i
k
, . . . , i
n
) = (i
1
, i
2
, . . . , i
k
, . . . , i
j
, . . . , i
n
)
With this denition, we can precisely describe the above sequence:
(4, 2, 1, 5, 6, 3, 7) = (4, 2, 1, 5, 6, 3, 7)
(
5,6
)(4, 2, 1, 5, 6, 3, 7) = (4, 2, 1, 5, 3, 6, 7)
(
1,3

5,6
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 4, 5, 3, 6, 7)
(
3,5

1,3

5,6
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 5, 4, 6, 7)
(
4,5

3,5

1,3

5,6
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7)
Note that we are applying a function, so composition is on the left.
Also notice that our choice of transpositions could have been smarter. We could have been completely
methodical and gone from left to right, correcting one place at a time:
(4, 2, 1, 5, 6, 3, 7) = (4, 2, 1, 5, 6, 3, 7)
(
1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 4, 5, 6, 3, 7)
(
1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 4, 5, 6, 3, 7)
(
3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 5, 6, 4, 7)
(
4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 6, 5, 7)
(
5,6

4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7)
As a simple exercise on induction we can prove,
Theorem. For any permutation
(i
1
, i
2
, . . . , i
n
)
there exists a sequence of transpositions
1
,
2
, . . . ,
k
such that
(
k

k1
. . .
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
28.3 The Trouble with Transposition
But theres a catch. In each case, we used 4 transpositions to restore the ordering to the identity:
(
4,5

3,5

1,3

5,6
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7)
(
5,6

4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7).
Thats because we were smart. However, we could have used far more than 4:
(
3,4

1,4

3,4

1,3

5,6

4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7).
Even worse, if you had one too many Amaretto Sours, you could have ipped the rst two coordinates
a hundred times:
(
1,2

1,2
. . .
1,2
. .
100 times

5,6

4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7).
28.3. THE TROUBLE WITH TRANSPOSITION 547
In fact, the number of transpositions used to restore a permutation to the identity ordering can be
arbitrarily large!
Luckily, even though there is no xed number of transpositions, the parity of transpositions is
always the same. And this will be the key step in deriving the determinant formula.
How are we going to prove this?
We are going to prove that the number of transpositions has the same parity as a
nicer, xed number.
Namely,
Denition. The number of inversions of a permutation
N(i
1
, i
2
, . . . , i
k
)
is the number of pairs (j, k) where j < k and i
k
< i
j
.
Dont be afraid: it is a very simple interpretation. The values of the normal identity permutation
increase as you go to the right.
(1, 2, 3, 4, 5, . . . , n)
So for any pair, the left number is less than the right number:
1
2
3
4
5
The number of inversions simply counts how many times this fails, i.e., when an element to the left
is bigger than an element to the right:
i
1
i
2
i
3
i
4
i
5
Think of the number of inversions as a way to measure how messed up a permutation is.
Example: Calculate
N(4, 2, 1, 5, 6, 3, 7)
Directly, we see that
(4, 2, 1, 5, 6, 3, 7)
has 6 inversions:
(4, 2, 1, 5, 6, 3, 7) (4, 2, 1, 5, 6, 3, 7) (4, 2, 1, 5, 6, 3, 7) (4, 2, 1, 5, 6, 3, 7)
(4, 2, 1, 5, 6, 3, 7)
(4, 2, 1, 5, 6, 3, 7)
The number of inversions is a very nice number to work with. This is because we can break
it into a sum of smaller calculations by xing j and dening
N
j
(i
1
, i
2
, . . . , i
k
)
as the number of k where j < k and i
k
< i
j
. Visually, think of this as xing a leftmost element:
i
1
i
2
i
j
i
j+1
i
j+2
i
j+3
i
n
and then comparing it to the elements on the right, one at a time:
i
j
i
j+1
i
j+2
i
j+3
i
j
i
j+1
i
j+2
i
j+3
i
j
i
j+1
i
j+2
i
j+3
This means
N(i
1
, i
2
, . . . , i
k
) = N
1
(i
1
, i
2
, . . . , i
k
) + N
2
(i
1
, i
2
, . . . , i
k
) + . . . + N
n1
(i
1
, i
2
, . . . , i
k
),
and in our example,
N(4, 2, 1, 5, 6, 3, 7) =
N
1
(4, 2, 1, 5, 6, 3, 7)
. .
3
+ N
2
(4, 2, 1, 5, 6, 3, 7)
. .
1
+ N
3
(4, 2, 1, 5, 6, 3, 7)
. .
0
+
N
4
(4, 2, 1, 5, 6, 3, 7)
. .
1
+ N
5
(4, 2, 1, 5, 6, 3, 7)
. .
1
+ N
6
(4, 2, 1, 5, 6, 3, 7)
. .
0
= 6.
This decomposition will be the key step in the next proof.
Now, to prove
The number of inversions has the same parity as the number of transpositions
needed to correct
1
an ordering
we need to show that applying a transposition to a permutation changes the parity of the number of
inversions. First consider the case of swapping two consecutive elements:
Lemma.
N(
j,j+1
(i
1
, i
2
, . . . , i
n
))
and
N(i
1
, i
2
, . . . , i
n
)
have opposite parity
Proof Summary:
Assume i
j
< i
j+1
.
Split N(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) into a sum
N
1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) + N
2
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) + . . . + N
n1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
)
Observe for s = j, j + 1,
N
s
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) = N
s
(i
1
, . . . i
j
, i
j+1
, . . . , i
n
)
and
N
j
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) = N
j+1
(i
1
, . . . i
j
, i
j+1
, . . . , i
n
) + 1
N
j+1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) = N
j
(i
1
, . . . i
j
, i
j+1
, . . . , i
n
).
Replace each term in the summation and recombine to get N(i
1
, . . . i
j
, i
j+1
, . . . , i
n
) + 1.
Proof: Since the argument is virtually the same, assume i
j
< i
j+1
. Notice that
j,j+1
(i
1
, . . . , i
j
, i
j+1
, . . . , i
k
) = (i
1
, i
2
, . . . , i
j+1
, i
j
, . . . , i
n
)
where i
j+1
is now in the j-th position and i
j
is now in the (j + 1)-th position:
i
1
1
i
2
2
i
j+1
j
i
j
j+1
i
n1
n-1
i
n
n
1
To correct means to return to the identity ordering. We think of the identity ordering as the correct one.
The key is to rewrite N(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) in terms of N(i
1
, . . . i
j
, i
j+1
, . . . , i
n
). To do this, well
rewrite each term of the decomposition
N(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) =
_
_
N
1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
)
+
N
2
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
)
+
.
.
.
+
N
n1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
)
in terms of N
s
(i
1
, . . . i
j
, i
j+1
, . . . , i
n
).
Consider an arbitrary position number s. For s < j, we must have
N
s
(i
1
, . . . , i
j+1
, i
j
, . . . i
n
) = N
s
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
)
This is because nothing changed: recall that N
s
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
) compares the s-th element
to all the elements to the right:
i
s
s
i
s+1
s+1
i
j+1
j
i
j
j+1
i
n1
n-1
i
n
n
But in N
s
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
), you are still making the same comparisons:
i
s
s
i
s+1
s+1
i
j
j
i
j+1
j+1
i
n1
n-1
i
n
n
All the elements on the right are the same. Order of comparison doesnt matter!
Likewise, it follows that
N
s
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
) = N
s
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
)
for s > j + 1, so we only need to consider
s = j or s = j + 1
s = j.
When we compute N
j
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
), we get one inversion from comparing the rst pair:
+1
i
j+1
i
j
i
j+2
i
j+3
i
j+4
The remaining portion counts the number of inversions by comparing i
j+1
to i
j+2
, i
j+3
, . . . , i
n
:
i
j+1
i
j
i
j+2
i
j+3
i
j+4
But by denition, this is exactly the same as N
j+1
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
)! Thus,
N
j
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
) = N
j+1
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
) + 1.
s = j + 1
When we compute N
j+1
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
), we are looking at:
i
j
i
j+2
i
j+3
i
j+4
i
j+5
But we assumed i
j
< i
j+1
, so adding an extra comparison with i
j+1
doesnt change the number
of inversions:
i
j
i
j+1
i
j+2
i
j+3
i
j+4
This is exactly N
j
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
)! Thus,
N
j+1
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
) = N
j
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
).
Now we can rewrite each term in our decomposition:
N(. . . i
j+1
, i
j
, . . .) =
_
_
N
1
(. . . i
j+1
, i
j
, . . .) N
1
(. . . i
j
, i
j+1
, . . .)
+ +
N
2
(. . . i
j+1
, i
j
, . . .) N
2
(. . . i
j
, i
j+1
, . . .)
+ +
.
.
.
.
.
.
N
j
(. . . i
j+1
, i
j
, . . .) N
j+1
(. . . i
j
, i
j+1
, . . .) + 1
+ +
N
j+1
(. . . i
j+1
, i
j
, . . .) N
j
(. . . i
j
, i
j+1
, . . .)
+ +
.
.
.
.
.
.
+ +
N
n2
(. . . i
j+1
, i
j
, . . .) N
n2
(. . . i
j
, i
j+1
, . . .)
+ +
N
n1
(. . . i
j+1
, i
j
, . . .) N
n1
(. . . i
j
, i
j+1
, . . .)
_
_
= N(. . . i
j
, i
j+1
, . . .) + 1
Or in other words,
N(i
1
. . . , i
j+1
, i
j
, . . . , i
n
) and N(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
) have opposite parity.
Now we extend this result to work for any transposition. We will do this by rewriting any transposition
as a composition of transpositions between consecutive coordinates.
Lemma.
N(
j,k
(i
1
, i
2
, . . . , i
n
))
and
N(i
1
, i
2
, . . . , i
n
)
have opposite parity.
Proof Summary:
WLOG, assume j < k.
Apply k j consecutive transpositions to move i
j
to position k (so it is to the right of i
k
).
Apply k j 1 consecutive transpositions to move i
k
to position j.
Apply preceding lemma on a total of 2(k j) 1 consecutive transpositions.
Proof: Without loss of generality, we may assume j < k (for otherwise, we can just relabel the
transposition as
k,j
since switching j and k is the same as switching k and j). Applying only
consecutive transpositions, we want to take
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
k
and swap i
j
and i
k
:
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
j
First, apply consecutive transpositions to move i
j
to the right:
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
k
i
k4
i
k3
i
k2
i
k1
i
k
i
j+1
i
j
i
j+2
i
j+3
i
j+4
i
j+1
i
j+2
i
j
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
j
i
k2
i
k1
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
j
i
k1
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
k1
i
j
i
k
After applying transpositions
k1,k

k2,k1
. . .
j+2,j+3

j+1,j+2

j,j+1
. .
kj transpositions
we have
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
k1
i
k
i
j
Now, starting with i
k
on the left of i
j
, we use transpositions to shift i
k
to the j-th position:
i
j+1
i
k
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
j
i
k4
i
k3
i
k2
i
k1
i
j
i
j+1
i
j+2
i
k
i
j+3
i
j+4
i
j+1
i
j+2
i
j+3
i
k
i
j+4
i
k4
i
k3
i
k2
i
k1
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k
i
k2
i
k1
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
k
i
k1
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
k1
i
k
i
j
So after applying additional transpositions
j,j+1

j+1,j+2
. . .
k3,k2

k2,k1
. .
kj1 transpositions
we have the desired transposition:
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
j
To swap j and k, we used a total of
k j
. .
right
+k j 1
. .
left
= 2(k j) 1
transpositions. Applying the previous theorem to each of these 2(k j) 1 transpositions, the parity
will change an odd number (namely, 2(k j) 1) of times, so in the end the parity will change. We
conclude that
N(
j,k
(i
1
, i
2
, . . . , i
n
)) and N(i
1
, i
2
, . . . , i
n
) have opposite parity.
Finally, we can prove that any two sequences of transpositions that restore the normal ordering always
have the same parity:
Theorem. Let (i
1
, . . . , i
n
) be a permutation. For any two sequences of transpositions
1
,
2
, . . . ,
q
t
1
, t
2
, . . . , t
r
where
(
q

q1
. . .
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
(t
r
t
r1
. . . t
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n),
q and r have the same parity.
Proof Summary:
Apply inverse transpositions to get
(i
1
, i
2
, . . . , i
n
) = (
1

2
. . .
q
)(1, 2, . . . , n)
(i
1
, i
2
, . . . , i
n
) = (t
1
t
2
. . . t
r
)(1, 2, . . . , n).
Inductively apply the preceding theorem.
Proof: First notice that any transposition is its own inverse:
(
i,j

i,j
)(i
1
, i
2
, . . . , i
n
) = (i
1
, i
2
, . . . , i
n
)
From
(
q

q1
. . .
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
(t
r
t
r1
. . . t
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
we can apply a series of inverse transpositions to both sides
(
1

2
. . .
q1

q

q

q1
. . .
2

1
)(i
1
, i
2
, . . . , i
n
) = (
1

2
. . .
q
)(1, 2, . . . , n)
(t
1
t
2
. . . t
r1
t
r
t
r
t
r1
. . . t
2
t
1
)(i
1
, i
2
, . . . , i
n
) = (t
1
t
2
. . . t
r
)(1, 2, . . . , n)
to get
(i
1
, i
2
, . . . , i
n
) = (
1

2
. . .
q
)(1, 2, . . . , n)
(i
1
, i
2
, . . . , i
n
) = (t
1
t
2
. . . t
r
)(1, 2, . . . , n).
Since
N(1, 2, . . . , n) = 0,
we can apply the previous theorem q times on
N(i
1
, i
2
, . . . , i
n
) = N(
1

2
. . .
q
)(1, 2, . . . , n)
to get that N(i
1
, i
2
, . . . , i
n
) has the same parity as q.
1
Likewise, we can apply the thorem r times on
N(i
1
, i
2
, . . . , i
n
) = N(t
1
t
2
. . . t
r
)(1, 2, . . . , n)
to get that N(i
1
, i
2
, . . . , i
n
) has the same parity as r. Therefore, q and r have the same parity.
New Notation
Symbol Reading Example Example Translation
(i
1
, i
2
, . . . , i
n
) The permutation
(i
1
, i
2
, . . . , i
n
)
(3, 2, 1) The permutation (3, 2, 1).
j,k
The transposition
that swaps the coor-
dinates j and k
1,2
(3, 2, 1) The transposition that swaps co-
ordinates 1 and 2 applied to the
permutation (3, 2, 1).
N(i
1
, i
2
, . . . , i
k
) The number of inver-
sions of the permuta-
tion (i
1
, i
2
, . . . , i
n
)
N(3, 2, 1) = 3 The number of inversions of
(3, 2, 1) is 3.
N
j
(i
1
, i
2
, . . . , i
k
) The number of in-
versions, relative to
the j-th coordinate,
of the permutation
(i
1
, i
2
, . . . , i
n
)
N
2
(3, 2, 1) = 1 The number of inversions, relative
to the 2nd coordinate, of (3, 2, 1)
is 1.
1
To see this, note that starting from 0, the parity is switched q times. If q is even, N(i
1
, i
2
, . . . , i
n
) will have the
same parity as 0, i.e. even. If q is odd, then N(i
1
, i
2
, . . . , i
n
) will have the opposite parity as 0, i.e. odd.
Lecture 29
Determining Determinant
ONE function to determine them all.
- Lord of Z, R, Q, and C.
Goals: Today, we build the determinant from a multilinear function D. Remarkably, we
can prove that this D is the UNIQUE multilinear function that outputs 1 at the ordered
standard basis and switches sign whenever two inputs are swapped. Finally, we prove
column and row reduction properties to eciently compute the determinant.
29.1 The Magic of Multilinearity
Unless you slept through the rst seven weeks of Math 51H, youve probably realized
Math Mantra: LINEARITY IS AWESOME!
Weve used this property a gazillon times. Namely when
Distributing an integral across a sum of functions.
Representing a function as a matrix multiplication.
Calculating the directional derivative.
But why restrict this awesomeness to a function with a single input? Why not consider a function
that has more than one input
f(x
1
, x
2
)
and make it linear with respect to each of these inputs? That means scaling any input scales the
output by the same factor
f(c x
1
, x
2
) = f(x
1
, c x
2
) = c f(x
1
, x
2
),
and if we have a sum in any component, we x the other components and perform normal linearity:
f(a + b, x
2
) = f(a, x
2
) + f(b, x
2
)
f(x
1
, a + b) = f(x
1
, a) + f(x
2
, b).
557
558 LECTURE 29. DETERMINING DETERMINANT
As with linear maps, there are lots of examples of multilinear maps. For example, consider a function
that inputs three variables and returns the product:
f(x, y, z) = xyz
But lets consider only multi-linear functions that input n vectors in R
n
and output a real number:
f : R
n
R
n
R
n
. . . R
n
. .
n inputs
R
Suppose we add two seemingly innocuous conditions:
If we swap any two inputs, the value of our function is negated:
f(x
1
, . . . , x
i
. . . x
j
, . . . , x
n
) = f(x
1
, . . . , x
j
. . . x
i
, . . . , x
n
)
If we input the ordered standard basis vectors, the function returns 1:
f(e
1
, e
2
, . . . , e
n
) = 1
The remarkable fact is that there is one and only one function that satises this! And we will
call
1
this almighty function D.
29.2 Uniqueness of D
How do we prove that there is exactly one function that satises the aforementioned properties? First,
we have to prove such a satisfying function actually exists. To do this, we normally
Prove that the function exists
Using the fact that it exists, we derive properties on what it must look like.
Weve used this trick a million times, especially in the calculation of limits. Most recently, in Lecture
24, this was the key step in deriving the formula for the arc-length of a curve.
But thats not going to cut it here. We have to cheat. We are going to assume D exists to derive its
formula. Generally,
Math Mantra: To deduce an object exists, we first ASSUME it actually does
exist. Then, by exploiting its properties, we derive an explicit formula for
that object. We then VERIFY that this explicit formula satisfies all the
properties we need.
1
When we apply D to the specic case of matrix columns, then we call it the determinant.
29.2. UNIQUENESS OF D 559
This was the strategy in the Cauchy-Schwarz equality proof. We also exploited this thinking with
Lagrange Multipliers: we found necessary conditions on what a maxima must look like. Then, we
formed a guess and veried that our guess was indeed a maximum.
Once we assume D exists, the proof is easy (though there is quite a bit of notation). We are going to
do the same trick we used when we proved that any linear function from R
n
to R can be written as
a matrix multiplication. Rewrite x in terms of standard basis vectors and apply linearity!
Before we proceed to the proof, we prove an extremely easy lemma.
Lemma. Let D : R
n
R
n
R
n
. . . R
n
. .
n inputs
R and suppose D changes sign whenever we inter-
change any two inputs:
D(x
1
, . . . , x
j
. . . , x
k
, . . . , x
n
) = D(x
1
, . . . , x
k
, . . . , x
j
, . . . , x
n
)
Then D must evaluate to 0 if two of its inputs are the same:
D(x
1
, . . . , a, . . . a, . . . , x
n
) = 0.
Proof: Suppose that the inputs at the j-th and k-th coordinate are equal:
D(x
1
, . . . , a
..
jth
, . . . , a
..
kth
, . . . , x
n
).
Interchanging the j-th and k-th components, we have
D(x
1
, . . . , a . . . , a, . . . , x
n
) = D(x
1
, . . . , a, . . . , a, . . . , x
n
)
implying
D(x
1
, . . . , a, . . . , a, . . . , x
n
) = 0.
Even though this is an incredibly easy lemma, do not underestimate it! When we try to derive the
determinant formula, we will use this lemma to kill unnecessary terms.
Theorem. If there exists some function D : R
n
R
n
R
n
. . . R
n
. .
n inputs
R that satises the
following properties:
D is linear in each component: for any j,
D(x
1
, . . . , ca +
b
. .
jth component
, . . . , x
n
) = cD(x
1
, . . . , a
..
jth
, . . . x
n
) + D(x
1
, . . . ,

b
..
jth
, . . . , x
n
)
D changes sign whenever we interchange any two inputs: for any j = k,
D(x
1
, . . . , x
j
, . . . , x
k
, . . . , x
n
) = D(x
1
, . . . , x
k
, . . . , x
j
, . . . , x
n
).
D evaluates to 1 on the ordered standard basis vectors:
D(e
1
, e
2
, . . . , e
n
) = 1.
then D must satisfy
D(x
1
, x
2
, . . . , x
n
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
i
1
1
x
i
2
2
. . . x
inn
Proof Summary:
Assume such a D exists.
Write each input in terms of standard basis vectors. Be sure to index the sums.
Apply linearity on each component and condense the summations into one.
By the preceding lemma, any term over non-distinct indices is 0. The summation simplies to
permutations from 1 to n.
In each term, apply the swapping property on
D(e
i
1
, e
i
2
, . . . e
in
)
to restore it to
D(e
1
, e
2
, . . . e
n
)
. .
=1
From the last lecture, the number of times you apply swapping property is the number of
inversions. This introduces a
(1)
N(i
1
,i
2
,...,in)
factor in each term.
Proof: Assume that there exists a function D with the desired properties:
D(x
1
, x
2
, . . . , x
n
).
Now we rewrite each input in terms of the standard basis vectors. So a rst attempt would be to
expand as
D
_
n
i=1
x
i1
e
i
,
n
i=1
x
i2
e
i
, . . . ,
n
i=1
x
in
e
i
_
.
But we have n summations! Reusing the same dummy variable is completely boneheaded!
To keep everything straight, we decide to index the indexing terms:
D
_
n
i
1
=1
x
i
1
1
e
i
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
.
Now, we use repeated applications of linearity to pull out the rst components summation:
n
i
1
=1
x
i
1
1
D
_
e
i
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
.
Then, we can do the same trick to pull out the summation from the second component:
n
i
1
=1
n
i
2
=1
x
i
1
1
x
i
1
2
D
_
e
i
1
, e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
.
Inductively, after going through all components, we have
n
i
1
=1
n
i
2
=1
. . .
n
in=1
x
i
1
1
x
i
1
2
. . . x
i
1
n
D(e
i
1
, e
i
2
, . . . , e
in
) . ()
Chances are, this is the rst time youve seen n nested sums. But it is just like a double summa-
tion: a summation over a summation. Except you are doing this n times (giving us a total of n
n
terms)!
I know this is scary:
1
you are summing over sums of sums of sums, etc. etc. And because we have so
many indexing terms, we need to introduce ugly subscripts. But just to put your mind at ease, lets
write out the expansion for the rst two components:
Expanding the rst component of
D
_
n
i
1
=1
x
i
1
1
e
i
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
,
we have
D
_
x
11
e
1
+ x
21
e
2
+ . . . + x
n1
e
n
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
By repeated application of linearity on the rst component, this becomes
x
11
D
_
e
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . .
_
+ x
21
D
_
e
2
,
n
i
2
=1
x
i
2
2
e
i
2
, . . .
_
+ . . . + x
n1
D
_
e
n
,
n
i
2
=1
x
i
2
2
e
i
2
, . . .
_
But we can expand each of these terms. Consider the i
1
-th term:
x
i
1
1
D
_
e
i
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
1
Its Inception all over again.
Now we apply repeated linearity on the second component of this term to get
x
i
1
1
x
12
D
_
e
i
1
, e
1
,
n
i
3
=1
x
i
3
3
e
i
3
, . . .
_
+ x
i
1
1
x
22
D
_
e
i
1
, e
2
,
n
i
3
=1
x
i
3
3
e
i
3
, . . .
_
.
.
.
.
.
.
+ x
i
1
1
x
n2
D
_
e
i
1
, e
n
,
n
i
3
=1
x
i
3
3
e
i
3
, . . .
_
Then you can play the same game by expanding the i
2
-th term of this sum:
x
i
1
1
x
i
2
2
D
_
e
i
1
, e
i
2
,
n
i
3
=1
x
i
3
3
e
i
3
,
n
i
4
=1
x
i
4
4
e
i
4
, . . . ,
n
in=1
x
inn
e
in
_
If you are still unsure, I recommend practicing with smaller cases.
Returning to (), we try to make the notation easier on the eyes by rewriting as a single sum symbol:
n
i
1
,i
2
,..., in=1
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . e
in
)
By the preceding lemma, we know that whenever two components of D are equal, the term is 0.
Therefore, we only need to consider terms where the indices are all distinct:
n
i
1
, i
2
, . . . , i
n
= 1,
i
1
, i
2
. . . , i
n
distinct
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
Notice that
i
1
, i
2
, . . . , i
n
are distinct numbers from 1 to n. So we are really looking at all permutations of (1, 2, . . . n):
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
) .
Next, use our swapping property to rearrange
D(e
i
1
, e
i
2
, . . . , e
in
)
into
D(e
1
, e
2
, . . . , e
n
) .
Each swap multiplies D by 1. Moreover, each swap is a transposition; thus, the power of 1 is the
number of transpositions that rearrange
(i
1
, i
2
, . . . , i
n
)
into
(1, 2, . . . , n).
For each specic term
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
we can nd transpositions
1
,
2
, . . .
q
to rewrite this term as
(1)
q
x
i
1
1
x
i
2
2
. . . x
inn
D(e
1
, e
2
, . . . , e
n
)
. .
=1.
Now stare at q: the only thing that matters is its parity. And we proved last lecture that the number
of transpositions has the same parity as the number of inversions! So this term is just
(1)
N(i
1
,i
2
,...,in)
x
i
1
1
x
i
2
2
. . . x
inn
and our summation is
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
i
1
1
x
i
2
2
. . . x
inn
.
There is a very crucial philosophical step in this proof that you must mull over. Particularly, think
about the step in which you found transpositions
1
,
2
, . . . ,
q
to rewrite
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
as
(1)
q
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
) .
Suppose the parity of q was not unique and that we can nd odd and even sequences of transpositions
that correct the ordering of
(i
1
, i
2
, . . . , i
n
)
Then we would have both
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
and
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
FAIL!
To avoid catastrophes like this, we must make sure that our operations are well-dened.
What does it mean for a function to be well-dened? Suppose you have multiple ways to represent
an input. A function is well-dened if it returns the same output independent of how you choose to
represent an input.
For example, the following function is ill-dened:
For an integer x,
f(x) = s where x = st
For x = 20, we can represent
x = 5 4
which implies
f(20) = 5.
But we could have also written x as
x = 2 10,
giving us
f(20) = 2.
You will see the issue of well-denedness again in Math 120 when you talk about Quotient Groups
and Math 171 when you dene the Lebesgue Integral. For now, keep in mind that
Math Mantra: When you define a function, make sure that your output is
INDEPENDENT of how you choose to represent the input.
Now that weve proven
IF D exists, then it must look like
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...in)
x
i
1
1
x
i
2
2
. . . x
inn
we must verify that this expression satises the three properties of D.
Theorem.
D(x
1
, x
2
, . . . , x
n
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...in)
x
i
1
1
x
i
1
2
. . . x
i
1
n
satises the following properties:
D is linear in each component: for any j,
D(x
1
, . . . , ca +
b
. .
jth component
, . . . , x
n
) = cD(x
1
, . . . , a
..
jth
, . . . , x
n
) + D(x
1
, . . . ,

b
..
jth
, . . . , x
n
).
D changes sign whenever we interchange any two inputs: for any j = k,
D(x
1
, . . . , x
j
. . . x
k
, . . . , x
n
) = D(x
1
, . . . , x
k
. . . x
j
, . . . , x
n
).
D evaluates to 1 on the ordered standard basis vectors:
D(e
1
, e
2
, . . . , e
n
) = 1.
Proof Summary:
Linearity
Directly apply the denition and distribute the summation.
Swapping
From D(x
1
, . . . , x
j
, . . . , x
k
, . . . , x
n
), apply a transposition and swap x
i
k
k
and x
i
j
j
.
The names of the indexing variables do not matter, so swap i
j
and i
k
.
Compare the inner summation to D(x
1
, . . . , x
k
, . . . , x
j
, . . . , x
n
). The terms are exactly the
same. Even though the indexing set is in a dierent order, it is still over all permutations.
So the summations are equal.
Ordered Standard Basis
Directly apply the denition. All terms are zero unless
(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
Proof:
Linearity
Expand the denition of
D(x
1
, . . . , ca +
b
. .
jth component
, . . . , x
n
)
to get
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...in)
x
i
1
1
. . . (ca
i
j
j
+ b
i
j
j
)
. .
x
i
j
j
. . . x
i
1
n
Then distribute the sum
perm. (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
c (1)
N(i1,i2,...in)
x
i11
. . . a
i
j
j
..
x
i
j
j
. . . x
inn
+
perm. (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i1,i2,...in)
x
i11
. . . b
i
j
j
..
x
i
j
j
. . . x
inn
But this is just
cD(x
1
, . . . , a
..
jth
, . . . x
n
) + D(x
1
, . . . ,

b
..
jth
, . . . , x
n
)
Swapping
Expand
D(x
1
, . . . , x
j
. . . x
k
, . . . , x
n
)
to get
permutation (i
1
, . . . , i
j
. . . , i
k
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
j
,...,i
k
,...,in)
x
i
1
1
. . . x
i
j
j
. . . x
i
k
k
. . . x
inn
.
Switch x
i
k
k
and x
i
j
j
permutation (i
1
, . . . , i
j
. . . , i
k
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
j
,...,i
k
,...,in)
x
i
1
1
. . . x
i
k
k
. . . x
i
j
j
. . . x
inn
and apply a single transposition to switch the i
j
and i
k
in N(i
1
, . . . , i
k
, . . . , i
j
, . . . , i
n
) and pull
out 1:
_
_
_
_
_
_
_
permutation (i
1
, . . . , i
j
. . . , i
k
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
k
,...,i
j
,...,in)
x
i
1
1
. . . x
i
j
j
. . . x
i
k
k
. . . x
inn
_
_
_
_
_
_
_
But the names of the dummy indexing variables do not matter! So switch i
j
with i
k
:
_
_
_
_
_
_
_
permutation (i
1
, . . . , i
k
. . . , i
j
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
j
,...,i
k
,...,in)
x
i
1
1
. . . x
i
j
k
. . . x
i
k
j
. . . x
inn
_
_
_
_
_
_
_
Compare the inside to D(x
1
, . . . , x
j
. . . x
i
, . . . , x
n
):
permutation (i
1
, . . . , i
j
. . . , i
k
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
j
,...,i
k
,...,in)
x
i
1
1
. . . x
i
j
k
. . . x
i
k
j
. . . x
inn
They are summing the same terms over all permutations. The only dierence is that the
permutations are indexed by dierent dummy variables. Therefore,
D(x
1
, . . . , x
j
. . . x
k
, . . . , x
n
) = D(x
1
, . . . , x
k
. . . x
j
, . . . , x
n
).
Ordered Standard Basis
Expand
D(e
1
, e
2
, . . . , e
n
)
to get
29.3. COMPUTING DETERMINANTS 567
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
e
i
1
1
e
i
2
2
. . . e
inn
But
e
jk
= 1
only when j = k (and 0 otherwise). Thus every term in this sum is 0, except when
i
1
= 1
i
2
= 2
.
.
.
i
n
= n.
This leaves us with
(1)
N(1,2,...,n)
e
11
e
22
. . . e
nn
= 1.
Combining the last two theorems, we have:
Theorem. There exists one and only one function D : R
n
R
n
R
n
. . . R
n
. .
n inputs
R such that
D is linear in each component.
D changes sign whenever we interchange any two inputs.
D evaluates to 1 on the ordered standard basis vectors.
29.3 Computing Determinants
As you may have already guessed, we like to evaluate the function D on the columns of a square
matrix:
Denition. The determinant of an n n matrix A is the value of D applied to the columns of A:
det(A) = D(a
1
, a
2
, . . . , a
n
)
where
A =
_
_
_
_
_
_
a
1
a
2
. . . a
n
_
_
_
_
_
_
.
Back in high school, you already learned how to compute determinants of 2 2 and 3 3 matrices
using some mnemonic. You drew some diagonals
a
11
a
12
a
13
a
11
a
12
a
21
a
22
a
23
a
21
a
22
a
31
a
32
a
33
a
31
a
32
and said,
Main diagonals minus anti-diagonals.
But,
Math Mantra: Just because a formula is true for a specific case DOES NOT MEAN
it is true for all cases!
We cannot apply this mnemonic for higher n. For example, in the case of n = 5, the diagonal
process would only give us 10 terms. But in the actual determinant denition, our sum has 5!=120
terms.
The truth is: no one ever uses the direct denition
det(A) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...in)
a
i
1
1
a
i
2
2
. . . a
inn
It is a completely unwieldy mess of n! terms. No one in their right mind wants to expand a sum that big!
Heres a smarter idea:
Math Mantra: Instead of appealing to the original definition,
you can save yourself a lot of trouble by using theorems youve already proved.
The determinant immediately inherits some sweet simplication properties from D. Namely,
Theorem. For an n n matrix A,
Adding a scaling of one column to another does not change the determinant.
det
_
_
_
_
_
_
. . . a
i
. . . a
j
. . .
_
_
_
_
_
_
= det
_
_
_
_
_
_
. . . a
i
. . . ca
i
+a
j
. . .
_
_
_
_
_
_
Scaling a column by c scales the determinant by c:
det
_
_
_
_
_
_
. . . ca
i
. . .
_
_
_
_
_
_
= c det
_
_
_
_
_
_
. . . a
i
. . .
_
_
_
_
_
_
Swapping two columns switches the sign of the determinant
det
_
_
_
_
_
_
. . . a
j
. . . a
i
. . .
_
_
_
_
_
_
= 1 det
_
_
_
_
_
_
. . . a
i
. . . a
j
. . .
_
_
_
_
_
_
But theres more. We also have the same properties for the rows. And this will be a simple corollary
of the product property for determinants. Note that this proof follows the exact same method as
the derivation of D:
Theorem. For n n matrices A and B,
det(AB) = det(A) det(B).
Proof Summary:
Write each input in terms of columns of B. Be sure to index the sums.
Apply linearity on each component and condense the summations into one.
Any term over non-distinct indices is 0, therefore the summation is now over all permutations
from 1 to n.
In each term, apply the swapping property on
D
_
b
i
1
,
b
i
2
, . . . ,
b
in
_
to restore it to
D
_
b
1
,
b
2
, . . . ,
b
n
_
. .
det(B)
.
This introduces a
(1)
N(i
1
,i
2
,...,in)
factor in each term.
Pull out det(B)
Proof: Recall that the columns of AB are linear combinations of the columns of A:
AB =
_
_
_
_
_
_
A
b
1
A
b
2
. . . A
b
n
_
_
_
_
_
_
where the j-th column is
A
b
j
=
n
i=1
b
ij
a
i
.
When we apply the denition of determinant
det(AB) = D
_
A
b
1
, A
b
2
, . . . , A
b
n
_
,
we expand each input as a linear combination of columns of A:
D
_
n
i
1
=1
b
i
1
1
a
i
1
,
n
i
2
=1
b
i
2
2
a
i
2
, . . . ,
n
in=1
b
inn
a
in
_
.
Remember to index the sums!
Applying linearity on each component, we can, once more, pull out each sum
n
i
1
=1
n
i
2
=1
. . .
n
in=1
b
i
1
1
b
i
2
2
. . . b
inn
D(a
i
1
, a
i
2
, . . . , a
in
)
and rewrite as
n
i
1
,i
2
,..., in=1
b
i
1
1
b
i
2
2
. . . b
inn
D(a
i
1
, a
i
2
, . . . a
in
) .
Then, kill o terms where components are non-distinct. The summation is now over permutations:
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
b
i
1
1
b
i
2
2
. . . b
inn
D(a
i
1
, a
i
2
, . . . , a
in
)
The swapping property yields
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,in)
b
i
1
1
b
i
2
2
. . . b
inn
D(a
1
, a
2
, . . . , a
n
)
. .
=det(A)
Yet the inner determinant term is a constant! Pull it out to get:
det(A)
_
_
_
_
_
_
_
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,in)
b
i
1
1
b
i
2
2
. . . b
inn
D(a
1
, a
2
, . . . a
n
)
_
_
_
_
_
_
_
. .
=det(B)
.
This is the same as
det(A) det(B).
To prove the row reduction properties, we express the row operations as matrix multiplications and
use the preceding theorem:
Theorem. For an n n matrix A,
Adding a scaling of one row to another does not change the determinant.
det
_
_
_
_
_
_
_
_
.
.
.
A
i
.
.
.
A
j
.
.
.
_
_
_
_
_
_
_
_
= det
_
_
_
_
_
_
_
_
.
.
.
A
i
.
.
.
c
A
i
+

A
j
.
.
.
_
_
_
_
_
_
_
_
Scaling a row by c scales the determinant by c:
det
_
_
_
.
.
.
c
A
i
.
.
.
_
_
_
= c det
_
_
_
.
.
.
A
i
.
.
.
_
_
_
Swapping two rows switches the sign of the determinant
det
_
_
_
_
_
_
_
_
.
.
.
A
j
.
.
.
A
i
.
.
.
_
_
_
_
_
_
_
_
= 1 det
_
_
_
_
_
_
_
_
.
.
.
A
i
.
.
.
A
j
.
.
.
_
_
_
_
_
_
_
_
Proof Summary:
Adding a Scaling of One Row to Another:
Represent the operation as the product
E
add
A.
Then
det(E
add
A) = det(E
add
)
. .
=1
det(A) = det(A).
Scaling
E
scale
A.
Then
det(E
scale
A) = det(E
scale
)
. .
=c
det(A) = c det(A).
Swapping
E
swap
A.
Then,
det(E
swap
A) = det(E
swap
)
. .
=1
det(A) = det(A).
Proof:
Adding a Scaling of One Row to Another:
Let E
add
be the identity matrix with a constant c in position (j, i) for i = j:
j
i
E
add
=
_
_
1
1
.
.
.
1 . . . c
.
.
.
1
1
_
_
Then
E
add
A =
_
_
_
_
_
_
_
_
.
.
.
A
i
.
.
.
c
A
i
+

A
j
.
.
.
_
_
_
_
_
_
_
_
Notice that
det(E
add
) = 1.
This is because we can apply the columns properties of determinant and subtract c times the
i-th column from the j-th column:
det
_
_
_
_
_
_
_
_
_
_
_
1
1
.
.
.
1 . . . c
.
.
.
1
1
_
_
_
_
_
_
_
_
_
_
_
= det
_
_
_
_
_
_
_
_
_
_
_
1
1
.
.
.
1 . . . 0
.
.
.
1
1
_
_
_
_
_
_
_
_
_
_
_
Thus,
det(E
add
A) = det(E
add
) det(A) = det(A).
Scaling:
Let E
scale
be the identity matrix with c at coordinate (i, i):
i
i
E
scale
=
_
_
1
1
.
.
.
c
.
.
.
1
1
_
_
Then multiplying on the left by E
scale
scales the i-th row:
E
scale
A =
_
_
_
.
.
.
c
A
i
.
.
.
_
_
_
By the column scaling property of determinant,
det(E
scale
) = c,
giving us
det(E
scale
A) = det(E
scale
) det(A) = c det(A).
Column Swapping:
Let E
swap
be the identity matrix with the i and j columns swapped:
i
j
E
swap
=
_
_
1
1
.
.
.
0 1
.
.
.
1 0
.
.
.
1
1
_
_
Then multiplying on the left by E
swap
swaps the i-th and j-th row:
E
swap
A =
_
_
_
_
_
_
_
_
.
.
.
A
j
.
.
.
A
i
.
.
.
_
_
_
_
_
_
_
_
By the column swapping property of determinant,
det(E
swap
) = 1,
allowing us to conclude
det(E
swap
A) = det(E
swap
) det(A) = det(A).
By exploiting the row and column properties, we can easily
1
compute determinants.
Example. Compute the determinant of
_
_
0 0 5 0 0
0 4 0 5 0
0 0 5 0 5
0 5 0 6 0
1 0 0 0 1
_
_
.
1
This is especially easy with SPARSE matrices (matrices with lots of zeros). You will see these in your engineering
courses (e.g. EE263).
First, pull out constants from the rst and third row:
det
_
_
_
_
_
_
0 0 5 0 0
0 4 0 5 0
0 0 5 0 5
0 5 0 6 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 4 0 5 0
0 0 1 0 1
0 5 0 6 0
1 0 0 0 1
_
_
_
_
_
_
Then, subtract the second column from the fourth
25 det
_
_
_
_
_
_
0 0 1 0 0
0 4 0 5 0
0 0 1 0 1
0 5 0 6 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 4 0 1 0
0 0 1 0 1
0 5 0 1 0
1 0 0 0 1
_
_
_
_
_
_
and then four times the fourth column from the second:
25 det
_
_
_
_
_
_
0 0 1 0 0
0 4 0 1 0
0 0 1 0 1
0 5 0 1 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 0 0 1 0
0 0 1 0 1
0 1 0 1 0
1 0 0 0 1
_
_
_
_
_
_
By subtracting rows from each other,
25 det
_
_
_
_
_
_
0 0 1 0 0
0 0 0 1 0
0 0 1 0 1
0 1 0 1 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
0 1 0 0 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
0 1 0 0 0
1 0 0 0 0
_
_
_
_
_
_
which after permuting rows (e.g. 7 times) gives us
25 det
_
_
_
_
_
_
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
_
_
_
_
_
_
= 25
New Notation
det(A) The determinant of
the matrix A
det(I) = 1 The determinant of the identity
matrix is 1.
Lecture 30
Flirting with Inverting
In the dating game, sometimes its better to stay singular.
- Q Principle Ideal Domain
Goals: After giving a formal denition of a matrix inverse, we prove necessary and
sucient conditions for invertibility. Namely, a matrix is invertible if and only if the
determinant is non-zero. To prove this, we derive the cofactor expansion formula. We
then use this formula to explictly construct the left and right inverses.
30.1 A Revision of Algebra II
A long time ago, in a galaxy far, far away (Algebra II), you were given a denition of determinant.
Your teachers gave you a mnemonic, and told you how to calculate it. They also gave you some
concatenated matrices and a methodology to compute an inverse:
_
_
0 0 4 1 0 0
0 1 4 0 1 0
1 0 0 0 0 1
_
_
ADD
_
_
0 0 4 1 0 0
0 1 0 1 1 0
1 0 0 0 0 1
_
_
SWAP
_
_
1 0 0 0 0 1
0 1 0 1 1 0
0 0 4 1 0 0
_
_
SCALE
_
_
1 0 0 0 0 1
0 1 0 1 1 0
0 0 1 .25 0 0
_
_
By now, you understand
577
578 LECTURE 30. FLIRTING WITH INVERTING
Math Mantra: We dont care about the methodology. Methodology can always be
looked up or programmed on a computer. We care about the MEANING.
From last lecture, you should realize that the determinant is a complicated little bugger. It deserves
a lot more credit that some measly mnemonic. And now that we have dierent eyes, we can go back
to Algebra II and understand what we did and why we did it.
So like John Smith in Pocahontas, we are going to listen to the mathematics that surround us and
learn things we never knew that we never knew.
30.2 Left and Right Inverse
The rst thing we need to do is dene inverses:
Denition. Let A be an nn matrix. The right inverse of A, if it exists, is the right multiplicative
inverse of A. Formally, it is the n n matrix B that satises
AB = I.
Likewise, the left inverse of A, if it exists, is the n n matrix C that satises
CA = I.
Notice we are making two major points.
The inverse need not exist. For example, it may be the case that for every choice of B,
_
_
1 1 1
1 1 0
0 0 0
_
_
B = I
We have to consider the left and right inverse separately. This is because matrices
are not commutative!
Lets address the second issue rst. Using the same algebraic shenanigans as in Lecture 5, we can
prove that there is a unique matrix that is both the left and the right inverse, provided that the left
and right inverse both exist.
Theorem. Let A be an n n matrix. Suppose left and right inverses exist.
30.2. LEFT AND RIGHT INVERSE 579
The right inverse is unique: if
AB = I
AB
= I
then
B = B
Right inverse equals left inverse: if

AB = I
CA = I
then
B = C
Proof:
Right inverse is unique:
Suppose there exists B, B
such that
AB = I
AB
= I.
Then,
AB = AB
.
Since we also assumed that the left inverse of A exists, multiply both sides by a left inverse C:
C(AB) = C(AB
).
Then by associativity
(CA)B = (CA)B
,
and since C is the left inverse of A, this reduces to
B = B
.
Right inverse equals left inverse:
Assume B, C satisfy
AB = I
CA = I.
First,
B = IB.
Substituting,
IB = (CA)B.
By associativity,
(CA)B = C(AB) = C.
Thus,
B = C
Did we forget to prove that the left inverse is unique? Nope. Notice that this follows from the fact
that any left inverse must equal the right inverse, which was proven to be unique. Finally, we can
formally dene an inverse:
Denition. Let A be an n n matrix. If the left and right inverse both exist, the inverse of
A is the n n matrix A
1
that satises
AA
1
= A
1
A = I.
Now, we can explain the methodology behind our inverse computation. In truth,
_
_
0 0 4 1 0 0
0 1 4 0 1 0
1 0 0 0 0 1
_
_
was really shorthand for the system
AX = I
When we solved for X, we were multiplying both sides, on the left, by elementary matrices.
What is an elementary matrix? Last lecture, we used the matrices E
add
, E
swap
, E
scale
:
E
add
=
_
_
1
1
.
.
.
1 . . . c
.
.
.
1
1
_
_
E
scale
=
_
_
1
1
.
.
.
c
.
.
.
1
1
_
_
E
swap
=
_
_
1
1
.
.
.
0 1
.
.
.
1 0
.
.
.
1
1
_
_
An elementary matrix is simply a matrix of one of these forms. So we multiplied matrices on both
sides to reduce the left hand side
E
3
E
2
E
1
AX = E
3
E
2
E
1
I
30.2. LEFT AND RIGHT INVERSE 581
_
_
1 0 0
0 1 0
0 0 .25
_
_
. .
E
3
_
_
0 0 1
0 1 0
1 0 0
_
_
. .
E
2
_
_
1 0 0
1 1 0
0 0 1
_
_
. .
E
3
_
_
0 0 4
0 1 4
1 0 0
_
_
. .
A
X = E
3
E
2
E
1
I
to just
X = E
3
E
2
E
1
I =
_
_
0 0 1
1 1 0
.25 0 0
_
_
.
To reiterate, this computation required the assumption that the inverse exists, and for the umpteenth
time,
The inverse need not exist!
Can we come up with a nice condition that guarantees that the inverse exists?
Absolutely. To derive the condition, we follow the rst rule of Math Fight Club:
Math Mantra: You should always play around with new theorems and definitions!
We have already done this a million times:
Applying Cauchy-Schwarz inequality to derive new inequalities.
Applying Rolles Theorem to prove Mean Value Theorem.
Using the standard basis vectors in the denition of directional derivative.
etc., etc., etc.
Therefore, lets apply the determinant to the inverse denition,
AA
1
= I
to get
det(AA
1
) = det(I),
which by our determinant properties is
det(A) det(A
1
) = 1.
This means that if the inverse exists, then
det(A
1
) = 0.
Thats great. But it doesnt give us a condition on whether the inverse exists.
However, consider the converse:
If det(A) = 0, then A
1
exists.
Remarkably, this is true!
30.3 Cofactor Expansions
Before we can prove the preceding assertion, we will need an unintuitive way to rewrite a determinant
in terms of smaller matrices. Specically, let A
ij
denote the (n 1) (n 1) sub-matrix of A with
the i-th row and j-th column removed:
A
ij
=
_
_
a
11
a
12
. . . a
1j
. . . a
1n
a
21
a
22
. . . a
2j
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
a
i2
. . . a
ij
. . . a
in
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nj
. . . a
nn
_
_
i
j
For any column number j, we can always rewrite the determinant as
det(A) =
n
i=1
(1)
i+j
a
ij
det(A
ij
)
Intuitively, you are choosing a column number and expanding the determinant along that column. So
in the case of a 3 3,
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
we can rewrite it as an expansion along the 2nd column:
a
12
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
+ a
22
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
a
32
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
We call this sum a cofactor expansion.
Theorem. For any column number j,
det(A) =
n
i=1
(1)
i+j
a
ij
det(A
ij
)
where A
ij
is the sub-matrix of A with the i-th row and j-th column removed.
30.3. COFACTOR EXPANSIONS 583
Proof Summary:
Perform swaps to move the j-th component to the last component of D(
1
,
2
, . . . ,
n
).
Use multilinearity on the last component of D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
,
j
) to expand it
into a summation.
For the i-th term of this summation, look at
D(
1
,
2
, . . . , ,
j1
,
j+1
, . . . ,
n
, e
i
)
and perform swaps to move the i-th row to the last.
In the last column, the only non-zero component is in the last row. Therefore, when you evaluate
D directly, i
n
= n and det(A
ij
) pops out.
Proof: By denition of determinant,
det(A) = D(
1
,
2
, . . . ,
n
).
First, shift the j-th column,
j
, all the way to the right:
D(. . .
j
,
j+1
,
j+2
,
j+3
, . . .
n
).
This requires n j swaps, so we have
det(A) = (1)
nj
D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
,
j
).
Expand only the last component
det(A) = (1)
nj
D
_

1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
,
n
i=1
a
ij
e
i
_
and apply multilinearity
det(A) = (1)
nj
n
i=1
a
ij
D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
, e
i
). ()
Lets look at the D in the i-th term of this sum:
D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
, e
i
).
We have a 1 in the last column of the i-th row:
det
_
_
_
_
_
_
_
_
_
a
11
. . . a
1(j1)
a
1(j+1)
. . . 0
a
21
. . . a
2(j1)
a
2(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
. . . a
i(j1)
a
i(j+1)
. . . 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
n(j1)
a
n(j+1)
. . . 0
_
_
_
_
_
_
_
_
_
Applying the same trick we did with the columns, shift the i-th row to the last row:
det
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
. . . a
i(j1)
a
i(j+1)
. . . 1
a
(i+1)1
. . . a
(i+1)(j1)
a
(i+1)(j+1)
. . . 0
a
(i+2)1
. . . a
(i+2)(j1)
a
(i+2)(j+1)
. . . 0
a
(i+3)1
. . . a
(i+3)(j1)
a
(i+3)(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
n(j1)
a
n(j+1)
. . . 0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
This requires n i transpositions, so now we have
(1)
ni
det
_
_
_
_
_
_
_
_
_
_
_
a
11
. . . a
1(j1)
a
1(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(i1)1
. . . a
(i1)(j1)
a
(i1)(j+1)
. . . 0
a
(i+1)1
. . . a
(i+1)(j1)
a
(i+1)(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(n1)1
. . . a
(n1)(j1)
a
(n1)(j+1)
. . . 0
a
i1
. . . a
i(j1)
a
i(j+1)
. . . 1
_
_
_
_
_
_
_
_
_
_
_
. .
B
Look at the determinant expansion of the matrix B above:
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
b
i
1
1
b
i
2
2
. . . b
inn
.
The last column is e
n
, so b
inn
is zero when i
n
= n. This means
i
1
, i
2
, . . . , i
n1
must be a permutation among the remaining numbers
1, 2 . . . , n 1.
Moreover, since n is the largest number,
N(i
1
, i
2
, . . . , i
n1
) = N(i
1
, i
2
, . . . , i
n1
, n)
and our sum reduces to
permutation (i
1
, i
2
, . . . , i
n1
)
of (1, 2, . . . , n 1)
(1)
N(i
1
,i
2
,...,i
n1
)
b
i
1
1
b
i
2
2
. . . b
i
(n1)
(n1)
which is just the determinant of B restricted to the rst n 1 rows and n 1 columns:
30.4. CONSTRUCTING THE LEFT INVERSE 585
_
_
a
11
. . . a
1(j1)
a
1(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(i1)1
. . . a
(i1)(j1)
a
(i1)(j+1)
. . . 0
a
(i+1)1
. . . a
(i+1)(j1)
a
(i+1)(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(n1)1
. . . a
(n1)(j1)
a
(n1)(j+1)
. . . 0
a
i1
. . . a
i(j1)
a
i(j+1)
. . . 1
_
_
Lo and behold, what is this? The i-th row is missing and the j-th column is missing. This is precisely
the submatrix A
ij
!
Therefore,
det(B) = det(A
ij
)
and thus
D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
, e
i
) = (1)
ni
det(A
ij
)
Plugging back into (),
det(A) = (1)
nj
n
i=1
a
ij
(1)
ni
det(A
ij
)
. .
D(
1
,
2
,...,
j1
,
j+1
,..., n, e
i
)
.
Of course, we can distribute (1)
nj
and combine
(1)
ni
(1)
nj
= (1)
2n
_
(1)
1
_
i+j
= (1)
i+j
so
det(A) =
n
i=1
(1)
i+j
a
ij
det(A
ij
).
30.4 Constructing the Left Inverse
Armed with a neat way to write the determinant, we would like to use this new formula to explicitly
construct an inverse. To do this, we will write the matrix
_
_
det A 0 0 . . . 0
0 det A 0 . . . 0
0 0 det A . . . 0
0 0 0
.
.
. 0
0 0 0 . . . det A
_
_
as a product
det(A) I = MA.
for some matrxi M. This means, as long as det(A) = 0,
1
det(A)
M
is the left inverse of A.
Theorem. If det(A) = 0, then there exists a left inverse of A.
Proof Summary:
Dene a two input function
F(r, c) =
n
i=1
(1)
i+r
a
ic
det(A
ir
)
By the cofactor expansion theorem, F(c, c) = det(A).
F(r, c) = 0 for r = c:
Consider the matrix A
where the r-th column has been replaced by the c-th column.
Cofactor-expand A
along the c-th column and rewrite the expression as F(r, c) = 0.

The matrix with components F(r, c) is det(A)I.
This matrix can also be written as a product MA.
1
det(A)
M is the left inverse of A.
Proof: We would like to construct a function F(r, c) that is det(A) only when the inputs equal (and
0 otherwise):
F(r, c) =
_
det(A) if r = c
0 otherwise
This is because we want to rewrite det(A)I as
_
_
det A 0 0 . . . 0
0 det A 0 . . . 0
0 0 det A . . . 0
0 0 0
.
.
. 0
0 0 0 . . . det A
_
_
=
_
_
F(1, 1) F(1, 2) F(1, 3) . . . F(1, n)
F(2, 1) F(2, 2) F(2, 3) . . . F(2, n)
F(3, 1) F(3, 2) F(3, 3) . . . F(3, n)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
F(n, 1) F(n, 2) F(n, 3) . . . F(n, n)
_
_
But what should our magic function F be?
When we expanded the determinant, we chose the column number j. This is a function
1
of one
variable:
f(j) =
n
i=1
(1)
i+j
a
ij
det(A
ij
).
But notice that j occurs multiple times in the above formula. So why not split the inputs
1
Though this is a boneheaded function that spits out det(A) for j = 1, 2, . . . , n.
30.4. CONSTRUCTING THE LEFT INVERSE 587
F(r, c) =
n
i=1
(1)
i+r
a
ic
det(A
ir
)?
Immediately we have, when r = c,
F(c, c) = det(A).
This is just the normal expansion formula for the choice j = c. But what happens when r = c?
This takes some creativity: stare at the matrix A
where the r-th column of A has been REPLACED

by the c-th column:
A
=
_
_
a
1
a
2
. . . a
c
. . . a
c
. . .
_
_
r c
We know the determinant of A
is 0 (two columns are the same, duh). Therefore, when we do a

cofactor expansion along the c-th column,
_
_
a
11
. . . a
1c
. . . a
1c
. . . a
1n
a
21
. . . a
2c
. . . a
2c
. . . a
2n
a
31
. . . a
3c
. . . a
3c
. . . a
3n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
nc
. . . a
nc
. . . a
nn
_
_
r c
we have
0 =
n
i=1
(1)
i+c
a
ic
det(A
ic
).
Take a look at A
ic
and compare it to A
ir
:
A
ic
=
_
_
a
11
. . . a
1c
. . . a
1c
. . . a
1n
a
21
. . . a
2c
. . . a
2c
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
. . . a
ic
. . . a
ic
. . . a
in
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
nc
. . . a
nc
. . . a
nn
_
_
A
ir
=
_
_
a
11
. . . a
1r
. . . a
1c
. . . a
1n
a
21
. . . a
2r
. . . a
2c
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
. . . a
ir
. . . a
ic
. . . a
in
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
nr
. . . a
nc
. . . a
nn
_
_
r c
i i
r c
A
ir
has the same columns as A
ic
except in a dierent order! So, after performing some number of
transpositions q, we have
0 =
n
i=1
(1)
i+c
a
ic
(1)
q
det(A
ir
)
. .
det(A
ic
)
.
Multiply both sides by (1)
rcq
. This replaces the power of 1:
0 =
n
i=1
(1)
i+r
a
ic
det(A
ir
).
But the (RHS) is just F(r, c)! So when r = c,
F(r, c) = 0
Now notice that F(r, c) is simply a matrix product involving the c-th column of A
F(r, c) =
_
_
(1)
1+r
det(A
1r
) (1)
2+r
det(A
2r
) . . . (1)
n+r
det(A
nr
)
_
_
_
_
a
1c
a
2c
a
3c
.
.
.
a
nc
_
_
.
Therefore, the product
MA =
_
_
det A 0 0 . . . 0
0 det A 0 . . . 0
0 0 det A . . . 0
0 0 0
.
.
. 0
0 0 0 . . . det A
_
_
. .
[F(r,c)]
where
M =
_
_
(1)
1+1
det(A
11
) (1)
2+1
det(A
21
) . . . (1)
n+1
det(A
n1
)
.
.
.
.
.
.
.
.
.
.
.
.
(1)
1+r
det(A
1r
) (1)
2+r
det(A
2r
) . . . (1)
n+r
det(A
nr
)
.
.
.
.
.
.
.
.
.
.
.
.
(1)
1+n
det(A
1n
) (1)
2+n
det(A
2n
) . . . (1)
n+n
det(A
nn
)
_
_
Thus,
1
det(A)
M is the left inverse of A.
Did we nish proving
If det(A) = 0, then A
1
exists?
Absolutely not! We only proved that the left inverse exists. Remember, the inverse does not exist
unless both the left inverse and the right inverse exist!
To complete the proof, we will need to prove that we can calculate the cofactor expansion of det(A)
along a row. This requires proving one more fundamental fact about determinants.
30.5. PROVING DET(A) = DET(A
T
) 589
30.5 Proving det(A) = det(A
T
)
To the astute 51H veteran and careful reader: you probably noticed I completely sidestepped the
proof that
det(A) = det(A
T
)
Instead of proving this fact rst, I introduced elementary matrices and used the product property to
justify row reduction of determinants. I did this because
I wanted to explain the computation of inverses, which uses elementary matrices.
I wanted to warn students:
Math Mantra: When consulting other sources, be careful of circularity!
In Math, there are many dierent ways to introduce a subject. One source can
Use Theorem A to prove Theorem B,
whereas another source can
Use Theorem B to prove Theorem A.
If you mindlessly borrow from another source, you can accidentally
Use Theorem A to prove Theorem A.
Complete and utter fail!
Particularly, you cannot use the following proof on Khan Academy:
Theorem (Bad Proof ). For any n n matrix A,
det(A) = det(A
T
)
Bad Proof: We proceed by induction on n from the matrix size ntimesn. The base case n = 1 is
obvious. For the inductive step, cofactor expand det(A) along the rst column:
det(A) = a
11
det(A
11
) + a
21
det(A
21
) + . . . + a
n1
det(A
n1
)
By the induction hypothesis, we know the (RHS) is
a
11
det((A
11
)
T
) + a
21
det((A
21
)
T
) + . . . + a
n1
det((A
n1
)
T
)
which is the same as
a
11
det(A
T
11
) + a
21
det(A
T
12
) + . . . + a
n1
det(A
T
1n
)
But this is the formula for the cofactor expansion of det(A
T
) along the rst row:
det(A
T
) = a
11
det(A
T
11
) + a
21
det(A
T
12
) + . . . + a
n1
det(A
T
1n
)
However, we cannot use this proof because we will use det(A) = det(A
T
) to derive the cofactor ex-
pansion row formula!
Instead, we give the following proof:
Theorem. For any n n matrix A,
det(A) = det(A
T
)
Proof Summary:
Rewrite the determinant denition as
det(A) =
permutation = (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(j
1
,j
2
,...,j
n
)
x
1j
1
x
2j
2
. . . x
nj
n
where
(j
1
, j
2
, . . . j
n
) =
k

k1
. . .
1
(1, 2, . . . , n)
for
=
1

2
. . .
k
(1, 2, . . . , n).
Show every term of preceding summation is contained in
det(A
T
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
1i
1
x
2i
2
. . . x
nin
and vice versa.
Proof: Starting with the formula
det(A) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
i
1
1
x
i
2
2
. . . x
inn
()
the goal is to rewrite it into the form
det(A
T
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
1i
1
x
2i
2
. . . x
nin
.
Particularly, we need to make the right indices of
30.5. PROVING DET(A) = DET(A
T
) 591
x
i
1
1
x
i
1
2
. . . x
i
1
n
appear on the left:
x
1i
1
x
2i
2
. . . x
nin
.
Notice that
(i
1
, i
2
, . . . , i
n
)
is a permutation of
(1, 2, . . . , n),
so we can always reorder
x
i
1
1
x
i
2
2
. . . x
inn
to be
x
1j
1
x
2j
2
. . . x
njn
.
But what are the new js?
Consider again the left indices
(i
1
, i
2
, . . . , i
n
)
When we moved the is around, we were really applying transpositions that restored the is to the
original ordering
k

k1
. . .
1
(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
At the same time, we were also applying these transpositions to the right components. They got
tacked on for the ride:
k

k1
. . .
1
(1, 2, . . . , n) = (j
1
, j
2
, . . . , j
n
).
Now each term of () is of the form
(1)
N(i
1
,i
2
,...,in)
x
1j
1
x
2j
2
. . . x
njn
.
Moreover, we know
N(j
1
, j
2
, . . . , j
n
) = N(i
1
, i
2
, . . . , i
n
).
This is because from
k

k1
. . .
1
(1, 2, . . . , n) = (j
1
, j
2
, . . . , j
n
),
we can apply transpositions to both sides to get
1
. . .
k1

k

k

k1
. . .
1
(1, 2, . . . n) =
1
. . .
k1

k
(j
1
, j
2
, . . . , j
n
).
Therefore,
(1, 2, . . . , n) =
1
. . .
k1

k
(j
1
, j
2
, . . . , j
n
)
i.e, the number of transpositions to restore the js to the correct ordering is also k. Thus, each term
of () is of the form
(1)
N(j
1
,j
2
,...,jn)
x
1j
1
x
2j
2
. . . x
njn
Note that (j
1
, j
2
, . . . , j
n
) is a function of permutation = (i
1
, i
2
, . . . , i
n
), so to signify this fact, we
write
(j
1
, j
2
, . . . , j
n
)
and thus () becomes
det(A) =
permutation = (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(j
1
,j
2
,...,j
n
)
x
1j
1
x
2j
2
. . . x
nj
n
. ()
I claim that this sum is exactly the same as
det(A
T
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
1i
1
x
2i
2
. . . x
nin
( )
First notice that every term in () appears in ( ) since (j
1
, j
2
, . . . , j
n
) is indeed a permutation.
Moreover, every term in ( ) appears in ():
Consider a term
(1)
N(q
1
,q
2
,...,qn)
x
1q
1
x
2q
2
. . . x
nqn
in ( ). Then,
(q
1
, q
2
, . . . , q
n
) = t
s
t
s1
. . . t
1
(1, 2, . . . , n).
for some transpositions t
i
. If we choose
= t
1
t
2
. . . t
s
(1, 2, . . . , n),
we have
(j
1
, j
2
, . . . , j
n
) = t
s
t
s1
. . . t
1
(1, 2, . . . , n)
and thus
(q
1
, q
2
, . . . , q
n
) = (j
1
, j
2
, . . . , j
n
).
Since each term of () is contained in ( ) and vice versa, we conclude
det(A) = det(A
T
).
30.6 Constructing the Right Inverse
Consider the cofactor expansion of det(A
T
) along a column. Using det(A) = det(A
T
), we know this
is the same as the cofactor expansion of det(A) across a row. Thus, we automatically have
Theorem. For any row number i,
det(A) =
n
j=1
(1)
i+j
a
ij
det(A
ij
)
where A
ij
is the sub-matrix of A with the i-th row and j-th column removed.
30.6. CONSTRUCTING THE RIGHT INVERSE 593
Now we can do a similar trick to prove a right inverse exists:
Theorem. If det(A) = 0, then there exists a right inverse of A.
Proof Summary:
Dene a two input function
F(r, c) =
n
j=1
(1)
c+j
a
rj
det(A
cj
)
By cofactor expansion theorem, F(c, c) = det(A).
F(r, c) = 0 for r = c:
Consider the matrix A
where the c-th row has been replaced by the r-th row.
Cofactor expand A
along the r-th row and rewrite the expression as F(r, c) = 0.

The matrix with components F(r, c) is det(A)I.
This matrix can also be written as a product AM.
1
det(A)
M is the right inverse of A.
Proof: We do the same trick, but use the row expansion formula instead of the column expansion.
Dene
F(r, c) =
n
j=1
(1)
c+j
a
rj
det(A
cj
).
Immediately we have, if r = c,
F(c, c) =
n
j=1
(1)
c+j
a
cj
det(A
cj
) = det(A).
If r = c, consider the matrix A
where the c-th row is replaced by the r-th row:

A
=
_
A
1
A
2
.
.
.
A
r
.
.
.
A
r
.
.
.
_
_
r
c
Again, since the two rows are the same we know
det(A
) = 0.
Cofactor-expand along the r-th row
_
_
a
11
a
12
a
13
. . . a
1n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
a
r3
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
a
r3
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
a
n3
. . . a
nn
_
_
r
c
to get
0
..
det(A
)
=
n
j=1
(1)
r+j
a
rj
det(A
rj
)
Compare A
rj
and to A
cj
:
A
rj
=
_
_
a
11
a
12
. . . a
1j
. . . a
1n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
. . . a
rj
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
. . . a
rj
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nj
. . . a
nn
_
_
A
cj
=
_
_
a
11
a
12
. . . a
1j
. . . a
1n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
. . . a
rj
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
c1
a
c2
. . . a
cj
. . . a
cn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nj
. . . a
nn
_
_
j
r
c
r
c
j
A
cj
has the same rows as A
rj
except in a dierent order! So, after performing some number of
transpositions q, we have
0 =
n
j=1
(1)
r+j
a
rj
(1)
q
det(A
cj
)
. .
det(A
rj
)
.
Multiplying both sides by (1)
crq
, we can replace the power of 1:
0 =
n
j=1
(1)
c+j
a
rj
det(A
cj
)
But this is just F(r, c)! Therefore, when r = c
F(r, c) = 0.
30.6. CONSTRUCTING THE RIGHT INVERSE 595
Because F(r, c) is simply a matrix product involving the r-th row of A
F(r, c) =
_
_
a
r1
a
r2
a
r3
. . . a
rn
_
_
_
_
(1)
c+1
det(A
c1
)
(1)
c+2
det(A
c2
)
(1)
c+3
det(A
c3
)
.
.
.
(1)
c+n
det(A
cn
)
_
_
we conclude that the product
AM =
_
_
det A 0 0 . . . 0
0 det A 0 . . . 0
0 0 det A . . . 0
0 0 0
.
.
. 0
0 0 0 . . . det A
_
_
. .
[F(r,c)]
where
M =
_
_
(1)
1+1
det(A
11
) . . . (1)
c+1
det(A
c1
) . . . (1)
n+1
det(A
n1
)
(1)
1+2
det(A
12
) . . . (1)
c+2
det(A
c2
) . . . (1)
n+2
det(A
n2
)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(1)
1+n
det(A
1n
) . . . (1)
c+n
det(A
cn
) . . . (1)
n+n
det(A
nn
)
_
_
.
Thus,
1
det(A)
M is the right inverse
1
of A.
Since det(A) = 0 implies a left and right inverse exists, we can nally conclude:
Theorem. If det(A) = 0, then there exists an inverse of A.
New Notation
A
1
The inverse of A. AA
1
= I. The product of A with its inverse
is the identity.
A
ij
The sub-matrix of A
with the i-th row and
j-th column removed.
A
23
The sub-matrix of A with the 2nd
row and 3rd column removed .
1
Notice our M is the same in both proofs, as it should be.
Lecture 31
Gram-Schmidt Style
Heeeeeeeeeeeyyyyyy, Sexy Basis!
Or, Or, Or, Or, OrthoGonal Style!
-
Goals: Today, we give the denition of an orthonormal basis and prove basic properties.
Then, we use the Gram-Schmidt process to prove that every vector space can be written
as the span of an orthonormal basis. We will use this fact in the proof of the Spectral
Theorem.
31.1 Extra Structure
In our discussion on vector spaces, we mentioned that bases are not unique. For example, consider
V = span
_
_
_
_
1
1
1
1
0
_
_
,
_
_
0
1
1
1
0
_
_
,
_
_
0
1
1
1
1
_
_
_
_
The sets
_
_
_
_
1
0
0
0
0
_
_
,
_
_
0
1
1
1
0
_
_
,
_
_
0
0
0
0
1
_
_
_
_
_
_
_
_
1
0
0
0
1
_
_
,
_
_
0
1
1
1
0
_
_
,
_
_
0
0
0
0
1
_
_
_
_
_
_
_
_
1
1
1
1
0
_
_
,
_
_
0
0
0
0
1
_
_
,
_
_
0
1
1
1
1
_
_
_
_
are all bases for V .
In fact, there are innitely many choices of bases at our disposal.
A natural question to ask is
What is the best basis we can choose?
597
598 LECTURE 31. GRAM-SCHMIDT STYLE
Before we can answer this question, we should ask ourselves,
Why are we even bothering with choosing a dierent basis?
If its not broken, why x it? Does it even make a dierence what basis we choose as long as they all
span V ?
It absolutely does make a dierence. Generally,
Math Mantra: Suppose we can always rewrite our objects in some form that has
additional structure. Then we can EXPLOIT this extra structure in our proofs.
A good example of this is in Number Theory. By the Fundamental Theorem of Arithmetic, given a
natural number ( 2), we can always nd a unique prime factorization:
n = p
1
1
p
2
2
. . . p
n
n
We can focus on this form and exploit the properties of its prime components.
With subspaces, we will prove that we can always nd a basis with a special structure. And next
lecture, we will exploit this special structure to prove the infamous Spectral Theorem.
31.2 The Best Basis
Going back to the original question, we must rst clarify what we mean by best. What properties
should the best basis have? To solve this,
Math Mantra: Look for an object with nice properties and try to GENERALIZE it.
What is the best basis in the world? The standard basis, of course!
{e
1
, e
2
, . . . , e
n
}
In particular, the incredible property we have used a billion times (at least) is that we can instantly
write any vector as a linear combination of the standard basis vectors:
v = v
1
e
1
+ v
2
e
2
+ . . . + v
n
e
n
Calculating the scaling coecients is completely trivial!
But what if I gave you the subspace
V = span
_
_
_
_
47
43
41
37
31
_
_
,
_
_
29
23
19
17
13
_
_
,
_
_
11
7
5
3
2
_
_
_
_
31.2. THE BEST BASIS 599
and asked you to solve for the scaling coecients c
1
, c
2
, c
3
of
_
_
91
90
81
80
64
_
_
= c
1
_
_
47
43
41
37
31
_
_
+ c
2
_
_
29
23
19
17
13
_
_
+ c
3
_
_
11
7
5
3
2
_
_
This is not obvious! Instead of an immediate answer, you would have to solve a system of the form
Ax =
b.
Luckily, it is indeed possible to rewrite the basis in such a way that, for any vector in that space,
we can easily solve for the corresponding linear combination of basis vectors. To do so, we study the
standard basis even further. Namely, we focus on the properties that
The norm of each vector in the basis is 1.
The dot product of any two distinct vectors is 0.
We call such a set of vectors orthonormal.
Denition. A set of vectors
{u
1
, u
2
, . . . , u
n
}
is orthonormal if, for any two vectors in this set,
u
i
u
j
=
_
1 if i = j
0 if i = j
Remarkably, if the basis for V is orthonormal, then for any vector v V , we can easily solve for the
linear combination of basis vectors that equals v. Specically, let
v = c
1
u
1
+ c
2
u
2
+ . . . + c
n
u
n
.
To solve for the i-th coecient c
i
, all we need to do is take the dot product of v with the i-th basis
vector u
i
.
Theorem. Let
{u
1
, u
2
, . . . , u
k
}
be an orthonormal basis for V R
n
. Then for any v V ,
v = c
1
u
1
+ c
2
u
2
+ . . . + c
k
u
k
where
c
i
= v u
i
.
Proof Summary:
Expand v as some arbitrary linear combination of the basis vectors u
i
.
Take the dot product with u
i
and apply orthonormality.
Proof: By the denition of basis, there are some constants c
1
, . . . , c
k
such that
v = c
1
u
1
+ c
2
u
2
+ . . . + c
k
u
k
To nd the i-th coeecient, take the dot product of both sides with u
i
:
v u
i
= (c
1
u
1
+ c
2
u
2
+ . . . + c
k
u
k
) u
i
and distribute:
v u
i
= c
1
u
1
u
i
+ c
2
u
2
u
i
+ . . . + c
k
u
k
u
i
.
By denition of an orthonormal basis, only u
i
u
i
is non-zero, so
v u
i
= 0
..
c
1
u
1
u
i
+. . . + 0
..
c
2
u
2
u
i
+. . . + c
i
..
c
i
u
i
u
i
+. . . + 0
..
c
k
u
k
u
i
.
Thus
c
i
= v u
i
.
We just showed that for any v V ,
v =
k
i=1
(v u
i
) u
i
where the u
i
are members of an orthonormal basis.
Thats cool. But it gets even better! For any x in the entire space R
n
, the summation
k
i=1
(v u
i
) u
i
is actually the projection of x onto V !
Theorem. Let
{u
1
, u
2
, . . . , u
k
}
be an orthonormal basis for V R
n
. Then, for any x R
n
,
P
V
(x) =
k
i=1
(x u
i
) u
i
.
31.2. THE BEST BASIS 601
Proof Summary:
By denition of a projection map, we must show
k
i=1
(x u
i
) u
i
V
x
k
i=1
(x u
i
) u
i
V

.
The rst inclusion follows from closure.
To prove the second, take the dot product with u
i
and apply orthonormality.
Proof: Recall that P
V
(x) is uniquely dened by the property that, for any x,
P
V
(x) V
x P
V
(x) V

.
Therefore, to show that
f(x) =
k
i=1
(x u
i
) u
i
is the projection P
V
(x), we just have to show that f(x) satises this projection property.
Automatically,
k
i=1
(x u
i
) u
i
V
by closure. So we only need to show
x
k
i=1
(x u
i
) u
i
V

Equivalently, we need to check
_
x
k
i=1
(x u
i
) u
i
_
u
i
= 0
for any basis vector u
i
. Distribute the dot product on the (LHS) and again use orthonormality to kill
terms:
x u
i
(x u
1
) u
1
u
i
. .
=0
(x u
2
) u
2
u
i
. .
=0
. . . (x u
i
) u
i
u
i
. .
=xu
i
. . . (x u
k
) u
k
u
k
. .
=0
to get
x u
i
x u
i
= 0.
31.3 Gram-Schmidt Process
As you can see, orthonormal bases are awesome. But how do you convert any basis into an orthonor-
mal basis? The process is actually pretty simple.
Suppose you had nished writing a book and you wanted to edit it. In particular, you wanted to
make sure that
No chapter overlaps with any material from the previous chapters
Each chapter is completely edited.
You can do the following procedure:
Start with Chapter 1
Chapter
1
and edit it to get
Edited
Chapter 1
With Chapter 2, look for any overlap of material with the Edited Chapter 1
Chapter
2
Edited
Chapter
1
and remove it.
31.3. GRAM-SCHMIDT PROCESS 603
Then, edit the remains to get the Edited Chapter 2.
Edited
Chapter
2
For Chapter 3, look for any overlap of material with the Edited Chapter 1 and the Edited
Chapter 2
Chapter
3
Edited
Chapter
1
Edited
Chapter
2
and remove it:
Then edit the remains to get the Edited Chapter 3.
Edited
Chapter
3
Generally, we remove all the overlap of Chapter j with the previous Edited Chapters
1, 2, . . . , j 1. Then we edit what is left over to get the Edited Chapter j.
After we go through all the chapters, our book is completely edited with no overlapping material.
Simple, right? The process to convert a basis into an orthonormal basis follows exactly the same
reasoning!
Given some basis for V :
{v
1
, v
2
. . . , v
k
}
Start with v
1
and normalize it to get vector u
1
.
From v
2
, subtract the projection of v
2
onto span{u
1
}. Normalize the dierence to get vector u
2
.
From v
3
3
onto span{u
1
, u
2
}. Normalize the dierence to get vector
u
3
.
Generally, from v
j
j
onto span{u
1
, u
2
, . . . , u
j1
}. Normalize the dierence
to get vector u
j
.
When this process terminates, we have a set
{u
1
, u
2
, . . . , u
k
}
It turns out that this set is an orthonormal basis for V !
Theorem (Gram-Schmidt Process). Let
{v
1
, v
2
, . . . , v
k
}
be a basis for V . Set
u
1
=
v
1
v
1
and for i > 1, dene

u
i
=
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
Then the set
{u
1
, u
2
, . . . , u
k
}
is an orthonormal basis for V .
Proof Summary:
Spanning and Existence (Induction):
Base Case
Obvious.
Inductive Step, Existence
Suppose not. Then
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
) =
0.
Use projection properties and inductive hypothesis to show
v
i
span{v
1
, v
2
, . . . , v
i1
}.
This contradicts that the vs form a basis.
Inductive Step, Spanning
By inductive hypothesis, it suces to prove
u
i
span {v
1
, v
2
, . . . , v
i
}
v
i
span {u
1
, u
2
, . . . , u
i
}
Expand the denition of u
i
in each case and argue by closure.
Basis
We know the dimension is k and we have k spanning vectors.
Orthonormal
In our construction, we divide by the norm in each step. So u
i
= 1.
For i < j, expand the higher index term of u
i
u
j
Look at the numerator
u
i
v
j
u
i
P
span{u
1
,u
2
,...,u
j1
}
(v
j
)
Use the swapping property of projections to rewrite this as
u
i
v
j
P
span{u
1
,u
2
,...,u
j1
}
(u
i
) v
j
.
But this is 0 since P
span{u
1
,u
2
,...,u
j1
}
(u
i
) = u
i
.
Proof: We have to be really careful! This is because the above construction may not make sense!
Namely, we have to make sure we never divide by 0!
Instead of immediately assuming that
u
1
, u
2
, . . . , u
k
already exist, we are going to do induction on each step of the construction and show that the next
u
i
exists. For this to work, we interweave it with our spanning proof.
Spanning and Existence
We do induction on the k-th step of the construction process of u
i
to prove property Q(i),
Q(i) :
_
_
u
1
, u
2
, . . . , u
i
exist
1
and
span{v
1
, v
2
, . . . , v
i
} = span{u
1
, u
2
, . . . , u
i
}
holds for i n.
1
If you understand strong induction, then you can simplify this line to u
i
exists.
Base Case, k = 1
Since v
1
is a member of a basis, v
1
= 0, so
u
1
=
v
1
v
1
exists and
span{v
1
} = span{u
1
}.
Thus, Q(1) is true.
Inductive Step, Existence
Assume Q(i): u
1
, u
2
, . . . , u
i1
exist and
span{v
1
, v
2
, . . . , v
i1
} = span{u
1
, u
2
, . . . , u
i1
}.
First, we need to show u
i
exists. Suppose not. That means
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
) = 0.
Then the argument of the norm must be zero. This implies
v
i
= P
span{u
1
,u
2
,...,u
i1
}
(v
i
).
Remember, by the denition of projection, this means
v
i
span{u
1
, u
2
, . . . , u
i1
}.
By our induction hypothesis,
v
i
span{v
1
, v
2
, . . . , v
i1
}
But we assumed the vs formed a basis, a contradiction. Therefore, u
i
exists.
Inductive Step, Spanning
By our inductive hypothesis, to prove
span{v
1
, v
2
, . . . , v
i
..
} = span{u
1
, u
2
, . . . , u
i
..
}
we only need to show that the new vectors
u
i
span {v
1
, v
2
, . . . , v
i
}
v
i
span {u
1
, u
2
, . . . , u
i
} .
To prove the rst set inclusion, consider
u
i
=
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
and use the inductive hypothesis to rewrite the numerator as
v
i
..
span{v
1
,...,v
i
}
P
span{v
1
,...,v
i1
}
(v
i
)
. .
span{v
1
,...,v
i
}
.
By closure,
u
i
span {v
1
, v
2
, . . . , v
i
} .
To prove the second set inclusion, consider again
u
i
=
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
and isolate v
i
:
v
i
= v
i
P
span{u
1
,u
2
,...,u
i1
}
( v
i
)u
i
. .
span{u
1
,u
2
,...,u
i
}
+P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
. .
span{u
1
,u
2
,...,u
i
}
By closure,
v
i
span{u
1
, u
2
, . . . , u
i
}.
Basis
Since the dimension of V is k and we proved
span{v
1
, v
2
, . . . , v
k
} = span{u
1
, u
2
, . . . , u
k
},
we automatically know
{u
1
, u
2
, . . . , u
k
}
is a basis by our basis properties.
Orthonormal
Because we divide by the norm in each step, we automatically know that
u
1
, u
2
, . . . , u
k
are unit vectors. Therefore, we only need to argue that they are pairwise orthogonal.
Let i < j and consider
u
i
u
j
.
Expand the denition of only the vector with the higher index (i.e., u
j
):
u
i
_
v
j
P
span{u
j
,u
2
,...,u
j1
}
(v
j
)
v
j
P
span{u
1
,u
2
,...,u
j1
}
(v
j
)
_
Now we need only show
u
i
_
v
j
P
span{u
1
,u
2
,...,u
j1
}
(v
j
)
_
= 0.
Distribute the dot product
u
i
v
j
u
i
P
span{u
1
,u
2
,...,u
j1
}
(v
j
)
and use the swapping property of projections over dot products to get
u
i
v
j
P
span{u
1
,u
2
,...,u
j1
}
(u
i
) v
j
.
But i < j, implying
u
i
span{u
1
, u
2
, . . . , u
j1
}
Thus, the projection map simply spits out u
i
and so
u
i
v
j
P
span{u
1
,u
2
,...,u
j1
}
(u
i
)
. .
u
i
v
j
= 0.
Lecture 32
Spooky Spectral Theorem
Right now Im looking at you and I cant believe,
I now know, oh oh,
I now know youre beautiful.
Whoa, oh oh,
But thats what makes math beautiful!
-Same Direction, x
Goals: First, we introduce eigen-decompositions and their applications. We then show
how to nd such a decomposition: to do this, we introduce eigenvectors and eigenvalues.
Unfortunately, not all square matrices have such a decomposition. Thus, we give two
dierent sucient conditions that guarentee the existence of this decomposition. The
rst condition is that the matrix has distinct eigenvalues. The second is that the matrix
is symmetric (Spectral Theorem).
32.1 Rewriting Matrices
Last lecture, we talked about putting a basis in a nice form. Particularly, we can always rewrite a
vector space as a span of an orthonormal basis. But how about matrices? Do we have nice ways to
rewrite them?
Absolutely! And our method of decomposition follows the same philosophy as prime factorization:
n = p
1
1
p
2
2
. . . p
n
n
.
Here, we break n into a product of numbers that have nice properties (namely, being prime). Likewise,
we can decompose a matrix into a product of matrix factors, where each factor has a nice property.
Here are a few famous matrix decompositions:
QR Factorization
LU Decomposition
Singular Value Decomposition
609
610 LECTURE 32. SPOOKY SPECTRAL THEOREM
Jordan Canonical Form
Eigen-decomposition
The one we will focus on today will be the eigen-decomposition. Namely, we can break a matrix A
into
A = SDS
1
where S is invertible and D is diagonal.
1
But there is a catch. Not all matrices have an eigen-decomposition!
Denition. We call an n n matrix A diagonalizable if there exists an invertible matrix S and a
diagonal matrix D such that
A = SDS
1
In this case, we say that SDS
1
is the eigen-decomposition of A.
But why in the world would we care about eigen-decompositions?
This decomposition is incredibly important, especially when you hit Math 53H. In particular, you
will use this method to solve linear systems of dierential equations of the form
2
x
(t) = Ax
where A is some xed matrix.
However, Math 53H is at least another 5 months away, so here is a more accessible application:
suppose you wanted to compute the product of A multiplied by itself n times:
A
n
= A A A. . . A
. .
n times
Normally, this takes quite a bit of work. But suppose A is diagonalizable:
A = SDS
1
Then
A
n
= SDS
1
SDS
1
SDS . . . A
. .
n times
1
Remember, we say a matrix D = (d
ij
) is diagaonal if all of its entries outside the main diagaonal are zero, i.e.
d
ij
= 0 when i = j.
2
FYI, this is simply the multivariable extension of the dierential equation
dx
dt
= ax.
32.1. REWRITING MATRICES 611
Cancelling each of the inner S
1
S terms we get
A
n
= SD
n
S
1
.
And the power of D in the middle is easy to deal with. In general, taking powers of a diagonal matrix
is very easy! Simply take the corresponding powers of the diagonal entries!
Example. We can easily calculate
A
100
where
A =
_
4 2
1 1
_
We can nd an eigen-decomposition for A:
_
4 2
1 1
_
. .
A
=
_
1 2
1 1
_
. .
S
_
2 0
0 3
_
. .
D
_
1 2
1 1
_
. .
S
1
.
Then,
A
100
=
_
1 2
1 1
_
. .
S
_
2
100
0
0 3
100
_
. .
D
100
_
1 2
1 1
_
. .
S
1
=
_
2 3
100
2
100
2 3
100
2
101
2
100
3
100
2
101
3
100
_
But dont think that eigen-decomposition is only useful for practical calculations. We can also use it
to prove cool theoretical results. For example, eigen-decompositions can be used to derive an explicit
formula for the Fibonacci numbers.
Example. Consider the Fibonacci sequence
F
n
=
_
_
_
0 if n = 0
1 if n = 1
F
n1
+ F
n2
if n > 1
Then the n-th Fibonacci number is explicitly
F
n
=
1
5
__
1 +
5
2
_
n
_
1
5
2
_
n
_
.
The key idea is to consider vectors of consecutive Fibonacci numbers
_
F
1
F
0
_
,
_
F
2
F
1
_
,
_
F
3
F
2
_
,
_
F
4
F
3
_
,
_
F
5
F
4
_
, . . . ,
_
F
i+1
F
i
_
, . . .
By the Fibonacci relation, to go from one pair to the next we just need to do a matrix multiplication:
_
1 1
1 0
_ _
F
i+1
F
i
_
=
_
F
i+2
F
i+1
_
.
In particular, starting from
_
F
1
F
0
_
=
_
1
0
_
we can repeatedly multiply on the left by
_
1 1
1 0
_
to get
_
F
n+1
F
n
_
.
In fact, we only need to do this n times:
_
1 1
1 0
_
n
_
1
0
_
=
_
F
n+1
F
n
_
. ()
Using the eigen-decomposition
_
1 1
1 0
_
=
_
_
1
5
2
1+
5
2
1 1
_
_
. .
S
_
_
1
5
2
0
0
1+
5
2
_
_
. .
D
_
_
1
5
1+
5
2
5
1
5
1+
5
2
5
_
_
. .
S
1
we can simplify the left side of () as
_
_
1
5
2
1+
5
2
1 1
_
_
_
_
_
1
5
2
_
n
0
0
_
1+
5
2
_
n
_
_
_
_
1
5
1+
5
2
5
1
5
1+
5
2
5
_
_
_
1
0
_
.
Multiplying from the right, this can be simplied further as
_
_
1
5
2
1+
5
2
1 1
_
_
_
_
_
1
5
2
_
n
0
0
_
1+
5
2
_
n
_
_
_
_
1
5
1
5
_
_
=
_
_
1
5
2
1+
5
2
1 1
_
_
_
_
1
5
_
1
5
2
_
n
1
5
_
1+
5
2
_
n
_
_
=
_
_
1
5
_
_
1+
5
2
_
n+1
_
1
5
2
_
n+1
_
1
5
__
1+
5
2
_
n
_
1
5
2
_
n
_
_
_
.
Therefore, () says
_
_
1
5
_
_
1+
5
2
_
n+1
_
1
5
2
_
n+1
_
1
5
__
1+
5
2
_
n
_
1
5
2
_
n
_
_
_
=
_
F
n+1
F
n
_
.
32.2. EIGENVECTORS 613
Equating the second component,
F
n
=
1
5
__
1 +
5
2
_
n
_
1
5
2
_
n
_
.
32.2 Eigenvectors
Hopefully, you are convinced that eigen-decompositions are useful. But,
How do you even calculate the eigen-decomposition (assuming that it exists)?
For that, we need a discussion of eigenvectors and eigenvalues.
Denition. Let v be a non-zero vector. Then we say that v is an eigenvector of A corresponding
to eigenvalue R if
Av = v.
Eigenvectors and eigenvalues have a very simple geometric interpretation. Intuitively, eigenvectors
are the vectors that have the same direction after multiplication by A. The eigenvalue is just the
scale factor:
v
A
v
Suppose you can nd eigenvectors v
1
, v
2
, . . . , v
n
corresponding to eigenvalues
1
,
2
, . . . ,
n
, respec-
tively, such that
v
1
, v
2
, . . . , v
n
are linearly independent. Then, when we concatenate the vs into a single matrix and multiply on
the left by A,
A
_
_
v
1
v
2
. . . v
n
_
_
=
_
_
Av
1
Av
2
. . . Av
n
_
_
By the denition of eigenvectors, the (RHS) is
_
1
v
1

2
v
2
. . .
n
v
n
_
_
.
After pulling out a diagonal matrix
_
_
v
1
v
2
. . . v
n
_
_
_
1
0 0 . . . 0
0
2
0 . . . 0
0 0
3
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . .
n
_
_
,
we can rewrite our equality:
A
_
_
v
1
v
2
. . . v
n
_
_
. .
S
=
_
_
v
1
v
2
. . . v
n
_
_
. .
S
_
1
0 0 . . . 0
0
2
0 . . . 0
0 0
3
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . .
n
_
_
. .
D
.
Because the columns (eigenvectors) v
1
, . . . , v
n
are linearly independent, S is invertible! That means
we can multiply both sides on the right by S
1
to get
A = SDS
1
AWESOME!
But how do we nd the eigenvectors and eigenvalues?
The trick is to solve for the eigenvalues rst. To do this, we use the determinant to reduce the problem
to solving for the roots of a polynomial.
Theorem. There exists an eigenvector v of A corresponding to eigenvalue if and only if
det(A I) = 0
32.2. EIGENVECTORS 615
Proof Summary:
Rewrite as (A I)v =
0.
v must be non-zero so (A I) is non-invertible.
(A I) non-invertible is equivalent to det(A I) = 0.
Proof: Suppose there exist v and such that
Av = v.
Equivalently,
Av v =
0.
Pulling out the v, we have
(A I)v =
0.
This means (A I) is non-invertible! Why? Suppose (A I)
1
did exist. Then multiplying both
sides on the left by (A I)
1
(A I)
1
(A I)v = (A I)
1
0
would give us
v =
0.
However, v is an eigenvector and eigenvectors are non-zero by denition, a contradiction.
By determinant properties, non-invertibility of (A I) is equivalent to
det(A I) = 0
Now, the existence of eigenvalues and eigenvectors is equivalent to solving
det(A I) = 0
Observe that this is an n-degree polynomial in terms of the variable :
n
b
(n1)
n1
+ . . . + b
2
2
+ b
1
+ b
0
.
By solving for the roots of this polynomial, you can get values for and plug them back into
(A I)v =
0.
But we know how to solve a system of this form!
Example. Solve for all eigenvectors of
A =
_
4 2
1 1
_
.
Expand
det
_
4 2
1 1
_
. .
det(AI)
= 0
to get
2
5 + 6 = 0.
Therefore, = 2 or = 3. These are the eigenvalues.
= 2
Plugging into (A I)v =
0,
_
2 2
1 1
_ _
v
1
v
2
_
=
_
0
0
_
.
Solving,
v
1
= v
2
so
v =
_
v
2
v
2
_
= v
2
_
1
1
_
Since v
2
can be any real number, we conclude that any vector of the form
t
_
1
1
_
is an eigenvector of A corresponding to the eigenvalue = 2.
= 3
Plugging into (A I)v =
0,
_
1 2
1 2
_ _
v
1
v
2
_
=
_
0
0
_
Solving,
v
1
= 2v
2
.
This allows us to reduce:
v =
_
2v
2
v
2
_
= v
2
_
2
1
_
.
Thus, any vector of the form
t
_
2
1
_
is an eigenvector of A corresponding to the eigenvalue = 3.
32.3. SEEKING SUFFICIENCY 617
If you go back to the rst example of this lecture, this explains how we found the eigen-decomposition
_
4 2
1 1
_
. .
A
=
_
1 2
1 1
_
. .
S
_
2 0
0 3
_
. .
D
_
1 2
1 1
_
. .
S
1
.
We just used the most obvious eigenvectors corresponding to the eigenvalues 2 and 3, respectively.
32.3 Seeking Suciency
Even though we now know how to compute eigenvectors and eigenvalues, an arbitrary matrix is not
necessarily diagonalizable!
Take another look at the eigen-decomposition process. For this to work, we required that the eigen-
vectors
v
1
, v
2
, . . . , v
n
are linearly independent. But it may be impossible to nd such a set! So what do we do?
Math Mantra: If you want some property to hold, look for sufficient conditions
that GUARANTEE it does.
Luckily, there do exist sucient conditions that will guarantee that we can nd a linearly independent
set of eigenvectors. Today, we will focus on two conditions:
A has n distinct eigenvalues.
A is symmetric.
The rst is a pretty easy proof, involving a simple algebraic shenanigan. The second will involve a
ton more work. But each will use induction.
NOTE: In both proofs, the n = 1 case is immediate. I will note this in the proof summary.
However, I will prove the case n = 2 so you can see the main ideas.
Theorem. Let A be an n n matrix. If A has n distinct eigenvalues, then there exists a set of n
linearly independent eigenvectors of A.
Proof Summary:
Base Case k = 1
Immediate.
Inductive Step
Suppose a linear combination of k + 1 of the eigenvectors equals

0. We want to show that
each coecient c
i
= 0.
Multiply this linear combination by the (k + 1)-th eigenvalue
k+1
to get one equation.
Multiply the original linear combination by A to get a second equation.
Subtract the rst equation from the second to kill the
k+1
term.
Use the inductive hypothesis and the distinctness of the eigenvalues to show
c
1
= c
2
= . . . = c
k
= 0.
Conclude c
k+1
= 0.
Proof: We will use induction to prove:
P(k) : If A has eigenvectors v
1
, v
2
, . . . , v
k
such that their corresponding
i
are all distinct, then
v
1
, v
2
, . . . , v
k
are linearly independent
For then P(n) yields the theorem.
Base Case, k = 2
Consider two eigenvectors v
1
and v
2
corresponding to distinct eigenvalues
1
and
2
, respectively.
Assume
c
1
v
1
+ c
2
v
2
=
0. ()
Multiplying both sides of () by
2
yields
c
1
2
v
1
+ c
2
2
v
2
=
0.
We can also multiply both sides of () by A to get
c
1
Av
1
+ c
2
Av
2
=
0,
which by denition of an eigenvector is
c
1
1
v
1
+ c
2
2
v
2
=
0.
Subtracting,
c
1
2
v
1
+ c
2
2
v
2
=

0
c
1
1
v
1
+ c
2
2
v
2
=

0
c
1
(
2
1
)v
1
=

0
Because
1
=
2
and v
1
is non-zero (by denition of an eigenvector), we must have c
1
= 0.
Plugging back into ()
c
1
v
1
..
0
+c
2
v
2
=
0,
we conclude c
2
= 0.
Thus, P(2) holds.
32.4. SPECTRAL THEOREM 619
Inductive Step:
Assume P(k) and consider k+1 eigenvectors v
1
, . . . , v
k+1
with corresponding distinct eigenvalues
1
, . . . ,
k+1
. Assume
c
1
v
1
+ c
2
v
2
+ . . . + c
k+1
v
k+1
=
0. ()
Multiplying both sides of () by
k+1
yields
c
1
k+1
v
1
+ c
2
k+1
v
2
+ . . . + c
k+1
k+1
v
k+1
=
0.
We can also multiply both sides of () by A
c
1
Av
1
+ c
2
Av
2
+ . . . + c
k+1
Av
k+1
=
0
and apply the eigenvector denition to get
c
1
1
v
1
+ c
2
2
v
2
+ . . . + c
k+1
k+1
v
k+1
=
0.
Subtracting,
c
1
k+1
v
1
+ c
2
k+1
v
2
+ . . . + c
k+1
k+1
v
k+1
=

0
c
1
1
v
1
+ c
2
2
v
2
+ . . . + c
k+1
k+1
v
k+1
=

0
c
1
(
k+1
1
)v
1
+ c
2
(
k+1
2
)v
2
+ . . . +

0 =

0.
By the inductive hypothesis, v
1
, v
2
, . . . , v
k
are linearly independent. Therefore,
c
1
(
k+1
1
) = c
2
(
k+1
2
) = . . . = c
k
(
k+1
k
) = 0.
But
k+1
is distinct from the other
i
s; thus,
c
1
= c
2
= . . . = c
k
= 0.
Plugging back into () yields
c
k+1
= 0.
32.4 Spectral Theorem
Unlike the previous sucient condition, this one is not as easy to prove. However, it is a very
fundamental result that
If an n n matrix A is symmetric, then there exists a linearly independent set of n eigenvectors.
In fact, it gets even better:
If an n n matrix A is symmetric, then there exists an orthonormal
1
set of n eigenvectors.
1
As an exercise, prove that orthonormality implies linear independence.
This is known as the Spectral Theorem. But how do we prove it?
First you need to notice that we are dealing with a symmetric matrix. Weve rst seen these matrices
in Lecture 23: it is a required condition in the denition of a quadratic form:
Q(x) =
n
i,j=1
a
ij
x
i
x
j
Moreover, in that lecture, we showed that quadratic forms have a minimum on the unit sphere. We
all know that the gradient is

0 at a minimum.
The rst step to proving the Spectral Theorem is observing a magical fact: when you dierentiate
Q
_
x
x
_
and plug in the minimum, out pops the eigenvector relationship! In fact, the minimum is achieved at
the eigenvector and the minimum value is the corresponding eigenvalue.
This gives you the rst eigenvector. We need n 1 more, so we use induction.
For the inductive step, the trick is to remove the rst eigenvector from the space: formally, we look
at the orthogonal complement. By Gram-Schmidt, the complement has a basis of n 1 orthonormal
vectors in R
n
:
w
1
, w
2
, . . . , w
n1
.
Using this basis, we are going to build a smaller (n 1) (n 1) symmetric matrix and apply the
inductive hypothesis. This will give us n 1 eigenvectors in R
n1
:
Heres the kicker: consider an eigenvector of the smaller matrix:
_
_
v
1
v
2
.
.
.
v
n1
_
_
We can use this eigenvectors coordinates to form a linear combination of the basis vectors:
v
1
_
_
w
1
_
_
+ v
2
_
_
w
2
_
_
+ . . . + v
n1
_
_
w
n1
_
_
Remarkably, this is an eigenvector of the bigger matrix!
Before we begin, here are a few notes:
Rewriting Q(x)
We will need to compute the k-th partial derivative of the quadratic form
Q(x) =
n
i,j=1
a
ij
x
i
x
j
.
If you expand it directly, the derivative is obvious. However, we are professionals now! As
professionals, we would like to rewrite the quadratic so that the dierentiation is immediate.
Therefore, split Q(x) into a sum over three cases:
Q(x) =
n
i=j
a
ij
x
i
x
j
+
n
i<j
a
ij
x
i
x
j
+
n
j<i
a
ij
x
i
x
j
.
Exploiting the symmetry of A, we can rewrite the last term as
n
j<i
a
ij
x
i
x
j
=
n
j<i
a
ji
x
i
x
j
Switching the indexing variables, this is
n
j<i
a
ji
x
i
x
j
=
n
i<j
a
ij
x
i
x
j
.
Moreover the rst term in the sum simply goes from 1 to n. Therefore, we can rewrite Q(x) as
Q(x) =
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
.
This form of Q(x) makes dierentiation a breeze! Dierentiating the rst term
n
i=1
a
ii
x
2
i
with respect to x
k
kills everything except the k-th term, giving us
2a
kk
x
k
.
To dierentiate the second
n
i<j
2a
ij
x
i
x
j
notice that all terms are killed except the ones over pairs involving k:
(1, k) (2, k) (3, k) . . . (k 1, k)
(k, k + 1) (k, k + 2) (k, k + 3) . . . (k, n)
Therefore, dierentiation reduces this sum to
n
i=1,i=k
2a
ik
x
i
.
Base case
Again, you can skip the base case in the following proof since the result is immediate with n = 1.
However,
I advise that you to understand the n = 2 case rst.
It gives some (but not all!) of the main ideas and will help paint a clearer picture. When you
are ready, move on to the inductive step.
Theorem. Let A be an n n symmetric matrix. If A is symmetric, then there exists a set of n
orthonormal eigenvectors of A.
Proof Summary:
Base Case, n = 1
Immediate.
Inductive Step, First Eigenvector
Consider the corresponding quadratic form over the unit sphere and extend to a function
S dened on R
n
\ {
0}.
S achieves a minimum value of m at the point .
The gradient of S is

0 at . Expand this system to show that is an eigenvector corre-
sponding to eigenvalue m.
Inductive Step, Remaining Eigenvectors
Consider the orthogonal complement V

of V = span{}. It has an orthonormal basis
w
1
, w
2
, . . . , w
n1
by Gram-Schmidt.
Take the image of the ws under left multiplication by A. By the homework, each of these
mapped vectors can be written as a linear combination of the original ws.
Show that the matrix of coecients B is symmetric and apply inductive hypothesis to get
eigenvectors in R
n1
.
Use the coecients of each eigenvector of B to get a corresponding linear combination of
ws. Call this combination q
i
.
Show that the q
i
are eigenvectors of the larger matrix A: directly multiply on the left by A
and use summation shenanigans. Take particular note that the innermost summation will
collapse into the eigenvector relation for the k-th component of B.
Directly show that
, q
1
, . . . , q
n1
.
is orthonormal.
Proof: We will use induction to prove:
P(n) : If A is an n n symmetric matrix, then there exists a set of n orthonormal eigenvectors of A.
Base Case, n = 2. First Eigenvector:
Consider the corresponding quadratic form
Q(x) = a
11
x
2
1
+ 2a
12
x
1
x
2
+ a
22
x
2
2
,
over the unit sphere and extend to a new function S : R
2
\ {
0} R that normalizes the input

vector x and then applies Q to the resulting vector on the unit sphere:
S(x) = Q
_
x
x
_
=
a
11
x
2
1
+ 2a
12
x
1
x
2
+ a
22
x
2
2
x
2
=
a
11
x
2
1
+ 2a
12
x
1
x
2
+ a
22
x
2
2
x
2
1
+ x
2
2
.
Since the unit sphere is closed and bounded, it follows by the Extreme Value Theoerem that Q
achieves its minimum value of m at some unit vector . Since the images of S and Q are the
same, m is also the minimum value of S. In particular, S achieves this minimum value m at
(and more generally, at any positive scalar multiple of ). Therefore, the gradient of S is

0 at
, giving us the system
x
1
S () = 0
x
2
S () = 0
_
_
()
We would like to expand this system. First, calculate the derivatives on the left by applying
single variable quotient rule:
x
1
S(x) =
(
x
2
1
+x
2
2
)
(2a
11
x
1
+2a
12
x
2
)
(
a
11
x
2
1
+2a
12
x
1
x
2
+a
22
x
2
2
)
(2x
1
)
(x
2
1
+x
2
2
)
2
x
2
S(x) =
(
x
2
1
+x
2
2
)
(2a
22
x
2
+2a
12
x
1
)
(
a
11
x
2
1
+2a
12
x
1
x
2
+a
22
x
2
2
)
(2x
2
)
(x
2
1
+x
2
2
)
2
By construction, is on the unit sphere, so
=
2
1
+
2
2
= 1.
Since S and Q agree on the unit sphere,
S() = Q() = a
11
2
1
+ 2a
21
2
+ a
22
2
2
= m
Plugging into the derivatives,
x
1
S() =
1
..
_
2
1
+
2
2
_
(2a
11
1
+2a
12
2
)
m
..
_
a
11
2
1
+ 2a
12
2
+ a
22
2
2
_
(2
1
)
(
2
1
+
2
2
)
2
. .
1
x
2
S() =
1
..
_
2
1
+
2
2
_
(2a
22
2
+2a
12
1
)
m
..
_
a
11
2
1
+ 2a
12
2
+ a
22
2
2
_
(2
2
)
(
2
1
+
2
2
)
2
. .
1
which simplies to
x
1
S() = (2a
11
1
+ 2a
12
2
) m(2
1
)
x
2
S() = (2a
22
2
+ 2a
12
1
) m(2
2
)
.
Substituting into the original system (), isolate the equations
a
11
1
+ a
12
2
= m
1
a
22
2
+ a
12
1
= m
2
But this is just
A
_

1
2
_
. .
= m
_

1
2
_
. .
,
so is an eigenvector of A corresponding to the eigenvalue m.
Base Case, n = 2. Remaining Eigenvectors:
Let
V = span{}
By the Gram-Schmidt process, we know we can nd an orthonormal basis for V

. Particularly
in the case n = 2, V

has dimension 1 so we have the single unit vector
w
1
.
Consider the operation of multiplication on the left by A:
Ax
On this weeks homework, you will prove that
Homework. Given a symmetric matrix A, if x V

then Ax V

.
Therefore, we can write the image of the basis of V

(under left multiplication by A) in terms
of the basis of V

:
A w
1
= a
11
w
1
.
Thus, w
1
is an eigenvector corresponding to eigenvalue a
11
.
But V and w
1
V

, so they are orthogonal. Moreover, = 1 by construction and
w
1
= 1 by Gram-Schmidt. In conclusion,
, w
1
is an orthonormal set of eigenvectors of A.
Inductive step, First Eigenvector:
By the note preceding this proof, we can rewrite the quadratic form as
Q(x) =
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
Q(x) =
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
As before, consider this quadratic form on the unit sphere and extend to a function
S : R
n
\ {
0} R dened by
S(x) = Q
_
x
x
_
=
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
x
2
=
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
x
2
1
+ x
2
2
+ . . . + x
2
n
.
As before, Q achieves its minimum m at some unit vector . It follows that the minimum of S
is also m, and S achieves its minimum m at . Thus, the gradient of S must be

0 at , giving
the system
x
1
S() = 0
x
2
S() = 0
.
.
.
x
n
S() = 0
_
_
()
The goal is to expand this system: once we do, an eigenvector magically pops out!
First, use the single-variable quotient rule to calculate the k-th partial
x
k
S(x) =
(x
2
1
+ x
2
2
+ . . . + x
2
n
)
_
a
kk
2x
k
+
n
i=1,i=k
2a
ik
x
i
_
(2x
k
)
_
n
i=1
a
ii
x
2
i
+
n
i<j
a
ij
x
i
x
j
_
(x
2
1
+ x
2
2
+ . . . + x
2
n
)
2
.
Observe that = 1 since it lies on the unit sphere. Moreover,
n
i=1
a
ij
2
i
+
n
i<j
a
ij
j
= m
This is because, by denition, S takes the value m at and
S () =
n
i=1
a
ii
2
i
+
n
i<j
a
ij
2
1
+
2
2
+ . . . +
2
n
. .
=1
= m.
Plugging into the k-th partial,
x
k
S() =
1
..
(
2
1
+
2
2
+ . . . +
2
n
)
_
2a
kk
k
+
n
i=1,i=k
2a
ik
i
_
(2
k
)
m
..
_
n
i=1
a
ii
2
i
+
n
i<j
a
ij
j
_
(
2
1
+
2
2
+ . . . +
2
n
)
2
. .
1
which reduces to
x
k
S() =
_
2a
kk
k
+
n
i=1,i=k
2a
ik
i
_
(2
k
) m.
Notice that the 2a
kk
k
is the missing term in the summation! Therefore, recombine to get
x
k
S() =
_
n
i=1
2a
ik
i
_
(2
k
) m.
Substituting into system (), isolate the m
i
terms to get
n
i=1
a
i1
i
= m
1
n
i=1
a
i2
i
= m
2
.
.
.
n
i=1
a
in
i
= m
n
which is the matrix multiplication
_
_
a
11
a
21
. . . a
n1
a
12
a
22
. . . a
n2
.
.
.
.
.
.
.
.
.
.
.
.
a
1n
a
2n
. . . a
nn
_
_
. .
A
_
2
.
.
.
n
_
_
. .
= m
_
2
.
.
.
n
_
_
. .
m
.
But the left matrix is still A since it is symmetric. Therefore, is an eigenvector of A corre-
sponding to eigenvalue m.
Inductive step, Remaining Eigenvectors:
Let
V = span{}
By the Gram-Schmidt process, we know we can nd an orthonormal basis for V

w
1
, w
2
, . . . , w
n1
.
Consider the operation of multiplying on the left by A:
Ax
On this weeks homework, you will prove that this multiplication maps vectors of V

directly
into V

:
V

x
A
Ax
In particular, this means we can write the image of the basis of V

(under left multiplication
by A) in terms of the basis of V

:
A w
1
= b
11
w
1
+ b
21
w
2
+ . . . + b
(n1)1
w
n1
A w
2
= b
12
w
1
+ b
22
w
2
+ . . . + b
(n1)2
w
n1
A w
3
= b
13
w
1
+ b
23
w
2
+ . . . + b
(n1)3
w
n1
.
.
.
.
.
.
.
.
.
A w
n1
= b
1(n1)
w
1
+ b
2(n1)
w
2
+ . . . + b
(n1)(n1)
w
n1
Consider the matrix of these coecients
B =
_
_
b
11
b
12
. . . b
1(n1)
b
21
b
212
. . . b
2(n1)
.
.
.
.
.
.
.
.
.
.
.
.
b
(n1)1
b
(n1)2
. . . b
(n1)(n1)
_
_
Since it is (n 1) (n 1), if we can prove it is symmetric, then we can apply the inductive
hypothesis.
To show B is symmetric, consider another theorem from this weeks homework:
Homework. For a symmetric matrix A,
(A w
j
) w
i
= (A w
i
) w
j
.
Using
A w
i
= b
1i
w
1
+ b
2i
w
2
+ . . . + b
(n1)i
w
n1
dot product both sides with w
j
to get
(A w
i
) w
j
= b
ij
by orthonormality. Applying the same trick with
A w
j
= b
1j
w
1
+ b
2j
w
2
+ . . . + b
(n1)j
w
n1
,
dot product both sides with w
i
to get
(A w
j
) w
i
= b
ji
.
Therefore,
b
ij
..
(A w
j
) w
i
= b
ji
..
(A w
i
) w
j
.
In other words, B is symmetric.
Now we can apply the induction hypothesis to B. This gives us orthonormal eigenvectors
u
1
, u
2
, . . . , u
n1
of B with corresponding eigenvalues
1
,
2
, . . . ,
n1
, respectively. For any one of these eigen-
vectors
u
i
=
_
_
u
1i
u
2i
.
.
.
u
(n1)i
_
_
we can check that
q
i
= u
1i
w
1
+ u
2i
w
2
+ . . . + u
(n1)i
w
n1
=
n1
j=1
u
ji
w
j
is an eigenvector of the original matrix A: multiply on the left by A
Aq
i
= A
_
n1
j=1
u
ji
w
j
_
=
n1
j=1
u
ji
A w
j
.
Expanding our denition of A w
j
, rewrite the sum as
n1
j=1
u
ji
_
n1
k=1
b
kj
w
k
_
.
Pull in u
ji
, switch the order of summation, and pull out w
k
:
n1
j=1
_
n1
k=1
b
kj
u
ji
w
k
_
=
n1
k=1
_
n1
j=1
b
kj
u
ji
w
k
_
=
n1
k=1
w
k
_
n1
j=1
b
kj
u
ji
_
.
But we have a cute way to rewrite the inner sum! Look at the denition of the i-th eigenvector
of B. Focus on the k-th component:
_
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
b
k1
b
k2
b
k3
. . . b
k(n1)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
. .
B
_
_
u
1i
u
2i
u
3i
.
.
.
_
_
. .
u
i
=
_
_
.
.
.
i
u
ki
.
.
.
_
_
. .
i
u
i
This means
_
n1
j=1
b
kj
u
ji
_
=
i
u
ki
.
Therefore,
n1
k=1
w
k
_
n1
j=1
b
kj
u
ji
_
=
i
n1
k=1
w
k
u
ki
=
i
q
i
Thus, q
i
is an eigenvector of A corresponding to
i
. So
, q
1
, q
2
, . . . , q
n1
is a set of eigenvectors of A.
All thats left is to check that this set in orthonormal. But = 1 and V whereas each
q
i
V

. Therefore we only need to check that the q
i
are orthonormal.
Directly compute the dot product. First, expand only one term and distribute:
q
i
q
j
=
_
n1
r=1
u
ri
w
r
_
. .
q
i
q
j
=
n1
r=1
(u
ri
w
r
q
j
)
Then expand the other term and distribute:
n1
r=1
_
_
_
_
_
_
u
ri
w
r
_
n1
t=1
u
tj
w
t
_
. .
q
j
_
_
_
_
_
_
=
n1
r=1
n1
t=1
u
ri
u
tj
w
r
w
t
Since the ws are orthonormal, only the terms where r = t remain:
n1
r=1
u
ri
u
rj
which is just
u
i
u
j
.
But the us are orthonormal! This means that the qs are orthonormal.
Lecture 33
Keeping up with Contractions
Take a map of Texas and throw it into Texas.
Theres going to be some place in Texas where it lands
that aligns with its corresponding point on the map
- H
i
Goals: Today, we study functions that map into themselves. In particular, we prove the
Contraction Mapping Theorem which guarantees, under certain conditions, the existence
of a xed point. Not only does this theorem give us a constructive method to nd such
a point, but it also serves as a vital step in the proof of the Inverse Function Theorem.
33.1 Its Inception All Over Again
One of the key steps in the proof of the Spectral Theorem was considering a mapping whose image
was contained in its domain.
E
f
f(E)
But what if we mapped the image again? Then the image of the image will lie within the original
domain!
631
632 LECTURE 33. KEEPING UP WITH CONTRACTIONS
f(E)
f
f(f(E))
Because we love Inception, we keep on mapping the images under f:
This gives us a nested sequence of sets:
E f(E) f(f(E)) f(f(f(E))) f(f(f(f(E)))) f(f(f(f(f(E))))) . . .
Can we prove anything about sequences of nested sets? Absolutely! Namely, we are going to prove
the Contraction Mapping Theorem.
33.2 Non-Triviality of Non-Emptiness
Before we prove the Contraction Mapping Theorem, lets do an easier proof that has the same avor.
Ignore the role of f and consider just the sets. Then,
Given a nested sequence of closed, bounded, and non-empty sets, their intersection is also
closed, bounded, and non-empty.
We already know the intersection is bounded and closed. Strangely, the only thing we really have to
prove is that the intersection is non-empty! Weird!
33.2. NON-TRIVIALITY OF NON-EMPTINESS 633
But what possible applications could this theorem have? Who cares about whether the intersection
is non-empty or not?
When you continue your studies in analysis, you will see that this can actually be used to prove the
incredible Heine-Borel Theorem. Also, in the nal lecture, we will use this to prove that the reals are
uncountable.
To prove our nested sequence is non-empty, we need to nd a point x with the property that
x is in every set of the sequence.
We use a nice trick:
Math Mantra: Suppose you are given a sequence of nested closed sets, and you
want to find an x that satisfies some property. Form a sequence by taking a
point from each set. As long as our sequence of points is bounded, we can apply
Bolzano-Weierstrass. Hopefully, the limit will satisfy the property you need.
This trick will also be used to prove the Contraction Mapping Theorem.
Theorem. Let C
1
, C
2
, C
3
, . . . be a nested sequence of closed, bounded, and non-empty sets in R
n
:
C
1
C
2
C
3
. . .
Then the intersection
iN
C
i
is closed, bounded, and non-empty.
Proof Summary:
Construct a sequence (x
k
) where x
k
C
k
. Apply Bolzano-Weierstrass to get
x
n
j
x.
For arbitrary i,
x
n
i
, x
n
i+1
, x
n
i+2
, . . .
is a subsequence in C
i
. Thus, x C
i
.
Since i was arbitrary,
x
iN
C
i
.
Proof: To reiterate, the fact that the intersection is closed and bounded is immediate. We only need
to prove its non-empty.
First construct a sequence by choosing a point in each set
x
k
C
k
.
Since the sets are bounded, we automatically know (x
k
) is bounded. Applying Bolzano-Weierstrass,
there is a subsequence
x
n
j
x.
Now consider any set C
i
and notice that i n
i
. Then
x
n
i
, x
n
i+1
, x
n
i+2
, . . .
is a subsequence in C
i
since
C
n
i
C
i
C
n
i+1
C
i
C
n
i+2
C
i
.
.
.
But this sequence still converges to x and since C
i
is closed,
x C
i
.
Since C
i
was arbitrary,
x
iN
C
i
.
33.3 Contraction Mapping Theorem
Consider again a function f on a closed set
1
E that maps into itself:
f(E) E
Suppose f satises the following property: for any two points in E, the distance between those two
points is strictly smaller after being mapped by f.
x
y
f
(
x
)
f
(
x
)
f
(
y
)
f
(
y
)
1
We do not need to assume E is bounded.
33.3. CONTRACTION MAPPING THEOREM 635
Moreover, suppose that the ratio
1
of the distance after mapping to the distance before mapping is
bounded by a universal xed scale factor < 1:
f(x) f(y) x y.
y
x
f(x) f(y)
a

b
f(a) f(
b)
Remarkably, it follows that there must exist some point z E that gets mapped to itself:
f(z) = z
This is pretty cool! But dont think it is as intuitively obvious as the last proof. Specically, you
should not be thinking:
There is a point in common in all the mapped sets, so this must be true.
The fact that z is in all the images is a necessary condition. The Contraction Mapping Theorem
actually says something stronger. It says that even though our f could rotate and shift our points
around,
1
Generally, for an unrestricted constant , this type of function is called Lipschitz. In Math 52H you will learn
that these are great functions to work with.
there is at least one point that stays xed by the mapping. In fact, this point is unique.
So how do we prove this? Its going to be similar to the previous proof. We are going to construct a
sequence of points such that each point lies in the next nested image. However, there is a big dier-
ence: E need not be bounded! So we cannot use boundedness to instantly apply Bolzano-Weierstrass.
But there is a way around this. We do what Bane did in Dark Knight Rises: we throw a body into the
river and let it ride the current as it gets closer to a full stop. To prove the Contraction Mapping
Theorem, we are going to choose an arbitrary initial point and keep mapping it as the succesives
images get closer to a full stop.
Theorem (Contraction-Mapping Theorem). Let f : R
n
R
n
and let E R
n
be a closed set
such that f(E) E. Suppose there exists some (0, 1) such that, for any x, y E,
f(x) f(y) x y.
Then there exists a xed point z E:
f(z) = z.
Proof Summary:
Choose x
0
E. Construct sequence (x
k
) where
x
k+1
= f(x
k
)
Inductively apply the contraction mapping property to get the bound
x
r
x
r1

r1
x
1
x
0
.
Apply this bound to a geometric series to obtain the key inequality
x
l
x
k

k
x
1
x
0
1
for l k.
Use the key inequality to show (x
k
) is bounded. Apply Bolzano-Weierstrass to get
x
n
i
z.
By continuity of f,
f (x
n
i
) f(z).
Use key inequality to show
f (x
n
i
) z.
Conclude z = f(z).
Proof: Choose some x
0
E and form a sequence (x
i
) by repeatedly mapping x
0
under f:
x
1
= f(x
0
)
x
2
= f(x
1
)
x
3
= f(x
2
)
.
.
.
Generally,
x
k+1
= f(x
k
).
Particularly, our contraction mapping tells us
x
k+1
x
k
= f(x
k
) f(x
k1
) x
k
x
k1
.
First, we will prove that this sequence gets closer together. Let;s look at the distance between two
terms of the sequence
x
l
x
k
.
WLOG let l k. We can repeatedly add

0
x
l
x
l1
+ x
l1
x
l2
+ x
l2
+ . . . + x
k+1
x
k
and bound this by repeated triangle inequality

x
l
x
l1
+x
l1
x
l2
+ . . . +x
k+1
x
k
Observe that each term of this sum is bounded: by inductively applying the contraction mapping
property,
x
r
x
r1
x
r1
x
r2

2
x
r2
x
r3
. . .
we have
x
r
x
r1

r1
x
1
x
0
.
Now we can further bound
x
l
x
l1
+x
l1
x
l2
+ . . . +x
k+1
x
k

l1
x
1
x
0
+
l2
x
1
x
0
+ . . . +
k
x
1
x
0
=
_
l1
+
l2
+ . . . +
k
_
x
1
x
0
=
_
lk1
+ . . . +
2
+ 1
_
k
x
1
x
0
.
The last line contains a geometric sum with 0 < < 1, so we can bound this sum by the full innite
geometric sum:
_
lk1
+ . . . +
2
+ 1
_
. .
1
1
k
x
1
x
0

k
x
1
x
0
1
.
This gives us the key inequality: for l k,
x
l
x
k

k
x
1
x
0
1
()
Using this key inequality, we can show that our sequence (x
k
) is bounded: for arbitrary s,
x
s
= x
s
x
1
+ x
1
x
s
x
1
+x
1
.
Plugging in l = s, k = 1, into (), this sum is bounded by
x
1
x
0
1
+x
1
which are all constants. Thus the sequence (x

k
) is bounded.
This allows us to apply Bolzano-Weierstrass: there exists a convergent subsequence
x
n
i
z.
Since E is closed, z E.
I claim z is a xed point. To prove this, rst note that f is a continuous map
_
just choose =

_
.
Therefore,
f (x
n
i
) f(z).
Thus, if we can show
_
f (x
n
i
)
_
converges to z as well then
f (x
n
i
) f(z)
f (x
n
i
) z
and by uniqueness of limits,
f(z) = z.
Lets look at
f (x
n
i
) z.
By our recursive construction, this is just
x
n
i
+1
z
Then,
x
n
i
+1
z = x
n
i
+1
x
n
i
+ x
n
i
z x
n
i
+1
x
n
i
+x
n
i
z
By convergence, we know we can nd an N
1
such that for i N
1
,
x
n
i
z <

2
.
Moreover, by our key inequality,
x
n
i
+1
x
n
i

n
i
x
1
x
0
1
and since (0, 1), we can nd an N
2
such that for i N
2
,
x
n
i
+1
x
n
i

2
.
For i max{N
1
, N
2
},
x
n
i
+1
z x
n
i
+1
x
n
i
. .
<
2
+x
n
i
z
. .
<
2
<
i.e.,
f (x
n
i
) z.
Here are a few important observations:
This is a constructive proof. To nd the
1
xed point, just take any point in E and keep
applying f.
Our key inequality shows that our sequence is a Cauchy sequence. Intuitively, a
Cauchy sequence is a sequence that eventually bunches up:
Cauchy Sequences are very important: in upper level analysis courses, you will use these to
construct the reals from the rationals. Essentially, we tell the Completeness Axiom to back o.
We dont need to take it as an axiom. Instead, the Completeness Property will be a corollary
of our construction.
Lastly, we can easily show that our xed point is actually unique:
1
We will justify why this is the and not a very soon.
Theorem. Let f : R
n
R
n
and let E R
n
be a closed set such that f(E) E. Suppose there exists
some (0, 1) such that for any x, y E,
f(x) f(y) x y
If for z
1
, z
2
E,
f(z
1
) = z
1
f(z
2
) = z
2
then
z
1
= z
2
.
Proof: Rewrite the distance between z
1
and z
2
in terms of f and apply the contraction mapping
property
z
1
z
2
= f( z
1
) f( z
2
) z
1
z
2
.
Subtracting z
1
z
2
from both sides,
(1 )z
1
z
2
0.
Since (1 ) > 0,
z
1
z
2
= 0
i.e.,
z
1
= z
2
.
Lecture 34
Intimidating Inverse Function Theorem
The only thing we have to fear is fear itself
-Franklin (Evelt)
Goals: After giving a brief review of inverses, we prove the infamous Inverse Function
Theorem.
34.1 Introspection on Inverse
In my experience, there is a lot of confusion among high school students when it comes to inverses.
Particularly, they confuse the algebraic inverse with the functional inverse.
1
Just in case,
An algebraic inverse is an object x
1
that you apply, under some operation, to an object x to
get the identity element:
x
1
x = x x
1
= e
A functional inverse is a function f
1
such that inputting f(x) into f
1
returns x:
f
1
(f(x)) = x.
Today, we are going to focus on functional inverses. In fact, we are going to prove the most dicult
theorem in this course, the Inverse Function Theorem.
But before you jump headrst into a dicult yet fundamental result, you must know the basics. By
now, you should have realized that
Math Mantra: NEW mathematics is built on PREVIOUS mathematics. You have to
have a SOLID GRASP before continuing.
1
Of course, you can think of a functional inverse as an algebraic inverse under the operation of composition, where
the identity element is the identity function I(x) = x. But the typical high school student wouldnt know this.
641
642 LECTURE 34. INTIMIDATING INVERSE FUNCTION THEOREM
You already know this: the material in the second half of 51H relies on a solid understanding of both
single variable calculus and linear algebra.
Unfortunately, you may have had a lousy introduction to inverses. Instead of being taught concepts
like 1 : 1, your theory of inverses (and functions) may have been reduced to a mindless methodology:
Apply the Vertical Line Test.
Swap x and y and solve for y.
Fold a paper in half.
Yuck!
I am going to ll this gap, and when you are ready, you can move on to the Inverse Function Theorem.
34.2 Inverse Basics
Lets start with some function f. For every input x, we know there is some output f(x).
x
f(x)
Using this f, we want to construct some function f
1
such that when you input f(x), you get x. So
what does such a function have to satisfy?
The domain of f
1
must contain the image of f. Otherwise, f
1
(f(x)) = x wouldnt make
sense!
No inputs of f can map to the same output. Consider, for example, the case y = x
2
:
34.2. INVERSE BASICS 643
1 1
Then the inputs 1, 1 would map to the output 1:
f(1) = 1
f(1) = 1
So if an inverse did exist,
f
1
(1) = 1
f
1
(1) = 1
meaning f
1
is not a function!
To avoid the case where two inputs are competing for the same output, we dene the condition
Denition. A function is 1 : 1 (read: one-to-one) or injective if, for any x, y such that
f(x) = f(y)
it must be the case that
x = y.
This condition, along with the fact that f
1
is dened on Im(f), makes the existence of the inverse
immediate. Just switch the range and domain, and show that it is a function.
Theorem. If f is 1:1, then there exists a function f
1
with domain Im(f) such that
f
1
(f(x)) = x
Proof: Dene f
1
: Im(f) dom(f) by
f
1
(x) = y,
where y is some value such that
f(y) = x
Suppose f
1
is ill-dened (i.e., it is not a function). Then there exists a point z that is mapped to
two outputs:
f
1
(z) = a
f
1
(z) = b
where a = b. Then, by construction,
f(a) = z
f(b) = z
which contradicts the fact that f is 1 : 1.
Thats all the theory you need to understand the meaning of the Inverse Function Theorem. But
before you embark on this great proof, I want you to know that the concepts of 1 : 1 and inverses are
pretty darn important. Particularly in the elds of
Cryptography: Every encrypted message can be decrypted to precisely one corresponding
text.
Combinatorics: If we can produce 1 : 1 mappings between two nite sets, then they must
have the same size. This lets us prove some pretty nifty results.
Set Theory: We can create a notion of size for innite sets. Intuitively, the size of A is less
than (or equal to) the size of B if there is a 1 : 1 mapping of A into B.
We shall save the rst item for courses like Math 110 and CS 255. The last two items will be
discussed in our nal lecture.
34.3 An Overall Schematic
The Inverse Function Theorem is a dicult theorem to prove. There is no question about that. This
is because it contains a lot of intricacies. But so do some of the greatest works: Les Miserables,
Memento, and Pulp Fiction. So lets breathe slowly, and take this one step at a time.
First, lets give a simple example.
Focus on only the universe of functions that are at least C
1
. For example, consider the sine function:
6 4 2 2 4 6
34.3. AN OVERALL SCHEMATIC 645
If you tried to directly invert this function, you would completely and utterly fail. Instead, we cheat.
We restrict the domain so that it is locally invertible.
6 4 2 2 4 6
This was the entire point of arcsin(x), arccos(x), and arctan(x)!
Of course, where we center the restricted interval makes a dierence. If I centered around a turning
point, say at x
0
=

2
,
x
0
then no matter how much we restrict the interval, we cannot nd a local inverse!
Notice that this problem arose because the derivative at x
0
is zero. However,
For a C
1
function, at a point where the Jacobian is non-zero, we can always restrict the function to
an open set so that it is invertible. In fact, we can show that this inverse is also C
1
and dened on
an open set (meaning that Im(f) is open).
This is the Inverse Function Theorem.
So how do we prove this theorem?
Start with an open ball around x
0
. Then, we add mysterious conditions: shrink the ball so that for
all of its points x, the matrix inverse [Df(x)]
1
exists. Moreover, for any point in this ball, when we
evaluate the Jacobian, the error from the Jacobian at x
0
must be less than a xed (0,
1
2
):
x
0
[Df(x)]
1
exists
Df(x) Df(x
0
) <
We then take the image of the ball under f:
x
0
f(x
0
)
f
Incredibly, this f is 1 : 1 and its image is open. To prove this, we use the mysterious restrictions to
prove a magical inequality
_
x f(x)
_
_
y f(y)
_
y x.
Almost every step in our proof is going to require this inequality.
After we prove that f is 1 : 1 and the image is open, we know that f
1
exists. Then, we have a
Majoras Mask moment. Like Link in the Stone Tower Temple, ip the world and reverse f:
f
1
34.4. A MUCH NEEDED SIMPLIFICATION 647
Then we rewrite our magical inequality to get a magical reverse inequality:
f
1
(u) f
1
(v)
u v
(1 )
.
Using this inequality, we can check that f
1
is indeed C
1
.
That is the overall schematic. But there are a lot of details. Luckily we can make our lives easier:
34.4 A Much Needed Simplication
In the proof of the inverse function theorem, it turns out that we can actually assume
Df(x
0
) = I
To explain why, rst we need to talk about open sets.
Suppose that you take an open set V and multiply every point by an invertible matrix
AV = {Av | v V } .
It turns out that AV is still open.
Careful! The proof is not a straightforward denition check! Its going to need an idea. Namely,
Math Mantra: If you cannot prove the result directly, try to come up with an
INTERMEDIATE step.
Particularly, instead of directly proving
A C
we add an intermediate set inclusion
A B C
and prove instead that
A B
B C
You must understand this trick: in fact, its a fundamental step in the proof of the Inverse Function
Theorem.
Lemma. Let V be an open set and A an invertible matrix. Then
AV = {Av | v V }
is also open.
Choose such that
B
(v) V.
It suces to nd a such that
B
(Av) A
_
B
_
v
__
AV.
A
_
B
_
v
__
AV
Immediate.
B
(Av) A
_
B
_
v
__
Expand denition and choose
=

A
1
.
Proof: For a point Av AV , we want to nd a > 0 such that
B
(Av) AV.
Av
AV
Proving this directly is too dicult. Instead, we throw in an intermediate set inclusion. First, use
openness to nd a ball centered around v contained in V :
v
V
34.4. A MUCH NEEDED SIMPLIFICATION 649
Thus is some radius such that
B
(v) V
I claim that, with the right choice of given , when we map this ball under A, it contains our set
B
(Av) and lies in AV :

v
Av
A
The goal now is to nd a such that
B
(Av) A
_
B
_
v
__
AV
Notice that the right inclusion
A
_
B
(v)
_
AV
is automatically true. Namely, we chose so that
B
(v) V.
Thus
A
_
B
(v)
_
AV,
and all we have to show is
B
(Av) A
_
B
(v)
_
.
In other words, we have to show there exists a such that for all x where
x Av <
there exists a q B
(v) such that

Aq = x.
By invertibility, we can solve for
q = A
1
x,
so we just need to show A
1
x B
(v):
A
1
x v < .
Pulling out an A
1
and applying the Cauchy-like inequality for matrices, we get an upper bound:
A
1
x A
1
A
. .
I
v = A
1
(x Av) A
1
x Av.
Therefore, if we choose
=

A
1
we would get
A
1
x Av
. .
<
< A
1

A
1
= .
Suppose you can prove the Inverse Function Theorem in the case Df(x
0
) = I. Then in particular,
the theorem is true for the function
f(x) = [Df(x
0
)]
1
f(x)
since
D
f(x
0
) = [Df(x
0
)]
1
. .
constant
Df(x
0
) = I.
Thus, there exist open sets U, V such that

f : U V is 1 : 1 and f(U) = V . By denition,
f(U) = V [Df(x
0
)]
1
f(U) = V
so
f(U) = Df(x
0
)V.
But U is open so by our lemma, Df(x
0
)V is open. The inverse of

f is also explicitly
f
1
(x) = Df(x
0
)f
1
(x),
thus
f
1
(x) = [Df(x
0
)]
1
f
1
(x)
and ergo, f
1
is also C
1
.
Therefore, if the Inverse Function Theorem is true in the case Df(x
0
) = I, then it is in fact true when
Df(x
0
) is any arbitrary invertible matrix.
34.5 The Intricate Inverse Function Theorem
Now that we know we can assume that Df(x
0
) = I, we begin the legendary proof:
Theorem (Inverse Function Theorem). Let f : R
n
R
n
be a C
1
function, let x
0
R
n
, and
suppose Df(x
0
) is invertible. Then there exists an open set U containing x
0
and an open set V such
that f : U V is 1 : 1 and f(U) = V (so (f|U)
1
: V U exists). Moreover, (f|U)
1
is C
1
.
34.5. THE INTRICATE INVERSE FUNCTION THEOREM 651
Proof Summary:
By the preceding section, assume Df(x
0
) = I.
Construct an open set U = B
(x
0
) and dene such that for every x U,
Df(x) I < <
1
2
and Df(x) is invertible.
Prove the magical inequality
_
x f(x)
_
_
y f(y)
_
y x.
by applying the Fundamental Theorem of Calculus on
F
_
x + t(y x)
_
.
Prove V = f
_
B
(x
0
)
_
is open:
It suces to show
B
_
f(z)
_
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
where
=
z x
0
2
.
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
Directly check denition.
B
_
f(z)
_
f
_
B
2
(z)
_
Show
F(x) = x f(x) +c
is a contraction mapping on B
2
(z) that maps into B
2
(z). Use the magical inequality.
Prove f is 1 : 1 on U by using the magical inequality. Conclude that f
1
exists.
Prove the magical reverse inequality for f
1
,
f
1
(u) f
1
(v)
u v
1
by rewriting the magical inequality in terms of f
1
.
Prove f
1
is dierentiable by checking that
Df
1
(v
0
) = [Df(u
0
)]
1
for f
1
(v
0
) = u
0
. Apply the magical reverse inequality on fs dierentiability denition.
Prove Df
1
is continuous as a consequence of composition properties and the explicit formula
for the matrix inverse.
Proof:
Building our open set U = B
(x
0
) and dening
As we mentioned earlier in our schematic, we would like U to have two properties:
For every x U, Df(x) I < <
1
2
Since f is C
1
we know that for a xed (0,
1
2
), there is a > 0 such that for all
x x
0
<
we have
Df(x) Df( x
0
) < <
1
2
.
But notice that the condition really describes all points in a -ball and by the preceding
section, we can assume Df( x
0
) = I. Hence, for all
x B
(x
0
)
we have
Df(x) I < <
1
2
.
For every x U, Df(x) is invertible.
Let x B
(x
0
). To prove this matrix is invertible, it suces to show that for any non-zero
vector v,
Df(x)v > 0.
This would mean that the null space of Df(x) is trivial and thus the matrix is invertible.
Adding

0, we have
Df(x)v = Df(x)v +v v = v + [Df(x) I] v.
To construct a lower bound, we will use the reverse
1
triangle inequality:
a b a b
Rewrite the (RHS) as a dierence and apply this inequality:
_
_
v
_
I Df(x)
v
_
_
v
_
_
_
I Df(x)
v
_
_
.
By our Cauchy-like inequality for matrices, we know that the (RHS) is minimally
v I Df(x) v.
1
Just use triangle inequality with x = a
b and y =
b.
Recall that for all points in our ball,
Df(x) I <
1
2
.
Therefore, we can shrink our lower bound to
v
1
2
..
>IDf(x)
v =
1
2
v.
Now we have
Df(x)v
1
2
v.
Thus, for any non-zero vector v,
Df(x)v > 0.
Proving the Magical Inequality:
_
x f(x)
_
_
y f(y)
_
y x
We do the same rst step as we had in the proof of the Second Derivative Test. Consider some
function F where the input goes from x to y as t goes from 0 to 1:
F
_
x + t(y x)
_
By the Fundamental Theorem of Calculus,
F(y)
..
F(x+1(yx))
F(x)
. .
F(x+0(yx))
=
_
1
0
d
dt
F
_
x + t(y x)
_
dt.
Applying chain rule, expand the (RHS) as
_
1
0
_
DF
_
x + t(y x)
_
_
(y x) dt.
Now we have the integral equation
F(y) F(x) =
_
1
0
_
DF
_
x + t(y x)
_
_
(y x) dt.
For our choice of F, lets consider plugging in the function that spits out the dierence of the
mapped point from the original vector:
F(x) = x f(x).
In particular,
DF(x) = I Df(x).
Plugging F into our integral equation,
_
x f(x)
_
_
y f(y)
_
=
_
1
0
_
_
_
I Df
_
x + t(y x)
_
. .
DF(x+t(yx))
_
_
_
(y x) dt.
Now, take the norm of both sides
_
x f(x)
_
_
y f(y)
_
=
_
_
_
_
_
1
0
_
I Df
_
x + t(y x)
_
_
(y x) dt
_
_
_
_
and from the right, we can compute an upper bound. First, we know the integral is bounded
by the integral with the norm pulled inside:
_
1
0
_
_
_
_
I Df
_
x + t(y x)
_
_
(y x)
_
_
_ dt.
Then, by our Cauchy-like inequality for matrices, we can bound this by
_
1
0
I Df
_
x + t(y x)
_
y x dt.
But the above looks like the second property of B
(x
0
)! To apply it, formally, you need to prove
that for t (0, 1), the line segment
x + t(y x) B
(x
0
)
lies in B
(x
0
). A set with this property is known as convex:
x
+
t
(
y
x
)
x
y
I leave it to you to prove that balls are convex.
1
Using the fact that balls are convex, we can apply our property of U to get our nal upper
bound,
_
1
0
..
IDf(x+t(yx))
y x dt = (y x).
This gives us
_
x f(x)
_
_
y f(y)
_
y x. ()
Proving the Image V = f
_
B
(x
0
)
_
is open.
We want to show that for any y f
_
B
(x
0
)
_
, we can nd a ball centered around y that is still
contained in f
_
B
(x
0
)
_
.
1
I think its only fair for you to prove one fact (when I have to prove a gazillion of them!) Dont worry: just check
the denition!
y
Proving this directly isnt easy! Instead, we insert an intermediate set inclusion and try
to prove that instead. But what should this intermediate step be?
First, since y f
_
B
(x
0
)
_
, we know
y = f(z)
for some z B
(x
0
). Therefore, the equation we are trying to prove is
B
_
f(z)
_
f
_
B
(x
0
)
_
Now consider the ball around this z with some radius 2:
B
2
(z)
x
0
z
2
(x
0
)
I claim that when we map this ball under f, it will contain our set B
(y) and lie in f

_
B
(x
0
)
_
B
_
f(z)
_
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
.
z
2
f(z)
f
Great! But what should our be?
We do the same trick we did in Chapter 14 when we proved that open balls are open. Draw
the radius through z
x
0
z
and calculate the distances:
.
.
.
x
0
.
.
Therefore, let
=
z x
0
2
.
Now that we have the game plan, lets prove:
B
_
f(z)
_
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
Let x f
_
B
2
(z)
_
. Then x = f(y) for some y B
2
(z). The goal is to show
1
y B
(x
0
)
since this implies
x
..
f(y)
f
_
B
(x
0
)
_
.
By our usual triangle shenanigans, introduce z
y x
0
= y x
0
z +z
. .
=0
y z +x
0
z
and then bound the (RHS) by using the fact y B
2
(z):
y z +x
0
z < 2
..
2
zx
0
2
+x
0
z = .
Therefore, y B
(x
0
).
B
_
f(z)
_
f
_
B
2
(z)
_
Let c B
_
f(z)
_
. The goal is to show that there exists some

d B
2
(z) such that
c = f(
d)
Showing

d exists takes a little bit of creativity. Consider a variation of our earlier F:
F(x) = x f(x) +c.
Suppose we can nd a xed point x
FIX
B
2
(z):
F(x
FIX
) = x
FIX
.
Then, plugging in the above,
x
FIX
. .
F(x
FIX
)
= x
FIX
f(x
FIX
) +c
giving us
f(x
FIX
) = c.
Thus, we can use x
FIX
as our

d !
Before we apply last lectures work to prove F is a contraction mapping, notice that there
is a catch:
The Contraction Mapping Theorem only applies to contraction mappings on closed sets.
1
This should be intuitively obvious. Look at the previous diagram: this is how we chose !
Therefore, were going to show F is a contraction mapping on the closed ball B
2
(z). More-
over, we add the additional proviso that F maps this closed ball into the open ball B
2
(z).
This implies, particularly, that the xed point x
FIX
B
2
(z).
Now we must prove:
F
_
B
2
(z)
_
B
2
(z).
There is a constant (0, 1) such that for any x, y B
2
(z),
F(x) F(y) x y.
But weve seen the second item before. Namely, its our magical inequality ():
_
x f(x)
_
_
y f(y)
_
x y
where (0,
1
2
). By adding c c, we get the inequality we need:
(x f(x) +c)
. .
F(x)
(y f(y) +c)
. .
F(y)

..
x y
Therefore, we only need to check the rst item:
F
_
B
2
(z)
_
B
2
(z).
Let q B
2
(z). We want to show that F(q) z < 2. Expanding and doing our usual
tricks,
F(q) z = q f(q) +c z + f(z) f(z)
. .
=0
=
_
q f(q)
_
_
z f(z)
_
+
_
c f(z)
_
.
By triangle inequality, this is bounded by
_
q f(q)
_
_
z f(z)
_
+c f(z).
But weve just seen the left summand: it is again just an application of the magical
inequality () with x = q and y = z ! Thus, we can bound our sum:
_
q f(q)
_
_
z f(z)
_
+c f(z) <
..
<
1
2
q z+c f(z)
1
2
q z+c f(z).
By denition, c B
_
f(z)
_
and q B
2
(z), so in fact
1
2
q z +c f(z) <
1
2
2 + = 2,
giving us
F(q) z < 2.
f is a 1 : 1 map from U = B
( x
0
) to V = f
_
B
( x
0
)
_
Again, use the magical inequality ():
_
x f(x)
_
_
y f(y)
_
x y
This tells us that if f(x) f(y) = 0, then
x y x y.
Thus
(1 )x y 0.
which implies x = y.
In particular, this tells us (f|B
( x
0
))
1
: f
_
B
( x
0
)
_
B
( x
0
) exists.
Magical Reverse-Inequality for f
1
,
f
1
(u) f
1
(v)
u v
1
Starting from our magical inequality ()
_
x f(x)
_
_
y f(y)
_
x y,
rewrite as
_
x y
_
_
f(x) f(y)
_
y x
and apply the reverse triangle inequality on the left to get
x y f(x) f(y)
_
x y
_
_
f(x) f(y)
_
.
Now,
x y f(x) f(y) y x
and by moving terms, we get
(1 )x y f(x) f(y).
For u, v V , plug in
x = f
1
(u)
y = f
1
(v)
This gives us
(1 )f
1
(u) f
1
(v) u v
. .
f(f
1
(u))f(f
1
(v))
so
f
1
(u) f
1
(v)
u v
1
. ()
f
1
is dierentiable.
We directly check the denition
1
of dierentiability at any point v
0
V . First, we guess that
The Jacobian of f
1
at v
0
is the inverse of the Jacobian matrix of f evaluated at the inverse
of v
0
under f.
Symbolically,
Df
1
(v
0
) = [Df(u
0
)]
1
where f
1
(v
0
) = u
0
.
Now we need to check that, for any
1
, there exists a
1
> 0 such that if
v v
0
<
1
then
_
_
f
1
(v) f
1
(v
0
) [Df(u
0
)]
1
(v v
0
)
_
_
<
1
v v
0
.
Starting from
_
_
f
1
(v) f
1
(v
0
) [Df(u
0
)]
1
(v v
0
)
_
_
,
rewrite as
_
_
_
_
_
_
[Df(u
0
)]
1
[Df(u
0
)]
. .
I
f
1
(v) [Df(u
0
)]
1
[Df(x)]
. .
I
f
1
(v
0
) [Df(u)]
1
(v v
0
)
_
_
_
_
_
_
and pull out the inverse:
_
_
_[Df(u
0
)]
1
_
[Df(x)]
_
f
1
(v) f
1
(v
0
)
_
_
v v
0
_
__
_
_ .
But this is bounded by
_
_
[Df(u
0
)]
1
_
_
_
_
[Df(x)]
_
f
1
(v) f
1
(v
0
)
_
_
v v
0
_ _
_
.
Therefore, if we can nd a condition on
1
such that
_
_
[Df(x)]
_
f
1
(v) f
1
(v
0
)
_
_
v v
0
_ _
_
<

1
_
_
[Df(x)]
1
_
_
v v
0
then we are done.

To nd this condition, apply the dierentiability denition of f at f
1
(v
0
) with the choice of
2
=

1
(1 )
_
_
[Df(x)]
1
_
_
: there exists a
2
such that if
q f
1
(v
0
) <
2
1
Rather, we use an equivalent denition, in which we replace

h with the dierence v u.
then
_
_
[Df(x)]
_
q f
1
(v
0
)
_
_
f(q) v
0
_ _
_
<

1
(1 )
_
_
[Df(x)]
1
_
_
q f
1
(v
0
).
In particular, by adding a provision on
1
, this -hypothesis holds for choice
q = f
1
(v).
This is because, by the magical reverse inequality (),
f
1
(v) f
1
(v
0
)
v v
0
1
<
2
when v v
0
<
2
(1 ). Therefore, by choosing
1
=
2
(1 ),
we have
_
_
_
_
_
_
[Df(x)]
_
f
1
(v)
. .
q
f
1
(v
0
)
_
_
v
..
f( q)
v
0
_
_
_
_
_
_
_
<

1
(1 )
_
_
[Df(x)]
1
_
_
f
1
(v)
. .
q
f
1
(v
0
).
But we can increase this upper bound by applying the magical reverse inequality ()
f
1
(v) f
1
(v
0
) <
v v
0
1
and multiplying both sides by

1
(1 )
_
_
[Df(x)]
1
_
_
,
1
(1 )
_
_
[Df(x)]
1
_
_
|f
1
(v) f
1
(v
0
) <

1
_
_
[Df(x)]
1
_
_
v v
0
.
In conclusion,
_
_
[Df(x)]
_
f
1
(v) f
1
(v
0
)
_
_
v v
0
_ _
_
<

1
_
_
[Df(x)]
1
_
_
v v
0
for choice
1
=
2
(1 ).
Df
1
(v) is continuous
1
.
We just showed that
Df
1
(v) = [Df(u)]
1
where
f(u) = v,
so
Df
1
(v) =
_
Df
_
f
1
(v)
_
1
1
In other words, each component function is continuous.
But Df and f
1
are continuous, so their composition Df
_
f
1
(v)
_
is also continuous. Moreover,
the inverse matrix of a continuous matrix is also continuous: apply the explicit formula for the
inverse from Lecture 30:
A
1
=
_
_
(1)
1+1
det(A
11
) . . . (1)
c+1
det(A
c1
) . . . (1)
n+1
det(A
n1
)
(1)
1+2
det(A
12
) . . . (1)
c+2
det(A
c2
) . . . (1)
n+2
det(A
n2
)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(1)
1+n
det(A
1n
) . . . (1)
c+n
det(A
cn
) . . . (1)
n+n
det(A
nn
)
_
_
where
A = Df
_
f
1
(v)
_
.
Determinants are polynomials, so each component is a composition of continuous functions.
Without a doubt, this is one of the most dicult results to prove. But the mathematics itself isnt
hard. The only obstacle is that there are a lot of things that need checking.
In truth, I didnt really understand the Inverse Function Theorems proof until Math 148: Analysis
on Manifolds (Professor Wieczorek is an excellent teacher)! If youre worried about the nal, just
memorize the statement and learn the proof during Winter Break.
New Notation
1 : 1 One to one f(x) is 1 : 1 f(x) is a 1 : 1 function
B
(a) The closed ball of ra-

dius centered at a
B
1
(
0) The unit closed ball centered at
0.
Lecture 35
Implying Implicit Function Theorem
The Math is lovely, dark and deep.
But I have promises to keep.
And miles to go before I sleep.
-Robert
Goals: We prove the Implicit Function Theorem as an application of the Inverse Function
Theorem. Then, using the Implicit Function Theorem, we nally complete the proof of
the Lagrange Multiplier Theorem.
35.1 On Keeping Ones Word
Right before the second midterm, we proved the Lagrange Multiplier Theorem. Unfortunately, we
had to assume one lemma without proof:
Theorem. For C
1
functions g
1
, g
2
, . . . g
k
, the set
M =
_
_
_
x R
n
g
1
(x) = g
2
(x) = . . . = g
k
(x) = 0
and
g
1
(x), g
2
(x), . . . , g
k
(x) are linearly independent
_
_
_
is an (n k)-manifold.
Today, we make good on our promise. After many miles of intense mathematics, we prove this theo-
rem as a corollary of the Implicit Function Theorem.
663
664 LECTURE 35. IMPLYING IMPLICIT FUNCTION THEOREM
35.2 Intuition on Implicit Function Theorem
Recall way back in high school, when you tried to graph the circle
x
2
+ y
2
= 4
using your trusty TI-89:
x
y
You couldnt plot it directly. Instead, you solved for y in terms of x, and got
y =
4 x
2
or y =
4 x
2
,
so you plugged in each equation in your calculator separately. The full circle was the combination of
the graphs:
x
y
x
y
Thats all familiar, but now lets look at this under a mathematical lens.
Consider the function
G(x, y) = x
2
+ y
2
4.
35.2. INTUITION ON IMPLICIT FUNCTION THEOREM 665
The circle is simply the set of points where this function is 0:
S =
__
x
y
_
R
2
G
_
x
y
_
= 0
_
.
For any point
_
a
b
_
S,
we want to write the y-coordinate as a function of x. But we have to consider two cases: for a point
in the upper half of the circle, it is contained in the graph
H
1
=
__
x
f
1
(x)
_
2 x 2
_
where f
1
(x) =
4 x
2
.
If the point is in the lower half, it is contained in the graph
H
2
=
__
x
f
2
(x)
_
2 x 2
_
where f
2
(x) =
4 x
2
.
Easy, right? This almost expresses the idea behind the Implicit Function Theorem. But because we
are in Math 51H, we are going to add another spin.
As you may know, we are in love with open sets. So lets ask,
Is it true that any point
_
a
b
_
S is contained in some graph
H =
__
x
f(x)
_
x U
_
where U is an open set?
Absolutely not! Let
_
x
0
0
_
be one of the two points on the x-axis:
x
y
x
0
= 2
Consider any open interval U containing x
0
. If you tried to dene a function f(x), it would only give
points on one side of x
0
:
x
y
U
Complete and utter fail!
1
But why does it fail? Well, its because at
_
x
0
0
_
the graph has a turning
point along the y direction:
1
This should remind you of how, in our proof that the unit circle is a 1-manifold, the two points on the x-axis
required special treatment.
35.3. FORMALIZATION 667
If we look at the Jacobian
DG(x, y) =
_
2x 2y
and restrict the matrix to only the entry corresponding to dierentiation with respect to the y variable,
_
2y
we have a non-invertible matrix (i.e. a nonzero real number since the matrix is 1 1) precisely when
y = 0.
In summary,
Start with some set of points S where the function G is zero (the circle).
For any x S, we want to write S locally as a graph on some open set (an open interval on the
x axis).
We can do this if the Jacobian restricted to the graph variables is invertible (the derivative of
G with respect to y is not zero).
This is the essence of the Implicit Function Theorem.
35.3 Formalization
To give a formal description of Implicit Function Theorem, we will need new notation. Consider a
function with more inputs than outputs:
G : R
n
R
m
, n > m.
When we look at the inputs, we can consider the rst n m variables and the remaining m variables
separately:
G
_
_
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
nm
x
nm+1
x
nm+2
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
To make our lives easier, we relabel variables and denote the input by a concatenation of vectors
x R
nm
and y R
m
:
G
_
_
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
nm
y
1
y
2
.
.
.
y
m
_
_
_
_
_
_
_
_
_
_
_
_
_
= G
_
x
y
_
.
Using this notation, we can formally dene a Jacobian restricted to variables x, y:
Denition. Let G : R
n
R
m
where n > m and denote the input vector of G as the concatenation
of x R
nm
and y R
m
,
_
x
y
_
.
The Jacobian restricted to variable x is the rst n m columns of the Jacobian:
D
x
G(q) =
_
_
D
1
G(q) D
2
G(q) . . . D
nm
G(q)
_
_
.
The Jacobian restricted to variable y is the last m columns of the Jacobian:
D
y
G(q) =
_
_
D
nm+1
G(q) D
nm+2
G(q) . . . D
n
G(q)
_
_
.
Observe that D
y
G(q) is an mm matrix, so it makes sense to talk about its inverse (if it exists).
We will also use sub-vector notation:
_
x
a
b
=
_
_
x
a
x
a+1
.
.
.
x
b
_
_
.
Armed with the proper notation, lets describe the theorem:
35.3. FORMALIZATION 669
For a function G : R
n
R
m
where n > m, let S be the set where G is

0:
S =
__
x
y
_
R
n
G
_
x
y
_
=
0
_
and consider an element
_
a
b
_
S.
S
_
a
b
_
Suppose that
D
y
G
_
a
b
_
is invertible
Then for some open set V containing
_
a
b
_
, consider S V :
V
We can write this local region of S as a graph
1
of a C
1
function h over U
S V =
__
x
h(x)
_
x U
_
where U is a open set in R
nm
containing a:
h(a)
a
U
35.4 The Proof
Proving the Implicit Function Theorem is going to be easy. Why? Because we did all the work when
we proved the Inverse Function Theorem! Generally,
Math Mantra: Dont reprove a theorem from scratch if you can build off previous
work!
Muych like the Mean Value Theorem is a cute application of Rolles Theorem, the Implicit Function
Theorem is a cute application of the Inverse Function Theorem. The key is to consider the function
f
_
x
y
_
=
_
_
x
G
_
x
y
_
_
_
.
Note that f maps
_
a
b
_
G
zero
to a vector a concatenated to the zero vector:
_
a
b
_ _
a
0
_
f
So, if the inverse did exist, it would give us the reverse mapping
1
If you dont see the connections to manifolds, you gotta go back to Lecture 26!
35.4. THE PROOF 671
_
a
0
_ _
a
b
_
f
1
Stare at the rst n m inputs and the last m outputs of f
1
:
_
a
0
_ _
a
b
_
f
1
This is exactly what we need: a mapping that inputs a and spits out

b. Therefore, form the function
h(x) =
_
f
1
_
x
0
__
nm+1
n
Now, all we have to do is verify that this is our graph map. Luckily, this is going to be an easy
consequence of the Inverse Function Theorem!
Theorem (Implicit Function Theorem). Let G : R
n
R
m
where n > m and denote the input
vector of G as the concatenation of x R
nm
and y R
m
,
_
x
y
_
.
Consider the set
G
zero
=
__
x
y
_
R
n
G
_
x
y
_
=
0
_
.
Then for any element
_
a
b
_
G
zero
, provided that D
y
G
_
a
b
_
is invertible, there exist a C
1
function
h and open sets V R
n
, U R
nm
such that a U and
G
zero
V =
__
x
h(x)
_
x U
_
.
Proof Summary:
Directly check that
f
_
x
y
_
=
_
_
x
G
_
x
y
_
_
_
satises the conditions of the Inverse Function Theorem.
By Inverse Function Theorem, there exist open sets U
R
2nm
, V R
n
such that f
1
: U
V
exists.
Dene
h(x) =
_
f
1
_
x
0
__
nm+1
n
and
U =
_
_
x
1
nm
x U
_
.
Verify
G
zero
V =
__
x
h(x)
_
x U
_
.
Proof: Let
_
a
b
_
G
zero
such that D
y
G
_
a
b
_
is invertible. Consider the function
f
_
x
y
_
=
_
_
x
G
_
x
y
_
_
_
.
First, we show that f
1
exists by verifying the conditions of the Inverse Function Theorem. Im-
mediately, we know that f is C
1
since G is C
1
. Also, we can directly compute the Jacobian of f
as
Df
_
x
y
_
=
_
_
1 0 . . . 0 . . . 0
0 1 . . . 0 . . . 0
0 0 . . . 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
. . . . 0
0 0 . . . 1
.
.
. 0
D
1
G
_
a
b
_
D
2
G
_
a
b
_
. . . D
nm
G
_
a
b
_
. . . D
n
G
_
a
b
_
_
_
Condensely, this is just
35.4. THE PROOF 673
Df
_
x
y
_
=
_
_
_
_
I
nm
0
m
D
x
G
_
a
b
_
D
y
G
_
a
b
_
where I
nm
, 0
m
are the (n m) (n m) identity matrix and mm zero matrix, respectively.
Because of the identity matrix, when we compute the determinant of Df
_
x
y
_
, we are forced to
choose
i
1
= 1
i
2
= 2
.
.
.
i
nm
= n m
so in fact,
det
_
Df
_
x
y
__
= det
_
D
y
G
_
a
b
__
.
Thus, we can apply the Inverse Function Theorem: there exist open sets U
R
2nm
, V R
n
such
that f
1
: U
V exists. Dene
h(x) =
_
f
1
_
x
0
__
nm+1
n
and take U to be the rst n m components of U
:
U =
_
_
x
1
nm
x U
_
Using a simple proof by contradiction, we can verify that U is indeed open.
Now, all we need to show is
G
zero
V =
__
x
h(x)
_
x U
_
:

Let
_
c
d
_
G
zero
V . Since
_
c
d
_
G
zero
,
f
_
c
d
_
=
_
c
0
_
.
Moreover,
_
c
d
_
V , so we can invert the map:
f
1
_
c
0
_
=
_
c
d
_
.
But
_
c
0
_
U
implies
c U,
and so
_
c
h(c)
_
=
_
_
c
_
f
1
_
c
0
__
nm+1
n
_
_
=
_
c
d
_
.

Let c U, and consider
_
c
h(c)
_
. The key observation is to notice
1
that the rst n m
components of f
1
s output is x, so in fact,
f
1
_
x
y
_
=
_
_
x
_
f
1
_
x
y
__
nm+1
n
_
_
.
Thus
f
1
_
c
0
_
=
_
_
c
_
f
1
_
c
0
__
nm+1
n
_
_
=
_
c
h(c)
_
.
But f
1
: U
V so
_
c
h(c)
_
V
Moreover, by denition of f,
f
_
c
h(c)
_
=
_
_
c
G
_
c
h(c)
_
_
_
.
But we also know
f
_
c
h(c)
_
=
_
c
0
_
therefore
_
_
c
G
_
c
h(c)
_
_
_
=
_
c
0
_
.
1
The inverse of the identity I(x) = x is just itself!
35.4. THE PROOF 675
Equating components,
G
_
c
h(c)
_
=
0.
Thus,
_
c
h(c)
_
G
zero
and so
_
c
h(c)
_
G
zero
V.
Now, we can nally fulll our promise:
Theorem. For C
1
functions g
1
, g
2
, . . . g
k
, the set
M =
_
_
_
x R
n
g
1
(x) = g
2
(x) = . . . = g
k
(x) = 0
and
g
1
(x), g
2
(x), . . . , g
k
(x) are linearly independent
_
_
_
is an (n k)-manifold.
Proof Summary:
Dene G : R
n
R
k
as
G
_
x
y
_
=
_
_
g
1
(x, y)
g
2
(x, y)
.
.
.
g
k
(x, y)
_
_
for input variables x R
nk
and y R
k
.
Compute DG
_
a
b
_
. There are k linearly independent rows; thus, there must exist k linearly
independent columns. WLOG, assume the last k columns are linearly independent.
Apply the Implicit Function Theorem: there exist a C
1
function h and open sets V R
n
, U
R
nk
such that a U and
G
zero
V =
__
x
h(x)
_
x U
_
.
Check that G
zero
can be replaced by M.
Proof: By denition of a manifold, we need to consider an arbitrary point
_
a
b
_
M,
where a R
nk
,

b R
k
. Then, we need to show there exist open sets U R
nk
, V R
n
such that
a U and M V is a permuted graph of some function h over U.
The key is to apply the Implicit Function Theorem. First, dene G : R
n
R
k
,
G
_
x
y
_
=
_
_
g
1
(x, y)
g
2
(x, y)
.
.
.
g
k
(x, y)
_
_
for input variables x R
nk
and y R
k
.
Then,
DG
_
a
b
_
=
_
_
_
g
1
(a,
b)
_
T
_
g
2
(a,
b)
_
T
.
.
.
_
g
k
(a,
b)
_
T
_
_
.
By denition of M, this matrix has k linearly independent rows. Furthermore, by our theorem on
rank, there must exist k linearly independent columns. Without loss of generality,
1
we can reorder
the input variables of G so that the last k columns of DG
_
a
b
_
are linearly independent. Thus,
D
y
G
_
a
b
_
is invertible.
So by the Implicit Function Theorem, there exists a C
1
function h and open sets V R
n
, U R
nk
such that a U and
G
zero
V =
__
x
h(x)
_
x U
_
If we can replace G
zero
by M, then we are done.
But recall that in our construction of V in the original Inverse Function Theorem, we designed
D
y
G
_
x
y
_
to be non-singular on V . Therefore the rank of D
y
G
_
x
y
_
is k on V . This implies the
full matrix
DG
_
x
y
_
=
_
_
_
g
1
(x, y)
_
T
_
g
2
(x, y)
_
T
.
.
.
_
g
k
(x, y)
_
T
_
_
has rank at least k on V . But there are only k rows, implying the rows are linearly independent.
These are the g
i
(x, y), so they are linearly independent on V , allowing us to conclude
M V =
__
x
h(x)
_
x U
_
.
1
Reordering will give us a permuted graph, which is all we need. But to avoid the headache of P(G(U)), lets make
this assumption.
35.4. THE PROOF 677
New Notation
G
_
x
y
_
The function G with
a concatenated input
formed from x, y
G
_
e
1
e
2
_
The concatenation of the rst two
standard basis vectors inputted
into G.
D
x
G(q) The Jacobian of G re-
stricted to x evaluated
at q
D
x
G(q) is invertible The Jacobian of G restricted to x
evaluated at q is invertible.
_
x
a
b
The sub-vector of x
from a to b
_
x
1
3
The sub-vector formed from the
rst 3 components of x .
Simons Secret Lecture 36
Proving FTA: An Analytic Way
When I read papers in the humanities, sometimes I see cool ideas and nod my head.
I completely understand where the author is coming from.
But only with mathematical proofs have I ever found myself in complete and total awe,
wondering how these ideas could have been pulled out of the ether.
- B
F
SCHO
([])
Goals: In the rst of two optional lectures, we prove the Fundamental Theorem of
Algebra. The proof will be purely analytic, relying on complex numbers and harmonic
functions.
36.1 Journey to Another Plane: Preparations
When you studied polynomials in Algebra II, you especially focused on solving for the roots, i.e., the
values r such that
P(r) = 0.
But why this emphasis? Why should we care?
The more obvious, practical reason is that a real world phenomenon could be modeled by some
polynomial P(x). Solving for a particular value b is equivalent to nding the root of
Q(x) = P(x) b.
But we also have a theoretical reason. Remember,
Math Mantra: Suppose we can always rewrite our objects in some form that has
additional structure. Then we can EXPLOIT this extra structure in our proofs.
The key idea is that we can use roots to factorize polynomials:
679
680 SIMONS SECRET LECTURE 36. PROVING FTA: AN ANALYTIC WAY
Lemma. If r C is a root of a monic polynomial
P(x) = x
n
+ a
n1
x
n1
+ . . . + a
1
x + a
0
then
P(x) = (x r)Q(x)
where Q is an (n 1)-degree monic polynomial.
Proof: Since
P(r) = 0
we know
P(x) = P(x) P(r).
Expanding the (RHS),
(x
n
+ a
n1
x
n1
+ . . . + a
1
x + a
0
) (r
n
+ a
n1
r
n1
+ . . . + a
1
r + a
0
)
which we can regroup as
_
x
n
r
n
_
+ a
n1
_
x
n1
r
n1
_
+ . . . + a
2
_
x
2
r
2
_
+ a
1
_
x r
_
+ (a
0
a
0
)
. .
=0
.
But we can pull out x r from each of these terms using the infamous identity:
1
x
j
r
j
= (x r)
_
x
j1
+ x
j2
r + x
j3
r
2
+ . . . + x
2
r
j3
+ x
1
r
j2
+ r
j1
_
. .
P
j
(x)
.
Now we have
(x r)P
n
(x) + a
n1
(x r)P
n1
(x) + . . . + a
2
(x r)P
2
(x) + a
1
(x r)P
1
(x)
giving us
P(x) = (x r)
_
P
n
(x) + a
n1
P
n1
(x) + . . . + a
1
P
1
(x)
_
. .
Q(x)
where Q(x) is a degree n 1 monic polynomial (since only P
n
contributes the x
n1
term).
Suppose we can prove
Theorem (Fundamental Theorem of Algebra). Every polynomial has at least one complex
2
root.
1
Expand it yourself to check!
36.2. NO HARM IN HARMONICS 681
Then, we can inductively apply the preceding lemma to rewrite every polynomial as a product
P(x) = (x r
1
)(x r
2
) . . . (x r
n
).
AWESOME!
This important fact is used in tons of proofs. For example, recall the polynomial
x
2
+ x + 41
that always spits out a prime number for
x = 1, 2, . . . , 39.
Using the Fundamental Theorem of Algebra (FTA), you can easily prove that there does not exist a
non-constant polynomial that spits out a prime for every integer input. But for this proof, you will
need to wait. Eventually you will hear the SOUND-Kararajan of mathematics.
But how do we prove the FTA?
Math Mantra: You dont have to limit yourself to a single branch of
Mathematics!
We are going to jump out of the Algebra zone and appeal to the gods of Analysis. But not just any
Analysis. We will need Complex Analysis. But before we take a journey to another plane, we need
to talk about harmonic functions.
36.2 No Harm in Harmonics
First, we dene an important class of functions:
Denition. Consider a C
2
function f that maps into C. We say that f is harmonic if, at every
point in its domain, the sum of the pure second derivatives is zero:
n
j=1
D
j
D
j
f(x) = 0.
For example,
x
3
3xy
2
is a harmonic function.
2
This is a VERY important distinction. For example, x
2
+ 1 has NO real roots.
Harmonic functions have a key property:
Recall that a continuous functions over a closed and bounded set always achieves its extrema. If our
function is also harmonic, the extrema over a closed and bounded set are always achieved on the
boundary of the set.
Of course, this only makes sense if the values of the function over this set are real (how do you
maximize a complex number)?
For this lecture, we only need to consider the case of a closed ball centered around

0. And as always,
it suces to prove the maximum case:
Theorem. Let f : B
R
(
0) R be continuous. If f|B
R
(
0) is harmonic, then the extrema of f are

achieved on the boundary of the ball (denoted B
R
(
0)). In other words, if an extremum is achieved at

a, then
a = R.
Proof Summary:
It suces to show that for every > 0,
g(x) = f(x) + x
2
achieves its maxima on the boundary.
Suppose g achieves a maximum at a where a / B
R
(
0).
Show
n
j=1
D
j
D
j
g(a) > 0 and
n
j=1
D
j
D
j
g(a) 0.
j=1
D
j
D
j
g(a) > 0
Directly compute D
j
D
j
g(a).
36.2. NO HARM IN HARMONICS 683
j=1
D
j
D
j
g(a) 0
Dene
s(t) = g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t
.
.
.
a
n
_
_
_
_
_
_
_
and use 1D calculus to show that for all j,
s
(0) = D
j
D
j
g(a) 0
Proof: First, dene an auxiliary function
g(x) = f(x) + x
2
for an arbitrary > 0. If we can prove that the maximum of g over B
R
(
0) is achieved on the boundary,

then this implies that the maximum of f is also achieved on the boundary. To see this, note that if
g achieves its maximum at a B
R
(
0), then for all x B

R
(
0),
f(a) + R
2
. .
g(a)
f(x) + x
2
. .
g(x)
g(x)
and since is arbitrary we may tke the limit as 0 to get:
f(a) f(x)
Suppose that g achieves a maximum at a where a / B
R
(
0). We will derive a contradiction by

showing
n
j=1
D
j
D
j
g(a) > 0 and
n
j=1
D
j
D
j
g(a) 0.
j=1
D
j
D
j
g(a) > 0
Directly dierentiate
g(x) = f(x) + (x
2
1
+ x
2
2
+ . . . + x
2
n
)
Then
D
j
g(x) = D
j
f(x) + (2x
j
)
and thus
D
j
D
j
g(x) = D
j
D
j
f(x) + 2.
Summing across all j and evaluating at a yields
n
j=1
D
j
D
j
g(a) =
n
j=1
D
j
D
j
f(a) +
n
j=1
2.
But > 0 and
n
j=1
D
j
D
j
f(a) = 0 since f is harmonic. Thus,
n
j=1
D
j
D
j
g(a) > 0.
j=1
D
j
D
j
g(a) 0
We do our usual trick: build a single variable function and apply 1D Calculus.
Dene
s(t) = g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t
.
.
.
a
n
_
_
_
_
_
_
_
.
Because g achieves a maximum at a, s achieves a maximum at 0. By 1D Calculus,
s
(0) 0
But
s
(t) = lim
h0
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t + h
.
.
.
a
n
_
_
_
_
_
_
_
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t
.
.
.
a
n
_
_
_
_
_
_
_
h
= D
j
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t
.
.
.
a
n
_
_
_
_
_
_
_
,
implying
s
(0) = lim
h0
s
(h) s
(0)
h
= lim
h0
D
j
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ h
.
.
.
a
n
_
_
_
_
_
_
_
D
j
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
.
.
.
a
n
_
_
_
_
_
_
_
h
= D
j
D
j
g(a).
Thus
D
j
D
j
g(a) 0
and summing over all j yields
n
j=1
D
j
D
j
g(a) 0.
36.3. GETTING COMPLEX IN HERE 685
One question you should be asking yourself is
What do harmonic functions have to do with the Fundamental Theorem of Algebra?
It isnt obvious. You cant just mindlessly apply the preceding lemma to polynomials because
For a single real variable x, generally a polynomial P(x) is not harmonic.
This is because P(x) is harmonic if and only if its second derivative is 0. Unless the degree is less
than three, this is absolutely untrue.
However,
For a complex variable z, a polynomial P(z) is haromonic.
1
Namely, it is a harmonic function with respect to its real and imaginary variables! But you still cant
use the preceding lemma with this fact: P(z) maps into C, not R, maximum doesnt even make
sense!
Instead, we will use that fact that P(z) is harmonic to prove
The real and imaginary parts of
1
P(z)
are harmonic.
which is the lynchpin of the FTA!
36.3 Getting Complex in Here
In case you were too busy studying for Midterm 2, lets discuss some of the philosophy behind the
rst problem of Homework 8. But I highly recommend that, before going on, you go back and prove
for yourself that P(z) is indeed harmonic. Its a pretty cool result!
Consider a complex function
f(z)
where we take z to be a complex input
z = x + iy
and the variables x, y correspond to the real and imaginary components, respectively. We are going
to think of f(z) as a function from R
2
C:
f(x, y) = f(x + iy)

For ease, we suppress the binary input and leave it as f(z) with the convention that z = x + iy and
that f is in fact, a function of two variables.
The denition of harmonic boils down to showing
D
x
D
x
f(z) + D
y
D
y
f(z) = 0
1
You proved this on the rst problem of Homework 8.
But because we are dealing with complex functions, we actually have an easier criterion.
For a complex function f(z), think of its output as a sum of two functions:
f(z) = u(x, y) + iv(x, y)
where u(x, y), v(x, y) correspond to the real and imaginary component of f(z), respectively:
Denition. Let f be a complex function:
f(z) = u(x, y) + iv(x, y)
where u and v are C
2
. We say that f satises the Cauchy-Riemann Equations on S R
2
if, for
all (x, y) S,
u
x
(x, y) =
v
y
(x, y)
u
y
(x, y) =
v
x
(x, y)
It turns out that
Lemma. Let f be a complex function with C
2
real and imaginary components u and v, respectively.
If f satises the Cauchy-Riemann Equations on S, then f is harmonic on S.
Proof: We simply apply the Cauchy-Riemann Equations and use the fact that for C
2
functions,
mixed partials commute.
Dierentiate with respect to x
D
x
f(z) =
u
x
(x, y) + i
v
x
(x, y)
and apply the Cauchy-Riemann Equations:
D
x
f(z) =
v
y
(x, y) i
u
y
(x, y).
Then dierentiate again
D
x
D
x
f(z) =
v
yx
(x, y) i
u
yx
(x, y).
Likewise, when we dierentiate with respect to y
D
y
f(z) =
u
y
(x, y) + i
v
y
(x, y)
we can apply the Cauchy-Riemann Equations to get
D
y
f(z) =
v
x
(x, y) + i
u
x
(x, y).
36.3. GETTING COMPLEX IN HERE 687
Thus,
D
y
D
y
f(z) =
v
xy
(x, y) + i
u
xy
(x, y)
so,
D
x
D
x
f(z) + D
y
D
y
f(z) =
v
yx
(x, y) i
u
yx
(x, y)
v
xy
(x, y) + i
u
xy
(x, y).
Since mixed partials commute,
D
x
D
x
f(z) + D
y
D
y
f(z) = 0.
This should all be familiar: on Homework 8, you used this criterion to prove that P(z) is harmonic.
We will use this criterion again to prove that
1
P(z)
is also harmonic. From this, we will then conclude
that the real and imaginary components of
1
P(z)
are harmonic.
To minimize distractions and make this proof easier to understand, we are going to use the notation of
Leon Simons text. We drop the (x, y) input symbol and use subscripts to denote partial derivatives.
Dont forget that u, v, S, T are functions!
Theorem. Let P(z) be a polynomial and let
1
A = {(x, y) R
2
: P(x + iy) = 0}
Then dene S, T : R
2
\ A R to be the real and imaginary components, respectively, of
1
P(z)
:
1
P(z)
= S(x, y) + iT(x, y)
Then S and T are harmonic on R
2
\ A.
Proof Summary:
It suces to show that
1
P(z)
is harmonic.
For P(z) = u + iv, solve
S =
u
u
2
+ v
2
T =
v
u
2
+ v
2
Use the fact that u, v satisfy the Cauchy-Riemann Equations to show that S, T also satisfy the
Cauchy-Riemann Equations.
Proof: To show that S and T are harmonic on R
2
\ A, it suces to show that
1
P(z)
is harmonic on
R
2
\ A. This is because for a complex function
f(z) = a + ib,
if
f
xx
+ f
yy
= 0
then expanding the partial derivatives gives
_
a
xx
+ ib
xx
_
+
_
a
yy
+ ib
yy
_
= 0.
Regrouping,
_
a
xx
+ a
yy
_
+
_
b
xx
+ b
yy
_
i = 0 + 0i
Equating real and imaginary coecients,
a
xx
+ a
yy
= 0
b
xx
+ b
yy
= 0
Thus, a and b are harmonic.
Now we just need to check that
1
P(z)
is harmonic by showing that S, T satisfy the Cauchy-Riemann
Equations.
On the homework, you already proved that the real and imaginary component of P(z) satises these
equations: for
P(z) = u + iv
we have
u
x
= v
y
u
y
= v
x
Rewrite S, T in terms of u, v
1
P(z)
=
1
u + iv
.
By our old trick of multiplying by conjugates,
1
u + iv
=
1
u + iv

u iv
u iv
=
u iv
u
2
+ v
2
=
u
u
2
+ v
2
i
v
u
2
+ v
2
.
Equating real and imaginary coecients,
S =
u
u
2
+ v
2
T =
v
u
2
+ v
2
Checking that
1
P(z)
is harmonic is now easy:
1
In the actual proof of the Fundamental Theorem of Algebra, A = . However, we include A here to be rigorous.
36.4. THE PROOF 689
S
x
= T
y
Apply quotient and chain rule to compute
S
x
=
_
u
2
+ v
2
_
u
x
u
_
2uu
x
+ 2vv
x
_
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
u
x
+
_
2uv
_
v
x
(u
2
+ v
2
)
2
T
y
=
_
u
2
+ v
2
_
v
y
v
_
2uu
y
+ 2vv
y
_
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
v
y
+
_
2uv
_
v
y
(u
2
+ v
2
)
2
By the Cauchy-Riemann equations for u, v
S
x
=
_
v
2
u
2
_
vy
..
u
x
+
_
2uv
_
uy
..
v
x
(u
2
+ v
2
)
2
= T
x
.
S
y
= T
x
Apply quotient and chain rule to compute
S
y
=
_
u
2
+ v
2
_
u
y
u
_
2uu
y
+ 2vv
y
_
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
u
y
+
_
2uv
_
v
y
(u
2
+ v
2
)
2
T
x
=
_
u
2
+ v
2
_
v
x
v
_
2uu
y
+ 2vv
x
_
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
v
x
+
_
2uv
_
v
x
(u
2
+ v
2
)
2
.
By the Cauchy-Riemann equations for u, v
S
y
=
_
v
2
u
2
_
vx
..
u
y
+
_
2uv
_
ux
..
v
y
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
v
x
+
_
2uv
_
u
x
(u
2
+ v
2
)
2
= T
x
.
36.4 The Proof
The proof of the Fundamental Theorem of Algebra will be by contradiction. And the nal punchline
is that we will show, for
1
P(z)
= S(x, y) + iT(x, y),
that
S(x, y) = T(x, y) = 0.
This would mean
1
P(z)
= 0
which is complete and utter nonsense: the inverse of a real number is never zero!
The way that we reach this contradiction is, visually, very beautiful.
First,
1
for an arbitrary > 0, we show that there is some R
0
such that for all |z| > R
0
,
1
P(z)
< .
Geometrically, this means that outside a ball of radius R
0
centered at the origin,
1
P(z)
s magnitude
is bounded by :
x
iy
R
0
C
<
< <
<
But
1
P(z)
s magnitude is greater than (or equal) the magnitudes of its real and imaginary parts. So
in fact, we have for all (x, y) > R
0
|S| <
and
|T| <
In other words, outside a ball of radius R
0
centered at the origin, Ss and Ts magnitudes are also
bounded by :
1
Recall that
1
P(z)
is only dened on R
2
\ A, but in the actual proof, we will assume for a contradiction that P(z)
has no zeros. Hence, A = .
36.4. THE PROOF 691
S
x
y
R
0
R
2
<
< <
<
x
y
R
0
R
2
T
<
< <
<
Lets focus on S. Consider any ball with radius greater than R
0
. We know that S is harmonic, so the
extrema of S over this ball must occur on the boundary.
S
m
M
R
R
2
Therefore the magnitudes of the maximum value M and minimum value m of S over this ball must
be less than (since they lie in the gray region):
|M| <
|m| <
In fact, for all (x, y) B
R
(
0),
< m S(x, y) M <
i.e
|S(x, y)| < for all (x, y) B
R
(
0)
meaning that we can plug up this white hole:
S
<
R
2
and conclude that the magnitude is bounded by everywhere:
|S(x, y)| < for all (x, y) R
2
.
Likewise we can show
|T(x, y)| < for all (x, y) R
2
.
Since was arbitrary, we must have
S(x, y) = T(x, y) = 0
which, as aforementioned, is a big no-no.
The preceding schematic is correct, but we have to be really careful. One of my favorite Leon Si-
monisms is
Math Mantra: You must always know the TYPE OF ANIMAL you are working with.
I used the word magnitude and expressions like
|P(z)|,
1
P(z)
but | . . . | is not the normal absolute value symbol! This is because the argument is a complex
number!
For a complex number, we overload notation and dene a new norm:
Denition. The complex norm of z = a + ib is
|z| =
a
2
+ b
2
.
36.4. THE PROOF 693
Remember the discussion on metrics all the way back in Lecture 1? The complex norm can indeed
be used to form a metric on C:
d(z
1
, z
2
) = |z
1
z
2
|
In particular, we are going to use the properties:
Lemma. The complex norm satises
Triangle inequality
|z
1
+ z
2
| |z
1
| +|z
2
|
Product property
|z
1
z
2
| = |z
1
||z
2
|
The absolute value of the complex and imaginary parts are bounded by the full norm: for
z = a + ib
we have
max
_
|a|, |b|
_
|z|.
After 36 lectures of 51H, these should all be easy to check.
Now that we understand all the pieces, lets write out this legendary proof:
Theorem (Fundamental Theorem of Algebra). Every polynomial has at least one complex root.
Proof Summary:
Suppose P(z) is never zero. Show that
1
P(z)
<
for all |z| > R
0
where
R
0
= max
_
1, 2
_
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
_
,
n
_
2
_
Then for all (x, y) > R
0
,
|S(x, y)| <
|T(x, y)| <
where
1
P(z)
= S(x, y) + iT(x, y).
Let R > R
0
. Use the fact that S, T are harmonic on B
R
(
0) to show
|S(x, y)| <
|T(x, y)| <
for all (x, y) R,
Conclude that
S(x, y) = 0
T(x, y) = 0
giving us
1
P(z)
= 0
a contradiction.
Proof: Suppose not. Then P(z) is never zero and in fact,
1
P(z)
is always dened. Let > 0. First, we need to nd an R
0
> 0 such that for all |z| > R
0
,
1
P(z)
<
This condition is equivalent to showing
|P(z)| >
1
.
Starting from the left, expand
|P(z)| =
z
n
+
_
a
n1
z
n1
+ . . . + a
2
z
2
+ a
1
z + a
0
_
.
Then, apply reverse triangle inequality to get
z
n
+
_
a
n1
z
n1
+ . . . + a
2
z
2
+ a
1
z + a
0
_
|z
n
|
a
n1
z
n1
+ . . . + a
2
z
2
+ a
1
z + a
0
.
Using multiple applications of the normal triangle inequality on the subtracted term, we get a smaller
bound
|z
n
|
_
a
n1
z
n1
+ . . . +
a
2
z
2
+|a
1
z| +|a
0
|
_
.
By repeatedly applying the product property of norms, this bound is equal to
|z|
n
_
|a
n1
| |z|
n1
+ . . . +|a
2
| |z|
2
+|a
1
| |z| +|a
0
|
_
.
For the next step, add the restriction that
|z| > 1.
36.4. THE PROOF 695
This will allow us to shrink the lower bound by taking each power of z in the subtracted term
|z|
n
_
|a
n1
| |z|
n1
. .
+. . . +|a
2
| |z|
2
..
+|a
1
| |z|
1
..
+|a
0
| |z|
0
..
_
and increasing their exponents. We could increase each |z|
i
to |z|
n
: but you can check this leads to a
possible divide by zero case. Instead, we shrink the bound by increasing each |z|
i
to |z|
n1
:
|z|
n
_
|a
n1
| |z|
n1
. .
+. . . +|a
2
| |z|
n1
. .
+|a
1
| |z|
n1
. .
+|a
0
| |z|
n1
. .
_
=
|z|
n
_
1
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
z
_ .
Further restricting
|z| > 2 (|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|) ,
we can shrink our bound to
|z|
n
_
_
1
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
2
_
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
_
_
_
=
|z|
n
2
.
Finally, we know
|z|
n
2
>
1
if we have the restriction

|z| >
n
_
2
.
In summary, if we choose
R
0
= max
_
1, 2
_
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
_
,
n
_
2
_
then for all |z| > R
0
,
|P(z)| >
1
which is the same as

1
P(z)
< .
Now, for
1
P(z)
= S(x, y) + iT(x, y)
we immediately have, for all (x, y) > R
0
,
|S(x, y)| <
|T(x, y)| <
Let R > R
0
. Because we proved S is harmonic, we know that S achieves its extrema on the boundary
of B
R
(
0). Thus, for minimum M and maximum M, there exist points (x

min
, y
min
) and (x
max
, y
max
)
such that
S(x
min
, y
min
) = m
S(x
max
, y
max
) = M
and
(x
min
, y
min
) = R
(x
max
, y
max
) = R
Of course,
(x
min
, y
min
) > R
0
(x
max
, y
max
) > R
0
implying
|m| = |S(x
min
, y
min
)| <
|M| = |S(x
max
, y
max
)| <
So in fact, for all (x, y) B
R
(
0),
< m S(x, y) M < .
This means, for all (x, y) R
2
,
|S(x, y)| < .
Likewise, for all (x, y) R
2
, we can show
|T(x, y)| < .
Since was arbitrary,
S(x, y) = 0
T(x, y) = 0
giving us
0 =
1
P(z)
=
1
|P(z)|
.
This is a contradiction since the inverse of a real number (in particular 1/|P(z)|) is never zero.
The complex universe is a very rich eld (pun intended) and this proof demonstrates that fact.
Namely, we delved into the complex world to extract two real valued functions S and T. Then, we
applied our theory from the real universe to these functions. To quote Professsor Maydanskiy,
Math Mantra: If you want to prove a fact about the reals, you can delve into
the complex plane to extract the information you need.
You will see this remarkable strategy again in Math 106 and Math 116.
Simons Secret Lecture 37
An Animal Farm of Innities
All innite sets are innite,
but some sets are more innite than others.
-GEORGE WELL
Goals: In the nal optional lecture, we discuss the notion of size for innite sets. Par-
ticularly, we notice that in the nite case, two sets have the same size if there exists a
bijection between them. This inspires us to dene an extension of same size between
innite sets. We then proceed to prove the classic results that the rationals are countable
and the reals are uncountable. Finally, we end the lecture with the statement of the
Continuum Hypothesis.
37.1 A Little More Complicated than it Looks
An underlying theme in Math 51H (and generally, Calculus) is that we are working with innity.
Primarily, we have captured innity through limits, which we dened with with the word arbitrary:
N can be arbitrarily big
can be arbitrarily small
Consider an arbitrary union of sets
The sequence gets arbitrarily close to some limit
However, like the novel Animal Farm, innity is a lot more complicated that it seems.
Consider, for example, the natural numbers:
1 2 3 4 5 6
There are innitely many natural numbers and in between each consecutive pair, there are innitely
many real numbers.
n n + 1
697
698 SIMONS SECRET LECTURE 37. AN ANIMAL FARM OF INFINITIES
In fact, there are more reals than natural numbers. However, there are also innitely many rational
numbers between each consecutive pair:
n n + 1
Yet the set of rationals has the same size as the set of natual numbers.
Before we can dene size for innite sets, we need to talk about bijections.
37.2 Being a Bit Bijective
Suppose the function f is 1 : 1 on a nite set X. Notice
1
that the image f(X) has the same size as
X:
In particular, if
f(X) = A
then X has the same size as A. In fact, if we are given f : X A, we automatically know f(X) A.
Therefore, to show f(X) = A, we just have to show A f(X).
Denition. For f : X A, we say f maps X onto A if
A f(X)
If this function is also 1 : 1, we call it a bijection.
One of the key ideas about bijections is
Math Mantra: To count the size of a certain set, produce a bijection with a
set that is easier to count.
1
Formally, we can show this via induction, but I shall spare you the proof.
37.2. BEING A BIT BIJECTIVE 699
This is a pretty cool proof technique, so its worth giving a few classic examples.
The rst is an alternate proof that the number of subsets of a set of size n is 2
n
.
Theorem. The number of subsets of a nite set
S = {x
1
, x
2
, . . . , x
n
}
is 2
n
.
Proof: Consider the set P(S) of all subsets of S and the set B of binary n-tuples (i.e, n-tuples that
have only 1 and 0 as elements). Dene the mapping
f : B P(S)
where
f(a
1
, a
2
, . . . , a
n
) = {x
i
| a
i
= 1}
For example,
f(1, 0, 0, 1, 1) = {x
1
, x
4
, x
5
}
We can show that our f is a bijection:
1 : 1
Let a, b B,
a = (a
1
, a
2
, . . . a
n
)
b = (b
1
, b
2
, . . . b
n
)
and
f(a) = f(b).
Suppose a = b. Then there is some component a
i
= b
i
. WLOG, say a
i
= 1, b
i
= 0. Then,
x
i
f(a)
x
i
/ f(b)
So f(b) = f(a), a contradiction.
Onto
Let s P(S),
s = {x
n
1
, x
n
2
, . . . , x
n
k
}.
We need to nd some b B such that
f(b) = s.
But we can choose
b = (b
1
, b
2
, . . . , b
n
)
where
b
i
=
_
1 if i = n
j
for some j
0 otherwise
Thus, P(S) P(B). Since the other inclusion is clear, P(S) = P(B).
Since the size of B is 2
n
, the size of P(S) is 2
n
as well, so the number of subsets of S is 2
n
.
In the second example, we calculate the number of ways to break an integer n into a sum of positive
integers. Formally, for a given n, it counts the number of tuples
(a
1
, a
2
, . . . , a
k
)
with k 1 and each a
i
> 0 such that
a
1
+ a
2
+ + a
k
= n
For example, the number 4 can be broken up into 8 such tuples:
(1, 1, 1, 1) (1, 1, 2) (1, 3) (4)
(1, 2, 1) (2, 2)
(2, 1, 1) (3, 1)
Theorem. For a positive integer n, consider the set P
n
of all ordered sequences
(a
1
, a
2
, . . . , a
k
)
with k 1 and each a
i
> 0 such that
a
1
+ a
2
+ + a
k
= n.
There are 2
n1
elements in this set.
Proof Summary:
Dene
f : B P
n
where
f(x
1
, x
2
, . . . , x
n1
) = (1
1
1
2
1
3

n1
1)
and
i
=
_
, if x
i
= 1
+ if x
i
= 0
1 : 1
Suppose not: f(a) = f(b) but a = b. Let N be the rst time a
i
= b
i
. WLOG assume
a
N
= 1, b
N
= 0.
If N is the rst occurrence of 1, the rst component of f(b) is strictly bigger than the
rst component of f(a)
37.2. BEING A BIT BIJECTIVE 701
If N is the m-th occurrence of 1 where m > 1, the m-th component of f(b) is strictly
bigger than the m-th component of f(a).
Onto
The (n 1)-tuple corresponding to
00 . . . 0
. .
a
1
1
1 00 . . . 0
. .
a
2
1
1 . . . 1 00 . . . 0
. .
a
k
1
maps to (a
1
, a
2
, . . . , a
k
).
Proof: Let B be the set of binary (n 1)-tuples and dene the function
f : B P
n
by
f(x
1
, x
2
, . . . , x
n1
) = (1
1
1
2
1
3
. . .
n1
1)
where
i
=
_
, if x
i
= 1
+ if x
i
= 0
For example, in the case n = 5,
f(1, 0, 1, 0) = (1, 1 + 1, 1 + 1) = (1, 2, 2).
Even though f involves mapping to some strange symbols, dont be afraid! Its extremely intuitive if
you think of it in computer science terms. You begin with a single 1 in the rst line
1
Then, keep hitting the 0 button until the 1 grows to a desired size:
1 2 3
. . .
a
1
0 0 0 0
Then hit 1 to break
a
1
1
1
and play the same game with the second line.
1 : 1
Let a, b B,
a = (a
1
, a
2
, . . . , a
n1
)
b = (b
1
, b
2
, . . . , b
n1
)
and
f(a) = f(b).
Suppose a = b. Dene N to be the rst component where a and b dier:
a
i
= b
i
, i < N
a
N
= b
N
WLOG, let
a
N
= 1
b
N
= 0
We now show a contradiction, and we split into cases to build intuition:
1
N is the position of the rst occurrence of 1 in a.
Compare the number of zeros to the left of position N:
N
a = 000 . . . 0 1 . . .
b = 000 . . . 0 0 . . .
Since there are N1 zeros to the left of a
N
, the rst component of f(a) is N. But there are
also N 1 zeros to the left of b
N
as well as an additional zero at position N. Therefore,
the rst component of f(b) is strictly greater than N, a contradiction.
N is the position of the m-th occurrence of 1 in a where m > 1.
The (m1)-th occurrence of 1 must occur at some position J < N in both a and b.
J N
a = . . . 1 00 . . . 0 1 . . .
b = . . . 1 00 . . . 0 0 . . .
Since there are NJ 1 zeros between the (m1) and m-th occurrence of 1, the m-th
2
component of f(a) is N J. However, the m-th component of f(b) is strictly greater than
N J, a contradiction.
1
We could just consider a single case for m 1 and the proof would be correct, but even after 37 chapters, this is
still a book for underdogs.
2
Not the (m 1)-th! Be careful. Since the rst component comes after zero commas and the second component
comes after the rst comma, the m-th component comes after the (m1)-th comma.
37.3. COUNTING ON COUNTABILITY 703
Onto
Let
(a
1
, a
2
, . . . , a
k
) P
n
.
Then the (n 1)-tuple
00 . . . 0
. .
a
1
1
1 00 . . . 0
. .
a
2
1
1 . . . 00 . . . 0
. .
a
k
1
maps to (a
1
, a
2
, . . . , a
k
). Formally, for
x
i
=
_
_
1 if i = a
1
1 if i = a
1
+ a
2
.
.
.
.
.
.
1 if i = a
1
+ a
2
+ + a
k1
0 otherwise
we have
f(x
1
, x
2
, . . . , x
n1
) = (a
1
, a
2
, . . . , a
k
).
Thus, f(B) P
n
.
Since the size of B is 2
n1
, there are 2
n1
elements in P
n
.
If we can produce a bijection between two nite sets, then they must have the same size. Thats
great! But what about innite sets?
37.3 Counting on Countability
We are inspired to make an innite extension:
Denition. We say that two sets A and B have the same cardinality (i.e. the same size) if we can
produce a bijection between A and B.
In particular,
Denition. For an innite set A, if A has the same cardinality as the natural numbers, then we say
A is countable.
For example, we know the positive evens, positive odds, and the integers are countable since we can
respectively construct the invertible maps
E(n) = 2n
O(n) = 2n 1
Z(n) =
_
m if n = 2m
m if n = 2m1
Dont think of this as anything new. Youve worked with mappings dened on N a million times.
This is because a sequence
(a
n
)
is a mapping from N. Therefore, a bijection from N onto A is really just a sequence that enumerates
each element of A precisely once. This is exactly the reason why such a set is called countable.
The classic result you need to know about countable sets is that
The rationals are countable.
To prove this fact, you can think of each rational number (in reduced form) as a pair of natural
numbers:
p
q
(p, q)
As an easy exercise, you can show that an innite subset of a countable set is still countable (this is
merely a subsequence)! Therefore, to prove that the rationals are countable it suces to prove that
N N is countable.
This is the fastest and cutest proof of this fact:
Theorem. N N is countable.
Proof: Since the inverse of a bijection is still a bijection, it suces to nd a bijection from NN to
N. Consider
f(i, j) = 2
i1
(2j 1).
Now we check
1 : 1
Suppose
2
i
1
1
(2j
1
1) = 2
i
2
1
(2j
2
1)
Notice that (2j
1
1), (2j
2
1) are odd numbers and thus contain no power of 2. By the
Fundamental Theorem of Arithmetic, we know the powers of 2 are equal:
i
1
1 = i
2
1
implying
i
1
= i
2
.
Dividing out 2
i
1
, we get
2j
1
1 = 2j
2
1
thus,
j
1
= j
2
.
Onto
First note that
f(1, 1) = 1
For n 2, by the Fundamental Theorem of Arithmetic, n has the prime factorization
n = 2
1
1
p
2
2
. . . p
s
s
. .
odd
.
Choosing
i = + 1
j =
p
1
1
p
2
2
. . . p
s
s
+ 1
2
we have
f(i, j) = n.
Here, our choice of f relied more on arithmetic intuition than physical intuition. A more classic proof
that N N is countable is to visualize it as a grid of pairs
1, 1 1, 2 1, 3 1, 4
2, 1 2, 2 2, 3 2, 4
3, 1 3, 2 3, 3 3, 4
4, 1 4, 2 4, 3 4, 4
and count along diagonals:
1 2 4 7
3 5 8
6 9
10
Its a very simple idea; in fact, we can cook up an explicit mapping:
Theorem. N N is countable.
Proof Summary:
Dene
f(i, j) =
(i + j 2)(i + j 1)
2
+ i
Onto
Substitute
k = i + j
and rewrite f using Gauss formula.
For any n N, choose a particular k such that
1 + 2 + + (k 2)
. .
< n < 1 + 2 + + (k 2)
. .
+(k 1)
f(i, j) maps to n, where
i = n
_
1 + 2 + + (k 2)
_
j = k i
1 : 1
Suppose
(i
1
+ j
1
2)(i
1
+ j
1
1)
2
+ i
1
=
(i
2
+ j
2
2)(i
2
+ j
2
1)
2
+ i
2
.
It suces to show that f maps to the same diagonal:
i
1
+ j
1
= i
2
+ j
2
.
Derive contradiction using Gauss formula.
Proof: Dene the mapping
f(i, j) =
(i + j 2)(i + j 1)
2
+ i.
Onto
The key is to notice that along any diagonal, the sum i + j is constant. To make life easier,
dene a new variable
k = i + j
Then,
f(i, j) =
(k 2)(k 1)
2
+ i
where 1 i k 1.
But lo and behold, what is this? The fraction is an application of Gauss formula:
1 + 2 + 3 + + n =
n(n + 1)
2
Therefore,
f(i, j) = 1 + 2 + + (k 2) + i
where 1 i k 1.
For any n N, choose a particular k such that
1 + 2 + + (k 2)
. .
n < 1 + 2 + + (k 2)
. .
+(k 1).
Then, for the choice
1
i = n
_
1 + 2 + + (k 2)
_
j = k i
we have
f(i, j) = 1 + 2 + + (k 2) +
_
n
_
1 + 2 + + (k 2)
_
_
. .
i
= n
1 : 1
Suppose
(i
1
+ j
1
2)(i
1
+ j
1
1)
2
+ i
1
=
(i
2
+ j
2
2)(i
2
+ j
2
1)
2
+ i
2
1
This is analogous to the rst step of converting a number into binary: subtract o the biggest power 2
i
to get a
remainder. Subtract o the biggest
(k2)(k1)
2
less than n to get a remainder i. Think of it as Base Gaussian.
I claim that f must map to the same diagonal i.e., the sum
i
1
+ j
1
. .
k
1
= i
2
+ j
2
. .
k
2
.
Rewrite our equation as
(k
1
2)(k
1
1)
2
+ i
1
=
(k
2
2)(k
2
1)
2
+ i
2
.
WLOG, suppose k
1
< k
2
. By Gauss formula again, we can bound the (LHS):
(k
1
2)(k
1
1)
2
+ i
1
= 1 + 2 + + (k
1
2) + i
1
..
k
1
1
1 + 2 + + (k
1
1) =
(k
1
1)k
1
2
But by basic integer properties, k
1
< k
2
imply
k
1
k
2
1
k
1
1 k
2
2
Therefore, we can further bound
(k
1
1)k
1
2

(k
2
2)(k
2
1)
2
.
Moreover,
(k
2
2)(k
2
1)
2
<
(k
2
2)(k
2
1)
2
+ i
2
..
1
This gives us
(k
1
2)(k
1
1)
2
+ i
1
<
(k
2
2)(k
2
1)
2
i
2
,
which is a contradiction. Thus,
k
1
= k
2
.
Now the proof is easy. Our equality becomes
(k
1
2)(k
1
1)
2
+ i
1
=
(k
1
2)(k
1
1)
2
+ i
2
which immediately implies
i
1
= i
2
and then
j
1
= k
1
i
1
= k
2
i
2
= j
2
.
37.4 Realistically Uncountable
Not all innite sets are countable! In analogy with the denition of irrational, we dene
37.4. REALISTICALLY UNCOUNTABLE 709
Denition. For an innite set A, if A is not countable then we say A is uncountable.
One of the most fundamental results is
The reals are uncountable.
This is the cutest proof that I know:
Theorem. R is uncountable.
Proof Summary:
Suppose I
1
= [0, 1] is countable and is enumerated by (a
n
).
Construct a nested sequence of closed, non-empty sets
I
1
I
2
I
3
. . .
where I
j
contains only points of (a
n
) with index n > j.
Let
a
kN
I
k
a [0, 1] yet a is not an element of (a
n
). Contradiction.
Proof: It suces to show that the closed interval [0, 1], is uncountable. Suppose that there does exist
a bijection f from N to [0, 1]. Then, consider the image sequence (a
n
) where
a
n
= f(n).
First, we are going to construct a sequence of nested closed sets. Starting with
I
1
= [0, 1]
split the interval into three equal closed sets:
I
M
1
I
L
1
I
R
1
At least one of these sets contains only points of (a
n
) with index n 2. Call that set I
2
I
2
. .
an, n2
and play the same game: split I
2
into three sets
I
M
2
I
L
2
I
R
2
and choose I
3
to be one of the subsets that contains only points of (a
n
) with index n 3.
I
3
. .
an, n3
Formally,
1
we dene the recursive relation,
I
k+1
=
_
_
I
L
k
if I
L
k
n
) with index n k + 1
I
M
k
else if I
M
k
n
) with index n k + 1
I
R
k
else if I
R
k
n
) with index n k + 1.
where I
k
= [a, b] and
I
L
k
=
_
a, a +
b a
3
_
I
M
k
=
_
a +
b a
3
, a + 2
b a
3
_
I
R
k
=
_
a + 2
b a
3
, b
_
Now that we have a nested sequence of closed, non-empty sets,
I
1
I
2
I
3
. . .
by our work in Lecture 33, we know that the arbitrary intersection is non-empty. Therefore, let
a
kN
I
k
and since a [0, 1], by our countability assumption,
a
S
= a
for some S. But by construction,
a
S
..
a
/ I
S+1
and so
a /
kN
I
k
,
1
There really is no need to be formal. You only need to understand that we trisect the interval, and each time, we
choose either left, middle, or right.
37.4. REALISTICALLY UNCOUNTABLE 711
a contradiction.
This is a pretty cool proof and it uses the awesome nested sequence property. But its not the classic
argument (although it does have the same essence of diagonalization). Here is the diagonalization
proof
1
that every mathematics student must know:
Theorem. R is uncountable.
Proof Summary:
Suppose that (0, 1) is countable and is enumerated by (a
n
).
Expand each a
n
in terms of its decimal expansion and arrange in a list.
Dene s to have the decimal expansion
s = .s
1
s
2
s
3
s
4
s
5
. . .
where the digit
s
i
=
_
3 if a
ii
= 3
5 if a
ii
= 3
s (0, 1) yet s is not an element of (a
n
). Contradiction.
Proof: Consider the open interval (0, 1) and suppose that there does exist a bijection f from N to
(0, 1). Then, consider the image sequence (a
n
) where
a
n
= f(n)
We can write each term a
n
as a decimal expansion
a
n
= 0.a
n1
a
n2
a
n3
. . .
where each a
ni
is a digit:
a
ni
{0, 1, 2, . . . , 9}.
Line the a
n
in a grid:
a
1
= 0.
a
11
a
12
a
13
a
14
a
15
. . .
a
2
= 0.
a
21
a
22
a
23
a
24
a
25
. . .
a
3
= 0.
a
31
a
32
a
33
a
34
a
35
. . .
a
4
= 0.
a
41
a
42
a
43
a
44
a
45
. . .
a
5
= 0.
a
51
a
52
a
53
a
54
a
55
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
This proof is almost valid. First, you should prove that every real has a decimal expansion. More importantly,
decimal expansions are not unique. For your Math 171 WIM, you will need to make the proper adjustments.
Using the diagonal digits a
ii
,
a
1
= 0.
a
11
a
12
a
13
a
14
a
15
. . .
a
2
= 0.
a
21
a
22
a
23
a
24
a
25
. . .
a
3
= 0.
a
31
a
32
a
33
a
34
a
35
. . .
a
4
= 0.
a
41
a
42
a
43
a
44
a
45
. . .
a
5
= 0.
a
51
a
52
a
53
a
54
a
55
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
we construct an element that is not on this list. Specically, we construct s such that its i-th digit is
dierent from a
ii
:
s = 0.
s
1
s
2
s
3
s
4
s
5
a
1
= 0.
a
11
a
12
a
13
a
14
a
15
. . .
a
2
= 0.
a
21
a
22
a
23
a
24
a
25
. . .
a
3
= 0.
a
31
a
32
a
33
a
34
a
35
. . .
a
4
= 0.
a
41
a
42
a
43
a
44
a
45
. . .
a
5
= 0.
a
51
a
52
a
53
a
54
a
55
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Formally, we can dene s to have the decimal expansion
s = .s
1
s
2
s
3
s
4
s
5
. . .
where the digit
s
i
=
_
3 if a
ii
= 3
5 if a
ii
= 3
Then s (0, 1) yet s is not an element of (a
n
). Because if it were, then for some K,
a
K
= s.
But a
K
and s dier in the K-th digit by construction:
s
K
= a
KK
,
a contradiction. Thus, the reals are uncountable.
37.5. LAST WORDS ON THE CONTINUUM 713
37.5 Last Words on the Continuum
There are many levels of uncountability! In fact, for a given set S, S is smaller
1
than the set of all
subsets of S. This means we can have arbitrarily large innite sets.
Of course, I cannot explain that here. If you want to understand how to classify the sizes of innite
sets, I highly recommend taking Math 161: Set Theory. You will be exposed to some philosophical
and interesting topics. Also, Professor Sommer is an excellent lecturer.
I do feel, however, that it is tting to end with the statement of a famous unsolved problem. Even
though we can have arbitrarily large uncountable sets, it is unknown if there is a smallest:
Conjecture (Continuum Hypothesis). R is the smallest uncountable set.
Once you nish the Math 51H nal, I highly recommend mulling over this statement. Even though
it is way beyond the scope of this course, ask yourself what it means. How do you formalize it? And
as always, think about how one can possibly go about proving it.
1
We say A is smaller than B if there is a 1:1 mapping from A to B
714
The Final: Acing Analysis
B
(SNITCH) 51H
POTTER
The Last Run
Youve nally made it. Ten weeks of Math 51H. And now you are faced with the nal.
My rst advice is to
COMPLETE HOMEWORK 10.
I know its not graded. And you are probably swamped with work. Thats why they call it dead-week.
But problems are the only way to truly test your understanding of the material. Moreover, Professor
Simon may put a variation of one of these problems on the nal.
Secondly,
DO THE PRACTICE FINAL AND MAKE SURE YOU TIME YOURSELF.
To succeed, it is not enough to understand the material. You must make sure you have mastered it.
This means being able to write out complete proofs quickly in a limited amount of time.
In fact, while youre at it,
REDO ALL THE PREVIOUS MIDTERMS.
On the nal, all previous material is fair game. In fact, you cannot progress without full com-
prehension of earlier concepts!
715
716
Lastly, I repeat for the nal time, the correct way to study:
Open up Chapter 1 and head to the rst theorem. Then,
Read the theorem statement.
Close the book.
Start re-deriving the proof.
If you get stuck, glance at the proof summary for a hint.
Close the book again.
Rinse and repeat.
Do this for every proof in the book. This is the only way you can be sure that you understand the
material. Always remember Professor Simons saying,
The human capacity for self-delusion is limitless
In addition to the rst 7 weeks, here are the topics you need to have mastered:
Week 8
1. Do you know the denition of permutations, transpositions, and number of inversions? Can you
calculate the number of inversions of a given permutation?
2. Can you prove that applying a transposition to a permutation switches the parity of the number
of inversions?
3. Do you know the dening properties of D? Do you know the explicit formula for
D? Can you prove that it is unique?
4. Can you prove basic determinant properties? Can you prove det(AB) = det(A) det(B)?
Can you quickly calculate determinants using row and column reduction properties?
5. Do you know how to invert a matrix? Do you know how to apply a cofactor expansion? Can
you prove the cofactor expansion formulas? Can you prove det(A) = det(A
T
)? Do you know
the explicit formula for A
1
? Can you derive the explicit formula for A
1
?
6. Do you know the denition of orthonormal? Do you know how to apply the Gram-
Schmidt process?
717
Week 9
1. Do you know how to compute eigenvectors and eigenvalues? Do you know how to
use eigenvectors and eigenvalues to diagonalize a matrix?
2. Do you know the statement of the Spectral Theorem? Can you apply the Spectral
Theorem?
3. Do you know the statement of the Contraction Mapping Theorem? Can you prove
the Contraction Mapping Theorem? Can you use the Contraction Mapping Theo-
rem on some system of equations to show the existence of a solution?
4. Do you know the denition of 1 : 1? Do you know the denition of onto? Do you know the
statement of the Inverse Function Theorem?
5. Do you know the statement of the Implicit Function Theorem? Can you use the In-
verse Function Theorem to prove the Implicit Function Theorem? Can you use the
Implicit Function Theorem to complete the proof of Lagrange Multiplier Theorem?
Last Words on Math 51H
Even though it is a challenging course, I hope you enjoyed Math 51H. Without a doubt,
Professor Simon is an incredible lecturer who is second to none.
He has truly exposed you to some beautiful proofs and given you a preview of the dierent branches
of mathematics. In just ten weeks, you have learned more mathematics than you have in the last ten
years.
And by learn, you have acquired more than just content. Youve acquired a completely new way
of thinking. And like the Golden Snitch,
Your mind is now open at the close of Math 51H.
718
The After Math
In life, unlike chess, the game continues after checkmate.
-Isaac Asimov
A Message to the Underdog
If your experience in Math 51H was anything like mine, after you see your grades, youre going to
be shocked. For the rst time in your life, you didnt get an A. For me, that was just the tip of the
iceberg. By the end of that year, I scored a 1 on the Putnam and tanked the remaining H-series.
Almost every job rejected me that summer.
But the scores and rejections werent the worst part. It was the feeling of resignation. The feeling
that no matter how hard you tried, you would never catch up. That everyone in that room will always
be smarter than you and that you didnt belong.
When people talk about the Stanford Duck Syndrome, I feel it is especially true with the math
majors. There is an inux of brilliant students each year. Students who are already properly trained
in mathematical thinking. Because theyve already seen the algebraic and analytic concepts, they will
be the ones constantly raising their hands.
And these students are also the ones the professors take time to notice.
From someone who has been at the bottom of every math class, I assure you, it doesnt get any easier.
TAs will roll their eyes at you. Professors will call your questions trivial.
1
When you give an incorrect
answer in class, people will snicker at you. But worse of all, ten weeks of persistent eort will be
quantized into three hours of testing.
So the big question you need to ask yourself is
Do you still want to pursue theoretical math?
You dont have to study pure math. You can try engineering, economics, or even philosophy. From
personal experience, even someone with a mediocre mathematics background can excel in these sub-
jects. If you want acknowledgement, then maybe thats where you should go.
However, the H-series could have left you with a gnawing feeling. Because even amidst dark confu-
sion and struggle, you saw something. A glimpse of pure creativity. An idea that is paradoxically so
1
God I hate that word. If there was any word that captures mathematical arrogance, this would be it.
719
720
dicult yet so simple. An idea that was magically pulled out of thin air. And even if it was for a
brief moment, you understood. You were rapt in awe.
If this is how you feel, then you have to ignore the grades and keep chasing mathematics. To quote
Professor Vakil,
Of course, the best three on the Putnam will go o to become fantastic mathematicians.
But the people who are going to take over the world are those ghting to get the 1s.
Indeed, its going to be a tough ght. And the only one who is going to care (or even notice) is you.
But be sure to go at your own pace! You dont have to jump into Hardy, Dummit, and Pollack. Read
Jones, Armstrong, and Munkres. In fact, read as many yellow UTM and SUMS books you can get a
hold of. Dont just stop there. Devour the vast online resources that are available. Become
an expert in Coursera, Wikipedia, and Stack Exchange.
In the meanwhile, ignore the pompous people in your class. Find the right ones to collaborate
with. Learn from bad grades and move on. Try harder next time. There is no shame in failing: you
can always retake a course. Even if a professor laughs at you for sitting in for the hundredth time,
just grin back. You gotta keep ghting, even if the worlds telling you to give up. Because thats what
it means to live.
And when you completely understand a beautiful proof and are teeming with excitement, you have
to teach it to other people. Just be sure that, when you explain the proof, its not to make yourself
look smarter. Deliver the punch line in the right way so that even the struggling underdog walks
away with a smile. Shout it out to the world so that everyone can know its beautiful. Because thats
what it means to love.
Only the Beginning
As for me, Ive decided to stop pursuing the dream of becoming a professor. Not because Ive given
up on mathematics, but because professors teach the best. And I have no interest in teaching the
best. I want to inspire the best.
To inspire students, I intend on keeping a promise. Before my best friend overdosed, we made a pact
to take on the world. Even though shes gone, I still intend on keeping my end of that promise. I am
going to nd the worlds most infamous introductory math courses and I am going to translate them.
Because I feel that all students who have a passion for mathematics, yet lack the problem
solving and proof techniques, deserve to learn from these epic courses.
Thats enough about my future. As for you, it is inevitable that as you grow up, you will hit some
hard times. When that happens, you just have to keep living, laughing, and loving. Because, in the
end,
Let Hercules do as he may. Cats will mew.
Dogs will have their day.
Index
1 : 1, 643
T
a
M, 519
, 41
C, 4
Q, 4
R, 4
Z, 4
, 218
, 34
, 34
abelian group, 97
absolute convergence, 373
Aladdin, 138
arbitrary union, 324
Axiom of Choice, 124, 134
Axler, Sheldon, 364
Banach-Tarski Paradox, 135
Baptisma Pyros, viii
basis, 124
Basis Extension Theorem, 127
Basis Theorem, 125
applications of, 132
Batman, x, 239, 636
bijection, 698
Bolzano-Weierstrass, 197, 210
alternate proof, 213
multivariable, 330
Boyd, Stephen, 171
cardinality, 703
Cauchy Sequences, 639
Cauchy-Riemann Equations, 686
Cauchy-Schwarz Equality, 57
applications of, 59, 60
Cauchy-Schwarz Inequality, 21, 25
alternate proof, 79
applications of, 2931
matrices, 152
chain rule, 409
Change of Base-Point Theorem, 438
nite, 430
choice function, 134
closed set, 312
cofactor expansion, 582
column rank, 159
column space, 158
basis, 260
Completeness Axiom, 107, 639
Conrad, Brian, vii
constructivism, 51, 239
continuous function
applications, 294
multivariable, 338
properties, 280
single variable, 275
Continuum Hypothesis, 713
Contraction Mapping Theorem, 636, 657
convex, 654
countable, 703
N N, 704, 705
rationals, 704
critical point
manifold, 531
curve, 474
length, 479
Depp, Johnny, 205
derivative
directional, 351
partial, 353
single variable, 289
determinant, 567
Devlin, Keith, x, 267
diagonalizable, 610
dierentiable, 358
dimension, 129
directional derivative, 351
721
722
discretizing the delta, 320
distance function, 7
Division Algorithm, 93
dot product, 19
EE263: Linear Dynamical Systems, 171
eigen-decomposition, 610
eigenvalue, 613
eigenvector, 613
Everclear-190, 8
Field Axiom, 101
elds, 101
Frost, Robert, 663
Fundamental Theorem of Algebra, 680, 693
Fundamental Theorem of Arithmetic, 54
Fundamental Theorem of Calculus, 485
Fundamental Theorem of Linear Algebra, 159
Galatius, Soren, 135
Gaussian Elimination
full, 241
step one, 65
geometric series, 366, 377
gradient, 389
Gram-Schmidt Process, 602, 604
graph, 509
Guy, Richard, 12
Hardy, G.H., xi
harmonic function, 681
Harry Potter, xiii, 81, 417, 494, 545, 715, 717
Hilbert, xii, 89, 267
Hitler Learns Topology, 301
homogeneous equations, 83
Implicit Function Theorem, 671
Inception, 631
indexing set, 323
inhomogeneous equations, 263
injective, 643
invariant, 544
inverse, 580
Inverse Function Theorem, 650
Jacobian, 358, 388
Jepsen, Carly Rae, 47
Johari, Ramesh, 271
Jones, Indiana, 200
Kafka, Franz, 71
Lagrange Multipliers, 674
multiple constraints, 530
proof, 536
single constraint, 527
Law of Excluded Middle, 47
Law of Small Numbers, 12
left inverse, 578
Lehrer, Tom
Tropic of Calculus, 396
limit
function, 273, 274
multivariable, 337
sequence, 176
sequence of vectors, 312
uniqueness, 193
linear combination, 45
Linear Dependence Lemma, 86
linear function, 38
local maximum, 392
manifold, 531
local minimum, 392
manifold, 531
logical quantiers, 177
Lovasz, L., 47, 305
Malibu Light Rum, 396
manifold, 505, 514
Martini, 347
Math 106: Introduction to Complex Analysis, 696
Math 108: Introduction to Combinatorics, 33, 445
Math 114: Linear Algebra II, 38
Math 116: Introduction to Complex Analysis, 696
Math 120: Modern Algebra I, 110
Math 121: Modern Algebra II, 38, 110
Math 171: Introduction to Analysis, 38, 116
Maydanskiy, Maksim, 180, 457, 696
Mean Value Theorem, 293
application of, 51
multivariable application, 398
Mojito, 37, 443
Monotone Convergence Property, 200
MS&E 246: Game Theory with Engineering Ap-
plications, 271
723
multilinear function, 557
N-th term test, 365
non-pivot column, 244
norm
complex, 692
matrix, 151
vector, 15
Norris, Chuck, 193
null space, 159
basis, 256
nullity, 159
number of inversions, 547
onto, 698
open ball, 302
open set, 304
orthogonal, 218
orthogonal complement, 218
properties, 230
orthonormal, 599
Orwell, George, 697
Osgood, Brad, 37
Pang, Amy, xiii
parity, 544
partial derivative, 353
partition, 478
permutation, 545
Phil 151: Introduction to Logic, 102
Pina Colada, xiii
pivot column, 243
index, 241
Pocahontas, 578
Popov, 539
power series, 419
dierentiation, 491
projection map, 234
properties, 234
Proof Technique
7-10 split, 33
contradiction, 47
i, 52
cases, 67
existence, 109
induction, 71
set equality, 116
uniqueness, 90
universal statements, 8
quadratic form, 467
rank, 166
Rank-Nullity Theorem, 159, 162
rearrangement, 379
right inverse, 578
Rolles Theorem, 290
Ross, Kenneth, 407
row space, 165
Sandwich Theorem, 205
multivariable, 341
Sawin, Stephen, 197
Second Derivative Test, 470
Sher, David, 48
Shoham, Yoav, 134
Shyamalan, M. Night, 7, 363
Simon, Leon, vii, xxiii, 40, 50, 63, 115, 124, 127,
167, 173, 203, 250, 270, 273, 301, 304, 324,
347, 363, 395, 397, 442, 512, 519, 540, 542,
715717
Smirno, 539
Sommer, Professor, 102, 713
Sound, K., 110, 681
span, 45
Sparrow, Jack, 200
Spectral Theorem, 619
standard basis vector, 121
Steele, Michael, 17
Stolichnaya, 539
Strang, Gilbert, 159
sub-matrix, 582
subsequence, 198
subspace, 39
SUMaC, viii, 267, 543
symmetric matrix, 467
tangent space, 519
tangential gradient, 531
Taylor Series, 494
Taylors Theorem, 495
Titanic, 126
transpose, 165
transposition, 545
triangle inequality, 7, 27
724
uncountable, 708
reals, 708, 710
Under-determined Systems Lemma, 83
Vakil, Ravi, 720
vector space, 38, 418
Weinberger, 503
well-dened, 563
Wieczorek, W., 662

Part 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part 3

Uploaded by

Copyright:

Available Formats

Lecture 28

Playing with Permutations

Right inverse equals left inverse: if

along the c-th column and rewrite the expression as F(r, c) = 0.

where the r-th column of A has been REPLACED

is 0 (two columns are the same, duh). Therefore, when we do a

along the r-th row and rewrite the expression as F(r, c) = 0.

where the c-th row is replaced by the r-th row:

and for i > 1, dene

0} R that normalizes the input

and bound this by repeated triangle inequality

which are all constants. Thus the sequence (x

(Av) and lies in AV :

(v) such that

(y) and lie in f

then we are done.

(a) The closed ball of ra-

0) The unit closed ball centered at

0) is harmonic, then the extrema of f are

0)). In other words, if an extremum is achieved at

0) is achieved on the boundary,

0), then for all x B

0). We will derive a contradiction by

f(x, y) = f(x + iy)

if we have the restriction

which is the same as

0). Thus, for minimum M and maximum M, there exist points (x

You might also like