You are on page 1of 7

EECS 16B Designing Information Devices and Systems II

Fall 2017 Miki Lustig and Michel Maharbiz Homework 9


This homework is due November 7, 2017, at noon.

1. The Moore-Penrose pseudoinverse for “fat” matrices


Say we have a set of linear equations described as A~x = ~y. If A is invertible, we know that the solution
is ~x = A−1~y. However, what if A is not a square matrix? In 16A, you saw how this problem could be
approached for tall matrices A where it really wasn’t possible to find a solution that exactly matches all
the measurements. The Linear Least-Squares solution gives us a reasonable answer that asks for the “best”
match in terms of reducing the norm of the error vector.
This problem deals with the other case — when the matrix A is short and fat. In this case, there are generally
going to be lots of possible solutions — so which should we choose? Why? We will walk you through
the Moore-Penrose Pseudoinverse that generalizes the idea of the matrix inverse and is derived from the
singular value decomposition.

(a) Say you have the following matrix.  


1 −1 1
A=
1 1 −1
Calculate the SVD decomposition of A. That is to say, calculate U, Σ,V such that,

A = UΣV T

What are the dimensions of U, Σ and V ?


Note. Do NOT use a computer to calculate the SVD. You may be asked to solve similar questions on
your own in the exam.
Solution:

> 3 −1
AA = .
−1 3
Which has characteristic polynomial λ 2 −6λ +8 = 0, producing eigenvalues 4 and 2. Solving Av = λi v
produces eigenvectors [ √12 , − √12 ]> and [ √12 , √12 ]> associated with eigenvalues 4 and 2 respectively. The
singular values are the square roots of the eigenvalues of AA> , so
 
2 √0 0
Σ=
0 2 0

and " #
√1 √1
U= 2 2
− √12 √1
2

We can then solve for the~v vectors using A>~ui = σi~vi , producing~v1 = [0, − √12 , √12 ]> and~v2 = [1, 0, 0]> .
The last ~v must be orthonormal to the other two, so we can pick [0, √12 , √12 ]> .

EECS 16B, Fall 2017, Homework 9 1


The SVD is:  
"
√1 √1
#  0 − √12 √1
2
2 2 2 √0 0 
A= 1 0 0

− √12 √1 0 2 0 1
√ √1
2 0 2 2

(b) Let us think about what the SVD does. Let us look at matrix A acting on some vector ~x to give the
result ~y. We have,
A~x = UΣV T~x =~y
Observe that V T~x rotates the vector, Σ scales it and U rotates it again. We will try to "reverse" these
operations one at a time and then
 put them together.
T
If U “rotates” the vector ΣV ~x, what operator can we derive that will undo the rotation?
Solution: By orthonormality, we know that U T U = UU T = I. Therefore, U T undoes the rotation.
(c) Derive a matrix that will "unscale", or undo the effect of Σ where it is possible to undo. Recall that Σ
has the same dimensions as A. Ignore any division by zeros (that is to say, let it stay zero).
Solution: If you observe the equation:
Σ~x =~y, (1)
you can see that σi xi = yi for i = 0, ..., m − 1, which means that to obtain xi from yi , we need to multiply
yi by σ1i . For any i > m−1, the information in xi is lost by multiplying with 0. Therefore, the reasonable
guess for xi is 0 in this case. That’s why we padded 0s in the bottom of Σ e given below:
 
1
σ0 0 ... 0
1
 
0
 σ1 ... 0 

σ0 0 0 0 0 ... 0 .
. .. .. 
. . ... .

 0 σ1 0 0 0 ... 0 
If Σ =  . then Σ =  0 1 
0 ...
  
. . . .. e
 .. .. .. ..  σm−1 
. ... 0  
0 0 ... 0 
0 0 0 σm−1 0 . . . 0  ..

.. .. .. 

. . . . 
0 0 ... 0

(d) Derive an operator that would "unrotate" by V T .


Solution: By orthonormality, we know that V T V = VV T = I. Therefore, V undoes the rotation.
(e) Try to use this idea of "unrotating" and "unscaling" to derive an "inverse" (which we will use A† to
denote). That is to say,
~x = A†~y
The reason why the word inverse is in quotes (or why this is called a pseudo-inverse) is because we’re
ignoring the "divisions" by zero.
Solution: We can use the unrotation and unscaling matrices we derived above to "undo" the effect of
A and get the required solution. Of course, nothing can possibly be done for the information that was
destroyed by the nullspace of A — there is no way to recover any component of the true ~x that was in
the nullspace of A. However, we can get back everything else.

~y = A~x = UΣV T~x


U T~y = ΣV T~x Unrotating by U
e T~y
ΣU T
= V ~x Unscaling by Σ
e
e T~y
V ΣU =~x Unrotating by V

EECS 16B, Fall 2017, Homework 9 2


Therefore, we have A† = V ΣU
e T , where Σ
e is given in (c).
(f) Use A† to solve for ~x in the following systems of equations.
   
1 −1 1 2
~x =
1 1 −1 4

Solution: From the above, we have the solution given by:

~x = A†~y = V ΣU
e T~y
 
0 1 0 1 "
0 √1
#
 1 1 
2 − √12
= − √2 0 √2   0 √1  2
2 √1 √1
√1 0 √12 0 0 2 2
2
 
3
=  12 
− 21

Therefore, the solution to the system of equations is:


 
3
~x = 21 

− 12

(g) (Optional) Now we will see why this matrix is a useful proxy for the matrix inverse in such circum-
stances. Show that the solution given by the Moore-Penrose Psuedoinverse satisfies the minimality
property that if ~x̂ is the psuedo-inverse solution to A~x =~y, then k~x̂k ≤ k~zk for all other vectors~z satis-
fying A~z =~y.
(Hint: look at the vectors involved in the V basis. Think about the relevant nullspace and how it is
connected to all this.)
This minimality property is useful in both control applications (as you will see in the next problem)
and in communications applications.
Solution: Since ~x̂ is the pseudo-inverse solution, we know that,

~x̂ = V ΣU
e T~y

Let us write down what ~x̂ is with respect to the columns of V . Let there be k non-zero singular values.
The following expression comes from expanding the matrix multiplication.

~x̂ = V T~x̂
V
= V T A†~y = V T V ΣUe T~y = ΣU e T~y
 T
h~y,~u0 i h~y,~u1 i h~y,~uk−1 i
= , ,..., , 0, . . . , 0
σ0 σ1 σk−1

The n − k zeros at the end come from the fact that there are only k non-zero singular values. Therefore,
by construction, ~x̂ is a linear combination of the first k columns of V .

EECS 16B, Fall 2017, Homework 9 3


Since any other~z is also a solution to the original problem, we have

A~z = UΣV T~z = UΣ~z|V =~y, (2)

where ~z|V is the projection of ~z in the V basis. Using the idea of “unscaling” for the first k elements
(where the unscaling is clearly invertible) and “unrotation” after that, we see that the first k elements
of ~z|V must be identical to those first k elements of ~x|V .
However, since the information for the last n − k elements of ~z|V is lost by multiplying 0s, any values
α` there are unconstrained as weights on the last part of the V basis — namely the weights on the basis
for the nullspace of A. Therefore,
 T
h~y,~u0 i h~y,~u1 i h~y,~uk−1 i
~z|V = , ,..., , αk , αk+1 , . . . , αn−1
σ0 σ1 σk−1
.
Now, since the columns of V are orthonormal, observe that,
k−1
h~y,~ui i 2

~ 2
||x̂|| = ∑
i=0 σi

and that,
k−1
h~y,~ui i 2 n−1

2 2
||~z|| = ∑
+ ∑ |αi |
i=0 σ i i=k

Therefore,
n−1
||~z||2 = ||~x̂||2 + ∑ |αi |2
i=k

This tells us that,


||~z|| ≥ ||~x̂||

2. Eigenfaces
In this problem, we will be be exploring the use of PCA to compress and visualize pictures of human faces.
We use the images from the data set Labeled Faces in the Wild. Specifically, we use a set of 13,232 images
aligned using deep funneling to ensure that the faces are centered in each photo. Each image is a 100x100
image with the face aligned in the center. To turn the image into a vector, we stack each column of pixels
in the image on top of each other, and we normalize each pixel value to be between 0 and 1. Thus, a single
image of a face is represented by a 10,000 dimensional vector. A vector this size is a bit challenging to
work with directly. We combine the vectors from each image into a single matrix so that we can run PCA.
For this problem, we will provide you with the first 1000 principal components, but you can explore how
well the images are compressed with fewer components. Please refer to the IPython notebook to answer the
following questions.

(a) We provide you with a randomly selected subset of 1000 faces from the training set, the first 1000
principle components, all 13,232 singular values, and the average of all of the faces. What do we need
the average of the faces for?
Solution: We need to zero-center the data by subtracting out the average before running PCA. During
the reconstruction, we need to add the average back in.

EECS 16B, Fall 2017, Homework 9 4


(b) We provide you with a set of faces from the training set and compress them using the first 100 principal
components. You can adjust the number of principal components used to do the compression between
1 and 1000. What changes do you see in the compressed images when you used a small number of
components and what changes do you see when you use a large number?
Solution: When fewer principal components are used, the images do not differ much from the average
face and do not contain many distinguishinig features. This is to be expected, since very small num-
bers of components will not account for much of the variation found in faces. When more principal
components are used, the images more closely resemble the originals.
(c) You can visualize each principal component to see what each dimension “adds” to the high-dimensional
image. What visual differences do you see in the first few components compared to the last few com-
ponents?
Solution: The first few principal components are blury images capturing low frequency data. This low
frequency data captures some of the broad variation across faces like lighting. The last few components
contain high frequency data where small details vary from face to face.
(d) By using PCA on the face images, we obtain orthogonal vectors that point in directions of high variance
in the original images. We can use these vectors to transform the data into a lower dimensional space
and plot the data points. In the notebook, we provide you with code to plot a subset of 1000 images
using the first two principal comonents. Try plotting other components of the data, and see how the
shape of the points change. What difference do you see in the plot when you use the first two principal
components compared with the last two principal components? What do you think is the cause of this
difference?
Solution: The variance of the points in the plot is larger for the first two components compared to
the last two components. We can also confirm that the variance is larger for the first few components
because the singular values are large while the singular values for the last few components are small.
This happens because PCA orders the principal components by the singular values, which can be used
to measure the variability in the data for each component.
(e) We can use the principal components to generate new faces randomly. We accomplish this by pick-
ing a random point in the low-dimensional space and then multiplying it by the matrix of principal
components. In the notebook, we provide you with code to generate faces using the first 1000 princi-
pal components. You can adjust the number of components used. How does this affect the resulting
images?
Solution: When fewer components are used, the faces appear more similar and when a very small
number of principal components are used, they are almost indistinguishable. When we use more
principal components, the synthesized faces appear more distinct. This happens because we are adding
more degrees of freedom to our “face” vector when we add more principal components. This allows us
to generate faces with more variety because we have more parameters that control how the face looks.

3. Image Processing by Clustering In this homework problem, you will learn how to use the k-means
algorithm to solve two image processing problems: (1) color quantization and (2) image segmentation.
Digital images are composed of pixels (you could think of pixels as small points shown on your screen.)
Each pixel is a data point (sample) of an original image. The intensity of each pixel is its feature. If we use
8-bit integer Vari to present the intensity of one pixel, (Vari = 0) means black, while (Vari = 255) means
white. Images expressed only by pixel intensities are called grayscale images.
In color image systems, the color of a pixel is typically represented by three component intensities (features)
such as Red, Green, and Blue. The features of one pixel are a list of three 8-bit integers:[Varr ,Varg ,Varb ].
Here [0, 0, 0] means black, while [255, 255, 255] presents white. You can find tools like this website to

EECS 16B, Fall 2017, Homework 9 5


visualize RGB colors:
https://www.colorspire.com/rgb-color-wheel/.
Now think about your own experience: Have you needed to delete some photos on your cell phones because
you ran out of the memory? Have you felt "why does it take forever to upload my photos"? We need ways
to reduce the memory size of each photo by image compression.
In computer graphics, there are two types of image compression techniques: lossless and lossy image com-
pression. The former technique is preferred for archival purposes, which aim to perfectly record an image
but reduce the required memory to store it. The latter techniqe tries to remove some irrelevant and redundant
features without destroying the main features of the image. In this problem, you will learn how to use the
k-means algorithm to perform lossy image compression by color quantization.
Color quantization is one way to reduce the memory size of an image. In real schemes, it can be used in
combination with other techniques, but here we apply it directly to pixels on its own for simplicity. The
target of color quantization is to to reduce the number of colors used in an image, while trying to make the
new image visually similar to the original image. Graphics Interchange Format (GIF) is one format that uses
color quantization.
For example, consider an image of the size 800 × 600, where each pixel takes 24 bits (3 bytes - one each for
red, green, and blue) to store its color intensities. This raw image takes 800 × 600 × 3 bytes, about 1.4MB
to store. If we use only 8 colors to represent all pixels in this image, we could include a color map, which
stores the RBG values for these representation colors, and then use 3 bits for each pixel to indicate which
one is its representation color. In that case, the compressed image will take 8 × 24 + 800 × 600 × 3 bits,
about 0.2MB, to store it. This compression has saved a considerable amount of memory.
There are two main tasks in color quantization: (1) decide the representation colors, and (2) determine which
representation color each pixel should be assigned. Both tasks can be done by the k-means algorithm: It
groups data points (pixels) into k different clusters (representation colors), and the centroids of the clusters
will be the representation colors for those pixels inside the cluster.
Here is another thing we can do with this strategy: image segmentation. Image segmentation partitions a
digital image into multiple segments (sets of pixels). It is typically used to locate boundaries of objects in
images. Once we isolate objects from images, we can perform object detection and recognization, which
play essential roles in artificial intelligence. This can be done by clustering pixels with similar features and
labeling them with an indicating color for each cluster (object).

(a) Please look at the ipython notebook file, where you will find a 4 by 4 grayscale image. Perform the
k-means algorithm on the 16 data points with k = 4. What are the representation colors (centroids)?
Show the image after color quantization.
Solution: See sol3.ipynb. There are multiple local optima. Here we chose [0, 135, 175, 230] as the
initial centroids. The final centroids are [23.5, 119.67, 174, 230.67]. The values of the pixels assigned
to each cluster are {0, 18, 27, 49} for 23.5, {111, 113, 115, 122, 123, 134} for 135, {169, 175, 178} for
174, and {223, 234, 235}.
(b) See the ipython notebook. Apply the k-means algorithm to the grayscale image with different values of
k. Observe the distortion. Choose a good value for k, which should be the minimum value for keeping
the compressed image visually similar to the original image. Calculate the memory we need for the
compressed image.
Solution: When you adjust the value of k_gray, the higher k_gray will make the image more similar
to the original image, which means the value of distortion_gray is smaller. For this image, 8 different
colors might be enough, but some might feel we need more. This image we use is of size 597 ×

EECS 16B, Fall 2017, Homework 9 6


640. If you set k_gray = 8, we need 8 × 8 + 597 × 640 × 3 bits to store it, where we need 8 bits
for each representation color in grayscale for the color map , and 3 bits for each pixel to indicate its
representation color.
(c) See the ipython notebook. Apply the k-means algorithm to the color image with different values of k.
Observe the distortion. Choose a good value for k, calculate the memory we need for the compressed
image.
Solution: When you adjust the value of k_color, the higher k_color will make the image more
similar to the original image, which means the value of distortion_color is smaller. It is difficult
to express this image properly by only 8 or 16 colors, because there are color blends(gradients) in
the sky of this image. This image we use is of size 384 × 256. If you set k_color = 64, we need
24×64+384×256×6 bits to store it, where we need 24 bits for the RGB values of each representation
color, and 6 bits for each pixel to indicate its representation color.
(d) See the ipython notebook. Here we combine three features of each pixel into one feature as the intensity
in grayscale images. Use the k-means algorithm to locate the boundaries of objects. How many objects
do you expect from this image? Adjust the value of k and describe your observations. If you are
interested in this, you could learn more algorithms for image segmentation from EECS courses in
image processing and computer vision.
Solution: See sol3.ipynb. There are 3 objects in this image: the fish, sea and coral. If you run
the k-means algorithm with k_segment_intensity = 3, you will see the boundary between the sea and
the two other objects is clear. However, we cannot extract the fish and coral perfectly because pixels
for each object do not have similar intensities. This issue cannot be resolved by increasing the value
of k_segment_intensity. In advanced courses, you will learn how to properly incorporate information
from neighboring pixels into features of one pixel.

4. Brain-machine interface
The iPython notebook pca_brain_machine_interface.ipynb will guide you through the process of analyzing
brain machine interface data using principle component analysis (PCA). This will help you to prepare for
the project, where you will need to use PCA as part of a classifier that will allow you to use voice or music
inputs to control your car.
Please complete the notebook by following the instructions given.
Solution: The notebook pca_brain_machine_interface_sol.ipynb contains solutions to this exercise.

Contributors:

• Siddharth Iyer.

• Justin Yim.

• Stephen Bailey.

• Yu-Yun Dai.

EECS 16B, Fall 2017, Homework 9 7

You might also like