Professional Documents
Culture Documents
compiled by Alvin Wan from Professor Benjamin Rechts lecture, Professor Jonathan Shewchucks
notes, and Elements of Statistical Learning
1 Overview
The outline for our algorithm is the following: We take in an input X, which is d n in
dimension r.
2. Compute SVD of Xc = U SV T
3. Return X = Sr VrT , Ur , x
We will explore 3 different views of PCA, showing how each approach gives us the same
result: the maximum eigenvalue and its corresponding eigenvector. In the last section, we
will then explore an application of PCA to Latent Semantic Indexing (or Latent Factor
Analysis).
Maximizeu:kuk2 =1 var(uT xi )
n
1X T 2
= Maximizeu:kuk2 =1 (u xi )
n i=1
n
1X T
= Maximizeu:kuk2 =1 u xi xTi u
n i=1
n
T1X
= Maximizeu:kuk2 =1 u ( xi xTi )u
n i=1
1
= Maximizeu:kuk2 =1 uT XX T u
n
1
Why is u normalized? When u is normalized, uT xi corresponds to the projection of x onto u
with direction u. Why is X de-meaned? Only then is XX T the covariance matrix . Recall
that the entries along the diagonal of take the form cov(xi , xi ) = var(xi ). Additionally, if
we do not de-mean, the first eigenvector will point to the mean.
d
X
T T T T T T
u XX u = u Su = u (P DP )u = w Dw = i wi2
i=1
Pd
Since i s are listed in descending order, and kwk2 = 1, then i=1 i wi2 is maximized when
w1 = 1 and all other wi = 0. As a result w = e1 and
u = P w = P e1 = u1
where u1 is the first eigenvector. Our direction of maximum variance is u1 . How do we find
subsequent principal components? Take Xi = Xi (uT1 Xi )u1 .
Note that X is centered and orthogonal to u1 . To find the direction of maximum variance, we
need the maximum singular value 1 (X). Note that this is the second principal component
for our original X, 2 (X), in the direction u2 .
2
View 2 : Maximizing Likelihood
The second view is fitting a Gaussian model to our data. We will assume the following.
xi = i zi + i
Note that this is the equivalent of the following, where we chain all of the xi together. We
wish to find the best rank-one matrix that models our data.
k
X
Minimize,z kX zT k2F = i ui viT
i=1
Does this uniquely define a z and an ? It doesnt, because we can scale by k and z by
k 1 . Even if z was constrained to have norm 1, z, are not unique.
This is reminiscent of regression: when we run linear regression, the error runs along the
y-axis. For PCA, the error is the minimum distance from the point to the line, making the
error perpendicular to our line. In other words, if our decision boundary is defined by
{wT z = 0} = L(w)
3
We have that distance is defined is to be the following. Note that the projection of xi onto w
T
is projw xi = hw,xii
hw,wi
= wkwkx2i w. To take the distance to our line, we then consider the magnitude
of the difference between the projection with xi .
w T xi 2
Dist(xi , L(w) = kxi wk
kwk2
As a result, our objective is simply the sum of all these distances to the line. In the first
step, we use (a b)2 = a2 2ab + b2 . In the second, we use the fact that wT xi is a scalar
w
and that xTi w = (wT x)T is the same scalar. In the fourth, we know that kwk is a unit vector
w T xi
and that kwk
is a scalar.
n
X w T xi 2
Minimizew kxi wk
i=1
kwk2
n
X w T xi w T xi 2
= kxi k2 2xTi w + k wk
i=1
kwk2 kwk2
n
X (wT xi )2 w T xi 2
= kxi k2 2 + k wk
i=1
kwk2 kwk2
n
X w T xi 2 w T xi w 2
= kxi k2 2( ) +k k
i=1
kwk kwk kwk
n
X w T xi 2 w T xi 2
= kxi k2 2( ) +( )
i=1
kwk kwk
n
X w T xi 2
= kxi k2 ( )
i=1
kwk
Since xi are fixed, then kxi k22 are fixed. We can reformulate this as maximization problem,
considering only the second term.
4
X w T xi
Maximizew ( )2
i
kwk
P T
w xi xTi w
= i
kwk2
wT XX T w
=
kwk2
w T w
=( ) XX T
kwk kwk
w
kwk
are unit vectors, so we can consider u Rd , where kuk2 = 1.
Maximizeu:kuk2 =1 uT XX T u
2.1 Applications
1. fuzzy search The reduced-rank X 0 clusters synonyms, by SVD.
2. denoising Reducing dimensionality may improve classification, as it removes noise.
xi = i z + wi as opposed to xi = z T x = + z T i . In other words, X may be a
measurement of some low-rank matrix. Thus, the reduced-rank X 0 may be a better
estimator.
3. collaborative filtering LFA may allow us to fill in values (i.e., matrix completion).
Just as X 0 clusters synonyms, it groups users with similar tastes.
5
2.2 Relation to SVD
Recall that SVD gives us a decomposition of X = U V T , where this can be written entry-
wise as
d
X
X= i ui viT
i=1
, where i s are ordered from greatest to least. Consider each i as a genre. ui then tells us
which documents are associated with that genre, and vi tells us which terms are associated
with that genre.
This is a form of clustering, where we notice that similar genres or books see stronger cluster
memberships. However, clusters can overlap, meaning that documents selected by u1 are
not necessarily disjoint from documents selected by u2 .
Formally, were looking to factor X into AB T . This yields the following objective function.
Minimize(A,B) kX AB T k2F
1/2 1/2
As it turns out, the solution is A = Ur Sr , B = Vr Sr , where Ur is the r-rank approxi-
mation of U , S is the diagonal matrix with the first r singular values, and V is the r-rank
approximation of V . In this case, we note that kAkF = kBkF . Consider the low-rank ap-
proximation of X, X 0 Rrr . Per PCA, we select the r ui , vi with the largest singular values
i .
r
X
0
X = i ui viT
i=1
X
kX X 0 k2F = (Xij Xij0 )2
i,j