You are on page 1of 3

PRINCIPAL COMPONENT ANALYSIS

The idea: find which features strongly correlate to each other; if the correlation is high, then
(at least) one of those features can be eliminated.
Features with highest variances are the most interesting.
Example: calculating covariance between two features X and Y, where mx = avg(X) and my = avg(Y).
Sample
1
2
3
4
5
6
7
8
9
10
avg=
stdev=

X
2.5
0.5
2.2
1.9
3.1
2.3
2
1
1.5
1.1

Y
2.4
0.7
2.9
2.2 ===>
3
2.7
1.6
1.1
1.6
0.9

X-mx
0.69
-1.31
0.39
0.09
1.29
0.49
0.19
-0.81
-0.31
-0.71

Y-my
0.49
-1.21
0.99
0.29
1.09
0.79
-0.31
-0.81
-0.31
-1.01

1.81
1.91
0.785211 0.846496

The formula for covariance is:


Cov(X,Y) = (1/n)*[(x1-mx)(y1-my) + (x2-mx)(y2-my) + + (xn-mx)(yn-my)]
Notice that Cov(X,X) is the same thing as variance.

Cov(X,Y)=

0.5539

Cov(X,X)=
Cov(Y,Y)=

0.5549
0.6449

Cov(Y,X) = Cov(X,Y)

For easy reference, tabulate the result by putting all the covariances into a covariance matrix Cov.
Elements of Cov are Cij, where Cij = Cov(Xi, Xj). In our case, X1 = X and X2 = Y.
Cov =

0.5549
0.5539

0.5539
0.6649

At this point, we could manually program this. However, what if we have many features?
Since the covariance can be represented as a matrix, we could use matrix math.
If we could come up with covariance formulas that use matrices, that would make our life easier because
there is true-and-tested matrix code available.
We would feed matrices into a software package/library function, which would do the matrix math.
So let's work out the matrix math.

In general, the input data is a flat file, i.e. a matrix of n samples (i.e. rows) and f features (i.e. columns):

X=

x11
x21
.

x12
x22

x13
x23

xn1

xn2

xn3

x1f
x2f

xnf

To transpose a matrix means to switch rows and columns; i.e. if the original matrix had elements Xij,
the transposed matrix has elements Xji.

Transpose(X)=

x11
x12
x13

x1f

x21
x22
x23

x2f

xn1
xn2
xn3

xnf

Obviously, trasnposing a transpose gives back the original: Transpose(Transpose(X)) = X.

The covariance matrix is calculated using matrix multiplication as:


Cov = 1/n* [Transpose(X-mx) * (X-mx)]
Vector mx is the matrix of size n x f that contains averages for all columns:
Each row of matrix mx contains [avg(1st column) avg(2nd column) avg(3rd column) ... avg(last column)]

Covariance is a square matrix of dimension f x f (because we are comparing each feature to all other features).
Cov has elements Cij, i = 1, .f and j = 1, ..f:
------------------------------------------------------------------------------------------------------------------------------------At this point, we can just let the software calculate the covariance. But let's go one more step and calculate Cij.
We can calculate Cij using column vectors.
PS - this way may look "upside down", but is stating the problem correctly - we are comparing
columns (i.e. features).
A vector is a matrix of only 1 column.
For example, if we have a vector V = [1 2] then it's transpose is:
Transpose(V) = 1
i.e. just "flipped over" V.
2
Obviously, trasnposing a transpose gives back the original: Transpose(Transpose(V)) = V.
Let us assume that we label column vectors Xk, where each Xk is a vector and represents the k-th feature.
For example, in the example above:
X1 = Transpose[2.5, 0.5,2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1]
X2 = Transpose[2.4,0.7, 2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9]
Cij can be obtained using scalar multiplication of two vectors as:
Cij = [Transpose(Xi - mi) * (Xj - mj)]

Cij = SUM[k=1,n] {(Xki-mi) * ((Xkj-mk))}


where Xki and Xkj are the elements of the input matrix as shown above,
and mi and mj are the averages of ith and jth column, respectively.
-----------------------------------------------------------------------------------------------------------------------------------------------------------So far, we calculated covariance but we are not done yet with PCA: we need to find the eigenvalues of the
covariance matrix Cov.
The eigenvalues are the solutions to the following equation:
Cov * Ei = Li*Ei
Ei are eigenvectors (each Ei has dimension f x 1).
Li are corresponding eigenvalues (i.e. variances for each column), i = 1, .., f.
I is the identity matrix (square matrix with all 1's on the diagonal and 0's otherwise).
We will ask the software to do that for us. Most likely, it will solve the following equation to find Li:
determinant(Cov - L*I) = 0
Once you find eigenvalues, sort them in decreasing order. Hopefully, the first values are the prominently highest,
and can be considered the most important. The smallest values represent features that can be discarded.
To find out how many features to keep, pick m<=f highest eigenvalues and calculate:
R = SUM[i=1,m]Li / SUM[i=1,f]Li
If R > Threshold, taking only those m features and discarding other features would be
a good representation of the f-dimensional data set.