Professional Documents
Culture Documents
2018/2019
Frederico Cruz Jesus
fjesus@novaims.unl.pt
Chapter 2 – Principal Components Analysis
1. Introduction
2. Geometry of PCA
3. Analytical Approach
4. PCA application
5. Issues relating to the use of PCA
Introduction
Objectives of PCA
Imagine the following hypothetical situation:
• A financial analyst that is interested in determining the financial health of firms in a
specific market sector. For doing so, he/she has a data set consisted of 1.000 firms in
which, for each, there is information about 50 ratios. The analyst would be dealing
with 50.000 pieces of information. However, this task would be made simpler if he/she
could reduce the number of rations from 50 to, say, three.
Principal Components is the appropriate technique for achieving each of the above
objectives. PCA is a technique for forming new variables which are linear composites of
the original variables. The maximum number of new variables that can be formed is
equal to the number of original variables and the variables are uncorrelated among
themselves.
Introduction
Objectives of PCA
Principal Components Analysis can be considered as:
• A technique which allow us to generate k new variables, in which k is less or equal than
the number of original variables (p);
The variance of the two variables is, respectively, 23.091 and 21.091, thus being the total
variance 44.182. Note that the variance of the X1 in respect to the total variance is
52.26%, whereas the variance of X2 represents 47.74%.
The original and mean-corrected data, as well as its projection in the variable space, can
be seen in the next slides.
Geometry of PCA
One could create a new dimension, or axis - X*1 - making an angle of Ɵ degrees with X1
and, naturally, 90-Ɵ degrees with X2. Observations could then be also projected in respect
to X*1, which can be considered a new variable.
Geometry of PCA
It becomes clear that the percentage of variance accounted for X*1 increases as the angle
Ɵ between X*1 and for X1 increases and then, after a certain maximum value, the variance
accounted for X*1 starts to decrease. Hence, there is one and only one new axis that
results in a new variable accounting or the maximum variance in the data.
Geometry of PCA
For Ɵ = 43.216 degrees, the two new variables are defined by the following equations:
X*2
X*1
Geometry of PCA
The first new axis X*1 results in a new variable x*1, such that this new variable accounts
for the maximum of the total variance. After this, a second axis, orthogonal to the first
axis, is identified such that the corresponding new variable, x*2, accounts for the
maximum of the variance that has not been accounted for by the first new variable x*1
and, x*1 and x*2 are uncorrelated. This procedure is carried until all the p new axes have
been identified such that the new variables x*1, x*2…. x*p account for successive
maximum variances and the variables are uncorrelated.
Note that once the p-1 axes have been identified, the identification of the pth axes will be fixed due to
the condition that all the axes must be orthogonal. Note that the maximum number of new variables,
i.e., principal components, is equal to the number of original variables.
Geometry of PCA
Let s consider the case where, instead of using both the original variables we would only
use X*1 to represent most of the information in the data. Geometrically, we were
representing the data in an one-dimensional space. In the case of p variables one may
want to represent the data in a lower k-dimensional space, where k<<p.
For each situation where PCA is used, the question whether the loss of information is
substantial or not depends on the purpose or objective of the study.
Geometry of PCA
Objectives of PCA
From a geometric point of view, PCA aims to identify a new set of orthogonal axes such
that:
• The coordinates of the observations with respect to each of the axes give the values for
the new variables. As mentioned previously, the new axes or the variables are called
principal components and the values of the new variables are called principal
components scores;
• Each new variables is a linear combination of the original variables;
• The first new variable accounts for the maximum variance in the data;
• The second new variable accounts for the maximum that has not been accounted for
by the first variable;
• The pth new variable accounts for the variance that has not been accounted for by the
first p-1 variables.
• The p new variables are uncorrelated.
Analytical Approach
p << k
xn1 xn 2 xnk n1 n 2 np
Analytical Approach
• The first principal component 1, accounts for the maximum variance in the data, the
second principal component 2, accounts for the maximum variance that has not been
accounted by the first principal component, and so on;
The mathematical problem is then: how do we obtain the weights such that these
conditions are satisfied. It can be shown that PCA reduces to finding the eigenstructure of
the S matrix of the original data.
Alternatively, PCA can also be done by finding the singular value decomposition (SVD) of
the data matrix or a spectral decomposition of the S matrix.
Interpreting PCA
Descriptive Statistics
This part of the output gives the basic descriptive statistics such as the mean and the
standard deviation. As can be seen, the means of the variables are 8 and 3, and the
standard deviations 4.805 and 4.592.
Interpreting PCA
Eigenvalues
As we have two variables, the Var-Cov is a 2x2 matrix. Hence, it will be possible to form
two principal components, which is the same as saying that two eigenvectors, each
one associated with one eigenvalue will be formed.
The following part of the SAS output contains the eigenvalues of the Var-Cov matrix.
Note that the sum of the eigenvalues equals the total variance in the data. This means
that the first principal component represents 87.3% of the total variance of the data,
whereas the second principal component “only” represent 12.7%.
Interpreting PCA
Eigenvectors
The eigenvalues represent the variance of each principal component.
The following part of the SAS output shows the two eigenvectors extracted:
Loadings
As it was already mentioned, the linear correlation between each of the principal
components is zero, as the principal components are not correlated, i.e., they are
orthogonal.
The correlations between each variable and the principal components extracted is named
as loadings. The loadings show the extent to which an original variable is important to the
formation of a principal component, i.e., higher absolute values of loadings mean that the
variables are very important in the principal components’ formation. Loadings are one of
the most important tools to understand the principal components’ formation.
Interpreting PCA
Loadings
Loadings can be obtained from the following equation:
wij
lij = i
sˆ j
Where lij is the loading of the jth variable for the ith principal component; wij is the weight of the jth
variable for the ith principal components, i is the eigenvalue (i.e., the variance) of the ith principal
component and ŝj is the standard deviation of the jth variable
In the specific case of our example data, the loading between X1 and the first principal
component is given by:
0.728
l11 = 38.576 = 0.941
4.805
Interpreting PCA
Until this point it was demonstrated that PCA is the formation of new variables that are
linear combinations of the original variables. However, as a data analytic technique, the
use of PCA raises a number of issues that need to be addressed.
* - This consideration is, in fact, used very loosely as there are many cases where one might think that
the principal component (which has higher variance explained) may not be the most
important/interesting.
Issues relating PCA
In general, the weight assigned to a variable is affected by the relative variance of the
variable. If we do not want to the relative variance to affect the weights, then the data
should be standardized so that the variance of each variable is the same (i.e., one).
The choice between the analysis obtained from the mean-corrected and standardized
data also depends on other factors. In cases for which there is reason to believe that the
variances of the variables do indicate the importance of a given variable, then mean-
corrected data should be used.
Issues relating PCA
On the other hand, it he objective is to reduce the number of variables in the data set to a
few variables (principal components) that are linear combinations of the original
variables, then it is imperative that the number of principal components be less than the
original variables. In such a case, PCA should only be performed if the data can be
represented by a fewer number of principal components without a substantial loss of
information, a notion that depends on the context of the study.
Issues relating PCA
On the other hand, if variables are answers to a customer survey questionnaire, the five
principal components explaining 99% of the variation in the 100 questions may (probably)
be considered as excellent. Hence, if PCA is the appropriate technique or not, depends to
high extent to the context of the problem.
In any case, we know that if the variables are perfectly correlated, one principal
component will be enough to explain all the variation in the data. Some statistical tests
may be used to assess the correlation level in the data, however these have same
significant drawbacks.
Issues relating PCA
One of the most popular decisions are based one the following:
1. Kaiser's criteria – Retain the principal components with eigenvalues greater than
one;
2. Pearson’s criteria – Retain every principal components until 80% of the variance is
explained;
3. Scree plot’s method – Plot the percent of variance accounted for by each principal
component and look for an elbow.
Issues relating PCA
The higher (in absolute terms) the loading of a variable, the more influence it has in the
formation of the principal component score and vice versa. Therefore, one can use the
loadings to determine which variables are influential in the formation of principal
components, and one can then assign a meaning or label to the principal component.
Issues relating PCA
The advantage of using principal components scores is that the new variables are not
correlated and the problem of multicollinearity is avoided. It should be noted,
nevertheless, that although this problem of multicollinearity is avoided, a new problem
(the difficulty in interpreting the principal components) arises from this approach.
Issues relating PCA
Summary
Students should read Sharma, S. (1996), Applied Multivariate Techniques, Wiley, p. 58-89
if they want to extend their knowledge on these subjects.
Thank you!