Principal Components Analysis

Principal Components Analysis
Outline
Data reduction PCA vs. FA Assumptions and other issues Multivariate analysis in terms of eigenanalysis PCA basics Examples
Ockhams Razor1
One of the hallmarks of good science is parsimony and elegance of theory However in analysis it is also often desirable to reduce data in some fashion in order to better get a better understanding of it or simply for ease of analysis In MR this was done more implicitly Reduce the predictors to a single composite Note the correlation (Multiple R) and its square
The sum of the weighted variables

Conceptually the goal of PCA is to reduce the number of variables of interest into a smaller set of components PCA analyzes the all the variance in the in the variables and reorganizes it into a new set of components equal to the number of original variables Regarding the new components:
They are independent They decrease in the amount of variance in the originals they account for Only some will be retained for further study (dimension reduction)
First component captures most of the variance, 2nd second most and so on until all the variance is accounted for Since the first few capture most of the variance they are typically of focus
PCA vs. Factor Analysis

It is easy to make the mistake in assuming that these are the same techniques, though in some ways exploratory factor analysis and PCA are similar, and in general both can be seen as factor analytic techniques However they are typically used for different reasons, are not mechanically the same, nor do they have the same underlying linear model
PCA/FA
Extracts all the factors underlying a set of variables The number of factors = the number of variables Completely explains the variance in each variable
Factor Analysis
Analyzes only the shared variance
Error is estimated apart from shared variance
FA vs. PCA conceptually

FA produces factors; PCA produces components Factors cause variables; components are aggregates of the variables The underlying causal model is fundamentally distinct between the two
Some do not consider PCA as part of the FA family1
FA
PCA
I1
I2
I3
I1
I2
I3
Contrasting the underlying models

PCA
Extraction is the process of forming PCs as linear combinations of the measured variables as we have done with our other techniques
PC1 = b11X1 + b21X2 + + bk1Xk PC2 = b12X1 + b22X2 + + bk2Xk PCf = b1fX1 + b2fX2 + + bkfXk
Common factor model

X 1 = 1 + 1 X2 = 2 + 2 Xf = f + f
Each measure X has two contributing sources of variation: the common factor and the specific or unique factor :
FA vs. PCA
PCA
PCA is mathematically precise in orthogonalizing dimensions PCA redistributes all variance into orthogonal components PCA uses all variable variance and treats it as true variance
FA
FA distributes common variance into orthogonal factors FA is conceptually realistic in identifying common factors FA recognizes measurement error and true factor variance
FA vs. PCA
In some sense, PCA and FA are not so different conceptually than what we have been doing since multiple regression
Creating linear combinations PCA especially falls more along the line of what weve already been doing
What we do have different from previous methods is that there is no IV/DV distinction
Just a single set of variables
FA vs. PCA Summary

PCA goal is to analyze variance and reduce the observed variables PCA reproduces the R matrix perfectly PCA the goal is to extract as much variance with the fewest components PCA gives a unique solution FA analyzes covariance (communality) FA is a close approximation to the R matrix FA the goal is to explain as much of the covariance with a minimum number of factors that are tied specifically to assumed constructs FA can give multiple solutions depending on the method and the estimates of communality
Questions Regarding PCA

Which components account for the most variance? How well does the component structure fit a given theory? What would each subjects score be if they could be measured directly on the components? What is the percentage of variance in the data accounted for by the components?
Assumptions/Issues
Assumes reliable variables/correlations
Very much affected by missing data, outlying cases and truncated data Data screening methods (e.g. transformations, etc.) may improve poor factor analytic results
Normality
Univariate - normally distributed variables make the solution stronger but not necessary if we are using the analysis in a purely descriptive manner Multivariate is assumed when assessing the number of factors
Assumptions/Issues
No outliers
Influence on correlations would bias results
Variables as outliers
Some variables dont work Explain very little variance Relates poorly with primary components Low squared multiple correlations as DV with other items as predictors Low loadings
Assumptions/Issues
Factorable R matrix
Need inter-item/variable correlations > .30 or PCA/FA isnt going to do much for you Large inter-item correlations do not guarantee a solution either
While two variables may be highly correlated, they may not be correlated with others
Matrix of partials adjusted for other variables, Kaisers measure of sampling adequacy can help assess.
Kaisers is a ratio of the sum of squared correlations to the sum of squared correlations plus sum of squared partial correlations
Approaches 1 if partials are small, and typically desire or about .6+
Multicollinearity/Singularity
In traditional PCA it is not problem; no matrix inversion is necessary
As such it is a solution to dealing with collinearity in regression
Investigate tolerances, det(R)
Assumptions/Issues
Sample Size and Missing Data
True missing data are handled in the usual ways Factor analysis via Maximum Likelihood needs large samples and it is one of the only drawbacks
The more reliable the correlations are, the smaller the number of subjects needed Need enough subjects for stable estimates
How many? Depends on the nature of the data and the number of parameters to be estimated
For example, a simple setting with few variables and clean data might not need as much Having several hundred data points for a more complex solution with messy data with lower correlations among the variables might not provide a meaningful result (PCA) or even converge upon a solution (FA)
Other issues
No readily defined criteria by which to judge outcome
Before we had R2 for example
Choice of rotations is dependent entirely on researchers estimation of interpretability Often used when other outcomes/ analyses are not so hot, just to have something to talk about1
Extraction Methods
PCA
Extracts maximum variance with each component First component is a linear combination of variables that maximizes component score variance for the cases The second (etc.) extracts the max. variance from the residual matrix left over after extracting the first component (therefore orthogonal to the first) If all components retained, all variance explained
PCA
Components are linear combinations of variables. As we will see later PCA is not much different than canonical correlation in terms of generating canonical variates from linear combinations of variables
These combinations are based on weights (eigenvectors) developed by the analysis
The loading for each item/variable is the correlation between it and the component (i.e., the underlying shared variance) However, unlike many of the analyses you are exposed to there is no statistical criterion to compare the linear combination to
In MANOVA we create linear combinations that maximally differentiate groups In Canonical correlation one linear combination is used to maximally correlate with another PCA is a form of unsupervised learning
Although in PCA there are no sides of the equation, and youre not necessarily correlating the factors, components, variates, etc.
PCA
With multivariate research we come to eigenvalues and eignenvectors Eigenvalues
Conceptually can be considered to measure the strength (relative length) of an axis in N-dimensional space Derived via eigenanalysis of the square symmetric matrix
The covariance or correlation matrix
Eigenvector
Each eigenvalue has an associated eigenvector. While an eigenvalue is the length of an axis, the eigenvector determines its orientation in space. The values in an eigenvector are not unique because any coordinates that described the same orientation would be acceptable.
Data
Example data of womens height and weight
height 57 58 60 59 61 60 62 61 62 63 62 64 63 65 64 66 67 66 68 69 weight 93 110 99 111 115 122 110 116 122 128 134 117 123 129 135 128 135 148 142 155 Zheight Zweight -1.77427146053986 -1.47097719378091 -.86438866026301 -1.16768292702196 -.561094393504058 -.86438866026301 -.257800126745107 -.561094393504058 -.257800126745107 .0454941400138444 -.257800126745107 .348788406772796 .0454941400138444 .652082673531747 .348788406772796 .955376940290699 1.25867120704965 .955376940290699 1.5619654738086 1.86525974056755 -1.96516286068824 -.873405715861441 -1.5798368095729 -.809184707342218 -.552300673265324 -.102753613630758 -.873405715861441 -.4880796647461 -.102753613630758 .282572437484583 .667898488599925 -.423858656226876 -.0385326051115347 .346793446003807 .732119497119148 .282572437484583 .732119497119148 1.56699260786906 1.18166655675371 2.01653966750362
Data transformation
Consider two variables height and weight X would be our data matrix, w our eigenvector (coefficients) Multiplying our original data by these weights1 results in a column vector of values The multiplying of a matrix by a vector results in a linear combination The variance of this linear combination is the eigenvalue
z1 = Xw
Data transformation
Consider a woman 5 and 122 pounds She is -.86sd from the mean height and -.10 sd from the mean weight for this data
b a ' b = (a1a2 ) 1 = a1b1 + a2b2 b2

The first eigenvector associated with the normalized data1 is [.707,.707], as such the resulting value for that data point is -.68 So with the top graph we have taken the original data point and projected it onto a new axis -.68 units from the origin Now if we do this for all data points we will have projected them onto a new axis/component/dimension/factor/linear combination The length of the new axis is the eigenvalue
Data transformation
Suppose we have more than one dimension/factor? In our discussion of the techniques thus far, we have said that each component or dimension is independent of the previous one What does independent mean?
r=0
What does this mean geometrically in the multivariate sense? It means that the next axis specified is perpendicular to the previous Note how r is represented even here The cosine of the 90o angle formed by the two axes is 0 Had the lines been on top of each other (i.e. perfectly correlated) the angle formed by them would be zero, whose cosine is 1
r=1
Data transformation
The other eigenvector associated with the data is (-.707,.707) Doing as we did before wed create that second axis, and then could plot the data points along these new axes1 We now have two linear combinations, each of which is interpretable as the vector comprised of projections of original data points onto a directed line segment Note how the basic shape of the original data has been perfectly maintained The effect has been to rotate the configuration (45o) to a new orientation while preserving its essential size and shape
It is an orthogonal transformation Note that we have been talking of specifiying/rotating axes, but rotating the points themselves would give us the same result
Meaning of Principal Components

Component analyses are those that are based on the full correlation matrix
1.00s in the diagonal
Principal analyses are those for which each successive factor...

accounts for maximum available variance is orthogonal (uncorrelated, independent) with all prior factors full solution (as many factors as variables), i.e. accounts for all the variance
Application of PC analysis
Components analysis is a kind of data reduction
start with an inter-related set of measured variables identify a smaller set of composite variables that can be constructed from the measured variables and that carry as much of their information as possible
A Full components solution ...

has as many components as variables accounts for 100% of the variables variance each variable has a final communality of 1.00
A Truncated components solution

has fewer components than variables accounts for <100% of the variables variance each variable has a communality < 1.00
The steps of a PC analysis

Compute the correlation matrix Extract a full components solution Determine the number of components to keep
total variance accounted for variable communalities interpretability replicability
Rotate the components and interpret (name) them Compute component scores Apply components solution
theoretically -- understand the meaning of the data reduction statistically -- use the component scores in other analyses
PC Extraction
Extraction is the process of forming PCs as linear combinations of the measured variables as we have done with our other techniques
PC1 = b11X1 + b21X2 + + bk1Xk PC2 = b12X1 + b22X2 + + bk2Xk PCf = b1fX1 + b2fX2 + + bkfXk
The goal is to reproduce as much of the information in the measured variables with as few PCs as possible Heres the thing to remember We usually perform factor analyses to find out how many groups of related variables there are, however The mathematical goal of extraction is to reproduce the variables variance, efficiently
3 variable example
Consider 3 variables with the correlations displayed In a 3d sense we might envision their relationship as this, with the shadows what the scatterplots would roughly look like for each bivariate relationship
X1
X3
X2
The first component identified
The variance of this component, its eigenvalue, is 2.063 In other words it accounts for twice as much variance as any single variable1 Note 3 variables 2.063/3 = .688% variance accounted for by this first component1
PCA
In principal components, we extract as many components as there are variables As mentioned previously, each component by default is uncorrelated with the previous If we save the component scores and were to look at their graph it would resemble something like this
How do we interpret the components?

The component loadings can inform us as to their interpretation They are the original variables correlation with the component In this case, all load nicely on the first component, which since the others do not account for nearly as much variance is probably the only one to interpret Depending on the type of PCA, the rotation etc. you may see different loadings although often the general pattern will remain With PCA as much the overall pattern to be considered relative to sign or absolute values
Which variables load on to which components in general?
Here is an example of magazine readership from the chapter handout Underlined loadings are > .30 How might this be interpreted?
Applied example
Six items
Three sadness, three relationship quality N = 300
PCA
Start with the Correlation Matrix
Communalities are Estimated

A measure of how much variance of the original variables is accounted for by the observed components/factors Uniqueness is 1-communality With PC with all factors (as opposed to a truncated solution), communality will always equal 1 Why 1.0? As well see with FA, the approach will be different
PCA analyzes all the variance for each variable
The initial value is the multiple R2 for the association between a item and all the other items in the model FA analyzes shared variance
What are we looking for?

Any factor whose eigenvalue is less than 1.0 is in most cases not going to be retained for interpretation Loadings1 that are:
Unless it is very close or has a readily understood and interesting meaning more than .5 are typically considered strong between .3 and .5 are acceptable Less than .3 are typically considered weak All the information about the correlation matrix is maintained Correlations can be reproduced exactly in PCA
Sum of cross loadings
Matrix reproduction
Assessing the variance accounted for

Eigenvalue/N of items or variables
Eigenvalue is an index of the strength of the component, the amount of variance it accounts for. It is also the sum of the squared loadings for that component
Loadings
Eigenvalue of factor 1 = .6092 + .6142 .5932 + .7282 + .7672 + .7642 = 2.80
Reproducing the correlation matrix (R)

Sum the products of the loadings for two variables on all factors
For RQ1 and RQ2: (.61 * .61) + (.61 * .57) + (-.12 * -.41) + (-.45 * .33) + .06 * .05) + (.20 * -.16) = .59 If we just kept to the first the first two factors only, the reproduced correlation = .72
Note that an index of the quality of a factor analysis (as opposed to PCA) is the extent to which the factor loadings can reproduce the correlation matrix1. with PCA, the correlation is reproduced exactly if all components are retained, however when we dont, we can use a similar approach to fit.
Original correlation
Variance Accounted For

For Items
The sum of the square of the loadings (i.e., weights) across the components is the amount of variance accounted for in each item. Item 1: For the first two factors: .74
.612 + .612 + -.122 + .452 + .062 + .202 = .37 + .37 + .015 + .2 + .004 + .04 = ~1.0
For components
How much variance is accounted for by the components that will be retained?
When is it appropriate to use PCA?

PCA is largely a descriptive procedure In our examples, we are looking at variables with decent correlations. However, if the variables are largely uncorrelated PCA wont do much for you
May just provide components that are respective of each individual variable i.e. nothing is gained
One may use Bartletts sphericity test to determine whether such an approach is appropriate It tests the null hypothesis that the R matrix is an identity matrix (ones on the diagonal, zer0s offdiagonals) When the determinant of R is small (recall from before this implies strong correlation), the chi-square statistic will be large reject H0 and PCA would be appropriate for data reduction One should note though that it is a powerful test, and usually will result in rejection with typical sample sizes One may instead refer to estimation of practical effect rather than a statistical test
Are the correlations worthwhile?
p2 p 2 p + 5 ln R = (n 1) 6 2
p2 p = df 2 p = number of variables n = number of observations ln R = natural log of the determinant of R
How should the data be scaled?

In most of our examples we have been using the R matrix instead of the var-covar matrix As PCA seeks to maximize variance, it can be sensitive to scale differences across variables Variables with a larger range of scores would thus have more of an impact on the linear combination created As such, the R matrix will typically be used, except perhaps in cases where the items are on the same scale (e.g. Likert) The values involved will change (e.g. eigenvalues), though the general interpretation may not
How many components should be retained?

There are many ways to determine this1
"Solving the number of factors problem is easy, I do it everyday before breakfast. But knowing the right solution is harder" (Kaiser)
What weve already suggested i.e. eigenvalues over 1 The idea is that any component should account for at least as much as a single variable
Chi-square
Null hypothesis is that X number of components is sufficient Want nonsignificant result This is a different approach which suggests to create a set of random data of the same size N and p variables The idea is that in this maximizing variance accounted for, PCA has a good chance of capitalization on chance Even with random data, the first eigenvalue will be > 1 As such, retain components with eigenvalues greater than that produced by the largest component of the random data
Kaisers Rule

Horns Procedure
Another perspective on this is to retain as many components as will account for X% of variance in the original variables
Practical approach Look for the elbow2
Look for the point after which the remaining eigenvalues decrease in linear fashion and retain only those above the elbow
Scree Plot
Not really a good primary approach though may be consistent with others
Rotation
Sometimes our loadings will be a little difficult to interpret initially Given such a case we can rotate the solution such that the loadings perhaps make more sense
This is typically done in factor analysis but is possible here too
An orthogonal rotation is just a shift to a new set of coordinate axes in the same space spanned by the principal components
Rotation
You can think of it as shifting the axes or rotating the egg in our previous graphic The gist is that the relations among the items is maintained, while maximizing their more natural loadings and minimizing off-loadings1 Note that as PCA is a technique that initially creates independent components, and orthogonal rotations that maintain this independence are typically used Varimax is the common rotation utilized
Loadings will be either large or small, little in between Maximizes the sum of the squared loadings for each component
Other issues: How do we assess validity?

Usual suspects Cross-validation
Holdout sample as we have discussed before About a 2/3, 1/3 split Using eigenvectors from the original components, we can create new components with the new data and see how much variance each accounts for Hope its similar to original solution With smaller samples conduct PCA multiple times each with a specific case held out Using the eigenvectors, calculate the component score for the value held out Compare the eigenvalues for the components involved In the absence of a hold out sample, we can create a bootstrapped sample to perform the same function
Jackknife
Bootstrap
Other issues: Factoring items vs. factoring scales1

Items are often factored as part of the process of scale development Check if the items go together like the scales author thinks Scales (composites of items) are factored to
examine construct validity of new scales test theory about what constructs are interrelated
Remember, the reason we have scales is that individual items are typically unreliable and have limited validity
Other issues: Factoring items vs. factoring scales

The limited reliability and validity of items means that they will be measured with less precision, and so, their intercorrelations for any one sample will be fraught with error Since factoring starts with R, factorings of items is likely to yield spurious solutions -- replication of item-level factoring is very important! Is the issue really items vs. scales ?
No -- it is really the reliability and validity of the things being factored, scales having these properties more than items
Other issues: When is it appropriate to use PCA?

Another reason to use PCA, which isnt a great one obviously, is that the maximum likelihood test involved in and Exploratory Factor Analysis does not converge PCA will always give a result (it does not require matrix inversion) and so can often be used in such a situation Well talk more on this later, but in data reduction situations EFA is typically to be preferred for social scientists and others that use imprecise measures
Other issues: Selecting Variables for Analysis

Sometimes a researcher has access to a data set that someone else has collected -- an opportunistic data set While this can be a real money/time saver, be sure to recognize the possible limitations Be sure the sample represents a population you want to talk about Carefully consider variables that arent included and the possible effects their absence has on the resulting factors You should plan to replicate any results obtained from opportunistic data
this is especially true if the data set was chosen to be efficient variables chosen to cover several domains
Other issues:
Selecting the Sample for Analysis
How many? Keep in mind that the R and so the factor solution is the same no matter how many cases are used -- so the point is the representativeness and stability of the correlation Advice about the subject/variable ration varies pretty dramatically
5-10 cases per variable 300 cases minimum (maybe + # of items)1
Consider that like for other statistics, your standard error for correlation decreases with increasing sample size
A note about SPSS

SPSS does provide a means for principal components analysis However, its presentation (much like many textbooks for that matter) blurs the distinction between PCA and FA, such that they are easily confused Although they are both data dimension reduction techniques, they do go about the process differently, have different implications regarding the results and can even come to different conclusions
A note about SPSS

In SPSS, the menu is factor analysis (even though principal components is the default technique setting) Unlike other programs PCA isnt even a separate procedure (its all in the Factor syntax) In order to perform PCA, make sure you have principal components selected as your extraction method, analyze the correlation matrix, and specify the number of factors to be extracted equals the number of variables Even now, your loadings will be different from other programs, which are scaled such that the sum of their squared values = 1 In general be cautious when using SPSS
PCA in R1
Package name base
princomp principal VSS Function name
psych
pcaMethods
FactoMiner R-commander plugin

PCA
As the name implies this package is all about PCA, and from a modern approach. Will automatically estimate missing values (via traditional, robust, or Bayesian methods) and is useful just for that for any analysis. pca Q2 for cross validation

Principal Components Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Principal Components Analysis

Uploaded by

Copyright:

Available Formats

Principal Components Analysis

Principal Components Analysis

PCA vs. Factor Analysis

FA vs. PCA conceptually

Contrasting the underlying models

Common factor model

FA vs. PCA Summary

Questions Regarding PCA

Investigate tolerances, det(R)

b a ' b = (a1a2 ) 1 = a1b1 + a2b2 b2

Meaning of Principal Components

Principal analyses are those for which each successive factor...

A Full components solution ...

A Truncated components solution

The steps of a PC analysis

The first component identified

How do we interpret the components?

Start with the Correlation Matrix

Communalities are Estimated

What are we looking for?

Assessing the variance accounted for

Eigenvalue of factor 1 = .6092 + .6142 .5932 + .7282 + .7672 + .7642 = 2.80

Reproducing the correlation matrix (R)

Variance Accounted For

When is it appropriate to use PCA?

p2 p = df 2 p = number of variables n = number of observations ln R = natural log of the determinant of R

How should the data be scaled?

How many components should be retained?

Other issues: How do we assess validity?

Other issues: Factoring items vs. factoring scales1

Other issues: Factoring items vs. factoring scales

Other issues: When is it appropriate to use PCA?

Other issues: Selecting Variables for Analysis

A note about SPSS

A note about SPSS

FactoMiner R-commander plugin

You might also like