Professional Documents
Culture Documents
Lesson7:PrincipalComponentsAnalysis(PCA)
Lesson7:PrincipalComponentsAnalysis(PCA)
Introduction
Sometimesdataarecollectedonalargenumberofvariablesfromasinglepopulation.Asan
exampleconsiderthePlacesRateddatasetbelow
Example:PlacesRated
InthePlacesRatedAlmanac,BoyerandSavageaurated329communitiesaccordingtothefollowing
ninecriteria:
1. ClimateandTerrain
2. Housing
3. HealthCare&theEnvironment
4. Crime
5. Transportation
6. Education
7. TheArts
8. Recreation
9. Economics
Notethatwithinthedataset,exceptforhousingandcrime,thehigherthescorethebetter.Forhousing
andcrime,thelowerthescorethebetter.Wheresomecommunitiesmightdobetterinthearts,other
communitiesmightberatedbetterinotherareassuchashavingalowercrimerateandgoodeducational
opportunities.
Objective
Withalargenumberofvariables,thedispersionmatrixmaybetoolargetostudyandinterpretproperly.
Therewouldbetoomanypairwisecorrelationsbetweenthevariablestoconsider.Graphicaldisplayof
datamayalsonotbeofparticularhelpincasethedatasetisverylarge.With12variables,forexample,
therewillbemorethan200threedimensionalscatterplotstobestudied!
Tointerpretthedatainamoremeaningfulform,itisthereforenecessarytoreducethenumberof
variablestoafew,interpretablelinearcombinationsofthedata.Eachlinearcombinationwillcorrespond
toaprincipalcomponent.
(ThereisanotherveryusefuldatareductiontechniquecalledFactorAnalysis,whichwillbetakenupina
subsequentlesson.)
Learningobjectives&outcomes
Uponcompletionofthislesson,youshouldbeabletodothefollowing:
CarryoutaprincipalcomponentsanalysisusingSASandMinitab
Assesshowmanyprincipalcomponentsshouldbeconsideredinananalysis
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
1/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Interpretprincipalcomponentscores.Beabletodescribeasubjectwithahighorlowscore
Determinewhenaprincipalcomponentanalysismaybebasedonthevariancecovariancematrix,
andwhenthecorrelationmatrixshouldbeused
Understandhowprincipalcomponentscoresmaybeusedinfurtheranalyses.
7.1PrincipalComponentAnalysis(PCA)
Procedure
SupposethatwehavearandomvectorX.
\(\textbf{X}=\left(\begin{array}{c}X_1\\X_2\\\vdots\\X_p\end{array}\right)\)
withpopulationvariancecovariancematrix
\(\text{var}(\textbf{X})=\Sigma=\left(\begin{array}{cccc}\sigma^2_1&\sigma_{12}&\dots
&\sigma_{1p}\\\sigma_{21}&\sigma^2_2&\dots&\sigma_{2p}\\\vdots&\vdots&\ddots&\vdots
\\\sigma_{p1}&\sigma_{p2}&\dots&\sigma^2_p\end{array}\right)\)
Considerthelinearcombinations
\(\begin{array}{lll}Y_1&=&e_{11}X_1+e_{12}X_2+\dots+e_{1p}X_p\\Y_2&=&
e_{21}X_1+e_{22}X_2+\dots+e_{2p}X_p\\&&\vdots\\Y_p&=&e_{p1}X_1+e_{p2}X_2+
\dots+e_{pp}X_p\end{array}\)
Eachofthesecanbethoughtofasalinearregression,predictingYifromX1,X2,...,Xp.Thereisno
intercept,butei1,ei2,...,eipcanbeviewedasregressioncoefficients.
NotethatYiisafunctionofourrandomdata,andsoisalsorandom.Thereforeithasapopulation
variance
\[\text{var}(Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl}=
\mathbf{e}'_i\Sigma\mathbf{e}_i\]
Moreover,YiandYjwillhaveapopulationcovariance
\[\text{cov}(Y_i,Y_j)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{jl}\sigma_{kl}=
\mathbf{e}'_i\Sigma\mathbf{e}_j\]
Herethecoefficientseijarecollectedintothevector
\(\mathbf{e}_i=\left(\begin{array}{c}e_{i1}\\e_{i2}\\\vdots\\e_{ip}\end{array}\right)\)
FirstPrincipalComponent(PCA1):Y1
Thefirst principal component is the linear combination of xvariables that has maximum variance (among all
linearcombinations),soitaccountsforasmuchvariationinthedataaspossible.
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
2/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Specificallywewilldefinecoefficientse11,e12,...,e1pforthatcomponentinsuchawaythatits
varianceismaximized,subjecttotheconstraintthatthesumofthesquaredcoefficientsisequaltoone.
Thisconstraintisrequiredsothatauniqueanswermaybeobtained.
Moreformally,selecte11,e12,...,e1pthatmaximizes
\[\text{var}(Y_1)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{1l}\sigma_{kl}=
\mathbf{e}'_1\Sigma\mathbf{e}_1\]
subjecttotheconstraintthat
\[\mathbf{e}'_1\mathbf{e}_1=\sum_{j=1}^{p}e^2_{1j}=1\]
SecondPrincipalComponent(PCA2):Y2
Thesecondprincipalcomponentisthelinearcombinationofxvariablesthataccountsforasmuchofthe
remainingvariationaspossible,withtheconstraintthatthecorrelationbetweenthefirstandsecondcomponentis
0
Selecte21,e22,...,e2pthatmaximizesthevarianceofthisnewcomponent...
\[\text{var}(Y_2)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{2l}\sigma_{kl}=
\mathbf{e}'_2\Sigma\mathbf{e}_2\]
subjecttotheconstraintthatthesumsofsquaredcoefficientsadduptoone,
\[\mathbf{e}'_2\mathbf{e}_2=\sum_{j=1}^{p}e^2_{2j}=1\]
alongwiththeadditionalconstraintthatthesetwocomponentswillbeuncorrelatedwithoneanother.
\[\text{cov}(Y_1,Y_2)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{2l}\sigma_{kl}=
\mathbf{e}'_1\Sigma\mathbf{e}_2=0\]
Allsubsequentprincipalcomponentshavethissamepropertytheyarelinearcombinationsthataccountforas
muchoftheremainingvariationaspossibleandtheyarenotcorrelatedwiththeotherprincipalcomponents
Wewilldothisinthesamewaywitheachadditionalcomponent.Forinstance:
ithPrincipalComponent(PCAi):Yi
Weselectei1,ei2,...,eipthatmaximizes
\[\text{var}(Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl}=
\mathbf{e}'_i\Sigma\mathbf{e}_i\]
subjecttotheconstraintthatthesumsofsquaredcoefficientsadduptoone...alongwiththeadditional
constraintthatthisnewcomponentwillbeuncorrelatedwithallthepreviouslydefinedcomponents.
\(\mathbf{e}'_i\mathbf{e}_i=\sum_{j=1}^{p}e^2_{ij}=1\)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
3/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
\(\text{cov}(Y_1,Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{il}\sigma_{kl}=
\mathbf{e}'_1\Sigma\mathbf{e}_i=0\),
\(\text{cov}(Y_2,Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{il}\sigma_{kl}=
\mathbf{e}'_2\Sigma\mathbf{e}_i=0\),
\(\vdots\)
\(\text{cov}(Y_{i1},Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{i1,k}e_{il}\sigma_{kl}=
\mathbf{e}'_{i1}\Sigma\mathbf{e}_i=0\)
Thereforeallprincipalcomponentsareuncorrelatedwithoneanother.
7.2Howdowefindthecoefficients?
Howdowefindthecoefficientseijforaprincipalcomponent?
Thesolutioninvolvestheeigenvaluesandeigenvectorsofthevariancecovariancematrix.
Solution:
Wearegoingtolet1throughpdenotetheeigenvaluesofthevariancecovariancematrix.Theseare
orderedsothat1hasthelargesteigenvalueandpisthesmallest.
\(\lambda_1\ge\lambda_2\ge\dots\ge\lambda_p\)
Wearealsogoingtoletthevectorse1throughep
e1,e2,...,ep
denotethecorrespondingeigenvectors.Itturnsoutthattheelementsfortheseeigenvectorswillbethe
coefficientsofourprincipalcomponents.
Thevariancefortheithprincipalcomponentisequaltotheitheigenvalue.
\(\textbf{var}(Y_i)=\text{var}(e_{i1}X_1+e_{i2}X_2+\dotse_{ip}X_p)=\lambda_i\)
Moreover,theprincipalcomponentsareuncorrelatedwithoneanother.
\(\text{cov}(Y_i,Y_j)=0\)
Thevariancecovariancematrixmaybewrittenasafunctionoftheeigenvaluesandtheircorresponding
eigenvectors.ThisisdeterminedbyusingtheSpectralDecompositionTheorem.Thiswillbecomeuseful
laterwhenweinvestigatetopicsunderfactoranalysis.
SpectralDecompositionTheorem
Thevariancecovariancmatrixcanbewrittenasthesumoverthepeigenvalues,multipliedbythe
productofthecorrespondingeigenvectortimesitstransposeasshowninthefirstexpressionbelow:
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
4/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
\[\begin{array}{lll}\Sigma&=&\sum_{i=1}^{p}\lambda_i\mathbf{e}_i\mathbf{e}_i'\\&\cong&
\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\end{array}\]
Thesecondexpressionisausefulapproximationif\(\lambda_{k+1},\lambda_{k+2},\dots,
\lambda_{p}\)aresmall.Wemightapproximateby
\[\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\]
Again,thiswillbecomemoreusefulwhenwetalkaboutfactoranalysis.
EarlierinthecoursewedefinedthetotalvariationofXasthetraceofthevariancecovariancematrix,or
ifyoulike,thesumofthevariancesoftheindividualvariables.Thisisalsoequaltothesumofthe
eigenvaluesasshownbelow:
\(\begin{array}{lll}trace(\Sigma)&=&\sigma^2_1+\sigma^2_2+\dots+\sigma^2_p\\&=&
\lambda_1+\lambda_2+\dots+\lambda_p\end{array}\)
Thiswillgiveusaninterpretationofthecomponentsintermsoftheamountofthefullvariation
explainedbyeachcomponent.Theproportionofvariationexplainedbytheithprincipalcomponentis
thengoingtobedefinedtobetheeigenvalueforthatcomponentdividedbythesumoftheeigenvalues.
Inotherwords,theithprincipalcomponentexplainsthefollowingproportionofthetotalvariation:
\[\frac{\lambda_i}{\lambda_1+\lambda_2+\dots+\lambda_p}\]
Arelatedquantityistheproportionofvariationexplainedbythefirstkprincipalcomponent.Thiswould
bethesumofthefirstkeigenvaluesdividedbyitstotalvariation.
\[\frac{\lambda_1+\lambda_2+\dots+\lambda_k}{\lambda_1+\lambda_2+\dots+\lambda_p}\]
Naturally,iftheproportionofvariationexplainedbythefirstkprincipalcomponentsislarge,thennot
muchinformationislostbyconsideringonlythefirstkprincipalcomponents.
WhyItMayBePossibletoReduceDimensions
Whenwehavecorrelations(multicollinarity)betweenthexvariables,thedatamaymoreorlessfallonalineor
planeinalowernumberofdimensions.Forinstance,imagineaplotoftwoxvariablesthathaveanearlyperfect
correlation.Thedatapointswillfallclosetoastraightline.Thatlinecouldbeusedasanew(onedimensional)
axistorepresentthevariationamongdatapoints.Asanotherexample,supposethatwehaveverbal,math,and
totalSATscoresforasampleofstudents.Wehavethreevariables,butreally(atmost)twodimensionstothe
databecausetotal=verbal+math,meaningthethirdvariableiscompletelydeterminedbythefirsttwo.The
reasonforsayingatmosttwodimensionsisthatifthereisastrongcorrelationbetweenverbalandmath,thenit
maybepossiblethatthereisonlyonetruedimensiontothedata .
Note
Allofthisisdefinedintermsofthepopulationvariancecovariancematrixwhichisunknown.
However,wemayestimatebythesamplevariancecovariancematrixwhichisgiveninthestandard
formulahere:
\[\textbf{S}=\frac{1}{n1}\sum_{i=1}^{n}(\mathbf{X}_i\bar{\textbf{x}})(\mathbf{X}_i
\bar{\textbf{x}})'\]
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
5/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Procedure
Computetheeigenvalues\(\hat{\lambda}_1,\hat{\lambda}_2,\dots,\hat{\lambda}_p\)ofthesample
variancecovariancematrixS,andthecorrespondingeigenvectors\(\hat{\mathbf{e}}_1,
\hat{\mathbf{e}}_2,\dots,\hat{\mathbf{e}}_p\).
Thenwewilldefineourestimatedprinciplecomponentsusingtheeigenvectorsasourcoefficients:
\(\begin{array}{lll}\hat{Y}_1&=&\hat{e}_{11}X_1+\hat{e}_{12}X_2+\dots+\hat{e}_{1p}X_p
\\\hat{Y}_2&=&\hat{e}_{21}X_1+\hat{e}_{22}X_2+\dots+\hat{e}_{2p}X_p\\&&\vdots\\
\hat{Y}_p&=&\hat{e}_{p1}X_1+\hat{e}_{p2}X_2+\dots+\hat{e}_{pp}X_p\\\end{array}\)
Generally,weonlyretainthefirstkprincipalcomponent.Herewemustbalancetwoconflictingdesires:
1.Toobtainthesimplestpossibleinterpretation,wewantktobeassmallaspossible.Ifwecanexplain
mostofthevariationjustbytwoprincipalcomponentsthenthiswouldgiveusamuchsimpler
descriptionofthedata.Thesmallerkisthesmalleramountofvariationisexplainedbythefirstk
component.
2.Toavoidlossofinformation,wewanttheproportionofvariationexplainedbythefirstkprincipal
componentstobelarge.Ideallyasclosetooneaspossiblei.e.,wewant
\[\frac{\hat{\lambda}_1+\hat{\lambda}_2+\dots+\hat{\lambda}_k}{\hat{\lambda}_1+
\hat{\lambda}_2+\dots+\hat{\lambda}_p}\cong1\]
7.3Example:PlacesRated
WewillusethePlacesRatedAlmanacdata(BoyerandSavageau)whichrates329communities
accordingtoninecriteria:
1. ClimateandTerrain
2. Housing
3. HealthCare&Environment
4. Crime
5. Transportation
6. Education
7. TheArts
8. Recreation
9. Economics
Notes:
Thedataformanyofthevariablesarestronglyskewedtotheright.
Thelogtransformationwasusedtonormalizethedata.
UsingSAS
UsingMinitab
TheSASprogramplaces.saswillimplementtheprincipalcomponentprocedures:
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
6/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Whenyouexaminetheoutput,thefirstthingthatSASdoesistogiveussummaryinformation.
Thereare329observationsrepresentingthe329communitiesinourdatasetand9variables.Thisis
followedbysimplestatisticsthatreportthemeansandstandarddeviationsforeachvariable.
Belowthisisthevariancecovariancematrixforthedata.Youshouldbeabletoseethatthevariance
reportedforclimateis0.01289.
Whatwereallyneedtodrawourattentiontohereistheeigenvaluesofthevariancecovariance
matrix.IntheSASoutputtheeigenvaluesinrankedorderfromlargesttosmallest.Thesevalueshave
beencopiedintoTable1belowfordiscussion.
DataAnalysis:
Step1:Weexaminetheeigenvaluestodeterminehowmanyprincipalcomponentsshouldbeconsidered:
Table1.Eigenvalues,andtheproportionofvariationexplainedbytheprincipalcomponents.
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
7/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
0.3775
0.7227
0.7227
0.0511
0.0977
0.8204
0.0279
0.0535
0.8739
0.0230
0.0440
0.9178
0.0168
0.0321
0.9500
0.0120
0.0229
0.9728
0.0085
0.0162
0.9890
0.0039
0.0075
0.9966
0.0018
0.0034
1.0000
Total
0.5225
Ifyoutakealloftheseeigenvaluesandaddthemupandyougetthetotalvarianceof0.5223.
Theproportionofvariationexplainedbyeacheigenvalueisgiveninthethirdcolumn.Forexample,
0.3775dividedbythe0.5223equals0.7227,or,about72%ofthevariationisexplainedbythisfirst
eigenvalue.Thecumulativepercentageexplainedisobtainedbyaddingthesuccessiveproportionsof
variationexplainedtoobtaintherunningtotal.Forinstance,0.7227plus0.0977equals0.8204,andso
forth.Therefore,about82%ofthevariationisexplainedbythefirsttwoeigenvaluestogether.
Nextweneedtolookatsuccessivedifferencesbetweentheeigenvalues.Subtractingthesecond
eigenvalue0.051fromthefirsteigenvalue,0.377wegetadifferenceof0.326.Thedifferencebetween
thesecondandthirdeigenvaluesis0.0232thenextdifferenceis0.0049.Subsequentdifferencesare
evensmaller.Asharpdropfromoneeigenvaluetothenextmayserveasanotherindicatorofhowmany
eigenvaluestoconsider.
Thefirstthreeprincipalcomponentsexplain87%ofthevariation.Thisisanacceptablylargepercentage.
AnAlternativeMethodtodeterminethenumberofprincipalcomponentsistolookataScreePlot.
Withtheeigenvaluesorderedfromlargesttothesmallest,ascreeplotistheplotof versusi.The
numberofcomponentisdeterminedatthepoint,beyondwhichtheremainingeigenvaluesareall
relativelysmallandofcomparablesize.ThefollowingplotismadeinMinitab.
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
8/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Thescreeplotforthevariableswithoutstandardization(covariancematrix)
Asyousee,wecouldhavestoppedatthesecondprincipalcomponent,butwecontinuedtillthethird
component.Relativelyspeaking,contributionofthethirdcomponentissmallcomparedtothesecond
component.
Step2:Next,wewillcomputetheprincipalcomponentscores.Forexample,thefirstprincipal
componentcanbecomputedusingtheelementsofthefirsteigenvector:
\(\begin{array}\hat{Y}_1&=&0.0351\times(\text{climate})+0.0933\times(\text{housing})+
0.4078\times(\text{health})\\&&+0.1004\times(\text{crime})+0.1501\times(\text{transportation})
+0.0321\times(\text{education})\\&&0.8743\times(\text{arts})+0.1590\times(\text{recreation})+
0.0195\times(\text{economy})\end{array}\)
Inordertocompletethisformulaandcomputetheprincipalcomponentfortheindividualcommunityof
interest,pluginthatcommunity'svaluesforeachofthesevariables.Afairlystandardprocedureis,rather
thanusingtherawdatahere,tousethedifferencebetweenthevariablesandtheirsamplemeans.Thisis
knownastranslationoftherandomvariables.Translationdoesnotaffecttheinterpretationsbecausethe
variancesoftheoriginalvariablesarethesameasthoseofthetranslatedvariables.
Magnitudesofthecoefficientsgivethecontributionsofeachvariabletothatcomponent.However,the
magnitudeofthecoefficientsalsodependonthevariancesofthecorrespondingvariables.
7.4InterpretationofthePrincipalComponents
Step3:Tointerpreteachcomponent,wemustcomputethecorrelationsbetweentheoriginaldatafor
eachvariableandeachprincipalcomponent.
Thesecorrelationsareobtainedusingthecorrelationprocedure.Inthevariablestatementwewillinclude
thefirstthreeprincipalcomponents,"prin1,prin2,andprin3",inadditiontoallnineoftheoriginal
variables.Wewillusethesecorrelationsbetweentheprincipalcomponentsandtheoriginalvariablesto
interprettheseprincipalcomponents.
Becauseofstandardization,allprincipalcomponentswillhavemean0.Thestandarddeviationisalso
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
9/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
givenforeachofthecomponentsandthesewillbethesquarerootoftheeigenvalue.
Moreimportantforourcurrentpurposesarethecorrelationsbetweentheprincipalcomponentsandthe
originalvariables.Thesehavebeencopiedintothefollowingtable.Youwillalsonotethatifyoulookat
theprincipalcomponentsthemselvesthatthereiszerocorrelationbetweenthecomponents.
PrincipalComponent
Variable
Climate
0.190
0.017
0.207
Housing
0.544
0.020
0.204
Health
0.782
0.605
0.144
Crime
0.365
0.294
0.585
Transportation
0.585
0.085
0.234
Education
0.394
0.273
0.027
Arts
0.985
0.126
0.111
Recreation
0.520
0.402
0.519
Economy
0.142
0.150
0.239
Interpretationoftheprincipalcomponentsisbasedonfindingwhichvariablesaremoststrongly
correlatedwitheachcomponent,i.e.,whichofthesenumbersarelargeinmagnitude,thefarthestfrom
zeroineitherpositiveornegativedirection.Whichnumbersweconsidertobelargeorsmallisofcourse
isasubjectivedecision.Youneedtodetermineatwhatlevelthecorrelationvaluewillbeofimportance.
Hereacorrelationvalueabove0.5isdeemedimportant.Theselargercorrelationsareinboldfaceinthe
tableabove:
Wewillnowinterprettheprincipalcomponentresultswithrespecttothevaluethatwehavedeemed
significant.
FirstPrincipalComponentAnalysisPCA1
Thefirstprincipalcomponentisstronglycorrelatedwithfiveoftheoriginalvariables.Thefirstprincipal
componentincreaseswithincreasingArts,Health,Transportation,HousingandRecreationscores.This
suggeststhatthesefivecriteriavarytogether.Ifoneincreases,thentheremainingtwoalsoincrease.This
componentcanbeviewedasameasureofthequalityofArts,Health,Transportation,andRecreation,
andthelackofqualityinHousing(recallthathighvaluesforHousingarebad).Furthermore,weseethat
thefirstprincipalcomponentcorrelatesmoststronglywiththeArts.Infact,wecouldstatethatbasedon
thecorrelationof0.985thatthisprincipalcomponentisprimarilyameasureoftheArts.Itwouldfollow
thatcommunitieswithhighvalueswouldtendtohavealotofartsavailable,intermsoftheaters,
orchestras,etc.Whereascommunitieswithsmallvalueswouldhaveveryfewofthesetypesof
opportunities.
SecondPrincipalComponentAnalysisPCA2
Thesecondprincipalcomponentincreaseswithonlyoneofthevalues,decreasingHealth.This
componentcanbeviewedasameasureofhowunhealthythelocationisintermsofavailablehealthcare
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
10/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
includingdoctors,hospitals,etc.
ThirdPrincipalComponentAnalysisPCA3
ThethirdprincipalcomponentincreaseswithincreasingCrimeandRecreation.Thissuggeststhatplaces
withhighcrimealsotendtohavebetterrecreationfacilities.
Tocompletetheanalysisweoftentimeswouldliketoproduceascatterplotofthecomponentscores.
Inlookingattheprogram,youwillseeagplotprocedureatthebottomwhereweareplottingthesecond
componentagainstthefirstcomponent.AsimilarplotcanalsobepreparedinMinitab,butisnotshown
here.
Eachdotinthisplotrepresentsonecommunity.Soifyouwerelookingatthereddotoutbyitselftothe
rightyoumayconcludethatthisparticulardothasaveryhighvalueforthefirstprincipalcomponentand
wewouldexpectthiscommunitytohavehighvaluesfortheArts,Health,Housing,Transportationand
Recreation.Whereasifyoulookatreddotattheleftofthespectrum,youwouldexpecttohavelow
valuesforeachofthosevariables.
Thetopdotinbluehasahighvalueforthesecondcomponent.Soyouwouldexpectthatthiscommunity
wouldbelousyforHealth.Andconverselyifyouweretolookatthebluedotonthebottom,the
correspondingcommunitywouldhavehighvaluesforHealth.
Furtheranalysesmayinclude:
Scatterplotsofprincipalcomponentscores.Inthepresentcontext,wemaywishtoidentifythe
locationsofeachpointintheplottoseeifplaceswithhighlevelsofagivencomponenttendtobe
clusteredinaparticularregionofthecountry,whilesiteswithlowlevelsofthatcomponentare
clusteredinanotherregionofthecountry.
Principlecomponentsareoftentreatedasdependentvariablesforregressionandanalysisof
variance.
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
11/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
7.5Alternative:StandardizetheVariables
Inthepreviousexamplewelookedatprincipalcomponentsanalysisappliedtotherawdata.Inour
earlierdiscussionwenotedthatiftherawdataisusedprincipalcomponentanalysiswilltendtogive
moreemphasistothosevariablesthathavehighervariancesthantothosevariablesthathaveverylow
variances.Ineffecttheresultsoftheanalysiswilldependonwhatunitsofmeasurementareusedto
measureeachvariable.Thatwouldimplythataprincipalcomponentanalysisshouldonlybeusedwith
therawdataifallvariableshavethesameunitsofmeasure.Andeveninthiscase,onlyifyouwishto
givethosevariableswhichhavehighervariancesmoreweightintheanalysis.
Auniqueexampleofthistypeofimplementationmightbeinanecologicalsettingwhereyouarelooking
atcountsofdifferentspeciesoforganismsatanumberofdifferentsamplesites.Here,onemaywantto
givemoreweighttothemorecommonspeciesthatareobserved.Byanalysingtherawdatayouwilltend
tofindthatmorecommonspecieswillalsoshowhighervariancesandwillbegivenmoreemphasis.If
youweretodoaprincipalcomponentanalysisonstandardizedcounts,allspecieswouldbeweighted
equallyregardlessofhowabundanttheyareandhence,youmayfindsomeveryrarespeciesenteringin
assignificantcontributorsintheanalysis.Thismayormaynotbedesirable.Thesetypesofdecisions
needtobemadewiththescientificfoundationandquestionsinmind.
Summary
Theresultsofprincipalcomponentanalysisdependonthescalesatwhichthevariablesare
measured.
Variableswiththehighestsamplevarianceswilltendtobeemphasizedinthefirstfewprincipal
components.
Principalcomponentanalysisusingthecovariancefunctionshouldonlybeconsideredifallofthe
variableshavethesameunitsofmeasurement.
Ifthevariableseitherhavedifferentunitsofmeasurement(i.e.,pounds,feet,gallons,etc),orifwewish
eachvariabletoreceiveequalweightintheanalysis,thenthevariablesshouldbestandardizedbeforea
principalcomponentsanalysisiscarriedout.Standardizethevariablesbysubtractingitsmeanfromthat
variableanddividingitbyitsstandarddeviation:
\[Z_{ij}=\frac{X_{ij}\bar{x}_j}{s_j}\]
where
Xij=Dataforvariablejinsampleuniti
\(\bar{x}_{j}\)=Samplemeanforvariablej
sj=Samplestandarddeviationforvariablej
Wewillnowperformtheprincipalcomponentanalysisusingthestandardizeddata.
Note:thevariancecovariancematrixofthestandardizeddataisequaltothecorrelationmatrixforthe
unstandardizeddata.Therefore,principalcomponentanalysisusingthestandardizeddataisequivalentto
principalcomponentanalysisusingthecorrelationmatrix.
PrincipalComponentAnalysisProcedure
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
12/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Theprincipalcomponentsarefirstcalculatedbyobtainingtheeigenvaluesforthecorrelationmatrix:
\(\hat{\lambda}_1,\hat{\lambda}_2,\dots,\hat{\lambda}_p\)
InthismatrixwedenotetheeigenvaluesofthesamplecorrelationmatrixR,andthecorresponding
eigenvectors
\(\mathbf{\hat{e}}_1,\mathbf{\hat{e}}_2,\dots,\mathbf{\hat{e}}_p\)
Thentheestimatedprinciplecomponentsscoresarecalculatedusingformulassimilartobefore,but
insteadofusingtherawdatawewillusethestandardizeddataintheformulaebelow:
\(\begin{array}{lll}\hat{Y}_1&=&\hat{e}_{11}Z_1+\hat{e}_{12}Z_2+\dots+\hat{e}_{1p}Z_p
\\\hat{Y}_2&=&\hat{e}_{21}Z_1+\hat{e}_{22}Z_2+\dots+\hat{e}_{2p}Z_p\\&&\vdots\\
\hat{Y}_p&=&\hat{e}_{p1}Z_1+\hat{e}_{p2}Z_2+\dots+\hat{e}_{pp}Z_p\\\end{array}\)
Restoftheprocedureandtheinterpretationsareasdiscussedbefore.
7.6Example:PlacesRatedafter
Standardization
Thepreviousanalysisisrepeatedafterstandardizingthevariables.
UsingSAS
UsingMinitab
TheSASprogramplaces1.saswillimplementtheprincipalcomponentproceduresusingthe
standardizeddata:
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
13/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Theoutputbeginswithdescriptiveinformationincludingthemeansandstandarddeviationsforthe
individualvariablesbeingpresented.
ThisisfollowedbytheCorrelationMatrixforthedata.Forexample,thecorrelationbetweenthe
housingandclimatedatawasonly0.273.Therearenohypothesispresentedthatthesecorrelations
areequaltozero.Wewillusethiscorrelationmatrixinsteadtoobtainoureigenvaluesand
eigenvectors.
Weneedtofocusontheeigenvaluesofthecorrelationmatrixthatcorrespondtoeachoftheprincipal
components.Inthiscase,totalvariationofthestandardizedvariablesisgoingtobeequaltop,the
numberofvariables.Afterstandardizationeachvariablehasvarianceequaltoone,andthetotalvariation
isthesumofthesevariations,inthiscasethetotalvariationwillbe9.
Theeigenvaluesofthecorrelationmatrixaregiveninthesecondcolumninthetablebelow.Notealso
theproportionofvariationexplainedbyeachoftheprincipalcomponents,aswellasthecumulative
proportionofthevariationexplained.
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
14/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Step1
Examinetheeigenvaluestodeterminehowmanyprincipalcomponentsshouldbeconsidered:
Component Eigenvalue Proportion Cumulative
1
3.2978
0.3664
0.3664
1.2136
0.1348
0.5013
1.1055
0.1228
0.6241
0.9073
0.1008
0.7249
0.8606
0.0956
0.8205
0.5622
0.0625
0.8830
0.4838
0.0538
0.9368
0.3181
0.0353
0.9721
0.2511
0.0279
1.0000
Thefirstprincipalcomponentexplainsabout37%ofthevariation.Furthermore,thefirstfourprincipal
componentsexplain72%,whilethefirstfiveprincipalcomponentsexplain82%ofthevariation.
Comparetheseproportionswiththoseobtainedusingnonstandardizedvariables.Thisanalysisisgoing
torequirealargernumberofcomponentstoexplainthesameamountofvariationastheoriginalanalysis
usingthevariancecovariancematrix.Thisisnotunusual.
Inmostcases,therequiredcutoffisprespecifiedi.e.howmuchofthevariationtobeexplainedispre
determined.Forinstance,ImightstatethatIwouldbesatisfiedifIcouldexplain70%ofthevariation.If
wedothisthenwewouldselectthecomponentsnecessaryuntilyougetupto70%ofthevariation.This
wouldbeoneapproach.Thistypeofjudgmentisarbitraryandhardtomakeifyouarenotexperienced
withthesetypesofanalysis.Thegoaltosomeextentalsodependsonthetypeofproblemathand.
Anotherapproachwouldbetoplotthedifferencesbetweentheorderedvaluesandlookforabreakora
sharpdrop.Theonlysharpdropthatisnoticeableinthiscaseisafterthefirstcomponent.Onemight,
basedonthis,selectonlyonecomponent.However,onecomponentisprobablytoofew,particularly
becausewehaveonlyexplained37%ofthevariation.Considerthescreeplotbasedonthestandardized
variables.
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
15/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Thescreeplotforstandardizedvariables(correlationmatrix)
Step2
Next,wecancomputetheprincipalcomponentscoresusingtheeigenvectors.Thisisaformulaforthe
firstprincipalcomponent:
\(\begin{array}\hat{Y}_1&=&0.158\timesZ_{\text{climate}}+0.384\timesZ_{\text{housing}}+
0.410\timesZ_{\text{health}}\\&&+0.259\timesZ_{\text{crime}}+0.375\times
Z_{\text{transportation}}+0.274\timesZ_{\text{education}}\\&&0.474\timesZ_{\text{arts}}+
0.353\timesZ_{\text{recreation}}+0.164\timesZ_{\text{economy}}\end{array}\)
Andremember,thisisnowgoingtobeafunction,notoftherawdatabutthestandardizeddata.
Themagnitudesofthecoefficientsgivethecontributionsofeachvariabletothatcomponent.Sincethe
datahavebeenstandardized,theydonotdependonthevariancesofthecorrespondingvariables.
Step3
Next,wecanlookatthecoefficientsfortheprincipalcomponents.Inthiscase,sincethedataare
standardized,withinacolumntherelativemagnitudeofthosecoefficientscanbedirectlyassessed.Each
columnherecorrespondswithacolumnintheoutputoftheprogramlabeledEigenvectors.
PrincipalComponent
Variable
Climate
Housing
Health
Crime
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Arts
Recreation
Economy
Interpretationoftheprincipalcomponentsisbasedonfindingwhichvariablesaremoststrongly
correlatedwitheachcomponent.Inotherwords,weneedtodecidewhichnumbersarelargewithineach
column.InthefirstcolumnwewilldecidethatHealthandArtsarelarge.Thisisveryarbitrary.Other
variablesmighthavealsobeenincludedaspartofthisfirstprincipalcomponent.
ComponentSummaries
FirstPrincipalComponentAnalysisPCA1
ThefirstprincipalcomponentisameasureofthequalityofHealthandtheArts,andtosomeextent
Housing,TransportationandRecreation.HealthincreaseswithincreasingvaluesintheArts.Ifanyof
thesevariablesgoesup,sodotheremainingones.Theyareallpositivelyrelatedastheyallhavepositive
signs.
SecondPrincipalComponentAnalysisPCA2
Thesecondprincipalcomponentisameasureoftheseverityofcrime,thequalityoftheeconomy,and
thelackofqualityineducation.CrimeandEconomyincreasewithdecreasingEducation.Herewecan
seethatcitieswithhighlevelsofcrimeandgoodeconomiesalsotendtohavepooreducationalsystems.
ThirdPrincipalComponentAnalysisPCA3
Thethirdprincipalcomponentisameasureofthequalityoftheclimateandpoornessoftheeconomy.
ClimateincreaseswithdecreasingEconomy.Theinclusionofeconomywithinthiscomponentwilladda
bitofredundancywithinourresults.Thiscomponentisprimarilyameasureofclimate,andtoalesser
extenttheeconomy.
FourthPrincipalComponentAnalysisPCA4
Thefourthprincipalcomponentisameasureofthequalityofeducationandtheeconomyandthe
poornessofthetransportationnetworkandrecreationalopportunities.EducationandEconomyincrease
withdecreasingTransportationandRecreation.
FifthPrincipalComponentAnalysisPCA5
Thefifthprincipalcomponentisameasureoftheseverityofcrimeandthequalityofhousing.Crime
increaseswithdecreasinghousing.
7.7OncetheComponentsHaveBeen
Calculated
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
17/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
Onecaninterpretthesecomponentbycomponent.Onemethodofdecidinghowmanycomponentsisto
includeonlythosethatgiveunambiguousresults,i.e.,wherenovariableappearsintwodifferent
columnsasasignificantcontribution.
Notethattheprimarypurposeofthisanalysisisdescriptiveitisnothypothesistesting!Soyour
decisioninmanyrespectsneedstobemadebasedonwhatprovidesyouwithagood,concisedescription
ofthedata.
Wehavetomakeadecisionastowhatisanimportantcorrelation,notnecessarilyfromastatistical
hypothesistestingperspective,butfrom,inthiscaseanurbansociologicalperspective.Youhaveto
decidewhatisimportantinthecontextoftheproblemathand.Thisdecisionmaydifferfromdiscipline
todiscipline.Insomedisciplinessuchassociologyandecologythedatatendtobeinherently'noisy',and
inthiscaseyouwouldexpect'messier'interpretations.Ifyouarelookinginadisciplinesuchas
engineeringwhereeverythinghastobeprecise,youmightputhigherdemandsontheanalysis.You
wouldwanttohaveveryhighcorrelations.Principalcomponentsanalysisaremostlyimplementedin
sociologicalandecologicaltypesofapplicationsaswellasinmarketingresearch.
Asbefore,youcanplottheprincipalcomponentsagainstoneanotherandwecanexplorewherethedata
forcertainobservationslies.
Sometimestheprincipalcomponentsscoreswillbeusedasexplanatoryvariablesinaregression.
Sometimesinregressionsettingsyoumighthaveaverylargenumberofpotentialexplanatoryvariables
toworkwith.Andyoumaynothavemuchofanideaastowhichonesyoumightthinkareimportant.
Whatyoumightdoistoperformaprincipalcomponentsanalysisfirstandthenperformaregression
predictingthevariablesentersfromtheprincipalcomponentsthemselves.Thenicethingaboutthis
analysisisthattheregressioncoefficientswillbeindependenttooneanother,sincethecomponentsare
independentofoneanother.Inthiscaseyouactuallysayhowmuchofthevariationinthevariableof
interestisexplainedbyeachoftheindividualcomponents.Thisissomethingthatyoucannotnormally
doinmultipleregression.
Oneoftheproblemsthatwehavewiththisanalysisisthatbecauseofallofthenumbersinvolved,the
analysisisnotas'clean'asonewouldlike.Forexample,inlookingatthesecondandthirdcomponents,
theeconomyisconsideredtobesignificantforbothofthosecomponents.Asyoucansee,thiswilllead
toanambiguousinterpretationinouranalysis.
AnalternativemethodofdatareductionisFactorAnalysiswherefactorrotationsareusedtoreducethe
complexityandobtainacleanerinterpretationofthedata.
7.8Summary
Inthislessonwelearnedabout:
Thedefinitionofaprincipalcomponentsanalysis
Howtointerprettheprincipalcomponents
Howtoselectthenumberofprincipalcomponentstobeconsidered
Howtochoosebetweendoingtheanalysisbasedonthevariancecovariancematrixorthe
correlationmatrix.
Lookforthislesson'shomeworkproblemsthatwillgiveyouachancetoputwhatyouhavelearnedto
use...
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
18/19
24/6/2015
Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49
19/19