You are on page 1of 13

STA4702/5701

STA4702/5701

Overview of PCA
One of the major objectives in exploratory data analysis of
multivariate data is dimension reduction.
> To screen data for obvious outliers.
> To select low-dimensional projections of the data for graphing.
> To search for structure in the data.

STA 4702 Multivariate Statistical Methods


STA 5701 Applied Multivariate Methods

The primary statistical tool to accomplish this is through the creation


of Principal Components.

Principal Components
Analysis
[Chapter 2 MDRD]

A principal component is defined as a linear combination or


projection of optimally-weighted observed variables.
(In appearance, this linear function is similar to a multiple regression
equation except that there is no intercept term).

K. M. Portier, 2001
STA4702/5701

K. M. Portier, 2001
STA4702/5701

Ordination

Projection of 2-D to 1-D


y
z = ay = [a 1 a 2 ] 1
y2

Variable 2 (y2)

Approach - Assign suitably chosen weights to the different


variables (original dimensions) so that a score can be
calculated for each individual. Each individual man then be
ordered by its score.

o
o

oo
o
ooo o

o
o o
o o
oo
o

oo o
o
o o
o o
o

A procedure for adapting a multidimensional swarm of data points


in such a way that when it is projected onto a two-dimensional
surface any intrinsic pattern possessed by the swarm will
become apparent.

Project each point


onto the line

Variable 1 (y1)

How do we determine the OPTIMAL WEIGHTS?


The histogram of z1 tells
us almost as much as the
scatterplot of y1 and y2.
K. M. Portier, 2001

K. M. Portier, 2001

Z1
4

STA4702/5701

STA4702/5701

Axis Rotation

Rigid Rotation in P Space

Rotate the axis such that z1 goes through the scatter in the direction
of most variability. z2 is perpendicular to z1 and in the direction of
next most variability.
z2

z1

Variable 2 (y2)

o
o

oo
o
ooo o

z1 = a11x1 + a12 x 2 + L + a1p x p = Xa 1


z2 = a21x1 + a 22 x 2 + L + a2p x p = Xa 2
M
zp = ap1x1 + ap 2 x 2 + L + app x p = Xa p

oo o
o
o o
o o
o

a 12 y1
a
z = Ay = 11
a 21 a 22 y 2

If all we were doing is rotating the axes of the data, we can define p
projections from the p original data. The projection coefficients
must have certain properties if they are to describe a true (rigid)
axis rotation.

Properties:
90o

o
o o
o o
oo
o

ik

a jk = 0

for all i j

Orthogonal projections.

k =1

Variable 1 (y1)
Note: Zi is of order n x 1.

Var(z1)>Var(z2)
5

K. M. Portier, 2001
STA4702/5701

K. M. Portier, 2001
STA4702/5701

Coordinate Shift and Scale

Eigenvalues
Let S(k,k) be a square matrix and I(k,k) an identity matrix.
The scalars 1, 2, , k that satisfy the polynomial equation

After rotation, we shift the center of the new coordinate system to


the centroid of the scatter, and rescale the axes.

a12 y1 y1
a
z = A( y y ) = 11

a 21 a 22 y 2 y 2

S I = 0

z1

Characteristic equation

Are called the eigenvalues or characteristic roots of S.


oo
o
o
ooo o

o
o o
o o
oo
o
K. M. Portier, 2001

oo o
o
o o
o o
o

1 0
S=
1 3
1 0
1 0
S I =

1 3 = (1 )(3 ) = 0
1 3

Variable 2
(y2)

o
o

z2

Eigenvalues=

Variable 1 (y1)
7

K. M. Portier, 2001

=1 = 3
8

STA4702/5701

STA4702/5701

Eigenvectors

Sample Principal Components

Let S(k,k) be a square matrix with eigenvalues 1, 2, , k.


For i, the the non-zero vector, ai(k,1) constructed such that

Given the data matrix:

Sa i = ia i

Mean
vector

x( p ,1)

is called the ith eigenvector or characteristic root of S.

1
S=
1
1
1

0 = 1 = 3
1
2
3
a 11 = a 11
0 a11 a 11
= 1
a11 = 2a 21

a 11 + 3a 21 = a 21
3 a 21 a 21

a 21 = 1 a 11 = 2 a1 = 2

1

K. M. Portier, 2001

Eigenvector for

Let

Let

X ( p , n ) = [x1

x1
x
= 2
M

x p

be the eigenvalues of S

a1 a 2 L a p

be the eigenvectors of S

1 = 1

STA4702/5701

S( p,p)

11 12

22
= 21
M
M

p1 p 2

2 L p

Principal Components

Sample
Variances

x2 L xn ]
L
L
L
L

1p
2 p

M
pp

z i = a i (x x )
10

K. M. Portier, 2001
STA4702/5701

Principal Components

Second Principal Component

z1 = a11 ( x1 x1 ) + a12 ( x 2 x 2 ) + L + a1p ( x p x p ) First PC


z 2 = a 21 ( x1 x1 ) + a 22 ( x 2 x 2 ) + L + a 2 p ( x p x p ) Second PC
M
z p = a p1 ( x1 x1 ) + a p 2 ( x 2 x 2 ) + L + a pp ( x p x p ) pth PC

For z2 to define the second principal component we require the


following additional parameter constraints.

For z1 to define the first principal component we require the


following additional parameter constraints.

a = a1a1 = 1
k =1

2
1k

Normalized coefficients

k =1

2
2k

= a 2a 2 = 1

Normalized coefficients

Var(z2) is maximum among all {z2, z3, , zp}

Var(z1) is maximum among all {z1, z2, , zp}

K. M. Portier, 2001

z1 = a 11 ( x1 x1 ) + a12 ( x 2 x 2 ) + L + a1p ( x p x p )
z 2 = a 21 ( x1 x1 ) + a 22 ( x 2 x 2 ) + L + a 2 p ( x p x p )
M
z p = a p1 ( x1 x1 ) + a p 2 ( x 2 x 2 ) + L + a pp ( x p x p )

Remember z1 and z2 are orthogonal.


11

K. M. Portier, 2001

12

STA4702/5701

STA4702/5701

Subsequent Principal Components

Total Variance

z1 = a11 ( x1 x1 ) + a12 ( x 2 x 2 ) + L + a1p ( x p x p )


z 2 = a 21 ( x1 x1 ) + a 22 ( x 2 x 2 ) + L + a 2 p ( x p x p )
M
z p = a p1 ( x1 x1 ) + a p 2 ( x 2 x 2 ) + L + a pp ( x p x p )

11 0
0 22
COV( Z) =
M
M

0
0

Apply the rules for the first and second principle components to all the
components to obtain the one rigid projection which has the property:

Var(z1) Var(z2) Var(z3) ... Var(zp)

L 0
L 0
L M

L pp

ii = var( z i )
p

k =1

k =1

2
Total Variance = TV = Trace( ) = tr( ) = kk = k

Var(z1) + Var(z2) + Var(z3) + ... + Var(zp) = TV


Var(x1) + Var(x2) + Var(x3) + ... + Var(xp) = TV

Cov( z i , z j ) = 0 for all i j

In a rigid rotation, total variation is maintained!

ai are orthogonal and normalized hence orthonormal.


Mean of each zi is zero hence all components are centered.
K. M. Portier, 2001

13

STA4702/5701

14

K. M. Portier, 2001
STA4702/5701

Proportion of Variance Explained

Principle Component Scores


Evaluate the first principle component for the complete dataset.

Typically the first couple of principle components will explain the


largest fraction of total variance.

z11 = a11 ( x11 x1 ) + a 12 ( x 21 x 2 ) + L + a 1p ( x p1 x p )


z12 = a11 ( x12 x1 ) + a 12 ( x 21 x 2 ) + L + a 1p ( x p 2 x p )
M
z1n = a11 ( x1n x1 ) + a 12 ( x 2 n x 2 ) + L + a1p ( x pn x p )

var( zp )
var( z1) var( z2 )

L
TV
TV
TV

z1( n ,1) = X( n ,p ) a 1( p ,1)

If most of the total variation can be associated with the first couple of
principle components, a plot of the data in these two coordinates should
demonstrate most of the useful information about structures in the data.

1st PCA Scores

corr ( x i , z j ) = a ji

If there are dependencies among the variables in the data, some of the
principle components may have zero variance and explain nothing of
the total variation.
K. M. Portier, 2001

j
var( x i )

Variables with coefficients of larger magnitude in a principal


component have larger contribution to that component.
15

K. M. Portier, 2001

16

STA4702/5701

STA4702/5701

Collinear Data

Covariance vs Correlation

If y1 and y2 are highly correlated (collinear) then the projection z1


will describe all of the interesting structure in the scatter, leaving
nothing for z2.
z2

What happens if you use the eigenvalues


and eigenvectors of the correlation matrix,
R instead of the covariance matrix, S?

z1

Variable 2 (y2)
o
o
o
o
o
o
ooo

o
oo
ooo
o
Var(z1)>>Var(z2)

oo
oo

SAME

90o

Variable 1 (y1)

17

K. M. Portier, 2001
STA4702/5701

Principal Components
(Latent Factors)

Principal Component Bi-Plot


A Bi-Plot is a graphical tool for displaying multivariate data in
such a way that an ordination of observations is presented
along with the relationship of the ordination axes to the
original variables.

Observed
Responses

z1

x1

z2

x2

Lets overlay on a scatter plot of the first two principal components a


representation of where the original variable axes lie.
z2

z3

x3

.
.
.

.
.
.

zp

xp

x2

oo o
o
o oo
o
oo
o o

19

K. M. Portier, 2001

x1

ooo
o
o

o oo o
o o o
o o

z1

K. M. Portier, 2001

18

K. M. Portier, 2001
STA4702/5701

Conceptual Model

Percent
Variation
Explained

DIFFERENT

oo
ooooo

20

STA4702/5701

STA4702/5701

Constructing a Bi-Plot
Q-mode decomposition

Constructing a Bi-Plot
R-mode decomposition

An alternate analysis is to examine an ordering of the components in the


subject space, called Q-mode analysis (quantifier or variable mode). In this
case the covariance among subjects is used.

The covariance matrix, S is a measure of the distance between observations


in p-dimensional space.

S p p =

1
1

X I n 11 X
n 1
n ~~

In PCA, this matrix is decomposed into two matrices in what is


called a spectral decomposition.

S = PP
Ppp = a1 , a 2 , L , a p

= diag(1 , 2 ,L , p )
PP = PP = I

1
1
1 =
~ n1
M

n n

The spectral decomposition of this matrix can also be computed.

= QQ
Q np = e1 , e 2 , L , e p

This is referred to as R-mode analysis (response mode) - provides for an


ordering of observations in the component space.

STA4702/5701

1
1
1 =
~ p1
M

= diag(1 , 2 ,L , p )
QQ = QQ = I

Since the matrix X is of rank at most p>n, there are only


at most p unique eignevalues and eigenvectors.
21

K. M. Portier, 2001

1
1
=
X I 11 X
p 1 p ~ ~

Same as
on
previous
slide.

22

K. M. Portier, 2001
STA4702/5701

Reformulation of the Data Matrix

BiPlot of Gator PCA

Using the spectral decomposition of S and , we can rewrite the original


data matrix as follows.
1
2

X ( n ,p ) = Q( n ,p ) ( p ,p ) P( p ,p )
Or:

Observation
Scaled Principal Components

Original
Variable
Scales

X = H ( n ,p ) G( p ,p ) G = P

1
2

H = Q
= 0, 1, or

1
2

Projection of original axes


(assuming length 1) onto the
principal component axes.

A Bi-Plot is simply a plot of the first two columns of G as points


and the first two columns of H as vectors on a scatterplot.

K. M. Portier, 2001

23

K. M. Portier, 2001

24

STA4702/5701

STA4702/5701

PCA Issues

Number of Components to Plot

Sample Size: No less than 50 observations, better to have 100. Rule of


thumb is to have at least 20 observations per variable.
Measurement Scales: Theory assumes continuous variables. If all
variables are binary use correspondence analysis.
Original or Standardized Variables: Principle component analysis can be
performed on standardized variables (i.e. assessing the correlation
matrix), or unstandardized values (I.e. using the covariance matrix).
Standardized scores aid comparisons among different variables,
especially when those variables have quite different variances. The
difference in variances can be very important in the definition of
components.
Number or Variables: PCA can be performed on any number of variables.
With large numbers of variables there is a much higher chance that
some of the components will have zero or very small eigenvalues
indicating exact or near collinearity. Variables that do not weigh highly
in the more significant components may be dropped and components
recomputed. Removed variables can be analyzed in their own
separate analysis.

How many components to plot will depend on the relative values of the
eigenvalues and the analysts criteria as to how much of the total variation
must be explained. Typical criteria are described below.
Latent Root Criterion: Plot combinations of components having
eigenvalues >1. Use fewer factors if the number of variables
is less than 50 and more if the number of variables greater
than 50.
Percentage of Variance Criterion: Consider components
important until the fraction of explained variance exceeds
some pre-specified level, say 95% in the natural sciences or
60% in the social sciences, or when the last component
added adds less than 5%.
Scree Test Criterion: Examination of the Scree Plot to identify
the number of components where the curve first begins to
straighten out.

25

K. M. Portier, 2001
STA4702/5701

STA4702/5701

Scree Plot

PCA Summary

K. M. Portier, 2001

26

K. M. Portier, 2001

27

Principal components analysis is a powerful and useful


multivariate technique for reducing the dimensions of data sets
with large numbers of continuous variables.
Because PCA utilizes only linear combinations, the ordering of
the distances between points in p-dimensional space is not
changed.
Can be used to develop measures capable of representing a
number of observed variables (The first couple of components).
There is a high degree of subjectivity in deciding on the number
of components to examine and interpretation of those
components.
PCA is usually applied to the correlation matrix associated with a
set of variables (R-mode) or observations (Q-mode). It may also
be applied to the covariance matrix of observations that have
been centered but not scaled.

K. M. Portier, 2001

28

STA4702/5701

STA4702/5701

PCA Summary

Outliers

PCA may not be able to demonstrate all important structures in


the original data set. Non-linear relationships (e.g. patterns of
points that are arranged along one wall of a hyper-cube or
which drape around a hyper-cube will not be easily seen.

Observations having a unique combination of


characteristics (variable values) that make the
observation distinctly different from other observations.
May be beneficial in that they identify unique situations
not normally observed. The jewel of a find that leads to
new directions of research and insight.
May be uncharacteristic of the population of direct
interest and hence can distort any subsequent statistical
analysis of the data unless removed or corrected.
Can be quite influential in some statistical analyses
(especially those using linear combinations - regression,
factor analysis, etc.).

x x
x xx
x x x
x xx
x x
x

29

K. M. Portier, 2001
STA4702/5701

STA4702/5701

Types of Outliers

Detecting Outliers

Result of an implementation error: Data entry errors,


coding errors, errors in reading the measurement device, etc.
Should be corrected if possible, discarded if true value
cannot be determined.
Extraordinary Event: Retain the observation if the
researcher decides that the extraordinary event should be
represented in the analysis.
Unexplained Event: If the observation is the result of an
extraordinary event that the researcher cannot explain,
typically the observation will be removed from consideration.
Unique Combination: Here the values for individual
variables are not unusual, it is the combination of values that
make the observation unique. Typically this observation will
be retained in the analysis unless other information dictates it
be removed.

K. M. Portier, 2001

30

K. M. Portier, 2001

Univariate Distributions: Potential outliers are those that fall


in the tails of the distribution for a number of variables.
Non-interactive analysis suggest finding all observations
in the tails (say 2.5 or more standard deviations from the
mean) of each variable and look for observations that
occur most often. Interactive analysis involves linked
histograms.
Bivariate Detection: Examine linked bivariate scatter plots in
a draftsmans display.

31

K. M. Portier, 2001

32

STA4702/5701

STA4702/5701

Outlier Detection - Multivariate Data

Outliers in the Gator Data


Possible
Outlier

Multivariate Detection:
Compute and rank observations on their Mahalanobis distance
from the centroid (typically computed leaving the observation
out).
Examine scatterplots of principal components. Observations
having PCA scores that are large, especially for the first
component are good outlier candidates.

Euclidean Distance
from centroid.
Mahalanobis Distance

d E ( x i , x ) = ( x i x )( x i x ) =

(x

ik

x k )2

k =1

d M ( x i , x ) = ( x i x )S1 ( x i x )
33

K. M. Portier, 2001
STA4702/5701

34

K. M. Portier, 2001
STA4702/5701

Clustering using PCA

Other PCA-like Methods

Scatter plots of the first couple of principal components can be


used to identify clusters of observations.

Canonical Correlations Analysis


A statistical technique to identify and measure the
association between two sets of variables.

z2

Correspondence Analysis
o o
o o
o
oo o o
o
o oo
o oo o
o
oo
o o o
o o
o o

A weighted principal component analysis of a contingency


table.

Canonical Correspondence Analysis


z1

The equivalent of canonical correlations analysis for


measuring the associations between two sets of categorical
variables.

K. M. Portier, 2001

35

K. M. Portier, 2001

36

STA4702/5701

STA4702/5701

Crocodilia Skull Morphology

Crocodilia Skull

Iordansky, N.N., 1973, The Skull of Crocodilia, Chapter 3 in Biology of the Reptilia,
Volume 4 Morphology D, C. Gans and T. S. Parsons (eds), Academic Press,
London, p201-262.
Species Crocodylus niloticus, Crododylus porosus, Osteolaemus tetraspis, Alligator
mississippiensis.
Cranial Measurements
cl - cranial length (anterior tip of snout to posterior surface of occipital condyle)
cw - cranial width (between lateral surfaces of mandibular condyles of quadrates)
sw - basal width of snout (on level with anterior orbital borders)
sl - snout length (anterior tip of snout to middle of posterior margin cranial roof)
dcl - dorsal cranial length
ow - maximal orbital width
ol - maximal orbital length
oiw - minimal interorbital width
lcr - length of postorbital cranial roof
wcr - posterior width of cranial roof
wn - maximal width of external nares

cl

wn

dcl

sl
ol
ow

oiw
sw
cw

37

K. M. Portier, 2001
STA4702/5701
Species
Crocodylus niloticus

Crocodilia
Skull Dataset

Crocodylus porosus

Osteolaemus tetraspis

K. M. Portier, 2001

iow
9
13
16
16
42
50
48
58
90

lcr
30
32
42
42
68
70
82
76
76

wcr
39
48
.
65
105
120
145
.
164

wn
9
13
15
15
42
48
54
57
56

76
238
408
548
565
672
800

30
74
.
200
300
292
384
416

41
.
154
274
364
405
452
516

73
.
230
390
513
550
620
740

13
23
29
38
46
45
50
63

3.5 17 16
10 29 26
12 36 30
36 57 54
55 68 65
64 70 90
70 90 85
82 100 105

20
44
55
110
150
160
185
204

4
.
.
32
.
48
64
75

164
.
170
173
175
185
185
188
188
190
194
194
203
210
225
240

90 70
90
.
71
92
98 72
98
.
70 100
102 73 102
105 77 105
105 78 105
.
82 108
104 80 110
108 80 112
110 82 114
117 92 117
108 88 116
.
91 124
128 105 128
136 91 133

160
160
165
165
165
175
175
180
178
180
182
180
193
.
215
222

36
29
31
33
32
32
33
33
34
32
34
34
35
36
40
38

16
13
14
12
14
14
16
16
15
16
15
18
16
19
20
19

57
.
60
60
64
61
61
65
64
65
67
70
69
.
75
76

20
18
20
22
24
22
22
24
24
24
24
23
26
26
28
27

40.0
112
.
138
148
150
150
150
178
186
236

70.5 16.7 5.2 20.0 15.0 24.6 10.5


216 30 16 46 36 64
31
220 32 17 52 37
.
30
262 24 25 54 44 78
38
275 40 22 58 42 82
40
270 40 20 54 46 82
40
284 49 26 56 48 86
.
310 40 25 62 46 80
38
337 42 25 69 50 89
51
348 39 32 68 54 98
53
358 52 27 63 63 120 64

Alligator mississippiensis 72.3


220
225
272
288
.
292
320
354
366
380

37.3
98
89
120
126
117
127
124
137
160
210

35.0
138
140
175
180
183
166
203
240
232
238

wcr
38

K. M. Portier, 2001
STA4702/5701

Cranial Measurements (in mm)


cl
cw sw
sl
dcl ow
160 64 46 100 153 20
198 94 70 121 186 25
248
.
76 159 235 30
254 114 71 158 235 28
420 235 170 270 400 37
440 250 170 280 420 42
525 290 220 360 495 45
582 336 218 382 554 48
610 345 268 400 564 46
22
56
68
148
210
216
302
324

lcr

ol
22
31
41
40
60
65
72
72
85

42
38
42
40
42
44
40
40
44
45
44
43
46
48
52
51

32
35
35
35
38
40
40
40
40
38
38
42
40
.
45
46

Gator Example
Sample Covariance Matrix, S
cl
sw
sl
dcl
ow
iow
ol
lcr
wn

cl
sw
sl
dcl
ow
iow
ol
lcr
29618.5 12329.55 20581.08 27602.5 1410.027 3528.36 2938.29 3363.80
12329.6 5365.95 8531.87 11469.3 621.857 1480.52 1270.27 1409.79
20581.1 8531.87 14376.16 19206.4 958.091 2448.47 2023.33 2336.60
27602.5 11469.28 19206.37 25754.6 1312.784 3284.38 2738.27 3140.15
1410.0
621.86
958.09 1312.8
94.726 161.24 162.41 170.97
3528.4 1480.52 2448.47 3284.4 161.241 453.76 348.01 399.78
2938.3 1270.27 2023.33 2738.3 162.413 348.01 333.51 337.42
3363.8 1409.79 2336.60 3140.1 170.970 399.78 337.42 406.08
2693.5 1202.12 1855.84 2515.8 151.233 310.20 303.07 311.25

Many high
correlations a
sign that most
measurements
are probably
measuring the
same thing.
39

K. M. Portier, 2001

cl
sw
sl
dcl
ow
iow
ol
lcr
wn

cl
1.00
0.98
1.00
1.00
0.84
0.96
0.93
0.97
0.90

sw
0.98
1.00
0.97
0.98
0.87
0.95
0.95
0.96
0.94

sl
1.00
0.97
1.00
1.00
0.82
0.96
0.92
0.97
0.89

dcl
1.00
0.98
1.00
1.00
0.84
0.96
0.93
0.97
0.90

ow
0.84
0.87
0.82
0.84
1.00
0.78
0.91
0.87
0.89

iow
0.96
0.95
0.96
0.96
0.78
1.00
0.89
0.93
0.84

ol
0.93
0.95
0.92
0.93
0.91
0.89
1.00
0.92
0.95

lcr
0.97
0.96
0.97
0.97
0.87
0.93
0.92
1.00
0.89

wn
2693.47
1202.12
1855.84
2515.82
151.23
310.20
303.07
311.25
302.41

wn
0.90
0.94
0.89
0.90
0.89
0.84
0.95
0.89
1.00
40

STA4702/5701

STA4702/5701

options ls=120 ps=41 nodate nocenter;


data gator;
infile 'e:\portier\research\workshop\multivarws\sasdata\gator\reptile2.prn'
lrecl=165 firstobs=2 ;
input id $ 1-4 species $ 9-35 cl cw sw sl dcl ow iow ol lcr wcr wn ;
run;
/* Drop out all cases with missing data.*/
data gator;
set gator;
array vars(9) cl sw sl dcl ow iow ol lcr wn;
del = 0;
do i = 1 to 9;
if (vars(i) eq . ) then del = 1;
end;
run;
data gator;
set gator;
if del eq 0;
run;

SAS Program

SAS Program (cont)


/* Run principal components on covariance matrix */
proc princomp data=gator cov out=gat_prn;
var cl sw sl dcl ow iow ol lcr wn ;
title1 'Alligator Data';
title2 'Principal Components - Covariance Based';
run;
proc plot data=gat_prn;
plot prin2*prin1=id / vpos=30;
title2 'Plot of First Two Principal Components';
run;
proc sort data=gat_prn;
by prin1;
run;
proc print data=gat_prn;
var id species prin1 prin2;
title2 'Principal Component Scores';
run;
41

K. M. Portier, 2001
STA4702/5701

STA4702/5701

Alligator Data
Principal Components - Covariance Based

Eigenvalues of the Covariance Matrix


Eigenvalue

Simple Statistics

Mean
StD

Mean
StD

CL

SW

SL

DCL

300.7918919
172.1001878

123.3054054
73.2526145

190.1891892
119.9006158

285.0405405
160.4822757

35.7486486
9.7327517

IOW

OL

LCR

WN

27.37027027
21.30168049

52.13513514
18.26222903

48.75675676
20.15137906

32.90540541
17.39006654

PRIN1
PRIN2
PRIN3
PRIN4
PRIN5
PRIN6
PRIN7
PRIN8
PRIN9

CL
SW
SL
DCL
OW
IOW
OL
LCR
29618.47 12329.55 20581.08 27602.51 1410.02 3528.36 2938.29 3363.80
12329.55 5365.94 8531.87 11469.28 621.85 1480.51 1270.27 1409.79
20581.08 8531.87 14376.15 19206.36 958.09 2448.46 2023.33 2336.60
27602.51 11469.28 19206.36 25754.56 1312.78 3284.37 2738.27 3140.14
1410.02
621.85
958.09 1312.78
94.72 161.24 162.41 170.97
3528.36 1480.51 2448.46 3284.37 161.24 453.76 348.00 399.77
2938.29 1270.27 2023.33 2738.27 162.41 348.00 333.50 337.42
3363.80 1409.79 2336.60 3140.14 170.97 399.77 337.42 406.07
2693.47 1202.11 1855.83 2515.81 151.23 310.20 303.06 311.25

K. M. Portier, 2001

76225.1
294.1
65.3
46.6
25.7
24.6
13.7
6.9
3.6

Difference

1
2

75931.0
228.8
18.7
20.9
1.1
10.9
6.9
3.3
.

Proportion

Cumulative

0.993735
0.003834
0.000852
0.000608
0.000335
0.000321
0.000179
0.000089
0.000047

0.99374
0.99757
0.99842
0.99903
0.99936
0.99969
0.99986
0.99995
1.00000

Principal Component Analysis


Eigenvectors

Covariance Matrix

CL
SW
SL
DCL
OW
IOW
OL
LCR
WN

42

K. M. Portier, 2001

WN
2693.47
1202.11
1855.83
2515.81
151.23
310.20
303.06
311.25
302.41

43

CL
SW
SL
DCL
OW
IOW
OL
LCR
WN

PRIN1

PRIN2

PRIN3

PRIN4

PRIN5

PRIN6

PRIN7

PRIN8

PRIN9

0.6231
0.2599
0.4334
0.5810
0.0297
0.0742
0.0619
0.0708
0.0569

-.0697
0.8280
-.2808
-.1597
0.1787
0.0132
0.2324
0.0353
0.3429

0.0659
-.4137
-.2957
0.2593
0.3956
-.3808
0.4439
0.1577
0.3822

0.4434
-.0804
-.6573
0.0090
0.1109
0.4870
0.0058
0.0502
-.3355

-.0696
0.1052
0.0198
-.0140
0.2863
-.1902
-.3869
0.8161
-.2224

-.3956
-.1096
0.2611
0.0976
0.0727
0.6776
0.4242
0.3252
0.0584

0.2656
0.0159
0.3234
-.5311
0.3048
-.1667
0.4770
-.0357
-.4390

0.3155
-.0918
-.0426
-.3254
-.6882
-.0533
0.2031
0.4351
0.2776

0.272433
-.193020
0.196368
-.414402
0.380048
0.293120
-.382608
-.061446
0.545631

a1

K. M. Portier, 2001

a2

a3

a4

44

STA4702/5701

STA4702/5701

SAS Program (cont)


/* Rerun the analysis on correlations */
proc princomp data=gator out=gat_prn;
var cl sw sl dcl ow iow ol lcr wn ;
title1 'Alligator Data';
title2 'Principal Components - Correlation Based';
run;
proc plot data=gat_prn;
plot prin2*prin1=id / vpos=30;
title2 'Plot of First Two Principal Components';
run;
proc sort data=gat_prn;
by prin1;
run;
proc print data=gat_prn;
var id species prin1 prin2;
title2 'Principal Component Scores';
run;

CL
SW
SL
DCL
OW
IOW
OL
LCR
WN

CL

SW

SL

DCL

OW

IOW

OL

LCR

WN

1.0000
0.9780
0.9974
0.9994
0.8418
0.9624
0.9349
0.9699
0.9000

0.9780
1.0000
0.9714
0.9756
0.8722
0.9488
0.9496
0.9551
0.9437

0.9974
0.9714
1.0000
0.9982
0.8210
0.9586
0.9240
0.9671
0.8901

0.9994
0.9756
0.9982
1.0000
0.8405
0.9608
0.9343
0.9710
0.9015

0.8418
0.8722
0.8210
0.8405
1.0000
0.7777
0.9138
0.8717
0.8935

0.9624
0.9488
0.9586
0.9608
0.7777
1.0000
0.8946
0.9313
0.8374

0.9349
0.9496
0.9240
0.9343
0.9138
0.8946
1.0000
0.9169
0.9543

0.9699
0.9551
0.9671
0.9710
0.8717
0.9313
0.9169
1.0000
0.8882

0.9000
0.9437
0.8901
0.9015
0.8935
0.8374
0.9543
0.8882
1.0000

Eigenvalues of the Correlation Matrix


Eigenvalue

Difference

Proportion

Cumulative

8.39494
0.34324
0.11491
0.06456
0.03947
0.02908
0.01174
0.00179
0.00027

8.05171
0.22833
0.05035
0.02509
0.01039
0.01735
0.00994
0.00152
.

0.932772
0.038137
0.012768
0.007173
0.004386
0.003232
0.001304
0.000199
0.000030

0.93277
0.97091
0.98368
0.99085
0.99524
0.99847
0.99977
0.99997
1.00000

PRIN1
PRIN2
PRIN3
PRIN4
PRIN5
PRIN6
PRIN7
PRIN8
PRIN9
45

K. M. Portier, 2001

Correlation Matrix

STA4702/5701

46

K. M. Portier, 2001
STA4702/5701

Gator PCA 1 and 2 Scores


Alligator Data
Principal Components
OBS
ID
SPECIES
1
am1
Alligator_mississippiensis
2
cp1
Crocodylus_porosus
3
cn1
Crocodylus_niloticus
4
ot1
Osteolaemus_tetraspis
5
ot3
Osteolaemus_tetraspis
6
ot4
Osteolaemus_tetraspis
7
ot5
Osteolaemus_tetraspis
8
ot6
Osteolaemus_tetraspis
9
ot7
Osteolaemus_tetraspis
10
ot9
Osteolaemus_tetraspis
11
ot8
Osteolaemus_tetraspis
12
ot10
Osteolaemus_tetraspis
13
ot11
Osteolaemus_tetraspis
14
cn2
Crocodylus_niloticus
15
ot22
Osteolaemus_tetraspis
16
ot23
Osteolaemus_tetraspis
17
am2
Alligator_mississippiensis

PRIN1
-364.559
-362.899
-230.047
-218.580
-208.032
-205.915
-202.452
-187.918
-187.727
-181.110
-180.498
-177.899
-172.934
-170.476
-169.809
-158.023
-122.075

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

PRIN2
2.5237
-16.0635
-26.7002
6.0300
3.4081
1.5330
4.4206
4.4455
4.5490
5.8937
6.6968
4.6904
5.2983
-16.1641
12.6619
8.7339
6.6561

ot25
am3
ot26
cn3
cn4
am4
am5
am8
am9
am10
am11
cp4
cn5
cn6
cn7
cp6
cn8
cn9
cp7
cp8

Osteolaemus_tetraspis
Alligator_mississippiensis
Osteolaemus_tetraspis
Crocodylus_niloticus
Crocodylus_niloticus
Alligator_mississippiensis
Alligator_mississippiensis
Alligator_mississippiensis
Alligator_mississippiensis
Alligator_mississippiensis
Alligator_mississippiensis
Crocodylus_porosus
Crocodylus_niloticus
Crocodylus_niloticus
Crocodylus_niloticus
Crocodylus_porosus
Crocodylus_niloticus
Crocodylus_niloticus
Crocodylus_porosus
Crocodylus_porosus

-120.623
-117.589
-108.862
-90.998
-89.114
-39.231
-17.506
32.864
90.677
107.824
138.929
171.879
191.317
221.273
367.068
443.711
446.466
493.593
596.188
783.087

17.5970
-0.8882
1.5243
-28.8843
-33.7525
7.0979
11.8351
-3.6779
-3.4041
15.4461
57.7779
-25.8088
-4.3792
-7.4872
-1.8396
-31.3928
-21.5890
13.9546
18.9111
0.3463

z1 = a11x1 + a12x 2 + L + a1p x p = Xa1


z2 = a 21x1 + a22 x 2 + L + a2p x p = Xa 2
K. M. Portier, 2001

47

K. M. Portier, 2001

48

STA4702/5701

STA4702/5701

Plot of the First Two PCAs

Alternate Plot
Same
Scales

Note
differences
in scales.

Gator
Size

49

K. M. Portier, 2001
STA4702/5701

STA4702/5701

BiPlot of Gator PCA


%include 'c:/sasdata/biplot.sas' ;
filename gsasfile "c:/sasdata/gator/gatPCA.gif";
goptions reset=all gaccess=gsasfile autofeed dev=gif;
%biplot(data=gator, var=sw sl dcl ow iow ol lcr wn,
id=id, factype=SYM, std=STD );
run;
proc gplot data=BIPLOT;
plot dim2*dim1 /anno=BIANNO frame
href=0 vref=0 lvref=3 lhref=3
vaxis=axis2 haxis=axis1 vminor=1 hminor=1;
axis1 length=6in order=(-.8 to .8 by .1)
offset=(2) LABEL=(H=1.3 'Dimension 1');
axis2 length=6in order=(-.8 to .8 by .1)
offset=(2) LABEL=(H=1.3 a=90 r=0 'Dimension 2');
symbol v=none;
title2 h=1.5 'Biplot';
run;
K. M. Portier, 2001

50

K. M. Portier, 2001

SAS Biplot of Gator Data

Copy the
BIPLOT SAS
macro from the
course
datasets web
page and store
it on your
system.

51

K. M. Portier, 2001

52

You might also like