Professional Documents
Culture Documents
STA4702/5701
Overview of PCA
One of the major objectives in exploratory data analysis of
multivariate data is dimension reduction.
> To screen data for obvious outliers.
> To select low-dimensional projections of the data for graphing.
> To search for structure in the data.
Principal Components
Analysis
[Chapter 2 MDRD]
K. M. Portier, 2001
STA4702/5701
K. M. Portier, 2001
STA4702/5701
Ordination
Variable 2 (y2)
o
o
oo
o
ooo o
o
o o
o o
oo
o
oo o
o
o o
o o
o
Variable 1 (y1)
K. M. Portier, 2001
Z1
4
STA4702/5701
STA4702/5701
Axis Rotation
Rotate the axis such that z1 goes through the scatter in the direction
of most variability. z2 is perpendicular to z1 and in the direction of
next most variability.
z2
z1
Variable 2 (y2)
o
o
oo
o
ooo o
oo o
o
o o
o o
o
a 12 y1
a
z = Ay = 11
a 21 a 22 y 2
If all we were doing is rotating the axes of the data, we can define p
projections from the p original data. The projection coefficients
must have certain properties if they are to describe a true (rigid)
axis rotation.
Properties:
90o
o
o o
o o
oo
o
ik
a jk = 0
for all i j
Orthogonal projections.
k =1
Variable 1 (y1)
Note: Zi is of order n x 1.
Var(z1)>Var(z2)
5
K. M. Portier, 2001
STA4702/5701
K. M. Portier, 2001
STA4702/5701
Eigenvalues
Let S(k,k) be a square matrix and I(k,k) an identity matrix.
The scalars 1, 2, , k that satisfy the polynomial equation
a12 y1 y1
a
z = A( y y ) = 11
a 21 a 22 y 2 y 2
S I = 0
z1
Characteristic equation
o
o o
o o
oo
o
K. M. Portier, 2001
oo o
o
o o
o o
o
1 0
S=
1 3
1 0
1 0
S I =
1 3 = (1 )(3 ) = 0
1 3
Variable 2
(y2)
o
o
z2
Eigenvalues=
Variable 1 (y1)
7
K. M. Portier, 2001
=1 = 3
8
STA4702/5701
STA4702/5701
Eigenvectors
Sa i = ia i
Mean
vector
x( p ,1)
1
S=
1
1
1
0 = 1 = 3
1
2
3
a 11 = a 11
0 a11 a 11
= 1
a11 = 2a 21
a 11 + 3a 21 = a 21
3 a 21 a 21
a 21 = 1 a 11 = 2 a1 = 2
1
K. M. Portier, 2001
Eigenvector for
Let
Let
X ( p , n ) = [x1
x1
x
= 2
M
x p
be the eigenvalues of S
a1 a 2 L a p
be the eigenvectors of S
1 = 1
STA4702/5701
S( p,p)
11 12
22
= 21
M
M
p1 p 2
2 L p
Principal Components
Sample
Variances
x2 L xn ]
L
L
L
L
1p
2 p
M
pp
z i = a i (x x )
10
K. M. Portier, 2001
STA4702/5701
Principal Components
a = a1a1 = 1
k =1
2
1k
Normalized coefficients
k =1
2
2k
= a 2a 2 = 1
Normalized coefficients
K. M. Portier, 2001
z1 = a 11 ( x1 x1 ) + a12 ( x 2 x 2 ) + L + a1p ( x p x p )
z 2 = a 21 ( x1 x1 ) + a 22 ( x 2 x 2 ) + L + a 2 p ( x p x p )
M
z p = a p1 ( x1 x1 ) + a p 2 ( x 2 x 2 ) + L + a pp ( x p x p )
K. M. Portier, 2001
12
STA4702/5701
STA4702/5701
Total Variance
11 0
0 22
COV( Z) =
M
M
0
0
Apply the rules for the first and second principle components to all the
components to obtain the one rigid projection which has the property:
L 0
L 0
L M
L pp
ii = var( z i )
p
k =1
k =1
2
Total Variance = TV = Trace( ) = tr( ) = kk = k
13
STA4702/5701
14
K. M. Portier, 2001
STA4702/5701
var( zp )
var( z1) var( z2 )
L
TV
TV
TV
If most of the total variation can be associated with the first couple of
principle components, a plot of the data in these two coordinates should
demonstrate most of the useful information about structures in the data.
corr ( x i , z j ) = a ji
If there are dependencies among the variables in the data, some of the
principle components may have zero variance and explain nothing of
the total variation.
K. M. Portier, 2001
j
var( x i )
K. M. Portier, 2001
16
STA4702/5701
STA4702/5701
Collinear Data
Covariance vs Correlation
z1
Variable 2 (y2)
o
o
o
o
o
o
ooo
o
oo
ooo
o
Var(z1)>>Var(z2)
oo
oo
SAME
90o
Variable 1 (y1)
17
K. M. Portier, 2001
STA4702/5701
Principal Components
(Latent Factors)
Observed
Responses
z1
x1
z2
x2
z3
x3
.
.
.
.
.
.
zp
xp
x2
oo o
o
o oo
o
oo
o o
19
K. M. Portier, 2001
x1
ooo
o
o
o oo o
o o o
o o
z1
K. M. Portier, 2001
18
K. M. Portier, 2001
STA4702/5701
Conceptual Model
Percent
Variation
Explained
DIFFERENT
oo
ooooo
20
STA4702/5701
STA4702/5701
Constructing a Bi-Plot
Q-mode decomposition
Constructing a Bi-Plot
R-mode decomposition
S p p =
1
1
X I n 11 X
n 1
n ~~
S = PP
Ppp = a1 , a 2 , L , a p
= diag(1 , 2 ,L , p )
PP = PP = I
1
1
1 =
~ n1
M
n n
= QQ
Q np = e1 , e 2 , L , e p
STA4702/5701
1
1
1 =
~ p1
M
= diag(1 , 2 ,L , p )
QQ = QQ = I
K. M. Portier, 2001
1
1
=
X I 11 X
p 1 p ~ ~
Same as
on
previous
slide.
22
K. M. Portier, 2001
STA4702/5701
X ( n ,p ) = Q( n ,p ) ( p ,p ) P( p ,p )
Or:
Observation
Scaled Principal Components
Original
Variable
Scales
X = H ( n ,p ) G( p ,p ) G = P
1
2
H = Q
= 0, 1, or
1
2
K. M. Portier, 2001
23
K. M. Portier, 2001
24
STA4702/5701
STA4702/5701
PCA Issues
How many components to plot will depend on the relative values of the
eigenvalues and the analysts criteria as to how much of the total variation
must be explained. Typical criteria are described below.
Latent Root Criterion: Plot combinations of components having
eigenvalues >1. Use fewer factors if the number of variables
is less than 50 and more if the number of variables greater
than 50.
Percentage of Variance Criterion: Consider components
important until the fraction of explained variance exceeds
some pre-specified level, say 95% in the natural sciences or
60% in the social sciences, or when the last component
added adds less than 5%.
Scree Test Criterion: Examination of the Scree Plot to identify
the number of components where the curve first begins to
straighten out.
25
K. M. Portier, 2001
STA4702/5701
STA4702/5701
Scree Plot
PCA Summary
K. M. Portier, 2001
26
K. M. Portier, 2001
27
K. M. Portier, 2001
28
STA4702/5701
STA4702/5701
PCA Summary
Outliers
x x
x xx
x x x
x xx
x x
x
29
K. M. Portier, 2001
STA4702/5701
STA4702/5701
Types of Outliers
Detecting Outliers
K. M. Portier, 2001
30
K. M. Portier, 2001
31
K. M. Portier, 2001
32
STA4702/5701
STA4702/5701
Multivariate Detection:
Compute and rank observations on their Mahalanobis distance
from the centroid (typically computed leaving the observation
out).
Examine scatterplots of principal components. Observations
having PCA scores that are large, especially for the first
component are good outlier candidates.
Euclidean Distance
from centroid.
Mahalanobis Distance
d E ( x i , x ) = ( x i x )( x i x ) =
(x
ik
x k )2
k =1
d M ( x i , x ) = ( x i x )S1 ( x i x )
33
K. M. Portier, 2001
STA4702/5701
34
K. M. Portier, 2001
STA4702/5701
z2
Correspondence Analysis
o o
o o
o
oo o o
o
o oo
o oo o
o
oo
o o o
o o
o o
K. M. Portier, 2001
35
K. M. Portier, 2001
36
STA4702/5701
STA4702/5701
Crocodilia Skull
Iordansky, N.N., 1973, The Skull of Crocodilia, Chapter 3 in Biology of the Reptilia,
Volume 4 Morphology D, C. Gans and T. S. Parsons (eds), Academic Press,
London, p201-262.
Species Crocodylus niloticus, Crododylus porosus, Osteolaemus tetraspis, Alligator
mississippiensis.
Cranial Measurements
cl - cranial length (anterior tip of snout to posterior surface of occipital condyle)
cw - cranial width (between lateral surfaces of mandibular condyles of quadrates)
sw - basal width of snout (on level with anterior orbital borders)
sl - snout length (anterior tip of snout to middle of posterior margin cranial roof)
dcl - dorsal cranial length
ow - maximal orbital width
ol - maximal orbital length
oiw - minimal interorbital width
lcr - length of postorbital cranial roof
wcr - posterior width of cranial roof
wn - maximal width of external nares
cl
wn
dcl
sl
ol
ow
oiw
sw
cw
37
K. M. Portier, 2001
STA4702/5701
Species
Crocodylus niloticus
Crocodilia
Skull Dataset
Crocodylus porosus
Osteolaemus tetraspis
K. M. Portier, 2001
iow
9
13
16
16
42
50
48
58
90
lcr
30
32
42
42
68
70
82
76
76
wcr
39
48
.
65
105
120
145
.
164
wn
9
13
15
15
42
48
54
57
56
76
238
408
548
565
672
800
30
74
.
200
300
292
384
416
41
.
154
274
364
405
452
516
73
.
230
390
513
550
620
740
13
23
29
38
46
45
50
63
3.5 17 16
10 29 26
12 36 30
36 57 54
55 68 65
64 70 90
70 90 85
82 100 105
20
44
55
110
150
160
185
204
4
.
.
32
.
48
64
75
164
.
170
173
175
185
185
188
188
190
194
194
203
210
225
240
90 70
90
.
71
92
98 72
98
.
70 100
102 73 102
105 77 105
105 78 105
.
82 108
104 80 110
108 80 112
110 82 114
117 92 117
108 88 116
.
91 124
128 105 128
136 91 133
160
160
165
165
165
175
175
180
178
180
182
180
193
.
215
222
36
29
31
33
32
32
33
33
34
32
34
34
35
36
40
38
16
13
14
12
14
14
16
16
15
16
15
18
16
19
20
19
57
.
60
60
64
61
61
65
64
65
67
70
69
.
75
76
20
18
20
22
24
22
22
24
24
24
24
23
26
26
28
27
40.0
112
.
138
148
150
150
150
178
186
236
37.3
98
89
120
126
117
127
124
137
160
210
35.0
138
140
175
180
183
166
203
240
232
238
wcr
38
K. M. Portier, 2001
STA4702/5701
lcr
ol
22
31
41
40
60
65
72
72
85
42
38
42
40
42
44
40
40
44
45
44
43
46
48
52
51
32
35
35
35
38
40
40
40
40
38
38
42
40
.
45
46
Gator Example
Sample Covariance Matrix, S
cl
sw
sl
dcl
ow
iow
ol
lcr
wn
cl
sw
sl
dcl
ow
iow
ol
lcr
29618.5 12329.55 20581.08 27602.5 1410.027 3528.36 2938.29 3363.80
12329.6 5365.95 8531.87 11469.3 621.857 1480.52 1270.27 1409.79
20581.1 8531.87 14376.16 19206.4 958.091 2448.47 2023.33 2336.60
27602.5 11469.28 19206.37 25754.6 1312.784 3284.38 2738.27 3140.15
1410.0
621.86
958.09 1312.8
94.726 161.24 162.41 170.97
3528.4 1480.52 2448.47 3284.4 161.241 453.76 348.01 399.78
2938.3 1270.27 2023.33 2738.3 162.413 348.01 333.51 337.42
3363.8 1409.79 2336.60 3140.1 170.970 399.78 337.42 406.08
2693.5 1202.12 1855.84 2515.8 151.233 310.20 303.07 311.25
Many high
correlations a
sign that most
measurements
are probably
measuring the
same thing.
39
K. M. Portier, 2001
cl
sw
sl
dcl
ow
iow
ol
lcr
wn
cl
1.00
0.98
1.00
1.00
0.84
0.96
0.93
0.97
0.90
sw
0.98
1.00
0.97
0.98
0.87
0.95
0.95
0.96
0.94
sl
1.00
0.97
1.00
1.00
0.82
0.96
0.92
0.97
0.89
dcl
1.00
0.98
1.00
1.00
0.84
0.96
0.93
0.97
0.90
ow
0.84
0.87
0.82
0.84
1.00
0.78
0.91
0.87
0.89
iow
0.96
0.95
0.96
0.96
0.78
1.00
0.89
0.93
0.84
ol
0.93
0.95
0.92
0.93
0.91
0.89
1.00
0.92
0.95
lcr
0.97
0.96
0.97
0.97
0.87
0.93
0.92
1.00
0.89
wn
2693.47
1202.12
1855.84
2515.82
151.23
310.20
303.07
311.25
302.41
wn
0.90
0.94
0.89
0.90
0.89
0.84
0.95
0.89
1.00
40
STA4702/5701
STA4702/5701
SAS Program
K. M. Portier, 2001
STA4702/5701
STA4702/5701
Alligator Data
Principal Components - Covariance Based
Simple Statistics
Mean
StD
Mean
StD
CL
SW
SL
DCL
300.7918919
172.1001878
123.3054054
73.2526145
190.1891892
119.9006158
285.0405405
160.4822757
35.7486486
9.7327517
IOW
OL
LCR
WN
27.37027027
21.30168049
52.13513514
18.26222903
48.75675676
20.15137906
32.90540541
17.39006654
PRIN1
PRIN2
PRIN3
PRIN4
PRIN5
PRIN6
PRIN7
PRIN8
PRIN9
CL
SW
SL
DCL
OW
IOW
OL
LCR
29618.47 12329.55 20581.08 27602.51 1410.02 3528.36 2938.29 3363.80
12329.55 5365.94 8531.87 11469.28 621.85 1480.51 1270.27 1409.79
20581.08 8531.87 14376.15 19206.36 958.09 2448.46 2023.33 2336.60
27602.51 11469.28 19206.36 25754.56 1312.78 3284.37 2738.27 3140.14
1410.02
621.85
958.09 1312.78
94.72 161.24 162.41 170.97
3528.36 1480.51 2448.46 3284.37 161.24 453.76 348.00 399.77
2938.29 1270.27 2023.33 2738.27 162.41 348.00 333.50 337.42
3363.80 1409.79 2336.60 3140.14 170.97 399.77 337.42 406.07
2693.47 1202.11 1855.83 2515.81 151.23 310.20 303.06 311.25
K. M. Portier, 2001
76225.1
294.1
65.3
46.6
25.7
24.6
13.7
6.9
3.6
Difference
1
2
75931.0
228.8
18.7
20.9
1.1
10.9
6.9
3.3
.
Proportion
Cumulative
0.993735
0.003834
0.000852
0.000608
0.000335
0.000321
0.000179
0.000089
0.000047
0.99374
0.99757
0.99842
0.99903
0.99936
0.99969
0.99986
0.99995
1.00000
Covariance Matrix
CL
SW
SL
DCL
OW
IOW
OL
LCR
WN
42
K. M. Portier, 2001
WN
2693.47
1202.11
1855.83
2515.81
151.23
310.20
303.06
311.25
302.41
43
CL
SW
SL
DCL
OW
IOW
OL
LCR
WN
PRIN1
PRIN2
PRIN3
PRIN4
PRIN5
PRIN6
PRIN7
PRIN8
PRIN9
0.6231
0.2599
0.4334
0.5810
0.0297
0.0742
0.0619
0.0708
0.0569
-.0697
0.8280
-.2808
-.1597
0.1787
0.0132
0.2324
0.0353
0.3429
0.0659
-.4137
-.2957
0.2593
0.3956
-.3808
0.4439
0.1577
0.3822
0.4434
-.0804
-.6573
0.0090
0.1109
0.4870
0.0058
0.0502
-.3355
-.0696
0.1052
0.0198
-.0140
0.2863
-.1902
-.3869
0.8161
-.2224
-.3956
-.1096
0.2611
0.0976
0.0727
0.6776
0.4242
0.3252
0.0584
0.2656
0.0159
0.3234
-.5311
0.3048
-.1667
0.4770
-.0357
-.4390
0.3155
-.0918
-.0426
-.3254
-.6882
-.0533
0.2031
0.4351
0.2776
0.272433
-.193020
0.196368
-.414402
0.380048
0.293120
-.382608
-.061446
0.545631
a1
K. M. Portier, 2001
a2
a3
a4
44
STA4702/5701
STA4702/5701
CL
SW
SL
DCL
OW
IOW
OL
LCR
WN
CL
SW
SL
DCL
OW
IOW
OL
LCR
WN
1.0000
0.9780
0.9974
0.9994
0.8418
0.9624
0.9349
0.9699
0.9000
0.9780
1.0000
0.9714
0.9756
0.8722
0.9488
0.9496
0.9551
0.9437
0.9974
0.9714
1.0000
0.9982
0.8210
0.9586
0.9240
0.9671
0.8901
0.9994
0.9756
0.9982
1.0000
0.8405
0.9608
0.9343
0.9710
0.9015
0.8418
0.8722
0.8210
0.8405
1.0000
0.7777
0.9138
0.8717
0.8935
0.9624
0.9488
0.9586
0.9608
0.7777
1.0000
0.8946
0.9313
0.8374
0.9349
0.9496
0.9240
0.9343
0.9138
0.8946
1.0000
0.9169
0.9543
0.9699
0.9551
0.9671
0.9710
0.8717
0.9313
0.9169
1.0000
0.8882
0.9000
0.9437
0.8901
0.9015
0.8935
0.8374
0.9543
0.8882
1.0000
Difference
Proportion
Cumulative
8.39494
0.34324
0.11491
0.06456
0.03947
0.02908
0.01174
0.00179
0.00027
8.05171
0.22833
0.05035
0.02509
0.01039
0.01735
0.00994
0.00152
.
0.932772
0.038137
0.012768
0.007173
0.004386
0.003232
0.001304
0.000199
0.000030
0.93277
0.97091
0.98368
0.99085
0.99524
0.99847
0.99977
0.99997
1.00000
PRIN1
PRIN2
PRIN3
PRIN4
PRIN5
PRIN6
PRIN7
PRIN8
PRIN9
45
K. M. Portier, 2001
Correlation Matrix
STA4702/5701
46
K. M. Portier, 2001
STA4702/5701
PRIN1
-364.559
-362.899
-230.047
-218.580
-208.032
-205.915
-202.452
-187.918
-187.727
-181.110
-180.498
-177.899
-172.934
-170.476
-169.809
-158.023
-122.075
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
PRIN2
2.5237
-16.0635
-26.7002
6.0300
3.4081
1.5330
4.4206
4.4455
4.5490
5.8937
6.6968
4.6904
5.2983
-16.1641
12.6619
8.7339
6.6561
ot25
am3
ot26
cn3
cn4
am4
am5
am8
am9
am10
am11
cp4
cn5
cn6
cn7
cp6
cn8
cn9
cp7
cp8
Osteolaemus_tetraspis
Alligator_mississippiensis
Osteolaemus_tetraspis
Crocodylus_niloticus
Crocodylus_niloticus
Alligator_mississippiensis
Alligator_mississippiensis
Alligator_mississippiensis
Alligator_mississippiensis
Alligator_mississippiensis
Alligator_mississippiensis
Crocodylus_porosus
Crocodylus_niloticus
Crocodylus_niloticus
Crocodylus_niloticus
Crocodylus_porosus
Crocodylus_niloticus
Crocodylus_niloticus
Crocodylus_porosus
Crocodylus_porosus
-120.623
-117.589
-108.862
-90.998
-89.114
-39.231
-17.506
32.864
90.677
107.824
138.929
171.879
191.317
221.273
367.068
443.711
446.466
493.593
596.188
783.087
17.5970
-0.8882
1.5243
-28.8843
-33.7525
7.0979
11.8351
-3.6779
-3.4041
15.4461
57.7779
-25.8088
-4.3792
-7.4872
-1.8396
-31.3928
-21.5890
13.9546
18.9111
0.3463
47
K. M. Portier, 2001
48
STA4702/5701
STA4702/5701
Alternate Plot
Same
Scales
Note
differences
in scales.
Gator
Size
49
K. M. Portier, 2001
STA4702/5701
STA4702/5701
50
K. M. Portier, 2001
Copy the
BIPLOT SAS
macro from the
course
datasets web
page and store
it on your
system.
51
K. M. Portier, 2001
52