Professional Documents
Culture Documents
l=1
a
kl
b
lh
k = 1, . . . , K; h = 1, . . . , H.
Properties: Let A
mn
, B
np
, C
pq
, D
np
,
E
nn
and F
nn
(AB)C = A(BC)
A(B+ D) = AB+ AD
(B+ D)C = BC+ DC
EF ,= FE
The square matrix A
KK
is idempotent
if A
2
= A
A
KK
is orthogonal if A
/
A = I
CHAPTER 1. BACKGROUND MATHEMATICS 10
The rank of a matrix
Q vectors of same dimension y
1
, . . . , y
Q
are
said to be linearly independent if
Q
q=1
q
y
q
= 0
is veried only for
1
=
2
= . . . =
Q
= 0
Let A be an n p matrix.
The column rank is the maximum number
of linearly independent columns.
The row rank is the maximum number of
linearly independent rows.
The two ranks are equal and it is called the
rank and denoted by: r(A).
r(A) min(n, p)
CHAPTER 1. BACKGROUND MATHEMATICS 11
The determinant of A
KK
The determinant of a squared matrix A
KK
is a scalar, noted by [A[, given by:
K = 1: if A = a, then [A[ = a;
K = 2: if A =
_
a
11
a
12
a
21
a
22
_
, then [A[ =
a
11
a
22
a
21
a
12
;
K = 3: si A =
_
_
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
_
_
, then
[A[ = a
11
a
22
a
33
+ a
12
a
23
a
31
+ a
13
a
21
a
32
a
11
a
23
a
32
a
13
a
22
a
31
a
12
a
21
a
33
;
If K > 3 then
[A[ =
K
l=1
a
kl
A
kl
k 1, . . . , K
where A
kl
= (1)
k+l
[M
kl
[ with M
kl
the squared
sub-matrix of A without line k and column l
CHAPTER 1. BACKGROUND MATHEMATICS 12
The trace of A
KK
The trace of a square K K matrix A is the
sum of its diagonal elements:
tr(A) =
K
i=1
a
ii
Example:
A =
_
3 2
1 2
_
= tr(A) = 3 + 2 = 5
Properties: Let A
mm
, B
mm
tr(A+ B) = tr(A) + tr(B)
tr(A) = tr(A) is a scalar
tr(A
/
) = tr(A)
tr(AB) = tr(BA)
CHAPTER 1. BACKGROUND MATHEMATICS 13
Quadratic forms
Let x be K 1 vector and A an K K
symmetric matrix, then the double sums of
the form:
F(x
1
, x
2
, . . . , x
K
) =
K
i=1
K
j=1
x
i
x
j
a
ij
= x
/
Ax
can be written as this product of matrix, called
a quadratic form in x:
_
x
1
x
2
. . . x
K
_
_
_
_
_
_
_
_
a
11
. . . a
1K
a
21
. . . a
2K
a
K1
. . . a
KK
_
_
_
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
x
K
_
_
_
_
_
_
_
We say that A is:
positive denite if x
/
Ax > 0 x ,= 0
positive semidenite if x
/
Ax 0 x ,= 0
negative denite if x
/
Ax < 0 x ,= 0
negative semidenite if x
/
Ax 0 x ,= 0
CHAPTER 1. BACKGROUND MATHEMATICS 14
1.2 Geometric point of view in IR
P
Consider the column-vector
a =
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
P
_
_
_
_
_
_
_
=
_
a
1
a
2
a
P
_
/
.
Geometrically a can be represent in IR
P
by
line segment
-
OA from the origin O to the
point A with coordinate given by vector a.
-
OE
1
,
-
OE
2
, . . . ,
-
OE
p
are the vectors dening
IR
P
associated with
e
1
=
_
_
_
_
_
_
_
_
_
_
_
_
1
0
0
.
.
.
0
0
_
_
_
_
_
_
_
_
_
_
_
_
, e
2
=
_
_
_
_
_
_
_
_
_
_
_
_
0
1
0
.
.
.
0
0
_
_
_
_
_
_
_
_
_
_
_
_
, . . . , e
P
=
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
.
.
.
0
1
_
_
_
_
_
_
_
_
_
_
_
_
.
CHAPTER 1. BACKGROUND MATHEMATICS 15
Then for an observation A in IR
P
with asso-
ciated vector a =
_
a
1
a
2
a
P
_
/
-
OA= a
1
-
OE
1
+a
2
-
OE
2
+. . . + a
p
-
OE
P
The scalar product <
-
OA,
-
OB>between two
vectors is dened by :
<
-
OA,
-
OB> = a
/
b = (a
1
, . . . , a
P
)(b
1
, . . . , b
P
)
/
=
P
p=1
a
p
b
p
The euclidean norm |
-
OA | measures the
length of the vector :
|
-
OA |
2
= <
-
OA,
-
OA>= a
/
a =
P
p=1
a
2
p
A unit vector is a vector with unit length.
CHAPTER 1. BACKGROUND MATHEMATICS 16
The euclidean distance d(A, B) between two
points A and B is dened by:
d
2
(A, B) = |
-
AB |
2
= |
-
OA
-
OB |
2
=
P
p=1
(a
p
b
p
)
2
d(O, A) = |
-
OA |
The cosine of the angle between vectors
-
OA
and
-
OB is dened by:
cos(
-
OA,
-
OB) =
<
-
OA,
-
OB>
|
-
OA ||
-
OB |
The vectors
-
OA and
-
OB are orthogonal i
cos(
-
OA,
-
OB) = cos(90
) = 0
It is to say i
<
-
OA,
-
OB>= a
/
b =
P
p=1
a
p
b
p
= 0
CHAPTER 1. BACKGROUND MATHEMATICS 17
1.2.1 Orthogonal projection in IR
1
Orthogonal projection of observation Ain IR
P
on the axis that is passing through the ori-
gin:
(A)
O
A
a
A
u
|
_
OP
(A) |
The direction is generated by the unit vector
-
OU noted for simplicity by u with coordinates
u = (u
1
, . . . , u
P
).
CHAPTER 1. BACKGROUND MATHEMATICS 18
The point P
(A) |
|
-
OA |
Moreover, since cos() =
<
-
OA,
-
u>
|
-
OA|
, we obtain
that:
|
-
OP
(A) | =<
-
OA,
-
u>=
P
p=1
a
p
u
p
CHAPTER 1. BACKGROUND MATHEMATICS 19
1.2.2 Orthogonal projection in a subspace IR
H
A normalized orthogonal system u
1
, . . . , u
H
is such that:
|u
h
| = 1 h 1, . . . , H
< u
h
, u
l
> = 0 h ,= l 1, . . . , H
These vectors generate a subspace of IR
P
called L which is of dimension H. This sub-
space contains all the linear combinations:
H
h=1
h
u
h
CHAPTER 1. BACKGROUND MATHEMATICS 20
The orthogonal projection of observation A
in IR
P
on the subspace L is given by P
L
(A)
L. Among all the points in the subspace L,
this point is the closest to A. It is given by:
OP
L
(A) =
H
h=1
< OA, u
h
> u
h
|OP
L
(A)|
2
=
H
h=1
< OA, u
h
>
2
_
0
u
2
u
1
A
P
(
1
,
2
)
(A)
P
1
(A)
P
2
(A)
1
CHAPTER 1. BACKGROUND MATHEMATICS 21
1.3 Eigenvalues and eigenvectors
Let
- A be a matrix of dimension P P
- u be a column vector of dimension P 1
Transformation of space IR
P
by A:
A : IR
P
IR
P
: u Au
u is an eigenvector (non null) of Aassociated
with eigenvalue i:
Au = u
Au u = 0
(AI)u = 0
is an eigenvalue of A i
det(AI) = 0
CHAPTER 1. BACKGROUND MATHEMATICS 22
Comments:
If u is an eigenvector of A associated with
, then u ( IR
0
) is also an eigenvec-
tor associated with same same eigenvalue
The equation
det(AI) = 0
can have no real solution. In this case, the
transformation of IR
P
by the matrix A has
no xed direction
Each matrix Ahas at most P distinct eigen-
values
If two real eigenvalues are the same =
there exists a plane of eigenvectors
Eigenvectors associated with distinct eigen-
values are linearly independent
Let
1
, . . . ,
P
be the eigenvalues of A:
P
p=1
p
=
trace(A) et
P
p=1
p
= det(A)
CHAPTER 1. BACKGROUND MATHEMATICS 23
Comments:
A real symmetric matrix has only real eigen-
values
A singular matrix has at least one eigenval-
ues zero
A symmetric matrix is positive denite if
and only if all its eigenvalues are positive
A symmetric matrix is positive semidenite
if and only if all its eigenvalues are non-
negative
In practice, we take the eigenvectors u
1
, . . . , u
P
in order to have an orthonormal basis. There-
fore, A can be written as follows:
A =
P
p=1
p
u
p
u
/
p
CHAPTER 1. BACKGROUND MATHEMATICS 24
The particular case of the correlation
matrix
The correlation matrix (P P) is given by
R =
1
n
(X
)
/
X
where X
/
Rx
=
1
n
x
/
(X
)
/
X
=
1
n
(X
)
/
X
=
1
n
|X
|
2
0 x
,= 0
R is positive denite i the columns are lin-
early independent (the matrix X
is of rank
P)
The number of non zero eigenvalues is equal
to the rank of R
CHAPTER 1. BACKGROUND MATHEMATICS 25
1.4 References
Magnus, J.R., Neudecker, H. (1999), Ma-
trix Dierential Calculus with Applica-
tions in Statistics and Econometrics, Wi-
ley Series in Probability and Statistics, Eng-
land.
Chapter 2
Principal Component Analysis
(PCA)
2.1 Introduction
Basic tools to reduce the dimension of
a multivariate data matrix
Descriptive technique using geometrical
approach to reduce the dimension
The output consists of:
graphical representation of individuals show-
ing similarities and dissimilarities
graphical representation of variables based
on correlations
26
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 27
2.1.1 Example: Academic Ranking of World Universities (2007)
Question: Can a singleindicator accurately
sum up research excellence ?
Alumni (10%): Alumni recipients of the
Nobel prize or the Fields Medal;
Award (20%): Current faculty Nobel lau-
reates and Fields Medal winners;
HiCi (20%): Highly cited researchers in
21 broad subject categories;
N&S (20%): Articles published in Nature
and Science;
PUB (20%): Articles in the Science Cita-
tion Index-expanded, and the Social Science
Citation Index;
PCP (10%): The weighted score of the
previous 5 indicators divided by the number
of full-time academic sta members.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 28
Case study on the TOP 50 (Overall score relative to rank)
Harvard Univ
California Inst Tech
Yale Univ
Univ Washington ! Seattle
Univ Michigan ! Ann Arbor
Univ Paris 11
Univ Bonn
Univ Mainz
Univ Auckland
0
2
0
4
0
6
0
8
0
1
0
0
0 100 200 300 400 500
Universits Variables
Alumni Award HiCi N&S SCI Size
1. Harvard Univ. 100 100 100 100 100 73
2. Stanford Univ. 42 78.7 86.1 69.6 70.3 65.7
3. Univ. California, Berkeley 72.5 77.1 67.9 72.9 69.2 52.6
4. Univ. Cambridge 93.6 91.5 54 58.2 65.4 65.1
5. Massachusetts Inst. Tech. (MIT) 74.6 80.6 65.9 68.4 61.7 53.4
6. California Inst. Tech. 55.5 69.1 58.4 67.6 50.3 100
7. Columbia Univ. 76 65.7 56.5 54.3 69.6 46.4
8. Princeton Univ. 62.3 80.4 59.3 42.9 46.5 58.9
9. Univ. Chicago 70.8 80.2 50.8 42.8 54.1 41.3
10. Univ. Oxford 60.3 57.9 46.3 52.3 65.4 44.7
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 29
Universits Variables
Alumni Award HiCi N&S SCI Size
11. Yale Univ. 50.9 43.6 57.9 57.2 63.2 48.9
12. Cornell Univ. 43.6 51.3 54.5 51.4 65.1 39.9
13. Univ. California, Los Angeles 25.6 42.8 57.4 49.1 75.9 35.5
14. Univ. California, San Diego 16.6 34 59.3 55.5 64.6 46.6
15. Univ. Pennsylvania 33.3 34.4 56.9 40.3 70.8 38.7
16. Univ. Washington, Seattle 27 31.8 52.4 49 74.1 27.4
17. Univ. Wisconsin, Madison 40.3 35.5 52.9 43.1 67.2 28.6
18. Univ. California, San Francisco 0 36.8 54 53.7 59.8 46.7
19. Johns Hopkins Univ. 48.1 27.8 41.3 50.9 67.9 24.7
20. Tokyo Univ. 33.8 14.1 41.9 52.7 80.9 34
21. Univ. Michigan, Ann Arbor 40.3 0 60.7 40.8 77.1 30.7
22. Kyoto Univ. 37.2 33.4 38.5 35.1 68.6 30.6
23. Imperial Coll. London 19.5 37.4 40.6 39.7 62.2 39.4
24. Univ. Toronto 26.3 19.3 39.2 37.7 77.6 44.4
25. Univ. Coll. London 28.8 32.2 38.5 42.9 63.2 33.8
26. Univ. Illinois, Urbana Champaign 39 36.6 44.5 36.4 57.6 26.2
27. Swiss Fed. Inst. Tech. - Zurich 37.7 36.3 35.5 39.9 38.4 50.5
28. Washington Univ., St. Louis 23.5 26 39.2 43.2 53.4 39.3
29. Northwestern Univ. 20.4 18.9 46.9 34.2 57 36.9
30. New York Univ. 35.8 24.5 41.3 34.4 53.9 25.9
31. Rockefeller Univ. 21.2 58.6 27.7 45.6 23.2 37.8
32. Duke Univ. 19.5 0 46.9 43.6 62 39.2
33. Univ. Minnesota, Twin Cities 33.8 0 48.6 35.9 67 23.5
34. Univ. Colorado, Boulder 15.6 30.8 39.9 38.8 45.7 30
35. Univ. California, Santa Barbara 0 35.3 42.6 36.2 42.7 35.1
36. Univ. British Columbia 19.5 18.9 31.4 31 63.1 36.3
37. Univ. Maryland, Coll. Park 24.3 20 40.6 31.2 53.3 25.9
38. Univ. Texas, Austin 20.4 16.7 46.9 28 54.8 21.3
39. Univ. Paris VI 38.4 23.6 23.4 27.2 54.2 33.5
40. Univ. Texas Southwestern Med. Center 22.8 33.2 30.6 35.5 38 31.9
41. Vanderbilt Univ. 19.5 29.6 31.4 23.8 51 36
42. Univ. Utrecht 28.8 20.9 27.7 29.9 56.6 26.6
43. Pennsylvania State Univ. - Univ. Park 13.2 0 45.1 37.7 58 23.7
44. Univ. California, Davis 0 0 46.9 33.1 64.2 30
45. Univ. California , Irvine 0 29.4 35.5 28 48.9 32.1
46. Univ. Copenhagen 28.8 24.2 25.7 25.2 51.4 31.7
47. Rutgers State Univ., New Brunswick 14.4 20 39.9 32.1 44.8 24.2
48. Univ. Manchester 25.6 18.9 24.6 28.3 56.9 28.4
49. Univ. Pittsburgh, Pittsburgh 23.5 0 39.9 23.6 65.6 28.5
50. Univ. Southern California 0 26.8 37.1 23.4 52.7 25.9
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 30
Univariate and bivariate analysis
The rst step of all statistical analysis is the
univariate and bivariate analysis
Univariate statistics
Statistiques Alumni Award HiCi N&S SCI Size
(X
1
) (X
2
) (X
3
) (X
4
) (X
5
) (X
6
)
Mean 34.09 36.10 46.62 43.09 60.10 38.63
Median 38.80 32 44.80 40.10 61.85 35.30
Min 0 0 23.40 23.40 23.20 21.30
Max 100 100 100 100 100 100
Variance 525.74 625.57 207.82 217.51 156.63 212.33
Correlation matrix:
R =
_
_
_
_
_
_
_
_
_
_
_
_
1.00 0.75 0.56 0.68 0.40 0.58
0.75 1.00 0.59 0.73 0.09 0.74
0.56 0.59 1.00 0.84 0.60 0.60
0.68 0.73 0.84 1.00 0.49 0.74
0.40 0.09 0.60 0.49 1.00 0.16
0.58 0.74 0.60 0.74 0.16 1.00
_
_
_
_
_
_
_
_
_
_
_
_
.
Variables are positively correlated size fac-
tor
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 31
Graphics
Univariate graphs - Boxplot to detect out-
liers
1 2 3 4 5 6
0
10
20
30
40
50
60
70
80
90
100
Scatterplots to detect bivariate structure
40 60 80 100
4
0
6
0
8
0
1
0
0
Scores HiCi
S
c
o
r
s
S
C
I
G
Harvard
Stanford
Berkeley
Cambridge
MIT
CalTech
Princeton
Chicago
Kyoto
Tokyo
Toronto
Texas.Med.Center
Rockefeller
Pittsburgh
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 32
Radar type of graph based on TOP 10 to
detect multivariate structure
!
#!
$!
%!
&!
'!
(!
)!
*!
+!
#!!
,-./-.0 123/
123/ 4-56.3078
9:-2;<.0 123/
123/ 4-=3;<.23- >
?8.@8=8A
B-CC-DEFC8GC H2C:
I8DE JBHIK
4-=3;<.23- H2C: I8DE
4<=F563- 123/
L.32D8:<2 123/
123/ 4E3D-7<
123/ MN;<.0
9D<.8 <2 O=F523
9D<.8 <2 OP-.0
9D<.8 <2 ,343
9D<.8 <2 QR9
9D<.8 <2 94H
9D<.8 <2 93S8
Visualization is not easy when the data con-
tains a large number of individuals
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 33
2.1.2 The geometric point of view
Data matrix X (n p) is composed of n ob-
servations (or individuals) and p variables.
X
1
. . . X
p
. . . X
P
1 x
11
. . . x
1p
. . . x
1P
x
/
1
i x
i1
. . . x
ip
. . . x
iP
x
/
i
n x
n1
. . . x
np
. . . x
nP
x
/
n
Mean x
1
. . . x
p
. . . x
P
V ariance s
2
1
. . . s
2
p
. . . s
2
P
v
1
. . . v
p
. . . v
P
Examples:
ARWU scores of universities on research vari-
ables
indicators of corruption on countries, . . .
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 34
Cloud of n points in IR
P
:
Proximity between two individuals (observa-
tions) reects a similar behavior on the p vari-
ables
Cloud of p points in IR
n
:
Proximity between two variables reects a sim-
ilar behavior on the n individuals
BUT ... when n or/and p are large (larger
than 2 or 3), we cannot produce interpretable
graphs of these clouds of points
Develop methods to reduce the dimension with-
out loosing too much information, the infor-
mation about the variation and structure of
clouds in both spaces
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 35
Simplest way of dimension reduction:
Take just one variable - Not a very reasonable
approach
Alternative method:
Consider the simple average - All the element
are considered with equal importance
Other solution:
Use a weighted average with xed weights -
Choice of weight is arbitrary
Example: ARWU (2007)
Take only the variable measuring the num-
ber of articles published in Nature and Sci-
ence
Summarize the 6 variables using the mean
Use the weights proposed by the rankers
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 36
Question:
How to project the point cloud onto a space
of lower dimension without loosing too much
information?
How to construct new uncorrelated variables
1
,
2
, . . . ,
M
(where M is small) summa-
rizing in the best way the structure of the ini-
tial point cloud ?
These new variables will be given as a weighted
average, but how to choose the optimal weights?
The new variables will be called principal
components
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 37
Several criteria exist in the literature to obtain
principal components:
Inertia criteria (Pearson, 1901).
This point of view is based on geometric
approach facilitating the understanding and
the interpretation of output.
Moreover correspondence analysis for qual-
itative variables is a generalization of this
method.
This approach is extensively used in french
textbooks and software
Correlation and Variance criteria (Hotelling,
1931).
Methods used in several english textbooks
and software.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 38
2.2 The geometric approach of Pearson
2.2.1 The n-dimensional point cloud
Each individual i denoted as I
i
in IR
P
is as-
sociated with vector x
i
= (x
i1
, . . . , x
iP
)
/
= Cloud of n points: = I
1
, . . . , I
n
.
Center of gravity G of :
g
= ( x
1
, . . . , x
P
)
/
In the example on ranking where the variables
are Alumni, Award, HiCi, N&S, SCI and PCP,
G characterize an university with mean prole
:
g
i=1
d
2
(I
i
, G)
=
1
n
n
i=1
_
_
P
p=1
(x
ip
x
p
)
2
_
_
=
P
p=1
_
_
1
n
n
i=1
(x
ip
x
p
)
2
_
_
=
P
p=1
s
2
p
= The total inertia is the sum of variances
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 40
For the ranking example:
I(, G) = 525.7 + 625.6 + 207.8
+217.5 + 156.6 + 212.3
= 1945.5
The largest part of the total inertia is due to
the Nobels variables
=The choice of units has clearly an impact.
Solution: Normalize the PCA
PCAn is independent of the choice of units
because it uses the standardized variables:
x
ip
=
x
ip
x
p
s
p
i 1, . . . , n; p 1, . . . , P
Data matrix X
of standardized observations
= Point cloud
= I
1
, . . . , I
, O) = P
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 41
Example ARWU (2007) on two variables:
Universits Variables
X
1
(HiCi
) X
2
(SCI
)
1. Harvard Univ. 3.70 3.19
2. Stanford Univ. 2.74 0.81
3. Univ. California, Berkeley 1.48 0.73
4. Univ. Cambridge 0.51 0.42
5. Massachusetts Inst. Tech. (MIT) 1.34 0.13
.
.
.
.
.
.
.
.
.
31. Rockefeller Univ. 1.31 2.95
.
.
.
.
.
.
.
.
.
49. Univ. Pittsburgh, Pittsburgh 0.47 0.44
50. Univ. Southern California 0.66 0.59
Moyenne 0 0
Variance 1 1
1 0 1 2 3
1
0
1
2
3
Scores HiCi*
S
c
o
r
s
S
C
I*
Harvard
Stanford Berkeley
MIT
Columbia
Princeton
Chicago
Oxford
Yale
Cornell
LosAngeles
SanDiego
Pennsylvania
Madison
SanFrancisco
JohnsHopkins
Tokyo
Michigan
Imp_Coll_Londo
Toronto
Coll_London
Urbana_Champaign
Zurich
SaintLouis
Northwestern
NewYork
Rockefeller
Duke
TwinCities
Boulder
SantaBarbara
BritishColumbia
Coll_Park
Austin
Paris06
TexasMedCenter
Vanderbilt
Utrecht
Pennsylvania_St
Davis
Irvine
Copenhagen
Rutgers_Univ
Pittsburgh
Southern_California
Deux critres centrs rduits d'valuation de la rechecrhe (HiCi et SCI)
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 42
2.2.2 First principal component
Projection of
= I
1
, . . . , I
n
IR
P
on a
subspace of dimension one (IR
1
)
First projecting direction
Find a projecting direction
1
to adjust in a
better way the point cloud
,
1
) =
1
n
n
i=1
d
2
(I
i
, P
1
(I
i
))
where P
1
(I
i
) is the orthogonal projection of
I
i
on the direction
1
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 43
PROBLEM:
Find the direction
1
passing through the ori-
gin such that:
I(
,
1
) = min
through O
I(
, )
'
_ _
'
2
X
2
X
1
X
/
A B
I
i
P
(I
i
)
I
i
P
(I
i
)
Direction
1
is called the rst principal axis
Let u
1
be the vector of norm 1 associated to
the direction
1
:
u
1
= (u
1,1
, . . . , u
1,P
)
/
More generally let u be the vector of norm 1
from the origin associated to the direction :
u
= (u
1
, . . . , u
P
)
/
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 44
RESOLUTION :
IR
P
d
i
(u)
P
(I
i
)
0
I
i
x
i
A
u
1
p
i
(u)
Let:
d
i
(u) = |I
i
P
(I
i
)|
p
i
(u) = |OP
(I
i
)|
Find the vector u
1
of norm 1 such that :
u
1
= argmin
u st |u|=1
1
n
n
i=1
d
2
i
(u)
By Pythagoras theorem:
|OI
i
|
2
= p
i
(u)
2
+ d
i
(u)
2
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 45
Then
u
1
= argmin
u st |u|=1
1
n
n
i=1
d
2
i
(u)
is equivalent to
u
1
= argmax
u st |u|=1
1
n
n
i=1
p
2
i
(u)
Using the scalar product:
p
i
(u) =< u, OI
i
>= u
/
x
i
=
P
p=1
u
p
x
ip
it follows that:
u
1
= argmax
u st u
/
u=1
1
n
n
i=1
(u
/
x
i
)
2
.
Using matrices in the formulation:
n
i=1
(u
/
x
i
)
2
=
n
i=1
u
/
x
i
(x
i
)
/
u
= u
/
_
_
n
i=1
x
i
(x
i
)
/
_
_
u
= u
/
(X
)
/
X
/
(X
)
/
X
/
u
= 1
= To solve this problem, we introduce the
Lagrange function:
L(u
, ) =
1
n
u
/
(X
)
/
X
(u
/
u
1)
The solution of this problem is given by the
resolution of a system of P + 1 equations:
_
u
1
L = 0
. . . = . . .
u
P
L = 0
L = 0
The last equation gives the constraint
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 47
Let derive componentwise: u
p
p 1, . . . , P:
u
p
L =
u
p
_
1
n
u
/
(X
)
/
X
(u
/
u
1)
_
=
u
p
_
_
1
n
n
i=1
(u
/
x
i
)
2
(
P
l=1
u
2
l
1)
_
_
=
u
p
_
_
1
n
n
i=1
(
P
l=1
u
l
x
il
)
2
(
P
l=1
u
2
l
1)
_
_
=
2
n
n
i=1
_
_
P
l=1
u
l
x
il
_
_
x
ip
2u
p
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 48
Putting together the P rst equations leads to:
_
u
1
L
. . .
u
p
L
. . .
u
P
L
_
_
= 2
_
_
1
n
n
i=1
_
P
l=1
u
l
x
il
_
x
i1
u
1
. . .
1
n
n
i=1
_
P
l=1
u
l
x
il
_
x
ip
u
p
. . .
1
n
n
i=1
_
P
l=1
u
l
x
il
_
x
iP
u
P
_
_
= 2
_
_
_
_
_
_
_
_
_
1
n
n
i=1
_
_
x
i1
. . .
x
ip
. . .
x
iP
_
_
(x
i
)
/
u
_
_
_
_
_
_
_
_
_
= 2(
1
n
n
i=1
x
i
(x
i
)
/
u
)
= 2(
1
n
(X
)
/
X
)
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 49
The system of P + 1 equations is then equiv-
alent to the following system:
_
1
n
(X
)
/
X
= u
/
u
= 1
SOLUTION: The rst principal axis
1
through
the origin is given by the eigenvector u
1
of
the correlation matrix R =
1
n
(X
)
/
X
of vari-
ables X
p
(p 1, . . . , P) associated with the
largest eigenvalue
1
.
Remarks:
= u
/
u
=
1
n
u
/
(X
)
/
X
1
= 3.94
The norm of u
1
|u
1
| =
P
p=1
u
2
1,p
= 0.42
2
+ . . . + 0.41
2
= 1
is indeed equal to one
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 51
First principal component
Orthogonal projection of point cloud
on the
axis
1
:
P
1
() = P
1
(I
1
), . . . , P
1
(I
n
)
Coordinate of project point P
1
(I
i
) dene the
values of the n individuals on the new vari-
able
1
. This variable, the best compromise to
summarize the information in dimension one,
is called the rst principal component:
i1
= |OP
1
(I
i
)| =< u
1
, OI
i
>
= u
/
1
x
i
=
P
p=1
u
1,p
x
ip
Let
1
be the vector that contains the n coor-
dinates on the rst principal component
1
= X
1
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 52
The rst principal component is a linear com-
bination of the initial variables, it is to say a
weighted average.
Example: ARWU (2007)
1
= (0.42) Alumni
+ (0.42) Award
+ (0.44) HiCi
+ (0.47) NS
+ (0.26) SCI
+ (0.41) PCP
1
CTR
1
cos
2
1. Harvard Univ. 7.50 0.29 0.95
2. Stanford Univ. 3.88 0.08 0.84
3. Univ. California, Berkeley 3.57 0.06 0.96
4. Univ. Cambridge 3.58 0.07 0.78
5. Massachusetts Inst. Tech. (MIT) 3.33 0.06 0.92
6. California Inst. Tech. 3.61 0.07 0.53
7. Columbia Univ. 2.34 0.03 0.82
8. Princeton Univ. 1.93 0.02 0.44
9. Univ. Chicago 1.48 0.01 0.36
10. Univ. Oxford 1.41 0.01 0.71
.
.
.
.
.
.
.
.
.
.
.
.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 53
Properties of
1
1
is centered (weighted mean of centered
variables):
1
=
1
n
n
i=1
i1
=
1
n
n
i=1
P
p=1
u
1,p
x
ip
=
P
p=1
u
1,p
1
n
n
i=1
x
ip
=
P
p=1
u
1,p
x
p
= 0
The variance of
1
is equal to
1
:
s
2
1
=
1
n
n
i=1
(
i1
1
)
2
=
1
n
n
i=1
2
i1
=
1
n
/
1
1
=
1
n
u
/
1
(X
)
/
X
1
= u
/
1
1
n
(X
)
/
X
1
= u
/
1
1
u
1
=
1
u
/
1
u
1
=
1
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 54
The variance of
1
is equal to the inertia of
the point cloud projected on
1
:
s
2
1
=
1
n
n
i=1
2
i1
=
1
n
n
i=1
|OP
1
(I
i
)|
2
= I(P
1
(
), O)
Correlation between X
p
and
1
is given by
r
X
p
,
1
=
_
1
u
1,p
Indeed, the associated covariance is given by
s
X
p
,
1
=
1
n
n
i=1
x
ip
i1
p 1, . . . , P
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 55
It follows that
_
_
s
X
1
,
1
. . .
s
X
p
,
1
. . .
s
X
P
,
1
_
_
=
_
_
1
n
n
i=1
x
i1
i1
. . .
1
n
n
i=1
x
ip
i1
. . .
1
n
n
i=1
x
iP
i1
_
_
=
_
_
1
n
(v
1
)
/
1
. . .
1
n
(v
p
)
/
1
. . .
1
n
(v
P
)
/
1
_
_
=
1
n
_
_
(v
1
)
/
. . .
(v
p
)
/
. . .
(v
P
)
/
_
1
=
1
n
(X
)
/
1
=
1
n
(X
)
/
X
1
=
1
u
1
Leading to :
s
X
p
,
1
=
1
u
1,p
p 1, . . . , P
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 56
Hence,
r
X
p
,
1
= r
X
p
,
1
=
s
X
p
,
1
s
1
=
1
u
1,p
1
=
_
1
u
1,p
Example: ARWU (2007)
r
X
k
,
h
1
2
3
4
5
6
Alumni 0.83 0.09 0.52 0.06 0.05 0.16
Award 0.84 0.44 0.13 0.17 0.01 0.24
HiCi 0.86 0.29 0.26 0.25 0.19 0.08
N&S 0.94 0.06 0.16 0.07 0.29 0.08
SCI 0.51 0.82 0.11 0.16 0.01 0.15
Size 0.81 0.35 0.28 0.36 0.075 0.00
1
is positively correlated with all the vari-
ables
The proximity of
1
with all the initial vari-
ables is given by:
1
P
P
p=1
r
2
X
p
,
1
=
1
P
P
p=1
1
u
2
1,p
=
1
P
P
p=1
u
2
1,p
=
1
P
=
3.94
6
= 66%
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 57
Global quality of the rst principal
component
Using the decomposition of total inertia, we
capture the percentage of information taking
into account by the rst principal component:
|OI
i
|
2
= |OP
1
(I
i
)|
2
+ |I
i
P
1
(I
i
)|
2
1
n
n
i=1
|OI
i
|
2
=
1
n
n
i=1
|OP
1
(I
i
)|
2
+
1
n
n
i=1
|I
i
P
1
(I
i
)|
2
I(
, O) = I(P
1
(
), O) + I(
,
1
)
Total inertia = inertia explained by
1
+ residual inertia
Global quality is given by
1
P
Example: ARWU (2007)
1
P
=
3.94
6
= 66%
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 58
Quality of the representation of each
individual on the rst axis
The quality of the representation of each in-
dividuals I
i
on the axis
1
is measured by
the squared cosines of the angle between the
vector OI
i
and the axis
1
:
cos
2
(OI
i
,
1
) = cos
2
(OI
i
, OP
1
(I
i
))
=
|0P
1
(I
i
)|
2
|0I
i
|
2
=
2
i1
|0I
i
|
2
.
The representation of individual i is satisfying
on the rst axis if cos
2
(OI
i
,
1
) is close to 1.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 59
,
0
0
A
B
1
I
i
I
1
(I
i
)
P
1
(I
i
)
u
1
u
1
I
j
P
1
(I
j
)
I
j
1
(I
j
)
j
*
r
Example: ARWU (2007)
|OI
Harvard
|
2
= d
2
(O, I
Harvard
)
= (3.70)
2
+ (3.19)
2
+ . . . = 59.21
cos
2
(OI
Harvard
,
1
) =
(7.50)
2
59.21
= 0.95
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 60
Contribution of each individual on the
construction of the rst axis
Note that :
1
= I(P
1
(
), O) = s
2
1
=
1
n
n
i=1
2
i1
The contribution of each individual i on the
variance
1
is then given by
CTR
1
(i) =
1
n
2
i1
1
Each contribution gives a percentage since
n
i=1
CTR
1
(i) = 1
Interpretation: One individual is important in
the construction of the rst axis if its contri-
bution is large. The construction of the rst
principal component is based essentially on in-
dividuals far away from the center of gravity.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 61
Universities First axis Second axis
1
CTR
1
cos
2
2
CTR
2
cos
2
1. Harvard Univ. 7.50 0.29 0.95 1.65 0.05 0.05
2. Stanford Univ. 3.88 0.08 0.84 0.13 0.00 0.00
3. Univ. California, Berkeley 3.57 0.06 0.96 0.06 0.00 0.00
4. Univ. Cambridge 3.58 0.07 0.78 1.23 0.03 0.09
5. Massachusetts Inst. Tech. (MIT) 3.33 0.06 0.92 0.67 0.01 0.04
6. California Inst. Tech. 3.61 0.07 0.53 2.35 0.10 0.23
7. Columbia Univ. 2.34 0.03 0.82 0.00 0.00 0.00
8. Princeton Univ. 1.93 0.02 0.44 1.94 0.07 0.44
9. Univ. Chicago 1.48 0.01 0.36 1.24 0.03 0.26
10. Univ. Oxford 1.41 0.01 0.71 0.24 0.00 0.02
11. Yale Univ. 1.58 0.01 0.92 0.04 0.00 0.00
12. Cornell Univ. 1.07 0.01 0.87 0.18 0.00 0.02
13. Univ. California, Los Angeles 0.71 0.00 0.20 1.21 0.03 0.57
14. Univ. California, San Diego 0.74 0.00 0.22 0.49 0.00 0.10
15. Univ. Pennsylvania 0.40 0.00 0.13 0.89 0.01 0.62
16. Univ. Washington, Seattle 0.14 0.00 0.01 1.37 0.03 0.82
17. Univ. Wisconsin, Madison 0.16 0.00 0.02 0.79 0.01 0.58
18. Univ. California, San Francisco 0.17 0.00 0.01 0.09 0.00 0.00
19. Johns Hopkins Univ. 0.03 0.00 0.00 0.83 0.01 0.32
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31. Rockefeller Univ. 1.13 0.01 0.11 2.99 0.16 0.77
32. Duke Univ. 0.80 0.00 0.25 0.78 0.01 0.24
33. Univ. Minnesota, Twin Cities 1.07 0.01 0.31 1.40 0.04 0.53
34. Univ. Colorado, Boulder 1.31 0.01 0.64 0.70 0.01 0.18
35. Univ. California, Santa Barbara 1.44 0.01 0.46 0.98 0.02 0.21
36. Univ. British Columbia 1.41 0.01 0.72 0.25 0.00 0.02
37. Univ. Maryland, Coll. Park 1.51 0.01 0.92 0.01 0.00 0.00
38. Univ. Texas, Austin 1.65 0.01 0.76 0.39 0.00 0.04
39. Univ. Paris VI 1.61 0.01 0.59 0.56 0.01 0.07
40. Univ. Texas Southwestern Med. Center 1.63 0.01 0.52 1.48 0.04 0.43
41. Vanderbilt Univ. 1.71 0.01 0.76 0.72 0.01 0.13
42. Univ. Utrecht 1.76 0.02 0.83 0.08 0.00 0.00
43. Pennsylvania State Univ., Univ. Park 1.67 0.01 0.68 0.85 0.01 0.17
44. Univ. California, Davis 1.70 0.01 0.55 1.16 0.02 0.26
45. Univ. California, Irvine 1.97 0.02 0.79 0.59 0.01 0.07
46. Univ. Copenhagen 1.88 0.02 0.77 0.64 0.01 0.09
47. Rutgers State Univ., New Brunswick 1.91 0.02 0.83 0.46 0.00 0.05
48. Univ. Manchester 1.94 0.02 0.83 0.12 0.00 0.00
49. Univ. Pittsburgh, Pittsburgh 1.80 0.02 0.66 1.02 0.02 0.21
50. Univ. Southern California 2.21 0.02 0.86 0.15 0.00 0.00
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 62
2.2.3 Second principal component
Second projecting direction
The second projecting axis
2
is
an axis through the origin of IR
P
(the grav-
ity center of point cloud
)
orthogonal to
1
minimizing the residual inertia I(
, (
1
,
2
))
In practice, we can show that
2
is given by
the direction u
2
, eigenvector with unitary norm
of the correlation matrix Rassociated with the
second largest eigenvalue
2
.
The sub-space (
1
,
2
) of dimension 2 is called
the rst principal plan.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 63
Decomposition of the total inertia
_
0
u
2
u
1
I
i
P
(
1
,
2
)
(I
i
)
P
1
(I
i
)
P
2
(I
i
)
1
Let:
P
1
(I
i
) the orthogonal projection of I
i
on
the axis
1
P
2
(I
i
) the orthogonal projection of I
i
on
the axis
2
P
(
1
,
2
)
(I
i
) the orthogonal projection of
I
i
on the axis (
1
,
2
).
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 64
By Pythagoras theorem:
|0I
i
|
2
= |0P
(
1
,
2
)
(I
i
)|
2
+|I
i
P
(
1
,
2
)
(I
i
)|
2
Moreover
P
1
(I
i
) is the orthogonal projection of P
(
1
,
2
)
(I
i
)
on the axis
1
P
2
(I
i
) is the orthogonal projection of P
(
1
,
2
)
(I
i
)
on the axis
2
,
= |0I
i
|
2
= |0P
1
(I
i
)|
2
+ |0P
2
(I
i
)|
2
+ |I
i
P
(
1
,
2
)
(I
i
)|
2
=
1
n
n
i=1
|0I
i
|
2
=
1
n
n
i=1
|0P
1
(I
i
)|
2
+
1
n
n
i=1
|0P
2
(I
i
)|
2
+
1
n
n
i=1
|I
i
P
(
1
,
2
)
(I
i
)|
2
I(
, 0) = I(P
1
(
), 0) + I(P
2
(
), 0) + I(
, (
1
,
2
)).
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 65
Second principal component
Orthogonal projection of point cloud
on the
axis
2
:
P
2
(
) = P
2
(I
1
), . . . , P
2
(I
n
)
In the same way that for the rst direction,
dene:
i2
= |0P
2
(I
i
)| i = 1, . . . , n
where
i2
gives the value of individual i on the
second principal component
2
The second principal component is also a weighted
average of initial variables
i2
= < u
2
, 0I
i
>
= u
/
2
x
i
=
P
p=1
u
2,p
x
ip
.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 66
Let
2
be the vector that contains the n coor-
dinate on the rst principal component
2
=
(
12
, . . . ,
n2
)
/
:
2
= X
u
2
.
The second new variable
2
is a linear combi-
nation of the initial variables X
1
, . . . , X
P
:
2
=
P
p=1
u
2,p
X
p
.
Example: ARWU (2007)
2
= 0.08 Alumni
0.42 Award
+ 0.27 HiCi
+ 0.06 NS
+ 0.79 SCI
0.34 PCP
2
= s
2
2
=
1
n
n
i=1
2
i2
=
1
n
n
i=1
|0P
2
(I
i
)|
2
= I(P
2
(
), 0).
The correlation between
1
and
2
is equal
to zero:
s
1
,
2
=
1
n
n
i=1
i1
i2
=
1
n
/
1
2
=
1
n
u
/
1
(X
)
/
X
u
2
= u
/
1
2
u
2
=
2
u
/
1
u
2
= 0
= r
1
,
2
= 0.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 68
Correlation between the second component
and initial variables (exercise):
r
X
p
,
2
=
_
2
u
2,p
p = 1, . . . , P.
Example: ARWU (2007)
r
X
k
,
h
1
2
3
4
5
6
Alumni 0.83 0.09 0.52 0.06 0.05 0.16
Award 0.84 0.44 0.13 0.17 0.01 0.24
HiCi 0.86 0.29 0.26 0.25 0.19 0.08
N&S 0.94 0.06 0.16 0.07 0.29 0.08
SCI 0.51 0.82 0.11 0.16 0.01 0.15
Size 0.81 0.35 0.28 0.36 0.075 0.00
2
discriminates, for universities with globally
the same level on
1
, 2 behaviors:
Volume of publication dominates the number
of Nobel prize :
Michigan,2
= 2.10,
Nobel prizes dominates the score on the vol-
ume of publication:
Rockfeller,2
= 2.99
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 69
Global quality of the second principal
component
Percentage of inertia explained by
2
:
2
P
Percentage of inertia explained by the rst prin-
cipal plan (
1
,
2
):
1
+
2
P
Example: ARWU (2007)
2
explains
1.09
6
= 18.17% of total inertia
Then (
1
,
2
) explains
3.94+1.09
6
= 83.83% of
total inertia
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 70
Quality of the representation of each
individual on the second axis
Quality of representation of each point I
i
on
the axis
2
is measured by the squared cosines
of angle between the vector OI
i
and the di-
rection
2
:
cos
2
(OI
i
,
2
) =
|0P
2
(I
i
)|
2
|0I
i
|
2
=
2
i2
|0I
i
|
2
.
_
0
u
2
u
1
I
i
P
(
1
,
2
)
(I
i
)
P
1
(I
i
)
P
2
(I
i
)
1,i
(1,2),i
2,i
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 71
Quality of representation of each point I
i
on
the plan (
1
,
2
) is measured by the squared
cosines of angle between the vector OI
i
and
the plan (
1
,
2
) :
cos
2
(OI
i
, (
1
,
2
)) =
|0P
(
1
,
2
)
(I
i
)|
2
|0I
i
|
2
=
|0P
(
1
)
(I
i
)|
2
+ |0P
(
2
)
(I
i
)|
2
|0I
i
|
2
=
2
i1
+
2
i2
|0I
i
|
2
= cos
2
(OI
i
,
1
) + cos
2
(OI
i
,
2
).
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 72
Contribution of each individual on the
construction of the second axis
2
Note that:
2
= I(P
2
(
), 0) = s
2
2
=
1
n
n
i=1
2
i2
,
The contribution of each individual i on the
variance
2
is given by:
CTR
2
=
1
n
2
i2
2
.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 73
Universities First axis Second axis
1
CTR
1
cos
2
2
CTR
2
cos
2
1. Harvard Univ. 7.50 0.29 0.95 1.65 0.05 0.05
2. Stanford Univ. 3.88 0.08 0.84 0.13 0.00 0.00
3. Univ. California, Berkeley 3.57 0.06 0.96 0.06 0.00 0.00
4. Univ. Cambridge 3.58 0.07 0.78 1.23 0.03 0.09
5. Massachusetts Inst. Tech. (MIT) 3.33 0.06 0.92 0.67 0.01 0.04
6. California Inst. Tech. 3.61 0.07 0.53 2.35 0.10 0.23
7. Columbia Univ. 2.34 0.03 0.82 0.00 0.00 0.00
8. Princeton Univ. 1.93 0.02 0.44 1.94 0.07 0.44
9. Univ. Chicago 1.48 0.01 0.36 1.24 0.03 0.26
10. Univ. Oxford 1.41 0.01 0.71 0.24 0.00 0.02
11. Yale Univ. 1.58 0.01 0.92 0.04 0.00 0.00
12. Cornell Univ. 1.07 0.01 0.87 0.18 0.00 0.02
13. Univ. California, Los Angeles 0.71 0.00 0.20 1.21 0.03 0.57
14. Univ. California, San Diego 0.74 0.00 0.22 0.49 0.00 0.10
15. Univ. Pennsylvania 0.40 0.00 0.13 0.89 0.01 0.62
16. Univ. Washington, Seattle 0.14 0.00 0.01 1.37 0.03 0.82
17. Univ. Wisconsin, Madison 0.16 0.00 0.02 0.79 0.01 0.58
18. Univ. California, San Francisco 0.17 0.00 0.01 0.09 0.00 0.00
19. Johns Hopkins Univ. 0.03 0.00 0.00 0.83 0.01 0.32
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31. Rockefeller Univ. 1.13 0.01 0.11 2.99 0.16 0.77
32. Duke Univ. 0.80 0.00 0.25 0.78 0.01 0.24
33. Univ. Minnesota, Twin Cities 1.07 0.01 0.31 1.40 0.04 0.53
34. Univ. Colorado, Boulder 1.31 0.01 0.64 0.70 0.01 0.18
35. Univ. California, Santa Barbara 1.44 0.01 0.46 0.98 0.02 0.21
36. Univ. British Columbia 1.41 0.01 0.72 0.25 0.00 0.02
37. Univ. Maryland, Coll. Park 1.51 0.01 0.92 0.01 0.00 0.00
38. Univ. Texas, Austin 1.65 0.01 0.76 0.39 0.00 0.04
39. Univ. Paris VI 1.61 0.01 0.59 0.56 0.01 0.07
40. Univ. Texas Southwestern Med. Center 1.63 0.01 0.52 1.48 0.04 0.43
41. Vanderbilt Univ. 1.71 0.01 0.76 0.72 0.01 0.13
42. Univ. Utrecht 1.76 0.02 0.83 0.08 0.00 0.00
43. Pennsylvania State Univ., Univ. Park 1.67 0.01 0.68 0.85 0.01 0.17
44. Univ. California, Davis 1.70 0.01 0.55 1.16 0.02 0.26
45. Univ. California, Irvine 1.97 0.02 0.79 0.59 0.01 0.07
46. Univ. Copenhagen 1.88 0.02 0.77 0.64 0.01 0.09
47. Rutgers State Univ., New Brunswick 1.91 0.02 0.83 0.46 0.00 0.05
48. Univ. Manchester 1.94 0.02 0.83 0.12 0.00 0.00
49. Univ. Pittsburgh, Pittsburgh 1.80 0.02 0.66 1.02 0.02 0.21
50. Univ. Southern California 2.21 0.02 0.86 0.15 0.00 0.00
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 74
2.2.4 Extended dimensions
The h
th
projecting axis
h
is
an axis passing through the origin of IR
P
(the gravity center of point cloud
)
orthogonal to
1
, . . . ,
h1
minimizing the residual inertia
In practice, we can show that
h
is given by
the direction u
h
which is the eigenvector (with
unitary norm) of the correlation matrix R that
is associated with the h
th
largest eigenvalue
h
.
It is clear that if h is equal to the rank of X
,
the data cloud
on the
axis
h
:
P
h
(
) = P
h
(I
1
), . . . , P
h
(I
n
)
In the same way that for other directions, de-
ne:
ih
= |0P
h
(I
i
)| i = 1, . . . , n
where
ih
gives the value of individual i on the
principal component
h
The principal component is also a weighted
average of the initial variables
ih
= < u
h
, 0I
i
>
= u
/
h
x
i
=
P
p=1
u
h,p
x
ip
.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 76
Properties of
h
h
has zero mean (exercise)
h
has a variance equal to
h
(exercise)
Correlation between
l
(l 1, . . . , h 1
and
h
is equal to zero:
s
l
,
h
=
1
n
n
i=1
il
ih
=
1
n
/
l
h
=
1
n
u
/
l
(X
)
/
X
u
h
= u
/
l
h
u
h
=
h
u
/
l
u
h
= 0
= r
l
,
h
= 0.
Correlation between the h
th
component and
the initial variables (exercise):
r
X
p
,
h
=
_
h
u
h,p
p = 1, . . . , P.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 77
Correlations and eigenvectors
By linear algebra:
R =
1
n
(X
)
/
X
=
H
h=1
h
u
h
u
/
h
.
Then, for each p ,= l 1, . . . , P:
r
X
p
,X
l
=
H
h=1
h
u
h,p
u
h,l
.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 78
Question: How many principal components
needed?
Stopping rules for determining the number of
principal components:
Classical rule based on
h
, the percentage
of variance explained by the rst h principal
components, h 1, . . . , H:
h
=
1
+ . . . +
h
1
+ . . . +
H
=
1
+ . . . +
h
P
.
If is big enough (close to one), h is the
number of factors to choose. But this rule
is rather subjective.
Keep principal component
h
i
h
> 1
(mean of eigenvalues).
Examine the scree s plot that shows the
fraction of total variance in the data ex-
plained by each principal component
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 79
2.2.5 Graphical representations
The principal components are used to repre-
sent graphically individuals and variables
Map of individuals
Projection of the data cloud
on the rst
principal plan (
1
,
2
):
i = 1, . . . , n the projection P
(
1
,
2
)
(I
i
) of
individual I
i
on the rst plan has coordinates
(
i1
,
i2
)
on the axis
1
and
2
.
This graph makes the interpretation of axis
easier as well as the comparison between indi-
viduals
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 80
Example: ARWU (2007)
Well represented individuals can be interpreted
!2 0 2 4 6
!
3
!
2
!
1
0
1
2
!!
1
!!
2
!
!
!
Harvard
Stanford
Berkeley
Cambridge
MIT
CalTech
Princeton
Chicago
Michigan
Kyoto
Tokyo
Zurich
Texas.Med.Center
Rockefeller
SanFrancisco
AMER
EU
ASIA
The rst axis segregates the universities from
the less quality to the best quality in terms
on research
The second axis discriminates between vol-
ume of publication and Nobel prizes
Harvard seems to be an outlier
If the principal plan is not sucient, (
1
,
3
)
and (
2
,
3
) plans can also be analyzed
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 81
Correlations circle
Representation of variables is based on the
projection of the cloud of p variables X
in
IR
n
on the principal components. The coor-
dinate on the srt principal plan are
B
p
= (r
X
p
,
1
, r
X
p
,
2
).
_
'
1
1
r
X
k
,
1
r
X
k
,
2
B
k
0
This graph makes it easier to visualize
correlations between old and new variables
the quality of the representation of X
p
given
by the norm of the vector 0B
p
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 82
Example: ARWU (2007)
_
'
SCI
Award
HiCi
N&S
Size
Alumni
2
All variables have a good quality of repre-
sentation in IR
2
The rst principal component is positively
correlated with all variables (quality factor)
The second principal component discrimi-
nates between Volume and Prizes =
type of research quality
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 83
2.3 Additional variables or individuals
Additional individuals i
s
- Step 1: Standardize the coordinate of new
individual i
s
using mean and standard devia-
tion calculated on active individuals
- Step 2: Project new standardize individual
on principal axis:
i
s
1
=
P
p=1
u
1,p
x
i
s
p
i
s
2
=
P
p=1
u
2,p
x
i
s
p
etc
- Step 3: Project this observation on the rst
plan.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 84
Additional continuous variable X
s
The information on the additional continuous
variable X
s
will be given by the correlations
circle where the coordinates are
r
X
s
,
1
and r
X
s
,
2
Example: ARWU (2007)
Representation of the ranking given in Shang-
hai ranking
_
'
,
SCI
Award
HiCi
N&S
Size
Alumni
Rank
2
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 85
Additional qualitative variable X
s
If the variable is qualitative, the correlation
can not be used
j=1
[
1
P
P
p=1
r
2
X
p
,Z
j
].
It is possible to prove that the maximum is
reached by reducing the principal principal com-
ponents
Z
j
=
j
=
j
and the maximum is given by
1
+...+
J
P
.
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 88
Variance criteria
Find J new uncorrelated variables Z
1
, . . . , Z
J
such that
Z
j
=
P
p=1
j,p
X
p
where the vectors
j
= (
j,1
, . . . ,
j,P
)
/
maximize the following criteria
J
j=1
s
2
Z
j
.
The maximum is given by
1
+ . . . +
J
The maximum is reached for orthogonal eigen-
vectors of covariance matrix
If the standardized variables are used, then
Z
j
=
j
and the maximum is given by
1
+
. . . +
J
CHAPTER 2. PRINCIPAL COMPONENT ANALYSIS (PCA) 89
2.5 References
Dehon, C. , Droesbeke, J-J. et Vermandele
C. (2008), Elments de statistique, Brux-
elles, Editions de LUnviversit de Bruxelles
Jolie I. T. (1986), Principal Component
Analysis, 2nd edition, New York Springer.
Hotelling H. (1933), Analysis of a com-
plex statistical variable into principal com-
ponent, J. Edu. Psy. , Vol 24, 417-441
and 498-520.
Pearson K. (1901), On lines and planes of
closest t to systems of points in space,
Phil. Mag.,2, 11, 559-572
Rao C.R. (1964), The use and interpreta-
tion of principal components analysis in ap-
plied research, Sankhya, serie A, Vol 26,
329-357
Chapter 3
A short introduction on robust
statistics
3.1 Why robust statistics ?
Develop procedures (in estimation, in test-
ing problem, in regression, in time series, . . . )
that are valid (bias, eciency) under small de-
viations from the underlying model
All models are wrong, but some are useful.
(Box, 1979)
90
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 91
[
Non-parametric hypothesis: P large fam-
ily of distributions
Robust hypothesis P is close to one element
of P
[
Important remarks
Robust statistics doesnt replace classical one
The two-step procedure, where classical meth-
ods are used in the second step after having
deleted outliers, requires robust methods
The word robust is used in various context,
with dierent meaning.
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 96
New concept linked to robustness
The bias and the eciency are well-known in
statistics but robust statistics need new mea-
sures:
Inuence function (IF): local stability
Breakdown point: global validity
Maxbias curve : a theoritical summary
Important: Trade-o between robustness and
eciency
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 97
Example: Cushny and Peebles
3.2 Detection
Cushny and Peebles reported the results of
a clinical trial of the eect of various drug on
duration of sleep:
Sample: 0,0.8,1,1.2,1.3,1.3,1.4,1.8,2.4,4.6
The last observation 4.6 seems to be outlier rel-
atively to the other nine observation.
2 4 6 8 10
0
1
2
3
4
Index
x
Cushny and Peebles
0
1
2
3
4
Boxplot: Cushny and Peebles
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 98
The rejection rule: The 3 rule
If X N(,
2
), it is well known that:
P( 3 < X < + 3) 0.999
Tchebyshevs rule (valid for all distribution):
at least(1
1
k
2
)of observations ( k)
Example: if k = 3 at least 89% of observations
( 3)
But and are unknown !!!!
Classical rule: an observation x
i
is considered
as an outliers if
x
i
/ ( x 3s) = (2.11; 5.27)
PROBLEM: MASKING EFFECT !!!!
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 99
The robust 3 rule
An observation x
i
is considered as an outliers if
x
i
/ [med(x) 3MAD(x), med(x) + 3MAD(x)]
/ (0.48, 3.08)
A robust estimator of scale is given by the me-
dian absolute deviation MAD, which is the me-
dian of the n distances to the median:
MAD(x) = c med([x
i
med(x)[)
where c =
1
1
(3/4)
in order to obtain Fisher
consistency at the normal distribution.
The rejection rule estimation is then given by:
0 + 0.8 + 1.0 + 1.2 + 1.3 + 1.4 + 1.8 + 2.4
9
= 1.24
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 100
Bivariate simulated example
Univariate analysis
!
!
!
"
#
"
!
Boxplot of X
!
!
!
"
!
#
$
#
"
!
Boxplot of Y
Bivariate analysis
!! !" # " !
!
$
!
!
!
"
#
"
!
$
%
&
Scatter PIot X!Y
Outliers in two-dimension space but not in in a
single one dimensional space
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 101
Multivariate example
Stack loss (Rousseeuw & Leroy, 1987)
i x
1
x
2
x
3
y i x
1
x
2
x
3
y
1 80 27 89 42 12 58 17 88 13
2 80 27 88 37 13 58 18 82 11
3 75 25 90 37 14 58 19 93 12
4 62 24 87 28 15 50 18 89 8
5 62 22 87 18 16 50 18 86 7
6 62 23 87 18 17 50 19 72 8
7 62 24 93 19 18 50 19 79 8
8 62 24 93 20 19 50 20 80 9
9 58 23 87 15 20 56 20 82 15
10 58 18 80 14 21 70 20 91 15
11 58 18 89 14
x
1
: air ow, x
2
: cooling water inlet tempera-
ture, x
3
: acide concentration
y: stack loss, deend as the percentage of in-
going ammonia that escapes unabsorbed (re-
sponse).
BUT: It is not possible to visualize all informa-
tion in one gure
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 102
Mahalanobis distances
Let X be the matrix of data of dimension np
Let x
i
be the vector of dimension p 1
Classical Mahalanobis distances are dened by:
MD
i
=
_
((x
i
T(X))
/
C(X)
1
(x
i
T(X)))
where T(X) is the mean vector:
T(X) =
1
n
x
i
and C(X) is the empirical covariance matrix:
C(X) =
1
n
((x
i
T(X))(x
i
T(X)))
/
T(X) and C(X) are not robust
MASKING EFFECT
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 103
Robust Multivariate estimators
Let b be a constant and A (pp) a non-singuliar
matrix
Let X = x
1
, . . . , x
n
,
Y = x
1
+ b, . . . , x
n
+ b = X + b,
Z = AX + b
Equivariance for the location estimator T(X):
Translation equivariant:= T(Y ) = T(X) + b
Ane equivariant:= T(Z) = AT(X) + b
Equivariance for the covariance estimator C(X):
Translation invariant:= C(Y ) = C(X)
Ane equivariant:= C(Z) = A
/
C(X)A
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 104
Generalization of the univariate median
The median is an univariate location estimator
with BDP = 50% which is dened by the min-
imization problem:
med(x) = argmin
t
n
i=1
[x
i
t[
First proposition: the L
1
estimator minimizes
n
i=1
|x
i
T|
Problem: not an equivariant
Second proposition: the coordinatewise me-
dian:
T = (med
i
x
i1
, . . . , med
i
x
ip
)
Problem: For p 3 the coordinatewise median
is not always in the convex hull of the sample
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 105
Several propositions of ane equivari-
ant estimators
Multivariate M-estimateurs (Maronna, 76)
Convex Peeling (Barnett, 76; Bennington, 78)
Ellipsoid Peeling (Titterington, 78; Hebling,
83)
Iterative Trimming (Gnanadesikan and Ket-
tering, 78)
Generalized median (Oja, 83)
. . .
PROBLEM:
all these estimators have a BDP
1
p+1
i
w(u
i
)x
i
i
w(u
i
)
C(x) =
i
w(u
i
)(x
i
T(x))(x
i
T(x))
/
i
w(u
i
)
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 107
Minimum Covariance Determinant (MCD)
Suppose that p = 2 for simplicity: Z = (X, Y )
IR
2
, with
=
_
2
X
XY
Y X
2
Y
_
= =
XY
Y
The generalized variance dened as:
det() =
2
X
2
Y
2
Y X
can be seen as a generalization of the variance.
T(X): mean of the 50% points of X for which
the determinant of the empirical covariance ma-
trix is minimal;
C(X): given by the same covariance matrix,
multiplied by a factor to obtain consistency
Properties:
an equivariant BDP= 50%
asymptotic normality (Butler et Jhun, 1988)
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 108
S-estimators
Classical estimators (t
n
, C
n
) can be obtained
by minimizing det(C) under the constraint:
1
n
n
i=1
(
_
(x
i
t)
/
C
1
(x
i
t))
2
= p
(t, C) R
P
PSD(p) where PSD(p) is the
set of all symmetric and positive denite matrix
of dimension(p p)
S-estimators (t
n
, C
n
) can be obtained by min-
imizing det(C) under the constraint:
1
n
n
i=1
(
_
(x
i
t)
/
C
1
(x
i
t)) b
(t, C) R
P
PSD(p)
-4 -2 0 2 4
0
2
4
6
8
1
0
1
2
Classical
-4 -2 0 2 4
0
1
2
3
4
5
S-median
-4 -2 0 2 4
0
.0
0
.5
1
.0
1
.5
2
.0
Biweight S
-4 -2 0 2 4
0
1
2
3
Most Robust S
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 109
Robust distances
RD
i
=
_
((x
i
T(X))
/
C(X)
1
(x
i
T(X)))
where T(X) is a robust multivariate estimator
of location and C(X) is a robust estimator of
the covariance matrix
Idea: Represent graphically the robust distances.
Outliers can be detected by large distances.
How to nd the cuto ?? Suppose that
X N
p
(, ), then
1/2
(X ) N(0, I)
It follows that ((x
i
)
/
1
(x
i
)) is the sum
of p independent standardized normal squared
((x
i
)
/
1
(x
i
))
2
p
The cut-o will be then approximated by the
squared root of the 0.975 quantile of the
2
p
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 110
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
QUANTIFYING ACADEMIC EXCELLENCE,
WHAT DO THE SHANGHAI RANKING
MEASURE ?
C. Dehon, A. McCathie & V. Verardi
Universite libre de Bruxelles, ECARES - CKE
September 2009
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
Increased competition in Higher Education
1
= 0.42 Alumni + 0.44 Awards + 0.48 HiCi + 0.50 NS + 0.38 PUB
What does this component measure?? The quality of research??
Variable Corr(
1
, .)
Alumni 78%
Awards 81%
HiCi 89%
N&S 92%
PUB 70%
Total score 99%
BUT ...
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
Harvard is an outlier 18% of
1
is due solely to Harvard
The Top 10 universities account for over 60% of
1
!
0
.
0
5
.
1
.
1
5
.
2
C
o
n
t
r
i
b
u
t
i
o
n
0 50 100 150
Ranking
.
2
.
4
.
6
.
8
1
C
u
m
u
l
a
t
e
d
c
o
n
t
r
i
b
u
t
i
o
n
0 50 100 150
Ranking
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
DETECTION OF OUTLIERS - Robust distances:
RD
i
=
((x
i
T(X))
C(X)
1
(x
i
T(X)))
Harvard Univ
Stanford Univ
Univ Cambridge
Princeton Univ
Univ Chicago
0
2
0
4
0
6
0
R
o
b
u
s
t
M
a
h
a
l
a
n
o
b
i
s
D
i
s
t
a
n
c
e
0 20 40 60 80 100
Ranking
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
ROBUST PCA based on RMCD ESTIMATORS
(Croux and Haesbroeck, 2000)
IDEA : Robustify matrix of correlations by working with robust
estimators (MCD, RMCD).
Suppose that p = 2 for simplicity: Z = (X, Y) IR
2
, with
=
2
X
XY
YX
2
Y
= =
XY
Y
The generalized variance (Wilks, 1932) dened as:
det() =
2
X
2
Y
2
YX
can be seen as a generalization of the variance.
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
Minimum Covariance Determinant Estimator (Rousseeuw, 1985):
MCD estimators T
n
and C
n
: For the sample {z
1
, . . . , z
n
}, select
that subsample {z
i
1
, . . . , z
i
h
} of size h (h n) with minimum
determinant of its covariance matrix. Then compute sample
covariance estimator over that subsample. Take h
n
2
.
RMCD estimators are dened by
T
R
n
=
n
i =1
w
i
z
i
n
i =1
w
i
C
R
n
= c
2
n
i =1
w
i
(z
i
T
R
n
)(z
i
T
R
n
)
t
n
i =1
w
i
where c
2
is a consistency constant and the weight are given by
w
i
=
1 si (z
i
T
n
)
t
C
1
n
(z
i
T
n
) q
0 otherwise
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
Two underlying factors are uncovered:
R
1
explains 38% of inertia
R
2
explains 28% of inertia
But what do these two factors represent??
Variable Corr(
1
, .) Corr(
2
, .)
Alumni -20% 80%
Awards -25% 82%
HiCi 87% 7%
N&S 77% 22%
PUB 68% -1%
Total score 75% 64%
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
Highly sensitivity to the weights attributed to the variables
SCORE
i
= w
i
(Alumni +Award) + (1 w
i
) (HiCi +N&S +PUB)
with w
i
= 0, 0.1, . . . , 1
Example 1: TOP 10
Harvard
Stanford
Berkeley
Cambridge
MIT
Caltech
Columbia
Princeton
Chicago
Oxford
0
1
0
2
0
3
0
0 .2 .4 .6 .8 1
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
Example 2: Some european universities
ENS Paris
Moscow
VU Amsterdam
Liverpool
Geneva
Frankfurt
0
1
0
0
2
0
0
3
0
0
0 .2 .4 .6 .8 1
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
INTRODUCTION SHANGHAI SCORES CRITICISMS ACP ROBUSTNESS ALTERNATIVES CONCLUSION
USE RANKINGS WITH CAUTION!!
C. Dehon, A. McCathie & V. Verardi Universite libre de Bruxelles, ECARES - CKE
QUANTIFYING ACADEMIC EXCELLENCE, WHAT DO THE SHANGHAI RANKING MEASURE ?
CHAPTER 3. A SHORT INTRODUCTION ON ROBUST STATISTICS 126
3.2.1 References
Cook, RD., and Weisberg, S. (1999), Applied
Regression including Computing and Graph-
ics, John Wiley and Sons, NY.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J.,
and Stahel, W.A. (1986), Robust Statistics,
John Wiley and Sons, NY.
Heritier, S., Cantoni, E., Copt, S. and Victoria-
Feser, M.-P. (2009), Robust Methods in Bio-
statistics, Chichester, UK: John Wiley Sons.
Huber, P. J. (1981), Robust Statistics, New
York: John Wiley and Sons.
Maronna, R.A., Martin, R.D., and Yohai, V.J.
(2006), Robust Statistics, John Wiley and Sons,
NY.
Rousseeuw, P.J., and Leroy, A.M. (1987), Ro-
bust Regression and Outliers Detection, John
Wiley and Sons, NY.
Chapter 4
Correspondence analysis (CA)
4.1 Introduction
Method that displays and summarizes the in-
formation contained in a dataset with quali-
tative type of variables
CA is conceptually similar to PCA
Can be divided into 2 areas:
Binary correspondence analysis (BCA): Tech-
nique that displays the rows and the columns
of a two-way contingency table
Multiple correspondence analysis (MCA):
Extension of BCA to more than 2 variables
127
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 128
Goals of BCA
Study the associations between the categories
of two qualitative variables using the two-way
contingency table:
2 qualitative (categorical) variables X and Y :
- X has J categories (or modalities): A
1
, . . . , A
J
- Y has K categories (or modalities): B
1
, . . . , B
K
.
Examples
1. In education, can we suppose that the vari-
ables concerning work/study habits of stu-
dents (regularity and work during the exam)
are coherent?
2. In a research in education can we suppose
that the fathers level of education will tend
to be very close to the level of education of
the mother?
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 129
For the students in ULB, the answer is positive:
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 130
The methodology can be summed up
as follows:
Step 1: Perform PCA on the table of row pro-
les where the A
j
(j 1, . . . , J) play the role
of individuals and the B
k
(k 1, . . . , K) the
role of variables
Step 2: Perform PCA on the table of column
proles where the B
k
(k 1, . . . , K) play the
role of individuals and the A
j
(j 1, . . . , J)
the role of variables
Step 3: Study the links between both PCAs
Step 4: Plot graphs to show the proximity be-
tween row proles, the proximity between col-
umn proles and put forward the relationship
between rows and columns.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 131
Generalization of PCA in two directions :
The weight associated to each individual (cat-
egory) depends on the following frequencies:
Step 1: the weight allocated to the individ-
ual (category) A
j
is equal to the frequency
of this category (f
j.
)
Step 2: the weight assigned to the individ-
ual (category) B
k
is equal to the frequency
of this category (f
.k
)
In PCA, the distance between observations
corresponds to Euclidean distance. In corre-
spondance analysis the distance between modal-
ities corresponds to chi square type of dis-
tance
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 132
4.2 Example
Survey on 1000 workers:
Variable X: Diploma
3 categories: A
1
, A
2
, A
3
(Primary school, High
school, University)
Variable Y : Salary
3 categories: B
1
, B
2
, B
3
(low, middle, high)
Two-way contingency table:
n
jk
B
1
B
2
B
3
n
j.
A
1
150 40 10 200
A
2
190 350 60 600
A
3
10 110 80 200
n
.k
350 500 150 1000
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 133
Notations
2 qualitative (categorical) variables X and Y :
- X has J categories (or modalities): A
1
, . . . , A
J
- Y has K categories (or modalities): B
1
, . . . , B
K
.
A sample of size is n leads to the following two-
way contingency table:
X[Y B
1
. . . B
k
. . . B
K
K
k=1
A
1
n
11
. . . n
1k
. . . n
1K
n
1.
A
j
n
j1
. . . n
jk
. . . n
jK
n
j.
A
J
n
J1
. . . n
Jk
. . . n
JK
n
J.
J
j=1
n
.1
. . . n
.k
. . . n
.K
n
where n
jk
counts the number of individuals that
are in category A
j
for the variable X and in
category B
k
for the variable Y
Remark: n
j.
=
K
k=1
n
jk
et n
.k
=
J
j=1
n
jk
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 134
4.3 Explonatory analysis
Two-way contingency table of relative frequencies F:
Proportion of individuals that belong to cate-
gory A
j
for the variable X and into category
B
k
for the variable Y
f
jk
=
n
jk
n
(j = 1 . . . , J; k = 1, . . . , K).
f
jk
B
1
B
2
B
3
f
j.
A
1
0.15 0.04 0.01 0.20
A
2
0.19 0.35 0.06 0.60
A
3
0.01 0.11 0.08 0.20
f
.k
0.35 0.50 0.15 1
The marginal frequencies are given by:
f
j.
=
n
j.
n
(j = 1 . . . , J)
and
f
.k
=
n
.k
n
(k = 1, . . . , K).
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 135
To formalize the notion of independence between
the two variables X and Y, let us consider that:
f
jk
is the estimation of
jk
= P(X A
j
, Y B
k
)
f
j.
is the estimation of
j.
= P(X A
j
)
f
.k
is the estimation of
.k
= P(Y B
k
)
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 136
Tables of conditional frequencies:
Table of row proles:
Proportion of individuals that belong to cate-
gory B
k
for the variable Y among the individ-
uals that have the modality A
j
for the variable
X:
f
k[j
=
n
jk
n
j.
=
n
jk
/n
n
j.
/n
=
f
jk
f
j.
(j xed; k = 1, . . . , K).
f
k[j
is the estimation of P(Y B
k
[X A
j
)
f
jk
f
j.
B
1
B
2
B
3
A
1
0.75 0.20 0.05 1
A
2
0.32 0.58 0.10 1
A
3
0.05 0.55 0.40 1
f
.k
0.35 0.50 0.15 1
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 137
Table of column proles:
Proportion of individuals that belong to cate-
gory A
j
for the variable X among the individ-
uals that have the modality B
k
for the variable
Y :
f
j[k
=
n
jk
n
.k
=
n
jk
/n
n
.k
/n
=
f
jk
f
.k
(j = 1, . . . , J; kxed).
f
j[k
is the estimation of P(X A
j
[Y B
k
)
f
jk
f
j.
B
1
B
2
B
3
f
j.
A
1
0.43 0.08 0.07 0.20
A
2
0.54 0.70 0.40 0.40
A
3
0.03 0.22 0.53 0.20
1 1 1 1
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 138
Independence between X and Y
Two random variables X and Y are indepen-
dent i j 1, . . . , J and k 1, . . . , K:
a)P(X A
j
, Y B
k
) = P(X A
j
)P(Y B
k
)
b)P(Y B
k
[X A
j
) = P(Y B
k
)
c)P(X A
j
[Y B
k
) = P(X A
j
)
At the sample level, these equalities can be
estimated by:
a )f
jk
f
j.
f
.k
j 1, . . . , J k 1, . . . , K
b )f
k[j
=
f
jk
f
j.
f
.k
j, k
c )f
j[k
=
f
jk
f
.k
f
j.
j, k.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 139
We can therefore dene the theoretical frequen-
cies and relative frequencies under the assump-
tion of independence as follows:
f
jk
= f
j.
f
.k
and n
jk
= nf
jk
=
n
j.
n
.k
n
Observed frequencies
n
jk
B
1
B
2
B
3
n
j.
A
1
150 40 10 200
A
2
190 350 60 600
A
3
10 110 80 200
n
.k
350 500 150 1000
Theoretical frequencies under independence
n
jk
B
1
B
2
B
3
n
j.
A
1
70 100 30 200
A
2
210 300 90 600
A
3
70 100 30 200
n
.k
350 500 150 1000
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 140
Observed relative frequencies
f
jk
B
1
B
2
B
3
f
j.
A
1
0.15 0.04 0.01 0.20
A
2
0.19 0.35 0.06 0.60
A
3
0.01 0.11 0.08 0.20
f
.k
0.35 0.50 0.15 1
Theoretical relative frequencies under indepen-
dence
f
jk
B
1
B
2
B
3
f
j.
A
1
0.07 0.10 0.03 0.20
A
2
0.21 0.30 0.09 0.60
A
3
0.07 0.10 0.03 0.20
f
.k
0.35 0.50 0.15 1
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 141
Attraction/repulsion matrix D
The element jk of the Attraction/repulsion
matrix D (J K) is dened by:
d
jk
=
n
jk
n
jk
=
f
jk
f
jk
=
f
jk
f
j.
f
.k
Interpretations:
d
jk
> 1 f
jk
> f
j.
f
.k
f
jk
> f
j.
f
.k
f
k[j
> f
.k
and f
j[k
> f
j.
The modalities (categories) A
j
and B
k
are
attracted to each other
d
jk
< 1 f
jk
< f
j.
f
.k
f
jk
< f
j.
f
.k
f
k[j
< f
.k
and f
j[k
< f
j.
The modalities (categories) A
j
and B
k
are
repulse to each other
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 142
Example
f
jk
B
1
B
2
B
3
f
jk
B
1
B
2
B
3
A
1
0.15 0.04 0.01 A
1
0.07 0.10 0.03
A
2
0.19 0.35 0.06 A
2
0.21 0.30 0.09
A
3
0.01 0.11 0.08 A
3
0.07 0.10 0.03
d
jk
B
1
B
2
B
3
A
1
2.14 0.40 0.33
A
2
0.90 1.16 0.67
A
3
0.14 1.10 2.67
High salary is more frequent for people with
university diploma
High salary is less frequent for people with at
most a primary diploma
Low salary is less frequent for people with
university diploma
. . .
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 143
Measures of association
The
2
statistic:
Conditions for application:
n 30
n
jk
1 j, k
at least 80% of n
jk
5
If these conditions are not met =group classes
(modalities).
Statistic of test:
2
=
J
j=1
K
k=1
(n
jk
n
jk
)
2
n
jk
Reject the null hypothesis (independence be-
tween X and Y ) at the level % if
2
>
2
(J1)(K1);1
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 144
The statistic
2
=
2
n
:
2
=
J
j=1
K
k=1
(f
jk
f
jk
)
2
f
jk
=
J
j=1
K
k=1
(
n
jk
n
n
jk
n
)
2
n
jk
n
Remark: Using weights for the attraction/repulsion
indices (
J
j=1
K
k=1
f
jk
= 1):
d =
J
j=1
K
k=1
f
jk
d
jk
=
J
j=1
K
k=1
f
jk
f
jk
f
jk
=
J
j=1
K
k=1
f
jk
= 1
s
2
d
=
J
j=1
K
k=1
f
jk
(d
jk
1)
2
=
2
n
=
2
= The dispersion of the attraction/repulsion
indices (around the mean) is given by
2
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 145
4.4 Analysis of row proles
The point cloud
l
of row proles
At each line A
j
of the table of row proles is
associated a point L
j
in IR
K
with coordinates:
l
j
= (f
1[j
, . . . , f
k[j
, . . . , f
K[j
)
/
.
A weight f
j.
(% of individuals that have the
modality A
j
) is associated with the row prole
l
j
(j 1, . . . , J)
= The point cloud
l
of observations in IR
K
contains J weighted row proles:
l
= (L
1
; f
1.
), (L
2
; f
2.
), . . . , L
J
; f
J.
).
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 146
Center of gravity of
l
The coordinates of the center of gravity are given
by a weighted mean of the J row proles:
g
l
=
J
j=1
f
j.
l
j
Consequently, the coordinate k of g
l
is :
J
j=1
f
j.
f
k[j
=
J
j=1
f
j.
f
jk
f
j.
=
J
j=1
f
jk
= f
.k
l
= (f
.1
, . . . , f
.K
)
/
The center of gravity G
l
of the J (weighted) row
proles is equal to the marginal prole ( % of
individuals having the modality B
k
).
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 147
The
2
distance in IR
K
Denition: The
2
distance in IR
K
between
two points X and Y with coordinates (x
1
, . . . , x
K
)
and (y
1
, . . . , y
K
) is given by:
d
2
2
(X, Y ) =
K
k=1
(x
k
y
k
)
2
f
.k
The euclidian distance gives the same weight to
each column. The
2
distance gives the same
relative importance to each column proportion-
ally to the frequency B
k
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 148
Total inertia of
l
Total inertia based on the
2
distance and the
weighted row proles in IR
K
:
I
2
(
l
, G
l
) =
J
j=1
f
j.
d
2
2
(L
j
, G
l
)
=
J
j=1
f
j.
K
k=1
1
f
.k
(f
k[j
f
.k
)
2
=
J
j=1
f
j.
K
k=1
1
f
.k
(
f
jk
f
j.
f
.k
)
2
=
J
j=1
K
k=1
f
j.
f
.k
(
f
jk
f
j.
f
.k
f
j.
)
2
=
J
j=1
K
k=1
(f
jk
f
.k
f
j.
)
2
f
j.
f
.k
=
2
=
2
n
=This explains why this distance is called the
chi square distance!
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 149
Interpretation of the inertia :
It measures the dependence between the two
qualitative variables X and Y
This measure is independent of the sample
size n
I
2
(
l
, G
l
) = 0 means that all row proles
L
1
, . . . , L
J
are equal to the center of gravity
G
l
:
k 1, . . . , K et j 1, . . . , J
f
k[j
= f
.k
f
jk
f
j.
= f
.k
f
jk
= f
j.
f
.k
leading to the independence of X and Y .
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 150
4.5 Step 1: PCA on the row proles
l
Same methodology than PCA applied to quan-
titative variables with two modications:
The weights of individuals (categories) are
not the same: the weight of A
j
is equal to f
j.
The distance used to measure the proximity
between two individuals is the
2
distance.
l
= (L
1
, f
1.
), . . . , (L
J
, f
J.
)
but on a normalized point cloud
l
:
= (L
1
, f
1.
), . . . , (L
J
, f
J.
)
where the coordinates of L
j
are given by:
l
j
= (
f
j1
f
j.
f
.1
_
f
.1
, . . . ,
f
jK
f
j.
_
f
.K
_
f
.K
)
/
The center of gravity of
l
is the origin
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 151
First projecting direction
1
The rst projecting direction
1
is the direction
passing through the origin that ts in an opti-
mal way the point cloud
l
in terms of inertia:
I(
l
,
1
) = min
:direction through the origin
I(
l
, )
where I(
l
, ) =
J
j=1
f
j.
d
2
(L
l
, P
(L
j
)).
Problem: Find the direction given by the vector
u
1
such that I(0, P
1
(L
j
)) is maximized:
max
J
j=1
f
j.
d
2
(0, P
1
(L
j
))
under the constriant
|u
1
| = 1
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 152
It is again a problem of maximization under con-
straint, and as in PCA, the solution is given by
the eigenvalues and eigenvectors of the matrix:
V =
J
j=1
f
j.
l
j
(l
j
)
/
= u
1
is the eigenvector associated with the
largest eigenvalue
1
= I(0, P
1
(L
j
)).
Note that the element (k, k
/
) of the matrix V (K
K) is given by :
v
kk
/
=
J
j=1
_
f
jk
f
j.
f
.k
_
f
j.
f
.k
__
f
jk
/
f
j.
f
.k
/
_
f
j.
f
.k
/
_
which yields V = X
/
X with elements of X(J
K) given as:
x
jk
=
f
jk
f
j.
f
.k
_
f
j.
f
.k
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 153
First principal component
To create the rst principal component
1
, the
point cloud
l
is projected on
1
:
P
1
(
l
) = P
1
(L
1
), . . . , P
1
(L
J
).
The coordinate for each point associated with
modality A
j
(j = 1, . . . , J) is given by:
1,j
= |OP
1
(L
j
)| =< OL
j
, u
1
>=
K
k=1
u
1,k
(l
j
)
k
= u
1,1
(l
j
)
1
+ u
1,2
(l
j
)
2
+ . . . + u
1,K
(l
j
)
K
Then
1,j
is the value of the row prole j (as-
sociated with A
j
) on the rst principal compo-
nent.
It can be proven that
1
is centered:
J
j=1
f
j.
1,j
= 0
the variance of
1
is equal to
1
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 154
Global quality of the rst principal com-
ponent
Using the decomposition of total inertia, it can
be shown that the percentage of inertia that is
kept by projecting on
1
is given by :
2
since I(
l
, 0) = I(
l
,
1
) + I(0, P
1
(L
j
))
Contribution of modality A
j
(j = 1, . . . , J)
Knowing that
1
= s
2
1
=
J
j=1
f
j.
2
1,j
=
J
j=1
f
j.
d
2
(0, P
1
(L
j
))
the contribution of the modality A
j
is given by:
CTR
1
(A
j
) =
f
j.
2
1,j
1
.
=The interpretation of
1
is mainly based on
modalities A
j
that have a high contribution
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 155
Quality of representation on the rst
axis
The quality of representation of the row pro-
le L
j
on the rst axis
1
is measured by the
squared cosine of the angle formed by the vector
OL
j
and the axis
1
:
cos
2
(OL
j
,
1
) =
_
< OL
j
, u
1
>
|OL
j
||u
1
|
_
2
=
2
1,j
|OL
j
|
2
.
This formula does not contain the weight f
j.
= one modality can be:
close to the axis
1
and and therefore be well
represented (well explained)
because of a low weight f
j.
, it can have a low
contribution to the axis
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 156
Extended dimensions
The second projecting axis
2
is dened by the
vector u
2
:
through the origin (the center of gravity)
orthogonal to u
1
(u
2
u
1
)
minimizing the residual inertia
=u
2
is the eigenvector of V associated to the
second largest eigenvalues
2
.
In the same way, we can nd the other project-
ing axis
3
,
4
, . . .
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 157
How many principal components ?
l
is contained in a space of dimension
H min(J 1, K 1)
where H is equal to the rank of the matrix V
(K K)
c
= (C
1
; f
.1
), (L
2
; f
.2
), . . . , (C
K
; f
.K
)
where the point C
k
in IR
J
has coordinates:
c
k
= (f
1[k
, . . . , f
j[k
, . . . , f
J[k
)
/
.
Instead of working directly with this point cloud,
we prefer to transform it such that the center of
gravity is the origin:
c
= (C
1
, f
.1
), . . . , (C
K
, f
.K
)
where C
j
has the coordinates:
c
j
= (
f
1[k
f
1.
_
f
1.
, . . . ,
f
J[k
_
f
J.
_
f
J.
)
/
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 159
Projecting directions
The projecting directions
1
, . . . ,
H
of
c
are
dened by the orthogonal eigenvectors v
1
, . . . , v
H
of the matrix
W = XX
/
associated with H(= min(J 1, K 1)) non
zero eigenvalues
1
, . . . ,
H
. v
1
is associated
with the largest eigenvalue, . . .
The elements of the matrix X(J K) are de-
ned as:
x
jk
=
f
jk
f
j.
f
.k
_
f
j.
f
.k
The eigenvalues of W are the same as the eigen-
values of V
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 160
Principal components
The principal components
1
, . . . ,
H
are de-
ned by k = 1, . . . , K::
h,k
= |OP
h
(C
k
)| =< OC
k
, v
h
>=
J
j=1
v
h,j
(c
k
)
j
= v
h,1
(c
k
)
1
+ v
h,2
(c
k
)
2
+ . . . + v
h,J
(c
k
)
J
Properties of principal components
1
,
2
, . . . ,
H
h 1, . . . , H:
Principal components are centered:
J
j=1
f
j.
h,j
= 0
The variance of
h
is given by
h
Principal components are uncorrelated.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 161
Global quality of
h
The percentage of inertia that is kept when pro-
jecting on
h
is given by
2
Contribution of modality B
k
, j = 1, . . . , J
Knowing that
h
= s
2
h
=
K
k=1
f
.k
2
h,k
the contribution of the modality B
k
is given by:
CTR
h
(B
k
) =
f
.k
2
h,k
h
.
Quality of the representation of C
k
on
h
cos
2
(OC
k
,
h
) =
_
< OC
k
, v
h
>
|OC
k
||v
h
|
_
2
=
2
h,k
|OC
k
|
2
.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 162
4.7 Step 3: Links between both PCAs
The analysis of point cloud
c
could be deduced
from the analysis of point cloud
l
and vice
versa.
= The possibility to study the associations
between the two variables is due to the links
between the two analysis.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 163
Row proles
l
: IR
K
Column proles
c
: IR
J
(
h
, u
h
) where h = 1, . . . , H (
h
, v
h
) where h = 1, . . . , H
are the eigenvalues and the eigenvectors of
V = X
/
X W = XX
/
leading to the relations
V u
h
=
h
u
h
Wv
h
=
h
v
h
Hence we have
X
/
Xu
h
=
h
u
h
XX
/
v
h
=
h
v
h
XX
/
Xu
h
=
h
Xu
h
X
/
XX
/
v
h
=
h
X
/
v
h
WXu
h
=
h
Xu
h
V X
/
v
h
=
h
X
/
v
h
=
Xu
h
eigenvector of W X
/
v
h
eigenvector of V
The norm of these vectors is given by
|Xu
h
| =
h
|X
/
v
h
| =
h
the normed eigenvectors associated to
h
are:
1
h
Xu
h
1
h
X
/
v
h
To conclude, we have the following relations:
v
h
=
1
h
Xu
h
u
h
=
1
h
X
/
v
h
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 164
These relations between both PCA leads (after
some developments) to a relation between the
attraction/repulsion index and the coordinates
of modalities in the two new system.
The distance for the couple (A
j
, B
k
) to the in-
dependence situation is measured by:
f
jk
f
j.
f
.k
= 1 +
H
h=1
1
_
h,j
h,k
d
jk
= 1 +
H
h=1
1
_
h,j
h,k
l
is projected
on the rst factorial plan (
1
,
2
)
- the point cloud of column proles
c
is pro-
jected on the rst factorial plan (
1
,
2
)
= Simultaneous representation of the modal-
ities A
1
, . . . , A
J
and B
1
, . . . , B
K
The modality A
j
is associated to A
j
which has
coordinates (
1,j
,
2,j
)
/
and the modality B
k
is
associated to B
k
which has coordinates (
1,k
,
2,k
)
/
.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 166
Interpretation of projections on
1
,
1
If cos
2
(OL
j
,
1
) is close to one = the prol
L
j
is close to its projection P
1
(L
j
) on
1
= l
j
=
H
h=1
h,j
u
h
= l
j
1,j
u
1
This implies that k 1, . . . , K:
d
jk
=
f
jk
f
j.
f
.k
1 +
1
1,j
1,k
.
We can therefore say that:
- The modalities A
j
and B
k
are attracted to
each other (d
jk
> 1)
if
1,j
> 0 and
1,k
> 0
if
1,j
< 0 and
1,k
< 0
- The modalities A
j
and B
k
are repulse each
other (d
jk
< 1)
if
1,j
> 0 and
1,k
< 0
if
1,j
< 0 and
1,k
> 0
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 167
Interpretation of the rst principal map
If cos
2
(OL
j
, (
1
,
2
)) is close to one = the
prol L
j
is close to its projection P
(
1
,
2
)
(L
j
)
= l
j
=
H
h=1
h,j
u
h
= l
j
1,j
u
1
+
2,j
u
2
This implies that k 1, . . . , K:
d
jk
=
f
jk
f
j.
f
.k
1+
1
1,j
1,k
+
1
2,j
2,k
.
Therefore:
- The modalities A
j
and B
k
are attracted to
each other (d
jk
> 1) if A
j
and B
k
are belong
to the same quadrant
- The modalities A
j
and B
k
are repulse each
other (d
jk
< 1) if A
j
and B
k
are in opposite
quadrants
- We cannot conclude if A
j
and B
k
belong to
adjacent quadrants.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 168
Gamma1, Delta1
G
a
m
m
a
2
,
D
e
l
t
a
2
-2 -1 0 1 2
-
2
-
1
0
1
2
Attraction (d_{jk} > 1)
Aj*
Bj*
Gamma1, Delta1
G
a
m
m
a
2
,
D
e
l
t
a
2
-2 -1 0 1 2
-
2
-
1
0
1
2
Rpulsion (d_{jk} < 1)
Aj*
Bj*
Gamma1, Delta1
G
a
m
m
a
2
,
D
e
l
t
a
2
-2 -1 0 1 2
-
2
-
1
0
1
2
Pas de conclusion
Aj*
Bj*
If a modality A
j
is well represented on the
rst factorial plan, it is possible to determine
graphically whether this modality is attracted
or repulsed by some modalities B
k
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 169
4.8.2 Barycentric representation
In case of uncertainty about the attraction/repulsion
between modalities, this representation can give
an answer:
The attraction/repulsion indices are given by:
d
jk
= 1 +
H
h=1
1
_
h,j
h,k
= we are going to use the standardized prin-
cipal components
h
instead of
h
:
h
=
h
_
h
.
= Superposition of both PCAs:
- the row prole A
j
is associated to A
j
which
has coordinates (
1,j
,
2,j
)
/
- the column prole B
k
is associated to
B
k
which has coordinates (
1,k
,
2,k
)
/
= (
1,k
1
,
2,k
2
)
/
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 170
Interpretation for the rst factorial plan
If a modality A
j
is well represented on the
rst principal plan
1
,
2
:
d
jk
1 +
1,j
1,k
+
2,j
2,k
1+ < OA
j
, O
B
k
>
where < ., . >is the usual scalar product in IR
2
We can therefore say that:
The modalities A
j
and B
k
are attracted to each
other (d
jk
> 1) if the angle between OA
j
and
O
B
k
is acute (< OA
j
, O
B
k
> is therefore pos-
itive)
The modalities A
j
and B
k
are repulse each other
(d
jk
< 1) if the angle between OA
j
and O
B
k
is obtuse (< OA
j
, O
B
k
>is therefore negative)
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 171
Gamma1, Delta1
G
a
m
m
a
2
,
D
e
l
t
a
2
-2 -1 0 1 2
-
2
-
1
0
1
2
Attraction (angle aigu)
Aj*
Bj*
Gamma1, Delta1
G
a
m
m
a
2
,
D
e
l
t
a
2
-2 -1 0 1 2
-
2
-
1
0
1
2
Rpulsion (angle obtus)
Aj*
Bj*
Examples where no conclusion can be drawn
with the pseudo-barycentric representation. But
with the barycentric representation, the rule is:
Draw A
j
which passes through the origin and
which is orthogonal to OA
j
. This line separates
the space into two parts: the modalities B
k
that
are on the same side than A
j
are attracted by
it and the modalities on the other side are re-
pulsed by A
j
.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 172
4.8.3 Biplot
The angles between the modalities and the fac-
tors yield most of the information. We therefore
introduce a new variable where the coordinates
of row proles are divided by
1
. This leads
to a better visibility of the rst principal plan.
= Simultaneous representation of the modal-
ities A
1
, . . . , A
J
and B
1
, . . . , B
K
in the
rst principal map:
- The modality A
j
is associated to
A
j
which
has coordinates (
1,j
,
2,j
)
/
= (
1,j
1
,
2,j
1
)
/
.
- The modality B
k
is associated to
B
k
which
has coordinates (
1,k
,
2,k
)
/
= (
1,k
1
,
2,k
2
)
/
.
This type of standardization is called BIPLOT.
CHAPTER 4. CORRESPONDENCE ANALYSIS (CA) 173
4.9 References
Benzecri, (1973), Lanalyse des donnees. Tome
1: La taxinomie. Tome 2: Lanalyse des cor-
respondances (2
de
. ed. 1976). Dunod, Paris.
Escoer and Pages (2008), Analyses facto-
rielles simples et multiples: Objectifs, methodes
et interpretation. Dunod, Paris.
Hirschfeld, (1935), A connection between cor-
relation and contingency., Proc. Camb. Phil.
Soc., 31, 520-524.
Guttman, (1941), The quantication of a
class of attributes: a theory and method of
a scale construction. In: The prediction of
personal adjustement (Horst P., Ed.), 251-264,
SSCR New York.
Chapter 5
Multiple correspondence analysis
(MCA)
Extension of BCA to more than 2 variables.
Goal: Analysis of a table n P of individu-
als qualitative variables.
Method: apply BCA to a table called com-
plete disjunctive table.
174
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 175
5.1 Data, tables and distances
5.1.1 The complete disjunctive table
Example
4 individuals: n = 4
3 variables: P = 3
Y
1
: gender 2 modalities: K
1
= 2 (male=1,
female=2)
Y
2
: civil status 3 modalities: K
2
= 3
(single=1, married=2, divorced or widower=3)
Y
3
: level of education 2 modalities: K
3
=
2 (primary or secondary school=1, higher or
university diploma=2)
K = K
1
+ K
2
+ K
3
= 2 + 3 + 2 = 7.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 176
Logic table (the modalities are coded)
n[P Y
1
Y
2
Y
3
1 2 1 1
2 2 1 2
3 1 3 2
4 2 2 1
Complete disjunctive table (CDT)
X
1
X
2
X
3
X
11
X
12
X
21
X
22
X
23
X
31
X
32
P
1 0 1 1 0 0 1 0 3
2 0 1 1 0 0 0 1 3
3 1 0 0 0 1 0 1 3
4 0 1 0 1 0 1 0 3
n
pl
1 3 2 1 1 2 2 12
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 177
Notations:
n individuals, P variables: Y
1
, . . . , Y
P
The variable Y
p
has K
p
modalities = K =
P
p=1
K
p
total number of modalities in the
dataset
n
pl
number of individuals having the modal-
ity l for the variable Y
p
x
ipl
= 1 if individual i has modality l of Y
p
,
0 otherwise
X
pl
is a dummy (binary) variable which is
associated with modality l of Y
p
X
p
= (X
p1
, . . . , X
pK
p
) vectors of dummy
variables of Y
p
The following relations hold:
K
p
l=1
n
pl
= n and
P
p=1
K
p
l=1
n
pl
= nP
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 178
Table of dummy variables X
p
associated to Y
p
:
1 . . . l . . . K
p
K
p
l=1
1 x
1p1
. . . x
1pl
. . . x
1pK
p
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i x
ip1
. . . x
ipl
. . . x
ipK
p
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n x
np1
. . . x
npl
. . . x
npK
p
1
n
i=1
n
p1
. . . n
pl
. . . n
pK
p
n
Complete disjunctive table X = (X
1
, . . . , X
P
):
x 1 . . . p . . . P
P
p=1
K
p
l=1
1 . . . . . . P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i x
1
(n K
1
) . . . x
p
(n K
p
) . . . x
P
(n K
P
) P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n . . . . . . P
n
i=1
nP
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 179
5.1.2 Row and column proles, attraction/repulsion indices
MCA on Y
1
, . . . , Y
P
= BCA on the complete
disjunctive table.
Relative frequencies of the complete disjunctive
table:
Y
1
. . . Y
p
. . . Y
P
1 . . . l . . . K
1
. . . 1 . . . l . . . K
p
. . . 1 . . . l . . . K
P
1 . . . . . .
1
n
.
.
. . . . . . .
1
n
i . . . f
ipl
=
x
ipl
nP
. . .
1
n
.
.
. . . . . . .
1
n
n . . . . . .
1
n
. . . f
.pl
=
n
pl
nP
. . . 1
where the marginal relative frequencies are given
by:
f
i..
=
1
n
and f
.pl
=
n
pl
nP
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 180
Row proles L
i
of individual i: l
i
(1 K)
the coordinate pl of the row prole i:
(l
i
)
pl
=
f
ipl
f
i..
=
x
ipl
/nP
1/n
=
x
ipl
P
p = 1, . . . , P; l = 1, . . . , K
p
Column prole C
pl
associated to the
modality l of Y
p
:
c
pl
(n 1)
the coordinate i of the column prole pl:
(c
pl
)
i
=
f
ipl
f
.pl
=
x
ipl
/nP
n
pl
/nP
=
x
ipl
n
pl
i = 1, . . . , n.
Notations
(l
i
)
pl
: coordinate pl of the row prole i
(c
pl
)
i
: coordinate i of the column prole pl
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 181
Example
Row proles table:
X
1
X
2
X
3
X
11
X
12
X
21
X
22
X
23
X
31
X
32
1 0
1
3
1
3
0 0
1
3
0 1
2 0
1
3
1
3
0 0 0
1
3
1
3
1
3
0 0 0
1
3
0
1
3
1
4 0
1
3
0
1
3
0
1
3
0 1
1
12
3
12
2
12
1
12
1
12
2
12
2
12
1
Column proles table:
X
1
X
2
X
3
X
11
X
12
X
21
X
22
X
23
X
31
X
32
1 0
1
3
1
2
0 0
1
2
0
1
4
2 0
1
3
1
2
0 0 0
1
2
1
4
3 1 0 0 0 1 0
1
2
1
4
4 0
1
3
0 1 0
1
2
0
1
4
1 1 1 1 1 1 1 1
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 182
Attraction/repulsion indices between in-
dividual i and modality l of Y
p
:
d
i,pl
=
f
ipl
f
i..
f
.pl
=
x
ipl
nP
1
n
n
pl
nP
=
x
ipl
n
pl
/n
As x
ipl
= 0, 1 and n
pl
/n 1, we have that
d
i,pl
= 0 if x
ipl
= 0
d
i,pl
=
n
n
pl
1 if x
ipl
= 1
Interpretation: If one individual i has the
modality l of the variable Y
p
, then the at-
traction/repulsion index d
i,pl
increases as the
modality l of the variable Y
p
becomes rare
(n
pl
small).
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 183
5.1.3 Point cloud and distances between row proles
Point cloud
- n row proles L
1
, . . . , L
n
- in IR
K
where K =
P
p=1
K
p
- with weight 1/n
- and the
2
distance.
The center of gravity G
l
has coordinate pl (p =
1, . . . , P; l = 1, . . . , K
p
) given by:
n
i=1
1
n
(l
i
)
pl
=
1
nP
n
i=1
x
ipl
=
n
pl
nP
=G
l
is the marginal prole (marginal relative
prole)
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 184
Properties
Distance between individuals (row proles)
d
2
2
(L
i
1
, L
i
2
) =
P
p=1
K
p
l=1
1
f
.pl
((l
i
1
)
pl
(l
i
2
)
pl
)
2
=
P
p=1
K
p
l=1
1
n
pl
nP
(
x
i
1
pl
P
x
i
2
pl
P
)
2
=
n
P
P
p=1
K
p
l=1
1
n
pl
(x
i
1
pl
x
i
2
pl
)
2
Interpretation:
The distance between 2 individuals is small
if they have many modalities that are the
same.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 185
Example
Distance between individual 1 (female, sin-
gle with primary or secondary diploma) and
2 (female, single with a higher or university
formation):
d
2
2
(L
1
, L
2
) =
3
p=1
K
p
l=1
1
f
.pl
((l
1
)
pl
(l
2
)
pl
)
2
= 12(0 0)
2
+
12
3
(
1
3
1
3
)
2
+
12
2
(
1
3
1
3
)
2
+
12
2
(0 0)
2
+ 12(0 0)
2
+ 6(
1
3
0)
2
+ 6(0
1
3
)
2
=
4
3
= 1.33
Another way to compute it:
d
2
2
(L
1
, L
2
) =
n
P
3
p=1
K
p
l=1
1
n
pl
(x
i
1
pl
x
i
2
pl
)
2
=
4
3
(1(0 0)
2
+
1
3
(1 1)
2
+
1
2
(1 1)
2
+ 1(0 0)
2
+ 1(0 0)
2
+
1
2
(1 0)
2
+
1
2
(0 1)
2
) =
4
3
= 1.33
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 186
Matrix of distances and matrix of squared
distances between individuals (row proles)
d
2
2
(L
i
, L
j
) L
1
L
2
L
3
L
4
L
1
- 1.33 5.11 2.00
L
2
1.33 - 3.78 3.33
L
3
5.11 3.78 - 5.78
L
4
2.00 3.33 5.78 -
d
2
(L
i
, L
j
) L
1
L
2
L
3
L
4
L
1
- 1.15 2.26 1.41
L
2
1.15 - 1.94 1.83
L
3
2.26 1.94 - 2.40
L
4
1.41 1.83 2.40 -
Conclusions
individuals 1 and 2 are close to each
other (both are female and single)
individuals 1 and 3 are very dierent
(all the modalities between those individ-
uals are dierent).
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 187
Distance between the row prole L
i
and the
center of gravity:
d
2
2
(L
i
, G
l
) =
P
p=1
K
p
l=1
1
f
.pl
((l
i
)
pl
n
pl
nP
)
2
=
P
p=1
K
p
l=1
nP
n
pl
(
x
ipl
P
n
pl
nP
)
2
=
P
p=1
K
p
l=1
n
Pn
pl
_
x
2
ipl
+
n
2
pl
n
2
2x
ipl
n
pl
n
_
=
n
P
P
p=1
K
p
l=1
x
ipl
n
pl
+
1
nP
P
p=1
K
p
l=1
n
pl
2
P
P
p=1
K
p
l=1
x
ipl
=
n
P
P
p=1
K
p
l=1
x
ipl
n
pl
+
1
nP
nP
2
P
P
=
n
P
P
p=1
K
p
l=1
x
ipl
n
pl
1
= The distance between the individual i
and the center of gravity G
l
increases as
the modalities taking by the individual i
becomes rare (x
ipl
= 1 and n
pl
small).
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 188
Total inertia of point cloud
l
around G
l
:
I
2
(
l
, G
l
) =
n
i=1
f
i..
d
2
2
(L
i
, G
l
)
=
n
i=1
1
n
_
_
n
P
P
p=1
K
p
l=1
x
ipl
n
pl
1
_
_
=
1
P
P
p=1
K
p
l=1
n
i=1
x
ipl
n
pl
1
n
n
i=1
1
=
1
P
P
p=1
K
p
l=1
n
pl
n
pl
1
n
n
i=1
1
=
K
P
1
where
K
P
is the average number of modalities
by variables
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 189
The total inertia depends only on the num-
ber of variables and on the number of modal-
ities. It does not depend at all on the re-
lations between the variables. From a sta-
tistical point of view, this quantity cannot
be interpreted (as in PCA).
i 1, . . . , n the row prole l
i
satises the
P linear constraints:
K
p
l=1
(l
i
)
pl
=
K
p
l=1
x
ipl
P
=
1
P
p = 1, . . . , P
=the point cloud
l
is inside a sub-space
of at most K P dimensions.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 190
5.1.4 Point cloud and distances between column proles
Point cloud
- K =
P
p=1
K
p
column proles C
pl
- in IR
n
- with weight f
.pl
=
n
pl
nP
- and the
2
distance.
The i
th
coordinate of the center of gravity G
c
is given by:
P
p=1
K
p
l=1
f
.pl
(c
pl
)
i
=
P
p=1
K
p
l=1
n
pl
nP
x
ipl
n
pl
=
1
n
=G
c
is the marginal prole (marginal relative
prole)
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 191
Properties
Distance between modalities (column proles)
The
2
distance between modality l
1
of vari-
able Y
p1
and modality l
2
of variable Y
p2
is:
d
2
2
(c
p1l1
, c
p2l2
) =
n
i=1
1
f
i..
((c
p1l1
)
i
(c
p2l2
)
i
)
2
=
n
i=1
1
1
n
(
x
ip1l1
n
p1l1
x
ip2l2
n
p2l2
)
2
= n
n
p=i
(
x
ip1l1
n
p1l1
x
ip2l2
n
p2l2
)
2
Interpretation:
- if the same individuals take these 2 modal-
ities, the distance between the 2 modalities
is small
- if a modality is rare, it is far away from
the other modalities.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 192
Example
Distance between modality 1 of Y
1
(male)
and 2 of Y
2
(married):
d
2
2
(c
11
, c
22
) =
n
i=1
1
f
i..
((c
11
)
i
(c
22
)
i
)
2
= 4
_
(0 0)
2
+ (0 0)
2
+ (1 0)
2
+ (0 1)
2
_
= 8
d
2
(, ) 11 12 21 22 23 31 32
11 - 2.31 2.45 2.83 0 2.45 1
12 - 0.67 0.94 2.31 0.67 1.37
21 - 2.45 2.45 1.41 1.41
22 - 2.83 1 2.45
23 - 2.45 1
31 - 2
32 -
- 12 and 21 are close to each other (50%
of individuals have chosen these two modali-
ties)
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 193
Distance between the column prole C
pl
and
the center of gravity:
d
2
2
(C
pl
, G
c
) =
n
i=1
n((c
pl
)
i
1
n
)
2
=
n
i=1
n(
x
ipl
n
pl
1
n
)
2
=
n
i=1
n
x
2
ipl
n
2
pl
+
n
i=1
n
1
n
2
2
n
i=1
x
ipl
n
pl
=
n
n
2
pl
n
i=1
x
ipl
+ 1
2
n
pl
n
i=1
x
ipl
=
n
n
pl
1
=The distance between the modality l of
Y
p
and the center of gravity G
c
increases
as the modality becomes more rare (n
pl
small).
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 194
Total inertia of point cloud
c
around G
c
:
I
2
(
c
, G
c
) =
P
p=1
K
p
l=1
f
.pl
d
2
2
(C
pl
, G
c
)
=
P
p=1
K
p
l=1
n
pl
nP
(
n
n
pl
1)
=
P
p=1
K
p
l=1
1
P
(1
n
pl
n
)
=
P
p=1
1
P
(K
p
1) =
1
P
(K P)
=
K
P
1
Notice that I
2
(
c
, G
c
) = 1 if all the vari-
ables have exactly two modalities.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 195
Contribution of the modality l of the variable
Y
p
to the total inertia of the point cloud
c
:
f
.pl
d
2
2
(C
pl
, G
c
) =
n
pl
nP
(
n
n
pl
1)
=
1
P
n
pl
nP
=
1
P
(1
n
pl
n
)
= The contribution of the modality l of
the variable Y
p
increases when n
pl
decreases.
A rare modality has therefore a larger im-
pact than a common modality.
The contribution of the variable Y
p
(sum of
the contributions of the modalities) is given
by:
K
p
l=1
1
P
(1
n
pl
n
) =
1
P
(K
p
1)
=The contribution of a variable increases
with the number of modalities.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 196
l
= (L
1
;
1
n
), . . . , (L
n
;
1
n
) with
2
distances
in IR
K
where L
i
has coordinates:
l
i
=
x
ipl
P
p = 1, . . . , P; l = 1, . . . , K
p
Column proles
c
= (C
pl
; f
.pl
=
n
pl
n
) where p = 1, . . . , P and l =
1, . . . , K
p
with
2
distances in IR
n
where C
pl
has coordinates:
c
pl
=
x
ipl
n
pl
i = 1, . . . , n
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 198
Row proles
l
: IR
K
Columb proles
c
: IR
n
(
h
, u
h
) where h = 1, . . . , H (
h
, v
h
) where h = 1, . . . , H
are the eigenvalues and the eigenvectors of
V = T
/
T W = TT
/
Hence we have
V u
h
=
h
u
h
Wv
h
=
h
v
h
where T is a matrix n K with coordinates:
t
i,pl
=
f
ipl
f
i..
f
.pl
_
f
i..
f
.pl
=
x
ipl
n
pl
n
_
Pn
pl
Construction of the principal components (pro-
jection of the row and column proles):
h,j
= |OP
h
(L
j
)| =< OL
j
, u
h
>=
K
k=1
u
h,k
(l
j
)
k
h,pl
= |OP
h
(C
pl
)| =< OC
pl
, v
h
>=
n
i=1
v
h,j
(c
pl
)
i
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 199
How many principal components ?
Stopping rule in PCA:
Keep principal component i the associated eigen-
value is larger than 1 (mean of eigenvalues).
This rule is adapted to MCA as follows:
Keep principal component i the associated
eigenvalue is larger than
1
P
.
Indeed, suppose that H = K P (usual situa-
tion), then the mean of all non-zero eigenvalues
is given by:
1
K P
pl
and the axis
h
)
cos
2
(
h,pl
) =
2
h,pl
|OC
pl
|
2
It can be proven that:
cos(
h,pl
) = r
X
pl
,
h
As for PCA, it is possible to construct a cor-
relation circle with the modalities.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 201
5.2.3 Contribution of each modality
Contribution of the modality l of Y
p
on the
variance of the new variable
h
:
CTR
h
(X
pl
) =
f
.pl
2
h,pl
h
=
n
pl
nP
h
2
h,pl
The contribution of the modality X
pl
increases
with the correlation between
h
and the modal-
ity. It also increases as the modality becomes
more rare (n
pl
small)
Global contribution of the variable Y
p
(sum
on all modalities) on the variance of
h
:
CTR
h
(Y
p
) =
K
p
l=1
CTR
h
(X
pl
)
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 202
5.2.4 Reconstitution formula
The formula introduced for BCA becomes:
f
ipl
= f
i..
f
.pl
(1 +
H
h=1
1
_
h,i
h,pl
)
=
x
ipl
nP
=
1
n
n
pl
nP
(1 +
H
h=1
1
_
h,i
h,pl
)
= x
ipl
=
n
pl
n
(1 +
H
h=1
1
_
h,i
h,pl
)
The distance between the observed probabil-
ity that individual i has modality l on vari-
able Y
p
(x
ipl
) and the mean probability to
have this modality (
n
pl
n
) is given as a function
of principal components
i=1
x
ipl
x
ip
/
l
/
=
n
i=1
n
pl
n
(1 +
H
h=1
1
_
h,i
h,pl
)
n
p
/
l
/
n
(1 +
H
h=1
1
_
h,i
h,p
/
l
/
)
= . . .
=
n
pl
n
p
/
l
/
n
(1 +
H
h=1
h,pl
h,p
/
l
/
)
= Comparison between modalities
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 204
But the attraction/repulsion index d
pl,p
/
l
/
be-
tween the modality l of Y
p
and the modality l
/
de Y
/
p
is given by:
d
pl,p
/
l
/
=
n
pl,p
/
l
/
/n
n
pl
n
n
p
/
l
/
n
=
n
pl,p
/
l
/
n
pl
n
p
/
l
/
n
= d
pl,p
/
l
/
= 1 +
H
h=1
h,pl
h,p
/
l
/
The proximity between two individuals i and
i
/
is dened by :
p
i,i
/
= 1 +
H
h=1
h,i
h,i
/
Two individuals are close (same behaviour)
if they have in general the same modalities.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 205
5.3 Graphical representations
Two types of graphical representations:
Pseudo-barycentric representation (standard)
Biplot representation (barycentric)
5.3.1 Standard representation (Pseudo-barycentric)
We focus on the rst principal plan but more di-
mensions can be analyzed with the same method-
ology
The rst principal plan is constructed using both
PCAs:
- individual A
i
(i = 1, . . . , n) is projected
on the rst factorial plan leading to coordinate
(
1,i
,
2,i
)
- modality B
pl
(p = 1, . . . , P; l = 1, . . . , K
p
)
is projected on the rst factorial plan leading to
coordinate (
1,pl
,
2,pl
)
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 206
Delta 1, Gamma 1
D
e
l
t
a
2
,
G
a
m
m
a
2
-1.0 -0.5 0.0 0.5 1.0
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
A_i*
A_i*
B_pl*
B_pl*
This representation is the closest representation
of the simultaneous information inside point clouds
l
and
c
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 207
Interpretation:
The well represented modalities on the rst
principal plan are compared using the following
approximated formula:
d
pl,p
/
l
/
1 +
2
h=1
h,pl
h,pl
= 1+ < 0B
pl
, 0B
p
/
l
/
>
= 1 + |0B
pl
||0B
p
/
l
/
| cos(0B
pl
, 0B
p
/
l
/
)
Draw B
pl
which passes through the origin and
which is orthogonal to 0B
pl
. This line separates
the space into two parts:
- the modalities that are on the same side
than B
pl
are attracted by it
- the modalities on the other side are re-
pulsed by B
pl
The attraction/repulsion index increases with
[ < 0B
pl
, 0B
p
/
l
/
> [.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 208
Gamma 1
G
a
m
m
a
2
-1.0 -0.5 0.0 0.5 1.0
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
B_pl*
B_pl,perp*
B_pl*
B_pl*
If the modalities pl, p
/
l
/
and p
//
l
//
are well rep-
resented on the rst principal plan, therefore
we can conclude that pl and p
/
l
/
are attracted
by each other, and modalities pl and p
//
l
//
are
repulse by each other.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 209
The well represented individuals on the rst
principal plan are compared using the following
approximated formula:
p
i,i
/
1 +
2
h=1
h,i
h,i
/
= 1+ < 0A
i
, 0A
i
/
>
= 1 + |0A
i
||0A
i
/
| cos(0A
i
, 0A
i
/
)
Draw A
i
which passes through the origin and
which is orthogonal to 0A
i
. This line separates
the space into two parts:
- the modalities that are on the same side
than A
i
are individuals who share a set of modal-
ities with individual i. And the common set in-
creases with < 0A
i
, 0A
i
/
>.
- the modalities on the other side than A
i
are individuals who have few characteristic in
common with individual i.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 210
Delta 1
D
e
l
t
a
2
-1.0 -0.5 0.0 0.5 1.0
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
A_i*
A_iperp*
A_i*
A_i*
If the individuals i, i
/
and i
//
are well rep-
resented on the rst principal plan, there-
fore we can conclude that individual i is close
to individual i
/
and has few characteristic in
common with individual i
//
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 211
The well represented modalities and individ-
uals on the rst principal plan are compared
using the following approximated formula:
x
ipl
n
pl
n
(1 +
2
h=1
1
_
h,i
h,pl
)
The coecient
1
h
implies some diculties in
the interpretation.
If A
i
and B
pl
are well represented on the rst
principal plan:
- The probability that the individual A
i
has
modality l on variable Y
p
is high if they are
belong to the same quadrant
- The probability that the individual A
i
has
modality l on variable Y
p
is low if they are
in opposite quadrants
- We cannot conclude if they belong to ad-
jacent quadrants.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 212
5.3.2 Biplot
The Biplot representation leads to a better vis-
ibility of the rst principal plan to compare the
individuals with the modalities.
The individual i is associated to
A
i
which has
coordinates:
(
1,i
,
2,i
)
/
= (
1,i
1
,
2,i
2
)
/
The modality l on variable Y
p
(p = 1, . . . , P; l =
1, . . . , K
p
) is associated with B
pl
which has co-
ordinates:
1,pl
,
2,pl
.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 213
Reconstitution formula to compare the individ-
uals with the modalities:
x
ipl
n
pl
n
(1 +
2
h=1
h,i
h,pl
)
=
n
pl
n
(1+ < 0
A
i
, 0B
pl
>)
=
n
pl
n
(1 + |0
A
i
||0B
pl
| cos(0
A
i
, 0B
pl
))
Draw B
pl
which passes through the origin
and which is orthogonal to 0B
pl
. This line
separates the space into two parts:
- the individuals that are on the same side
than B
pl
have, with high probability, the modal-
ity l on variable Y
p
- the individuals on the other side have, with
low probability, the modality l on variable Y
p
.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 214
Axe 1
A
x
e
2
-1.0 -0.5 0.0 0.5 1.0
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
B_pl*
B_pl,perp*
~A_i*
~A_ip*
If the modality l on variable Y
p
is well rep-
resented on the rst principal plan, therefore
the probability that individual i has modality
l on variable Y
p
is high and the probability
that individual i
/
has modality l on variable
Y
p
is low.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 215
5.4 The Burt table (BT)
When the use of BT is more appropri-
ate than the use of CDT?
If n is large, the simultaneous representation
of individuals and modalities is unreadable.
If the individuals are anonymous, the interest
is only based on the modalities.
.
.
. n
1l,Pl
.
.
.
K
1
0 n
1K
1
. . . . . . Pn
1K
1
.
.
.
.
.
.
.
.
.
.
.
.
1 . . . n
p1
0 . . . Pn
p1
Y
p
.
.
. n
pl,1l
.
.
.
.
.
.
.
.
. n
pl,Pl
.
.
.
K
p
. . . 0 n
pK
p
. . . Pn
pK
p
.
.
.
.
.
.
.
.
.
.
.
.
1 . . . n
Pl,pl
. . . n
P1
0 Pn
P1
Y
p
.
.
. n
Pl,1l
.
.
.
.
.
.
.
.
.
.
.
.
K
p
. . . . . . 0 n
PK
P
Pn
PK
P
Pn
11
. . . Pn
1K
1
. . . Pn
p1
. . . Pn
pK
p
. . . Pn
P1
. . . Pn
PK
P
nP
2
We use the BCA on the Burt table, instead of
the application of the BCA on the complete dis-
junctive table (CDT).
Remark: The row proles and the column pro-
les are identical since the Burt table is sym-
metric.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 217
5.4.1 Links between MCA on CDT and MCA on BT
The inertia obtained by MCA on BT are given
by the squared inertia obtained by MCA on
CDT:
BT,h
=
2
h
h = 1, . . . , H
The variances of the principal component
BT,h
obtained by MCA on BT are given by the squared
variances of the principal component obtained
by MCA on CDT:
s
2
h
=
h
and s
2
BT,h
=
BT,h
=
2
h
It holds also that h = 1, . . . , H:
BT,h
=
_
h
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 218
5.5 Practical example
Research question:
Determining if, inside the PS electorate, Mus-
lims behave dierently from non-believers and
Catholics.
Database:
Votes for the PS in the regional elections of June
2004 in the Brussels Region
Method:
To this end, we will look into the answers given
to society-oriented questions using multiple cor-
respondence analysis.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 219
5.5.1 Society-oriented questions:
Mail services should be privatized;
Trade Unions should weigh heavily in major
economic decisions;
Homosexual couples should be allowed to adopt
children;
Consumption of cannabis should be forbidden;
People dont feel at home in Belgium anymore;
Abolishing the death penalty was the right
decision.
The answers proposed to these questions are:
Total agreement (1),
Rather in agreement (2),
Rather opposed (3),
Totally opposed (4),
No opinion (5).
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 220
The questionnaire also includes a question con-
cerning a subjective judgment of the individual
about his general behavior on a left-right scale:
Here is a political left-right scale. 0 is the most
left-wing position 9 the most right-wing. Where
would you locate yourself?
The variable Belief with three categories (Mus-
lims, non-believers and Catholics) is also avail-
able
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 221
5.5.2
2
independence test
First, we analyze each society-oriented question
separately by testing its dependency with re-
spect to the belief variable using a
2
indepen-
dence test.
2
Mail Trade Union Homosexual
Test 26.78 27.13 144.82
p-value (0.00) (0.00) (0.00)
2
Cannabis Home D. Penalty
Test 86.98 27.94 11.75
p-value (0.00) (0.00) (0.16)
The assumption of independence between the
society-oriented questions and belief-oriented ques-
tion is rejected for all of the questions (at the
5% level) except for the question on the death
penalty (very small variation inside the ques-
tion).
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 222
5.5.3 Attraction-repulsion indexes
Links between each pair of modalities of two
variables with the attraction-repulsion indexes
d
jk
dened as
d
jk
=
f
jk
f
j.
f
.k
where f
jk
is the observed frequency and f
j.
f
.k
is the theoretical frequency under the indepen-
dence hypothesis.
Interpretation:
d
jk
> 1 the two modalities attract each others
d
jk
< 1 the two modalities push each other away
d
jk
1 the two modalities are close to being.
independent
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 223
Mail services should be privatized
Attraction Index Non-believer Catholic Muslim
Total agreement 0.712 1.411 1.196
Rather in agreement 1.055 0.707 1.113
Rather opposed 1.080 1.001 0.866
Totally opposed 1.119 1.062 0.757
No opinion 0.779 0.857 1.472
Proportion of Muslim PS-voters who declare
having no opinion on the subject is much higher
than the corresponding proportions of Catholic
and Non-believer PS-voters.
Proportion of Catholics who are in total
agreement to a privatization of mail services is
much higher.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 224
Trade Unions should weigh heavily in
major economic decisions
Attraction Index Non-believer Catholic Muslim
Total agreement 0.878 0.920 1.261
Rather in agreement 1.117 0.930 0.853
Rather opposed 1.203 1.102 0.588
Totally opposed 0.953 1.779 0.534
No opinion 0.847 0.953 1.290
As for the inuence of Trade Unions in major
political decisions, Muslim PS-voters are more
prone to agree with the necessity of more inu-
ence than the others, while Catholics seem to
be very opposed to the latter.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 225
Homosexual couples should be allowed
to adopt children
Attraction Index Non-believer Catholic Muslim
Total agreement 1.311 0.886 0.558
Rather in agreement 1.470 0.959 0.240
Rather opposed 1.101 1.220 0.676
Totally opposed 0.468 1.104 1.821
No opinion 1.240 0.674 0.825
The answers to the question of allowing adop-
tion by homosexual couples is very clear-cut.
Non-believers are proportionally much more
in agreement with the assertion than others
Catholics generally seem to oppose or totally
oppose it.
A vast majority of Muslims declare them-
selves totally opposed to the proposition.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 226
Consumption of cannabis should be for-
bidden
Attraction Index Non-believer Catholic Muslim
Total agreement 0.626 1.116 1.548
Rather in agreement 0.748 1.176 1.300
Rather opposed 1.341 0.948 0.463
Totally opposed 1.371 0.680 0.601
No opinion 1.024 1.186 0.830
Majority of Muslims agree with the proposal
Majority of Non-believers declare themselves
opposed to it.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 227
People dont feel at home in Belgium
anymore
Attraction Index Non-believer Catholic Muslim
Total agreement 0.786 1.433 1.056
Rather in agreement 0.677 1.330 1.311
Rather opposed 0.937 1.207 0.962
Totally opposed 1.178 0.738 0.885
No opinion 0.867 1.082 1.166
Strong opposition between Non-believers and
Catholics. The Catholic are proportionally more
prone to agree with the assertion than Non-
believers.
Muslims also seem to agree on the fact that
they dont feel at home in Belgium anymore.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 228
Abolishing the death penalty was the
right decision
Attraction Index Non-believer Catholic Muslim
Total agreement 1.069 0.881 0.967
Rather in agreement 1.020 0.926 1.019
Rather opposed 0.735 1.486 1.105
Totally opposed 0.762 1.390 1.127
No opinion 0.932 1.178 0.989
High number of totally in agreement with
abolishing it
Muslims dont really show a tendency one
way or another with respect to the others.
Catholics seem to be more prone than Non-
believers to be against the abolishment of the
death penalty.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 229
5.5.4 Multiple correspondance analysis (AFCM)
Multivariate vision of the set of society-oriented
questions (active variables)
0.5 0 0.5 1 1.5
1
0.5
0
0.5
1
First factorial plan
First factor
S
e
c
o
n
d
f
a
c
t
o
r
HOMO1
CAN4
POSTE4
OG1
BEL4
PM1
PM2
HOMO3
BEL3
OG3
CAN3
POSTE3
HOMO2
OG2
CAN3
BEL1
PM4
PM3
POSTE1
HOMO4
CAN1
OG4
BEL2
POSTE2
NON BELIEVER
MUSLIM
CATHOLIC
POL1
POL2
POL3
POL4
POL5
POL6
POL7
Figure 5.1: Multiple Correspondence Analysis on society-oriented questions. Belief and the
political scale are added as illustrative variables.
Two illustrative variables: belief and the polit-
ical scale
The rst axis represents a left-right dimension.
To visualize better, we deleted modality no
opinion for the society-oriented questions.
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 230
Inertia explained by the rst plane: 20%
Contributors on rst factorial axis:
24.8% feeling at home in Belgium
22.7% the death penalty
17.9% adoption by homosexual couples
17% prohibition of cannabis consumption
10.4% privitization of mail services
7.2% Trade Unions in political decisions
Contributors on second factorial axis:
24.2% privitization of mail services
19.3% adoption by homosexual couples
16.5% prohibition of cannabis consumption
14.7% the death penalty
13.6% feeling at home in Belgium
11.8% Trade Unions in political decisions
CHAPTER 5. MULTIPLE CORRESPONDENCE ANALYSIS (MCA) 231
5.5.5 Econometric Model
Multivariate data analysis doesnt take into ac-
count the inuence of other variables which may
strongly inuence the results
Dependent variable: the left-right indicator built
on the basis of the six society-oriented questions
Regression 1 Regression 2
Variable Coecient Std. Error Coecient Std. Error
C -0.166*** (0.027) -0.457*** (0.078)
NONCROYANT -0.319*** (0.050) -0.225*** (0.048)
MUSULMAN 0.089 (0.055) 0.152*** (0.055)
AGE 0.008*** (0.001)
AUCUN 0.371*** (0.112)
PRIMAIRE 0.421*** (0.094)
PROFESSIONNEL 0.310*** (0.083)
SECINF 0.416*** (0.068)
SECSUP 0.274*** (0.053)
SUPNONUNIV 0.163*** (0.054)
TECHNIQUE 0.151 (0.096)
R-squared: 12.6 % R-squared: 24.4 %
Sample size: 676, *Statistically dierent from zero at 10%,
Chapter 6
Canonical correlation analysis
6.1 Introduction
Objective: Characterize the linear relation be-
tween 2 sets of quantitative variables
Canonical correlation analysis seeks to identify
and quantify the associations between two sets
of variables
Key reference:
Hotelling, H. (1936), Relations between two
Sets of Variables, Biometrika, 28, 321-377
232
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 233
EXAMPLES:
Relationships between job evaluation ratings
and self-ratings of job characteristics (Dunham,
1977)
Measures of job characteristics
X
1
: Task Feedback
X
2
: Task signicance
X
3
: Task variety
X
4
: Task identity
X
5
: Autonomy
Self-ratings of job characteristics
Y
1
: Supervision satisfaction
Y
2
: Career future satisfaction
Y
3
: Financial satisfaction
Y
4
: Amount of work satisfaction
Y
5
: Company identication
Y
6
: Kind of work satisfaction
Y
7
: General satisfaction
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 234
Determine associations between socio-economic
variables and consumption behaviors
Socio-economic variables
X
1
: Household income
X
2
: Number of school years of the husband
X
3
: Number of school years of the wife
X
4
: Age of the husband
X
5
: Age of the wife
X
6
: Number of children
Consumption behaviors
Y
1
: Number of times that the family goes to a restau-
rant (per year)
Y
2
: Number of times that the family goes to the cin-
ema (per year)
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 235
6.2 Canonical variates and canonical correlations
Let X = (X
1
, X
2
, . . . , X
p
)
/
and Y = (Y
1
, Y
2
, . . . , Y
q
)
/
.
IDEA: Find linear combinations (Canonical vari-
ates)
U
k
=
/
k
X and V
k
=
/
k
Y
with maximal
[corr(U
k
, V
k
)[
subject to the following constraints: :
-Var(U
k
) = Var(V
k
) = 1
-uncorrelated with previously found
canonical variates.
Canonical vectors:
k
and
k
(k minp, q)
Canonical correlations:
k
= [corr(U
k
, V
k
)[.
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 236
To solve this maximization problem under con-
straint, denote: Z = (X, Y ) IR
p+q
, where
Cov(Z) =
_
XX
XY
Y X
Y Y
_
:= .
Solution of canonical analysis problem at the
population level
(proof page 546, Johnson and Wichern):
k
are the eigenvectors of
/
X
=
1
XX
XY
1
Y Y
Y X
k
are the eigenvectors of
/
Y
=
1
Y Y
Y X
1
XX
XY
(we get also the following link:
k
=
1
1
Y Y
Y X
k
)
2
k
are the eigenvalues of /
X
or /
Y
.
The rst couple (
1
,
1
) is associated with the
largest eigenvalue, etc.
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 237
Remark: In practice, it is sometimes more rel-
evant to apply canonical correlation analysis to
the correlation matrix instead of the covariance
matrix (use standardized variables)
R(Z) =
_
R
XX
R
XY
R
Y X
R
Y Y
_
Using the correlation matrix instead of the co-
variance matrix, the canonical correlations are
the same but the canonical vectors are modied.
Nevertheless, a simple relation exists between
both formulations:
k
= D
1/2
X
k
k
= D
1/2
Y
k
where D
X
is the diagonal matrix with variances
of X on the diagonal and D
Y
the matrix with
the variances of Y on the diagonal
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 238
6.3 Estimation
QUESTION: How to estimate canonical vari-
ates U
k
=
/
k
X and V
k
=
/
k
Y ?
ANSWER: Estimation of the covariance matrix
=
_
XX
XY
Y X
Y Y
_
by the sample covariance matrix
S =
_
S
XX
S
XY
S
Y X
S
Y Y
_
Solution to the problem at the sample level:
k
are the eigenvectors of
M
X
= S
1
XX
S
XY
S
1
Y Y
S
Y X
k
are the eigenvectors of
M
Y
= S
1
Y Y
S
Y X
S
1
XX
S
XY
2
k
are the eigenvalues of M
X
or M
Y
.
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 239
6.4 Interpreting the sample canonical variables
The canonical variables are articial and based
on X et Y = Try to identify the meaning of
these new variables.
Two schools of thought are opposed in this eld
Contribution in the construction of U
k
and V
k
Rencher (1998) proposed to use the coordinates
of canonical vectors which measure the marginal
impact of each variables in the construction of
canonical variables = Multivariate approach
Correlations with initial variables (as in PCA)
Tenenhaus (page 18, 1998) preferred to use the
correlations between initial variables and canon-
ical variates = easy but bivariate
A = [
/
1
,
/
2
, . . . ,
/
p
] and
B = [
/
1
,
/
2
, . . . ,
/
q
],
it follows that
X =
A
1
U and Y =
B
1
V
Hence the covariance matrices can be written
on the basis of canonical variates:
S
XY
= (
A
1
)cov(
U,
V )(
B
1
)
/
=
1
(1)
(1)/
+ . . . +
p
(p)
(p)/
S
XX
= (
A
1
)(
A
1
)
/
=
(1)
(1)/
+ . . . +
(p)
(p)/
S
Y Y
= (
B
1
)(
B
1
)
/
=
(1)
(1)/
+ . . . +
(q)
(q)/
where
(i)
and
(i)
are the i
th
columns respec-
tively of the inverse matrices
A
1
and
B
1
.
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 241
QUESTION:
Which proportion of the information on S
XX
, S
Y Y
and S
XY
is lost when only r(< p) canonical
variates are used?
S
XY
1
(1)
(1)/
+. . .+
r
(r)
(r)/
=
r+1
(r+1)
(r+1)/
+. . .+
p
(p)
(p)/
S
XX
(1)
(1)/
+. . .+
(r)
(r)/
=
(r+1)
(r+1)/
+. . .+
(p)
(p)/
S
Y Y
(1)
(1)/
+. . .+
(r)
(r)/
=
(r+1)
(r+1)/
+. . .+
(q)
(q)/
It is straightforward to note that most of the
time S
XY
is better explained than S
XX
and
S
Y Y
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 242
6.5.2 Proportions of explained sample variances
When the observations are standardized, the
sample covariance matrices are correlation ma-
trices.
Proportions of total sample variances explained
by the rst r canonical variates:
R
2
X[
U
1
,...,
U
r
=
r
i=1
p
k=1
r
2
U
i
,
X
k
p
R
2
Y [
V
1
,...,
V
r
=
r
i=1
q
k=1
r
2
V
i
,
Y
k
q
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 243
6.6 Large sample inferences
Suppose that Z = (X, Y ) IR
p+q
N
p+q
(, )
6.6.1 Testing procedure on
XY
Idea: Perform a testing procedure looking at the
association between the two groups of variables
(proof in Kshirsagar, 1972)
H
0
:
XY
= 0 (
1
= . . . =
p
= 0)
H
1
:
XY
,= 0
Test statistic: MV = nln
p
i=1
(1
2
i
)
(MV = nln(
det(S
XX
) det(S
Y Y
)
det(S)
))
Distribution under H
0
: MV
2
pq
Reject H
0
at signicance level = 5% if
MV >
2
pq;0.95
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 244
6.6.2 Individual tests on canonical correlations
If H
0
:
XY
= 0 is rejected, it is natural
to examine the signicance of the individual
canonical correlations. First step:
1
,= 0:
H
1
0
:
1
,= 0,
2
=
3
= . . . =
p
= 0
H
1
1
:
i
,= 0 pour i 2
If H
1
0
is rejected, the next step is:
H
2
0
:
1
,= 0,
2
,= 0,
3
=
4
= . . . =
p
= 0.
H
2
1
:
i
,= 0 pour i 3
and so on k 2, . . . , p 1 :
H
k
0
:
1
,= 0,
k
,= 0,
k+1
= . . . =
p
= 0.
H
k
1
:
i
,= 0 pour i k + 1
Decision rule: Reject H
0
at signicance level
if
(n 1
1
2
(p + q + 1)) ln
p
i=k+1
(1
2
i
)
>
2
(pk)(qk);1
CHAPTER 6. CANONICAL CORRELATION ANALYSIS 245
6.7 Example: Relationships between job evaluation rat-
ings and self-ratings of job characteristics (Dun-
ham, 1977; see Johnson & Wichern (2002))
Measures of job characteristics
X
1
: Task Feedback
X
2
: Task signicance
X
3
: Task variety
X
4
: Task identity
X
5
: Autonomy
Self-ratings of job characteristics
Y
1
: Supervision satisfaction
Y
2
: Career future satisfaction
Y
3
: Financial satisfaction
Y
4
: Amount of work satisfaction
Y
5
: Company identication
Y
6
: Kind of work satisfaction
Y
7
: General satisfaction
Chapter 7
Discriminant and classication
7.1 Introduction
OBJECTIVES:
1. Discrimination or separation: Separate two
(or more) classes of objects. Describe the
dierent caracteristics of observations arising
from dierent known populations.
2. Classication or allocation: Dene rules that
assign an individual to a certain class.
Overlap between the two approaches since the
variables that discriminate can also be used to
allocate new observation to one group and vice-
versa.
246
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 247
EXAMPLES
Populations
1
and
2
Measured variables
Good and poor Income, age, number of
credit risks credit cards, family size
Successful and unsuccessful Socio-economic variables,
students secondary path, gender
Males and females Anthropological measurements
Purchasers of a new product Income, education, family size
and laggards amount of previous brand switching
Papers written by two authors Frequencies of dierent words
and lengths of sentences
Two species of owers Sepal and petal length,
pollen diameter
Remark: In the sequel we present the problem
using two populations but the generalization to
more than two populations is straightforward.
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 248
THEORITICAL CONTEXT:
Let denote the 2 populations by :
1
and
2
.
The information on observations can be sum-
marized in p variables:
X
/
= [X
1
, . . . , X
p
]
The behavior of the variables is dierent in the
two populations
1
0 c(2[1)
2
c(1[2) 0
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 254
Expected cost of misclassication (ECM)
ECM = c(2[1)P(2[1)p
1
+ c(1[2)P(1[2)p
2
RESULT: The regions R
1
and R
2
that minimize
ECM are dened by the values of x for which
the following inequalities hold:
R
1
:
f
1
(x)
f
2
(x)
c(1[2)
c(2[1)
p
2
p
1
R
2
:
f
1
(x)
f
2
(x)
<
c(1[2)
c(2[1)
p
2
p
1
Proof: Johnson & Wichern (2002) page 647.
Particular cases:
Equal prior probabilities:
R
1
:
f
1
(x)
f
2
(x)
c(1[2)
c(2[1)
et R
2
:
f
1
(x)
f
2
(x)
<
c(1[2)
c(2[1)
Equal misclassication costs:
R
1
:
f
1
(x)
f
2
(x)
p
2
p
1
et R
2
:
f
1
(x)
f
2
(x)
<
p
2
p
1
Equal prior probabilities and misclassica-
tion costs
R
1
:
f
1
(x)
f
2
(x)
1 et R
2
:
f
1
(x)
f
2
(x)
< 1.
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 255
Other criteria to derive optimal classi-
cation procedure
Minimize the total probability of misclassi-
cation (TPM):
TPM = p
1
P(2[1) + p
2
P(1[2)
Mathematically, this problem is equivalent
to minimizing ECM when the costs of misclas-
sication are equal.
Allocate a new observation x
0
to the popu-
lation with the largest posterior probability
P(
i
[x
0
). By Bayes s rule, we obtain:
P(
1
[x
0
) =
p
1
f
1
(x
0
)
p
1
f
1
(x
0
) + p
2
f
2
(x
0
)
P(
2
[x
0
) =
p
2
f
2
(x
0
)
p
1
f
1
(x
0
) + p
2
f
2
(x
0
)
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 256
7.3 Classication with two multivariate normal popu-
lations
Often used in theory and practice because of
their simplicity and reasonably high eciency
across a wide variety of population models.
HYPOTHESES:
f
1
(x) = N
p
(
1
,
1
) et f
2
(x) = N
p
(
2
,
2
)
If X N
p
(, ) then:
f(x) =
1
(2)
p/2
det()
1
2
exp[
1
2
(x)
/
1
(x)]
Before using these rules, it is necessary to test
the normality hypothesis (e.g. QQ-plot). If the
data reject the gaussianity assumption, we can
try to obtain this assumption by a transforma-
tion of the data(e.g. by logarithm transforma-
tion).
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 257
Linear classication:
1
=
2
=
RESULT: The regions R
1
and R
2
that minimize
ECM are dened by the values of x for which
the following inequalities hold:
R
1
:
f
1
(x)
f
2
(x)
c(1[2)
c(2[1)
p
2
p
1
R
2
:
f
1
(x)
f
2
(x)
<
c(1[2)
c(2[1)
p
2
p
1
which is after simplication:
R
1
: (
1
2
)
/
1
x
1
2
(
1
2
)
/
1
(
1
+
2
) ln[
c(1[2)
c(2[1)
p
2
p
1
]
R
2
: (
1
2
)
/
1
x
1
2
(
1
2
)
/
1
(
1
+
2
) < ln[
c(1[2)
c(2[1)
p
2
p
1
]
But in practice
1
,
2
and are unknwon
_
x
(1)
1
x
(1)
2
x
(1)
p
_
_
et
1
= S
1
_
_
S
(1)
11
S
(1)
12
. . . S
(1)
1p
S
(1)
21
S
(1)
22
. . . S
(1)
2p
. . . . . .
S
(1)
p1
S
(1)
p2
. . . S
(1)
pp
_
_
Estimate
2
and
2
using the sample from
2
of size n
2
:
2
=
_
_
x
(2)
1
x
(2)
2
x
(2)
p
_
_
et
1
= S
1
_
_
S
(2)
11
S
(2)
12
. . . S
(2)
1p
S
(2)
21
S
(2)
22
. . . S
(2)
2p
. . . . . .
S
(2)
p1
S
(2)
p2
. . . S
(2)
pp
_
_
Under the hypothesis
1
=
2
, we can use an
unbiased pooled estimator of :
= S
pooled
=
n
1
1
(n
1
1) + (n
2
1)
S
1
+
n
2
1
(n
1
1) + (n
2
1)
S
2
The estimated rule minimizing ECM is then:
R
1
: ( x
1
x
2
)
/
S
1
pooled
x
1
2
( x
1
x
2
)
/
S
1
pooled
( x
1
+ x
2
) ln[
c(1[2)
c(2[1)
p
2
p
1
]
R
2
: ( x
1
x
2
)
/
S
1
pooled
x
1
2
( x
1
x
2
)
/
S
1
pooled
( x
1
+ x
2
) < ln[
c(1[2)
c(2[1)
p
2
p
1
]
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 259
Quadratic classication:
1
,=
2
RESULT: The regions R
1
and R
2
that minimize
ECM are dened by the values of x for which
the following inequalities hold:
R
1
:
f
1
(x)
f
2
(x)
c(1[2)
c(2[1)
p
2
p
1
and R
2
:
f
1
(x)
f
2
(x)
<
c(1[2)
c(2[1)
p
2
p
1
which is after simplication:
R
1
:
1
2
x
/
(
1
1
1
2
)x + (
/
1
1
1
/
2
1
2
)x k ln[
c(1[2)
c(2[1)
p
2
p
1
]
R
2
:
1
2
x
/
(
1
1
1
2
)x + (
/
1
1
1
/
2
1
2
)x k < ln[
c(1[2)
c(2[1)
p
2
p
1
]
where
k =
1
2
ln(
det(
1
)
det(
2
)
) +
1
2
(
/
1
1
1
1
/
2
1
2
2
)
The estimated rule minimizing ECM is then:
R
1
:
1
2
x
/
(S
1
1
S
1
2
)x + ( x
/
1
S
1
1
x
/
2
S
1
2
)x k ln[
c(1[2)
c(2[1)
p
2
p
1
]
R
2
:
1
2
x
/
(S
1
1
S
1
2
)x + ( x
/
1
S
1
1
x
/
2
S
1
2
)x k < ln[
c(1[2)
c(2[1)
p
2
p
1
]
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 260
7.4 Evaluation of classication rules
Total probability of misclassication (TPM):
TPM = p
1
_
R
2
f
1
(x)dx + p
2
_
R
1
f
2
(x)dx
The lowest value of this quantity is called the
optimum error rate (OER).
Suppose that p
1
= p
2
, C(2[1) = C(1[2) and
f
1
(x) = N(
1
, ) and f
2
(x) = N(
2
, ), then
the regions minimizing TPM are:
R
1
: (
1
2
)
/
1
x
1
2
(
1
2
)
/
1
(
1
+
2
) 0
R
2
: (
1
2
)
/
1
x
1
2
(
1
2
)
/
1
(
1
+
2
) < 0
RESULT: The optimum Error Rate is:
OER = (
2
) where
2
= (
1
2
)
/
1
(
1
2
)
Example: if
2
= 2.56 then OER = 0.2119,
hence then optimal rule of classication fails in
21% of cases.
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 261
But the rule is generally based on estimators
R
2
f
1
(x)dx + p
2
_
R
1
f
2
(x)dx
where
R
1
: ( x
1
x
2
)
/
S
1
pooled
x
1
2
( x
1
x
2
)
/
S
1
pooled
( x
1
+ x
2
) 0
R
2
: ( x
1
x
2
)
/
S
1
pooled
x
1
2
( x
1
x
2
)
/
S
1
pooled
( x
1
+ x
2
) < 0
But calculus to obtain AER are dicult and
depend on f
1
(x) and f
2
(x).
CHAPTER 7. DISCRIMINANT AND CLASSIFICATION 262
Apparent Error rate (APER):
APER = % of obs. in the sample misclassied
=Easy to calculate and does not require knowl-
edge on density functions
But underestimates AER even if n
i
are large.
Solution: the problem comes from the fact that
the same sample is used to construct the rule
and also to test the quality of the classication
i=1
[x
i
y
i
[
m
_
_
1/m
For m = 1, d(x, y) is thecity-block dis-
tance and for m = 2 we recover the euclidian
distance
CHAPTER 8. CLUSTERING 266
Similarity measures for variables
Quantitatives variables
Sample correlation coecients
Absolute values of correlation coecients
. . .
Binary variables
2
= r
2
=
2
/n
Frequencies
. . .
Qualitative variables
2
statistics
2
=
2
/n
. . .
There are many ways to measure similarity
between individuals or variables
CHAPTER 8. CLUSTERING 267
Stepwise algorithms:
Two families of algorithms:
Nonhierarchical clustering methods: Direct
partition into a xed number of groups (clus-
ters)
Moving centers method
K-means method
Hierarchical clustering methods
Agglomerative hierarchical methods: start
with individual objects, then the most sim-
ilar objects are rst grouped, and so on.
Divisive hierarchical methods: work in the
opposite direction
A large literature exist on this subject
CHAPTER 8. CLUSTERING 268
8.2 Nonhierarchical clustering methods
Mainly used for large database
Goal: Find q (xed) groups of n individuals
with
- homogeneity in the group
- heterogeneity between groups
= Find a criteria to measure the proximity
among individuals of the same group and com-
pare this measure for all possible partitions BUT
.....
Example: 4 groups for 14 individuals : more
than 10 millions of partitions
It is then impossible to nd the best partition
= Used algorithm to nd a partition close
CHAPTER 8. CLUSTERING 269
to the best partition
CHAPTER 8. CLUSTERING 270
8.2.1 Algorithm: Moving centers method
Let a set of n individuals with P characteristics
Let d be a distance in IR
P
(euclidean,
2
, . . .)
The number of groups is xed to q
Step 0: Chose q starting centers (random se-
lection of q individuals):
C
0
1
, . . . , C
0
k
, . . . , C
0
q
Creation of a partition P
0
: I
0
1
, . . . , I
0
k
, . . . , I
0
q
Creation of a partition P
1
in q groups, using
the same distance rule, of n individuals:
I
1
1
, . . . , I
1
k
, . . . , I
1
q
.
.
.
Step m: Let the new centers of the q groups
be:
C
m
1
, . . . , C
m
k
, . . . , C
m
q
calculated as centers of gravity of the q groups
obtained in step m1:
I
m1
1
, . . . , I
m1
k
, . . . , I
m1
q
CHAPTER 8. CLUSTERING 272
=Creation of a new partition P
m
using the
same methodology:
I
m
1
, . . . , I
m
k
, . . . , I
m
q
.
.
.
Final Step: Stop the iterations
if the number of iterations exceeds a given
number of iterations which is chosen priori (se-
curity)
if two consecutive steps give the same par-
tition
if a statistical criteria (intra-class variance)
doesnt decrease suciently
:-) This algorithm converges since we can prove
that the intra-class variance never increases from
step m to step m+ 1
:-( The nal partition depends of the initial cen-
ters chosen randomly in step 0
CHAPTER 8. CLUSTERING 273
8.2.2 Stable groups
The algorithm of moving centers method con-
verges to local optimum since the nal partition
depends of the initial centers chosen randomly
in step 0