04 Multivariate Vectors and Samples 2008 1203418099646432 3

Multivariate Samples
Recall some very basic concepts of univariate and

bivariate statistics
Describe Multivariate Samples
Analyze multivariate samples in a geometrical
perspective
Describe distances in the Euclidean Space
The data we will consider

Example1. Innovation and Research in Europe (Source: Eurostat)
Geo
Country
Region
Educ_Exp
GERD
GERD_industry
GERD_govern
GERD_abroad
Internet_Acc
ST_grad
ST_grad_f
ST_grad_m
EPO
USTPO
IT_Expenditure
Telec_Expenditure
Y_Educ_Lev
Y_Educ_Lev_f
Y_Educ__Lev_m
E_gov_avail
HT_Exports
Country code
Country name
european region
Spending on Human Resources (total public expen. on education) - % of GDP
Gross domestic expenditure on R&D (GERD) - As a % of GDP
GERD - industry - % of GERD financed by industry
GERD - government - % of GERD financed by government
GERD - abroad - % of GERD financed by abroad
Level of Internet access - % of households who have Internet access at home
Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29
Female tertiary graduates in S&T per 1000 of females aged 20-29
Male tertiary graduates in S&T per 1000 of males aged 20-29
No patent applications to the European Patent Office per million inhabitants
No patents granted by the US Patent and Trademark Office per million inhabitants
Expenditure on Information Technology as a % of GDP
Expenditure on Telecommunications as a % of GDP
Youth education attainment level - total - % of the population 20-24 who completed at
least upper secondary education
% of fem. 20-24 having completed at least upper 2 educ.
% of males 20-24 having completed at least upper 2 educ.
E-government on-line availability - Online availability of 20 basic public services
2
Exports of high technology products as a share of total exports
Some basic concepts of Univariate and

Bivariate statistics
Back to basics. Considering one variable
Let us consider one variable of interest, say EPO

country
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western
Internet_Acc
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00
EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
Netherlands
Mean = 127.6987
In statistics a commonly used position

measure is the arithmetic (sample) mean,
obtained by summing up all the observed
values and dividing the results by the nr of obs
Mean of EPO epo 127.6987
Spain
4
Variable of interest: EPO

The mean can be used to make a prediction
about EPO for a generic country without any
further information.
To evaluate the reliability of the mean as a
synthesis of the observed data, we can
consider for each observed value the error
incurred when substituting it with the
sample mean.
In the plot: errors incurred when substituting
the mean to the values observed for
Netherlands and Spain respectively.
The TOTAL SUM OF SQUARES is the sum
of the squared errors
15
Total SSEPO (epoi epo) 2

i 1
Netherlands
Mean = 127.6987
Spain
5

A synthesis of the errors, and a measure of the reliability of the mean
as a synthesis of the observed data, is the (sample) variance
15
1 15
1
Var of EPO (epoi epo) 2 (epoi 127.6987) 2
14 i 1
14 i 1
This is the average of the squared errors we incur when substituting

the observed values with the sample mean. It is obtained by dividing
the Total SS by the number of observations (minus 1)
The variance of EPO turns out to be 12646.5814. Hence the error we
can expect to incur for a generic observation is the square root of the
variance, which is called standard deviation
Std. Dev. of EPO Var of EPO

12626.5814 112.457
In statistics we are mainly concerned with the explanation of variance,

i.e., we are interested in explaining why a phenomenon varies and, also,
we are considering predictive tools characterized by low prediction
errors.
So the question now is: Can we do better than the mean?
i.e., can we use external information (other vars) related to EPO, and
hence proving useful to predict the values of EPO with a lower error?
In the following we will consider two supporting variables having
different characteristics:
The Region (a categorical variable)
Internet_Access (a numerical variable)
and we will show how it is possible to evaluate the extent to which one
external variable provides information about the variable of interest 7

If we consider the region, our prediction on EPO can be better?
country
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western
EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
Values observed within the regions
Netherlands
General Mean = 127.6987
We can use the conditional means

rather than the general one.
It is worth only if the prediction error
is considerably lower (it can be
shown that it is lower by construction)
Spain
8
Consider the region to improve prediction on EPO

Use the conditional means
To evaluate the reliability of the
conditional means as syntheses
of the observed EPO data, we can
consider the squared difference
between each value and the
proper conditional mean.
Netherlands
In the plot: errors for Netherlands

and Spain
The WITHIN SUM OF SQUARES
of EPO given Region is the sum
of the squared errors incurred
when using the conditional means
(by region) to predict EPO
Spain
9
Compare general mean / conditional means as predictors of EPO

country
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western
EPO General mean

1.31
127.6987
12.04
127.6987
2.78
127.6987
79.87
127.6987
135.77
127.6987
124.19
127.6987
309.09
127.6987
293.32
127.6987
9.87
127.6987
84.14
127.6987
30.64
127.6987
246.15
127.6987
141.8
127.6987
299.99
127.6987
144.52
127.6987
Squared errors Conditional means

15974.1035
6.675
13376.9349
6.675
15604.6816
157.5033
2287.5845
157.5033
65.1459
157.5033
12.311
157.5033
32902.8037
157.5033
27430.415
157.5033
13883.6025
41.55
1897.3603
41.55
9420.3912
41.55
14030.7105
208.115
198.8467
208.115
29684.2921
208.115
282.9561
208.115
Squared errors
28.7832
28.7832
23939.3
6026.929
472.3363
1109.776
22978.53
18446.18
1003.622
1813.908
119.0281
1446.661
4397.679
8441.016
4044.324
TOTAL SSEPO = 177052.1395 WITHIN SSEPO | REGION = 94296.85
If we use the region, our improvement as compared to the general mean is

2
The
R
ranges from 0 to 1. It
WITHIN
SS
EPO|
REG
2
R EPO|REG 1
0.467 measures the ability of the
TOTAL SSEPO|REG
categorical var as a predictor
10 of
% of variance of EPO accounted for by Region the numerical one.

If we consider Internet_Access, our prediction on EPO can be better?
country
Internet_Acc
Romania
6.00
Czech Republic
19.00
Lithuania
12.00
Ireland
40.00
Norway
60.00
UK
56.00
Finland
51.00
Sweden
73.00
Greece
17.00
Italy
34.00
Spain
34.00
Netherlands
67.00
Belgium
50.00
Germany
60.00
France
34.00
EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
When considering numerical variables, we are

interested in evaluating the existence of a linear
association between them.
To evaluate if a linear relationship exists and to determine its direction we refer

to the sample covariance (absolute measure of linear association)
1 15
Cov (EPO, Int_Acc) (epoi epo)(int_acci int_acc11)
14 i 1

The covariance between the two variables is:
Cov(EPO, Int_Acc) = 1868.5152
This measure only indicates that a linear relationship exists and that it is
direct (an inspection of the scatter plot confirms this). Nevertheless, the
value of the covariance depends upon the unit of measurement of the
considered variables.
A relative measure of linear association is the correlation coefficient.
Cov (EPO, Int_Acc)

Corr (EPO, Int_Acc)
Var (EPO) Var (Int_Acc)
The correlation coefficient ranges from 1 to +1. Values close to 1
indicate strong direct linear association, values close to 1 denote
strong inverse association. Values close to zero indicate no relationship.
Here we have Corr(EPO, Int_Acc) = 0.8527 (strong association) 12

EPO = 60.018 + 4.5934*Int_Acc
The high value of the correlation tells us that observations tend to

cluster around a line having a positive slope. This line, evidenced in the
scatterplot is called regression line.
Its analytical expression can be easily determined
13
Consider Internet_Access to improve prediction on EPO

Use the regression line
EPO = 60.018 + 4.5934*Int_Acc
Spain
For each observation we can calculate the difference between the observed
EPO value and the value predicted using the regression line.
In the plot the error is evidenced for the Spain.
The MODEL SUM OF SQUARES of EPO given Int_Acc is the sum of the
14
squared errors incurred when using the line to predict EPO.
Compare general mean / regression line as predictors of EPO

country Int_Acc
EPO
Gen mean
Squared
errors
Prediction using the line

=4.5934*Int_Acc-60.018
Squared
errors
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
15974.1035
13376.9349
15604.6816
2287.5845
65.1459
12.311
32902.8037
27430.415
13883.6025
1897.3603
9420.3912
-32.4576
27.2566
-4.8972
123.718
215.586
197.2124
174.2454
275.3002
18.0698
96.1576
96.1576
1140.251
231.5449
58.9394
1922.647
6370.594
5332.271
18183.07
324.7132
67.2367
144.4227
4292.556
Netherlands
Belgium
Germany
France
67.00
50.00
60.00
34.00
246.15
141.8
299.99
144.52
127.6987 14030.7105
127.6987
198.8467
127.6987 29684.2921
127.6987
282.9561
247.7398
169.652
215.586
96.1576
2.5275
775.7339
7124.035
2338.922
TOTAL SSEPO = 177052.1395 MODEL SSEPO | Int_Acc = 48309.46
Notice that we have a considerable decrease of the prediction errors.
15

If we use the line (function of Int_Acc), our improvement as compared to the
general mean is
2
EPO|Int_Acc
MODEL SSEPO|Int_Acc
TOTAL SSEPO|Int_Acc
0.7271
% of variance of EPO accounted for by Int_Acc
The R2 index ranges from 0 to 1 and it measures the ability of the numerical
var to predict the other one.
It can be shown that the index coincides with the squared correlation
coefficient.
Hence the correlation measures the extent of linear association, whereas
its square measures the percentage of the variance of one variable
which can be explained by the other variable (numerical).
16
Data Matrices
(Numerical variables only)
17
Data matrices
Example1 (continued). Innovation and Research in Europe. For the sake
of simplicity, we limit attention to few variables and to few observations
country
region
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western
GERD
0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20
GERD_ GERD_ Internet ST_

industry govern _Acc
grad
47.60
52.50
37.10
66.70
51.60
45.60
70.80
71.50
33.00
47.20
47.20
51.90
63.40
65.70
54.20
43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00
5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50
EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
E_gov
_avail
The country variable

is useful to identify
the statistical units but
it is not object of
analysis.
25.00
30.00
40.00
50.00
56.00
At the moment we
59.00
67.00 consider only
74.00 numerical variables
32.00
The data matrix
53.00
55.00 contains information
32.00 available for the n
35.00 cases (rows) on the p
47.00 variables (columns)
50.00
For each observation we have information collected on p variables

For each variable we have information collected on n observations
Here we have 15 rows (cases, n) and 7 columns (vars, p)
18
Data matrices
Example1 (continued). Innovation and Research in Europe. (subset)
GERD
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20
GERD_ GERD_ Interne

industry govern t_Acc
47.60
52.50
37.10
66.70
51.60
45.60
70.80
71.50
33.00
47.20
47.20
51.90
63.40
65.70
54.20
43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00
ST_
grad
EPO
E_go
v_ava
il
5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
25.00
30.00
40.00
50.00
56.00
59.00
67.00
74.00
32.00
53.00
55.00
32.00
35.00
47.00
50.00
To each observation a
collection of p values is
associated. These values
are the realizations
observed for each
variables corresponding to
the considered obs.
Similarly, to each variable,
a collection of n values can
be associated (values
observed for all the cases)
A collection of k values is usually called a vector. To avoid confusion, we will only

consider column vectors, with dimension (k 1) i.e., a collection of values arranged in
k rows and in 1 column .
19
A row (1 k) vector can always be seen as the transpose of a column (k 1) vector.
Data matrices
Data matrix (n individuals and p variables)
x11
x
21
X

xn1
x12
x22
x1 p
x2 p
x1T
T
x2 x
(1)

T
x n
xn 2 xnp
x (2) x ( p )
A data matrix can be seen as a collection of n row (transposed) vectors (cases)

and/or as a collection of p column vectors (variables)
xi = vector (p 1) containing measurements on the p vars for the i-th case.
xi1
x
i2
xi

xip
x iT xi1
xi2 xip
Transposition operation
x(j) = vector (n 1) containing the n measurements on the j-th variable
20
Data matrices
Example1 (continued). Innovation and Research in Europe. (subset)
GERD
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20
GERD_ GERD_ Interne

industry govern t_Acc
47.60
52.50
37.10
66.70
51.60
45.60
70.80
71.50
33.00
47.20
47.20
51.90
63.40
65.70
54.20
43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00
ST_
grad
EPO
E_go
v_ava
il
5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
25.00
30.00
40.00
50.00
56.00
59.00
67.00
74.00
32.00
53.00
55.00
32.00
35.00
47.00
50.00
Column vector associated to

EPO (measurements on 15 obs)
x (6)
The element in the i-th row

and in the j-th column, xij is
the value observed for the ith case corresponding to the
j-th variable.
In this simple example, x13 6
is the value of EPO (6
variable) for Belgium (13
observation).
T
13
Row vector associated to

Belgium (measurements
on 7 vars)
21
Data matrices Vectors

A (K 1) vector is as an oriented line in a K-dimensional space
A one-dimensional vector (scalar)
v [v1 ]
v1
v1
v
v2
v2
A two-dimensional vector
v1
v
v2
v1
A three-dimensional vector
v1
v v2
v3
v3
v2
v1
v v2
v3
v1
22
Vectors of higher dimension cannot be represented in this way
Data matrices Vectors (length)

For a given vector in the k-dimensional space, we define its length as:
|| v || v12 v22 ... vK2

It is the length of the line connecting v to the origin, 0:
v [v1 ]
v1
v2
v1
v2
v1
v2
v1
v v2
v3
v3
v1
23
Data matrices Vectors (Distance)

Given two vectors, v and u in the k-dimensional space, we define the
Euclidean Distance between v and u as the length of the line
connecting v to u:
DE ( v, u) v u (v1 u1 ) 2 (v2 u2 ) 2 ... (vk uk ) 2
!!! the length of a vector v coincides with its distance from the origin, 0.
|| v || v12 v22 ... vK2 DE ( v,0)
v
v2
Example in the twodimensional space
|v2 u2|
u
u2
|v1 u1|
0
v1
u1
24
Analyze multivariate samples in a

geometrical perspective
Describe distances in the Euclidean
Space
25
Data matrices
A data matrix can be see as a collection of two kind of vectors:
Row vectors:
xi
lie in the p-dimensional space
Column vectors:
x(j)
lie in the n-dimensional space
Hence two dimensional spaces can be considered to

analyze/describe a data matrix.
Of course, these spaces will be related one to each other.
For the sake of simplicity, we will analyze in depth only the space
of the observations.
26
Syntheses of variables
X [x (1) x (2) x ( p ) ]
How to arrange syntheses of p variables, i.e.,

how to synthesize the elements of the column
vectors?
The position. The sample mean (unbiased estimator for the population
mean) for the j-th variable (column) is:
x( j )
x
i 1
ij
xj
x1

x
xp

Vector of the sample

means (centroid).
It may be seen as the vector associated to the artificial case mean

an unobserved case being in the average with respect to all the vars
Remember: the mean is not robust (sensitive to extreme values)
27
The space of the observations

Consider a graphical representation we are used to:
the 2-dimensional space
Note: axes adjusted to have the same scale.
Mean of E_gov_indiv
The centroid
(vector whose
elements are
the sample
means) is the
centre of
gravity of the
cloud.
It is the point
which is
globally less
distant from
all the points.
Mean of Internet_Acc
28
Synthesis of variables
The dispersion around the mean.
The sample variance (unbiased estimator for the population variance)
for the j-th variable (column) is:
Average of the squared errors we incur
1 n
2
s jj
( xij x j ) when substituting the observed values
n 1 i 1
with the sample mean.
Notice that it is the average of the squared distances between the

observed values and the sample mean
The sample standard deviation for the j-th variable (column) is
s jj
The Std. Dev has the same unit of measurement as the variable taken
into account. It measures of the expected error (below or above the
mean) we incur when substituting the mean to a generic case.
Moreover it can be considered as the average distance between a
generic value and the mean. It is the expected distance from mean.
Being based upon averages, both the variance and the standard
29
deviation are not robust (sensitive to extreme values)

Consider again the 2-dimensional space
Let us consider the distance from Iceland (IS) to the centroid
DE ( IS , x) ( xIS ,egov _ ind xegov _ ind ) 2 ( xIS ,int_acc xint_acc ) 2
Absolute Difference
between the Iceland
E_gov_Indiv value and
the mean of E_gov_Indiv
Absolute Difference
between the Iceland
Internet_Acc value and
the mean of Internet_Acc
Note: axes adjusted to
have the same scale.
30

Consider, in the 2-dimensional space,
ALL THE DISTANCES FROM POINTS TO THE CENTROID.
1 n
2
s jj
(
x
x
)
ij j
n 1 i 1

Var(E_gov_indiv) + Var(Internet_cc)
= SUM of the variances of THE TWO VARIABLES
is proportional to the sum of the squared
distances from the obs to the centroid
31
Synthesis of association between vars

The linear association.
The sample covariance for the j-th and the h-th variables (columns) is
1 n
s jh
( xij x j )( xih xh )
n 1 i 1
(absolute measure of linear

association)
The sample correlation coefficient for the j-th and the h-th variables is
r jh
s jh
s jj shh
(relative measure of linear association; it

ranges from 1 to +1).
Remember: being based upon averages, the correlation coefficient

is not robust (sensitive to extreme values)
32

Consider again the 2-dimensional space
Since the covariance and the correlations are actually
measuring the concentration of points around a line,
both the indices give us information about the
ORIENTATION of the scatter.

33
Variance and Covariance Matrix

Variances and covariances are arranged in the so called variance and
covariance matrix
s11
s
21
s12
s22
s p1
s p2
s11
s
21
s12
s22
s p1
s p2
s1 p
s2 p
s pp
s1 p
s2 p
s pp
S is a square matrix (number of rows

equals the number of columns)
The diagonal elements of S, sjj, are
the variances (notice that the
variance can be regarded as the
covariance between one variable and
itself)
The extra-diagonal elements of S,
sjh, are the covariances
Since sjh = shj, S is a symmetric
matrix.
34
Correlation Matrix
Correlations are arranged in the correlation matrix
1
r
21
r12
1
rp1
rp 2
r1 p R is also a square matrix, and its

diagonal elements are 1s (the
r2 p correlation between one variable and

itself is 1)
Its extra-diagonal elements, rjh, are the correlations, and of course, R

is a symmetric matrix.
Due to the relationship between covariances and correlations:
r jh
s jh
s jj shh
R can be simply obtained from the variance and covariance matrix 35

The centroid (vector
whose elements are
the sample means)
is the centre of
gravity of the pdimensional cloud
The elements of the
variance and
covariance matrix
give us information
about the dispersion
around the centroid
(remember the 2dimension example)
and on the
orientation of the
cloud
36
Measuring dispersion
How to synthesize the dispersion of the n cases in the p-dimensional

space? Two proposals.
TOTAL VARIANCE
As we saw before, the sum of all the variances is proportional to the
sum of the squared distances from the points to the centroid. Thus, a
first method to evaluate the dispersion of the points in the p-dimensional
space is the so called Total Variance.
p
Total Variance s jj
j 1
Notice that we are not taking into account the interrelationships between
vars, i.e. the orientation of the cloud.
The Total Variance is the sum of the diagonal elements of the
var/cov matrix, S. The sum of the diagonal elements of a square matrix
is defined to be its trace. Hence, we have:
p
Total Variance s jj Trace (S) tr (S)

j 1
37
To motivate the second measure of multivariate

dispersion, consider the portion of the space which is
occupied by data (area of the ellipse). We will come back
to this concept later, but can intuitively understand that
the area of the ellipse (in higher-dimensional space, the
volume of an ellipsoid) is somehow related to the
variances and to the covariances, i.e., to all the
entries of the var/cov matrix, S
38
Measuring dispersion
THE GENERALIZED VARIANCE
The volume of the ellipsoid containing points in the p-dimensional
space can be shown to be related to a particular synthesis of the
elements of S, the so called determinant of S, |S|.
The determinant is a number which can be calculated for a square
matrix. It equals zero if two column of the matrix are proportional, i.e., if
they do share information.
This measure is called Generalized Variance
Generalized Variance = det(S)=|S|

Hence, to synthesize the dispersion of points in a pdimensional space, two measure can be used, both related to the
elements of the variance and covariance matrix, S.
The Total Variance takes into account only the diagonal elements of S,
whilst the Generalized variance is calculated by referring to all the39
elements of S.

The variances and covariance matrix contains relevant information to describe
the points in a p-dimensional space, and, also information about their
distances. We now consider different measures of distances between cases
in the p-dimensional space, related to particular transformations of the original
vars.
Notice first that if the variables are centred on their mean nothing changes as
concerns the dispersion of the points.
This operation only consists in a
change of the origin
40
Multivariate Samples - Transformations

TRASFORMATION: VARS CENTRED ON THEIR MEANS
Original Data Matrix
x11
x
21

xn1
x12
x1 p
x22 x2 p

xn 2 xnp
Centroid = x
Var/Cov Matrix: S
Corr Matrix: R
Centred Data Matrix
( x11 x1 ) ( x1 p x p )
~

( xn1 x1 ) ( xnp x p )
Centroid = Origin = 0
Var/Cov Matrix: S
Corr Matrix: R
The centred matrix is obtained by subtracting to each observation on a given

variable the mean of the variable itself. This means that to all the observations on
41
a given column, say the j-th, the mean of the j-th variable is subtracted.
A closer look at the distance

The Euclidean distance is the length of the line connecting a point to the origin.
Consider, in the plot of the centred variables, Cyprus and Italy: their
distance from the origin, 0, is (almost) the same.
Notice that the
distance of Slovakia
from the origin is
higher. We will
consider this later
This similar
distance is due
to different
combinations
of x- and ydeviations
from 0. Should
the x- and ydeviations be
evaluated in
the same
manner?
42
Remember: the standard deviation of a variable is the typical deviation

from the mean. Here Std.Dev.(E_gov_Avail)=15, Std.Dev.(Int_Acc) = 21.31.
To compare adequately the deviations from the origin (data are centred),
we should take into account the Std.Dev (of course, squared deviations
should be compared with variances).
Internet_Acc
has an higher
std.dev. Hence,
a deviation D
from the origin
along the
horizontal axis
should count
less than a
deviation D
from the origin
along the
vertical axis.
43
In the Euclidean distance, the deviations are considered in absolute terms.

When we are considering variables having different Std.Dev, we should
consider relative deviations. To remove the effect of Std. Dev, thus obtaining
comparable deviations, we have to standardize the variables.
Standardization of the j-th variable:
zij
xij x j
s jj
The Euclidean Distance between two standardized observations is:
DE (z i , z h ) ( zi1 z h1 ) 2 ( zi 2 z h 2 ) 2 ... ( zip z hp ) 2

2
xip xhp
xi1 xh1
xi 2 xh 2

...

s
s
s
11
22
pp
DS (x i , x h )
Statistical Distance: A different weight is assigned to the squared

deviation of each variable in the calculation of the distance (1/sjj). The
statistical distance is proportional to the Euclidean one only if the 44
variances are all equal.

The statistical distance (visualization in the original/centred space).
x-deviations are penalized less than y-deviations, since the x-axis is
characterized by an higher dispersion.
Hence Cyprus, which is showing an higher y-deviation from the origin as
compared to Italy is characterized by a statistical distance from the origin which
is higher than that characterizing Italy.
Notice that Slovakia
has a stat. distance
from 0 which is
now similar to that
of Cyprus.
Points having the

same statistical
distance from the
origin
45

TRASFORMATION: STANDARDIZED VARS
x11
x
21

xn1
x12
x22
x1 p
x2 p
xn 2 xnp
Centroid = x
Var/Cov Matrix: S
Corr Matrix: R
Standardized Data Matrix
( x11 x1 )
s11
( xn1 x1 )
s11
( x1 p x p )
s pp
( xnp x p )
s pp
Var/Cov Matrix: R
Corr Matrix: R
The standardized matrix is obtained by subtracting to each observation

on a given variable the mean of the variable itself and by dividing this
difference by the Std.Dev. The centred vars have null mean, the
standardized vars have variances all equal to 1 (the unit of measurement
is removed). Since Variance=Std.Dev= 1 for each variable, the
46
covariances coincide with correlations (Corr=Cov/Product of Std.Devs).

In statistical distance deviations are adjusted by taking into account dispersions
of the variables. But no attention is posed on the coherence between each
point and the cloud of points (standardization does not involve correlations)
Slovakia and Cyprus are equally statistically distant from the origin.
Consider the
Notice that
orientation of
Lithuania is more
the cloud: the
statistically distant
line connecting
from the origin.
Lithuania to 0
has the same
direction of the
cloud. This is
less true for
Slovakia. The
line connecting
Cyprus to the
origin is in
countertendency
48
In Statistical distance, the coherence with the orientation of the cloud is not
considered. A transformation of data which removes the effect of Std. Dev, and
also penalizes deviations by considering the orientation of the cloud of points id
the so called Mahalanobis transformation. We do not enter into details here.
Mahalanobis transf. of the j-th variable:
zijM Mahal ( xij ) M ( xij )
The Mahalanobis transformation is a particular linear combination of the

considered variables.
The so called Mahalanobis distance is defined as the Euclidean

distance calculated on Mahalanobis transformed observations:
M 2
DE (z iM , z hM ) ( ziM1 z hM1 ) 2 ( ziM2 z hM2 ) 2 ... ( zipM z hp
)
M ( xi1 ) M ( xh1 ) 2 ... M ( xip ) M ( xhp ) 2
DM (x i , x h )
49

TRASFORMATION: MAHALANOBIS
The Mahalanobis distance is the Euclidean distance evaluated by
previously transforming data according to the Mahalanobis
transformation.
x11
x
21

xn1
x12
x22
x1 p
x2 p
xn 2 xnp
Centroid = x
Var/Cov Matrix: S
Corr Matrix: R
Mahalanobis Data Matrix
M ( x11 ) M ( x1 p )
Z

M ( xn1 ) M ( xnp )
Var/Cov Matrix: I
Corr Matrix: I
The variables transformed according to the Mahalanobis transformation have

null means, variances all equal to 1 (unit of measurement is removed), and null
correlations (orientation of the cloud is removed).
50
Mahalanobis Distance: deviations from the origin are adjusted by taking into
account both the dispersions of variables and their correlations (orientation).
Now Cyprus, being in countertendency with respect to the orientation of the
cloud is characterized by a Mahalanobis distance from 0 which is higher than
that characterizing Slovakia.
Notice that
Lithuania has a
Mahalan. distance
from 0 similar to
that of Slovakia.
Points having the

same Mahalanobis
distance from the
origin
51
Multivariate samples Transformations

ORIGINAL
CENTRED ON MEAN STANDARDIZATION
~
X
MAHALANOBIS
xj
sjj
0
sjj
Z
0
1
Covariances
sjk
sjk
rjk
Correlations
rjk
rjk
rjk
Euclidean
Euclidean
Statistical
Mahalanobis
Means
Variances
Euclidean
distance
ZM
0
1
Conclusion: By transforming data via standardization or Mahalanobis transformation we

are simply defining a new space such that the Euclidean Distance calculated on the
transformed points coincides respectively with:
Statistical distance - standardization, deviations are differently evaluated depending on
their Std.Dev
Mahalanobis distance - Mahalanobis transformation, deviations are differently
evaluated depending on the Std.Dev.s and to the orientation of the cloud correlations/covariances).
As for now the latter transformation was not explicitly defined due to its analytical 53
complexity, but we will see later how to obtain Mahalanobis-transformed data.

04 Multivariate Vectors and Samples 2008 1203418099646432 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 Multivariate Vectors and Samples 2008 1203418099646432 3

Uploaded by

Copyright:

Available Formats

Multivariate Samples

Recall some very basic concepts of univariate and

The data we will consider

Some basic concepts of Univariate and

Back to basics. Considering one variable

Let us consider one variable of interest, say EPO

In statistics a commonly used position

Mean of EPO epo 127.6987

Back to basics. Considering one variable

Variable of interest: EPO

Total SSEPO (epoi epo) 2

Back to basics. Considering one variable

This is the average of the squared errors we incur when substituting

Std. Dev. of EPO Var of EPO

Back to basics. Considering one variable

Variable of interest: EPO

In statistics we are mainly concerned with the explanation of variance,

Back to basics. Considering one variable

Values observed within the regions

General Mean = 127.6987

We can use the conditional means

Back to basics. Considering one variable

Consider the region to improve prediction on EPO

In the plot: errors for Netherlands

Back to basics. Considering one variable

Compare general mean / conditional means as predictors of EPO

EPO General mean

Squared errors Conditional means

TOTAL SSEPO = 177052.1395 WITHIN SSEPO | REGION = 94296.85

If we use the region, our improvement as compared to the general mean is

Back to basics. Considering one variable

When considering numerical variables, we are

To evaluate if a linear relationship exists and to determine its direction we refer

Back to basics. Considering one variable

Cov (EPO, Int_Acc)

Back to basics. Considering one variable

EPO = 60.018 + 4.5934*Int_Acc

The high value of the correlation tells us that observations tend to

Back to basics. Considering one variable

Consider Internet_Access to improve prediction on EPO

EPO = 60.018 + 4.5934*Int_Acc

Back to basics. Considering one variable

Compare general mean / regression line as predictors of EPO

Prediction using the line

TOTAL SSEPO = 177052.1395 MODEL SSEPO | Int_Acc = 48309.46

Notice that we have a considerable decrease of the prediction errors.

Back to basics. Considering one variable

% of variance of EPO accounted for by Int_Acc

GERD_ GERD_ Internet ST_

The country variable

For each observation we have information collected on p variables

GERD_ GERD_ Interne

A collection of k values is usually called a vector. To avoid confusion, we will only

A data matrix can be seen as a collection of n row (transposed) vectors (cases)

xi = vector (p 1) containing measurements on the p vars for the i-th case.

x(j) = vector (n 1) containing the n measurements on the j-th variable

GERD_ GERD_ Interne

Column vector associated to

The element in the i-th row

Row vector associated to

Data matrices Vectors

Vectors of higher dimension cannot be represented in this way

Data matrices Vectors (length)