You are on page 1of 51

Multivariate Samples

Recall some very basic concepts of univariate and


bivariate statistics
Describe Multivariate Samples
Analyze multivariate samples in a geometrical
perspective
Describe distances in the Euclidean Space

The data we will consider


Example1. Innovation and Research in Europe (Source: Eurostat)
Geo
Country
Region
Educ_Exp
GERD
GERD_industry
GERD_govern
GERD_abroad
Internet_Acc
ST_grad
ST_grad_f
ST_grad_m
EPO
USTPO
IT_Expenditure
Telec_Expenditure
Y_Educ_Lev
Y_Educ_Lev_f
Y_Educ__Lev_m
E_gov_avail
HT_Exports

Country code
Country name
european region
Spending on Human Resources (total public expen. on education) - % of GDP
Gross domestic expenditure on R&D (GERD) - As a % of GDP
GERD - industry - % of GERD financed by industry
GERD - government - % of GERD financed by government
GERD - abroad - % of GERD financed by abroad
Level of Internet access - % of households who have Internet access at home
Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29
Female tertiary graduates in S&T per 1000 of females aged 20-29
Male tertiary graduates in S&T per 1000 of males aged 20-29
No patent applications to the European Patent Office per million inhabitants
No patents granted by the US Patent and Trademark Office per million inhabitants
Expenditure on Information Technology as a % of GDP
Expenditure on Telecommunications as a % of GDP
Youth education attainment level - total - % of the population 20-24 who completed at
least upper secondary education
% of fem. 20-24 having completed at least upper 2 educ.
% of males 20-24 having completed at least upper 2 educ.
E-government on-line availability - Online availability of 20 basic public services
2
Exports of high technology products as a share of total exports

Some basic concepts of Univariate and


Bivariate statistics

Back to basics. Considering one variable

Let us consider one variable of interest, say EPO


country
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France

region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western

Internet_Acc
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00

EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52

Netherlands

Mean = 127.6987

In statistics a commonly used position


measure is the arithmetic (sample) mean,
obtained by summing up all the observed
values and dividing the results by the nr of obs

Mean of EPO epo 127.6987

Spain
4

Back to basics. Considering one variable

Variable of interest: EPO


The mean can be used to make a prediction
about EPO for a generic country without any
further information.
To evaluate the reliability of the mean as a
synthesis of the observed data, we can
consider for each observed value the error
incurred when substituting it with the
sample mean.
In the plot: errors incurred when substituting
the mean to the values observed for
Netherlands and Spain respectively.
The TOTAL SUM OF SQUARES is the sum
of the squared errors
15

Total SSEPO (epoi epo) 2


i 1

Netherlands

Mean = 127.6987

Spain
5

Back to basics. Considering one variable


Variable of interest: EPO
A synthesis of the errors, and a measure of the reliability of the mean
as a synthesis of the observed data, is the (sample) variance
15
1 15
1
Var of EPO (epoi epo) 2 (epoi 127.6987) 2
14 i 1
14 i 1

This is the average of the squared errors we incur when substituting


the observed values with the sample mean. It is obtained by dividing
the Total SS by the number of observations (minus 1)
The variance of EPO turns out to be 12646.5814. Hence the error we
can expect to incur for a generic observation is the square root of the
variance, which is called standard deviation

Std. Dev. of EPO Var of EPO


12626.5814 112.457

Back to basics. Considering one variable

Variable of interest: EPO

In statistics we are mainly concerned with the explanation of variance,


i.e., we are interested in explaining why a phenomenon varies and, also,
we are considering predictive tools characterized by low prediction
errors.
So the question now is: Can we do better than the mean?
i.e., can we use external information (other vars) related to EPO, and
hence proving useful to predict the values of EPO with a lower error?
In the following we will consider two supporting variables having
different characteristics:
The Region (a categorical variable)
Internet_Access (a numerical variable)
and we will show how it is possible to evaluate the extent to which one
external variable provides information about the variable of interest 7

Back to basics. Considering one variable


If we consider the region, our prediction on EPO can be better?
country
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France

region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western

EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52

Values observed within the regions

Netherlands

General Mean = 127.6987

We can use the conditional means


rather than the general one.
It is worth only if the prediction error
is considerably lower (it can be
shown that it is lower by construction)

Spain
8

Back to basics. Considering one variable

Consider the region to improve prediction on EPO


Use the conditional means
To evaluate the reliability of the
conditional means as syntheses
of the observed EPO data, we can
consider the squared difference
between each value and the
proper conditional mean.

Netherlands

In the plot: errors for Netherlands


and Spain
The WITHIN SUM OF SQUARES
of EPO given Region is the sum
of the squared errors incurred
when using the conditional means
(by region) to predict EPO

Spain
9

Back to basics. Considering one variable

Compare general mean / conditional means as predictors of EPO


country
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France

region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western

EPO General mean


1.31
127.6987
12.04
127.6987
2.78
127.6987
79.87
127.6987
135.77
127.6987
124.19
127.6987
309.09
127.6987
293.32
127.6987
9.87
127.6987
84.14
127.6987
30.64
127.6987
246.15
127.6987
141.8
127.6987
299.99
127.6987
144.52
127.6987

Squared errors Conditional means


15974.1035
6.675
13376.9349
6.675
15604.6816
157.5033
2287.5845
157.5033
65.1459
157.5033
12.311
157.5033
32902.8037
157.5033
27430.415
157.5033
13883.6025
41.55
1897.3603
41.55
9420.3912
41.55
14030.7105
208.115
198.8467
208.115
29684.2921
208.115
282.9561
208.115

Squared errors
28.7832
28.7832
23939.3
6026.929
472.3363
1109.776
22978.53
18446.18
1003.622
1813.908
119.0281
1446.661
4397.679
8441.016
4044.324

TOTAL SSEPO = 177052.1395 WITHIN SSEPO | REGION = 94296.85

If we use the region, our improvement as compared to the general mean is


2
The
R
ranges from 0 to 1. It
WITHIN
SS
EPO|
REG
2
R EPO|REG 1
0.467 measures the ability of the
TOTAL SSEPO|REG
categorical var as a predictor
10 of
% of variance of EPO accounted for by Region the numerical one.

Back to basics. Considering one variable


If we consider Internet_Access, our prediction on EPO can be better?
country
Internet_Acc
Romania
6.00
Czech Republic
19.00
Lithuania
12.00
Ireland
40.00
Norway
60.00
UK
56.00
Finland
51.00
Sweden
73.00
Greece
17.00
Italy
34.00
Spain
34.00
Netherlands
67.00
Belgium
50.00
Germany
60.00
France
34.00

EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52

When considering numerical variables, we are


interested in evaluating the existence of a linear
association between them.

To evaluate if a linear relationship exists and to determine its direction we refer


to the sample covariance (absolute measure of linear association)

1 15
Cov (EPO, Int_Acc) (epoi epo)(int_acci int_acc11)
14 i 1

Back to basics. Considering one variable


If we consider Internet_Access, our prediction on EPO can be better?
The covariance between the two variables is:
Cov(EPO, Int_Acc) = 1868.5152
This measure only indicates that a linear relationship exists and that it is
direct (an inspection of the scatter plot confirms this). Nevertheless, the
value of the covariance depends upon the unit of measurement of the
considered variables.
A relative measure of linear association is the correlation coefficient.

Cov (EPO, Int_Acc)


Corr (EPO, Int_Acc)
Var (EPO) Var (Int_Acc)
The correlation coefficient ranges from 1 to +1. Values close to 1
indicate strong direct linear association, values close to 1 denote
strong inverse association. Values close to zero indicate no relationship.
Here we have Corr(EPO, Int_Acc) = 0.8527 (strong association) 12

Back to basics. Considering one variable


If we consider Internet_Access, our prediction on EPO can be better?

EPO = 60.018 + 4.5934*Int_Acc

The high value of the correlation tells us that observations tend to


cluster around a line having a positive slope. This line, evidenced in the
scatterplot is called regression line.
Its analytical expression can be easily determined
13

Back to basics. Considering one variable

Consider Internet_Access to improve prediction on EPO


Use the regression line

EPO = 60.018 + 4.5934*Int_Acc

Spain

For each observation we can calculate the difference between the observed
EPO value and the value predicted using the regression line.
In the plot the error is evidenced for the Spain.
The MODEL SUM OF SQUARES of EPO given Int_Acc is the sum of the
14
squared errors incurred when using the line to predict EPO.

Back to basics. Considering one variable

Compare general mean / regression line as predictors of EPO


country Int_Acc

EPO

Gen mean

Squared
errors

Prediction using the line


=4.5934*Int_Acc-60.018

Squared
errors

Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain

6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00

1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64

127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987

15974.1035
13376.9349
15604.6816
2287.5845
65.1459
12.311
32902.8037
27430.415
13883.6025
1897.3603
9420.3912

-32.4576
27.2566
-4.8972
123.718
215.586
197.2124
174.2454
275.3002
18.0698
96.1576
96.1576

1140.251
231.5449
58.9394
1922.647
6370.594
5332.271
18183.07
324.7132
67.2367
144.4227
4292.556

Netherlands
Belgium
Germany
France

67.00
50.00
60.00
34.00

246.15
141.8
299.99
144.52

127.6987 14030.7105
127.6987
198.8467
127.6987 29684.2921
127.6987
282.9561

247.7398
169.652
215.586
96.1576

2.5275
775.7339
7124.035
2338.922

TOTAL SSEPO = 177052.1395 MODEL SSEPO | Int_Acc = 48309.46

Notice that we have a considerable decrease of the prediction errors.

15

Back to basics. Considering one variable


If we consider Internet_Access, our prediction on EPO can be better?
If we use the line (function of Int_Acc), our improvement as compared to the
general mean is

2
EPO|Int_Acc

MODEL SSEPO|Int_Acc
TOTAL SSEPO|Int_Acc

0.7271

% of variance of EPO accounted for by Int_Acc

The R2 index ranges from 0 to 1 and it measures the ability of the numerical
var to predict the other one.
It can be shown that the index coincides with the squared correlation
coefficient.
Hence the correlation measures the extent of linear association, whereas
its square measures the percentage of the variance of one variable
which can be explained by the other variable (numerical).
16

Data Matrices
(Numerical variables only)

17

Data matrices
Example1 (continued). Innovation and Research in Europe. For the sake
of simplicity, we limit attention to few variables and to few observations
country

region

Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France

Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western

GERD
0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20

GERD_ GERD_ Internet ST_


industry govern _Acc
grad
47.60
52.50
37.10
66.70
51.60
45.60
70.80
71.50
33.00
47.20
47.20
51.90
63.40
65.70
54.20

43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90

6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00

5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50

EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52

E_gov
_avail

The country variable


is useful to identify
the statistical units but
it is not object of
analysis.

25.00
30.00
40.00
50.00
56.00
At the moment we
59.00
67.00 consider only
74.00 numerical variables
32.00
The data matrix
53.00
55.00 contains information
32.00 available for the n
35.00 cases (rows) on the p
47.00 variables (columns)
50.00

For each observation we have information collected on p variables


For each variable we have information collected on n observations
Here we have 15 rows (cases, n) and 7 columns (vars, p)

18

Data matrices
Example1 (continued). Innovation and Research in Europe. (subset)
GERD

Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France

0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20

GERD_ GERD_ Interne


industry govern t_Acc
47.60
52.50
37.10
66.70
51.60
45.60
70.80
71.50
33.00
47.20
47.20
51.90
63.40
65.70
54.20

43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90

6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00

ST_
grad

EPO

E_go
v_ava
il

5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50

1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52

25.00
30.00
40.00
50.00
56.00
59.00
67.00
74.00
32.00
53.00
55.00
32.00
35.00
47.00
50.00

To each observation a
collection of p values is
associated. These values
are the realizations
observed for each
variables corresponding to
the considered obs.
Similarly, to each variable,
a collection of n values can
be associated (values
observed for all the cases)

A collection of k values is usually called a vector. To avoid confusion, we will only


consider column vectors, with dimension (k 1) i.e., a collection of values arranged in
k rows and in 1 column .
19
A row (1 k) vector can always be seen as the transpose of a column (k 1) vector.

Data matrices
Data matrix (n individuals and p variables)

x11
x
21
X

xn1

x12
x22

x1 p
x2 p

x1T
T
x2 x

(1)


T
x n

xn 2 xnp

x (2) x ( p )

A data matrix can be seen as a collection of n row (transposed) vectors (cases)


and/or as a collection of p column vectors (variables)

xi = vector (p 1) containing measurements on the p vars for the i-th case.

xi1
x
i2
xi


xip

x iT xi1

xi2 xip

Transposition operation

x(j) = vector (n 1) containing the n measurements on the j-th variable

20

Data matrices
Example1 (continued). Innovation and Research in Europe. (subset)
GERD

Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France

0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20

GERD_ GERD_ Interne


industry govern t_Acc
47.60
52.50
37.10
66.70
51.60
45.60
70.80
71.50
33.00
47.20
47.20
51.90
63.40
65.70
54.20

43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90

6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00

ST_
grad

EPO

E_go
v_ava
il

5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50

1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52

25.00
30.00
40.00
50.00
56.00
59.00
67.00
74.00
32.00
53.00
55.00
32.00
35.00
47.00
50.00

Column vector associated to


EPO (measurements on 15 obs)

x (6)

The element in the i-th row


and in the j-th column, xij is
the value observed for the ith case corresponding to the
j-th variable.
In this simple example, x13 6
is the value of EPO (6
variable) for Belgium (13
observation).

T
13

Row vector associated to


Belgium (measurements
on 7 vars)
21

Data matrices Vectors


A (K 1) vector is as an oriented line in a K-dimensional space
A one-dimensional vector (scalar)
v [v1 ]
v1

v1
v
v2

v2

A two-dimensional vector
v1
v
v2

v1

A three-dimensional vector
v1
v v2
v3

v3

v2

v1
v v2
v3
v1
22

Vectors of higher dimension cannot be represented in this way

Data matrices Vectors (length)


For a given vector in the k-dimensional space, we define its length as:

|| v || v12 v22 ... vK2


It is the length of the line connecting v to the origin, 0:
v [v1 ]

v1

v2

v1
v2

v1

v2

v1
v v2
v3

v3

v1

23

Data matrices Vectors (Distance)


Given two vectors, v and u in the k-dimensional space, we define the
Euclidean Distance between v and u as the length of the line
connecting v to u:

DE ( v, u) v u (v1 u1 ) 2 (v2 u2 ) 2 ... (vk uk ) 2

!!! the length of a vector v coincides with its distance from the origin, 0.
|| v || v12 v22 ... vK2 DE ( v,0)
v

v2

Example in the twodimensional space

|v2 u2|
u

u2

|v1 u1|
0

v1

u1

24

Analyze multivariate samples in a


geometrical perspective
Describe distances in the Euclidean
Space

25

Data matrices
A data matrix can be see as a collection of two kind of vectors:
Row vectors:

xi

lie in the p-dimensional space

Column vectors:

x(j)

lie in the n-dimensional space

Hence two dimensional spaces can be considered to


analyze/describe a data matrix.
Of course, these spaces will be related one to each other.
For the sake of simplicity, we will analyze in depth only the space
of the observations.

26

Syntheses of variables

X [x (1) x (2) x ( p ) ]

How to arrange syntheses of p variables, i.e.,


how to synthesize the elements of the column
vectors?

The position. The sample mean (unbiased estimator for the population
mean) for the j-th variable (column) is:

x( j )

x
i 1

ij

xj

x1

x
xp

Vector of the sample


means (centroid).

It may be seen as the vector associated to the artificial case mean


an unobserved case being in the average with respect to all the vars
Remember: the mean is not robust (sensitive to extreme values)
27

The space of the observations


Consider a graphical representation we are used to:
the 2-dimensional space
Note: axes adjusted to have the same scale.

Mean of E_gov_indiv

The centroid
(vector whose
elements are
the sample
means) is the
centre of
gravity of the
cloud.
It is the point
which is
globally less
distant from
all the points.

Mean of Internet_Acc
28

Synthesis of variables
The dispersion around the mean.
The sample variance (unbiased estimator for the population variance)
for the j-th variable (column) is:
Average of the squared errors we incur
1 n
2
s jj
( xij x j ) when substituting the observed values
n 1 i 1
with the sample mean.

Notice that it is the average of the squared distances between the


observed values and the sample mean
The sample standard deviation for the j-th variable (column) is

s jj

The Std. Dev has the same unit of measurement as the variable taken
into account. It measures of the expected error (below or above the
mean) we incur when substituting the mean to a generic case.
Moreover it can be considered as the average distance between a
generic value and the mean. It is the expected distance from mean.
Being based upon averages, both the variance and the standard
29
deviation are not robust (sensitive to extreme values)

The space of the observations


Consider again the 2-dimensional space
Let us consider the distance from Iceland (IS) to the centroid

DE ( IS , x) ( xIS ,egov _ ind xegov _ ind ) 2 ( xIS ,int_acc xint_acc ) 2

Absolute Difference
between the Iceland
E_gov_Indiv value and
the mean of E_gov_Indiv

Absolute Difference
between the Iceland
Internet_Acc value and
the mean of Internet_Acc
Note: axes adjusted to
have the same scale.

30

The space of the observations


Consider, in the 2-dimensional space,
ALL THE DISTANCES FROM POINTS TO THE CENTROID.

1 n
2
s jj
(
x

x
)
ij j
n 1 i 1

Note: axes adjusted to


have the same scale.

Var(E_gov_indiv) + Var(Internet_cc)
= SUM of the variances of THE TWO VARIABLES
is proportional to the sum of the squared
distances from the obs to the centroid

31

Synthesis of association between vars


The linear association.
The sample covariance for the j-th and the h-th variables (columns) is

1 n
s jh
( xij x j )( xih xh )

n 1 i 1

(absolute measure of linear


association)

The sample correlation coefficient for the j-th and the h-th variables is

r jh

s jh
s jj shh

(relative measure of linear association; it


ranges from 1 to +1).

Remember: being based upon averages, the correlation coefficient


is not robust (sensitive to extreme values)
32

The space of the observations


Consider again the 2-dimensional space
Since the covariance and the correlations are actually
measuring the concentration of points around a line,
both the indices give us information about the
ORIENTATION of the scatter.

Note: axes adjusted to


have the same scale.

33

Variance and Covariance Matrix


Variances and covariances are arranged in the so called variance and
covariance matrix

s11
s
21

s12
s22

s p1

s p2

s11
s
21

s12
s22

s p1

s p2

s1 p
s2 p

s pp
s1 p
s2 p

s pp

S is a square matrix (number of rows


equals the number of columns)
The diagonal elements of S, sjj, are
the variances (notice that the
variance can be regarded as the
covariance between one variable and
itself)
The extra-diagonal elements of S,
sjh, are the covariances
Since sjh = shj, S is a symmetric
matrix.
34

Correlation Matrix
Correlations are arranged in the correlation matrix

1
r
21

r12
1

rp1

rp 2

r1 p R is also a square matrix, and its


diagonal elements are 1s (the

r2 p correlation between one variable and


itself is 1)

Its extra-diagonal elements, rjh, are the correlations, and of course, R


is a symmetric matrix.
Due to the relationship between covariances and correlations:

r jh

s jh
s jj shh

R can be simply obtained from the variance and covariance matrix 35

The space of the observations


The centroid (vector
whose elements are
the sample means)
is the centre of
gravity of the pdimensional cloud
The elements of the
variance and
covariance matrix
give us information
about the dispersion
around the centroid
(remember the 2dimension example)
and on the
orientation of the
cloud
36

Measuring dispersion

How to synthesize the dispersion of the n cases in the p-dimensional


space? Two proposals.
TOTAL VARIANCE
As we saw before, the sum of all the variances is proportional to the
sum of the squared distances from the points to the centroid. Thus, a
first method to evaluate the dispersion of the points in the p-dimensional
space is the so called Total Variance.
p

Total Variance s jj
j 1

Notice that we are not taking into account the interrelationships between
vars, i.e. the orientation of the cloud.
The Total Variance is the sum of the diagonal elements of the
var/cov matrix, S. The sum of the diagonal elements of a square matrix
is defined to be its trace. Hence, we have:
p

Total Variance s jj Trace (S) tr (S)


j 1

37

The space of the observations

To motivate the second measure of multivariate


dispersion, consider the portion of the space which is
occupied by data (area of the ellipse). We will come back
to this concept later, but can intuitively understand that
the area of the ellipse (in higher-dimensional space, the
volume of an ellipsoid) is somehow related to the
variances and to the covariances, i.e., to all the
entries of the var/cov matrix, S

38

Measuring dispersion
THE GENERALIZED VARIANCE
The volume of the ellipsoid containing points in the p-dimensional
space can be shown to be related to a particular synthesis of the
elements of S, the so called determinant of S, |S|.
The determinant is a number which can be calculated for a square
matrix. It equals zero if two column of the matrix are proportional, i.e., if
they do share information.
This measure is called Generalized Variance

Generalized Variance = det(S)=|S|


Hence, to synthesize the dispersion of points in a pdimensional space, two measure can be used, both related to the
elements of the variance and covariance matrix, S.
The Total Variance takes into account only the diagonal elements of S,
whilst the Generalized variance is calculated by referring to all the39
elements of S.

The space of the observations


The variances and covariance matrix contains relevant information to describe
the points in a p-dimensional space, and, also information about their
distances. We now consider different measures of distances between cases
in the p-dimensional space, related to particular transformations of the original
vars.
Notice first that if the variables are centred on their mean nothing changes as
concerns the dispersion of the points.
This operation only consists in a
change of the origin

40

Multivariate Samples - Transformations


TRASFORMATION: VARS CENTRED ON THEIR MEANS
Original Data Matrix

x11
x
21

xn1

x12

x1 p

x22 x2 p

xn 2 xnp

Centroid = x
Var/Cov Matrix: S
Corr Matrix: R

Centred Data Matrix

( x11 x1 ) ( x1 p x p )
~


( xn1 x1 ) ( xnp x p )

Centroid = Origin = 0
Var/Cov Matrix: S
Corr Matrix: R

The centred matrix is obtained by subtracting to each observation on a given


variable the mean of the variable itself. This means that to all the observations on
41
a given column, say the j-th, the mean of the j-th variable is subtracted.

A closer look at the distance


The Euclidean distance is the length of the line connecting a point to the origin.
Consider, in the plot of the centred variables, Cyprus and Italy: their
distance from the origin, 0, is (almost) the same.
Notice that the
distance of Slovakia
from the origin is
higher. We will
consider this later

This similar
distance is due
to different
combinations
of x- and ydeviations
from 0. Should
the x- and ydeviations be
evaluated in
the same
manner?

42

A closer look at the distance

Remember: the standard deviation of a variable is the typical deviation


from the mean. Here Std.Dev.(E_gov_Avail)=15, Std.Dev.(Int_Acc) = 21.31.
To compare adequately the deviations from the origin (data are centred),
we should take into account the Std.Dev (of course, squared deviations
should be compared with variances).
Internet_Acc
has an higher
std.dev. Hence,
a deviation D
from the origin
along the
horizontal axis
should count
less than a
deviation D
from the origin
along the
vertical axis.

43

A closer look at the distance

In the Euclidean distance, the deviations are considered in absolute terms.


When we are considering variables having different Std.Dev, we should
consider relative deviations. To remove the effect of Std. Dev, thus obtaining
comparable deviations, we have to standardize the variables.

Standardization of the j-th variable:

zij

xij x j
s jj

The Euclidean Distance between two standardized observations is:

DE (z i , z h ) ( zi1 z h1 ) 2 ( zi 2 z h 2 ) 2 ... ( zip z hp ) 2


2

xip xhp
xi1 xh1
xi 2 xh 2


...

s
s
s
11
22
pp

DS (x i , x h )

Statistical Distance: A different weight is assigned to the squared


deviation of each variable in the calculation of the distance (1/sjj). The
statistical distance is proportional to the Euclidean one only if the 44
variances are all equal.

A closer look at the distance


The statistical distance (visualization in the original/centred space).
x-deviations are penalized less than y-deviations, since the x-axis is
characterized by an higher dispersion.
Hence Cyprus, which is showing an higher y-deviation from the origin as
compared to Italy is characterized by a statistical distance from the origin which
is higher than that characterizing Italy.
Notice that Slovakia
has a stat. distance
from 0 which is
now similar to that
of Cyprus.

Points having the


same statistical
distance from the
origin
45

Multivariate Samples - Transformations


TRASFORMATION: STANDARDIZED VARS
Original Data Matrix

x11
x
21

xn1

x12
x22

x1 p
x2 p

xn 2 xnp

Centroid = x
Var/Cov Matrix: S
Corr Matrix: R

Standardized Data Matrix

( x11 x1 )

s11

( xn1 x1 )
s11

( x1 p x p )

s pp

( xnp x p )
s pp

Centroid = Origin = 0
Var/Cov Matrix: R
Corr Matrix: R

The standardized matrix is obtained by subtracting to each observation


on a given variable the mean of the variable itself and by dividing this
difference by the Std.Dev. The centred vars have null mean, the
standardized vars have variances all equal to 1 (the unit of measurement
is removed). Since Variance=Std.Dev= 1 for each variable, the
46
covariances coincide with correlations (Corr=Cov/Product of Std.Devs).

A closer look at the distance


In statistical distance deviations are adjusted by taking into account dispersions
of the variables. But no attention is posed on the coherence between each
point and the cloud of points (standardization does not involve correlations)
Slovakia and Cyprus are equally statistically distant from the origin.
Consider the
Notice that
orientation of
Lithuania is more
the cloud: the
statistically distant
line connecting
from the origin.
Lithuania to 0
has the same
direction of the
cloud. This is
less true for
Slovakia. The
line connecting
Cyprus to the
origin is in
countertendency
48

A closer look at the distance

In Statistical distance, the coherence with the orientation of the cloud is not
considered. A transformation of data which removes the effect of Std. Dev, and
also penalizes deviations by considering the orientation of the cloud of points id
the so called Mahalanobis transformation. We do not enter into details here.

Mahalanobis transf. of the j-th variable:

zijM Mahal ( xij ) M ( xij )

The Mahalanobis transformation is a particular linear combination of the


considered variables.

The so called Mahalanobis distance is defined as the Euclidean


distance calculated on Mahalanobis transformed observations:
M 2
DE (z iM , z hM ) ( ziM1 z hM1 ) 2 ( ziM2 z hM2 ) 2 ... ( zipM z hp
)

M ( xi1 ) M ( xh1 ) 2 ... M ( xip ) M ( xhp ) 2

DM (x i , x h )
49

Multivariate Samples - Transformations


TRASFORMATION: MAHALANOBIS
The Mahalanobis distance is the Euclidean distance evaluated by
previously transforming data according to the Mahalanobis
transformation.
Original Data Matrix

x11
x
21

xn1

x12
x22

x1 p
x2 p

xn 2 xnp

Centroid = x
Var/Cov Matrix: S
Corr Matrix: R

Mahalanobis Data Matrix

M ( x11 ) M ( x1 p )
Z


M ( xn1 ) M ( xnp )

Centroid = Origin = 0
Var/Cov Matrix: I
Corr Matrix: I

The variables transformed according to the Mahalanobis transformation have


null means, variances all equal to 1 (unit of measurement is removed), and null
correlations (orientation of the cloud is removed).
50

A closer look at the distance

Mahalanobis Distance: deviations from the origin are adjusted by taking into
account both the dispersions of variables and their correlations (orientation).
Now Cyprus, being in countertendency with respect to the orientation of the
cloud is characterized by a Mahalanobis distance from 0 which is higher than
that characterizing Slovakia.
Notice that
Lithuania has a
Mahalan. distance
from 0 similar to
that of Slovakia.

Points having the


same Mahalanobis
distance from the
origin
51

Multivariate samples Transformations


ORIGINAL

CENTRED ON MEAN STANDARDIZATION

~
X

MAHALANOBIS

xj
sjj

0
sjj

Z
0
1

Covariances

sjk

sjk

rjk

Correlations

rjk

rjk

rjk

Euclidean

Euclidean

Statistical

Mahalanobis

Means
Variances

Euclidean
distance

ZM
0
1

Conclusion: By transforming data via standardization or Mahalanobis transformation we


are simply defining a new space such that the Euclidean Distance calculated on the
transformed points coincides respectively with:
Statistical distance - standardization, deviations are differently evaluated depending on
their Std.Dev
Mahalanobis distance - Mahalanobis transformation, deviations are differently
evaluated depending on the Std.Dev.s and to the orientation of the cloud correlations/covariances).
As for now the latter transformation was not explicitly defined due to its analytical 53
complexity, but we will see later how to obtain Mahalanobis-transformed data.

You might also like