Professional Documents
Culture Documents
Country code
Country name
european region
Spending on Human Resources (total public expen. on education) - % of GDP
Gross domestic expenditure on R&D (GERD) - As a % of GDP
GERD - industry - % of GERD financed by industry
GERD - government - % of GERD financed by government
GERD - abroad - % of GERD financed by abroad
Level of Internet access - % of households who have Internet access at home
Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29
Female tertiary graduates in S&T per 1000 of females aged 20-29
Male tertiary graduates in S&T per 1000 of males aged 20-29
No patent applications to the European Patent Office per million inhabitants
No patents granted by the US Patent and Trademark Office per million inhabitants
Expenditure on Information Technology as a % of GDP
Expenditure on Telecommunications as a % of GDP
Youth education attainment level - total - % of the population 20-24 who completed at
least upper secondary education
% of fem. 20-24 having completed at least upper 2 educ.
% of males 20-24 having completed at least upper 2 educ.
E-government on-line availability - Online availability of 20 basic public services
2
Exports of high technology products as a share of total exports
region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western
Internet_Acc
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00
EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
Netherlands
Mean = 127.6987
Spain
4
Netherlands
Mean = 127.6987
Spain
5
region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western
EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
Netherlands
Spain
8
Netherlands
Spain
9
region
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western
Squared errors
28.7832
28.7832
23939.3
6026.929
472.3363
1109.776
22978.53
18446.18
1003.622
1813.908
119.0281
1446.661
4397.679
8441.016
4044.324
EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
1 15
Cov (EPO, Int_Acc) (epoi epo)(int_acci int_acc11)
14 i 1
Spain
For each observation we can calculate the difference between the observed
EPO value and the value predicted using the regression line.
In the plot the error is evidenced for the Spain.
The MODEL SUM OF SQUARES of EPO given Int_Acc is the sum of the
14
squared errors incurred when using the line to predict EPO.
EPO
Gen mean
Squared
errors
Squared
errors
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
127.6987
15974.1035
13376.9349
15604.6816
2287.5845
65.1459
12.311
32902.8037
27430.415
13883.6025
1897.3603
9420.3912
-32.4576
27.2566
-4.8972
123.718
215.586
197.2124
174.2454
275.3002
18.0698
96.1576
96.1576
1140.251
231.5449
58.9394
1922.647
6370.594
5332.271
18183.07
324.7132
67.2367
144.4227
4292.556
Netherlands
Belgium
Germany
France
67.00
50.00
60.00
34.00
246.15
141.8
299.99
144.52
127.6987 14030.7105
127.6987
198.8467
127.6987 29684.2921
127.6987
282.9561
247.7398
169.652
215.586
96.1576
2.5275
775.7339
7124.035
2338.922
15
2
EPO|Int_Acc
MODEL SSEPO|Int_Acc
TOTAL SSEPO|Int_Acc
0.7271
The R2 index ranges from 0 to 1 and it measures the ability of the numerical
var to predict the other one.
It can be shown that the index coincides with the squared correlation
coefficient.
Hence the correlation measures the extent of linear association, whereas
its square measures the percentage of the variance of one variable
which can be explained by the other variable (numerical).
16
Data Matrices
(Numerical variables only)
17
Data matrices
Example1 (continued). Innovation and Research in Europe. For the sake
of simplicity, we limit attention to few variables and to few observations
country
region
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
Eastern
Eastern
Northern
Northern
Northern
Northern
Northern
Northern
Southern
Southern
Southern
Western
Western
Western
Western
GERD
0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20
43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00
5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50
EPO
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
E_gov
_avail
25.00
30.00
40.00
50.00
56.00
At the moment we
59.00
67.00 consider only
74.00 numerical variables
32.00
The data matrix
53.00
55.00 contains information
32.00 available for the n
35.00 cases (rows) on the p
47.00 variables (columns)
50.00
18
Data matrices
Example1 (continued). Innovation and Research in Europe. (subset)
GERD
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20
43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00
ST_
grad
EPO
E_go
v_ava
il
5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
25.00
30.00
40.00
50.00
56.00
59.00
67.00
74.00
32.00
53.00
55.00
32.00
35.00
47.00
50.00
To each observation a
collection of p values is
associated. These values
are the realizations
observed for each
variables corresponding to
the considered obs.
Similarly, to each variable,
a collection of n values can
be associated (values
observed for all the cases)
Data matrices
Data matrix (n individuals and p variables)
x11
x
21
X
xn1
x12
x22
x1 p
x2 p
x1T
T
x2 x
(1)
T
x n
xn 2 xnp
x (2) x ( p )
xi1
x
i2
xi
xip
x iT xi1
xi2 xip
Transposition operation
20
Data matrices
Example1 (continued). Innovation and Research in Europe. (subset)
GERD
Romania
Czech Republic
Lithuania
Ireland
Norway
UK
Finland
Sweden
Greece
Italy
Spain
Netherlands
Belgium
Germany
France
0.39
1.20
0.67
1.10
1.60
1.83
3.30
4.25
0.64
1.09
0.91
1.80
2.08
2.46
2.20
43.00
43.60
56.30
25.60
39.80
28.80
25.50
21.30
46.60
46.80
39.90
35.80
22.00
31.40
36.90
6.00
19.00
12.00
40.00
60.00
56.00
51.00
73.00
17.00
34.00
34.00
67.00
50.00
60.00
34.00
ST_
grad
EPO
E_go
v_ava
il
5.80
6.00
14.60
20.50
7.70
20.30
17.40
13.30
8.00
7.40
11.90
6.60
10.50
8.10
19.50
1.31
12.04
2.78
79.87
135.77
124.19
309.09
293.32
9.87
84.14
30.64
246.15
141.80
299.99
144.52
25.00
30.00
40.00
50.00
56.00
59.00
67.00
74.00
32.00
53.00
55.00
32.00
35.00
47.00
50.00
x (6)
T
13
v1
v
v2
v2
A two-dimensional vector
v1
v
v2
v1
A three-dimensional vector
v1
v v2
v3
v3
v2
v1
v v2
v3
v1
22
v1
v2
v1
v2
v1
v2
v1
v v2
v3
v3
v1
23
!!! the length of a vector v coincides with its distance from the origin, 0.
|| v || v12 v22 ... vK2 DE ( v,0)
v
v2
|v2 u2|
u
u2
|v1 u1|
0
v1
u1
24
25
Data matrices
A data matrix can be see as a collection of two kind of vectors:
Row vectors:
xi
Column vectors:
x(j)
26
Syntheses of variables
X [x (1) x (2) x ( p ) ]
The position. The sample mean (unbiased estimator for the population
mean) for the j-th variable (column) is:
x( j )
x
i 1
ij
xj
x1
x
xp
Mean of E_gov_indiv
The centroid
(vector whose
elements are
the sample
means) is the
centre of
gravity of the
cloud.
It is the point
which is
globally less
distant from
all the points.
Mean of Internet_Acc
28
Synthesis of variables
The dispersion around the mean.
The sample variance (unbiased estimator for the population variance)
for the j-th variable (column) is:
Average of the squared errors we incur
1 n
2
s jj
( xij x j ) when substituting the observed values
n 1 i 1
with the sample mean.
s jj
The Std. Dev has the same unit of measurement as the variable taken
into account. It measures of the expected error (below or above the
mean) we incur when substituting the mean to a generic case.
Moreover it can be considered as the average distance between a
generic value and the mean. It is the expected distance from mean.
Being based upon averages, both the variance and the standard
29
deviation are not robust (sensitive to extreme values)
Absolute Difference
between the Iceland
E_gov_Indiv value and
the mean of E_gov_Indiv
Absolute Difference
between the Iceland
Internet_Acc value and
the mean of Internet_Acc
Note: axes adjusted to
have the same scale.
30
1 n
2
s jj
(
x
x
)
ij j
n 1 i 1
Var(E_gov_indiv) + Var(Internet_cc)
= SUM of the variances of THE TWO VARIABLES
is proportional to the sum of the squared
distances from the obs to the centroid
31
1 n
s jh
( xij x j )( xih xh )
n 1 i 1
The sample correlation coefficient for the j-th and the h-th variables is
r jh
s jh
s jj shh
33
s11
s
21
s12
s22
s p1
s p2
s11
s
21
s12
s22
s p1
s p2
s1 p
s2 p
s pp
s1 p
s2 p
s pp
Correlation Matrix
Correlations are arranged in the correlation matrix
1
r
21
r12
1
rp1
rp 2
r jh
s jh
s jj shh
Measuring dispersion
Total Variance s jj
j 1
Notice that we are not taking into account the interrelationships between
vars, i.e. the orientation of the cloud.
The Total Variance is the sum of the diagonal elements of the
var/cov matrix, S. The sum of the diagonal elements of a square matrix
is defined to be its trace. Hence, we have:
p
37
38
Measuring dispersion
THE GENERALIZED VARIANCE
The volume of the ellipsoid containing points in the p-dimensional
space can be shown to be related to a particular synthesis of the
elements of S, the so called determinant of S, |S|.
The determinant is a number which can be calculated for a square
matrix. It equals zero if two column of the matrix are proportional, i.e., if
they do share information.
This measure is called Generalized Variance
40
x11
x
21
xn1
x12
x1 p
x22 x2 p
xn 2 xnp
Centroid = x
Var/Cov Matrix: S
Corr Matrix: R
( x11 x1 ) ( x1 p x p )
~
( xn1 x1 ) ( xnp x p )
Centroid = Origin = 0
Var/Cov Matrix: S
Corr Matrix: R
This similar
distance is due
to different
combinations
of x- and ydeviations
from 0. Should
the x- and ydeviations be
evaluated in
the same
manner?
42
43
zij
xij x j
s jj
xip xhp
xi1 xh1
xi 2 xh 2
...
s
s
s
11
22
pp
DS (x i , x h )
x11
x
21
xn1
x12
x22
x1 p
x2 p
xn 2 xnp
Centroid = x
Var/Cov Matrix: S
Corr Matrix: R
( x11 x1 )
s11
( xn1 x1 )
s11
( x1 p x p )
s pp
( xnp x p )
s pp
Centroid = Origin = 0
Var/Cov Matrix: R
Corr Matrix: R
In Statistical distance, the coherence with the orientation of the cloud is not
considered. A transformation of data which removes the effect of Std. Dev, and
also penalizes deviations by considering the orientation of the cloud of points id
the so called Mahalanobis transformation. We do not enter into details here.
DM (x i , x h )
49
x11
x
21
xn1
x12
x22
x1 p
x2 p
xn 2 xnp
Centroid = x
Var/Cov Matrix: S
Corr Matrix: R
M ( x11 ) M ( x1 p )
Z
M ( xn1 ) M ( xnp )
Centroid = Origin = 0
Var/Cov Matrix: I
Corr Matrix: I
Mahalanobis Distance: deviations from the origin are adjusted by taking into
account both the dispersions of variables and their correlations (orientation).
Now Cyprus, being in countertendency with respect to the orientation of the
cloud is characterized by a Mahalanobis distance from 0 which is higher than
that characterizing Slovakia.
Notice that
Lithuania has a
Mahalan. distance
from 0 similar to
that of Slovakia.
~
X
MAHALANOBIS
xj
sjj
0
sjj
Z
0
1
Covariances
sjk
sjk
rjk
Correlations
rjk
rjk
rjk
Euclidean
Euclidean
Statistical
Mahalanobis
Means
Variances
Euclidean
distance
ZM
0
1