You are on page 1of 10

Lesson 4.

Analysis of the relationship between variables Statistics I ~1~


______________________________________________________________________

LESSON 4. ANALYSIS OF THE RELATIONSHIP BETWEEN VARIABLES

4.1. Single regression

An objective of any researcher is to find relationships between phenomena taking


place in his/her field of knowledge. When dealing with two variables, X and Y, there
may exist some kind of relationship or association between them.

The study of the statistical relationship between two variables X and Y is focused
on two basic issues:

To determine the function that better explains the behaviour of the variable Y
(dependent or endogenous) by means of a whole set of variables X1, X2, , Xp
(independent or exogenous); it is supposed to be a relationship among all these
variables (in the two-dimensional case there is only one explanatory variable, X).
This is the so-called regression function or curve.

To provide a measure of the intensity of the relationship that there may


exist between the variables, that is, a measure of the explanatory power of the
regression function; it is called correlation.


Given two variables X and Y with joint frequency distribution ( xi , y j ); nij , the

regression of Y on X (Y/X) means the function that explains the variable Y for each
value of the variable X. In the same way, the regression of X on Y (X/Y) determines
the behaviour of X as a function of Y.

In order to determine these functions, we may use two criteria: type I regression
and type II regression.

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~2~
______________________________________________________________________

4.1.1. Type I regression

Type I regression of Y on X

Lets consider the following two-dimensional frequency distribution:

X Y y1 y2 yk

x1 n11 n12 n1k

x2 n21 n22 n2 k


xh n h1 nh 2 nhk

If we think about the value we should assign to Y for X=x1, we would say the mean
of Y when X is x1, that is, the conditional mean of Y given X equals x1. Taking this
criterion into account for x2, we would take the conditional mean of Y given x2, and so
on and so forth. Therefore, type I regression of Y on X would consist of the pairs:
x1 ; y | x1
x 2 ; y | x 2
x3 ; y | x3
...
x h ; y | x h

EXAMPLE 2

We have the following two-dimensional frequency distribution

X Y 1 2 4 5
2 9 0 4 0
3 0 1 0 5
5 0 1 0 0 n=20

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~3~
______________________________________________________________________

Type I regression of Y on X

Observed values

yj nj 1 * 9 ... 5 * 5
y 2.7
1 9 20
(1 2'7) 2 * 9 ... (5 2'7) 2 * 5
2 2 S y2 3.01
20
4 4
5 5
20

Type I regression - line connecting the conditional means: y | x 2 ; y | x 3 ; y | x 5

yj nj|i
1 9
2 0 1* 9 4 * 4
y| x 2 1.923
13
4 4
5 0
13

yj nj|i
1 0 2 *1 5 * 5
y| x 3 4.5
2 1 6

4 0
5 5
6

yj nj/i
1 0 2 *1
y| x 5 2
2 1 1

4 0
5 0
1

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~4~
______________________________________________________________________

Theoretical values

Marginal distrib. of X Type I regression


xi ni xi y (xi)
2 13 2 1.923
3 6 3 4.5
5 1 5 2
20

Distrib. of y y| x xi
y ( xi ) ni

1.923 13
4.5 6
2 1
20

1.923*13 4.5 * 6 2 * 1
y 2.7 y
20
(1.923 2.7) 2 *13 (4.5 2.7) 2 * 6 (2 2.7) 2 *1
S 2
y 1.389
20

Key idea

Type I regression offers the best possible explanation of Y as a function of X;


it provides the true regression (intrinsic regression) between the variables

Advantage: it is the best possible fit Disadvantage: the function can adopt any form
(the result of connecting dots)

Solution: to force the function to have a


certain form type II regression objective

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~5~
______________________________________________________________________

4.1.2. Type II regression

Type II regression of Y on X

In order to obtain the regression curve of Y on X, we first locate one point for each
pair of two variables that represent an observation in the data set in a coordinate axis
(cluster of dots or scatter plot) and we select the type of function that better fits those
pairs. Second, we determine this function by minimizing the sum of the squared errors
or residuals, eij; these are the difference between the observed dependent variable, yj,
and the theoretical value, y j , which is obtained by replacing X with xi in the selected

function. Thus, eij y j y j .

Min ( y j y j ) nij
2
(least squares method)
i j

Comparisons between type I and type II regressions

- Type I regression is the best possible explanation of a variable as a function of


the other one (intrinsic regression); type II regression criterion is an approach
procedure to type I regression.

- Type I regression is always the result of connecting dots, not a continuous curve,
which makes it less easy to use for our purposes. For this reason, type II
regressions become widespread.

- At a practical level, the sort of function is not set a priori in type I regression,
whilst this decision is the first step in the type II regression procedure.

A particular case: linear type II regression

When the function that better fits the cluster of dots is a straight line, we have a
linear regression. Its form is

y a bx for Y/X

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~6~
______________________________________________________________________

Coefficients b and b are called regression coefficients; a and a are the points of
intersection with the respective axis.

Lets see an example where a straight line is a good fit for the cluster of dots
representing the pairs of values (X,Y):

(Source: http://www.idlcoyote.com/documents/cg_programs.php)

Linear type II regression of Y on X

Following the least squares technique so as to fitting a straight line, parameters a


and b are calculated in the following manner:

Linear regression coefficient b is the slope of the regression line. Its


interpretation is very important: b measures the change in variable Y for every unit
change in variable X.

The sign of b equals the sign of the covariance.

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~7~
______________________________________________________________________

EXAMPLE 2 - CONTINUATION

Obtain the linear type II regression of Y on X

S xy
b 2
Y a bX Sx
a y bx

xi ni 2 *13 3 * 6 5 *1
x i 1
r
2.45 ; y 2.7
n 20
xi2 ni 2 2 * 13 3 2 * 6 5 2 * 1
S x2 i 1
r
x2 2,452 0.5475
n 20
xi y j nij 2 * 1 * 9 ... 5 * 5 * 0
i 1 j 1
r c
S xy (x y) (2.7 * 2.45) 0.435
n 20

0.435
b 0.795 ; a 2.7 (0.795* 2.45) 0.752
0.5475

Y 0.752 0.795X

4.1.3. Goodness of fit

The attempt to explain a variable as a function of another one is caused by the a


priori assumption that the information provided by the latter (variable) improves the
knowledge of the first one. In other words, Y is supposed to be better explained
through X rather than using the marginal distribution of Y, in the regression of Y on X.
In order to determine to what extent the description of a variable improves when
we take into account the other one, we define the following concept: variance due to
regression. For doing that, we consider three variables derived from the regression:
- y j , observed values of Y

- y j , theoretical values assigned to each xi in the regression of Y on X

- e j , errors or residuals from the regression, e j y j y j

Their mean values are:

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~8~
______________________________________________________________________

nij
- the mean of the observed series, y y
i j
j
n
- the mean of the residuals from the regression Y/X,
nij nij
e ej ( y j y j ) 0
i j n i j n
nij nij
- the mean of the theoretical values, y y
i j
j
n
( y j e j )
i j n
ye y

Their variances are:

- total variance of the observed values: it measures the variation in Y in the


observed marginal distribution,
n j SST
S y2 ( y j y ) 2
j n n

- variance of error: it measures the deviations between the theoretical value and
the observed value dispersion left out of the regression line,
nij SSE
S e2 ( y j y j ) 2
i j n n

- variance due to regression or variance of the theoretical values: it measures the


dispersion of the values from the regression,
n j SSR
S y2 S R2 ( y j y ) 2
j n n

There is a relationship between these three variances, both in the type I regression
and the linear fit. It is as follows:

S y2 S e2 S y2 SST SSE SSR


The total variability in a regression analysis (SST) can be decomposed into two
components: one explained by the regression (SSR) and the other due to the
unexplained error (SSE).

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~9~
______________________________________________________________________

- Fitted function (from regression) passes through all the points:


maximum degree of dependence between variables
Situations
- The more distant the points of the function,
the higher the intensity that is lost in the association

Measure for goodness of fit: general coefficient of determination (R2), which


indicates the percentage of variability in Y that is explained by the regression equation:

S y2 SSR
R 2
2

S y SST

Accounting for the relationship between variances, it follows:

S e2 SSE
R 1 2 1
2

Sy SST

In linear type II regression, we call this coefficient of determination linear


coefficient of determination (r2). In this concrete case and taking into account that:

nij S xy2
S ( y j y j )
2
e
2
S 2
y
i j n S x2

Comments

- R2 and r2 range of values: 0 R2 1


(very bad fit) (the best fit)

- R2 r 2

- The coefficient that measures the degree of linear correlation between the
variables (r, linear coefficient of correlation) is:

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~ 10 ~
______________________________________________________________________

Why do we incorporate r ?
: to add to r2 the nature of statistical dependence (positive or negative association).

r range of values: -1 r 1
(perfect negative linear relationship) (perfect positive linear relationship)

EXAMPLE 2 - CONTINUATION

In type I regression:
2
S e2 S y 1.389
R 1 2 2
2
0.461
Sy Sy 3.01

In linear type II regression:

S xy2 (0.435) 2
r 2
0.115
S x2 S y2 0.5475* 3.01

It must be fulfilled that R 2 r 2

In linear type II regression, it is always true that:

y y e 2.7 2.7 0

S y2 S R2 r 2 S y2 0.115 * 3.01 0.346


S S S 2
2
y
2
y
2
e
S e (1 r ) S y (1 0.115) * 3.01 2.664
2 2

S y2 0.346 2.664 3.01

______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017

You might also like