Lesson 4 Theory

Lesson 4.
Analysis of the relationship between variables Statistics I ~1~

______________________________________________________________________
LESSON 4. ANALYSIS OF THE RELATIONSHIP BETWEEN VARIABLES
4.1. Single regression
An objective of any researcher is to find relationships between phenomena taking

place in his/her field of knowledge. When dealing with two variables, X and Y, there
may exist some kind of relationship or association between them.
The study of the statistical relationship between two variables X and Y is focused
on two basic issues:
To determine the function that better explains the behaviour of the variable Y
(dependent or endogenous) by means of a whole set of variables X1, X2, , Xp
(independent or exogenous); it is supposed to be a relationship among all these
variables (in the two-dimensional case there is only one explanatory variable, X).
This is the so-called regression function or curve.
To provide a measure of the intensity of the relationship that there may

exist between the variables, that is, a measure of the explanatory power of the
regression function; it is called correlation.

Given two variables X and Y with joint frequency distribution ( xi , y j ); nij , the
regression of Y on X (Y/X) means the function that explains the variable Y for each
value of the variable X. In the same way, the regression of X on Y (X/Y) determines
the behaviour of X as a function of Y.
In order to determine these functions, we may use two criteria: type I regression
and type II regression.
______________________________________________________________________
Degree in Economics / Estefana Mourelle Espasandn 2016/2017
Lesson 4. Analysis of the relationship between variables Statistics I ~2~
______________________________________________________________________
4.1.1. Type I regression
Type I regression of Y on X
Lets consider the following two-dimensional frequency distribution:
X Y y1 y2 yk
x1 n11 n12 n1k
x2 n21 n22 n2 k

xh n h1 nh 2 nhk
If we think about the value we should assign to Y for X=x1, we would say the mean
of Y when X is x1, that is, the conditional mean of Y given X equals x1. Taking this
criterion into account for x2, we would take the conditional mean of Y given x2, and so
on and so forth. Therefore, type I regression of Y on X would consist of the pairs:
x1 ; y | x1
x 2 ; y | x 2
x3 ; y | x3
...
x h ; y | x h
EXAMPLE 2
We have the following two-dimensional frequency distribution
X Y 1 2 4 5
2 9 0 4 0
3 0 1 0 5
5 0 1 0 0 n=20
______________________________________________________________________
______________________________________________________________________
Type I regression of Y on X
Observed values
yj nj 1 * 9 ... 5 * 5
y 2.7
1 9 20
(1 2'7) 2 * 9 ... (5 2'7) 2 * 5
2 2 S y2 3.01
20
4 4
5 5
20
Type I regression - line connecting the conditional means: y | x 2 ; y | x 3 ; y | x 5
yj nj|i
1 9
2 0 1* 9 4 * 4
y| x 2 1.923
13
4 4
5 0
13
yj nj|i
1 0 2 *1 5 * 5
y| x 3 4.5
2 1 6
4 0
5 5
6
yj nj/i
1 0 2 *1
y| x 5 2
2 1 1
4 0
5 0
1
______________________________________________________________________
______________________________________________________________________
Theoretical values
Marginal distrib. of X Type I regression

xi ni xi y (xi)
2 13 2 1.923
3 6 3 4.5
5 1 5 2
20
Distrib. of y y| x xi
y ( xi ) ni
1.923 13
4.5 6
2 1
20
1.923*13 4.5 * 6 2 * 1
y 2.7 y
20
(1.923 2.7) 2 *13 (4.5 2.7) 2 * 6 (2 2.7) 2 *1
S 2
y 1.389
20
Key idea
Type I regression offers the best possible explanation of Y as a function of X;

it provides the true regression (intrinsic regression) between the variables
Advantage: it is the best possible fit Disadvantage: the function can adopt any form
(the result of connecting dots)
Solution: to force the function to have a

certain form type II regression objective
______________________________________________________________________
______________________________________________________________________
4.1.2. Type II regression
Type II regression of Y on X
In order to obtain the regression curve of Y on X, we first locate one point for each
pair of two variables that represent an observation in the data set in a coordinate axis
(cluster of dots or scatter plot) and we select the type of function that better fits those
pairs. Second, we determine this function by minimizing the sum of the squared errors
or residuals, eij; these are the difference between the observed dependent variable, yj,
and the theoretical value, y j , which is obtained by replacing X with xi in the selected
function. Thus, eij y j y j .
Min ( y j y j ) nij
2
(least squares method)
i j
Comparisons between type I and type II regressions
- Type I regression is the best possible explanation of a variable as a function of

the other one (intrinsic regression); type II regression criterion is an approach
procedure to type I regression.
- Type I regression is always the result of connecting dots, not a continuous curve,
which makes it less easy to use for our purposes. For this reason, type II
regressions become widespread.
- At a practical level, the sort of function is not set a priori in type I regression,
whilst this decision is the first step in the type II regression procedure.
A particular case: linear type II regression
When the function that better fits the cluster of dots is a straight line, we have a
linear regression. Its form is
y a bx for Y/X
______________________________________________________________________
______________________________________________________________________
Coefficients b and b are called regression coefficients; a and a are the points of
intersection with the respective axis.
Lets see an example where a straight line is a good fit for the cluster of dots
representing the pairs of values (X,Y):
(Source: http://www.idlcoyote.com/documents/cg_programs.php)
Linear type II regression of Y on X
Following the least squares technique so as to fitting a straight line, parameters a

and b are calculated in the following manner:
Linear regression coefficient b is the slope of the regression line. Its

interpretation is very important: b measures the change in variable Y for every unit
change in variable X.
The sign of b equals the sign of the covariance.
______________________________________________________________________
______________________________________________________________________
EXAMPLE 2 - CONTINUATION
Obtain the linear type II regression of Y on X
S xy
b 2
Y a bX Sx
a y bx

xi ni 2 *13 3 * 6 5 *1
x i 1
r
2.45 ; y 2.7
n 20
xi2 ni 2 2 * 13 3 2 * 6 5 2 * 1
S x2 i 1
r
x2 2,452 0.5475
n 20
xi y j nij 2 * 1 * 9 ... 5 * 5 * 0
i 1 j 1
r c
S xy (x y) (2.7 * 2.45) 0.435
n 20
0.435
b 0.795 ; a 2.7 (0.795* 2.45) 0.752
0.5475
Y 0.752 0.795X
4.1.3. Goodness of fit
The attempt to explain a variable as a function of another one is caused by the a

priori assumption that the information provided by the latter (variable) improves the
knowledge of the first one. In other words, Y is supposed to be better explained
through X rather than using the marginal distribution of Y, in the regression of Y on X.
In order to determine to what extent the description of a variable improves when
we take into account the other one, we define the following concept: variance due to
regression. For doing that, we consider three variables derived from the regression:
- y j , observed values of Y
- y j , theoretical values assigned to each xi in the regression of Y on X
- e j , errors or residuals from the regression, e j y j y j
Their mean values are:
______________________________________________________________________
______________________________________________________________________
nij
- the mean of the observed series, y y
i j
j
n
- the mean of the residuals from the regression Y/X,
nij nij
e ej ( y j y j ) 0
i j n i j n
nij nij
- the mean of the theoretical values, y y
i j
j
n
( y j e j )
i j n
ye y
Their variances are:
- total variance of the observed values: it measures the variation in Y in the

observed marginal distribution,
n j SST
S y2 ( y j y ) 2
j n n
- variance of error: it measures the deviations between the theoretical value and
the observed value dispersion left out of the regression line,
nij SSE
S e2 ( y j y j ) 2
i j n n
- variance due to regression or variance of the theoretical values: it measures the

dispersion of the values from the regression,
n j SSR
S y2 S R2 ( y j y ) 2
j n n
There is a relationship between these three variances, both in the type I regression
and the linear fit. It is as follows:
S y2 S e2 S y2 SST SSE SSR

The total variability in a regression analysis (SST) can be decomposed into two
components: one explained by the regression (SSR) and the other due to the
unexplained error (SSE).
______________________________________________________________________
______________________________________________________________________
- Fitted function (from regression) passes through all the points:

maximum degree of dependence between variables
Situations
- The more distant the points of the function,
the higher the intensity that is lost in the association
Measure for goodness of fit: general coefficient of determination (R2), which

indicates the percentage of variability in Y that is explained by the regression equation:
S y2 SSR
R 2
2

S y SST
Accounting for the relationship between variances, it follows:
S e2 SSE
R 1 2 1
2
Sy SST
In linear type II regression, we call this coefficient of determination linear

coefficient of determination (r2). In this concrete case and taking into account that:
nij S xy2
S ( y j y j )
2
e
2
S 2
y
i j n S x2
Comments
- R2 and r2 range of values: 0 R2 1

(very bad fit) (the best fit)
- R2 r 2
- The coefficient that measures the degree of linear correlation between the
variables (r, linear coefficient of correlation) is:
______________________________________________________________________
Lesson 4. Analysis of the relationship between variables Statistics I ~ 10 ~
______________________________________________________________________
Why do we incorporate r ?
: to add to r2 the nature of statistical dependence (positive or negative association).
r range of values: -1 r 1
(perfect negative linear relationship) (perfect positive linear relationship)
EXAMPLE 2 - CONTINUATION
In type I regression:
2
S e2 S y 1.389
R 1 2 2
2
0.461
Sy Sy 3.01
In linear type II regression:
S xy2 (0.435) 2
r 2
0.115
S x2 S y2 0.5475* 3.01
It must be fulfilled that R 2 r 2
In linear type II regression, it is always true that:
y y e 2.7 2.7 0
S y2 S R2 r 2 S y2 0.115 * 3.01 0.346

S S S 2
2
y
2
y
2
e
S e (1 r ) S y (1 0.115) * 3.01 2.664
2 2
S y2 0.346 2.664 3.01
______________________________________________________________________

Lesson 4 Theory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 4 Theory

Uploaded by

Copyright:

Available Formats

Lesson 4.

Analysis of the relationship between variables Statistics I ~1~

LESSON 4. ANALYSIS OF THE RELATIONSHIP BETWEEN VARIABLES

4.1. Single regression

An objective of any researcher is to find relationships between phenomena taking

To provide a measure of the intensity of the relationship that there may

4.1.1. Type I regression

Lets consider the following two-dimensional frequency distribution:

x1 n11 n12 n1k

We have the following two-dimensional frequency distribution

Type I regression - line connecting the conditional means: y | x 2 ; y | x 3 ; y | x 5

Marginal distrib. of X Type I regression

Type I regression offers the best possible explanation of Y as a function of X;

Solution: to force the function to have a

4.1.2. Type II regression

function. Thus, eij y j y j .

Comparisons between type I and type II regressions

- Type I regression is the best possible explanation of a variable as a function of

A particular case: linear type II regression

Linear type II regression of Y on X

Following the least squares technique so as to fitting a straight line, parameters a

Linear regression coefficient b is the slope of the regression line. Its

The sign of b equals the sign of the covariance.

Obtain the linear type II regression of Y on X

4.1.3. Goodness of fit

The attempt to explain a variable as a function of another one is caused by the a

- y j , theoretical values assigned to each xi in the regression of Y on X

- e j , errors or residuals from the regression, e j y j y j

Their mean values are:

Their variances are:

- total variance of the observed values: it measures the variation in Y in the

- variance due to regression or variance of the theoretical values: it measures the

S y2 S e2 S y2 SST SSE SSR

- Fitted function (from regression) passes through all the points:

Measure for goodness of fit: general coefficient of determination (R2), which

Accounting for the relationship between variances, it follows:

In linear type II regression, we call this coefficient of determination linear

- R2 and r2 range of values: 0 R2 1

In linear type II regression:

It must be fulfilled that R 2 r 2

In linear type II regression, it is always true that:

S y2 S R2 r 2 S y2 0.115 * 3.01 0.346

S y2 0.346 2.664 3.01

You might also like