You are on page 1of 6

Module 7: Probability and Statistics

Lecture 6: Regression Analyses and Correlationcontd.


1. Introduction
In the previous lecture, formulations and problems on linear regression were discussed. The
basic formulation of multiple linear regression was also discussed. This lecture presents the
solving technique of multiple linear regression equations. Theory as well as an example
problem on non-linear regression is also presented here. The lecture concludes with a
discussion on correlation coefficient and its sample statistic.
2. Estimation of the Coefficients for Multiple Linear Regression
Assuming that ( )
m
x x x Y Var ,..., ,
2 1
is constant, the sum of squared errors of n data set points
can be calculated as
| |

=
=
=
= A
n
i
m mi m i i
n
i
i i
x x x x y
y y
1
2
1 1 1
1
2 2
) ( ... ) (
) ' (
| | o

Estimates of the regression coefficients are obtained by minimizing
2
A .
( ) ( ) | |

=
= =
c
A c
n
i
m mi m i i
x x x x y e i
1
1 1 1
2
0

...

2 . . | | o
o

Similarly 0 0
2
2
1
2
=
c
A c
=
c
A c
| |
and
From these set of equations,
( ) ( )
( ) ( )
y
n
y
Thus
x x x x
x x x x n y
i
n
i
m mi
n
i
i
n
i
m mi m
n
i
i
n
i
i
= =
= = =
=



= =
= = =
o
| | o

0 ...
0

1 1
1 1
1 1
1 1 1
1

And, substituting the value of o ,




= =
= =
= =
= =
= +
+ +
= +
+ +
n
i
i m mi
n
i
m mi m
n
i
i m mi
n
i
i m mi
n
i
i i
n
i
m mi i m
n
i
i i
n
i
i
y y x x x x
x x x x x x x x
y y x x x x x x
x x x x x x
1 1
2
1
2 2 2
1
1 1 1
1
1 1
1
1 1
1
2 2 1 1 2
1
2
1 1 1
) )( ( ) (

... ) )( (

) )( (

) )( ( ) )( (

... ) )( (

) (

|
| |
|
| |

Thus, we have m linear simultaneous equations with m unknowns, which can be solved
for the values of the coefficients

i
|

to obtain the least squares regression equation
( ) ( )
m m o
m m o
m m m n
x x where
x x
x x x x x x x Y E
| | o |
| | |
| | o

...

...

...

) ,..., , | (
1 1
1 1
1 1 1 2 1
=
+ + + =
+ + + =

The conditional variance is calculated by


( ) ( ) | |
1

...

estimate) unbaised (
1
2
2
1
1 1 1
2
,..., |
1


=

A
=

=
m n
x x x x y
m n
S
n
i
m mi m i i
x x Y
m
| | o

And so, the corresponding standard deviation is obtained as
1
,..., |
1

A
=
m n
s
m
x x Y

where n is the sample size and m is the number of dependent variables
3. Nonlinear Regression
Sometimes, predictions based on linear relationships may over estimate (in certain ranges of
variables) or underestimate (in other ranges of variables) the expected result. In such cases, a
nonlinear relationship between the variables could be more appropriate. The determination of
such nonlinear relationships on the basis of observational data involves non linear regression
analysis.
Here, let us consider ) ( ) | ( x g x Y E | o + = , where ( ) x g is a predetermined function of x .
For example, if ( ) ( ) x x g ln = , then we define a new variable ( ) x g x = ' , so that
' ) ' | ( x x Y E | o + = which is again a linear regression equation, and can be solved
accordingly.
Problem on Non-linear regression
The average all-day parking cost in various cities of India is expressed in terms of the
logarithm of the urban population, that is modelled with the following nonlinear regression
equation
x x Y E ln ) | ( | o + =
with a constant ( ) x Y Var , where
= Y average cost in Indian Rupees for all-day parking cost (in Hundreds)
= X urban population (in thousands)
Estimate o , | and ( ) x Y Var on the basis of the observed data shown in Table 1.

Table 1: Observed data for urban population and average cost of all-day parking
City
Population (in thousands)
i
x
All-day parking cost (in
Hundreds of INR)
i
y
1 300 0.51
2 280 0.47
3 330 0.57
4 450 0.59
5 370 0.65
6 540 0.83
7 450 0.87
8 1990 0.96
9 3360 1.50
10 3560 1.13





Soln.
The calculations are shown in tabular form as follows:
City
i
x
i
y
i i
x x ln '=
i i
y x ' ( )
2
'
i
x
2
i
y
i i
x y | o + = ' ( )
2
'
i i
y y
1 300 0.51 5.704 2.909 32.533 0.260 0.563 0.003
2 280 0.47 5.635 2.648 31.751 0.221 0.543 0.005
3 330 0.57 5.799 3.305 33.629 0.325 0.591 0.000
4 450 0.59 6.109 3.604 37.323 0.348 0.681 0.008
5 370 0.65 5.914 3.844 34.970 0.423 0.624 0.001
6 540 0.83 6.292 5.222 39.584 0.689 0.734 0.009
7 450 0.87 6.109 5.315 37.323 0.757 0.681 0.036
8 1990 0.96 7.596 7.292 57.698 0.922 1.114 0.024
9 3360 1.50 8.120 12.18 65.929 2.250 1.266 0.055
10 3560 1.13 8.178 9.241 66.872 1.277 1.283 0.023

=
8.08 65.454 55.56 437.611 7.471 0.164

Therefore,
808 . 0
10
080 . 8
545 . 6
10
454 . 65
= = = = ' y x
( )
097 . 1 545 . 6 291 . 0 808 . 0
291 . 0
545 . 6 10 611 . 437
808 . 0 545 . 6 10 56 . 55

2
= =
=


=
o
|

( ) | | 1047 . 0 808 . 0 10 471 . 7


9
1
2 2
= =
Y
s
804 . 0
1047 . 0
0205 . 0
1
143 . 0 0205 . 0
0205 . 0
2 10
164 . 0
2
|
2
|
= =
= =
=

=
r
s
s
x Y
x Y

The mean value function and standard deviation is


143 . 0
ln 291 . 0 097 . 1 ) | (
|
=
+ =
x Y
s
x x Y E

5. Correlation
Correlation is a statistical technique that can show whether and how strongly pairs of
variables are associated with each other. The study of the degree of linear association
between two random variables is called correlation analysis. The accuracy of a linear
prediction will depend on the correlation between the variables.
In a two-dimensional plot, the degree of correlation between the values on the two axes is
quantified by the correlation coefficient, which is given by
| |
y x
y X
y x
Y X E
Y X Cov
o o

o o

) )( (
) , (

= =
where, E is the expected value operator, and Cov means covariance
The sample correlation coefficient may be estimated by
y x
n
i
i i
y x
n
i
i i
s s
y x n y x
n s s
y y x x
n

= =

=
1 1
1
1
) )( (
1
1

where
y x
s and s y x , , , are respectively the sample means and standard deviations of X and
Y .
In the linear regression equation,
( )( )
( )

=
=


=
n
i
i
n
i
i i
x x
y y x x
1
2
1

|
Using the above two equations, we can rewrite the equation of correlation coefficient as:
( )
y
x
y
x
n
i
i
n
i
i i
s
s
s
s
x x
y y x x
|

) )( (

1
2
1
=

=
=

Also we have
(

=

= =
n
i
i
n
i
i x Y
x x y y
n
S
1
2 2
1
2 2
|
) (

) (
2
1
|
Further, by substituting value of | in the above relation of co-variance, we can write
) 1 (
2
1
) ( ) (
2
1
) | (

2 2
1
2
2
2
2
1
2

=
(
(

=

= =
y
n
i
i
x
y
n
i
i
s
n
n
x x
s
s
y y
n
x Y ar V

from which we can write


2
|
2
2
1
2
1
y
x Y
s
s
n
n

=
which can be approximated to
2
r for large n .
6. Concluding Remarks
In this lecture, solving techniques of multiple linear regression equations are discussed.
Basics of non linear regression along with an example problem are also presented here, which
is followed by a discussion on correlation coefficient and its sample statistic.

You might also like