You are on page 1of 23

1

Lecture 17 (Sections 12.5, 12.1-12.2)


Correlation and Linear Regression
So far, we discussed the methods for analyzing univariate data.
Our focus now is mainly analyzing bivariate data obtained on two
variables
Bivariate Data
The data
1 1
( , )..., ( , )
n n
x y x y
obtained on two random variables
is called a bivariate data.
For, example, let

= height of a student, = weight of a student.
The data of heights and weights of all students in a class constitute
a bivariate data.
The aim is to investigate if the random variables are
associated/correlated and the relationship is linear or non-linear.

Scatter Plot
Scatter plot is the graphical display of a bivariate data, taking
i
x
-
values along
x
- axis and
i
y
- values along the
y
- axis. Just plot
the points
1 1
( , )..., ( , )
n n
x y x y
.
The resulting graph is called the Scatter Plot.
Usually, = explanatory (or independent) variable
= response (or dependent) variable
Often we want to predict the response variable.
2

Examine the scatter plot for the kind of association:
(i) Direction (negative or positive)
(ii) Strength (no, moderate, strong)
(iii) From (linear or not)
Look at the following scatter plots:






Example 1 (Ex 4): A study to assess the capacity of subsurface
flow wetland systems to remove biochemical oxygen demand
3

(BOD) and various other chemical constituents resulted in the
accompanying data on x = BOD mass loading (kg/ha/d) and y =
BOD mass removal (kg/ha/d).

X 3 8 10 11 13 16 27 30 35 37 38 44 103 142
Y 4 7 8 8 10 11 16 26 21 9 31 30 75 90

a. Construct a scatter plot of the data, and comment on any
interesting features.
Solution:
a. Scatter plot of the data:
There is a strong linear relationship between BOD mass loading
and BOD mass removal. There is one observation that appears not
to match the liner pattern. This value is (37, 9).

Example 3:

The following data are on x = number of hours studied and y = the
score on a test. Examine their scatter plot relationship.
X 0 1 2 3 3 4 4 5 5 5 5 6 6 6 7 7 8 8
Y 40 41 51 58 49 48 64 55 69 58 75 68 63 93 84 67 90 76


0 50 100 150
0
10
20
30
40
50
60
70
80
90
x:
y
:
BOD mass loading (x) vs BOD mass removal (y)
4

A way to observe such relationships is constructing a scatter plot.

The scatter plot is given below:


time
s
c
o
r
e
9 8 7 6 5 4 3 2 1 0
100
90
80
70
60
50
40
Scatterplot of score vs time


The plot shows that the variables and are strongly associated.
This is because the values of also increases, as the values of
increases.
Sample Correlation Coefficient
Scatter plot gives only a visual impression of the relationship
between and ; sometimes eyes may be fooled. There is a need
for a precise statement, and it is given by Karl Pearsons
correlation coefficient. This meassures the strengthening of linear
association between and

Definition. The correlation measures the direction and the strength
of the linear relationship between and . It is given by

1
1
1
n
i i
i
x y
x x y y
r
n s s
=
| |
| |

=
|
|
|

\ .
\ .

,
5

where x , and
x
s
are the sample mean and sample standard
deviation of
1
,...,
n
x x
. Similarly for y and y
s
. Note r =

.

A useful formula for computational purpose is:
xy
xx yy
S
r
S S
=

2
2 2
1
( )
( 1)
n
i
xx i x
i
x
S x n S
n
=
= =


2
2 2
1
( )
( 1)
n
i
yy i y
i
y
S y n S
n
=
= =


1
( )( )
n
i
i
xy i i
i
x y
S x y
n
=
=


Remarks:
(i) The standardized scores say how many SDs above or
between x .
(ii) The correlation
r
has no unit.
(iii) The measure r is called Pearsons sample correlation
coefficient.

Properties of Correlation
Note both variables have to be quantitative.
6

-
( , ) ( , ) r x y r y x =

- 1 s r s 1, and has no units. (Hint the proof using co-
variance inequality).
- r measures the extent of linear relationship between x and y
and does not capture the non-linear relationship.
- Variables may be strongly associated, but still may have
small r, if the association is not linear.
- Sign of correlation gives the direction of the association;
r < 0 initiates negative association: and r > o indicates
positive association.
- Value of r does not depend on the units of measurement for
either variable. That is, it is not affected by the change of
shifting or scaling the variables. This is because,
c. positive and d real for ) , ( ) , ( y x r d cy b ax r = + +

- Strongly affected by a few outlying observations.
- r = 1 only when all positive
( )
,
i i
x y
lie on a straight line.

Example 5 (Ex 12.15): An accurate assessment of soil
productivity is critical to rational land-use planning. The following
data presents the data on corn yield and peanut yield (mT/Ha)
for eight types of soil.
7

X 2.4 3.4 4.6 3.7 2.2 3.3 4.0 2.1
Y 1.33 2.12 1.80 1.65 2.00 1.76 2.11 1.63

Find if there is any association between
Solution: With


Hence



Example 5. The following data gives the marks of first midterm
(x) and second midterms (y) of 9 students from 3 sections:
8 A.M: (70, 60), (72, 83), (94, 85)
Noon : (80, 72), (60, 74), (55, 58)
Evening: (45, 63), (50, 40), (35, 54)
(a) Find the correlation coefficient between x and y.
(b) Find the correlation coefficient between
x
and
y
.


8

Solution.
(a) S
xx
= 37695 - (561)
2
/9 = 2726, SS
yy
= 40223 - (589)
2
/9 =
1676.222, and
S
xy
= 38281- (561)(589)/9 = 1566.666.
So,
222 . 1676 2726
667 . 1566
= r
= .733.
(b) Now
1
x
= (70+72+94)/3 = 78.667,
1
y
= (60+83+85)/3 = 76.

2
x
= (80+60+55)/3 = 65,
2
y
= (72+74+58)/3 = 68.

3
x
= (45+50+35)/3 = 43.333,
3
y
= (63+40+54)/3 = 52.333.
xx
S
= [(78.667)
2
+(65)
2
+(43.333)
2
- (78.667+65+43.333)
2
/3
=634.913,
yy
S
= [(76)
2
+(68)
2
+(52.333)
2
-(76+68+52.333)
2
/3] = 289.923,
xy
S
= [(78.667)(76)+(65)(68)+(43.333)(52.333)-
(187)(196.333)/3] = 428.348.
So,
923 . 289 913 . 634
348 . 428
= r
= .9984.

Population Correlation Coefficient
The population correlation coefficient between and is defined
by


We now only look at its properties:
9

(i)
1 s

(ii)
1 =
if all
( , )
i i
x y
in the population lie on a straight line.
Sample correlation coefficient
r
can be used to decide if
0 =

(no linear relationship between and or not.
(iii) Also,
1 =
for the bivariate distribution means that the
variables and are linearly related.

A test for
To test the hypothesis

use the test statistic


which has t-distribution with (n-2) df .


Example (Ex 60): The following is the summary of statistics
related to a study of re-vegetation of soil at mine reclamation sites.
Here,
X = KCI of extractable aluminium and
Y = amount of lime required to bring soil pH to 7.0;


Carry out a test of significance level 0.01 to see whether



10

Solution:
0 :
0
= H
vs
0 : =
a
H
.
2
1
2
r
n r
t

=
; Reject H
o
at level .01 if
either
819 . 2
22 , 005 .
= > t t
or 819 . 2 s t . r = .5778, t = 3.32, so H
o

should be rejected. There appears to be a non-zero correlation in
the population.


12.1 Linear Regression






11


Least squares






12

Coefficient determination





Details of Linear Regression
Fitting a straight line
Often, one is interested in not only studying the relationship, but
also in predicting the value of the dependent variable based on
independent (predictor or explanatory variable) .

When scatter plot suggests a linear relationship, it is natural to find
a straight line which is as close as possible to the points.
The equation of straight line is
bx a y + =
.

A particular equation is
5 y x = +
. Here . 1 and 5 = = b a
13

To draw a line, we need two quantities namely intercept (with
y
-
axis) term a and the slope
b
.
Given the data
1 1
( , )..., ( , )
n n
x y x y
on
( , ) x y
.
Aim: To find the straight line
y ax b = +
which fits the data well.
2. Method of Least Squares
Here = explanatory (predictor) variable
= response variable.
Let
( )
i i i
y a bx c = +
= error= deviation from the line.
Then
2 2
1 1
( )
n n
i i i
i
y a bx c
=
=
= sums of squares of errors.
Principle of least squares says choose the line (or find a and b )
such that

2
i
c
is minimum. The resulting equation is called
Sample Regression Line.

3. The Derivation
Let
2
1
) ( ) , (

=
=
n
i
i i
bx a y b a f
(*)
For fixed b and treating as a function of a, we have

0 ) 1 ( ) ( 0
1
= =
c
c

n
i i
bx a y
a
f


0 = x nb na y n


) ( say a x b y a = =
.
14

Also, substituting a in (*) and treating as a function of b,

0 ) ( ) ( 0
1
= =
c
c
i
n
i i
x bx a y
b
f


0
1
2
1
=

n
i
n
i i
x b x a n y x


0 ) (
1
2
1
=

n
i
n
i i
x b x x b y n y x
(substituting a )
Solving now for b, we obtain

) (


1
2
2
1
say b
S
S
x n x
y x n y x
b
xx
xy
n
i
n
i i
= =


Then the line
x b a

y + =
is called the fitted least-squares
(regression) line.
The slope of the least squares (regression) line is
;
xy
xx
S
b
S
=

The intercept of the line is =
a y bx =
. Therefore, the
(sample) regression line is
y a bx = +
.
The value
i i
y a bx = +
is called the fitted value of
y
and
i
y
is
called the observed value of
y
.
The quantities
( )
i i i
e y y =
is called the residual.
If,
i
e
> 0, the model under estimate data value;
15

If
i
e
< 0, the model over estimate data value.

Example 1. The following data gives the mean height of a group
of children in Kalama, an Egyptian village, that was the study of
nutrition in developing countries. The data were obtained on 161
children each month from 18 to 29 months of age.
Here, = age (in months) = explanatory variable;
= height (in cm) = response variable.



x

y

18
19
20
21
22
23
24
25
26
27
28
29
76.1
77.0
78.1
78.2
78.8
79.7
79.9
81.1
81.2
81.8
82.8
83.5

For the above data
16


x
= 23.5;
y
= 79.85
x
s
= 3.606; y
s
= 2.302
Also,
( , ) .9944 r r x y = =

Hence,
1
2.302
(.9944)
3.606
y
x
s
b r
s
= =
= .6348 =
b
.
And
= a
0 1
b y b x =
= 79.85 (.6348) (23.5) = 64.932
Therefore, the least-square line is

64.932 0.6348 y x = +


Interpretation
The slope b = .6348 cm/month is the rate of change in mean height
as age increases. Though
r
does not change, with the units of
measurement, the equation of least-square line changes.

Genesis of Regression. Note the slope

xy yy
xy y
xx x
xx yy xx
s s
s s
b r
s s
s s s
= = =

Hence,
( )
y
x
s
y y r x x
s
= +

Put
x
x x s = +
, then
y
y y rs = +
.
17

When,

1,
1 1
,
2 2
y
y
r y y s
r y y s
= = +
= = +

For any x - value, y (predicted value) will be closer to (in terms of
SD) to y than x is to
x
. That is, y is pulled toward (regressed
toward)
y
.
This regression effect, was first noticed by Sir Francis Galton who
predicted height of a son (
i
y
) was always closer to y than his
fathers height (
i
x ).

Assessing the Fit

To assess the effectiveness of the fit, the residuals can be used.
Note
( ) 0
i i i
e y y = >
if
i i
y y >
And
( ) 0
i i i
e y y = <
if
i i
y y <

Also,
. 0 ) (
2
i i i i
y y y y = =


That is, all observed values lie on a straight line. Also,
2
1
n
i
e

can be
used as a measure of the fit. Another one is the total variation in

i
y
s, namely
2
1
( )
n
i i
y y
.
Definition:

The residual sum of squares, SSE, is
18



SSE =
2
1
( )
n
i i
y y

=
2
1
n
i
e

and the total sum of squares is defined as
SST =
2
( )
i yy i
y y S =

Note:
( ) ( )
i i i i
y y y a bx = +


( ) ( )
i i
y y b x x =
(Substituting a )
Hence,
= 0
i
e
and
SSE = SST +
2
2
xx xy
b S bS


,
xy
SST bS =
(since
xy xx
S S b =

)
which shows SSE can be calculated without
'
i
e s
.

Note :
i. SSE is used as a measure of unexplained variation by the
regression line.
ii. Similarly, SST is used as a measure of total variation.
iii.
SSE
SST
= fraction of total variation that is unexplained by line.
Definition: The coefficient of determination, denoted by

2
1
SSE
R
SST
=
.
It is the proportion of variation in
y
explained by regression.
Result.
2 2
r R = , where r is the sample correlation
coefficient.
19

Definition: The quantity
2
2
1
2 2
n
i
e
e
SSE
s
n n
= =

is the variance of
residuals and
2
e
s s =
is called the SD of residuals about least
squares line. The estimator of

.

Plotting the Residuals (Residual Plot)
Definition: A scatter plot of
'
i
e s
against
'
i
x s
is called residual
plot.
(i) It is used for checking if there is any unusual, highly
influential observations or revealing patterns are present in
the data.
(ii) If there is no particular pattern, such as curvature and etc,
the least-square fit is a good fit. Also, the residuals will
be centered around x-axis.
(iii) Looking at residual plot is equivalent to examining
y

after removing linear dependence on x . This may
sometimes show existence of a non-linear relationship.

Example 2 (Ex 9): The flow rate y (m
3
/min) in a device used for
air-quality measurement depends on the pressure drop x (in.of
water) across the devices filter. Suppose that for x values between
20

5 and 20, the two variables are related according to the simple
linear regression model with true regression line y = -.12+.095x.
a. What is the expected change in flow rate associated with a 1-in
increase in pressure drop? Explain.
b. What change in flow rate can be expected when pressure drop
decreases by 5 in.?
c. What is the expected flow rate for a pressure drop of 10 in.? A
drop of 15 in.?
d. Suppose and consider a pressure drop of 10 in. What is
the probability that the observed value of flow rate will exceed
.835? That observed flow rate will exceed .840?
e. What is the probability that an observation on flow rate when
pressure is 10 in. will exceed an observation on flow rate made
when pressure drop is 11 in.?
Solution:
a. =
1
| expected change in flow rate (y) associated with a one inch
increase in pressure drop (x) = .095.
b. We expect flow rate to decrease by
475 . 5
1
= |
.
c.
( ) , 83 . 10 095 . 12 .
10
= + =
Y

and ( ) 305 . 1 15 095 . 12 .


15
= + =
Y
.
d.
( ) ( ) 4207 . 20 .
025 .
830 . 835 .
835 . = > =
|
.
|

\
|
> = > Z P Z P Y P

( ) ( ) 3446 . 40 .
025 .
830 . 840 .
840 . = > =
|
.
|

\
|

> = > Z P Z P Y P

21

e. Let Y
1
and Y
2
denote pressure drops for flow rates of 10 and 11,
respectively. Then
, 925 .
11
=
Y

so Y
1
- Y
2
has expected value
.830 - .925 = -.095, and s.d.
( ) ( ) 035355 . 025 . 025 .
2 2
= +
. Thus
( ) 0036 . 69 . 2
035355 .
095 .
) 0 ( ) (
2 1 2 1
= > =
|
.
|

\
|
+
> = > = > Z P z P Y Y P Y Y P

Example 3 (Ex 13): The accompanying data on x = current density
(mA/cm
2
) and y = rate of deposition (m/min) appeared in an
article. Do you agree with the articles author that a linear
relationship was obtained from the tin-lead rate of deposition as a
function of current density? Explain your reasoning.
X 20 40 60 80
y .24 1.20 1.71 2.22
Solution: For this data, n = 4,
200 = E
i
x
,
37 . 5 = E
i
y
,
000 . 12
2
= E
i
x
,
3501 . 9
2
= E
i
y
,
333 = E
i i
y x
.
( )
2000
4
200
000 , 12
2
= =
xx
S
,
( )
140875 . 2
4
37 . 5
3501 . 9
2
= =
yy
S
, and
( )( )
5 . 64
4
37 . 5 200
333 = =
xy
S
.
03225 .
2000
5 . 64

1
= = =
xx
xy
S
S
|
and
( ) 27000 .
4
200
03225 .
4
37 . 5

0
= = |
.
( )( ) 060750 . 5 . 64 03225 . 14085 . 2

1
= = =
xy yy
S S SSE |
.
972 .
14085 . 2
060750 .
1 1
2
= = =
SST
SSE
r
. This is a very high value of
2
r ,
22

which confirms the authors claim that there is a strong linear
relationship between the two variables.

Example 4 (Ex 19): The following data is representative of that
reported in an article with x = burner area liberation rate (MBtu/hr-
ft
2
) and y = NO
x
emission rate (ppm):
X 100 125 125 150 150 200 200 250 250 300 300 350 400 400
Y 150 140 180 210 190 320 280 400 430 440 390 600 610 670
a. Assuming that the simple linear regression model is valid,
obtain the least squares estimate of the true regression line.
b. What is the estimate of expected NOx emission rate when
burner area liberation rate equals 225?
c. Estimate the amount by which you expect NOx emission rate to
change when burner area liberation rate is the decreased by 50.
d. Would you use the estimated regression line to predict emission
rate for a liberation rate of 500? Why or why not?
Solution:
N = 14,
3300 = E
i
x
,
5010 = E
i
y
,
750 , 913
2
= E
i
x
, 100 , 207 , 2
2
= E
i
y ,
500 , 413 , 1 = E
i i
y x

a.
71143233 . 1
500 , 902 , 1
000 , 256 , 3

1
= = |
,
55190543 . 45

0
= |
, so we use the
equation
x y 7114 . 1 5519 . 45 + =
.
b.
( ) 51 . 339 225 7114 . 1 5519 . 45
225
= + =
Y


23

c. Estimated expected change
57 . 85

50
1
= = |

d. No, the value 500 is outside the range of x values for which
observations were available (the danger of extrapolation).
Home work:
Sec 12.1: 3, 8, 9
Sec 12.2: 12, 14, 16
Sec 12.5: 58, 62, 65

You might also like