Linear Correlation and Regression Analysis

Jimma University,
College of Health Sciences,

Department of Epidemiology
Graduate program in Public Health

(Addis Ababa-ABH Campus)
Analysis of Continuous
outcome Data
Linear Correlation and Regression
Analysis
By Teshome Kabeta(BSc, MPH)
Correlation Analysis
Correlation is the method of analysis to use
when studying the possible association
between two continuous variables
The standard method (Pearson Correlation)
leads to a quantity called r that can take on any
value from -1 to +1
The correlation coefficient r measures the
degree of 'straight-line' association between
the values of two variables
3
The correlation between two variables is
positive if
higher values of one variable are associated
with higher values of the other and
negative if
one variable tends to be lower as the other gets
higher
A correlation of around zero indicates that

there is no linear relation between the
values of the two variables
4
It is important to note that a correlation
between variables shows that they are
associated but does not necessarily imply a
cause and effect relationship
In essence r is a measure of the scatter of
the points around an underlying linear trend:
The greater the spread of the points the lower
the correlation
5
Scatter Plots and Correlation

Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
Scatter plot is used to show the
relationship between two variables
Only concerned with strength of the
relationship and its direction
We consider the two variables equally; as a
result no causal effect is implied
6
Fig.1: Systolic Blood Pressure against Age
10
If we have two variables X and Y, the correlation

between them denoted by r(X, Y) is given by:
r
(x i x )(yi y)
xy
2
2
2
(x
x
)
(y
y
)
x y
i
i
XY [ X Y ] / n
[ X ( X ) / n][ Y ( Y ) / n]
2
where xi and yi are the values of X and Y for the ith

individual
The equation is clearly symmetric as it does not
matter which variable is X and which is Y
11
Calculate r for SBP and age

Let X be age and Y be SBP
From the data we have the following:
Mean of X = 40.41, Mean of Y = 130.54,
x2=58,956.549, y2=129,177.971,
xy=26,871.501, Sx=10.980. Sy=16.253,
Sxy=54.958 and n=490
r=0.308
12
Hypothesis testing on
Under the null hypothesis that there is no
association in the population (=0) it can be
shown that the quantity
n2
tr
1 r2
has a t distribution with n-2 degrees of freedom

For the age and SBP data, t = 6.99, df=488, p <
0.001
13
Interpretation of correlation
Correlation coefficients lie within the range -1
to +1, with the mid-point of zero indicating no
linear association between the two variables
A very small correlation does not necessarily
indicate that two variables are not associated,
however
To be sure of this we should study a plot of
data, because it is possible that the two
variables display a non-linear relationship (for
example cyclical or curved)
14
In such cases r will underestimate the
association, as it is a measure of linear
association alone. Consider transforming the
data to obtain a linear relation before
calculating r
Very small r values may be statistically
significant in moderately large samples, but
whether they are clinically relevant must be
considered on the merits of each case
15
One way of looking at the correlation helps to
modify over-enthusiasm is to calculate l00r2
(coefficient of determination), which is the
percentage of variability in the data that is
'explained' by the association
So a correlation of 0.7 implies that just about
half (49%) of the variability may be put down to
the observed association, and so on
16
Exercise: The following data shows the respective weight of a

sample of 12 fathers and their oldest son. Compute the
correlation coefficient between the two weight measurements
Wt of father X
Wt of son Y
XY
65
68
X2
4225
63
66
3969
4356
4158
67
68
4489
4624
4556
64
65
4096
4225
4160
68
69
4624
4761
4692
62
66
3844
4356
4092
70
68
4900
4624
4760
66
65
4356
4225
4290
68
71
4624
5041
4828
67
67
4489
4489
4489
69
68
4761
4624
4692
71
70
5041
4900
4970
Y2
4624
4420
17
Scatter Plot
18
Calculating r
The correlation coefficient for the data on fathers
and sons will be:
Basic values from the data
2
2
X 800, X 53,418, Y 811, Y 54,849, XY 54,107
(x - x )(y y) xy ( x )( y)/n 54,107 (800 811)/12 40.33

2
2
2
2
( x x) x ( x) / n 53,418 (800) / 12 84.67
2
2
2
2
( y y ) y ( y ) / n 54,849 (811) / 12 38.92
Calculating r
40.33
r
0.703
(84.67)(38.92)
19
Significance test
We need to check that the correlation is
unlikely to have arisen due to sample
variation
Testing whether the calculated Pearsonss
correlation coefficient is significant or not
follows
20
Significance test
For the fathers and sons weight data:
Ho: = 0
HA: 0
Test statistic, t:
n2
12 2
t r
0.703
3.12
2
2
1 r
1 (0.703)
p < 0.01, i.e., the correlation coefficient is

significantly different from 0
21
Inference on Correlation Coefficient

r<0
r=0
b=0
b<0
r>0
b>0
22
Correlation coefficients lie within the range -1 to +1,
with the mid-point of zero indicating no linear
association between the two variables
A very small correlation does not necessarily indicate
that two variables are not associated, however
To be sure of this we should study a plot of the data,
because it is possible that the two variables display a
non-linear relationship (for example cyclical or
curved). In such cases r will underestimate the
association, as it is a measure of linear association
alone
23
Very small r values may be statistically significant
in moderately large samples, but whether they
are clinically relevant must be considered on the
merits of each case
One way of looking at the correlation helps to
modify over-enthusiasm is to calculate 100r 2, the
coefficient of determination called goodness of fit,
which is the percentage of variability in the data
that is 'explained' by the linear association
24
Pearsons r Correlation
As a rule of thumb, the following
guidelines on strength of relationship are
often useful (though many experts would
somewhat disagree on the choice of
boundaries).
Correlation value
Interpretation
0.70 or higher
0.40 to 0.69
0.30 to 0.39
0.20 to 0.29
0.01 to 0.19
Very strong relationship

Strong relationship
Moderate relationship
Weak relationship
No or negligible relationship
25
Simple linear regression

Data are frequently given in pairs where one variable is
dependent on the other:
E.g. weight and height
house rent and income
yield and fertilizer
It is usually desirable to express their relationship by finding an
appropriate mathematical equation. To form the equation,
collect the data on these two variables. Let the observations be
denoted by (X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn).
However, before trying to quantify this relationship, plot the data
and get an idea of their nature.
Plot these points on the XY plane and obtain the scatter
diagram.
26
27

Simple regression uses the relationship
between the two variables to obtain
information about one variable by knowing
the values of the other
The equation showing this type of
relationship is called simple linear
regression equation
28

The scatter diagram helps to choose the curve that
best fits the data. The simplest type of curve is a
straight line whose equation is given by = a + bxi
This equation is a point estimate of Y = + Xi .
b= the sample regression coefficient of Y on X
= the population regression coefficient of Y on X
Y on X means Y is the dependent variable and X is the
independent one
29

a is the estimated average value of y
when the value of x is zero (provided that
0 is inside the data range considered)
Otherwise it shows the portion of the
variability of the dependent variable left
unexplained by the independent variables
30

Regression is a method of estimating the
numerical relationship between variables
For example, we would like to know what is the
mean or expected weight for factory workers of a
given height, and what increase in weight is
associated with a unit increase in height.
The purpose of a regression equation is to use
one variable to predict another
How is the regression equation determined?
31
The method of least square

The model Y = + X + refers to the
population from which the sample was
drawn
The regression line = a + bx is an
estimate of the population regression line
that was found using ordinary least
squares (OLS)
The difference between the given score Y
and the predicted score is known as the
error of estimation
32

The regression line, or the line which best fits the
given pairs of scores, is the line for which the sum of

the squares of these errors of estimation (i) is
minimized
That is, of all the curves, the curve with minimum i
is the least square regression which best fits the given
data
The least square regression line for the set of
observations
(X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn) has the equation
= a + bxi
33

The values a and b in the equation are constants, i.e.,
their values are fixed.
The constant a indicates the value of y when x=0. It is also
called the y intercept.
The value of b shows the slope of the regression line and
gives us a measure of the change in y for a unit change in x.
This slope (b) is frequently termed as the regression
coefficient of Y on X.
If we know the values of a and b, we can easily compute
34
the value of for any given value of X.

The constants a and b are determined by
solving simultaneously the equations normal
equations
Y = an + bX
XY = aX + bX2
a =Y - b X
b=
( X X )(Y Y )
(X X )
2
n XY X Y
n X 2 ( X ) 2
35
SLR-example 1
Heights of 10 fathers (X) together with their oldest sons (Y)
are given below (in inches). Find the regression of Y on X.
Father (X)
63
64
70
72
65
67
68
66
70
71
Total 676
oldest son (Y)

65
67
69
70
64
68
71
63
70
72
679
product (XY)
4095
4288
4830
5040
4160
4556
4828
4158
4900
5112
45967
X
3969
4096
4900
5184
4225
4489
4624
4356
4900
5041
45784
36
SLR-example 1
a =Y - b X
b=
b=
a=
n XY X Y
n X 2 ( X ) 2
10(45967) (676 x679)

10(45784) (676) 2
679
10 -
0.77 (
676
10
( X X )(Y Y )
(X X )
459670 459004
457840 456976
666
864
= 0.77
) = 67.9 52.05 = 15.85
Therefore, = 15.85 + 0.77 X
37
The regression coefficient of Y on X (i.e., 0.77) tells us the change in Y due to a unit change in X.
SLR-example 1
Estimate the height of the oldest son for a fathers height
of 70 inches.
= 15.85 + 0.77 (70) = 69.75 inches
NB: 1) n is the number of pairs of X and Y scores
which are used in determining the regression line.
In the above example, n=10.
2) Be careful to distinguish between (X) and .38
Standard error of regression

coefficients
The calculated values for a and b are
sample estimates of the values of the
intercept and slope from the regression
line describing the linear association
between x and y in the whole population
Therefore , they are subject to sampling
variation and their precision is measured
by their standard errors
39
Standard errors of regression

coefficients
The SE of the regression coefficients is given
by:
2
se(a) S
(x x )
and se(b)
(x - x ) 2
where
2
(y y) b (x x )
n2
S is the standard deviation of the points about the

line. It has (n-2) degree of freedom.
40
Example (1- )100% CI for

regression coefficient
Consider the age and SBP data and the
fitted regression model
SBP = 112.12 + 0.456(Age)
S = 15.48, se(a) = 2.67, se(b) = 0.064
A 95% confidence interval for the slope is:

Estimated slope t1-(SE of slope)
0.456 1.96*0.064 = (0.331, 0.581)
The 95% CI does not include 0=>There is

a sufficient evidence that age affects SBP
41
Significant test for

Ho:
H1:
If the null hypothesis is true then the
statistic:
Observed slope - 0
t
S.E. of obsereved slope
will follow a t-distribution with (n 2)

df
42
Significant test for Example

For the age and SBP data,
b = 0.456 and se(b) = 0.064, then
t = 7.15 and with (n-2)=488 df
p < 0.001
Decision: Reject Ho
43
Exercise
What do you say about the relationship
between r and b?
Hint: Use the formula for r and b in terms
of sum of squares(Sx, Sy and Sxy)
44

Explained, unexplained (error), total variations
If all the points on the scatter diagram fall on the
regression line we could say that the entire variance
of Y is due to variations in X.
Explained variation = (- Y)
The measure of the scatter of points away from the
regression line gives an idea of the variance in Y
that is not explained with the help of the regression
equation.
45
Unexplained variation = (Y - )

The variation of the Ys about their mean can also be
computed. The quantity (Y- Y) is called the total

variation.
Explained variation + unexplained variation =Total
variation
The ratio of the explained variation to the total

variation measures how well the linear regression line
fits the given pairs of scores.
It is called the coefficient of determination,
and is denoted by
r =
exp lained var iation

total var iation
46

The explained variation is never negative
and is never larger than the total variation
Therefore, r is always between . If the
explained variation equals 0, r = 0
2
r
If r is known, then r =
. The sign of r
is the same as the sign of b from the

regression equation
47
Simple linear regression model

The relationship y = + x is not expected
to hold exactly for every individual but the
average value of y for a given value of x is
E(Y x) = + x
An error term , which represents the
variance of the dependent variable among
all individuals with a given x, is introduced
into the model
The full linear-regression model then takes
48
the y = + x + form
Simple linear regression model

is the residual normally distributed with
mean 0 and variance 2
One interpretation of the regression line is
that for a subject with x independent
values, the corresponding y dependent
value will be normally distributed with
mean + x and variance 2
If 2 were 0, then every point would fall
exactly on the regression line, whereas
the larger 2 is, the more scatter occurs
about the regression line
49
Assumptions
The assumptions made when using this method are:
The relationship between the outcome and the

explanatory variable is linear or at least
approximately linear;
At each value of the explanatory variable the
outcomes follow a normal distribution;
The variance of the outcome is constant for all
values of the explanatory variable
50
Assumptions of Linear
Regression
1. Linear relationship between outcome (y)
and explanatory variable x
2. Outcome variable (y) should be Normally
distributed for each value of explanatory variable
(x)
3. Standard deviation of y should be approximately
the same for each value of x
4. Fixed independent observations
e.g. Only one point per person
5. No outlier distortion
51
Assumptions of linear
regression
Assumption 1
Linear relationship
Assumption 2
Y normally distributed
at each value of x
**
**
**
**
**
**
*
*
*
*
*
*
*
*
*
*
*
*
Assumption 3
Same variance at each value
of x
52
Diagnostic Tests for the Regression

Assumptions
Linearity tests: Regression curve fitting: No level
shifts
Independence of observations: Runs test
Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
Homogeneity of variance if the residuals: Whites
General Specification test
No autocorrelation of residuals: Durbin Watson or
ACF of residuals
53
Diagnostic Tests for the Regression

Assumptions
Plot residuals and look for high leverage of
residuals
Lists of Standardized residuals
Lists of Studentized residuals
Cooks distance or leverage statistics
54
Testing Assumptions:
Assumption 1: linear relationship
.2
2
0
.1
0
0
.1
8
0
.1
6
0
.1
4
0
.1
2
0
.8
0
0
0
.4
0
.6
0
.8
0
.W
1
0
.01
2
0
.1
4
0
.
E
IG
H
T
blodpresure
Plot y against x to check for linearity
55
Testing
Assumptions:
.2
2
0
.1
0
.1
8
0
.1
6
0
.1
4
0
.1
2
0
.8
0
R
S
q
L
i
n
e
a
r
=
0
.
1
6
0
.4
0
.6
0
.8
0
.W
1
0. 1
2
0
.1
4
0
.
E
IG
H
T
b
lo
d
p
re
s
u
re
Assumption 2:
Normality
Y normally distributed
at each value of x
Residuals need
to be normally
distributed
56
Assumption 2: Normality
Histogram of residuals
Normal probability plot
Histogram
Normal P-P Plot of Regression Standardized Residual
Dependent Variable: Systolic BP
Dependent Variable: Systolic BP
20
1.00
.75
10
Std. Dev = 1.00

Mean = 0.00
N = 127.00
Expected Cum Prob
Frequency
.50
.25
0.00
0.00
.25
.50
.75
1.00
Regression Standardized Residual

Observed Cum Prob
57
Assumption 3: Spread of y values
constant over range of x values
UnstandrizedResidual
.6
8
0
0
.4
0
0
0
.2
0
0
0
0
.0
0
0
.
0
0
--2
.4
0
0
0
0
.0
0
4
0
.06
0
.8
0
.0
1
0
.01
2
0
.1
4
0
.0
W
E
IG
H
T
Plot residuals against x values
58
Linear Correlation & Simple Linear

Regression?
Thank You!
59

Linear Correlation and Regression Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Correlation and Regression Analysis

Uploaded by

Copyright:

Available Formats

Jimma University,

College of Health Sciences,

Graduate program in Public Health

A correlation of around zero indicates that

Scatter Plots and Correlation

Fig.1: Systolic Blood Pressure against Age

If we have two variables X and Y, the correlation

where xi and yi are the values of X and Y for the ith

Calculate r for SBP and age

has a t distribution with n-2 degrees of freedom

Exercise: The following data shows the respective weight of a

(x - x )(y y) xy ( x )( y)/n 54,107 (800 811)/12 40.33

p < 0.01, i.e., the correlation coefficient is

Inference on Correlation Coefficient

Very strong relationship

Simple linear regression

Simple linear regression

Simple linear regression

Simple linear regression

Simple linear regression

Simple linear regression

The method of least square

The method of least square

given pairs of scores, is the line for which the sum of

The method of least square

The method of least square

oldest son (Y)

10(45967) (676 x679)

) = 67.9 52.05 = 15.85

Therefore, = 15.85 + 0.77 X

Standard error of regression

Standard errors of regression

S is the standard deviation of the points about the

Example (1- )100% CI for

A 95% confidence interval for the slope is:

The 95% CI does not include 0=>There is

Significant test for

S.E. of obsereved slope

will follow a t-distribution with (n 2)

Significant test for Example

Simple linear regression

Simple linear regression

computed. The quantity (Y- Y) is called the total

The ratio of the explained variation to the total

exp lained var iation

Simple linear regression

is the same as the sign of b from the

Simple linear regression model

Simple linear regression model

The relationship between the outcome and the

Diagnostic Tests for the Regression

Diagnostic Tests for the Regression

Plot y against x to check for linearity

Normal probability plot

Normal P-P Plot of Regression Standardized Residual

Dependent Variable: Systolic BP

Dependent Variable: Systolic BP

Std. Dev = 1.00

Expected Cum Prob

Regression Standardized Residual

Plot residuals against x values

Linear Correlation & Simple Linear