Professional Documents
Culture Documents
Analysis of Continuous
outcome Data
Linear Correlation and Regression
Analysis
By Teshome Kabeta(BSc, MPH)
Correlation Analysis
Correlation is the method of analysis to use
when studying the possible association
between two continuous variables
The standard method (Pearson Correlation)
leads to a quantity called r that can take on any
value from -1 to +1
The correlation coefficient r measures the
degree of 'straight-line' association between
the values of two variables
3
Correlation Analysis
The correlation between two variables is
positive if
higher values of one variable are associated
with higher values of the other and
negative if
one variable tends to be lower as the other gets
higher
Correlation Analysis
It is important to note that a correlation
between variables shows that they are
associated but does not necessarily imply a
cause and effect relationship
In essence r is a measure of the scatter of
the points around an underlying linear trend:
The greater the spread of the points the lower
the correlation
5
10
(x i x )(yi y)
xy
2
2
2
(x
x
)
(y
y
)
x y
i
i
XY [ X Y ] / n
[ X ( X ) / n][ Y ( Y ) / n]
2
r=0.308
12
Hypothesis testing on
Under the null hypothesis that there is no
association in the population (=0) it can be
shown that the quantity
n2
tr
1 r2
Interpretation of correlation
Correlation coefficients lie within the range -1
to +1, with the mid-point of zero indicating no
linear association between the two variables
A very small correlation does not necessarily
indicate that two variables are not associated,
however
To be sure of this we should study a plot of
data, because it is possible that the two
variables display a non-linear relationship (for
example cyclical or curved)
14
Interpretation of correlation
In such cases r will underestimate the
association, as it is a measure of linear
association alone. Consider transforming the
data to obtain a linear relation before
calculating r
Very small r values may be statistically
significant in moderately large samples, but
whether they are clinically relevant must be
considered on the merits of each case
15
Interpretation of correlation
One way of looking at the correlation helps to
modify over-enthusiasm is to calculate l00r2
(coefficient of determination), which is the
percentage of variability in the data that is
'explained' by the association
So a correlation of 0.7 implies that just about
half (49%) of the variability may be put down to
the observed association, and so on
16
Wt of son Y
XY
65
68
X2
4225
63
66
3969
4356
4158
67
68
4489
4624
4556
64
65
4096
4225
4160
68
69
4624
4761
4692
62
66
3844
4356
4092
70
68
4900
4624
4760
66
65
4356
4225
4290
68
71
4624
5041
4828
67
67
4489
4489
4489
69
68
4761
4624
4692
71
70
5041
4900
4970
Y2
4624
4420
17
Scatter Plot
18
Calculating r
The correlation coefficient for the data on fathers
and sons will be:
Basic values from the data
2
2
X 800, X 53,418, Y 811, Y 54,849, XY 54,107
Significance test
We need to check that the correlation is
unlikely to have arisen due to sample
variation
Testing whether the calculated Pearsonss
correlation coefficient is significant or not
follows
20
Significance test
For the fathers and sons weight data:
Ho: = 0
HA: 0
Test statistic, t:
n2
12 2
t r
0.703
3.12
2
2
1 r
1 (0.703)
r=0
b=0
b<0
r>0
b>0
22
Interpretation of correlation
Correlation coefficients lie within the range -1 to +1,
with the mid-point of zero indicating no linear
association between the two variables
A very small correlation does not necessarily indicate
that two variables are not associated, however
To be sure of this we should study a plot of the data,
because it is possible that the two variables display a
non-linear relationship (for example cyclical or
curved). In such cases r will underestimate the
association, as it is a measure of linear association
alone
23
Interpretation of correlation
Very small r values may be statistically significant
in moderately large samples, but whether they
are clinically relevant must be considered on the
merits of each case
One way of looking at the correlation helps to
modify over-enthusiasm is to calculate 100r 2, the
coefficient of determination called goodness of fit,
which is the percentage of variability in the data
that is 'explained' by the linear association
24
Pearsons r Correlation
As a rule of thumb, the following
guidelines on strength of relationship are
often useful (though many experts would
somewhat disagree on the choice of
boundaries).
Correlation value
Interpretation
0.70 or higher
0.40 to 0.69
0.30 to 0.39
0.20 to 0.29
0.01 to 0.19
27
30
31
Y = an + bX
XY = aX + bX2
a =Y - b X
b=
( X X )(Y Y )
(X X )
2
n XY X Y
n X 2 ( X ) 2
35
SLR-example 1
Heights of 10 fathers (X) together with their oldest sons (Y)
are given below (in inches). Find the regression of Y on X.
Father (X)
63
64
70
72
65
67
68
66
70
71
Total 676
product (XY)
4095
4288
4830
5040
4160
4556
4828
4158
4900
5112
45967
X
3969
4096
4900
5184
4225
4489
4624
4356
4900
5041
45784
36
SLR-example 1
a =Y - b X
b=
b=
a=
n XY X Y
n X 2 ( X ) 2
0.77 (
676
10
( X X )(Y Y )
(X X )
459670 459004
457840 456976
666
864
= 0.77
37
The regression coefficient of Y on X (i.e., 0.77) tells us the change in Y due to a unit change in X.
SLR-example 1
Estimate the height of the oldest son for a fathers height
of 70 inches.
= 15.85 + 0.77 (70) = 69.75 inches
NB: 1) n is the number of pairs of X and Y scores
which are used in determining the regression line.
In the above example, n=10.
2) Be careful to distinguish between (X) and .38
(x x )
and se(b)
(x - x ) 2
where
2
(y y) b (x x )
n2
Decision: Reject Ho
43
Exercise
What do you say about the relationship
between r and b?
Hint: Use the formula for r and b in terms
of sum of squares(Sx, Sy and Sxy)
44
46
47
Assumptions
The assumptions made when using this method are:
50
Assumptions of Linear
Regression
1. Linear relationship between outcome (y)
and explanatory variable x
2. Outcome variable (y) should be Normally
distributed for each value of explanatory variable
(x)
3. Standard deviation of y should be approximately
the same for each value of x
4. Fixed independent observations
e.g. Only one point per person
5. No outlier distortion
51
Assumptions of linear
regression
Assumption 1
Linear relationship
Assumption 2
Y normally distributed
at each value of x
**
**
**
**
**
**
*
*
*
*
*
*
*
*
*
*
*
*
Assumption 3
Same variance at each value
of x
52
53
54
Testing Assumptions:
Assumption 1: linear relationship
.2
2
0
.1
0
0
.1
8
0
.1
6
0
.1
4
0
.1
2
0
.8
0
0
0
.4
0
.6
0
.8
0
.W
1
0
.01
2
0
.1
4
0
.
E
IG
H
T
blodpresure
55
Testing
Assumptions:
.2
2
0
.1
0
.1
8
0
.1
6
0
.1
4
0
.1
2
0
.8
0
R
S
q
L
i
n
e
a
r
=
0
.
1
6
0
.4
0
.6
0
.8
0
.W
1
0. 1
2
0
.1
4
0
.
E
IG
H
T
b
lo
d
p
re
s
u
re
Assumption 2:
Normality
Y normally distributed
at each value of x
Residuals need
to be normally
distributed
56
Testing Assumptions:
Assumption 2: Normality
Histogram of residuals
Histogram
20
1.00
.75
10
Frequency
.50
.25
0.00
0.00
.25
.50
.75
1.00
57
Testing Assumptions:
Assumption 3: Spread of y values
constant over range of x values
UnstandrizedResidual
.6
8
0
0
.4
0
0
0
.2
0
0
0
0
.0
0
0
.
0
0
--2
.4
0
0
0
0
.0
0
4
0
.06
0
.8
0
.0
1
0
.01
2
0
.1
4
0
.0
W
E
IG
H
T
58
Thank You!
59