You are on page 1of 59

Jimma University,

College of Health Sciences,


Department of Epidemiology

Graduate program in Public Health


(Addis Ababa-ABH Campus)

Analysis of Continuous
outcome Data
Linear Correlation and Regression
Analysis
By Teshome Kabeta(BSc, MPH)

Correlation Analysis
Correlation is the method of analysis to use
when studying the possible association
between two continuous variables
The standard method (Pearson Correlation)
leads to a quantity called r that can take on any
value from -1 to +1
The correlation coefficient r measures the
degree of 'straight-line' association between
the values of two variables
3

Correlation Analysis
The correlation between two variables is
positive if
higher values of one variable are associated
with higher values of the other and

negative if
one variable tends to be lower as the other gets
higher

A correlation of around zero indicates that


there is no linear relation between the
values of the two variables
4

Correlation Analysis
It is important to note that a correlation
between variables shows that they are
associated but does not necessarily imply a
cause and effect relationship
In essence r is a measure of the scatter of
the points around an underlying linear trend:
The greater the spread of the points the lower
the correlation
5

Scatter Plots and Correlation


Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
Scatter plot is used to show the
relationship between two variables
Only concerned with strength of the
relationship and its direction
We consider the two variables equally; as a
result no causal effect is implied
6

Fig.1: Systolic Blood Pressure against Age

10

If we have two variables X and Y, the correlation


between them denoted by r(X, Y) is given by:
r

(x i x )(yi y)

xy
2

2
2
(x

x
)
(y

y
)
x y
i
i
XY [ X Y ] / n

[ X ( X ) / n][ Y ( Y ) / n]
2

where xi and yi are the values of X and Y for the ith


individual
The equation is clearly symmetric as it does not
matter which variable is X and which is Y
11

Calculate r for SBP and age


Let X be age and Y be SBP
From the data we have the following:
Mean of X = 40.41, Mean of Y = 130.54,
x2=58,956.549, y2=129,177.971,
xy=26,871.501, Sx=10.980. Sy=16.253,
Sxy=54.958 and n=490

r=0.308
12

Hypothesis testing on
Under the null hypothesis that there is no
association in the population (=0) it can be
shown that the quantity
n2
tr
1 r2

has a t distribution with n-2 degrees of freedom


For the age and SBP data, t = 6.99, df=488, p <
0.001
13

Interpretation of correlation
Correlation coefficients lie within the range -1
to +1, with the mid-point of zero indicating no
linear association between the two variables
A very small correlation does not necessarily
indicate that two variables are not associated,
however
To be sure of this we should study a plot of
data, because it is possible that the two
variables display a non-linear relationship (for
example cyclical or curved)
14

Interpretation of correlation
In such cases r will underestimate the
association, as it is a measure of linear
association alone. Consider transforming the
data to obtain a linear relation before
calculating r
Very small r values may be statistically
significant in moderately large samples, but
whether they are clinically relevant must be
considered on the merits of each case
15

Interpretation of correlation
One way of looking at the correlation helps to
modify over-enthusiasm is to calculate l00r2
(coefficient of determination), which is the
percentage of variability in the data that is
'explained' by the association
So a correlation of 0.7 implies that just about
half (49%) of the variability may be put down to
the observed association, and so on
16

Exercise: The following data shows the respective weight of a


sample of 12 fathers and their oldest son. Compute the
correlation coefficient between the two weight measurements
Wt of father X

Wt of son Y

XY

65

68

X2
4225

63

66

3969

4356

4158

67

68

4489

4624

4556

64

65

4096

4225

4160

68

69

4624

4761

4692

62

66

3844

4356

4092

70

68

4900

4624

4760

66

65

4356

4225

4290

68

71

4624

5041

4828

67

67

4489

4489

4489

69

68

4761

4624

4692

71

70

5041

4900

4970

Y2
4624

4420

17

Scatter Plot

18

Calculating r
The correlation coefficient for the data on fathers
and sons will be:
Basic values from the data
2
2
X 800, X 53,418, Y 811, Y 54,849, XY 54,107

(x - x )(y y) xy ( x )( y)/n 54,107 (800 811)/12 40.33


2
2
2
2
( x x) x ( x) / n 53,418 (800) / 12 84.67
2
2
2
2
( y y ) y ( y ) / n 54,849 (811) / 12 38.92
Calculating r
40.33
r
0.703
(84.67)(38.92)
19

Significance test
We need to check that the correlation is
unlikely to have arisen due to sample
variation
Testing whether the calculated Pearsonss
correlation coefficient is significant or not
follows

20

Significance test
For the fathers and sons weight data:

Ho: = 0
HA: 0
Test statistic, t:
n2
12 2
t r
0.703
3.12
2
2
1 r
1 (0.703)

p < 0.01, i.e., the correlation coefficient is


significantly different from 0
21

Inference on Correlation Coefficient


r<0

r=0
b=0

b<0

r>0
b>0

22

Interpretation of correlation
Correlation coefficients lie within the range -1 to +1,
with the mid-point of zero indicating no linear
association between the two variables
A very small correlation does not necessarily indicate
that two variables are not associated, however
To be sure of this we should study a plot of the data,
because it is possible that the two variables display a
non-linear relationship (for example cyclical or
curved). In such cases r will underestimate the
association, as it is a measure of linear association
alone
23

Interpretation of correlation
Very small r values may be statistically significant
in moderately large samples, but whether they
are clinically relevant must be considered on the
merits of each case
One way of looking at the correlation helps to
modify over-enthusiasm is to calculate 100r 2, the
coefficient of determination called goodness of fit,
which is the percentage of variability in the data
that is 'explained' by the linear association

24

Pearsons r Correlation
As a rule of thumb, the following
guidelines on strength of relationship are
often useful (though many experts would
somewhat disagree on the choice of
boundaries).
Correlation value
Interpretation
0.70 or higher
0.40 to 0.69
0.30 to 0.39
0.20 to 0.29
0.01 to 0.19

Very strong relationship


Strong relationship
Moderate relationship
Weak relationship
No or negligible relationship
25

Simple linear regression


Data are frequently given in pairs where one variable is
dependent on the other:
E.g. weight and height
house rent and income
yield and fertilizer
It is usually desirable to express their relationship by finding an
appropriate mathematical equation. To form the equation,
collect the data on these two variables. Let the observations be
denoted by (X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn).
However, before trying to quantify this relationship, plot the data
and get an idea of their nature.
Plot these points on the XY plane and obtain the scatter
diagram.
26

Simple linear regression

27

Simple linear regression


Simple regression uses the relationship
between the two variables to obtain
information about one variable by knowing
the values of the other
The equation showing this type of
relationship is called simple linear
regression equation
28

Simple linear regression


The scatter diagram helps to choose the curve that
best fits the data. The simplest type of curve is a
straight line whose equation is given by = a + bxi
This equation is a point estimate of Y = + Xi .
b= the sample regression coefficient of Y on X
= the population regression coefficient of Y on X
Y on X means Y is the dependent variable and X is the
independent one
29

Simple linear regression


a is the estimated average value of y
when the value of x is zero (provided that
0 is inside the data range considered)
Otherwise it shows the portion of the
variability of the dependent variable left
unexplained by the independent variables

30

Simple linear regression


Regression is a method of estimating the
numerical relationship between variables
For example, we would like to know what is the
mean or expected weight for factory workers of a
given height, and what increase in weight is
associated with a unit increase in height.
The purpose of a regression equation is to use
one variable to predict another
How is the regression equation determined?

31

The method of least square


The model Y = + X + refers to the
population from which the sample was
drawn
The regression line = a + bx is an
estimate of the population regression line
that was found using ordinary least
squares (OLS)
The difference between the given score Y
and the predicted score is known as the
error of estimation
32

The method of least square


The regression line, or the line which best fits the

given pairs of scores, is the line for which the sum of


the squares of these errors of estimation (i) is
minimized
That is, of all the curves, the curve with minimum i
is the least square regression which best fits the given
data
The least square regression line for the set of
observations
(X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn) has the equation
= a + bxi
33

The method of least square


The values a and b in the equation are constants, i.e.,
their values are fixed.
The constant a indicates the value of y when x=0. It is also
called the y intercept.
The value of b shows the slope of the regression line and
gives us a measure of the change in y for a unit change in x.
This slope (b) is frequently termed as the regression
coefficient of Y on X.
If we know the values of a and b, we can easily compute
34
the value of for any given value of X.

The method of least square


The constants a and b are determined by
solving simultaneously the equations normal
equations

Y = an + bX
XY = aX + bX2
a =Y - b X
b=

( X X )(Y Y )
(X X )
2

n XY X Y
n X 2 ( X ) 2

35

SLR-example 1
Heights of 10 fathers (X) together with their oldest sons (Y)
are given below (in inches). Find the regression of Y on X.
Father (X)
63
64
70
72
65
67
68
66
70
71
Total 676

oldest son (Y)


65
67
69
70
64
68
71
63
70
72
679

product (XY)
4095
4288
4830
5040
4160
4556
4828
4158
4900
5112
45967

X
3969
4096
4900
5184
4225
4489
4624
4356
4900
5041
45784

36

SLR-example 1
a =Y - b X
b=
b=
a=

n XY X Y
n X 2 ( X ) 2

10(45967) (676 x679)


10(45784) (676) 2
679
10 -

0.77 (

676
10

( X X )(Y Y )
(X X )

459670 459004
457840 456976

666
864

= 0.77

) = 67.9 52.05 = 15.85

Therefore, = 15.85 + 0.77 X

37

The regression coefficient of Y on X (i.e., 0.77) tells us the change in Y due to a unit change in X.

SLR-example 1
Estimate the height of the oldest son for a fathers height
of 70 inches.
= 15.85 + 0.77 (70) = 69.75 inches
NB: 1) n is the number of pairs of X and Y scores
which are used in determining the regression line.
In the above example, n=10.
2) Be careful to distinguish between (X) and .38

Standard error of regression


coefficients
The calculated values for a and b are
sample estimates of the values of the
intercept and slope from the regression
line describing the linear association
between x and y in the whole population
Therefore , they are subject to sampling
variation and their precision is measured
by their standard errors
39

Standard errors of regression


coefficients
The SE of the regression coefficients is given
by:
2
se(a) S

(x x )

and se(b)

(x - x ) 2

where
2

(y y) b (x x )

n2

S is the standard deviation of the points about the


line. It has (n-2) degree of freedom.
40

Example (1- )100% CI for


regression coefficient
Consider the age and SBP data and the
fitted regression model
SBP = 112.12 + 0.456(Age)
S = 15.48, se(a) = 2.67, se(b) = 0.064

A 95% confidence interval for the slope is:


Estimated slope t1-(SE of slope)
0.456 1.96*0.064 = (0.331, 0.581)

The 95% CI does not include 0=>There is


a sufficient evidence that age affects SBP
41

Significant test for


Ho:
H1:
If the null hypothesis is true then the
statistic:
Observed slope - 0
t

S.E. of obsereved slope

will follow a t-distribution with (n 2)


df
42

Significant test for Example


For the age and SBP data,
b = 0.456 and se(b) = 0.064, then
t = 7.15 and with (n-2)=488 df
p < 0.001

Decision: Reject Ho
43

Exercise
What do you say about the relationship
between r and b?
Hint: Use the formula for r and b in terms
of sum of squares(Sx, Sy and Sxy)

44

Simple linear regression


Explained, unexplained (error), total variations
If all the points on the scatter diagram fall on the
regression line we could say that the entire variance
of Y is due to variations in X.
Explained variation = (- Y)
The measure of the scatter of points away from the
regression line gives an idea of the variance in Y
that is not explained with the help of the regression
equation.
45
Unexplained variation = (Y - )

Simple linear regression


The variation of the Ys about their mean can also be

computed. The quantity (Y- Y) is called the total


variation.
Explained variation + unexplained variation =Total
variation

The ratio of the explained variation to the total


variation measures how well the linear regression line
fits the given pairs of scores.
It is called the coefficient of determination,
and is denoted by
r =

exp lained var iation


total var iation

46

Simple linear regression


The explained variation is never negative
and is never larger than the total variation
Therefore, r is always between . If the
explained variation equals 0, r = 0
2
r
If r is known, then r =
. The sign of r

is the same as the sign of b from the


regression equation

47

Simple linear regression model


The relationship y = + x is not expected
to hold exactly for every individual but the
average value of y for a given value of x is
E(Y x) = + x
An error term , which represents the
variance of the dependent variable among
all individuals with a given x, is introduced
into the model
The full linear-regression model then takes
48
the y = + x + form

Simple linear regression model


is the residual normally distributed with
mean 0 and variance 2
One interpretation of the regression line is
that for a subject with x independent
values, the corresponding y dependent
value will be normally distributed with
mean + x and variance 2
If 2 were 0, then every point would fall
exactly on the regression line, whereas
the larger 2 is, the more scatter occurs
about the regression line
49

Assumptions
The assumptions made when using this method are:

The relationship between the outcome and the


explanatory variable is linear or at least
approximately linear;
At each value of the explanatory variable the
outcomes follow a normal distribution;
The variance of the outcome is constant for all
values of the explanatory variable

50

Assumptions of Linear
Regression
1. Linear relationship between outcome (y)
and explanatory variable x
2. Outcome variable (y) should be Normally
distributed for each value of explanatory variable
(x)
3. Standard deviation of y should be approximately
the same for each value of x
4. Fixed independent observations
e.g. Only one point per person
5. No outlier distortion
51

Assumptions of linear
regression

Assumption 1
Linear relationship
Assumption 2
Y normally distributed
at each value of x

**
**
**
**
**
**

*
*
*
*
*
*

*
*
*
*
*
*

Assumption 3
Same variance at each value
of x

52

Diagnostic Tests for the Regression


Assumptions
Linearity tests: Regression curve fitting: No level
shifts
Independence of observations: Runs test
Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
Homogeneity of variance if the residuals: Whites
General Specification test
No autocorrelation of residuals: Durbin Watson or
ACF of residuals

53

Diagnostic Tests for the Regression


Assumptions
Plot residuals and look for high leverage of
residuals
Lists of Standardized residuals
Lists of Studentized residuals
Cooks distance or leverage statistics

54

Testing Assumptions:
Assumption 1: linear relationship

.2
2
0
.1
0
0
.1
8
0
.1
6
0
.1
4
0
.1
2
0
.8
0
0
0
.4
0
.6
0
.8
0
.W
1
0
.01
2
0
.1
4
0
.
E
IG
H
T

blodpresure

Plot y against x to check for linearity

55

Testing
Assumptions:

.2
2
0
.1
0
.1
8
0
.1
6
0
.1
4
0
.1
2
0
.8
0
R
S
q
L
i
n
e
a
r
=
0
.
1
6
0
.4
0
.6
0
.8
0
.W
1
0. 1
2
0
.1
4
0
.
E
IG
H
T

b
lo
d
p
re
s
u
re

Assumption 2:
Normality

Y normally distributed
at each value of x

Residuals need
to be normally
distributed
56

Testing Assumptions:
Assumption 2: Normality
Histogram of residuals

Normal probability plot

Histogram

Normal P-P Plot of Regression Standardized Residual

Dependent Variable: Systolic BP

Dependent Variable: Systolic BP

20

1.00

.75

10

Std. Dev = 1.00


Mean = 0.00
N = 127.00

Expected Cum Prob

Frequency

.50

.25

0.00
0.00

.25

.50

.75

1.00

Regression Standardized Residual


Observed Cum Prob

57

Testing Assumptions:
Assumption 3: Spread of y values
constant over range of x values

UnstandrizedResidual

.6
8
0
0
.4
0
0
0
.2
0
0
0
0
.0
0
0
.
0
0
--2
.4
0
0
0
0
.0
0
4
0
.06
0
.8
0
.0
1
0
.01
2
0
.1
4
0
.0
W
E
IG
H
T

Plot residuals against x values

58

Linear Correlation & Simple Linear


Regression?

Thank You!
59

You might also like