You are on page 1of 27

Correlation and Simple Linear Regression

Department of Health Informatics


BINF 5210
Spring 2011

1
Correlation Analysis
• It is used to measure the linear association (degree to
which they are related) between two quantitative variables
measured on the same subjects
• For example, if you want to see the relationship between
the height and weight of a group of children ages 8 to 10 to
investigate the physical growth, correlation analysis might
be a better option for you.
• Plotting the variables of interest in a scatter plot and then
examining the relationship visually is one way of examining
correlation. It is a recommended practice.
• Pearson’s product-moment correlation or Pearson’s
correlation is the most commonly used for correlation
measurement between 2 quantitative variables

2
Pearson’s Correlation
• Pearson’s product moment correlation measured on a
population is ρ (Greek letter rho) which is the measure
of degree to which the variables of interest (2
quantitative) are related. When measured (estimated)
on a sample, it is designated by r (Pearson’s r)
• It measures the extent (degree) to which the points in
a scatter plot of the variables of interest fall on a
straight line (linear relationship)
• Value for Pearson’s correlation ranges from +1 to -1 (+1
for perfect positive correlation, - 1 for perfect negative
correlation and 0 means no correlation (zero
correlation))

3
Calculating Pearson’s Correlation
• Lets say we want to verify the correlation
between variable X and variable Y (both
quantitative variables) of a sample dataset.
The formula to calculate Pearson’s correlation
is:
∑XY – (∑X ∑Y)/N
r = ------------------------------------------- (divided by)
√ ((∑X2 – ((∑X)2/N)) (∑Y2 – ((∑Y)2/N))
N is the number of elements (observations or subjects)

4
Correlation in SAS
• SAS provides a procedure called PROC CORR for the
analysis of correlation coefficient between two
variables (quantitative)
• It tests the hypotheses-
H0 (Null Hypothesis): There is no linear relationship
between the two variables of interest (Pearson’s r=0)
Ha (Alternative Hypothesis): There is a linear
relationship between the two variables of interest
(Pearson’s r ≠ 0)
and determines if estimated correlation coefficient is
significantly differ from 0.
5
PROC CORR Assumptions
• Data is a random sample drawn from normally
distributed population (bivariate)
• If the population is not normal, then use non
parametric correlation estimation procedure (most
common is Spearman’s rho)
• PROC CORR also provides Spearman’s rho as well but
you have to request it in PROC CORR option
• Spearman’s correlation can be calculated by calculating
the rank for each of the values of the variables of
interest and then applying the Pearson’s correlation
coefficient method on the ranks of the variables.

6
PROC CORR Structure
• PROC CORR <<option(s)>>;
<<statement(s)>>;

• <<option(s)>> : commonly used ones are


Data=your_dataset_name
spearman (to request non parametric test non normal population)
You can also use NOSIMPLE (not to display simple statistics), NOPROB (not to display
probability value, p-value)

• Statements could be:


VAR variables of interest;
BY variable; /* Optional (for categorical variable, it will produce output
separately for each category level)*/
WITH variable /* Optional (when you want the correlation between variables
in VAR list with other variables (listed in WITH list))*/

7
PROC CORR Example
• Consider the data set for this assignment (external text file smoke_drug in my document). All
columns value are tab delimited. All data are numerical type.
First column is Gender (1=male and 2= female)
Second column is Age
Third column is Race of subjects (1=white, 2= black, 3= Hispanic, 4= other)
Fourth column is smoker? (1=yes 2=no)
Fifth column is Systolic blood pressure
Sixth column is diastolic blood pressure
• As an investigator, you are interested to examine the relationship between age and
(SYSTOLIC AND DIASTOLIC) blood pressure of randomly selected subjects as a part of a
clinical trial.

8
PROC CORR in SAS
First we read the data into SAS:
data mydata;
INFILE "C:\smoke_drug.txt" DLM ='09'x;
INPUT GENDER AGE RACE SMOKER SYSTOLIC DIASTOLIC;
RUN;
• Then we run PROC CORR on the variables of our interest
ODS HTML;
PROC CORR DATA=MYDATA;
VAR AGE SYSTOLIC DIASTOLIC; /*list of variables we are interested this will generate
correlation for variables pair wise (3 pairs)*/
TITLE 'CORRELATION OF AGE SYSTOLIC, AGE DIASTOLIC AND SYSTOLIC AND DIASTOLIC
BLOOD PRESSURE';
RUN;
ODS HTML CLOSE;

9
PROC CORR OUTPUT

1 You can tell


SAS not to
display this
table for
basic
statistics by
2 using
NOSIMPLE
option in
PROC CORR

This is the
correlation
matrix
containing
3 pair wise
Pearson
correlations
between
each of the
3 variables

Values of r Significance level 10


PROC CORR Output Interpretation
• Table 3 is of our interest in this example.
• We can see the correlation between AGE and
SYSTOLIC pressure is 0.511150 (positive
relationship but not perfect positive) and the p-
value is <.0001 (so we reject the null hypothesis
that there is no linear relationship between AGE
and SYSTOLIC).
• Read the correlation same way for the other
combination of variables (i,e, AGE DIASTOLIC and
so on.)
11
PROC CORR
• If your population is non normal then use Spearman’s correlation
test by specifying this in the PROC CORR option either with Pearson
or by itself.
• Using WITH statement: sometimes we want to examine the
correlations of one or more variables with other variables. WITH
statement becomes handy in such cases.
• Lets say in our example you want to verify the correlations of AGE
with multiple measures of Systolic blood pressure (lets say 4
measures Sys1, Sys2, Sys3, Sys 4). In this case you have to include
WITH statement, for example,
PROC CORR data=data_set_name;
VAR Sys1-sys4;
WITH AGE;
This will produce correlations between AGE and each of Sys1, Sys2,
Sys3, and Sys4.

12
PROC CORR -Plot
• You should always produce a scatter plot of
the variables of interest to verify the
correlations between them
• One option is to use ODS GRAPHICS option on
PROC CORR. This will generate the graphs and
plots associated with output of SAS PROC
CORR.

13
Linear Regression
• Correlation gives you the measures of linear
relationship between two variables and regression
analysis utilizes this relationship to predict the
dependent variable from the independent variable
• In order to predict (value of) a dependent variable from
a given value of an independent variable, simple linear
regression is appropriate
• For example, as part of the investigation of the effect s
of physical exercises (amount of time spent for
exercising daily) on BMI, a simple linear regression can
be used to predict the BMI from the amount of time
spent daily for physical exercises.

14
Simple Linear Regression Model Basics
• The following mathematical equation of a (theoretical) line
describes the association (relationship) between an
independent variable X and a dependent variable Y:
Y= α + βx + ε
(α is the Y intercept, β is slope of the line and ε is the error
whose mean is 0 and whose variance if fixed. If the slope,
β=0, then there is no predictive relationship between the
variables )
When we perform a regression analysis on data to predict
variable, we actually calculate a regression line to describe
the relationship of the variables of our interest ,which
(regression line) is an estimate of the theoretical line
above.

15
Simple Linear Regression Model Basics

The regression line we calculate has the following equation:


Y’ = a + bx
Where a and b are the least square estimate of the
parameters α and β respectively, x is the given value of
independent variable, Y’ is the dependent variable (value)
we are trying to predict.
Note: Least square estimates because the regression line tries
to minimize the sum of the squared errors of the
predictions (square of the error between the actual value
of the outcome variable and the predicted value of the
outcome variable. Please check Lane text book, chapter 15,
for details)

16
Simple Linear Regression in SAS
• SAS provides a procedure called PROC REG for regression
analysis of data
• When we specify the regression model in SAS by specifying
the dependent variable and independent variable, SAS
formulates a regression line (same equation in the previous
slide) based on the given dataset and predicts the dependent
variable (value)
• First step is to check if there is any relationship existed
between the variables specified in SAS.
This is done by testing the null hypothesis that there is no
linear relationship predictable between the variables (that is
the slope of the equation, β= 0) .
Ha (Alternative Hypothesis): There is linear relationship
predictable between the two variables of interest (the slope
of the equation is not 0).
17
Simple Linear Regression in SAS
Therefore, H0: β =0 and
H a: β ≠ 0
If we have a small p-value (usually <.05), then we
can reject the null hypothesis and conclude that β
≠ 0 (there is a predictable relationship between
the variables).
In other words, we can say knowing the value of
the independent variable would be helpful to
predict the value of the dependent variable.
• Next step would be to use this linear relationship
to predict the value of the dependent variable
18
Simple Linear Regression Using PROC REG

• SAS PROC REG takes the following structure:


PROC REG <<option(s): >>;
such as data=, SIMPLE to include for basic statistics in output

• <<statement(s): Model, BY (for the group variable)>>;


• MODEL statement has the structure:
MODEL dependent_var=independent_var/ options;
• Some of the MODEL statement options are (check
SAS manual and know their functions):
P (for requesting a table of predicted values),R (for residual analysis),CLM (for expected
value), CLI (for individual values of the dependent variable), INCLUDE, SELECTION,
SLSTAY, SLENTRY

19
Simple linear regression using PROC REG
Example
• Lets consider the data we used for the correlation
analysis example. In this example we are
interested to see if systolic blood pressure can be
used to predict the diastolic blood pressure. After
reading the dataset into SAS, we run the
following PROC REG:
ODS HTML;
TITLE 'SIMPLE LINEAR REGRESSION EXAMPLE';
PROC REG DATA= MYDATA;
MODEL DIASTOLIC = SYSTOLIC;
/* SPECIFYING THE OUTCOME VARIABLE (DEPENDENT) AND PREDICTOR (INDEPENDENT) VARIABLE
FOR THE MODEL OF REGRESSION, what you want to predict from what*/
RUN;
ODS HTML CLOSE;

20
PROC REG Output If we could reject the
null hypothesis, then
only we would have
continued here.
R-square is the
measure of how strong
the relationship
between the variables
We can not predict 1 is. The closer it is to 1,
the stronger the
DIASTOLIC from
relationship. In this
SYSTOLIC because
example the value is
there is no
very small, 0.0141
significant
(0.01, there is barely a
relationship 2 relationship).
between them. So
This is how to interpret
we do not go any
this value: only 1% of
further.
the variability in
This table
tells you
DIASTOLIC variable can
about the be explained by
SYSTOLIC variable.
strength of
the
3
relationship
Statistical test on
SYSTOLIC row is for
Least square estimate of a the β=0. Can not
reject null
4 hypothesis. So there
Least square estimate of b is no relationship
(slope is not
significantly different
Y’= a + b x This table is associated with regression from 0)
model 21
PROC REG Output
• When we read PROC REG output, 3 things are usually are of our interest to understand the results
(also shown in the output in the last slide):
1- R -square (tells you the strength of the relationship)
2- Slope (check the regression table for the independent variable
and check the p-value for the test if it is significant or not. This is the
test for whether the slope=0 or not)
3- Parameter estimate: Intercept and independent variable (estimate of a and b for the regression
equation for
prediction)

• From this example, we conclude that there is no significant predictive linear relationship between
Diastolic and Systolic blood pressure according to our dataset. Since the slope=0 t-test is significant
(slope is for the dependent and the independent variables, the intercept not involved. So we check
the regression table for the independent variable and check the p-value for the test if it is
significant or not.)
• Therefore, we can not predict Diastolic from Systolic.
• So we stop our analysis by concluding and we do not need to verify the parameters and regression
equation for prediction.

22
How to Interpret PROC REG Output
(When slope is not zero)
• Now consider the following output. Think about it as if
is based on the same table but for different data values
and also think that this time the slope test produced
smaller p-value ( lets say.04) for significance.
• This is a made up output where just the p-value of
Systolic is changed to make it significant so that the
null hypothesis is rejected. This is just to show how to
read and interpret the output of a regression when the
slope is not 0 and how to predict the dependent
variable from the value of independent variable using
the regression line equation.
(Again this is just for explanation, not correct output on a dataset)

23
PROC REG Output (Slope≠0)
Check R square to see the strength of
the relationship. Then check the R-square is the
slope test p-value in the last column measure of how strong
in the regression table for the
independent variable, in this case the
the relationship
row for independent variable between the variables
SYSTOLIC, the p-value (Pr> |t|) in is. The closer it is to 1,
regression table (4). Then report the
parameter estimate (a and b) from 1 the stronger the
relationship. In this
the third column in regression table
(4). Test for Intercept is not of our example the value is
interest but the value is. 0.0141.
This is how to interpret
this value: only 1% of
2 the variability in
DIASTOLIC variable can
be explained by
SYSTOLIC variable.
This table
tells you
about the
strength of
the
3
relationship

Statistical test on
SYSTOLIC row is for
Least square estimate of a the β=0. Can reject
null hypothesis. So
4 there is a
Least square estimate of b relationship
(slope is significantly
Y’= a + b x This table is associated with regression
different from 0)

DIASTOLIC= 1.47628 + 0.00110 * SYSTOLIC model 24


(This is the predictive equation)
PROC REG Output (Slope≠0)
• In this case, parameter estimates are 1.47628
and 0.00110 for the Intercept and Systolic
(remember these are least square estimates
of a and b in the regression line equation)
• So we can calculate the equation of the
regression line as:
Y’= a + b x (Value of x)

DIASTOLIC= 1.47628 + 0.00110 * SYSTOLIC


(outcome variable) (predictor)
In this situation, we would have used this regression equation to predict the dependent variable from the
values of the independent variable.

25
Simple Linear Regression Plot
• It is always recommended to create a plot for
the variables of interest to visually inspect the
linear relationship of the data. The regression
line can give you ideas about the predictive
values of the dependent variable for each unit
change of the independent variable.
• You can simply plot the variables by using plot
option available in SAS simply by adding PLOT
statement after MODEL statement
(PLOT DEPENDENTVARIABLE * INDEPENDENTVARIABLE;) or by using
PROC GPLOT procedure after MODEL statement.
26
Assignment
• Read The Little SAS BOOK chapter 8 (8.4 (for
correlation), 8.5 & 8.6 (for regression))
• Learn how to create plot for regression, how
to read the plot elements, and how to read
regression output.
• There will be assignment on these coming
soon.

27