296 views

Uploaded by al7abeeb

- Module 1 Lesson 4
- Chapter 01 1
- Stat Regression Report
- 11. Multiple Regression
- GNB_15e_02_Instructor.pptx
- Mca4020 Slm Unit 08
- Evaluation of Technical Losses Estimation in Lv Power Distribution
- ChoKwac-ATravelTimePredictionWithMachineLearningAlgorithms
- Lucas Extraction Signal
- Collective and Individual Month-wise Data Management Approach on the Data Collected in Kalam (Swat) Through Multiple Regression Analysis
- Urban sprawl metrics dynamics and modelling using GIS.pdf
- CO-280
- Parents+Involved
- Motivation, Empowerment, Service Quality and Polytechnic Students’ Level Of
- Regression Analysis
- 12CCA
- chapter 4 project
- The Pace of Life - Reanalysed
- Chapter 2 5 5 data mining
- Sample Midterm

You are on page 1of 27

BINF 5210

Spring 2011

1

Correlation Analysis

• It is used to measure the linear association (degree to

which they are related) between two quantitative variables

measured on the same subjects

• For example, if you want to see the relationship between

the height and weight of a group of children ages 8 to 10 to

investigate the physical growth, correlation analysis might

be a better option for you.

• Plotting the variables of interest in a scatter plot and then

examining the relationship visually is one way of examining

correlation. It is a recommended practice.

• Pearson’s product-moment correlation or Pearson’s

correlation is the most commonly used for correlation

measurement between 2 quantitative variables

2

Pearson’s Correlation

• Pearson’s product moment correlation measured on a

population is ρ (Greek letter rho) which is the measure

of degree to which the variables of interest (2

quantitative) are related. When measured (estimated)

on a sample, it is designated by r (Pearson’s r)

• It measures the extent (degree) to which the points in

a scatter plot of the variables of interest fall on a

straight line (linear relationship)

• Value for Pearson’s correlation ranges from +1 to -1 (+1

for perfect positive correlation, - 1 for perfect negative

correlation and 0 means no correlation (zero

correlation))

3

Calculating Pearson’s Correlation

• Lets say we want to verify the correlation

between variable X and variable Y (both

quantitative variables) of a sample dataset.

The formula to calculate Pearson’s correlation

is:

∑XY – (∑X ∑Y)/N

r = ------------------------------------------- (divided by)

√ ((∑X2 – ((∑X)2/N)) (∑Y2 – ((∑Y)2/N))

N is the number of elements (observations or subjects)

4

Correlation in SAS

• SAS provides a procedure called PROC CORR for the

analysis of correlation coefficient between two

variables (quantitative)

• It tests the hypotheses-

H0 (Null Hypothesis): There is no linear relationship

between the two variables of interest (Pearson’s r=0)

Ha (Alternative Hypothesis): There is a linear

relationship between the two variables of interest

(Pearson’s r ≠ 0)

and determines if estimated correlation coefficient is

significantly differ from 0.

5

PROC CORR Assumptions

• Data is a random sample drawn from normally

distributed population (bivariate)

• If the population is not normal, then use non

parametric correlation estimation procedure (most

common is Spearman’s rho)

• PROC CORR also provides Spearman’s rho as well but

you have to request it in PROC CORR option

• Spearman’s correlation can be calculated by calculating

the rank for each of the values of the variables of

interest and then applying the Pearson’s correlation

coefficient method on the ranks of the variables.

6

PROC CORR Structure

• PROC CORR <<option(s)>>;

<<statement(s)>>;

Data=your_dataset_name

spearman (to request non parametric test non normal population)

You can also use NOSIMPLE (not to display simple statistics), NOPROB (not to display

probability value, p-value)

VAR variables of interest;

BY variable; /* Optional (for categorical variable, it will produce output

separately for each category level)*/

WITH variable /* Optional (when you want the correlation between variables

in VAR list with other variables (listed in WITH list))*/

7

PROC CORR Example

• Consider the data set for this assignment (external text file smoke_drug in my document). All

columns value are tab delimited. All data are numerical type.

First column is Gender (1=male and 2= female)

Second column is Age

Third column is Race of subjects (1=white, 2= black, 3= Hispanic, 4= other)

Fourth column is smoker? (1=yes 2=no)

Fifth column is Systolic blood pressure

Sixth column is diastolic blood pressure

• As an investigator, you are interested to examine the relationship between age and

(SYSTOLIC AND DIASTOLIC) blood pressure of randomly selected subjects as a part of a

clinical trial.

8

PROC CORR in SAS

First we read the data into SAS:

data mydata;

INFILE "C:\smoke_drug.txt" DLM ='09'x;

INPUT GENDER AGE RACE SMOKER SYSTOLIC DIASTOLIC;

RUN;

• Then we run PROC CORR on the variables of our interest

ODS HTML;

PROC CORR DATA=MYDATA;

VAR AGE SYSTOLIC DIASTOLIC; /*list of variables we are interested this will generate

correlation for variables pair wise (3 pairs)*/

TITLE 'CORRELATION OF AGE SYSTOLIC, AGE DIASTOLIC AND SYSTOLIC AND DIASTOLIC

BLOOD PRESSURE';

RUN;

ODS HTML CLOSE;

9

PROC CORR OUTPUT

SAS not to

display this

table for

basic

statistics by

2 using

NOSIMPLE

option in

PROC CORR

This is the

correlation

matrix

containing

3 pair wise

Pearson

correlations

between

each of the

3 variables

PROC CORR Output Interpretation

• Table 3 is of our interest in this example.

• We can see the correlation between AGE and

SYSTOLIC pressure is 0.511150 (positive

relationship but not perfect positive) and the p-

value is <.0001 (so we reject the null hypothesis

that there is no linear relationship between AGE

and SYSTOLIC).

• Read the correlation same way for the other

combination of variables (i,e, AGE DIASTOLIC and

so on.)

11

PROC CORR

• If your population is non normal then use Spearman’s correlation

test by specifying this in the PROC CORR option either with Pearson

or by itself.

• Using WITH statement: sometimes we want to examine the

correlations of one or more variables with other variables. WITH

statement becomes handy in such cases.

• Lets say in our example you want to verify the correlations of AGE

with multiple measures of Systolic blood pressure (lets say 4

measures Sys1, Sys2, Sys3, Sys 4). In this case you have to include

WITH statement, for example,

PROC CORR data=data_set_name;

VAR Sys1-sys4;

WITH AGE;

This will produce correlations between AGE and each of Sys1, Sys2,

Sys3, and Sys4.

12

PROC CORR -Plot

• You should always produce a scatter plot of

the variables of interest to verify the

correlations between them

• One option is to use ODS GRAPHICS option on

PROC CORR. This will generate the graphs and

plots associated with output of SAS PROC

CORR.

13

Linear Regression

• Correlation gives you the measures of linear

relationship between two variables and regression

analysis utilizes this relationship to predict the

dependent variable from the independent variable

• In order to predict (value of) a dependent variable from

a given value of an independent variable, simple linear

regression is appropriate

• For example, as part of the investigation of the effect s

of physical exercises (amount of time spent for

exercising daily) on BMI, a simple linear regression can

be used to predict the BMI from the amount of time

spent daily for physical exercises.

14

Simple Linear Regression Model Basics

• The following mathematical equation of a (theoretical) line

describes the association (relationship) between an

independent variable X and a dependent variable Y:

Y= α + βx + ε

(α is the Y intercept, β is slope of the line and ε is the error

whose mean is 0 and whose variance if fixed. If the slope,

β=0, then there is no predictive relationship between the

variables )

When we perform a regression analysis on data to predict

variable, we actually calculate a regression line to describe

the relationship of the variables of our interest ,which

(regression line) is an estimate of the theoretical line

above.

15

Simple Linear Regression Model Basics

Y’ = a + bx

Where a and b are the least square estimate of the

parameters α and β respectively, x is the given value of

independent variable, Y’ is the dependent variable (value)

we are trying to predict.

Note: Least square estimates because the regression line tries

to minimize the sum of the squared errors of the

predictions (square of the error between the actual value

of the outcome variable and the predicted value of the

outcome variable. Please check Lane text book, chapter 15,

for details)

16

Simple Linear Regression in SAS

• SAS provides a procedure called PROC REG for regression

analysis of data

• When we specify the regression model in SAS by specifying

the dependent variable and independent variable, SAS

formulates a regression line (same equation in the previous

slide) based on the given dataset and predicts the dependent

variable (value)

• First step is to check if there is any relationship existed

between the variables specified in SAS.

This is done by testing the null hypothesis that there is no

linear relationship predictable between the variables (that is

the slope of the equation, β= 0) .

Ha (Alternative Hypothesis): There is linear relationship

predictable between the two variables of interest (the slope

of the equation is not 0).

17

Simple Linear Regression in SAS

Therefore, H0: β =0 and

H a: β ≠ 0

If we have a small p-value (usually <.05), then we

can reject the null hypothesis and conclude that β

≠ 0 (there is a predictable relationship between

the variables).

In other words, we can say knowing the value of

the independent variable would be helpful to

predict the value of the dependent variable.

• Next step would be to use this linear relationship

to predict the value of the dependent variable

18

Simple Linear Regression Using PROC REG

PROC REG <<option(s): >>;

such as data=, SIMPLE to include for basic statistics in output

• MODEL statement has the structure:

MODEL dependent_var=independent_var/ options;

• Some of the MODEL statement options are (check

SAS manual and know their functions):

P (for requesting a table of predicted values),R (for residual analysis),CLM (for expected

value), CLI (for individual values of the dependent variable), INCLUDE, SELECTION,

SLSTAY, SLENTRY

19

Simple linear regression using PROC REG

Example

• Lets consider the data we used for the correlation

analysis example. In this example we are

interested to see if systolic blood pressure can be

used to predict the diastolic blood pressure. After

reading the dataset into SAS, we run the

following PROC REG:

ODS HTML;

TITLE 'SIMPLE LINEAR REGRESSION EXAMPLE';

PROC REG DATA= MYDATA;

MODEL DIASTOLIC = SYSTOLIC;

/* SPECIFYING THE OUTCOME VARIABLE (DEPENDENT) AND PREDICTOR (INDEPENDENT) VARIABLE

FOR THE MODEL OF REGRESSION, what you want to predict from what*/

RUN;

ODS HTML CLOSE;

20

PROC REG Output If we could reject the

null hypothesis, then

only we would have

continued here.

R-square is the

measure of how strong

the relationship

between the variables

We can not predict 1 is. The closer it is to 1,

the stronger the

DIASTOLIC from

relationship. In this

SYSTOLIC because

example the value is

there is no

very small, 0.0141

significant

(0.01, there is barely a

relationship 2 relationship).

between them. So

This is how to interpret

we do not go any

this value: only 1% of

further.

the variability in

This table

tells you

DIASTOLIC variable can

about the be explained by

SYSTOLIC variable.

strength of

the

3

relationship

Statistical test on

SYSTOLIC row is for

Least square estimate of a the β=0. Can not

reject null

4 hypothesis. So there

Least square estimate of b is no relationship

(slope is not

significantly different

Y’= a + b x This table is associated with regression from 0)

model 21

PROC REG Output

• When we read PROC REG output, 3 things are usually are of our interest to understand the results

(also shown in the output in the last slide):

1- R -square (tells you the strength of the relationship)

2- Slope (check the regression table for the independent variable

and check the p-value for the test if it is significant or not. This is the

test for whether the slope=0 or not)

3- Parameter estimate: Intercept and independent variable (estimate of a and b for the regression

equation for

prediction)

• From this example, we conclude that there is no significant predictive linear relationship between

Diastolic and Systolic blood pressure according to our dataset. Since the slope=0 t-test is significant

(slope is for the dependent and the independent variables, the intercept not involved. So we check

the regression table for the independent variable and check the p-value for the test if it is

significant or not.)

• Therefore, we can not predict Diastolic from Systolic.

• So we stop our analysis by concluding and we do not need to verify the parameters and regression

equation for prediction.

22

How to Interpret PROC REG Output

(When slope is not zero)

• Now consider the following output. Think about it as if

is based on the same table but for different data values

and also think that this time the slope test produced

smaller p-value ( lets say.04) for significance.

• This is a made up output where just the p-value of

Systolic is changed to make it significant so that the

null hypothesis is rejected. This is just to show how to

read and interpret the output of a regression when the

slope is not 0 and how to predict the dependent

variable from the value of independent variable using

the regression line equation.

(Again this is just for explanation, not correct output on a dataset)

23

PROC REG Output (Slope≠0)

Check R square to see the strength of

the relationship. Then check the R-square is the

slope test p-value in the last column measure of how strong

in the regression table for the

independent variable, in this case the

the relationship

row for independent variable between the variables

SYSTOLIC, the p-value (Pr> |t|) in is. The closer it is to 1,

regression table (4). Then report the

parameter estimate (a and b) from 1 the stronger the

relationship. In this

the third column in regression table

(4). Test for Intercept is not of our example the value is

interest but the value is. 0.0141.

This is how to interpret

this value: only 1% of

2 the variability in

DIASTOLIC variable can

be explained by

SYSTOLIC variable.

This table

tells you

about the

strength of

the

3

relationship

Statistical test on

SYSTOLIC row is for

Least square estimate of a the β=0. Can reject

null hypothesis. So

4 there is a

Least square estimate of b relationship

(slope is significantly

Y’= a + b x This table is associated with regression

different from 0)

(This is the predictive equation)

PROC REG Output (Slope≠0)

• In this case, parameter estimates are 1.47628

and 0.00110 for the Intercept and Systolic

(remember these are least square estimates

of a and b in the regression line equation)

• So we can calculate the equation of the

regression line as:

Y’= a + b x (Value of x)

(outcome variable) (predictor)

In this situation, we would have used this regression equation to predict the dependent variable from the

values of the independent variable.

25

Simple Linear Regression Plot

• It is always recommended to create a plot for

the variables of interest to visually inspect the

linear relationship of the data. The regression

line can give you ideas about the predictive

values of the dependent variable for each unit

change of the independent variable.

• You can simply plot the variables by using plot

option available in SAS simply by adding PLOT

statement after MODEL statement

(PLOT DEPENDENTVARIABLE * INDEPENDENTVARIABLE;) or by using

PROC GPLOT procedure after MODEL statement.

26

Assignment

• Read The Little SAS BOOK chapter 8 (8.4 (for

correlation), 8.5 & 8.6 (for regression))

• Learn how to create plot for regression, how

to read the plot elements, and how to read

regression output.

• There will be assignment on these coming

soon.

27

- Module 1 Lesson 4Uploaded byandri00
- Chapter 01 1Uploaded byIggy Azalea
- Stat Regression ReportUploaded byDavid
- 11. Multiple RegressionUploaded byEdwin Okoampa Boadu
- GNB_15e_02_Instructor.pptxUploaded byBilly DeTomaso
- Mca4020 Slm Unit 08Uploaded byAppTest PI
- Evaluation of Technical Losses Estimation in Lv Power DistributionUploaded byGhassan Abi Khalil
- ChoKwac-ATravelTimePredictionWithMachineLearningAlgorithmsUploaded bykanha0019723
- Lucas Extraction SignalUploaded byLeticia Klotz
- Collective and Individual Month-wise Data Management Approach on the Data Collected in Kalam (Swat) Through Multiple Regression AnalysisUploaded byfarazismail18
- Urban sprawl metrics dynamics and modelling using GIS.pdfUploaded byVicky Ceunfin
- CO-280Uploaded byradocajdorijan
- Parents+InvolvedUploaded byvegeta326
- Motivation, Empowerment, Service Quality and Polytechnic Students’ Level OfUploaded bysifera
- Regression AnalysisUploaded byAnand
- 12CCAUploaded byRocio Gill
- chapter 4 projectUploaded byapi-346152803
- The Pace of Life - ReanalysedUploaded byKhairunnisa Putri Kanhida
- Chapter 2 5 5 data miningUploaded bybharathimanian
- Sample MidtermUploaded byArafath Cherukuri
- Chapter 16 - Logistic Regression ModelUploaded byFirdaus Ahmad
- Statistical tools for Biomedical ResearchUploaded bySantanu Ghorai
- IntroductionUploaded byShradha Gawankar
- The Influence of Political Power in Predicting the Outcome of RevolutionsUploaded bymariaglenna
- BC3406Team4Uploaded byjohnconnor
- IJAIEM-2014-05-31-138Uploaded byAnonymous vQrJlEN
- Students’ Variables as Predictor of Secondary School Students’ Academic Achievement in Science SubjectsUploaded byIJSRP ORG
- 283614253-StockWatson-3e-EmpiricalExerciseSolutions.pdfUploaded byChristine Yan
- bba3SGUUploaded byhetujeny
- Noorkartina-Estimating the Effect of Entrepreneur EducationUploaded byMaulidaAzis

- 13. Simple Linear RegressionUploaded byNurgazy Nazhimidinov
- Encyclopedia of Survey Research Methods_Lavrakas_2008.pdfUploaded byhelton_bsb
- Multiple Linear RegressionUploaded byHemanshu Das
- Prac Final 08, ENGR 62, StanfordUploaded bysuudsfiin
- Regression lecture notesUploaded byAsish Mahapatra
- STATA Confidence -IntervalsUploaded bysmriti
- Math Review Set2Uploaded byHasen Bebba
- Favero Applied MacroeconometricsUploaded byMithilesh Kumar
- junior docsUploaded byjevanjunior
- Coefficient of Determination and Interpretation of Determination CoefficientUploaded byJasdeep Singh Bains
- Multiple Regression AnalysisUploaded byAli Alshaqah
- oopUploaded bybudi prawiro
- Bayesian Biostatistics. 2014Uploaded bydipar
- NPV.pptUploaded byMaNi Rajpoot
- Design of Experiments (DOE) TutorialUploaded byIvan Thomas
- Bfc 34303 uthmUploaded byHamierul Mohamad
- RegressionUploaded byparkchick
- Scilab_Optimization_201109Uploaded byYusuf Haqiqzai
- StataCheatSheet AnalysisUploaded bySagardeep Roy
- Giacomini (2013) the Relationship Between DSGE and VAR Models. CeMMAPUploaded byrhroca2762
- REGRESION MULTIPLE.docxUploaded byJose Eliceo Chambi Quispe
- Multi Hetero AutoUploaded byEzairul Hossain
- Risk Adjusted Performance AnalysisUploaded bymorrisonkaniu8283
- Chapter 14Uploaded bydungnt0406
- RSM new1Uploaded byAbhilash Nair
- Final Assignment.docxUploaded byMuhammad Asad Ali
- Sample chapter from Oxford A Level Mathematics for Edexcel Statistics S2Uploaded byOxford University Press Children and Schools
- EconometricsTest Bank Questions Chapter 3Uploaded byNgọc Huyền
- QM10 Tif Ch07Uploaded byAthena Christine Pore Alfabete
- MAPA ESTRATEGICO_HEIDYUploaded byHeidy Figueroa