You are on page 1of 10

PSY 202 Spring 2015 / nci Ayhan

Correlation and Regression Worksheet


When to use?
Correlation: To look whether there is a relationship between 2 quantitative continuous variables when
the linear model assumptions are met (>>2 variables are NOT categorised as predictor and response)
Regression: To build a linear model to make predictions about the response variable >> (>>so we treat one
variable as predictor and the other as response)

Is the statement Correlation does not imply causation correct?


No. The correct statement is: An association between two variables in a correlational design does not imply
causation. Because whether you can make causal inferences or not depends on the design of your study
(rather than the way you analyse the data).
For example, you might be looking at the effect of age on reaction time and analyse your data using a
between subjects t-test BUT still you CANNOT make causal inferences. WHY? Because you would have
made a selection of subjects (on the basis of their age groups) rather than a random assignment to
experimental groups so your design would have a quasi nature rather than a true experiment. >> Just
because someone analysed the data using a t test or ANOVA doesnt mean they can imply causation.

Similarly, sometimes you might have a strong argument that one variable predicts the other and construct a
hypothesis around it. Though youd use a regression analysis, there would be some directionality if not
causation.
Have a data set at hand. What to do first?
Scatterplots!
Before running any analysis on your data >> Its important to look at the data points graphically.
Chech scatter plot of the variables you measure:
-

Linearity: You can use Pearsons Correlation and Linear Regression only if the relationship between
two variables in your data set is linear:

(Please note that those red lines over there belong to regression model. For the purposes of
correlation, what we do is just to plot the diagram with data scattered along the X and Y axes).
>> if we check correlation: Either variable could be plotted on either axis.
>> if we make a regression analysis, the predictor variable would be plotted on the X, and the
response variable on the Y axis.
What if my data is not linear: Well, then, you need some other models (i.e. curvilinear fit)
What other assumptions I need to check?
-

The sample is independently and randomly selected from the population of interest (Well, this is
crucial particularly IF you will make inferences to the population).

No outliers present >> This is critical because outliers can radically change your correlation
coefficient value and might lead to misleading conclusions:

How to check outliers?


In SPSS, under the REGRESSION routine, CASEWISE DIAGNOSTICS that indicate which
cases are extreme outliers. This is particularly useful in that you see which cases stand out even
after all IVs have been controlled for. An example output:

In this case, it shows that the effect of IV would drop by .136 if case 9 were dropped. With a
standardized dfbeta, values of 1 or larger are generally considered important; it has also been
suggested that standardized dfbetas > 2/Sqrt(N) should be checked > With a standardized dfbeta,
values of 1 or larger are generally considered important; it has also been suggested that standardized
dfbetas > 2/Sqrt(N) should be checked >> If this is the case, look at this data point at scatter diagram
(and perhaps rerun the analysis without this data and compare how the results change >> if makes a
big difference, you might decide to exclude it).
-

Bivariate Normality: The population distibutions of X and Y are such that their joint distribution
represents a bivariate normal distribution.
>> This means the distribution of Y at a given value of X, or the distribution of X at a given value of
Y) are assumed to be normal with equal variances (This assumption is robust to violation if your
sample size is large enough).
(unfortunately) SPSS does not have a ready button to check multivariate normality >> need to
write syntax

Neither distributions should be highly skewed: Have a look at the univariate distributions of the
variables X and Y separately >> If they are skewed, then, you would never be able to get a very high
correlation coefficient r. In fact, unless the X and Y distributions have the identical shape, it is
impossible to attain perfect r of +1 and -1.
>> For each variable, check normality using the procedures below:
How to check normality?
1. P-p plots: You could use p-p plots >> probability probability plots to check for normality
>> plots cumulative probability of a variable against cumulative probability of a normal distribution
>>if the distribution is normal, youd get a nice straight diagonal line
>> If the data points are S shaped, youve got problem with skewness

Different types of p-p plots you could obtain


2. Kolmogorov-Smirnov & Shapiro-Wilks Tests: Compare the scores in the sample to a normally
distributed set of scores with the same mean and standard deviation.
>> We want these tests to be statistically non-significant: Which tells that the distribution of the
sample is not statistically different from a normal distribution.
BUT if you have small samples please do NOT use these as a measure of normality as they
would lack power to detect violations of assumptions.
If the distributions are not normal, is it the end of the world? NO. You might employ a
transformation which could be used to correct for non-normality and unequality of variances.
Transforming the Y values to remedy nonnormality often results in correcting heteroscedasticity
(unequal variances). Occasionally, both the X and Y variables are transformed.
How to check linearity?: Zpred vs zresid plots( under the REGRESSION command in SPSS)
Create a scatterlot of the values of the residuals against the values of the outcome predicted by our
model.
>> So that check whether there is a systematic relationship between what comes out of the model
(precited values) and the errors in the model.
>> If the relationship is linear, then we expect no systematic relationship between the errors in the
model and what the model predicts (this is what we want to go on with linear correlation and
regression):

(X Axis: Standardised residual, Y Axis: Standardised Predicted Value)


! If the data have outliers and are NOT normal, then you need to use another kind of correlations,
called Spearmans rho OR Kendalls tau which are used on ranked data (rather than interval or ratio
data that we analyse with Pearsons Correlation r).

What does residual mean?


>> Remember in Regression, we fit a linear model line into our data points using a least square rule
so that the squared vertical distances of the data points to their corresponding model values is
minimised. But this means that our model did not do a perfect job and that there is some error in our
prediction which we call residual:

What is Covariance?
By looking at how 2 variables covary, we are looking at whether these two variables are associated
or not.
>> In order to understand covariance, lets go back to the concept of variance:

Where;
x-bar: the mean of the sample
xi: the data point in question
N: the number of observations
>> SO variance of a variable represents the average amount that the data vary from the mean.
Similarly, if two variables are related, then changes in one variable should be met by similar changes
in other variable >> When one variable deviates from its mean we would expect the other variable to
deviate from its mean in a similar way. In fact the formula for covariance is:
This bit is called: Sum of Cross
Products

>> If both deviations for the same data point are positive or negative, this would give us a positive
value (indicative of deviations being in the same direction), but if one deviation is positive and the

other is negative, the resulting product would be negative (indicative of deviations being in the
opposite direction).
>> positive covariance: as one variable deviates from mean, the other variable deviates from mean
in the same direction.
>> negative covariance: as one variable deviates from mean, the other variable deviates from mean
in the opposite direction.
BUT
Covariance is NOT a scale independent measure: i.e. Lets say you check the relationship between
height and weight and used metres for height and kilograms for weight. When you do make your
calculations using the formula above, youd get a certain value, right. Now lets change the unit of
height from metres to centimetres. The value youd obtain as covariance would also change!
SO we need some other more objective measure to compare different data sets to each other in a
common metric. Here does Pearson Correlation Coefficient steps in:

Remember the standard scores (i.e. z score): what we were doing was that we were trying to figure
out how much a score deviates from the mean in terms of the standard deviation units. Here, we do
exactly the same, we are trying to find out how much the two variables vary together in terms of their
standard deviation units.
BUT
We already know that concept: Well, x deviated from mean over the standard deviation is z x and y
deviated from mean over the standard deviation is zy. In fact, another formula for correlation
coefficient r is:

(N -1 in the denominator if you use degrees of freedom, and N if you do not make this correction)

Now, lets substitute / (N 1) for the sx and / (N 1) for the sy in the first r formula
we discussed above, wed obtain yet another formula:

What is Coefficient of determination (r2)?


It is an index of the amount of variability in one variable that is shared by the other. 1-r2, is then, the
amount of variability still to be accounted for by other variables.
Note that r-square does not imply causation, either!
Some useful remindings:
The Magnitude of the correlation can be estimated by examining the width of the oval: the more narrow the
oval, the higher the correlation (Remember the slope does not give a clue about the magnitude):

The Correlation Coefficient r can take a value between the range of -1 and +1. Whereas 0 means no
corrrelation, -1 means a perfect negative correlation and +1 means a perfect positive correlation:

Be careful while you make extrapolation for prediction in the regression analysis. Similarly, the Pearson
correlation coefficient represents the extent to which two variables approximate a linear relationship for the
range of the variables included in its calculation. The nature of the relationship outside the range might well
be different:

REF: I heavily relied upon Discovering Statistics Using SPSS by Andy Field
A Working Example on the Calculations

Scatter Plot

Pearson Correlation Coefficient

The interpretation is that as the performance in the digit span task increases, the mathematical ability
also increases.
INDEX: r = 0.00 0.30 (strong relationship)
0.40 0.60 (moderate relationship)
0.70 1.00 (strong relationship)
Effect size r-squared
Eta-squared = r-squared = (0.85)*(0.85) = 0.7225
>> 72% of variance in mathematical ability can be accounted for by performance in the digit span
task.
1-r-squared = 1 0.7225 = 0.27
>> 27% unexplained variance(caused by some other factors)
Hypothesis Testing
We found the correlation coefficient value above but in order to say something about the population,
we need to conduct a t test:

in the population

Note the connotation in the hypothesis section:

Construct a Regression Line


Suppose we had rejected the null hypothesis, we could like to construct a regression line to obtain a
model to make predictions:

Now that we constructed our regression line, we could make predictions: If we had tested someone
whose score is 7 in the digit span task, their mathematical ability would be expected to be:
= 23.8 + 7.7 (7) = 53.9
(Note that we treat digit span task performance as predictor and mathematical ability as response
variable in this example).
Estimated Standard Error of Estimate
>> How well the function represents the data: The average amount of predictive error when predicting
scores on Y across the population:

https://www.youtube.com/watch?v=r-txC-dpI-E
https://www.khanacademy.org/math/probability/regression/regression-correlation/v/r-squared-orcoefficient-of-determination
(PLEASE watch the links above)

You might also like