You are on page 1of 14

Lecture 10: Correlation and Simple Linear Regression

Bivariate Data & Scatterplots

Bivariate Data
When data come in dependent pairs (i.e. one data value is linked to another), say X is height and Y is
weight and we have n individuals we're taking the measurements on. Then our data would be as
follows:
(X1, Y1), (X2, Y2), ... , (Xn, Yn). This type of data is called bivariate.

Scatterplots
Scatterplots, as we discussed in Lecture 2 are like histograms, and are a good visual means to
understanding patterns of bivariate numerical data.
Recall or Lecture 2 Notes:
Scatterplot - A graph of two numerical variables (X vs. Y) This graph is also known as a scattergram or scatter diagram. A
scatterplot is a two dimensional graph representing two variables measures from the same set of subjects or elements. That
is, for each point being graphed, there are two pieces of information. For example, the height and weight of a group of 10
teenagers would result in a scatter plot of 10 paired (X=height, Y=weight) points on the graph. The graph could then be
studied to determine if a relationship exists between the two variables. If the relationship between the two variables forms
a straight line (linear), the variables are considered to be correlated. This correlated pattern usually appears to form a type
of roadway rather than an exact straight line.
Typically, the more controlled variable is plotted on the X axis and the response variable is plotted on the Y axis.
Note that there are many other relationships beyond linear, such as, quadratic, cubic, exponential, ...etc.
Here are some examples of X,Y paired data:
X-Cloud Density vs. Y-Rainfall
X-Advertising Dollars Spent vs. Y-Sales
X-Speed vs. Y-Braking Distance

Here are two examples of scatter plots that show a linear relationship (correlation):

The scatterplot gives us a means of observing relationships between the two variables. We call this
relationship positive if an increase in one variable corresponds to an increase in the other. This results
in a upward trend of the data points (1st graph above). When one variable increases while the other
decreases, we call the relationship negative, which results in a downward trend of the data points (2nd
graph above).
What does a scatterplot show? In general terms, it gives us an idea of what kind of relationships (or
patterns) the bivariate data has. We may have
1. Positive (or negative) linear relationship
2. Positive (or negative) curved relationship
3. Other relationships
4. No relationship
Scatterplots can also give us visual evidence of outliers or suspicious observations.
Here is a another scatterplot of the diameter of Oak trees vs. the age of the trees.

Correlation, Causation & Confounding

Correlation
To get a measure of how strongly X and Y values are related, we will use Pearson's Correlation
Coefficient. Correlation is concerned with linear trends: if X increases, does Y tend to increase or
decrease? How much? How strong is this tendency?

The Pearson Correlation Coefficient


The Pearson correlation coefficient measures the strength and direction of a linear relationship between
the x and y variables. Like other numerical measures, the population correlation coefficient is the
Greek letter rho, , and the sample correlation coefficient is denoted by r. The formula for the Pearson
sample correlation coefficient is:

Note: Sxy measures covariation of X and Y while Sxx and Syy measure internal variation of X and Y
respectively.
Sxx = Sum of Squares of the X data (same as SSx)
Syy = Sum of Squares of the Y data (same as SSy)
Sxy = Sum of Squares of the (X,Y) pairs (same as SSxy)
Sx = Standard deviation of the X data
Sy = Standard deviation of the Y data
Sxy = Standard deviation of the (X,Y) pairs
Properties of Pearson's Correlation
1. The value of r does not depend upon the units of measurement. For example, if X is the weight
in pounds and Y is the height in inches of a person, then the correlation between X and Y would
be the same if we measured X in kilograms and Y in centimeters.
2. The value of r does not depend upon which variable is labeled X and which variable is labeled
Y. In other words, the correlation of X and Y = correlation of Y and X.
3. 1 r +1. A positive value of r means a positive linear relationship, a negative value of r
means a negative linear relationship.
4. r = +1 is a perfect positive linear relationship and r = 1 is a perfect negative linear
relationship. This is extremely rare.
5. Values of r close to 0 implies no linear relationship. But note that "no linear relationship"
doesn't mean "no relationship" exists; a higher "non-linear" relationship might still exist when r
is close to 0. Always remember that r measures only the linear relationship between X and Y.
Strength of Correlation
We can generally define the strength of correlation as follows: (This is only a general guide!)
Strong: |r| 0.8
Moderate: 0.5 |r| < 0.8
Weak: |r| < 0.5
NOTE: Since correlation measures only a linear relationship, to have r close to or equal to zero does
not mean that there is no relationship between x and y.

Causation and Confounding


We must be very careful in interpreting correlation coefficients. Just because two variables are
highly correlated does not mean that one variable causes the other to change. In statistical terms,
we say that correlation does not imply causation. There are many good examples of correlated
variables which are nonsensical (not causal) when interpreted in terms of causation.
1. Ice cream sales and the number of shark attacks on swimmers are positively correlated.
2. Skirt lengths and stock prices are positively correlated (as stock prices go up, skirt lengths get
shorter).
3. The number of cavities in elementary school children and vocabulary size have a strong
positive correlation.
Three relationships which can be taken (or mistaken) for causation are:
1. Causation: Changes in X cause changes in Y. For example, football weekends cause heavier
traffic, more food sales, etc.

2. Common response: Both X and Y respond to changes in some unobserved variable. All three of
our examples above are examples of common response.
a. Ice cream sales and shark attacks both increase during summer.
b. Skirt lengths and stock prices are both controlled by the general attitude of the country,
liberal or conservative.
c. The number of cavities and children's vocabulary are both related to a child's age.
3. Confounding: The effect of X on Y is hopelessly mixed up with the effects of other
explanatory variables on Y. For example, if we are studying the effects of Tylenol on reducing
pain, and we give a group of pain-sufferers Tylenol and record how much their pain is reduced,
we are confounding the effect of giving them Tylenol with giving them any pill. Many people
report a reduction in pain by simply being given a sugar pill (placebo) with no medication in it
at all. This is called the placebo effect. To establish causation, a designed experiment must be
run.
The Regression Model & Regression Analysis

Regression Analysis
Regression analysis is a statistical tool that utilizes the relation between two or more quantitative
variables so that one variable (dependent variable) can be predicted from the others (independent
variables). For example, if one knows the relationship between advertising expenditures and sales, one
can predict sales by regression analysis once the level of advertising expenditures has been set. In
simple linear regression, we specifically consider the case when a single independent variable is used
for predicting the dependent variable and the dependent variable and the independent variable are
linearly related.

The Straight Line Regression Model


The model can be stated as:

where

Y is the response variable (also called dependent variable)


X is the predictor (also called independent variable)
alpha & beta are the parameters representing the y-intercept and slope
e is the error (also called random deviation or random noise)
Y = + X is called the population regression line.

Assumptions about Error (e):


1. e has mean zero and variance for every value of x.
2. e is normally distributed.
3. e1, e2, e3,..., en are independent of one another.

Least Squares Line


Recall the equation of a line from algebra (Y = mX + b) and apply it to the following notation:

Y-hat is the estimated response, E(Y), for a given value of X.


b is called the slope of the line.
a is the y-intercept.

The slope "b" measures the amount Y increases when X increases by one unit.
The y-intercept is the value of Y when X = 0.
The objective of simple linear regression is to fit a straight line through the points on a scatterplot that
best represents all the points. So we want to find a and b such that the line [Y = a + bX] fits the points
as well as possible. To do this, we first define what we mean by a "best fit" line. This line, in some
sense, is closest to all of the data points simultaneously. In statistics, we define a residual, ei, as the
vertical distance between a point and the line,

ei = Yi - (a + bXi)
Since residuals can be positive or negative, we square them to remove the sign. By adding up all of the
squared residuals, we get a total measure of how far away from the data our line is. This sum is called
the SSresid = Sum of Squared residuals. Thus, the "best fit" line is defined as the one whose sum of
squared residuals is a minimum. This method of finding a line is called least squares. This is a very
important point and is further explain on pp. 199-201 of your textbook!
This is expressed in the following diagram:

Estimation and Prediction


Given a least squares line, we can use it for estimation and prediction. The equation for prediction is
simply the equation for a line with a and b replaced by their estimates. The predicted value of Y is
traditionally denoted Y-hat. Thus, suppose we are given the least squares equation:

where X is the age of a child in months and Y is the height of that child, and let's further assume that
the X data values range from 1 to 24 months. To predict the height of an 18 month old child, we just
plug in X=18 to get:

This is called a point estimate or prediction of Y when X = 18.


What if we wanted to know the height of a child at age 32 months. From our least squares equation,
we could get a prediction. However, we're predicting outside of the range of our X raw data values.
This is called extrapolation and is not good statistical practice, unless you are confident the estimated
regression line over the range of the X data [1, 24] is also valid over the range you're predicting. When
we predict within the range of our X values, this is known as interpolation; this is the way we normally
want to predict.
Note: The major problem presented by extrapolation is that there is no supporting data to
ensure the relationship continues to exist outside of the range of the X data.

Residual Plots and Regression Assumptions


Recall that there are three basic assumptions about the random deviations (residuals or errors), the
random deviations are independent, normally distributed, and have a constant variance. In simple
linear regression, we also assume that y and x are linearly related. We shall consider the use of residual
plots for examining the following types of departures from the assumed model.
1. The regression function is not linear.
2. The error terms do not have a constant variance.
3. The model fits all but one or a few outlying observations.
4. The error terms are not normally distributed.
5. The error terms are not independent.
The common graphical tools for assumption checking includes:
1. Residual Plot- scatter plot the residuals against x or the fitted value (Y-hat).
2. Absolute Residual Plot- scatter plot the absolute values of the residuals against X or the fitted
value.
3. Normal Probability Plot of the Residuals.
4. Time Series Plot of the Residuals - scatter plot the residuals against time or index.
5. The time series plot of the residuals are strongly recommended whenever data are obtained in a
time sequence. The purpose is to see if there is any correlation between the error terms over
time (the error terms are not independent). When the error terms are independent, we expect the
residuals to fluctuate in a more or less random pattern around the base line 0.
We will concern ourselves mostly with (1) and (3). If the residual plot (1) shows a pattern, then there
is a more complex relationship between X and Y than a simple linear line; thus a more complex
regression analysis is needed the study the relationship between X and Y.
Linear Strength and Inferences
Recall the Strength of Pearson's Correlation Coefficient:

Strong: |r| 0.8


Moderate: 0.5 |r| < 0.8
Weak: |r| < 0.5

And remember since correlation measures only a linear relationship, to have r close to or equal to zero
does not mean that there is no relationship between x and y.
Let's consider a second measure of the strength of a linear relationship.
Coefficient of Determination
A statistic that is widely used to determine how well a regression line fits a set of (X,Y) data pairs is
the coefficient of determination (or multiple correlation coefficient), R2. The coefficient of
determination represents the percent of variability in Y that can be explained by the variability
explained by the linear relationship between X and Y. In other words, R2 explains how much of the
variability in the Y's can be explained by the fact that they are related to X. (i.e., how close the points

are to the least squares regression line)


In the simple linear regression case, the coefficient of determination is equal to Pearson's
correlation coefficient squared; R2 = (r)2.

Inferences for Regression Parameters


In many situations, a general form of a C% confidence interval for a parameter is:
Point estimate + (critical tabled value) Standard error(statistic)
Using this method, a confidence interval can be determined for alpha (true Y-intercept) and beta (true
slope). We will not discuss the detailed formulas in this class, but rather let the computer software
determine its value.
Likewise, we can test the significance of the linear relationship by determining if "0" is a possible
value in the confidence interval for the population slope (Beta) or the population correlation (rho).
There are formal testing procedures as well that yield p-values. We'll see this later in the simple linear
regression example.
Estimating Y for a given value of X
There are two type of intervals we will consider for estimating Y for a given value of X:
1. Confidence (Estimation) Interval for the mean response, E(Y).
2. Prediction Interval for the mean response, E(Y).
NOTE: The prediction interval is wider than the estimation interval because it is an inference about
future data where the estimation interval is an inference on the current population the sample data
represents.
Correlation and Simple Linear Regression - An Example

"Athletic Performance and Cardiovascular Fitness"


(This is the Example 13.4 used in your textbook on pp. 564-565)
A study was performed relating Cardiovascular Fitness and an athlete's performance in a 20-km ski
race.
Let X = Treadmill run time to exhaustion (A measure of Cardiovascular Fitness)
Let Y = 20-km ski time (A measure of performance)
The following data was acquired from the article "Physiological Characteristics and Performance of
Top U.S. Biathletes" (Medicine and Science in Sports and Exercise (1995): 1302-1310)
X
Y

7.7
71.0

8.4
71.4

8.7
65.0

9.0
68.7

9.6
64.4

9.6
69.4

10.0
63.0

10.2
64.6

10.4
66.9

11.0
62.6

11.7
61.7

The 6 STEPS in a Linear Regression Analysis are:


1. Determine if a possible Linear relationship exists between X and Y.
This is done either visually by use of scatterplot or analytically by calculating correlation.
If it is clear a non-linear pattern exists, you would not do a Linear Regression analysis, but
rather move to a higher order regression analysis. (parabolic, quadratic, cubic, ...etc.)
2. Calculate the Least-Squares Regression Line.
To determine this line, you need the slope and y-intercept.
3. Determine if the Linear Relationship is statistically significant.
This is done through a test of the population slope or test of the population correlation.
You can also calculate confidence intervals for the population slope and y-intercept.
4. Measure the strength of the linear relationship.
We've discussed two measures of this type: 1. Correlation (r) and 2. Coefficient of
Determination (R2). Is is also not uncommon to calculate a confidence interval for the
population correlation.
5. Check your model assumptions.
This is a very important step to ensure your model is adequate.
6. Make any desired estimates and predictions.
This is done through confidence and prediction intervals of the mean response (Y), for a given
value of X.
STEP 1: SCATTERPLOT

It appears that a negative linear relationship exists between treadmill time (X) and ski time (Y).
STEP 2: ESTIMATE THE LEAST-SQUARES REGRESSION LINE

From the MINITAB output:


The estimated slope = -2.33 and the estimated y-intercept = 88.8
The least-squares line is y-hat = 88.8 - 2.33 X
(notice MINITAB doesn't show the y-hat symbol)
And here's a graph is the least squares line through the points.

STEP 3: IS THE LINEAR RELATIONSHIP SIGNIFICANT?


Refer to highlighted line in the above output. This shows the results of a test involving the "true
slope":
Ho: = 0
Ha: 0

Since the p-value is less than 0.05, Ho is rejected is you can conclude that the slope is significantly
different from 0, thus you can conclude a statistically significant linear relationship exists between X
and Y. Remember, if the true slope could reasonably equal 0 (accepting Ho), then there is no
statistically significant linear relationship between X and Y. If we Reject Ho, then we would conclude
that a statistically significant linear relationship does exist between X and Y.
STEP 4: STRENGTH OF THE LINEAR RELATIONSHIP
Again, from the above output, R2 = 63.4%, thus you can conclude that 63.4% of the variability in Y
can be "explained" by the linear relationship between X and Y.
Correlation (r) can be determined by taking the square root of R2.
SQRT(0.634) = 0.80
But, since the linear relationship is negative, we need to make the answer as such, so...
r = 0.80
This is considered a strong negative linear relationship between X and Y.
STEP 5: CHECK ASSUMPTIONS

The residuals appear to be normal since the Normality plot shows an approximate straight line.

Since there seems to be a random pattern of the standardized residuals, it appears the linear relationship
between X and Y is appropriate, and there are no higher order relationships here. In addition, there
appears to be constant variance across the graph. In other words, the points are approximately equally
spread up and down across the graph.
STEP 6: CONFIDENCE AND PREDICTION INTERVALS FOR THE MEAN RESPONSE
The following graph shows confidence and prediction bands for the mean response (Y) across the
range of X values.

Notice how far the confidence and predication bands are from the estimated regression line. This is
partly due to the small set of points we used in this example. Remember, more data, less error!
This is how the output looks if you want an estimate and prediction for a specific value of X, for
example, let X*=8.5.

As a result, when X=8.5, y-hat=68.961. This is a point estimate.


The 95% confidence interval is (66.805, 71.117) and the 95% prediction interval is (63.561,74.360).
When I chose X=8.5, did I extrapolate or interpolate? You tell me!

You might also like