You are on page 1of 5

Project: Linear Correlation and Regression

You may very well have studied linear regression before; I know many instructors discuss it in their classes. If the
word regression means nothing to you...great! This project will explain it to you from the ground up. For those of you
who have seen linear regression before...great! This project will hopefully de mystify what is going on when you ran
the command LinReg(ax+b) on your TI.
Much of the data we deal with in this course are univariate; that is, only one characteristic is measured and
studied. For example, we can study the average age of houses in, say, Oklahoma. The one variable? Age. This project
will deal with bivariate data, where two characteristics are measured simultaneously. Our main idea is to discover
whether or not there is a correlation between these two variables.
A correlation is a measure of how well two variables are related. If two variables are related well, we say they
are highly correlated. If not, we say they are not highly correlated. For example, height and weight are well correlated
taller people tend to be heavier than shorter people. The relationship isn't perfect; people of the same height vary in
weight, and you can probably think of two people you know where the shorter one is heavier than the taller one.
Nonetheless, the average weight of people 5'5'' is less than the average weight of people 5'6'', and their average
weight is less than that of people 5'7'', and so forth.
OK...so, lets begin with a straightforward, context less example. Here are some data points:
x 10
8
13
9
11
14
6
4
12
7
5
y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

Its often hard to see, when data is presented in this


form, whether or not a linear correlation exists between
the variables. To assist us, well use a scatter plot1...there
it is at right.
Now that we have our scatter plot, we can observe a
few things more easily. It seems as though, as x increases,
y increases, as well...roughly linearly. Can you (kind of) see
the line that goes through them?

This line (y = 0.5x + 3) is called the least squares


regression line, or best fit line (more on how to find it
using Excel in a bit, and, if youre interested in why its called
the least squares regression line, this will explain:
http://www.youtube.com/watch?v=jEEJNz0RK4Q ). It is the
line that best defines the data onto which its superimposed.
From this line, you may (if youre careful) make predictions.
But, wait just a minute...who says that line should even
be there?
I mean, I could have just as easily drawn a parabola, or some other curve, over that data. Who says a line even
makes sense?

If you need a little tutorial for creating an Excel scatter plot: http://www.youtube.com/watch?v=-SeCPLC30_g

Apparently this idea also upset a statistician by the name of Frank Anscombe. Back in 1973, he published a
paper titled Graphs in Statistical Analysis where he presented four data sets: the one I showed you above, and three
more. When you graph all four, with their corresponding best fit lines, you get this:

Yupall FOUR of those data sets have the same best fit line. And, you see, thats the problem: Just because we
can apply a best fit line to data doesnt mean we should. In our first data set (left to right), the line seems reasonable.
In the second, I sure see parabolic data, dont you? In the third, you have almost perfectly linear data with an outlier
severely affecting the best fit line; most of the data trends downward left to right, but the best fit line rises upward,
influenced by that outlier (the same kind of thing happens in the fourth data set).
So, what we need is a measure of how linearly correlated data is...and statisticians have given us that with a little
measure called r. Its formally called the Pearson Product Moment Correlation Coefficient2. Heres a look at some
different datasets and their corresponding r values (thanks to Wikipedia):

As you can see, the sign of the r value indicates the slope of the data, and, the closer the absolute value of
the r value is to 1, the better the linear fit (positive rs imply a positive slope in the data, and vice versa). So the r
values of 1 and 0.8 show strong linear evidence, but how about the 0.4s? The 0s? Not so much, eh?
We need a decision mechanism...a way, from an r- value, to decide if the datas linear enough. Ill show you
how, with an example. Suppose a pediatrician collects the following data from 11 infants:
Height (in.) 27.75 24.5 25.5 26
25 27.75 26.5 27 26.75 26.75 27.5
Head Girths (in.) 17.1 17.1 17.1 17.3 16.9 17.6 17.3 17.5 17.3 17.5 17.5
She wants to see if there is a linear relation between the infants heights and their head girths.
She starts (wisely) by constructing a scatterplot of the
data, shown at right. Once shes plotted the data, she
seems satisfied that the data looks roughly linear; she
doesnt see a parabolic curve to the data, or any other
nonlinear patterns. But...are they linear enough?
Lets check...open the spread sheet marked
correg.xlsx, and select the tab marked infant height wrt
head girth. In that sheet, youll see the data across the top,
as well as the scatterplot of the data.

Which makes you wonder why they called it r, eh? Why not the PPMCC? Nahlooks too much like PMRC, and
we all know how Rage Against the Machine and Dee Snider feel about them. Why r, then? Maybe they were
pirates. By the way, its computational formula is at right. And no, you never have to do it by hand. Arrrr, mateys...

x x y y


sx s y
n 1

r=

Youll also see some orange decision boxes that Ive set up for you...lets
begin: to find r, in cell C5, type =CORREL(C2:M2,C3:M3) in the fx bar and then
press enter. Your spreadsheet will kick back the r value of 0.69.
Next, input your sample size in cell C6. Once you press Enter, your decision
will be made in cell C7. In this case, the decision is yes...the data is linear
enough.3 This means that, as an infants height increases, their head girth
increases linearly, as well.
Now, once youve decided that your data is linear enough, its time to get the
equation that best defines your data. Technically, youd have to find an equation of the form y = mx + b, where

m=

n ( xy ) ( x )( y )
n( x ) ( x )
2

and

b = y mx ..but no one does these by hand anymore4.

Excel, as you saw earlier,

saves the day again:


a) Left click on the scatterplot, then click the
Chart Tools/Layout tab.

b) Left click on the selection Trendline, then,


from the bottom of the list of options, click on
More Trendline Options.
c) Make sure that the Linear radio button is
selected. Everything else can stay the same,
but make sure that the box next to Display
Equation on Chart is selected.
Click OK, and youll get something like the one at right above (you can make some style changes, too, if you like).
Now that we have the regression (best fit) equation, we can (carefully) use it to
forecast. For example, suppose we wanted to know what the head girth of an infant will
be, given their height is 29 inches. Heres how you could reckon that piece of info...
So, an infant with a height of 29 inches should have a head circumference of
roughly 17.6 inches.
^
Did you notice how I called that result y ? That was
intentional...since its not a parameter, but a statistic, it should
carry a margin of error. MEs for regressions are not discussed in
most intro stat courses5, but Ive shown you at right how to get
the 95% CI for the last example, if youre interested, at right.
If you ever need to calculate MEs for regression equations,
youll be using SPSS, or MiniTab, or some other (better) stat
software than Excel. So, in this project, I will not be asking you to
forecast...just to get the best fit equations.

y = 13.601"+ 0.1395x
= 13.601"+ 0.1395(29")
17.6"

95% CI for y = y ME
1
(x x )2
= y t MSE +
n (x x )2

1 (29 26.45)2
= 17.6 2.26 0.023 +

11.98
11

= 17.6 0.12
= (17.48",17.72")

This had better feel like black magic to you. Hang on, well get to what its doing, soon enoughalso, when it says the data is linear enough,
Excel is trusting that you have already decided that a line makes sense. More on that in question #5 below
4
Although, if youre interested, Ive placed a justification of these formulas on the enrichment page of the website.
5
Most intro stat courses also treat linear regression as a hypothesis test, which is unnecessarily cumbersome, in my humble opinion. Thats why
were doing regression the way we are, and not the way most texts (yours included) do it.

A few closing notes:


1) Correlation does not imply causation. Just because two variables move together in a predictable fashion does
not mean that one is causing the other to do so. There are, at times, other factors (called lurking, or
confounding variables) at work causing the correlation you see, but your regression will most likely not identify
them. For example, in the baby height vs. head girth work above, we cant say that the babys increasing height
is causing the babys increasing head girthmost likely, its due to the fact that all parts of the baby are growing
proportionately, and were only looking at two of them (Id bet, if you studied their foot lengths, for example,
they would correlate with the heights as well)
2) Your dataset should be a SRS with no outliers. These can wildly affect your regression analyses and pollute your
results.
3) More on the r- value:
a. To keep with our conventions in class (and the journals), I based the r values on 95% confidence (5%
significance) in this project, but they can allow for any confidence, in general.
b. A given r value might imply linearity in a larger data set, but not imply linearity in a smaller one As your
sample size goes up, your correlation coefficient is allowed to be farther from 1, yet still be significant. This
allows for the natural variations that will occur in a larger set of data. Heres an example of what I mean:

Both of those data sets are arguably linear, and they carry the same r value (0.75589). However, the left
one, where n = 5, is not strongly linear enough, while the right one (n = 20) is.
c. I dont think I can emphasize this enoughthe r value you find does not tell you Yeah! Your datas linear!
It cant see your data at all. Its basically a second pass filter: once youve looked at your data, it tells you,
OKyou saw the data, and you think it looks linear. Ill tell you if its statistically linear enough. Its sort of
a conditional probabilitylike a P value. It doesnt tell you P(you should use a line)it tells you P(you
should use a line

| you have decided a line will fit the data well).

*******
Please do this entire project within Exceldownload the spreadsheet to your computer, and save it. Then, complete
questions 1 through 5 (save frequently!). Email it to me (srule@cocc.edu) when youre finished. I must receive it by the
beginning of class on the date its due or you will receive no credit. Each of the following 5 questions is worth 5 points,
and they will be graded all or none.
 For each of the questions 1 through 4:
a. Create the Excel scatter plot for your data, right in the spreadsheet. Make sure your axes have clear labels
(with units, if applicable), and makes sure your plot has a title.
b. If your data appears linear (3 out of the 4 will), complete the orange decision boxes like we did in the
example above. For the non linear data set, leave the decision boxes blank. Hint: ones clearly parabolic.
c. If your decision box tells you your datas statistically linear enough (2 of the 3 should), place the best fit
line, with its equation, on your scatter plot. For the other one, dont.
d. If you correctly find a best fit line, replace the blank in the question asked in the textbox in each sheet with
increase or decrease. Otherwise, leave that box the way it is.
 For question 5: complete the 4 decision boxes shown (regardless of the look of the data), and answer the
question in the sheet.

1. (Data Set rate my professor) I randomly selected 16 of the faculty members from COCC and checked their
Rate My Professor profiles. On this tab, youll see the data. Id like you to see if there is a linear enough
relationship between the professors easiness rating (the x variable) and their overall rating (the y variable).

2. (Data Set cops and offenses) I randomly selected 15 years on record for Bends police force (size of force given
as x), and cross referenced those years for offenses reported (number of offenses per 1,000 Bendites are given
as y). Does there appear to be a linear relationship between the number of police Id like you to see if there is a
linear enough relationship between the number of cops and the number of offenses per 1000 population?

3. (Data Set revenue) A stadiums operations office is trying to decide how much to charge for a ticket to a
midweek, midday event. With each rise in ticket price comes a greater chance of turning away more folks, but
also a rise in revenue (from more expensive tickets) that could offset the lost patrons. Based on randomized
survey results, they forecast the amount of revenue (y) as a function of ticket price (x). Does there appear to be
a linear enough relationship between the ticket price and revenue?

4. (Data Set earnings wrt education) The data in this sheet indicate averages gotten from the Bureau of Labor
Statistics 2008 Current Population Survey (the BLS CPS 2008). Id like you to see if there is a linear enough
relationship between the years of schooling (x) and median weekly salary (y).

You may have to scroll over to get to the last data set:

5. (Data Set Anscombe's Quartet) Well finish with the datasets to which I alluded on page 2 of the data.
Rememberthe ones that all had the same best fit line? Those four data sets are listed in this tab, each next to
corresponding scatter plot. A student carelessly filled in all 4 decision boxes without looking first at the
scatterplots (you might recall, from the 2nd page of this project, that, mathematically, these 4 data sets have the
same best fit lineyou now see that they also have the same r - value). Answer the questions in the sheet (you
may have to scroll down to see it).

You might also like