7 views

Uploaded by Anna

hooo

- Bbb
- Caught on Tape 20080410
- 4504-13719-1-PB
- QT Assignment Draft
- 7.IJBMRFEB20187
- chapter 4 Solutions
- Homework 2
- 2.-Multiple-regression.pdf
- Ch 3 Forecasting
- CORRELATION AND REGRESSION ANALYSIS.docx
- stats project 1
- Analysis of Performance Management on Employee Motivation A case of Kenya Electricity Generating Company Limited.pdf
- Ch08 - Regression Analysis
- Past exam q and answers compiled notes
- Qm Imp Theory
- Ken Black QA 5th chapter16 Solution
- 206923673 Business Ethics Research Paper
- PACF
- Sturture and Governance
- 5702

You are on page 1of 5

Summary

How to look for relationships between continuous variables using correlation and regression

Functions

Introduction

This help sheet covers correlation and regression. The main difference between these approaches is

the issue of causality: correlation does not examine causality and simply describes whether and how

a change in one variable is related to another. For example, we might ask how CO2 levels relate to

the air temperature, without implying that CO2 drives temperature changes or vice-versa; we simply

want to know “does temperature increase (or decrease) as CO2 increases?” We get this information

from the p-value of the correlation test, which indicates the evidence for whether a correlation is

there or not. We can also ask the question “how strong is the correlation between temperature and

C02?”, which indicates how closely they are associated (see below).

In contrast, we could also model temperature according to CO2 levels using linear regression. Here,

we assume temperature responds to changes to CO2 levels. This allows us to say how much

temperature increase we would expect from a given CO2 increase; in other words, we can use CO2

levels to predict temperature.

This gives the speed and stopping distance (dist) of cars. We’ll start by having a look at the data:

120

100

80

dist

60

40

20

0

5 10 15 20 25

speed

Looks like stopping distance is pretty closely related to speed; let’s test this using Pearson’s

correlation:

Note the method=“pearson” bit, which tells R to use a Pearson’s correlation. Remember also that

the order of dist and speed doesn’t actually matter here as we aren’t assuming any causality.

We can see that there is strong evidence for a correlation (t48=9.46, p=1.5x10-12). We can also see

that there is a strong positive correlation; distance and speed are closely related (the correlation

statistic, cor, is 0.81). The strength of correlation can vary from 1 (perfectly positively correlated; dist

and speed fall on a perfect straight line, with no scatter) to 0 (correlation so weak that dist and speed

are virtually unrelated) to -1 (dist and speed are perfectly negatively correlated stopping distance

decreases as speed goes up, and vice-versa). Remember that it is possible to get a significant weak

correlation or a non-significant strong correlation; there is a difference between the evidence for the

correlation and the strength of that correlation.

Let’s say we discover that one (or both) of our variables isn’t normally distributed; what do we do

then? If we can’t transform the data to normalise it (help sheet 8), we need to use a non-parametric

alternative. The most common option here is Spearman’s rank correlation, which ranks both

variables separately and then sees if, for example, the cars with the highest speed tended to also

have high stopping distances:

The spearman test also supports a correlation between speed and stopping distance (S=3532, n=50,

p=8.83x10-14; note that we have to use the sample size n here instead of quoting the degrees of

freedom, because the test doesn’t estimate any parameters). The estimate of the strength of the

correlation, rho, is similar to that estimated in the Pearson correlation (0.83, indicating a strong

positive correlation).

Linear Regression

We know that stopping distance is positively correlated with the speed of the car: but how much

does stopping distance increase for every extra mph of speed? To predict this, we need to use linear

regression:

Our linear model (lm, assigned to the object model) predicts for every 1mph increase in speed, the

stopping distance increases by around 3.9 feet. It also predicts a stopping distance of -17.5 feet at a

speed of 0 mph, as indicated by the intercept with the y axis – not a particularly sensible prediction!

To make this clearer, let’s check what our modelled relationship looks like on a graph. Plot the graph

from earlier, this time using ylim to extend the y-axis between -20 and 120 and xlim to extend the x-

axis between 0 and 25 (see help sheet 9 for more on graphical functions). Next, use abline (a

function for plotting a straight line with intercept a and gradient b) to put a line on based on our

model:

120

100

80

60

dist

40

20

0

-20

0 5 10 15 20 25

speed

We can see that our line crosses x=0 (the y-axis) at just below -20 (-17.5 feet), and for every 10mph

increase we get an increase in stopping distance of around 40 feet (10 x 3.9 = 39 feet).

But to know if this fitted relationship is actually any good, we need to test the model to see if

explains a significant amount of variation. This is done using the summary command:

Lots of information! However, there isn’t that much here that you haven’t met before. Let’s start at

the bottom first. The test statistic is the F-statistic (89.57). The p-value is 1.49x10-12. Here, we have

two degrees of freedom: 1 for the line and 48 for the data*. This means we would report our result

like so: There was significant increase in stopping distance with speed (F=89.571,48, p=1.49x10-12).

Another way of saying this is that fitting our relationship between stopping distance and speed

provides a significantly better explanation of the stopping distance than just using a mean stopping

distance (intercept), with no relationship between stopping distance and speed.

How well does our model explain the variation in stopping distance? To test this, we can look at our

R-squared value, which tells us what proportion of the variation in the data is explained by our

model. Here, R-squared = 0.65 or 65%, so we’re doing a decent job of explaining stopping distance,

but there’s still quite a bit (35%) of unexplained variation in the data. This can be seen in the scatter

around the line in our graph above; if the points lined up perfectly on the line, our R-squared value

would be 100%. The adjusted R-squared value accounts for the fact that the more complicated you

make your model, the more of the data you can explain. We’ll be using pretty simple models though,

so it doesn’t really matter which R-squared value you choose to use.

The table of Coefficients just tell you the same information we looked at earlier (estimates of the

intercept and gradient of the model), together with information on how accurate those estimates

are. You don’t need to worry about the information on the residuals too much; this just tells you a

bit more about the scatter around the line.

We can use our modelled relationship to predict stopping distances based on speed. For example,

our model predicts that the stopping distance at 150 mph would be as follows:

= 572.3 feet, or nearly 175 metres – about one and a half football pitches! However, here, we’re

extrapolating beyond the range of our data, which is often ill-advised: we didn’t measure the

stopping distances of cars at speeds any higher than 25mph.

Linear regression makes a number of assumptions about the data, and isn’t valid if these

assumptions aren’t met - see help sheet 8 for details.

*The degrees of freedom thing is a little complex, but is to do with the way the test is being done. We’ve fitted

a line with two parameters: an intercept (-17.5 feet at speed = 0 mph) and a gradient (3.9 feet for every mph

increase in speed). We’re comparing this line to the null hypothesis of no relationship between stopping

distance and speed; this hypothesis explains stopping distance using only a single parameter, a mean speed,

which doesn’t vary with distance. So our more complicated model, with df=2 (intercept and gradient), is being

compared to the simpler null model, with df=1 (intercept, but no gradient). So our treatment degrees of

freedom = 2-1 = 1. Since we have 50 datapoints, and have fitted two parameters (mean and intercept), we’re

left with 50-2 = 48 freely varying datapoints: so our error degrees of freedom = 48.

- BbbUploaded byWan Nurul Aimi
- Caught on Tape 20080410Uploaded byAlex Chong
- 4504-13719-1-PBUploaded byGurvinder Arora
- QT Assignment DraftUploaded byAmaresh Mohapatra
- 7.IJBMRFEB20187Uploaded byTJPRC Publications
- chapter 4 SolutionsUploaded bymdwfiroz
- Homework 2Uploaded byErika Lee
- 2.-Multiple-regression.pdfUploaded byĐào Duy Tùng
- Ch 3 ForecastingUploaded byAnonymous iEtUTYPOh3
- CORRELATION AND REGRESSION ANALYSIS.docxUploaded byDeta Detade
- stats project 1Uploaded byapi-302912496
- Analysis of Performance Management on Employee Motivation A case of Kenya Electricity Generating Company Limited.pdfUploaded byAlexander Decker
- Ch08 - Regression AnalysisUploaded byPrakash Bharati
- Past exam q and answers compiled notesUploaded byvicki_hood_2
- Qm Imp TheoryUploaded bySanketBhalodia
- Ken Black QA 5th chapter16 SolutionUploaded byRushabh Vora
- 206923673 Business Ethics Research PaperUploaded byMeann Cubol
- PACFUploaded byMasoud Barati
- Sturture and GovernanceUploaded byNikytat
- 5702Uploaded byCrystal Murray
- Forecasting TrendsUploaded byMahrukh Rasheed
- Pooled cross section dataUploaded byUttara Ananthakrishnan
- Help Nonlinear RegressionUploaded bybennyferguson
- 36 WP Rodrigues-Silveira OnlineUploaded byAnnieDeLonge
- A Simple Technique to Assess Vertical Dimension of OcclusionUploaded byanak
- Civan, BallıUploaded bygaffurzade
- 10.1007_s40534-013-0008-9Uploaded byBombang Lompo
- 1.2b Fitting Lines With Technology Cheat SheetUploaded bybobjanes13
- 200411303_ftpUploaded bysaminpane13
- 18 Regression Analysis of Index Properties of Soil as Strength Determinant for California Bearing RatioUploaded byemer_quezon

- Doc1Uploaded byAnna
- meal planner.docxUploaded byAnna
- R Help 2 the R LanguageUploaded byAnna
- R Help 3 Getting HelpUploaded byAnna
- Breif SISUploaded byAnna
- Van Budget 1Uploaded byAnna
- Journal SummarysUploaded byAnna
- WolseyUploaded byVini Tiastuti
- 2. Bridging the Gaps 2018. KmUploaded byAnna
- Atomic Emission SpectraUploaded byAnna
- Lecture 11 Coop and Mutualism Post Lecture Slides 2017Uploaded byAnna
- Challenging the ByProduct Theory of ReligionUploaded byAnna
- R Help 1 Data ImportUploaded byAnna
- R_Help_2_The_R_Language.pdfUploaded byAnna
- Cells and OrganellesUploaded byAnna
- Hello WorldUploaded byAnna
- Controlled Assessment Plan ProformaUploaded byAnna
- prob + sol in trop rainfUploaded byAnna
- Cress Seeds Write UpUploaded byAnna
- Balancing EquationsUploaded byAnna
- Yo OoooooooUploaded byAnna
- Una Dieta Sana - A Healthy Diet TextUploaded byAnna
- Excersize ParaUploaded byAnna
- Cress Seeds Write UpUploaded byAnna
- Blank Controlled Assesment PlanUploaded byAnna
- 97 Rural and Urb PopsUploaded byAnna
- School SubjectsUploaded byAnna

- Error Bars in Experimental BiologyUploaded byEsliMoralesRechy
- Percentile, Quartile and FractileUploaded byAl Joshua Ü Pidlaoan
- Boosted RegressionUploaded byPadmavathy Reddy
- SamplingUploaded byBodhisattva Ghosh
- Ed Inference1Uploaded byshoaib625
- Edwards Chapter OneUploaded bySheri Dean
- Chapter 6Uploaded byVishal Chetal
- samplingUploaded byavispeed
- ch4Uploaded bySpencerHarrison
- Doe Shainin Methods TipsUploaded byshivaprasadmvit
- General Tolerances -DIN -IsO -2768Uploaded byPritam Polekar
- MBAUploaded byRufino Gerard Moreno
- A 029330Uploaded byitel itel
- MH3511midterm2017Q (1)Uploaded byFrancis Tan
- Research Report - Monageng MogalakweUploaded byioana_1611
- p0171-0183Uploaded byAnita Arias Uriona
- Dof Anamet35 (Notes)Uploaded byJamall Clt
- A Guide to Using EViews_Johnson_00_aUploaded byAna-Maria Jinca
- Elbeydi - Measuring the Supply Response Function of Barley in LibyaUploaded byAdosa Adorop
- Stata Time Series VarsocUploaded byaiwen_wong2428
- Peter B.M. Vranas - Gigerenzer's Normative Critique of Kahneman and TverskyUploaded byNatasha Demic
- Risk AssessmentUploaded byElnaz Fd
- 2Uploaded byAndrian Reza Saputra
- Chapter 5 Discrete Choice ModelsUploaded byaskazy007
- EDCDFUploaded byShakti Pattnaik
- Factor Analysis of Rural Tourism DevelopmentUploaded byAlexandra Dima
- Preventive FinalAUploaded byvaegmundig
- P Values Are Random VariablesUploaded byaskazy007
- 9789382332640.pdfUploaded byjyoti
- eviews1Uploaded byganbaahaliun