You are on page 1of 6

In-Class Exercise: Linear Regression in R

Youll need two files to do this exercise: linearRegression.r (the R script file) and mtcars.csv
(the data file1). Both of those files can be found on the course site. The data was extracted
from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of
automobile design and performance for 32 automobiles.
Download both files and save them to the folder where you keep your R files.

Part 1: Look at the Data File


1) Start RStudio.
2) Open the mtcars.csv data file. Youll see something like this:

This is the raw data for our analysis. This is a comma-separated file (CSV). That just
means that each data value is separated by a comma.
Now look at the contents of the file. The first line contains the names of the fields (think
of them like columns in a spreadsheet). You can see the first field is called model, the
second field is called mpg, the third field is called cyl, and so on. The remaining lines of
the file contain the data for each car model.
Here is the full list of the variables:
Variable
Name
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb

Variable Description
Miles/(US) gallon (or fuel efficiency)
Number of cylinders
Displacement (cu.in.)
Gross horsepower
Rear axle ratio
Weight (lb/1000)
1/4 mile time
V/S
Transmission (0 = automatic, 1 =
manual)
Number of forward gears
Number of carburetors

We will use this data set to predict the miles per gallon (mpg) based on any combination of
the remaining variables (i.e., cyl, wt, etc.).

1 Adapted from R data set.

mpg is a typical outcome variable for regression analysis because it describes a continuous
value.
3) Close the OrganicsPurchase.csv file by selecting File/Close. If it asks you to save the file,
choose Dont Save.

Part 2: Explore the linearRegression.r Script


1) Open the linearRegression.r file. This contains the R script that performs the linear
regression analysis.
2) Look at lines 8 through 14. These contain the parameters for the script. Heres a
rundown:
Variable Name
in R
INPUT_FILENAME
OUTPUT_FILENAM
E

Value

Description

mtcars.csv
RegressionOutput.t
xt

The data is contained in mtcars.csv


The text output of the analysis

3) One good news about this analysis is that we do not need to install any additional
package.
4) Now lets look at the simple linear regression model with only one predictor. Scroll down

to lines 31 through 37:

fit = lm(mpg ~ wt, data = mtcars)

You can see a few things at work:

The lm() function is use to fit linear regression models.


The formula for a simple linear regression model is outcome ~ predictor1 +
predictor 2 + etc.
mpg is the outcome event youre trying to predict (i.e., fuel efficiency).
Variable(s) to the right of the ~ are used to predict the outcome. Here we have only
one predictor, i.e., mpg.

5) Now lets look at the multiple linear regression model with more than one predictors.

Scroll down to lines 43 through 47:


mfit = lm(mpg ~ wt + disp + cyl, data = mtcars)

The only change compared to the previous one is that now we have more than one
predictor (i.e. wt, disp and cyl). Specifically, now we are looking at the effect of not just
weight, but also the number of cylinders, and the volume, or displacement, of the car, on
fuel efficiency.

Part 3: Execute the linearRegression.r Script


1) Select Session/Set Working Directory/To Source File Location to change the working
directory to the location of your R script.
2) Select Code/Run Region/Run All. It could take a few seconds to run since the first time it

has to install some extra packages to do the analysis. Be patient!

Part 4: Interpreting the Output


We fit a model to our data. That's great! But the important question is, is it any good?
There are lots of ways to evaluate model fit. lm()function consolidates some of the most
popular ways into the summary function. You can invoke the summary function on any
model you've fit with lm and get some metrics indicating the quality of the fit.
Now we can look at the details of this fit with the summary function:
> summary(fit)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min
1Q Median
3Q
Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851
1.8776 19.858 < 2e-16 ***
wt
-5.3445
0.5591 -9.559 1.29e-10 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528,Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

This output contains a lot of information. Let's look at a few parts of it.
(1) Briefly, it first shows the call that's the way that the function was called; miles per
gallon (y) explained by weight (x) using the mtcars data. The regression equation we
would like to fit would be

m
^
pg=b 0+ b1 wt .

(2) This next part summarizes the residuals: that's how much the model got each of
those predictions wrong- how different the predictions were from the actual results.
(3) This table, the most interesting part, is the coefficients- this shows the actual
predictors and the significance of each.

First, we have our estimate of the intercept (

b0

b0 ): the estimated value of

is 37.2851. Hypothetically, if we have a car with weight of 0, the

predicted miles per gallon of the car based on our linear model would be
37.2851.

Then we can see the effect of the weight variable on miles per gallon (

b1 ):

the estimated value of

b1 is -5.3445, which shows the effect of the weight,

also called the coefficient or the slope of the weight. This shows that there's a
negative relationship, where increasing the weight decreases the miles per
gallon. In particular, it shows that increasing the weight by 1000 pounds
decreases the efficiency by 5.3 miles per gallon.

So the table gives us the fitted regression line:

m
^
pg=37.2851+5.3445 wt .

You can then use this equation to predict the gas mileage of a car that has a
weight of, say, 4500 pounds.

This second column is called the standard error: we won't examine it here, but
in short, it represents the amount of uncertainty in our estimate of the slope.

The third column is the t-value of the coefficient estimate, a mathematically


relevant value that was used to compute the last column, which is the pvalue, describing whether this relationship could be due to chance alone.
Smaller p-values (typically with p-value<0.05) indicates that the relationship
is statistically significant.

(4) Multiple R-squared (

): used to evaluating the goodness of fit of your model.

Higher is better, with 1 being the best. Corresponds with the amount of variability in
what you're predicting that is explained by the model. In this instance, 75% of the
variation in mpg can be explained by the cars weight.
(5) Adjusted R-squared (

R2 ): Similar to multiple R-squared, but will have a small

penalty if you include more variables.


Here are the list of output values returned by the summary ( ) function. The ones that are
especially important are in bold.
#
1

Name
Residuals

Description:
The residuals are the difference between the actual values of the outcome
variable and predicted values from your regression--

Significance codes

Coefficient
Estimates

y ^y . For most

regressions you want your residuals to look like a normal distribution


when plotted. If our residuals are normally distributed, this indicates the
mean of the difference between our predictions and the actual values is
close to 0 (good).
The stars are shorthand for significance levels, with the number of
asterisks displayed according to the p-value computed. *** for high
significance and * for low significance. In this case, *** indicates that it's
unlikely that no relationship exists b/w heights of parents and heights of
their children.
The coefficient estimates are the values calculated by the regression. With
a regression model

^y =b 0+ b1 x1 +b 2 x 2+ , the b0 , b1 , b2 are the

cofficients that we would like to get. These values measure the marginal
importance of each predictor variable on the outcome variable.

Standard Error of
the Coefficient
Estimate (Std.
Error)

Measure of the variability in the estimate for the coefficient. Lower means
better but this number is relative to the value of the coefficient. As a rule
of thumb, you'd like this value to be at least an order of magnitude less
than the coefficient estimate.

t value of the
Coefficient
Estimate

Score that measures whether or not the coefficient for this variable is
meaningful for the model. You probably won't use this value itself, but
know that it is used to calculate the p-value and the significance levels.

Pr(>|t|) (i.e.
Variable p-value)

Significance
Legend

Residual Std
Error / Degrees of
Freedom

R-squared

1
0

F-statistic &
resulting p-value

Another score that measures whether or not the coefficient for this
variable is meaningful for the model. You want this number to be as small
as possible. If the number is really small, R will display it in scientific
notation. In or example 2e-16 means that the odds that parent is
meaningless is about 15000000000000000
The more punctuation there is next to your variables, the better.
Blank=bad, Dots=pretty good, Stars=good, More Stars=very good
The Residual Std Error is just the standard deviation of your residuals.
You'd like this number to be proportional to the quantiles of the residuals
in #1. For a normal distribution, the 1st and 3rd quantiles should be 1.5
+/- the std error.
The Degrees of Freedom is the difference between the number of
observations included in your training sample and the number of variables
used in your model (intercept counts as a variable).
Metric for evaluating the goodness of fit of your model. Higher is better
with 1 being the best. Corresponds with the amount of variability in what
you're predicting that is explained by the model. In this instance, ~21% of
the cause for a child's height is due to the height their parent.
WARNING: While a high R-squared indicates good correlation, correlation
does not always imply causation.
Performs an F-test on the model. This takes the parameters of our model
(in our case we only have 1) and compares it to a model that has fewer
parameters. In theory the model with more parameters should fit better. If
the model with more parameters (your model) doesn't perform better
than the model with fewer parameters, the F-test will have a high p-value
(probability NOT significant boost). If the model with more parameters is
better than the model with fewer parameters, you will have a lower pvalue.
The DF, or degrees of freedom, pertains to how many variables are in the
model. In our case there is one variable so there is one degree of
freedom.

Try it:
Looking at the results returned by summary(mfit), and try to interpret the output.
> summary(mfit)
Call:
lm(formula = mpg ~ wt + disp + cyl, data = mtcars)
Residuals:
Min
1Q Median
3Q
Max
-4.4035 -1.4028 -0.4955 1.3387 6.0722
Coefficients:

Estimate Std. Error t value Pr(>|t|)


(Intercept) 41.107678 2.842426 14.462 1.62e-14 ***
wt
-3.635677 1.040138 -3.495 0.00160 **
disp
0.007473 0.011845 0.631 0.53322
cyl
-1.784944 0.607110 -2.940 0.00651 **
--Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 2.595 on 28 degrees of freedom
Multiple R-squared: 0.8326,
Adjusted R-squared: 0.8147
F-statistic: 46.42 on 3 and 28 DF, p-value: 5.399e-11

Questions:
(1) Which predictor variables are statistically significant in predicting mpg?
(2) How is the model prediction compared to the simple linear regression model?
Answers:
(1) wt and cyl
(2) The R-squared is 0.8326, which is larger than 0.7528, indicating better prediction
accuracy.

You might also like