Professional Documents
Culture Documents
Erick Farias
Executive summary
The present report intends to answer to the questions: Is an automatic or manual transmission better for MPG and, if there is
difference, how much is it?"
The assessment to this question started with an exploratory analysis and then the fitting of a linear model.
The variables for the initial model were selected through the criteria of least multicolinearity. Then, this initial model was
tested against others (adding other variables) and the best model was selected through the evaluation of the ANOVA (nested
model testing), the predictive R and the residuals.
After testing the model, it was concluded that the most adequate is the model of MPG ~ Cylinders + Transmission Type +
Horse Power + Weight, with a predictive R of 80%.
Interpreting the coefficients, it's thus concluded that leaving all the rest unchanged, there's no difference in mpg when
comparing automatic to manual transmission (see section Coefficient Interpretation for deeper explanation).
In this plot we see that apparently a manual car run more miles per gallon. We'll proceed to a regression model to check if
there's not underlying variables explaining the mpg change when we look only to the transmission factor adjustedly. Thus we
can isolate the effect of the transmission over mpg as much as possible to understand its influence on it.
1 variables picked for the model: Transmission (am), for it is it that will answer the interest question; 2: Horse Power (hp),
for it has high linear relationship with mpg and low with transmission type,having a reduced overlaying of variance explained.
Initially we also expect the variables qsec (1/4 mile time) and gear (number of forward gears) to be not really significant for
the model, as they have little correlation with MPG.
I understand that these two picked variables explain the biggest amount of variation, and adding others will add little
explanation power for the model, for they have overlaying variance. However we'll proceed in testing the others, adding one to
one in individual models, and then comparing all of it in a Nested Model Testing through ANOVA.
1: mpg ~ factor(am)
2: mpg ~ factor(am) + hp
3: mpg ~ factor(am) + hp + factor(cyl)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
disp
disp + qsec
disp + qsec + drat
disp + drat + qsec +
disp + drat + qsec +
disp + drat + qsec +
For the first three added variables ( Horse Power [hp], Weight(wt) and Number of Cylinders[cyl] ) we had a P-value <= 0.05
showing that these variables are significant for the model in terms of variance explained x complexity added. Thus we'll
consider them for further testings, selecting the model Fit4.
For models 5 up to 10 we have p-values > .05 showing that we should not add these variables, for they're insignificant in the
same criteria mentioned above.
Analyzing the R
We chose to analyze both the adjusted R and the predictive R as measures of a good fit.
The adjusted R squared increases only if the new term improves the model more than would be expected by chance and it can
also decrease with poor quality predictors.
The predicted R-squared is a form of cross-validation and it can also decrease. Cross-validation determines how well your
model generalizes to other data sets by partitioning your data.
##The predictive R
pred_r_squared <- function(linear.model) {
lm.anova <- anova(linear.model)
tss <- sum(lm.anova$"Sum Sq")
# predictive R^2
pred.r.squared <- 1 - PRESS(linear.model)/(tss)
return(pred.r.squared)
}
PRESS <- function(linear.model) {
pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
PRESS <- sum(pr^2)
return(PRESS)
}
summary(fit4)$r.squared ##Multiple R squared
## [1] 0.8658799
summary(fit4)$adj.r.squared ## Adjusted R squared
## [1] 0.8400875
pred_r_squared(fit4) ## Predictive R squared
## [1] 0.8015456
Analyzing the Predictive R we see that it's a little smaller than the adjusted R. One way to think of this is that 6.5% (86.6%
80.1%) of the model is explained by too many factors and random correlations, which we would have attributed to our model
if we were just using Multiple R.
When the model is good and has few terms, the differences are small, which is the case.
So we have further evidence to stay with the model fit4: we can say with some certainty that 98% of the variance is explained
by this model.
Residual analysis
Now, presupposing that this is a good model, lets take a look at the residuals:
1.
2.
3.
##Residuals analysis
##1.
mean(resid(fit4))
## [1] 1.12757e-16
##2.
cor(resid(fit4), data$hp)
## [1] -1.560353e-16
cor(resid(fit4), data$cyl)
## [1] 2.350835e-17
cor(resid(fit4), data$am)
## [1] 7.500521e-18
cor(resid(fit4), data$wt)
## [1] -1.027771e-16
##3.
par(mfrow=c(2,2))
plot(fit4)
We see from 1. that the mean of residuals is ~ 0; from 2. that the residuals have no significant correlation with the predictors.
And from 3. we see that there's no relevant pattern in the behavior of residuals and that they're approx. normal distributed.
Coefficient Interpretation
###Coefficient interpretation
summary(fit4)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = mpg ~ factor(am) + hp + factor(cyl) + wt, data = data)
Residuals:
Min
1Q Median
-3.9387 -1.2560 -0.4013
3Q
1.1253
Max
5.0513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.70832
2.60489 12.940 7.73e-13 ***
factor(am)1
1.80921
1.39630
1.296 0.20646
hp
-0.03211
0.01369 -2.345 0.02693 *
factor(cyl)6 -3.03134
1.40728 -2.154 0.04068 *
factor(cyl)8 -2.16368
2.28425 -0.947 0.35225
wt
-2.49683
0.88559 -2.819 0.00908 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.41 on 26 degrees of freedom
Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
4.674192