You are on page 1of 6

Spring 2014

Stat 431
Solution to Homework 5
(100 points total)

20

40

60

80

190
150

160

170

157.2 + x x

180

190
180
170

161 + 2.6 * x/10

150

160

170
150

160

161 + 2.6 * x/10

180

190

1. (15 points) Plots are shown in Figure 1.

20

40

60

80

20

40

60

80

Figure 1: fitted regression for (a), (b) and (c)


R-code:
curve(161+2.6*x/10, 0,80, ylim = c(150, 190))
curve(96.2+33.6*x/10 - 3.2 * (x/10)^2, 0, 80, add=TRUE)
curve(157.2+x-x, 0, 29, xlim = c(0, 80), ylim = c(150, 190))
curve(157.2+19.1+x-x, 30, 44, xlim = c(0, 80), ylim = c(150, 190), add= TRUE)
curve(157.2+27.2+x-x, 45, 64, xlim = c(0, 80), ylim = c(150, 190), add= TRUE)
curve(157.2+8.5+x-x, 65, 80, xlim = c(0, 80), ylim = c(150, 190), add= TRUE)
2. (20 points)
(a) (5 points) Output of regression:
Call:
lm(formula = Mailings ~ HoursOfEffort * Aware)
Coefficients:
(Intercept)
HoursOfEffort
AwareYES
HoursOfEffort:AwareYES

Estimate Std. Error t value Pr(>|t|)


2.454
2.502
0.981
0.3286
13.821
1.088 12.705
<2e-16 ***
1.707
3.991
0.428
0.6697
4.308
1.683
2.560
0.0117 *

Residual standard error: 11.16 on 121 degrees of freedom


Multiple R-squared: 0.7665,
Adjusted R-squared: 0.7607
F-statistic: 132.4 on 3 and 121 DF, p-value: < 2.2e-16
1

The coefficient of interaction is 4.308, which means that, if HoursOfEffort is increased


by 1, the predicted difference in Mailings for a customer who was aware of FedEx is
4.308 greater than the predicted difference for a customer who was not aware.
(b) (5 points) We use the Anova test. H0 : aware = 0 v.s. H1 : aware 6= 0. Output:
Analysis of Variance Table
Model 1: Mailings
Model 2: Mailings
Res.Df
RSS Df
1
123 19188
2
122 15896 1
--Signif. codes: 0

~ HoursOfEffort
~ HoursOfEffort + Aware
Sum of Sq
F
Pr(>F)
3292.7 25.272 1.729e-06 ***
*** 0.001 ** 0.01 * 0.05 . 0.1

We can see the P-value is smaller than 0.05, which means Aware does explain a significant
extra amount of variation in the number of shipments per month by customers. (Full
credit can also be earned by comparing the model Mailings HoursOfEffort to the
model Mailings HoursOfEffort * Aware, which includes the interaction term.)
(c) (5 points) We split the data and perform regression on both. Output:
Call:
lm(formula = MailingYES ~ HoursOfEffortYES)
Coefficients:
(Intercept)
HoursOfEffortYES

Estimate Std. Error t value Pr(>|t|)


4.161
3.464
1.201
0.236
18.129
1.430 12.679
<2e-16 ***

Residual standard error: 12.43 on 48 degrees of freedom


Multiple R-squared: 0.7701,
Adjusted R-squared: 0.7653
F-statistic: 160.8 on 1 and 48 DF, p-value: < 2.2e-16
Call:
lm(formula = MailingNO ~ HoursOfEffortNO)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.4541
2.2955
1.069
0.289
HoursOfEffortNO 13.8214
0.9982 13.846
<2e-16 ***
Residual standard error: 10.24 on 73 degrees of freedom
Multiple R-squared: 0.7242,
Adjusted R-squared: 0.7204
F-statistic: 191.7 on 1 and 73 DF, p-value: < 2.2e-16
We can see the coefficient of HourOfEffort in No group is the same as in the regression of (a); the coefficient of HourOfEffort in Yes group is equal to the coefficient of
2

HourOfEffort plus interaction in the regression of (a). The intercept in the No group
is the same as the intercept in (a), and the intercept in the Yes group is the same as
the intercept in (a) plus the coefficient of Aware in (a).
(d) (5 points)
predict(fit1, newdata = data.frame(HoursOfEffort = 4, Aware = "YES"),
interval = "prediction", level = 0.95)
fit
lwr
upr
1 76.67738 53.83333 99.52142
So the prediction interval is [53.83, 99.51].
3. (25 points)
(a) (5 points)
Call:
lm(formula = GP1000M.City ~ Weight + Horsepower, data = car)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.168e+01 1.727e+00
6.765 6.91e-10 ***
Weight
8.918e-03 8.822e-04 10.109 < 2e-16 ***
Horsepower 8.838e-02 1.226e-02
7.207 7.88e-11 ***
Residual standard error: 3.5 on 109 degrees of freedom
Multiple R-squared: 0.841,
Adjusted R-squared: 0.8381
F-statistic: 288.3 on 2 and 109 DF, p-value: < 2.2e-16
The LS coefficient are 11.68, 0.008918, 0.0838. R2 is 0.841, RMSE is 3.5.
(b) (5 points)
> vif(fit.car)
Weight Horsepower
2.202488
2.202488
The VIF for these two variables are both 2.202488. The VIF are identical because
there are only two predictors in the regression. The VIF is a function of R2 for the
(simple) regression of one predictor on the other. Since R2 in simple linear regression is
the squared correlation coefficient of the response with the predictor, R2 for regressing
Weight on Horsepower is equal to R2 for regressing Horsepower on Weight.
(c) (5 points) The pairwise correlations are shown below.
> cor(Weight, Horsepower)
[1] 0.7388965
> cor(Weight, H.per.P)
[1] 0.2403865
> cor(Horsepower, H.per.P)
[1] 0.8227992
3

(d) (5 points)
Call:
lm(formula = GP1000M.City ~ Weight + H.per.P)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.703e-01 2.047e+00
0.327
0.744
Weight
1.253e-02 6.051e-04 20.707 < 2e-16 ***
H.per.P
2.707e+02 3.623e+01
7.474 2.08e-11 ***
Residual standard error: 3.458 on 109 degrees of freedom
Multiple R-squared: 0.8448,
Adjusted R-squared: 0.842
F-statistic: 296.7 on 2 and 109 DF, p-value: < 2.2e-16
The LS coefficients are 0.6703, 0.01253, 270.7; R2 is 0.8448, RMSE is 3.458. VIF of the
two predictors are both 1.06133.
(e) Comparing the results in (d) and (a), we can see the R2 is larger and RMSE is smaller
in (d) than (a), which means the response can be explained better in (d) than (a). The
collinearity in (d) is smaller than (a) as the VIF in (d) is smaller than (a).
4. (20 points)
(a) (10 points) We have the following two formulas.
2
2
M SRXj =
M
X
,
LR /
j ,SLR

q
SE(jM LR ) = SE(jSLR ) V IFXj M SRXj .
2
2
2
From the two summaries of regression, we know
M
M RI,SLR = 21.21 ;
LR = 19.79 ,
M LR ) = 0.5634; SE(
SLR ) = 0.4806. So we can calculate that
SE(M
M RI
RI

V IFXM RI = (0.5634/0.4806)2 (21.212 /19.792 ) = 1.5785.


(b) (10 points) The RMSE for PIQ MRI is 21.21 with degree of freedom 36, the RMSE
for PIQ MRI + Height + Weight is 19.79 with degree of freedom 34. Hence,
F =

(21.212 36 19.792 34)/2


= 6.938
19.792

Under the null, F would satisfy F2,34 . So we can calculate the P-value as 0.0029. So we
reject the null hypothesis that M RI = 0 with significance level 0.05.
5. (20 points)
(a) (8 points) Output of the regression summary:
Call:
lm(formula = courseevaluation ~ beauty + tenuretrack + age +
female)
4

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.444818
0.156449 28.411 < 2e-16 ***
beauty
0.134947
0.032932
4.098 4.94e-05 ***
tenuretrack -0.197480
0.060341 -3.273 0.00115 **
age
-0.003809
0.002764 -1.378 0.16883
female
-0.228929
0.052567 -4.355 1.64e-05 ***
Residual standard error: 0.5318 on 458 degrees of freedom
Multiple R-squared: 0.08938,
Adjusted R-squared: 0.08143
F-statistic: 11.24 on 4 and 458 DF, p-value: 1.046e-08

-0.5
-1.5

-1.0

Residuals

0.0

0.5

1.0

From the summary, we can see one unit increase beauty associates with 0.1349 higher
in course evaluation in average; if the teacher is in tenure track, the course evaluation
will be 0.197480 lower than the non-tenure track; if the teacher is 1 year older, the
course evaluation will be 0.003809 lower on average; if the teacher is female, the course
evaluation will be 0.228929 lower than the males. Under the assumption of normally
distributed residuals, roughly 95% of the observations will fall within 2(0.5318) of their
predicted values.

3.8

4.0

4.2

4.4

Fitted values

Figure 2: Residuals vs fitted values for (a)


(b) (6 points) Output of the regression summary.
Call:
lm(formula = courseevaluation ~ beauty * female)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
4.10364
0.03359 122.158 < 2e-16 ***
beauty
0.20027
0.04333
4.622 4.95e-06 ***
female
-0.20505
0.05103 -4.018 6.85e-05 ***
beauty:female -0.11266
0.06398 -1.761
0.0789 .
Residual standard error: 0.5361 on 459 degrees of freedom
Multiple R-squared: 0.07256,
Adjusted R-squared: 0.0665
5

F-statistic: 11.97 on 3 and 459 DF,

p-value: 1.471e-07

From the summary, we can see in our model if the teacher is female, the course evaluation
will be 0.20505 lower than the males; a one unit increase in beauty for males is associated
with 0.20027 higher course evaluation on average; a one unit increase in beauty for
females is associated with (0.20027 0.11266) = 0.08761 higher course evaluations on
average.

4.0
3.5
2.0

2.5

3.0

course evaluation

4.5

5.0

(c) (6 points) The plot is in Figure 3. The black dots are male instructors and red dots are
female instructors.

1.5

1.0

0.5

0.0

0.5

1.0

1.5

beauty

Figure 3: fitted regression for (a), (b) and (c)


R-code:
plot(beauty.male, courseevaluation.male, xlab = "beauty", ylab = "course evaluation")
abline(lm(courseevaluation.male ~ beauty.male))
points(beauty.female, courseevaluation.female, col="red")
abline(lm(courseevaluation.female ~ beauty.female), col="red")

You might also like