Professional Documents
Culture Documents
Multiple Regression
STAT 3022
School of Statistic, University of Minnesota
2014 spring
1 / 1
Introduction
Example Consider the following study designed to investigate how
to elevate meadowfoam production to a protable crop.
Explanatory variables: two light-related factors
light intensity(150, 300, 450, 750, and 900 mol /m
2
/sec)
timing of the onset of the light treatment (PFI or 24 days
before PFI)
Response variable: Number of owers per meadowfoam plant
What are the eects of diering light intensity levels? What is the
eect of the timing? Does the eect of intensity depend on the
timing?
PFI = photoperiodic oral induction
2 / 1
Graphical Summary
200 400 600 800
3
0
4
0
5
0
6
0
7
0
Scatterplot of flowers vs. intensity
Light intensity
N
u
m
b
e
r
o
f
f
l
o
w
e
r
s
p
e
r
p
l
a
n
t
3 / 1
Graphical Summary
> plot(Flowers ~ Intens, data = case0901,
+ col = as.numeric(case0901$Time), pch = as.numeric(case0901$Time),
+ xlab="Light intensity", ylab="Number of flowers per plant",
+ main="Scatterplot of flowers vs. intensity")
> legend(800, 75, c("Late", "Early"), col = c(1,2), pch = c(1,2))
200 400 600 800
3
0
4
0
5
0
6
0
7
0
Scatterplot of flowers vs. intensity
Light intensity
N
u
m
b
e
r
o
f
f
l
o
w
e
r
s
p
e
r
p
l
a
n
t
Late
Early
4 / 1
Including Time Variable
The scatterplot on the previous slide suggests that two regression
lines one for Late treatment (at PFI) and one for Early treatment
(24 days before PFI) might be appropriate.
> data_late <- subset(case0901, Time == "Late")[, c("Flowers", "Intens")]
> data_early <- subset(case0901, Time == "Early")[, c("Flowers", "Intens")]
> head(data_late)
Flowers Intens
1 62.3 150
2 77.4 150
3 55.3 300
4 54.2 300
5 49.6 450
6 61.9 450
> head(data_early)
Flowers Intens
13 77.8 150
14 75.6 150
15 69.1 300
16 78.0 300
17 57.0 450
18 71.1 450
5 / 1
Including Time Variable (2)
> m_late <- lm(Flowers ~ Intens, data = data_late)
> m_early <- lm(Flowers ~ Intens, data = data_early)
> m_late$coefficients
(Intercept) Intens
71.62333349 -0.04107619
> m_early$coefficients
(Intercept) Intens
83.14666684 -0.03986667
Consider plotting both lines on the scatterplot from before.
6 / 1
Two Regression Lines
> plot(Flowers ~ Intens, data = case0901, xlab="", ylab="",
+ col = as.numeric(case0901$Time), pch = as.numeric(case0901$Time))
> abline(m_late, col = 1, lty = 1)
> abline(m_early, col = 2, lty = 2)
> legend(750, 80, col = c(1,2), pch = c(1,2), lty = c(1,2),
+ legend=c("Late","Early"))
200 400 600 800
3
0
4
0
5
0
6
0
7
0
Late
Early
7 / 1
Summary
Including separate regression lines seems to suggest parallel lines.
i.e., only their intercepts dier.
What if we wanted to include both the variables Intensity and
Time in our model?
{Y|X
1
, X
2
} =
0
+
1
X
1
+
2
X
2
where X
1
= intensity, and
X
2
=
0
= height of the plane when X
1
= X
2
= 0
1
=
slope of the plane as a function of X
1
for any xed value of X
2
2
=
slope of the plane as a function of X
2
for any xed value of X
1
15 / 1
Eects of Explanatory Variables
Denition
The eect of an explanatory variable is the change in the mean
response that is associated with a one-unit increase in that variable
while holding all other explanatory variables xed.
Example
In the meadowfoam study:
light eect = {ower | light + 1, time} {ower | light, time}
= (
0
+
1
(light + 1) +
2
time) (
0
+
1
light +
2
time)
=
1
16 / 1
Causal vs. Associative Eects
If regression analysis involves results of a randomized experiment,
interpretation of eect of explanatory variable implies a
causation.
A one-unit increase in light intensity causes the mean
number of owers to increase by
1
2
0
2
4
6
8
Body Weight
B
o
d
y
w
e
i
g
h
t
(
l
o
g
s
c
a
l
e
)
3
4
5
6
Gestation
G
e
s
t
a
t
i
o
n
(
l
o
g
s
c
a
l
e
)
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
Litter Size
L
i
t
t
e
r
s
i
z
e
(
l
o
g
s
c
a
l
e
)
24 / 1
Updated Scatterplot Matrix
Brain
4 2 0 2 4 6 8 0.0 0.5 1.0 1.5 2.0
0
2
4
6
8
4
0
2
4
6
8
Body
Gestation
3
4
5
6
0 2 4 6 8
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
3 4 5 6
Litter
25 / 1
Notes About Scatterplots
Pronounced relationship between log brain weight and each of
the explanatory variables.
Gestation and litter size also related to body weight.
Is there an association between gestation and brain weight,
after accounting for the eect of body weight?
Is there an association between litter size and brain weigh,
after accounting for the eect of body weight?
Scatterplots do not resolve these questions.
Next course of action: t a regression model for log brain
weight on log body weight, log gestation, and log litter size.
26 / 1
Regression Output
> m <- lm(log(Brain) ~ log(Body) + log(Gestation) + log(Litter), data = case0902)
> summary(m)
Call:
lm(formula = log(Brain) ~ log(Body) + log(Gestation) + log(Litter),
data = case0902)
Residuals:
Min 1Q Median 3Q Max
-0.95415 -0.29639 -0.03105 0.28111 1.57491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.85482 0.66167 1.292 0.19962
log(Body) 0.57507 0.03259 17.647 < 2e-16 ***
log(Gestation) 0.41794 0.14078 2.969 0.00381 **
log(Litter) -0.31007 0.11593 -2.675 0.00885 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4748 on 92 degrees of freedom
Multiple R-squared: 0.9537, Adjusted R-squared: 0.9522
F-statistic: 631.6 on 3 and 92 DF, p-value: < 2.2e-16
27 / 1
Conclusions
Controlling for body weight and litter size, an increase in
gestation length of one unit on the log scale is associated with
an increase in mean log brain weight of 0.418.
Controlling for body weight and gestation length, an increase
in litter size of one unit on the log scale is associated with a
decrease in mean log brain weight of 0.310.
In the next chapter we will discuss inferential procedures for
multiple regression in-depth.
28 / 1
Introduction
We can include specially constructed explanatory variables in order
to exhibit
curvature in the regression model
interactive eects of explanatory variables
eects of categorical variables
We accomplish these goals by including
quadratic terms (e.g. X
2
1
)
product terms (e.g. X
1
X
2
)
indicator terms
X
3
=
0, if group A
1, if group B
29 / 1
A Squared Term for Curvature
Consider the scatterplot of yearly corn yield vs. rainfall (1890 -
1927) in six U.S. states:
8 10 12 14 16
2
0
2
5
3
0
3
5
Rainfall
Y
i
e
l
d
30 / 1
Incorporating Curvature
A straight-line regression model is not adequate.
One model for incorporating curvature includes squared rainfall:
{yield | rain} =
0
+
1
rain +
2
rain
2
This allows the eect of rainfall to be dierent at dierent levels of
rainfall:
{yield | rain + 1} {yield | rain}
= (
0
+
1
(rain + 1) +
2
(rain + 1)
2
) (
0
+
1
rain +
2
rain
2
)
=
1
+
2
(2 rain + 1)
As rainfall increases, its eect changes.
31 / 1
Squared Term in R
In R, there is no need to create a new variable to include a squared
term:
> m <- lm(Yield ~ Rainfall + I(Rainfall^2), data=ex0915)
> summary(m)
Call:
lm(formula = Yield ~ Rainfall + I(Rainfall^2), data = ex0915)
Residuals:
Min 1Q Median 3Q Max
-8.4642 -2.3236 -0.1265 3.5151 7.1597
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.01466 11.44158 -0.438 0.66387
Rainfall 6.00428 2.03895 2.945 0.00571 **
I(Rainfall^2) -0.22936 0.08864 -2.588 0.01397 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.763 on 35 degrees of freedom
Multiple R-squared: 0.2967, Adjusted R-squared: 0.2565
F-statistic: 7.382 on 2 and 35 DF, p-value: 0.002115
32 / 1
Plotting the Fitted Model
> plot(ex0915$Yield ~ ex0915$Rainfall, pch=16,
+ xlab="Rainfall (inches)", ylab="Corn Yield (bu/acre)")
> xx <- seq(min(ex0915$Rainfall), max(ex0915$Rainfall), 1/1000)
> yy <- m$coef %*% rbind(1,xx,xx^2)
> lines(xx, yy, lty=3, col=2, lwd=2)
8 10 12 14 16
2
0
2
5
3
0
3
5
Rainfall (inches)
C
o
r
n
Y
i
e
l
d
(
b
u
/
a
c
r
e
)
33 / 1
Interpretations
Eect of rainfall:
Increase from 8 to 9 inches associated with increase in mean
yield of 2.1 bushels of corn per acre.
Increase from 14 to 15 inches associated with decrease in
mean yield of 0.6 bushels of corn per acre.
Interpretation of individual coecients is dicult and
unnecessary.
Fitted model suggests that increasing rainfall is associated
with increasing rainfall only up to a point.
In many situations, squared term is just there to incorporate
curvature; its coecient need not be interpreted.
34 / 1
Distinguishing Between Groups
Denition
An indicator variable (or dummy variable) takes on one of two
values:
1 indicates that an attribute is present
0 indicates that the attribute is absent
Example
In the meadowfoam study, consider the variable:
early =
1, if time = 0
0, if time = 24
35 / 1
Indicators in R
In R, a variable that should be coded as a factor may be coded as
numeric at rst.
> case0901$early <- with(case0901, factor(as.numeric(Time) - 1))
> case0901
Flowers Time Intens early
1 62.3 Late 150 0
2 77.4 Late 150 0
3 55.3 Late 300 0
4 54.2 Late 300 0
5 49.6 Late 450 0
6 61.9 Late 450 0
7 39.4 Late 600 0
8 45.7 Late 600 0
9 31.3 Late 750 0
10 44.9 Late 750 0
11 36.8 Late 900 0
12 41.9 Late 900 0
13 77.8 Early 150 1
14 75.6 Early 150 1
15 69.1 Early 300 1
16 78.0 Early 300 1
17 57.0 Early 450 1
18 71.1 Early 450 1
19 62.9 Early 600 1
20 52.2 Early 600 1
21 60.3 Early 750 1
22 45.6 Early 750 1
23 52.6 Early 900 1
24 44.4 Early 900 1
36 / 1
Indicators in R (2)
> summary(case0901)
Flowers Time Intens early
Min. :31.30 Late :12 Min. :150 0:12
1st Qu.:45.42 Early:12 1st Qu.:300 1:12
Median :54.75 Median :525
Mean :56.14 Mean :525
3rd Qu.:64.45 3rd Qu.:750
Max. :78.00 Max. :900
Note that early is a factor with 2 levels, not a numeric.
37 / 1
Modeling an Indicator
Consider the regression model
{owers | light, early} =
0
+
1
light +
2
early
If time = 0, then early = 0, and the regression line is
{owers | light, early = 0} =
0
+
1
light
If time = 24, then early = 1, and the regression line is
{owers | light, early = 1} =
0
+
1
light +
2
Slope of both lines is
1
Intercept for timing at PFI (late) is
0
Intercept for timing 24 days prior to PFI (early) is
0
+
2
Parallel lines model
38 / 1
Fitting Parallel Lines Model
> m_parallel <- lm(Flowers ~ Intens + early, data = case0901)
> summary(m_parallel)
Call:
lm(formula = Flowers ~ Intens + early, data = case0901)
Residuals:
Min 1Q Median 3Q Max
-9.652 -4.139 -1.558 5.632 12.165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.305834 3.273772 21.781 6.77e-16 ***
Intens -0.040471 0.005132 -7.886 1.04e-07 ***
early1 12.158333 2.629557 4.624 0.000146 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.441 on 21 degrees of freedom
Multiple R-squared: 0.7992, Adjusted R-squared: 0.78
F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08
39 / 1
Interpretations
This regression model states that:
mean number of owers is a straight-line function of light
intensity for both levels of timing.
the slope of both lines is estimated to be
1
= 0.0405
owers per plant per mol/m
2
/sec.
2
= 12.158 means that the mean number of owers with
prior timing at 24 days exceeds the mean number of owers
with no prior timing by about 12.158 owers per plant.
40 / 1
Sets of Indicator Variables
What if an explanatory variable has more than two categories?
Denition
When a categorical variable is used in regression it is called a
factor and the individual categories are called the levels of the
factor.
If there are k levels, then k 1 indicator variables are needed as
explanatory variables.
Example
In the meadowfoam study, light intensity can be viewed as a
categorical variable with 6 levels. How many indicator variables will
be associated with this factor?
41 / 1
Factorizing Intensity
Since Intens is numeric, we need to create a new factor:
> case0901$light <- with(case0901, factor(Intens))
> summary(case0901)
Flowers Time Intens early light
Min. :31.30 Late :12 Min. :150 0:12 150:4
1st Qu.:45.42 Early:12 1st Qu.:300 1:12 300:4
Median :54.75 Median :525 450:4
Mean :56.14 Mean :525 600:4
3rd Qu.:64.45 3rd Qu.:750 750:4
Max. :78.00 Max. :900 900:4
Note that light is a factor with 6 levels.
42 / 1
Modeling k-Level Factor
With 6 levels, we can set the rst level, 150 mol/m
2
/sec, as the
reference level, the multiple linear regression model is:
{owers | light, early} =
0
+
1
L300 +
2
L450 +
3
L600
+
4
L750 +
5
L900 +
6
early
By reference level, we mean that when
L300 = L450 = L600 = L750 = L900 = 0, we have the estimate
for light = 150.
Consider the regression output on the following slide. In practice
we would treat light as a numerical variable, but we briey treat it
as a factor for the sake of illustration.
43 / 1
Fitting Model with Factors
> m_light <- lm(Flowers ~ light + early, data = case0901)
> summary(m_light)
Call:
lm(formula = Flowers ~ light + early, data = case0901)
Residuals:
Min 1Q Median 3Q Max
-8.979 -4.308 -1.342 5.204 10.204
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.196 3.629 18.518 1.05e-12 ***
light300 -9.125 4.751 -1.921 0.071715 .
light450 -13.375 4.751 -2.815 0.011919 *
light600 -23.225 4.751 -4.888 0.000138 ***
light750 -27.750 4.751 -5.841 1.97e-05 ***
light900 -29.350 4.751 -6.178 1.01e-05 ***
early1 12.158 2.743 4.432 0.000365 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.719 on 17 degrees of freedom
Multiple R-squared: 0.8231, Adjusted R-squared: 0.7606
F-statistic: 13.18 on 6 and 17 DF, p-value: 1.427e-05
Q: Can you use this model to estimate the owers per
meadowfoam plant for light intensity = 800?
44 / 1
A Product Term for Interaction
Denition
Two explanatory variables are said to interact if the eect of one
of them depends on the value of the other.
In multiple regression, explanatory variable for interaction is
constructed as the product of two explanatory variables thought to
interact.
Example
Recall a question of interest from the meadowfoam study:
Does the eect of light intensity on mean number of owers
depend on the timing of the light treatment?
Answer this question by including a product term for interaction.
45 / 1
Interaction Model
In the meadowfoam study, consider the product variable
light early (where light is numeric, but early is a factor).
Consider the model
{owers | light, early} =
0
+
1
light +
2
early +
3
(lightearly)
When early = 0, what is the slope? What is the intercept?
slope =
1
, intercept =
0
When early = 1, what is the slope? What is the intercept?
slope =
1
+
3
, intercept =
0
+
2
If
3
= 0, then the model is not parallel lines.
46 / 1
Interaction in R
> m_interaction <- lm(Flowers ~ Intens * early, data = case0901)
> summary(m_interaction)
Call:
lm(formula = Flowers ~ Intens * early, data = case0901)
Residuals:
Min 1Q Median 3Q Max
-9.516 -4.276 -1.422 5.473 11.938
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.623333 4.343305 16.491 4.14e-13 ***
Intens -0.041076 0.007435 -5.525 2.08e-05 ***
early1 11.523333 6.142361 1.876 0.0753 .
Intens:early1 0.001210 0.010515 0.115 0.9096
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.598 on 20 degrees of freedom
Multiple R-squared: 0.7993, Adjusted R-squared: 0.7692
F-statistic: 26.55 on 3 and 20 DF, p-value: 3.549e-07
47 / 1
Interpretations
Consider rearranging the model as
{owers | light, early} = (
0
+
2
early) + (
1
+
3
early)light
Both the intercept and the slope depend on the timing.
the eect of light intensity is (
1
+
3
early)
the eect of timing is (
2
+
3
light)
So there are 3 dierent tted models:
1
separate lines (
2
= 0,
3
= 0)
2
parallel lines (
2
= 0,
3
= 0)
3
equal lines (
2
= 0,
3
= 0)
48 / 1
Further Interpretations
Often dicult to interpret individual coecients in an interaction
model.
Coecient of light,
1
, changes from being a global slope
to being the slope when time = 0.
Coecient of the product term,
3
, is the dierence between
the slope when time = 24 and the slope when time = 0.
To test for the presence of an interaction eect, consider
testing
H
0
:
3
= 0 vs. H
a
:
3
= 0
In the meadowfoam example, the p-value for this test is
0.9096. What can we conclude?
49 / 1
When to Include Interaction Terms
We do not routinely include interaction terms in regression models.
We include them when:
when a question of interest pertains to an interaction (like in
meadowfoam study)
when good reason exists to suspect interaction
when interactions are proposed as a more general model for
the purpose of examining the goodness of t of a model
without interaction (i.e., does the model with interaction
terms t better than the one without interaction terms?)
Also, if we include a product term in a model, we should also
include the individual terms unless otherwise specied.
If we have a light time interaction, make sure both light
and time are in the model.
50 / 1
Strategy for Data Analysis
After dening the questions of interest, reviewing the study design
and model assumptions, and correcting any errors in the data:
1
Explore the data graphically.
Look for initial answers to questions.
Consider transformations.
Check outliers.
2
Formulate an inferential model.
Word questions of interest in terms of model parameters.
3
Check the model.
Check for nonconstant variance, outliers.
If appropriate, t interactions or curvature.
See if extra terms can be dropped from model.
4
Infer the answers to the questions of interest, using
appropriate inferential tools.
Condence intervals and/or tests for regression coecients.
Prediction intervals and/or condence intervals for mean.
Then present your results, in the context of the problem.
51 / 1