You are on page 1of 51

Chapter 9

Multiple Regression
STAT 3022
School of Statistic, University of Minnesota
2014 spring
1 / 1
Introduction
Example Consider the following study designed to investigate how
to elevate meadowfoam production to a protable crop.
Explanatory variables: two light-related factors
light intensity(150, 300, 450, 750, and 900 mol /m
2
/sec)
timing of the onset of the light treatment (PFI or 24 days
before PFI)
Response variable: Number of owers per meadowfoam plant
What are the eects of diering light intensity levels? What is the
eect of the timing? Does the eect of intensity depend on the
timing?
PFI = photoperiodic oral induction
2 / 1
Graphical Summary
200 400 600 800
3
0
4
0
5
0
6
0
7
0
Scatterplot of flowers vs. intensity
Light intensity
N
u
m
b
e
r

o
f

f
l
o
w
e
r
s

p
e
r

p
l
a
n
t
3 / 1
Graphical Summary
> plot(Flowers ~ Intens, data = case0901,
+ col = as.numeric(case0901$Time), pch = as.numeric(case0901$Time),
+ xlab="Light intensity", ylab="Number of flowers per plant",
+ main="Scatterplot of flowers vs. intensity")
> legend(800, 75, c("Late", "Early"), col = c(1,2), pch = c(1,2))
200 400 600 800
3
0
4
0
5
0
6
0
7
0
Scatterplot of flowers vs. intensity
Light intensity
N
u
m
b
e
r

o
f

f
l
o
w
e
r
s

p
e
r

p
l
a
n
t
Late
Early
4 / 1
Including Time Variable
The scatterplot on the previous slide suggests that two regression
lines one for Late treatment (at PFI) and one for Early treatment
(24 days before PFI) might be appropriate.
> data_late <- subset(case0901, Time == "Late")[, c("Flowers", "Intens")]
> data_early <- subset(case0901, Time == "Early")[, c("Flowers", "Intens")]
> head(data_late)
Flowers Intens
1 62.3 150
2 77.4 150
3 55.3 300
4 54.2 300
5 49.6 450
6 61.9 450
> head(data_early)
Flowers Intens
13 77.8 150
14 75.6 150
15 69.1 300
16 78.0 300
17 57.0 450
18 71.1 450
5 / 1
Including Time Variable (2)
> m_late <- lm(Flowers ~ Intens, data = data_late)
> m_early <- lm(Flowers ~ Intens, data = data_early)
> m_late$coefficients
(Intercept) Intens
71.62333349 -0.04107619
> m_early$coefficients
(Intercept) Intens
83.14666684 -0.03986667
Consider plotting both lines on the scatterplot from before.
6 / 1
Two Regression Lines
> plot(Flowers ~ Intens, data = case0901, xlab="", ylab="",
+ col = as.numeric(case0901$Time), pch = as.numeric(case0901$Time))
> abline(m_late, col = 1, lty = 1)
> abline(m_early, col = 2, lty = 2)
> legend(750, 80, col = c(1,2), pch = c(1,2), lty = c(1,2),
+ legend=c("Late","Early"))
200 400 600 800
3
0
4
0
5
0
6
0
7
0
Late
Early
7 / 1
Summary
Including separate regression lines seems to suggest parallel lines.
i.e., only their intercepts dier.
What if we wanted to include both the variables Intensity and
Time in our model?
{Y|X
1
, X
2
} =
0
+
1
X
1
+
2
X
2
where X
1
= intensity, and
X
2
=

0, Late (at PFI)


1, Early (24 days before PFI)
Q: Using this model, how would we interpret
2
?
A:
2
is the dierence in the intercept, or the vertical dierence
between the two regression lines.
8 / 1
R
> m <- lm(Flowers ~ Intens + Time, data = case0901)
> summary(m)
Call:
lm(formula = Flowers ~ Intens + Time, data = case0901)
Residuals:
Min 1Q Median 3Q Max
-9.652 -4.139 -1.558 5.632 12.165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.305834 3.273772 21.781 6.77e-16 ***
Intens -0.040471 0.005132 -7.886 1.04e-07 ***
TimeEarly 12.158333 2.629557 4.624 0.000146 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.441 on 21 degrees of freedom
Multiple R-squared: 0.7992, Adjusted R-squared: 0.78
F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08
9 / 1
R - Summary
> cf <- m$coefficients
> m1 <- c(cf[1], cf[2])
> m2 <- c(cf[1] + cf[3], cf[2])
> plot(Flowers ~ Intens, data = case0901, xlab="", ylab="",
+ col = as.numeric(case0901$Time), pch = as.numeric(case0901$Time))
> abline(m1, col=1, lty=1)
> abline(m2, col=2, lty=2)
> legend(750, 80, col=c(1,2), pch=c(1,2), lty=c(1,2), legend=c("Late","Early"))
200 400 600 800
3
0
4
0
5
0
6
0
7
0
Late
Early
10 / 1
case0902: Why Do Some Mammals Have Large Brains?
Example
Brain size is an interesting variable for studying evolution. Bigger
brains are not always better - they are associated with fewer
ospring and longer pregnancies.
After controlling for body size, what characteristics are associated
with large brains?
Data: For 96 species of mammals, data consists of average values
for
brain weight
body weight
gestation lengths (length of pregnancy)
litter size
11 / 1
The Multiple Regression Model
Denition The regression of Y on X
1
and X
2
, {Y|X
1
, X
2
}, is an
equation that describes the mean of Y for particular values of X
1
and X
2
.
Some examples of multiple linear regression models are
{Y|X
1
, X
2
} =
0
+
1
X
1
+
2
X
2
{Y|X
1
, X
2
} =
0
+
1
X
1
+
2
X
2
+
3
X
1
X
2
{Y|X
1
, X
2
} =
0
+
1
log(X
1
) +
2
log(X
2
)
{Y|X
1
} =
0
+
1
X
1
+
2
X
2
1
In multiple regression there is a single response Y and two or more
explanatory variables, X
1
, X
2
, . . . , X
p
.
Note that the constant term
0
is included in all models, unless a
specic reason for excluding it exists.
12 / 1
Constant Variance
The ideal regression model assumes constant variation:
Var{Y|X
1
, X
2
, . . . , X
p
} =
2
For the meadowfoam example, this means that the variation of
points about the regression lines is the same for all values of light
and time.
Constant variance assumption is important for two reasons:
the regression interpretation is more straightforward when
explanatory variables are only associated with the mean of the
response distribution
the assumption justies the standard inferential tools
13 / 1
Regression Coecients
Regression analysis involves:
nding a model for the response mean that ts well
wording the question of interest in terms of model parameters
estimating the parameters from the available data
employing appropriate inferential tools for answering the
questions of interest
First we must discuss the meaning of regression coecients. What
questions can they help answer?
14 / 1
Regression Surfaces
Consider the multiple linear regression model with two explanatory
variables
{Y|X
1
, X
2
} =
0
+
1
X
1
+
2
X
2
Model describes the regression surface as a plane, rather than a
line.
Imagine a 3-dimensional space with Y as the vertical axis, X
1
as
the horizontal axis, and X
2
as the out-of-page axis:

0
= height of the plane when X
1
= X
2
= 0

1
=
slope of the plane as a function of X
1
for any xed value of X
2

2
=
slope of the plane as a function of X
2
for any xed value of X
1
15 / 1
Eects of Explanatory Variables
Denition
The eect of an explanatory variable is the change in the mean
response that is associated with a one-unit increase in that variable
while holding all other explanatory variables xed.
Example
In the meadowfoam study:
light eect = {ower | light + 1, time} {ower | light, time}
= (
0
+
1
(light + 1) +
2
time) (
0
+
1
light +
2
time)
=
1
16 / 1
Causal vs. Associative Eects
If regression analysis involves results of a randomized experiment,
interpretation of eect of explanatory variable implies a
causation.
A one-unit increase in light intensity causes the mean
number of owers to increase by
1

For observational studies, interpretation is less straightforward.


We cannot make causal conclusions from statistical
association.
The Xs cannot be held xed independently of another
because they were not controlled.
For any subpopulation of mammal species with the same
body weight and litter size, a one-day increase in the species
gestation length is associated with an increase in mean brain
weight of
2
grams. (read case0902)
17 / 1
Interpretation of Coecients
Interpretation of
1
in the model
{brain | gestation} =
0
+
1
gestation
diers from the interpretation of
1
in the model
{brain | gestation, body} =
0
+
1
gestation +
2
body
First model:
1
measures rate of change in mean brain weight
with changes in gestation length in the population of all
mammal species.
Second model:
1
measures rate of change in mean brain
weight with changes in gestation length within subpopulations
of xed body size.
Furthermore, coecients themselves will likely change depending
on which Xs are included (unless correlation between gestation
and body is 0).
18 / 1
Mammals Brains
> head(case0902)
Species Brain Body Gestation Litter
1 Quokka 17.50 3.500 26 1.0
2 Hedgehog 3.50 0.930 34 4.6
3 Tree shrew 3.15 0.150 46 3.0
4 Elephant shrew I 1.14 0.049 51 1.5
5 Elephant shrew II 1.37 0.064 46 1.5
6 Lemur 22.00 2.100 135 1.0
> summary(case0902[, -1])
Brain Body Gestation Litter
Min. : 0.45 Min. : 0.017 Min. : 16.0 Min. :1.00
1st Qu.: 12.60 1st Qu.: 2.075 1st Qu.: 63.0 1st Qu.:1.00
Median : 74.00 Median : 8.900 Median :133.5 Median :1.20
Mean : 218.98 Mean : 108.328 Mean :151.3 Mean :2.31
3rd Qu.: 260.00 3rd Qu.: 94.750 3rd Qu.:226.2 3rd Qu.:3.20
Max. :4480.00 Max. :2800.000 Max. :655.0 Max. :8.00
We are interested in the regression of Brain on Body,
Gestation and Litter.
19 / 1
A Matrix of Pairwise Scatterplots
Denition
A scatterplot matrix is a consolidation of all possible pairwise from
a set of variables.
The variable that determines each row is represented on the
vertical axis of each scatterplot in that row, while the variable
that determines each column is represented on the horizontal
axis of each scatterplot in that column.
Do any relationships appear to be linear?
Which relationships are the strongest?
Are there any outliers?
Typically rst compare response to each explanatory variable.
20 / 1
Graphical Summary
pairs(case0902[, -1])
Brain
0 500 1500 2500 1 2 3 4 5 6 7 8
0
1
0
0
0
3
0
0
0
0
1
0
0
0
2
0
0
0
Body
Gestation
0
2
0
0
4
0
0
6
0
0
0 1000 2000 3000 4000
1
2
3
4
5
6
7
8
0 100 300 500
Litter
21 / 1
Scatterplots for Brain Weight Data
Consider the top row rst. Plot of brain weight versus body weight
is not helpful - data is clustered in bottom-left corner because of
an outlier (African elephant).
Mammals dier by size in orders of magnitude (dierences are
bigger for bigger mammals) - use log transformation for brain and
body weight.
Notice that gestation and litter are also positive variables, whose
observations become more spread out for larger values.
We will consider log transformations for all 4 variables.
22 / 1
Before Transformations
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
Brain Weight
B
r
a
i
n

w
e
i
g
h
t

(
g
)
0
5
0
0
1
5
0
0
2
5
0
0
Body Weight
B
o
d
y

w
e
i
g
h
t

(
k
g
)
0
1
0
0
3
0
0
5
0
0
Gestation
G
e
s
t
a
t
i
o
n

(
d
a
y
s
)
1
2
3
4
5
6
7
8
Litter Size
L
i
t
t
e
r

s
i
z
e
23 / 1
After Transformations
0
2
4
6
8
Brain Weight
B
r
a
i
n

w
e
i
g
h
t

(
l
o
g

s
c
a
l
e
)

2
0
2
4
6
8
Body Weight
B
o
d
y

w
e
i
g
h
t

(
l
o
g

s
c
a
l
e
)
3
4
5
6
Gestation
G
e
s
t
a
t
i
o
n

(
l
o
g

s
c
a
l
e
)
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
Litter Size
L
i
t
t
e
r

s
i
z
e

(
l
o
g

s
c
a
l
e
)
24 / 1
Updated Scatterplot Matrix
Brain
4 2 0 2 4 6 8 0.0 0.5 1.0 1.5 2.0
0
2
4
6
8

4
0
2
4
6
8
Body
Gestation
3
4
5
6
0 2 4 6 8
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
3 4 5 6
Litter
25 / 1
Notes About Scatterplots
Pronounced relationship between log brain weight and each of
the explanatory variables.
Gestation and litter size also related to body weight.
Is there an association between gestation and brain weight,
after accounting for the eect of body weight?
Is there an association between litter size and brain weigh,
after accounting for the eect of body weight?
Scatterplots do not resolve these questions.
Next course of action: t a regression model for log brain
weight on log body weight, log gestation, and log litter size.
26 / 1
Regression Output
> m <- lm(log(Brain) ~ log(Body) + log(Gestation) + log(Litter), data = case0902)
> summary(m)
Call:
lm(formula = log(Brain) ~ log(Body) + log(Gestation) + log(Litter),
data = case0902)
Residuals:
Min 1Q Median 3Q Max
-0.95415 -0.29639 -0.03105 0.28111 1.57491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.85482 0.66167 1.292 0.19962
log(Body) 0.57507 0.03259 17.647 < 2e-16 ***
log(Gestation) 0.41794 0.14078 2.969 0.00381 **
log(Litter) -0.31007 0.11593 -2.675 0.00885 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4748 on 92 degrees of freedom
Multiple R-squared: 0.9537, Adjusted R-squared: 0.9522
F-statistic: 631.6 on 3 and 92 DF, p-value: < 2.2e-16
27 / 1
Conclusions
Controlling for body weight and litter size, an increase in
gestation length of one unit on the log scale is associated with
an increase in mean log brain weight of 0.418.
Controlling for body weight and gestation length, an increase
in litter size of one unit on the log scale is associated with a
decrease in mean log brain weight of 0.310.
In the next chapter we will discuss inferential procedures for
multiple regression in-depth.
28 / 1
Introduction
We can include specially constructed explanatory variables in order
to exhibit
curvature in the regression model
interactive eects of explanatory variables
eects of categorical variables
We accomplish these goals by including
quadratic terms (e.g. X
2
1
)
product terms (e.g. X
1
X
2
)
indicator terms
X
3
=

0, if group A
1, if group B
29 / 1
A Squared Term for Curvature
Consider the scatterplot of yearly corn yield vs. rainfall (1890 -
1927) in six U.S. states:
8 10 12 14 16
2
0
2
5
3
0
3
5
Rainfall
Y
i
e
l
d
30 / 1
Incorporating Curvature
A straight-line regression model is not adequate.
One model for incorporating curvature includes squared rainfall:
{yield | rain} =
0
+
1
rain +
2
rain
2
This allows the eect of rainfall to be dierent at dierent levels of
rainfall:
{yield | rain + 1} {yield | rain}
= (
0
+
1
(rain + 1) +
2
(rain + 1)
2
) (
0
+
1
rain +
2
rain
2
)
=
1
+
2
(2 rain + 1)
As rainfall increases, its eect changes.
31 / 1
Squared Term in R
In R, there is no need to create a new variable to include a squared
term:
> m <- lm(Yield ~ Rainfall + I(Rainfall^2), data=ex0915)
> summary(m)
Call:
lm(formula = Yield ~ Rainfall + I(Rainfall^2), data = ex0915)
Residuals:
Min 1Q Median 3Q Max
-8.4642 -2.3236 -0.1265 3.5151 7.1597
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.01466 11.44158 -0.438 0.66387
Rainfall 6.00428 2.03895 2.945 0.00571 **
I(Rainfall^2) -0.22936 0.08864 -2.588 0.01397 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.763 on 35 degrees of freedom
Multiple R-squared: 0.2967, Adjusted R-squared: 0.2565
F-statistic: 7.382 on 2 and 35 DF, p-value: 0.002115
32 / 1
Plotting the Fitted Model
> plot(ex0915$Yield ~ ex0915$Rainfall, pch=16,
+ xlab="Rainfall (inches)", ylab="Corn Yield (bu/acre)")
> xx <- seq(min(ex0915$Rainfall), max(ex0915$Rainfall), 1/1000)
> yy <- m$coef %*% rbind(1,xx,xx^2)
> lines(xx, yy, lty=3, col=2, lwd=2)
8 10 12 14 16
2
0
2
5
3
0
3
5
Rainfall (inches)
C
o
r
n

Y
i
e
l
d

(
b
u
/
a
c
r
e
)
33 / 1
Interpretations
Eect of rainfall:
Increase from 8 to 9 inches associated with increase in mean
yield of 2.1 bushels of corn per acre.
Increase from 14 to 15 inches associated with decrease in
mean yield of 0.6 bushels of corn per acre.
Interpretation of individual coecients is dicult and
unnecessary.
Fitted model suggests that increasing rainfall is associated
with increasing rainfall only up to a point.
In many situations, squared term is just there to incorporate
curvature; its coecient need not be interpreted.
34 / 1
Distinguishing Between Groups
Denition
An indicator variable (or dummy variable) takes on one of two
values:
1 indicates that an attribute is present
0 indicates that the attribute is absent
Example
In the meadowfoam study, consider the variable:
early =

1, if time = 0
0, if time = 24
35 / 1
Indicators in R
In R, a variable that should be coded as a factor may be coded as
numeric at rst.
> case0901$early <- with(case0901, factor(as.numeric(Time) - 1))
> case0901
Flowers Time Intens early
1 62.3 Late 150 0
2 77.4 Late 150 0
3 55.3 Late 300 0
4 54.2 Late 300 0
5 49.6 Late 450 0
6 61.9 Late 450 0
7 39.4 Late 600 0
8 45.7 Late 600 0
9 31.3 Late 750 0
10 44.9 Late 750 0
11 36.8 Late 900 0
12 41.9 Late 900 0
13 77.8 Early 150 1
14 75.6 Early 150 1
15 69.1 Early 300 1
16 78.0 Early 300 1
17 57.0 Early 450 1
18 71.1 Early 450 1
19 62.9 Early 600 1
20 52.2 Early 600 1
21 60.3 Early 750 1
22 45.6 Early 750 1
23 52.6 Early 900 1
24 44.4 Early 900 1
36 / 1
Indicators in R (2)
> summary(case0901)
Flowers Time Intens early
Min. :31.30 Late :12 Min. :150 0:12
1st Qu.:45.42 Early:12 1st Qu.:300 1:12
Median :54.75 Median :525
Mean :56.14 Mean :525
3rd Qu.:64.45 3rd Qu.:750
Max. :78.00 Max. :900
Note that early is a factor with 2 levels, not a numeric.
37 / 1
Modeling an Indicator
Consider the regression model
{owers | light, early} =
0
+
1
light +
2
early
If time = 0, then early = 0, and the regression line is
{owers | light, early = 0} =
0
+
1
light
If time = 24, then early = 1, and the regression line is
{owers | light, early = 1} =
0
+
1
light +
2
Slope of both lines is
1
Intercept for timing at PFI (late) is
0
Intercept for timing 24 days prior to PFI (early) is
0
+
2
Parallel lines model
38 / 1
Fitting Parallel Lines Model
> m_parallel <- lm(Flowers ~ Intens + early, data = case0901)
> summary(m_parallel)
Call:
lm(formula = Flowers ~ Intens + early, data = case0901)
Residuals:
Min 1Q Median 3Q Max
-9.652 -4.139 -1.558 5.632 12.165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.305834 3.273772 21.781 6.77e-16 ***
Intens -0.040471 0.005132 -7.886 1.04e-07 ***
early1 12.158333 2.629557 4.624 0.000146 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.441 on 21 degrees of freedom
Multiple R-squared: 0.7992, Adjusted R-squared: 0.78
F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08
39 / 1
Interpretations
This regression model states that:
mean number of owers is a straight-line function of light
intensity for both levels of timing.
the slope of both lines is estimated to be

1
= 0.0405
owers per plant per mol/m
2
/sec.

2
= 12.158 means that the mean number of owers with
prior timing at 24 days exceeds the mean number of owers
with no prior timing by about 12.158 owers per plant.
40 / 1
Sets of Indicator Variables
What if an explanatory variable has more than two categories?
Denition
When a categorical variable is used in regression it is called a
factor and the individual categories are called the levels of the
factor.
If there are k levels, then k 1 indicator variables are needed as
explanatory variables.
Example
In the meadowfoam study, light intensity can be viewed as a
categorical variable with 6 levels. How many indicator variables will
be associated with this factor?
41 / 1
Factorizing Intensity
Since Intens is numeric, we need to create a new factor:
> case0901$light <- with(case0901, factor(Intens))
> summary(case0901)
Flowers Time Intens early light
Min. :31.30 Late :12 Min. :150 0:12 150:4
1st Qu.:45.42 Early:12 1st Qu.:300 1:12 300:4
Median :54.75 Median :525 450:4
Mean :56.14 Mean :525 600:4
3rd Qu.:64.45 3rd Qu.:750 750:4
Max. :78.00 Max. :900 900:4
Note that light is a factor with 6 levels.
42 / 1
Modeling k-Level Factor
With 6 levels, we can set the rst level, 150 mol/m
2
/sec, as the
reference level, the multiple linear regression model is:
{owers | light, early} =
0
+
1
L300 +
2
L450 +
3
L600
+
4
L750 +
5
L900 +
6
early
By reference level, we mean that when
L300 = L450 = L600 = L750 = L900 = 0, we have the estimate
for light = 150.
Consider the regression output on the following slide. In practice
we would treat light as a numerical variable, but we briey treat it
as a factor for the sake of illustration.
43 / 1
Fitting Model with Factors
> m_light <- lm(Flowers ~ light + early, data = case0901)
> summary(m_light)
Call:
lm(formula = Flowers ~ light + early, data = case0901)
Residuals:
Min 1Q Median 3Q Max
-8.979 -4.308 -1.342 5.204 10.204
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.196 3.629 18.518 1.05e-12 ***
light300 -9.125 4.751 -1.921 0.071715 .
light450 -13.375 4.751 -2.815 0.011919 *
light600 -23.225 4.751 -4.888 0.000138 ***
light750 -27.750 4.751 -5.841 1.97e-05 ***
light900 -29.350 4.751 -6.178 1.01e-05 ***
early1 12.158 2.743 4.432 0.000365 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.719 on 17 degrees of freedom
Multiple R-squared: 0.8231, Adjusted R-squared: 0.7606
F-statistic: 13.18 on 6 and 17 DF, p-value: 1.427e-05
Q: Can you use this model to estimate the owers per
meadowfoam plant for light intensity = 800?
44 / 1
A Product Term for Interaction
Denition
Two explanatory variables are said to interact if the eect of one
of them depends on the value of the other.
In multiple regression, explanatory variable for interaction is
constructed as the product of two explanatory variables thought to
interact.
Example
Recall a question of interest from the meadowfoam study:
Does the eect of light intensity on mean number of owers
depend on the timing of the light treatment?
Answer this question by including a product term for interaction.
45 / 1
Interaction Model
In the meadowfoam study, consider the product variable
light early (where light is numeric, but early is a factor).
Consider the model
{owers | light, early} =
0
+
1
light +
2
early +
3
(lightearly)
When early = 0, what is the slope? What is the intercept?
slope =
1
, intercept =
0
When early = 1, what is the slope? What is the intercept?
slope =
1
+
3
, intercept =
0
+
2
If
3
= 0, then the model is not parallel lines.
46 / 1
Interaction in R
> m_interaction <- lm(Flowers ~ Intens * early, data = case0901)
> summary(m_interaction)
Call:
lm(formula = Flowers ~ Intens * early, data = case0901)
Residuals:
Min 1Q Median 3Q Max
-9.516 -4.276 -1.422 5.473 11.938
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.623333 4.343305 16.491 4.14e-13 ***
Intens -0.041076 0.007435 -5.525 2.08e-05 ***
early1 11.523333 6.142361 1.876 0.0753 .
Intens:early1 0.001210 0.010515 0.115 0.9096
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.598 on 20 degrees of freedom
Multiple R-squared: 0.7993, Adjusted R-squared: 0.7692
F-statistic: 26.55 on 3 and 20 DF, p-value: 3.549e-07
47 / 1
Interpretations
Consider rearranging the model as
{owers | light, early} = (
0
+
2
early) + (
1
+
3
early)light
Both the intercept and the slope depend on the timing.
the eect of light intensity is (
1
+
3
early)
the eect of timing is (
2
+
3
light)
So there are 3 dierent tted models:
1
separate lines (
2
= 0,
3
= 0)
2
parallel lines (
2
= 0,
3
= 0)
3
equal lines (
2
= 0,
3
= 0)
48 / 1
Further Interpretations
Often dicult to interpret individual coecients in an interaction
model.
Coecient of light,
1
, changes from being a global slope
to being the slope when time = 0.
Coecient of the product term,
3
, is the dierence between
the slope when time = 24 and the slope when time = 0.
To test for the presence of an interaction eect, consider
testing
H
0
:
3
= 0 vs. H
a
:
3
= 0
In the meadowfoam example, the p-value for this test is
0.9096. What can we conclude?
49 / 1
When to Include Interaction Terms
We do not routinely include interaction terms in regression models.
We include them when:
when a question of interest pertains to an interaction (like in
meadowfoam study)
when good reason exists to suspect interaction
when interactions are proposed as a more general model for
the purpose of examining the goodness of t of a model
without interaction (i.e., does the model with interaction
terms t better than the one without interaction terms?)
Also, if we include a product term in a model, we should also
include the individual terms unless otherwise specied.
If we have a light time interaction, make sure both light
and time are in the model.
50 / 1
Strategy for Data Analysis
After dening the questions of interest, reviewing the study design
and model assumptions, and correcting any errors in the data:
1
Explore the data graphically.
Look for initial answers to questions.
Consider transformations.
Check outliers.
2
Formulate an inferential model.
Word questions of interest in terms of model parameters.
3
Check the model.
Check for nonconstant variance, outliers.
If appropriate, t interactions or curvature.
See if extra terms can be dropped from model.
4
Infer the answers to the questions of interest, using
appropriate inferential tools.
Condence intervals and/or tests for regression coecients.
Prediction intervals and/or condence intervals for mean.
Then present your results, in the context of the problem.
51 / 1

You might also like